Brisk guide to Mathematics Jan Slovák and Michal Bulant, Ioannis Chrysikos, Martin Panák with help of Ray Booth, Vladimir Ejov, Radek Suchánek, Vojtěch Žádník, ... Brno, 2024 Works on this text were finanically supported by the project NPO MUNI 3.2.1 DataAnalytics NPO-MUNI-MSMT-16606/2022 Authors: Michal Bulant Ioannis Chrysikos Martin Panák Jan Slovák With further help of: Ray Booth Vladimir Ejov Radek Suchánek Vojtěch Žádník Graphics and illustrations: Petra Rychlá 2023 Masaryk University Contents – theory Chapter 1. Initial warmup 4 1. Numbers and functions 4 2. Difference equations 10 3. Combinatorics 14 4. Probability 18 5. Plane geometry 26 6. Relations and mappings 41 Chapter 2. Elementary linear algebra 83 1. Vectors and matrices 83 2. Determinants 96 3. Vector spaces and linear mappings 107 4. Properties of linear mappings 128 Chapter 3. Linear models and matrix calculus 193 1. Linear optimization 193 2. Difference equations 202 3. Iterated linear processes 211 4. More matrix calculus 221 5. Decompositions of the matrices and pseudoinversions 244 Chapter 4. Analytic geometry 319 1. Affine and Euclidean geometry 319 2. Transformations 338 3. Geometry of quadratic forms and quadrics 342 4. Projective geometry 349 Chapter 5. Establishing the ZOO 368 1. Polynomial interpolation 368 2. Real numbers and limit processes 379 3. Derivatives 403 4. Infinite sums and power series 417 Chapter 6. Differential and integral calculus 508 1. Differentiation 508 2. Integration 529 3. Sequences, series and limit processes 557 Chapter 7. Continuous tools for modelling 631 1. Fourier series 631 2. Integral operators and Fourier transform 654 3. Metric spaces 666 Chapter 8. Calculus with more variables 709 1. Functions and mappings on Rn 709 2. Integration for the second time 747 3. Differential equations 761 Contents – practice Chapter 1. Initial warmup 4 A. Numbers and functions 4 B. Difference equations 12 C. Combinatorics 16 D. Probability 20 E. Plane geometry 28 F. Relations and mappings 43 G. Additional exercises for the whole chapter 47 Chapter 2. Elementary linear algebra 83 A. Vectors and matrices 83 B. Determinants 100 C. Vector spaces and linear mappings 107 D. Properties of linear mappings 133 E. Additional exercises for the whole chapter 142 Chapter 3. Linear models and matrix calculus 193 A. Linear optimization 193 B. Recurrence relations 204 C. Models of growth and iterated processes 210 D. More matrix calculus 224 E. Matrix decompositions 248 F. Additional exercises for the whole chapter 262 Chapter 4. Analytic geometry 319 A. Affine geometry 319 B. Euclidean geometry 327 C. Geometry of quadratic forms 341 D. Further exercise on this chapter 361 Chapter 5. Establishing the ZOO 368 A. Polynomial interpolation 368 B. Real numbers and limit processes 380 C. Derivatives 407 D. Infinite sums and power series 428 E. Additional exercises for the whole chapter 444 Chapter 6. Differential and integral calculus 508 A. Derivatives of higher orders 508 B. Integration 534 C. Sequences, series and limit processes 566 D. Additional exercises for the whole chapter 577 Chapter 7. Continuous tools for modelling 631 A. Fourier series 631 B. Integral operators and Fourier Transform 654 C. Metric spaces 667 D. Additional exercises to the whole chapter 686 Chapter 9. Continuous models – further selected topics802 1. Exterior differential calculus and integration 802 2. Remarks on Partial Differential Equations 827 3. Remarks on Variational Calculus 858 4. Complex Analytic Functions 873 Chapter 10. Statistics and probability theory 901 1. Descriptive statistics 901 2. Probability 913 3. Mathematical statistics 959 Chapter 11. Elementary number theory 978 1. Fundamental concepts 978 2. Primes 982 3. Congruences and basic theorems 987 4. Solving congruences and systems of them 1000 5. Diophantine equations 1016 6. Applications – calculation with large integers, cryptography 1020 Chapter 12. Algebraic structures 1040 1. Posets and Boolean algebras 1040 2. Elements of Logic 1055 3. Polynomial rings 1073 4. Groups, rings, and fields 1086 5. Coding theory 1110 6. Systems of polynomial equations 1117 Chapter 13. Combinatorial methods, graphs, and algorithms 1140 1. Elements of Graph theory 1140 2. A few graph algorithms 1166 3. Remarks on Computational Geometry 1188 4. Remarks on more advanced combinatorial calculations 1206 Chapter 8. Calculus with more variables 709 A. Multivariate functions 709 B. The topology of En 711 C. Limits and continuity of multivariate functions 713 D. Tangent lines, tangent planes, graphs of multivariate functions 715 E. Taylor polynomials 722 F. Extrema of multivariate functions 723 G. Implicitly given functions and mappings 728 H. Constrained optimization 730 I. Volumes, areas, centroids of solids 742 J. First-order differential equations 759 K. Practical problems leading to differential equations 768 L. Higher-order differential equations 770 M. Applications of the Laplace transform 777 N. Numerical solution of differential equations 780 O. Additional exercises to the whole chapter 795 Chapter 9. Continuous models – further selected topics802 A. Exeterior differential calculus 802 B. Applications of Stoke’s theorem 802 C. Equation of heat conduction 808 D. Partial Differential Equations 809 E. Variational Problems 830 F. Complex analytic functions 830 G. Additional exercises to the whole chapter 897 Chapter 10. Statistics and probability methods 901 A. Dots, lines, rectangles 901 B. Visualization of multidimensional data 910 C. Classical and conditional probability 912 D. What is probability? 920 E. Random variables, density, distribution function 922 F. Expected value, correlation 932 G. Transformations of random variables 937 H. Inequalities and limit theorems 939 I. Testing samples from the normal distribution 944 J. Linear regression 953 K. Bayesian data analysis 955 L. Processing of multidimensional data 960 Chapter 11. Number theory 978 A. Basic properties of divisibility 978 B. Prime numbers 981 C. Congruences 985 D. Solving congruences 997 E. Diophantine equations 1012 F. Primality tests 1015 G. Encryption 1019 H. Additional exercises to the whole chapter 1038 Chapter 12. Algebraic structures 1040 A. Boolean algebras and lattices 1040 B. Rings 1052 C. Polynomial rings 1053 D. Rings of multivariate polynomials 1058 E. Algebraic structures 1063 F. Groups 1065 G. Burnside’s lemma 1084 H. Codes 1087 I. Extension of the stereographic projection 1094 J. Elliptic curves 1095 K. Gröbner bases 1098 Chapter 13. Combinatorial methods, graphs, and algorithms 1140 A. Fundamental concepts 1140 B. Fundamental algorithms 1149 C. Minimum spanning tree 1158 D. Flow networks 1160 E. Classical probability and combinatorics 1164 F. More advanced problems from combinatorics 1169 G. Probability in combinatorics 1171 H. Combinatorial games 1178 I. Generating functions 1180 J. Additional exercises to the whole chapter 1220 Index 1228 Preface This textbook is a followup of the Czech course material “Matematika drsně a svižně”, reflecting many years of lecturing Mathematics at the Faculty of Informatics at the Masaryk University in Brno. Their programme required introduction to genuine mathematical thinking and precision, but within a quite limited time-frame for lectures. This endeavor was undertaken by Jan Slovák and Martin Panák since 2004, with further collaborators joining later. Our goal was to cover seriously, but quickly, about as much of mathematical methods as usually seen in bigger courses in the classical Science and Technology programmes. At the same time, we did not want to give up the completeness and correctness of the mathematical exposition. We wanted to introduce and explain more demanding parts of Mathematics together with elementary explicit examples how to use the concepts and results in practice. But we did not want to decide how much of theory or practice the reader should enjoy and in which order. Now, we have tried to accommodate all the above features in one book providing a more or less complete account on basics of Mathematics, as taught in typical Mathematics Bachelor programmes, but in a way relevant to the coming era of computational powers and artificial intelligence resources, including the GPT and other chatbots available widely. Thus, we chose the two column format of the textbook, where the theoretical explanation on one side and the practical procedures and exercises on the other side are split. Moreover, we suppose the practice learning will be heavily supported by Computer Aided Mathematic tool, we chose mainly Sage in our illustrations. This way, we want to encourage and help the readers to find their own way. Either to go through the examples and algorithms first (perhaps with the help of Sage or GPT-like chatbots), and then to come to more serious thinking on why the things work, or the other way round. We also hope to overcome the usual stress of the readers horrified by the amount of the stuff. With our text, they are not supposed to read through the book in a linear order. On the opposite, the readers should enjoy browsing through the text and finding their own thrilling paths through the new mathematical landscapes. In both columns, we intend to present rather standard exposition of basic Mathematics, but focusing on the essence of the concepts and their relations. The exercises are addressing simple mathematical problems but we also try to show the exploitation of mathematical models in practice as much as possible. We are aware that the text is written in a very compact and non-homogeneous way. A lot of details are left to readers, in particular in the more difficult paragraphs, while we try to provide a lot of simple intuitive explanation when introducing new concepts or formulating important theorems. Similarly, the examples display the variety from very simple ones to those requesting independent thinking. We would very much like to help the reader: • to formulate precise definitions of basic concepts and to prove simple mathematical results; • to percieve the meaning of roughly formulated properties, relations and outlooks for exploring mathematical tools; • to understand the instructions and algorithms underlying mathematical models and to appreciate their usage. These goals are ambitious and there are no simple paths reaching them without failures on the way. This is one of the reasons why we come back to basic ideas and concepts several times with growing complexity and width of the discussions. Of course, this might also look chaotic but we very much hope that this approach gives a better chance to those who will persist in their efforts. We also hope, this textbook should be a perfect beginning and help for everybody who is ready to think and who is ready to return back to earlier parts again and again. To make the task simpler and more enjoyable, we have added what we call "emotive icons". We hope they will spirit the dry mathematical text and indicate which parts should be read more carefully, or better left out in the first round. They should work as a sort of switches, so when loosing the ground, the reader is adviced to find the next appealing icon and go on. The usage of the icons follows the feelings of the authors and we tried to use them in a systematic way. We hope the readers will assign the meaning to icons individually. Roughly speaking, there are icons indicating complexity, difficulty etc.: Further icons indicate unpleasant technicality and need of patiance, or possible entertainment and pleasure: Some icons are related to feelings when solving problems and appear mainly in the practical column: The practical column with the solved problems and exercises should be readable nearly independently of the theory. Without the ambition to know the deeper reasons why the algorithms work, it should be possible to read mainly just this column. In order to help such readers, some definitions and descriptions in the theoretical text are marked in order to catch the eyes easily when reading the exercises. The exercises and theory are partly coordinated to allow jumping there and back, but the links are not tight. The numbering in the two columns is distinguished by using the different numberings of sections, i.e., those like 1.2.1 belong to the theoretical column, while 1.A.14 points to the practical column. The equations are numbered within subsections and their quotes include the subsection numbers if necessary. In general, our approach stresses the fact that the methods of the so called discrete Mathematics seem to be more important for mathematical models nowadays. They seem also simpler to get percieved, supported by the computational tools, and grasped. However, the continuous methods are strictly necessary too. First of all, the classical continuous mathematical analysis is essential for understanding of convergence and robustness of computations. It is hard to imagine how to deal with error estimates and computational complexity of numerical processes without it. Moreover, the continuous models are often the efficient and effectively computable approximations to discrete problems coming from practice. As usual with textbooks, there are numerous figures completing the exposition. We very much advise the readers to draw their own pictures whenever necessary, in particular in the later chapters, where we provide only a few. The rough structure of the book and the dependencies between its chapters are depicted in the diagram below. The darker the color is, the more demanding is the particular chapter (or at least its essential parts). In particular, the chapters 7 and 9 include a lot of material which would perhaps not be covered in the regular course activities or required at exams in great detail. The solid arrows mean strong dependencies, while the dashed links indicate only partial dependencies. In particular, the textbook could support courses starting with any of the white boxes, i.e. aiming at standard linear algebra and geometry (chapters 2 through 4), discrete chapters of mathematics (11 through 13), and the rudiments of Calculus (5, 6, 8). All topics covered in the book are now included (with more or less details) in teaching within our Mathematics Minor programme, complemented by numerical seminars. In this block of four courses, the first semester covers chapters 1 and 2 and selected topics from chapters 3 and 4. The second semester essentially includes chapters 5, 6, and 7. The third semester is now split into two parts. The first one is covered by chapter 8 (with only a few glimpses towards the more advanced topics from chapter 9), while the rest of the semester is devoted to the rudiments of the graph theory in chapter 13. The last semester provids large parts of the chapters 11 through 13. This textbook project got a new impuls when developing the main background material for a new professional programme Data Analytics, a cross-disciplinary endeavor aiming at formal education of top-tier young programmers in the era of GPT and further AI resources. There, the two elementary maths courses in the first semester are covered by Chapters 1-7, while in the second semester, another two courses are based on Chapters 8-9, and 13. Finally one course in the third semester is covered by Chapters 11-12. Two more courses are devoted to Probability and Statistics and they already go much beyond the Chapter 10 here. The goal of this first chapter is to introduce the reader to the fascinating world of mathematical thinking. The name of this chapter can be also understood as an encouragement for patience. Even the simplest tasks and ideas are easy only for those who have already seen similar ones. We start with the simplest thing: numbers. They will also serve as the first example of how mathematical objects and theories are built. The entire first chapter will become a quick tour through various mathematical landscapes (including germs of analysis, combinatorics, probability, geometry, and the language of Mathematics itself). Perhaps sometimes our definitions and ideas will look too complicated and not practical enough. The simpler the objects and tasks are, the more difficult the mastering of depth and all nuances of the relevant tools and procedures might be. We shall come back to all of the notions again and again in the further chapters and hopefully this will be the crucial step in the ultimate understanding. Thus the advice: do not worry if you find some particular part of the exposition too formal or otherwise difficult – come back later for another look. 1. Numbers and functions Since the dawn of time, people want to know “how much” about something they have, or “how much” is something worth, “how long” will a particular task take, etc. The answer for such ideas is usually some kind of “number”. We consider something to be a number, if it behaves according to the usual rules – either according to all the rules we accept, or maybe only to some of them. For instance, the result of multiplication does not depend on the order of multiplicands. We have the number zero whose addition to another number does not change the result. We have the number one whose product with another number does not change the result. And so on. The simplest example of numbers are the positive integers which we denote Z+ = {1, 2, 3, . . . }. The natural numbers consist of either just the positive integers, or the positive integers together with the number zero. The number zero is either considered to be a natural number, as is usual in computer science, or not a natural number as is usual in some other contexts. CHAPTER 1 Initial warmup “value, difference, position” – what it is and how to comprehend it? A. Numbers and functions People mostly think that “Mathematics” equals “counting”. Indeed, we start the introduction with number systems including integers, rationals, reals, and complex numbers, denoted by Z, Q, R and C, respectively. Counting is based on functions, and numbers are their arguments and values. We approach these elementary and well known concepts now in order to invoke the more realistic view at Mathematics as a way of thinking. The reader is always advised to keep an eye at the right-hand column if the concepts and methods get unclear. In the next pages, we also use basic properties of numbers which we carefully discuss only in Chapter 12 (e.g., the unique decomposition of integers into products of primes, etc.). As soon as mathematical tasks approach counting, it can be easily done in computer aided mathematics software in a few lines of code. Therefore, our exposition will often include Sage cells, but we avoid presenting preliminaries on programming with Sage. Our experience indicates that the Sage (Python based) interface is user-friendly and even unfamiliar readers should feel comfortable with basic coding in Sage really fast. 1.A.1. Integers are discrete, in the sense that for any m ∈ Z, there is no integer between m and m + 1. Show with an example that this fails for each couple of rational numbers. CHAPTER 1. INITIAL WARMUP Thus the set of natural numbers is either Z+ , or the set N = {0, 1, 2, 3, . . . }. To count “one, two, three, ...” is learned already by children in their pre-school age. Later on, we meet all the integers Z = {. . . , −2, −1, 0, 1, 2, . . . } and finally we get used to floating-point numbers. We know what a 1.19-multiple of the price means if we have a 19% tax. 1.1.1. Properties of numbers. In order to be able to work properly with numbers, we need to be careful with their definition and properties. In mathematics, the basic statements about properties of objects, whose validity is assumed without the need to prove them, are called axioms. We list the basic properties of the operations of addition and multiplication for our calculations with numbers, which we denote by letters a, b, c, . . . . Both operations work by taking two numbers a, b. By applying addition or multiplication we obtain the resulting values a + b and a · b. Properties of numbers Properties of addition: (a + b) + c = a + (b + c), for all a, b, c(CG1) a + b = b + a, for all a, b(CG2) there exists 0 such that for all a, a + 0 = a(CG3) for all a there exists b such that a + b = 0.(CG4) The properties (CG1)-(CG4) are called the properties of a commutative group. They are called respectively associativity, commutativity, the existence of a neutral element (when speaking of addition we usually say zero element), and the existence of an inverse element (when speaking of addition we also say the negative of a and denote it by −a). Properties of multiplication: (a · b) · c = a · (b · c), for all a, b, c(R1) a · b = b · a, for all a, b(R2) there exists 1 such that for all a, 1 · a = a(R3) a · (b + c) = a · b + a · c, for all a, b, c.(R4) The properties (R1)-(R4) are called respectively associativity, commutativity, the existence of a unit element and distributivity of addition with respect to multiplication. The sets with operation +, · that satisfy the properties (CG1)-(CG2) and (R1)-(R4) are called commutative rings. Two further properties of multiplication are: for every a ̸= 0 there exists b such that a · b = 1.(F) if a · b = 0, then either a = 0 or b = 0 or both.(ID) The property (F) is called the existence of an inverse element with respect to multiplication (this element is then denoted by a−1 ). For normal arithmetic, this is called the reciprocal of a, the same as 1/a or 1 a . The property (ID) then says that there exists no “divisors of zero”. A divisor of zero is a number a, a ̸= 0, such that there is a number b, b ̸= 0, with a · b = 0. 5 Solution. Let p = a/b and q = c/d be two rationals, where a, b, c, d ∈ Z, and suppose for instance that p < q. Consider the average α of p, q, i.e., α = 1 2 (p + q). Then α = ad+cb 2bd , and this is a rational number as the quotient of the integers ad+cb and 2bd. Now, adding p to both sides of the inequality p < q we get 2p < p + q, whereas adding q to both sides gives p + q < 2q. Hence we have 2p < p + q < 2q and so p < α < q. □ 1.A.2. 1.A.3. Show that the integer 2 does not have a rational square root (hint: think about the number of copies of 2 appearing in the decompositions of integers). ⃝ 1.A.4. An irrational number is a real number which cannot be written as the ratio of two integers. Using 1.A.3, prove that between any two different rational numbers there is an irrational number. ⃝ Perhaps, people reading this book are familiar with naturals, integers, rationals, and reals. However, there are larger sets of numbers as the complex numbers, or the quaternions, the latter introduced by William H. Hamilton already in 1843. Next we provide examples on complex numbers, see their introduction in 1.1.3. The algebra of quaternions is a topic requiring some knowledge from linear algebra. Hence it will be discussed later in paragraph 2.E.66, in terms of the so-called Pauli ma- trices.1 1.A.5. Consider the function f(x) = x2 − 4x + 8. Use its graph to ensure yourself that we need to extend the real numbers to solve each quadratic equation with real coefficients. Recall the formula for such a universal solution. Solution. Given a function f with domain X and codomain Y , its graph is the set Gf = {(x, f(x)) : x ∈ X}. Thus Gf is a subset of the Cartesian product X × Y of X, Y , i.e., Gf ⊆ X × Y = {(x, y) : x ∈ X, y ∈ Y } , where here (x, y) are ordered pairs, see also 1.6.1. To illustrate the graph of the given f, we first note that Gf ⊆ R × R. This means that Gf is a subset of the real plane R2 , also called Euclidean plane (see the discussion in 1.5.1 and 1.5.2 for more details on R2 ). To visualize Gf or parts of it, there are many alternatives. Next we will demonstrate how to use SageMath, which 1Complex numbers appear already in Ars Magna by Gerolamo Cardano (1501-1576, Italian), around 1545 in relation to solving cubic equations (see 1.A.21 below). Another known number system consists of the octonions, also referred to as “Cayley numbers”. Roughly speaking, octonions provide a generalization of quaternions. In a similar fashion, quaternions can be though of as a generalization of complex numbers, and complex numbers of reals. William Rowan Hamilton (1805-1865) was an Irish mathematician, astronomer, and physicist. Arthur Cayley (1821-1895) was a British mathematician working mainly in algebra. Actually, the octonions were introduced slightly earlier independently by John T. Graves in 1843, the same year as his friend discovered the quaternions. CHAPTER 1. INITIAL WARMUP 1.1.2. Remarks. The integers Z are a good example of a commutative group. The natural numbers are not such an example since they do not satisfy (CG4) (and possibly do not even contain the neutral element if one does not consider zero to be a natural number). If a commutative ring also satisfies the property (F), we speak of a field (often also about a commutative field). The last stated property (ID) is automatically satisfied if (F) holds. However, the converse statement is false. Thus we say that the property (ID) is weaker than (F). For example, the ring of integers Z does not satisfy (F) but does satisfy (ID). In such a case we use the term integral domain. Notice that the set of all non-zero elements in the field along with the operation of multiplication satisfies (R1), (R2), (R3), (F) and thus is also a commutative group. However in this case, instead of addition we speak of multiplication. As an example, the set of all non-zero real numbers forms a commutative group under multiplication. The elements of some set with operations + and · satisfying (not necessarily all) stated properties (for example, a commutative field, an integral domain) may be called scalars. To denote them we usually use lowercase Latin letters, either from the beginning or from the end of the alphabet. We will use only these properties of scalars and thus our results will hold for any objects with such properties. This is the true power of mathematical theories – they do not hold just for a specific solved example. Quite the opposite, when we build ideas in a rational way they are always universal. We will try to emphasise this aspect, although our ambitions are modest due to the limited size of this book. Before coming to any use of scalars, we should make a short formal detour and pay attention to their existence. We shall come back to this in the very end of this chapter, when we shall deal with the formal language of Mathematics in general, cf. the constructions starting in 1.6.5. There we indicate how to get natural numbers N, integers Z, and rational numbers Q, while the real numbers R will be treated much later in chapter 5. At this point, let us just remark that it is not enough to pose the axioms of objects. We have to be sure that the given conditions are not in conflict and such objects might exist. We suppose the readers are sure about the existence of the domains N, Z, Q and can handle them easily. The real numbers are usually understood as a dense and better version of Q, but what about the domain of complex numbers? As is usual in mathematics, we will use variables (letters of alphabet or other symbols) to denote numbers, and it does not matter whether we know their value beforehand or not. 1.1.3. Complex numbers. We are forced to extend the domain of real numbers as soon as we want to see solutions of equations like x2 = b for all real numbers b. We know that this equation always has a solution x in the domain of real numbers, whenever b is non-negative. If 6 is a free mathematical software system that can be accessed at https://sagecell.sagemath.org . The graph of the given function f is displayed in the image on the right. The Sage cell that generates this graph uses the plot command, which requires three parameters: “what to plot”, “what the variable is” and "the range of plotting". Thus, in general, the syntax we use is as follows: plot(f(x), (x, x_min, x_max)) In our case this takes the very simple form f(x)= x**2 -4*x + 8; plot(f(x), x, -10, 10) In this block observe that we used “;” to type different commands in the same line, while the values −10, 10 specify where x is evaluated. In the graph we see that f does not intersect the x-axis. Thus the equation f(x) = x2 −4x+8 = 0 cannot admit a solution over the reals. In general, this reflects the negative value of the discriminant ∆ = b2 − 4ac of an equation ax2 + bx + c = 0. Indeed ∆ = −16 in our case. Recall the well known formula for the solutions x1,2 = −b± √ ∆ 2a , giving us the (possibly) complex solutions x1,2 once we introduce √ −1 = i. Indeed, in our case √ ∆ = √ −1 √ −∆ = √ −1 √ 16 = i √ 16 = 4i and the solutions are x1,2 = 2 ± i2. □ 1.A.6. A comment on power functions. Suppose that n ∈ N is a positive natural number.2 The nth root n √ x = x 1 n of some positive real x > 0 is the inverse of the power function f(x) = xn , so ( n √ x)n = n √ xn = x. A polynomial function is the sum of a finite number of (constant) multiples of power functions with natural exponents, as for example the function f(x) = 2(x − 4)3 − 16. Note that it holds x0 = 1 for each real or complex number x. Consider the equation f(x) = 0, i.e., 2(x − 4)3 = 16. We have (x − 4) = 3 √ 23 = 2, which gives x = 6 as a quick solution. This is visible also in the graph of f, given below (this was obtained in Sage via the command plot(f, −2, 10), as before). However, since our equation is of degree 3, over C one expects three solutions (based on the fundamental theorem of algebra, see in Chapter 12, and also in the end of this section). In order to determine all the 2In this book we mainly adopt the notation N = {0, 1, 2, 3, . . .} for the (ordered) set of naturals, so we view 0 as a natural number (except Chapter 11 on number theory, where 0 is not considered as a natural number). CHAPTER 1. INITIAL WARMUP b < 0, then such a real x cannot exist. Thus we need to find a larger domain, where this equation has a solution. The crucial idea is to add the new number i to the real numbers, the imaginary unit, for which we require i2 = −1. Next we try to extend the definitions of addition and multiplication in order to preserve the usual behaviour of numbers (as summarised in 1.1.1). Clearly we need to be able to multiply the new number i by real numbers and sum it with real numbers. Therefore we need to work in our newly defined domain of complex numbers C with formal expressions of the form z = a+i b, being called algebraic form of z. The real number a is called the real part of the complex number z, the real number b is called the imaginary part of the complex number z, and we write Re(z) = a, Im(z) = b. It should be noted that if z = a + i b and w = c + i d then z = w implies both a = c and b = d. In other words, we can equate both real and imaginary parts. For positive x we then get (i · x)2 = −1 · x2 and thus we can solve the equations as requested. In order to satisfy all the properties of associativity and distributivity, we define the addition so that we add independently the real parts and the imaginary parts. Similarly, we want the multiplication to behave as if we multiply the pairs of real numbers, with the additional rule that i2 = −1, thus (a + i b) + (c + i d) = (a + c) + i (b + d), (a + i b) · (c + i d) = (ac − bd) + i (bc + ad). Next, we have to verify all the properties (CG1-4), (R1-4) and (F) of scalars from 1.1.1. But this is an easy exercise: zero is the number 0 + i 0, one is the number 1 + i 0, both these numbers are for simplicity denoted as before, that is, 0 and 1. For non-zero z = a + i b we easily check that z−1 = (a2 + b2 )−1 (a − i b). All other properties are obtained by direct calculations. 1.1.4. The complex plane and polar form. A complex number is given by a pair of real numbers, therefore it corresponds to a point in the real plane R2 . Our algebraic form of the complex numbers z = x + i y corresponds in this picture to understanding the x-coordinate axis as the real part while the y-coordinate axis is the imaginary part of the number. The absolute value of the complex number z is defined as its distance from the origin, thus |z| = √ x2 + y2. The reflection with respect to the real axis then corresponds to changing the sign of the imaginary part. We call this operation z → ¯z = x − i y the complex conjugation. Let us now consider complex numbers of the form z = cos φ + i sin φ, where φ is a real parameter giving the angle between the real axis and the line from the origin to z (measured in the positive, i.e. counter-clockwise sense). These numbers describe all points on the unit circle in the complex plane. Every non-zero complex number z can be then written as z = |z|(cos φ + i sin φ). 7 solutions, we may use the command solve in Sage. This is an appropriate tool for solving equations and system of equations. In the sequel we will encounter the solve command in various contexts and analyze its possible implementations in more detail. For the function f of our example, we can proceed as follows: x=var("x"); f = 2*(x-4)**3-16 sol = solve([f==0], x); sol and as an output we obtain a list of the three solutions, with two of them being complex conjugate, i.e., [x == -I*sqrt(3) + 3, x == I*sqrt(3) + 3, x == 6] A traditional way to find the roots of polynomials relies on the so called Horner’s scheme. This provides a quick way to evaluate a given polynomial f(x), and occurs by rewriting the polynomial as follows: f(x) = a0 + a1x + a2x2 + a3x3 + · · · + anxn = a0 + x ( a1 + x ( a2 + x(a3 + · · · + x(an−1 + xan) · · · ) )) . Observe that such a procedure requires n multiplications and n additions (and this turns out to be optimal). In fact we can treat the division of polynomials by divisors of the form x−ρ (ρ ∈ R), using the Horner’s table: ρ an an−1 · · · a0 - ρan · · · an ρan + an−1 · · · f(ρ) In this table, each entry in the second row is the product of ρ with the bottom-row entry immediately to the left. The bottom line is the sum of the previous two lines. For example, if f(x) = x3 + x2 − x + 2 and ρ = −2, we obtain ρ = −2 1 1 −1 2 −2 2 −2 1 −1 1 0 The zero in the last entry (the remainder), verifies that ρ = −2 is a root of our polynomial, i.e., f(ρ) = 0.3 1.A.7. Use the Horner’s scheme to find the division of p(x) = 5x4 + x2 − 2x + 2 by x − 4. Then verify your answer by the classical algorithm of (long) division. ⃝ Polynomials can be defined also over the complex numbers, and the Horner’s scheme is applicable in this case as well. Recall that to each complex number z = x + iy ∈ C we can associate the ordered pair (x, y) in the Euclidean plane. Thus, complex numbers can be viewed as ordered pairs of real numbers (x, y) ∈ R2 , and such pairs correspond to vectors on R2 with initial point the origin of R2 . As usual we will use the x-axis for the reals Re z, and the y-axis to denote the purely 3The motivation back of the notation ρ in the Horner’s scheme, is due to the greek word “ριζα”, which means “root”. CHAPTER 1. INITIAL WARMUP For given z ̸= 0, φ is unique if 0 ≤ φ < 2π. The number φ is called the argument of the complex number z and this form of z is called the polar form of the complex number. This way of writing the complex numbers is very convenient for understanding the multiplication. Consider the numbers z = |z| (cos φ + i sin φ) and w = |w| (cos ψ + i sin ψ) and calculate their product z · w = |z|(cos φ + i sin φ)|w|(cos ψ + i sin ψ) = |z||w| ( cos φ cos ψ − sin φ sin ψ + i (cos φ sin ψ + sin φ cos ψ) ) = |z||w|(cos(φ + ψ) + i sin(φ + ψ)). The last equality is a result of the addition formulas for trigonometric functions (we shall deal with them in more detail later in our discussion of rotations in the plane, see the page 37). Division is equally easy. If z = |z|(cos φ+i sin φ) ̸= 0, then w = |z|−1 (cos φ − i sin φ) satisfies zw = wz = 1, hence we can write w = z−1 = 1/z. We can summarize (and iterate the application of the previous formula on the product of the number z with itself): Polar form and de Moivre Theorem Consider two complex numbers z = |z|(cos φ + i sin φ) and w = |w|(cos ψ + i sin ψ) in polar forms. Then if n is an integer, positive or negative, z w = |z| |w|(cos(φ + ψ) + i sin(φ + ψ)) zn = |z|n (cos(nφ) + i sin(nφ)). 1.1.5. Functions. In most tasks we do not deal just with numbers, i.e. with individual values of scalars. More often the values are associated to each of the elements in a set of objects. Formally we talk about a mapping f : A → B assigning to each element x in the domain set A the value f(x) in the codomain set B. The set of all images f(x) ∈ B is called the range of f. The set A or B can be a set of numbers, but there is nothing to stop them being sets of other objects. The mapping f, however it is described, must unambiguously determine a unique member of B for each member of A. In another terminology, the member x ∈ A, is often called the independent variable. Then y = f(x) ∈ B, is called the dependent variable. We also say that the value y = f(x) is a function of the independent variable x in the domain of f. For now, we shall restrict ourselves to the case where the codomain B is a subset of scalars and we shall talk about scalar functions. The simplest way to define a function appears if A is a finite set. Then we can describe the function f by a table or a listing showing the image of each member of A. We have certainly seen many examples of such functions: 8 imaginary numbers i Im z. We proceed with basic exercises on the arithmetic of complex numbers. Recall that both the addition and the multiplication of complex numbers attain a geometric interpretation: Addition corresponds to vector sum, while multiplication by i = √ −1, for example, is equivalent to a counterclockwise rotation by the right angle. 1.A.8. Arithmetic in C. Compute Re(z1), Im(z1), z1, |z2|, z1 + z2, and z1z2, for the cases: (a) z1 = 1 − 2i, z2 = 4i − 3; (b) z1 = 3 + 4i, z2 = 3 − 4i. Solution. (a) We have Re(z1) = 1, Im(z1) = −2, z1 = 1 + 2i, |z2| = √ 42 + (−3)2 = 5, z1 + z2 = −2 + 2i, z1z2 = 5 + 10i. (b) Here we use Sage, where the field of complex numbers is denoted by CC, while as we saw above the square root of −1 is denoted by I (or i). To introduce a complex number there are many alternatives, e.g., we simply type z1 = 3 + 4 ∗ I, where ∗ is the general multiplication operator in Sage4 , or type z1 = CC(3, 4). As for the additional code that gives the computations, type and run successively the following z1 = CC(3, 4); z2 = CC(3, -4) z1.real ( ); z1.imag ( ) z1.conjugate ( ); abs(z2); z1+z2; z1*z2 Note that in case (b) we have s z2 = z1. In Sage a verification of this takes the form z1 == conjugate(z2), after introducing z1, z2 as above. As it is usual also in Python (the mother of Sage), this means that one uses “=” to determine a quantity, instead of “==” which is used to determine an equation between two quantities (in the latter case Sage returns True or False). It follows that |z1| = |z2| = √ 25 = 5 (see also 1.A.10 below), and the multiplication z1z2 occurs also by the usual formula z¯z = |z|2 , with z ∈ C (verify this). □ 1.A.9. Show that the Euclidean distance between two arbitrary complex numbers z = x + iy and w = u + iv is given by the formula d = |z − w| = √ (x − u)2 + (y − v)2, with x, y, u, v ∈ R. ⃝ 1.A.10. Given the complex number 1 + i √ 3, make use only of the figure given below to calculate its distance from its complex conjugate. Next, consider the complex number z = 2+2i √ 3+3i−3 √ 3 1−i √ 3 . Relate a verification of the equality |z| = √ 13 with the given figure and find another z′ ∈ C with the same property, i.e., |z′ | = √ 13. Solution. Complex conjugates are symmetric with respect to the x-axis. Moreover, the distance of a complex number from the x-axis equals its imaginary part. Thus, as we deduce also from the given figure, the distance between 1 + i √ 3 and its conjugate 1−i √ 3 equals to 2 √ 3. Now, from the Pythagorean 4As we will see later, in Sage the multiplication of matrices and vectors is also denoted by ∗. CHAPTER 1. INITIAL WARMUP Let f denote the pay of a worker in some company in certain year. The values of independent variable, that is, the domain of the function, are individual workers x from the set of all considered workers. The value f(x) is their pay for the given year. Similarly we can talk about the age of students or their teachers in years, the litres of beer and wine consumed by individuals from a given group, etc. Another example is a food dispensing machine. The domain of a function f would be the button pushed together with the money inserted to determine the selection of the food item. Let A = {1, 2, 3} = B. The set of equalities f(1) = 1, f(2) = 3, f(3) = 3, defines a function f : A → B. Generally, as there are 3 possible values for f(1), and the same for f(2), and f(3), there are 27 possible functions from A into B in total. But there are other ways to define a function than as a table. For example, the function f can denote the area of a planar region. Here, the domain consists of subsets of the plane (e.g. all triangles, circles or other planar regions with a defined area). The range of f consists of the respective areas of the regions. Rather than providing a list of areas for a finite number regions, we hope for a formula allowing us to compute the functional value f(P) for any given planar region P from a suitable class. Of course, there are many simple functions given by formulae, e.g. f(x) = 3x + 7 with A = B = R or A = B = N. Not all functions can be given by a formula or list. For example, let f(t) denote the speed of the car at time t. For any given car and time t, we know there will be the functional values f(t) denoting its speed. Which can of course be measured approximately, but usually not by a formula. Another example: Let f(n) be the nth digit in the decimal expansion of π . = 3.1415 . . . . So for example f(4) = 5. The value of f(n) is defined but unknown if n is large enough. The mathematical approach in modelling real problems often starts from the indication of certain dependencies between some quantities and aims at explicit formulas for functions which describe them. Often a full formula is not available but we may obtain the values f(x) at least for some instances of the independent variable x, or we may be able to find a suitable approximation. We shall see all of the following types of expressions of the requested function f in this book: • exact finite expression (like the function f(x) = 3x + 7 above); • infinite expression (we shall come to that only much later in chapter 5 when introducing the limit processes); • description of how the function’s values change under a given change of the independent variable (this behaviour will be displayed under the name difference equation in a moment and under different circumstances later on); • approximation of a not computable function with a known one (usually including some error estimates – this could be the case with the car above, say we know 9 Theorem it follows that a complex number x+iy and its complex conjugate x − iy have the same length (absolute value), given by r := √ x2 + y2. For our task this applies to 1 + i √ 3. On the other hand, recall that the absolute value of the product of any two complex numbers is the product of their absolute values. Thus, if z1, z2 are two complex numbers, provided that z2 ̸= 0, we see that z1z2 ¯z2 = |z1z2| |¯z2| = |z1| |z2| |¯z2| = |z1| . Observe now the enumerator of the given z is expressed as 2 + 2i √ 3 + 3i − 3 √ 3 = 2(1 + i √ 3) + 3i(1 + i √ 3) = (2 + 3i)(1 + i √ 3) . This means that our z has the form z1z2 ¯z2 with z1 = 2+i3 and z2 = 1+i √ 3, respectively, and hence the result: |z| = |z1| = |2 + i3| = √ 22 + 32 = √ 13. Because z2 does not play any essential role in this computation, one can find infinitely many complex numbers with the same property as the given z. □ 1.A.11. Determine the distance d of the numbers z, ¯z in the complex plane for z = √√√ 13 − i( √ 13/2). ⃝ 1.A.12. Use the Horner’s scheme to solve the equation z3 + (1 + i)z2 + az + 2 = 0 , z ∈ C , when it is already known that z0 = i is a root of the given equation. ⃝ 1.A.13. Solve the equation x2 + 2(1 + i)x = 14i (observe that this is a quadratic equation with complex coefficients). Solution. Add to both sides of the equation the term (1+i)2 . This gives x2 + 2(1 + i)x + (1 + i)2 = 14i + (1 + i)2 , or equivalently, (x + (1 + i))2 = 16i . Therefore, x + (1 + i) = ±4 √ i, that is, x = −(1 + i) ± 4 √ i and it remains to compute √ i. Suppose that a + ib = √ i for some reals a, b ∈ R. Then (a + ib)2 = i or in other words a2 + 2iab − b2 = i = 0 + 1i. By comparing the real and imaginary parts, we obtain the equations a2 − b2 = 0 and 2ab = 1. The first relation gives a = ±b. If a = b then the CHAPTER 1. INITIAL WARMUP it goes with some known speed at the time t = 0, we break as much as possible on a known surface and we compute the decrease of speed with the help of some mathematical model); • finding only the probability of possible values of the function. For example the function giving the length of life of a given group of still living people, in dependence of some health related parameters. 1.1.6. Functions defined explicitly. Let us start with the most desirable case, when the function values are defined by a computable finite formula. Of course, we shall be interested also in the efficiency of the formulas, i.e. how fast the evaluations would be. In principle, real computations can involve only a finite number of summations and multiplications of numbers. This is how we define the polynomials, i.e. functions of the form f(x) = an·xn +· · · +a1·x+a0, where a0, . . . , an are known scalars, x is the unknown variable whose value we can insert. xn = 1·x · · · x means the n-times repeated multiplication of the unit by x (in particular, x0 = 1), and f(x) is the value of the indicated sum of products. This is fairly well computable formula for each n ∈ N. The choice n = 0 provides the constants (i.e. constant functions) a0. The next example is more complicated. Factorial function Let A = Z+ be the set of positive integers. For each n ∈ Z+ , define the factorial function by n! = n(n − 1)(n − 2) . . . 3 · 2 · 1. For convenience we also define 0! = 1. (We will see why this is sensible later on). It is easy to see that n! = n · (n − 1)! for all n ≥ 1. So 1! = 1, 2! = 2 · 1 = 2, 3! = 3 · 2 · 1 = 6, 6! = 720 etc. The latter example deserves more attention. Notice that we could have defined the factorial by setting A = B = N and giving the equation f(n) = n · f(n − 1) for all n ≥ 1. This does not yet define f, but for each n, it does determine what f(n) is in terms of its predecessor f(n − 1). This is sometimes called a recurrence relation. After choosing f(0) = 1, the recurrence now determines f(1) and hence successively f(2) etc., and so a function is defined. It is the factorial function as described above. 2. Difference equations The factorial function is one example of a function which can be defined on the natural numbers by means of a recurrence relation. Such a situation can often be seen when formulating mathematical models that describe real systems in practice. We will observe here only a few simple examples and return to this topic in chapter 3. 10 second relation reduces to a2 = 1/2, that is, a = ±1/ √ 2 = ± √ 2/2 = b. We also see that the case a = −b is impossible; It gives 2a2 = −1 which contradicts to the fact a ∈ R. Thus √ i = ± √ 2 2 (1 + i) and x = −(1 + i) ± 2 √ 2(1 + i) = (−1 ± 2 √ 2)(1 + i) . For convenience, let us verify the answer via Sage. The next commands successively written, yield the computations of the real/imaginary parts of the two complex solutions: eq = x**2+2*(1+I)*x-14*I==0 sols = solve([eq], x); sols sols[0].rhs().real_part() sols[0].rhs().imag_part() sols[1].rhs().imag_part() sols[1].rhs().real_part() For example, the last command returns 2 ∗ sqrt(2) − 1, as it should be, and similarly for the previous commands. As a useful remark, keep in mind that in Sage the command sols[0] returns the first solution, and sols[1] the second one. This means that Sage assigns automatically labels to these solutions, as the first two integers “0”, and “1”, respectively. □ 1.A.14. Polar form and algebraic form. Express z1 = 2 + 3i in polar form. Next, express z2 = 3(cos(π/3)+i sin(π/3)) in algebraic form. Solution. As we learned from the figure in 1.A.10, the absolute value |z1| of the complex number z1 equals to r = √ 13. Next, we need the angle (argument) φ of z1, indicated in the figure. Applying the well-known rules for the trigonometric functions cos and sin yields sin(φ) = 3/ √ 13 and cos(φ) = 2/ √ 13, respectively. Therefore, with the aid of a calculator we get φ = arcsin(3/ √ 13) = arccos(2/ √ 13) . = 56.3◦ . This computation is also easy in Sage: Running the cell z=2+3*I; arg(z) we obtain arctan(3/2). Then, to translate the angle in radians or degrees we just type phi = arctan(3/2) print(N(phi,digits = 5),"radians") print(N(phi*180/pi,digits=4),"degrees") The given answer is 0.98279 radians and 56.31 degrees, respectively. Now we are able to present the polar form of z1 = x + iy: We have x = √ 13 cos(φ) and y = √ 13 sin(φ) CHAPTER 1. INITIAL WARMUP 1.2.1. Linear difference equations of first order. A general difference equation of the first order (or first order recurrence) is an expression of the form f(n + 1) = F(n, f(n)), where F is a known function with two arguments (independent variables). If we know the “initial” value f(0), we can compute f(1) = F(0, f(0)), then f(2) = F(1, f(1)) and so on. Using this, we can compute the value f(n) for arbitrary n ∈ N. An example of such an equation is provided by the factorial function f(n) = n! where: (n + 1)! = (n + 1) · n!, f(0) = 1. In this way, the value of f(n + 1) depends on both n and the value of f(n), and formally we would express this recurrence in the form F(x, y) = (x + 1)y. A very simple example is f(n) = C for some fixed scalar C and all n. Another example is the linear difference equation of first order (1) f(n + 1) = a · f(n) + b, where a ̸= 0 and b are fixed scalars. Such a difference equation is easy to solve if b = 0. Then it is the well-known recurrent definition of the geometric progression. We have f(1) = af(0), f(2) = af(1) = a2 f(0), and so on. Hence for all n we have f(n) = an f(0). This is also the relation for the Malthusian population growth model. This is based on the assumption that population size grows with a constant rate when measured at a sequence of fixed time intervals. It is time to prove our first mathematical theorem. We deduce a general result for first order equations with variable coefficients, namely: (2) f(n + 1) = an · f(n) + bn. We use the usual notation for sum ∑ , and the similar notation for the product ∏ . We use also the convention that when the index set is empty, then the sum is zero and the product is one. 1.2.2. Proposition. The general solution of the first order difference equation (2) from the previous paragraph with the initial condition f(0) = y0 is for n ∈ N given by the formula (1) f(n) = (n−1∏ i=0 ai ) y0 + n−2∑ j=0   n−1∏ i=j+1 ai   bj + bn−1. Proof. We use mathematical induction. The result clearly holds for n = 1 since f(1) = a0y0 + b0. Assuming that the statement holds for some fixed n, we compute (do not be upset with the many brackets and remember the conventions about the empty sums and products): 11 which gives z1 = √ 13 ( 2 √ 13 + i · 3 √ 13 ) = √ 13 ( cos ( arccos ( 2√ 13 )) + i sin ( arcsin ( 3√ 13 ))) . Transition from polar form to algebraic form is even simpler: z2 = 3 ( cos (π 3 ) + i sin (π 3 )) = 3 ( 1 2 + i · √ 3 2 ) . □ 1.A.15. Remark. In principle, the argument arg(z) = φ of a complex number z is only defined up to “modulo 2π”. That is, for any integer k ∈ Z, the angle φ + 2kπ would serve as well (since full circle rotations do not change the point). Therefore, in the previous exercise we have essentially defined the so-called principal (value of the) argument of z. 1.A.16. (a) Is it true that the (principal) argument of a positive real number is zero and of a negative one is π? (b) Is it true that the (principal) argument of the imaginary number i is π/2, while of −2i is −π? ⃝ 1.A.17. Express z = 1 + cos π 3 + i sin π 3 in polar form. ⃝ 1.A.18. De Moivre’s theorem applied. Simplify the expres- sion ( 5 √ 3 + 5i )n for n = 2 and n = 12. Solution. Using the binomial theorem for n = 2 we see that ( 5 √ 3 + 5i )2 = 75 + 10 √ 3 · 5i − 25 = 50 + 50 √ 3i . For n = 12 it is shorter (and wiser) to express first the complex number in polar form: 5 √ 3 + 5i = 10 (√ 3 2 + i 2 ) = 10 ( cos π 6 + i sin π 6 ) . Then, an application of the de Moivre’s formula presented in 1.1.4, gives the following: ( 5 √ 3 + 5i )12 = 1012 ( cos 12π 6 + i sin 12π 6 ) = 1012 . □ 1.A.19. Compute the expression ( cos π 6 + i sin π 6 )31 by applying the de Moivre’s theorem. ⃝ 1.A.20. Use de Moivre’s theorem to prove the identities cos(3φ) = 4 cos3 (φ) − 3 cos(φ) , sin(3φ) = 3 sin(φ) − 4 sin3 (φ) . ⃝ The beauty of complex numbers is also linked to the “fundamental theorem of algebra”, see 12.4.20. In fact, complex numbers arose from the need to solve cubic equations (without meaning that the solutions must be complex). Next we describe a procedure, which was developed during the 16th century by S. del Ferro (1465-1526), G. Cardano (1501-1576), N. F. Tartaglia (1500-1557), and possibly others. Unlike the CHAPTER 1. INITIAL WARMUP f(n + 1) = an   (n−1∏ i=0 ai ) y0 + n−2∑ j=0   n−1∏ i=j+1 ai  bj + bn−1   + bn = ( n∏ i=0 ai ) y0 + n−1∑ j=0   n∏ i=j+1 ai   bj + bn, as can be seen directly by multiplying out. □ Note that for the proof, we did not use anything about the scalars except for the properties of commutative ring. There is another approach to the proof which explains the name "linear". First observe that with all bn = 0, the formula from the proposition is obvious. Next observe, that dealing with two sequences yn, y′ n solving 1.2.2(2), their difference zn = yn − y′ n must be a solution to the “homogenized” equation f(n + 1) = an · f(n). On the contrary, knowing such sequences zn and yn, then their sum y′ n = zn + yn must be a solution of 1.2.2(2). Thus, we could just guess and check that the second and third terms of (1) solve our equation. In particular, the following corollary can be proved very easily this way. Try to complete this line of arguments in detail! 1.2.3. Corollary. The general solution of the linear difference equation (1) from 1.2.1 with a ̸= 1 and initial condition f(0) = y0 is (1) f(n) = an y0 + (1 + · · · + an−1 )b = an y0 + 1 − an 1 − a b. Proof. If we set ai and bi to be constants and use the general formula 1.2.2(1), we obtain f(n) = an y0 + b ( 1 + n−2∑ j=0 an−j−1 ) . We observe that the expression in the bracket is (1+a+· · ·+ an−1 ). The sum of this geometric progression follows from 1 − an = (1 − a)(1 + a + · · · + an−1 ). □ The proof of the former proposition is a good example of a mathematical result, where the verification is quite easy, as soon as someone tells us the theorem. Mathematical induction is a natural method of proof. Note that for calculating the sum of a geometric progression we silently assumed the existence of the inverse element for non-zero scalars. But actually, we derived the formula in a way manifesting that the division is always possible if our scalars do not allow for divisors of zero. In particular, notice that the formula (1) is valid with integer coefficients a, b and integer initial conditions. Here, we know in advance that each f(n) is an integer. Thus our formula necessarily gives correct integer solutions, although living in the extension of Z to the rational numbers Q. 12 quadratic equations, here we need the complex numbers even if all three solutions are real. 1.A.21. Express a solution of the cubic equation x3 +ax2 + bx + c = 0 in terms of the real coefficients a, b and c. Solution. The first step reduces the quadratic dependence by setting x := t − a/3. The original equation then becomes t3 +pt+q = 0, with p = b−a2 /3 and q = c+(2a3 −9ab)/27, respectively. The next step requires us to be more creative. We introduce new variables, u and v, satisfying the conditions { u + v = t , 3uv + p = 0 } . Using these terms, we can substitute the first condition into the previous equation to obtain u3 + v3 + (3uv + p)(u + v) + q = 0 . Next, use the second equation to eliminate v. This yields u6 + qu3 − p3 27 = 0 , which is a quadratic equation in the unknown s = u3 . Hence u = 3 √ − q 2 ± √ q2 4 + p3 27 . Finally, by back substitution we obtain the desired answer: x = −p/3u + u − a/3 . Observe that in order to obtain all three solutions, one has to work with complex roots. This is because the equation x3 = a, with a ̸= 0, has exactly three solutions over C (according to the fundamental theorem of algebra). If one needs all the three solutions, the Sage cell given below provides an answer (though the expressions are complicated enough and hence not presented here). x, a, b, c=var("x, a, b, c") assume(a, b, c, "real") solve(x**3+a*x**2+b*x+c==0, x) □ 1.A.22. Solve the equation x3 + x2 − 2x − 1 = 0. ⃝ A series of additional tasks related to complex numbers is presented in the end of this chapter. In fact, once we provide a systematic treatment of basic linear algebra in Chapter 2, we will be able to demonstrate how to treat complex numbers via matrices. This takes place in paragraph ??. B. Difference equations Difference equations (also called recurrence relations) are rules which determine the values of elements of a sequence in terms of previous elements. Solving a difference equation means finding an explicit formula for an arbitrary element of the sequence. If each element of the sequence is determined only by the previous element, we talk about first order difference equations. As we will see below, and also in Chapter 3, this CHAPTER 1. INITIAL WARMUP The linear difference equation 1.2.1(1) can be neatly interpreted as a mathematical model for finance, e.g. savings or loan payoff with a fixed interest rate a and fixed repayment b. (The cases of savings and loans differ only in the sign of b). With varying parameters a and b we obtain a similar model with varying interest rate and repayment. We can imagine for instance that n is the number of months, an is the interest rate in the nth month, bn the repayment in the nth month. 1.2.4. A nonlinear example. When discussing linear difference equations, we mentioned a very primitive population growth model which depends directly on the momentary population size p. At first sight, it is clear that such a model with a > 1 leads to a very rapid and unbounded growth. A more realistic model has such a population change ∆p(n) = p(n + 1) − p(n) only for small values of p, that is ∆p/p ∼ r > 0. Thus if we want to let the population grow by 5% for a time interval only for small p, then we choose r to be 0.05. For some limiting value p = K > 0 the population may not grow. For even greater values it may even decrease, for instance if the resources for the feeding of the population are limited, or if individuals in a large population are obstacles to each other etc. Assume that the values yn = ∆p(n)/p(n) change linearly in p(n). Graphically we can imagine this dependence as a line in the plane of the variables p and y. This line passes through the point [0, r], so that y = r when p = 0. This line also passes through [K, 0] since this gives the second condition, namely that when p = K the population does not change. Thus we set y = − r K p + r. By setting y = yn = ∆p(n)/p(n) and p = p(n) we obtain p(n + 1) − p(n) p(n) = − r K p(n) + r. By multiplying, we obtain a difference equation of first order with p(n) present as both a first and a second power. (1) p(n + 1) = p(n) ( 1 − r K p(n) + r ) . 13 type of recurrence relations are often induced by problems appearing in our everyday life. 1.B.1. Michael wants to buy a great new car, which costs C 30 000. He wants to take out a loan and pay it off under a fixed month repayment agreement. The car company offers him a loan with yearly interest of 6%, with first repayment in the end of the first month of the loan. Specify the exact amount of the monthly payment, under the assumption that Michael aims to redeem the loan in three years. Solution. Let us denote by P the sum that Michael has to pay per month, and by dk the amount of the remaining loan after k months. Set also C = 30000 and u = 0.06 12 (the latter represents the monthly interest rate). We havev d0 = C = 30000, and after the first month we see that d1 = C−P +u·C. After the kth month we compute dk = dk−1 − P + u dk−1 = (1 + u)dk−1 − P . (∗) In order to solve (∗) we may apply the statement of Corollary 1.2.3. In terms of this result we have a = 1+u, b = −P, and thus the solution takes the form dk = d0ak − P ( ak − 1 a − 1 ) . Paying off the loan in three years means d36 = 0. This gives P = 30000 · ( (1 + u)36 u (1 + u)36 − 1 ) = 912, 7 . Thus, approximately Michael should pay the amount of C913 per month. □ 1.B.2. Consider the task in 1.B.1. For how long will Michael pay, if the monthly dose of the loan is C500? ⃝ 1.B.3. Determine the sequence {yn}∞ n=1, which satisfies the following recurrence relation yn+1 = 3yn 2 + 1, n ≥ 1, y1 = 1. Next verify your answer via Sage. Solution. We can apply again the Corollary 1.2.3. We have a = 3/2, b = 1 and y0 = 0 (such that y1 = 1). Hence, according to the given formula we deduce that yn = (3/2)n · 0 + 1 − (3/2)n 1 − 3 2 · 1 = −2 + 2(3/2)n . Let us explain the procedure via Sage, which has a very friendly package to treat recurrence relations. An appropriate function that one can load is the function rsolve (from the pure Python package sympy which is for symbolic computations). This can deal with linear recurrence relations and initially we type the following code: from sympy import Function, rsolve from sympy.abc import n y = Function("y") The next step is to type the relation that we want to solve and the initial conditions: CHAPTER 1. INITIAL WARMUP Try to think through the behaviour of this model for various values of r and K. In the diagram we can see the results for parameters r = 0.05 (that is, five percent growth in the ideal state), K = 100 (resources limit the population to the size 100), and as p(0) = 2 we have initially two individuals. Note that the original almost exponential growth slows down later. The population size approaches the desired limit of 100 individuals. For p close to one and K much greater than r, the right side of the equation (1) is approximately p(n)(1 + r). That is, the behaviour is similar to that of the Malthusian model. On the other hand, if p is almost equal to K, the right hand side of the equation is approximately p(n). For an initial value of p greater than K the population size will decrease. For an initial value of p less than K the population size will increase.1 3. Combinatorics A typical “combinatorial” problem is to count in how many ways something can happen. For instance, in how many ways can we choose two different sandwiches from the daily offering in a grocery shop? In this situation we need first to decide what we mean by different. Do we then allow the choice of two "identical"’ sandwiches? Many such questions occur in the context of card games and other games. The solution of particular problems, usually involves either some multiplication of particular results (if the individual possibilities are independent) or some addition (if their appearance is disjoint). This is demonstrated in many examples in the problem column (cf. several problems starting with 1.D.2). 1.3.1. Permutations. Suppose we have a set of n (distinguishable) objects, and we wish to arrange them in some order. We can choose a first object in n ways, then a second in n − 1 ways, a third in n − 2 ways, and so on, until we choose the last object for which there is only one choice. The total number of possible arrangements is the product of these, hence there are exactly n! = n(n − 1)(n − 2) . . . 3 · 2 · 1 distinct orders of the objects. Each ordering of the elements of a set S is called a permutation of the elements of S. The number of permutations on a set with n elements is n!. We can identify the elements in S by numbering them (using the digits from one to n), that is, we identify S with the set S = {1, . . . , n} of n natural numbers. Then the permutations correspond to the possible orderings of the numbers from one to n. Thus we have an example of a simple mathematical theorem and this discussion can be considered to be its proof. 1This model is called the discrete logistic model. Its continuous version was introduced already in 1845 by Pierre François Verhulst. Depending on the proportions of the parameters r, K and p(0), the behaviour can be very diverse, including chaotical dynamics. There is much literature on this model. 14 f = y(n+1) - 3*y(n)/2 -1 initial = {y(1):1, y(2):5/2} To get the solution we finally type rsolve(f, y(n), initial). This gives the answer 2 ∗ (3/2) ∗ ∗n − 2.5 □ In Chapter 3 we will study recurrences in a more systematic way. Now we continue our observations with a very characteristic example, the Fibonacci numbers, which we will meet several times, see e.g., 3.B.1. These numbers are defined by the recurrence relation using two preceding values: Fn = Fn−1 + Fn−2 (n ≥ 3) with the initial conditions F1 = 1 and F2 = 1 (or the same with F0 = 0). Hence the very first numbers in the Fibonacci sequence (Fn)n∈N are 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, etc.6 We shall call such equations (second order) linear difference equation with constant coefficients in Chapter 3, and we will learn how to solve them. 1.B.4. Fibonacci numbers. A 2-composition of a positive integer n is a representation of n as an ordered sum of numbers 1 and 2. Notice that the order indeed matters; For instance 3 = 1 + 1 + 1, 3 = 1 + 2 and 3 = 2 + 1, hence the integer 3 has three distinct 2-compositions. Let Cn be the number of 2-compositions of n ∈ N. Show that Cn = Fn+1 for n ∈ N. Having Sage as a tool, test (or prove) the claim, too. Solution. Notice, C(0) = 1 = F(1) due to our conventions, since there is only the empty sum of interest providing the result. Thus, we may omit n = 0 from our considerations. Let us denote by Cn(1) and Cn(2) the number of 2-compositions of n, ending with 1 and 2, respectively. For n = 1 we have C1(1) = 1 and C1(2) = 0, so C1 = C1(1) + C1(2) = 1 + 0 = 1 = F2. For n = 2 we have C2(1) = 1 and C2(2) = 1, so C2 = C2(1) + C2(2) = 1 + 1 = 2 = F3. Hence the claim holds for n = 1, 2. Suppose now that the integer n satisfies n ≥ 3. Assume first that the 2-composition of n ends to 1. In this case, we can determine a 2-composition of the integer n − 1 by deleting this 1. However, by adding a 1 in a 2-composition of n − 1, we will get a 2-composition of n that ends by 1. Thus we conclude that Cn(1) = Cn−1. Similarly, if the 2-composition of n ends by 2, then we get a 2-composition of the integer n − 2 by deleting this 2. However, by adding a 2, or two ones, in the end of a 2-composition of n − 2 we will result to a 2-composition of n ending either with 2 or with 1. Since the latter case has been counted already above, we comclude Cn(2) = Cn−2. Altogether we 5Recall that a ∗ ∗n in Sage means an. 6The Fibonacci numbers have a long and rich history, and they appear in many applications. These numbers were first discussed by Acharya Pingala, an Indian poet and mathematician (around 200 BC) when counting the possible patterns for poetry forms based on syllables of two lengths. The Italian mathematician Leonardo di Pisa (also called Bonacci, Pisano, or Fibonacci) explained them in his famous book Liber Abaci to the western community in 1202. Here these numbers were used to explain the rabbit reproduction question. They also arise in combinatorics and graph theory, and we shall meet them several times, e.g. in Chapters 3 and 13. CHAPTER 1. INITIAL WARMUP Number of permutations Proposition. The number p(n) of distinct orderings of a finite set with n elements, is given by the factorial function: (1) p(n) = n! Suppose S is a set with n elements. Suppose we wish to choose and arrange in order just k of the members of S, where 1 ≤ k ≤ n. This is called a k-permutation without repetition of the n elements. The same reasoning as above shows that this can be done in v(n, k) = n(n − 1)(n − 2) · · · (n − k + 1) = n! (n − k)! ways. The right side of this result also makes sense for k = 0, (there is just one way of choosing nothing), and for k = n, since 0! = 1. Now we modify the problem, this time where the order of selection is immaterial. 1.3.2. Combinations. Consider a set S with n elements. A k-combination of the elements of S is a selection of k elements of S, 0 ≤ k ≤ n, when order does not matter. For k ≥ 1, the number of possible results of a subsequential choosing of our k elements, is n(n − 1)(n − 2) · · · (n − k + 1) (a k-permutation). We obtain the same k-tuple in k! distinct orders. Hence the number of k-combinations is n(n − 1)(n − 2) · · · (n − k + 1) k! = n! (n − k)!k! If k = 0, the same formula is still true, since 0! = 1, and there is just one way to select all n elements. Combinations Proposition. The number c(n, k) = (n k ) of combinations of k-th degree among n elements, where 0 ≤ k ≤ n, is (1) c(n, k) = n(n − 1) . . . (n − k + 1) k(k − 1) . . . 1 = n! (n − k)!k! . We pronounce the binomial coefficient (n k ) as “n over k” or “n choose k”. The name stems from the binomial expansion, which is the expansion of (a+b)n . If we expand (a+b)n , the coefficient of ak bn−k is the number of ways to choose a k-tuple from n parentheses in the product (from these parentheses, we take a, from the others, we take b). Therefore we have (2) (a + b)n = n∑ k=0 ( n k ) ak bn−k . Note that only distributivity, commutativity and associativity of multiplication and summation was necessary. The formula (2) therefore holds in every commutative ring. 15 get Cn = Cn(1) + Cn(2) = Cn−1 + Cn−2, and this coincides with the Fibonacci recurrence. Thus, the difference may be only in the initial conditions, but we have verified already C1 = F2 and C2 = F3, and this concludes the proof. To test Cn = Fn+1 via Sage we will use the observed recurrence relation together with the initial conditions, which we present via Sage. Notice, how Sage imports tools from Python packages. Here Sage learns how to solve recurrences. Of course, c(n) would be listed very fast, after being resolved in the 6th line: from sympy import Function, rsolve from sympy.abc import n a = Function("a") f = a(n)-a(n-1)-a(n-2) initial = {a(1):1, a(2):2} c(n)=rsolve(f, a(n), initial) k=var("k") bool(c(k)==fibonacci(k+1) for k in (0 .. 120)) This returns True and so verifies our claim. Inside the cell we used the function fibonacci(n), which returns the nth Fibonacci number, and the class bool, which is for comparing logical propositions. Later in 1.G.20 the reader can find a visual presentation of Cn. □ Recurrence relations can be much more complicated than those of first or second order. To highlight the case, we present an example of order 3 and another one including more equations with more variables. In the latter case, although we are not able to evaluate the arbitrary term in the presented sequence P(k,l) explicitly, we succeed in answering the question using the relevant recurrence relations. This is an example of the so called partial difference equations, since the terms of the sequence are indexed by two independent variables k, l. Further applications of recurrence relations are in the final section of this chapter. 1.B.5. How many words of length 12 can be constructed by using only the letters A and B, under the assumption that they do not contain a sentence of the form BBB? Solution. Let an denote the number of words of length n consisting of the letters A and B, but without BBB as a sentence in between. The words of length n > 3 that satisfy the given condition, either end with one letter A, or with two letters AB, or with three letters ABB (notice that the AA two letters end has been covered, the ABB end is the only option for the BB end, while the remaining BAB three last letters are covered already). We enumerate an−1 words ending with an A (observe that preceding the last A, it can be an arbitrary word of length n − 1 satisfying our condition). Analogously for the two remaining groups. Thus an (n > 3) satisfies the recurrence relation an = an−1 + an−2 + an−3 CHAPTER 1. INITIAL WARMUP We present a few simple propositions about binomial coefficients – another simple example of a mathematical proof. If needed, we define (n k ) = 0 whenever k < 0 or k > n. 1.3.3. Proposition. For all non negative integers n, we have (1) (n k ) = ( n n−k ) 0 ≤ k ≤ n. (2) (n+1 k+1 ) = (n k ) + ( n k+1 ) 0 ≤ k ≤ n − 1. (3) ∑n k=0 (n k ) = 2n (4) ∑n k=0 k (n k ) = n2n−1 . Proof. The first formula in the proposition is immediate directly from the formula 1.3.2(1). If we expand the righthand side of (2), we obtain ( n k ) + ( n k + 1 ) = n! k!(n − k)! + n! (k + 1)!(n − k − 1)! = (k + 1)n! + (n − k)n! (k + 1)!(n − k)! = (n + 1)! (k + 1)!(n − k)! which is the left-hand side of (2). In order to prove (3), we use mathematical induction again. Mathematical induction consists of two steps. In the initial step, we establish the claim for n = 0 (in general, for the smallest n the claim should hold for). In the inductive step we assume that the claim holds for some n (and all smaller numbers). We use this to prove that this implies the claim for n + 1. The principle of mathematical induction then asserts that the claim holds for every n. The claim (3) clearly holds for n = 0, since (0 0 ) = 1 = 20 . It holds also for n = 1. Now assume that the claim holds for some n ≥ 1. We must prove the corresponding claim for n + 1 using the claims (2) and (3). We calculate n+1∑ k=0 ( n + 1 k ) = n+1∑ k=0 (( n k − 1 ) + ( n k )) = n∑ k=−1 ( n k ) + n+1∑ k=0 ( n k ) = 2n + 2n = 2n+1 . Note that the formula (3) gives the number of all subsets of an n-element set, since (n k ) is the number of all subsets of size k. Note also that (3) follows from 1.3.2(2) by choosing a = b = 1. To prove (4) we again employ induction, as we did in (3). For n = 0 the claim clearly holds. The inductive assumption says that (4) holds for some n. We calculate the corresponding sum for n+1 using (2) and the inductive assumption. We obtain 16 with a1 = 2, a2 = 4, and a3 = 7. Using this relation we can now compute a12. The answer is 1705. Of course, we can get this by simple modification of the previous Sage cell, give it a try! □ 1.B.6. After the first quarter, the score of a basketball match between the national teams of Russia and the Czech Republic is 12 : 9 for the Czech team. In how many ways could the score have developed? Solution. We can divide all possible evolutions of the quarter with the final score k : l into six mutually exclusive possibilities, according to which team scored, and how much was it worth (1, 2 or 3 points). If we denote by P(k,l) the number of ways in which the score could have developed for a quarter that ended with k : l, then for k, l ≥ 3 the following recurrence relation holds: P(k,l) = P(k−3,l) + P(k−2,l) + P(k−1,l) + P(k,l−1) +P(k,l−2) + P(k,l−3) . Using the symmetry of the problem, P(k,l) = P(l,k). In addition, for k ≥ 3 we see that P(k,2) = P(k−3,2) + P(k−2,2) + P(k−1,2) + P(k,1) + P(k,0) , P(k,1) = P(k−3,1) + P(k−2,1) + P(k−1,1) + P(k,0) , P(k,0) = P(k−3,0) + P(k−2,0) + P(k−1,0) . These relations, along with the initial condition, give P(0,0) = 1, P(1,0) = 1, P(2,0) = 2, P(3,0) = 4, P(1,1) = 2, and P(2,1) = P(1,1) + P(0,1) + P(2,0) = 5 , P(2,2) = P(0,2) + P(1,2) + P(2,1) + P(2,0) = 14 . Therefore, by repeatedly using the above equations, we arrive at nearly 500 million options, since P(12,9) = 497178513. □ C. Combinatorics In this section we use natural numbers to label items that may or may not be distinguishable and address questions about the number of ways certain events can occur. Let us begin with a straightforward problem involving permutations, which will demonstrate the usefulness of mathematical software packages, such as as Matlab, Maple, Sage, etc. We expect the reader to be familiar with the basic concepts as permutations, combinations, etc., as wells as adding or multiplying the numbers of independent options. If not, have a look into the other column in 1.3.1. CHAPTER 1. INITIAL WARMUP n+1∑ k=0 k ( n + 1 k ) = n+1∑ k=0 k (( n k − 1 ) + ( n k )) = n∑ k=−1 (k + 1) ( n k ) + n+1∑ k=0 k ( n k ) = n∑ k=0 ( n k ) + n∑ k=0 k ( n k ) + n∑ k=0 k ( n k ) = 2n + n2n−1 + n2n−1 = (n + 1)2n . This completes the inductive step and the claim is proven for all natural n. □ The second property from above allows us to write down all the binomial coefficients into the Pascal triangle.2 Here, every coefficient is obtained as a sum of the two coefficients situated right “above” it: n = 0 : 1 n = 1 : 1 1 n = 2 : 1 2 1 n = 3 : 1 3 3 1 n = 4 : 1 4 6 4 1 n = 5 : 1 5 10 10 5 1 Note that in individual rows we have the coefficients of individual powers in the expression (2). For instance the last given row says (a + b)5 = a5 + 5a4 b + 10a3 b2 + 10a2 b3 + 5ab4 + b5 . 1.3.4. Choice with repetitions. The ordering of n elements, where some of them are indistinguishable, is called a permutation with repetitions. Among n given elements, suppose there are p1 elements of the first kind, p2 elements of the second kind, ..., pk of the k-th kind, where p1 + p2 + · · · + pk = n. Then the number of permutations with repetitions of these elements is denoted as P(p1, . . . , pk). We consider the orderings which differ only in the order of indistinguishable elements to be identical. Elements of the ith kind can be ordered in pi! ways, thus we have Permutations with repetitions The number of permutations with repetitions is P(p1, . . . , pk) = n! p1! · · · pk! . Let S be a set with n distinct elements. We wish to select k elements, 0 ≤ k ≤ n from S with repetition permitted. This is called a k-permutation with repetition. Since the first selection can be done in n ways, and similarly the second can also be done in n ways etc. The total number V (n, k) of k-permutations with repetitions is nk . Hence 2Although the name goes back to Blaise Pascal’s treatise from 1653, such a neat triangle configuration of the numbers c(n, k) were known for centuries earlier in China, India, Greece, etc. 17 1.C.1. Presentations of permutations. List all permutations of 4 digits using any computer algebra system of your choice. Think of them as of mappings. Solution. There exist 4! = 4 · 3 · 2 = 24 such permutations, e.g., [1, 2, 3, 4], [1, 2, 4, 3], [1, 3, 2, 4], etc., and carefully one can list all possible cases. We appreciate Sage, which can do this for us much faster, or at least verify our computation, via the cell per4 = Permutations (4); per4.list () The answer is [[1, 2, 3, 4], [1, 2, 4, 3], [1, 3, 2, 4], [1, 3, 4, 2], [1, 4, 2, 3], [1, 4, 3, 2], [2, 1, 3, 4], [2, 1, 4, 3], [2, 3, 1, 4], [2, 3, 4, 1], [2, 4, 1, 3], [2, 4, 3, 1], [3, 1, 2, 4], [3, 1, 4, 2], [3, 2, 1, 4], [3, 2, 4, 1], [3, 4, 1, 2], [3, 4, 2, 1], [4, 1, 2, 3], [4, 1, 3, 2], [4, 2, 1, 3], [4, 2, 3, 1], [4, 3, 1, 2], [4, 3, 2, 1]] Note that the command per4[12] returns the 13th permutation (Sage counts from zero), which is [3, 1, 2, 4], and similarly one can specify any of the 24 permutations directly, without being necessary to list them (hence in this case we do not need the command per4.list( )). Recall also that permutations of the elements in a finite set S are essentially bijections f : S → S. To realize this perspective via Sage, and interpret permutations as functions, let us fix the 13th permutation as above. Then, the cell per4=Permutations(4) Pm = Permutation(per4[12]) for i in Pm: print(Pm.index(i)+1,"->",i) returns the desired result, i.e., 1 -> 3; 2 -> 1; 3 -> 2; 4 -> 4 Think of the syntax of these commands in Sage (e.g., the command per4[12] provides simply a list, while the second line keeps Pm to be the permutation object in Sage; the index method returns the index of the first appearance of the value - which is OK with permutations, but the method enumerate applied to lists works in general; the +1 is necessary since the indices run from 0). Experiment with additional examples yourself! For instance, composition of permutations is encoded by the standard ∗ operator – what would be the Pm ∗ Pm permutation? Hint: follow the above arrows to compose the mappings, and check in Sage. □ 1.C.2. During a conference, 8 speakers are scheduled. Determine the number of all possible orderings in which two given speakers do not speak one right after the other. Solution. Denote the two given speakers by A and B, respectively. If B follows directly after the speaker A, we can consider it as a speech by a single speaker AB. The number of all orderings where B speaks directly after A is therefore 7!, CHAPTER 1. INITIAL WARMUP k-permutations with repetitions V (n, k) = nk . If we are interested in a choice of k elements without taking care of order, we speak of k-combinations with repetitions. At first sight, it does not seem to be easy to determine the number. We reduce the problem to another problem we have already solved, namely combinations without repetitions: Combinations with repetitions Theorem. The number of k-combinations with repetitions from n elements equals for every k ≥ 0 and n ≥ 1 ( n + k − 1 k ) . Proof. Label the n elements as a1, a2, · · · , an. Suppose each element labeled ai is selected ki times, 0 ≤ ki ≤ k, so that k1+k2+· · ·+kn = k. Each such selection can be paired with the sequence of symbols ∗ and | where each ∗ represents one selection of an element and individual boxes are separated by | (therefore there are n − 1 of them). The number of ∗ in the ith box is equal to ki, so we obtain the sequence ∗ · · · ∗ k1 | ∗ · · · ∗ k2 | · · · | ∗ · · · ∗ kn . The other way around, from any such sequence we can determine the number of selections of any element (e.g. the number of ∗ before first | determines k1). Having altogether k symbols ∗ and n − 1 separators | we see that there are ( n + k − 1 n − 1 ) = ( n + k − 1 k ) possible sequences and therefore also the same number of the required selections. □ 4. Probability Now we are going to discuss the last type of function description, as listed in the very end of the subsection 1.1.5. Thus, instead of assigning explicit values of a function, we shall try to describe the probabilities of the individual options. 1.4.1. What is probability? As a simple example we can use common six-sided dice throwing, with sides labelled as 1, 2, 3, 4, 5, 6. If we describe the mathematical model of such throwing with a “fair” dice, we expect by symmetry that every side occurs with the same frequency. We say that “every side occurs with the probability 1/6”. 18 the number of permutations of seven elements. By symmetry, the number of all orderings where A speaks directly after B is also 7!. Since the number of all possible orderings of eight speakers is 8!, the solution is 8! − 2 · 7!. □ 1.C.3. How many rearrangements of the letters of the word “problem” are there, such that: (a) The letters b, r are next to each other; (b) The letters b, r are not next to each other. ⃝ 1.C.4. k-permutations and k-combinations. Determine the number of different permutations that admits a code with 5 digits in a specific order, under the assumption that every digit can be used only once. Deduce that 5-digit codes of such type are much more secure than the corresponding 4-digit codes. Finally, what is the situation if the order of digits in the code does not matter? Solution. We are interested in codes with 5 digits, where the order of the digits is crucial and every digit appears only once. In other words, we have the set of digits, say S = {0, 1, . . . , 9}, and we want to choose and arrange in order 5 elements of S. In terms of the paragraph 1.3.4, this is a 5-permutation of 10 elements without repetition, which can be done in u(10, 5) = 10! (10 − 5)! = 10 · 9 · 8 · 7 · 6 = 30240 ways. In Sage we can compute u(10, 5) by typing factorial(10)/factorial(5) For a 4-digit code with the same characteristics, the corresponding number of permutations is given u(10, 6) = 5040, hence the claim. (Of course, we see from the formula that the letter number must be 6 times smaller than the first one!) Assume now that the order does not play some role. Then, codes of the form 12345 and 54123, for instance, represent the same code, hence finally one will get much less possible permutations (codes of this type are very unsafe). In particular, this case yields the definition of a 5-combination of 10 elements, whose number according to 1.3.2 is given by c(10, 5) = ( 10 5 ) = 10! 5!(10 − 5)! = 30240 120 = 252 . Sage provides the command binomial(n, k) for the k-combination of n elements which in our case applies as binomial (10, 5) In such terms, the very first result about u(10, 5) appears also by the command binomial(10, 5) ∗ factorial(5). □ 1.C.5. Remark. Notice that Sage does not have a direct command for k-permutations. Let us construct one, by introducing a function called uperm(n, k):7 7Note that usually the cells that we provide in Sage can be copy-pasted without extra editing being necessary in the editor of Sage. However, when we introduce functions as below, the reader should be careful and type the command in the editor exactly as it appears here, since the spaces play some role. CHAPTER 1. INITIAL WARMUP But throwing some less symmetric version of a dice with six faces, the actual probabilities of the individual results might be quite different. Let us build a simple mathematical model for this. We shall work with the parameters pi for the probabilities of individual sides with two requirements. These probabilities have to be non-negative real numbers and their sum is one, i.e. p1 + p2 + p3 + p4 + p5 + p6 = 1. At this time, we are not concerned about the particular choice of the specific values pi, they are given to us. Later on, in chapter 10, we shall link probability with mathematical statistics and then we shall introduce methods how to discuss reliability of such a model for a specific real dice. 1.4.2. Classical probability. Let us come back to the mathematical model for the fair dice. We consider the sample space Ω = {1, 2, 3, 4, 5, 6} of all possible elementary events (each of them corresponding to one possible result of the experiment of throwing the dice). Then we can consider any event as a given subset A of Ω. For example A = {1, 3, 5} describes the result of getting odd number on the resulting side (we count the labels on the sides of the dice). Similarly, the set B = Ac = {2, 4, 6} = Ω \A is the complementary event of getting even numbered points. The probability of both A and B will be 1/2. Indeed, |A|/|Ω| = 1/2, where |A| means the number of elements of a set A. This leads to the following obvious generalization: Classical probability Let Ω be a finite set with n = |Ω| elements. The classical probability of the event corresponding to any subset A ⊂ Ω is defined as P(A) = |A| |Ω| . Such a definition immediately allows us to solve problems related to throwing several fair dice simultaneously. Indeed, we may treat this as throwing independently one dice many times and thus multiplying the probabilities. For example, the event of getting an odd sum of points on two dice is given by adding the probabilities of having an even number on the first one and odd number on the second one and vice versa. Thus the probability will be twice 1/2 · 1/2, which is 1/2 as expected. 1.4.3. Probability space. Next, we formulate a more general concept of probability covering also the unfair dice example above. We shall need a finite set Ω of all possible states of a system (e.g. results of an experiment), which we call the sample space. Further, the space of all possible events is given as the set A of all subsets in Ω. Finally, we need the function describing the probabilities of occurrence of individual events: 19 def uperm (n,k): a = factorial (k) * binomial (n, k) return(a) In such terms we can directly compute u(n, k) for any n, k, with k = 0, 1, 2, . . ., by typing in the editor of Sage the command uperm(n, k). For instance, the command uperm(10, 5) returns 30240, uperm(10, 1) returns 10, uperm(10, 2) returns 90, uperm(10, 11) returns 0 (as it should be since k > n in this case), etc. 1.C.6. Determine the number of 4-digit codes, composed of the digits 1, 3, 5, 6, 7 and 9, under the condition that no digit occurs more than once. Solution. We have 6 distinct letters at our disposal, and we ask: How many distinct (ordered) 4-tuples can be chosen from them? Obviously the result is u(6, 4) = 6! (6 − 4)! = 6 · 5 · 4 · 3 = 360 . The reader may verify the computation in Sage with the function uperm(n, k) constructed above, that is, via the command uperm(6, 4). □ 1.C.7. Six men in a meeting shake mutually their hands. How many handshakes will happen in total? Solution. We understand that each couple of men shake their hands. Thus, the number of handshakes equals the number of ways of choosing an unordered pair among 6 elements, this is the combinations c(6, 2). Let us present the answer via Sage: c62=binomial(6, 2); c62 which gives 15. □ To summarize, k-permutations u(n, k) take care of the order while combinations c(n, k) do not. Moreover, u(n, k) = c(n, k)k!. We have noticed that permutations and combinations are most useful when counting possibilities. Let us finalize this paragraph by treating examples where repetitions occur. Notice how the principle of “inclusion and exclusion” may be useful. 1.C.8. k-permutations with repetition. The Greek alphabet consists of 24 letters. How many words of exactly five letters can be composed in it? (Disregarding whether the words have some actual meaning or not). Solution. For each of the five positions in the word we have 24 possibilities, since the letters can repeat. According to the discussion in 1.3.4, this is nothing but a 5-permutation with repetitions. Hence the total number of words that can be composed is given by V (24, 5) = 245 = 7962624. In Sage we compute this by typing 24 ∧ 5 or 24 ∗ ∗5. □ CHAPTER 1. INITIAL WARMUP Probability function Let us consider a non-empty fixed sample space Ω. The probability function P : A → R satisfies P(Ω) = 1(1) 0 ≤ P(A) for all events A(2) P(A ∪ B) = P(A) + P(B) whenever A ∩ B = ∅.(3) Notice that the intersection A ∩ B describes the simultaneous appearance of both events, while the union A ∪ B means that at least one of events A and B appear. The event Ac = Ω \ A is called the complementary event. There are some further straightforward consequences of the definition for all events A, B: P(A) = 1 − P(Ac )(4) P(∅) = 0(5) P(A) ≤ 1 for all events A(6) P(A) ≤ P(B) whenever A ⊂ B(7) P(A ∪ B) = P(A) + P(B) − P(A ∩ B)(8) The proofs are all elementary. For example, A ∪ (Ac ) = Ω and thus (3) implies (4). Similarly, we can write A = (A \ B) ∪ (A ∩ B) and A∪B = (A\B)∪(B\A)∪(A∩B) with disjoint unions of sets on the right hand sides. Thus, P(A) = P(A\B)+P(A∩B) and P(A ∪ B) = P(A \ B) + P(B \ A) + P(A ∩ B) by (3), which implies the last equality. The remaining three claims are even simpler. All these properties correspond exactly to our intuition how probability should behave. Probability should be always a real number between zero and one. The event Ω includes all possible results of the experiment, so it must have probability one. No result appears with probability zero, the probabilities of disjoint events should add, etc. Of course, the classical probability on the sample space Ω is an example of a probability function. In our more general model, the set of all subsets A is closed upon union, intersection and taking the complement, and this has been essential 20 1.C.9. In how many ways can we insert five golf balls into five holes, with one ball into every hole, assuming that we have four identical white balls, four identical blue balls and three identical red balls? Solution. First it is convenient to solve the problem for the case where we have five balls of every colour. This amounts to a free choice of five elements from three possibilities (there is a choice out of three colours for every hole), that is, permutations with repetitions. So the total ways are given by V (3, 5) = 35 = 243. Next, to answer our task it is sufficient to subtract the configurations which are excluded: • all chosen balls are of one colour; there are 3 such cases, • exactly four red balls were chosen; there are 2 · 5 = 10. Indeed, in the latter event, one first fixes the colour of the non-red ball (two options), and then the hole where this ball is in (five independent options). Hence, in total there are 35 − 3 − 10 = 230 possible ways. □ 1.C.10. Determine the number of ways of placing the white tower and black tower on the chessboards (of size 8 × 8), that are neither in the same column nor in the same row. ⃝ 1.C.11. k-combinations with repetition. In how many ways can we equip three distinct envelopes with five identical stamps and five identical letters. (Yes, we may glue many or no stamps onto the same envelop, etc.) Next provide an answer for the same task, under the condition that no envelope stays empty. Solution. First compute the number of choices ignoring the non-emptiness condition. It is an example of 5-combinations with repetition from 3 elements, and since we glue the stamps and insert the letters independently, we have got C(3, 5) · C(3, 5) = c(7, 2)2 = 441 different ways. Next we have to subtract the choices with exactly one envelope empty, and finally with two empty ones (we still may distribute the stamps freely). We thus obtain (in the second term we choose the empty envelop and distribute the letters, and we have to correct the multiple appearances) C(3, 5) − 3(C(2, 5) + 3) − 3C(3, 5) = (7 2 ) − 3 (6 1 ) + 3 = 6. Thus the resulting number is 6C(3, 5) = 126. If our interpretation was that the stamps are glued only on the non-empty envelops, the result would be the square of the latter number of choices for envelops, i.e., 36. □ D. Probability We proceed with simple exercises related to the concept of classical probability. In many cases we are dealing with experiments having only a finite number of outcomes and we are interested in whether or not the outcome belongs to a subset of favourable outcomes. Then, the probability we are trying to determine equals the number of favourable outcomes divided by the total number of all possible outcomes. Classical probability can be used when we assume, or know, that each CHAPTER 1. INITIAL WARMUP in our exposition above. This will continue in all our discussion on probability in the sequel, where we shall have to allow for more general spaces of events A in the sets of all subsets in the sample space. We will return to this and more serious generalizations in chapter 10. 1.4.4. Summing probabilities. By using mathematical induction, the additivity of probability is easily extended to any (finite) number of mutually exclusive events Ai ⊂ Ω, i = 1, . . . , n. That is, P(∪i∈IAi) = n∑ i=1 P(Ai), whenever Ai ∩ Aj = ∅, for all i ̸= j, i, j = 1, . . . , n. Indeed, 1.4.3(3) is the result for n = 2. If we assume the validity of the formula for some fixed n, then the union of any n + 1 events A0, A1, . . . , An can be split into the union of A0 and A1 ∪. . . An. Then by the induction assumption, together with 1.4.3(3) again, the result follows. In general, the summing of probabilities of event occurrences is much more difficult. The problem is that whenever the events are mutually compatible, the possible results in their intersection are counted multiple times. We have seen the simplest case of two mutually compatible events A and B in 1.4.3(8). For classical probability, it reduces just to counting elements in subsets. Indeed, those elements that belong to both the sets A and B count in the formula P(A ∪ B) = P(A) + P(B) − P(A ∩ B) twice and thus we have to subtract them once. Now, we look at the general case. The approach of interactive inclusion and exclusion (potentially too many) elements in some count is a standard method in combinatorics known as the inclusion-exclusion principle. We shall exploit this method in our general finite probability spaces. As we shall see, this is an example of a mathematical theorem, where the hard part is to understand (and find) the formulation of the result. The proof is then relatively simple. The diagram explains the situation for three sets A, B, C for classical probability: P(A∪B ∪ C) = P(A) + P(B) + P(C) − P(A ∩ B) − P(A ∩ C) − P(B ∩ C) + P(A ∩ B ∩ C). Clearly, the probabilities are given by first counting the elements in each set and adding. Then we subtract the sum 21 outcome has the same probability of happening (for instance, fair dice throwing, etc). 1.D.1. Prove that the following two events can happen with the same probability (independently of each other): (a) The roll of a dice results in a number greater than 4; (b) Throwing two dice, and assuming that their total sum equals to 7, at least one resulted in four. Solution. For the first case, there are six possible outcomes (the set {1, 2, 3, 4, 5, 6}). Two are favourable ({5, 6}). Thus the probability is 2/6 = 1/3. For the second event, one can apply classical probability (where the condition is interpreted as restriction of the probability space). The space has 6 elements, and exactly 2 of those are favourable to the given event. Thus, the answer is again 2/6 = 1/3. Notice, that if we treat rolling of two dices as two independent rolling of one, and view the results independent of the order (i.e., just the total), then the probability of the sums is not equal. For instance 12 and 2 have got probability 1/36 while 10 appears with probability 3/36, etc. Thus, throwing two dices is not a classical probability case in general, but we were right with treating the resulting sum 7 as classical probability anyhow. Why? □ 1.D.2. Helen wants to give to John and Mary five pears and six apples. If we consider the pears to be indistinguishable and the same for the apples, what is the probability that either John or Mary get nothing? Solution. Let us assume that the pears and apples are distributed independently and each option has the same probability. Of course, one of the two mentioned persons could then get nothing. We have to count in how many ways the distributions happen. The five pears can be divided in six ways (it is determined by the number of pears given to John, the rest goes to Mary.) Similarly, the six apples can be divided in seven ways. These divisions are independent, hence we can apply the product rule, which states that the frequency of two independent events occurring together can be calculated by multiplying the individual values, i.e., we have got 6 · 7 = 42 options in total. Two of them result in John or Mary getting nothing. Thus the probability is 1/21 (relatively high). Notice, how much different this is from dealing with 11 distinguishable objects, where 11 independent tosses of the same coin with always the same result would result in probability 2 · 2−11 (extremely low). □ 1.D.3. We choose randomly a group of five people from a group of eight men and four women. What is the probability that there are at least three women in the chosen group? ⃝ 1.D.4. Remark. The classical probability concept can be often applied, but we have to be careful. Imagine we should compute the probability that the reader of this remark will win at least 25 million euro in EuroLotto during the next week. First, notice the formulation of the problem is incomplete. For CHAPTER 1. INITIAL WARMUP of those in intersections of pairs of sets, since those elements are counted twice. But we must then add in the number of elements in the intersection of all three. We shall now follow the same idea in order to write down the formula in the following theorem. It seems plausible that such a formula should work with proper coefficients of the sums of probabilities of intersections of more and more events among A1, . . . , Ak, at least in the case of classical probability. The reader will perhaps appreciate that a quite straightforward mathematical induction will verify the theorem in full generality. 1.4.5. Theorem. Let A1, . . . , Ak ∈ A be arbitrary events over the sample space Ω with a set of events A. Then P(∪k i=1Ai) = k∑ i=1 P(Ai) − k−1∑ i=1 k∑ j=i+1 P(Ai ∩ Aj) + k−2∑ i=1 k−1∑ j=i+1 k∑ ℓ=j+1 P(Ai ∩ Aj ∩ Aℓ) − · · · + (−1)k−1 P(A1 ∩ A2 ∩ · · · ∩ Ak). Proof. For k = 1 the claim is obvious. The case k = 2 is the same as the equality 1.4.3(8), which we have already proved. Assume that the theorem holds for any number of events up to k, where k ≥ 1. Now we can work in the induction step with the formula for k + 1 events, where the union of the first k of them are considered to be the A in the equation 1.4.3(8) and the remaining event is considered to be the B: P(∪k+1 i=1 Ai) = P( ( ∪k i=1Ai ) ∪ Ak+1) = k∑ j=1 ( (−1)j+1 ∑ 1≤i1<··· 0, then P(A1 ∩ A2) = P(A2)P(A1|A2) = P(A1)P(A2|A1). All these numbers express (in a different manner) the probability that both events A1 and A2 occur. For instance, in the last case we first look whether the first event occurred. Then, assuming that the first has occurred, we look whether the second also occurs. Similarly, for three events A1, A2, A3 satisfying P(A1 ∩ A2 ∩ A3) > 0 we obtain P(A1 ∩ A2 ∩ A3) = P(A1)P(A2|A1)P(A3|A1 ∩ A2). The probability that three events occur simultaneously can be computed as follows. Compute the probability that the first occurs, then compute the probability that the second occurs under the assumption that the first has occurred. Then compute the probability that the third occurs under the assumption that both the first and the second have occurred. Finally, multiply the results together. In general, if we have k events A1, . . . , Ak satisfying P(A1 ∩ · · · ∩ Ak) > 0, then the theorem says P(A1 ∩ · · · ∩ Ak) = = P(A1)P(A2|A1)· · ·P(Ak|A1 ∩ · · · ∩ Ak−1). Notice that our condition that P(A1 ∩ · · · ∩ Ak) > 0 implies that all the hypotheses in the latter formula have got non-zero probabilities and thus all the conditional probabilities make sense. Indeed, each Ai is at least as big as the intersection and thus its probability is at least as big, see 1.4.3(7). 1.4.9. Geometric probability. In practical problems, the sample space may not be a finite set. The set A of all events may not be the entire set of all subsets in Ω. To generalise probability to such situations is beyond our scope now, but we can at least give a simple illustration. Consider the plane R2 of pairs of real numbers and a subset Ω with known area Ω. Events are represented by subsets A ⊂ Ω (again with known areas). For the event set A we consider some suitable system of subsets for which we can determine the area. An event A then occurs if a randomly chosen point from Ω belongs to the subregion determined by A, otherwise the event does not occur. Consider the problem of randomly choosing two numbers a < b in the interval [0, 1] ⊂ R. All values a and b are chosen with equal probability. The question is “what is the probability that the interval (a, b) has length at least one half?” The choice of points (a, b) is actually the choice of a point [a, b] inside of the triangle Ω with vertex points [0, 0], [0, 1], [1, 1] (see the diagram below). We can imagine this as a description of a problem where a very tired guest at a party tries to divide a sausage with two cuts into three pieces for himself and his two friends. What is the probability that the middle part will be at least half of the sausage? 25 none of the given 500 people will not die, occurs by the product rule. This is because the events are independent, hence we obtain (1 − 12 105 )5000 . It follows that the probability of the complementary event, that is, some of the chosen people will die, equals to 1 − ( 1 − 12 105 )5000 . = 0.4512. □ Remark. The model we have used in the previous exercise to describe the given situation is only approximate. The complication arises from the condition that every person in the sample has the same probability of dying, and the probabilities are derived from the total number of deaths per year. However, the number of deaths changes yearly and also the population changes. We may handle one of the possible inaccuracies as follows. Suppose that 1200 persons per year die, so in ten years 12000 persons die. The probability that a certain person dies in ten years is estimated by 12000/107 . Thus, the probability that a specific person will not die in ten years is (1 − 12 104 ) (first two members of binomial expansion of (1 − 12 105 )10 ). In total, we obtain the following estimate of the probability 1 − ( 1 − 12 104 )500 . = 0.4514. Observe that both estimates are very close one to each other. We already met probabilities based on assumptions that other events have happened, we say “under certain hypothesis”. We talks about conditional probability, see 1.4.8. As we will see below, such estimates can be successfully derived also via Sage (which is also related to “Monte Carlo” computational method, cf. 1.4.10). 1.D.11. Sage simulation. What is the probability that when rolling two dice the sum is 7, if we know that neither of the rolls resulted in a 2? Estimate the solution via Sage. Solution. Let B be the event that neither of the rolls results into 2, and let A be the event “sum is 7”. The set of all possible outcomes is again denoted by Ω. Then P(A|B) = P(A ∩ B) P(B) = |A∩B| |Ω| |B| |Ω| = |A ∩ B| |B| . Now, the number 7 can appear as a sum in four ways if there is no 2, that is, |A ∩ B| = 4, |B| = 5 · 5 = 25. Thus P(A|B) = 4 25 = 0.16 . Observe that P(A) = 1 6 , that is A and B are not stochastically independent. Let us now run the simulation to estimate the resulting probability via Sage. Initially we use the parameter num_rolls and the method random_element (be careful that after copyingpasting the cells below, one needs to re-edit the functions to keep the spaces that appearing below, etc). num_rolls = 100_000 die1=[IntegerRing().random_element(x=1, y=7) for _ in range(num_rolls)] die2=[IntegerRing().random_element(x=1, y=7) for _ in range(num_rolls)] CHAPTER 1. INITIAL WARMUP Thus we need to determine the area of the subset which corresponds to points with b ≥ a + 1 2 , that is, the interior of the triangle A bounded by the points [0, 1 2 ], [0, 1], [1 2 , 1]. We find P(A) = (1/8)/(1/2) = 1 4 . Similarly, if we ask for the probability that some of the three guests will get at least half of the sausage, then we have to add the probabilities of two other events: B saying a ≥ 1/2 and C given as b ≤ 1/2. Clearly they correspond to the lowest and the most right top triangles and thus they have got probabilities 1/4, too. Thus the requested probability is 3/4. Equivalently we could have asked for the complementary event “all of them get less than a half” which clearly corresponds to the middle triangle and thus has probability 1/4. Try to answer on your own the question “what is the minimal prescribed length ℓ such that the probability of choosing an interval (a, b) of length at least ℓ is one half?” 1.4.10. Monte Carlo methods. One efficient method for computing approximate values is simulation by the relative occurrence of a chosen event. We present an example. Let Ω to be the unit square with vertices at [0, 0], [1, 0], [0, 1], and [1, 1]. Let A be the intersection of Ω with the unit disk centred at the origin. Then area A = 1 4 π. Suppose we have a reliable generator of random numbers a and b between zero and one. We then compute relative frequencies of how often a2 + b2 < 1. That is, that [a, b] ∈ A. Then the result (after a large number of attempts) should approximate the area of a quarter unit circle, that is π/4 quite well. (Draw a picture yourselves!) Of course, the well-known formula for the area of a circle with radius r is πr2 , where π = 3.14159 . . . . It is an interesting question – why should the area of a circle be a constant multiple of the square of its radius? We will be able to prove this later. But experimentally, we can hint at this by the approach as above using squares of different sizes. Numerical approaches based on such probabilistic principle are called Monte Carlo methods. 5. Plane geometry So far we have been using elementary notions from the geometry of the real plane in an intuitive way. Now we will investigate in more detail how to deal with the need to describe “position in the plane” and to find some relation between positions of distinct points in the plane. 26 We select only rolls where neither dice resulted in a 2. good_rolls = [] for x, y in zip(die1, die2): if x != 2 and y != 2: good_rolls.append((x, y)) Next we calculate the number of rolls with sum 7: count_sum_7 = 0 for x, y in good_rolls: if x + y == 7: count_sum_7 += 1 The final probability is given by the ratio result=count_sum_7/len(good_rolls) print(n(result)) Sage’s output is 0.162572964225857 (this will slightly differ each time you run the code!), hence our estimate given by Sage has been close to the correct answer. □ 1.D.12. Michael has two mailboxes, one at gmail.com and the other one at hotmail.com. His username is the same at both servers, but the passwords are different. He does not remember which password corresponds to which server. When typing in the password for accessing his mailbox, he makes a typo with probability 5% (i.e., if he tries to type in a specific password, he types what he intended with probability 95%). At the server hotmail.com, Michael typed in the username and a password, but the server told him that something is wrong. What is the probability that he chose the correct password but just mistyped? (we assume that the username is always typed correctly and that making a typo cannot turn a wrong password into a right one.) Solution. Let A be the event that Michael typed in a wrong password at hotmail.com. This event is the union of two disjoint events: • A1: he wanted to type in the correct password and mistyped; • A2: he wanted to type in the wrong password (the one from gmail.com) and either mistyped it or not. We are looking for a conditional probability P(A1|A) which according to the paragraph 1.4.8 is given by P(A1 ∩ A) P(A) = P(A1) P(A1 ∪ A2) = P(A1) P(A1) + P(A2) . Here, we have used the fact that P(A1 ∪ A2) = P(A1) + P(A2), since A1 and A2 are disjoint. So, we only need to compute the probabilities P(A1) and P(A2). The event A1 is the intersection of two independent events: Michael wanted to type in a correct password and Michael mistyped. According to the problem statement, the probability of the first event is 1/2 and the probability of the second event is 1/20. In total, and since the events are independent, we get P(A1) = 1 2 · 1 20 = 1 40 . In addition, P(A2) = 1 2 , and thus CHAPTER 1. INITIAL WARMUP Our tools will be mappings. We will mainly consider mappings F which, to (ordered) pairs of values (x, y), assign pairs (w, z) = F(x, y). Such a mapping will consist of two functions w(x, y) and z(x, y), each depending on two arguments x, y. This will also serve as a gentle introduction to the part of mathematics called Linear algebra, with which we will deal in the subsequent three chapters. 1.5.1. Vector space R2 . We view the “plane” as a set of pairs of real numbers (x, y) ∈ R2 . We will call these pairs vectors in R2 . For such vectors we can define addition “coordinatewise”, that is, for vectors u = (x, y) and v = (x′ , y′ ) we set u + v = (x + x′ , y + y′ ). Since all the properties of commutative groups hold for individual coordinates, these hold for our new vector addition too. In particular there exists a zero vector 0 = (0, 0), such that v + 0 = v. We use the same symbol 0 for the vector and for the number zero on purpose. The context will always make it clear which “zero” it is. Next we define scalar multiplication of vectors. For a ∈ R and u = (x, y) ∈ R2 , we set a · u = (ax, ay). Usually we will omit the symbol · and use the juxtaposition of the symbols a v to denote the scalar multiple of a vector. We can directly check other properties for scalar multiplication by a or b and addition of vectors u and v. For instance a (u+v) = a u+a v, (a+b)u = a u+b u, a(b u) = (ab)u. We use the same symbol + for both vector addition and scalar addition. Now we take a very important step. Define vectors e1 = (1, 0) and e2 = (0, 1). Every vector can then be written uniquely as u = (x, y) = x e1 + y e2. The expression on the right is called a linear combination of vectors e1 and e2 (with coefficients x and y). The pair of vectors e = (e1, e2) is called the standard basis of the vector space R2 . As shown on the picture, these operations are easy to imagine if we consider the vectors v to be arrows starting at the origin 0 = (0, 0) and ending at the position (x, y) in the plane. 27 P(A) = P(A1) + P(A2) = 1 40 + 1 2 = 21 40 . This gives P(A1|A) = P(A1) P(A) = 1 40 21 40 = 1 21 . With the Sage cell PA1=1/2*1/20; PA2=1/2; PA=PA1+PA2 PA1condA=PA1/PA print("P(A1|A)=", PA1condA) we obtain the verification P(A1|A) = 1/21. □ 1.D.13. Consider a deck of 32 cards. If we draw twice one card, what is the probability that the second drawn card is an ace, if we return the first card; and when we don’t return the first card (then there are 31 cards in the deck). Solution. If we return the card in the deck, we are just repeating the experiment, which has 32 possible results (having the same probability), and exactly four of them are favourable. Thus we see that the probability is p = 1/8. In the second case when we do not return the card, the probability is the same. Indeed, notice that when drawing all the cards one by one, the probability of an ace as the first card is identical to the probability getting an ace in the second draw. We can also apply the conditional probability, splitting the event into two disjoint options of drawing or not drawing an ace as the first card: p = 4 32 · 3 31 + 28 32 · 4 31 = 1 8 . □ The concept of geometric probability is useful if we can identify individual events with positions on a line, in a plane, in a space, etc., while the sample space Ω is a region with known length, area, volume, respectively. The favorable positions are then represented by measurable objects, too. In analogy to the classical probability, we define the probability as the ratios of length, areas, volumes, etc., to those of Ω, see ??. This reflects the idea that all subregions of the same measure are hit with the same probability. We present one example (and we include more of them in the final section of this chapter, starting at page 61). 1.D.14. In a certain country, a bus departs from town A to town B once a day at a random time between eight a.m. and eight p.m. Once a day in the same time interval another bus departs in the opposite direction. The trip in either direction takes five hours. Compute the probability that the buses meet, assuming they use the same route. Solution. The sample space is a square 12×12. If we denote the time of the departure of the buses as x and y respectively, then they meet on the trail if and only if |x − y| ≤ 5. This inequality determines the region in the square of “favourable events”. This is a complement to the union of two rightangled isosceles triangles with legs of length 7, see also the figure below. CHAPTER 1. INITIAL WARMUP The addition of two such arrows is then given by the parallelogram law: Given two arrows starting at the origin, their sum is the arrow given by the diagonal arrow (also starting at the origin) of the parallelogram with the two given arrows as adjacent sides. Multiplication by a scalar a corresponds to stretching the arrow to its a-multiple. This includes negative scalars, where the direction of the vector is reversed. If we choose any two non-zero vectors u, v such that neither of them is a multiple of the other, then they form a basis of R2 , too, i.e. each other vector can be (uniquely) written as a linear combination of these. This seems to be clear from the picture, and we may easily write down the two equations for the coefficients and to solve them (as we shall deduce in a while). 1.5.2. Points in the plane. In geometry, we should distinguish between the points in the plane (as for instance the chosen origin O above), and the vectors as the arrows describing the difference between two such points. We will work in fixed standard coordinates, that is, with pairs of real numbers, but for better understanding we will now distinguish vectors written in parentheses and denoted for a moment by bold face letters like u, v, instead of brackets (which we use for coordinates of points in the plane). Points themeselves are denoted by capital latin letters. Even if we view the entire plane as pairs of real numbers in R2 , we may understand adding two such couples as follows. The first couple of coordinates describes a point P = [x, y], while the other one denotes a vector u = (u1, u2). Their sum P +u corresponds to adding the (arrow) vector u to the point P. If we fix the vector u, we call the resulting mapping P = [x, y] → P + u = [x + u1, y + u2] the shift of the plane (or translation) by the vector u. Thus, the vectors in R2 can be understood in more abstract way as the shifts in the plane (sometimes called the free vectors in elementary geometry texts). The standard coordinates on R2 , understood as pairs of real numbers are not the only ones. We can put a coordinate system on the plane with our choosing. 28 Its area in total is 49, so the area of the “favourable part” is 144 − 49 = 95. The probability is p = 95 144 . = 0.66. □ E. Plane geometry In this section we handle elementary geometric objects in the real plane R2 , including points, vectors, lines, circles, and polygons. Our explanation will be supported by Sage. We begin with a series of standard tasks related to the fundamental notion of lines. Recall that a line in the plane is given “implicitly” by the equation a x+b y = c (the line is the set of all points [x, y] given by the solutions with fixed parameters a, b, c). We shall mostly write ℓ, p, q for lines. Lines can be also “parametrized” using equations of the form x = x0 + tv1 and y = y0 + tv2, where P = [x0, y0] is a point on the line, (v1, v2) is the direction vector, and t ∈ R is a parameter, see 1.5.3 for more details. 1.E.1. Points versus vectors. Points in the plane will be denoted with capital letters, as P = [x, y], where x, y ∈ R are the coordinates, while the vectors shifting P to Q = [x + u1, y + u2] will be written as u = (u1, u2). This follows the needs to distinguish the “position in the plane” P and the “change of the position” u. Notice, the same notation is used for closed and open intervals on a real line, cf. Chapter 5, but we do not see much space for confusion here. The reader is advised to look carefully at the explanation starting in 1.5.2, to fully appreciate working with both concepts as couples of reals, as soon as we fix our coordinates. This approach will be crucial when dealing with affine and Euclidean geometries in Chapter 4. We shall also adopt the matrix calculus which brings the notation for both points and vectors in the form of columns, i.e., instead of u = (x, y) or P = [x, y], we often write ( x y ) . 1.E.2. Determine the line ℓ given by {x = 2−t, y = 1+3t : t ∈ R} implicitly. Next present an illustration of ℓ using Sage for those x with −5 ≤ x ≤ 5. Solution. We need to find the implicit representation of ℓ, and this can be done by eliminating the parameter t. Solving the first equation with respect to t gives t = 2−x. Replacing this in the second equation gives y = 1 + 3(2 − x), that is y + 3x − 7 = 0, which is the solution. To present a line via Sage, there are many alternatives. Here we use the implicit form y = −3x + 7, and then the code can be CHAPTER 1. INITIAL WARMUP Coordinates in the plane R2 Choose any point in the plane, and call it the origin O. All other points P in the plane can be identified with the vectors (arrows) −−→ OP with their tails at the origin. Choose any point other than O and call it E1. This defines the vector e1 = −−→ OE1 = (1, 0). Choose any other point E2 so that O, E1, E2 are distinct and not collinear. This defines the vector e2 = −−→ OE2 = (0, 1). Then every point P = (a, b) in the plane can be described uniquely as P = O + a e1 + b e2 for real a, b, or in vector notation, −−→ OP = a e1 + b e2. Translation, by adding a fixed vector, can be used either to shift the coordinate system (including the origin), or to shift sets of points in the plane. Notice that the vector corresponding to the shift of the point P into the point Q is given as the difference Q − P (in any coordinates). Thus we shall also use this notation for the vector −−→ PQ = Q − P. For each choice of coordinates, we have two distinct lines for the two axes. The origin is the point of intersection. Other way round, each choice of two non-parallel lines, together with the scales on each of them defines coordinates in the plane. They are called affine coordinates. Clearly each nontrivial triangle in the plane with vertices O, E1, E2 defines coordinates where this triangle is defined by points [0, 0], [1, 0], [0, 1]. Thus we may say that in the geometry of plane, “all nontrivial triangles are the same, up to a choice of coordinates”. 1.5.3. Lines in the plane. Every line is parallel to a (unique) line through the origin. To define a line, we therefore need two ingredients. One is a nonzero vector which describes the direction of the line. Call it v = (v1, v2). The other is a point P0 = [x0, y0] on the line. Every point on the line is then of the form P(t) = P0 + t v, t ∈ R. Parametric description of a line We may understand the line p as the set of all multiples of the vector v, shifted by the vector (x0, y0). This is called the parametric description of the line: p = {P ∈ R2 ; P = P0 + t v, t ∈ R}. The vector v is called direction vector of the line p. 29 ell=-3*x+7 plot(ell,-5,5,color="black",thickness=2) Or simply type ell = −3 ∗ x + 7; plot(ell, −5, 5), or plot(−3 ∗ x + 7, −5, 5), without the options for the colour and the thickness of the line. The figure that Sage returns via the very first choice is given here: □ 1.E.3. Consider the lines p, q given in the form [2, 0] + t(3, 2), and [−1, 2] + s(1, 3), with t, s ∈ R, respectively. Present them in implicit form, and next examine if they intersect. In positive case, determine explicitly the (unique) intersection point. Verify your answer via Sage. Solution. The coordinates of the points on the first line are given by the parametric equations {x = 2 + 3t, y = 0 + 2t}. Eliminating t from the equations we obtain the implicit form of p: 2x − 3y − 4 = 0. Similarly, the points on the line q are given by {x = −1 + s, y = 2 + 3s}. By eliminating s, we get the implicit form of q, namely 3x − y + 5 = 0. The unique solution (x∗ , y∗ ) of the system of the equations { 2x − 3y − 4 = 0 , 3x − y + 5 = 0 } , if any, determines the coordinates of the unique intersection point P = (x∗ , y∗ ) of p, q, if any. We compute P = [ −19 7 , −22 7 ] . As we have seen, Sage offers the command solve. Here its application comes with the cell x, y=var("x, y") eq1=2*x-3*y-4; eq2=3*x-y+5 solve([eq1==0, eq2==0], x, y) which returns [[x == (−19/7), y == (−22/7)]]. As for the given plot that joins both lines, we have applied the Sage cell CHAPTER 1. INITIAL WARMUP In the chosen coordinates, the point P(t) = [x(t), y(t)] is given as x = x(t) = x0 + t v1, y = y(t) = y0 + t v2 We can eliminate t from these two equations to obtain −v2x + v1y = −v2x0 + v1y0. Since the vector v = (v1, v2) is non-zero, at least one of the numbers v1, v2 is non-zero. If one of the coordinates v1 or v2 is zero, then the line is parallel to one of the coordinate axis. Implicit description of a line The general equation of the line in the plane is (1) ax + by = c, with a and b not both zero. The relation between the pair of numbers (a, b) and the direction vector of the line v = (v1, v2) is (2) av1 + bv2 = 0. We can view the left hand side of the equation (1) as a function z = f(x, y) mapping each point [x, y] of the plane to a scalar and the line corresponds to the prescribed constant value of this function. We shall see soon that (2) says the vector (a, b) is perpendicular to the direction of the line. Suppose we have two lines p and q. We ask about their intersection p ∩ q. That is a point [x, y] which satisfies the equations of both lines simultaneously. We write them as (3) ax + by = r cx + dy = s. Again, we can view the left side as a mapping F, which to every pair of coordinates [x, y] of the point P in the plane assigns a vector of values of two scalar functions f1 and f2 given by the left sides of the particular equations (3). Hence we can write our two scalar equations as one vector equation F(v) = w, where v = (x, y) and w = (r, s). Notice that the two lines are not parallel if and only if they have a unique point in their intersection. 30 z=var("z") y1=(2/3)*z-(4/3); y2=3*z+5 P = plot([],figsize=(4,4)) P += plot(y1,z,(-4,4), color="black") P += plot(y2,z,(-4,4), color="black") P += point((-19/7, -22/7), size=50, color="black"); show(P) □ 1.E.4. Remark. In 1.E.3 you can also solve the problem by substituting the parametric equations of the line q into the implicit equation of p, which gives 2(−1+s)−3(2+3s)−4 = 0. This equation has the unique solution s = −12/7, hence returning back to the parametric equation of q we obtain the coordinates of P as previously mentioned. In Sage you can directly plot a line given in parametric form using the parametric_plot command. For example, to plot the lines from 1.E.3 you can use the follo wing Sage cell (check the resulting figure in your editor). Sage can also find the intersection point by solving the systems of the two equations for the parameters s, t - try it out your- selves! t = var("t") P1= parametric_plot((2+3*t,2*t),(t,-3,3), color="blue",legend_label="line p") P2= parametric_plot((-1+t,2+3*t),(t,-3, 3),color="black",legend_label="line q") Q= point((-19/7,-22/7),size=70) PL=P1+P2+Q; show(PL) 1.E.5. Determine the intersection of the lines x + y − 4 = 0 and {x = −1 + 2t, y = 2 + t : t ∈ R}. ⃝ 1.E.6. Find the line p which passes through the point P = [2, 3] ∈ R2 and is parallel to line q given by x − 3y + 2 = 0. Provide a figure in Sage illustrating the situation. Solution. On the real plane for two parallel lines there are two chances: either they overlap, or they have no intersection. Since the line p should pass from the point [2, 3] and this point does not belong to q (since it does not satisfy its equation), we deduce that p, q have no intersection. This means that p is of the form x−3y+c = 0, for some c ∈ R, with c ̸= 2. To find c, we substitute the coordinates of the point P into this equation. It reveals c = 7, which means that p is implicitly given by y = 1 3 x + 7 3 . For the figure presented above, we have used the cell x=var("x") y1=(1/3)*x+(2/3) y2=(1/3)*x+(7/3) a = plot([],figsize=(4,4)) CHAPTER 1. INITIAL WARMUP 1.5.4. Linear mappings and matrices. Mappings F with which we have worked with when describing intersection of lines have one very important property in common: they preserve the operations of addition and multiplication with vectors and scalars, that is they preserve linear combinations: F(a · v + b · w) = a · F(v) + b · F(w) for all a, b ∈ R, v, w ∈ R2 . We say that F is a linear mapping from R2 to R2 , and write F : R2 → R2 . This can be also described with words — linear combination of vectors maps to the same linear combination of their images, that is linear mappings are those mappings which preserve linear combina- tions. We have already encountered the same behaviour in the equation 1.5.3(1) for the line, where the linear mapping in question was f : R2 → R and its prescribed value c. That is also the reason why the values of the mapping z = f(x, y) are on the image depicted as a plane in R3 . We can write such a mapping using matrices. By a matrix we mean a rectangular array of numbers, for instance A = ( a b c d ) or v = ( x y ) . We speak of a (square 2 × 2 ) matrix A and (column) vector v. Multiplication of matrices, row by column, is defined as follows: A · v = ( a b c d ) · ( x y ) = ( ax + by cx + dy ) . We introduce some more tools for vectors and matrices. Our goal is to compute with matrices in a similar way as we do it with scalars. We define the product C = A · B of two square matrices A and B applying the above formulas to individual columns of the matrix B and writing the resulting column vectors again as columns in the matrix C. In order to multiply two vectors v and w in a similar way, we can write the vector w as a row of numbers (the transposed vector) wT . Then the product of wT and v is wT · v = ( r s ) · ( x y ) = rx + sy. This is the scalar product of vectors v and w introduced a while ago. We can easily check the associativity of multiplication (do it for general matrices A, B and a vector v in detail): (A · B) · v = A · (B · v). Instead of a vector v we can write any (2 × 2) matrix C. In a similar way, distributivity also holds: A · (B + C) = A · B + A · C (A + B) · C = A · C + B · C. 31 a += plot(y1,x,(-4,4), thickness=2) a += plot(y2,x,(-4,4), thickness=2) a += point((2, 3), size=70);show(a) Observe that in this case we asked a different size for the thickness of lines, in comparison with the previous figures. □ 1.E.7. Consider the following lines in R2 : p1 : 2x + 3y − 4 = 0, p2 : x − y + 3 = 0, p3 : −2x + 2y = −6, p4 : −x − 3 2 y + 2 = 0, p5 : x = 2 + t, y = −2 − t, t ∈ R Determine which lines are parallel to each other. ⃝ 1.E.8. Find a parametric equation of the line ℓ which goes through the points P1 = [1, 3] and P2 = [−2, 1] on R2 . What is the corresponding implicit equation of ℓ? Solution. The parametric equation of a line ℓ on the Euclidean plane passing through two points P1 = [u1, v1] and P2 = [u2, v2] on R2 is given by P1 + t −−−→ P1P2 = P1 + t(P2 − P1) , where −−−→ P1P2 := P2 −P1 = (u2 −u1, v2 −v1) is the displacement vector from P1 to P2. Applying this rule for the given points, we get P2 − P1 = (−3, −2) and so the parametric equation of ℓ is given by [1, 3]+ t(−3, −2) = [1, 3] − t(3, 2), with t ∈ R. Hence we have x = 1 − 3t and y = 3 − 2t. By eliminating t we get y = 2 3 x + 7 3 x. As a verification one can check that both the points P1, P2 satisfy the equation of ℓ. □ 1.E.9. Planar soccer player shoots a ball from the point F = [1, 0] in the direction of the vector u = (3, 4) hoping to hit the goal which is a line segment from the point A = [23, 36] to B = [26, 30]. Does the ball fly towards the goal? Solution. The ball travels along the line [1, 0] + t(3, 4). The line segment joining A and B, has the parametrization [23, 36] + s(3, −6). The intersection of these lines is given by equations 1 + 3t = 23 + 3s and 4t = 36 − 6s, with the solution t = 8, s = 2/3. As 0 < 2/3 < 1 the intersection is in the line segment between A and B, and so the ball hits the goal. Another approach (closed to the function of our vision) is based on the slopes of the vectors −→ FA, u = (3, 4), and −−→ FB (i.e. the ratio of the coordinates, we shall come back to that later). Since 36 22 > 4 3 > 30 25 , the player scores. □ Next, we shall touch the basic computational tools in the entire Mathematics, the matrix calculus. For now, we shall restrict our attention to tasks in the plain only, and the reader can find introductory concepts from the so called “linear algebra” in 1.5.4, also restricted to the real plane R2 . Below we will first experience the operations of addition and multiplication of matrices, and exercise how to compute with them more or less exactly as with scalars. We shall return to the geometry of the plane with the new tools at hand later. CHAPTER 1. INITIAL WARMUP But the commutativity does not hold. For example, ( 0 1 0 0 ) · ( 0 0 0 1 ) = ( 0 1 0 0 ) , ( 0 0 0 1 ) · ( 0 1 0 0 ) = ( 0 0 0 0 ) . This last product also shows the existence of divisors of zero. Notice that the mapping defined by multiplication of vectors with a fixed matrix is a linear mapping, i.e. it respects linear combinations. On the other hand, every linear mapping F has to be completely given by its values on the two vectors e1 = (1, 0) and e2 = (0, 1) in the standard basis and these values appear in the columns of the matrix A, which expresses F as matrix multiplication (check this observation carefully!). With matrices and vectors we can write the equations for lines and points respectively as uT · v = ( a b ) · ( x y ) = c A · v = ( a b c d ) · ( x y ) = ( r s ) = w. 1.5.5. Determinant of matrix. The procedure of finding the intersection of lines described in 1.5.3 fails in some special cases. For instance the intersection of two parallel lines is either empty (when the lines are parallel but distinct) or the line itself (when the lines are identical). This condition occurs when the ratios a/c and b/d are the same, that is (1) ad − bc = 0. Note that this expression already takes care of the cases, where either c or d is zero. The expression on the left in (1) is called the determinant of the matrix A. We write it as det A = a b c d = ad − bc. Our discussion can be now expressed as follows: Proposition. The determinant is a real valued function det A defined for all square 2×2 matrices A. The (vector) equation A·v = u has a unique solution for v if and only if det A ̸= 0. So far, we have worked with pairs of real numbers in the plane. Equally well we might pose exactly the same questions for points with integer coordinates and lines with equations with integer coefficients. Notice that the latter requirement is equivalent to considering rational coefficients in the equations. We have to be careful which properties of the scalars we exploit. In fact, we needed all the properties of the field of scalars when discussing the solvability of the system of two equations — try to think it through. At least, we can be sure that the intersection of two non-parallel lines with rational coefficients is a point with rational coefficients again. The case of integer coefficients and coordinates is more difficult. We shall come 32 1.E.10. Quick comments on notation. In order to simplify our notation, for the multiplication of two matrices A, B (of appropriate size) we just write AB, instead of A·B (which is the complete notation we use in the other column, but only in this chapter). Later on we shall use the operator “·” for the scalar product u · v between two vectors u, v (which we do in the other column as well). In the sequel, we shall write Matm,n(R) for the set of all m×n real matrices (so in this section both m and n will be either 1 or 2). For m = n, i.e., square matrices, we will use the notation Matn(R). If m = 1, n = 2, the object is a couple of scalars written in a row, if n = 1 and m = 2, we deal with a vector written as a column (while Mat1(R) coincides with R). Moreover, for A = ( α β γ δ ) ∈ Mat2(R) we denote by det(A) and tr(A) the determinant and trace of A, which are the real numbers defined by det(A) = αδ − βγ and tr(A) = α + δ, respectively. On all matrices, there is the transposition operation, A → AT swapping rows and columns, and the product by scalars cA, for c ∈ R. In Chapter 2, we will extend the matrix calculus to general positive integers m and n, and general scalars. 1.E.11. Let A, B be two real square matrices. How many arithmetic operations do we need to compute the product AB? Solution. Given A, B ∈ Mat2(R), each entry of the product AB requires 2 multiplications and 2 − 1 = 1 addition. Since there are 22 = 4 entries in AB, we need 4(2 + 1) = 12 operations in total. We may notice, that the same operation of “multiplying” rows and columns for general A, B ∈ Matn(R) requires n multiplications and n−1 additions for each term in the result. Thus, we shall see in the next chapter that in total we need n2 (n+n−1) = n2 (2n−1) arithmetic operations to compute AB for two real square matrices A, B.8 □ 1.E.12. Compute v = 2(A − B)T Cu, for A = ( 0 5 −2 2 ) , B =( 2 0 −1 1 ) , C = ( 2 −2 4 5 ) , and u = (3, 2)T , respectively. ⃝ 1.E.13. Matrices in Sage. Our aim is to compute with matrices and vectors just like with scalars. Sage can do this. Let us look at simple computations like in 1.E.12. We have to tell Sage to use the objects of the right class. The commands matrix and vector do that for us. Let us type A = matrix([[0,5],[-2,2]]) B = matrix([[2,0],[-1,1]]) C = matrix([[2,-2],[4,5]]) u = vector([3,2]) v =2*transpose(A-B)*C*u; print(v) 8This means, that the complexity of matrix multiplication is “polynomial”, essentially of the size of n3. With big matrices, this is an essential issue in theoretical computer science to make the power lower. Volker Strassen introduced his famous algorithm reaching the complexity of the size n2.807 in 1969, and as of July 2023, the fastest known algorithm performs with the complexity size of n2.371866. CHAPTER 1. INITIAL WARMUP back to this in the next chapter. In particular we shall see that the equation 1.5.3(3) with fixed integer coefficients a, b, c, d has a unique integer solution for all integer values (r, s) if and only if the determinant is ±1. 1.5.6. Affine mappings. We now investigate how the matrix notation allows us to work with simple mappings in the affine plane. We have seen that matrix multiplication defines a linear mapping. Shifting R2 by a fixed vector w = (r, s) ∈ R2 in the affine plane can be also easily written in matrix notation: P = ( x y ) → P + w = ( x + r y + s ) . If we add a fixed vector to the result of a linear mapping then we have the expression v = ( x y ) → A · v + w = ( ax + by + r cx + dy + s ) . In this way we have described all affine mappings of the plane to itself. Such mappings allow us to recompute coordinates which arise by different choices of origins or bases. We shall come back to this in detail later. 1.5.7. The distance and angle. Now we consider distance. We define the length of the vector v = (x, y) to be ∥v∥ = √ x2 + y2. Immediately we can define notions of distance, angle and rotation in the plane. Distance in the plane The distance between the points P, Q in the plane is given as the length of the vector −−→ PQ, i.e. ∥Q−P∥. Obviously, the distance does not depend on the ordering of P and Q and it is invariant under shifts of the plane by any fixed vector w. The Euclidean plane is an affine plane with distance defined as given above. 33 Notice, Sage does all the multiplications properly (including multiplying correctly all the components by 2), and returns the components (−52, 64) of the vector v. The methods of addition and multiplication are appropriately used for the objects, with the same symbols as with scalars. Of course, we may introduce a matrix with other scalar type of entries. For example, the real float type entries are invoked by A = matrix(RR, [[0, 5], [−2, 2]]) (recall that RR indicates the field of float type real numbers in Sage), and similarly with CC for complex float type. Sage also offers the methods transpose(A) and determinant(A) (or write A.transpose( ) and A.determinant( ), respectively), which return the transpose and determinant of a matrix A. Further methods and comments will appear when using Sage below. 1.E.14. Prove the statements listed below, where A, B, C are the matrices from 1.E.12: (a) (A − B)2 ̸= A2 − 2AB + B2 ; (b) D := 2ABC − BCA − CAB ̸= 0;9 (c) c(AB) − (c(AB))T √ 2 = 11 √ 2c 2 ( 0 1 −1 0 ) , with c ̸= 0; (d) det(AB) − det(A) det(B) = 0; (e) The determinant of the matrix F := det(A)B − det(B)A − tr(C)C is an integer divisible by the numbers 1, 2, 19 and 38. ⃝ 1.E.15. Give an example of matrices A and B for which (a) (A + B)(A − B) ̸= A2 − B2 . (b) (A + B)(A + B) ̸= A2 + 2AB + B2 . ⃝ The so called “affine geometry” of the real plane is characterized by the concept of “going straight with constant velocity” from a given point, i.e., along a line. The mappings F : R2 → R2 of the plane which preserve this feature are called “affine”. Those fixing the origin O = [0, 0] are just the “linear ones”, F(a u + b v) = a F(u) + b F(v) , for all a, b ∈ R and u, v ∈ R2 . Remarkably, all affine maps are obtained by additionally allowing the “shifts” P → P + u, P ∈ R2 , for some fixed vectors u. In coordinates, the linear mappings are given by matrix multiplication, while the affine mapping include also addition of constant vectors. See the detailed explanation in the other column, e.g., 1.5.6. Due to the associativity of matrix multiplication, composition of mappings corresponds to multiplication of matrices, too. Moreover, invertible affine mappings are also understood as changes of coordinates, see 1.5.6, thus we call them affine transformations. 9Here we denote by 0 the zero 2 × 2 matrix. CHAPTER 1. INITIAL WARMUP Angles are a matter of vectors rather than points in Euclidean geometry. Let u be a vector of length 1, at angle φ measured counter-clockwise from the vector (1, 0). In coordinates, u is at the unit circle and has first and second coordinates cos φ, sin φ respectively (this is one of the elementary definitions of the sine and cosine functions). That is, u = (cos φ, sin φ). This is compatible with −1 ≤ sin φ ≤ 1 satisfying (cos φ)2 + (sin φ)2 = 1. Angle between vectors The angle φ between two vectors v and v′ can be in general described using their coordinates v = (x, y), v′ = (x′ , y′ ) like this: (1) cos φ = xx′ + yy′ ∥v∥ ∥v′∥ with 0 ≤ φ ≤ π. In our special case v = (1, 0), the more general equation gives cos φ = x′ ∥v′∥ , which is just the definition of the function cos φ. The general case can be always reduced to this special one. First we notice that the angle φ between two vectors u, v is always the same as the angle between the normalized vectors 1 ∥u∥ u and 1 ∥v∥ v. Thus we can restrict ourselves to two vectors on the unit circle. Then we can rotate our coordinates in such a way that the first of the vectors will become (1, 0). This means, it is enough to show that the scalar product xx′ +yy′ , as well as the length of vectors are invariant with respect to rotations. We shall come back to this in a moment. In the special case, when the scalar product is zero, we say that the vectors are perpendicular and this corresponds to φ = π/2 as expected. Of course the best example of perpendicular vectors of length 1 are the standard basis vectors (1, 0) and (0, 1). Notice that our formula for the angle between the vectors is symmetric in the two vector arguments, thus we have to take the smaller of the possible angles between two vectors and φ is always between 0 and π. We can easily imagine that not all affine coordinates are adequate for expressing the distance and thus for use in the Euclidean plane. Indeed, although we may choose any point O as the origin again, we want also the basis vectors e1 = −−→ OE1 and e2 = −−→ OE2 to be perpendicular and of length one. Such 34 1.E.16. Decide whether the mappings F, G : R2 → R2 given by F : (x, y) → (7x − 3y, −2x + 5y) , G : (x, y) → (2x + 2y − 4, 4x − 9y + 3) , are linear. Write them via matrix multiplication, analyze the meaning of the individual columns, and check that their composition is expressed via matrix multiplication, too. ⃝ Affine transformations play a crucial role in computer graphics and computer vision. Computer graphics involve creating and manipulating images, displayed or animated on a computer screen. Modern computer-aided design (CAD) programs and tools like LibreCAD have a wide range of applications, from designing satellites to creating computer games and educational software. Thus these tools are particularly valuable for engineers and designers working in various industries. Below we will explore how to manipulate affine transformations on the plane, which will enhance our understanding of their mathematical significance in practical applications. Using Sage to illustrate these concepts, we will focus on two-dimensional graphics (2D graphics) through the problems analyzed below. For additional learning and Sage examples, refer to the tasks outlined in 1.G.52, 1.G.53, and ??. 1.E.17. Affine transformations. Find the matrix representation for simple affine transformations on the plane, as: shifts, homotheties, stretching in only one direction, shears, rotations, and reflections. Solution. In order to talk about matrix representation, we need to fix the coordinates, see 1.5.2. We shall work with the standard ones, that is, with the vectors e1 = (1, 0), e2 = (0, 1), identifying the plane with R2 explicitly. • “Idenity transformation.” The simplest affine transformation is the identity, the relevant matrix is the identity matrix E = ( 1 0 0 1 ). Clearly E u = u, for all u ∈ R2 . • “Shifts.” The shift (or translation) τw by a vector w = r e1 + s e2 = (r, s)T ∈ R2 is given by applying the matrix E and adding the vector w, i.e., (x, y)T → (x, y)T + (r, s)T = (x+r, y+s)T , or in other words τw(u) = u+w. For instance, if w = (3, 0)T , then applying τw to an image will shift it to the right by 3. Notice that when we want to shift a point P to the origin, we will also write τ−P for the relevant translation. With the aim to illustrate a translation (and also the rest transformations), we will use a triangle ABC, with vertices the points A = [1, 0], B = [4, −2], and C = [2, −2]. In the figures we have coloured these vertices by blue, green and red, respectively. The new coordinates are also painted by the corresponding colour, with the hope to get a better visualization of how the given transformation acts. For instance, let us present the illustration of the translation τ(3,0) applied to ABC: CHAPTER 1. INITIAL WARMUP basis will be called orthonormal. We shall see that the angles and distances computed in such coordinates will always be the same no matter which coordinates are used. 1.5.8. Rotation around a point in the plane. As we saw, the matrix of any given linear mapping F : R2 → R2 is easy to guess. If the mapping works applying the matrix with columns (a, c) and (b, d), then the first column (a, c) is obtained by multiplying this matrix with the basis vector (1, 0) and the second is the evaluation at the second basis vector (0, 1). We can see from the picture that the columns of the matrix corresponding to rotating counter-clockwise through the angle ψ are computed as follows: ( a b c d ) ( 1 0 ) = ( cos ψ sin ψ ) , ( a b c d ) ( 0 1 ) = ( − sin ψ cos ψ ) The counter-clockwise direction is called the positive direction, the other direction is the negative direction. Rotation matrix Rotation through a given angle ψ in the positive direction about the origin is given by the matrix Rψ: v = ( x y ) → Rψ · v = ( cos ψ − sin ψ sin ψ cos ψ ) · ( x y ) . Now, since we now know how the matrix of the rotation in the plane looks like, we can check that rotation preserves distances and angles (defined by the equation (1) in 1.5.7). Denote the image of a vector v as v′ = ( x′ y′ ) = Rψ · v = ( x cos ψ − y sin ψ x sin ψ + y cos ψ ) , and similarly w′ = Rψ · w for w = (r, s)T , and w′ = (r′ , s′ )T . A straightforward expansion (exploiting (cos ψ)2 + (sin ψ)2 = 1) checks the expected equalities ∥v′ ∥2 = ∥v∥2 , and that x′ r′ + y′ s′ = xr + ys. The previous expression can be written using vectors and matrices as follows: (Rψ · w)T (Rψ · v) = wT v. 35 • “Homotheties.” A homothety hc is a constant rescaling of all vectors by a factor c ∈ R. Thus the relevant matrix is a multiple of E and hence diagonal, A = c E = ( c 0 0 c ) . As a result, a homothety with c > 1 will stretch a figure. Let us illustrate this, by using the triangle ABC and the homothety hc((x, y)T ) = (3x 2 , 3y 2 )T , i.e., c = 3/2: In the opposite, a homothety hc with 0 < c < 1 will shrink the figure, try to present examples yourselves. In case we want to apply a homothety with another fixed point P = [a, b] in R2 (thus rescaling the differences Q − P for all points Q), then we can first translate P to the origin, then to employ the usual homothety and then to shift the plane back: ( x y ) → ( x − a y − b ) → ( c(x − a) c(y − b) ) → ( c(x − a) + a c(y − b) + b ) . Imagine what is happening with the plane under the three transformations on the way. • “Rotations.” If we rotate by angle θ counterclockwise, we get e1 → (cos θ, sin θ) (this is the elementary definition of the goniometric functions), while e2 → (− sin θ, cos θ) (look at the picture in 1.5.8). Thus, the rotation matrix has the form Rθ = ( cos θ − sin θ sin θ cos θ ) . For instance, for θ = π/2, we get Rπ 2 = ( 0 −1 1 0 ) , and hence in this case the rotation takes (x, y)T to (−y, x)T : ( x y ) → Rπ 2 ( x y ) = ( 0 −1 1 0 ) ( x y ) = ( −y x ) . The figure below illustrates the result of Rπ/2 on the triangle ABC used above.10 10Notice a rotation preserves both the lengths of vectors and the angles. CHAPTER 1. INITIAL WARMUP The transposed vector (Rψ · w)T equals wT · RT ψ , where RT ψ is the so-called transpose of the matrix Rψ. That is a matrix, whose rows consist of the columns of the original matrix and similarly the columns consist of the rows of the original matrix. Therefore we see that the rotation matrices satisfy the relation RT ψ · Rψ = I, where the matrix I (sometimes we denote this matrix just as 1 and mean by this the unit in the ring of matrices) is the unit matrix I = ( 1 0 0 1 ) . This leads us to a remarkable observation — the matrix X with the property that X · Rψ = I (we will call such a matrix the inverse matrix to the rotation matrix Rψ) is the transpose of the original matrix. This makes sense, since the inverse mapping to the rotation through the angle ψ is again a rotation, but through the angle −ψ. That is, the inverse matrix of RT ψ equals the matrix R−ψ = ( cos(−ψ) − sin(−ψ) sin(−ψ) cos(−ψ) ) = ( cos ψ sin ψ − sin ψ cos ψ ) . It is easy to write the rotation around a point P = O+w, P = [r, s] again using matrices. One just has to note that instead of rotating around the given point P, we can first shift P into the origin, then do the rotation and then do the inverse shift. We calculate: v = ( x y ) → v − w → Rψ · (v − w) → Rψ · (v − w) + w = ( cos ψ (x − r) − sin ψ (y − s) + r sin ψ (x − r) + cos ψ (y − s)) + s ) . 1.5.9. Reflections. Another well-known example of a length preserving mapping is reflection through a line. It is enough to understand reflection through a line that goes through the origin O. All other reflections can be derived using shifts. Thus, we look first for a matrix Zψ of reflection with respect to the line through the origin and through the point (cos ψ, sin ψ). Notice that Z0 = ( 1 0 0 −1 ) . 36 • “Stretching by a parameter c”. Let us stretch the plane only in the direction of the first basis vector e1. Then, the relevant matrix Ac is diagonal, obtained by multiplying the first row in the identity matrix by the stretching parameter c, i.e., Ac = ( c 0 0 1 ) . The figure below shows the stretching of our triangle ABC in the direction of e1 with c = 3/2, hence the particular linear transformation has the form ( x y ) → ( 3/2 0 0 1 ) ( x y ) = ( 3x/2 y ) . Applying the same stretching but in direction of e2, will force only the y-coordinates to be transformed, as we see here: If we want to employ the same operation with a general fixed point P belonging in a line ℓ, and along this line, we may again compose three transformations: first shift by τ−P , then find the angle θ of ℓ with the direction of e1 and use the R−θ matrix, then apply the above stretching and next move the plane back, i.e., apply Rθ and τP . • “Shears.” The horizontal shear is a linear transformation keeping the y-coordinate fixed, and the entire “x-coordinate axes” (this is the line y = 0), is fixed, too. In other words a horizontal shear takes the point (x, y) to the point (x + ay, y) for some a. Similarly, vertical shears are the transformations with the properties of the components swapped. Thus, the relevant matrices have the following CHAPTER 1. INITIAL WARMUP Any line going through the origin can be rotated so that it has the direction (1, 0) and thus we can write general reflection matrix as Zψ = Rψ · Z0 · R−ψ, where we first rotate via the matrix R−ψ so that the line is in “zero” position, reflect with the matrix Z0 and return back with the rotation Rψ. Therefore we can calculate (by associativity of matrix multiplication): Zψ = ( cos ψ − sin ψ sin ψ cos ψ ) · ( 1 0 0 −1 ) · ( cos ψ sin ψ − sin ψ cos ψ ) = ( cos ψ sin ψ sin ψ − cos ψ ) · ( cos ψ sin ψ − sin ψ cos ψ ) = ( cos2 ψ − sin2 ψ 2 sin ψ cos ψ 2 sin ψ cos ψ −(cos2 ψ − sin2 ψ) ) = ( cos 2ψ sin 2ψ sin 2ψ − cos 2ψ ) . The last equality follows from the usual formulas for trigonometric functions: (1) sin 2ψ = 2 sin ψ cos ψ cos 2ψ = cos2 ψ − sin2 ψ. Notice that the product Zψ · Z0 gives: ( cos 2ψ sin 2ψ sin 2ψ − cos 2ψ ) · ( 1 0 0 −1 ) = ( cos 2ψ − sin 2ψ sin 2ψ cos 2ψ ) . This observation can be formulated as follows: Proposition. A rotation through the angle ψ can be obtained by two subsequent reflections through the lines that have the angle 1 2 ψ between them. 37 form: Ha = ( 1 a 0 1 ) , Va = ( 1 0 a 1 ) . Let us illustrate a shear in the x-direction for the values a = ±3. We get the transformations ( x y ) → ( 1 ±3 0 1 ) ( x y ) = ( x ± 3y, y ) , so the triangle ABC is transformed as follows: Case a = 3: Case a = −3: • “Reflections.” Finally, let us deal with reflections. They actually appeared above as stretching along one direction, with the parameter c = −1. Thus, stretching along the direction of e1 with c = −1 gives a reflection with respect to the y-axis, where the matrix representation is ( −1 0 0 1 ) . Repeating, using instead of e1 the vector e2, we get a reflection with respect to the x-axis, where the matrix presentation is ( 1 0 0 −1 ) . Below we illustrate the reflection of our triangle ABC with respect to the y-axis: For an illustration of a reflection in the plane with respect to x-axis, see 1.G.51. On the other hand, the reflection in the plane through the line y = x, sends the x-axis into the y-axis, and vice versa. Hence it is also referred to as the “axial symmetry”. Obviously, its matrix presentation is given by ( 0 1 1 0 ) , hence for example it transforms the given triangle as follows (see 1.G.52 for another example). □ CHAPTER 1. INITIAL WARMUP In fact we can prove the previous proposition purely by geometrical argumentation, as shown in the above picture (try to be a “synthetic geometer” when reflecting A to A′ and then A′ to A′′ ). If we believe in this proof “by picture”, then the above computational derivation of the proposition provides the proof of the standard double angle formulas (1). 1.5.10. The following is a recapitulation of previous ideas. Mappings that preserve length Theorem. A linear mapping of the Euclidean plane is composed of one or more reflections if and only if it is given by a matrix R which satisfies R = ( a b c d ) , ab + cd = 0, a2 + c2 = b2 + d2 = 1. This happens if and only if the mapping preserves length. Rotation is such a mapping if and only if the determinant of the matrix R equals one, which corresponds to an even number of reflections. When there is an odd number of reflections, the determinant equals −1. Proof. We calculate how a general matrix A might look, when the corresponding mapping preserves length. That is, we have a map- ping ( x y ) → ( a b c d ) · ( x y ) = ( ax + by cx + dy ) . Preserving length thus means that for every x and y, we have x2 + y2 = (ax + by)2 + (cx + dy)2 = (a2 + c2 )x2 + (b2 + d2 )y2 + 2(ab + cd)xy. Since this equation is to hold for every x and y, the coefficients of the individual powers x2 , y2 and xy on the left and right side of the equation must be equal. Thus we have calculated that the conditions put on the matrix R in the first part 38 1.E.18. Show that every rotation can be obtained as a composition of three shears.11 Solution. This is a little more demanding exercise. But we may experiment with the abilities of Sage and this makes our life really easy. We would add the lines in the following session following partial results. Composing horizontal shears gives again a horizontal shear (and similarly for the vertical ones): ( 1 a 0 1 ) ( 1 b 0 1 ) = ( 1 a + b 0 1 ) . Thus we try to combine horizontal and vertical ones to get a rotation by an angle θ. # introduce three shear matrices with # variable parameters a,b,c a,b,c=var("a,b,c") A = matrix([[1,a],[0,1]]); B = matrix( [[1,0],[b,1]]);C=matrix([[1,c],[0,1]]) # take the general tranformation X # obtained by composition X = C*B*A; print(X) # notice that one of the conditions on X # to be a rotation indicates a=c X = X.substitute(c=a) eq1 = X[0,0]^2+X[1,0]^2-1; eq2 = X[1,0]+X[0,1]; # eq3 = X[0,0]-X[1,1] == 0 is already # fulfilled; solve the system sol = solve([eq1,eq2],[a,b]); print(sol) The final line returns the two solutions: a = ( √ ( − r2 1 + 1) − 1)/r1 , b = r1 , or a = −( √ ( − r2 2 + 1) + 1)/r2 , b = r2 . This means that b is a free parameter and from the printed form of X, we learn that b plays the role of sin θ. The graphical illustration of this compositions is given in ??. □ 1.E.19. Change of coordinates. Look at picture in 1.5.6 and assume that, in the standard coordinates, O′ = [1, 5], P = O′ + e′ 1 = [2, 3], Q = O′ + e′ 2 = [2, 7] and these define the “girl’s” coordinate system. Consider the two options that the meeting point M = [2, 1] is given (a) by the boy, or (b) by the girl, in their own coordinate system. What does it mean for the other one? Find the answer first for the situation where at least the origins of the coordinate systems coincide! Solution. If the boy and girl have got at least the same origin of coordinate systems, then the transformation of coordinates is simple. The girl’s basis is already expressed in the standard coordinates (the boy’s ones). Thus the matrix T with colums 11This is actually of high interest for pixel based computer graphics: A horizontal or vertical shear is a very quick operation consisting of constant shifting of individual rows or columns of pixels, while implementing a rotation directly is much more time consuming! It seems that the first algorithm based on this was introduced by Alan Paeth in 1986. CHAPTER 1. INITIAL WARMUP of the theorem we are proving are equivalent to the property that the given mapping preserves length. Because a2 + c2 = 1, we can assume that a = cos φ and c = sin φ for a suitable angle φ. As soon as we choose the first column of the matrix R, the relation ab + cd = 0 determines the second column up to a multiple. But we also know that the length of the vector in the second column is one, and thus we have only two possibilities for the matrix R, namely: ( cos φ − sin φ sin φ cos φ ) , ( cos φ sin φ sin φ − cos φ ) . In the first case, we have a rotation through the angle φ, in the second case we have a rotation composed with the reflection through the first coordinate axis. As we have seen in the previous proposition 1.5.8, every rotation corresponds to two reflections. The determinant of the matrix R is in these two cases either one or minus one and distinguishes between these two cases by the parity of the number of reflections. □ Notice, we have now proved our earlier claim on the invariance of formulae for distance and angle in any orthonormal coordinates. Moreover, we have seen that all euclidean affine mappings are generated by translations, and reflections. 1.5.11. Area of a triangle. At the end of our little trip to geometry we will focus on the area of planar objects. For us, triangles will be sufficient. Every triangle is determined by a pair of vectors v and w, which, if translated so that they start from one vertex P of the triangle, determine the remaining two vertices. We would like to find a formula (scalar function area), which assigns the number area ∆(v, w) equal to the area of the triangle ∆(v, w) defined in the aforementioned way. By translating, we can place P at the origin since translation should not change the area. We can see from the picture that the desired value is half of the area of the parallelogram spanned by the vectors v and w. It is easy to calculate (using the well-known formula: base times corresponding height), or simply observe from the picture that the following holds area ∆(v + v′ , w) = area ∆(v, w) + area ∆(v′ , w) area ∆(av, w) = a area ∆(v, w). 39 e′ 1 and e′ 2 yields the linear mapping providing the boy’s coordinate expression of the girl’s vectors. In order to go in the opposite direction, we just take the inverse matrix T−1 . Thus, T = ( 1 1 2 −2 ) , and T−1 = ( 1/2 −1/4 1/2 1/4 ) , and the boy knows that the girl means him to be at [3, −2], while the girl knows that boy’s [2, 1] would be [3/4, 5/4]. In the general case, we must consider the shifts of the origin, evaluated in the right coordinates. We shall again use Sage to compute: # go to the new affine coordinates O=vector([0,0]);e1=vector([1,0]); e2=vector([0,1]); OO=vector([1,5]);ee1=vector([2,3])-OO; ee2 = vector([2,7])-OO; # define the relevant transformations T=matrix([ee1,ee2]); T=T.transpose(); print(T); print(T.inverse()) def change_girl_to_boy(coordinates): position = vector(coordinates) return T*position + OO def change_boy_to_girl(coordinates): # [0,0] = a[1,-2]+b[1,2]+[1,5] => # a-b=5/2, a+b=-1 => a=3/4, b=-7/4 position = vector(coordinates) return T.inverse()*position + vector([3/4,-7/4]) print(change_girl_to_boy(vector([2,1]))) print(change_boy_to_girl(vector([2,1]))) # tests OO+2*ee1+1*ee2 == O+4*e1+3*e2 and O+2*e1+1*e2 == OO+3/2*ee1-1/2*ee2 The linear parts of the transformations are still T and T−1 , the right shifts are added. The last two prints give the answer: the girl’s meeting point [2, 1] means the boy should go to position [4, 3], while if it is the boy who tells his meeting point, then the girl should look for him at [3/2, −1/2] in her coordinates. The final line confirms this by returning “True”. □ Let us now move to Euclidean tasks. In addition, there is the concept of the norm (size) of a vector, allowing us to measure distances between points. We have already seen this in case of complex numbers, viewing C as the real plane with standard cordinates. As we will see below, we are then also able to measure angles between vectors and areas of polygonal objects. 1.E.20. (a) Find the norms, the scalar product and the angle of the vectors u = (−3, −2) and v = (−2, 3). (b) Prove that the triangle with vertices P = [2, 2], Q = [3, 0], R = [4, 3] is isosceles. Solution. (a) We shall work with the standard Euclidean norm on R2 (which we may identify with C), as we did in 1.A.8. The norm of a vector w = (x, y) is given by CHAPTER 1. INITIAL WARMUP Finally we add to the formulation of our problem a condition area ∆(v, w) = − area ∆(w, v), which corresponds to the idea that we give a sign to the area, according to the order in which we are taking the vectors. If we write the vectors v and w into the columns of a matrix A, then the mapping A = (v, w) → det A satisfies all the three conditions we wanted. How many such mappings could there possibly be? Every vector can be expressed using two basis vectors e1 = (1, 0) and e2 = (0, 1). By linearity, area ∆ is uniquely determined by these vectors. We want area ∆(e1, e2) = 1 2 . In other words, we have chosen the orientation and the scale through the choice of basis vectors, and we choose the unit square to have area equal to one. Thus we see that the determinant gives the area of a parallelogram determined by the columns of the matrix A. The area of the triangle is thus one half of the parallelogram. 1.5.12. Visibility in the plane. The previous description of the value for oriented area gives us an elegant tool for determining the position of a point relative to oriented line segments. By an oriented line segment we mean two points in the plane R2 with a selected order. We can imagine it as an arrow from one point to the other. Such an oriented line segment divides the plane into two half-planes. Let us call them “left” and “right”. We want to be able to determine whether a given point is in the left or right half-plane. Such tasks are often met in computer graphics when dealing with visibility of objects. We can imagine that an oriented line segment can be “seen” from the points to the right of it and cannot be seen from the points to left of it. 40 ∥w∥ = √ x2 + y2. Thus ∥u∥ = √ 9 + 4 = √ 13 = ∥v∥. Their scalar product (dot product) is given by u·v ≡ ⟨u, v⟩ = (−3)(−2) + (−2)3 = 0, and hence u, v are orthogonal each other, i.e., their angle is φ = π/2. In Sage, we have got the method which is applied to the first vector and gets the second one as argument: u.dot_product(v), see the cell u = vector([-3, 2]); v = vector([-2, 3]) u.dot_product(v) which returns 0. As for the norms of the vectors u, v, type u.norm (); v.norm () and for both cases this returns √ 13. (b) In this case we compute | −−→ PQ| = ∥P − Q∥ = √ (2 − 3)2 + (2 − 0)2 = √ 5, | −−→ QR| = ∥Q − R∥ = √ (3 − 4)2 + (0 − 3)2 = √ 10, | −→ PR| = ∥P − R∥ = √ (2 − 4)2 + (2 − 3)2 = √ 5. Thus | −−→ PQ| = | −→ PR| and the triangle is isosceles. □ 1.E.21. Given the line ℓ : 7y = 6x + 13, determine the line ℓ′ which is perpendicular to ℓ and passes through the point P = [−6, 7] ∈ R2 . Find at least two methods. Solution. (1) Given a line ℓ in parametric form P0 + tu = [x0, y0]+t(x1, y1) (t ∈ R), it is easy to determine a line ℓ′ perpendicular to ℓ, which goes through another point Q = [a, b], as follows. The direction vector of ℓ′ must be orthogonal with the direction vector of ℓ. So, if we assume that Q + tv is the parametric form of ℓ′ , then we should have u · v = 0, where · is the usual dot product on R2 . Taking v = (−y1, x1) we see that (x1, y1) · (−y1, x1) = 0, that is, u ⊥ v. Thus the parametric equation of ℓ′ is simply Q + tv = [a, b] + t(−y1, x1), with t ∈ R. To apply this in our case we need to express the initial line ℓ in parametric form. The points [−1, 1] and [0, 13/7] belong to the line ℓ, so the parametric form of ℓ is (cf. 1.E.8) [−1, 1] + t ( 0 − (−1), 13 7 − 1 ) = [−1, 1] + t(1, 6 7 ) . As a result, the direction vector of ℓ′ is (−6 7 , 1), such that (1, 6 7 ) · (−6 7 , 1) = 0, and ℓ′ reads by [−6, 7] + t(−6 7 , 1), or in other words x = −6 − 6 7 t and y = 7 + t, with t ∈ R. For a verification we can transfer the given parametric form of ℓ′ to implicit form. Eliminating t from these equations gives x = −6 − 6 7 (y − 7), which is equivalent to y = −7 6 x. (2) Actually, writing the equation for ℓ as 6x − 7y − 13 = 0, it is obvious that the coefficients (6, −7) provide just a vector perpendicular to ℓ. Thus we may write the implicit equation for ℓ′ immediately. (3) We may also work with the slopes. Writing ℓ as y = 6 7 x+ 13 7 , we get its slope m = 6/7. The line ℓ′ , perpendicular to the line ℓ must have the slope m′ = −7 6 , such that mm′ = −1. CHAPTER 1. INITIAL WARMUP We have the oriented line segment AB and are given some point C. We calculate the oriented area of the corresponding triangle determined by the vectors A−C and B−C. If the point C is to the left of the line segment, then with the usual positive orientation (counter-clockwise) we obtain the positive sign of the oriented area, while the negative sign corresponds to the points to the right. This approach is often used for testing relative positions in 2D graphics. 6. Relations and mappings In the final part of this introductory chapter, we return to the formal description of mathematical structures. We will try to illustrate them on examples we already know. We can consider this part to be an exercise in a formal approach to the objects and concepts of mathematics, i.e. the “language of mathe- matics”. 1.6.1. Relations between sets. First we define the Cartesian product A×B of two sets A and B. It is the set of all ordered pairs (a, b) such that a ∈ A and b ∈ B. A binary relation between the two sets A and B is then a subset R of the Cartesian product A × B. We write a ≃R b to mean (a, b) ∈ R, and say that a is related to b. The domain of the relation is the subset D = {a ∈ A : ∃b ∈ B, (a, b) ∈ R}. Here the symbol ∃b means that there is at least one such b satisfying the rest of the claim. Similarly, the codomain of the relation is the subset I = {b ∈ B : ∃a ∈ A, (a, b) ∈ R}. A special case of a relation between sets is a mapping from the set A to the set B. This is the case when every element of the domain of the relation is related to exactly one element of the codomain. Examples of mappings known to us are all functions, where the codomain of the mapping is a set of numbers, for instance the set of integers or the set of real numbers, or the linear mappings in the plane given by matrices. We write f : D ⊆ A → I ⊆ B, f(a) = b to express the fact that (a, b) belongs to a relation, and we say that b is the value of f at a. Furthermore we say that • mapping f of the set A to the set B is surjective (or onto), if D = A and I = B, clarify ? • mapping f of the set A to the set B is injective (or oneto-one), if D = A and for every b ∈ I there exist exactly one preimage a ∈ A, f(a) = b. Expressing a mapping f : A → B as a relation f ⊆ A × B, f = {(a, f(a)); a ∈ A} is also known as the graph of a mapping f. 41 Thus, ℓ′ is of the form y = −7 6 x + b, and substituting the coordinates of P implies b = 0, i.e., y = −7 6 x. □ 1.E.22. Verify that each affine mapping always maps parallel lines to parallel lines, but the angles and distances are changed in general. ⃝ 1.E.23. Remark on plots. Sage, as well as other software, provides extensive 2D plotting functionality. However, even for drawing simple lines in the plane, we should notice several details. First, let us look at the so called “aspect ratio” of the figures, which is governed by the option aspect_ratio in Sage. The aspect ratio reflects the apparent height/width ratio of a unit square. In Sage one can explicitly ask for an “automatic setup” by typing aspect_ratio = ”automatic”. However, this does not produce always what we normally expect. Instead, the option aspect_ratio = 1 works much better applying the same scale on both coordinates. This seems to be relevant if illustrating Euclidean properties, where we want to see the real angles and lengths, while in the affine setup we may more appreciate to choose the scales independently. We may also adjust the size of a figure via the option figsize. The default figsize is 4, while setting figsize = 8 makes a figure roughly twice as big. We can also request separate horizontal and vertical sizes, by typing for example figsize = [4, 8] (both will be measured in inches). Usually, small figsize values work well. 1.E.24. To illustrate the above situation, provide the plots of the lines appearing in Problem 1.E.21. ⃝ In the next series of exercises we shall focus on applications of the linear and affine mappings. In applications, they appear either directly, or as “linear approximations”.12 We already saw the simple building blocks of most of them in 1.E.17, see also the other column starting in 1.5.8. We shall also include the concept of area and its link to the properties of the determinant function on matrices. We refer to the paragraphs 1.5.5 and 1.5.11 for details. 1.E.25. Compute the area of the triangle PQR, if P = [−8, 1], Q = [−2, 0] and R = [5, 9]. What would we have to do, if the points were given in a coordinate system different to the standard Euclidean one? ⃝ 1.E.26. Consider the quadrilateral S given by its vertices A = [1, 1], B = [6, 1], C = [11, 4], and D = [2, 4]. Illustrate S via Sage and deduce that S is a trapezoid whose bases are of lengths 5 and 9, respectively, and whose height equals to 3. Next compute its area in at least two ways. ⃝ 12While the linear approximations are the core of building models in Science and Engineering (we shall start developing the relevant methods in Chapter 5), in arts, the linearity concept is essential in finding the right proportions, helping to create structure, balance, and rhythm in compositions, guiding the viewer’s eye and effecting emotions. CHAPTER 1. INITIAL WARMUP 1.6.2. Composition of relations and mappings. For mappings, the concept of composition is clear. Suppose we have two mappings f : A → B and g : B → C. Then their composition g ◦ f : A → C is defined as (g ◦ f)(a) = g(f(a)). Composition can also be expressed with the notation used for a relation as f ⊆ A × B, f = {(a, f(a)); a ∈ A} g ⊆ B × C, g = {(b, g(b)); b ∈ B} g ◦ f ⊆ A × C, g ◦ f = {(a, g(f(a))); a ∈ A}. The composition of a relation is defined in a very similar way. We just add existential quantifiers to the statements, since we have to consider all possible “preimages” and all possible “images”. Let R ⊆ A × B, S ⊆ B × C be relations. Then S ◦ R ⊆ A × C, S ◦ R = {(a, c); ∃b ∈ B, (a, b) ∈ R, (b, c) ∈ S}. A special case of a relation is the identity relation idA = {(a, a) ∈ A × A; a ∈ A} on the set A. It is a neutral element with respect to composition with any relation that has A as its codomain or domain. For every relation R ⊆ A × B, we define the inverse relation R−1 = {(b, a); (a, b) ∈ R} ⊂ B × A. Beware, the same term is used with mappings in a more specific situation. Of course, for every mapping there is its inverse relation, but this relation is in general not a mapping. Therefore we speak about the existence of an inverse mapping if every element b ∈ B is an image of exactly one element in A. In such a case the inverse mapping is exactly the inverse relation. Note that the composition of a mapping and its inverse mapping (if it exists) is the identity mapping. In general, this is not so for relations. 42 1.E.27. (a) An equilateral triangle with vertices [1, 0] and [0, 1] lies entirely in the first quadrant. Find the coordinates of its third vertex with the help of Sage. (b) An equilateral triangle has vertices at P = [1, 1] and Q = [2, 3]. Its other vertex lies in the same half-plane as the point S = [0, 0]. The triangle is rotated by 60◦ in the positive direction around the point S, to produce a new triangle. Determine the coordinates of the vertices of the new triangle. Solution. (a) Let us denote by A = [1, 0] and B = [0, 1] the given points. To find the coordinates [x, y] of the third vertex we are rotating the point A around the point B by an angle of 60◦ = π/3 in the positive direction (counter clockwise direction). Thus, in Sage we can find the third vertex by typing a=pi/3 M=matrix([[cos(a),-sin(a)],[sin(a),cos(a)]]) A=vector([1, 0]); B=vector([0, 1]) rot=M*(A-B)+B; print(rot) Hence we have [x, y] = ( cos(π/3) − sin(π/3) sin(π/3) cos(π/3) ) (A − B) + B , which gives [x, y] = [1 2 + √ 3 2 , 1 2 + √ 3 2 ]. (b) Using the general rotation transformation as found in Sage above, we obtain easily the wanted points (check yourselves): [− 3 2 √ 3, √ 3− 1 2 ], [ 1 2 − 1 2 √ 3, 1 2 √ 3+ 1 2 ], [1− 3 2 √ 3, √ 3+ 3 2 ] . □ 1.E.28. Consider a regular hexagon with vertices labeled in the positive direction, with first vertex A = [0, 2] and centre at the point S = [1, 0]. Determine the coordinates of the third vertex C. ⃝ 1.E.29. Which sides of the quadrangle given by the vertices [−2, −2], [1, 4], [3, 3] and [2, 1] are “visible” from the position of the point X = [3, π − 2]? Solution. In the first step we order the vertices such that their order corresponds the counter-clockwise direction. We choose vertex A = [−2, −2], the order of the remaining vertices is then B = [2, 1], C = [3, 3], D = [1, 4] (think, how to order the points without a picture; you can actually use similar procedure to what follows). First consider the side AB. It along with the point X = [3, π − 2] determines the matrix ( −2 − 3 2 − 3 −2 − (π − 2) 1 − (π − 2) ) such that its first column is the difference A − X and the second column is B − X. Whether it can be “seen” from the point [3, π − 2] (i.e., is left or right of the vector −−→ AB, see 1.5.12), is then determined by the sign of the determinant −2 − 3 2 − 3 −2 − (π − 2) 1 − (π − 2) = −5 −1 −π 3 − π < 0 . CHAPTER 1. INITIAL WARMUP 1.6.3. Relation on a set. In the case when A = B we speak about a relation on the set A. We say that the relation R is: • reflexive, if idA ⊆ R, that is (a, a) ∈ R for every a ∈ A, • symmetric, if R−1 = R, that is if (a, b) ∈ R, then also (b, a) ∈ R, • antisymmetric, if R−1 ∩ R ⊆ idA, that is if (a, b) ∈ R and if also (b, a) ∈ R, then a = b, • transitive, if R◦R ⊆ R, that is if (a, b) ∈ R and (b, c) ∈ R implies (a, c) ∈ R. A relation is called an equivalence relation if it is reflexive, symmetric and transitive. A relation is called an ordering if it is reflexive, transitive and antisymmetric. Orderings are usually denoted by the symbol ≤, that is the fact that element a is in relation with element b is written as a ≤ b. Notice that the relation <, that is “to be strictly smaller than”, is not an ordering on the set of real numbers, since it is not reflexive. The common practice (also used in this book) is to use the sign < or ⊂ for orderings, too. A good example of an ordering is set inclusion. Consider the set 2A of all subsets of a finite set A. We have a relation ⊆ on the set 2A given by the property “being a subset”. Thus X ⊆ Z if X is a subset of Z. Clearly all three conditions from the definition of ordering are satisfied: if X ⊆ Y and Y ⊆ X then necessarily X and Y must be identical. If X ⊆ Y ⊆ Z then also X ⊆ Z, and reflexivity is clear from the definition. (As mentioned above, we shall mostly use the symbol ⊂ without excluding equality.) We say that an ordering ≤ on a set A is complete, if every two elements a, b ∈ A are comparable, that is, either a ≤ b or b ≤ a. If A contains more than one element, there exist subsets X and Y where neither X ⊆ Y nor Y ⊆ X, so the ordering ⊆ is not complete on the set of all subsets of A. The set of real numbers with the usual ≤ is complete. Thus the subdomains N, Z, Q come equipped with a complete ordering, too. On the other hand, there is no such natural complete ordering on C. The absolute value provides a partial ordering there (comparing the radii of the circles in the complex plane). 1.6.4. Partitions of an equivalence. Every equivalence relation R on a set A defines also a partition of the set A, consisting of subsets of mutually equivalent elements, namely equivalence classes. For any a ∈ A we consider the set of elements, which are equivalent with a, that is [a] = Ra = {b ∈ A; (a, b) ∈ R}. Clearly a ∈ Ra by reflexivity. If (a, b) ∈ R, then Ra = Rb by symmetry and transitivity. Furthermore, if Ra ∩ Rb ̸= ∅ then there is an element c in both Ra and Rb so that Ra = Rc = Rb. It follows that for every pair a, b, either Ra = Rb, or Ra and Rb are disjoint. That is, the equivalence classes are pairwise disjoint. Finally, A = ∪a∈ARa. That is, the set A is partitioned into equivalence classes. We write [a] = Ra, 43 For the sides BC, CD and DA we analogically obtain 2 − 3 3 − 3 1 − (π − 2) 3 − (π − 2) < 0 , 3 − 3 1 − 3 3 − (π − 2) 4 − (π − 2) > 0 , 1 − 3 −2 − 3 4 − (π − 2) −2 − (π − 2) > 0 . The determinants differ in signs, thus the point X is outside the given quadrangle and a side is visible from X, if X is left of the side. According to our convention of putting vectors −−→ XA, −−→ XB, −−→ XC , −−→ XD into the determinants, the side is visible, if the corresponding determinant is negative (i.e., X is right of the oriented side). Thus, from the point X are visible exactly the sides determined by the pairs of vertices A = [−2, −2], B = [2, 1] and B = [2, 1], C = [3, 3]. □ 1.E.30. Given the sides of the pentagon with vertices at points [−2, −2], [−2, 2], [1, 4], [3, 1] and [2, −11/6], which are visible from the point [300, 1]. ⃝ 1.E.31. Let the triangle with the vertices P = [5, 6], Q = [7, 8], and R = [5, 8]. Determine, which of its sides are visible from the point X = [0, 1]. ⃝ F. Relations and mappings We close this chapter by considering briefly some aspects of the language of mathematics. Thus, instead of the usual intuitive introduction we request the reader to have a quick look at the formal approach to the relevant concepts, beginning in 1.6.1. Note that digesting the abstract concepts of equivalence and ordering (see 1.6.3) is one of the crucial steps towards mathematical think- ing. 1.F.1. Determine whether the following relations on the given set M are equivalence relations: i) M = {f : R → R}, where f ∼ g if f(0) = g(0). ii) M = {f : R → R}, where f ∼ g if f(0) = g(1). iii) M is the set of lines in the plane, where two lines are related if they do not intersect. iv) M is the set of lines in the plane, where two lines are related if they are parallel. v) M = N, where m ∼ n if S(m) + S(n) = 20, while S(n) stands for the sum of the digits of the integer n. vi) M = N, where m ∼ n if C(m) = C(n), where C(n) = S(n) if the sum of the digits S(n) is less than 10, otherwise we define C(n) = C(S(n)) (thus we always have C(n) < 10). Solution. For each case we must check the three defining properties of equivalence relations. i) a) Reflexivity: for any real function f, f(0) = f(0). b) Symmetry: if f(0) = g(0), then also g(0) = f(0). CHAPTER 1. INITIAL WARMUP and by the above, we can represent an equivalence class by any one of its elements. 1.6.5. Existence of scalars. As before, we assume to know what sets are, and indicate the construction of the natural numbers. We denote the empty set by ∅ (notice the difference between the symbol 0 for the zero and the empty set ∅) and define (1) 0 := ∅, n + 1 := n ∪ {n} , in other words 0 := ∅, 1 := {0}, 2 := {0, 1}, . . . , n + 1 := {0, 1, . . . , n}. This notation says that if we have already defined the numbers 0, 1, 2, . . . n, then the number n+1 is defined as the set of all previous numbers. We have defined the set of natural numbers N.3 Next, we should construct the operations + and · and deduct their required properties. In order to do that in detail, we would have to pay more attention to basic understanding of sets. For example, once we know what a disjoint union of sets is, we may define the natural number c = a+b as the unique natural number c having the same number of elements as the disjoint union of a and b. Of course, formally speaking, we need to explain what does it mean for two sets to have the same number of elements. Let us notice that in general, having the two sets A and B of the same “size” should mean that there exists a bijection A → B. This is completely in accordance with our intuition for finite sets. However, it is much less intuitive with infinite sets. For example there is the same amount of all natural numbers and those with natural square roots (the bijection a → a2 ), although the picture in the solution to 1.A.3 could be read as “many of natural numbers do not have a rational square root”. We say, that each set which is bijective to natural numbers N is countable. Sets bijective to some natural number n (as defined above) are called finite (with number of elements n), while the sets which are neither finite nor countable are called uncountable. So adding a + b is understood as increasing a iteratively by one exactly as many time as is the number of elements in b. Similarly, we multiply a·b by adding a to zero as many times as is the number of elements in b. Now we should check the axioms for scalar operations for our operations + and ·. The (boring but tedious) check is left to the reader. We can also define a relation ≤ on N as follows: m ≤ n, if either m ∈ n or m = n. Clearly this is a complete ordering. For instance 2 ≤ 4, since 2 = {∅, {∅}} ∈ {∅, {∅}, {∅, {∅}}, {∅, {∅}, {∅, {∅}}}} = 4. 3The concept of natural numbers based on the principle of "increasing by one" was known to all ancient civilisations, however they always had the smallest natural number one. The set theoretical approach was developped in 19th century and there zero got a logical smallest natural number as the counterpart of the empty set. 44 c) Transitivity: if f(0) = g(0) and g(0) = h(0), then also f(0) = h(0). We conclude that the relation is an equivalence relation. ii) The relation is not reflexive, since for instance for the function f(x) = sin(x) we have sin(0) ̸= sin(1). Moreover, it is not transitive. iii) The relation is neither reflexive (every line intersects itself), nor transitive. iv) This is an equivalence relation whose equivalence classes are well represented by the un-oriented lines through the origin. v) The relation clearly is not reflexive. Neither it is transi- tive. vi) In this case the answer is positive and we leave the verification of the details to the reader. □ 1.F.2. Consider the set A = {1, 2, 3, 4} and let R be the relation over A defined by R = {(0, 0), (0, 1), (0, 3), (1, 0), (1, 1), (2, 3), (3, 3)} Show that R is not an equivalence relation by explaining which of the defining properties of an equivalence relation are not satisfied. Solution. It is easy to see that R is not reflexive, since 2 ∈ A but (2, 2) /∈ R. It is not symmetric since for example (3, 2) /∈ R. Finally it is not transitive since (1, 0) and (0, 3) they both belong to R but (1, 3) /∈ R. □ 1.F.3. Over the set A = {a, b, c, d} consider the relation R1 = {(a, a), (b, b), (c, c), (d, d), (b, a), (b, c), (b, d)}. Is R1 an equivalence relation? Solution. Since (k, k) ∈ R1 for any k ∈ A, the relation is reflexive. It is also transitive. However, it is not symmetric since for example (b, a) ∈ R1 but (a, b) /∈ R1. Hence R1 is not an equivalence relation. □ 1.F.4. Verify the result of the previous task in Sage. ⃝ 1.F.5. Let the relation R be defined over R2 such that ((a, b), (c, d)) ∈ R for arbitrary a, b, c, d ∈ R if and only if b = d. Determine whether or not this is an equivalence relation. In the positive case, describe geometrically the partitioning determined by R. ⃝ 1.F.6. Present the domain D and the codomain I of the relation R = {(a, v), (b, x), (c, x), (c, u), (d, v), (f, y)} defined between the sets A = {a, b, c, d, e, f} and B = {x, y, u, v, w}. Is the relation R a mapping? Next attempt to solve the problem using Sage. ⃝ 1.F.7. Determine for each of the following relations over the set {a, b, c, d} whether it is an ordering and whether it is com- plete: R1 = {(a, a), (b, b), (c, c), (d, d), (b, a), (b, c), (b, d)} , R2 = {(a, a), (b, b), (c, c), (d, d), (d, a), (a, d)} , R3 = {(a, a), (b, b), (c, c), (d, d), (a, b), (b, c), (b, d)} . CHAPTER 1. INITIAL WARMUP In other words, the recurrent definition itself gives the relation n ≤ n+1. and transitivity then gives n ≤ k for all k obtained in this manner later. This ordering of the positive integers or natural numbers (the number a is strictly smaller than b if a ∈ b) has obviously got the following striking property: every nonempty subset in N or Z+ has a smallest element. 1.6.6. Integers and rational numbers. With the set N of positive integers together with zero, we can always add two numbers together. Also, adding zero to a number does not change it. We can also define subtraction, but the result does not always belong to N. The basic idea of construction of the integers from the natural numbers or positive integers is to add to N these missing results. This can be done as follows: instead of subtraction, we will work with ordered pairs of numbers (a, b) representing the desired result a − b. It just remains to define which such pairs are equivalent (with respect to the result of subtraction). The necessary relation is then: (a, b) ∼ (a′ , b′ ) ⇐⇒ a−b = a′ −b′ ⇐⇒ a+b′ = a′ +b. Note that the expression in the middle equation may not belong to N, but the expression on the right always does. It is easy to check that it really is an equivalence, and we denote its classes as the integers Z. We define addition and subtraction on Z using representatives. For instance [(a, b)] + [(c, d)] = [(a + c, b + d)], which is clearly independent of the choice of representatives. It is always possible to choose a representative (a, 0) for natural numbers a, and a representative (0, a) for negative numbers −a. This is probably the simplest and easiest choice. Next, we define multiplication of integers similarly to the addition (i.e., using representatives we aim at the representative of (a − b) · (c − d)) (a, b) · (c, d) = (ac + bd, bc + ad). This is clearly commutative and obviously does not depend on the choice of representives. Moreover, choosing positive or negative multiplicands with their special representatives as above, we get just the standard multiplication in N taking the right care for the signs. Now it is easy to see, that we have all the properties (CG1)–(CG4) and (R1)–(R4), see the paragraph 1.1.1. For multiplication, the neutral element is one, but for all numbers a other than zero and ±1 there does not exist an integer a−1 with the property a · a−1 = 1. Thus, for multiplication, we are missing the inverse elements. However, the property of the integral domain (ID) holds. This means that if the product of two integers equals zero, then at least one of them has to be zero. We can construct the rational numbers Q by adding all the missing multiplicative inverses by a method analogous to the construction of Z from N. 45 Solution. The relation R1 is an ordering, which is not complete (for instance neither (a, c) /∈ R1 nor (c, a) /∈ R1). The relation R2 is not anti-symmetric as it is both (a, d) ∈ R2 and (d, a) ∈ R2, therefore it is not an ordering. The relations R3 does not define an ordering as well, since it is not transitive (for instance (a, b), (b, c) ∈ R3, but (a, c) /∈ R3). □ 1.F.8. In the following three figures, icons are connected with lines such that people in different parts of the world could have assigned them. Determine whether the connection is a mapping, and whether it is injective, surjective or bijective. Solution. In the first figure the connection is a mapping which is surjective but not injective, because both the snake and the spider are labeled as poisonous. The second figure is not a mapping but only a relation, since the dog is labeled both as a pet and as a meal. The third connection is again a mapping. This time it is neither injective nor surjective. □ 1.F.9. Determine whether or not the mapping f is injective (one-to-one) or surjective (onto), when f is given by: (a) f : Z × Z → Z, f((x, y)) = x + y − 10x2 ; (b) f : N → N × N, f(x) = ( 2x, x2 + 10 ) . ⃝ 1.F.10. Determine the number of equivalence relations over the set A = {1, 2, 3, 4}. Solution. We divide the sought equivalences according to the types of corresponding partitions (given by number and cardinality of equivalence classes), and we count the number of partitions of a given type: The type of partition number of partitions of this type 1,1,1,1 1 2,1,1 (4 2 ) 2,2 1 2 (4 2 ) 3,1 (4 1 ) 4 1 . Thus in total we have 15 different equivalences. □ 1.F.11. Determine the number of orderings of a fourelement set. Solution. We will consider all possible Hasse diagrams of orderings over a four-element set M. We count how many different orderings (recall that an ordering is a subset of a set M × M) the given Hasse diagram has. See the diagram: CHAPTER 1. INITIAL WARMUP On the set of all ordered pairs (p, q), q ̸= 0, of integers, we define a relation ∼ so that it models our expectation of the fractions p/q: (p, q) ∼ (p′ , q′ ) ⇐⇒ p/q = p′ /q′ ⇐⇒ p · q′ = p′ · q. Again, we are not able to formulate the expected behaviour in the middle equation when we work in Z, but for the equation on the right this is indeed possible. This relation is a well-defined equivalence (think it through!). If we formally write p/q instead of pairs (p, q), we can define the operations of multiplication and addition by the well-known formulas p/q · r/s = pr/qs p/q + r/s = ps/qs + qr/qs = (ps + qr)/qs. 1.6.7. Remainder classes. Another example of equivalence classes is the remainder classes of integers. For a fixed natural number k we define an equivalence ∼k so that two numbers a, b ∈ Z are equivalent if they have the same remainder when divided by k. The resulting set of equivalence classes is denoted as Zk. This procedure is simplest for k = 2. This yields Z2 = {[0], [1]}, where zero stands for even numbers and one for odd numbers. It is easy to see that the addition and multiplication can be defined using representatives. Remainder classes rings and fields Theorem. The remainder class Zk is always a commutative ring of scalars. It is a commutative field of scalars (that is, the property (F) from the paragraph 1.1.1 is also satisfied) if and only if k is a prime. If k is not prime, then Zk contains a divisor of zero, thus it is not an integral domain. Proof. The second part is easy to see — if x · y = k for natural numbers x, y, then the result of multiplying the corresponding classes [x] · [y] is zero. On the other hand, if x and k are relatively prime, then according to the Bezout equality, (which we derive later, see 11.1.2), there are natural numbers a and b satisfying ax + bk = 1, which for corresponding equivalence classes gives [a] · [x] + [0] = [a] · [x] = [1] and thus [a] is the inverse element to [x]. □ 46 In total, there are 219 orderings over a four-element set. □ Many combinatorics problems involve relations, and examples can be found in the final section with additional material (see Section G). 47 CHAPTER 1. INITIAL WARMUP G. Additional exercises for the whole chapter We will now present supplementary material related to the concepts discussed so far. We begin by reviewing some key equalities and inequalities that appear throughout the book, due to their versatility in various applications. A) Material on numbers and functions 1.G.1. (Basic arithmetic and the triangular numbers) Recall from high-school that the principle of mathematical induction can be used to prove the following in just a few steps: ∆n := 1 + 2 + · · · + n = n(n + 1) 2 , n ∈ Z+ . The numbers ∆n are called triangle numbers. Provide a shorter proof of this summation without using induction, and illustrate your method with a figure. Next, show that the number of distinct handshakes or wine glass clinks among n people equals the triangular number ∆n−1. Finally prove that ∆n − ∆n−1 = n, and ∆n + ∆n−1 = n2 for any n. ⃝ 1.G.2. (Average speed) A bicyclist ascends a hill at 25km/h, and descends the same hill at 75km/h. Find the bicyclist’s average speed for the entire trip, assuming the length of the hill is not needed for the calculation. ⃝ 1.G.3. The arithmetic–geometric mean inequality (AM-GM inequality). (a) For any two non-negative real numbers a, b prove the inequality √ ab ≤ a + b 2 , or equivalently (ab) 1 2 ≤ a + b 2 , (†) with equality if and only if a = b (recall that √ ab is called the geometric mean of a, b, while a+b 2 is the arithmetic mean of a, b). Next use (†) to prove that any four non-negative real numbers a, b, c, d, the following inequality holds: (abcd) 1 4 ≤ a + b + c + d 4 , (‡) with equality if and only if a = b = c = d. (b) Present a proof using mathematical induction for the general arithmetic–geometric mean inequality n √ a1 · a2 · . . . · an ≤ a1 + a2 + · · · + an n , with equality if and only if a1 = a2 = . . . = an, where ai are non-negative real numbers for all i = 1, . . . , n. Solution. (a) There are numerous alternative proofs for the AM-GM inequality. For instance, take the square in both sides of the relation given in (†). We obtain ab ≤ (a + b)2 /4, which simplifies to 4ab ≤ a2 + 2ab + b2 . This is equivalent to stating that (a − b)2 ≥ 0. This last inequality is obviously true, as the square of a real number is non-negative. Moreover, equality holds if and only if a = b. Since all steps are reversible, the statement is proven. A more geometric proof can be descrined using the figure below. This illustrates four rectangles, each of area ab, placed inside a square of area (a + b)2 . We observe that 4ab does not exceed (a + b)2 , with the “wasted” area being the shaded region in the figure. By demonstrating that the shaded area is (a − b)2 we can conclude that equality holds if and only a = b. The second inequality follows directly from (†), when combined with the known properties of powers of real numbers: (abcd) 1 4 = ( (abcd) 1 2 )1 2 = (√ ab √ cd )1 2 (†) ≤ √ ab + √ cd 2 (†) ≤ a+b 2 + c+d 2 2 = a + b + c + d 4 . (b) To simplify the notation, suppose that Υ(n) stands for the hypothesis that the given inequality is valid. Above we proved the result for n = 2, 4, that is, we have confirmed Υ(2) and Υ(4), and in a similar way one can show that the result is valid for n = 8, that is, (a1 · a2 · . . . · a8) 1 8 = ( (a1 · . . . · a4) 1 4 (a5 · . . . · a8) 1 4 )1 2 (†) ≤ (a1 · . . . · a4) 1 4 + (a5 · . . . · a8) 1 4 2 (‡) ≤ a1 + · · · + a8 8 . 48 CHAPTER 1. INITIAL WARMUP Hence we have confirmed Υ(2), Υ(4) and Υ(8), and one can obtain the statement for any 2k with k ≥ 1, that is, verify Υ(2k ) for all integers k ≥ 1 (we leave this verification to the reader). Let us now explain how Υ(4) can be used to derive Υ(3). Set µ := a1 + a2 + a3 3 for the arithmetic mean of some real positive numbers a1, a2, a3. A crucial but simple observation is that µ is also the arithmetic mean of the numbers a1, a2, a3, µ, that is, a1 + a2 + a3 3 = a1 + a2 + a2 + µ 4 , or in other words µ = a1+a2+a2+µ 4 . A proof of this is easy: By definition we have 3µ = a1 + a2 + a3, or equivalently 4µ = a1 + a2 + a3 + µ. Then a division by 4 gives the claim. Now we can apply Υ(4) on the numbers a1, a2, a3, µ: 4 √( a1a2a3µ ) = ( a1a2a3µ )1 4 ≤ a1 + a2 + a3 + µ 4 = µ . Considering the fourth power of both sides in this inequality, we obtain abcµ ≤ µ4 which can be rewritten as a1a2a3 ≤ µ3 . Thus, after taking cube roots in both sides we arrive to the result, i.e., (a1a2a3) 1 3 ≤ µ. Before describe the general step, keep in mind that a1a2a3µ can be rewritten as a1a2a3µ22 −3 and a1 + a2 + a3 + µ = a1 + a2 + a3 + (22 − 3)µ, since 22 − 3 = 1. Now one can use Υ(8), to obtain also Υ(5), Υ(6), and Υ(7), and so forth. However, let us explicitly present the general argument. Observe that in order to compute Υ(3) from Υ(4), we extended the triple (a1, a2, a3) into the quadruple (a1, a2, a3, µ) = (ˆa1, ˆa2, ˆa3, ˆa4), so that we could apply Υ(4). We will use the analogous technique to prove Υ(n) for all n < 2k . Hence, the goal is to extend (a1, a2, . . . , an) to (ˆa1, . . . , ˆa2k ), so that we can apply Υ(2k ) on the latter set. A question that may arise is how to extend this result, and the case n = 3 provides the necessary motivation. Indeed, consider the following: ˆaj := { aj , j = 1, 2, . . . , n , µ := a1 + a2 + . . . + an n , j = n + 1, . . . , 2k (for n = 3 this formula yields the expressions used above, and in particular we have k = 2 for this case). Apply now Υ(2k ) to (ˆa1, . . . , ˆa2k ). This gives ( a1 · · · anµ2k −n ) 1 2k ≤ a1 + · · · + an + (2k − n)µ 2k = µ . Obviously, (µ2k −n ) 1 2k = µµ− n 2k , therefore the previous formula is also written as ( a1 · · · anµ2k −n ) 1 2k = (a1 · · · an) 1 2k µµ− n 2k ≤ µ . Thus we can eliminate µ from both sides, i.e., we have proved till now that (a1 · · · an) 1 2k µ− n 2k ≤ 1. Raising both sides of this inequality to the power of 2k n , yields the result: (a1 · · · an) 1 n µ−1 ≤ 1 , or equivalent (a1 · · · an) 1 n ≤ µ . This completes the main proof, and the equality case can be easily derived. □ The proof given above goes back almost two centuries, and is due to the French mathematician A.-L. Cauchy (1789–1856). Note that this is a special kind of mathematical induction, often referred to as the “forward-backward ” induction. This is because we first derived the result for all the powers of 2, and next proceeded with another induction to establish it for all positive integers. As we pronounced, the AM-GM inequality admits many different proofs, most of them based on induction. There are also methods arising in term of calculus, which allow the presentation of shorter proofs. For example, much later in Chapter 8, we will provide a proof of the AM-GM inequality in terms of the so-called Lagrange multipliers. 1.G.4. Bernoulli inequality. For any x ∈ R with x > −1 based on mathematical induction one can easily prove that (1+x)n ≥ 1+nx for any n ∈ N, see also 5.4.1. This is the so called Bernoulli inequality. Prove that the Bernoulli inequality yields the AM-GM inequality. Solution. For n = 1 we have nothing to prove, so we may assume that n ≥ 2. Setting y = x + 1 with x > −1, so that y > 0, the Bernoulli inequality takes the form yn ≥ 1 + n(y − 1) . (∗) 49 CHAPTER 1. INITIAL WARMUP Now, to simplify the description let us set A(n) = a1 + · · · + an n and G(n) = (a1 · . . . · an) 1 n , so that the AM-GM inequality simply reads by G(n) ≤ A(n). Obviously A(n) A(n−1) > 0 and as y we may consider the ratio A(n) A(n−1) with n ≥ 2. Then, using the relation (∗) we deduce that ( A(n) A(n − 1) )n ≥ 1 + n ( A(n) A(n − 1) − 1 ) = A(n − 1) + nA(n) − nA(n − 1) A(n − 1) = an A(n − 1) , and this can be equivalently rephrased as An (n) ≥ anAn−1 (n − 1). Using this inequality successively we obtain An (n) ≥ anAn−1 (n − 1) ≥ an · an−1An−2 (n − 2) ≥ . . . ≥ an · an−1 · . . . · a2A1 (1) = an · an−1 · . . . · a2 · a1 = Gn (n) , that is, An (n) ≥ Gn (n), for any n ≥ 2. Raising both sides to the power of 1 n we obtain the desired A(n) ≥ G(n). □ 1.G.5. Symbolic computations with Sage and polynomials. In SageMath, for a series of applications as doing symbolic computations, solving equations, plotting functions, etc, are available the so called symbolic variables. Symbolic variables are often used to represent "indeterminates," which are variables that do not have a fixed value. This concept is frequently employed in mathematical problem-solving to allow for general solutions and manipulations, such as simplifying expressions, solving equations, and performing symbolic differentiation and integration (see Chapter 5 and Chapter 6, respectively). By using indeterminates, we can explore mathematical relationships and derive formulas that hold true for any value of the variable, providing a powerful tool for both theoretical and applied mathematics. On the other hand, polynomials form a fundamental component of algebra and calculus, and symbolic computation provides a powerful toolset for examining their properties and behavior. Let us explain how using symbolic variables, we can manipulate polynomials easily. First, to establish a symbolic variable we type t = var(”t”), or simply var(”t”), or t=SR.var("x"); t which returns a symbolic variable, named t by Sage. We stress that Sage does this automatically for the variable x, but not for other variables. Thus, given the cell y=var("y"); x-y Sage will understand x − y as an expression of two variables x, y. In the opposite, typing directly x − y an error will be printed out, declaring that the name ”y” is not defined. It is also possible to create more than one symbolic variable, in a row. This for example allows to use expressions depending either on many variables, or both on variables and parameters, as in the examples below: x, y, z=var("x, y, z"); x+y+z t, a, b, c=var("t, a, b, c"); a*t**2+b*t+c The cell var("a, b, c"); f=a*x**2+b*x+c; print([f.coefficient(x**2), f.coefficient(x)]) returns the coefficients of the terms x2 and x, respectively, as [a, b], which can be useful in complicated expressions. Of course, if you want to solve the equation ax2 + bx + c = 0 with a > 0, then type x, a, b, c=var("x, a, b, c"); assume(a>0); show(solve(a*x**2+b*x+c==0, x)) In this case Sage returns the known solutions in terms of a, b, c, check the output yourselves. An alternative to introduce symbolic variables is the command x = SR.var(”x”, n). This creates the symbolic variables x0, . . . , xn−1 for some positive integer n, and each of them can be accessed by typing x[i], for some i ∈ {0, . . . , n − 1}. As an example, type x=SR.var("x"); a=SR.var("a", 4); a[0]+a[1]*x+a[2]*x**2+a[3]*x**3 Executing this cell we obtain a3 ∗ x3 + a2 ∗ x2 + a1 ∗ x + a0, hence fixing the parameters a3, a2, a1, a0 we can define a real polynomial of degree three. For such substitutions one may apply the following code: x=SR.var("x"); a=SR.var("a", 4) a[0]+a[1]*x+a[2]*x**2+a[3]*x**3 f=a[0]+a[1]*x+a[2]*x**2+a[3]*x**3 f.subs(a[0] == 6,a[1] == 4, a[2] == 2, a[3] == 1/8) This prints out the polynomial 1/8 ∗ x3 + 2 ∗ x2 + 4 ∗ x + 6, which we may plot by typing (produce the figure in your edi- tor) plot(1/8*x^3 + 2*x^2 + 4*x + 6, (x, -50, 50)) Notice to factor a polynomial we may use the command factor: 50 CHAPTER 1. INITIAL WARMUP p(x)=x^5 + x^4 - 8*x^3 + 11*x^2 - 15*x + 2; show(factor(p(x))) In this case Sage prints out the expression ( x4 + 3 x3 − 2 x2 + 7 x − 1 ) (x − 2). We will meet further option manupulating polynomials in Sage in the sequel. We will encounter more options for manipulating polynomials in SageMath later in this text. 1.G.6. Suppose that P(x) is a polynomial (with constant coefficients) satisfying the following: divided by x + 2 leaves remainder −1, divided by x + 3 leaves remainder 1, while divided by x − 3 leaves remainder 4. Find the remainder υ(x) of the division of P(x) by the polynomial (x + 2)(x + 3)(x − 3). Solution. Let us write P(x) = q(x) · π(x) + υ(x), where q(x) = (x + 2)(x + 3)(x − 3) is the divisor, π(x) is the quotient and υ(x) the remainder. Since the divisor is of degree 3, the remainder υ(x) is a polynomial at most of degree 2, say υ(x) = ax2 + bx + c. By assumption we have P(−2) = −1, P(−3) = 1 and P(3) = 4. Since P(x) = (x2)(x + 3)(x − 3) · π(x) + ax2 + bx + c, we get the following system of equations: { 4a − 2b + c = −1 , 9a − 3b + c = 1 , 9a + 3b + c = 4 } . Solving this system we obtain a = 1/2 = b and c = −2. For verifying this, use for example Sage via the block var("a, b, c"); f1=4*a-2*b+c+1; f2=9*a-3*b+c-1; f3=9*a+3*b+c-4 solve((f1==0, f2==0, f3==0), a,b,c) □ 1.G.7. Find a polynomial P(x) of fourth degree satisfying P(1) = 1, such that P(x+1) is divided by (x−1)2 , and P(x−1) is divided by (x + 1)2 . Solution. By assumption, we have P(x + 1) = (x − 1)2 · π1(x) , and P(x − 1) = (x + 1)2 · π2(x) , where π1(x), π2(x) are the corresponding quotients. Using these equations, one can obtain the relations P(x) = (x − 2)2 · π1(x − 1) , and P(x) = (x + 2)2 · π2(x + 1) , respectively. This means that P(x) is divided by (x − 2)2 and also by (x + 2)2 . But then P(x) will be divided also by the product (x − 2)2 (x + 2)2 = (x2 − 4)2 .13 Now, P(x) is of degree 4, hence the previous conclusion implies that P(x) = c · (x2 − 4)2 , for some constant c ∈ R\{0}. Based on the condition P(1) = 1 we compute c = 1/9, and hence P(x) = 1 9 (x2 − 4)2 . □ 1.G.8. Given the polynomial P(x) = x2 + x + 2, show that the polynomial Q(x) := P(P(x)) − P(x) − 3 is divided by P(x) − 1. Which is the quotient? ⃝ 1.G.9. Prove that the polynomial P(x) of degree 4 satisfying P(0) = 0 and P(x + 1) − P(x) = 4x3 , admits x0 = 1 as a double root. ⃝ 1.G.10. (a) Factor the polynomial p(x) = 4x4 − 8x3 − 3x2 + 7x − 2 using Horner’s scheme and indicate the multiplicities of its roots. Verify your answer via Sage and present the graph of p(x) for x ∈ [−2, 2]. (b) Consider the polynomials p1(x) = x4 + 2x3 + x2 + x + 12, p2(x) = 2x5 + x4 − 53x3 − 7x + 3, q1(x) = x + 2 and q2(x) = x2 − 5x + 1. Show that the Euclidean divisions of pi(x) by qi(x) for i = 1, 2, satisfy υ2(x) = −62x + υ1(x) and moreover π2(x) = 2π1(x) + 11x2 − 2x − 9, where we write pi(x) = qi(x)πi(x) + υi(x) with i = 1, 2. Solution. (a) Dividing the factors of 2 by all factors of 4 ones gets a list of all possible rational roots. In particular, we see that ρ1 = −1 is a root, p(ρ1) = 4 + 8 − 3 − 7 − 2 = 0. Thus, using the Horner’s table we get ρ1 = −1 4 −8 −3 7 −2 −4 12 −9 2 4 −12 9 −2 0 Hence p(x) = (x − ρ1)(4x3 − 12x2 + 9x − 2) = (x + 1)(4x3 − 12x2 + 9x − 2). Next we see that ρ2 = 2 is a root of 4x3 − 12x2 + 9x − 2, thus using the Horner’s method for this polynomial we get ρ2 = 2 4 −12 9 −2 8 −8 2 4 −4 1 0 13Here we are based on the general fact that if a polynomial P(x) (with constant coefficients) is divided by the monomials (x − ρ1), (x − ρ2), . . ., (x − ρk), where ρi ̸= ρj for 1 ≤ i ̸= j ≤ k, then P(x) is also divided by the product (x − ρ1) · (x − ρ2) · . . . · (x − ρk). The converse is also true. 51 CHAPTER 1. INITIAL WARMUP Thus p(x) = (x−ρ1)(x−ρ2)(4x2 −4x+1) = (x+1)(x−2)(2x−1)2 , or equivalently p(x) = 4(x−ρ1)(x−ρ2)(x−ρ3)2 where ρ1 = −1, ρ2 = 2 and ρ3 = 1 2 . Note that ρ3 is a double root. In Sage we simply type p(x)=4*x^4-8*x^3-3*x^2+7*x-2; factor(p(x)) which returns (2 ∗ x − 1)2 ∗ (x + 1) ∗ (x − 2). In this block one can specify the ring over which the factorization is implemented by typing x = QQ[”x”] at the beginning of the code. This defines x as a polynomial over the rational numbers. However, in our case, this approach will yield the same factorization result as the default setting. Additionally, by typing p.roots() SageMath will return the roots of the polynomial p, along with their multiplicities: [(2, 1), (-1, 1), (1/2, 2)] # in the pair (a, b), a denotes a root, b denotes its multiplicity As for the graph of p(x) for x ∈ [−2, 2], one may add the syntax plot(4*x^4-8*x^3-3*x^2+7*x-2, x, -2, 2 (b) The first division gives x3 + x − 1 x + 2 ) x4 + 2x3 + x2 + x + 12 − x4 − 2x3 x2 + x − x2 − 2x − x + 12 x + 2 14 Thus p1(x) = q1(x)π1(x) + υ1(x), where π1(x) = x3 + x + 1 and υ1(x) = 14 (constant). For the second division one gets 2x3 + 11x2 − 11 x2 − 5x + 1 ) 2x5 + x4 − 53x3 − 7x + 3 − 2x5 + 10x4 − 2x3 11x4 − 55x3 − 11x4 + 55x3 − 11x2 − 11x2 − 7x + 3 11x2 − 55x + 11 − 62x + 14 hence q2(x) = q2(x)π2(x) + υ2(x) with π2(x) = 2x3 + 11x2 − 11 and υ2(x) = −62x + 14, respectively. Thus υ2(x) = −62x + υ1(x) and the second relation follows since 2 x3 + x − 1 ) 2x3 + 11x2 − 11 − 2x3 − 2x + 2 11x2 − 2x − 9 □ 1.G.11. (Polynomial division via SageMath) Given a polynomial p(x) = anxn + an−1xn−1 + · · · + a1x + a0 and a number x0, use Sage to implement the division of p(x) by (x − x0), resulting in a quotient polynomial q(x) and a remainder r. Next use your program to confirm that dividing p(x) = 4 − 3x + 5x2 − 2x3 + x4 by (x − 3) results in the quotient q(x) = 21 + 8x + x2 + x3 and the remainder r = 67. ⃝ The omission of basic features of integers, such as their divisibility and the theory of prime numbers, up to this point, has been intentional. It is more practical to explore these topics in detail in Chapter 11, which is entirely dedicated to number theory. For this reason, we will delay discussing additional tasks related to the beautiful theory of integers and prime numbers, and instead, focus on problems involving complex numbers and their geometry for now. 1.G.12. (True or False?) Answer the following True or False questions, providing a proof to support your statements. (1) “The expression of the complex number z = 1 + 2i 4 + 5i in the form z = x + iy is given by − 6 41 + i13 41 ”. 52 CHAPTER 1. INITIAL WARMUP (2) “The ratio (1 + i)2023 (1 − i)2020 equals 2i − 2”. (3) “The product zw of two elements z, w ∈ S1 on the unit circle S1 = {z ∈ C : |z| = 1} belongs also to S1 ”. (4) “Any element z ∈ S1 has an inverse which also belongs to S1 .” (5) “The cubic equation z3 + i = 0 has three solutions over C, given by z1 = i, z2 = √ 3 2 − i1 2 , z3 = − √ 3 2 − i1 2 .” ⃝ 1.G.13. Consider the complex numbers z1 = 2 + 6i and z2 = −4 − 2i. Suppose that there is another complex number w = x + iy which is at a distance of 2 √ 2 from z2. Determine the positions of w on the plane R2 , such that the triangle formed by the the points P1 = [2, 6], P2 = [−4, −2] and P3 = [x, y] (representing z1, z2 and w, respectively) is a right triangle. In particular, show that there are two possible solutions for w, and illustrate both of them using SageMath. ⃝ 1.G.14. Let a, b, c, d ∈ R with ac > 0 and ad = bc. Show that the polynomial P(x) = ax3 + bx2 + cx + d has a unique real solution and two purely imaginary solutions. Solution. The condition ac > 0 implies that a ̸= 0, and so d = bc a . Therefore we see that P(x) = ax3 + bx2 + cx + bc a = 1 a ( a2 x3 + abx2 + acx + bc ) = 1 a ( ax2 (ax + b) + c(ax + b) ) = 1 a (ax + b)(ax2 + c) . As a consequence, the equation P(x) = 0 is equivalent to (ax + b)(ax2 + c) = 0, that is, x = − b a or x2 = − c a < 0. So we have a real solution x0 = − b a and two purely imaginary given by x1,2 = ±i √ c a . □ Euler’s identity for complex numbers states that any complex number z = r(cos(φ) + i sin(φ)), where r = |z| and φ is the argument of z, can be expressed as z = r eiφ . Euler’s identity thus provides an alternative to the polar form of z. For example, given z ∈ S1 we get z = cos(φ) + i sin(φ) = eiφ , such that |z| = 1. It is worth noting that the exponential eiy of an imaginary number iy is given by the power series ∑∞ n=0 (iy)n n! . We will present a proof of this claim later, in Section 5.4.12, which discusses the relationship between exponentials and trigonometric functions. However, it is useful to get an initial understanding of this concept at this point. 1.G.15. Based on the Euler’s identity, provide an answer to the following tasks: (a) Express eiπ/2 , eiπ , and ei2π in algebraic form. (b) Show that cos(φ) = eiφ +e−iφ 2 and sin(φ) = eiφ −e−iφ 2 , respectively. Deduce that tan(φ) = −ieiφ −e−iφ eiφ+e−iφ . (c) Show that eiφ = 1. (d) Using the formula ei(α+β) = eiα eiβ , prove the following classical trigonometric identities: cos(α + β) = cos(α) cos(β) − sin(α) sin(β) , sin(α + β) = sin(α) cos(β) − cos(α) sin(β) . Solution. (a) Using Euler’s identity we obtain eiπ/2 = cos(π/2) + i sin(π/2) = i = √ −1 , eiπ = cos(π) + i sin(π) = −1 , ei2π = cos(2π) + i sin(2π) = 1 . (b) We have eiφ = cos(φ) + i sin(φ) and cos(φ) − i sin(φ) = e−iφ = eiφ (see also 5.4.12). By adding these two equations and solving for cos(φ), we derive the formula for cos(φ). In a similar way we obtain the formula for sin(φ), and the final claim follows in combination with the usual definition of the trigonometric function tan, that is, tan(φ) := sin(φ) cos(φ) . (c) Indeed, eiφ = |cos(φ) + i sin(φ)| = √ cos2(φ) + sin2 (φ) = √ 1 = 1. (d) We see that ei(α+β) = eiα eiβ = ( cos(α) + i sin(α) ) · ( cos(β) + i sin(β) ) =( cos(α) cos(β) − sin(α) sin(β) ) + i ( sin(α) cos(β) + cos(α) sin(β) ) . On the other hand ei(α+β) = cos(α + β) + i sin(α + β). By comparing the real/imaginary parts of these two equations we obtain the given formulas. □ 1.G.16. (a) Show that the following equation does not admit a real solution: (1 + iz)2025 = 2025 + i 1 + i2025 , z ∈ C . (∗) (b) Consider the complex number w = 1 2 ( z + 1 z ) with z ∈ C such that arg(z) = φ ̸= kπ for all k ∈ Z and Im(w) = 0. Show that the |z| = 1, i.e., z lies on the unit circle. 53 CHAPTER 1. INITIAL WARMUP Solution. (a) Suppose in the opposite that the equation (∗) has a real solution z = x ∈ R, that is, (1 + ix)2025 = 2025 + i 1 + 2025i . Taking absolute values in both sides we have the following equivalent relations: (1 + ix)2025 = 2025 + i 1 + i2025 ⇐⇒ |(1 + ix)| 2025 = |2025 + i| |1 + i2025| ⇐⇒ ( √ 1 + x2)2025 = √ 20252 + 1 √ 1 + 20252 ⇐⇒ ( √ 1 + x2)2025 = 1 ⇐⇒ √ 1 + x2 = 1 ⇐⇒ 1 + x2 = 1 ⇐⇒ x = 0 . However, x = 0 cannot be a solution of the equation (∗) as it follows from the computation below: 12025 = 2025 + i 1 + i2025 ⇐⇒ 1 + i2025 = 2025 + i , a contradiction. Hence the given equation does not admit a real solution. (b) By assumption, φ is the (principle) argument of z and we may use the polar form z = |z| (cos φ + i sin φ). Thus we can write w = 1 2 ( |z| (cos φ + i sin φ) + 1 |z| (cos φ + i sin φ) ) = 1 2 ( |z| (cos φ + i sin φ) + 1 |z| ( cos(−φ) + i sin(−φ) )) = 1 2 ( |z| (cos φ + i sin φ) + 1 |z| ( cos(φ) − i sin(φ) )) = 1 2 ( |z| + 1 |z| ) cos φ + i 1 2 ( |z| − 1 |z| ) sin φ . Here we have used that the complex number eiφ = cos φ + i sin φ satisfies (see also the previous problem) 1 eiφ = e−iφ = cos(−φ) + i sin(−φ) = cos(φ) − i sin(φ) . Thus now the requirement Im(w) = 0 (in combination with the condition φ ̸= kπ) gives 1 2 ( |z| − 1 |z| ) sin φ = 0 ⇐⇒ |z| − 1 |z| = 0 ⇐⇒ |z| 2 = 1 ⇐⇒ |z| = 1 . □ 1.G.17. Roots of unity and regular n-polygons. (a) Prove that the cube roots of 1 are the vertices of an equilateral triangle inscribed in the unit circle S1 (on the plane), with side length given by √ 3. Verify your computation via Sage. (b) As a generalization, deduce that for any positive integer n the equation zn = 1, where z ∈ C, has n-roots z1, z2, . . . , zn called the nth roots of unity. Prove that these solutions form the vertices of the regular n-polygon in the plane. Solution. (a) We are interested in solutions of the equation z3 = 1, where z ∈ C. By the fundamental theorem of algebra we expect three solutions over C. To find them explicitly, we can use de Moivre’s formula which provides an easy way for solving equations of the form zn = z0. Indeed, in polar form the equation z3 = 1 is expressed as r3 ( cos(3φ)+i sin(3φ) ) = 1(cos 0 + i sin 0). Verify by simple computations that this equation is satisfied if and only if r = 1 and 3φ = 0(modulo 2π). This means that the three solutions are of the form z0 = cos 0 + i sin 0, z1 = cos 2π 3 + i2π 3 , and z2 = cos 4π 3 + i sin 4π 3 , or in algebraic form z0 = 1 , z1 = − 1 2 + i √ 3 2 , z2 = − 1 2 − i √ 3 2 . An appropriate Sage cell that gives directly all three solutions in algebraic form reads as z=var ("z"); solve (z**3 -1==0 , z) It is easy to verify that |zi| = 1 for any i = 0, 1, 2, so each of them lies on the unit circle S1 , see the figure below. In addition, we see that z2 1 = z1 = z2. It follows that the coordinate points of z0, z1, z2 = z1 form an equilateral triangle with sides of length √ 3 as we can see in the figure (for instance the points z1, z2 are conjugate each other, so their distance is 2 √ 3/2 = √ 3). 54 CHAPTER 1. INITIAL WARMUP (b) The first claim follows by the fundamental theorem. Now, as for the case n = 3, de Moivre’s formula provides the argument of the roots. In particular, we see that the argument multiplied by n has to be a multiple of 2π. Moreover, the absolute value has to be one, so the roots are of the form zk = cos( 2kπ n ) + i sin( 2kπ n ) , k = 0, . . . , n − 1 . Recall now that polygons are geometric objects in the plane composed of points and line segments connected together to close and form a single shape. A regular polygon is a polygon having equal angles and sides of equal length. For the roots derived above we see that the argument of any two consecutive of them, differs by 2π/n. Moreover, the absolute value of any root equals to 1. Together these claims prove that the points under question are vertexes of a regular n-polygon. □ 1.G.18. Polygons via Sage. In Sage on can apply the following cell to produce the second figure given above (this already includes the code producing the first simpler figure). n=3; A=[exp(2*pi*I*k/n) for k in range(n)] a = plot([],xmin=-1, xmax=1, ymin=-1, ymax=1, figsize=(4,4), rgbcolor=(1/4,1/8,3/4)) a+=list_plot(A, color="black", size=70, xmin=-1,xmax=1) a+= circle ((0.0 ,0.0) ,1, color="black") a+=line([(-1/2,(sqrt(3)/2)), (-1/2,-(sqrt(3)/2))], color="black", linestyle="--") a+=line([(-1/2,(sqrt(3)/2)), (1, 0)], color="black", linestyle="--") a+=line([(1, 0), (-1/2,-(sqrt(3)/2))], color="black", linestyle="--") a+= text("$z_2$", (-0.65, (sqrt(3)/2)+0.05), color="black", fontsize="16") a+= text("$z_3$", (-0.65, -(sqrt(3)/2) -0.05),color="black", fontsize="16") a+= text("$z_1$", (1.10, 0.07), color="black", fontsize="16") show(a) In fact, the commands of the first five lines in the previous code, with only difference the initial n, can produce similar plots for our favourite n (though for large n Sage needs time to answer). Below we present the figures for n = 5, 15, 25, 50. Finally it is also possible to produce directly the regular polygons, using the command polygon2d. Let us present a Sage cell adapted to the case n = 6. n=6; A=[exp(2*pi*I*k/n) for k in range(n)] a = plot([],xmin=-1, xmax=1, ymin=-1, ymax=1, figsize=(4,4)) a+=list_plot(A, color="black", size=10, xmin=-1,xmax=1) a+=circle((0.0,0.0),1,color="black") L = [[cos(pi*i/3),sin(pi*i/3)] for i in range(6)] b=polygon2d(L, color="lightgrey") show(a+b) This produces the hexagon: 55 CHAPTER 1. INITIAL WARMUP 1.G.19. Enumerate all ordered pairs (x, y) ∈ R × R satisfying the relation (x + iy)2024 = x − iy. Solution. Setting z = x + iy the given relation becomes z2024 = z. Recall that z2 = |z| 2 and by mathematical induction (or by de Moivre’s formula) one can prove that |zn | = |z| n for any n ∈ Z. Thus, having in mind the relation z2024 = z we compute |z| 2024 = z2024 = |z| = |z| ⇐⇒ |z| (|z| 2023 − 1) = 0 . From this equation we see that either |z| = 0 or |z| = 1. The first condition implies that (x, y) = (0, 0). If |z| = 1, then using the relation z2024 = z we see that z2025 = zz2024 = zz = |z| 2 = 1. However the equation z2025 = 1 has 2025 distinct roots. Hence, all together we can enumerate 2025 + 1 ordered pairs (x, y) ∈ R × R satisfying the given condition. □ B) Material on difference equations 1.G.20. Tillings. For the Problem 1.B.4 on 2-compositions of natural numbers and for the given solution provide a visual interpretation. Solution. To provide a visual presentation of Cn (and so also of the Fibonacci numbers) one may think the 1s as squares, and the 2s as dominoes. Then Cn counts the distinct ways to tile a board of length n with squares and dominoes, and the result proved in 1.B.4 means that we have Fn+1 ways to do so. Below we illustrate the idea for n = 4 and n = 5. For n = 4 one has the 2-compositions 1 + 1 + 1 + 1, 1 + 1 + 2, 1 + 2 + 1, 2 + 1 + 1, 2 + 2, that is, C4 = 5 = F5. Notice that C4 = C3 + C2 = 3 + 2. For n = 5 we can define 8 such decompositions, i.e., C5 = 8 = F6, as we can see also in the figure below (notice that C5 = C4 + C3 = 5 + 3). □ 1.G.21. Basic properties of Fibonacci numbers. Show that the Fibonacci numbers Fn satisfying the following: (a) The sum of the first n Fibonacci numbers equals to Fn+2 − 1; (b) F2 1 + F2 2 + · · · + F2 n = FnFn+1; (c) Fm+n = Fn−1Fm + FnFm+1; (d) Fn−1Fn+1 − F2 n− = (−1)n . Solution. (a) We have F1 = F3 − F2, F2 = F4 − F3, . . ., Fn−1 = Fn+1 − Fn and Fn = Fn+2 − Fn+1. Adding these relations (side by side), we get F1 + F2 + · · · + Fn = Fn+2 − F2 and the claim follows. (b) First notice that using the recurrence relation of Fibonacci numbers one may extend these numbers to zero and negative indices, and here we need to know that F0 = F2 − F1 = 0. Now, since we have FkFk+1 − Fk−1Fk = Fk(Fk+1 − Fk−1) = F2 k , we obtain the relations F2 1 = F1F2 , F2 2 = F2F3 − F1F2 , · · · , F2 n = FnFn+1 − Fn−1Fn . Adding these relations (side by side) we arrive to the desired formula. (c)+(d) The last two cases can be proved by induction over n and we shall prove (d) only, which is known as the Cassini identity. For n = 1 it becomes F0F2 − F2 1 = 0 · 1 − 12 = −1 = (−1)1 , which is true. The induction hypothesis says that the result is true for some arbitrary positive integer k, which mean that 56 CHAPTER 1. INITIAL WARMUP Fk−1Fk+1 − F2 k = (−1)k . (∗) Then, for n = k + 1 we compute FkFk+2 − F2 k+1 = (Fk+1 − Fk−1)(Fk + Fk+1) − F2 k+1 = Fk+1Fk + F2 k+1 − Fk−1Fk − Fk−1Fk+1 − F2 k+1 (∗) = Fk+1Fk − Fk−1Fk − ( F2 k + (−1)k ) = Fk+1Fk − Fk ( Fk−1 + Fk ) + (−1)k+1 = Fk+1Fk − FkFk+1 + (−1)k+1 = (−1)k+1 , where we have applied repeatedly the definition of Fibonacci numbers. This means that the claim is true for n = k + 1, and hence is valid for all n ≥ 1. □ Recurrence relations have a wide range of applications across various fields, often in surprising ways. For example, linear recurrences naturally arise in simple geometric problems, requiring creative problem-solving skills. For your convenience, we have included some examples of these problems below. In Chapter 3 we will cover difference equations again, this time with more advanced applications to expand upon the introduction given in this chapter. 1.G.22. Consider n lines dividing a plane into regions. Find the maximum number of regions that are formed. Solution. Let us denote the number of regions by pn. If there is no line in the plane, then the whole plane is one region and p0 = 1. If there are n lines, then adding an additional line will increase the number of regions, exactly by the number of regions that the new line intersects. If no lines are parallel and no three lines intersect at the same point, then the number of regions intersected by the (n + 1)th line is one more than the number of intersections it has with the existing n lines. Each intersection divides an existing region into two, resulting in an increase of one region for each intersection. The new line can have at most n intersections with the existing n lines. Each segment of the line between two intersections crosses exactly one region, so the new line can cross at most n + 1 regions. Thus we obtain the recurrence relation pn+1 = pn + (n + 1), with p0 = 1. We can derive an explicit formula for pn via Corollary 1.2.3, which we leave to the reader. A more straightforward approach relies on the observation that pn = pn−1 + n. This gives pn = pn−1 + n = pn−2 + (n − 1) + n = pn−3 + (n − 2) + (n − 1) + n = . . . = p0 + n∑ i=1 i = 1 + n(n + 1) 2 = n2 + n + 2 2 . Or we can use Sage as we did in 1.B.3: from sympy import Function, rsolve from sympy.abc import n p = Function("p") eq = p(n)-p(n+1)+n+1 initial = {p(0):1, p(1):2} rsolve(eq, p(n), initial) □ 1.G.23. What is the maximum number of areas that a 3-dimensional space can be divided into by n planes? Solution. Let rn be the desired number. Obviously, r0 = 1. As in 1.G.22, let us consider n planes in the space. Now, add an additional plane and determine the maximum number of new regions it can create. The maximum number of new regions is precisely the number of regions that the new plane intersects. What is the maximum number of new regions that can be created? The number of areas intersected by the (n + 1)th plane equals the number of regions created by the intersections of this new plane with the n existing planes. According to the exercise in plane, there can be at most 1/2 · (n2 + n + 2) such regions. Therefore, we obtain the recurrent formula rn+1 = rn + n2 + n + 2 2 . 57 CHAPTER 1. INITIAL WARMUP This equation can be solved directly as follows: rn = rn−1 + (n − 1)2 + (n − 1) + 2 2 = rn−1 + n2 − n + 2 2 = rn−2 + (n − 1)2 − (n − 1) + 2 2 + n2 − n + 2 2 = rn−2 + n2 2 + (n − 1)2 2 − n 2 − (n − 1) 2 + 1 + 1 = rn−3 + n2 2 + (n − 1)2 2 + (n − 3)2 2 − n 2 − (n − 1) 2 − (n − 2) 2 +1 + 1 + 1 = · · · = r0 + 1 2 n∑ i=1 i2 − 1 2 n∑ i=1 i + n∑ i=1 1 = 1 + n(n + 1)(2n + 1) 12 − n(n + 1) 4 + n = n3 + 6n + 5 6 . Above we used the formula n∑ i=1 i2 = n(n + 1)(2n + 1) 6 , which you may want to prove using mathematical induction. □ 1.G.24. Find the maximum number of areas that the plane can be divided into by n circles. Solution. A solution to this geometric problem again relies on a recurrence which can be described as follows: First observe that the (n+1)th circle intersects the n existing circles in at most 2n points. For example, see the figure presented at the right-hand side for the case of 2 + 1 circles; the third circle intersects the previous two into 4 points (hence 4 = 2n for n = 2). Thus, for the maximum number pn of areas one can derive the recurrent formula pn+1 = pn + 2n . Clearly p1 = 2. Thus for pn we obtain pn = pn−1 + 2(n − 1) = pn−2 + 2(n − 2) + 2(n − 1) = · · · = p1 + n−1∑ i=1 2i = n2 − n + 2 . Or we may directly solve the recurrence in Sage, as before: from sympy import Function, rsolve from sympy.abc import n p = Function("p") eq = p(n)-p(n+1)+2*n initial = {p(1):2, p(2):4} rsolve(eq, p(n), initial) □ 1.G.25. Determine the maximum number of regions that a 3-dimensional space can be divided into by n balls. ⃝ 1.G.26. Find the number of regions formed when n distinct planes intersect at a single point in a 3-dimensional space. ⃝ C) Material on combinatorics 1.G.27. Pascal’s triangle via Sage. Recall that Pascal’s triangle is formed by the binomial coefficients, as discussed in the section at the end of the paragraph 1.3.2. In particular, the entry in the nth row and kth column of Pascal’s triangle is given by the binomial coefficient (n k ) . For example for n = 6 we get the following illustration: Note that in the second figure, the colored "diagonals" represent the natural numbers and triangular numbers, respectively, the latter as illustrated in the first exercise of this section. Present the Pascal’s triangle via Sage for n = 9. Solution. It is sufficient to use the command binomial(n, i) and type the cell: 58 CHAPTER 1. INITIAL WARMUP [[binomial(n, i) for i in range(n+1)] for n in range(9)] Alternatively, we can give the cell n=8 [[binomial(i, j) for j in range(i + 1)] for i in range(n + 1)] Sage prints out the following: [[1], [1, 1], [1, 2, 1], [1, 3, 3, 1], [1, 4, 6, 4, 1], [1, 5, 10, 10, 5, 1], [1, 6, 15, 20, 15, 6, 1], [1, 7, 21, 35, 35, 21, 7, 1], [1, 8, 28, 56, 70, 56, 28, 8, 1]] □ 1.G.28. For any fixed n ∈ N, determine the number of all solutions of the equation x1 +x2 +· · ·+xk = n for the following two cases: (a) over the set Z≥ := {a ∈ Z : a ≥ 0} of non-negative integers; (b) over the set Z+ := {a ∈ Z : a > 0} of strictly positive integers. Solution. (a) Every solution (r1, . . . , rk) to the equation ∑k i=1 xi = n can be uniquely encoded as a sequence of 1s and separators. In particular, the sequence is constructed by writing r1 1s, followed by a separator, then r2 1s, followed by another separator, and so on. Consequently, this sequence contains exactly n 1s, and k − 1 separators. Each such sequence uniquely corresponds to a solution of the given equation. Therefore, the number of solutions is equal to the number of such sequences, and the latter is given by (n+k−1 n ) . (b) We now look for a solution in the domain of positive integers. We see that the natural numbers x1, . . . xk provide a solution of the given equation if and only if the non-negative integers yi = xi − 1, i = 1, . . . , k form a solution of the equation y1 + y2 + · · · + yk = n − k . Using the result obtained in (a), we deduce that there are (n−1 k−1 ) of them. □ 1.G.29. (Trolleybus) (a) In how many ways can five people seat in a car for five people, under the assumption that only two of them have a driving licence? (b) In how many ways can 20 passengers and two drivers be seated in a trolleybus for 25 people? ⃝ 1.G.30. (Flipping coins – I) We flip a coin six times. (a) How many distinct sequences of heads and tails are there? (b) How many sequences with exactly four heads are there? (c) How many sequences with at least two heads are there? ⃝ 1.G.31. (Playing with divisions) Determine the number of all possible ways that the following can happen: (a) divide 40 identical balls among 4 boys? (b) divide among three people A, B and C, 33 distinct coins such that A and B together have twice as many coins as C. (c) divide 9 girls and 6 boys into two group such that each group contains at least two boys. ⃝ 1.G.32. According to quality, we divide food products into groups I, II, III, IV . Determine the number of all possible divisions of 9 food products into these groups, such that the numbers of products in groups are all distinct. Solution. If we directly write the considered groups from the elements of I, II, III, IV , we create combinations of repetitions of the ninth-order from four elements. The number of such combinations is (12 9 ) = 220. □ 1.G.33. (Handshakes) New players meet in a volleyball team (6 people). How many handshakes are there when everybody shakes once with everybody else? How many handshakes are there if everybody shakes hands once with each opponent after playing a match? ⃝ 59 CHAPTER 1. INITIAL WARMUP 1.G.34. We need to accommodate 9 people in one four-bed room, one three-bed room and one two-bed room. In how many ways can this be done? Solution. We may assign to the people in the four-bed room the number 1, in the three-bed room number 2 and in the two-bed room number 3. In this way t we create permutations with repetitions from the elements 1, 2, 3, where 1 occurs four times, 2 three times and 3 two times. The number of such permutations is 9! 4!·3!·2! = 1 260. □ 1.G.35. In a long-distance race, where the racers start one after another in given time intervals, there were k racers, among them 3 friends. Determine the number of starting schedules in which no two of the 3 friends start next to each other. For simplicity assume k ≥ 5. Solution. Remaining k − 3 racers can be ordered in (k − 3)! ways. For the three friends there are then k − 2 places (the start, the end and the k − 4 spaces) where we can put them in v (k − 2, 3) ways. Using the rule of (combinatorial) product, we obtain (k − 3)! · (k − 2) · (k − 3) · (k − 4) = (k − 2)! · (k − 3) · (k − 4) . □ 1.G.36. There are 32 participants of a tournament. The organisers have stated that the participants must divide arbitrarily into four groups, such that the first one has size 10, the second and the third 8, and the fourth 6. In how many ways can this be done? Solution. We can imagine that from 32 participants we create a row, where first 10 are the first group, next 8 are the second group and so on. On the other hand, there are 32! orderings of all participants. Note that the division into groups is not influenced if we change the order of the people in the same group. Therefore the number of distinct divisions equals to 32! 10!·8!·8!·6! . □ D) Material on probabilities Next we will provide supplementary material on basic and conditional probability. Probability theory will be explored in depth in Chapter 10, where we will build on this foundational material by introducing additional key results, such as Bayes’ theorem, and discussing its significance in statistics. Additionally, the rich interplay between combinatorics and probability will be examined in Chapter 13, which concludes the book. 1.G.37. For the final exams in the 6th class of all the high-schools in a European country, it is known that: • 25% of the students fails in the exams of mathematics; • 18% of the students fails in the exams of physics; • 10% of the students fails both in the exams of mathematics and physics. Choose randomly a student belonging to this class. (a) Find the probability the student has failed at least in one of these two subjects; (b) Find the probability the student has failed in the exams of mathematics, but not in the exams of physics. (c) If the student has failed in the exams of physics, find the probability he also failed in the exams of mathematics. (d) If the student has failed in the exams of mathematics, find the probability he also failed in the exams of physics. Solution. We define two events: the event M, described by “The candidate has failed in the exams of mathematics”, and the even P, described by “The candidate has failed in the exams of physics”. (a) We should compute P(M ∪ P) = P(M) + P(P) − P(M ∩ P) which gives 0.25 + 0.18 − 0.1 = 0.33. (b) In this case we should compute P(M ∩ Pc ), which equals to P(M) − P(M ∩ P) = 0, 25 − 0.1 = 0.15. (c) We are looking for the probability P(M|P). By definition, we have P(M|P) = P(M ∩ P) P(P) = 0.1 0.18 ≈ 0.55 . (d) In this final case we are looking for the probability P(P|M), which is given by P(P|M) = P(P ∩ M) P(M) = 0.1 0.25 = 0.4. □ 60 CHAPTER 1. INITIAL WARMUP 1.G.38. From ten cards, where exactly one is an ace, we randomly draw a card and put it back. How many times must we do this, so that the probability that the ace is drawn at least once, is greater than 0.9? Solution. Let Ai be the following event: “at ith drawing the ace was drawn”. The individual events Ai are independent, so we know that P ( n∪ i=1 Ai ) = 1 − (1 − P(A1)) · (1 − P(A2)) · · · (1 − P(An)) , n ∈ N . We are looking for some n ∈ N satisfying P ( n∪ i=1 Ai ) = 1 − (1 − P(A1)) · (1 − P(A2)) · · · (1 − P(An)) > 0.9 . Clearly, P(Ai) = 1/10 for all i ∈ N. Therefore, it is sufficient to solve the inequality 1 − ( 9 10 )n > 0.9 , from which one gets n > loga 0.1 loga 0.9 , where a > 1 . Evaluating, we deduce that one must do the drawing at least twenty two times. In particular, a Sage cell verifying this has the following form: x = var("x") ineq = 1 - 0.9**x > 0.9 sol = solve(ineq , x, algorithm="sympy") print(sol) Sage prints out x > 21.8543453267828. □ 1.G.39. (Playing with a dice) (a) Throwing a dice eleven times in a row the result was 4. Determine the probability that the twelfth roll results in 4. (b) Throwing n dice, compute the probability that among the numbers that appeared the values 1, 3 and 6 are not present. (c) Throwing a dice determine the conditional probability, that the first die resulted in five under the condition that the sum is 9. Based on this result, decide whether the events “first dice results in five” and “the sum is 9”, are independent. ⃝ 1.G.40. We flip a coin five times. For every head, we put a white ball in a hat, for every tail we put in the same hat a black ball. Compute the probability that in the hat there are more black balls than white balls, if there is at least one black ball in the hat. Solution. We can introduce the following two events: A − “There are more black than white balls in the hat”; H − “There is at least one black ball in the hat”. Notice the event A has H as a consequence. Denote by Ac , Hc the complementary events of A and H, respectively. The goal is to compute the conditional probability P(A|H). The probability P(Hc ) is 2−5 , and the probability of H is the same as the probability P(Ac ) of the complementary event of A. Thus P(H) = 1 − 2−5 and moreover P(A) = 1/2. Furthermore P(A ∩ H) = P(A), since the event H contains the event A, as it was mentioned above. This finally gives P(A|H) = P(A ∩ H) P(H) = 1 2 1 − (1 2 )5 = 16 31 . □ 1.G.41. In a painting exhibition there are 15 paintings and between them 12 are authentic and 3 are copies. A visitor chooses randomly a painting and asks the opinion of an art expert about the authentication of the artwork. The expert can give a correct opinion both about an original painting as much as for a copy, on average 9 out of 10 times. If the expert decides that the painting is authentic, find the probability the expert is right. Solution. (a) Let us consider the events: A − “The painting is authentic”; B − “The expert considers that the painting is authentic”. We are looking for the conditional probability P(A|B). By the statement we have P(A) = 12 15 = 4 5 and P(Ac ) = 3 15 = 1 5 = 1 − P(A). We also see that the probability of the event B occurring, given that A is true, equals to 9/10, i.e., P(B|A) = 9 10 . 61 CHAPTER 1. INITIAL WARMUP Similarly P(Bc |Ac ) = 9 10 and moreover P(Bc |A) = 1 10 = P(B|Ac ). Thus, by the definition of the conditional probability and the product rule we get P(A|B) = P(A ∩ B) P(B) = P(A) · P(B|A) P(B) . (∗) However, by Bayes’ formula, see 10.2.7 we have P(B) = P(A) · P(B|A) + P(Ac ) · P(B|Ac ) = 4 5 · 9 10 + 1 5 · 1 10 = 37 50 . Thus, in combination with (∗) we arrive to P(A|B) = 36 37 . □ 1.G.42. A rod of length two meters is randomly divided into three parts. Determine the probability that at least one part is at most 20 cm long. Solution. Random division of a rod into three parts is given by two points of the cut, x and y (we first cut the rod in the distance x from the origin, we do not move it and again cut it in the distance y from the origin). The sample space is a square C with side 2 meters, see also the figure below. If we place the square C so that its two sides lie on axes in the plane, then the condition that at least one part is at most 20 cm determines in the square a subregion O, defined by O = { (x, y) ∈ C : x ≤ 20 ∨ x ≥ 180 ∨ y ≤ 20 ∨ y ≥ 180 ∨ |x − y| ≤ 20 } . A straightforward computation shows that this subregion has area 51 100 times the area of the square. Let us now explain how one can treat this task via Sage. We first use the a cell to generate some random points. num_points = 100_000 #Generate num_points points x=[RealField().random_element(min=0, max=2) for _ in range(num_points)] y=[RealField().random_element(min=0, max=2) for _ in range(num_points)] Next we divide the points into two groups corresponding to the case where one piece of rod is less than 20 cm. favourable_events = [] other_events = [] for i, j in zip(x, y): if abs(i-j) <= 5: favourable_events.append((i, j)) else: other_events.append((i, j)) Finally we plot these two sets using red and blue colour (as we see in the figure posed above) via the cell scatter1 = scatter_plot(short_rod_cases, facecolor="blue", marker="o", markersize=10, edgecolor="blue") scatter2 = scatter_plot(other_cases, facecolor="red", marker="o", markersize=10, edgecolor="red") show(scatter1+scatter2, axes=true, aspect_ratio=1, ticks=[0.2,0.2]) This returns the figure posed above. Recall that we are interested in the ratio of the blue area to the total area of the square. To this end, we can compute its approximate value by typing len(short_rod_cases) / (len(short_rod_cases) + len(other_cases)) 62 CHAPTER 1. INITIAL WARMUP Sage prints out the estimation 0.51018. □ 1.G.43. Michael and Alex have a lunch at the school canteen, which operates from 11 to 14. Each of them takes a lunch for 30 minutes, and the arrival time is random. What is the probability that they meet at a given day, if they always sit at the same table? ⃝ E) Material on plane geometry 1.G.44. Compute the area S of a quadrilateral with vertices A = [0, −2], B = [1, −1], C = [1, 5], and D = [−1, 1]. ⃝ 1.G.45. Determine the relative position of the lines p, q in the plane for p : 2x − y − 5 = 0, q : x + 2y − 5 = 0. If they are not parallel, determine the intersection point. ⃝ 1.G.46. Determine the sum of the three angles, which are between the vectors (1, 1), (2, 1) and (3, 1) respectively and the x-axis in the plane R2 . ⃝ 1.G.47. Compute the area of parallelogram with vertices at [5, 5], [6, 8] at [6, 9]. Solution. Although such parallelogram is not uniquely determined (the fourth vertex is not given), the triangle with vertices at [5, 5], [6, 8] and [6, 9] must be necessarily a half of every parallelogram with these three vertices (one of the sides of the triangle becomes the diagonal of the parallelogram). Therefore the area equals the determinant 6 − 5 6 − 5 8 − 5 9 − 5 = 1 1 3 4 = 1 · 4 − 1 · 3 = 1. □ 1.G.48. Determine the angle φ between the two diagonals A3A7 and A5A10 of a regular dodecagon A0A1A2 . . . A11. Solution. The angle does not depend neither on the size, nor on the position of the given dodecagon. Consider a dodecagon inscribed in a circle with a diameter of 1, and let A0 be the point [1, 0]. The vertices of the dodecagon can be then represented by the twelfth roots of unity in the complex plane. Thus we can write Ak = cos(2kπ/12) + i sin(2kπ/12) and hence A3 = cos(π/2) + i sin(π/2) = i ∼ [0, 1] , A5 = cos(5π/6) + i sin(5π/6) = − √ 3 2 + 1 2 i ∼ [− √ 3 2 , 1 2 ] , A7 = cos(7π/6) + i sin(7π/6) = − √ 3 2 − 1 2 i ∼ [− √ 3 2 , − 1 2 ] , A10 = cos(5π/3) = i sin(5π/3) = 1/2 − i √ 3 2 ∼ [1/2, − √ 3 2 ] . Combining now the description given in 1.5.7, we deduce that cos φ = 1 2 √ 2+ √ 3 . This gives φ = 75◦ . □ 1.G.49. Consider the matrices A = ( 1 0 0 0 ) , B = ( 1 2 −2 1 ) , X = ( a b c d ) , D = ( 5 −b 4 −4d ) , E = ( 1 0 0 1 ) with a, b, c, d ∈ R. Solve the matrix equation X(E − A2 ) + BBT = D. ⃝ 1.G.50. Illustrate the expression of a rotation as a composition of shears on an interactive plot, using Sage (use the method from 1.E.18). Solution. A solution goes as follows: # import the the right objects and define the shear functions from sage.plot.point import point from sage.plot.line import line def shear_transformation(x, y, shear_factor): new_x = x + shear_factor * y new_y = y return new_x, new_y def shear_vert_transformation(x, y, shear_factor): new_x = x new_y = y + shear_factor * x return new_x, new_y # create an illustrative interactive plot @interact def plot_shear(b=(0.0, 1.0, 0.02)): # choose points and lines to show points1 = [(0, 0), (1, 1), (1, 2), (2, 3)] points2 = [(1, 1), (2, 2), (2, 3), (3, 4)] 63 CHAPTER 1. INITIAL WARMUP # get the shear parameters a and apply shears a = (sqrt(-b^2 + 1) - 1)/b sheared_points11 = [shear_transformation(x, y, a) for x, y in points1] sheared_points12 = [shear_transformation(x, y, a) for x, y in points2] sheared_points21 = [shear_vert_transformation(x, y, b) for x, y in sheared_points11] sheared_points22 = [shear_vert_transformation(x, y, b) for x, y in sheared_points12] sheared_points31 = [shear_transformation(x, y, a) for x, y in sheared_points21] sheared_points32 = [shear_transformation(x, y, a) for x, y in sheared_points22] # prepare the objects for the final plot p1 = point(points1, color=’blue’, size=50) p2 = point(points2, color=’blue’, size=50) p_sheared11 = point(sheared_points11, color=’red’, size=25) p_sheared12 = point(sheared_points12, color=’red’, size=25) p_sheared21 = point(sheared_points21, color=’green’, size=25) p_sheared22 = point(sheared_points22, color=’red’, size=25) p_sheared31 = point(sheared_points31, color=’red’, size=50) p_sheared32 = point(sheared_points32, color=’red’, size=50) l1 = line(points1, thickness=2) l2 = line(points2, thickness=2) l1_sheared1 = line(sheared_points11, thickness=1, color=’black’) l2_sheared1 = line(sheared_points12, thickness=1, color=’black’) l1_sheared2 = line(sheared_points21, thickness=1, color=’green’) l2_sheared2 = line(sheared_points22, thickness=1, color=’green’) l1_sheared3 = line(sheared_points31, thickness=2, color=’red’) l2_sheared3 = line(sheared_points32, thickness=2, color=’red’) # Combine the plots and show the result (choose options) combined_plot = p1 + p_sheared11 + p_sheared21 +p_sheared31 combined_plot += p2 + p_sheared12 + p_sheared22 +p_sheared32 combined_plot += l1 + l1_sheared1 + l1_sheared2 + l1_sheared3 combined_plot += l2 + l2_sheared1 + l2_sheared2 + l2_sheared3 combined_plot.show(axes_labels=[’x’, ’y’],gridlines=True,aspect_ratio=1, figsize=8) Executing this code (for one chosen value of b = sin θ, e.g., b = 3) we obtain the following picture: □ 1.G.51. (a) Let ABC be the triangle on the plane with vertices the points A = [1, 0], B = [4, −2] and C = [2, −3], respectively. Reflect ABC with respect to the x-axis and illustrate the initial triangle and the result of the reflection via Sage. 64 CHAPTER 1. INITIAL WARMUP Solution. On R2 the reflection with respect to the x-axis is the linear mapping f : R2 → R2 with u = (x, y)T → f(u) := (x, −y)T . Obviously, the x-axis is the fixed points set of f, which means that f preserves A = [1, 0]. However, it transforms B and C to B′ = [4, 2] and C′ = [2, 3], respectively, see the figure given below. Alternatively, recall that matrix of f with respect to the standard basis on R2 is given ( 1 0 0 −1 ) , such that f(u) = ( 1 0 0 −1 ) u = ( 1 0 0 −1 ) ( x y ) = ( x −y ) . Viewing A, B, C as vectors we obtain f(A) = ( 1 0 0 −1 ) ( 1 0 ) = ( 1 0 ) , f(B) = ( 1 0 0 −1 ) ( 4 −2 ) = ( 4 2 ) , f(C) = ( 1 0 0 −1 ) ( 2 −3 ) = ( 2 3 ) . Thus indeed, reflecting the triangle ABC via f, we arrive to the triangle AB′ C′ , where B′ = [4, 2] and C′ = [2, 3], respectively. An illustration in Sage can been seen below. Can you reproduce it in your editor? For help, see the cell given in 1.E.26. □ 1.G.52. Let ABCD be the trapezoid on the plane with vertices the points A = [0, 0], B = [4, 0], C = [3, 3/2] and D = [3/2, 3/2]. Let {e1 = (1, 0), e2 = (0, 1)} be the standard basis, identifying the plane with R2 . (a) Transform ABCD via the homothety h1/2; (b) Stretch ABCD in the direction of e1 with stretching parameter c = 3/4. (c) Stretch ABCD in the direction of e2 with stretching parameter c = 4/3. (d) Apply to ABCD a horizontal shear with parameter a = 2 (e) Apply to ABCD a vertical shear with parameter a = 1/2 (f) Transform ABCD via the axial symmetry. Illustrate all the cases. Solution. (a) The matrix of h1/2 is ( 1 2 0 0 1 2 ) , and we have h1/2((x, y)T ) = (x/2, y/2)T . Viewing A, B, C, D as vectors on the plane, one computes h1/2(A) = [0, 0] = A , h1/2(B) = [2, 0] , h1/2(C) = [3/2, 3/4] , h1/2(D) = [3/4, 3/4] . Thus, the new coordinates of the points B, C, D are B′ = −[2, 0], C′ = [3/2, 3/4] and D′ = [3/4, 3/4], respectively, and so h1/2 has shrunk the trapezium (because 0 < c = 1/2 < 1), as you can see here: 65 CHAPTER 1. INITIAL WARMUP (b) The matrix of stretching in the direction of e1 only, with stretching parameter c = 3/4 is again diagonal, given by ( 3 4 0 0 1 ) . This means that the corresponding transformation is given by (x, y)T → (3x 4 , y)T , so it preserves the y-coordinates. Therefore A = [0, 0] → A , B = [4, 0] → [3, 0] , C = [3, 3/2] → [9/4, 3/2] , D = [3/2, 3/2] → [9/8, 3/2] , and the initial trapezium ABCD has been transformed to AB′ C′ D′ , with B′ = [3, 0], C′ = [9/4, 3/2], and D′ = [9/8, 3/2], respectively. Hence in this case an illustration looks like as follows: (c) The stretching here preserves x and transforms y by 4 3 , i.e., it has the form (x, y)T → (x, 4y 3 )T . We compute A = [0, 0] → A , B = [4, 0] → B , C = [3, 3/2] → [3, 2] , D = [3/2, 3/2] → [3/2, 2] , thus the initial trapezium ABCD has been transformed to ABC′ D′ , with C′ = [3, 2], and D′ = [3/2, 2], respectively. As for the illustration we get (d) As a transformation, the horizontal shear on the plane with parameter a = 2 has matrix ( 1 2 0 1 ) , and thus we get the assignment (x, y)T → (x + 2y, y)T . We compute A = [0, 0] → A , B = [4, 0] → B , C = [3, 3/2] → [6, 3/2] , D = [3/2, 3/2] → [9/2, 3/2] . This means that the new coordinates of initial trapezium are now ABC′ D′ with C′ = [6, 3/2] and D′ = [9/2, 3/2], respectively. Let us for a few postpone its illustration, and present it below together with the illustration of the vertical shear. (e) The vertical shear on the plane with parameter a = 1/2 has matrix ( 1 0 1 2 1 ) , so it corresponds to the mapping (x, y)T → (x, x 2 + y)T . We compute A = [0, 0] → A , B = [4, 0] → [4, 2] , C = [3, 3/2] → [3, 3] , D = [3/2, 3/2] → [3/2, 9/4] , hence in this case the new coordinates of ABCD are AB′ C′ D′ with B′ = [4, 2], C′ = [3, 3] and D′ = [3/2, 9/4], respectively. Below, at the l.h.s is presented an illustration of the horizontal shear, while the vertical shear corresponds to the figure at the r.h.s. (f) Recall that the axial symmetry is the linear transformation (x, y)T → (y, x)T . Let us present only the illustration of its action on ABCD, and leave to the reader the mathematical explanation of the figure. Can you predict the vertices which remain fixed (without an illustration or computation). 66 CHAPTER 1. INITIAL WARMUP □ 1.G.53. Let ABCD be the rhombus on the plane with vertices the points A = [0, 0], B = [5, 0], C = [8.4], and D = [3, 4], respectively. (a) Rotate ABCD counter-clockwise around the origin for θ1 = π/3. (b) Then rotate the rhombus obtained in (a), counter-clockwise around the origin for θ2 = π/3. (c) Finally rotate the rhombus obtained in (b), counter-clockwise around the origin for another θ3 = π/3. (d) Determine directly the final position of ABCD, after imposing the rotations described in (a), (b) and (c). Solution. First observe that the given polygon ABCD is indeed a rhombus. This is because the vectors u1 = −−→ AB and u2 = −−→ AD are of the same length (with respect to the dot product on R2 ), ∥u1∥ = ∥u2∥ = 5, and moreover the opposite sides are parallel each other. The angle between u1, u2 equals to arccos(3/5) ≈ 53.13◦ . For a moment think why after applying a rotation on ABCD, the new polygon will be again a rhombus. This is because rotations preserves lengths and angles (we will essentially prove this in Chapter 2, studying orthogonal transformations). Thus, rotating ABCD around the origin, for any possible angle, we will obtain a rhombus with the same characteristics (same lengths and angles as those described for ABCD). The rotation matrix for θ = π/3 is given by ( 1/2 − √ 3/2 √ 3/2 1/2 ) , and hence as a linear transformation on R2 , the rotation Rπ 3 has the form (x, y)T → ( x − √ 3y 2 , √ 3x + y 2 )T . Based on direct computations, one can now verify the following: (a) Under the action of Rπ/3 the points A, B, C, D (seen as vectors on R2 ) are mapped to: A → A , B → [5/2, 5 √ 32] , C → [4 − 2 √ 3, 4 √ 3 + 2] , D → [ 3 2 − 2 √ 3, 3 √ 3 2 + 2] . Hence, a counter-clockwise rotation around the origin for θ = π/3 transforms the rhombus ABCD to AB1C1D1, where B1 = [5/2, 5 √ 32], C1 = [4 − 2 √ 3, 4 √ 3 + 2] and D1 = [3 2 − 2 √ 3, 3 √ 3 2 + 2], respectively (see the figure below, where we have included all the rotations together). (b) Next we need to apply Rπ/3 to AB1C1D1. We compute A → A , B1 → [−5/2, 5 √ 32] , C1 → [−4 − 2 √ 3, 4 √ 3 − 2] , D1 → [− 3 2 − 2 √ 3, 3 √ 3 2 − 2] . This means that AB1C1D1 has been transformed to the rhombus AB2C2D2, where B2 = [−5/2, 5 √ 32], C2 = [−4 − 2 √ 3, 4 √ 3 − 2] and D2 = [−3 2 − 2 √ 3, 3 √ 3 2 − 2], respectively. (c) Let us finally apply the third rotation, using this time the rhombus AB2C2D2 obtained in (b). We compute A → A , B2 → [−5, 0] , C2 → [−8, −4] , D2 → [−3, −4] , Thus a counter-clockwise rotation around the origin by π/3 transforsms the rhombus AB2C2D2 to AB3C3D3, where B3 = [−5, 0] = −B, C3 = [−8, −4] = −C and D3 = [−3, −4] = −D, respectively. Let us illustrate all these rotations together in one figure (via Sage), as follows: 67 CHAPTER 1. INITIAL WARMUP The code behind this figure is based on a bit more advanced options from 2D-graphics in Sage, comparing with what we have seen so far (as assigning automatically names to given points in a list). For the interested reader we pose it here: pts = [(0, 0), (5, 0), (8, 4), (3, 4), (5/2, 5*sqrt(3)/2), (4-2*sqrt(3), 2+4*sqrt(3)), (3/2-2*sqrt(3), (3/2)*sqrt(3)+2), (-5/2, 5*sqrt(3)/2), (-4-2*sqrt(3), 4*sqrt(3)-2), (-3/2-2*sqrt(3), (3/2)*sqrt(3)-2), (-5, 0), (-8, -4), (-3, -4)] pt_names = [’$A$’, ’$B$’, ’$C$’, ’$D$’, ’$B_1$’, ’$C_1$’, ’$D_1$’, ’$B_2$’, ’$C_2$’, ’$D_2$’, ’$B_3$’, ’$C_3$’, ’$D_3$’] pt_opt = {’color’: ’black’, ’horizontal_alignment’: ’left’, ’vertical_alignment’: ’top’} plt=point2d(pts, color=’blue’, size=20) plt += sum(text(name, vector(pt), **pt_opt) for name, pt in zip(pt_names, pts)) plt +=point([0, 0], size=40, color="blue", aspect_ratio=1) plt +=point([5, 0], size=40, color="green") plt +=point([8, 4], size=40, color="red") plt +=point([3, 4], size=40, color="purple") plt +=polygon([(0,0), (5,0), (8,4), (3, 4)], fill=False, rgbcolor= (0, 1/2, 1))#the rhombus ABCD plt +=point([5/2, 5*sqrt(3)/2], size=40, color="green") plt +=point([4-2*sqrt(3), 2+4*sqrt(3)], size=40, color="red") plt +=point([3/2-2*sqrt(3), (3/2)*sqrt(3)+2], size=40, color="purple") plt +=polygon([(0,0),(5/2, 5*sqrt(3)/2),(4-2*sqrt(3),2+4*sqrt(3)),(3/2-2*sqrt(3), (3/2)*sqrt(3)+2)],fill=False,rgbcolor=(1/8,1/4,1/2),linestyle="--")#the rhombus AB_1C_1D_1 plt +=point([0, 0], size=40, color="blue", aspect_ratio=1) plt +=point([-5/2, 5*sqrt(3)/2], size=40, color="green") plt +=point([-4-2*sqrt(3), 4*sqrt(3)-2], size=40, color="red") plt +=point([-3/2-2*sqrt(3), (3/2)*sqrt(3)-2], size=40, color="purple") plt +=polygon([(0,0),(-5/2,5*sqrt(3)/2),(-4-2*sqrt(3),4*sqrt(3)-2),(-3/2-2*sqrt(3), (3/2)*sqrt(3)-2)],fill=False,rgbcolor=(1/8,1/4,1/2),linestyle="--")#the rhombus AB_2C_2D_2 plt +=point([-5, 0], size=40, color="green") plt +=point([-8, -4], size=40, color="red") plt +=point([-3, -4], size=40, color="purple") plt += polygon([(0,0), (-5, 0), (-8, -4), (-3, -4)],fill=False, rgbcolor=(1/8,1/4,1/2), linestyle="--")#the rhombus AB_3C_3D_3 plt.show(aspect_ratio=1, figsize=8) (d) This task relies on the principle that composing two rotations results in another rotation, which can be easily determined. In other words, applying a rotation by an angle θ1 followed by a rotation by an angle θ2 is equivalent to a single rotation by the angle θ1 + θ2. In terms of matrices, this implies that ( cos θ1 − sin θ1 sin θ1 cos θ1 ) ( cos θ2 − sin θ2 sin θ2 cos θ2 ) = ( cos(θ1 + θ2) − sin(θ1 + θ2) sin(θ1 + θ2) cos(θ1 + θ2) ) , 68 CHAPTER 1. INITIAL WARMUP that is, Rθ1 Rθ2 = Rθ1+θ2 . To prove this identity, you can use the trigonometric identities provided in part (d) of Problem 1.G.15. For our case we compute Rθ1 Rθ2 = R2π/3 , and (Rθ1 Rθ2 )Rθ3 = R2π/3Rθ3 = R2π/3Rπ/3 = Rπ . To determine the final position of the initial rhombus ABCD it is now sufficient to apply a rotation by π. This results in the linear transformation ( x y ) → ( cos π − sin π sin π cos π ) ( x y ) = ( −1 0 0 −1 ) ( x y ) = ( −x −y ) = − ( x y ) . Hence ABCD will be finally transformed to AB′ C′ D′ , where B′ = −B, C′ = −C and D′ = −D, respectively. As a verification, we see that these are the points obtained in step (c), that is B′ = B3, C′ = C3 and D′ = D3, respectively. □ F) Material on relations and mappings 1.G.54. Consider a relation R on the set {1, 2, 3, 4} where (a, b) ∈ R if and only if a − b is divisible by 2. Show that R is an equivalence relation. Solution. To verify reflexivity we need to show that (a, a) ∈ R for all a ∈ {1, 2, 3, 4}. This is obvious since 0 is divisible by 2. Therefore, R is reflexive. To verify symmetry we need to show that if (a, b) ∈ R then (b, a) ∈ R. If (a, b) ∈ R the a − b is divisible by 2, and hence also b − a = −(a − b) is divisible by 2. Thus (b, a) ∈ R and R is symmetric. To verify transitivity we need to show that if (a, b) ∈ R and (b, c) ∈ R then (a, c) ∈ R. If (a, b) ∈ R and (b, c) ∈ R then both a − b and b − c are divisible by 2 and their sum (a − b) + (b − c) = a − c is also divisible by 2. Thus (a, c) ∈ R and R is transitive. For example (1, 3) ∈ R since 1 − 3 = −2 is divisible by 2, and since R is symmetric we have also (3, 1) ∈ R. Then we see that (1, 1) ∈ R, as well. Since R is reflexive, symmetric, and transitive, it is an equivalence relation. □ 1.G.55. Determine the number of mappings from the set {1, 2} to the set {a, b, c}. How many of them are surjective and how many are injective? Solution. The element 1 can be mapped to any of a, b, c and the same for 2. Thus there are exactly 32 mappings of the set {1, 2} to the set {a, b, c}. None of them can be surjective, because the set {a, b, c} has more elements than the set {1, 2}. Injectivity requires that the images of 1 and 2 are different. There are three possibilities for the image of 1, and once the image of 1 is given, there remain two possibilities for the image of 2. Thus, the number of injective mappings of the set {1, 2} to the set {a, b, c} is 6. □ 1.G.56. Let {a, b, c, d} be a set with a relation {(a, a), (b, b), (a, b), (b, c), (c, b)}. What is the minimal number of elements we have to add to the relation in order to make it an equivalence? Solution. Let us successively ensure the three properties that define an equivalence. First it is the reflexivity. We must add the tuples {(c, c), (d, d)}. Second is the symmetry – we must add (b, a) and for the third step we must do the so-called transitive closure. Since a is in relation with b and b is in relation with c, we must add (a, c) and (c, a). □ 1.G.57. Determine how many distinct binary relations can be defined between a set X and the set of all subsets of X, if the set X has exactly 3 elements. Solution. First, notice that the set of all subsets of X has exactly 23 = 8 elements, and thus the Cartesian product with X has 8 · 3 = 24 elements. Possible binary relations correspond to subsets of this Cartesian product, and hence the total numbers of such relations is 224 = 16777216. □ 1.G.58. Find the number of surjective mappings f from the set {1, 2, 3, 4, 5} to the set {1, 2, 3} such that f(1) = f(2). Solution. Every such mappings is uniquely given by the images of the elements {1, 3, 4, 5}, there are exactly that many mappings as there are surjective mappings of the set {1, 3, 4, 5} to the set {1, 2, 3}, that is, 36. □ 1.G.59. Determine the number of surjective mappings of the set {1, 2, 3, 4} to the set {1, 2, 3}. ⃝ 1.G.60. List all the relations on the two-element set {1, 2} that are symmetric, but neither reflexive nor transitive. Solution. The reflexive relations are exactly those which contain both pairs (1, 1), (2, 2). Hence this excludes the following relations {(1, 1), (2, 2)} , {(1, 1), (2, 2), (1, 2)} , {(1, 1), (2, 2), (2, 1)} , {(1, 1), (2, 2), (1, 2), (2, 1)} . We claim that the remaining relations, which are symmetric but not transitive, must contain (1, 2), (2, 1). If such a relation contains one of these two (ordered) pairs, by symmetry it must contain also the other. If it contains neither of these pairs, then it is clearly transitive. Thus, from the total number of 16 relations over a two-element set we select {(1, 2), (2, 1)} , {(1, 2), (2, 1), (1, 1)} , {(1, 2), (2, 1), (2, 2)} . It is now clear that each of these 3 relations is symmetric, but neither reflexive nor transitive. □ 69 CHAPTER 1. INITIAL WARMUP 1.G.61. Determine the number of relations over the set {1, 2, 3, 4}, which are both symmetric and transitive. Solution. Relations of the given properties is an equivalence over some subset of the set {1, 2, 3, 4}. In total, 1 + 4 · 1 + (4 2 ) · 2 + (4 3 ) · 5 + 15 = 52 , Thus, there are 52 relations on the set {1, 2, 3, 4} that satisfy the required properties. □ 1.G.62. Determine all the elements in S ◦ R, if R = {(2, 4), (4, 4), (4, 5)} ⊂ N × N , and S = {(3, 1), (3, 2), (3, 5), (4, 1), (4, 4)} ⊂ N × N . Solution. Consider all choices of two ordered tuples (2, 4), (4, 1), (2, 4), (4, 4), (4, 4), (4, 1), (4, 4), (4, 4) such that the second element of the first ordered tuple (which is a member of R) is equal to the first element of the second ordered tuple (which is a member of S). Then we obtain S ◦ R = {(2, 1), (2, 4), (4, 1), (4, 4)}. This task can also be solved using Sage. First, define a function to compose two given relations as follows def RelCompose(Rel1, Rel2): RS = set() for (a,b) in Rel1: for (c,d) in Rel2: if b == c: RS.add( (a,d)) return RS Now we should introduce R, S, as follows: R = {(2,4), (4,4), (4,5)}; S = {(3,1), (3,2), (3,5), (4,1), (4,4)} print(RelCompose(R,S)) Executing the whole block Sage returns {(4, 4), (2, 4), (4, 1), (2, 1)}. □ 1.G.63. Let R be the binary relation between the sets A = Z and B = R, defined by R = {(0, 4), (−3, 0), (5, π), (5, 2), (0, 2)}. Express explicitly R−1 and R ◦ R−1 . ⃝ 1.G.64. Is there an equivalence relation on the set of all lines in the plane that also serves as an ordering? Solution. An equivalence relation (or ordering relation) must be reflexive, therefore every line must be in relation with itself. Furthermore we require that the relation is both symmetric (equivalence) and antisymmetric (ordering). This implies that each line can only be related to itself. If we define the relation such that two lines are in relation if and only if they are identical, we obtain a very natural relation, which is both an equivalence relation and an ordering. We just need to check that it is transitive, which is easy. Therefore, the only relation that meets the criteria is the identity relation over the set of all lines in the plane. □ 1.G.65. We have a set {3, 4, 5, 6, 7}. Write explicitly the following relations and explore their properties: i) a divides b, ii) Either a divides b or b divides a, iii) a and b have a common divisor greater than one, ⃝ 70 CHAPTER 1. INITIAL WARMUP Solutions to the problems 1.A.3. Already the ancient Greeks knew that if we prescribe the area of square as a2 = 2, then we cannot find a rational a to satisfy it. Why? Suppose that we know that (p/q)2 = 2 for some natural numbers p and q, which have no common divisor greater than 1 (otherwise we can further reduce the fraction p/q). Then p2 = 2q2 is an even number and on the l.h.s p2 must be even. Therefore so is p. Now p is even, thus p2 must be divisible by 4. But then q2 is even and so q must be even too. This certifies that p and q both have 2 as a common factor, a contradiction. 1.A.4. Let p, q be two arbitrary rational numbers with p ̸= q. We may assume that p < q, and similarly is treated the other case. Set β := p − p − q √ 2 . We will show that β is an irrational, that is, β ∈ R\Q, and in addition that p < β < q. The first claim follows because 1/ √ 2 is an irrational, in combination with the following two facts: • The product of a rational with an irrational is an irrational number. • Adding a rational and an irrational number we obtain an irrational number. Let us prove the first statement and in a very similar way occurs also the second one. Let p = a b ∈ Q be a rational number, and let x ∈ R\Q be an irrational. Then xp = px = xa b . Assume that xp is a rational, that is, xp = xa b = r s , for some integers r, s. Multiplying both sides with b a we obtain x = rb sa . Being the product of the rationals r s and b a , the number x should be a rational, a contradiction. Now it remains to prove that p < β < q. We begin with 0 < 1√ 2 < 1 and multiply both sides by q − p > 0. This gives 0 < q−p√ 2 < q − p and by adding to both sides the number p we get p < p − p−q√ 2 < q. This proves the assertion. 1.A.7. By assumption, p(x) = 5x4 + x2 − 2x + 2 so the Horner’s table has the form ρ = 4 5 0 1 −2 2 20 80 324 1288 5 20 81 322 1290 This means that p(x) = (x − 4)(5x3 + 20x2 + 81x + 322) + 1290. In terms of the long division method we have 5x3 + 20x2 + 81x + 322 x − 4 ) 5x4 + x2 − 2x + 2 − 5x4 + 20x3 20x3 + x2 − 20x3 + 80x2 81x2 − 2x − 81x2 + 324x 322x + 2 − 322x + 1288 1290 71 CHAPTER 1. INITIAL WARMUP 1.A.9. The horizontal distance between z, w is the absolute value of the difference of their real parts, that is, hor = |x − u|. For an illustration, see the figure below. Note that here one uses the absolute value of x − u, since depending whether z lies to the left or to the right of w, the horizontal distance is given by ±(x − u). Similarly, the vertical distance between z, w is the absolute value of the difference of their imaginary parts, that is, ver = |y − v|. According to the Pythagorean theorem the distance d between z, w satisfies d2 = hor2 + ver2 = |x − u| 2 + |y − v| 2 and the assertion follows. 1.A.11. The distance under question equals to √ 13. 1.A.12. By assumption, i is a solution of the given equation. Hence we have that i3 + (1 + i)i2 + ai + 2 = 0. From this equation we compute a = 2 + i. Thus, we need to solve the equation f(z) = 0, where f(z) = z3 + (1 + i)z2 + (2 + i)z + 2 . Since f(z) is of degree three, over C we can specify three roots. In particular, we know that i is a solution of f(z) = 0, and we can use this to obtain the corresponding Horner table. Set ρ = i. We have ρ = i 1 1 + i 2 + i 2 i i − 2 −2 1 1 + 2i 2i 0 The last entry of the bottom row is 0 since i is a root of f(z) = 0. Hence we can write f(z) = z3 + (1 + i)z2 + (2 + i)z + 2 = (z − i)(z2 + (1 + 2i)z + 2i) = (z − ρ)π(z) with π(z) := z2 + (1 + 2i)z + 2i. Now, the discriminant of π(z) is given by ∆ = (1 + 2i)2 − 4 · 1 · 2i = 1 + 4i − 4 − 8i = −3 − 4i = 1 − 4 − 4i = (1 − 2i)2 . Thus, the rest two roots are given by z2,3 = −(1 + 2i) ± √ (1 − 2i)2 2 , that is, z2 = −2i and z3 = −1. 1.A.16. (a) This is true, as we can see in figure below. In particular, the argument of a real number is kπ, with k ∈ Z. (b) The principal argument of i equals to φ = π/2 and is given correctly. However, −2i has the same principal argument as −i, i.e., ϑ = −π/2, see also the figure. Hence the last statement in (b) is false. One can show that a complex number will be imaginary, if and only if its argument is given by ±π/2 + nπ, with n ∈ Z. 1.A.17. One can apply the method presented in Problem 1.A.14. This means that for the given z ∈ C we need to compute its magnitude r = |z| and also its argument. Moreover, a direct calculation shows that |z| = √( 1 + cos π 3 )2 + sin2 π 3 = √ 3 , cos(φ) = Re(z) |z| = 1 + 1 2 √ 3 = √ 3 2 , sin(φ) = Im(z) |z| = 1 2 . 72 CHAPTER 1. INITIAL WARMUP Therefore we see that φ = π/6. All together, we obtain z = √ 3 ( cos π 6 + i sin π 6 ) . 1.A.19. Recall that the de Moivre’s formula states that zn = |z| n ( cos(nφ) + i sin(nφ) ) for any n ∈ Z and z ∈ C, see 1.1.4. We are interested in z31 , where z = cos π 6 + i sin π 6 (note that z ∈ S1 = {z ∈ C : |z| = 1}). In such tasks it is often useful to draw the diagram corresponding to z, which for the particular case takes the form Therefore, we obtain ( cos π 6 + i sin π 6 )31 = cos 31π 6 + i sin 31π 6 = cos 7π 6 + i sin 7π 6 = − √ 3 2 − i 1 2 . 1.A.20. This task highlights the different uses of de Moivre’s theorem. Indeed, having in mind the result in 1.1.4, and the identities (a + b)3 = a3 + 3a2 b + 3ab2 + b3 , i2 = −1, i3 = −i, we compute cos(3φ) + i sin(3φ) = ( cos(φ) + i sin(φ) )3 = cos3 (φ) + 3i cos2 (φ) sin(φ) + 3i2 cos(φ) sin2 (φ) + i3 sin3 (φ) = ( cos3 (φ) − 3 cos(φ) sin2 (φ) ) + i ( 3 cos2 (φ) sin(φ) − sin3 (φ) ) . A comparison now of the real and the imaginary parts yields the result: cos(3φ) = cos3 (φ) − 3 cos(φ) sin2 (φ) = cos3 (φ) − 3 cos(φ) ( 1 − cos2 (φ) ) = 4 cos3 (φ) − 3 cos(φ) , and sin(3φ) = 3 cos2 (φ) sin(φ) − sin3 (φ) = 3 ( 1 − sin2 (φ) ) sin(φ) − sin3 (φ) = 3 sin(φ) − 4 sin3 (φ) . 1.A.22. Note that this equation has no rational roots. Substitution into formulas obtained in 1.A.21 yields p = b − a2 /3 = −7/3, q = −7/27. It follows that u = 3 √ 28 ± 12 √ −147 6 . We can theoretically choose up to six possibilities for u (two for the choice of the sign and three independent choices of the cubic root). But we obtain only three distinct values for x. By substitution into the formulas, one of the roots is of the form 14 3 √ 3(28 − 84i √ 3) + 3 √ 28 − 84i √ 3 6 − 1 3 . = 1.247 , and similarly for the other two (approximately −0.445 and −1.802). Finally observe that even if we have used complex numbers during the computation, all the solutions are real. 1.B.2. Set as before a = ( 1 + 0.06 12 ) = 1.005 and C = 30 000. The condition dk = 0 gives the equation ak = P a−1 P a−1 − C = 200P 200P − C . By taking logarithms of both sides, we see arrive to k = ln(200P) − ln(200P − C) ln a This, for P = 500 gives approximately k = 71.5. Therefore, Michael will pay for 72 months, with the last repayment being less that C500. 73 CHAPTER 1. INITIAL WARMUP 1.C.3. (a) The pair of letters b and r can be assumed to be a single indivisible “double-letter”. In total we have six distinct letters and there are 6! words of six indivisible letters. We have to multiply this by two, since the double-letter can be either br, or rb. So the solution is 2 · 6!. (b) The events in this case form the complement to the part (i), in the set of all rearrangements of the seven-letters. The solution is therefore 7! − 2 · 6!. 1.C.10. First, we can place the white tower in any of 82 positions. Then we have to our disposal 72 positions (the remaining 7 rows and columns) in which we can place the black tower. Therefore, the total number of ways equals to 82 · 72 = 3136. The “inlusion-exclusion” would of course, provide the same result: 82 (82 − 1) − 82 · 7 · 2 = 3136 (the first terms are all possibilities, the second one the forbidden ones). 1.D.3. First divide the favourable cases according to the number of men in the chosen group: 2 or 1. Now, there are eight groups with five people of which one is a man (all women have to be present in such groups, thus it depends only on which man is chosen). Also, there are c(8, 2) · c(4, 3) = (8 2 ) · (4 3 ) of groups with two men. This is because we choose two men from eight and then independently three women from four. These two choices can be independently combined and thus using the rule of product we obtain the number of such groups. The total number of groups with five people is c(12, 5) = (12 5 ) . Therefore, the desired probability, being the quotient of the number of favourable outcomes to the total number of outcomes, equals to 8 + (4 3 )(8 2 ) (12 5 ) = 5 33 . 1.D.8. We solve this exercise using the theorem about multiplication of probabilities. This is explained in 1.4.8, based on the conditional probability concept. Here it seems to be obvious: First we require a red ball, that happens with the probability 9/16. If a red ball was drawn, then in the second round we draw a red ball with the probability 8/15 (there are 15 balls in the box, 8 of them are red). Finally, if two red balls were drawn, the probability that a white ball is drawn is 7/14 (there are 7 white balls and 7 red balls in the box). Thus we obtain 9 16 · 8 15 · 7 14 = 0.15 . 1.E.5. Eliminate t to obtain q : x − 2y = −5. Then solve for x and y. The intersection has coordinates x = 1, y = 3. 1.E.7. It is clear that −2 · ( −x − 3 2 y + 2 ) = 2x + 3y − 4 . Thus p1 and p4 describe the same line. Moreover, note that p2 can be rewritten as −2x + 2y − 6 = 0, thus the lines p2 and p3 are parallel and distinct. Also, by eliminating t, the line p5 has an equation x + y = 0, which is not parallel to any other line. 1.E.12. We compute A − B = ( −2 5 −1 1 ) , (A − B)T = ( −2 −1 5 1 ) and by matrix multiplication we obtain v = 2 ( −2 −1 5 1 ) ( 2 −2 4 5 ) ( 3 2 ) = ( −52 64 ) . 1.E.14. Of course, all these results can be easily computed via Sage and the methods mentioned in the previous subsection. But it is all so easy with 2 by 2 matrices that we could count on fingers, too: (a) We have A − B = ( −2 5 −1 1 ) , so we easily compute: (A − B)2 = ( −2 5 −1 1 ) ( −2 5 −1 1 ) = ( −1 −5 1 −4 ) , A2 = ( −10 10 −4 −6 ) , B2 = ( 4 0 −3 1 ) and AB = ( −5 5 −6 2 ) ̸= BA = ( 0 10 −2 −3 ) . Thus, computing (A−B)2 we may use the formula A2 −AB −BA+B2 (but we see that it is false to claim that (A−B)2 = A2 − 2AB + B2 ). Hence in matrix calculus the multiplication is not in general commutative. (b) We compute ABC, BCA, and CAB, which yields: D = 2 ( 10 35 −4 22 ) − ( 8 12 −14 24 ) − ( 2 6 −50 30 ) = ( 10 52 56 −10 ) ̸= ( 0 0 0 0 ) . 74 CHAPTER 1. INITIAL WARMUP Therefore, in general given three arbitrary matrices A, B, C, we have ABC ̸= BCA ̸= CAB. (c) We compute (in the result, we write E = ( 0 1 −1 0 ) ): c(AB) − (c(AB))T √ 2 = c √ 2 (AB − (AB)T ) = √ 2c 2 {( −5 5 −6 2 ) − ( −5 −6 5 2 )} = 11 √ 2 2 cE . (d) This is a direct computational consequence of the formula for determinant. In Chapter 2, we will see that this holds true for square matrices of all sizes. (f) Let us better handle this one by Sage, where it is convenient to use the cell presented in 1.E.12 and add the code D=det(A)*B-det(B)*A-C.trace()*C; print(det(D)); print(divisors(det(D))) This gives −38, which is the determinant of D, and [1, 2, 19, 38], which are its divisors. 1.E.15. (a) For any two square matrices A and B we have (A + B)(A − B) = A2 − A · B + B · A − B2 . Therefore, the identity (A + B) · (A − B) = A2 − B2 is valid if and only if AB = BA. Thus any pair of matrices which do not commute does the work. A choice is given for instance by A = ( 1 2 3 4 ) , B = ( 4 3 2 1 ) . Indeed, AB = ( 8 5 20 13 ) ̸= BA = ( 13 20 5 8 ) . (b) Similarly, any two square matrices A, B satisfy (A + B)(A + B) = A2 + AB + BA + B2 It follows that (A + B)(A + B) = A2 + 2AB + B2 if and only if AB = BA, as in the first case. Hence the pair of matrices presented above provides an example also for this case. 1.E.16. The definition of the linear mappings shows, that knowing the values on u = (1, 0) and v = (0, 1) determines the values for all the other vectors: (x, y) → F(x u+y v) = x F(u)+y F(v). This corresponds to placing the vectors F(u) and F(v) into the columns of the appropriate matrix A, and then express the value as ( x y ) → A ( x y ) . This answers the question about the role of the columns in the matrix A, once we have it. Moreover, the affine mappings differ from the linear ones only by having a constant vector added. Thus, an affine mapping is linear if and only if it keeps the origin fixed. In our case, clearly F (( x y )) = ( 7 −3 −2 5 ) ( x y ) = A ( x y ) , G (( x y )) = ( 2 2 4 −9 ) ( x y ) + ( −4 3 ) = B ( x y ) + w. F (( 0 0 )) = ( 0 0 ) , G (( 0 0 )) = ( −4 3 ) , and we deduce that the mapping F is linear, but G is not. The final claim is obvious from the associativity of matrix multiplication, i.e., B ( A ( x y ) ) + w = (BA) ( x y ) + w , A ( B ( x y ) + w ) = (AB) ( x y ) + Aw . Check these computations yourselves and see also the right column in 1.5.4. 1.E.22. This is a typical task to be discussed without any explicit coordinates in mind. Let us start with the linear mappings. In the plane, lines are parallel if and only if their directional vectors u in the parametric representations P(t) = P + tu are parallel. The images of the lines can be parametrized by F(P(t)) = F(P) + tF(u). Now recall that on the plane, two non-zero vectors are parallel if and only if one is a scalar multiple of the other. Thus, the images of parallel lines are clearly parallel as well, except the case, where the image F(u) = 0. So, strictly speaking, the claim we should prove is false. But we have proved that invertible linear mappings transform parallel lines to parallel lines. Adding translations in order to deal with general affine mappings, does not spoil this property. On the other hand, for example, a shear transformation clearly changes both the angles and the distances, cf. 1.E.17. 1.E.24. An “admissible picture” that effectively illustrates the result of the task in 1.E.21, should resemble the figure shown below: 75 CHAPTER 1. INITIAL WARMUP This figure allows you to visually verify the two requirements of the task: first, that the two lines are perpendicular, and second, that the line in question passes through the point P = [−6, 7]. To create this plot, you can use a Sage cell with specific options to ensure the correct aspect ratio and figure size. For example, you can type: x=var("x") y2=-7*x/6; y1=(1/7)*(6*x+13) fig = plot([], figsize=[4,4]) fig += plot(y1,x,(-7,7), aspect_ratio=1) fig += plot(y2,x,(-7,7), aspect_ratio=1) fig += point((-6, 7), size=70) show(fig) To verify the effect of the different options on the plot, try the following: Modify the Sage cell to use aspect_ratio = ”automatic”. You will observe that the two lines may appear non-perpendicular, which can be misleading. Additionally, changing the figsize to [4, 8] can produce a distorted or "ugly" figure, further complicating the visual interpretation. By experimenting with these settings in your editor, you can see firsthand how altering the aspect ratio and figure size impacts the clarity and accuracy of your plot. 1.E.25. We know that the area equals to the absolute value of the half of the determinant of the matrix, whose first column is given by the vector Q − P and the second column by the vector R − P, that is the determinant of the matrix A = ( −2 − (−8) 5 − (−8) 0 − 1 9 − 1 ) = ( 6 13 −1 8 ) . A simple calculation yields 1 2 |det(A)| = 1 2 |(6 · 8 − (13 · (−1))| = 61 2 . Alternatively, in Sage we can type the cell A=matrix([[6, 13], [-1, 8]]); A.det ( ) which returns the determinant of A. As we will prove in Chapter 2, changing the order of the vectors leads to the change of the sign of the determinant. But the absolute value remains unchanged. Similarly, the transposed matrix (writing the vectors rather in the rows) yields the same determinant. Finally, if the vertices P, Q, R are ordered in the anti-clockwise direction, the determinant formed by the vectors Q − P and R − P is always positive. If the given coordinates were not in the standard Euclidean coordinate system, we would have to look at their defining frame e1, e2. If their norms are one and they are perpendicular (an orthonormal frame, see the end of 1.5.7), then we do not need to do anything. Otherwise, we should transform the coordinates first into the standard coordinate system, see the methods discussed in 1.E.22 (although area preserving transformation matrices are those with determinant one, e.g., all shears, so this might not be necessary at all – think about it!). 1.E.26. Let us denote the quadrilateral S by ABCD. Being a polygon, to sketch it in Sage we can apply the command polygon or polygon2d. Here we use the first command in the following cell: P = polygon([(1,1),(6,1),(11,4),(2,4)], fill=False, color="black") A =point([1, 1], size=40, color="black") B =point([6, 1], size=40, color="black") C =point([11, 4], size=40, color="black") D =point([2, 4], size=40, color="black") 76 CHAPTER 1. INITIAL WARMUP Al= text("$A$",(0.8, 0.8),color="black",fontsize="12") Bl= text("$B$",(6.2, 0.8),color="black",fontsize="12") Cl= text("$C$",(11.2, 4.2),color="black",fontsize="12") Dl= text("$D$",(1.8, 4.2),color="black",fontsize="12") l=line([(1, 1), (11, 4)], color="black",linestyle="--") show(P+A+B+C+D+Al+Bl+Cl+Dl+l) Executing this block we obtain the following illustration: From this figure we observe that S is a trapezoid, with bases of length 5 and 9 and a height of 3. Thus, the area of S is calculated as Area(S) = 5 + 9 2 · 3 = 21 . This formula can be derived by noting that shearing transformations do not alter the area of a shape. By selecting an appropriate trapezoid that simplifies the calculations, you can verify this result. Can you identify a suitable trapezoid to use for this purpose? Alternatively, we can divide S into two triangles, △ABC and △ACD, as shown in the figure. We can then find the area of S by summing the areas of these two triangles, which can be calculated using the determinants of the corresponding matrices: d1 = 6 − 1 11 − 1 1 − 1 4 − 1 = 5 10 0 3 , d2 = 11 − 1 2 − 1 4 − 1 4 − 1 = 10 1 3 3 , where in the columns are these vectors B − A, C − A (for d1) and C − A, D − A (for d2). In such terms we see that Area(S) = 1 2 (|d1| + |d2|), and this gives 21 as well. Note that the second approach works for all polygonal objects in the plane. 1.E.28. The coordinates of the vertex C can be obtained by rotating the point A around the centre S through the angle 2π/3 in the positive direction. This gives C = [3 2 − √ 3, −1 − √ 3 2 ]. As a verification in Sage you may type a=2*pi/3;x1=0;y1=2;x2=1;y2=0}} M=matrix([[cos(a), -sin(a)], [sin(a), cos(a)]]) A=vector([x1, y1]); S=vector([x2, y2]) rot=(M*(A-S)+S); show(rot) 1.E.30. For simplicity, set A = [−2, −2], B = [2, −11/6], C = [3, 1], D = [1, 4], and E = [−2, 2]. The sides BC and CD are clearly visible from the position of the point [300, 1]. On the other hand, DE and EA cannot be seen. For the side AB, we compute −2 − 300 2 − 300 −2 − 1 −11 6 − 1 = −302 · ( −17 6 ) − (−298) · (−3) < 0 . This implies that the side can be seen from the point [300, 1]. 1.E.31. Order the vertices in the positive direction, that is counter-clockwise: [5, 6], [7, 8], [5, 8]. Using the corresponding determinants we can determine whether the point [0, 1] lies to the “left” or to the “right” of the sides of the triangle when we view them as oriented line segments. We have Q − X R − X = 7 7 5 7 > 0 , R − X P − X = 5 7 5 5 < 0 , P − X Q − X = 5 5 7 7 = 0 . We see that the determinants are not all positive, so X is outside the triangle. In particular, if it is left of some oriented segment (a side of the triangle), the segment is not visible from P. Because the last determinant is zero, the points [0, 1], [5, 6] and [7, 8] lie on a line, and thus the side joining P and Q is not visible. The side joining Q and R is also not visible, unlike the side joining P and R, for which the determinant is negative. 77 CHAPTER 1. INITIAL WARMUP 1.F.4. we should determine first the three defining properties of an equivalence relation, as functions, and next examine if R satisfies them. We can do this as follows: def is_reflexive(Rel, A=None): if not A: A = set(x[0] for x in Rel) # in this way we define the domain return all({(a,a) in Rel for a in A}) def is_symmetric(Rel): return all({(b,a) in Rel for (a,b) in Rel}) def is_transitive(Rel): return all({(a,d) in Rel for (a,b) in Rel for (c,d) in Rel if b == c}) def relation_summary(Rel): A = set(x[0] for x in Rel) reflexive = is_reflexive(Rel, A) symmetric = is_symmetric(Rel) transitive = is_transitive(Rel) print(f’Reflexive: {reflexive}’) print(f’Symmetric: {symmetric}’) print(f’Transitive:{transitive}’) Rel1 = {(’a’,’a’),(’b’,’b’),(’c’,’c’),(’d’,’d’), (’b’,’a’),(’b’,’c’), (’b’,’d’)} relation_summary(Rel1) Sage returns the desired result: Reflexive: True Symmetric: False Transitive: True Notice the use of the so called “f-strings” in the print – we are allowed to use expressions whose values should be printed inside the string. 1.F.5. From the relationship ((a, b), (a, b)) ∈ R for all a, b ∈ R it follows that the relation is reflexive. It is also easy to see that the relation is symmetric, since in the equality of the second coordinates we can interchange the left and right side. If ((a, b), (c, d)) ∈ R and ((c, d), (e, f)) ∈ R, that is, b = d and d = f, then we get that ((a, b), (e, f)) ∈ R, that is, b = f. Hence R is also transitive, and so an equivalence relation. Notice the points in the plane are related if and only if they have the same second coordinate (the line they determine is perpendicular to the y axis). Thus the corresponding partition divides the plane into the lines parallel with the x axis. 1.F.6. Directly from the definition of the domain and the codomain of a relation we obtain D = {a, b, c, d, f} ⊂ A, I = {x, y, u, v} ⊂ B . To check if a relation is a mapping, we need to ensure that each element in the domain is assigned to exactly one element in the codomain. In our case we deduce that R is not a mapping since (c, x), (c, u) ∈ R, that is, c ∈ D has two images. In Sage, the domain and codomain can be found like this: A = set(["a", "b", "c", "d", "e","f"]) B = set(["x","y", "u", "v", "w"]) Rel = set( [("a","v"), ("b","x"), ("c","x"),("c","u"), ("d","v"), ("f","y")]) # to compute the domain we iterate over R taking the first element of each pair domain = set(x[0] for x in Rel) print(f"domain = {domain}") # to compute the codomain, we take the second element codomain = set(x[1] for x in Rel) print(f"codomain = {codomain}") 1.F.9. In the first case, the map f is surjective (it is enough to set x = 0) but not injective (it is enough to set (x, y) = (0, −9) and (x, y) = (1, 0)). In the second case, f is an injective mapping (both its coordinates, that is functions y = 2x and y = x2 + 10 are clearly increasing over N). The mapping is not surjective (for instance the pair (1, 1) has no preimage). 1.G.1. An easy way is given by Sage, using the command sum. Typing in your editor the code # Define symbolic variables n, k = var("n k") 78 CHAPTER 1. INITIAL WARMUP # Define the function to sum f = k # Compute the symbolic sum sum1 = sum(f, k, 1, n) # Show the symbolic sum show(sum1) Sage’s output is 1 2 n2 + 1 2 n, which represents the evaluated form of the sum ∑n k=1 k. However, since the task requires an illustration, we need to explore a different approach. The idea that we will present below likely dates back to the Pythagoreans (6th century BC), but is also closely associated with the German mathematician F. Gauss (1777-1855), one of the most influential scientists of his time. Let ∆n = 1 + 2 + · · · + (n − 1) + n be the sum that we want to compute, and rewrite this with the inverse ordering of the summands, that is, ∆n = n + (n − 1) + · · · + 2 + 1. Adding these two relations, we get 2∆n = (n + 1) + (n − 1 + 2) + (n − 2 + 3) + · · · + (2 + (n − 1)) + (1 + n) = (n + 1) + (n + 1) + · · · + (n + 1) , where the sum in the r.h.s has n-factors. Thus 2∆n = n(n + 1), which completes the proof. Graphically, the triangular number ∆n = 1 2 n(n+1) (n = 1, 2, . . .), can be represented by the triangular grid of points (black dots) figured below, where the first row contains a single dot (∆1 = 1), and each subsequent row contains one more dot than the previous one (we always move the first dot upwards). In the figure are presented the first five triangular numbers.14 To illustrate the simplified proof given above, one converts the above triangle arrangements to half-square arrangements, as in the figure below. Let us illustrate the case of ∆7. The shaded triangle is the triangle arrangement corresponding to ∆7, while the white triangle occurs after rotating the shaded diagram to obtain the rectangular of size 7(7 + 1). Hence 2∆7 = 56 = 7(7 + 1), that is ∆7 = 28. Finally, as in 1.C.7 we see that the number of distinct handshakes or wine glass clinks that can be made among a group of n people is given by ( n 2 ) , and this can be rephrased as ∆n−1.15 This last claim is based on the relation ∆n = ( n + 1 2 ) . As for the two given recurrence relations, they occur very easily and it worths to mention that if we add them, then we return back to the initial definition of ∆n. 1.G.2. Suppose that the length of the hill is ℓ km. Then, the bicyclist needs ℓ/25 hours to goes up the hill, and ℓ/75 hours to goes down the same hill. For instance, if ℓ = 25 km, then the cyclist needs one hour to goes uphill and a bit more than half hour for downhill. Since the total distance is 2ℓ km, we deduce that the average speed is 2ℓ ℓ 25 + ℓ 75 = 75 · 2ℓ 4ℓ = 75 2 = 37.5 km . Often, the average speed is called the harmonic mean of the two given rates. 1.G.8. By assumption, P(x) = x2 + x + 2, hence we have 14Many mathematicians include 0 as the very first element in the sequence of triangle numbers, but we do not adopt this here. 15Each person shakes hands with n − 1 others, but each handshake involves two people, so we count each handshake twice. 79 CHAPTER 1. INITIAL WARMUP Q(x) = P(P(x)) − P(x) − 3 = ( P(x) )2 + P(x) + 2 − P(x) − 3 = ( P(x) )2 − 1 = ( P(x) − 1 ) · ( P(x) + 1 ) . This show that Q(x) is divided by P(x) − 1, with quotient π(x) = P(x) + 1 = x2 + x + 3. 1.G.9. Suppose that P(x) = ax4 + bx3 + cx2 + dx + e. Since P(0) = 0 we get e = 0 and so P(x) = ax4 + bx3 + cx2 + dx. Now, we see that P(x + 1) − P(x) = a(x + 1)4 + b(x + 1)3 + c(x + 1)2 + d(x + 1) − (ax4 + bx3 + cx2 + dx) = a(x4 + 4x3 + 6x2 + 4x + 1) + b(x3 + 3x2 + 3x + 1) + c(x2 + 2x + 1) + d(x + 1) −ax4 − bx3 − cx2 − dx = 4ax3 + 3(2a + b)x2 + (4a + 3b + 2c)x + (a + b + c + d) . Hence, by the relation P(x + 1) − P(x) = 4x3 we get the following system of equations: { 4a = 4 , 2a + b = 0 , 4a + 3b + 2c = 0 , a + b + c + d } . This has unique solution given by a = 1, b = −2, c = 1 and d = 0, thus P(x) = x4 − 2x3 + x2 = x2 (x2 − 2x + 1) = x2 (x − 1)2 , and the claim follows. 1.G.11. This task can be challenging, especially at this point. However, we will make every effort to explain the code thoroughly, including our comments within the block, to ensure a clear understanding of each step. We will use the def keyword to define a function (or routine), which we name horner_division. This function has two inputs: A list of coefficients starting from degree 0, which we may think as [a0, . . . , an]. The number n represents the number of coefficients in the polynomial p. Sage encodes this number by the command len. The second entry is the real number x0. To perform the division we will combine the for loop with the range() function. Such combinations are essential in SageMath for handling repetitive tasks, as demonstrated in our example. Recall that the range() function generates a sequence of numbers, commonly used in for loops. We can customize such sequences using different arguments: • One argument: range(stop) generates numbers from 0 to stop − 1. • Two arguments: range(start, stop) generates numbers from start to stop. • Three arguments: range(start, stop, step) generates numbers from start to stop with increments of step. With all the necessary tools in place, we can now construct our routine. The code is as follows: def horner_division(p, x0): n = len(p) # The number of coefficients in the polynomial q = [0] * (n - 1) # Initialize a list to hold the coefficients of the quotient r = p[n - 1] # Start with the leading coefficient (highest degree term) for i in range(n - 2, -1, -1): # Loop from the second last coefficient to the first q[i] = r # Assign the current remainder as a coefficient in the quotient r = p[i] + x0 * r # Update the remainder using the current coefficient and x0 return q, r # Return the list of quotient coefficients and the remainder Our routine is now ready to be tested. Let us use it to prove the second claim. # Example polynomial # Define the polynomial p(x) = 4 -3x + 5x^2 -2x^3 + x^4 p = [4, -3, 5, -2, 1]; x0 = 3 q, r = horner_division(p, x0) print("Quotient coefficients:", q) print("Remainder:", r) Sage’s output has the form Quotient coefficients: [21, 8, 1, 1] Remainder: 67 Thus dividing p(x) = 4 − 3x + 5x2 − 2x3 + x4 by (x − 3) we have quotient q(x) = 21 + 8x + x2 + x3 and remainder r = 67. 1.G.12. (1) Fractions of the form z := z1 z2 , where z1, z2 are complex numbers with z2 ̸= 0, can be written in the form z = x + iy with x, y ∈ R, after multiplying z with ¯z2 ¯z2 . This is because z2 ¯z2 = |z2| 2 ∈ R. For the given z we compute 1 + 2i 4 − 5i = 1 + 2i 4 − 5i · 4 + 5i 4 + 5i = 4 + 5i + 8i + 10i2 |4 + 5i| 2 = −6 + 13i 42 + 52 = − 6 41 + i 13 41 . This can be quickly verified in Sage by executing the following cell: 80 CHAPTER 1. INITIAL WARMUP real((1+2*I)/(4-5*I)) == -6/41; imag((1+2*I)/(4-5*I)) == 13/41 For both commands Sage prints out True. You can also obtain the result directly by typing the following command: z=(1+2*I)/(4-5*I); simplify(z) (2) Denote by α the quantity that we want to compute. We have α := (1 + i)2023 (1 − i)2020 = ( 1 + i 1 − i )2020 (1 + i)3 and moreover 1 + i 1 − i = 1 + i 1 − i · 1 + i 1 + i = (1 + i)2 2 = i . Hence α = i2020 (1 + i)3 = i2020 (2i − 2), and it suffices to compute i2020 . A direct computation shows that i4 = 1 and moreover i8 = 1. By induction it follows that i4n = 1 for all n ∈ N. Thus i2020 = i4·505 = 1 and so α = 2i − 2. A verification by Sage occurs by the cell a=((1+I)**(2023))/((1-I)**(2020)); simplify(a) (3-4) Both claims are true and indicate that S1 has the structure of a commutative group under multiplication, as discussed in 1.1.1. Indeed, the first assertion is true because the absolute value function is multiplicative, which implies that S1 is closed under multiplication. Specifically, for any z ∈ S1 we have |z| = 1, and hence z¯z = |z|2 = 1. Moreover, since 1 z = 1 |z| = 1, we confirm that 1 z ∈ S1 for any z ∈ S1 , ensuring that S1 is closed under taking reciprocals. Below, we will explore a slightly different but equivalent interpretation of the unit circle, which provides an alternative way to verify these properties. (5) The statement is true. We will verify this using Sage, and encourage the reader to explore and present alternative meth- ods. z=var("z"); solve(z**3+I==0, z) 1.G.13. The complex numbers that satisfy the conditions of the task lie on a circle in R2 , with centre at P2 = [−4, −2] and radius ρ, where ρ is the distance between z2 and w, i.e., ρ = 2 √ 2 = |z2 − w|. Let us illustrate this circle along with the complex numbers z1, z2 in a figure (see the figure in the left-hand side). Now, |z2 − w| = |w − z2| = |(x + 4) + i(y + 2)|, and we obtain ρ2 = (2 √ 2)2 = 8 = (x + 4)2 + (y + 2)2 = (x2 + y2 ) + 4(2x + y) + 20, i.e., (x2 + y2 ) + 4(2x + y) + 12 = 0 . (†) Moreove, if b = |z1 − w| = |(2 − x) + i(6 − y)| is the distance between z1, w, then we compute b2 = (2 − x)2 + (6 − y)2 = (x2 + y2 ) − 4(x + 3y) + 40 . (‡) Finally, the distance between z1, z2 is given by d = |z1 − z2| = |6 + 8i| = √ 36 + 100 = 10. Recall that in Sage this can be verified by the command abs(6 + 8i). We can now proceed with the Pythagorean theorem, which yields the condition d2 = ρ2 + b2 . Due to (‡) and the above observations, this translates as 100 = 8 + (x2 + y2 ) − 4(x + 3y) + 40 ⇐⇒ 52 = (x2 + y2 ) − 4(x + 3y) . Hence together with (†) we result to the following system of equations { (x2 + y2 ) + 4(2x + y) = −12 , (x2 + y2 ) − 4(x + 3y) = 52 } . 81 CHAPTER 1. INITIAL WARMUP We may multiple the first equation by a minus and then add it to the second one. This gives the equation y = − (3 4 x + 4 ) . Replacing now this expression of y into any of the two equations determining the system (or even in the equation (‡)), one arrives to the following quadratic relation that determines x, namely 25 16 x2 + 11x + 12 = 0 , with discriminant ∆ = 46 > 0. Thus, there are two solutions that result in valid triangles, which are illustrated in the figure on the right above. Let us present the explicit form of the solutions: x1,2 = 8(−11 ± √ 46) 25 , y1,2 = − (3 4 (−88 ± 8 √ 46 25 ) + 4 ) . Using Sage one may verify this by typing x, y=var("x, y") solve([x**2+y**2+4*(2*x+y)+12==0, x**2+y**2-4*(x+3*y)-52==0], x, y) 1.G.25. Consider first the division of a plane by n circles. As we know by 1.G.24, the maximum number yn of areas a plane can be divided into by n circles is yn = yn−1 + 2(n − 1), y1 = 2, that is, yn = n2 − n + 2. For the maximum number pn of areas a space can be divided into by n balls, we obtain the recurrent formula pn+1 = pn + yn, p1 = 2, that is, pn = n 3 (n2 − 3n + 8). 1.G.26. Let xn. denote the number of regions in question. When n = 1, a single plane divides the space into two regions. This means that x1 = 2. Now, each new additional plane will intersect all the existing n planes along lines that divide the new plane into regions. In more details, a new plane will intersect the existing n planes along n lines, each of which divides the new plane into additional regions. Thus each new plane adds 2(n − 1) new regions. Therefore we derive the following recurrence relation: xn = xn−1 +2(n−1), with the initial condition x1 = 2. It is now straightforward to see that the solution has the form xn = n(n − 1) + 2. 1.G.29. (a) For the driver’s place we have two choices and the other places are then arbitrary, that is, for the second seat we have four choices, for the third three choices, then two and then 1. That makes 2.4! = 48 ways. (b) Similarly in the bus we have two choices for the driver, and then the other driver plus the passengers can be seated among the 24 seats arbitrarily. First choose the seats to be occupied, that is, (24 21 ) . Among these seats the people can be seated in 21! ways. The solution is 2 · (24 21 ) 21! = 24! 3 ways. 1.G.30. (a) 26 = 64. (b) (6 4 ) = 15. (c) No head is one possibility (6 0 ) = 1, one head is (6 1 ) = 6. Thus there are 7 sequences with at most one head and the result is 64 − 7 = 57. 1.G.31. (i) Let us add three matches to the 40 balls. If we order the balls and matches in a row, the matches divide the balls in 4 sections. We order the boys at random, give the first boy all the balls from the first section, give the second boy all the balls from the second section and so on. It is now evident that the result is (43 3 ) = 12 341. (ii)From the problem statement it is clear that C must receive 11 coins. That can be done in (33 11 ) ways. Each of the remaining 22 coins can be given either to A or to B, which gives 222 ways. Using the rule of product we obtain the result (33 11 ) · 222 . (iii) We divide the boys and the girls independently. Thus the answer is 29 (25 − 7) = 12800. 1.G.33. Each pair of players shakes hands at the introduction. As we know the number of handshakes is the combination c(6, 2) = (6 2 ) = 15. After a match each of the six players shakes hands six times (with each of six opponents). Hence the required number is 62 = 36. 1.G.39. (a) The answer is 1/6. (b) One can reformulate this task, as we are throwing the dice n times. The probability that the first roll does not result into 1, 3 or 6 is 1/2. The probability that neither the first nor the second roll is clearly 1/4. This is because the result of the first roll does not influence the result of the second roll. Because the event determined by the result of a given roll and the event determined by the result of another roll are always (stochastically) independent, we deduce that the probability is 1/2n . (c) If we denote the event “first dice resulted in five” by A and the event “the sum is 9” by H, then it holds P(A|H) = P (A∩H) P (H) = 1 36 4 36 = 1 4 . 82 CHAPTER 1. INITIAL WARMUP Note that the sum 9 occurs when the first die is 3 and the second 6, the first is 4 and the second 6, the first is 5 and the second is 4, or the first is 6 and the second is 3. Of those four results (that have the same probability) only one is favourable to the event A. Since the probability of A is clearly 1/6 ̸= 1/4, the events are not mutually independent. 1.G.43. The space of all possible events can be viewed as a square 3×3. Denote by x the arrival time of Michael and by y the arrival time of Alex. In such terms we can claim that these two persons will meet if and only if |x − y| ≤ 1 2 . This inequality determines in the square of all possible events the area whose volume is 11/36 of the volume of the whole square. Thus, this is also the probability of the event. 1.G.44. Dividing the quadrilateral into two triangles ABC and ACD with areas S1 and S2 we obtain S = S1 + S2 = 1 2 1 − 0 1 − 0 −1 + 2 5 + 2 + 1 2 1 − 0 −1 − 0 5 + 2 1 + 2 = 1 2 (7 − 1) + 1 2 (3 + 7) = 8 . 1.G.45. Eliminating y yields (4x − 2y − 10) + (x + 2y − 5) = 0, from which x = 3, and hence y = 1. Hence the lines intersect at the point P = [3, 1]. 1.G.46. The given vectors correspond to complex numbers 1 + i, 2 + i and 3 + i. We are to find the sum of their arguments. According to de Moivre’s formula this equals the argument of their product. Their product is (1 + i)(2 + i)(3 + i) = (1 + 3i)(3 + i) = 10i, which is a purely imaginary number with argument π/2. So the sum we are looking for is π/2. 1.G.49. Clearly, A2 = A and hence E − A2 = E − A = ( 0 0 0 1 ) . Also, B is symmetric, i.e., B = BT , and hence BBT = B2 = ( 5 4 4 5 ) . Thus, the equation X(E − A2 ) + BBT = D is equivalent to ( 5 b + 4 4 d + 5 ) = ( 5 −b 4 −4d ) ⇐⇒ ( 0 2b + 4 0 5d + 5 ) = ( 0 0 0 0 ) . Its solution is given by b = −2, d = −1, and a, c ∈ R free. In Sage we can directly compute the expression X(E − A2 ) + BBT − D as follows: var("a, b, c, d") X=matrix([[a, b], [c, d]]); D=matrix([[5, -b], [4, -4*d]]) A=matrix([[1, 0], [0, 0]]); B=matrix([[1, 2], [2, 1]]) E = identity_matrix(2) show(X*E-X*A^2+B*(B.transpose())-D) 1.G.59. The number we need to determine is obtained by subtracting the number of non-surjective mappings from the number of all mappings. The number of all mappings is V (3, 4) = 34 . Non-surjective mappings have either a one element, or a two element codomain. There are just three mappings with a one element codomain. The number of mappings with a two-element codomain is (3 2 ) (24 − 2) (there are (3 2 ) ways to choose the codomain and for a fixed two-element codomain there are 24 − 2 ways how to map four elements onto them). Therefore, the number of surjective mappings is 34 − (3 2 ) (24 − 2) − 3 = 36. 1.G.63. We see that R−1 = {(4, 0), (0, −3), (π, 5), (2, 5), (2, 0)}. Moreover, R ◦ R−1 = {(4, 4), (0, 0), (π, π), (2, 2), (4, 2), (π, 2), (2, π), (2, 4)} . 1.G.65. Here are the solutions: i) (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (3, 6), check that it is an ordering relation. ii) again (i, i) for i = 1, . . . , 7 and additionally (3, 6), (6, 3), check that it is an equivalence relation. iii) (i, i) for i = 1, . . . , 7 and also (3, 6), (6, 3), (4, 6), (6, 4). Check that it is not an equivalence, since transitivity does not hold. In the previous chapter we warmed up by considering relatively simple problems which did not require any sophisticated tools. It was enough to use addition and multiplication of scalars. In this and subsequent chapters we shall add more sophisticated thoughts and concepts. First we restrict ourselves to concepts and operations consisting of a finite number of multiplications and additions to a finite number of scalars. This will take us three chapters and only then will we move on to infinitesimal concepts and tools. Typically we deal with finite collections of scalars of a given size. We speak about “linear objects” and “linear algebra”. Although it might seem to be a very special tool, we shall see later that even more complicated objects are studied mostly using their “linear approximations”. In this chapter we will work with finite sequences of scalars. Such sequences arise in real-world problems whenever we deal with objects described by several parameters, which we shall call coordinates. Do not try much to imagine the space with more than three coordinates. You have to live with the fact that we are able to depict only one, two or three dimensions. However, we will deal with an arbitrary number of dimensions. For example, observing any parameter in a group of 500 students (for instance, their study results), our data will have 500 elements and we would like to work with them. Our goal is to develop tools which will work well even if the number of elements is large. Do not be afraid of terms like field or ring of scalars K. Simply, imagine any specific domain of numbers. Rings of scalars are for instance integers Z and all residue classes Zk. Among fields we have seen only R, Q, C and residue classes Zk for k prime. Z2 is very specific among them, because the equation x = −x does not imply x = 0 here, whereas in nearly every other field it does. 1. Vectors and matrices In the first two parts of this chapter, we will work with vectors and matrices in the simple context of finite sequences of scalars. We can imagine working with integers or residue classes as well as real or complex numbers. We hope to illustrate how easily a concise and formal reasoning can lead to strong results valid in a much broader context than just for real numbers. CHAPTER 2 Elementary linear algebra Can’t you count with scalars yet? – no worry, let us go straight to matrices... A. Vectors and matrices The vectors (n-tuples of scalars) and matrices (2-dimensional arrays of scalars) are the backbone of most of the computational power in practical tasks in all disciplines relying on any kind of data analytics or numerics (e.g., physics, biology, chemistry, engineering, or economics). In this chapter we shall focus on matrix calculus and illustrate some of its simple applications. To start with, we display a few elementary tasks on matrices and vectors, extending the discussion in the first chapter (see, e.g., 1.E.12 and 1.E.14). Recall that the matrix multiplication works only if the matrix on the left has got the same CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Later, we follow the general terminology where the notion of vectors is related to fields of scalars only. 2.1.1. Vectors over scalars. For now, a vector is for us an ordered n-tuple of scalars from K, where the fixed n ∈ N is called dimension. We can add and multiply scalars. We will be able to add vectors, but multiplying a vector will be possible only by a scalar. This corresponds to the idea we have already seen in the plane R2 . There, addition is realized as vector composition (as composition of arrows having their direction and size and compared when emanating from the origin). Multiplication by scalar is realized as stretching the vectors. A vector u = (a1, . . . , an) is multiplied by a scalar c by multiplying every element of the n-tuple u by c. Addition is defined coordinate-wise. Basic vector operations For all vectors u, v and scalars c we define u + v = (a1, . . . , an) + (b1, . . . , bn) = (a1 + b1, . . . , an + bn) c · u = c · (a1, . . . , an) = (c · a1, . . . , c · an). For vector addition and multiplication by scalars we shall use the same symbols as for scalars, that is, respectively, plus and either dot or juxtaposition, i.e., cu = c(a1, . . . , an) = (ca1, . . . , can). The vector notation convention. We shall not, unlike many other textbooks, use any special notations for vectors and leave it to the reader to pay attention to the context. For scalars, we shall mostly use letters from the beginning of the alphabet, for the vectors from the end of the alphabet. The middle part of the alphabet can be used for indices of variables or components and also for summation indices. In the general theory in the end of this chapter and later, we will work exclusively with fields of scalars when talking about vectors. Now we will work with the more relaxed properties of scalars as listed in 1.1.1. For vector addition in Kn , the properties (CG1)–(CG4) (see 1.1.1) clearly hold with the zero element being (notice we define the addition coordinate-wise) 0 = (0, . . . , 0) ∈ Kn . We are purposely using the same symbol for both the zero vector element and the zero scalar element. Next, let us notice the following basic properties of vectors: Vector properties For all vectors v, w ∈ Kn and scalars a, b ∈ K we have a · (v + w) = a · v + a · w(V1) (a + b) · v = a · v + b · v(V2) a · (b · v) = (a · b) · v(V3) 1 · v = v(V4) 84 number of columns as the number of rows of the matrix on the right. The number of rows in the result coincides with that of the left matrix, while its number of columns equals the number of columns of the right-hand matrix. 2.A.1. Matrix multiplication. Check the computations: (a) ( 1 2 −1 3 ) ( 1 −1 2 1 ) = ( 5 1 5 4 ) , (b) ( 1 −1 2 1 ) ( 1 2 −1 3 ) = ( 2 −1 1 7 ) , (c) ( 1 2 3 1 −1 1 )   1 −1 2 1 1 1 −2 −3 3 2 1 0   = ( 12 7 1 −5 3 0 5 4 ) , (d) ( 1 3 −3 )   1 −2 3 3 2 1 1 −1 −4   = ( 7 7 18 ) , (e) ( 1 2 −2 )   2 1 3   = ( −2 ) = −2 . Remark. As we already mentioned in Chapter 1, the multiplication of matrices is not commutative in general. Parts (a) and (b) in 2.A.1 illustrate this fact for 2 × 2 matrices. On the other hand, part (e) illustrates that multiplying rows and columns produces the so called scalar products of vectors (also called dot products), and our approach to distances and angles will rely on them. For two vectors like u = (2, 1, 3)T and v = (1, 2, −2)T , we write ⟨u, v⟩ or u · v for vT u = −2. As usual, we use the superscript T to indicate the transposition of the matrices (i.e., we write the rows of A as columns in AT ), and E for the identity matrix and its dimension will be clear from the context. 2.A.2. For the following matrices and vectors A = ( 1 1 0 √ 2 ) , B = ( 1 4 4 0 ) , u = ( 0√ 2 ) , v = ( 1 4 ) compute the given expressions: 1) (A − E)(A + E) , 6) 4A3 − B2 , 2) A(Bu) , 7) √ 2u − 1 2 v , 3) (A − B)(A + B)u , 8) √ 2 2 u − ABu + 2v , 4) aA2 + aB2 (a ∈ R) , 9) A2 u − B2 v , 5) u · v − 4 √ 2 , 10) (B2 u) · (4v) . ⃝ 2.A.3. Compute all tasks in 2.A.2 in Sage. Solution. A solution goes as follows: CHAPTER 2. ELEMENTARY LINEAR ALGEBRA The properties (V1)–(V4) of our vectors are easily checked for any specific ring of scalars K, since we need just the corresponding properties of scalars as listed in 1.1.1, applied to individual components of the vectors. In this way we shall work with, for instance, Rn , Qn , Cn , but also with Zn , (Zk)n , n = 1, 2, 3, . . .. 2.1.2. Matrices over scalars. Matrices are slightly more complicated objects, useful when working with vectors. Matrices of type m/n A matrix of the type m/n over scalars K is a rectangular schema A with m rows and n columns A =     a11 a12 . . . a1n a21 a22 . . . a2n ... ... am1 am2 . . . amn     where aij ∈ K for all 1 ≤ i ≤ m, 1 ≤ j ≤ n. For a matrix A with elements aij we also use the notation A = (aij). The vector (ai1, ai2, . . . , ain) ∈ Kn is called the (i-th) row of the matrix A, i = 1, . . . , m. The vector (a1j, a2j, . . . , amj) ∈ Km is called the (j-th) column of the matrix A, j = 1, . . . , n. Matrices of the type 1/n or n/1 are actually just vectors in Kn . All general matrices can be understood as vectors in Kmn , we just consider all the columns. In particular, matrix addition and matrix multiplication by scalars is defined: A + B = (aij + bij), a · A = (a · aij) where A = (aij), B = (bij), a ∈ K. The matrix −A = (−aij) is called the additive inverse to the matrix A and the matrix 0 =    0 . . . 0 ... ... 0 . . . 0    is called the zero matrix. By considering matrices as mn-dimensional vectors, we obtain the following: Proposition. The formulas for A+B, a·A, −A, 0 define the operations of addition and multiplication by scalars for the set of all matrices of the type m/n, which satisfy properties (V1)–(V4). 2.1.3. Matrices and equations. Many mathematical models are based on systems of linear equations. Matrices are useful for the description of such systems. In order to see this, let us introduce the notion of scalar product of two vectors, assigning to the vectors (a1, . . . , an) and (x1, . . . , xn) their product (a1, . . . , an) · (x1, . . . , xn) = a1x1 + · · · + anxn. This means, we multiply the corresponding coordinates of the vectors and sum the results. 85 E=matrix.identity(2) A=matrix([[1, 1], [0, sqrt(2)]]) B=matrix([[1, 4], [4, 0]]) u=vector([0, sqrt(2)]) v=vector([1, 4]) show((A-E)*(A+E)) show(A*B*u) show((A-B)*(A+B)*u) show(A^2+B^2) u.dot_product(v) show(4*A^3-B^2) show(sqrt(2)*u-(1/2)*v) show((sqrt(2)/2)*u-A*B*u+2*v) show(A^2*u-B^2*v) (B^2*u).dot_product(4*v) □ Doing computations with matrices, we always recommend to use Sage to verify your formal computations, especially for large size matrices. 2.A.4. Determine explicitly each of the following vectors (by hand), and next verify your answer in Sage: 1) u1 := (A − E)2 (4B)u, 3) u3 := A(Bu), 2) u2 := (A − E)T (B2 u), 4) u4 := B(Au), where E is the identity 3 × 3 matrix and A, B, u are respectively given by A =   1 √ 2 π 0 1 1 −π 0 2   , B =   0 1 1 −1 0 π −1 0 1   , u =   π 0√ 2   . ⃝ While explicit computations in high dimensions (i.e., dealing with vectors with large number of components) can hardly be imagined without the use of computer aided mathematics software, like Sage, the understanding of structure and properties is often based on visual perception. Of course, we are then limited to dimensions two or three, but even then we can gain essential insight how the things may work in general (and this fact might be viewed as one of the strategies to build a mathematical mindset). Thus, we should enjoy the Sage tools designed to plot vectors and similar objects.1 2.A.5. Use Sage for plotting: (a) The vectors u = (1, 2, 3)T , v = (2, −3, 4)T , u + v, u − v, which are all vectors of the 3-dimensional Euclidean space R3 . (b) Repeat for the vectors x = (4, 0, 2)T , y = (−3, 0, 1)T , x + y, and x − y. Solution. (a) The simplest method is to introduce the vectors, and then use the function plot, as follows: u = vector([1, 2, 3]) 1Notice that so far Sage allows us to introduce vectors and make operations between them, without a reference to the choice of scalars. Later in 2.C.12 we will see how to treat vectors in Sage, when they are viewed as elements of some specific vector space. CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Every system of m linear equations in n variables a11x1 + a12x2 + · · · + a1nxn = b1 a21x1 + a22x2 + · · · + a2nxn = b2 ... am1x1 + am2x2 + · · · + amnxn = bm can be seen as a constraint on values of m scalar products with one unknown vector (x1, . . . , xn) (called the vector of variables, or vector variable) and the known vectors of coordinates (ai1, . . . , ain). The vector of variables can be also seen as a column in a matrix of the type n/1, and similarly the values b1, . . . , bm can be seen as a vector u, and that is again a single column of the matrix of the type m/1. Our system of equations can then be formally written as A · x = u as follows:    a11 . . . a1n ... ... am1 . . . amn    .    x1 ... xn    =    b1 ... bm    where the left-hand side is interpreted as m scalar products of the individual rows of the matrix (giving rise to a column vector) with the vector variable x, whose values are prescribed by the equations. That means that the identity of the i-th coordinates corresponds to the original i-th equation ai1x1 + · · · + ainxn = bi and the notation A · x = u gives the original system of equa- tions. 2.1.4. Matrix product. In the plane, that is, for vectors of dimension two, we developed a matrix calculus. We noticed that it is effective to work with (see 1.5.4). Now we generalize such a calculus and we develop all the tools we know already from the plane case to deal with higher dimensions n. It is possible to define matrix multiplication only when the dimensions of the rows and columns allow it, that is, when the scalar product is defined for them as before: Matrix product For any matrix A = (aij) of the type m/n and any matrix B = (bjk) of the type n/q over the ring of scalars K we define their product C = A · B = (cik) as a matrix of the type m/q with the elements cik = n∑ j=1 aijbjk, for arbitrary 1 ≤ i ≤ m, 1 ≤ k ≤ q. That is, the element cik of the product is exactly the scalar product of the i-th row of the matrix on the left and of the k-th column of the matrix on the right. For instance we have ( 2 1 1 −1 ) · ( 2 1 1 −1 0 1 ) = ( 3 2 3 3 1 0 ) . 86 v = vector([2, -3, 4]) u_plus_v = u + v u_minus_v = u - v uu=plot(u, color="blue") vv=plot(v, color="green") uplusv=plot(u_plus_v, color="green") uminusv=plot(u_minus_v, color="purple") show(uu+vv+uplusv+uminusv) However, Sage offers an alternative method for plotting vectors in three dimensions, using the arrow3d function. For instance, to visualize the vectors in (a) you can use the following block: u=vector([1, 2, 3]) v=vector([2, -3, 4]) u_plus_v=u + v u_minus_v=u - v u_ar=arrow3d([0, 0, 0], u, color="red") v_ar=arrow3d([0, 0, 0], v, color="blue") u_plus_v_ar=arrow3d([0, 0, 0], u_plus_v, color="green") u_minus_v_ar=arrow3d([0, 0, 0], u_minus_v, color="purple") p=u_ar+v_ar+u_plus_v_ar+u_minus_v_ar p.show() In this block we used red, blue, green and purple to represent the vectors u, v, u+v and u−v, respectively. Run the code in your Sage environment to generate and view the vector plot. Next apply the same method to analyze the second set of vectors. □ As our first real task essentially involving matrices, we are going to develop a systematic approach for solving systems of linear equations. Our first aim is to introduce the beautiful Gauss elimination method (see 2.1.7 for details), treating a few examples. You might also look into 2.1.3. Later, in Section C, we shall relate the description given here with the important notions of “affine spaces” and “vector spaces”. 2.A.6. A colourful example. A company of painters orders 810 litres of paint, to contain 270 litres each of red, green and blue coloured paint. The provider can satisfy this order by mixing the colours he usually sells, namely: • reddish colour – it contains 50% of red, 25% of green and 25% of blue colour; • greenish colour – it contains 12.5% of red, 75% of green and 12.5% of blue colour; • bluish colour – it contains 20% of red, 20% of green and 60% of blue colour. How many litres of each of the colours at the warehouse have to be mixed in order to satisfy the order? Solution. Let us denote by • x – the number of litres of reddish colour to be used; • y – the number of litres of bluish colour to be used; CHAPTER 2. ELEMENTARY LINEAR ALGEBRA 2.1.5. Square matrices. If there is the same number of rows and columns in the matrix, we speak of a square matrix. The number of rows or columns is then called the dimension of the matrix. The matrix E = (δij) =    1 . . . 0 ... ... ... 0 . . . 1    is called the unit matrix, or alternatively, the identity matrix. The numbers δij defined in such a way are also called the Kronecker delta. When we restrict ourselves to square matrices over K of fixed dimension n, the matrix product is defined for any two matrices. That is, there is the well defined multiplication operation there. Its properties are similar to that of scalars: Proposition. On the set of all square matrices of dimension n over an arbitrary ring of scalars K, the multiplication operation is defined with the following properties of rings (see 1.1.1): (R1) Multiplication is associative. (R3) The unit matrix E = (δij) is the unit element for multi- plication. (R4) Multiplication and addition is distributive. In general, neither the property (R2) nor (ID) are true. Therefore, the square matrices for n > 1 do not form an integral domain, and consequently they cannot be a (commutative or non-commutative) field. Proof. Associativity of multiplication – (R1): Since scalars are associative, distributive and commutative, we can compute for any three matrices A = (aij) of type m/n, B = (bjk) of type n/p and C = (ckl) of type p/q: A · B = (∑ j aij · bjk ) , B · C = (∑ k bjk · ckl ) , (A · B) · C = (∑ k (∑ j aijbjk ) ckl ) = (∑ j,k aijbjkckl ) , A · (B · C) = (∑ j aij (∑ k bjkckl )) = (∑ j,k aijbjkckl ) . Note that while computing, we relied on the fact that it does not matter in which order are we performing the sums and products, that is, we were relying on the properties of scalars. We can easily see that multiplication by a unit matrix has the property of a unit element: A · E =    a11 · · · a1m ... am1 · · · amm    ·     1 0 · · · 0 0 1 · · · 0 ... ... 0 0 · · · 1     = A and similarly from the left, E · A = A. It remains to prove the distributivity of multiplication and addition. Again using the distributivity of scalars we can 87 • z – the number of litres of greenish colour to be used; By mixing the colours, we want the final colour to contain 270 litres of red. By assumption, the reddish contains 50% of red, the greenish contains 12.5% of red and the bluish 20% of red. Hence, we get the equation 0.5x + 0.125y + 0.2z = 270 . Similarly, for blue and green colour, respectively, we get 0.25x + 0.75y + 0.2z = 270 , 0.25x + 0.125y + 0.6z = 270 . From the first equation we have x = 540−0.25y −0.4z, and then the second and third equations give 2, 75y +0.4z = 540 and 0.25y+2z = 540, respectively. Hence, z = 270−0.125y and after substituting into the first equation, we obtain 2.7y = 432, that is, y = 160. Therefore z = 270−0.125·160 = 250 and x = 540 − 0.25 · 160 + 0.4 · 250 = 400. This means that it is necessary to mix 400 litres of reddish, 160 litres of bluish and 250 litres of greenish colour. Once you have set up the system that describes the problem, you can quickly find the solution using Sage. For example, enter the following block var("x, y, z") eq1=0.5*x+0.125*y+0.2*z-270 eq2=0.25*x+0.75*y+0.2*z-270 eq3=0.25*x+0.125*y+0.6*z-270 solve([eq1==0,eq2==0, eq3==0], x, y, z) Sage’s output is the list [[x == 400, y == 160, z == 250]] □ 2.A.7. Matrix notation. Approach the previous problem using the matrix notation and solving the matrix equation in Sage. Solution. A practical approach to solve linear systems of equations relies on matrix notation. The first step is to construct the corresponding coefficient matrix, where each row represents the coefficients of the variables in a specific equation. In particular, the first row contains the coefficients from the first equation, the second row from the second equation, and so forth. In our case, the system in 2.A.6 leads to the 3 × 3 matrix A =   0.5 0.125 0.2 0.25 0.75 0.2 0.25 0.125 0.6   . Then we write the system of equations as Ax = b, where x = (x1, x2, x3)T and b = (270, 270, 270)T , respectively. We call x the vector of unknowns, and b the vector of constants in the right hand side of the system. Merging A and the vector b, we arrive at the so called extended matrix2 , which in our example has got the form ( A b ) =   0.5 0.125 0.2 270 0.25 0.75 0.2 270 0.25 0.125 0.6 270   . (♯) 2The extended matrix is also referred to as the augmented matrix. CHAPTER 2. ELEMENTARY LINEAR ALGEBRA easily calculate for matrices A = (aij) of the type m/n, B = (bjk) of the type n/p, C = (cjk) of the type n/p, D = (dkl) of the type p/q A · (B + C) = (∑ j aij(bjk + cjk) ) = ( ( ∑ j aijbjk ) + ( ∑ j aijcjk ) ) = A · B + A · C (B + C) · D = (∑ k (bjk + cjk)dkl ) = ( ( ∑ k bjkdkl ) + ( ∑ k cjkdkl ) ) = B · D + C · D. As we have seen in 1.5.4, two matrices of dimension two do not necessarily commute: for example ( 1 0 0 0 ) . ( 0 1 0 0 ) = ( 0 1 0 0 ) ( 0 1 0 0 ) . ( 1 0 0 0 ) = ( 0 0 0 0 ) . This gives us immediately a counterexample to the validity of (R2) and (ID). For matrices of type 1/1 both axioms clearly hold, because the scalars itself have them. For matrices of greater dimension the counterexamples can be obtained similarly. Simply place the counterexamples for dimension 2 in their left upper corner, and select the rest to be zero. (Verify this on your own!) □ In the proof we have actually worked with matrices of more general types, thus we have proved the properties in greater generality: Associativity and distributivity Matrix multiplication is associative and distributive, that is, A · (B · C) = (A · B) · C A · (B + C) = A · B + A · C, whenever are all the given operations defined. The unit matrix is a unit element for multiplication (both from the right and from the left). 2.1.6. Inverse matrices. With scalars we can do the following: from the equation a · x = b with a fixed invertible a we can express x = a−1 · b for any b. We would like to be able to do this for matrices too. So we need to solve the problem – how to tell that such a matrix exists, and if so, how to compute it? We say that B is the inverse of A if A · B = B · A = E. Then we write B = A−1 . From the definition it is clear that both matrices must be square and of the same dimension n. A matrix which has an inverse is called an invertible matrix or a regular square matrix. 88 Observe that the extended matrix completely encodes the original linear system of equations (just leaving out the variable names). We shall come back to the extended matrix in the next task. In order to find a solution of Ax = b in Sage, we can use the command A.solve_right(b). For our case: A=matrix(QQ, [[0.5, 0.125, 0.2], [0.25, 0.75, 0.2], [0.25, 0.125, 0.6]]) b=vector([270, 270, 270]) x = A.solve_right(b); print(x) Or, instead of A.solve_right(b), we could type A\b, i.e., A=matrix(QQ, [[0.5, 0.125, 0.2], [0.25, 0.75, 0.2], [0.25, 0.125, 0.6]]) b=vector([270, 270, 270]) A \ b #this is a Matlab like command □ In the sequel we will refer to A\b as the “backslash” operator in Sage. If we are interested in a system of the form xA = b, in order to solve for x we use the command A.solve_left(b). Let us notice that if the matrix equation does not have a solution, Sage returns an error (see also 2.A.20 for a similar case), while if it does have solutions, Sage returns just one. We shall come back to this phenomenon later. 2.A.8. Using the echelon form. Solve the previous problem again, this time using the transformation to the echelon form of the extended matrix. Solution. The mathematical approach suggests to look for “equivalent” systems of equations with the same solution sets, perhaps much nicer than the original one. The crucial observation is that elementary row transformations allow us to transform the extended matrix into a row echelon form, see 2.1.7. Recall, the elementary row operations correspond to interchanging of any two rows, multiplying each element of a row by a non-zero number, or adding (a multiple of) a row to another one. Dealing with an extended matrix B = ( A b ) , in Sage the elementary row operations are implemented via the following functions:3 B.swap.rows() interchange two rows, B.rescale.row() scale a row by a factor, B.add_multiple_of_row() add a multiple of one row to another row and replace. Recall, a matrix is in row echelon form if it meets the following requirements: • The first non-zero number from the left in a non-zero row, known as the leading coefficient or pivot, is always to the right of the leading coefficient of the row above it. • Rows containing only zeros are located at the bottom of the matrix. 3In Sage there are similar functions encoding the column operations. CHAPTER 2. ELEMENTARY LINEAR ALGEBRA In the subsequent paragraphs we derive (among other things) that B is actually the inverse of A whenever just one of the above required equations holds. The other is then a consequence. We easily check that if A−1 and B−1 exist, then there also is the inverse of the product A · B (1) (A · B)−1 = B−1 · A−1 . Indeed, because of the associativity of matrix multiplication proved a while ago, we have (B−1 · A−1 ) · (A · B) = B−1 · (A−1 · A) · B = E (A · B) · (B−1 · A−1 ) = A · (B · B−1 ) · A−1 = E. Because we can calculate with matrices similarly as with scalars (they are just a little more complicated), the existence of an inverse matrix can really help us with the solution of systems of linear equations: if we express a system of n equations for n unknowns as a matrix product A · x =    a11 · · · a1m ... am1 · · · amm    ·    x1 ... xm    =    b1 ... bm    = u and when the inverse of the matrix A exists, then we can multiply from the left by A−1 to obtain A−1 · u = A−1 · A · x = E · x = x, that is, A−1 · u is the desired solution. On the other hand, expanding the condition A·A−1 = E for unknown scalars in the matrix A−1 gives us n systems of linear equations for the same matrix on the left and different vectors on the right. Thus we should think about methods for solutions of the systems of linear equations. 2.1.7. Equivalent operations with matrices. Let us gain some practical insight into the relation between systems of equations and their matrices. Clearly, searching for the inverse can be more complicated than finding the direct solution to the system of equations. But note that whenever we have to solve more systems of equations with the same matrix A but with different right sides u, then yielding A−1 can be really beneficial for us. From the point of view of solving systems of equations A · x = u, it is natural to consider the matrices A and vectors u equivalent whenever they give a system of equations with the same solution set. Let us think about possible operations which would simplify the matrix A such that obtaining the solution is easier. We begin with simple manipulations of rows of equations which do not influence the solution, and similar modifications of the right-hand side vector. If we are able to change a square matrix into the unit matrix, then the right-hand side vector is a solution of the original system. If some of the rows of the system vanish during the course of manipulations (that is, 89 These two conditions ensure that all entries in a column below a leading coefficient are zeros. Paragraph 2.1.8 explains how to interpret the elementary row transformations as matrix multiplication. This presentation is often very convenient, and we shall come back to this later in Section E, see for example the task in 2.E.3. We will now apply this method to solve the system posed in 2.A.6. Let R1, R2, R3 denote the rows of the matrix obtained at each step of the procedure. We begin by considering the extended matrix B = ( A b ) , as outlined in relation (♯) of 2.A.7, and apply the row operations R1 → 2R1, R2 → 4R2 and R3 → 4R3. This yields B = ( A b ) ∼   1 0.25 0.4 540 1 3 0.8 1080 1 0.5 2.4 1080   . In Sage these operations are implemented by the block B=matrix(QQ,[[0.5,0.125,0.2,270],[0.25, 0.75,0.2,270],[0.25,0.125,0.6,270]]) B.rescale_row(0,2) #multiply the 1st row # by 2 (Sage labels rows as 0,1,...) B.rescale_row(1,4) #the 2nd row by 4 B.rescale_row(2,4) #the 3rd row by 4 show(B) where the last command is used to verify the matrix obtained along this first step. Notice, the first argument in these commands is the index of the row to be replaced, the second argument is the row to form a multiple of, and the final argument is the scale factor. Next, B.add_multiple_of_row(k, l, c) will replace then kth row Rk by Rk + cRl, where Rl denotes the lth row and c is some (non-zero) constant. Using the resulting matrix, we proceed with the row operations R2 → R2 − R1 and R3 → R3 − R1, while keeping R1 untouched: ( A b ) ∼   1 0.25 0.4 540 0 2.75 0.4 540 0 0.25 2 540   . These row operations also admit an easy description in Sage, simply by adding in the previous cell the code B.add_multiple_of_row(1,0,-1) #replace the 2nd row R2 by R2-R1 B.add_multiple_of_row(2,0,-1) #replace the 3rd row R3 by R3-R1 show(B) Notice, in Sage we do not need to introduce new matrices when doing row operations successively. Next we multiply both R2 and R3 by 4 and then interchange them, i.e., R1 → R1, R2 → 4R2, R3 → 4R3 and next R2 ↔ R3. This provides B ∼   1 0.25 0.4 540 0 11 1.6 2160 0 1 8 2160  ∼   1 0.25 0.4 540 0 1 8 2160 0 11 1.6 2160   . In order to describe this step in Sage, we should continue typing in the previous cell the code given below: CHAPTER 2. ELEMENTARY LINEAR ALGEBRA they become zero), then we get some direct information about the solution. Our simple operations are: Elementary row transformations • interchanging two rows, • multiplication of any given row by an non-zero scalar, • adding another row to any given row. These operations are called elementary row transformations. It is clear that the corresponding operations at the level of the equations in the system do not change the set of the solutions whenever we deal with a field of scalars, they still look reasonable if our ring of coordinates is an integral domain (but multiplying by a divisor of zero might be a problem). Analogically, elementary column transformations of matrices are • interchanging two columns • multiplication of any given column by a non-zero scalar, • adding another column to any given column. These do not preserve the solution set, since they change the variables themselves. Systematically we can use elementary row transformations for subsequent elimination of variables. This gives an algorithm which is usually called the Gaussian elimination method. Henceforth, we shall assume that our scalars come from an integral domain (e.g. integers are allowed, but not say Z4). Gaussian elimination of variables Proposition. Any non-zero matrix over an arbitrary integral domain of scalars K can be transformed, using finitely many elementary row transformations, into row echelon form: • For each j, if aik = 0 for all columns k = 1, . . . , j, then akj = 0 for all k ≥ i, • if a(i−1)j is the first non-zero element at the (i − 1)-st row, then aij = 0. Proof. The matrix in row echelon form looks like         0 . . . 0 a1j . . . . . . . . . a1m 0 . . . 0 0 . . . a2k . . . a2m ... 0 . . . . . . . . . . . . 0 alp . . . ...         . The matrix can (but does not have to) end with some zero rows. In order to transform an arbitrary matrix, we can use a simple algorithm, which will bring us, row by row, to the resulting echelon form: 90 B.rescale_row(1,4) #multiply the 2nd row R2 by 4 B.rescale_row(2,4) #multiply the 3rd row R3 by 4 B.swap_rows(1,2) #interchange the 2nd row by the 3rd row show(B) Observe now that in the matrix obtained in this step, the coefficient of y in the third row is not yet zero. Therefore, it remains to apply one more row operation, namely R3 → R3 − 11R2: B ∼   1 0.25 0.4 540 0 1 8 2160 0 0 −86.4 −21600   . This matrix is in row echelon form, and in Sage for this last row operation we successively proceed with the code B.add_multiple_of_row(2,1,-11) #replace the 3rd row R3 by R3-11R2 show(B) Finally, by the so called backward substitution, we can describe the solution of the initial system. In particular, from the last row we get z = −21600 −86.4 = 250. Using this, from the second row we obtain y = 2160−8·250 = 160 and finally from the first row we have x = 540 − 0.4 · 250 − 0.25 · 160 = 400. □ 2.A.9. Based on elementary row transformation solve the linear system posed below. Then use Sage to find an echelon form of the corresponding extended matrix, and verify your solution:    x1 + 2x2 + 3x3 = 2 , 2x1 − 3x2 − x3 = −3 , −3x1 + x2 + 2x3 = −3 . Solution. Let us express the system in the matrix form Ax = b:   1 2 3 2 −3 −1 −3 1 2     x1 x2 x3   =   2 −3 −3   , . The extended matrix is given by B := ( A b ) =   1 2 3 2 2 −3 −1 −3 −3 1 2 −3   . Let us now perform elementary row operations to transform the extended matrix into row echelon form. We have: ( A b ) ∼   1 2 3 2 0 −7 −7 −7 0 7 11 3   ∼   1 2 3 2 0 −7 −7 −7 0 0 4 −4   ∼   1 2 3 2 0 1 1 1 0 0 1 −1   . Above, we first applied the operations R2 → R2 − 2R1, R3 → R3 + 3R1 on B to obtain the first mentioned matrix. Using this, we proceeded with R3 → R3 + R2, to get the CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Gaussian elimination algorithm (1) By a possible interchange of rows we can obtain a matrix where the first row has a non-zero element in the first non-zero column. Let that column be column j. In other words, a1j ̸= 0, but aiq = 0 for all i, and all q, 1 ≤ q < j. (2) For each i = 2, . . ., multiply the first row by the element aij, multiply i-th row by the element a1j and subtract, to obtain aij = 0 on the i-th row. (3) By repeated application of the steps (1) and (2), always for the not-yet-echelon part of rows and columns in the matrix we reach, after a finite number of steps, the final form of the matrix. This algorithm clearly stops after a finite number of steps and provides the proof of the proposition. □ The given algorithm is really the usual elimination of variables used in the systems of linear equations. In a completely analogous manner we define the column echelon form of matrices and considering column elementary transformations instead the row ones, we obtain an algorithm for transforming matrices into the column echelon form. Remark. Although we could formulate the Gaussian elimination for general scalars from any ring, this does not make much sense in view of solving equations. Clearly having divisors of zero among the scalars, we might get zeros during the procedure and lose information this way. Think carefully about the differences between the choices K = Z, K = R and possibly Z2 or Z4. On the other hand, if we are dealing with fields of scalars, we can always arrive at a row echelon form where the nonzero entries on the “diagonal” are ones. This is done by applying the appropriate scalar multiplication to each individual row. However, this is not possible in general – think for instance of the integers Z. 2.1.8. Matrices of elementary row transformations. Let us now restrict ourselves to fields of scalars K, that is, every non-zero scalar has an inverse. Note that elementary row or column transformations correspond respectively to multiplication from the left or right by the following matrices (only the differences from the unit matrix are indicated): (1) Interchanging the i-th and j-th row (column)           ... 0 . . . 1 ... ... ... 1 . . . 0 ...           ← i-th row ← j-th row 91 second matrix, and the final step consists of the operations R2 → −1 7 R2 and R3 → 1 4 R3, which give the third matrix. This one is in row echelon form and implies directly x3 = −1, and moreover { x1 + 2x2 + 3x3 = 2 , x2 + x3 = 1 , from where we specify x2 = 2 and x1 = 1. Sage provides an easy way to compute a row echelon form corresponding to a matrix B, via the command B.echelon_form( ). For our example we can type A=matrix([[1,2,3],[2,-3,-1],[-3,1,2]]) b=vector([2, -3, -3]) B=A.augment(b) #the augmented matrix show(B); show(B.echelon_form()) This code returns the extended matrix B and a row echelon form associated to B:4   1 2 3 2 0 7 3 11 0 0 4 −4   . Note that Sage’s solution differs from ours. However, there is no need to worry, as there are various methods for performing row reduction, and thus the row echelon form of a matrix is not unique! From the last row of this matrix we obtain 4x3 = −4, that is, x3 = −1, and so the second row which reads as 7x2 + 3x3 = 11, becomes 7x2 = 14, so x2 = 2. From the first row we have x1 + 2x2 + 3x3 = 2, and by replacing the values found for x2, x3 we get x1 = 1, and so the same solution as above. □ In summary, the most efficient method for solving large systems of linear equations involves applying elementary row transformations to achieve row echelon form. This technique, known as Gaussian elimination, can be optimized through various sophisticated schemes for selecting the best pivots in numerical contexts. Subsequent examples will illustrate that systems of linear equations may have infinitely many solutions, exactly one solution, or none at all. For a theoretical exploration of the classification of solutions using matrix invariants and other methods, see 2.1.13 and 2.3.5. 2.A.10. Using the Gauss elimination method solve the following linear system:    2x1 − x2 + 3x3 = 0 , 3x1 + 16x2 + 7x3 = 0 , 3x1 − 5x2 + 4x3 = 0 , −7x1 + 7x2 + −10x3 = 0 . 4If we want the vertical line in the presentation of the extended matrix, we can type B = A.augment(b, subdivide = ”True”). CHAPTER 2. ELEMENTARY LINEAR ALGEBRA (2) Multiplication of the i-th row (column) by the scalar a:         ... 1 a 1 ...         ← i-th row (3) To row i, add row j (columns): i-th row and j-th column →           ... 1 ... 1 1 ...           This trivial observation is actually very important, since the product of invertible matrices is invertible (recall 2.1.6(1)) and all elementary transformations over a field of scalars are invertible (the definition of the elementary transformation itself ensures that inverse transformations are of the same type and it is easy to determine the corresponding matrix). Thus, the Gaussian elimination algorithm tells us, that for an arbitrary matrix A, we can obtain its equivalent row echelon form A′ = P · A by multiplying with a suitable invertible matrix P = Pk · · · P1 from the left (that is, sequential multiplication with k matrices of the elementary row transfor- mations). If we apply the same elimination procedure for the columns, we can transform any matrix B into its column echelon form B′ by multiplying it from the right by a suitable invertible matrix Q = Q1 · · · Qℓ. If we start with the matrix B = A′ in row echelon form, this procedure eliminates only the still non-zero elements out of the “diagonal” of the matrix and in the end we can transform the remaining elements to be units. Thus we have verified a very important result which we will use many times in the future: 2.1.9. Theorem. For every matrix A of the type m/n over a field of scalars K, there exist square invertible matrices P and Q of dimensions m and n, respectively, such that the matrix P · A is in row echelon form and P · A · Q =         1 . . . 0 0 . . . 0 ... ... 0 . . . 1 0 . . . 0 0 . . . 0 0 . . . 0 ...         . The number of the ones in the diagonal is independent of the particular choice of P and Q. Proof. We already have proved everything but the last sentence. We shall see this last claim below in 2.1.11. □ 92 Solution. This is a so called homogeneous linear system, since the vector b is the zero vector. In matrix notation it reads as     2 −1 3 3 16 7 3 −5 4 −7 7 −10       x1 x2 x3   =     0 0 0 0     , that is, Ax = 0, where 0 denotes the zero vector. A homogeneous system of linear equations has either exactly one solution, the zero one, or infinitely many solutions, see 2.3.5 for details. By transforming the matrix into a row echelon form via elementary row operations, we will prove that the given system has infinitely many solutions. Initially we obtain A∼     2 −1 3 3 16 7 3 −5 4 −7 7 −10     ∼     2 −1 3 0 35/2 5/2 0 −7/2 −1/2 0 7/2 1/2     . From there, we see that the second, third and fourth equations are multiples of the equation 7x2 + x3 = 0. We continue as follows: (2 −1 3 0 35/2 5/2 0 −7/2 −1/2 0 7/2 1/2 ) ∼ (2 −1 3 0 35/2 5/2 0 0 0 0 0 0 ) ∼ (2 −1 3 0 7 1 0 0 0 0 0 0 ) . We can translate this literally as { 2x1 − x2 + 3x3 = 0 , 7x2 + x3 = 0 } , where observe that the equations induced by the last two rows are redundant. In particular, x3 is a free variable to which we assign the real parameter t ∈ R. This gives infinite many solutions of the form x2 = − 1 7 x3 = − 1 7 t , x1 = 1 2 (x2 − 3x3) = − 11 7 t . We may verify this result in Sage, as before: var("x1, x2, x3"); eq1=2*x1-x2+3*x3 eq2=3*x1+16*x2+7*x3; eq3=3*x1-5*x2+4*x3 eq4=-7*x1+7*x2-10*x3 solve([eq1==0,eq2==0,eq3==0,eq4==0], x1, x2, x3) which answers x1 == -11/7*r1, x2 == -1/7*r1, x3 == r1 Notice the substitution t = −7s brings our solution to the form (11s, s, −7s), with s ∈ R, that is, (x1, x2, x3) T = s · (11, 1, −7) T , s ∈ R , where “·” is simply the scalar multiplication. This expression can be obtained very quickly in Sage, by the command A.right_kernel(), where A is the matrix of the linear homogeneous system, as discussed in the next task. □ CHAPTER 2. ELEMENTARY LINEAR ALGEBRA 2.1.10. Algorithm for computing inverse matrices. In the previous paragraphs we almost obtained the complete algorithm for computing the inverse matrix. Using the simple modification below, we find either that the inverse does not exist, or we compute the inverse. Keep in mind that we are still working over a field of scalars. Equivalent row transformations of a square matrix A of dimension n leads to an invertible matrix P′ such that P′ · A is in row echelon form. If A has an inverse, then there exists also the inverse of P′ · A. But if the last row of P′ · A is zero, then the last row of P′ · A · B is also zero for any matrix B of dimension n. Thus, the existence of a zero row in the result of (row) Gaussian elimination excludes the existence of A−1 . Assume now that A−1 exists. As we have just seen, the row echelon form of A will have exclusively non-zero rows only, In particular, all diagonal elements of P′ ·A are non-zero. But now, we can employ row elimination by the elementary row transformation from the bottom-right corner backwards and also transform the diagonal elements to be units. In this way, we obtain the unit matrix E. Summarizing, we find another invertible matrix P′′ such that for P = P′′ ·P′ we have P · A = E. Now observe that we could clearly work with columns instead of row transformation and thus, under the assumption of the existence of A−1 , we would find a matrix Q such that A · Q = E. From this we see immediately that P = P · E = P · (A · Q) = (P · A) · Q = E · Q = Q. That is, we have found the inverse matrix A−1 = P = Q for the matrix A. Notice that at the point of finding the matrix P with the property P · A = E, we do not have to do any further computation, since we have already obtained the inverse matrix. In practice, we can work as follows: Computing the inverse matrix Write the unit matrix E to the right of the matrix A, producing an augmented matrix (A, E). Transform the augmented matrix using the elementary row transformations to row echelon form. This produces an augmented matrix (PA, PE), where P is invertible, and PA is in row echelon form. By the above, either we may achieve PA = E, in which case A is invertible and P = PE = A−1 , or PA has a row of zeros, in which case we conclude that the inverse matrix for A does not exist. 2.1.11. Linear dependence and rank. In the previous practical algorithms dealing with matrices we worked all the time with row and column additions and scalar multiplications, seeing them as vectors. Such operations are called linear combinations. We shall return to such operations in an abstract sense later on in 2.3.1. 93 2.A.11. Analyze the space of solutions in the previous task in view of the vector addition and scalar multiplication. Solution. Of course, knowing that Ax = 0, Ay = 0, the sum must be a solution again: A(x + y) = Ax + Ay = 0. Similarly A(cx) = c(Ax) for all scalars c. Viewing matrices as linear mappings x → Ax, the kernel of a (real) m × n matrix A, also known as the null space of A, is the set of vectors x ∈ Rn satisfying Ax = 0, where 0 denotes the zero vector of degree m. We saw in the previous task that the solution space, i.e., the kernel of A, was spanned by one vector (11, 1, −7)T . To compute the null space of A in Sage we can just type A.right_kernel( ). However, be aware that Sage also provides the command A.kernel( ), which refers to the left kernel of A and it is an alias of the command A.left_kernel(). In our text we will use only the left/right kernel commands to avoid confusions. Now, to obtain a verification of the previous expression you can type A=matrix([[2, -1, 3], [3, 16, 7], [3, -5, 4], [-7, 7, -10]]) sol=A.right_kernel() show(sol) The output has the form RowSpan(11 1 − 7), which essentially means that the corresponding solution is spanned by the vector (11, 1, −7)T , as we described above. Alternatively, you can simply type A.right_kernel(), and then Sage returns a more complex information: Free module of degree 3 and rank 1 over Integer Ring Echelon basis matrix: [11 1 -7] In the first line of this output, the “degree 3” part refers to the “dimension” of the vectors in the kernel, that is 3 × 1 vectors. The “rank 1” part indicates the “dimension” of the right kernel of A, i.e., of the null space of A. Notice, Sage automatically considered A over the ring Z, and therefore it talks about free modules instead of vector spaces. The answer would be different if we enforced the field Q (by adding QQ in the definition of A).5 We will learn about vector spaces, subspaces and the notion of dimension of a vector space in Section C, where we will meet further examples on null spaces. □ 2.A.12. Based on Gauss elimination method, solve the linear system below. Next present the solution in Sage.    3x1 + 3x3 − 5x4 = −8 , x1 − x2 + x3 − x4 = −2 , −2x1 − x2 + 4x3 − 2x4 = 0 , 2x1 + x2 − x3 − x4 = −3 . 5Beware that declaring A to be over the field R, we obtain a completely different result since the numerical errors in the 53 bit arithmetic prevent us to have the echelon form with two zero rows. Thus Sage would answer that there is only the zero solution there! CHAPTER 2. ELEMENTARY LINEAR ALGEBRA But it will be useful to understand their core meaning right now. A linear combination of rows of a matrix A = (aij) of type m/n is understood as an expression of the form c1ui1 + · · · + ckuik , where ci are scalars, uj = (aj1, . . . , ajn) are rows of the matrix A. Similarly, we can consider linear combinations of columns by replacing the above rows uj by the columns uj = (a1j, . . . , amj). If the zero row can be written as a linear combination of some given rows with at least one non-zero scalar coefficient, we say that these rows are linearly dependent. In the alternative case, that is, when the only possibility of obtaining the zero row is to select all the scalars cj equal to zero, the rows are called linearly independent. Analogously, we define linearly dependent and linearly independent columns. Lemma. The number of non-zero “steps” in the row echelon form always equals to the number of linearly independent rows of the matrix. In particular, the elementary row transformations do not change the maximal number of linearly independent rows. Similarly for columns and the column echelon form, and the maximal number of linearly independant rows and columns coincides (and equals to the number of ones in Theorem 2.1.9) Proof. Indeed, the claims all follow once we prove that the number h of ones in the matrix Eh from the theorem 2.1.9 does not depend on our choice of the row or column transformations in our procedure. Thus, assume that by two different row transformation procedures into the echelon form we obtain two different h′ < h. But then according to our algorithm there are invertible matrices P′ , P′′ , Q′ , and Q′′ such that Eh = P′ · A · Q′ , Eh′ = P′′ · A · Q′′ . In particular, Eh = P′ · P′′−1 · Eh′ · Q′′−1 · Q′ and so there are invertible matrices P and Q such that Eh′ · Q = P · Eh. Now, on the left hand side, there are the first h′ rows of Q, complemented by zero rows. At the same time, on the right hand side, there are h first columns of P, complemented by zero columns. Thus, the equality implies that taking the first h columns in P, only the first h′ of the rows there are non-zero. But this shows, that our Gaussian elimination algorithm applied to P would show P is not invertible. Therefore h′ = h, and we have proved that the number of ones in the matrix P · A · Q in theorem 2.1.9 is independent of the choice of our elimination procedure and it is always equal to the number of linearly independent rows in A, which must be the same as the number of linearly independent columns in A. □ Definition. The maximal number h(A) of linearly indepent rows (or columns) in A is called the rank of the matrix. We have the following theorem: 94 Solution. The corresponding extended matrix has the form B := ( A b ) =     3 0 3 −5 −8 1 −1 1 −1 −2 −2 −1 4 −2 0 2 1 −1 −1 −3     . By changing the order of rows (equations) we obtain B ∼     1 −1 1 −1 −2 2 1 −1 −1 −3 −2 −1 4 −2 0 3 0 3 −5 −8     which we transform into a row echelon form: ( 1 −1 1 −1 −2 2 1 −1 −1 −3 −2 −1 4 −2 0 3 0 3 −5 −8 ) ∼ (1 −1 1 −1 −2 0 3 −3 1 1 0 −3 6 −4 −4 0 3 0 −2 −2 ) ∼ (1 −1 1 −1 −2 0 3 −3 1 1 0 0 3 −3 −3 0 0 3 −3 −3 ) ∼ (1 −1 1 −1 −2 0 3 −3 1 1 0 0 3 −3 −3 0 0 0 0 0 ) . Notice that via the command B.echelon_form( )), Sage computes a different echelon form. In any case, we obtain three equations in four variables, and thus the initial linear system has infinitely many solutions (as in the previous example, although this is not a homogeneous system). These three equations have exactly one solution for any choice for the free variable x4 ∈ R. So we may set x4 = t ∈ R, and translate the final matrix of the Gauss elimination as x1 − x2 + x3 − t = −2 , 3x2 − 3x3 + t = 1 , 3x3 − 3t = −3 . The solution now can be easily described by back substitu- tion: x3 = t − 1 , x2 = 1 3 (2t − 2) , x1 = 1 3 (2t − 5) . Of course, the same solution occurs if we use the echelon form that Sage provides. Now, to have a Sage verification of the given solution, it sufficient to use the cell Eq1=x1-x2+x3-x4+2 Eq2=3*x2-3*x3+x4-1 Eq3=3*x3-3*x4+3 show(solve([Eq1==0, Eq2==0, Eq3==0], x1, x2, x3)) Notice for t = 3s our solution (x1, x2, x3, x4) takes the form ( 2s − 5 3 , 2s − 2 3 , 3s − 1, 3s ) , s ∈ R . □ 2.A.13. Sage and general systems. Analyze the solutions of the system of linear equations from the previous task in view of the vector operations and Sage tools. Solution. We could approach the linear system of 2.A.12 by using the backslash operator in Sage: CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Theorem. Let A be a matrix of type m/n over a field of scalars K. The matrix A has the same maximal number h(A) of linearly independent rows as linearly independent columns. In particular, the rank is always at most the minimum of the dimensions of the matrix A. The algorithm for computing the inverse matrix also says that a square matrix A of dimension m has an inverse if and only if its rank equals m. 2.1.12. Matrices as mappings. Similarly to the way we worked with matrices in the geometry of the plane (see 1.5.7), we can interpret every matrix A of the type m/n as a mapping A : Kn → Km , x → A · x. By the distributivity of matrix multiplication, it is clear how the linear combinations of vectors are mapped using such mappings: A · (a x + b y) = a (A · x) + b (A · y). Straight from the definition we see, by the associativity of multiplication, that composition of mappings corresponds to matrix multiplication in given order. Thus invertible matrices of dimension n correspond to bijective mappings A : Kn → Kn . Remark. From this point of view, the theorem 2.1.9 is very interesting. We can see it as follows: the rank of the matrix determines how large is the image of the whole Kn under this mapping. In fact, if A = P ·Ek ·Q where the matrix Ek has k ones as in 2.1.9, then the invertible Q first bijectively “shuffles” the n-dimensional vectors in Kn , the matrix Ek then “copies” the first k coordinates and completes them with the remaining m − k zeros. This “k-dimensional” image then cannot be enlarged by multiplying with P. Multiplying by P can only bijectively reshuffle the coordinates in the image. 2.1.13. Back to linear equations. We shall return to the notions of dimension, linear independence and so on in the third part of this chapter. But we should notice now what our results say about the solutions of the systems of linear equations. If we consider the matrix of the system of equations and add to it the column of the required results, we speak about the extended matrix of the system. The above Gaussian elimination approach corresponds to the sequential variable elimination in the equations and the deletion of the linearly dependent equations (these are simply consequences of other equations). Thus we have derived complete information about the size of the set of solutions of the system of linear equations, based on the rank of the matrix of the system. If we are left with more non-zero rows in the row echelon form of the extended matrix than in the original matrix of the system, then there cannot be a solution (simply, we cannot obtain the given vector value with the corresponding linear mapping). If the 95 A=matrix([[3, 0, 3, -5], [1, -1, 1, -1], [-2, -1, 4, -2], [2, 1, -1, -1]]) b=vector([-8, -2, 0, -3]) x=A.solve_right(b); x The output is the solution given above with t = 0 only: (-5/3, -2/3, -1, 0) As we have seen before, this provides just one solution of the system. But the matrix form Ax = b of the equation reveals that having two such solutions x and y, ensures that A(x − y) = Ax − Ay = b − b = 0 (notice we can compute with vectors and matrices just like if they were scalars). On the contrary, if y was a solution of the homogeneous system, Ay = 0, then A(x + y) = b + 0 = b. Hence we should be grateful to Sage, picking one of the solutions to the original system and combining the backslash operator with the right_kernel( ) method, as fol- lows: A=matrix([[3,0,3,-5],[1,-1,1,-1], [-2,-1,4,-2], [2,1,-1,-1]]) b=vector([-8, -2, 0, -3]) x=A.solve_right(b) k=A.right_kernel().matrix() # basis of solutions in rows nrows = k.nrows() # counts the kernel dimension if nrows > 0: # non unique solution t=vector( var("t", n=nrows)) # row of parameters t0,... show(x + t*k) else: # unique solution show(x) Executing this block, Sage’s output is as in the end of the previous task. Actually, we add the vector subspace of the solutions to the homogeneous system, to one of the general solutions. □ If we work over a field of scalars, e.g., Q or R, the extended matrices B = ( A b ) of the linear systems can be transformed further, aiming at having all the pivots equal to one, and all the other entries in the columns of the pivots equal to zero. This matrix is in a special row echelon form, known as the reduced row echelon form (often referred to as RREF). It is not difficult to prove that this reduced row echelon form of any matrix is unique, once the pivot columns are fixed. 2.A.14. Come back to the task in 2.A.12 and solve it using the RREF method. Then check with Sage. Solution. A few elementary row transformations of the augmented matrix lead to (1 −1 1 −1 −2 0 3 −3 1 1 0 0 3 −3 −3 0 0 0 0 0 ) ∼ (1 −1 1 −1 −2 0 1 −1 1/3 1/3 0 0 1 −1 −1 0 0 0 0 0 ) CHAPTER 2. ELEMENTARY LINEAR ALGEBRA rank of both matrices is the same, then the backwards elimination provides exactly as many free parameters as the difference between the number of variables n and the rank h(A). In particular, there will be exactly one solution if and only if the matrix is invertible. All this will be stated explicitely in terms of abstract vector spaces in the important Kronecker-Capelli theorem, see 2.3.5. 2. Determinants In the fifth part of the first chapter, we introduced the scalar function det on square matrices of dimension 2 over the real numbers, called determinant, see 1.5.5. We saw that the determinant assigned a non-zero number to a matrix if and only if the matrix was invertible. We did not say it in exactly this way, but you can check for yourself in previous paragraphs starting with 1.5.4 and formula 1.5.5(1). We saw also that determinants were useful in another way, see the paragraphs 1.5.10 and 1.5.11. There we showed that the area of the parallelepiped should be linearly dependent on every two of the vectors defining it. It was useful to require the change of the sign when changing the order of these vectors. Because determinants (and only determinants) have these properties, up to a constant scalar multiple, we concluded that it was determining the area. Now we will see that we can proceed similarly for every finite dimension. We work again with arbitrary scalars K and matrices over these scalars. Our results about determinants will thus hold for all commutative rings, notably also for integer matrices or matrices over any residue classes. 2.2.1. Definition of the determinant. Each bijective mapping from a set X to itself is called a permutation of the set X, cf. 1.3.1. If X = {1, 2, . . . , n}, the permutation can be written by putting the resulting ordering into a table: ( 1 2 . . . n σ(1) σ(2) . . . σ(n) ) . In this way, we shall view a permutation as a bijection or an ordering. The element x ∈ X is called a fixed point of the permutation σ if σ(x) = x. If there exist exactly two distinct elements x, y ∈ X such that σ(x) = y while all other elements z ∈ X are fixed points, then the permutation σ is called a transposition, and we denote it by (x, y). Of course, then σ(y) = x holds for such a transformation. For dimension 2, the formula for a determinant was simple – take all possible products of two elements, one from every column and every row of the matrix, give them a sign such that interchanging two columns leads to the change of the sign of the whole result, and sum all of them (that is, both): A = ( a b c d ) , det A = ad − bc. 96 ∼ (1 −1 0 0 −1 0 1 0 −2/3 −2/3 0 0 1 −1 −1 0 0 0 0 0 ) ∼ (1 0 0 −2/3 −5/3 0 1 0 −2/3 −2/3 0 0 1 −1 −1 0 0 0 0 0 ) . The final matrix is in reduced row echelon form, which evidently directly provides the solution     x1 x2 x3 x4     =     −5/3 −2/3 −1 0     + t ·     2/3 2/3 1 1     , t ∈ R , as above. Notice again, that this solution is the sum of a particular solution of our initial system and the general solution of the corresponding homogeneous system Ax = 0, cf. the Kronecker-Capelli Theorem in 2.3.5. We also mention that in the Gauss elimination method, free variables are the variables whose column in the reduced row echelon form does not contain any pivot (e.g., x4 is the free variable of this specific example, and indeed, the fourth column in REEF is the unique column which does not contain any pivot). The reduced echelon form of a given matrix B can be quickly described in Sage, by the command B.rref( ). Thus, the following cell verifies the (necessarily unique) expression presented above. A=matrix([[3, 0, 3, -5], [1, -1, 1, -1], [-2, -1, 4, -2], [2, 1, -1, -1]]) b=vector([-8, -2, 0, -3]) B=A.augment(b) show(B.rref()) □ 2.A.15. Remarks on Sage methods. Although we noticed the importance of B.echelon_form( ) in 2.A.10 and 2.A.12 (mimicking our handling of matrices by hand), the Sage method B.rref( ) seems to be much more efficient. We should also mention that Sage provides methods for directly identifying the pivots of the augmented matrix. The relevant command is B.pivots( ) (or we can use the command B.pivot_rows( )). For example, adding in the previous cell B.pivots( ), Sage answers (0, 1, 2), which means that the pivots are in the first, second and third columns. Recall now that a linear system of equations is inconsistent (does not have a solution), if there is a pivot in the last column of an echelon form of the augmented matrix. Therefore, we can combine this with the previous commands in Sage to derive quickly if a linear system of equations is consistent or not. 2.A.16. Use Sage to find the reduced echelon form and the pivot entries of the following matrices: B± =   1 1 −2 3 2 0 2 ±1 2   , C =     2 1 7 0 −1 4 10 5 3 2 12 10 0 0 0 10     . Next, deduce about the consistency of (a) the linear systems having as extended matrix one of B±; (b) the homogeneous linear system Cx = 0. ⃝ CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Consider now square matrices A = (aij) of dimension n over K. The formula for the determinant of the matrix A is also composed of all possible products from elements from individual rows and columns, with properly chosen signs. In dimension 3 we can guess the correct signs easily. The product of the elements on the diagonal should be with positive sign and we want anti-symmetry when interchanging two columns or rows. This gives the so called Sarrus rule: Sarrus rule a11 a12 a13 a21 a22 a23 a31 a32 a33 = a11a22a33 + a13a21a32 + a12a23a31 −a13a22a31 − a11a23a32 − a12a21a33 The general definition can be formulated via a sum over all permutations: Definition of determinant The determinant of the matrix A is a scalar det A = |A| defined by the relation |A| = ∑ σ∈Σn sgn(σ)a1σ(1) · a2σ(2) · · · anσ(n) where Σn is the set of all possible permutations over {1, . . . , n} and the symbol sgn for a permutation σ, called the parity of σ, will be described below. Each of the expressions sgn(σ)a1σ(1) · a2σ(2) · · · anσ(n) is called a term in the determinant |A|. 2.2.2. Parity of permutation. How should we define the sign of a permutation? We say that a pair of elements a, b ∈ X = {1, . . . , n} forms an inversion in the permutation σ, if a < b and σ(a) > σ(b). A permutation σ is called even or odd, if it contains an even or odd number of inversions, respectively. Thus, the parity sgn σ of the permutation σ is (−1)number of inversions and we denote it by sgn(σ). This amounts to our definition of sign for computing determinant. But we should like to know how to calculate the parity. The following theorem reveals that the Sarrus rule really defines the determinant in dimension 3. Theorem. Over the set X = {1, 2, . . . , n} there are exactly n! distinct permutations. These can be ordered in a sequence such that every two consecutive permutations differ in exactly one transposition. Every transposition changes parity. For any chosen permutation σ there is such a sequence starting with σ. Proof. For n = 1 or n = 2, the claim is trivial. We prove the theorem by induction on the size n of the set X. Assume that the claim holds for all sets with n − 1 elements and consider a permutation σ(1) = a1, . . . , σ(n) = 97 Our next focus is on computing inverses A−1 of matrices A ∈ Matn(K) for fields K. Viewing X = A−1 as an unknown matrix, we aim at AX = E (the identity matrix). This provides n systems of linear equations with the same coefficient matrix A (we view the matrix equation column-wise). We already touched this in the case n = 2. We realize that the general case is equivalent to finding RREF for the augmented matrix (A | E). 2.A.17. Solve the following matrix equations: ( 1 3 3 8 ) X1 = ( 1 2 3 4 ) , X2 ( 1 3 3 8 ) = ( 1 2 3 4 ) . ⃝ 2.A.18. Compute the inverse of the matrices A =   4 3 2 5 6 3 3 5 2   , B =   1 0 1 3 3 4 2 2 3   . Then determine the matrix ( AT · B )−1 . Solution. As just explained, an n×n matrix A is invertible if and only if A is row-equivalent to the n×n identity matrix E. Hence, to find the inverse we first form the augmented matrix( A E ) , which in our case reads as ( A E ) =   4 3 2 1 0 0 5 6 3 0 1 0 3 5 2 0 0 1   . The goal now is to transform the submatrix A of the augmented matrix into the identity matrix using elementary row operations on ( A E ) . This process illustrates how Gaussian elimination is applied to compute the inverse of a matrix Note that this transformation may not always be possible, depending on the rank of A. (Later we shall see that having the maximal rank is equivalent to having the determinant non-zero). However, whenever we can achieve the unit matrix in the place of A, the original unit submatrix gets transformed into the desired A−1 . Notice, this is exactly solving all the given n systems simultaneously, using the RREF method with the shared coefficient matrix A. In this case the final (augmented) matrix will be ( E A−1 ) . CHAPTER 2. ELEMENTARY LINEAR ALGEBRA an. According to the induction assumption, all the permutations that end with an can be obtained in a sequence, where every two consecutive permutations differ in one transposition. There are (n − 1)! such permutations. In order to proceed further, we select the last of them, and use the transposition of σ(n) = an with some element ai which has not been at the last position yet. Once again, we form a sequence of all permutations that end with ai. After doing this procedure n-times, we obtain n(n−1)! = n! distinct permutations – that is, all permutations on n elements. The resulting sequence satisfies the condition. Note that the last sentence of the theorem does not seem to be useful in practice. But it is a very important part for proving the theorem by induction over the size of X. It remains to prove the part of the theorem about parities. Consider the ordering (a1, . . . , ai, ai+1, . . . , an), containing r inversions. Then in the ordering (a1, . . . , ai+1, ai, . . . , an) there are either r − 1 or r + 1 inversions. Every transposition (ai, aj) is obtainable by doing (j−i)+(j−i−1) = 2(j−i)−1 transpositions of neighbouring elements. Therefore any transposition changes the parity. Also, we already know that all permutations can be obtained by applying transpositions. □ We found that applying a transposition changes the parity of a permutation and any ordering of numbers {1, 2, . . . , n} can be obtained through transposing of neighbouring elements. Therefore we have proven Corollary. On every finite set X = {1, . . . , n} with n elements, n > 1, there are exactly 1 2 n! even permutations, and 1 2 n! odd permutations. If we compose two permutations, it means first doing all transpositions forming the first permutation and then all the transpositions forming the second one. Therefore for any two permutations σ, η : X → X we have sgn(σ ◦ η) = sgn(σ) · sgn(η) and also sgn(σ−1 ) = sgn(σ). 2.2.3. Decomposing permutations into cycles. A good tool for practical work with permutations is the cycle decomposition, which is also a good exercise on the concept of equiva- lence. 98 We compute ( A E ) R1→R1−R3 ∼   1 −2 0 1 0 −1 5 6 3 0 1 0 3 5 2 0 0 1   R2→R2−5R1 ∼ R3→R3−3R1   1 −2 0 1 0 −1 0 16 3 −5 1 5 0 11 2 −3 0 4   R2→R2−R3 ∼   1 −2 0 1 0 −1 0 5 1 −2 1 1 0 11 2 −3 0 4   R3→R3−2R2 ∼   1 −2 0 1 0 −1 0 5 1 −2 1 1 0 1 0 1 −2 2   R1→R1+2R3 ∼ R2→R2−5R3   1 0 0 3 −4 3 0 0 1 −7 11 −9 0 1 0 1 −2 2   R2↔R3 ∼   1 0 0 3 −4 3 0 1 0 1 −2 2 0 0 1 −7 11 −9   , where the row operations performed at each step are indicated for clarity. This means that A−1 =   3 −4 3 1 −2 2 −7 11 −9   . As a verification it is easy to see that AA−1 = A−1 A = E, or we can directly use the command A.inverse( ) in Sage: A=matrix([[4, 3, 2],[5, 6, 3],[3, 5, 2]]) show(A); show(A.inverse()) Observe now that when calculating A−1 , we did not have to cope with fractions thanks to the suitably chosen row transformations (and the fact, that A−1 is an integer matrix, too!). We leave the similar steps leading to B−1 to the reader (again, the result happens to be an integer matrix): B−1 =   1 2 −3 −1 1 −1 0 −2 3   . Based now on the identity ( AT · B )−1 = B−1 · ( AT )−1 = B−1 · ( A−1 )T we finally obtain ( AT · B )−1 =   −14 −9 42 −10 −5 27 17 10 −49   . To describe this result in Sage one can continue typing in the previous cell, by adding the code B=matrix([[1,0,1],[3,3,4],[2,2,3]]) show(B.inverse()) show((transpose(A)*B).inverse()) CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Cycles A permutation σ over the set X = {1, . . . , n} is called a cycle of length k, if we can find elements a1, . . . , ak ∈ X, 2 ≤ k ≤ n such that σ(ai) = ai+1, i = 1, . . . , k − 1, while σ(ak) = a1, and other elements in X are fixed-points of σ. Cycles of length two are transpositions. Proposition. Every permutation is a composition of cycles. Cycles of even length have parity −1, cycles of odd length have parity 1. Proof. Fix a permutation σ and define a relation R such that two elements x, y ∈ X are R-related if and only if σℓ (x) = y for some iteration ℓ ∈ Z of the permutation σ (notice σ−1 means the inverse bijection to σ). Clearly, it is an equivalence relation (check it carefully!). Because X is a finite set, for some ℓ it must be that σℓ (x) = x. If we pick one equivalence class {x, σ(x), . . . , σℓ−1 (x)} ⊂ X and define other elements to be fixed-points, we obtain a cycle. Evidently, the original permutation X is then the composition of all these cycles for individual equivalence classes and it does not matter in which order we compose the cycles. For determining the parity we just have to note that cycles of even length can be written as a composition of an odd number of transposition, therefore their parity is −1. Analogously, cycle of odd length can be obtained using an even number of transpositions and therefore it has parity 1. □ 2.2.4. Expansion of determinant. Our understanding of the permutations allows to find the expansion method of computing the determinants. The simple idea is to collect the terms containing an element in a fixed row in the determinant sum and to add these contributions along the row. Consider a matrix A = (aij) and let us look at all terms in |A| containing the element a11. By the very definition, these terms correspond to all permutations σ with σ(1) = 1. Thus, the contribution of all these terms to |A| is a11A11, where A11 is the determinant of the matrix obtained from A by omitting the first row and the first column. Similarly, we can take any other fixed element aij in A and look for the contribution of all terms containing it. Again, we could write Aij for the determinant of the matrix obtained from A by omitting the i-th row and the j-th column, and the latter contribution must have terms like in aijAij, but we have to be very careful about the signs. While the actual terms of |A| would be sgn σaija1σ(1) . . .∧ . . . anσ(n) where the hat denotes the omition of the i-th entry and σ(i) = j, the signatures of the permutations in Aij, with the i and j omitted might be different. In order to compare it to the previous case i = 1, j = 1, we can change the initial ordering of the elements in the domain and target of the permutations σ. Clearly, i − 1 changes on the domain and j − 1 changes on the target do the job 99 which also verifies the given expression for B−1 . □ 2.A.19. Compute the inverse of the following matrices and next provide a verification by Sage: A =   1 0 −2 2 −2 1 5 −5 2   , B =       8 3 0 0 0 5 2 0 0 0 0 0 −1 0 0 0 0 0 1 2 0 0 0 3 5       , C =     1 1 1 1 1 1 −1 1 1 −1 1 −1 1 −1 −1 1     , D = ( 1 i −i 3 ) , where as usual i = √ −1. ⃝ We have seen that a system of linear equations Ax = b may have: (1) no solution; (2) a unique solution; (3) infinitely many solutions, depending on one or more free parameters. In the homogeneous case, where b = 0, there is always (at least) one solution. The collections of all solutions forms a vector s[pace (computed as the kernel of the mapping x → Ax, we shall come back to these concepts in more abstract way later). If the right-hand side vector b has at least one non-zero entry, we say that the system is non-homogeneous. Then, the space of all solutions is closely related to the ranks of the matrices A and (A | b). The rank of a matrix is defined as the maximal number of its linearly independent rows (which is the same as the maximal number of linearly independent columns in A, cf. 2.1.11). This corresponds to the fact that solutions x of the system Ax = b are the coefficients in the expression of b as a linear combination of the columns of A. We have seen that the case (1) happens if the rank of the augmented matrix is bigger than that of A (which of course cannot happen if b = 0). The number of free parameters for the cases (2) and (3) is the difference between the size of b and the rank of A. Later we will see that the rank is closely related to the nullity of the matrix, see for example the description in ??. 2.A.20. Determine the rank of the matrix A =     1 −3 0 1 1 −2 2 −4 1 −1 0 1 −2 −1 1 −2     . Then, determine the number of solutions of the system    x1 + x2 + x3 − 2x4 = 4 , −3x1 − 2x2 − x3 − x4 = 5 , + 2x2 + x4 = 1 , x1 − 4x2 + x3 − 2x4 = 3 . CHAPTER 2. ELEMENTARY LINEAR ALGEBRA (by “bubbling” the index in question to the first position by consecutive swaps of neighboring positions). Thus, the sign correction is (−1)i+j−2 and we have to adjust the value of Aij as in the following algorithm, which is the simplest version of the more general Laplace expansion formula, see 2.2.9 below. The readers not sure about the details of our argumentation here may wait for the detailed proof in the more general situation. Expansion of determinant The algebraic complement Aij of the element aij in a matrix A is the (−1)i+j -multiple of the determinant of the matrix obtained from A by omiting the i-th row and the j-th col- umn. Fixing the i-th row or j-th column, |A| = n∑ j=1 aijAij, |A| = n∑ i=1 aijAij. The latter formulae correspond to splitting the determinant sum to parts containg terms with the individual elements in the row or column. For example, an easy application derives the Sarrus rule from the formula in dimension 2 now. 2.2.5. Simple properties. Knowing the properties of permutations and their parities from previous paragraphs allows us to derive quickly basic properties of determinants. For every matrix A = (aij) of the type m/n over scalars from K we define the transpose of A as the matrix AT = (a′ ij) with elements a′ ij = aji. The matrix AT is of the type n/m. A square matrix A with the property A = AT is called symmetric. If A = −AT , then A is called antisymmetric. Simple properties of determinants Theorem. Every square matrix A = (aij) satisfies the following conditions: (1) |AT | = |A|. (2) If one of the rows contains only zero elements from K, then |A| = 0. (3) If a matrix B was obtained from A by transposing two rows, then |A| = −|B|. (4) If a matrix B was obtained from A by multiplying one row by a scalar a ∈ K, then |B| = a |A|. (5) If all elements of the k-th row in A are of the form akj = ckj +bkj and all remaining rows in the matrices A, B = (bij), C = (cij) are identical, then |A| = |B| + |C|. (6) A determinant |A| does not change if we add to any row of A a linear combination of other rows. 100 Moreover, find all solutions of the corresponding homogeneous system, and in addition of the system    x1 − 3x2 = 1 , x1 − 2x2 + 2x3 = −4 , x1 − x2 = 1 , −2x1 − x2 + x3 = −2 . ⃝ B. Determinants A key object in matrix calculus is the scalar function det, the so called determinant, which we already came across when discussing the dimension two calculus (see for example 1.E.25, 1.E.29). Now we shall handle the determinant concept for general square matrices of size n. In dimension two, the formula was det A = det(aij) = a11a22 − a12a21, i.e., each term picked up just one entry from each row and column and there are just two options there. Again, we want to see a function linear in all individual columns and changing the sign, whenever two columns are exchanged. Thus, we want to include all terms with any permutation of columns, to be chosen for the individual rows. This leads us back to the permutations, i.e., bijections σ : {1, . . . , n} → {1, . . . , n}, see the 3rd section of Chapter 1, and 2.2.1. In particular, by 2.2.1 we know that a convenient way to denote permutations is the so called two-row notation (a table form for the function σ). This notation is adopted in this column, as well. 2.B.1. Decompose the permutation σ = ( 1 2 3 4 5 6 7 8 9 3 1 6 7 8 9 5 4 2 ) into a product of cycles and product of transpositions. Solution. The term “cycle” is closely related to invariant subsets in {1, . . . , n} under iterated actions of σ. We may easily see them from the defining table: start with the first element and look on the second row, iterating σ. We see that 1 is mapped to 3, while the second iteration sends 3 to 6, and so on. We continue until we again reach the starting element (which must happen since the set is finite). We obtain the following sequence of elements, which map to each other under the given permutation: 1 → 3 → 6 → 9 → 2 → 1. Such a sequence is called a cycle, or better we extend it to a bijection by letting all the other elements as fixpoints (see 2.2.3), and we may denote it by (1, 3, 6, 9, 2). Next, choose any element which belongs to the set {1, 2, . . . , 9} but does not belong to our cycle (1, 3, 6, 9, 2), e.g., the integer 4. Applying the same procedure as before, we obtain the cycle (4, 7, 5, 8). Observe that each element from the set {1, 2, . . . , 9} appears now in one of the previous two cycles. Thus, we can express σ as the composition of these two cycles, i.e., σ = (1, 3, 6, 9, 2) ◦ (4, 7, 5, 8) = (4, 7, 5, 8) ◦ (1, 3, 6, 9, 2), CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Proof. (1) The terms of determinants |A| and |AT | are in bijective correspondence, where the term sgn(σ)a1σ(1) · a2σ(2) · · · anσ(n) corresponds the following AT term (notice, multiplication does not depend on the order of scalars) sgn(σ)aσ(1)1 · aσ(2)2 · · · aσ(n)n = = sgn(σ)a1σ−1(1) · a2σ−1(2) · · · anσ−1(n), and we have to ensure that this member has the correct sign. But the parities of σ and σ−1 are the same, and so this is really a term in the determinant |AT | and the first claim is proved. (2) This comes straight from the definition of determinant, because all its terms contain exactly one member from every row. Thus, if one of the rows is zero, all terms of the determinant are also zero. (3) The only change in the terms of |B| compared to |A| is the addition of one transposition in all permutations, therefore all the signs will be reversed. (4) This follows straight from the definition, because terms of |B| are just terms of |A| multiplied by the scalar a. (5) In every term of |A|, there is exactly one element from the k-th row of the matrix A. By the distributive law for multiplication and addition in K, the claim follows directly from the definition of determinant. (6) If there are two identical rows in A, then there are always two identical terms among all terms in the determinant, up to the sign. Therefore in this case |A| = 0. Thus, by (5), we can add any other row to the given row, without changing the value of the determinant. In view of the claims (4) and (5), we can in fact add a scalar multiple of any other row. □ 2.2.6. Computational corollaries. Let us note a nice corollary of the first claim of the previous theorem about the equality of the determinants of the matrix and its transpose. It ensures that whenever we prove some claim about determinants formulated in terms of rows of the corresponding matrix, we immediately obtain an analogous claim in terms of the columns. For instance, we can immediately formulate all the claims (2)–(6) for linear combinations of columns. For a while, let us assume, we work with a field of scalars K. Then, by the previous theorem, we can use elementary row transformations to bring any square matrix A into row echelon form, without changing the value of its determinant. We just have to be careful and add only linear combinations of other rows to a given one. Thus let us look at the distribution of the elements in the individual terms of a determinant |A| with dimension of A equal to n > 1. There is just one term with all of its elements on the diagonal. In all other terms, there must be elements both above and below the diagonal (if we place one element outside of the diagonal, we block two diagonal entries and we leave only n − 2 diagonal positions for the other n − 1 elements). 101 where the second equality occurs since independent cycles commute. Next, we notice how to decompose general cycles into the simplest ones - the transpositions (i, j) (i.e., i → j → i and all other remain fixed). A simple check reveals (do it carefully yourselves!) that any cycle (i1, i2, . . . , ik) is a composition of transpositions of the neighbouring numbers, starting from the back. Thus, in our case (1, 3, 6, 9, 2) = (1, 3) ◦ (3, 6) ◦ (6, 9) ◦ (9, 2) ≡ (1, 3)(3, 6)(6, 9)(9, 2) . For example, evaluating this resulting bijection on 2: ((1, 3)(3, 6)(6, 9)(9, 2))(2) = ((1, 3)(3, 6)(6, 9))(9) = = ((1, 3)(3, 6))(6) = (1, 3)(3) = 1, as expected. This means σ = (1, 3)(3, 6)(6, 9)(9, 2)(4, 7)(7, 5)(5, 8) (notice that this is also the decomposition into the minimal number of transpositions). As we saw in Chapter 1, Sage offers a rich portfolio of methods available for solving combinatorial tasks. To treat the given task in Sage, we need to define a permutation p of a finite set, which can be done via he command Permutation(p). Also, we can determine the cycles of p by the command Permuation(p).to_cycles( ). Let us illustrate the situation for p = σ, where σ is as above: p = Permutation([3, 1, 6, 7, 8, 9, 5, 4, 2]) c = (p.to_cycles()) ts = [] # collecting transpositions for x in c: for i in range(len(x) - 1): ts.append( (x[i], x[i + 1]) ) print(ts) This block returns the following list verifying our an- swer: [(1, 3), (3, 6), (6, 9), (9, 2), (4, 7), (7, 5), (5, 8)] □ We are ready to summarize: The right formula for the determinant of a matrix A = (aij) of size n is det(A) = |A| = ∑ σ∈Σn sgn(σ) a1σ(1) . . . anσ(n) , where the parity sgn(σ) is ±1, according to the even or odd number of permutations constituting σ. This is well defined. Clearly this definition fulfills our wish to have an expression linear in each of the columns, while changing its sign with each transposition of columns. Note that the minimal number of transpositions in the decomposition of a given permutation can be obtained as follows: We first decompose the permutation into a product of independent cycles, and then the cycles canonically into the transpositions (actually the length of the cycles tells their parity and they are multiplied). You may find more details in 2.2.2, where the parity is defined in the terms of the so called inversions. In the two-row CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Therefore, if the matrix A is in a row echelon form, then every term of |A| is zero, except the term with exclusively diagonal entries. This proves the following algorithm: Computing determinants using elimination Lemma. If A is in the row echelon form then |A| = a11a22 · · · ann. This observation gives an effective method for computing determinants using the Gauss elimination method for matrices over a fied of scalars K, see the paragraph 2.1.7. Notice that the very same argumentation allows us to stop the elimination having the first k columns in the requested form and finding the determinant of the matrix B of dimension n−k in the right bottom corner of A in another way. The result will then be |A| = a11a22 · · · akk|B|. As a useful (theoretical) illustration of this principle, we shall derive the following formula for direct calculation of solutions of systems of linear equations. For sake of simplicity, we still work with field of scalars now. (But we shall see later, that Cramer rule actully works for all scalars.) Cramer rule Proposition. Consider the system of n linear equations for n variables with matrix of the system A = (aij) and the column of values b = (b1, . . . , bn). In matrix notation this means we are solving the equation A · x = b. If there exists the inverse |A|−1 , then the individual components of the unique solution x = (x1, . . . , xn) are given as xi = |Ai||A|−1 , where the matrices Ai arise from the matrix A of the system by replacing the i-th column by the column b of values. Proof. Even for general scalars, if A−1 exists, then there must be a unique solution. As we have already seen, working over field of scalars the inverse of the matrix of the system exists if and only if the system has a unique solution, and this in turn happens if and only if |A| ̸= 0. If we have such a solution x, we can express the column b in the matrix Ai by the corresponding linear combination of the columns of the matrix A, that is the values bk = ak1x1 + · · · + aknxn. Then, by subtracting the xℓ-multiples of all the other ℓ-th columns from this i-th column in Ai, we arrive at just the xi-multiple of the original column of A. The number xi can thus be brought in front of the determinant to obtain the equation |Ai| = xi |A|, and thus |Ai||A|−1 = xi|A||A|−1 = xi, which is our claim. □ Notice also that the properties (3)–(5) from the previous theorem say that the determinant, (considered as a mapping which assigns a scalar to n vectors of dimension n), is an 102 notation for permutations, the number of inversions is seen when you go through the columns and count the number of those where the second row is smaller than the first one. 2.B.2. Determine the parity of the permutation σ in 2.B.1, and of the permutation τ = ( 1 2 3 4 5 6 2 4 6 1 5 3 ) . ⃝ 2.B.3. Check the previous result in Sage. Solution. It is very easy to obtain the parity of a permutation in Sage. This is done via the command p.sign(), where p is the given permutation. For instance, for the given σ and τ in 2.B.2 we can type: s=Permutation([3,1,6,7,8,9,5,4,2]) t=Permutation([2,4,6,1,5,3]) print(s.sign()); print(t.sign()) Sage says σ, τ are odd by returning −1 for both cases. □ 2.B.4. Compute the determinant of the matrices A = ( 1 2 2 1 ) , B =   1 2 3 1 −1 2 3 2 2   , C =   1 1 1 1 0 0 −2 0 1   . Solution. For A we compute det(A) = 1·1−2·2 = −3. As for the 3 × 3 matrices, the defition yields the so called Sarrus rule, see 2.2.1. In particular, for B we get det(B) = 1 · (−1) · 2 + 2 · 2 · 3 + 3 · 1 · 2 −3 · (−1) · 3 − 1 · 2 · 2 − 1 · 2 · 2 = 17. Another way to remember this rule: det ( a b c d e f ℓ m n ) = a · α − b · β − c · γ, where α = det ( e f m n ) , β = det ( d f ℓ n ) , γ = det ( d e ℓ m ) . Similarly, for the matrix C we obtain det(C) = 1 · 0 · 1 + 1 · 0 · 1 + 1 · 0 · (−2) −1 · 0 · (−2) − 1 · 1 · 1 − 1 · 0 · 0 = −1 . Of course, all cases can be done in Sage, as in Chapter 1. Let us recall the relevant code: B=matrix([[1,2,3],[1,-1,2],[3,2,2]]) det(B) □ There are several methods for computing determinants of big matrices. One of them is to use the elementary row (or column) transformations. Remember though that we are allowed to add linear combinations of other rows to the transformed one, since multiplying a row clearly results in the same multiple of the resulting determinant. If A is in a row-echelon (column-echelon) form, then det(A) is just the multiple of all diagonal elements (straight from definition!). See a more detailed review of elementary properties in 2.2.5. CHAPTER 2. ELEMENTARY LINEAR ALGEBRA antisymmetric mapping linear in every argument, exactly as we required in analogy to the 2-dimensional case. 2.2.7. Further properties of the determinant. Later we will see that, exactly as in the dimension 2, the determinant of the matrix equals to the (oriented) volume of the parallelepiped determined by the columns of the matrix. We shall also see that considering the mapping x → A · x given by the square matrix A on Rn we can understand the determinant of this matrix as expressing the ratio between the volume of the parallelepipeds given by the vectors x1, . . . , xn and their images A · x1, . . . , A · xn. Because the composition x → A·x → B·(A·x) of mappings corresponds to the matrix multiplication, the Cauchy theorem below is easy to understand: Cauchy theorem Theorem. Let A = (aij), B = (bij) be square matrices of dimension n over the ring of scalars K. Then |A · B| = |A| · |B|. In the next paragraphs, we derive this theorem in a purely algebraic way, in particular because the previous argumentation based on geometrical intuition could hardly work for arbitrary scalars. The basic tool is the determinant expansion using one or more of the rows or columns which we have seen in simplest case of single rows or columns in 2.2.4. We will also need a little technical preparation. The reader who is not fond of too much abstraction can skip these paragraphs and note only the statement of the Laplace theorem and its corollaries. Notice also, the claims (2), (3) and (6) from the theorem 2.2.5 are easily deduced from the Cauchy theorem and the representation of the elementary row transformations as multiplication by suitable matrices (cf. 2.1.8). 2.2.8. Minors of the matrix. When investigating matrices and their properties we often work only with parts of the matrices. Therefore we need some new concepts. submatrices and minors Let A = (aij) be a matrix of the type m/n and let 1 ≤ i1 < . . . < ik ≤ m, 1 ≤ j1 < . . . < jl ≤ n be fixed natural numbers. Then the matrix M =    ai1j1 ai1j2 . . . ai1jℓ ... ... aikj1 aikj2 . . . aikjℓ    of the type k/ℓ is called a submatrix of the matrix A determined by the rows i1, . . . , ik and columns j1, . . . , jℓ. The remaining (m − k) rows and (n − ℓ) columns determine a matrix M∗ of the type (m − k)/(n − ℓ), which is called complementary submatrix to M in A. 103 The other option is to use the Laplace expansion with respect to one or more rows or columns, see the description in 2.2.9. 2.B.5. Compute det(B) for the matrix B from the previous task by the Gauss elimination method. Solution. We aim at transforming the matrix into a row echelon form, and then multiplying the numbers on the diagonal. However, we must remember that a multiplication of a row with a scalar changes the determinant of the matrix by the same multiple. Moreover, interchanging two rows changes the sign of the determinant of the matrix. 1 2 3 1 −1 2 3 2 2 = 1 2 3 0 −3 −1 0 −4 −7 = 1 −4 · 1 3 · 1 2 3 0 12 4 0 −12 −21 = − 1 12 · 1 2 3 0 12 4 0 0 −17 . The remaining step is the computation of the determinant of an upper triangular matrix. The latter equals to the product of the diagonal entries, thus, det(B) = − 1 12 (1·12·(−17)) = 17. Of course, direct application of the Sarrus rule was faster. □ 2.B.6. Compute the determinant of the matrix A =     1 3 5 6 1 2 2 2 1 1 1 2 0 1 2 1     . Solution. A standard way to compute det(A) is the reduction of the size of the matrices by the expansion along the first column (or line, see also 2.3.10). This gives det(A) = 1 · 2 2 2 1 1 2 1 2 1 − 1 · 3 5 6 1 1 2 1 2 1 + 1 · 3 5 6 2 2 2 1 2 1 = −2 − 2 + 6 = 2 , where we have exploited that the last entry of the first column is zero. An alternative way is to convert the matrix to row echelon form, which we can do as follows: det(A) = − 1 1 1 2 1 2 2 2 1 3 5 6 0 1 2 1 = − 1 1 1 2 0 1 1 0 0 2 4 4 0 1 2 1 = = − 1 1 1 2 0 1 1 0 0 0 2 4 0 0 1 1 = 1 1 1 2 0 1 1 0 0 0 1 1 0 0 2 4 = 1 1 1 2 0 1 1 0 0 0 1 1 0 0 0 2 = 2 . Observe that here we have interchanged the rows twice. Specify the rest of the applied rules, according to 2.2.5! □ 2.B.7. Use the Laplace theorem (see 2.2.9) to compute the determinant of the matrix D given below. Next verify your CHAPTER 2. ELEMENTARY LINEAR ALGEBRA When k = ℓ we call the determinant |M| the subdeterminant or minor of the order k of the matrix A. If m = n and k = ℓ, then M∗ is also a square matrix and |M∗ | is called the minor complement to |M|, or complementary minor of the submatrix M in the matrix A. The scalar (−1)i1+···+ik+j1+···+jl · |M∗ | is then called the algebraic complement of the minor |M|. The submatrices formed by the first k rows and columns are called leading principal submatrices, and their determinants are called leading principal minors of the matrix A. If we choose k sequential rows and columns starting with the i-th row, we speak of principal submatrices and principal mi- nors. Specially, when k = ℓ = 1, m = n we call the corresponding algebraic complementary minor the algebraic complement Aij of the element aij of the matrix A, which we met already in 2.2.4. 2.2.9. Laplace determinant expansion. If the leading principal minor |M| of the matrix A is of the order k, then, directly from the definition of the determinant, each of the individual k!(n − k)! terms in the product of |M| with its algebraic complement is a term of |A|. In general, consider a square submatrix M, that is, a square matrix given by the rows i1 < i2 < · · · < ik and columns j1 < · · · < jk. Then using (i1 − 1) + · · · + (ik − k) exchanges of neighbouring rows and (j1 −1)+· · ·+(jk −k) exchanges of neighbouring columns in A we can transform this submatrix M into a leading principal submatrix and the complementary matrix gets transformed into its complementary matrix. The whole matrix A gets transformed into a matrix B satisfying (cf. 2.2.5 and the definition of the determinant) |B| = (−1)α |A|, where α = ∑k h=1(ih +jh)−2(1+· · · +k). But (−1)α = (−1)β with β = ∑k h=1(ih + jh). Therefore we have checked: Proposition. If A is a square matrix of dimension n and |M| is its minor of the order k < n, then the product of any term of |M| with any term of its algebraic complement is a term in the determinant |A|. This claim suggests that we could perhaps express the determinant of the matrix by using some products of smaller determinants. We see that |A| contains exactly n! distinct terms, exactly one for each permutation. These terms are mutually distinct as polynomials in the components of a general matrix A. If we can show that there are exactly that many mutually distinct expressions from the previous claim, we obtain the determinant |A| as their sum. It remains to show that the terms of the product |M|·|M∗ | contain exactly n! distinct members from |A|. From the chosen k rows we can choose (n k ) minors M and using the previous lemma each of the k!(n − k)! 104 answer in Sage. D =       1 0 1 0 1 0 2 0 2 0 0 0 3 0 3 4 0 0 4 4 0 0 0 0 5       . ⃝ 2.B.8. This is a theoretical task. Let us fix any matrices A, B ∈ Matn(K), over any ring K. (a) If det(A) = 0, show that the product AB also has determinant zero. (b) Pove the identity det(c A) = cn det(A), for each scalar c ∈ K. (c) Show by a counterexample that in general det(A + B) ̸= det(A) + det(B). ⃝ The upcoming series of exercises will explore various methods for computing the inverse of an invertible matrix, including Cramer’s rule, and other topics. As we will demonstrate, Sage can efficiently handle these computations, leveraging its extensive matrix calculus functionality. Before diving into the problems, let us review some key points about matrix manipulation in Sage. There are several convenient shortcuts for matrix creation. As we pronounced, to specify the ring of the matrix entries use QQ for rationals, RR for reals, CC for complex numbers, and ZZ for integers. For exact computations with rational entries the results are precise. For numerical calculations with real or complex matrices, we recommend to use of RDF and CDF working with floating-point representation, instead of RR and CC. After creating a matrix A in your favourite Sage editor, you can access its entries in a straightforward manner using the square bracket notation A[i, j]. Here, i and j, represent the row and column indices, respectively. Note that both Python and Sage use zero-based indexing. For example, to access the entry a3,3 of a 4 × 4 matrix A, we type A[2, 2]. Notice, Sage also allows indexing from the end of the matrix, using negative indices. For example, A[−1, :], refers to the last row of A, and A[−2, :] refers to the second to last row of A. Additionally, indices can be specified as intervals using the format a : b. For instance, A[2 : 3] is valid and will be useful when discussing submatrices later. Another useful tip is the notation p : q : r, which generates all indices from p to q − 1, in steps of r. 2.B.9. Given the matrix A = ( 1 1 3 0 8 2 2 −4 −8 4 −1 −5 2 4 6 7 −1 10 0 9 ) use Sage to access the entries a33, a12, a11, a45 and a44. Solution. We can use the cell A=matrix(QQ,[[1,1,3,0,8],[2,3,-4,-8,4], [-1, -5, 2, 4, 6], [7, -1, 10, 0, 9]]) print(A[2, 2]); print(A[0, 1]) print(A[0, 0]);print(A[3,4]) print(A[3,3]) CHAPTER 2. ELEMENTARY LINEAR ALGEBRA terms in the products of |M| with their algebraic complements is a term in |A|. But for distinct choices of M we can never obtain the same terms and the individual terms in (−1)i1+···+ik+j1+···+jl · |M| · |M∗ | are also mutually distinct. Therefore we have exactly the required number k!(n − k)! (n k ) = n! of terms, and we have proved: Laplace theorem Theorem. Let A = (aij) be a square matrix of dimension n over arbitrary ring of scalars with k rows fixed. Then |A| is a sum of all (n k ) products (−1)i1+···+ik+j1+···+jl ·|M|·|M∗ | of minors of the order k chosen among the fixed rows with their algebraic complements. The Laplace theorem transforms the computation of |A| into the computation of determinants of lower dimension. This method of computation is called the Laplace expansion along the chosen rows (or columns). For instance, the expansion along the i-th row or the j-th column is: |A| = n∑ j=1 aijAij = n∑ i=1 aijAij where Aij are the algebraic complements of the elements aij (that is, minors of order one), as deduced in 2.2.4 already. In practical computations, it is often efficient to combine the Laplace expansion with a direct method of Gaussian elim- ination. 2.2.10. Proof of the Cauchy theorem. The theorem is based on a clever but elementary application of the Laplace theorem. We just use the Laplace expansion twice on a particular arrangement of a well chosen matrix. Consider first the following matrix H of dimension 2n (we are using the so-called block symbolics, that is, we write the matrix as if composed of the (sub)matrices A, B, and so on). H = ( A 0 −E B ) =           a11 . . . a1n ... ... an1 . . . ann 0 . . . 0 ... ... 0 . . . 0 −1 0 ... 0 −1 b11 . . . b1n ... ... bn1 . . . bnn           The Laplace expansion along the first n rows gives |H| = |A| · |B|. Now in sequence, we add linear combinations of the first n columns to the last n columns in order to obtain a matrix 105 Sage returns 2, 1, 1, 9 and 0, respectively. Notice entering the command A([4, 3]), Sage will produce an error message: matrix index out of range, which is expected. □ 2.B.10. For the matrices A, B given below, compute: (a) the corresponding null spaces; (b) the submatrix of A obtained by canceling the first row, and the first and third columns; Is this diagonal? (c) the submatrix of B obtained by canceling the first and last row, and the last column. (d) the submatrix of B obtained by canceling the first and third row, and the second and fourth column. Is this symmet- ric? A =   1 −1 π 0 0 π −1 0 1 0 −1 π   , B =     1 2 3 4 2 1 4 3 3 4 1 2 4 3 2 1     . ⃝ 2.B.11. Using the Cramer’s rule solve the system posed in 2.A.9. Next demonstrate the situation in Sage. Solution. Consider a linear system Ax = b of n equations and n unknowns, with det(A) ̸= 0. Recall by 2.2.6 that the Cramer’s rule describes the unique solution of such a linear system as the unique vector x = (x1, . . . , xn)T of Rn with entries xi = det(Ai) det(A) , for all i = 1, . . . , n, where Ai are the matrixes obtained from the coefficient matrix A by replacing the ith column by the vector b, for i = 1, . . . , n. The matrix A of the system in 2.A.9 is 3 × 3 and the Cramer’s rule gives x1 = 2 2 3 −3 −3 −1 −3 1 2 1 2 3 2 −3 1 −3 1 2 = 1 , x2 = 1 2 3 2 −3 −1 −3 −3 2 1 2 3 2 −3 1 −3 1 2 = 2 , x3 = 1 2 2 2 −3 −3 −3 1 −3 1 2 3 2 −3 1 −3 1 2 = −1 . It is easy to demonstrate Cramer’s rule in Sage. For a 3 × 3 system this can done really quickly, and an illustration is here: A=matrix([[1,2,3],[2,-3,-1],[-3,1,2]]) b=vector([2, -3, -3]) A1=copy(A);A2=copy(A);A3=copy(A) A1[:,0] = b #construct the matrix A_1 A2[:,1] = b #construct the matrix A_2 A3[:,2] = b #construct the matrix A_3 show(A1);show(A2);show(A3) x1=A1.det()/A.det(); x2=A2.det()/A.det() x3=A3.det()/A.det(); print([x1, x2, x3]) CHAPTER 2. ELEMENTARY LINEAR ALGEBRA with zeros in the bottom right corner. We obtain K =           a11 . . . a1n ... ... an1 . . . ann c11 . . . c1n ... ... cn1 . . . cnn −1 0 ... 0 −1 0 . . . 0 ... ... 0 . . . 0           . The elements of the submatrix on the top right part must sat- isfy cij = ai1b1j + ai2b2j + · · · + ainbnj, that is, they are exactly the components of the product A · B and |K| = |H|. The expansion of the last n columns gives us |K| = (−1)n (−1)1+···+2n |A·B| = (−1)2n·(n+1) ·|A·B| = |A · B|. This proves the Cauchy theorem. 2.2.11. Determinant and the inverse matrix. Assume first that there is an inverse matrix of the matrix A, that is, A · A−1 = E. Since the unit matrix always satisfies |E| = 1, it follows that for every invertible matrix its determinant is an invertible scalar and by the Cauchy theorem we have |A−1 | = |A|−1 . But we can say more, combining the Laplace and Cauchy theorems. Inverse matrix determinant formula For any square matrix A = (aij) of dimension n we define the matrix A∗ = (a∗ ij), where a∗ ij = Aji are algebraic complements of the elements aji in A. The matrix A∗ is called the algebraically adjoint matrix of the matrix A (or the adjugate matrix A∗ ). Theorem. For every square matrix A over a ring of scalars K we have that (1) AA∗ = A∗ A = |A| · E. In particular, (i) A−1 exists as a matrix over the ring of scalars K if and only if |A|−1 exists in K. (ii) If A−1 exists, then A−1 = |A|−1 · A∗ . Proof. As already mentioned, the Cauchy theorem shows that the existence of A−1 implies the invertibility of |A| ∈ K. For an arbitrary square matrix A we can directly compute A · A∗ = (cij), where cij = n∑ k=1 aika∗ kj = n∑ k=1 aikAjk. If i = j, it is exactly the Laplace expansion of |A| along the i-th row. If i ̸= j, then we may imagine we expand the determinant along the j-th row, but plug in the values of the i-th row instead of the ajk’s. This is the expansion of the determinant 106 Notice, the copy( ) function provides a way to make a copy of a matrix before we make changes to it. This block returns the matrices A1, A2, A3 and the solution in the form [1, 2, −1]. In Section E we will meet another implementation of Cramer’s rule via Sage, see 2.E.23. □ 2.B.12. Is the matrix A given below invertible? In the positive case find its algebraically adjoint matrix and its inverse: A =     1 0 2 0 0 3 0 4 5 0 6 0 0 7 0 8     . Solution. Recall that a square matrix A is invertible if and only if det(A) ̸= 0. Expanding the first row of the given A, we see that det(A) = 1 0 2 0 0 3 0 4 5 0 6 0 0 7 0 8 = 3 0 4 0 6 0 7 0 8 + 2 0 3 4 5 0 0 0 7 8 = 16 . Hence there exists A−1 such that AA−1 = E = A−1 A, where E is the 4 × 4 identity matrix. To compute A−1 we will use the algebraic adjoint matrix of A, the latter given by adj(A) =     A11 A12 A13 A14 A21 A22 A23 A24 A31 A32 A33 A34 A41 A42 A43 A44     T . Here, Aij is the algebraic complement of the element aij of the matrix A, that is, the product of the number (−1)i+j and the determinant of the matrix given by A without the ith row and jth column. We compute: A11 = 3 0 4 0 6 0 7 0 8 = −24, A12 = − 0 0 4 5 6 0 0 0 8 = 0, A13 = 0 3 4 5 0 0 0 7 8 = 20, A14 = − 0 3 0 5 0 6 0 7 0 = 0, A21 = − 0 2 0 0 6 0 7 0 8 = 0, A22 = 1 2 0 5 6 0 0 0 8 = −32, A23 = − 1 0 0 5 0 0 0 7 8 = 0, A24 = 1 0 2 5 0 6 0 7 0 = −28, A31 = 0 2 0 3 0 4 7 0 8 = 8, A32 = − 1 2 0 0 0 4 0 0 8 = −0, A33 = 1 0 0 0 3 4 0 7 8 = −4, A34 = − 1 0 2 0 3 0 0 7 0 = −0, A41 = − 0 2 0 3 0 4 0 6 0 = 0, A42 = 1 2 0 0 0 4 5 6 0 = −16, CHAPTER 2. ELEMENTARY LINEAR ALGEBRA of a matrix where the i-th and j-th row is the same, therefore cij = 0. This implies that A · A∗ = |A| · E, and we have proven one of the equalities (1). In particular, if |A|−1 exists, then A · (|A|−1 A∗ ) = E. If |A| is an invertible scalar, we may repeat the previous computation for A∗ · A, and we obtain (|A|−1 A∗ ) · A = E. Therefore our computation really gives the inverse matrix of A, as claimed in the theorem. □ Notice that for fields of scalars we have already proved that the right inverse of a matrix is automatically the left inverse and thus the inverse, too. Here we have obtained the same result for all rings of scalars, together with a strong and effective existence condition. On the other hand the exact formula for the inverse has become rather theoretical with little practical value. As a direct corollary of this theorem we can once again prove the Cramer rule for solving the systems of linear equations, see 2.2.6. Really, for the solution of the system A·x = b we just need to read in the equation x = A−1 · b = |A|−1 A∗ · b the individual components of the expression A∗ · b as the Laplace expansions of the determinant of the matrix Ai which arose through the exchange of the i-th column of A for the column b. 3. Vector spaces and linear mappings 2.3.1. Abstract vector spaces. Let us go back for a while to the systems of m linear equations of n variables from 2.1.3 and further, let us assume that the system is homogeneous, A · x = 0, i.e.,    a11 . . . a1n ... ... am1 . . . amn    .    x1 ... xn    =    0 ... 0    . By the distributivity of the matrix multiplication it is clear that the sum of two solutions x = (x1, . . . , xn) and y = (y1, . . . , yn) satisfies A · (x + y) = A · x + A · y = 0 and thus is also a solution. Similarly, a scalar multiple a · x is also a solution. The set of all solutions of a fixed system of equations is therefore closed under vector addition and scalar multiplication. These are the basic properties of vectors of dimension n in Kn , see 2.1.1. Now we have the vectors in the solution space with n coordinates. The “dimension” of this space is given by the difference of the number of variables and the rank of the matrix A. Thus we can easily deal with the solution of a system of 1000 equations in 1000 variables and need only one or two free parameters. Thus the whole solution space will behave as a plane or a line, as we have already seen in 1.5.3 at the page 29, although the vectors themselves are given by so many components. 107 A43 = − 1 0 0 0 3 4 5 0 0 = 0, A44 = 1 0 2 0 3 0 5 0 6 = −12. Thus, by substitution we obtain that adj(A) =     −24 0 20 0 0 −32 0 28 8 0 −4 0 0 16 0 −12     T =     −24 0 8 0 0 −32 0 16 20 0 −4 0 0 28 0 −12     . As a verification of the given expression in Sage, you can use either the command A.adjugate, or its alias A.adjoint_classical()), as follows: A=matrix([[1, 0, 2, 0],[0, 3, 0, 4], [5, 0, 6, 0], [0, 7, 0, 8]]) show(A.adjugate()) The inverse matrix A−1 is now obtained by the rule A−1 = 1 det(A) · adj(A), that is, A−1 =     −3/2 0 1/2 0 0 −2 0 1 5/4 0 −1/4 0 0 7/4 0 −3/4     . Recall that we can directly compute either the determinant of A, or the inverse of A in Sage by adding the commands det(A) and A.inverse( ), respectively. □ C. Vector spaces and linear mappings Intuitively, a vector space is a landscape (of any dimension n, finite or infinite) where each position corresponds to a unique set of scalar coordinates (again n) and we add and multiply them exactly as we already saw in chapter 1, working with R2 and C. Vector spaces can become quite abstract and unusual (as already the scalars can be) and they are fundamental building blocks in mathematics, with an amazing wide range of applications in many fields, including chemistry, physics, image processing, computer science, economics, etc. For an informative introduction to the concepts, see 2.3.1. Next we will primarily focus on finite-dimensional vector spaces. 2.C.1. Vector spaces. Decide whether the following sets form a vector space over the given field: (a) The set of real solutions of the linear system:    x1 + x2 + · · · + x8 + x9 + x10 = 10x1 , x1 + x2 + · · · + x8 + x9 = 9x1 , x1 + x2 + · · · + x8 = 8x1 , ... x1 + x2 = 2x1 . CHAPTER 2. ELEMENTARY LINEAR ALGEBRA We go further. Already in paragraph 1.2.1 we have encountered an interesting example of a space of all solutions of a homogeneous linear difference equation of first order. All solutions have been obtained from a single one by scalar multiplication and are also closed under addition and scalar multiples. These “vectors” of solutions are infinite sequences of numbers, although we intuitively expect that the “dimension” of the whole space of solutions should be one. We shall understand such phenomena with the help of a more general definition of vector space and its dimension. Vector space definition A vector space V over a field of scalars K is a set where we define the operations • addition, which satisfies the axioms (CG1)–(CG4) from the paragraph 1.1.1 on the page 5, • scalar multiplication, for which the axioms (V1)–(V4) from the paragraph 2.1.1 on the page 84 hold. Recall our simple notational convention: scalars are usually denoted by letters from the beginning of the alphabet, that is, a, b, c, . . . , while for vectors we shall use letters from the end, that is, u, v, w, x, y, z. Usually, x, y, z will denote n-tuples of scalars. For completeness, the letters from the centre of the alphabet, for instance i, j, k, ℓ, will mostly denote indices. In order to gain some practice in the formal approach, we check some simple properties of vectors. These are trivial for n-tuples for scalars, but not so evident for general vectors in our new abstract sense. 2.3.2. Proposition. Let V be a vector space over a field of scalars K. Suppose a, b, ai ∈ K, and u, v, uj ∈ V . Then (1) a · u = 0 if and only if a = 0 or u = 0, (2) (−1) · u = −u, (3) a · (u − v) = a · u − a · v, (4) (a − b) · u = a · u − b · u, (5) (∑n i=1 ai ) · (∑m j=1 uj ) = ∑n i=1 ∑m j=1 ai · uj. Proof. We can expand a · u = (a + 0) · u (V 2) = a · u + 0 · u which, according to the axiom (CG4), implies 0·u = 0. Now u + (−1) · u (V 2) = (1 + (−1)) · u = 0 · u = 0 and thus −u = (−1) · u. Further, a · (u + (−1) · v) (V 2,V 3) = a · u + (−a) · v = a · u − a · v, which proves (3). It follows that (a − b) · u (V 2,V 3) = a · u + (−b) · u = a · u − b · u which proves (4). Property (5) follows using induction with (V2) and (V1). It remains to prove (1): a · 0 = a · (u − u) = a · u − a · u = 0, which along with the first derived proposition in 108 (b) The set of real solutions of the equation x1 + x2 + · · · + x10 = 0 . (c) The set of real solutions of the equation x1 + 2x2 + 3x3 + · · · + 10x10 = 1 . (d) The set of all real or complex sequences (recall that a real or a complex sequence can be viewed as a mapping f : N → R or f : N → C, respectively). (e) The set F = {f f : X → R} of real-valued functions with domain a non-empty set X. (f) The set of real solutions of a non-homogeneous difference equation. (g) The set Matm,n(K) of m × n matrices with entries over K, where K ∈ {R, C}. (h) The set Q( √ 2) = {a + √ 2b : a, b ∈ Q} over Q. Solution. (a) This set is a vector space, since the solutions are real multiples of the vector u := (1, . . . , 1)T (10-factors). Moreover, a sum of two multiples of the same vector is again a multiple of the vector, while the reverse vector is again a multiple of the vector. All other axioms are trivially satisfied. (b) This provides an example of a vector space of dimension 9 (its dimension is determined by the number of free parameters of the solution). More general, recall that the set of solutions of any homogeneous linear system is a vector space, see 2.3.5. (c) Taking twice the solution x1 = 1, xi = 0, for i = 2, . . . 10, we do not obtain a solution. Hence, the given set cannot be a vector space. However, the set of solutions forms an affine space, see 4.1.1. (d) The set of all real (respectively, complex) sequences clearly forms a real (respectively, complex) vector space. Notice that the addition and scalar multiplication is defined termwise. Moreover, the zero vector is represented by the zero sequence (0, 0, 0, · · · ). (e) This provides an example of an infinite-dimensional vector space, where the addition and multiplication are defined by the usual way, i.e., (f + g)(x) = f(x) + g(x) and (c ˙f)(x) = c · f(x), respectively, for all x ∈ X. Notice the zero vector is simply the constant function with zero value everywhere on X, i.e., the zero function defined on X. (f) Consider two solutions of a non-homogeneous equation anxn+k + an−1xn+k−1 + · · · + a0xk = c , anyn+k + an−1yn+k−1 + · · · + a0yk = c , where c is some non-zero scalar. Their sum satisfies an(xn+k + yn+k) + an−1(xn+k−1 + yn+k−1) + · · · + a0(xk + yk) = 2c , and obviously, this cannot provide a solution of the given non-homogeneous equation. Hence we do not obtain a vector space. However, the set of solutions in this case forms an affine space (see 4.1.1). (g) This is one of the most fundamental examples of finitedimensional vector spaces. For simplicity, let us fix K = R and similarly is treated the complex case, see also the task CHAPTER 2. ELEMENTARY LINEAR ALGEBRA this proof proves one implication. For the other implication, we use an axiom for the field of scalars, and axiom (V4) for vector spaces: if p · u = 0 and p ̸= 0, then u = 1 · u = (p−1 · p) · u = p−1 · 0 = 0. □ 2.3.3. Linear (in)dependence. In paragraph 2.1.11 we worked with linear combinations of rows of a matrix. With vectors we work analogously: Linear combination and independence An expression of the form a1 v1 + · · · + ak vk is called a linear combination of vectors v1, . . . , vk ∈ V . A finite sequence of vectors v1, . . . , vk is called linearly independent, if the only zero linear combination is the one with all coefficients zero. That is, for any scalars a1, . . . , ak ∈ K, a1 v1 +· · ·+ak vk = 0 implies a1 = a2 = · · · = ak = 0. It is clear that for an independent sequence of vectors, all vectors are mutually distinct and nonzero. The set of vectors M ⊂ V in a vector space V over K is called linearly independent, if every finite k-tuple of vectors v1, . . . , vk ∈ M is linearly independent. The set of vectors M is linearly dependent, if it is not linearly independent. A nonempty subset M of vectors in a vector space over a field of scalars K is dependent if and only if one of its vectors can be expressed as a finite linear combination using other vectors in M. This follows directly from the definition: Indeed, at least one of the coefficients in the corresponding linear combination must be nonzero, and since we are over a field of scalars, we can multiply whole combination by the inverse of this nonzero coefficient and thus express its corresponding vector as a linear combination of the others. Every subset of a linearly independent set M is clearly also linearly independent (we require the same conditions on a smaller set of vectors). Similarly, we can see that M ⊂ V is linearly independent if and only if every finite subset of M is linearly independent. 2.3.4. Generators and subspaces. A subset M ⊂ V is called a vector subspace if it forms, together with the restricted operations of addition and scalar multiplication, a vector space. That is, we require ∀a, b ∈ K, ∀v, w ∈ M, a · v + b · w ∈ M. We investigate a couple of cases: The space of m-tuples of scalars Rm with coordinate-wise addition and multiplication is a vector space over R, but also a vector space over Q. For instance for m = 2, the vectors (1, 0), (0, 1) ∈ R2 are linearly independent, because from a · (1, 0) + b · (0, 1) = (0, 0) follows a = b = 0. Further, the vectors (1, 0), ( √ 2, 0) ∈ R2 are linearly dependent over R, because √ 2 · (1, 0) = ( √ 2, 0), but over Q they are linearly independent! Over R these two 109 in 2.E.56 for more details on complex matrices. The vector addition on Matm,n(R) is given by matrix addition, i.e., A + B = (aij + bij) ∈ Matm,n(R) for all A = (aij), B = (bij) ∈ Matm,n(R), and scalar multiplication is defined by multiplying each entry of the given matrix by the scalar, i.e., cA = (c · aij) ∈ Matm,n(R) for all c ∈ R and A = (aij) ∈ Matm,n(R), see also 2.1.2. Let us check the axioms (CG1)-(CG4) presented in 1.1.1: • (CG1) – associativity: Indeed, (A+B)+C = A+(B+C) for all A, B, C ∈ Matm,n(R). This property follows by the associativity of the addition of real numbers: (aij + bij) + cij = aij + (bij + cij), where we assume that A = (aij), B = (bij) and C = (cij), respectively. • (CG2) – commutativity: In an analogous way we can prove that A + B = B + A, for all A, B ∈ Matm,n(R). • (CG3) – existence of neutral element: This is the zero matrix, which we just denote by 0 ∈ Matm,n(R). Obviously, it satisfies A + 0 = A = 0 + A, for all A ∈ Matm,n(R). • (CG4) – existence of additive inverse: For A ∈ Matm,n(R), B = −A is the unique matrix satisfyingA + B = 0. These axioms ensure that the pair (Matm,n(R), +) has the structure of a “commutative group”. Finally, it is very easy to check the axioms (V 1) − (V 4) posed in the paragraph 2.1.1, and which must satisfy the scalar multiplication. For completeness, we list them: • (V1) – α(A + B) = αA + αB for all α ∈ R, and A, B ∈ Matm,n(R). • (V2) – (α + β)A = αA + βA for all α, β ∈ R and A ∈ Matm,n(R). • (V3) – α(βA) = (αβ)A, for all α, β ∈ R and A ∈ Matm,n(R). • (V4) – 1A = A for all A ∈ Matm,n(R). (h) The final set Q( √ 2) = {a + √ 2b : a, b ∈ Q} provides an example of a vector space over Q. Indeed, let x = a + √ 2b, y = c + √ 2d two elements in Q( √ 2). Then x+y = (a+c)+ √ 2(b+d) ∈ Q( √ 2). Moreover, for q ∈ Q we have qx ∈ Q( √ 2), since qa, qb ∈ Q, for all a, b ∈ Q. □ 2.C.2. By the discussion in 2.3.4 we know that the space Rm[x] of real polynomials of degree at most m is a vector space over R. Can we claim the same for the set of real polynomials of degree exactly m? ⃝ 2.C.3. Vector spaces in Sage. (a) Use Sage to create the vector spaces R4 and Q4 . Next check if the vector u = 2v + 3w is an element of Q4 , where v = (1, −2, −3, 0)T and w = (1, 0, 3, −2)T , respectively. (b) Use Sage to create the matrix spaces Mat4(R) of 4 × 4 real matrices, and Mat4,1(C) of 4 × 1 complex matrices. Solution. (a) An easy way to introduce a vector space in Sage, is to start by specifying the field over which the vectors are defined, and then use an exponent to indicate the dimension of the vector space. This process is compact and mimics the notation we use when working on paper. For instance, to introduce R4 , type CHAPTER 2. ELEMENTARY LINEAR ALGEBRA vectors “generate” a one-dimensional subspace, while over Q the subspace is “larger”. Polynomials with real coefficients and of degree at most m form a vector space Rm[x]. We can consider the polynomials as mappings f : R → R and define the addition and scalar multiplication like this: (f + g)(x) = f(x) + g(x), (a · f)(x) = a · f(x). Polynomials of all degrees also form a vector space R[x] (or R∞[x]) and Rm[x] ⊂ Rn[x] is a vector subspace for any m ≤ n ≤ ∞. Further examples of subspaces is given by all even polynomials or all odd polynomials, that is, polynomials satisfying f(−x) = ±f(x). In complete analogy with polynomials, we can define a vector space structure on a set of all mappings R → R or of all mappings M → V of an arbitrary fixed set M into the vector space V . Because the condition in the definition of subspace consists only of universal quantifiers, the intersection of subspaces is still a subspace. We can see this also directly: Let Wi, i ∈ I, be vector subspaces in V , a, b ∈ K, u, v ∈ ∩i∈IWi. Then a · u + b · v ∈ Wi for all i ∈ I. Hence a · u + b · v ∈ ∩i∈IWi. It can be noted that the intersection of all subspaces W ⊂ V that contain some given set of vectors M ⊂ V is a subspace. It is called the linear span or linear hull of M and we write span M. We say that a set M generates the subspace span M, or that the elements of M are generators of the subspace span M. We formulate a few simple claims about subspace gener- ation: Proposition. For every nonempty set M ⊂ V, we have (1) span M = {a1 ·u1 +· · ·+ak ·uk; k ∈ N, ai ∈ K, uj ∈ M, j = 1, . . . , k}; (2) M = span M if and only if M is a vector subspace; (3) if N ⊂ M then span N ⊂ span M is a vector subspace; the subspace span ∅ generated by the empty subset is the trivial subspace {0} ⊂ V . Proof. (1) The set of all linear combinations a1u1 + · · · + akuk on the right-hand side of (1) is clearly a vector subspace and of course it contains M. On the other hand, each of the linear combinations must be in span M and thus the first claim is proved. Claim (2) follows immediately from claim (1) and from the definition of vector space. Analogously, (1) implies most of the third claim. Finally, the smallest possible vector subspace is {0}. Notice that the empty set is contained in every subspace and each of them contains the vector 0. This proves the last claim. □ 110 V=RR^4; V # we can also type V=RR**4 Sage returns the following: Vector space of dimension 4 over Real Field with 53 bits of precision Similarly, to introduce the vector space Q4 type QQˆ4, etc. We can also use the VectorSpace() function, which requires the name of the number system for the entries and the number of entries in each vector. For instance the cell V = VectorSpace(QQ, 4); print(V) prints out Vector space of dimension 4 over Rational Field Vectors can be constructed in the usual way, that is, V=QQ^4; v=vector(QQ, [1, -2, -3, 0]) w=vector(QQ, [1, 0, 3, -2]) It is also easy to perform computations with vectors (vector addition and scalar multiplication):6 V=QQ^4; v=vector(QQ, [1, -2, -3, 0]) w=vector(QQ, [1, 0, 3, -2]) u=2*v+3*w print(u) # we could type show(u) u in V Here Sage returns the vector u, i.e., (5, −4, 3, −6) and the word True, the latter verifying that u ∈ V . (b) Sage provides built-in functions for working with matrix spaces. This cell creates the space Mat4(R): M = MatrixSpace(RR,4); M Similarly, to introduce the space Mat4,1(C) use the cell M = MatrixSpace(CC,4, 1); M Notice, when the number of columns is omitted, it defaults to the number of rows. □ 2.C.4. Describe the linear combination 3v1 +2v2 +5v3 +v4 of the vectors v1, . . . , v4 given below in Sage, using at least two different methods: v1 = (1 2 , 2 3 , 1, 0)T , v3 = (0, 1 5 , 2 5 , 1)T , v2 = (2, 0, 1/7, 0)T , v4 = (−1, 2, 0, 3)T . ⃝ 2.C.5. Consider the system Ax = b given below with a solution given by (−2/15, 1/5, 2/15, −11/15)T ,     3 −9 9 0 9 −3 6 0 9 −6 0 −6 −9 12 6 6         x1 x2 x3 x4     =     −1 −1 2 0     . 6Be aware that Sage displays vectors using parentheses (to distinguish them from lists), and presents them horizontally. In particular, in Sage there is no inherent distinction between a “row vector” and a “column vector”. However, when dealing with matrices, it is crucial to make this distinction. CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Basis and dimension A subset M ⊂ V is called a basis of the vector space V if span M = V and M is linearly independent. A vector space with a finite basis is called finitely dimensional. The number of elements of the basis is called the dimension of V . If V does not have a finite basis, we say that V is infinitely dimensional. We write dim V = k, k ∈ N or k = ∞. In order to be satisfied with such a definition of dimension, we must know that different bases of the same space will always have the same number of elements. We shall show this below, cf. 2.3.9. But we note immediately, that the trivial subspace is generated by the empty set, which is an “empty” basis. Thus it has dimension zero. The linearly independent vectors ei = (0, . . . , 1, . . . , 0) ∈ Kn , i = 1, . . . , n (all zeros, but one value 1 at the i-th position) are the most useful example of a basis in the vector space Kn . We call it the standard basis of Kn . 2.3.5. Linear equations again. It is a good time now to recall the properties of systems of linear equation in terms of abstract vector spaces and their bases. As we have already noted in the introduction to this section (cf. 2.3.1), the set of all solutions of the homogeneous system A · x = 0 is a vector space. If A is a matrix with m rows and n columns, and the rank of the matrix is k, then using the row echelon transformation (see 2.1.7) to solve the system, we find that the dimension of the space of all solutions is exactly n − k. Indeed, the left hand side of the equation can be understood as the linear combination of the columns of A with coefficients given by x and the rank k of the matrix provides the number of linearly independent columns in A, thus the dimension of the subspace of all possible linear combinations of the given form. Therefore, after transforming the system into row echelon form, exactly m − k zero rows remain. In the next step, we are left with exactly n − k free parameters. By setting one of them to have value one, while all others are zero, we obtain exactly n − k linearly independent solutions. Then all solutions are given by all the linear combinations of these n − k solutions. Every (n − k)-tuple of linearly independent solutions is called a fundamental system of solutions of the given homogeneous system of equations. We have proved: Proposition. The set of all solutions of the homogeneous system of equations A · x = 0, for n variables with the matrix A of rank k, is a vector subspace in Kn of dimension n − k. Every basis of this space forms a fundamental system of solutions of the given homogeneous system. Next, consider the general system of equations A · x = b. Notice that the columns of the matrix A are actually images of the vectors of the standard basis in Kn under the mapping 111 Use Sage to illustrate that the solution gives us scalars that yield the vector of constants b as a linear combination of the columns of the coefficient matrix A. ⃝ 2.C.6. Describe the vector space structure of Z3 5. What is the total number of vectors in Z3 5? Use the cardinality command in Sage to support your answer. ⃝ Next we will explore problems concerning the concept of linear subspaces. A vector subspace of a vector space V is a non-empty subset W of V which is closed under addition and scalar multiplication. It can be either given as a span of some vectors (i.e., all of their linear combinations), or by some (linear) condi- tions.7 Therefore, any subspace of V is itself a vector space and shares the same fundamental properties and significance, see also the discussion in 2.3.4. In Sage we have numerous options to explore the properties and characteristics of subspaces. Once we have introduced a vector space V , a subspace W ⊂ V is defined by specifying a set of generators that span the subspace. This can be achieved by the subspace() function or with the span command. 2.C.7. Linear subspaces. Decide whether the following statements are true or false: (a) The subset U1 = {(x, y, z) ∈ R3 : 2x + 7y + z = 1} is a linear subspace of R3 . (b) The subset U2 = {(x, y, z) ∈ R3 : 2x2 + 7y + z = 0} is a linear subspace of R3 . (c) The subset U3 = {(x, y, z) ∈ R3 : 2x + 7y + z = 0} is a linear subspace of R3 . (d) The subset U4 = {ax2 + c : a, c ∈ R} is a linear subspace of the vector space R2[x] of polynomials with real coefficients and degree at most 2. (e) The subset U5 = {(x1, x2, x3) ∈ R3 : |x1| = |x2| = |x3|} is a linear subspace of R3 . (f) The subset U6 = {p ∈ R3[x] : p(1) = 0} is a linear subspace of the vector space R3[x] of polynomials with real coefficients and degree at most 3. (g) The set of real solutions of a homogeneous difference equation is a linear subspace of the vector space of all real sequences. (h) The set { ( a b 0 c ) : a, b, c ∈ R} of 2 × 2 upper triangular matrices is a subspace of Mat2(R). (i) The subset Vλ = {x ∈ Rn : Ax = λx}, where A ∈ Matn(R) and λ ∈ R , is a linear subspace of Rn . (j) The subset {A ∈ Matn(R) : A = −AT } of skewsymmetric n × n matrices is a subspace of Matn(R). Solution. (a) The subset U1 cannot be a subspace of R3 , since the zero vector does not belong to U1. (b) This is also false, and U2 cannot be subspace of R3 . For example, the vector u = (1, 0, −2) ∈ U2, but −u = 7This implies that W contains the zero vector. Indeed, any w ∈ W will have an additive inverse −w ∈ W and hence 0 = w + (−w) ∈ W. CHAPTER 2. ELEMENTARY LINEAR ALGEBRA assigning the vector A · x to each vector x. If there should be a solution, b must be in the image under this mapping and thus it must be a linear combination of the columns in A. If we extend the matrix A by the column b, the number of linearly independent columns and thus also rows might increase (but does not have to). If this number increases, then b is not in the image and the system of equations does not have a solution. If on the other hand the number of linearly independent rows does not change after adding the column b to the matrix A, it means that b must be a linear combination of the columns of A. Coefficients of such combinations are then exactly the solutions of our system. Consider now two fixed solutions x and y of our system and some solution z of the homogeneous system with the same matrix. Then clearly A · (x − y) = b − b = 0 A · (x + z) = b + 0 = b. Thus we can summarise in the form of the so called Kronecker-Capelli theorem1 : Kronecker-Capelli Theorem Theorem. The solution of a non-homogeneous system of linear equations A · x = b exists if and only if adding the column b to the matrix A does not increase the number of linearly independent rows. In such a case the space of all solutions is given by all sums of one fixed particular solution of the system and all solutions of the homogeneous system that has the same matrix. 2.3.6. Sums of subspaces. Since we now have some intuition about generators and the subspaces generated by them, we should understand the possibilities of how some subspaces can generate the whole space V . Sum of subspaces Let Vi, i ∈ I, be subspaces of V . Then the subspace generated by their union, that is, span ∪i∈IVi, is called the sum of subspaces Vi. We denote it as W = ∑ i∈I Vi. Notably, for a finite number of subspaces V1, . . . , Vk ⊂ V we write W = V1 + · · · + Vk = span(V1 ∪ V2 ∪ · · · ∪ Vk). We see that every element in the considered sum W can be expressed as a linear combination of vectors from the subspaces Vi. Because vector addition is commutative, we can 1A common formulation of this fact is “system has a solution if and only if the rank of its matrix equals the rank of its extended matrix”. Leopold Kronecker was a very influential German Mathematician, who dealt with algebraic equations in general and in particular pushed forward Number Theory in the middle of 19th century. Alfredo Capelli, an Italian, worked on algebraic identities. This theorem is equally often called by different names, e.g. Rouché-Frobenius theorem or Rouché-Capelli theorem etc. This is a very common feature in Mathematics. 112 (−1, 0, 2) /∈ U2, since 2 · (−1)2 + 7 · 0 + 2 = 4 ̸= 0. (c) This is true and we leave the verification for practice. (d) The set U4 is a linear subspace of R2[x], since ( a1x2 + c1 ) + ( a2x2 + c2 ) = (a1 + a2) x2 + (c1 + c2) , k ( ax2 + c ) = (ka) x2 + kc , for all real numbers a1, c1, a2, c2, a, c, k ∈ R. Notice also that for a = c = 0 we obtain the zero polynomial. (e) The subset U5 is not a linear subspace of R3 , since for example (1, 1, 1) + (−1, 1, 1) = (0, 2, 2) /∈ U5. (f) The subset U6 is clearly a subspace of R3[x]. Indeed, if p, q ∈ U6, and c ∈ R, then we see that (p + q)(1) = p(1) + q(1) = 0 , ⇒ (p + q) ∈ U6 , (cp)(1) = cp(1) = c · 0 = 0 , ⇒ (cp) ∈ U6 . (g) Consider two sequences (xj)∞ j=0 and (yj)∞ j=0 satisfying the same homogeneous difference equation, that is, anxn+k + an−1xn+k−1 + · · · + a0xk = 0 anyn+k + an−1yn+k−1 + · · · + a0yk = 0. By adding these equations, we obtain an(xn+k + yn+k) + an−1(xn+k−1 + yn+k−1) + · · · + a0(xk + yk) = 0 . This means that also the sequence (xj + yj)∞ j=0 satisfies the given equation. Analogously, if the sequence (xj)∞ j=0 satisfies the given equation, then the same applies for (uxj)∞ j=0, for some u ∈ R. Thus the assertion is true. (h) This claim is again true, as you can easily verify yourself. (i) If x, y ∈ Vλ, then we see that A(x + y) = Ax + Ay = λx + λy = λ(x + y) , A(cx) = c(Ax) = c(λx) = λ(cx) , and hence x + y ∈ Vλ and cx ∈ Vλ, for all x, y ∈ Vλ and c ∈ R. Hence the claim is true. Non-zero vectors lying on Vλ are called eigenvectors of A corresponding to the eigenvalue λ. Thus we have just proved that the eigenspace Vλ corresponding to the eigenvalue λ is a subspace of Rn . We will learn more about these notions later (see Section 2.D.1). Notice that when λ = 0, then V0 = Ker(A) is the null space (kernel) of the squared matrix A. (j) Taking A, B ∈ Matn(R) with AT = −A and BT = −B, we see that (A+B)T = −(A+B) and (cA)T = −(cA), for all c ∈ R. Hence the claim is true. Show that the subset of n × n symmetric matrices is also a subspace of Matn(R). □ 2.C.8. Construct in Sage the 2-dimensional subspaces of Q3 , generated over Q by two vectors of the standard basis e = {e1 = (1, 0, 0)T , e2 = (0, 1, 0)T , e3 = (0, 0, 1)T }. ⃝ Dealing with vector spaces, it is crucial to understand the linear dependence or independence of vectors. A basis is an independent set of generators. In finitedimensional cases, this is the same as a maximal set of linearly independent vectors. Every finite-dimensional K-vector space V admits a basis; however, bases are not at CHAPTER 2. ELEMENTARY LINEAR ALGEBRA aggregate summands that belong to the same subspace and for a finite sum of k subspaces we obtain V1 +V2 +· · ·+Vk = {v1 +· · ·+vk; vi ∈ Vi, i = 1, . . . , k}. The sum W = V1 + · · · + Vk ⊂ V is called the direct sum of subspaces if the intersection of each Vi with the sum of the other spaces is trivial, that is, Vi ∩ ∑ j̸=i Vj = {0}. We show that in such a case, every vector w ∈ W can be written in a unique way as the sum w = v1 + · · · + vk, where vi ∈ Vi. Indeed, if we could simultaneously write w as w = v′ 1 + · · · + v′ k, then 0 = w − w = (v1 − v′ 1) + · · · + (vk − v′ k). If vi − v′ i is the first nonzero term of the right-hand side, then this vector from Vi can be expressed using vectors from the other subspaces. This is a contradiction to the assumption that Vi has zero intersection with all the other subspaces. The only possibility is then that all the vectors on the right-hand side are zero and thus the expression of w is unique. For direct sums of subspaces we write W = V1 ⊕ · · · ⊕ Vk = ⊕k i=1Vi. 2.3.7. Basis. Now we have everything prepared for understanding minimal sets of generators as we understood them in the plane R2 and to prove the promised indepence of the number of basis elements on any choices. A basis of a k-dimensional space will usually be denoted as a k-tuple v = (v1 . . . , vk) of basis vectors. This is just a matter of convention: with finitely dimensional vector spaces we shall always consider the bases along with a given order of the elements, even if we have not defined it that way (strictly speaking). Clearly, if (v1, . . . , vn) is a basis of V , then the whole space V is the direct sum of the one-dimensional subspaces V = span{v1} ⊕ · · · ⊕ span{vn}. An immediate corollary of the derived uniqueness of decomposition of any vector w in V into the components in the direct sum gives a unique decomposition w = x1v1 + · · · + xnvn. This allows us, after choosing a basis, to see the abstract vectors again as n-tuples of scalars. We shall return to this idea in paragraph 2.3.11, when we finish the discussion of the existence of bases and sums of subspaces in the general case. 2.3.8. Theorem. From any finite set of generators of a vector space V we can choose a basis. Every basis of a finitely dimensional space V has the same number of elements. Proof. The first claim is easily proved using induction on the number of generators k. 113 all unique, cf. 2.3.7, 2.3.10. At the same time, the number of vectors in any basis is unique, and we call it the dimension dimK(V ) of V . We write just dim(V ) if the field K is clear from the context. Notice, that vector spaces (over fixed field K) of the same finite dimension are isomorphic, cf. 2.3.13. 2.C.9. Consider any matrix A of size m × n over a field K. Besides the kernel Ker(A) = {x ∈ Kn : Ax = 0} of A, there are several other fundamental subspaces associated with A: • The column space C(A) = {Ax : x ∈ Kn }, which is the span of the columns of A. • The row space C(AT ) = {AT y : y ∈ Km }, which is the span of the rows of A. • The left null space Ker(AT ) = {y ∈ Km : AT y = 0}, also known as the cokernel of A. Prove that C(AT ) is a subspace of Kn , while Ker(AT ) and C(A) are subspaces of Km . Moreover, show that dimK C(A) = rank(A) = dimK C(AT ). ⃝ Although vector spaces over infinite fields are always infinite sets, Sage can effectively handle them as mathematical objects. The main tool to this is proper work with the concept of basis. See our discussion in 2.C.14 for the finite case and see also the discussion in 2.3.10. 2.C.10. Determine whether or not the vectors u1 = (1, 2, 3, 1)T , u2 = (1, 0, −1, 1)T , u3 = (2, 1, −1, 3)T and u4 = (0, 0, 3, 2)T are linearly independent. Do they provide a basis of R4 ? Describe an answer using Sage, as well. Solution. Given n vectors on Rn , say x1 = (x11, . . . , x1n)T , . . . , xn = (xn1, . . . , xnn)T , it is easy to check that the condition det(A) ̸= 0 is equivalent to their linear independence, where A is the coefficient matrix, A =     x11 x12 . . . x1n x21 x22 . . . x2n ... ... . . . ... xn1 xn2 . . . xnn     . For the given vectors on R4 , we compute det(A) = 1 2 3 1 1 0 −1 1 2 1 −1 3 0 0 3 2 = 10 ̸= 0 . Thus the set u = (u1, u2, u3, u4) consists of linearly independent vectors. Since dim R4 = 4 and this set is a basis for its own linear span, u is indeed a basis. To check the linear independence/dependence in Sage, we can rely on built-in functions, as follows: u1=vector(RR, [1, 2, 3, 1]) CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Only the zero subspace does not need a generator and thus we are able to choose an empty basis. On the other hand, we are not able to choose the zero vector (the generators would then be linearly dependent) and there is nothing else in the subspace. In order to have our inductive step more natural, we deal with the case k = 1 first. We have V = span{v} and v ̸= 0, because {v} is a linearly independent set of vectors. Then {v} is also a basis of the vector space V and any other vector is a multiple of v, so all bases of V must contain exactly one vector, which can be chosen from any set of generators. Assume that the claim holds for k = n and consider V = span{v1, . . . , vn+1}. If v1, . . . , vn+1 are linearly independent, then they form a basis. If they are linearly dependent, there exists i such that vi = a1v1 + · · · + ai−1vi−1 + ai+1vi+1 + · · · + an+1vn+1. Then V = span{v1, . . . , vi−1, vi+1, . . . , vn+1} and we can choose a basis, using the inductive assumption. It remains to show that bases always have the same number of elements. Consider a basis v = (v1, . . . , vn) of the space V and for an arbitrary nonzero vector u, consider u = a1v1 + · · · + anvn ∈ V with ai ̸= 0 for some i. Then vi = 1 ai ( u−(a1v1+· · ·+ai−1vi−1+ai+1vi+1+· · ·+anvn) ) and therefore also span{u, v1, . . . , vi−1, vi+1, . . . , vn} = V . We show that this is again a basis. For if adding u to the linearly independent vectors v1, . . . , vi−1, vi+1, . . . , vn leads to a set of linearly dependent vectors, then V = span{v1, . . . , vi−1, vi+1, . . . , vn}, which implies a basis of n − 1 vectors chosen from v, which is not possible. Thus we have proved that for any nonzero vector u ∈ V there exists i, 1 ≤ i ≤ n, such that (u, v1, . . . , vi−1, vi+1, . . . , vn) is again a basis of V . Similarly, instead of one vector u, we can consider a linearly independent set u1, . . . , uk. We will sequentially add u1, u2, . . . , always exchanging for some vi using our previous approach. We have to ensure that there always is such vi to be replaced (that is, that the vectors ui will not consequently replace each other). Assume thus that we have already placed u1, . . . , uℓ instead of some vj’s. Then the vector uℓ+1 can be expressed as a linear combination of the latter vectors ui and the remaining vj’s. As we have seen, uℓ+1 may replace any vector with non-zero coefficient in this linear combination. If only the coefficients at u1, . . . , uℓ were nonzero, then it would mean that the vectors u1, . . . , uℓ+1 were linearly dependent, which is a contradiction. Summarizing, for every k ≤ n we can arrive after k steps at a basis in which k vectors from the original basis were exchanged for the new ui’s. If k > n, then in the n-th step we 114 u2=vector(RR, [1, 0, -1, 1]) u3=vector(RR, [2, 1, -1, 3]) u4=vector(RR, [0, 0, 3, 2]) V=RR^4; V.linear_dependence([u1, u2, u3, u4]) ==[] Sage’s output is True. In case you want to compute only the determinant of A, then replace the last line by the syntax given here: A=column_matrix([u1, u2, u3, u4]) show(det(A)) □ 2.C.11. Consider the vector space V = Q4 and the vectors v1 = (1, 0, 1, 2)T and v2 = (0, 1, 0, 0)T , both lying on V . What is the dimension of the vector subspace of V = Q4 , spanned by v1, v2? ⃝ 2.C.12. (a) Consider Q3 over the rationals and let W be the subspace of Q3 generated by the vectors v1 = (2, −1, 3)T and v2 = (4, 0, 1)T . Determine the expression of a general vector of W and next use Sage to determine the dimension of W. What happens if we replace v2 by a scalar multiple of v1? (b) Consider Q4 over the rationals and let W be the subspace spanned by the vectors u1 = (1, 1, 0, 0)T and u2 = (0, 1, 1, 0)T . Describe the quotient space V/W (i.e., “what is left” in V viewing vectors “up to W”, cf. 3.4.13) and next verify your answer via Sage. ⃝ 2.C.13. For the second task in 2.C.12, find a basis of the quotient space V/W. Confirm your answer by Sage by using the method vector in subspace. ⃝ 2.C.14. (a) Show that the following set of vectors is a subspace of Z3 5, U =      a 2b 3a + 4b   : a, b ∈ Z5    . (b) Find the reduced row echelon form of the matrix A having as columns the generators of U. Next deduce that dimZ5 U = 2 by finding a basis of U. ⃝ We can often rely on Sage to perform mathematical “experiments”. In linear algebra for example, this can be done using matrices or vectors with random coefficients. Notice that such experiments can be useful to test theoretical statements, or illustrate them in a random way. Let us describe such an example. 2.C.15. Provide in Sage a short code that generates a random (ordered) tuple S of vectors within V = Q4 , and checks their linear independence. Moreover, create a random 4 × 4 matrix A whose columns are the vectors in S, and use the matrix’s rank to establish an alternative criterium for the linear independence of S (recall the following basic criterium for the invertibility of a square matrix A ∈ Matn(K): A is invertible if and only if rank(A) = n). CHAPTER 2. ELEMENTARY LINEAR ALGEBRA would obtain a basis consisting only of new vectors ui, which means that the original set could not be linearly independent. In particular, it is not possible for two bases to have a different number of elements. □ In fact, we have proved a much stronger claim, the Steinitz exchange lemma: Steinitz Exchange Lemma For every finite basis v of a vector space V and every set of linearly independent vectors ui, i = 1, . . . , k in V we can find a subset of the basis vectors vj which will complete the set of ui’s into a new basis. 2.3.9. Corollaries of the Steinitz lemma. Because of the possibility of freely choosing and replacing basis vectors we can immediately derive nice (and intuitively expectable) properties of bases of vector spaces: Proposition. (1) Every two bases of a finite dimensional vector space have the same number of elements, that is, our definition of dimension is basis-independent. (2) If V has a finite basis, then every linearly independent set can be extended to a basis. (3) A basis of a finite dimensional vector space is a maximal linearly independent set of vectors. (4) The bases of a vector space are the minimal sets of gen- erators. A little more complicated, but now easy to deal with, is the situation of dimensions of subspaces and their sums: Corollary. Let W, W1, W2 ⊂ V be subspaces of a space V of finite dimension. Then (1) dim W ≤ dim V , (2) V = W if and only if dim V = dim W, (3) dim W1 +dim W2 = dim(W1 +W2)+dim(W1 ∩W2). Proof. It remains to prove only the last claim. This is evident if the dimension of one of the spaces is zero. Assume dim W1 = r ≥ 1, dim W2 = s ≥ 1 and let (w1 . . . , wt) be a basis of W1 ∩ W2 (or empty set, if the intersection is trivial). According to the Steinitz exchange lemma this basis of the intersection can be extended to a basis (w1, . . . , wt, ut+1 . . . , ur) for W1 and to a basis (w1 . . . , wt, vt+1, . . . , vs) for W2. Vectors w1, . . . , wt, ut+1, . . . , ur, vt+1 . . . , vs clearly generate W1 + W2. We show that they are linearly independent. Let a1w1 + · · · + atwt + bt+1ut+1 + . . . · · · + brur + ct+1vt+1 + · · · + csvs = 0. 115 Solution. To introduce a random vector in Sage we can use the command V.random_element(), where V is the underlying vector space. For instance, the cell V=QQ^4 u=V.random_element(); show(u) prints a random vector of Q4 . Any time that the code will be implemented, a new vector will appear in our screen. To construct simultaneously four random vectors lying on Q4 and check their linear independence, we can instead type V=QQ^4 S=[V.random_element() for i in range(4)] show(S); V.linear_dependence(S)==[] which will return an ordered random tuple S of vectors in Q4 and moreover the message “True”, or “False”, depending on the linear independence/dependence of S. Or type V=QQ^4 S=[V.random_element() for i in range(4)] W=V.span(S); print(W) In this case, if Sage confirms that W is a 4-dimensional subspace, then the set S of four random vectors should be linearly independent. Remarkably, most of the cases Sage returns a linearly independent set, and hence this method can be used when you need to locate many different (ordered) bases of a given vector space. We recommend practicing with further examples in your editor. Finally, to introduce the required matrix A and specify a rank-based criterium for linear independence or dependence, you can add the following code in either of the previous two cells: A=column_matrix(S) if rank(A) < 4 : # then linear dependent print(False) else : # then linear independent print(True) Thus, when A is of full rank, i.e., rank(A) = 4, S will consist of linearly independent vectors, and Sage will type True. When rank(A) < 4, S will consist of linearly dependent vectors, and its message will be False. □ 2.C.16. Given arbitrary linearly independent vectors u, v, w, z in a vector space V , decide whether or not the vectors u − 2v, 3u + w − z, u − 4v + w + 2z, 4v + 8w + 4z are also linearly independent. ⃝ 2.C.17. Determine the coordinates of the vector w = (1, 1, 1)T with respect to the following ordered basis of R3 : u = { u1 = (1, 2, 1)T , u2 = (−1, 1, 0)T , u3 = (0, 1, 1)T } . Next, verify your answer in Sage, based for example on the command coordinates, or any other method. CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Then necessarily − (ct+1 · vt+1 + · · · + cs · vs) = = a1 · w1 + · · · + at · wt + bt+1 · ut+1 + · · · + br · ur must belong to W2 ∩ W1. This implies that bt+1 = · · · = br = 0, since this is the way we have defined our bases. Then also a1 · w1 + · · · + at · wt + ct+1 · vt+1 + · · · + cs · vs = 0 and because the corresponding vectors form a basis W2, all the coefficients are zero. The claim (3) now follows by directly counting the generators. □ 2.3.10. Examples. (1) Kn has (as a vector space over K) dimension n. The n-tuple of vectors ((1, 0, . . . , 0), (0, 1, . . . , 0) . . . , (0, . . . , 0, 1)) is clearly a basis, called the standard basis of Kn . Note that in the case of a finite field of scalars, say Zk with k prime, the whole space Kn has only a finite number kn of elements. (2) C as a vector space over R has dimension 2. A basis is for instance the pair of numbers 1 and i, or any other two complex numbers which are not a real multiple of each other, eg. 1+i and 1 − i. (3) Km[x], that is, the space of all polynomials with coefficients in K of degree at most m, has dimension m + 1. A basis is for instance the sequence 1, x, x2 , . . . , xm . The vector space of all polynomials K[x] has dimension ∞, but we can still find a basis (although infinite in size): 1, x, x2 , . . . . (4) The vector space R over Q has dimension ∞. It does not have a countable basis. (5) The vector space of all mappings f : R → R has also dimension ∞. It does not have any countable basis. 2.3.11. Vector coordinates. If we fix a basis (v1, . . . , vn) of a finite dimensional space V , then every vector w ∈ V can be expressed as a linear combination v = a1v1 +· · ·+anvn in a unique way. Indeed, assume that we can do it in two ways: w = a1v1 + · · · + anvn = b1v1 + · · · + bnvn. Then 0 = (a1 − b1) · v1 + · · · + (an − bn) · vn and thus ai = bi for all i = 1, . . . , n, because the vectors vi are linearly independent. We have reached the concept of coordinates: 116 Solution. When a vector u ∈ Rn is given in coordinates, such as w, it is always understood to be expressed with respect to the standard basis, unless stated otherwise. The assumption that u is an ordered basis, with a fixed and well defined order as indicated by the relevant positions of the vectors u1, u2, u3, it is also crucial. To solve the problem we need to determine reals a, b, and c ∈ R, which satisfy the following matrix equa- tion: a ·   1 2 1   + b ·   −1 1 0   + c ·   0 1 1   =   1 1 1   . This is equivalent to (a − b, 2a + b + c, a + c)T = (1, 1, 1)T . which induces the following system of equations: { a − b = 1 , 2a + b + c = 1 , a + c = 1 } . It is easy to see that this system has a unique solution, given by a = 1 2 , b = −1 2 , c = 1 2 . Thus, the coordinates of w with respect to u are given by (1 2 , −1 2 , 1 2 ). There are several different methods to demonstrate a verification via Sage. Here we describe two methods. The first method uses the coordinates command, which is designed to determine the coordinates of a vector with respect to a given basis. Its implementation goes as follows: V=RR^3 u1=vector([1,2,1]);u2=vector([-1,1,0]) u3=vector([0, 1, 1]);w=vector([1, 1, 1]) L=[u1, u2, u3] W=V.subspace_with_basis(L) cord=W.coordinates(w); show(cord) sum([cord[i]*L[i] for i in range(3)])==w This cell returns the coefficients of w with respect to u, along with the message True. The latter indicates that the command in the last row verifies the accuracy of the result from show(cord) (i.e., Sage is used to verify its own output). The second method relies on computing the reduced row echelon form of the matrix having as columns the vectors of the given basis, together with w, and so of the extended matrix corresponding to the linear system posed above. For this, add in the previous cell the code B=column_matrix([u1, u2, u3, w]) B1=B.rref(); show(B1[:,3]) By executing the block, Sage prints out the fourth column of the reduced row echelon form, which indeed consists of the coefficients of w with respect to the ordered basis u. □ So far we have demonstrated a straightforward algorithm to identify a maximal linearly independent set from a given collection of vectors. The steps are as follows: • Arrange the given vectors as columns in a matrix. • Transform the matrix into reduced row echelon form using row operations. • Identify the vectors corresponding to the pivot columns; these form a maximal linearly independent set. CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Coordinates of vectors Definition. The coefficients of the unique linear combination expressing the given vector w ∈ V in the chosen basis v = (v1, . . . , vn) are called the coordinates of the vector w in this basis. Whenever we speak about coordinates (a1, . . . , an) of a vector w, which we express as a sequence, we must have a fixed ordering of the basis vectors v = (v1, . . . , vn). Although we have defined the basis as a minimal set of generators, in reality we work with them as with sequences (that is, with completely ordered sets). Assigning coordinates to vectors A mapping assigning the vector v = a1v1 + · · · + anvn to its coordinates in the basis v will be denoted by the same symbol v : V → Kn . It has the following properties: (1) v(u + w) = v(u) + v(w); ∀u, w ∈ V , (2) v(a · u) = a · v(u); ∀a ∈ K, ∀u ∈ V . Note that the operations on the two sides of these equations are not identical. Quite the opposite; they are operations on different vector spaces! Sometimes it is really useful to understand vectors as mappings from fixed set of independent generators to coordinates (without having the generators ordered). In this way, we may think about the basis M of infinite dimensional vector spaces V . Even though the set M will be infinite, there can be only a finite number of non-zero values for any mapping representing a vector. The vector space of all polynomials K∞[x], with the basis M = {1, x, x2 , . . . } is a good example. 2.3.12. Linear mappings. The above properties of the assignments of coordinates are typical for what we have called linear mappings in the geometry of the plane R2 . For any vector space (of finite or infinite dimension) we define “linearity” of a mapping between spaces in a similar way to the case of the plane R2 : Linear mappings Let V and W be vector spaces over the same field of scalars K. The mapping f : V → W is called a linear mapping, or homomorphism, if the following holds: (1) f(u + v) = f(u) + f(v), ∀u, v ∈ V (2) f(a · u) = a · f(u), ∀a ∈ K, ∀u ∈ V . We have seen such mappings already in the case of matrix multiplication: f : Kn → Km , x → A · x with a fixed matrix A of the type m/n over K. Analogously to the abstract definition of vector spaces, it is again necessary to prove seemingly trivial claims that follow from the axioms: 117 This approach is essentially what was utilized in the second method described in 2.C.17. For a theoretical validation and a more in-depth discussion, see the theoretical section 2.3.5. 2.C.18. Determine the vector subspace of the Euclidean space R4 , generated by the vectors u1 = (−1, 3, −2, 1)T , u2 = (2, −1, −1, 2)T , u3 = (−4, 7, −3, 0)T , and u4 = (1, 5, −5, 4)T . Solution. Write the vectors ui into the columns of a matrix and transform it using elementary row transformations. This gives (−1 2 −4 1 3 −1 7 5 −2 −1 −3 −5 1 2 0 4 ) ∼ ( 1 2 0 4 −1 2 −4 1 3 −1 7 5 −2 −1 −3 −5 ) ∼ (1 2 0 4 0 4 −4 5 0 −7 7 −7 0 3 −3 3 ) ∼ (1 2 0 4 0 1 −1 5/4 0 1 −1 1 0 0 0 0 ) ∼ (1 2 0 4 0 1 −1 5/4 0 0 0 −1/4 0 0 0 0 ) ∼    1 0 2 0 0 1 −1 0 0 0 0 1 0 0 0 0    . Now, according to the algorithm above, a maximal linearly independent set (basis) consists of those vectors corresponding to the columns with the pivots. For our case, the pivots are circled in the reduced row echelon form8 and hence a maximal linearly independent set consists of the vectors u1, u2 and u4. This means that the vector subspace generated by {u1, u2, u3, u4} is only 3-dimensional. Finally, let us verify the output of our algorithm via Sage (where for convenience we view the vectors as elements of Q4 ). Let us check first that the given vector are indeed linearly dependent: v1=vector(QQ, [-1, 3, -2, 1]) v2=vector(QQ, [2, -1, -1, 2]) v3=vector(QQ, [-4, 7, -3, 0]) v4=vector(QQ, [1, 5, -5, 4]) V=QQ^4; L=[v1, v2, v3, v4] V.linear_dependence(L)==[] Next we apply the trick with the pivot elements: M=column_matrix([v1, v2, v3, v4]) M1=M.rref(); show(M1); M.pivots() with the last command returning the desired (0, 1, 3). □ 2.C.19. Derive in Sage a method analogous to the one presented in 2.C.18, with aim to complete the following set of vectors to a basis of the vector space Q4 : L = {v1 = (2, 0, 1, 0)T , v2 = (1, 0, 1, 0)T } . ⃝ 8We will adopt this tactic especially in Chapter 3. CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Proposition. Let f : V → W be a linear mapping between two vector spaces over the same field of scalars K. The following is true for all vectors u, u1, . . . , uk ∈ V and scalars a1, . . . , ak ∈ K (1) f(0) = 0, (2) f(−u) = −f(u), (3) f(a1 ·u1 +· · ·+ak ·uk) = a1 ·f(u1)+· · ·+ak ·f(uk), (4) for every vector subspace V1 ⊂ V , its image f(V1) is a vector subspace in W, (5) for every vector subspace W1 ⊂ W, the set f−1 (W1) = {v ∈ V ; f(v) ∈ W1} is a vector subspace in V . Proof. We rely on the axioms, definitions and already proved results (in case you are not sure what has been used, look it up!): f(0) = f(u − u) = f((1 − 1) · u) = 0 · f(u) = 0, f(−u) = f((−1) · u) = (−1) · f(u) = −f(u). Property (3) is derived easily from the definition for two summands, using induction on the number of summands. Next, (3) implies span f(V1) = f(V1), thus it is a vector subspace. On the other hand, if f(u) ∈ W1 and f(v) ∈ W1 then for any scalars we arrive at f(a · u + b · v) = a · f(u) + b · f(v) ∈ W1. □ The image of a linear mapping, Im f = f(V ) ⊂ W, is always a vector subspace, since for any set of vectors ui, the linear combination of images f(ui) is the image of the linear combination of the vectors ui with the same coefficients. Analogously, the set of all vectors Ker f = f−1 ({0}) ⊂ V is a subspace, since the linear combination of zero images will always be a zero vector. The subspace Ker f is called the kernel of the linear mapping f. A linear mapping which is a bijection is called an isomor- phism. 2.3.13. Proposition (Simple corollaries). (1) The composition g ◦ f : V → Z of two linear mappings f : V → W and g : W → Z is again a linear mapping. (2) The linear mapping f : V → W is an isomorphism if and only if Im f = W and Ker f = {0} ⊂ V . The inverse mapping of an isomorphism is again an isomor- phism. (3) For any two subspaces V1, V2 ⊂ V and linear mapping f : V → W, f(V1 + V2) = f(V1) + f(V2), f(V1 ∩ V2) ⊂ f(V1) ∩ f(V2). (4) The “coordinate assignment” mapping u : V → Kn given by an arbitrarily chosen basis u = (u1, . . . , un) of a vector space V is an isomorphism. (5) Two finitely dimensional vector spaces are isomorphic if and only if they have the same dimension. (6) The composition of two isomorphisms is an isomor- phism. 118 2.C.20. Consider the matrix A =   2 4 1 3 0 5  . i) Find the column space C(A), the row space C(AT ), the kernel Ker(A) and the cokernel Ker(AT ) of A. ii) Compute the dimensions of these subspaces, using Sage as well. Hint: Use the commands column_space, row_space, right_kernel and left_kernel. Solution. (i) The given matrix A has size 3 × 2 and thus the column space should be a subspace of R3 . Let us check the linear independency of the two column vectors of A. By performing elementary row operations we see that A =   2 4 1 3 0 5   R1→ 1 2 R1 −→ R3→ 1 5 R3   1 2 1 3 0 1   R2→R2−R1 −→   1 2 0 1 0 1   R3→R3−R2 −→   1 2 0 1 0 0   R1→R1−2R2 −→   1 0 0 1 0 0   . Thus rank(A) = 2 and the column vectors are linearly independent. Hence, C(A) = span { (2, 1, 0)T , (4, 3, 5)T } with dimR C(A) = 2. By the reduced row echelon form of A we also deduce that the first two rows of A are linearly independent. For instance, it is easy to see that the third row is a linear combination of the previous two: (0, 5)T = a(2, 4)T + b(1, 3)T with a = −5/2 and b = 5. Thus, C(AT ) = span{(2, 4)T , (1, 3)T } with dimR C(AT ) = 2, as well. Recall that null spaces are the solution spaces of linear homogeneous systems Au = 0. For our case, this system has the form   2 4 1 3 0 5   ( x1 x2 ) =   0 0 0   and it is easy to see that the unique solution is given by x1 = 0 = x2. Thus, Ker(A) is trivial, Ker(A) = { (0, 0)T } ⊂ R2 . In other words the linear map F : R2 → R3 corresponding to A is an injection. On the other hand, the left null space is the kernel of the transpose matrix AT . This space consists of vectors w = (w1, w2, w3)T ∈ R3 , such that AT w = 0, i.e., ( 2 1 0 4 3 5 )   w1 w2 w3   = ( 0 0 ) . Solving the corresponding system we obtain Ker(AT ) = span { (5, −10, 2)T } ⊂ R3 and hence the cokernel of A is 1-dimensional. (ii) In Sage, you can easily execute the commands outlined in the statement as follows: A=matrix([[2, 4], [1, 3], [0, 5]]) rank(A) print(A.column_space()) print(dim(A.column_space())==rank(A)) print(A.row_space()) CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Proof. Proving the first claim is a very easy exercise left to the reader. In order to verify (2), notice that f is surjective if and only if Im f = W. If Ker f = {0} then f(u) = f(v) ensures f(u − v) = 0, that is, u = v. In this case f is injective. Finally, if f is a linear bijection, then the vector w is the preimage of a linear combination au + bv, that is w = f−1 (au + bv), if and only if f(w) = au + bv = f(a · f−1 (u) + b · f−1 (v)). Thus we also get w = af−1 (u) + bf−1 (v) and therefore the inversion of a linear bijection is again a linear bijection. The third property is obvious from the definition, but try finding an example showing that the inequality in the second equation can indeed by sharp. The remaining claims all follow immediately from the definition. □ 2.3.14. Coordinates again. Consider any two vector spaces V and W over K with dim V = n, dim W = m and consider some linear mapping f : V → W. For every choice of basis u = (u1, . . . , un) on V , v = (v1, . . . , vn) on W there are the following linear mappings as shown in the diagram: V f // ≃u  W v≃  Kn fu,v // Km The bottom arrow fu,v is defined by the remaining three, i.e. the composition of linear mappings fu,v = v ◦ f ◦ u−1 . Matrix of a linear mapping Every linear mapping is uniquely determined by its values on an arbitrary set of generators, in particular, on the vectors of a basis u. Denote by f(u1) = a11 · v1 + a21 · v2 + · · · + am1vm f(u2) = a12 · v1 + a22 · v2 + · · · + am2vm ... f(un) = a1n · v1 + a2n · v2 + · · · + amnvm, that is, scalars aij form a matrix A, where the columns are coordinates of the values f(uj) of the mapping f on the basis vectors expressed in the basis v on the target space W. A matrix A = (aij) is called the matrix of the mapping f in the bases u, v. 119 print(dim(A.row_space())==rank(A)) print(A.right_kernel()) print((A.T).right_kernel()) print(A.left_kernel()) print((A.T).right_kernel()==A.left_kernel()) In this block the final line confirms that the command (A.T).right_kernel() can be used as an alternative to A.left_kernel(). □ In linear algebra, there are several classical constructions that generate new vector spaces from existing ones. Examples include intersections, direct sums (see e.g., 2.3.6), and quotients of vector spaces. While intersection means just solving systems of linear equations on coordinates, sums require merging all generators (and leaving only a maximal independent subset). We already touched quotient spaces in 2.C.12, although they are formally introduced in Chapter 3, see 3.4.13. Since we are short in space, we display only a few tasks and further discussion appears later, see also Section E. 2.C.21. In R4 consider the three-dimensional linear subspaces U = span{u1, u2, u3} and V = span{v1, v2, v3}, where the vectors ui and vi (i = 1, 2, 3) are respectively given by u1 = (1, 1, 1, 0)T , v1 = (1, 1, −1, −1)T , u2 = (1, 1, 0, 1)T , v2 = (1, −1, 1, −1)T , u3 = (1, 0, 1, 1)T , v3 = (1, −1, −1, 1)T . Determine explicitly the subspace W := U ∩ V , and verify your answer via Sage. Moreover, compute dim(W) over R. Solution. The intersection U ∩ V contains exactly these vectors which are linear combinations of the vectors ui, and also linear combinations of the vectors vi, for i = 1, 2, 3. Thus, we search for some x1, x2, x3, y1, y2, y3 ∈ R, satisfying x1u1 + x2u2 + x3u3 = y1v1 + y2v2 + y3v3 . This means that we are looking for a solution of the following system of linear equations:    x1 + x2 + x3 = y1 + y2 + y3, x1 + x2 = y1 − y2 − y3, x1 + x3 = −y1 + y2 − y3, x2 + x3 = −y1 − y2 + y3. We convert this to a homogeneous system of linear equations, by moving all variables to the left-hand side. Its matrix reads as A =     1 1 1 −1 −1 −1 1 1 0 −1 1 1 1 0 1 1 −1 1 0 1 1 1 1 −1     . Let us now apply row operations to find an echelon form: A∼ (1 1 1 −1 −1 −1 0 0 −1 0 2 2 0 −1 0 2 0 2 0 1 1 1 1 −1 ) CHAPTER 2. ELEMENTARY LINEAR ALGEBRA For a general vector u = x1u1 + · · · + xnun ∈ V we calculate (recall that vector addition is commutative and distributive with respect to scalar multiplication) f(u) = x1f(u1) + · · · + xnf(un) = x1(a11v1 +· · · +am1vm) + · · · + xn(a1nv1 + · · · ) = (x1a11 +· · · +xna1n)v1 + · · · + (x1am1 + · · · )vm. Using matrix multiplication we can now very easily and clearly write down the values of the mapping fu,v(w) defined uniquely by the previous diagram. Recall that vectors in Kℓ are understood as columns, that is, matrices of the type ℓ/1 fu,v(u(w)) = v(f(w)) = A · u(w). On the other hand, if we have fixed bases on V and W, then every choice of a matrix A of the type m/n gives a unique linear mapping Kn → Km and thus also a mapping f : V → W. We have found the bijective correspondence between matrices of the fixed types (determined by dimensions of V and W) and linear mappings V → W. 2.3.15. Coordinate transition matrix. If we choose V = W to be the same space, but with two different bases u, v, and consider the identity mapping for f, then the approach from the previous paragraph expresses the vectors of the basis u in coordinates with respect to the basis v. Let the resulting matrix be T. Thus, we are applying the concept of the matrix of a linear mapping to the special case of the identity mapping idV . V idV // ≃u  V v≃  Kn T =(idV )u,v // Kn The resulting matrix T is called the coordinate transition matrix for changing the basis from u to the basis v. The fact that the matrix T of the identity mapping yields exactly the transformation of coordinates between the two bases is easily seen. Consider the expression of u with the basis u u = x1u1 + · · · + xnun, and replace the vectors ui by their expressions as linear combinations of the vectors vi in the basis v. Collecting the terms properly, we obtain the coordinate expression ¯x = (¯x1, . . . , ¯xn) of the same vector u in the basis v. It is enough just to reorder the summands and express the individual scalars at the vectors of the basis. But this is exactly what we do when forming the matrix for the identity mapping, thus ¯x = T · x. We have arrived at the following instruction for building the coordinate transition matrix: 120 ∼ (1 1 1 −1 −1 −1 0 1 1 1 1 −1 0 0 −1 0 2 2 0 0 1 3 1 1 ) ∼ (1 1 1 −1 −1 −1 0 1 1 1 1 −1 0 0 1 0 −2 −2 0 0 0 1 1 1 ) ∼ (1 1 1 0 0 0 0 1 1 0 0 −2 0 0 1 0 −2 −2 0 0 0 1 1 1 ) ∼ (1 0 0 0 0 2 0 1 0 0 2 0 0 0 1 0 −2 −2 0 0 0 1 1 1 ) . The final matrix is in row echelon form and gives that x1 = −2t, x2 = −2s, x3 = 2s + 2t, y1 = −s − t, y2 = s, y3 = t, t, s ∈ R. One can verify the previous computation in Sage, by typ- ing A=matrix(QQ, [[1, 1, 1, -1, -1, -1], [1, 1, 0, -1, 1, 1], [1, 0, 1, 1, -1, 1], [0,1, 1, 1, 1, -1]]) show(A.echelon_form()) Hence we can claim that a general vector of W is written as (x1 + x2 + x3 x1 + x2 x1 + x3 x2 + x3 ) = ( 0 −2t − 2s 2s 2t ) , that is, W = spanR { (0, −2t−2s, 2s, 2t)T s, t ∈ R } . Obviously, W is generated by the vectors w1 = (0, −1, 1, 0)T and w2 = (0, −1, 0, 1)T and it is easy to see that they are linearly independent. Thus dimR(W) = 2/ In Sage we can determine the intersection U ∩ V of two linear subspaces of a given vector space via the command U.intersection(V). For our example, give the cell u1=vector([1, 1, 1, 0]) u2=vector([1, 1, 0, 1]) u3=vector([1, 0, 1, 1]) U=(RR**4).span([u1, u2, u3], QQ) v1=vector([1, 1, -1, -1]) v2=vector([1, -1, 1, -1]) v3=vector([1, -1, -1, 1]) V=(RR**4).span([v1, v2, v3], QQ) W=U.intersection(V) show(W); print(dim(W)) □ 2.C.22. Referring to the results presented in 2.C.20, demonstrate that for the given matrix A, the following direct sum decompositions are valid: R3 = C(A) ⊕ Ker(AT ) , R2 = C(AT ) ⊕ Ker(A) . Solution. In 2.C.20 we obtained the expressions C(A) = span { u1 = (2, 1, 0)T , u2 = (4, 3, 5)T } , Ker(AT ) = span { u3 = (5, −10, 2)T } . In particular, we have seen that dim C(A)+dim Ker(AT ) = 2 + 1 = 3 = dim R3 . This, along with the relation C(A) ∩ Ker(AT ) = {0} that we are going to prove below, precisely defines the conditions for a direct sum, i.e., R3 = C(A) ⊕ Ker(AT ). Hence, suppose that v ∈ C(A)∩Ker(AT ). Then, there exists scalars a, b, c ∈ R such that v = au1+bu2 and v = cu3, respectively, CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Calculating the matrix for changing the basis Proposition. The matrix T for the transition from the basis u to the basis v is obtained by taking the coordinates of the vectors of the basis u expressed in the basis v and writing them as the columns of the matrix T. The new coordinates ¯x in terms of the new basis v are then ¯x = T · x, where x is the coordinate vector in the original basis u. Because the inverse mapping to the identity mapping is again the identity mapping, the coordinate transition matrix is always invertible and its inverse T−1 is the coordinate transition matrix in the opposite direction, that is from the basis v to the basis u (just have a look at the diagram above and invert all the arrows). 2.3.16. More coordinates. Next, we are interested in the matrix of a composition of the linear mappings. Thus, consider another vector space Z over K of dimension k with basis w, linear mapping g : W → Z and denote the corresponding matrix by gv,w. V f // ≃u  W g // v≃  Z w≃  Kn fu,v // Km gv,w // Kk The composition g ◦ f on the upper row corresponds to the matrix of the mapping Kn → Kk on the bottom and we calculate directly (we write A for the matrix of f and B for the matrix of g in the chosen bases): gv,w ◦ fu,v(x) = w ◦ g ◦ v−1 ◦ v ◦ f ◦ u−1 = B · (A · x) = (B · A) · x = (g ◦ f)u,w(x) for every x ∈ Kn . By the associativity of matrix multiplications, the composition of mappings corresponds to multiplication of the corresponding matrices. Note that the isomorphisms correspond exactly to invertible matrices and that the matrix of the inverse mapping is the inverse matrix. The same approach shows how the matrix of a linear mapping changes, if we change the coordinates on both the domain and the codomain: V idV // ≃u′  V f // ≃u  W idW // v≃  W v′ ≃  Kn T // Kn fu,v // Km S−1 // Km where T is the coordinate transition matrix from u′ to u and S is the coordinate change matrix from v′ to v. If A is the original matrix of the mapping, then the matrix of the new mapping is given by A′ = S−1 AT. In the special case of a linear mapping f : V → V , that is the domain and the codomain are the same space V , we express f usually in terms of a single basis u of the space V . Then the change from the old basis to the new basis u′ with 121 and we should have the relation au1 +bu2 = cu3. This gives the matrix equation a   2 1 0   + b   4 3 5   − c   5 −10 2   =   0 0 0   . To solve the corresponding homogeneous system, you can use Sage in the following straightforward manner: var("a, b, c") eq1=2*a+4*b-5*c; eq2=a+3*b+10*c; eq3=5*b-2*c solve([eq1==0, eq2==0, eq3==0], a, b, c) Sage’s output has the form [[a == 0, b == 0, c == 0]], and hence v = (0, 0, 0)T , i.e., C(A) ∩ Ker(AT ) = {0}. This proves the first direct sum decomposition and you are encouraged to prove the second one, which is actually easier. □ In Mathematics, the structures are best understood via mappings which preserve them. With vector spaces, these are the linear mappings, cf. 2.3.12. We met already homotheties, rotations, and reflections in Chapter 1. By the very definition, a linear mapping f : V → W is uniquely determined by its action on the basis vectors of its domain.9 This results in the unique matrix representation of f, once the basis on V and W are chosen. Thus, we are back at matrix calculus and composition of mappings is given by products of matrices. Two special subspaces are associated to a linear map f : V → W, the kernel Ker(f) = f−1 ({0}) ⊂ V and the image Im(f) = f(V ) ⊂ W. Isomorphisms are those f with trivial kernels Ker(f) = {0} and images Im(f) = W. Hence such mappings are both surjective and injective, and their matrices are invertible. Any transformation of coordinates is such an example. Although we usually write vectors as columns, in the sequel, we shall often write just x = (x1, . . . , xn) for vectors in Kn . 2.C.23. Check which of F is a linear transformation: i) F : R3 → R2 , F((x, y, z)) = (y, z); ii) F : R3 → R3 , F((x, y, z)) = (x + z, 2x − 3 + z, −2y); iii) F : R2 → R2 , F((x, y)) = (x + a, y), 0 ̸= a ∈ R; iv) F : R → R, F(x) = x + a, where a ∈ R; v) F : R2 → R2 , F((x, y)) = (x2 , y); vi) F : R2 → R1[t], F((x, y)) = y + x t; vii) F : Matn(R) → Matn(R), F(A) = AB − BA, where B ∈ Matn(R) is fixed; viii) F : Matn(R) → Matn(R), F(A) = AT ; ix) F : Z2 3 → Z3 3, F((x, y)) = (x + y, 2x + y, x), where as usual Zn 3 = Z3 × · · · × Z3 (n-factors); x) F : R → R, F(t) = et , where e is the base of the natural logarithm. ⃝ 9Linear mappings T : V → W are also called linear transformations or operators. Linear mappings T : V → V are also called endomorphism of V . CHAPTER 2. ELEMENTARY LINEAR ALGEBRA the coordinate transition matrix T leads to the new matrix A′ = T−1 AT. 2.3.17. Linear forms. A simple but very important case of linear mappings on an arbitrary vector space V over the scalars K appears with the codomain being the scalars themselves, i.e. mappings f : V → K. We call them linear forms. If we are given the coordinates on V , the assignments of a single i-th coordinate to the vectors is an example of a linear form. More precisely, for every choice of basis v = (v1, . . . , vn), there are the linear forms v∗ i : V → K such that v∗ i (vj) = δij, that is, v∗ i (vj) = 1 when i = j, and v∗ i (vj) = 0 when i ̸= j. The vector space of all linear forms on V is denoted by V ∗ and we call it the dual space of the vector space V . Let us now assume that the vector space V has finite dimension n. The basis of V ∗ , v∗ = (v∗ 1, . . . , v∗ n), composed of assignments of individual coordinates as above, is called the dual basis to v. Clearly this is a basis of the space V ∗ , because these forms are evidently linearly independent (prove it!) and if α ∈ V ∗ is an arbitrary form, then for every vector u = x1v1 + · · · + xnvn α(u) = x1α(v1) + · · · + xnα(vn) = α(v1)v∗ 1(u) + · · · + α(vn)v∗ n(u) and thus the linear form α is a linear combination of the forms v∗ i . Taking into account the standard basis {1} on the onedimensional space of scalars K, any choice of a basis v on V identifies the linear forms α with matrices of the type 1/n, that is, with rows y. The components of these rows are coordinates of the general linear forms α in the dual basis v∗ . Expressing such a form on a vector is then given by multiplying the corresponding row vector y with the column of the coordinates x of the vector u ∈ V in the basis v: α(u) = y · x = y1x1 + · · · + ynxn. Thus we can see that for every finitely dimensional space V , the dual space V ∗ is isomorphic to the space V . The choice of the dual basis provides such an isomorphism. In this context we meet again the scalar product of a row of n scalars with a column of n scalars. We have worked with it already in the paragraph 2.1.3 on the page 85. The situation is different for infinitely dimensional spaces. For instance the simplest example of the space of all polynomials K[x] in one variable is a vector space with a countable basis with elements vi = xi . As before, we can define linearly independent forms v∗ i . Every formal infinite sum ∑∞ i=0 aiv∗ i is now a well-defined linear form on K[x], because it will be evaluated only for a finite linear combination of the basis polynomials xi , i = 0, 1, 2, . . . . The countable set of all v∗ i is thus not a basis. Actually, it can be proved that this dual space cannot have a countable basis. 122 2.C.24. Show that the map T : R2 → R3 defined by T((x, y)) = (x, y, x − y) is a linear transformation. Next, find the matrix A corresponding to T with respect to the standard bases of R2 and R3 , respectively. Solution. Let u = (u1, u2)T and v = (v1, v2)T be two arbitrary vectors on R2 . Then we see that T(au + bv) = T ( a ( u1 u2 ) + b ( v1 v2 )) = T (( au1 + bv1 au2 + bv2 )) =   au1 + bv1 au2 + bv2 au1 + bv1 − au2 − bv2   = a   u1 u2 u1 − u2   + b   v1 v2 v1 − v2   = aT(u) + bT(v) , for all a, b ∈ R. Thus T is a linear transformation. Let e = {e1 = (1, 0)T , e2 = (0, 1)T } be the standard basis of R2 and denote by ε = {ε1 = (1, 0, 0)T , ε2 = (0, 1, 0)T , ε3 = (0, 0, 1)T the standard basis of R3 . According to the discussion in 2.3.14, the columns of the matrix A = (aij) corresponding to T consist of the coordinates of the values T(ej), for j = 1, 2, the latter expressed in the basis ε on the target space R3 . We compute T(e1) = (1, 0, 1)T = 1 · e1 + 0 · e2 + 1 · e3 , T(e2) = (0, 1, −1)T = 0 · e1 + 1 · e2 − 1 · e3 , hence the matrix A has the form A = ( 1 0 0 1 1 −1 ) . As a simple verification, show that T(u) = Au = (x, y, x − y)T , for all vectors u = (x, y) ∈ R2 . In fact, the matrix presentation T(u) = Au provides an easier way to verify the linearity of T. Indeed, assuming that a, b, u, v are as above we get T(au + bv) = A(au + bv) = a(Au) + b(Av) = aT(u) + bT(v). □ 2.C.25. Remark. In Sage we can treat linear mappings via the command linear_transformation, see also the task in 2.C.27 for a complementary application. Let us use the linear mapping T : R2 → R3 given in 2.C.24 to illustrate the situation: V=RR^2; W=RR^3 var("x, y");f(x, y)=[x, y, x-y] T=linear_transformation(V, W, f) show(T) Executing this block will display, among other information, the matrix representation of T, which we present here: ( 1.000000000 0.0000000000 1.0000000000 0.000000000 1.0000000000 −1.0000000000 ) . Notice this matrix is from the “left”, in terms of Sage, and hence the transpose of the matrix A presented above.10 To ensure that Sage will print the correct matrix, add the following code: 10This means that A acts on a vector x as xT A. CHAPTER 2. ELEMENTARY LINEAR ALGEBRA 2.3.18. The length of vectors and scalar product. When dealing with the geometry of the plane R2 in the first chapter we also needed the concept of the length of vectors and their angles, see 1.5.7. For defining these concepts we used the scalar product of two vectors u = (x, y) and v = (x′ , y′ ) in the form u · v = xx′ + yy′ . Indeed, the expression for the length of v = (x, y) is given by ∥v∥ = √ x2 + y2 = √ v · v, while the (oriented) angle φ of two vectors u = (x, y) and v = (x′ , y′ ) is in the planar geometry given by the formula cos φ = xx′ + yy′ ∥v∥∥v′∥ . Note that this scalar product is linear in each of its arguments, and we denote it by u · v or by ⟨u, v⟩. The scalar product defined in such a way is symmetric in its arguments and of course ∥v∥ = 0 if and only if v = 0. We also see immediately that two vectors in the Euclidean plane are perpendicular whenever their scalar product is zero. Now we shall mimic this approach for higher dimensions. First, observe that the angle between two vectors is always a two-dimensional concept (we want the angle to be the same in the two-dimensional space containing the two vectors u and v). In the subsequent paragraphs, we shall consider only finitely dimensional vector spaces over real scalars R. Scalar product and orthogonality A scalar product on a vector space V over real numbers is a mapping ⟨ , ⟩ : V × V → R which is symmetric in its arguments, linear in each of them, and such that ⟨v, v⟩ ≥ 0 and ∥v∥2 = ⟨v, v⟩ = 0 if and only if v = 0. The number ∥v∥ = √ ⟨v, v⟩ is called the length, or norm, of the vector v. Vectors v and w ∈ V are called orthogonal or perpendicular whenever ⟨v, w⟩ = 0. We also write v ⊥ w. The vector v is called normalised whenever ∥v∥ = 1. The basis of the space V composed exclusively of mutually orthogonal vectors is called an orthogonal basis. If the vectors in such a basis are all normalised, we call the basis orthonormal. A scalar product is very often denoted by the common dot, that is, ⟨u, v⟩ = u · v. Thus, it is then necessary to recognize from the context whether the dot means a product of two vectors (the result is a scalar) or something different (e.g. we often denote the product of matrices and product of scalars in the same way). Because the scalar product is linear in each of its arguments, it is completely determined by its values on pairs of basis vectors. Indeed, choose a basis u = (u1, . . . , un) of the space V and denote sij = ⟨ui, uj⟩. 123 A=T.matrix(side="right");show(A) Check yourselves that now Sage prints out the matrix A posed in 2.C.24. 2.C.26. Let T : R3 → R3 be the linear mapping with type T     x y z     =   x + y + z x + y + 2z x + y + 3z   . Determine its matrix with respect to the standard basis of R3 , and next present an answer via Sage. ⃝ 2.C.27. Consider the matrix A = ( −1 2 3 4 2 0 ) . Find the value f(u) of the linear mapping f : R2 → R3 induced by A, where u = (1, 2)T ∈ R2 . Next, using Sage, find a basis for the kernel and the image of f. Is f injective or surjective? Solution. The function f has domain R2 and target R3 , thus its matrix should be 3×2. By our assumption, this means that matrix of f is the transpose of the given A. Thus we have f(u) =   −1 4 2 2 3 0   ( 1 2 ) =   7 6 3   . As before, use the command linear_transformation () in Sage to verify this result: A = matrix(RR, [[-1, 2, 3], [4, 2, 0]]) f = linear_transformation(A); f This gives: Vector space morphism represented by the matrix: [-1.00000000000000 2.00000000000000 3.00000000000000] [ 4.00000000000000 2.00000000000000 0.000000000000000] Domain: Vector space of dimension 2 over Real Field with 53 bits of precision Codomain: Vector space of dimension 3 over Real Field with 53 bits of precision We can now evaluate f at a vector, simply by typing f([1, 2]). This prints the vector (7, 6, 3). We can also compute the kernel and image of f by adding the commands f.kernel( ); f.image( ) respectively. For them, the output is Vector space of degree 2 and dimension 0 over Real Field with 53 bits of precision Basis matrix: [] and Vector space of degree 3 and dimension 2 over Rational Field Basis matrix: [ 1 0 -3/5] [ 0 1 6/5] CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Then from the symmetry of the scalar product we know sij = sji and from the linearity of the product in each of its arguments we get ⟨∑ i xiui, ∑ j yjuj ⟩ = ∑ i,j xiyj⟨ui, uj⟩ = ∑ i,j sijxiyj. If the basis is orthonormal, the matrix S is the unit matrix. This proves the following useful claim: Scalar product in coordinates Proposition. For every orthonormal basis, the scalar product is given by the coordinate expression (1) ⟨x, y⟩ = yT · x. For each basis of the space V there is the symmetric matrix S such that the coordinate expression of the scalar product is (2) ⟨x, y⟩ = yT · S · x. Notice, that with symmetric matrix S it is just a matter of convention in which order we insert the vectors: the formula xT · S · y = (xT · S · y)T = yT · ST · x = yT · S · x produces the same value. However, we shall later consider the second argument as a linear form, thus it seems to be more convenient to use the expression yT · S · x. 2.3.19. Orthogonal complements and projections. For every fixed subspace W ⊂ V in a space with scalar product, we define its orthogonal complement as W⊥ = {u ∈ V ; u ⊥ v for all v ∈ W}. It follows directly from the definition that W⊥ is a vector subspace. If W ⊂ V has a basis (u1, . . . , uk) then the description for W⊥ is given as k homogeneous equations for n variables. Thus W⊥ will have dimension at least n − k. Also u ∈ W ∩ W⊥ means that ⟨u, u⟩ = 0, and thus also u = 0 by the definition of scalar product. Clearly then, V is the direct sum V = W ⊕ W⊥ . A linear mapping f : V → V on any vector space is called a projection, if we have f ◦ f = f. In such a case, we can write, for every vector v ∈ V, v = f(v) + (v − f(v)) ∈ Im(f) + Ker(f) = V and if v ∈ Im(f) and f(v) = 0, then also v = 0. Thus the above sum of the subspaces is direct. We say that f is a projection to the subspace W = Im(f) along the subspace U = Ker(f). In words, the projection can be described naturally as follows: we decompose the given vector into a component in W and a component in U, and forget the second one. 124 respectively. This means that Ker(f) = {0}, and so f is injective, but Im(f) is 2-dimensional, with a basis determined by the two vectors that Sage indicates in the solution. Hence f is not surjective. Note that Sage allows us to directly verify the injectivity and surjectivity of f, as follows: f.is_injective() f.is_surjective() Sage’s ouptut is True and False, respectively. □ 2.C.28. Consider R3 endowed with its standard basis e = {e1, e2, e3}, and let f : R3 → R3 be a linear mapping satis- fying f(e1) = (0, 1, 2)T , f(e2) = (1, 0, 1)T , f(e3) = (−1, 1, 1)T . (a) Determine the explicit form of f; (b) Determine the matrix of f with respect to the standard basis e; (c) Present bases of the kernel and the image of f; (d) Find the inverse f−1 , provided it exists. Solution. (a) By assumption we have f(e1) = (0, 1, 2)T , f(e2) = (1, 0, 1)T and f(e3) = (−1, 1, 1)T . However, any vector u = (x, y, z) ∈ R3 is written as u = x·e1+y·e2+z·e3 (here “·” is just the scalar multiplication), and since f is linear we get f(u) = x · f(e1) + y · f(e2) + z · f(e3) = x · (0, 1, 2)T + y · (1, 0, 1)T + z · (−1, 1, 1)T = (y − z, x + z, 2x + y + z)T , that is, f((x, y, z)) = (y − z, x + z, 2x + y + z)T . (b) We have that f(e1) = 0 · e1 + 1 · e2 + 2 · e3 , f(e2) = 1 · e1 + 0 · e2 + 1 · e3 , f(e3) = −1 · e1 + 1 · e2 + 1 · e3 . Thus, the matrix of f with respect to the standard basis on R3 is given by A =   0 1 −1 1 0 1 2 1 1   . In Sage we can verify this result by executing the block V=W=RR**3; var("x, y, z") f(x, y, z)=[y-z, x+z, 2*x+y+z] T=linear_transformation(V, W, f) A=T.matrix(side="right");show(A) As an alternative, compute first the matrix A, and then determine the explicit form of f by the rule f(u) = Au, for all u ∈ R3 . (c) Let us compute first a basis of the image Im(f) of f. By the relation f(u) = x · f(e1) + y · f(e2) + z · f(e3) it is obvious that Im(f) = spanR{f(e1), f(e2), f(e3)} ={ (0, 1, 2)T , (1, 0, 1)T , (−1, 1, 1)T } . We should now check the linear independence of these vectors. By elementary row operations on the matrix A presented above, we obtain CHAPTER 2. ELEMENTARY LINEAR ALGEBRA If V has a scalar product, we say that the projection is orthogonal if the kernel is orthogonal to the image. Every subspace W ̸= V thus defines an orthogonal projection to W. It is a projection to W along W⊥ , given by the unique decomposition of every vector u into components uW ∈ W and uW ⊥ ∈ W⊥ , that is, linear mapping which maps uW + uW ⊥ to uW . 2.3.20. How to compute projections. The orthogonal projections projv to one-dimensional subspaces span{v} spanned by v ∈ V are most simple. Indeed, for any u ∈ V we require the projection of u to be a scalar multiple cv, such that u − cv ⊥ w, i.e., ⟨u − cw, w⟩ = 0 (see part (a) in the figure below). Solving this equation yields the correct value for c and we deduce projv u = cv = ⟨u, v⟩ ⟨v, v⟩ v = ⟨u, v⟩ ∥v∥2 v . Consequently, the vector u decomposes into a pair of orthogonal vectors: u = projv u + (u − projv u). Similarly, for linear subspaces W ⊂ V of (V, ⟨ , ⟩), there is the direct decomposition V = W ⊕ W⊥ and clearly any vector u ∈ V can be uniquely expressed as u = w +z, where w ∈ W and z ∈ W⊥ . The vector w is called the orthogonal projection of u in W, we also write projW u. The other part of u, orthogonal to W, is denoted by projW ⊥ u and we have (see part (b) in the figure below): z = projW ⊥ u = u − projW u . 2.3.21. Existence of orthonormal bases. It is easy to see that on every finite dimensional real vector space there exist scalar products. Just choose any basis. Define lengths so that each basis vector is of unit length. Call it orthonormal. Immediately we have a scalar product. In this basis the scalar products of vectors are computed as in the formula (1) in the Proposition 2.3.18. More often we are given a scalar product on a vector space V, and we want to find an appropriate orthonormal basis for it. We present an algorithm using suitable orthogonal projections in order to transform any basis into an orthogonal one. It is called the Gramm-Schmidt orthogonalization process. The point of this procedure is to transform a given sequence of independent generators v1, . . . , vk of a finite dimensional space V into an orthogonal set of independent generators of V . 125 A =   0 1 −1 1 0 1 2 1 1   ∼   2 1 1 1 0 1 0 1 −1   ∼   2 1 1 0 1 −1 1 0 1   ∼   1 1 0 0 1 −1 1 0 1   ∼   1 0 1 0 1 −1 1 0 1   ∼   1 0 1 0 1 −1 0 0 0   . The final matrix is in reduced row echelon form (recall that a quick computation of RREF in Sage occurs by adding in the previous cell the code A1 = A.rref( ); show(A1)). Therefore, the pivots are lying on the first two columns (use the command A.pivots( ) to verify this). This means that only the vectors f(e1), f(e2) are linearly independent (in fact it is easy to see that f(e3) = f(e1) − f(e2)), and so a basis of Im(f) has the form { (0, 1, 2)T , (1, 0, 1)T } . Thus dim Im(f) = 2, and in particular f is not surjective. For the kernel, we should have dim R3 = dim Ker(f) + dim Im(f) from where we get that dim Ker(f) = 3−2 = 1. We may use Sage to find a basis of Ker(f), by one of the commands print(A.right_nullity()) A.right_kernel( ) The first command prints out the dimension of Ker(A), while the output of the second one includes also a basis, Ker(A) = spanR { (1, −1, −1) T } . Of course, this also gives the required basis of Ker(f). Indeed, recall the computation of Ker(f) relies on solving the homogeneous system Au = 0, where u = (x, y, z)T ∈ R3 is an arbitrary vector on R3 . This exploits the identification Ker(A) = Ker(f) and gives   0 1 −1 1 0 1 2 1 1     x y z   =   0 0 0   , or in other words {y − z = 0 , x + z = 0 , 2x + y + z = 0}. The next step relies on the reduced row echelon form posed above, which first implies that z is a free variable, say z = t ∈ R. Then we get y = z = t ∈ R and x = −z = −t ∈ R, that is, there are infinitely many solutions of the form t · (−1, 1, 1) T , with t ∈ R. Of course, one can replace t by −t to obtain the answer that Sage provides. (d) The given linear transformation is not injective, since we saw that Ker(f) ̸= {0}. Therefore, the inverse f−1 does not exist. □ 2.C.29. Consider the endomorphism T : C3 → C3 with T(e1) = (1, 0, i)T , T(e2) = (0, 1, 1)T , T(e3) = (i, 1, 0)T , where {e1, e2, e3} denotes the standard basis of C3 . Is T invertible? Describe a solution using Sage, as well. ⃝ 2.C.30. (a) Consider the complex numbers C as a real vector space with its standard basis u = {1, i}. In this basis determine the matrix of the following linear mappings: 1) conjugation, 2) multiplication by the number (2 + i). CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Gramm-Schmidt orthogonalization Proposition. Let (u1, . . . , uk) be a linearly independent k-tuple of vectors of a space V with scalar product. Then there exists an orthogonal system of vectors (v1, . . . , vk) such that vi ∈ span{u1, . . . , ui}, and span{u1, . . . , ui} = span{v1, . . . , vi}, for all i = 1, . . . , k. We obtain it by the following procedure: • The independence of the vectors ui ensures that u1 ̸= 0; we choose v1 = u1. • If we have already constructed the vectors v1, . . . , vℓ with the required properties and if ℓ < k, we choose vℓ+1 = uℓ+1 + a1v1 + · · · + aℓvℓ, where ai = −⟨uℓ+1,vi⟩ ∥vi∥2 . Proof. We begin with the first (nonzero) vector v1 = u1 and calculate the orthogonal projection v2 of u2 to span{v1}⊥ ⊂ span{v1, u2}, i.e., we find the right constant a1 for which v2 = u2 + a1v1 is perpendicular to v1. The result is nonzero if and only if u2 is independent of v1. All other steps are similar: In step ℓ, ℓ ≥ 1 we seek the vector vℓ+1 = uℓ+1 +a1v1 + · · · + aℓvℓ satisfying ⟨vℓ+1, vi⟩ = 0 for all i = 1, . . . , ℓ. This implies 0 = ⟨uℓ+1 + a1v1 + · · · + aℓvℓ, vi⟩ = ⟨uℓ+1, vi⟩ + ai⟨vi, vi⟩ and we can see that the vectors with the desired properties are determined uniquely up to a scalar multiple. □ Whenever we have an orthogonal basis of a vector space V , we just have to normalise the vectors in order to obtain an orthonormal basis. Thus, starting the Gramm-Schmidt orthogonalization with any basis of V , we have proven: Corollary. On every finite dimensional real vector space with scalar product there exists an orthonormal basis. In an orthonormal basis, the coordinates and orthogonal projections are very easy to calculate. Indeed, suppose we have an orthonormal basis (e1, . . . , en) for a space V . Then every vector v = x1e1 + · · · + xnen satisfies ⟨ei, v⟩ = ⟨ei, x1e1 + · · · + xnen⟩ = xi and so we can always express (1) v = ⟨e1, v⟩e1 + · · · + ⟨en, v⟩en. If we are given a subspace W ⊂ V and its orthonormal basis (e1, . . . , ek), then we can extend it to an orthonormal basis (e1, . . . , en) for V . Orthogonal projection of a general vector v ∈ V to W is then given by the expression v → ⟨e1, v⟩e1 + · · · + ⟨en, v⟩ek. In particular, we need only consider an orthonormal basis of the subspace W in order to write the orthogonal projection to W explicitly. 126 (b) Determine the matrix of these mappings also in the basis f = ((1 − i), (1 + i)). ⃝ 2.C.31. Consider the vector space Matm,n(K) of m × n matrices with coefficients in K, where K is R or C. Show that dimK Matm,n(K) = mn. Next establish an isomorphism φ : Matm,n(K) → Kmn , between Matm,n(K) and Kmn . ⃝ 2.C.32. The vector spaces Rn[x] and Rn+1 have the same dimension. Find an isomorphism between them. ⃝ 2.C.33. Prove that the vectors u1 = (1, 2, 0, 0)T , u2 = (0, 1, 0, 1)T , u3 = (1, 0, 0, 0)T generate a subspace U of R4 which is isomorphic to R3 . Can you find explicitly a linear isomorphism? ⃝ 2.C.34. List all subspaces of R3 , up to isomorphism. ⃝ The simplest linear mappings α on Kn are of the form x → α(x) = c1x1 + · · · + cnxn ∈ K, where c = (c1, . . . , cn) is a fixed n-tuple of scalars from the field K, and x = (x1, . . . , xn)T ∈ Kn . Actually, these are all linear mappings Kn → K, which can be expressed as α(x) = c · xT , through matrix multiplication. Such linear mappings are called linear forms on Kn . In fact, any choice of coordinates on a finite-dimensional vector space V over a field K, is actually an n-tuple of independent linear forms on V , i.e., linear mappings V → K. They form the dual vector space V ∗ of V , which is also a vector space over K, see 2.3.17. Clearly, V and V ∗ are isomorphic as vector spaces over K. 2.C.35. Given the basis u = {u1 = (1, 1)T , u2 = (3, 1)T } of R2 , find the dual basis in (R2 )∗ . Solution. We seek for linear forms φi : R2 → R satisfying φ1(u1) = 1 , φ1(u2) = 0 , and φ2(u1) = 0 , φ2(u2) = 1 , respectively. Suppose that φ1(x, y) = ax + by and φ2(x, y) = cx + dy for some reals a, b, c, d. These equations induce the systems {a + b = 1, 3a + b = 0} and {c + d = 0, 3c + d = 1}, respectively. Solving them we deduce that the dual basis to u consists of the linear forms φ1(x, y) = −1 2 x + 3 2 y and φ2(x, y) = 1 2 x − 1 2 y. □ 2.C.36. Prove that the set u = {u1 = (1, 0, 0)T , u2 = (0, i, 0)T , u3 = (1, 1, i)T } is a basis of C3 . Next find the dual basis. ⃝ 2.C.37. Let {u1, . . . , un} be a basis of a vector space V over a field K, and let {φ1, . . . , φn} be the dual basis in V ∗ . Prove that (a) Any u ∈ V is written as u = ∑n i=1 φi(u)ui. (b) Any ξ ∈ V ∗ is written as ξ = ∑n i=1 ξ(ui)φi. ⃝ CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Note that in general the projection f to the subspace W along U and the projection g to U along W is constrained by the equality g = idV −f. Thus, when dealing with orthogonal projections to a given subspace W, it is always more efficient to calculate the orthonormal basis of that space W or W⊥ whose dimension is smaller. Note also that the existence of an orthonormal basis guarantees that for every real space V of dimension n with a scalar product, there exists a linear mapping which is an isomorphism between V and the space Rn with the standard scalar product (i.e. respecting the scalar products as well). We saw already in Theorem 2.3.18 that the desired isomorphism is exactly the coordinate assignment. In words – in every orthonormal basis the scalar product is computed by the same formula as the standard scalar product in Rn . We shall return to the questions of the length of a vector and to projections in the following chapter in a more general context. 2.3.22. Angle between two vectors. As we have already noted, the angle between two linearly independent vectors in the space must be the same as when we consider them in the two-dimensional subspace they generate. Basically, this is the reason why the notion of angle is independent of the dimension of the original space. If we choose an orthogonal basis such that its first two vectors generate the same subspace as the two given vectors u and v (whose angle we are measuring), we can simply take the definition from the planar geometry. Independently of the choice of coordinates we can formulate the definition as follows: Angle between two vectors The angle φ between two vectors v and w in a vector space with a scalar product is given by the relation cos φ = ⟨v, w⟩ ∥v∥∥w∥ . The angle defined in this way does not depend on the order of the vectors v, w and it is chosen in the interval 0 ≤ φ ≤ π. We shall return to scalar products and angles between vectors in further chapters. 2.3.23. Multilinear forms. The scalar product was given as a mapping from the product of two copies of a vector space V into the space of scalars, which was linear in each of its arguments. Similarly, we will work with mappings from the product of k copies of a vector space V into the scalars, which are linear in each of its k arguments. We speak of k-linear forms. Most often we will meet bilinear forms, that is, the case α : V × V → K, where for any four vectors u, v, w, z and scalars a, b, c and d we have α(au + bv, cw + dz) = ac α(u, w) + ad α(u, z) + bc α(v, w) + bd α(v, z). 127 Endomorphisms F on V are represented by square matrices A in a given basis. If we leave the choice of the bases free on both domain and target, then we may always achieve the diagonal matrix A with just ones and zeros on the diagonal, with the rank equal to the dimension of the image, see 2.1.9. But dealing with endomorphisms, we rather want to fix just one basis on V . Then the available change of the matrix is A → P · A · P−1 = B, and we call such matrices A and B similar. Of course, P provides the relevant change of basis on V , see 2.3.16. Much of our future efforts will aim at understanding linear forms and other concepts within the matrix calculus which are invariant under the above similarity, and thus they reveal properties of the linear mappings themselves. The trace of matrices discussed in the subsequent tasks is an example. 2.C.38. Show that the trace tr : Matm(K) → K, tr(A) =∑m i=1 aii, is a linear functional, i.e.,tr ∈ Mat∗ m(K) ∼= (Km2 )∗ . Next show that tr(AB) = tr(BA), for any A, B ∈ Matm(K). Moreover, provide an example verifying that in general tr(AB) ̸= tr(A) tr(B). ⃝ 2.C.39. Demonstrate with a low-dimensional example that the trace of an endomorphism remains invariant under a change of basis. ⃝ 2.C.40. Prove that if A, B are two similar square matrices, then tr(A) = tr(B) and det(A) = det(B). Next decide which of the given pairs of matrices are similar: (a) A = ( 1 −2 2 3 ) , B = ( 2 2 −1 2 ) . (b) A = ( 0 1 0 0 ) , B = ( γδ δ2 −γ2 −γδ ) , where γ, δ are non-zero real numbers. (c) A = ( 2 1 1 2 ) , B = ( 3 0 0 1 ) . ⃝ 2.C.41. Transition matrix. Find the transition matrix from the standard basis e = {e1, e2, e3} of R3 to the basis u = {u1 = (1, 1, 3)T , u2 = (1, −1, 1)T , u3 = (3, 1, 5)T } . Next, find the coordinates of the vector w = (1, 2, 3)T ∈ R3 in the new ordered basis u. Solution. We see that u1 = e1 + e2 + 3e3 , u2 = e1 − e2 + e3 , u3 = 3e1 + e2 + 5e3 . Thus, the transition matrix T from u to e has as columns exactly the vectors of the basis u: T =   1 1 3 1 −1 1 3 1 5   . CHAPTER 2. ELEMENTARY LINEAR ALGEBRA If additionally we always have α(u, w) = α(w, u), then we speak of a symmetric bilinear form. If interchanging the arguments leads to a change of sign, we speak of an antisymmetric bilinear form. Already in planar geometry we have defined the determinant as a bilinear antisymmetric form α, that is, α(u, w) = −α(w, u). In general, due to the theorem 2.2.5, we know that the determinant with dimension n can be seen as an n-linear antisymmetric form. As with linear mappings it is clear that every k-linear form is completely determined by its values on all k-tuples of basis elements in a fixed basis. In analogy to linear mappings we can see these values as k-dimensional analogues to matrices. We show this by an example with k = 2, where it will correspond to matrices as we have defined them. Matrix of a bilinear form If we choose a basis u on V and define for a given bilinear form α scalars aij = α(ui, uj) then we obtain for vectors v, w with coordinates x and y (as columns of coordinates) α(v, w) = n∑ i,j=1 aijxiyj = yT · A · x, where A is a matrix A = (aij). Directly from the definition of the matrix of a bilinear form we see that the form is symmetric or antisymmetric if and only if the corresponding matrix has this property. Every bilinear form α on a vector space V defines a mapping V → V ∗ , v → α( , v). That is, by placing a fixed vector in the second argument we obtain a linear form which is the image of this vector. If we choose a fixed basis on a finitely dimensional space V and a dual basis V ∗ , then we have the mapping y → (x → yT · A · x). All this is a matter of convention. Also we may fix the first vector and get a linear form again. 4. Properties of linear mappings In order to exploit vector spaces and linear mappings in modelling real processes and systems in other sciences, we need a more detailed analysis of properties of diverse types of linear mappings. 2.4.1. We begin with four examples in the lowest dimension of interest. With the standard basis of the plane R2 and with the standard scalar product we consider the following matrices of mappings f : R2 → R2 : A = ( 1 0 0 0 ) , B = ( 0 1 0 0 ) , C = ( a 0 0 b ) , D = ( 0 −1 1 0 ) 128 The transition matrix from the e to u is now given by the inverse T−1 of T, see 2.3.15. By inspection, we compute T−1 =   −3 2 −1 2 1 −1 2 −1 1 2 1 1 2 −1 2   , hence the new coordinates of w are given by the matrix mul- tiplication T−1 w =   −3 2 −1 2 1 −1 2 −1 1 2 1 1 2 −1 2     1 2 3   =   1 2 −1 1 2   . □ 2.C.42. Solve the problem in 2.C.41 using Sage. ⃝ 2.C.43. Consider the following ordered basis u1 = {E1 = (1, 0, 1)T , E2 = (1, 1, 0)T , E3 = (0, 1, 1)T } of R3 , and suppose that the transition matrix T from another ordered basis u2 of R3 to u1 is given by T =   1 1 1 1 2 1 −1 1 2   . Find the basis u2. ⃝ 2.C.44. Suppose that a linear mapping F : R3 → R3 has the following matrix (with respect to the standard basis) A =   1 −1 0 0 1 1 2 0 0   . Compute the matrix of this mapping in the new basis f := {f1, f2, f3} := {(1, 1, 0)T , (−1, 1, 1)T , (2, 0, 1)T }. ⃝ In applications, we often know the size of vectors, i.e., there is the so-called notion of norm on the vector space. We saw how to deal with this in Chapter 1 already — the norm is usually defined via the scalar product (see 1.E.20). This is closely linked with the concept of angle between vectors, which is a 2-dimensional concept and works in all dimensions, even the infinite ones, without changes. The scalar product allows us to employ orthogonal projections, orthogonal bases, and other important concepts, see the theoretical column for details. In the following section, we shall come to mappings and matrices compatible with scalar products, which leads to the spectral theory discussed properly in the next chapter. Therefore, consider the remainder of this section as a gentle preparation. 2.C.45. Remark. Consider the vectors u = (1, 2, 2)T and v = (0, 1, −1)T on R3 . Then we see that u · v = ⟨u, v⟩ = 0, that is, u, v are orthogonal. To encode this we often write u ⊥ v. As we know by 1.E.20, in Sage to compute the scalar product ⟨u, v⟩ = ∑ i uivi, of two vectors u, v ∈ Rn we use the command u.dot_product(v). For instance, the block CHAPTER 2. ELEMENTARY LINEAR ALGEBRA The matrix A describes the orthogonal projection along the subspace W = {(0, a); a ∈ R} ⊂ R2 to the subspace V = {(a, 0); a ∈ R} ⊂ R2 , that is, the projection to the x-axis along the y-axis. Evidently for this f : R2 → R2 we have f ◦ f = f and thus the restriction f|V of the given mapping on its codomain is the identity mapping. The kernel of f is exactly the subspace W. The matrix B has the property B2 = 0, therefore the same holds for the corresponding mapping f. We can envision this as the differentiation of polynomials R1[x] of degree at most one in the basis (1, x) (we shall come to differentiation in chapter five, see 5.1.6). The matrix C gives a mapping f, which rescales the first vector of the basis a-times, and the second one b-times. Therefore the whole plane divides into two subspaces, which are preserved under the mapping and where it is only a homothety, that is, scaling by a scalar multiple (the first case was a special case with a = 1, b = 0). For instance the choice a = 1, b = −1 corresponds to axial symmetry (mirror symmetry) under the x-axis, which is the same as complex conjugation x+iy → x−iy on the two-dimensional real space R2 ≃ C in basis (1, i). This is a linear mapping of the two-dimensional real vector space C, but not of the one-dimensional complex space C. The matrix D is the matrix of rotation by 90 degrees (the angle π/2) centered at the origin in the standard basis. We can see at first glance that none of the one-dimensional subspaces is preserved under this mapping. Such a rotation is a bijection of the plane onto itself, therefore we can surely find distinct bases in the domain and codomain, where its matrix will be the unit matrix E. We simply take any basis of the domain and its image in the codomain. But we are not able to do this with the same basis for both the domain and the codomain. Consider the matrix D as a matrix of the mapping g : C2 → C2 with the standard basis of the complex vector space C2 . Then we can find vectors u = (i, 1), v = (−i, 1), for which we have g(u) = ( 0 −1 1 0 ) · ( i 1 ) = ( −1 i ) = i · u, g(v) = ( 0 −1 1 0 ) · ( −i 1 ) = ( −1 −i ) = −i · v. That means that in the basis (u, v) on C2 , the mapping g has the matrix K = ( i 0 0 −i ) . Notice that by extending the scalars to C, we arrive at an analogy to the matrix C with diagonal elements a = cos(1 2 π) + i sin(1 2 π) and its complex conjugate ¯a. In other words, the 129 u = vector(RDF, [1, 2, 2]) v = vector(RDF, [0, 1, -1]) u.dot_product(v) returns the result 0.0, and verifies that u ⊥ v. We also know how to compute norms of vectors. For instance, for the given u, v, we can add the code u.norm (); v.norm () returning 3.0 and 1.4142135 ≈ √ 2, respectively. Scalar products over complex vector spaces will need differnt approach, we come to them in Chapter 3. They are linked to Hermitian forms on Cn , and are fundamental in quantum computing. Let us now turn our attention to vectors in our 3-dimensional world. In the solution to the next task we shall see the general concepts and procedures working independent of the (finite) dimensions. One of the main concepts is that of orthogonal projections, see 2.3.20 for instructions how to compute them. 2.C.46. Consider the following four vectors in R3 : i) u1 = (1, 3, √ 2)T , u2 = (−1, 1, − √ 2)T ; ii) v1 = (0, 1, 2)T , v2 = (−1, 2, 3)T ; Compute their dot products, norms, angles. Check that the product is symmetric, and next use Sage to solve the tasks. ⃝ 2.C.47. Given the linear form α : R3 → R with α(u) = 4x + 6y − 2z for all u = (x, y, z)T ∈ R3 , determine the unique vector v = (a, b, c)T ∈ R3 satisfying f(u) = ⟨u, v⟩, where ⟨ , ⟩ denotes the usual dot product on R3 . Solution. This is an application of the fact that the standard basis {e1, e2, e3} of R3 is orthonormal with respect to the dot product, i.e., ⟨ei, ej⟩ = δi,j for 1 ≤ 1, j ≤ 3. Then, we have v = 3∑ i=1 ⟨v, ei⟩ei = e∑ i=1 α(ei)ei , where we have used that ⟨ , ⟩ is symmetric and moreover that α(ei) = ⟨ei, v⟩, for all i. Since α(e1) = 4, α(e2) = 6 and α(e3) = −2, we get v = 4e1 + 6e2 − 2e3, that is v = (4, 6, −2)T ∈ R3 . Now one can easily verify that α(u) = ⟨u, v⟩ for all u ∈ R3 . □ 2.C.48. Orthogonal projections via Sage. It is very easy to compute orthogonal projections of vectors in Sage. For instance, suppose that we want to compute the projection of u = (1, 2, 4)T ∈ R3 onto w = (1, −2, 1)T ∈ R3 . We have projw(u) = ⟨u,w⟩ ∥w∥2 w, so it is reasonable to type u = vector ([1 ,2 ,4]) w = vector ([1 , -2 ,1]) proj = u.dot_product (w)/(norm (w)^2)*w proj sage: (1/6, -1/3, 1/6) Hence projw(u) = (1/6, −1/3, 1/6)T ∈ R3 . As a verification we can check that projw(u) is orthogonal to u − projw(u): CHAPTER 2. ELEMENTARY LINEAR ALGEBRA argument of the number a in polar form provides the angle of the rotation. This is easy to understand, if we denote the real and imaginary part of the vector u as follows u = xu + iyu = Re u + i Im u = ( 0 1 ) + i · ( 1 0 ) . The vector v is the complex conjugate of u. We are interested in the restriction of the mapping g to the real vector subspace V = R2 ∩ spanC{u, v} ⊂ C2 . Evidently, V = spanR{u + ¯u, i(u − ¯u)} = spanR{xu, −yu} is the whole plane R2 . The restriction of g to this plane is exactly the original mapping given by the matrix D (notice this matrix is real, thus it preserves this real subspace). It is immediately seen that this is the rotation through the angle 1 2 π in the positive sense with respect to the chosen basis xu, −yu. Work it by yourself with a direct calculation. Note also why exchanging the order of the vectors u and v leads to the same result, although in a different real basis! 2.4.2. Eigenvalues and eigenvectors of mappings. A key to the description of mappings in the previous examples was the answer to the question “what are the vectors satisfying the equation f(u) = a · u for some suitable scalars a?”. We consider this question for any linear mapping f : V → V on a vector space of dimension n over scalars K. If we imagine such an equality written in coordinates, i.e. using the matrix of the mapping A in some bases, we obtain a system of linear equations A · x − a · x = (A − a · E) · x = 0 with an unknown parameter a. We know already that such a system of equations has only the solution x = 0 if the matrix A−aE is invertible. Thus we want to find such values a ∈ K for which A−aE is not invertible, and for that, the necessary and sufficient condition reads (see Theorem 2.2.11) (1) det(A − a · E) = 0. If we consider λ = a as a variable in the previous scalar equation, we are actually looking for the roots of a polynomial of degree n. As we have seen in the case of the matrix D, the roots may exist in an extension of our field of scalars, if they are not in K. Eigenvalues and eigenvectors Scalars λ ∈ K satisfying the equation f(u) = λ · u for some nonzero vector u ∈ V are called the eigenvalues of the mapping f. The corresponding nonzero vectors u are called the eigenvectors of the mapping f. If u, v are eigenvectors associated with the same eigenvalue λ, then for every linear combination of u and v, f(au + bv) = af(u) + bf(v) = λ(au + bv). Therefore the eigenvectors associated with the same eigenvalue λ, together with the zero vector, form a nontrivial vector 130 proj.dot_product(u-proj) which gives 0. Recall now that a nice feature of Sage is that we can easily extend its capabilities by introducing new commands. Here will create a function that computes the orthogonal projection of a vector onto another, a procedure which allows us to view (in Sage) the orthogonal projection as a function of two vectors. For this, use the cell def proj (u,w): a = u. dot_product (w)/(norm (w)^2)*w return(a) which defines the projection of u onto w. In such terms, the following cell in Sage gives the same result as above. u1 = vector ([1 ,2 ,4]) u2 = vector ([1 , -2 ,1]) def proj (u,w): a = u. dot_product (w)/(norm (w)^2)*w return(a) proj (u1 ,u2) sage: (1/6, -1/3, 1/6) Try implementing this block on your own. The function defined here will be useful for implementing the Gram-Schmidt procedure, which we are going to describe very soon. 2.C.49. Determine the matrix A which, under the standard basis of R3 , gives the orthogonal projection on the vector subspace generated by the vectors u1 = (−1, 1, 0)T and u2 = (−1, 0, 1)T . Solution. First observe that given subspace is a plane containing the origin of R3 , with normal vector u3 = (1, 1, 1)T . This is because the ordered triple (1, 1, 1) is clearly a solution to the system {−x1 + x2 = 0 , −x1 + x3 = 0}, that is, the vector u3 is perpendicular to the vectors u1, u2. Under the given projection the vectors u1 and u2 must map to themselves, and the vector u3 to the zero vector. Thus, with respect to the basis {u1, u2, u3}, the matrix presentation of the projection is P =   1 0 0 0 1 0 0 0 0   . If the transition matrix from the basis {u1, u2, u3} to the standard basis is denoted by T, then the transition matrix from the standard basis to {u1, u2, u3} is the inverse T−1 . We com- pute T =    −1 −1 1 1 0 1 0 1 1    , T−1 =    −1 3 2 3 −1 3 −1 3 −1 3 2 3 1 3 1 3 1 3    , and thus we obtain A = TPT−1 =    2 3 −1 3 −1 3 −1 3 2 3 −1 3 −1 3 −1 3 2 3    . CHAPTER 2. ELEMENTARY LINEAR ALGEBRA subspace Vλ ⊂ V . We call it the eigenspace associated with λ. For instance, if λ = 0 is an eigenvalue, the kernel Ker f is the eigenspace V0. We have seen how to compute the eigenvalues in coordinates. The independence of the eigenvalues from the choice of coordinates is clear from their definition. But let us look explicitely what happens if we change the basis. As a direct corollary of the transformation properties from the paragraph 2.3.16 and the Cauchy theorem 2.2.7 for calculation of the determinant of product, the matrix A′ in the new coordinates will be A′ = P−1 AP with an invertible matrix P. Thus |P−1 AP − λE| = |P−1 AP − P−1 λEP| = |P−1 (A − λE)P| = |P−1 ||(A − λE)||P| = |A − λE|, because the scalar multiplication is commutative and we know that |P−1 | = |P|−1 . For these reasons we use the same terminology for matrices and mappings: Characteristic polynomials For a matrix A of dimension n over K we call the polynomial |A−λE| ∈ Kn[λ] the characteristic polynomial of the matrix A. Roots of this polynomial are the eigenvalues of the matrix A. If A is the matrix of the mapping f : V → V in a certain basis, then |A − λE| is also called the characteristic polynomial of the mapping f. Because the characteristic polynomial of a linear mapping f : V → V is independent of the choice of the basis of V , the coefficients of individual powers of the variable λ are scalars expressing some properties of f. In particular, they too cannot depend on the choice of the basis. Suppose dim V = n and A = (aij) is the matrix of the mapping in some basis. Then |A − λ · E| = (−1)n λn + (−1)n−1 (a11 + · · · + ann)λn−1 + · · · + |A|λ0 . The coefficient at the highest power says whether the dimension of the space V is even or odd. The most interesting coefficient is the sum of the diagonal elements of the matrix. We have just proved that it does not depend on the choice of the basis and we call it the trace of the matrix A and denote it by Tr A. The trace of the mapping f is defined as a trace of the matrix in an arbitrary basis. In fact, this is not so surprising once we notice that the trace is actually the linear approximation of the determinant in the neighbourhood of the unit matrix in the direction A. We shall deal with such concepts in Chapter 8 only. But since the determinant is a polynomial, we may see easily that the only terms in det(E +tA) which are linear in the real parameter t 131 □ Our choice of basis of a given (finite dimensional) vector space corresponds to a coordinate system on it (we get the basis with fixed ordering of its elements). If the basis u = {u1, . . . , um} is orthonormal, i.e. ⟨ui, uj⟩ = δij, the Kronecker delta, for all i, j, then the coordinates are given just by the scalar products, i.e., v = ⟨v, ui⟩ui + · · · + ⟨v, um⟩um for all v ∈ V . With orthogonal basis, we just request that its elements are mutualy perpendicular and in the latter expression, we have to divide the scalar product by the norm of the elements, i.e., the coordinates are ⟨v,ui⟩ ⟨ui,ui⟩ . The figure below compares a general basis and an orthonormal basis in R2 . Notice that the formula for the scalar products is the same in all orthonormal bases, just the standard Euclidean dot product, as we know it from the matrix calculus. Orthogonal or orthonormal bases are easily obtained by the Gram-Schmidt procedure, see 2.3.21. 2.C.50. (a) Prove that the vectors E1 = (1, 1, 1, 1)T , E2 = (1, 1, 1, 0)T and E3 = (1, 1, 0, 0)T of R4 are linearly inde- pendent. (b) Next find an orthonormal basis of the 3-dimensional space W = spanR{E1, E2, E3} of R4 . (c) Find the new coordinates of the vector w = (4, 4, 2, 1)T ∈ W with respect to the orthonormal basis that you described in part (b). Solution. (a) We check the linear independence by Sage V = RR^4 E1 = vector(RR, [1, 1, 1, 1]) E2 = vector(RR, [1, 1, 1, 0]) E3 = vector(RR, [1, 1, 0, 0]) V.linear_dependence([E1, E2, E3]) == [] Sage’s output is indeed True. (b) Let us apply the Gram-Schmidt method to obtain an orthonormal basis of W. Set w1 = E1 and w2 = E2 − projw1 (E2) = E2 − ⟨E2, w1⟩ ∥w1∥2 w1 , w3 = E3 − projw2 (E3) − projw1 (E3) = E3 − ⟨E3, w2⟩ ∥w2∥2 w2 − ⟨E3, w1⟩ ∥w1∥2 w1 . We compute ∥w1∥2 = 4 and ⟨E2, E1⟩ = 3, hence w2 = (1/4, 1/4, 1/4, −3/4)T with ∥w2∥2 = 3/4. We proceed by computing ⟨E3, w2⟩ = 1/2, ⟨E3, w1⟩ = 2 and hence CHAPTER 2. ELEMENTARY LINEAR ALGEBRA are just the trace. We shall see relation to matrix exponential later in Chapter 8. The coefficient at λ0 is the determinant |A| and we shall see later that it describes the rescaling of volumes by the map- ping. 2.4.3. Basis of eigenvectors. We discuss a few important properties of eigenspaces now. Theorem. Eigenvectors of linear mappings f : V → V associated to different eigenvalues are linearly independent. Proof. Let a1, . . . , ak be distinct eigenvalues of the mapping f and u1, . . . , uk eigenvectors with these eigenvalues. The proof is by induction on the number of linearly independent vectors among the chosen ones. Assume that u1, . . . , uℓ are linearly independent and ul+1 = ∑ i ciui is their linear combination. We can choose ℓ = 1, because the eigenvectors are nonzero. But then f(uℓ+1) = al+1 · ul+1 = ∑l i=1 al+1 · ci · ui, that is, f(ul+1) = l∑ i=1 al+1 ·ci ·ui = l∑ i=1 ci ·f(ui) = l∑ i=1 ci ·ai ·ui. By subtracting the second and the fourth expression in the equalities we obtain 0 = ∑l i=1(al+1 − ai) · ci · ui. All the differences between the eigenvalues are nonzero and at least one coefficient ci is nonzero. This is a contradiction with the assumed linear independence u1, . . . , uℓ, therefore also the vector ul+1 must be linearly independent of the others. □ The latter theorem can be seen as a decomposition of a linear mapping f into a sum of much simpler mappings. If there are n = dim V distinct eigenvalues λi, we obtain the entire V as a direct sum of one-dimensional eigenspaces Vλi . Each of them then describes a projection on this invariant one-dimensional subspace, where the mapping is given just as multiplication by the eigenvalue λi. Furthermore, this decomposition can be easily calcu- lated: 132 w3 = (1/3, 1/3, −2/3, 0)T . The orthogonality is verified easily: ⟨w1, w2⟩ = ⟨w1, w3⟩ = ⟨w2, w3⟩ = 0. We also compute ∥w3∥2 = 2/3, hence the orthonormal basis { ˆwi = wi ∥wi∥ : i = 1, 2, 3} of W has the explicit form ˆw1 =     1/2 1/2 1/2 1/2     , ˆw2 =     √ 3/6√ 3/6√ 3/6 − √ 3/2     , ˆw3 =     √ 6/6√ 6/6 − √ 6/3 0     . (c) The given vector w = (4, 4, 2, 1)T lies on W, since we have the expression w = E1 + E2 + 2E3. To find its coordinates with respect to the orthonormal basis { ˆwi : i = 1, 2, 3} of W constructed above, we apply the formula w = 3∑ i=1 ⟨w, ˆwi⟩ ˆwi . We compute ⟨w, ˆw1⟩ = 11/2, ⟨w, ˆw2⟩ = 7 √ 3/6 and ⟨w, ˆw3⟩ = 2 √ 6/3, thus w = 11 2 ˆw1 + 7 √ 3 6 ˆw2 + 2 √ 6 3 ˆw3. Hence the initial coordinates (1, 1, 2) of w with respect to the basis {E1, E2, E3} changed to the new coordinates (11/2, 7 √ 3/6, 2 √ 6/3) with respect to the basis { ˆw1, ˆw2, ˆw3}. □ 2.C.51. Use Sage via the function proj(v, u) introduced in 2.C.48 to confirm the expression of the orthonormal basis { ˆw1, ˆw2, ˆw3} presented in 2.C.50. ⃝ 2.C.52. Apply the Gram-Schmidt orthogonalisation process to obtain the orthogonal basis of the linear subspace U ⊂ R4 U = { (x1, x2, x3, x4)T ∈ R4 ; x1 + x2 + x3 + x4 = 0 } . ⃝ 2.C.53. Consider the linear mapping φ : R4 → R4 whose matrix with respect to the standard basis of R4 is given by A =     1/2 2 −1/2 −1/2 1 1/2 1 3/2 2 9/2 0 1/2 2 2 0 0     . Find an orthonormal basis of the kernel of φ. ⃝ 2.C.54. Consider the Euclidean space R4 with its standard dot product ⟨ , ⟩. Find a basis of the linear subspace of R4 consisting of vectors which are orthogonal to: i) u = (1, 1, 1, 1)T ∈ R4 ; ii) w = (1, 0, 0, −1)T and z = (−1, 0, 1, 0)T ∈ R4 . ⃝ 2.C.55. Orthogonal complement. Let W be the subspace of R4 spanned by the vectors u1 = (−1, 1, 3, 0)T and u2 = (0, 0, 0, 1)T . Find the orthogonal complement of W. Solution. Any vector x ∈ W is written as x = au1 + bu2 = (−a, a, 3a, b)T for some a, b ∈ R. Vectors y ∈ W⊥ should satisfy x · y = 0, i.e., −ay1 + ay2 + 3ay3 + by4 = 0, for any a, b ∈ R. For a = 1, b = 0 (this corresponds to the condition u1 · y = 0) and a = 0, b = 1 (this corresponds to the condition u2 · y = 0) we obtain the following system of equations: {−y1 + y2 + 3y3 = 0 , y4 = 0}. Viewing y2 = c ∈ R and y3 = d ∈ R as free variables, we obtain the solution CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Basis of eigenvectors Corollary. If there exist n mutually distinct roots λi of the characteristic polynomial of the mapping f : V → V on the n-dimensional space V , then there exists a decomposition of V into a direct sum of eigenspaces each of dimension one. This means that there exists a basis for V consisting only of eigenvectors and in this basis the matrix for f is the diagonal matrix with the eigenvalues on the diagonal. This basis is uniquely determined up to the order of the elements and scale of the vectors. The corresponding basis (expressed in the coordinates in an arbitrary basis of V ) is obtained by solving n systems of homogeneous linear equations of n variables with matrices (A − λi · E), where A is the matrix of f in a chosen basis. 2.4.4. Invariant subspaces. We have seen that every eigenvector v of the mapping f : V → V generates a subspace span{v} ⊂ V , which is preserved by the mapping f. More generally, we say that a vector subspace W ⊂ V is an invariant subspace for a linear mapping f, if f(W) ⊂ W. If V is a finite dimensional vector space and we choose some basis (u1, . . . , uk) of a subspace W, we can always extend it to a basis (u1, . . . , uk, uk+1, . . . , un) for the whole space V. For every such basis, f(W) ⊂ W implies the mapping f will have the matrix A of the form (1) A = ( B C 0 D ) where B is a square matrix of dimension k, D is a square matrix of dimension n − k and C is a matrix of the type n/(n − k). On the other hand, if for some basis (u1, . . . , un) the matrix of the mapping f is of the form (1), then W = span{u1, . . . , uk} is invariant under the mapping f. By the same arguments, the mapping with the matrix A as in (1) leaves the subspace span{uk+1, . . . , un} invariant, if and only if the submatrix C is zero. From this point of view the eigenspaces of the mapping are special cases of invariant subspaces. Our next task is to find some conditions under which there are invariant complements of invariant subspaces. 2.4.5. We illustrate some typical properties of mappings on the spaces R3 and R2 in terms of eigenvalues and eigenvec- tors. (1) Consider the mapping given in the standard basis by the matrix A f : R3 → R3 , A =   0 0 1 0 1 0 1 0 0   . 133 (c + 3d, c, d, 0). It follows that W⊥ = spanR{w1, w2}, where the vectors w1, w2 are given by w1 = (1, 1, 0, 0)T and w2 = (3, 0, 1, 0)T , respectively. □ 2.C.56. Given the dot products on Rm and Rn , prove that for any m×n real matrix A the column space C(A) is orthogonal to the left null space Ker(AT ) and the row space C(AT ) is orthogonal to the kernel Ker(A) of A. Next illustrate these statements using the matrix A described in 2.C.20. ⃝ D. Properties of linear mappings A central attention in numerical matrix calculus is devoted to the simplest possible representation of linear mappings, i.e., the optimal choice of coordinates allowing to understand the mappings best. The most perfect understanding of f : V → V comes if we have the mapping represented by a diagonal matrix. This leads to the concepts of “eigenvectors” and “eigenvalues” of f. Thus, an eigenvector of an n×n matrix A is a non-trivial solution x to the equation A x = λx with an unknown scalar λ. Obviously, this homogeneous system of equations has got a non-trivial solution if and only if λ is a root of the characteristic polynomial χA(λ) := det(A − λE), where E is the n × n identity matrix. We talk about eigenvalues of A, they may come with multiplicities, and the spectrum of A consists of the eigenvalues of A, including their multiplicities (called algebraic multiplicities). By the very definition, all this corresponds to f(u) = λu, i.e., these concepts are independent of the choice of coordinates. If we find enough eigenvalues and eigenvectors, the mapping f enjoys a diagonal matrix! This may fail for two reasons: the polynomial might not have enough roots in the chosen field K, or the number of independant eigenvectors to a given eigenvalue λ might be smaller than the algebraic multiplicity of λ, i.e., the dimension of the eigen-space Vλ spanned by all such eigenvectors is too small (we call it the geometric multiplicity). We met both examples in the first chapter in the plane (rotations and derivatives of first order polynomials, see also the task presented in 2.E.57). 2.D.1. Eigenvalues and eigenvectors. Find the eigenvalues and the eigenvectors of the matrix A =   −1 1 0 −1 3 0 2 −2 2   . Solution. We begin with the characteristic polynomial of A: χA(λ) = −1 − λ 1 0 −1 3 − λ 0 2 −2 2 − λ = λ3 − 4λ2 + 2λ + 4 . The eigenvalues of A are the roots of χA(λ). Using for example the Horner’s scheme presented in Chapter 1, we compute CHAPTER 2. ELEMENTARY LINEAR ALGEBRA We compute |A − λE| = −λ 0 1 0 1 − λ 0 1 0 −λ = −λ3 + λ2 + λ − 1, with roots λ1 = 1, λ2 = 1, λ3 = −1. The eigenvectors with eigenvalue λ = 1 can be computed:   −1 0 1 0 0 0 1 0 −1   ∼   1 0 −1 0 0 0 0 0 0   ; with the basis of the space of solutions, that is, of all eigenvectors with this eigenvalue u1 = (0, 1, 0), u2 = (1, 0, 1). Similarly for λ = −1 we obtain the third independent eigen- vector   1 0 1 0 2 0 1 0 1   ∼   1 0 1 0 2 0 0 0 0   ⇒ u3 = (−1, 0, 1). Under the basis u1, u2, u3 (note that u3 must be linearly independent of the remaining two because of the previous theorem and u1, u2 were obtained as two independent solutions) f has the diagonal matrix A =   1 0 0 0 1 0 0 0 −1   . The whole space R3 is a direct sum of eigenspaces, R3 = V1 ⊕V2, with dim V1 = 2, and dim V2 = 1. This decomposition is uniquely determined and says much about the geometric properties of the mapping f. The eigenspace V1 is furthermore a direct sum of one-dimensional eigenspaces, which can be selected in other ways (thus such a decomposition has no further geometrical meaning). (2) Consider the linear mapping f : R2[x] → R2[x] defined by polynomial differentiation, that is, f(1) = 0, f(x) = 1, f(x2 ) = 2x. The mapping f thus has in the usual basis (1, x, x2 ) the matrix A =   0 1 0 0 0 2 0 0 0   . The characteristic polynomial is |A−λ·E| = −λ3 , thus it has only one eigenvalue, λ = 0. We compute the eigenvectors:   0 1 0 0 0 2 0 0 0   ∼   0 1 0 0 0 1 0 0 0   . The space of the eigenvectors is thus one-dimensional, generated by the constant polynomial 1. The striking property of this mapping is that there is no basis for which the matrix would be diagonal. There is the “chain” of vectors mapping the independent generators as follows: 1 2 x2 → x → 1 → 0. This builds a sequence of invariant subspaces without invariant complements. 134 λ1 = 2, λ± = 1 ± √ 3, all with multiplicity one11 Hence the algebraic multiplicities of λ1 and λ± are all one. It follows that each has a 1-dimensional eigenspace, i.e., the geometric multiplicity of each eigenvalue is also one (see 3.4.10). The eigenvector associated to λ1 is determined by solving the matrix equation (A − λ1E)x = 0, with x = (x1, x2, x3)T ∈ R3 . This reduces to the system {−3x1 + x2 = 0 , −x1 + x2 = 0 , x1 − x2 = 0} and in order to find a solution we may apply what we learned in Section A. We see that x3 is a free variable, x3 ∈ R, and the rest two variables must satisfy x1 = x2 = 0.12 Thus, the eigenvector associated to λ1 is the vector u1 := (0, 0, 1)T (or any multiple of it), and in particular Vλ1 is 1-dimensional (spanned by u1). Similarly, the eigenvector corresponding to λ+ = 1 + √ 3 arises by solving the matrix equation (A−λ+E)x = 0. This gives the system    −(2 + √ 3)x1 + x2 = 0 , −x1 + (2 − √ 3)x2 = 0 , 2x1 − 2x2 + (1 − √ 3)x3 = 0 , whose solution has the form { ( 2 − √ 3, 1, −2 ) t : t ∈ R}. This implies that the eigenspace Vλ+ corresponding to λ+ is also 1-dimensional, as well: Vλ+ = spanR {( 2 − √ 3, 1, −2 )T } . In an analogous way we obtain that the eigenspace associated to the third eigenvalue λ− is 1-dimensional, Vλ− = spanR {( 2 + √ 3, 1, −2 )T } . □ 2.D.2. Eigentheory via Sage. In Sage, given a squared matrix it is easy to compute its eigenvalues and eigenvectors. Let us use the matrix A from the previous task to illustrate the situation. To introduce the characteristic polynomial and find its roots, we can type A=matrix(SR, [[-1, 1, 0], [-1, 3, 0], [2, -2, 2]]) p(t) = A.characteristic_polynomial(t) show(p(t)); show(p.roots()) The output here is the characteristic polynomial (with respect to the variable t), i.e., t3 − 4 t2 + 2 t + 4, and its roots, presented in a list as follows: [( − √ 3 + 1, 1 ) , (√ 3 + 1, 1 ) , (2, 1) ] . 11Use also Sage to solve the equation χA(λ) = 0, and then read below for a variety of techniques in Sage for finding the eiqenvalues of a given matrix A. 12Although this case is extremely easy and the fact that x3 is a free variable is obvious (as the solution), recall that a good technique solving the homogeneous system (A − λE)x = 0, where λ is an eigenvalue of A, relies on the description of the reduced row echelon form corresponding to the matrix A − λE, especially when A is a n × n matrix with large n. CHAPTER 2. ELEMENTARY LINEAR ALGEBRA 2.4.6. Orthogonal mappings. We consider the special case of the mapping f : V → W between spaces with scalar products, which preserve lengths for all vectors u ∈ V . Orthogonal mappings A linear mapping f : V → W between spaces with scalar product is called an orthogonal mapping, if for all u ∈ V ⟨f(u), f(u)⟩ = ⟨u, u⟩. The linearity of f and the symmetry of the scalar product imply that for all pairs of vectors the following equality holds: ⟨f(u + v), f(u + v)⟩ = ⟨f(u), f(u)⟩ + ⟨f(v), f(v)⟩ + 2⟨f(u), f(v)⟩. Therefore all orthogonal mappings satisfy also the seemingly stronger condition for all vectors u, v ∈ V : ⟨f(u), f(v)⟩ = ⟨u, v⟩, i.e. the mapping f leaves the scalar product invariant if and only if it leaves invariant the length of the vectors. (We should have noticed that this is true for all fields of scalars, where 1 + 1 ̸= 0, but it does not hold true for Z2.) In the initial discussion about the geometry in the plane we proved in the Theorem 1.5.10 that a linear mapping R2 → R2 preserves lengths of the vectors if and only if its matrix in the standard basis (which is orthonormal with respect to the standard scalar product) satisfies AT · A = E, that is, A−1 = AT . In general, orthogonal mappings f : V → W must be always injective, because the condition ⟨f(u), f(u)⟩ = 0 implies ⟨u, u⟩ = 0 and thus u = 0. In such a case, the dimension of the range is always at least as large as the dimension of the domain of f. But then both dimensions are equal and f : V → Im f is a bijection. If Im f ̸= W, we extend the orthonormal basis of the image of f to an orthonormal basis of the range space and the matrix of the mapping then contains a square regular submatrix A along with zero rows so that it has the required number of rows. Without loss of generality we can assume that W = V . Our condition for the matrix of an orthogonal mapping in any orthonormal basis requires that for all vectors x and y in the space Kn : (A · x)T · (A · y) = xT · (AT · A) · y = xT · y. Special choice of the standard basis vectors for x and y yields directly AT · A = E, that is, the same result as for dimension two. Thus we have proved the following theorem: Matrix of orthogonal mappings Theorem. Let V be a real vector space with scalar product and let f : V → V be a linear mapping. Then f is orthogonal if and only if in some orthonormal basis (and then consequently in all of them) its matrix A satisfies AT = A−1 . 135 In case we simply type A.characteristic_polynomial( ), then Sage returns the characteristic polynomial as a polynomial with respect to the variable x. The previous list indicates the three eigenvalues of A, together with their algebraic multiplicity. Of course, one can proceed in a more elementary way, e.g., by the block A=matrix(SR, [[-1, 1, 0], [-1, 3, 0], [2, -2, 2]]) E=identity_matrix(3) var("c"); ch(c)=det(A-c*E) ch.factor(); show(ch.roots()) Another alternative to compute the eigenvalues of A has the form A=matrix(SR, [[-1, 1, 0], [-1, 3, 0], [2, -2, 2]]) eigen=A.eigenvalues() show(eigen) In this case, Sage’s output looks like as [− √ 3+1, √ 3+1, 2], so the command A.eigenvalues( ) computes only the eigenvalues, without their algebraic multiplicities. As for the eigenvectors, add in some of the previous cells the code A.eigenvectors_right() Here Sage prints out the eigenvectors of A, together with the corresponding eigenvalues and algebraic multiplicities. 2.D.3. Find bases of the eigenspaces of the matrix A =   0 1 1 1 0 1 1 1 0   . Solution. We first compute the determinant det(A − λ E): −λ 1 1 1 −λ 1 1 1 −λ = −λ· −λ 1 1 −λ −1· 1 1 1 −λ +1· 1 −λ 1 1 and hence the characteristic polynomial is given by χA(λ) = −λ3 +3λ+2. With the same method applied in 2.D.1, we see that χA(λ) = −(λ + 1)2 (λ − 2), hence λ1 = −1 is a double root, while the other root is given by λ2 = 2. These are the eigenvalues of A, with algebraic multiplicities two and one, respectively. Let us describe the associated eigenspaces. • For λ1 = −1 we get the matrix equation   1 1 1 1 1 1 1 1 1     x1 x2 x3   =   0 0 0   . Obviously, the reduced row echelon form of the matrix A+E is given by   1 1 1 0 0 0 0 0 0  , hence we have two free variables, namely x2 = a ∈ R and x3 = b ∈ R. The general solution has the form (−a − b, a, b)T , thus the eigenspace V−1 is a 2-dimensional subspace of R3 given by V−1 = {( −a − b a b ) : a, b ∈ R } = spanR {( −1 1 0 ) , ( −1 0 1 )} . CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Proof. Indeed, if f preserves lengths, it must have the claimed property in every orthonormal basis. On the other hand, the previous calculations show that this property for the matrix in one such basis ensures length preservation. □ Square matrices which satisfy the equality AT = A−1 are called orthogonal matrices. The shape of the coordinate transition matrices between orthonormal bases is a direct corollary of the above theorem. Each such matrix must provide a mapping Kn → Kn which preserves lengths and thus satisfies the condition S−1 = ST . When changing from one orthonormal basis to another one, the matrix of any linear mapping changes according to the relation A′ = ST A S. 2.4.7. Decomposition of an orthogonal mapping. We take a more detailed look at eigenvectors and eigenvalues of orthogonal mappings on a real vector space V with scalar prod- uct. Consider a fixed orthogonal mapping f : V → V with the matrix A in some orthonormal basis. We continue as with the matrix D of rotation in 2.4.1. We think first about invariant subspaces of orthogonal mappings and their orthogonal complements. Namely, given any subspace W ⊂ V invariant with respect to an orthogonal mapping f : V → V , then for all v ∈ W⊥ and w ∈ W we immediately see ⟨f(v), w⟩ = ⟨f(v), f ◦ f−1 (w)⟩ = ⟨v, f−1 (w)⟩ = 0 since f−1 (w) ∈ W, too. But this means that also f(W⊥ ) ⊂ W⊥ and we have proved a simple but very important propo- sition: Proposition. The orthogonal complement of a subspace invariant with respect to an orthogonal mapping is also invari- ant. If all eigenvalues of an orthogonal mapping are real, this claim ensures that there always exists a basis of V composed of eigenvectors. Indeed, the restriction of f to the orthogonal complement of an invariant subspace is again an orthogonal mapping, therefore we can add one eigenvector to the basis after another, until we obtain the whole decomposition of V . However, mostly the eigenvalues of orthogonal mappings are not real. We need to deviate into complex vector spaces. We formulate the result right away: 136 These two eigenvectors form a basis of V−1. • For λ2 = 2 we obtain the matrix equation   −2 1 1 1 −2 1 1 1 −2     x1 x2 x3   =   0 0 0   . It is easy to verify that the matrix   1 0 −1 0 1 −1 0 0 0   is the reduced row echelon form of the matrix A − 2E. Hence we deduce that the solution has the form (a, a, a, )T , where a = x3 ∈ R is the free variable. Thus, V2 is 1-dimensional, V2 = spanR { (1, 1, 1)T } . □ 2.D.4. Present the solution of the task in 2.D.3 in Sage. ⃝ In 2.D.3 we observed that the eigenspace of a multiple eigenvalue, such as the double root λ1 = −1, can have a dimension greater than one. In fact, for this particular eigenvalue, the algebraic and geometric multiplicity are both equal to 2. However, be aware that, in general, the algebraic and geometric multiplicities of an eigenvalue may not always match. It particular, the algebraic multiplicity of an eigenvalue is always greater than or equal to its geometric multiplicity. 2.D.5. Determine the eigenvalues of the following matrix, together with their algebraic and geometric multiplicities. A =   1 1 0 −1 3 0 2 −2 2   . ⃝ 2.D.6. Compute tr(A) and det(A) for the matrix A =     −11 5 4 1 −3 0 1 0 −21 11 8 2 −9 5 3 1     , based on the formulas det(A) = λ1λ2λ3λ4 , tr(A) = λ1 + λ2 + λ3 + λ4 , where λ1, λ2, λ3, λ4 are the roots of the characteristic polynomial χA(λ) of A. Moreover, do the computation in Sage. Solution. The characteristic polynomial associated to A is of degree four, χA(λ) = λ4 + 2λ3 − 2λ − 1. It has two roots, namely λ+ = 1 with multiplicity one and λ− = −1 with multiplicity three, that is, χA(λ) = (λ − 1)(λ + 1)3 . It follows that det(A) = λ1λ3 2 = −1 and tr(A) = λ1 + 3λ2 = 1 − 3 = −2. To perform these computations in Sage, use the following block: A=matrix(SR, [[-11, 5, 4, 1], [-3, 0, 1, 0], [-21, 11, 8,2], [-9, 5, 3,1]]) p(t) = A.characteristic_polynomial(t) show(p(t).factor()) show(p.roots()) print(A.det()); print(A.trace()) CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Orthogonal mapping decomposition Theorem. Let f : V → V be an orthogonal mapping on a real vector space V with scalar product. Then all the (in general complex) roots of the characteristic polynomial f have length one. There exists the decomposition of V into onedimensional eigenspaces corresponding to the real eigenvalues λ = ±1 and two-dimensional subspaces Pλ,¯λ with λ ∈ C\R, where f acts by the rotation by the angle equal to the argument of the complex number λ in the positive sense. All these subspaces are mutually orthogonal. Proof. Without loss of generality we can work with the space V = Rm with the standard scalar product. The mapping is thus given by an orthogonal matrix A which can be equally well seen as the matrix of a (complex) linear mapping on the complex space Cm (which just happens to have all of its coefficients real). There exist exactly m (complex) roots of the characteristic polynomial of A, counting their algebraic multiplicities (see the fundamental theorem of algebra, 1.A.6). Furthermore, because the characteristic polynomial of the mapping has only real coefficients, the roots are either real or there are pairs of roots which are complex conjugates λ and ¯λ. The associated eigenvectors in Cm for such pairs of complex conjugates are actually solutions of two systems of linear homogeneous equations which are also complex conjugate to each other – the corresponding matrices of the systems have real components, except for the eigenvalues λ. Therefore the solutions of this systems are also complex conjugates (check this!). Next, we exploit the fact that for every invariant subspace its orthogonal complement is also invariant. First we find the eigenspaces V±1 associated with the real eigenvalues, and restrict the mapping to the orthogonal complement of their sum. Without loss of generality we can thus assume that our orthogonal mapping has no real eigenvalues and that dim V = 2n > 0. Now choose an eigenvalue λ and let uλ be the eigenvector in C2n associated to the eigenvalue λ = α + iβ, β ̸= 0. Analogously to the case of rotation in the plane discussed in paragraph 2.4.1 in terms of the matrix D, we are interested in the real part of the sum of two one-dimensional (complex) subspaces W = span{uλ} ⊕ span{¯uλ}, where ¯uλ is the eigenvector associated to the conjugated eigenvalue ¯λ. Now we want the intersection of the 2-dimensional complex subspace W with the real subspace R2n ⊂ C2n , which is clearly generated (over R) by the vectors uλ + ¯uλ and i(uλ − ¯uλ). We call this real 2-dimensional subspace Pλ,¯λ ⊂ R2n and notice, this subspace is generated by the basis given by the real and imaginary part of uλ xλ = Re uλ, −yλ = − Im uλ. 137 eign=A.eigenvalues(); print(eign) bool(sum(eign)==A.trace()) bool(prod(eign)==A.det()) □ 2.D.7. Let A be a n × n triangular matrix over K ∈ {R, C}, either upper or lower. Prove that the eigenvalues of A are its diagonal elements. ⃝ 2.D.8. (a) Prove that a n×n matrix A is invertible if and only if 0 is not an eigenvalue of A. (b) Compute the characteristic polynomial of the n×n matrix An given below. Then combine your result with the statement in (a), to determine those n for which the matrix An is invert- ible. An =           1 0 0 0 · · · 0 0 1 1 1 0 0 · · · 0 0 0 0 1 1 0 · · · 0 0 0 0 0 1 1 · · · 0 0 0 ... ... ... ... ... ... ... ... 0 0 0 0 · · · 1 1 0 0 0 0 0 · · · 0 1 1           . ⃝ As described in 2.4.3 the eigenvectors of a squared matrix corresponding to distinct eigenvalues, are linearly independent. This property enables us to use an “eigenvector basis” to represent the matrix as a diagonal matrix. Matrices (or endomorphisms) with this property are commonly referred to as diagonalizable. Summarizing, a matrix A is diagonalizable if it is similar to a diagonal matrix D, meaning that there exists an invertible matrix P such that A = PDP−1 . This is equivalent to saying that the algebraic and geometric multiplicities of each eigenvalue of A are equal. Diagonalizable matrices are particularly useful because they simplify many matrix computations, such as raising a matrix to a power, or solving systems of linear differential equations. As we will explore in Chapter 3, significant examples of diagonalizable endomorphisms include symmetric matrices. These matrices are important in various fields, including physics, engineering, and computer science. 2.D.9. Diagonalization. Consider the matrix A introduced in the tasks 2.D.1, 2.D.3 and 2.D.5, respectively. Specify the cases where A is diagonalizable, and for these cases find matrices P, D such that A = PDP−1 . Solution. (a) Let us consider first the matrix A given in 2.D.1. There we saw that this matrix admits three eigenvalues, all distinct each other. Hence in this case A is diagonalizable. In particular, observe that for any eigenevalue of A, the corresponding algebraic and geometric multiplicity coincide (they are all equal to one). Thus we have A = PDP−1 , with D =   λ1 0 0 0 λ+ 0 0 0 λ−   , P =   0 2 − √ 3, 2 + √ 3 0 1 1 1 −2 −2   , CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Because A · (uλ + ¯uλ) = λuλ + ¯λ¯uλ and similarly with the second basis vector, it is clearly an invariant subspace with respect to multiplication by the matrix A and we obtain A · xλ = αxλ + βyλ, A · (−yλ) = −αyλ + βxλ. Consequently, ∥A · xλ∥2 + ∥A · yλ∥2 = (α2 + β2 )(∥x∥2 + ∥y∥2 ) and, since our mapping preserves lengths, the absolute value of the eigenvalue λ must equal one. But that means that the restriction of our mapping to Pλ,¯λ is the rotation by the argument of the eigenvalue λ. Note that the choice of the eigenvalue ¯λ instead of λ leads to the same subspace with the same rotation, we would just have expressed it in the basis xλ, yλ, that is, the same rotation will in these coordinates go by the same angle, but with the opposite sign, as expected. The proof of the whole theorem is completed by restricting the mapping to the orthogonal complement and finding another 2-dimensional subspace, until we get the required decomposition. □ We return to the ideas in this proof once again in chapter three, where we study complex extensions of the Euclidean vector spaces, see 3.4.4. Remark. The previous theorem is very powerful in dimension three. Here at least one eigenvalue must be real ±1, since three is odd. But then the associated eigenspace is an axis of the rotation of the three-dimensional space through the angle given by the argument of the other eigenvalues. Try to think how to detect in which direction the space is rotated. Note also that the eigenvalue −1 means an additional reflection through the plane perpendicular to the axis of the rotation. ./img/0163b.jpg We shall return to the discussion of such properties of matrices and linear mappings in more details at the end of the next chapter, after illustrating the power of the matrix calculus in several practical applications. We close this section with a general quite widely used definition: 138 respectively, where λ1 = 2, λ± = 1± √ 3 are the eigenvalues of A. The columns of P are the eigenvectors of A, described in 2.D.1. The block below provides a quick verification of the relation A = PDP−1 in Sage: A=matrix(SR, [[-1, 1, 0], [-1, 3, 0], [2, -2, 2]]) u1=vector([0, 0, 1]) u2=vector([2-sqrt(3),1,-2]) u3=vector([2+sqrt(3),1,-2]) P=column_matrix([u1, u2, u3]) D=diagonal_matrix([2, 1+sqrt(3),1-sqrt(3)]) bool(A==P*D*P.inverse()) Note that we can replace any column of the matrix P with a non-zero scalar multiple of the corresponding eigenvector. As a result, there are infinitely many possible matrices P that satisfy the diagonalization condition. (b) Let us now consider the matrix A given in 2.D.3. There we saw that this matrix has two eigenvalues: λ1 = −1 with algebraic multiplicity 2, and λ = 2 with algebraic multiplicity 1. We also proved that the geometric multiplicity of λ1 coincides with its algebraic multiplicity, and similarly for λ2. Hence in this case the matrix A is also diagonalizable: A = PDP−1 . To explicitly determine D and P we can follows the same method outline in (a). We leave this case as an exercise for practice. (c) Finally, the matrix A presented in 2.D.5 is not diagonalizable. This is because its unique eigenvalue λ = 2 has an algebraic multiplicity 3, but a geometric multiplicity 2. Consequently, it is impossible to construct a complete basis of eigenvectors for this matrix. □ 2.D.10. Determine whether the matrices A1 and A2 given below are diagonalizable. If diagonalizable, determine the matrices Di, Pi such that Ai = PiDiP−1 i : A1 =   2 0 0 4 2 2 −2 0 1   , A2 =   2 0 0 −4 2 2 −2 0 1   . ⃝ Consider a real vector space V endowed with a scalar product ⟨ , ⟩, and an orthogonal endomorphism f : V → V . Thus, f preserves the scalar product, the lengths and the angles of vectors in V . A prime example of orthogonal endomorphisms are rotations, in R2 or R3 (cf. 1.E.17). It is most remarkable, that in matrix terms, A ∈ Matn(R) is a matrix of an orthogonal transformation f (in suitable orthonormal coordinates), if and only if AAT = AT A = E . We call them orthogonal matrices and they are particularly easy to invert. This property leads to several interesting results, such as the fact that the determinant of an orthogonal matrix is alway ±1. CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Spectrum of linear mapping 2.4.8. Definition. The spectrum of a linear mapping f : V → V, or the spectrum of a square matrix A, is a sequence of roots of the characteristic polynomial f or A, along with their multiplicities, respectively. The algebraic multiplicity of an eigenvalue means the multiplicity of the root of the characteristic polynomial, while the geometric multiplicity of the eigenvalue is the dimension of the associated subspace of eigenvectors. The spectral diameter of a linear mapping (or matrix) is the greatest of the absolute values of the eigenvalues. In this terminology, our results about orthogonal mappings can be formulated as follows: the spectrum of an orthogonal mapping is always a subset of the unit circle in the complex plane. Thus only the values ±1 may appear in the real part of the spectrum and their algebraic and geometric multiplicities are always the same. Complex values of the spectrum then correspond to rotations in suitable two-dimensional subspaces which are mutually perpendicular. 139 The next task illustrates that orthogonal endomorphisms on Rn correspond to orthogonal n × n real matrices, and conversely. For more information, see the discussion in 2.4.6. 2.D.11. On V = Rn endowed with the dot product consider a linear mapping φA : V → V , defined by φA(u) = Au, for some A ∈ Matn(R), for all u ∈ V . Prove that A is orthogonal if and only if φA is an orthogonal endomorphism of V . ⃝ 2.D.12. Determine which of the matrices below is orthogo- nal: A = ( 1/ √ 2 −1/ √ 2 −1/ √ 2 −1/ √ 2 ) , B =   2/3 −2/3 1/3 2/3 1/3 −2/3 1/3 2/3 2/3   , C =   3/ √ 11 −1/ √ 6 −1/ √ 66 1/ √ 11 2/ √ 6 −4/ √ 66 1/ √ 11 1/ √ 6 7/ √ 66   . Solution. All the given matrices A, B and C are orthogonal. There are many ways to verify this fact. For A notice that AT = A and AAT = ( 1/ √ 2 −1/ √ 2 −1/ √ 2 −1/ √ 2 ) ( 1/ √ 2 −1/ √ 2 −1/ √ 2 −1/ √ 2 ) = E . On the other hand, later in Chapter 3 we will learn that a square matrix is orthogonal, if and only if its columns (or rows) form an orthonormal basis (see 3.4.4). Given our familiarity with orthonormal bases, we can use this criterion to determine whether the matrix B is orthogonal. We need to verify for example that the (column) vectors v1 =   2/3 2/3 1/3   , v2 =   −2/3 1/3 2/3   , v3 =   1/3 −2/3 2/3   form an orthonormal basis of R3 with respect to the dot product. We compute ∥v1∥ = ∥v2∥ = ∥v2∥ = 1 and v1 · v2 = −4 9 + 2 9 + 2 9 = 0, v1 · v3 = 2 9 − 4 9 + 2 9 = 0, v2 · v3 = −2 9 − 2 9 + 4 9 = 0. This proves the claim. You can also use the command A.is_unitary( ) in Sage. This checks whether a given real matrix A is orthogonal (or if a complex matrix is unitary, as we will see in Chapter 3). Let us apply this method to verify the orthogonality of the matrix C: C = matrix(SR,[[3/sqrt(11),-1/sqrt(6),-1/sqrt(66)], [1/sqrt(11),2/sqrt(6),-4/sqrt(66)], [1/ sqrt (11),1/ sqrt (6),7/ sqrt (66)]]) C.is_unitary() Sage’s output is True. □ CHAPTER 2. ELEMENTARY LINEAR ALGEBRA 140 Proper rotations are orthogonal matrices with determinant 1, and play a crucial role in both applied and pure mathematics. Orthogonal matrices with determinant −1 correspond to reflections through the origin, which are used to model symmetry operations, such as those seen in particle physics or optics. Recall that in low dimensions, eigentheory provides a nuanced geometric perspective on both rotations and reflections (see the remark in 2.4.7, and the tasks 1.E.17, 2.E.57). In a three-dimensional space, a general proper rotation is characterized by its axis and its angle θ. Conventionally, a positive angle corresponds to a counterclockwise rotation when viewed from the tip of the axis, see also the discussion in ??. The rotation matrix in this case has necessarily three eigenvalues, a real one λ1 = 1, and two unitary complex conjugate eigenvalues. As explained in 2.4.7, the eigenvector u1 associated with the eigenvalue 1 represents the rotation axis, which remains invariant under the rotation. The direction of this axis can be determined using the right-hand rule: curl the fingers of your right hand around the axis of rotation, with your fingers pointing in the direction of θ; your thumb will then point in the direction of u1. Additionally, the angle θ is given by the argument of the two complex conjugate eigenvalues, which represent a 2-dimensional rotation within the plane orthogonal to the rotation axis. 2.D.13. Consider the matrix A =    3 5 16 25 −12 25 −16 25 93 125 24 125 12 25 24 125 107 125    . (a) Show that the linear mapping induced by A is a rotation. (b) Find its axis, the plane rotation, and the rotation angle. Solution. (a)To prove that the induced linear mapping is a rotation, it is sufficient to show that A is an orthogonal matrix with determinant 1. In Sage you may apply the method described above, i.e., A=matrix(SR,[[3/5,16/25,-12/25], [ -16/25,93/125,24/125], [12/25,24/125,107/125]]) A.is_unitary() Sage’s output is True. Adding the command det(A), Sage gives that det(A) = 1. Compute the determinant in a formal way, as well. Thus, A is an orthogonal matrix with determinant 1. Let us also compute the eigenvalues of A by adding the command A.eigenvalues(). Sage returns the list [-4/5*I + 3/5, 4/5*I + 3/5, 1] but for our convenience we set λ1 = 1, λ2 = 3 5 + 4 5 i and λ3 = 3 5 − 4 5 i. All three eigenvalues have absolute value one. Combined with the fact that the matrix is orthogonal, this confirms that the matrix represents a rotation. (b) In this part we need to find the eigenvectors of A. Let CHAPTER 2. ELEMENTARY LINEAR ALGEBRA 141 us use Sage to specify them. By adding the command A.eigenvectors_right() to the previous block, Sage will return the eigenvectors corresponding to the eigenvalues λ1, λ2, λ3. These are given by v1 = (0, 1, 4/3)T , v2 = (1, 4i/5, −3i/5)T , and v3 = (1, −4i/5, 3i/5)T , respectively. The axis of the rotation is given by the eigenvector v1. The plane of rotation is the real plane E in R3 which is defined by the intersection of the 2-dimensional complex subspace of C3 spanned by the remaining eigenvectors v2, v3, with R3 . A direct computation shows that E = spanR { E1 = (1, 0, 0)T , E2 = (0, −4, 3)T } ⊂ R3 , where the first generator is the real multiple of v2 + v3 and the second one is the real multiple of i(v2 − v3), see also 2.4.7. We are now ready to determine the rotation angle in this plane: It is a rotation by the angle arccos(3 5 ) . = 0, 295π, which is the argument of the eigenvalue 3 5 + 4 5 i (or minus that number, if we would choose the other eigenvalue). □ You can find more exercises on eigenvalues, eigenvectors, orthogonal matrices and rotations in Section E. In Chapter 3 we will return to these topics and expand our perspective by exploring their applications in eigenvalues and orthogonal diagonalization. 142 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA E. Additional exercises for the whole chapter Below we provide additional material for most of the concepts discussed in the second chapter. The goal is to gain further computational experience with matrix calculus, become more familiar with basic notions of linear transformations, and hopefully enrich our approach to solving linear problems. A) Material on vectors and matrices 2.E.1. Sage’s vocabulary in linear algebra. This chapter has provided extensive information on using Sage with vectors and matrices. Before we begin this supplementary section, it is helpful to summarize this knowledge. Below, we list most of the Sage commands we have explored so far, along with instructions for manipulating matrices and linear mappings. We strongly recommend using Sage to aid your understanding of the problems presented below, especially in cases where our solution does not include Sage. This practice will also help expand your Sage vocabulary on linear algebra and its applications, as we will see later in Chapter 3. Sage vocabulary for matrix calculus, vector spaces and linear mappings zero m/n matrix zero_matrix(KK, m, n) rank A.rank( ) identity n/n matrix identity_matrix(KK, n) left kernel A.left_kernel( ) matrix diagonal_matrix(KK, [a, b, c]) right kernel A.right_kernel( ) random matrix random_matrix(KK, m, n) rank of left kernel A.left_nullity( ) matrix multipl. A ∗ b, or A ∗ B or B ∗ A rank of right kernel A.right_nullity( ) matrix exponent. e.g., Aˆ3, or A ∗ ∗3 character. polyn. A.characteristic_polynomial( ) determinant, trace A.det( ), A.trace( ) eigenvalues A.eigenvalues( ) row echelon form A.echelon_form( ) eigenvectors A.eigenvectors_right( ) RREF A.rref( ) conjugate transpose A.conjugate_transpose( ) pivots columns A.pivots( ) orthogonal A.is_unitary( ) inverse A.inverse( ) unitary A.is_unitary( ) minors of order k A.minor(k) symmetric A.is_symmetric( ) adjoint A.adjugate( ) rescale rows A.rescale_row( ) transpose A.transpose( ) add multiples of rows A.add_multiple_of_row( ) conjugate A.conjugate( ) interchange rows A.swap_rows( ) accessing the entry a1,2 A[0, 1] or A[0][1] find the size of A A.nrows( ), A.ncols( ) accessing kth column A[:, k − 1], e.g. 3rd column A[:, 2] introduce a vector u u = vector(KK, [u1, · · · , un]) accessing nth row A[n − 1, :], e.g. 2rd row A[1, :] size of u u.degree( ) submatrices of B B[0 : 2, 1 : 3], B[0 : 2, 1 : 3] block matrix ( A 0 0 B ) A.stack(B) e.g., B is a 4/5 matrix B[:, 1 : 2], B[:, 1 : 3] augmented matrix A.augment(b) B[1 : 2, :], B[1 : 3, :] a solution of Ax = b A\b As we learned in the main part, many of the commands listed above come with various options, which we have not included here for simplicity. Above, KK represents a numerical system K, e.g, ZZ for the integers Z, etc. 2.E.2. Consider the finite field Z7 and the matrices A = ( 2 3 4 5 ) and B = ( 6 1 0 2 ) with elements in Z7. Describe the multiplication AB and next confirm your result using Sage. How many arithmetic operations we need to compute the product AB over Z7 and why this differs from the result presented in 1.E.11? Solution. The finite field Z7 consists of the integers {0, 1, 2, 3, 4, 5, 6} with addition and multiplication defined modulo 7. Thus, to multiply the matrices A, B over Z7 we follow the standard procedure for matrix multiplication but reduce each entry modulo 7. Let us denote by C = AB = (cij) the product of A, B. Then, the entries cij are given by cij = ∑2 k=1 aikbkj(mod7). We compute c11 = (2 · 6 + 3 · 0)(mod7) = 12(mod7) = 5 , c12 = (2 · 1 + 3 · 2)(mod7) = 8(mod7) = 1 , c21 = (4 · 6 + 5 · 0)(mod7) = 24(mod7) = 3 , c22 = (4 · 1 + 5 · 2)(mod7) = 14(mod7) = 0 . Thus C = AB = ( c11 c12 c21 c22 ) = ( 5 1 3 0 ) . A verification in Sage follows the standard procedure. However, we first need to introduce the finite field Z7 and define the matrices A and B over this field. 143 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Z7=GF(7) A=Matrix(Z7, [[2, 3], [4, 5]]); B=Matrix(Z7, [[6, 1], [0, 2]]); show(A*B) Finally, according to 1.E.11 the multiplication AB of two 2 × 2 real matrices requires 12 arithmetic operations: For each matrix element, we perform two multiplications and one addition. Over Z7 for each matrix element we need one more operation, the modulo operation. Thus in total we will have 12 + 4 = 16 operations. □ Elementary matrices are fundamental tools in linear algebra, particularly useful for understanding and performing row operations on matrices. An elementary matrix is derived from the identity matrix by performing a single elementary row operation, see 2.1.8. These operations are essential for matrix manipulations, solving systems of linear equations, and finding matrix inverses. Let us describe related tasks. 2.E.3. Consider the matrix A =   1 0 2 3 1 −1 2 4 2  . i) Determine the elementary matrix representing the row operation R2 → R2 − 3R1. Next, confirm that the inverse of this elementary matrix is the elementary matrix of the row operation R2 → R2 + 3R1. ii) Determine the elementary matrix representing the row operation R2 → R2 − 3 2 R3. Next, confirm that the inverse of this elementary matrix is the elementary matrix of the row operation R2 → R2 + 3 2 R3. iii) Determine the elementary matrix representing the row operation R3 → 1 2 R3. Next, confirm that the inverse of this elementary matrix is the elementary matrix of the row operation R3 → 2R3. ⃝ 2.E.4. Consider the matrix A =   2 1 3 1 0 1 4 2 1  . (a) Show that A is invertible. (b) Apply elementary row operations to determine the inverse of A. (c) Express A as a product of inverses of elementary matrices. Solution. (a) Using the Sarrus rule (see 2.2.1) it is straightforward to verify that the determinant of A equals 5 ̸= 0. Therefore, the matrix is invertible. (b) Consider the augmented matrix ( A E ) (where E is the 3 × 3 identity). The goal is to use elementary row operations to transform ( A E ) into ( E A−1 ) . We compute: ( A E ) =   2 1 3 1 0 0 1 0 1 0 1 0 4 2 1 0 0 1   R1→ 1 2 R1 −→    1 1 2 3 2 1 2 0 0 1 0 1 0 1 0 4 2 1 0 0 1    R2→R2−R1 −→    1 1 2 3 2 1 2 0 0 0 −1 2 −1 2 −1 2 1 0 4 2 1 0 0 1    R3→R3−4R1 −→    1 1 2 3 2 1 2 0 0 0 −1 2 −1 2 −1 2 1 0 0 0 −5 −2 0 1    R2→−2R2 −→    1 1 2 3 2 1 2 0 0 0 1 1 1 −2 0 0 0 −5 −2 0 1    R3→− 1 5 R3 −→    1 1 2 3 2 1 2 0 0 0 1 1 1 −2 0 0 0 1 2 5 0 −1 5    R2→R2−R3 −→    1 1 2 3 2 1 2 0 0 0 1 0 3 5 −2 1 5 0 0 1 2 5 0 −1 5    R1→R1− 3 2 R3 −→    1 1 2 0 − 1 10 0 3 10 0 1 0 3 5 −2 1 5 0 0 1 2 5 0 −1 5    R1→R1− 1 2 R2 −→    1 0 0 −2 5 1 1 5 0 1 0 3 5 −2 1 5 0 0 1 2 5 0 −1 5    . Thus A−1 =   −2/5 1 1/5 3/5 −2 1/5 2/5 0 −1/5  . To confirm this expression, you can either use Sage, or verify the relation AA−1 = E = A−1 A by hand. Let us present the verification in Sage: A=Matrix(SR, [[2, 1, 3], [1, 0, 1], [4, 2, 1]]); print(det(A)) A_inv=A.inverse(); show(A_inv) 144 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA (c) In part (b) we obtained A−1 by applying eight elementary row operations. These operations correspond to eight elementary matrices E1, . . . , E8, such that E8E7E6E5E4E3E2E1A = E. Thus, A = (E8E7E6E6E5E4E3E2E1)−1 = E−1 1 E−1 2 E−1 3 E−1 4 E−1 5 E−1 6 E−1 7 E−1 8 . Let us specify E1, . . . , E8 and their inverses. row operation R1 → 1 2 R1 : E =   1 0 0 0 1 0 0 0 1   R1→ 1 2 R1 −→   1/2 0 0 0 1 0 0 0 1   =: E1 =⇒ E−1 1 =   2 0 0 0 1 0 0 0 1   , row operation R2 → R2 − R1 : E =   1 0 0 0 1 0 0 0 1   R2→R2−R1 −→   1 0 0 −1 1 0 0 0 1   =: E2 =⇒ E−1 2 =   1 0 0 1 1 0 0 0 1   , row operation R3 → R3 − 4R1 : E =   1 0 0 0 1 0 0 0 1   R3→R3−4R1 −→   1 0 0 0 1 0 −4 0 1   =: E3 =⇒ E−1 3 =   1 0 0 0 1 0 4 0 1   , row operation R2 → −2R2 : E =   1 0 0 0 1 0 0 0 1   R2→−2R2 −→   1 0 0 0 −2 0 0 0 1   =: E4 =⇒ E−1 4 =   1 0 0 0 −1 2 0 0 0 1   , row operation R3 → − 1 5 R3 : E =   1 0 0 0 1 0 0 0 1   R3→− 1 5 R3 −→   1 0 0 0 1 0 0 0 −1/5   =: E5 =⇒ E−1 5 =   1 0 0 0 1 0 0 0 −5   , row operation R2 → R2 − R3 : E =   1 0 0 0 1 0 0 0 1   R2→R2−R3 −→   1 0 0 0 1 −1 0 0 1   =: E6 =⇒ E−1 6 =   1 0 0 0 1 1 0 0 1   , row operation R1 → R1 − 3 2 R3 : E =   1 0 0 0 1 0 0 0 1   R1→R1− 3 2 R3 −→   1 0 −3 2 0 1 0 0 0 1   =: E7 =⇒ E−1 7 =   1 0 3 2 0 1 0 0 0 1   , row operation R1 → R1 − 1 2 R2 : E =   1 0 0 0 1 0 0 0 1   R1→R1− 1 2 R2 −→   1 −1 2 0 0 1 0 0 0 1   =: E8 =⇒ E−1 8 =   1 1 2 0 0 1 0 0 0 1   . It is now a straightforward (though lengthy) computation to verify that A = E−1 1 E−1 2 E−1 3 E−1 4 E−1 5 E−1 6 E−1 7 E−1 8 . To speed up this process, we can program Sage to perform the calculation for us, by adding the following cell to the previous code block: E1inv=Matrix(SR, [[2, 0, 0], [0, 1, 0], [0, 0, 1]]) E2inv=Matrix(SR, [[1, 0, 0], [1, 1, 0], [0, 0, 1]]) E3inv=Matrix(SR, [[1, 0, 0], [0, 1, 0], [4, 0, 1]]) E4inv=Matrix(SR, [[1, 0, 0], [0, -1/2, 0], [0, 0, 1]]) E5inv=Matrix(SR, [[1, 0, 0], [0, 1, 0], [0, 0, -5]]) E6inv=Matrix(SR, [[1, 0, 0], [0, 1, 1], [0, 0, 1]]) E7inv=Matrix(SR, [[1, 0, 3/2], [0, 1, 0], [0, 0, 1]]) E8inv=Matrix(SR, [[1, 1/2, 0], [0, 1, 0], [0, 0, 1]]) bool(A==E1inv*E2inv*E3inv*E4inv*E5inv*E6inv*E7inv*E8inv) Execute this program yourselves to verify that Sage’s output is True. □ Sage provides built-in functions to explore elementary matrices, which we can list as follows: • The matrix which multiplies the kth row by c: elementary_matrix(R, n, row1 = k, scale = c) • The matrix which multiplies the kth row by c and adds it to the mth row: elementary_matrix(R, n, row1 = k, row2 = m, scale = c) • The matrix which swaps the kth and mth rows: elementary_matrix(R, n, row1 = k, row2 = m) In each case, R is the base ring, and is optional, while n denotes the size of the square matrix. 145 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA 2.E.5. Use Sage to describe the elementary matrices E1, . . . , E8 obtained in 2.E.4, using the commands mentioned above. Compute their inverses, as well. ⃝ 2.E.6. Solve the system of linear equations given by the extended matrix B = ( A b ) =     3 3 2 1 3 2 1 1 0 4 0 5 −4 3 1 5 3 3 −3 5     . ⃝ 2.E.7. Provide a formal solution of the linear system given below. Next use Sage to verify your solution:    x1 + x2 + x3 + x4 − 2x5 = 3 , 2x2 + 2x3 + 2x4 − 4x5 = 5 , −x1 − x2 − x3 + x4 + 2x5 = 0 , −2x1 + 3x2 + 3x3 − 6x5 = 2 . Solution. The extended matrix of the system is B = ( A b ) =     1 1 1 1 −2 3 0 2 2 2 −4 5 −1 −1 −1 1 2 0 −2 3 3 0 −6 2     . Adding the first row to the third, adding its 2-multiple to the fourth, and adding the (−5/2)-multiple of the second to the fourth we obtain B ∼     1 1 1 1 −2 3 0 2 2 2 −4 5 0 0 0 2 0 3 0 5 5 2 −10 8     ∼     1 1 1 1 −2 3 0 2 2 2 −4 5 0 0 0 2 0 3 0 0 0 −3 0 −9/2     . The last row is clearly a multiple of the previous one, and thus we can omit it. The pivots are located in the first, second and fourth, while x3 and x5 are free variables, which we substitute by the real parameters t and s. Thus we consider the system x1 + x2 + t + x4 − 2s = 3 , 2x2 + 2t + 2x4 − 4s = 5 , 2x4 = 3 . We see that x4 = 3/2, and hence the second equation gives x2 = 1−t+2s. From the first equation we now obtain x1 = 1/2. This gives the expression (x1, x2, x3, x4, x5) = (1/2, 1 − t + 2s, t, 3/2, s) with t, s ∈ R. Note that setting t0 = −t + 2s and t1 = t − s, the solution takes the forms (1/2, 1 + t0, t0 + 2t1, 3/2, t0 + t1), with t0, t1 ∈ R. As an alternative method, one could transform the extended matrix B to reduced row echelon form. In this procedure, we can omit the fourth equation since it is a combination of the first three. By sequentially multiplying the second and third rows by 1/2, subtracting the third row from the second and first rows, and finally subtracting the second row from the first, we obtain the following: B ∼   1 1 1 1 −2 3 0 2 2 2 −4 5 0 0 0 2 0 3   ∼   1 1 1 1 −2 3 0 1 1 1 −2 5/2 0 0 0 1 0 3/2   ∼   1 1 1 0 −2 3/2 0 1 1 0 −2 1 0 0 0 1 0 3/2   ∼   1 0 0 0 0 1/2 0 1 1 0 −2 1 0 0 0 1 0 3/2   . This confirms that x3, x5 are free variables, i.e., x3 = t, x5 = s with t, s ∈ R. Moreover, a short computation yields the solution presented above. These computations, particularly the (unique) reduced row echelon form of the extended matrix B and the given solution, can be obtained in Sage by applying the method proposed in 2.A.13. In particular, the commands B.rref() and B.pivots() appearing in the block below, return the reduced row echelon form of B and the pivot columns, respectively. A=matrix([[1, 1, 1, 1, -2], [0, 2, 2, 2, -4], [-1, -1, -1, 1, 2], [-2, 3, 3, 0, -6]]) b=vector([3, 5, 0, 2]); B=A.augment(b) Br=B.rref(); show(Br); B.pivots() x=A.solve_right(b) k=A.right_kernel ().matrix () nrows = k.nrows () 146 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA if nrows > 0: t = vector ( var("t", n= nrows )) show (x + t*k) else: show (x) Sage’s output has the form (1 2 , t0 + 1, t0 + 2 t1, 3 2 , t0 + t1 ) , with t0, t1 ∈ R, and this verifies our formal computations. □ 2.E.8. Find all the solutions of the following homogeneous system of four linear equations and five unknowns x, y, z, u, v: {x + y = 2z + v , z + 4u + v = 0 , −3u = 0 , z = −v} . Moreover, specify a basis of the vector space of solutions. Solution. Let us first rewrite the system intro the matrix form Ax = 0, where here x = (x, y, z, u, v)T ∈ R5 , and as usual the first column of A consists of the coefficients of x, the second column of the coefficients of y, and so on. This gives A =     1 1 −2 0 −1 0 0 1 4 1 0 0 0 −3 0 0 0 1 0 1     . Such homogeneous systems with less (linear) equations than unknowns are called underdetermined and admit an infinite number of solutions. Indeed, let us compute the reduced row echelon form of A. For this, let us denote by R1, R2, R3, R4 the rows at any step of the procedure. We see that     1 1 −2 0 −1 0 0 1 4 1 0 0 0 −3 0 0 0 1 0 1     R2→R2+ 4 3 ·R3 ∼     1 1 −2 0 −1 0 0 1 0 1 0 0 0 −3 0 0 0 1 0 1     R4→R4−R2 ∼     1 1 −2 0 −1 0 0 1 0 1 0 0 0 −3 0 0 0 0 0 0     R3→− 1 3 ·R3 ∼     1 1 −2 0 −1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0     R1→R1+2·R2 ∼     1 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0     . The final matrix is in reduced row echelon form. Recall that in Sage we can achieve this form in two ways: by applying the command A.rref( ), or by manually performing the necessary elementary row operations. Let us demonstrate both methods using matrices over Q instead of R, for convenience, as this not affect the result. A=matrix(QQ, [[1, 1, -2, 0, -1], [0, 0, 1, 4, 1], [0, 0, 0, -3, 0],[0, 0, 1, 0, 1]]) show(A.rref()) The output is the same as the one obtained by executing the following block: A=matrix(QQ, [[1, 1, -2, 0, -1], [0, 0, 1, 4, 1], [0, 0, 0, -3, 0],[0, 0, 1, 0, 1]]) A.add_multiple_of_row(1 ,2 , 4/3);show(A) A.add_multiple_of_row(3, 1, -1);show(A) A.rescale_row (2 ,-1/3) ;show(A) A.add_multiple_of_row(0 ,1 , 2);show(A) where recall that elementary row operations should be applied successively. Running any of these blocks verifies that our computations are accurate. Therefore, we can now pose the system {x + y + v = 0, z + v = 0, u = 0}, where the variable y, v are free (they correspond to the 2nd and 5th column of the REEF, since these columns do not contain a pivot). Hence, let us set y = t, and v = s with t, s ∈ R and derive the solution by back substitution: (x, y, z, u, v) = (−t − s, t, −s, 0, s) , t, s ∈ R . Obviously, we can rewrite the solution as       x y z u v       = t       −1 1 0 0 0       + s       −1 0 −1 0 1       , t, s ∈ R . 147 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA It easy to prove that the vectors in the right-hand side are linearly independent, and thus they form a basis of the solution space. In other words, the kernel of the matrix A is a 2-dimensional subspace of R5 . □ 2.E.9. Determine all solutions of the following linear system:    x2 + x4 = 1 , 3x1 − 2x2 − 3x3 + 4x4 = −2 , x1 + x2 − x3 + x4 = 2 , x1 − x3 = 1 . ⃝ 2.E.10. Solve the following linear system:    3x − 5y + 2u + 4z = 2 , 5x + 7y − 4u − 6z = 3 , 7x − 4y + + 3z = 4 , x + 6y − 2u − 5z = 2 . ⃝ 2.E.11. Consider the following linear system:    x1 + x2 − x3 = 1 , x1 + 2x2 + κ x3 = 2 , 2x1 + κ x2 + 2x3 = 3 . Determine the values of the real parameter κ such that (a) the system has a unique solution; (b) the system has infinitely many solutions; (c) the system has no solution. For the cases (a) and (b) determine the corresponding solutions. Solution. The extended matrix of the system has the form B = ( A b ) =   1 1 −1 1 1 2 κ 2 2 κ 2 3   . By applying elementary row operations we obtain that B =   1 1 −1 1 1 2 κ 2 2 κ 2 3   R2→R2−R1 ∼ R3→R3−2R1   1 1 −1 1 0 1 κ + 1 1 0 κ − 2 4 1   R3→R3−(κ−2)R2 ∼   1 1 −1 1 0 1 κ + 1 1 0 0 4 − (κ + 1)(κ − 2) 3 − κ   . The above matrix is in row echelon form and we can rewrite it as B ∼   1 1 −1 1 0 1 κ + 1 1 0 0 −κ2 + κ + 6 3 − κ   =   1 1 −1 1 0 1 κ + 1 1 0 0 (κ − 3)(κ + 2) 3 − κ   . (∗) Hence we deduce that: (a) The system has a unique solution when κ ̸= 3 and κ ̸= −2. Indeed, in this case we obtain    x1 + x2 − x3 = 1 , x2 + (κ + 1) x3 = 1 , (2 + κ)(3 − κ)x3 = 3 − κ . Therefore, by back substitution we obtain the solution (x1, x2, x3)T = (1, 1 2+κ , 1 2+κ )T . (b) The system has infinite many solutions when κ = 3, since then the last row of the row echelon form presented in (∗) has only zeros, the pivots belong to the first and second column, and x3 is a free variable. Hence in this case we obtain the system { x1 + x2 − x3 = 1 , x2 + 4x3 = 1 , with x3 = t ∈ R, having infinite many solutions given by (x1, x2, x3)T = (5t, 1 − 4t, t)T , (t ∈ R). Note that we may express these solutions as (x1, x2, x3)T = t(5, −4, 1)T + (0, 1, 0)T . Thus, when κ = 3 the space of all solutions is given by summing all solutions of the corresponding homogeneous system, given by t(5, −4, 1)T , with a particular solution of the original system, given by (0, 1, 0)T . 148 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA (c) The system is inconsistent when κ = −2. Indeed, for this value of κ the third row of the matrix in the right-hand side of (∗) gives us 0x3 = 5, which is impossible. □ 2.E.12. Solve the following parametric system of linear equations in terms of the parameter µ ∈ R: { µ x1 + 4x2 + 2x3 = 0 , 2x1 + 3x2 − x3 = 0 . ⃝ 2.E.13. Determine the number of solutions of the following parametric linear system, in terms of the parameter a ∈ R:     4 1 4 a 2 3 6 8 3 2 5 4 6 −1 2 −8         x1 x2 x3 x4     =     2 5 3 −3     . ⃝ 2.E.14. Consider the following parametric system of linear equations:    x1 − a x2 − 2x3 = b , x1 + (1 − a) x2 = b − 3 , x1 + (1 − a) x2 + ax3 = 2b − 1 . Find the values of the parameters a, b ∈ R, such that: (a) the system has exactly one solution; (b) the system has no solution; (c) the system has infinitely many solutions. ⃝ Let us now explore an important application of linear systems in the analysis of electric circuits. This description relies on Ohm’s law, the Kirchhoff’s voltage and the current laws, providing an opportunity to revisit some concepts from high school physics. Readers not interested in these applications can skip this task. 2.E.15. Kirchhoff’s Circuit Laws. Consider an electric circuit as in the figure and write down the values of the currents there if you know the values V1 = 20, V2 = 120, V3 = 50, R1 = 10, R2 = 30, R3 = 4, R4 = 5, R5 = 10, Notice that the quantities Ii denote the electric currents, while Rj are resistances, and Vk are voltages. Solution. There are two closed loops, namely ABEF and EBCD and two branching vertices B and E of degree no less than 3. On every segment of the circuit, bounded by branching points, the electric current is constant. Let us set this I1 for the segment EFAB, I2 for EB, and I3 for BCDE. Applying Kirchhoff’s current law to branching points B and E we obtain: I1 + I2 = I3 and I3 − I1 = I2, which are, of course the same equations. In case there are many branching vertices, we write all Kirchhhoff’s Current Law equations to the system, having at least one of those equations redundant. Choose the counter clockwise orientations of the loops ABEF and EBCD. Applying Kirchhoff Voltage Law and Ohm’s Law to the loop ABEF we obtain the equation: V1 + I1R3 − I2R5 + V3 + I1R1 + I1R4 = 0 . 149 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Similarly, the loop EBCD implies −V2 + I3R2 − V3 + R5I2 = 0. A combination of all equations yields the following system of linear equations:    I1 + I2 − I3 = 0 , (R3 + R1 + R4)I1 − R5I2 + = −V1 − V3 , R5I2 + R2I3 = V2 + V3 . Using the prescribed values, this finally reduce to    I1 + I2 − I3 = 0 , 19I1 − 10I2 + = −70 , 10I2 + 30I3 = 170 . It is easy to see that this has a unique solution, given by I1 = −80 53 ≈ −1.509, I2 = 219 53 ≈ 4.132, I3 = 139 53 ≈ 2.623. □ 2.E.16. The general case. In general, the method for electrical circuit analysis can be formulated along the following steps: (α) Identify all branching vertices of the circuit, i.e., vertices of degree no less than 3; (β) Identify all closed loops of the circuit; (γ) Introduce variables Ik, denoting the oriented currents on each segment of the circuit between two branching vertices; (δ) Write down the Kirchhoff’s current conservation law for each branching vertex. The total incoming current equals the total outgoing current; (ε) Choose an orientation on every closed loop of the circuit and write down the Kirchhoff’s voltage conservation law, according to the chosen orientation. Here, in case you find an electric charge of voltage Vj and you go from the short bar to the long bar, then the contribution of this charge is Vj. It should be −Vj in case you go from the long bar to the short one. Notice also that if you go in the positive direction of a current I and find a resistor with resistance Rj, then the contribution is −RjI, and it is RjI if the orientation of the loop is opposite to the direction of the current I. The total voltage change along each closed loop must be zero. (ζ) Compose the system collecting all equations, representing the Kirchhoff’s current and voltage laws and solve it with respect to the variables, representing the currents. Notice that some equations may be redundant, however, the solution should be unique! To illustrate this general approach, consider the circuit example in the diagram below. Solution. Let us apply the steps posed above: (α) The set of branching vertices is {B, C, F, G, H}. (β) The set of closed loops is {ABHG, FHBC, GHF, CDEF}. (γ) Let I1 be the current on the segment GAB, I2 on the segment GH, I3 on the segment HB, I4 on the segment BC, I5 on the segment FC, I6 on the segment FH, I7 on GF, and I8 on CDEF. (δ) Next we write the Kirchhoff’s current conservation laws for the branching vertices: B : I1 + I3 = I4 , C : I4 + I5 = I8 , F : I8 = I5 + I6 − I7 , G : −I7 = I1 + I2 , H : I2 + I6 = I3 . Notice that the second and third equations give I8 − I5 = I4 and I8 − I5 = I6 − I7, respectively. Hence we can replace any of them by the equation I4 = I6 − I7. (ε) Let us now write Kirchhoff’s voltage conservation for each of the closed loops traversed counter-clockwise: loop ABHG : −R1I2 + V3 + R2I1 − V2 = 0 , loop GHF : R1I2 − V1 = 0 , loop FHBC : V4 + R3I4 − V3 = 0 , loop CDEF : R4I8 − V4 = 0 . (ζ) To pass to the final step we need to use some specific values for Ri, Vi for i = 1, . . . , 4. Setting R1 = 4, R2 = 7, R3 = 9, R4 = 12, V1 = 10, V2 = 20, V3 = 60 and V4 = 120, we get a unique solution (although the system is overdetermined!) 150 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA I1 = − 30 7 , I2 = 5 2 , I3 = − 50 21 , I4 = − 20 3 , I5 = 50 3 , I6 = − 205 42 , I7 = 25 14 , I8 = 10 . Below we present the solution in Sage, where first we should introduce the unknowns I1, . . . , I8 as symbolic vari- ables. var("I1, I2, I3, I4, I5, I6, I7, I8") eq1=I1+I3-I4; eq2=I4+I5-I8 eq3=I5+I6-I7-I8; eq4=I1+I2+I7 eq5=I2-I3+I6; eq6=7*I1-4*I2+40 eq7=9*I4+60; eq8=4*I2-10 eq9=12*I8-120 show(solve([eq1, eq2, eq3, eq4, eq5, eq6, eq7, eq8, eq9], [I1, I2, I3, I4, I5, I6, I7, I8])) □ B) Material on determinants 2.E.17. Factor the following permutations into a product of transpositions: σ = ( 1 2 3 4 5 6 7 7 6 5 4 3 2 1 ) , τ = ( 1 2 3 4 5 6 7 8 6 4 1 2 5 8 3 7 ) , ρ = ( 1 2 3 4 5 6 7 8 9 10 4 6 1 10 2 5 9 8 3 7 ) . ⃝ 2.E.18. Determine the parity of the following permutations: σ = ( 1 2 3 4 5 6 7 7 5 6 4 1 2 3 ) , τ = ( 1 2 3 4 5 6 7 8 6 7 1 2 3 8 4 5 ) , ρ = ( 1 2 3 4 5 6 7 8 9 10 9 7 1 10 2 5 4 9 3 6 ) . ⃝ 2.E.19. For the matrix A shown below, determine all values of the complex parameter a ∈ C that satisfy det(A) = 1: A =     a 1 1 1 0 a 1 1 0 1 a 1 0 0 0 −a     . Solution. Let us compute the determinant of A by expanding the first column of the matrix: det(A) = a 1 1 1 0 a 1 1 0 1 a 1 0 0 0 −a = a · a 1 1 1 a 1 0 0 −a = −a4 + a2 . In Sage we can verify this computation by the cell var("a") A=matrix([[a, 1, 1, 1], [0,a, 1, 1], [0, 1, a, 1],[0, 0, 0, -a]]); det(A) Thus, the equation det(A) = 1 is equivalent to a4 − a2 + 1 = 0. Substituting t = a2 we obtain t2 − t + 1 with roots t1 = 1 + i √ 3 2 = cos(π/3) + i sin(π/3) , t2 = 1 − i √ 3 2 = cos(π/3) − i sin(π/3) = cos(−π/3) + i sin(−π/3) . Therefore, there are four possible values for the parameter a: a1 = cos(π/6) + i sin(π/6) = √ 3/2 + i/2 , a2 = cos(7π/6) + i sin(7π/6) = − √ 3/2 − i/2 , a3 = cos(−π/6) + i sin(−π/6) = √ 3/2 − i/2 , a4 = cos(5π/6) + i sin(5π/6) = − √ 3/2 + i/2 . Of course, solving the equation det(A) = 1 with Sage is straightforward. Simply add the code solve(det(A) == 1, a) to the previous cell. This provides a fast way to verify the result. Note that there is an alternative way to specify the possible values of a. For this, multiply the equation det(A) = 1, i.e., the equation a4 − a2 + 1 = 0 by a2 + 1. This gives 0 = (a2 + 1)(a4 − a2 + 1) = a6 + 1, and the idea is that we can easier treat the equation a6 + 1, comparing the original equation. Indeed, recall by 1.G.17 that the equation a6 = −1 has six complex solutions, which can be presented as a = cos φ + i sin φ , where φ = π/6 + kπ/3 = (2k + 1)π/6 , k = 0, 1, 2, 3, 4, 5 . 151 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Between them, we must discard the two choices k = 1 and k = 4, since these choices solve a2 + 1 = 0, and not our equation a4 − a2 + 1 = 0. Hence we conclude that a = cos φ + i sin φ with φ = (2k + 1)π/6, for k = 0, 2, 3, and k = 5. □ 2.E.20. Establish whether the following matrix is invertible: A =     3 2 −1 2 4 1 2 −4 −2 2 4 1 2 3 −4 8     . Solution. We just need to compute det(A). We will apply Laplace theorem (see 2.2.9) by expanding the first row of the given matrix. This gives det(A) = 3 · 1 2 −4 2 4 1 3 −4 8 − 2 · 4 2 −4 −2 4 1 2 −4 8 + (−1) · 4 1 −4 −2 2 1 2 3 8 − 2 · 4 1 2 −2 2 4 2 3 −4 = 3 · 90 − 2 · 180 − 1 · 110 − 2 · (−100) = 0. Thus det(A) = 0 and A cannot be invertible. As a verification by Sage, give the cell A=matrix([[3, 2, -1, 2], [4, 1, 2, -4], [-2, 2, 4, 1],[2, 3, -4, 8]]) det(A) Note that giving the command A.inverse(), Sage will return an error, since A is singular, i.e., det(A) = 0. □ We now present the determinant of the famous “Vandermonde matrix”. The Vandermonde matrix Vn = (Vij) is the square matrix of size n, with columns formed by the powers of a given vector x = (x1, . . . , xn)T ∈ Rn , i.e., Vij = xj−1 i for i, j = 1, . . . , n, see below. This matrix is fundamental in both pure and applied mathematics. For example, in Chapter 5, we will explore its role in polynomial interpolation (see 5.1.5). Beyond mathematics, Vandermonde matrices are also significant in the natural sciences, particularly in economics and statistics. 2.E.21. Vandermonde determinant. The matrix Vn :=       1 1 . . . 1 x1 x2 . . . xn x2 1 x2 2 . . . x2 n ... ... ... xn−1 1 x2 . . . xn−1 n       , where x1, . . . , xn ∈ R, is the so called Vandermonde matrix. Prove that det(Vn) = ∏ 1≤i 1 we see that det(A + B) = det(2A) = 2n det(A) > 2 det(A) = det(A) + det(B). 2.B.10. (a) We present the solutions mainly via Sage, and leave to the reader the description of formal proofs. To find the null space of the matrix A, give the cell A=matrix([[1, -1, pi, 0],[0, pi, -1, 0], [1, 0, -1, pi]]) show(A.right_kernel()) The output has the form RowSpanSR ( 1 − 1 π ( π− 1 π ) − 1 π− 1 π − π− 1 π +1 π ( π− 1 π ) ) . Therefore, Ker(A) is 1-dimensional. In a similar way we compute Ker(B) = span{(1, −1, −1, 1)T }. We will learn for vector subspaces and the notion of “dimension” of a vector (sub)space very soon, in Section C. (b) Obviously, the required submatrix has the form ( π 0 0 π ) , which is diagonal. In Sage there are many ways to determine submatrices. For the specific case, where we should cancel both rows and columns, an appropriate command is A.matrix_from_rows_and_columns( ): 171 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA d=A.matrix_from_rows_and_columns([1, 2],[1,3]); show(d) #verify that d is diagonal print(bool(d==diagonal_matrix([pi, pi]))) In this command the brackets [1, 2] and [1, 3] specify the rows and columns that are retained in the submatrix. Another method relies on the bracket operation that we shortly mentioned above. Hence for example, add the cell suba=A[1:3, 1:4:2]; show(suba) We mention that in the expression A[1 : 3, 1 : 4 : 2] the final index 2 is the slice step between the columns. (c) The submatrix obtained by B after canceling the first and last row, and the fourth column has the form ( 2 1 4 3 4 1 ) . In Sage use the block B=matrix([[1, 2, 3, 4],[2, 1, 4, 3], [3, 4, 1, 2], [4, 3, 2, 1]]) show(B.matrix_from_rows_and_columns([1, 2],[0, 1,2])) or, as an alternative, you may type B=matrix([[1, 2, 3, 4],[2, 1, 4, 3], [3, 4, 1, 2], [4, 3, 2, 1]]) show(B[1:3, 0:-1]) In fact, one can replace the final line by the command show(B[1 : 3, 0 : 3]). (d) In this case the required submatrix has the form ( 2 4 4 2 ) , which is obviously symmetric. In Sage add in the previous block the code subb=B.matrix_from_rows_and_columns([1,3], [0, 2]) bool(subb==subb.transpose()) 2.C.2. The set of real polynomials of degree exactly m is not a real vector space, because it fails to satisfy one of the key properties of a vector space: closure under addition. This is because the sum of two polynomials of degree m does not necessarily have the same degree. Can you provide an explicit example? 2.C.4. One can use the following block: v1=vector(QQ, [1/2, 2/3, 1, 0]) v2=vector(QQ, [2, 0, 1/7, 0]) v3=vector(QQ, [0, 1/5, 2/5, 1]) v4=vector(QQ, [-1, 2, 0, 3]) show(3*v1+2*v2+5*v3+v4) Another approach involves placing the vectors and their coefficients into two separate lists and then manipulating these lists as follows: vectors=[v1, v2, v3, v4] scalars=[3, 2, 5, 1] lin_comb=sum([scalars[i]*vectors[i] for i in range(len(vectors))]) show(lin_comb) In this block we used the len() function to determine the number of elements in the list with vectors, which can be very useful when working with large lists. Check yourselves that both cells print out the following answer: (9 2 , 5, 37 7 , 8 ) . There are further methods to describe the linear combination (span) of vectors via Sage, presented in 2.C.12. 2.C.5. A solution goes as follows: A=matrix(QQ, [[3, -9, 9, 0],[9, -3, 6, 0],[9, -6, 0,-6],[-9, 12, 6, 6]]) b = vector(QQ, [-1, -1, 2, 0]) cols=A.columns(); show(cols) soln = [-2/15, 1/5, 2/15, -11/15] bool(sum([soln[i]*cols[i] for i in range(len(cols))])==b) In this block we used the function .columns() to access the columns of A. Note also that the final line verifies that linear combination of the columns of A with coefficients from soln (this represents the given solution), equals the vector b. Executing this block we obtain True. In fact, in our solution we could replace the final line by show(sum([soln[i]*cols[i] for i in range(len(cols)))) so that we can directly compare this expression with the constant vector b. 172 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA 2.C.6. The finite field Z5 is an important mathematical structure used in various areas (such as algebra, cryptography, and computer science). It consists of the integers from 0 to 4, Z5 = {0, 1, 2, 3, 4}, with arithmetic operations performed modulo 5. For your convenience we present the addition and multiplication table for Z5: + 0 1 2 3 4 0 0 1 2 3 4 1 1 2 3 4 0 2 2 3 4 0 1 3 3 4 0 1 2 4 4 0 1 2 3 · 0 1 2 3 4 0 0 0 0 0 0 1 0 1 2 3 4 2 0 2 4 1 3 3 0 3 1 4 2 4 0 4 3 2 1 . For instance, 3 + 4 = 7 = 2(mod5) and 2 · 3 = 6 = 1(mod5). The vector space Z3 5 = Z5 × Z5 × Z5 consists of vectors (a, b, c)T , where each entry a, b, c is an element of the finite field Z5. Hence, Z3 5 is 3-dimensional over Z5. The addition and scalar multiplication are respectively defined by u + v =   (u1 + v1) mod 5 (u2 + v2) mod 5 (u3 + v3) mod 5   , cu =   (cu1) mod 5 (cu2) mod 5 (cu3) mod 5   , for any two vectors u = (u1, u2, u3)T , v = (v1, v2, v3)T and scalar c ∈ Z5. For instance:   3 4 2   +   2 3 4   =   (3 + 2) mod 5 (4 + 3) mod 5 (2 + 4) mod 5   =   0 2 1   , 3   1 4 2   =   3 mod 5 12 mod 5 6 mod 5   =   3 2 1   . The zero vector is the additive identity in Z3 5. Can you derive the expression of the additive inverse of some vector u = (u1, u2, u3)T ∈ Z3 5? About the total number of vectors in Z3 5 we see that each entry of a vector in Z3 5 can independently take one of 5 possible values: {0, 1, 2, 3, 4}. Thus, the total number of vectors in this space is 53 = 125. Here is a verification via Sage (to create the finite field Z5 we will use the function GF) # Define the finite field Z_5 F = GF(5) # Define the 3-dimensional vector space over Z_5 V = VectorSpace(F, 3) # Calculate the total number of vectors in the vector space total_vectors = V.cardinality() # Output the result total_vectors 2.C.8. Here is a block that we can use to answer the task: V=QQ^3 e1=vector([1, 0, 0]) e2=vector([0, 1, 0]) e3=vector([0, 0, 1]) Vxy=V.span([e1, e2]); print(Vxy) Vxz=V.span([e1, e3]); print(Vxz) Vyz=V.span([e2, e3]); print(Vyz) Let us also present what Sage prints out: Vector space of degree 3 and dimension 2 over Rational Field Basis matrix: [1 0 0] [0 1 0] Vector space of degree 3 and dimension 2 over Rational Field Basis matrix: [1 0 0] [0 0 1] Vector space of degree 3 and dimension 2 over Rational Field Basis matrix: [0 1 0] [0 0 1] 173 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA 2.C.9. For a matrix A of size m × n the column space is defined by C(A) = {Ax : x ∈ Rn }. Hence, obviously, C(A) is a subset of Fm . Note that C(A) contains the zero vector, since A0 = 0. To prove that is a linear subspace of Fm we should prove that C(A) is closed under vector addition and scalar multiplication. Let u, v ∈ C(A). Then, there exist vectors x, y ∈ Fn such that u = Ax and v = Ay, respectively. Thus, considering the linear combination αu + βv of u, v for some scalars α, β ∈ F, we see that αu + βv = α(Ax) + β(Ay) = A(αx + βy). This shows that αu + βv ∈ C(A), and hence C(A) is a subspace of Fm . Recall that the rank of A equals the number of (nonzero) pivots in elimination, hence it coincides with the maximum number of linearly independent columns (or linearly independent rows). The relation now between the dimension of C(A) and the rank of A follows. A similar analysis applies for the row space and the left null space, which we leave for practice. 2.C.11. The given vectors are linearly independent: The equation a · v1 + b · v2 = 0 for some a, b ∈ R, can be equivalently written in matrix form as (a, 0, a, 2a) T + (0, b, 0, 0)T = (0, 0, 0, 0)T , from where we immediately get a = 0 = b. We can verify the independence of v1, v2 in Sage as in 2.C.10: V=QQ^4;v1=vector(QQ, [1, 0, 1, 2]); v2=vector(QQ, [0, 1, 0, 0]) V.linear_dependence([v1, v2]) ==[] It follows that the linear span W = spanQ{v1, v2} is a 2-dimensional linear subspace of Q4 . As we said an alternative is given by using the command span. For example, adding in the previous cell the syntax W=V.span([v1, v2]); W} verifies that W is 2-dimensional subspace of Q4 . In Sage we can directly find a basis of a vector space V by the command W.basis(). Hence for example we can type V=QQ^4 v1=vector(QQ, [1, 0, 1, 2]) v2=vector(QQ, [0, 1, 0, 0]) W=V.subspace([v1, v2]) B=W.basis();show(B) By this block Sage will directly print the basis vectors v1 and v2. Other techniques to create subspaces rely on the creation of row or column spaces of matrices, a situation that we will encounter a bit later. 2.C.12. (a) It is easy to see the vector equation av1 + bv2 = 0 for some scalars a, b, implies that a = b = 0. Thus, the vectors v1, v2 are linearly independent and automatically form a basis of W = spanQ{v1, v2}. Therefore, W is 2-dimensional, see also the discussion in 2.3.7-2.3.10 for the concept of dimension of a vector space. A general element of W has the form w = c1v1 + c2v2 = (2c1 + 4c2, −c1, 3c1 + c2)T with c1, c2 ∈ Q. There are many ways to program Sage to verify the linear independence of v1, v2. Here, we present a method that uses the rank of the matrix A, which has v1, v2 as its columns. For a built-in method in Sage to check linear independence we refer to 2.C.10. A = matrix(QQ, [[2, 4], [-1, 0], [3, 1]]) rank_of_A = A.rank() print(f"The rank of matrix A is: {rank_of_A}") if rank_of_A == 2: print("The vectors are linearly independent.") else: print("The vectors are linearly dependent.") Sage’s output has the form The rank of matrix A is: 2 The vectors are linearly independent. To introduce W in Sage we can type V = VectorSpace(QQ, 3); v1 = V([2, -1, 3]); v2 = V([4, 0, 1]) W = V.subspace([v1, v2]) print("\nSubspace W spanned by v1 and v2:") print(W) An alternative to create W is to use the span command, as in 2.C.8: W.span([v1, v2]); print(W) For both cases Sage’s output has the form 174 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Subspace W spanned by v1 and v2: Vector space of degree 3 and dimension 2 over Rational Field Basis matrix: [ 1 0 1/4] [ 0 1 -5/2] We will analyze the terms used in this answer in more detail soon. For now, keep only that W is a 2-dimensional subspace of Q3 , over the rationals. To illustrate W we can leverage Sage’s 3D-graphics functions, because we are working within a three-dimensional space. We want to generate a figure displaying v1 in red, v2 in green and W as the shaded area. With this goal in mind one can execute the following block: # Define the vectors v1 = vector(QQ, [2, -1, 3]) v2 = vector(QQ, [4, 0, 1]) # Create a meshgrid for the plane using linear combinations of v1 and v2 u, v = var("u v") plane = u*v1 + v*v2 # Define the plot range for u and v plot_range = (-1, 1) # Plot the plane plane_plot = parametric_plot3d(plane, (u, *plot_range), (v, *plot_range), color="lightblue", opacity=0.5) # Plot the vectors v1 and v2 starting from the origin v1_plot = arrow3d((0, 0, 0), v1, color="red", thickness=0.05) v2_plot = arrow3d((0, 0, 0), v2, color="green", thickness=0.05) # Combine the plots show(plane_plot + v1_plot + v2_plot, figsize=8) This produces the following illustration of W: We will meet further applications of the Sage function parametric_plot3d in Chapter 4. Finally, if v2 is a multiple of v1, then W will be reduced to a line (dimension 1). (b) This task is a bit more challenging compared to part (a). By assumption, W = spanQ{u1 = (1, 1, 0, 0)T , u2 = (0, 1, 1, 0)T } = {au1 + bu2 : a, b ∈ Q} = { (a, a + b, b, 0)T : a, b ∈ Q } ⊂ Q4 . 175 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA It is easy to see that u1, u2 are linearly independent, hence dimQ W = 2, with a basis given by these vectors. For instance, the rank of the matrix A =     1 0 1 1 0 1 0 0     formed by u1, u2 is 2. This is because by row reducing A we get     1 0 1 1 0 1 0 0     →     1 0 0 1 0 1 0 0     →     1 0 0 1 0 0 0 0     , and the final matrix has 2 pivot columns. Thus rank(A) = 2 and u1, u2 are linearly independent. The quotient space V/W is the set of equivalence classes of vectors in V , under the equivalence relation ∼ defined by the subspace W ⊂ V , i.e., v1 ∼ v2 if and only if v1 − v2 ∈ W, where v1, v2 ∈ V . Thus, we can view V/W as V/W = {[u] : u ∈ V }, where the equivalence class of u ∈ V is denoted by [u] = {u + w : w ∈ W}, see 3.4.13 in Chapter 3. The quotient space V/W inherits a vector space structure from V , with operations defined by [u] + [v] = [u + v] and a[u] = [au], for any two vectors u, v and scalar a, try to verify this claim yourselves. It is also easy to prove that the dimension of V/W is given by dim(V/W) = dim(V ) − dim(W). Four our case, V = Q4 and dimQ W = 2, thus dimQ V/W = 4 − 2 = 2. Since W is a 2-dimensional plane in Q4 , geometrically the quotient space V/W represents the remaining “directions”, after factoring out the influence of W. To introduce the quotient space V/W in Sage we can use the command V.quotient(W), as follows: V = VectorSpace(QQ, 4) u1 = vector(QQ, [1, 1, 0, 0]) u2 = vector(QQ, [0, 1, 1, 0]) W = V.subspace([u1, u2]) print(V.quotient(W)) Let us present Sage’s output: Vector space quotient V/W of dimension 2 over Rational Field where V: Vector space of dimension 4 over Rational Field W: Vector space of degree 4 and dimension 2 over Rational Field Basis matrix: [ 1 0 -1 0] [ 0 1 1 0] Notice in this answer the basis matrix refers to W. For a basis of V/W see 2.C.13. 2.C.13. The quotient space V/W is a 2-dimensional vector space over Q, as shown in 2.C.12. Thus, a basis of V/W will consists of two vectors. These vectors can be chosen to extend the basis of W to a basis of V = Q4 . Fo example, consider the vector e3 = (0, 0, 1, 0)T of the standard basis of Q4 , and suppose that e3 = au1 + bu2 for ome a, b ∈ Q. Then we get the following system of equations {a = 0 , a + b = 0 , b = 1}, which does not admit a consistent solution for a, b ∈ Q. This means that e3 /∈ W. Similarly we can show that e4 = (0, 0, 0, 1)T /∈ W. Thus a basis of V/W is given by u = {e3, e4}. To verify with Sage that u forms a basis of V/W we will use the vector in subspace command, which provides a convenient way to check if a given vector is an element of a specified subspace. Our program has the following form: # Define the vectors u1 = vector(QQ, [1, 1, 0, 0]) u2 = vector(QQ, [0, 1, 1, 0]) e3 = vector(QQ, [0, 0, 1, 0]) e4 = vector(QQ, [0, 0, 0, 1]) # Define the space W spanned by u1 and u2 V = VectorSpace(QQ, 4) W = V.span([u1, u2]) # Check if e3 and e4 are in the span of W def is_in_subspace(vector, subspace): return vector in subspace is_e3_in_W = is_in_subspace(e3, W) is_e4_in_W = is_in_subspace(e4, W) # Check linear independence of e3 and e4 176 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA # Construct the matrix with e3 and e4 as rows matrix_e3_e4 = Matrix(QQ, [e3, e4]) is_independent = matrix_e3_e4.rank() == 2 # Construct the span of e3 and e4 span_V = V.span([e3, e4]) span_check = span_V.dimension() == 2 # Display results is_e3_in_W, is_e4_in_W, is_independent, span_check Confirm yourselves that Sage’s output appears as (False, False, True, True), verifying that e3, e4 do not lie on W and that u is indeed a basis of V/W. Try exploring additional examples by presenting an alternative basis. 2.C.14. (a) Taking a = b = 0 we see that U contains the zero vector (0, 0, 0)T . Thus, to show that U is a subspace of Z3 5, we need to prove closure under addition and scalar multiplication. Let u = (a1, 2b1, 3a1 + 4b1)T and v = (a2, 2b2, 3a2 + 4b2)T be arbitrary vectors of U, where ai, bi ∈ Z5 for all i = 1, 2. Then, we see that u + v = ( a1 + a2, 2(b1 + b2), 3(a1 + a2) + 4(b1 + b2) )T ∈ U , cu = ( ca1, 2cb1, c(3a1 + 4b1) ) ∈ U , with c ∈ Z5. Therefore, U is a vector subspace of Z3 5. (b) Notice that   a 2b 3a + 4b   = a   1 0 3   + b   0 2 4   = au1 + bu2 , where u1 =   1 0 3   , u2 =   0 2 4   . Thus the vectors u1, u2 span U, i.e., U = spanZ5 {u1, u2}, and to obtain a basis it remains to prove linear independence. Let A =   1 0 0 2 3 4   be the matrix having as columns the vectors u1, u2. To find the RREF of A we mention that in Z5 the inverse of 2 equals 3 and the inverse of 4 equals 4 itself. For example, for the case of 2 one needs to find an integer y such that 2 · y = 1 mod 5. Since 6 = 1 mod 5 our claim follows. Similarly for 4, we have 4 · 4 = 16 = 1 mod 5. Performing elementary row operations with respect to the arithmetic in Z5 (see the tables in 2.C.6), we obtain the following reduced row echelon form:   1 0 0 1 0 0  . For a confirmation of this matrix in Sage, add the cell F = GF(5) A=matrix(F, [[1, 0], [0, 2], [3, 4]]); A.rref() It is now immediate that rank(A) = 2. Hence the vectors u1, u2 are linearly independent. and so they form a basis of U. This implies that dimZ5 U = 2. To confirm the linear independence of u1, u2 in Sage, we use the following cell: F = GF(5); V = VectorSpace(F, 3) u1=vector(F, [1, 0, 3]); u2=vector(F, [0, 2, 4]) V.linear_dependence([u1, u2]) ==[] By adding U=span([u1, u2]); U.dimension() we also get a direct verification of the dimension of U. 2.C.16. Considered vectors are linearly independent if and only if the vectors (1, −2, 0, 0)T , (3, 0, 1, −1)T , (1, −4, 1, 2)T , and (0, 4, 8, 4)T are linearly independent in R4 . We see that 1 −2 0 0 3 0 1 −1 1 −4 1 2 0 4 8 4 = −36 ̸= 0 , thus the vectors are linearly independent. 2.C.19. In this task, first check that the given vectors v1, v2 are linearly independent. Next, if A is the matrix having as columns these two vectors, construct the extended matrix ( A I4 ) , where I4 is the 4 × 4 identity matrix. Then compute its reduced row echelon form. From this matrix, the columns with the pivots will form the desired basis. All these in Sage can be done as follows: 177 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA V=QQ^4; v1=vector(QQ, [2, 0, 1, 0]); v2=vector(QQ, [1, 0, 1, 0]); L=[v1,v2] V.linear_dependence(L)==[] I4=identity_matrix(4); A=column_matrix(L); A1=A.augment(I4) M=A1.rref() [A1.column(p) for p in A1.pivots( )] Sage’s output is here: [(2, 0, 1, 0), (1, 0, 1, 0), (0, 1, 0, 0), (0, 0, 0, 1)] Thus the vectors {v1, v2, v3 = (0, 1, 0, 0)T , v4 = (0, 0, 0, 1)T form the desired basis of Q4 . The linear independence of these vectors can be confirmed by adding in the previous cell the next few lines of code: v3=vector(QQ, [0, 1, 0, 0]); v4=vector(QQ, [0, 0, 0, 1]); Lext=[v1,v2,v3,v4] V.linear_dependence(Lext)==[] 2.C.23. (i) This is the projection to the yz-plane and a short computation ensures its linearity. (ii) We see that F((0, 0, 0)) = (0, −3, 0) ̸= (0, 0, 0). However, any linear mapping maps the zero vector of its domain to the zero vector of the target space. Hence, this F is not linear. (iii) By assumption a ̸= 0. Hence, for two vector u = (x1, y1) and v = (x2, y2) lying on R2 , we see that F(u + v) = (x1 + x2 + a, y1 + y2) ̸= F(u) + F(v) = (x1 + a, y1) + (x2 + a, y2) = (x1 + x2 + 2a, y1 + y2) . This shows that F is not linear. (iv) When a = 0 this is the identity map, which is obviously a linear mapping. For a ̸= 0 the mapping F(x) = x + a is an affine mapping. This is because the mapping G : R → R defined by G(x) := F(x) − F(0) is linear, for all x ∈ R. (v) The mapping F((x, y)) = (x2 , y) is not linear. This is because F(c (x, y)) = F((c x, c y)) = (c2 x2 , c y) and c F((x, y)) = (c x2 , c y). Hence in general we have F(c (x, y)) ̸= c F((x, y)), except if c = 0 or c = 1. Since the relation F(c (x, y)) = c F((x, y)) must hold for any c ∈ R and u = (x, y) ∈ R2 we conclude. (vi) The mapping F : R2 → R1[t] with F((x, y)) = y + x t is linear. Indeed, consider two arbitrary vectors u = (x1, y1) and v = (x2, y2) on R2 . Then for any two scalars a, b ∈ R we have F(a u + b v) = F((a x1, a y1) + (b x2, b y2)) = F((a x1 + b x2, a y1 + b y2)) = (a y1 + b y2) + (a x1 + b x2) t = a (y1 + x1 t) + b (y2 + x2 t) = a F((x1, y1)) + b F((x2, y2)) . (vii) Fix some B ∈ Matn(R). The mapping F(A) = AB − BA is also linear, and hence an endomorphism of Matn(R). In order to prove its linearity, consider some λ, µ ∈ R and A1, A2 ∈ Matn(R). Then we have F(λ A1 + µ A2) = (λ A1 + µ A2)B − B(λ A1 + µ A2) = λ A1B + µ A2B − λ BA1 − µ BA2 = λ (A1B − BA1) + µ (A2B − BA2) = λ F(A1) + µ F(A2) . (viii) The mapping F(A) = AT is another linear endomorphism of Matn(R). Perform the brief computation yourselves. (ix) Let u = (x1, y1) and v = (x2, y2) two vectors on Z2 3 = Z3 × Z3. Then we have F(u + v) = F((x1 + x2, y1 + y2)) =   (x1 + x2) + (y1 + y2) 2(x1 + x2) + (y1 + y2) x1 + x2   =   x1 + y1 2x1 + y1 x1   +   x2 + y2 2x2 + y2 x2   = F(u) + F(v) , and F(c u) = F((c x1, c y1)) =   c x1 + c y1 2c x1 + c y1 c x1   = c   x1 + y1 2x1 + y1 x1   = c F(u) , for all c ∈ Z3. Thus, F is linear. (x) The exponential map is, of course, not linear because, as we will see in Chapter 5, et1+t2 = et1 · et2 ̸= et1 + et2 . 2.C.26. Let e = {e1 = (1, 0, 0)T , e1 = (0, 1, 0)T , e3 = (0, 0, 1)T } be the standard basis of R3 . We see that T(e1) = (1, 1, 1)T = e1 + e2 + e3 , T(e2) = (1, 1, 1)T = e1 + e2 + e3 , T(e3) = (1, 2, 3)T = e1 + 2e2 + 3e3 . The columns of the matrix A corresponding to T are given by the images of the basis vectors. Thus we get A =   1 1 1 1 1 2 1 1 3   . To work with this example in Sage, use the following code block: 178 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA # Define the variables and vector space x, y, z = var("x y z") V = W = RR^3 # Define the linear transformation f(x, y, z) = [x + y + z, x + y + 2*z, x + y + 3*z] T = linear_transformation(V, W, f) # Get the matrix representation of the transformation A = T.matrix(side="right") # Display the matrix show(A) Verify that this confirms the matrix A obtained above. 2.C.29. The matrix of T with respect to the standard basis is given by A =   1 0 i 0 1 1 i 1 0  . The endomorphism T will be invertible if and only if A is invertible. However, A is not of full rank, in particular, rank(A) = 2 and hence A is not invertible. Alternatively, we may compute the determinant of A, which is zero, det(A) = 0. Hence A is singular and T−1 does not exist. A verification is Sage is given in the usual way: v1=vector([1, 0, i]); v2=vector([0, 1, 1]); v3=vector([i, 1, 0]) A=column_matrix([v1, v2, v3]) print(rank(A)); print(det(A)) 2.C.30. (a) We begin with the conjugation. This maps 1 → 1, and i → −i, written in the coordinates as (1, 0) → (1, 0) and (0, 1) → (0, −1), respectively. By writing the images into the columns we obtain the matrix ( 1 0 0 −1 ) . For the second map, using the basis u = {1, i} we obtain 1 → 2 + i, i → 2i − 1, that is, (1, 0) → (2, 1), (0, 1) → (2, −1). Thus, the matrix of multiplication by the number 2 + i under the basis u has the form ( 2 −1 1 2 ) . (b) We will only determine the matrix of the second linear map in the basis f, and leave the first case for practice. Multiplication by (2 + i) gives us: (1 − i) → (1 − i)(2 + i) = 3 − i, (1 + i) → (1 + 3i). Coordinates (a, b)f of the vector 3 − i in the basis f are given, as we know, by the equation a · (1 − i) + b · (1 + i) = 3 + i, that is, (3 + i)f = (2, 1). Analogously (1+3i)f = (−1, 2). Altogether, we obtain the matrix ( 2 −1 1 2 ) . Observe that the matrices for multiplication by 2+i are the same in both bases. Why does this happen? Would the matrices be identical in both bases for multiplication by any complex number? 16 2.C.31. To simplify the explanation, we will prove the statement for a low-dimensional case, as the idea extends easily to the general case. For example, let us consider proving that the space Mat2,3(K) of 2 × 3 matrices over K is isomorphic to K2·3 = K6 . Any element of Mat2,3(K) has the form ( a b c d e f ) , for some a, b, c, d, e, f ∈ K . From this expression we see that the matrices E1 = ( 1 0 0 0 0 0 ) , E2 = ( 0 1 0 0 0 0 ) , E3 = ( 0 0 1 0 0 0 ) , E4 = ( 0 0 0 1 0 0 ) , E5 = ( 0 0 0 0 1 0 ) , E6 = ( 0 0 0 0 0 1 ) generate Mat2,3(K). It is also easy to see that these matrices are linearly independent, and hence they form a basis of Mat2,3(K). This implies that dimK Mat2,3(K) = 6. For a verification, use Sage and its command MatrixSpace, as fol- lows: MS = MatrixSpace(SR, 2, 3) MS.dimension() Sage print out the number 6, which is the dimension of Mat2,3(K) over K. Adding the following command, one can also obtain the basis mentioned above: 16This similarity arises because multiplication by a complex number c, corresponds to a linear transformation that is invariant under basis changes. Essentially, c acts as a scalar multiplication on the vector space, and scalar multiplication is invariant to the choice of a basis. 179 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA B = MS.basis(); list(B) Consider finally the mapping φ defined by φ : Mat2,3(K) → K6 , ( a b c d e f ) → (a, b, c, d, e, f)T . This is linear and maps the basis E1, . . . , E6 to the standard basis e1 = (1, 0, . . . , 0)T , . . . , e6 = (0, . . . , 0, 1)T of K6 . Hence φ is a linear isomorphism (you should be able to prove that a linear mapping having this property is always an isomorphism). 2.C.32. An isomorphism f : Rn[x] → Rn+1 is given by a0 + a1x + a2x2 + · · · + anxn → (a0, a1, . . . , an). We leave to the reader to prove that f is a well defined linear bijection. Can you specify the inverse f−1 : Rn+1 → Rn[x]? 2.C.33. To prove the claim it is sufficient to show that U = spanR{u1, u2, u3} is a 3-dimensional subspace of R4 , and hence that the vectors u1, u2, u3 are linearly independent. Indeed for some a, b, c ∈ R we see that the matrix equation au1 + bu2 + cu3 = 0 is equivalent to the system {a + c = 0, 2a + b = 0, b = 0} having the unique solution a = b = c = 0. Or in Sage we can type u1=vector([1, 2, 0, 0]); u2=vector([0, 1, 0, 1]); u3=vector([1, 0, 0, 0]); L=[u1, u2, u3] V=RR**4; V.linear_dependence(L)==[] Let e = {e1, e2, e3} be the standard basis of R3 . Check yourselves that a linear isomorphism f : R3 → U is defined by f(e1) = u1 , f(e2) = u2 , f(e3) = u3 . 2.C.34. A subspace of R3 can be 0, 1, 2 or 3-dimensional. Thus, up to linear isomorphisms, such a subspace should be isomorphic to (a) the origin of R3 (the single zero vector), or, (c) a plane passing through the origin of R3 , or, (b) a line passing through the origin of R3 , or, (d) R3 itself. 2.C.36. We can easily see that u is a basis of C3 , e.g., via Sage. To find its dual, we need to specify linear forms φi : C3 → C for i = 1, 2, 3, satisfying the relations φ1(u1) = 1 , φ1(u2) = 0 = φ1(u3) , φ2(u1) = 0 , φ2(u2) = 1 , φ2(u3) = 0 , φ3(u1) = 0 = φ3(u2) , φ3(u3) = 1 , respectively. We may suppose that φ1(z1, z2, z3) = 3∑ i=1 aizi , φ2(z1, z2, z3) = 3∑ i=1 bizi , φ3(z1, z2, z3) = 3∑ i=1 cizi , for some complex numbers ai, bi, ci, (i = 1, 2, 3). From the first set of equations we get a1 = 1, a2 = 0 and a3 = i, thus φ1(z1, z2, z3) = z1 + iz3. From the second set of equations we get b1 = 0, b2 = −i and b3 = 1, thus φ2(z1, z2, z3) = −iz2 + z3. Finally for φ3 we get c1 = 0 = c2 and c3 = −i, that is, φ3(z1, z2, z3) = −iz3. You may try to verify alone that {φ1, φ2, φ3} is indeed a basis of (C3 )∗ . 2.C.37. (a) Suppose that u = ∑n i=1 ciui for some real numbers ci, for all i = 1, . . . , n. Then, by linearity, we see that φ1(u) = φ1 ( ∑ i ciui ) = ∑ i ciφ1(ui) = c1φ1(u1) + c2φ1(u2) + · · · + cnφ1(un) = c1 · 1 + c2 · 0 + · · · + cn · 0 = c1, and in an analogous way we can show that φi(u) = ci, for all i = 1, . . . , n. Based on this remark we obtain the first relation: u = ∑n i=1 ciui = ∑n i=1 φi(u)ui. (b) To prove the second relation let us apply ξ in both sides of the relation in (a). Since φi(u) are scalars for all i, i.e., φi(u) ∈ K, this gives ξ(u) = ξ( ∑ i φi(u)ui) = ∑ i φi(u)ξ(ui) = ( ∑ i ξ(ui)φi ) (u) . This relation holds for all u ∈ V and hence the result follows, i.e., ξ = ∑ i ξ(ui)φi. 2.C.38. If A = (aij), B = (bij) ∈ Matm(K) and c ∈ K, then we have tr(A + B) = (a11 + b11) + · · · + (amm + bmm) = (a11 +· · ·+amm)+(b11 +· · ·+bmm) = tr(A)+tr(B) and tr(cA) = ca11 +· · ·+camm = c(a11 +· · ·+amm) = c tr(A). Thus tr is a linear transformation, and since the output is a scalar, i.e., tr(A) ∈ K, it is a linear form of Matm(K). Now, for A, B ∈ Matm(K) we have AB = ((ab)ij) and BA = ((ba)ij). Thus, tr(AB) = m∑ i=1 (ab)ii = m∑ i=1 m∑ k=1 aikbki = m∑ k=1 m∑ i=1 bkiaik = m∑ k=1 (ba)kk = tr(BA) . 180 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA As a counterexample of the relation tr(AB) = tr(A) tr(B), consider any non-zero matrix A ∈ Matm(R) and the identity matrix E ∈ Matm(R). Then we see that tr(AE) = tr(A), but tr(E) tr(A) = m tr(A). Hence, except of the trivial case m = 1 where the relation actually holds, this provides an answer. 2.C.39. Consider for example the endomorphism f : R2 → R2 represented by the matrix A = ( 3 1 0 2 ) with respect to the standard basis e of R2 . Obviously, tr(A) = 3 + 2 = 5. Another basis of R2 is given by u = { (1, 1)T , (1, −1)T } (use Sage, as indicated below, to verify that these two vectors are linearly independent). V=RR^2; v1=vector(RR, [1,1]) v2=vector(RR, [1, -1]) V.linear_dependence([v1, v2]) ==[] The matrix that changes from the new basis u to the standard one e is formed by placing the new basis vectors as columns, see also below. Let us denote this matrix by P = ( 1 1 1 −1 ) . The matrix presentation of f in the new basis is given by A′ = P−1 AP = ( 3 0 1 2 ) , see the discussion in the end of 2.3.16. We see that tr(A′ ) = 5 = tr(A), hence the trace is invariant under a change of basis. You may like to confirm this result in Sage (and the expression of A′ ). This can be done by the usual methods, see the cell given here: # Define the matrix A in the standard basis A = Matrix(RR, [[3, 1], [0, 2]]) # Define the change of basis matrix P P = Matrix(RR, [[1, 1], [1, -1]]) # Compute the inverse of P P_inv = P.inverse() # Compute the matrix representation of the endomorphism in the new basis A_prime = P_inv * A * P show(A_prime) # Compute the traces bool(A.trace()==A_prime.trace()) 2.C.40. In 2.C.38 we proved the relation tr(AB) = tr(BA). Hence, for any invertible m × m matrix P over K we obtain tr(P−1 AP) = tr(P−1 (AP)) = tr((AP)P−1 ) = tr(APP−1 ) = tr(AE) = tr(A) . Thus, A and P−1 AP have the same trace and our claim follows. For the determinant observe that det(B) = det(P−1 AP) = det(P−1 ) det(A) det(P) = det(A) det(P) det(P) = det(A) . Here, we utilized the fact that the determinant of the product of matrices equals the product of their determinants, and that for any invertible matrix P, the determinant of its inverse is given by det(P−1 ) = 1/ det(P). Now, in (a) the given matrices A, B are such that tr(A) = tr(B) = 4 but det(A) = 7 ̸= 6 = det(B). Thus, according to our result above, the matrices A, B cannot be similar. For the matrices A, B given in (b), we have tr(A) = tr(B) = 0 and det(A) = det(B) = 0, hence we cannot conclude by applying our criterium. However, let M = ( α β γ δ ) be an invertible 2 × 2 real matrix. Thus we assume that det(M) = αδ − βγ ̸= 0, so that M−1 = 1 αδ−βγ ( δ −β −γ α ) . For simplicity, we may fix α, β so that det(M) = 1. Then we see that M−1 AM = ( δ −β −γ α ) ( 0 1 0 0 ) ( α β γ δ ) = ( 0 δ 0 −γ ) ( α β γ δ ) = ( γδ δ2 −γ2 −γδ ) . Therefore, in this case the matrices A, B are similar. Notice that for γ = δ = 0, we have det(M) = 0, hence we should exclude these values. The matrices A, B given in (c) are also similar: Let us verify this in Sage via the command B.is_similar(A), as fol- lows: A = matrix([[2,1],[1,2]]) B = matrix([[3,0],[0, 1]]) b, P = B.is_similar(A, transformation=True) b; show(P) 181 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA The output here is True, which means that A, B are indeed similar, while show(P) prints the matrix P, satisfying P−1 AP = B. A verification of this relation can be done by adding the code bool(P.inverse()*A*P==B) In fact, later we will explain that the diagonal matrix B in (c) consists of the eigenvalues of A, and so the similarity of A, B is related with the study of the eigenvectors of A, see also below. 2.C.42. To solve in Sage the task given in 2.C.41, we first introduce the vector space V = R3 and confirm the linear independence of the vectors u1, u2 and u3. Next, we use these vectors to form the transition matrix T, which can be done easily via the columm_matrix command. Finally, we compute the inverse of T, and display the new coordinates of w by applying the rule T−1 w. Here is the corresponding code: V=RR^3;u1=vector([1, 1, 3]) u2=vector([1, -1, 1]);u3=vector([3, 1,5]) V.linear_dependence([u1, u2, u3])==[] T=column_matrix([v1, v2, v3]) Tinv=T.inverse() w=vector([1, 2, 3]); wcor=Tinv*w show(wcor) 2.C.43. Suppose that u2 consists of the vectors V1, V2, V3. By the given expression of T we have V1 = E1 + E2 − E3 = (2, 0, 0)T , V2 = E1 + 2E2 + E3 = (3, 3, 2)T , V3 = E1 + E2 + 2E3 = (2, 3, 3)T . Thus u2 = {(2, 0, 0)T , (3, 3, 2)T , (2, 3, 3)T }. As a verification in Sage, you may type V = RR**3; V1 = vector(RR, [2, 0, 0]); V2 = vector(RR, [3, 3, 2]); V3 = vector(RR, [2, 3, 3]) V.linear_dependence([V1, V2, V3]) == [ ] 2.C.44. Again the transition matrix T for changing the basis from the basis f = {f1, f2, f3} to the standard basis e can be obtained by expressing the coordinates of the vectors f1, f2, f3 in the standard basis as the columns of the matrix T. On the other hand, the transition matrix for changing the basis from e to f is the inverse of T. We compute T =   1 −1 2 1 1 0 0 1 1   , T−1 =    1 4 3 4 −1 2 −1 4 1 4 1 2 1 4 −1 4 1 2    . Now we can derive the matrix of the mapping in the basis f, which is given by (see also 2.2.11) T−1 AT =    −1 4 2 −3 4 5 4 0 7 4 3 4 −2 9 4    . A method in Sage that can be used to derive the matrix of F : R3 → R3 using built-in functions, is described in 2.E.46. 2.C.46. (i) As always, the expressions of the given vectors u1, u2 are with respect to the standard basis of R3 . To compute their (standard) scalar product we have u1 · u2 = ⟨u1, u2⟩ = uT 2 u1 = ( −1 1 − √ 2 )   1 3√ 2   = −1 + 3 − 2 = 0 . And the same gives the computation u2 · u1 = ⟨u2, u1⟩ = uT 1 u2 = ( 1 3 √ 2 )   −1 1 − √ 2   = −1 + 3 − 2 = 0 . This verifies that ⟨, ⟩ is symmetric. Having u1 · u2 = 0 means that the vectors u1, u2 are orthogonal each other, u1 ⊥ u2. Moreover, ∥u1∥ = √ u1 · u1 = √ 1 + 9 + 2 = √ 12 = 2 √ 3 and ∥u2∥ = √ u2 · u2 = √ 1 + 1 + 2 = √ 4 = 2, respectively. A verification of these computations In Sage relies on the block u1 = vector([1, 3, sqrt(2)]) u2 = vector([-1, 1, -sqrt(2)]) sp1=u1.dot_product(u2); sp2=u2.dot_product(u1); bool(sp1==sp2) 182 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA print(sp1); print(sp2) print(norm(u1)); print(norm(u2)) (ii) Let us present the solution in Sage: v1=vector([0, 1, 2]) v2=vector([-1, 2, 3]) spr1=v1.dot_product(v2) print(spr1); print(norm(v1)) print(norm(v2)) In order to compute the angle between v1, v2, add the code theangle = spr1/(norm(v1) * norm(v2)) arccos(theangle).n() Notice the answer printed out by Sage is in radians. 2.C.51. It is straightforward to verify the orthonormal basis expression obtained in 2.C.48, using the function proj(v, u) developed in 2.C.48. The corresponding code block is as follows: E1 = vector( [1, 1, 1, 1]) E2 = vector( [1, 1, 1, 0]) E3 = vector( [1, 1, 0, 0]) def proj (v,u): p = v. dot_product (u)/(norm (u)^2)*u return(p) w1 = E1; w2 = E2 - proj(E2, w1) w3 = E3 - proj(E3, w2)-proj(E3, w1) g = [w1, w2, w3]; g W1 = w1/norm(w1); W2 = w2/norm(w2) W3 = w3/norm(w3); G = [W1, W2, W3]; G Verify on your own that running this block, Sage will print the basis {w1, w2, w3}, denoted by g, and the orthonormal basis { ˆw1, ˆw2, ˆw3}, denoted by G. 2.C.52. The set of solutions of the given homogeneous linear equation is always a vector spaces. For our case, a computation shows that a basis of this vector space consists of the vectors u1 = (−1, 1, 0, 0) T , u2 = (−1, 0, 1, 0) T , u3 = (−1, 0, 0, 1) T . Let us now denote by {v1, v2, v3} the orthogonal basis obtained using the Gram-Schmidt orthogonalisation process. We have v1 = u1 and for the rest two vectors we respectively compute v2 = u2 − uT 2 · v1 ||v1||2 v1 = u2 − 1 2 v1 = ( − 1 2 , − 1 2 , 1, 0 )T , v3 = u3 − uT 3 · v1 ||v1||2 v1 − uT 3 · v2 ||v2||2 v2 = u3 − 1 2 v1 − 1 6 v2 = ( − 1 3 , − 1 3 , − 1 3 , 1 )T . 2.C.53. The kernel Ker(φ) of φ is a linear subspace of R4 . To specify Ker(φ) we need to solve the matrix equation Ax = 0, for some arbitrary vector x = (x1, x2, x3, x4)T ∈ R4 . Let us use Sage to quickly perform this task: A = matrix(SR, 4, 4, [[1/2, 2, -1/2, -1/2], [1, 1/2, 1, 3/2], [2, 9/2, 0, 1/2], [2, 2, 0, 0]]) A.right_kernel() Sage’s output has the form Vector space of degree 4 and dimension 1 over Symbolic Ring Basis matrix: [ 1, -1, -8, 5] This means that Ker(φ) is 1-dimensional subspace of R4 , spanned by the vector E1 = (1, −1, −8, 5)T ∈ R4 . For the norm of this vector we compute ∥E1∥ = √ 91, hence an orthonormal basis of V consists of the vector 1√ 91 E1. 2.C.54. (i) Let x = (x1, x2, x3, x4)T be a vector orthogonal to u, with respect to the dot product on R4 . Then x · u = x1 + x2 + x3 + x4 = 0. Thus, if U is the subspace of vectors orthogonal to u, then U = {(x1, x2, x3, x4)T ∈ R4 : x4 = −(x1 + x2 + x3)} = { ( x1, x2, x3, −(x1 + x2 + x3)T ) : x1, x2, x3 ∈ R} . It follows that U is 3-dimensional. In particular, vectors of U can be expressed as 183 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA ( x1, x2, x3, −(x1 + x2 + x3) )T = x1E1 + x2E2 + x3E3 , where E1, E2, E3 are the vectors defined by E1 = (1, 0, 0, −1)T , E2 = (0, 1, 0, −1)T , and E3 = (0, 0, 1, −1)T , respectively. We easily check that these are linearly independent, and hence they provide a basis of U. (ii) For the second task we get two conditions from orthogonality, namely: x · w = x1 − x4 = 0 , x · z = −x1 + x3 = 0 . This means that the vector x = (x1, x2, x3, x4)T is orthogonal both to the vectors w and z if and only if x1 = x4 and x1 = x3. Therefore, if V denotes the subspace of vectors orthogonal to w and z, then we see that V = {(x1, x2, x1, x1)T : x1, x2 ∈ R}. It turns out that V is 2-dimensional, with a basis given by the vectors v1 = (1, 0, 1, 1)T and v2 = (0, 1, 0, 0)T . 2.C.56. We will prove that C(A) is orthogonal to the cokernel Ker(AT ) and leave the second case for practice. Recall that for a m × n matrix A, both C(A) and Ker(AT ) are subspaces of Rm given by C(A) = {Ax : x ∈ Rn } and Ker(AT ) = {y ∈ Rm : AT y = 0}, respectively. Hence, for u ∈ C(A) and w ∈ Ker(AT ) we see that ⟨u, w⟩ = wT u = wT Ax = (AT w)T x = 0x = 0 , where the first equality is the definition of the dot product, the second occurs by replacing u ∈ C(A) by Ax and the third relies on the relation (AB)T = BT AT . Thus wT u = 0 and C(A) ⊥ Ker(AT ) with respect to the dot product on Rm . Recall now from 2.C.20 that for the matrix A =   2 4 1 3 0 5   the column space C(A) is spanned by the vectors u1 = (2, 1, 0)T and u2 = (4, 3, 5)T , while the left null space Ker(AT ) is spanned by the vector u3 = (5, −10, 2)T . Hence, we see that ⟨u1, u3⟩ = uT 3 u1 = ( 5 −10 2 )   2 1 0   = 10 − 10 = 0 , ⟨u2, u3⟩ = uT 3 u2 = ( 5 −10 2 )   4 3 5   = 20 − 30 + 10 = 0 . Similarly, we saw that the row space C(AT ) is the subspace span{(2, 4)T , (1, 3)T }, while Ker(A) is trivial, Ker(A) = {0}. Hence it is also trivial to prove that C(AT ) ⊥ Ker(A). 2.D.5. Observe that this matrix differs from the one in 2.D.1 by just a single sign. However, in this case the characteristic polynomial of A has the form λ3 −6λ2 +12λ−8, or in other words χA(λ) = (λ−2)3 . Therefore, the unique eigenvalue of A is given by λ = 2, with algebraic multiplicity three. The geometric multiplicity of λ is either one, two or three. Let us determine the eigenvectors associated to λ. They are the solutions of the matrix equation (A − 2E)x = 0, and a direct computation shows that the eigenspace V2 has the form V2 = spanR{(1, −1, 0)T , (0, 0, 1)T }. Thus, the geometric multiplicity of the unique eigenvalue λ of A equals to two. 2.D.7. Let us prove the statement for n = 3 and when A is upper triangular, and similarly is treated the more general case. Thus we assume that A has the form A =   a11 a12 a13 0 a22 a23 0 0 a33   , with aij ∈ K for all i, j. Then it is easy to see that the characteristic polynomial has the form χA(λ) = (a11 − λ)(a22 − λ)(a33 − λ), hence the roots of χA(λ) are the diagonal entries a11, a22, a33. The result now follows. 2.D.8. (a) The first claim is based on the fact that det(A) is the product of eigenvalues of A. On the other hand, we know that A is invertible if and only if det(A) ̸= 0. Combining these two statements we obtain that A is invertible if and only if 0 is not an eigenvalue of A. (b) The solution arises as an application of part (a). You may like to write down the matrix An for small values of n and then generalize (by induction over n). For instance, since A1 = 1 , A2 = ( 1 1 1 1 ) , A3 =   1 0 1 1 1 0 0 1 1   , A4 =     1 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1     , we see that det(A2) = det(A4) = 0 but det(A1) ̸= 0 and det(A3) ̸= 0. Thus, an option is to compute the determinant of An and show that det(An) ̸= 0 if and only if n is odd. An alternative method is suggested by the statement, and is based on the characteristic polynomial: 184 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA χAn (λ) = det(An − λ E) = 1 − λ 0 0 0 · · · 0 0 1 1 1 − λ 0 0 · · · 0 0 0 0 1 1 − λ 0 · · · 0 0 0 0 0 1 1 − λ · · · 0 0 0 ... ... ... ... ... ... ... ... 0 0 0 0 · · · 1 1 − λ 0 0 0 0 0 · · · 0 1 1 − λ . By expanding the determinant with respect to the first column, for example, we see that χAn (λ) = (1 − λ)n + (−1)n−1 . Based on this expression we deduce that An is invertible if and only if n is odd (we are looking for those n such that χAn (λ) does not admits λ = 0 as a root). 2.D.10. The given matrix A1 is diagonalizable, since it has two eigenvalues, λ1 = 1 with algebraic and geometric multiplicity 1, and λ2 = 2 with algebraic and geometric multiplicity 2. The eigenspace V1 is generated by the eigenvector (0, 1, −1/2)T (or any non-zero multiple of this vector), and V2 is generated by the eigenvectors (1, 0, −2)T and (0, 1, 0)T (or any non-zero multiple of them). Thus, for example, the matrices D1 and P1 presented below satisfy A1 = P1D1P−1 1 : D1 = diag(1, 2, 2) , P1 =   0 1 0 2 0 1 −1 −2 0   . To verify all these results via Sage, execute the following block: A1=matrix(SR, [[2, 0, 0], [4, 2, 2], [-2, 0, 1]]) p(t) =A1.characteristic_polynomial(t) show(p(t).factor()) show(p.roots()) show(A1.eigenvectors_right()) u1=vector([0, 2, -1]) u2=vector([1, 0, -2]) u3=vector([0, 1, 0]) P1=column_matrix([u1, u2, u3]) D1=diagonal_matrix([1, 2, 2]) bool(A1==P1*D1*P1.inverse()) For the matrix A2, which only differs by A1 in one sign, the result is negative, in particular A2 is not diagonalizable. It has the same eigenvalues with A1, i.e., λ1 = 1 and λ2 = 2, with aalgebraic multiplicity 1 and 2, respectively. However, in this case, λ2 has a geometric multiplicity 1 and thus is impossible to obtain a basis of eigenvectors for the matrix A2. 2.D.11. Suppose that AAT = AT A = E. Then we see that ⟨φA(u), φA(w)⟩ = ⟨Au, Aw⟩ = (Aw)T Au = wT AT Au = wT u = ⟨u, w⟩ , for any u, w ∈ V . Hence φA is orthogonal. For the converse, assume hat φA is orthogonal. Then we have ⟨φA(u), φA(w)⟩ = ⟨Au, Aw⟩ = ⟨u, w⟩, or equivalently wT AT Au = wT u, or (AT Aw)T u = wT u. Hence ⟨u, w⟩ = ⟨u, AT Aw⟩. By linearity this final relation can be equivalently written as ⟨u, (E − AT A)w⟩ = 0 for any u, w ∈ V . Recall now that ⟨ , ⟩ is a scalar product and hence non-degenerate, which means that the condition ⟨x, y⟩ = 0 for any x implies y = 0. Thus we obtain (E − AT A)w = 0 for all w ∈ V , which yields the desired relation E − AT A = 0, i.e., A is orthogonal. 2.E.3. (i) The row operation R2 → R2 − 3R1 transforms A as follows A =   1 0 2 3 1 −1 2 4 2   R2→R2−3R1 −→   1 0 2 0 1 −7 2 4 2   . To determine the corresponding elementary matrix, which we may denote by E1, we apply the same row operation on the 3 × 3 identity matrix. This gives E =   1 0 0 0 1 0 0 0 1   R2→R2−3R1 −→   1 0 0 −3 1 0 0 0 1   =: E1 . To confirm this, note that the product E1A yields the same result as performing the row operation on A, i.e., E1A =   1 0 0 −3 1 0 0 0 1     1 0 2 3 1 −1 2 4 2   =   1 0 2 0 1 −7 2 4 2   . 185 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Every elementary matrix is invertible and the inverse of an elementary matrix is also an elementary matrix. It is found by performing the reverse row operation on the identity matrix, which for our case is the row operation R2 → R2 + 3R1. Hence we have E−1 1 =   1 0 0 3 1 0 0 0 1  , such that E1E−1 1 =   1 0 0 −3 1 0 0 0 1     1 0 0 3 1 0 0 0 1   =   1 0 0 0 1 0 0 0 1   = E = E−1 1 E1. (ii) By applying the row operation R2 → R2 − 3 2 R1 we obtain A =   1 0 2 3 1 −1 2 4 2   R2→R2− 3 2 R3 −→   1 0 2 0 −5 −4 2 4 2  . For the corresponding elementary matrix we have E =   1 0 0 0 1 0 0 0 1   R2→R2− 3 2 R3 −→   1 0 0 0 1 −3 2 0 0 1   := E2. As a confirmation, we see that the product E2A matches the result of the row operation on A, i.e., E2A =   1 0 0 0 1 −3 2 0 0 1     1 0 2 3 1 −1 2 4 2   =   1 0 2 0 −5 −4 2 4 2  . For the inverse we get E−1 2 =   1 0 0 0 1 3 2 0 0 1  . (iii) In a similar way we find that the elementary matrix representing the row operation R3 → 1 2 R3, and its inverse, corresponding to the row operation R3 → 2R3: E2 =   1 0 0 0 1 0 0 0 1/2   , E−1 2 =   1 0 0 0 1 0 0 0 2   . 2.E.5. Here is a block in Sage that confirms the expressions of E1, . . . , E8 and of their inverses, given in 2.E.4: # Define the elementary matrices E1 = elementary_matrix(QQ, 3, row1=0, scale=1/2) E2 = elementary_matrix(QQ, 3, row1=1, row2=0, scale=-1) E3 = elementary_matrix(QQ, 3, row1=2, row2=0, scale=-4) E4 = elementary_matrix(QQ, 3, row1=1, scale=-2) E5 = elementary_matrix(QQ, 3, row1=2, scale=-1/5) E6 = elementary_matrix(QQ, 3, row1=1, row2=2, scale=-1) E7 = elementary_matrix(QQ, 3, row1=0, row2=2, scale=-3/2) E8 = elementary_matrix(QQ, 3, row1=0, row2=1, scale=-1/2) # Print the elementary matrices show(E1, E2, E3, E4, E5, E6, E7, E8) # Compute their inverses E1_inv = E1.inverse() E2_inv = E2.inverse() E3_inv = E3.inverse() E4_inv = E4.inverse() E5_inv = E5.inverse() E6_inv = E6.inverse() E7_inv = E7.inverse() E8_inv = E8.inverse() # Print the inverses show(E1_inv, E2_inv, E3_inv, E4_inv, E5_inv, E6_inv, E7_inv, E8_inv) 2.E.6. We have four unknowns x1, x2, x3, x4, and four equations. Hence the matrix A has size 4 × 4, and b = (3, 4, 1, 5)T ∈ R4 . Let us present the solution using Sage, and compute the reduced row echelon form of the extended matrix B: B=matrix([[3, 3, 2, 1, 3], [2, 1, 1, 0, 4], [0, 5, -4, 3, 1],[5, 3, 3, -3, 5]]) Br=B.rref(); show(Br); print(Br.pivots()) The output is the matrix 186 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA     1 0 0 0 4 0 1 0 0 −2 0 0 1 0 −2 0 0 0 1 1     and the quadruple (0, 1, 2, 3). Therefore, the pivots are lying on the first four columns and the corresponding linear system admits a unique solution, given by x1 = 4, x2 = −2, x3 = −2, x4 = 1. Try to present a formal computation of the reduced row echelon form. 2.E.9. The are infinite many solutions represented by the vector (x1 = 1 + t , x2 = 3 2 , x3 = t , x4 = −1 2 )T , with t ∈ R. 2.E.10. This linear system has no solution. Can you explain why this claim is true? 2.E.12. The solution has the form {(−10t , (µ + 4)t , (3µ − 8)t) T : t ∈ R}. 2.E.13. For a = 0 the system has no solution. For a ̸= 0 the system has infinitely many solutions. 2.E.14. Let us use the extended matrix of apply elementary row transformation to obtain   1 −a −2 b 1 1 − a 0 b − 3 1 1 − a a 2b − 1   ∼   1 −a −2 b 0 1 2 −3 0 1 a + 2 b − 1   ∼   1 −a −2 b 0 1 2 −3 0 0 a b + 2   . Above, we first subtracted the first row from the second and the third;, Then we subtracted the second row from the third. We see that the system has a unique solution (determined by backward elimination), if and only if a ̸= 0. If a = 0 and b = −2, we have a zero row in the extended matrix. Choosing x3 ∈ R as a parameter then gives infinitely many distinct solutions. For a = 0 and b ̸= −2 the last equation a = b + 2 cannot be satisfied and the system has no solution. Note that: • For a = 0, b = −2 we have infinite solutions of the form (x1, x2, x3) T = (−2 + 2t, −3 − 2t, t) T , with t ∈ R. • For a ̸= 0 the unique solution has the form ( −3a2 −ab−4a+2b+4 a , −2b+3a+4 a , b+2 a )T . 2.E.17. We compute: σ = (1, 7)(2, 6)(5, 3) , τ = (1, 6)(6, 8)(8, 7)(7, 3)(2, 4) , ρ = (1, 4)(4, 10)(10, 7)(7, 9)(9, 3)(2, 6)(6, 5) . 2.E.18. For σ we compute 17 inversions, and its parity is thus odd. For τ we enumerate 12 inversions, and hence its parity is even. Finally for ρ we enumerate 25 inversions, so its parity is odd. 2.E.24. A confirmation of the inverse of A and the given solution can be obtained as follows: A=matrix(QQ, [[1, 1, 1, 1], [1, 1, -1, -1], [1, -1, 1, -1], [1, -1, -1, 1]]) det(A) Ainv=A.inverse() show(Ainv) b=vector(QQ, [2, 3, 3, 5]) x=Ainv*b; show(x) As for the computation of the algebraic complements Aij, one can proceed with the following block: B11=A.matrix_from_rows_and_columns ([1, 2, 3], [1, 2, 3]);show(B11) A11=(-1)^(1+1)*det(B11); show(" A11 is equal to",A11) B12=A.matrix_from_rows_and_columns ([1, 2, 3], [0, 2, 3]);show(B12) A12=(-1)^(1+2)*det(B12); show(" A12 is equal to",A12) B13=A.matrix_from_rows_and_columns ([1, 2, 3], [0, 1, 3]);show(B13) A13=(-1)^(1+3)*det(B13); show(" A13 is equal to",A13) B14=A.matrix_from_rows_and_columns ([1, 2, 3], [0, 1, 2]);show(B14) A14=(-1)^(1+4)*det(B14); show(" A14 is equal to",A14) B21=A.matrix_from_rows_and_columns ([0, 2, 3], [1, 2, 3]);show(B21) A21=(-1)^(1+2)*det(B21); show(" A21 is equal to",A21) B22=A.matrix_from_rows_and_columns ([0, 2, 3], [0, 2, 3]);show(B22) A22=(-1)^(2+2)*det(B22); show(" A22 is equal to",A22) B23=A.matrix_from_rows_and_columns ([0, 2, 3], [0, 1, 3]);show(B23) A23=(-1)^(2+3)*det(B23); show(" A23 is equal to",A23) B24=A.matrix_from_rows_and_columns ([0, 2, 3], [0, 1, 2]);show(B24) A24=(-1)^(2+4)*det(B24); show(" A24 is equal to",A24) B31=A.matrix_from_rows_and_columns ([0, 1, 3], [1, 2, 3]);show(B31) 187 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA A31=(-1)^(3+1)*det(B31); show(" A31 is equal to",A31) B32=A.matrix_from_rows_and_columns ([0, 1, 3], [0, 2, 3]);show(B32) A32=(-1)^(3+2)*det(B32); show(" A32 is equal to",A32) B33=A.matrix_from_rows_and_columns ([0, 1, 3], [0, 1, 3]);show(B33) A33=(-1)^(3+3)*det(B33); show(" A33 is equal to",A33) B34=A.matrix_from_rows_and_columns ([0, 1, 3], [0, 1, 2]);show(B34) A34=(-1)^(3+4)*det(B34); show(" A34 is equal to",A34) B41=A.matrix_from_rows_and_columns ([0, 1, 2], [1, 2, 3]);show(B41) A41=(-1)^(4+1)*det(B41); show(" A41 is equal to",A41) B42=A.matrix_from_rows_and_columns ([0, 1, 2], [0, 2, 3]);show(B42) A42=(-1)^(4+2)*det(B42); show(" A42 is equal to",A42) B43=A.matrix_from_rows_and_columns ([0, 1, 2], [0, 1, 3]);show(B43) A43=(-1)^(4+3)*det(B43); show(" A43 is equal to",A43) B44=A.matrix_from_rows_and_columns ([0, 1, 2], [0, 1, 2]);show(B44) A44=(-1)^(4+4)*det(B44); show(" A43 is equal to",A44) Ain=(-1/16)*matrix(QQ, [[A11, A21, A31, A41], [A12, A22, A32, A42], [A13, A23, A33, A43], [A14, A24, A34, A44]]) show(Ain); bool(Ain==A.inverse()) Inside this cell, the matrices Bij represent the matrices ˆAij. The final command verifies that the matrix constructed via the algebraic complement, named here Ain, is the inverse of A. Or we can test our computation for the matrix adj(A), e.g., by adding the code Adtr=matrix(QQ, [[A11, A21, A31, A41], [A12, A22, A32, A42], [A13, A23, A33, A43], [A14, A24, A34, A44]]) show(Adtr); show(A.adjugate()) bool(Adtr==A.adjugate()) 2.E.25. To do this, we will use a slightly different method from those presented in 2.B.11. Both methods can be employed to further explore applications of Cramer’s rule, such as solving linear systems with parameters. Here we will define a function to construct the matrices Ai in Cramer’s notation (see 2.B.11). This will enable us to express the solution to Ax = b as (xi = det(Ai)/ det(A) : i = 1, . . . , n). This function can be introduced by the block given here: def column_replace (M,column ,u): n1=M.nrows () n2=M.ncols () P = matrix (n1 ,n2) for i in range (n1): for j in range (n2): P[i,j]=M[i,j] for i in range (n1): P[i,column -1]= u[i ,0] return P To solve our task, we can now add the following cell: A= matrix ([[1 ,1 ,1 ,1] ,[1 , 1 , -1, -1] ,[1 ,-1, 1, -1] ,[1 ,-1 ,-1, 1]]) b= matrix (4 ,1 ,[2 ,3 ,3 ,5]) A1= column_replace (A ,1,b) A2= column_replace (A ,2,b) A3= column_replace (A ,3,b) A4= column_replace (A ,4,b) x= matrix (4 ,1 ,[det(A1)/det(A), det(A2)/det(A), det(A3)/det(A), det(A4)/det(A)]) show(x) Executing this block Sage returns the solution obtained above, i.e., (13 4 , −3 4 , −3 3 , 1 4 )T . 2.E.27. One may compute first F−1 . Then, based on the formula F−1 = 1 det(F ) adj(F) we deduce that adj(F) = (αδ − βγ) F−1 =   δ −β 0 −γ α 0 0 0 αδ − βγ  , for all α, β, γ, δ ∈ R. 188 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA 2.E.28. The corresponding adjoint matrices have the form adj(A) =     1 1 −2 −4 0 1 0 −1 −1 −1 3 6 2 1 −6 −10     , adj(B) = ( 6 −2i −3 + 2i 1 + i ) . 2.E.29. We can easily verify that A satisfies A2 = A. In Sage give the cell A=matrix([[2, -3, -5], [-1, 4, 5], [1, -3, -4]]); bool(A==A*A) Sage returns True. As for the product AB, we get AB = A(A − I) = A2 − A = A − A = 0. Moreover, for B2 we see that B2 = (A − I)(A − I) = A2 − 2A − I = I − A = −B =   −1 3 5 1 −3 −5 −1 3 5   . In Sage a verification of these results it is very fast. For example add the following syntax in the previous cell: E=matrix([[1, 0, 0], [0, 1, 0], [0, 0, 1]]); B=A-E; show(A*B); show(B^2) Finally, let us use Sage again to compute det(A), where we just need to add the command det(A). It gives det(A) = 0, i.e., A is a singular matrix. Try to present a formal computation of det(A). In fact, except of the identity matrix, all the other idempotent matrices have determinant zero (see 2.E.61 below). 2.E.30. Both relations are true. To prove this, we will use Cauchy’s theorem, which states that det(AB) = det(A) det(B) (see 2.2.7). Since det(PP−1 ) = det(E) = 1 we obtain det(B) = det(PAP−1 ) = det(P) det(A) det(P−1 ) = det(P) det(P−1 ) det(A) = det(PP−1 ) det(A) = det(A) . Hence, det(B) = det(A) and we also obtain det(A−1 B) = det(A−1 ) det(B) = det(A−1 ) det(A) = det(A−1 A) = 1. 2.E.32. This task is solved in a similar way with 2.E.31, and you can verify that the desired coordinates are given by (2 + 1√ 3 , 2 − 1√ 3 ). 2.E.33. Let us write (5, 1, 11)T = p (3, 2, 2) T + q (2, 3, 1) T + r (1, 1, 3) T for some p, q, r ∈ R. Solving the corresponding linear system we obtain a unique solution given by p = 2, q = −2 and r = 3. 2.E.34. We see that the vectors u1, u2, u3 are linearly dependent, whenever at least one of the following conditions is satisfied: a = b = 1, or a = c = 1, or b = c = 1. 2.E.35. It turns out that the given vectors are linearly independent. In order to prove our assertion, you may try to apply the method presented in 2.C.16. 2.E.36. It is easy to check that adding the polynomial x gives a basis of R3[x]. 2.E.37. For both cases we have to check the linear independence of the given vectors u1, u2, u3. (a) In the first case we compute dim U = 2, for t ∈ {1, 2}, otherwise we have dim U = 3. (b) In the second case we compute dim U = 2 for t ̸= 0, and dim U = 1 for t = 0. 2.E.39. According to the definition of intersection, the vectors in the intersection are in both, the span of the vectors (1, 1, −3) , (1, 2, 2), as well as in the span of the vectors (1, 1, −1) , (1, 2, 1) , (1, 3, 3). We see that U is spanned by two linearly independent vectors, and hence U is a plane in R3 . Next, V is spanned by three vectors, but these are linearly depen- dent: 1 1 1 1 2 3 −1 1 3 = 1 1 −1 1 2 1 1 3 3 = 0 . Therefore, V is also a plane. If the vector (x1, x2, x3) lies in U, then (x1, x2, x3) = λ(1, 1, −3)+µ(1, 2, 2), for some scalars λ, µ. Similarly, if the vector (x1, x2, x3) lies in V , then (x1, x2, x3) = α(1, 1, −1) + β(1, 2, 1) + γ(1, 3, 3), for some scalars α, β, γ. This gives a system of six equations with eight unknowns. Solving such a system can be quite cumbersome, but we can make some simplifications. First notice that the subspace U is described by the following three equations: x1 = λ + µ , x2 = λ + 2µ , x3 = −3λ + 2µ . 189 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA Solving this system of equations with respect to λ and µ, or alternatively by eliminating λ and µ from these equations, one obtains the single equation 8x1 − 5x2 + x3 = 0, which we may use to replace the first three. Now notice that the subspace V is describe by the following three equations: x1 = α + β + γ , x2 = α + 2β + γ , x3 = −α + β + 3γ . Solving this systems of equations with respect to α, β and γ, or alternatively by eliminating α, β and γ from these equations, we obtain the single equation 3x1 − 2x2 + x3 = 0, which we can use to describe V . Hence, it is now straightforward that after introducing a new parameter t, we can express intersection by (x1, x2, x3) = t · (3, 5, 1). 2.E.40. The answer is given by the matrix ( 5/6 −1/6 1/3 −1/6 5/6 1/3 1/3 1/3 1/3 ) . 2.E.41. The answer is given by the matrix ( 5/9 2/9 −4/9 2/9 8/9 2/9 −4/9 2/9 5/9 ) . 2.E.42. (a) The vector that determines the subspace U is perpendicular to each of the three vectors that generate W. Thus the subspaces are orthogonal and the first claim holds. (b) It is not true that R4 = U ⊕ W. This is because the subspace W is only two-dimensional, since (−1, 0, −1, 2)T = (−1, 0, 1, 0)T − 2(0, 0, 1, −1)T . 2.E.43. Let A =     1 0 1 0 1 2 0 0 1 1 1 0     be the 4 × 3 matrix induced by the vectors E1, E2, E3. Recall that the maximum number of linearly independent row vectors in A coincides with the rank of A. Because the determinant of the first three rows equals to 1 ̸= 0, the rank of A equals to 3, rank(A) = 3. Hence the vectors E1, E2, E3 are linearly independent. Recall that to verify this conclusion by Sage it suffices to add the block V = RR^4 E1 = vector(RR, [1, 0, 0, 1]) E2 = vector(RR, [0, 1, 0, 1]) E3 = vector(RR, [1, 2, 1, 0]) V.linear_dependence([E1, E2, E3]) == [] On the other hand, for the rank of A the appropriate cell is A = matrix(SR, [[1, 0, 1], [0, 1, 2], [0, 0, 1], [1, 1, 0]]) A.rank() In the first case Sage’s output is True, while in the second Sage prints out the desired number 3. Since E1, E2, E3 are linearly independent and span W, they form a basis of W. We will use this basis to obtain an orthogonal basis by applying the usual Gram–Schmidt procedure. Set w1 = E1 , w2 = E2 − ⟨E2, w1⟩ ∥w1∥2 w1 , w3 = E3 − ⟨E3, w1⟩ ∥w1∥2 w1 − ⟨E3, w2⟩ ∥w2∥2 w2. Since ⟨E2, w1⟩ = 1 and ∥w1∥2 = 2 this gives w2 = (−1 2 , 1, 0, 1 2 )T . We also compute ⟨E3, w1⟩ = 1, ⟨E3, w2⟩ = 3 2 and ∥w2∥2 = 3 2 . Hence w3 = E3 − 1 2 E1 − w2, that is w3 = (1 2 , 1, 1, −1)T , with ∥w3∥2 = 13 4 . Thus an orthonormal basis of W consists of the vectors ˆw1 = w1 ∥w1∥ = ( 1 √ 2 , 0, 0, 1 √ 2 )T , ˆw2 = w2 ∥w2∥ = ( −1 √ 6 , √ 3 2 , 0, 1 √ 6 )T , ˆw3 = w3 ∥w3∥ = ( 1 √ 13 , 2 √ 13 , 2 √ 13 , −2 √ 13 )T . By definition, we have R4 = W ⊕ W⊥ and hence dim W⊥ = dim R4 − dim W = 1. Choosing any vector of the orthonormal basis of R4 , say e2 = (0, 1, 0, 0), we extend the above basis to a basis to R4 , namely {Z1 = ˆw1, Z2 = ˆw2, Z3 = ˆw3, Z4 = e2} (we encourage the reader to verify this statement in Sage). By applying the Gram–Schmidt method to this basis we obtain an orthogonal basis { ˆZ1, . . . , ˆZ4} of R4 , given by ˆZj = Zj = ˆwj for j = 1, 2, 3 and ˆZ4 = Z4 − ⟨Z4, ˆw1⟩ ∥ ˆw1∥2 ˆw1 − ⟨Z4, ˆw2⟩ ∥ ˆw2∥2 ˆw2 − ⟨Z4, ˆw3⟩ ∥ ˆw3∥2 ˆw3 . It follows that W⊥ = spanR{ ˆZ4}, and the explicit computation of ˆZ4 is left to the reader. 2.E.47. Using the Gram-Schmidt orthogonalization process we obtain the basis: {(1, 1, 1, 1)T , (1, 1, 1, −3)T , (−2, 1, 1, 0)T }. 190 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA 2.E.48. For example, one can obtain the orthogonal bases {(1, 0, 1, 0)T , (0, 1, 0, −7)T ) for the first part, and {(1, 2, 2, −1)T , (2, 3, −3, 2)T , (2, −1, −1, −2)T } for the second part. 2.E.49. The solution is a = 9/2, b = −5 (since 1 + b + 4 + 0 + 0 = 0 and 1 − b + 0 + 3 − 2a = 0). 2.E.50. The orthogonal complement V ⊥ is the set of all scalar multiples of the vector (4, 2, 7, 0) T . 2.E.51. Obviously, there are infinitely many possible extensions. For example, a very simple one is given by the set {u1, u2, u3 = (1, 0, 0, −1)T , u4 = (1, 0, −1, 1)T } . 2.E.52. The basis consists only from the vector u1 = (3, −7, 1, −5, 9)T (or any non-zero scalar multiple of u1). 2.E.53. The answers are given as follows: (a) W⊥ = spanR { (1, 0, −1, 1, 0)T , (1, 3, 2, 1, −3)T } . (b) W⊥ = spanR { (1, 0, −1, 0, 0)T , (1, −1, 1, −1, 1)T } . 2.E.54. Obviously, any squared matrix A can be written as A = 1 2 ( A + AT ) + 1 2 ( A − AT ) . Setting As = 1 2 ( A + AT ) and Aa = 1 2 ( A − AT ) , respectively, it is clear that AT s = 1 2 ( A + AT )T = As , AT a = 1 2 ( A − AT )T = −Aa . This proves the first statement, i.e., A = As + Askew. For the given matrix A we compute As = 1 2      1 0 2 6 3 0 2 2 4   +   1 6 2 0 3 2 2 0 4      =   1 3 2 3 3 1 2 1 4   and Aa = 1 2      1 0 2 6 3 0 2 2 4   −   1 6 2 0 3 2 2 0 4      =   0 −3 0 3 0 −1 0 1 0   . Observe that a skew-symmetric matrix, as Aa, has zeros in the diagonal. In Sage, to find As and Aa we can proceed with the cell A=matrix([[1, 0, 2], [6, 3, 0], [2, 2, 4]]); show(A) As=(A+A.transpose())/2; Aa=(A-A.transpose())/2; show(As); show(Aa) As.is_symmetric() # this command is used to verify that A_s is symmetric Notice an alternative to compute the transpose of a matrix A in Sage, is the shortcut A.T, a notation which is common in some other programming environments like NumPy. Hence for example to determine the matrix As in Sage one could type A=matrix([[1, 0, 2], [6, 3, 0], [2, 2, 4]]) As=(A+A.T)/2; print(As) 2.E.55. (a) The assertions are obvious: (A + AT )T = AT + (AT )T = AT + A = A + AT , (A − AT )T = AT − (AT )T = AT − A = −(A − AT ) . (b) By assumption A is skew-symmetric, i.e., A = −AT and n is odd. Recall that det(X) = det(XT ) and det(cX) = cn det(X) for any n × n matrix X, see 2.B.8 . Thus, det(A) = det(AT ) = det(−A) = (−1)n det(A) = − det(A) which implies that 2 det(A) = 0, i.e., det(A) = 0. (c) The proof for the claim about the trace is also easy: Since a skew-symmetric matrix A = (aij) satisfies aij = −aji for any i, j, we have aii = 0 for all i = j. Thus, all the diagonal entries of a skew-symmetric matrix are zero. 2.E.58. The given identity occurs easily by (∗), which for λ = 1 gives χA(1) = det(A − E) = −1 + tr(A) + c1 + det(A), so we can replace c1 in (∗) by det(A − E) + 1 − tr(A) − det(A). For the given matrix A we compute tr(A) = 12 and det(A) = 60. Hence an application gives χA(λ) = det(A − λ E) = −λ3 + 12λ2 − 47λ + 60 . Using Sage we find that χA(λ) has three roots: λ1 = 3, λ2 = 4, and λ3 = 5. These are the eigenvalues of A, all with algebraic multiplicity one. Try to solve the equation χA(λ) = 0 in a formal way. For a description in Sage of the given expression of the characteristic polynomial, but also for a verification of the determinant and trace, type 191 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA A=matrix(SR, [[32, -67, 47], [7, -14, 13], [-7, 15, -6]]) print(A.trace( )); print(A.det( )) 192 CHAPTER 2. ELEMENTARY LINEAR ALGEBRA p(t) = A.characteristic\_polynomial(t) show(p(t)); show(p.roots()) 2.E.59. Possible examples illustrated by the matrices A, B and C provided below, which correspond to the cases (i), (ii) and (iii), respectively: A =     6 0 0 0 0 7 0 0 0 0 7 0 0 0 0 7     , B =     6 0 0 0 0 7 1 0 0 0 7 0 0 0 0 7     , C =     6 0 0 0 0 7 1 0 0 0 7 1 0 0 0 7     . Explain the reasoning behind the derivation of these matrices and then explore additional solutions. 2.E.60. There is a triple eigenvalue −1. The corresponding eigenspace is spanned by the eigenvectors (1, 0, 0)T and (0, 2, 1)T , hence it is 2-dimensional (over R). 2.E.65. The matrix has a double eigenvalue −1, and its associated eigenspace is spanned over R by the vectors (2, 0, 1)T and (1, 1, 0)T . Further, the matrix has 0 as the eigenvalue, with eigenvector (1, 4, −3)T . Combining this with the statements in 2.E.63, we deduce that the mapping induced by the given matrix M is an axial symmetry through the line induced by the last vector, composed with the projection on the plane perpendicular to the last vector. Verify that this plane is given by the equation x + 4y − 3z = 0. 2.E.66. (i) Let us present the computations in Sage, where we can use the command A.commutator(B) to compute the commutator [A, B] = AB − BA of two squared matrices A, B. The corresponding block is here: s1=matrix(SR, [[0, 1], [1, 0]]) s2=matrix(SR, [[0, -i], [i, 0]]) s3=matrix(SR, [[1, 0], [0, -1]]) show(s1.commutator(s2)); bool(s1.commutator(s2)==2*i*s3) show(s1.commutator(s3)); bool(s1.commutator(s3)==-2*i*s2) show(s2.commutator(s3)); bool(s2.commutator(s3)==2*i*s1) Execute this block to read Sage’s output. (ii) To verify the claims posed in the second part one can proceed by adding in the previous cell the code bool(s1*s1==identity_matrix(2)) bool(s2*s2==identity_matrix(2)) bool(s3*s3==identity_matrix(2)) ev1=s1.eigenvalues() ev2=s2.eigenvalues() ev3=s3.eigenvalues() bool(ev1==ev2); bool(ev1==ev3); bool(ev2==ev3); show(ev1) Let us also verify the claim for the eigenvalues by a formal computation, e.g., for σ1. The point is to compute its characteristic polynomial χσ1 (λ) = det(σ1 − λ E), where here E = ( 1 0 0 1 ) . We have σ1 − λ E = ( −λ 1 1 −λ ) =⇒ χσ1 (λ) = −λ 1 1 −λ = λ2 − 1 . Thus there are two roots, λ± = ±1. Similarly for σ2 and σ3. Try to you determine the corresponding eigenvectors. (iii) The claim in this part can be proved similarly, and left as an exercise. 2.E.67. This is direct and left for practice. We have already developed a useful package of tools and it is time to show some applications of matrix calculus. The first three parts of this chapter are independent and the readers more interested in the theory might skip any of them and continue with the fourth part straight ahead. It might seem that the assumption of linearity of relations between quantities is too restrictive. But this is often not so. In real problems, linear relations may appear directly. A problem may be solved as a result of an iteration of many linear steps. If this is not the case, we may still use this approach at least to approximate real non-linear processes. We should also like to compute with matrices (and linear mappings) as easily as we can compute with scalars. In order to do that, we prepare the necessary tools in the second part of this chapter. We also present a useful application of matrix decompositions to the pseudoinverse matrices, which are needed for numerical mastery of matrix calculus. We try to illustrate all the phenomena with rather easy problems. Still some parts of this chapter are perhaps difficult for first reading. This in particular concerns the very first part providing some glimpses towards the linear optimization (linear programming), and the third part devoted to iterated processes (the Frobenius-Perron theory). The rest of the chapter comes back to some more advanced parts of the matrix calculus (the Jodan canonical form, decompositions, and pseudo-inverses of matrices). The reader should feel free to move forward if getting lost. 1. Linear optimization The simplest linear processes are given by linear mappings φ : V → W on vector spaces. As we can surely imagine, the vector v ∈ V can represent the state of some system we are observing, while φ(v) gives the result after some process is realized. If we want to reach a given result b ∈ W of such a process, we solve the problem φ(x) = b for some unknown vector x and a known vector b. In fixed coordinates we then have the matrix A of a mapping φ and coordinate expression of the vector b. We have mastered such problems in the previous chapter. Now we CHAPTER 3 Linear models and matrix calculus where are the matrices useful? – basically almost everywhere... A. Linear optimization The idea of maximizing or minimizing a linear function subject to linear constraints arises naturally in many fields. For instance we may want to maximize profits, minimize costs, etc. Linear programming builds upon notions from linear algebra discussed in Chapter 2 and is a dynamic tool for solving linear optimization problems subject to linear constraints.1 In a linear programming problem (LP problem in short) we seek for a vector on some Euclidean space maximizing (or minimizing) the value of some linear functional, among all vectors satisfying a given system of linear constraints that govern the process. In 3.1.2 this linear functional is called the objective function. 3.A.1. To illustrate the idea let us begin with LP problems in two dimensions, where a solution occurs graphically. As usual, we view vectors x on R2 as column matrices (x1, x2)T with x1, x2 ∈ R. We want to maximize the value of the linear function h(x1, x2) = x1 + x2 (objective function) subject to the constraints x1 + 2x2 ≤ 3 , 2x1 + x2 ≤ 3 , x1 ≥ 0, x2 ≥ 0 . 1The name "programming" refers to process planning and was introduced before the computers were invented, cf. the footnote on page 195. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS draw more interesting conclusions in the setup of linear optimization models (called also linear programming). 3.1.1. Linear optimization. In the practical column, the previous chapter started with a painting problem, and we shall continue here in a similar way. Imagine that our very specialized painter in a black&white world is willing to paint facades of either small family houses or of large public buildings, and that he (of course) uses only black and white colours. He can arbitrarily choose proportions between x1 units of area for the small houses or x2 units for the large buildings. Assume that his maximal workload in a given interval of time is L units of area, his net income (that is, after subtracting the costs) is c1 per unit of area for small houses and c2 per unit of area for large buildings. Furthermore, he has only W kg of white colour and B kg of black colour at his disposal. Finally, a unit of area for small houses requires w1 kg of white colour and b1 kg of black colour. For large buildings the corresponding values are w2 and b2. If we write all this information as inequalities, we obtain the conditions x1 + x2 ≤ L(1) w1x1 + w2x2 ≤ W(2) b1x1 + b2x2 ≤ B.(3) The total net income of the painter, which is the following linear form h, h(x1, x2) = c1x1 + c2x2, is to be maximized. Each of the given inequalities clearly determines a halfplane in the plane of the variables (x1, x2), bounded by a line given by the corresponding equality, and we must also assume that both x1 and x2 are non-negative real numbers (because the painter cannot paint negative areas). Thus we have constraints for the values (x1, x2) – either the constraints are unsatisfiable, or they allow points inside a polygon with at most five vertices. See the diagram. the axis in the diagram should be called x1 and x2! Add the line of constant value for h, best through one of the vertices with hand-written description "optimal constant value of h 194 Solution. The inequalities are graphically represented by half-planes in R2 , and the intersection of these half-planes determines the “feasible region” of solutions. Thus, a feasible solution is any point in R2 satisfying all the given constraints. The non-negativity constraints xi ≥ 0 for i = 1, 2 clearly restrict all the feasible solutions to the first quadrant. Therefore, the feasible region is the shaded (hatched) area in the diagram below. While it is possible to construct linear programming (LP) problems with an empty feasible region, we are primarily interested in those with a non-empty feasible region. In a linear maximization problem, an “optimal solution” is a point within the feasible region that maximizes the objective function’s value. For our example, an optimal solution appears at a unique vertex. To see this, let us consider the vector c = (1, 1)T , such that cT x = x1 + x2. Consider also the objective function line x1 + x2 = k (k ∈ R), perpendicular to the vector c. As illustrated in the figure above, we now begin moving this line upwards in the direction of the vector c, maintaining the same slope, without leaving the feasible region. By doing so, we can determine the point where the moving line intersects the feasible region for the last time, that is, at the point (1, 1). This point provides the maximum value k = 2 for the objective function h. □ On the other hand, there exist LP problems with infinitely many optimal solutions. Nevertheless, at least one of these optimal solutions must occur at a vertex of the feasible region. Let us describe such an example. 3.A.2. Consider the problem of maximizing the functional h(x1, x2) = 2.5x1 + x2 under the constraints 3x1 + 5x2 ≤ 15 , 5x1 + 2x2 ≤ 10 , x1 ≥ 0, x2 ≥ 0 . CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS How to solve such a problem? We seek the maximum value of a linear form h over subsets M of a vector space which are defined by linear inequalities. In the plane, M is given by the intersection of half planes. Next, note that every linear form over real vector space h : V → R (that is, arbitrary linear scalar function) is monotone in every chosen direction. More precisely, if we choose a fixed starting vector u ∈ V and “directional” vector v ∈ V , then composition of our form h with parametrization yields t → h(u + t v) = h(u) + t h(v). This expression is indeed either increasing or decreasing, or constant (depending on whether h(v) is positive, negative or zero), as a function of t. Thus, if the set M is bounded as at our picture above, we easily find the solution by testing the value of h at the vertices of the boundary polygon. In general, we must expect that problems similar to the one with the painter are either unsatisfiable (if the given set with constraints is empty), or the profit is unbounded (if the constraints allow for unbounded directions in the space and the form h is non-zero in some of the unbounded directions) or they attain a maximal solution in at least one of the “vertices” of the set M. Normally the maximum is attained at a single point of M, but sometimes it is attained on a part of the boundary of the set M. Try to choose explicit values for the parameters w1, w2, b1, b2, c1, c2, draw the above picture for these parameters and find the explicit solution to the problem (if it exists)! 3.1.2. Terminology. In general we speak of a linear programming problem whenever we seek either the maximum or minimum value of a linear form h over Rn on a set bounded by a system of linear inequalities which we call linear constraints. The vector on the right side is then called the vector of constraints. The linear form h is also called the objective function.1 In real practice we meet hundreds or thousands of constraints for dozens of variables. The standard maximization problem is defined by seeking a maximum of the objective function while the restrictive inequalities are ≤ and the variables are non-negative. On the other hand, the standard minimization problem is defined by seeking a minimum of the objective function while the restrictive inequalities are ≥ and the variables are non-negative. It is easy to see that every general linear programming problem can be transformed into a standard one of either types. Aside from sign changes, we can work with a decomposition of the variables that have no sign restriction into a difference of two nonnegative ones. Without loss of generality we will work only with the standard maximization problem. 1Leonid Kantorovich and Tjalling Koopmans shared the 1975 Nobel prize in economics for their formulations and solution of economical and logistics problems in a similar way during the second world war. But it was George B. Dantzig who independently developed general linear programming formulation in the period 1946-49, motivated by planning problems in US Air Force. Among others, he invented the simplex method algorithm. 195 Solution. Obviously, we have h = cT x where c = (2.5, 1)T and the feasible region is the shaded polygon in the figure below. All the points on the thick side of the polygon are optimal solutions. This is because h is parallel to the line 5x1 + 2x2 = 10, induced by the second constraint. Indeed, we see that both the points (2, 0) and (20 19 , 45 19 ) are optimal solutions to the problem, with h(0, 2) = h(20 19 , 45 19 ) = 5. The line segment p joining these points has the form (x1, x2) = t(2, 0)+(1−t)( 20 19 , 45 19 ) = ( 20 19 + 18 19 t, 45 19 − 45 19 t) with t ranging from 0 to 1. Then, any point (x1, x2) on p maximizes h and hence serves as an optimal solution: h(x1, x2) = 5 2 ( 20 19 + 18 19 t) + ( 45 19 − 45 19 t) = 5 . □ Finally, we may encounter LP problems with an unbounded feasible region, where the objective function can achieve arbitrarily large values, resulting in the absence of optimal solutions (unbounded objective). For example, this occurs when attempting to maximize x1 +x2 in the first quadrant without any additional constraints, or when adding the constraint x2 − x1 ≤ 1. Another possible scenario involves an unbounded linear programming problem that still has infinitely many optimal solutions. This situation is illustrated by the following example, where the details left as an exercise. 3.A.3. Maximize h = 2x2 − x1 subject to the constraints x1 − x2 ≥ −1 , −0.5x1 + x2 ≤ 2 , x1 ≥ 0 , x2 ≥ 0 . ⃝ Linear programming applies to a wide range of problems and in the sequel we will explore several of its applications. Examples include maximizing the profits of a production process, minimizing the area on a chip, optimizing CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS 3.1.3. Formulation using linear equations. Finding an optimum is not always as simple as in the previous 2-dimensional case. The problem can contain many variables and constraints and even deciding whether the set M of the feasible points is non-empty can be a problem. We do not have the ambition to go into detailed theory here. But we mention at least some ideas which show that the solution can be always found, and then we build an effective algorithm solving the problem in the next paragraphs. We begin by comparison with systems of linear equations – because we understand those well. We write the equations (1)-(3) in 3.1.1 in the general form: A · x ≤ b, where x is now an n-dimensional vector, b is an m-dimensional vector and A is the corresponding matrix. By an inequality between vectors we mean individual inequalities between all coordinates. We want to maximize the product c · x for a given row vector of coefficients of the linear form h and the feasible values of x. If we add new auxiliary variables xs, one for every equation and add another variable z for the value of the linear form h, we can rewrite the whole system as a system of linear equations (1) ( 1 −c 0 0 A Em ) ·   z x xs   = ( z − c · x A · x + xs ) = ( 0 b ) where the matrix is composed of the blocks with 1 + n + m columns and 1 + m rows, with corresponding individual components of the vectors. We call the new variables xs the slack variables. Moreover, we require non-negativity for all coordinates x and xs. If the given system of equations has a solution, we seek values for the variables z, x and xs, such that all x and xs are non-negative and z is maximized. Specifically, in our problem of the black&white painter from 3.1.1, the system of linear equations looks like this:     1 −c1 −c2 0 0 0 0 1 1 1 0 0 0 w1 w2 0 1 0 0 b1 b2 0 0 1     ·         z x1 x2 x3 x4 x5         =     0 L W B     In paragraph 4.1.11 on page 328 we will discuss this situation from the viewpoint of affine geometry. Now we just notice that being on the boundary of the set of feasible points M of the problem is equivalent to having some of the slack variables vanishing. Our algorithm will try to move from one such position to another while increasing h. But we shall need some conceptual preparation first. 3.1.4. Duality of linear programming. Consider the real matrix A with m rows and n columns, vector of constraints b and row vector c giving the objective function. From this data we can consider two problems of linear programming for x ∈ Rn and y ∈ Rm . 196 supply chain logistics, scheduling tasks efficiently, and determining the best investment portfolio mix. 3.A.4. A company manufactures two models A and B of a product. Each piece of model A requires ten labor hours for fabrication and one labor hour for finishing and packing. For model B, each piece requires sixteen labor hours for fabrication and two labor hours for finishing and packing. The profit is C500 for each piece of model A and C850 for each piece of model B. Assuming there are 180 labor hours available monthly for fabrication and 20 labor hours available for finishing and packing, how many pieces of model A and B should be manufactured to maximize the profit? Solution. Let us encode the given data in a table: model A model B labour hours fabricate 10 hours/piece 16 hours/piece 180 hours finish & pack 1 hour/piece 2 hours/piece 20 hours profit C500/piece C850/piece Suppose that x1 is the number of pieces of model A and x2 is the number of pieces of model B that are manufactured. The profit function is given by h(x1, x2) = 500x1 + 850x2, and we want to maximize it subject to the following constraints: 10x1 + 16x2 ≤ 180, x1 + 2x2 ≤ 20, with x1 ≥ 0 and x2 ≥ 0. Here, the first inequality represents the constraint on the labour hours of fabricating, the second one pertains the labour hours of finishing and packing, and the last two are natural non-negativity conditions. Equivalently we have the LP problem of maximizing the given profit h = h(x1, x2) with respect to the conditions 5x1+8x2 ≤ 90, x1+2x2 ≤ 20, with xi ≥ 0 for i = 1, 2. Draw a figure yourself to see, that this is a bounded LP problem and hence the maximum of h will appear in one of the corners of the corresponding feasible region. The vertices of the feasible region are (18, 0), (0, 10), and (10, 5). Since h(18, 0) = 9000, h(0, 10) = 8500, h(10, 5) = 9250, we deduce that the company must produce 10 pieces of model A and 5 pieces of model B, to achieve the maximum profit of C9250. □ 3.A.5. Minimize the costs of feeding. A stable in the west of Czech Republic purchases fodder for the winter: hay and oats. The nutritional values of the fodder, along with the daily requirements (portions) per foal, are detailed in the following table. g per kg Hay Oats Requirements Dry basis 841 860 ≥ 6300 g Digestible nitrogen stuff 53 123 ≥ 1150 g Starch 0.348 0.868 ≤ 5.35 g Calcium 6 1.6 ≥ 30 g Phosphate 2.8 3.5 ≤ 44 g Natrium 0.2 1.4 ≃ 7 g Costs C1.80 C1.60 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Dual problems of linear programming Maximization problem: Maximize c · x under the conditions A · x ≤ b and x ≥ 0. Minimization problem: Minimize yT · b under the condition yT · A ≥ c and y ≥ 0. We say that these two problems are dual problems of linear programming. Before deriving further properties of linear programming we need some terminology. We say that the problem is solvable if there is an admissible vector x (or admissible vector yT ) which satisfies all constraints. A solvable maximization (minimization) problem is bounded, if the objective function is bounded from above (bellow) over the set of admissible vectors. Lemma (Weak duality theorem). If x ∈ Rn is an admissible vector for the standard maximization problem, and if y ∈ Rm is an admissible vector for the dual minimization problem, then c · x ≤ yT · b Proof. It is a simple observation. Since x ≥ 0 and c ≤ yT · A, it follows that c · x ≤ yT · A · x. But also y ≥ 0 and A · x ≤ b, hence c · x ≤ yT · A · x ≤ yT · b, which is what we wanted to prove. □ We see immediately that if both dual problems are solvable, then they must be bounded. Even more interesting is the following corollary, which is directly implied by the inequality in the previous proof. Corollary. If there exist admissible vectors x and y of dual linear problems such that for the objective functions c · x = yT · b, then both are optimal solutions for the corresponding problems. 3.1.5. Theorem (Strong duality theorem). If a standard problem of linear programming is solvable and bounded, then its dual is also bounded and solvable. There exists an optimal solution for each of the problems, and the optimal values of the corresponding objective functions are equal. Proof. As already proved in the latter corollary, once it is established that the values of the objective functions for the dual problems equal, we have the required optimal solutions to both problems. It remains to prove the other implication, i.e. the existence of an optimal solution under the assumptions in the theorem, as well as the fact, that the objective functions share their values in such a case. This will be verified by delivering an efficient algorithm in the next paragraph. □ We notice yet another corollary of the just formulated duality theorem: Corollary (Equilibrium theorem). Consider two admissible vectors x and y for the standard maximization problem and its dual problem as defined in 3.1.4. Then both vectors are 197 During each daily meal, each foal requires a minimum of 2 kg of oats. The average cost, including transportation, is C1.80 per kg of hay and C1.60 per kg of oats. Design a daily diet for one foal that minimizes costs. ⃝ As explained earlier, when the feasible region of a linear programming problem with two variables is non-empty and bounded, the maximum value of the given linear cost function h will be found at one of the extreme points (corners) of this region, in the direction of the normal vector to the line defined by h. This principle also holds true in higher dimensions, as we will see below. 3.A.6. Linear programming on Rn . Consider a function f : Rn → R of the form f(x1, . . . , xn) = c0 + c1x1 + · · · + cnxn for some constants c0, c1, . . . , cn. Due to the appearance of the constant term c0 we see that f is an affine function.2 Since f is affine, the difference of the values of f at the points A = (a1, . . . , an) and B = A + u = (a1 + u1, . . . , an + un) equals to f(u) = f(u1, . . . , un) = c1u1 + · · · + cnun. Notice the latter is the dot product of the vectors (c1, ..., cn)T and (u1, ..., un)T on Rn . Next we will refer to f by the term “objective function”. The relation between the scalar product and the cosine of the angle between vectors ensures that fixing the value of the given function f defines a hyperplane in Rn with normal the vector (c1, . . . , cn)T . This hyperplane splits the space Rn into two half-spaces. Clearly the given function grows if moving towards one of those half-spaces and declines towards the other one. This is essentially the same principle as we saw when discussing the visibility of segments in dimension 2 (we checked whether the observer is to the left or to the right from the oriented segment, cf. 1.5.12). At the same time, each linear inequality also defines a half-space and we shall learn the properties of intersections of half-spaces in Chapter 4 (they form the so-called simplexes). These observations lead to an algorithm for finding the extremal values of the linear objective function f on the set of admissible points defined by linear inequalities. For the sake of simplicity, we shall deal with the “standard problem” of linear programming. This means that we will treat the task of maximizing the linear function h(x1, . . . , xn) = n∑ j=1 cjxj = cT x subject to the conditions ∑n j=1 aijxj ≤ bi and xj ≥ 0, with i = 1, . . . , m, j = 1, . . . , n, respectively. According to the discussion in 3.1.6, our first task is to add the (non-negative) slack variables xs (also called slacks), one for each of the less-than inequalities to reduce them to equalities: 2Recall that affine functions are essentially linear functions plus a constant offset. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS optimal if and only if yi = 0 for all coordinates with index i for which ∑n j=1 aijxj < bi and simultaneously xj = 0 for all coordinates with index j such that ∑m i=1 yiaij > ci. Proof. Suppose both relations regarding the zeros among xi and yi are true. Since the summands with strict inequality have zero coefficients, we have m∑ i=1 yibi = m∑ i=1 yi n∑ j=1 aijxj = m∑ i=1 n∑ j=1 yiaijxj and for the same reason m∑ i=1 n∑ j=1 yiaijxj = n∑ j=1 cjxj. This shows one implication, by the duality theorem. Suppose now that both x and y are optimal vectors. Then m∑ i=1 yibi ≥ m∑ i=1 n∑ j=1 yiaijxj ≥ n∑ j=1 cjxj. But the left- and right-hand sides are equal, and hence there is equality everywhere. If we rewrite the first equality as m∑ i=1 yi ( bi − n∑ j=1 aijxj ) = 0, then we see that it can be satisfied only if the relation from the statement holds. But it is a sum of non-negative numbers and equals zero. From the second equality we similarly derive the second part and the proof is finished. □ The duality theorem and equilibrium theorem are useful when solving linear programming problems, because they show us relations between zeros among the additional variables and the fulfillment of the constraints. As usual, it is good to know that the problem is solvable in principle and to have some theory related to that, but we still need some clever ideas to make it all into an efficient algorithmic procedure. The next paragraph will provide some insight to this. 3.1.6. The algorithm. As already explained, the linear programming problem of maximizing the linear objective function h = c x under the conditions A x ≤ b can be turned into solving the system of equations (1) in 3.1.3, where we added the slack variables xs. If all entries in b are non-negative, then the choice of xs = b and x = 0 provides an admissible solution of the system with the value of the objective function h = 0. This is the choice of the origin x = 0 as one of the vertices of the distinguished region M of the admissible points. We can understand this as choosing the variables xs as the basic variables, whose values are given by the right hand sides of the equation, while all the other variables are set to zero. In the general case (allowing for negative entries in b), we shall see in 4.1.11 that we always can find an admissible vertex. That is, the choice of the basic variables in the above 198 ∑n j=1 aijxj + (xs)i = bi, for all i = 1, . . . , m. Therefore, our goal is to maximize h over a vector space of solutions defined by a system of linear equations, while ensuring all coordinate values are non-negative. This represents the “canonical form” of a linear programming problem. In fact, it is useful to have a notation in which the slacks are more or less indistinguishable from the original variables. Therefore, we will often write (x1, . . . , xn, xn+1, . . . , xn+m), instead of (x1, . . . , xn, (xs)1, . . . , (xs)m), and then the above equations take the form n∑ j=1 aijxj + xn+i = bi , i = 1, . . . , m . If there are more general inequalities, we can transform them into our standard form by multiplying them by −1. Additionally, minimizing h is equivalent to maximizing −h. Therefore, we can reduce all linear programming problems to the standard form described above. The simplex method is an iterative process where we begin with a less-than optimal “solution” that meets the given equations and non-negativity constraints. We then seek for a new solution that improves upon it by increasing the objective function value. Using the Gauss elimination method, we iterate this process until we reach a solution that cannot be further improved—thus achieving an optimal solution. Next, we will analyze the key steps in detail, summarizing the discussion given in 3.1.1 and 3.1.6. To illustrate these steps clearly, we will demonstrate them using a straightforward example. Guide example. Consider the task of maximizing h(x1, x2) = 140x1 + 100x2 under the constraints 8x1 + 8x2 ≤ 960 , 4x1 + 2x2 ≤ 400 , 4x1 + 3x2 ≤ 420 , together with the positivity conditions x1 ≥ 0, x2 ≥ 0. The canonical form occurs by maximizing h subject to 8x1 + 8x2 + x3 = 960 , 4x1 + 2x2 + x4 = 400 , 4x1 + 3x2 + x5 = 420 , with xi ≥ 0 for any i = 1, . . . , 5. The slacks are the variables x3, x4, x5 and in terms of matrices we have ˜A = (A|I3) =   8 8 1 0 0 4 2 0 1 0 4 3 0 0 1   , b =   960 400 420   , and x = (x1, x2, x3, x4, x5)T such that ˜Ax = b. 3.A.7. The simplex algorithm. To keep the exposition simple, from now on we restrict ourselves to the case where all bi are nonnegative, bi ≥ 0 for any i = 1, . . . , m. Step 1. Using the canonical form of a LP program, we construct the initial simplex tableau. The first row of this tableau consists of the coefficients of h (with negatives signs), while CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS sense, describing an admissible solution. Next, we shall assume to have such a vertex already. The idea of the algorithm is to perform equivalent row transformations of the entire system in such a way, that we move to other vertices of the region M and the function h increases. In order to move to more interesting vertices in M, we must bring some of the slack variables to zero while the appropriate column for the unit matrix would move to one of those columns corresponding to the variables x. A simple check reveals that in order to do this, we must choose some of the negative entries in the first row of the matrix 3.1.3(1), pick up this column and choose a row in such a way that using the Gaussian elimination to push the other entries in this particular column to zero, the right hand sides of the equations remain non-negative. The latter condition means that we have to choose the index i such that bi/aij is minimal. This entry in the matrix is called the pivot for the next step in the elimination. Of course, the non-positive coefficients aij are not taken into consideration, since they would not lead to any increase in the objective function. When there are no more negative entries in the first row, we are finished, and the claim is that the optimal value of h appears in the right hand top corner of the matrix. Before indicating the proof of all the above claims, we show how all this works for the simple problem from 3.F.1. In practice, the very first column of the matrix in question does not change during the procedure at all, so we can omit it completely. Thus we deal with the matrix tableaux: −4 −6 0 0 0 0 0 1 2 1 0 0 0 120 1 4 0 1 0 0 180 −1 1 0 0 1 0 −90 1 0 0 0 0 1 110. We cannot find an admissible solution by fixing xs as the basic variables here, since there are negative values in b. We try to initiate the above algorithm by changing the sign in the last but one row and performing the Gaussian elimination for the very first column aiming to have only the 1 in the last but one row there. We obtain: 0 −10 0 0 −4 0 360 0 3 1 0 1 0 30 0 5 0 1 1 0 90 1 −1 0 0 −1 0 90 0 1 0 0 1 1 20 We choose the boxed entries for the basic variables, this represents the values x1 = 90, x2 = 0, x3 = 30, x4 = 90, x5 = 0, x6 = 20, and h = 360 = 4 · 90 = −4 · (−90) which is an admissible solution. We have also circled the pivot for the next step, i.e. the element in the second column which we want to replace with 1 and eliminate the rest of the column (remember this is the one yilding the smalest ratio with the 199 the subsequent rows are derived from the matrices (A|I) and b, i.e., −c1 . . . −cn 0 0 · · · 0 0 a11 · · · a1n 1 0 · · · 0 b1 a21 · · · a2n 0 1 · · · 0 b2 ... ... ... ... ... ... ... am1 . . . amn 0 0 · · · 1 bm In this initial tableau, there are m basic variables and n non-basic variables, a configuration that remains consistent throughout the process. Recall that a basic variable is one that can take non-zero values, whereas a non-basic variable is one that equals zero in the current solution of the problem. Basic variables correspond to the columns of the tableau, where exactly one entry is equal to one, and all other entries are zero. To initiate the algorithm, we choose all slack variables as basic variables. This is achieved by setting to zero the initial decision variables x1, . . . , xn in the equation∑n j=1 aijxj +xn+i = bi, for all i. This initialization ensures an initial feasible solution, though not necessarily optimal. For the guide example the initial tableau has the form x1 x2 x3 x4 x5 R0 −h −140 −100 0 0 0 0 R1 x3 8 8 1 0 0 960 R2 x4 4 2 0 1 0 400 R3 x5 4 3 0 0 1 420 where R0, . . . , R3 denote the rows of the table. An initial feasible solution is given by x1 = x2 = 0 and x3 = 960, x4 = 400, x5 = 420. This satisfies h = 0 and since we have negative values in R0, a better solution exists. Step 2. We now move in the iterated steps (compare the more theoretical explanation in 3.1.6). (a) We locate the column with the most negative value in the first row of the tableau (locating the first column from the left having a non positive value also works). This column, which we assume here that is the jth column, is called the work column. For the guide example, the work column corresponds to the variable x1, since −140 < −100. (b) In the work column we pick up the positive entry aij in A which provides the minimal relation bi/aij. If more than one element yields the same smallest ratio, we choose one. We call this entry the pivot. For our guide example, the pivot appears in row R2 circled (this is because 400/4 < 420/4 < 960/8). (c) To configure the new tableau we proceed with the Gaussian method. Our goal is to eliminate the current work column, and determine the new pivot and the new basic variable (recall that within each iteration of the simplex method, exactly one variable goes from non-basic to basic and exactly one variable goes from basic to non-basic). To illustrate this via the guide example, denote by ˆR2 = (1, 1/2, 0, 1/4, 0, 100) the result of the row R2 divided by the pivot 4. The elementary row operations R0 → R0 + 140 ˆR2, CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS last right hand column entry among the positive elements – 30/3 = 10 which is less then 90/5 = 16 and 20/1 = 20). This leads to the next admissible vertex in our region M and, of course the value for h will increase: 0 0 10 3 0 −2 3 0 460 0 1 1 3 0 1 3 0 10 0 0 −5 3 1 −2 3 0 40 1 0 1 3 0 −2 3 0 100 0 0 −1 3 0 2 3 1 10 with x1 = 100, x2 = 10, x3 = x5 = 0, x4 = 40, x6 = 10, and h = 460 = 4 · 100 + 6 · 10 = 10 3 · 120 − 2 3 · (−90). We still have one of the entries in the first line negative. We circled the next pivot leading to 0 0 9 3 0 0 1 470 0 1 1 2 0 0 −1 2 5 0 0 −2 1 0 1 50 1 0 0 0 0 1 110 0 0 −1 2 0 1 3 2 15 with the final values x1 = 110, x2 = 5, x3 = 0, x4 = 50, x5 = 15, x6 = 0, and h = 470 = 4 · 110 + 6 · 5 = 9 3 · 120 + 1 · 110. Let us remind why we can be sure that this is the optimal solution. Thanks to fact that the first line is exclusively nonnegative, we have got admissible solution of the dual problem which leads to the same value as the solution of the original one. Thus the equillibrium theorem claims we are done! Correctness of the algorithm. Let us come back to the above claims in some detail. We should check whether the algorithm provides the right answer when it terminates, but we should also check under which conditions it terminates. We start with the first part. The special feature of the above algorithm reshuffling the basic variables is that the slack variables parts of the matrix are closely linked to the dual linear programming problem. Moreover, there is an invariant of the entire procedure: Claim. Writing (−ˆc, ˆcs, ˆh) for the current first line in the matrix and (ˆx, ˆxs) for the current values of the variables, we obtain c · ˆx = ˆcs · b = ˆh at each step. In particular at the moment of the termination of the above algorithm, the coefficients y = ˆcs in the first row represent admissible values of the dual problem (while the values ˆc stay for the slack variables in the dual problem), and the right hand top corner provides the value of the corresponding objective function yT · b. Since the two objective functions are equal, we know that the algorithm provides the optimal solution. 200 R1 → R1 − 8 ˆR2, and R3 → R3 − 4 ˆR2 yield the second simplex tableau, given by x1 x2 x3 x4 x5 ˆR0 −h 0 −30 0 35 0 14000 ˆR1 x3 0 4 1 −2 0 160 ˆR2 x1 1 1/2 0 1/4 0 100 ˆR3 x5 0 1 0 −1 1 20 The variable x4 is replaced by x1, which becomes basic, and hence we moved it to the column of basic variables. Together with x3, x5 they are now the new basic variables, and we have the solution x1 = 100, x2 = 0 = x4, x3 = 160, x5 = 20 (the values of the basic variables are read from the last column). Since there is still a negative number in ˆR0, we proceed. (d) We repeat steps (a), (b) and (c) and terminate the procedure when there are no more given entries in the first row. For our example by the second tableau one deduces the new work column, that of x2, and the new pivot, which is the circled number 1. Applying the elementary row operations ˆR0 → ˆR0 + 30 ˆR3, ˆR1 → ˆR1 − 4 ˆR3, and ˆR2 → ˆR2 − 1 2 ˆR3 we arrive to the third tableau: x1 x2 x3 x4 x5 −h 0 0 0 5 30 14600 x3 0 0 1 2 −4 80 x1 1 0 0 3/4 −1/2 90 x2 0 1 0 −1 1 20 The new basic variable is x2 and replaces x5. So we have the solution x1 = 90, x2 = 20, x3 = 80 and x4 = 0 = x5. Since the above tableau is the final one, this solution is optimal and we may read the maximal value of h in the right top corner of the tableau, that is, h(90, 20) = 14600. In the solution obtained above all the original variables are among the basic ones and so their values are non-zero. This is based on the assumption b ≥ 0. The problem 3.F.1 presented in the final section of this chapter encodes the more general situation and its explanation via this simplex algorithm is given in 3.1.6. 3.A.8. A note on duality. By the theoretical discussion in 3.1.4 we know that when the primal LP problem is a maximization one, its dual is a minimization problem and conversely. The number of decision variables in a dual problem is the same as the number of less-than inequalities in the optimization primal problem, while the number of the constrains in the dual problem equals to the number of decision variables in the primal one. Note that when some constraint appears as an equality in the primal (maximization) problem, then the corresponding dual variable is unrestricted. The coefficients of the objective function of the primal problem form the right hand side of the constrains on the dual one, while the right hand side of the primal determines the objective function of the dual LP problem. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Proof of the Claim. As we know, the Gaussian elimination cat be expressed as multiplication by a suitable transition matrix T from the left. Our “pivoting” corresponds always to a (block wise) matrix T = ( 1 yT 0 R ) , with an invertible matrix R. Thus (dealing with the standard problem) T · ( 1 −c 0 0 0 A Em 0 ) = ( 1 −c + yT A yT yT b 0 RA R Rb ) and the second row corresponds (since R is invertible) to the equality Aˆx + ˆxs = b. Consequently, we arrive at −ˆc ˆx = (−c + yT A) ˆx = −c ˆx + yT (b − ˆxs). Now, notice the components ˆxs vanish whenever the corresponding components of yT ̸= 0, i.e., yT ˆxs = 0. Similarly, ˆc ˆx = 0 and so we read from the latter displayed equality cˆx = yT b, as claimed. □ The termination of the algorithm is a more subtle question. While it terminates nearly always, it might happen that the algorithm cycles without enlarging the objective function. We shall not go into more detailes here, a simple example is given in ??. □ Put a cycling simple example to the practicle column, maybe in the end of the chapter. 3.1.7. Notes about linear models in economy. Our simple scheme of the black&white painter from the paragraph 3.1.1 can be used to illustrate one of the typical economical models, the model of production planning. The model tries to capture the problem completely, that is, to capture both external and internal relations. The left-hand sides of the equations (1), (2), (3) in 3.1.1, and the objective function h(x1, x2) express various production relations. Depending on the character of the problem, we have on the right-hand sides either exact values (and so we solve equations) or capacity constraints and goal optimization (then we obtain linear programming problems). Thus in general we can solve the problem of source allocation with supplier constraints and either minimize costs or maximize income. Among economical models, we can find many modifications. One of them is the problem of financial planning, which is connected to the optimization of portfolio. We are setting up a volume of investment into individual investment possibilities with the goal to meet the given constraints for risk factors while maximizing the profit, or dually minimize the risk under the given volume. Another common model is marketing application, for instance allocation of costs for advertisement in various media or placing advertisement into time intervals. Restrictions are in this case determined by budget, target population, etc. Very common are models of nutrition, that is, setting up how much of different kinds of food should be eaten in order to meet total volume of specific components, e.g. minerals and vitamins. 201 For our guide example the dual problem is the task of minimizing 960y1 + 400y2 + 420y3 subject to 8y1 + 4y2 + 4y3 ≥ 140 , 8y1 + 2y2 + 3y3 ≥ 100 , with yi ≥ 0 for all i = 1, 2, 3. The final tableau of the primal problem also provides the solution of the dual one. According to the strong duality theorem (see 3.1.5), the minimal value is again 14600, while the corresponding values of the dual variables y1, y2 and y3 are obtained by the first row in the corresponding final tableau, that is, y1 = 0, y2 = 5, y3 = 30. 3.A.9. For the following LP problems, find their standard form and describe the corresponding dual problem. (a) Maximize h = x1 + 2.5x2 subject to 2x1 + 3x2 ≤ 20 , x1 + x2 ≥ −1 , x1 − 2x2 = 1 , with x1 ≥ 0, x2 ≥ 0. (b) Minimize h = 2x1 + 3x2 + 2x4 subject to x1 +2x2 +2x3 ≤ −6 , x1 +4x2 −2x4 = 5 , x2 −x3 +4x4 ≥ 2 , with xi ≥ 0 for all i = 1, . . . , 4. ⃝ 3.A.10. LP problems with redundant constraints. Use the simplex method to solve the LP problem of maximizing the linear function h = 4x1 + 9x2 subject to the conditions x1 + 4x2 ≤ 8 , x1 + 2x2 ≤ 4 , x1 ≥ 0 , x2 ≥ 0 . Solution. The corresponding feasible region is the grey region in the figure below, and as we see the first constraint is redundant. The evaluation of h at the three corners of this triangle shows that h maximizes at (x1 = 0, x2 = 2). We want to verify this result by the simplex method. The canonical form of the problem is the following: maximize h = 4x1 + 9x2 + 0x3 + 0x4 subject to x1 + 4x2 + x3 = 8 , x1 + 2x2 + x4 = 4 , with xi ≥ 0 for i = 1, . . . , 4. The variables x3, x4 are the slacks and we begin with the first simplex tableau: x1 x2 x3 x4 R0 −h −4 −9 0 0 0 R1 x3 1 4 1 0 8 R2 x4 1 2 0 1 4 The work column is the column of x2 and as pivot one may fix the circled number 4. To eliminate the work column replace R1 by ˆR1 := 1 4 R1 and apply the row operations R0 → R0 + 9 ˆR1 and R2 → R2 − 2 ˆR1. This enters the variable x2 to the CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Problems of linear programming arise with personal tasks, where workers with specific qualifications and other properties are distributed into working shifts. Common are also problems of merging, problems of splitting and problems of goods distribution. 2. Difference equations We have already met difference equations in the first chapter, albeit briefly and of first order only. Now we consider a more general theory for linear equations with constant coefficients. This not only provides very practical tools but also represents a good illustration for the concepts of vector spaces and linear mappings. Homogeneous linear difference equation of order k 3.2.1. Definition. A homogeneous linear difference equation (or homogeneous linear recurrence) of order k is given by the expression a0xn + a1xn−1 + · · · + akxn−k = 0, a0 ̸= 0, ak ̸= 0, where the coefficients ai are scalars, which can possibly depend on n. A solution of this equation is a sequence of scalars xi, for all i ∈ N (or i ∈ Z), which satisfy the equation with any n. We often understand the sequence in question as a func- tion xn = f(n) = − a1 a0 f(n − 1) − · · · − ak a0 f(n − k). By giving any k consecutive values xi in the sequence, all other values of xi are determined uniquely. Indeed, we work over a field of scalars, thus the values a0 and ak are invertible and hence, using the recurrent definition, any xn can be computed uniquely from the preceding k values, and similarly for xn−k. Induction thus immediately proves that all remaining values are uniquely determined. The space of all infinite sequences xi forms a vector space, where addition and multiplication by scalars work coordinate-wise. The definition immediately implies that a sum of two solutions of a homogeneous linear difference equation or a multiple of a solution is again a solution. Analogously as with homogeneous linear systems we see that the set of all solutions forms a subspace. Initial conditions on the values x0, . . . , xk−1 of the solution represent a k-dimensional vector in Kk . The sum of initial conditions determines the sum of the corresponding solutions, similarly for scalar multiples. Note also that substituting zeros and ones into initial k values immediately yields k linearly independent solutions of the difference equation. Thus, although the vectors are infinite sequences, the set of all solutions has finite dimension. The dimension equals the order of the equation k. Moreover, we can easily obtain a basis of all those solutions. Again we speak of the fundamental 202 column of basic variables, while x3 leaves. Thus the second simplex tableau has the form x1 x2 x3 x4 ˆR0 −h −5/4 0 9/4 0 18 ˆR1 x2 1/4 1 1/4 0 2 ˆR2 x4 1/2 0 −1/2 1 0 We observe that in this table the basic variable x4 takes the zero value, and this means that there exists a redundant constraint. Such a basic feasible solution with at least one basic variable equal to zero is called degenerate. Now, the column of x1 is the new driver and the new pivot is the circled 1/2. We replace ˆR2 by ˇR2 := 2 ˆR2 and apply the row operations ˆR1 → ˆR1 − 1 4 ˇR2 and ˆR0 → ˆR0 + 5 4 ˇR2. Thus x1 enters the column of basic variables, x4 leaves and the final tableau is given by x1 x2 x3 x4 ˇR0 −h 0 0 1 5/2 18 ˇR1 x2 0 1 1/2 −1/2 2 ˇR2 x1 1 0 −1 2 0 This gives us the optimal solution (x2 = 2, x1 = 0), which is degenerate, with the optimal value of the objective function being h = 18. □ 3.A.11. Cycling. Observe that during the second iteration above, the value of the given h did not increase. When a sequence of iterations returns to a previously constructed tableau, we say that the simplex method cycles. In this case the method enters a loop, without improving the objective function value, see also the end of section 3.1.6 Cycling can occur only in the presence of degeneracy. However, there are many LP problems that are degenerate and do not cycle (as above). Examples of LP problems for which the simplex method cycles are in general difficult to construct, and rare in practice. We describe such an example in Section 3.F.5, where also a remedy to this pitfall is presented. Numerous software packages are available to handle the simplex method and solve linear programming problems via a computer. As it is customary in our book, next we will use Sage to implement the tasks at hand. 3.A.12. LP treated by Sage. In Sage, for handling linear programming problems, one can utilize the mixed integer linear programming (MILP) package. Next, we will demonstrate the procedure using our guide example from section 3.A.7. Let us summarize the main steps: 1. We initialize the linear programming algorithm: p = MixedIntegerLinearProgram() 2. We introduce the non-negative decision variables: v = p.new_variable(real=True, nonnegative=True) CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS system of solutions and all other solutions are its linear com- binations. As we have just checked, if we choose k indices i, i + 1, . . . , i + k − 1 in sequence, the homogeneous linear difference equation gives a linear mapping Kk → K∞ of k-dimensional vectors of initial values into infinitelydimensional sequences of the same scalars. The independence of such solutions is equivalent to the independence of the initial values – which can be easily checked by a determinant: If we have a k-tuple of solutions (x [1] n , . . . , x [k] n ), it is independent if and only if the following determinant, sometimes called the Casoratian, is non-zero for one n Cn = x [1] n · · · x [k] n x [1] n+1 . . . x [k] n+1 ... ... ... x [1] n+k−1 . . . x [k] n+k−1 ̸= 0. Notice the determinant Cn+1 is obtained from Cn by replacing the first raw in Cn by the last raw in Cn+1 expressed as the linear combination of the raws in Cn given by the difference equation and then putting the first row to position of the last one. Thus Cn+1 = (−1)k a0 ak Cn, with values a0, ak corresponding to n+k. In particular, linearly independent initial conditions lead to independent k-dimensional vectors for all consecutive k components of the solution. 3.2.2. Recurrences with constant coefficients. It is difficult to find a universal mechanism for finding a solution (that is, a directly computable expression) of general homogeneous linear difference equations. We shall come back to this problem in the end of chapter 13. In practical models there are very often equations, where the coefficients are constant. In this case it is possible to guess a suitable form for the solution and indeed to find k linearly independent solutions. This would then be a complete solution of the problem, since all other solutions would be linear combinations of them. For simplicity we start with equations of second order. Such recurrences are very often encountered in practical problems, where there are relations based on two previous values. A linear difference equation (recurrence) of second order with constant coefficients is thus a formula (1) f(n + 2) = a f(n + 1) + b f(n) + c, where a,b,c are known scalar coefficients. Consider a population model. We assume that the individuals in a population mature and start breeding two seasons later (that is, they add to the value f(n + 2) by a multiple b f(n) with positive b > 1), while immature individuals at the same time weaken and destroy part of the mature population (that is, the coefficient a at f(n+1) is negative). Furthermore, it might be that somebody destroys (uses, eats) a fixed amount −c of individuals every season. 203 For the guide example there are two decision variables and to introduce them one should additionally type x1, x2 = v["x1"], v["x2"] 3. We set the objective function by typing p.set_objective(140*x1 + 100*x2) 4. Next we set the constraints. For the guide example this can be done as follows: p.add_constraint(8*x1 + 8*x2 <= 960) p.add_constraint(4*x1 + 2*x2 <= 400) p.add_constraint(4*x1 + 3*x2 <= 420) 5. If k is the maximum of the objective function, let the program knows about this by typing k = p.solve() Since x1, x2 are the coordinates of the solution, we should also include the code x1, x2 = p.get_values(x1,x2) 6. Finally, to get the output we write print("Answer =", round(k, 2)) print("(x1, x2) =", (x1, x2)) and Sage prints out the following answer: Answer = 14600.0 (x1, x2) = 90.0 20.0 Let us summarize the code all together, without interruptions. p = MixedIntegerLinearProgram() v = p.new_variable(real=True, nonnegative=True) x1, x2 = v["x1"], v["x2"] p.set_objective(140*x1 + 100*x2) p.add_constraint(8*x1 + 8*x2 <= 960) p.add_constraint(4*x1 + 2*x2 <= 400) p.add_constraint(4*x1 + 3*x2 <= 420) k = p.solve() x1, x2 = p.get_values(x1,x2) print("Answer =", round(k, 2)) print("(x1, x2) =", (x1, x2)) Similarly can be treated problems with more decision variables. For instance, if you need three variables give the cell v = p.new_variable(real=True, nonnegative=True) x1, x2, x3 = v["x1"], v["x2"], v["x3"] Further tasks in linear programming and applications of Sage are described in section F. It is noteworthy that an elegant application of linear programming involves game theory. In 3.F.6 we shall explore an application in this direction concerning zero-sum games, also known as “matrix games”. In these games, the interests of the two players are directly opposed, meaning one player’s loss is the other player’s gain. The term “zero-sum” indicates that the sum of the payoffs for both players always equals zero, regardless of the game’s outcome. As detailed in CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS A similar situation with c = 0 and both other coefficients positive determines the famous Fibonacci sequence of numbers y0, y1, . . . , where yn+2 = yn+1 + yn, see 3.B.1. If we have no idea how to solve a mathematical problem, we can always blindly try some known solutions of similar problems. Thus, let us substitute into the equation (1) with coefficient c = 0 a similar solution as with the linear equations from the first chapter (cf. 1.2.1), that is, we try f(n) = λn for some scalar λ. By substitution into the equation we obtain λn+2 − aλn+1 − bλn = λn (λ2 − aλ − b) = 0. This relation will hold either for λ = 0 or for the choice of the values λ1 = 1 2 (a + √ a2 + 4b), λ2 = 1 2 (a − √ a2 + 4b). It is easy to see that such solutions work. We just had to choose the scalar λ suitably. But we are not finished, since we want to find a solution for any two initial values f(0) and f(1). So far, we have only found two specific sequences satisfying the given equation (or possibly even only one sequence if λ2 = λ1). As we have already derived for linear recurrences, the sum of two solutions f1(n) and f2(n) of our equation f(n + 2) − a f(n + 1) − b f(n) = 0 is again a solution of the same equation. The same holds for scalar multiples of the solution. Our two specific solutions thus generate the more general so- lutions f(n) = C1λn 1 + C2λn 2 for arbitrary scalars C1 and C2. For a unique solution of the specific problem with given initial values f(0) and f(1), it remains only to find the corresponding scalars C1 and C2. 3.2.3. The choice of scalars. We show how this can work with an example. Consider the problem: (1) yn+2 = yn+1 + 1 2 yn, y0 = 2, y1 = 0. Here λ1,2 = 1 2 (1 ± √ 3) and clearly y0 = C1 + C2 = 2 y1 = 1 2 C1(1 + √ 3) + 1 2 C2(1 − √ 3) is satisfied for exactly one choice of these constants. Direct calculation yields C1 = 1 − 1 3 √ 3, C2 = 1 + 1 3 √ 3 and our problem has unique solution f(n) = (1− 1 3 √ 3) 1 2n (1+ √ 3)n +(1+ 1 3 √ 3) 1 2n (1− √ 3)n . Note that even if the found solution for our equation with rational coefficients and rational initial values looks complicated and is expressed with irrational numbers, we know a priori that the solution itself is again rational. But without this “step aside” into a larger field of scalars, we would not be able to describe the general solution. 204 3.F.6, such problems can be effectively addressed using stochastic matrices. B. Recurrence relations A recurrence relation for a given sequence defines a relationship among its terms, often useful for modeling recurring patterns or behaviours. Such relations are particularly valuable in scenarios where recursive nature is observed repeatedly. For instance, they serve as a powerful tool for describing growth models. Without doubt, the simplest recurrence relations are the “homogeneous linear recurrence relations” (with constant coefficients), expressed as: xn+k = α1xn+k−1 + α2xn+k−2 + · · · + αkxn , where αi are complex numbers with αk ̸= 0. The number k is called the order of the recurrence relation, see also the Section 3.2.1 for details. When the right-hand side includes a constant term, we get the notion of a “non-homogeneous linear recurrence relations” (with constant coefficients). We begin with a well-known example of a second-order homogeneous linear recurrence relation, historically used to model populations of rabbits. 3.B.1. Rabbits and the Fibonacci sequence. Consider a scenario where, at the start of spring, a stork brings two newborn rabbits, one male and one female, to a meadow. Each female rabbit becomes fertile at two months old and gives birth to a pair of newborns, one male and one female, every month thereafter. Specifically, each female is pregnant for one month before producing new offsprings. How many pairs of rabbits will be present after nine months, assuming no deaths and that none “move in”? Solution. After the first month, there is still only one pair of rabbits, but the female is already pregnant. By the end of the second month, the first offspring are born, resulting in two pairs. Each subsequent month, the number of new pairs matches the number of pregnant females from the previous month. This corresponds to the number of at least one month old pairs, which coincides with the number of pairs that were there two months ago. Let us denote by Fn the total number of pairs after n months. According to the explanation given above, for the first months we compute F1 = 1 , F2 = 1 , F3 = F2 + F1 = 2 , F4 = F3 + F2 = 3 , F5 = F4 + F3 = 5 , F6 = F5 + F4 = 8 , . . . Hence Fn should be equal to the sum of the number of pairs in the previous two months, and this is described by the following homogeneous linear recurrence relation: Fn+2 = Fn+1 + Fn , n = 1, 2, . . . . CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS We will often meet similar phenomena. Moreover, the general solution often allows us to discuss qualitative behaviour of the sequence of numbers f(n) without direct enumeration of the constants. For example, we may see whether the values approach some fixed value with increasing n or oscillate in some interval or whether they are unbounded. 3.2.4. General homogeneous recurrences. We substitute xn = λn for some (yet unknown) scalar λ into the general homogeneous equation from the definition 3.2.1 (with constant coefficients). For every n we obtain the condition λn−k (a0λk + a1λk−1 · · · + ak) = 0. This means that either λ = 0 or λ is the root of the so-called characteristic polynomial in the parentheses. The characteristic polynomial is independent of n. Assume that the characteristic polynomial has k distinct roots λ1, . . . , λk. For this purpose, we can extend the field of scalars we are working in, for instance Q into R or C. Of course, if the inicial conditions are in the original field then the solutions stay there since the recurrence equation itself does. Each of the roots gives us single possible solution xn = (λi)n . We need k linearly independent solutions. Thus we should check the independence by substituting k values for n = 0, . . . , k − 1 for k choices of λi into the Casoratian (see 3.2.1). Thus we obtain the Vandermonde matrix. It is a good but not entirely trivial exercise to show that for every k and any k-tuple of distinct λi the determinant of such a matrix non-zero, see 2.E.21 on the page 151. It follows that the chosen solutions are linearly independent. Thus we have found the fundamental system of solutions of the homogeneous difference equation in the case that all the (possibly complex) roots of its characteristic polynomial are distinct. Now we suppose λ is a multiple root. We ask whether xn = nλn could be a solution. We arrive at the condition a0nλn + · · · + ak(n − k)λn−k = 0. This condition can be rewritten as λ(a0λn + · · · + akλn−k )′ = 0 where the dash denotes differentiation with respect to λ (cf. the infinitesimal definition in 5.1.6, and 12.3.7 for the purely algebraic treatment). Moreover, a root c of a polynomial f has multiplicity greater than one if and only if it is a root of f′ , see 12.3.7 for the proof. Our condition is thus satisfied. With greater multiplicity ℓ of the root of the characteristic polynomial we can proceed similarly and use the (now obvious) fact that a root with multiplicity ℓ is a root of all derivatives of the polynomial up to order ℓ − 1 (inclusively). 205 This relation, along with the initial conditions F1 = 1 and F2 = 1, uniquely determines the number of pairs of rabbits at the meadow, in individual months. Observe that for certain r the function rn is a solution of the difference equation (without initial conditions). This r can be obtained by substitution into the recurrent relation: rn+2 = rn+1 + rn and after dividing by rn we obtain r2 = r + 1. This is the characteristic equation of the given recurrent formula, and the numbers 1− √ 5 2 and 1+ √ 5 2 are its roots.3 Thus, according to the theory in 3.2.4, the solution of the given recurrent relation will be a sequence of the form Fn = axn + byn , with xn := (1− √ 5 2 )n , yn := (1+ √ 5 2 )n , and a, b ∈ R to be specified. To compute the constants a, b, we use the initial conditions. Alternatively, we can set F0 = 0 and compute a and b from the equations for F0 and F1. We find a = − 1√ 5 , b = 1√ 5 and hence the solution is given by Fn = 1 √ 5 (yn − xn) = (1 + √ 5)n − (1 − √ 5)n 2n( √ 5) . Although the Fibonacci sequence looks like a sequence of irrational numbers, the value of Fn is actually an integer for any natural n. Hence, all terms in the Fibonacci sequence are integers, as the term F9, which gives the answer. □ We proceed with additional exercises related to secondorder linear homogeneous difference equations with constant coefficients. 3.B.2. Find the next two terms of the sequence (an)n≥0 beginning as follows: 3 , 5 , 11 , 21 , 43 , 85 , . . . by providing a recursive definition of an. Next solve the corresponding recurrence relation. Solution. Set a0 = 3 and a1 = 5. We see that the terms under question are a6 = 171 and a7 = 341. Indeed, observe that a2 = a1 + 2a0, a3 = 2a2 + a1, and so on. Thus the sequence an satisfies an = an−1 + 2an−2 for n = 0, 1, . . ., and this is the recurrence formula we are seeking for. The corresponding characteristic equation is the quadraticl equation r2 − r − 2 = 0, with roots the real numbers 2 and −1. Thus, the general solution has the form an = a2n + b(−1)n , with a, b ∈ R to be computed. By the initial conditions we obtain the system {3 = a + b , 5 = 2a − b}, so a = 8/3, b = 1/3 and an = 8 3 2n + 1 3 (−1)n . □ 3The number 1+ √ 5 2 is called the golden ratio and has fascinated mathematicians since Pythagoras (570 – 495 B.C.). CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Derivatives look like this: f(λ) = a0λn + · · · + akλn−k f′ (λ) = a0nλn−1 + · · · + ak(n − k)λn−k−1 f′′ (λ) = a0n(n−1)λn−2 +· · · +ak(n−k)(n−k−1)λn−k−2 ... f(ℓ) = a0n · · · (n − ℓ + 1)λn−ℓ + · · · + ak(n − k) · · · (n − k − ℓ + 1)λn−k−ℓ . We look at the case of a triple root λ and try to find a solution in the form n2 λn . By substitution into the definition, we obtain the equation a0n2 λn + · · · + ak(n − k)2 λn−k = 0. Clearly the left side equals the expression λ2 f′′ (λ) + λf′ (λ) and because λ is a root of both derivatives, the condition is satisfied. Using induction, we prove that even for the general condition of the solution in the form xn = nℓ λn , a0nℓ λn + . . . ak(n − k)ℓ λn−k = 0, the solution can be obtained as a linear combination of the derivatives of the characteristic polynomial starting with the expression (check the combinatorics!) λℓ f(ℓ) + ( ℓ 2 ) λℓ−1 f(ℓ−1) + . . . . We have thus come close to the complete proof of the following result: Homogenous equations with constant coefficients Theorem. The solution space of a homogeneous linear difference equation of order k with constant coefficients, over the field of scalars K = C, is the k-dimensional vector space generated by the sequences xn = nℓ λn , where λ are (complex) roots of the characteristic polynomial and the powers ℓ run over all natural numbers 0, . . . , rλ −1, where rλ is the multiplicity of the root λ. Proof. The relation between the multiplicity of roots and the derivatives of real polynomials will be proved later (cf. 5.3.7), while the fact that every complex polynomial has exactly as many roots (counting multiplicities) as its degree will appear in 10.2.11. It remains to prove that the k-tuple of solutions thus found is linearly independent. Even in this case we can prove inductively that the corresponding Casoratian is non-zero. We have done this already in the case of the Vandermonde determinant before. This approach is well illustrated by the calculation in the case of a root λ1 with multiplicity one and a root λ2 with 206 3.B.3. Find the solution of the recurrence relation xn = 6xn−1 − 9xn−2 with initial conditions x0 = 2, x1 = 3. Additionally, validate your answer using Sage. Solution. The characteristic equation has the form r2 − 6r + 9 = 0 ⇐⇒ (r − 3)2 = 0 . Hence r = 3 is the unique (double) root. Thus, according to the theory in 3.2.4, the general solution must be of the form xn = a3n + bn3n , for some a, b ∈ R to be computed. Based on the initial conditions x0 = 2 and x1 = 3 we obtain a = 2 and b = −1. Thus we deduce that xn = 2 · 3n − n3n = 3n (2 − n). Sage provides a user-friendly package for handling recurrence relations. One useful function available is the rsolve, from the pure Python package sympy, designed for symbolic computations. This function can effectively handle linear recurrence relations. To begin, we typically start by entering the following cell: from sympy import Function,rsolve from sympy.abc import n a = Function("a") The next step is to input the recurrence relation that we want to solve, along with its initial conditions: f = a(n) - 6*a(n-1) + 9*a(n-2) initial = {a(0):2, a(1):3} Now we can obtain the solution, by adding the syntax rsolve(f, a(n), initial).expand() Sage prints out the answer −3 ∗ ∗n ∗ n + 2 ∗ 3 ∗ ∗n. Let us summarize all the code together: from sympy import Function,rsolve from sympy.abc import n a = Function("a") f = a(n) - 6*a(n-1) + 9*a(n-2) initial = {a(0):2, a(1):3} rsolve(f, a(n), initial).expand() □ 3.B.4. Determine an explicit expression of the sequence satisfying the difference equation xn+2 = 3xn+1 + 3xn with members x1 = 1 and x2 = 3. Next verify your answer using Sage. ⃝ Note that characteristic polynomial of a homogeneous difference equation may have complex roots. Therefore, when solving recurrence relations, we may encounter situations involving complex numbers. This requires transitioning from complex bases in the solution space to real ones, as discussed in Section 3.2.5. Let us explore such an example. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS multiplicity two: C(λn 1 , λn 2 , nλn 2 ) = λn 1 λn 2 nλn 2 λn+1 1 λn+1 2 (n + 1)λn+1 2 λn+2 1 λn+2 2 (n + 2)λn+2 2 = λn 1 λ2n 2 1 1 n λ1 λ2 (n + 1)λ2 λ2 1 λ2 2 (n + 2)λ2 2 = λn 1 λ2n 2 1 1 n λ1 − λ2 0 λ2 λ1(λ1 − λ2) 0 λ2 2 = −λn 1 λ2n 2 λ1 − λ2 λ2 λ1(λ1 − λ2) λ2 2 = λn 1 λ2n+1 2 (λ1 − λ2)2 ̸= 0. In the general case the proof can be carried on inductively in a similar way. □ 3.2.5. Real basis of the solutions. For equations with real coefficients, initial real conditions always lead to real solutions (and similarly with scalars Z or Q). However, the corresponding fundamental solutions derived using the above theorem might exist only in the complex domain. We try therefore to find other generators, which will be more convenient. Because the coefficients of the characteristic polynomial are real, each of its roots is either real or the roots are paired as complex conjugates. If we describe the solution in polar form as λn = |λ|n (cos nφ + i sin nφ) ¯λn = |λ|n (cos nφ − i sin nφ), we see immediately that their sum and difference leads to two linearly independent solutions xn = |λ|n cos nφ, yn = |λ|n sin nφ. Difference equations very often appear as a model of dynamics of some system. A nice topic to think about is the connection between the absolute values of individual roots and the stability of the solution. We will not go into details here, because only in the fifth chapter we will speak of convergence of values to some limit value. There is space for some interesting numerical experiments: for instance with oscillations of suitable population or economical models. 3.2.6. The non-homogeneous case. As in the case of systems of linear equations we can obtain all solutions of nonhomogeneous linear difference equations a0(n)xn + a1(n)xn−1 + · · · + ak(n)xn−k = b(n), where the coefficients ai and b are scalars which might depend on n, with a0(n) ̸= 0, ak(n) ̸= 0. Again, we proceed by finding one solution and adding the complete vector space of dimension k of solutions to the corresponding homogeneous system. Indeed each such sum yields a solution. Since the difference of two solutions of a non-homogeneous system is 207 3.B.5. Complex bases versus real bases. Determine the solution of the difference equation xn+2 = 2xn+1 − 2xn, with initial values x1 = 2 = x2. Moreover, express your solution in terms of a real basis corresponding to the associated solution space. Solution. The roots of the characteristic polynomial r2 −2r+ 2 are the complex numbers 1 + i and 1 − i. Hence the sequences yn = (1 + i)n and zn = (1 − i)n form a basis of the (complex) vector space of solutions, see also the main theorem in 3.2.4. In particular, a general solution is given as a linear combination of yn, zn (with complex coefficients), that is, xn = ayn + bzn, where a = a1 + ia2, b = b1 + ib2, with aj, bj ∈ R for any j = 1, 2. The initial conditions will guide us to choose these scalars. They yield the following system of equations 1 = a1+ia2+b1+ib2 , 2 = (a1+ia2)(1+i)+(b1+ib2)(1−i) and by comparing the real and the complex part of both equations, we finally result to a system of four equations with four unknowns: a1 + b1 = 1, a2 + b2 = 0, a1 − a2 + b1 + b2 = 2 and a1 + a2 − b1 + b2 = 0. We compute a1 = b1 = b2 = 1 2 , a2 = −1/2, and thus the sequence in question attains the following form: xn = ( 1 2 − 1 2 i ) (1 + i)n + ( 1 2 + 1 2 i ) (1 − i)n . If we want to verify our solution via Sage, we may proceed as before, that is, from sympy import Function, rsolve from sympy.abc import n a = Function("a") f = a(n+2) - 2*a(n+1) + 2*a(n) initial = {a(1):2, a(2):2} rsolve(f, a(n), initial) Sage’s answer has the form (1/2 + I/2)*(1 - I)**n + (1/2 - I/2)*(1 + I)**n For the final task we use the sequences un = 1 2 (yn + zn) = ( √ 2)n cos( nπ 4 ) , vn = i 2 (zn − yn) = ( √ 2)n sin( nπ 4 ) . Note that the transition matrix for changing the basis from the complex one to the real one, is given by T := (1 2 −1 2 i 1 2 1 2 i ) , with T−1 = ( 1 1 i −i ) . Therefore, if (c, d) are the coordinates of the sequence xn, then under the basis {un, vn} we have ( c d ) = T−1 ( a b ) = ( 1 1 ) . Thus, we now have an alternative expression for the sequence xn, which involves square roots instead of complex numbers: xn = ( √ 2)n cos (nπ 4 ) + ( √ 2)n sin (nπ 4 ) . □ CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS a solution of the homogeneous system, we obtain all solutions in this way. When we were working with systems of linear equations, it was possible that there was no solution. This is not possible with difference equations. But it is not always easy to find that one particular solution of a non-homogeneous system, particularly if the behaviour of the scalar coefficients in the equation is complicated. Even for linear recurrences with constant coefficients it may not be easy to find a solution if the right-hand side is complicated. But we can always try to find a solution in a form similar to the right hand side. Consider the case when the corresponding homogeneous system has constant coefficients and b(n) is a polynomial of degree s. The solution can then be found in the form of the polynomial xn = α0 + α1n + · · · + αsns with unknown coefficients αi, i = 1, . . . , s. By substitution into the difference equation and comparing the coefficients of the individual powers of n we obtain a system of s + 1 equations for s + 1 variables αi. If this system has a solution, then we have found a solution of our original problem. If it has no solution, we can try again with an increase in the degree s of the polynomial in question. For instance, the equation xn − xn−2 = 2 cannot have a constant solution, because substitution of the potential solution xn = α0 yields the requirement α0 − α0 = 0 = 2. But by setting xn = α0 +α1n we obtain a solution xn = α0 +n, with α0 arbitrary. Thus the general solution of our equation is xn = C1 + C2(−1)n + n. We use this method, the method of indeterminate coefficients for example in 3.F.8. 3.2.7. Variation of constants. Another possible way to solve such an equation is the variation of constants method. Here we find first a solution y(n) = k∑ i=1 cifi(n) of the homogeneous equation, where we consider the constants ci as functions ci(n) of the variable n. Then we look for a particular solution of the given equation in the form y(n) = k∑ i=1 ci(n)fi(n). We illustrate the method on second order equations. Suppose that the homogeneous part of the second order nonhomogeneous equation xn+2 + anxn+1 + bnxn = fn has x (1) n and x (2) n as a basis of solutions. We will be looking for a particular solution of the non-homogeneous equation in the form xn = Anx(1) n + Bnx(1) n 208 It is often useful to treat recurrence relations using matrices. While we will describe many such cases in Section C, let us briefly provide an example to highlight this elegant application of matrices. For the reader’s convenience, our description includes verifications using Sage, especially for matrix computations. 3.B.6. The matrix method via Sage. Using tools from linear algebra, for n ≥ 0 find the solution of the difference equation xn+2 = 2xn+1 + 3xn, with x0 = 0 and x1 = 1. Solution. Set pn = ( xn+1 xn ) such that p0 = ( x1 x0 ) = ( 1 0 ) . Then, for n ≥ 0 the difference equation under question can be expressed in terms of matrices as pn+1 = ( xn+2 xn+1 ) = ( 2xn+1 + 3xn xn+1 ) = ( 2 3 1 0 ) ( xn+1 xn ) = Apn , where A = ( 2 3 1 0 ) . The matrix A is usually referred to as the “companion matrix” of the recurrence and satisfies the relations p1 = Ap0 , p2 = Ap1 = A2 p0 , p3 = Ap2 = A3 p0 and so forth. For example, we obtain p1 = (2, 1)T and p2 = (7, 2)T , which implies that x2 = 2, x3 = 7, and so on. This approach inductively gives us the expression pn = An p0. Therefore, to find the vector pn and consequently the desired sequence xn, we simply need to compute An . This process is straightforward when A is diagonalizable. For our case, using Sage and the syntax var("l") solve(l^2-2*l-3, l) we find that the characteristic polynomial of A, represented by λ2 − 2λ − 3, has roots λ1 = 3 and λ2 = −1. Hence A has two eigenvalues, both with multiplicity one, and as we will see it is diagonalizable. The corresponding eigenvectors, that is, solutions of the equation AX = λX, are given by X1 = (1, 1/3)T and X2 = (1, −1)T . In order to obtain these expressions we used Sage, via the block A = matrix(QQ, [[2,3], [1,0]]) A.eigenvectors_right() This prints out the answer [(3, [(1, 1/3)], 1), (−1, [(1, −1)], 1)].4 Recall that one may verify that A is diagonalizable, directly via the command A.is_diagonalizable ( ), which in our case prints True. Therefore, we have An = PDn P−1 , D = ( 3 0 0 −1 ) , P = ( 3 1 1 −1 ) 4 Recall that in the expression (3, [(1, 1/3)]) the first number 3 corresponds to the first eigenvalue, the pair (1, 1/3) encodes its eigenvector, that is X1 = (1, 1/3)T , and the last number 1 is the multiplicity of the eigenvalue. Similarly for the second component inside the list. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS with some conditions on An and Bn to be imposed. We have xn+1 = An+1x (1) n+1 + Bn+1x (2) n+1 = Anx (1) n+1 + Bnx (2) n+1+ (An+1 − An)x (1) n+1 + (Bn+1 − Bn)x (2) n+1 = Anx (1) n+1 + Bnx (2) n+1 + δAnx (1) n+1 + δBnx (2) n+1, where δAn = An+1 − An and δBn = Bn+1 − Bn. In order to be able to use the same An, Bn in the expression for xn+1, we impose for all n the condition δAnx (1) n+1 + δBnx (2) n+1 = 0. Thus, for all n xn+1 = Anx (1) n+1 + Bnx (2) n+1, and in particular xn+2 = An+1x (1) n+2 + Bn+1x (2) n+2 = Anx (1) n+2 + Bnx (2) n+2 + δAnx (1) n+2 + δBnx (2) n+2. Now, fn = xn+2 + anxn+1 + bnxn = An ( x (1) n+2 + anx (1) n+1 + bnx(1) n ) + Bn ( x (2) n+2+ anx (2) n+1 + bnx(2) n ) + δAnx (1) n+2 + δBnx (2) n+2 = δAnx (1) n+2 + δBnx (2) n+2 Hence the variations δAn and δBn are subject to the systems δAnx (1) n+1 + δBnx (2) n+1 = 0 δAnx (1) n+2 + δBnx (2) n+2 = fn with solutions (compute the inverse matrix e.g. by means of the algebraic adjoint and the determinant) δAn = An+1 − An = − fnx (2) n+1 Wn+1 δBn = Bn+1 − Bn = fnx (1) n+1 Wn+1 where Wn+1 is the Wronski determinant Wn+1 = det ( x (1) n+1 x (2) n+1 x (1) n+2 x (2) n+2 ) . It follows that An − A0 = n−1∑ j=0 − fjx (2) j+1 Wj+1 Bn − B0 = n−1∑ j=0 fjx (1) j+1 Wj+1 . 209 where for simplicity we have used the multiple 3X1 = (3, 1)T in the first column of P (this is also an eigenvector corresponding to λ1). We compute P−1 = ( 1/4 1/4 1/4 −3/4 ) , which one can verify by the relation P−1 AP = ( λ1 0 0 λ2 ) = D. With all the necessary ingredients in hand, we now obtain An = ( 3 1 1 −1 ) ( 3 0 0 −1 )n ( 1/4 1/4 1/4 −3/4 ) = ( 3 1 1 −1 ) ( 3n 0 0 (−1)n ) ( 1/4 1/4 1/4 −3/4 ) = 1 4 ( 3n+1 + (−1)n 3n+1 − 3(−1)n 3n − (−1)n 3n + 3(−1)n ) . To verify the expression for An one can Sage, by the following block (this relies on the fact that A is diagonalizable) A = matrix(SR, [[2,3], [1,0]]) P = matrix(SR, [[3, 1], [1, -1]]) n = var(’n’) D = matrix(SR, [[3**n, 0], [0, (-1)**n]]) An = P * D * P.inverse() An Returning back to our initial task we get: pn = ( xn+1 xn ) = An p0 = 1 4 ( 3n+1 + (−1)n 3n+1 − 3(−1)n 3n − (−1)n 3n + 3(−1)n ) ( 1 0 ) = 1 4 ( 3n+1 + (−1)n 3n − (−1)n ) . From this expression It follows that xn = 1 4 (3n − (−1)n ). □ 3.B.7. Verify the solution given in 3.B.6 via Sage. ⃝ Homogeneous recurrence relations of order higher than two, can essentially be treated in the same way. However, the characteristic polynomial in this case will be of a higher degree, making the overall procedure a bit more complicated. With this in mind, we present the following task. 3.B.8. For the difference equation xn+4 = xn+3 +xn+1 −xn find a real basis of the corresponding solution space. ⃝ 3.B.9. Determine an explicit formula for the nth member of the unique solution {xn}∞ n=1 that satisfies −xn+3 = 2xn+2 + 2xn+1 + xn with x1 = x2 = x3 = 1. ⃝ Non-homogeneous difference equations have numerous applications. Next, we will dedicate some space to describe examples of this kind (see also 3.2.6 and the problem 3.F.8 in the final section). Notice in Sage non-homogeneous recurrence relations can be treated exactly via the same method described above for the homogeneous case. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Setting A0 = B0 = 0 we obtain An = n−1∑ j=0 − fjx (2) j+1 Wj+1 Bn = n−1∑ j=0 fjx (1) j+1 Wj+1 . and the aquired general solution of our recurrence equation is xn = C1x(1) n + C2x(2) n +   n−1∑ j=0 − fjx (2) j+1 Wj+1   x(1) n +   n−1∑ j=0 fjx (1) j+1 Wj+1   x(2) n . This method is used to solve the example 3.F.9. 3.2.8. Linear filters. Now we consider infinite sequences x = (. . . , x−n, x−n+1, . . . , x−1, x0, x1, . . . , xn, . . . ). As in the case of systems of linear equations, we work with an operation T that maps the sequence x to the sequence z = Tx with elements zn = a0xn + a1xn−1 + · · · + akxn−k. As already noticed, the sequences x = (xn) are vectors with respect to coordinate-wise operations, and the vector space of all such sequences is infinitely-dimensional. The operation T is clearly a linear mapping on this space. The sequences can be imagined as discrete values of a signal, often captured in very short time units. T plays the role of a filter that works with the signal. For example, this is how the sampling of an audio signal looks like. We are interested in estimating the properties such a linear filter can have. Signals are often a linear combination of superimposed parts, which are themselves periodical. From our definition it is clear that periodic sequences xn, that is, sequences satisfying for some fixed natural number p and all n xn+p = xn will also have periodic images z = Tx zn+p = a0xn+p + a1xn−1+p + · · · + akxn−k+p = a0xn + a1xn−1 + · · · + akxn−k = zn with the same period p. We are interested in which input periodic sequences Tx remain roughly the same (up to a scalar multiple), and in which Tx will be suppressed close to zero values. This means, we are looking for the kernel of our linear mapping T and other eigen vectors. The kernel is the subspace of sequences given by the homogeneous difference equation a0xn + a1xn−1 + · · · + akxn−k = 0, a0 ̸= 0 ak ̸= 0, which we are able to solve. 210 3.B.10. Determine the sequence of real numbers that satisfies the non-homogeneous difference equation 2xn+2 = −xn+1 + xn + 2 with initial conditions x1 = 2, x2 = 3. Solution. The general solution of the homogeneous equation is of the form a(−1)n + b(1/2)n . A particular solution is the constant 1. The general solution of the non-homogeneous equation without initial conditions is thus a(−1)n + b ( 1 2 )n + 1 . Based on the initial conditions we compute a = 1 and b = 4. Consequently, the solution is given by the sequence xn = (−1)n + 4 (1 2 )n + 1. □ 3.B.11. Find a sequence which satisfies the following nonhomogeneous difference equation xn+2 = xn+1 + 2xn + 1 with the initial conditions x1 = 2, x2 = 2. Next verify your solution via Sage ⃝ C. Models of growth and iterated processes Many population models are based on recurrence relations in vector spaces. In these models, the unknown is not a sequence of numbers, but a sequence of vectors, with matrices serving as the coefficients. Thus, matrices play a crucial role in population dynamics and can be used to simulate population growth. Next we will focus on such problems, and moreover explore how various iterated processes, such as discrete Markov chains, can model everyday situations and offer intriguing insights. 3.C.1. Alice and Bob together in savings. We begin with a model of growth of saved money. Alice and Bob are artists having a band together, and they are saving money for online shopping during the black Friday, to enrich their musical instruments. As the black Friday belongs to the last week of November, they start saving every December. This first month Alice gives 1C and Bob gives 2C. Every consecutive month each of Alice and Bob gives as many as last month, plus one half of what the other has given the month before. How much money will Alice and Bob be able to spend together the next black Friday, by applying the above plan for exactly one year? Is it true or false that during the last month Alice and Bob will put away essentially the same amount of money? Solution. Let us denote by an the amount of money that Alice saves in the nth month, and let bn be the corresponding amount saved by Bob. In the first month they deposit a1 = 1, b1 = 2. The forthcoming savings can be encoded as follows: an+1 = an + 1 2 bn , bn+1 = bn + 1 2 an . CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS 3.2.9. Bad equalizer. As an example, consider a very simple linear filter given by the equation zn = (Tx)n = xn + xn−2. Clearly, the kernel of T is generated by xn = cos(π 2 n) and xn = sin(π 2 n), while the solutions to xn+2 = xn correspond to the requirement (Tx)n = 2xn. The results of such an operation on a signal are illustrated by the two diagrams below. There we use two different frequencies of signals and display their discrete sampling (the solid lines and the points xn on them). The dashed line represents the sampling zn of the filtered signal. The first case shows an amplifying of the signal, while the second frequency is close to the kernel which is killed by the filter. Notice that the filtered signal suffers serious shifts in phase, which varies with the frequencies. Cheap equalisers work in such a bad way. Notice also how badly the original signal is sampled on the second picture. This is due to the fact that the sampling frequency is not much higher than the frequency of the signal. 3. Iterated linear processes 3.3.1. Iterated processes. In practical models we often encounter the situation where the evolution of a system in a given time interval is given by a linear process, and we are interested in the behaviour of the system after many iterations. The linear process often remains the same, thus from the mathematical point of view we are dealing with an iterated multiplication of the state vector by the same matrix. While solving the systems of linear equations requires only minimal knowledge of properties of linear mappings, in order to understand the behaviour of an iterated system, we shall exploit the features of eigenvalues, eigenvectors and other structural features. In fact, the determination of the solution of a linear recurrence equation by a set of intital conditions can be described as an iterated process. Imagine we keep the state vector of the 211 Setting pn := an +bn for the common savings during the nth month, we obtain pn+1 = 3 2 pn. This is obviously a geometric sequence and hence pn = 3 ( 3 2 )n−1 .5 Now, the saving period lasts exactly twelve months, so the sum p1 +p2 +· · ·+p12 represents the total common savings. We compute 3 ( 1 + 3 2 + · · · + ( 3 2 )11 ) = 3 (3 2 )12 − 1 3 2 − 1 ≈ 772, 5 . To confirm this computation use the command sum in Sage, as follows: var("n"); pn=3*(3/2)**(n-1) N(sum(pn, n, 1, 12)) In this cell we first introduced n as a symbolic variable, and next programmed Sage to return the numerical approximation of the desired sum. This answers the first question: The next black Friday Alice and Bob will be able to spend almost 773C. To answer the second task, observe that an+1 = an + 1 2 bn = 1 2 (an + pn) = 1 2 an + ( 3 2 )n . For a quick solution of this recurrence relation we apply Sage. Why is there the initial condition a(2)=2? And we should say hwat is the Bob’s recurrence looks the same, just with b(1)=2, perhaps then the solution is (3/2)n + (1/2)n . from sympy import Function, rsolve from sympy.abc import n a = Function("a") f = a(n+1) - (1/2)*a(n) - (3/2)**n initial = {a(1):1, a(2):2} rsolve(f, a(n), initial) which gives the answer (3/2) ∗ ∗n − 1/2 ∗ ∗n. This means that an = (3 2 )n − (1 2 )n , and hence during the 12th month the economies of Alice should reach the level of 130C, as it occurs by the expression (3 2 )12 − (1 2 )12 . Bob will save basically the same amount and the statement is true. □ 3.C.2. Remark. Obviously, the example can be adapted to higher initial amounts, although the last months Alice and Bod may face problems to follow their plan. For instance, starting with a1 = 10C and b1 = 20C, we get p1 = 30. Therefore, in this case one has pn = 30 ( 3 2 )n−1 , an+1 = 1 2 an + 15 ( 3 2 )n−1 , with solution an = 10 (3n 2n − 1 2n ) . For convenience, we list the relative savings in a table with accuracy two decimal digits. 5Recall that the geometric sequence xn = κxn−1 satisfies xn = x1κn−1. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS last k values Yn = (xn, . . . , xn−k+1) (filled by the initial condition in the beginning of the process). In the next step we update the state vector Yn+1 = (xn+1, xn, . . . , xn−k+2), where the first entry xn+1 = a1xn +· · ·+akxn−k+1 is computed by means of a homogeneous difference equation, while the other entries are just a shift by one position with the last one forgotten. The corresponding square matrix of order k that satisfies Yn+1 = A · Yn is as follows: A =         a1 a2 . . . ak−1 ak 1 0 . . . 0 0 0 1 ... 0 0 ... ... ... ... 0 0 . . . 1 0         A while ago, we derived an explicit procedure for the complete formula for the solution of such an iterated process with a special type of matrix. In general, it will not be easy even for very similar systems. A typical case is the study of the dynamics of populations in some biological systems which we discuss below. The characteristic polynomial |A − λ E| of our matrix is p(λ) = (−1)k (λk − a1λk−1 − · · · − ak), as we can check directly or by expanding the last column and employing induction on k. Thus, the eigenvalues are exactly the roots λ of the characteristic polynomial of the linear recurrence. We should have expected this, because having a nonzero solution xn = λn to the linear recurrence means that the matrix A must bring (λk , . . . , λ)T to its λ-multiple. Thus every such λ must be eigenvalue of the matrix A. 3.3.2. Leslie model for population growth. Imagine that we are dealing with some system of individuals (cattle, insects, cell cultures, etc.) divided into m groups (according to their age, evolution stage, etc.). The state Xn is thus given by the vector Xn = (u1, . . . , um)T depending on the time tn in which we are observing the system. A linear model of evolution of such system is then given by the matrix A of dimension m, which gives the change of the vector Xn to Xn+1 = A · Xn when time changes from tn to tn+1. 212 month Alice Bob month Alice Bob 1st 10.00C 20.00C 7th 171.03C 171.43C 2nd 20.00C 25.00C 8th 256.75C 256.95C 3rd 32.50C 35.00C 9th 385.22C 385.32C 4th 50.00C 51.25C 10th 577.89C 577.94C 5th 75.62C 76.25C 11th 866.86C 866.88C 6th 113.75C 114.56C 12th 1300.30C 1300.31C The dominant chord of the next series of problems is the description of suitable frameworks investigating age-structured population dynamics systems from a linear perspective based on discrete variables. In particular, first we will focus on the “Leslie growth model”, which effectively describes the growth of age-structured populations and is therefore very popular in population ecology. Recall that this model is written as pn+1 = Apn, where A is the Leslie matrix and pn = (p1 n, . . . , pm n )T is the population vector at time n, divided into m age classes. The relevant data for each age class consists of the reproduction rate and the rate of survival into the next age class. These vital rates form the essential part of A. The matrix form makes the Leslie model flexible and mathematically very tractable. This is because the solutions o the Leslie model asymptotically exhibit exponential growth. Specifically, the population pn grows at an exponential rate determined by λ1, such that pn ≈ λn 1 X1. Here λ1 is the dominant eigenvalue (also known as the “Perron-Frobenius eigenvalue”, see 3.3.3) of A and X1 is the corresponding rescaled eigenvector. Furthermore, we can utilize X1 to infer longterm trends of the age classes, leading to the determination of the stable age distribution. For a more detailed explanation and the proofs of these statements, we encourage the reader to refer to the theoretical section starting at 3.3.2. 3.C.3. Bushbabies and Leslie matrices. A small team of biologists studies a colony of bushbabies6 living in a specific region of South Africa. In the wild, these mammals typically have a lifespan of no more than four years. For research purposes, we can categorize their population into four distinct age classes, as follows: class A: 0 to 1 year old class C: 2 to 3 years old class B: 1 to 2 years old class D: 3 to 4 years old The biologists determined that the breeding rates bi for the four age classes in this colony are as follows:7 b1 = 0, b2 = 1, b3 = 2 and b4 = 1, respectively.8 6Also known as galagos or little night monkeys. 7Breeding rates are also called fertility rates and it is natural to assume that they are non-negative real numbers. 8Galagos give usually birth to a single offspring at their first pregnancy, and then produce twins in subsequent litters. In particular, they may give birth to two sets of twins a year. (source: https://animaldiversity.org/, https://africageographic.com) CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS As an example, we consider the Leslie model for population growth. Here there is the matrix A =           f1 f2 f3 . . . fm−1 fm τ1 0 0 . . . 0 0 0 τ2 0 . . . 0 0 0 0 τ3 ... 0 0 ... ... ... ... 0 0 0 . . . τm−1 0           , whose parameters are tied with the evolution of a population divided into m age groups such that fi denotes the relative fertility of the corresponding age group (in the observed time shift from N individuals in the i-th group arise new fiN ones – that is, they are in the first group), while τi is the relative mortality in the i-th group in one time interval. Clearly such a model can be used with any number of age groups. All coefficients are thus non-negative real numbers and the numbers τi are between zero and one. Note that when all τ are equal one, it is actually a linear recurrence with constant coefficients and thus has either exponential growth/decay (for real roots λ of the characteristic polynomial) or oscillation connected with potential growth/decay (for complex roots). Before we introduce a more general theory, we consider in more detail this specific model. Direct computation with the Laplace expansion of the last column yields the characteristic polynomial pm(λ) of the matrix A for the model with m groups: pm(λ) = −λpm−1(λ) + (−1)m−1 fmτ1 . . . τm−1. By induction we derive that this characteristic polynomial is of the form pm(λ) = (−1)m (λm − a1λm−1 − · · · − am−1λ − am). The coefficients a1, . . . , am, are all positive if all parameters τi and fi are positive. In particular, am = fmτ1 . . . τm−1. Consider the distribution of the roots of the polynomial pm(λ). We write the characteristic polynomial in the form pm(λ) = ±λm (1 − q(λ)) where q(λ) = a1λ−1 + · · · + amλ−m is a strictly decreasing non-negative function for λ > 0. For λ positive but very small the value of q will be arbitrarily large, while for large λ, it will be arbitrarily close to zero. Thus, it seems there exists exactly one positive λ for which q(λ) = 1 and thus also pm(λ) = 0.2 In other words, for every Leslie matrix (with all the parameters fi and τi positive), there exists exactly one positive real eigenvalue. For actual Leslie models of populations a typical situation is when the only real eigenvalue λ1 is greater or equal to one, while the absolute values of the other eigenvalues are strictly less than one. 2Actually, we shall spend a lot of time in chapter 5 to make such consideratios precise, involving the convergence and continuity issues. 213 Therefore, except for the babies in age class A, all other (female) galagos in the colony can mate and produce offsprings.9 The biologists also estimated the corresponding survival rates si for these four classes: s1 = 0.4, s2 = 0.5, s3 = 0.2, and s4 = 0. Notice that 0 ≤ si ≤ 1, as each number represents the probability of a bushbaby surviving from one age class to the next. Graphically, the vital rates and the interactions between the age classes are represented by the following directed graph: Suppose that the current age distribution of female bushbabies across the four age classes is given by the vector p0 = (20, 20, 30, 10)T . Apply the Leslie model to compute the population of babies and the total female population of the colony after one year. In addition, determine the population of female galagos in age classes C and D ten years later, and deduce that the total female population decreases. Finally provide the long-term trends for the four age classes. Solution. The associated Leslie matrix is given by A =     b1 b2 b3 b4 s1 0 0 0 0 s2 0 0 0 0 s3 0     =     0 1 2 1 0.4 0 0 0 0 0.5 0 0 0 0 0.2 0     . Let us denote by At, . . . , Dt the amounts of the female population belonging to the age class A, . . . , D, respectively, at time t. Then, the Leslie condition pt+1 = Apt gives pt+1 =     At+1 Bt+1 Ct+1 Dt+1     =     Bt + 2Ct + Dt 0.4At 0.5Bt 0.2Ct     . Thus, p1 = (A1, B1, C1, D1)T = (90, 8, 10, 6)T , and so after one year the model predicts the existence of 90 babies and in total a number of 114 female galagos (this is represented by the sum A1 + · · · + D1). For the second task, we utilize the rule p10 = Ap9 = A10 p0. Using the following cell in Sage A=matrix(SR, [[0, 1, 2, 1], [0.4, 0, 0, 0], [0, 0.5, 0, 0], [0, 0, 0.2, 0]]) show(A**10) 9Note that our focus is solely on the female population. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS If we begin with any state vector X, given as a sum of eigenvectors X = X1 + · · · + Xm with eigenvalues λi, then iterations yield Ak · X = λk 1X1 + . . . λk mXm. Thus under the assumption that |λi| < 1 for all i ≥ 2, all components in the eigensubspaces decrease very fast, except for the component λk 1X1. The distribution of the population among the age groups is thus very fast approaching the ratios of the components of the eigenvector to the dominant eigenvalue λ1. As an example, consider the matrix below where individual coefficients are taken from the model for sheep breeding, that is, the values τ contain both natural deaths and activities of breeders. A =       0 0.2 0.8 0.6 0 0.95 0 0 0 0 0 0.8 0 0 0 0 0 0.7 0 0 0 0 0 0.6 0       . The eigenvalues are approximately 1.03, 0, −0.5, −0, 27 + 0.74i, −0.27 − 0.74i with absolute values 1.03, 0, 0.5, 0.78, 0.78 and the eigenvector corresponding to the dominant eigenvalue is approxi- mately XT = (30 27 21 14 8). We have chosen the eigenvector whose coordinates sum to 100, thus it directly gives us the percentage distribution of the population. Suppose instead that we wish for a constant population, and that one year old sheep are removed for consumption. Then we need ask how to decrease τ2 so that the dominant eigenvalue would be one. A direct check shows that the farmer could then eat about 10% more of one year old sheep to keep the population con- stant. 3.3.3. Matrices with non-negative elements. Real matrices which have no negative elements have very special properties. They are very often present in practical models. Thus we introduce the Perron-Frobenius theory which deals with such matrices. Actually, we show some results of Perron, we omit the more general situations due to Frobenius.3 We begin with some definitions in order to formulate our ideas. 3Oskar Perron and Ferdinand Georg Frobenius were two great German mathematicians at the break of the 19th and 20th centuries. Even in this textbook we shall meet their names in Analysis, Number Theory, Algebra. Look up the index. 214 we obtain A10 ≈     0.195 0.465 0.457 0.204 0.081 0.195 0.208 0.095 0.047 0.102 0.099 0.044 0.008 0.023 0.023 0.010     . Therefore, a further computation shows that p10 = (A10, . . . , D10)T ≈ (28.98, 12.75, 6.44, 1.44)T . It follows that C10 ≈ 6.4 and D10 ≈ 1.4. We also deduce that after ten years, the total population of female galagos will be significantly smaller, specifically fewer than 50 individuals. The eventual extinction of the population can be inferred from the dominant eigenvalue of A To determine this, one must compute the eigenvalues of A and compare them accordingly. To compute the eigenvalues we will use Sage. However, instead of employing the conventional method from Chapter 2 (see ??), we will opt for a more convenient approach by obtaining the eigenvalues numerically. This can be done by using the command A_num.eigenvalues. This method applies specifically to matrices in numerical form. After defining the matrix A, the process unfolds as follows: A_num = A.change_ring(RDF); show(A_num) eig_numeric = A_num.eigenvalues() show(eig_numeric) Verify that this block accurately computes the eigenvalues of A in a highly suitable format. For a more comprehensive response, you can replace the final line with print("Numerical eigenvalues:") for eigenvalue in eig_numeric: print(eigenvalue.n(digits=10)) According to Sage there are two real eigenvalues, λ1 ≈ 0.9347, and λ2 ≈ −0.1121, and two complex λ3,4 ≈ −0.4112±i0.4607. Thus λ1 is the dominant eigenvalue of A and since λ1 < 1, the colony will disappear with rate pt+1/pt equal to λ1 ≈ 0.9347. An answer of the final task relies on the eigenvector ˆX1 corresponding to λ1. Adding in the initial cell the code eigV = A_num.eigenvectors_right() show(eigV) we see that ˆX1 has (approximatelly) the expression (0.8988, 0.3847, 0.2058, 0.04403)T , but it is more convenient to rewrite this vector as ˆX1 ≈ (1, 0.427, 0.228, 0.048)T . The number 1.703 approximates the sum of its entries, so the normalized eigenvector associated to λ1 has the form X1 = 1 1.703 ˆX1 ≈ (0.587, 0.250, 0.133, 0.028)T . From this we obtain the long term trends of the four age classes in the population: 0.58 % of babies, 0.25 % of age class B, while the population of female galagos older than two years old is smaller than 16.2 %. □ 3.C.4. Remarks. To illustrate the decrease of the population in the previous example, we may present a graph where the population’s fading is depicted for a period of fifty years. Included is also the situation for babies’ population. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Positive and primitive matrices Definition. A positive matrix means a square matrix A all of whose elements aij are real and strictly positive. A primitive matrix is a square matrix A whose power Ak is positive for some positive k ∈ N. Recall that spectral radius of a matrix A is the maximum of absolute values of all (complex) eigenvalues of A. The spectral radius of a linear mapping on a (finite dimensional) vector space coincides with the spectral radius of the corresponding matrix for some basis. In the sequel, the norm of a matrix A ∈ Rn2 or of a vector x ∈ Rn will mean the sum of the absolute values of all elements. For a vector x we write |x| for its norm. The following result is very useful and hopefully understandable. But the difficulty of its proof is rather not typical for this textbook. If you prefer, read just the theorem and skip the proof till later on. Perron Theorem Theorem. If A is a primitive matrix with spectral radius λ ∈ R, then λ is a root of the characteristic polynomial of A with multiplicity one and λ is strictly greater than the absolute value of all other eigenvalues of A. Furthermore, there exists an eigenvector x associated with λ such that all elements xi of x are positive. Proof. We shall present the proof only briefly and we shall rely on intuition from elementary geometry, as well as some results touched much later in this book. Notice that the matrices A and Ak share the eigenvectors, while the corresponding eigenvalues are λ and λk respectively. Thus the assertion of the theorem holds if and only if the same is true for Ak . In particular, we may assume the matrix A itself is positive, without any loss of generality. Many of the necessary concepts and properties will be discussed in chapter four and in the subsequent chapters devoted to analytical aspects, so the reader might come back to this proof later. The first step is to show the existence of an eigenvector which has all elements positive. Consider the standard simplex S = {x = (x1, . . . , xn)T , |x| = 1, xi ≥ 0, i = 1, . . . , n}. Since all elements in the matrix A are positive, the image A·x for x ∈ S has all coordinates positive too. The mapping x → |A · x|−1 (A · x) thus maps S to itself. This mapping S → S satisfies all the assumptions of the Brouwer fixed point theorem4 and thus there 4This theorem is a great example of a blend of (homological) Algebra, (differential) Topology and Analysis. We shall discuss it in Chapter 9, cf. 9.1.16 on page 822. 215 To obtain the graph, for any plot separately we can apply a command of the following type (we present the code for the first 5 years for the total population and the case of babies population is treated similalry). list_plot({0: 80, 1: 114, 2: 76, 3: 78.4, 4: 79.2, 5: 66.32}, plotjoined=True, color="dodgerblue") \ + text(r"total population", (22, 34), color="black") In this block the numbers 0, 1, 2, . . . represent the years, and the numbers 80, 114, 76, . . . represent the size of the female population the corresponding year. The female population at the kth year is given by mk = Ak + Bk + Ck + Dk. Thus we need to compute the matrices A, A2 , . . . , A50 and then apply the Leslie condition pk = Ak p0, for any k with 0 ≤ k ≤ 50. Recall that pk = (Ak, Bk, Ck, Dk)T . Finally, to join the two plots in one figure one can add in the previous cell the code g1 = list_plot(0 : 80, . . .) , g2 = list_plot(0 : 20, . . .) , and then use the command (g1 + g2).show ( ). 3.C.5. Show that in 3.C.3 the result about the declination of the population will change if we assume that galagos of age class B give two female offsprings, instead of one, and the rest of the given data remains the same. Which are the long term trends of the four age classes in this case and which is the total population of the colony after ten years, if we assume that for any female bushbaby there exists a male one? Present a graph illustrating the increase of the female population for a period of thirty years. ⃝ In the study of age-structured populations an interesting question is how we can stabilize the population under examination, so that it is neither growing nor declining. For the Leslie model, this is the case if and only if the dominant eigenvalue of the associated Leslie matrix equals one, λ1 = 1. Below we describe this situation for two cases: one less realistic, involving the population of female galagos analyzed in 3.C.3, and one more realistic, involving the population of fish in a restricted environment, such as an artificial pond. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS exists vector y ∈ S such that it is mapped by this mapping to itself. That means that A · y = λ y, λ = |A · y| and we have found an eigenvector that lies in S. By assumption, A · y has got all coordinates positive, thus y must have the same property. Moreover, λ > 0. In order to prove the rest of the theorem, we consider the mapping given by the matrix A in a more suitable basis, where the coordinates of the eigenvector would be (λ, . . . , λ). Moreover, we multiply the mapping by the constant λ−1 . Thus we work with the matrix B, B = λ−1 (Y −1 · A · Y ), where Y is the diagonal matrix with coordinates yi of the above eigenvector y on its diagonal. Evidently B is also a positive matrix. By the construction, the vector z = (1, . . . , 1)T is its eigenvector with eigenvalue 1, because Y · z = y. It remains to prove that µ = 1 is a simple root of the characteristic polynomial of the matrix B and that all other roots have absolute value strictly smaller than one. Then the proof of the Perron theorem is finished. In order to do that we use an auxiliary lemma, which is discussed below. Consider for the moment the matrix B to define the linear mapping ψ that maps the row vectors u = (u1, . . . , un) → u · B = v, that is, using multiplication from the right (i.e., B is viewed as the matrix of a linear map on one-forms). Since z = (1, . . . , 1)T is an eigenvector of the matrix B (with eigenvalue 1), the sum of the coordinates of the row vector v u · B · (1, . . . , 1)T = n∑ i,j=1 uibij = n∑ i=1 ui = 1, whenever u ∈ S. Therefore the simplex S maps onto itself and thus has in S a (row) eigenvector w with eigenvalue one (a fixed point, by the Brouwer theorem again). Because some power B is positive by our assumption, the image of the simplex S under Bk lies inside of S. We continue with the row vectors. Denote by P the shift of the simplex S into the origin by the eigenvector w we have just found. That is, P = −w+S. Evidently P is a set containing the origin and is defined by linear inequalities. Clearly, ψ(P) = −w + ψ(S) ⊂ P and, thus, the vector subspace V ⊂ Rn generated by P is invariant with respect to the action of the matrix B through multiplication of the row vectors from the right. Moreover, if ψ(p) ∈ P sits on the boundary of P, then ψ(p + w) is on the boundary of S. Hence restriction of our mapping to V , and P itself satisfy the assumptions of the auxiliary lemma discussed below and thus all its eigenvalues are strictly smaller than one. Now, the entire space decomposes as the sum Rn = V ⊕ span{w} of invariant subspaces, w is the eigenvector with eigenvalue 1, while all eigenvalues of the restriction to V are strictly smaller in absolute value. 216 3.C.6. Stabilizing the size of bushbabies’ colony. Let us return to the colony of bushbabies studied in 3.C.3. Determine the necessary adjustments in the age-dependent vital rates that required to stabilize the population size. By agedependent vital rates we mean the fertility rates b1, . . . , b4 and the survival rates s1, . . . , s4. Solution. Since we are using the Leslie model, the stabilization process is governed by the condition λ1 = 1, where λ1 is the dominant eigenvalue of the corresponding Leslie matrix. This condition is independent of the initial population size. We consider the following possibilities (though there may be others): 1. Increase s := s1, that is the survival rate of babies. Hence we use the Leslie matrix A =     0 1 2 1 s 0 0 0 0 0.5 0 0 0 0 0.2 0     for some s, with 0 ≤ s ≤ 1. Since we set λ1 = 1, to determine the survival rate s we need to compute the determinant of A − E, where E is the identity matrix. For this one can use Sage, as follows: s=var("s") A= matrix(SR, 4, 4, [0, 1, 2, 1, s, 0, 0, 0, 0, 0.5, 0, 0, 0, 0, 0.2, 0]) E=identity_matrix(4) (A-E).det() which gives −2.1 ∗ s + 1. Thus the equation det(A−E) = 0 has the solution s = 10/21 = 0.47619. Hence from 0.4 which was initially the breeding parameter s = s1 we need to increase it to 0.47619 to stabilize the size of the female population of bushbabies (under the assumptions that the other vital rates are as in 3.C.3). 2. Change in a proportional way the fertility rates bi for i = 2, 3, 4. This means that we use the following Leslie matrix A =     0 s 2s s 0.4 0 0 0 0 0.5 0 0 0 0 0.2 0     for some s > 0, which we should specify via the constraint λ1 = 1. Using Sage we get det(A − E) = −0.84s + 1, and the equation of of det(A − E) = 0 gives s = 25/21 ≈ 1.19048. Therefore, by setting b2 = b4 = 1.19048 and b3 = 2.38095, while maintaining the survival rates si as in 3.C.3, the population is stabilized. □ 3.C.7. Population of fishes in an artificial pond. Suppose we have a simple model of a pond where a population of fish lives (e.g., bleak, vimba, or nase). Assume that 20% of babies survive their second year and, from that age on, they are able to reproduce. It is known that approximately 60% of these young fish survive during the third year, and in the following years, mortality can be ignored. We also assume that the birth rate is three times the number of fish that can reproduce. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS The theorem is nearly proved. We have just to consider the problem that the mapping under question was given by multiplication of the row vectors from the right with the matrix B, while originally we were interested in the mapping given by the matrix B and multiplication of the column vectors were from the left. But this is equivalent to the multiplication of the transposed column vectors with the transposed matrix B in the usual way – from the left. Thus we have proved the claim about eigenvalues for the transpose of B. But transposing does not change the eigenvalues and so the proof is complete. □ A bounded polyhedron in Rn is a nonempty subset defined by linear inequalities, sitting in some large enough ball. Simplex S from the proof or any its translation are examples. We shall provide concise explanation of all these concepts in chapter 4. Lemma. Consider any bounded polyhedron P ⊂ Rn , containing a ball around origin 0 ∈ Rn . If some iteration of the linear mapping ψ : Rn → Rn maps P into its interior (that is ψ(P) ⊂ P and the image does not intersect with the boundary), then the spectral radius of the mapping ψ is strictly less than one. Proof. Consider the matrix A of the mapping ψ in the standard basis. Because the eigenvalues of Ak are the k-th powers of the eigenvalues of the matrix A, we may assume (without loss of generality) that the mapping ψ already maps P into P. Clearly ψ cannot have any eigenvalue λ with absolute value greater than one. This is easy to see, if the eigenvalue is real. In the complex case, there will be a 2-dimensional invariant plane in which ψ is a multiplication by |λ| composed with a rotation, thus |λ| > 1 is in conflict with the invariance of P again. Next, we argue by contradiction and assume that there exists an eigenvalue λ with |λ| = 1. Then there are two possibilities, either λk = 1 for suitable pozitive integer k or there is no such k. The image of P is a closed set (that means that if the points in the image ψ(P) get arbitrarily close to some point y in Rn , then the point y is also in the image – this is a general feature of the linear maps on finite dimensional vector spaces). By our assumption, the boundary of P does not intersect with the image. Thus ψ cannot have a fixed point on the boundary and there cannot even be any point on the boundary to which some sequence of points in the image would converge. The first argument excludes that some power of λ is one, because such a fixed point of ψk on the boundary of P would then exist and thus it would be in the image. In the remaining case there would be a two-dimensional subspace W ⊂ Rn on which the restriction of ψ acts as a rotation by an irrational angle and thus there exists a point y in the intersection of W with the boundary of P. But then the point y could be approached arbitrarily close by the points from the set ψk (y) (through all iterations) and thus would have to be in the image too. This leads to a contradiction and thus the lemma is proved. □ 217 Clearly, such a population would fill the pond very quickly. To maintain a balance, we need to introduce a predator, such as esox. Assume that an esox eats approximately 500 mature fish per year. How many predators of this type should be introduced into the pond to keep the population constant? Solution. There are three age classes: babies, young fish and adult fish. Let p be the number of babies, m be the number of young fish and r be the number of adult fish. In terms of vectors, the state of the population in the next year is given by   p m r   →   3m + 3r 0.2p 0.6m + yr   . The relative mortality of the adult fish caused by the predators equals to the difference 1 − y, and we should specify the unknown y. This model is successfully described by a generalized Leslie matrix, having the form A =   0 3 3 0.2 0 0 0 0.6 y   . To stabilize the population one of the eigenvalues of this matrix should be equal to 1, and this will give us the desired value of the unknown y. As before, in Sage we may type y=var("y") A= matrix(SR, 3, 3, [0, 3, 3, 0.2, 0, 0, 0, 0.6, y]) I3 = matrix(RR, [[1, 0, 0], [0, 1, 0], [0, 0, 1]]) (A-I3).det ( ) This prints out the expression 0.4 ∗ y − 0.04 with obvious solution y = 0.1. This means that in the next year only 10 % is allowed to survive and the rest should be eaten by the predators. Let us denote the number of redators by x. Then, together they eat 500x fish, which according to the previous computation should be 0.9r. Consequently, the ratio of the number of fish to the number of predators is given by r x = 500 0.9 . That is, one esox for (approximately) 556 fish. □ 3.C.8. Predators and preys. In a population model, let Dk be the number of predators and Kk be the number of preys, in month k. The relation of these between month k and month k + 1 is given by one of the following three linear systems: (a) Dk+1 = 0.6 Dk + 0.5 Kk , Kk+1 = −0.16 Dk + 1.2 Kk , (b) Dk+1 = 0.6 Dk + 0.5 Kk , Kk+1 = −0.175 Dk + 1.2 Kk , (c) Dk+1 = 0.6 Dk + 0.5 Kk , Kk+1 = −0.135 Dk + 1.2 Kk . Analyse the behaviour of this model for large time values. Solution. We can encode all three cases as ( Dk Kk ) = Ta · ( Dk−1 Kk−1 ) , k ∈ N , CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS 3.3.4. Simple corollaries. Once we know the Perron theorem, the following very useful claim has a surprisingly simple proof. It shows how strong is the primitivity assumption of a matrix. Corollary. If A = (aij) is a primitive matrix and x ∈ Rn is its eigenvector with all coordinates non-negative and eigenvalue λ, then λ > 0 is the spectral radius of A. Moreover, minj∈{1,...,n} n∑ i=1 aij ≤ λ ≤ maxj∈{1,...,n} n∑ i=1 aij. Proof. Because A is primitive, we can choose k such that Ak has only positive elements. Then Ak · x = λk x is a vector with all coordinates strictly positive. Obviously λ > 0. According to the Perron theorem, the spectral radius µ of A is an eigenvalue and the associated eigenvectors y have positive coordinates only. Thus we may choose such an eigenvector with the property that the difference x − y has only strictly positive coordinates. Then for all large positive integer powers m we have 0 < Am · (x − y) = λm x − µm y, but also λ ≤ µ. If µ = λ + α, α > 0, then 0 < λm x − (λ + α)m y < λm (x − y − m α λ y) which is clearly negative for m large enough. Hence λ = µ. It remains to estimate the spectral radius using the minimum and maximum of sums of individual columns of the matrix. We denote them by bmin and bmax. Choose x to be the eigenvector with the sum of coordinates equal to one and count: n∑ i,j=1 aijxj = n∑ i=1 λxi = λ λ = n∑ j=1 ( n∑ i=1 aij ) xj ≤ n∑ j=1 bmaxxj = bmax λ = n∑ j=1 ( n∑ i=1 aij ) xj ≥ n∑ j=1 bminxj = bmin. □ Note that for instance all Leslie matrices from 3.3.2, as soon as all their parameters fi and τj are strictly positive, are primitive. Thus we can apply the just derived results to them. (Compare this with the ad hoc analysis of the roots of the characteristic polynomial from 3.3.2) 3.3.5. Markov chains. A very frequent and interesting case of linear processes with only non-negative elements in a matrix is a mathematical model of a system which can be in one of m states with various probabilities. At a given point of time the system is in state i with probability xi. The transition form the state i to the state j happens with probability tij. We can write the process as follows: at time n the system is described by the stochastic vector (we also say probability vector) xn = (u1(n), . . . , um(n))T . 218 where Ta := ( 0.6 0.5 −a 1.2 ) and a ∈ {0.16, 0.175, 0.135}. The coefficient a represents the average number of preys killed by one predator per month.10 It follows that ( Dk Kk ) = Tk a · ( D0 K0 ) , k ∈ N . Using the powers of the matrix Ta we can determine the evolution of the populations of predators and prey after a very long time. For such a procedure it is useful to summarize in a table the eigenvalues and the eigenvectors of Ta: λ1 λ2 X1 X2 a = 0.16 1 4/5 (5, 4)T (5, 2)T a = 0.175 19/20 17/20 (10, 7)T (2, 1)T a = 0.135 21/20 3/4 (10, 9)T (10, 3)T To compute Tk a we may simply proceed by hand. As we know, if X = ( X1 X2 ) is the matrix formed by the eigenvectors, then the matrix Tk a occurs as follows: Tk a = X · ( λ1 0 0 λ2 )k · X−1 , k ∈ N . For instance, for a = 0.16 we have X = ( 5 5 4 2 ) , X−1 = ( −1/5 1/2 2/5 −1/2 ) , and ( 1 0 0 0.8 )k ≈ ( 1 0 0 0 ) for large k. Thus, for large time values we get Tk 0.16 ≈ ( −1 5/2 −4/5 2 ) . In a similar way, for large k one computes Tk 0.175 ≈ ( 0 0 0 0 ) , Tk 0.135 ≈ 1.05k ( −1/2 5/3 −9/20 3/2 ) . Knowing Tk a we are able to proceed with the systems. These are given by ( Dk Kk ) ≈ 1 10 ( 5 (−2D0 + 5K0) 4 (−2D0 + 5K0) ) , ( Dk Kk ) ≈ ( 0 0 ) , ( Dk Kk ) ≈ 1.05k 60 ( 10 (−3D0 + 10K0) 9 (−3D0 + 10K0) ) , for a = 0.16, a = 0.175 and a = 0.135, respectively. A qualitative interpretation of these relations can produce valuable conclusions for the future of the populations. In particular: (a) If 2D0 < 5K0, the sizes of both populations stabilise on non-zero sizes. if 2D0 ≥ 5K0, both populations die out. (b) Both populations die out. (c) For 3D0 < 10K0 we have a population boom of both kinds. For 3D0 ≥ 10K0 both populations die out. Note that there are many models that effectively describe predator-prey population dynamics These models can be either discrete iterative models, or continuous models based on 10The value of a does not depend on the size of the population (in fact, for stable populations one can estimate the size of a). As a consequence, observe that a small change of a can lead to different conclusions. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS This means that all components of the vector x are real non-negative numbers and their sum equals one. Components give the distribution of the probability of individual possibilities for the state of the system. The distribution of the probabilities at time n + 1 is given via multiplication by the transition matrix T = (tij), that is, xn+1 = T · xn. Since we assume that the vector x captures all possible states of the system and moves again to some of these states with the total probability one, all columns of T are also given by stochastic vectors. We call such matrices stochastic matrices. Note that every stochastic matrix maps every stochastic vector x to a stochastic vector Tx again: ∑ i,j tijxj = ∑ j (∑ i tij ) xj = ∑ j xj = 1. Such a sequence xn+1 = Txn is called a (discrete) Markov process and the resulting sequence of vectors x0, x1, . . . is called a Markov chain xn. Now we can exploit the Perron-Frobenius theory in its full power. Because the sum of the rows of the matrix is always equal to the vector (1, . . . , 1), we see that the matrix T − E is singular and thus one is an eigenvalue of the matrix T. Furthermore, if T is a primitive matrix (for instance, when all elements are non-zero), we know from the corollary 3.3.4 that one is a simple root of the characteristic polynomial and all others have absolute value strictly smaller than one. This leads to: Ergodic Theorem Theorem. Markov processes with primitive matrices T sat- isfy: • there exists a unique eigenvector x∞ of the matrix T with the eigenvalue 1, which is stochastic, • the iterations Tk x0 approach the vector x∞ for any initial stochastic vector x0. Proof. The first claim follows directly from the positivity of the coordinates of the eigenvector derived in the Perron theorem (notice the dominant eigenvalue comes with multiplicity one). Next, assume that the algebraic and geometric multiplicities of the eigenvalues of the matrix T are the same. Then every stochastic vector x0 can be written (in the complex extension Cn ) as a linear combination x0 = c1x∞ + c2y2 + · · · + cnyn, where y2 . . . , yn extend x∞ to a basis of the eigenvectors. But then the k-th iteration gives again a stochastic vector xk = Tk · x0 = c1x∞ + λk 2c2y2 + · · · + λk ncnyn. Now all eigenvalues λ2, · · · λn are in absolute value strictly smaller than one. So all components of the vector xk but the first one approach (in norm) zero. But xk is still stochastic, 219 differential equations. A prominent example is the Lotka and Volterra model, see 8.3.10. Further exercises related to the Leslie model and population growth are presented in Section F, see for example the tasks 3.F.12, 3.F.13 and 3.F.15. □ In the remainder of this section, we delve into the discrete “Markov process”, a topic where matrix calculus intersects with probability theory. For a more theoretical treatment, see 3.3.5 and 3.4.7. We begin with a brief introduction to stochastic processes, a concept that will be extensively analyzed in Chapter 10. Here, we primarily focus on “discrete stochastic processes” (cf. 10.2.14). Recall that a “random variable” in the context of a random experiment with sample space Ω, is an R-valued function X : Ω → R satisfying the property that {s : X(s) ∈ I} is an event for any interval I ⊆ R. For further details see 10.2.10. In simpler terms, a random variable assigns a real number to each possible outcome of a random experiment. A random variable X that can take at most countably many values is termed discrete. For instance, the number of typographical errors on a page of a book represents a discrete random variable. A stochastic (or random) process is essentially a sequence of random variables {X(t) ≡ Xt}∞ t=0. Such processes depict the evolution over time of a random phenomenon, thereby encapsulating the concept of probabilistic dynamics. In the following we are only interested in discrete random variables and, consequently, in discrete stochastic processes. 3.C.9. Describe examples of discrete stochastic processes. Solution. There is plethora of such examples. Let Xn be the number of customers served in a bank, or the number of product sales in a market at the end of the nnth working day. Then {Xn : n = 0, 1, . . .} is a discrete stochastic process. Another example occurs when Xn represents the number of arrivals at an emergency room between midnight and 8.00 am. Or Xn can represent the number of books sold in a bookstore, or the number of people who attend a concert, etc. □ The values that a discrete random variable Xn takes are known as the states of the system at step n. We can denote these states using integers 0, 1, 2, . . ., or symbols such as s0, s1, and so on. Given that Xn−1 = j and Xn = i, the system is said to have made a “transition” from state sj to state si. This is known as a chain. “Discrete Markov processes”, also known as “Markov chains”, are stochastic processes with a finite number of possible states, where memory effects are strongly limited. This means that the probability having Xn = i, depends only on the immediate previous state of the system. Next we will assume time-homogeneity, meaning the transition probabilities P(Xn = i|Xn−1 = j) are constant CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS thus the only possibility is that c1 = 1 and the second claim is proved. In fact, even if the algebraic and geometric multiplicities of eigenvalues do not coincide we reach the same conclusion using a more detailed study of the root subspaces of the matrix T. (We meet them when discussing the Jordan matrix decomposition later in this chapter.) Consequently, even in the general case the eigensubspace span{x∞} comes with the unique invariant (n − 1)-dimensional complement, on which are all eigenvalues in absolute value smaller than one and the corresponding components in xk approach zero as before. See the note 3.4.11 where we finish this argument in detail. □ 3.3.6. Iteration of the stochastic matrices. We reformulate the previous theorem into a simple, but surprising result. By convergence to a limit matrix in the following theorem we mean the following: if we say that we want to bound the possible error ε > 0, then we can find a lower bound on the number of iterations k after which all the components of the matrix differ from the limit one by less than ε. Corollary. Let T be a primitive stochastic matrix from a Markov process and let x∞ be the stochastic eigenvector for the dominant eigenvalue 1 (as in the Ergodic Theorem above). Then the iterations Tk converge to the limit matrix T∞, whose columns all equal to x∞. Proof. Columns in the matrix Tk are images of the vectors of the standard basis under the corresponding iterated linear mapping. But these are images of the stochastic vectors and thus they all converge to x∞. □ 3.3.7. Final brief remark. Before leaving the Markov processes, we shortly mention their more general versions with matrices which are not primitive. Here we would need the full Frobenius-Perron theory. Without going into technicalities, consider a process with a block wise diagonal or an upper triangular matrix T, T = ( P R 0 Q ) and imagine first that P, Q are primitive and R = 0. Here we can again apply the above results block wise. In words, if we start in a stay x0 with all probability concentrated in the first block of coordinates, the process converges to the value x∞ which again has all the probability distributed among the first block of coordinates, and the same for the other block. If R > 0 then we can always jump to the states corresponding to the first block from those in the second block with a non-zero probability and the iterations get more com- plicated: T2 = ( P2 P · R + R · Q 0 Q2 ) T3 = ( P3 P2 · R + P · R · Q + R · Q2 0 Q3 ) . 220 over time and do not depend on n. This is a common assumption in many Markov chain models. The transition from state sj to si, or simply from j to i, is encoded by the transition probability tij = P(Xn = i|Xn−1 = j), compare with 3.3.5. Thus, the dynamics of a discrete-time homogeneous Markov chain with state space consisting of m states, is described by the m × m matrix T = (tij). This is known as the transition matrix of the Markov process, from the (n − 1)th step to the nth step. In our context, T is a column-stochastic matrix, meaning that tij ≥ 0 and ∑ i tij = 1. In other words, each column of T is a stochastic vector, (this has non-negative real numbers as entries, that sum to one, see also 3.3.5). 3.C.10. Two-state Markov chains are Markov chains having a state space consisting of two elements, typically denoted by S = {0, 1}. They can be used to model various situations, such as the operational status of everyday machines where the probability of a machine being out of operation the next day is known. For example, consider a traffic light on a road in New York. Suppose the probability that a working traffic light will be out of order the next day is p, while the probability that an out-of-order traffic light will start operating the next day is q. Demonstrate that this scenario constitutes a Markov process and determine its transition matrix. Solution. What happens at each day depends only on the previous day, and hence we have a two-stage (homogeneous) Markov process {Xn}, where Xn is the state of the traffic light on the nth day. By definition, Xn = 0 if the traffic light will be out of order the nth day and Xn = 1 otherwise. Thus one computes t00 = P(Xn = 0|Xn−1 = 0) = 1 − q , t01 = P(Xn = 0|Xn−1 = 1) = p , t10 = P(Xn = 1|Xn−1 = 0) = q , t11 = P(Xn = 1|Xn−1 = 1) = 1 − p . This means that T = ( t00 t01 t10 t11 ) = ( 1 − q p q 1 − p ) . □ 3.C.11. Remarks. a) Markov chains on finite state spaces can alternatively be represented graphically through a “transition diagram”. This diagram is a directed graph where each vertex represents a state of the chain. Directed edges are drawn from vertex j to vertex i, labeled with the probability tij, whenever tij > 0. For example, in the case of a two-state Markov process as described earlier, the transition diagram takes the following form: 0t00=1−q 66 t10=q ++ 1 t11=1−p jj t01=p kk This graphical representation is particularly useful for visualizing the structure of the Markov chain, understanding the possible state transitions, and analyzing the overall behavior of the chain in terms of state probabilities and sequences of CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS An interesting special case is when P = E and R is positive. Then Q − E must be a regular matrix and a simple computation yields the general iteration (notice E and Q commute and thus (E − Q)(E + Q + · · · + Qk−1 ) = E − Qk ) Tk = ( E R(E − Q)−1 (E − Qk ) 0 Qk ) . Thus, the entire first block of states is formed by eigenvectors with eigenvalue 1 (so these states stay constant with probability 1), while the behavior on the other block is more compli- cated. 4. More matrix calculus We have seen that understanding the inner structure of matrices is a strong tool for both computation and analysis. It is even more true when considering numerical calculations with matrices. Therefore we return now to the abstract theory. We introduce special types of linear mappings on vector spaces. We consider general linear mappings whose structure is understood in terms of the Jordan normal form (see 3.4.10). In all these cases, complex scalars are essential. So we extend our discussion of scalar product (see 2.3.18–2.3.22) to complex vector spaces. Actually, in many areas the complex vector spaces are the essential platform necessary for introducing the mathematical models. For instance, this is the case in the so-called quantum computing, which became a very active area of theoretical computer science. Many people hope to construct an effective quantum computer soon. 3.4.1. Unitary spaces and mappings. The definitions of scalar product and orthogonality easily extend to the complex case. But we do not mean the complex bilinear symmetric forms α, since there the quadratic expressions α(v, v) are not real in general and thus we would not get the right definition of length of vectors. Instead, we define: Unitary spaces Unitary space is a complex vector space V along with the mapping V × V → C, (u, v) → u · v called scalar product and satisfying for all vectors u, v, w ∈ V and scalars a ∈ C the following axioms: (1) u · v = v · u (the bar stands for complex conjugation), (2) (au) · v = a(u · v), (3) (u + v) · w = u · w + v · w, (4) if u ̸= 0, then u · u > 0 (notice u · u is always real). The real number √ v · v is called the norm of the vector v and a vector is normalized, if its norm equals one. Vectors u and v are said to be orthogonal if their scalar product is zero. A basis composed of mutually orthogonal and normalized vectors is called an orthonormal basis of V . 221 transitions, see 3.C.12 for another example. b) Be aware that other authors may define the transition matrix T = (tij) with tij being the probability of a transition from state i to state j, that is tij = P(Xn = j|Xn−1 = i). This is the opposite convention of ours, producing a rowstochastic matrix whose transpose matrix is our transition ma- trix. 3.C.12. Absent-minded professor. An absent-minded professor always carries an umbrella with him, but with probability of 1/2 he forgets it from wherever he is leaving. His daily routine is strict: in the morning he walks from his house to his office; from there, he goes to a restaurant for lunch, then returns to his office, and finally in the evening, he returns home. It is assumed the professor does not visit other locations, and if he forgets his umbrella at the restaurant, it remains there until his next visit. This situation can be modeled as a Markov process. Determine its transition matrix, diagram, and calculate the probability that after many days, in the morning, the umbrella is at the restaurant. Solution. We assume as time unit the daytime, from morning to the next morning, and denote by Xn the place where the umbrella sits in the morning of the nth day. This defines a (homogeneous) Markov process {Xn : n ∈ N}, with state space S = {s1 = house, s2 = office, s3 = restaurant}. For simplicity we represent these states as {1, 2, 3}. Let us proceed with the transition matrix T = (tij), where tij = P(Xn+1 = i|Xn = j) and 1 ≤ i, j ≤ 3. We explain the computations for the first column and leave the verification of the rest two to the reader. By definition, t11 = P(Xn+1 = 1|Xn = 1), and hence this is the probability that the umbrella starts its day at home and stays there till next morning. There are three distinct possibilities p1, p2, p3 for this scenario: p1: The umbrella stays at home in the morning. Thus p1 = 1 2 . p2: The umbrella arrives to the office, it stays there during lunch, but in the evening is moved back to home. Thus p2 = 1 2 · 1 2 · 1 2 = 1 8 . p3: The professor takes the umbrella with him all the time and does not forget it somewhere. Thus p3 = 1 2 · 1 2 · 1 2 · 1 2 = 1 16 and in total t11 = p1 + p2 + p3 = 11 16 . Next, t21 = P(Xn+1 = 2|Xn = 1) is the probability that the umbrella starts its day at home and the next morning sits in the office. There are two possibilities q1, q2 for such a sce- nario: q1: The umbrella arrives to the office, it stays there during lunch, and in the evening it remains again there. Thus q1 = 1 2 · 1 2 · 1 2 = 1 8 . q2: The umbrella arrives to the office, then to the restaurant, then back to the office, and in the evening it remains at the office. Thus q2 = 1 2 · 1 2 · 1 2 · 1 2 = 1 16 and in total t21 = 1 8 + 1 16 = 3 16 . Finally for t31 we have t31 = 1 − t11 − t21 = 1/8. The repetition of this procedure for the rest entries of T yields T =   11/16 3/8 1/4 3/16 3/8 1/4 1/8 1/4 1/2   . CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS At first sight this is an extension of the definition of Euclidean vector spaces into the complex domain. We will continue to use the alternative notation ⟨u, v⟩ for the scalar product of vectors u and v. As in the real domain, we obtain immediately from the definition the following simple properties of the scalar product for all vectors in V and scalars in C: u · u ∈ R u · u = 0 if and only if u = 0 u · (av) = ¯a(u · v) u · (v + w) = u · v + u · w u · 0 = 0 · u = 0 (∑ i aiui ) · (∑ j bjvj ) = ∑ i,j ai ¯bj(ui · vj), where the last equality holds for all finite linear combinations. It is a simple exercise to prove everything formally. For instance, the first property follows from (1) since the product u · u has to be the complex conjugate to itself. A standard example of the scalar product over the complex vector space Cn is (x1, . . . , xn)T · (y1, . . . , xn)T = x1 ¯y1 + · · · + xn ¯yn. This expression is also called the standard (positive definite) Hermitian form on Cn . By conjugation of the coordinates of the second argument, this mapping satisfies all the required properties. The space Cn with this scalar product is called the standard unitary space of dimension n. We can denote this scalar product of vectors x and y with matrix notation as ¯yT · x (here the complex conjugation indicated by the bar is performed on all components of y). As usual, those mappings which leave the additional structure invariant are of great importance. Unitary mappings A linear mapping φ : V → W between unitary spaces is called a unitary mapping, if for all vectors u, v ∈ V u · v = φ(u) · φ(v). Unitary isomorphism is a bijective unitary mapping. 3.4.2. Real and complex spaces with scalar product. In the previous chapter we have already derived some simple properties of spaces with scalar products. The properties and proofs are very similar to the complex case. In the sequel we shall work with real and complex spaces simultaneously and write K for R or C. In the real case the conjugation is just the identity mapping (it is the restriction of the conjugation in the complex plane to the real line). As in the real case, we define the orthogonal complement for a vector subspace U ⊂ V in the unitary space V as U⊥ = {v ∈ V ; u · v = 0 for all u ∈ U}, which is clearly also a vector subspace in V . 222 As for the transition graph it has the following form: 2 t22  t32  t12 yy 1t11 66 t31 22 t21 99 3 t33 jj t23 YY t13ss For the final task compute the eigenvector of T corresponding to the dominant eigenvalue 1. It is given by (y1 = 2, y2 = 1, y3 = 1)T , and we should rescale it to obtain the desired probability. This equals to y3/(y1 + y2 + y3) = 1/4. □ Let {Xn : n ≥ 0} be a (homogeneous) Markov process with transition matrix T = (tij) and let x(n) be the probability distribution of the states at the nth transition, i.e., the vector whose ith component describes the probability of the system to be in state i after n steps: x (n) i = P(Xn = i). We have ∑ i x (n) i = 1 and it follows that each x(n) is a stochastic vector. Moreover, in terms of the matrix T we know that x(n+1) = Tx(n) , which gives x(n) = Tn x(0) . This makes the Markov process computationally very tractable, since once T is known one can use matrix calculus to compute Tn . Let us describe such applications. 3.C.13. Experiment in a laboratory. In a laboratory an experiment is carried on with the same probability of success and failure. If the experiment succeeds, the probability of success of the second experiment is 0.7. If the first experiment fails, the probability of success of the second experiment is 0.6. This process is continued indefinitely. For any n ∈ N determine the probability that the nth experiment is successful. Solution. This is a two-state Markov process {Xn : n ∈ N} with transition matrix T = ( 7/10 3/5 3/10 2/5 ) . The corresponding state space admits the description {success, failure} which here we denote by {1, 2}. Now, for n ∈ N consider the stochastic vectors x(n) = (x (n) 1 , x (n) 2 )T , where x (n) 1 is the probability of the success of the nth experiment and x (n) 2 is the probability of its failure, that is, x (n) 1 = P(Xn = 1) , x (n) 2 = P(Xn = 2) = 1 − x (n) 1 . By assumption we have x(1) = (1/2, 1/2)T , and by the theoretical result in 3.3.5 we obtain the relation x(2) = Tx(1) = (13/20, 7/20)T . Similarly, for general n ∈ N and the relation x(n+1) = Tx(n) , we obtain x(n+1) = T2 x(n−1) = · · · = Tn x(1) . CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Athough we deal exclusively with finitely-dimensional spaces now, the results in the next two theorems have a natural generalization for Hilbert spaces, which are infinitelydimensional spaces with scalar products. We shall meet them later, in connection with approximation in vector spaces of real or complex valued functions. Theorem. For every finitely-dimensional space V of dimension n with scalar product we have: (1) There exists an orthonormal basis in V . (2) Every system of non-zero orthogonal vectors in V is linearly independent and can be extended to an orthogonal basis. (3) For every system of linearly independent vectors (u1, . . . , uk) there exists an orthonormal basis (v1, . . . , vn) such that ⟨v1, . . . , vi⟩ = ⟨u1 . . . , ui⟩, for all 1 ≤ i ≤ k, i.e. its vectors consecutively generate the same subspaces as the vector uj. (4) If (u1, . . . , un) is an orthonormal basis V , then the coordinates of every vector u ∈ V are expressed via u = (u · u1)u1 + · · · + (u · un)un. (5) In any orthonormal basis, the scalar product has the coordinate form u · v = ¯yT · x = x1 ¯y1 + · · · + xn ¯yn where x and y are columns of coordinates of the vectors u and v in a chosen basis. Notably, every n-dimensional space with scalar product is isomorphic to the standard Euclidean Rn or the unitary Cn . (6) The orthogonal sum of unitary subspaces V1 + · · · + Vk in V is always a direct sum. (7) If A ⊂ V is an arbitrary subset, then A⊥ ⊂ V is a vector subspace (and thus also unitary), and (A⊥ )⊥ ⊂ V is exactly the subspace generated by A. Furthermore, V = span A ⊕ A⊥ . (8) V is an orthogonal sum of n one-dimensional unitary subspaces. Proof. (1), (2), (3): First we extend the given system of vectors into any basis (u1, . . . , un) of the space V and then start the Gramm-Schmidt orthogonalization from 2.3.21. This procedure works in the complex case. It yields an orthogonal basis with properties as required in (3). But from the Gramm-Schmidt orthogonalization algorithm it is clear that if the original k vectors formed an orthogonal system of vectors, then they continue to do so after the othogonalization process is applied. Thus we have also proved (2) and (1). (4): If u = a1u1 + · · · + anun, then u · ui = a1(u1 · ui) + · · · + an(un · ui) = ai∥ui∥2 = ai (5): If u = x1u1 + · · · + xnun, v = y1u1 + · · · + ynun, then u · v = (x1u1 + · · · + xnun) · (y1u1 + · · · + ynun) = x1 ¯y1 + · · · + xn ¯yn. 223 Next we compute Tn . By running the following command in Sage we verify that T is diagonalizible and obtain its eigenvalues: The block A=matrix(QQ, [[7/10,3/5],[3/10,2/5]) A.eigenvalues() Thus, the built-in function A.is_diagonalizable ( ) provides a straightforward method to determine if a matrix A is diagonalizable. We also obtain the eigenvalues λ1 = 1 and λ2 = 0.1. As eigenvectors we choose the vectors e1 = (2, 1) and e2 = (−1, 1). Since T is diagonalizable, we can compute Tn by applying the same method as in 3.B.6. Hence we should compute Tn = PDn P−1 , where P = ( 2 −1 1 1 ) and D = diag(λ1, λ2) = diag(1, 1/10). This gives Tn = 1 3 ( 2 + 10−n 2 − 2 · 10−n 1 − 10−n 1 + 2 · 10−n ) , n ∈ N . Here is a verification in Sage: T=matrix(SR, [[7/10,3/5],[3/10,2/5]]) P=matrix(SR, [[2, -1],[1, 1]]) n=var("n") D=matrix(SR, [[1**n, 0], [0, (1/10)**n]]) Tn=P*D*P.inverse(); Tn Now, matrix multiplication of Tn with the vector x(1) gives x(n+1) = ( 2 3 − 1 6 · 10n , 1 3 + 1 6 · 10n )T , n ∈ N . Thus, for big n the probability of success of the nth experiment is close to 2/3. In other words, for large n we have x (n+1) 1 ≈ 2/3. □ To summarize, another significant advantage of studying Markov chains is their ability to predict the long-term behaviour of a system. Let us illustrate this fact with a straightforward example, related to forecasting weather pattern. 3.C.14. Weather expectations. Suppose that the days are divided into warm, medium, and cold, and that the following hold: 1. After a warm day the next day is warm with possibility 50 %, and is medium with possibility 30 %; 2. After a medium day the next day is medium with possibility 40 %, and cold with possibility 30 %; 3. After a cold day, the next day is cold with possibility 50 % and medium with possibility 30 %. Without any further information, derive how many warm, medium and cold days can be expected in a year. Solution. The problem is clearly a Markov process, since it is assumed that the daily weather depends only on the weather of the previous day. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS (6): We need to show that for any tuple Vi, Vj from the given subspaces their intersection is the zero vector. If u ∈ Vi and u ∈ Vj, then u ⊥ u, that is, u · u = 0. This is possible only for the zero vector u ∈ V . (7): Let u, v ∈ A⊥ . Then (au + bv) · w = 0 for all w ∈ A, a, b ∈ K (from the distributivity of the scalar product). Thus A⊥ is a subspace in V . Let (v1, . . . , vk) be a basis of span A chosen among the elements of A, and let (u1, . . . , uk) be the orthonormal basis resulting from the Gramm-Schmidt orthogonalization of the vectors (v1, . . . , vk). We extend it to an orthonormal basis of the whole V (both exist by the already proven parts of this proposition). Because it is an orthogonal basis, necessarily span{uk+1, . . . , un} = span{u1, . . . , uk}⊥ = A⊥ and A ⊂ span{uk+1, . . . , un}⊥ (this follows from expressing the coordinates under the orthonormal basis). If u ⊥ span{uk+1, . . . , un}, then u is necessarily a linear combination of the vectors u1, . . . , uk, but that happens whenever it is a linear combination of the vectors v1, . . . , vk, which is equivalent to u being in span A. (8): This is equivalent to the formulation of the existence of the orthonormal basis. □ 3.4.3. Important properties of the norm. Now we have everything prepared for basic properties related to our definition of the norm of vectors. We speak also of the length of vectors defined by the scalar product. Note also that all claims always consider finite sets of vectors, Their validity does not depend on the dimension of the space V where it all takes place. Properties of norm Theorem. Let V be a vector space with scalar product, u and v vectors in V . Then (1) ∥u + v∥ ≤ ∥u∥ + ∥v∥. Equality holds if and only if u and v are linearly dependent. This is called the triangle inequality. (2) |u · v| ≤ ∥u∥ ∥v∥. Equality holds if and only if u and v are linearly dependent. This property is called the Cauchy inequality. (3) If (e1, . . . , ek) is a orthonormal system of vectors, then ∥u∥2 ≥ |u · e1|2 + · · · + |u · ek|2 . This property is called the Bessel inequality. (4) If (e1, . . . , ek) is an orthonormal system of vectors, then u ∈ span{e1, . . . , ek} if and only if ∥u∥2 = |u · e1|2 + · · · + |u · ek|2 . This is called the Parseval equality. (5) If (e1, . . . , ek) is an orthonormal system of vectors and u ∈ V , then the vector w = (u · e1)e1 + · · · + (u · ek)ek is the only vector which minimizes the norm ∥u − v∥ among all v ∈ span{e1, . . . , ek}. 224 This has a 3-dimensional state space S = {w, m, c}, where w is for warm, m for medium and c for cold., and state transition matrix given by T = ( 0.5 0.3 0.2 0.3 0.4 0.3 0.2 0.3 0.5 ) . Let us now consider the probabilistic vector x(n) = (x (n) w , x (n) m , x (n) c )T , whose components are the probabilities the nth day to be warm, medium or cold. We indicate them by the numbers x (n) w = P(Xn = w), x (n) m = P(Xn = m), and x (n) c = P(Xn = c), respectively. Since all the elements T are positive, there exists a probabilistic vector x∞ = (xw ∞, xm ∞, xc ∞) T , which the vector x(n) approaches as n grows. By the corollary of the Perron-Frobenius theorem (see 3.3.4), x∞ must be the eigenvector of T corresponding to the eigenvalue 1. This gives rise to the condition Tx∞ = x∞, which together with the condition that the vector x∞ is stochastic, yield the following system of equations: xw ∞ = 0.5 xw ∞ + 0.3 xm ∞ + 0.2 xc ∞ , xm ∞ = 0.3 xw ∞ + 0.4 xm ∞ + 0.3 xc ∞ , xc ∞ = 0.2 xw ∞ + 0.3 xm ∞ + 0.5 xc ∞ , 1 = xw ∞ + xm ∞ + xc ∞ . It is easy to see that xw ∞ = xm ∞ = xc ∞ = 1 3 is the unique solution of this system. Thus, one should expect roughly the same number of warm, medium and cold days. □ The column-stochastic matrix T in Problem 3.C.14 is symmetric, TT = T. Consequently, it is also “rowstochastic”, indicating that the sum of the entries in any row is one. A matrix which is both column- and row-stochastic is called a “doubly stochastic matrix”. A significant property of every doubly stochastic primitive matrix is that the corresponding vector x∞ has all its components identical, as above. This means that, after sufficiently many iterations, all states in the corresponding Markov chain are reached with the same frequency. A series of additional problems related to Markov chains is presented in Section F. In this section we address the development of an algorithm for determining the importance of web pages (see 3.F.20), along with other compelling tasks. D. More matrix calculus In this section we explore fundamental concepts of linear algebra that underpin advanced applications across diverse fields. We begin with the notion of “unitary spaces”, see also 3.4.1. Given the vector space V = Fn , where F is either R or C, consider the standard scalar product ⟨ , ⟩ : V × V → F defined by ⟨x, y⟩ := ∑ j xj ¯yj = xT ¯y, where x = (x1, . . . , xn)T and y = (y1, . . . , yn)T are vectors in Fn , and the last expression is in terms of matrices. Since xT ¯y ∈ F is a scalar, we have xT ¯y = (xT ¯y)T = ¯yT x = y∗ x, CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Proof. The verifications are all based on direct compu- tations: (2): The result is obvious if v = 0. Otherwise, define the vector w = u − u·v v·v v, that is, w ⊥ v and compute ∥w∥2 = ∥u∥2 − (u·v) ∥v∥2 (u · v) − u·v ∥v∥2 (v · u) + (u·v)(u·v) ∥v∥4 ∥v∥2 ∥w∥2 ∥v∥2 = ∥u∥2 ∥v∥2 − 2(u · v)(u · v) + (u · v)(u · v) These are non-negative real values and thus, ∥u∥2 ∥v∥2 ≥ |u · v|2 and the equality holds if and only if w = 0, that is, whenever u and v are linearly dependent. (1): It suffices to compute ∥u + v∥2 = ∥u∥2 + ∥v∥2 + u · v + v · u = ∥u∥2 + ∥v∥2 + 2 Re(u · v) ≤ ∥u∥2 + ∥v∥2 + 2|u · v| ≤ ∥u∥2 + ∥v∥2 + 2∥u∥∥v∥ = (∥u∥ + ∥v∥)2 Since we deal with squares of non-negative real numbers, this means that ∥u + v∥ ≤ ∥u∥ + ∥v∥. Furthermore, equality implies that in all previous inequalities equality also holds. This is equivalent to the condition that u and v are linearly dependent (using the previous part). (3), (4): Let (e1, . . . , ek) be an orthonormal system of vectors. We extend it to an orthonormal basis (e1, . . . , en) (that is always possible by the previous theorem). Then, again using the previous theorem, we have for every vector u ∈ V ∥u∥2 = n∑ i=1 (u · ei)(u · ei) = n∑ i=1 |u · ei|2 ≥ k∑ i=1 |u · ei|2 But that is the Bessel inequality. Furthermore, equality holds if and only if u·ei = 0 for all i > k, which proves the Parseval equality. (5): Choose an arbitrary v ∈ span{e1, . . . , ek} and extend the given orthonormal system to the orthonormal basis (e1, . . . , en). Let (u1, . . . , un) and (x1, . . . , xk, 0, . . . , 0) be coordinates of u and v under this basis. Then ∥u−v∥2 = |u1−x1|2 +· · ·+|uk−xk|2 +|uk+1|2 +· · ·+|un|2 and this expression is clearly minimized when choosing the individual vectors to be x1 = u1, . . . , xk = uk. □ 3.4.4. Unitary and orthogonal mappings. The properties of orthogonal mappings have direct analogues in the complex domain. We can easily formulate them and prove together: Proposition. Consider the linear mapping (endomorphism) φ : V → V on the (real or complex) space with scalar product. Then the following conditions are equivalent. (1) φ is unitary or orthogonal transformation, (2) φ is linear isomorphism and for every u, v ∈ V φ(u) · v = u · φ−1 (v), (3) the matrix A of the mapping φ in any orthonormal basis satisfies A−1 = ¯AT (for Euclidean spaces this means that A−1 = AT ), 225 where y∗ := ¯yT . Hence, we can equivalently express ⟨ , ⟩ as ⟨x, y⟩ = xT ¯y = y∗ x . By convention, ⟨ , ⟩ is linear in the first argument but conjugate linear in the second one, in the sense that ⟨ax, y⟩ = a⟨x, y⟩ but ⟨x, ay⟩ = ¯a⟨x, y⟩, for all x, y ∈ V and a ∈ F. The scalar product ⟨ , ⟩ induces a norm on Fn defined by ∥x∥2 = ⟨x, x⟩ = ∑n j=1 |xj|2 .11 Ensure that ∥ · ∥ satisfies the defining properties of a norm, that is, • ∥x∥ ≥ 0 with ∥x∥ = 0 if and only if x = 0, • ∥ax∥ = |a|∥x∥, for any a ∈ F and x ∈ Fn • ∥x + y∥ ≤ ∥x∥ + ∥y∥, for any x, y ∈ Fn . Next we will adopt the following terminology: ⟨·, ·⟩ = { standard dot product, for V = Rn , standard Hermitian form, for V = Cn , This notation is consistent with what is used in Chapter 2 for the real case. Both spaces Rn , Cn endowed with ⟨ , ⟩ provide examples of unitary spaces (also known as inner product spaces), as defined in 3.4.1. On the other hand, the distance map d : V × V → R, defined by d(x, y) = ∥x − y∥, establishes V = Fn as a “metric space”, a fundamental concept in analysis which we will explore in Chapter 7 Next, we will see that unitary spaces extend beyond Rn and Cn , with many other examples existing. We begin with the following task, which is left as an easy challenge for you. 3.D.1. Consider the vector space C3 , endowed with the standard Hermitian form ⟨x, y⟩ = ∑ k xk ¯yk and the induced norm ∥ · ∥. Given the vectors x = (3 + 2i, 1 − i, −i)T and y = (2 − 2i, 1 − i, 2 + i)T , compute the inner product ⟨x, y⟩, the distance d(x, y) = ∥x − y∥, and the normalized vectors ˆx, ˆy corresponding to x, y, with ∥ˆx∥ = 1 = ∥ˆy∥. ⃝ 3.D.2. When using Sage to compute scalar products of complex vectors, we need to be cautious. Sage permits the use of the standard command x.dot_product(y), as discussed in Chapter 2 (see for example 2.C.45), to complex vectors as well. However, the handling of complex vectors may require special attention to ensure correct application and interpretation of the standard Hermitian scalar product. For example, let us use the vectors x, y given in 3.D.1. Execute the cell x = vector(CDF, [3+2*I, 1-I, -I]) y = vector(CDF, [2-2*I, 1-I, 2+I]) y.dot_product(x) In this case Sage prints out the expression 11.0 − 6.0 ∗ I, which does not match the result obtained using the standard rule ⟨x, y⟩ = ∑ i xiyi. To obtain the correct result we should use the cell 11If zj = xj +iyj ∈ C recall that ¯zj = xj −iyj and |zj|2 = x2 j +y2 j . CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS (4) the matrix A of a mapping φ in some orthonormal basis satisfies A−1 = ¯AT , (5) the rows of the matrix A of the mapping φ in an orthonormal basis form an orthonormal basis of the space Kn with standard scalar product, (6) the columns of the matrix A of the mapping φ in an orthonormal basis form an orthonormal basis of the space Kn with standard scalar product. Proof. (1) ⇒ (2): The mapping φ is injective, therefore it must be onto. Also φ(u) · v = φ(u) · φ(φ−1 (v)) = u · φ−1 (v). (2) ⇒ (3): The standard scalar product is in Kn . It is given for columns x, y of scalars by the expression x · y = ¯yT E x = ¯y x, where E is the unit matrix. Property (2) thus means that the matrix A of the mapping φ is invertible and ¯yT A x = (A−1y)T x. This means that (¯yT A − (A−1y)T )x = 0 for all x ∈ Kn . By substituting the complex conjugate of the expression in the parentheses for x we find that equality is possible only when ¯AT = A−1 . (We may also rewrite the expression as ¯yT (A−( ¯A−1 )T )x and see the conclusion by substituting the basis vectors for x and y.) (3) ⇒ (4): This is an obvious implication. (4) ⇒ (5) In the relevant basis, the claim is expressed via the matrix A of the mapping φ as the equation A ¯AT = E, which is ensured by (4). (5) ⇒ (6): We have | ¯AT A| = |E| = |A ¯AT | = |A||A| = 1, there exists the inverse matrix A−1 . But we also have A ¯AT A = A, therefore also ¯AT A = E which is expressed exactly by (6). (6) ⇒ (1): In the chosen orthonormal basis φ(u) · φ(v) = (Ay) T Ax = ¯yT ¯AT Ax = ¯yT ¯Ex = ¯yT x where x and y are columns of coordinates of the vectors u and v. That ensures that the scalar product is preserved. □ Characterizations from the previous theorem deserve some notes. The matrices A ∈ Matn(K) with the property A−1 = ¯AT are called unitary matrices for complex scalars (in the case R we have already used the name orthogonal matrices for them). The definition itself immediately implies that a product of unitary (orthogonal) matrices is again unitary (orthogonal). The same is true for inverses. Unitary matrices thus form a subgroup U(n) ⊂ Gln(C) in the group of all invertible complex matrices with the product operation. Orthogonal matrices form a subgroup O(n) ⊂ Gln(R) in the group of real invertible matrices. We speak of a unitary group and of an orthogonal group. The simple calculation 1 = det E = det(A ¯AT ) = det A det A = | det A|2 shows that the determinant of a unitary matrix has norm equal to one. For real scalars the determinant is ±1. Furthermore, if Ax = λx for a unitary or orthogonal matrix, then 226 x = vector(CDF, [3+2*I, 1-I, -I]) y = vector(CDF, [2-2*I, 1-I, 2+I]) y.hermitian_inner_product(x) which returns 3.0 + 8.0 ∗ I. Remark. It is important to note that in Sage the command x.hermitian_inner_product(y) prints out the expression 3.0 − 8.0 ∗ I. Therefore, Sage uses the rule ∑ i xiyi, instead of our rule ∑ i xiyi. In summary, when computing the standard Hermitian form of complex vectors x, y ∈ Cn , to stay consistent with our conventions, we should use the rule ⟨x, y⟩ = y∗ x = y.hermitian_inner_product(x) . An alternative is based on the dot.product function and goes as follows: u.dot_product(v.conjugate()) This corresponds to ⟨u, v⟩ for two complex vectors u, v. 3.D.3. Use the standard Hermitian form ⟨v, w⟩ = ∑ i viwi to compute ⟨v, w⟩, ∥v∥2 , ∥w∥2 and the angle θ between v, w, where these vectors are given as follows: (a) v = (1 + i, 2 − i)T , w = (3 − 2i, 1 + i)T in C2 ; (b) v = (3, i)T , w = (2 − i, 1 − i)T in C2 ; (c) v = (−i, 0, 2)T , w = (4, 1 − i, 1)T in C3 ; Next verify your answers via Sage. ⃝ 3.D.4. On V = F2 for F ∈ {R, C} consider the map f ( (x1, x2)T , (y1, y2)T ) = x1 ¯y1 + 4x1 ¯y2 + 4x2 ¯y1 + x2 ¯y2 . Does f define a scalar product on V ? ⃝ 3.D.5. For x = (x1, x2)T and y = (y1, y2)T in R2 set g(x, y) := 2x1y1 − x1y2 − x2y1 + 5x2y2 . Show that g is a scalar product and compute its matrix relative to the standard basis of R2 . Solution. Linearity and symmetry can be easily proved. In addition, we see that g(x, x) = 2x2 1 − 2x1x2 + 5x2 2 = (x1 + x2)2 + (x1 − 2x2)2 , for all x ∈ R. Thus g(x, x) ≥ 0, in particular g(x, x) = 0 if and only if x1 + x2 = 0 and x1 − 2x2 = 0, which gives x = 0. Thus g is a scalar product. It is clearly different from the standard dot product. Its matrix with respect to the standard basis of R2 is given by A = ( g(e1, e1) g(e1, e2) g(e2, e1) g(e2, e2) ) = ( 2 −1 −1 5 ) , such that g(x, y) = yT Ax. Observe, that A is symmetric and positive definite (recall that an element A ∈ Matn(R) is called a “positive definite matrix” when we have uT Au > 0 for any non-zero vector u ̸= 0 on Rn ). □ 3.D.6. Let a ̸= b be some positive real numbers. Show that the rule ρa,b(u, v) := au1v1 +bu2v2 defines a scalar product on R2 , where u = (u1, u2)T and v = (v1, v2)T , respectively. Next, compute the angle between the vectors u = (1, 1)T and CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS (Ax) · (Ax) = x · x = |λ|2 (x · x). Therefore the real eigenvalues of orthogonal matrices in the real domain are ±1. The eigenvalues of unitary matrices are always complex units in the complex plane. The same argument as we have seen with the orthogonal mappings imply that orthogonal complements of invariant subspaces with respect to unitary mappings φ : V → V are also invariant. Indeed, if φ(U) ⊂ U, u ∈ U and v ∈ U⊥ are arbitrary, then φ(v) · φ(φ−1 (u)) = v · φ−1 (u). Because the restriction φ|U is also unitary, it is a bijection. Notably φ−1 (u) ∈ U. But then φ(v)·u = 0, because v ∈ U⊥ . Thus φ(v) ∈ U⊥ . This leads to an immediate useful corollary in the complex domain Corollary. Let φ : V → V be a unitary mapping of complex vector spaces. Then V is an orthogonal sum of onedimensional eigensubspaces. Proof. There exists at least one eigenvector v ∈ V , since complex eigenvalues always exist. Then the restriction of φ to the invariant subspace ⟨v⟩⊥ is again unitary and also has an eigenvector. After n such steps we obtain the desired orthogonal basis of eigenvectors. After normalising the vectors we obtain an orthonormal basis. □ Now it is possible to understand the details of the proof of the spectral decomposition of the orthogonal mapping from 2.4.7 at the end of the second chapter. The real matrix of an orthogonal mapping is interpreted as a matrix of a unitary mapping on a complex extension of Euclidean space. We observe the corollaries of the structure of the roots of the real characteristic polynomial over the complex domain. Automatically we obtain invariant two-dimensional subspaces given by pairs of complex conjugated eigenvalues and hence the corresponding rotation for restricted original real mapping. 3.4.5. Dual and adjoint mappings. When discussing vector spaces and linear mappings in the second chapter, we mentioned briefly the dual vector space V ∗ of all linear forms over the vector space V , see 2.3.17. This duality extends to mappings: Dual mappings For any linear mapping ψ : V → W, the expression (1) ⟨v, ψ∗ (α)⟩ = ⟨ψ(v), α⟩, where ⟨ , ⟩ denotes the evaluation of the linear forms (the second argument) on the vectors (the first argument), while v ∈ V and α ∈ W∗ are arbitrary, defines the mapping ψ∗ : W∗ → V ∗ called the dual mapping to ψ. Choose bases v in V , w in W and write A for the matrix of the mapping ψ in these bases. Then we compute the matrix of the mapping ψ∗ in the corresponding dual bases in the 227 v = (1, −1)T with respect to ρ2,1 and compare the result with the angle that occurs if we use the dot product on R2 . ⃝ 3.D.7. Suppose that A = (aij) is a m × n matrix over C, whose column space has (complex) dimension n. Consider the mapping ρA : Cn × Cn → C with ρA(x, y) = y∗ A∗ Ax, where as usual A∗ = ¯AT denotes the transpose conjugate of a matrix A. Show that the pair (Cn , ρA) is a unitary space. ⃝ 3.D.8. Show that the rule B(A, B) := tr(B∗ A) defines a scalar product on the space Matm,n(F) of m × n matrices with entries over F, for F ∈ {R, C}. This scalar product is known as the Frobenius inner prod- uct. Solution. For any A, B ∈ Matm,n(F), the matrix B∗ A is n×n, and hence B is well-defined. One can proceed by proving the axioms of a unitary space, as they are presented in 3.4.1. However, a direct calculation yields that B(A, B) = m∑ i=1 n∑ j=1 bijaij = m∑ i=1 n∑ j=1 aijbij Hence, if we express A and B in terms of column vectors, say A = ( A1 . . . An ) and B = ( B1 . . . Bn ) , then B(A, B) = ∑n j=1⟨Aj, Bj⟩, which is the sum of the dot products of the vectors formed from the columns of the matrices A, B, respectively. Hence B is a scalar product (as a sum of dot products). Otherwise, a direct proof is mainly based on the properties of the trace, described in 2.C.38 in Chapter 2. For instance, over the complex numbers we have B(aA, B) = tr ( B∗ (aA) ) = a tr(B∗ A) = aB(A, B) for any scalar a ∈ C and any two elements A, B ∈ Matm,n(C), and moreover, if C ∈ Matm,n(C) is another matrix, then B(A + C, B) = tr ( B∗ (A + C) ) = tr(B∗ A) + tr(B∗ C) = B(A, B) + B(C, B) . Positive-definiteness occurs as follows: If A∗ A has entries cij, then cij = ∑n k=1 ¯akiakj and thus ∥A∥2 B = B(A, A) = n∑ i=1 cii = ∑ i,j |aij|2 . Hence, as soon as A ̸= 0, we see that B(A, A) is strictly positive, B(A, A) > 0. We leave as an exercise a direct verification of the property B(A, B) = B(B, A) for A, B ∈ Matm,n(C). □ In Chapter 7 we will encounter additional examples of unitary spaces. There, leveraging the concept of integration, which we will examine in Chapter 6, we will introduce inner products on spaces of polynomials and, more generally, on infinite-dimensional function spaces (see for example 7.1.1, 7.1.2 and see also the tasks 7.D.3, 7.D.4, 7.D.5, and 7.D.6). In this context, orthogonality becomes crucial, especially in Fourier analysis. For now, let us explore a few more elementary tasks related to orthogonality. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS dual spaces. Indeed, the definition says that if we represent the vectors from W∗ in the coordinates as rows of scalars, then the mapping ψ∗ is given by the same matrix as ψ, if we multiply by it the row vectors from the right: ⟨ψ(v), α⟩ = (α1, . . . , αn) · A ·    v1 ... vn    = ⟨v, ψ∗ (α)⟩. This means that the matrix of the dual mapping ψ∗ is the transpose AT , because α · A = (AT · αT )T . Assume further that we have a vector space with scalar product. Then we can naturally identify V and V ∗ using the scalar product. Indeed, choosing one fixed vector w ∈ V , we substitute this vector into the second argument in the scalar product in order to obtain the identification V ≃ V ∗ = Hom(V, K) V ∋ w → (v → ⟨v, w⟩) ∈ V ∗ . The non-degeneracy condition on the scalar product ensures that this mapping is a bijection. Notice it is important to use w as the fixed second argument in the case K = C in order to obtain linear forms. Since factorizing complex multiples in the second argument yields complex conjugated scalars, the identification V ≃ V ∗ is linear over real scalars only. It is clear that the vectors of an orthonormal basis are mapped to forms that constitute the dual basis, i.e. the orthonormal basis are selfdual under our identification. Moreover, every vector is automatically understood as a linear form, by means of the scalar product. How does the above dual mapping W∗ → V ∗ look in terms of our identification? We use the same notation ψ∗ : W → V for the resulting mapping, which is uniquely given as follows: Adjoint mapping For every linear mapping ψ : V → W between spaces with scalar products, there is the adjoint mapping ψ∗ uniquely determined by the formula (2) ⟨ψ(u), v⟩ = ⟨u, ψ∗ (v)⟩. The parentheses means the scalar products on W or V , re- spectively. Notice that the use of the same parenthesis for evaluation of one-forms and scalar products (which reflects the identification above) makes the defining formulae of dual and adjoint mappings look the same. Equivalently we can understand the relation (2) to be the definition of the adjoint mapping ψ∗ . By substituting all pairs of vectors from an orthonormal basis for the vectors u and v we obtain directly all the values of the matrix of the mapping ψ∗ . 228 3.D.9. (The Pythagorean theorem) (a) In a real vector space V with a scalar product ⟨ , ⟩ prove that two vectors u, w are orthogonal, i.e., ⟨u, w⟩ = 0, if and only if ∥u+w∥2 = ∥u∥2 + ∥w∥2 .12 v u u + v Next demonstrate with a counterexample that this property does not hold for a complex unitary vector space. (b) If (V, ⟨ , ⟩) is a real scalar vector space, prove that the vectors u − w and u + w are orthogonal, if and only if ∥u∥ = ∥w∥, where u, w ∈ V are two arbitrary vectors. ⃝ 3.D.10. Let (V, ⟨ , ⟩) be a unitary space over F, and u, v ∈ V two arbitrary vectors. Show that (a) u ⊥ v if and only if ∥u + av∥ = ∥u − av∥ for all a ∈ F. (b) u ⊥ v if and only if ∥u + av∥ ≥ ∥u∥ for all a ∈ F. ⃝ 3.D.11. Consider the space Matm,n(R) with the scalar product B(A, B) = tr(BT A), introduced in 3.D.8. For the matrices A = ( 1 3 5 0 2 2 ) and B = ( 2 4 0 7 9 1 ) compute the following: (a) the angle θ between A, B; (b) the distance ∥A − B∥B between A, B; (c) Verify the Cauchy-Schwarz inequality. Solution. (a) By 3.D.8, for two real matrices A = (aij) and B = (Bij), both of size m × n, we get B(A, B) = tr(BT A) = ∑m i=1 ∑n j=1 aijbij and ∥A∥2 B = B(A, A) = ∑m i=1 ∑n j=1 a2 ij. In particular, if A = ( a11 a12 a13 a21 a22 a23 ) and B = ( b11 b12 b13 b21 b22 b23 ) are two elements of Mat2,3(R), then we see that the product BT A is given by   b11a11 + b21a21 b11a12 + b21a22 b11a13 + b21a23 b12a11 + b22a21 b12a12 + b22a22 b12a13 + b22a23 b13a11 + b23a21 b13a12 + b23a22 b13a13 + b23a23   such that B(A, B) = tr(BT A) = a11b11 + a12b12 + a13b13 +a21b21 + a22b22 + a23b23 . Thus for the given A and B we compute B(A, B) = 2 + 12 + 0 + 0 + 18 + 2 = 34 , ∥A∥2 B = a2 11 + a2 12 + a2 13 + a2 21 + a2 22 + a2 23 = 43 , ∥B∥2 B = b2 11 + b2 12 + b2 13 + b2 21 + b2 22 + b2 23 = 151 . 12Observe that this condition, known as the Pythagorean theorem, is not the same as the equality case in the triangle inequality ∥u+w∥ ≤ ∥u∥+ ∥w∥. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Using the coordinate expression for the scalar product, the formula (2) reveals the coordinate expression of the adjoint mapping: ⟨ψ(v), w⟩ = (w1, . . . , wn) · A ·    v1 ... vn    = ( ¯AT ·    w1 ... wn    )T ·    v1 ... vn    = ⟨v, ψ∗ (w)⟩. It follows that if A is the matrix of the mapping ψ in an orthonormal basis, then the matrix of the adjoint mapping ψ∗ is the transposed and conjugated matrix A – we denote this by A∗ = ¯AT . The matrix A∗ is called the adjoint matrix of the matrix A. Note that the adjoint matrix is well defined for any rectangular matrix. We should not confuse them with algebraic adjoints, which we used for square matrices when working with determinants. We can summarise. For any linear mapping ψ : V → W between unitary spaces, with matrix A in some bases on V and W, its dual mapping has the matrix AT in the dual basis. If there are scalar products on V and W, we identify them (via the scalar products) with their duals. Then the dual mapping coincides with the adjoint mapping ψ∗ : W → V , which has the matrix A∗ . The distinction between the matrix of the dual mapping and the matrix of the adjoint mapping is thus in the additional conjugation. This is of course a consequence of the fact that our identification of the unitary space with its dual is not a linear mapping over complex scalars. 3.4.6. Self-adjoint mappings. Those linear mappings which coincide with their adjoints: ψ∗ = ψ, are of particular interest. They are called self-adjoint mappings. Equivalently we can say that they are the mappings whose matrix A satisfies A = A∗ in some (and thus in all) orthonormal basis. In the case of Euclidean spaces the self-adjoint mappings are those with symmetric matrices (in orthonormal basis). They are often called symmetric mappings. In the complex domain the matrices that satisfy A = A∗ are called Hermitian matrices or also Hermitian symmetric matrices. Sometimes they are also called self-adjoint matrices. Note that Hermitian matrices form a real vector subspace in the space of all complex matrices, but it is not a vector subspace in the complex domain. Remark. The next observation is of special interest. If we multiply a Hermitian matrix A by the imaginary unit, we obtain the matrix B = i A, which has the property B∗ = ¯i ¯AT = −B. Such matrices are called anti-Hermitian or Hermitian skewsymmetric. Every real matrix can be written as a sum of its 229 Hence ∥A∥B = √ 43, ∥B∥B = √ 151, and cos θ = B(A, B) ∥A∥B∥B∥B = 34 √ 43 √ 151 . From this one can explicitly compute θ, as before. (b) We see that A − B = ( −1 −1 5 −7 −7 1 ) . Thus ∥A−B∥2 B = B(A−B, A−B) = 1+1+25+49+49+1 = 126 such that ∥A − B∥B = √ 126. (c) For the Cauchy inequality we refer to 3.4.3. This important inequality for the scalar product B takes the form |B(A, B)| ≤ ∥A∥B∥B∥B = √ B(A, A) √ B(B, B) ⇐⇒ |tr(B∗ A)| ≤ √ tr(A∗A) √ tr(B∗ B) . This gives us 34 < √ 43 √ 151 ≈ 80. □ We proceed with exercises related to orthonormal bases and orthogonal complements. Recall that we have already discussed these concepts in Chapter 2 (see for example 2.C.48, 2.C.50). There we introduced the Gram-Schmidt orthogonalization process, which transforms any basis {E1, . . . , En} of a scalar product space (V, ⟨ , ⟩), to an orthogonal basis {w1, . . . , wn} of V . The process begins with w1 = E1, and constructs the jth member wj as follows: wj = Ej − ⟨Ej, w1⟩ ∥w1∥2 w1 − · · · − ⟨Ej, wj−1⟩ ∥wj−1∥2 wj−1 . This results to an orthonormal basis { w1 ∥w1∥ , . . . , wn ∥wn∥ } of V . This straightforward yet essential construction has a wide range of applications. We begin with an example that requires extending the method presented in Chapter 2 (cf. 2.C.50), to fit Sage’s capabilities. 3.D.12. Consider the Euclidean space R3 endowed with the scalar product ⟨⟨u, v⟩⟩ = vT Au, where A is given by13 A =   2 −1 0 −1 2 1 0 1 2   . Applying the Gram-Schmidt procedure to obtain an ⟨⟨ , ⟩⟩-orthogonal basis of R3 , starting with the standard basis e = {e1, e2, e3} of R3 . Solution. Let e1 = (1, 0, 0)T , e2 = (0, 1, 0)T and e3 = (0, 0, 1)T be the vectors of the standard basis e of R3 . According to the Gram-Schmidt method, an ⟨⟨ , ⟩⟩-orthogonal basis of R3 is given by {w1, w2, w3}, with w1 = e1 and w2 = e2 − ⟨⟨e2, w1⟩⟩ ⟨⟨w1, w1⟩⟩ w1 , w3 = e3 − ⟨⟨e3, w1⟩⟩ ⟨⟨w1, w1⟩⟩ w1 − ⟨⟨e3, w2⟩⟩ ⟨⟨w2, w2⟩⟩ w2 . 13Verify yourselves that A is symmetric and positive definite (for the second claim show for example that all the eigenvalues of A are positive). CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS symmetric part and its anti-symmetric part, A = 1 2 (A + AT ) + 1 2 (A − AT ). In the complex domain we have analogously A = 1 2 (A + A∗ ) + i 1 2i (A − A∗ ). In particular, we may express every complex matrix in a unique way as a sum A = B + i C with Hermitian symmetric matrices B = 1 2 (A + A∗ ) and C = 1 2i (A − A∗ ). This is an analogy of the decomposition of a complex number into its real and purely imaginary component and in the literature we often encounter the notation B = re A = 1 2 (A + A∗ ), C = im A = 1 2i (A − A∗ ). In the language of linear mappings this means that every complex linear automorphism can be uniquely expressed by means of two self-adjoint mappings playing the role of the real and imaginary parts of the original mapping. 3.4.7. Spectral decomposition. Consider a self-adjoint mapping ψ : V → V with the matrix A in some orthonormal basis. Proceed similarly as in 2.4.7 when we diagonalized the matrix of orthogonal mappings. Again, consider arbitrary invariant subspaces of selfadjoint mappings and their orthogonal complements. If a selfadjoint mapping ψ : V → V leaves a subspace W ⊂ V invariant, i.e. ψ(W) ⊂ W, then for every v ∈ W⊥ , w ∈ W ⟨ψ(v), w⟩ = ⟨v, ψ(w)⟩ = 0. Thus also, ψ(W⊥ ) ⊂ W⊥ . Next, consider the matrix A of a self-adjoint mapping in an orthonormal basis and an eigenvector x ∈ Cn , i.e. A·x = λx. We obtain λ⟨x, x⟩ = ⟨Ax, x⟩ = ⟨x, Ax⟩ = ⟨x, λx⟩ = ¯λ⟨x, x⟩. The positive real number ⟨x, x⟩ can be cancelled on both sides and thus ¯λ = λ, and we see that eigenvalues of Hermitian matrices are always real. The characteristic polynomial det(A−λE) has as many complex roots as is the dimension of the square matrix A (including multiplicities), and all of them are actually real. Thus we have proved the important general result: Proposition. The orthogonal complements of invariant subspaces of self-adjoint mappings are also invariant. Furthermore, the eigenvalues of a Hermitian matrix A are always real. The very definition ensures that restriction of a selfadjoint mapping to an invariant subspace is again self-adjoint. Thus the latter proposition implies that there always exists an orthonormal basis of V composed of eigenvectors. Indeed, start with any eigenvector v1, normalize it, consider its linear hull V1 and restrict the mapping to V ⊥ 1 . Consider next 230 We compute ⟨⟨e2, w1⟩⟩ = ( 1 0 0 )   2 −1 0 −1 2 1 0 1 2     0 1 0   = −1, ⟨⟨w1, w1⟩⟩ = ( 1 0 0 )   2 −1 0 −1 2 1 0 1 2     1 0 0   = 2 . Thus w2 = e2 + 1 2 e1 = (1 2 , 1, 0)T . Next we compute ⟨⟨e3, w1⟩⟩ = ⟨⟨e3, e1⟩⟩ = eT 1 Ae3 = ( 1 0 0 )   2 −1 0 −1 2 1 0 1 2     0 0 1   = 0 , that is, e1 and e3 are ⟨⟨ , ⟩⟩-orthogonal, and hence ⟨⟨e3, w2⟩⟩ = ⟨⟨e3, e2 + 1 2 e1⟩⟩ = ⟨⟨e3, e2⟩⟩ + 1 2 ⟨⟨e3, e1⟩⟩ = ⟨⟨e3, e2⟩⟩ = eT 2 Ae3 = ( 0 1 0 )   2 −1 0 −1 2 1 0 1 2     0 0 1   = 1 . Moreover, ⟨⟨w2, w2⟩⟩ = (1 2 1 0 )   2 −1 0 −1 2 1 0 1 2     1 2 1 0   = 3 2 . As an alternative, we know that ⟨⟨e1, e1⟩⟩ = 2, ⟨⟨e1, e2⟩⟩ = −1, and we compute ⟨⟨e2, e2⟩⟩ = 2. By linearity, we then have ⟨⟨w2, w2⟩⟩ = 1 4 ⟨⟨e1, e1⟩⟩ + ⟨⟨e1, e2⟩⟩ + ⟨⟨e2, e2⟩⟩ = 1 4 · 2 + (−1) + 2 = 3 2 . Hence, w3 = e3 − 2 3 w2 = (−1 3 , −2 3 , 1) = −1 3 e1 − 2 3 e2 + e3. To summarize, the desired basis has the form { e1, 1 2 e1 + e2, − 1 3 e1 − 2 3 e2 + e3 } . To confirm this basis in Sage we may revise the method presented in Chapter 2 (see 2.C.50). This adjustment is necessary because we are now using a different scalar product than the standard one. The following cell includes the necessary explanations and performs the Gram-Schmidt orthogonalization with respect to the given inner product defined by the matrix A. Execute this block yourselves to read Sage’s result. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS another eigenvector v1 ∈ V ⊥ 2 , take V2 = span(V1 ∪ {v2}), which is again invariant. Continue and construct the sequence of invariant subspaces V1 ⊂ V2 ⊂ . . . Vn = V , building the orthonormal basis of eigenvectors, as expected. Actually, it is easy to see directly that eigenvectors associated with different eigenvalues are perpendicular to each other. Indeed, if ψ(u) = λu, ψ(v) = µv then we obtain λ⟨u, v⟩ = ⟨ψ(u), v⟩ = ⟨u, ψ(v)⟩ = ¯µ⟨u, v⟩ = µ⟨u, v⟩. Usually this result is formulated using projections onto eigensubspaces. Recall the properties of projections along subspaces, as discussed in 2.3.19. A projection P : V → V is a linear mapping satisfying P2 = P. This means that the restriction of P to its image is the identity and the projector is completely determined by choosing the subspaces Im P and Ker P. A projection P : V → V is called orthogonal if Im P ⊥ Ker P. Two orthogonal projections P, Q are called mutually perpendicular if Im P ⊥ Im Q. Spectral decomposition of self-adjoint mappings Theorem (Spectral decomposition). For every self-adjoint mapping ψ : V → V on a vector space with scalar product there exists an orthonormal basis composed of eigenvectors. If λ1, . . . , λk are all distinct eigenvalues of ψ and if P1, . . . , Pk are the corresponding orthogonal and mutually perpendicular projectors onto the eigenspaces corresponding to the eigenvalues, then ψ = λ1P1 + · · · + λkPk. The dimensions of the images of these projections Pi equal the algebraic multiplicities of the eigenvalues λi. 3.4.8. Orthogonal diagonalization. Linear mappings which allow for orthonormal bases as in the latter theorem on spectral decomposition are called orthogonally diagonalizable. Of course, they are exactly the mappings for which we can find an orthonormal basis in which the matrix of the mapping is diagonal. We ask what they look like. In the Euclidean case, this is simple: diagonal matrices are first of all symmetric, thus they are the self-adjoint mappings. As a corollary we note that an orthogonal mapping of an Euclidean space into itself is orthogonally diagonalizable if and only if it is self-adjoint.They are exactly the self-adjoint mappings with eigenvalues ±1. The situation is much more interesting on unitary spaces. Consider any linear mapping φ : V → V on a unitary space. Let φ = ψ + iη be the (unique) decomposition of φ into its Hermitian and anti-Hermitian part. If φ has diagonal matrix D in a suitable orthonormal basis, then D = Re D + i Im D, where the real and the imaginary parts are exactly the matrices of ψ and η. This follows from the uniqueness of the decomposition. Knowing this in the particular coordinates, we conclude the following computation relations at the level of 231 A = Matrix([[2,-1,0],[-1,2,1],[0,1,2]]) E1 = vector([1, 0, 0]) E2 = vector([0, 1, 0]) E3 = vector([0, 0, 1]) def proj(v, u): p =((v*A*u)/(u*A*u))*u return p w1 = E1; w2 = E2-proj(E2,w1) w3 = E3-proj(E3,w2)-proj(E3,w1) CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS mappings ψ ◦ η = η ◦ ψ (i.e. the real and imaginary parts of φ commute), and φ ◦ φ∗ = φ∗ ◦ φ (since this clearly holds for all diagonal metrices). The mappings φ : V → V with the latter property are called the normal mappings. A detailed characterization is given by the following theorem (stated in the notation of this paragraph): Theorem. The following conditions on a mapping φ : V → V on a unitary space V are equivalent: (1) φ is orthogonally diagonalizable, (2) φ∗ ◦ φ = φ ◦ φ∗ (φ is a normal mapping), (3) ψ ◦ η = η ◦ ψ (the Hermitian and anti-Hermitian parts commute), (4) if A = (aij) is the matrix of φ in some orthonormal basis, and λi are the m = dim V eigenvalues of A, then m∑ i,j=1 |aij|2 = m∑ i=1 |λi|2 . Proof. The implication (1) ⇒ (2) was discussed above. (2) ⇔ (3): it suffices to calculate φ ◦ φ∗ = (ψ + iη)(ψ − iη) = ψ2 + η2 + i(ηψ − ψη) φ∗ ◦ φ = (ψ − iη)(ψ + iη) = ψ2 + η2 + i(ψη − ηψ) Subtraction of the two lines yields φφ∗ − φ∗ φ = 2i(ηψ − ψη). (2) ⇒ (1): If φ is normal, then ⟨φ(u), φ(u)⟩ = ⟨φ∗ φ(u), u⟩ = ⟨φφ∗ (u), u⟩ = ⟨φ∗ (u), φ∗ (u)⟩ thus |φ(u)| = |φ∗ (u)|. Next, notice (φ − λ id V )∗ = (φ∗ − ¯λ id V ). Thus, if φ is normal, then (φ − λ id V ) is normal too. If φ(u) = λu, then u is in the kernel of φ − λ idV . Thus the latter equality of norms of values for normal mappings and their adjoints ensures that u is also in the kernel of φ∗ − ¯λ idV . It follows that φ∗ (u) = ¯λu. We have proved, under the assumption (2), that φ and φ∗ have the same eigenvectors and that they are associated to conjugated eigenvalues. Similarly to our procedure with self-adjoint mappings, we now prove orthogonal diagonalizability. The latter procedure is based on the fact that the orthogonal complements to sums of eigenspaces are invariant subspaces. Consider an eigenvector u ∈ V with eigenvalue λ, and any v ∈ ⟨u⟩⊥ . We have ⟨φ(v), u⟩ = ⟨v, φ∗ (u)⟩ = ⟨v, ¯λu⟩ = λ⟨u, v⟩ = 0. Thus φ(v) ∈ ⟨u⟩⊥ . The same occurs if u is replaced by a sum of eigenvectors instead. (1) ⇒ (4): the expression ∑ i,j |aij|2 is the trace of the matrix AA∗ , which is the matrix of the mapping φ ◦ φ∗ . Therefore its value does not depend on the choice of the orthonormal basis. Thus if φ is diagonalizable, this expression equals exactly ∑ i |λi|2 . (4) ⇒ (1): This part of the proof is a direct corollary of the Schur theorem on unitary triangulation of an arbitrary 232 g = [w1, w2, w3] show(g) □ 3.D.13. Let (V, ⟨ , ⟩) be a (finite-dimensional) unitary vector space and let {Ej}n j=1 be an ⟨ , ⟩-orthornomal basis of V , i.e., ⟨Ei, Ej⟩ = δij. Show that any two vectors x, y ∈ V satisfy ⟨x, y⟩ = ∑n j=1⟨x, Ej⟩⟨y, Ej⟩. ⃝ 3.D.14. Show that the vectors E1 = (−1, 1, 2), E2 = (2, 0, 1) and E3 = (1, 5, −2) form an orthogonal basis of R3 (with respect the usual dot product). Next express the vector u = (6, 2, −4) as a liner combination of this basis. ⃝ 3.D.15. Based on the Cauchy-Schwarz inequality, show that any triple (a, b, c) of positive real numbers satisfies the following inequality: √ a + 2b a + b + c + √ b + 2c a + b + c + √ c + 2a a + b + c ≤ 3 . ⃝ 3.D.16. Consider C3 endowed with the standard Hermitian form ⟨u, v⟩ = ∑3 i=1 ui¯vi and the standard basis e = {e1, e2, e3}. Verify Parseval’s equality for the vector u = (2 + i, −1 + 2i, 3 − i) ∈ C3 . ⃝ 3.D.17. In Problem 3.D.11 use the isomorphism Mat2,3(R) ∼= R6 established in Problem 2.C.31 to show that the matrices A, B are linearly independent. Then consider the subspace W of Mat2,3(R) spanned by the matrices A, B. Find a basis of the orthogonal complement W⊥ of W with respect to the Frobenius scalar product B (introduced in 3.D.8). Solution. Under the isomorphism φ : Mat2,3(R) ∼= R6 discussed in Problem 2.C.31, we may view the matrices A and B as the following vectors on R6 : φ(A) = v1 = (1, 3, 5, 0, 2, 2)T and φ(B) = v2 = (2, 4, 0, 7, 9, 1)T , respectively. We now treat their linear independence via Sage by the cell V = RR^6 v1 = vector(RR, [1, 3, 5, 0, 2, 2]) v2 = vector(RR, [2, 4, 0, 7, 9, 1]) V.linear_dependence([v1, v2]) == [] Sage prints out True, so v1, v2 are linearly independent. Consider the subspace W = spanR{A, B} ∼= spanR{u1, u2} of Mat2,3(R) ∼= R6 spanned by A, B. We need to determine W⊥ with respect to B, that is, W⊥ = {C ∈ Mat2,3(R) : B(C, A) = 0 = B(C, B)} . Let us express C ∈ Mat2,3(R) as C = ( a b c d e f ) , for some reals a, . . . , f. Then C ∈ W⊥ if and only if tr(AT C) = 0 = tr(BT C), that is {a + 3b + 5c + 2e + 2f = 0 , 2a + 4b + 7d + 9e + f = 0} . CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS linear mapping V → V , which we prove later in 3.4.15. This theorem says that for every linear mapping φ : V → V there exists an orthonormal basis under which φ has an upper triangular matrix. Then all the eigenvalues of φ appear on its diagonal. Since we have already shown that the expression∑ i,j |aij|2 does not depend on the choice of the orthonormal bases, all elements in the upper triangular matrix, which are not on the diagonal must be zero. □ Remark. We can rephrase the main statement of the latter theorem in terms of matrices. A mapping is normal if and only if its matrix A satisfies AA∗ = A∗ A in some orthonormal basis (and equivalently in any orthonormal basis). Such matrices are called normal. Moreover, we can consider the last theorem as a generalization of standard calculations with complex numbers. The linear mappings appear similar to complex numbers in their algebraic form. The role of real numbers is played by self-adjoint mappings, and the unitary mappings play the role of the complex units cos t+i sin t ∈ C. The following consequence of the theorem shows the link to the property cos2 t + sin2 t = 1. Corollary. The unitary mappings on a unitary space V are exactly those normal mappings φ on V for which the unique decomposition φ = ψ + iη into Hermitian and antiHermitian parts satisfies ψ2 + η2 = idV . Proof. If φ is unitary, then φφ∗ = idV = φ∗ φ and thus φφ∗ = (ψ + iη)(ψ − iη) = ψ2 + 0 + η2 = idV . On the other hand, if φ is normal, we can read the latter computation backwards which proves the other implication. □ 3.4.9. Roots of matrices. Non-negative real numbers are exactly those which are squares of real numbers (and thus we may find their square roots). At the same time, their positive square roots are uniquely defined. Now we observe a similar behaviour of matrices of the form B = A∗ A. Of course, these are the matrices of the compositions of mappings φ with their adjoints. By definition, (1) ⟨B x, x⟩ = ⟨A∗ A x, x⟩ = ⟨A x, A x⟩ ≥ 0 for all vectors x. Furthermore, we clearly have B∗ = (A∗ A)∗ = A∗ A = B. Hermitian matrices B with the property ⟨Bx, x⟩ ≥ 0 for all x are called positive semidefinite matrices. If the zero value is attained only for x = 0, they are called positive definite. Analogously, we speak of positive definite and positive semidefinite (self-adjoint) mappings φ : V → V . For every mapping φ : V → V we can define its square root as a mapping ψ such that ψ ◦ ψ = φ. The next theorem completely describes the situation when restricting to positive semidefinite mappings. 233 This is a system of two equations with six unknowns. Let c, d, e, f ∈ R be the free variables. Then we get the solution a = 10c − 21 2 d − 19 2 e + 5 2 f , b = −5c + 7 2 d + 5 2 e − 3 2 f . Hence W⊥ consists of matrices of the form C = ( 10c − 21 2 d − 19 2 e + 5 2 f −5c + 7 2 d + 5 2 e − 3 2 f c d e f ) with c, d, e, f ∈ R. We see that C = c ( 10 −5 1 0 0 0 ) + d ( −21/2 7/2 0 1 0 0 ) +e ( −19/2 5/2 0 0 1 0 ) + f ( 5/2 −3/2 0 0 0 1 ) = cW1 + dW2 + eW3 + fW4 , where we denote the matrices appearing above by W1, W2, W3, W4, respectively. This shows that W1, . . . , W4 generate W⊥ . Actually, they provide a basis of W⊥ . For a quick verification of their linear independence, we use again the isomorphism Mat2,3(R) ∼= R6 and proceed with Sage, as before: V = RR^6 w1 = vector(RR, [10, -5, 1, 0, 0, 0]) w2 = vector(RR, [-21/2, 7/2, 0, 1, 0, 0]) w3 = vector(RR, [-19/2, 5/2, 0, 0, 1, 0]) w4 = vector(RR, [5/2, -3/2, 0, 0, 0, 1]) V.linear_dependence([w1, w2, w3, w4]) == [] □ Invertible linear transformations naturally intersect with group theory, a topic that we briefly introduced in Chapter 1 and will explore in greater detail in Chapter 12. This intersection forms the basis of “matrix groups”, which are of particular interest in this context. Next we will focus on the group of all invertible linear mappings from Rn to Rn , known as the “real general linear group”, denoted by Gln(R). The operation that defines this group is the composition of linear mappings. Equivalently, Gln(R) can be described as the group of all invertible n×n matrices with real entries, where the group operation is the matrix multiplication, i.e., Gln(R) = {A ∈ Matn(R) : det(A) ̸= 0} . On the other hand, a “matrix group” is defined as a closed subgroup of Gln(R).14 An example of a matrix group 14Matrix groups are special cases of the well-known “Lie groups”. In simple terms, a Lie group is a group equipped with a compatible differentiable structure, also known as a smooth manifold. Lie groups were introduced by the Norwegian mathematician Sophus Lie (1842-1899) during the late 19th century, shortly after the discovery of non-Euclidean geometries. Lie referred to them as “continuous symmetry groups”. During this period, Lie collaborated with F. Klein, and together they significantly altered perspectives in geometry and the theory of differential equations. Today, the theory of Lie groups and Lie algebras has become a fundamental area in differential geometry with a wide range of applications. It’s worth noting that not all Lie groups are matrix groups, indicating the potential complexity of these structures. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Positive semidefinite square roots Theorem. For each positive semidefinite square matrix B, there is the uniquely defined positive semidefinite square root √ B. If P is any matrix such that P−1 BP = D is diagonal, then √ B = P √ DP−1 , where D has got the (non-negative) eigenvalues of B on its diagonal and √ D is the matrix with the positive square roots of these values on its diagonal. Proof. Since B is a matrix of a self-adjoint mapping φ, there is even an orthonormal P as in the theorem (cf. Theorem 3.4.7) with all eigenvalues in the diagonal of D non-negative. Consider C = √ B as defined in the second claim and notice that in- deed C2 = P √ DP−1 P √ DP−1 = PDP−1 = B. Thus the mapping ψ given by C must have the same eigenvectors as φ and thus these two mappings share the decompositions of Kn into mutually orthogonal eigenspaces. In particular, both of them will share the bases in which they have diagonal matrices and thus the definition of √ D must be unique in each such basis. This proves that the definition of √ B does not depend on our particular choice of the diagonalization of φ. □ Notice there could be a lot of different roots, if we relax the positivity condition on √ B (e.g., we may choose the signs in the diagonal matrix D). 3.4.10. Spectra and nilpotent mappings. We return to the behavior of linear mappings in full generality. We continue to work with real or complex vector spaces, but without necessarily fixing a scalar product there. Recall that the spectrum of a linear mapping f : V → V is a sequence of roots of the characteristic polynomial of the mapping f, counting multiplicities. The algebraic multiplicity of an eigenvalue is its multiplicity as a root of the characteristic polynomial. The geometric multiplicity of an eigenvalue is the dimension of the corresponding subspace of eigenvec- tors. A linear mapping f : V → V is called nilpotent, if there exists an integer k ≥ 1 such that the iterated mapping fk is identically zero. The smallest k with such a property is called the degree of nilpotency of the mapping f. The mapping f : V → V is called cyclic, if there exists the basis (u1, . . . , un) of the space V such that f(u1) = 0 and f(ui) = ui−1 for all i = 2, . . . , n. In other words, the matrix of f in this basis is of the form A =    0 1 0 . . . 0 0 1 . . . ... ... ...    . 234 is the “orthogonal group” O(n), which consists of all linear transformations φ : Rn → Rn of Rn preserving the standard Euclidean product, i.e., ⟨φ(u), φ(v)⟩ = ⟨u, v⟩, for all u, v ∈ Rn . We studied such endomorphisms in the end of Chapter 2 and we learned that they correspond to orthogonal matrices (see 2.4.6 and 2.D.11). Hence, in terms of matrices, O(n) consists of all n × n orthogonal matrices, i.e., O(n) = {A ∈ Gln(R) : A−1 = AT }. 3.D.18. Prove that O(n) is a group and a subgroup of Gln(R). Additionally, demonstrate that the determinant of any matrix A ∈ O(n) is either 1 or −1. Solution. Obviously, the n × n identity matrix E belongs to O(n) and this is the corresponding identity element of the group. Thus, to verify that O(n) is a group it remains to prove that the composition of two orthogonal transformations is orthogonal, and that the inverse of an orthogonal transformation is again orthogonal. By the conclusion in Problem 2.D.11, one can equivalently work with orthogonal matrices. Let A, B ∈ O(n). Then we see that (AB)T AB = BT AT AB = BT B = E , AB(AB)T = ABBT AT = AAT = E , (A−1 )T A−1 = (AT )−1 A−1 = (AAT )−1 = E−1 = E , A−1 (A−1 )T = A−1 (AT )−1 = (AT A)−1 = E−1 = E , and these relations certify our claim. Finally recall that the matrix multiplication is associative. Now, given a group (G, ◦), a non-empty subset K of G which is closed under composition and taking inverses (with respect to the restriction of the group operation ◦ to K), is called a subgroup of G, see also 12.4.1 for more details. To demonstrate that a (non-empty) subset K ⊂ G of G is a subgroup of G, we need to show that a ◦ b−1 ∈ K, for any two elements a, b ∈ K. Let A, B ∈ O(n) be two orthogonal matrices. By the previous assertion we know that B−1 ∈ O(n) as well, and hence AB−1 ∈ O(n). Since O(n) is also a subset of Gln(R), our claim follows. For the determinant, let A ∈ Matn(R) be an orthogonal matrix, i.e., AAT = E = AT A. Hence det(AAT ) = det(E) = 1. However, det(AAT ) = det(A) det(AT ) = det(A)2 , thus det(A)2 = 1, i.e., det(A) = ±1. □ The special orthogonal group SO(n) is the subgroup of O(n) consisting of orthogonal matrices with determinant one, SO(n) = {A ∈ O(n) : det(A) = 1}. Obviously, for n = 1 we have O(1) ∼= Z2 and the group SO(1) is trivial. For n = 2 it is not hard to prove that the group SO(2) is isomorphic (as a group) to the unit circle S1 = {z ∈ C : |z| = 1}. For n = 3 we have already described examples of special orthogonal matrices in Chapter 2, e.g., rotations on R3 about an axis through the origin, which indeed have matrices lying in the special orthogonal group SO(3) (see 2.D.13). Let us further emphasize on rotations on the 3-dimensional Euclidean CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS If f(v) = a v, then fk (v) = ak · v for every natural k. Note that, the spectrum of nilpotent mapping can contain only the zero scalar (and this is always present). By the definition, every cyclic mapping is nilpotent. Moreover, its degree of nilpotency equals the dimension of the space V . The derivative operator on polynomials, D(xk ) = kxk−1 , is an example of a cyclic mapping on the spaces Kn[x] of all polynomials of degree at most n over the scalars K. Perhaps surprisingly, this is also true the other way round – every nilpotent mapping is a direct sum of cyclic mappings. A proof of this claim takes much work. So we formulate first the results we are aiming at, and only then come back to the technical work. In the resulting theorem describing the Jordan decomposition, the crucial role is played by vector (sub)spaces and linear mappings with a single eigenvalue λ given by the ma- trix (1) J =     λ 1 0 . . . 0 0 λ 1 . . . 0 ... ... ... ... 0 0 0 . . . λ     . These matrices (and the corresponding invariant subspaces) are called Jordan blocks.5 Jordan canonical form Theorem. Let V be a real or complex vector space of dimension n. Let f : V → V be a linear mapping with n eigenvalues (in the chosen domain of scalars), counting algebraic multiplicities. Then there exists a unique decomposition of the space V into the direct sum of subspaces V = V1 ⊕ · · · ⊕ Vk where not only f(Vi) ⊂ Vi, but the restriction of f to each Vi has a single eigenvalue λi and the restriction f − λi idVi on Vi is either cyclic or is the zero mapping. In particular, there is a suitable basis in which f has a block-diagonal matrix J with Jordan blocks along the diagonal. We say that the matrix J from the theorem is in Jordan canonical form. In the language of matrices, we can rephrase the theorem as follows: Corollary. For each square matrix A over complex scalars, there is an invertible matrix P such that A = P−1 J P and J is in canonical Jordan form. The matrix P is the transition matrix to the basis from the theorem above. Notice that the total number of ones over the diagonal in J equals the difference between the total algebraic and geometric multiplicity of the eigenvalues. The ordering of the blocks in the matrix corresponds to the chosen ordering 5Camille Jordan was a famous French Mathematician working in Analysis and Algebra at the end of the 19th and the beginning of the 20th centuries. 235 space, by presenting some addition exercises, see also in Section F for further material (cf. 3.F.35, 3.F.37). 3.D.19. Write down the matrices of rotations by the angle θ, about the (oriented) axes x, y and z in R3 .15 ⃝ The concept of linear transformations on real vector spaces naturally extends to complex vector spaces. In particular, the groups O(n) and SO(n) have counterparts in the complex case, known as the “unitary group” U(n) and the “special unitary group” SU(n). These groups are closed subgroups of the complex general linear group Gln(C), which serves as the starting point in this context, replacing Gln(R). As a result, they are also matrix groups (note that when identifying Cn ∼= R2n , the group Gln(C) can be viewed as a subgroup of Gl2n(R)). The unitary group consists of all complex linear mappings ψ : Cn → Cn , preserving the standard Hermitian form ⟨ , ⟩ on Cn . Equivalently, it can be viewed as the matrix group U(n) = {U ∈ Gln(C) : U−1 = U∗ = ¯UT }. Just as SO(n) is defined as a subgroup of O(n), the special unitary group SU(n) is defined as the subgroup of U(n) consisting of unitary matrices with determinant one. 3.D.20. As we said above, over C a n × n unitary matrix U is defined as the one preserving the standard Hermitian form ⟨ , ⟩ on Cn , i.e., ⟨Ux, Uy⟩ = ⟨x, y⟩ for any x, y ∈ Cn . Show that this is equivalent to say that U∗ U = UU∗ = E. Next prove that U(n) is a group, and that the determinant of any unitary matrix is a complex unit. ⃝ 3.D.21. Unitary matrices. Determine which of the matrices listed below is unitary: A = ( 1√ 2 1√ 2 i√ 2 − i√ 2 ) , B = 1 2 ( 1 + i √ 2 1 − i √ 2i ) , C = ( 1 −i −1 i ) . Solution. Let us use Sage to analyze the matrix A. We want to determine weather the products A∗ A and AA∗ yield the identity matrix. We this goal in mind, we type: A=matrix(SR, [[(1/sqrt(2)),(1/sqrt(2))], [(1/sqrt(2))*I,(-1/sqrt(2))*I]]) show(A) A_her=A.conjugate_transpose() show(A_her) bool(A_her*A==A*A_her) The command A.conjugate_transpose() returns the conjugate transpose of A, and has as shortcut the rule A.H. Executing this block we confirm that A is a unitary matrix. As mentioned before, a more direct method uses the command 15The matrices presented in the solution are well-known and are commonly used in 3D graphics, robotics, and other fields involving threedimensional transformations. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS of the subspaces Vi in the direct sum. Thus, the uniqueness of the matrix J is true up to the ordering of the Jordan blocks. There is therefore freedom in the choice of the basis for such a Jordan canonical form. 3.4.11. Remarks. The existence of the Jordan canonical form is clear for the cases when all eigenvalues are either distinct or when the geometric and algebraic multiplicities of the eigenvalues are the same. In particular, this is the case for all unitary and self-adjoint mappings on unitary vector spaces, while the definition of normal mappings requires eaxactly this behavior. In particular, the Jordan canonical form of a mapping is diagonal if and only if the mapping is normal. A consequence of the Jordan canonical form theorem is that for every linear mapping f, every eigenvalue of f uniquely determines an invariant subspace that corresponds to all Jordan blocks with this particular eigenvalue. We shall call this subspace the root subspace corresponding to the given eigenvalue. We mention one useful corollary of the Jordan theorem (which is already used in the discussion about the behavior of Markov chains). Assume that the eigenvalues of our mapping f are all of absolute value less than one. Then repeated application of the linear mapping on every vector v ∈ V leads to a decrease of all coordinates of fk (v) towards zero, without bounds. Indeed, assume f has only one eigenvalue λ on all the complex space V and that f − λ idV is cyclic (that is, we consider only one Jordan block separately). Let v1, . . . , vℓ be the corresponding basis. Then the theorem says that f(v2) = λv2 + v1, f2 (v2) = λ2 v2 + λv1 + λv1 = λ2 v2 + 2λv1, and similarly for other vi’s and higher powers. In any case, the iteration of f results in higher and higher powers of λ for all non-zero components. The smallest of them can differ from the largest one only by less than the dimension of V . The coefficients are bounded too. This proves the claim. The same argument can be used to prove that for the mapping with all eigenvalues with absolute value strictly greater than one leads to unbounded growth of all coordinates for the iterations fk (v). The remainder of this part of the third chapter is devoted to the proof of the Jordan theorem and a few necessary lemmas. It is much more difficult than anything so far. The reader can skip it, until the beginning of the fifth part of this chapter in case of any problems with reading it. 3.4.12. Root spaces. We have already seen by explicit examples that the eigensubspaces completely describe geometric properties for some linear mappings only. Thus we now introduce a more subtle tool, the root subspaces. Definition. A non-zero vector u ∈ V is called a root vector of the linear mapping φ : V → V , if there exists an a ∈ K and an integer k > 0 such that (φ − a idV )k (u) = 0. This means that the k-th iteration of the given mapping sends u 236 A.is_unitary(), which for our cases returns True. Alternatively, we can assess the orthonormality of the matrix’s columns (or rows) with respect to the standard Hermitian form on s Cn (for a n×n matrix). This is analogous to evaluating orthogonality in real matrices. These arguments enable us to conclude that the matrices B is unitary, while the matrix C is not (although the scaled matrix 1 2 C is unitary). We will demonstrate the verification process for the matrix B, and leave the other case for practise. For the matrix B, the column vectors are given by u1 = (1+i 2 , 1−i 2 )T , u2 = ( √ 2 2 , i √ 2 2 )T . Thus ∥u1∥ = √ 1 + i 2 · 1 − i 2 + 1 − i 2 · 1 + i 2 = 1 , ∥u2∥ = √√ 2 2 · √ 2 2 − i2 · √ 2 2 · √ 2 2 = 1 , ⟨u1, u2⟩ = 1 + i 2 · √ 2 2 − 1 − i 2 · i √ 2 2 = 0 . Hence B is unitary. Let us also proceed with Sage, by applying one of the methods described above. For example, executing the command B = matrix(SR, [[(1+I)/2, sqrt(2)/2], [(1-I)/2, sqrt(2)*I/2]]) B.is_unitary() Sage returns True. □ Let V, W be two unitary spaces. Given a linear map ψ : V → W its “adjoint” is the linear map ψ∗ : W → V defined by ⟨ψ(u), w⟩W = ⟨u, ψ∗ (w)⟩V , for all u ∈ V , and w ∈ W. By the discussion in 3.4.5 we know that if A is the matrix of ψ in an orthonormal basis of V , then the conjugate transpose A∗ is the matrix of ψ∗ . When f = f∗ , or equivalently when A is Hermitian, i.e., A = A∗ , then f is called “self-adjoint”, see 3.4.6.16 The following series of exercises will explore these objects, starting with the properties of the adjoint operator. 3.D.22. Consider two linear mappings φ, ψ : V → V on a complex unitary space (V, ⟨ , ⟩). Show that the adjoint operators of φ and ψ satisfy the following properties: •(cφ)∗ = ¯cφ∗ , for any c ∈ C; • (φ + ψ)∗ = φ∗ + ψ∗ ; • (φ ◦ ψ)∗ = ψ∗ ◦ φ∗ ; • (φ∗ )∗ = φ; • If φ is invertible, so is φ∗ with (φ∗ )−1 = (φ−1 )∗ . ⃝ 16Self-adjoint operators and Hermitian matrices are fundamental in various scientific and engineering disciplines. In quantum mechanics, they represent observable quantities such as energy, momentum, and position with their real eigenvalues being of special importance. In numerical analysis, self-adjoint operators are used to solve eigenvalue problems, which are essential for simulations and modeling in engineering and physics. In computer science, Hermitian matrices find applications in machine learning, where they help in dimensionality reduction and data analysis by identifying important features in large datasets. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS to zero. The set of all root vectors corresponding to a fixed scalar λ along with the zero vector is called the root subspace associated with the scalar λ ∈ K. We denote it by Rλ. If u is a root vector and the integer k from the definition is chosen as the smallest possible one for u, then (φ − a idV )k−1 (u) is an eigenvector with the eigenvalue a. Thus we have Rλ = {0} for all scalars λ which are not in the spectrum of the mapping φ. Proposition. Let φ : V → V be a linear mapping. Then (1) Rλ ⊂ V is a vector subspace for every λ ∈ K, (2) for every λ, µ ∈ K, the subspace Rλ is invariant with respect to the linear mapping (φ−µ idV ). In particular Rλ is invariant with respect to φ, (3) if µ ̸= λ, then (φ − µ idV )|Rλ is invertible, (4) the mapping (φ − λ idV )|Rλ is nilpotent. Proof. (1) Checking the properties of the vector vector subspace is easy and is left to the reader. (2) Assume that (φ − λ idV )k (u) = 0 and put v = (φ − µ idV )(u). Then (φ − λ idV )k (v) = = (φ − λ idV )k ((φ − λ idV ) + (λ − µ) idV )(u) = (φ − λ idV )k+1 (u) + (λ − µ) · (φ − λ idV )k (u) = 0 (3) If u ∈ Ker(φ − µ idV )|Rλ , then (φ−λ idV )(u) = (φ−µ idV )(u)+(µ−λ) u = (µ−λ) u. This implies 0 = (φ − λ idV )k (u) = (µ − λ)k u and thus also u = 0 for λ ̸= µ. (4) Choose a basis e1, . . . , ep of the subspace Rλ. By definition, there exist integers ki such that (φ − λ idV )ki (ei) = 0. In particular, the entire mapping (φ − λ idV )|Rλ must be nilpotent. □ 3.4.13. Quotient spaces. Our next aim is to show that the dimension of the root spaces always equals the algebraic multiplicity of the corresponding eigenvalues. First, we introduce some general useful technical tools. Quotient spaces Definition. Let U ⊂ V be a vector subspace. Define an equivalence relation on the set of all vectors in V by v1 ∼ v2 if and only if v1 − v2 ∈ U. Axioms of equivalence are easy to check. The set V/U of the classes of this equivalence is equipped by the operations defined by using representatives. That is, for classes [u] and [v] determined by the vectors u and v, set [v] + [w] = [v + w], a [u] = [a u]. This is a well defined vector space called the quotient vector space of the space V by the subspace U. Check the correctness of the definition of the operations and verify all axioms of the vector space in detail! The classes (vectors) in the quotient space V/U will often be denoted as formal sums of one representative with all 237 3.D.23. Let φ : C2 → C2 the linear mapping given by φ (( z1 z2 )) = ( iz1 + 2z2 z1 − iz2 ) . Determine the matrix of φ with respect to the standard basis of C2 and find its adjoint operator φ∗ . Deduce that φ is not self-adjoint. Next solve the task in Sage. Solution. Consider the standard orthonormal basis e = {e1, e2} of C2 (with respect to the standard Hermitian form ⟨ , ⟩). Recall that if A = [φ]e is the matrix of φ with respect to e and A = (aij), then we have aij = ⟨φ(uj), ui⟩. We compute φ(e1) = (i, 1)T , φ(e2) = (2, −i)T and hence a11 = ⟨φ(e1), e1⟩ = i, a12 = ⟨φ(e2), e1⟩ = 2, a21 = ⟨φ(e1), e2⟩ = 1 and a22 = ⟨φ(e2), e2⟩ = −i. Directly A = [φ]e = ( φ(e1) φ(e2) ) = ( i 2 1 −i ) . Then, for the matrix A∗ = ¯AT we compute A∗ = ( −i 1 2 i ) . Consequently, the adjoint of φ is given by φ∗ (( z1 z2 )) = ( −iz1 + z2 2z1 + iz2 ) . To verify this result we will use defining equation of φ∗ , that is, ⟨φ(u), v⟩ = ⟨u, φ∗ (v)⟩, for any two vectors u = (z1, z2)T and v = (z3, z4)T of C2 . We compute ⟨( iz1 + 2z2 z1 − iz2 ) , ( z3 z3 )⟩ = (iz1 + 2z2)¯z3 + (z1 − iz2)¯z4 = iz1 ¯z3 + 2z2 ¯z3 + z1 ¯z4 − iz2 ¯z4 , ⟨( z1 z2 ) , ( −iz3 + z4 2z3 + iz4 )⟩ = z1(−iz3 + z4) + z2(2z3 + iz4) = z1(i¯z3 + ¯z4) + z2(2¯z3 − i¯z4) = iz1 ¯z3 + z1 ¯z4 + 2z2 ¯z3 − iz2 ¯z4 . Finally, since A ̸= A∗ it follows that φ is not self-adjoint. In Sage a solution goes as follows: # Define the matrix A of the linear map phi A = Matrix(QQbar, [[I, 2], [1, -I]]) # Display the matrix A show("Matrix A of phi:") show(A) # Compute the adjoint of A A_adjoint = A.H # Display the adjoint matrix show("Adjoint matrix A*:") show(A_adjoint) # Check if A is self-adjoint is_self_adjoint = (A == A_adjoint) # Display whether A is self-adjoint show("Is A self-adjoint?") show(is_self_adjoint) □ 3.D.24. The numerical system QQbar in Sage. Observe that the Sage solution presented in 3.D.23 uses the option QQbar to define the matrix A. This represents the numerical system of algebraic complex numbers. The field of algebraic CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS vectors in the subspace U, for instance u+U ∈ V/U, u ∈ V . The class 0 + U is the zero vector in V/U, i.e. the vector u ∈ V represents the zero element in V/U if and only if u ∈ U. Trivial examples are V/{0} ∼= V , V/V ∼= {0}. Another example is the quotient space of the plane R2 factored by any one-dimensional subspace (here, every one-dimensional subspace U ⊂ R2 is a line passing through the origin). Then the equivalence classes are all the lines parallel to this line. Proposition. Let U ⊂ V be a vector subspace and (u1, . . . , un) be a basis of V , such that (u1, . . . , uk) is a basis of U. Then dim V/U = n − k and the vectors uk+1 + U, . . . , un + U form a basis of V/U. Proof. V = span{u1, . . . , un}, so V/U = span{u1 + U, . . . , un + U}. But the first k generators are zero, thus V/U = span{uk+1 + U, . . . , un + U}. Assume that the linear combination ak+1 (uk+1 + U) + · · · + an (un + U) = (ak+1 uk+1 +· · · +an un)+U = 0 ∈ V/U vanishes. Equivalently, this linear combination of the vectors uk+1, . . . , un belongs to the subspace U. Since U is generated by the remaining vectors in the basis of V , the latter linear combination is necessarily zero, and so all coefficients ai are zero. This proves the linear independence of the generators of V/U. □ 3.4.14. Induced mappings on quotient spaces. Assume that U ⊂ V is an invariant subspace with respect to linear mapping φ : V → V and choose basis u1, . . . , un of the space V such that the first k vectors of this basis is a basis of U. With this basis, φ has block matrix A = ( B C 0 D ) . Then we can prove the following lemma: Lemma. (1) the mapping φ induces a linear mapping φV/U : V/U → V/U, φV/U (v+U) = φ(v)+U with the matrix D under the induced basis uk+1 +U, . . . , un +U on V/U, (2) the characteristic polynomial of φV/U divides the characteristic polynomial of φ. Proof. For v, w ∈ V , u ∈ U, a ∈ K we have φ(v+u) ∈ φ(v)+U (because U is invariant), (φ(v)+U)+(φ(w)+U) = φ(v +w)+U and a (φ(v)+U) = a φ(v)+U = φ(a v)+U (because φ is linear), thus the mapping φV/U is well-defined and linear. Moreover the very definition of the matrix of a mapping in a basis implies that the matrix of φV/U in the induced basis on V/U is exactly the matrix D (when counting the images of the basis elements the coefficients of the matrix C add only to the class U). The characteristic polynomial of the induced mapping φV/U is thus |D−λ E|, while characteristic polynomial of the original mapping φ is |A − λ E| = |B − λ E||D − λ E|. □ 238 numbers, usually denoted by ¯Q, is formed by adjoining the rational numbers Q with the roots of all polynomial equations with rational coefficients. In other words, an algebraic number is a complex number that is a root of a non-zero polynomial equation with rational coefficients. Thus, in Sage QQbar is an extension of the numerical system QQ, enabling exact computations with algebraic numbers. However, the result would be the same if we used CC, instead of QQbar, as both numerical systems approximate complex numbers. Note however that QQbar allows for exact algebraic computations, while CC uses floating-point arithmetic for complex numbers. In particular, when displaying complex matrices with QQbar, then each element is shown in its exact algebraic form, preserving the precision of computations. In contrast, CC displays complex matrices using floating-point approximations, which can introduce rounding errors but are generally more efficient for large-scale computations. 3.D.25. Use Sage to determine which of the following matrices is Hermitian: A =   √ 2 i 1 − i −i 10 √ 2 + i 1 + i √ 2 − i 0   , B = ( a i + b −i + b √ |a| ) , where a, b ∈ R , C =   1 8i 1 − i √ 5 −8i 4z 0 1 + i √ 5 0 1   , where z ∈ C with Im(z) = 0 , D =   i √ 2 i 1 − i −i 0 4 + i 1 + i 4 − i 0   . ⃝ 3.D.26. Prove that with respect to the Frobenius inner product B(A, B) = tr(B∗ A) on Matn(C) the following statement is true: Two matrices A, B ∈ Matn(C) satisfying B = U∗ AU for some n×n unitary matrix U, are such that ∥A∥2 B = ∥B∥2 B. ⃝ 3.D.27. Consider the linear map φ : R2 → R3 defined by φ (( x y )) =   √ 2x + y x − y 2y   . Compute its adjoint operator φ∗ : R3 → R2 . Next examine the linear operator ψ : C3 → C3 whose matrix with respect to the standard basis of C3 is given by B =   10 5 + 5i 3 + 2i 5 − 5i 5 √ 2i 3 − 2i − √ 2i 0   . CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Corollary. Let V be a vector space over K of dimension n and let φ : V → V be a linear mapping whose spectrum contains n elements (that is, all roots of the characteristic polynomial lie in K and we count their multiplicities). Then there exists a sequence of invariant subspaces {0} = V0 ⊂ V1 ⊂ · · · ⊂ Vn = V with dimensions dim Vi = i. Consider a basis u1, . . . , un of the space V such that Vi = span{u1, . . . , ui}. In this basis, the matrix of the mapping φ is an upper triangular matrix:    λ1 . . . ∗ ... ... ... 0 . . . λn    , with the spectrum λ1, . . . , λn on the diagonal. Proof. The subspaces Vi are constructed inductively. Let {λ1, . . . , λn} be the spectrum of the mapping φ. Thus the characteristic polynomial of the mapping φ is of the form ±(λ − λ1) · · · (λ − λn). We choose V0 = {0}, V1 = span{u1}, where u1 is an eigenvector with eigenvalue λ1. According to the previous theorem, the characteristic polynomial of the mapping φV/V1 is of the form ±(λ − λ2) · · · (λ − λn). Assume that we have already constructed linearly independent vectors u1, . . . , uk and invariant subspaces Vi = span{u1 . . . , ui}, i = 1, . . . , k < n such that the characteristic polynomial of φV/Vk is of the form ±(λ − λk+1) · · · (λ − λn) and φ(ui) ∈ (λi · ui + Vi−1) for all i = 1, . . . , k. We want to add one more vector uk+1 with analogous properties. There exists an eigenvector uk+1 +Vk ∈ V/Vk of the mapping φV/Vk with the eigenvalue λk+1. Consider the space Vk+1 = span{u1, . . . , uk+1}. If the vector uk+1 is a linear combination of the vectors u1, . . . , uk then uk+1 + Vk would be the zero class in V/Vk. But this is not possible. Thus dim Vk+1 = k + 1. It remains to study the induced mapping φV/Vk+1 . The characteristic polynomial of this mapping is of degree n − k − 1 and divides the characteristic polynomial of the mapping φ. But completing the vectors u1, . . . , uk+1 to the basis of V yields a block matrix of the mapping φ with an upper triangular submatrix B in the left upper corner and zero in the left lower corner. The diagonal elements are exactly the scalars λ1, . . . , λk+1. Therefore the roots of the characteristic polynomial of the induced mapping have the required properties. □ Remark. If V decomposes into the direct sum of eigensubspaces for φ, the latter results do not say anything new. But their significance consists in the fact, that only the existence of dim V roots of the characteristic polynomial (counting multiplicities) is assumed. This is ensured whenever the field K is algebraically closed, for instance the complex numbers C. As a direct consequence we see that the determinant and the trace of the mapping φ are always the product and the sum of the elements in the spectrum, respectively. This can be also used for all real matrices. Just consider them to be complex, calculate the determinant or the trace as 239 Demonstrate that this operator is self-adjoint. ⃝ 3.D.28. Derive the adjoint of the trace tr : Matm(F) → F with respect to the Frobenius inner product B(A, B) = tr(B∗ A), for F ∈ {R, C}. ⃝ 3.D.29. Let (V, ⟨ , ⟩) be an inner product space and let W ⊂ V be a subspace of W. Show that the orthogonal projection projW : V → V on W is self-adjoint. ⃝ Recall that a linear endomorphism f : V → V on a (finite-dimensional) vector space V is diagonalizable if and only if V admits a basis B, consisting of eigenvectors of f. When V is equipped with a scalar product ⟨ , ⟩ and given an endomorphism f, it is natural to ask whether there exists an orthonormal basis of eigenvectors for f. If such a basis exists, the matrix [f]B of f with respect to B is diagonal, and the same holds true for the matrix associated with the adjoint f∗ of f. Since diagonal matrices commute, this implies f ◦f∗ = f∗ ◦f. Linear operators (or matrices) satisfying this relation are called “normal”. For instance, orthogonal or unitary matrices are normal, but there are many other types of normal matrices as well, see below. It turns out than on a complex scalar vector space (V, ⟨ , ⟩) an endomorphism f : V → V is normal if and only if V has an orthonormal basis consisting of eigenvectors of f. It is important to note that this result does not hold for real vector spaces. In the case of a real unitary space the existence of an orthonormal basis of eigenvectors for a linear operator f is equivalent to f being self-adjoint, i.e., f = f∗ . 3.D.30. Prove that orthogonal and unitary matrices are normal. Next demonstrate that not all normal matrices (or transformations) are orthogonal or unitary, by providing a counterexample, that is, a normal matrix that is neither orthogonal nor unitary. ⃝ Let φ : V → V be a normal operator on a unitary space (V, ⟨ , ⟩). By 3.4.8 and the proof given in the main theorem of this paragraph, we know that eigenvectors corresponding to distinct eigenvalues of φ are orthogonal with respect to ⟨ , ⟩. Let us illustrate this important situation by examples. 3.D.31. Suppose that φ : V → V is a normal operator on a (finite-dimensional) unitary space (V, ⟨ , ⟩), having two distinct eigenvalues, namely, λ1 = 3 and λ2 = 5. Assume also that the corresponding eigenvectors are of unit length. Use the Pythagorean theorem to prove that there exist a vector u ∈ V with ∥u∥ = 2 and ∥φ(u)∥ = 6, where ∥ · ∥ is the norm induced by ⟨ , ⟩. Solution. Suppose that v, w ∈ V are the eigenvectors of φ corresponding to λ1 and λ2, that, is φ(v) = 3v and φ(w) = 5w, respectively. By assumption φ is normal, hence eigenvectors corresponding to distinct eigenvalues are orthogonal, ⟨v, w⟩ = 0. Then, for the vector u = v + w by the CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS the product or sum of eigenvalues and because both determinant and the trace are algebraic expressions in terms of the elements of the matrix, the results will be correct. 3.4.15. Orthogonal triangulation. If we are given a scalar product on a vector space V and U ⊂ V is a subspace, then clearly V/U ≃ U⊥ where v ∈ U⊥ is identified with v + U. Moreover, each class of the quotient space V/U contains exactly one vector from U⊥ (the difference of two such vector is in U ∩ U⊥ ). We can exploit this observation in every inductive step of the proof of the theorem above. Choose the representative uk+1 ∈ V ⊥ k of the eigenvector of φV/Vk . This modification leads to the orthogonal basis with the properties required in the claim about triangulation in the corollary above. Therefore there exists such an orthonormal basis, and we arrive at a very important theorem: Schur’s orthogonal triangulation theorem Theorem. Let φ : V → V be a linear mapping on a vector space with scalar product. Let there be m = dim V eigenvalues, counting multiplicities. Then there exists an orthonormal basis of the space V such that the matrix of φ in this basis is upper triangular with eigenvalues λ1, . . . , λm on the diagonal. 3.4.16. Theorem. Let φ : V → V be a linear mapping and λ1, . . . , λk be all distinct eigenvalues. Then the sum of the root spaces Rλ1 , . . . , Rλk is direct. Furthermore, for every eigenvalue λ the dimension of the subspace Rλ equals the algebraic multiplicity of λ. Proof. We prove first the independence of nonzero vectors from different root spaces. We proceed by induction over the number k of root spaces. The claim is obvious if k = 1. Assume that the theorem holds for cases with less than k > 1 root spaces and assume that vectors u1 ∈ Rλ1 , . . . , uk ∈ Rλk satisfy u1 + · · · + uk = 0. Then, (φ − λk idV )j (uk) = 0 for suitable j, and moreover all yi = (φ − λk idV )j (ui) are non-zero vectors in Rλi , i = 1, . . . , k − 1, whenever ui are non-zero by Proposition 3.4.12. But at the same time y1 + · · · + yk−1 = (φ − λk · idV )j ( k∑ i=1 ui ) = 0 and, according to the inductive assumption, all yi are zero. But then also all ui, 1 ≤ i < k must vanish and thus uk = 0, too. This proves the first claim. It remains consider the dimensions of the root spaces Rλ. Consider an eigenvalue λ of φ, use the same notation φ for the restriction φ|Rλ and write ψ : V/Rλ → V/Rλ for the mapping induced by φ on the quotient space. Assume that the dimension Rλ is strictly smaller than the algebraic multiplicity of the root λ of the characteristic polynomial. In view of lemma 3.4.14, λ is also an eigenvalue 240 Pythagorean theorem we have ∥u∥2 = ∥v∥2 + ∥w∥2 , that is ∥u∥ = √ 2. On the other hand, again by the Pythagorean theorem we obtain ∥φ(u)∥ = ∥φ(v + w)∥ = ∥φ(v) + φ(w)∥ = ∥3v + 5w∥ = √ 9 + 25 = 6 . □ 3.D.32. Consider the normal operator φ : C3 → C3 whose matrix with respect to the standard basis of C3 is given by A =   2 1 + i 0 1 − i 3 0 0 0 1   . Use Sage to illustrate the orthogonality of eigenvectors corresponding to distinct eigenvalues of φ. Next present a formal solution. ⃝ 3.D.33. Skew-Hermitian matrices. Recall that a complex matrix A = (aij) is called skew-Hermitian if A = −A∗ , or equivalently aij = −¯aji. Prove that: (i) Every skew-Hermitian matrix A is normal. (ii) The eigenvalues of a skew-Hermitian matrix A are purely imaginary, that is, they are of the form iµ with µ ∈ R. Solution. (i) The first claim is obvious: If A∗ = −A, then A∗ A = −A2 = AA∗ . Hence A is normal. (ii) Suppose that λ is an eigenvalue of A with corresponding eigenvector u, that is Au = λu. It is sufficient to prove that λ = −λ. First we see that u∗ Au = u∗ (λu) = λu∗ u = λ∥u∥2 , and since ∥u∥ ∈ R the r.h.s belongs to C. Thus (u∗ Au)∗ = λ∥u∥2. Now, because ∥u∥ ∈ R, taking conjugates we obtain λ∥u∥2 = λ∥u∥2 = (u∗ Au)∗ = u∗ A∗ u = −u∗ Au = −u∗ λu = −λ∥u∥2 . This final relation yields the desired λ = −λ. □ 3.D.34. Present an example of a normal matrix which is neither Hermitian nor skew-Hermritian. Moreover, use Sage to verify your answer. ⃝ Recall that a n × n matrix A is diagonalizable if and only if it has n linearly independent eigenvectors, and in this case A is similar to the diagonal matrix D consisting of the eigenvalues of A, i.e., P−1 AP = D. Here, P is the matrix obtained from the eigenvectors of A (as column vectors). A square matrix A is said to be orthogonally diagonalizable when an orthogonal matrix P can be found such that P−1 AP = PT AP is diagonal. To clarify the concept, we provide an example, and additional tasks on orthogonal diagonalization are discussed in Section F (see for example 3.F.46 for an implementation of orthogonal diagonalization using Sage). CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS of the mapping ψ. Let (v + Rλ) ∈ V/Rλ be the corresponding eigenvector, that is, ψ(v + Rλ) = λ (v + Rλ). Then v /∈ Rλ and φ(v) = λ v + w for suitable w ∈ Rλ. Thus w = (φ−λ idV )(v) and (φ−λ idV )j (w) = 0 for suitable j. We conclude that (φ − λ idV )j+1 (v) = 0, which contradicts the choice v /∈ Rλ. It follows that the dimension of Rλ equals the algebraic multiplicity of the root λ of the characteristic polynomial of the mapping φ : V → V . □ Combining the latter theorem with the triangulation result from Corollary 3.4.14, we can formulate: Corollary. Consider a linear mapping φ : V → V on a vector space V over scalars K, whose entire spectrum is in K. Then V = Rλ1 ⊕· · ·⊕Rλn is the direct sum of the root subspaces. If we choose suitable bases for these subspaces, then under this basis φ has block-diagonal form with upper triangular matrices in the blocks and eigenvalues λi on the diagonal. 3.4.17. Nilpotent and cyclic mappings. Now almost everything is prepared for the discussion about canonical forms of matrices. It only remains to clear the relation between cyclic and nilpotent mappings and combine already proved results. Theorem. Let φ : V → V be a nilpotent linear mapping. Then there exists a decomposition of V into a direct sum of subspaces V = V1 ⊕ · · · ⊕ Vk such that the restriction of φ to each summand Vi is cyclic. Proof. We provide a straightforward construction of a basis of the space V such that the action of the mapping φ on the basis vectors directly shows the decomposition into the cyclic mappings. Let k be the degree of nilpotency of the mapping φ and write Pi = Im(φi ), i = 0, . . . , k. Thus, {0} = Pk ⊂ Pk−1 ⊂ · · · ⊂ P1 ⊂ P0 = V. Choose a basis ek−1 1 , . . . , ek−1 pk−1 of the space Pk−1, where pk−1 > 0 is the dimension of Pk−1. By definition, Pk−1 ⊂ Ker φ, i.e. φ(ek−1 j ) = 0 for all j. Assume that Pk−1 ̸= V . Since Pk−1 = φ(Pk−2), there necessarily exist the vectors ek−2 j , j = 1, . . . , pk−1 in Pk−2, such that φ(ek−2 j ) = ek−1 j . Assume a1ek−1 1 +· · ·+apk−1 ek−1 pk−1 +b1ek−2 1 +· · ·+bpk−1 ek−2 pk−1 = 0. Applying φ on this linear combination yields b1ek−1 1 + · · · + bpk−1 ek−1 pk−1 = 0. This is a linear combination of independent vectors, therefore all bj = 0. But then also aj = 0. Thus the linear independence of all 2pk−1 chosen vectors is established. Next, extend them to a basis (1) ek−1 1 , . . . , ek−1 pk−1 ek−2 1 , . . . , ek−2 pk−1 , ek−2 pk−1+1, . . . , ek−2 pk−2 of the space Pk−2. The images of the added basis vectors are in Pk−1. Necessarily they must be linear combinations of 241 3.D.35. (Orthogonal diagonalization) Consider the symmetric matrix A given below. Find a matrix P ∈ Mat3(R) such that P−1 AP = PT AP is diagonal: A =   1 2 6 2 0 2 6 2 1   . ⃝ A real n × n symmetric matrix has real eigenvalues, and the same holds true for Hermitian matrices, see 3.4.7. However, the converse is not necessarily true, as demonstrated by the counterexample given by the matrix A = ( 1 1 0 1 ). Moreover, by the spectral decomposition theorem of self-adjoint operators (see the theorem in ??), the eigenvectors corresponding to distinct eigenvalues of a symmetric or a Hermitian matrix A are orthogonal. Leveraging these properties, one can derive the following fundamental result which is further motivated by the result in 3.D.35, mentioned above. 3.D.36. Prove that a real n × n matrix A is orthogonally diagonalizable if and only if is symmetric, i.e., A = AT . ⃝ In a similar vein, a complex square matrix A ∈ Matn(C) is said to be “unitary diagonalizable”, if there exists a unitary matrix U such that U∗ AU is diagonal. By the spectral theorem (see 3.4.7) we know that any Hermitian matrix A is unitary diagonalizable. Conversely, the following holds. 3.D.37. Suppose that A is a complex matrix with real eigenvalues, which can be diagonalizable by a unitary matrix U. Show that A is Hermitian. Solution. By assumption, we have U∗ AU = D, where D is the diagonal matrix having diagonal entries the eigenvalues of A. However, the hypothesis states that A has real eigenvalues, so D is a real diagonal matrix, which implies that D∗ = D. Now, since U is unitary from the relation U∗ AU = D we get A = (U∗ )−1 DU−1 = UDU∗ . Then we see that A∗ = (UDU∗ )∗ = UD∗ U∗ = UDU∗ = A . □ 3.D.38. (Unitary diagonalization) Consider the matrices A, B given below. For each case demonstrate that the corresponding eigenvectors are orthognal each other (with respect to the standard Hermitian form ⟨ , ⟩ on C3 ). Next demostrate that both matrices are unitarily diagonalizable and illustrate the conditions necessary for unitary diagonalization: A =   −1 0 i 0 1 0 −i 0 1   , B = ( 3 i −i 3 ) . ⃝ 3.D.39. Prove that a normal matrix A having all its eigenvalues purely imaginary is skew-Hermitian. ⃝ CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS the basis elements ek−1 1 , . . . , ek−1 pk−1 . We can thus adjust the chosen vectors ek−2 pk−1+1, . . . , ek−2 pk−2 by adding the appropriate linear combinations of the vectors ek−2 1 , . . . , ek−2 pk−1 with the result that they are in the kernel of φ. Thus we may assume our choice in the scheme (1) has this property. Assume that we have already constructed a basis of the subspace Pk−ℓ such that we can directly arrange it into the scheme similar to (1) ek−1 1 , . . . , ek−1 pk−1 ek−2 1 , . . . , ek−2 pk−1 , ek−2 pk−1+1, . . . , ek−2 pk−2 ek−3 1 , . . . , ek−3 pk−1 , ek−3 pk−1+1, . . . , ek−3 pk−2 , ek−3 pk−2+1, . . . , ek−3 pk−3 ... ek−ℓ 1 ,. . ., ek−ℓ pk−1 , ek−ℓ pk−1+1,. . ., ek−ℓ pk−2 , ek−ℓ pk−2+1,. . .. . . ek−ℓ pk−ℓ where the value of the mapping φ on any basis vector is located above it. The value is zero if there is nothing above that basis vector. If Pk−ℓ ̸= V , then again there must exist vectors ek−ℓ−1 1 , . . . , ek−ℓ−1 pk−ℓ which map to ek−ℓ 1 , . . . , ek−ℓ pk−ℓ . We can extend them to a basis Pk−l−1, say, by the vectors ek−ℓ−1 pk−ℓ+1, . . . , ek−ℓ−1 pk−ℓ−1 . Again, exactly as when adjusting (1) above, we choose the additional basis vectors from the kernel of φ. and analogically as before we verify that we indeed obtain a basis for Pk−ℓ−1. After k steps we obtain a basis for the whole V , which has the properties given for the basis of the subspace Pk−ℓ. Individual columns of the resulting scheme then generate the subspaces Vi. Additionally we have found the bases of these subspaces which show that corresponding restrictions of φ are cyclic mappings. □ 3.4.18. Proof of the Jordan theorem. Let λ1, . . . , λk be all the distinct eigenvalues of the mapping φ. From the assumptions of the Jordan theorem it follows that V = Rλ1 ⊕ · · · ⊕ Rλk . The mappings φi = (φ|Rλi −λi idRλi ) are nilpotent and thus each of the root spaces is a direct sum Rλi = P1,λi ⊕ · · · ⊕ Pji,λi of spaces on which the restriction of the mapping φ−λi idV is cyclic. Matrices of these restricted mappings on Pr,s are Jordan blocks corresponding to the zero eigenvalue, the restricted mapping φ|Pr,s has thus for its matrix the Jordan block with the eigenvalue λi. For the proof of Jordan theorem it remains to verify the claim about uniqueness (up to reordering the blocks). Because the diagonal values λi are given as roots of the characteristic polynomial, their uniqueness is immediate. The decomposition to root spaces is unique as well. Thus, without loss of generality we may assume that there is just one eigenvalue λ and we are going to express the dimensions of 242 We have seen that working with diagonalizable matrices significantly simplifies many computations, such as matrix powers. When A is not diagonalizable, it is still useful to have a simplified form of A with respect to some ordered basis. In fact, for any linear transformation f : V → V on a finite-dimensional complex vector space V , there exists a particular simple matrix representation. If A is a complex square matrix, this means that we can find a matrix J such that A = PJP−1 for some invertible matrix P. The matrix J is known as the Jordan normal form (or Jordan canonical form) and is typically represented as: J =     J1 0 J2 ... 0 Jk     . In this expression, Ji (i = 1, . . . , k) are the so called Jordan blocks, which are upper triangular matrices whose form is explained below. The Jordan canonical form J will be a diagonal matrix if and only if A is diagonalizable, see 3.4.10. Let us briefly outline the procedure for computing the Jordan canonical form of a given n × n matrix A: Step 1: Compute the eigenvalues λ1, . . . , λk of A and their algebraic multiplicities, which we denote by α(λi). Step 2: Compute the geometric multiplicity γ(λi) of each eigenvalue λi of A. Recall that γ(λi) = dim Ker(A − λiE), where E is the n × n identity matrix (we have γ(λi) ≤ α(λi) for all i = 1, . . . , k). Step 3: For each eigenvalue λi determine the root subspace (also known as generalized eigenspace) Rλi = Ker ( (A − λiE)j ) , where j = α(λi) is the algebraic multiplicity of λi. Note the null spaces Ker ( (A−λiE)ℓ ) for ℓ < j = α(λi) are included in Rλi , for any eigenvalues λi, as Rλi captures all the necessary vectors. In particular, the dimension of Rλi coincides with the algebraic multiplicity j = α(λi) of λi, for all i = 1, . . . , k. Step 4: Construct the Jordan block Ji ≡ Jλi for all i = 1, . . . k, which has necessarily λi on the diagonal and 1s on the superdiagonal, while other elements are 0. To determine the size of Ji we start from the highest power j for which the dimension dim Ker ( (A − λiE)j ) is non-zero (this highest power is typically the algebraic multiplicity, i.e., j = α(λi)), and move downwards to j = 1. For example, a Jordan block of size m × m corresponding to the eigenvalue λi has the form Ji =       λi 1 0 . . . 0 0 λi 1 . . . 0 0 0 λi . . . 0 ... ... ... ... 1 0 0 0 . . . λi       ∈ Matm(C) . CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS individual Jordan blocks using the ranks rk of the mapping (φ−λ idV )k . This will show that the blocks are uniquely determined (up to their order). On the other hand, changing the order of the blocks corresponds to renumbering the vectors of basis, thus we can obtain them in any order. If ψ is a cyclic operator on an m-dimensional space, then the defect of the iterated mapping ψk is k for 0 ≤ k ≤ m, while the defect is m for all k ≥ m. This implies that if our matrix J of the mapping φ on the n-dimensional space V (remind we assume V = Rλ) contains dk Jordan blocks of the order k, then the defect Dℓ = n − rℓ of the matrix (J − λ E)ℓ is Dℓ = d1 + 2d2 + · · · + ℓdℓ + ℓdℓ+1 + · · · . Now, taking the combination 2Dk −Dk−1 −Dk+1 we cancel all those terms in the latter expression which coincide for ℓ = k − 1, k, k + 1 and we are left with 2Dk − Dk−1 − Dk+1 = dk. Substituting for Dℓ’s, we finally arrive at dk = 2n−2rk −n+rk−1 −n+rk+1 = rk−1 −2rk +rk+1. This is the requested expression for the sizes of the Jordan blocks and the theorem is proved. 3.4.19. Remarks. The proof of the theorem about the existence of the Jordan canonical form was constructive, but it does not give an efficient algorithmic approach for the construction. Now we show how our results can be used for explicit computation of the basis in which the given mapping φ : V → V has its matrix in the canonical Jordan form.6 (1) Find the roots of the characteristic polynomial. (2) If there are less than n = dim V roots (counting multiplicities), then there is no canonical form. (3) If there are n linearly independent eigenvectors, there is a basis of V composed of eigenvectors under which φ has diagonal matrix. (4) Let λ be the eigenvalue with geometric multiplicity strictly smaller than the algebraic multiplicity and v1, . . . , vk be the corresponding eigenvectors. They should be the vectors on the upper boundary of the scheme from the proof of the theorem 3.4.17. We need to complete the basis by application of iterations φ − λ idV . By doing this we also find in which row the vectors should be located. Hence we find the linearly independent solutions wi of the equations (φ − λ id)(wi) = vi from the rows below it. Repeat the procedure iteratively (that is, for wi and so on). In this way, we find the “chains” of basis vectors that give invariant subspaces, where φ − λ id is cyclic (the columns from the scheme in the proof). 6There is a beautiful purely algebraic approach to compute the Jordan canonical form efficiently, but it does not give any direct information about the right basis. This algebraic approach is based on polynomial matrices and Weierstrass divisors. We shall not go into details in this textbook. 243 Later we will see that each Jordan block corresponds to a size which represents the length of a so-called chain of generalized eigenvectors associated with λi. The increase in dimension between the kernel of (A − λiE)j−1 and (A − λiE)j is the number of blocks of size j or more. We finally form the matrix J as posed above, i.e., J consists of the Jordan blocks Ji which all lie along the diagonal, while the blank off-diagonal blocks are all zero. Therefore, in a Jordan matrix J the only possibly non-zero entries are those on the diagonal, which can attain any complex value (including 0), and those on the superdiagonal, which are either 0 or 1. 3.D.40. Let A be a matrix with characteristic polynomial χA(λ) = (λ− √ 2)3 . Illustrate all possible configurations for the Jordan canonical form of A. Solution. Since the characteristic polynomial is of degree three, the matrix A should be 3 × 3. Since χA(λ) = (λ −√ 2)3 , the unique eigenvalue λ = √ 2 has algebraic multiplicity 3. Now, the geometric multiplicity γ(λ) of λ satisfies 1 ≤ γ(λ) ≤ 3. Hence there are three possible configurations (up to the ordering of the Jordan blocks): • If γ(λ) = 1, then the Jordan canonical form will consist of a single Jordan block of size 3: J =   √ 2 1 0 0 √ 2 1 0 0 √ 2   . • If γ(λ) = 2, then the Jordan canonical form will consist of one Jordan block of size 2 and one Jordan block of size 1: J =   √ 2 1 0 0 √ 2 0 0 0 √ 2   . • If γ(λ) = 3, then the Jordan canonical form will consist of three Jordan blocks of size 1 each (in this case there exist three linearly independent eigenvectors, hence A is diagonalizable and J is diagonal): J =   √ 2 0 0 0 √ 2 0 0 0 √ 2   . □ 3.D.41. Which of the following matrices represents a Jordan canonical form? Explain your answer. G1 =     0 1 0 0 0 0 0 0 0 0 2 0 0 0 0 2     , G2 =   3 0 0 0 1 1 0 0 1   , G3 = ( 5 1 0 1 ) , G4 =   −2 1 0 0 −1 0 0 0 −2   , G5 = (π), CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS The procedure is practical for matrices when the multiplicities of the eigenvalues are small, or at least when the degrees of nilpotency are small. For instance, for the matrix A =   2 0 1 0 2 1 0 0 2   we obtain the two-dimensional subspace of eigenvectors span{(1, 0, 0)T , (0, 1, 0)T }, but we still do not know, which of them are the “ends of the chains”. We need to solve the equations (A − 2E)x = (a, b, 0)T for (yet unknown) constants a, b. This system is solvable if and only if a = b, and one of the possible solutions is x = (0, 0, 1)T , a = b = 1. The entire basis is then composed of (1, 1, 0)T , (0, 0, 1)T , (1, 0, 0)T . Note that we have free choices on the way and thus there are many such bases. 5. Decompositions of the matrices and pseudoinversions Previously we concentrated on the geometric description of the structure of a linear mapping. Now we translate our results into the language of matrix decomposition. This is an important topic for numerical methods and matrix calculus in general. Even when computing effectively with real numbers we use decompositions into products. The simplest one is the unique expression of every real number in the form a = sgn(a) · |a|, that is, as a product of the sign and the absolute value. Proceeding in the same way with complex numbers, we obtain their polar form. That is, we write z = (cos φ + i sin φ)|z|. Here the complex unit plays the role of the sign and the other factor is a non-negative real multiple. In the following paragraphs we list briefly some useful decompositions for distinct types of matrices. Remind, we met suitable decompositions earlier, for instance for positive semidefinite matrices in paragraph 3.4.9 when finding the square roots. We shall start with similar simple examples. 3.5.1. LU-decomposition. In paragraphs 2.1.7 and 2.1.8 we transformed matrices over scalars from any field into row echelon form. For this we used elementary row transformations, based on successive multiplication of our matrix by invertible lower triangular matrices Pi. In this way we added multiples of the rows above the currently transformed one. Sometimes we also interchanged the rows, which corresponded to multiplication by a permutation matrix. That is a square matrix in which all elements are zero except exactly one value 1 in each row and column. To imagine why, consider a matrix with just one non-zero element in the first column but not in the first row. When we used the backwards 244 G6 =         2 1 0 0 0 0 0 2 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0         . ⃝ 3.D.42. Find the Jordan canonical form of the matrix A =   5 4 2 0 1 −1 −1 −1 3   . Solution. Step 1: The characteristic polynomial is given by χA(λ) = 5 − λ 4 2 0 1 − λ −1 −1 −1 3 − λ = (5 − λ) 1 − λ −1 −1 3 − λ − 4 0 −1 −1 3 − λ + 2 0 1 − λ −1 −1 = −λ3 + 9λ2 − 24λ + 16 . We see that 1 is a root of χA(λ). To find all roots we can apply the Horner’s scheme: 1 −1 9 −24 16 −1 8 −16 −1 8 −16 0 Thus, χA(λ) = −λ3 + 9λ2 − 24λ + 16 = (λ − 1)(−λ2 + 8λ − 16) = (λ − 1)(λ − 4)2 , and A has two eigenvalues: λ1 = 1 with α(λ1) = 1 and λ2 = 4 with α(λ2) = 2. Step 2: Let us now compute the geometric multiplicities. For λ1 = 1 we need to solve the matrix equation (A − E)u = 0 for some vector u = (x1, x2, x3)T ∈ R3 . Since A − E =   4 4 2 0 0 −1 −1 −1 2   , we obtain the linear system {4x1 + 4x2 + 2x3 = 0 , x3 = 0 , −x1 − x2 + 2x3 = 0} . The solution space of this system is 1-dimensional:      x1 x2 x3   =   r −r 0   : r ∈ R    . Thus γ(λ1) = dim Ker(A − I) = 1, in particular we may assume that V1 is generated by the eigenvector u1 = (1, −1, 0)T . For the second eigenvalue λ2 = 4 we have the matrix equation (A − 4E)v = 0 for some vector v = (y1, y2, y3)T ∈ R3 . Since A − 4E =   1 4 2 0 −3 −1 −1 −1 −1   , CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS elimination to transform the matrix into the blockwise form ( Eh 0 0 0 ) (remind Eh stays for the unit matrix of rank h) then we potentially needed to interchange columns as well. This was achieved by multiplying by a permutation matrix from the right hand side. For simplicity, assume we have a square matrix A of size m and that Gaussian elimination does not force a row interchange. Thus all matrices Pi can be lower triangular with ones on diagonal. Finally we note that inverses of such Pi are again lower triangular with ones on the diagonal (either remember the algorithm 2.1.10 or the formula in 2.2.11). We obtain U = P · A = Pk · · · P1 · A where U is an upper triangular matrix. Thus A = L · U where L is lower triangular matrix with ones on diagonal and U is upper triangular. This decomposition is called LU-decomposition of the matrix A. We can also absorb the diagonal values of U into a diagonal matrix D and obtain the LDU-decomposition where both U and L have just ones along the diagonal, A = L D U. For a general matrix A, we need to add the potential permutations of rows during Gaussian elimination. Then we obtain the general result. (Think why we can always put the necessary permutation matrices to the most left and most right positions!) LU-decomposition Let A be any square matrix of size m over a field of scalars. Then we can find lower triangular matrix L with ones on its diagonal, upper triangular matrix U and permutation matrices P and Q, all of size m, such that A = P · L · U · Q. 3.5.2. Remarks. As one direct corollary of the Gaussian elimination we can observe that, up to a choice of suitable bases on the domain and codomain, every linear mapping f : V → W is given by a matrix in block-diagonal form with unit matrix of the size equal to the dimension of the image of f, and with zero blocks all around. This can be reformulated as follows: every matrix A of the type m/n over a field of scalars K can be decomposed into the product A = P · ( E 0 0 0 ) · Q, where P and Q are suitable invertible matrices. Previously (in 3.4.10) we discussed properties of linear mappings f : V → V over complex vector spaces. We 245 we obtain the linear system {y1 + 4y2 + 2y2 = 0 , 3y2 + y3 = 0 , y1 + y2 + y3 = 0}. Using Sage via the cell A1=matrix([[1, 4, 2], [0, -3, -1], [-1, -1, -1]]) show(A1) show(A1.rref()) we obtain the following reduced row echelon form A − 4E =   1 4 2 0 −3 −1 −1 −1 −1   ∼   1 0 2/3 0 1 1/3 0 0 0   , which can be verified by hand in a few steps. Therefore, the solution space is given by      y1 y2 y3   =   −2 3 s −1 3 s s   : s ∈ R    . Consequently, we may assume that the eigenspace V4 is generated by the eigenvector u2 = (2, 1, −3)T (this corresponds to the choice s = −3). In particular, γ(λ2) = 1. Step 3: Since the eigenvalue λ2 = 4 satisfies γ(λ2) = 1 < α(λ2) = 2, the matrix A is not diagonalizable, and we should proceed with Step 3. In particular, for λ2 = 4 we have j = α(λ2), and this algebraic multiplicity coincides with the dimension of the generalized eigenspace Rλ2 = Ker ( (A−4E)2 ) , i.e., dim Ker ( (A−4E)2 ) = 2 (see below in 3.D.44 for a description of Rλ2 ). Thus, according to our algorithm, for the eigenvalue λ2 = 4 exists a unique Jordan block of size 2 × 2 or larger, since dim Ker ( (A − 4E)2 ) − dim Ker ( (A − 4E) ) = 2 − 1 = 1 . Counting dimensions we see that this has necessarily the form J2 = ( 4 1 0 4 ), since for λ1 = 1 the corresponding Jordan block J1 is just a scale, J1 = (1). Step 4: We are now able to describe the Jordan canonical form: J = (1) ⊕ ( 4 1 0 4 ) =   1 0 0 0 4 1 0 0 4   . □ Having the Jordan canonical form J of a squared matrix A, we can derive a similarity matrix P such that P−1 AP = J, or, equivalently, A = PJP−1 . Such a matrix P is composed of columns that are eigenvectors and generalized eigenvectors of A. To understand how P is constructed, recall that for an eigenvalue λi of A, a generalized eigenvector u is defined as a non-zero vector such that (A − λiE)j u = 0 for some integer j > 0. If j is the smallest positive integer with this property, the set {(A − λi)j−1 u , (A − λi)j−2 u , . . . , (A − λi)u , u} is known as a “(Jordan) chain of length j”, see also the discussion in 3.4.19. The structure of these chains reflects the sizes and arrangements of the Jordan blocks in the Jordan form J. In CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS showed that every square matrix A of dimension m can be decomposed into the product A = P · J · P−1 , where J is a block-diagonal with Jordan blocks associated with the eigenvalues of A on the diagonal. Indeed, this is just a reformulation of the Jordan theorem, because multiplying by the matrix P and by its inverse from the other side corresponds in this case just to the change of the basis on the vector space V (with transition matrix P). The quoted theorem says that every mapping has Jordan canonical form in a suitable basis. Analogously, when discussing the self-adjoint mappings we proved that for real symmetric matrices or for complex Hermitian matrices there exists a decomposition into the prod- uct A = P · D · P∗ , where D is the diagonal matrix with all (always real) eigenvalues on the diagonal, counting multiplicities. Indeed, we proved that there is an orthonormal basis consisting of eigenvectors. Thus the transition matrix P reflecting the appropriate change of the basis must be orthogonal. In particular, P−1 = P∗ . For real orthogonal mappings we derived analogous expression as for the symmetric ones, i.e. A = P · B · P∗ . But in this case the matrix B is block-diagonal with blocks of size two or one, expressing rotations, mirror symmetry and identities with respect to the corresponding subspaces. 3.5.3. Singular decomposition theorem. We return to general linear mappings f : V → W between vector spaces (generally distinct). We assume that scalar products are defined on both spaces and we restrict ourselves to orthonormal bases only. If we want a similar decomposition result as above, we must proceed in a more refined way than in the case of arbitrary bases. But the result is surprisingly similar and strong: Singular decomposition Theorem. Let A be a matrix of the type m/n over real or complex scalars. Then there exist square unitary matrices U and V of dimensions m and n, and a real diagonal matrix D with non-negative elements of dimension r, r ≤ min{m, n}, such that A = U S V ∗ , S = ( D 0 0 0 ) and r is the rank of the matrix AA∗ . The matrix S is determined uniquely up to the order of the diagonal elements si in D. Moreover, si are the square roots of the positive eigenvalues di of the matrix A A∗ . If A is a real matrix, then the matrices U and V are orthogonal. 246 particular, each chain corresponds to a block, and the number of vectors in the chain indicates the block’s size. Now each column in P corresponds to a vector in a Jordan chain. For a Jordan block of size j associated with λi, the matrix P will include j columns form the Jordan chain of a generalized eigenvector associated with λi. Let us illustrate these facts with examples and for further exercises on the Jordan canonical form see Section F. 3.D.43. Using the matrices G1, . . . G6 given in 3.D.41, derive the characteristic polynomial of the corresponding matrix A for those cases that represent a Jordan canonical from. ⃝ 3.D.44. For the matrix A provided in 3.D.42, determine a similarity matrix P such that P−1 AP = J, where J is Jordan canonical form of A, as obtained in 3.D.42. Solution. We know that Rλ2 = Ker ( (A − 4E)2 ) is 2-dimensional. This is because the homogeneous linear system induced by the matrix equation (A − 4E)2 w = 0, i.e.,   −1 −10 −4 1 10 4 0 0 0     w1 w2 w3   =   0 0 0   has a 2-dimensional solution space, given by      w1 w2 w3   =   −10t − 4s t s   : t, s ∈ R    . To derive that matrix P we first need to solve the matrix equation (A − 4E)w = u2, where v2 = (2, 1, −3)T is the eigenvector corresponding to λ2 = 4 (see 3.4.19). Then, the vector w should be an element of Rλ2 and the set {(A−4E)w, w} = {u2, w} will give rise to a Jordan chain for w. Moreover, the columns of P will be the vectors u1, u2 and w, i.e., P =( u1 u2 w ) . Let us find w. The augmented matrix of the linear system induced by the equation (A − 4E)w = u2, is given by B := ( A − 4E u2 ) =   1 4 2 2 0 −3 −1 1 −1 −1 −1 −3   , and we can apply elementary row operations to obtain the corresponding RREF. Successively one obtains B R3→R1+R3 −→   1 4 2 2 0 −3 −1 1 0 3 1 −1   R3→R2+R3 −→   1 4 2 2 0 −3 −1 1 0 0 0 0   R2→− 1 3 R2 −→   1 4 2 2 0 1 1/3 −1/3 0 0 0 0   R1→R1−4R2 −→   1 0 2/3 10/3 0 1 1/3 −1/3 0 0 0 0   . You can verify this expression in Sage by the rref()-method, as we learned in Chapter 2. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Proof. Assume first that m ≤ n. Denote by φ : Kn → Km the mapping between real or complex spaces with standard scalar products, given by the matrix A in the standard bases. We can reformulate the statement of the theorem as follows: there exists orthonormal bases on Kn and Km in which the mapping φ is given by the matrix S from the statement of the theorem. As noted before, the matrix A∗ A is positive semidefinite. Therefore it has only real non-negative eigenvalues and there exists an orthonormal basis w of Kn in which the corresponding mapping φ∗ ◦ φ is given by a diagonal matrix with eigenvalues on the diagonal. In other words, there exists a unitary matrix V such that A∗ A = V B V ∗ for a real diagonal matrix B with non-negative eigenvalues (d1, d2, . . . , dr, 0, . . . , 0) on the diagonal, di ̸= 0 for all i = 1, . . . , r. Thus B = V ∗ A∗ A V = (A V )∗ (A V ). This is equivalent to the claim that the first r columns of the matrix AV are orthogonal, while the remaining columns vanish because they have zero norm. Next, we denote the first r columns of A V as v1, . . . , vr ∈ Km . Thus, ⟨vi, vi⟩ = di, i = 1, . . . , r, and the normalized vectors ui = 1√ di vi form an orthonormal system of non-zero vectors. Extend them to an orthonormal basis u = u1, . . . , um for the entire Km . Expressing the original mapping φ in the bases w of Kn and u of Km , yields the matrix √ B. The transformations from the standard bases to the newly chosen ones correspond to the multiplication from the left by a unitary (orthogonal) matrix U and from the right by V −1 = V ∗ . This is the claim of the theorem. If m > n, we can apply the previous part of the proof to the matrix A∗ which implies the desired result. All the previous steps in the proof are also valid in the real domain with real scalars. □ This proof of the theorem about singular decomposition is constructive and we can indeed use it for computing the unitary (orthogonal) matrices U and V and the non-zero diagonal elements of the matrix S. The diagonal values of the matrix D from the previous theorem are called singular values of the matrix A. 3.5.4. Further comments. When dealing with real scalars, the singular values of a linear mapping φ : Rn → Rm have a simple geometric meaning: Let K ⊂ Rn be the unit ball in the standard scalar product. The image φ(K) is always an m-dimensional ellipsoid (possibly degenerate). The singular values of the matrix A are then the norms of the main halfaxes. The theorem says further that the original ball allows an orthogonal set of diameters, whose images are exactly the half-axes of this ellipsoid. For square matrices it can be seen that A is invertible if and only if all singular values are non-zero. The ratio of the 247 A=matrix([[5, 4, 2], [0, 1, -1], [-1, -1, 3]]) E=identity_matrix(3) AA=A-4*E print(bool(AA== matrix([[1, 4, 2], [0, -3, -1], [-1, -1, -1]]))) b=vector([2, 1, -3]) B=AA.augment(b) B.rref() Therefore, we obtain the system (we assume that w has the form w = (x, y, z)T ) {x + 2 3 z = 10 3 , y + 1 3 z = − 1 3 , z ∈ R} with solution space      x y z   =   −2 3 s + 10 3 −1 2 s − 1 3 s   : s ∈ R    . To obtain a representative we may set s = 0 which gives the generalized eigenvector w = (10/3, −1/3, 0)T . Let us use Sage to verify that w is indeed a generalized eigenvector. To do so, we can simply run the following code block: w=vector([10/3, -1/3, 0]) A=matrix([[5, 4, 2], [0, 1, -1],[-1, -1, 3]]) E=identity_matrix(3) BB=(A-4*E)^2 bool(BB*w==0) Sage’s output is True and we can present the matrix P: P = ( u1 u2 w ) =   1 2 10 3 −1 1 −1 3 0 −3 0   . To verify the correctness of this result, first compute P−1 and next check that P−1 AP = J. For convenience, we will use Sage to perform these computations promptly. A=matrix([[5, 4, 2], [0, 1, -1], [-1, -1, 3]]) J=matrix([[1, 0, 0], [0, 4, 1], [0, 0, 4]]) P=matrix([[1, 2, 10/3], [-1, 1, -1/3], [0, -3, 0]]) show(A, P, J, P.inverse()) print(P.inverse()*A*P==J) We mention that Sage provides a built-in method to obtain the Jordan canonical form of a matrix, using the command A.jordan_form(). For instance, for our matrix A give the cell A=matrix([[5, 4, 2],[0, 1, -1], [-1, -1, 3]]) J,P=A.jordan_form(transformation=True) show(J, P) CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS greatest to the smallest singular value is an important parameter for the robustness of many numerical computations with matrices, for instance the computation of the inverse matrix. Note that there are fast methods of computation (approximations) for eigenvalues. Thus the singular decomposition is a very effective tool to work with. 3.5.5. Polar decomposition theorem. The singular decomposition theorem is the starting point for many other useful tools. We present several direct corollaries (which by themselves are non-trivial and important). The statement of the singular decomposition theorem saying that for any matrix A, real or complex, A = USW∗ with S diagonal with non-negative real numbers on the diagonal and U and W unitary, can be rephrased as A = USU∗ UW∗ and let us denote P = USU∗ , V = UW∗ . The first of the matrices, P, is Hermitian (in the real case, symmetric) and positive semidefinite (because P and S are matrices of the the same mapping in different orthonormal bases). At the same time, V is the product of two unitary matrices and thus again is unitary (in the real case orthogonal). Next, assume that A = PV = QZ are two such decompositions of the matrix A into the product of a positive semidefinite Hermitian matrix and a unitary matrix. Clearly, A∗ = WSU∗ . Thus AA∗ = USSU∗ = P2 , and the matrix P is actually the square root of the easily computable Hermitian matrix AA∗ . In particular, this proves that P is uniquely determined, cf. 3.4.9. Further, assume that A is invertible. Then also P is invertible and Z = V = P−1 A. We have derived a very useful analogy of the decomposition of a real number into a sign and the absolute value: Polar decomposition Theorem. Every square complex matrix A of dimension n can be expressed in the form A = PV , where P is a Hermitian positive semi-definite square matrix of the same dimension while V is unitary. The matrix P = √ AA∗ is uniquely given, and if A is invertible, the decomposition is unique and V = ( √ AA∗)−1 A. If A is a matrix of real scalars, then P is symmetric and V is orthogonal. If we apply the same theorem to A∗ instead of A, we obtain the same result, but with the order of the Hermitian and unitary matrices is reversed. This means A = V P with V unitary and P = √ A∗A positive semidefinite. The matrices in the corresponding right and left polar decompositions will in general be different. 248 Sage’s output has the form   1 0 0 0 4 1 0 0 4   ,   1 1 2 1 −1 1 4 0 0 −3 4 −1 4   . This verifies the expression of the matrix J derived in 3.D.42 and presents a similarity matrix P, such that P−1 AP = J. However, be aware that P is not uniquely determined, which is why Sage’s candidate for P may differ from ours. □ 3.D.45. Consider the matrix A = ( 3 1 −1 1 ) . (a) Show that A is not diagonalizable and find the associated Jordan canonical form. Moreover, find a similarity matrix P such that P−1 AP = J and verify the results using Sage. (b) Compute the 4th power A4 of A using the Jordan canonical form J and the matrices P, P−1 , derived in (a). ⃝ 3.D.46. Consider the matrix A =     5 4 2 1 0 5 1 1 0 0 5 2 0 0 0 5    . Determine the eigenvectors and generalized eigenvectors of A and find generators of the subspaces Ker ( (A − 5E)2 ) , Ker ( (A − 5E)3 ) and Ker ( (A − 5E)4 ) . Next determine the Jordan canonical form J of A and present a matrix P such that P−1 AP = J. Moreover show that the generalized eigenespace Rλ of the unique eigenvalue λ of A satisfies Rλ = Ker ( (A − 5E)4 ) ∼= R4 . ⃝ E. Matrix decompositions Matrices with some special property, such as diagonal matrices, triangular matrices, or unitary matrices, are simpler to work with than general matrices. The final part of this chapter focuses on matrix decompositions, which involving techniques expressing a given matrix as a product of “simpler” matrices, as those mentioned above, see the paragraphs 3.5.1 – 3.5.6. Matrix decompositions are essential for tasks like solving linear systems of equations and performing many complex matrix operations. We begin with the LU-decomposition, which is essentially the matrix form of the Gauss elimination method, described in paragraphs 2.1.7-2.1.9. The LU-decomposition exists for squared matrices A which can be reduced to a rowechelon form through a series of row operations without requiring row exchanges. In such a situation we can arrive to a factorization of the form A = LU, where U is the upper triangular matrix whose diagonal entries are the pivots, and L = (ℓij) is a lower triangular matrix having units in the diagonal and nonzero off-diagonal entries ℓij encoding the elementary row operations that we applied to find U. Thus, ℓij represents the times that we subtracted the row j from row i, during the specific step of the Gaussian elimination. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Actually, if A is invertible, it is easy to check, that the matrices in the left are polar decomposition coincide if and only if A is normal. Look at theorem 3.4.8 and verify it yourself. In the complex case the analogy with the decomposition of numbers is even more entertaining. The positive semidefinite P again plays the role of the absolute value of the complex number. The unitary matrix V uniquely allows the expression as a sum V = re V +i im V with Hermitian real and imaginary parts and the property (re V )2 + (im V )2 = E. We obtain a full analogy for the polar form for the complex numbers (see the final remark and corollary in 3.4.8). But note that in the higher dimensional case, it is important in which order this “polar form” of matrix is written. It is possible in both ways, but the results differ in general. 3.5.6. QR decomposition. For many practical applications it is faster to use another decomposition of matrices, which is an analogy of the Schur orthogonal triangulation theorem: QR decomposition Theorem. For every complex matrix A of the type m/n there exists a unitary matrix Q and an upper triangular matrix R such that A = QR. If all the scalars are real, then both Q and R are real (i.e. Q orthogonal). Proof. In the geometric formulation we need to prove that for every mapping φ : Kn → Km with the matrix A in the standard bases we can choose a new orthonormal basis on Km for which φ has upper triangular matrix. Consider the images φ(e1), . . . , φ(en) ∈ Km of the vectors of the standard orthonormal basis of Kn . Choose from them a maximal linearly independent system v1, . . . , vk in such a way that the removed dependent vectors are always a linear combination of the previous vectors. Extend it into a basis v1, . . . , vm. Let u1, . . . , um be an orthonormal basis Km obtained by the Gramm-Schmidt orthogonalization of this system of vectors. For every ei, φ(ei) is either one of vj, j ≤ i, or it is a linear combination of v1, . . . , vi−1. Therefore in the expression of φ(ei) in the basis u only the vectors u1, . . . , ui appear. Thus, in the standard basis on Kn and u on Km , the mapping φ has an upper triangular matrix R. The change of the basis u on Rm corresponds to the multiplication by a unitary matrix Q∗ from the left. That is, R = Q∗ A, equivalently A = QR. The last claim is clear from the construction. □ 3.5.7. Pseudoinversions. Finally, we discuss an especially useful and important extension of the inversion concept, which is of great importance for numerical procedures and also in Statistics. 249 3.E.1. Find a LU-decomposition of the matrix A =   −2 1 0 −4 4 2 −6 1 −1   . Solution. The desired decomposition is given by A = LU =   1 0 0 2 1 0 3 −1 1     −2 1 0 0 2 2 0 0 1   . To obtain to this result, let us denote by R1, R2, R3 the rows of A, during any step of the Gaussian elimination. Then U (the row-echelon form of A) occurs by the following elementary row operations   −2 1 0 −4 4 2 −6 1 −1   R2 − 2 R1 R3 − 3 R1 =⇒   −2 1 0 0 2 2 0 −2 −1   R3 − ( -1 R2) =⇒   −2 1 0 0 2 2 0 0 1   = U . Recall from Chapter 2 (cf. 2.A.9) that in Sage one can compute a row echelon form of a matrix A using the echelon_form() command. However, Sage’s output may differ from the expected result, and some caution is required when choosing the numerical system. For instance, consider the matrix A with entries over Z. Executing in Sage the cell given below, we will obtain the matrix ( 2 1 0 0 2 0 0 0 1 ) , which is clearly not the desired result. A = matrix(ZZ, [[-2, 1, 0],[-4, 4, 2], [-6, 1, -1]]) A.echelon_form() On the other hand, you can verify on your own that by replacing ZZ with RR, RDF, QQbar or the symbolic ring SR, the resulting matrix will be the 3 × 3 identity matrix. Hence, in this case, Sage returns tge reduced row-echelon form of A, which is not useful for our task. Now, for the lower triangular matrix L =   1 0 0 ℓ21 1 0 ℓ31 ℓ32 1   we see that the entries ℓ21, ℓ31, ℓ32 are given by the circled numbers 2, 3 and −1, appearing in the row operations mentioned above. In 3.E.5 below, you will be asked to derive the expression of L in terms of elementary matrices. Finally, a verification in Sage can be done manually, as fol- lows: CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Technically, the following quite straightforward application of singular decompositions of matrices allows us to define the pseudoinverse. However, we should beware that the singular decomposition is not unique and thus we must verify that such a definition is consistent. We shall see that in the next theorem. Pseudoinverse matrices Let A be a real or complex matrix of the type m/n. Let A = U S V ∗ , S = ( D 0 0 0 ) be its singular decomposition (in particular, D is invertible). The matrix A† := V S† U∗ , S† = ( D−1 0 0 0 ) is called the pseudoinverse matrix of the matrix A. In geometric terms, we may view the linear mapping φ given by the matrix A in the two special orthonormal basis, where φ has got the matrix S with non-negative diagonal entries. We take the inverse of the “invertible part” of φ and complete it trivially to the pseudoinverse7 mapping φ† . The result is then viewed in the original basis and yields A† . As the following theorem shows, the pseudoinverse is an important generalization of the notion of inverse matrix, together with direct applications. At the same time, property (3), together with property (2), verifies the appropriateness of the definition. 7This concept was introduced by Eliakim Hastings Moore, an American mathematician, around 1920. It was reinvented by Roger Penrose and others later. In the literature, it is often called the Moore-Penrose pseudoinverse. Roger Penrose is an extremely influential mathematical physicist and philosopher of science working in Oxford, known also for his many bestselling popular books such as The Emperor’s New Mind: Concerning Computers, Minds, and The Laws of Physics (1989); Shadows of the Mind: A Search for the Missing Science of Consciousness (1994); The Road to Reality: A Complete Guide to the Laws of the Universe (2004); Cycles of Time: An Extraordinary New View of the Universe (2010). 250 A = matrix(RR,[[-2, 1, 0],[-4, 4, 2], [-6, 1, -1]]) L = matrix(RR,[[1, 0, 0],[2, 1, 0], [3, -1, 1]]) U = matrix(RR,[[-2, 1, 0],[0, 2, 2], [0, 0, 1]]) L*U==A Sage returns True. Note in this block we could use RDF or SR instead of RR, or even omit specifying a numerical system altogether. □ 3.E.2. Permuted LU-decomposition via Sage. If partial pivoting (row exchanges) is allowed, then for any squared matrix A we have a decomposition of the form PA = LU, where P is a permutation matrix. This is known as the LU-decomposition with partial pivoting, or permuted LU-decomposition. Note that Sage provides the command A.LU(), which corresponds to a built-in method to obtain the permuted LU-decomposition. In particular, for a m×n matrix A this command returns a triple of matrices P, L and U, such that A = PLU, satisfying the following properties: P is a m × m permutation matrix, L is a lower triangular m × m matrix with units in the diagonal, and U is an upper triangular m × n matrix. Notice in this case U is not in general a row-echelon form of A. For example, using the matrix A from 3.E.1 and running the cell A = matrix([[-2, 1, 0], [-4, 4, 2], [-6, 1, -1]]) A.LU() we obtain the following answer: ( [0 0 1] [ 1 0 0] [ -6 1 -1] [0 1 0] [2/3 1 0] [ 0 10/3 8/3] [1 0 0], [1/3 1/5 1], [ 0 0 -1/5] ) Notice that the matrices L and U that Sage prints out here, differ from those obtained in 3.E.1. The LU-decomposition is an effective method for solving linear systems. Given that A = LU is the LU-factorization of a squared matrix A, the equation Ax = b can be written as LUx = b. Introducing Ux = y, this gives us Ly = b. Typically, we first solve Ly = b with respect to y and then use this result to solve Ux = y for x. 3.E.3. Solve Ax = b using a LU-decomposition of the matrix A, where A and b are respectively given by A =     3 −6 2 −2 −3 4 1 0 6 −4 0 −4 −9 6 −2 12     , b =     −7 4 6 −3     . Solution. One can verify that CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Properties of pseudoinverse matrices Theorem. Let A be a real or complex matrix of the type m/n and let A† be its pseudoinverse. Then: (1) if A is invertible (necessarily square), then A† = A−1 , (2) A† A and AA† are Hermitian (in real case symmetric) and AA† A = A, A† AA† = A† . (3) the pseudoinverse matrices A† are uniquely defined by the four properties from (2). Thus if some matrix B of the type n × m has the properties that BA and AB are both Hermitian, ABA = A and BAB = B, then B = A† . (4) if A is a matrix of the system of linear equations Ax = b with b ∈ Km , then the vector y = A† b ∈ Kn minimizes the norm ∥Ax − b∥ for all vectors x ∈ Kn . (5) the system of linear equations Ax = b with b ∈ Km is solvable if and only if AA† b = b. In this case all solutions are given by the expression x = A† b + (E − A† A)u, where u ∈ Kn is arbitrary. Proof. (1): If A is invertible, then the matrix S = U∗ AV is also invertible and directly from the definition S† = S−1 . Consequently, A† A = AA† = E. (2): Direct computation yields SS† S = S and S† SS† = S† , therefore AA† A = USV ∗ V S† U∗ USV ∗ = USS† SV ∗ = USV ∗ = A and analogically for the second equation. Furthermore, (AA† )∗ = (USS† U∗ )∗ = U(S† )∗ S∗ U∗ = U(SS† )∗ U∗ = USS† U∗ = AA† . It can be proved similarly that (A† A)∗ = A† A. (3) The claim can be proved by direct computation. Of course we can consider the matrices A, A† , B as the matrices of the mappings φ, φ† , and ψ in the standard bases on Kn and Km , or any other pair of orthonormal bases. The requested equality is equivalent to the equality φ† = ψ independently of the choice of the bases. We choose a couple of orthogonal bases from the singular decomposition of A. Then the mapping φ has the matrix S from the definition of the pseudoinverse A† , so we write directly A = ( D 0 0 0 ) , A† = ( D−1 0 0 0 ) , with the diagonal matrix D consisting of all non-zero singular values. We write again B for the matrix of ψ in these bases. Clearly B and A satisfy the assumptions of the claim (3). Thus A† A = ( E 0 0 0 ) , ABA = A 251 A = LU =    1 0 0 0 −1 1 0 0 2 −4 1 0 −3 6 −14/8 1         3 −6 2 −2 0 −2 3 −2 0 0 8 −8 0 0 0 4    . For example, use Sage via the block A=matrix(RDF,[[3, -6, 2, -2],[-3, 4, 1, 0], [6, -4, 0, -4], [-9, 6, -2, 12]]) L=matrix(RDF,[[1, 0, 0, 0], [-1, 1, 0, 0], [2, -4, 1, 0], [-3, 6, -14/8, 1]]) U=matrix(RDF,[[3, -6, 2, -2], [0, -2, 3, -2], [0, 0, 8, -8], [0, 0, 0, 4]]) L*U == A Sage prints out True. Now we solve the equation Ly = b, which is given by     1 0 0 0 −1 1 0 0 2 −4 1 0 −3 6 −14/8 1         y1 y2 y3 y4     =     −7 4 6 −3     . From this we obtain y1 = −7 , y2 = 4 + y1 = −3 , y3 = 6 − 2y1 + 4y2 = 8 , y4 = −3 + 3y1 − 6y2 + 14 8 y3 = 8 . Next we solve Ux = y, which is given by     3 −6 2 −2 0 −2 3 −2 0 0 8 −8 0 0 0 4         x1 x2 x3 x4     =     −7 −3 8 8     . By back substitution we get the desired solution: x4 = 8 4 = 2 , x3 = 8 + 8x4 8 = 3 , x2 = −3 − 3x3 + 2x4 −2 = −8 −2 = 4 , x1 = −7 + 6x2 − 2x3 + 2x4 3 = 15 3 = 5 . □ Given a n × n matrix A with a LU-decomposition A = LU, the efficiency of solving the linear equation Ax = b can be analyzed by counting the number of multiplications and divisions re- quired.17 As we saw before, to solve Ax = b using LU-factorization, the process involves solving Ly = b and Ux = y. It can be proved that approximately n2 multiplications and divisions are needed for these steps. For instance, in the case of a 4 × 4 matrix A as in 3.E.3, you can count exactly 16 multiplications and divisions, 6 for solving the equation Ly = b and 10 for solving the equation Ux = y. 17Additions are typically ignored in this analysis due to their lower computational “cost”. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS and we obtain A† = A† ABAA† = ( E 0 0 0 ) B ( E 0 0 0 ) = ( D−1 0 0 0 ) . Consequently B = ( D−1 P Q R ) for suitable matrices P, Q and R. Next, BA = ( D−1 P Q R ) ( D 0 0 0 ) = ( E 0 QD 0 ) is Hermitian. Thus QD = 0 which implies Q = 0 (the matrix D is diagonal and invertible). Analogously, the assumption that AB is Hermitian implies that P is zero. Finally, we com- pute B = BAB = ( D−1 0 0 R ) ( D 0 0 0 ) ( D−1 0 0 R ) . On the right side in the right-lower corner there is zero, and thus also R = 0 and the claim is proved. (4): Consider the mapping φ : Kn → Km , x → Ax, and direct sums Kn = (Ker φ)⊥ ⊕Ker φ, Km = Im φ⊕(Im φ)⊥ of the orthogonal complements. The restricted mapping ˜φ := φ|(Ker φ)⊥ : (Ker φ)⊥ → Im φ is a linear isomorphism. If we choose suitable orthonormal bases on (Ker φ)⊥ and Im φ and extend them to orthonormal bases on whole spaces, the mapping φ will have matrix S and ˜φ the matrix D from the theorem about the singular decomposition. In the next section, we shall discuss in detail that for any given b ∈ Km , there is the unique vector which minimizes the distance ∥b − z∥ among all z ∈ Im φ (in analytic geometry we shall say that the point z realises the distance of b from the affine subspace Im φ), see 4.1.13). The properties of the norm proved in theorem 3.4.3 directly imply that this is exactly the component z = b1 of the decomposition b = b1 +b2, b1 ∈ Im φ, b2 ∈ (Im φ)⊥ . Now, in our choice of bases, the mapping φ† is given by the matrix S† from the singular decomposition theorem. In particular, φ† (Im φ) = (Ker φ)⊥ , D−1 is the matrix of the restriction φ† | Im φ, and φ† |(Im φ)⊥ is zero. Indeed, φ ◦ φ† (b) = φ(φ† (z)) = z and the proof is finished. (5) Evidently, the equality Ax = b, with x ∈ Kn fixed, implies b = Ax = AA† Ax = AA† b. Thus the condition is necessary. On the other hand, if this condition holds, then the choice x = A† b + (E − A† A)u as in (5) implies Ax = A(A† b + (E − A† A)u) = b + (A − AA† A)u = b. The rank of the matrix E − A† A gives the correct size of the image of the corresponding mapping according to the Kronecker-Capelli theorem (cf. 2.3.5) about the solution of the system of linear equations, and thus we obtain all solutions in this way. □ 252 On the other hand, performing row reduction directly on the augmented matrix ( A b ) to transform it to ( U y ) requires about n3 /3 arithmetic operations (as multiplications and divisions), which is significantly more time-consuming. Even worse, computing the inverse of A demands about n3 operations, and multiplying b by A−1 to obtain x = A−1 b requires another n2 operations. 3.E.4. How are determinants of large matrices computed? Typically, determinants of large matrices are not calculated using cofactor expansion, but rather through matrix decomposition. For instance, with an LU-decomposition of a squared matrix A, the determinant of A satisfies the relation det(A) = det(L) det(U). Since both L, U are triangular matrices, the previous relation implies that det(A) equals the product of the elements on the diagonals of L and U. 3.E.5. Consider the LU-decomposition of the matrix A described in 3.E.1. i) Express the lower-triangular matrix L in terms of elementary matrices and compute its inverse. ii) Compute the inverse of the matrix U using Gauss elimi- nation. iii) Calculate both the determinant and the inverse matrix of A, using its LU-decomposition as derived above. ⃝ Let us now focus on the singular value decomposition, which is possibly the most important matrix decomposition, with a wide range of remarkable applications. According to the singular value decomposition (in short SVD), given a matrix A ∈ Matm,n(F) where F = R or F = C, we can find unitary matrices U ∈ Matm(F) and V ∈ Matn(F), and a diagonal matrix Σ ∈ Matm,n(F) with non-genative entries such that A = UΣV ∗ , see 3.5.3. In such a decomposition the diagonal matrix Σ consists of the singular values of A, which are the non-negative square roots of the eigenvalues of A∗ A (or equivalently of AA∗ ). They are typically denoted by σ1, . . . , σmin{m,n}, and we order them so that σ1 ≥ σ2 ≥ . . . ≥ σmin{m,n}. In fact, there are exactly rank(A) non-zero singular values of A, as rank(UΣV ∗ ) = rank(Σ), and the rank of a diagonal matrix coincides with the number of non-zero diagonal elements. On the other hand, the columns of V are (normalized) eigenvectors of A∗ A, while the columns of U are (normalized) eigenvectors of AA∗ . CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Remark. Notice that the last computation in the proof verifies that (E − A† A) is the matrix of the projection of Rn onto the subspaces of all solutions of the homogenous system Ax = 0. It can be also shown that the matrix A† minimizes the square of the norm of the expression AA† − E that is, the sum of squares of all elements of the given matrix. The claim (4) of the theorem can be also interpreted as follows. AA† is the matrix of the orthogonal projection form the vector space Rm , onto the subspace generated by the columns of the matrix A (m is the number of the rows of the matrix A). This interpretation has a strong meaning for matrices having more rows than columns. Moreover, for matrices A whose columns are independent vectors, the expression (AT A)−1 AT makes sense and it is not hard to verify that this matrix satisfies all the properties from (1) and (2) from the previous theorem. Thus it is the pseudoinverse A† of the matrix A. 3.5.8. Linear regression. The approximation property (4) from the previous theorem is very useful in the cases where we are to find as good an approximation as possible for the (non-existent) solution of a given system Ax = b, where A is a real matrix of the type m/n and m > n. For instance, an experiment gives many measured real values bj, j = 1, . . . , m. We want to find a linear combination of only a few fixed functions fi, i = 1, . . . , n which approximates the values bj as good as possible. The actual values of the fixed functions at the relevant points yj ∈ R define the matrix aij = fi(yj). The columns of the matrix are given by values of the individual functions fi at the considered points. The goal is to determine the coefficients xi ∈ R so that the sum of the squares of the deviations from the actual values m∑ j=1 (bj − ( n∑ i=1 xifi(yj)))2 = m∑ j=1 (bj − ( n∑ i=1 aijxi))2 is minimized. By the previous theorem, the optimal coefficients are A† b. As an example, consider just three functions f0(y) = 1, f1(y) = y, f2(y) = y2 . Assume that the “measured values” of their unknown combination g(y) = x0 +x1y +x2y2 in integral values for y between 1 and 5 are bT = (1, 12, 6, 27, 33). (This vector b arose by computing the values 1+y +y2 at the given points adjusted by random integral values in the range ±10.) This leads in our case to the matrix A = (bji) AT =   1 1 1 1 1 1 2 3 4 5 1 4 9 16 25   . 253 3.E.6. How many (non-zero) singular values has the matrix A = ( 4 0 3 5 ) ? Describe them explicitly. ⃝ 3.E.7. For the matrix A given below, derive the matrix Σ, as described above: A =   0 0 −1 2 −1 0 0 0 0 0   . ⃝ There is a simple way to perform the SVDdecomposition, which is described as follows: First compute the matrices Σ and V as suggested above. To compute the unitary matrix U, which forms the last piece of the SVD-decomposition, we rely on the relation AV = UΣ. To simplify the description, suppose that rank(A) = r and express V, U in terms of columns, as V = ( v1 v2 · · · vn ) and U = ( u1 u2 · · · um ) respectively. Then we have AV = ( Av1 Av2 · · · Avn ) = ( σ1u1 · · · σrur 0 · · · 0 ) = UΣ . Hence we will have uk = 1 σk Avk, for all k = 1, . . . , r. To determine the remaining m − r columns of U it suffices to extend the set {u1, . . . , ur} to an orthonormal basis of Fm , which usually can be done via the Gram–Schmidt procedure. 3.E.8. Describe the singular value decompositions of the matrices given in 3.E.6 and 3.E.7, respectively. Solution. (a) Let us first consider the 2 × 2 matrix A, where Σ = diag(2 √ 10, √ 10). Let us derive the eigenvectors corresponding to the eigenvalues λ1 = 40 and λ2 = 10 of AT A, to construct the matrix V . The solution space of the linear system induced by the matrix equation (AT A − 40E)v has the form {(x1, x2)T = (t, t)T : t ∈ R}. Hence, eigenvectors of AT A corresponding to λ1 = 40 are scalar multiples of the vector (1, 1)T . This vector has norm √ 2 and we obtain the normalized eigenvector v1 = (√ 2 2 , √ 2 2 )T . For the second eigenvalue we see that the linear system induced by the matrix equation (AT A − 10E)v has solution space {(x1, x2)T = (−s, s)T : s ∈ R}. Consequently, eigenvectors of AT A corresponding to λ2 are scalar multiples of (−1, 1)T . The norm of this vector is also √ 2, and we obtain the normalized eigenvector v2 = ( − √ 2 2 , √ 2 2 )T . Therefore, V = ( v1 v2 ) = (√ 2 2 − √ 2 2√ 2 2 √ 2 2 ) , CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS The requested optimal coefficients for the combination are x = A† · b =    9 5 0 −4 5 −3 5 3 5 −37 35 23 70 6 7 37 70 −23 35 1 7 − 1 14 −1 7 − 1 14 1 7    ·       1 12 6 27 33       ≃   0.600 0.614 1.214   . The resulting approximation can be seen in the picture, where the given values b are shown by the diamonds, while the dashed curve stays for the resulting approximation g(y) = x1 + x2y + x3y2 . The computation was produced in Maple and taking 15 values yi = 1 + i + i2 , with a random vector of deviations from the same range added produced the following picture: 254 and it is easy to verify that V is unitary. Next, for the matrix U = ( u1 u2 ) we have u1 = 1 σ1 Av1 = 1 2 √ 10 ( 4 0 3 5 ) (√ 2 2√ 2 2 ) = ( √ 5 5 2 √ 5 5 ) , u2 = 1 σ2 Av2 = 1 √ 10 ( 4 0 3 5 ) ( − √ 2 2√ 2 2 ) = ( −2 √ 5 5√ 5 5 ) . A direct computations show that u1, u2 are orthonormal, and hence the matrix U is given by U = ( u1 u2 ) = ( √ 5 5 −2 √ 5 5 2 √ 5 5 √ 5 5 ) , such that A = UΣV T . For a quick confirmation, you may use Sage: A=matrix([[4, 0], [3, 5]]) show("\nThe matrix A is given by A=", A) S=diagonal_matrix([2*sqrt(10), sqrt(10)]) show("\nThe matrix S is given by S=", S) V=matrix([[sqrt(2)/2, -sqrt(2)/2], [sqrt(2)/2, sqrt(2)/2]]) show("\nThe matrix V is given by V=", V) print(V.is_unitary()) U=matrix([[sqrt(5)/5, -2*sqrt(5)/5], [2*sqrt(5)/5, sqrt(5)/5]]) show("\nThe matrix U is given by U=", U) print(U.is_unitary()) print(A*V==U*S) (b) In this case we have seen that Σ = diag(1, 1/2, 0). For the eigenvalues λ1 = 1, λ2 = 1/4 and λ3 = 0 of AT A we compute the normalized eigenvectors v1 = (1, 0, 0)T , v2 = (0, 0, 1)T and v3 = (0, 1, 0)T . This can be confirmed in Sage by the following cell: A=matrix([[0, 0, -1/2], [-1, 0, 0], [0, 0, 0]]) AT=A.T; a= AT*A show(a.eigenvalues()) S=diagonal_matrix([1, 1/2, 0]) show("\nThe matrix S is given by S=", S) aein=a.eigenvectors_right() show(aein) which also prints the expression of the matrix Σ (denoted inside the cell by S). Thus V =   1 0 0 0 0 1 0 1 0  . Next, for the matrix U = ( u1 u2 u3 ) we can compute the first two columns by the rule uk = 1 σk Avk. This gives u1 = 1 σ1 Av1 = 1 1   0 0 −1 2 −1 0 0 0 0 0     1 0 0   =   0 −1 0   , u2 = 1 σ2 Av2 = 1 1/2   0 0 −1 2 −1 0 0 0 0 0     0 0 1   =   −1 0 0   . CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS 255 It remains to extend {u1, u2} to an orthonormal basis of R3 . Thus, we need to find a unit vector which is orthogonal to both u1, u2. Such a vector is the vector u3 = (0, 0, −1)T which is in fact the cross product on R3 of u1, u2: u1 × u2 = ⃗i ⃗j ⃗k 0 −1 0 −1 0 0 = 0 ·⃗i − 0 ·⃗j + (−1) · ⃗k . This gives the matrix U =   0 −1 0 −1 0 0 0 0 −1   , such that A = UΣV T , i.e., A =   0 −1 0 −1 0 0 0 0 −1     1 0 0 0 1 2 0 0 0 0     1 0 0 0 0 1 0 1 0   . If you would like to verify this computation using Sage, you can continue by typing the following code in the previous block: V=matrix([[1, 0, 0], [0, 0, 1], [0, 1, 0]]) show("\nThe matrix V is given by V=", V) print(V.is_unitary()) U=matrix([[0, -1, 0], [-1, 0, 0], [0, 0, -1]]) show("\nThe matrix U is given by U=", U) print(U.is_unitary()) A==U*S*(V.T) In this way Sage confirms that both the matrices U and V are unitary, and moreover the relation A = UΣV T . Note that Sage offers a built-in method to compute the singular value decomposition of a matrix, but it is only available for matrices over the numerical systems CDF or RDF. For our matrix A, for example, one can type A=matrix(RDF, [[0,0,-1/2],[-1,0,0], [0, 0, 0]]) show(A.SVD()) Note that Sage’s output differs from our result only in terms of signs and column permutations. This variation is expected, as in SVD decomposition, the singular vectors (eigenvectors of U and V ) may differ by signs or orderings. Hence, be aware that different software implementations or methods might produce different but equally valid SVD results. □ Using the components of the SVD-decomposition of a matrix A we can construct the polar decomposition, which is another significant matrix factorization, see 3.5.5. This describes a complex matrix A as a product of a unitary matrix Up and a positive semi-definite matrix P. If A is real then Up is an orthogonal matrix. Recall that saying P is positive semi-definite we mean that P has non-negative eigenvalues. If A = UΣV ∗ is the SVD decomposition of A, then the corresponding polar decomposition A = PUp occurs easily by setting Up = UV ∗ and P = UΣU∗ , respectively. 3.E.9. For the matrix A given in 3.E.7 compute its polar decomposition A = PUp and use Sage to verify your result. Is CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS 256 this decomposition unique? Next show that the matrix P is Hermitian and positive semi-definite. ⃝ Let A = UΣV ∗ be the SVD-decomposition of a m × n matrix A or rank r, where Σ = diag(σ1, . . . , σr, 0, . . . , 0) is the m × n diagonal matrix with the singular values of A. The pseudo-inverse of A (also known as the Moore-Penroseinverse) is defined by A† = V Σ† U∗ , where Σ† = diag ( 1 σ1 , . . . , 1 σr , 0, . . . , 0 ) . This matrix functions as a generalized inverse for matrices that are singular or non-square. As such, it extends the concept of an inverse matrix in several ways, see 3.5.7. Notably, while not every matrix has an inverse, every matrix, including non-square ones, has a pseudo-inverse. 3.E.10. For the matrices A given in 3.E.6 and 3.E.7 compute the corresponding pseudo-inverses. Solution. (a) For the matrix A presented in 3.E.6 we obtain Σ† = ( 1 2 √ 10 0 0 1√ 10 ) . Thus, the explicit form of the pseudo-inverse A† = V Σ† UT of the matrix A is given by A† = (√ 2 2 − √ 2 2√ 2 2 √ 2 2 ) ( 1 2 √ 10 0 0 1√ 10 ) ( √ 5 5 2 √ 5 5 −2 √ 5 5 √ 5 5 ) = ( 1 4 0 −3 20 1 5 ) . To confirm our result we need to verify the identity AA† A = A, which is direct: AA† A = ( 4 0 3 5 ) ( 1 4 0 −3 20 1 5 ) ( 4 0 3 5 ) = ( 1 0 0 1 ) ( 4 0 3 5 ) = A . In fact, observe that det(A) = 20 ̸= 0. Hence A is nonsingular and has a unique inverse that should coincide with A† . This provides another confirmation: A−1 = 1 det(A) ( 5 0 −3 4 ) = ( 1 4 0 −3 20 1 5 ) = A† . (b) For the matrix A presented in 3.E.7 we obtain Σ† =   1 0 0 0 2 0 0 0 0   and hence the corresponding pseudo-inverse has the form A† =   1 0 0 0 0 1 0 1 0     1 0 0 0 2 0 0 0 0     0 −1 0 −1 0 0 0 0 −1   =   0 −1 0 0 0 0 −2 0 0   . CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS 257 Note that det(A) = 0 hence this matrix does not have an inverse. To confirm the relation AA† A = A (or A† AA† = A† ), you can use Sage as follows; A=matrix([[0, 0, -1/2], [-1, 0, 0], [0, 0, 0]]) Aps=matrix([[0, -1, 0], [0, 0, 0], [-2, 0, 0]]) print(A*Aps*A==A) Or you can use the following routine, which relies on Scipy (or Numpy). def pseudo_inverse(mat) : from scipy import linalg return matrix(linalg.pinv(mat)) Ap=pseudo_inverse(A) show(Ap) Note that one should add this cell in the previous block. □ Let us conclude our description of matrix factorizations with the QR-decomposition, which expresses a complex matrix as a product of a unitary matrix Q and an upper-triangular matrix R, see 3.5.6. If for example A ∈ Matm,n(C) comes with n linearly independent columns, then using the Gram-Schmidt orthogonalization we can obtain the factorization A = QR, where Q is of type m × n with orthonormal columns, and R is an invertible upper triangular matrix with positive diagonal entries. Finding the QR-decomposition only requires simple algebraic operations and it can be useful to summarize the steps, for a matrix A ∈ Matm,n(C), as above: Step 1 Via the Gram-Schmidt procedure find an orthogonal basis of the column space C(A) of A. Step 2 Normalize the vectors of this basis to get the columns of Q. Step 3 As the columns of Q are orthonormal with respect to the dot product, we have QT Q = E, where E is the n × n identity matrix. Hence, the upper triangular matrix R is given by the formula R = QT A. Alternatively, if u1, . . . , un are the linearly independent columns of A, and q1, . . . , qn are the orthonormal vectors obtained by the Gram-Schmidt procedure, then the entries rij of R are given by rij = ⟨uj, qi⟩ for any 1 ≤ i, j ≤ n, such that rij = 0 for all i > j. Note that both Q, R are unique. 3.E.11. Present the QR-decomposition of the matrix A =     1 1 1 1 1 1 1 1 0 1 0 0     . Solution. The column of A are the vectors E1, E2, E3 presented in 2.C.50. Recall that there we constructed the orthonormal basis { ˆwi : i = 1, 2, 3} of the 3-dimensional subspace W = spanR{E1, E2, EE}. Obviously, this coincides with the column space of A, i.e., W = C(A). Thus the 3 × 4 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS 258 matrix Q is given by Q = ( ˆw1 ˆw2 ˆw3 ) , or in other words Q =     1/2 √ 3/6 √ 6/6 1/2 √ 3/6 √ 6/6 1/2 √ 3/6 − √ 6/3 1/2 − √ 3/2 0     . We can verify that QT Q = E in Sage, but again you should be careful in the following sense. If we use the number system QQbar to define the matrix Q, then the cell below gives True, which is the correct result. Q=matrix(QQbar, [[1/2, sqrt(3)/6, sqrt(6)/6], [1/2, sqrt(3)/6, sqrt(6)/6], [1/2, sqrt(3)/6, -sqrt(6)/3], [1/2, -sqrt(3)/2, 0]]) Q.transpose()*Q == matrix.identity(3) However, if we configure the matrix Q, using one of the number systems RR, CC, or CDF, Sage will return False. Can you explain the possible reasons why this might occur? Next, the matrix R occurs by the multiplication QT A, which gives R =   2 3/2 1 0 √ 3/2 √ 3/3 0 0 √ 6/3   . Note that det(R) = √ 2 ̸= 0, hence R is invertible, as it should be. Sage provides a built-in method to obtain the QR-decomposition of a matrix A, based on the command A.QR(). Hence, you can quickly verify the result, just by typing A=matrix(QQbar, [[1, 1, 1],[1, 1, 1], [1, 1, 0], [1, 0, 0]]) Q, R = A.QR ( ) In case you like to confirm the expression of R in a “manual” way, you may use the following cell: Q=matrix(QQbar, [[1/2, sqrt(3)/6, sqrt(6)/6], [1/2, sqrt(3)/6, sqrt(6)/6], [1/2, sqrt(3)/6, -sqrt(6)/3], [1/2, -sqrt(3)/2, 0]]) A = matrix(RR, [[1, 1, 1],[1, 1, 1], [1, 1, 0], [1, 0, 0]]) R=(Q.T)*A; R □ The QR-decomposition of a matrix A is very useful when a system of linear equations Ax = b which has no solution is given, but an approximation as good as possible is needed. In other words, the main point is to minimize ∥Ax−b∥. According to the Pythagorean theorem, one has ∥Ax − b∥2 = ∥Ax − b∥∥2 + ∥b⊥∥2 , CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS 259 where b is decomposed into b∥ and into b⊥. The first component belongs to the range of the linear transformation A, and the second one is perpendicular to this range. Now, the projection on the range of A can be written in the form QQT for a suitable matrix Q, which one obtains by the Gram-Schmidt orthonormalisation of the columns of A. Then, if follows that Ax − b∥ = Q(QT Ax − QT b) . The system in the parentheses has a solution, for which ∥Ax− b∥ = ∥b⊥∥, and this is the minimal value. Furthermore, the matrix R := QT A is upper triangular, therefore the approximate solution can be easily obtained. 3.E.12. Find an approximate solution of the following system of equations: {x1 + 2x2 = 1 , 2x1 + 4x2 = 4} . Solution. We may express the given system in the matrix form Ax = b, where A = ( 1 2 2 4 ) , b = ( 1 4 ) . Evidently, this system has no solution. Let us orthonormalize the columns of A: Take the first of them and divide it by its norm. This yields the first vector of the orthonormal basis 1√ 5 (1, 2)T . The second column vector of A is twice the first and thus we deduce that Q = 1√ 5 ( 1 2 ) . The projector on the range of A is then given by QQT = 1 5 ( 1 2 2 4 ) . Next, we see that QT b = 1 √ 5 ( 1 2 ) ( 1 4 ) = 9 √ 5 and R = QT A = 1 √ 5 ( 1 2 ) ( 1 2 2 4 ) = 1 √ 5 ( 5 9 ) . Thus, the approximate solution should satisfy the relation Rx = QT b, which implies that 5x1 +9x2 = 9. Note that the approximate solution is not unique (instead of the QR-decomposition). In particular, the QR-decomposition of our matrix A has the form ( 1 2 2 4 ) = 1√ 5 ( 1 2 ) 1√ 5 ( 5 9 ) . □ 3.E.13. Minimise ∥Ax − b∥ for A =   2 −1 −1 −1 2 −1 −1 −1 2   , and b =   1 0 0   . Write down also the QR-decomposition of the matrix A. Solution. The normalised first column of the matrix A is u1 = 1√ 6 (2, −1, −1)T . From the second column, subtract CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS 260 its component in the direction u1 and set u = (−1, 2, −1)T . Then we see that ⟨u, u1⟩ = − 3√ 6 . Thus it follows that u − ⟨u, u1⟩u1 = 1 2   0 3 3   . Hence, we have created an orthogonal vector which we may normalise to obtain the vector u2 = 1 √ 2   0 1 −1   . The third column of the matrix A is already linearly dependent and thus the orthogonal matrix Q has the form Q = 1 √ 6   2 0 −1 √ 3 −1 − √ 3   . Consequently, we obtain R = QT A = 1√ 6 ( 2 −1 −1 0 √ 3 − √ 3 )   2 −1 −1 −1 2 −1 −1 −1 2   = 1√ 6 ( 6 −3 −3 0 3 √ 3 −3 √ 3 ) and QT b = 1 √ 6 ( 2 −1 −1 0 √ 3 − √ 3 )   1 0 0   = 1 √ 6 ( 2 0 ) . The solution of the equation Rx = QT b is x = y = z. Thus, multiples of the vector (1, 1, 1)T minimize ∥Ax − b∥. Note that the mapping given by the matrix A, is a projection on the plane with normal vector (1, 1, 1)T . □ When dealing with an overdetermined system of linear equations (more equations than unknowns), the system may not have an exact solution. In such cases, the pseudo-inverse provides the best approximate solution in the least squares sense. The pseudo-inverse is also used in linear regression for finding the optimal parameters. Next we will analyze how to solve problems with linear regression. Such tasks are about finding the best approximation of some functional dependence, using a linear function. In particular, given a functional dependence for some points, that is, f(a1 1, . . . , a1 n) = y1, . . . , f(ak 1, ak 2, . . . , ak n) = yk, with k > n, we wish to find the “best possible” approximation of this dependency using a linear function. Observe that since k > n one has more equations than unknowns. Hence we want to express the value of the property as a linear function f(x1, . . . , xn) = b1x1+b2x2+· · ·+bnxn+c. CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS 261 We choose to define “best possible” by the minimisation of k∑ i=1  yi − n∑ j=1 (bjxj + c)   2 with regard to the real constants b1, . . . , bn, c. The goal is to find such a linear combination of the columns of the matrix A = (ai j) (with coefficients b1, . . . , bn), that is closest to the vector (y1, . . . , yk)T in Rk . This means that the whole procedure is about specifying an orthogonal projection of the vector (y1, . . . , yk)T onto the subspace generated by the columns of A. Using the theorem presented in 3.5.7, it follows that this projection should be described by the vector (b1, . . . , bn)T = A† (y1, . . . , bn), where A† is the pseudo-inverse of A. 3.E.14. Linear regression. Using the least squares method, solve the linear system given below:    2x + y + 2z = 1 , x + y + 3z = 2 , 2x + y + z = 0 , x + z = −1 . Solution. The system has no solution, since its matrix has rank 3, and the extended matrix has rank 4. According to the theorem in 3.5.7, the best approximation of the vector b = (1, 2, 0, −1)T can be obtained by the vector A† b. In particular, AA† b is then the best approximation – the perpendicular projection of the vector b onto the column space C(A) of A. Because the columns of the matrix A are linearly independent, its pseudo-inverse is given by the relation (AT A)−1 AT . We compute A† =         2 1 2 1 1 3 2 1 1 1 0 1       2 1 2 1 1 1 1 0 2 3 1 1       −1   2 1 2 1 1 1 1 0 2 3 1 1   =     10 5 10 5 3 6 10 6 15     −1   2 1 2 1 1 1 1 0 2 3 1 1   =   3/5 −1 0 −1 10/3 −2/3 0 −2/3 1/3     2 1 2 1 1 1 1 0 2 3 1 1   =   1/5 −2/5 1/5 3/5 0 1/3 2/3 −5/3 0 1/3 −1/3 1/3   , and now we can compute the desired x, which is given by x = A† b = (−6/5, 7/3, 1/3)T . Consequently, the best possible approximation to the column of the right side, is the vector (3/5, 32/15, 4/15, −13/15)T . □ 262 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS F. Additional exercises for the whole chapter As usual, we proceed with supplementary exercises related to the notions that we have discussed so far. A) Material on linear programming 3.F.1. Manufacturing bolts and nuts. A company manufactures bolts and nuts. Moulding a box of bolts takes one minute, while a box of nuts is moulded for 2 minutes. Preparing the box itself takes one minute for bolts, 4 minutes for nuts. The company has at its disposal two hours for moulding and three hours for box preparation. Demand says that it is necessary to manufacture at least 90 boxes of bolts more than boxes of nuts. Due to technical reasons it is not possible to manufacture more than 110 boxes of bolts. The profit from one box of bolts is $4 and the profit from one box of nuts is $6. The company has no trouble with selling. Write down the corresponding LP problem and present its standard form. Deduce graphically how many boxes of nuts and bolts should be manufactured in order to have maximal profit. Solution. For convenience, let us put the given data into a table: Bolts 1 box Nuts 1 box Capacity Mould 1 min./box 2 min./box 160 min. Box 1 min./box 4 min./box 180 min. Profit $4/box $6/box Denote by x1 the number of manufactured boxes of bolts and by x2 the number of manufactured boxes of nuts. From the restriction on moulding time and from the restriction on the box preparation we obtain the following restrictive conditions: x1 + 2x2 ≤ 120 , x1 + 4x2 ≤ 180 , x1 ≥ x2 + 90 , x1 ≤ 110 . The standard form reads as the maximization of the profit function h(x1, x2) = 4x1 + 6x2 subject to the conditions x1 + 2x2 ≤ 120 , x1 + 4x2 ≤ 180 , x2 − x1 ≤ −90 , x1 ≤ 110 , with x1 ≥ 0, x2 ≥ 0. The feasible region is the grey region in the figure below, where we have also included the objective lines 4x1 + 6x2 = k. From this figure we deduce that the point P = (110, 5) maximizes h, and the maximum possible income is thus given by 4 · 110 + 6 · 5 = $470. □ 3.F.2. Investments and profits. An insurance company has a capital of 100000C which aims to invest in two different ways: the investment of type X and the investment of type Y. These types of investments give an annual income of 10% and of 15%, respectively. However, there is a limitation by the government, which requires that at least 25% of the capitals, should be invested in the X-type investment. On the other hand, the policy of the company requires the ratio between the capital used 263 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS for the Y-type investment and the capital used for the X-type of investment, not be greater than 1.5. How the company should invest its capital? Articulate the problem as a LP problem, and find the solution via the simplex method. Verify your solution using Sage. Solution. Let x1, x2 be the decision variables corresponding to the capitals (in euro) that the company will use for the X-type investment, and Y-type investment, respectively. The objective function is given by h = 0.1x1 + 0.15x2 = 1 10 x1 + 3 20 x2 = 2 20 x1 + 3 20 x2 . The corresponding LP problem is the maximization of h under the constraints x1 + x2 ≤ 100000 , x1 x1 + x2 ≥ 0.25 , x2 x1 ≤ 1.5 , x1 ≥ 0 , x2 ≥ 0 , which we can equivalently write as x1 + x2 ≤ 100000 , − 3 4 x1 + 1 4 x2 ≤ 0 , − 3 2 x1 + x2 ≤ 0 , x1 ≥ 0 , x2 ≥ 0 . Let us introduce the slack variables y1, y2, y3, with yi ≥ 0 for any i = 1, 2, 3. The application of the simplex method yields four tableaux, which we present as follows: x1 x2 y1 y2 y3 −h −2/20 −3/20 0 0 0 0 y1 1 1 1 0 0 100000 y2 −3/4 1/4 0 1 0 0 y3 −3/2 1 0 0 1 0 x2 enters =⇒ y2 leaves x1 x2 y1 y2 y3 −h −11/20 0 0 3/5 0 0 y1 4 0 1 −4 0 100000 x2 −3 1 0 4 0 0 y3 3/2 0 0 −4 1 0 x1 enters =⇒ y3 leaves x1 x2 y1 y2 y3 −h 0 0 0 −13/15 11/30 0 y1 0 0 1 20/3 −8/3 100000 x2 0 1 0 −4 2 0 x1 1 0 0 −8/3 2/3 0 y2 enters =⇒ y1 leaves x1 x2 y1 y2 y3 −h 0 0 13/100 0 1/50 13000 y2 0 0 3/20 1 −2/5 15000 x2 0 1 3/5 0 2/5 60000 x1 1 0 2/5 0 −2/5 40000 Since in the last tableau the row of −h has no negative entry, we have arrived to the optimal solution which reads as follows: (x1 = 40000, x2 = 60000) with y2 = 15000. The maximum value of h is given by 13000, hence the investments will bring to the company a maximum profit of 13000C. To verify the solution in Sage, you can use the block p = MixedIntegerLinearProgram() v = p.new_variable(real=True, nonnegative=True) x1, x2 = v["x1"], v["x2"] p.set_objective((2/20)*x1 + (3/20)*x2) p.add_constraint(x1 + x2 <= 100000) p.add_constraint((-3/4)*x1 + (1/4)*x2 <= 0) p.add_constraint((-3/2)*x1 + x2 <= 0) k = p.solve() x1, x2 = p.get_values(x1,x2) print("Answer =", round(k, 2)) print("(x1, x2) =", (x1, x2)) By executing this block we obtain the desired verification: Answer = 13000.0 (x1, x2) = 40000.0 60000.0 □ 3.F.3. If xi ≥ 0 for all i = 1, 2, 3, minimize the expression −3x1 − x2 − 2x3 subject to the conditions x1 − x2 + x3 ≥ −4 , 2x1 + x3 ≤ 3 , x1 + x2 + 3x3 ≤ 8 . Solution. We multiply the objective function and the first inequality by −1, to get the equivalent task of maximizing h = 3x1 + x2 + 2x3 subject to the following conditions −x1 + x2 − x3 ≤ 4 , 2x1 + x3 ≤ 3 , x1 + x2 + 3x3 ≤ 8 , 264 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS with xi ≥ 0 for i = 1, 2, 3. Introducing the non-negative slacks x4, x5 and x6, interpret the objective function as 3x1 + x2 + 2x3 + 0 · x4 + 0 · x5 + 0 · x6. We now write down the first simplex tableau (the pivot element always appear circled): R0 −3 −1 −2 0 0 0 0 R1 −1 1 −1 1 0 0 4 R2 2 0 1 0 1 0 3 R3 1 1 3 0 0 1 8 To eliminate the first column which is the driver, we apply the elementary row operations R2 → ˆR2 := 1 2 R2, R0 → R0 +3 ˆR2, R1 → R1 + ˆR2, and R3 → R3 − ˆR2. This yields the second tableau: ˆR0 0 −1 −1/2 0 3/2 0 9/2 ˆR1 0 1 −1/2 1 1/2 0 11/2 ˆR2 1 0 1/2 0 1/2 0 3/2 ˆR3 0 1 5/2 0 −1/2 1 13/23 where the new basic variables are x1 = 3/2, x4 = 11/2, and x6 = 13/2. This reflects the fact that we moved as much from the former slack x5 to the new basic variable x1 as possible. This increased the value of the objective function, which we may read in the right top corner of the tableau. We now move to the new work column which is the one containing −1 in ˆR0. In this column we circled the new pivot, which equals to 1 (since 11/2 < 13/2). Hence we can directly eliminate the second column by the row operations ˆR0 → ˆR0 + ˆR1 and ˆR3 → ˆR3 − ˆR1 (observe that ˆR2 already contains 0 in the work column). This gives ˜R0 0 0 −1 1 2 0 10 ˜R1 0 1 −1/2 1 1/2 0 11/2 ˜R2 1 0 1/2 0 1/2 0 3/2 ˜R3 0 0 3 −1 −1 1 1 and shifts the new basic variable from x4 to x2. In this way we have increased the objective function. However, the first row ˜R0 still contains a negative number and hence there is one more repetition. The new work column is the third one and the next pivot is the circled number 3. By applying the necessary elementary row operations we result to the final tableau 0 0 0 2/3 5/3 1/3 31/3 0 1 0 5/6 1/3 1/6 17/3 1 0 0 1/6 2/3 −1/6 4/3 0 0 1 −1/3 −1/3 1/3 1/3 where the basic variables are the initial decision variables x1, x2, x3. This gives the optimal solution x1 = 4/3, x2 = 17/3 and x3 = 1/3 with the maximal value of h being 31/3. □ 3.F.4. Infinite optimal solutions. Consider the LP problem of maximizing h = 3x1 + x2 subject to the conditions 6x1 + 2x2 ≤ 30 , 10x1 + x2 ≤ 40 , x1 ≥ 0 , x2 ≥ 0 . Use the simplex method to show that an optimal solution of this LP problem is not unique. Verify your answer by a figure. Solution. Let us introduce slack variables y1, y2 to bring the problem to its canonical form. This corresponds to the maximization of h = 3x1 + x2 + 0y1 + 0y2 under the constraints 6x1 + 2x2 + y1 = 30 , 10x1 + x2 + y2 = 40 , with xi ≥ 0 and yj ≥ 0 for i, j = 1, 2. By applying the simplex method we need two iterations to arrive to an optimal solution. Let us summarize the tableaux together with the pivot elements circled (we also indicate the corresponding elementary row operations). x1 x2 y1 y2 R0 −h −3 −1 0 0 0 R1 y1 6 2 1 0 30 R2 y2 10 1 0 1 40 R2 → ˆR2 := 1 10 R2 R0 → ˆR0 := R0 + 3 ˆR2 R1 → ˆR1 := R1 − 6 ˆR2 =⇒ x1 x2 y1 y2 ˆR0 −h 0 −7/10 0 3/10 12 ˆR1 y1 0 7/5 1 −6/10 6 ˆR2 x1 1 1/10 0 1/10 4 265 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS ˆR1 → ˜R1 := 5 7 ˆR1 ˆR0 → ˜R0 := ˆR0 + 7 10 ˜R1 ˆR2 → ˜R2 := ˆR2 − 1 10 ˜R1 =⇒ x1 x2 y1 y2 ˜R0 −h 0 0 1/2 0 15 ˜R1 x2 0 1 5/7 −3/7 30/7 ˜R2 x1 1 0 −1/14 1/7 25/7 Hence in the first iteration x1 enters the column of basic variables and y2 leaves, while in the second one x2 enters and y1 leaves. Since ˜R0 does not contain a negative entry, we have arrived to an optimal solution given by (x1 = 25 7 , x2 = 30 7 ) (with y1 = 0 = y2). The maximal value of h equals to 15. However, observe that in the above table the non-basic variable y2 appears with zero coefficient in ˜R0. Therefore, y2 can be brought to the basis to generate a new optimal solution with the same value of the objective function h. For this we use the entry 1/7 of ˜R2 as a pivot and apply the following row operations ˜R2 → r2 := 7 ˜R2, ˜R1 → ˜R1 + 3 7 r2, ˜R0 → ˜R0 (the cost row remains the same). This gives the following tableau x1 x2 y1 y2 r0 −h 0 0 1/2 0 15 r1 x2 3 1 1/2 0 15 r2 y2 7 0 −1/2 1 25 which provides the new optimal solution (x1 = 0, x2 = 15) (with y1 = 0 and y2 = 25), such that h(0, 15) = 15. Since two optimal solutions have been indicated, their convex combination will be also an optimal solution of the initial LP problem. This means that tP1 +(1−t)P2 is also an optimal solution, for any t with 0 ≤ t ≤ 1, where P1 := (25 7 , 30 7 ) and P2 := (0, 15). Thus this LP problem admits infinite optimal solutions, as one can also deduce by the help of the following figure: □ 3.F.5. Cycling in the simplex method. Consider the following task: maximize the functional h = 10x1 −57x2 −9x3 −24x4 subject to the conditions 1 2 x1 − 11 2 x2 − 5 2 x3 + 9x4 ≤ 0 , 1 2 x1 − 3 2 x2 − 1 2 x3 + x4 ≤ 0 , x1 + x2 + x3 + x4 ≤ 1 , xi ≥ 0 , (i = 1, . . . , 4) . Show that the simplex method applied to this LP problem cycles, which roughly speaking means that the same tableau occurs more than once. Can you find an optimal solution? 18 Solution. Introduce the slack variables y1, y2, y3, such that 18This example is adapted from the book of V. Chvátal (1983), Linear Programming, W. H. Freeman. Vasek Chvátal is a czech mathematician known for his contributions in LP, graph theory and combinatorics. 266 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS 1 2 x1 − 11 2 x2 − 5 2 x3 + 9x4 + y1 = 0 , 1 2 x1 − 3 2 x2 − 1 2 x3 + x4 + y2 = 0 , x1 + x2 + x3 + x4 + y3 = 1 , with xi ≥ 0 and yj ≥ 0 for any i = 1, . . . , 4 and j = 1, 2, 3. Thus the initial simplex tableau reads as follows: x1 x2 x3 x4 y1 y2 y3 R0 −h −10 57 9 24 0 0 0 0 R1 y1 1/2 −11/2 −5/2 9 1 0 0 0 R2 y2 1/2 −3/2 −1/2 1 0 1 0 0 R3 y3 1 1 1 1 0 0 1 1 In the driver column, which is the column of x1, there are two choices of a pivot and we choose the circled 1/2 in R1. Hence the variable x1 enters the column of basic variables and the variable y1 leaves. We replace R1 by ˆR1 := 2R1 and next apply the row operations R0 → R0 + 10 ˆR1, R2 → R2 − 1 2 ˆR1 and R3 → R3 − ˆR1. This gives the second tableau which reads as follows: x1 x2 x3 x4 y1 y2 y3 ˆR0 −h 0 −53 −41 204 20 0 0 0 ˆR1 x1 1 −11 −5 18 2 0 0 0 ˆR2 y2 0 4 2 −8 −1 1 0 0 ˆR3 y3 0 12 6 −17 −2 0 1 1 Since −53 < −41, the column of x2 is the new work column and the circled 4 in ˆR2 is the new pivot. So x2 enters and y2 leaves. To eliminate the second column we replace ˆR2 by ˜R2 := 1 4 ˆR2 and apply the row operations ˆR0 → ˆR0 + 53 ˜R2, ˆR1 → ˆR1 + 11 ˜R2 and ˆR3 → ˆR3 − 12 ˜R2. This gives the third tableau: x1 x2 x3 x4 y1 y2 y3 ˜R0 −h 0 0 −29/2 98 27/4 53/4 0 0 ˜R1 x1 1 0 1/2 −4 −3/4 11/4 0 0 ˜R2 x2 0 1 1/2 −2 −1/4 1/4 0 0 ˜R3 y3 0 0 0 7 1 −3 1 1 The column of x3 is the new driver and we choose as pivot the circled 1/2 in ˜R1. So x3 enters the column of basic variables and x1 leaves. To do so we replace ˜R1 by ˇR1 := 2 ˜R1 and apply the row operations ˜R0 → ˜R0 + 29 2 ˇR1 and ˜R2 → ˜R2 − 1 2 ˇR1. This gives the fourth tableau: x1 x2 x3 x4 y1 y2 y3 ˇR0 −h 29 0 0 −18 −15 93 0 0 ˇR1 x3 2 0 1 −8 −3/2 11/2 0 0 ˇR2 x2 −1 1 0 2 1/2 −5/2 0 0 ˇR3 y3 0 0 0 7 1 −3 1 1 Since −18 < −15, the next work column is the column of x4 and the pivot is the circled 2 in ˇR2. Hence x4 enters and x2 leaves. For this, replace ˇR2 by ¯R2 := 1 2 ˇR2 and apply the row operations ˇR0 → ˇR0 + 18 ¯R2, ˇR1 → ˇR1 + 8 ¯R2 and ˇR3 → ˇR3 − 7 ¯R2. This gives the following table: x1 x2 x3 x4 y1 y2 y3 ¯R0 −h 20 9 0 0 −21/2 141/2 0 0 ¯R1 x3 −2 4 1 0 1/2 −9/2 0 0 ¯R2 x4 −1/2 1/2 0 1 1/4 −5/4 0 0 ¯R3 y3 7/2 −7/2 0 0 −3/4 23/4 1 1 The new column driver is the column of y1 and we choose as pivot the circled 1/2 in ¯R1. Thus y1 enters and x3 leaves. To do so we replace ¯R1 by r1 := 2 ¯R1 and apply the row operations ¯R0 → ¯R0 + 21 2 r1, ¯R2 → ¯R2 − 1 4 r1 and ¯R3 → ¯R3 + 3 4 r1. This yields the following table: 267 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS x1 x2 x3 x4 y1 y2 y3 r0 −h −22 93 21 0 0 −24 0 0 r1 y1 −4 8 2 0 1 −9 0 0 r2 x4 1/2 −3/2 −1/2 1 0 1 0 0 r3 y3 1/2 5/2 3/2 0 0 −1 1 1 Since −24 < −22 the new work column is that of y2 and the pivot is the circled 1 in r2. Hence y2 enters the column of basic variables and x4 leaves. To eliminate the driver column we leave r2 identical and apply the row operations r0 → r0 + 24r2, r1 → r1 + 9r2 and r3 → r3 + r2. This gives us the initial simplex tableau, and hence this LP problem exhibits cycling. A simple remedy against cycling is the so called Bland’s rule. This says that for the entering variable we should take the first one in the present ordering of variables, which has a negative entry in the objective row. For choosing the leaving variable, if there is a tie for least ratio, take the candidate that is first in the ordering. For our problem, the ordering of the variables is x1, . . . , x4, y1, . . . , y3 and an application of Bland’s rule will leave everything the same up to the last pivot. This means that x1 should enter, instead of y2 and the pivot for this is the number 1/2 in r2 in the above table. Thus we replace the row r2 by ˆr2 := 2r2 and apply the row operations r0 → r0 + 22ˆr2, r1 → r1 + 4ˆr2 and r3 → r3 − 1 2 ˆr2. This yields the following table: x1 x2 x3 x4 y1 y2 y3 ˆr0 −h 0 27 −1 44 0 20 0 0 ˆr1 y1 0 −4 −2 8 1 −1 0 0 ˆr2 x1 1 −3 −1 2 0 2 0 0 ˆr3 y3 0 4 2 −1 0 −2 1 1 The next iteration will increase the value of the objective function and in particular will provide an optimal solution. Indeed, x3 enters and the new pivot is circled in the row ˆr3, so y3 leaves. The final row operations are ˆr3 → ˜r3 := 1 2 ˆr3, ˆr0 → ˆr0 + ˜r3, ˆr1 → ˆr1 + 2˜r3 and ˆr2 → ˆr2 + ˜r3 and we arrive to the following optimal table: x1 x2 x3 x4 y1 y2 y3 ˜r0 −h 0 29 0 87/2 0 19 1/2 1/2 ˜r1 y1 0 0 0 7 1 −3 1 1 ˜r2 x1 1 −1 0 3/2 0 1 1/2 1/2 ˜r3 x3 0 2 1 −1/2 0 −1 1/2 1/2 Hence we have the optimal solution given by x1 = 1/2 = x3 and x2 = 0 = x4, with the maximal value being h = 1/2. □ 3.F.6. Matrix games. Imagine a game played by two players – a billionaire and fate. The billionaire would like to invest into gold, silver, diamonds or stocks of an important IT software company. Suppose that the wins and losses of such investments are well known for a period of four-five years (for simplicity, we consider only a period of four years and write them into the matrix A = (aij)): gold silver diamonds software 2018 2% 1% 4% 3% 2019 3% -1% -2% 6% 2020 1% 2% 3% -4% 2021 -2% 1% 2% 3% The billionaire would like to invest for one year only. How should he split his investment in order to ensure the maximal win independently of development on the stock market? We assume that next year will be some (unknown) probabilistic mix of the previous four ones. In terms of our game, fate will play some stochastic vector (x1, x2, x3, x4)T fixing the behaviour of the market (as a probabilistic mixture of the previous ones), while the billionaire will play another stochastic vector (y1, y2, y3, y4)T describing the split of his investment. The win of the billionaire is ∑4 i,j=1 xiyjaij. Solution. The task is to find the stochastic vector (y1, y2, y3, y4)T , which will maximize the minimum of all values ∑4 i,j=1 xiyjaij for the fixed matrix A and any stochastic vector (x1, x2, x3, x4)T . This is equivalent to the problem of maximizing z1 + z2 + z3 + z4 under the condition AT z ≤ (1, . . . , 1)T , z ≥ 0 (and the requested stochastic vector y is then obtained by 268 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS normalizing the vector z, the requested optimal value is the inverse of the optimal value obtained).19 Thus, one has to solve a LP problem and the first step is to introduce the slack variables w1, w2, w3, w4, and transform the problem to the standard form: max { z1 + z2 + z3 + z4 | ( AT |E4 ) (z, w) = (1, 1, 1, 1)T } . The initial tableau of the simplex method is the following one −1 −1 −1 −1 0 0 0 0 0 2 3 1 −2 1 0 0 0 1 1 −1 2 1 0 1 0 0 1 4 −2 3 2 0 0 1 0 1 3 6 −4 3 0 0 0 1 1 and after four iterations we arrive to the final table, given by 188 89 0 0 0 25 89 0 44 89 17 89 86 89 114 89 0 1 0 18 89 0 21 89 − 2 89 37 89 −146 89 0 0 0 − 9 89 1 −55 89 1 89 26 89 −85 89 0 0 1 −10 89 0 18 89 11 89 19 89 78 89 1 0 0 17 89 0 5 89 8 89 30 89 We can read off the optimal solution: z2 = 30 89 , z3 = 37 89 , z4 = 19 89 , z1 = 0. The optimal value (upper right corner) is z1 + z2 + z3 + z4 = 86 89 . After rescaling to a stochastic vector (multiplying with 89 86 ) we get the solution of the original problem: y1 = 0, y2 = 30 86 , y3 = 37 86 , y4 = 19 86 . with the optimal value 89 86 . □ B)Material on difference equations 3.F.7. Find a real basis of solutions for the difference equation xn+4 = xn+3 − xn+2 + xn+1 − xn. Solution. The space of the solutions is a four-dimensional vector space whose generators can be obtained from the roots of the characteristic polynomial of the given equation. The characteristic equation has the form r4 −r3 +r2 −r+1 = 0. This is a so-called reciprocal equation, which means that the coefficients at the (n − k)-th and k-th power of r for k = 1, . . . , n, are equal. After dividing the characteristic equation by r2 (zero cannot be a root) and setting u := r+1 r (note that r2 + 1 r2 = u2 −2) we obtain r2 − r + 1 − 1 r + 1 r2 = u2 − u − 1 = 0 . This gives the indeterminates u1,2 = 1± √ 5 2 and by the equation r2 − ur + 1 = 0 we determine the roots r1,2,3,4 = 1± √ 5± √ −10±2 √ 5 4 . In fact, since r5 + 1 = (r + 1)(r4 − r3 + r2 − r + 1), the roots of the characteristic equation could have been “guessed” right away. Thus, the roots of the characteristic polynomial are also the roots of the polynomial r5 + 1, which are exactly the fifth roots of the −1. By this we obtain that the solutions of the characteristic polynomial are the numbers r1,2 = cos(π 5 ) ± i sin(π 5 ) and r3,4 = cos(3π 5 ) ± i sin(3π 5 ) . Now, by the description in 3.2.5 it follows that a real basis of the space of the solutions of the given difference equation, is given by the sequences cos( nπ 5 ) , sin( nπ 5 ) , cos( 3nπ 5 ) , sin( 3nπ 5 ) . Note that these are sines and cosines of the arguments of the corresponding powers of the roots of the characteristic polynomial. □ 19The observation comes from the proof of the von Neumann Minimax theorem, 1928. The theorem claims that any probabilistic extension of a matrix game enjoys an equilibrium state. 269 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS 3.F.8. A simplified model for the behaviour of gross domestic product. Non homogeneous difference equations of second order are known in the science of macroeconomics. Consider for example the difference equation xn+2 − A(1 + B)xn+1 + ABxn = 1 , where xk is the gross domestic product at the year k. The constant A is the consumption tendency, which is a macro economical factor that gives the fraction of money that the people spend (from what they have at their disposal). The constant B describes the dependence of the measure of investment of the private sector on the consumption tendency. Moreover, we assume that the size of the domestic product is normalised such that the right-hand side of the equation is 1. Compute the values xn for A = 3/4, B = 1/3, x0 = 1 = x1. Solution. For A = 3/4 and B = 1/3, it is easy to check that the constant function xn = 4 is a particular solution of the initial (non-homogeneous) equation. We now look for solutions of the corresponding homogeneous equation. The general characteristic equation has the form r2 − A(1 + B)r + AB = 0 , which reduces to r2 − r + 1/4 = 0 for the particular values of A and B. Hence 1/2 is a double root. In this case, as we have seen before (cf. 3.B.3), the solutions of the homogeneous equation are given by a(1/2)n + bn(1/2)n , for some a, b ∈ R. Hence, according to the theoretical description in 3.2.6, the solutions of the given difference equation (for A = 3/4 and B = 1/3), are expressed by 4 + a(1/2)n + bn(1/2)n . Using the initial conditions x0 = x1 = 1 we also obtain a = b = −3, and we are able to present the solution explicitly: xn = 4 − 3 ( 1 2 )n − 3n ( 1 2 )n . Let us verify this result in Sage, by the method that we learned in the main part (cf. 3.B.11) from sympy import Function, rsolve from sympy.abc import n a = Function("a") f = a(n+2) - a(n+1) + (1/4)*a(n) - 1 initial = {a(0):1, a(1):1} rsolve(f, a(n), initial) Sage’s output is the expression 4 + (−3 ∗ n − 3)/2 ∗ ∗n. □ 3.F.9. Find the solution of the recurrence relation xn+2 − 6xn+1 + 5xn = n en . Solution. Solving first the homogeneous part we obtain: x(h) n = c1 · (1)n + c2 · 5n . To find the particular solution we can use the method of variation of the constant, described in ?? For this, we should compute the Wronski determinant: Wj+1 = det ( 1j+1 5j+1 1j+2 5j+2 ) = 4 · 5j+1 . Thus, xn = c1 + c2 · 5n − 1 4 n−1∑ j=0 j ej +   n−1∑ j=0 j ej 4 · 5j+1   5n , with c1, c2 ∈ R. □ 3.F.10. Find the solution of the recurrence relation −xn+3 = 3xn+2 +3xn+1 +xn, with initial conditions x1 = x2 = x3 = 1. ⃝ C) Material on models of growth and iterated processes In this paragraph, the initial set of exercises addresses population growth and the Leslie model. However, it is crucial to first recognize the role that “primitive matrices” have in the theory of linear iterative models. Recall that a matrix A is called primitive if there exists a positive integer k such that Ak has only positive entries. These matrices are discussed in detail in Section 3.3.3, where further information can be found. 270 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS 3.F.11. Which of the matrices given below are primitive? A = ( 0 1/7 1 6/7 ) , B =   1/2 0 1/3 0 1 1/2 1/2 0 1/6   , C =   0 1 0 1/4 0 1/2 3/4 0 1/2   , D =     1/3 1/2 0 0 1/2 1/3 0 0 0 1/6 1/6 1/3 1/6 0 5/6 2/3     . Solution. We see that A2 = ( 1/7 6/49 6/7 43/49 ) , C3 =   3/8 1/4 1/4 1/4 3/8 1/4 3/8 3/8 1/2   . So the matrices A and C are primitive, since A2 and C3 are positive matrices. For n ∈ N, the middle column of the matrix Bn is always the vector (0, 1, 0)T , which contains the entry 0. Hence the matrix B cannot be primitive. Finally, the product     1/3 1/2 0 0 1/2 1/3 0 0 0 1/6 1/6 1/3 1/6 0 5/6 2/3     ·     0 0 a b     =     0 0 a/6 + b/3 5a/6 + 2b/3    , with a, b ∈ R, implies that the matrix D2 has in the right upper corner a zero two-dimensional (square) sub-matrix. By induction, the same property is shared by the matrices D3 = D · D2 , D4 = D · D3 , . . . , Dn = D · Dn−1 , and so on. Consequently, the matrix D is not primitive. □ 3.F.12. Rabbits and their population growth. We are again interested in the population of rabbits on a meadow, as in the problem 3.B.1. However, we now assume that the rabbits are dying after reaching the ninth year of age.20 Show that according to this model the population grows approximately with the geometric sequence 1.608t . Solution. Denote by x1(t), x2(t),..., x9(t) the number of rabbits according to their age (in years), at time t. Then, the number of rabbits in individual categories are described after one year by the formula x1(t + 1) = x2(t) + x3(t) + · · · + x9(t) , xi(t + 1) = xi−1(t) for i = 2, 3, . . . , 10. Equivalently, we may write               x1(t + 1) x2(t + 1) x3(t + 1) x4(t + 1) x5(t + 1) x6(t + 1) x7(t + 1) x8(t + 1) x9(t + 1)               =               0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0                             x1(t) x2(t) x3(t) x4(t) x5(t) x6(t) x7(t) x8(t) x9(t)               . The characteristic polynomial of the given matrix is λ9 − λ7 − λ6 − λ5 − λ4 − λ3 − λ2 − λ − 1. To obtain a verification in Sage we can apply the command A.eignevalues(), as usual: A = matrix(QQ, [[0,1,1,1,1,1,1,1,1], [1,0,0,0,0,0,0,0,0], [0,1,0,0,0,0,0,0,0], [0,0,1,0,0,0,0,0,0], [0,0,0,1,0,0,0,0,0], [0,0,0,0,1,0,0,0,0], [0,0,0,0,0,1,0,0,0], [0,0,0,0,0,0,1,0,0], [0,0,0,0,0,0,0,1,0]]) A.eigenvalues() The answer shows that λ1 ≈ 1.608 is indeed the only positive real eigenvalue, a property that every Leslie matrix satisfies (there are also eight complex eigenvalues). In fact we can estimate this root of the characteristic polynomial very well (think why this must be smaller than ( √ 5 + 1)/2)). Now, the normalized eigenvector corresponding to λ1 has the form (the coordinates of X1 sum to 100) 20In the original model the rabbits were immortal. Note that domesticated rabbits can live between eight to twelve years. 271 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS X1 ≈ (38.36, 23.85, 14.83, 9.22, 5.73, 3.56, 2.21, 1.37, 0.85)T . Consequently, according to this model the population grows approximately with the geometric sequence 1.608t . □ 3.F.13. Consider the following Leslie model in which a farmer breeds sheep. The birth-rate of sheep depends only on their age and on average is 2 lambs per sheep between one and two years of age, 5 lambs per sheep between two and three years of age and 2 lambs per sheep between three and four years of age. Younger sheep do not deliver any lambs. Every year, half of the sheep die, uniformly distributed among all age groups. Every sheep older than four years is sent to the butchery. The farmer would like to sell (living) lambs younger than one year for their skin. What proportion of the lambs can be sold every year to ensure that the size of the herd remains the same? In what ratio will the sheep then be distributed among individual age categories? Solution. The matrix of the model, without an action of the farmer, can be expressed as L =     0 2 5 2 1 2 0 0 0 0 1 2 0 0 0 0 1 2 0     . The farmer can influence how many sheep younger than one year stay in his herd to the next year, that is, he can influence the element l12 of the matrix L. Thus finally we are dealing with the model encoded by the matrix L =     0 2 5 2 a 0 0 0 0 1 2 0 0 0 0 1 2 0     . We are looking for this a such that the matrix has the eigenvalue 1 (we know that it has only one real positive eigenvalue). The characteristic polynomial of this matrix is λ4 − 2aλ2 − 5 2 λ − 1 2 . If we require 1 as a root of this polynomial, then we get the solution a = 1 5 Thus, the farmer can sell 1 2 − 1 5 = 3 10 of lambs that are born that year. The corresponding eigenvector for the eigenvalue 1 of the given matrix is (20, 4, 2, 1)T , and in these ratios the population stabilises. □ 3.F.14. Consider the Leslie population growth model for the population of rats, divided into three groups according to age: younger than one year, between one year and two years, and between two years and three years. Assume that there exist no rats older than three years. The average birth-rate of one rat in individual age categories is the following: in the first group it is zero, in the second and in the third it is 2 rats. The mortality in the second group is zero, that is, the rats that survive their first year die after three years of life. Determine the mortality in the first group, if you know that the population stagnates (the total number of rats does not change). ⃝ 3.F.15. Model of evolution of a whale population. For the evolution of a population (of whales), females are important. The important factor is not age but fertility. From this point of view, one can divide the females into newborns (juvenile), that is, females who are yet fertile; young fertile females; adult females with the highest fertility, and postclimacterial femalesm who are no longer fertile, but are still important with respect to taking care of newborns and food gathering. We model the evolution of such a population by time. For a time unit, we choose the time it takes to reach adulthood. A newborn female who survives this interval becomes fertile. The evolution of a young female to full fertility and to post-climacterial state depends mainly on the environment. That is, the transition to the next category is a random event. Analogously, the death of an individual is also a random event. Note that a young fertile female has less children per unit interval, than an adult female. Let us not try to formalise these statements. Denote by x1(t), x2(t), x3(t), x4(t) the number of juvenile, young, adult and postclimacterial females in time t respectively. The amount can be expressed as a number of individuals, but also as a number of individuals relative per unit area (population density), or as a total biomass. Denote further by p1 the probability that a juvenile female survives the unit time interval and becomes fertile. Let also p2 and p3 be the respective probabilities that a young female becomes adult and that an adult female becomes old. Another random event is the death (positively formulated: survival) of females who do not move to the next category – we denote these probabilities respectively as q2, q3 and q4 for young, adult and old females. Of course, Each of the numbers p1, p2, p3, q2, q3, q4 is a probability from the interval [0.1]. Now, a young female can survive, reach adulthood or die. These events are mutually exclusive, together they form a sure certain event and cannot be excluded. Thus, p2 + q2 < 1. For similar reasons p3 + q3 < 1. Finally, we denote by f2 and f3 the average number of daughters of a young and adult female, respectively. These parameters satisfy 0 < f2 < f3. The expected number of newborn females in the next time interval is the sum of the daughters of young and of the adult females, that is x1(t + 1) = f2x2(t) + f3x3(t) . 272 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Let us temporarily denote by: • x2,1(t + 1) the number of young females in time t + 1, who were juvenile in the previous time interval. • x2,2(t + 1) the number of young females, who were already fertile in time t, survived that time interval, bud did not move into the adulthood. The probability p1 that a juvenile female survives the interval can be expressed by classical probability, that is, by the ratio x2,1(t+1)/x1(t). Similarly, the probability q2 can be expressed as the ratio x2,2(t+1)/x2(t) and we finally have the relation x2(t + 1) = x2,1(t + 1) + x2,2(t + 1) = p1x1(t) + q2x2(t) . In the same way we deduce the expected number of fully fertile females: x3(t + 1) = p2x2(t) + q3x3(t) , and finally the expected number of postclimacterial females is x4(t + 1) = p3x3(t) + q4x4(t) . Set now A :=     0 f2 f3 0 p1 q2 0 0 0 p2 q3 0 0 0 p3 q4     and x(t) :=     x1(t) x2(t) x3(t) x4(t)     . Then, we may rewrite the previous recurrent formulas in matrix form as x(t + 1) = Ax(t). Using this matrix difference equation it is possible to compute the expected number of females in individual categories, assuming that the distribution of the population at some initial time is known. Let us focus for example on the population of orca whales, where we may assume the following parameters: p1 = 0.9775 , q2 = 0.9111 , f2 = 0.0043 , p2 = 0.0736 , q3 = 0.9534 , f3 = 0.1132 . p3 = 0.0452 , q4 = 0.9804 , In this case the time interval is one year. If we start at the time t = 0 with a unit measure of young females in some unoccupied area, that is, with the vector x(0) = (0, 1, 0, 0)T , then with the help of Sage one easily computes x(1) =     0 0.0043 0.1132 0 0.9775 0.9111 0 0 0 0.0736 0.9534 0 0 0 0.0452 0.9804         0 1 0 0     =     0.0043 0.9111 0.0736 0     , and x(2) =     0 0.0043 0.1132 0 0.9775 0.9111 0 0 0 0.0736 0.9534 0 0 0 0.0452 0.9804         0.0043 0.9111 0.0736 0     =     0.01224925 0.83430646 0.13722720 0.00332672     . The reader may try a computation and graphical depiction of the results for a different initial distribution of the population. The result should be an observation that the total population grows exponentially, but the ratios of the sizes of individual groups stabilise on constant values. In fact, the matrix A has the following eigenvalues λ1 = 1.025441326, λ2 = 0.980400000, λ3 = 0.834222976, λ4 = 0.004835698 . The eigenvector associated with λ1 is given by w = (0.03697187, 0.31607121, 0.32290968, 0.32404724)T , and this vector is normalized such that the sum of its components equals to the unit. We will now explore a variety of further exercises related to discrete Markov chains. For the reader’s convenience, we have included some classic problems that lend themselves to a Markov process interpretation, such as the “ruin of a player” (see 3.F.21), the “algorithm for determining the importance of web pages” (see 3.F.20), and other. We begin with an enjoyable problem. 3.F.16. Students into three groups. Suppose that we can divide students into three groups, as follows: Those that are present at a lecture and pay attention, those that are present but pay no attention, and those who are in a pub instead. Now observe, lecture after lecture, how the numbers in the individual groups change. The first step is to observe what are the probabilities that a student changes his state. Suppose that this goes as follows. For a student who pays attention: with probability 50 % stays in the same state, with 40 % stops paying attention and with 10 % moves to the pub. For a student who pays no attention: starts paying attention with 10 %, with 50 % stays in the same state and with 40 % moves to the pub. Finally, for a student who is in the pub we assume that there is zero probability of 273 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS returning to the lectures. How does the model evolve in time? How does the situation change if we assume at least ten percent probability that a student returns from the pub to the lecture (but is not going to pay any attention)? Solution. This is obviously a (homogeneous) discrete Markov process. Its transition matrix T has the form T =   0.5 0.1 0 0.4 0.5 0 0.1 0.4 1   , where the first column is about the students paying attention, the second one about those no paying attention and the third one for those in the pub. We will prove that all the students end up in the pub. Indeed, by Sage we see that T has the following three eigenvalues: λ1 = 1 and the rest two are given by 0.3 and 0.4. We also see that the eigenvector associated to λ1 is a multiple of (0, 0, 1)T . This vector describes the distribution of students into the three groups with the passing of time and this proves our claim (of course, such a result is clear even without any computation, as the probability of returning from the pub is zero, all students end up in the pub). For the second task, we will show that adding 10 percent possibility for leaving the pub, will only slightly change the situation. The corresponding transition matrix is given by T =   0.5 0.1 0 0.4 0.5 0.1 0.1 0.4 0.9   . We see by Sage (or by Mathematica), that λ1 = 1 is an eigenvalue of T, with corresponding eigenvector given by (1, 5, 21)T . This means that the distribution of the students in the individual groups is encoded by the multiple of this vector such that the coordinates sum to one, that is, the vector ( 1 27 , 5 27 , 21 27 ). Therefore again most of the students end up in the pub. □ 3.F.17. Daily running. Jonas goes running every evening. He has three tracks – short, middle and long. Whenever he chooses a short track, the next day he feels bad about it and chooses with equal probabilities between long and medium. Whenever he chooses a long track, the next day he chooses arbitrarily among all three. Whenever he chooses the medium track, the next he feels good about it, and again chooses with equal probabilities between medium and long. Joans claims that he has been running in this mode for a very long period. How often does he choose the short track and how often the long track? What is the probability that Jonas follows a long track, under the assumption that he did the same a week ago? Solution. Clearly this task forms a (homogeneous) discrete Markov process with a three-dimensional state space: S = {short track, medium track, long track} (that silently one can encode as {1, 2, 3}). This order of the states gives the following transition matrix T =   0 0 1/3 1/2 1/2 1/3 1/2 1/2 1/3   . Let explain the computation for the second column, which corresponds to the choice of the medium track during the previous day. This means that with the probability 1/2, a medium track will be chosen (the second row), and with probability 1/2 a long track will be chosen (the third row). Hence T follows. Now, we see that T2 =   1/6 1/6 1/9 5/12 5/12 4/9 5/12 5/12 4/9   . Therefore, we can use the corollary of the Perron-Frobenius theorem for Markov chains, see 3.3.4. The eigenvector corresponding to the eigenvalue 1 is the stochastic vector (1 7 , 3 7 , 3 7 )T , where recall that the numbers 1/7, 3/7, 3/7 are respectively the probabilities that in a randomly chosen day, Joans choose a short, medium or long track. Suppose now that on a certain day (that is, in time n ∈ N), Jonas follows a long track. This corresponds to the probabilistic vector x(n) = (0, 0, 1) T . As we know, for the following day we will have x(n+1) =   0 0 1/3 1/2 1/2 1/3 1/2 1/2 1/3   ·   0 0 1   =   1/3 1/3 1/3   . In particular, after seven days x(n+7) = T7 ·   0 0 1   = T6 ·   1/3 1/3 1/3   . Let us present this computation in Sage: 274 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS T=matrix(QQbar, [[0, 0, 1/3], [1/2, 1/2, 1/3], [1/2, 1/2, 1/3]]) x=vector(QQbar, [1/3, 1/3, 1/3]); T^6*x Sage prints out (4999/34992, 29993/69984, 29993/69984) and hence we deduce that x(n+7) ≈ (4999/34992 ≈ 0.142861, 29993/69984 ≈ 0.428569, 29993/69984 ≈ 0.428569)T . Consequently, the probability that Jonas follows a long track under the assumption that he did the same a week ago, is 0.428569 ≈ 3/7. □ 3.F.18. Car rental. A car rental company has two branches – one in Prague and one in Brno. A car rented in Brno can be returned in Prague and vice versa. After operating for some time, the company has observed that roughly 80 % of the cars rented in Prague and 90 % of the cars rented in Brno, are finally returned in Prague. However, the strategy of the company says that in both branches, in the start of every week should exist the same number of the cars, as in the week before. How the cars should be then distributed? Solution. Let us denote by xB and xP , the initial number of cars in Brno and in Prague, respectively. Then x = (xB, xP )T is the vector describing car’s distribution between the two branches. Now, according to the statement the matrix describing the (linear) system of car rental is given by A = ( 0.1 0.2 0.9 0.8 ) . The state at the end of the week is given by Ax. However, at the end of the week in the branches the same number of cars should exist as at the beginning. This means that x must satisfies Ax = x, and so it must be an eigenvector of the matrix A associated with the eigenvalue λ1 = 1. We can save time on computations by using Sage as follows: A = matrix(QQ, [[0.1, 0.2], [0.9,0.8]]) print(A.eigenvalues()) show(A.eigenvectors_right()) Sage’s output for the eigenvector has the form [( 1, [( 1, 9 2 )] , 1 ) , ( − 1 10 , [(1, −1)] , 1 )] , and we deduce that the eigenvector corresponding to λ1 is given by (1, 9/2)T . This implies that x = (1, 9/2)T . Now, the percentage distribution of the cars is given by the normalized vector associated to x and this is the the vector whose entries sum to 1. This vector is given by 2 11 ( 1 9/2 ) = ( 0.181818 0.818182 ) . Therefore, the optimal distribution of cars would have approximately 18 % stationed in Brno and 82 % in Prague. □ 3.F.19. Suppose that two students A and B spend every Monday morning playing a certain computer game. The person who wins pays for both of them in the evening in the restaurant. The game can also be a draw – then each pays for half the meal. The result of the previous game partially determines the next game. If a week ago student A has won, then with the probability 3/4 wins again and with probability 1/4 it is a draw. A draw is repeated with probability 2/3. With probability 1/3 the next game is won by B. Moreover, if student B won a game, then with probability 1/2 he/she wins again and with probability 1/4, student A wins the next game. Determine the probability that today each of them pays half of the costs, if the first game played long time ago and was won by A. Solution. This a Markov process with the following three states: "student A wins", "the game ends with a draw", "student B wins". Labelling these states in this order as {1, 2, 3} we arrive to the following transition matrix of the process: T =   3/4 0 1/4 1/4 2/3 1/4 0 1/3 1/2   . We want to find the probability of the transition from the first state to the second after a large number n ∈ N of steps (weeks). Observe that the matrix T is primitive, because T2 =   9/16 1/12 5/16 17/48 19/36 17/48 1/12 7/18 1/3   . Thus, it suffices to find the probabilistic eigenvector x∞ of the matrix T associated with the eigenvalue λ1 = 1. As before, giving the following block in Sage we compute the eigenvalues and eigenvectors corresponding to T: 275 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS T = matrix(QQbar, [[3/4, 0, 1/4], [1/4, 2/3, 1/4], [0, 1/3, 1/2]]) T.eigenvalues() T.eigenvectors_right () We deduce that the eigenvector associated to the eigenvalue λ1 = 1 is the vector (1, 3/2, 1)T which we may normalize as x∞ = (2 7 , 3 7 , 2 7 )T . Recall now that x∞ differs only very slightly from the probabilistic vector for large n and in particular it does not depend on the initial state. Indeed, for large n ∈ N we compute Tn ≈   2/7 2/7 2/7 3/7 3/7 3/7 2/7 2/7 2/7   . The desired probability is the element of this matrix on the second position in the first column (the second component of the vector x∞). Hence the answer is 3/7. □ 3.F.20. Algorithm for determining the importance of pages. Internet browsers can find (almost) all pages containing a given word or phrase on the Internet. But how can a user sort the pages such that a list is sorted according to the relevance of the given pages? One of the possibilities is the following algorithm: the collection of all found pages is considered to be a system, and each of the found pages is one of its states. We describe a random walk on these pages as a Markov process. The transition probabilities between web pages are determined by hyperlinks: each link from page A to page B establishes the probability of moving from A to B as 1 totalănumberăofălinksăfromăpageA . If a page has no outgoing links, it is treated as a page that links to every other page. This creates a probability matrix M = (mij), where each entry mij represents the probability of moving from the ith page to the jth page. Assume that a user randomly clicks on links, and chooses a random page when encountering a linkless page. In this case, the probability of being on the ith page at a sufficiently large time from the beginning, is given by the ith component of the unit eigenvector of the matrix M, corresponding to the unit eigenvalue. The importance of individual pages is then determined by the magnitudes of these probabilities. This algorithm can be modified by assuming that the user eventually stops clicking on links after some time and starts again from a random page. Suppose that the user chooses a new page randomly with probability d, and with probability 1−d, continues clicking on links. In this scenario, the probability mij of transitioning from the page Si to page Sj is given by mij =    d n + (1 − d) total number of links at the page Si , if from Si there is a link to Sj, d n , otherwise. Note that mij ̸= 0. Now, According to the Perron-Frobenius theorem, the eigenvalue 1 is with multiplicity one and dominant, and thus the corresponding eigenvector is unique. For an illustration, consider pages A, B, C and D. The links lead from A to B and to C, from B to C and from C to A, and finally from D nowhere. Suppose that the probability that the user chooses a random new page is 1/5. Then the matrix M looks as follows: M =     1/20 1/20 17/20 1/4 9/20 1/20 1/20 1/4 9/20 17/20 1/20 1/4 1/20 1/20 1/20 1/4     . We find that the eigenvector corresponding to the eigenvalue 1 is the vector u = (305/53, 175/53, 315/53, 1)T ∈ R4 . Hence, the importance of the pages is given according to the order of the sizes of the corresponding components, that is, C > A > B > D. 3.F.21. Ruining of a player. Two players, A and B, gamble for money repeatedly in a certain game, which can result only in a victory for one of the players. The winning probability for player A in each individual game is p ∈ [0, 1/2). Both players bet always the amount of 1C. Consequently, after each game player B gives 1C to player A with probability p, and player A gives 1C to player B with probability 1 − p. Suppose that in the start of the game that A has Cx, B has Cy and that they play, as long as they both have some money. Determine the probability that the player A will lose all of his/her money. Solution. This is the so-called ruining of a player. It is a special Markov chain (see also the exercise Sweet-toothed gambler) with many important applications. The probability in question is given by 276 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS 1 − ( p 1−p )y 1 − ( p 1−p )x+y . We can now investigate what this value is for specific choices of p, x, y. We give an example. Suppose that the player B wants the probability that the player A will lose the amount of 1 000 000C to be at least 0.999. Assume for instance that p is given by p = 0.495. Then, for the player B we see that is sufficient to has the amount of 346C. If p = 0.499, this amount increases to 1727C. Therefore, it is possible in big casinos that “passionate” players play almost fair games. □ 3.F.22. In a certain game you can choose one of two opponents. The probability that you beat the better one is 1/4, while the probability that you beat the worse one is 1/2. But the opponents cannot be distinguished, thus you do not know which one is the better one. You await a large number of games. For each of them you can choose a different opponent. Consider the following two strategies: 1. For the first game choose the opponent randomly. If you win a game, carry on with the same opponent; if you lose a game, change the opponent. 2. For the first two games, choose an opponent randomly. Then for the next two games, if you lost both the previous games, change the opponent, otherwise stay with the same opponent. Which of the two strategies is better? Solution. Both strategies define a Markov chain. For simplicity denote the worse opponent by A and the better opponent by B. In the first case the state space has two states: “game with A” and “game with B”. In this order, we obtain the probabilistic transition matrix ( 1/2 3/4 1/2 1/4 ) . This matrix has all of its elements positive. Thus it suffices to find the probabilistic vector x∞, which is associated with the eigenvalue 1. We compute x∞ = (3 5 , 2 5 )T . Its components correspond to the probabilities that after a long sequence of games the opponent is A, or B. Therefore, we can expect that 60 % of the games will be played against the worse of the two opponents. Because 0.4 = 2 5 = 3 5 · 1 2 + 2 5 · 1 4 , there will be roughly 40 % against the better of the two opponents. For the second strategy, use the states “two games in a row with A”, and “two games in a row with B”, which lead to the probabilistic transition matrix ( 3/4 9/16 1/4 7/16 ) . In this case we compute x∞ = ( 9 13 , 4 13 )T . Against the worse opponent one would then play (9/4)-times more frequently than against the better one. Recall that for the first strategy it is (3/2)-times more frequently, which means that second strategy is better. Note also that for the second strategy, roughly 42, 3 % of the games are winning ones, as we can see from the relation 0.423 ≈ 11 26 = 9 13 · 1 2 + 4 13 · 1 4 . □ 3.F.23. In a certain country there are two television channels. From a public survey it follows that in one year 1/6 of the viewers of the first channel move to the second one, while 1/5 viewers of the second channel move to the first one. Determine the time evolution of the number of viewers watching the channels, using Markov processes. Write down a matrix of the process, and find its eigenvalues and eigenvectors. ⃝ 3.F.24. Daily casino routine. A female roulette player has the following strategy: she comes at a casino to play with C10. She always bets everything she has, and she always bets on black (there are 37 numbers in the roulette, 18 black, 18 red and zero). The player ends whenever she has either nothing, or when she wins C80. Consider this problem as a Markov process and write down its transition matrix. Solution. In the course of the game and at its end, the player can have only one of the following amounts of money (in C): 0, 10, 20, 40, 80. If we view the situation as a Markov process, then these amounts corresponds to the states of the process, and we construct the following transition matrix A =       1 a a a 0 0 0 0 0 0 0 b 0 0 0 0 0 b 0 0 0 0 0 b 1       , 277 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS with a = 19 37 and b = 18 37 , respectively. Note that the matrix is probabilistic and singular, while the eigenvalue λ1 = 1 has multiplicity two (and this is the unique non-zero eigenvalue of A), as we see from the following block in Sage: a = var(’a’); b = var(’b’) assume(a, ’real’); assume(b, ’real’) A = matrix(SR, 5, 5, [[1, a, a, a, 0], [0, 0, 0, 0, 0], [0, b, 0, 0, 0], [0, 0, b, 0, 0], [0, 0, 0, b, 1]]) A.eigenvalues() Executing this block we obtain the list [0, 0, 0, 1, 1]. Observe that the game does not converge to a single vector x∞, but ends in one of the eigenvectors associated with the eigenvalue λ1, that is, either (1, 0, 0, 0, 0)T (the player looses it all), or (0, 0, 0, 0, 1)T (the player wins C80). We verify this as follows (after introducing the matrix A as above) A.eigenvectors_right() In this case Sage output has the form [(0, [(1, 0, 0, -1/a, b/a)], 3), (1, [(1, 0, 0, 0, 0), (0, 0, 0, 0, 1)], 2)] Furthermore, using Sage we can easily check that the game ends after three bets, that is, the sequence {An }∞ n=1, is constant for n ≥ 3. For instance, you may type A^3 == A^4 A^3 == A^5 and for both cases Sage prints out True (and similarly for any other power An with n ≥ 3). Moreover, we see that A3 is given by [ 1 (a*b + a)*b + a a*b + a a 0] [ 0 0 0 0 0] [ 0 0 0 0 0] [ 0 0 0 0 0] [ 0 b^3 b^2 b 1] which means that A∞ := A3 = An =       1 a + ab + ab2 a + ab a 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 b3 b2 b 1       . We deduce that the game ends with the probability a + ab + ab2 . = 0.885 as a loss, and with the probability roughly 0.115 as a win of C80. This conclusion occurs by multiplying the matrix A∞ with the initial vector (0, 1, 0, 0, 0)T . This gives the vector (a + ab + ab2 , 0, 0, 0, b3 )T . Notice that whether the player was female or male, makes no difference to the result. □ 3.F.25. Consider the situation from the previous case and assume that the probability of both win and loss is 1/2. Denote by A the matrix of the process. Without using any computational software determine A100 . ⃝ 3.F.26. Reliable products. A production line is not reliable: individual products differ in quality in a significant way. A certain worker tries to improve the quality of the products and intervenes to the process. The products are distributed into classes I, II, and III, according to their quality, and a report found out the following: • After a product of class I, the next product has the same quality in 80% of the cases and is of quality II in 10% of the cases. • After a product of class II, the next product is of class II in 60% of the cases and is of quality I in 20% of the cases. • After a product of class III, the next product is of quality III in 50% of the cases. while in 25% of the cases it is of class II. Compute the probability that the 18th product has quality of class I, given that the 16th product is of quality III. Solution. There are at least two ways to provide an answer. First we solve the problem without using a Markov chain. Since the 16th product has quality of class III, the event in question is satisfied by the cases • The 17th product has quality of classI and the 18th product has quality of class I, • The 17th product has quality of class II and the 18th product has quality of class I, • The 17th product has quality of class III and the 18th product has quality of class I, with probabilities respectively 0.25 · 0.8 = 0.2 , 0.25 · 0.2 = 0.05 , 0.5 · 0.25 = 0.125 . 278 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Thus the solution is 0.375 = 0.2 + 0.05 + 0.125. Let us now view the problem as a discrete Markov process. From the statement it follows that the transition matrix has the form T =   0.8 0.2 0.25 0.1 0.6 0.25 0.1 0.2 0.5   . The situation that the product is in class III is given by the probabilistic vector (0.0.1)T . For the next product we obtain the probabilistic vector   0.25 0.25 0.5   =   0.8 0.2 0.25 0.1 0.6 0.25 0.1 0.2 0.5     0 0 1   . Finally, for the next product in order, we compute the vector   0.375 0.3 0.325   =   0.8 0.2 0.25 0.1 0.6 0.25 0.1 0.2 0.5     0.25 0.25 0.5   , and the first component is the desired probability. Observe that the first method (without using the Markov process) led to the result faster and easier. But notice also how unclear it would become if we wanted to compute, say, the 22nd or 30th product. For the second method one can in a sense restrict the computations to the relevant parts of the matrices only, instead of mindlessly multiplying the whole matrix. When using the Markov process, we have also directly obtained the probabilities that the 18thproduct belongs to the class II and III. □ 3.F.27. Suppose that there are two boxes, which together contain n white and n black balls. Each box can contain up to n balls. At regular time intervals, a ball is taken from each box and moved to the other box. For this Markov process, find its probabilistic transition matrix T. Solution. This problem is often used in physics as a model for blending two incompressible liquids (already introduced by D. Bernoulli during 1769), or analogously, as a model of diffusion of gases. Let the states 0, 1, . . . , n correspond to the number of white balls in the first box. This information already says how many black balls are in the first box (the remaining balls are then in the second box). If, for a certain step, the state changes from j ∈ {1, . . . , n} to j − 1, then from the first box a white ball was drawn and from the second a black ball was drawn. This happens with probability j n · j n = j2 n2 . The transition from state j ∈ {0, . . . , n − 1} to the state j + 1 corresponds to drawing the black ball from the first box, and a white ball from the second box, with probability n − j n · n − j n = (n − j) 2 n2 . The system stays in state j ∈ {1, . . . , n − 1}, if from both boxes balls of the same colour are drawn, which has the same probability, namely j n · n − j n + n − j n · j n = 2j (n − j) n2 . Notice that from the state 0 it is necessary (with probability 1) to go to the state 1 and similarly from the state n with probability one to the state n − 1. In summary, for the order of the states 0, 1, . . . , n. we obtain the following n × n transition matrix: T =               0 1 0 · · · 0 0 0 n2 2 · 1(n − 1) 22 ... 0 0 0 0 (n − 1)2 2 · 2(n − 2) ... 0 0 0 ... ... ... ... ... ... ... 0 0 0 ... 2 · (n − 2)2 (n − 1)2 0 0 0 0 ... 22 2 · (n − 1)1 n2 0 0 0 · · · 0 1 0               . 279 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS In physics we are of course interested in the distribution of balls in boxes, after a certain time (the number of drawings). If the initial state is for instance 0, we can use this model and the powers of the matrix T to observe with what probability the number of white balls in the first box is increasing. We can confirm the expected result that the initial distribution of the balls influences their distribution, after a certain time in a very negligible way. □ D) Material on linear algebra – unitary spaces, orthogonality, Jordan forms 3.F.28. The Cauchy–Schwarz inequality. Let (V, ⟨ , ⟩) be a unitary space over R. Describe an alternative proof of the Cauchy inequality, in comparison with the one presented in 3.4.3. Solution. We should prove the inequality |⟨x, y⟩| ≤ ∥x∥∥y∥, for any two vectors x, y ∈ V . We may assume that y ̸= 0 since otherwise the statement is trivial. For some scalar t ∈ R define the function ζ(t) = ∥ty − x∥2 = ⟨ty − x, ty − x⟩. Then we get ζ(t) = t2 ∥y∥2 − t⟨y, x⟩ − t⟨x, y⟩ + ∥x∥2 . Since (V, ⟨ , ⟩) is a real unitary space we have ⟨y, x⟩ = ⟨x, y⟩, and hence ζ(t) = t2 ∥y∥2 − 2t⟨x, y⟩ + ∥x∥2 . When y ̸= 0, by setting a = ∥y∥2 , b = 2⟨y, x⟩ and c = ∥x∥2 we result to a second order polynomial given by ζ(t) = at2 − bt + c. However, ∥ · ∥ is a norm and thus ∥ty − x∥2 ≥ 0, that is, ζ(t) ≥ 0. Therefore the discriminant of ζ(t) must be non-positive: ∆ = b2 − 4ac ≤ 0, which is equivalent to 4ac ≥ b2 . This means 4∥y∥2 ∥x∥2 ≥ 4⟨y, x⟩2 or equivalently ∥y∥∥x∥ ≥ |⟨x, y⟩| and the proof is complete. □ The next problem is about the so-called polarization identities on a unitary space. According to these formulas one can recover the inner product from the norm. The polarization identities can be used in many situations, see for example 3.F.32. (and see also Chapter 4 for a description in terms of quadratic forms). 3.F.29. Polarization identities. Let (V, ⟨ , ⟩) be a unitary space. (i) If V is defined over R and so (V, ⟨ , ⟩) is a real unitary space, prove that 4⟨u, v⟩ = ∥u + v∥2 − ∥u − v∥2 , u, v ∈ V . (ii) If V is defined over C and so (V, ⟨ , ⟩) is a complex unitary space, prove that 4⟨u, v⟩ = ∥u + v∥2 − ∥u − v∥2 + i∥u + iv∥2 − i∥u − iv∥2 , u, v ∈ V . Solution. (i) By assumption, (V, ⟨ , ⟩) is a real inner product space, hence ⟨u, v⟩ = ⟨v, u⟩ and by bilinearity we get ∥u + v∥2 − ∥u − v∥2 = ⟨u + v, u + v⟩ − ⟨u − v, u − v⟩ = ∥u∥2 + 2⟨u, v⟩ + ∥v∥2 − (∥u∥2 − 2⟨u, v⟩ + ∥v∥2 ) = 4⟨u, v⟩ . (ii) In this case (V, ⟨ , ⟩) is a complex inner product space and thus ⟨u, v⟩ ̸= ⟨v, u⟩. We compute ∥u + v∥2 = ⟨u + v, u + v⟩ = ∥u∥2 + ⟨u, v⟩ + ⟨v, u⟩ + ∥v∥2 , −∥u − v∥2 = −⟨u − v, u − v⟩ = −∥u∥2 + ⟨u, v⟩ + ⟨v, u⟩ − ∥v∥2 , i∥u + iv∥2 = i⟨u + iv, u + iv⟩ = i∥u∥2 + i⟨u, iv⟩ + i⟨iv, u⟩ + i⟨iv, iv⟩ = i∥u∥2 + i¯i⟨u, v⟩ + i2 ⟨v, u⟩ + i2¯i∥v∥2 = i∥u∥2 + ⟨u, v⟩ − ⟨v, u⟩ + i∥v∥2 , −i∥u − iv∥2 = −i⟨u − iv, u − iv⟩ = −i∥u∥2 + ⟨u, v⟩ − ⟨v, u⟩ − i∥v∥2 . Hence adding these relations we arrive to the desired identity: 4⟨u, v⟩ = ∥u + v∥2 − ∥u − v∥2 + i∥u + iv∥2 − i∥u − iv∥2 . □ Beyond the theory of unitary spaces the notion of norm appears in many other areas of mathematics. For instance, we couldn’t imagine the so-called theory of normed vector spaces without a norm. This is a topic that we will analyze in Chapter 7, but since we are already familiar with scalar products we can present a few exercises on norms induced by scalar products. In this point an important remark is that not all the norms are of this type, see the Problem 3.F.30 which establishes a very first link with the material that we are going to treat in Chapter 7. 3.F.30. Parallelogram law. Prove that the following relation makes sense on any unitary vector space (V, ⟨ , ⟩): ∥u + v∥2 + ∥u − v∥2 = 2(∥u∥2 + ∥v∥2 ) , (†) for any u, v ∈ V , where ∥ · ∥ denotes the induced norm by ⟨ , ⟩. Then, show with an example that there are norms which are not induced by an inner product. 280 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Solution. Many students in a high-school can solve the first task, at least for the real vector space Rn . It occurs by the properties of a scalar product and the details of the proof are left to the reader. The second task is a bit more demanding. Consider for example R2 and set ∥(x1, x2)∥∞ := max{|x1|, |x2|} for any x = (x1, x2)T ∈ R2 . We can easily show that this is a norm since it satisfies the characteristic properties of a norm. For example the triangle inequality occurs as follows: ∥x + y∥∞ = max{|x1| + |y1|, |x2| + |y2|} ≤ max{|x1|, |x2|} + max{|y1|, |y2|} = ∥x∥∞ + ∥y∥∞ . However, this norm is not induced by an inner product on R2 , since it does not satisfy the relation (†) posed above. To understand this, consider for example the vectors x = (1, 0)T and y = (0, 1/2)T . Then we compute ∥x+y∥2 ∞+∥x−y∥2 ∞ = 2 but 2(∥x∥2 ∞ + ∥y∥2 ∞) = 5/2. □ 3.F.31. Let F : Mat2(R) → R2[x] the mapping defined by F (( a b c d )) = (a + d) + (a + b)x + 2(c + d)x2 , a, b, c, d ∈ R . After verifying that F is linear, find the dimension of the null space ker(F) ⊂ Mat2(R) of F. Next provide an orthonormal basis of ker(F), with respect to the Frobenius scalar product B(A, B) = tr(BT A) on Mat2(R). ⃝ 3.F.32. Linear isometries. Let (V, ⟨ , ⟩V ) and (W, ⟨ , ⟩W ) be two (finite-dimensional) unitary spaces. For a linear mapping ψ : V → W prove the following: (a) ψ is a linear isometry if and only if ∥u∥V = ∥ψ(u)∥W for any u ∈ V , i.e., φ is distance preserving. (b) ψ is a linear isometry if and only if ψ∗ ψ = IV , where IV is the identity map on V . Solution. (a) Let u ∈ V and assume that ψ is an isometry. This means that ⟨u, v⟩V = ⟨φ(u), φ(v)⟩W for any u, v ∈ V and we see that ∥ψ(u)∥2 W = ⟨ψ(u), ψ(u)⟩W = ⟨u, u⟩V = ∥u∥2 V , and hence ∥ψ(u)∥W = ∥u∥V for any u ∈ V . Conversely, assume that ∥u∥V = ∥ψ(u)∥W for any u ∈ V . Then, recall by 3.F.29 that the scalar product is defined by the norm via the formula 4⟨u, v⟩V = ∥u + v∥2 V − ∥u − v∥2 V + i∥u + iv∥2 V − i∥u − iv∥2 V , for any u, v ∈ V , and similarly for ⟨ , ⟩W . Because ψ preserves norms, a direct calculation via the above formula certifies that it will preserve also the inner products, and so ψ is and isometry. (b) Suppose that ψ satisfies ψ∗ ψ = IV . Then, for any u, v ∈ V we have ∥u − v∥2 V = ⟨u − v, u − v⟩V = ⟨u − v, (ψ∗ ψ)(u − v)⟩V = ⟨ψ(u − v), ψ(u − v)⟩W = ∥ψ(u) − ψ(v)||2 W . Thus ∥u − v∥V = ∥ψ(u) − ψ(v)∥W which shows that ψ is an isometry. Conversely, if ψ : V → W is an isometry, then ⟨(ψ∗ ψ)(u), v⟩V = ⟨ψ(u), ψ(v)⟩W = ⟨u, v⟩V for any u, v ∈ V . Since ⟨ , ⟩V is an inner product this implies (ψ∗ ψ)(u) = u for any u ∈ V . Thus ψ∗ ψ = IV . □ 3.F.33. Consider the vector v = ( 0, √ 2 2 , − √ 2 2 )T ∈ R3 . Find an orthogonal operator F : R3 → R3 such that F(v) = e1, where e1 = (1, 0, 0)T is the first vector of the standard basis on R3 . Next confirm your result using Sage. Solution. Let A be the 3×3 matrix corresponding to F, such that F(u) = Au for all u ∈ R3 . Since F should be orthogonal, A satisfies AAT = E, where E is the 3 × 3 identity matrix. Recall that the columns of an orthogonal matrix n × n matrix form an orthonormal basis of Rn . For the given vector v we know that F(v) = Av = e1, from where we get v = AT e1. Thus v sits in the first column of AT , and we may assume that AT = ( v v1 v2 ) for some vectors v1, v2 on R3 that we should specify. These vectors should be orthogonal to v, i.e., v1, v2 ∈ v⊥ and it is easy to see that any vector (x1, x2, x3)T on the orthogonal complement v⊥ should satisfy the equation √ 2 2 x2 − √ 2 2 x3 = 0 or equivalently, x2 − x3 = 0. Thus, the solution space has the form      x1 x2 x3   =   t s s   = t   1 0 0   + s   0 1 1   : t, s ∈ R    and we deduce that v⊥ is generated by the vectors (1, 0, 0)T and (0, 1, 1)T . Obviously, these vectors are orthogonal each other. Setting v1 = (1, 0, 0)T and v2 = ( 0, √ 2 2 , √ 2 2 )T such that ∥v1∥ = 1 = ∥v2∥ and v1 ⊥ v2, we finally obtain 281 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS AT =    0 1 0√ 2 2 0 √ 2 2 − √ 2 2 0 √ 2 2    =⇒ A =    0 √ 2 2 − √ 2 2 1 0 0 0 √ 2 2 √ 2 2    . Using this expression we can verify that Av = e1. Consequently, if u = (x, y, z)T is an arbitrary vector of R3 we have F(u) = Au =    0 √ 2 2 − √ 2 2 1 0 0 0 √ 2 2 √ 2 2      x y z   =    √ 2 2 (y − z) x√ 2 2 (y + z)    . As for a confirmation in Sage, use the cell given here: A=matrix([[0, sqrt(2)/2, -sqrt(2)/2], [1, 0, 0],[0, sqrt(2)/2, sqrt(2)/2]]) v=vector([0, sqrt(2)/2, -sqrt(2)/2]); e1=vector([1, 0, 0]) print(A.is_unitary()); print(A*v==e1) □ Let us now analyze an example related to the notion of matrix groups G ⊂ Gln(F), where F ∈ {R, C} and Gln(F) is the general linear group. In this part, to prove the crucial property of “closedness” that characterizes such a group G, we require a basic understanding of continuous functions, as well as some familiarity with closed subsets of Euclidean space. These concepts are explored in Chapter 5, and our tasks below establish a first link between analysis and group theory. Notably, we have already encountered the concept of “closedness” in the proof of the lemma presented in 3.3.3, and the reader may find it helpful to refer to a similar explanation below.21 3.F.34. Prove that the orthogonal group O(n) is a matrix group, that is, a closed subgroup of Gln(R). Solution. Hints: The fact that O(n) is a subgroup was discussed in 3.D.18. Consider the map ρ : Matn(R) → Sym(n, R) , A → AAT . Here, Sym(n, R) is the set of n × n symmetric matrices with real entries. First, establish a linear isomorphism between Sym(n, R) and the Euclidean space R 1 2 n(n+1) . Then, deduce that the function ρ is continuous and moreover that O(n) = ρ−1 ({E}), where E is the n × n identity matrix. Since the preimage of a closed set under a continuous map is closed, this shows that O(n) is closed in Gln(R), thereby confirming that O(n) is indeed a matrix group. □ 3.F.35. Matrix of general rotation in R3 . Derive the matrix of a general rotation in R3 , Solution. Consider an arbitrary unit vector (x, y, z)T ∈ R3 . Rotation in the positive sense by an angle φ about this vector can be written down as a composition of the following rotations (whose matrices we already know): • rotation R1 in the negative sense about the z axis through the angle with cosine equal to x/ √ x2 + y2 = x/ √ 1 − z2, that is, with sine y/ √ 1 − z2, under which the line with the directional vector (x, y, z)T goes over on the line with the directional vector (0, y, z)T . The matrix of this rotation has the form R1 =   x/ √ 1 − z2 y/ √ 1 − z2 0 −y/ √ 1 − z2 x/ √ 1 − z2 0 0 0 1   . • Rotation R2 in the positive sense about the y axis through the angle with cosine √ 1 − z2, that is, with sine z, under which the line with the directional vector (0, y, z)T goes over on the line with the directional vector (1, 0, 0)T . The matrix of this rotation is given by R2 =   √ 1 − z2 0 z 0 1 0 −z 0 √ 1 − z2   . • Rotation R3 in the positive sense about the x axis through the angle φ, with the following matrix R3 =   1 0 0 0 cos(φ) − sin(φ) 0 sin(φ) cos(φ)   . • Rotation R−1 2 with the matrix R−1 2 , • Rotation R−1 1 with the matrix R−1 1 . 21Recall that we have an inclusion Gln(R) ⊂ Rn2 ∼= Matn(R). 282 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS The matrix of the composition of these mappings, that is, the matrix we are looking for, is given by the product of the rotations in the reverse order: R−1 1 · R−1 2 · R3 · R2 · R1 =   1 − t + tx2 txy − zs txz + ys yxt + zs 1 − t + ty2 tyz − xs zxt − ys tzy + xs 1 − t + tz2   where t = 1 − cos φ and s = sin φ. □ 3.F.36. Find the matrix of the rotation in the positive sense by the angle π/3 about the line passing through the origin with the oriented directional vector (1, 1, 0)T under the standard basis R3 . ⃝ 3.F.37. Using basic plotting functions in Sage, visualize the 2-dimensional square on R3 , having vertices the points [1, 1, 0], [1, −1, 0], [−1, −1, 0] and [−1, 1, 0], as it rotates around the x, y and z axes. Hint: Use the matrices Rx, Ry and Rz presented in the solution of 3.D.19. Solution. We should define the given square, next, apply the rotation matrices described in ?? and finally, plot the rotated squares. It will be easier to fix the rotation angle, say θ = π/4, in order to visualize our goal by a figure, as follows: In this figure, the initial square is shown in black. The square rotated about the x-axis is shown in red, the square rotated around the y-axis is in green, and the square rotated around the z-axis is in blue. Let us present the code and include the necessary explanations within the code block. 283 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS # Define the vertices of the square in the xy-plane square_vertices = [(1, 1, 0), (1, -1, 0), (-1, -1, 0), (-1, 1, 0)] # Define the edges of the square square_edges = [(square_vertices[0], square_vertices[1]), (square_vertices[1], square_vertices[2]), (square_vertices[2], square_vertices[3]), (square_vertices[3], square_vertices[0])] # Define the rotation angle in radians theta=pi/4 #45degrees theta = pi / 4 # 45 degrees # Rotation matrices Rx = Matrix([[1, 0, 0], [0, cos(theta), -sin(theta)],[0, sin(theta), cos(theta)]]) Ry = Matrix([[cos(theta), 0, sin(theta)], [0, 1, 0], [-sin(theta), 0, cos(theta)]]) Rz = Matrix([[cos(theta), -sin(theta), 0],[sin(theta), cos(theta), 0],[0, 0, 1]]) # Function to rotate vertices and create plot for each axis def plot_rotation(rotation_matrix, color, label): rotated_edges = [] for start, end in square_edges: start_rotated = rotation_matrix * vector(start) end_rotated = rotation_matrix * vector(end) rotated_edges.append((start_rotated, end_rotated)) return sum([line([edge[0], edge[1]], color=color, thickness=2, legend_label=label) for edge in rotated_edges]) 284 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS # Plot the original and rotated squares original_plot = plot_rotation(Matrix.identity(3), "black", "Original") rotated_x_plot = plot_rotation(Rx, "red", "Rotation around x-axis") rotated_y_plot = plot_rotation(Ry, "green", "Rotation around y-axis") rotated_z_plot = plot_rotation(Rz, "blue", "Rotation around z-axis") # Combine the plots and show them final_plot = original_plot + rotated_x_plot + rotated_y_plot + rotated_z_plot final_plot.show(aspect_ratio=1, frame=True, legend=True) The most advanced part of our program is the definition of the plot_rotation function. This is designed to take a rotation matrix, a colour, and a label as inputs. It then applies the rotation matrix to each edge of the given square, creates lines for these rotated edges, and returns a plot object that can be displayed, making it easier to visualize the effect of the rotation in R3 . The command rotated_edges = [] initializes a list for rotated edges. In particular, this is an empty list that will store the pairs of points (edges) after they have been transformed by the rotation matrix. Next, the syntax for start, end in square_edges: forms a loop over each edge of the square. In particular, the option square_edges is a list of tuples, where each tuple represents an edge of the square as a pair of vertex coordinates. The loop iterates over each edge, extracting the start and end points (start, end) of the edge. On the other hand, the syntax start_rotated = rotation_matrix * vector(start) end_rotated = rotation_matrix * vector(end) plays the following role: The first operation multiplies the rotation matrix by a vector created from the start point of the edge. The vector() function then converts the coordinate tuple into a vector object, allowing for matrix multiplication. Thus, the syntax start_rotated encodes the new vector representing the rotated position of the start point, while, the syntax end_rotated encodes the new vector representing the rotated position of the end point. Next, the syntax rotated_edges.append((start_rotated, end_rotated)) adds the newly rotated edge, represented as a tuple of the rotated start and end points, to the rotated_edges list. Finally, to create line plots for each rotated edge we typed the last line within the def environment. This creates a list of line objects, each representing a line plot for an edge. To summarize, the plot_rotation function efficiently rotates the square’s vertices using a specified rotation matrix, then plots each rotated edge with a given color and label. You can experiment further by changing the rotation angle θ form the value we used, i.e., θ = π/4. □ 3.F.38. Let (V, ⟨ , ⟩) be a unitary vector space and suppose that φ : V → V is a linear mapping having the property φ2 = φ. Prove that there exists a linear subspace of W such that φ is the orthogonal projection on W, i.e., φ = projW , if and only if φ is self-adjoint. Solution. Assume that φ is self adjoint. Then, by linearity and the condition that satisfies φ we see that φ(u − φ(u)) = φ(u) − φ2 (u) = 0 for any u ∈ V . Thus u − φ(u) ∈ ker(φ). Recall now by 3.F.41 that any endomorphism φ on V satisfies ker(φ∗ ) = (Im(φ))⊥ and Im(φ∗ ) = (ker(φ))⊥ . By the second relation we get ker(φ) = (Im(φ∗ ))⊥ and since φ is self-adjoint this gives ker(φ) = (Im(φ))⊥ . Thus writing u = φ(u) + (u − φ(u)) , we have φ(u) ∈ Im(φ) and (u − φ(u)) ∈ (Im(φ))⊥ . Since V = Im(φ) ⊕ (Im(φ))⊥ , we deduce that φ(u) = projW u, for any u ∈ V , where W = Im(φ) ⊂ V . This proves the one direction. For the converse, assume that there exists a subspace W ⊂ V , such that φ = projW . Let W⊥ the orthogonal complement of W with respect to ⟨ , ⟩. Given arbitrary u, v ∈ V , write u = u1 +u2, v = v1 +v2 with u1, v1 ∈ W and u2, v2 ∈ W⊥ , respectively. Then, since φ = projW we have φ(u) = u1, φ(v) = v1, and thus ⟨φ(u), v⟩ = ⟨u1, v1 + v2⟩ = ⟨u1, v1⟩ = ⟨u1 + u2, v1⟩ = ⟨u, v1⟩ = ⟨u, φ(v)⟩ . 285 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Thus φ = φ∗ and φ is self-adjoint. □ 3.F.39. Show that for any symmetric matrix A ∈ Matn(R) the operator LA : Rn → Rn defined by LAx = Ax (x ∈ Rn ) is self-adjoint. ⃝ 3.F.40. For A ∈ Matm(C) Hermitian, prove that the linear operator LA : Matm,n(C) → Matm,n(C) defined by LA(B) = AB is self-adjoint. ⃝ 3.F.41. Let f : V → V be an endomorphism on a (finite-dimensional) inner product space (V, ⟨ , ⟩). Prove that ker(f∗ ) = (Im(f))⊥ and Im(f∗ ) = (ker(f))⊥ . ⃝ 3.F.42. Let φu : Rn → Rn be the linear mapping that reflects Rn through the line in the direction of the unit vector u ∈ Rn . Show that its matrix [φu] = 2uuT − E is unitary. ⃝ 3.F.43. Let A ∈ Matn(C) be a complex matrix. Prove that the product A∗ A has only real eigenvalues. Next suppose that there exists a unitary matrix U such that A = U∗ DU for some diagonal matrix D. Show that A is normal. ⃝ 3.F.44. Find the values of the complex parameters a, b, c such that the matrix A is a Hermitian matrix, where A =   1 a −i 2 − 2i 0 b c 1 + i 0   . ⃝ 3.F.45. Given a Hermitian matrix A show that is determinant det(A) is a real number. Solution. The matrix A is supposed to be Hermitian, and hence unitary diagonalizable. Thus U∗ AU = D for some unitary matrix U and diagonal (real) matrix D (since A has only real eigenvalues). Then, since U−1 = U∗ a direct computation shows that A = U∗ DU. Therefore, det(A) = det(U∗ ) det(D) det(U) = det(U−1 ) det(D) det(U) = det(D) , and our claim follows. □ So far, we have explored various methods in Sage for applying the Gramm-Schmidt orthogonalization process. However, we have not yet discussed the gram_schmidt() function, which is only used to orthogonalize the columns of a matrix. When the gram_schmidt() method is applied to a matrix, it returns two matrices: a matrix whose columns are mutually orthogonal (but not necessarily of unit length) and an upper triangular matrix that contains the coefficients used in the Gram-Schmidt process to express the original columns as linear combinations of the orthogonalized columns. In 3.F.46 below, we will apply this method and use only the first of the two resulting matrices. 3.F.46. Present an orthogonal diagonalization of the following symmetric matrix A =   −1 3 1 0 1 −1 3 0 0 0 −1 3  . Next implement the task in Sage using the gram_schmidt() function mentioned above. Solution. The characteristic polynomial of A is given by χA(λ) = −1 3 − λ 1 0 1 −1 3 − λ 0 0 0 −1 3 − λ = −(λ + 1 3 )(λ2 + 2 3 λ − 8 9 ) = −(λ + 1 3 )(λ + 4 3 )(λ − 2 3 ) . Thus, the eigenvalues of A are given by λ1 = 2 3 , λ2 = −2 3 and λ3 = −4 3 , all with algebraic multiplicity one. The geometric multiplicity of each λi is also one, for all i = 1, 2, 3, and hence the matrix A is diagonalizable. Let us find the corresponding eigenvectors. • For λ1 = 2 3 we need to solve the matrix equation (A − 2 3 E)u = 0 for some vector u = (x1, x2, x3)T ∈ R3 . We see that A − 2 3 E =   −1 1 0 1 −1 0 0 0 −1   and the corresponding linear system has solution space      x1 x2 x3   =   t t 0   : t ∈ R    . Hence, the eigenvectors corresponding to λ1 are multiples of the vector u1 = (1, 1, 0)T . • For λ2 = −1 3 we have the matrix equation(A + 1 3 E)u = 0 for some vector u = (x1, x2, x3)T ∈ R3 , where A + 1 3 E =  0 1 0 1 0 0 0 0 0  . The solution space of the corresponding linear system has the form      x1 x2 x3   =   0 0 s   : s ∈ R    . Hence, the eigenvectors corresponding to λ2 are multiples of the vector u2 = (0, 0, 1)T . • For λ3 = −4 3 we have the matrix equation(A + 4 3 E)u = 0 for some vector u = (x1, x2, x3)T ∈ R3 , where A + 4 3 E = 286 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS   1 1 0 1 1 0 0 0 1  . The solution space of the corresponding linear system has the form      x1 x2 x3   =   r −r 0   : r ∈ R    . Hence, the eigenvectors corresponding to λ3 are multiples of the vector u3 = (1, −1, 0)T . In Sage we can confirm these results as follows: A=matrix(QQ, [[-1/3, 1, 0], [1, -1/3, 0], [0, 0, -1/3]]) chi_A=A.characteristic_polynomial() print(chi_A.factor()) print(A.eigenvalues()) show(A.eigenmatrix_right()) D, Pein=A.eigenmatrix_right() show(D,Pein) Executing this cell Sage prints out the characteristic polynomial of A, its eigenvalues, the diagonal matrix D = diag(λ1, λ2, λ3) and the matrix whose columns are the eigenvectors u1, u2, u3 (the latter is denoted by Pein inside the program). Next, we see that u1 ⊥ u2, u1 ⊥ u3 and u2 ⊥ u3: ⟨u1, u2⟩ = uT 2 u1 = 0 , ⟨u1, u3⟩ = uT 3 u1 = 0 , ⟨u2, u3⟩ = uT 3 u2 = 0 . Moreover, we compute ∥u1∥ = √ 2 = ∥u3∥, while ∥u2∥ = 1. Thus, the eigenvectors given below are orthonormal: ˆu1 = u1 ∥u1∥ = (√ 2 2 , √ 2 2 , 0 )T , ˆu2 = u2 = (0, 0, 1)T , ˆu3 = u3 ∥u3∥ = (√ 2 2 , − √ 2 2 , 0 )T . We are now ready to present the matrix P: P = ( ˆu1 ˆu2 ˆu3 ) =    √ 2 2 0 √ 2 2√ 2 2 0 − √ 2 2 0 1 0    . By adding the following cell to the previous block, we can quickly verify that P is orthogonal, and furthermore that P−1 AP = PT AP = D =   1 3 0 0 0 −2 3 0 0 0 −4 3   . P=matrix([[sqrt(2)/2, 0, sqrt(2)/2], [sqrt(2)/2, 0, -sqrt(2)/2], [0, 1, 0]]) print(P.is_unitary()) print(D==P.inverse()*A*P) Let us now implement the solution of the task using the gram_schmidt method, as suggested in the statement. As mentioned earlier, we will only utilize the first matrix returned by this method. To normalize the column vectors of a given matrix, we will define another function, which we may call normalize_col. This takes a matrix M as input and returns another matrix, with each column normalized. It works as follows: • It iterates through each column vector u of the matrix M. • Then, each vector u is divided by its norm to normalize it. • Finally the normalized columns are recombined into a new matrix, using the column_matrix() command. For our task we also need the matrices D and Pein, as introduced in the initial block via the eigenmatrix_right command. Our program has the following form: 287 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS A=matrix(QQ, [[-1/3, 1, 0], [1, -1/3, 0], [0, 0, -1/3]]) D, Pein=A.eigenmatrix_right() # store the transpose of the matrix whose columns are the unnormalized eigenvectors of A as Q Q=Pein.T # orthogonalize the columns of Q Q.gram_schmidt()[0] # take the transpose of the matrix obtained by Gram-Schmidt R=Q.gram_schmidt()[0].T show(R) #construct the function that normalizes the column vectors of a matrix M def normalize_col(M): return column_matrix([v/norm(v) for v in M.columns()]) 288 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS #normalize the columns of the matrix R and store the resulting matrix as P P=normalize_col(R) show(P) print(P.is_unitary()) print(P.inverse()*A*P==D) Running this program, Sage displays the matrix P as shown below, along with the message “True” printed twice, verifying that P is an orthogonal matrix that satisfies the relation PT AP = D:   1 2 √ 2 0 1 2 √ 2 1 2 √ 2 0 −1 2 √ 2 0 1 0   . □ 3.F.47. Using the program in Sage constructed in 3.F.46, present the orthogonal diagonalization of the following matrices: A =   2 2 2 2 0 4 2 4 0   , B =   −4 4 4 4 −4 4 4 4 −4   . Solution. For the matrix A the program has the form A=matrix(QQ, [[2, 2, 2], [2, 0, 4], [2, 4, 0]]) show(A.eigenmatrix_right()) D, Pein=A.eigenmatrix_right() show(D,Pein) Q=Pein.T R=Q.gram_schmidt()[0].T def normalize_columns(M): return column_matrix([v/norm(v) for v in M.columns()]) P=normalize_columns(R);show(P) print(P*D*P.T==A); print(P.is_unitary()) In this case we get the orthogonal matrix P =     √ 3 3 2 3 √ 3 2 0 √ 3 3 −1 3 √ 3 2 √ 2 2 √ 3 3 −1 3 √ 3 2 − √ 2 2     , such that PT AP =    √ 3 3 √ 3 3 √ 3 3 2 3 √ 3 2 −1 3 √ 3 2 −1 3 √ 3 2 0 √ 2 2 − √ 2 2      2 2 2 2 0 4 2 4 0       √ 3 3 2 3 √ 3 2 0 √ 3 3 −1 3 √ 3 2 √ 2 2 √ 3 3 −1 3 √ 3 2 − √ 2 2     =   6 0 0 0 0 0 0 0 −4   = D , where λ1 = 6, λ2 = 0 and λ3 = −4 are the eigenvalues of A. Similarly for the matrix B: B=matrix(QQ, [[-4, 4, 4], [4, -4, 4], [4, 4, -4]]) show(B.eigenmatrix_right()) D, Pein=B.eigenmatrix_right() show(D,Pein) Q=Pein.T show(Q.gram_schmidt()[0]) R=Q.gram_schmidt()[0].T show(R) def normalize_columns(M): return column_matrix([v/norm(v) for v in M.columns()]) P=normalize_columns(R) show(P) print(P*D*P.T==B) print(P.is_unitary()) Running this block Sage prints out the orthogonal matrix P =     √ 3 3 √ 2 2 −1 3 √ 3 2 √ 3 3 0 2 3 √ 3 2 √ 3 3 − √ 2 2 −1 3 √ 3 2     , such that 289 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS PT BP =    √ 3 3 √ 3 3 √ 3 3√ 2 2 0 − √ 2 2 −1 3 √ 3 2 2 3 √ 3 2 −1 3 √ 3 2      −4 4 4 4 −4 4 4 4 −4       √ 3 3 √ 2 2 −1 3 √ 3 2 √ 3 3 0 2 3 √ 3 2 √ 3 3 − √ 2 2 −1 3 √ 3 2     =   4 0 0 0 −8 0 0 0 −8   = D , where λ1 = 4 and λ2 = −8 are the eigenvalues of B, the second with an algebraic multiplicity of two. □ 3.F.48. Present a unitary diagonalization of the matrix B =   0 i 1 −i 0 0 1 0 0  . Solution. The matrix B is Hermitian, so it has real eigenvalues given by λ1 = 0 and λ2,3 = ± √ 2. The corresponding eigenvectors have the form u1 = (0, i, 1)T , u2 = ( √ 2, −i, 1)T , u3 = (− √ 2, −i, 1)T . By running in a sequence the following commands in Sage, one can verify that ⟨ui, uj⟩ = 0 for any 1 ≤ i ̸= j ≤ 3, and ∥u1∥ = √ 2, ∥u2∥ = ∥u3∥ = 2. u1=vector(QQbar, [0, I, 1]) u2=vector(QQbar, [sqrt(2), -I, 1]) u3=vector(QQbar, [-sqrt(2), -I, 1]) u2.hermitian_inner_product(u1) u1.hermitian_inner_product(u3) u3.hermitian_inner_product(u2) u1.norm (); u2.norm ();u3.norm () The normalized eigenvectors ˆui = ui/∥ui∥ for i = 1, 2, 3 form the columns of the matrix U, U =    0 √ 2 2 − √ 2 2 i √ 2 2 − i 2 − i 2√ 2 2 1 2 1 2    . We can verify that U is unitary; for example in Sage you can type U=matrix(QQbar, [[0, sqrt(2)/2, -sqrt(2)/2], [(sqrt(2)/2)*I, (-1/2)*I, (-1/2)*I], [sqrt(2)/2, 1/2, 1/2]]) U.is_unitary ( ) which returns: True. This diagonalizes the given matrix B with D = diag(0, √ 2, − √ 2). Recall by 3.D.38 that a verification in Sage of this final claim relies on the block U=matrix(QQbar, [[0, sqrt(2)/2, -sqrt(2)/2], [(sqrt(2)/2)*I, (-1/2)*I, (-1/2)*I], [sqrt(2)/2, 1/2, 1/2]]) A = matrix(QQbar, [[0, I, 1], [-I, 0, 0], [1, 0, 0]]) D = matrix(QQbar, [[0, 0, 0], [0, sqrt(2), 0], [0, 0, -sqrt(2)]]) U.conjugate_transpose ()*A*U == D Run this block yourself to see the output generated by Sage. □ 3.F.49. Adapt the method of orthogonal diagonalization presented in 3.F.46, to implement the unitary diagonalization of the matrix B given in 3.F.48 using Sage. Solution. We only need to male a few adjustments to ithe program presented in 3.F.46 for it to work with unitary diagonalization. In particular, since unitary diagonalization deals with complex matrices we need to replace carefully the transposes by conjugate transposes. Also, to achieve a more readable display of the printed matrices by Sage, it is useful to pass the matrix to a low-precision ring before printing it. This adjustment will not effect the original matrices and the final results, only their display. Below is the our program: 290 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS CC20 = ComplexField(prec=20) MatPrint = MatrixSpace(CC20,3,3) B=matrix(QQbar, [[0, i , 1], [-i, 0, 0], [1, 0, 0]]) show(MatPrint(B)) print(B==B.H) chi_B=B.characteristic_polynomial() 291 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS print(chi_B.factor()) print(B.eigenvalues()) D, Pein=B.eigenmatrix_right() show(MatPrint(D)) show(MatPrint(Pein)) Q=Pein.H #show(MatPrint(Q)) q=Q.gram_schmidt()[0] R=Q.gram_schmidt()[0].H def normalize_columns(M): return column_matrix([v/norm(v) for v in M.columns()]) P=normalize_columns(R) show(MatPrint(P)) print(P*D*P.H==B); print(P.is_unitary()) Sage’s prints the solution matrix P as follows P =   0.70711 9.5829 × 10−21 i 0.70711 −0.50000i 0.70711 0.50000i 0.50000 −0.70711i −0.50000   =   0.70711 0 0.70711 −0.50000i 0.70711 0.50000i 0.50000 −0.70711i −0.50000   . This matrix slightly differs from our solution matrix P in 3.F.48, since for example Sage enumerates the eigenvalues differently and uses different eigenvectors (recall that P is not uniquely determined). However, Sage verifies that P is unitary and satisfies the unitary diagonalization condition. To understand the importance of using the low-precision ring before printing our matrices, try running the program in Sage without including the initial lines related to matrix display. Then, compare the corresponding output with the one presented above. □ Next we will explore several tasks involving the bracket of square matrices of the same size, also known as the commutator [ , ], defined as [A, B] := AB − BA. The commutator is a fundamental operation in linear algebra and plays a crucial role in various branches of mathematics and physics, particularly in the study of the so called “Lie algebras”. When [A, B] = 0, the matrices commute in the sense that AB = BA. Hence, the commutator [A, B] measures the extend to which two matrices A, B fail to commute. The commutator is a central concept in the theory of Lie algebras, which are algebraic structures used to study symmetry, particularly in the context of continuous transformations. A “Lie algebra” is a vector space V over a field F, equipped with a bilinear operation [ , ] : V × V → V , known as the “Lie bracket”. This is skew-symmetric and satisfies the so called Jacobi identity SX,Y,Z[X, [Y, Z]] = 0 , where SX,Y,Z is the cyclic sum over the elements X, Y, Z ∈ V . We should mention that while the commutator of matrices is a primary example, the concept of a Lie algebra extends far beyond matrices, encompassing a broad range of algebraic structures. 3.F.50. For F = R or F = C, prove that the vector space of square matrices Matn(F) endowed with the commutator [A, B] := AB − BA has the structure of a (finite-dimensional) Lie algebra. Solution. It suffices to prove that the commutator is a bilinear mapping [ , ] : Matn(F) × Matn(F) → Matn(F), which is skew-symmetric and satisfies the Jacobi identity. Consider arbitrary matrices A1, A2 ∈ Matn(F) and scalars λ, µ ∈ F. Then, for any B ∈ Matn(F) we see that [λA1 + µA2, B] = (λA1 + µA2)B − b(λA1 + µA2) = λA1B + µA2B − λBA1 − µBA2 = λ(A1B − BA1) + µ(A2B − BA2) = λ[A1, B] + µ[A2, B] . This prove linearity in the first argument and in a similar way we obtain [A, λB1 + µB2] = λ[A, B1] + µ[A, B2], for all A, B1, B2 ∈ Matn(F) and λ, µ ∈ F. Since the bracket [ , ] is linear also in the second argument, is a bilinear mapping. Skew-symmetry is obvious: [B, A] = BA − AB = −(AB − BA) = −[A, B], for any A, B ∈ Matn(F). Finally, for the Jacobi identity, let A, B, C be some arbitrary n × n matrices over F. Then, we see that [A, [B, C]] = A(BC − CB) − (BC − CB)A = ABC − ACB − BCA + CBA , [B, [C, A]] = B(CA − AC) − (CA − AC)B = BCA − BAC − CAB + ACB , [C, [A, B]] = C(AB − BA) − (AB − BA)C = CAB − CBA − ABC + BAC , and the Jacobi identity [A, [B, C]] + [B, [C, A]] + [C, [A, B]] = 0 is now direct. 292 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS The Lie algebra (Matn(R), [ , ]), also denoted by gl(n, R) or gln(R), underpins a wide range of applications in science, engineering, and technology. Its significance stems from its ability to model and solve real-world problems that involve linear transformations and systems. Note that in a similar way one can prove that (Matn(C), [ , ]) is a Lie algebra over C. □ 3.F.51. The dimension of a Lie algebra (V, [ , ]V ) over a field F, is the dimension of the vector space V over F. What is the dimension of the Lie algebra (Matn(R), [ , ]) described above? ⃝ 3.F.52. Show that the set of matrices M = {( a b 0 −a ) : a, b ∈ R } forms a Lie algebra under matrix commutation. Solution. The given set of matrices is a linear subspace of the vector space Mat2(R), the latter consisting of 2 × 2 real matrices. This is because M contains the zero matrix (it occurs for a = b = 0), and is closed under addition and scalar multiplication: A + B = ( a1 b1 0 −a1 ) + ( a2 b2 0 −a2 ) = ( a1 + a2 b1 + b2 0 −(a1 + a2) ) ∈ M , λA = ( λ a λ b 0 −λ a ) ∈ M , for all A, B ∈ M and λ ∈ R. Next we should prove that the commutator [A, B] for A, B ∈ M satisfies the properties of a Lie bracket, as defined above and hence M is a 2-dimensional Lie algebra. An easier way relies on proving that M is a“Lie subalgebra” of the Lie algebra (Mat2(R), [ , ]). A Lie subalgebra of a Lie algebra (V, [ , ]V ) is a subspace W ⊂ V which is closed under the corresponding Lie bracket operation, i.e., [W, W]V ⊂ W. It is easy to see that a Lie subalgebra is a Lie algebra itself. For two elements A, B ∈ M we see that [A, B] = AB − BA = ( a1 b1 0 −a1 ) ( a2 b2 0 −a2 ) − ( a2 b2 0 −a2 ) ( a1 b1 0 −a1 ) = ( a1a2 a1b2 − b1a2 0 a1a2 ) − ( a2a1 a2b1 − b2a1 0 a2a1 ) = ( 0 0 0 0 ) ∈ M . Thus M is closed under the commutator, making it a Lie subalgebra of Mat2(R) and hence a Lie algebra in its own right. Since [A, B] = 0 for any A, B ∈ M, this Lie algebra is an example of an Abelian Lie algebra, a special type of Lie algebras where all Lie brackets are zero. □ 3.F.53. Let us denote by so(3) the set of skew-symmetric 3 × 3 matrices. i) Prove that so(3) is a vector subspace of Mat3(R) and compute its dimension. ii) Prove that so(3) endowed with the matrix commutator [ , ] is a Lie subalgebra of (Mat3(R), [ , ]). Next compute the brackets of the basis elements and use Sage to verify the result. iii) Show that R3 endowed with the cross product × as a Lie bracket, forms another 3-dimensional Lie algebra. iv) Establish an linear isomorphism φ between R3 and so(3) that preserves the Lie algebra structures, i.e., φ(u × v) = [φ(u), φ(v)].22 Solution. (i) An element in so(3) has the form   0 −x3 x2 x3 0 −x1 −x2 x1 0   for some x1, x2, x3 ∈ R. Thus it is easy to see that A + B ∈ so(3) and λ A ∈ so(3) for any two matrices A, B ∈ so(3) and scalar λ ∈ R. Hence so(3) is a vector subspace of Mat3(R) with dim so(3) = 3, as a basis is given by the matrices E12 =   0 −1 0 1 0 0 0 0 0   , E23 =   0 0 0 0 0 −1 0 1 0   , E31 =   0 0 1 0 0 0 −1 0 0   . (ii) Let us prove that so(3) is closed under the matrix commutator. Let A, B ∈ so(3) be skew-symmetric matrices, i.e., AT = −A and BT = −B. Then we see that ([A, B])T = (AB − BA)T = BT AT − AT BT = (−B)(−A) − (−A)(−B) = BA − AB = −[A, B] , which implies that [A, B] ∈ so(3). Thus [so(3), so(3)] ⊂ so(3) and so(3) is Lie subalgebra of Mat3(R), and hence a Lie algebra itself. Alternatively, one may compute the Lie brackets of the basis elements, and prove that they belong to so(3), [E12, E23] = E31 ∈ so(3) , [E12, E31] = −E23 ∈ so(3) , [E23, E31] = E12 ∈ so(3) . To verify the brackets in Sage you may use the following block (see also 2.E.66). 22Such isomorphisms are called Lie algebra isomorphisms. 293 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS E12=matrix([[0, -1, 0], [1, 0, 0], [0, 0, 0]]) E23=matrix([[0, 0, 0], [0, 0, -1], [0, 1, 0]]) E31=matrix([[0, 0, 1], [0, 0, 0], [-1, 0, 0]]) 294 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS show("\nThe matrix E12 has the form:", E12) show("\nThe matrix E23 has the form:", E23) show("\nThe matrix E31 has the form:", E31) def Lie_bracket(A, B): return A*B-B*A show("\nThe commutator [E12, E23] equals to:", Lie_bracket(E12, E23)) show("\nThe commutator [E12, E31] equals to:", Lie_bracket(E12, E31)) show("\nThe commutator [E23, E31] equals to:", Lie_bracket(E23, E31)) print(Lie_bracket(E12, E23)==E31) print(Lie_bracket(E12, E31)==-E23) print(Lie_bracket(E23, E31)==E12) (iii) Recall that the cross product v × w of two vectors v = (v1, v2, v3)T and w = (w1, w2, w3)T on R3 is the vector v × w = ⃗i ⃗j ⃗k v1 v2 v3 w1 w2 w3 =   v2w3 − w2v3 v3w1 − w3v1 v1w2 − w1v2   ∈ R3 . Hence R3 is closed with respect to the operation [v, w]R3 := v × w. This is by definition bilinear and skew-symmetric, v × w = −w × v for all v, w ∈ R3 . For the Jacobi identity recall that u × (v × w) = (u · w)v − (u · v)w. Thus [u, [v, w]R3 ]R3 + [v, [w, u]R3 ]R3 + [w, [u, v]R3 ]R3 = u × (v × w) + v × (w × u) + w × (u × v) = (u · w)v − (u · v)w + (v · u)w − (v · w)u + (w · v)u − (w · u)v = ( (u · w) − (w · u) ) v − ( (u · v) − (v · u) ) w + ( (w · v) − (v · w) ) u = 0 . Thus (R3 , [ , ]R3 ) is also Lie algebra. (iv) The map φ : R3 → so(3) with u = (x1, x2, x3)T → Au :=   0 −x3 x2 x3 0 −x1 −x2 x1 0   is clearly a vector space isomorphism. To show that is a Lie algebra isomorphism it remains to prove that φ([u, v]R3 ) = [φ(u), φ(v)] or equivalently, Au×v = [Au, Av], which we leave as an exercise. □ 3.F.54. Consider Hermitian matrices A, B, C satisfying [A, C] = [B, C] = 0 and [A, B] ̸= 0. Prove that at least one eigenspace of the matrix C has dimension > 1. Solution. We prove it by contradiction. Assume that all eigenspaces of the matrix C are 1-dimensional. Then, any vector u can be written as u = ∑ k⟨u, uk⟩uk = ∑ k ckuk, where uk are the linearly independent eigenvectors of C, associated with the eigenvalue λk. Now a computation shows that 0 = [A, C]uk = ACuk − CAuk = λkAuk − C(Auk) . Therefore Auk is also an eigenvector of the matrix C corresponding to eigenvalue λk. Then we get Auk = λA k uk for some number λA k . Similarly, Buk = λB k uk for some number λB k . Then for the commutator of matrices A and B one computes: [A, B]uk = ABuk − BAuk = λA k λB k uk − λB k λA k uk = 0 and hence [A, B]u = [A, B] ∑ k ckuk − ∑ k ck[A, B]uk = 0 . This final relation implies [A, B] = 0, a contradiction. □ 3.F.55. Given the matrix A =   0 −2 2 −2 0 2 2 2 0  , compute the trace of the matrix exp(A) := ∞∑ k=0 1 k! Ak = I2 + A + A2 2 + · · · . Hint: Use the eigenvalues of A. Solution. Recall that trace of a matrix A ∈ Matn(F), for F = R or F = C equals to the sum of the eigenvalues of A. The given matrix is symmetric and hence its eigenvalues are real. An application of the following block in Sage gives λ1 = −4, with multiplicity 1 and λ2 = 2 with multiplicity 2: 295 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS A=matrix(QQ, 3, 3, [[0, -2, 2], [-2, 0, 2], [2, 2, 0]]) A.eigenvalues () Thus, the eigenvalues of exp(A) are e−4 and e2 , the second with multiplicity 2 again. It follows that tr(exp(A)) = sum of eigenvalues of exp(A) = e−4 +2 e. □ 3.F.56. Show that if H is a Hermitian matrix, then U = exp(iH) = ∑∞ n=0 1 n! (iH)n is a unitary matrix. Next compute its determinant. Solution. From the definition of exp we can show that exp(A + B) = exp(A) exp(B) just as with the exponential mapping in the domain of real numbers. Because (u + v)∗ = u∗ + v∗ and (cv)∗ = ¯cv∗ , we obtain U∗ = ( ∞∑ n=0 1 n! (iH)n )∗ = ∞∑ n=0 1 n! (−iH∗ )n . Since H∗ = H, we finally see that U∗ = ∞∑ n=0 (−1)n 1 n! (iH)n = exp(−iH) . Thus U∗ U = exp(iH) exp(−iH) = exp(0) = 1, and det(U) = exptr(iH) . □ We will now continue with additional tasks related to generalized eigenspaces and the Jordan canonical form. 3.F.57. Determine the algebraic and geometric multiplicities of the eigenvalues for the matrices below, and use Sage to validate your results: A =   1 1 2 0 1 2 0 0 3   , B =   1 0 2 0 1 2 0 0 3   , C =   4 0 1 2 3 2 1 0 4   . Solution. The matrices A and B are upper triangular having the same diagonal entries. Thus A, B have the same eigenvalues, |A−λE| = |B −λE| = λ3 −5λ2 +7λ−3 = (λ−1)2 (λ−3) and we get λ1 = 1 with multiplicity 2, αA(λ1) = 2 = αB(λ1) and λ2 = 3 with multiplicity 1, αA(λ2) = 1 = αB(λ2). Using Sage we can obtain the same conclusion as follows: A=matrix(QQbar, [[1, 1, 2], [0, 1, 2], [0, 0, 3]]) A.characteristic_polynomial ( ) B=matrix(QQbar, [[1, 0, 2], [0, 1, 2], [0, 0, 3]]) B.characteristic_polynomial ( ) A.characteristic_polynomial ( ) == B.characteristic_polynomial ( ) Recall also that in Sage the command A.fcp(′ t′ ) factors the characteristic polynomial corresponding to a matrix. For the matrix C, the following cell C=matrix(QQbar, [[4, 0, 1], [2, 3, 2], [1, 0, 4]]) C.fcp(’x’) C.eigenvalues ( ) gives the characteristic polynomial of C as well as its eigenvalues: λ1 = 5 with αC(λ1) = 1 and λ2 = 1 with αC(λ2) = 2. Let us now compute the geometric multiplicities, γA(λ) = dimC(ker(A − λE)) = n − rank(A − λE) . Here A is a n × n matrix and λ is an eigenvalue of A. Thus, for the given matrix A we get A − E =   0 1 2 0 0 2 0 0 2  . In particular, A − E consists of 2 independent column vectors. Thus rank(A − E) = 2 and γA(1) = 3 − 2 = 1. Here is a confirmation using Sage: A=matrix(QQbar, [[1, 1, 2], [0, 1, 2], [0, 0, 3]]) E =matrix(QQbar, [[1, 0, 0], [0, 1, 0], [0, 0, 1]]) (A-E).rank ( ) Or we can type A=matrix(QQbar, [[1, 1, 2], [0, 1, 2], [0, 0, 3]]) A.eigenspaces_right () The second block returns the eigenvalues of A and their eigenspaces: 296 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS (3, Vector space of degree 3 and dimension 1 over Algebraic Field User basis matrix: [ 1.000000000000000? 0.6666666666666667? 0.6666666666666667?]), (1, Vector space of degree 3 and dimension 1 over Algebraic Field User basis matrix: [1 0 0]) In a similar way you can verify that γB(1) = 2, γA(3) = 1 = γB(3), γC(5) = 1 and γC(3) = 2. □ 3.F.58. Find the Jordan canonical form of the matrices A = ( −1 1 −6 4 ) and B = ( −1 1 −4 3 ) . Additionally, provide the geometric interpretation of the Jordan canonical form decomposition corresponding to these matrices. Solution. The eigenvalues of A are real, given by λ1 = 1 and λ2 = 2. Since the size of A is 2 × 2 and it has two distinct eigenvalues, the Jordan form of A is diagonal, i.e., J = ( 1 0 0 2 ) . It is easy to see that the eigenvector u1 = (x, y)T associated with the eigenvalue λ1 = 1 satisfies the matrix equation 0 = (A − E)u1 = ( −2 1 −6 3 ) ( x y ) . This given the equation −2x+y = 0, thus the eigenvectors are multiples of the vector u1 = (1, 2)T . Similarly, the eigenvector associated with the eigenvalue λ2 is given by u2 = (1, 3)T . The matrix P is then obtained by writing these eigenvectors into tho columns, P = ( 1 1 2 3 ) . Then, for the matrix A we have A = PJP−1 . The inverse of P is P−1 = ( 3 −1 −2 1 ) , and we can confirm the relation A = PJP−1 : ( −1 1 −6 4 ) = ( 1 1 2 3 ) ( 1 0 0 2 ) ( 3 −1 −2 1 ) . This decomposition says that the matrix A determines a linear mapping which has the aforementioned diagonal form with respect to the basis {u1, u2}. Geometrically, this means that in the direction of u1 nothing is changing and in the direction of u2 every vector is being stretched twice. Let us now focus on B. This has a unique eigenvalue given by λ = 1 with multiplicity 2. The corresponding eigenvector v = (x, y)T satisfies the matrix equation 0 = (B − E)v = ( −2 1 −4 2 ) ( x y ) . We find that the solutions are multiples of the vector v = (1, 2)T . The fact that the system does not have two linearly independent vectors as a solution, implies the following expression for the Jordan canonical form: J = ( 1 1 0 1 ) . The basis for which A has this form consists of the eigenvector v1 and the vector that is assigned to v1 via the linear transformation B − E. This provides a solution of the following system of equations: ( −2 1 1 −4 2 2 ) ∼ ( −2 1 1 0 0 0 ) . You can easily check that the solutions are multiples of the vector v2 = (1, 3)T . Moreover, one obtains the same basis as in the previous case, and we can verify the relation B = PJP−1 : ( −1 1 −4 3 ) = ( 1 1 2 3 ) ( 1 1 0 1 ) ( 3 −1 −2 1 ) . In this case, the mapping acts on the vectors as follows: the component in the direction of v2 stays the same. On the other hand, the component in the direction of v1 is multiplied by the sum of the coefficients that determine the components in the directions of v2 and v1. □ 3.F.59. Consider the matrix A =     −1 −1 −1 0 3 2 3 −1 2 1 3 −1 2 1 4 −2    . Solve the following tasks: i) Show that the eigenvalues of A are given by ±1. Compute their algebraic and geometric multiplicities and find the corresponding eigenvectors of A. ii) Use Sage to verify that the dimensions of the null spaces ker ( (A − E)j ) do not increase for j > 3. Next verify that dim Rλ1 = 3 by computing the rank of the matrix (A − E)3 in Sage and applying the rank-nullity theorem. 297 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS iii) Derive the generalized eigenvectors of A. iv) Describe the Jordan canonical form J of A and find a matrix P satisfying P−1 AP = J. Next verify your answer by Sage. v) Prove the direct sum decomposition R4 = Rλ1 ⊕ Rλ2 , where λ1 = 1 and λ2 = −1 are the eigenvalues of A. Solution. (i) The characteristic polynomial of A is given by χA(λ) = −1 − λ −1 −1 0 3 2 − λ 3 −1 2 1 3 − λ −1 2 1 4 −2 − λ = (λ − 1)3 (λ + 1) . To confirm this expression we may use Sage via the block A=matrix([[-1, -1, -1, 0], [3, 2, 3, -1], [2, 1, 3, -1], [2, 1, 4, -2]]) p=A.characteristic_polynomial(); show(factor(p)) Thus the eigenvalues of A are λ1 = 1 with multiplicity 3 and λ2 = −1 with multiplicity 1. • Let us first focus on λ1 and describe the eigenspace Vλ1 = V1 = ker (A − E). We will perform elementary row operations to obtain the reduced row echelon form of the matrix A − E: A − E =     −2 −1 −1 0 3 1 3 −1 2 1 2 −1 2 1 4 −3     R4→R4+R1 −→     −2 −1 −1 0 3 1 3 −1 2 1 2 −1 0 0 3 −3     R3→R3+R1 −→     −2 −1 −1 0 3 1 3 −1 0 0 1 −1 0 0 3 −3     R4→R4−3R3 −→     −2 −1 −1 0 3 1 3 −1 0 0 1 −1 0 0 0 0     R1→− 1 2 R1 −→     1 1/2 1/2 0 3 1 3 −1 0 0 1 −1 0 0 0 0     R2→R2−3R1 −→     1 1/2 1/2 0 0 −1/2 3/2 −1 0 0 1 −1 0 0 0 0     R1→R1+R2 −→     1 0 2 −1 0 −1/2 3/2 −1 0 0 1 −1 0 0 0 0     R2→−2R2 −→     1 0 2 −1 0 1 −3 2 0 0 1 −1 0 0 0 0     R2→R2+3R3 −→     1 0 2 −1 0 1 0 −1 0 0 1 −1 0 0 0 0     R1→R1−2R3 −→     1 0 0 1 0 1 0 −1 0 0 1 −1 0 0 0 0     . You can verify this expression in Sage by adding the following cel to the initial block: E=identity_matrix(4); AEprref=(A-E).rref(); show(AEprref) Thus, the solutions of the matrix equation (A − E)u1 = 0 for some vector u1 = (x1, x2, x3, x4)T correspond to solutions of the linear system {x1 + x4 = 0, x2 − x3 = 0, x3 − x4 = 0}, where x4 is a free parameter. This gives V1 = ker(A − E) = { t(−1, 1, 1, 1)T : t ∈ R } and hence the geometric multiplicity of λ1 = 1 equals 1, i.e., γ(λ1) = 1 = dim V1. Moreover, the vector u1 = (−1, 1, 1, 1)T is an eigenvector of A corresponding to λ1 = 1. • For the second eigenvalue λ2 = −1 we have algebraic multiplicity 1 and thus also the geometric multiplicity should be 1, i.e., γ(λ2) = dim V−1 = dim ker(A+E) = 1. This also implies that dim Rλ2 = 1, in particular Rλ2 ∼= V−1 (the dimension of the null spaces ker ( (A + E)ℓ ) stabilizes for ℓ > 1). We will apply elementary row operations to derive the reduced row echelon form of the matrix A + E. A + E =     0 −1 −1 0 3 3 3 −1 2 1 4 −1 2 1 4 −1     R4→R4−R3 −→     0 −1 −1 0 3 3 3 −1 2 1 4 −1 0 0 0 0     R1↔R2 −→     3 3 3 −1 0 −1 −1 0 2 1 4 −1 0 0 0 0     R1→ 1 3 R1 −→     1 1 1 −1/3 0 −1 −1 0 2 1 4 −1 0 0 0 0     R3→R3−2R1 −→     1 1 1 −1/3 0 −1 −1 0 0 −1 2 −1/3 0 0 0 0     R2→−R2 −→     1 1 1 −1/3 0 1 1 0 0 −1 2 −1/3 0 0 0 0     298 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS R3→R3+R2 −→     1 1 1 −1/3 0 1 1 0 0 0 3 −1/3 0 0 0 0     R3→ 1 3 R3 −→     1 1 1 −1/3 0 1 1 0 0 0 1 −1/9 0 0 0 0     R1→R1−R2 −→     1 0 0 −1/3 0 1 1 0 0 0 1 −1/9 0 0 0 0     R2→R2−R3 −→     1 0 0 −1/3 0 1 0 1/9 0 0 1 −1/9 0 0 0 0     . Once more we can verify this expression in Sage quickly, by adding the cell AEmrref=(A+E).rref(); show(AEmrref) This results in the system {x1 − 1 3 x4 = 0 , x2 + 1 9 x4 = 0 , x3 − 1 9 x4 = 0}, where x4 is a free parameter. Thus Vλ2 = V−1 = ker(A + E) = { t( 1 3 . − 1 9 , 1 9 , 1) : t ∈ R } . Therefore the corresponding eigenvector is given (up to a scalar) by v1 = (1 3 , −1 9 , 1 9 , 1)T . (ii) We will use Sage to study the matrices (A−E)2 , (A−E)3 , and so on. We proceed with the following block: A=matrix([[-1, -1, -1, 0], [3, 2, 3, -1], [2, 1, 3, -1], [2, 1, 4, -2]]) E=identity_matrix(4) bb=(A-E)*(A-E) show("\nThe matrix (A-E)^2 is given by:",bb) cc=bb*(A-E) show("\nThe matrix (A-E)^3 is given by:",cc) dd=cc*(A-E) show("\nThe matrix (A-E)^4 is given by:",dd) ee=dd*(A-E) show("\nThe matrix (A-E)^5 is given by:",ee) Executing this block, we obtain The matrix (A-E)^2 is given by:     −1 0 −3 2 1 0 2 −1 1 0 1 0 1 0 −3 4     The matrix (A-E)^3 is given by:     0 0 3 −3 0 0 −1 1 0 0 1 −1 0 0 9 −9     The matrix (A-E)^4 is given by:     0 0 −6 6 0 0 2 −2 0 0 −2 2 0 0 −18 18     The matrix (A-E)^5 is given by:     0 0 12 −12 0 0 −4 4 0 0 4 −4 0 0 36 −36     This shows that the matrices (A − E)4 and (A − E)5 are (even) multiples of the matrix (A − E)3 . Thus, as we expected, the dimensions of the null subspaces ker ( (A − E)j ) will not increase for j > 3 = α(λ1). To compute the rank of the matrix (A − E)3 we can continue typing in the previous block the following syntax: show("The rank of (A-E)^3 is given by:", cc.rank()) Sage’s output has the form: The rank of (A-E)^3 is given by:1 . Thus rank ( (A − E)3 ) = 1 and dim ker ( (A − E)3 ) = 4 − 1 = 3 = dim Rλ1 , as it should be, since α(λ1) = 3. (iii) Let us now determine generalized eigenvectors u2, u3 ∈ Rλ1 for the first eigenvalue, where the algebraic multiplicity exceeds the geometric multiplicity. To do this, start with the matrix equation (A − E)u2 = u1, i.e., 299 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS     −2 −1 −1 0 3 1 3 −1 2 1 2 −1 2 1 4 −3         x1 x2 x3 x3     =     −1 1 1 1     , where u2 = (x1, x2, x3, x4)T is the unknown. We will use Sage to obtain the reduced row echelon form of the augmented matrix ( A − E v1 ) . It is sufficient to continue typing in our previous block and add the following cell v1=vector([-1, 1, 1, 1]) B=(A-E).augment(v1, subdivide="True") show(B.rref()) We see that     −2 −1 −1 0 −1 3 1 3 −1 1 2 1 2 −1 1 2 1 4 −3 1     ∼     1 0 0 1 0 0 1 0 −1 1 0 0 1 −1 0 0 0 0 0 0     . Now you may conclude that the solution space of the induced linear system has the form { (x1, x2, x3, x4)T = (−s, 1 + s, s, s)T } , where s is a free parameter. Therefore, without loss of generality we may assume that u2 = (0, 1, 0, 0)T . Next, using the matrix expression (A − E)3 provided above, we can easily verify that (A − E)3 u2 = 0. Similarly, by solving the matrix equation (A − E)u3 = u2 we find that u3 = (1, −2, 0, 0)T (up to a scalar). Verification of this result is left as an exercise for the reader. To summarize, the eigenvalue λ1 = 1 has the eigenvector u1 = (−1, 1, 1, 1)T and the generalized eigenvectors u2 = (0, 1, 0, 0)T and u3 = (1, −2, 0, 0)T . (iv) It is easy to see that the generalized eigenevector u3 satisfies the additional relation (A − E)2 u3 = u1. Hence the set {(A − E)2 u3 , (A − E)u3 , u3} = {u1, u2, u3} is a Jordan chain of length 3 corresponding to Rλ1 . Consequently, the Jordan form should contain a Jordan block of size 3 corresponding to λ1, J1 =   1 1 0 0 1 1 0 0 1  , and a Jordan block of size 1 corresponding to λ2, J−1 = (−1). This yields the following expression: J = J1 ⊕ J−1 =   1 1 0 0 1 1 0 0 1   ⊕ (−1) =     1 1 0 0 0 1 1 0 0 0 1 0 0 0 0 −1     . Recall that in Sage a verification of the Jordan canonical form (up to order of the Jordan blocks) occurs by adding the cell J= A.jordan_form(transformation=True); show(J) Now, a similarity matrix P satisfying the condition P−1 AP = J has as columns the vectors {u1, u2, u3, v1} derived in (i) and (iii), i.e., P =     −1 0 1 1/3 1 1 −2 −1/9 1 0 0 1/9 1 0 0 1     . In particular, P is invertible since these vectors are linearly independent. While the computation of P−1 is left as an exercise, we use Sage to verify that P−1 AP = J: J=matrix([[1, 1, 0, 0], [0, 1, 1, 0], [0, 0, 1, 0], [0, 0, 0, -1]]) P=matrix([[-1, 0, 1, 1/3], [1, 1, -2, -1/9], [1, 0, 0, 1/9],[1, 0, 0, 1]]) P.inverse()*A*P == J Sage’s output is True. Finally, recall from Chapter 2 that in Sage a quick way to verify the linear independence of the vectors u1, u2, u3 and v1 is by computing the determinant of P, via the command P.det(). Alternatively, you might prefer to use the cell provided below: 300 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS u1=vector([-1, 1, 1, 1]) u2=vector([0, 1, 0, 0]) u3=vector([1, -2, 0, 0]) v1=vector([1/3, -1/9, 1/9, 1]) V=RR^4; V.linear_dependence([u1, u2, u3, v1])==[] (v) We have seen that dim R1 = 3 and dim R−1 = dim V−1 = 1. Since the total dimension of R4 equals 4, the generalized eigenspaces together have enough “space” to cover all of R4 . In fact, above we saw that joining the bases {u1, u2, u3} of R1 and {v1} of R−1 ∼= V−1 we obtain a basis of R4 , and this already provides a proof. Alternatively, the relations R4 = R1 + R−1 and R1 ∩ R−1 = {0} imply the direct sum decomposition R4 = Rλ1 ⊕ Rλ2 = R1 ⊕ R−1 = ker ( (A − E)3 ) ⊕ ker (A + E) . □ E) Material on linear algebra – matrix decompositions 3.F.60. Find the LU-decomposition of the matrix A =   2 8 0 2 2 −3 1 2 6   . Solution. An answer goes as follows: A =   2 8 0 2 2 −3 1 2 6   = LU =   1 0 0 1 1 0 1/2 1/3 1     2 8 0 0 −6 −3 0 0 7   . □ 3.F.61. Provide the QR-decomposition of the following matrices: A =     1 0 0 1 1 0 1 1 1 1 1 1     , B =   1 1 0 1 0 1 0 1 1   . Solution. The QR-factorizations under question are given as follows: A = QR =     1/2 − √ 12/4 0 1/2 √ 12/12 − √ 6/3 1/2 √ 12/12 √ 6/6 1/2 √ 12/12 √ 6/6       2 3/2 1 0 √ 12/4 √ 12/6 0 0 √ 6/3   , B = QR =   √ 2/2 √ 6/6 − √ 3/3√ 2/2 − √ 6/6 √ 3/3 0 √ 6/3 √ 3/3     √ 2 √ 2/2 √ 2/2 0 √ 6/2 √ 6/6 0 0 2 √ 3/3   . □ 3.F.62. Ray-tracing. In computer 3D-graphics the image is very often displayed using the Ray-tracing algorithm. The basis of this algorithm is an approximation of the light waves by a ray (line) and an approximation of the displayed objects by polyhedrons. These are bounded by planes and it is necessary to compute where exactly the light rays are reflected from these planes. From physics we know how the rays are reflected – the angle of impact equals the angle of reflection. We have already met this topic in the exercise ??. The ray of light in the direction v = (1, 2, 3) hits the plane given by the equation x + y + z = 1. In what direction is it reflected? Solution. The unit normal vector to the plane is n = 1√ 3 (1, 1, 1). The vector that gives the direction of the reflected ray vR lies in the plane given by the vectors v, n. We can express it as a linear combination of these vectors. Furthermore, the rule for the angle of reflection says that ⟨v, n⟩ = −⟨vR, n⟩. From there we obtain a quadratic equation for the coefficient of the linear combination. This exercise can be solved in an easier, more geometric way. From the diagram we can derive directly that vR = v − 2⟨v, n⟩n . In our case, vR = (−3, −2, −1). □ 301 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Solutions to the problems 3.A.3. We see that h = cT x, where c = (−1, 2)T . By multiplying the first inequality by −1 we obtain the following equivalent system of constraints: x2 − x1 ≤ 1 , −0.5x1 + x2 ≤ 2 , x1 ≥ 0, x2 ≥ 0 . An illustration of this LP problem goes as follows: The feasible region (the grey region in this diagram) is unbounded. However, the drawn thick semi-line starting from the point (2, 3) give us infinitely many optimal solutions. 3.A.5. The daily diet should contain 3.9 kg of hay and 4.3 kg of oat. Then the costs per foal are C13.82. 3.A.9. (a) Multiply both sides of the second inequality by −1 to get −x1 − x2 ≤ 1. Also, replace the relation x1 − 2x2 = 1 by the inequalities x1 − 2x1 ≤ 1 and x1 − 2x2 ≥ 1. A multiplication of the latter by −1 yields the standard form of the initial problem: maximize h = x1 + 2.5x2 with respect to the conditions 2x1 + 3x2 ≤ 20 , −x1 − x2 ≤ 1 , x1 − 2x2 ≤ 1 , −x1 + 2x2 ≤ −1 , x1 ≥ 0 , x2 ≥ 0 . Hence the dual problem is the minimization of 20y1 + y2 + y3 − y4 subject to the conditions 2y1 − y2 + y3 − y4 ≥ 1 , 3y1 − y2 − 2y3 + 2y4 ≥ 2.5 , yi ≥ 0 , for all i = 1, . . . , 4 . However, in the primal problem we had a constraint given by an equality, and so in the dual problem this should correspond to an unrestricted variable. Indeed, by setting y5 = y3 − y4 we result to the minimization of 20y1 + y2 + y5 subject to 2y1 − y2 + y5 ≥ 1 , 3y1 − y2 − 2y5 ≥ 2.5 , y1 ≥ 0 , y2 ≥ 0 , with y5 unrestricted in sign. (b) The standard form of the primal problem is the minimization of h = 2x1 + 3x2 + 2x4 subject to −x1 − 2x2 − 2x3 ≥ 6 , −x1 − 4x2 + 2x4 ≥ −5 , x1 + 4x2 − 2x4 ≥ 5 , x2 − x3 + 4x4 ≥ 2 , xi ≥ 0 , for all i = 1, . . . , 4 . The dual problem reads as follows: maximize 6y1 − 5y2 + 5y3 + 2y4 subject to −y1−y2+y3 ≤ 2 , −2y1−4y2+4y3+y4 ≤ 3 , −2y1−y4 ≤ 0 , 2y2−2y3+4y4 ≤ 2 , yi ≥ 0 , for all i = 1, . . . , 4 . By setting y5 = y2 − y3 we finally result to the maximization of 6y1 + 2y4 − 5y5 with respect to −y1 − y5 ≤ 2 , −2y1 + y4 − 4y5 ≤ 3 , −2y1 − y4 ≤ 0 , 4y4 + 2y5 ≤ 2 , y1 ≥ 0 , y4 ≥ 0 , with y5 unrestricted in sign. 3.B.4. Let us proceed with Sage for now, although we recommend readers to follow the discussion in Section 3.2.2, and perform the formal computations alone. For the given recurrence relation, the corresponding code is as follows: from sympy import Function,rsolve from sympy.abc import n a = Function("a") f = a(n+2) - 3*a(n+1) - 3*a(n) initial = {a(1):1, a(2):3} rsolve(f, a(n), initial) As an answer, Sage returns the following: −sqrt(21) ∗ (3/2 − sqrt(21)/2) ∗ ∗n/21 + sqrt(21) ∗ (3/2 + sqrt(21)/2) ∗ ∗n/21 . In other words, the solution is given by 302 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS xn = 1 √ 21 ( 3 + √ 21 2 )n − 1 √ 21 ( 3 − √ 21 2 )n . 3.B.7. A verification in Sage can be obtained as follows: from sympy import Function, rsolve from sympy. abc import n a = Function(’a’) f = a(n+2) - 2*a(n+1) - 3*a(n) initial = {a(0) : 0, a(1) : 1} rsolve(f, a(n), initial) Sage’s output has the form −(−1) ∗ ∗n/4 + 3 ∗ ∗n/4. 3.B.8. To treat this task we begin with the characteristic equation, given by r4 − r3 − r + 1 = 0, or equivalently by (r − 1)2 (r2 + r + 1) = 0. This has two complex roots given by r1 = − 1 2 + i √ 3 2 = cos(2π/3) + i sin(2π/3) , r2 = − 1 2 − i √ 3 2 = cos(2π/3) − i sin(2π/3) , and the real root 1 which is double. Thus we can find basis of the solution space; this consists of the sequences {an} := { ( − 1 2 + i √ 3 2 )n }∞ n=1 , {bn} := { ( − 1 2 − i √ 3 2 )n }∞ n=1 , together with {n}∞ n=1 and the constant sequence {1}∞ n=1. Our focus now shifts to establishing a real basis for the solution space, by replacing the two complex generators from the previous basis with sequences that are entirely real. These generators are “power series” whose elements are complex conjugates. We will discuss power series in Chapter 5. Now, it suffices to take as generators appropriate linear combinations of an, bn, as in the Problem 3.B.5. This yields the following real basis: {1}∞ n=1, {n}∞ n=1, {cos(2nπ/3)}∞ n=1, {sin(2nπ/3)}∞ n=1. 3.B.9. The answer is given by the sequence (xn) with general term xn = −3(−1)n −2 cos(n·(2π/3))−2 √ 3 sin(n·((2π/3)). 3.B.11. The general solution of the homogeneous equation is of the form a(−1)n + b2n . A particular solution is the constant −1/2. Therefore, the general solution of the given non-homogeneous equation without initial conditions has the form a(−1)n + b2n − 1 2 . Substituting in the initial conditions we obtain the constants a = −5/6, b = 5/6. Thus, the desired solution is given by the sequence xn = −5 6 (−1)n + 5 6 2n − 1 2 . To confirm this result by Sage use the block from sympy import Function, rsolve from sympy. abc import n a = Function("a") f = a(n+2) - a(n+1) - 2*a(n)-1 int = {a(1) : 2, a(2) : 2} rsolve(f, a(n), int) 3.C.5. In this case we have b1 = 0, b2 = 2 = b3, b4 = 1, and thus the Leslie matrix has the form A =     0 2 2 1 0.4 0 0 0 0 0.5 0 0 0 0 0.2 0     . Its unique dominant eigenvalue is λ1 ≈ 1.09486. Since λ1 > 0, the population increases with growth rate 1.09486t , in contrast with the population of bushbabies in 3.C.3. The normalized eigenvector associated with λ1 is given by X1 ≈ (0.639, 0.233, 0.106, 0.019)T , hence the long term trends of the female population are as follows: 63.9 % of age class A, 23.3 % of age class B, 10.6 % of age class C and 1.9 % of age class D. Finally, we compute A10 ≈     1.047 2.872 2.094 0.96 0.384 1.047 0.761 0.348 0.174 0.48 0.349 0.16 0.032 0.087 0.064 0.029     , p10 = A10 p0 ≈ (150.813, 54.956, 25.171, 4.598)T . Therefore, the female population after ten years will consist of approximately 235 members (this is the sum of the entries of p10), and the total population of the colony (both females and males) after the same period will approximately reach the level 303 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS of 470 galagos. Below is the required diagram representing the exponential growth of the female population for a duration of thirty years. 3.D.1. Let us first recall that the norm of an arbitrary vector z ∈ Cn with respect the standard Hermitian form ⟨ , ⟩ reads by ∥z∥2 = ∑n j=1 zj ¯zj. For the vectors x = (3 + 2i, 1 − i, −i)T and y = (2 − 2i, 1 − i, 2 + i)T in C3 one computes ⟨x, y⟩ = y∗ x = ( 2 + 2i 1 + i 2 − i )   3 + 2i 1 − i −i   = (3 + 2i)(2 + 2i) + (1 − i)(1 + i) − i(2 − i) = 3 + 8i . Moreover, we see that ∥x∥ = √ (3 + 2i)(3 − 2i) + (1 − i)(1 + i) − i2 = 4 , ∥y∥ = √ (2 − 2i)(2 + 2i) + (1 − i)(1 + i) + (2 + i)(2 − i) = √ 15 . On the other hand, the vector x − y has the expression x − y = (1 + 4i, 0, −(2 + 2i))T . Therefore, d(x, y) =√ (1 + 4i)(1 − 4i) + (2 + 2i)(2 − 2i) = 5. Since ∥x∥ = 4 and ∥y∥ = √ 15, the normalized vectors with ∥ˆx∥ = 1 = ∥ˆy∥ are given by ˆx = 1 4 (3 + 2i, 1 − i, −i) and ˆy = 1√ 15 (2 − 2i, 1 − i, 2 + i), respectively. See the Problem 3.D.2 how one can verify these results in Sage. 3.D.3. (a) For the vectors v = (1 + i, 2 − i)T and w = (3 − 2i, 1 + i)T in C2 we compute ⟨v, w⟩ = w∗ v = ( 3 + 2i 1 − i ) ( 1 + i 2 − i ) = (3 + 2i)(1 + i) + (1 − i)(2 − i) = 2 + 2i . Moreover, we see that ∥v∥2 = v∗ v = (1 + i)(1 − i) + (2 − i)(2 + i) = 7 , ∥w∥2 = w∗ w = (3 − 2i)(3 + 2i) + (1 + i)(1 − i) = 15 . Recall that the angle θ between v, w is defined by the equation (compare with the formula given in 2.3.22 for the real case) cos θ = Re(⟨v, w⟩) ∥v∥∥w∥ , where Re(z) denotes the real part of a complex number z ∈ C. Hence it follows that θ = cos−1 ( 2 √ 7 √ 15 ) ≈ 1.37435 ≈ 78.75◦ . To verify all these results in Sage one can use the following cell: 304 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS v = vector(CDF, [1+I, 2-I]) w = vector(CDF, [3-2*I, 1+I]) vinnerw=w.hermitian_inner_product(v) print("The standard Hermitian product of the vectors v=",v,"" "and w=",w,"" "equals to",vinnerw) vn=N(v.norm()^2, digits=3) print("The square norm of the vector v=", v, "" "equals to", vn) wn=N(w.norm()^2, digits=3) print("The square norm of the vector w=", w, "" "equals to", wn) theta=arccos(w.hermitian_inner_product(v).real()/(norm(v)*norm (w))) 305 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS print("The angle between the vectors v and w equals to", N(theta, digits = 4), "radians" ) print("The angle between the vectors v and w equals to", N(theta*180/pi, digits = 4), "degrees") In this block we used the option CDF (Complex Double Field) when introducing the vectors v, w, similarly with the description given in 3.D.2. Recall that CDF approximates the field of complex numbers using double-precision floating point numbers. Alternatively, we could use the option CC (for this task, the result would remain the same). The CDF option helps to save some space in the output, compared to CC. Finally, for your convenience, in this block Sage was programmed to provide more detailed responses, than usual, as one can see in the output below: The standard Hermitian product of the vectors v= (1.0 + 1.0*I, 2.0 - 1.0*I) and w= (3.0 - 2.0*I, 1.0 + 1.0*I) equals to 2.0 + 2.0*I The square norm of the vector v= (1.0 + 1.0*I, 2.0 - 1.0*I) equals to 7.00 The square norm of the vector w= (3.0 - 2.0*I, 1.0 + 1.0*I) equals to 15.0 The angle between the vectors v and w equals to 1.374 radians The angle between the vectors v and w equals to 78.75 degrees Similarly can be treated the cases (b) and (c) for which we only present the results. (b) ⟨v, w⟩ = 5 + 4i, ∥v∥2 = 10, ∥w∥2 = 7, cos θ ≈ 0.59, θ ≈ 0.93 ≈ 53.5◦ . (c) ⟨v, w⟩ = 2 − 4i, ∥v∥2 = 5, ∥w∥2 = 19, cos θ ≈ 0.205, θ ≈ 1.364 ≈ 78.1◦ . 3.D.4. One can easily check the following properties of f; f(u + w, v) = f(u, v) + f(w, v), f(au, v) = af(u, v) and f(v, u) = f(u, v), for all u, v ∈ V = F2 and a ∈ F. Moreover, if u = (x1, x2)T we have f(u, u) = x1 ¯x1 + 4x1 ¯x2 + 4x2 ¯x1 + x2 ¯x2 = |x1|2 + 4(x1 ¯x2 + x2 ¯x1) + |x2|2 , where we used the relation z¯z = |z|2 with z ∈ F. However, x1 ¯x2 + x2 ¯x1 = |x1 + x2|2 − |x1|2 − |x2|2 , and hence the above relation reduces to f(u, u) = −3|x1|2 + 4|x1 + x2|2 − 3|x2|2 . Therefore, we may have f(u, u) < 0 for some u ̸= 0. For example, f(u, u) = −6 for u = (1, −1)T . This shows that f cannot be a scalar product on V . 3.D.6. We see that ρa,b(u, u) = au2 1 + bu2 2 ≥ 0 and the only way that this is zero is when u = 0, that is, u1 = u2 = 0. This proves positive definiteness. Moreover, by the commutativity of real numbers we get ρa,b(v, u) = ρa,b(v, u) = av1u1 + bv2u2 = ρa,b(u, v) . Hence ρa,b is Hermitian symmetric. Next, ρa,b(u, v + w) = au1(v1 + w1) + bu2(v2 + w2) = (au1v1 + bu2v2) + (au1w2 + bu2w2) = ρa,b(u, v) + ρa,b(u, w) , and moreover ρa,b(cu, v) = ρa,b(u, cv) = cρa,b(u, v) for any c ∈ R. This proves that ρa,b is a scalar product on R2 . Then the angle between u, v satisfies cos θ = ρa,b(u, v) ∥u∥ρa,b ∥v∥ρa,b . For a = 2 and b = 1 we compute ρ2,1(u, v) = 2 · 1 + 1 · (−1) = 1, and ∥u∥ρ2,1 = √ 2 · 1 + 1 · 1 = √ 3 = ∥v∥ρ2,1 . Thus cos θ = 1/3 and so θ = arccos(1/3). With respect to the dot product the vector u, v are orthogonal, u · v = 0, hence they obviously form a different angle in comparison with the one appearing with respect to ρ2,1. 3.D.7. For any vector x ∈ Cn we see that ρA(x, x) = x∗ A∗ Ax = (Ax)∗ Ax = ⟨Ax, Ax⟩ ≥ 0, where ⟨ , ⟩ is the standard Hermitian form on Cn . Thus ρA(x, x) ≥ 0 with the equality holding if and only if Ax = 0. Let us denote by A1, . . . , An the n columns of A, such that A = ( A1 A2 · · · An ) with Ai = (a1i, a2i, . . . , ami)T , for any 1 ≤ i ≤ n. Recall the column space of A: C(A) = {w ∈ Cm : w = ∑n i=1 xiAi, with xi ∈ C} = {w ∈ Cm : w = Ax} , where x = (x1, . . . , xn)T ∈ Cn , and the second relation holds since the vector Ax ∈ Cm is given by a linear combination of the columns of A with elements of x, i.e., Ax = ∑n i=1 xiAi. Hence our assumption that C(A) is n-dimensional, is equivalent to say that A consists of n linearly independent column vectors, or that the conditions Ax = 0 and x = 0 are equivalent each other. This shows that ρA is positive definite. Also, since ρA(x, y) ∈ C for any x, y ∈ Cn and so (ρA(x, y))∗ = ρA(x, y) we see that ρA(x, y) = (Ay)∗ (Ax) = ( (Ax)∗ (Ay) )∗ = (x∗ A∗ Ay)∗ = (ρA(y, x))∗ = ρA(y, x) , for any x, y ∈ Cn . Moreover, ρA(x, y + z) = (y + z)∗ (A∗ Ax) = y∗ A∗ Ax + z∗ A∗ Ax = ρA(x, y) + ρA(x, z) for any three vectors x, y, z, while for some a ∈ C and any x, y ∈ Cn we get ρA(ax, y) = (Ay)∗ Aax = a(Ay)∗ Ax = aρA(x, y). Thus ρ is an inner product. 306 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS We should mention that by the definition of ρA one gets ρA(x, y) = y∗ A∗ Ax = (Ay)∗ (Ax) = ⟨Ax, Ay⟩, where Ax is viewed as a (column) vector. This provides of course an alternative and much shorter verification of the claim, but for the convenience of the reader we kept both arguments. Finally observe that when A is the identity matrix, then ρA is nothing than the standard Hermitian form on Cn . 3.D.9. (a) For a real scalar product space (V, ⟨ , ⟩) we have ⟨u, w⟩ = ⟨w, u⟩, for all u, w ∈ V and hence ∥u + w∥2 = ⟨u + w, u + w⟩ = ∥u∥2 + ∥w∥2 + 2⟨u, w⟩ , u, w ∈ V . The first claim is now direct. For a complex scalar product space (V, ⟨ , ⟩) the bilinear form ⟨ , ⟩ should satisfy ⟨u, w⟩ = ⟨w, u⟩, for all u, w ∈ V . This requirement prevents the form ⟨ , ⟩ from having the same symmetry, as in the real case. For example, consider C2 endowed with the standard (positive) Hermitian form ⟨u, w⟩ = u1 ¯w1 + u2 ¯w2, with u, w ∈ C2 . Given the vectors u = (0, i)T ∈ C2 an w = (0, 1)T ∈ C2 we see that ⟨u, w⟩ = i and ∥u + w∥2 = 2, ∥u∥2 = i¯i = 1, ∥w∥2 = 1. Hence we have ∥u + w∥2 = ∥u∥2 + ∥w∥2 but u, w are not orthogonal (observe however that if ⟨u, w⟩ = 0, then ∥u + w∥2 = ∥u∥2 + ∥w∥2 still makes sense on a complex inner product space, see the proof of the theorem in 3.4.2). (b) If u + w ⊥ u − w then 0 = ⟨u + w, u − w⟩ = ∥u∥2 − ∥w∥2 . Thus ∥u∥2 = ∥w∥2 , which gives ∥u∥ = ∥w∥. Conversely, let u, w ∈ V such that ∥u∥ = ∥w∥. Then ⟨u + w, u − w⟩ = ∥u∥2 − ∥w∥2 = 0, hence u + w ⊥ u − w. 3.D.10. (a) From the definition of a unitary space given in 3.4.1, we see that ∥u + av∥2 = ⟨u + av, u + av⟩ = ⟨u, u⟩ + ¯a⟨u, v⟩ + a⟨v, u⟩ + a¯a⟨v, v⟩ = ∥u∥2 + 2Re(¯a⟨u, v⟩) + |a|2 ∥y∥2 and similarly ∥u − av∥2 = ∥u∥2 − 2Re(¯a⟨u, v⟩) + |a|2 ∥y∥2 , for any two vector u, v ∈ V and scalar a. Thus, ∥u + av∥ = ∥u − av∥ if and only if Re(¯a⟨u, v⟩) = 0. Since the latter relation should hold for any scalar a, taking a = 1 for the real case, and a = i for the complex case, we see that ⟨x, y⟩ = 0. (b) The proof is similar and we leave it for practice. 3.D.13. According to part (4) of the main theorem in Section 3.4.2, for any two vectors x, y ∈ V we have the (Fourier) expansions x = ∑ j⟨x, Ej⟩Ej and y = ∑ k⟨y, Ek⟩Ek. Thus ⟨x, y⟩ = ⟨ ∑ j ⟨x, Ej⟩Ej, ∑ k ⟨y, Ek⟩Ek ⟩ = ∑ j,k ⟨x, Ej⟩⟨y, Ek⟩⟨Ej, Ek⟩ = ∑ j ⟨x, Ej⟩⟨y, Ej⟩ . 3.D.14. We can easily check that the given set {E1 = (−1, 1, 2), E2 = (2, 0, 1), E3 = (1, 5, −2)} is an orthogonal basis of R3 . We only prove orthogonality by Sage and leave a formal verification of the claim to the reader. Let us program Sage to do this quickly, as follows: E1 = vector(QQbar, [-1, 1, 2]) E2 = vector(QQbar, [2, 0, 1]) E3 = vector(QQbar, [1, 5, -2]) B = [E1, E2, E3] ips = [B[i].dot_product(B[j]) for i in range(2) for j in range(i+1,2)] all([ip == 0 for ip in ips]) Sage’s answer is True. Notice in the same way we can also verify that the vectors Ei are not of unit length for all i = 1, 2, 3. This can be done by adding the cell ips2 = [B[i].dot_product(B[i]) for i in range(2)] all([ip2 == 1 for ip2 in ips2]) and in this case Sage prints out False. Now recall that if {u1, . . . , um} is an orthogonal basis of a unitary space (V, ⟨ , ⟩), then any vector u ∈ V is expressed by u = m∑ k=1 ⟨u, uk⟩ ∥uk∥2 uk . Let us apply this rule for out task. We compute ⟨u, E1⟩ = −12 , ⟨u, E2⟩ = 8 , ⟨u, E3⟩ = 24 , ∥E1∥2 = 6 , ∥E2∥2 = 5 , ∥E3∥2 = 30 . Thus u = −12 6 E1 + 8 5 E2 + 24 30 E3. 3.D.15. We will apply a method that relies on the Cauchy-Schwarz inequality, see 3.4.3, as it is suggested in the statement. Set u = ( √ a + 2b, √ b + 2c, √ c + 2a)T , v = (1, 1, 1)T and view u, v as vectors in R3 . Their dot product is given by 307 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS ⟨u, v⟩ = vT u = √ a + 2b+ √ b + 2c+ √ c + 2a, while ∥u∥2 = (a+2b)+(b+2c)+(c+2a) = 3(a+b+c) and ∥v∥2 = 3. An application of the Cauchy-Schwarz inequality ⟨u, v⟩ ≤ ∥u∥∥w∥ yields the expression √ a + 2b + √ b + 2c + √ c + 2a ≤ 3 √ (a + b + c) , which gives the result after dividing with √ (a + b + c) (which is non-zero by the assumption in the statement). 3.D.16. The standard basis e is an orthonormal basis of (C3 , ⟨ , ⟩) and hence we see that u = 3∑ j=1 ⟨u, ej⟩ej = (2 + i)e1 + (−1 + 2i)e2 + (3 − i)e3 , that is, ⟨u, e1⟩ = 2 + i, ⟨u, e2⟩ = −1 + 2i and ⟨u, e3⟩ = 3 − i. Now, according to Parseval’s equality (see 3.4.3) we should have ∥u∥2 = n∑ j=1 |⟨u, ej⟩|2 = |2 + i|2 + | − 1 + 2i|2 + |3 − i|2 = (2 + i)(2 − i) + (−1 + 2i)(−1 − 2i) + (3 − i)(3 + i) = 20 , where we applied the rule |z|2 = z¯z, with z ∈ C. On the other hand, we also see that ∥u∥2 = ⟨u, u⟩ = (2 + i)(2 + i) + (−1 + 2i)(−1 + 2i) + (3 − i)(3 − i) = 20, and we are done. 3.D.19. When we rotate a point in R3 about a particular axis, the coordinate corresponding to that axis remains unchanged The remaining two coordinates are given by the well known rotation in the plane (see Chapter 1). This yields the matrices Rx(θ) =   1 0 0 0 cos θ − sin θ 0 sin θ cos θ   , Ry(θ) =   cos θ 0 sin θ 0 1 0 − sin θ 0 cos θ   , Rz(θ) =   cos θ − sin θ 0 sin θ cos θ 0 0 0 1   which represent the rotation about the x-axis, the y-axis and the z-axis, respectively. Note the sign of θ in the rotation matrix about the y-axis. As with any other rotation, we want the rotation about the y-axis to be in the “positive sense”. This means that when viewed from the negative y-axis direction, the rotation should appear anti-clockwise. Consequently, the signs in the rotation matrices depend on the orientation of our coordinate system. Usually, in the 3-dimensional space the “right-handed coordinate system” is chosen, also known as “dextrorotary coordinate system”. In a right-handed coordinate system, if you orient your right hand so that your thumb points along the positive x-axis and curl your fingers naturally, your fingers will point in the direction of the y-axis, followed by the z-axis. In particular, the index finger will point the positive y-axis and the middle finger will point the positive z-axis. For example, a positive rotation Rx(θ) about the x-axis occurs perpendicular to the yz-plane, and a negative rotation will occurs in the yz-plane, with opposite direction to the positive rotation. Notice when looking from positive x towards the origin, a positive rotation is counterclockwise. This configuration visually confirms the order x → y → z and its cyclical nature is evident in how axes are sequentially rotated: x → y → z → x → . . . for a positive cycle and x → z → y → x → . . . for a negative one. 3.D.20. The first task has been analyzed for unitary transformations in 3.4.4. Let us present an alternative proof. Suppose that U ∈ Matn(C) is unitary and consider the scalar product ⟨Ux, y⟩. Based on the relation E = UU−1 = U−1 U, where E is the identity n × n matrix, we get ⟨Ux, y⟩ = ⟨Ux, UU−1 y⟩ = ⟨x, U−1 y⟩. On the other hand, by the definition of the conjugate transpose matrix we have ⟨Ux, y⟩ = ⟨x, U∗ y⟩ for any x, y ∈ Cn . By comparing these two relations and using bilinearity, we arrive at the following relation: ⟨x, (U−1 − U∗ )y⟩ = 0 , (∗) for any x, y ∈ Cn . We can now utilize the fact that ⟨ , ⟩ is a non-degenerate bilinear form. Hence, by the relation (∗) we deduce that U−1 − U∗ = 0, which implies that U−1 = U∗ . Conversely, assume that U∗ U = E. Then, for any x ∈ Cn we have ∥Ux∥2 = ⟨Ux, Ux⟩ = x, U∗ Ux⟩ = ⟨x, x⟩ = ∥x∥2 , and hence U is unitary. To prove that U(n) is a group, we can proceed similarly to the case of O(n). Hence we need to verify that the composition of two unitary transformations and the inverse of a unitary transformation, both retain the same property. Let A, B ∈ U(n). Then A−1 = A∗ and B−1 = B∗ , respectively. Moreover (AB)∗ = B∗ A∗ = B−1 A−1 = (AB)−1 , and hence AB is also unitary. The verification of the second property is left as an exercise for the reader. Finally, the claim for the determinant occurs by the relation det(E) = 1 = det(U∗ U) = det(U∗ ) det(U) = det(U) det(U) = | det(U)|2 . This implies that the absolute value of the determinant of any unitary matrix is equal to one. 3.D.22. Consider linear maps φ, ψ : V → V as in the statement, and let A, B be their matrices. In Problem 2.E.56 we proved that A, B must satisfy all the analogous properties listed in the statement. Since φ, ψ are determined by their matrices, all the formulas follow. Alternatively, one can work in terms of endomorphisms and prove all the relations based on the definition of the adjoint map. We leave such a description to the reader. 308 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS 3.D.25. For A we have A∗ = A and hence A is Hermitian (i.e., self-adjoint). We can directly verify this by Sage, as follows: A=matrix(CC, [[sqrt(2), I, 1-I], [-I, 10, sqrt(2)+I], [1+I, sqrt(2)-I, 0]]) A.is_hermitian() This command returns True. Alternatively, we can check this by verifying if A equals its conjugate transpose: A=matrix(CC, [[sqrt(2), I, 1-I], [-I, 10, sqrt(2)+I], [1+I, sqrt(2)-I, 0]]) A == A.conjugate_transpose() which again returns True. Among the remaining matrices, B and C are Hermitian, while D is not. For the matrix B, verifying this property using Sage may take more time compared to a manual computation, but it is useful for handling parametric matrices. The verification process is as follows: a = var("a"); b = var("b") assume(a, "real"); assume(b, "real") B = matrix(SR, 2, 2, [[a, I+b],[-I+b, sqrt(abs(a))]]) B.is_hermitian() We leave the verification for the matrices C and D to the reader. 3.D.26. Under our assumptions for the matrices A, B, we need to show tr(A∗ A) = tr(B∗ B). This is equivalent to proving that ∑n i,j=1 |aij|2 = ∑n i,j=1 |bij|2 , providing an alternative reformulation of the problem. Since B = U∗ AU, we have B∗ = (U∗ AU)∗ = U∗ A∗ U. Therefore tr(B∗ B) = tr(U∗ A∗ UU∗ AU) = tr(U∗ A∗ AU) = tr(A∗ A) . This follows from the property that similar matrices X, Y , i.e., X = P−1 Y P, have the same trace, tr(X) = tr(Y ), and and from the fact that U is unitary, i.e., U−1 = U∗ . 3.D.27. To compute its adjoint operator φ∗ : R3 → R2 , we need to find the matrix representation of φ∗ with respect to the standard bases of R3 and R2 . In matrix form we have φ (( x y )) =   √ 2 1 1 −1 0 2   ( x y ) . Thus, the matrix of φ with respect to the standard orthonormal bases is given by A =   √ 2 1 1 −1 0 2   . Now, to compute φ∗ we can take the conjugate transpose of A, but since A has only real entries, this simplifies to just the transpose of A: We compute A∗ = AT = (√ 2 1 0 1 −1 2 ) . and hence φ∗     x y z     = ( √ 2x + y x − y + 2z ) , with x, y, z ∈ R. For the second task, we need to verify that B = B∗ . The conjugate transpose of the matrix B is given by B∗ = ¯BT =   10 5 + 5i 3 + 2i 5 − 5i 5 √ 2i 3 − 2i − √ 2i 0   . Hence B = B∗ and consequently ψ is a self-adjoint operator. 3.D.28. The adjoint operator of tr : Matm(F) → F is a linear map tr∗ : F → Matm(F), assigning to any scalar z ∈ F a m × m matrix over F. It is determined by the relation B ( tr(A), z ) = B ( A, tr∗ (z) ) , for any z ∈ F and A ∈ Matm(F). Observe here that on the left-hand side B acts on scalars (viewed as trivial 1 × 1 matrices). By the definition of B, we have B(tr(A), z) = ¯z tr(A) . 309 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Setting tr∗ (z) = B for some matrix B ∈ Matm(F), we also obtain B(A, tr∗ (z)) = tr(B∗ A). Hence, we need ¯z tr(A) = tr(B∗ A). Based on the properties of the trace described in 2.C.38, we see that ¯z tr(A) = tr(¯zA) = tr(¯zEA), where E denotes the m × m matrix. By comparing these two relations and using the non-degeneracy of B, we find that B∗ = ¯zE. Therefore B = (B∗ )∗ = (¯zE)∗ = E∗ ¯¯z = Ez , which means that tr∗ (z) = zE from any z ∈ F. 3.D.29. By definition, the linear map projW : V → V sends u ∈ V to its orthogonal projection projW u on W. Let W⊥ be the orthogonal complement of W with respect to ⟨ , ⟩. Recall by Chapter 2 the direct sum decomposition V = W ⊕ W⊥ , which for any u ∈ V implies that u = projW u + projW ⊥ u = projW u + (u − projW u) = P(u) + (u − P(u)) , where P(u) := projW u. Then, by linearity we see that ⟨P(u), w⟩ = ⟨P(u), P(w) + (w − P(w))⟩ = ⟨P(u), P(w)⟩ + ⟨P(u), w − P(w)⟩ , for any u, w ∈ V . However, by assumption P(u) ∈ W and (w − P(w)) ∈ W⊥ . Thus, the second component in the right-hand side of the previous relation vanishes, which yields the relation ⟨P(u), w⟩ = ⟨P(u), P(w)⟩. In a similar way one computes ⟨u, P(w)⟩ = ⟨P(u), P(w)⟩, and the result follows. 3.D.30. The first claim is direct. To present a counterexample, use the matrix A =   1 1 0 0 1 1 1 0 1   ∈ Mat3(R). Then a direct computation shows that A∗ = AT =   1 0 1 1 1 0 0 1 1   , AA∗ = A∗ A =   2 1 1 1 2 1 1 1 2   . Hence A is normal, but is not orthogonal since AAT = AT A ̸= E. 3.D.32. We present directly the code including some comments within the code: # Define the matrix phi symbolically phi = Matrix(SR, [[2, 1+I, 0], [1-I, 3, 0], [0, 0, 1]]) # Verify that phi is normal is_normal = phi*phi.conjugate_transpose()== phi.conjugate_transpose() * phi print("Is phi normal?:", is_normal) # Compute eigenvalues and eigenvectors symbolically eigenvalues = phi.eigenvalues() eigenvectors = phi.eigenvectors_right() # Display eigenvalues and eigenvectors print("Eigenvalues:", eigenvalues) print("Eigenvectors:") for eig in eigenvectors: print(f"Eigenvalue: {eig[0]}, Eigenvectors: {eig[1]}") # Extract eigenvectors corresponding to distinct eigenvalues if len(eigenvectors) >= 2: v1 = eigenvectors[0][1][0] # Eigenvector for the first eigenvalue v2 = eigenvectors[1][1][0] # Eigenvector for the second eigenvalue # Compute the inner product of the eigenvectors inner_product = v1.inner_product(v2) print("Inner product of v1 and v2:", inner_product) # Verify if they are orthogonal is_orthogonal = inner_product == 0 print("Are v1 and v2 orthogonal?:", is_orthogonal) else: print("Not enough distinct eigenvalues to check orthogonality.") 310 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Let us also display Sage’s output: Is phi normal?: True Eigenvalues: [1, 1, 4] Eigenvectors: Eigenvalue: 1, Eigenvectors: [(1, 1/2*I - 1/2, 0), (0, 0, 1)] Eigenvalue: 4, Eigenvectors: [(1, -I + 1, 0)] Inner product of v1 and v2: I + 1 Are v1 and v2 orthogonal?: (I + 1) == 0 Notice the comparison (I + 1) == 0 is used to determine if the computed inner product is zero. Since 1 + i ̸= 0 the result of the comparison is False, indicating that the eigenvectors are not orthogonal. We leave a formal computation for practice. 3.D.34. Such an example is given by the matrix A = ( 0 1 − i 1 − i 0 ) . Indeed, since A∗ = ¯AT = ( 0 1 + i 1 + i 0 ) ̸= ±A, the matrix at hand is neither Hermitian, nor skew-Hermitian. However, A is a normal matrix since AA∗ = ( 0 1 − i 1 − i 0 ) ( 0 1 + i 1 + i 0 ) = ( 2 0 0 2 ) = ( 0 1 + i 1 + i 0 ) ( 0 1 − i 1 − i 0 ) = A∗ A . Here is a detailed verification with Sage: # Define the matrix A = Matrix([[0, 1 - I], [1 - I, 0]]) # Compute the conjugate transpose (Hermitian transpose) of A A_H = A.H # Check if A is Hermitian is_hermitian = A == A_H # Check if A is skew-Hermitian is_skew_hermitian = A == -A_H # Compute A * A^H A_AH = A * A_H # Compute A^H * A AH_A = A_H * A # Check if A is normal is_normal = A_AH == AH_A # Print the results print("Matrix A:") print(A) print("\nConjugate Transpose of A:") print(A_H) print("\nIs A Hermitian?") print(is_hermitian) print("\nIs A Skew-Hermitian?") print(is_skew_hermitian) print("\nIs A Normal?") print(is_normal) 3.D.35. We will mainly use Sage and leave the formal computations for practice. The given matrix A is symmetric, A = AT . The following cell verifies this statement in Sage: A = matrix(SR, [[1, 2, 6], [2,0, 2], [6, 2, 1]]) A.T == A Recall also that the eigenvalues of A occur by the command A.eigenvalues() They are given by λ1 = 8, λ2 = −5 and λ3 = −1. As for the eigenvectors, use the cell A.eigenvectors_right() Sage’s output is the list [(8, [(1, 1/2, 1)], 1), (-1, [(1, -4, 1)], 1), (-5, [(1, 0, -1)], 1)] 311 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Thus the vectors u1 = (2, 1, 2)T , u2 = (1, 0, −1)T and u3 = (1, −4, 1)T are eigenvectors of A, corresponding to the eigenvalues λ1, λ2, and λ3, respectively. Observe that ⟨u1, u2⟩ = 0, ⟨u1, u3⟩ = 0, and ⟨u2, u3⟩ = 0 where ⟨ , ⟩ is the dot product on R3 . Thus, the eigenvectors of A are orthogonal to each other. We also compute ∥u1∥ = 3, ∥u2∥ = √ 2 and ∥u3∥ = 3 √ 2. Consequently, the orthogonal matrix P has the form P =    2 3 1√ 2 1 3 √ 2 1 3 0 − 4 3 √ 2 2 3 − 1√ 2 1 3 √ 2    . Note that P is not uniquely defined. It remains to verify the relations P−1 = PT , and P−1 AP = diag(λ1, λ2, λ3). In Sage this can be done easily: A = matrix(SR, [[1, 2, 6], [2,0, 2], [6, 2, 1]]) D = diagonal_matrix([8, -5, -1]) P = matrix(SR, [[2/3, 1/sqrt(2), 1/(3*sqrt(2))],[1/3, 0, -4/(3*sqrt(2))], [2/3, -1/sqrt(2), 1/(3*sqrt(2))]]) print(P.inverse()==P.T); print(P.inverse()*A*P==D) 3.D.36. The one direction starting with A orthogonally diagonalizable is easy. Then we have PT AP = D for some orthogonal n × n matrix P and a diagonal n × n matrix D. Thus A = PDPT and since D being diagonal satisfies D = DT , we obtain AT = (PDPT )T = PDT PT = PDPT = A . Hence A is symmetric. The converse is more difficult. Assume that A is a symmetric n × n matrix. If n = 1 the statement is trivial, i.e. the pair (P = 1, D = A) satisfies the statement. We will proceed by induction on the size of A. The induction hypothesis assumes that every (n−1)×(n−1) symmetric matrix is orthogonally diagonalizable. Consider now the symmetric matrix A and let λ ∈ R be an eigenvalue of A with eigenvector u ∈ Rn , i.e., Au = λu. Set p1 = u/∥u∥ and extend {p1} to an orthonormal basis {p1, . . . , pn} of Rn . Let P be the orthogonal matrix whose column vectors are the vectors pi, that is, P = ( p1 p2 . . . pn ) . Then, in terms of block matrices for the multiplication AP we get the expression AP = ( λp1 Ap2 . . . Apn ) = P ( λ C 0 A1 ) , for some C ∈ Mat1,n−1(R) and A1 ∈ Matn−1(R). However, P−1 = PT and hence this equivalently can be rewritten as PT AP = ( λ C 0 A1 ) . (∗) Since A is symmetric, we see that (PT AP)T = PT AP, hence the matrix PT AP is also symmetric. Thus, the relation given in (∗) will make sense only when C = 0 and A1 is symmetric. Since A1 has size (n − 1) × (n − 1), the induction hypothesis applies, so there exist an orthogonal matrix B and a diagonal matrix D1 of the same size with A1, satisfying BT A1B = D1. Consider now the block matrix P′ = ( 1 0 0 B ) , whose size is n × n. Since B is orthogonal, P′ is also orthogonal and hence we see that (PP′ )T A(PP′ ) = (P′ )T (PT AP)P′ = ( 1 0 0 BT ) ( λ 0 0 A1 ) ( 1 0 0 B ) = ( λ 0 0 BT A1B ) = ( λ 0 0 D1 ) . Since the last matrix is diagonal and PP′ is orthogonal, the proof is complete. 3.D.38. Both matrices A and B are Hermitian, i.e., A = A∗ and B = B∗ , and hence also normal. Thus, they are unitary diagonalizable. Moreover, since Hermitian matrices have only real eigenvalues, both A and B possess only real eigenvalues, as we will see also below. Let us begin with the matrix A. The characteristic polynomial of A is given by χA(λ) = |A − λ E| = −(1 + λ) 0 i 0 1 − λ 0 −i 0 1 − λ = −(1 + λ) 1 − λ 0 0 1 − λ + i 0 1 − λ −i 0 = −(1 + λ)(1 − λ)2 + i2 (1 − λ) = −(1 + λ)(1 − λ2 + 1) = (1 − λ)(λ2 − 2) = (1 − λ)(λ − √ 2)(λ + √ 2) . Or one may try to express the 3 × 3 determinant with respect to the second row, which has less computations. Hence the eigenvalues of A are given by λ = 1 and λ± = ± √ 2, all with multiplicity one. Hence their geometric multiplicity is also one. Let us derive generators of the corresponding eigenspaces Vλ, Vλ+ and Vλ− , which are subspaces of C3 . • For λ = 1 we have the matrix equation Au = λu = u, for some vector u = (u1, u2, u3)T ∈ C3 , i.e., 312 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS   −u1 + iu3 u2 −iu1 + u3   =   u1 u2 u3   . It is easy to see that the solutions of the corresponding linear system are given by u1 = u3 = 0, with u2 ∈ C free. Thus we may assume that u = (0, 1, 0)T . • For λ+ = √ 2 we need to solve the matrix equation Au+ = √ 2 u+, for some u+ = (w1, w2, w3) ∈ C3 , i.e.,   −w1 + iw3 w2 −iw1 + w3   =   √ 2w1√ 2w2√ 2w3   . This induces the linear system    ( √ 2 + 1)w1 − iw3 = 0 , ( √ 2 − 1)w2 = 0 , iw1 + ( √ 2 − 1)w3 = 0 , whose solutions are given by {w1 = i( √ 2 − 1)r , w2 = 0 , w3 = r ∈ C}. Thus, setting r = 1 we may assume that u+ = (i( √ 2 − 1), 0, 1)T ∈ C3 . • For λ− = − √ 2 we need to solve the matrix equation Au− = − √ 2 u−, for some vector u− = (v1, v2, v3) ∈ C3 . In a similar way we find that the solution of the corresponding linear system is given by {v1 = −i( √ 2 + 1)s , v2 = 0 , v3 = s ∈ C}. Thus, up to a multiple, the eigenvector u− is given by u− = (−i( √ 2 + 1), 0, 1)T ∈ C3 . We recommend formally verifying that the eigenvectors are orthogonal with respect to the standard Hermitian form ⟨ , ⟩ on C3 . We have achieved this using Sage, as presented below: u=vector([0, 1, 0]) u_plus=vector([1-sqrt(2), 0, i]) u_minus=vector([1+sqrt(2), 0, i]) a=u.dot_product(u_plus.conjugate()) print(a.simplify_full()) b=u.dot_product(u_minus.conjugate()) print(b.simplify_full()) c=u_plus.dot_product(u_minus.conjugate()) print(c.simplify_full()) Now, without loss of generality, we can multiply both u± by i and hence consider the eigenvectors ˜u+ =   1 − √ 2 0 i   ∈ V+ , and ˜u− =   1 + √ 2 0 i   ∈ V− . We need to compute the norms of the eigenvectors u, and ˜u± with respect the standard Hermitian form ⟨ , ⟩ on C3 . Obviously, ∥u∥ = 1 and ∥˜u+∥2 = ⟨˜u+, ˜u+⟩ = ˜u∗ + ˜u+ = ( 1 − √ 2 0 −i ) ·   1 − √ 2 0 i   = (1 − √ 2)2 − i2 = 4 − 2 √ 2 , ∥˜u−∥2 = ⟨˜u−, ˜u−⟩ = ˜u∗ − ˜ui = ( 1 + √ 2 0 −i ) ·   1 + √ 2 0 i   = (1 + √ 2)2 − i2 = 4 + 2 √ 2 , where “·” denotes matrix multiplication. Thus ∥˜u+∥ = √ 4 − 2 √ 2, ∥˜u−∥ = √ 4 + 2 √ 2, and the normalized eigenvectors are given by ˆu = u =   0 1 0   , ˆu+ = ˜u+ ∥˜u+∥ = 1 √ 4 − 2 √ 2   1 − √ 2 0 i   , ˆu− = ˜u− ∥˜u−∥ = 1 √ 4 + 2 √ 2   1 + √ 2 0 i   . We can now derive the matrices U, U∗ (and D) U =    0 1− √ 2√ 4−2 √ 2 1+ √ 2√ 4+2 √ 2 1 0 0 0 i√ 4−2 √ 2 i√ 4+2 √ 2    , U∗ =    0 1 0 1− √ 2√ 4−2 √ 2 0 −i√ 4−2 √ 2 1+ √ 2√ 4+2 √ 2 0 −i√ 4+2 √ 2    , D =   1 0 0 0 √ 2 0 0 0 − √ 2   . 313 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Verifying the relation U∗ AU = D by hand can be both time-consuming and challenging. To streamline this process we will use Sage in the following straightforward manner: A=matrix([[-1, 0, i], [0, 1, 0], [-i, 0, 1]]) U=matrix([[0, (1-sqrt(2))/sqrt(4-2*sqrt(2)), (1+sqrt(2))/sqrt(4+2*sqrt(2))], [1, 0, 0], [0, i/sqrt(4-2*sqrt(2)), i/sqrt(4+2*sqrt(2))]]) D=diagonal_matrix([1, sqrt(2), -sqrt(2)]) U.H*A*U==D Verify that Sage’s response is rapid, and effectively confirms the desired relation U∗ AU = D. For the matrix B we will rely solely on Sage to obtain a solution. Using the cell B = matrix (QQbar, [[3, I], [-I, 3]]) B.eigenvalues() B.eigenvectors_right() we obtain the eigenvalues of B: λ1 = 2 and λ2 = 4, both with multiplicity one. This code also returns the corresponding eigenvectors: (1, i)T for λ1 and (1, −i)T for λ2. We may multiply the second vector by i and set u1 = (1, i)T and u2 = (i, 1)T . With respect to the standard Hermitian form on C2 it is easy to verify that ⟨u1, u2⟩ = 0. Thus the eigenvectors are orthogonal, as it should be, and moreover ∥u1∥ = √ 2 = ∥u2∥. By taking the orthonormal vectors ˆu1 = 1√ 2 u1 and ˆu2 = 1√ 2 u2 we arrive to the unitary matrix U given by U = 1√ 2 ( 1 i i 1 ) . Recall that Sage can verify this claim via the command U.is_unitary, where the same cell should include the definition of the matrix U. To verify that this matrix U diagonalizes the given matrix B, i.e, U∗ BU = D, where D = diag(λ1, λ2) = diag(2, 4), use the following block: B = matrix (QQbar, [[3, I], [-I, 3]] U = matrix(QQbar, [[1/sqrt(2), I*(1/sqrt(2))], [I*(1/sqrt(2)), 1/sqrt(2)]]) D = matrix(QQbar, [[2, 0], [0, 4]]) U.conjugate_transpose()*B*U == D 3.D.39. Suppose that A is a normal matrix with purely imaginary eigenvalues. Then we can find a unitary matrix U satisfying U∗ AU = D, where D is a diagonal matrix with purely imaginary (diagonal) entries. From this we get A = UDU∗ and since D∗ = −D, we see that A∗ = (UDU∗ )∗ = UD∗ U∗ = −UDU∗ = −A . 3.D.41. The matrices that represent Jordan canonical forms are G1, G2, G5 and G6. 3.D.43. We saw that only the matrices G1, G2, G5 and G6 represent a Jordan canonical form. Let us denote by Ak the corresponding matrices and by Pk the corresponding similarity matrices. Both Ak and Pk have the same size as Gk for all these k. Since Ak = PkGkP−1 k , i.e., the matrices Ak and Gk are similar for all k = 1, 2, 5, 6, they have the same characteristic polynomial. Thus χAk (λ) = χGk (λ) for all these four cases and from the expressions of Gk is direct that • χG1 (λ) = χA1 (λ) = λ2 (λ − 2)2 ; • χG2 (λ) = χA2 (λ) = (λ − 3)(λ − 1)2 ; • χG5 (λ) = χA5 (λ) = (λ − π); • χG6 (λ) = χA6 (λ) = (λ − 2)2 (λ − 1)2 λ2 . 3.D.45. (a) Let us compute the characteristic polynomial of the given matrix A = ( 3 1 −1 1 ) : χA(λ) = |A − λ E| = 3 − λ 1 −1 1 − λ = (3 − λ)(1 − λ) + 1 = λ2 − 4λ + 4 = (λ − 2)2 . Thus, λ = 2 is the unique eigenvalue of A, and has algebraic multiplicity α(λ) = 2. To determine the eigenvectors, we need to solve the matrix equation (A−2E)u = 0 for some vector u = (x, y)T . This gives the linear system {x+y = 0, −x−y = 0} with 1-dimensional solution space {( x y ) = ( −s s ) : s ∈ R } . Therefore, we may assume that the eigenspace Vλ = V2 is generated by the eigenvector u = (1, −1)T (this occurs for s = −1). For the geometric multiplicity of λ this fact implies that γ(λ) = 1. Hence γ(λ) ̸= α(λ) and A cannot be diagonalizable. Since A is 2 × 2, α(λ) = 2, γ(λ) = 1, and λ is the unique eigenvalue of A, the Jordan canonical form of A is the matrix J = ( 2 1 0 2 ) . 314 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Observe now that (A − 2E)2 = 0, where 0 denotes the 2 × 2 matrix. To find the generalized eigenvector corresponding to λ we need to solve the matrix equation (A − 2E)w = u. If we assume that w has the expression w = (w1, w2)T , the solution space of the corresponding linear system has the form {( w1 w2 ) = ( 1 − t t ) : t ∈ R } . Thus we may assume that the generalized eigenvector w has the form w = (1, 0)T (this occurs for t = 0). Consequently, the basis with respect to which J has the above form is given by {u, w} and P has the form P = ( 1 1 −1 0 ) . We also compute P−1 = ( 0 −1 1 1 ) and it is straightforward to verify the relation P−1 AP = J. In Sage this verification can be done as usual, i.e., A=matrix(SR, [[3, 1], [-1, 1]]) J=matrix(SR, [[2, 1], [0, 2]]) P=matrix(SR, [[1, 1], [-1, 0]]) Pinv=P.inverse(); Pinv*A*P==J As mentioned in 3.D.44, an alternative has the form A=matrix(SR, [[3, 1], [-1, 1]]) J, P = A.jordan_form(transformation=True) show(J, P) In this case Sage directly prints out the derived expression for J and P. (b) For this task, observe that the kth power Ak of A satisfies the relation Ak = (PJP−1 )(PJP−1 ) · · · (PJP−1 ) k−factors = PJk P−1 . Now, we easily compute J4 = ( 16 32 0 16 ) and by applying the relation A4 = PJ4 P−1 , we obtain A4 = ( 48 32 −32 16 ) . 3.D.46. The matrix A is upper triangular, meaning that all entries below the main diagonal are zero. For an upper triangular matrix the characteristic polynomial is factorized into linear factors by inspecting the diagonal elements. For our case we get the expression χA(λ) = (5 − λ)4 . Therefore, λ = 5 is the unique eigenvalue of A and has algebraic multiplicity 4, α(λ) = 4. To determine the size of the Jordan blocks, we need to find the geometric multiplicity of λ. The matrix A − 5E is given by A − 5E =     0 4 2 1 0 0 1 1 0 0 0 2 0 0 0 0     , and it is easy to see that its rank equals 3. We can verify this statement in Sage, by the following block, A = Matrix([[5, 4, 2, 1], [0, 5, 1, 1],[0, 0, 5, 2], [0, 0, 0, 5]]) E=identity_matrix(4) AA=A-5*E; show(AA) show(AA.rref()) print("The rank of A-4E is given by:", AA.rank()) Since rank(A − 4E) = 3, the dimension of Ker(A − 5E) equals 1.23 Hence we conclude that the geometric multiplicity of λ = 5 equals 1. Since γ(λ) = 1, α(λ) = 4 and λ = 5 is the unique eigenvalue of A, the Jordan form is composed of a single Jordan block of size 4 (see also below for a Jordan chain of length 4) J =     5 1 0 0 0 5 1 0 0 0 5 1 0 0 0 5     . Let us now describe an eigenvector of λ. The previous block also provides the REEF of A − 4E, given by A − 5E =     0 4 2 1 0 0 1 1 0 0 0 2 0 0 0 0     ∼     0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0     . 23Recall that according to the rank-nullity theorem we have rank(T)+ dim Ker(T) = dim V for any linear transformation T : V → W defined on a finite-dimensional vector space V . 315 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Thus, the general solution of the linear system corresponding to the matrix equation (A − 4E)u1 = 0 is given by{ (x1, x2, x3, x4)T = (t, 0, 0, 0)T } , where t is a free parameter. Hence, without loss of generality, we may assume that the eigenspace Vλ = V5 = Ker(A − 5E) is generated by the eigenvector u1 = (1, 0, 0, 0)T . Using Sage we can confirm that (A − 5E)4 = 0 and compute (A − 5E)2 =     0 0 4 8 0 0 0 2 0 0 0 0 0 0 0 0     , (A − 5E)3 =     0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0     . Therefore it is easy to derive the dimensions of the null spaces Ker ( (A − 5E)2 ) and Ker ( (A − 5E)3 ) : rank ( (A − 5E)2 ) = 2 =⇒ dim Ker ( (A − 5E)2 ) ) = 4 − 2 = 2 , rank ( (A − 5E)3 ) = 1 =⇒ dim Ker ( (A − 5E)3 ) ) = 4 − 1 = 3 . Since (A − 5E)4 = 0 it is immediate that dim Ker ( (A − 5E)4 ) = 4 and hence Ker ( (A − 5E)4 ) is the whole R4 . Let us compute a generalized eigenvector for the unique eigenvalue of A, which we will denote by u2 = (x1, x2, x3, x4)T . The matrix equation (A − 5E)u2 = u1 corresponds to the linear system {4x2 + 2x3 + x4 = 1 , x3 + x4 = 0 , 2x4 = 0} and its solutions are given by {(x1, x2, x3, x4)T = (s, 1 4 , 0, 0)T }, where s is a free parameter. Thus, we may assume that u2 = (0, 1 4 , 0, 0)T , and it is easy to see that u2 ∈ Ker ( (A − 5E)2 ) but u2 /∈ Ker(A − 5E). In particular, the vectors u1, u2 span the subspace Ker ( (A − 5E)2 ) . Next we need to solve the matrix equation (A − 5E)u3 = u2 for some vector u3 = (x1, x2, x3, x4)T , which induces the system {4x2 + 2x3 + x4 = 0 , x3 + x4 = 1/4 , 2x4 = 0}. Its solutions are given by {(x1, x2, x3, x4)T = (r, −1 8 , 1 4 , 0)T }, where r is a free parameter. Thus we may assume that u3 = (0, −1 8 , 1 4 , 0)T and it is easy to see that u3 ∈ Ker ( (A − 5E)3 ) but u3 /∈ Ker ( (A − 5E)2 ) . It follows that vectors u1, u2, u3 generate Ker ( (A − 5E)3 ) . Finally, solving the matrix equation (A − 5E)u4 = u3 for some unknown u4 = (x1, x2, x3, x4)T , we get the generalized eigenvector u4 = (0, 3 32 , −1 4 , 1 8 )T . Putting all information together, we deduce that Ker ( (A − 5E)4 ) = span{u1, u2, u3, u4}. To summarize, we saw that the vectors u4, u3, u2 and u1 satisfy the relations (A − 5E)4 u4 = 0 (A − 5E)3 u3 = 0 (A − 5E)2 u2 = 0 (A − 5E)u1 = 0, (A − 5E)u4 = u3 (A − 5E)u3 = u2 (A − 5E)u2 = u1. Moreover, a direct computations shows that (A − 5E)3 u4 = u1 , (A − 5E)2 u4 = u2 , (A − 5E)u4 = u3 . Thus {(A − 5E)3 u4, (A − 5E)2 u4, (A − 5E)u4, u4} = {u1, u2, u3, u4} is a Jordan chain of full length 4. This implies that the generalized eigenespace Rλ corresponding to the eigenvalue λ has dimension 4 and it encompasses all vectors that map to zero under the fourth power of (A − 5E). Thus, Rλ = Ker ( (A − 5E)4 ) ∼= R4 . Actually Rλ is the union of the subspaces Ker ( (A − 5E)ℓ ) for all 1 ≤ ℓ ≤ 4. As another consequence of this chain we can derive a representative of the matrix P: P = ( u1 | u2 | u3 | u4 ) =     1 0 0 0 0 1/4 −1/8 3/32 0 0 1/4 −1/4 0 0 0 1/8     . Check yourselves using Sage via the following block that P satisfies P−1 AP = J. A = matrix(SR, [[5, 4, 2, 1],[0, 5, 1, 1], [0, 0, 5, 2],[0, 0, 0, 5]]) J = matrix(SR, [[5, 1, 0, 0],[0, 5, 1, 0], [0, 0, 5, 1],[0, 0, 0, 5]]) P = matrix(SR, [[1, 0, 0, 0],[0, 1/4, -1/8, 3/32],[0, 0, 1/4, -1/4], [0, 0, 0, 1/8]]) #Verify P^(-1)AP = J print(P.inverse()* A * P == J) 3.E.5. (i) In 3.E.1 the matrix U was derived from the matrix A by successively applying the elementary row operations R2 → R2 − 2R1, R3 → R3 − 3R1 and R3 → R3 + R2. To obtain the corresponding elementary matrices, we apply each of these row operations to the 3 × 3 identity matrix. This results in the following matrices: E1 =   1 0 0 −2 1 0 0 0 1   , E2 =   1 0 0 0 1 0 −3 0 1   , E3 =   1 0 0 0 1 0 0 1 1   , 316 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS and the Gauss elimination in terms of matrices has the form E3E2E1A =   1 0 0 0 1 0 0 1 1     1 0 0 0 1 0 −3 0 1     1 0 0 −2 1 0 0 0 1     −2 1 0 −4 4 2 −6 1 −1   =   −2 1 0 0 2 2 0 0 1   = U . Since the elementary matrices are always invertible, we obtain A = E−1 1 E−1 2 E−1 3 U, which implies that L = E−1 1 E−1 2 E−1 3 . The inverses of the elementary matrices are easily computed, leading to the expected result: L = E−1 1 E−1 2 E−1 3 =   1 0 0 2 1 0 0 0 1     1 0 0 0 1 0 3 0 1     1 0 0 0 1 0 0 −1 1   =   1 0 0 2 1 0 3 −1 1   . We also compute L−1 = E3E2E1 =   1 0 0 −2 1 0 −5 1 1   . For a verification using Sage, type he cell L=matrix([[1, 0, 0], [2, 1, 0], [3, -1, 1]]); show(L.inverse()) (ii) To find the inverse of the matrix U via the Gauss elimination method, we begin by forming the augmented matrix ( U E ) , where E is the 3 × 3 identity matrix. We then successively apply the following elementary row operations: R1 → −1 2 R1, R1 → R1 + 1 4 R2, R1 → R1 − 1 2 R3, R2 → 1 2 R2 and R2 → R2 − R3. You should be able to verify on your own that this results in the matrix U−1 =   −1/2 1/4 −1/2 0 1/2 −1 0 0 1   . Here is a quick verification using Sage: U=matrix([[-2, 1, 0], [0, 2, 2], [0, 0, 1]]); show(U.inverse()) (iii) We have det(A) = det(L) det(U) = 1 · (−4) = −4 ̸= 0, and hence A is invertible. Now, using the expressions of L−1 and U−1 obtained above, the matrix A−1 occurs very easily: A−1 = (LU)−1 = U−1 L−1 =   −1/2 1/4 −1/2 0 1/2 −1 0 0 1     1 0 0 −2 1 0 −5 1 1   =   3/2 −1/4 −1/2 4 −1/2 −1 −5 1 1   . Verify this result in Sage with the same basic method applied for L or U. 3.E.6. It is easy to see that rank(A) = 2 and hence there are exist two non-zero singular values Since AT A = ( 25 15 15 25 ) , the characteristic polynomial of AT A is given by χAT A(λ) = (40 − λ)(10 − λ). Hence we conclude that the eigenvalues of AT A are λ1 = 40 and λ2 = 10, so the singular values of A are σ1 = √ 40 = 2 √ 10 and σ2 = √ 10, with σ1 > σ2. 3.E.7. We see that AT A is a diagonal matrix: AT A =   0 −1 0 0 0 0 −1 2 0 0     0 0 −1 2 −1 0 0 0 0 0   =   1 0 0 0 0 0 0 0 1 4   . From this we deduce that AT A has three eigenvalues, λ1 = 1, λ2 = 1/4 and λ3 = 0. and two 2-nonzero singular values (note that rank(A) = 2), the latter given by σ1 = √ λ1 = 1 and σ2 = √ λ2 = 1/2, with σ1 > σ2. Therefore, the matrix Σ has the form Σ = diag(1, 1/2, 0). 3.E.9. According to 3.5.5 we will have P := UΣUT and Up := UV T , where A = UΣV T is the SVD decomposition of A. We compute P =   0 −1 0 −1 0 0 0 0 −1     1 0 0 0 1 2 0 0 0 0     0 −1 0 −1 0 0 0 0 −1   =   1 2 0 0 0 1 0 0 0 0   , and Up =   0 −1 0 −1 0 0 0 0 −1     1 0 0 0 0 1 0 1 0   =   0 0 −1 −1 0 0 0 −1 0   . 317 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS Thus, the polar decomposition of the matrix A has the form A =   0 0 −1 2 −1 0 0 0 0 0   = PUp =   1 2 0 0 0 1 0 0 0 0     0 0 −1 −1 0 0 0 −1 0   . A verification using Sage is easy: A=matrix([[0, 0, -1/2], [-1, 0, 0], [0, 0, 0]]) S=diagonal_matrix([1, 1/2, 0]) V=matrix([[1, 0, 0], [0, 0, 1], [0, 1, 0]]) U=matrix([[0, -1, 0], [-1, 0, 0], [0, 0, -1]]) P=U*S*(U.T) show(P) Upolar=U*(V.T) show(Upolar) A==P*Upolar The polar decomposition is not unique, since A is singular. Next it easy to see that P is Hermitian and positive semi-definite. For example, for the second claim we see that P has three non-negative eigenvalues: 1, 1/2 and 0. In Sage you can confirm these claims by the adding the cell show(P.eigenvalues()) P.is_hermitian() 3.F.10. The answer is given by xn = (−1)n (−2n2 + 8n − 7). 3.F.23. The required matrix has the form (5 6 1 5 1 6 4 5 ) . This matrix has the dominant eigenvalue 1, with corresponding eigenvector given by (6 5 , 1)T . Because the eigenvalue is dominant, the ratio of the viewers stabilises on 6 : 5. 3.F.25. The game ends after three bets (the same as in 3.F.24). Thus, all the powers of A, starting with A3 , are identical. Consequently the answer is given as follows: A100 = A3 =       1 7/8 3/4 1/2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1/8 1/4 1/2 1       . 3.F.31. It is easy to see that F is linear. Also we compute ker(F) = {( a −a a −a ) : a ∈ R } hence dimR(ker(F)) = 1 and a basis of ker(F) consists of the matrix u = ( 1 −1 1 −1 ) . Since B(u, u) = 4, we deduce that {ˆu = 1 2 u} is an orthonormal basis of ker(F). 3.F.36. The given rotation is easily obtained by composing the following three mappings: • rotation through the angle π/4 in the negative sense about the axis z (the axis of the rotation goes over on the x axis); • rotation through the angle π/3 in the positive sense about the x axis; • rotation through the angle π/4 in the positive sense about the z axis (the x axis goes over on the axis of the rotation). The matrix of the resulting rotation is the product of the matrices corresponding to the given three mappings, while the order of the matrices is given by the order of application of the mappings – the first mapping applied is in the product the rightmost one. Thus we obtain the desired matrix   √ 2 2 − √ 2 2 0√ 2 2 √ 2 2 0 0 0 1    ·    1 0 0 0 1 2 − √ 3 2 0 √ 3 2 1 2    ·    √ 2 2 √ 2 2 0 − √ 2 2 √ 2 2 0 0 0 1    =    3 4 1 4 √ 6 4 1 4 3 4 − √ 6 4 − √ 6 4 √ 6 4 1 2    . Note that the resulting rotation could be also obtained for instance by taking the composition of the three following mappings: • rotation through the angle π/4 in the positive sense about the axis z (the axis of rotation goes over on the axis y); • rotation through the angle π/3 in the positive sense about the axis y; • rotation through the angle π/4 in the negative sense about the axis z (the axis y goes over to the axis of rotation). Analogously we obtain 318 CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS    √ 2 2 √ 2 2 0 − √ 2 2 √ 2 2 0 0 0 1    ·    1 2 0 √ 3 2 0 1 0 − √ 3 2 0 1 2    ·    √ 2 2 − √ 2 2 0√ 2 2 √ 2 2 0 0 0 1    =    3 4 1 4 √ 6 4 1 4 3 4 − √ 6 4 − √ 6 4 √ 6 4 1 2    . 3.F.39. For any x, y ∈ Rn and with respect to the standard dot product on Rn we compute ⟨LAx, y⟩ = ⟨Ax, y⟩ = yT Ax = yT (AT )T x = (AT y)T x = ⟨x, AT y⟩ = ⟨x, Ay⟩ where the last equality follows since by assumption A = AT . 3.F.40. First notice that the operator LA : Matm,n(F) → Matm,n(F) is linear since LA(B+C) = A(B+C) = AB+AC = LA(B) + LA(C) and LA(zB) = A(zB) = z(AB) = zLA(B) for any B, C ∈ Matm,n(F) and z ∈ F. Its adjoint operator L∗ A is the linear map L∗ A : Matm,n(F) → Matm,n(F) satisfying B ( LA(B), C ) = B ( B, L∗ A(C) ) , B, C ∈ Matm,n(F) . We will show that for L∗ A = A∗ this holds like an identity. We compute B ( LA(B), C ) = tr ( C∗ (AB) ) = tr(C∗ AB) , B ( B, L∗ A(C) ) = B(B, A∗ C) = tr ( (A∗ C)∗ B ) = tr ( (C∗ (A∗ )∗ )B) = tr(C∗ AB) . Since L∗ A = A∗ and A is by assumption Hermitian, our claim follows. 3.F.41. Let y ∈ V . Then we have the following equivalences: y ∈ ker(f∗ ) ⇐⇒ f∗ (y) = 0 , ⇐⇒ ⟨x, f∗ (y)⟩ = 0 , for any x ∈ V , ⇐⇒ ⟨f(x), y⟩ = 0 , for any x ∈ V , ⇐⇒ y ⊥ f(x) , for any x ∈ V , ⇐⇒ y ∈ (Im(f))⊥ . Thus ker(f∗ ) = (Im(f))⊥ . Similarly is treated the second relation. 3.F.42. Reflections matrices are unitary since they preserve the length of vectors. In particular, [φu][φu]∗ = (2uuT −E)(2uuT −E)∗ = (2uuT −E)(2uuT −E) = 4u(uT u)uT −4uuT +E = 4uuT −4uuT +E = E , since u is a unit vector, i.e., ⟨u, u⟩ = uT u = 1. 3.F.43. Observe that (A∗ A)∗ = A∗ A hence the matrix B = A∗ A is Hermitian, B = B∗ . Then the result follows since any Hermitian matrix has real eigenvalues, see 3.4.7. For the second statement, we have AA∗ = U∗ DU(U∗ DU)∗ = U∗ DUU∗ D∗ U = U∗ DD∗ U and A∗ A = (U∗ DU)∗ U∗ DU = U∗ D∗ UU∗ DU = U∗ D∗ DU. Now the result follows since D is diagonal and hence we have DD∗ = D∗ D (as one can easily verify). 3.F.44. It should be a = 2 + 2i, b = 1 − i and c = i. One can verify this answer in Sage as follows: A=matrix(SR, 3, 3, [[1, 2+2*I, -I], [2-2*I, 0, 1-I], [I, 1+I, 0]]) A.is_hermitian() Could you suggest an alternative method in Sage to verify the same result? 3.F.51. The answer is dim Matn(R) = n2 . We return to the view on geometry that we had when we studied positions of points in the plane in the 5th part of the first chapter, c.f. 1.5.1. We are interested in the properties of objects in the Euclidean space, delimited by points, straight lines, planes etc. The essential point is to clarify how their properties are related to the notion of vectors, and whether they depend on the notion of the length of vectors. Later in the book, we use linear and mutli-linear algebra to study objects defined in a nonlinear way. In order to facilitate this we provide some insight in the so called analytic geometry, based on the application of matrix calculus. This point of view is most useful in discussion of the technique of optimization, e.g. when searching for (constrained) extrema of functions. At the end of this chapter we show how the projectivization of affine spaces helps us to obtain simplification and stability of algorithms typical for computer graphics. 1. Affine and Euclidean geometry While clarifying the structure of solutions of linear equations in the second chapter we find in paragraph 2.3.5 that the set of all solutions of a nonhomogeneous system of linear equations does not form a vector space. However, the solutions always arise in such a way that to one particular solution we can add the vector space of solutions to the corresponding homogeneous system. On the other hand, the difference of any two solutions of the nonhomogeneous system is always a solution of the homogeneous system. This behaviour is similar to the behaviour of linear difference equations. We see this already in paragraph 3.2.6. 4.1.1. Affine spaces. Going back to paragraph 1.5.3 (and further on) provides a hint how to deal with the theory in any finite dimension. There we describe straight lines and points as sets of solutions of systems of linear equations. A line is considered as a one-dimensional subspace, although its points are described by two coordinates. Parametrically, the line is defined by the sum of a single point (that is, a pair of coordinates) and multiples of a fixed direction vector. We proceed now in the same way for arbitrary dimensions. CHAPTER 4 Analytic geometry position, incidence, projection – and we return to matrices again... A. Affine geometry 4.A.1. Find a parametric equation for a line in R3 given by the equations x − 2y + z = 2, 2x + y − z = 5. Solution. It is sufficient to solve the equation system. However, there is an alternative approach. Find a non-zero direction vector orthogonal to the normal vectors (1, −2, 1), (2, 1, −1). The cross product (1, −2, 1) × (2, 1, −1) = (1, 3, 5) is such a vector. The triple [x, y, z] = [2, −1, −2] satisfies the respective system, so a solution is [2, −1, −2] + t (1, 3, 5) , t ∈ R. □ 4.A.2. A plane in R4 is given by its parametric equation ϱ : [0, 3, 2, 5] + t (1, 0, 1, 0) + s (2, −1, −2, 2) , t, s ∈ R Find its implicit equation. Solution. The task is to find a system of equations with 4 variables x, y, z, u (the dimension of the space is 4) which are satisfied by the coordinates of those points which lie in the plane. The desired system must contain 2 = 4−2 linearly CHAPTER 4. ANALYTIC GEOMETRY Standard affine space Standard affine space An is a set of all points in Rn = An together with an operation “+” which assigns the point A + v = (a1 + v1, . . . , an + vn) ∈ Rn = An to a point A = (a1, . . . , an) ∈ An and a vector v = (v1, . . . , vn) ∈ Rn = V. This operation satisfies the following three properties: (1) A + 0 = A for all points A ∈ An and the null vector 0 ∈ V , (2) A + (v + w) = (A + v) + w for all vectors v, w ∈ V and points A ∈ An, (3) for every two points A, B ∈ An there exists exactly one vector v ∈ V such that A + v = B. This vector is denoted by v = B − A, sometimes also ⃗AB. The underlying vector space Rn is called the difference space of the standard affine space An. Notice that care is needed about several formal ambiguities. In particular, the symbol “+” is used for two different operations. “+” is used for adding a vector from the difference space to a point in the affine space. “+” is also used for summing vectors in the difference space V = Rn . Further, notice that the operation “−” assigns a vector to a couple of points and cannot be understood as inverse of the addition (athough in the standard affine space it is given by the difference of the n-tuples B and A). We do not introduce specific letters for the set of points in the affine space. An denotes both this set of points as well as the whole structure defining the affine space. Why distinguish between the set of points in the affine space An and its difference space V when both spaces can be viewed as Rn ? It is a fundamental formal step to understanding the geometry in Rn : The issue is that geometric objects, namely straight lines, points, planes etc. do not depend directly on the vector space structure of the set Rn . They do not depend at all on the fact that we work with n–tuples of scalars. We need to know only what it means to move “straight in a given direction”. For instance, we can consider the affine plane as an unbounded board without chosen coordinates, but with the possibility of moving about a given vector. When we switch to such an abstract view, we can discuss the “plane geometry” for two-dimensional subspaces, without the need to work with k–tuples of coordinates. This point of view underlies the following definition: 4.1.2. Definition. The affine space A with the difference space V is a set of points P, together with the map P × V → P, (A, v) → A + v. V is a vector space. The map satisfies the properties (1)–(3) from the definition of the standard affine space. For a fixed vector v ∈ V, there is a translation τv : A → A as the restricted map τv : P ≃ P × {v} → P, A → A + v. 320 independent equations. Solve the problem by elimination of parameters. The points [x, y, z, u] ∈ ϱ satisfy x = t + 2s, y = 3 − s, z = 2 + t − 2s, u = 5 + 2s, where t, s ∈ R. Write the system as a matrix     1 2 −1 0 0 0 0 0 −1 0 −1 0 0 3 1 −2 0 0 −1 0 2 0 2 0 0 0 −1 5     . The first two columns are direction vectors of the plane, followed by the negative identity matrix. The last column is the vector of coordinates of the point [0, 3, 2, 5]. This is now a system in t, s, x, y, z, u. Transform the obtained matrix using elementary row operations in order to have as many zero-rows on the left-hand side of the first vertical line as possible. Adding (−1)-times the first row and (−4)-times the second row to the third row and adding twice the second row to the first row gives     1 2 −1 0 0 0 0 0 −1 0 −1 0 0 3 0 0 1 4 −1 0 −10 0 0 0 −2 0 −1 11     . The bottom two rows, both with only zeros to the left of the first vertical line, imply x + 4y − z − 10 = 0, −2y − u + 11 = 0. Note that the original system can be written as     1 0 0 0 1 2 0 0 1 0 0 0 −1 3 0 0 1 0 1 −2 2 0 0 0 1 0 2 5     , where x, y, z, u remains on the left-hand side of the equations. A similar transformation gives     1 0 0 0 1 2 0 0 1 0 0 0 −1 3 −1 −4 1 0 0 0 −10 0 2 0 1 0 0 11     from which −x − 4y + z = −10, 2y + u = 11. As seen in this exercise, parameter elimination can be longwinded. It is not difficult to make a mistake along the way. Another solution All that is needed are two linearly independent vectors perpendicular to (1, 0, 1, 0), (2, −1, −2, 2). If we “guessed” that these vectors could be for example (0, 2, 0, 1), (−1, 0, 1, 2), then putting x = 0, y = 3, z = 2, u = 5 to the equations 2y + u = a, −x + z + 2u = b CHAPTER 4. ANALYTIC GEOMETRY By the dimension of an affine space A, is meant the dimension of its difference space. In the sequel, we do not distinguish accurately between A and the set of points P. We talk instead about points and vectors of the affine space A. It follows immediately from the axioms that for arbitrary points A, B, C in the affine space A A − A = 0 ∈ V(1) B − A = −(A − B)(2) (C − B) + (B − A) = C − A.(3) Indeed, (1) follows from the fact that A + 0 = A and that such a vector is unique (the first and third defining property). By adding successively B − A and A − B to A, according to the second defining property we obtain A again. Add the null vector to prove (2). Similarly, (3) follows from the defining property 4.1.1 (2) and the uniqueness. Notice that the choice of one fixed point A0 ∈ A determines a bijection between V and A. So for a fixed basis u in V there is a unique expression A = A0 + x1u1 + · · · + xnun for every point A ∈ A. We talk about an affine coordinate system (A0; u1, . . . , un) given by the origin of the affine coordinate system A0 and the basis u of the corresponding difference space. This is sometimes called an affine frame (A0, u). To summarize: Affine coordinates of a point A in the frame (A0, u) are the coordinates of the vector A − A0 in the basis u of the difference space V . The choice of an affine coordinate system identifies each n-dimensional affine space A with the standard affine space An. 4.1.3. Affine subspaces. If we consider some coordinate system on A and choose only such points in A which have some chosen coordinates equal to zero (for instance the last one), we obtain again a set which behaves as an affine space. This is the spirit of the following definition of the affine subspaces. Subspaces of an affine space Definition. The nonempty subset Q ⊂ A of an affine space A with a difference space V is called an affine subspace in A if the subset W = {B − A; A, B ∈ Q} ⊂ V is a vector subspace and A + v ∈ Q for any A ∈ Q, v ∈ W. It is important to include both of the conditions in the definition, since there are examples of sets which satisfy the first condition but not the second. One such set consists of a straight line in the plane with one point removed. For an arbitrary set of points M ⊂ A in an affine space with a difference space V , we define the vector space Z(M) = span{B − A; B, A ∈ M} ⊂ V of all vectors generated by the differences of points in M. 321 yields a = 11, b = 12. The desired implicit expression is 2y + u = 11, −x + z + 2u = 12. Another solution Since x = t + 2s, y = 3 − s, z = 2 + t − 2s, u = 5 + 2s, Eliminate t to get x − z = 2 − 4s, y = 3 − s, u = 5 + 2s, Eliminate s to obtain two equations, namely z − x + 2u = 12 u + 2y = 11, which solves the problem. □ 4.A.3. Find a parametric equation of the plane passing through the points A = [2, 1, 1], B = [3, 4, 5], C = [4, −2, 3]. Hence find a parametric equation of the open half-plane containing the point C and bounded by the line passing through the points A, B. Solution. We need one point and two (linearly independent) vectors lying in the plane. It is enough to choose A together with the vectors B − A = (1, 3, 4) and C − A = (2, −3, 2), which are clearly independent. A point [x, y, z] lies in the plane if and only if there exist numbers t, s ∈ R so that x = 2+1·t+2·s, y = 1+3·t−3·s, z = 1+4·t+2·s. Consequently a parametric equation is [x, y, z] = [2, 1, 1] + t (1, 3, 4) + s (2, −3, 2) , t, s ∈ R. Setting s = 0 gives a line passing through the points A and B. t = 0 and s ≥ 0, defines a ray passing through C with an initial point A. A particular but arbitrarily chosen t ∈ R and variable s ≥ 0 gives a ray initiated on the border line, going through the half-plane in which the point C lies. That means that the desired open half-plane can be expressed parametrically as [x, y, z] = [2, 1, 1]+t (1, 3, 4)+s (2, −3, 2) , t ∈ R, s > 0. □ 4.A.4. Determine the relative position of the lines p : [1, 0, 3] + t (2, −1, −3) , t ∈ R, q : [1, 1, 3] + s (1, −1, −2) , s ∈ R. Solution. Search for common points of the given lines (subspaces intersection). We have a system 1 + 2t = 1 + s, 0 − t = 1 − s, 3 − 3t = 3 − 2s. CHAPTER 4. ANALYTIC GEOMETRY In particular, V = Z(A). Every affine subspace Q ⊂ A itself satisfies the axioms for an affine space with the difference space Z(Q). The intersection of any set of affine subspaces is either an affine subspace or is the empty set. This follows directly from the definitions. The affine subspace ⟨M⟩ in A generated by a nonempty set M ⊂ A is the intersection of all affine subspaces which contain all points of M. Affine hull and parametric description Affine subspaces can be described by their difference spaces after choosing a point A0 ∈ M in a generating set M. Indeed, to generate the affine subspace, take the vector subspace Z(M) ⊂ Z(A) generated by all differences of points in M, and add this vector space to an arbitrary point in M, i.e., ⟨M⟩ = {A0 + v; v ∈ Z(M) ⊂ Z(A)}. We talk about the affine hull of the set of points M in A. On the other hand, whenever a subspace U in the difference space Z(A) and a fixed point A ∈ A is chosen, the subset A + U, created by all possible sums of A and all vectors in U, is an affine subspace. This approach leads to the notion of parametrization of subspaces: Let Q = A + Z(Q) be an affine subspace in An. Let (u1, . . . , uk) be a basis of Z(Q) ⊂ Rn . Then the expression of the subspace Q = {A + t1u1 + · · · + tkuk; t1, . . . , tk ∈ R} is called the parametric description of the subspace Q. In the sequel, we shall use the same notation ⟨ ⟩ for the affine hulls in the affine spaces and linear hulls in their difference vector spaces. There is another way of prescribing affine spaces: If we choose affine coordinates, then the difference space may be described by a homogeneous system of linear equations in these coordinates. By inserting the coordinates of one point of the subspace Q into the system of equations, we obtain the right-hand side of the non-homogeneous system with the same matrix. The subspace Q is exactly the set of solutions of this system. The description of the subspace Q by a system of equations in given coordinates is called an implicit description of the subspace Q. The following proposition says that we can prescribe all affine subspaces in this way. It shows the geometric nature of the solutions of systems of linear equations. 4.1.4. Theorem. Let (A0; u) be an affine coordinate system in an n-dimensional affine space A. In these coordinates, affine subspaces of dimension k in A are exactly the sets of solutions of solvable systems of n − k linearly independent equations in n variables. Proof. Consider an arbitrary solvable system of n − k linearly independent equations αi(x) = bi, where bi ∈ R, i = 1, . . . , n − k. Suppose A = (a1, . . . , an)T ∈ Rn is a 322 From the first two equations, t = 1, s = 2. This does not satisfy the third equation. Thus the system does not have a solution. The direction vector (2, −1, −3) of the line p is not a multiple of the direction vector (1, −1, −2) of the line q. Hence the lines are not parallel. Hence, the lines are skew. □ 4.A.5. Find all numbers a ∈ R so that the lines p : [4, −4, 8] + t (2, 1, −4) , t ∈ R, q : [a, 6, −5] + s (1, −3, 3) , s ∈ R intersect. Solution. The lines intersect if and only if the system 4 + 2t = a + s, −4 + t = 6 − 3s, 8 − 4t = −5 + 3s has a solution. Express the system as a matrix (the first column corresponding to t, the second to s), and solve   2 −1 a − 4 1 3 10 −4 −3 −13   ∼   1 3 10 2 −1 a − 4 −4 −3 −13   ∼   1 3 10 0 −7 a − 24 0 1 3   . The system has a solution if and only if the second row is a multiple of the third row. This property is satisfied only for a = 3. The point of intersection of the lines is [6, −3, 4]. □ 4.A.6. In R3 , determine the relative position of the line p defined implicitly by x + y − z = 4, x − 2y + z = −3 and the plane ϱ : y = 2x − 1. Solution. A normal vector to the plane is ϱ is (2, −1, 0) (consider ϱ : 2x − y + 0z = 1). Since (1, 1, −1) + (1, −2, 1) = (2, −1, 0), the normal vector to the plane ϱ is a linear combination of the p normal vectors. A vector defining the line lies in a subspace of the plane ϱ It remains to discover whether or not they intersect. The system of equations x + y − z = 4, x − 2y + z = −3, 2x − y = 1 has infinitely many solutions, because the first two equations add to give the third one. So the line p lies in the plane ϱ. □ The following exercise is a typical vector spaces intersection exercise. The reader should be able to solve this. Otherwise we recommend not continuing with this book. CHAPTER 4. ANALYTIC GEOMETRY fixed solution of this (non-homogeneous) system. Suppose also that U ⊂ Rn is the vector space of all solutions of the homogenized system αi(x) = 0. Then the dimension of U is k. The set of all solutions of the given system is of the form {B; B = A+(y1, . . . , yn)T , y = (y1 . . . , yn)T ∈ U} ⊂ Rn , c.f. 2.3.5. So the corresponding affine subspace is described parametrically by the initial coordinates (A0; u). Conversely, consider an arbitrary affine subspace Q ⊂ An. Choose a point B therein, and consider this point to be the origin of an affine coordinate system (B, v) for the affine space A. Since Q = B + Z(Q), it is necessary to describe the difference space of the subspace Q as a subspace of solutions of a homogeneous system of linear equations. Therefore, choose a basis v of Z(A) such that the first k vectors form a basis of Z(Q). In these coordinates, the vectors v ∈ Z(Q) are given by equations αj(v) = 0, j = k + 1, . . . , n. The αi are linear forms from the dual basis to v. They are the functions which assign to a vector the corresponding coordinates in the basis v. Hence the vector subspace Z(Q) of dimension k in the n-dimensional space Rn is given as a solution of a homogeneous system of n−k independent equations. The description of the chosen affine subspace in the newly chosen coordinate system (B; v) is therefore given by a system of homogeneous linear equations. It remains to consider the consequences of the transition from the former coordinate system (A; u) to the new adapted system (B; v). It follows from a general consideration about transformations of coordinates in the following paragraph that the final description of the subspace is again a system of linear equations. This time it is non-homogeneous in general. □ 4.1.5. Coordinate transformations. Any two arbitrarily chosen affine coordinate systems (A0, u), (B0, v) differ in the basis of the difference spaces and the origin of the latter one is translated about the vector (B0 − A0). Hence the equations for the corresponding coordinate transformations can be read off from the rule for a transformation of a point X ∈ A X = B0 + x′ 1v1 + · · · + x′ nvn = B0 + (A0 − B0) + x1u1 + · · · + xnun. Let y = (y1, . . . , yn)T denote the column of coordinates of the vector (A0 − B0) in the basis v. Let M = (aij) be the matrix expressing the basis u in terms of the basis v. Then x′ 1 = y1 + a11x1 + · · · + a1nxn ... x′ n = yn + an1x1 + · · · + annxn. In matrix notation x′ = y + M · x. 323 4.A.7. Find the intersection of the subspaces Q1 and Q2, where Q1 : [4, −5, 1, −2] + t1 (3, 5, 4, 2) + t2 (2, 4, 5, 1) + t3 (0, 3, 1, 2) , Q2 : [4, 4, 4, 4] + s1 (0, −6, −2, −4) + s2 (−1, −5, −3, −3) , for t1, t2, t3, s1, s2 ∈ R. Solution. The point X = [x1, x2, x3, x4] ∈ R4 lies in Q1 if and only if     x1 x2 x3 x4     =     4 −5 1 −2     + t1     3 5 4 2     + t2     2 4 5 1     + t3     0 3 1 2     for some numbers t1, t2, t3 ∈ R. The point X = [x1, x2, x3, x4] ∈ R4 lies in Q2 if and only if     x1 x2 x3 x4     =     4 4 4 4     + s1     0 −6 −2 −4     + s2     −1 −5 −3 −3     for some s1, s2 ∈ R. Hence if X lies in Q1 ∩ Q2 then the equation t1     3 5 4 2     + t2     2 4 5 1     + t3     0 3 1 2     =     4 − 4 4 + 5 4 − 1 4 + 2     + s1     0 −6 −2 −4     + s2     −1 −5 −3 −3     . has a solution for t1, t2, t3, s1, s2. Move the vectors corresponding to s1 and s2 to the lefthand side. Write the equations in matrix form and reduce to echelon form. There follows     3 2 0 0 1 0 5 4 3 6 5 9 4 5 1 2 3 3 2 1 2 4 3 6     ∼     3 2 0 0 1 0 0 2 9 18 10 27 0 7 3 6 5 9 0 −1 6 12 7 18     ∼ · · · ∼     3 0 0 0 0 0 0 2 0 0 0 0 0 0 1 2 0 3 0 0 0 0 1 0     . So t1 = t2 = s2 = 0 and for s1 = t ∈ R we have t3 = 3 − 2t. Note that for the determination of Q1 ∩ Q2, it is sufficient to know either t1, t2, t3, or s1, s2. So     x1 x2 x3 x4     =     4 4 4 4     + s1     0 −6 −2 −4     + s2     −1 −5 −3 −3     =     4 4 4 4     + t     0 −6 −2 −4     . This should be checked using t1 = t2 = 0 and t3 = 3 − 2t. The solution is a line lying in both planes. It is Q1 ∩ Q2. □ CHAPTER 4. ANALYTIC GEOMETRY Now, we can come back to the proof of the previous theorem. Suppose the system of linear equations in the “new” coordinates (B0; v) has the form S · x′ = b where S is the matrix of the system. Then, in the original coordinates (A0; u) S · x′ = S · (y + M · x) = b. Thus in the original coordinates the system has the form S · M · x = b − S · y. Therefore, if a subspace is described by a system of linear equations in one affine frame, then it is so described in all other affine frames. This completes the proof of the previous proposition. 4.1.6. Examples of subspaces. (1) The one-dimensional (standard) affine space is the set of all points of a real straight line A1. Its difference space is a one-dimensional vector space R. The supporting set (i.e. the set of points in A1) is also R. The affine coordinates are obtained by the choice of an origin and a scale (i.e. a basis in the vector space R). All proper affine spaces are 0-dimensional. They are formed by all points of the real straight line R. (2) The two-dimensional (standard) affine space is a set of all points in the space A2 with the difference space R2 . The supporting set is R2 . The affine coordinates are obtained by a choice of an origin and two linearly independent vectors (directions and scales). The proper subspaces are then all points and straight lines in the plane (0-dimensional and 1-dimensional). The lines are prescribed by the choice of a point and one vector from the corresponding difference space. The vector is a generator of direction as in the parametric definition of the straight line. (3) The three-dimensional (standard) affine space is a set of all points in the space A3 with the difference space R3 . The affine coordinates are obtained by the choice of an origin and three linearly independent vectors (directions and scales). The proper affine subspaces are then all points, straight lines and planes (0-dimensional, 1-dimensional and 2-dimensional). (4) Suppose there is given a nonzero vector of coefficients (a1, . . . , an) and a scalar b ∈ R. Compute the subspace of all solutions of one linear equation a · x = b for the unknown point [x1, . . . , xn] ∈ An. This is an affine subspace of dimension n−1. We say that the subspace is of codimension 1, called a hyperplane in An. 4.1.7. Affine combinations of points. We introduce an analogue of the linear combination of vectors. Let A0, . . . , Ak be points in the affine space A. Their affine hull ⟨{A0 . . . , Ak}⟩ can be written as the set of points {A0 + t1(A1 − A0) + · · · + tk(Ak − A0); t1, . . . , tk ∈ R}. 324 4.A.8. Determine whether or not the points [0, 2, 1], [−1, 2, 0], [−2, 5, 2] and [0, 5, 4] in R3 all lie in the same plane. Solution. Consider the vectors [0, 2, 1] − [−1, 2, 0] = (1, 0, 1), [0, 2, 1] − [−2, 5, 2] = (2, −3, −1) and [0, 2, 1] − [0, 5, 4] = (0, −3, −3). They are linearly dependent since the matrix   1 0 1 2 −3 −1 0 −3 −3   , has rank 2. Hence, the given points lie in a plane. □ 4.A.9. Into how many parts can three planes slice the space (R3 )? Give an example of planes in a suitable position for every case. 4.A.10. Determine whether or not the point [2, 1, 0] lies within the convex hull of the points [0, 2, 1], [1, 0, 1], [3, −2, −1], [−1, 0, 1]. Solution. [2, 1, 0] lies in the convex hull (see chapter 4.9) if and only if [2, 1, 0] = t1[0, 2, 1]+t2[1, 0, 1]+t3[3, −2, −1]+t4[−1, 0, 1] has a solution with t1, t2, t3, t4, all non-negative and t1 + t2 + t3 + t4 = 1. Equivalently, [2, 1, 0] lies in the convex hull if and only if [2, 1, 0, 1] =t1[0, 2, 1, 1] + t2[1, 0, 1, 1] + t3[3, −2, −1, 1] + t4[−1, 0, 1, 1] has a solution with t1, t2, t3, t4, all non-negative. Solving these four equations, gives (t1, t2, t3, t4)= (1, 0, 1/2, −1/2), so the given point does not lie in the convex hull. □ 4.A.11. In R3 , a tetrahedron has vertices ABCD, where A = [4, 0, 2], B = [−2, −3, 1], C = [1, −1, −3], D = [2, 4, −2]. a) Determine its volume. b) Decide whether or not the point X = [0, −3, 0] lies inside the tetrahedron. Solution. a) The volume of the tetrahedron is one sixth of volume of a parallelepiped, of which three edges from the point A are B − A = (−6, −3, −1), C − A = (−3, −1, −5) and D − A = (−2, 4, −4). It is given by the absolute value of the determinant −6 −3 −1 −3 −1 −5 −2 4 −4 = −124. Thus, the volume of the tetrahedron is 124 6 . b) Write X as an affine combination of its vertices, by solving the system of four linear equation in four unknowns a, b, c, d given by the equality X = aA+bB+cC +dD. The solution is X = 1 4 A+ 1 2 B + 1 2 C − 1 4 D. Since the coefficient of D is negative, X does not lie in the convex hull of the points A, B, C and D. Hence the given point does not lie inside the tetrahedron. □ CHAPTER 4. ANALYTIC GEOMETRY In any affine coordinates the same set can be written as {t0A0 + t1A1 + · · · + tkAk; ti ∈ R, k∑ i=0 ti = 1}. Affine combinations of points In general, by the formula t0A0 + t1A1 + · · · + tkAk with coefficients satisfying ∑k i=0 ti = 1 is meant the set of all points A0 + ∑k i=1 ti(Ai − A0). They are called the affine combinations of points. The points A0 . . . , Ak are in general position if they generate a k-dimensional affine subspace. This happens if and only if for each Ai, the vectors which arise as differences of this point Ai, and all other vectors Aj, are linearly independent. Observe that an assignment of a series of (dim A) + 1 points in general position is equivalent to the definition of an affine frame with the origin in the first of them. 4.1.8. Simplexes. For points in an affine space, the affine combination is a similar construction to the linear combination for vectors in a vector space. Indeed, the affine subspace generated by points A0 . . . , Ak equals the set of all affine combinations of its generators. The notion “to lie on the line between two points” can be generalized. In the two-dimensional case, imagine the interior of a triangle. In general proceed as follows: k–dimensional simplexes Let A0, . . . , Ak be k + 1 points in general position in an affine space A. The set ∆ = ∆(A0, . . . , Ak) is defined as the set of all affine combinations of points Ai with nonnegative coefficients only. This is ∆ = {t0A0 + t1A1 + · · · + tkAk; 0 ≤ ti, k∑ i=0 ti = 1}, called a k–dimensional simplex generated by the points Ai. A one–dimensional simplex is a line segment, a two– dimensional simplex is a triangle, while a zero–dimensional simplex is a point. Notice that each k–dimensional simplex has exactly k+1 faces defined by equations ti = 0, i = 0, . . . , k. The faces are also simplexes, and their dimension is k − 1. We talk about the boundary of the simplex. For instance, the boundary of a triangle is formed by the three edges, and the boundary of each edge is formed by the two vertices. The description of a subspace as a set of affine combinations of points in general position is equivalent to the parametric description. We work similarly with the parametric description of simplexes. 4.1.9. Convex sets. The subset M of an affine space is called convex if and only if for any two points A, B ∈ M the set contains the line segment ∆(A, B). Directly from the definition, each convex set with k +1 points in general position contains also the entire simplex defined by these points 325 4.A.12. Affine transformation of point coordinates The point X has coordinates expressed as [2, 2, 3] in an affine basis {[1, 2, 3], (1, 1, 1), (1, −1, 2), (2, 1, 1)} (in R3 ). Determine its coordinates in the standard basis, i.e. in the basis {[0, 0, 0], (1, 0, 0), (0, 1, 0), (0, 0, 1)}. Solution. The coordinates [2, 2, 3] in the given basis are by the definition [1, 2, 3] + 2 · (1, 1, 1) + 2 · (1, −1, 2) + 3 · (2, 1, 1) = [11, 5, 12] coordinates for X in the standard basis. □ 4.A.13. Affine transformation of mapping. Find the affine mapping f in the coordinate system with the basis u = {(1, 1), (−1, 1)} and origin [2, 0], defined as f(x1, x2) = ( 2 1 0 1 ) ( x1 x2 ) + ( 1 1 ) in the standard basis in R2 . Solution. The change of the basis matrix from the basis u to the standard basis is ( 1 −1 1 1 ) . The transformation matrix in the basis ([2, 0], u) is obtained by first transforming the coordinates in the basis ([2, 0], u) to the standard basis, i.e. to the basis ([0, 0], (1, 0), (0, 1)). Then the transformation matrix f is applied in the standard basis. Finally it is transformed back to the coordinates in the basis ([2, 0], u). The transformation equations for changing the coordinates y1, y2 in the basis ([2, 0], u) to the coordinates x1, x2 in the standard basis are ( x1 x2 ) = ( 1 −1 1 1 ) ( y1 y2 ) + ( 2 0 ) . Hereby ( y1 y2 ) = ( 1 −1 1 1 )−1 (( x1 x2 ) − ( 2 0 ) . ) = ( 1 2 1 2 −1 2 1 2 ) ( x1 x2 ) + ( −1 1 ) . Hence the desired mapping is f(y1, y2) = ( 1 2 1 2 −1 2 1 2 ) [ ( 2 1 0 1 ) (( 1 −1 1 1 ) ( y1 y2 ) + ( 2 0 )) + ( 1 1 ) ] + ( −1 1 ) = ( 2 0 −1 1 ) ( y1 y2 ) + ( 2 −1 ) □ CHAPTER 4. ANALYTIC GEOMETRY Examples of convex sets are (1) the empty set, (2) affine subspaces, (3) line segments, rays p = {P + t · v; t ≥ 0}, (4) more generally k–dimensional objects α = {P + t1 · v1 + · · · + tk · vk; t1, . . . , tk ∈ R, tk ≥ 0}, (5) angles in two-dimensional subspaces β = {P + t1 · v1 + t2 · v2; t1 ≥ 0, t2 ≥ 0}. The intersection of an arbitrary system of convex sets is a convex set. The intersection of all convex sets containing a given set M is called the convex hull K(M) of the set M. Theorem. The convex hull of any subset M ⊂ A is K(M) = {t1A1 + · · · + tsAs; s∑ i=1 ti = 1, ti ≥ 0, Ai ∈ M} Proof. Let S denote the set of all affine combinations on the right-hand side of the equation. To check that S is convex, choose two sets of parameters ti, i = 1, .., s1, t′ j, j = 1, . . . , s2 with the desired properties. Without loss of generality, assume that s1 = s2 and that the same points from M there appear in both combinations (otherwise simply add summands with zero coefficients). Consider an arbitrary point on the line segment given by the vertices defined by the two combinations: ε(t1A1+· · ·+tsAs)+(1−ε)(t′ 1A1+· · ·+t′ sAs), 0 ≤ ε ≤ 1. Obviously any point of this line segment lies in S. It remains to show that the convex hull of the points A1, . . . , As cannot be smaller than S. The points Ai themselves correspond to the choice of parameters tj = 0 for all j ̸= i and ti = 1. Assume that the claim holds for all sets with at most s − 1 points. Then the convex hull of the points A1, . . . , As−1 is (according to the assumption) formed exactly by the combinations from the right side of the equation to be proved, where ts = 0. Now consider a point A = t1A1 +· · ·+tsAs ∈ S, ts < 1, and affine combinations ε(t1A1+· · ·+ts−1As−1)+(1−ε(1−ts))As, 0 ≤ ε ≤ 1 1−ts . It is a line segment with vertices given by parameters ε = 0 (the point As) and ε = 1/(1 − ts) (a point in the convex hull of A1, . . . , As−1). The point A is an inner point of this line segment with the parameter ε = 1, and thus A lies in the convex hull of A1, . . . , As. □ The convex hulls of finite sets are called convex polyhe- drons. We have a k–dimensional simplex if and only if the vertices A0, . . . , Ak defining the convex polyhedron are in general position. In the case of a simplex, the expression of any of its points as an affine combination of the defining vertices is unique. Specific examples are the convex polyhedrons defined by one point and a finite number of vectors. Let u1, . . . , uk be 326 4.A.14. Let there be a standard coordinate system in R3 space. Agent K lives at the point S with coordinates [0, 1, 2]. The headquarters gave him a coordinate system with origin S and basis {(1, 1, 0), (−1, 0, 1), (0, 1, 2)}. Agent Bond lives at the point D with coordinates [1, 1, 1] and uses a coordinate system with basis {(0, 0, 1), (−1, 1, 2), (1, 0, 1)}. Agent K has set an appointment with agent Bond in the old brickfield which is (according to K’s coordinate system) at the point [1, 1, 0]. To where should Bond go (regarding his coordinate system)? Solution. The change of basis matrix from agent K’s basis to the Bond’s basis (with the same origins) is T =   −4 2 −1 1 0 1 2 −1 1   The vector (0, 1, 2) thus has coordinates T · (0, 1, 2)T = (0, 2, 1)T . Translate the origin (add the vector (−1, 0, 1)) to obtain the result (−1, 2, 2). □ 4.A.15. Find a transversal of the lines (that is, a line passing through both given lines) p : [1, 1, 1] + t(2, 1, 0), q : [2, 2, 0] + t(1, 1, 1), so that [1, 0, 0] lies on the transversal. Solution. The transversal lies in the plane ρ defined by the point [1, 0, 0] and the line p. Hence it lies in the plane [1, 1, 1] + t(2, 1, 0) + s(0, 1, 1). Let the point Q be the intersection of this plane with the line q. Q is obtained by solving the system 1 + 2t = 2 + u 1 + t + s = 2 + u 1 + s = u The left-hand sides of the equations represent all three coordinates of an arbitrary point of the plane ρ respectively. The right-hand sides then represent the coordinates of an arbitrary point on q (the free variable is denoted u in order not to be ambiguous). Solving this system, yields s = 2, t = 2, u = 3. Putting u = 3 into the line q equation, gives Q = [5, 5, 3]. The desired transversal is thus given by Q and the point [1, 0, 0]. The intersection of the transversal with p is at P = [7/3, 5/3, 1]. □ 4.A.16. Find the common perpendicular between the two skew lines p : [3, 0, 3] + (0, 1, 2)t, t ∈ R q : [0, −1, −2] + (1, 2, 3)s s ∈ R Solution. The direction of the common perpendicular is given by the cross product of the two direction vectors. So the direction of the common perpendicular is (1, −2, 1). Form a linear equation system which CHAPTER 4. ANALYTIC GEOMETRY arbitrary vectors in the difference space Rn , A ∈ An a point. A parallelepiped Pk(A; u1, . . . , uk) ⊂ An is the set Pk(A; u1, . . . , uk) = {A+c1u1 +· · ·+ckuk; 0 ≤ ci ≤ 1}. If the vectors u1, . . . , uk are independent, we talk about a k–dimensional parallelepiped Pk(A; u1, . . . , uk) ⊂ An. It is clear from the definition that parallelepipeds are convex. They are the convex hulls of their vertices. 4.1.10. Examples of standard affine tasks. (1) To find a parametric description of an implicitly given subspace and vice versa: Find a particular solution of a nonhomogeneous system and a fundamental solution of the homogenized system. Then obtain (in the coordinates in which the equations have been set) the desired parametric description. In the opposite direction, write the parametric description in coordinates and then eliminate the free parameters t1, . . . , tk. This results in the equations defining the given subspace implicitly. (2) To find the subspace generated by several subspaces Q1, . . . , Qs (of different dimensions in general). For instance, to find a plane in A3 given by a straight line and a point, or by three points). To define this subspace implicitly or para- metrically: The resulting subspace Q is always determined by one fixed point Ai in each subspace Qi and by the sum of all difference spaces. For instance, Q = A1 + (Z({A1, . . . , As}) + Z(Q1) + · · · + Z(Qs)). If the subspaces are given implicitly, it is possible to convert them into parametric form first. Nevertheless, different methods are advantageous in some concrete situations. Notice that it is really necessary to use one point from each of the subspaces. For example, two parallel lines in a plane generate the whole plane, but they share the same one–dimensional difference space. (3) To find the intersection of the subspaces Q1, . . . , Qs: If they are given in the implicit form, it is sufficient to unify all equations into one system, omitting any linearly dependent ones. If the resulting system has no solution, then the intersection is empty. Otherwise, an implicit description of the affine subspace is obtained. This is the intersection we are searching for. If parametric forms are given, we may search directly for common points as solutions of the appropriate equations, similarly to the way we find the intersections of vector spaces. If the number of subspaces is greater then two, we must search for the intersection step by step. If one of the subspaces is defined parametrically and the other implicitly, it suffices to substitute the parametrized coordinates and to solve the resulting system of equations. (4) To find a crossbar between two skew lines p, q in A3 passing through a given point or having a given direction: By a crossbar we mean a straight line which has nonempty intersection with both skew lines. Thus the 327 expresses that a vector defined by two points, one lying on p, the other on q, is parallel to the direction (1, −2, 1). We get the system P − Q = k(1, −2, 1), or [3, 0, 3] + (0, 1, 2)t P − [0, −1, −2] + (1, 2, 3)s Q = k(1, −2, 1). Treat this equality component-wise to give 3 − s = k 1 + t − 2s = −2k 5 + 2t − 3s = k with the solution t = 1, s = 2, k = 1. Put t = 1 into the line p, to obtain the point [3, 1, 5] on the common perpendicular. Put s = 2 into the line q equation to obtain the point [3, 1, 5]. The common perpendicular is defined by the line joining these two points. □ B. Euclidean geometry 4.B.1. Determine the distance between the lines in R3 . p : [1, −1, 0]+t(−1, 2, 3), and q : [2, 5, −1]+t(−1, −2, 1). Solution. The distance is defined as the distance of the orthogonal projection of arbitrary points on the respective lines to the orthogonal complement of the vector subspace generated by their directions. The orthogonal complement is spanned by the cross product: ⟨(−1, 2, 3), (−1, −2, 1)⟩⊥ = ⟨(−1, 2, 3) × (−1, −2, 1)⟩ = ⟨(8, −2, 4)⟩ = ⟨(4, −1, 2)⟩. A transversal is (for example) the segment joining [1, −1, 0] to [2, 5, −1]. So the vector to be projected is [1, −1, 0] − [2, 5, −1] = (−1, −6, 1). The distance between the lines is therefore: ρ(p, q) = |(−1, −6, 1) · (4, −1, 2)| ∥(4, −1, 2)∥ = 4 √ 21 . □ 4.B.2. Find a point A lying on the line p : x + 2y + z − 1 = 0, 3x − y + 4z − 29 = 0, which is equidistant from both B = [3, 11, 4] and C = [−5, −13, −2]. Solution. First, express the line p parametrically. Solve the system x + 2y + z = 1, 3x − y + 4z = 29. Rewrite the system as an augmented matrix and perform row operations ( 1 2 1 1 3 −1 4 29 ) ∼ ( 1 2 1 1 0 −7 1 26 ) ∼ ( 1 0 9/7 59/7 0 1 −1/7 −26/7 ) . The line p is thus described by p : [ 59 7 , − 26 7 , 0 ] + t ( − 9 7 , 1 7 , 1 ) , t ∈ R. CHAPTER 4. ANALYTIC GEOMETRY resulting crossbar r is a one–dimensional affine subspace. If we are given one point A ∈ r, then the affine subspace generated by p and A is either a straight line (if A ∈ p) or a plane (if A /∈ p). In the first case, there are an infinite number of solutions, one for each point of q. In the second case, it suffices to find the intersection B of the plane ⟨p ∪ A⟩ with q, and r = ⟨{A, B}⟩. There is no solution if the intersection is empty. If q ⊂ ⟨p ∪ A⟩, there are an infinite number of solutions. If the intersection has one element, there is exactly one solution. If a direction u ∈ Rn is given, then we consider the subspace Q generated by p and the difference space Z(p)+⟨u⟩ ⊂ Rn . Again, we obtain an infinite number of solutions if q ⊂ Q. Otherwise we consider the intersection Q with q and we finish as before. The solutions of other practical geometric problems are based mostly on the systematic use of the steps given above. 4.1.11. Remark on linear programming. In the beginning of the third chapter in paragraphs 3.1.1–3.1.7, we dealt with practical problems which are given by systems of linear inequalities. Each single inequality a1x1 + · · · + anxn ≤ b defines a halfspace in the standard affine space Rn . This is bounded by a hyperplane given by the corresponding equation (compare with the definition in paragraph 4.1.9(4)). In particular, the set of all admissible vectors for the problem of the linear programming is always an intersection of a finite number of convex sets. Hence the set itself is either convex or empty. If the intersection is both nonempty and bounded, then it is a convex polyhedron. As justified in 3.1.1 already, each linear form is either increasing or decreasing or constant along each parametrized straight line in the affine space. Thus if a given problem from linear programming is solvable and bounded, then it has the optimal solution at one of the vertices of the corresponding convex polyhedron. The reader should be able to imagine this claim in the case of two–dimensional or three–dimensional problems. Nevertheless, the straightforward explanation in these low dimensions holds for all finite– dimensional cases. We have given a “geometric proof” of the existence part of the fundamental theorem 3.1.5. We have translated the initial problem into a finite problem of the given cost function. 4.1.12. Euclidean point spaces. So far, we do not need the notions of distance and length for geometric considerations. But the length of vectors and the angle between vectors, as defined in the second chapter (see 2.3.18 and elsewhere), play a significant role in many practical problems. 328 It is convenient to avoid the fractions, by introducing the substitution t = 7s + 26. p is thus described by p : [−25, 0, 26] + s (−9, 1, 7) , s ∈ R. The point A is obtained by requiring that the vectors A − B = (−28 − 9s, −11 + s, 22 + 7s) , A − C = (−20 − 9s, 13 + s, 28 + 7s) have the same length. Hence √ (−28 − 9s)2 + (−11 + s)2 + (22 + 7s)2 = √ (−20 − 9s)2 + (13 + s)2 + (28 + 7s)2, or rather (−28 − 9s)2 + (−11 + s)2 + (22 + 7s)2 = (−20 − 9s)2 + (13 + s)2 + (28 + 7s)2 . which has the unique solution s = −3. Therefore A = [−25, 0, 26] − 3 (−9, 1, 7) = [2, −3, 5]. □ 4.B.3. Michael has a stick of length 4. Can he touch the lines p and q simultaneously with this stick, given that the stick must pass through [2, 1, 2]? p : [−1, 4, 1] + t(−1, 2, 0), q : [4, 4, −1] + s(1, 2, −4)? Solution. Compute the transversal of those lines passing through [2, 1, 2]. It is the segment joining [1, 0, 1] to [3, 2, 3]. Its length is √ 12, which is less than 4. So Michael can touch the lines as required. □ 4.B.4. In Euclidean space R4 , determine the distance between the point A = [2, −5, 1, 4] and the subspace defined by the equations U : 4x1 − 2x2 − 3x3 − 2x4 + 12 = 0, 2x1 − x2 − 2x3 − 2x4 + 9 = 0. Solution. Find first a parametric expression of the subspace U. For example, B = [0, 3, 0, 3] ∈ U. The distance between A and U equals the length of the orthogonal projection of the vector A − B to the orthogonal complement of the direction of the subspace U. However, the orthogonal complement of the U direction (it defines this subspace) – as set (of linear combination of normal vectors) V := {t (4, −2, −3, −2) + s (2, −1, −2, −2) ; t, s ∈ R}. We need to find the orthogonal projection PA−B of vector A − B to V , which lies in V , and thus PA−B = a (4, −2, −3, −2) + b (2, −1, −2, −2) for certain a, b ∈ R. Clearly, (A − B − PA−B) ⊥ V , thus ((A − B) − PA−B) ⊥ (4, −2, −3, −2) , CHAPTER 4. ANALYTIC GEOMETRY Euclidean spaces The standard Euclidean point space En is the affine space An whose difference space is the standard Euclidean space Rn with the scalar product ⟨x, y⟩ = yT · x. The Cartesian coordinate system is the affine coordinate system (A0; u) with the orthonormal basis u. The Euclidean distance between two points A, B ∈ En is defined as the length of the vector ∥B − A∥. This is denoted by ρ(A, B). Euclidean subspaces in En are affine subspaces, where the corresponding difference spaces are considered with restricted scalar products. By a Euclidean point space E of dimension n is meant an affine space, whose difference space is a real n–dimensional Euclidean vector space. The notion of a Cartesian coordinate system has an obvious meaning. Since each choice of such a coordinate system identifies E with the standard space En, we deal with the standard Euclidean spaces and their subspaces, with no loss of generality. From the geometric point of view, simple properties of the scalar product like the triangular inequality, the Cauchy inequality, Bessel’s inequality, derived in the previous chapter (see 3.4.3), have useful consequences: 4.1.13. Theorem. For points A, B, C ∈ En the following holds (1) ρ(A, B) = ρ(B, A) (2) ρ(A, B) = 0 if and only if A = B (3) ρ(A, B) + ρ(B, C) ≥ ρ(A, C) (4) In each Cartesian coordinate system (A0; e), the distance between the points A = A0 + a1e1 + · · · + anen, B = A0 + b1e1 + · · · + bnen is √∑n i=1(ai − bi)2. (5) Given a point A and a subspace Q in En, there exists a point P ∈ Q which minimizes the distance between A and points in Q. The distance between A and P equals the length of the orthogonal projection of the vector A − B into Z(Q)⊥ for an arbitrary B ∈ Q. (6) More generally, for subspaces Q and R in En there exist points P ∈ Q and Q ∈ R which minimize the distance between points B ∈ Q and A ∈ R. The distance between the points P and Q is the length of the orthogonal projection of the vector A−B into (Z(Q)+Z(R))⊥ for arbitrary points B ∈ Q and A ∈ R. Proof. The first three properties follow directly from the properties of the length of vectors in spaces with a scalar product. The fourth follows from the expression of the scalar product in an orthonormal basis. Consider the relation for the minimal distances ρ(A, B) for B ∈ Q. The vector A − B decomposes uniquely as A − B = u1 + u2, where u1 ∈ Z(Q), u2 ∈ Z(Q)⊥ . The component u2 does not depend on the choice of B ∈ Q. This 329 ((A − B) − PA−B) ⊥ (2, −1, −2, −2) . By substitution of A − B and PA−B, ((2, −8, 1, 1) − a(4, −2, −3, −2) − b(2, −1, −2, −2)) ·(4, −2, −3, −2) = 0, ((2, −8, 1, 1) − a(4, −2, −3, −2) − b(2, −1, −2, −2)) ·(2, −1, −2, −2)) = 0; so (2, −8, 1, 1)·(4, −2, −3, −2) −a(4, −2, −3, −2)·(4, −2, −3, −2) −b(2, −1, −2, −2)·(4, −2, −3, −2) = 0, ((2, −8, 1, 1)·(2, −1, −2, −2)) −a(4, −2, −3, −2)·(2, −1, −2, −2) −b(2, −1, −2, −2)·(2, −1, −2, −2 = 0. If we compute these dot products, we obtain the system 19 − 33a − 20b = 0, 8 − 20a − 13b = 0, with the only solution a = 3, b = −4. Hence PA−B = 3 (4, −2, −3, −2) − 4 (2, −1, −2, −2) = (4, −2, −1, 2) , where || PA−B || = √ 42 + (−2)2 + (−1)2 + 22 = 5. Hence the distance between A and U equals || PA−B || = 5. □ 4.B.5. In the vector space R4 , compute the distance v between the point [0, 0, 6, 0] and the vector subspace U : [0, 0, 0, 0]+t1 (1, 0, 1, 1)+t2 (2, 1, 1, 0)+t3 (1, −1, 2, 3) , t1, t2, t3 ∈ R Solution. We solve the problem by the least squares method. Write the generating vectors of U as the columns of the matrix A =     1 2 1 0 1 −1 1 1 2 1 0 3     . Substitute the point [0, 0, 6, 0] by the corresponding vector b = (0, 0, 6, 0)T . Now solve A · x = b. This is the linear equation system x1 + 2x2 + x3 = 0, x2 − x3 = 0, x1 + x2 + 2x3 = 6, x1 + 3x3 = 0, by the least squares method. (Note that the system does not have a solution – the distance would be 0 otherwise.) Multiply CHAPTER 4. ANALYTIC GEOMETRY is because any potential change of B is apparent by adding a vector from Z(Q). Choose P = A + (−u2) = B + u1 ∈ Q. Then ∥A − B∥2 = ∥u1∥2 + ∥u2∥2 ≥ ∥u2∥2 = ∥A − P∥2 . Hence the minimal distance is obtained for the point P. Its value is ∥u2∥. The general result is obtained in a similar way. For the choice of arbitrary points A ∈ R and B ∈ Q their difference is given as a sum of vectors u1 ∈ Z(R) + Z(Q) and u2 ∈ (Z(R) + Z(Q))⊥ . The component u2 does not depend on the choice of the points. By adding suitable vectors from the difference spaces of R and Q, points A′ and B′ are obtained so that the distance between them is ∥u2∥. □ We consider more elementary problems in affine geometry requiring the concept of distance. 4.1.14. Examples of standard problems. (1) To find the distance from the point A ∈ En to the subspace Q ⊂ En: A method of solving such a problem is given in proposition 4.1.13. (2) In E2 to construct the straight line q through a given point A which forms a given angle with a given line p: Recall that we work with angles between vectors in plane geometry already (see e.g. 2.3.22). Find a vector u ∈ R2 lying in the difference space of the line p. Then choose a vector v having the prescribed angle with u. The desired line is given by the point A and the difference space ⟨v⟩. The problem has either one or two solutions. (3) To find a line through a given point, perpendicular to a given line: The procedure is introduced in the proof of the last but one item of proposition 4.1.13. (4) In E3 to determine the distance between two lines p, q: Choose any point from each of the lines, A ∈ p, B ∈ q. The component of the vector A − B lying in the orthogonal complement (Z(p)+Z(q))⊥ has length equal to the distance between p and q. (5) In E3 to find the axis of two skew lines p a q: By the axis we mean the crossbar which attains the minimal possible distance between the given skew lines in terms of the points of intersection. The procedure can be derived from the proof of proposition 4.1.13 (the last item). Let η be the subspace generated by a single point A ∈ p and the sum Z(p) + (Z(p) + Z(q))⊥ . Provided that the lines p and q are not parallel, η is a plane. Then the intersection η ∩ q together with the difference space (Z(p) + Z(q))⊥ gives a parametric description of the desired axis. If the lines are parallel, the problem has an infinite number of solutions. 330 A · x = b by the matrix AT from the left-hand side. Then the augmented matrix AT · A · x = AT · b is   3 3 6 6 3 6 3 6 6 3 15 12   . By elementary row operations, transform the matrix to the normal form   3 3 6 6 3 6 3 6 6 3 15 12   ∼   3 3 6 6 0 3 −3 0 0 −3 3 0   ∼   1 1 2 2 0 1 −1 0 0 0 0 0   . Continue with backward elimination   1 1 2 2 0 1 −1 0 0 0 0 0   ∼   1 0 3 2 0 1 −1 0 0 0 0 0   . The solution is x = (2 − 3t, t, t) T , t ∈ R. Note that the existence of infinitely many solutions is caused by third vector generating U, which is redundant because 3 (1, 0, 1, 1) − (2, 1, 1, 0) = (1, −1, 2, 3) . An arbitrary (t ∈ R) linear combination (2−3t) (1, 0, 1, 1)+t (2, 1, 1, 0)+t (1, −1, 2, 3) = (2, 0, 2, 2) corresponds to a point [2, 0, 2, 2] in the subspace U, which is the nearest point to [0, 0, 6, 0]. The required distance is therefore v = || [2, 0, 2, 2] − [0, 0, 6, 0] || = √ 22 + 0 + (−4)2 + 22 = 2 √ 6. □ 4.B.6. Compute the volume of the parallelepiped in R3 with base in the plane z = 0 and with edges given by pairs of vertices [0, 0, 0], [−2, 3, 0]; [0, 0, 0], [4, 1, 0] and [0, 0, 0], [5, 7, 3]. Solution. The parallelepiped is given by vectors (4, 1, 0), (−2, 3, 0), (5, 7, 3). Its volume is the determinant 4 −2 5 1 3 7 0 0 3 = 3 4 −2 1 3 = 3 · 14 = 42. Note that if the order of vectors is changed, we would get result ±42, because the determinant gives the oriented volume of parallelepiped. Note further that the volume would not change if the third vector was [a, b, 3] for arbitrary a, b ∈ R. Its surface depends only on orthogonal distance between planes of its upper and lower base and their area 4 −2 1 3 = 14. □ CHAPTER 4. ANALYTIC GEOMETRY 4.1.15. Angles. Various geometric notions like angles, orientation, volume etc. in the point spaces En are defined in terms of suitable notions from Euclidean spaces. The angle between two vectors is defined at the end of the third part of the second chapter, see 2.3.22. From the Cauchy inequality, it follows that 0 ≤ |u·v| ∥u∥∥v∥ ≤ 1. So it makes sense to define the angle φ(u, v) between vectors u, v ∈ V in a real vector space with a scalar product given by the equation cos φ(u, v) = u · v ∥u∥∥v∥ , 0 ≤ φ(u, v) ≤ 2π. This is completely in accordance with the situation in two– dimensional Euclidean space R2 and with the philosophy that the notion related to the two vectors is the issue of plane ge- ometry. In the Euclidean plane, we use also the geometric functions cos and sin defined by a pure geometric consideration. Therefore, the angle between two vectors in higher– dimensional spaces is measured in the plane which is generated by these two vectors (or it is zero). In an arbitrary real vector space with a scalar product, it follows that ∥u − v∥2 = ∥u∥2 + ∥v∥2 − 2(u · v) = ∥u∥2 + ∥v∥2 − 2∥u∥∥v∥ cos φ(u, v). This is the well known cosine rule from plane geometry. Consider an orthonormal basis e of the difference space V and a vector u ∈ V . The square of the size of u is given by the usual formula ∥u∥2 = ∑ i |u · ei|2 . By dividing this equation by the number ∥u∥2 , 1 = ∑ i (cos φ(u, ei))2 , which is known as the law of direction cosines cos φ(u, ei) of the vector u. Now we can derive definitions for angles between general subspaces in a Euclidean vector space from the definitions of angles between vectors. In particular, it must be decided how to deal with cases where the subspaces have a nontrivial intersection. For the angle between two lines, use the smaller of the two possible angles. In the case of two nonparallel planes in R3 we do not say that the angle is zero. They intersect and have one direction in common. Perhaps we take the two perpendicular lines to this common direction in the two planes and measure their angle. The general cases are treated as follows: 331 4.B.7. Let the points [0, 0, 1], [2, 1, 1], [3, 3, 1], [1, 2, 1] define a parallelogram. Determine the point X lying on the line p : [0, 0, 1] + (1, 1, 1)t so that the parallelepiped defined by the given parallelogram and a point X has volume of 1. Solution. Form a determinant which gives the volume of a parallelepiped with X moving along line p: t t t 2 1 0 1 2 0 . The volume is 3t which implies t = 1/3. □ 4.B.8. Let ABCDEFGH be a cube (with common notation, i.e. vectors E − A, F − B, G − C, H − D are orthogonal to the plane defined by vertices A, B, C, D) in Euclidean space R3 . Compute the angle φ between the vectors F − A and H − A. Solution. This problem is solved using the formula for the angle between the vectors. Alternatively notice that the vertices A, F, H are the vertices of a triangle with all sides of the same length. Hence it is an equilateral triangle. Therefore φ = π/3. □ 4.B.9. Let S be the midpoint of the edge AB of the cube ABCDEFGH (with common labelling). Compute the cosine of the angle between the lines ES and BG. Solution. Dilation (homothety) is a mapping which preserves angles. So without loss of generality, the cube edge has length 1. The coordinate system can be placed so that A is at the origin, B = [1, 0, 0] and E = [0, 0, 1]. It follows that then: S = [1/2, 0, 0], G = [1, 1, 1], ES = (1/2, 0, −1) and BG = (0, 1, 1). The desired cosine of the angle φ is then cos(φ) = (1/2, 0, −1) · (0, 1, 1) ∥(1/2, 0, −1)∥ ∥(0, 1, 1)∥ = √ 2 √ 5 . □ 4.B.10. Compute the angle between the line p given by the implicit equations x + 3y + z = 0, −x − y + z = 0 and plane the ϱ : x + y + 2z + 1 = 0. Solution. The normal vector of the plane ϱ is (1, 1, 2). Copy the first equation of the line p. Sum both of them, to obtain x + 3y + z = 0, 2y + 2z = 0. From this system y = −z and x = 2z. The vector (2, −1, 1) is therefore the direction vector of p. In other words, p passes through the origin, and p : [0, 0, 0] + t (2, −1, 1) , t ∈ R. For the angle φ between the vectors (1, 1, 2), (2, −1, 1), cos φ = 2 − 1 + 2 √ 6 · √ 6 = 1 2 . CHAPTER 4. ANALYTIC GEOMETRY Angles between subspaces 4.1.16. Definition. Consider finite–dimensional subspaces U1, U2 in a Euclidean vector space V of arbitrary dimension. The angle between vector subspaces U1, U2 is the real number α = φ(U1, U2) ∈ [0, π 2 ] satisfying: (1) If dim U1 = dim U2 = 1, U1 = ⟨u⟩, U2 = ⟨v⟩, then cos α = |u · v| ∥u∥∥v∥ . (2) If the dimensions of U1, U2 are positive, and if U1 ∩ U2 = {0}, then the angle is the minimum of all angles between the one–dimensional subspaces α = min{φ(⟨u⟩, ⟨v⟩); 0 ̸= u ∈ U1, 0 ̸= v ∈ U2}. Such a minimum always exists. (3) If U1 ⊂ U2 or U2 ⊂ U1 (in particular if one of them is empty), then α = 0. (4) If U1 ∩ U2 ̸= {0} and if U1 ̸= U1 ∩ U2 ̸= U2, then α = φ(U1 ∩ (U1 ∩ U2)⊥ , U2 ∩ (U1 ∩ U2)⊥ ). The angle between affine subspaces Q1, Q2 in a Euclidean point space En is defined as the angle between their difference spaces Z(Q1), Z(Q2). Notice that the angle is always well defined. In the last case, (U1 ∩ (U1 ∩ U2)⊥ ) ∩ (U2 ∩ (U1 ∩ U2)⊥ ) = {0} so we can determine the angle according to item (2). Notice also that in the case U1 ∩ U2 = {0}, the subspaces U1 and U2 are perpendicular in terms of the former definitions if and only if the angle between them is π/2. However, if the intersection is nontrivial, then they cannot be perpendicular in the former sense. In order to show the validity of the definition, it remains to show that the vectors u ∈ U1, v ∈ U2 minimizing the expression for the angle always exist. First a special case: 4.1.17. Lemma. Let v be a vector in a Euclidean space V and U ⊂ V an arbitrary subspace. Denote by v1 ∈ U, v2 ∈ U⊥ the (uniquely determined) components of the vector v, i.e. v = v1 + v2. Then the angle φ between the subspace generated by v and the subspace U satisfies cos φ(⟨v⟩, U) = cos φ(⟨v⟩, ⟨v1⟩) = ∥v1∥ ∥v∥ . Proof. By the Cauchy inequality, |u · v| ∥u∥∥v∥ = |u · (v1 + v2)| ∥u∥∥v∥ = |u · v1| ∥u∥∥v∥ ≤ ∥u∥∥v1∥ ∥u∥∥v∥ = ∥v1∥ ∥v∥ = ∥v1∥2 ∥v∥∥v1∥ = |v1 · v| ∥v∥∥v1∥ . 332 Hence φ = 60 ◦ . However, this is the angle between the direction vector of p and the normal vector ϱ. The desired angle is the complement of this angle, so the solution is 30 ◦ = 90 ◦ − 60 ◦ . □ 4.B.11. In the real plane, find a line which passes through the point [−3, 0], so that 60 ◦ is the angle between this line and the line p : √ 3x + 3y + 5 = 0. Solution. The given line has slope −1√ 3 . This is at angle −30 ◦ from the positive x axis. Thus the required line is either at angle −90 ◦ or angle 30 ◦ from the positive x axis. The former determines the vertical line x = −3. The latter determines the line with slope 1√ 3 through [−3, 0], hence has equation y √ 3 = x + 3. □ Solution. (Alternative) Notice that there are two such lines. The general equation of a line in the plane has the form ax+by +c = 0. Choose parameters so that a2 +b2 = 1. We find such numbers a, b, c ∈ R, so that all the conditions are satisfied. Since the line passes through [−3, 0]), c = 3a. The condition of the angle between lines equals 60 ◦ then gives 1 2 = cos 60 ◦ = √ 3a + 3b √ 12 , tj. √ 3 = √ 3a + 3b . Performing further operations ±1 = a+ √ 3b and exponentation 1 = a2 +3b2 +2 √ 3ab. If we use a2 + b2 = 1, we get 0 = 2b2 + 2 √ 3ab, tj. 0 = b ( b + √ 3a ) . Together (remember that c = 3a and a2 + b2 = 1) a = ±1, b = 0, c = ±3; a = ± 1 2 , b = ∓ √ 3 2 , c = ± 3 2 . We can easily check that lines determined by those coeffi- cients x + 3 = 0, 1 2 x − √ 3 2 y + 3 2 = 0 satisfy all the conditions. □ 4.B.12. Determine the equation of all planes so that the angle between every such plane and the plane x + y + z − 1 = 0 is 60◦ , and further, that they contain the line p : [1, 0, 0] + t (1, 1, 0). ⃝ 4.B.13. Determine the angle between the planes σ : [1, 0, 2] + (1, −1, 1)t + (0, 1, −2)s ρ : [3, 3, 3] + (1, −2, 0)t + (0, 1, 1)s Solution. The line of intersection between the planes has direction vector (1, −1, 1). The plane orthogonal to this vector has intersection with the given planes generated by the vectors vektory (1, 0, −1) and (0, 1, 1). The angle between these one-dimensional subspaces is 60◦ . □ CHAPTER 4. ANALYTIC GEOMETRY for all vectors u ∈ U. This implies that cos φ(⟨v⟩, ⟨u⟩) ≤ cos φ(⟨v⟩, ⟨v1⟩) = ∥v1∥ ∥v∥ . Thus the computed vector v1 represents the largest possible value of the cosine of angles between all choices of vectors in U. The cos function is decreasing on the interval [0, π 2 ]. Hence the smallest possible angle is obtained in this way, and so the claim is proved. □ 4.1.18. Calculating angles. The procedure in the previous lemma can be understood as follows. Choose the orthogonal projection of the one–dimensional subspace generated by v into the subspace U, and consider the ratio between v and its image. A similar procedure is used in the higher dimension. The problem is to recognize the directions whose projections give the desired (minimal) angle. This is clear in the previous example if we project the larger space U into the one–dimensional ⟨v⟩ first, and then orthogonally back to U. The desired angle corresponds to the direction of the eigenvector of this map. The eigenvalue is the square of the cosine of the angle. Let U1, U2 be two arbitrary subspaces in a Euclidean vector space V , U1 ∩ U2 = {0}. Choose orthonormal bases e and e′ of the whole space V such that U1 = ⟨e1, . . . , ek⟩, U2 = ⟨e′ 1, . . . , e′ l⟩. Consider the orthogonal projection φ of the space V on U2. Its restriction on U1 will be denoted by φ : U1 → U2 as before. Similarly, let ψ : U2 → U1 be the map which has arisen from the orthogonal projection on U1. In the bases (e1, . . . , ek) and (e′ 1, . . . , e′ l), these maps have matrices A and B,    e1 · e′ 1 . . . ek · e′ 1 ... ... e1 · e′ l . . . ek · e′ l    ,    e′ 1 · e1 . . . e′ l · e1 ... ... e′ 1 · ek . . . e′ l · ek    , respectively. Since ei · e′ j = e′ j · ei holds for all indices i, j, B = AT . The composition of maps ψ ◦ φ : U1 → U1 has therefore a symmetric positive semidefinite matrix AT A, and ψ is adjoint to φ. Each such map has only nonnegative real eigenvalues. It has a diagonal matrix with these eigenvalues on the diagonal in a suitable orthonormal basis, see 3.4.7 a 3.4.9. Now we can derive a general procedure for computing the angle α = φ(U1, U2). Theorem. In the previous notation, let λ be the largest eigenvalue of the matrix AT A. Then (cos α)2 = λ. Proof. Let u ∈ U1 be the eigenvector of the map ψ ◦ φ corresponding to the eigenvalue λ. Consider all eigenvalues λ1, . . . , λk (including multiplicities), and let u = (u1, . . . , un) be the corresponding orthonormal basis of U1 containing the eigenvectors. Assume that λ = λ1, and choose its eigenveector u = u1, ∥u∥ = 1. 333 4.B.14. A cube ABCDA′ B′ C′ D′ is given in standard notation. That is, ABCD and A′ B′ C′ D′ are faces and AA′ , BB′ are edges. Compute the angle φ between AB′ and AD′ . Solution. It can be assumed that the cube is of side 1 and placed in R3 in such a way that the vertices A, B, C, D have coordinates respectively [0, 0, 0], [1, 0, 0] [1, 1, 0] [0, 1, 0] and the vertices A′ , B′ , C′ , D′ have coordinates respectively [0, 0, 1], [1, 0, 1] [1, 1, 1] [0, 1, 1]. Thus AB′ = B′ − A = (1, 0, 1), AD′ = D′ − A = (0, 1, 1). So cos(φ) = (1, 0, 1) · (0, 1, 1) ∥ (1, 0, 1) ∥∥ (0, 1, 1) ∥ = 1 2 , hence φ = 60◦ . □ For further exercises on angles, see . 4.B.15. Prove that for every n ∈ N and for all positive x1, x2, . . . , xn ∈ R n2 ≤ ( 1 x1 + 1 x2 + · · · + 1 xn ) · (x1 + x2 + · · · + xn) . For what arguments does equality hold? Solution. It is sufficient to consider the Cauchy inequality | u · v | ≤ || u || || v || in Euclidean space Rn for the vectors u = ( 1 √ x1 , . . . , 1 √ xn ) , v = ( √ x1, . . . , √ xn) . We get (1) n ≤ √ 1 x1 + 1 x2 + · · · + 1 xn · √ x1 + x2 + · · · + xn. We obtain the desired inequality squaring (1). The Cauchy inequality attains equality when vector u is a multiple of v, that is, when x1 = x2 = · · · = xn. □ 4.B.16. Vectors u = (u1, u2, u3) and v = (v1, v2, v3) are given. Find a third unit vector such that parallelepiped defined by these three vectors has the greatest possible volume. Solution. Denote the desired vector by t = (t1, t2, t3). By Proposition ?? the volume of the parallelepiped P3(0; u, v, t) is the absolute value of determinant u1 v1 t1 u2 v2 t2 u3 v3 t3 = t1 t2 t3 u1 u2 u3 v1 v2 v3 = t · (u × v) ≤ ∥t∥∥u × v∥ = ∥u × v∥. The sign of the inequality follows from the Cauchy inequality. This becomes equality if and only if t = c(u × v), c ∈ R. The volume therefore could be at most equal to the area of paralleloid defined by vectors u, v (i.e. size of vector (u×v)). Equality holds if and only if t = ± (u × v) ∥(u × v)∥ . □ CHAPTER 4. ANALYTIC GEOMETRY We need to show that the angle between an arbitrary v ∈ U1 and U2 is at least as large as the angle between u and U2. Equivalently that the cosine of the corresponding angle cannot be greater. By the previous lemma, it is sufficient to discuss the angle between u and φ(u) ∈ U2. Choose v ∈ U1, v = a1u1 + · · · + akuk, ∑k i=1 a2 i = ∥v∥2 = 1. Then ∥φ(v)∥2 = φ(v) · φ(v) = (ψ ◦ φ(v)) · v ≤ ∥ψ ◦ φ(v)∥∥v∥ = ∥ψ ◦ φ(v)∥. Moreover, the previous lemma gives a formula for computing the angle α between the vector v and the subspace U2 cos α = ∥φ(v)∥ ∥v∥ = ∥φ(v)∥. Since λ1 is the largest eigenvalue, and the sum of squares of coordinates a2 i is one, (cos α)2 = ∥φ(v)∥2 ≤ ∥ψ ◦ φ(v)∥ = k∑ i=1 (λiai)2 = = λ2 1 + k∑ i=1 a2 i (λ2 i − λ2 1) ≤ √ λ2 1. If v = u, we have ∥φ(v)∥2 = λ2 1∥v∥2 = λ2 , and thus the angle has the minimal value for this vector. □ 4.1.19. Calculating volume. An indication of how to calculate volumes in plane geometry is given at the end of the fifth part of the first chapter (see 1.5.11). There the notion of orientation played a fundamental role. We can imagine orientation as the decision whether to look at the plane R2 from above or from below. The distinction lies in the order of selecting standard basis vectors e1 and e2 on the unit circle. We proceed in the same way in general: Orientation of a vector space Two bases u and v of a real vector space V are said to determine the same orientation if the transformation matrix between them has a positive determinant. By the orientation of a vector space V is meant the equivalence class of bases u with respect to the equivalence defined above, by the sign of the determinant. Equivalent bases in this sense are called compatible with the chosen orientation. It follows that there exist exactly two orientations on every vector space. From each compatible basis there is a non compatible one by a transformation matrix with a negative determinant. A vector space with a chosen orientation is called the oriented vector space. The oriented Euclidean (point) space is a Euclidean point space whose difference space is oriented. In the sequel we consider the standard Euclidean space En together with the orientation given by the standard basis of Rn . 334 4.B.17. Find the foot of the line passing through the point [0, 0, 7] and perpendicular to the plane ρ : [0, 5, 3] + (1, 2, 1)t + (−2, 1, 1)s. 4.B.18. In Euclidean space R5 determine the distance between the planes ϱ1 : [7, 2, 7, −1, 1] + t1 (1, 0, −1, 0, 0) + s1 (0, 1, 0, 0, −1) , ϱ2 : [2, 4, 7, −4, 2] + t2 (1, 1, 1, 0, 1) + s2 (0, −2, 0, 0, 3) , where t1, s1, t2, s2 ∈ R. Solution. First compute the orthogonal complement to sum of vectors defining the planes. Form a matrix with rows as the direction vectors of the planes. Then transform this matrix into normal form.     1 0 −1 0 0 0 1 0 0 −1 1 1 1 0 1 0 −2 0 0 3     ∼ · · · ∼     1 0 −1 0 0 0 1 0 0 −1 0 0 1 0 1 0 0 0 0 1     . So the orthogonal complement is ⟨(0, 0, 0, 1, 0)⟩. The vector (0, 0, 0, 1, 0) lies within the orthogonal complement. Transform the matrix into normal form. This shows that the orthogonal complement is one-dimensional. The distance between planes is the length of the perpendicular projection of the vector A1 − A2 into the subspace ⟨(0, 0, 0, 1, 0)⟩ for arbitrary points A1 ∈ ϱ1, A2 ∈ ϱ2. Choose e.g. A1 = [7, 2, 7, −1, 1], A2 = [2, 4, 7, −4, 2]. The orthogonal projection A1 − A2 = (5, −2, 0, 3, −1) to ⟨(0, 0, 0, 1, 0)⟩ is (0, 0, 0, 3, 0). The length of (0, 0, 0, 3, 0) gives the desired distance 3. □ 4.B.19. In Euclidean space R5 determine the distance of planes σ1 : [0, 1, 2, 0, 0] + p1 (2, 1, 0, 0, 1) + q1 (−2, 0, 1, 1, 0) , σ2 : [3, −1, 7, 7, 3] + p2 (2, 2, 4, 0, 3) + q2 (2, 0, 0, −2, −1) , where p1, q1, p2, q2 ∈ R. Solution. The sum of the directions σ1, σ2 is generated by the direction vectors. Denote them by u1 = (2, 1, 0, 0, 1) , u2 = (−2, 0, 1, 1, 0) , v1 = (2, 2, 4, 0, 3) , v2 = (2, 0, 0, −2, −1) . Find points X1 ∈ σ1, X2 ∈ σ2, that equal the distance between σ1 and σ2. This requires X1 − X2 = [0, 1, 2, 0, 0] − [3, −1, 7, 7, 3] + p1u1 + q1u2 − p2v1 − q2v2 and ⟨ X1 − X2, u1 ⟩ = 0, ⟨ X1 − X2, u2 ⟩ = 0, ⟨ X1 − X2, v1 ⟩ = 0, ⟨ X1 − X2, v2 ⟩ = 0. CHAPTER 4. ANALYTIC GEOMETRY Let u1, . . . , uk be arbitrary vectors in the difference space Rn , A ∈ En a point. As an example of a convex set, the parallelepiped Pk(A; u1, . . . , uk) ⊂ En is given by Pk(A; u1, . . . , uk) = {A + c1u1 + · · · + ckuk; 0 ≤ ci ≤ 1}. If u1, . . . , uk are linearly dependent, the parallelepiped is degenerated and we se the volume Vol Pk = 0. If the vectors u1, . . . , uk are linearly independent, we have a k–dimensional parallelepiped Pk(A; u1 . . . , uk) ⊂ En. For given vectors u1, . . . , uk there are also parallelepipeds of lower dimension P1(A; u1), . . . , Pk(A; u1, . . . , uk) in Euclidean subspaces A+⟨u1⟩, . . . , A+⟨u1, . . . , uk⟩ at our disposal. We proceed as in the case of the Gramm–Schmidt orthogonalization: Consider the decomposition ⟨u1, . . . , uk⟩ = = ⟨u1, . . . , uk−1⟩ ⊕ ⟨u1, . . . , uk−1⟩⊥ ∩ ⟨u1, . . . , uk⟩. In this direct sum, uk is uniquely expressed as uk = u′ k + ek where ek ⊥ ⟨u1, . . . , uk−1⟩. The absolute value of the volume of a parallelepiped is defined inductively such that it is the product of the volume of the “base” and the “altitude”: | Vol |P1(A; u1) = ∥u1∥, | Vol |Pk(A; u1, . . . , uk) = = ∥ek∥| Vol |Pk−1(A; u1, . . . , uk−1). If u1, . . . , un is a basis compatible with the orientation of the entire vector space V , the (oriented) volume of the parallelepiped is defined by Vol Pn(A; u1, . . . , un) = | Vol |Pn(A; u1, . . . , un). In the case of a non compatible basis we set Vol Pn(A; u1, . . . , un) = −| Vol |Pn(A; u1, . . . , un). Theorem. Let Q ⊂ En be a Euclidean subspace, and let e = (e1, . . . , ek) be an orthonormal basis of Z(Q). For arbitrary vectors u1, . . . , uk ∈ Z(Q) and A ∈ Q the following holds (1) Vol Pk(A; u1, . . . , uk) = u1 · e1 . . . uk · e1 ... ... u1 · ek . . . uk · ek (2) (Vol Pk(A; u1, . . . , uk))2 = u1 · u1 . . . uk · u1 ... ... u1 · uk . . . uk · uk Proof. The matrix A =    u1 · e1 . . . uk · e1 ... ... u1 · ek . . . uk · ek    335 Hence ⟨ (−3, 2, −5, −7, −3), u1 ⟩ + p1 ⟨ u1, u1 ⟩ + q1 ⟨ u2, u1 ⟩ − p2 ⟨ v1, u1 ⟩ − q2 ⟨ v2, u1 ⟩ = 0, ⟨ (−3, 2, −5, −7, −3), u2 ⟩ + p1 ⟨ u1, u2 ⟩ + q1 ⟨ u2, u2 ⟩ − p2 ⟨ v1, u2 ⟩ − q2 ⟨ v2, u2 ⟩ = 0, ⟨ (−3, 2, −5, −7, −3), v1 ⟩ + p1 ⟨ u1, v1 ⟩ + q1 ⟨ u2, v1 ⟩ − p2 ⟨ v1, v1 ⟩ − q2 ⟨ v2, v1 ⟩ = 0, ⟨ (−3, 2, −5, −7, −3), v2 ⟩ + p1 ⟨ u1, v2 ⟩ + q1 ⟨ u2, v2 ⟩ − p2 ⟨ v1, v2 ⟩ − q2 ⟨ v2, v2 ⟩ = 0. By computing the dot products, we obtain the linear equation system 6p1 − 4q1 − 9p2 − 3q2 = 7, −4p1 + 6q1 + 6q2 = 6, 9p1 − 33p2 − q2 = 31, 3p1 − 6q1 − p2 − 9q2 = −11. Solve it by forming a matrix and performing elementary row operations.     6 −4 −9 −3 7 −4 6 0 6 6 9 0 −33 −1 31 3 −6 −1 −9 −11     ∼ · · · ∼     1 0 0 0 0 0 1 0 0 −1 0 0 1 0 −1 0 0 0 1 2     . The solution is (p1, q1, p2, q2) = (0, −1, −1, 2). Conse- quently X1 − X2 = (−3, 2, −5, −7, −3) − u2 + v1 − 2v2 = (−3, 4, −2, −4, 2). The length of the vector (−3, 4, −2, −4, 2) equals the distance between the planes σ1, σ2 and is then 7 = √ (−3)2 + 42 + (−2)2 + (−4)2 + 22. We solved this problem by a method different to that of the previous problem We can use both methods in both cases. Try the former method for the case of σ1, σ2. Find the orthogonal complement of vector subspace generated by (2, 1, 0, 0, 1) , (−2, 0, 1, 1, 0) , (2, 2, 4, 0, 3) , (2, 0, 0, −2, −1). We get     2 1 0 0 1 −2 0 1 1 0 2 2 4 0 3 2 0 0 −2 −1     ∼ · · · ∼     1 0 0 0 3/2 0 1 0 0 −2 0 0 1 0 1 0 0 0 1 2     , The orthogonal complement is ⟨(−3/2, 2, −1, −2, 1)⟩, or rather ⟨(3, −4, 2, 4, −2)⟩. Note that the distance between σ1 CHAPTER 4. ANALYTIC GEOMETRY has the coordinates of the vectors u1, . . . , uk in the chosen basis in columns, and |A|2 = |A||A| = |AT ||A| = |AT A| = u1 · u1 . . . uk · u1 ... ... u1 · uk . . . uk · uk . Hence if (1) holds, then also (2) holds. Directly from the definition, the unoriented volume equals the product | Vol |Pk(A; u1, . . . , uk) = ∥v1∥∥v2∥ . . . ∥vk∥, where v1 = u1, v2 = u2 +a2 1v1, . . . , vk = uk +ak 1v1 +· · ·+ ak k−1vk−1 is the result of the Gramm-Schmidt orthogonalization. Thus (Vol Pk(A; u1, . . . , uk))2 = v1 · v1 0 . . . 0 ... ... 0 0 . . . vk · vk = v1 · v1 . . . vk · v1 ... ... v1 · vk . . . vk · vk . Denote by B the matrix whose columns are formed by the coordinates of vectors v1, . . . , vk in the orthonormal basis e. Since v1, . . . , vk have arisen from u1, . . . , uk as images under a linear transformation with an upper–triangular matrix C with ones on the diagonal, B = CA and |B| = |C||A| = |A|. But then |A|2 = |B|2 = |A||A|, and thus Vol Pk(A; u1, . . . , uk) = ±|A|. The resulting volume is zero if the vectors u1, . . . , uk are linearly dependent. Provided they are independent, the sign of the determinant is positive if and only if the basis u1, . . . , uk defines the same orientation as the basis e. □ Consider a parallelepiped in a k–dimensional space, which is spanned by k vectors. Write down the coordinates (in an orthonormal basis) into the columns of a matrix. Then the volume of the parallelepiped is the determinant of the matrix. The formula (2) above is called the Gram determinant. It is independent of the choice of basis and, therefore it is useful when k is lower then the dimension of the whole space. We formulate the following important geometric conse- quence: 4.1.20. Corollary. For each linear map φ : V → V on a Euclidean vector space V , det φ equals the (oriented) volume of the image of the parallelepiped determined by vectors of an orthonormal basis. More generally, the image of the parallelepiped P, determined by arbitrary dim V vectors, has a volume equal to det φ–multiple of the former volume. 336 and σ2 equals the size of the orthogonal projection of the vector (the difference of an arbitrary point in σ1 and an arbitrary point in σ2) u = (3, −2, 5, 7, 3) = [3, −1, 7, 7, 3] − [0, 1, 2, 0, 0] to this orthogonal complement. Denote the orthogonal projection of u as pu and choose v = (3, −4, 2, 4, −2). Obviously pu = a · v for some a ∈ R and ⟨ u − pu, v ⟩ = 0, tj. ⟨ u, v ⟩ − a ⟨ v, v ⟩ = 0. Computing gives 49 − a · 49 = 0. Therefore pu = 1 · v = v and the distance between the planes σ1 and σ2 equals || pu || = √ 32 + (−4)2 + 22 + 42 + (−2)2 = 7. The method of computing the distance using the orthogonal complement of sum of vector spaces proves to be a "faster way to the solution”. It is the same for the planes ϱ1 and ϱ2. The second method however reveals points where the distance can be measured (a pair of points in which the planes are the closest). Find such points in the case of planes ϱ1, ϱ2. Denote u1 = (1, 0, −1, 0, 0) , u2 = (0, 1, 0, 0, −1) , v1 = (1, 1, 1, 0, 1) , v2 = (0, −2, 0, 0, 3) . Points X1 ∈ ϱ1, X2 ∈ ϱ2, which are the "closest” (as commented above), are X1 = [7, 2, 7, −1, 1] + t1u1 + s1u2, X2 = [2, 4, 7, −4, 2] + t2v1 + s2v2, so X1 − X2 = [7, 2, 7, −1, 1] − [2, 4, 7, −4, 2] + t1u1 + s1u2 − t2v1 − s2v2 = (5, −2, 0, 3, −1) + t1u1 + s1u2 − t2v1 − s2v2. The dot products ⟨ X1 − X2, u1 ⟩ = 0, ⟨ X1 − X2, u2 ⟩ = 0, ⟨ X1 − X2, v1 ⟩ = 0, ⟨ X1 − X2, v2 ⟩ = 0 then lead to the linear equation system 2t1 = −5, 2s1 + 5s2 = 1, −4t2 − s2 = −2, −5s1 − t2 − 13s2 = −1 with the unique solution t1 = −5/2, s1 = 41/2, t2 = 5/2, s2 = −8. We obtained X1 = [7, 2, 7, −1, 1] − 5 2 u1 + 41 2 u2 = [ 9 2 , 45 2 , 19 2 , −1, − 39 2 ] , X2 = [2, 4, 7, −4, 2] + 5 2 v1 − 8v2 = [ 9 2 , 45 2 , 19 2 , −4, − 39 2 ] . The distance between the points X1, X2 equals the distance between the planes ϱ1, ϱ2) both of which are given by || X1 − X2 || = || (0, 0, 0, 3, 0) || = 3. □ CHAPTER 4. ANALYTIC GEOMETRY 4.1.21. Outer product and cross product. The previous considerations are closely related to the tensor product of vectors. We do not go further in this technically more complicated topic now. But we do mention the outer product of n = dim V vectors u1, . . . , un ∈ V . Let (u1j, . . . , unj)T be coordinate expressions of vectors uj in a chosen orthonormal basis V . Let M be a matrix with elements (uij). Then the determinant |M| does not depend on the choice of the basis with the same orientation. Its value is called the outer product of the vectors u1, . . . , un, and is denoted by [u1, . . . , un]. Hence the outer product is the oriented volume of the corresponding parallelepiped, see 4.1.19. Although the outer product looks like a scalar quantity, the story gets more complicated once we allow for general basis of V . Then the determinant of the matrix M built of the coordinates of ui changes by the determinants of the transition matrices. Such objects are called densities and we shall come back to them in chapter 9. Several useful properties of the outer product follow directly from the definition (1) The map (u1, . . . , un) → [u1, . . . , un] is an antisymmetric n–linear map. It is linear in all arguments, and the interchange of any two arguments causes a change of sign. (2) The outer product is zero if and only if the vectors u1, . . . , un are linearly dependent. (3) The vectors u1, . . . , un form a positive basis if and only if the outer product is positive. Consider a Euclidean vector space V of dimension n ≥ 2 and vectors u1, . . . , un−1 ∈ V . If these n − 1 vectors are substituted into the first n − 1 arguments of the n–linear map defined by the volume determinant as above, then there is one argument left over. This defines a linear form on V . Since the scalar product is available, each linear form corresponds to exactly one vector. This vector v ∈ V is called the cross product of the vectors u1, . . . , un−1. For each vector w ∈ V ⟨w, v⟩ = [u1, . . . , un−1, w]. We denote the cross product by v = u1 × . . . × un−1. If the coordinates of the vectors in an orthonormal basis are v = (y1, . . . , yn)T , w = (x1, . . . , xn)T and uj = (u1j, . . . unj)T , then the definition can be expressed as y1x1 + · · · + ynxn = u11 . . . u1(n−1) x1 ... ... ... un1 . . . un(n−1) xn . Hence the vector v is determined uniquely. Its coordinates are calculated by the formal expansion of this determinant along the last column. The following properties of the cross product are direct consequences of the definition: Theorem. For the cross product v = u1 × . . . × un−1 (1) v ∈ ⟨u1, . . . , un−1⟩⊥ (2) v is nonzero if and only if the vectors u1, . . . , un−1 are linearly independent, 337 4.B.20. Find the intersection of the plane passing through the point A = [1, 2, 3, 4] ∈ R4 and orthogonal to the plane ϱ : [1, 0, 1, 0] + (1, 2, −1, −2)s + (1, 0, 0, 1)t, s, t ∈ R. Solution. Find the plane orthogonal to ϱ. Its direction is orthogonal to the direction of ϱ, for vectors (a, b, c, d) within its direction we get linear equation system (a, b, c, d) · (1, 2, −1, −2) = 0 ≡ a + 2b − c − 2d = 0 (a, b, c, d) · (1, 0, 0, 1) = 0 ≡ a + d = 0. The solution is the two-dimensional vector space ⟨(0, 1, 2, 0), (−1, 0, −3, 1)⟩. The plane τ orthogonal to ϱ passing through A has parametric equation τ : [1, 2, 3, 4] + (0, 1, 2, 0)u + (−1, 0, −3, 1)v, u, v ∈ R. We can obtain the intersection of the planes from both parametric equations. It is given by the linear equation system 1 + s + t = 1 − v 2s = 2 + u 1 − s = 3 + 2u − 3v −2s + t = 4 + v, which has the unique solution (it must be so as matrix columns are linearly independent) s = −8/19, t = 34/19, u = −54/19, v = −26/19. Substitute the parameter values s and t into the parametric form of the plane ϱ, to obtain the intersection [45/19, −16/19, 11/19, 18/19]. (Needless to say, the same solution is obtained by substituting the values into τ). □ 4.B.21. Find a line passing through point [1, 2] ∈ R2 so that the angle between this line and the line p : [0, 1] + t(1, 1) is 30◦ . Solution. The angle between two lines is the angle between their direction vectors. It is sufficient to find the direction vector v of the line. One way to do so is to rotate the direction vector of p by 30◦ . The rotation matrix for the angle 30◦ is ( cos 30◦ − sin 30◦ sin 30◦ cos 30◦ ) = (√ 3 2 −1 2 1 2 √ 3 2 ) . The desired vector v is therefore v = (√ 3 2 −1 2 1 2 √ 3 2 ) ( 1 1 ) = (√ 3 2 − 1 2√ 3 2 + 1 2 ) . We could perform the backward rotation as well. The line (one of two possible) has parametric equation [1, 2] + (√ 3 2 − 1 2 , √ 3 2 + 1 2 ) t. □ CHAPTER 4. ANALYTIC GEOMETRY (3) the length ∥v∥ of the cross product equals the absolute value of the volume of parallelepiped P(0; u1, . . . , un−1), (4) (u1, . . . , un−1, v) is a compatible basis of the oriented Euclidean space V . Proof. The first claim follows directly from the defining formula for v. Substituting an arbitrary vector uj for w gives the scalar product v · uj on the left and the determinant with two equal columns on the right. The rank of the matrix with n − 1 columns uj is given by the maximal size of a non-zero minor. The minors which define coordinates of the cross product are of degree n − 1 and thus claim (2) is proved. If the vectors u1, . . . , un−1 are linearly dependent, then (3) also holds. Suppose the vectors are linearly independent. Let v be their cross product, and choose an orthonormal basis (e1, . . . , en−1) of the space ⟨u1, . . . , un−1⟩. It follows from what is proved that there exists a multiple (1/α)v, 0 ̸= α ∈ R, such that (e1, . . . , ek, (1/α)v) is an orthonormal basis of V . The coordinates of the vectors in this basis are uj = (u1j, . . . , u(n−1)j, 0)T , v = (0, . . . , 0, α)T . So the outer product [u1, . . . , un−1, v] equals (see the definition of cross product) [u1, . . . , un−1, v] = u11 . . . u1(n−1) 0 ... ... ... u(n−1)1 . . . u(n−1)(n−1) 0 0 . . . 0 α = ⟨v, v⟩ = α2 . By expanding the determinant along the last column, α2 = α Vol P(0; u1, . . . , un−1), which proves the remaining two claims. □ In technical applications in R3 , the cross product is often used. It assigns a vector to any pair of vectors. 2. Transformations This short section backs up a quite wide area of practical considerations displayed in the other column. As usual, we can understand objects well only if we master the mappings preserving the crucial con- cepts. 338 4.B.22. An octahedron has eight faces consisting of equilateral triangles. Determine cos α, where α is the angle between two adjacent faces of a regular octahedron. Solution. An octahedron is symmetric, therefore it does not matter which two faces are selected. By suitable scaling, the octahedron has edge length 1 and is placed in the standard Cartesian coordinate system R3 so that its centroid is at [0, 0, 0]. Its vertices then are located at the points A = [ √ 2 2 , 0, 0], B = [0, √ 2 2 , 0], C = [− √ 2 2 , 0, 0], D = [0, − √ 2 2 , 0], E = [0, 0, − √ 2 2 ] and F = [0, 0, √ 2 2 ]. We compute the angle between the faces CDF and BCF. We need to find vectors orthogonal to their intersection and lying within respective faces, which means orthogonal to CF. They are altitudes from D and F to the edge CF in the triangles CDF and BCF respectively. The altitudes in an equilateral triangle are the same segments as ythe medians, so they are SD and SB, where S is midpoint of CF. Because the coordinates of points C and F are known, S has coordinates [− √ 2 4 , 0, √ 2 4 ] and the vectors are SD = ( √ 2 4 , − √ 2 2 , − √ 2 4 ) and SB = ( √ 2 4 , √ 2 2 , − √ 2 4 ). To- gether cos α = ( √ 2 4 , − √ 2 2 , − √ 2 4 ) · ( √ 2 4 , √ 2 2 , − √ 2 4 ) ∥( √ 2 4 , − √ 2 2 , − √ 2 4 )∥∥( √ 2 4 , √ 2 2 , − √ 2 4 )∥ = − 1 3 . Therefore α . = 132◦ . □ 4.B.23. In Euclidean space R5 determine the angle φ between subspaces U, V , where (a) U : [3, 5, 1, 7, 2] + t (1, 0, 2, −2, 1) , t ∈ R, V : [0, 1, 0, 0, 0] + s (2, 0, −2, 1, −1) , s ∈ R; (b) U : [4, 1, 1, 0, 1] + t (2, 0, 0, 2, 1) , t ∈ R, V : x1 + x2 + x3 + x5 = 7; (c) U : 2x1 − x2 + 2x3 + x5 = 3, V : x1 + 2x2 + 2x3 + x5 = −1; (d) U : [0, 1, 1, 0, 0] + t (0, 0, 0, 1, −1) , t ∈ R, V : [1, 0, 1, 1, 1] + r (1, −1, 2, 1, 0) + s (0, 1, 3, 2, 0) + p (1, 0, 0, 1, 0) + q (1, 3, 1, 0, 0) , r, s, p, q ∈ R; (e) U : [0, 2, 5, 0, 0] + t (2, 1, 3, 5, 3) + s (0, 3, 1, 4, −2) + r (1, 2, 4, 0, 3) , t, s, r ∈ R, V : [0, 0, 0, 0, 0] + p (−1, 1, 1, −5, 0) + q (1, 5, 1, 13, −4) , p, q ∈ R; (f) U : [1, 1, 1, 1, 1] + t (1, 0, 1, 1, 1) + s (1, 0, 0, 1, 1) , t, s ∈ R, V : [1, 1, 1, 1, 1] + p (1, 1, 1, 1, 1) + q (1, 1, 0, 1, 1) + r (1, 1, 0, 1, 0) , p, q, r ∈ R. Solution. Recall that the angle between affine subspaces is the same as the angle between vector spaces associated to them. Therefore the translation caused by the point addition can be omitted. Case (a). Since U and V are one-dimensional spaces, the angle φ ∈ [0, π/2] is given by formula cos φ = | (1,0,2,−2,1)·(2,0,−2,1,−1) | || (1,0,2,−2,1) ||·|| (2,0,−2,1,−1) || = 5√ 10· √ 10 . CHAPTER 4. ANALYTIC GEOMETRY Affine maps 4.2.1. A map f : A → B between affine spaces is called an affine map if there exists a linear map φ : Z(A) → Z(B) between their difference spaces such that for all A ∈ A, v ∈ Z(A) the following holds: f(A + v) = f(A) + φ(v). The maps f and φ are determined uniquely by this property, and by arbitrarily chosen images of (dim A + 1) points in general position. For an arbitrary affine combination of points t0A0 + · · · + tsAs ∈ A we obtain f(t0A0 + · · · + tsAs) = = f(A0 + t1(A1 − A0) + · · · + ts(As − A0)) = f(A0) + t1φ(A1 − A0) + · · · + tsφ(As − A0) = t0f(A0) + t1f(A1) + · · · + tsf(As). On the other hand, if a map preserves affine combinations, we may use a specific combination of n + 1 fixed vectors generating the affine frame. After choosing successively the only non-zero coefficients ti = 1, i = 1, . . . , s, we define the map φ between difference spaces by the relation φ(Ai − A0) = f(Ai) − f(A0). The previous computation can be read in the opposite direction, so we can check the validity and linearity of φ. The assumption that the first and the last rows are equal implies that the second and the third rows are equal. So we have an affine map with the corresponding linear map φ between difference spaces which we described in the chosen affine frame by this procedure. Therefore: Theorem. Affine maps are exactly those maps which preserve the affine combinations of points. It is sufficient to check the invariance of affine combinations for all pairs of points since we can create an arbitrary affine combination from them. The affine combination of k + 2 points A0, . . . , Ak+1 can be expressed as r(t0A0 + · · · + tkAk) + sAk+1, where ∑k i=0 tk = 1 and r + s = 1. We choose a point which is an affine combination of k + 1 points only. Then make its combination with the last one. In this way, any finite affine combination can be made step by step from the combination of pairs. 4.2.2. Ratio of collinear points. The affine combinations of pairs of points can be also expressed with the help of the ratio of points on a straight line. If C is given by an affine combination of points A and B ̸= C, C = rA + sB, then we say that the number λ = (C; A, B) = − s r is the ratio of the point C with respect to the given points A and B. Since we can express C as C = A + s(B − A) = B + r(A − B), 339 Therefore cos φ = 1/2 and φ = π/3. Case (b). The subspace U has direction vector (2, 0, 0, 2, 1) and the subspace V has normal vector (1, 1, 1, 0, 1) . The angle between them ψ = π/3 is derived from the formula cos ψ = (2,0,0,2,1)·(1,1,1,0,1) || (2,0,0,2,1) ||·|| (1,1,1,0,1) || = 3 3·2 . Notice that φ = π/2 − ψ = π/6, because φ is complement to ψ. Case (c). The hyperplanes U and V are defined by normal vectors u = (2, −1, 2, 0, 1) and v = (1, 2, 2, 0, 1). The angle φ equals to angle between the direction vectors u a v. Therefore (see (a)) cos φ = | (2,−1,2,0,1)·(1,2,2,0,1) | || (2,−1,2,0,1) ||·|| (1,2,2,0,1) || = 1 2 , tj. φ = π 3 . Case (d). Denote u = (0, 0, 0, 1, −1) , v1 = (1, −1, 2, 1, 0) , v2 = (0, 1, 3, 2, 0) , v3 = (1, 0, 0, 1, 0) , v4 = (1, 3, 1, 0, 0) and denote the orthogonal projection of u into the vector subspace of V (subspace generated by v1, v2, v3, v4) by pu. Now pu = av1 + bv2 + cv3 + dv4 for some a, b, c, d ∈ R and ⟨ pu − u, v1 ⟩ = 0, ⟨ pu − u, v2 ⟩ = 0, ⟨ pu − u, v3 ⟩ = 0, ⟨ pu − u, v4 ⟩ = 0. Substituting for pu gives the linear equation system 7a + 7b + 2c = 1, 7a + 14b + 2c + 6d = 2, 2a + 2b + 2c + d = 1, 6b + c + 11d = 0. The solution is (a, b, c, d) = (−8/19, 7/19, 13/19, −5/19). (1) cos φ = || pu || || u || and so pu = − 8 19 v1 + 7 19 v2 + 13 19 v3 − 5 19 v4 = (0, 0, 0, 1, 0) , cos φ = || pu || || u || = || (0, 0, 0, 1, 0) || || (0, 0, 0, 1, −1) || = 1 √ 2 = √ 2 2 . Hence φ = π/4. Case (e). Determine the intersection of the vector subspaces associated with the given affine subspaces. The vector (x1, x2, x3, x4, x5) is in the vector subspace of U, if and only if (x1, x2, x3, x4, x5) = t (2, 1, 3, 5, 3) + s (0, 3, 1, 4, −2) + r (1, 2, 4, 0, 3) for some t, s, r ∈ R. Similarly, (x1, x2, x3, x4, x5) ∈ V if and only if (x1, x2, x3, x4, x5) = p (−1, 1, 1, −5, 0) + q (1, 5, 1, 13, −4) for some p, q ∈ R. We look for such t, s, r, p, q ∈ R, so that t (2, 1, 3, 5, 3) + s (0, 3, 1, 4, −2) + r (1, 2, 4, 0, 3) = p (−1, 1, 1, −5, 0) + q (1, 5, 1, 13, −4). It is a homogeneous linear equation system. It is solved in matrix form (order of variables is t, s, r, p, q) CHAPTER 4. ANALYTIC GEOMETRY the ratio λ is the ratio of the length of the oriented vectors C − A and C − B. In particular, λ = −1 if and only if C is at the centre of the line segment joining A and B (i.e. r = s = 1 2 in the affine combination). Hence the characterization of affine maps in terms of affine combinations has the following consequence: Corollary. Affine maps are exactly those maps for which the ratios are invariant. Needless to say that collinearity of points must be preserved in order to talk about the ratios. 4.2.3. Coordinate expression for maps. View a general affine map f : A → B, f(X) = f(A0) + φ(X − X0), in coordinates as follows. First express the image f(A0) of the origin of the frame (A0, u) on A in the frame (B0, v) on B. In other words, the vector f(A0)−B0 has got coordinates y0 in the basis v. Everything else is then given by multiplying by the matrix of the map φ in the chosen bases and by adding the outcome. Each affine map therefore has the following form in coordinates: x → y0 + Y · x, where y0 is as above, and Y is the matrix of the map φ. Of course, the changes of coordinates are special instances of invertible affine mappings on the standard affine space An. Similarly to the case of linear mappings, under the choice of two affine coordinate systems (A0, u) and (B0, v) on A, the coordinate expression of the identity mapping is the requested rule for the change of coordinates, cf. 4.1.5. Next, consider the change of the frame x = w + M · x′ , on the domain by a translation w and a matrix M. Further, let y′ = z + N · y describe a change of basis on the range space by a translation z and a matrix N. Then the coordinate expression x → y0 + Y · x for f : A → B is (1) y′ = z + N · y = z + N · (y0 + Y · x) = (z + N · y0 + N · Y · w) + (N · Y · M) · x′ . Hence the affine map in the new bases is given by the translation vector z +N ·y0 +N ·Y ·w and matrix N ·Y ·M. Euclidean maps 4.2.4. The Euclidean maps f : E1 → E2 are affine maps which respects the distances, which happens if and only if the associated linear maps φ are orthogonal. In particular, the coordinate description of invertible Euclidean maps includes orthogonal matrices Y . If the dimension of the image is bigger, then we always can complete the image of a chosen orthogonal frame into an orthonormal frame of the co-domain, and then the relevant matrix Y will contain an orthogonal block, completed by zeros. 340       2 0 1 1 −1 1 3 2 −1 −5 3 1 4 −1 −1 5 4 0 5 −13 3 −2 3 0 4       ∼ · · · ∼       1 3 2 −1 −5 0 2 1 −1 −3 0 0 1 −1 1 0 0 0 0 0 0 0 0 0 0       . The vectors defining V are linear combination of the vectors of U. So V is subset of U, and hence φ = 0. Case (f). Find the intersection of U and V . Search for numbers t, s, p, q, r ∈ R such that t (1, 0, 1, 1, 1) + s (1, 0, 0, 1, 1) =p (1, 1, 1, 1, 1) + q (1, 1, 0, 1, 1) + r (1, 1, 0, 1, 0) . The solution is (t, s, p, q, r) = (−a, a, −a, a, 0), a ∈ R. The intersection Z(U)∩Z(V ) of vector spaces U and V contains the vectors (0, 0, −a, 0, 0) = −a (1, 0, 1, 1, 1) + a (1, 0, 0, 1, 1) = −a (1, 1, 1, 1, 1) + a (1, 1, 0, 1, 1) + 0 (1, 1, 0, 1, 0) , where a ∈ R. Z(U) ∩ Z(V ) is generated by (0, 0, 1, 0, 0) and its orthogonal complement (Z(U) ∩ Z(V ))⊥ is generated by vectors (1, 0, 0, 0, 0) , (0, 1, 0, 0, 0) , (0, 0, 0, 1, 0) , (0, 0, 0, 0, 1) . We obtain Z(U) ∩ Z(V ) ̸= {0}, Z(U) ∩ Z(V ) ̸= Z(U), Z(U) ∩ Z(V ) ̸= Z(V ). The angle φ is defined as the angle between the subspaces Z(U) ∩ (Z(U) ∩ Z(V ))⊥ a Z(V ) ∩ (Z(U) ∩ Z(V ))⊥ . It is now established that Z(U) ∩ (Z(U) ∩ Z(V ))⊥ = ⟨ (1, 0, 0, 1, 1) ⟩ , Z(V ) ∩ (Z(U) ∩ Z(V ))⊥ = ⟨ (1, 1, 0, 1, 1) , (1, 1, 0, 1, 0) ⟩ . It is enough to express Z(U) as a linear combination of vectors (0, 0, 1, 0, 0), (1, 0, 0, 1, 1) and Z(V ) by the vectors (0, 0, 1, 0, 0), (1, 1, 0, 1, 1), (1, 1, 0, 1, 0). Since the dimension of Z(U) ∩ (Z(U) ∩ Z(V ))⊥ is 1, we can use the formula (1), where u = (1, 0, 0, 1, 1) and pu is the orthogonal projection of u into Z(V ) ∩ (Z(U) ∩ Z(V ))⊥ . Then pu = a (1, 1, 0, 1, 1) + b (1, 1, 0, 1, 0) and ⟨ pu − u, (1, 1, 0, 1, 1) ⟩ = 0, ⟨ pu − u, (1, 1, 0, 1, 0) ⟩ = 0, which leads to the linear equation system 4a + 3b = 3, 3a + 3b = 2 with the unique solution a = 1, b = −1/3. Thus pu = (2 3 , 2 3 , 0, 2 3 , 1 ) . From (1) it follows that CHAPTER 4. ANALYTIC GEOMETRY Remark. Notice an amazing fact about transformations on Euclidean spaces keeping the distances invariant, i.e., general mappings f : En → En. Notice that a point C is on the segment determined by points A and B if and only if ∥C − B∥ + ∥C − A∥ = ∥B − A∥. Consequently, any mapping f preserving distances must map segments to segments, thus also lines to lines. Moreover, if preserving distances, it clearly preserves ratios too, and thus we deal with an affine map. Affine maps preserve distances if and only if they are Euclidean. In 4.D.1, we can see a purely analytic proof that in principle all such (reasonably “smooth”) maps are affine and thus Euclidean. The advantage of the above synthetic approach is, that even the smoothnes of the mapping follows from the assumption. 4.2.5. Affine and Euclidean properties. Now we can consider which properties are related to the affine structure and where we really need the scalar product on the difference vector space. All Euclidean transformations, i.e. bijective affine maps which preserve the distance between points, preserve also all objects we have studied (possibly up to the change of orientation). In particular, they preserve unoriented angles, unoriented volumes, angles between subspaces etc. If we want them to preserve also oriented angles, cross products, and volumes, then we must also assume that the transformations preserve the ori- entation. Dealing with a general affine transformation T on an m-dimensional Euclidean space, the volumes are multiplied by the determinants of the relevant linear parts of T. We ask: Which concepts are preserved under affine trans- formations? Recall first that an affine transformation on an n– dimensional space A is uniquely defined by mapping n + 1 points in general position, that is, by mapping a one n– dimensional simplex. In the plane, this means choosing the image of any nondegenerate triangle. Preserved properties are properties related to subspaces and ratios. In particular, incidence properties of the type “a line passing through a point” or “a plane contains a line” etc. are preserved. Moreover, the collinearity of vectors is preserved. For every two collinear vectors, the ratio of their lengths is preserved independently of the scalar product defining the length. Similarly, the ratio of the volumes of two n–dimensional parallelepipeds is preserved under the transformations, since the determinant of the corresponding matrix changes them by the same multiple. These affine properties can be used to prove geometric statements. For instance in the plane, to prove the fact that the medians of a triangle intersect in a single point, and in one third of their lengths, it is sufficient to verify this only in the case of an isosceles right-angled triangle or only in the 341 cos φ = || (2/3,2/3,0,2/3,1) || || (1,0,0,1,1) || = √ 7 3 . φ . = 0.49 (≈ 28 ◦ ) . □ C. Geometry of quadratic forms 4.C.1. Determine the polar basis of the form f : R3 → R, f(x1, x2, x3) = 3x2 1 + 2x1x2 + x2 2 + 4x2x3 + 6x2 3. Solution. Its matrix is A =   3 1 0 1 1 2 0 2 6   . According to step (1) of the Lagrange algorithm (see Theorem 4.3.1), perform the following operations f(x1, x2, x3) = 1 3 (3x1 + x2)2 + 2 3 x2 2 + 4x2x3 + 6x2 3 = 1 3 y2 1 + 3 2 ( 2 3 y2 + 2y3)2 = 1 3 z2 1 + 3 2 z2 2. The form has rank 2 and the matrix changing the basis to the polar basis w is obtained by a combination of following transformations: z3 = y3 = x3, z2 = 2 3 y2 + 2y3 = 2 3 x2 + 2x3 and z1 = y1 = 3x1 + x2, so the change of basis matrix is T =   3 1 0 0 2 3 2 0 0 1   . We computed the polar coordinates, expressed them in standard basis and wrote them as rows of the matrix (the columns of this matrix are vectors of the standard basis in the polar basis). The polar basis vector coordinates are the columns of the matrix T−1 . T−1 =   1 3 −1 2 1 0 3 2 −3 0 0 1   , The polar basis is therefore ((1 3 , 0, 0), (−1 2 , 3 2 , 0), (1, −3, 1)). □ 4.C.2. Determine the polar basis of the form f : R3 → R3 . f(x1, x2, x3) = 2x1x3 + x2 2. Solution. The matrix is of the form A =   0 0 1 0 1 0 1 0 0   . Change the order of the variables: y1 = x2, y2 = x1, y3 = x3. It is then trivial to apply step (1) of Lagrange algorithm (there are no common terms). However for the next step, case (4) sets in. Introduce the transformation z1 = y1, z2 = y2, z3 = y3 − y2. f(x1, x2, x3) = z2 1+2z2(z3+z2) = z2 1+ 1 2 (2z2+z3)2 − 1 2 z2 3. CHAPTER 4. ANALYTIC GEOMETRY case of an equilateral triangle. Then this property holds for all triangles. Think about this argument! 4.2.6. Transformations of quadrics. After straight lines (which are mapped to straight lines again by all affine maps), the simplest objects in the analytic geometry of plane are the conic sections. These are given by quadratic equations in Cartesian coordinates. A conic is distinguished as a circle, ellipse, parabola or hyperbola, by examining the coefficients. There are two degenerate cases, namely a pair of lines or a point. We cannot distinguish a circle from an ellipse in affine geometry, but they are different in Euclidean geometry. In analogy with the equations of conic sections in plane, we may discuss quadratic objects in Euclidean point spaces. These are defined in all orthonormal frames by quadratic equations, and they are known as quadrics. Consider a general quadratic equation for the coordinates (x1, . . . , xn)T of a point A ∈ En (1) n∑ i,j=1 aijxixj + n∑ i=1 aixi + a = 0, where it may be assumed by symmetry that aij = aji without loss of generality. This equation can be written as f(u) + g(u) + a = 0 for a quadratic form f (i.e. the restriction of a symmetric bilinear form F to pairs of equal arguments), a linear form g, and a scalar a ∈ R. We assume that at least one coefficient aij is nonzero. Otherwise the equation is linear and describes a hyperplane. Notice that the equation (1) keeps the same shape under every affine coordinate transformation, i.e., it splits again into a nontrivial quadratic, linear and constant parts. Recognizing which of the standard types of quadrics is determined by the given equation is an extremaly useful tool (we shall see this later e.g. in multivariate analysis in Chapter 8) and thus we devote the next section to this topic. Notice that the above observation is true for every fixed order of the multivariate polynomial expressions. In particular, there is the well defined concept of cubics, given by cubic equations and invariant with respect to all affine transformations. Similarly the quartics, quintics, etc. We shall meet the cubic curves in plain in more details in Chapter 11 because of their fascinating use in cryptography. 3. Geometry of quadratic forms and quadrics We shall start with the affine point of view and proceed to the Euclidean classification later. 4.3.1. Linear transformations of quadratic forms. Let us recall the bilinear symmetric forms F : Rn × Rn → R, cf. 2.3.23, and the corresponding mappings f(x) = F(x, x) for all x ∈ Rn . 342 Together, z1 = y1 = x2, z2 = y2 = x1, z3 = y3 − y2 = x3 − x1. The matrix T for change to polar basis is T =   0 1 0 1 0 0 −1 0 1   and T−1 =   0 1 0 1 0 0 0 1 1   . The polar basis is therefore ((0, 1, 0), (1, 0, 1) (0, 1, 1)). □ 4.C.3. Find the polar basis of the quadratic form f : R3 → R, which in the standard basis is defined as f(x1, x2, x3) = x1x2 + x1x3. Solution. By an application of the Lagrange algorithm: f(x1, x2, x3) = 2x1x2 + x2x3 substitution y2 = x2 − x1, y1 = x1, y3 = x3 = 2x1(x1 + y2) + (x1 + y2)x3 = 2x2 1 + 2x1y2 + x1x3 + y2x3 = 1 2 (2x1 + y2 + 1 2 x3)2 − 1 2 y2 2 − 1 8 x2 3 + y2x3 substitution y1 = 2x1 + y2 + 1 2 x3 = 1 2 y2 1 − 1 2 y2 2 − 1 8 x2 3 + y2x3 = 1 2 y2 1 − 2( 1 2 y2 − 1 2 x3)2 + 3 8 x2 3 substitution y3 = 1 2 y2 − 1 2 x3 = 1 2 y2 1 − 2y2 3 + 3 8 x2 3. With the coordinates y1, y3, x3, the quadratic form has a diagonal shape, which means that the basis associated with those coordinates is the polar basis of the form. If we want to express the basis, we need to obtain the matrix which changes the basis from polar to standard. By definition of the change of basis matrix, its columns are the polar basis vectors. Either we express the old variables (x1, x2, x3) by new variables (y1, y3, x3), or equivalently we express the new ones by the old ones (which is easier). In the latter case, we need to compute inverse matrix. y1 = 2x1 + y2 + 1 2 x3 = 2x1 + (x2 − x1) + 1 2 x3 and y3 = 1 2 y2 − 1 2 x3 = −1 2 x1 + 1 2 x3 − 1 2 x3. The matrix for changing the basis from the polar basis to the standard basis is T =   2 1 1 2 −1 2 1 2 −1 2 0 0 1   . The inverse matrix is T−1   1 3 −2 3 −1 2 1 3 4 3 1 2 0 0 1   . Hence one of the polar bases of the given quadratic forms is (see the columns of the matrix), {(1/3, 1/3, 0), (−2/3, 4/3, 0), (−1/2, 1/2, 1)}. □ CHAPTER 4. ANALYTIC GEOMETRY For an arbitrary basis e on this vector space, the value f(x) on vector x = x1e1+· · ·+xnen is given by the equation (1) f(x) = F(x, x) = ∑ i,j xixjF(ei, ej) = xT · A · x where A = (aij) is a symmetric matrix with elements aij = F(ei, ej). We call such maps f quadratic forms, and the formula (1), f(x) = ∑ ij aijxixj, is called the analytic formula for the form f. In general, by a quadratic form on vector space V is meant the restriction f(u) of a symmetric bilinear form F(u, v) to arguments of the type (u, u) ∈ V × V . Evidently, the whole bilinear form F can be reconstructed from the values f(u) (unless we work over scalars Z2) since f(u + v) = F(u + v, u + v) = f(u) + f(v) + 2F(u, v). This is called the polarization process. Let us come back to the real vector spaces Rn . If we change the basis e to a different basis e′ 1, . . . , e′ n, we get different coordinates x = S · x′ for the same vector (here S is the corresponding transformation matrix), and so f(x) = (S · x′ )T · A · (S · x′ ) = (x′ )T · (ST · A · S) · x′ . Clearly, if we fix the second argument in the bilinear form F, we obtain a linear mapping V → V ∗ , with coordinate description x → F( , x) = xT · A. Thus, the rank of the matrix A of the quadratic form f is the dimension of the image of this mapping and therefore it is independent of the choice of the coordinates. We call it the rank of the quadratic form f. Our first task is to decide whether two quadratic forms can be transformed one into the other by a linear transforma- tion. We shall easily see that the matrix A of each quadratic form may be diagonal for a suitable choice of basis u and this will resolve our task as well. In other words we request F(ui, uj) = 0 for i ̸= j for the corresponding symmetric bilinear form F. Each such basis is called the polar basis of the quadratic form f. Later we shall see, we could even find a polar orthonormal basis (with respect to any scalar product). Nevertheless, without the use of the scalar product, there is a much simpler algorithm for finding a polar basis among all other bases. At the same time, this algorithm brings relevant information about the affine properties of the quadratic form. The algorithmic procedure in the proof of the next theorem is known as the Lagrange algorithm. Theorem. Let f : V → R be a quadratic form on a real vector space V of dimension n. Then there exists a polar basis for f on V . Proof. (1) Let A be the matrix of f in basis u = (u1, . . . , un) on V , and assume a11 ̸= 0. Then we may write f(x1, . . . , xn) = a11x2 1 + 2a12x1x2 + · · · + a22x2 2 + . . . = a−1 11 (a11x1 + a12x2 + · · · + a1nxn)2 + terms not containing x1. 343 4.C.4. Determine the type of conic section defined by 3x2 1 − 3x1x2 + x2 − 1 = 0. Solution. Complete the squares: 3x2 1 − 3x1x2 + x2 − 1 = 1 3 (3x1 − 3 2 x2)2 − 3 4 x2 2 + x2 − 1 = 1 3 y2 1 − 4 3 ( 3 4 x2 − 1 2 )2 + 1 3 − 1 = 1 2 y2 1 − 4 3 y2 2 − 2 3 . According to the list 4.3.6, the given conic section is a hyperbola. □ 4.C.5. By completing the squares, express the quadric −x2 + 3y2 + z2 + 6xy − 4z = 0 in such a way that one can determine its type from it. Solution. Complete the square. Deal first with all terms involving an x. Obtain the equation −(x − 3y)2 + 9y2 + 3y2 + z2 − 4z = 0. There are no "unwanted“ terms containing y , so repeat the procedure for z. This gives −(x − 3y)2 + 12y2 + (z − 2)2 − 4 = 0. Conclude that there is a transformation of variables that leads to the equation (we can divide by 4 if desired) −¯x2 + ¯y2 + ¯z2 − 1 = 0. □ We can tell the type of the conic section without transforming its equation to the form listed in 4.3.6. Every conic section can be expressed as a11x2 + 2a12xy + a22y2 + 2a13x + 2a23y + a33 = 0. Determinants ∆ = det A = a11 a12 a13 a12 a22 a23 a13 a32 a33 and δ = a11 a12 a12 a22 are invariants of conic sections which means that they are not changed by Euclidean transformations (rotation and translation). Furthermore, the different types of conic sections have different signs of those determinants. • ∆ ̸= 0 for non-degenerate conic sections: ellipse for δ > 0, hyperbola for δ < 0 and parabola for δ = 0 For a real ellipse (not imaginary), it is necessary that (a11 + a22)∆ < 0. • ∆ = 0 for degenerate conic sections, or pairs of lines. The signs (or zero-value) of the determinants are really invariant to the coordinate transformation. Denote X =   x y 1   and denote A as the matrix of the quadratic form. Then the corresponding conic section has equation XT AX = 0. The CHAPTER 4. ANALYTIC GEOMETRY This suggest to transform the coordinates (i.e., change the ba- sis) x′ 1 = a11x1 + a12x2 +· · · + a1nxn, x′ 2 = x2, . . . , x′ n = xn. This corresponds to the new basis v1 = a−1 11 u1, v2 = u2−a−1 11 a12u1, . . . , vn = un−a−1 11 a1nu1 (As an exercise, compute the transformation matrix.) In the new basis the corresponding symmetric bilinear form satisfies F(v1, vi) = 0 for all i > 1 (compute it!). Thus f has the analytic form a−1 11 x′ 1 2 + h in the new coordinates, where h is a quadratic form independent of the variable x1. It is often easier to choose v1 = u1 in the new basis. Then f = f1 + h, where f1 depends only on x′ 1, while x′ 1 does not appear in h, but F(v1, v1) = a11. (2) Assume that after step (1), h is a matrix of rank less than n with a nonzero coefficient of x′ 2 2 . Then the same procedure can be repeated to obtain the expression f = f1 +f2 +h, where h contains only the variables with index greater than two. Proceed in this way until a diagonal form is obtained after n − 1 steps, or in (say) the i-th step, the element aii is zero. (3) If the last possibility occurs, and there exists some other element ajj ̸= 0 with j > i, then it suffices to exchange the i–th and the j–th vector of the basis. Then continue according the previous procedure. (4) Assume that the situation is ajj = 0 for all j ≥ i. If there is no element ajk ̸= 0 with j ≥ i, k ≥ i, then we are finished, since then the matrix is diagonal. If ajk ̸= 0, then we use the transformation vj = uj + uk and we keep the other vectors of basis unchanged (i.e. x′ k = xk − xj, the other remain without change). Then h(vj, vj) = h(uj, uj) + h(uk, uk) + 2h(uk, uj) = 2ajk ̸= 0 and we can continue as for case (1). □ 4.3.2. Affine classification of quadratic forms. The vectors in the basis obtained from the Lagrange algorithm can be rescaled by scalars such that the coefficients of the squares of variables are only the scalars 1, −1 and 0. Moreover, the following law of inertia says that the number of one’s and minus one’s does not depend on the choices in the course of the algorithm. These numbers are called the signature of a quadratic form. As before, there is a complete description of quadratic forms in the sense that two such forms may be transformed each one into the other by a linear transformation if and only if they have the same signature. Theorem. For each nonzero quadratic form of rank r on a real vector space V there exists a natural number p, and r independent linear forms φ1, . . . , φr ∈ V ∗ such that 0 ≤ p ≤ r and f(u) becomes (φ1(u))2 + · · · + (φp(u))2 − (φp+1(u))2 − · · · − (φr(u))2 . Otherwise put, there exists a polar basis, in which f has an analytic formula f(x1, . . . , xn) = x2 1 + · · · + x2 p − x2 p+1 − · · · − x2 r. 344 standard form is obtained by rotation and translation. This is by a transformation to new coordinates x′ , y′ satisfying x = x′ cos α − y′ sin α + c1 y = x′ sin α + y′ cos α + c2, or, in matrix form, for the new coordinates X′ =   x′ y′ 1  , (1) X =   x y 1   =   cos α − sin α c1 sin α cos α c2 0 0 1     x′ y′ 1   = MX′ . Put X = MX′ into the conic section equation to obtain the equation in new coordinates XT AX = 0 (MX′ )T A(MX′ ) = 0 X′ T MT A MX′ = 0. Denote by A′ the matrix of the quadratic form in new coordinates. Then A′ = MT AM, where matrix M =   cos α − sin α c1 sin α cos α c2 0 0 1   has unit determinant, so det A′ = det MT det A det M = det A = ∆. Necessarily, the determinant A33, which is the algebraic complement of a33, is invariant to the coordination transformation. For rotation only, det A′ = det MT det A det M. In this case the matrix M =   cos α − sin α 0 sin α cos α 0 0 0 1   and det A′ 33 = det A33 = δ. For translation only, M =   1 0 c1 0 1 c2 0 0 1   and this subdeterminant remains unchanged. 4.C.6. Determine the type of conic section 2x2 − 2xy + 3y2 − x + y − 1 = 0. Solution. The determinant ∆ = 2 −1 −1 2 −1 3 1 2 −1 2 1 2 −1 = −23 4 ̸= 0, hence it is a non-degenerate conic section. Moreover δ = 5 > 0, therefore it is an ellipse. Furthermore (a11 + a22)∆ = (2 + 3) · (−23 4 ) < 0, so it is real ellipse. □ 4.C.7. Determine the type of conic section x2 −4xy−5y2 + 2x + 4y + 3 = 0. CHAPTER 4. ANALYTIC GEOMETRY The number p of positive diagonal coefficients in the matrix of the given quadratic form (and thus the number r − p of negative coefficients) does not depend on the choice of polar basis. Two symmetric matrices A, B of dimension n are matrices of the same quadratic form in different bases if and only if they have the same rank and the same number of positive coefficients in the polar basis. Proof. By completing the squares, f(x1, . . . , xn) = λ1x2 1 + · · · + λrx2 r, λi ̸= 0, in a basis on V . Assume moreover that the first p coefficients λi are positive. Then the transformation y1 = √ λ1x1, . . . , yp = √ λpxp, yp+1 = √ −λp+1xp+1, . . . , yr = √ −λrxr, yr+1 = xr+1, . . . , yn = xn yields the desired formula. The forms φi are exactly the forms from the dual basis in V ∗ to the obtained polar basis. It remains to prove that p does not depend on the procedure. Assume that there is a formula for the same form f in the polar bases u, v, i.e. f(x1, . . . , xn) = x2 1 + · · · + x2 p − x2 p+1 − · · · − x2 r f(y1, . . . , yn) = y2 1 + · · · + y2 q − y2 q+1 − · · · − y2 r . Denote the subspace generated by the first p vectors of the first basis by P = ⟨u1, . . . , up⟩, and similarly Q = ⟨vq+1, . . . , vr⟩. Then for each 0 ̸= u ∈ P, f(u) > 0 while for 0 ̸= v ∈ Q, f(v) ≤ 0. Hence necessarily P ∩ Q = {0}, and therefore dim P + dim Q ≤ n. Hence p + (r − q) ≤ r, so that p ≤ q. By interchanging the subspaces, q ≤ p, and so p = q. Thus p is independent of the choice of the polar basis. Consequently for two matrices with the same rank and the same number of positive coefficients in the diagonal form of the corresponding quadratic form, the analytic formulas are the same. □ While discussing symmetric maps we talked about definite and semidefinite maps. The same discussion has an obvious meaning also for symmetric bilinear forms and quadratic forms. Types of quadratic forms A quadratic form f on a real vector space V is called (1) positive definite if f(u) > 0 for all vectors u ̸= 0, (2) positive semidefinite if f(u) ≥ 0 for all vectors u ∈ V , (3) negative definite if f(u) < 0 for all vectors u ̸= 0, (4) negative semidefinite if f(u) ≤ 0 for all vectors u ∈ V , (5) indefinite if f(u) > 0 and f(v) < 0 for two vectors u, v ∈ V . The same names are used for symmetric matrices corresponding to quadratic forms. By the signature of a symmetric matrix is meant the signature of the corresponding quadratic form. 345 Solution. The determinant ∆ = 1 −2 1 −2 −5 2 1 2 3 = −34 ̸= 0, furthermore δ = 1 −2 −2 −5 = −9 < 0, it is therefore a hyperbola. □ 4.C.8. Determine the equation and type of conic section passing through the points [−2, −4], [8, −4], [0, −2] , [0, −6] , [6, −2] . Solution. Input the coordinates of the points into the general conic section equation a11x2 + a22y2 + 2a12xy + a1x + a2y + a = 0 There follows the linear equation system 4a11 + 16a22 + 16a12 − 2a1 − 4a2 + a = 0, 64a11 + 16a22 − 64a12 + 8a1 − 4a2 + a = 0, 4a22 − 2a2 + a = 0, 36a22 − 6a2 + a = 0, 36a11 + 4a22 − 24a12 + 6a1 − 2a2 + a = 0. In matrix form we perform operations       4 16 16 −2 −4 1 64 16 −64 8 −4 1 0 4 0 0 −2 1 0 36 0 0 −6 1 36 4 −24 6 −2 1       ∼ · · · ∼       4 16 16 −2 −4 1 0 4 0 0 −2 1 0 0 64 −8 12 −9 0 0 0 24 −36 27 0 0 0 0 3 −2       ∼ · · · ∼       48 0 0 0 0 −1 0 12 0 0 0 −1 0 0 64 0 0 0 0 0 0 24 0 3 0 0 0 0 3 −2       . Then a11 = 1, a22 = 4, a12 = 0, a1 = −6, a2 = 32. The conic section has equation x2 + 4y2 − 6x + 32y + 48 = 0. Complete the terms x2 −6x, 4y2 +32y to squares. The result is (x − 3)2 + 4(y + 4)2 − 25 = 0, or rather (x − 3)2 52 + (y + 4)2 (5 2 )2 − 1 = 0. The conic section is an ellipse with centre at [3, −4]. □ CHAPTER 4. ANALYTIC GEOMETRY Sylvester criterion 4.3.3. Theorem. A symmetric real matrix A is positive definite if and only if all its leading principal minors are posi- tive. A symmetric real matrix A is negative definite if and only if (−1)i |Ai| > 0 for all leading principal submatrices Ai. Proof. The claim about negative definite forms follows immediately from the first part of the theorem. Just observe that A is positive definite if and only if −A is negative defi- nite. Suppose that the form f is positive definite. Then A = PT EP = PT P for a suitable regular matrix P. Hence |A| = |P|2 > 0. Let u be a chosen basis in which the form f has matrix A. The restrictions of f to the subspaces Vk = ⟨u1, . . . , uk⟩ are positive definite forms fk again, and the corresponding matrices in the bases u1, . . . , uk are the leading principal submatrices Ak. Thus |Ak| > 0, too. In order to prove the other implication, analyse in detail the form of the transformations used in completing the square in the Lagrange algorithm. The transformation used in the first step always has an upper triangular matrix T. By rescaling, see the proof of Theorem 4.3.1, the matrix may have ones on the diagonal: T =    1 −a12 a11 . . . −an2 a11 0 1 . . . 0 ... ... ...    . Such a matrix of the transformation from basis u to basis v has several useful properties. In particular, its leading principal submatrices Tk formed by the first k rows and columns are the transformation matrices of a subspace Pk = ⟨u1, . . . , uk⟩ from basis (u1, . . . , uk) to basis (v1 . . . , vk). The leading principal submatrices Ak of the matrix A of the form f are matrices of restrictions of the form f to Pk. Therefore, the matrices Ak and A′ k of restrictions to Pk in basis u and v respectively satisfy Ak = TT k A′ kTk, where T is the transformation matrix from u to v. The inverse matrix to an upper triangular matrix with one’s on the diagonal is again an upper triangular matrix with one’s on the diagonal. Hence we may similarly express A′ in terms of A. Thus the determinants of the matrices Ak and A′ k are equal by Cauchy formula. We may conclude: Claim. Let f be a quadratic form on V , dim V = n. Let u be a basis of V such that the items (3) and (4) from the Lagrange algorithm while finding the polar basis are not needed. Then the analytic formula f(x1, . . . , xn) = λ1x2 1 + λ2x2 2 + · · · + λrx2 r is obtained where r is the rank of the form f, λ1, . . . , λr ̸= 0 and for the leading principal submatrices of the (former) matrix A of quadratic form f, |Ak| = λ1λ2 . . . λk, k ≤ r. 346 4.C.9. Other characteristics and concepts of conic sections. The axis of a conic section is a line of reflection symmetry for the conic section. From the canonical form of a conic section in polar basis (4.3.6) it can be shown that an ellipse and a hyperbola both have two axes (x = 0 and y = 0). A parabola has one axis (x = 0). The intersection of a conic section and its axis is called a conic section vertex. The numbers a, b from the canonical form of a conic section (which express the distance between vertices and the origin) are called the length of semi-axes. In the case of an ellipse and hyperbola, the axes intersect at the origin. This is a point of central symmetry for the conic section, called the centre of the conic section. For practical problems involving conic sections, it is often easiest to describe them in parametric form. Often, this avoids contending with messy square roots. Every point P on the parabola y2 = 4ax, a > 0, can be described by P = (x, y) = (at2 , 2at), for real t. The standard parametric form for the parabola is the pair of equations x = at2 y = 2at, (Note that the roles of x and y are interchanged, so that the axis of symmetry is the line y = 0.) The tangent line at at2 , 2at) has slope 1 t and equation t(y − 2at) = (x − at2 ). The point F = (a, 0) on the axis is called the focus of the parabola, and the line x = −a is called the directrix. Each point on the parabola is equidistant from the focus and the directrix. This property can be used to define a parabola. Every point P on the ellipse x2 a2 + y2 b2 = 1 can be described by P = (x, y) = (a cos θ, b sin θ, ) where 0 < b ≤ a. The standard parametric form for the ellipse is the pair of equations x = a cos θ, y = b sin θ. The tangent line at P has slope −b cos θ a sin θ and consequently has equation (a cos θ)(y − b sin θ) = −b cos θ)(x − a cos θ). The positive number e, defined by b2 = a2 (1 − e2 ) is called the eccentricity of the ellipse. If e = 0, the ellipse becomes a circle or radius a = b. Otherwise 0 < e < 1. The two points F1 = (ae, 0) and F2 = (−ae, 0) are the foci of the ellipse, and the lines x = ±a/e are the directrices. Every point P on the hyperbola x2 a2 − y2 b2 = 1, 0 < a, 0 < b, can be described by P = (x, y) = (a cosh θ, b sinh θ). The standard parametric form for the hyperbola is the pair of equations x = a cosh θ, y = b sinh θ. The tangent line at P has slope b cosh θ a sinh θ and consequently has equation (a cosh θ)(y − b sinh θ) = b cosh θ)(x − a cosh θ). The positive number e, defined by b2 = a2 (e2 − 1) is called the eccentricity of the hyperbola. Necessarily, e > 1. The two points F1 = (ae, 0) and F2 = (−ae, 0) are the foci of the ellipse, and the lines x = ±a/e are the directrices. A hyperbola has two asymptotes. In standard form, the equations are y = ±(b/a)x. CHAPTER 4. ANALYTIC GEOMETRY After each step in this procedure, the resulting matrix contains zeros under the diagonal in the already processed columns. At the same time, all the principal minors remain the same. Consequently if the leading principal minors are nonzero, then the next diagonal term in A is nonzero and we do not need other steps than completing the squares. Moreover, λi = |Ai|/|Ai−1|. This proves the following: Corollary (Jacobi theorem). Let f be a quadratic form of rank r on a vector space V with matrix A for the basis u. Steps other than completing the square are not required if and only if the leading principal submatrices of A satisfy |A1| ̸= 0, ..., |Ar| ̸= 0. Then there exists a polar basis in which f has the analytic formula f(x1, . . . , xn) = |A1|x2 1 + |A2| |A1| x2 2 + · · · + |Ar| |Ar−1| x2 r. Hence if all leading principal minors are positive, then f is positive definite by the Jacobi theorem and the Sylvester criterion is proved. □ 4.3.4. Euclidean classification of quadratic forms. We come back to the discussion of equation 4.2.6(1) in the Euclidean context. Begin with its quadratic part, i.e. bilinear symmetric form F : V × V → R. Assume now that the real vector space is equipped with a scalar product and choose an orthogonal basis. Then the matrix of the bilinear form F, which is the same as the matrix of f, transforms under a change of coordinates in such a way that for orthogonal changes it coincides with the transformation of a matrix of a linear map (i.e., then S−1 = ST ). This result can be interpreted as the following observation: Proposition. Let V be a real vector space with a scalar product. Then formula φ → F, F(u, u) = ⟨φ(u), u⟩ defines a bijection between symmetric linear maps and quadratic forms on V . Proof. Each bilinear form with a fixed second argument becomes a linear form F( , u). In the presence of a scalar product, it is given by the formula F(v, u) = v·w for a suitable vector w. Put φ(u) = w. Directly from the coordinate expression 4.3.1(1) displayed above, φ is the linear map with symmetric matrix A. Hence it is selfadjoint. On the other hand, each symmetric map φ defines a symmetric bilinear form F by formula F(u, v) = ⟨φ(u), v⟩ = ⟨u, φ(v)⟩, and thus is also a quadratic form. □ It is immediate that for each quadratic form f there exists an orthonormal basis in which f has a diagonal matrix. The values on the diagonal are the eigenvalues of the matrix and they are determined uniquely up to their order. The rank of f equals the dimension of the image of the corresponding map φ. 347 4.C.10. Existence of foci. For an ellipse with lengths of semi-axes a > b, show that the sum of the distances from any point on the ellipse to the two foci is constant, namely 2a. Solution. If P = (a cos θ, b sin θ) and F1 = (ae, 0), then |PF1|2 = (a cos θ − ae)2 + b2 sin2 θ = a2 cos2 θ − 2a2 e cos θ + a2 e2 + a2 (1 − e2 ) sin2 θ = a2 [−2e cos θ + e2 − e2 (1 − cos2 θ)] = a2 (1 − e cos θ)2 . So |PF1| = a(1 − e cos θ). Similarly |PF2| = a(1 + e cos θ). Hence |PF1| + |PF2| = 2a. □ Solution. (Alternative). Consider the points X = [x, y], which satisfy the property |F1X|+|F2X| = 2a. Coordinatewise, this implies the equation √ (x + ae)2 + y2 + √ (x − ae)2 + y2 = 2a Rewrite this as √ (x + ae)2 + y2 = 2a − √ (x − ae)2 + y2 Square, simplify and square again to get (1 − e2 )x2 + y2 = a2 (1 − e2 ). Substitute b2 = a2 (1 − e2 and divide by b2 to obtain x2 a2 + y2 b2 = 1. which is the ellipse in standard form. □ Remark. Similarly, the hyperbola foci are the points F1, F2, which satisfy ||F2X| − |F1X|| = 2a for an arbitrary X on the hyperbola. You can check this in same the way as above for the ellipse, with F1 = [ae, 0], F2 = [−ae, 0], ae = √ a2 + b2. Parabola focus If the parabola has equation x2 = 2py, the focus is the point F with coordinates F = [0, p 2 ]. It is characterized by the fact that the distance between this point and an arbitrary X on parabola is equal to the distance between X and line y = −p 2 . 4.C.11. Find the foci of the ellipse x2 + 2y2 = 2. Solution. From the equation that semi-axes lengths are a = √ 2 and b = 1. Compute (see 4.C.10): ae = √ a2 − b2 = 1 The foci coordinates are at [−1, 0] and [1, 0]. □ 4.C.12. Prove that the product of the distances between the foci of an ellipse and any tangent line is constant. Find the value of the constant. Solution. Every point T on the ellipse has coordinates T = (x, y) where x = a cos θ, y = b sin θ for some θ. The tangent line to the ellipse at T has equation y − b sin θ = − b cos θ a sin θ (x − a cos θ). a(sin θ)(y − b sin θ) = −b(cos θ)(x − a cos θ). CHAPTER 4. ANALYTIC GEOMETRY 4.3.5. Classification of quadrics. We return to the equation 4.2.6(1). The above results enable us to rewrite this equation in suitable orthonormal frame of the difference space as n∑ i=1 λix2 i + n∑ i=1 bixi + b = 0. Hence we may assume that the quadric is given in this form. In the next step, we “complete the square” for the coordinates xi with λi ̸= 0, which “absorbs” the squares together with the linear terms in the same variable. So only linear terms are left corresponding to variables for which the coefficient of the quadratic term is zero. We have n∑ i=1 λi(xi − pi)2 + ∑ j bjxj + c = 0. where the summation over j is only for j satisfying λj = 0. This corresponds to a translation of the origin about the vector with coordinates pi. To such a choice of basis of the difference space the desired diagonal form is in the quadratic part. In the identification of quadratic forms with linear maps derived above, this means that φ is diagonal on the orthogonal complement of its kernel. If there are also some linear terms left, the orthonormal basis of the difference space can be adjusted for the kernel of φ so that the corresponding linear form is a multiple of the first term of the dual basis. Hence the final formula reads k∑ i=1 λiy2 i + byk+1 + c = 0, where k is the rank of matrix of quadratic form f. If b ̸= 0, it can be arranged that the constant c in the equation is zero by a further change of the origin. Hence the linear term may (but does not have to) appear only in the case that the rank of f is less than n, while c ∈ R may be nonzero only if b = 0. The resulting equations are called the canonical analytic formulas for quadrics. 4.3.6. The case of E2. As an example of the previous procedure, we discuss the simplest case of a nontrivial dimension, namely dimension two. The original equation has the form a11x2 + a22y2 + 2a12xy + a1x + a2y + a = 0. By a suitable choice of a basis of the difference space, and the subsequent completion of the square, it is written in the form (using the same notation x, y for the new coordinates): a11x2 + a22y2 + a1x + a2y + a = 0 where ai is nonzero only in the case that aii is zero. By the last step of the general procedure, exactly one of the following equations is involved (notice the equations of quadrics are determined up to a multiple): 348 This meets the x − axis at the point (a/ cos θ, 0). The distance from the focus F1 to the tangent line is D1 = [a/ cos θ − ae] sin φ where tan φ = ±b cos θ a sin θ . Eliminate φ to get D2 1 = a2 (1 − e cos θ)2 [ b2 a2 sin2 θ + b2 cos2 θ ] = (1 − e cos θ)2 [ b2 sin2 θ + (1 − e2) cos2 θ ] = (1 − e cos θ)2 [ b2 1 − e2 cos2 θ) ] = b2 (1 − e cos θ)[ 1 1 + e cos θ) ] Since D2 is the same as D1 with e replaced by −e, it follows that D1D2 = b2 . □ Solution. (Alternative). Consider the polar basis.The ellipse matrix has diagonal shape diag( 1 a2 , 1 b2 , −1) and the tangent equation at X=[x0, y0] is x0 a2 x + y0 b2 y = 1. The distance between F1, F2 = [∓ae, 0] and this line equals 1 ± aex0 a2 √ x2 0 a4 + y2 0 b4 . Its product is 1 − e2 x2 0 a4 x2 0 a4 + y2 0 b4 If we substitute a2 e2 = a2 −b2 and y2 0 b2 = 1− x2 0 a2 (the point X is lying on the ellipse), we find that the previous term equals b2 . □ 4.C.13. Projective approach to conic section. Projective space gives an ability to approach the conic section from a new perspective (compare with 4.4.11). We can understand conic sections in E2 defined by the quadratic form f(x, y) = a11x2 + 2a12xy + a22y2 + 2a13x + 2a23y + a33 CHAPTER 4. ANALYTIC GEOMETRY 0 = x2 /a2 + y2 /b2 + 1 empty set 0 = x2 /a2 + y2 /b2 − 1 ellipse 0 = x2 /a2 − y2 /b2 − 1 hyperbola 0 = x2 /a2 − 2py parabola 0 = x2 /a2 + y2 /b2 point 0 = x2 /a2 − y2 /b2 2 concurrent lines 0 = x2 − a2 2 parallel lines 0 = x2 2 identical lines 0 = x2 + a2 empty set The origin of the Cartesian coordinates is the center of the studied conic. The new orthonormal basis of the difference space gives the direction of semiaxes. The final coefficients a, b then give the lengths of the semiaxes in the nondegenerate directions. 4. Projective geometry In many elementary texts on analytic geometry, the authors finish with the affine and Euclidean objects described above. The affine and Euclidean geometries are sufficient for many practical problems, but not for all problems. For instance in processing an image from a camera, angles are not preserved and parallel lines may (but do not have to) intersect. Moreover, it is often difficult to distinguish very small angles from zero angles, and thus it would be convenient to have tools which do not need such distinguishing. The basic idea of projective geometry is to extend affine spaces by points at infinity. This permits an easy way to deal with linear objects such as points, lines, planes, projections, etc. 4.4.1. Projective extension of affine plane. We begin with the simplest interesting case, namely geometry in a plane. If we imagine the points in the plane A2 as the plane z = 1 in R3 , then each point P in the affine plane is represented by a vector u = (x, y, 1) ∈ R3 . So it is represented also by a one–dimensional subspace ⟨u⟩ ⊂ R3 . On the other hand, almost every one–dimensional subspace in R3 intersects the plane in exactly one point P. The vectors of such a subspace are given by coordinates (x, y, z) uniquely up to a common scalar multiple. Only the subspaces corresponding to vectors (x, y, 0) do not intersect the plane. Projective plane Definition. The projective plane P2 is the set of all one– dimensional subspaces in R3 . The homogeneous coordinates of a point P = (x : y : z) in the projective plane are triples of real numbers given up to a common scalar multiple, at least one of which must be nonzero. A straight line in the projective plane is defined as a set of one–dimensional subspaces (i.e. points in P2) which generate a two–dimensional subspace (i.e. a plane) in R3 . 349 as a set of points in projective plane P2 with homogenous coordinates (x : y : z), which are the zero points of the homogenous quadratic form f(x, y, z) = a11x2 +2a12xy+a22y2 +2a13xz+2a23yz+a33z2 . Or rather f(v) = vT Av, where v is a column vector with coordinates (x, y, z) and matrix A is symmetric matrix (aij). By theorem 4.3.2, there exists a basis in which this quadratic form has one of the following equations f(x, y, z) = x2 + y2 + z2 , f(x, y, z) = x2 + y2 − z2 . In the former case there is only one solution of f(x, y, z) = 0 and therefore the original form does not represent a real conic section. The second quadratic form represents a cone in R3 . We obtain the corresponding conic section by moving back to inhomogeneous coordinates. That means intersecting the cone with the plane which has the equation z = 1 in the original basis. Immediately we obtain the conic section classification from 4.29., which corresponds to the intersecting cone in R3 with different planes. Non-degenerate sections are depicted. Degenerate sections are those which pass through the vertex of the cone. We define the following useful terms for a conic section in projective plane : Points P, Q∈ P2 corresponding to one-dimensional subspaces ⟨p⟩, ⟨q⟩ (generated by vectors p, q ∈ R3 ) are called polar conjugated with respect to conic section f, if F(p, q) = 0, or rather pT Aq = 0. Point P= ⟨p⟩ is called singular point of conic section f, when it is polar conjugated with respect to f with all points of the plane, so F(p, x) = 0 ∀x ∈ P2 . In other words, Ap = 0. Hence the matrix A of the conic section does not have maximal rank and therefore defines a degenerate conic section. Non-degenerate conic sections do not contain singular points. The set of all points X= ⟨x⟩ are called polar conjugated with P = ⟨p⟩ polar of the point P with respect to the conic section f. It is therefore the set of points for which F(p, x) = pT Ax = 0. Because the polar is given by a linear combination of coordinates, it is always (in the non-singular case) a line. The following explains the geometric interpretation of polar. 4.C.14. Polar characterization. Consider a non-degenerate conic section f. The polar of a point P ∈ f with respect to f is the tangent to f with the touch point P. The polar of CHAPTER 4. ANALYTIC GEOMETRY For a concrete example, consider two parallel lines in the affine plane R2 L1 : y − x − 1 = 0, L2 : y − x + 1 = 0. If the points of lines L1 and L2 are finite points in projective space P2, then their homogeneous coordinates (x : y : z) satisfy equations L1 : y − x − z = 0, L2 : y − x + z = 0. the intersection L1 ∩ L2 is the point (1 : 1 : 0) ∈ P2 in this context. It is the point at infinity corresponding to the common direction vector of the lines. 4.4.2. Affine coordinates in the projective plane. If we begin with the projective plane and if we want to see the affine plane as its “finite” part, then instead of the plane z = 1 we may take another plane σ in R3 which does not pass through the origin 0 ∈ R3 . Then the finite points are those one–dimensional subspaces which have a nonempty intersection with the plane σ. Consider the two parallel lines from the previous paragraph. Let us choose the plane y = 1 to obtain two lines in the affine plane L′ 1 : 1 − x − z = 0, L′ 2 : 1 − x + z = 0 The “infinite” points of the former affine plane are given by z = 0. The lines L′ 1 and L′ 2 intersect at the “finite” point (x, z) = (1, 0). This corresponds to the geometric concept that two parallel lines L1, L2 in the affine plane meet at infinity, at the point (1 : 1 : 0), but this point becomes finite in different finite affine plane. 4.4.3. Projective spaces. We shall not go for an axiomatic definition of projective spaces here. We generalize the procedure in the affine plane for each finite dimension instead. By choosing an arbitrary affine hyperplane An in the vector space Rn+1 which does not pass through origin, we may identify the points P ∈ An with one– dimensional subspaces generated by these points. The remaining one–dimensional subspaces determine a hyperplane parallel to An. They are called infinite points in the projective extension Pn of the affine plane An. The set of infinite points in Pn is always a projective space of dimension one less. An affine straight line has only one infinite point in its projective extension (both ends of the line “intersect” at infinity and thus the projective line looks like a circle). The projective plane has a projective line of infinite points, the three–dimensional projective space has a projective plane of infinite points etc. More generally, we can define the projectivization of a vector space. For an arbitrary vector space V of dimension n + 1, we define P(V ) = {P ⊂ V ; P ⊂ V , dim V = 1}. 350 the point P /∈ f is the line defined by the touch points of the tangents to f passing through P. Solution. First consider P∈ f. Suppose that the polar of P, defined by F(p, x) = 0, intersects f in Q= ⟨q⟩ ̸=P. Then F(p, q) = 0 and f(q) = F(q, q) = 0. For an arbitrary point X = ⟨x⟩ lying on P and Q, x = αp + βq for some α, β ∈ R. Because of the bilinearity and symmetry of F, f(x) = F(x, x) = α2 F(p, p)+2αβF(p, q)+β2 F(q, q) = 0. So every point X of the line lies on the conic section f. However, when the conic section contains a line, it has to be degenerate, which is a contradiction. The claim for P /∈ f follows from the corollary of the symmetry of the bilinear form F. When the Q lies on the polar of P, then P lies on the polar of Q. □ Using polar conjugates we can find the axes and the centre of the conic sections without using the Lagrange algo- rithm. Consider the conic section matrix as a block matrix A = ( ¯A a aT α ) , where ¯A = (aij) for i, j = 1, 2, a is vector (a13, a23) and α = a33. This means that the conic section is defined by the equation uT ¯Au + 2aT u + α = 0 for a vector u = (x, y). Now we show that 4.C.15. The axes of a conic section are the polars of the points at infinity determined by the eigenvectors of the matrix ¯A. Solution. Because of the symmetry of ¯A in the basis of its eigenvectors, it has a diagonal shape D = ( λ 0 0 µ ) , where λ, µ ∈ R and this basis is orthogonal. Denote by U the matrix changing basis to a basis of eigenvectors (columns), then the conic section matrix is ( UT 0 0 1 ) ( ¯A a aT α ) ( U 0 0 1 ) = ( D UT a aT U α ) CHAPTER 4. ANALYTIC GEOMETRY By choosing a basis u in V we obtain homogeneous coordinates on P(V ). For a P ∈ P(V ) we use the nonzero vector u ∈ V and the coordinates of this vector in a basis u. The points of the projective space P(V ) are called geometric points. Their generators in V are called arithmetic represen- tatives. In the chosen homogeneous coordinates, we fix one of them to be one. Thus we exclude all points of the projective space P(V ) which have this coordinate equal to zero. We have an embedding of n–dimensional affine space An ⊂ P(V ). This is precisely the construction used in the example on the projective plane. 4.4.4. Perspective projection. Our basic “projective” concepts are now (projective) points, lines, planes, etc., together with their incidencies (i.e. point in a line, line in a plane, etc.). Thus the morphisms of the projective geometry must respect these. The best known example is given by perspective projections R3 → R2 . Imagine that an observer sitting in the origin observes “one half of the world”, that is, the points (X, Y, Z) ∈ R3 with Z > 0. The observer sees the image “projected” on the screen given by plane Z = f > 0. picture missing! Thus a point (X, Y, Z) in the “real world” projects to a point (x, y) on the screen as follows: x = −f X Z , y = −f Y Z . This is a nonlinear formula. The accuracy of calculations are problematic when Z is small. By extending this transformation to a map between projective spaces, we have (X : Y : Z : W) → (x : y : z) = (−fX : −fY : Z), which is well defined for all Z > 0. That is, a map described by a simple linear formula   x y z   =   f 0 0 0 0 f 0 0 0 0 1 0   ·     X Y Z W     This simple expression defines the perspective projection for finite points in the open half-space in R3 ⊂ P3 which we 351 So in this basis there is the canonical form defined by vector UT a (up to a translation). Specifically, denote the eigenvectors by vλ, vµ, and then λ(x+ aT vλ λ )2 +µ(y + aT vµ µ )2 = (aT vλ)2 λ + (aT vµ)2 µ −α. This means that the eigenvectors are the direction vectors of the conic section axes (main directions). The axes equations in this basis are x = −aT vλ λ and y = − aT vµ µ . The axes coordinates uλ and uµ in the standard basis satisfy vT λ uλ = −aT vλ λ and vT µ uµ = − aT vµ µ , because vT λ (λuλ + a) = 0 and vT µ (µuµ+a) = 0. These equations are equivalent to the equations vT λ ( ¯Auλ + a) = 0 and vT µ ( ¯Auµ + a) = 0 which are the polar equations of the points defined by the vectors vλ a vµ. □ 4.C.16. Remark. A corollary of the previous claim is that the centre of the conic section is polar conjugated with all points at infinity. The coordinates of the centre s then satisfy the equation ¯As + a = 0. If det(A) ̸= 0, then the equation ¯As + a = 0 for centre coordinates has exactly one solution if δ = det( ¯A) ̸= 0, and no solutions if δ = 0. That means that, regarding nondegenerate conic sections, the ellipse and the hyperbola have exactly one centre. The parabola has no centre. (its centre is point at infinity). 4.C.17. Prove that the angle between the tangent to the parabola (with arbitrary touch point) and the parabola axis is the same as the angle between the tangent and the line connecting the focus and the point of tangency Solution. The polar (i.e. tangent) of a point X=[x0, y0] to a parabola defined by the canonical equation in the polar basis is a line satisfying (x0, y0, 1)   1 0 0 0 0 −p 0 −p 0     x y 1   = x0x − py − py0 = 0 The cosine of the angle between the tangent and the axis of the parabola (x = 0) is given by the dot product of the corresponding unit direction vectors. The unit direction vector of the tangent is 1√ p2+x2 0 (p, x0) and therefore 1 √ p2 + x2 0 (p, x0).(0, 1) = x0 √ p2 + x2 0 Now we show that this is the same as the cosine of the angle between the tangent and line connecting the focus F=[0, p 2 ], and the touch point X. The unit direction vector of the connecting line is 1 √ x2 0 + (y0 − p 2 )2 (x0, y0 − p 2 ). For the cosine of the angle, 1 √ p2 + x2 0 1 √ x2 0 + (y0 − p 2 )2 (x0y0 + px0 2 ) CHAPTER 4. ANALYTIC GEOMETRY substitute as points with W = 1. In this way we eliminate problems with points whose image runs to infinity. Indeed, if the Z–coordinate of a real point is close to zero, then the value of the third homogeneous coordinate of the image is close to zero, i.e. it corresponds to a point close to infinity. 4.4.5. Affine and projective transformations. Each injective linear map φ : V1 → V2 between vector spaces maps one–dimensional subspaces to one– dimensional subspaces. Therefore, we have a map on projectivizations T : P(V1) → P(V2). We shall call such maps projective maps or homographies. A slightly more general concept posits the property that a map C : P(V1) → P(V2) is bijective and maps projective lines to projective lines. They are called collineations. Of course, every invertible homography is a collineation.1 Otherwise put, the projective map is a map between projective spaces such that in each system of homogeneous coordinates on the domain and image, it is given by the multiplication by a matrix. More generally if the auxiliary linear map is not injective, then we need to define the projective map only outside of its kernel, that is, on points whose homogeneous coordinates do not map to zero. This is, what we saw in the previous paragraph when discussing the perspective projec- tions. In general, each collineation does not have to be a homography. For example, if we work with complex vector spaces, then we may use the field homomorphism z → ¯z, i.e. the conjugation, to define the map F : (z0 : · · · : zn) → (¯z0 : · · · : ¯zn) in any fixed homogeneous coordinates. Clearly this is bijective and since the colinearity is computed by subdeterminants (and the determinants are equivariant with respect to each field homomorphism) the colinearity is preserved. Thus, F is a collineation which does not come from a linear automorphism of the vector space Cn+1 . Fortunately, there is the so called fundamental theorem of projective geometry saying that in dimensions at least two, each collineation is composed of a homography and an collineation coming from a field homomorphism in the above way. Of course, in dimension one, every bijection is a collineation and this concept does not make much sense there. Notice, there are no nontrivial field homomorphisms on the real scalars R and thus the two concepts coincide for our real projective spaces in dimensions at least two. Since injective maps V → V of a vector space to itself are invertible, all projective maps of projective space Pn to itself are invertible. They are also called regular collineations or projective transformations. In homogeneous coordinates, they correspond to invertible matrices of dimension n + 1. Two such matrices define the same projective transformation if and only if one is a (nonzero) multiple of the other. 1Since projective geometry is an old and rich mathematical discipline with very abstract current versions, there is a lot of terminology and different names for similar or same objects involved. 352 Substitute y0 = x2 0 2p to obtain x0√ p2+x2 0 . This example shows that lightrays striking parallel with axis of parabolic mirror are reflecting to the focus and, vice versa, light rays going through focus reflect in direction parallel with axis of parabola. This is the principle of many devices such as parabolic reflectors. □ Solution. (Alternative) At the point P = (at2 , 2at) on the parabola, the tangent line has slope (1/t) and the focus is at (a, 0). So the line joining P to the focus F has slope 2at−0 at−a = 2t (t2−1) . If θ is the angle between the tangent line and the x − axis, then tan θ = 1/t, so tan 2θ = 2 tan θ 1 − tan2 θ = 2/t (1 − 1/t2) = 2t t2 − 1 By subtraction, the angle between the tangent line and the line joining P to the focus is θ. Note that the tangent line meets the x-axis at Q where Q = (−at2 , 0). The result follows from showing that |FP| = |FQ|, and hence the triangle QFP is isosceles. □ You can find many more examples on quadrics on 4.D.1 CHAPTER 4. ANALYTIC GEOMETRY If we choose the first coordinate as the one whose vanishing defines infinite points, then the transformations preserving infinite points are given by matrices whose first row vanishes up to its first element. If we wish to switch to affine coordinates of finite points, (i.e the first coordinate is fixed at one), the first element in the first row also equals one. Hence the matrices of collineations preserving finite points of the affine space have the form:      1 0 · · · 0 b1 a11 · · · a1n ... ... bn an1 · · · ann      where b = (b1, . . . , bn)T ∈ Rn and A = (aij) is an invertible matrix of dimension n. The action of such a matrix on the vector (1, x1, . . . , xn) is exactly a general affine transformation, where b is the translation and A is its linear part. Thus the affine maps are exactly those collineations which preserve the hyperplane of points at infinity. 4.4.6. Determining collineations. In order to define an affine map, it is necessary and sufficient to define an image of the affine frame. In the above description of affine transformations as special cases of projective maps, this corresponds to a suitable choice of an image of a suitable arithmetic basis of the vector space V . In general it is not true that the image of an arithmetic basis of V determines the collineation uniquely. The basic problem is illustrated by a simple example of affine plane. Choose four points A, B, C, D in the plane such that no three of them lie on a line. Then choose their images in the collineation as follows: Choose arbitrarily their four images A′ , B′ , C′ , D′ with the same property, and choose their homogeneous coordinates u, v, w, z, u′ , v′ , w′ , z′ in R3 . The vectors z and z′ can be written as linear combinations z = c1u + c2v + c3w, z′ = c′ 1u′ + c′ 2v′ + c′ 3w′ , where all six coefficients must be nonzero, otherwise there exist three points not in general position. Now choose new arithmetic representatives ˜u = c1u, ˜v = c2v and ˜w = c3w of points A, B and C respectively. Similarly ˜u′ = c1u′ , ˜v′ = c2v′ and ˜w′ = c3w′ for points A′ , B′ and C′ . This choice defines an unique linear map φ which maps successively φ(˜u) = ˜u′ , φ(˜v) = ˜v′ , φ( ˜w) = ˜w′ . But then, φ(z) = φ(˜u + ˜v + ˜w) = ˜u′ + ˜v′ + ˜w′ = z′ , and so the constructed collineation maps the points which we have chosen in advance. The linear map φ is given uniquely by the construction, thus the collineation is given uniquely by this choice. 353 CHAPTER 4. ANALYTIC GEOMETRY The argument holds also in the case when some of the chosen points are infinite (i.e. one or two). The same phenomenon can be explained even more easily by the regular collineation of a projective line. These are defined by pairwise different images of three pairwise different points. The procedure works in an arbitrary dimension n. Then we say that n + 2 points are in general position if no n + 1 of them lie in the same hyperplane. We also call these points linearly independent, forming a geometric basis of projective space. Theorem. A regular collineation on n–dimensional projective space is uniquely determined by linearly independent images of n + 2 linearly independent points. Proof. The proof is exactly the same as in dimension two. We recommend writing it in detail as an exercise. □ 4.4.7. Cross-ratio. Recall that affine maps preserve ratios of lengths of line segments on each line. Technically, we defined this ratio as for three points A, B and C ̸= B, C = rA + sB as λ = (C; A, B) = −s r . For central projection the ratios are not preserved. Moreover, even the relative position of points on a line is not necessarily preserved. On the contrary we may determine uniquely a projective transformation by choosing arbitrarily images of three pairwise different points on a projective line. However, one can show relatively easily that the ratio of such ratios for two distinct points C is preserved: Consider four distinct points A,B, C, D in projective space with arithmetic coordinates x, y, w, z respectively which lie on a projective line. Since these four vectors lie in the subspace generated by ⟨x, y⟩, we may write w and z as linear combinations w = t1x + s1y, z = t2x + s2y. Define the cross-ratio of four points (A, B, C, D) as ρ = s1 t1 t2 s2 . The definition is valid, since although the vectors x and y are determined up to a scalar multiple, these freedom is cancelled out in the definition. Each projective transformation preserves cross-ratios: Indeed, if the transformation is given in arithmetic coordinates by a matrix A, we have images A·w = t1A·x+s1A·y, and similarly for Az. Therefore the four images have the same cross-ratio. We discuss the characterization of projective transformations. These are exactly those maps which preserve crossratios. But this is not a very practical characterization, since it contains implicitly the claim that these maps map projective lines to projective lines. We shall not go into details here, but one can prove a much stronger statement. A map of arbitrarily small open area in affine space Rn (e.g. a ball without boundary) into 354 CHAPTER 4. ANALYTIC GEOMETRY the same affine space which maps lines to lines is actually a restriction of a uniquely determined globally defined collineation of the projective extension PRn+1 of the former affine space Rn . As we heard above, all collineations of real projectives space of finite dimensions at least two are projective tranformations in our sense and thus these transformations also preserve cross-ratios. 4.4.8. Duality. Projective hyperplanes in an n–dimensional projective space P(V ) are defined as the projectivizations of n–dimensional vector subspaces in the vector space V . Hence in homogeneous coordinates, they are defined as kernels of linear forms α ∈ V ∗ which in turn are determined up to a scalar multiple. Thus in a chosen arithmetic basis, a projective hyperplane is given by a row vector α = (α0, . . . , αn). But the forms α are given uniquely up to a scalar multiple. Therefore, each hyperplane in V is identified with exactly one geometric point in the projectivization of the dual space P(V ∗ ). We call such a space the dual projective space, and we talk about a duality between points and hyperplanes. On forms, the linear map defining a given collineation acts by the multiplication of row vectors from the right by the same matrix α = (α0, . . . , αn) → α · A. Viewing the coordinates αi as collumns, this corresponds to the action of the dual map with the matrix AT . But the dual map maps forms in the opposite direction, from the “target space” to the “initial one”. Therefore the inverse map for the collineation of f is required in order to study the effect of regular collineations on points and their dual hyperplanes. The inverse is given by the matrix A−1 . Hence the matrix for the action of the corresponding collineation on forms is (AT )−1 . Since the inverse matrix equals the algebraically adjoint matrix A∗ alg, up to the multiplication by the inverse of determinant, (see equation (1) on page 106) we can work directly with the projective transformation of the space P(V ∗ ) given by the matrix (A∗ alg)T (or without transposing if we multiply row vectors from the right). The projective point X belongs to the hyperplane α if the arithmetic coordinates satisfy α·x = 0. It still holds after acting with an arbitrary collineation, since (α · A−1 ) · (A · x) = α · x = 0. 4.4.9. Fixed points, centers and axes. Consider a regular collineation f given in an arithmetic basis of projective space P(V ) by a matrix A. By the fixed point of the collineation f, we mean a point P which is mapped to itself. That is, f(P) = P. By the fixed hyperplane of collineation f is meant a hyperplane α which is mapped to itself. That is, f(α) ⊂ α. Hence the arithmetic representatives of fixed points are exactly the eigenvectors of the matrix A. 355 CHAPTER 4. ANALYTIC GEOMETRY In the geometry of the plane, we meet many types of collineations: reflection through a point, reflection across a line, translation, homothety etc. Perhaps we remember also some types of projections, e.g. the projection of a plane in R3 to another from a center S ∈ R3 . Note also that there appear fixed lines next to fixed points in all cases of such affine maps. For example, the reflection through a point preserves also all lines passing through this point. In the case of a translation the infinite points behave similarly. Now we discuss this phenomenon in an arbitrary dimension. First, we define a classical notion related to the incidence of points and hyperplanes. A set of hyperplanes passing through a point P ∈ P(V ) is a set of all hyperplanes which contain the point P. For each point P the corresponding set of hyperplanes itself is a hyperplane in the dual space P(V ∗ ). It is given by one homogeneous linear equation in arithmetic coordinates. For a collineation f : P(V ) → P(V ), a point S ∈ P(V ), is called the center of collineation f, if all hyperplanes in the set determined by S are fixed hyperplanes. A hyperplane α is called the axis of collineation f if all its points are fixed points. It follows that the axis of a collineation is the center of the dual collineation, while the set of hyperplanes defining the center of collineation is the axis of the dual collineation. Since the matrices of a collineation on the former and the dual space differ only by the transposition, their eigenvalues coincide (the eigenvectors are column vectors, respectively row vectors corresponding to the same eigenvalues). For example in the projective plane (and for the same reason in each real projective space of even dimension) each collineation has at least one fixed point, since the characteristic polynomials of corresponding linear maps are of odd degree. Hence they have at least one real root. Instead of discussing a general theory, we illustrate its usefulness in several results for projective planes. Proposition. A projective transformation of the projective plane other than the identity has either exactly one center and exactly one axis, or it has neither a center nor an axis. Proof. Consider a collineation f on PR3 and assume that it has two distinct centers A and B. Denote by ℓ the line given by these two centers, and choose a point X in the projective plane outside of ℓ. If p and q are the lines passing through pairs of points (A, X) respectively (B, X), then f(p) = p and f(q) = q. In particular, X is fixed. But then all points of the plane outside of ℓ are fixed. Hence each line different from ℓ has all points out of ℓ fixed and thus also its intersection with ℓ is fixed. It follows that f is the identity mapping. So it is proved that every projective transformation other than the identity has at most one center. The same argument for the dual projective plane proves that there is at most one axis. 356 CHAPTER 4. ANALYTIC GEOMETRY If f has a center A, then all lines passing through A are fixed. They correspond therefore to a two–dimensional subspace of a row eigenvectors of the matrix corresponding to the transformation f. Therefore, there exists a two–dimensional subspace of column eigenvectors for the same eigenvalue. This represents exactly the line of fixed points, hence it represents the axis. The same consideration in the reversed order proves the opposite statement – if a projective transformation of plane has an axis, then it has also a center. □ picture missing! For practical problems it is useful to work with complex projective extensions also in the case of a real plane. Then the geometric behaviour can be easily read off the potential existence of real or imaginary centers and axes. 4.4.10. Pappus Theorem. The following result known as Pappus theorem is a classic result of projective geometry. Proposition. Let two triples of distinct consecutive collinear points {A, B, C} and {A′ , B′ , C′ } lie on two lines that meet at the point T, which is closest to A and A′ , respectively. Define points Q, R and S as Q = [AB′ ]∩[BA′ ], R = [AC′ ]∩[CA′ ], S = [BC′ ]∩[CB′ ]. Then {Q, R, S} are also collinear. picture missing! 357 CHAPTER 4. ANALYTIC GEOMETRY Proof. Without loss of generality, consider the plane, passing through {T, A, B, C, A′ , B′ , C′ } as a 2-dimensional plane in P2 defined by z = 1 in the homogeneous coordinates (x : y : z). The points {T, A, B, C, A′ , B′ , C′ } may be considered as objects in P2 , representing lines through the origin in R3 with directional vectors {t, a, b, c, a′ , b′ , c′ }, respectively. These can be chosen up to a real non-zero factor. The condition {z = 1} uniquely identifies those points in R3 regardless of the choice of {t, a, b, c, a′ , b′ , c′ }. Since {T, A, B, C} are collinear points, (they lie in the same 2-dimensional linear subspace of R3 ), we may assume that this plane is generated by t and a. Choose b = t + a, c = λt + a, and analogously, for {T, A′ , B′ , C′ } b′ = t + a′ , c′ = λ′ t + a′ for some real constants λ and λ′ . It is only necessary to show that the vectors q, r, s, representing Q, R, S in P2 generate a 2-dimensional subspace in R3 . Since (t + a) + a′ = a + (t + a′ ), q = t + a + a′ represents Q. Since λλ′ t + λ′ a + λa′ = λ(λ′ t + a′ ) + λ′ a = λ′ (λt + a) + λa′ , r = λλ′ t + λ′ a + λa′ represents R. Finally, s = q − r = t + a + a′ − λλ′ t − λ′ a − λa′ = (1 − λ′ )(t + a) + (1 − λ)(λ′ t + a′ ) = (1 − λ)(t + a′ ) + (1 − λ′ )(λt + a) represents the point S. Thus, the points {Q, R, S} lie in the 2-dimensional subspace generated by vectors q and r. Since Q, R, S also belong to the plane {z = 1}, these points are collinear. □ 4.4.11. Projective classification of quadrics. To end this section, we return to conics and quadrics. A quadric Q in n–dimensional affine space Rn is defined by a general quadratic equation 4.2.6(1) on the page 342. By viewing the affine space Rn as affine coordinates in projective space PRn+1 we may wish to describe the set Q by homogeneous coordinates in projective space. The formula in these coordinates should contain only the terms of second order since only a homogeneous formula is independent of the choice of the multiple of homogeneous coordinates (x0, x1, . . . , xn) of a point. Hence we search for a homogeneous formula whose restriction to affine coordinates, (that is, substitution x0 = 1), gives the original formula 4.2.6(1). But this is especially easy. Simply add the right number of x0 to all terms – nothing to the quadratic terms, one to the linear terms and x2 0 to the constant term in the original affine equation for Q. We obtain a well defined quadratic form f = ∑n i,j=0 aijxixj on the vector space Rn+1 whose zero set defines correctly the projective quadric ¯Q. 358 CHAPTER 4. ANALYTIC GEOMETRY The intersection of a “cone” ¯Q ⊂ Rn+1 of the zero set of this form with the affine plane x0 = 1 is the original quadric Q whose points are called the proper points of the quadric. The other points ¯Q \ Q in the projective extension are the infinite points. The classification of real or complex projective quadrics, up to projective transformations, is a problem already considered. It is all about finding the canonical polar basis, see paragraph 4.3.2. From this classification, given by the signature of the form in the real case and by the rank only in the complex case, we can deduce also the classification of the affine quadrics. We show the essential part of the procedure in the case of conics in the affine and the projective plane. The projective classification gives the following possibilities, described by homogeneous coordinates (x : y : z) in the projective plane PR3 : • imaginary regular conic given by x2 + y2 + z2 = 0 • real regular conic given by x2 + y2 − z2 = 0 • pair of imaginary lines given by x2 + y2 = 0 • pair of real lines given by x2 − y2 = 0 • one double line x2 = 0. We consider this classification as real, that is, the classification of quadratic forms is given not only by its rank but also by its signature. Nevertheless, the points of a quadric are considered also in the complex extension. In this way we should understand the stated names. For example the imaginary conic does not have any real points. 4.4.12. Affine classification of quadrics. For an affine classification we must restrict the projective transformations to those which preserve the projective subspace of infinite points. This can be seen also by the converse procedure — for a fixed projective type of conic Q, that is, its cone ˜Q ⊂ R3 , we choose different affine planes α ⊂ R3 which do not pass through the origin. We observe the changes to the set of points ˜Q ∩ α, which are proper points of Q in affine coordinates, as realized by the plane α. Hence in the case of a regular conic there is a real cone ˜Q given by the equation z2 = x2 + y2 . As planes α we may for instance choose the tangent planes to the unit sphere. If we begin with the plane z = 1, the intersection consists only of finite points forming a unit circle Q. By a gradual change of the slope of α we obtain a more and more stretched ellipse until we get such a slope that α is parallel with one of lines of the cone. At that moment there appears one (double) infinite point of the conic whose finite points still form one connected component, and so we have a parabola. Continuing to change the slope gives rise to two infinite points. The set of finite points is no longer connected, and so we obtain the last regular quadric in the affine classification, a hyperbola. We can take advice from the introduced method which enables us to continue the classification in higher dimensions. In particular, we notice that the intersection of the conic with the projective line of infinite points is always a quadric in dimension one less. It is either the empty set or a double point 359 CHAPTER 4. ANALYTIC GEOMETRY or two points as types of quadrics on a projective line. Next we found an affine transformation transforming one of possible realizations of a fixed projective type to another one, only if the corresponding quadrics in the subspace of infinite points were projectively equivalent. In this way, it is possible to continue the classification of quadrics in dimension three and above. 360 361 CHAPTER 4. ANALYTIC GEOMETRY D. Further exercise on this chapter 4.D.1. [To be moved to an appropriate spot in Chapter 8 or 9] Show that all twice differentiable mappings F : Rn → Rn which preserve distances are affine mappings, and thus Euclidean (also the so called rigid motions). Solution. Suppose F : Rn → Rn keeps distances, i.e., ∥F(x + tv) − F(x)∥ = ∥tv∥, where t ∈ R, v ∈ Rn . Then the Taylor expansion says t∥v∥ = ∥F(x+tv)−F(x)∥ = ∥DF(x)(tv)+ 1 2 D2 F(x+stv)(tv, tv)∥ = t∥DF(x)(v)+ t 2 D2 F(x+stv)(v, v)∥, where 0 ≤ st ≤ t. Now, the limit for t → 0 leads to ∥v∥ = ∥DF(x)(v)∥, i.e., the expected fact that the differential of F must be an orthogonal linear map. Consequently, writing Fxi for the partial derivatives, its columns are orthogonal: (1) Fxi · Fxj = δij. In particular, F is invertible on a neighborhood of x and we may look at G(y) = DF(x)−1 ◦ F(y) instead of F. Now, the differential of G is the identity matrix at the point x and G keeps distances again. Thus, differentiating the equation (1) for G we arrive at Gxixk (x) · Gxj (x) + Gxi (x) · Gxj xk (x) = 0 and taking into account Gj xi (x) = δj i , the latter equation reduces to αjik = −αijk, where we write αijk = Gi xj xk (x) for the second partial derivatives of the relevant component function. Clearly, we also know αijk = αikj since the second derivatives are commutative. Thus, αijk = −αjik = −αjki = αkji = αkij = −αikj = −αijk and thus all the quantities αijk have to vanish. This means all the second partial derivatives of G in the point x vanish and by the definition, this also implies that all the second order partial derivatives of F(y) = DF(x) ◦ G(y) in the point x vanish (write down the formulae explicitly in case of any doubts!). The point x was arbitrary and thus we have verified that F is affine by the Taylor expansion. □ 4.D.2. Find a parametric equation for the intersection of the following planes in R3 : σ : 2x + 3y − z + 1 = 0 a ρ : x − 2y + 5 = 0. ⃝ 4.D.3. Find a common perpendicular for the skew lines p : [1, 1, 1] + t(2, 1, 0), q : [2, 2, 0] + t(1, 1, 1). ⃝ 4.D.4. Jarda is standing in [−1, 1, 0] and has a stick of length 4. Can he simultaneously touch the lines p and q, where p : [0, −1, 0] + t(1, 2, 1), q : [3, 4, 8] + s(2, 1, 3)? (The stick must pass through [−1, 1, 0].) ⃝ 4.D.5. A cube ABCDEFGH is given. The point T lies on the edge BF, with |BT| = 1 4 |BF|. Compute the cosine of the angle between ATC and BDE. ⃝ 4.D.6. A cube ABCDEFGH is given. The point T lies on the edge AE, with |AT| = 1 4 |AE|. S is the midpoint of AD. Compute the cosine of the angle between BDT and SCH. ⃝ 4.D.7. A cube ABCDEFGH is given. The point T lies on the edge BF, |BT| = 1 3 |BF|. Compute the cosine of the angle between ATC and BDE. ⃝ 4.D.8. What are the lengths of semi-axes, when the sum of their lengths equals the distance between foci both equal 1. Solution. It is given that a + b = 1 and 2ae = 1. Also b2 = a2 (1 − e2 ). Eliminating e gives b2 = a2 − (1/4). So 1/4 = a2 − b2 = (a − b)(a + b) = a − b. So a = 5/8 and b = 3/8. □ Solution. (Alternative.) Solve the system a + b = 1 2e = 2 √ a2 − b2 = 1 and find solution a = 5 8 , b = 3 8 . □ 362 CHAPTER 4. ANALYTIC GEOMETRY 4.D.9. For what slopes k are the lines passing through [−4, 2] secant and tangent lines of the ellipse defined by x2 9 + y2 4 = 1 Solution. The direction vector of the line is (1, k) and its parametric equations then are x = −4 + t, y = 2 + kt. The intersection with the ellipse satisfies (−4 + t)2 9 + (2 + kt)2 4 = 1 This quadratic equation has discriminant equal to D = − k 9 (7k + 16). This implies that for k ∈ (−16 7 , 0) there are two solutions, and the line is a secant. For k = −16 7 and k = 0 there is only one solution and the line is a tangent to the ellipse. □ 4.D.10. Find all lines tangent to the ellipse 3x2 + 7y2 = 30, so that the distance from the centre of the ellipse to the tangent is 3. Solution. All lines at distance 3 from the origin are tangents to the circle centre at [0, 0] and radius 3. They all have an equation x cos θ + y sin θ = 3 for some θ.This line meets the standard ellipse x2 /a2 + y2 /b2 = 1 where x2 a2 + (3 − x cos θ)2 b2 sin2 θ = 1 or x2 (a2 cos2 θ+b2 sin2 θ)−6a2 x cos θ−a2 (b2 sin2 θ−9) = 0 It is a tangent line if the above equation has a double root for x. Thus it is required that 36a4 cos2 θ = 4(a2 cos2 θ + b2 sin2 θ)(9 − b2 sin2 θ). This simplifies to requiring that a2 cos2 θ + b2 sin2 θ = 9. This implies cos2 θ = (9 − b2 ) (a2 − b2) sin2 θ = (a2 − 9) (a2 − b2) For the given problem a2 = 10 and b2 = 30/7. The solution is x √ 33 + y √ 7 = 3 √ 40. □ Solution. (Alternative.) The tangent line is (y − b sin θ)= −b cos θ a sin θ (x − a cos θ) with a2 = 10 and b2 = 30/7. The distance to the origin, 3, implies 3 = (a/ cos θ) sin φ where tan φ = b cos θ a sin θ 3 cos θ = a sin φ where a sin θ tan φ = b cos θ 3 sin θ = b cos φ 9/a2 cos2 θ + 9/b2 sin2 θ = 1. 9 cos2 θ + 21 sin2 θ = 10. 12 sin2 θ = 1 (y − √ 35/2) = −(x − √ 55/ √ 6)[ √ 33/7] □ Solution. (Alternative.) The centre of the ellipse is at the origin. The distance d between the line ax + by + c = 0 and the origin is d = |c| √ a2+b2 . The tangent then satisfies a2 +b2 = c2 9 . The equation of the tangent passing through the point [xT , yT ] is 3xxT + 7yyT − 30 = 0. For coordinates of the point of tangency, (3xT )2 + (7yT )2 = 100 3x2 T + 7y2 T = 30 Its solution is xT = ± √ 55 6 , yT = ± √ 5 14 . Considering the symmetry of ellipse, there are four solutions ±3 √ 55 6 x ± 7 √ 5 14 y − 30 = 0. □ 363 CHAPTER 4. ANALYTIC GEOMETRY 4.D.11. A hyperbola x2 − y2 = 2 is given. Find an equation of a hyperbola having the same foci and passing through point [−2, 3]. Solution. The given hyperbola has a2 = b2 = 2, so a2 e2 = a2 + b2 = 4, and the foci are at (±ae, 0) = (±2, 0). So the desired hyperbola has equation √ (x − 2)2 + y2 − √ (x + 2)2 + y2 = k, for some constant k. Since the hyperbola passes through [−2, 3], k = 2. Squaring gives √ (x − 2)2 + y2 = [ √ (x + 2)2 + y2 + 2], (x − 2)2 + y2 = (x + 2)2 + y2 + 4 √ (x + 2)2 + y2 + 4 (−2x − 1)2 = (x + 2)2 + y2 or 3x2 = y2 + 3 which is the required hyperbola. □ Solution. (Alternative.) The equation of the desired hyperbola is x2 a2 − y2 b2 = 1, with its eccentricity e satisfying a2 e2 = a2 + b2 = 4, since the foci are at [±ae, 0] = [±2, 0]. The point [−2, 3] lies on the hyperbola, so 4 a2 − 9 b2 = 1. It follows that a2 = 1, b2 = 3. The desired hyperbola is x2 − y2 3 = 1. □ 4.D.12. Determine the equations of the tangent lines to the hyperbola 4x2 − 9y2 = 1, which are perpendicular to line x − 2y + 7 = 0. Solution. All lines perpendicular to the given line have an equation 2x+y +c = 0 for some c. So the line has an intersection with a double root with the given hyperbola. So the equation 4x2 − 9(−2x − c)2 = 1 has a double root. Hence (36c)2 − 4.32.(9c2 + 1) = 0, and c = ±2 √ 2 3 . □ 4.D.13. Determine the tangent to the ellipse x2 16 + y2 9 = 1 which is parallel with line x + y − 7 = 0. Solution. The lines parallel with the given line intersect this line in a point at infinity (1 : −1 : 0). Construct tangents to given ellipse passing through this point. The point of tangency T= (t1 : t2 : t3) lies on its polar and therefore satisfies t1 16 − t2 9 = 0, so t2 = 9 16 t1. Substituting into the ellipse equation, we get t1 = ±16 5 . The touching points of the desired tangents are [16 5 , 9 5 ] and [−16 5 , −9 5 ]. The tangents are polars of those points. They have equations x + y = 5 and x + y = −5. □ Solution. (Alternative). The given line has slope −1. The tangent line at (4 cos θ, 3 sin θ) has slope −3 cos θ 4 sin θ , so it is required that tan θ = 3 4 . The tangent line has equation (y − 3 sin θ) = (−1)(x − 4 cos θ) where either sin θ = 3/5 and cos θ = 4/5 or sin θ = −3/5 and cos θ = −4/5. The two solutions are x + y = ±5. □ 4.D.14. Determine the points at infinity and the asymptotes of the conic section 2x2 + 4xy + 2y2 − y + 1 = 0 Solution. The equation for the points at infinity of 2x2 + 4xy + 2y2 = 0 or rather 2(x + y)2 = 0 has a solution x = −y. The only point at infinity therefore is (1 : −1 : 0), so the conic section is a parabola. The asymptote is a polar of this point, specifically the line at infinity z = 0. □ 4.D.15. Prove that the product of the distances between an arbitrary point on a hyperbola and both of its asymptotes is constant. Find its value. Solution. Denote the point lying on the hyperbola by P. The asymptote equation of the hyperbola in canonical form is bx±ay = 0. Their normals are (b, ±a) and from here we determine the projections P1, P2 of point P to asymptotes. For the distance between point P and asymptotes we get |PP1,2| = |aq±bp| √ a2+b2 . The product is therefore equal to a2 q2 −b2 p2 a2+b2 = a2 b2 a2+b2 , because P lies on hyperbola. □ 4.D.16. Compute the angle between the asymptotes of the hyperbola 3x2 − y2 = 3. Solution. For the cosine of the angle between the asymptotes of the hyperbola in canonical form, cos α = b2 −a2 b2+a2 . In this case the angle is 60 degrees. □ 364 CHAPTER 4. ANALYTIC GEOMETRY 4.D.17. Locate the centers of the conic sections (a) 9x2 + 6xy − 2y − 2 = 0 (b) x2 + 2xy + y2 + 2x + y + 2 = 0 (c) x2 − 4xy + 4y2 + 2x − 4y − 3 = 0 (d) (x−α)2 a2 + (y−β)2 b2 = 1 Solution. (a) The system ¯As + a = 0 for computing centers is 9s1 + 3s2 = 0 3s1 − 2 = 0 . Solve it to obtain the center at [2 3 , −2]. (b) In this case, s1 + s2 + 1 = 0 s1 + s2 + 1 2 = 0. Therefore there is no proper center (the conic section is a parabola). Moving to homogeneous coordinates we can obtain the center at infinity (1 : −1 : 0). (c) The coordinates of the center in this case satisfy s1 − 2s2 + 1 = 0 −2s1 + 4s2 − 2 = 0. The solution is the line of centers. This is so because the conic section is degenerate: it is a pair of parallel lines. (d) The center is at (α, β). The coordinates of the center therefore give the translation of the coordinate system to the frame in which the ellipse has its basic form. □ 4.D.18. Find the equations of the axes of the conic section 6xy + 8y2 + 4y + 2x − 13 = 0. Solution. The major and minor axes of the conic section are in the direction of the eigenvectors of matrix ( 0 3 3 8 ) . The characteristic equation has the form λ2 − 8λ − 9 = 0. The eigenvalues are therefore λ1 = −1, λ2 = 9. The corresponding eigenvectors are then (3, −1) and (1, −3). The axes are the polars of points at infinity defined by those directions. For (3, −1), the axis equation is −3x + y + 1 = 0. For (1, −3) it is −9x − 21y − 5 = 0. □ 4.D.19. Determine the equations of the axes of the conic section 4x2 + 4xy + y2 + 2x + 6y + 5 = 0. Solution. The eigenvalues of the matrix ( 4 2 2 1 ) are λ1 = 0, λ2 = 5 and the corresponding eigenvectors are (−1, 2) and (2, 1). There is one axis 2x + y + 1 = 0, and the conic section is a parabola. □ 4.D.20. The equation x2 + 3xy − y2 + x + y + 1 = 0. defines a conic section. Determine its center, axes, asymptotes and foci. 4.D.21. Find the equation of the tangent at P=[1, 1] to the conic section 4x2 + 5y2 − 8xy + 2y − 3 = 0 Solution. By projecting, this is a conic section defined by the quadratic form (x, y, z)A(x, y, z)T with matrix A =   4 −4 0 −4 5 1 0 1 −3   Using the previous theorem, the tangent is a polar of P, which has homogenenous coordinates (1 : 1 : 1). It is given by equation (1, 1, 1)A(x, y, z)T = 0, which in this case gives 2y − 2z = 0 Moving back to inhomogeneous coordinates, the tangent line equation is y = 1. 365 CHAPTER 4. ANALYTIC GEOMETRY □ 4.D.22. Find the coordinates of the intersection of the y axis and the conic section defined by 5x2 + 2xy + y2 − 8x = 0 Solution. The y axis, is the line x = 0. It is the polar of the point P with homogeneous coordinates ⟨p⟩ = (p1 : p2 : p3). That means that the equation x = 0 is equivalent to the polar equation F(p, v) = pT Av = 0, where v = (x, y, z)T . This is satisfied when Ap = (α, 0, 0)T for some α ∈ R. This condition gives the conic section matrix A =   5 1 −4 1 1 0 −4 0 0   equation system 5p1 + p2 − 4p3 = αj p1 + p2 = 0 −4p1 = 0 We can find the coordinates of P by the inverse matrix, p = A−1 (α, 0, 0)T , or solve the system directly by backward substitution. In this case we can easily obtain solution p = (0, 0, −1 4 α). So the y axis touches the conic section at the origin. □ 4.D.23. Find a touch point of the line x = 2 with the conic section from the previous exercise. Solution. The line has equation x − 2z = 0 in its projective extension and therefore we get the condition Ap = (α, 0, −2α) for the touch point P, which gives 5p1 + p2 − 4p3 = α p1 + p2 = 0 −4p1 = −2α Its solution is p = (1 2 α, −1 2 α, 1 4 α). These homogeneous coordinates are equivalent to (2, −2, 1) and hence the touch point has coordinates [2, −2]. □ 4.D.24. Find equations of the tangents passing through P= [3, 4] to the conic defined by 2x2 − 4xy + y2 − 2x + 6y − 3 = 0. Solution. Suppose that the point of tangency T has homogeneous coordinates given by a multiple of the vector t = (t1, t2, t3). The condition that T lies on the conic section is tT At = 0, which gives 2t2 1 − 4t1t2 + t2 2 − 2t1t3 + 6t2t3 − 3t2 3 = 0 The condition that P lies on the polar of T is pT At = 0, where p = (3, 4, 1) are the homogeneous coordinates of point P. In this case, the equation gives (3, 4, 1)   2 −2 −1 −2 1 3 −1 3 −3     t1 t2 t3   = −3t1 + t2 + 6t3 = 0 Now we can substitute t2 = 3t1 − 6t3 to the previous quadratic equation. Then −t2 1 + 4t1t3 − 3t2 3 = 0 Because the equation is not satisfied for t3 = 0, we move to inhomogeneous coordinates (t1 t3 , t2 t3 , 1), for which we get −(t1 t3 )2 + 4(t1 t3 ) − 3 = 0 a t2 t3 = 3(t1 t3 ) − 6, tj. t1 t3 = 1 a t2 t3 = −3, nebo t1 t3 = 3 a t2 t3 = 3. So the touch points have homogeneous coordinates (1 : −3 : 1) and (3 : 3 : 1). The tangent equations are the polars of those points 7x − 2y − 13 = 0 and x = −3. □ 366 CHAPTER 4. ANALYTIC GEOMETRY 4.D.25. Find an equation of the tangent passing through the origin to the circle x2 + y2 − 10x − 4y + 25 = 0 Solution. The touch point (t1 : t2 : t3) satisfies (0, 0, 1)   1 0 −5 0 1 −2 −5 −2 25     t1 t2 t3   = −5t1 − 2t2 + 25 = 0 From here we eliminate t2 and substitute into circle equation, which (t1 : t2 : t3) has to be satisfied as well. We obtain the quadratic equation 29t2 1 −250t1 +525 = 0, with solutions t1 = 5 and t1 = 105 29 . We compute the coordinate t2 and get touch points [5, 0] and [105 29 , 100 29 ]. The tangents are polars of those points with equations y = 0 and 20x − 21y = 0. □ 4.D.26. Find tangents equations to circle x2 + y2 = 5 which are parallel with 2x + y + 2 = 0. Solution. In the projective extension, these tangents intersect at the point at infinity satisfying 2x + y + z = 0, so in point with homogeneous coordinates (1 : −2 : 0). They are tangents from this point to the circle. We can use the same method as in previous exercise. The conic section matrix is diagonal with the diagonal (1, 1, −5) and therefore the touch point (t1 : t2 : t3) of the tangents satisfies t1 − 2t2 = 0. Substitute into the circle equation to get 5t2 2 = 5. Since t2 = ±1, the touch points are [2, 1] and [−2, −1]. □ Solution. Alternative. The point P = √ 5(cos θ, sin θ) lies on the circle for all θ. The tangent line at P is x cos θ +y sin θ =√ 5. This has slope −(cos θ)/(sin θ) which is −2 provided tan θ = 1/2. It follows that P is at either [2, 1] or [−2, −1]. □ A tangent line touching the conic section at infinity is called an asymptote. The number of asymptotes of a conic section equals the number of intersections between the conic section and the line at infinity. So the ellipse has no real asymptotes, the parabola has one (which is however a line at infinity) and the hyperbola has two. 4.D.27. Find the points at infinity and the asymptotes of the conic section defined by 4x2 − 8xy + 3y2 − 2y − 5 = 0 Solution. First, rewrite the conic section in homogeneous coordinates. 4x2 − 8xy + 3y2 − 2yz − 5z2 = 0 the homogeneous coordinates (x : y : 0) satisfying this equation, which means 4x2 − 8xy + 3y2 = 0. It follows that either: x y = −1 2 or x y = −3 2 . The conic section is therefore a hyperbola with points at infinity P= (−1 : 2 : 0) a Q= (−3 : 2 : 0). (−1, 2, 0)   4 −4 0 −4 3 −1 0 −1 −5     x y 1   = −12x + 10y − 2 = 0 a (−3, 2, 0)   4 −4 0 −4 3 −1 0 −1 −5     x y 1   = −20x + 18y − 2 = 0 □ There are further exercises on conic sections on the page 361. 4.D.28. Harmonic cross-ratio. If the cross-ratio of four points lying on the line equals −1, we talk about a harmonic quadruple. Let ABCD be a quadrilateral. Denote by K the intersection of the lines AB and CD, by M the intersection of the lines AD and BC. Further let L, N be the intersection of KM and AC, BD respectively. Show that the points K, L, M, N are a harmonic quadruple. ⃝ 367 CHAPTER 4. ANALYTIC GEOMETRY Exercise solution In this chapter, we start using tools allowing us to model dependencies which are neither linear, nor discrete. Such models are often needed when dealing with time dependent systems. We try to describe them not only at discrete moments of time, but “continuously”. Sometimes this is advantageous, for instance in physical models of classical mechanics and engineering. It might also be appropriate and computationally effective to employ an approximation of discrete models in economics, chemistry, or biology. In particular such ideas may be appropriate in relation to stochastic models, as we shall see in Chapter 10. The key concept is that of a function, also called a “signal” in practical applications. The larger the class of functions used, the more difficult is the development of effective tools. On the other hand, if there are only a few simple types of functions available, it may be that some real situations cannot be modelled at all. The objective of the following two chapters is thus to introduce explicitly the most elementary functions of real variables. It is also to describe implicitly many more functions, and to build the standard tools to use them. This is the differential and integral calculus of one variable. While the focus has been mainly on the part of mathematics called algebra, the emphasis will now be on mathematical analysis. The link between the two is provided by a “geometric approach”. If possible, this means building concepts and intuition independently of any choice of coordinates. Often this leads to a discrete (finite) description of the objects of interest. This is immediate when working with polynomials now. 1. Polynomial interpolation In the previous chapters, we often worked with sequences of real or complex numbers, i.e. with scalar functions N → K or Z → K, where K is a given set of numbers. We also worked with sequences of vectors over real or complex num- bers. Recall the discussion from paragraph 1.1.6, about dealing with scalar functions. This discussion is adequate to work with functions R → R (real-valued functions of one real variable), or R → C (complex-valued functions of one real variable), or sometimes more generally the vector-valued functions of one real variable R → V . The results can usually be CHAPTER 5 Establishing the ZOO which functions do we need for our models? – a thorough menagerie A. Polynomial interpolation Our first “route” in the zoo of functions is devoted to a beautiful topic, concerning the elementary notion of polynomials. Usually, given a polynomial one wants to find its roots, and in the first chapter we learned to evaluate polynomials via the Horner’s scheme. Of course, evaluation is an essential part of any rootfinding method and we know that the fundamental theorem of algebra classifies all roots of a polynomial over C (see also the paragraph 12.2.8). Below we adopt a slightly different approach, and focus on applications related to the notion of polynomial interpolation of real or complex functions. Roughly speaking this means that we will use polynomials of certain degree to approximate such functions. This includes the so called “Lagrange interpolation”, but we will also explain how the theory of derivatives involves, in terms of “Hermite’s interpolation problem” and “cubic splines”. We mention that in this part the use of elementary functions is stressed, although they are systematically analyzed later in this chapter. 5.A.1. Using Sage for plotting graphs. Computer algebra programs, as Mathematica, Matlab, Maple, Sage, and other, all include simple syntaxes available for drawing the graph Gf = {(x, f(x)) : x ∈ I} of a real polynomial f(x) = anxn + · · · + a1x + a0 of some fixed degree n ∈ N, or CHAPTER 5. ESTABLISHING THE ZOO extended to cases concerning vector values over the considered scalars, rather than just real and complex numbers. We begin with some easily computable functions. 5.1.1. Polynomials. We can add and multiply scalars. These operations satisfy a number of properties which we listed in the paragraphs 1.1.1 and 1.1.5. If we admit any finite number of these operations, leaving one of the variables as an unknown and fixing the other scalars, we obtain the polynomial functions. We consider the scalars K = R, C, or Q. Polynomials A polynomial over a ring of scalars K is a mapping f : K → K given by the expression f(x) = anxn + an−1xn−1 + · · · + a1x + a0, where ai, i = 0, . . . , n, are fixed scalars. Multiplication is indicated by juxtaposition of symbols, and “+” denotes addition. If an ̸= 0, the polynomial f is said to have degree n. The degree of the zero polynomial is undefined. The scalars ai are called the coefficients of the polynomial f. The polynomials of degree zero are exactly the non-zero constant mappings. In algebra, polynomials are more often defined as formal expressions of the aforementioned form of f(x), i. e. a polynomial is defined to be a sequence a0, a1, . . . of coefficients such that only finitely many of them are nonzero. However, we will show shortly that these approaches are equivalent for our choices of scalars. It is easy to verify that the polynomials over a given ring of scalars form a ring. Multiplication and addition of polynomials are given by the operations in the original ring K by the values of the polynomials. Hence, (f · g)(x) = f(x) · g(x), (f + g)(x) = f(x) + g(x), where the operations on the left-hand side are interpreted in the ring of polynomials whereas the operations on the righthand side are of the ring of scalars (see the second part of Chapter 12 for detailed algebraic treatment). 5.1.2. Division of polynomials. As already mentioned, the scalar fields used are Q, R, or C. In all of these fields, the following holds: Euclidean division of polynomials Proposition. For any two polynomials f of degree n and g of degree m, there is exactly one pair of polynomials q, r such that f = q · g + r, where either r = 0, or the degree of r is less than m. Proof. This is a special simple case of the much more general algebraic result in 12.3.6. Write f(x) = anxn + an−1xn−1 + · · · + a1x + a0 for the polynomial of degree n, and g(x) = bmxm + bm−1xm−1 + · · · + b1x + b0, with an ̸= 0 and bm ̸= 0. 369 more general of some real function f : I → R, where I is a suitable interval in the real line. Next we will use Sage, and its command plot(f(x), x, a, b), a situation that we encountered already in Chapter 1. It worths to mention that Sage computes the plot of a given function f by evaluating the function at a large number of random points lying in the chosen interval [a, b]. Since in general Sage tries to plot the whole graph of f, it is often useful to restrict the “y-range” between certain values (“compatible” with the range of y = f(x), say c ≤ y = f(x) ≤ d).1 This can be done by including inside the command plot the options ymin = c and ymax = d, respectively. The restriction of the y-range is also useful for treating further specialities, as for example for the graphs of functions with (vertical) asymptotes and other situations. We will discuss such details later. 5.A.2. Low degree polynomials. Describe geometrically real polynomials of degree zero, one, two, and three, respectively. Use Sage to plot the corresponding graphs for the most “fancy cases” (i.e., the non-linear cases). Solution. Let us discuss the cases of degree n = 2, 3 only (whose graphs are more interesting than straight lines, being the graphs for the cases n = 0, 1). So, recall that a second degree polynomial has the form f(x) = a2x2 + a1x + a0 = ax2 + bx + c, with a2 = a ̸= 0, and x ∈ R. Its graph is a parabola, symmetric with respect to the vertical line x = − b 2a and oriented upwards or downwards, depending on the sign of a. Below we present the graph of the parabola f(x) = 4x2 −2x−3, constructed in Sage. This indicates also its axis of symmetry and its vertex V = [− b 2a , f(− b 2a )] = [1/4, −13/4]. Recall from 1.G.5 that a method to sketch the graph of f(x) = a0 + a1x + a2x2 2 is based on the substitution method and in particular relies on a combination of the commands plot and subs. This allows to substitute certain values of the parameters a0, a1, a2 and quickly sketch the corresponding Cf (this can be applied to plot the graph of any polynomial). a=SR.var("a", 3); P=a[0]+a[1]*x+a[2]*x^2 q=plot(P.subs(a[0] == -3,a[1] == -2, a[2] == 4), x, -6, 6); q.show() As a side remark related to the use of ymin and ymax mentioned in 5.A.1, observe that one may include them also in 1This is extremely useful when for example we need to sketch the graph of a rational function. CHAPTER 5. ESTABLISHING THE ZOO Consider uniqueness. Suppose there are polynomials q, q′ , r, and r′ , such that f = q · g + r = q′ · g + r′ . Then subtraction gives 0 = (q − q′ ) · g + (r − r′ ). If q = q′ , then also r = r′ . If q ̸= q′ , then the term of highest degree in (q − q′ ) · g cannot be replicated in r − r′ . This leads to a contradiction. This proves uniqueness. It remains to prove that f can always be expressed in the desired form. If m > n, then f = 0 · g + f satisfies the requirements. So suppose that n ≥ m. The result is proved by induction on the degree of f. If f is of degree zero, then the statement is trivial. Suppose the statement holds for all polynomials f of degree less than n > 0. Put h(x) = f(x) − an bm xn−m g(x). If h(x) is the zero polynomial, then f is of the desired form. Otherwise h(x) is a polynomial of degree less than that of f and so h be written in the desired form as h(x) = q · g + r. But then f(x) = h(x) + an bm xn−m g(x) = (q + an bm xn−m )g(x) + r and the proof is complete. □ If f(b) equals zero for some element b ∈ K, then the equality f(x) = q(x)(x−b)+r yields 0 = f(b) = q(b)·0+r, and so the constant remainder is r = 0. Consequently f(x) = (x−b)q(x). The value b is called a root of the polynomial f. The degree of q is then n − 1. If q also has a root, we can continue and in no more than n steps we arrive at a constant polynomial. It follows that the number of roots of any non-zero polynomial over the field K is at most the degree of the polynomial. Hence the following observation: Corollary. If the field of scalars K is infinite, then the polynomials f and g are equal as mappings if and only if they are equal as sequences of coefficients. Proof. Suppose that f = g, i.e. f−g = 0, as a mapping. Then the polynomial (f − g)(x) has infinitely many roots, which is possible only if it is the zero polynomial. □ Notice that of course, this statement does not hold for finite fields. A simple counter-example is the polynomial x2 + x over Z2 which represents a constant zero mapping. 5.1.3. Interpolation polynomial. It is often desirable to use an easily computable expression for a function which is given by its values at some given points x0, . . . , xn. Mostly this would be an approximation of an unknown function represented by the finite values only. We look for such polynomials. If the values were all zeros, we can immediately find a polynomial of degree n + 1, namely f(x) = (x − x0)(x − x1) . . . (x − xn). 370 the command show ending the previous block, to restrict the values of f. For n = 3, the graph of a general cubic polynomial f(x) = a3x3 + a2x2 + a1x + a0, with a3 ̸= 0 is a curve running from −∞ to ∞, if a3 > 0, or the opposite way from ∞ to −∞ if a3 < 0. A cubic polynomial can be continuously either increasing or decreasing, or it can have two bumps. Observe also that such a polynomial always meets the x-axis, providing a real root either once or three times (see also 1.A.6 in Chapter 1). Later in Section C we will discuss conditions distinguishing these cases. □ Functions as trigonometric, exponential or logarithmic, have an elementary role in our zoo of functions, hence we will meet them very often. It will be also useful to know how to treat and evaluate them in Sage, a goal that we quickly figure out for the trigonometric one via our next task. 5.A.3. Trigonometric functions via Sage. Use Sage to obtain the exact values of all the elementary trigonometric functions at the points 0, π/6, π/4, π/3, π/2, 2π/3, π, 3π/2, and 2π, in case they are defined there. Solution. All the trigonometric functions are built into Sage. We type sin, cos and tan for sin, cos and tan = sin cos , respectively. Similarly we typ cot, sec and csc for the cotangent, secant, and cosecant function. To save your time, we recall that these are the functions defined by cot(x) = cos(x) sin(x) = 1 tan(x) , sec(x) = 1 cos(x) and csc(x) = 1 sin(x) , respectively. To obtain the required values in an exact way, you may type sin(pi/6), cos(3 ∗ pi/2), and so on.2 When a function is not defined at a certain point, Sage returns Infinity. This is the case for the commands cot(0) and sec(pi/2), for example. In this way one may recover the following table: sin cos tan cot sec csc 0 0 1 0 − 1 − π/6 1/2 √ 3/2 √ 3/3 √ 3 2 √ 3/3 2 π/4 √ 2/2 √ 2/2 1 1 √ 2 √ 2 π/3 √ 3/2 1/2 √ 3 √ 3/3 2 2 √ 3/3 π/2 1 0 − 0 − 1 2π/3 √ 3/2 −1/2 − √ 3 − √ 3/3 −2 2 √ 3/3 π 0 −1 0 − −1 − 3π/2 −1 0 − 0 − −1 2π 0 1 0 − 1 − Notice the inverse trigonometric functions are also built into Sage. Hence we type arcsin, arccos, arctan, arccot, arcsec and arccsc for the inverse of sin, cos, tan, cot, sec and csc, respectively. You may use Sage to evaluate these functions, or to obtain their graph. For some of them, we will do this later in this chapter. □ Let us now focus on polynomial interpolation. Recall by 5.1.3 that given some distinct nodes x0, x1, . . . , xn there is precisely a polynomial P(x) of degree not greater than n, 2In case you need a decimal approximation, type N(sin(pi/6)), etc. An alternative can be the command n(sin(pi/6)), though we should be careful not having introduced in the same cell “n” as a symbolic variable. CHAPTER 5. ESTABLISHING THE ZOO This is zero at these points and only at them. However, there are other polynomials which are zero at the given points. For instance the zero polynomial, which tourns out to be the only such polynomial in the vector space of polynomials of degree at most n. The general situation is analogous: Interpolation polynomials Let K be an infinite field of scalars. An interpolation polynomial f for the set of (pairwise distinct) points x0, . . . , xn ∈ K and given values y0, . . . , yn ∈ K is either the zero polynomial, or a polynomial of degree at most n such that f(xi) = yi for all i = 0, 1, . . . , n. Theorem. For every set of n + 1 (pairwise distinct) points x0, . . . , xn ∈ K and given values y0, . . . , yn ∈ K, there is exactly one interpolation polynomial f. Proof. If f and g are interpolation polynomials with the same defining values, then their difference is a polynomial of degree n which has at least n+1 roots, and thus f −g = 0. This proves uniqueness. It remains to prove the existence. Label the coefficients of the polynomial f of degree n: f = anxn + · · · + a1x + a0. Substituting the desired values leads to a system of n+1 equations for the same number of unknown coefficients ai a0 + x0a1 + · · · + (x0)n an = y0 ... a0 + xna1 + · · · + (xn)n an = yn. The existence of a solution of this system is easily shown by constructing the polynomial by using the Lagrange polynomials for the given points x0, . . . , xn, introduced in the next paragraph. However, the proof can be concluded by using only basic knowledge from linear algebra. This system of linear equations has a unique solution if the determinant of its matrix is a non-zero scalar (see 2.3.5 and 2.2.11). The determinant is the Vandermonde determinant, which was discussed in the exercise 2.E.21 on page 151. Since it is verified that for zero right-hand sides, there is exactly one solution, we know that this determinant must be non-zero. 371 which takes a prescribed value yi at xi, for all i = 0, 1, . . . , n. This is given by P(x) = y0 · ℓ0(x) + · · · + yn · ℓn(x), with ℓi(x) := (x − x0) · · · (x − xi−1)(x − xi+1) · · · (x − xn) (xi − x0) · · · (xi − xi−1)(xi − xi+1) · · · (xi − xn) for any i = 0, 1, . . . , n. Observe that ℓi(xi) = 1 for all i = 0, 1, . . . , n, while ℓi(xj) = 0 for all i ̸= j. The polynomial P(x) is known as the Lagrange interpolation polynomial, while ℓi are referred to as the elementary Lagrange polynomials, see 5.1.4 for details. Notice if we have yi = f(xi) for some function f, then P(x) is referred to as the Lagrange interpolation polynomial for f. Let us now illustrate the situation by examples. 5.A.4. Lagrange interpolation. Consider the nodes x0 = 0, x1 = 1, x2 = 4 and the values y0 = 1, y1 = 2, y2 = 4. Write down the corresponding Lagrange interpolation polynomial P(x), and sketch its graph, together with the graphs of ℓ0, ℓ1, ℓ2. Solution. According to 5.1.4 the Lagrange interpolation polynomial will be at most of degree two. We compute ℓ0(x) = (x−1)(x−4) 4 , ℓ1(x) = −x(x−4) 3 and ℓ2(x) = x(x−1) 12 , and hence P(x) = − 1 12 x2 + 13 12 x + 1. For an implementation of the Lagrange interpolation, Sage provides an inbuilt method based on the command langrange.polynomial. For example, in our case we can give the block nodes=[(0, 1), (1, 2), (4, 4)] R=PolynomialRing(QQ, "x") P=R.lagrange_polynomial(nodes); show(P) Check yourself that this cell prints out the explicit expression of the polynomial P(x). Notice also that you can use the bool command to confirm Sage’s output. This can be done by adding in the cell posed above the following syntax: l0(x)=(x-1)*(x-4)/4; l1(x)=-x*(x-4)/3 l2(x)=x*(x-1)/12 eq=P(x)-l0(x)-2*l1(x)-4*l2(x)==0; eq bool(eq) As for the required graphs, we present them below (check yourself that the reproduction of this figure in Sage is really enjoyable). □ CHAPTER 5. ESTABLISHING THE ZOO Since polynomials are equal as mappings if and only if they are equal as sequences of coefficients, the theorem is proved. □ 5.1.4. Applications of interpolation. At first sight, it may seem that real or rational polynomials, that is, polynomial functions R → R or Q → Q, form a very useful class of functions of one variable. We can arrange for them to attain any set of given values. Moreover, they are easily expressible, so their value at any point can be calculated without difficulties. However, there are a number of problems when trying to use them in practice. The first of the problems is to find quickly the polynomial which will interpolate the given data. Solving the aforementioned system of linear equations generally requires time proportional to the cube of the number of given points xi. This is unacceptable for large data. We will demonstrate how to overcome this on one popular type of polynomials related to fixed points x0, . . . , xn: Lagrange1 interpolation polynomials The Lagrange interpolation polynomial is expressed in terms of the elementary Lagrange polynomials ℓi of degree n with the properties ℓi(xj) = { 1 i = j 0 i ̸= j . These polynomials must (up to a constant factor) equal the expressions (x−x0) . . . (x−xi−1)(x−xi+1) . . . (x−xn). So ℓi(x) = ∏ j̸=i(x − xj) ∏ j̸=i(xi − xj) . The desired Lagrange interpolation polynomial is then given by f(x) = y0ℓ0(x) + y1ℓ1(x) + · · · + ynℓn(x). Notice that the elementary Langrange polynomials can be quite easily expressed by means of the derivatives, see the exercise ??. The usage of Lagrange polynomials is especially efficient when working with different values yi for the same set of values xi. For in this case, the elementary polynomials ℓi are already prepared. One of the disadvantages of this expression is a large sensitivity to inaccuracies in a computation when the differences of the given values xi are small. This is because division by these differences is required. Another disadvantage (common to all ways of expressing the unique interpolation polynomial) is poor stability of the values of real or rational polynomials outside of the interval containing all its roots. 1Joseph-Louis Lagrange (1736-1813) was a famous Italian mathematician and astronomer, who contributed in particular to celestial mechanics. His famous Mécanique analytique appeared in 1788. His name appears often even in this elementary textbook. 372 5.A.5. Find a polynomial P satisfying the following conditions: P(2) = 1, P(3) = 0, P(4) = −1, P(5) = 6. Solution. Initially one may want to use Sage to plot the given points in R2 . We can do this by the cell list_plot({2: 1, 3: 0, 4: -1, 5: 6}, size=30, figsize=4, color="black") Now, we can solve the task by two different ways. Four points are given, so we know that there is exactly one polynomial of degree at most three, satisfying the given conditions. Hence we can assume that P(x) = a3x3 + a2x2 + a1x1 + a0, for some a0, . . . , a3 ∈ R, and in such terms it is easy to see that the system {P(2) = 1, P(3) = 0, P(4) = −1, P(5) = 6} consists of the equations a0 + 2a1 + 4a2 + 8a3 = 1, a0 + 3a1 + 9a2 + 27a3 = 0, a0 + 4a1 + 16a2 + 64a3 = −1, and a0 +5a1 +25a2 +125a3 = 6. Applying the Sage cell var("a0, a1, a2, a3") eq1=a0+2*a1+4*a2+8*a3-1; eq2=a0+3*a1+9*a2+27*a3 eq3=a0+4*a1+16*a2+64*a3+1 eq4=a0+5*a1+25*a2+125*a3-6 solve([eq1==0,eq2==0,eq3==0,eq4==0],a0,a1,a2,a3) we obtain a unique solution given by [[a0 == -29,a1 == (101/3),a2 == -12,a3 == (4/3)]] An alternative, more “solid” way to treat this system via Sage is here: a=SR.var("a", 4) P(x)=a[0]+a[1]*x+a[2]*x^2+a[3]*x^3 solve([P(2)-1==0, P(3)==0,P(4)+1==0, P(5)-6==0], a[0], a[1], a[2], a[3]) To summarize, the polynomial P has the form 4 3 x3 − 12x2 + 101 3 x − 29. Let us now verify this answer by the method described before, based on the Lagrange polynomial: P(x) = 1 · (x − 3)(x − 4)(x − 5) (2 − 3)(2 − 4)(2 − 5) + 0 · (. . . ) + (−1) · (x − 2)(x − 3)(x − 5) (4 − 2)(4 − 3)(4 − 5) + 6 · (x − 2)(x − 3)(x − 4) (5 − 2)(5 − 3)(5 − 4) = 4 3 x3 − 12x2 + 101 3 x − 29. Of course, such a computation can be done faster in Sage via the langrange.polynomial command, as in 5.A.4, and we leave to the reader this verification. Below we present the graph of P(x) together with the given nodes. □ CHAPTER 5. ESTABLISHING THE ZOO Soon we will develop tools for an exact description of the functions’ behaviour. But even without such tools, it is clear that, according to the sign of the coefficient of the term with highest degree, the value of the polynomial will rapidly approach plus or minus infinity as x increases (or decreases). However, the above mentioned sign is even not stable under small changes of the defining values yi. This is illustrated by the following two diagrams, displaying eleven values of the function sin(x) with two different small changes of values. The interpolated function sin(x) is the dotted line, the circles are the gently moved values yi and the uniquely determined interpolation polynomial is the solid line. While the approximation is quite good inside the interval covering the eleven points, it is very poor at the margins. There is a rich theory about the interpolation polynomials. If interested, consult the special literature. 5.1.5. Remark. Numerical instability caused by the closeness of (some of) the points xi is clearly seen in the system of equations from the proof of the Theorem 5.1.3. When solving a system of linear equations, instability is closely related to the size of the determinant of the corresponding matrix. This is the Vandermonde determinant V in our case. Lemma. For any sequence of pairwise distinct scalars x0, . . . , xn ∈ K, V (x0, . . . , xn) = n∏ i>k=0 (xi − xk). Proof. The proof is by induction on the number of the points xi. The result is true for n = 1. (The problem is completely uninteresting for n = 0). Suppose that the result is true for n − 1, i.e. V (x0, . . . , xn−1) = n−1∏ i>k=0 (xi − xk). Consider the values x0, . . . , xn−1 to be fixed, and vary the value of xn. Expand the determinant by the last row (see 373 5.A.6. Find the Lagrange interpolation polynomial for the function f(x) = 1 4+x2 by dividing the closed interval [−1, 1] ⊂ R into n equal parts, for n = 20. Use Sage to present a figure including the graph of f and that of the interpolation polynomial, along with the interpolated points. Solution. The procedure of dividing [−1, 1] to n equal parts gives rise to n + 1 points of the form −1 + 2k n , k = 0, 1, . . . , n . For instance for n = 1 we get the points −1, 1, for n = 2 we get the points −1, 0, 1, for n = 3 we get the points −1, −1/3, 1/3, 1, etc. We can now use Sage and the command .lagrange_polynomial(), as follows: f(x)=1/(4+x**2) fig=plot(f(x), (x, -1, 1),figsize=4,color="black") show(fig) R=PolynomialRing(QQ, "x"); n=20 points=[(-1+k*(2/n), f(-1+k*2/n)) for k in [0, 1,.., n]] P(x)=R.lagrange_polynomial(points) fig+=plot(P(x),(x, -1, 1),color="purple") fig+=list_plot(points, size=20, figsize=4, color="blue"); fig This block includes commands appropriate to produce the plots of f and of P, along with the interpolated points. This allows us to illustrate the interpolation, as in the figure below. Notice to obtain the explicit form of P(x) one should type show(P(x)), which we do not present to save some space (type alone in your editor the given code to read P(x)). □ 5.A.7. Repeat the task in 5.A.6 by replacing f with the function g(x) = (x/(1 + 8x2 )). Next present the graphs of g(x) and of the Lagrange interpolation polynomial along with the given points, via a programming package of your desire. ⃝ 5.A.8. Find a polynomial P of degree two or less, taking the values y0 = 1, y1 = −3, y2 = 4 at the points x0 = −1, x1 = 1, x2 = 2, respectively. ⃝ 5.A.9. Find a polynomial P of third degree satisfying P(0) = 1, P(1) = 0, P(2) = 1, and P(3) = 10. ⃝ 5.A.10. Find a polynomial P satisfying: (i) P(1 + i) = i, P(2) = 1, P(3) = −i, (ii) P(1) = i, P(−1) = −i, P(i) = −1. CHAPTER 5. ESTABLISHING THE ZOO 2.2.9). This exhibits the desired determinant as the polyno- mial (1) V (x0, . . . , xn) = (xn)n V (x0, . . . , xn−1) +lower degree terms in x. This is a polynomial of degree n since its coefficient at (xn)n is non-zero, by the induction hypothesis. Evidently, it vanishes at any point xn = xi for i < n because in that case, the original determinant contains two identical rows. The polynomial is thus divisible by the expression (xn − x0)(xn − x1) · · · (xn − xn−1), which itself is of degree n. It follows that the Vandermonde determinant (as a polynomial in the variable xn) must, up to a multiplicative constant, be given by V (x0, . . . , xn) = c · (xn − x0)(xn − x1) · · · (xn − xn−1). Comparing the coefficients at the highest power in (1) with this expression yields c = 1, which completes the proof . □ Notice that the value of the determinant is small if the points xi are close together. 5.1.6. Derivatives of polynomials. The values of polynomials rapidly tend to infinite values as the input variable grows. Hence polynomials are unable to describe periodic events, such as the values of the trigonometric functions. One could say that we will achieve much better results, at least between the points xi, if we look not only at the function values, but also at the rate of increase of the function at those points. For this purpose, we introduce (only intuitively, for the time being) the concept of a derivative for polynomials. Again, we can work with real, complex or rational polynomials. The rate of increase of a real-valued polynomial f(x) at a point x ∈ R should be related to the values (1) f(x + δx) − f(x) δx , where δx is a small value in K expressing the increment of the argument x. Since we can calculate (over an arbitrary ring) ./img/0085interpolacnipolynomy.jpg (x + δx)k = xk + · · · + (k l ) xl (δx)k−l + · · · + (δx)k , 374 Solution. A straightforward computation reveals the expression P(z) = (−3 5 − 4 5 i)z2 + (2 + 3i)z − 3 5 − 14 5 i. We can compute this directly via Sage, as in 5.A.5, although we have to change the field into C: R = PolynomialRing(CC, "z") R. lagrange_polynomial ([(1+I,I),(2,1),(3,-I)]) Sage returns the answer as follows: (-0.600000000000000 - 0.800000000000000*I)*z^2 + (2.00000000000000 + 3.00000000000000*I)*z - 0.600000000000000 - 2.80000000000000*I In the second case, the solution is easier. This is because the conditions are satisfied of the rotation by angle π/2 in the complex plane. This means that the polynomial must be of the form f(z) = iz. □ Assuming a bit of experience on elementary functions, we now proceed with tasks highlighting the interplay between elementary functions and polynomial interpolation. Later, in Chapter 7, we will study the so called Chebyshev polynomials, which may enrich this relationship. 5.A.11. Based on Lagrange interpolation, present an approximate polynomial formula of the sine function, using the known values of sin(x) at the points 0, π 6 , π 4 , π 3 , π 2 . Next present the graphs of the interpolation polynomial and of sin(x) for x ∈ [0, π], including in your figure the given nodes. Solution. According to the statement we have the table x 0 π/6 π/4 π/3 π/2 sin(x) 0 1/2 √ 2/2 √ 3/2 1 To solve our task we can use this table and apply the same technique in Sage, as above. This can be done by the block nodes=[(0, 0), (pi/6, 1/2), (pi/4, sqrt(2)/2), (pi/3, sqrt(3)/2), (pi/2, 1)] R=PolynomialRing(RR, "x") P=R.lagrange_polynomial(nodes); show(P) Sage’s answer says that the interpolation polynomial P(x) is approximately given by 0.0288x4 − 0.2043x3 + 0.0214x2 + 0.9956x. Now, in order to produce the required graphs together with the given points, one may proceed with the code nodes=[(0, 0), (pi/6, 1/2), (pi/4, sqrt(2)/2), (pi/3, sqrt(3)/2), (pi/2, 1)] R=PolynomialRing(RR, "x") P=R.lagrange_polynomial(nodes) A=plot(P, 0, pi, color="grey") B=list_plot({0: 0, pi/6: 1/2, pi/4: sqrt(2)/2, pi/3: sqrt(3)/2, pi/2 : 1}, size=30, figsize=4, color="black") C=plot(sin(x), 0, pi, color="black") show(A+B+C, figsize=4) We leave to the reader the implementation of this block. □ CHAPTER 5. ESTABLISHING THE ZOO we get for the polynomial f(x) = anxn +· · ·+a0, the above quotient (1) in the form f(x+δx)−f(x) δx =an nxn−1 δx+· · · +(δx)k δx +· · · +a1 δx δx = nanxn−1 + (n − 1)an−1xn−2 + · · · + a1 + δx(. . . ) where the expression in parentheses in the end of the expression is polynomially dependent on δx. Clearly, for values δx very close to zero, there is a value arbitrarily close to the value in the following definition: Derivatives of polynomials The derivative of the polynomial f(x) = anxn + · · · + a0 with respect to the variable x is the polynomial f′ (x) = nanxn−1 + (n − 1)an−1xn−2 + · · · + a1. From the definition, it is clear that it is just the value f′ (x0) of the derivative which gives a good approximation of the polynomial’s behaviour near the point x0. More precisely, the lines y = f(x0 + δx) − f(x0) δx (x − x0) + f(x0), that is, the secant lines of the graph of the polynomial going through the points [x0, f(x0)] and [x0 + δx, f(x0 + δx)] approach, as δx decreases, to the line y = f′ (x0)(x − x0) + f(x0), which is the “tangent” to the graph of the polynomial f. This is linear approximation to the polynomial f by its tangent line. Exact meaning to all these concepts is given later. The derivative of polynomials is a linear mapping which, to polynomials of degree at most n, assigns polynomials of degree at most n − 1. Iterating this procedure, there are the second derivative f′′ , the third derivative f(3) , and generally after k-tuple iteration, the polynomial f(k) of degree n−k. Thus the (n+1)-st derivative is the zero polynomial. This linear mapping is an example of cyclic nilpotent mappings, which are more thoroughly examined in paragraph 3.4.10. The derivative behaves well also with respect to the multiplication of polynomials. A straightforward combinatorial check reveals the derivation property or Leibniz rule for this linear operator : ( f(x) · g(x) )′ = f′ (x) · g(x) + f(x) · g′ (x). Actually this is a purely algebraic result (which holds over any ring of scalars!) and you may either check it yourself or consult the formal proof in 12.3.7. 5.1.7. Hermite’s interpolation problem. Consider m + 1 pairwise distinct real numbers x0, . . . , xm, i.e. xi ̸= xj for all i ̸= j. It is desired to place polynomials through given values at these points, but to determine also the first derivatives of the interpolating polynomial in these points. Set yi and y′ i for all i. 375 The Lagrange interpolation method comes with an interpolation error (and an error bound), which we will briefly analyze in the final section E of this chapter (see 5.E.9, 5.E.10 and 5.E.11). It is now reasonable to discuss problems where we should also regard the slope of the tangents to our polynomial at the given points. This amounts at employing the derivatives of the polynomials (see 5.1.6 if necessary), and it can be handled by the very same methods as before. Thus next we will meet the so called Hermite interpolation, a method introduced in 5.1.7 (see also 5.1.8). 5.A.12. Hermite interpolation. Find a polynomial P satisfying the following conditions: P(1) = 0, P′ (1) = 1, P(2) = 3, P′ (2) = 3. Solution. We provide two methods of finding the solution. 1st approach: The given conditions give rise to four linear equations in the coefficients of P. If we seek for a polynomial of degree less than four, we get the same number of equations and unknown coefficients. Hence assuming that P(x) = a3x3 + a2x2 + a1x + a0 for some reals a0, . . . , a3, we get the equations P(1) = a3 + a2 + a1 + a0 = 0 , P′ (1) = 3a3 + 2a2 + a1 = 1 , P(2) = 8a3 + 4a2 + 2a1 + a0 = 3 , P′ (2) = 12a3 + 4a2 + a1 = 3 . Solving quickly this linear system by Sage, we get P(x) = −2x3 + 10x2 − 13x + 5. 2nd approach: This is based on Hermite’s interpolation, which requires the description of the so-called fundamental Hermite interpolation polynomials h (1) i and h (2) i (i = 0, 1), see 5.1.7. By assumption, x0 = 1, x1 = 2, and we may set y0 = P(x0) = 0, y1 = P(x1) = 3, y′ 0 = P′ (x0) = 1 and y′ 1 = P′ (x1) = 3. For an application of the Hermite’s interpolation method one needs the function ℓ(x) = (x − x0)(x − x1) = (x − 1)(x − 2) = x2 − 3x + 2 . Obviously ℓ′ (x) = 2x − 3, with ℓ′ (x0) = −1 and ℓ′ (x1) = 1. Moreover, the second derivative is a constant function, ℓ′′ (x) = 2. We also need the elementary Lagrange polynomials induced by the nodes x0, x1: ℓ0(x) = x − x1 x0 − x1 = (2 − x), ℓ1(x) = x − x0 x1 − x0 = (x − 1) . CHAPTER 5. ESTABLISHING THE ZOO A polynomial f is wanted which will satisfy these conditions on the values and derivatives. As in the case of interpolating the values only, we obtain the following system of 2(m+1) equations for the coefficients of the polynomial f(x) = anxn + · · · + a0: ./img/0085interpolacnipolynomy.jpg a0 + x0a1 + · · · + (x0)n an = y0 ... a0 + xma1 + · · · + (xm)n an = ym a1 + 2x0a2 + · · · + n(x0)n−1 an = y′ 0 ... a1 + 2xma2 + · · · + n(xm)n−1 an = y′ m. We could verify that with the choice n = 2m+1, the determinant of this system is non-zero, and thus there is exactly one solution. The polynomial f can be constructed immediately. Simply create a set of polynomials with values 0 or 1 respectively for the derivatives and the values, in order to express the desired values as the linear combination. We sketch briefly, how to construct them now, leaving the details to the reader. The elementary Lagrange polynomials serve well for this purpose. The derivative of f(x) = (ℓi(x))2 is 2ℓ′ i(x)ℓi(x) and thus all xj are roots of this polynomial, except for j = i. Similarly for the derivative f′ (x). But a polynomial of degree 2m+1 is wanted. So we consider rather g(x) = (x− xi)f(x). Now the values will be all zero, while the derivative g′ (x) = f(x)+(x−xi)f′ (x) has the required properties too. Thus we take h (2) i (x) = (x − xi)(ℓi(x))2 . This is called the fundamental Hermitian polynomial2 of the second type. Finally we look for a polynomial which has zero derivatives at all points xi with the same values as ℓi at the given points xi. We can apply a very similar trick. Look for polynomials of the form h (1) i (x) = (1 − a(x − xi))(ℓi(x))2 . All xj will be roots of this polynomial, except for j = i where 2Charles Hermite (1822-1901) was a Frenchman active in many areas of Mathematics. His name is mostly linked to the Hermitian operators and matrices, cf. 3.4.6. 376 Now, according to the formulas given in 5.1.7 we have h (1) 0 (x) := ( 1 − ℓ′′ (x0) ℓ′(x0) (x − x0) ) (ℓ0(x))2 = (2x − 1)(x − 2)2 , h (1) 1 (x) := ( 1 − ℓ′′ (x1) ℓ′(x1) (x − x1) ) (ℓ1(x))2 = (5 − 2x)(x − 1)2 , h (2) 0 (x) := (x − x0)(ℓ0(x))2 = (x − 1)(x − 2)2 , h (2) 1 (x) := (x − x1)(ℓ1(x))2 = (x − 2)(x − 1)2 . The polynomial P(x) is then given by the epxression y0 · h (1) 0 (x) + y1 · h (1) 1 (x) + y′ 0 · h (2) 0 (x) + y′ 1 · h (2) 1 (x), and we arrive to the same expression as above: P(x) = 0 · h (1) 0 (x) + 3 · h (1) 1 (x) + 1 · h (2) 0 (x) +3 · h (2) 1 (x) = −2x3 + 10x2 − 13x + 5 . In fact, one could avoid to compute explicitly h (1) 0 , since y0 = 0 by assumption. Notice also that our final computation can be verified quickly in Sage, via the block h01(x)=(2*x-1)*(x-2)^2; h11(x)=(5-2*x)*(x-1)^2 h02(x)=(x-1)*(x-2)^2; h12(x)=(x-2)*(x-1)^2 y0=0; y1=3; dy0=1; dy1=3 P(x)=y0*h01(x)+y1*h11(x)+dy0*h02(x)+dy1*h12(x) show(P(x).expand()) □ The next exercises on Hermite interpolation are helpful for a better understanding of the method but also for improving your computational skills (we suggest to treat these tasks with the help of Sage, as well). One can find further problems related to Hermite interpolation also in Section E, see 5.E.1. 5.A.13. Determine the Hermite interpolation polynomial Q satisfying Q (−1) = −9, Q (1) = −1, Q′ (−1) = 10, Q′ (1) = 2. ⃝ 5.A.14. Replace the function f with a Hermite polynomial, having as initial data the following values of f: xi −1 1 2 f(xi) 4 −4 −8 f′ (xi) 8 −8 11 ⃝ Let us now focus on splines, a notion introduced in 5.1.9. Shortly, we look for (cubic) polynomials on intervals, prescribing the values and requesting the first and second derivatives to be shared in the boundary points. Splines are very popular in environmental research for image processing, where smooth interpolation becomes essential, e.g., for computational animation and image scal- ing. However, a formal description of the spline through a given data can be often a tedious task. For instance, if we are CHAPTER 5. ESTABLISHING THE ZOO the value is 1. The derivative is ( h (1) i (x) )′ = −a(ℓi(x))2 + (1 − a(x − xi))2ℓi(x)ℓ′ i(x). All xj, j ̸= i, are roots of ℓi(x), thus they are also roots of this polynomial. Finally, at the point xi we want 0 = −a+2ℓ′ i(xi). Thus we choose a = 2ℓ′ i(xi). The combinatorial check that 2ℓ′ i(xi) = ℓ′′ (xi) ℓ′(xi) is left to the reader. We summarize: Hermite’s 1st order interpolation polynomial The fundamental Hermite polynomials are defined as fol- lows: h (1) i (x) = ( 1 − ℓ′′ (xi) ℓ′(xi) (x − xi) ) (ℓi(x)) 2 h (2) i (x) = (x − xi) (ℓi(x)) 2 , where ℓ(x) = ∏m i=0(x − xi), ℓi(x) is the elementary Lagrange polynomial. These polynomials satisfy: h (1) i (xj) = δj i = { 1 for i = j 0 for i ̸= j (h (1) i )′ (xj) = 0 h (2) i (xj) = 0 (h (2) i )′ (xj) = δj i . The Hermite’s interpolation polynomial is given by the ex- pression f(x) = m∑ i=0 ( yih1 i (x) + y′ ih2 i (x) ) . 5.1.8. Examples of Hermite’s polynomials. The simplest example is the one of prescribing the value and the derivative at one point. This determines a polynomial of degree one f(x) = f(x0) + f′ (x0)(x − x0). This is exactly the equation of the straight line given by the value and slope at the point x0. When we set the values and the derivatives at two points, i.e. y0 = f(x0), y′ 0 = f′ (x0), y1 = f(x1), y′ 1 = f′ (x1) for two distinct points xi, we still obtain an easily computable problem. Consider the simple case when x0 = 0, x1 = 1. Then the matrix of the system and its inverse is A =     0 0 0 1 1 1 1 1 0 0 1 0 3 2 1 0     , A−1 =     2 −2 1 1 −3 3 −2 −1 0 0 1 0 1 0 0 0     . The multiplication A−1 · (y0, y1, y′ 0, y′ 1)T gives the vector (a3, a2, a1, a0)T of coefficients of the polynomial f, i.e. f(x) = (2y0 − 2y1 + y′ 0 + y′ 1)x3 + (−3y0 + 3y1 − 2y′ 0 − y′ 1)x2 + y′ 0x + y0. 377 given n ≥ 2 points and values in them, then we need to solve 4n − 4 linear equations. Fortunately, using computers we may establish algorithms which lead to much faster computer implementations. In order to illustrate this situation, below we proceed by combining the appropriate syntax of Sage. But let us first present a more basic example. 5.A.15. Splines. Find a natural cubic spline S satisfying S(−1) = 0, S(0) = 1, S(1) = 0, and present its graph for x in the closed interval [−1, 1]. Solution. We have three nodes and the intervals [x0, x1] = [−1, 0] and [x1, x2] = [0, 1], that is, n = 2 in terms of the paragraph 5.1.9. Thus the spline that we are looking for consists of two cubic polynomials, say S1(x) = ax3 +bx2 +cx+ d, and S2(x) = ex3 +fx2 +gx+h, with domain the intervals [−1, 0] and [0, 1], respectively, with a, b, · · · , h ∈ R. Applying the definition of cubic splines for n = 2, one obtains six linear equations in terms of a, b, · · · , h: S1(−1) = 0, S2(0) = 1, S1(0) = 1, S2(1) = 0, S′ 1(0) = S′ 2(0), S′′ 1 (0) = S′′ 2 (0). In addition, S must be a natural spline, and this corresponds to the vanishing of the second derivatives of S1 = ax3 + bx2 + cx + d and S2 = ex3 + fx2 + gx + h at the points −1 and 1, respectively: S′′ 1 (−1) = 0 and S′′ 2 (1) = 0. Hence we arrive to a system of eight equations, which in fact we can reduce in the following way. Thanks to the given value at 0, we know that the absolute coefficients of both the polynomials are 1. The resulting spline has to be symmetric along the y axis, otherwise we would get two splines satisfying the condition by the reflection along the axis. Here we are based on the fact that a natural cubic spline is unique. Thus, the only possibility for the common values for the first derivatives of S1 and S2 at zero is zero, further the second derivatives in zero have to agree, that is b = d, and the symmetry gives also a = −c. So far we have S1(x) = ax3 + bx2 + 1 and S2(x) = −ax3 + bx2 + 1. Now the conditions S1(−1) = 0 and S′′ 1 (−1) = 0 correspond to the equations −a+b+1 = 0 and −6a+2b = 0, respectively. Solving this system we obtain S1(x) = −1 2 x3 − 3 2 x2 + 1, S2(x) = 1 2 x3 − 3 2 x2 + 1. Altogether, S(x) = { −1 2 x3 − 3 2 x2 + 1, if x ∈ [−1, 0] , 1 2 x3 − 3 2 x2 + 1, if x ∈ [0, 1] . For the graph of the spline via Sage, we obtain the figure presented below (we used blue colour for S1 and green for S2) CHAPTER 5. ESTABLISHING THE ZOO 5.1.9. Spline interpolation. We can prescribe any finite number of derivatives at the particular points and a convenient choice for the upper bound on the degree of the desired polynomial, leading to a unique interpolation. Unfortunately, these interpolations do not solve the problems mentioned already – complexity of the computations and instability. However, a smarter usage of derivatives allows an improvement. As we have seen, small local changes of the values may dramatically affect the overall changes of the behaviour of the resulting polynomial. In particular, this may happen outside of the interval covered by the points xi. We try gluing together polynomial pieces of low degree. The simplest possibility is to interpolate each pair o f adjacent points by a polynomial of degree at most one. This corresponds either to the interpolation by values with two points, or by guessing the slope and employing Hermite’s first order interpolation at a single point. This is a common way of displaying data. This means that derivatives will be constant on each of the segments, with a discontinuous ’jump’ at the given points. There is no freedom for improvements. A more sophisticated method is to prescribe the value and the derivative at each point. We have then four values for each pair of neighbouring points. As seen earlier, this uniquely determines Hermite’s polynomial of degree three. This polynomial can then be used for all the values of the input variable between the distinguished points x0 < x1. Such a piece-wise polynomial approximation has the property that the first derivatives will be compatible (equal) at the meeting points xi in the interval [x0, x1]. In practice, mere compatibility of the first derivatives is often insufficient. Consider for instance railway tracks, where the second derivative corresponds to acceleration. Discontinuous jumps would be very undesirable for the second derivative. So instead of requiring fixed values of the first derivatives, we request equality of both first and second derivatives at adjacent pieces of the cubic polynomials, as well as fixing the values at the given points. This requirement yields the same number of equations and unknowns, and so the problem is solvable similarly to the 1st order Hermite interpolation problem: Cubic splines Let x0 < x1 < · · · < xn be real values at which the required values y0, . . . , yn are given. A cubic interpolation spline for this assignment is a function S : R → R which satisfies the following conditions: • the restrictions of S on the intervals [xi−1, xi] are polynomials Si of degree at most three, for all i = 1, . . . , n • Si(xi−1) = yi−1 and Si(xi) = yi for all i = 1, . . . n, • S′ i(xi) = S′ i+1(xi) for all i = 1, . . . , n − 1, • S′′ i (xi) = S′′ i+1(xi) for all i = 1, . . . , n − 1. 378 □ 5.A.16. Splines via Sage. Splines can be treated in Sage via the command spline. For instance, a graphical implementation of the previous example in Sage relies on the given interpolated points and goes as follows: pts=[(-1, 0), (0, 1), (1, 0)]; S=spline(pts) a=plot(S, -1, 1, color="darkgray", figsize=4) show(points(pts)+a) Verify yourself that for the spline in question this syntax gives us the same graph (so this provides a graphical verification of our previous computation). Changing the points of the spline causes of course the spline to be recomputed. Notice also that this syntax does not return the explicit form of S, as a cubic polynomial. However, it allows us to compute the values of the spline, e.g., by adding the syntax show(S(0.4)); show(S(0.5)); show(S(0.6)) This returns the values of S at the chosen points, and one can easily verify that these values coincide with the values that we obtain by applying the given solution in 5.A.15 at the same points. Notice however that we cannot compute the value of the spline at points that are not included in between the given interpolated points. 5.A.17. Find a (cubic) spline S which satisfies S(−1) = 0, S(0) = 1, S(1) = 0, S′ (−1) = −1, S′ (1) = 1. Hint: Apply the same trick as in 5.A.15. ⃝ 5.A.18. Interpolate the function f(x) = ex2 − e on the interval [−6/5, 6/5] by the (unique) natural cubic spline S corresponding to the partition x0 = −1.2, x1 = −1, x2 = 0, x3 = 1, x4 = 1.2. Next, using Sage plot the functions f and S (together with the interpolated nodes), and based on the “spline” method described in 5.A.16, verify your result. Solution. By assumption we have five nodes, x0 = −1.2, x1 = −1, x2 = 0, x3 = 1, x4 = 1.2, and so four intervals [x0, x1], [x1, x2], [x2, x3], [x3, x4]. Thus, in terms of the definition of cubic splines given in paragraph 5.1.9 we have CHAPTER 5. ESTABLISHING THE ZOO The cubic spline3 for n + 1 points consists of n cubic polynomials. There are 4n free parameters (the first condition from the definition). The other conditions then yield 2n + (n − 1) + (n − 1) more equalities. Two parameters remain free. The values of the derivatives at the marginal points may be prescribed explicitly (the complete spline), or the second derivatives can be set to zero (the natural spline). Unfortunately, the computation of the whole spline is not as easy as with the independent computations of Hermite’s cubic polynomials because the data mingles between adjacent intervals. However, ordering the variables and equations properly, gives a matrix of the system such that all of its nonzero elements appear only on three diagonals. These matrices are nice enough to be solved in a time proportional to the number of points, using a suitable numerical method. The results are stunning. For comparison, look at the interpolation of the same data as in the case of the Lagrange polynomial, now using splines. The spline is the solid line, the interpolated function is again the dotted line. Although the diagrams look nearly identical, the data is different. 2. Real numbers and limit processes Polynomials and splines do not supply a sufficiently large stock of functions to express many dependencies. Actually, the first problem to solve is how to define the values of more general functions at all. In principle, all we can get with a finite number of multiplications and additions is polynomial functions. Perhaps division by polynomial quantities, and some efficient manipulation with rational numbers can be done. However, we cannot restrict ourselves to rational numbers. For instance, √ 2 is not a rational number. Thus the first step is a thorough introduction to limit processes. We define precisely what it means for a sequence of numbers to approach another number. 3The name comes from the name of an elastic ruler used by engineers to draw smooth curve interpolation points. In fact, the requirement on the equality of the first and second derivatives is a good model for natural elasticity behaviour. 379 n = 4. This means that the spline S(x) has the form    S1(x) = a3x3 + a2x2 + a1x + a0, x ∈ [− 6 5 , −1], S2(x) = b3x3 + b2x2 + b1x + b0, x ∈ [−1, 0], S3(x) = c3x3 + c2x2 + c1x + c0, x ∈ [0, 1], S4(x) = d3x3 + d2x2 + d1x + d0, x ∈ [1, 6 5 ], with ai, bi, ci, di ∈ R for all i = 0, 1, 2, 3. In our terms, we also have y0 = f(x0) ≈ 1.502, y1 = f(x1) = 0, y2 = f(x2) ≈ −1.718, y3 = f(x3) = 0, y4 = f(x4) ≈ 1.502. Let us summarize the sixteen conditions that determine S: S1(x0) = y0 ⇔ −( 6 5 )3 a3 + ( 6 5 )2 a2 − 6 5 a1 + a0 = 1.502 , S2(x1) = y1 ⇔ −b3 + b2 − b1 + b0 = 0 , S3(x2) = y2 ⇔ c0 = −1.718 , S4(x3) = y3 ⇔ d3 + d2 + d1 + d0 = 0 , S1(x1) = y1 ⇔ −a3 + a2 − a1 + a0 = 0 , S2(x2) = y2 ⇔ b0 = −1.718 , S3(x3) = y3 ⇔ c3 + c2 + c1 + c0 = 0 , S4(x4) = y4 ⇔ ( 6 5 )3 d3 + ( 6 5 )2 d2 + 6 5 d1 + d0 = 1.502 , S′ 1(x1) = S′ 2(x1) ⇔ 3a3 − 2a2 + a1 = 3b3 − 2b2 + b1 , S′ 2(x2) = S′ 3(x2) ⇔ b1 = c1 , S′ 3(x3) = S′ 4(x3) ⇔ 3c3 + 2c2 + c1 = 3d3 + 2d2 + d1 , S′′ 1 (x1) = S′′ 2 (x1) ⇔ −6a3 + 2a2 = −6b3 + 2b2 , S′′ 2 (x2) = S′′ 3 (x2) ⇔ b2 = c2 , S′′ 3 (x3) = S′′ 4 (x3) ⇔ 6c3 + 2c2 = 6d3 + 2d2 , S′′ 1 (x0) = 0 ⇔ −( 36 5 )a3 + 2a2 = 0 , S′′ 4 (x4) = 0 ⇔ ( 36 5 )d3 + 2d2 = 0 . Solving in Sage this system of linear equations we get a unique solution given by S1(x) = 4933 380 x3 + 44397 950 x2 + 228243 4750 x + 135841 9500 , S2(x) = − 28837 9500 x3 − 3129 2375 x2 − 859 500 , S3(x) = 28837 9500 x3 − 3129 2375 x2 − 859 500 , S4(x) = − 4933 380 x3 + 44397 950 x2 − 228243 4750 x + 135841 9500 . Hence the plot of S(x) in [−1.2, 1.2] is given by CHAPTER 5. ESTABLISHING THE ZOO An important property of polynomials is the “continuous” dependency of their values on the input variable. We expect intuitively that if x is changed by a little, then the value of f(x) also changes only a little. This behaviour is not possessed by piece-wise constant functions f : R → R near the “jump discontinuities”. For instance, the Heaviside function4 f(x) =    0 for all x < 0, 1/2 for x = 0, 1 for all x > 0 has this type of “discontinuity” for x = 0. We formalize these intuitive statements. 5.2.1. Real numbers. We have dealt with algebraic properties of real numbers, summarized by the claim that R is a field. However, we have also used the relation of the standard (total) ordering of the real numbers, denoted by “≤”. See the paragraph 1.6.3 on the page 43. The properties (axioms) of the real numbers, including the connections between the relations and other operations, are enumerated in the following table. The bars indicate how the axioms guarantee that the real numbers form an abelian (commutative) group with respect to addition, that R \ {0} is an abelian group with respect to multiplication, that R is a field, that the set R together with the operations +, · and the order relation is an ordered field. The last axiom can be considered as claiming that R is “sufficiently dense”. 4Oliver Heaviside was an unconventional English electrical engineer (1850-1925) with an innovative and very original approach to practical mathematical modelling. His famous sayings include “Mathematics is an experimental science, and definitions do not come first, but later on”. Or, defending his incomplete argumentation, “I do not refuse my dinner simply because I do not understand the process of digestion”. Is this suggestive of the methodology of this textbook? 380 Using the command spline one can obtain a graphical verification, as before. Although the code is given below, we leave to the reader to check the result alone. f(x)=e^(x^2)-e plf=plot(f, x, -1.2, 1.2, color="black") pts=[(-1.2, f(-1.2)), (-1, f(-1)), (0, f(0)), (1, f(1)), (1.2, f(1.2))] S=spline(pts) sps=plot(S, -1.2, 1.2, color="steelblue") show(points(pts, size=40)+sps+plf) □ 5.A.19. Without calculation, construct the natural cubic interpolation spline for the points x0 = −1, x1 = 0 a x2 = 2 and the value y0 = y1 = y2 = 1 at these points. Solution. The natural splines request the second derivative in the outer boundary points to vanish. Thus, the constant spline S1(x) ≡ 1, S2(x) ≡ 1 satisfies all conditions and thus it must be the unique solution. □ 5.A.20. Construct the natural cubic interpolation spline for the function f(x) = 1/(1 + x2 ) selecting the nodes x0 = 0, x1 = 1, and x2 = 3. ⃝ Addition problems concerning polynomial interpolation can be found in the final section E, see also Chapter 6. B. Real numbers and limit processes In this section we treat limits of sequences and of functions. We will also discuss some basic topological notions about subsets of the real line R or the complex plane C. In this way we will establish the foundations of calculus, but also become familiar with elementary ideas from mathematical analysis, that we will revise finally later, in Chapter 7. Functions and limits are some of the most fundamental concepts in mathematics and are ideas that came of age in the 17th century.3 The French mathematician Augustin-Louis 3 The French mathematician Pierre de Fermat (1607-1665) was one of the first realizing the importance of limits. When later Newton and Leibniz invented the calculus they don’t use “delta-epsilon” proofs, and it took more than a century to develop them. From this perspective, it should be no wonder that often modern students find an introduction to calculus difficult. Hence you should not worry much whenever some of the notions that we discuss CHAPTER 5. ESTABLISHING THE ZOO Axioms of the real numbers (R1) (a + b) + c = a + (b + c), for all a, b, c ∈ R (R2) a + b = b + a, for all a, b ∈ R (R3) there is an element 0 ∈ R such that for all a ∈ R, a + 0 = a (R4) for all a ∈ R, there is an additive inverse (−a) ∈ R such that a + (−a) = 0 (R5) (a · b) · c = a · (b · c), for all a, b, c ∈ R (R6) a · b = b · a for all a, b ∈ R (R7) there is an element 1 ∈ R, 1 ̸= 0, such that for all a ∈ R, 1 · a = a (R8) for all a ∈ R, a ̸= 0, there is a multiplicative inverse a−1 ∈ R such that a · a−1 = 1 (R9) a · (b + c) = a · b + a · c, for all a, b, c ∈ R (R10) the relation ≤ is a total order, i.e. reflexive, antisymmetric, transitive, and total on R (R11) for all a, b, c ∈ R, a ≤ b implies a + c ≤ b + c (R12) for all a, b ∈ R, a > 0 and b > 0 implies a·b > 0 (R13) every non-empty set A ⊂ R which has an upper bound has a least upper bound. The concept of a least upper bound from axiom (R13), also called the supremum, is very important. It makes sense for any partially ordered set. This is a set with a (not necessarily total) ordering relation. Recall that an ordering relation is a binary relation on a set which is reflexive, antisymmetric, and transitive; see the paragraph 1.6.3. Supremum and infimum Consider a subset A ⊂ B in a partially ordered set B. An upper bound of the set A is any element b ∈ B such that b ≥ a for all a ∈ A. Similarly, a lower bound of the set A is an element b ∈ B such that b ≤ a for all a ∈ A. The least upper bound of the set A, if it exists, is called its supremum and it is denoted by sup A. Similarly, the greatest lower bound, if it exists, is called the infimum and it is denoted by inf A. Thus, the last axiom (R13) from the table of properties of the real numbers can be reformulated as follows: Every non-empty bounded set A of real numbers has a supremum. This means that if there is a number a which is larger than or equal to all numbers x ∈ A, then there is a smallest number with this property. For instance, the choice A = {x ∈ Q, x2 < 2} gives the supremum sup A which is called √ 2; the square root of two. An immediate consequence of this axiom is the existence of the infima for any non-empty set of real numbers bounded from below. Observe that changing the sign of all the numbers in a set interchanges suprema and infima. For the formal construction, it is necessary to know whether or not there is such a set R with the operations and ordering relation which satisfies the thirteen axioms. So far, 381 Cauchy (1789-1867) was probably the first who put the calculus in a rigorous basis. To be more specific, Cauchy introduced the “delta-epsilon” notation commonly encountered in the definition of limits. While such proofs are presented here, we have chosen to de-emphasize them after a certain point. This is because many calculus problems can be approached in various alternative ways. Additionally, this approach allows us to conserve space, which is then utilized to implement solutions using Sage Our initial tasks involve the concepts of the supremum (sup A) and the infimum (inf A) of a subset A of real or complex numbers, which are elementary notions discussed in 5.2.1. These represent the smallest upper bound and the largest lower bound of A, respectively. They always exist when A is a bounded subset, but they do not necessarily belong to A, as they can be either limit points or isolated points of A, see 5.2.5 and 5.2.7. 5.B.1. Find sup A and inf A, if they exist, for A = { n + (−1)n n : n ∈ N∗ } ⊂ R , where as usual we set N∗ = N\{0} = Z+. ⃝ 5.B.2. Determine the infima and suprema of the sets B = { (−1)n n2 : n ∈ Z+ } , C = (−9, 9) ∩ Q , X = { − 1 n : n ∈ Z+ } , Y = (0, 2] ∪ [3, 5]\{4} , and look whether they belong to them or not. ⃝ Sequence are functions defined on the set of natural numbers. Next we are interesting in the behaviour of such functions as n tends to infinity, see also the discussion in 5.2.3, 5.2.9 and 5.2.10. To begin with, recall that for any subset A of reals or complex numbers we can assign the notion of “limit points” of A. These are points x for which each of their ε-neighbourhoods Oε(x) = {z : |z−x| < ε} contains at least one point from A other than x. Be aware that limit points may not belong to A. Having in hand this definition of limit points we can define a “convergent sequence” (xn) of real or complex numbers, as a sequence approaching its only limit point x, as n tends to infinity, see 5.2.3. In this case we say that (xn) converges to x (or that tends to x), and write xn → x, or lim n→∞ xn = x . When the limit does not exist as a real, or complex number, or is ±∞, the sequence (xn) is called divergent, see also below. Notice that in our definition of limits points the neighbourhood Oε(x) is either an open interval in R, or an open disc in C. The related topological notion of open sets is introduced in 5.2.7, see also 5.2.5. To make the description in this below seem difficult. To master with limits, it takes a little while. Once it is mastered, however, it becomes easier to understand differentiation and integration. CHAPTER 5. ESTABLISHING THE ZOO only the rational numbers have been constructed formally. These form an ordered field. This means that Q satisfies the axioms (R1) – (R12), and this can easily be verified. We do not go into details here of the consistent construction of the real numbers now. We will be satisfied with an intuitive idea of the real line, and we will work with the axioms (R1) through R(13). But we shall come back to this issue in a more general framework in chapter 7, see the paragraph 5.2.4 and the discussion started in 7.3.6. Actually, we shall see, that if the real numbers can be constructed, then the construction is unique up to isomorphism. This is a bijection preserving all algebraic structures of two different realizations of the field R. 5.2.2. The complex plane. Recall that the complex numbers are given as pairs of real numbers. We usually write them as z = Re z + i Im z. Therefore, the plane C = R2 is an appropriate image of the complex numbers. With addition and multiplication, the complex numbers satisfy the axioms (R1)-(R9) and thus form a field. There is, however, no natural ordering defined on them which would satisfy the axioms (R10)-R(13). Nevertheless, we work with them, since extending real scalars to the complex numbers is highly advantageous or sometimes even necessary. There is an important operation on the complex numbers called complex conjugation. It is the reflection symmetry with respect to the line of real numbers. We denote it by a bar over the number z ∈ C: ¯z = Re z − i Im z. It changes the sign of the imaginary part. Since for z = x+iy, z · ¯z = (x + iy)(x − iy) = x2 + y2 , this value expresses the squared distance of the complex numbers from the origin. The square root of this non-negative real number is called the absolute value of the complex number z; written (1) |z|2 = z · ¯z. The absolute value can be defined on any ordered field of scalars K. Define the absolute value |a| as follows: |a| = { a if a ≥ 0 −a if a < 0. For any numbers a, b ∈ K, (2) |a + b| ≤ |a| + |b|. This property is called the triangle inequality. It holds also for the absolute value of the complex numbers. For the fields of rational numbers or real numbers, both of which are subfields of the complex numbers, both definitions of the absolute value coincide. The absolute value must be understood in the context of which ever field K of rational, real, or complex numbers is involved. The triangle inequality holds in all these cases. 382 column more functional, we decided to present problems related to such (and other) topological notions a few later, after mastering the notion of limits. 5.B.3. Convergence and divergence. Based on the formal definition of the convergence or the divergence of a sequence, prove the following: (a) an = 1 n → 0 , as n → +∞ , (b) an = 1 n2 → 0 , as n → +∞ , (c) an = n n+1 → 1 , as n → +∞ , (d) if 0 < x < 1 then an = xn → 0 , as n → +∞ , (e) if x > 1 then an = xn → +∞ , as n → +∞ , (f) an = n2 → +∞ , as n → +∞ . Solution. (a) Let ε > 0 be given. We have to find some positive integer N such that 1 n − 0 < ε for all n ≥ N. Clearly, the condition 1 n − 0 < ε is equivalent to 1 n < ε, i.e., n > 1 ε . Hence taking N ∈ Z+ with N > 1 ε we see that 1 n ≤ 1 N < ε so that 1 n − 0 < ε for all n ≥ N. This means that 1 n → 0. (b) For any ε > 0 we see that 1 n2 < ε is equivalent to n2 > 1 ε , that is, n > 1√ ε . Thus, taking N ∈ Z+ with N ≥ 1/ √ ε we have 1 n2 < ε, for all n ≥ N. It follows that 1 n2 → 0. (c) Again, for given ε > 0 we wish to find some N ∈ Z+ such that n n+1 − 1 < ε, provided that n ≥ N. Notice that the condition n n+1 − 1 < ε is equivalent to 1 n+1 < ε, i.e., 1 ε < n + 1. Hence, choosing N with N > 1 ε − 1 we see that the inequality n ≥ N implies n + 1 > 1 ε , and hence n n+1 − 1 < ε. Thus n n+1 → 1. (d) We have 0 < x < 1 and thus 1 x > 1. Hence 1 x = 1 + h, for some h > 0, and by applying the binomial rule ∑n k=0 ( n k ) an−k bk for the power (1 x )n we get 1 xn = (1 + h)n = (1 + nh + · · · + hn ) > 1 + nh . Thus 0 < xn < 1 1+nh . Since 1 1+nh → 0 we will also have xn → 0, as n → ∞. For instance, given some ε > 0 we see that 0 < xn < ε is true whenever 1 1+nh < ε. Hence for n ≥ N with N being the least positive integer larger than (1/e)−1 h , we are done. (e) Let us describe a proof that again is based on the binomial theorem. Since x > 1 we may write x = 1 + h for some h > 0 and by the binomial rule we deduce that xn ≥ nh for all n. Since nh → ∞ it follows that xn → ∞ as well. In fact in this case we say that the sequence (xn ) “diverges to infinity”. It is useful to state this as a general rule: For a sequence (an) saying that an → +∞ we essentially mean that for any given ζ > 0 there exists some N ∈ Z+ such that an > ζ for all n ≥ N. Similarly, saying that a sequence (an) diverges to −∞ we essentially mean that for any given ζ > 0 there exists N ∈ Z+ such that an < −ζ for all n ≥ N. Let us finally consider case (f). Given ζ > 0, the condition CHAPTER 5. ESTABLISHING THE ZOO 5.2.3. Convergence of a sequence. We wish to formalize the notion of a sequence of numbers approaching a limit. The key object of interest is a sequence of numbers ai, where the index i usually goes through all the natural numbers. Denote the sequences loosely either as a0, a1, . . . , or as infinite vectors (a0, a1, . . . ), or as (ai)∞ i=1. Cauchy5 sequences Consider a sequence (a0, a1, . . . ) of elements of K such that for any fixed positive number ε > 0, |ai − aj| < ε. for all but finitely many terms ai, aj of the sequence. In other words, for any fixed ε > 0, there is an index N such that the above inequality holds for all i, j > N. Loosely put, the elements of the sequence are eventually arbitrarily close to each other. Such a sequence is called a Cauchy se- quence. Intuitively, either all but finitely many of the sequence’s terms are equal, (then |ai − aj| = 0 from some index N on), or they “approach” some particular value. This is easily imaginable in the complex plane. Choose an arbitrarily small disc (with radius equal to ε). Suppose a Cauchy sequence is given. It must be possible to put the disc into the complex plane in such a way that it covers all but finitely many of the elements of the infinite sequence ai. Imagine that such discs have very small radii, and all contain a number a; see the diagram. If such a value a ∈ K exists for a Cauchy sequence, we would expect the sequence to have the property of convergence: Convergent sequences A sequence (ai)∞ i=0 converges to a value a if and only if for any positive real number ε, |ai − a| < ε for all but finitely many indices i. Notice that the set of those i for which the inequality does not hold may depend on ε. The number a is called the limit of the sequence (ai)∞ i=0, we write lim i→∞ ai = a. If a sequence ai ∈ K, i = 0, 1, . . . , converges to a ∈ K, then for any fixed positive ε, |ai −a| < ε for all i greater than 5Augustin-Louis Cauchy (1789-1857) was a French mathematician pioneering a rigorous approach to infinitesimal analysis. He was very productive, wrote about 800 research articles. There are dozens of concepts and theorems named after him. 383 n2 > ζ is satisfied if n > √ ζ. Thus, for N > √ ζ we see that n2 > ζ for all n ≥ N. This implies that n2 → +∞. □ 5.B.4. Use Sage to plot the first 30 terms of the sequence (an) given in Problem 5.B.3, case (c). Solution. Sage provides many alternatives to plot a sequence (an). Here we utilized a method based on the object Graphics( ), which is employed when initializing a for loop of various graphics objects (further techniques will be analyzed in the sequel). Therefore, to plot the first 30 terms of the sequence ( n n+1 )n∈Z+ , type p=Graphics() for n in srange (1, 30+1): p=p+points((n,n/(n+1)),color="black") show(p) This produces the figure given here □ For many of the solutions presented in 5.B.3, we have repeatedly used the fact that for every real number x there exists a natural number n ∈ N∗ such that n > x. This is the so-called Archimedean property of R, and it will be used also below without mentioning it explicitly. In fact, it is not hard to prove that the Archimedean property is equivalent to say that for each positive real x there exists n ∈ N∗ such that 1 n < x. 5.B.5. Infinite limits. Show that if an → +∞ and bn → +∞, then an · bn → +∞. Also, assuming an → a > 0 and bn → +∞, show that an · bn → +∞. What is the situation for a < 0 and for a = 0? Solution. (Hints) First, note that in this context the rule of product of limits must be carefully extended (see 5.2.13 for the product rule). This is because at least one of the limits in the given statements is infinite, refer to the discussion in 5.2.14 for further details. To prove the statement we will apply the definition given in case (e) in 5.B.3. We will describe the proof for the second statement only, and the proof of the first is left for practice. CHAPTER 5. ESTABLISHING THE ZOO a certain N ∈ N. By the triangle inequality, |ai − aj| = |ai − a + a − aj| < |ai − a| + |a − aj| < 2ε. for all pairs of indices i, j ≥ N. Thus: Lemma. Every convergent sequence is a Cauchy sequence. However, in the field of rational numbers, it can happen that for a Cauchy sequence a corresponding value a does not exist. For instance, the number √ 2 can be approached by a sequence of rational numbers ai, thereby obtaining a sequence converging to √ 2, but the limit is not rational. Ordered fields of scalars in which every Cauchy sequence converges are called complete. The following theorem states that the axiom (R13) guarantees that the real numbers are such a field: Theorem. Every Cauchy sequence of real numbers ai converges to a real number a ∈ R. Proof. The terms of any Cauchy sequence form a bounded set since any choice of ε bounds all but finitely many of them. Let B be the set of those real numbers x for which x < aj for all but finitely many terms aj of the sequence. B has an upper bound, and thus B has a supremum as well, by (R13). Define a = sup B. Fix ε > 0, and choose N so that |ai − aj| < ε for all i, j ≥ N. Then aj > aN − ε and aj < aN + ε for all indices j > N, and so aN − ε belongs to B, while aN + ε does not. Hence |a − aN | ≤ ε, and thus |a − aj| ≤ |a − aN | + |aN − aj| ≤ 2ε for all j > N. So a is the limit of the given sequence. □ Corollary. Every Cauchy sequence of complex numbers zi converges to a complex number z. Proof. Write zi = ai+i bi. Since |ai−aj|2 ≤ |zi−zj|2 and similarly for the values bi, both sequences of real numbers ai and bi are Cauchy sequences. They converge to a and b, respectively. It is easily verified that z = a + i b is the limit of the sequence zi. □ 5.2.4. Remark. The previous discussion proposes a construction method for the real numbers. Proceed similarly to building the integers from the natural numbers (adding in all additive inverses). Build the rational numbers from the integers (adding all multiplicative inverses of non-zero numbers). Then “complete” the rational numbers by adding in all limits of Cauchy sequences. Cauchy sequences (ai)∞ i=0 and (bi)∞ i=0 of rational numbers are equivalent if and only if the distances |ai − bi| converge to zero. This is the same as the condition that merging these sequences into a single sequence also yields a Cauchy sequence. For example, a sequence can be formed by selecting alternately terms from the first sequence and the second sequence. Check the properties of the equivalence relations. Clearly the relation is reflexive, it is symmetric (since the distance of the rational numbers is symmetric in its arguments) 384 By assumption, limn→+∞ an = a > 0, so choosing ε = a/2 > 0 we can find N ∈ Z+ such that |an − a| < a 2 for all n ≥ N. This implies that a 2 < an < 3a 2 , for all n ≥ N, and so an > a 2 for n ≥ N. Also, choosing any ζ > 0 there exists M ∈ Z+ such that bn > ζ for all n ≥ M. Hence, for n ≥ L := max{M, N} both conditions hold and in particular an · bn > a 2 · ζ > 0. Taking ζ = 2˜ζ a for some ˜ζ > 0 we get an · bn > ˜ζ for all n ≥ L. Thus an · bn → +∞, and a·(+∞) = +∞ for a > 0. For a < 0 we have an ·bn → −∞, and thus a·(+∞) = −∞, for a < 0. However, the case a = 0 will produce a so-called indeterminate form, namely 0 · (+∞), which is excluded. □ 5.B.6. Using the results from 5.B.3 explain why: (a) The sequence ( an = 1 n + 3n ) diverges to +∞. (b) The sequence (bn = n2 + 2n ) diverges to +∞. (c) The sequence (cn = n2 · 3n ) diverges to +∞. (d) The sequence ( dn = n·4n n+1 ) diverges to +∞. ⃝ 5.B.7. (a) Prove that (an = 1 2n ) tends to zero as n → ∞. (b) Prove that (bn = (−1)n+1 ) is divergent. (c) Prove that (cn = sin(nπ/2)) is divergent. (d) Find sequences (xn) and (yn) with infinite limits, such that lim n→∞ (xn + yn) = 1 and lim n→∞ ( xny2 n ) = +∞. ⃝ 5.B.8. Show that lim n→∞ n √ n = 1 (notice n √ n = n 1 n ). Then present a proof via Sage. Solution. Apparently, for all naturals n ≥ 1 we have n √ n ≥ 1. So we can set n √ n = 1+an for certain numbers an ≥ 0. Now, by the binomial theorem we see that n = (1 + an) n = 1 + ( n 1 ) an + ( n 2 ) a2 n + · · · + an n , for all natural numbers n ≥ 2. Hence we have the bound n ≥ ( n 2 ) a2 n = n (n − 1) 2 a2 n, for all naturals n ≥ 2, which leads to 0 ≤ an ≤ √ 2 n−1 for all such n. Having established these inequalities, we can now use the squeeze theorem (see 5.2.12). This implies that limn→∞ an = 0, and hence limn→∞ n √ n = limn→∞ (1 + an) = 1. One should finally mention that it possible to prove the result by means of the so called “l’Hopital’s rule”, introduced in 5.3.10. However, such a procedure is based on the notion of derivatives and hence we will return to this later, see 5.E.109. For computing limits in Sage we use the command limit or its alias lim. We also recall that the mathematical symbol ∞ is represented either by oo or by Infinity (or infinity). Let us now illustrate the implementation of these rules via our example. So, an appropriate cell has the form n=var("n");a(n)=n**(1/n);lim(a(n), n=oo) Or, as we said, one could replace the last command by lim(a(n), n = Infinity). Notice that first one should introduce the variable n, as a symbolic variable. This is because CHAPTER 5. ESTABLISHING THE ZOO and transitivity follows easily from the triangle inequality. Thus, we may define R as the set of equivalence classes on the above set of sequences. We introduce algebraic structures on this set R and check their properties. Of course, the rational numbers can be represented by constant sequences, so that Q ⊂ R, as expected. Next, define the sum and product of equivalence classes by taking the sum and product of sequences representing them, respectively. It is easy to check that the results represent a class independent of the choices. Ordering is dealt with similarly. Here it is required to prove that a ≤ b if and only if there are representatives with ai ≤ bi. Finally it is necessary to show that all Cauchy sequences in R converge. We do not go into details now and advise the reader to return back and check all the details when going through the full discussion of the completion of metric spaces in the paragraph 7.3.6. The arguments used there with the real scalars replaced by rational ones provide an adequate proof. The arguments proving that the axioms (R1)–(R13) define the real numbers uniquely up to isomorphism are also to be found there. 5.2.5. Closed sets. For further work with real or complex numbers, we need to understand the notions of closeness, boundedness, convergence, and so on. These concepts belong to the topic “topology”6 . As before, we work with K = R or K = C. We advise the reader to draw many diagrams for all the concepts and their properties for both the real line and the complex plane. For any subset A of points in K, we are interested not only in the points belonging to a ∈ A, but also in the ones which can be approached by limits of sequences in A. Limit points of a set Let A be a subset of K. A point x ∈ K is called a limit point of A if and only if there is a sequence a0, a1, . . . of elements in A such that all its terms differ from x, yet its limit is x. Notice that a limit point of a set may or may not belong to the set. For every non-empty set A ⊂ K and a fixed point x ∈ K, the set of all distances |x−a|, a ∈ A, is a set of real numbers bounded from below, and so it has an infimum d(x, A), which is called the distance of the point x from the set A. Notice that d(x, A) = 0 if and only if either x ∈ A or if x is a limit point of A. (We suggest the reader proves this in detail directly from the definitions.) 6The name of this mathematical discipline comes from the Greek “studying the shape” (topos + logos) . The main concepts are built on the formalism of open and closed sets, compactness etc. We use the same names here but only in the realm of real and complex numbers. Later on, we go further in metric spaces in chapter 7. 385 our goal is to view a(n) as a function of n. Execute yourself the given syntax to practice with Sage. □ 5.B.9. Limits of sequences. Compute the following limits, and next verify your answers via Sage: (a) lim n→∞ 2n2 +3n+1 n+1 , (d) lim n→∞ n+1 2n2+3n+1 , (b) lim n→∞ n √ 12 + 22 + · · · + n2, (e) lim n→∞ 5n +1 7n+1 , (c) lim n→∞ √ 4n2 + n − 2n, (f) lim n→∞ √ 4n2+n n . Solution. (a) Set an = 2n2 +3n+1 n+1 . Multiplying both the numerator and denominator by 1/n, we have lim n→∞ an = lim n→∞ 2n + 3 + 1 n 1 + 1 n = ∞ + 3 + 0 1 + 0 = ∞ . To confirm this result in Sage, give the cell n=var("n"); a(n)=(2*n^2+3*n+1)/(n+1) lim(a(n), n=infinity) Sage’s answer is +Infinity. (b) Let us present a solution based on a very useful theoretical result, the so called “squeeze theorem” presented in 5.2.12. In particular, set an = n √ 12 + 22 + · · · + n2. Then for any positive natural n we have bn = n √ n ≤ an ≤ cn = n √ n2 + n2 + · · · + n2 n−terms = n √ n3 . By 5.B.8 we know that bn → 1. Moreover, since n √ n3 = (n3 ) 1 n = n 3 n = (n 1 n )3 , we also conclude that cn → 1, as n → +∞. The squeeze theorem now applies and gives that an → 1. In Sage one can verify these claims as follows: n, k=var("n, k") a(n)=sum(k^2, k, 1, n)**(1/n) # declare (a_n) f(n)=(n^3)**(1/n); g(n)=n**(1/n) print(bool(a(n)<=f(n) for n in range(1, 1000))) print(bool(a(n)>=g(n) for n in range(1, 1000))) print(lim(a(n), n=oo)) Notice in the fourth and fifth line we used the bool command to test the two posed inequalities, for many n. However, one could directly compute the limit in question (hence, one may avoid to include these two lines, together with the definition of the sequences f(n) and g(n)). (c) Set an = √ 4n2 + n − 2n. Here it applies the following classical trick: lim n→∞ an = lim n→∞ ( √ 4n2 + n − 2n)( √ 4n2 + n + 2n) √ 4n2 + n + 2n = lim n→∞ n√ 4n2+n+2n = lim n→∞ 1 √ 4n2+n n + 2 = 1 4 . (d) We leave this for practice, since can be treated as case (a). Verify yourself that the given limit equals to 1 ∞ = 0. CHAPTER 5. ESTABLISHING THE ZOO Closed sets The closure ¯A of a set A ⊂ K is the set of those points which have zero distance from A (note that the distance from the empty set of points is undefined, therefore ¯∅ = ∅). A closed subset in K is a set which coincides with its closure. A set is closed if it contains all of its limit points. On the real line, a closed interval [a, b] = {x ∈ R, a ≤ x ≤ b} of real numbers, where a and b are fixed real numbers is a closed set. The sets (−∞, b], [a, ∞), and (−∞, ∞) are also closed sets. A closed set may also be formed by a sequence of real numbers without a limit point, or a sequence with a finite number of limit points together with these points. The unit disc (including its boundary circle) in the complex plane is another example of a closed set. An arbitrary intersection of closed sets is again a closed set. A finite union of closed sets is again a closed set. Indeed, if all of the points of some sequence belong to the considered intersection of closed sets, then they belong to each of the sets, and so do all the limit points. However, if we wanted to say the same about an arbitrary union, we would get in trouble: singleton sets are closed, but a sequence of points created from them may not be. On the other hand, if we restrict our attention to finite unions and consider a limit point of some converging sequence lying in this union, then the limit point must also be the limit point of any subsequence, especially the one lying in only one of the united sets. As this set is assumed to be closed, the limit point lies in it, and thus it lies in the union. 5.2.6. Open sets. There is another useful type of subset of the real numbers: open intervals (a, b) = {x ∈ R; a < x < b}, where, again, a and b are fixed real numbers or infinite values ±∞. It is an open set, in the following sense: Open sets and neighbourhoods of points An open set in K is a set whose complement is a closed set. A neighbourhood of a point a ∈ K is an open set O which contains a. If the neighbourhood is defined as Oδ(a) = {x ∈ K, |x − a| < δ} for some positive number δ, then we call it the δ-neighbourhood of the point a. For real numbers, the δ–neighbourhood of a point a is the open interval of length 2δ, centered at a. In the complex plane, it is the disc of radius δ, also centered at a. Notice that for any set A, a ∈ K is a limit point of A if and only if every neighbourhood of a contains at least one point b ∈ A, b ̸= a. Lemma. A set A ⊂ K of numbers is open if and only if every point a ∈ A, has a neighbourhood contained in A. 386 (e) This case can be treated using the conclusions of 5.B.3. Hence we get lim n→∞ 5n + 1 7n + 1 = lim n→∞ (5 7 )n + 1 7n 1 + 1 7n = 0 + 0 1 + 0 = 0 . In Sage for this result work as above, i.e., n=var("n");f(n)=((-5)^n+1)/((7^n)+1) lim(f(n), n=oo) (f) Set an = √ 4n2+n n . For any positive natural n we have that bn = √ 4n2 n < an < cn = √ 4n2 + n + 1 16 n . Moreover, it is easy to see that lim n→∞ bn = 2 = lim n→∞ cn. Thus, by the squeeze theorem it follows that lim n→∞ an = 2. □ 5.B.10. Given a non-empty bounded set A ⊆ R, set a := sup A. Show that there exists a sequence (an) of A tending to a (notice a similar statement holds for inf A). Solution. By assumption a is the supremum of A, so a is a upper bound of A and for every ζ > 0 there exists x ∈ A such that x > a − ζ. To prove this fix some ζ > 0 and suppose in the contrary that x ≤ a − ζ. Then a − ζ must be also an upper bound of A, a contradiction since a = sup A, so a is the least upper bound. Thus we can find some x ∈ A with x > a − ζ (note that the converse is also true). Choosing now ζ = 1/n for n ∈ Z+, so ζ > 0, we may denote this element by xn ∈ A, such that xn > a− 1 n . Since xn ≤ a, we finally arrive to the inequality 0 ≤ a − xn < 1/n. Fix now some ε > 0 and choose N ∈ Z+ with 1/N < ε. Then for all n ≥ N we get |a − xn| < 1/n ≤ 1/N < ε, hence xn → a as n → ∞. □ A sequence (an) is said to be bounded above (respectively, bounded below), if there exists M ∈ R such that an ≤ M (respectively, an ≥ M), for all n that (an) is defined. A sequence that is bounded above and below is called bounded. Obviously, a sequence that diverges to +∞ (respectively, −∞) is not bounded above (respectively, is not bounded below). A sequence (an) is said to be increasing (respectively, decreasing) if an ≤ an+1 (respectively, an ≥ an+1) for all n that (an) is defined. If we have an < an+1 (respectively, an > an+1) for all n, then the sequence (an) is called strictly increasing (respectively, strictly decreasing). For instance, the sequence (n2 ) is strictly increasing, while the sequence (1/n) is strictly decreasing. 5.B.11. Monotone sequence theorem. Show that every bounded monotonic sequence is convergent. Solution. This extremely useful result can be equivalently rephrased as follows: “Every increasing sequence (an) which is bounded above converges to sup{an : n ∈ N∗ }, and similarly every decreasing sequence (bn) that is bounded below CHAPTER 5. ESTABLISHING THE ZOO Proof. Let A be an open set and let a ∈ A. If there is no neighbourhood of the point a inside A, then there is a sequence an /∈ A, |a − an| ≤ 1/n. But then the point a ∈ A is a limit point of the set K\A, which is impossible since the complement of A is closed. Suppose every a ∈ A has a neighbourhood contained in A. This prevents a limit point b of the set K \ A to lie in A. Thus the set K \ A is closed, and so A is open. □ From this lemma, it follows immediately that any union of open sets is an open set. A finite intersection of open sets is also an open set. 5.2.7. Bounded and compact sets. The closed and open sets are basic concepts of topology. Without going into deeper connections, the above material describes the topology of the real line and the topology of the complex plane. The following concepts are extremely useful: Bounded and compact sets A set A of rational, real, or complex numbers is called bounded if and only if there is a positive real number r such that |z| ≤ r for all numbers z ∈ A. Otherwise, the set is called unbounded. A set which is both bounded and closed is called com- pact. An interior point of a set A is a point such that one of its neighbourhoods is contained in A. A boundary point of a set A is a point for which all its neighbourhoods have a non-trivial intersection with both A and its complement K \ A. A boundary point of the set A may or may not belong to it. An open cover of a set A is such a collection of open sets Ui, i ∈ I, that its union contains the whole of A. An isolated point of a set A is a point a ∈ A such that there is a neighbourhood N of a satisfying N ∩ A = {a}. 5.2.8. Theorem. All subsets A ⊂ K of real or complex numbers satisfy: (1) a non-empty set A ⊂ R is open if and only if it is a union of countably (or finitely) many open intervals; similarly 387 converges to inf{bn : n ∈ N∗ }. We will prove the first statement and an analogous method applies for the second one. Let (an) be a sequence and assume for simplicity that n ∈ N∗ . Assume also that (an) is bounded above with an ≤ an+1 for all n ∈ N∗ . Since (an) is bounded above, its range {an : n ∈ N∗ } is also bounded above (as a set), and hence sup{an : n ∈ N∗ } exists. Set a := sup{an : n ∈ N∗ }. Recall that if a is the supremum of a non-empty set A, then a is an upper bound of A and for every ζ > 0 there exists x ∈ A such that x > a − ζ (see the proof of 5.B.10). In our case this means that for ζ > 0 be given, it exists a natural N with a − ζ < aN , that is, a − aN < ζ. However, (an) is increasing, so for all n ≥ N we have an ≥ aN and in particular a − ζ < aN ≤ an ≤ a < a + ζ, for all such n. Thus |an − a| < ζ for all n ≥ N, i.e., an → a. □ 5.B.12. Present a “counterexample” verifying that the converse of the monotone sequence theorem fails. Moreover, give an example of a monotone sequence which is not convergent. ⃝ 5.B.13. Given the sequences (an = sin(n) n ), (bn = n2 + 1) and (cn = (−1)n +n n ), with n ∈ Z+ for all three cases, locate them which are bounded. ⃝ 5.B.14. Indicate the convergent sequences between those given in the previous problem 5.B.13, and determine their limits. Repeat for the sequences fn = 1√ n and gn = n! nn , where n ∈ Z+. ⃝ 5.B.15. Let a > 0. Show that lim n→∞ n √ a = 1. Solution. Once more, squeeze theorem provides a traditional way to treat this limit, as in 5.B.8. Let us however discuss a different method. First notice that we may assume that 0 < a < 1, since otherwise the argument applies for 1/a. For such a it is easy to see that given sequence (an = n √ a)n∈Z+ is monotonically increasing. Moreover, it is bounded above by 1. A proof of this fact is left to the reader, but one may like to illustrate the situation in Sage, either via the cell n=var("n"); a=0.2; a(n)=a**(1/n) bool(a(n)<1 for n in range(1, 5000)) which returns True, or by plotting many terms of (an) (continue typing in the previous cell)4 list_plot([a(n) for n in range(1, 100)]) Thus (an) converges, and we may assume that lim n→+∞ an = ℓ ∈ R. Assume that ℓ < 1. Using the inequality n √ a ≤ ℓ we get a ≤ ℓn , with ℓn → 0 by 5.B.3. This gives a contradiction, since a > 0. Taking into account that (an) is bounded above by 1, we deduce that ℓ = 1. For a Sage verification one can continue in the very first cell used above, by adding the following command: lim(a(n), n = oo). □ 4Notice this gives an alternative to plot sequences via Sage. CHAPTER 5. ESTABLISHING THE ZOO A ⊂ C is open if and only if it is a union of countably (finite) many open discs. (2) every point a ∈ A is either an interior or a boundary point, (3) every boundary point of A is either an isolated point or a limit point of A, (4) A is compact if and only if every infinite sequence contained in A has a subsequence converging to a point in A.7 (5) A is compact if and only if each of its open covers contains a finite subcover of A. Proof. (1) Every open set is a union of some neighbourhoods of its points, i.e., we may consider open intervals in reals, or open discs in C. So the question that remains is whether it suffices to take countably many of them. Let us first prove the claim for the complex plane. For each z ∈ A, there is an open disc Oδ(z) contained in A, with some δ > 0, and let δz be the supremum of the values of such δ. Clearly, A = ∪z∈AOδz (z). Consider an arbitrary z ∈ A and pick up w with both real and imaginary parts rational, such that |w − z| < δz/4. Thus, z ∈ Oδw (w) (draw a picture!) and we have checked that actually A is the union of the countably many open discs Oδw (w) for all w ∈ A with rational real and imaginary coordinates. If A is an open subset in R, then we may repeat the above argument with the discs Oδ(z) replaced by the intervals Oδ(x) and x ∈ A. Think about the details! (2) It follows immediately from the definitions that no point can be both an interior and boundary point. Let a ∈ A be a point that is not interior. Then there is a sequence of points ai /∈ A with a as its limit point. At the same time, a belongs to each of its neighbourhoods. Thus a is a boundary point. (3) Suppose that a ∈ A is a boundary point but not isolated. Then, similarly to the reasoning from the previous paragraph, there are points ai, this time inside A, whose limit point is a. (4) Suppose A ⊂ R is a compact set, i.e., both closed and bounded. Consider an infinite sequence of points ai ∈ A. A has both a supremum b and an infimum a. Divide the interval [a, b] into halves: [a, 1 2 (b − a)] and [1 2 (b − a), b]. At least one of them contains infinitely many of the terms ai. Select this half and one of the terms contained in it; then cut the selected interval into halves. Again, select such a half which contains infinitely many of the sequence’s terms and select one 7This result for real numbers is usually referred to as the BolzanoWeierstrass theorem. Karl Weierstrass was a famous German mathematician (1815-1897) and his name is linked to many theorems in Mathematics. Bernard Bolzano (1781-1848) was a Bohemian mathematician, logician, philosopher, theologian and Catholic priest working in Prague at the beginning of the 19th century. He laid the basis of rigorous mathematical analysis a few decades before all the theory was worked out by Weierstrass and others. In particular he was skeptical about the effective use of Leibniz’s infinitesimals without the necessary rigour. 388 Often we are interested in the limit of a sequence sn, where the terms are partial sums of some other given sequence an. Let us describe such an example and more applications are described in Section D, which is about infinite series. 5.B.16. Geometric series. Find the limit of the sequence with general term sn = 1 + x + x2 + · · · + xn , n = 0, 1, 2, . . . , for x with |x| < 1. Solution. First observe that sn(1 − x) = (1 + x + · · · + xn ) − x(1 + x + · · · + xn ) = 1 + x + · · · + xn − x − · · · − xn − xn+1 = 1 − xn+1 . So for x ̸= 1 we have sn = 1 − xn+1 1 − x , with n ∈ N. Recall now by 5.B.3 that limn→∞ xn = 0, provided that |x| < 1. Thus for such x we get lim n→∞ sn = lim n→∞ 1 − xn+1 1 − x = 1 1 − x , and it follows that ∞∑ n=0 xn = lim n→∞ sn = 1 1 − x , |x| < 1 . This is the so called geometric series and for those x satisfying |x| < 1, the geometric series is a finite number. For instance ∞∑ n=1 ( 1 2 )n = ∞∑ n=0 ( 1 2 )n − 1 = 1 1 − 1 2 − 1 = 2 − 1 = 1 , that is, 1 2 + 1 4 + 1 8 + 1 16 + 1 32 + · · · = 1 . Very often Sage allows us to compute such sums, by applying for example the following cell (which indeed returns −1 x−1 ): x, n=var("x, n"); assume(abs(x) < 1) print(sum(x^n, n, 0, oo)) This is preliminary example of the sum method in Sage, which we will analyze more in Section D. On the other hand, the geometric series will appear in the foreground very often, hence we encourage you to devote some additional time with it (e.g., compute in Sage some of its partial sums). □ 5.B.17. Consider the sequence (an) with an = 8 10 + 8 102 + · · · + 8 10n , n = 1, 2, . . . that is, an = 0.88 · · · 8, with 8 repeated n times. Show that an tends to 8 9 as n tends to ∞, and then give a proof via Sage. CHAPTER 5. ESTABLISHING THE ZOO of those points. By this procedure, a Cauchy sequence is established. Cauchy sequences have limit points or are constant up to finitely many exceptions. Thus there is a subsequence with the desired limit. The fact that A is closed implies that the point obtained lies in A. Now the other direction: if every infinite subset of A has a limit point in A, then all limit points are in A, and so A is closed. If A were not bounded, we would be able to find an increasing or decreasing sequence such that the differences of absolute values of adjacent numbers would be at least 1, for instance. However, such a sequence of points in A cannot have a limit point at all. Finally, we have to deal with the general case A ⊂ C. The arguments of the latter implication remain the same. Thus we have to show that any sequence zn of complex numbers in A has got a limit point in A. Consider the sequences of real and imaginary parts, xn and yn. Since they both have to be in the bounded subsets AR and AiR of the real and imaginary projections of A, there is a subsequence znk = (xnk , ynk ) such that xnk → x, ynk → y with the limits sitting in the closures of AR and AiR, by virtue of the already proved real case. Obviously, znk → z = (x, y), but the latter limit has to sit in A since A is closed. (5) First, focus on the easier implication. That is, suppose that every open cover contains a finite subcover. It is required to prove that A is both closed and bounded. A ⊂ C can be covered by a countable union of neighbourhoods On(z), with integers n and centers z with integral real and imaginary parts. Any choice of a finite subcover of them witnesses that A is bounded. If A ⊂ R, then the same argument applies with intervals On(x), n, x ∈ Z. Now suppose that a ∈ C \ A is the limit point of a sequence ai ∈ A. Further, assume that |a−an| < 1 n (otherwise select a subsequence satisfying this property). The sets Jn = C \ O1/n(a) for all n ∈ N, n > 0, are open and they also cover our set A. Since it is possible to choose a finite cover of A, the point a is inside the complement C \ A, including one of its neighbourhoods, and thus it is not a limit point. Therefore, all of A’s limit points must again lie in A. Hence A is closed. If A ⊂ R, the same argument applies with closed discs replaced by closed intervals. Finally, we have to prove the other implication. So assume A ⊂ C is complete and bounded, but there is an open covering Uα, α ∈ I, of A, which does not contain any finite covering. Consider the sequence of positive real numbers εn = 1/n converging to 0 and sets Bn = {z = (k n , m n ) ∈ A, k, m ∈ Z} of complex numbers with real and imaginary parts in the “1/n–net of coordinates”. Clearly all sets Bn are finite. Further, for each k, consider the system Ak of closed discs with centres in the points of Bk and diameters 2εk. Clearly each such system Ak covers the entire set A. Altogether, there 389 Solution. We see that an = 8 10 ( 1 + 1 10 + · · · + 1 10n−1 ) = 8 10 (1 − 1 10n 1 − 1 10 ) = 8 9 ( 1 − 1 10n ) = 8 9 − 8 9 · 1 10n . This gives an − 8 9 = 8 9 · 1 10n and hence it is easy to see that the condition an − 8 9 < ε is equivalent to the inequality 10n > 8 9 ε. Therefore, if N is a natural satisfying 10N > 8 9 ε, then we have an − 8 9 < ε, for all n ≥ N. Thus we are done. An alternative relies on the fact 1 10n → 0 as n → ∞. To prove this, notice that for given ε > 0 and for all n ≥ N := log10(1/ε) one has 1 − 1 10n = 1 10n ≤ 1 10N = 1 1/ε = ε. Finally, the following block confirms the result by a less obvious technique (explain why this is true). var("n, k") ; a=sum(8/(10)^n, n, 1, oo) bool(a==lim(sum(8/(10)^k, k, 1, n), n=oo)) □ The binomial theorem often combines with the monotone sequence theorem and together provide elegant techniques for studying the convergence of sequences. Such an example encodes our next task, which is about a famous number, the so called “Euler number” denoted by e, that is, the base of the natural logarithm, see 5.4.1. This satisfies 2 < e < 3, approximately equals to e ≈ 2.7183, and is the limit of the sequence (en)∞ n=1 featured in 5.B.18.5 5.B.18. (The Euler number) Combining the binomial theorem with the monotone sequence theorem, see 5.B.11, show that the sequence en given below is convergent: en = ( 1 + 1 n )n , n = 1, 2, . . . . ⃝ Roughly speaking, Cauchy sequences are sequences whose all but a finite number of elements sit from each other less than a given distance ε > 0. A general result states that any convergent sequence is Cauchy, while over the reals R or the complex numbers C the converse is also true, see 5.2.3. Thus we say that R or C are complete fields, instead of Q, for example. We can often use this result, to decide about the convergence/divergence of a given sequence of real or complex numbers. Let us describe such an example, and for more advanced tasks check the final section with the extra material, see 5.E.29, 5.E.30, for example. 5Good to know that the proof given here as a solution of 5.B.18, differs from those given in 5.4.1. Can you establish the different points between the two methods? CHAPTER 5. ESTABLISHING THE ZOO must be at least one closed disc C in the system A1 which is not covered by a finite number the sets Uα. Call it C1 and notice that diam C1 = 2ε1. Next, consider the sets C1 ∩C, with discs C ∈ A2 which cover the entire set C1. Again, at least one of them cannot be covered by a finite number of Uα, we call it C2. This way, we inductively construct a sequence of sets Ck satisfying Ck+1 ⊂ Ck, diam Ck ≤ 2εk, εk → 0, and none of them can be covered by a finite number of the open sets Uα. Finally we choose one point zk ∈ Ck in each of these sets. By construction, this must be a Cauchy-sequence. Consequently, this sequence of complex numbers has a limit z. Thus there is Uα0 containing z and containing also some δ-neighbourhood Oδ(z). But now, if diam Ck ≤ 2εk < δ, then Ck ⊂ Oδ(z) ⊂ Uα0 , which is a contradiction. The proof is complete when considering A ⊂ C. Dealing with real subset A ⊂ R, again the same line of arguments applies, just the 2-dimensional nets Bk become 1-dimensional and the open discs are replaced by open intervals. □ 5.2.9. Limits of functions and sequences. For the discussion of limits, it is advantageous to extend the set R of real numbers by the two infinite values ±∞ as we have done when defining intervals. A neighbourhood of infinity is any interval (a, ∞). Similarly, any interval (−∞, a) is a neighbourhood of −∞. Further, we will extend the concept of a limit point so that ∞ is a limit point of a set A ⊂ R if and only if every neighbourhood of ∞ has a non-empty intersection with it, i.e. if the set A is unbounded from above. Similarly for −∞. We talk about the infinite limit points, sometimes also called improper limit points of the set A. “Calculations” with infinities We also introduce rules for calculation with the formally added values ±∞ and arbitrary “finite” numbers a ∈ R: a + ∞ = ∞ a − ∞ = −∞ a · ∞ = ∞, if a > 0 a · ∞ = −∞, if a < 0 a · (−∞) = −∞, if a > 0 a · (−∞) = ∞, if a < 0 a ±∞ = 0, for all a ̸= 0. The following definition covers many cases of limit processes and needs to be thoroughly understood. Some particular cases are considered in great detail below. 390 5.B.19. Harmonic numbers. Consider the sequence (Hn)∞ n=1 with general term Hn = n∑ k=1 1 n , n = 1, 2, . . . Decide about the convergence of (Hn)∞ n=1 in terms of Cauchy sequences. Solution. The term Hn is the so-called nth harmonic number. The first harmonic numbers are given by H1 = 1, H2 = 3 2 = H1 + 1 2 , H3 = 11 6 = H2 + 1 3 , and by induction over n we can show that Hn+1 = Hn + 1 n+1 , for all naturals n ≥ 1. Harmonic numbers are the partial sums of the harmonic series that we will meet later in Section D. Now, for convenience set b = |H2n − Hn|. Then b = 1 + 1 2 + · · · + 1 n + 1 n + 1 + · · · + 1 2n − Hn = 1 n + 1 + 1 n + 2 + · · · + 1 2n > 1 2n + 1 2n + · · · + 1 2n n−terms = 1 2 . Thus taking ε = 1/2 we see that there is no positive integer N satisfying |Hm − Hn| < 1/2, for all m, n ≥ N. This means that (Hn) is not a Cauchy sequence. Thus (Hn) diverges (to +∞ since all of its terms are positive). We may verify the situation in Sage, by giving the cell var("n");sum(1/n, n, 1, oo) Sage after returning some “Runtime Errors”, types ValueError: Sum is divergent. □ Let us now present exercises having a bit more topological character, and are about limit, interior, boundary or isolated points of certain subsets of R or C. Here one should refresh the formal definitions given in 5.2.5, 5.2.7, and read carefully the proof of the Theorem 5.2.8. Mastering these notions may needs some effort, and thus we have included a series of exercises (see also the problems 5.E.36, 5.E.37, 5.E.38, and 5.E.39 in Section E). 5.B.20. Limit, interior and isolated points. Find all limit, isolated, and interior points of the set N∗ ⊂ R of non-zero natural numbers (i.e., positive integers), and of the set Q ⊂ R of rational numbers. Solution. Any n ∈ N∗ admits a neighbourhood in R containing only the number n, i.e., O1 (n) ∩ N∗ = (n − 1, n + 1) ∩ N∗ = {n}. This means that every point n ∈ N∗ is isolated, and there are no interior points (since an isolated point cannot be an interior point). Moreover N∗ provides an example of an infinite set having no limits points. Indeed, if x0 is a CHAPTER 5. ESTABLISHING THE ZOO Real and complex limits Definition. Consider a subset A ⊂ R and a real-valued function f : A → R or a complex-valued function f : A → C, defined on A. Further, consider a limit point x0 of the set A (i.e. a real number or ±∞). We say that f has limit a ∈ R (or a complex limit a ∈ C) at x0 and write lim x→x0 f(x) = a if and only if for every neighbourhood O(a) of the point a, there is a neighbourhood O(x0) of x0 such that for all x ∈ A ∩ (O(x0) \ {x0}), f(x) ∈ O(a). In the case of a real-valued function, a = ±∞ can also be the limit. Such a limit is called infinite or improper. In the other case, i.e. a ∈ R, we say the limit is finite or proper. It is important to notice that the value of f at x0 does not occur in the definition, and that the function f may even not be defined at this limit point (and in the case of an improper limit point, it cannot be defined, of course)! We shall not deal with improper limits of complex functions now. 5.2.10. The most important cases of domains. Our definition of a limit covers several very dissimilar situations: (1) Limits of sequences. If A = N, i.e. the function f is defined for the natural numbers only, we talk about limits of sequences of real or complex numbers. In this case, the only limit point of the domain is ∞, and we mostly write the values (terms) of the sequence as f(n) = an and the limit in the form lim n→∞ an = a. According to the definition, this means that for any neighbourhood O(a) of the limit value a, there is an index N ∈ N such that an ∈ O(a) for all n ≥ N. Actually, we have only reformulated the definition of convergence of a sequence (see 5.2.3). We have only added the possibility of infinite limits. As before, we also say that the sequence an converges to a. We can easily see from our definition for complex numbers that a sequence of complex values has limit a if and only if the real parts of ai converge to Re a and the imaginary parts converge to Im a. (2) Limits of functions at interior points of intervals. If f is defined on the interval A = (a, b) and x0 is an interior point of this interval, we talk about the limit of a function at an interior point of its domain. Usually, we write lim x→x0 f(x) = a. Let us examine why it is important to require f(x) ∈ O(a) only for the points x ̸= x0 in this case as well. As an example, let us consider the function f : R → R f(x) = { 0 if x ̸= 0 1 if x = 0. 391 limit point of N∗ then for some ε > 0 we should have a natural (and in particular an infinite number of natural numbers) in distance smaller than ε, which is impossible. Now for Q, recall by 1.A.1 that this is a dense subset of R. This means that each neighbourhood Oε(x) of any x ∈ R contains rational numbers, that is for any x ∈ R and every ε > 0 there exists q ∈ Q with q ∈ Oε(x) = (x − ε, x + ε). Thus, for every x ∈ R, there is a sequence of rational numbers an ̸= x, converging to it. For instance, imagine the decimal representation of a real number x /∈ Q, and the corresponding sequence whose kth term will be the representation truncated to the first k decimal digits. If x was rational, we could clearly consider the sequence an = x+ 1 n . The set of all limit points of Q is thus the whole real line R. There are no isolated points in Q, and in particular any rational number is also a limit point of the complement R\Q. Finally, there are no interior points of Q (this is because between any two rational numbers, there is an irrational number, and so it is impossible for a rational a ∈ Q to be an interior point of Q). □ 5.B.21. Find all limit, isolated, boundary and interior points of the sets X = {x ∈ R : 0 ≤ x < 1} ⊂ R and Y = {z ∈ C : 0 ≤ |z| < 1} ⊂ C. Solution. Let a ∈ [0, 1) be an arbitrary number. Apparently, the sequences {a + 1 n }∞ n=1, {1 − 1 n }∞ n=1 converge to a and 1, respectively. So we have easily shown that the set of X’s limit points contains the interval [0, 1]. There are no other limit points: for any b /∈ [0, 1] there is δ > 0 such that Oδ (b)∩[0, 1] = ∅ (for b < 0 it suffices to take δ = −b, and for b > 1 we can choose δ = b−1). Since every point of the interval [0, 1) is a limit point, there are no isolated points. For a ∈ (0, 1), let δa be the smaller one of the two positive numbers a, 1 − a. Considering Oδa (a) = (a − δa, a + δa) ⊆ (0, 1) for a ∈ (0, 1), we see that every point of the interval (0, 1) is an interior point of X. For every δ ∈ (0, 1), we have that Oδ (0) ∩ [0, 1) = (−δ, δ) ∩ [0, 1) = [0, δ), Oδ (1) ∩ [0, 1) = (1 − δ, 1 + δ) ∩ [0, 1) = (1 − δ, 1), so every δ-neighborhood of the point 0 contains some points of the interval [0, 1) and some points of the interval (−δ, 0), and every δ-neighborhood of 1 has a non-empty intersection with the intervals [0, 1), [1, 1 + δ). Therefore, 0 and 1 are boundary points. Altogether, we have found that the interior points of X coincide with the interval (0, 1) while the twoelement set {0, 1} consists of the boundary points of X (as we know that no point can be both interior and boundary and that a boundary point must be an isolated or a limit point). The case of Y is very similar and we leave most of the details to the reader. It is the open unit disk in the complex plane, all its points are interior (as with all open sets). The boundary points form the unit circle. Thus, the set of limit points is the closed unit disc and there are no isolated points there. □ CHAPTER 5. ESTABLISHING THE ZOO Apparently, the limit at zero is well-defined, and in accordance with our expectations, limx→0 f(x) = 0 even though the value f(0) = 1 does not belong into small neighbourhoods of the limit value 0. An equivalent definition using ε-neighbourhoods of the limits a and δ-neighbourhoods of the limit points x0 is the following: limx→x0 f(x) = a if for each ε > 0 there is a δ > 0 such, that for all x ̸= x0 satisfying |x − x0| < δ, |f(x) − a| < ε. (3) One-sided limits. If A = [a, b] is a bounded interval and x0 = a or x0 = b, we talk about a one-sided limit of the function f at the point x0, from the right and from the left respectively. If the point x0 is an interior point of the domain of f, we can, in order to determine the limit, consider the domain restricted to [x0, b] or [a, x0]. The resulting limits are also called a right-sided limit and left-sided limit, respectively, of the function f at the point x0. We denote them by limx→x+ 0 f(x) and limx→x− 0 f(x), respectively. As an example, we can consider the one-sided limits at x0 = 0 for Heaviside’s function h from the introduction to this part. Apparently, lim x→0+ h(x) = 1, lim x→0− h(x) = 0. However, the limit limx→0 f(x) does not exist. It follows from the definitions that the limit at an interior point of the domain of an arbitrary function f exists if and only if both one-sided limits exist and are equal. 5.2.11. Further examples of limits. (1) The limit of a complex function f : A → C in a limit point x0 of its domain exists if and only if the limits of both the real part and the imaginary part exist. In this case, we have lim x→x0 f(x) = lim x→x0 (Re f(x)) + i lim x→x0 (Im f(x)). The proof is straightforward and makes direct use of the definitions of distances and neighbourhoods of the points in the complex plane. Indeed, the membership into a δ– neighbourhood of a complex value z is guaranteed by the real (1/ √ 2)δ–neighbourhoods of the real and the imaginary parts of z. Hence the proposition follows immediately. (2) Let f be a real or complex polynomial. Then for every point x ∈ R, lim x→x0 f(x) = f(x0). Really, if f(x) = anxn + · · · + a0, then the identity (x0 + δ)k = xk 0 + kδxk−1 0 + · · · + δk , substituted for k = 0, . . . , n, gives that choosing a sufficiently small δ makes the values arbitrarily close to f(x0). (3) Now consider the following function defined on the whole real line f(x) = { 1 if x ∈ Q 0 if x /∈ Q. It is apparent from the definition that this function cannot have (even one-sided) limits at any point of its domain. 392 If all limit points of a subset A of R or of C belong to A, then the subset is said to be closed. This means that we cannot run away from A by a sequence (xn) ∈ A to a limit not belonging to A. If the complement of A in the real line or in the complex plane is closed, then clearly each x ∈ A has a neighbourhood Oε(x) ⊂ A, for some ε > 0. In this case A is called open, see 5.2.5, 5.2.7 and 5.2.8 for further details. Observe that the empty set ∅ and R (or C) itself are both open and closed (for instance, C has no boundary points). If a subset A in R or C is closed and bounded (the latter means that A ⊂ Or(x) for some x and big enough r > 0), then each sequence (xn) ∈ A has a limit point in A, i.e., there is a convergent subsequence in A. Such sets are called compact sets. 5.B.22. Closed and open subsets. Verify that an open interval I = (a, b) = {x ∈ R : a < x < b} of the real line is an open set, and moreover that a closed interval J = [a, b] = {x ∈ R : a ≤ x ≤ b} is a closed set. Is the interval (0, 1] a closed set? Solution. Given some x ∈ I we should show that x is an interior point of I. Set ε = 1 2 min{x−a, b−x} > 0 and note that ε depends on x. Then we have a ≤ x − ε < x + ε ≤ b, and so (x−ε, x+ε) ⊂ (a, b). Therefore, we have constructed an open neighbourhood Oε(x) ⊂ I around x, contained in I, and since this works for any x ∈ I, the set I is open. Next, the complement of J is Jc = (−∞, a) ∪ (b, ∞) which is open as a union of open intervals. Thus J is closed. Finally, the interval (0, 1] cannot be a closed set; For instance, the sequence (xn) = (1/n) ∈ (0, 1] satisfies (1/n) → 0 as n → ∞, but 0 /∈ (0, 1]. □ 5.B.23. Decide which of the following sets are open or closed: (a) A = Z ⊂ R; (b) B = [2, +∞) ⊂ R; (c) C = {z ∈ C : |z − 1| + |z + 1| < 5} ⊂ C. Solution. (a) It is not hard to see that the complement of Z in R is given by the union ∪k∈Z(k, k + 1). Hence the set R\Z is open, as the union of open sets. It follows that Z is closed in R. (b) The complement of B in R is the set (−∞, 2), which is open. Thus B itself is a closed subset of R. In a similar way one can show that the following sets are closed: [a, +∞), and (−∞, b], with a, b ∈ R. (c) Let us treat this case based on the notion of continuity, that we will essentially analyze a few below. Consider the function f : C → R given by f(z) = |z − 1| + |z + 1|. As a challenge for you, prove that this function is continuous. Moreover, show that C = f−1 ((−∞, 5)). Then, the claim follows by 5.2.18; The set (−∞, 5) is open and f is continuous, hence C is open, as well. □ 5.B.24. Is the range of f(x) = x2 a closed subset of R? ⃝ CHAPTER 5. ESTABLISHING THE ZOO (4) The following function is even trickier than the previous one. Let f : R → R be the function defined as follows:8 f(x) =    1 if x = 0 1 q if x = p q ∈ Q, p, q ∈ Z, q > 0 are co-prime, 0 if x /∈ Q. Choose any point x0, no matter whether rational or irrational. Our goal is to show that limx→x0 f(x) = 0. Thus fix any ε > 0 and look at the possible values of f(x) close to x0. Notice that f(x) ≥ ε, i.e. 1 q ≥ ε can be true for only finite number of q ∈ N. This behaviour is illustrated on the diagram. In particular, there can be only a finite number of points x in the interval (x0 − 1, x0 + 1) for which f(x) ≥ ε. Label them x1, . . . , xn. Finally, choose δ smaller than the minimum of the distances of any two different points xi. Then f(x) < ε for all x ∈ Oδ(x0) \ {x0}. This finishes the proof. Notice that this limit equals the function value only at the irrational points. 5.2.12. The squeeze theorem. The following result is elementary, but extremely useful. We meet it when computing limits of all types discussed above, i.e. limits of sequences, limits of functions at interior points, one-sided limits, and so on. Theorem. Let f, g, h be three real-valued functions with the same domain A and such that there is a neighbourhood of a limit point x0 ∈ R of the domain where f(x) ≤ g(x) ≤ h(x), for all x ̸= x0. Suppose there are limits lim x→x0 f(x) = f0, lim x→x0 h(x) = h0 and f0 = h0. Then the limit lim x→x0 g(x) = g0 exists, and it satisfies g0 = f0 = h0. 8This function is called the Thomae function after the German mathematician J. Thomae, 1840–1921. You may find it under many other names too: e.g. Riemann function, pop-corn function, raindrop function etc. It illustrates how badly dense the “discontinuity” points of a function can be even though it has limits everywhere. 393 5.B.25. Decide which of the following sets is compact: (a) I = (0, 1); (b) J = [0, 1]; (c) A = N∗ ; (d) B = {1/n : n ∈ N∗ }; (e) C = B ∪ {0}. Solution. Recall by (4) in Theorem 5.2.8 (BolzanoWeierstrass theorem), that a subset A ⊂ R is compact if and only if every finite sequence contained in X has a subsequence converging to a point in X. We can use the Bolzano-Weierstrass theorem as criterium to decide which of the given sets is compact or not. (a) The open set I = (0, 1) is not compact: Consider for example the sequence (xn) = ( 1 n ), as above. We know that in I this sequence converges to 0, but 0 /∈ I. Thus, for (xn) there is no convergent subsequence whose limit belongs to I. (b) The closed set J = [0, 1] is compact, as every bounded closed interval [a, b] ⊂ R, see 5.E.39. (c) The set of non-zero naturals N∗ is not compact. For instance, the sequence (yn = n) has no convergent subsequence. This is because every subsequence diverges to infinity. (d) The set B is not closed, so cannot be compact either. (e) Obviously, the set C is closed and bounded, and hence compact. □ Given a sequence (an) in 5.4.6 we consider the so called “upper limit” and “lower limit”, also called limes superior and inferior of (an), and denoted by lim sup n→∞ an and lim inf n→∞ an. These limits always exists and have many important applications (see for example in Section D), and see also 5.E.32 and 5.E.34 for more details. 5.B.26. Compute the limes superior and inferior of the se- quence an = n2 + 4n − 5 n2 + 9 sin2 (nπ 4 ) , n ∈ N∗ . Solution. Recall that the limes superior/interior are the largest/smallest limit point of the sequence. For the given task to compute them we may split the sequence (an) into several subsequences according the the value of ( sin nπ 4 )2 , e.g., 0, 1 2 , and 1. All of them are converging to different limits and the result is lim sup n→∞ an = 1 and lim inf n→∞ an = 0, respectively. □ Additional exercises on limes superior/interior and most of the directions analyzed above are presented in Section E. Later, in Chapter 7, we will revise most of the topological notions analyzed in this chapter (using R or C), in terms of the more general notion of “metric spaces”. But to do so, we first need to learn for limits of functions and continuous functions. CHAPTER 5. ESTABLISHING THE ZOO Proof. From the assumptions of the theorem, it follows that for any ε > 0, there is a neighbourhood O(x0) of the point x0 ∈ A ⊂ R in which both f(x) and h(x) lie in the interval (f0 − ε, f0 + ε), for all x ̸= x0. From the condition f(x) ≤ g(x) ≤ h(x), it follows that g(x) ∈ (f0 − ε, f0 + ε), and so limx→x0 g(x) = f0. The above reasoning is easily modified for infinite limit values or for limits at infinite points x0. In the first case, choose a large N instead of ε. The condition on the values reads: both f(x) and h(x) have values larger than N on the neighbourhood O(x0) \ {x0}, and thus the same will be true for g(x). In the second case, the neighbourhood O will be an interval (M, ∞). The other infinite limit point −∞ is dealt with similarly. □ The next theorem reveals the elementary properties of limits, again for all types together. Think about the individual cases, including the limits taken at x0 = ±∞! 5.2.13. Theorem. Let A ⊂ R be the domain of real or complex functions f and g, let x0 be a limit point of A and let the limits lim x→x0 f(x) = a ∈ K, lim x→x0 g(x) = b ∈ K exist. Then: (1) the limit a is unique, (2) the limit of the sum f + g exists and satisfies lim x→x0 (f(x) + g(x)) = a + b, (3) the limit of the product f · g exists and satisfies lim x→x0 (f(x) · g(x)) = a · b. In particular, if f(x) = a is a constant function then limx→x0 a · g(x) = a · b, (4) if b ̸= 0, the limit of the quotient f/g exists and satisfies lim x→x0 f(x) g(x) = a b . Proof. (1) Suppose a and a′ are two values of the limit limx→x0 f(x). If a ̸= a′ , then there are disjoint neighbourhoods O(a) and O(a′ ). However, for sufficiently small neighbourhoods of x0, the values of f should lie in both neighbourhoods. This is a contradiction. Thus a = a′ . (2) Choose a neighbourhood of a+b, for instance O2ε(a+ b). For a sufficiently small neighbourhood of x0 and x ̸= x0, 394 Limits of functions enable us to investigate the behaviour of functions (and hence their graph), in the neighbourhood of some given point. They form the basis of the notion of differentiation that we will discuss in the forthcoming section, and establishes one of the most beautiful themes in mathematical analysis and in many other areas (e.g., numerical analysis). Most of the tasks presented below are based on theoretical results from the sections 5.2.9, 5.2.12, 5.2.13, 5.2.16, 5.2.18, and 5.2.20. All these results will gradually become more evident, as we explore them by examples. We recall however that limits are well defined in the limit points of the given domain of a function, and the function is called continuous at x, if its limit equals the function value there (be aware however, that in general a function does not need to be defined at the point where we seek for he limit). Recall also that from the topological point of view (see 5.2.18), a function f : A ⊂ R → R is continuous if and only if the pre-image f−1 (U) is an open subset of A, for every open set U ⊂ R. To verify that f−1 (U) is open it suffices to show that any point x ∈ f−1 (U) lies on an open interval of f−1 (U). 5.B.27. Limit of functions. Find the following limits or explain why they do not exist: (a) lim x→c x2 , (e) lim x→2+ x2 + 2x − 8 |x − 2| , (b) lim x→c (x6 − x3 − √ 6) , (f) lim x→2− x2 + 2x − 8 |x − 2| , (c) lim x→2 x2 + x − 6 x2 − 3x + 2 , (g) lim x→3 x − 3 √ x + 1 − (x − 1) , (d) lim x→0 sin ( 1 x ) , (k) lim x→0 x cos ( 1 x ) . Solution. (a) We can treat this limit by basic principles, as for example the Cauchy definition for limits of functions. Let us however rely on topology, as it was mentioned a few above, to show that f(x) = x2 is continuous and hence limx→c x2 = c2 , where c is any real number. Let I = (a, b) ⊂ R be an open interval with a < b and a, b ∈ R. By definition, f−1 (I) = {x ∈ R : x2 ∈ (a, b)}, hence obviously for b ≤ 0 it should be f−1 (I) = ∅ (since x2 ≥ 0 for any x), which is open. For a < 0 < b, one gets f−1 (I) = (− √ b, √ b) ⊂ R, since any x ∈ (− √ b, √ b) satisfies f(x) ∈ (a, b). Thus again f−1 (I) is open as an open interval of R. Finally, for 0 ≤ a < b the preimage f−1 (I) is the union (− √ b, − √ a) ∪ ( √ a, √ b) which is again open as the union of two open intervals of R. (b) For the second limit one simply applies sum and difference rules (see also below), and get benefit from the fact that the nth power x → xn (n = 0, 1, 2, . . .) is a continuous function. Or you can directly rely on the fact that polynomials are continuous everywhere in R, see 5.2.11. Hence limx→c(x6 − x3 + x − √ 6) = c6 − c3 − √ 6. (c) The substitution x = 2 leads to both zero numerator and zero denominator. Despite of that, the problem can be solved CHAPTER 5. ESTABLISHING THE ZOO both f(x) and g(x) will lie in ε–neighbourhoods of the points a and b. Hence their sum will lie in the 2ε–neighbourhood of a + b. The proposition is proved. (3) We have to be a bit more careful here. Let us look, what we can do with an assumption |f(x) − a| < ε, |g(x) − b| < ε for x ∈ O(x0), i.e. choosing a small enough neighbourhood of the limit point x0. Estimate: |f(x)g(x) − ab| = |f(x)(g(x) − b) + b(f(x) − a)| ≤ |f(x)|ε + |b|ε ≤ |f(x) − a|ε + |a|ε + |b|ε ≤ ε2 + ε(|a| + |b|) Now we easily conclude. Choosing any ˜ε > 0, there is unique ε > 0 with ˜ε = ε2 + ε(|a| + |b|). Thus using this ε for the choice of O(x0) above, we arrive at the required condition from the definition of the limit. Clearly, the limit of the constant function f(x) = a is a at all limit points of its domain. (4) In view of the previous results, it suffices to prove limx→x0 1 g(x) = 1 b for |b| > 0. We need to be careful when considering complex valued functions. We need to estimate 1 g(x) − 1 b = b − g(x) g(x) · b . Since |b| > 0, we may restrict ourselves to a neighbourhood U of x0 such that |g(x)| > |b| 2 . Then |g(x)|·|b| > |b|2 2 . Thus, if |g(x) − b| < ε, then 1 g(x) − 1 b < 2|b − g(x)| |b|2 < 2ε |b|2 . This verifies the claim as in the previous case. □ 5.2.14. Remarks on infinite values of limits. The statement of the theorem can be extended to some infinite values of the limits of real-valued functions: For sums, either at least one of the two limits must be finite or both limits share the same sign. Then the limit of the sum is the sum of the limits, with the conventions from 5.2.9. However, “∞ − ∞” is excluded. For products, if one of the limits is infinite, then the other limit must be non-zero. Then the limit of the product is the product of the limits. The case “0 · (±∞)” is excluded. For a quotient, it may be that a ∈ R and b = ±∞, then the resulting limit will be zero; or a = ±∞ and b ∈ R, then it will be ±∞ according to the signs of the numerator and the denominator. The case “∞ ∞ ” is excluded. The theorem also covers, as a special case, the corresponding statements about the convergence of sequences as well as about one-sided limits of functions defined on an in- terval. The following provides a “convergence test” useful in many situations. It relates to limits of sequences and functions in general. 5.2.15. Proposition. Consider a real or complex valued function f defined on a set A ⊂ R and a limit point x0 of the set A. f has a limit y at x0 if and only if for every sequence of 395 very easily by factorization: lim x→2 x2 + x − 6 x2 − 3x + 2 = lim x→2 (x − 2) (x + 3) (x − 2) (x − 1) = 2 + 3 2 − 1 = 5 . (d) We will prove that this limit does not exist, using ε-neighbourhoods, see 5.2.9 and 5.2.10. Set f(x) = sin(1/x) and assume, in contrast, that limx→0 f(x) = ℓ for some real number ℓ. This means that for ε = 1 2 there exists δ > 0 such that for any x with 0 < |x − 0| < δ we have |f(x) − ℓ| < 1 2 . Then there exist some n ∈ N and reals x1 = 1 2nπ and x2 = 1 2nπ+ π 2 , such that 0 < |x1| < δ and 0 < |x2| < δ. Hence, |f(x1) − ℓ| < 1 2 , |f(x2) − ℓ| < 1 2 , and using the triangle inequality we see that ∆ = |(f(x1) − ℓ) + (ℓ − f(x2))| ≤ |f(x1) − ℓ| + |ℓ − f(x2)| = |f(x1) − ℓ| + |f(x2) − ℓ| < 1 2 + 1 2 = 1 , where for simplicity we set ∆ := |f(x1) − f(x2)|. Since f(x1) = sin(2nπ) = 0 and f(x2) = sin(2nπ + π 2 ) = sin(π 2 ) = 1, the last inequality gives |f(x1) − f(x2)| = |0 − 1| = 1 < 1, a contradiction. (e) By the definition of the absolute value we have lim x→2+ x2 + 2x − 8 |x − 2| = lim x→2+ (x − 2)(x + 4) x − 2 = lim x→2+ (x + 4) = 6 . (f) Similarly lim x→2− x2 + 2x − 8 |x − 2| = lim x→2− (x − 2)(x + 4) −(x − 2) = −6. Thus, a comparison with the result from case (e) implies that the function f(x) = x2 +2x−8 |x−2| cannot be continuous at x0 = 2 (since the one sided limits are not equal). (g) Set for simplicity f(x) = x−3√ x+1−(x−1) . When x → 3 we see that both the numerator and denominator tend to 0, which gives the indeterminate form 0 0 (see also below). One can treat such cases by multiplying both the numerator and denominator by the “conjugate expression” of the denominator. For our case this means √ x + 1 + (x − 1) with (√ x + 1 − (x − 1) ) (√ x + 1 + (x − 1) ) = −x(x − 3) . Thus lim x→3 f(x) = lim x→3 (x − 3) (√ x + 1 + (x − 1) ) −x(x − 3) = − lim x→3 √ x + 1 + (x − 1) x = − 4 3 . (k) Recall that |cos(x)| ≤ 1 for any x ∈ R and hence x cos (1 x ) ≤ |x|. By the squeeze theorem (5.2.12) this gives limx→0 x cos (1 x ) = 0. □ CHAPTER 5. ESTABLISHING THE ZOO points xn ∈ A converging to x0, xn ̸= x0, the sequence of the values f(xn) has limit y. Proof. Suppose first that the limit of f at x0 is y. Then for any neighbourhood U of the point y, there is a neighbourhood V of x0 such that for all x ∈ V ∩ A, x ̸= x0, f(x) ∈ U. For every sequence xn → x0 of points different from x0, the terms xn lie in V for all n greater than a suitable N. Therefore, the sequence f(xn) converges to y. Now suppose that the function f does not converge to y at x → x0. Then for some neighbourhood U of y, there is a sequence of points xm ̸= x0 in A which are closer to x0 than 1/m, with f(xm) not belonging to U. In this way, there is constructed a sequence of points lying in A different from x0, with limm→∞ xm = x0, for which the values f(xn) do not converge to y. The proof is finished. □ 5.2.16. Continuity. Continuity was discussed intuitively when polynomials were discussed. Now all the tools for a proper formulation of continuity are prepared. This is the basic class of functions in the sequel. Continuity of functions Definition. Let f be a real or complex valued function defined on an interval A ⊂ R. f is continuous at a point x0 ∈ A if and only if lim x→x0 f(x) = f(x0). The function f is continuous on an interval A if and only if it is continuous at every point x0 ∈ A. The diagram explains the meaning of continuity. Firstly, the limit has to exist. Thus, after choosing a neighbourhood U of the limit value f(x0) (the ε-neighbourhood Oε(f(x0)) is shown), there is a neighbourhood of x0 (the δ-neighbourhood is shown), for which all images lie in U. In words, if we decide how close we want to be to f(x0), we always may choose a sufficiently small neighbourhood of x0 where this is guar- anteed. 396 5.B.28. Compute the limit lim x→1 3x3 + 2x2 − 2x − 3 5x3 + 2x2 − 2x − 5 , if it is known that ρ = 1 is a root of both the numerator and dominator. Next use Sage to confirm your computations. Solution. First notice that the function f(x) = 3x3 +2x2 −2x−3 5x3+2x2−2x−5 is not defined at x = 1, and moreover this limit has again the indeterminate form 0 0 . However, the statement already motivates a method for solving the task: Use the Horner’s scheme and factorize the numerator and the dominator, i.e., 3x3 + 2x2 − 2x − 3 = (x − 1)(3x2 + 5x + 3) , 5x3 + 2x2 − 2x − 5 = (x − 1)(5x2 + 7x + 5) . Then it is easy to see that the limit at hand equals to 11/17. In fact, you can easily do all the computations that we omitted via Sage, where you can still use the command limit to compute limits of functions, as in the case of sequences. Here it is however necessary to specify the limit point x0, and the exact syntax for computing the limit limx→x0 f(x) is limit(f(x), x = x_0). In addition, you can program Sage to compute the left and right side limits, by adding the options dir = ”left” and dir = ”right”, respectively. For out task the cell g(x)=3*x^3+2*x^2-2*x-3; h(x)=5*x^3+2*x^2-2*x-5 show(factor(g(x))); show(factor(h(x))) lim(g(x)/h(x), x=1) verifies the factorizations given above and the explicit value of the limit, as you can easily check in your computer. Finally, it worths to mention that one could compute the limit at hand by other methods. For instance, one can apply Proposition 5.2.15. Hence, let (xn) be a sequence of real numbers with xn → 1 as n → ∞, and xn ̸= 1 for all n = 1, 2, . . .. Then, for the induced sequence of values f(xn) we see that f(xn) = 3x3 n + 2x2 n − 2xn − 3 5x3 n + 2x2 n − 2xn − 5 = (xn − 1)(3x2 n + 5xn + 3) (xn − 1)(5x2 n + 7xn + 5) = 3x2 n + 5xn + 3 5x2 n + 7xn + 5 . Taking the limit limn→∞ f(xn) we get 11/17, and according to Proposition 5.2.15 this implies that limx→1 f(x) = 11/17. □ 5.B.29. Explain why the relation limx→π/3 sin(x) = sin(π/3) is true. Next use Sage to confirm that sin(x) is continuous at the chosen point by combining the limit command with the bool syntax. ⃝ The Cauchy definition of limits of functions in terms of ε and δ can be used to derive a number of elementary results. Let us describe such an example. CHAPTER 5. ESTABLISHING THE ZOO Notice that for the boundary points of the interval A, the definition says that value of f equals the value of the onesided limit there. The function is said to be right-continuous or left-continuous at such a point. Every polynomial is a continuous function on the whole R, see 5.2.11(2). The Thomae function is continuous at irrational real numbers only although it has limits at all rational points as well, see 5.2.11(4). The previous theorem 5.2.13 about limit properties implies immediately all of the following claims. The same properties are true for right-continuity or left-continuity, as is easily checked. 5.2.17. Theorem. Let f and g be (real or complex valued) functions defined on an interval A ⊂ R and continuous at a point x0 ∈ A. Then (1) the sum f + g is continuous at x0 (2) the product f · g is continuous at x0 (3) if g(x0) ̸= 0, then the quotient f/g is well-defined on some neighbourhood of x0 and is continuous at x0. (4) If a continuous function h is defined on a neighbourhood of f(x0) of the real-valued function f, then the composite function h ◦ f is defined on a neighbourhood of x0 and is continuous at x0. Proof. Statements (1) and (2) are clear. For property (3): if g(x0) ̸= 0, then the ε–neighbourhood of the number g(x0) does not contain zero for a sufficiently small ε > 0. By the continuity of g, it follows that on a sufficiently small δ-neighbourhood of the point x0, g is non-zero and the quotient f/g is thus well-defined there. It is continuous at x0 by the previous theorem. (4) Choose a neighbourhood O of h(f(x0)). By the continuity of h, there is a neighbourhood O′ of f(x0) which is mapped into O by h. The continuous function f maps some sufficiently small neighbourhood of the point x0 into the neighbourhood O′ . This is the definition property of continuity, so the proof is finished. □ 5.2.18. We consider some basic relations between continuous mappings and the topology of the real numbers. They exploit the highly non-trivial characterization of compact sets in theorem 5.2.8. 397 5.B.30. Let f : A → R be a function, where A is a subset of R and let a be a limit point of A. Suppose that the limit limx→a f(x) exists and is a finite number. Show that f is bounded on the set (a − h, a + h) ∩ A for some h > 0. Solution. By assumption, the limit limx→a f(x) exists and is a finite number, hence suppose that limx→a f(x) = ℓ. Then for every ε > 0 there exists some δ > 0 such that |f(x) − ℓ| < ε, for all x ∈ A satisfying 0 < |x − a| < δ. Hence, taking ε = 1, there exists some h > 0 such that |f(x) − ℓ| < 1 for all x ∈ A with 0 < |x − a| < h. By the triangle inequality this implies that |f(x)| < 1 + |ℓ| for all such x. Set now L = 1 + |ℓ| if a /∈ A, otherwise set L = max{1 + |ℓ| , f(a)}. Then it is easy to see that |f(x)| ≤ L for all x ∈ (a − h, a + h) ∩ A. □ In the previous examples we often applied some basic rules, as the limit of a sum of functions is the sum of the limits, the limit of a product is the product of the limits, and the limit of a quotient is the quotient of the limits, provided that the particular limits exist and do not lead to one of the following expressions: 0/0, ∞/∞, 0 · ∞, and ∞ − ∞. These are called “indeterminate forms”, see also the discussion in 5.2.14 and their full list includes three more types: 00 , 1∞ and ∞0 . To address many indeterminate forms (e.g., of type 0/0, ∞/∞, 0 · ∞, etc.), we use the so-called “l’Hopital’s rule”, which we will examine later (see 5.3.10). However, let us begin with an example that demonstrates a pedestrian approach to deal with an indeterminate form of type 0/0. 5.B.31. Show that lim x→0 sin(x) x = 1. Solution. Consider the first quadrant in the unit circle S1 , and its arbitrary point P(x) = [cos(x), sin(x)], where x is running in the open interval ( 0, 2π 2 ) ⊂ R. Obviously, the length of the arc joining the points P(x) and C := [1, 0], equals to x. So we apparently have sin(x) < x for all x ∈ ( 0, π 2 ) , see also the figure given at the r.h.s. On the other hand, the value tan(x) is the distance between the points Q(x) = [1, tan(x)] and C, and we see that x < tan(x), for all x ∈ ( 0, π 2 ) . Altogether, sin(x) < x < sin(x) cos(x) which implies that cos(x) < sin(x) x < 1 , CHAPTER 5. ESTABLISHING THE ZOO Topological characterization of continuity Theorem. Let f : A ⊂ R → R be a function defined on an interval A. Then: (1) f is continuous if and only if the inverse image f−1 (U) of every open set U ⊂ R is an open set in A. (2) If f is continuous, then the inverse image f−1 (W) of every closed set W ⊂ R is a closed set in A. (3) If f is continuous, then the image f(K) of every compact set K ⊂ A is a compact set in A, (4) If f is continuous, then f attains both its maximum and its minimum on every compact set K. Proof. (1) Consider a point x0 ∈ f−1 (U). There is a neighbourhood O of f(x0) which is contained in U since U is open. Hence there is a neighbourhood O′ of x0 which is mapped into O and thus is contained in the inverse image. Therefore, every point of the inverse image is an interior point, i.e., f−1 (U) is open. Conversely, if f−1 (U) is open for each open U, then taking any ε-neighbourhood of f(x0), its pre-image will be an open neighbourhood of x0 satisfying the condition from the definition of the continuity. (2) Consider a limit point x0 of the inverse image f−1 (W) and a sequence xi, f(xi) ∈ W, which converges to x0. From the continuity of f, it follows that f(xi) converges to f(x0) (cf. the convergence test 5.2.15). Since W is closed, f(x0) ∈ W. Thus, all limit points of the inverse image of the set W are contained in f−1 (W). (3) Choose any open cover Uα of f(K). The inverse images of all Uα are open and thus create an open cover of the set K. Select a finite subcover from it. Then finitely many of the corresponding images cover the original set f(K). (4) Since the image of a compact set is again a compact set, the image must be bounded and it contains both the supremum and the infimum. Hence it follows that these must also be the maximum and the minimum, respectively. □ Notice, a complex valued function is continuous if and only if its real and imaginary components are continuous. The first three claims of the theorem remain valid with very slight modification of the proof. (Check it yourselves!) 5.2.19. There are two very useful consequences of the previous theorem. 398 for all x ∈ ( 0, π 2 ) . Invoking now the squeeze theorem, one deduces that limx→0+ sin(x) x = 1. Based then on the fact that the function f(x) = sin(x)/x defined for all x ̸= 0 is even, i.e., f(−x) = f(x), we get lim x→0− sin(x) x = lim x→0+ sin(x) x = 1 . This shows that the both one-sided limits exist and have the same value, hence our claim follows. Notice that may exist other alternative ways to prove the statement. □ 5.B.32. Based on the result form 5.B.31, show that: (a) lim x→0 1 − cos(x) x = 0; (b) lim x→0 1 − cos(x) x2 sin(x2) = ∞. ⃝ 5.B.33. Compute lim x→x0 f(x) for the following cases: (a) f(x) = x − 2 √ x2 − 4 , and x0 = 2 , (b) f(x) = sin ( sin(x) ) x , and x0 = 0 , (c) f(x) = sin2 (x) x , and x0 = 0 . ⃝ 5.B.34. Explain why the limit of the following functions as x → 0, does not exist: f(x) = { 1/x, if x > 0, 1, if x ≤ 0, g(x) = { 1, if x ≥ 0, 0, if x < 0, h(x) = | sin(x)| sin x , k(x) = sign(x) . ⃝ 5.B.35. Suppose that f : (0, +∞) → R is a function satisfying |ex f(x) − 2 ex | ≤ |sin(ex )|, for any x ∈ (0, +∞). Show that lim x→+∞ f(x) = 2. Solution. The given relation |ex f(x) − 2 ex | ≤ |sin(ex )| can be equivalently written as − |sin(ex )| ≤ ex f(x) − 2 ex ≤ |sin(ex )| , or 2 ex − |sin(ex )| ≤ ex f(x) ≤ |sin(ex )| + 2 ex , or 2 − sin(ex ) ex ≤ f(x) ≤ sin(ex ) ex + 2 . (∗) Now, it is easy to prove that lim x→+∞ sin(ex ) ex = 0, which also follows from the graph of this function, see here: CHAPTER 5. ESTABLISHING THE ZOO Maxima and minima of continuous functions9 Corollary. Let f : R → R be continuous. Then (1) the image of every interval is again an interval, (2) f takes all the values between the maximal and the minimal one on the closed interval [a, b]. Proof. (1) Consider an open interval A and suppose there is a point y ∈ R such that f(A) contains points less than y as well as points greater than y, but y /∈ f(A). Put B1 = (−∞, y) and B2 = (y, ∞). These are open sets, and the union of their inverse images A1 = f−1 (B1) ⊂ A and A2 = f−1 (B2) ⊂ A contains A. A1 and A2 are open, disjoint, and they both have a non-empty intersection with A. Thus there is a point x ∈ A which does not lie in A1 but is a limit point of A1. It is in A2, which is impossible for two disjoint open sets. Thus it is proved that if there is a point y which does not belong to the image of the interval, then either all of the values must be above y or they all must be below. It follows that the image is again an interval. Notice that the boundary points of this interval may or may not lie in the image. If the domain interval A contains one of its boundary points, then the continuous function must map it to a limit point or an interior point of the image of the interior of A. This verifies the statement. (2) This statement immediately follows from the previous one (and the above theorem) since the image of a closed bounded interval (i.e. a compact set) is again a closed interval. □ 5.2.20. We conclude this introductory discussion by two more theorems which provide useful tools for calculating limits. Notice that we assume that functions are defined on all of R. Actually we are only interested in f on a neighbourhood of one point a, while g has to be defined on a neighbourhood of one point b only. Limits of composite functions Theorem. Let f, g : R → R be two functions and limx→a f(x) = b. (1) If the function g is continuous at the point b, then lim x→a g (f(x)) = g ( lim x→a f(x) ) = g(b). (2) If the limit limy→b g(y) exists and f(x) ̸= b holds for all x ̸= a from some neighbourhood of the point a, then lim x→a g (f(x)) = lim y→b g(y). 9This result is usually called the Weierstrass theorem, but it is also known (especially in Czech literature) as the Bolzano’s theorem. Bernard Bolzano apparently used such result as a technical lemma when proving his Bolzano-Weierstrass theorem mentioned earlier. 399 For instance, via the squeeze theorem one can show that lim y→+∞ sin y y = 0 and then the replacement y = ex yields our claim, or directly, via Sage type: lim(sin(ex )/ex , x = +oo). Hence, we also deduce that lim x→+∞ ( 2 − sin(ex ) ex ) = 2 = lim x→+∞ ( sin(ex ) ex + 2 ) and by (∗) we are done via the squeeze theorem. □ 5.B.36. For some α > 0 let g the function defined by g(x) = lim α→+∞ ( (2x − 3)α2 + 4xα + 2 x + α · sin ( 1 α )) . Show that g is a line which is a median of the triangle ABC with corners the points A = [1, −1], B = [6, 2], C = [−2, 0]. ⃝ So far we tried to convince the reader to use common sense when reasoning about the limit behaviour of functions and prevent any mindless following of rules. On the other hand, beyond the l’Hopital’s rule which is based on the notion of derivatives and we will discuss later (see also 5.3.10), there are diverse useful "rules" or "tricks" to deal with the indeterminate expressions, which are not necessarily based on derivatives. Let us mention some of them. a) Recall that if f(x) > 0 for all x in the domain of f, then f(x)g(x) = e ln ( f(x)g(x) ) = e g(x)·ln(f(x)) . Since the function ex is continuous and injective on R, we get lim x→x0 f(x)g(x) = e lim x→x0 (g(x) ln(f(x))) = e lim x→x0 g(x)· lim x→x0 ln(f(x)) . Moreover, it is easy to prove that lim x→x0 (g(x) · ln f(x)) = a ∈ R =⇒ lim x→x0 f(x)g(x) = ea , lim x→x0 (g(x) · ln f(x)) = +∞ =⇒ lim x→x0 f(x)g(x) = +∞ , lim x→x0 (g(x) · ln f(x)) = −∞ =⇒ lim x→x0 f(x)g(x) = 0 . b) Some further useful rules to remember are here: lim x→+∞ c xα = 0, lim x→+∞ xα xβ = 0, lim x→+∞ xβ ax = 0 , lim x→+∞ ax bx = 0, c ∈ R, 0 < α < β, 1 < a < b . You may try to verify them (a way to determine the third limit is based on the l’Hopital’s rule, for example). 5.B.37. Compute the following limits: (a) lim x→+∞ 3x+1 + x5 − 4x 3x + 2x + x2 , (b) lim x→+∞ 4x − 8x6 − 2x − 167 3x − 45x − √ 11πx+12 , CHAPTER 5. ESTABLISHING THE ZOO Proof. The first proposition can be proved similarly to 5.2.17(4). From the continuity of g at the point b, it follows that for any neighbourhood V of the value g(b), we can find a sufficiently small neighbourhood U of the point b whose values of g lies in V . However, if f has limit b at the point a, then f will hit U by all its values f(x) for x ̸= a from some sufficiently small neighbourhood of the point a, which verifies the first statement. (2) Even if we cannot use the continuity of g at the point b, the previous reasoning will be true if we ensure that all points x ̸= a from sufficiently small neighbourhoods of a are mapped into a neighbourhood of b by the function f, but f(x) ̸= b for all such points. □ 5.2.21. Who or what is in the ZOO. We have begun to build a menagerie of functions with polynomials and functions which can be created from them “piece-wise”. Moreover, we have derived many properties for a huge class of continuous functions. However, except for polynomials, we do not have many practically manageable examples at our disposal. We consider the quotients of polynomials. Let f and g be two polynomials in the real variable x with complex coefficients ai ∈ C. The function h : R \ {x ∈ R, g(x) = 0} → C, h(x) = f(x) g(x) is well-defined for all real x except for the roots of the polynomial g. Such functions are called rational functions. From theorem 5.2.17, it follows that rational functions are continuous at all points of their domains. At the points where the denominators of real rational functions vanish, they can have • a finite limit, • an infinite limit, supposing the one-sided infinite limits are equal, • different one-sided infinite limits. For the case of a finite limit, it is necessary that the point is a common root of both f and g and that its multiplicity in f is at least as large as in g. Then the domain of the rational function can be extended by this point, defining it to take the value of the limit there. Then the new function is continuous at this point, too. The possibilities are illustrated in the diagram showing the values of the function h(x) = (x − 0.05a)(x − 2 − 0.2a)(x − 5) x(x − 2)(x − 4) 400 (c) lim x→0 √ 1 + x − √ 1 − x x , (d) lim x→π/4 cos x − sin x cos (2x) . ⃝ 5.B.38. Consider a polynomial function f : R → R whose degree is determined by the result of rolling a die. Find the probability P(A), where A denotes the following event: lim x→+∞ x4 + x2 + 1 f(x) = 0 . Solution. We have f(x) = anxn + an−1xn−1 + · · · + a1x + a0, where by assumption n ∈ {1, 2, . . . , 6} (and so f(x) ̸= 0). Thus one computes lim x→+∞ x4 + x2 + 1 f(x) = = lim x→+∞ x4 + x2 + 1 anxn + an−1xn−1 + · · · + a1x + a0 = lim x→+∞ x4 anxn . Hence we seek for the probability of having lim x→+∞ x4 anxn = 0. It is now reasonable to consider the following cases: 1) For n = 1, 2, 3 we get lim x→+∞ x4 anxn = ±∞. 2) For n = 4 we get lim x→+∞ x4 a4x4 = 1 a4 ̸= 0. 3) For n = 5, 6 we obtain lim x→+∞ x4 anxn = 0. Therefore, the favourable cases are those with n = 5, 6. Since the sample space is Ω = {1, 2, . . . , 6} we deduce that the probability in question equals to P(A) = 2 6 = 1 3 . □ 5.B.39. Consider the functions f(x) = e4x+2 and g(x) = ln(x2 ), defined on R\{0} and [1, e8 ], respectively. (a) Show the composition φ = g ◦ f is determined, and present its explicit form. Confirm your result via Sage and the command compose. (b) Evaluate the limit L = lim x→0 φ(x) − sin3 (x) − 4 x . Solution. (a) Let us denote by A = R\{0} and B = [1, e8 ] the two given domains. The composition φ(x) = g (f(x)) is determined, if the set X := {x ∈ A : f(x) ∈ B} is non empty, i.e., X ̸= ∅. We compute X = { x ∈ R∗ : 1 ≤ e4x+2 ≤ e8 } = { x ∈ R∗ : e0 ≤ e4x+2 ≤ e8 } = {x ∈ R∗ : 0 ≤ 4x + 2 ≤ 8} = { x ∈ R∗ : − 1 2 ≤ x ≤ 3 2 } = [ − 1 2 , 0 ) ∪ ( 0, 3 2 ] , where here R∗ = R\{0} Therefore, based on the basic properties of logarithms, we get φ : X → R with type φ(x) = g (f(x)) = ln ( f(x)2 ) = ln e2(4x+2) = 2(4x + 2) . CHAPTER 5. ESTABLISHING THE ZOO for a = 0 on the left (thus it is the simpler rational function (x − 5)/(x − 4)) and for a = 5/3 on the right. Notice that the situation gets more complicated for the complex rational functions. Their real and imaginary parts are again (real) rational functions (see the exercise ??) and the above options are true for only each of the components separately. 5.2.22. Power functions and the exponential. We have met the simple power functions x → xn with natural exponents n = 0, 1, 2, . . . when building the polynomials. The meaning of the function x → x−1 , defined for all x ̸= 0, is also clear. Now, we extend this definition to a general power function xy with an arbitrary (constant) y ∈ R and x > 0. Changing our mind about constants and variables, we obtain the values of exponential function xy with arbitrary (constant) x > 0 called the base. With natural y and z, we know the following rule (1) xy+z = xy · xz and its consequence (2) xy·z = (xy )z . We shall extent this definition to all positive real x > 0 and all real y. Exponential and power functions Theorem. There is the unique function f(y) = xy , defined for all x > 0, y ∈ R, separately continuous in both variables x and y (i.e. we consider the other one as a constant when checking the continuity) and staisfying (1), f(0) = 1, and f(1) = x. This function also satisfies (2). We call this function the exponential function with the base x, or the power function with exponent y. Proof. Of course, we want to extend the well known values of the powers xn with x rational and n integral, which are also direct consequences of (1) (and then automatically satisfy (2)). For a negative integer −a, we have to define (in view of (1)) x−a = (xa )−1 = (x−1 )a . Further, we want the equality bn = x for n ∈ N to imply that b is the n–th root of x, i.e. b = x 1 n (again a consequence of 401 To verify this very last computation in Sage, use the cell f(x)=e**(4*x+2); ln(f(x)^2).simplify_full() which returns 8 ∗ x + 4. The same returns the method suggested in the statement, via the command compose, which we implement by the cell f(x)=e^(4*x+2); g(x)=ln(x^2) h = compose(g, f) show(h(x).simplify_full()) (b) The explicit form of φ is known by (a), which makes it easy to compute the limit at hand: L = lim x→0 8x − sin3 (x) x = lim x→0 ( 8 − sin2 (x) · sin(x) x ) = 8 − lim x→0 sin2 (x) · lim x→0 sin(x) x = 8 − 0 · 1 = 8 . □ We have exploited the continuity of several functions in an implicit or explicit way many times already. Now it is time to return back to the essence and play a little bit with the concept itself. 5.B.40. The figure below shows the graph of a function y = f(x). Indicate the points where f is discontinuous and explain why. 2 3 4 510 x y Solution. There is a discontinuity at x = 2, since there f is not defined (the graph has a break that is represented by the small circle). Another discontinuity appears at x = 3 (a jump discontinuity). Indeed, although f(3) is defined (is a negative number), we see that the left and right limits are different, so lim x→3 f(x) does not exist. Finally, according to the figure at x = 5 the value f(5) is defined and moreover lim x→5 f(x) exists. However, we see that lim x→5 f(x) ̸= f(5). Thus f is discontinuous also at x = 5 (we may refer to such a discontinuity by the term removable discontinuity). □ 5.B.41. Sketch the graph of a function which is continuous everywhere, except of a removable discontinuity at x = 1, a jump discontinuity at x = 3, and a discontinuity at x = 4 which is not of the two previous types. Solution. There are of course many such functions. An example is given here: CHAPTER 5. ESTABLISHING THE ZOO (1) since this requests (x 1 n )n = xn 1 n = x). We verify that such b’s always exist for positive real numbers x. By factoring out y2 − y1 in yn 2 − yn 1 , or otherwise, we see that the function y → yn is strictly increasing for y > 0. Choose x > 0 and consider the set B = {y ∈ R, y > 0, yn ≤ x}. This is a non-empty set bounded from above, so set b = sup B. A power function with natural exponent n is continuous, thus bn = x. Indeed, surely bn ≤ x. if the inequality is strict, then there is a number y such that bn < yn < x, which implies that b < y. This contradicts the definition of b as a supremum. Thus the power function is suitably defined for all rational numbers a = p q . For x > 0, we set xa = (xp ) 1 q = (x 1 q )p . Finally, we notice that for the values 0 < a ∈ Q and fixed x > 1, xa is strictly increasing for rational a’s. Therefore, for general 0 < a ∈ R and 1 < x we can define xa = sup{xy , y ∈ Q, y ≤ a}. As before, x−a = 1 xa . For 0 < x < 1, proceed analogously with care for the inequality signs, or set xa = (1 x )−a . For x = 1, define 1a = 1 for any a, while 0a = 0. Now, the power function x → xa is defined for all x ∈ [0, ∞) and a ∈ R. Notice, the requested continuity in both parameters fixed all our choices above. In particular, our function is continous also in the paremater a and we have constructed also the exponential functions cy for constants c > 0. The property (1) used when defining the power function xa for integral a implied also (2), and they both manifestly survived trough the construction for any choices of the values x, y. □ 5.2.23. Logarithmic functions. The exponential function f(x) = ax is increasing for a > 1 and decreasing for 0 < a < 1. Thus in both cases, there is a function f−1 (x) inverse to it. This function is called the logarithmic function with base a. We write lna(x) and the defining property is lna(ax ) = x. The equalities 5.2.22(1),(2) are thus equivalent to lna(x · y) = lna(x) + lna(y), lna(xy ) = y · lna(x). Logarithmic functions are defined only for positive arguments and they are, on the entire domain, increasing if a > 1 and decreasing for 0 < a < 1. Moreover, lna(1) = 0 holds for every a. 402 2 3 4 510 x y Notice the discontinuity at x = 4: Can you analyze why this differs from the previous two? □ 5.B.42. Explain why the functions f, g : (0, +∞) → R defined by f(x) = xx and g(x) = xcos(x) , respectively, are continuous. ⃝ 5.B.43. If f, g : R → R are continuous functions with f(2) = 6 and lim x→2 (4f(x) − g(x)) = 2, find g(2). ⃝ 5.B.44. Present an example of a function defined on R which is everywhere continuous except at x = 0. ⃝ 5.B.45. Consider the function f : R → R f(x) = { x , if x ∈ Q , −x , if x ∈ R\Q . Determine the points x0 ∈ R, if any, where f is continuous. Solution. The given function f is continuous only at x0 = 0. For a confirmation, consider any sequence (xn) or R with xn → 0 and xn ̸= 0 for all n. Then obviously (−xn) → 0 as well and hence f(xn) → 0. The claim now follows by combining the definition of continuity of a given function at a chosen point with Proposition 5.2.15. On the other hand, for any non-zero real number a we can find sequences (xn) and (yn) with xn → a and yn → a, which however satisfy f(xn) → a and f(yn) → −a, respectively. This implies that f is discontinuous at any a ∈ R\{0}. □ 5.B.46. Determine the interval(s) on which each of the following functions is continuous: f(x) = { sin (1 x ) , if x ̸= 0 , 0 , if x = 0 , g(x) = { x sin (1 x ) , if x ̸= 0 , 0 , if x = 0 . ⃝ 5.B.47. Consider the function g(x) =    tan(x) π − 2x + x , if x ∈ ( 0, π 2 ) , 2x + 1 2 , if x ∈ [π 2 , +∞ ) . Show that g(x) is continuous at x0, where x0 is the value of the determinant |A|, and A ∈ Mat4(R) is a real 4 × 4 invertible matrix satisfying A = (1/ |A|)B, for some square matrix B ∈ Mat4(R) with det(B) = |B| = (π/2)5 . CHAPTER 5. ESTABLISHING THE ZOO There is an extremely important value of a, namely the number e, see the paragraph 5.4.1, also known as Euler’s number. The function lne(x) is called the natural logarithm and is denoted by ln(x) (i.e., omitting the base e). 3. Derivatives When talking about polynomials, the rate at which the function changes at a given point of its domain was already discussed (see the paragraph 5.1.6). It is the quotient 5.1.6(1), which expresses the slope of the secant line between the points [x, f(x)] ∈ R2 and [x + δx, f(x + δx)] ∈ R2 for a (small) change δx of the input variable. This reasoning is correct for any real or complex function f. It is only necessary to work with the concept of the limit, instead of the intuitive “small change” δx. Derivative of a function of a real variable 5.3.1. Definition. Let f be a real or complex function defined on an interval A ⊂ R and x0 ∈ A. If the limit lim x→x0 f(x) − f(x0) x − x0 = a exists, the function f is said to be differentiable at x0, provided a is finite. The value of the derivative at x0, namely a, is denoted by f′ (x0) or df dx (x0) or d dx f(x0). If a is finite, the derivative is also sometimes called proper. If a is infinite, it is improper. If x0 is one of the boundary points of A, we arrive at one-sided derivatives (i.e., left-sided derivative and rightsided derivative). If a function has a derivative at x0, the function is said to be differentiable at x0. A function which is differentiable at every point of a given interval is said to be differentiable on the interval. Obviously, the derivative of a complex valued function (f(x) + i g(x)) exists if and only if the derivatives of both real and imaginary parts f and g exist (see the elementary properties of limits). Then (f(x) + i g(x))′ = f′ (x) + i g′ (x). 403 Solution. By assumption, A and B are real 4 × 4 matrices satisfying the relation B = |A| A. In Chapter 2 we saw that det(λC) = λn det(C) for any real n×n-matrix C and λ ∈ R, see ??. Thus we get |B| = A A = |A| 4 |A| = |A| 5 , that is, (π/2) 5 = |A| 5 or equivalently |A| = π/2. Hence x0 = π/2 and we see that lim x→ π 2 + f(x) = lim x→ π 2 + ( x + 1 2 ) = π 2 + 1 2 , lim x→ π 2 − f(x) = lim x→ π 2 − tan x π − 2x + lim x→ π 2 − x = 1 2 lim x→ π 2 − tan(π 2 − x) (π 2 − x) + π 2 = π 2 + 1 2 . Here we have used the trigonometric identity tan(π 2 − x) = tan(x) (try to prove this), and moreover that lim y→0 tan y y = 0 , (∗) where above one should replace π 2 − x by y and x → π 2 − with y → 0. Therefore, the one-sided limits coincide and since they equal to f(π 2 ) = π 2 + 1 2 , f is continuous at x0. To be complete, let us prove (∗). For this one uses the result from 5.B.31 and that the limit of a product is the product of the limits, if both limits are defined: lim y→0 tan y y = lim y→0 sin y cos y y = lim y=0 sin y y · lim y=0 1 cos y = 1 · 1 cos 0 = 1 . □ Given a real number x ∈ R its floor (also called the integer part of x), denoted by ⌊x⌋ (or by [x]), is the greatest integer which is less than or equal to x, i.e., ⌊x⌋ = max{a ∈ Z : a ≤ x}. Equivalently, we can view ⌊x⌋ as the unique integer satisfying ⌊x⌋ ≤ x < ⌊x⌋ + 1. Hence for example we have ⌊x⌋ = 1 for 1 ≤ x < 2, ⌊x⌋ = 2 for 2 ≤ x < 3, etc. The floor function or greatest integer function is then defined by f(x) = ⌊x⌋ ∈ Z, for all x ∈ R. Obviously, for any integer n ∈ Z we have f(n) = ⌊n⌋ = n. In the same vein one can define the “ceiling function” (also known as the “least integer function”) x → ⌈x⌉, where ⌈x⌉ is the smallest integer ≥ x, called the “ceiling of x”, i.e., ⌈x⌉ = min{a ∈ Z : a ≥ x}. In addition, {x} = x − ⌊x⌋ is called the “fractional part of x”, and the assignment x → {x} induces the so called “fractional part function”. Obviously, 0 ≤ {x} < 1 and for nonnegative real numbers, the fractional part is just the part of the number after the decimal (but this is not the case for negative real numbers). Nevertheless, together with the floor function for which we devote some space in 5.B.48, these functions are some of the simplest examples of discontinuous functions, and they can be quite useful in constructing more wild examples. Moreover, they have all important applications in integral CHAPTER 5. ESTABLISHING THE ZOO Derivatives are handled rather easily, but it takes time to derive the proper formulae for derivatives of the elementary functions in our zoo. Therefore, we present a table of derivatives of several such functions in advance. In the last column, there are references to the corresponding paragraphs where the results are proved. Notice that even though we are unable to express inverse functions to some of our functions by elementary means, we can calculate their derivatives; see 5.3.6. Derivatives of some functions function domain derivative polynomials f(x) whole R f′ (x) is again a polynomial 5.1.6 cubic splines h(x) whole R continuous second derivatives 5.1.9 rational functions f(x)/g(x) whole R, except for roots of g rational functions: f′ (x)g(x)−f(x)g′ (x) g(x)2 5.3.5 power functions f(x) = xa interval (0, ∞) f′ (x) = axa−1 5.3.7 exponential functions f(x) = ax , a > 0, a ̸= 1 whole R f′ (x) = ln(a)ax 5.3.7 logarithmic function f(x) = lna(x), a > 0, a ̸= 1 interval (0, ∞) f′ (x) = (ln(a))−1 1 x 5.3.7 The initial idea of the definition suggests that f′ (x0) allows an approximation to the function f by the straight line y = f(x0) + f′ (x0)(x − x0). This is the meaning of the following lemma, which says that replacing the constant coefficient f′ (x0) in the line’s equation with a certain continuous function gives exactly the values of f. The difference between ψ(x) and ψ(x0) on a neighbourhood of x0 then says how much the slopes of the secant lines and the tangent line at x0 differ. Lemma. A real or complex function f(x) has a finite derivative at x0 if and only if there is a neighbourhood O of x0 and a function ψ defined on this neighbourhood which is continuous at x0 and such that for all x ∈ O, f(x) = f(x0) + ψ(x)(x − x0). Furthermore, ψ(x0) = f′ (x0), and f is continuous at the point x0. Proof. If ψ exists, it is of the form ψ(x) = f(x) − f(x0) x − x0 for all x ∈ O \ {x0}. 404 calculus, in number theory and in computer science. Hence we will meet them again in the next chapter and in Chapter 11, the latter devoted to number theory. For more details on the functions ⌈x⌉ and {x} we refer to the problem 5.E.57 in the final section. 5.B.48. The floor function. Let f(x) = ⌊x⌋ be the floor function. (1) Compute f(−0.4), f(0.4), f(0.99), f(3/2), f( √ 7), f(3), f(π), f(9/2) and f(19/2). (2) Show that f(x + n) = f(x) + n, for all x ∈ R and n ∈ Z. Next present an example confirming this relation. (3) Show that the relation f(nx) = nf(x), with x ∈ R and n ∈ Z is false in general. (4) Deduce that limx→2+ f(x) ̸= limx→2− f(x). More in general, prove that f is discontinuous at any integer n ∈ Z. (5) Use Sage to sketch the graph of the floor function for −5 ≤ x ≤ 5. Solution. (1) Obviously, ⌊−0.4⌋ = −1, ⌊0.4⌋ = 0 = ⌊0.99⌋, ⌊3/2⌋ = 1, ⌊ √ 7⌋ = 2, ⌊3⌋ = 3 = ⌊π⌋, ⌊9/2⌋ = 4, and ⌊19/2⌋ = 9. Notice that in Sage the floor function corresponds to the command floor. Hence, to verify some of our previous computations, one may type (and similarly for the cases omitted). print(floor(-0.4)); print(floor(sqrt(7)) print(floor(pi)); print(floor(19/2)) (2) We need to prove that ⌊x + n⌋ = ⌊x⌋ + n for all x ∈ R and n ∈ Z. By definition, we have ⌊x⌋ ≤ x < ⌊x⌋+1 and by adding n to all the parts we get ⌊x⌋+n ≤ x+n < ⌊x⌋+n+1. Recall however that for all x ∈ R and m ∈ Z the relations ⌊x⌋ = m and m ≤ x < m + 1 are equivalent. Thus for m = ⌊x⌋ + n the above inequality means m ≤ x < m + 1, and hence we get ⌊x + n⌋ = m. Comparing now the two equalities involving m, the result follows. As an example, we have ⌊9/2⌋ = ⌊0.5 + 4⌋ = ⌊0.5⌋ + 4 = 0 + 4 = 4. (3) As a counterexample notice that ⌊3 · 1 3 ⌋ = ⌊1⌋ = 1, but 3 · ⌊1 3 ⌋ = 3 · 0 = 0. (4) By definition, we have ⌊x⌋ = 2 for 2 ≤ x < 3. Thus limx→2+ ⌊x⌋ = limx→2+ 2 = 2. Similarly, since ⌊x⌋ = 1 for 1 ≤ x < 2 we have limx→2− ⌊x⌋ = limx→2− 1 = 1. This shows that f is discontinuous at x = 2, and similarly at any integer n ∈ Z, i.e., limx→n+ ⌊x⌋ = n = f(n) ̸= limx→n− ⌊x⌋ = n − 1. (5) Let us now present the graph of the floor function. For convenience, we will use small circles to indicate the jump discontinuities of f at the integers lying in [−5, 5]. We can do this manually in Sage by combining first the commands floor and plot, to sketch f(x) = ⌊x⌋, and next the commands point and circle to draw the discontinuity points, etc. This is encoded by the following cell: CHAPTER 5. ESTABLISHING THE ZOO Suppose f′ (x0) is the proper derivative. Then we can define the value at the point x0 as ψ(x0) = f′ (x0). Certainly, lim x→x0 ψ(x) = f′ (x0) = ψ(x0). Thus ψ is continuous at x0 as desired. On the other hand if such a function ψ exists, the same procedure calculates its limit at x0. Thus the derivative f′ (x0) exists as well and equals ψ(x0). The function f is expressed in terms of the sum and product of functions continuous at x0. Thus f is continuous at x0. □ 5.3.2. Geometrical meaning of the derivative. The previous lemma leads to a geometric interpretation of the derivative in terms of the slope of the secant lines of the graph of a real function f through [x0, f(x0)]. The derivative exists if and only if the slope of the secant line through the points [x0, f(x0)] and [x, f(x)] changes continuously when approaching the argument x = x0. If so, the limit of this slope is the value of the derivative. This observation leads to the important corollary: Functions increasing and decreasing at a point Corollary. If a real-valued function f has derivative f′ (x0) > 0 at a point x0 ∈ R, then there is a neighbourhood O(x0) such that f(x) > f(x0) for all points x ∈ O(x0), x > x0, and f(x) < f(x0) holds for all x ∈ O(x0), x < x0. On the other hand, if the derivative satisfies f′ (x0) < 0, then there is a neighbourhood O(x0) such that f(x) < f(x0) for all points x ∈ O(x0), x > x0, and f(x) > f(x0) for all x ∈ O(x0), x < x0. Proof. Suppose f′ (x0) > 0. By the previous lemma, f(x) = f(x0) + ψ(x)(x − x0) and ψ(x0) > 0. Since ψ is continuous at x0, there exists a neighbourhood O(x0) on which ψ(x) > 0. If x increases, x > x0, f(x) increases as well, f(x) > f(x0). Analogously for x < x0. The case with a negative derivative is proved similarly. □ A real function is called increasing at x0 of its domain, if for all points x of some neighbourhood of a point x0, f(x) > f(x0) if x > x0 and f(x) < f(x0) if x < x0. A real function is increasing on an interval A if f(x)−f(y) > 0 for all x > y, x, y ∈ A. Similarly, a function is said to be decreasing at a point x0 if and and only if there is a neighbourhood of the point x0 such that f(x) < f(x0) for all x > x0, while f(x) > f(x0) for all x < x0 from this neighbourhood. A function is decreasing on an interval A if f(x) − f(y) < 0 for all x > y, x, y ∈ A. Thus a function having a non-zero finite derivative at a point is either increasing or decreasing at that point, according to the sign of the derivative. 405 f=floor(x) p=plot(f, x, -5, 5, color="steelblue", ticks=[1,1], legend_label=r"$\text{floor function}$") CHAPTER 5. ESTABLISHING THE ZOO A function increasing on an interval is increasing at each of its points. The converse is true as well. In order to see this, assume that f is increasing at all points of the interval A. Consider two points x < y in A with f(y) ≤ f(x). By the assumption, there is a δ-neighbourhood of y on which z < y implies f(z) < f(y). Let δ0 be the supremum of all such δ ≤ y−x and w = y−δ0. Then f(w) cannot be larger than f(y) (there would be such a point on the right of it too, which is excluded). But, unless w = x, w is a limit point of a sequence of points less than w, for which the value of f is larger than f(y) ≥ f(w). This is a contradiction with f increasing in w. But if w were x then there would be the contradiction with the assumption f(x) > f(z) for points z > x arbitrarily close to x. The same arguments work for decreasing functions. The following is now proved: Proposition. A real function is increasing or decreasing on an open interval A if and only if it is increasing or decreasing in each its point, respectively. 5.3.3. Examples. (1) There is a function which is increasing at the origin x0 = 0 but is neither increasing or decreasing on any neighbourhood of x0. Consider the (continuous) function f(x) = x + 5x2 sin(1/x), f(0) = 0. The choice f(0) makes f a continuous function on R (sin is a bounded function with values between 1 and −1). Its derivative at zero exists too. lim x→0 x + 5x2 sin(1/x) x = lim x→0 (1 + 5x sin(1/x)) = 1. For x ̸= 0, f′ (x) = 1 + 10x sin(1/x) − 5 cos(1/x) (cf. the rules for computing derivatives in 5.3.4 below). The derivative is not continuous at the origin. f is increasing at x = 0 but is not increasing on any neighbourhood of this point. (2) As another illustration of a simple usage of the relation between the derivatives and the properties of being an increasing (or decreasing) function, we can consider the existence of inverses to real polynomials. Polynomials of degree at least two need not be either increasing or decreasing functions. Hence we cannot anticipate that there would be a globally defined inverse function to them. On the other hand, the inverse exists to every restriction 406 for x in [-5..5] : p+=point([x,x],size=30, color="black") p+=circle((x,x-1),0.08, color="black") show(p) Sage prints out the figure given here (ignore the vertical lines, as they are not part of the graph): Notice that there exist a shorter way to sketch the graph of the floor function, which combines the floor command with the command exclude. The latter automatically removes the vertical lines and the syntax goes as follows: plot(floor(x), (x, -5, 5), exclude=[-5..5]) □ To solve problems involving the continuity of piecewise functions, there are generally two steps. First, prove that the individual components of the function are continuous within the intervals where the domain is divided. Second, examine the continuity at the “gluing points”. For the first step, you can often rely on the inherent continuity of elementary functions within their defined domains (e.g., polynomials, trigonometric functions, power functions, exponentials, logarithms, etc.). 5.B.49. Consider the 2-parameter family fα,β of piecewise functions, defined by fα,β(x) =    √ 2x2 − x + 6 − αx x + 2 , if x < −2 , x3 + βx + 4 , if x ≥ −2 , with α, β ∈ R. Find the continuous members of fα,β. Next use Sage to confirm your answer, and moreover sketch the graph of these members, for −10 ≤ x ≤ 10. ⃝ The continuity of a function has far-reaching consequences, such as Bolzano’s theorem, which states that a continuous function which changes its sign within an interval has a zero, see 5.2.19. The following tasks are based on this important principle, with further examples provided in Section E. CHAPTER 5. ESTABLISHING THE ZOO of f to an interval between adjacent roots of the derivative f′ , i.e. where the derivative of the polynomial is non-zero and keeps the sign. These inverse functions will never be polynomials, except for the case of polynomials of degree one. The equation y = ax + b implies x = 1 a (y − b). For a polynomial of degree two, the equation y = ax2 + bx + c leads to the equation x = −b ± √ b2 − 4a(c − y) 2a . Thus the inverse (given by the above equation) exists only for those x which are in either of the intervals (−∞, − b 2a ), (− b 2a , ∞). It can be shown that the roots of polynomials of order larger than four cannot in general be be expressed by means of power functions in general. Thus piece-wise defined inverses to polynomials may represent new items in our zoo. 5.3.4. Elementary properties of derivatives. We introduce several basic facts about the calculation of derivatives. We shall see that the derivatives are quite nicely compatible with the algebraic operations of addition and multiplication of real or complex functions. The last formula then allows us to efficiently determine the derivatives of composite functions. It is also called the chain rule. Intuitively, they can be understood very easily if we imagine that the derivative of a function y = f(x) is the quotient of the rates of increase of the output variable y and the input variable x: f′ = δy δx . Of course, for y = h(x) = f(x) + g(x), the increase in y is given by the sum of the increases of f and g, and the increase of the input variable is still the same. Therefore, the derivative of a sum is the sum of the derivatives. The derivative of a product is not the product of the derivatives. For y = f(x)g(x), the increase is δy = f(x + δx)g(x + δx) − f(x)g(x) = f(x+δx)(g(x+δx)−g(x)) + (f(x+δx)− f(x))g(x) Now, if we make the increase δx small, we actually calculate the limit of a sum of products, which is the sum of the products of the limits. Thus the derivative of a product fg is given by the expression fg′ +f′ g, which is called the Leibniz rule10 . 10Gottfried Wilhelm von Leibniz (1646-1716) was a great German mathematician and philosopher. He developed the differential and integral calculus in terms of the infinitesimal quantities, arguing similarly as above. 407 5.B.50. Determine whether the equation e2x −x4 + 3x3 − 6x2 = 5 has a positive solution. Solution. Let us consider the function f(x) = e2x −x4 + 3x3 − 6x2 − 5, x ≥ 0, for which f(0) = −4. Notice that limx→+∞ f(x) = limx→+∞ e2x = +∞. Obviously, f is continuous on the whole domain and hence it takes on all the values y ∈ [−4, ∞). Moreover, we have f(0) < 0 and f(2) = e4 −21 > 0. Hence by Bolzano’s theorem its graph necessarily intersects the positive semi-axis x, i.e., the equation f(x) = 0 has a positive solution. To confirm all these claims in Sage use the cell f(x)=e^(2*x)-x^4+3*x^3-6*x^2-5 show(f(0)); show(f(2)) plot(f(x), x, 0, 2, ymax=20) sol=solve(f(x)==0, x); show(sol) which in fact solves the equation f(x) = 0 via the last command. However, in this way Sage is not able to present the numerical expression of the complicated symbolic expression which returns for the solution (check it yourself). To avoid this problem, one may instead use the command f(x)=e^(2*x)-x^4+3*x^3-6*x^2-5 find_root(f, 0,2) □ 5.B.51. Determine whether the polynomial P(x) = x37 + 5x21 − 4x9 + 5x4 − 2x − 3 has a real root in (−1, 1). ⃝ 5.B.52. Use Sage and implement Bolzano’s theorem to prove that the equation x3 − cos(x) ex +x sin(x) = 0 has a positive solution in (0, π/2). Next use the find_root function to take a numerical approximation of this root. ⃝ C. Derivatives Having both a conceptual understanding of the notion of limits but also the ability to compute limits, we are now ready to proceed with problems in calculus, the branch of mathematics devoted to derivatives and integrals of functions.6 We will first focus on derivatives. Given a function f depending on a real variable x, the slope of its graph at the point [x0, f(x0)] indicates the rate of change of the value f(x), as x approaches the point x0. This rate of change is the limiting value of the slopes f(x)−f(x0) x−x0 (x ̸= x0), as x tends to x0, that is, 6A very rough notion of calculus, and in particular integration goes back to the ancient Greeks. Archimedes already developed the method of exhaustion to compute the volume of a cone, sphere, etc. However, calculus was essentially developed by Isaac Newton (1643–1727), and Gottfried Leibniz (1646–1716), both having as reference previous works by Fermat (1607–1665) and others. Later the Frech mathematician Augustin-Louis Cauchy (1789-1857) established a more rigorous approach to infinitesimal calculus, as we mentioned also before. On the other hand, the Italian mathematician Joseph-Louis Lagrange (1736–1813) was one of the founders of “calculus of variations”, whose elements will be studied in Chapter 9. CHAPTER 5. ESTABLISHING THE ZOO The derivative of a composite function is even more interesting: Consider a function g = h ◦ f, where the domain of the function z = h(y) contains the codomain of the function y = f(x). By writing out the in- creases, g′ = δz δx = δz δy δy δx . Thus we expect that the formula will be of the form (h ◦ f)′ (x) = h′ (f(x))f′ (x). Now we provide correct formulations together with proofs: Rules for differentiation Theorem. Let f and g be real or complex functions defined on a neighbourhood of a point x0 ∈ R and having finite derivatives at this point. Then (1) for every real or complex number c, the function x → c · f(x) has a derivative at x0 and (cf)′ (x0) = c(f′ (x0)), (2) the function f + g has a derivative at x0, and (f + g)′ (x0) = f′ (x0) + g′ (x0), (3) the function f · g has a derivative at x0, and (f · g)′ (x0) = f′ (x0)g(x0) + f(x0)g′ (x0). (4) Further, suppose h is a function defined on a neighbourhood of the image y0 = f(x0) with a derivative at y0. Then the composite function h ◦ f also has a derivative at x0, and (h ◦ f)′ (x0) = h′ (f(x0)) · f′ (x0). Proof. (1) and (2): A straightforward application of the theorem about sums and products of function limits yields the result. (3) Rewrite the quotient of the increases (already mentioned), in the following way: (fg)(x)−(fg)(x0) x−x0 = f(x)g(x)−g(x0) x−x0 + f(x)−f(x0) x−x0 g(x0). The limit of this as x → x0 gives the desired result because f is continuous at x0. (4) By lemma 5.3.1, there are functions ψ and φ which are continuous at x0 and y0 = f(x0). Further they satisfy h(y) = h(y0)+φ(y)(y−y0), f(x) = f(x0)+ψ(x)(x−x0) on some neighbourhoods of x0 and y0. They satisfy ψ(x0) = f′ (x0) and φ(y0) = h′ (y0). Then, h(f(x)) − h(f(x0)) = φ(f(x))(f(x) − f(x0)) = φ(f(x))ψ(x)(x − x0) for all x from the neighbourhood of x0. However, the product φ(f(x))ψ(x) is a function which is continuous at x0 and 408 lim x→x0 f(x) − f(x0) x − x0 . If this limit exists and it is a finite number, then we say that f is “differentiable” at x0 and denote the limiting value by f′ (x0) or df dx (x0). 7 This is the so-called first derivative of f at x0. The uniqueness of limits ensures that the derivative of f at x0, when it exists, is uniquely determined. If f is differentiable at any point x0 ∈ (a, b) in a given open interval, we say that f is differentiable on (a, b). We say that f is differentiable on a closed interval [a, b], when f is differentiable at any point x0 ∈ (a, b) and the left-side and right-side derivatives at a and b, respectively, exist. 5.C.1. Based on the definition of the derivative of a function of one variable (see 5.3.1), compute the derivatives of the following functions: f(x) = xn , with n ∈ N constant and x ∈ R; g(x) = sin(x), with x ∈ R; h(x) = √ x, with x ∈ [0, +∞). ⃝ 5.C.2. Let f : R → R be a function such that f(a + b) = f(a)f(b), for all a, b ∈ R. Suppose also that f(0) = f′ (0) = 1. Show that f′ (x) = f(x) for all x ∈ R. ⃝ 5.C.3. Consider the function f(x) = |x|. Find the derivative of f for all 0 ̸= x ∈ R. Is f differentiable at x0 = 0? ⃝ 5.C.4. Show that if f is a differentiable function at a point x0 in its domain, then f is continuous at x0 as well. ⃝ The converse of the statement in 5.C.4 isn’t true, and there are several reasons that a continuous function f may fail to be differentiable at a given point in its domain. A possible reason is a “corner” on its graph, as those of the absolute value function f(x) = |x| that we discussed in 5.C.3. Another reason is that f may admit a vertical tangent line at the given point, as the square root function discussed in 5.C.1 Let us summarize with a useful slogan: “Differentiability is a stronger requirement than continuity”. 5.C.5. Consider the function f(x) = { √ x2 + 7 , if x > 3, x2 + bx + c , if x ≤ 3. Determine the reals b, c ∈ R such that f is differentiable for all x ∈ R. For those values use Sage to plot the graph of f with at least two different methods. Solution. Obviously, for all x ̸= 3 the given function is differentiable. In order to determine the unknowns b, c ∈ R we need to establish a system of two equations. Since differentiability requires continuity, these equations are induced by the continuity and differentiability of f at x0 = 3. 7For y = f(x), the notation dy/dx is due to Liebniz, while f′ (x) is the notation by Lagrange. CHAPTER 5. ESTABLISHING THE ZOO its value at x0 is just the desired derivative of the composite function, again by lemma 5.3.1. □ 5.3.5. Derivative of quotients. Consider first the special case of h(x) = x−1 . From the definition of the derivative, h′ (x) = lim δx→0 1 x+δx − 1 x δx = lim δx→0 x − x − δx δx(x2 + xδx) = lim δx→0 −1 x2 + xδx = −x−2 . Thus, the above leads to: Derivative of a quotient Corollary. Let f and g be real-valued functions which have finite derivatives at a point x0 and g(x0) ̸= 0. Then the function h(x) = f(x)(g(x))−1 satisfies h′ (x0) = ( f g )′ (x0) = f′ (x0)g(x0) − f(x0)g′ (x0) (g(x0))2 . Proof. Using the formula (x−1 )′ = −x−2 , the chain rule says (g−1 )′ = −g−2 · g′ . The Leibniz rule implies (f/g)′ = (f · g−1 )′ = f′ g−1 − fg−2 g′ = f′ g − gf′ g2 . □ 5.3.6. Derivatives of inverse functions. In paragraph 1.6.1, while talking about relations and mappings in general, the concept of an inverse function was introduced. If the inverse function f−1 to a given function f : R → R exists (do not confuse this notation with the function x → (f(x))−1 ), then it is uniquely determined by either of the following identities f−1 ◦ f = idR, f ◦ f−1 = idR . Then the other identity is also true. If f is defined on a set A ⊂ R and f(A) = B, the existence of f−1 is conditioned by the same statements with identity mappings idA and idB, respectively, on the right-hand sides. As seen in the diagram, the graph of the inverse function is obtained simply by interchanging the axes of the input and output variables. 409 We have f(3) = 9+3b+c, and the continuity condition limx→3− f(x) = 9 + 3b + c = limx→3+ f(x). The first equality holds as an identity and the second gives the relation 9 + 3b + c = √ 32 + 7 = 4, i.e., 3b + c = −5. As for the differentiability of f at x0 = 3 we get lim x→3− f(x) − f(3) x − 3 = lim x→3+ f(x) − f(3) x − 3 . (∗) We see that f(3) = 9 + 3b + c = 9 − 5 = 4 and hence lim x→3− f(x) − f(3) x − 3 = lim x→3− √ x2 + 7 − 4 x − 3 = lim x→3− ( √ x2 + 7 − 4)( √ x2 + 7 + 4) (x − 3)( √ x2 + 7 + 4) = lim x→3− x2 − 9 (x − 3)( √ x2 + 7 + 4) = lim x→3− x + 3 √ x2 + 7 + 4 = 6 8 = 3 4 . Similarly, lim x→3+ f(x) − f(3) x − 3 = lim x→3+ x2 + b(x − 3) − 9 x − 3 = lim x→3+ (x − 3)(x + 3 + b) x − 3 = lim x→3+ (x + 3 + b) = 6 + b . Thus, by (∗) we obtain the equation 6 + b = 3 4 , that is, b = −21 4 and so c = −5 − 3b = 43 4 . In Sage we can plot f “manually”, i.e., f1(x)=sqrt(x^2+7); f2(x)=x^2-(21/4)*x+43/4 P=plot(f1(x), x, 3, 12, color="black") P+=plot(f2(x), x, -6, 3, color="gray") P+=point([3, 9-(21/4)*3+43/4], color="black", size=30); show(P) An alternative relies on the command piecewise, whose implementation can be summarized from the cell f1(x)=sqrt(x^2+7); f2(x)=x^2-(21/4)*x+43/4 F=piecewise([[(-6, 3), f2(x)], [(3, 12), f1(x)]]) p=plot(F(x), x, -6, 12) p+=point([3, 9-(21/4)*3+43/4], color="black", size=30); show(p) Execute these cells to generate the required graph (note the code includes also the midpoint x0 = 3). Notably, compiling the “piecewise” method takes more time in both https://sagecell.sagemath.org or in cocalc.com, compared to the first approach. Finally, remember that another way to introduce f is based on the method presented in 5.B.49. □ CHAPTER 5. ESTABLISHING THE ZOO If it is known that the inverse y = f−1 (x) of a differentiable function x = f(y) is also differentiable, then the chain rule yields immediately 1 = (id)′ (x) = (f ◦ f−1 )′ (x) = f′ (y) · (f−1 )′ (x). Notice that f′ (y) must be non-zero. This corresponds to the intuitive idea that for y = f(x), the value of f′ is approximately δy δx while for x = f−1 (y) it is approximately (f−1 )′ (y) = δx δy . And this indeed is the way the derivatives of inverse functions are calculated. Derivative of the inverse function Theorem. If f is a real-valued function differentiable at y0, such that the inverse f−1 (x) exists on a neighbourhood of the value x0 = f(y0) and f′ (y0) ̸= 0, then (1) (f−1 )′ (x0) = 1 f′(f−1(x0)) = 1 f′(y0) . Let us notice that if we assume that the derivative f′ (x) is continuous on a neighborhood of x0, than the condition f′ (x0) ̸= 0 clearly implies that f is either increasing or decreasing on some neighborhood, and thus the assumptions of the theorem are fulfilled. On the other hand, the example 5.3.3(1) shows, that the existence of f′ (x0) ̸= 0 is not necessarily sufficient. Proof. To prove the proposition, it suffices to read the proof of the fourth statement of theorem 5.3.4. We work with the composition f ◦ f−1 = id there. The composite function is differentiable. By lemma 5.3.1, there is a function φ continuous at y0 such that f(y) − f(y0) = φ(y)(y − y0), on some neighbourhood of y0. Further, it satisfies φ(y0) = f′ (y0) ̸= 0 and φ has constant sign on a neighbourhood of x0. Next, notice that the existence of the inverse f−1 around the point x0 and the continuity of f at y0 guarantees the continuity of f−1 at x0, (the ε and δ neighbourhoods map each to the other bijectively). The substitution y = f−1 (x) then yields x − x0 = φ(f−1 (x))(f−1 (x) − f−1 (x0)), for all x lying in some neighbourhood O(x0) of x0. Further, f−1 (x0) = y0, and φ(f−1 (x)) is continuous at x0 and remains non-zero on a neighbourhood O(x0) of x0 with constant sign. Thus f−1 (x) − f−1 (x0) x − x0 = 1 φ(f−1(x)) ̸= 0, for all x ∈ O(x0) \ {x0}. The right-hand side of this expression is continuous at x0. The limit is lim x→x0 1 φ(f−1(x)) = 1 φ(f−1(x0)) = 1 f′(y0) . 410 The fundamental rules of differentiation (see 5.3.4 and 5.3.5) enable us to compute derivatives of functions, even those with complex forms. The upcoming exercises focus on these rules to enhance familiarity with their application. It’s worth noting that while differentiation is a standard procedure, it can involve intricate computations. Therefore, many find it more enjoyable to compute derivatives using computer algebra packages, as these not only simplify the process but also allow for rigorous verification of calculations. Below we will explain the usage of Sage. Similar syntaxes are available in many other computer algebra systems, as Mathematica and Maple. 5.C.6. In 5.C.1 we saw that (sin(x))′ ≡ sin′ (x) = cos(x). Based on this fact and using the chain rule, introduced in 5.3.4, prove that (cos(x))′ = − sin(x). ⃝ 5.C.7. Differentiate the following functions, compute the required values and verify (some of) your answers via Sage. (1) f(x) = x sin(x). Find f′ (π/2); (2) g(x) = sin(x) x , x ̸= 0. Find g′ (π/2); (3) h(x) = ln(x + √ x2 − c2), c ̸= 0, |x| ≥ |c|. (4) k(x) = xx , x > 0; (5) m(x) = xsin(x) , x > 0. Solution. (1) By the product rule (also called Leibniz rule, see 5.3.4), we see that f′ (x) = x′ · sin(x) + x · (sin(x))′ = sin(x) + x cos(x). Thus f′ (π/2) = 1. In order to differentiate a function f(x), Sage provides many alternatives. For instance we can use the command derivative(f, x) or f.derivative(x), as fol- lows: f(x)=x*sin(x); show(derivative(f, x)) or type f(x)=x*sin(x); show(f.derivative(x)) If we want to find the explicit value of the derivative at a point x0, then the command f.derivative(x)(x = x0) is an appropriate one. Thus, to compute f′ (π/2) you may continue typing in the previous cell the following: f.derivative(x)(x=pi/2) This prints out 1, that is, f′ (π/2) = 1, where f(x) = x sin(x). An alternative to differentiate f(x) relies on the command diff. For example, by typing f(x)=x*sin(x); show(diff(f, x)) we obtain the same result. In order to specify the value at a certain point x0, we again type diff(f, x)(x = x0), i.e., f(x)=x*sin(x); show(diff(f, x)(x=pi/2)) The command diff is also useful when one wants to find higher-order derivatives, hence we will meet it again in Chapter 6, see also below and the final Section E of this chapter. CHAPTER 5. ESTABLISHING THE ZOO Therefore, the limit of the left-hand side also exists, and it follows that (f−1 )′ (x0) = 1 f′(y0) as required. □ 5.3.7. Derivatives of the elementary functions. Consider the exponential function f(x) = ax for any fixed real a > 0. If the derivative of ax exists for all x, then f′ (x) = lim δx→0 ax+δx − ax δx = ax lim δx→0 aδx − 1 δx = f′ (0)ax . On the other hand, if the derivative at zero exists, then this formula guarantees the existence of the derivative at any point of the domain and also determines its value. At the same time, the validity of this formula for one-sided derivatives is also verified. Unfortunately, it takes some time to verify that the derivatives of exponential functions indeed exist (see 5.4.2, 5.4.10, and 6.3.7). There is an especially important base e, also known as Euler’s number, for which the derivative at zero equals one. Remember the formula (ex )′ = ex for a while and draw on its consequences. For the general exponential function, (using standard rules of differentiation), (ax )′ = (eln(a)x )′ = ln(a)(eln(a)x ) = ln(a) · ax . Thus exponential functions are special since their derivatives are proportional to their values. Next, we determine the derivative (lne(x))′ . The definition of the natural logarithm as the inverse to ex , eln x = x, allows the calculation: (1) (ln)′ (y) = (ln)′ (ex ) = 1 (ex)′ = 1 ex = 1 y . The formula (2) (xa )′ = axa−1 for differentiating a general power function can also be derived using the derivatives of the exponential and logarithmic functions: (xa )′ = (ea ln x )′ = ea ln x (a ln x)′ = a xa x = axa−1 . 5.3.8. Mean value theorems. Before continuing the journey of finding new interesting functions, we derive several simple statements about derivatives. The meaning of all of them is intuitively clear from the diagrams. The proofs follow the visual imagi- nation. 411 (2) The rule for the derivative of a quotient (see 5.3.5), gives g′ (x) = (sin(x))′ · x − sin(x) · x′ x2 = x cos(x) − sin(x) x2 . Thus, we easily get g′ (π/2) = − 4 π2 . To confirm this via Sage a recommendation has the form g(x)=sin(x)/x; diff(g(x), x)(x=pi/2) (3) Here the “chain rule” applies. Set a(x) = ln(x) and b(x) = x + √ x2 − c2 such that h(x) = a(b(x)). Then h′ (x) = ( a(b(x) )′ = a′ (b(x)) · b′ (x) = (x + √ x2 − c2)′ x + √ x2 − c2 = 1 + x x2−c2 x + √ x2 − c2 . For a confirmation in Sage you may use one of the methods posed above. (4) Let us recall the identity (ef(x) )′ = f′ (x) ef(x) , where we assume that f′ (x) exists. Because xx = ex ln(x) by applying this rule we obtain k′ (x) = (xx )′ = ( ex ln(x) )′ = ( x ln(x) )′ ex ln(x) = ( ln(x) + 1 ) ex ln(x) = ( ln(x) + 1 ) xx . In Sage you may use the cell show(diff(x ∗ ∗x, x)), which returns the expression xx (log (x) + 1) (recall that in Sage the function log represents the natural logarithm). The same trick used in (4) applies for the final case, too. Hence we leave this as an exercise. □ 5.C.8. Chain rule via Sage. Write in Sage a short code implementing the chain rule for two arbitrary differentiable func- tions. Solution. Let h(x) = f(g(x)) be the composition of two arbitrary differentiable functions f, g : R → R. We want to implement in Sage the rule h′ (x) = g′ (x)f′ (g(x)), for all x ∈ R, and with this goal in mind it is useful to recall about symbolic functions in Sage. For instance, type in your editor the cell x=var("x") f=function(’f’)(x); show(f) This cell introduces an arbitrary function f of one variable x, in particular prints out f(x). This makes the implementation of the chain rule for two arbitrary functions f, g : R → R, really simple. First introduce the composition h, as fol- lows x=var("x"); f=function("f"); g=function("g") h=f(g(x)) # define the composition h show(h) Check yourself that this gives the composition f(g(x)). Next it is sufficient to add the command h.diff(x), which returns D[0](f)(g(x)) ∗ diff(g(x), x). Recall here that D[0] encodes the first derivative, hence we can indeed translate Sage’s output as f′ (g(x))g′ (x), that is, the chain rule. A more useful alternative has the following form: function("f")(x) CHAPTER 5. ESTABLISHING THE ZOO Rolle’s theorem11 Theorem. Assume that the function f : R → R is continuous on a closed bounded interval [a, b] and differentiable inside this interval. If f(a) = f(b), then there is a number c ∈ (a, b) such that f′ (c) = 0. Proof. Since the function f is continuous on the closed interval (i.e. on a compact set), it attains its maximum and its minimum there. Either its maximum value is greater than f(a) = f(b), or the minimum value is less than f(a) = f(b), or f is constant. If the third case applies, the derivative is zero at all points of the interval (a, b). If the second case applies, then the first case applies to the function −f. If the first case applies, it occurs at an interior point c. If f′ (c) ̸= 0 then the function f would be either increasing or decreasing at c (see 5.3.2), implying the existence of larger values than f(c) in a neighbourhood of c, contradicting that f(c) is a maximum value. □ 5.3.9. The latter result immediately implies the following corollary. Lagrange’s mean value theorem Theorem. Assume the function f : R → R is continuous on an interval [a, b] and differentiable at all points inside this interval. Then there is a number c ∈ (a, b) such that f′ (c) = f(b) − f(a) b − a . 11The French mathematician Michel Rolle (1652-1719) proved this theorem only for polynomials. The principle was perhaps known much earlier, but the rigorous proof comes from the 19th century only. 412 function("g")(x) show(diff(f(g(x)), x)) The output here has the form D0 (f) (g (x)) ∂ ∂x g (x), and the advantage is that we can “replace” f, g with certain functions, so that we can use this method for verifying our formal computations in such examples. For instance, to compute the derivative of ex2 , it is sufficient to specify f, g by adding in the previous cell the code f(x)=e^x;g(x)=x^2; show(diff(f(g(x)), x)) To compute the case (5) of Problem 5.C.7, add the line f(x) = xx ; g(x) = sin(x); show(diff(f(g(x)), x)). Try some additional examples yourself, and see also in Chapter 6 for further applications of symbolic functions (cf. 6.A.4). □ 5.C.9. Inverse function theorem. (a) Given the function f(x) = (4x + 3)/(x − 6) prove that it is invertible on its domain, with inverse given by the function g(x) = (6x + 3)/(x − 4), with x ∈ R\{4}. (b) According to 5.3.6 if y = g(x) is the inverse of a differentiable function f(x), then for all x satisfying f′ (g(x))) ̸= 0 we have g′ (x) = 1/f′ (g(x))). Based on this result compute the derivative of the function g(x) given in (a) and then compare your result, by applying a direct computation of g′ (x). Solution. (a) Obviously, the domain of f is the set A := R\{6}. For arbitrary x1, x2 ∈ A we see that the relation f(x1) = f(x2) is equivalent to 27x1 = 27x2, that is, x1 = x2. Hence f is one-to-one and its inverse y = f−1 (x) exists. We will show that y = f−1 (x) = g(x) with domain the set B = R\{4}. Setting y = 4x+3 x−6 , we equivalently get y(x − 6) = 4x + 3, or x(y − 4) = 6y + 3, that is x = 6y+3 y−4 for y ̸= 4. Such x are indeed elements of A, and in order to obtain the inverse of f it is now sufficient to reverse the roles of x, y in the previous relation. This gives the desired expression of g = f−1 . An alternative verification occurs due to the relations f(g(x)) = 27x 27 = x and g(f(x)) = 27x 27 = x. (b) A direct computation shows that g′ (x) = (6x + 3 x − 4 )′ = − 27 (x − 4)2 , x ∈ B . Let us obtain the same result by applying the method mentioned above. We similarly compute f′ (x) = − 27 (x−6)2 and hence it follows that for all x ̸= 4 we have f′ (g(x)) = − 27 (6x+3 x−4 − 6)2 = − (x − 4)2 27 ̸= 0 . Thus g′ (x) = 1/f′ (g(x)) = − 27 (x−4)2 , for all x ∈ B. □ Let y = f(x) be a differentiable function at x0 ∈ R. The the tangent line of the graph of f at the point [x0, y0 = f(x0)] ∈ R2 has the form y − y0 = f′ (x0)(x − x0), or equivalently y = f′ (x0)x + (y0 − f′ (x0)x0) . CHAPTER 5. ESTABLISHING THE ZOO Proof. The proof is a simple statement of the geometrical meaning of the theorem: The secant line between the points [a, f(a)] and [b, f(b)] has a tangent line which is parallel to it (have a look at the diagram). The equation of the secant line is y = g(x) = f(a) + f(b) − f(a) b − a (x − a). The difference h(x) = f(x) − g(x) determines the (vertical) distance of the graph and the secant line (in the values of y). Surely h(a) = h(b) and h′ (x) = f′ (x) − f(b) − f(a) b − a . By Rolle’s theorem, there is a point c at which h′ (c) = 0. □ The mean value theorem can also be written in the form: (1) f(b) = f(a) + f′ (c)(b − a). In the case of a parametrically given curve in the plane, i.e., a pair of functions y = f(t), x = g(t), similar result about the existence of a tangent line parallel to the secant line going through the boundary points is described by Cauchy’s mean value theorem. Notice we may consider such a curve as a complex valued function f(t) + i g(t). Cauchy’s mean value theorem Corollary. Let functions y = f(t) and x = g(t) be continuous on an interval [a, b] and differentiable inside this interval. Further, let g′ (t) ̸= 0 for all t ∈ (a, b) and g(b) ̸= g(a). Then there is a point c ∈ (a, b) such that f(b) − f(a) g(b) − g(a) = f′ (c) g′(c) . Proof. Put h(t) = (f(b) − f(a))g(t) − (g(b) − g(a))f(t). Now h(a) = f(b)g(a) − f(a)g(b), h(b) = f(b)g(a) − f(a)g(b), so by Rolle’s theorem, there is a number c ∈ (a, b) such that h′ (c) = 0. Finally, g′ (c) ̸= 0 and the desired formula follows. □ Notice that g(b) ̸= g(a) automatically, if g′ (t) is contin- uous. 5.3.10. A reasoning similar to the one in the above proof leads to a supremely useful tool for calculating limits of quotients of functions. 413 Hence, the tangent is the line through the point [x0, y0], having slope f′ (x0). Observe that the geometric condition of having a unique non-vertical tangent to the graph of f at a point [x0, f(x0)] ∈ Gf is equivalent to the existence of f′ (x0), that is, to the differentiability of f at x0. Hence tangent lines provide a more intuitive definition of differentiability: We may always think a differentiable function f at x0, as a function whose graph has a unique (non-vertical) tangent at the point [x0, f(x0)]. In this case the graph of f cannot have breaks, corners or cusps at x0. 5.C.10. Consider the function f(x) = αx2 − 4x ln(x) with x > 0, where α ̸= 0 is some real parameter. Find the tangent of f at the point P = [1, f(1)]. Next specify α such that the origin is a point of this tangent. ⃝ 5.C.11. Consider the function f(x) = x2 ln(x), with x > 0. Prove that there exists a unique point P ∈ R2 belonging on the graph of f and where the tangent of f is parallel to the x-axis. Explain why this happens and verify your computations via Sage. Solution. The derivative of f is given by f′ (x) = 2x ln(x) + x = x(2 ln(x) + 1) , for any x ∈ (0, +∞). We see that the equation f′ (x) = 0 has a unique solution, namely x0 = e− 1 2 = 1√ e (since f′ is defined only for x > 0, the solution x = 0 is not acceptable). In Sage we can verify these computations by typing f(x)=x^2*ln(x); assume(x>0) show(solve(diff(f, x)==0, x)) The tangent line of f at this point has zero slope, i.e., f′ (x0) = 0, and hence is horizontal. In particular, the tangent line is given by y = f(x0), and observe that near x0 this line lies under the graph of f. In such a situation we say that at x0 the function f attains a local minimum with value f(x0), see also here: Thus there is a unique point P ∈ R2 satisfying the statement, with coordinates P = [ 1√ e , − 1 2e ] . □ CHAPTER 5. ESTABLISHING THE ZOO L’Hospital’s rule12 Theorem. Suppose f and g are functions differentiable on some neighbourhood of a point x0 ∈ R, yet not necessarily at x0 itself. Suppose lim x→x0 f(x) = 0, lim x→x0 g(x) = 0. If the limit lim x→x0 f′ (x) g′(x) exists, then the limit lim x→x0 f(x) g(x) also exists, and the two limits are equal. Proof. Without loss of generality, the functions f and g are zero at the point x0. The quotient of the values then corresponds to the slope of the secant line between the points [0, 0] and [f(x), g(x)]. At the same time, the quotient of the derivatives corresponds to the slope of the tangent line at the given point. Thus it is necessary to verify that the limit of the slopes of the secant lines exists from the fact that the limit of the slopes of the tangent lines exists. Technically, we can use the mean value theorem in Cauchy’s parametric form. First of all, the existence of the expression f′ (x)/g′ (x) on some neighbourhood of the point x0 (excluding x0 itself) is implicitly assumed. Thus especially for points c sufficiently close to x0, g′ (c) ̸= 0.13 By the mean value theorem, lim x→x0 f(x) g(x) = lim x→x0 f(x) − f(x0) g(x) − g(x0) = lim x→x0 f′ (cx) g′(cx) , 12Guillaume François Antoine, Marquis de l’Hôpital, (1661-1704) became famous for his textbook on Calculus (in modern textbooks, his name is mostly spelled as l’Hospital). This rule was first published there, perhaps originally proved by one of the famous Bernoulli brothers. 13This is not always necessary for the existence of the limit in a general sense. Nevertheless, for the statement of l’Hospital’s rule, it is. A thorough discussion can be found (googled) in the popular article ‘R. P. Boas, Counterexamples to L’Hospital’s Rule, The American Mathematical Monthly, October 1986, Volume 93, Number 8, pp. 644–645.’ 414 5.C.12. Find the equations of the tangent and normal lines to the graph of g(x) = (x + 1) 3 √ 3 − x, with x ∈ R, at the point P = [1, f(1)]. Next sketch the two lines together with the graph of g via Sage. Solution. We have g(x0) = g(1) = 2 · 2 1 3 = 2 4 3 . Moreover g′ (x) = (3 − x) 1 3 − 1 3 (x + 1)(3 − x)− 2 3 , and hence g′ (1) = 2 3 · 2 1 3 = 2 3√ 2 3 . Thus the tangent of g at P is given by y1 = g(x0) + g′ (x0)(x − x0) = 2 4 3 + 2 3 √ 2 3 (x − 1) . The normal line and tangent line are perpendicular, hence they should have slopes that are opposite reciprocals each other. This means that the normal line has the form y2 = g(x0) − 1 g′(x0) (x − x0) = 2 4 3 − 3 2 3 √ 2 (x − 1) . In order to derive the equations of the tangent and normal (and sketch them) in Sage, type the block g(x)=(x+1)*(3-x)^(1/3);dg(x)=diff(g(x),x) a=plot(g(x), x, -1.2, 3, xmin=-3, xmax=3, legend_label="curve") x0=1; tangl=g(x0)+dg(x0)*(x-x0) show(tangl) perpl=g(x0)-(1/dg(x0))*(x-x0);show(perpl) b=plot(tangl, xmin=-1.2, xmax=1.5, color="black", legend_label="tangent", aspect_ratio=1) c=plot(perpl, xmin=-1, xmax=2, color="gray", legend_label="normal") m=point([x0, 0], size=30, color="black") mv=point([x0, g(x0)], size=30, color="black") show(a+b+c+m+mv) We leave to the reader the implementation of this block, for practicing with Sage. □ 5.C.13. Find the tangent and normal line to the curve given by the equation x3 + y3 − 2xy = 0, at the point P = [1, 1]. ⃝ 5.C.14. Tangent lines via Sage. Given the polynomial P(x) = x4 − 2x with x ∈ I = [−2, 2], write a short routine in Sage constructing the tangent line ℓx0 (x) of P at a random point x0 lying in the interval I. Then choosing certain x0 ∈ I, produce the graphs of P and ℓx0 for all x ∈ I. Solution. Using the def command we can introduce a subroutine, which we agree to call Tangent. To do so we first need to introduce P and its first derivative. The whole block has the form CHAPTER 5. ESTABLISHING THE ZOO where cx is a number lying between x0 and x, dependent on x. From the existence of the limit lim x→x0 f′ (x) g′(x) , it follows that this value will be shared by the limit of any sequence created by substituting the values x = xn approaching x0 into f′ (x)/g′ (x) (cf. the convergence test 5.2.15). Especially, we can substitute any sequence cxn for xn → x0, and thus the limit lim x→x0 f′ (cx) g′(cx) exist, and the last two limits are equal. Hence the desired limit exists and has the same value. □ From the proof of the theorem, it is true for one-sided limits as well. 5.3.11. Corollaries. L’Hospital’s rule can easily be extended for limits at the improper points ±∞ and for the case of infinite values of the limits. If, for instance, we have lim x→∞ f(x) = 0, lim x→∞ g(x) = 0, then limx→0+ f(1/x) = 0 and limx→0+ g(1/x) = 0. At the same time, from existence of the limit of the quotient of the derivatives at infinity, lim x→0+ (f(1/x))′ (g(1/x))′ = lim x→0+ f′ (1/x)(−1/x2 ) g′(1/x)(−1/x2) = lim x→0+ f′ (1/x) g′(1/x) = lim x→∞ f′ (x) g′(x) . Applying the previous theorem, the limit lim x→∞ f(x) g(x) = lim x→0+ f(1/x) g(1/x) = lim x→∞ f′ (x) g′(x) exists in this case as well. The limit calculation is even simpler in the case when lim x→x0 f(x) = ±∞, lim x→x0 g(x) = ±∞. Then it suffices to write lim x→x0 f(x) g(x) = lim x→x0 1/g(x) 1/f(x) , which is already the case of usage of l’Hospital’s rule from the previous theorem. In fact, the l’Hospital’s rule has the same form for inproper limits as well: Theorem. Let f and g be functions differentiable on some neighbourhood of a point x0 ∈ R, not necessarily at x0 itself. Further, let the limits limx→x0 f(x) = ±∞ and limx→x0 g(x) = ±∞ exist. If the limit lim x→x0 f′ (x) g′(x) exists, then the limit lim x→x0 f(x) g(x) also exists and they equal each other. 415 P(x)=x^4-2*x; P1(x)=diff(P(x), x) def Tangent(x_0): y_0=P(x_0) m=P1(x_0) c=y_0 - m*x_0 l(x)=m*x+c #this defines the tangent line Q = plot(P(x), -2, 2, color="blue", CHAPTER 5. ESTABLISHING THE ZOO Proof. Apply the mean value theorem. The key step is to express the quotient in a form where the derivative arises: f(x) g(x) = f(x) f(x) − f(y) · f(x) − f(y) g(x) − g(y) · g(x) − g(y) g(x) , where y ̸= 0 is fixed, from a neighbourhood of x0 and x is approaching x0. Since the limits of f and g at x0 are infinite, we can surely assume that the differences of the values of both functions at x and y, having fixed y, are non-zero. Using the mean value theorem, replace the fraction in the middle with the quotient of the derivatives at an appropriate point c between x and y. The expression of the examined limit thus gets the form f(x) g(x) = 1 − g(y) g(x) 1 − f(y) f(x) · f′ (c) g′(c) , where c depends on both x and y. With y fixed and x approaching x0, the former fraction converges to one. At the same time, if simultaneously y → x0 and |y −y0| ≥ |x−x0|, the latter fraction becomes arbitrarily close to the limit value of the quotient of the derivatives. Thus, we may choose a sequence yn → x0, such that limx→x0 f′ (x) g′(x) − f′ (c) g′(c) < 1/n for all c with |c − x0| < |yn − x0|. At the same time, we may restrict appropriately |x − x0| ≤ δn, so that the former fraction with y = yn will be closer to 1 than by 1/n for all such x. Altogether, this implies the requested equality of the limits. □ By making suitable modifications of the examined expressions, one can also apply l’Hopital’s rule on forms of the types ∞−∞, 1∞ , 0·∞, and so on. Often one simply rearranges the expressions or uses some continuous function, for instance the exponential one. 5.3.12. Example. For an illustration of such a procedure, we show the connection between the arithmetic and geometric means of n non-negative values xi. The arithmetic mean M1 (x1, . . . , xn) = x1 + · · · + xn n is a special case of the power mean with exponent r, also known as the generalized mean: Mr (x1, . . . , xn) = ( xr 1 + · · · + xr n n )1 r . The special value M−1 is called the harmonic mean. Calculate the limit value of Mr for r approaching zero. For this purpose, determine the limit by l’Hopital’s rule (we treat it as an expression of the form 0/0 and differentiate with respect to r, with xi as constant parameters). The following calculation uses the chain rule and knowledge of the derivative of the power function, must be read in reverse. The existence of the last limit implies the existence of the last-but-one, and so on. 416 ymin=-2, ymax=20, gridlines="true") Q+=plot(l(x), -2, 2, color="black", ymin=-2, ymax=20) Q+=point( (x_0, y_0), color="black", size=50) Q.show() show("x_0=" , x_0) show("tang.line: l(x)=(",m,")*x+(",c,")") return Notice that x0 is included in this code as a real number input. Also, the command gridlines with the option true was used to add grid lines in the background (though not necessary). The last two show commands are included to display the chosen value of x0 and the expression of the tangent line ℓx0 (x) at x0. Finally, the command return is used to conclude the routine. Further applications of programming in Sage will be discussed later, as seen, for for example in 5.D.5. We can now check our routine by testing certain values for x0. We may type Tangent(0); Tangent(-0.5) Tangent(0.5); Tangent(1) etc. For instance, the last command produces the figure and prints x_0=1 tang. line: l(x)=(2)*x+(-3) Later in the final section of this chapter we will use the routine constructed here to build an interactive environment for tangent lines, see 5.E.92. □ We have now only scratched the surface of the theory of derivatives, which encompasses a wide range of theorems and applications. Derivatives offer a powerful mathematical framework that naturally applies to the study of monotone functions, optimization problems, and approximation techniques. CHAPTER 5. ESTABLISHING THE ZOO lim r→0 ln(Mr (x1, . . . , xn)) = lim r→0 ln( 1 n (xr 1 + · · · + xr n)) r = lim r→0 xr 1 ln x1+···+xr n ln xn n xr 1+···+xr n n = ln x1 + · · · + ln xn n = ln n √ x1 . . . xn. Hence lim r→0 Mr (x1, . . . , xn) = n √ x1 . . . xn, which is known as the geometric mean. 4. Infinite sums and power series The last part of this chapter is mainly devoted to infinite sums of numbers, aiming at infinite extension of polynomials – the so called power series. We first complete the basic discussion of the exponential function, expressing it as a limit of polynomial approximations. This illustrates the more general need to develop effective tools to deal with sequences of numbers or functions. If the reader finds the next paragraphs too demanding, we suggest jumping to 5.4.3 starting the general discussion on infinite sums of numbers and maybe return later. 5.4.1. The calculation of ex . For numerical computations, manipulation with limits of sequences is needed as well as addition and multiplication of scalars. Thus it might be a good idea to approximate non-polynomial functions by sequences of numbers that can be calculated easily, keeping control of the approximation errors. We approach the function ex this way. In view of the expected property (ex )′ = ex (cf. 5.3.7), we look for a function whose rate of increase equals the function’s value at every point. This can be imagined as a splendid interest rate equal to the current value of your money. If we apply the interest rate per year once a month, once a day, once an hour, and so on, we obtain the following values for the yield x of the deposit after one year: ( 1 + x 12 )12 , ( 1 + x 365 )365 , ( 1 + x 8760 )8760 , . . . Therefore, we could guess that ex = lim n→∞ ( 1 + x n )n . At the same time, we can imagine that the finer we apply the interest, the higher the yield will be. So the sequence on the right-hand side should be an increasing sequence. In detail, we examine the sequence of numbers an = ( 1 + 1 n )n . Bernoulli’s inequality will come in handy: 417 In particular, when examining monotonicity and optimization problems (including local and absolute extrema), derivatives of the first and second orders are often sufficient to reveal the local behavior of a function defined on an interval of the real line.8 Notice in 5.3.2 we will revise the concepts of increasing and decreasing functions, thereby enriching our approach to studying the local behavior of functions. For further illustration, refer to the examples in 5.3.3. This latter aspect will be further elaborated upon in the initial section of Chapter 6. Therefore, the tasks presented below serve a more foundational purpose. 5.C.15. Discuss the monotonicity of the function f(x) = ln(x) x over its domain. Solution. For any x in the domain A := (0, +∞) of f we compute f′ (x) = (ln(x))′ x−x′ ln(x) x2 = 1−ln(x) x2 . Thus, the equation f′ (x) = 0 has a unique solution, namely x0 = e. This is a “critical” or “stationary point” of f, see 5.3.2 for terminology. Since x2 is always positive, the sign of f′ depends on the numerator 1 − ln(x). Observe now that 1 − ln(x) > 0 ⇔ ln(x) < 1 ⇔ ln(x) < ln(e) ⇔ x < e, and similarly we obtain 1 − ln(x) < 0 ⇔ x > e. Thus f′ (x) > 0 for all x ∈ (0, e) and f′ (x) < 0 for all x ∈ (e, +∞), which means that f is strictly increasing in the interval (0, e) and strictly decreasing in (e, +∞). In particular, f takes its maximum value at the point x0, namely f(x0) = f(e) = 1 e , see also the figure below. In section 6.1.2 we will learn about a test checking this final claim, which is based on the second derivative of f. The latter is simply defined by f′′ = (f′ )′ , hence in our case it has the form f′′ (x) = (f′ (x))′ = (1 − ln(x) x2 )′ = 2 ln(x) − 3 x3 . 8In this column we will mainly focus on real functions or on complexvalued functions of a real variable. Applications of complex-valued functions of a complex variable (complex analytic functions) will be analyzed later in Chapter 9. CHAPTER 5. ESTABLISHING THE ZOO Lemma. For every real number b ≥ −1, b ̸= 0, and a natural number n ≥ 2, (1 + b)n > 1 + nb. Proof. For n = 2, (1 + b)2 = 1 + 2b + b2 > 1 + 2b. Proceed by induction on n, supposing b > −1. Assume that the proposition holds for some k ≥ 2 and calculate: (1 + b)k+1 = (1 + b)k (1 + b) > (1 + kb)(1 + b) = 1 + (k + 1)b + kb2 > 1 + (k + 1)b The statement is, of course, true for b = −1 as well. □ Now an an−1 = (1 + 1 n )n (1 + 1 n−1 )n−1 = (n2 − 1)n n n2n(n − 1) = ( 1 − 1 n2 )n n n − 1 > ( 1 − 1 n ) n n − 1 = 1. by using Bernoulli’s inequality with b = − 1 n2 . So an > an−1 for all natural numbers, and it follows that the sequence an is indeed increasing. The following similar calculation (also using Bernoulli’s inequality) verifies that the sequence of numbers bn = ( 1 + 1 n )n+1 = ( 1 + 1 n ) ( 1 + 1 n )n is decreasing. Notice that bn > an. Also, bn bn+1 = n n + 1 ( n+1 n n+2 n+1 )n+2 = n n + 1 ( n2 + 2n + 1 n2 + 2n )n+2 = n n + 1 ( 1 + 1 n(n + 2) )n+2 ≥ n n + 1 ( 1 + n + 2 n(n + 2) ) = 1. Thus the sequence an is increasing and bounded from above, so the set of its terms has a supremum which equals the limit of the sequence. At the same time, this value is also the limit of the decreasing sequence bn because lim n→∞ bn = lim n→∞ (1 + 1 n )an = lim n→∞ an. This limit determines one of the most important numbers in mathematics (besides the numbers 0, 1, and π), namely Euler’s number14 e. Thus e = lim n→∞ ( 1 + 1 n )n . 14The ingenious Swiss mathematician, physicist, astronomer, logician and engineer Leonhard Euler (1707-1783) was behind extremely many inventions, including original mathematical techniques and tools. 418 According to this criterium, whenever we have f′′ (x0) < 0 near a critical point x0 of f, then f must attain a (local) maximum at x0. For our case we indeed compute f′′ (e) = − 1 e3 < 0. Notice finally in Sage one could compute the first and second derivative of f by the cell f(x)=ln(x)/x; d1(x)=diff(f, x).factor() show(d1(x)) D(x)=diff(d1, x).factor(); show(D(x)) Here we have programmed Sage to compute f′′ (x) “manually” (we denote this by D(x)). An alternative approach for the second derivative relies on an application of the command diff(f, x, 2).factor( ). This directly prints out the factorization of the expression given for the second derivative of f, without being necessary to compute the first derivative. We will meet more such applications in Chapter 6. □ 5.C.16. Prove by two different ways that the tangent function f(x) = tan(x) is strictly increasing for all x ∈ (−π/2, π/2). ⃝ 5.C.17. Consider the parabola y = x2 . Determine the x-coordinate of the parabola’s point which is nearest to the point A with coordinates x0 = 1 and y0 = 2. Solution. Recall the formula for the Euclidean distance between two points [x1, y1], [x2, y2] in the plane: d = √ (x2 − x1)2 − (y2 − y1)2 . An arbitrary point in the parabola has coordinates [x, y] = [x, x2 ] (and in this way we eliminate y from the problem). Since the task requires the minimization of the distance between the points A = [1, 2] and [x, x2 ], it is sufficient to find the minimum of the function d(x) = √ (x − 1)2 + (x2 − 2)2 with x ∈ R, see also the figure for an illustration of the idea. The function d(x) is non-zero for any x ∈ R, in particular the equation g(x) := (x−1)2 +(x2 −2)2 = 0 has only complex solutions. We have d′ (x) = g′ (x) 2 √ g(x) = g′ (x) 2d(x) , x ∈ R . CHAPTER 5. ESTABLISHING THE ZOO 5.4.2. Power series for ex . The exponential function has been defined as the only continuous function satisfying f(1) = e and f(x + y) = f(x) · f(y). The base e is now expressed as the limit of the sequence an, thus necessarily ex = lim n→∞ (an)x . Fix a real number x ̸= 0. If we replace n with n/x in the numbers an from the previous paragraph, we arrive again at the same limit. (Think this out in detail!) Hence e = lim n→∞ ( 1 + x n )n x , ex = lim n→∞ ( 1 + x n )n . Denote the n-th term of this sequence by un(x) = (1 + x n )n and express it by the binomial theorem: (1) un(x) = 1 + n x n + n(n − 1)x2 2!n2 + · · · + n!xn n!nn = 1 + x + x2 2! ( 1 − 1 n ) + x3 3! ( 1 − 1 n ) ( 1 − 2 n ) + . . . + xn n! ( 1 − 1 n ) ( 1 − 2 n ) . . . ( 1 − n − 1 n ) . Look at un(x) for very large n. It seems that many of the first summands of un(x) will be fairly close to the values 1 k! xk , k = 0, 1, . . . . Thus it is plausible that the numbers un(x) should be very close to vn(x) = ∑n j=0 1 j! xj and thus both these sequences should have the same limit. The following theorem is perhaps one of the most important results of Mathematics: The power series for ex Theorem. The exponential function ex equals, for each x ∈ R, the limit limk→∞ vk(x) of the partial sums in the expression ex = 1 + x + 1 2 x2 + · · · + 1 n! xn + · · · = ∞∑ n=0 1 n! xn . The function ex is differentiable at each point x and its derivative is (ex )′ = ex . Proof. The technical proof makes the above idea precise. Although the complete argumentation might look complicated, the strategy is straightforward. We shall go through three steps. 1) Prove that the partial sums vn converge. 2) Write un,k for the first k < n summands in un and conclude that for given k, the difference between vk and un,k gets arbitrarily small for large n > k. 3) Show that there are subsequences vki and uni converging to the same limit. This will conclude the proof of the first claim of the theorem. 419 Because the square root is an increasing function, the function d takes its least value at the same point where the function g does. The derivative of g has the form g′ (x) = 4x3 − 6x − 2, for all x ∈ R. Moreover, the equation g′ (x) = 0 or equivalently 0 = 2x3 − 3x − 1 has three solutions: x0 = 1 , x1 = 1 − √ 3 2 , x2 = 1 + √ 3 2 . We deduce that the x-coordinate in question equals to x2 = 1+ √ 3 2 . Can you explain how we decline x0 and x1? □ First order derivatives and in particular tangent lines give us the ability to approximate functions locally by linear functions (recall that linear functions are the easiest functions to work with). In particular, when f is a differentiable function at a point x0 of its domain, we can approximate the values of f near x0 via the formula f(x) ≈ f (x0) + f′ (x0) (x − x0). Here, the r.h.s is a firstdegree polynomial with respect to x and we suppose that the difference x−x0 is “small enough”. Hence using first-order derivatives we can obtain a tangent line approximation of the value f(x) in a small neighbourhood of x0. In Section A of Chapter 6 these ideas will be generalized. There we will explain how to approximate functions based on higher-degree polynomials (Taylor polynomials). Next we pose applications related to linear approximations, see also 6.A.37 and 6.A.38 for an alternative interpretation in terms of the so-called “differentials”. More exercises on tangential approximations are presented in Section D. 5.C.18. Linear approximations. Approximate linearly the function f(x) = 1/x near x0 = 1, and present the graph of f and that of its tangent line at x0. Then compute the approximations at x = 0.9 and x = 1.1 and compare their difference with the real values of f at these points. Solution. The first derivative of f is given by f′ (x) = −1/x2 and at x0 = 1 we get f′ (x0) = −1. Thus the tangent line of f passing from x0 has the form f(1)+(x−1)f′ (1) = 2−x, that is, the linear approximation of f near x0 is given by the line L(x) = 2 − x. The figure below illustrate the situation: CHAPTER 5. ESTABLISHING THE ZOO 1) Fix x and recall that vn(x) is the sequence defined as the sums of the first n terms of the formal infinite expression ∞∑ j=0 cj = ∞∑ j=0 1 j! xj . For all m > n, |vm(x) − vn(x)| ≤ m∑ k=n+1 1 k! |x|k = vm(|x|) − vn(|x|). Consequently, in order to prove that vn(x) is a Cauchy sequence, and thus convergent, it is enough to restrict ourselves to x > 0 and to prove that vn(x) is always bounded and thus convergent (as a nondecreasing sequence) in this case. The quotient of adjacent terms in the series is cj+1 cj = x n+1 . Thus for every fixed x > 0, there is a number N ∈ N such that cj+1 cj < 1 2 for all j ≥ N. However, such large indices j satisfy |cj+1| < 1 2 |cj| < 2−(j−N+1) |cN |. Recall that sums of a geometric series is computed from the equality (1 − q)(1 + q + · · · + qk ) = 1 − qk+1 . This means that the partial sums of the first n > N terms of our formal sum with x > 0 can be estimated as follows vn(x) − N−1∑ j=0 1 j! xj < 1 N! xN n−N∑ j=0 1 2j . In particular, the limit of the expressions on the right-hand side for n approaching infinity surely exists, and so the limit of the increasing sequence vn also exists. 2) Consider some fixed k and ε > 0. If N is large enough, then clearly for all n > N, |un,k − vk| < ε. Indeed, there is only a fixed number of the brackets in the k summands of un,k, see (1), and they will all be arbitrarily close to 1, if n is large enough. 3) Next, consider x > 0 and notice |un − un,k| < |vn − vk|. Thus, if we fix some ε > 0, then there is some M, such that for k > M we ensure |un − un,k| < ε, for all n > k. If x < 0, we may first bound the left hand side by sum of absolute values of the individual terms, which is still less then |vn−vk| evaluated for |x|. Now, take such a k > M and consider N from the previous step (of course dependent on k). Then for n > N we arrive at |vk − un| ≤ |vk − un,k| + |un,k − un| < 2ε. Finally, choosing εℓ = 1 2ℓ , the previous estimate allows to find subsequences vkℓ , unℓ satisfying |vkℓ − unℓ | < 1 ℓ . Thus, the converengent sequences un and vn must converge to the same limit value. This is the first claim we had to prove. Remind we already know that (ex )′ = ex if and only if the derivative equals 1 at the origin, see 5.3.7. Thus, it 420 Now, we compute f(0, 9) ≈ 1.1111 and L(0, 9) = 1.1, thus |L(0.9) − f(0.9)| ≈ 0.0111. Similarly, we have f(1.1) ≈ 0.90909, L(1.1) = 0.9 and |L(1.1) − f(1.1)| ≈ 0.009. □ 5.C.19. Given that (arcsin(x))′ = 1√ 1−x2 for x ∈ (−1, 1), approximate linearly the value arcsin(0.497). ⃝ 5.C.20. Approximate linearly the values sin (29π 180 ) and sin (46π 180 ) . ⃝ Rolle’s theorem, mean value theorem and its generalizations (Cauchy’s mean value theorem), are important theorems that highlight the significance of derivatives, see 5.3.8 and 5.3.9. Below we illustrate them quickly via examples, and further linked applications are presented in the final section of this chapter. 5.C.21. Prove that the the equation x2027 + 7x3 − 5 = 0 has a unique solution, Solution. Consider the function f(x) = x2027 + 7x3 − 5. Since f(0) = −5 < 0 and f(1) = 3 > 0 by the intermediate value theorem there exists some x0 ∈ (0, 1) such that f(x0) = 0. Suppose that x1 ̸= x0 is another (positive) root of f. Without loss of generality we may assume that x0 < x1. We have that f(x0) = f(x1), and f is continuous on [x0, x1] and differentiable on (x0, x1). Thus by Rolle’s theorem the first derivative f′ of f must have a root in (x0, x1). However, f′ (x) = 2027x2026 +21x2 and thus we have f′ (x) > 0 for all x > 0, a contradiction. Thus f has only one zero in (0, +∞). On the other hand we see that f(x) < 0 for all x ∈ (−∞, 0]. You may verify this final claim in Sage, just by the cell f(x)=x^(2027)+7*x^3-5 assume(x<0); bool(f(x)<0) Thus f has a unique solution for all x ∈ R. □ 5.C.22. Using Rolle’s theorem prove that the equation x3 + px + q = 0 with p, q ∈ R, p > 0, admits a unique real solution. Solution. Consider the function f(x) = x3 + px + q with x ∈ R, where p, q ∈ R and p > 0. It is easy to prove that limx→−∞ f(x) = −∞ and limx→+∞ f(x) = +∞. Moreover, f takes both negative and positive values, thus there exists some ξ ∈ R with f(ξ) = 0. We will show that ξ is unique. Indeed, suppose that ζ ∈ R is another real number satisfying f(ζ) = 0. We may assume that ζ < ξ, and similarly is treated the case with ξ < ζ. Based on the Rolle’s theorem we can then find some a ∈ (ζ, ξ) with f′ (a) = 0. However, f′ (x) = 3x2 + p for all x, and so this gives a contradiction (since p > 0, the equation f′ (x) = 0 does not admit a real solution). □ 5.C.23. Decide if the function f(x) = x3 + b, where b is some constant, satisfies the mean value theorem on the interval [1, 2]. In the positive case determine c, where c is as in the Theorem 5.3.9. CHAPTER 5. ESTABLISHING THE ZOO remains to compute lim x→0 (1 + x + 1 2 x2 + . . . ) − 1 x = lim x→0 x + 1 2 x2 + . . . x . This seems to be tricky, since there are two limit expressions there (notice that an x may be cancelled since this is a constant in the inner limit): lim x→0 ( lim n→∞ n∑ k=1 1 k! xk−1 ) = 1 + lim x→0 ( lim n→∞ n∑ k=2 1 k! xk−1 ) . But now, for each positive ε > 0 we can find N ∈ N such that lim n→∞ n∑ k=N 1 (k + 1)! xk < lim n→∞ n∑ k=N 1 (k + 1)! < ε for all x ∈ [−1, 1]. Finally, we can restrict the interval for x enough to ensure that the remaining first terms are bounded by ε, too: N−1∑ k=2 1 (k + 1)! xk < ε. This shows that the limit expression on the right-hand side must be zero. Thus the derivative exists and equals to one, as expected. □ Readers who skipped the preceding paragraphs (it doesn’t matter whether on purpose or in need) can stay calm – we deduce all the results on the exponential function later again, using more general tools. In particular, we will see that all power series are always differentiable and can be differentiated term by term. We see later that the conditions f′ (x) = f(x) and f(0) = 1 determine a function uniquely. 5.4.3. Number series. When deriving the previous theorems about the function ex , we have automatically used several extraordinarily useful concepts and tools. Now, we come back to them in detail: Infinite number series Definition. An infinite series of numbers is an expression ∞∑ n=0 an = a0 + a1 + a2 + · · · + ak + . . . , where the an’s are real or complex numbers. The sequence of partial sums is given by the terms sk = ∑k n=0 an. The series converges and equals s if the limit s = lim k→∞ sk of the partial sums exists and is finite. If the sequence of partial sums has an improper limit, the series diverges to ∞ or −∞. If the limit of the partial sums does not exist, the series oscillates. An immediate consequence of the properties of limits is the following claim: 421 Solution. The given function is a polynomial, and hence continuous on its domain [1, 2]. It is also differentiable on the open interval (1, 2), with f′ (x) = 3x2 . Therefore, it satisfies the mean value theorem. In particular, by Theorem 5.3.9 there exists c ∈ (1, 2) such that f′ (c) = 3c2 = f(2)−f(1) 2−1 = 8+b−1−b = 7. Thus c2 = 7/3, that is, c = ± √ 7/3. Since − √ 7/3 does not belong on the interval [1, 2], we deduce that c = √ 7/3 is the only possible value. □ 5.C.24. Check which of the following functions satisfy the mean value theorem. In the positive case, find the possible values of c, where c is as in the Theorem 5.3.9: f(x) = x 2 3 , g(x) = x x + 1 , h(x) = 3 √ 8 − x for x ∈ [−2, 2], x ∈ [1, 4] and x ∈ [0, 8], respectively. ⃝ 5.C.25. Using the mean value theorem prove that −1 ≤ cos(2x) < sin(2x) 2x ≤ 1 , for all x ∈ ( 0, π 2 ] . Solution. For x = π 2 the given inequality becomes −1 ≤ −1 < 0 ≤ 1, which is true. Hence we may focus on x ∈ (0, π 2 ). For any x ∈ (0, π 2 ) consider the function f(a) = sin(2a) , with a ∈ [0, x] . Observe that f is continuous on [0, x] as a composition of continuous functions. It is also differentiable on the open interval (0, x) with f′ (a) = 2 cos(2a). Thus, by the mean value theorem there exists c ∈ (0, x) with f′ (c) = f(x)−f(0) x−0 , or equivalently 2 cos(2c) = sin(2x) x ⇔ cos(2c) = sin(2x) 2x . (∗) Next we may assume that 0 < c < x < π 2 , that is, 0 < 2c < 2x < π. Also, for the function g(x) = cos(x) we have g′ (x) = − sin(x) < 0 for all x ∈ (0, π). This means that the function cos is strictly decreasing on (0, π) (having a local maximum at x = 0 and a local minimum at x = π). Thus, the previous inequality implies that 1 = cos(0) ≥ cos(2c) > cos(2x) ≥ cos(π) = −1 . Combining this inequality with (∗), the claim follows. □ 5.C.26. Based on Cauchy’s mean value theorem, show that a < n n + 1 (bn+1 − an+1 bn − an ) < b for all n ∈ N∗ , where a, b ∈ R are such that 0 < a < b ⃝ The theory of derivatives is useful in a plethora of practical tasks, but also for crucial problems appearing in many other sciences and technology, e.g., in physics, chemistry, statistics, economics, biology, engineering, environmental science, informatics and computer science, architecture, science of graphics, and space science, to mention a few of them. But why derivatives can be so useful? Partially this is because the derivative dy dx represents the rate of change of y CHAPTER 5. ESTABLISHING THE ZOO Corollary. Consider two convergent series S = ∑∞ n=0 an and T = ∑∞ n=0 bn. Then their sum and constant multiple by real or complex number c are convergent too, and S + T = ∞∑ n=0 (an + bn), cS = ∞∑ n=0 can. In particular, any linear combination of series with constant coefficients is convergent and equals to the same linear combination of the sums. Check the details yourself! 5.4.4. Properties of series. For the sequence of partial sums sn to converge, it is necessary and sufficient that it is a Cauchy sequence; that is |sm − sn| = |an+1 + · · · + am| must be arbitrarily small for sufficiently large m > n. Since |an+1| + · · · + |am| ≥ |an+1 + · · · + am|, the convergence of the series ∑∞ k=0 |an| implies the convergence of the series ∑∞ k=0 an. Absolutely convergent series A series ∑∞ k=0 an is absolutely convergent if the series∑∞ n=0 |an| converges. Absolute convergence is introduced because it is often much easier verified. The following theorem shows that all simple algebraic operations behave “very well” for series that converge abso- lutely. Properties of series Theorem. Let S = ∑∞ n=0 an and T = ∑∞ n=0 bn be two absolutely convergent series. Then (1) their sum converges absolutely to the sum S + T = ∞∑ n=0 an + ∞∑ n=0 bn = ∞∑ n=0 (an + bn), (2) their difference converges absolutely to the difference S − T = ∞∑ n=0 an − ∞∑ n=0 bn = ∞∑ n=0 (an − bn), (3) their product converges absolutely to the product S · T = ( ∞∑ n=0 an ) · ( ∞∑ n=0 bn ) = ∞∑ n=0 ( n∑ k=0 an−kbk ) , (4) the value S of the sum does not depend on any rearrangement of the series, i.e., ∑∞ n=0 aσ(n) = S for any permutation σ : N → N of integers. Proof. The convergence in the first and the second statements were discussed already. Further, the linear combinations satisfy |αan + βbn| ≤ |α||an| + |β||bn| and thus the absolute convergence is obvious, too. 422 with respect to x. Thus, the rate of change of any physical quantity at any time is obtained by differentiating the physical quantity with respect to time. Another reason is the related theory of minimal/maximals (optimization problems) and we will meet many such tasks in Chapter 6. The next few examples aim to present how derivatives may involve in problems of every-day life and other nice applications. An enrichment of this list, which roughly speaking can be long enough, is postponed to the final section E of this chapter. 5.C.27. For a company renting city taxis, any driver costs C6 per working hour (h), while the gas costs C1.40 per litter (ℓ). Suppose that inside the city these taxis can succeed (180 − 2u)/6 kilometres (km) per litter of gas, where u denotes the driving velocity, enumerated in km/h. Find the cheapest driving speed so that the company can minimize its costs per kilometre (corresponding only to a taxi of its property), and present the corresponding mimimal cost value in C/km per taxi. Solution. We can focus on the cost corresponding to the use of a single taxi (in C per km), which should minimize. This cost is given by the sum of the costs of the used driver (per km) and of the used gas (per km). According to the statement we have: • The driver costs 6 C/h u km/h = 6 u C/km; • The gas costs 1.40 C/ℓ (180−2u) 6 km/ℓ = 8.4 180 − 2u C/km. Thus, the total cost in C per km is a function with respect to u, given by c(u) := 6 u + 8.4 180 − 2u = 6 u − 4.2 u − 90 . First we see that the driving velocity u, should satisfy 0 < u < 90 (e.g., the cost blows up for u = 90), and for such u the extreme points of c(u) correspond to the solutions of the equation c′ (u) = 0 (we recommend to sketch the graph of the cost function c via Sage). The derivative of c is given by c′ (u) = − 6 u2 + 4.2 (u − 90)2 and it follows that the equation c′ (u) = 0 has two solutions, namely u1,2 = ±30 √ 70 + 300 km/h. These computations can be quickly verified in Sage: u=var("u"); c(u)=(6/u)+8.4/(180-2*u) dc(u)=diff(c(u), u);show(solve(dc==0, u)) Acceptable is only the value u1 = −30 √ 70 + 300 ≈ 49 km/h, and this is the speed with the lowest costs. In particular, with this velocity the cost value c(u1) of a single taxi is approximately 0.225 C/km. □ 5.C.28. Suppose that from a given piece of industrial paper having area α2 we want to construct an open box, by cutting CHAPTER 5. ESTABLISHING THE ZOO The third statement is not so simple. Write cn = n∑ k=0 an−kbk. From the assumptions and from the rule for the limit of a prod- uct, ( k∑ n=0 an ) · ( k∑ n=0 bn ) → ( ∞∑ n=0 an ) · ( ∞∑ n=0 bn ) . Thus it suffices to prove that 0 = lim k→∞ (( k∑ n=0 an ) · ( k∑ n=0 bn ) − k∑ n=0 cn ) . Consider the expressions ( k∑ n=0 an ) · ( k∑ n=0 bn ) = ∑ 0≤i,j≤k aibj, cn = ∑ i+j=n 0≤i,j≤k aibj, k∑ n=0 cn = ∑ i+j≤k 0≤i,j≤k aibj. along with the bound ( k∑ n=0 an ) · ( k∑ n=0 bn ) − k∑ n=0 cn = ∑ i+j>k 0≤i,j≤k aibj ≤ ∑ i+j>k 0≤i,j≤k |aibj|. If the sum of the indices is to be larger than k, then at least one of them must be larger than k/2. The expression does not decrease if more terms are added into it. Take all as in the product and remove only those whose indices are both at most k/2. ∑ i+j>k 0≤i,j≤k |aibj| ≤ ∑ 0≤i,j≤k |aibj| − ∑ 0≤i,j≤k/2 |aibj|. However, both the expressions of the difference are the partial sums for the product S · T. Therefore, they share the same limit and their difference goes to zero. The last claim seems to be a little tricky. Notice, for each small ε > 0 we can find a common bound α such that for all N > α both estimates are true: ∞∑ n=N |an| < ε, N∑ n=0 an − S < ε. Now, consider any permutation σ of the indices and write Iσ = {σ−1 (0), . . . , σ−1 (α)}. Then, for each N > max Iσ 423 four small squares of equal size of its corners and then bending the resulting sides. Determine the total area of paper that we should cut, so that the box has maximum volume. Solution. Let x be the length of the side of any of the squares that we should cut, as it is illustrated in the figure: The volume of the box under construction is given by V (x) = x(α − 2x)2 with 0 < x < α 2 . We want to locate those x for which the function V attains its maximum value V (x). Since α > 0 is a constant, an application of the product rule gives V ′ (x) = dV dx = (α − 2x)2 + 2 · x · (α − 2x) · (−2) = (α − 2x)(α − 6x) . Thus V ′ (x) = 0 if and only if x = α 2 or x = α 6 . Now we will use the theory of second derivatives, which are analyzed in Section 6.1.5. As we said above, the second derivative of a differentiable function f is the derivative of its first derivative. So V ′′ (x) = (V ′ (x))′ = −2 · (α − 6x) − 6(α − 2x) = 24x − 8α = 24 ( x − α 3 ) . According to 6.1.5, if x0 satisfies f′ (x0) = 0 and f′′ (x0) < 0, then x0 is a local maximum of f, while when x0 is such that f′ (x0) = 0 but f′′ (x0) > 0, then x0 is a local minimum of f. In our case we see that V ′′ ( α 2 ) = 4α > 0 , V ′′ ( α 6 ) = −4α < 0 . Thus, the volume of the box under construction will be maximal if and only if x = α 6 , so the area that we should cut is 4· α2 36 = α2 9 = (α 3 )2 . Moreover, the maximal volume is given by V (α 6 ) = 2α3 27 = 2(α 3 )3 . An implementation of the whole problem via Sage is easy, and can be done in many ways. For instance, type the cell var("a") #introduce a symbolic variable V(x)=x*(a-2*x)^2 #declare the volume DV(x)=diff(V, x).factor()#declare the 1st #derivative of the volume show(solve(DV(x)==0, x)) #critical point eq. show(diff(V, x, 2)(x=a/6).factor()) show(V(a/6).factor()) #the maximal volume □ CHAPTER 5. ESTABLISHING THE ZOO clearly N∑ n=0 aσ(n) − S = N∑ n≤N, n∈Iσ aσ(n) − S + N∑ 0≤n≤N,n/∈Iσ aσ(n) ≤ α∑ n=0 an − S + N∑ 0≤n≤N, n/∈Iσ |aσ(n)|. Next, notice that n /∈ Iσ means σ(n) > α. Thus, the latter term is at most equal to ∑∞ α+1 an and thus the entire expresssion is bounded by 2ε. This shows that the rearranged series converges to the same value S again. □ 5.4.5. Simple tests. The following theorem collects some useful conditions for deciding on the convergence of series. Convergence tests Theorem. Let S = ∑∞ n=0 an be an infinite series of real or complex numbers. Let T = ∑∞ n=0 bn be another series with all bn ≥ 0 real. (1) If the series S converges, then limn→∞ an = 0. (2) (The comparison test) If T converges and |an| ≤ bn, then S converges absolutely. If bn ≤ |an| and T diverges, then S does not converge absolutely. (3) (The limit comparison test). If both an and bn are positive real numbers and the finite limit lim n→∞ an bn = r > 0 exists, then S converges if and only if T converges. (4) (The ratio test) Suppose that the limit of the quotients of adjacent terms of the series exists and lim n→∞ an+1 an = q. Then the series S converges absolutely for |q| < 1 and does not converge absolutely for |q| > 1. If |q| = 1 the series may or may not converge. (5) (The root test) If the limit lim n→∞ n √ |an| = q exists, then the series converges absolutely for q < 1. It does not converge absolutely for q > 1. If q = 1, the series may or may not converge. Proof. (1) The existence and the potential value of the limit of a sequence of complex numbers is given by the limits of the real parts and the imaginary parts. Thus it suffices to prove the first proposition for sequences of real numbers. If limn→∞ an does not exist or is non-zero, then for a sufficiently small number ε > 0, there are infinitely many terms ak with |ak| > ε. There are either infinitely many positive terms or infinitely many negative terms among them. But then, adding any one of them into the partial sum, the difference of the adjacent terms sn and sn+1 is at least 424 5.C.29. On a spherical ballon we inflate air so that its volume increase with a rate of 16 cm3 /min. Find the rate of change of the radius of the ballon when its volume is 36 cm3 . Hint: The volume of a sphere of radius r is given by 4πr3 3 . Solution. The radius of the ballon can be viewed as a function with respect to the time t. Hence also the volume of the ballon is a function of t (since it depends on the radius), i.e., V (t) = 4π 3 r3 (t) for all t. Now, by assumption at certain time t = t0 the volume is 36 cm3 . Hence we have 36 = 4πr3 (t0) 3 , and thus r3 (t0) = 33 π , that is, r(t0) = 3 3 √ π . We also compute V ′ (t) = dV dt = 4πr2 (t)r′ (t) . By assumption dV dt = 16, hence 16 = 4πr2 (t)r′ (t) and this gives r′ (t) = dr dt = 4 πr2(t) . At t = t0 we have r′ (t0) = 4 πr2(t0) and by replacing the value r(t0) = 3 3 √ π we obtain the rate of change of the radius of the ballon at time t = t0. □ 5.C.30. Consider an isosceles triangle with base length b and height (above the base) h. Inscribe the rectangle having the greatest possible area into it (one of the rectangle’s sides will lie on the triangle’s base). Determine the area S of this rectangle. Solution. To solve this problem, it suffices to consider the problem of inscribing the largest rectangle into a right triangle with legs of lengths b/2 and h so that two of the rectangle’s sides lie on the legs of the triangle. Thus we reduce the problem to maximizing the function f(x) = x ( h − 2 h x b ) on the closed interval I = [0, b/2]. Observe that f(x) ≥ 0, for all x ∈ I, with f(0) = f (b 2 ) = 0. We also have f′ (x) = h − 4 h x b and x0 = b/4 is the unique stationary point of f [0,b/2] . There f takes its greatest value, hence the sides of the required rectangle are b/2 long (i.e., twice x0), and h/2 long (the latter is obtained by substituting b/4 for x into the expression h − 2hx/b). Moreover, S = hb/4. □ 5.C.31. Recall that the velocity of a moving object is the derivative of its position function, and its acceleration is the derivative of its velocity function. If the position of a moving object in time t is given by the function s(t) = −(t − 3)2 + 16, t ∈ [0, 7] , where the time in measured in seconds, and the position is measured in meters, determine the following: (a) the initial velocity of the object (that is, at t = 0); (b) the time and position at which its velocity is zero; (c) its velocity and acceleration at time t = 4 s. ⃝ In the reminder of this section we shall exercise the use of the so called “l’Hopital’s rule” via a variety of examples (for extra material refer to Section E in the end of this chapter). CHAPTER 5. ESTABLISHING THE ZOO ε. Thus the sequence of partial sums cannot be a Cauchy sequence and, therefore, it cannot be convergent. (2) We are dealing with absolute converengence and thus the sequence of partial sums is non-decreasing. Once bounded, such a sequence is convergent and converges to its supremum. Similarly, the sequence of partial sums must be unbounded, if estimated from below by a divergent sequence. (3) Since limit r = limn→∞ an bn exists, for any given ε > 0 and sufficiently big n > Nε, (r − ε)bn < an < (r + ε)bn. Thus, after choosing ε < r it follows that an < (r +ε)bn and bn < 1 r−ε an. The result follows from the previous claim (2). (4) To prove absolute convergence, it can be assumed that the terms of the series are real numbers an > 0. Suppose q < r < 1 for a real number r. From the existence of the limit of the quotients, for every j greater than a sufficiently large N, aj+1 < r · aj ≤ r(j−N+1) aN , But this means that the partial sums sn are, for large n > N, bounded from above by the sums sn < N∑ j=0 aj + aN n−N∑ j=0 rj = N∑ j=0 aj + aN 1 − rn−N+1 1 − r where the last equality follows from the general equality (1− r)(1 + r2 + · · · + rk ) = 1 − rk+1 . Since 0 < r < 1, the set of all partial sums is an increasing sequence bounded from above, and thus its limits equals to its supremum. In the case q > r > 1, a similar technique can be used. However, this time, from the existence of the limit of the quo- tients, aj+1 > r · aj ≥ r(j−N+1) aN > 0. This implies that the absolute values of the particular terms of the series do not converge to zero, and thus the series cannot be convergent, by the already proved part (1) of the theorem. (5) The proof is similar to the previous case. From the existence of the limit q < 1, it follows that for any r, q < r < 1, there is an N such that for all n > N, n √ |an| < r holds. Exponentiation then gives |an| < rn , there is a comparison with a geometric series. Thus the proof can be finished in the same way as in the case of the ratio test. □ 5.4.6. Limes superior and inferior. In the proofs of the last two statements of the previous theorem, a much weaker assumption is used than the existence of the limit. It is only necessary to know that the examined sequences of non-negative terms (ratios or roots) are, from a given index on, either all larger or all less than a given number. For this purpose however, it suffices to consider, for a given bounded sequence of terms bn, the supremum of the terms with index higher than n. These suprema always exist and create a non-increasing sequence. Its infimum is then 425 Recall the claim is that for each differentiable curve [g(t), f(t)] emanating from the origin, the limit of the slope f(t) g(t) exists and they are equal, provided that the limit of the tangent lines in the origin exists. This resolves the indeterminate forms 0 0 , and the similar result is available for indeterminate forms ∞ ∞ , see also 5.3.10. It’s important to note that multiple applications of l’Hopital’s rule are quite common. This practice involves stating that the limits of quotients of derivatives obtained through l’Hopital’s rule are equal to the limits of the original quotients, albeit this is sometimes an abuse of notation. To justify this, it is essential to ensure that the limits produced on the right-hand sides actually exist, which may require repeated verification. 5.C.32. l’Hopital’s rule. Apply the l’Hopital’s rule to compute the limit lim x→+∞ f(x), where f(x) = ln x − 4 ex 8 ex αe − ln x , α = constant > 0 . Solution. When x → +∞, both the numerator and denominator of f(x) give the indeterminate form (+∞) − (+∞). However, we may write lim x→+∞ f(x) = lim x→+∞ ln x − 4 ex 8 ex αe − ln x = lim x→+∞ ln x ex − 4 8αe − ln x ex = lim x→+∞ ln x ex − 4 8αe − lim x→+∞ ln x ex . Now, for the limit lim x→+∞ ln x ex we may apply the l’Hopital’s rule. For this we need the derivatives of ln x and ex , and by 5.3.7 we know that (ln x)′ = 1/x and (ex )′ = ex , respectively. Therefore lim x→+∞ f(x) = lim x→+∞ (ln x)′ (ex)′ − 4 8αe − lim x→+∞ (ln x)′ (ex)′ = lim x→+∞ 1/x ex − 4 8αe − lim x→+∞ 1/x ex = lim x→+∞ 1 x ex − 4 8αe − lim x→+∞ 1 x ex = 0 − 4 8αe − 0 = − 1 2αe . □ 5.C.33. Based on the l’Hopital’s rule verify the given type and next verify the given result: (a) lim x→0 sin(2x) − 2 sin(x) 2 ex −x2 − 2x − 2 = −3 , (type 0 0 ); (b) lim x→0+ ln(x) cot(x) = 0 , (type ∞ ∞ ); (c) lim x→1+ ( x x − 1 − 1 ln(x) ) = 1 2 , (type ∞ − ∞); CHAPTER 5. ESTABLISHING THE ZOO called upper limit of the sequence and denoted by lim sup n→∞ bn = lim n→∞ sup{bk; k ≥ n}. The advantage is that the upper limit always exists (either finite if the sequence is bounded, or ±∞ if the sequence is unbounded). Similarly, lim inf n→∞ bn = lim n→∞ inf{bk; k ≥ n}. Therefore, we can reformulate the previous result (without having to change the proof) in a stronger form: Corollary. Let S = ∑∞ n=0 an be an infinite series of real or complex numbers. (1) If q = lim sup n→∞ an+1 an , then the series S converges absolutely for q < 1 and does not converge absolutely for q > 1. For q = 1, it may or may not converge. (2) If q = lim sup n→∞ n √ |an|, the series converges absolutely for q < 1 while it does not converge absolutely for q > 1. For q = 1, it may or may not converge. In the literature, the ratio test is often called the d’Alambert’s criterion, while the root test is called the Cauchy’s criterion. There are many other useful tests, but we do not have space here for them. 5.4.7. Alternating series. The condition an → 0 is a necessary but not sufficient condition for the convergence of the series ∑∞ n=0 an. However, there is the Leibniz’s criterion of convergence for special types of series. Leibniz’s criterion for alternating series The series ∑∞ n=0(−1)n an, where an is a non-increasing sequence of non-negative real numbers, is called an alternating series. Theorem. An alternating series converges if and only if limn→∞ an = 0. Its value a = ∑∞ n=0(−1)n an differs from the partial sum s2k by at most a2k+1. Proof. By the definition the partial sums sk of an alternating series satisfy s2(k+1)+1 = s2k+1 + a2k+2 − a2k+3 ≥ s2k+1 s2(k+1) = s2k − a2k+1 + a2k+2 ≤ s2k s2k+1 − s2k = −a2k+1 → 0 s2 ≥ s2k ≥ s2k+1 ≥ s1. Thus, the even partial sums are a non-decreasing sequence, while the odd ones are non-increasing. The last line reveals that the bounded sequence of the odd partial sums converges 426 (d) lim x→1+ (ln(x) · ln(x − 1)) = 0 , (type 0 · ∞); (e) lim x→0+ (cot(x)) 1 ln(x) = 1 e , (type ∞0 ); (f) lim x→0 ( sin(x) x ) 1 x2 = 1 6 √ e , (type 1∞ ); (g) lim x→1− ( cos( πx 2 ) )ln x = 1 , (type 00 ). Solution. (a) The conclusion for the type is immediate. Set f(x) = sin(2x)−2 sin(x) 2 ex −x2−2x−2 . Applying repeatedly the l’Hopital’s rule we obtain (observe that at any step the limit in the r.h.s exists) lim x→0 f(x) 0 0 = lim x→0 (sin(2x) − 2 sin(x)) ′ (2 ex −x2 − 2x − 2) ′ = lim x→0 2 cos(2x) − 2 cos(x) 2 ex −2x − 2 0 0 = lim x→0 (2 cos(2x) − 2 cos(x)) ′ (2 ex −2x − 2) ′ = lim x→0 −4 sin(2x) + 2 sin(x) 2 ex −2 0 0 = lim x→0 (−4 sin(2x) + 2 sin(x)) ′ (2 ex −2) ′ = lim x→0 −8 cos(2x) + 2 cos(x) 2 ex = −3 . (b) Cleary, limx→0+ ln(x) = −∞ and limx→0+ cot(x) = +∞ (see also a sketch of the graphs of these functions below), hence the type of indeterminate form. Otherwise, for a direct verification via Sage, one may type g(x)=cot(x);k(x)=ln(x) lim(g(x), x=0, dir="plus") lim(ln(x), x=0, dir="plus") This time applying the l’Hopital’s rule, we will result to an indeterminate form of different type, which is again a quite common situation. In particular, recall that cot(x) = 1 tan(x) = cos(x) sin(x) and hence cot′ (x) = −1/ sin2 (x). Setting f(x) = ln(x)/ cot(x) and observing that at any step the limit in the CHAPTER 5. ESTABLISHING THE ZOO to its supremum, while the even ones converge to the infimum. The previous line says they coincide, if and only if limn→∞ an = 0 which proves the first claim. At the same time the limit value a of the series is always at least s2k+1 and at most s2k. Thus, the latter partial sums cannot differ by more than a2k+1 from the limit value. □ Remark (Riemann’s rearrangement theorem). As obvious from the latter theorem, convergent alternating series are often not convernging absolutely. This phenomenon is called conditionally converging series. Unlike the independence on the order in which we sum up the terms of an absolutely convergent series (cf. (4) of Theorem 5.4.4), there is the famous Riemann theorem saying that conditionaly convergent series can be braught to any finite or infite value by appropriate rearrangement of the terms in the sum. We shall not go into the detailed proof here, but the idea is very simple: if a series S converges only conditionally, we may split it into the series of positive and negative terms, S+ and S−, say ordered by absolute value, and both of them have to be divergent (otherwise, they would both converge absolutely and thus their difference would do as well). Now, prescribing the desired limit q, we shall add the postive terms until getting bigger than q, then adding the negative ones, until getting smaller than q, etc. 5.4.8. Convergence rate. The proofs of the tests derived in the previous two paragraphs allow also for straightforward estimates of the speed of the convergence. Indeed, both the tests for the absolute convergence are based on the comparison with the geometric series either for q = lim supn→∞ an+1 an or q = lim supn→∞ n √ |an|, and 0 < q < 1. In the estimate of the error of approximation of the limit s∞ by the n-th partial sum sn |s∞ − sn| < |aN | ∞∑ j=0 rj = |aN |rn−N 1 1 − r = Crn where N and q < r < 1 are the two related choices from the proof of the test and C is the resulting constant not dependent of n. Thus the convergence rate is quite fast, in particular if r is much smaller than 1 (and we can get r as close to q as necessary). On the other hand, the proof of the alternating series test shows that the convergence rate is at least as fast as the convergence of the terms an. 5.4.9. Power series. If we do not consider a sequence of numbers an, but rather a sequence of functions fn(x) sharing the same domain A, we can use the definition of addition of series “point-wise”, thereby obtaining the concept of the series of functions S(x) = ∞∑ n=0 fn(x). 427 r.h.s. exists, the following makes sense lim x→0+ f(x) +∞ +∞ = lim x→0+ ln′ (x) cot′ (x) = lim x→0+ 1 x − 1 sin2(x) = lim x→0+ − sin2 (x) x 0 0 = lim x→0+ − ( sin2 (x) )′ (x)′ = lim x→0+ (−2 sin(x) cos(x)) = 0 . (c) For a verification of the type, notice that limx→1+ x x−1 = +∞, and limx→1+ 1 ln x = +∞. However, we see that f(x) := ( x x−1 − 1 ln x ) = x ln(x)−x+1 (x−1) ln(x) , thus we obtain the indeterminate form 0/0. One computes the following: lim x→1+ f(x) 0 0 = lim x→1+ (x ln(x) − x + 1)) ′ ((x − 1) ln(x)) ′ = lim x→1+ ln(x) + x x − 1 ln(x) + x−1 x = lim x→1+ x ln(x) x ln(x) + x − 1 0 0 = lim x→1+ (x ln(x)) ′ (x ln(x) + x − 1) ′ = 1 2 . (d) Obviously, this case is of type 0 · (−∞). We transform the given expression into the type −∞ ∞ by writing f(x) := ln(x) · ln(x − 1) = ln(x−1) 1 ln(x) . Then we see that lim x→1+ f(x) 0 0 = lim x→1+ (ln(x − 1)) ′ ( 1 ln(x) )′ = lim x→1+ 1 x−1 −1 x · 1 ln2(x) = lim x→1+ x ln2 (x) 1 − x 0 0 = lim x→1+ ( x ln2 (x) )′ (1 − x)′ = lim x→1+ ln2 (x) + 2 · x x · ln(x) −1 = 0 . (e) Recall from above that limx→0+ cot(x) = +∞. Moreover limx→0+ 1 ln(x) = 1 −∞ = 0, hence indeed we have the indeterminate form (+∞)0 . Moreover, we know that lim x→0+ f(x) = lim x→0+ (cot(x)) 1 ln(x) = e lim x→0+ ln (cot(x)) ln(x) , and hence it suffices to calculate the limit given in the argument of the exponential function. Setting g(x) = ln(cot(x)) ln(x) CHAPTER 5. ESTABLISHING THE ZOO If we consider the monomials fn(x) = anxn , we arrive at the “infinite polynomials”: Power series Definition. A power series is a series of functions given by the expression S(x) = ∞∑ n=0 anxn with coefficients an ∈ C, n = 0, 1, . . . . S(x) has the radius of convergence ρ ≥ 0 if and only if S(x) converges for every x satisfying |x| < ρ and does not converge for |x| > ρ. 5.4.10. Properties of power series. Although a significant part of the proof of the following theorem is postponed until the end of the following chapter, formulation of the basic properties of the power series can be considered now. Actually, we should notice that our argument on the converengence of power series works exactly the same way for complex values of z and even now, the reader may enjoy a direct simple proof of all the claimed properties in the complex setting in 9.4.2 on page 873 in Chapter 9. Our real case may be viewed as a special case there and there is nothing included but some very straightforward and natural estimates (which could serve as a nice excercise right now). Recall that the upper limit r = lim supn→∞ n √ |an| equals the limit limn→∞ n √ |an|, whenever this limit exists. Similarly with the ratio criterion and lim supn→∞ an+1 an . Convergence and differentiation Theorem. Let S(x) = ∑∞ n=0 anxn be a power series and let r = lim sup n→∞ n √ |an|. Then the radius of convergence of the series S is ρ = r−1 . Equivalently, we may compute r = lim sup n→∞ an+1 an . The power series S(x) converges absolutely on the whole interval of convergence and is continuous on it (including the boundary points, supposing it is convergent there). Moreover, the derivative exists on this interval, and S′ (x) = ∞∑ n=1 nanxn−1 . Proof. To verify the absolute convergence of the series, use the root test from theorem 5.4.5(3), for every value of x. Calculate (if the limit exists) lim n→∞ n √ |anxn| = r|x|. 428 we obtain lim x→0+ g(x) +∞ −∞ = lim x→0+ (ln (cot(x))) ′ ln′ (x) = lim x→0+ 1 cot(x) · (cot(x))′ 1 x = lim x→0+ x · ( tan(x) · −1 sin2 (x) ) = lim x→0+ −x cos(x) · sin(x) 0 0 = lim x→0+ −1 cos2(x) − sin2 (x) = −1 . Thus limx→0+ f(x) = e−1 = 1/ e. The cases (f) and (g) rely on the same trick as in the previous case. Thus, limx→0(sin(x) x ) 1 x2 = e limx→0 ln( sin(x) x ) x2 and limx→1− ( cos(πx 2 ) )ln(x) = e limx→1− ( ln(x)·ln ( cos( πx 2 ) )) , respectively, and now you should prove that the limits appearing as exponents in the right-hand-side equal to −1/6 and 0, respectively. Check yourself that this leads to the evaluations e−1/6 = 1/ 6 √ e and e0 = 1, respectively. □ 5.C.34. Show with an example that using the l’Hopital’s rule for limits which are not indeterminate, can lead to wrong results. ⃝ D. Infinite sums and power series In this section we shall investigate whether we can add infinitely many real numbers, which leads us to consider what is known as an infinite series of real numbers. Usually we denote such series as ∑∞ n=0 an. Saying that an infinite series ∑∞ n=0 an converges to a real number L, we mean that the sequence (Sn = ∑n k=1 ak) of partial sums converges to L, that is, lim n→∞ Sn = L . If the sequence (Sn) does not converge, we usually say that the series diverges. When an infinite series ∑∞ n=0 an is convergent, it is relatively easy to prove that an → 0 as n → +∞, see 5.4.5. Hence, if the sequence (an) does not converge to zero, then the given series is not convergent. However, there are many cases where an → 0 as n → +∞, but the induced series diverges, see our first example below (5.D.1). In 5.4.5 further theorems are presented that help us to determine whether a series converges or diverges, known as “convergence criteria”. Many of these criteria revolve CHAPTER 5. ESTABLISHING THE ZOO Either the series converges absolutely, or it does not converge if this limit is different from 1. It follows that it converges for |x| < ρ and diverges for |x| > ρ. If the limit does not exist, use the upper limit in the same way. The same argument based on the ratio test of convergence leads to the other formula for r. The statements about continuity and the derivatives are proved later in a more general context, see 6.3.7–6.3.9, or check the straightforward proof in the complex setting in 9.4.2. □ 5.4.11. Remarks. If the coefficients of the series increase rapidly enough, (for example an = nn ), then r = ∞. Then the radius of convergence is zero, and the series converges only at x = 0. Here are some examples of convergence of power series (including the boundary points of the corresponding interval): Consider S(x) = ∞∑ n=0 xn , T(x) = ∞∑ n=1 1 n xn . The former example is the geometric series, which was already discussed. Its sum is, for every x, |x| < 1, S(x) = 1 1 − x , while |x| > 1 guarantees that the series diverges. For x = 1, we obtain the series 1 + 1 + 1 + . . . , which is divergent. For x = −1, the series is 1 − 1 + 1 − . . . , whose partial sums do not have a limit. The series oscillates. Theorem 5.4.5(2) shows that the radius of convergence of the series T(x) is again 1 because lim n→∞ 1 n+1 xn+1 1 n xn = |x| lim n→∞ n n + 1 = |x|. For x = 1, the series 1 + 1 2 + 1 3 + . . . , is divergent: By summing up the 2k−1 adjacent terms 1/2k−1 , . . . , 1/(2k −1) and replacing each of them by 2−k (thus they total up to 1/2), the partial sums are bounded from below by the sum of these 1/2’s. Since the bound from below diverges to infinity, so does the original series. On the other hand, the series T(−1) = −1+ 1 2 − 1 3 +. . . converges although of course, it cannot converge absolutely. Of course, this is true since we deal here with an alternating series. Notice that the convergence of a power series is relatively fast near x = 0. It is slower near the boundary of the convergence interval. 5.4.12. Trigonometric functions. Another important observation is that a power series is a series of numbers for each fixed x and the individual terms make sense for complex numbers x ∈ C. Thus the domain of convergence of a power series is always a disc in the complex plane C centered at the origin. 429 around “absolute convergence”, a significant concept introduced in 5.4.4. Our goal next is to carefully analyze numerous examples to illustrate these theoretical statements. Additionally, we will demonstrate how Sage can be used for computational assistance. We start with a straightforward example, the so-called “harmonic series”, and progressively introduce more challenging examples and applications. 5.D.1. Explain why the following series are divergent: S = ∞∑ n=1 1 √ n , H = ∞∑ n=1 1 n , Υ = ∞∑ n=1 ln ( n + 1 n ) . Solution. Starting with S, first of all observe that 1√ n → 0 as n → ∞. So indeed we cannot apply (1) of Theorem 5.4.5. However, considering the partial sums we see that Sn = 1 + 1 √ 2 + · · · + 1 √ n > n √ n = √ n for all n ∈ N∗ . This means that the sequence (Sn)n∈N∗ of partial sums of S is not bounded, and this implies that S di- verges. For H observe again that 1 n → 0 as n → ∞. This series is the harmonic series, discussed also in 5.B.19, and here we present a different proof verifying its divergence. The newer method is based on the partial sums of H, listed below: H2 = 1 + 1 2 = H1 + 1 2 , H4 = H2 + 1 3 + 1 4 ≥ H2 + 1 4 + 1 4 = H2 + 1 2 , H8 = H4 + 1 5 + 1 6 + 1 7 + 1 8 ≥ H4 + 1 8 + 1 8 + 1 8 + 1 8 = H4 + 1 2 , · · · · · · H2n+1 = H2n + 1 2n + 1 + 1 2n + 2 + · · · + 1 2n + 2n ≥ H2n + 2n 2n+1 = H2n + 1 2 , and more general H2n ≥ 1 + n 2 , n ≥ 1 . Thus, the sequence of partial sums is not bounded, and diverges to infinity. This implies that the harmonic series diverges as well. As we will see below, the harmonic series is a special case of a more general series, the so called “p-series”. For Υ, we have an := ln (n+1 n ) → 0 for n → +∞, as well. To verify this claim, use Sage via the cell var("n");lim(ln((n+1)/n), n=oo) CHAPTER 5. ESTABLISHING THE ZOO More generally, we can write power functions centered at an arbitrary (complex) point x0, S(x) = ∞∑ n=0 an(x − x0)n , which converge absolutely again on the disc of radius ρ, ρ−1 = lim supn→∞ n √ |an|, but this time centered at x0. Earlier we proved explicitly (by a simple application of the ratio test) that the exponential function series converges everywhere. Thus this defines a function for all complex numbers x. Its values are the limits of values of (complex) polynomials with real coefficients and each polynomial is completely determined by finitely many of its values. In particular, the values of each series are completely determined on the complex domain by their values at the real input values x. Therefore, the complex exponential must also satisfy the usual formulae which we have already derived for the real values x. In particular, for all x, y ∈ C ex+y = ex · ey , which can be also easily checked directly by the formula for the product in theorem 5.4.4(3). Substitute the values x = i t, where i ∈ C is the imaginary unit, t ∈ R arbitrary. We get the complex valued function on R eit = 1 + it − 1 2 t2 − i 1 3! t3 + 1 4! t4 + i 1 5! t5 − . . . . The conjugate number to z = eit is the number ¯z = e−it . Hence |z|2 = z · ¯z = eit · e−it = e0 = 1. All the values z = eit lie on the unit circle centered at the origin, in the complex plane. The real and imaginary parts of the points lying on the unit circle are named as the trigonometric functions cos θ and sin θ, where θ is the corresponding angle. Differentiating the parametric description of the points of the circle t → eit , gives the vectors of “velocities” which are easily computed. Differentiating the real and imaginary parts separately (knowing that the real power series can be differentiated term by term) gives : (eit )′ = (1 − 1 2 t2 + 1 4! t4 . . . )′ + i(t − 1 3! t3 + 1 5! t5 . . . )′ = −(t − 1 3! t3 + 1 5! t5 . . . ) + i(1 − 1 2 t2 + 1 4! t4 . . . ) 430 However, we may rely on the properties of natural logarithm, as follows: ∞∑ n=1 an = lim n→∞ ( ln 2 1 + ln 3 2 + ln 4 3 + · · · + ln n + 1 n ) = lim n→∞ ln 2 · 3 · 4 · · · (n + 1) 1 · 2 · 3 · · · n = lim n→∞ ln (n + 1) = +∞. Thus the series Υ also diverges to +∞. □ The infinite number series ζ(p) = ∞∑ n=1 1 np = 1 + 1 2p + · · · + 1 np + · · · is called the p-series, and it is known that converges for all p > 1. For 0 < p ≤ 1, the p-series diverges (observe that for p = 1 coincides with the harmonic series discussed above). For p > 1 the p-series ζ(p) is a monotone decreasing function of p, called the “Riemann zeta function”.9 Euler proved that ∞∑ n=1 1 np = ∏ s 1 ( 1 − 1 sp ) , where s ranges over all prime numbers. This formula, which is now known as the Euler product formula, makes ζ a very important tool in number theory. However, for most of the p with p > 1, the value ζ(p) is still unknown, though there are some accurate approximations (as for p = 3 or p = 5, etc). Next we will use Sage to evaluate ζ(p) for some p. 5.D.2. Evaluate the zeta function ζ(p) for p = 2, . . . , 9 with the aid of Sage. Is this feasible for all the given values of p? ⃝ Infinite sums naturally appear in many practical problems. Perhaps the best known ancient example is the Achilles paradox noticed by Aristoteles, which we briefly highlight by the next exercise. Later we will discuss more sophisticated problems, starting for example with 5.D.10. 5.D.3. Achilles’ paradox. Although Achilles runs much faster than the turtle on the beach, he needs some time tn for shortening their distance to the half. Thus Achilles needs the infinite sum of these time intervals tn to catch the turtle. Explain why this sum is still a finite number. Solution. If the current distance was d, and the velocity of the turtle was 1 while the velocity of Achilles was α, we could write d + t − α t for their distance after time t. Thus, shortening it to the half means t = 1 2 d α−1 . Repeating this n times, we obtain the time intervals tn = d α − 1 · 1 2n . 9Named for F. B. Riemman (1826 - 1866). CHAPTER 5. ESTABLISHING THE ZOO which means (eit )′ = i · eit . So the velocity vectors all have unit length. Hence the entire circle is parametrized if t is moved through the interval [0, 2π], where 2π stands for the length of the circle (a thorough definition of the length of a curve needs integral calculus, which we will develop in the next chapter). In particular, this procedure of parameterizing the circle can be used to define the number π, also called Archimedes’ constant or the Ludolphian number,15 half the length of the unit circle in the Euclidean plane R2 . It can be found by computing the first positive zero point of the imaginary part of eit . For example, use the 10th order approximation sin t ≃ t − 1 6 t3 + 1 120 t5 − 1 5040 t7 + 1 362880 t9 . Ask any computer algebra system to find the first positive root. The result is π ≃ 3.148690, for which only the first 2 decimal points are correct. But with the 20th order approximation this gets 3.141592 with 5 digits correct. The reason for the slow approximation is that π is relatively far from zero. The explicit representation of trigonometric functions in terms of the power series is now apparent: cos t = Re eit = 1 − 1 2 t2 + 1 4! t4 − 1 6! t6 + · · · + (−1)k 1 (2k)! t2k + . . . sin t = Im eit = t − 1 3! t3 + 1 5! t5 − 1 7! t7 + · · · + (−1)k 1 (2k + 1)! t2k+1 + . . . The following diagram illustrates the convergence of the series for the cosine function. It is the graph of the corresponding polynomial of degree 68. Drawing partial sums shows that the approximation near zero is very good. As the order increases, the approximation is better further away from the origin as well. 15This number describes the ratio of the circumference to the diameter of an (arbitrary) circle. It was known to the Babylonians and the Greeks in ancient times. The term Ludolphian number is derived from the name of German mathematician Ludolph van Ceulen of the 16th century, who produced 35 digits of the decimal expansion of the number, using the method of inscribed and circumscribed regular polygons, invented by Archimedes. 431 The procedure of summing all these tn follows the rule of geometric series (see 5.B.16). Thus the result is ∞∑ n=0 1 2n = 1 1 − 1 2 = 2 and so Achilles needs the time d α−1 (here we sum from n = 1). This perfectly matches the fact that their relative mutual velocity is α − 1 (explain why this assertion is true). □ An infinite series, whether real or complex, that is absolutely convergent is also convergent. This is fundamentally due to the completeness of both the real numbers (R) and the complex numbers (C). Conversely, a series that is convergent but not absolutely convergent is termed “conditionally convergent”, see in the end of Section 5.4.7. In the following sections we will examine a range of examples that encompass all possible cases of convergence and divergence. 5.D.4. Exam which of the following series is convergent or divergent: T1 = ∞∑ n=1 n2 + 1 2n2 − 1 , T2 = ∞∑ n=1 2n n , T3 = ∞∑ n=1 2n (n + 2)! , T4 = ∞∑ n=1 1 n · 22025 , T5 = ∞∑ n=1 n7 + 7n + 2 (7n + 2) · n7 , T6 = ∞∑ n=1 en n2 , T7 = ∞∑ n=0 1 (n + 1) · 3n , T8 = ∞∑ n=1 n2 + 1 n3 , T9 = ∞∑ n=0 2n + 3n n! , T10 = ∞∑ n=1 2n · ( 4 5 )n2 , T11 = ∞∑ n=0 2 + sin3 (n + 1) 4n + n2 , T12 = ∞∑ n=1 nn n2 · n! . Solution. 1) We see that limn→∞ n2 +1 2n2−1 = 1 2 ̸= 0, so by (1) in Theorem 5.4.5 the series T1 is divergent. 2) Similarly we see that limn→∞ 2n n = ∞, hence T2 is divergent. Notice however that when the terms of the series are positive, absolute convergence is the same as convergence. Hence the same result follows by the ratio test, also known as the d’Alembert criterion (see (4) of Theorem 5.4.5). This is because lim n→∞ an+1 an = lim n→∞ 2n+1 n+1 2n n = lim n→∞ 2(n + 1) n = 2 > 1 . One may like to verify the divergent of T2 (or T1) also in Sage. This can be done just by typing var("n"); sum(2^n/n, n, 1, oo) and what we should essentially keep from the answer that Sage returns is the very final output: Sum is divergent. Thus, the sum method in Sage can be used as a verification CHAPTER 5. ESTABLISHING THE ZOO The well-known formula eit e−it = sin2 t + cos2 t = 1 is immediate. From the derivative (eit )′ = i eit it follows that (sin t)′ = cos t, (cos t)′ = − sin t by considering real and imaginary parts. Let t0 denote the smallest positive number for which e−it0 = − eit0 . The value t0 is the first positive zero point of the function cos t. According to the definition of π, t0 = 1 2 π. If approximating the function cos by the 10th order approximation, then twice the first positive root yields π ∼ 3.1415917 with 5 digits correct, while the 20th order approximation provides 10 correct digits after the decimal point. Squaring yields ei2t0 = e−i2t0 = (e−it0 )2 . So π is a zero point of the function sin t. Of course, for any t, ei(4kt0+t) = (eit0 )4k · eit = 1 · eit . Therefore, both trigonometric functions sin and cos are periodic, with period 2π. This is their prime period. Now the usual formulae connecting the trigonometric functions are easily derived. For illustration, we introduce some of them. First, the definition says that cos t = 1 2 (eit + e−it )(1) sin t = 1 2i (eit − e−it ).(2) Thus the product of these functions can be expressed as sin t cos t = 1 4i (eit − e−it )(eit + e−it ) = 1 4i (ei2t − e−i2t ) = 1 2 sin 2t. Further, by utilizing our knowledge of derivatives: cos 2t = ( 1 2 sin 2t)′ = (sin t cos t)′ = cos2 t − sin2 t. The properties of other trigonometric functions tan t = sin t cos t , cot t = (tan t)−1 can easily be derived from their definitions and the formulae for derivatives. The graphs of the functions sine, cosine, tangent, and cotangent are displayed on the diagrams (they are distinguised as the solid and dashed lines, respectively): 432 of divergence, as well. 3) We see that an+1 an = 2n+1 (n + 2)! 2n(n + 3)! = 2 n + 3 = 2 n + 3 , which tends to 0 < 1 as n tends to infinity. By the ratio test we deduce that the series T3 is absolute convergent and hence also convergent. 4) Let ∑∞ n=1 an be a non-convergent series. Taking some constant c ̸= 0 and considering the new series ∑∞ n=1(can), it is easy to see that this is also a non-convergent series. This is because by assumption the sequence of partial sums sn =∑n k=1 ak cannot converge, and hence also the sequence (Sn) of partial sums of the new series cannot converge (since Sn =∑n k=1(cak) = c · sn). We can apply this conclusion for T4: T4 = ∞∑ n=1 1 n · 22025 = 1 22025 · ∞∑ n=1 1 n , that is, the series is a multiple of the harmonic series. Thus, in combination with 5.D.1 we see that T4 is not convergent. 5) Obviously, T5 = ∞∑ n=1 1 n7 + ∞∑ n=1 1 7n + 2 = T1 5 + T2 5 . Hence it is sufficient to deduce about the convergence or divergence of the series T1 5 and T2 5 , as defined above. We see that T2 5 = ∑∞ n=1 1 7n+2 < ∑∞ n=1 1 7n and the geometric se- ries ∑∞ n=1 1 7n converges (see 5.B.16). Thus, the series T2 5 converges as well. The series T1 5 = ∑∞ n=1 1 n7 is also convergent (is a p-series with p = 7 > 1). As a result, the relation T5 = T1 5 + T2 5 implies that T5 is a convergent series, (since it is the sum of two convergent series, see Corollary 5.4.3). 6) The series T6 is not convergent. Let us apply for example the d’Alembert criterion (ratio test): lim n→∞ an+1 an = lim n→∞ en+1 ·n2 en ·(n + 1)2 = e · lim n→∞ ( n n + 1 )2 = e ·1 = e > 1 . 7) The series T7 converges, because T7 = ∞∑ n=0 1 (n + 1) · 3n ≤ ∞∑ n=0 ( 1 3 )n = 1 1 − 1 3 = 3 2 < +∞ . The result now follows by the comparison test (see Theorem 5.4.5). 8) The series T8 consists only of non-negative terms, as T7 for example, but it diverges (necessarily to +∞), as a direct computation shows: T8 = ∞∑ n=1 n2 + 1 n3 ≥ ∞∑ n=1 n2 n3 = ∞∑ n=1 1 n = +∞ . 9) Observe that T9 is again the sum of two series: T9 = ∞∑ n=0 2n n! + ∞∑ n=0 3n n! = T1 9 + T2 9 CHAPTER 5. ESTABLISHING THE ZOO Cyclometric functions are the functions inverse to trigonometric functions. Since the trigonometric functions all have period 2π, their inverses can be defined only inside one period, and further, only on the part where the given function is either increasing or decreasing. Two inverse trigonometric functions are arcsin = sin−1 with domain [−1, 1] and range [−π/2, π/2] and arccos = cos−1 with domain [−1, 1] and range [0, π]. See the left-hand illus- tration.. The remaining functions are (displayed in the diagram on the right) arctan = tan−1 with domain R and range (−π/2, π/2), and finally arccot = cot−1 with domain R and range (0, π). The hyperbolic functions are also of big importance. Two basic ones are sinh x = 1 2 (ex − e−x ), cosh x = 1 2 (ex + e−x ). The name indicates that they should have something in common with a hyperbola. From the definition, (cosh x)2 − (sinh x)2 = 2 1 4 (2 ex e−x ) = 1. The points [cosh t, sinh t] ∈ R2 parametrically describe a hyperbola in the plane. For hyperbolic functions, one can easily derive identities similar to the ones for trigonometric functions. By substituting into (1) and (2), one can obtain for ex- ample cosh x = cos(ix), i sinh x = sin(ix). 433 and hence it is sufficient to deduce about the convergence of T1 9 and T2 9 . By applying the ratio test to T1 9 , we see that lim n→∞ 2n+1 (n+1)! 2n n! = lim n→∞ 2 n + 1 = 0 < 1 so T1 9 converges. Similarly for T2 9 , and hence T9 converges. 10) We can apply the root test, see (5) in Theorem 5.4.5: lim n→∞ n 2n · ( 4 5 )n2 = lim n→∞ n √ 2n · ( 4 5 )n2 = lim n→∞ ( 2 · ( 4 5 )n ) = 0 < 1 . Thus T10 is a convergent series. In Sage you can compute this limit as usual: var("n"); lim(2*((4/5)**n), n=oo) 11) For the series T11 we see that 0 ≤ 2 + sin3 (n + 1) 4n + n2 < 3 4n . However, ∑∞ n=1 3 4n = 3 ∑∞ n=1 1 4n and the latter is a convergent geometric series. Thus the series T11 converges as well. 12) For the series T12 an application of the ratio test gives lim n→∞ an+1 an = lim n→∞ ( (n + 1)n+1 (n + 1)2 · (n + 1)! · n2 · n! nn ) = lim n→∞ n2 (n + 1)2 · lim n→∞ (n + 1)n nn = lim n→∞ n2 n2 · lim n→∞ ( 1 + 1 n )n = 1 · e > 1 . Thus, T12 is not converge and in particular it diverges to +∞. □ Next, we will explore how to program in Sage to utilize it as a tool for analyzing the convergence or divergence of infinite series. We introduce a syntax using the command def to construct a routine tailored for this purpose. The essence of this routine revolves around implementing the ratio test. Later on, we will demonstrate how this technique can be adapted to apply the root test (see 5.E.136), as well as other convergence criteria. 5.D.5. Let S = ∑∞ n=0 fn be an infinite series for which the limit limn→∞ |fn+1/fn| exists, and equals to q. Present in Sage a syntax appropriate to deduce the absolute convergence/divergence of S based on the ratio test. Then apply your program with aim to: (a) Prove that the following infinite series converge: S1 = ∞∑ n=1 2n · (n + 1)3 3n , S2 = ∞∑ n=1 6n n! . Next provide a formal proof. (b) Deduce, if possible, the convergence/divergence of S3 = CHAPTER 5. ESTABLISHING THE ZOO 5.4.13. Notes. (1) If a power series S(x) is expressed with the variable x moved by a constant offset x0, we arrive at the function T(x) = S(x − x0). If ρ is the radius of convergence of S, then T will be well-defined on the interval (x0 −ρ, x0 +ρ). We say that T is a power series centered at x0. The power series can be defined in the following way: S(x) = ∞∑ n=0 an(x − x0)n , where x0 is an arbitrary fixed real number. All of the previous reasonings are still valid. It is only necessary to be aware of the fact that they relate to the point x0. Especially, such a power series converges on the interval (x0 −ρ, x0 +ρ), where ρ is its radius of convergence. Further, if a power series y = T(x) has its values in an interval where a power series S(y) is well-defined, then the values of the function S ◦ T are also described by a power series which can be obtained by formal substitution of y = T(x) for y into S(y). (2) As soon as a power series with a suitable center is available, the coefficients of the power series for inverse functions can be calculated. We do not introduce a list of formulae here. It is easily obtained in Maple, for instance, by the procedure “series”. For illustration, here are two examples: Begin with ex = 1 + x + 1 2 x2 + 1 6 x3 + 1 24 x4 + . . . . Since e0 = 1, we search for a power series centered at x = 1 for the inverse function ln x. So assume ln x = a0+a1(x−1)+a2(x−1)2 +a3(x−1)3 +a4(x−1)4 +. . . . Apply the equality x = ln(ex ), regroup the coefficients by the powers of x and substitute. The result is: x = a0 + a1 ( x + 1 2 x2 + 1 6 x3 + 1 24 x4 + . . . ) + a2 ( x + 1 2 x2 + . . . )2 + a3 ( x + 1 2 x2 + . . . )3 + . . . = a0 + a1x + ( 1 2 a1 + a2 ) x2 + ( 1 6 a1 + a2 + a3 ) x3 + ( 1 24 a1 + ( 1 4 + 2 6 ) a2 + 3 2 a3 + a4 ) x4 + . . . . Comparing the coefficients of the corresponding powers on both sides, gives a0 = 0, a1 = 1, a2 = − 1 2 , a3 = 1 3 , a4 = − 1 4 , . . . . This corresponds to the valid expression (to be verified later): ln x = ∞∑ n=1 (−1)n−1 n (x − 1)n . Similarly, we can begin with the series sin t = t − 1 3! t3 + 1 5! t5 − 1 7! t7 + . . . 434 ∑∞ n=1 nn n2·n! and S4 = ∑∞ n=1 n2 n+1 (notice S3 is the last series in 5.D.4). Solution. As we said, we will use the command def to define a routine, which we will call ratiotest. First it is useful to introduce n as a symbolic variable. It is also convenient to introduce the subset (−1, 1) ⊂ R. This is because of the ratio test (see Theorem 5.4.5), which states that S converges absolutely for |q| < 1, does not converge absolutely for |q| > 1, while for q = 1 the series may or may not converge. The aim now is to encode the above criterion in a Sage routine. For this we may type var("n") s=RealSet(-1, 1) def ratiotest(f): q=lim(abs(f(n+1)/f(n)), n=oo) if q in s : return "converges absolutely" elif q==1 : return "no conclusion" else : return "does not converge absolutely" return Our routine is now ready to be tested.10 (a) First we will prove formally that S1, S2 are both convergent series. Notice that for S1, S2 the defining terms are positive, and hence q = lim n→∞ fn+1 fn = lim n→∞ fn+1 fn . So, for S1 we compute q = lim n→∞ 2n+1 · (n + 2)3 · 3n 3n+1 · 2n · (n + 1)3 = lim n→∞ 2(n + 2)3 3(n + 1)3 = lim n→∞ 2n3 3n3 = 2 3 < 1 . Or you can directly use Sage to compute this limit: var("n") f(n)=(2^(n+1)*(n+2)^3*3^n)/(3^(n+1)*(n+1)^3*2^n) lim(f(n), n=oo) Similarly, for S2: q = lim n→∞ ( 6n+1 (n + 1)! · n! 6n ) = lim n→∞ 6 n + 1 = 0 < 1 , where again one can compute the limit directly in Sage: var("n") g(n)=(factorial(n)*6^(n+1))/(factorial(n+1)*6^n) lim(g(n), n=oo) Notice here the command factorial(n) corresponds to n!. Since for both cases we obtained q < 1, our claim follows by the ratio test. To check our routine using S1, S2, we type 10In general, detailed “return inputs” are more than welcome. However, here we kept them “short” to save some space. CHAPTER 5. ESTABLISHING THE ZOO and the (unknown so far) power series for its inverse centered at zero (since sin 0 = 0) arcsin t = a0 + a1t + a2t2 + a3t3 + a4t4 + . . . . Substitution gives t = a0 + a1 ( t − 1 3! t3 + 1 5! t5 + . . . ) + a2 ( t − 1 3! t3 + 1 5! t5 + . . . )2 + . . . = a0 + a1t + a2t2 + ( − 1 6 a1 + a3 ) t3 + ( − 2 6 a2 + a4 ) t4 + ( 1 120 a1 − 3 6 a3 + a5 ) t5 + . . . , hence arcsin t = t + 1 6 t3 + 3 40 t5 + . . . . (3) Notice that if it is assumed that the function ex can be expressed as a power series centered at zero, and that power series can be differentiated term by term, then the difference equation for the coefficients an is easily obtained since (xn+1 )′ = (n + 1)xn . Therefore, from the condition that the exponential function has its derivative equal to its value at every point, an+1 = 1 n+1 an, a0 = 1 and hence it is clear that an = 1 n! . 435 f(n)=((2^n)*(n+1)^3)/3^n; ratiotest(f) and f(n)=(6^n/factorial(n)); ratiotest(f) respectively. Our routine verifies our result, by printing out the answer “converges absolutely”. On the other hand, in 5.D.4 we saw that S3 satisfies q = e > 1, in particular, S3 cannot converge. The same verifies our routine: f(n)=n^n/(n^2*factorial(n)); ratiotest(f) with answer “does not converge absolutely”. For the final case notice that running the command print(lim((n+1)^3/(n^2*(n+2)), n=oo)) we obtain 1, that is, limn→∞ f(n+1) f(n) = 1, where f(n) = n2 n+1 . This means that the ratio test is inconclusive and we cannot determine if the series converges or diverges using this test. Our routine is also able to certify this claim, and we just need to type f(n)=n^2/(n+1) ratiotest(f) which prints out “no conclusion”. To answer however this case, one may add the syntax sum(n2 /(n + 1), n, 1, oo), which shows to us that S4 is divergent. Try to describe a formal proof of this conclusion. □ 5.D.6. Alternating series. Determine whether or not the following series converge: S1 = ∞∑ n=1 (−1)n n2 + 3n − 1 (3n − 2)2 , S2 = ∞∑ n=1 (−1)n−1 3n4 − 3n3 + 9n − 1 (5n3 − 2) · 4n . Solution. We have limn→∞ n2 +3n−1 (3n−2)2 = limn→∞ n2 9n2 = 1 9 ̸= 0. Thus the limit lim n→∞ (−1)n n2 + 3n − 1 (3n − 2)2 does not exist, and the series S1 does not converge. For S2 an application of the ratio (or root) test shows that the polynomials in the numerator or in the denominator do not affect the value of the considered limit, and we may consider the series ∞∑ n=1 (−1)n−1 1 4n For this, we see that lim n→∞ an+1 an = 1 4 < 1. Thus this series (absolutely) converges and as a conclusion the series S2 is also (absolutely) convergent. □ 5.D.7. Decide whether the series S1 = ∞∑ n=1 sin(n) n2 , S2 = ∞∑ n=1 cos (πn) 3 √ n2 CHAPTER 5. ESTABLISHING THE ZOO 436 converges absolutely, converges conditionally, or does not converge at all. ⃝ Computing the exact values of infinite series is generally a challenging task. Sometimes, we can compare these values with known results. However, in many cases, relying on computer algebra software is necessary. To illustrate this situation, let us consider some examples and use Sage to verify our calculations rigorously. 5.D.8. Calculate the series given below and then use Sage to verify your answer. S1 = ∞∑ n=1 ( 1 √ n − 1 √ n + 1 ) , S2 = ∞∑ n=0 5 3n , S3 = ∞∑ n=1 ( 3 42n−1 + 2 42n ) , S4 = ∞∑ n=1 n 3n . Solution. For the first given series we see that S1 = lim n→∞ (( 1 √ 1 − 1 √ 2 ) + ( 1 √ 2 − 1 √ 3 ) + · · · · · · + ( 1 √ n − 1 √ n + 1 )) = lim n→∞ ( 1 + ( − 1 √ 2 + 1 √ 2 ) + · · · · · · + ( − 1 √ n + 1 √ n ) − 1 √ n + 1 ) = lim n→∞ ( 1 − 1 √ n + 1 ) = 1 . As we know, a verification by Sage takes the form n=var("n");sum((1/sqrt(n)-1/sqrt(n+1)),n,1,oo) For S2 we see that this is a quintuple of the standard geometric series with the common ratio q = 1/3. Thus S2 = ∞∑ n=0 5 3n = 5 ∞∑ n=0 ( 1 3 )n = 5 · 1 1 − 1 3 = 15 2 . Or we can directly give in Sage the cell n=var("n"); sum((5/3**n),n,0,oo) The third infinite series is a series of linear combinations which we can express as a sum of infinite series with factoring out the constants. This is a valid modification, since the obtained series are absolutely convergent. In particular, S3 = 3 4 ∞∑ n=1 ( 1 42n−2 ) + 2 16 ∞∑ n=1 ( 1 42n−2 ) m:=n−1 = ( 3 4 + 2 16 ) ∞∑ m=0 1 42m = 14 16 ∞∑ m=0 ( 1 16 )m = 14 16 · 1 1 − 1 16 = 14 15 . CHAPTER 5. ESTABLISHING THE ZOO 437 In Sage after introducing n as a symbolic variable we can compute this sum as above: sum((3/(4**(2*n-1))+2/(4**(2*n))),n,1,oo) Finally, for S4, notice that from the relation of the partial sums, i.e., sn = 1 3 + 2 32 + 3 33 + · · · + n 3n for n = 1, 2, . . ., we can claim that sn 3 = 1 32 + 2 33 + · · · + n − 1 3n + n 3n+1 . Therefore, for all n ∈ Z+ we obtain the relation sn − sn 3 = 1 3 + 1 32 + 1 33 + · · · + 1 3n − n 3n+1 . This, in combination with the relation lim n→∞ n 3n+1 = 0, gives S4 = 3 2 lim n→∞ ( sn − sn 3 ) = 3 2 lim n→∞ n∑ k=1 1 3k = 3 2 ∞∑ k=1 ( 1 3 )k (∗) = 3 2 ( 1 1 − 1 3 − 1 ) = 3 4 . Notice that here our geometric series ∑∞ k=1 (1 3 )k starts from k = 1 and thus the given replacement in (∗). A verification in Sage occurs as usual, i.e., var("n"); sum((n/3**n),n,1,oo) □ 5.D.9. Verify the inequality ∞∑ n=1 1 n2 < ∞∑ n=0 1 2n . ⃝ Further exercises concerning infinite series are presented in Section E. Let us now describe a beautiful application that highlights the use of infinite series (see 5.E.130 for another fascinated example). 5.D.10. Koch snowflake (1904). Create a “snowflake” by the following procedure: At the beginning, consider an equilateral triangle with sides of length 1. With each of its three sides, do the following: Cut it into three equally long parts, then build another equilateral triangle above the middle part (this is pointing out from the original triangle), and remove the middle part. This transforms the original equilateral triangle into a six-pointed star. Repeating this step ad infinitum, one arrives to the desired snowflake. Prove that the created figure has infinite perimeter. Then determine its area. Solution. The perimeter of the original triangle is equal to 3. In each step, the perimeter increases by one third since three parts of every line segment are replaced with four equally long ones. Thus the snowflake’s perimeter can be expressed as the limit lim n→∞ dn , dn := 3 ( 4 3 )n , and we see that lim n→∞ dn = +∞. This can be quickly verified in Sage: CHAPTER 5. ESTABLISHING THE ZOO 438 var("n");lim(3*((4/3)**n),n=oo) Now, the figure’s area is apparently increasing during the construction. For its computation it is sufficient to catch the rise between two consecutive steps. The number of the figure’s sides is four times higher at every step (the line segments are divided into thirds and one of them is doubled). Moreover, the new sides are three times shorter. Therefore, the figure’s area grows exactly by the equilateral triangles glued to each side (so, there is the same number of them as of the sides). In the first iteration (when creating the six-pointed star from the original triangle), the area grows by the three equilateral triangles with sides of length 1/3 (one third of the original sides’ length). Let us denote the area of the original equilateral triangle by S0. If we realize that shortening an equilateral triangle’s sides three times makes its area decrease nine times, we get S0+3· S0 9 for the area of the six-pointed star. Similarly, in the next step we obtain the area of the figure as S0 + 3 · S0 9 + 4 · 3 · S0 92 . It is easy now to deduce that the area E of the resulting snowflake equals the limit E = lim n→∞ ( S0 + 3 S0 9 + 4 · 3 S0 92 + · · · + 4n · 3 S0 9n+1 ) = S0 lim n→∞ ( 1 + 1 3 + 1 3 · 4 9 + · · · + 1 3 · ( 4 9 )n ) = S0 ( 1 + 1 3 lim n→∞ ( 1 + 4 9 + · · · + ( 4 9 )n) ) = S0 ( 1 + 1 3 lim n→∞ n∑ k=0 ( 4 9 )k ) = S0 ( 1 + 1 3 ∞∑ k=0 ( 4 9 )k ) = S0 ( 1 + 1 3 · 1 1 − 4 9 ) , that is, E = 8 5 S0. Thus, the snowflake’s area is 8/5 of the area of the original triangle, i.e., 8 5 S0 = 8 5 · √ 3 4 = 2 √ 3 5 . We mention that this snowflake is an example of an infinitely long curve which encloses a finite area. □ So far, we have explored the concept of assigning a value to a sum of infinitely many numbers. Now, we shift our focus to sums involving infinitely many functions. Specifically, we can consider such series for each argument x of the functions fn(x), particularly focusing on sums of polynomials. These are known as “power series”, which always converge on an interval (or a disc in the complex plane, for complex polynomials). Moreover, the radius of convergence of these series can be readily determined from their coefficients, which is either a non-negative real number or ∞. CHAPTER 5. ESTABLISHING THE ZOO 439 Later, in Chapters 6 and 7, we will revisit other types of function series and explore additional concepts of conver- gence. 5.D.11. Consider the series S = ∞∑ n=1 (x + 4)n n · 5n . Determine all x ∈ R for which S is convergent. Solution. Having the ratio test in mind, let us compute the limit lim n→∞ fn+1(x) fn(x) , that is, lim n→∞ (x+4)n+1 (n+1)·5n+1 (x+4)n n·5n = lim n→∞ n |x + 4| 5(n + 1) = |x + 4| lim n→∞ n 5(n + 1) = |x + 4| 5 . In case you like confirm this conclusion by Sage, give the cell var(”n”); lim(n/(5 ∗ (n + 1)), n = oo). Now, according to the ratio test the series will converge for |x+4| 5 < 1, that is, −9 < x < 1 and diverges for |x+4| 5 > 1, that is, x < −9 or x > 1. For x = 1 we see that S is the harmonic series, i.e., S = ∑∞ n=1 1 n which diverges by 5.D.1. For x = −9 we get S = ∞∑ n=1 (−5)n n · 5n = ∞∑ n=1 (−1)n n = −1 + 1 2 − 1 3 + 1 4 − · · · which is an alternating series, known as the alternating harmonic series. For this series we compute lim n→∞ (−1)n n = 0, hence S is convergent (see Theorem 5.4.7). As an alternative, in Sage the command sum((−1)n /n, n, 1, oo) provides an explicit evaluation of the alternating harmonic series; it equals to − ln(2). We conclude that the initial series is convergent for all x ∈ [−9, 1). □ 5.D.12. Determine the radius r of convergence of the following power series: A(x) = ∞∑ n=1 2n n xn , B(x) = ∞∑ n=1 1 (1 + i)n xn , C(x) = ∞∑ n=1 (−1) n+1 n · 8n xn , D(x) = ∞∑ n=1 (−4n) n xn , E(x) = ∞∑ n=1 ( 1 + 1 n )n2 xn . Solution. According to the discussion in 5.4.10, for A(x) we have r = 1 lim sup n→∞ an+1 an = 1 2 . Thus, the power series converges exactly for the real numbers x ∈ (−1 2 , 1 2 ). Moreover, the series diverges for x = 1 2 (since it becomes harmonic), and it converges for x = −1 2 (since then it becomes an alternating harmonic series). However, to determine the convergence for any x lying in the complex CHAPTER 5. ESTABLISHING THE ZOO 440 plane on the circle of radius 1 2 , it is a much harder question which goes beyond our lectures. For B(x) we compute r = lim sup n→∞ n √ 1 (1 + i)n = lim sup n→∞ 1 1 + i = √ 2 2 . For C(x) we see that lim n→∞ n √ | an | = lim n→∞ 1 n √ n · 8 = 1 8 . Thus the radius is r = 8. For D(x) we get lim n→∞ n √ | an | = lim n→∞ 4n = +∞, so the radius is r = 0, while for E(x) we compute lim n→∞ n √ | an | = lim n→∞ ( 1 + 1 n )n = e, so r = 1/e in this case. □ 5.D.13. Calculate the radius r of convergence of the power series ∞∑ n=1 ein 3 √ n3 + n · 3n πn · 3 √ n4 + 2n3 + 1 · (x − 2)n . Solution. Here we will apply the following trick: The radius of convergence of any power series does not change if we move its center or alter its coefficients while keeping their absolute values. Therefore, let us determine the radius of convergence of the infinite series ∞∑ n=1 3 √ n3 + n · 3n πn 3 √ n4 + 2n3 + 1 · xn . Since lim n→∞ n √ na = ( lim n→∞ n √ n )a = 1 for all a > 0, we can further move to the series ∞∑ n=1 3n πn xn . with the same radius of convergence r = π/3. □ 5.D.14. Determine the power series centered at the origin which, determines the function 1 x2 − x − 12 , x ∈ (−3, 3) . Solution. A quick method relies on the known procedure of partial fraction decomposition, which gives 1 x2 − x − 12 = 1 (x − 4)(x + 3) = 1 7 ( 1 x − 4 − 1 x + 3 ) . This expression can be obtained also in Sage, via the partial_fraction function, as follows: f(x)=1/(x^2-x-12) show(f.partial_fraction()) CHAPTER 5. ESTABLISHING THE ZOO 441 Using now appropriately the known formula of geometric series, we see that 1 x − 4 = − 1 4 ( 1 + x 4 + x2 42 + · · · + xn 4n + · · · ) , 1 x + 3 = 1 3 ( 1 − x 3 + x2 32 + · · · + (−x)n 3n + · · · ) . Thus, altogether we obtain 1 x2 − x − 12 = − 1 28 ∞∑ n=0 xn 4n − 1 21 ∞∑ n=0 (−x)n 3n = ∞∑ n=0 ( (−1)n+1 21 · 3n − 1 28 · 4n ) xn . □ 5.D.15. Determine the radius r of convergence of the power series ∞∑ n=0 22n ·n! (2n)! xn . ⃝ 5.D.16. Calculate the radius of convergence for ∑∞ n=1 2 √ n xn . ⃝ 5.D.17. Find the domain of convergence of the power series ∞∑ n=1 √ n+1 3 √ n xn . ⃝ 5.D.18. Determine for which x ∈ R the power series ∞∑ n=1 (−3)n √ n4+2n3+111 (x − 2)n converges. ⃝ 5.D.19. Determine for which x ∈ R the series ∞∑ i=1 1 2n · n · ln(n) x3n converges. ⃝ 5.D.20. Determine all x ∈ R for which the power series ∞∑ i=1 x2n n2 is convergent. ⃝ The polynomials are functions which we can easily evaluate and thus, the partial sums of power series offer a valuable method for approximating the values of functions. However, as we will see below, to be successful we need some good estimate on the convergence speed. Especially for convergent alternating series S = ∞∑ n=0 (−1)n an = a0 − a1 + a2 − a3 + · · · with a0 ≥ a1 ≥ a2 ≥ · · · ≥ an ≥ · · · ≥ 0 and∑∞ n=0(−1)n an = L, where L is a finite real number, we can prove a very useful estimation error given by |L − Sk| ≤ ak+1 . Here Sk = ∑k n=0(−1)n an denotes the partial sum corresponding to S. Hence, summing only terms up to the kth term ak and omitting all the remaining terms, the approximation CHAPTER 5. ESTABLISHING THE ZOO 442 error will be at most large as ak+1. Let us illustrate this situation via examples. 5.D.21. Approximate cos(1) with an error strictly less that 10−8 . Then use Sage to find the actual approximation error. Solution. From Section 5.4.12 we know the following expression of cos(x) in terms of power series: cos(x) = 1 − x2 2! + x4 4! − x6 6! + · · · = ∞∑ n=0 (−1)n x2n (2n)! , for all x ∈ R. Thus cos(1) = 1 − 1 2 + 1 4! − 1 6! + 1 8! − · · · = ∞∑ n=0 (−1)n (2n)! . According to the previous remark on alternating series, stopping at the term 1 (2n)! we will obtain an error which equals at most 1 (2(n+1))! = 1 (2n+2)! . Thus, to find the number of terms which we need to approximate cos(1) with an error strictly less that 10−8 , it suffices to solve the inequality 1 (2n + 2)! < 10−8 . (∗) There are many ways to solve this inequality, and perhaps, the simplest one is by testing several values of n. As a solution, one gets that n = 5 is the smallest positive integer satisfying (∗). For instance, in Sage a solution goes as follows: n=1; while (1/factorial(2*n+2) >= 10^-8 ): n = n+1 print(n) and this answers 5. Here, with the second line we program Sage to try all values until (∗) is true, starting with n = 1. Hence, the desired solution is cos(1) ≈ 1 − 1 2 + 1 4! − 1 6! + 1 8! − 1 10! , with an error less than 10−8 . We may find the error of the approximation in Sage as follows: a=1-1/2+1/factorial(4)-1/factorial(6) \ +1/factorial(8)-1/factorial(10) print(N(cos(1))-N(a)) bool((N(cos(1))-N(a))<(1/factorial(12))) Notice here we used a slash to break the first line. In this cell the print command gives the actual approximation error 2.07625261428035e − 9, which approximately translates to 2.077·10−9 . The final command returns True, verifying that the error is not larger than the 6th term (−1)n 1 (2n)! n=5+1 = 1/12! (since above we proved that n = 5), that is, 1/12! ≈ 2.088·10−9 . For a successful treating of such small numbers one can use again Sage, e.g., by the cell bool(2.077*(10**(-9))<2.088*(10**(-9))) □ CHAPTER 5. ESTABLISHING THE ZOO 443 5.D.22. Approximate sin(1) with an error strictly less than 10−8 , and then verify your answer in Sage. Solution. In Section 5.4.12 we learned the expression of sin(x) in terms of power series: sin(x) = x − 1 3! x3 + 1 5! x5 − 1 7! x7 + · · · = ∞∑ n=0 (−1)n (2n + 1)! x2n+1 , x ∈ R . Thus sin(1) = 1 − 1 3! + 1 5! − 1 7! + · · · = ∞∑ n=0 (−1)n (2n + 1)! and, as before, we are treating an alternating series. Stopping at the term (−1)n (2n+1)! we will obtain an error which equals at most (−1)n+1 (2n+3)! = 1 (2n+3)! . Hence, to locate the number of terms which we need to approximate sin(1) with a cost strictly less that 10−8 , it suffices to solve the inequality 1 (2n + 3)! < 10−8 , and as a solution we get n = 5 (e.g. by applying the same method as in 5.D.21). This mean that sin(1) ≈ 1 − 1 3! + 1 5! − 1 7! + 1 9! − 1 11! with an error less than 10−8 . In Sage you may type b=1-1/factorial(3)+1/factorial(5)-1/factorial(7) \ +1/factorial(9) -1/factorial(11) print(abs(N(sin(1))-N(b))) print(bool(abs(N(sin(1))-N(b))<1/10**8)) which returns the actual approximation error, given by 1.59828483781155e − 10 ≈ 1.599 · 10−10 , and verifies the statement. □ 444 CHAPTER 5. ESTABLISHING THE ZOO E. Additional exercises for the whole chapter As usual, we will now delve into additional material related to the concepts discussed thus far in Chapter 5. Many of the tasks described below rely on the theory of derivatives, making prior experience with derivatives particularly useful. In some instances, we may need to employ higher-order derivatives, which are formally introduced at the beginning of Chapter 6, see 6.1.1. However, these cases primarily pertain to higher-order derivatives of polynomials, a topic with which the reader is already familiar from our discussion in Section 5.1.6. A) Material on polynomial interpolation 5.E.1. Consider the function f(x) = sin(x). Given three points (nodes) x0, x1, x2, write a routine in Sage which will return the Hermite interpolation polynomial P(x) corresponding to these nodes and the values yi = sin(xi) and y′ i = sin′ (xi) = cos(xi), for i = 0, 1, 2, that is, adapted to the following table (for simplicity you may fix some triple (x0, x1, x2), e.g., x0 = −2π, x1 = 0 and x2 = 2π) xi x0 x1 x2 yi y0 = sin(x0) y1 = sin(x1) y2 = sin(x2) y′ i y′ 0 = cos(x0) y′ 1 = cos(x1) y′ 2 = cos(x2) . Then, for a variety of different triples (x0, x1, x2) centered to 0 (that is, with x1 = 0 and with x0 = −x2), present the graphs of the corresponding polynomial P(x) and that of f(x), for x ∈ [x0, x2]. Solution. We present the code in Sage and attach some comments within. Since the Hermite interpolation method includes derivatives, below we will use commands as derivative(f, x), which gives the derivative of a function f and derivative(f, x, n), returning the nth derivative of f. So, let us fix the nodes x0 = −2π, x1 = 0, x2 = 2π. Notice the elementary Lagrange polynomials ℓ0, ℓ1, ℓ2 corresponding to these three nodes are all of degree 2. We have: x0=-2*pi; x1=0; x2=2*pi; l(x)=(x-x0)*(x-x1)*(x-x2) d1l(x)=derivative(l, x) #define first derivative of function l d2l(x)=derivative(l(x),x,2) #define second derivative of function l l0(x)=((x-x1)*(x-x2))/((x0-x1)*(x0-x2)) #elementary Lagrange polynomial l_0 l1(x)=((x-x0)*(x-x2))/((x1-x0)*(x1-x2)) #elementary Lagrange polynomial l_1 l2(x)=((x-x0)*(x-x1))/((x2-x0)*(x2-x1)) #elementary Lagrange polynomial l_2 h01(x)=(1-(d2l(x0)/d1l(x0))*(x-x0))*(l0(x))^2 #1st type Hermite polyn. h^{(1)}_{0} h11(x)=(1-(d2l(x1)/d1l(x1))*(x-x1))*(l1(x))^2 #1st type Hermite polyn. h^{(1)}_{1} h21(x)=(1-(d2l(x2)/d1l(x2))*(x-x2))*(l2(x))^2 #1st type Hermite polyn. h^{(1)}_{2} h02(x)=(x-x0)*(l0(x))^2 #2nd type Hermite polyn. h^{(2)}_{0} h12(x)=(x-x1)*(l1(x))^2 #2nd type Hermite polyn. h^{(2)}_{1} h22(x)=(x-x2)*(l2(x))^2 #2nd type Hermite polyn. h^{(2)}_{2} y0=sin(x0); y1=sin(x1); y2=sin(x2) #introduce the values y_0, y_1, y_2 dy0=derivative(sin(x), x)(x=x0) #introduce the values y’_0 dy1=derivative(sin(x), x)(x=x1) #introduce the values y’_1 dy2=derivative(sin(x), x)(x=x2) #introduce the values y’_2 P(x)=y0*h01(x)+y1*h11(x)+y2*h21(x)+dy0*h02(x)+dy1*h12(x)+dy2*h22(x); show(P(x)) a=plot(sin(x), x, x0, x2, color="red", thickness=8, legend_label=" $\\sin(x)$") b=plot(P(x), x, x0, x2, color="blue", thickness=2, linestyle="-.", legend_label="$P(x)$"); show(a+b) This returns the explicit form of the Hermite interpolation polynomial P(x), given by P(x) = (2 π + x) 2 (2 π − x) 2 x 16 π4 − (2 π + x) 2 (2 π − x)x2 64 π4 + (2 π + x)(2 π − x) 2 x2 64 π4 , and the following figure 445 CHAPTER 5. ESTABLISHING THE ZOO Above, the code is written in such way that one only changes the first line, that is, the initial values of x0, x1, x2, (and so the triple (x0, x1, x2)) and results to a new Hermite interpolation polynomial. Let us present the graphs for the triples (−π, 0, π), (−3π/2, 0, 3π/2), and (−8π, 0, 8π), centered at 0, but also for other kind of triples, e.g., (−5π/2, π/2, 3π/2), (−e, √ e, ln(312 )), etc. □ 5.E.2. Determine the natural cubic spline that interpolates the absolute value function f(x) = |x|, for x ∈ [−1, 1], selecting the points x0 = −1, x1 = 0, x2 = 1. Then use the “spline” method described in 5.A.16 to verify your result. Solution. By assumption one has y0 = f(x0) = 1, y1 = f(x1) = 0 and y2 = f(x2) = 1. For x ∈ [−1, 0] suppose that S(x) = S1(x) = a3x3 +a2x2 +a1x+a0 and for x ∈ [0, 1] let S(x) = S2(x) = b3x3 +b2x2 +b1x+b0, with ai, bi ∈ R for all 446 CHAPTER 5. ESTABLISHING THE ZOO i = 0, . . . , 3. The conditions that S(x) must satisfy are eight in total; First, we have S1(x0) = y0, S2(x1) = y1, S1(x1) = y1, and S2(x2) = y2, which equivalently are written as −a3 + a2 − a1 + a0 = 1, b0 = 0 = a0 and b3 + b2 + b1 + b0 = 1, respectively. Moreover S′ 1(x1) = S′ 2(x1) and S′′ 1 (x1) = S′′ 2 (x1), that is, a1 = b1 and a2 = b2, respectively. Finally, since S should be natural, we need S′′ 1 (x0) = 0 = S′′ 2 (x2), so we also have −6a1 + 2a2 = 0 and 6b1 + 2b2 = 0. The system of these eight linear equations has non-zero determinant and so unique solution. Using Sage we get a0 = b0 = 0, a1 = b1 = 0, a2 = b2 = 3/2, a3 = 1/2 and b2 = −(1/2), that is, S(x) = { S1(x) = (1/2)x3 + (3/2)x2 , if x ∈ [−1, 0] , S2(x) = −(1/2)x3 + (3/2)x2 , if x ∈ [0, 1] . Below we include the graphical verification obtained by a combination of the commands spline and plot, but present the figure that occurs by plotting S1, S2 in Sage via the usual way (so one can compare alone in his/her editor). f(x)=abs(x); pts=[(-1, f(-1)), (0, f(0)), (1, f(1))]; S=spline(pts) A=plot(S, -1, 1, color="steelblue", thickness=2, legend_label="$S(x)$") B=points(pts, size=50, color="darkgrey") fpl=plot(f(x), x, -1, 1, color="black", legend_label="$f(x)$") fpl.set_legend_options(loc=(0.6,0.8)) tx0=text("$(x_0, y_0)$", (-0.89, 1), color="darkslategrey", fontsize="12") tx1=text("$(x_1, y_1)$", (0.1, -0.03), color="darkslategrey", fontsize="12") tx2=text("$(x_2, y_2)$", (0.89, 1), color="darkslategrey", fontsize="12") show(A+B+fpl+tx0+tx1+tx2) □ 5.E.3. Without calculation, determine the Hermite interpolation polynomial if the following is given: x0 = 0, x1 = 2, x2 = 1, y0 = 0, y1 = 4, y2 = 1, y′ 0 = 0, y′ 1 = 4, y′ 2 = 2. ⃝ 5.E.4. Construct the natural cubic interpolation spline for the points x0 = −3, x1 = 0, x2 = 3, and the values y0 = −3, y1 = 0, y2 = 3. ⃝ 5.E.5. Construct the complete cubic interpolation spline for the points x0 = −3, x1 = −2, x2 = −1 with values y0 = 0, y1 = 1, y2 = 2 and derivatives at the marginal points given by y′ 0 = 1, y′ 2 = 1. ⃝ 5.E.6. Using Lagrange interpolation, approximate cos2 (1), based on the values of cos2 (x) at the points π 4 , π 3 , and π 2 . ⃝ 5.E.7. Let P(x) be a polynomial with real non-negative coefficients. Assume that P(1 x )P(x) ≥ 1 for x = 1. Show that the same inequality holds for every positive x. Solution. Let P(x) = anxn + an−1xn−1 + · · · a1x + a0. From the statement one has P(1)2 ≥ 1. Thus, P(x)P ( 1 x ) = (anxn + an−1xn−1 + · · · a1x + a0) · (anx−n + an−1x−(n−1) + · · · a1x−1 + a0) = n∑ i=0 a2 i + ∑ i cn ≥ c1 = 0. Thus, sup A = 3 2 and inf A = 0, respectively. (b) An example is given by the set Z\N. (c) The set of natural numbers is such an example. □ 5.E.14. Recall that when the supremum of a subset A ⊂ R belongs to A is called “maximum of A”. Similarly is defined the “minimum of A”. Find, if existent, the minimum/maximum of the following sets: A = (−1, 2), B = (−1, 2], C = {(−1)n : n ∈ Z+}, and R+ = {x ∈ R : x > 0}. Solution. The set A = (−1, 2) has 2 as its supremum, but 2 /∈ A. So there is no maximum for A. Neither it has a minimum, since its infimum −1 /∈ A. The set B = (−1, 2] has 2 as its supremum, which belongs to B. Hence 2 is the maximum of B. But it does not have a minimum. For C both the minimum and maximum elements exist, and are the numbers −1 and 1, respectively. Finally, R+ is not bounded above, but does has an infimum which equals to 0. However, 0 /∈ R+ hence in this case there is neither a minimum, nor a maximum. □ 5.E.15. Find a subset X ⊂ R such that sup X ≤ inf X. ⃝ 5.E.16. Find sets A, B, C ⊆ R such that A ∩ B = ∅, A ∩ C = ∅, B ∩ C = ∅, and sup A = inf B = inf C = sup C. ⃝ 5.E.17. Based on the definition of a convergent sequence, show that an → 1, as n → ∞, where an = 2n −1 2n . Solution. We have an − 1 = 2n − 1 2n − 1 = 1 2n = 1 2n . On the other hand, by induction we get n < 2n for any natural number n, hence we also have 1 2n < 1 n for all n ∈ Z+. Let ε > 0. Using the Archimedean property we can find some N with 1 N < ε. Then, for n > N we get 1 n < 1 N < ε and hence |an − 1| = 1 2n < 1 n < 1 N < ε. This shows that an → 1 as n → ∞. □ 5.E.18. Use the binomial theorem to prove that lim n→+∞ n √ ln(n) = 1. Solution. For x > 0 the function ln(x) is strictly increasing, thus for n > 3 it is true that ln(n) > ln(e) = 1 and this implies that n √ ln(n) > 1. Consequently, we may write n √ ln(n) = 1 + an for some an > 0 for all n. Taking the nth power of this relation we arrive to ln(n) = (1 + an)n , and here is where the binomial theorem applies: ln(n) = (1 + an)n = 1 + nan + n(n − 1) 2! a2 n + · · · + an n . Thus ln(n) > 1+nan, from where we deduce that an < ln(n) n − 1 n . However, ln(n) n → 0 as n → +∞, thus limn→+∞ an = 0. Hence by the relation n √ ln(n) = 1 + an the result follows. □ 5.E.19. Compute the limit lim n→∞ (√ 2 · 4 √ 2 · 8 √ 2 · · · 2n√ 2 ) . ⃝ 5.E.20. Evaluate the limit lim n→∞ ( 1 1 · 2 + 1 2 · 3 + 1 3 · 4 + · · · + 1 (n − 1) · n ) . ⃝ 5.E.21. Compute the limit lim n→∞ ( 1 n2 + 2 n2 + · · · + n − 2 n2 + n − 1 n2 ) ⃝ 450 CHAPTER 5. ESTABLISHING THE ZOO 5.E.22. Evaluate the limit lim n→∞ ( 1 √ n2 + 1 + 1 √ n2 + 2 + · · · + 1 √ n2 + n ) . Also, use Sage to plot several terms of the given sequence to visually illustrate your answer. Solution. In order to determine this limit, we can invoke the squeeze theorem. The bounds 1 √ n2 + 1 + · · · + 1 √ n2 + n ≥ 1 √ n2 + n + · · · + 1 √ n2 + n = n √ n2 + n and 1 √ n2 + 1 + · · · + 1 √ n2 + n ≤ 1 √ n2 + 1 + · · · + 1 √ n2 + 1 = n √ n2 + 1 for all naturals n, respectively, imply that 1 = lim n→∞ n √ n2 + n ≤ lim n→∞ ( 1 √ n2 + 1 + · · · + 1 √ n2 + n ) ≤ lim n→∞ n √ n2 + 1 = 1 . Thus lim n→∞ ( 1√ n2+1 + · · · + 1√ n2+n ) = 1. This result can be visually confirmed in Sage by the same method used in 5.B.4. Thus one can execute the cell var("k") p=Graphics() for n in srange (1, 50+1): p=p+points((n,sum(1/sqrt(n^2+k), k, 1, n)),color="black") show(p) This code produces the figure given below, where indeed one can observe the convergence of the given sequence to the unit. □ 5.E.23. Evaluate the limits lim n→∞ ( n n + 1 )n , lim n→∞ ( 1 + 1 n2 )n and lim n→∞ ( 1 − 1 n )n2 . ⃝ 5.E.24. Let (an) be a non-negative convergent sequence of real numbers.12 Show that limn→∞ an = a ≥ 0. Solution. In contrast, suppose that the claim is wrong, i.e., the limit of (an) is a negative number, a < 0. Since an → a as n → ∞ and ε = −a is a positive real number, there is a natural n0 such that |an − a| < ε for all n ≥ n0, that is, a − ε < an < a + ε, for all n ≥ n0. However, a = −ε which gives −2ε < an < 0, a contradiction as (an) consists only of non-negative terms. Thus a ≥ 0. □ 5.E.25. Let (an) be a sequence of real numbers and let (yn) be a sequence of positive real numbers satisfying yn → 0, as n → ∞. Suppose that for some N ∈ N, some constant µ > 0 and some (finite) real number a, we have |an − a| < µ yn for all n ≥ N. Prove that (an) is convergent, in particular, limn→∞ an = a. Solution. The sequence (yn) is by assumption convergent to 0, thus if ˆε > 0 is given there exists some n0 (depending in general on ˆε) such that |yn − 0| = |yn| = yn < ˆε for all n ≥ n0. By assumption µ > 0, hence we may set ˆε = ε/µ, 12By the term “non-negative” we mean an ≥ 0 for all naturals n. 451 CHAPTER 5. ESTABLISHING THE ZOO for some ε > 0. Thus we have yn < ε µ for all n ≥ n0. Then, for those n satisfying both n ≥ n0 and n ≥ N we get |an − a| ≤ µ yn < µ · ε µ = ε. As ε > 0 is arbitrary, the claim follows. □ 5.E.26. (1) Let (an) be a sequence of positive real numbers for which the limit limn→∞ an+1 an exists and is finite real number ℓ. Show that if ℓ < 1, then limn→∞ an = 0. (2) More in general, suppose that an ̸= 0 for all n ∈ N and that the limit ℓ = limn→∞ an+1 an exists. Show that if ℓ < 1, then limn→∞ an = 0. Solution. We will prove the first statement and leave the second one for the reader, since the method follows essentially the same ideas. We have an > 0 for all n ∈ N, hence the sequence (yn) with general term yn = an+1 an also satisfies yn > 0, for all n ∈ N. Also, yn → ℓ as n → ∞ and hence the requirements of the statement in 5.E.24 are satisfied. Thus we have limn→∞ yn = ℓ ≥ 0. Let b a real number with ℓ < b < 1 and set ε = b − ℓ > 0. Then, there exists some N ∈ N with an+1 an − ℓ < ε, provided that n ≥ N. This implies that an+1 an < ε + ℓ = b − ℓ + ℓ = b, which in turn gives 0 < an+1 < b · an < b2 · an−1 < · · · < bn−N+1 aN . Thus, if µ := aN b−N > 0 we obtain 0 < an+1 < µ · bn+1 , for all n ≥ N. In addition, since 0 < b < 1 we have limn→∞ bn = 0. Thus the result follows by 5.E.25. Alternatively, the inequality an+1 < b an for all n ≥ N implies that an+1 < an, for all n ≥ N (since 0 < b < 1). Therefore, the sequence (an) is monotone (is decreasing starting from the Nth term). Since (an) is also bounded, it is convergent. Let L = limn→∞ an. Then, using the relation an+1 = ynan we see that L = lim n→∞ an+1 = lim n→∞ (yn an) = lim n→∞ yn · lim n→∞ an = ℓ · L ⇐⇒ L(1 − ℓ) = 0 . But 1 − ℓ > 0, and thus L = 0, as it is required. □ 5.E.27. (a) Apply the statement in 5.E.25 to prove that limn→∞ ( 1 1+n a ) = 0, where a > 0. (b) For a real number x apply the statement 5.E.26 to prove that limn→∞ xn n! = 0. ⃝ 5.E.28. Recall that if (an) is a Cauchy sequence in R, then the difference an+1 − an tends to zero, i.e., (an+1 − an) → 0 as n → ∞. Present a counterexample verifying that the converse of this statement is not in general true. Solution. Consider the sequence (an = √ n) with n ∈ N. Then we see that an+1 − an = √ n + 1 − √ n = ( √ n + 1 − √ n)( √ n + 1 + √ n) √ n + 1 + √ n = 1 √ n + 1 + √ n , and 1√ n+1+ √ n → 0. However, (an) is not a Cauchy sequence (try to verify this claim yourself). □ 5.E.29. Based on the definition of a Cauchy sequence, show that (an = 1/n)∞ n=1 is such a sequence. Solution. We need to show that for every ε > 0 there exist N ∈ N such that |an − am| = 1 n − 1 m < ε for all n, m ≥ N. Indeed, for such n, m ≥ N by the triangle inequality we have 1 n − 1 m ≤ 1 n + − 1 m = 1 n + 1 m ≤ 1 N + 1 N = 2 N . (∗) Now, for ε > 0 we can find non-zero N ∈ N such that 1 N < ε 2 . Then by (∗) we get the result, i.e., 1 n − 1 m ≤ 2 N < ε. □ 5.E.30. A sequence (xn)∞ n=1 of real numbers is called “contractive” if there exists some real α with 0 < α < 1 such that |xn+1 − xn| ≤ α |xn − xn−1| , for all n ∈ N , n ≥ 2 , (⋆) Prove that any contractive sequence is Cauchy. Solution. Suppose that (xn) is a sequence of real numbers satisfying (⋆) for some α ∈ (0, 1). By (⋆) we get that |x3 − x2| ≤ α|x2 − x1| , |x4 − x3| ≤ α |x3 − x2| ≤ α · α |x2 − x1| = α2 |x2 − x1| , · · · · · · |xn+1 − xn| ≤ α |xn − xn−1| ≤ α2 |xn−1 − xn−2| ≤ · · · ≤ αn−1 |x2 − x1| , where the final relation follows by induction over n. Now, from the geometric series task 5.B.16 recall that 1 + α + α2 + · · · + αn = 1 − αn−1 1 − α . 452 CHAPTER 5. ESTABLISHING THE ZOO Combining this with our previous observation, for all m, n ∈ N with m > n we have |xm − xn| ≤ |xm − xm−1| + |xm−1 − xm−2| + · · · + |xn+1 − xn| ≤ αm−2 |x2 − x1| + αm−3 |x2 − x1| + · · · + αn−1 |x2 − x1| = |x2 − x1| ( αm−2 + αm−3 + · · · + αn−1 ) = αn−1 ( 1 + α + · · · + αm−n−1 ) |x2 − x1| = αn−1 ( 1 − αm−n 1 − α ) |x2 − x1| ≤ αn−1 ( 1 1 − α ) |x2 − x1| . (♯) If x2 = x1 it is easy to see that (xn) is a Cauchy sequence, hence we may assume that x2 ̸= x1. Recall by 5.B.3 that αn−1 → 0 as n → +∞, provided that 0 < α < 1. Therefore, given some ε > 0 we can find some positive integer N with αn−1 − 0 = |αn−1 | < ε( 1 1−α ) ·|x2−x1| , for all n ≥ N. Then, for the same N and for m > n ≥ N by (♯) we see that |xm − xn| ≤ αn−1 ( 1 1 − α ) |x2 − x1| < ε ( 1 1−α ) |x2 − x1| ( 1 1 − α ) |x2 − x1| = ε . Thus (xn) is a Cauchy sequence. □ 5.E.31. Consider the sequence (xn) of real numbers defined recursively as follows: x1 = 1 , xn+1 = 1 4 + xn , n ≥ 1 . Deduce that (xn) is a convergent sequence, by proving that it is contractive (see 5.E.30). ⃝ 5.E.32. Let (an)∞ n=1 be a bounded sequence of real numbers. Set mn = sup{ak : k ≥ n}, and ℓn = inf{ak : k ≥ n}, n ∈ N. Based on the monotone convergent theorem, show that both (mn)∞ n=1 and (ℓn)∞ n=1 are convergent. ⃝ 5.E.33. Equivalent condition for convergence. Let (xn) a sequence of real numbers. Is the following claim true? xn is convergent ⇐⇒ lim n→∞ lim sup m→∞ |xn − xm| = 0 . Solution. The answer is yes. Suppose first that limn→∞ xn = a for some (finite) real number a Then, limm→∞ |xn −xm| = |xn − a| and in turn this implies that lim n→∞ lim sup m→∞ |xn − xm| = limn→∞ |xn − a| = 0 . For the opposite direction, we rely on the theory of Cauchy sequence. So, suppose that lim n→∞ lim sup m→∞ |xn − xm| = 0. Then, for every ε > 0 we may fix some natural N with lim sup m→∞ |xm − xN | < ε 2 . Therefore, there exist m0 ∈ N with |xm − xN | < ε 2 for all m ≥ m0. Combining this with the triangle inequality, for all m1, m2 ≥ m0 we obtain |xm1 − xm2 | ≤ |xm1 − xN | + |xm2 − xN | < ε 2 + ε 2 = ε . This show that (xn) is a Cauchy sequence, and hence convergent, see the theorem in 5.2.3. □ 5.E.34. Find the limes superior/inferior of the sequences (an), (bn) and (cn) defined below: an = 3 + (−1)n , bn = 2 + 1 n , cn = 4n n + 1 cos( nπ 2 ) , (n ∈ Z+) . Solution. We recall that superior/inferior of a sequence (xn) are essentially the biggest/smallest limits of all the subsequences of (xn). Thus, to compute them it is always sufficient to choose appropriate subsequences of (xn). Let us begin with (an) and consider its subsequence (a2n = 3 + (−1)2n ). This obviously tends to 4 as n → +∞ and hence lim sup n→∞ an = 4. On the other hand, the subsequence (a2n+1 = 3 + (−1)2n+1 ) of (an) tends to 2 as n → +∞ and hence lim inf n→∞ an = 2. Obviously the sequence (bn) converges to 2, so any subsequence of (bn) will have the same limit and it follows that lim sup n→∞ bn = 2 = lim inf n→∞ bn. For (cn) we have the expression cn = xn · yn, where xn = 4n n+1 and yn = cos(nπ 2 ). Obviously, limn→∞ xn = limn→∞ 4 1+ 1 n = 4 and so we may focus on (yn). We consider the following four subsequences: • The subsequence of (yn) defined by y4n = cos(4nπ 2 ) = cos(2nπ) = 1, which satisfies limn→∞ y4n = 1. Thus, the 453 CHAPTER 5. ESTABLISHING THE ZOO subsequence (c4n) of (cn) defined by c4n = x4n · y4n will tend to 4 · 1 = 4. • The subsequence of (yn) defined by y4n+1 = cos((4n+1)π 2 ) = cos(2nπ+π 2 ) = cos(π/2) = 0, satisfies limn→∞ y4n+1 = 0. Hence in this case the subsequence (c4n+1) of (cn) defined by c4n+1 = x4n+1 · y4n+1 tends to 4 · 0 = 0. • The subsequence of (yn) defined by y4n+2 = cos((4n+2)π 2 ) = cos(2nπ + π) = cos(π) = −1, satisfies limn→∞ y4n+2 = −1. Thus, the subsequence (c4n+2) of (cn) defined by c4n+2 = x4n+2 · y4n+2 tends to 4 · −1 = −4. • The subsequence of (yn) defined by y4n+3 = cos((4n+3)π 2 ) = cos(2nπ + 3π 2 ) = cos(3π 2 ) = 0 tends to 0 and hence the subsequence (c4n+3) of (cn) tends to 0 as well. To summarize, the largest limit of the subsequences (c4n), (c4n+1), (c4n+2) and (c4n+3) is 4 and the smallest one is −4. This means that lim sup n→∞ cn = 4 and lim inf n→∞ cn = −4, respectively. □ 5.E.35. Based on the result from 5.B.18 on the Euler number, compute lim sup n→∞ an and lim inf n→∞ an, where (an)∞ n=1 is the sequence defined by an = ( 1 + (−1)n n )n , for n ∈ Z+. Solution. Consider a sequence (an) which its definition involves the expression (−1)n (or (−1)n+1 or (−1)n−1 ). Recall that in this case in order to compute the superior/inferior of (an) we can consider the subsequences determined for 2n and 2n + 1, respectively. Hence for the specific task let us consider first the subsequence of (an) defined by a2n = ( 1 + 1 2n )2n . We have 2n → +∞ as n → +∞, and hence by 5.B.18 we see that limn→+∞ a2n = e. On the other hand, the subsequence of (an) defined by a2n+1 = ( 1 + (−1)2n+1 2n+1 )2n+1 satisfies a2n+1 = ( 1 − 1 2n+1 )2n+1 and a combination of 5.B.18 with the exercise 5.E.23 above, gives lim n→+∞ a2n+1 = e−1 . As a conclusion we get that lim sup n→∞ an = e and lim inf n→∞ an = e−1 . □ Let us now present a few extra tasks on the topological notions that we met so far in Chapter 5. 5.E.36. Is some of the sets N or Q an open or a closed subset of R? ⃝ 5.E.37. (a) Given the sets R∗ = R\{0}, R\Q, R\Z, and (1, 2) ∪ {5}, locate the open one. (b) Similarly, given the sets R\Q, R\Z and [1, 2] ∪ {5}, locate the closed one. ⃝ 5.E.38. (a) Consider the set A = ∩ n∈N∗ ( − 1 n , 1 + 1 n ) , where as usual N∗ = N\{0}. Show that A = [0, 1]. (b) Is the set B = ∪ n∈N∗ [ 1 n , 1 − 1 n ] an open or a closed subset of R? Solution. (a) Clearly, [0, 1] ⊂ A and we need to prove the opposite inclusion. So assume that x ∈ A, that is, − 1 n < x < 1+ 1 n for all n ∈ Z+. Then we have x ≥ sup{− 1 n : n ∈ Z+} and x ≤ inf{1 + 1 n : n ∈ Z+}. However, sup{− 1 n : n ∈ Z+} = 0 and inf{1 + 1 n : n ∈ Z+} = 1, which means that x ∈ [0, 1]. (b) This can be treated similarly, in particular one can show that B = ∪ n∈N∗ [ 1 n , 1 − 1 n ] = (0, 1) , which we leave for practice. Hence B is open and this shows that the union of infinitely many closed sets is not necessary closed. □ 5.E.39. Compact sets. Suppose that we define the compactness of a subset A ⊂ R via the Bolzano-Weierstrass theorem (see (4) in Theorem 5.2.8), that is, A is said to be compact if for any sequence (xn) in A there is a convergent subsequence (xnk ) of (xn) whose limit belongs to A. Under this definition, show that a compact set A is closed and bounded. Solution. If A is the empty set, then the result clearly holds, and hence assume that A ̸= ∅ and that A is compact. According to our new definition, this means that for any sequence (xn) in A there is a convergent subsequence (xnk ) of (xn) whose limit, say ℓ, belongs to A. Hence if (xn) is convergent, then we should have limn→∞ xn = limk→∞ xnk = ℓ ∈ A. This shows that A is closed and it remains to show that A is bounded. For each n ∈ N∗ = N\{0} consider the open sets Cn = (−n, n). Obviously, ∪ n∈N∗ Cn = R , hence A ⊆ ∪ n∈N∗ Cn . Since each Cn is also open, this means that the family {Cn : n ∈ N∗ } is an open cover of A. However, A is compact and by (5) in Theorem 5.2.8 each of its open covers contains a finite subcover of A. Thus there exists m ∈ N∗ such that A ⊆ m∪ n=1 Cn 454 CHAPTER 5. ESTABLISHING THE ZOO with ∪m n=1 Cn = (−m, m). This shows that A is contained in the bounded interval (−m, m), therefore A is bounded. For instance, any interval of the form [a, b] ⊂ R with a < b, a, b ∈ R, is closed (and bounded) and hence compact. □ 5.E.40. An open cover. Find an open cover, a subcover and a finite subcover of A := {x ∈ R : 0 ≤ x ≤ 1} = [0, 1]. Solution. An open cover of A is given by the family A := {Aa = (a − ε, a + ε)} where a ∈ A and ε > 0. As a subcover, for the same ε take the subfamily {Ab = (b − ε, b + ε)}, where b ∈ {x ∈ Q : 0 ≤ x ≤ 1}. Notice however that this is not a finite subcover. To obtain a finite subcover, consider the family {A0·ε, A1·ε, A2·ε, . . . , An0·ε} = {A0, Aε, A2ε, . . . , An0ε} , where n0 ∈ N is the largest natural with n0ε < 1. Then this is finite subfamily of {Aa = (a − ε, a + ε) : a ∈ A} (every set in this family is obviously a member of A), and moreover it covers A = [0, 1], i.e., A ⊆ A0 ∪ Aε ∪ A2ε ∪ · · · ∪ An0ε. □ 5.E.41. Show that the collection of open intervals A = {Ak = (k − 1, k + 1)}k∈N∗ is an open cover of N∗ = N\{0}. Deduce however that N∗ is not a compact subset of R. 5.E.42. Indeed, we see that ∞∪ k=1 Ak = (0, +∞) and N∗ ⊂ (0, +∞). Observe however that the family {Ak = (k − 1, k + 1)}k∈N∗ is not a cover of N, since its union does not contain 0.13 On the other hand we see that there is no finite subfamily {Ai1 , . . . , Ain } of A which covers N∗ . For, if α = max{i1, . . . , in} then we see that n∪ k=1 Aik ⊂ (0, α + 1) , and it is obvious that this union does not contain all naturals. Hence N∗ is a non-compact set. ⃝ 5.E.43. Suppose that A ⊂ R is a non-empty compact subset of R and let ε > 0. Specify a finite subset B ⊂ R such that min{|x − y| : y ∈ B} < ε for all x ∈ A. Solution. It is easy to see that for any ε > 0 the set {(x − ε, x + ε) : x ∈ A} is an open cover of A. On the other hand, A is by assumption compact. Hence there exist some positive integer n and x1, . . . , xn, all points of A, such that A ⊂ n∪ i=1 (xn − ε, xn + ε) . Hence a finite subset B ⊂ R with the required properties is given by these points, i.e., B = {x1, . . . , xn}. □ The theory of limits often combines with the theory of matrices, and can be used to solve tasks related to linear algebra. For instance, let (An)∞ n=1 be a sequence of 2 × 2 matrices, with An = ( an bn cn dn ) , where (an)∞ n=1, (bn)∞ n=1, (cn)∞ n=1 and (dn)∞ n=1 are sequences of real (or complex) numbers. When an → a, bn → b, cn → c and dn → d for some real numbers a, b, c, d we say that (An)∞ n=1 converges to the matrix A = ( a b c d ) , i.e., lim n→∞ An = A. In this case A referred to as the “limit matrix” of (An)∞ n=1. Let us illustrate such an example. 5.E.44. Given a positive real number φ, consider the matrix sequence (An) whose general term is given by An = ( 1 −φ n φ n 1 )n , n ∈ Z+ . Based on the relation lim n→∞ ( 1 + z n )n = ez for all z ∈ C, prove that the limit matrix A of (An) exists and describes a rotation in the plane by angle φ, i.e., A = ( cos(φ) − sin(φ) sin(φ) cos(φ) ) . 13Recall that till Chapter 11 we assume that N contains 0. 455 CHAPTER 5. ESTABLISHING THE ZOO Solution. Let us set φ n = tan(θn). Based on basic matrix calculus we can then express the matrix sequence (An) as follows: An = ( 1 −φ n φ n 1 )n = ( 1 − tan(θn) tan(θn) 1 )n = 1 cosn(θn) ( cos(θn) − sin(θn) sin(θn) cos(θn) )n = 1 cosn(θn) ( cos(nθn) − sin(nθn) sin(nθn) cos(nθn) ) =     cos(nθn) cosn(θn) − sin(nθn) cosn(θn) sin(nθn) cosn(θn) cos(nθn) cosn(θn)     . Notice above we used the relations ( 1 − tan(θn) tan(θn) 1 ) = 1 cos(θn) ( cos(θn) − sin(θn) sin(θn) cos(θn) ) , ( cos(θn) − sin(θn) sin(θn) cos(θn) )n = ( cos(nθn) − sin(nθn) sin(nθn) cos(nθn) ) . Although the first is obvious, you may like to confirm the second one by induction. Hence we can write lim n→∞ An =     lim n→∞ cos(nθn) cosn(θn) − lim n→∞ sin(nθn) cosn(θn) lim n→∞ sin(nθn) cosn(θn) lim n→∞ cos(nθn) cosn(θn)     , (∗) and A = lim n→∞ An will exist if and only if the limits inside the matrix in (∗) exist. To examine these limits we rely on de Moivre’s theorem, as follows: cos(nθn) cosn(θn) + i sin(nθn) cosn(θn) = cos(nθn) + i sin(nθn) cosn(θn) = (cos(θn) + i sin(θn) cos(θn) )n = ( 1 + i tan(θn) )n = ( 1 + iφ n )n . Recalling that lim n→∞ ( 1 + iφ n )n = eiφ = cos(φ) + i sin(φ), we deduce that lim n→∞ (cos(nθn) cosn(θn) + i sin(nθn) cosn(θn) ) = cos(φ) + i sin(φ) ⇐⇒ lim n→∞ cos(nθn) cosn(θn) + i lim n→∞ sin(nθn) cosn(θn) = cos(φ) + i sin(φ) . Comparing the real and imaginary parts, we obtain lim n→∞ cos(nθn) cosn(θn) = cos(φ), lim n→∞ sin(nθn) cosn(θn) = sin(φ), and the result follows by (∗). □ Let’s now discuss additional content concerning the limits of functions and the concept of continuity. This will involve various computational and theoretical examples, as well as notable applications of the intermediate value theorem. However, we will first address tasks involving sequences defined recursively, such as in the problem 5.E.31 above. Let f : A ⊂ R → R be a continuous function defined on a subset A of real numbers. Suppose that exists a sequence (xn) in A satisfying xn+1 = f(xn), for all n. Moreover, assume that the limn→∞ xn exists, and that is equal to some (limit point) a ∈ A. In this situation it is not hard to see that limn→∞ f(xn) = f(a) and f(a) = a. This shows how useful can be continuous functions and provides a very common way to define sequences, that is, by iterating some function f (recursive definition). 5.E.45. Iterations. Consider the sequence (an)∞ n=1 with a1 = 0, a2 = 1, a3 = √ 3, . . . , an = √ 1 + 2an−1. (a) Show that (an)∞ n=1 is strictly increasing. (b) Is this a convergent sequence? In the positive case, find the limit limn→∞ an. (c) Use Sage to plot some terms of (an) and obtain a graphical confirmation of your conclusion in (b). Solution. We should prove that an+1 > an for all n. Observe that a2 = 1 > 0 = a1. Assuming that an > an−1 for n ≥ 2 and proceeding by induction over n one gets an+1 = √ 1 + 2an > √ 1 + 2an−1 = an. Using mathematical induction over n we can also show that (an) is bounded above, that is, an < 3 for all naturals n ≥ 1. This is because a1 = 0 < 3 and hence assuming an < 3 for n ≥ 2, we can arrive to the claim: an+1 = √ 1 + 2an < √ 1 + 2 · 3 < √ 9 = 3. Since the sequence (an) is strictly increasing and bounded above, by the monotone sequence theorem it should be convergent. To determine its limit we rely on the continuity of the square root function √ x, as follows: Suppose that a = limn→∞ an. Then a must satisfy the relation a = √ 1 + 2a, thus a2 = 1 + 2a with solutions a = 1 ± √ 2. However, an ≥ 0 for all n, from where one can neglect the negative solution, that is, limn→∞ an = 1 + √ 2. Let us now use Sage to plot some terms of (an). To do so we will first use the def command to introduce our sequence. This can be done as follows: 456 CHAPTER 5. ESTABLISHING THE ZOO def a(n, D={}) : if n in D.keys() : return D[n] if(n==1) : sol = 0 else : sol =sqrt(1+2*a(n-1)) D[n] = sol return sol In this way to test different values of (an) one can simply type a(2).n(); a(3).n(); a(10).n() etc. Notice however that the definition of (an) via the def method, does not allows us to compute the limit of (an), which means that typing lim(a(n), n = oo) returns an error (even if we add before the def command the variable n as symbolic). Now we can use this routine to plot (an). Together we will sketch the line y = 1 + √ 2 so that the visualization of our conclusion about the limit to be easier. N=50 pts = [(n, a(n)) for n in range(1, N)] p=points(pts, color="slategray", size=25) p+=line([(0, 1+sqrt(2)), (51, 1+sqrt(2))],rgbcolor=(0.7,0.2,0.3), linestyle="--") p+=point([0, 1+sqrt(2)], size=20, color="darkblue") p+= text (r"$1+\sqrt{2}$ ",(-4, 1+sqrt(2)), color="darkblue", fontsize ="14") p.show(figsize=6) The figure that Sage returns is here: □ 5.E.46. Consider the sequence (an)∞ n=1 defined by the relation an+1 = 1 2 ( an + a2 an ) for all positive integers n, with a1 > a, for some a > 0. Show that lim n→+∞ an = a. Solution. For all naturals n > 1 by the AM-GM-inequality presented in Chapter 1, we have an = 1 2 ( an−1 + a2 an−1 ) ≥ √ an−1 · a2 an−1 = a > 0 . Thus, together with the condition a1 > a this gives an ≥ a for all n and hence (an) is bounded. It is also decreasing, since an+1 − an = a2 − a2 n 2an ≤ 0 . (♭) One may like to confirm the equality appearing in (♭) also via Sage, which can be done quickly by the block var("n, c"); function("a")(n) bool((1/2)*(a(n)+(c^2/a(n)))-a(n)==(c^2-a(n)*a(n))/(2*a(n))) In this way one proves that (an) is bounded and monotone, hence its limit lim n→+∞ an exists and equals to inf{an : n ∈ N}. Let us set b = inf{an : n ∈ N}. Then we should have 457 CHAPTER 5. ESTABLISHING THE ZOO b = 1 2 ( b + a2 b ) , which can be equivalently expressed as b2 = a2 . However, by assumption a > 0 and thus we get b = a. □ 5.E.47. Consider the sequence (an)∞ n=1 with a1 = √ 2 and an = √ 2 + an−1 for all naturals n with n ≥ 2. Show that (an) is convergent, and in particular compute lim n→+∞ an. ⃝ 5.E.48. Let f : R → R be a continuous function satisfying the relation x2 f(x) = 1 − cos(2x), for all x ∈ R. Using the result of 5.B.31, find the explicit form of f. Solution. Obviously, for x ̸= 0 the function at hand satisfies f(x) = 1 − cos(2x) x2 . For x = 0, by the continuity of f we should have f(0) = limx→0 f(x). To compute this limit one can use the known identity sin2 ( θ 2 ) = 1 − cos(θ) 2 (half-angle formula), which can be rephrased as 2 sin2 (θ) = 1 − cos(2θ). This is a very useful trigonometric identity, which for our case applies as follows: f(0) = lim x→0 f(x) = lim x→0 1 − cos(2x) x2 = 2 lim x→0 sin2 (x) x2 = 2 ( lim x→0 sin(x) x )2 = 2 · 12 = 2 . Therefore one can now present the full type of f, namely f(x) =    1 − cos(2x) x2 , for x ̸= 0 , 2 , for x = 0 . □ 5.E.49. Using the result of 5.B.31, when adequate, evaluate the following limits: lim x→0 sin2 (x) x , lim x→0 x sin2 (x) , lim x→0 arcsin(x) x , lim x→0 3 tan2 (x) 5x2 , lim x→0 sin (3x) sin (5x) , lim x→0 tan (3x) sin (5x) . ⃝ 5.E.50. Confirm that lim x→0 e5x − e2x x = 3 = lim x→0 e5x − e−x sin(2x) . (Hint: Recall that lim x→0 ex −1 x = 1)). Next use Sage to sketch the graphs of the involved functions. ⃝ 5.E.51. Compute the limit lim x→+∞ ( 3x + 1 3x − 2 )4x with the aid of the relation lim x→+∞ ( 1 + 1 x )x = e. Next verify your answer via Sage. Solution. Based on the properties of powers, we see that ( 3x + 1 3x − 2 )4x = [ 3x ( 1 + 1 3x )]4x [ 3x ( 1 − 2 3x )]4x = ( 1 + 1 3x )4x ( 1 − 2 3x )4x = [( 1 + 1 3x )3x ]4 3 ( 1 + 1 − 3 2 x )4x = [( 1 + 1 3x )3x ]4 3 [( 1 + 1 − 3 2 x )− 3 2 x ]− 8 3 . Thus lim x→+∞ ( 3x + 1 3x − 2 )4x = lim x→+∞ [( 1 + 1 3x )3x ]4 3 lim x→+∞ [( 1 + 1 − 3 2 x )− 3 2 x ]− 8 3 = [ lim x→+∞ ( 1 + 1 3x )3x ]4 3 [ lim x→+∞ ( 1 + 1 − 3 2 x )− 3 2 x ]− 8 3 = e 4 3 e− 8 3 = e 4 3 + 8 3 = e4 . For a confirmation via Sage just give the cell limit(((3 ∗ x + 1)/(3 ∗ x − 2)) ∗ ∗(4 ∗ x), x = oo). □ 5.E.52. Evaluate the limits lim x→+∞ ( 2 + 1 x )1 x , lim x→+∞ x−x and lim x→0 e 1 x . ⃝ 5.E.53. Evaluate the following limits: lim x→0 sin(x) x3 , lim x→+∞ (√ x2 + x − x ) , lim x→+∞ ( x √ 1 + x2 − x2 ) , lim x→0− √ 1 + tan(x) − √ 1 − tan(x) sin(x) . 458 CHAPTER 5. ESTABLISHING THE ZOO ⃝ 5.E.54. Use the binomial theorem to compute the limit lim x→0 (1 + 2nx) n − (1 + nx) 2n x2 , for all n ∈ N∗ . Next confirm your computation via Sage. ⃝ 5.E.55. Exam the convergence of the sequence (an) with general term an = n∑ k=1 cos(k) 2k , n = 1, 2, . . . ⃝ 5.E.56. Compute the limit lim n→+∞ ⌊xn⌋ n , where x ∈ R and ⌊ ⌋ is the floor function introduced in 5.B.48. Solution. Recall from 5.B.48 that for any x ∈ R we have ⌊x⌋ ≤ x < ⌊x⌋+1. Since in general we have nx ∈ R for all n ∈ Z and x ∈ R we also get ⌊nx⌋ ≤ nx < ⌊nx⌋ + 1, which we may rewrite as 0 ≤ x − ⌊xn⌋ n < 1 n , for all n ̸= 0. Combining this with the squeeze theorem we deduce that limn→+∞ ⌊xn⌋ n = x, for all x ∈ R. □ 5.E.57. The ceiling function and the fractional part function. (1) Use the command ceil in Sage to compute ⌈−1.4⌉, ⌈−0.5⌉, ⌈2⌉, ⌈2.1⌉, ⌈ √ 7⌉, ⌈π⌉ and ⌈9/2⌉. (2) Construct in Sage the fractional part function and then compute its evaluation on the values mentioned in (1). (3) Show that ⌈x⌉ = −⌊−x⌋, for all x ∈ R. (4) If x = n + d, with n ∈ Z and 0 ≤ d < 1, show that n = ⌊x⌋ and d = {x}. (5) Show that the fractional part function is discontinuous at any integer. (6) Use Sage to sketch the graph of the ceiling function for −5 ≤ x ≤ 5 and of the fractional part function for 0 ≤ x ≤ 5. Solution. (1) It is not hard to compute ⌈−1.4⌉ = −1 , ⌈−0.5⌉ = 0 , ⌈2⌉ = 2 , ⌈2.1⌉ = 3 = ⌈ √ 7⌉ , ⌈π⌉ = 4 , ⌈9/2⌉ = 5 . A confirmation in Sage occurs by the command ceil, as follows: print(ceil(-1.4)); print(ceil(-0.5)); print(ceil(2)); print(ceil(2.1)) print(ceil(sqrt(7))); print(ceil(pi)); print(ceil(9/2)) (2) In Sage to introduce the fractional part function x → {x} = x − ⌊x⌋, use the floor function floor and type the syntax: fractal(x) = x − floor(x) Then, to compute the required evaluations type print(fractal(−1.4)), print(fractal(−0.5)), etc. Notice that Sage has a build-in function for the fractional part function, called frac but this agrees with ours only for positive reals. In this way you can quickly verify the following answers: {−1.4} = −1.4 − ⌊−1.4⌋ = −1.4 − (−2) = 0.6 , {−0.5} = −0.5 − ⌊−0.5⌋ = −0.5 − (−1) = 0.5 , {2} = 2 − ⌊2⌋ = 2 − 2 = 0 , {2.1} = 2.1 − ⌊2.1⌋ = 2.1 − 2 = 0.1 , { √ 7} = √ 7 − ⌊ √ 7⌋ ≈ 2.645752 − 2 ≈ 0.645752 , {π} = π − ⌊π⌋ ≈ 0.141593 , {9/2} = 4.5 − ⌊4.5⌋ = 0.5 . (3) We leave this to the reader. (4) Consider a real x such that x = n + d with n ∈ Z and 0 ≤ d < 1. By (2) in 5.B.48 we know that ⌊x + n⌋ = ⌊x⌋ + n for all x ∈ R and n ∈ Z. Thus we have ⌊x⌋ = ⌊n + d⌋ = n + ⌊d⌋ = n, since ⌊d⌋ = 0. Consequently, x = n + d = ⌊x⌋ + d = ⌊x⌋ + {x} , that is, {x} = d. This proves (4). (5) Let n ∈ Z be an integer. We see that lim x→n+ {x} = lim h→0 {n + h} = lim h→0 ( n + h − ⌊n + h⌋ ) = lim h→0 ( n + h ) − lim h→0 ( ⌊n + h⌋ ) = n − lim h→0 ( n + ⌊h⌋ ) = n − n = 0 . On the other hand, we compute lim x→n− {x} = lim h→0 {n − h} = lim h→0 ( n − h − ⌊n − h⌋ ) = lim h→0 ( n − h ) − lim h→0 ( ⌊n − h⌋ ) = n − (n − 1) = 1 . It follows that the function x → {x} is not continuous at any integer (see also its graph below). (6) For this task and for the ceiling function you may proceed in analogous way with the floor function, presented in 5.B.48. This method includes the jump discontinuities, and is encoded by the following block: 459 CHAPTER 5. ESTABLISHING THE ZOO g=ceil(x) p=plot(g, x, -5, 5, ticks=[1,1]) for x in [-5..5] : 460 CHAPTER 5. ESTABLISHING THE ZOO p+=point([x,x], size=30, color="black") for x in [-4..5] : p+=circle((x-1,x),0.08, color="black") show(p) In a similar way we can sketch the graph of the fractional part function (together with the jump discontinuities). fractal(x)=x-floor(x) q=plot(fractal, x, 0, 5) for x in [0..5] : q+=point([x,0], size=30, color="black") for x in [1..5] : q+=circle((x,1),0.03, color="black") show(q) Let us present the figures that Sage constructs: At left is the graph of the ceiling function and at right those of the fractional part function (again, in both figures one should ignore the vertical lines). Notice that the fractional part function is non-negative. □ 5.E.58. Examine the continuity of the function f(x) = (x − 1)− sgn(x) , at the points 0 and 1. ⃝ 5.E.59. Given the points −π, 0, 1, 2, 3, π, determine whether the function f(x) =    x, x < 0; 0, 0 ≤ x < 1; x, x = 1; 0, 1 < x < 2; x, 2 ≤ x ≤ 3; 1 x−3 , x > 3 is continuous; left-continuous or right-continuous at these points. ⃝ 5.E.60. Find all p ∈ R for which the function f(x) = sin (6x) 3x , x ∈ R∗ = R\{0}; f(0) = p , is continuous at the origin. ⃝ 5.E.61. Choose a real number a so that the function h (x) = x4 − 1 x − 1 , x > 1; h (x) = a, x ≤ 1 , is continuous on R. ⃝ 5.E.62. By defining the values at the points −1 and 1, extend the function f(x) = ( x2 − 1 ) sin 2x − 1 x2 − 1 , x ∈ R\{±1} , so that the resulting function is continuous on the whole R. ⃝ 461 CHAPTER 5. ESTABLISHING THE ZOO 5.E.63. Let f : R → R be a function satisfying the relation f3 (x) + 3f2 (−x)f(x) = 4x2 , x ∈ R . (♯) (a) Prove that f is even and next show that limx→0 f(x) = 0. (b) Show that lim ln (f(x)) ln( √ x3) = 4/9. Solution. (a) In the defining equation (♯) replace x by −x. This gives f3 (−x) + 3f2 (x)f(−x) = 4x2 and by subtracting this relation from (♯) (by parts), we get f3 (x) − 3f2 (x)f(−x) + 3f(x)f2 (−x) − f3 (−x) = 4x2 − 4x2 = 0 ⇐⇒ ( f(x) − f(−x) )3 = 0 , x ∈ R , where we used the identity (a − b)3 = a3 − 3a2 b + 3ab2 − b3 , with a, b ∈ R. Therefore, we deduce that f(x) − f(−x) = 0, for all x ∈ R, i.e., f is even. Now, because f is even the defining equation takes the form f3 (x)+3f2 (x)f(x) = 4x2 or 4f3 (x) = 4x2 , that is, f3 (x) = x2 for all x ∈ R, from where we get f(x) = 3 √ x2 = x 2 3 , x ∈ R. As for the limit, recall that the function f(x) = x 2 3 is everywhere continuous on R = (−∞, ∞), see also the Problem 6.A.21 where the graph of f is presented. Therefore, limx→0 f(x) = limx→0 x 2 3 = f(0) = 0. (b) The type of f is known by the first part, thus one can easily prove the claim, i.e., lim x→0 ln (f(x)) ln( √ x3) = lim x→0 ln(x)2/3 ln(x)3/2 = lim x→0 (2/3) ln(x) (3/2) ln(x) = 4 9 · lim x→0 ln(x) ln(x) = 4 9 . □ 5.E.64. Let f : R → (−∞, 1) and g : R → (1, +∞) be continuous functions satisfying f(a) = n a and g(b) = n b for certain reals 0 < a < b and some positive integer n. Prove the existence of some ξ ∈ (a, b) satisfying the equation ξ = f(ξ)g(ξ) n . Solution. Observe that f(a) = n a < 1 since f(a) ∈ (−∞, 1), and hence a < 1 n , while g(b) = n b > 1 since g(b) ∈ (1, +∞) and hence b > 1 n . Thus, we can indeed form the interval [a, b] and consider the function ℓ(x) = f(x)g(x) − n x , x ∈ [a, b] . Then we see that: • ℓ(a) = f(a)g(a) − n a = n a g(a) − n a = n a ( g(a) − 1 ) > 0, since g(a) > 1 and so g(a) − 1 > 0 and 0 < a < 1/n. • ℓ(b) = f(b)g(b) − n b = n b f(b) − n b = n b ( f(b) − 1 ) < 0, since f(b) < 1 and hence f(b) − 1 < 0 while 0 < 1 n < b. Thus, ℓ(a)ℓ(b) < 0 and by Bolzano’s theorem there exists some ξ ∈ (a, b) with ℓ(ξ) = 0, that is, n ξ = f(ξ)g(ξ). □ 5.E.65. Let f : R → R be a periodic function with period T > 0, that is, f(x+T) = f(x) for all x ∈ R. If limx→+∞ f(x) exists and is a real number α, show that f(x) = α for all x ∈ R, i.e., f is constant.14 Solution. By assumption f is periodic with period T, hence it suffices to prove the f is constant on [0, T). Since limx→+∞ f(x) = α ∈ R, for every ε > 0 there exist C > 0 such that |f(x) − α| < ε, for all x ≥ C. Moreover, for all x ∈ [0, T) we can find some positive integer n such that x + nT ≥ C. Thus we get |f(x) − α| = |f(x + nT) − α| < ε. This shows that |f(x) − α| < ε, for all x ∈ [0, T) and ε > 0, and hence f(x) = α for all x ∈ R. □ 5.E.66. Let f : R → R be a continuous function which is periodic with period T > 0. Show that there exists x0 ∈ R such that f ( x0 + T 2 ) = f(x0). Then use Sage to compute x0 for f(x) = cos(x). Solution. It is sufficient to consider the function g(x) = f ( x + T 2 ) − f(x) which is continuous on R (since f is continuous) and satisfies g(0) = f ( T 2 ) − f(0) , g ( T 2 ) = f(0) − f ( T 2 ) = −g(0) . Recall that if a continuous function has values of opposite sign inside an interval, then it admits a root in that interval (Bolzano’s theorem, see 5.2.19). In particular, the result follows by the intermediate value theorem applied to g, the latter restricted on [ 0, T 2 ] . The cos function f(x) = cos(x) has period T = 2π, thus it suffices to focus on the interval (0, T 2 = π). The equation cos(x0+π) = cos(x0) is equivalent to cos(x0) = 0 (since cos(x+π) = − cos(x) for all x). Hence obviously x0 = π 2 ∈ (0, π) (see also the table in 5.A.3).15 In Sage just type solve(cos(x + pi) == cos(x), x). □ 14Later we will see that the condition limx→+∞ f(x) = α ∈ R, means that the line y = α is a horizontal asymptote of y = f(x). 15Recall that the equation cos(x) = 0 has general solution π 2 + κπ, with k ∈ N. 462 CHAPTER 5. ESTABLISHING THE ZOO 5.E.67. Let f, g, h, k : R → R continuous functions on R satisfying the following conditions: • f(x) > 0 and k(x) > 0 for all x ∈ R; • f(a) = g(a) and h(b) = k(b) for some a ̸= b ∈ R; • g(x) < f(x) for all x ∈ R\{a} and k(x) < h(x) for all x ∈ R\{b} . Prove the existence of (at least) one x0 ∈ (a, b) for which the vectors ⃗u = ( g(x0), f(x0) )T and ⃗v = ( k(x0), h(x0) )T (of R2 ) are parallel. Solution. The vectors ⃗u and ⃗v are parallel if and only if the determinant g(x0) f(x0) k(x0) h(x0) vanishes, i.e., g(x0)h(x0) = k(x0)f(x0). Consider the function φ(x) = g(x)h(x)−k(x)f(x). This function is continuous on [a, b] and by the assumptions it follows that φ(a)φ(b) = ( g(a)h(a) − k(a)f(a) ) · ( g(b)h(b) − k(b)f(b) ) = g(a)h(a)g(b)h(b) − g(a)h(a)k(b)f(b) − k(a)f(a)g(b)h(b) + k(a)f(a)k(b)f(b) = f(a)h(a)g(b)k(b) − f(a)h(a)k(b)f(b) − k(a)f(a)g(b)k(b) + k(a)f(a)k(b)f(b) = f(a) ( h(a) − k(a) ) · k(b) ( g(b) − f(b) ) < 0 . Thus, by the Bolzano’s theorem there exists x0 ∈ (a, b) with φ(x0) = 0 and the claim follows. □ 5.E.68. Show that the equation x2024 + 3x − ex −1 = 0 has at least a real solution in the open interval (0, 1). Solution. The function f(x) = x2024 + 3x − ex −1 is sum of the polynomial x2024 + 3x − 1 with the negative of the exponential function, hence by 5.2.17 it should be continuous on R and hence also on the closed interval [0, 1]. Moreover, we have f(0) = −2 < 0 and f(1) = 3 − e > 0. Therefore, the claim follows by the Bolzano’s theorem. □ 5.E.69. Let f be a real-valued continuous function defined on the closed interval [1, 4]. (a) Explain why the function g(x) = f(x) + e2x−1 with the same domain as f has a minimal and a maximal value there. (b) Prove the existence of some x0 ∈ [1, 4] such that f(x) < f(x0) + e2x0 , for all x ∈ [1, 4]. Solution. (a) The function g is continuous for all x ∈ [1, 4], as the sum of two continuous function defined there, see the basic limit properties in 5.2.17. Thus g should take all the values between the maximal and the minimal one on the closed interval [1, 4], by the Corollary in 5.2.19. (b) Using the conclusion form the first part, we can find x0 ∈ [1, 4] such that g(x) ≤ g(x0) for all such x (i.e., at x0 the function g attains it maximal value). This induces the result, i.e., f(x) ≤ f(x0) + e2x0−1 − e2x−1 < f(x0) + e2x0−1 < f(x0) + e2x0 , x ∈ [1, 4] . □ 5.E.70. Let f : I → R be a continuous function defined on the interval I = [a, b], where a, b ∈ R with a < b. If f is injective, prove that f is strictly monotone. ⃝ 5.E.71. Compute the following limits or explain why they do not exist: lim x→+∞ ( arccos(1/(x + 1)) )3 , lim x→−∞ arctan(1/x) , lim x→−∞ arctan(x4 ) , lim x→−∞ arctan ( sin(x) ) . Next, use Sage to verify your conclusions. Solution. The function arccos(x) is continuous on its domain [−1, 1], while the function x3 is continuous everywhere. Moreover, lim x→+∞ (1/(x + 1)) = 0, and arccos(0) = π/2. Thus, lim x→+∞ ( arccos(1/(x + 1)) )3 = [ arccos ( lim x→+∞ 1/(x + 1) )]3 = (π/2)3 . The same conclusion occurs graphically, via the graph of the function f(x) = (arccos(1/(1 + x)))3 , see below at the left-hand-side for an illustration for x > 0: 463 CHAPTER 5. ESTABLISHING THE ZOO Similarly, the function arctan(x) is continuous and injective on its domain. Thus, according to 5.E.70 it should be strictly monotone, and indeed it is easy to verify that it is strictly increasing. This is also illustrated by the graph of arctan(x), as we see above at the right-hand-side. Now, according to the discussion in 5.2.16 these properties allow us to move the examined limit into the argument of such a function. Therefore, it is possible to write lim x→−∞ arctan(1/x) = arctan ( lim x→−∞ (1/x) ) , lim x→−∞ arctan(x4 ) = arctan ( lim x→−∞ x4 ) , lim x→−∞ arctan ( sin(x) ) = arctan ( lim x→−∞ sin(x) ) . However, we see that lim x→−∞ (1/x) = 0, lim x→−∞ x4 = +∞, and lim x→−∞ sin(x) does not exist. Therefore, using the relations above we obtian lim x→−∞ g(x) = arctan(0) = 0, lim x→−∞ h(x) = π/2, while the last limit does not exist. Finally, to describe a confirmation via Sage just type f(x)=(arccos(1/(1+x)))^3; print(lim(f(x), x=+oo)) g(x)=arctan(1/x); print(lim(g(x), x=-oo)) h(x)=arctan(x^4); print(lim(h(x), x=-oo)) k(x)=arctan(sin(x)); print(lim(k(x), x=-oo)) □ 5.E.72. Show by an example that, in general, the intermediate value theorem fails for discontinuous functions. Solution. For some positive real numbers α, δ, consider the function f : [−δ, δ] → R with f(x) = { α if − δ ≤ x < 0 , −α if 0 ≤ x ≤ δ . Obviously, this function is discontinuous at x = 0 since lim x→0− f(x) = α ̸= lim x→0+ f(x) = −α = f(0). On the other hand we see that f(−δ) = α > 0 and f(δ) = −α < 0, so that f(−δ)f(δ) < 0, but f(x) ̸= 0 for any x ∈ [−δ, δ]. □ Let us now present some further applications related to the intermediate value theorem. We first describe the 1-dimensional case of the so called Borsuk–Ulam theorem. We then study an application related to the temperature of antipodal points on the earth. 5.E.73. 1-dimensional Borsuk–Ulam theorem. Let S1 be the unit circle. Show that for any continuous function f : S1 → R there exists some x ∈ S1 with f(−x) = f(x).16 Solution. Consider the function g : S1 → R defined by g(x) := f(x) − f(−x). If g(x0) = 0 for some arbitrary x0 ∈ S1 , we are done. Otherwise we may assume that g(x) > 0 (and similarly is treated the case with g(x) < 0). Then g(−x) = f(−x) − f(x) = − ( f(x) − f(−x) ) = −g(x) < 0. Thus, Bolzano’s theorem certifies the existence of some x0 ∈ S1 with −x < x0 < x satisfying g(x0) = 0. This proves the claim. We should mention that the result extends to n dimensions, as follows: Let Sn = { (x1, . . . , xn+1) ∈ Rn+1 : n+1∑ k=1 x2 k = 1 } ⊂ Rn+1 16Prove that this is equivalent to say that any continuous odd function S1 → R has a zero. 464 CHAPTER 5. ESTABLISHING THE ZOO be the n-sphere. Then, for any continuous map f : Sn → Rn there exists x ∈ Sn such that f(−x) = f(x).17 □ 5.E.74. Earth’s temperature and antipodal pints. Prove that at any time there is a point on the globe where the temperature agrees with the temperature on the antipode, assuming that the temperature varies continuously.18 Solution. Consider the equator (or any meridian of the globe), and denote by T : [0, 2π] → R the temperature, so that T(x) is the temperature of a point x on this great circle. Notice that every great circle through any point also passes through its antipodal point, so there are infinitely many great circles through two antipodal points. Define the function f(x) = T(x) − T(x + π), which is also continuous on [0, 2π], since T is assumed to be continuous. If f(0) = 0 we have found our antipodal points with equal temperature. Otherwise we may assume that f(0) ̸= 0. We see that f(0) = T(0) − T(π), f(π) = T(π)−T(2π), and since T is periodic with period 2π, we finally get f(π) = T(π)−T(2π) = T(π)−T(0) = −f(0). By Bolzano’s theorem this implies that there exists some point x0 ∈ (0, π) such that f(x0) = 0 (notice that T is also continuous on [0, π]). Thus T(x0) = T(x0 + π), which proves the claim. □ C) Material on derivatives For convenience, we divide this paragraph into two subsections: one containing technical exercises for additional practice on derivatives, and the other focusing on applications of derivatives. C1) Material on derivatives - Practice 5.E.75. Show that the function f(x) = x |x| is differentiable for all x ∈ R. Solution. Based on the definition of the absolute value function, one deduces that f(x) = x2 for x > 0 and f(x) = −x2 for x < 0, so for all x ∈ R\{0} the function f is differentiable (it is easy to see that (x2 )′ = 2x, see also below). We need to check the differentiability of f at x0 = 0, where f(0) = 0. We should compute the left and right derivatives, f′ −(0) and f′ +(0), respectively. We see that f′ −(0) = lim x→0− f(x) − f(0) x − 0 = lim x→0− −x2 − 0 x = 0 , f′ +(0) = lim x→0+ f(x) − f(0) x − 0 = lim x→0+ x2 − 0 x = 0 , that is, the left and right derivative exist and are equal. Hence, f is also differentiable at x0 = 0, with f′ (0) = 0. □ 5.E.76. Compute the derivatives of the functions f(x) = (tan(x))x and g(x) = (sin(x))ln(x) . Solution. By the quotient rule and since tan(x) = sin(x) cos(x) , we easily get tan′ (x) ≡ (tan(x))′ ≡ d dx tan(x) = ( sin(x) cos(x) )′ = 1 cos2(x) . Moreover, we have f(x) = tan(x) = eln(tan(x)) , and hence we deduce that f(x) = ( eln(tan(x)) )x = ex ln(tan(x)) . Combining this with the relations (eh(x) )′ = h′ (x) · eh(x) and (ln(h(x))) ′ = h′ (x)/h(x), we get f′ (x) = ( ex ln(tan(x)) )′ = ( x ln(tan(x)) )′ · ex ln(tan(x)) = ( x ln(tan(x)) )′ · f(x) = ( ln(tan(x)) + x · tan′ (x) tan(x) ) · f(x) = ( ln(tan(x)) + x · 1 tan(x) · 1 cos2(x) ) · f(x) = ( ln(tan(x)) + x · (1 + tan2 (x)) tan(x) ) · f(x) , where in the final relation one replaces 1/ cos2 (x) by 1 + tan2 (x). In Sage a verification is given as usual: f(x)=tan(x)**x; df=diff(f, x); show(df) In a similar vein, for the function g we get g(x) = (sin(x))ln(x) = ( eln(sin(x)) )ln(x) = eln(x)·ln(sin(x)) . 17Another famous result in topology with a very similar flavor is the so called “Brouwer fixed point theorem” which we will analyze in Chapter 9. 18Observe that the same result applies for the barometric pressure of antipodal points on the globe. 465 CHAPTER 5. ESTABLISHING THE ZOO Thus, g′ (x) = ( (sin(x))ln(x) )′ = ( eln(x)·ln(sin(x)) )′ = ( ln(x) · ln(sin(x)) )′ · g(x) = ( 1 x · ln(sin(x)) + ln(x) · cos(x) · 1 sin(x) ) · g(x) = (ln(sin(x)) x + ln(x) · cot(x) ) · g(x) . For a verification in Sage type g(x)=sin(x)**ln(x); show(g.derivative(x)) Observe that for your convenience the given solution in this task highlights two different, but familiar, methods in Sage that we can always use to compute derivatives. □ 5.E.77. Differentiate the expression 4 √ x − 1 · (x + 2)3 ex(x + 132)2 , with x > 1. ⃝ 5.E.78. Calculate the first derivative of the functions given below: (1) a(x) = (2 − x2 ) cos(x) + 2x sin(x) with x ∈ R; (2) b(x) = sin (sin(x)) with x ∈ R; (3) c(x) = sin ( ln ( x3 + 2x )) with x ∈ (0, +∞); (4) d(x) = (1 + x − x2 )/(1 − x + x2 ) with x ∈ R; (5) ε(x) = √ x √ x √ x with x ∈ (0, +∞); (6) f(x) = sin (sin (sin(x))) with x ∈ R; (7) g(x) = 3 √ sin(x) with x ∈ R\{nπ : n ∈ Z}; (8) h(x) = 3 √ 1+x3 1−x3 with x ∈ R\{±1}. ⃝ 5.E.79. Recall that a function f is called odd if f(−x) = −f(x) for all x in its domain, and even if f(−x) = f(x) for all such x. Therefore, an odd function is symmetric about the origin, and an even function is symmetric with respect to the y-axis. (a) Present a function defined on R which is neither even, nor odd. (b) Show that the derivative of an even (respectively odd) function is an odd (respectively even) function. ⃝ 5.E.80. Extend the discussion of real polynomials of degree at most three, initiated in Problem 5.A.2, using derivatives. ⃝ So far we have discussed most of the trancedental functions, as the exponential, the logarithms, the trigonometric and their inverse. Another remarkable class consists of the so called “hyperbolic functions”, defined by sinh(x) = ex − e−x 2 , cosh(x) = ex + e−x 2 , tanh(x) = sinh(x) cosh(x) , coth(x) = cosh(x) sinh(x) , where x ∈ R for the first three cases, while the domain of coth is R∗ = R\{0}. Notice that cosh2 (x) − sinh2 (x) = 1 , x ∈ R . The hyperbolic functions come at handy at times (especially in physics), and the next task is devoted to them, see also 5.4.12. 5.E.81. Hyperbolic functions. (a) Show that sinh (respectively, cosh) is the odd (respectively, even) part of the exponential function. Moreover, prove that tanh(−x) = − tanh(x) for all x ∈ R and coth(x) = − coth(x) for all x ∈ R∗ . (b) Determine the derivatives of the hyperbolic functions on their domains. (c) Confirm your answer in Sage and next sketch the graphs of sinh and cosh in the interval I = [−π, π]. Solution. (a) Obviously, one has the relation ex = ex − e−x 2 + ex + e−x 2 = sinh(x) + cosh(x) , and is easy to see that sinh(−x) = − sinh(x), while cosh(−x) = cosh(x) for all x ∈ R. The rest two relations are now direct. (b) This is really easy and relies on the basic rules of differentiation, see 5.3.4, in combination with the formula (ex )′ = ex , 466 CHAPTER 5. ESTABLISHING THE ZOO for all x ∈ R. Thus (sinh(x))′ = ( ex − e−x 2 )′ = 1 2 ( ex + e−x ) = cosh(x) , (cosh(x))′ = ( ex + e−x 2 )′ = 1 2 ( ex − e−x ) = sinh(x) , (tanh(x))′ = ( sinh(x) cosh(x) )′ = cosh2 (x) − sinh2 (x) cosh2 (x) = 1 cosh2 (x) = 1 − tanh2 (x) , (coth(x))′ = ( cosh(x) sinh(x) )′ = sinh2 (x) − cosh2 (x) sinh2 (x) = − 1 sinh2 (x) . (c) As one could expect in Sage the hyperbolic functions are built-in and correspond to the commands sinh, cosh, tanh and coth. Therefore, to confirm the previous computations in Sage one can combine these functions with either the command diff, or the command derivative, as follows show(sinh(x).derivative()); show(cosh(x).derivative()) show(tanh(x).derivative()); show(coth(x).derivative()) Let us finally present the graphs of sinh and cosh, but leave for practice the corresponding coding in Sage. □ 5.E.82. Find the inverse of the hyperbolic sine function f(x) = sinh(x), determine its domain and its range, and next compute its first derivative. ⃝ 5.E.83. Prove that the following function is differentiable at x = 0, f(x) = { x2 , if x ∈ Q , 0 , if x ∈ R\Q . Solution. Notice that f(0) = 0. Thus, lim x→0 f(x) − f(0) x − 0 = lim x→0 f(x) x . However, we see that 0 ≤ f(x) x ≤ x2 |x| = |x|, and thus by the squeeze theorem we get limx→0 f(x) x = 0. □ 5.E.84. Show that the function f(x) = √ |x| is not differentiable for all x ∈ R. Which is the problematic point? Solution. We see that f(x) = √ x for x ≥ 0 and f(x) = √ −x for x < 0. Thus for x ̸= 0 we may write f′ (x) = { 1 2 √ x , x > 0; − 1 2 √ −x , x < 0. However, at x0 = 0 the given function is not differentiable since for the left and right derivative one computes f′ −(0) = −∞ and f′ +(0) = +∞, respectively (you should be able to verify the last claim by a formal computation, but also in Sage). □ 467 CHAPTER 5. ESTABLISHING THE ZOO 5.E.85. Provide an example of a continuous function f : R → R which is differentiable for all x ∈ R but whose first derivative f′ (x) is not continuous at x0 = 0. Next use Sage to present the graph of your example. Solution. An example is given by the piecewise function f : [−1, 1] → R, defined by f(x) =    x2 sin ( 1 x ) , x ̸= 0 ; 0 , x = 0 . We know that limx→0 f(x) = 0 and since f(0) = 0, the function f is continuous. A short computation shows that its first derivative is given by f′ (x) =    2x sin ( 1 x ) − cos ( 1 x ) , x ̸= 0 ; 0 , x = 0 . Above, the first expression follows by the chain and product rule, while to verify that f′ (0) = 0, notice that lim x→0 f(x) − f(0) x = lim x→0 x sin ( 1 x ) = 0 (since 0 ≤ x sin (1 x ) ≤ |x|). On the other hand, we see that the limit limx→0 ( 2x sin(1 x ) − cos(1 x ) ) does not exist, so f′ (x) is not continuous at the origin. For another similar example see the description in 5.3.3. Now, to present the graph of f one can apply the piecewise method described in 5.C.5. Thus we can proceed with the cell f1(x)=(x^2)*sin(1/x); f2(x)=0 S=RealSet(x==0) S1=RealSet.closed_open(-1,0) S2=RealSet.open_closed(0,1) S3=S1.union(S2) F=piecewise([[S, f2(x)], [S3, f1(x)]]) p=plot(F(x), x, -1, 1, ymin=-0.1, ymax=0.1, color="black") show(p) In this block we appropriately used the command RealSet to produce the sets {0}, [−1, 0), and (0, 1], and next the command union to introduce the set [−1, 0) ∪ (0, 1]. Below, on the left is the figure produced by this block, while on the right we see a zoomed version of the graph of f near the origin. Can you realize the changes that we did in our code to get the zoomed version? □ 5.E.86. (1) Consider the function f : R → R defined by f(0) = 0 and f(x) = x arctan ( 1 x ) for all x ∈ R\{0}. Determine whether the derivative of f exists at x0 = 0. (2) Consider the function f : R → R defined by f(−1) = 0 and f(x) = ( x2 − 1 ) sin ( 1 x + 1 ) for all x ∈ R\{−1}. Determine whether the derivative of f exits at x0 = −1. (3) Present an example of a function f : R → R which is continuous on R but does not have derivatives at the points x1 = 5 and x2 = 9. (4) Find functions f and g which are not differentiable anywhere, yet their composition f ◦ g is differentiable everywhere on 468 CHAPTER 5. ESTABLISHING THE ZOO R. (5) For x > e, determine the sign of the derivative of the function f(x) = arctan ( ln(x) −1 + ln(x) ) . ⃝ 5.E.87. If f(x) = 5x + 4 sin(x), then compute (f−1 )′ (4) by means of the formula of the derivative of the inverse function presented in 5.3.6. ⃝ 5.E.88. (a) Let P(x) be a polynomial of order n ≥ 2. Prove that P(x) is divided by (x − ρ)2 with quotient π(x), i.e., P(x) = (x − ρ)2 π(x), if only if P(ρ) = P′ (ρ) = 0. (b) Show that (x − 1)2 is a factor of the polynomial P(x) = nxn+1 − xn−1 − (n2 + 1)x + n2 − n + 2, with n ≥ 2. Solution. (a) If P(x) = (x − ρ)2 π(x) then by differentiation we get P′ (x) = 2(x − ρ)π(x) + (x − ρ)2 π′ (x). Thus P′ (ρ) = 0 = P(ρ). Conversely, assume that P(ρ) = 0 = P′ (ρ). From the first relation we get P(x) = (x − ρ)ν(x) for some polynomial ν(x), and hence P′ (x) = ν(x) + (x − ρ)ν′ (x). But then the second relation can be equivalently written as ν(ρ) + (ρ − ρ)ν′ (ρ) = 0, that is ν(ρ) = 0. Therefore ρ is a root of ν(x) and hence ν(x) = (x − ρ)π(x) for some polynomial π(x). It follows that P(x) = (x − ρ)2 π(x), as it is required. (b) We see that P(1) = n − 1 − n2 − 1 + n2 − n + 2 = 0. Moreover, the first derivative of P is given by P′ (x) = n(n + 1)xn − (n − 1)xn−2 − (n2 + 1), and thus P′ (1) = n(n + 1) − (n − 1) − n2 − 1 = n2 + n − n + 1 − n2 − 1 = 0. The claim now follows from the statement in (a). □ 5.E.89. Let P be a cubic polynomial satisfying the conditions P(0) = 1, P′ (0) = 1, P(1) = 2a + 2, P′ (1) = 5a + 1. Find the values of the parameter a ∈ R for which the polynomial P is strictly monotonic on the whole real line. ⃝ 5.E.90. Suppose that f, g : R → R are two functions differentiable at a ∈ R. Prove that lim x→a x2 f(a) − a2 f(x) x − a = 2af(a) − a2 f′ (a), and more in general lim x→a g(x)f(a) − g(a)f(x) x − a = g′ (a)f(a) − f′ (a)g(a) . Solution. For the first limit we have lim x→a x2 f(a) − a2 f(x) x − a = lim x→a x2 f(a) − a2 f(x) − a2 f(a) + a2 f(a) x − a = lim x→a (x2 − a2 )f(a) − a2 (f(x) − f(a)) x − a = f(a) lim x→a x2 − a2 x − a − a2 lim x→a f(x) − f(a) x − a = f(a) lim x→a (x + a) − a2 f′ (a) = 2af(a) − a2 f′ (a) . For the general case we apply the same trick: lim x→a g(x)f(a) − g(a)f(x) x − a = lim x→a g(x)f(a) − g(a)f(x) − g(a)f(a) + g(a)f(a) x − a = lim x→a (g(x) − g(a))f(a) − g(a)(f(x) − f(a)) x − a = f(a) lim x→a g(x) − g(a) x − a − g(a) lim x→a f(x) − f(a) x − a = f(a)g′ (a) − g(a)f′ (a) . □ 5.E.91. Let f : R → R be a differentiable function at 0, satisfying f(x)f(y) ̸= 1 for all x, y ∈ R and f(x + y)(1 − f(x)f(y)) = f(x) + f(y) , x, y ∈ R . Show that f is differentiable on the whole real line. ⃝ 5.E.92. Interactive tangent lines via Sage. For the polynomial P(x) = x4 − 2x use Sage to produce an interactive graph for the tangent line of P at a random point x0, lying in the interval [−2, 2]. (Hints: Combine the routine constructed in 5.C.14 with the method presented in ?? for creating an interactive plot). Solution. Let us use the routing introduced in 5.C.14 for the construction of the tangent line and combine this with the commands @interact and slider, which are responsible for the creation and the control of the interactive environment, respectively. In particular, the command @interact appears alway in a separate line inside the syntax, usually before the def command that our subroutine starts. On the other hand, the slider command converts the x0 input from a numerical input into a slider. The coding goes as follows 469 CHAPTER 5. ESTABLISHING THE ZOO P(x)=x^4-2*x; P1(x)=diff(P(x), x) @interact 470 CHAPTER 5. ESTABLISHING THE ZOO def Tangentat(x_0 = slider(-2, 2, 0.1, -1.5,label="x-coordinate")) : y_0=P(x_0) m=P1(x_0) c=y_0 - m*x_0 l(x)=m*x+c Q = plot(P(x), -2, 2, color="blue", ymin=- 2, ymax=20) Q+=plot(l(x), -2, 2, color="black", ymin=- 2, ymax=20) Q+=point( (x_0, y_0), color="black", size=50) Q.show() show("x_0=" , x_0) show("tang. line: l(x) = (",m,")*x + (",c,")") return Recall inside the command slider the first two parameters determine the interval where that variable x0 is defined, in our case −2 ≤ x ≤ 2. The third parameter determines the granularity of the slider, and our choice represents a movement with step 0.1. Notice in CoCalc.com one should call the subroutine by typing Tangentat( ) in the very end of the block presented above (without specifying x0 inside the parenthesis). However, in https : //sagecell.sagemath.org this is not necessary and one can simply move the slider around. Test yourself Sage’s output. □ 5.E.93. Consider the functions f(x) = 2 + ln(x − 1) and g(x) = 2 − ln(x − 1), with x ∈ A = (1, +∞). (a) Use Sage to find the common points of the graphs of f, g, respectively. (b) Deduce about the monotonicity of f, g, and show that x = 1 is an asymptote without slope (vertical asymptote) both of Cf and Cg. Moreover, deduce that f(A) = g(A) = R. (c) Show that the tangent lines of f, g at points with the same y-coordinate are perpendicular. Solution. (a) This can be done by solving the equation f(x) = g(x), which is equivalent to ln(x − 1) = 0 and has x = 2 as a unique solution. There we see that f(2) = g(2) = 2, hence the common point of Cf and Cg is the point P = [2, 2]. A solution via Sage occurs by the cell f(x)=2+ln(x-1); g(x)=2-ln(x-1); solve(f(x)==g(x), x) (b) We see that f′ (x) = 1 x − 1 > 0 and g′ (x) = − 1 x−1 < 0 for all reals x > 1. Thus f is strictly increasing on A, while g is strictly decreasing on A. Moreover, you can compute a = lim x→1+ f(x) = −∞, b = lim x→∞ f(x) = +∞, c = lim x→1+ g(x) = +∞ and d = lim x→∞ g(x) = −∞. These limits can be confirmed in Sage, as usual: f(x)=2+ln(x-1); g(x)=2-ln(x-1); a=lim(f(x), x=1, dir="right"); b=lim(f(x), x=oo); c=lim(g(x), x=1, dir="right"); d=lim(g(x), x=oo) show(a, ",", b, ",", c, ",", d) Both functions f, g are continuous, hence by the limits above we deduce that f(A) = R and g(A) = R. Moreover, the limits a, c presented above show that the graphs of f, g, respectively, tend toward ∓∞ as the inputs approach 1, and hence the line x = 1 is a vertical asymptote for both Cf , Cg (we will discuss more details on asymptotes in 6.1.8). To illustrate the behaviour of f, g, we used Sage to sketch the graphs Cf , Cg, for 1 ≤ x ≤ 50, which we present together here: 471 CHAPTER 5. ESTABLISHING THE ZOO (c) To get points on Cf , Cg with the same y-coordinate it is sufficient to draw horizontal lines y = c, c ∈ R. This happens due to the type of monotonicity of f, g. Let Q, R be the corresponding intersection points of the line y = c with Cf and Cg, respectively. We may assume that Q = [a, f(a)] = [a, c] and R = [b, g(b)] = [b, c]. Notice for a = 2 = b we have P = Q = R. The getter a better view around the intersection point of Cf and Cg, see the figure below (at the l.h.s). In this figure, to become more precise, we fix the points Q, R to be elements of the line y = 2.5 (hence one can specify Q, R explicitly, as we did). From the mathematical point of view it is not necessary to fix c. At the the points Q = [a, c] ∈ Cf and R = [b, c] ∈ Cg we have f′ (a) = 1 1−a and g′ (b) = − 1 1−b , and to prove the claim it is sufficient to show hat f′ (a)g′ (b) = −1. Because f(a) = c = g(b) we get a − 1 = ec−2 , b − 1 = e2−c and the result follows (for an illustration see the picture at the r.h.s, which includes the tangent lines of Cf , Cg, at the points Q, R, respectively), i.e., f′ (a)g′ (b) = − 1 (1 − a)(1 − b) = − 1 ec−2 e2−c = − 1 e0 = −1 . For your convenience, let us finally present the code used to construct the figure including the tangent lines. 472 CHAPTER 5. ESTABLISHING THE ZOO f(x)=2+ln(x-1); g(x)=2-ln(x-1); p=plot(f(x), x, 1, 5, rgbcolor=(0.2,0.3,0.6)) p+=plot(g(x), x, 1, 5,rgbcolor=(0.6,0.3,0.2)) p+=point((2, 2), color="black", size=30) p+=text(r"$f(x)$", (4.8, 3.55), rgbcolor=(0.2,0.3,0.6), fontsize=13) p+=text(r"$g(x)$", (4.8, 0.45), rgbcolor=(0.6,0.3,0.2), fontsize=13) p+=text(r"$P$", (2.35, 2.1), color="black", fontsize=12) p+=line([(0.5, 1), (4.5, 1)], rgbcolor=(0.2,0.2,0.2), linestyle="--") p+=line([(0.5, 2.5), (4.5, 2.5)], rgbcolor=(0.2,0.2,0.2), linestyle="--") p+=line([(0.5, 2), (4.5, 2)], rgbcolor=(0.2,0.2,0.2), linestyle="--") 473 CHAPTER 5. ESTABLISHING THE ZOO p+=point((e^(1/2) + 1, f(e^(1/2) + 1)), color="black", size=30) p+=text(r"$Q$", (e^(1/2) + 1, f(e^(1/2) + 1)+0.2), color="black", fontsize=13) p+=point(((e^(1/2) + 1)*e^(-1/2), g((e^(1/2) + 1)*e^(-1/2))), color="black", size=30) p+=text(r"$R$", ((e^(1/2) + 1)*e^(-1/2)+0.05, g((e^(1/2) + 1)*e^(-1/2))+0.2), color="black",fontsize=13) p+=text(r"$y=2.5$", (4.7, 2.65), color="black", fontsize=12) p+=text(r"$y=2$", (4.65, 2.15), color="black", fontsize=12) p+=point(((e + 1)*e^(-1), f((e + 1)*e^(-1))), color="black", size=30) p+=point((e + 1, g(e + 1)), color="black", size=30) p+=text(r"$y=1$", (4.65, 1.15), color="black", fontsize=13) xQ=e^(1/2) + 1 tangf=f(xQ)+diff(f, x)(x=xQ)*(x-xQ) p+=plot(tangf, x, 1, 5, rgbcolor=(0.2,0.2,0.2), linestyle="--") xR=(e^(1/2) + 1)*e^(-1/2) tangg=g(xR)+diff(g, x)(x=xR)*(x-xR) p+=plot(tangg, x, 1, 5, rgbcolor=(0.2,0.2,0.2), linestyle="--") p.show(ymin=-1, ymax=5, ticks=[1, 1], aspect_ratio=1) □ 5.E.94. Consider the function f(x) = e √ α cos(x) , where α > 0 is some fixed constant. Provide a proof of the following facts:: (a) f is continuous, even, and periodic with period independent of α. Sketch the graph of f for α = 3 and x ∈ [−2π, 2π]; (b) f is differentiable for all x ∈ R. Find the formula of f′ (x); (c) f is (strictly) decreasing in (0, π) and (strictly) increasing in (π, 2π), in particular, the point x0 = π provides a local minimum of f. Solution. (a) The domain of f is the whole real line R. Since f is the composition of continuous functions, it is also continuous. It is even since cos(−x) = cos(x) which implies that f(−x) = f(x) for all x ∈ R. Suppose that T ∈ R\{0} satisfies f(x + T) = f(x), for all x ∈ R. Since the exponential map is an injection this gives √ α cos(x + T) = √ a cos(x), that is, cos(x + T) = cos(x) for all x ∈ R. Thus x + T = 2kπ + x , or x + T = 2kπ − x , which implies that T = 2kπ or T = 2kπ − 2x for all x ∈ R and some k ∈ Z. This means that f is periodic with period 2kπ, so for k = 1 we obtain T = 2π which is the smallest (positive) period of f. A visual verification of this fact occurs by sketching the graph of f, for some fixed α. Below we present the case for α = 3. (b) The function f is differentiable on R as the composition of differentiable functions. By the chain rule we see that f′ (x) = (√ a cos(x) )′ e √ α cos(x) = − √ α sin(x) · e √ α cos(x) , x ∈ R . Or we can compute this in Sage, via the cell a=var("a"); f(x)=e**(sqrt(a)*cos(x)); show(diff(f, x)) (c) In [0, 2π] the equation f′ (x) = 0 has three solutions, namely the points x = 0, x = π and x = 2π, see also the figure in the r.h.s. where we have included the graph of f′ (x). Moreover we see that f′ (x) < 0 for all x ∈ (0, π) and f′ (x) > 0 for all 474 CHAPTER 5. ESTABLISHING THE ZOO x ∈ (π, 2π). It follows that f(x) is decreasing in the first interval, and increasing in the second one. Hence x0 = π is a local minimum of f giving the minimal value f(π) = e− √ α (the other two stationary points are local maxima of f in [0, 2π]). □ 5.E.95. (a) Determine all local extrema of the function f(x) = x ln2 (x) with domain (0, +∞); (b) Determine the maximum value of the function f(x) = e−x · 3 √ 3x, with x ∈ R. (c) Find the absolute extrema of the polynomial p(x) = x3 − 3x + 2 on the interval [−3, 2]. (d) Is there any α ∈ R such that the function h(x) = αx + sin(x), x ∈ [0, 2π] has a global minimum at x0 = 5π/4? ⃝ 5.E.96. Show that the function f(x) = e−x is strictly decreasing for any x ∈ R. ⃝ 5.E.97. Find all the solutions of the equation x2025 + 2024x = 2025 − ln(x). ⃝ 5.E.98. Let I = [a, b] ⊂ R be a closed subset of R and let f : I → R be a continuous function. Suppose that f differentiable on the open interval (a, b) and satisfies f(a) = f(b) = 0. Show that for given κ ∈ R there exists x0 ∈ (a, b) such that κf(x0) + f′ (x0) = 0. Solution. For fixed κ ∈ R consider the function g(x) = eκx f(x), with x ∈ [a, b]. By assumption f is continuous on [a, b], differentiable on (a, b), and the same is g. We see that g′ (x) = κ eκx f(x) + eκx f′ (x) = eκx ( κf(x) + f′ (x) ) , x ∈ (a, b) . (∗) Moreover, g(a) = 0 = g(b) and by Rolle’s theorem (see 5.3.8) one can find some x0 ∈ (a, b) such that g′ (x0) = 0. Then the result is a simple consequence of (∗). □ 5.E.99. Let f : [0, 4] → R be a differentiable function with f(0) = f(4) = 0, and set g(x) := x(4−x) 4 f(2), for any x ∈ [0, 4]. Show that there exist x1, x2 ∈ (0, 4) with x1 ̸= x2 satisfying f′ (x1) = g′ (x1) and f′ (x2) = g′ (x2). Solution. The functions f, g are both differentiable on [0, 4], so the same is their difference h(x) := f(x) − g(x) = f(x) − x(4−x) 4 f(2). Moreover, we see that h(0) = f(0) − g(0) = 0, h(4) = f(4) − g(4) = 0 and h(2) = f(2) − 2·2 4 f(2) = 0. Thus, h satisfies the Rolle’s theorem conditions on the intervals [0, 2] and [2, 4]. This means that there exist x1 ∈ (0, 2) such that h′ (x1) = 0, which is equivalent to f′ (x1) = g′ (x1) and x2 ∈ (2, 4) such that h′ (x2) = 0, that is, f′ (x2) = g′ (x2). □ 5.E.100. Suppose that f : I → R is a continuous function on I = [a, b], differentiable on (a, b) and such that limx→a f′ (x) = κ ∈ R. Show that f is also differentiable at x = a with f′ (a) = κ. Solution. We can apply the mean value theorem on f on the interval (a, a + h) for some small enough h. Hence there exists ch ∈ (a, a + h) such that f′ (ch) = f(a + h) − f(a) h . (∗) Moreover, we see that ch → a as h → 0 and hence f′ (a) = lim h→0 f(a + h) − f(a) h (∗) = lim h→0 f′ (ch) = lim x→a f′ (x) = κ . □ 5.E.101. Show that the polynomial P(x) = x5 − x4 + 2x3 − x2 + x + 1 has exactly one real root. Next use Sage to approximate this root. Solution. It is easy to prove that any odd degree polynomial has at least one real root. For instance, the polynomial at hand has degree 5, hence by the fundamental theorem of algebra (see 12.4.20), P has to have five roots over C. Moreover, the complex roots with real coefficients come in pairs of conjugated numbers. Therefore, P has to have at least one real root, say x0. Suppose now that there exists a second real root x′ 0. Then, according to the Rolle’s theorem there must be some c ∈ (x0, x′ 0), such that P′ (c) = 0. But P′ (x) = 5x4 − 4x3 + 6x2 − 2x + 1 = 2x2 (x − 1)2 + 3x4 + 3x2 + (x − 1)2 > 0, for all x ∈ R, a contradiction. Therefore, P admits exactly one real root. In other words, P is strictly increasing and thus its graph should cross the x-axis just once. From the graph of P given below we see that the unique root of P should be near to −0.5. In Sage to obtain this root we can use the find_root function, as follows: find_root(x^5-x^4+2*x^3-x^2+x+1, -1, 1) According to Sage’s answer, the unique root of P is given by x0 ≈ −0.47748. 475 CHAPTER 5. ESTABLISHING THE ZOO □ 5.E.102. Let f : R → (0, +∞) be a continuously differentiable function with f(0) ̸= 0. Prove that there exists ξ ∈ (0, 1) satisfying the relation ef′ (ξ) f(0)f(ξ) = f(1)f(ξ) . Solution. Let us rewrite the equation as ef′ (ξ) = ( f(1) f(0) )f(ξ) . By applying the logarithm in both sides we get ln ( ef′ (ξ) ) = f′ (ξ) = ln ( f(1) f(0) )f(ξ) ⇐⇒ f′ (ξ) = f(ξ) ln ( f(1) f(0) ) ⇐⇒ f′ (ξ) f(ξ) = ln(f(1)) − ln(f(0)) . It is now obvious that the existence of such a ξ is guaranteed by applying the Lagrange mean value theorem to the function f(x) = ln(f(x)), which is continuous on [0, 1] and differentiable at all points inside this interval. □ 5.E.103. Show that any real x ≥ 0 satisfies x ≥ ln(x + 1). ⃝ 5.E.104. Let α, β be reals satisfying 0 < α < β. Show that α xβ − β xα > α − β, for all x > 1. ⃝ 5.E.105. Show that all x ∈ R satisfy ex − x ≥ 1. Solution. Consider the function f(x) = ex − x + 1. This is differentiable in R with f′ (x) = ex − 1. We see that the equation f′ (x) = 0 has a unique solution given by x = 0. In particular, having in mind the graph of ex we see that ex < 1 for x < 0, so f′ (x) < 0 for x ∈ (−∞, 0) and ex > 1 for x > 0 which means that f′ (x) > 0 for x ∈ (0, +∞). Hence f is strictly decreasing in the first interval and strictly increasing in the second one, see the figure below. Thus, for x > 0 we have f(x) > f(0) and so ex − x + 1 > 2, that is ex − x > 1, while for x < 0 we have f(x) > f(0) and so again ex − x + 1 > 2. So for x ̸= 0 we get ex − x > 1. Since for x = 0 this holds as an equality we have finally proved ex − x ≥ 1 for all x ∈ R. Remark: The graphs of f, g sketched below indicate that one may use the function g(x) = ex − x − 1, instead of f, and apply the same procedure. Try to verify this claim. □ 5.E.106. For x > y > 0 prove the inequality x + y 2 > x − y ln(x) − ln(y) . Solution. Rewrite the given inequality as ln(x) − ln(y x − y > 2 x + y and set t = x − y > 0. Then we get ln(t + y) − ln(y) t > 2 t + 2y ⇐⇒ ln ( t+y y ) t > 2 t + 2y ⇐⇒ ln ( 1 + t y ) > 2t t + 2y ⇐⇒ ln ( 1 + t y ) > 2 t y 2 + t y . 476 CHAPTER 5. ESTABLISHING THE ZOO Putting now ψ = t y we have ψ > 0 and the initial inequality can be equivalently transformed to ln(1 + ψ) > 2ψ 2 + ψ , ψ > 0 . (♭) Thus now you may consider the function f(ψ) = ln(1 + ψ) − 2ψ 2+ψ , with ψ > −1. Its first derivative is given by f′ (ψ) = 1 ψ + 1 − 2(2 + ψ) − 2ψ (2 + ψ)2 = ψ2 (ψ + 1)(2 + ψ)2 > 0 , that is, f′ (ψ) > 0 for all ψ > −1. Hence for example we should have f(ψ) > f(0) = 0, which gives us the inequality presented in (♭). □ 5.E.107. If p, q ∈ R are such that 0 < p < q show that ln ( p + q 2 ) < p p + q ln(p) + q p + q ln(q) . (♯) Solution. Consider the 1-parameter family fq(x) = ln (x+q 2 ) − x x+q ln(x)− q x+q ln(q) with q > 0 and domain A = (0, +∞). Notice that fq(q) = 0. We will prove that fq is strictly increasing for all x ∈ (0, q], which for 0 < p < q gives the inequality fq(p) < fq(q). Then, it is easy to see that the latter is equivalent to (♯). Therefore let us compute the first derivative of f: f′ q(x) = 1 2 · 2 x + q − (x + q) − x (x + q)2 ln(x) − x x + q · 1 x + q (x + q)2 ln(q) = 1 x + q − q ln(x) (x + q)2 − 1 x + q + q ln(q) (x + q)2 = q ( ln(q) − ln(x) ) (x + q)2 = q ln ( q x ) (x + q)2 . The critical points of fq are the solutions of f′ q(x) = 0, and we see that f′ q(x) = 0 ⇐⇒ ln ( q x ) = 0 ⇐⇒ q x = 1 ⇐⇒ q = x . Now, for x > q we have ln ( q x ) < 0, hence f′ q(x) < 0 and fq is strictly decreasing in [q, +∞) for all q > 0. On the other hand for 0 < x < q we get f′ q(x) > 0, therefore fq is strictly increasing on (0, q] for all q > 0. Thus, in fact, at x0 = q our family fq attains its unique maximum which equals to fq(q) = 0 (hence, fq is non-positive all along its domain, for all q > 0). Let us finally sketch the graph of fq for q = 1, 2, . . . , 5, so that we have an illustrated confirmation of the monotonicity of fq for these values of the parameter q. To obtain this figure, we have used Sage and a bit of programming (relying on the command for), which goes as fol- lows: var("q"); f(x, q)=ln((x+q)/2)-(x/(x+q))*ln(x)-(q/(x+q))*ln(q) a=plot([f(x, q) for q in [1,2,..,5]], (x, 0, 10), ymax=0.05, ymin=-0.4) a+=point([(q, f(q, q)) for q in [1,2,..,5]], size=20, color="black") a+=text(r"$q=1$", (1, 0+0.02), fontsize=12, color="black") a+=text(r"$q=2$", (2, 0+0.02), fontsize=12, color="black") a+=text(r"$q=3$", (3, 0+0.02), fontsize=12, color="black") a+=text(r"$q=4$", (4, 0+0.02), fontsize=12, color="black") a+=text(r"$q=5$", (5, 0+0.02), fontsize=12, color="black"); show(a) 477 CHAPTER 5. ESTABLISHING THE ZOO □ When applying the l’Hopital’s rule, caution is always necessary. For instance, the l’Hopital’s rule can yield a non-existent limit even when the original limit exists. Let’s illustrate this remarkable scenario with an example. 5.E.108. Consider the function f(x) = x + sin x x . (a) Which is the type of indeterminate form corresponding to the limit limx→+∞ f(x)? (b) Show that in this case the l’Hopital’s rule leads to a non-existing limit, although limx→+∞ f(x) = 1. ⃝ 5.E.109. Using the L’Hospital’s rule provide an alternative proof of the relation limn→+∞ n √ n = 1 (see 5.B.8). Moreover, apply the same method to prove that limn→+∞ n √ ln(n) = 1 (see 5.E.18), and limn→+∞ ( n sin( 1 n ) ) = 1. Solution. Recall that n √ n = n1/n . Thus, as n → +∞ this gives the indeterminate form (+∞)0 . Consider the function f(x) = x1/x with x > 0. Taking the logarithm in both sides we get ln(f(x)) = ln(x1/x ) = 1 x ln(x). Thus f(x) = e ln(x) x . Notice that by the L’Hospital’s rule we get lim x→+∞ ln(x) x +∞ +∞ = lim x→+∞ 1 x = 0. Therefore lim x→+∞ f(x) = lim x→+∞ e ln(x) x = e lim x→+∞ ln(x) x = e0 = 1 , and applying this to our sequence we get the result: lim n→+∞ n √ n = lim n→+∞ n 1 n = lim n→+∞ e ln(n) n = 1 . Let us use the L’Hospital’s rule to prove also that limn→+∞ n √ ln(n) = 1 and leave the third limit for practice. Since we have n √ ln(n) = (ln(n)) 1 n , we see that the limit limn→+∞ n √ ln(n) = 1 has the indeterminate form (+∞)0 . So, let us consider the function g(x) = (ln(x)) 1 x . Then, we can write g(x) = e 1 x ·ln(ln(x)) . Hence, to compute limx→+∞ g(x) it is sufficient to compute limx→+∞ ln(ln(x)) x . This has the indeterminate form +∞ +∞ , so by applying the L’Hospital’s rule we get lim x→+∞ ln(ln(x)) x = lim x→+∞ ( ln(ln(x)) )′ x′ = lim x→+∞ (ln(x))′ ln(x) 1 = lim x→+∞ 1 x ln(x) = 1 +∞ = 0 . Therefore, limx→∞ g(x) = e0 = 1 and this implies that limn→+∞ n √ ln(n) = 1. □ 5.E.110. Use the l’Hospital’s rule to evaluate the limit lim x→0 ( cot(x) − 1 x ) . Solution. Since cot(x) = cos(x) sin(x) , by applying repeatedly the l’Hospital’s rule one has lim x→0 ( cot(x) − 1 x ) = lim x→0 x cos(x) − sin(x) x sin(x) 0 0 = lim x→0 (x cos(x) − sin(x)) ′ (x sin(x))′ = lim x→0 cos(x) − x sin(x) − cos(x) sin(x) + x cos(x) = lim x→0 −x sin(x) sin(x) + x cos(x) 0 0 = lim x→0 (−x sin(x)) ′ (sin(x) + x cos(x)) ′ = lim x→0 − sin(x) − x cos(x) cos(x) + cos(x) − x sin(x) = 0 − 0 1 + 1 − 0 = 0 . □ 5.E.111. Using the l’Hospital’s rule, when adequate, determine the limits a = lim x→1− (1 − x) tan( πx 2 ) , b = lim x→ π 2 − (π 2 − x tan(x) ) , c = lim x→1 ( 1 2 ln(x) − 1 x2 − 1 ) , d = lim x→+∞ ( cos 2 x )x2 . ⃝ Next we present a result which analogous to l’Hopital’s rule, which can be used in establishing the monotonicity of a ratio f/g of two functions f, g : [a, b] → R, which are both differentiable on (a, b), with g′ (x) ̸= 0 for all x ∈ (a, b). In the literature, it is commonly referred to as the “l’Hopital’s Monotone Rule” or for short by “LMR”, and, despite its numerous applications in various contexts, it often seems to be overlooked. Notice the rule holds true even in cases where a or b is infinite, and in particular when f, g are defined only on the open interval (a, b) (provided that g′ has constant sign). 478 CHAPTER 5. ESTABLISHING THE ZOO 5.E.112. Suppose that a, b are such that −∞ ≤ a < b ≤ ∞ and let f, g : (a, b) → R be continuous functions which are differentiable on (a, b). Suppose also that limx→a+ f(x) = limx→a+ g(x) = 0 (or limx→b− f(x) = limx→b− g(x) = 0), and that the derivative g′ is nonzero (and does not change sign on (a, b)). Prove that if f′ /g′ is (strictly) increasing or (strictly) decreasing on (a, b), then so is f/g. Solution. We will analyze the case with limx→a+ f(x) = limx→a+ g(x) = 0, and similarly one can treat the case f(b− ) = g(b− ) = 0. Assume also that f′ /g′ is strictly increasing, and the proof for the rest cases is analogous. We will prove that ( f g )′ > 0 , for all x ∈ (a, b) , and hence the ratio f/g is strictly increasing. Fix some arbitrary x ∈ (a, b) and consider the function hx : (a, b) → R, defined by hx(y) = f′ (x)g(y) − f(y)g′ (x) , y ∈ (a, b) . Obviously, hx is differentiable on (a, b) (and hence also continuous), with h′ x(y) = f′ (x)g′ (y) − f′ (y)g′ (x) = g′ (x)g′ (y) ( f′ (x) g′(x) − f′ (y) g′(y) ) . (∗) By the hypothesis, g′ (x) is non-zero and it has constant sign, that is, only one of the following relations is possible: g′ (x) > 0 or g′ (x) < 0 for all x ∈ (a, b). Moreover f′ /g′ is strictly increasing, so for x > y the inequality f′ (x) g′(x) > f′ (y) g′(y) makes sense. Therefore, by (∗) we deduce that h′ x(y) > 0 for all y ∈ (a, x), which means that the function hx is strictly increasing on (a, x). In fact, by continuity, hx should be strictly increasing on (a, x]. Moreover, by our hypothesis we get lim x→a+ hx(x) = lim x→a+ ( f′ (x)g(x) − f(x)g′ (x) ) = 0 , and hence, all together, this implies that hx(x) > 0. Thus ( f g )′ (x) = f′ (x)g(x) − f(x)g′ (x) g2(x) = hx(x) g2(x) > 0 and since x ∈ (a, b) is arbitrary, our claim follows. □ 5.E.113. (a) Based on the l’Hopital monotone rule study the monotonicity of k = f/g on the specified domain A, when f, g, A are given as follows: (1) f(x) = e2x −1, g(x) = ln(x + 1), A = (−1, 0) ∪ (0, +∞); (2) f(x) = ex −1, g(x) = 2x + ex −1, A = (−∞, 0) ∪ (0, ∞). Then, confirm your conclusions by specifying the sign of the first derivative κ′ of κ = f/g, and rely on Sage to verify all your computations. (b) For both cases confirm that limx→0 κ(x) = limx→0 K(x), where K(x) = f′ (x)/g′ (x). Solution. (a) Let us begin with (1). We have κ(x) = f(x) g(x) = e2x −1 ln(x + 1) , and K(x) = f′ (x) g′(x) = 2(x + 1) e2x , where κ is defined on A = (−1, 0) ∪ (0, +∞) (and we assume the same for both f, g). Moreover, the given functions satisfy limx→0+ f(x) = limx→0+ g(x) = 0, hence one can apply the LMR (5.E.112). A direct computation shows that K′ (x) = 2 e2x (3 + 2x) > 0 , for all x ∈ A . Because K = f′ /g′ is differentiable, this implies that K strictly increases on A and according to LMR the same monotonicity pattern we should get for the function κ = f/g. To confirm this result, and because κ is differentiable on A, we will use its first derivative: dκ(x) dx = κ′ (x) = ( f(x) g(x) )′ = ( e2x −1 ln(x + 1) )′ = 2 e2x · ln(x + 1) − (e2x −1) · 1 x+1 (ln(x + 1))2 = 2(x + 1) e2x ln(x + 1) − e2x +1 (x + 1)(ln(x + 1))2 . To determine the sign of κ′ (x) it is sufficient to determine the sign of the numerator h(x) := 2(x+1) e2x ln(x+1)−e2x +1. A direct computation shows that h′ (x) = 2 e2x ln(x + 1)(2x + 3), so for all x ∈ A we have h′ (x) > 0. Therefore h is increasing on A. Moreover h(0) = 0, and thus it should be h(x) > 0 for all x ∈ A, which implies that κ′ (x) > 0. As a side remark observe that it is easier to deduce that the ratio K = f′ /g′ is increasing, than determining the sign of k′ = (f/g)′ . Now, let’s perform all computations in Sage, including necessary explanations within the code: f(x)=e^(2*x)-1; g(x)=ln(x+1); k(x)=f(x)/g(x) #declare the ratio k=f/g A=RealSet((-1, 0), x>0); show(A) #declare the domain A of k K(x)=diff(f, x)/diff(g, x); show(K(x)) #declare and show the ratio K=f’/g’ Kprime(x)=diff(K, x).factor(); show(Kprime(x)) #declare and show the derivative K’ 479 CHAPTER 5. ESTABLISHING THE ZOO print(bool(Kprime(x)>0 for x in A)) #confirm that K’>0 for all the elements of A kprime(x)=diff(k, x).factor(); show(kprime(x)) #declare and show the derivative k’ print(bool(kprime(x)>0 for x in A)) #confirm that k’>0 for all the elements of A h(x)=2*(x+1)*e^(2*x)*ln(x+1)-e^(2*x)+1; show(diff(h, x).factor()) #declare the numerator h print(bool(diff(h, x)>0 for x in A)) #confirm that the derivative of h is positive in A print(bool(h(x)>0 for x in A)) #confirm that h is positive in A (2) In this case we have κ(x) = f(x) g(x) = ex −1 2x + ex −1 , and K(x) = f′ (x) g′(x) = ex 2 + ex , where κ is defined on A = (−∞, 0) ∪ (0, +∞) (and we assume the same for both f, g). We also have limx→0+ f(x) = limx→0+ g(x) = 0 and hence we can apply the LMR. In particular, the ratio K(x) = ex 2+ex is strictly increasing on A (e.g., its first derivative satisfies K′ (x) = 2 ex (2+ex)2 > 0 for all x ∈ A), and hence the same monotonicity pattern must have the ratio κ = f/g. It is an easy exercise to see that the first derivative of κ satisfies κ′ (x) = 2(x ex 1 − ex +1) (2x + ex −1)2 > 0 , ∀ x ∈ A . On the other hand, a confirmation via Sage occurs in a very similar way as in (a), i.e., f(x)=e^(x)-1; g(x)=2*x+e^(x)-1; k(x)=f(x)/g(x) A=RealSet(x>0, x<0); show(A) #declare the domain A K(x)=diff(f, x)/diff(g, x); show(K(x)) #declare and show the ratio K=f’/g’ Kprime(x)=diff(K, x).factor(); show(Kprime(x)) #declare and show the derivative K’ print(bool(Kprime(x)>0 for x in A)) #confirm that K’>0 for all the elements of A kprime(x)=diff(k, x).factor(); show(kprime(x)) #declare and show the derivative k’ print(bool(kprime(x)>0 for x in A)) #confirm that k’>0 for all the elements of A (b) The objective of this question is to illustrate that the LMR operates similarly to the l’Hospital’s rule for limits. According to the latter rule we get lim x→0 κ(x) = lim x→0 f(x) g(x) = lim x→0 e2x −1 ln(x + 1) 0 0 = lim x→0 f′ (x) g′(x) = lim x→0 K(x) = lim x→0 2(x + 1) e2x = 2 . Verify yourself that in a similar way in (2) we get limx→0 κ(x) = limx→0 K(x) = 1/3. □ C2) Material on derivatives - Applications 5.E.114. Thales theorem and the shadow problem. (a) A man who is 1.7 meters tall walks away from a 4-meter-tall lamppost at a speed of 1.4 m/sec (see also the figure given below). Find the rate at which his shadow is increasing in length. (b) Suppose now that as a man walks aways from a lamppost with height 380 cm, the tip of his shadow moves twice as fast as he does. What is man’s height? Solution. (a) Let us denote by x the distance of the man from the base of the lamppost at time t, and by y the length of his shadow, both measured as meters. By assumption the man is moving at a rate of 1.4m/sec, that x = 1.4t. Consider the triangles OLB and AMB. x y 1.7m 4m man light post shadow O A B M These triangles are similar, so by Thales theorem we have x + y 4 = y 1.7 ⇐⇒ 1.4t + y 4 = y 1.7 ⇐⇒ y = 1.03478 · t = 119 115 · t . One may like to solve this equation via Sage and take the final equality presented here, which can be done by the syntax 480 CHAPTER 5. ESTABLISHING THE ZOO var("t"); solve(4*x==1.7*(1.4*t+x), x) It follows that the length of man’s shadow increases with rate dy dt = 119 115 m/sec. (b) Let us denote by α the whole distance from lamppost’s base to the tip of man’s shadow (hence in terms of the previous figure we have α = x + y), and by h the man’s height (which we now enumerate in centimeters and is the unknown of the problem). Notice that both α, x are functions of the time t, but h is constant. In this case Thales theorem tells us that α(t) 380 = α(t) − x(t) h ⇐⇒ 380 · ( α(t) − x(t) ) = h · α(t) which we need to differentiate with respect to t, based on the statement of the assumption, that is dα dt = 2dx dt . This gives 380 (dα dt − dx dt ) = 2 · h dx dt ⇐⇒ 380 ( 2 dx dt − dx dt ) = 2 · h dx dt , which is equivalent to h = 380 2 = 190 cm. □ 5.E.115. Inscribed rectangle with maximal area. (1) Find the rectangle with the largest area that can be inscribed inside the graph of the parabola y = x2 and below the line y = a, where a > 0 is some fixed constant (see the figure below). (2) Which is the maximum area for the values 1, 2 and 3 of a? Solution. (1) Let [x, x2 ] be the point representing the lower right corner of the rectangle. As we indicate in the figure, the upper right corner should be the point [x, a], in particular the height of the rectangle is the function ha(x) = a − x2 with x ∈ [0, √ a]. Notice ha( √ a) = 0, and ha(0) = a which are two cases that we may exclude for obvious reasons (these correspond to zero height and zero width (x = 0) respectively, hence they do not determine a rectangle). Thus, the function representing the area of the rectangle is Ea(x) = 2 · x · ha(x) = 2 x (a − x2 ) = 2 x a − 2 x3 , which we want to maximize on the interval (0, √ a). A differentiation of the defining equation of Ea with respect to x gives E′ a(x) = 2 a − 6 x2 . The equation E′ a(x) = 0 has a unique solution in (0, √ a) given by xa := √a 3 , with Ea(xa) = 4 √ 3 9 a 3 2 = 4 √ 3 a3 9 , a > 0 . (∗) There are many ways to prove that xa maximizes the area. For instance the second derivative of the area function Ea is given by E′′ a (x) = (E′ a(x))′ = −12x and hence obviously E′′ a (xa) < 0 (since xa > 0 for all a > 0). You can easily confirm these conclusions in Sage by applying the cell var("a"); assume(a>0) h(a, x)=a-x^2; E(a, x)=2*x*h(a, x) show(diff(E(a, x), x)); show(solve(diff(E(a, x), x)==0, x)) bool(diff(E(a, x), x, 2)(x=sqrt(a/3))<0) (2) It is sufficient to replace in (∗) the required values. This gives E1(x1) = 4 √ 3 9 , E2(x2) = 4 √ 24 9 = 8 √ 6 9 and E3(x3) = 4 √ 81 9 = 4. In Sage to confirm these values add in the previous block the cell show(E(1, sqrt(1/3)).factor()); show(E(2, sqrt(2/3)).factor()); show(E(3, sqrt(3/3)).factor()) Let us finalize this task by illustrating these three solutions: 481 CHAPTER 5. ESTABLISHING THE ZOO □ 5.E.116. Inscribe a rectangle with the greatest perimeter possible into a semidisc with radius r. Determine the rectangle’s perimeter. ⃝ 5.E.117. Among the rectangles with perimeter 4c, find the one having the greatest area (if such one exists) and determine the lengths of its sides. ⃝ 5.E.118. Find the height h and the radius r of the largest (greatest volume) cone which fits into a ball of radius R. ⃝ 5.E.119. From all triangles with given perimeter p, select the one with the greatest area. ⃝ 5.E.120. A parabola is given by the equation 2x2 − 2y = 9. Find the points of the parabola which are closest to the origin. ⃝ 5.E.121. Regiomontanus’ angle maximization problem (1471). In the museum, there is a painting on the wall. Its lower edge is a meters above the ground and its upper edge is b meters, that is, it height equals to b − a. A tourist is looking at the painting, her eyes being at height h < a meters above ground (see the figure below).How far from the wall should the tourist stand if she wants to maximize her angle of view at the painting? This is the so-called “Regiomontanus’ angle maximization problem” from 14th century.19 b a h tourist painting x α β Solution. First observe that because the tourist is below the lowest level of the painting, her viewing angle very close to the wall of very far from the wall becomes small, so the largest viewing angle is somewhere in between. We will use the picture above (at the left) to illustrate the situation. Let us denote by x the distance (in meters) of the tourist from the wall and the angle of her view at the painting by φ. For the angles α, β ∈ (0, π/2) we have tan(α) = b − h x , tan(β) = a − h x , with h < a < b. Our task is to maximize φ = α − β. Let us add that for h > b, one can proceed analogously and for h ∈ [a, b], the angle φ increases as x decreases (φ = π for x = 0 and h ∈ (a, b)). From the condition h < a it follows that the angle φ is acute, that is, φ ∈ (0, π/2). Since the function y = tan(x) is increasing on the interval (0, π/2), we can turn our attention on maximizing the value tan(φ). We have that 19 This optimization problem was first posed and solved in 1471 by the German astronomer Johannes Müller von Königsberg (1436–1476), known as Regiomontanus. Müller solved the problem based on elementary geometry, but the solution presented here relies on differentiation. 482 CHAPTER 5. ESTABLISHING THE ZOO tan(φ) = tan (α − β) = tan(α) − tan(β) 1 + tan(α) tan(β) = b − h x − a − h x 1 + b − h x · a − h x = x (b − a) x2 + (b − h) (a − h) . So it suffices to find the global maximum of the function f(x) = x (b − a) x2 + (b − h) (a − h) , x ∈ [0, +∞) , whose first derivative is given by f′ (x) = (b − a) [ (b − h) (a − h) − x2 ] [x2 + (b − h) (a − h)] 2 , (∗) for all x ∈ (0, +∞). This expression can be quickly verified via the following cell in Sage: var("a, b, h"); f(x)=x*(b-a)/(x^2+(b-h)*(a-h)) df1=diff(f(x), x).factor(); show(df1) From (∗) one can see that f′ (x) > 0 , for x ∈ ( 0, √ (b − h)(a − h) ) , and f′ (x) < 0 , for x ∈ (√ (b − h)(a − h), +∞ ) , and hence f attains its global maximum at the point x0 = √ (b − h)(a − h). This is the desired point and the corresponding angle is given by φ0 = arctan ( x0 (b − a) x2 0 + (b − h) (a − h) ) = arctan ( b − a 2 √ (b − h) (a − h) ) . Let us describe two examples: • Suppose an ant looks at the painting, so in this case we have h = 0, x0 = √ ab and φ0 = arctan ( b−a 2 √ ab ) . If the painting is 1 meter high and its lower edge is 2 meters above ground, i.e., a = 2 and b = 3, then the ant will see the painting at the largest angle φ0 . = 0.201rad ≈ 11.5 ◦ at the distance x0 = √ 6 ≈ 2.45 meters from the wall. • As for another example, look in the left picture above, where the painting is viewed by a man whose eyes are at the height of 1.8 meters, together with his son whose eyes are 1 meter above ground. In this case the father should stand at x0 . = 0.49 meters from the wall, and his son at x0 . = 1.41 meters from the wall. Thus the father has viewing angle φ0 . = 0.795 rad ≈ 45.6 ◦ whereas his son has φ0 . = 0.339 rad ≈ 19.5 ◦ , and the quotient 0.795 0.339 ≈ 2.3 proves what a strongly better view the father has. □ 5.E.122. Determine the point x0 of the previous problem 5.E.121 by an alternative method. ⃝ 5.E.123. Snell’s law. This task is about the light rays which bend when traveling from one (optical) medium to another, such as air to water. This causes items placed in or behind water to look shifted and there is a rule which can be applied to evaluate the bending of a light ray, called the “Snell’s law”.20 Having as reference the figure given below, determine the refracted light ray between the point A in a homogeneous medium with speed of light v1 and the point B in a homogeneous medium with speed of light v2. Solution. We can assume that distances are given in meters, the speeds v1, v2 in meters per second, and the time in seconds. But for convenience we will omit the units of measurement in our notation. The ray is determined by “Fermat’s principle of least time”, which states that “from all the paths between the points A and B, the light will go along the one which can be traversed in the least time”. In homogeneous mediums the ray will be a straight line and in our case we will consider its segment. So it suffices to determine the point R (given by the value x) where the ray refracts. The distance between the points A and R is √ h2 1 + x2, while the distance between R and B is √ h2 2 + (d − x)2. We deduce that the total time which needs the energy to be transmitted between the points A and B is represented by the function T(x) = √ h2 1 + x2 v1 + √ h2 2 + (d − x)2 v2 , 20An optical medium is material through which light (and other electromagnetic waves) propagate. 483 CHAPTER 5. ESTABLISHING THE ZOO withg x ∈ [0, d]. We want to locate the point x ∈ [0, d] at which the value T(x) is minimal. The derivative of T is the continuous function T′ (x) = x v1 √ h2 1 + x2 − d − x v2 √ h2 2 + (d − x)2 for all x ∈ [0, d], and thus the critical point equation T′ (x) = 0 is equivalent to x v1 √ h2 1 + x2 = d − x v2 √ h2 2 + (d − x)2 , from where we deduce that f1(x) f2(x) = v1 v2 , with f1(x) = x √ h2 1 + x2 , f2(x) = d − x √ h2 2 + (d − x)2 . This expression is useful for us because (see the picture) sin(φ1) = x √ h2 1 + x2 , sin(φ2) = d − x √ h2 2 + (d − x)2 . Thus there is at most one stationary point, determined by the relation sin(φ1) sin(φ2) = v1 v2 . (♯) This is the so called “Snell’s law” in physics. As φ1 ∈ [0, π/2] increases the angle φ2 ∈ [0, π/2] decreases. On the other hand, on the interval [0, π/2] the sine function is non-negative and increasing, so the quotient (sin φ1)/(sin φ2) is increasing with respect to x. Since T′ (0) < 0 and T′ (d) > 0, there is exactly one stationary point x0. From the inequalities T′ (x) < 0 for x ∈ [0, x0) and T′ (x) > 0 for x ∈ (x0, d], it follows that at the stationary point x0 there is the global minimum. Let us summarize as follows: The ray is given by the point R of refraction (i.e., the value x0), and the point R is determined by the Snell’s law (♯). The ration v1/v2 is constant for the given homogeneous mediums, and determines an important quantity which describes the interface of optical spaces. It is called “refractive index” and denoted by n. Usually, the first optical medium (space) is vacuum, that is, v1 = c, where c = speed of light and v2 = v, such that n = c/v. For vacuum, we get n = 1, of course. This value is also used for the air since its refractive index at standard conditions is n . = 1.000272. Other mediums have n > 1 (n = 1.31 for ice, n = 1.33 for water, n = 1.5 for glass). However, the refractive index also depends on the wave length of the electromagnetic radiation in question (for example, for water and light, it ranges from n . = 1.331 to n . = 1.344), where the index ordinarily decreases as the wave length increases. The speed of light in an optical space having n > 1 depends on its frequency, and hence we talk about the “dispersion of light”. The dispersion causes rays of different colors to refract at different angles. The violet ray refracts the most and the red ray refracts the least. This is also the origin of a rainbow. □ 5.E.124. You are in a boat on a lake at distance d km from the shore. You want to get to a given place on the shore whose straight-line distance is √ d2 + ℓ2 from you, see the figure below. What path will you take if you want to be there as soon as possible, supposing you can row at v1 kph and run along the shore at v2 kph? How long will the journey take? Solution. The optimal strategy is given by first rowing in a straight line to the shore at some point [0, x] for x ∈ [0, ℓ], and then running along the shore to the target point [0, ℓ], see the diagram. So the trajectory consists of two line segments, or only one segment, in the case when x = ℓ. The voyage to the point [0, x] on the shore takes √ d2+x2 v1 hours, and the final run takes ℓ−x v2 hours. Thus, the total time is the function t(x) = √ d2 + x2 v1 + ℓ − x v2 , x ∈ [0, ℓ] . It can be assumed that v1 < v2, for if v1 ≥ v2, the optimal strategy is to row straight to the target point, which corresponds to x = l. Now, the first and the second derivative of t(x) are given by t′ (x) = x v1 √ d2+x2 − 1 v2 and t′′ (x) = d2 v1 √ (d2+x2)3 , respectively, with x ∈ (0, ℓ). You may confirm these relations in Sage as follows: var("d, l, v1, v2"); assume(v1 0 for all x in that interval. If x0 ≥ ℓ, then t′ (x) ≤ 0 for all x ∈ [0, ℓ] and so the global minimum of t(x) occurs at ℓ. In the former case, the fastest journey in hours takes t (x0) = √ d2 + x2 0 v1 + ℓ − x0 v2 = d √ v2 2 − v2 1 v1v2 + ℓ v2 . In the latter case, the fastest journey in hours takes t (ℓ) = √ d2 + ℓ2 v1 . Notice one could follow an even simpler approach for doing the calculations, using the variable θ, instead of the variable x, where x = d tan(θ). The fastest journey occurs when sin(θ) = v1/v2., which is the limiting case of Snell’s law. □ 5.E.125. L’Hospital’s pulley. A rope of length r is tied at one of its ends to the ceiling at a point A. A small pulley is attached to its other end. A point B is also on the ceiling at distance d, d > r, from A. Another rope of length ℓ > √ d2 + r2, is tied to B at one end, passes over the pulley, and has a weight is attached to its other end, see the figure presented below. Omit the mass and the size of the ropes and the pulley. In what position does the weight stabilize so that the system is in a stationary position? Solution. The system is in stationary position when its potential energy is minimized. This is when the distance between the weight and the ceiling is maximal. Let x be the distance between A and the point P on the ceiling vertically above the weight and the pulley. By the Pythagorean theorem, the distance between the pulley and the ceiling is √ r2 − x2. Similarly, the distance between the weight and the pulley is ℓ− √ (d − x)2 + r2 − x2. Hence if f(x) is the distance between the weight and the ceiling, then f(x) = √ r2 − x2 + ℓ − √ (d − x)2 + r2 − x2 . The state of the system is fully determined by the value x ∈ [0, r]. Therefore, it suffices to find the global maximum of f on the interval [0, r]. The first derivative of f is given by f′ (x) = −x √ r2 − x2 − − (d − x) − x √ (d − x)2 + r2 − x2 = −x √ r2 − x2 + d √ (d − x)2 + r2 − x2 , x ∈ (0, r) . Let us use once more Sage to confirm this expression: var("r, d, l"); f(x)=sqrt(r^2-x^2)+l-sqrt((d-x)^2+r^2-x^2) df1(x)=diff(f, x); show(df1(x)) Now, instead of solving the equation f′ (x) = 0, it is easier to solve the equation (f′ (x)) 2 = 0, which is equivalent to the polynomial 2 d x3 − (2 d2 + r2 )x2 + d2 r2 = 0 , x ∈ (0, r) . This has x = d as an obvious solution and hence we get the factorization (x − d)(2dx2 −r2 x−dr2 ) = 0. Thus, with a bit of try we see that the equation f′ (x) = 0 has three solutions. The solution x = d does not lie on the interval [0, r] since d > r. The solution x = r2 −r √ r2+8d2 4d is also outside the interval [0, r], since x1 < 0. But the solution x0 = r2 + r √ r2 + 8d2 4d is positive and since r < d it satisfies x0 < r, i.e., x0 < r 4 + 3r 4 = r. Now, because f′ is continuous on the open interval (0, r) it changes sign only at x0. Based on the limits limx→0+ f′ (x) = d√ d2+r2 and limx→r− f′ (x) = −∞, we deduce that f′ (x) > 0 for all x ∈ (0, x0), while f′ (x) < 0 for all x ∈ (x0, r). This implies that at x0 the function f attains its global maximum. □ 485 CHAPTER 5. ESTABLISHING THE ZOO 5.E.126. A rectangle is inscribed into an equilateral triangle with sides of length a so that one of its sides lies on one of the triangle’s sides and the other two of the rectangle’s vertices lie on the remaining sides of the triangle. What is the maximum possible area of the rectangle? 5.E.127. Determine the dimensions of an (open) swimming pool whose volume is 32 m3 and whose bottom has the shape of a square, so that one would use the least amount of paint possible to prime its bottom and its walls. ⃝ 5.E.128. Express the number 28 as a sum of two non-negative numbers such that the sum of the first summand squared and the second summand cubed is as small as possible. ⃝ 5.E.129. With the help of the first derivative, find the real number a > 0 for which the sum a + 1/a is minimal. Next solve this problem without using differential calculus. ⃝ D) Material on infinite sums and power series Recall by Section D that infinite sums may have several beautiful applications. Below we describe an application, strongly related with the compelling notion of fractals, the so called Sierpiński carpet. 5.E.130. Sierpiński carpet. The unit square is divided into nine equal squares, and then the middle one is removed. Each of the eight remaining squares is again divided into nine equal sub-squares and then the middle sub-square, corresponding to each of these eight squares, is removed again. Having applied this procedure ad infinitum, determine the area of the resulting figure. Solution. In the first step, a square having the area of 1/9 is removed. In the second step, eight squares are removed, each having the area of 9−2 , so in total we remove 8 · 9−2 . Every further iteration removes eight times more squares than in the previous steps, but the squares are nine times smaller. Thus, the sum of areas of all the removed squares is encoded by the following infinite sum: 1 9 + 8 92 + 82 93 + · · · = ∞∑ n=0 8n 9n+1 . Now, to compute the area of the remaining figure (known as the Sierpiński carpet) it is sufficient to consider the difference 1 − ∞∑ n=0 8n 9n+1 = 1 − 1 9 ∞∑ n=0 ( 8 9 )n = 1 − 1 9 · 1 1 − 8 9 = 0 , where we used the result from 5.B.16 on geometric series. Thus, the Sierpiński carpet has zero area. □ 5.E.131. Decide whether the following implications hold: (a) Convergence of series ∞∑ n=1 an, ∞∑ n=1 bn implies that the series ∞∑ n=1 (6an − 5bn) converges as well. (b) If a series ∞∑ n=0 an satisfies lim n→∞ a2 n = 0, then it is convergent. (c) If a series ∞∑ n=1 a2 n converges, then the series ∞∑ n=1 an n converges absolutely. ⃝ 5.E.132. Prove that the series below diverge to +∞: T1 = ∞∑ n=2 1 n √ ln(n) , T2 = ∞∑ n=1 1 n − ln(n) . Solution. The two enlisted series consist of non-negative terms only. So the series either converge, or diverge to +∞. For T1 it is useful to recall that limn→∞ n √ ln(n) = 1 (see for example 5.E.18), and hence we also get lim n→∞ 1 n √ ln(n) = 1 limn→∞ n √ ln(n) = 1 . Thus, by (1) in Theorem 5.1.9 we deduce that T1 is not convergent, and in particular it diverges to +∞. For T2 we have limn→∞ 1 n−ln(n) = 0. However, we see that ∞∑ n=1 1 n − ln(n) ≥ ∞∑ n=1 1 n = +∞ . Hence T2 also diverges to +∞. □ 486 CHAPTER 5. ESTABLISHING THE ZOO 5.E.133. Prove that the series S = ∞∑ n=0 arctan ( n2 + 2n + 3 √ n + 4 n + 1 ) , T = ∞∑ n=1 3n + 1 n3 + n2 − n are both divergent. Solution. For the first case we have lim n→∞ arctan ( n2 + 2n + 3 √ n + 4 n + 1 ) = lim n→∞ arctan ( n2 n ) = π 2 . We can do this computation in Sage by typing var("n"); lim(arctan((n^2+2*n+3*sqrt(n)+4)/(n+1)), n=oo) For the second one we compute lim n→∞ 3n + 1 n3 + n2 − n = lim n→∞ 3n n3 = +∞ . Thus, the necessary condition lim n→∞ an = 0 for the series ∑∞ n=n0 an to converge does not hold, and so none of S, T converges. □ 5.E.134. Compute explicitly the sum ∞∑ n=0 1 (3n+1)(3n+4) . Solution. It suffices to use the form (this is the so-called partial fraction decomposition) 1 (3n + 1)(3n + 4) = 1 3 · 1 3n + 1 − 1 3 · 1 3n + 4 , n = 0, 1, 2, . . . This gives ∞∑ n=0 1 (3n + 1)(3n + 4) = lim n→∞ 1 3 ( 1 − 1 4 + 1 4 − 1 7 + 1 7 − 1 10 + · · · + 1 3n + 1 − 1 3n + 4 ) = lim n→∞ 1 3 ( 1 − 1 3n + 4 ) = 1 3 . □ 5.E.135. (a) Using the ratio test show that S = ∞∑ n=1 (−2)n2 n! is not convergent. (b) Exam the convergence of the series T = ∞∑ n=1 (−1)n+1 arctan ( 2 √ 3n ) . ⃝ 5.E.136. Let S = ∑∞ n=0 f(n) be an infinite series for which the limit limn→∞ n √ |f(n)| exists, and equals to q. Present in Sage a syntax based on the root test to deduce, if possible, for the absolute convergence/divergence of S. Then apply your program for the following series and state Sage’s output. S1 = ∞∑ n=1 2n ( 4 5 )n2 , S2 = ∞∑ n=1 ( ln(n) n )n , S3 = ∞∑ n=1 ( 2n 4n2 + 1 )1 n , S4 = ∞∑ n=1 (−1)n−1 ( 3n n + 1 )n . Solution. Here one can use once more the command def to creating a subroutine, as we did in 5.D.5. This time we will call it roottest. Recall that according to the root test the series S will converge absolutely if q < 1, will not converge absolutely for q > 1, while for q = 1 the root test is inconclusive. Thus one can give the cell var("n") def roottest(f): q=lim(abs(f(n))**(1/n), n=oo) if q<1 : return "converges absolutely" elif q==1 : return "no conclusion" else : return "does not converge absolutely" return Now to test our routine via the given series, one can simply introduce first the sequence inside the series, and the apply the command roottest(f). For instance for the second series, type in your editor 487 CHAPTER 5. ESTABLISHING THE ZOO f(n)=(ln(n)/n)^n; roottest(f) Sage informs us that the series converges absolutely. Check yourself that the same is true for the first series (see also 5.D.4). On the other hand, the series S4 does not converge absolutely, while the root test does not provide a result for S3. □ 5.E.137. Prove the convergence of the series ∞∑ n=1 3n + 2n 6n , and find its value. ⃝ 5.E.138. Calculate the series ∞∑ n=1 2n − 1 2n and ∞∑ n=0 n + 1 3n . ⃝ 5.E.139. Determine those α ∈ R, β ∈ Z and γ ∈ R\{0} which establish the following series as convergent: ∞∑ n=120 e−αn n , ∞∑ n=240 βn · n! nn , ∞∑ n=360 n γn . ⃝ 5.E.140. Determine whether the series ∞∑ n=21 (−1)n n8 − 5n6 + 2n 2n converges absolutely, converges conditionally, or does not converge at all. ⃝ 5.E.141. Find all real numbers α ≥ 0 for which the series ∞∑ n=1 (−1)n ln ( 1 + α2n ) is convergent. ⃝ 5.E.142. Explain, why the convergence interval of the power series expressing the function 1 1+x2 with the center at origin is one. Solution. Recall by 5.B.16 that the geometric series ∞∑ n=0 xn = 1 1−x converges for |x| < 1. Thus 1 1+x = ∞∑ n=0 (−1)n xn and substituting x2 instead of x, we arrive at 1 1+x2 = ∞∑ n=0 (−1)n x2n , again converging for |x| < 1. At the same time, viewing the same sum over complex x, the sum exploads towards infinit absolute value if x approaches the imaginary unit i. Thus the power series cannot converge for |x| > 1. □ 5.E.143. Express the function y = ex , defined on the whole real line, as an infinite polynomial whose terms are of the form an(x − 1)n . Then express the function y = 2x defined on R as an infinite polynomial with terms anxn . Solution. We know the series for ex , thus writing ex−1 = e−1 ex leads to ex = ∞∑ n=0 e n! (x − 1)n . The second task is even simpler, since 2x = ex ln(2) = ∞∑ n=0 lnn (2) n! xn . □ 5.E.144. Supposing | x | < 1, determine the series A(x) = ∞∑ n=1 1 2n − 1 x2n−1 , B(x) = ∞∑ n=1 n2 xn−1 . 488 CHAPTER 5. ESTABLISHING THE ZOO ⃝ 5.E.145. Give an example of two divergent series ∑∞ n=1 an, ∑∞ n=1 bn with positive numbers for which the series∑∞ n=1 (3an − 2bn) converges absolutely. ⃝ 5.E.146. Determine whether the series ∞∑ n=1 (−1)n (n!)2 (2n)! and ∞∑ n=1 (−1)n n7 − n4 + n n8 + 2n6 + n converge absolutely, converge conditionally, or do not converge at all. ⃝ 5.E.147. Does the series ∞∑ n=1 (−1)n+1 3 √ n + 5 √ n + 1 n + 5 √ n converge? ⃝ 5.E.148. Find the values of the parameter p ∈ R for which the series ∞∑ n=1 (−1) n sinn p n converges. ⃝ 5.E.149. Determine whether the series ∞∑ n=0 2n + (−2)n 5n converges. ⃝ 5.E.150. Using the power series F(x) = ∞∑ n=0 (−1)n (2n + 1)x2n with x ∈ (−1, 1), calculate the infinite sum S = ∞∑ n=1 2n − 1 (−2) n−1 . ⃝ 489 CHAPTER 5. ESTABLISHING THE ZOO Solutions to the exercises 5.A.7. As usual, we have a taste for Sage, where an appropriate cell reads as follows: g(x)=(x/(1+8*x**2)) figg=plot(g(x), (x, -1, 1), figsize=4, color="black"); show(figg) R=PolynomialRing(QQ, "x") n=20 points=[(-1+k*(2/n), g(-1+k*2/n)) for k in [0, 1,.., n]] P(x)=R.lagrange_polynomial(points); show(P(x)) figg+=plot(P(x), (x, -1, 1), color="purple") figg+=list_plot(points, size=20, figsize=4, color="blue"); figg This returns the interpolation polynomial P(x) under question, and also the plots of g(x) and of P(x), along with the given set of points. Note that the polynomial oscillates wildly near the endpoints, getting worse and worse as n increases. 5.A.8. An implementation of the following cell in Sage prints out the explicit for of P, given byP(x) = 3x2 −2x−4. R = PolynomialRing(QQbar, "x") R.lagrange_polynomial ([(-1 ,1) ,(1 ,-3) ,(2 , 4)]) For further practicing with the method of elementary Lagrange polynomials, we suggest to the reader to verify this answer also by a formal computation. 5.A.9. Assume that that P(x) = a x3 + b x2 + c x + d for some reals a, b, c, d to be specified. By the given conditions one gets the following system (we present this directly in Sage) var("a, b, c, d") eq1=d-1; eq2=a+b+c+d eq3=8*a+4*b+2*c+d-1; eq4=27*a+9*b+3*c+d-10 solve([eq1==0, eq2==0, eq3==0, eq4==0], a, b, c, d) This blocks returns the answer a = 1, b = −2, c = 0, d = 1, that is, P(x) = x3 − 2x2 + 1. Alternatively, one may apply the previous method of Lagrange polynomials: R = PolynomialRing(QQbar, "x") R.lagrange_polynomial([(0, 1), (1, 0), (2, 1), (3, 10)]) which verifies the previous answer: x3 − 2 ∗ x2 + 1. 5.A.13. The polynomial Q is given by Q(x) = x3 − 2x2 + 3x − 3. 5.A.14. An answer is given by x5 − 2x4 − 5x + 2. 5.A.17. Sought spline differs from the one in 5.A.15 only in the values of the derivatives at the points −1 and 1. Similarly to the previous task, we get that the parts S1 and S2 of our spline have the forms S1(x) = ax3 +bx2 +1 and S2(x) = −ax3 +bx2 +1, respectively, where a, b, c, d are unknown real parameters. Confronting this with the conditions S1(−1) = 0, S1 ′ (−1) = 1, S2(1) = 0, and S2 ′ (1) = 1 yields the system {−a + b + 1 = 0, 3a − 2b = −1}, having the solution a = −3, b = −4. Hence, S(x) = { −3x3 − 4x2 + 1 , if x ∈ [−1, 0], 3x3 − 4x2 + 1 , if x ∈ [0, 1]. 5.A.20. S1(x) = 1 − 11 20 x + 1 20 x3 ; S2(x) = 1 2 − 2 5 (x − 1) + 3 20 (x − 1)2 − 1 40 (x − 1)3 . 490 CHAPTER 5. ESTABLISHING THE ZOO 5.B.1. The set Z+ is unbounded above, hence A is also unbounded and in particular sup A cannot exist. Observe also that (−1)n n = 1 n ≤ 1 and thus we have the inequalities n + (−1)n n ≥ n − (−1)n n ≥ n − 1 for all n ∈ Z+. This implies that n + (−1)n n ≥ 0 for all n ∈ Z+, and hence 0 is a lower bound of A. In particular, 0 ∈ A and A is non-empty, thus we deduce that inf A = 0. 5.B.2. For the set B we get sup B = 1 4 ∈ B and inf B = −1 ∈ B. For C we see that sup C = 9 /∈ C, inf C = −9 /∈ C. For the set X, inf X = −1 ∈ X, while sup X = 0 /∈ X. Further, inf Y = 0 ̸∈ Y , while sup Y = 5 ∈ Y . 5.B.6. (a) By 5.B.3 we know that 1 n → 0 and 3n → +∞. Thus their sum will diverge to +∞. (b) Here we are based on fact that if an → +∞ and bn → +∞, then also an + bn = +∞. We have n2 → +∞ and 2n → +∞, so (n2 + 2n ) → +∞, as well. (c) For this use the first assertion in Problem 5.B.5: Since n2 → +∞, and 3n → +∞, also their product n2 · 3n → +∞. (d) For this use the second assertion in Problem 5.B.5: Since n n+1 → 1 > 0 and 4n → +∞, their product will tend to +∞ as well. 5.B.7. (a) For any ε > 0 the inequality 1/2n < ε is equivalent to 2n > 1/ε. Thus, taking N > 1/ε we see that n ≥ N implies 2n ≥ 2N > N > 1/ε. Hence 1/2n < ε, and this proves the claim. (b) Observe that the terms of (bn = (−1)n+1 ) interchange between 1 and −1. Such a sequence is said to be an alternating sequence, and since its nth term is either 1 or −1, it must diverge by oscillation. For instance, taking ε with 0 < ε < 1/2 we see that infinitely many terms of (bn) do not sit in the open interval (x − ε, x + ε), for any x ∈ R. Hence the inequality |bn − x| < ε will not make sense for infinitely many n. (c) The sequence (cn = sin(nπ/2)) (n ≥ 1) is of the form (1, 0, −1, 0, 1, 0, . . .) Therefore, by the same reasoning as in the previous case (b) one deduces that (cn) diverges. (d) In this case a solution is given by xn = n, yn = −n + 1, with n ∈ Z+. 5.B.12. We have to prove the incorrectness of the statement: “Any convergent sequence is bounded and monotone.” Take for example the sequence ((−1)n /n)n≥1. It converges to 0 but is not monotone, as we can see in the figure below. For the second question, an example is given by (an = n)n≥1. Clearly, this sequence is strictly increasing, but is not bounded and so it cannot be convergent (it is not hard to prove that any convergent sequence is bounded). 5.B.13. Recall that |sin(x)| ≤ 1 for any x ∈ R. Thus |an| = 1 n sin(n) ≤ 1 n = 1 n , which implies that |an| ≤ 1/n ≤ 1, for any natural n ≥ 1. This is equivalent to say that −1 ≤ an ≤ 1, and so the sequence (an) is bounded. Next, it is easy to see that (bn = n2 + 1) is bounded below, bn ≥ 2 for any n (since n2 ≥ 1 for any natural n ≥ 1). Suppose that (bn) is also bounded above. Then there exists M ∈ R such that bn ≤ M for all n. This gives n2 ≤ M − 1 or n ≤ √ M − 1 for all n, a contradiction. Finally, for cn one can verify that 0 ≤ cn ≤ 2, which we leave as an exercise (we recommend to plot the first terms of (cn), e.g., via Sage). 5.B.14. The sequence (an = 1 n sin(n)) is bounded but not monotone. However, it is convergent. Indeed, we have −(1/n) ≤ an ≤ (1/n), for all n ∈ Z+, so using the squeeze theorem we obtain an → 0. We can verify this in Sage and also plot some terms of an, as follows: a(n)=sin(n)/n; show(lim(a(n), n=oo)) points([(n,a(n)) for n in range(1,200)], color="black") 491 CHAPTER 5. ESTABLISHING THE ZOO Check yourself Sage’s output. Clearly, the sequence bn diverges to +∞, so it is not convergent. Also, it is easy to see that the sequence cn = ((−1)n +n)/n behaves as (an). This means that (cn) is bounded, and also convergent, though not monotone. We present the verification only via Sage, together with a syntax giving the plot of its first thirty terms. However, once more we leave for practice the execution of this block. c(n)=((-1)^n+n)/n; show(lim(c(n), n=oo)) list_plot([c(n) for n in range(1, 30)], color="black") Next, one can prove that fn = 1√ n = n−1/2 is decreasing: Since √ n ≤ √ n + 1 we have 1√ n+1 ≤ 1√ n , for all n ∈ Z+. It is also bounded: For any n ∈ Z+ we have n ≥ 1, so √ n ≥ 1 and thus 1/ √ N ≤ 1. Therefore, by the monotone sequence theorem fn should be convergent, and in particular we see that lim n→+∞ fn = 0 +∞ = 0 = sup { 1 √ n : n ∈ Z+ } . Similarly for the sequence gn = n! nn , which is decreasing and bounded. Thus, the monotone sequence theorem ensures its convergence, and in particular we deduce that lim n→+∞ gn = 0 = sup { n! nn : n ∈ Z+ } . This also follows by squeeze theorem, since we have 0 < gn ≤ 1/n for all positive naturals n. This is because n! = n · (n − 1) · · · 2 · 1 < n · n · · · n · 1 = nn−1 (so dividing by 1/nn we obtain the formula). 5.B.18. By the binomial theorem we have en = 1 + ( n 1 ) 1 n + · · · + ( n k ) 1 nk + · · · + ( n n ) 1 nn = 2 + ( n 2 ) 1 n2 + · · · + ( n k ) 1 nk + · · · + ( n n ) 1 nn . Since ( n k ) = n! k!(n − k)! = n · (n − 1) · . . . · (n − k + 1) k! , for the general term we see that ( n k ) 1 nk = 1 k! · n · (n − 1) · . . . · (n − k + 1) nk = 1 k! · n n · n − 1 n · . . . · n − k + 1 n = 1 k! · ( 1 − 1 n ) · . . . · ( 1 − k − 1 n ) . Therefore en = 2 + 1 2! ( 1 − 1 n ) + 1 3! ( 1 − 1 n ) ( 1 − 2 n ) + . . . + 1 k! ( 1 − 1 n ) · . . . · ( 1 − k − 1 n ) + . . . + 1 nn . As n increases, the quantities 1 n , 2 n , . . . , k−1 n decrease and so the expressions (1 − 1 n ), (1 − 2 n ), . . . , (1 − k−1 n ) increase. So the general term of (en) increases, i.e., (en) is increasing. Another way to obtain the monotonicity of (en) is based on the geometric mean inequality and in particular in Bernoulli’s inequality which we presented in Chapter 1. Also, since k! = 1 · 2 · · · k ≥ 1 · 2 · · · 2 = 2k−1 we see that 2 < en < 1 + n∑ k=1 1 k! < 1 + n∑ k=1 1 2k−1 = 1 + 1 − (1/2)n 1 − (1/2) < 3 , since 1 + 1−(1/2)n 1−(1/2) < 1 + 1 1−(1/2) . Thus (en) is also bounded, 2 < en < 3, for any positive natural n, and by 5.B.11 it is convergent. 5.B.24. Yes: The range of f(x) = x2 is the set [0, +∞), which according to the remark in 5.B.23 is a closed subset of R. 5.B.29. This is because the sine function f(x) = sin(x) is continuous at any x ∈ R. Verify this claim yourself, presenting a “delta-epsilon” proof.21 Thus limx→π/3 sin(x) = sin(π/3) = √ 3/2. Let us now use Sage and combine the limit command with the bool function, to verify that the left/right limits coincide with the value of sine function at the limit point x0. This idea is encoded by the following cell: bool(sin(pi/3)==limit(sin(x),x=pi/3, dir="left")) bool(sin(pi/3)==limit(sin(x),x=pi/3, dir="right")) For both these commands Sage’s output is True, as it was expected. 21Hint: Use the identity sin(x) − sin(x0) = 2 cos( x+x0 2 ) sin( x−x0 2 ) and some classical inequalities related to the sine and cosine function. 492 CHAPTER 5. ESTABLISHING THE ZOO 5.B.32. (a) We may multiply the expression inside the given limit by 1+cos(x) 1+cos(x) , so that we can form the difference of two squares in the numerator, that is, lim x→0 1 − cos(x) x = lim x→0 1 − cos(x) x · 1 + cos(x) 1 + cos(x) = lim x→0 1 − cos2 (x) x ( 1 + cos(x) ) = lim x→0 sin2 (x) x ( 1 + cos(x) ) = lim x→0 sin(x) x · lim x→0 sin(x) 1 + cos(x) = 1 · 0 2 = 0 . (b) Here we will use the so called trigonometric “half-angle” identity of the sine function, that is, 2 sin2 (x 2 ) = 1 − cos(x). We have lim x→0 1 − cos x x2 sin(x2) = lim x→0 2 sin2 (x 2 ) x2 sin(x2) = lim x→0 1 2 sin2 (x 2 ) (x 2 )2 sin(x2) = 1 2 ( lim x→0 sin (x 2 ) x 2 )2 · lim x→0 1 sin2 (x2) = 1 2 · ∞ = ∞ . Note that this calculation must be considered “from the back”. Indeed, since the limits on the right-hand side exist (no matter whether finite or infinite) and the expression 1 2 · ∞ is meaningful (see the note after theorem 5.2.13), the original limit exists as well. On the other hand, observe that if we split the original limit into the product lim x→0 (1 − cos x) · lim x→0 1 x2 sin(x2) , then we will get the 0 · ∞ type, which is an indeterminate form and tells us nothing about existence of the original limit. Finally, you may like to confirm your computation by Sage, which can be easily done by the cell f(x)=(1-cos(x))/(x^2*sin(x^2)); limit(f(x), x=0) 5.B.33. (a) One can decompose the polynomial in the denominator, and simplify the given expression, i.e., lim x→2 x − 2 √ x2 − 4 = lim x→2 x − 2 √ (x − 2)(x + 2) = lim x→2 √ x − 2 √ x + 2 = 0 4 = 0 . (b) Here one can exploit the rule for limits of compositions of functions (see 5.2.20). Thus lim x→0 sin (sin(x)) x = lim y→0 sin(y) y · lim x→0 sin(x) x = 1 . (c) We see that lim x→0 sin2 (x) x = lim x→0 sin(x) · lim x→0 sin(x) x = 0 · 1 = 0. Notice the original limit exists because both the right-hand side limits exist and their product is well-defined. 5.B.34. For instance, in the first case we have limx→0− f(x) = 1 but limx→0+ f(x) = +∞. Hence limx→0 f(x) does not exist. Let us also explain the situation for k and similarly are treated the function g and h. For the sign function sign : R → R we have k(x) = sign(x) =    1 , if x > 0 , 0 , if x = 0 , −1 , if x < 0 . Taking the sequence (xn = 1/n) we see that sign(1/n) → 1 for n → +∞, while for the sequence (yn = −1/n) we obtain sign(−1/n) → −1, as n → +∞. Since we found two real sequences (xn), (yn) with limn→+∞ xn = c = limn→+∞ yn (where c = 0), and limn→+∞ k(xn) ̸= limn→+∞ k(yn) we deduce that the limit limx→c k(x) does not exist (notice that c = 0 ∈ R is a limit point, see also Proposition 5.2.15). In Sage recall that we can treat limits of functions really quickly. For instance, for the functions h, k to verify the statement type k(x)=sign(x); h(x)=abs(sin(x))/sin(x); lim(k(x), x=0);lim(h(x), x=0) As a side remark notice that the sign function is very similar to the function g : R\{0} → {−1, 1} ∼= Z2, defined by g(x) = |x| /x which is everywhere continuous except at x = 0. In particular, g(0) is not defined and we see that g(x) = { 1 , if x > 0 , −1 , if x < 0 . If you like to sketch the graph of g, say for −2 ≤ x ≤ 2, type f(x)=abs(x)/x; plot(f, x, -2, 2, exclude=[0]) Notice here in our input we added the option exclude, to exclude the point that g is not defined. This syntax produces the known figure, i.e., 493 CHAPTER 5. ESTABLISHING THE ZOO 5.B.36. Let us multiply both the numerator and denominator by 1/α2 , which yields g(x) = lim α→+∞ ( 2x − 3 + 4x 1 α + 2 α2 1 + x α · sin 1 α 1 α ) = lim α→+∞ 2x − 3 + 4x 1 α + 2 α2 1 + x α · lim1 α →0 sin 1 α 1 α = 2x − 3 . This proves that g(x) = 2x − 3, i.e., g(x) represents a line. It remains to prove the second claim. Observe that y = 2x − 3 pass through A. Let M = [xM , yM ] be the middle point of the segment BC. This has coordinates xM = (6 + (−2))/2 = 2 and yM = (2 + 0)/2 = 1. These coordinates also lie on the line y = 2x − 3, and hence y = g(x) is a median of ABC. 5.B.37. (a) We see that lim x→+∞ 3x+1 + x5 − 4x 3x + 2x + x2 = lim x→+∞ 3 · 3x 3x = 3 . (b) Here one computes lim x→+∞ 4x − 8x6 − 2x − 167 3x − 45x − √ 11πx+12 = lim x→+∞ 4x − √ 11π12 · πx = −∞ . (c) In view of the formula (a − b)(a + b) = a2 − b2 , we have lim x→0 √ 1 + x − √ 1 − x x = lim x→0 (1 + x) − (1 − x) x (√ 1 + x + √ 1 − x ) = lim x→0 2 √ 1 + x + √ 1 − x = 2 √ 1 + √ 1 = 1 . (d) Similarly we see that lim x→π/4 cos x − sin x cos (2x) = lim x→π/4 (cos x + sin x) (cos x − sin x) (cos x + sin x) cos (2x) = lim x→π/4 cos2 x − sin2 x (cos x + sin x) cos (2x) = lim x→π/4 1 cos x + sin x = 1 √ 2 2 + √ 2 2 = √ 2 2 . Notice the reduction above was made thanks to the identity cos (2x) = cos2 (x) − sin2 (x), where x ∈ R. 5.B.42. For the function f we have f(x) = xx = eln(xx ) = ex ln(x) . Thus f is continuous as the composition of the continuous functions y1 = ex and y2 = x ln(x). Similarly, g(x) = xcos(x) = ecos(x) ln(x) and g is continuous as the composition of two continuous functions, namely y1 = ex and y3 = cos(x) ln(x). 5.B.43. Since f, g are continuous the given relation means that 4 · f(2) − g(2) = 2, and so g(2) = 22. 5.B.44. We have already seen such an example in 5.B.34. We mean the sign function f(x) = sign(x), which has a “bump discontinuity” at x = 0. As we have explained in 5.B.34, essentially this is because limx→0+ f(x) = 1 ̸= limx→0− f(x) = −1. Sketch the graph of f to illustrate the situation. 494 CHAPTER 5. ESTABLISHING THE ZOO 5.B.46. The function 1/x is continuous for all x ∈ R\{0}, and the function sin(x) is continuous for all x ∈ R. Thus the function f is continuous on the set (−∞, 0) ∪ (0, +∞), i.e., everywhere on R except at x0 = 0. In particular, at x0 = 0 it is easy to see that the limit of f does not exist. The function g is similar to f, except of that the oscillations are absorbed by the factor x. We see that −|x| ≤ x sin (1 x ) ≤ |x| and thus by the squeeze theorem we get limx→0 g(x) = 0. In particular, g is continuous everywhere on R. 5.B.49. The family has the form fα,β(x) =    √ 2x2 − x + 6 − αx x + 2 , for x < −2 , x3 + βx + 4 , for x ≥ −2 , and we see that for all x < −2 the expression 2x2 − x + 6 is positive (and hence fα,β is well-defined). We require the continuity of fα,β at any x ∈ R and hence also at x0 = −2, which is equivalent to the condition limx→−2− fα,β(x) = fα,β(−2) = limx→−2+ fα,β(x). Then we see that ℓ := lim x→−2− ( fα,β(x) · (x + 2) ) = fα,β(−2) · 0 = 0 . On the other hand, from the definition of f and ℓ, we get ℓ = lim x→−2− (√ 2x2 − x + 6 − αx ) = √ lim x→−2− (2x2 − x + 6) − lim x→−2− (αx) = 4 + 2α . Thus, 4 + 2α = 0, i.e., α = −2. Using this value, we compute lim x→−2− f−2,β(x) = lim x→−2− √ 2x2 − x + 6 + 2x x + 2 = lim x→−2− 2x2 − x + 6 − 4x2 (x + 2)( √ 2x2 − x + 6 − 2x) = lim x→−2− −2x2 − x + 6 (x + 2)( √ 2x2 − x + 6 − 2x) = lim x→−2− (x + 2)(−2x + 3) (x + 2)( √ 2x2 − x + 6 − 2x) = lim x→−2− (−2x + 3) ( √ 2x2 − x + 6 − 2x) = 7 8 . On the other hand, fα,β(−2) = −2(b + 2) and we arrive to the equation 7 8 = −2(b + 2), that is, b = −39 16 . Thus only the member f−2,− 39 16 is continuous. Using Sage we can define piecewise functions with many different ways. For instance one can combine the def command with if and else, as follows: def f(x) : if x<-2 : return((sqrt(2*x^2-x+6)+2*x)/(x+2)) else : return(x^3-(39/16)*x+4) Now, to view f combine any of the print or show rules, with the assume command. Thus, for example, adding the syntax assume(x < −2); show(f(x)) will give the first component of f, and similarly for the second, where instead one should type assume(x >= −2). To evaluate f at different points x ∈ R just type f(−4), f(−2), f(0), etc. As for the continuity you can add the commands show(bool(limit(f(x), x=-2,dir="+")==f(-2))) show(bool(limit(f(x), x=-2, dir="-")==f(-2))) or simply type show(bool(limit(f(x), x=-2)==f(-2))) Finally, to sketch the graph of f add the cell plot(f, x, -10, 10,ymax=8, color="black") The produced figure is here: 495 CHAPTER 5. ESTABLISHING THE ZOO Another way to determine f and sketch its graph relies on the piecewise function in Sage, which we analyze in 5.C.5. Note that this method currently has some limitations (e.g., in computing limits) and Sage requires a significant amount of time to return an answer (see however in 5.C.5 and 6.D.43 for further details). 5.B.51. The given polynomial has at least two roots in (−1, 1), This is because P(−1) > 0, P(0) < 0, P(1) > 0 and thus there must be at least one root in each of these two subintervals, see also 5.2.19. 5.B.52. The solution is based on applying Bolzano’s theorem via Sage, which goes as follows: f(x) = x^3 - cos(x)*e^x + x*sin(x) show(f(0)); show(f(pi/2)) Sage gives f(0) = −1 < 0 and f(1) = π3 8 + π 2 > 0, and hence by Bolzano’s theorem f has at least a root in (0, π/2). As for the implementation of the find_root function, just add the syntax find_root(f, 0, pi/2). Sage’s answer is 0.9221778114841418. 5.C.1. For f(x), using the binomial theorem one computes (xn )′ = lim h→0 (x + h)n − xn h = lim h→0 (n 1 ) xn−1 h + (n 2 ) xn−2 h2 + · · · + hn h = n xn−1 + lim h→0 (( n 2 ) xn−2 h + · · · + hn−1 ) = n xn−1 . For the derivative of g(x) we rely on identities from trigonometry, as those for sin(x + h) and the results from 5.B.31 and 5.B.32, that is, lim h→0 sin h h = 1 and lim h→0 (cos h−1) h = 0, respectively. We have (sin(x))′ = lim h→0 sin(x + h) − sin(x) h = lim h→0 sin(x) cos(h) + cos(x) sin(h) − sin(x) h = cos(x) lim h→0 sin(h) h + sin(x) lim h→0 (cos(h) − 1) h = cos(x) · 1 + sin(x) · 0 = cos(x) . Similarly is treated the square root function h(x) = √ x = x1/2 , which is left as an exercise. Here one can show that h′ (x) = 1 2 √ x for all x ̸= 0, but f is not differentiable at x = 0 (verify this claim by computing the left and right derivatives). Note that the derivative of the first given function generalizes as (xa )′ = axa−1 , for all x ∈ R and all a > 0. When a is negative, we can use the relation (xa )′ = axa−1 only for x ̸= 0. These rules provide a way to compute h′ (x) for x ̸= 0: ( √ (x))′ = (x 1 2 )′ = 1 2 x1− 1 2 = 1 2 x− 1 2 = 1 2x 1 2 = 1 2 √ x . 5.C.2. The claim is based on the properties of f and occurs by the definition of the derivative of f, as follows: f′ (x) = lim h→0 f(x + h) − f(x) h = lim h→0 f(x)f(h) − f(x) h = lim h→0 f(x) (f(h) − 1) h = f(x) · lim h→0 f(h) − 1 h = f(x) · lim h→0 f(0 + h) − f(0) h = f(x) · f′ (0) = f(x) · 1 = f(x) . 5.C.3. Recall that f(x) = { x, if x ≥ 0 , −x, if x < 0 . 496 CHAPTER 5. ESTABLISHING THE ZOO Thus, f′ (x) = 1 for x > 0 and f′ (x) = −1 for x < 0. For instance, for x > 0 we can choose h small enough such that x + h > 0, and then f′ (x) = lim h→0 |x + h| − |x| h = lim h→0 x + h − x h = 1 . Similarly for x < 0. At x0 = 0 the function f cannot be differentiable, since the left-side derivative does not agree with the right-side derivative: lim x→0+ f(x) − f(0) x − 0 = lim x→0+ x − 0 x = 1 , lim x→0− f(x) − f(0) x − 0 = lim x→0− −x − 0 x = −1 . Hence f′ (0) does not exist (and this is what we should expect since the graph of f forms a corner at 0). 5.C.4. If f is differentiable at x0, then the limit lim x→x0 f(x) − f(x0) x − x0 exists and equals to f′ (x0). Thus we also have (x − x0) · f′ (x0) = (x − x0) · lim x→x0 f(x) − f(x0) x − x0 . Consider now the limits in both sides as x tends to x0. In the l.h.s we obtain a zero, since limx→x0 ( (x − x0) · f′ (x0) ) = f′ (x0) · limx→x0 (x − x0) = f′ (x0) · 0 = 0. For the limit in the r.h.s we write altogether 0 = lim x→x0 (x − x0) · lim x→x0 f(x) − f(x0) x − x0 = lim x→x0 ( (x − x0) · f(x) − f(x0) x − x0 ) = lim x→x0 ( f(x) − f(x0) ) , which implies that limx→x0 f(x) = f(x0). Therefore, f is continuous at x0. 5.C.6. In order to use the derivative sin′ (x) = cos(x) and the chain rule, we first need to use the identity cos(x) = sin(x+ π 2 ). This allows us to view cos(x) as the composition cos(x) = f(g(x)), with f(x) = sin(x) and g(x) = x + π 2 , respectively. Then, by the chain rule we get (cos(x))′ = g′ (x) · f′ (g(x)) = 1 · cos(x + π 2 ) = cos(x) cos(π/2) − sin(x) sin(π/2) = 0 − sin(x) = − sin(x) . 5.C.10. We have f(1) = α and since f′ (x) = 2αx − 4 ln(x) − 4, it follows that f′ (1) = 2α − 4. Thus the tangent of f at the point P = [1, α] is given by y − α = (2α − 4)(x − 1) ⇐⇒ y = (2α − 4)x − α + 4 . We finally see that this line passes through [0, 0] if and only if 0 = (2α − 4) · 0 − α + 4, that is, α = 4. 5.C.13. The answer is y = 2 − x; y = x. 5.C.16. It is useful to recall that the tangent function tan(x) = sin(x)/ cos(x) is not defined at any x = π 2 + κπ, with κ ∈ Z. Let us first plot tan(x) (in fact, let us present the graph of a periodic extension of the tangent function on (−3π/2, 3π/2)). To obtain this we used Sage via the following block: p=plot(tan(x), x, -pi/2, pi/2, ymax=5, ymin=-5, color="black") p+=plot(tan(x-pi), x, -3*pi/2, -pi/2, ymax=5, ymin=-5, color="black", detect_poles="False") p+=plot(tan(x+pi), x, pi/2, 3*pi/2, ymax=5, ymin=-5, color="black", detect_poles="False") p.show(ticks=pi/2, tick_formatter=pi, aspect_ratio="1") 497 CHAPTER 5. ESTABLISHING THE ZOO In this block ensure to include the options ymax=5, ymin=-5 for plotting correctly the tangent function. From the figure, it is evident that tan(x) is strictly increasing for all x ∈ (−π/2, π/2). Let us substantiate this claim rigorously with a mathematical proof. Let x1, x2 be real numbers such that −π 2 < x1 < x2 < π 2 . Based on the identity sin(α − β) = sin(α) cos(β) − sin(β) cos(α) we see that tan(x2) − tan(x1) = sin(x2) cos(x2) − sin(x1) cos(x1) = sin(x2) cos(x1) − sin(x1) cos(x2) cos(x1) cos(x2) = sin(x2 − x1) cos(x1) cos(x2) > 0 , since 0 < x2 − x1 < π. Thus, for x1 < x2 we have tan(x2) − tan(x1) > 0 ⇐⇒ tan(x1) < tan(x2) ⇐⇒ f(x1) < f(x2) , which shows that the tangent function is increasing on (−π/2, π/2). Alternatively, we can compute the first derivative of f, given by f′ (x) = tan′ (x) = 1 cos2(x) (see 5.E.76). Thus for any x ∈ (−π/2, π/2) we get f′ (x) > 0 and our claim follows. 5.C.19. The answer is given by π 6 − 2√ 3 0.003. 5.C.20. The answers are 1 2 − √ 3π 360 and √ 2 2 + √ 2π 360 , respectively. 5.C.24. The function f is continuous on its domain [−2, 2]. We have f′ (x) = 2 3 x− 1 3 = 2 3x1/3 . Thus f′ is not defined at x0 = 0. Since x0 ∈ (−2, 2), this means that we cannot apply the mean value theorem. In the opposite, for the functions g, h the mean value theorem applies. Indeed, g is continuous on [1, 4] with g′ (x) = 1 (x+1)2 . Thus g is also differentiable on the open interval (1, 4) and according to the Theorem 5.3.9 there exists c ∈ (1, 4) satisfying g′ (c) = g(4)−g(1) 4−1 = 4/5−1/2 3 = 1 10 . From this we get (1+c)2 = 10, that is, c = ± √ 10−1 and since c ∈ (1, 4) we can accept only the value c = √ 10−1 ≈ 2.163. Similarly is treated the case for h. 5.C.26. It is sufficient to apply Cauchy’s mean value theorem for the functions f(x) = xn+1 and g(x) = xn , which are both continuous on [a, b], and differentiable on (a, b). Also, by assumption 0 ̸∈ (a, b) and hence g′ (x) = nxn−1 ̸= 0 for all x ∈ (a, b). Thus f, g satisfy the requirements of the Cauchy’s mean value theorem. This gives that g(b) ̸= g(a) and moreover there exists c ∈ (a, b) such that f(b) − f(a) g(b) − g(a) = f′ (c) g′(c) ⇐⇒ bn+1 − an+1 bn − an = (n + 1)cn ncn−1 ⇐⇒ n n + 1 (bn+1 − an+1 bn − an ) = c , for any n ∈ N\{0}. Since c ∈ (a, b) we are done. 5.C.31. (a) v(0) = 6 m/s; (b) t = 3 s, s(3) = 16 m; (c) v(4) = −2 m/s, a(4) = −2 m/s2 . 5.C.34. Such an example occurs by considering the limit limx→0 sin(x) x+1 , which obviously equals to 0 and is not of some indeterminate form. Another example is the limit limx→0 sin(x) ex . 5.D.2. Often, we may use Sage to evaluate convergent infinite series, or verify summation identities. Recall from Chapter 1 that this can be done via the command sum, whose general form is sum(f(n), n, a, b) (formally, this corresponds to the sum ∑b n=a f(n)). Of course, one can replace b by Infinity or its alias oo, and this corresponds to infinite sums. This method can be applied for example when one tries to evaluate the zeta function ζ(p), for certain p, as in our example. In particular, an explicit implementation of our task goes as follows: var("n") show(sum(1/n^2, n, 1, oo)) show(sum(1/n^3, n, 1, oo)) show(sum(1/n^4, n, 1, oo)) show(sum(1/n^5, n, 1, oo)) show(sum(1/n^6, n, 1, oo)) show(sum(1/n^7, n, 1, oo)) show(sum(1/n^8, n, 1, oo)) show(sum(1/n^9, n, 1, oo)) Let us put Sage’s output in a table: p ζ(p) p ζ(p) 2 π2 /6 3 ζ(3) 4 π4 /90 5 ζ(5) 6 π6 /945 7 ζ(7) 8 π8 /9450 9 ζ(9) 498 CHAPTER 5. ESTABLISHING THE ZOO Hence Sage is able to evaluate explicitly ζ(p) for p = 2, 4, 6, 8. However, as we expected, Sage is unable to compute the cases p = 3, 5, 7, 9. In these instances, Sage indicates that the sum we are trying to evaluate corresponds to the zeta function. We will encounter further related applications in the sequel (see for example 5.D.8). 5.D.7. The series S1 converges absolutely. For instance, we see that ∞∑ n=1 sin(n) n2 ≤ ∞∑ n=1 1 n2 < ∞∑ n=0 1 2n = 2 . Passing to S2, one observes that this is an alternating series (since cos (πn) = (−1)n , n ∈ N). Moreover, we see that the sequence of the absolute values of its terms is decreasing, and lim n→∞ 1 3 √ n2 = 0. It follows that the series S2 is convergent. In addition, we see that ∞∑ n=1 cos(πn) 3 √ n2 = ∞∑ n=1 1 3 √ n2 ≥ ∞∑ n=1 1 n = +∞ , which implies that S2 converges also conditionally. 5.D.9. We see that 1 ≤ 1 , 1 22 + 1 32 < 2 · 1 22 = 1 2 , 1 42 + 1 52 + 1 62 + 1 72 < 4 · 1 42 = 1 4 , . . . , and more general 1 (2n)2 + · · · + 1 (2n+1 − 1)2 < 2n · 1 (2n)2 = 1 2n , n = 1, 2, . . . Hence, by comparing the terms of both of the series, we get the required inequality. By the way, notice that from this inequality it follows that the series ∑∞ n=1 1 n2 converges absolutely. Eventually, let us specify that ∞∑ n=1 1 n2 = π2 6 < 2 = ∞∑ n=0 1 2n . 5.D.15. The radius of convergence equals to r = +∞. 5.D.16. The radius of convergence equals to 1. 5.D.17. The domain of convergence is the closed interval [−1, 1]. 5.D.18. The answer is x ∈ [ 2 − 1 3 , 2 + 1 3 ] . 5.D.19. The series converges for all x ∈ [− 3 √ 2, 3 √ 2). 5.D.20. The series converges for all x ∈ [−1, 1] 5.E.3. The obvious answer is x2 . 5.E.4. The natural cubic spline interpolating the given data is given by S1(x) ≡ x; S2(x) ≡ x. 5.E.5. The complete cubic interpolation spline in question is given by Si(x) = x + 3, x ∈ [−3 + i − 1, −3 + i]; i ∈ {1, 2}. 5.E.6. The mentioned values are given by cos2 (π 4 ) = 1/2, cos2 (π 3 ) = 1/4, cos2 (π 2 ) = 0. Since the third value is zero, we need to compute only the values of the first two elementary Lagrange polynomials at the given points. Based on the definition of such polynomials and the given data, it follows that ℓ0(1) = (1 − π 3 )(1 − π 2 ) (π 4 − π 3 )(π 4 − π 2 ) = 8 (π − 3)(π − 2) π2 , ℓ1(1) = (1 − π 4 )(1 − π 2 ) (π 3 − π 4 )(π 3 − π 2 ) = −9 (π − 4)(π − 2) π2 . Thus, we deduce that P(1) = 1 2 · 8 (π − 3)(π − 2) π2 − 1 4 · 9 (π − 4)(π − 2) π2 + 0 = (7π − 12)(π − 2) 4π2 ≈ 0.288913 . Note that the actual value is approximately cos2 1 ≈ 0.291927. 5.E.8. Such a polynomial is given by x4 + 2x3 − x2 + x − 2. 5.E.15. Consider any singleton (one-element set) X ⊂ R. 5.E.16. The set C must be a singleton. Thus, let us choose C = {0}, for example. Now as A, B we can take the open subsets A = (−1, 0), and B = (0, 1), respectively. 499 CHAPTER 5. ESTABLISHING THE ZOO 5.E.19. Using the properties of powers one can express the general term of the given sequence as an = (√ 2 · 4 √ 2 · 8 √ 2 · · · 2n√ 2 ) = 2 1 2 · 2 1 4 · 2 1 8 · · · 2 1 2n = 2 1 2 + 1 4 + 1 8 +···+ 1 2n . Thus, in combination with the continuity of the exponential function (a property discussed in the paragraph 5.2.22), we get lim n→∞ an = 2 lim n→∞ ( 1 2 + 1 4 + 1 8 +···+ 1 2n ) = 2 ( ∞∑ n=1 1 2n ) . Next we use a known formula for the sum of geometric series: ∞∑ n=1 ( 1 2 )n = 1 2 · 1 1 − 1 2 = 1. Thus finally lim n→∞ an = 2. Notice in Sage we can simply type n=var("n"); b=2**(sum(1/2**n, n, 1, oo)); b which answers 2, and the claim follows. 5.E.20. Notice that every natural number k ≥ 2 satisfies the relation 1 (k−1)k = 1 k−1 − 1 k . (this identity is a special case of the so called partial fraction decomposition). Therefore, we compute lim n→∞ ( 1 1 · 2 + 1 2 · 3 + 1 3 · 4 + · · · + 1 (n − 1) · n ) = lim n→∞ ( 1 1 − 1 2 + 1 2 − 1 3 + 1 3 − 1 4 + · · · + 1 n − 1 − 1 n ) = lim n→∞ ( 1 − 1 n ) = 1 . Let us mention that this limit determines the sum of one of the so-called “telescoping series” (used by J. Bernoulli already about 300 years ago). In Sage we can verify the result as usual, i.e., by the cell var("i, n"); limit(sum(1/(i*(i+1)), i, 1, n), n=oo) 5.E.21. We see that lim n→∞ ( 1 n2 + 2 n2 + · · · + n − 2 n2 + n − 1 n2 ) = lim n→∞ ( 1 + n − 1 n2 · n − 1 2 ) = 1 2 . In Sage we can type limit(sum(i/n ∗ ∗2, i, 1, n − 1), n = oo), which provides a quick confirmation of our computation. 5.E.23. For these tasks one relies on the relation lim n→∞ ( 1 + a n )n = ea , where a ∈ R, see 5.B.18 . This means that e−1 = 1 e = lim n→∞ ( 1 − 1 n )n = lim n→∞ ( n − 1 n )n . Thus, the substitution m = n − 1 gives lim n→∞ ( n − 1 n )n = lim m→∞ ( m m + 1 )m+1 = lim m→∞ ( m m + 1 )m · lim m→∞ m m + 1 . Clearly, the second limit is equal to 1 and replacing n with m we get the result 1 e = lim n→∞ ( n n + 1 )n . In Sage we can easily verify this by typing n = var(”n”); f(n) = (n/(n + 1))n ; lim(f(n), n = oo). For the second limit we see that lim n→∞ ( 1 + 1 n2 )n = lim n→∞ ( 1 + 1 n2 )n2 n = lim n→∞ (( 1 + 1 n2 )n2 ) 1 n = e0 = 1 . For the third limit we will apply only Sage and leave to the reader the formal details. Hence type n=var("n"); g(n)=(1-1/n)**(n^2); lim(g(n), n=oo) This returns the value 0. 5.E.27. (a) We have a > 0, thus we also get 0 < n a < n a + 1 and 0 < 1 n a+1 < 1 n a , for all positive n. This gives 1 1 + n a − 0 = 1 1 + n a = 1 1 + n a < 1 n a = 1 a · 1 n , for all n = 1, 2, . . . . As the sequence (1/n) tends to 0 and 1/a > 0, the inequality obtained above allows us to invoke the result from 5.E.25, with n ≥ N = 1. Thus limn→∞ ( 1 1+n a ) = 0. (b) Obviously, the sequence (an) with general term an = xn n! satisfies an ̸= 0 for all n. Moreover, 500 CHAPTER 5. ESTABLISHING THE ZOO ℓ = lim n→∞ an+1 an = lim n→∞ xn+1 (n + 1)! · n! xn = lim n→∞ x n + 1 = 0 < 1 . Thus, by the second result in 5.E.26 we get limn→∞ xn n! = 0, for all x ∈ R. 5.E.31. By definition, x1 = 1 > 0. Assuming that xn > 0 for some n, we get xn+1 = 1/(4 + xn) > 0. Thus, by induction we see that (xn) is a positive sequence, that is, xn > 0 for all n. Using this, for all n ≥ 2 one computes |xn+1 − xn| = 1 4 + xn − 1 4 + xn−1 = xn−1 − xn (4 + xn)(4 + xn−1) = |xn−1 − xn| (4 + xn)(4 + xn−1) < 1 16 |xn − xn−1| , Thus (xn) is a contractive sequence and by 5.E.30 it is Cauchy, and hence convergent. 5.E.32. Obviously, {ak : k ≥ n + 1} ⊆ {ak : k ≤ n} and hence mn+1 = sup{ak : k ≥ n + 1} ≤ sup{ak : k ≥ n} = mn. This shows that the sequence (sn) is decreasing Similarly, the sequence (ℓn) is increasing, since ℓn ≤ inf{ak : k ≥ n+1} = ℓn+1, see also the discussion in 5.4.6. By assumption, (an) is bounded hence (sn) and (ℓn) are also bounded, and the result follows by 5.B.11. 5.E.36. The naturals N form a closed subset of R (recall by 5.B.20 that there are no limits points). Another explanation occurs from the fact that any b /∈ N has got a small neighbourhood disjoint with N. On the other hand, the subset Q ⊂ R is neither closed nor open. Indeed, in 5.B.20 we saw that the set of all limit points of Q is the real line R, so Q cannot be closed. Moreover, we saw that there are no interior points, so Q cannot be open. 5.E.37. (a) The open subsets are R∗ and R\Z. (b) Only the set [1, 2] ∪ {5} is closed. 5.E.47. We will show that the sequence is strictly increasing and bounded, so from the monotonicity theorem (5.B.11), it should be convergent. Obviously, a1 = √ 2 ≤ a2 = √ 2 + √ 2. Assume that ak < ak+1 for some k. Using the induction hypothesis we see that 2 + ak < 2 + ak+1, or √ 2 + ak < √ 2 + ak+1, that is ak+1 < ak+2. Hence, an < an+1 for all n ∈ N, which means that (an)n∈N is strictly increasing. Notice now that a1 = √ 2 < 2. If it is also an−1 < 2, then we get 2 + an−1 < 4, that is, √ 2 + an−1 < 2, or equivalently an < 2. Thus (an) is bounded and by the monotonicity theorem (5.B.11) it is convergent, limn→+∞ an = sup{an : n ∈ N} = 2. To verify this final conclusion, let x = limn→+∞ an. Then taking the limits in both sides of the relation an = √ 2 + an−1 we see that √ x + 2 = x, which has two solutions (x = −1, x = 2), and the acceptable is the positive one. 5.E.49. We have lim x→0 sin2 (x) x = lim x→0 sin(x) · lim x→0 sin(x) x = 0 · 1 = 0. Apparently, lim x→0 x sin(x) = 1−1 = 1 and the limit lim x→0 1 sin(x) does not exist. Similarly, using the rule for the limit of a product we see that the limit lim x→0 x sin2 (x) does not exist. For the calculation of the limit lim x→0 arcsin(x) x use the identity x = sin(arcsin(x)), which makes sense for any x ∈ (−1, 1). This gives lim x→0 arcsin(x) x = lim x→0 arcsin(x) sin(arcsin(x)) = lim y→0 y sin y = 1, where we substitute y = arcsin(x). Observe that y → 0 follows from substituting x = 0 into y = arcsin(x) and from continuity of this function at 0 (this also guarantees that such a substitution can be made). Next, see that lim x→0 3 tan2 (x) 5x2 = lim x→0 ( 3 5 · sin(x) x · sin(x) x · 1 cos2(x) ) = 3 5 · lim x→0 sin(x) x · lim x→0 sin(x) x · lim x→0 1 cos2(x) = 3 5 · 1 · 1 · 1 = 3 5 . For the next case it useful to sketch the graph of the function sin(3x)/ sin(5x), via Sage. This is given by 501 CHAPTER 5. ESTABLISHING THE ZOO This graph occurs via the following cell in Sage: a=plot(sin(3*x)/sin(5*x), x, -2*pi, 2*pi, detect_poles="show", ymin=-10,ymax=10, color="black") b=point((0, 3/5), size = 70, color="black") c=text("3/5", (0.4, 0.3), color="black", fontsize="10"); show(a+b+c) As we see, the limit under question must be equal to 3/5. Let us verify this in a formal way: lim x→0 sin(3x) sin(5x) = lim x→0 ( sin(3x) 3x · 5x sin(5x) · 3 5 ) = lim x→0 sin(3x) 3x · lim x→0 5x sin(5x) · 3 5 = lim y→0 sin y y · lim z→0 z sin z · 3 5 = 3 5 . For a confirmation via Sage, type lim(sin(3 ∗ x)/sin(5 ∗ x), x = 0). We leave to the reader to show that the final limit equals to 3/5, as well. 5.E.50. Based on the fact that lim x→0 ex −1 x = 1, for the first limit we obtain lim x→0 e5x − e2x x = lim x→0 ( e2x e(5−2)x −1 (5 − 2)x (5 − 2) ) = lim x→0 e2x · lim x→0 e3x −1 3x · 3 = e0 · lim y→0 ey −1 y · 3 = 1 · 1 · 3 = 3 . In a similar way one computes the second limit, and we leave the details for practice. We present the solution in Sage and include in a figure both plots. G(x)=(e**(5*x)-e**(2*x))/x; E(x)=(e**(5*x)-e**(-x))/sin(2*x); show(lim(E(x), x=0)) a=plot(E(x), x, -pi, pi, detect_poles="show", ymin=-2, ymax=4, color="black") b=plot(G(x), x, -pi, pi, detect_poles="show", ymin=-2, ymax=4, color="grey") show(a+b) Executing this block we get the confirmation for the limit and desired illustration of the graphs, see here: 502 CHAPTER 5. ESTABLISHING THE ZOO 5.E.52. Obviously, limx→+∞ ( 2 + 1 x ) = 2, limx→+∞ 1 x = 0 and limx→+∞ x = +∞. Thus, limx→+∞ ( 2 + 1 x )1 x = 20 = 1 and limx→+∞ x−x = limx→+∞ (1 x )x = 0. Equally well, we could first compute the limit lim x→+∞ (1 x · ln(2 + 1 x ) ) = 0·ln 2 = 0, and then, as a generalization of the formula lim x→x0 f(x)g(x) = e lim x→x0 (g(x) ln(f(x))) , we obtain limx→+∞ ( 2 + 1 x )1 x = e lim x→+∞ ( 1 x ·ln(2+ 1 x ) ) = e0 = 1. Similarly, we see that lim x→+∞ (−x ln(x)) = −∞, and hence lim x→+∞ x−x = elimx→+∞(−x ln(x)) = e−∞ = 0 . Finally one must be cautious when calculating the final limit. Both one-sided limits exist, but are different, which implies that this limit does not exist: lim x→0+ e 1 x = elimx→0+ 1 x = e∞ = ∞ and lim x→0− e 1 x = elimx→0− 1 x = e−∞ = 0. In Sage one can directly solve the task by the cell f(x)=(2+(1/x))**(1/x); g(x)=x**(-x); j(x)=e**(1/x) show([lim(f(x),x=+oo), lim(g(x),x=+oo), lim(j(x),x=0)]) 5.E.53. The first limit equals to +∞. For the second one, after multiplying by the fraction √ x2+x+x√ x2+x+x it follows that limx→+∞( √ x2 + x − x) = 1/2. For the third case we see that limx→+∞(x √ 1 + x2 − x2 ) = 1/2. Finally, limx→0− √ 1+tan(x)− √ 1−tan(x) sin(x) = 1. 5.E.54. In Sage we can directly type x, n=var("x, n"); f(x)=((1+2*n*x)**n-(1+n*x)**(2*n))/(x**2); lim(f(x), x=0) and this returns the answer −n3 . To verify this in a formal way, we will use the binomial theorem. This gives (1 + 2nx) n = 1 + ( n 1 ) 2nx + ( n 2 ) (2nx) 2 + P (x) x3 , (1 + nx) 2n = 1 + ( 2n 1 ) nx + ( 2n 2 ) (nx) 2 + Q (x) x3 , for some polynomials P, Q. Let us emphasize on the fact that this really holds for all n ∈ N∗ (if n = 1, the polynomials P, Q as zero constants). So, for all x ∈ R we obtain the relations (1 + 2nx) n = 1 + 2n2 x + 2n3 (n − 1) x2 + P (x) x3 , (1 + nx) 2n = 1 + 2n2 x + n3 (2n − 1) x2 + Q (x) x3 . Based on these formulas we are now able to present the result, as follows: lim x→0 (1 + 2nx) n − (1 + nx) 2n x2 = lim x→0 ( 2n3 (n − 1) − n3 (2n − 1) ) x2 + (P(x) − Q(x)) x3 x2 = lim x→0 ( −n3 + (P(x) − Q(x)) x ) = −n3 + 0 = −n3 . 5.E.55. By the definition of (an) and based on the trigonometric inequality |cos(x)| ≤ 1, (x ∈ R), we see that |an+1 − an| = cos(n + 1) 2n+1 ≤ 1 2n+1 < 1 2n . Thus |an+1 − an| < 1 2n and it follows that (an) is a Cauchy sequence. Thus it converges (in fact one can show that a sequence (an) satisfying |an+1 − an| < cn , for some c with 0 < c < 1 and for all n ∈ Z+, is a Cauchy sequence). 5.E.58. First, let us calculate the one-sided limits at the point x0 = 0: lim x→0− (x − 1)− sgn(x) = lim x→0− (x − 1) = −1 , lim x→0+ (x − 1)− sgn(x) = lim x→0+ 1 x − 1 = −1 , whence limx→0(x−1)− sgn(x) = −1. However, the function will be continuous at 0 = 0, if and only if f(0) = −1. Recall by the solution given in 5.B.34 that for the sign function we have adopted the convention sgn(0) = 0. Thus f(0) = (−1)0 = 1, and the function at hand is not continuous at x0 = 0. Similarly, for x = 1 we obtain lim x→1− (x − 1)− sgn(x) = lim x→1− 1 x − 1 = −∞ , lim x→1+ (x − 1)− sgn(x) = lim x→1+ 1 x − 1 = ∞ . Hence both one-sided limits at the point 1 exist, yet they differ. Thus the limit lim x→1 f(x) does not exist, and the function is not continuous at x1 = 1, as well. 5.E.59. The function is continuous at the points −π, 0, π, only right-continuous at the point 2, only left-continuous at the point 3, and discontinuous at 1. 5.E.60. The function is continuous if and only if p = 2. 503 CHAPTER 5. ESTABLISHING THE ZOO 5.E.61. The answer is a = 4. 5.E.62. The given function is continuous at every point of its domain R\{±1}. Thus, the extended function will be continuous if and only if f(−1) = lim x→−1 ( ( x2 − 1 ) sin 2x − 1 x2 − 1 ) , f(1) = lim x→1 ( ( x2 − 1 ) sin 2x − 1 x2 − 1 ) , such that the limits in the right hand side exist and are finite (if either of these limits did not exist, or are infinite, then f cannot be extended to a continuous function). Indeed, since any x ∈ R, x ̸= ±1 satisfies sin ( 2x−1 x2−1 ) ≤ 1, it follows that − x2 − 1 ≤ f(x) ≤ x2 − 1 , ∀ x ∈ R , x ̸= ±1 . Clearly limx→±1 x2 − 1 = 0, and by the squeeze theorem we get f(±1) = 0. 5.E.70. If f : [a, b] → R is injective then we have f(a) ̸= f(b), and we may assume that f(a) < f(b) (otherwise consider −f). We will show that f is strictly increasing. First we will show that for any x ∈ (a, b) we have f(a) < f(x) < f(b). Assume in the contrary that f(x) ≤ f(a) or f(x) ≥ f(b). Because f is injective we essentially assume that f(x) < f(a) or f(x) > f(b). Case f(x) < f(a): This means that f(x) < f(a) < f(b) and by applying (the general version of) the intermediate value theorem for the restriction f [x,b] we find some x0 ∈ (x, b) with f(x0) = f(a), a contradiction since f is injective. Case f(x) > f(b): This means that f(a) < f(b) < f(x) and by applying (the general version of) the intermediate value theorem for the restriction f [a,x] we find some x0 ∈ (a, x) with f(x0) = f(b), which is again a contradiction by injectivity. Based now on our claim and taking x1, x2 ∈ R with a < x1 < x2 < b it is easy to see that f(x1) < f(x2), that is, f is strictly increasing. 5.E.77. We will solve this problem using logarithmic differentiation. In particular, let f be a positive function for which f′ (x) exists. Then recall that (ln f(x)) ′ = f′ (x) f(x) , that is, f′ (x) = f(x) · (ln f(x))′ . Now, the given function is differentiable as a product of differentiable functions. Thus, by combining the previous rule with basic properties of the natural logarithm, we obtain ( 4 √ x − 1 · (x + 2)3 ex(x + 132)2 )′ = 4 √ x − 1 · (x + 2)3 ex(x + 132)2 · [ ln 4 √ x − 1 · (x + 2)3 ex(x + 132)2 ]′ = 4 √ x − 1 · (x + 2)3 ex(x + 132)2 · [ 3 ln (x + 2) + 1 4 ln (x − 1) − x ln(e) − 2 ln (x + 132) ]′ = 4 √ x − 1 · (x + 2)3 ex(x + 132)2 [ 3 x + 2 + 1 4 (x − 1) − 1 − 2 x + 132 ] . In Sage to verify this complicated computation use the cell f(x)=(((x-1)**(1/4))*(x+2)^3)/(exp(x)*(x+132)^2) show(diff(f(x), x).simplify()) 5.E.78. (1) a′ (x) = x2 sin(x); (2) b′ (x) = cos (sin(x)) cos(x); (3) c′ (x) = 3x2 +2 x3+2x cos ( ln ( x3 + 2x )) ; (4) d′ (x) = 2(1−2x) (1−x+x2)2 ; (5) ε′ (x) = 7 8 x− 1 8 ; (6) f′ (x) = cos(x) cos (sin(x)) cos (sin (sin(x))); (7) g′ (x) = cos x 3 3√ sin2 x ; (8) h′ (x) = 2x2 1−x6 3 √ 1+x3 1−x3 . 5.E.79. (a) Such a function is the floor function f(x) = ⌊x⌋ introduced in 5.B.48. For instance, we have ⌊−2.2⌋ = −3 and ⌊2.2⌋ = 2, thus f(−x) ̸= f(x) and f cannot be even. Moreover, ⌊−2.2⌋ = −3 ̸= −2 = −⌊2.2⌋, and hence the floor function is neither odd. (b) Suppose that f is even, that is, f(−x) = f(x) for all x in the domain of f. Taking the derivatives in both sides of this equation and based on the chain rule, we get f′ (x) = (f(−x))′ = (−x)′ f′ (−x) = −f′ (−x). This means that f′ (−x) = −f′ (x) and hence f′ is odd. Similarly the case with f odd. 5.E.80. The derivative f′ (x) of a polynomial f provides the slope of the tangent line to its graph at the point [x, f(x)] ∈ R2 . In degree zero, f is a non-zero constant a · x0 = a · 1 = a and its derivative vanishes everywhere. This confirms the fact that the graph in this case is the line y = a, parallel to the x-axis. In degree one, f(x) = ax + b for some a, b ∈ R with a ̸= 0, thus its derivative is the constant a. Consequently, the slope of the tangent is a everywhere, which is the property 504 CHAPTER 5. ESTABLISHING THE ZOO characterizing a line. In degree two, f(x) = a x2 + b x + c with a, b, c ∈ R and a ̸= 0. Obviously, f′ (x) = 2a x + b and x = − b 2a is the only point where the slope is zero, i.e., the maximum or the minimum of the quadratic function, as pointed out in 5.A.2. In degree three f(x) = a x3 + b x2 + c x + d with a, b, c, d ∈ R and a ̸= 0. The derivative is the second degree polynomial f′ (x) = 3a x2 + 2b x + c. Clearly, there are points with tangent slope zero if and only if the latter quadratic expressions allows zero values. This happens if and only if its discriminant 4(b2 − 3a c) is non-negative. In particular, there will be two bumps on the graph if and only if b2 > 3ac. The slope of the graph gets to zero only in one point if b2 = 3a c, and then the graph curve looks similarly as the graph of f(x) = x3 at x = 0. 5.E.82. Observe first that (sinh(x))′ > 0 for all x ∈ R, hence f is strictly increasing on R (as we see also in the graph of f above). Thus it has an inverse function, called the “inverse hyperbolic sine function” and denoted by sinh−1 (x). The domain and the range of f are both (−∞, ∞), hence the same applies for sinh−1 . To compute its explicit form, let y = sinh−1 (x). Then x = sinh(y) = ey − e−y 2 , that is, ey − e−y −2x = 0. Multiplying both sides of this equation by ey we arrive to the equation e2y −2x ey −1 = 0, which is obviously a quadratic equation in the variable t = ey . Its solutions have the form t± = x ± √ x2 + 1. However, t = ey > 0, but x − √ x2 + 1 < 0, from where we deduce that ey = x + √ x2 + 1, or equivalently y = ln(x + √ x2 + 1), i.e., sinh−1 (x) = ln(x + √ x2 + 1) for all x ∈ R. You may like to confirm this expression in Sage, based on the command arcsinh which corresponds to the function sinh−1 . This, for example, can be done via the cell bool(arcsinh(x)==ln(x+sqrt(x^2+1))) Recall now that ln(f(x))′ = f′ (x)/f(x) and ( √ g(x))′ = g′ (x)/2 √ g(x), where f, g are both positive on their domain. Thus, ( sinh−1 (x) )′ = ( ln(x + √ x2 + 1) )′ = 1 + x√ x2+1 x + √ x2 + 1 = 1 √ x2 + 1 , x ∈ R . As a confirmation via Sage, just type show(diff(ln(x+sqrt(x^2+1)), x).factor()) 5.E.86. (1) It does not because the one-sided derivatives differ. Om particular, we get π/2 from the right and −π/2 from the left. (2) It does not. (3) To provide an answer one should recall the derivative of absolute value function. In particular, the function f(x) := | x − 5 | + | x − 9 | has the desired properties. (4) Suppose that f(x) = g(x) = 1 at the rational numbers and f(x) = g(x) = −1 at the irrational numbers. (5) It is not hard to see that f′ (x) < 0, for all x > e. 5.E.87. We see that f′ (x) = 5 − 4 sin(x) > 0. This means that f is increasing and hence a bijection. Thus f−1 exists and since f(0) = 4 we have f−1 (4) = 0. Moreover, f′ ( f−1 (4) ) = f′ (0) = 5 ̸= 4, hence we may apply 5.3.6 which gives ( f−1 )′ (4) = 1 f′ (f−1(4)) = 1 f′(0) = 1 5 . 5.E.89. From the conditions P(0) = 1 and P′ (0) = 1 it follows that P(x) = bx3 + cx2 + x + 1, for some b, c ∈ R. The two remaining conditions determine two equations for the variables b and c: b+c+2 = 2a+2, 3b+2c+1 = 5a+1 with the unique solution b = c = a. Therefore, a polynomial satisfying the desired conditions should has the form P(x) = ax3 +ax2 +x+1, with a ∈ R. The derivative of P is a parabola given by P′ (x) = 3ax2 + 2ax + 1 and we require P′ (x) > 0 or P′ (x) < 0, which is equivalent saying that the discriminant ∆ = 4a2 − 12a of P′ is negative, ∆ < 0. This gives 4a(a − 3) < 0, which is true for all a ∈ (0, 3). 5.E.91. Let x0 ∈ R some arbitrary point. By means of the definition of the first derivative of a function on R, and in combination with the relation f(x + y) = f(x)+f(y) 1−f(x)f(y) we see that f′ (x0) = lim h→0 f(x0 + h) − f(x0) h = lim h→0 f(x0)+f(h) 1−f(x0)f(h) − f(x0) h = lim h→0 f(x0) + f(h) − f(x0) + f2 (x0)f(h) h ( 1 − f(x0)f(h) ) = lim h→0 1 + f2 (x0) 1 − f(x0)f(h) · lim h→0 f(h) h = 1 + f2 (x0) 1 − f(x0)f(0) · f′ (0) = f′ (0) · ( 1 + f2 (x0) ) . 505 CHAPTER 5. ESTABLISHING THE ZOO Here we have used that f(0) = 0, which we get from the relation f(x + y) = f(x)+f(y) 1−f(x)f(y) by setting y = 0 and moreover that f is differentiable at 0, and hence limh→0 f(h) h = limh→0 f(h+0)−f(0) h−0 = f′ (0). Thus f′ (x) = f′ (0) ( 1 + f2 (x) ) for all x ∈ R and f is differentiable on R. 5.E.95. (a) The function has a local maximum at the point x1 = e−2 . It has a local minimum at the point x2 = 1. (b) The answer is 1 3 √ e . (c) The answer is 4 = p (−1) = p (2), −16 = p (−3). (d) No. For instance, if α = √ 2/2, there is only a local extremum at the point. 5.E.96. The given function f has domain the whole real line and we have f′ (x) = − e−x = − 1 ex < 0 for any x ∈ R. Thus f is strictly decreasing on R. 5.E.97. Obviously, a solution is given by x = 1. We will show that it is unique by using the function f(x) = x2025 + 2024x + ln x − 2025. It is sufficient to show that f is strictly growing throughout its domain A = (0, +∞). Obviously, f is differentiable over A and its derivative is given by f′ (x) = 2024x2024 + 2024 + 1 x . But f′ (x) > 0 for all x ∈ A, so we are done. 5.E.103. Consider the function g(x) = x − ln(x + 1), with g(0) = 0. The first derivative of g is given by g′ (x) = 1 − 1 x + 1 = x x + 1 . Thus g′ (x) = 0 if and only if x = 0. For x > 0 we have obviously g′ (x) > 0, so g is strictly increasing on the open interval (0, +∞). Hence, for all x > 0 we have g(x) > g(0) = 0, that is, x > ln(x + 1). Since for x = 0 this holds as an equality we have finally proved x ≥ ln(x + 1) for all x ≥ 0. 5.E.104. Consider the function f(x) = α xβ − β xα − α + β. We have f′ (x) = αβ ( xβ−1 − xα−1 ) > 0 for all x > 1, that is, f(x) is strictly increasing for all x ∈ (1, +∞). The point x = 1 induces a critical point for f, i.e., f′ (1) = 0 and we see that there f attains its minimum value f(1) = 0. Thus for all x > 1 we have f(x) > 0, i.e., α xβ − β xα > α − β. 5.E.108. (a) Obviously, this limit has the indeterminate form ∞ ∞ . (b) The l’Hospital’s rule gives rise to the following limit lim x→+∞ x + sin(x) x = lim x→+∞ 1 + cos(x) 1 = lim x→+∞ (1 + cos(x)) , which does not exist (e.g., use Sage and type the syntax limit(1 + cos(x), x = oo)). However, the limit at hand exists. This is because x − 1 x ≤ x + sin x x ≤ x + 1 x , which implies that lim x→+∞ f(x) = 1. 5.E.111. The answers are a = 2/π, b − 1, c = 1/2 and d = e−2 . 5.E.116. The perimeter is 2 √ 5 r. 5.E.117. The answer here is the square with sides of length c. 5.E.118. We get h = 4 3 R and r = 2 √ 2 3 R. 5.E.119. This triangle is the equilateral triangle, with area √ 3 p2 /36. 5.E.120. The desired points have coordinates P = [2, −1/2], and Q = [−2, −1/2], respectively. Try to illustrate the situation via Sage. 5.E.122. To confirm the answer presented for x0 in 5.E.121 we may instead try to locate the global minimum of the function g(x) = 1 f(x) = x2 + (b − h) (a − h) x (b − a) = x b − a + (b − h) (a − h) x (b − a) , x ∈ (0, +∞) . This can be done with the help of the AM-GM inequality that we met in Chapter 1, that is, y1 + y2 2 ≥ √ y1 y2, y1, y2 ≥ 0 , where the equality occurs if and only if y1 = y2. The choice y1(x) = x b − a , y2(x) = (b − h) (a − h) x (b − a) 506 CHAPTER 5. ESTABLISHING THE ZOO then gives g(x) = y1(x) + y2(x) ≥ 2 √ y1(x) y2(x) = 2 b − a √ (b − h) (a − h) . Therefore, if there is a number x > 0 for which y1(x) = y2(x), then the function g has the global minimum at x. In particular, the equation y1(x) = y2(x), i. e. x b − a = (b − h) (a − h) x (b − a) has a unique positive solution given by x0 = √ (b − h)(a − h). 5.E.126. The inscribed rectangle has sides of lengths x, √ 3/2(a − x), thus its area is √ 3/2(a − x)x. The maximum occurs for x = a/2, hence the greatest possible area is ( √ 3/8)a2 . 5.E.127. The dimensions of the pool are 4 m × 4 m × 2 m. 5.E.128. The answer is 28 = 24 + 4. 5.E.129. The answer is a = 1. 5.E.131. (a) The first claim is obviously true. We do not need any absolute convergence to handle linear combinations. This is a direct consequence of the properties of limits. (b) This statement is clearly false, in general. Consider the harmonic series ∞∑ n=1 1 n . We saw in 5.D.9 that the series ∞∑ n=1 1 n2 converges, while the harmonic series does not. (c) Let us divide the series into a sum of two. The first one will collect the terms with |an| > n|an|2 , while the other sum the remaining terms. In the first case, |an| < 1 n , and thus an n < 1 n2 . Consequently, this part of the series converges absolutely. The remaining part must converge, as well, as a consequence of the comparison with the converging series: an n ≤ |an|2 . As a sum of two absolutely convergent series, the initial series is absolutely convergent, too. Thus the claim is true. 5.E.135. (a) The ratio test gives lim n→∞ an+1 an = lim n→∞ 2(n+1)2 · n! (n + 1)! · 2n2 = lim n→∞ 22n+1 n + 1 = lim n→∞ 2 · 4n n + 1 = +∞ . Thus the series does not converge. (b) The sequence ( 2√ 3n ) n∈Z+ is decreasing, while recall that and the function f(x) = arctan(x) is increasing on the whole real axis. So, the sequence ( arctan ( 2√ 3n ) ) n∈Z+ is decreasing. Thus, it is an alternating series such that the sequence of the absolute values of its terms is decreasing. According to the Leibniz criterion, such an alternating series converges if and only if the sequence of its terms converges to zero, and this is satisfied: lim n→∞ arctan ( 2 √ 3n ) = arctan(0) = 0 and hence lim n→∞ ( (−1)n+1 arctan ( 2 √ 3n ) ) = 0 . So the series T is convergent. 5.E.137. The required value is 3/2. 5.E.138. The first series sums to 3 and the second sums to 9/4. 5.E.139. We get α > 0, β ∈ {−2, −1, 0, 1, 2}, and γ ∈ (−∞, −1) ∪ (1, +∞). 5.E.140. This series is absolutely convergent. 5.E.141. The answer is α ∈ [0, 1). 5.E.144. We see that A(x) = 1 2 ln 1+x 1−x and B(x) = 1+x (1−x)3 , respectively. 5.E.145. For example: an = n/3, bn = n/2, with n ∈ N. 5.E.146. The former series converges absolutely; the latter converges conditionally. 5.E.147. Yes, it does. 5.E.148. The answer is p ∈ R. 507 CHAPTER 5. ESTABLISHING THE ZOO 5.E.149. This series is convergent. 5.E.150. The answer is S = 2/9. In the previous chapter, we were working either with an extremely large class of functions (for example, all continuous, all differentiable), or with only particular functions, (for example exponential, trigonometric, polynomial). However we had very few tools. We indicated how to discuss the local behaviour of functions near a given point by linear approximation. We learned how to measure instantaneous changes by differentiation. Now we derive several results that will allow us to work with functions more easily when modeling real problems. We also deal with the task of summing infinitely many “infinitely small” changes, in particular, how to “integrate”. In the last part of the chapter we come back to series of functions. We also add useful techniques, how to deal with extra parameters in the functions, and we introduce some further integration concepts briefly. 1. Differentiation 6.1.1. Higher order derivatives. If the first derivative f′ (x) of a function of one real variable has a derivative (f′ )′ (x0) at the point x0, we say that the second derivative of function f (or second order derivative) exists. Then we write f′′ (x0) = (f′ )′ (x0) or f(2) (x0). A function f is two times differentiable on some interval, if it has a second derivative at each of its points. Derivatives of higher orders are defined inductively: k times differentiable functions A function f of one real variable is differentiable k times at the point x0 for some natural number k > 1, if it is differentiable (k − 1) times on some neighbourhood of the point x0 and its (k − 1)-st derivative has a derivative at the point x0. We write f(k) (x) for the k-th derivative of the function f(x). If derivatives of all orders exist on an interval A, we say the function f is smooth or infinitely differentiable on A. We use the notation class of functions Ck (A) for functions with continuous k-th derivative on the interval A, where k can attain values 0, 1, . . . , ∞. Often we write only Ck , if the domain is known from the context. When k = 0, C0 means continuous functions. CHAPTER 6 Differential and integral calculus we already have the menagerie, but what shall we do with it? – we’ll learn to control it... A. Derivatives of higher orders In the previous chapter we briefly explained the use of second order derivatives in the study of local extremes (see 5.C.28 for example). In this chapter we will study higher order derivatives in many details and derive such applications in a more systematic way. Notice that a higher order derivative refers to the repeated process of taking derivatives of derivatives, a procedure also known as “successive differentiation”. As in Chapter 5, we will denote the second derivative of a function f by f′′ = (f′ )′ , and for derivatives of third or higher order we will write f(3) = (f′′ )′ , f(4) = (f(3) )′ , etc. The bracket in the notation f(n) = (f(n−1) )′ is necessary to distinguish it form the nth power fn of f. When for a function f we can consider arbitrarily many continuous derivatives, then the function is said to be smooth, see also the theoretical column for the notion of smooth functions (e.g., 6.1.9 ). At this point it is also importnat to recall that in Sage the nth order derivative of a given function f can be computed via the command diff(f, x, n). For the examples described below we advice the reader to use Sage and verify the computations, especially for the problems where a Sage implementation is not included. CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS We illustrate the concept of higher order derivatives with polynomials. Because the derivative of a polynomial is a polynomial with the degree one less than the original one, after a finite number of differentiations we obtain the zero polynomial. If k is the degree of the polynomial, then exactly k + 1 differentiations yields zero. Then derivatives of all orders exist, hence f ∈ C∞ (R). In the spline construction, see 5.1.9, we took care that the resulting functions belong to the class C2 (R). Their third derivatives are piece-wise constant functions. That is why the splines do not belong to C3 (R), even though all their higher order derivatives are zero in all of the inner points of all single intervals in the interpolation. Think this example through in detail! The next assertion is a combinatorical corollary of Leibniz’s rule for differentiation of a product of two functions: Lemma. If two functions f and g have derivatives of order k at the point x0, then their product also has the derivative of order k and the following equality holds: (f · g)(k) (x0) = k∑ i=0 ( k i ) f(i) (x0)g(k−i) (x0). Proof. For k = 0, the statement is trivial. For k = 1 it is Leibniz’s product rule. Suppose equality holds for some k. Differentiate the right hand side and use Leibniz’s rule to obtain the expression k∑ i=0 ( k i )( f(i+1) (x0)g(k−i) (x0) + f(i) (x0)g(k−i+1) (x0) ) . In this new sum, the sum of orders of the derivatives of products in all summands is k + 1 and the coefficients of f(j) (x0)g(k+1−j) (x0) are the sums of binomial coefficients( k j−1 ) + (k j ) = (k+1 j ) . □ 6.1.2. The meaning of second derivative. We have already seen that the first derivative of a function is its linear approximation in the neighbourhood of a given point. The sign of a nonzero derivative determines whether the function is increasing or decreasing at the point x0. The points where the first derivative is zero are called the critical points or stationary points of the given function. If x0 is a critical point of function f, there are several possibilities for the behaviour of the function f in the neighbourhood of x0. Consider the behaviour of the function f(x) = xn in the neighbourhood of zero for different 509 6.A.1. Determine the derivatives given below. (1) ( x2 sin(x) )′′ , x ∈ R (5) (xx )′′ , x > 0 (2) ( 2x 4x+3 )′′ , x ∈ R\{−3 4 } (6) (xn )(n) , x ∈ R (3) ( ln √ x2+1 x2−1 )′′ , with (7) ( x ln(x) )(3) , x > 0 x ∈ (−∞, −1) ∪ (1, +∞) (4) (tan(x))′′ , with (8) (1/ √ x(x − 2))′′ , with x ∈ R\{π 2 + nπ : n ∈ N} 0 < x < 2. Solution. (1) This is based on the product rule (fg)′ = f′ g+ fg′ . Applying this rule successively we obtain (x2 sin(x))′′ = (( x2 sin(x) )′ )′ = ( 2x sin(x) + x2 cos(x) )′ = 2 sin(x) + 4x cos(x) − x2 sin(x) . (2) In this case we will apply twice the quotient rule ( f g )′ = gf′ −fg′ g2 . We get ( 2x 4x+3 )′ = 2(4x+3)−8x (4x+3)2 = 6 (4x+3)2 , hence ( 2x 4x + 3 )′′ = (( 2x 4x + 3 )′ )′ = ( 6 (4x + 3)2 )′ = − 48(4x + 3) (4x + 3)4 = − 48 (4x + 3)3 . (3) For any x ∈ (−∞, −1) ∪ (1, +∞) we can write ln √ x2 + 1 x2 − 1 = ln (x2 + 1 x2 − 1 )1 2 = 1 2 ln (x2 + 1 x2 − 1 ) = 1 2 [ ln(x2 + 1) − ln(x2 − 1) ] . (∗) This relation has many advantages. For instance, one can easily deduce that the given function is differentiable (as the difference of two differentiable functions). On the other hand, recall that a positive differentiable function f satisfies (ln(f(x))′ = f′ (x)/f(x) for all x in its domain. Hence, a combination of this rule with (∗) gives ( ln √ x2 + 1 x2 − 1 )′ = 1 2 [ 2x x2 + 1 − 2x x2 − 1 ] = − 2x x4 − 1 . Now we can easily compute also the second derivative: ( ln √ x2 + 1 x2 − 1 )′′ = ( − 2x x4 − 1 )′ = 2(1 + 3x4 ) (x4 − 1)2 . (4) For this you should apply the quotient rule twice. In particular, recall by 5.E.76 that (tan(x))′ = 1 cos2(x) . Therefore, tan′′ (x) ≡ (tan(x))′′ ≡ d2 d2x tan(x) = ( 1 cos2(x) )′ = 2 sin(x) cos3(x) = 2 tan(x) cos2(x) . CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS n. For odd n > 0, f(x) will be increasing on all R, while for even n it will be decreasing for x < 0 and increasing for x > 0. In the latter case, the function will attain its minimal value among points in the (sufficiently small) neighbourhood of x0 = 0. The same argument applies to the function f′ . If the second derivative is nonzero, its sign determines the behaviour of the first derivative. At the critical point x0 the derivative f′ (x) is increasing if the second derivative is positive and decreasing if the second derivative is negative. If increasing, it is necessarily negative to the left of the critical point and positive to the right of it. In that case, f is decreasing to the left of the critical point and increasing to the right of it. So f attains its minimal value among all points from a (sufficiently small) neighbourhood of x0 at x0. On the other hand, if the second derivative is negative at x0, the first derivative is decreasing. Thus the first derivative is negative to the left of x0 and positive to the right of it. f then attains its maximal value at x0 among all values in a neighbourhood of x0. A function which is differentiable on (a, b) and continuous on [a, b] has an absolute maximum and minimum on this interval. Both can be attained only at the boundary of the interval or at a point where the derivative is zero, Thus critical points may be sufficient for finding extremes. Second derivatives help to determine the type of extreme, if nonzero. For a more precise discussion of the latter phenomena we consider higher order polynomial approximations of the functions. We then return to the qualitative study of the behaviour of functions with new tools. 6.1.3. Taylor expansion. As a surprisingly easy use of Rolle’s theorem we derive an extremely important result. It is called the Taylor expansion with remainder1 . Consider the power series centered at a, S(x) = ∞∑ n=0 an(x − a)n . Recall, power series can be differentiated term by term, cf. 5.4.10. Differentiate the series S(x) repeatedly, to get the power series S(k) (x) = ∞∑ n=k n(n − 1) . . . (n − k + 1)an(x − a)n−k . Put x = a. Then S(k) (a) = k! ak. We can read the last statement as an equation for ak and rewrite the original series as S(x) = ∞∑ n=0 1 k! S(k) (a)(x − a)n . 1Brook Taylor was an English mathematician (1685-1731) best known for his formalization of the polynomial approximations of functions, recognized by Lagrange as the “ main foundation of differential calculus” 510 (5) In 5.C.7 (d) we proved that (xx )′ = ( ln(x)+1 ) xx . Thus, for the second derivative we obtain (xx )′′ = (( xx )′ )′ = (( ln(x) + 1 ) ex ln(x) )′ = ( ln(x) ex ln(x) )′ + ( ex ln(x) )′ = 1 x ex ln(x) + ln(x) ( ln(x) + 1 ) ex ln(x) + ( ln(x) + 1 ) ex ln(x) = xx x + xx ( ln2 (x) + 2 ln(x) + 1 ) = xx−1 + xx ( ln(x) + 1 )2 . (6) In this case we simply state the result and leave a formal computation for practice: ( x ln x )(3) = 1 x2(ln x)2 − 6 x2(ln x)4 . (7) We see that (xn )(n) = [(xn )′ ] (n−1) = (nxn−1 )(n−1) = · · · = n! . The last step of a formal proof includes an induction over n, which we leave as an exercise. (8) For this case we just present the result, and leave the proof for practice: ( 1/ √ x(x − 2) )′′ = ( 2 x2 − 4 x + 3 )√ x2 − 2 x x6 − 6 x5 + 12 x4 − 8 x3 = 2 x2 − 4 x + 3 √ (x − 2)x(x − 2) 2 x2 . □ 6.A.2. Use Sage to verify your answers in 6.A.1. Solution. For most of the cases the answer is based on the command diff(f, x, m), mentioned above. For instance, type show(diff(x^2*sin(x), x, 2).full_simplify()) show(diff((2*x)/(4*x+3),x,2).full_simplify( )) f(x)=ln(sqrt((x^2+1)/(x^2-1))) show(diff(f(x), x, 2).full_simplify()) show(diff(tan(x), x, 2)) show(diff(x^x,x,2)) show(diff(x/ln(x),x,3)) g(x)=1/sqrt(x*(x-2)) show(diff(g(x), x, 2).factor()) Notice the function full_simplify( ) (or factor()) was used to simplify the given expressions and make direct the verification, whenever necessary. Finally observe also that the cell [diff(x^n,x,n) for n in range (1,100)] gives a verification of the relation (xn )(n) = n!, presented in (7), for many values of n. However, a more precise program that offers a more practical demonstration of the theoretical result is as follows:" # Define the variable x x = var(’x’) # Define a specific integer value for n n = 4 # Change this to test different cases CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Suppose f is a smooth function instead of a power series. We search for a good approximation by polynomials in the neighbourhood of a given point a. Taylor polynomial of function f For a k times differentiable (real or complex valued) function f of one real variable, define its Taylor polynomial of k–th degree centered at a by the formula Tk a f(x) = f(a) + f′ (a)(x − a) + 1 2 f′′ (a)(x − a)2 + 1 6 f(3) (a)(x − a)3 + · · · + 1 k! f(k) (a)(x − a)k . The mean value theorem is used to show how good this approximation of f is. Taylor expansion with a remainder Theorem. Let f(x) be a function that is k times differentiable on the interval (a, b) and continuous on [a, b]. Then for all x ∈ (a, b) there exists a number c ∈ (a, x) such that f(x) = Tk−1 a f(x) + 1 k! f(k) (c)(x − a)k = f(a) + f′ (a)(x − a) + . . . + 1 (k − 1)! f(k−1) (a)(x − a)k−1 + 1 k! f(k) (c)(x − a)k . Proof. For fixed x define the remainder R, by f(x) = Tk−1 a f(x) + R. Then R = 1 k! r(x−a)k for a suitable number r (dependent on x). Consider the function F(ξ) defined by F(ξ) = k−1∑ j=0 1 j! f(j) (ξ)(x − ξ)j + 1 k! r(x − ξ)k . By the Leibniz rule, its derivative (here x is considered as a constant parameter) is F′ (ξ) = f′ (ξ)+ k−1∑ j=1 ( 1 j! f(j+1) (ξ)(x−ξ)j − 1 (j − 1)! f(j) (ξ)(x−ξ)j−1 ) − 1 (k − 1)! r(x − ξ)k−1 = 1 (k − 1)! f(k) (ξ)(x − ξ)k−1 − 1 (k − 1)! r(x − ξ)k−1 = 1 (k − 1)! (x − ξ)k−1 (f(k) (ξ) − r), because the expressions in the sum cancel each other out sequentially. Now it suffices to notice that F(a) = F(x) = f(x) (recall that x is an arbitrarily chosen but fixed number from the interval (a, b)). According to Rolle’s theorem there exists a number c, a < c < x such that F′ (c) = 0. That is the desired relation. □ 511 # Define the function f(x) = x^n f = x^n # Compute the n-th derivative n_th_derivative = f.diff(x, n) # Simplify the result n_th_derivative_s = n_th_derivative.simplify() # Print the result print(f"The {n}-th derivative of x^{n} is:", n_th_derivative_s) # Verify that the result matches n! expected_result = factorial(n) print(f"Expected result ({n}!):", expected_result) # Check if the computed result matches n! if n_th_derivative_s == expected_result: print("The result matches n!") else: print("The result does not match n!.") □ 6.A.3. Use the generalized Leibniz rule (see lemma in 6.1.1) to compute the fourth-order derivative of the functions: (a) h(x) = x cos(x) , (c) ℓ(x) = x3 ln(x) , (b) k(x) = e4x x4 , (d) m(x) = ex arctan(x) . Next describe a solution in Sage. Solution. The generalized Leibniz’s rule for the 4th-order derivative of a product fg gives (f(x)g(x))(4) = ( 4 0 ) f(0) (x)g(4) (x) + ( 4 1 ) f′ (x)g(3) (x) + ( 4 2 ) f′′ (x)g′′ (x) + ( 4 3 ) f(3) (x)g′ (x) + ( 4 4 ) f(4) (x)g(x) = f(x)g(4) (x) + 4f′ (x)g(3) (x) + 6f′′ (x)g′′ (x) + 4f(3) (x)g′ (x) + f(4) (x)g(x) . (♯) Now, for case (a) we have h(x) = f(x)g(x) with f(x) = x and g(x) = cos(x), respectively. Moreover, g′ (x) = − sin(x), g′′ (x) = − cos(x), g′′′ (x) = sin(x), g(4) (x) = cos(x), f′ (x) = 1, and f(k) (x) = 0 for all k ≥ 2. Thus, an application of (♯) gives h(4) (x) = x cos(x) + 4 sin(x). A solution in Sage is based on the cell h(x)=x*cos(x); show(diff(h,x,4)) Or, you may use the alternative ([diff(h(x),x,i) for i in [1..4]]) which gives us all the derivatives h′ (x), h′′ (x), h(3) (x) and h(4) (x). Similarly are treated the rest cases. □ 6.A.4. Generalized Leibniz rule and Sage. For two symbolic functions f, g in Sage write a short program to implement the identity (♯) presented in 6.A.3. Then verify the results for the functions h, k, ℓ, m, by using this program, and hence independently of the command diff(f ∗ g, x, 4). Solution. Recall from 5.C.8 that in Sage we can easily introduce symbolic functions by typing CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS A special case of the last theorem is the mean value theorem, as an approximation by the Taylor polynomial of degree zero. See 5.3.9(1). 6.1.4. Estimates for Taylor expansions. A simple case of a Taylor expansion is when f is a polynomial: f(x) = anxn + an−1xn−1 + · · · + a1x + a0, an ̸= 0. Because the (n + 1)–th derivative f is identically zero, the Taylor polynomial of degree n has zero remainder, therefore for each x0 ∈ R f(x) = f(x0)+f′ (x0)(x−x0)+· · · + 1 n! f(n) (x0)(x−x0)n . We can compute all the derivatives easily (for example the last term is always of the form an(x − x0)n ). This result is a very special case of error estimation in Taylor expansion with the remainder. We know in advance that the remainder can be estimated by the size of the derivative, and for polynomials this is identically zero for some order onwards. More generally, the estimation of the size of the k–th derivative on some interval can be used to estimate the error on the same interval. Good examples of an expansion of an arbitrary degree are provided by the trigonometric functions sin and cos. By iterating the differentiation of the function sin x we always have either sine or cosine with some signs. The absolute values do not exceed one. Thus we obtain a direct estimation of the speed of convergence of the power series | sin x − (Tk 0 sin)(x)| ≤ |x|k+1 (k + 1)! . This shows that for x much smaller than k the error is small, but for x comparable with k or bigger it may be large. In the figure, compare the approximation of the function cos x by a Taylor polynomial of degree 68 in paragraph 5.4.12 on the page 429. As mentioned in the introduction of the discussion of Taylor expansion of functions, if we start with a power series f(x) centered at a, then its partial sums coincide with Taylor polynomials Tk a f(x). The next statement is one of the simple formulations of the converse implication. This is when the given function f(x) is actually a power series on some neighbourhood of the given point a. Taylor’s theorem Theorem. Assume that the function f(x) is smooth on the interval (a − b, a + b) and all of its derivatives are bounded uniformly by a constant M > 0. So |f(k) (x)| ≤ M, k = 0, 1, . . . , x ∈ (a − b, a + b). Then the power series S(x) = ∑∞ n=0 1 k! f(k) (a)(x − a)n converges on the interval (a − b, a + b) to f(x). 512 var("x"); f=function("f")(x) This, in combination with the substitute_function, gives rise to an alternative approach to obtain the first derivative of a given function. For instance, for f(x) = ex we just need to add the code f1=f.diff(x) show(f1.substitute_function(f==exp(x))) Check yourself that this block returns the first derivative of f(x) = ex . We are now ready to verify (♯), which can be done for example by the cell f=function("f")(x) g=function("g")(x) f1=diff(f, x); f2=diff(f, x, 2) f3=diff(f, x, 3); f4=diff(f, x, 4) g1=diff(g, x); g2=diff(g, x, 2) g3=diff(g, x, 3); g4=diff(g, x, 4) h4x=f*g4+4*f1*g3+6*f2*g2+4*f3*g1+f4*g show(h4x) This block includes all the derivatives that one needs to encode (♯) inside the program, which we did here by the syntax named “h4x”. To confirm the expression of h(4) (x) given in 6.A.3, it now suffices to add the code show(h4x.substitute_function(f==x,g==cos(x))) The rest cases can be confirmed similarly and left for practice. □ 6.A.5. Let n ∈ N be arbitrary. Find the nth derivative of function f(x) = ln ( 1+x 1−x ) , where x ∈ (−1, 1). ⃝ 6.A.6. Compute the 12th derivative of the function f(x) = e2x + cos(x) + x10 − 5x7 + 6x3 − 7x + 3, with x ∈ R. ⃝ 6.A.7. Compute the 26th derivative of the function f(x) = sin(x) + x23 − x18 + 5x11 − 3x8 + e2x , with x ∈ R. ⃝ Taylor polynomials extend the idea of linearization that we briefly discussed in Chapter 5. They provide highorder approximations of a function f by polynomials in the neighbourhood of a point (and thus locally). This is because such expansions are based on higher order derivatives of f at that point. We remark that as the terms of such polynomials increase (and hence their order), the induced approximations are improved. Taylor expansions have many natural applications in mathematics, but also in other sciences (e.g., in physics). In the sequel, first are described exercises related to the computation of Taylor polynomials, and more interesting applications will be discussed hereafter. For our description we adopt the notation from the theoretical section 6.1.3, hence for example we will denote by Tk a f(x) the kth-order Taylor expansion of f(x) around a point a. CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Proof. The proof is identical with the special case of function sin x above. Except that the universal bound by 1 is replaced by M, and thus the estimate of the remainders are |f(x) − (Tk a f)(x)| ≤ M (k + 1)! |x|k+1 . □ 6.1.5. Local behaviour of functions. With the Taylor expansions at hand, let us return to the local or global behaviour of real functions of one real variable. We have seen that the sign of the first derivative of a differentiable function determines whether it is increasing or decreasing on some neighbourhood of the given point. If the derivative is zero, it is one of the so called critical points, but we do not know about the function much without looking at higher derivatives. We encountered the importance of the second derivative when analysing the critical points, see 6.1.2. Now we generalize the discussion of critical points for all orders. First we deal with the local extremes of functions. In the following we consider real functions with a sufficiently high number of continuous derivatives, without specifically stating this assumption. The point a in domain of f is a critical point of order k if and only if f′ (a) = · · · = f(k) (a) = 0, f(k+1) (a) ̸= 0. Suppose f(k+1) (a) > 0 and f ∈ Ck+1 . Then this continuous derivative is positive on a certain neighbourhood O(a) of the point a as well. In that case, the Taylor expansion with the remainder gives f(x) = f(a) + 1 (k + 1)! f(k+1) (c)(x − a)k+1 for all x in O(a). Because of that, the change of values of f(x) in a neighbourhood of a is given by the behaviour of (x − a)k+1 . Moreover, if k + 1 is an even number, then the values of f(x) in such a neighbourhood are necessarily larger than the value f(a). So a is a local minimum. But if k is even then the values on the left are smaller, while those on right are larger than f(a). So an extreme does not occur even locally. On the other hand, the graph of the function f(x) intersects its tangent y = f(a) at the point [a, f(a)] in the latter case, as discussed in more detail below. Similarly, if f(k+1) (a) < 0, then it is a local maximum for odd k, and there is no extreme for even k. 6.1.6. Convex and concave functions. The differentiable function f is concave at a, if its graph lies completely below the tangent at the point [a, f(a)] in a neighbourhood of a. That is, f(x) ≤ f(a) + f′ (a)(x − a). Similarly f is convex at a, if its graph is above the tangent at the point a. That is, f(x) ≥ f(a) + f′ (a)(x − a). 513 6.A.8. For the elementary functions sin(x), cos(x), ex and ln(x + 1) compute the third order Taylor expansion based on a = 0. ⃝ 6.A.9. Determine the Taylor expansion T3 1 f(x) for the functions f(x) = ex x and f(x) = arctan(x), with x ∈ R\{0} and x ∈ R, respectively. Then verify your answers in Sage and moreover check the accuracy of the approximation near the point a. Solution. The third-order Taylor expansion around the point a = 1 is given by T3 1 f(x) = f(1)+f′ (1)(x−1)+ f′′ (1) 2 (x−1)2 + f′′′ (1) 6 (x−1)3 . For the first case we have f(1) = e, and f′ (1) = ex x − ex x2 x=1 = 0 , f(2) (1) = ex x − 2 ex x2 + 2 ex x3 x=1 = e , f(3) (1) = ex x − 3 ex x2 + 6 ex x3 − 6 ex x4 x=1 = −2 e . Thus it follows that T3 1 ex x = e + e 2 (x − 1)2 − e 3 (x − 1)3 . In Sage in order to compute the Taylor expansion Tk a f(x) of a given function f(x) around a point a, you can either apply the command taylor(f(x), x, a, k), or type f.taylor(x, a, k). Thus, a verification of the derivatives and the Taylor expansion presented above is obtained by the cell f(x)=e^x/x a=diff(f, x, 2); show(a); show(a(1)) b=diff(f, x, 3); show(b); show(b(1)) show(taylor(f(x), x, 1, 3)) In order to check the accuracy of the approximation near the point a = 1, we can check the values of f and T3 1 f at a point close to a = 1, say at x = 1.1. The block f(x)=exp(x)/x; T(x)=taylor(f(x), x, 1, 3) show("Value of the function at x=1.1:", round(f(x=1.1), 10)) show("3rd degree Taylor pol. at x=1.1:", round(T(x=1.1), 10)) returns the desired answer, namely: • Original value of the function at x = 1.1: 2.7310600218, • 3rd degree Taylor polynomial at x = 1.1: 2.7309671437. The function arctan(x) can be treated in the same way. For instance, the syntax show(taylor(arctan(x), x, 1, 3)) gives us the following expression. T3 1 arctan(x) = π 4 + 1 2 (x−1)− 1 4 (x − 1) 2 + 1 12 (x − 1) 3 . CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS A function is convex or concave on an interval, if it has this property at all its points. Suppose f has continuous second derivatives in a neighbourhood of a. The Taylor expansion of second order with the remainder implies f(x) = f(a) + f′ (a)(x − a) + 1 2 f′′ (c)(x − a)2 . Then the function is convex, whenever f′′ (a) > 0, and concave whenever f′′ (a) < 0. If the second derivative is zero, we can use derivatives of higher orders. But we can only make the same conclusion if the first other nonzero derivative after the first derivative is of even order. If the first nonzero derivative is of odd order, the points of the graph of the function on opposite sides of some small neighbourhood of the studied point will lie on opposite sides of the tangent at this this point. 6.1.7. Inflection points. A point a is called an inflection point of a differentiable function f, if the graph of f crosses from one side of the tangent in the point a to the other. The latter discussion on concave and convex functions shows that the inflections can appear only at points with vanishing second derivative. Suppose f has continuous third derivatives and write the Taylor expansion of third order with the remainder: f(x)=f(a)+f′ (a)(x−a)+ 1 2 f′′ (a)(x−a)2 + 1 6 f′′′ (c)(x−a)3 . If a is a zero point of the second derivative such that f′′′ (a) ̸= 0, then the third derivative is nonzero on some neighbourhood. Then a is an inflection point since the second derivative changes the sign at a and thus the tangent crosses the graph. In that case, the sign of the third derivative determines whether the graph of the function crosses the tangent from the top to the bottom or vice versa. Moreover, if a is an isolated zero point of the second derivative and simultaneously an inflection point, then on some small neighbourhood of a the function is concave on one side and convex on the other. Thus the inflection points are points of the change between concave and convex behaviour of the graph of the function. 514 To check the accuracy, one can apply the same method as for f(x) = ex x . □ 6.A.10. Determine the Taylor expansion T4 1 f(x) where f(x) = ln(x2 ), with x ∈ (0, 2). Next use Sage to plot together f(x) and T4 1 f(x), for x in the domain of f. ⃝ 6.A.11. Consider the function f(x) = cos(x) with x ∈ R. Use Sage to sketch in one diagram the graph of f together with the graphs of Taylor polynomials Tk a f(x) for k = 2, 3, 5, 7, 9 and a = π/2, for certain intervals of R. Solution. Sage allows us to use different colours to present the required graphs, and this makes easier their recognition. Based on this remark, one can present the following figure. To produce this figure we use the taylor function to introduce the required Taylor polynomials. We then sketch their graphs via the plot command, which we use with some more advanced options (for example, any polynomial comes with a label, that we did via the legend_label option). Let us present the full syntax in a block: f(x)=cos(x) p=plot(f(x), x, -2*pi, 2*pi,color="green", legend_label=r"$f(x)=cos(x)$") T2(x)=taylor(f(x), x, pi/2, 2) T3(x)=taylor(f(x), x, pi/2, 3) T5(x)=taylor(f(x), x, pi/2, 5) T7(x)=taylor(f(x), x, pi/2, 7) T9(x)=taylor(f(x), x, pi/2, 9) p2=plot(T2(x), x, -2*pi, 2*pi, legend_label=r"$T^2_{\frac{\pi}{2}}(x)$") p3=plot(T3(x), x, -pi, 2*pi, color="red", legend_label=r"$T^3_{\frac{\pi}{2}}(x)$") p5=plot(T5(x), x, -pi, 2*pi, color="orange", legend_label=r"$T^5_{\frac{\pi}{2}}(x)$") p7=plot(T7(x), -pi, 2*pi, color="black", legend_label=r"$T^7_{\frac{\pi}{2}}(x)$") p9=plot(T9(x), -pi, 2*pi, color="brown", legend_label=r"$T^9_{\frac{\pi}{2}}(x)$") show(p+p2+p3+p5+p7+p9) For further practice with Taylor polynomials, check 6.B.40 and see also in Section D for a series of additional tasks (e.g., 6.D.5). □ CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 6.1.8. Asymptotes of graphs of functions. We introduce one more useful utility for understanding or sketching the graph of a function. We consider the asymptotes. These are lines in R2 whose distance from the graph of f(x) converges to zero for x → x0. Thus, an asymptote at the improper point ∞ is a line y = ax + b, which satisfies lim x→∞ (f(x) − ax − b) = 0. An asymptote with a slope. If such an asymptote exists, it satisfies lim x→∞ (f(x) − ax) = b Consequently the limit lim x→∞ f(x) x = a also exists. Conversely, if the last two limits exist, the limit from the definition of the asymptote exists as well, thus these are sufficient conditions, too. The asymptote at the improper point −∞ is defined sim- ilarly. In this way we find all the lines satisfying the properties of asymptotes with slope. It remains to consider lines perpendicular to the x axis: The asymptotes at points a ∈ R are lines x = a such that the function f has at least one of the one-sided limits at a has infinite value. They are called asymptotes without slope. The real rational functions have asymptotes at all zero points of the denominator which are not zero points of the numerator as well. We consider a simple illustrative example: Let f(x) = x + 1 x . f has two asymptotes y = x and x = 0. Indeed, the one-sided limits from the right and left at zero are clearly ±∞, while the limit f(x)/x = 1+1/x2 is of course 1 at the improper points. Finally the limit of f(x) − x = 1/x is zero at the improper points. By differentiating, f′ (x) = 1 − x−2 , f′′ (x) = 2x−3 . The function f′ (x) has two zero points ±1. At x = 1, f has a local minimum. At x = −1, f has a local maximum. The second derivative has no zero points in all its domain (−∞, 0) ∪ (0, ∞), so f has no inflection points. 515 So far we have learned how to use Taylor polynomials to approximate smooth functions. Whenever we use an approximation technique, it is important to have a sense for how accurate our approximations are, as for example the Lagrange interpolation error presented in 5.E.9. For Taylor expansions the situation is encoded by the so called Remainder Theorem, which is described in theoretical section 6.1.3. Let us illustrate this by examples. 6.A.12. Determine the Taylor polynomial T6 0 f(x), where f(x) = sin(x), and next estimate the error of the approximation at the point x = π/4, according to the Theorem 6.1.3. Solution. It is easy to prove that T6 0 sin(x) = x − 1 6 x3 + 1 120 x5 . According to the statement in 6.1.3, the estimate of the remainder R at a = 0 is thus given by (the absolute value of) the expression f(k+1) (c) (k+1)! xk+1 , for some c ∈ (0, π 4 ) and k = 6. We compute f(7) (x) = − cos(x), for all x ∈ R, and so at x = π/4 we obtain R(π/4) = − cos(c)π7 7!47 < 1 7! ≈ 0, 0002 . □ 6.A.13. Show that ex ≥ 1 + x + x2 2! + x3 3! . ⃝ 6.A.14. Find the estimation of the error of the approximation ln (1 + x) ≈ x − x2 2 for x ∈ (−1, 0). ⃝ 6.A.15. Consider the sine function f(x) = sin(x), with x ∈ R. Compute the Taylor polynomial T4 0 f(x) and use Sage to plot together the graphs of T4 0 f(x) and of f(x). Finally, based on your answer estimate sin(1◦ ). ⃝ 6.A.16. Compute an approximation of cos π 10 with an error less than 10−5 . ⃝ Taylor series are power series having as partial sums the Taylor polynomials of a given function. Taylor series need not be convergent, and even if the Taylor series of a function f does converge, its limit may not be equal to f(x). Functions f whose Taylor series ∑∞ n=0 f(n) (a) n! (x − a)n converges to f(x) for all x in some open neighbourhood of a, are said to be “analytic at a”. When f is analytic at any point in an interval I ⊂ R, then f is said to be “analytic on I”, see also in 6.1.9. As we will explain below, for analytic functions the power series representation allows us to solve problems as if the function were a polynomial. In general not all the functions are analytic, and in particular functions with discontinuities cannot be even expressed as Taylor series. However, most of the functions that CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 6.1.9. Analytic and smooth functions. If f is smooth at a, the formal power series can be written S(x) = ∞∑ n=0 1 k! f(k) (a)(x − a)n . If this power series has a nonzero radius of convergence and simultaneously S(x) = f(x) on the respective interval, we say that f is an analytic function at a. A function is analytic on an interval, if it is analytic at its every point. One of the most important functions in Mathematics and Physics is the following Gaussian function The analytic function g(x) = e−x2 . is called the Gaussian function. Indeed, the Gaussian is analytic – we simply replace x with −x2 in the power series for the exponential ex and it is an analytic function on the entire R. Its derivatives are easily computed, too. In particular g′ (0) = 0 is the only singular point and g′′ (0) = −2. Since the limits at ±∞ are zero, the number g(0) = 1 is the global maximum of the Gaussian function (see ?? for more detailed comments). We shall see why the function is so important in Chapter 10 when dealing with probability and statistics. The Gaussian g(x) and the function f(x) = g(1/x) are depicted on the figure, f(x) is the solid line, while the Gaussian function is dashed. Notice how firmly the function f touches the x axis. We are going to explain this. 516 arise in applications are usually analytic. In the end of Chapter 5 we met for example the analytic functions ex = 1 + x + x2 2! + x3 3! + · · · = ∞∑ k=0 xk k! , sin(x) = x − x3 3! + x5 5! − · · · = ∞∑ k=0 (−1)k (2k + 1)! x2k+1 , cos(x) = 1 − x2 2! + x4 4! − · · · = ∞∑ k=0 (−1)k (2k)! x2k . In fact it is not hard to prove that the Taylor series of cos(x) occurs from the Taylor series of sin(x) around the origin, after differentiation. Another example is induced by the equality ln(1 + x) = x − x2 2 + x3 3 − · · · = ∞∑ k=1 (−1)k+1 k xk . This will be proved later in Section C, where you will have the chance to learn more details on the convergence of Taylor series (see 6.D.47). Finally, a remarkable common property of all the examples mentioned above, is that the Taylor series in the r.h.s. are centered at a = 0. Such Taylor series are referred to as “Maclaurin series”. As we will see, from the Maclaurin series we can build many other examples through substitution and series multiplication. 6.A.17. Consider the functions f(x) = ln ( 1 + x 1 − x ) , g(x) = ex2 + x2 e−2x with x ∈ (−1, 1) and x ∈ R, respectively. Expand them into a Taylor series centered at the origin, that is, into a Maclaurin series, and then present a verification in Sage. Solution. We first analyze the case for f(x). We know that ln(1 + x) = ∑∞ n=1 (−1)n+1 n xn , which implies that ln(1 − x) = ∞∑ n=1 (−1)n+1 n (−x)n = − ∞∑ n=1 1 n xn . Thus, for all x ∈ (−1, 1) we have the relation ln ( 1 + x 1 − x ) = ln(1 + x) − ln(1 − x) = 2 ∞∑ n=1 x2n−1 2n − 1 . In order to verify this computation in Sage we could type var("n"); assume(x>0) sum((2/(2*n-1))*x^(2*n-1), n, 1, oo) and this prints out the desired result, i.e., ln(−(x + 1)/(x − 1)). On the other side, we could use Sage to derive the Taylor series directly, as follows: f(x)=ln((1+x)/(1-x)) tf=taylor(f, x, 0, 10) T=tf.power_series(QQ) show(T) CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Not all smooth functions are analytic. It can be proven that for every sequence of numbers an there is a smooth function, whose derivatives of order k at a given point x0 are these numbers ak.2 Now, look more closely at our function f(x) = e−1/x2 . It is a well defined smooth function at all points x ̸= 0. Its limit at x = 0 exists, and limx→0 f(x) = 0. By defining f(0) = 0, f is a continuous function for all real x. By a direct computation based on L’Hospital’s rule we compute the derivatives of f (the first three ones are at the picture, guess which line is which!). It suffices to consider only the right derivative, since the function is even. f′ (0) = lim x→0+ e−1/x2 −0 x = lim x→0+ x−1 e1/x2 = 1 2 lim x→0+ x e1/x2 = 0. By differentiating f(x) at an arbitrary point x ̸= 0, f′ (x) = e−1/x2 ·2x−3 . By repeated differentiation of the results, there is always a sum of finitely many terms of the form C · e−1/x2 ·x−j , where C is an integer and j is a natural number. Next, assume it is already proven that the derivative of order k of f(x) exists and vanishes at zero. Compute the limit of the expression f(k) (x)/x for x → 0+ . This is a finite sum of limits of the expressions Cx−j e−1/x2 = x−j / e1/x2 . All these expressions are of type ∞/∞, so L’Hôpital’s rule can be used repeatedly on them. After several differentiations of both the numerator and denominator (and a similar adjustment as above) there remains the same expression in the denominator, while in the numerator the power is non-negative. Thus the expression necessarily has a zero limit at zero, just as in the case of the first derivative above. The same holds for a finite sum of such expressions. So each derivative f(k) (x) at zero exists with value zero. 2This is a special case of the Whitney extension theorem, which says that there is a smooth function on a Euclidean space with prescribed derivatives in all points of a closed set A if and only if the Taylor theorem estimates are true for the prescription. In the case of one single point A, the condition is empty. This is relevant for the Taylor theorem for functions of more than one real variable, as in Chapter 8. Hasler Whitney (1907-1989) was a very influential American mathematician. 517 This returns the Taylor series corresponding to f(x), in the following form 2x + 2 3 x3 + 2 5 x5 + 2 7 x7 + 2 9 x9 + O(x10 ) . Here, Sage uses the “big O” notation to encode terms of higher order (in our case of order ≥ 10, since we “asked” the Taylor expansion to be of order 10). For the second function by the identity ex = ∞∑ n=0 xn n! we obtain ex2 = ∞∑ n=0 x2n n! , for all x ∈ R. Moreover, x2 e−2x = x2 ∞∑ n=0 1 n! (−2x) n = ∞∑ n=0 (−2)n n! xn+2 , and hence ex2 +x2 e−2x = ∞∑ n=0 x2n + (−2)n xn+2 n! . The verification in Sage for this case is left for practice. □ 6.A.18. Determine the Taylor series centered at the origin for the functions f(x) = 1 (1+x)2 , with x ∈ (−1, 1). Next confirm your result by Sage. ⃝ Maclaurin series are also useful when compute limits, and here is a task for you to perform yourself, see also 6.D.6. 6.A.19. Use the appropriate Maclaurin series to compute the limits lim x→0 x sin(x) − x2 x4 , and lim x→0 x2 ( e− 1 x2 −1 ) . ⃝ Let us now stress the use of derivatives when studying the “local behaviour” of functions, and hence revise the discussion initiated in Chapter 5 (cf. 5.C.15, 5.C.17). This is based on many new notions introduced in ??, and essentially includes the description of the following features of f: • the domain and the range; • parity and periodicity; • discontinuities and their kind; • points of intersections with the axes; • the limits limx→±∞ f(x); • the first and the second derivatives; • the critical points (also called stationary points); • the intervals of monotonicity; • local and absolute extremes; • the intervals where f is convex/concave; • the points of inflection; • the horizontal and inclined asymptotes; • the graph. In order to become familiar with these notions, we present many illustrative examples. CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS In summary, f(x) is smooth on the whole of R. It is strictly positive everywhere except for x = 0. All its derivatives at this point are zero. It cannot be analytic at x0 = 0. The limit of the function at the improper points ±∞ is 1, while all its derivatives converge quickly to zero. 6.1.10. Smooth jump functions. The smooth functions are very “elastic” — from a local behaviour around one point we cannot deduce anything at all about the global behavior of such functions. On the other hand, analytic functions are completely determined just by derivatives at one point. In particular they are completely determined by their behaviour on an arbitrarily small neighbourhood of a single point from their domain. In this sense, analytic functions are very “rigid”. In particular, the smooth functions allow for joining different constant values on disjoint open intervals in a differentiable way. Let us look at such functions more closely now. We can modify f(x) from the previous paragraph in this way: g(x) = { 0 if x ≤ 0 e−1/x2 if x > 0. Again it is a smooth function on all of R. By another modification there is another function h, which is nonzero at all inner points of the interval [−a, a], a > 0 and zero elsewhere. h(x) = { 0 if |x| ≥ a e 1 x2−a2 + 1 a2 if |x| < a. This function is again smooth on all of R. The last two functions are in the two figures. On the right, the parameter a = 1/2 is used. Finally we show how to get smooth analogies of the Heaviside functions. For two fixed real numbers a < b, define the function f(x) exploiting the above function g as follows: f(x) = g(x − a) g(x − a) + g(b − x) . For all x ∈ R the denominator of the fraction is positive (because g is non-negative. For each of the three intervals determined by numbers a and b at least one of the summands of the denominator is nonzero). Thus the definition yields a smooth function f(x) on all of R. For x ≤ a the numerator of the fraction is zero according to the definition of g. For x ≥ b 518 6.A.20. (a) Find the critical points and the local extremes of the function f(x) = 3 √ x2 = x 2 3 , with x ∈ R. (b) Provide an example of a function g satisfying g′ (0) = 0 but g does not attain a local extreme at x0 = 0. (c) Provide an example of a function having no local minima or maxima. Solution. (a) The given function f is continuous in R and for any x > 0 we have (see also 5.C.24) f′ (x) = (x 2 3 )′ = 2 3 x 2 3 −1 = 2 3 x− 1 3 = 2 3 3 √ x . Comparing the left and right derivatives, one can check that at x = 0 the function f is not differentiable. This can be done quickly via Sage, by the syntax f(x)=x^(2/3) show(lim((f(x)-f(0))/x, x=0, dir="right")) show(lim((f(x)-f(0))/x, x=0, dir="left")) which prints out +∞, and −∞, respectively. Moreover, we see that f′ (x) = 0 has no solutions, hence f has no critical points (in terms of 6.1.5). We already know from Chapter 5 how to obtain a confirmation of the critical points via Sage: just add the cell show(solve(diff(f, x).factor()==0, x)) On the other hand, we see that f′ (x) > 0 for x > 0 and f′ (x) < 0 for x < 0. Hence f′ changes from negative to positive at x0 = 0, and since f is continuous at x0 = 0 this implies that f attains a local minimum at this point. Thus local extremes of a function may occur at places where the first derivative does not exist. (b) An example is given by the function g(x) = x3 , which obviously satisfies g′ (0) = 0, but the origin is neither a local minimum nor a local maximum of g (recall the graph of g, or see below). This shows that stationary points are not necessary local extremes. (c) An example is given by the (equilateral) hyperbola y = h(x) = 1/x with x ∈ A := R\{0} = (−∞, 0) ∪ (0, +∞). Indeed, for x1, x2 ∈ (0, +∞) with x1 < x2 we have 1 x1 > 1 x2 , i.e., h(x1) > h(x2), and thus y = h(x) is strictly decreasing on (0, +∞). Since h is an odd function, h(−x) = −h(x) for all x ∈ A, we immediately deduce that h is strictly decreasing on (−∞, 0), as well, and hence on its whole domain A. This also follows by the first derivative test: h′ (x) = −1/x2 < 0 for all x ∈ A. Hence there are no extreme points, a conclusion that one can illustrate by sketching the graph of h (h is an odd function, hence its graph is symmetric with respect to the origin, see also the figure below). Moreover, we have limx→0+ h(x) = +∞ and hence the y-axis (x = 0) is a vertical asymptote of h. In addition, the x-axis (y = 0) is a horizontal asymptote, since limx→+∞ h(x) = 0. CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS the numerator and denominator are equal. In the next two figures there are functions f(x) with parameters a = 1 − α, b = 1 + α. On the left α = 0.8, and on the right α = 0.4. Finally, we can create a smooth analogue of the characteristic function of any interval [c, d]. Write fε(x) for the latter function f(x) with parameters a = −ε, b = +ε. For the interval (c, d) with the length d − c > 2ε define the function hε(x) = fε(x − c) · fε(d − x). This function is identically zero on the intervals(−∞, c− ε) a (d + ε, ∞). It is identically one on the interval (c + ε, d − ε). Moreover, it is smooth everywhere. Locally it is either constant or monotonic (you should verify the last claim yourself). The smaller the ε > 0, the faster hε(x) jumps from zero to one around the beginning of the interval or back at the end of it. The diagram shows the choices [c, d] = [1, 2] and ε = 0.6, ε = 0.3. 6.1.11. Differential of a function. In practical use of differential calculus, we often work with dependencies between several variables, say y and x. The choice of dependent and independent variable is not fixed. The explicit relation y = f(x) with some function f is then only one of possible options. Differentiation then expresses that the immediate change of y = f(x) is proportional to the immediate change of x with the proportion of f′ (x) = df dx (x). This relation is often written as dy = f′ (x)dx, or df(x) = f′ (x)dx, where we interpret df(x) as a linear map R → R defined on increments of x at the given point, df(x)(v) = f′ (x)·v, while the identity map yields dx(x)(v) = v. We talk about the differential of function f, which is the best linear approximation of the function f around the point 519 Our figure also contains the graph of the function k(x) = −1/x, which is symmetric to the graph of h(x) = 1/x with respect to the x-axis. Notice that k has the same domain as h and is strictly increasing on A, since k′ (x) = 1 x2 > 0 for all x ∈ A. Hence it provides another example of a function having no local minima or maxima. A generalization occurs by the hyperbolas y = a/x, with a > 0 and a < 0, respectively. For convenience, let us finally include the code used in Sage to construct these graphs: h(x)=1/x; k(x)=-1/x p=plot(h(x), x,-5, 5, ymin=-5, ymax=5, rgbcolor=(0.2,0.2,0.5), thickness=1.2) p+=text(r"$y=\frac{1}{x}, \ x>0$", (3, 2), fontsize=16, rgbcolor=(0.2,0.2,0.5)) p+=text(r"$y=\frac{1}{x}, \ x<0$", (-3, -2), fontsize=16, rgbcolor=(0.2,0.2,0.5)) p+=plot(k(x), x, -5, 5, ymin=-5, ymax=5, rgbcolor=(0.2,0.5,0.2), thickness=1.2) p+=text(r"$y=-\frac{1}{x}, \ x<0$", (-3, 2), fontsize=16, rgbcolor=(0.2,0.5,0.2)) p+=text(r"$y=-\frac{1}{x}, \ x>0$", (3, -2), fontsize=16, rgbcolor=(0.2,0.5,0.2)); show(p) □ 6.A.21. An annoying technicality. Generally speaking, Sage is still a work “in progress” and hence its users may face some technicalities. For instance, try to produce the graph of f(x) = x 2 3 with x ∈ R (or of x → x 1 3 = 3 √ x), in a traditional way. In this case Sage will sketch only the portion corresponding to the positive reals. This problem occurs since Sage returns complex numbers for odd roots of negative numbers, when the latter are numerically approximated. This is the situation when, for example, one tries to sketch the graph of f. In order to realize this fact, in your editor type the syntax c1 = (-1)^(2/3); print(c1) print(float(c1)) c2 = (-1.)^(2/3); print(c2); float(c2) For c1 and by the third command, Sage returns the error “unable to simplify to float approximation”, while for c2 the error CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS x, i.e. the following approximation property is true: (1) lim v→0 f(x + v) − f(x) − df(x)(v) v = 0. Since we are working in dimension 1, the definition is equivalent to the existence of the derivative f′ (x) (all linear maps are just multiplications by a constant) and then the differential is df(x)(v) = f′ (x)v. Later, we shall see that the situation is quite different for functions of more variables. If the quantity x is expressed by another quantity t, e.g. x = g(t) and, moreover, g is differentiable, the chain rule for differentiating the composite functions says that f ◦ g has the differential too and df(t) = d(f ◦ g)(t) = df dx (x) dx dt (t)dt. We may also write dy = f′ (x)dx = f′ (g(t))g′ (t)dt. Therefore dy can be seen as a linear approximation of the given quantity independently of the choice of the dependent vari- able. 6.1.12. The numerical derivatives. We shall conclude this section with two straightforward applications of differentials. First we provide a brief introduction to the numerical procedures for differentiation. Then we discuss curves in the plane and space, starting with the graphs of functions. This will also provide first glimpses into the so called vector calculus (working with vector valued functions). In the begining of this textbook we discussed how to describe the values in a sequence if its immediate differences are known, (c.f. paragraphs 1.1.5, 1.2.1). Before proceeding the same way with the derivatives we clarify the connections between derivatives and differences. The key to this is the Taylor expansion with remainder. Suppose that for some (sufficiently) differentiable function f(x) defined on the interval [a, b], the values fi = f(xi) at the points x0 = a, x1, x2, . . . , xn = b, are given while xi − xi−1 = h for some constant h > 0 and all indices i = 1, . . . , n. Write the Taylor expansion of function f in the form f(xi ± h) = fi ± hf′ (xi) + h2 2 f′′ (xi) ± h3 3! f(3) (xi) + . . . Suppose the expansion is terminated at the term containing hk which is of order k in h. Then the actual error is bounded by hk+1 (k + 1)! |f(k+1) (x)| on the interval [xi −h, xi +h]. If the (k+1)th derivative f is continuous, it can be bounded by a constant. Then for small h, the error of the approximation by the Taylor polynomial of order k acts like hk+1 except for a constant multiple. Such an estimation is called an asymptotic estimation. 520 from the command float(c2) is about the difficulty of converting a complex number to float (notice the difference in the definition of c1, c2). In this case Sage advices us to use some of the options abs( ) or real_part( ). Motivated by this, to cure our problem for the given f we may instead plot the graph of |x| 2 3 , see the figure below. 6.A.22. Recall that if x0 is a critical point of a differentiable function and f′′ (x0) > 0, then f′ is strictly increasing near x0. Therefore, the conditions f′ (x0) = 0 and f′′ (x0) > 0 imply that x0 is a local minimum. Similarly, the conditions f′ (x0) = 0 and f′′ (x0) < 0 imply that x0 is a local maximum. Show with a counterexample that the converse of these conclusions are not true in general. ⃝ 6.A.23. Suppose that a function f : R → R admits the line y = 3x + 1 as an asymptote with slope, as x → +∞. Compute the limit lim x→+∞ xf(x) + 7x2 + 2 x2f(x) + 3x3 + 5x2 + 2 . Solution. By the definition of an asymptote with slope, we have lim x→+∞ f(x) x = 3 , and lim x→+∞ ( f(x) − 3x ) = 1 . Thus, if we denote by A the given limit, by dividing both the enumerator and dominator by x2 we get A = lim x→+∞ f(x) x + 7 + 2 x2 f(x) − 3x + 5 + 2 x2 = lim x→+∞ ( f(x) x ) + lim x→+∞ ( 7 + 2 x2 ) lim x→+∞ (f(x) − 3x) + lim x→+∞ ( 5 + 2 x2 ) = 3 + 7 1 + 5 = 5/3 . □ Let us describe a problem which requires a bit of integration, but in a very elementary level and hence already treatable. Further similar problems we will analyze in Section B and also in the final section of this chapter. 6.A.24. Let f : R → R be a twice differentiable function on R satisfying for all x ∈ R the relation ( √ 2 x2 + 1)f′′ (x) + 4 √ 2 x f′ (x) + 2 √ 2 f(x) = 0 . (∗) (a) Find those differentiable functions g : R → R satisfying g(x) = 2 √ 2 x f(x) + ( √ 2 x2 + 1)f′ (x); (b) Based on your answer in (a), determine the function f, if in addition it is known that the graph Cf of f passes through CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Asymptotic estimates Definition. The expression G(h) is asymptotically equal to F(h) for h → 0, write G(h) = O(F(h)), if the finite limit lim h→0 G(h) F(h) = a ∈ R exists. Similarly, compare the expressions for h → ∞ and use the same notation. If the limit is zero, then we write G(h) = o(F(h)). This way of denoting the asymptotic behaviour is often called the big O and small o notation. For instance, the differentiability of a function f means that the error of the approximation of f by its differential is o(|h|) for increments h of the argument. Denote the values of the derivatives of f(x) at the points xi as f (j) i . Write the Taylor expansion as: fi±1 = fi ± f′ i h + f′′ i 2 h2 ± f′′′ i 6 h3 + . . . Considering combinations of the two expansions and fi itself, we can express the derivative f′ i as follows fi+1 − fi−1 2h = f′ i + h2 3! f (3) i + . . . fi+1 − fi h = f′ i + h 2! f′′ i + . . . fi − fi−1 h = f′ i − h 2! f′′ i + . . . This suggests a basic numerical approximation for deriva- tives: Central, forward, and backward differences The central difference is defined as f′ i = fi+1 − fi−1 2h , the forward difference is f′ i = fi+1 − fi h , and the backward difference is f′ i = fi − fi−1 h . Theorem. The asymptotic estimate of the error of the central difference is O(h2 ). The errors of the backward and forward differences are O(h). Proof. If we use the Taylor expansions with remainder of the appropriate order, we obtain an expression of the error of the approximation by the central difference in the form 1 3! h2 ( f(3) (xi + ξh) − f(3) (xi − ηh) ) . 521 the origin and the tangent line of Cf at the origin is perpendicular to the line determined by the equation x + √ 2y −1 = 0; (c) Use Sage to confirm that your answer in (b) satisfies indeed the initial condition (∗). Moreover, sketch the graph of f for −10 ≤ x ≤ 10. Where did we meet such a similar function earlier? (d) Use Sage to specify the local extremes of f and characterize their type. Solution. (a) By differentiating the relation g(x) = 2 √ 2 x f(x)+( √ 2 x2 +1)f′ (x), with respect to x, we obtain g′ (x) = 2 √ 2 f(x) + 2 √ 2 x f′ (x) + 2 √ 2 x f′ (x) + ( √ 2 x2 + 1)f′′ (x) = 0 , x ∈ R , where the final equality follows by (∗). This implies that g(x) = c for all x ∈ R for some constant c ∈ R, i.e., g is constant. (b) One observes that g(x) = 2 √ 2 x f(x) + ( √ 2 x2 + 1)f′ (x) = ( ( √ 2 x2 + 1)f(x) )′ , x ∈ R . In (a) we proved that g(x) = c for all x ∈ R, hence a combination with the previous relation gives ( ( √ 2 x2 +1)f(x) )′ = c. Hence it is easy to guess that ( √ 2 x2 + 1)f(x) = c x + α, for some constant α ∈ R. As we will see in the forthcoming section, this follows by integrating both parts of the relation including the derivative, i.e., ∫ ( ( √ 2 x2 + 1)f(x) )′ dx = ∫ c dx =⇒ ( √ 2 x2 + 1)f(x) = c x + α , α ∈ R . The polynomial √ 2 x2 + 1 has only complex roots, hence we can divide which yields the expression f(x) = c x + α √ 2 x2 + 1 , ∀ x ∈ R . (∗∗) To determine the constants c and α we rely on the information specified by the scenario in (b). First, f passes through the origin, hence we need α = 0. Next, the line x + √ 2 y − 1 = 0 is written as y = − 1√ 2 x + 1 = − √ 2 2 x + 1, which means that its slope equals to − √ 2 2 . Since the tangent line of f at the origin should be perpendicular to this line, we need f′ (0) = 2√ 2 = √ 2, such that − √ 2 2 · √ 2 = −1. Therefore, to determine c we need to solve the equation f′ (0) = √ 2. So, let us use (∗∗) (for α = 0) to compute f′ (x): f′ (x) = c(− √ 2 x2 + 1) ( √ 2 x2 + 1)2 , x ∈ R . Thus f′ (0) = √ 2 if and only if c = √ 2, which means that f(x) = √ 2 x ( √ 2 x2 + 1) , for all x ∈ R. (c) A confirmation in Sage relies on the command bool, as follows (we also include the syntax for sketching Cf ) CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Here, 0 ≤ ξ, η ≤ 1 are the values from the remainder expression of fi+1 and fi−1, respectively. The error of the second derivative in the other two cases is obtained similarly. □ Surprisingly, the central difference is one order better than the other two. But of course, the constants in the asymptotic estimates are important, too. In the case of the central difference, the bound on the third derivative appears, while in the two other cases second derivatives show up instead. We proceed the same way when approximating the second derivative. To compute f′′ (xi) from a suitable combination of the Taylor polynomials, we cancel both the first derivative and the value at xi. The simplest combination cancels all the odd derivatives as well: fi+1 − 2fi + fi−1 h2 = f (2) i + h2 12 f(4) (xi) + . . . . This is called the second order difference. Just as in the central first order difference, the asymptotic estimate of the error is f (2) i = fi+1 − 2fi + fi−1 h2 + O(h2 ). Notice that the actual bound depends on the fourth derivative of f. 6.1.13. The curvature of the graph of a function. Imagine the graph of a function as a movement in the plane parametrized by the independent variable x. The vector (1, f′ (x)) ∈ R2 represents the velocity at x of such a movement. The tangent line through [x, f(x)] parametrized by this directional vector then represents a linear approximation of the curve. The goal is to discuss how “curved” is the graph at x. This is a straightforward exercise working with differentials in the setup of elementary plane geometry. It might need some effort to keep the attention. If f′′ (x) = 0 and simultaneously f′′′ (x) ̸= 0, the graph of the function f intersects its tangent line. In such a case, the tangent line is the best approximation of the curve at the point x up to the second order as well. We describe this by saying that the graph of f has zero curvature at the point x. 522 f(x)=(sqrt(2)*x)/(sqrt(2)*x^2+1) print(bool((sqrt(2)*x^2+1)*diff(f, x, 2) + 4*sqrt(2)*x*diff(f, x)+2*sqrt(2)*f(x)==0)) show(plot(f(x), x, -10, 10, CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS The nonzero values of the first derivative describe the speed of the growth. Intuitively we expect the second derivative to describe the acceleration, including how “curved” the graph is. As a matter of convention, we want the curvature to be positive if the graph of the function is above its tangent. The tangent at a fixed point P = [x, f(x)] is the limit of the secants. The lines passing through the points P and Q = [x + ∆x, f(x + ∆x)]. To approximate the second derivative, interpolate the points P and Q ̸= P by the circle CQ, whose center is at the intersection of the perpendicular lines to the tangents at P and Q. It can be seen from the figure that if the angle between the tangent at the fixed point P and the x axis is α and the angle between the tangent at the chosen point Q and the x axis is α+∆α, then the angle of the latter perpendicular lines is ∆α as well. improve a bit the picture - should only ρ∆α a perhaps make the measured arc thicker.Denote the radius of the circle by ρ. Then the length of the arc between points P and Q is ρ∆α. As Q approaches the fixed point P, the length ρ∆α of the arc approaches the length ∆s of the curve between P and Q, that is, the graph of the function f(x). At the same time the circle approaches some limit circle CP . This limit circle CP is called the osculating circle. Thus we arrive at the basic relation for the expected radius ρ of the circle CP in terms of the linear approximations of the quantities: ρ = lim ∆α→0 ∆s ∆α = ds dα . Notice that the quantity on the right hand side is well defined (independently of its rather intuitive justification). Define the curvature of the graph of the function f at the point P as the number 1/ρ. Zero curvature then corresponds to an infinite radius ρ. For computing the radius ρ in terms of f we need to express the length of the arc s by the change of the angle α and express the derivative of this function in terms of the derivative of f. Notice that for an increasing angle α the length of the arc can either increase or decrease, depending on whether the circle CQ has its center above or below the graph of the function f. The sign of ρ then reflects whether the function is concave or convex. There is also the special case when the center “runs off” to infinity in the limit. Instead of a circle there is the tangent line. There is no direct tool to compute the derivative ds dα . However, tg α = df dx . By differentiating this equality with respect to x we obtain (using the chain rule for differentials) 1 (cos α)2 dα dx = f′′ . On the left hand side we can substitute 1 (cos α)2 = 1 + (tg α)2 = 1 + (f′ )2 523 ymin=-0.8, ymax=0.8, color="black")) This block verifies our answer and produces the figure presented here: (d) As shown in the graph above, f has two local extrema: one maximum and one minimum. Recall that critical points are found by solving the equation f′ (x) = 0. Sage performs this calculation quickly for us: f(x)=(sqrt(2)*x)/(sqrt(2)*x^2+1) d1=diff(f, x).factor(); print(solve(d1==0, x)) This prints out two solutions, namely x± = ±1 2 23/4 = ± 4√ 8 2 ≈ ±0.84. An alternative to obtain these solutions is based on the command roots and can be implemented as follows: f(x)=(sqrt(2)*x)/(sqrt(2)*x^2+1) d1=diff(f, x).factor(); roots =d1.roots() sols = [] for i in range(len(roots)): sols.append(roots[i][0]) print(sols) Executing this block one gets the two solutions (without their multiplicities), i.e., [−(1/2) ∗ 2 ∗ ∗(3/4), (1/2) ∗ 2 ∗ ∗(3/4)]. To verify that x+ (respectively, x−) is a point where f attains its local maximum (respectively, minimum) we will use the criterium of the second derivative. Hence it suffices to add the syntax bool(diff(f, x, 2)(x=1/2*2^(3/4))<0) bool(diff(f, x, 2)(x=-1/2*2^(3/4))>0) which for both cases returns True. Or we can program Sage to check itself the critical points and print the corresponding result. For this it is sufficient to add in the initial block the following: d2 = f.diff(2) for a in sols: if d2(a)>0: print("{} is a local minimum" .format(a)) if d2(a)<0: print("{} is a local maximum" .format(a)) CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS which implies (see the rule for differentiating inverse func- tions) dx dα = 1 + (tg α)2 f′′ = 1 + (f′ )2 f′′ . Now, we are almost finished, because the increment of the length of arc s dependent on x is given by the formula ds dx = (1 + (f′ )2 )1/2 . Thus, by the chain rule, ρ = ds dα = ds dx dx dα = (1 + (f′ )2 )3/2 f′′ . Taking the reciprocal value, we have computed: Curvature of the graph If f is in C2 , then the curvature of its graph is given by the formula ρ−1 (x) = f′′ (x) (1 + (f′(x))2)3/2 The result explains the relation between the curvature and the second derivative. The denominator of the fraction is always positive. It equals the third power of the length of the tangent vector of the given graph curve. The sign of the curvature is therefore given only by the sign of the second derivative, which confirms the ideas about concave and convex points of functions. In particular, in singular points, the curvature is just the second derivative. If the second derivative is zero, the curvature 1/ρ is also zero. If f′′ is large, then the radius ρ is small, thus the curvature is large as well. Compute the curvature of simple functions yourself and use osculating circles while sketching their graphs. The computation at the critical points of the function f is easiest. The radius of the osculating circle is the reciprocal value of the second derivative with the corresponding sign. 6.1.14. Vector differential calculus. As mentioned already in the introduction to chapter five, most considerations related to differentiation are based on the fact that the functions are defined on real numbers and that their values can be added and multiplied by real numbers. That is why functions f : R → V need to have values in a vector space V . We call them vector functions of one real variable or more briefly vector functions. Now, we digress to consider functions with values in the plane or in space. Thus, f : R → R2 and f : R → R3 . We consider (parametrized) curves in plane and space. We could work with values in Rn for any finite dimension n. For simplification, we work with the fixed standard bases ei in R2 and R3 . So curves are given by pairs or triples of real functions of one real variable, respectively. The vector function r in plane or space, respectively, is given by r(t) = x(t)e1 + y(t)e2, r(t) = x(t)e1 + y(t)e2 + z(t)e3. 524 In this case the output looks like as follows: -1/2*2^(3/4) is a local minimum 1/2*2^(3/4) is a local maximum Finally, Sage can be used to compute the values of f at x±, which allows us to summarize as follows: The point P+ = [ 4√ 8 2 , 4√ 2 2 ] is a local maxima of f, while the point P− = [− 4√ 8 2 , − 4√ 2 2 ] is a local minima of f. Notice ± 4√ 2 2 ≈ ±0.59. □ 6.A.25. Local extrema numerically via Sage. We should remark that Sage provides an in-built method which numerically finds the local extremes of a given function on an interval [a, b], along with the points that these extremes values are attained. This technique relies on the commands find_local_maximum(f, a, b) and find_local_minimum(f, a, b), respectively. Hence, for example, for the function f presented in 6.A.24 one could alternatively type f(x)=(sqrt(2)*x)/(sqrt(2)*x^2+1); show(find_local_maximum(f, -10, 10)) show(find_local_minimum(f, -10, 10)) This, indeed, returns the desired answers in the form (f(x±), x±), see here: (0.5946035575013604, 0.8408963982254773) , (−0.5946035575013604, −0.8408963982254773) . 6.A.26. Use Sage to sketch the real functions f, g, h, k for x ∈ [−2, 2], and find numerically their local extrema, if any, where: (a) f(x) = 1 8 x8 − 3 5 x5 − 3x + 9 , (b) g(x) = x2 − 1 6 x3 , (c) h(x) = x √ x2 + 4 , (d) k(x) = ln(x3 + 8) . ⃝ 6.A.27. Find all intervals on which the function f(x) = e−x2 , with x ∈ R, is concave. Solution. We compute f′ (x) = −2x e−x2 and f′′ (x) = 2(2x2 − 1) e−x2 for all x ∈ R. The sign of f′′ is given in the following table x −∞ − √ 2/2 √ 2/2 +∞ f′′ (x) + 0 − 0 + Thus for any x ∈ (− √ 2/2, √ 2/2) we get f′′ (x) < 0, and the desired answer is given by the open interval (− √ 2/2, √ 2/2) ⊂ R. □ 6.A.28. Remark. Using Sage we might consider using the bool command, to verify the inequality f′′ (x) < 0, for all x ∈ (− √ 2/2, √ 2/2), from the previous problem. Hence, for example, one could type f(x)=exp(-x**2) df2=diff(f, x, 2) s=RealSet(-sqrt(2)/2, sqrt(2)/2) bool(df2(x)<0 for x in s) CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS The derivative of such a vector function is a vector, which approximates the map r by a linear map of the real line to the plane or to the space. In the plane it is dr dt (t) = r′ (t) = x′ (t)e1 + y′ (t)e2 and similarly in space. The differential of a vector function in this context is: dr = ( dx dt e1 + dy dt e2 + dz dt e3 ) dt where the expression on the right hand side is understood as “selecting” an increment of the scalar independent variable t and mapping it linearly by multiplying the vector of the three derivative components. Thus the corresponding increment of the vector quantity r is obtained (of course, only two components in the plane). The notation r(t) is a convenient way to describe curves in space. For example r(t) = (a cos t, a sin t, bt) or r(t) = a cos te1 + a sin te2 + be3 for fixed constants a, b describes a circular helix. Here the parameter t is related to a suitable angle measured around the z-axis. The derivative of r(t) at t = t0, determines the direction of the tangent line at r(t0). In Newtonian mechanics, the parameter t can stand for time, measured in suitable units. In this case the derivative of r(t) at time t = t0, gives the velocity vector at the same time. The second derivative then represents the acceleration vector at the same time. 6.1.15. Differentiating composite maps. In linear algebra and geometry there are very useful special maps called forms. They have one or more vectors as their arguments and they are linear in each of their arguments. In this way we defined the length of the vectors (the dot product is a symmetric bilinear form) or the volume of a parallelepiped (this is an n-linear antisymmetric form, where n is the dimension of the space), see for example the paragraphs 2.3.23 a 4.1.19. Of course, we insert vectors r(t) dependent on a parameter as the arguments of these operations. By a straightforward usage of the Leibniz rule for differentiation of a product of functions, the following is verified: Theorem. (1) If r(t) : R → Rn is a differentiable vector function and Ψ : Rn → Rm is a linear map, then the derivative of the map Ψ ◦ r satisfies d(Ψ ◦ r) dt = Ψ ◦ dr dt . (2) Consider differentiable vectors r1, ..., rk : R → Rn and a k–linear form Φ : Rn × . . . ×Rn → R on the space Rn . Then the derivative of the composed map φ(t) = Φ(r1(t), . . . , rk(t)) satisfies the (generalized) Leibniz’s rule dφ dt = Φ (dr1 dt , r2, . . . , rk ) + · · · + Φ ( r1, . . . , rk−1, drk dt ) . 525 In this block we used the RealSet command to define the interval of interest. Although Sage returns True, this result can be misleading and we cannot trust this method. This is because if we replace the last line by bool(df2(x) > 0 for x in s), Sage will still return True. In fact, the bool(df2(x) < 0 for x in s) command checks if the condition is true for some sampled points within the interval, but does not guarantee that the inequality holds for all points in the interval. This is because it only evaluates the condition at specific points and does not perform a rigorous check over the entire range. Therefore, one should exercise caution when interpreting results from computer algebra systems. 6.A.29. Show that the function f(x) = xa with 0 < a < 1 is concave for all x > 0. ⃝ 6.A.30. Use Sage to determine the asymptotes of the function f(x) = x/(x2 −1), with x ∈ R\{±1}, and then sketch these asymptotes along with the graph of f. ⃝ 6.A.31. Consider the function f(x) = ex −1 ex +1 with x ∈ R. (a) Prove that f is increasing on R, find its asymptotes and determine its range. Then use Sage to sketch the asymptotes along with the graph of f. (b) Determine the extremes points, inflection points and intervals of convexity of f, if any. ⃝ 6.A.32. Study the local behaviour of the function f(x) = x/(x + 1)2 , with x ∈ A := R\{−1}. Solution. Let us begin with the asymptotes of f. Obviously, f is continuous on its domain A. Thus, the only place we may have a vertical asymptote is at the endpoint −1. Indeed, we see that the line x = −1 is a vertical asymptote since limx→−1+ f(x) = limx→−1− f(x) = −∞. We also have limx→−∞ f(x) = limx→∞ f(x) = 0, thus the line y = 0 is a horizontal asymptote. In Sage a computation of these limits can be done as usual, i.e., f(x)=x/(x+1)^2 show(lim(f(x), x=-1, dir="left")) show(lim(f(x), x=-1, dir="right")) show(lim(f(x), x=-oo)); lim(f(x), x=oo) Next, by adding the command show(diff(f, x).factor()) we obtain the first derivative of f, f′ (x) = − (x − 1) (x + 1)3 , x ∈ A . (♭) Try to verify these results by hand. By (♭) we deduce that the equation f′ (x) = 0 has a unique solution, given by x = 1, hence f has a unique critical point. There it appears a relative maximum of f, with value f(1) = 1/4 (see below for an explanation). To determine the critical points of f via Sage, add in the previous block the syntax solve(diff(f, x)==0, x) (the command diff(f, x).roots() also works, as we explained in 6.A.24). Now, by (♭) we also deduce that CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS (3) The previous statement remains valid even if Φ also has values in the vector space, and is linear in all its k argu- ments. Proof. (1) The linear maps are given by a constant matrix of scalars A = (aij) so that Ψ ◦ r(t) = ( n∑ i=1 a1iri(t), . . . , n∑ i=1 amiri(t) ) . Carry out the differentiation separately for individual coordinates of the result. However, the derivative acts linearly with respect to scalar linear combinations, see Theorem 5.3.4. That is why the derivative is obtained simply by evaluating the original linear map Ψ on the derivative r′ (t). (2) The second statement is obtained analogously. Write out the evaluation of the k–linear form on the vectors r1, ..., rk in the coordinates in this way: Φ(r1(t), . . . , rk(t)) = n∑ i1,...,ik=1 Bi1...ik · (r1)i1 (t) . . . (rk)ik (t), where the scalars Bi1...ik = Φ(ei1 , . . . , eik ) are given as the values of the given form on the chosen k–tuple of base vectors for every choice of indices. The rule for differentiating a product of scalar functions then yields the statement. (3) If Φ has vector values, it is given by finitely many components and the previous result can be used for each of them. □ In the Euclidean space R3 , the scalar product assigns a scalar to two vectors, There is also the vector product, which assigns the vector u × v ∈ R3 to vectors u and v, see 4.1.21. This vector u × v is orthogonal to both vectors u and v, its length equals the area of the parallelogram determined by u and v (in this order) and the orientation is such that the triple u, v, u × v is a positively oriented basis. The previous ideas immediately imply: Corollary. Consider the vectors u(t) and v(t) in the space R3 . The derivatives of their scalar product ⟨u(t), v(t)⟩ and their vector product u(t) × v(t) satisfy d dt ⟨u(t), v(t)⟩ = ⟨u′ (t), v(t)⟩ + ⟨u(t), v′ (t)⟩(1) d dt (u(t) × v(t)) = u′ (t) × v(t) + u(t) × v′ (t)(2) 6.1.16. The curvature of curves. We develop far more powerful tools for studying curves in a more systematic way than when we discussed the curvature of the graphs of functions. We proceed in dimension three. Plane curves are a special case in which the third component is the constant zero. Let r(s) be a curve in the Euclidean space R3 . For our purposes, it is convenient to choose arc length s as for the parameter. It follows that ∥dr ds ∥ = 1, so that the tangent vector has unit length. When s is the parameter, the notation ′ is used for differentiation. 526 • f′ (x) < 0 for all x ∈ (−∞, −1) ∪ (1, +∞); • f′ (x) > 0 for all x ∈ (−1, 1). Thus f decreases for all x ∈ (−∞, −1) ∪ (1, +∞) and increases for x ∈ (−1, 1). Let us now proceed with the second derivative. By adding the command diff(f, x, 2) we obtain the second derivative of f, given by f′′ (x) = 2(x−2) (x+1)4 for all x ∈ A. Hence x = 2 is the unique solution of f′′ (x) = 0 and we see that • f′′ (x) < 0 for all x ∈ (−∞, −1) ∪ (−1, 2); • f′′ (x) > 0 for all x ∈ (2, +∞). Thus, f is convex for all x ∈ (2, +∞) and concave for x ∈ (−∞, −1)∪(−1, 2). It also follows that the unique inflection point of f appears at x = 2, with value f(2) = 2/9. Notice in order to verify that at the critical point located above, a relative maximum appears, we may use the criterion based on the second derivative (see below) and show that f′′ (1) < 0. Let us finally focus on the graph of f. The origin is the unique intersection of the graph of f with the axes, while f is neither even nor odd (so its graph has no particular symmetries). To sketch the graph of f for certain x ∈ A, together with its vertical asymptote x = −1, we use the cell f(x)=x/(x+1)^2 pf=plot(f, x, -5, 5, ymin=-5,ymax=1,color="black") pm = point ((1 , 1/4) , size =30 , color="red") pt = circle((-1,0), 0.05, color="black") pnf = point ((2 , 2/9) , size =30 , color="blue") sy = line([(-1,-5),(-1,5)], linestyle="--", thickness=0.8) (pf+sy+pm+pt+pnf).show(ticks=1,tick_formatter=1) Execute this block in your editor, to revise its “small” particularities (as red and blue colours for the presentation of the maximum and the inflection point of f, respectively). □ We will now focus on derivatives of functions defined “implicitly” by equations of the form F(x, y) = 0. In such cases it can be often impossible to solve the equation with respect to y (as an explicit function of x). However, as we will see below it is still possible to find the derivative dy/dx, compute extrema and/or decide for the convexity features of y, etc. For instance, to compute dy/dx the trick is just to differentiate both sides of the given equation, and then solve for the derivative we are seeking for. This is the so called implicit differentiation, which has many useful applications. 6.A.33. Implicit differentiation. Find the extreme points of the real function y = f(x) given in the implicit form xy2 − x2 y = 54. Solution. Since y is a function of x we can differentiate the given relation with respect to x. We get y2 + 2xyy′ − 2xy − x2 y′ = 0 , and thus solving with respect to y′ one deduces that y′ = 2xy − y2 2xy − x2 . (♯) CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS So ⟨r′ (s), r′ (s)⟩ = 1 for all s. The curve r(s) is parametrized by the length s. By another differentiation of this equality we arrive at (using the symmetry of the dot prod- uct) 0 = d ds ⟨r′ (s), r′ (s)⟩ = 2⟨r′′ (s), r′ (s)⟩. Thus the vector r′′ (s) is always orthogonal to the vector r′ (s). This corresponds to the idea that after the choice of a parametrization with a derivative of constant length, the second derivative in the direction of the movement vanishes. The second derivative lies in the plane orthogonal to the tangent vector. If the second derivative is nonzero, the normed vector n(s) = 1 ∥r′′(s)∥ r′′ (s) is the (principal) normal of the curve r(s). The scalar function κ(s) satisfying (at the points where r′′ (s) ̸= 0) r′′ (s) = κ(s)n(s) is called the curvature of the curve r(s). At the zero points of the second derivative κ(s) is defined as 0. At the nonzero points of the curvature, the unit vector b(s) = r′ (s)×n(s) is well defined and is called the binormal of the curve r(s). By direct computation 0 = d ds ⟨b(s), r′ (s)⟩ = ⟨b′ (s), r′ (s)⟩ + ⟨b(s), r′′ (s)⟩ = ⟨b′ (s), r′ (s)⟩ + κ(s)⟨b(s), n(s)⟩ = ⟨b′ (s), r′ (s)⟩, which shows that the derivative of the binormal is orthogonal to r′ (s). Further, b′ (s) is also orthogonal to b(s) (for the same reason as with r′ above). Therefore it is a multiple of the principal normal n(s). We write b′ (s) = −τ(s)n(s). The scalar function τ(s) is called the torsion of the curve r(s). In the case of plane curves, the definitions of binormal and torsion do not make sense. 527 One can obtain this formula also in Sage by the cell var("x"); y = function("y")(x) eq = x*y**2 -y*x**2 == 54 dy = diff(y,x) #declare the derivative sol = solve(diff(eq), dy) #solve the implicit eq. show(dy.subs(sol)) #substitute the solution Thus Sage can be quickly programmed to implement an implicit differentiation for us, see also below. Let us now return back to our task. Based on (♯), we see that the equation y′ = 0 is equivalent to y(2x − y) = 0, which gives y = 0, or y = 2x. The first solution is not acceptable and hence we have y′ = 0 if and only if y = 2x. For y = 2x the given equation reduces to x3 − 33 = 0, that is, (x − 3)(x2 + 3x + 9) = 0. The real x = 3 is the unique real solution of this equation, hence we get a stationary point of y with value y(3) = 6. Next we need to characterize this critical point, and hence we need the second derivative. A differentiation of both sides in (♯) gives the relation y′′ = A(x)/B(x), where A(x) = 2 [ (y + xy′ − yy′ )(2xy − x2 ) − (2xy − y2 )(y + xy′ − x) ] and B(x) = (2xy − x2 )2 , respectively. By replacing the coordinates [x = 3, y = 6] and since y′ vanishes at this point, we finally deduce that y′′ (3) = 12/27 > 0 . Thus, x = 3 is local mimimum of y = f(x), with value ymin = 6. □ 6.A.34. Suppose that f : R → R is a function which is twice differentiable on R and satisfies f2 (x) − xf(x) + 2x2 − 7 = 0 , x ∈ R . If a ∈ R is a stationary (critical) point of f, prove that a = ± √ 2 2 . Does f admit some inflection point? ⃝ Recall by Chapter 5 that first order derivatives provide linear approximations, which can be used to estimate values of differentiable functions (cf. 5.C.18). This idea is revised in 6.1.11, in terms of the “differential” of a differentiable function y = f(x), defined by df(x) = f′ (x)dx, i.e., dy = f′ (x)dx. This is of course an equivalent way to interpret the differentiability of f, which is encoded by the relation f′ (x) = dy dx . Notice dx is an independent variable which we may assign any non-zero real number, while dy (or df(x)) is necessary a dependent variable. The differential of f can be viewed as a linear form on R, satisfying df(x)(h) = f′ (x)h, for all h ∈ R, and provides the best linear approximation of f, around x. Let us illustrate the most basic features of this concept via examples. 6.A.35. Compute the differentials of the functions f(x) = ex2 , g(x) = ln(x2 + 1) and k(x) = arctan(ex ) with x ∈ R. Next present an implementation of these differentials via Sage. ⃝ CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS We have not yet computed the rate of change of the principal normal, which can be written as n(s) = b(s) × r′ (s): n′ (s) = b′ (s) × r′ (s) + κ(s)b(s) × n(s) = −τ(s)n(s) × r′ (s) + κ(s)(−r′ (s)) = τ(s)b(s) − κ(s)r′ (s). Summarizing, for all points with nonzero second derivative of the curve r(s) parametrized by the arc length we constructed the nice orthonormal basis (r′ (s), n(s), b(s)), called the Frenet frame in the classical literature. At the same time, this basis is used in order to express the derivatives of its components in the well known form Frenet–Serret formulae For the Frenet frame (r′ (s), n(s), b(s)), the so called Frenet– Serret formulae hold true dr′ ds (s) = κ(s)n(s), dn ds (s) = τ(s)b(s) − κ(s)r′ (s) db ds (s) = −τ(s)n(s). The following theorem tells how crucial the curvature and torsion are. Notice that if the curve r(s) lies in one plane, then the torsion is identically zero. In fact, the converse is true as well. We shall not provide the proofs here. Theorem. Two curves in the space R3 parametrized by the length of their arc can be mapped to each other by a Euclidean transformation (i.e. an affine map keeping the distances) if and only if their curvature functions and torsion functions coincide, except for a possible constant shift of the parameter. Moreover, for every choice of smooth functions κ and τ there exists a smooth curve with these parameters. By a straightforward computation we can check that the curvature of the graph of the function y = f(x) in plane and the curvature κ of this curve defined in this paragraph coincide. Indeed, comparing the differentials of the length of the arc for the graph of a function (as a curve with coordinates x(t), y(t) = f(x(t)): ds = (1 + (fx)2 )1/2 dx, dx = (1 + (fx)2 )−1/2 ds (we write fx = df dx ), we obtain the following equality for the unit tangent vector of the graph of a curve r′ (s) = ( (1 + (fx)2 )−1/2 , fx(1 + (fx)2 )−1/2 ) . Now, we have to go through a bit messy computation for the second derivative. Let us write shortly r = (x, y), y′ = fxx′ , x′ = (1 + f2 x)−1/2 (remind ′ always means derivative with respect to the arc lenght parameter s). Then x′′ = −1 2 (x′ )3 2fxfxxx′ = −(x′ )4 fxfxx y′′ = fxx(x′ )2 + fxx′′ = fxx(x′ )2 − fxxf2 x(x′ )4 . 528 6.A.36. Differentials. Construct in Sage a routine having as input a pair (f, x0) consisting of a function and a point, and as output the differential of f at x0 (assuming that f is differentiable at x0). Next use your program to test some cases explicitly. Solution. To solve this task we can use the def command to define a routine with the name Dif. Notice that Dif should have two inputs, a function f depending on one variable, and a real number x_0. Since df(x) = f′ (x)dx, we should program Sage to view dx as a variable different than x, and it is sufficient to introduce dx as a symbolic variable, similarly with x. In total, one can program Sage via the block var("x, dx") def Dif(f, x_0) : f1=diff(f, x)(x=x_0) show("The differential of f(x)=", f, ",", "at x=", x_0, ",", "equals to", ":", (dx)*f1) return Notice in the second line of this block the syntax corresponds to the derivative f′ (x0). Let us now test the routine. For this goal, we chose the following pairs (f, x0): Dif(ln(x), 2*pi) Dif(exp(x), 0) Dif(cos(2*x), 2*pi) Dif(ln(x^2+1), 4*pi) Dif(arctan(exp(x)-x^3+4*x), pi/2) In the first case and for the pair (ln(x), 2π), Sage’s output has the form:1 The differential of f(x)= log (x) ,at x=2 π, equals to: dx 2 π , which is obviously true. Execute the rest commands in your editor and next verify Sage’s solutions by hand. As a side remark, observe that when our input is a nondifferentiable function f at x0, our routine will return an error (similarly with the error that Sage returns when the derivative of a function at a point does not exist). □ 6.A.37. Consider a function y = f(x) differentiable at a point x0. The change in the values of y in a small neighbourhood around x0, is given by ∆(y) = f(x0 + dx) − f(x0). We can approximate linearly ∆(y) using the differential of f at x0, that is, ∆(y) ≈ f′ (x0)dx. Based on this formula compare ∆(y) and dy for the cases: (a) y = x4 − x2 + 3x + 2 and x changes from 2 to 2.05; (b) y = x2 + sin2 (x) cos(x) and x changes from π 2 to π 2 + 0.04; (c) y = ln(x + 1) − e √ x and x changes from 1 to 1.03. Solution. (a) We have x0 = 2 and x0+dx = 2+0.05 = 2.05. Thus ∆(y) = f(x0 + dx) − f(x0) = f(2.05) − f(2) ≈ 1To avoid confusions, recall that in Sage the function log(x) corresponds to the natural logarithm ln(x). CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Hence (x′′ )2 + (y′′ )2 = f2 xx(x′ )8 ( fx + (1 + f2 x)2 + f4 x − 2f2 x(1 + f2 x) ) = f2 xx(1 + f2 x)−4 (f2 x + 1) = f2 xx(1 + f2 x)−3 . We have arrived at the expected formula κ2 = ∥r′′ ∥2 = (d2 f dx2 )2 (1 + (fx)2 )−3 . 2. Integration 6.2.1. Indefinite integral. Now, we reverse the procedure of differention. We want to reconstruct the actual values of a function using its immediate changes. If we consider the given function f(x) as the (say continuous) derivative of an unknown function F(x), then at the level of differentials we can write dF = f(x) dx. We call the function F the primitive function or the indefinite integral of the function f (we can also find the name antiderivative in the literature). Traditionally we write F(x) = ∫ f(x) dx. Lemma. The primitive function F(x) to the function f(x) is determined uniquely on each interval [a, b] up to an additive constant. Proof. The statement follows immediately from Lagrange’s mean value theorem, see 5.3.9. Indeed, if F′ (x) = G′ (x) = f(x) on the whole interval [a, b], then the derivative of the function (F − G)(x) vanishes at all points c of the interval [a, b]. The mean value theorem implies that for all points x in this interval, F(x) − G(x) = F(a) − G(a) + 0 · (x − a). Thus the difference of the values of the functions F and G is constant on the interval [a, b]. □ The previous lemma supports another notation for the indefinite integral: F(x) = ∫ f(x) dx + C with an unknown constant C. The primitive functions are well defined for complex functions f, where the real and the imaginary part of the indefinite integrals are real primitive functions to the real and the imaginary parts of f. Thus, with no loss of generality, we work only with real functions in the sequel. 529 1.6085. On the other hand, dy x=x0 = f′ (x0)dx, that is, dy = (4x3 − 2x + 3) x=2 dx = 31 · 0.05 = 1.55 . Thus |∆(y) − dy| = ∆(y) − dy ≈ 1.6085 − 1.55 ≈ 0.0585. (b) In this case we have x0 = π/2 and x0 + dx = π 2 + 0.04. Based on the following block in Sage, f(x)=x^2+(sin(x))^2*cos(x) D=f((pi/2)+0.04)-f(pi/2) show(N(D)) we compute ∆(y) = f(π 2 + 0.04) − f(π 2 ) ≈ 0.08733. Also, we compute that f′ ( π 2 ) = ( 2x + 2 sin(x) cos2 (x) − sin3 (x) ) x=π/2 = π − 1 . Hence for the differential we get dy x=π/2 = (π−1)·0.04 ≈ 0.08566, and we deduce that |∆(y) − dy| = ∆(y) − dy ≈ 0.00167. (c) Let us answer this task only with the aid of Sage, and leave the formal computation as an exercise. An appropriate syntax has the form f(x)=ln(x+1)-e**(sqrt(x)) D=f(1.03)-f(1); show(N(f(1.03)-f(1))) d=(diff(f(x), x)(x=1))*0.03; show(N(d)) show(N(abs(D-d))) In this block the first show command implies that ∆y ≈ 0.025887, while the second gives dy ≈ 0.025774. After introducing f, D and d, one could directly type the final command, which implies that |∆(y) − dy| ≈ 0.00011. At the end of the day, what we should keep from this task is that a formal computation of dy for all the three cases (a), (b) and (c), is easier than computing explicitly D(y). In fact, for complicated functions the computation of D(y) can be extremely hard, but it is always easier to compute differentials. Finally, you may like to verify yourself that a receipt for improving the approximation is to choose smaller increments dx. □ 6.A.38. Recall that the formula f(x) ≈ f(x0)+f′ (x0)(x− x0) determines the linear (tangential) approximation of a differentiable function f at a point x0. (a) Show that this is equivalent to the approximation formula ∆(y) ≈ dy; (b) For the function f(x) = 1/x compare the linear approximation of the value f(1.1), obtained in 5.C.18, with the value determined by the relation f(x0 + dx) ≈ f(x) + dy. Solution. (a) We have f(x) ≈ f(x0) + f′ (x0)(x − x0) and since x − x0 = dx, this can be equivalently expressed as f(x) ≈ f(x0) + f′ (x0)dx. Using the definition of the differential of f at x0 (which we simply denote by dy), we finally get the equivalent expression f(x) ≈ f(x0) + dy, that is, f(x0 + dx) ≈ f(x0) + dy, where in the l.h.s the variable x was replaced by x0 + dx. Hence, the tangential approximation is equivalent to the relation f(x0 + dx) − f(x0) ≈ dy, i.e., ∆(y) ≈ dy. This proves (a). (b) Recall by 5.C.18 that the tangential approximation of 1/x around x0 = 1 is the line L(x) = 2 − x, and hence CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 6.2.2. Newton integral. We consider the value of a real function f(x) as an immediate increment of the region bounded by the graph of the function f and the x axis and try to find the area of this region between boundary values a and b of some interval. We relate this idea with the indefinite integral. Suppose we are given a real function f and its indefinite integral F(x), i.e. F′ (x) = f(x) on the interval [a, b]. Divide the interval [a, b] into n parts by choosing the points a = x0 < x1 < · · · < xn = b. Approximate the values of the derivatives at the points xi by the forward differences. That is, by the expressions f(xi) = F′ (xi) ≃ F(xi+1) − F(xi) xi+1 − xi . Finally the sum over all the intervals of our partition yields the approximation of the area: n−1∑ i=0 f(xi)(xi+1 − xi) ≃ n−1∑ i=0 F(xi+1) − F(xi) xi+1 − xi (xi+1 − xi) = n−1∑ i=0 (F(xi+1) − F(xi)) = F(b) − F(a). Therefore we expect that for “nice enough” functions f(x), the area of the region bounded by the graph of the function and the x axis (including the signs) can be calculated as a difference of the values of the primitive function at the boundary points of the interval. This procedure is called the Newton in- tegration.3 3Isaac Newton (1642-1726) was a phenomenal English physicist and mathematician. The principles of integration and differentiation were formulated independently by him and Gottfried Leibniz in the late 17th century. It took nearly another two centuries before Bernhard Riemann introduced the completely rigorous modern version of the integration process. 530 L(1.1) = 0.9. Using the differential dy at x0, by the claim in (a) we should get the same result. For, we have dy = − 1 x2 dx. Moreover, 1.1 = x0 + dx = 1 + 0.1, i.e., dx = 0.1, and hence we get dy x0=1 = −1 · 0.1 = −0.1. Then the relation f(x0 + dx) ≈ f(x0) + dy x0=1 gives f(1.1) ≈ f(1) + (−0.1) = 1 − 0.1 = 0.9 , as required. □ The relation df(x)(h) = f′ (x)h and the replacement of the increment dx by some small positive number h close to zero, allows us to rephrase the approximation formula ∆(y) ≈ dy as f(x+h)−f(x) ≈ f′ (x)h, i.e., f(x+h)−f(x) h ≈ f′ (x). Based then on the Taylor expansion of f around x, and using the O-notation (see 6.1.12) we can finally arrive to the equation f′ (x) = f(x + h) − f(x) h + O(h) . Obviously, this establishes a first-order accurate approximation of f′ (x), since the dominate term in the truncation error is O(h). The expression f(x+h)−f(x) h is known as the “forward difference” formula for the first derivative. Similarly, one can consider the “backward difference” formula f(x)−f(x−h) h , which gives another first-order accurate approximation of f′ (x), see 6.1.12. 6.A.39. Remark on the role of h. The figure below illustrates the forward and backward difference formulas for the approximations of f′ (0.5), where f(x) = 3x2 − x + 14 with x ∈ I = [−1, 2] and h = 1. Check yourself that since h is big enough, this will not produce a “good” approximation of f′ (0.5). Hence, to improve the approximations in general small increments h closer to zero are more adaptable (see also below). 6.A.40. Consider the function f(x) = ex2 − 9 10 . For the steps h = 0.1 and h = 0.01, use Sage to find the forward and backward difference approximations of the derivative f′ (0.2). CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Newton integral If F is the primitive function to the function f on the interval [a, b], then we write ∫ b a f(x) dx = [F(x)]b a = F(b) − F(a) and call it the Newton (definite) integral with the bounds a and b. We prove later that for all continuous functions f ∈ C0 [a, b] the Newton integral exists and computes the area as expected. This is one of the fascinating theorems in elementary calculus. Before going into this, we discuss how to compute these integrals. 6.2.3. Integration “by heart”. We show several procedures for computing the Newton integral. We exploit the knowledge of differentiation, and look for primitive functions. The easiest case is the one where the given function is known as a derivative. To learn such cases, it suffices to read the tables for function derivatives in the menagerie the other way round. Hence: Integration table For arbitrary nonzero a, b ∈ R and n ∈ Z, n ̸= −1: ∫ a dx = ax + C ∫ axn dx = a n+1 xn+1 + C ∫ eax dx = 1 a eax +C ∫ a x dx = a ln x + C ∫ a cos(bx) dx = a b sin(bx) + C ∫ a sin(bx) dx = −a b cos(bx) + C ∫ a cos(bx) sinn (bx) dx = a b(n+1) sinn+1 (bx) + C ∫ a sin(bx) cosn (bx) dx = − a b(n+1) cosn+1 (bx) + C ∫ a tg(bx) dx = − a b ln(cos(bx)) + C ∫ a a2 + x2 dx = arctg (x a ) + C ∫ −1 √ a2 − x2 dx = arccos (x a ) + C ∫ 1 √ a2 − x2 dx = arcsin (x a ) + C. In all the above formulae, it is necessary to clarify the domain on which the indefinite integral is well defined. We leave this to the reader. 531 Then, estimate the absolute value of the actual error. Which formula provides the better approximation and for which h? Solution. We have x = 0.2, hence the forward/backward difference formulas have the form f(0.2 + h) − f(0.2) h , f(0.2) − f(0.2 − h) h . It is easy to estimate these expressions for the given h via Sage, and a recommendation is here: f(x)=exp(x^(2))-0.9; forw01=(f(0.2+0.1)-f(0.2))/0.1; show(forw01) back01=(f(0.2)-f(0.2-0.1))/0.1; show(back01) forw001=(f(0.2+0.01)-f(0.2))/0.01; show(forw001) back001=(f(0.2)-f(0.2-0.01))/0.01; show(back001) Based on Sage’s output, for h = 0.1, we deduce that f′ (0.2) forw. diff ≈ f(0.3) − f(0.2) 0.1 ≈ 0.53364 , f′ (0.2) back. diff ≈ f(0.2) − f(0.1) 0.1 ≈ 0.30761 , while for h = 0.01 we get f′ (0.2) forw. diff ≈ f(0.21) − f(0.2) 0.01 ≈ 0.42761 , f′ (0.2) back. diff ≈ f(0.2) − f(0.19) 0.01 ≈ 0.40513 . Of course, the second choice h = 0.01 gives betters approximations. For, the derivative of f is given by f′ (x) = 2x ex2 , thus f′ (0.2) = 0.4163243. As for a confirmation via Sage, in the previous cell add the line f1=diff(f(x), x)(x=0.2); show(f1) Let us now estimate the errors, where it is reasonable to treat the case h = 0.01 only, which provides smaller errors. So, to evaluate the differences |f′ (0.2) − 0.42761|, and |f′ (0.2) − 0.40513|, we may add in our block the lines show(N(abs(f1-forw001))) show(N(abs(f1-back001))) We deduce that |f′ (0.2) − 0.42761| ≈ 0.0112841 and |f′ (0.2)−0.40513| ≈ 0.0111986. Thus, the backward difference with h = 0.01 approximates f′ (0.2) with the smallest actual error. □ 6.A.41. For h = 0.1, and for the function f given in 6.A.40, use Sage to illustrate the forward and backward difference approximation of f′ (0.2), together with the tangent line of f at x = 0.2. ⃝ A combination of the forward and backward differences induces the “central difference” formula, given by f(x + h) − f(x − h) 2h . This provides another approximation of f′ (x), which has in fact truncation error being of order of O(h2 ), see 6.1.12. 6.A.42. For the function f introduced in 6.A.40, show that the central difference approximation of the derivative f′ (0.2), CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Further rules can be added by observations of special structure of the given functions. For example, ∫ f′ (x) f(x) dx = ln |f(x)| + C for all continuously differentiable functions f on intervals where they are nonzero. Of course, the rules for differentiating a sum of differentiable functions and constant multiples of differentiable functions yield analogous rules for the indefinite integral. So the sum of two indefinite integrals is the indefinite integral of the sum of the integrated functions, up to the freedom in the chosen constant, etc. 6.2.4. Integration by parts. The Leibniz rule for derivatives, (F · G)′ (t) = F′ (t)G(t) + F(t)G′ (t), can be interpreted in the realm of the primitive functions. This observation leads to the following very useful practical procedure. It also has theoretical consequences. Integration by parts The formula for computing the integral on the left hand side ∫ F(x)G′ (x) dx = F(x)G(x) − ∫ F′ (x)G(x) dx + C is called integration by parts. The above formula is useful if we can compute G and at the same time compute the integral on the right hand side. The principle is best shown on an example. Compute I = ∫ x sin x dx. In this case the choice F(x) = x, G′ (x) = sin x will help. Then G(x) = − cos x and therefore I = x(− cos x) − ∫ − cos x dx = −x cos x + sin x + C. Some integrals can be dealt with by inserting the factor 1, so that G′ (x) = 1: ∫ ln x dx = ∫ 1 · ln x dx = x ln x − ∫ 1 x x dx = x ln x − x + C. 6.2.5. Integration by substitution. Another useful procedure is derived from the chain rule for differentiating composite functions. If F′ (y) = f(y), y = φ(x), where φ is a differentiable function with nonzero derivative, then dF(φ(x)) dx = F′ (y) · φ′ (x) and thus F(y) + C = ∫ f(y) dy can be computed as F(φ(x)) + C = ∫ f(φ(x))φ′ (x) dx. 532 with step h = 0.01, is much better than the corresponding forward and backward difference approximations. In particular, show that this approximation coincides with f′ (0.2), up to the first four decimal digits. ⃝ Calculus helps us understand various fundamental phenomena, such as the intrinsic geometry of curves and surfaces. We will start with a task that demonstrates the concept of the curvature of the graph of a function, as introduced in ??. To maintain continuity with our previous discussion (cf. 6.A.33), we recommend solving this task using implicit differentiation, though other methods are also available. 6.A.43. Determine the curvature of the ellipse x2 + 2y2 = 2 at its vertices. Also determine the equations of the circles of osculation at these vertices. ⃝ Let us now explore how to work with parametrized curves in both plane and space within the context of calculus. These curves can be viewed as real-valued functions in R2 and R3 , respectively, though the concept easily generalizes to Rn (see 6.1.14 for for a brief introduction to such functions, which are examples of “vector functions”. These curves have numerous applications in Newtonian mechanics, where velocity and acceleration are described by the first and second derivatives of such parametric functions with respect to the parameter t, representing time. We will start with a simple task involving the derivatives of plane curves, which is left as an easy challenge for you and can be solved using Sage, as well. 6.A.44. (a) Use the chain rule to prove that any differentiable plane curve α(t) = [x(t), y(t)] satisfies dy dx = dy/dt dx/dt , assuming that x′ (t) = dx/dt ̸= 0. (b) Compute using two alternative methods the derivative dy dx for the parametric curves α(t) = [t + 1, 2t3 ], β(t) = [1/(1 + t2 ), ln(t + 1)] , where t ∈ [0, 2] for both cases. Next evaluate dy dx at t = 0 and t = 2 (if it is defined there). (c) Use Sage to confirm your computations in (b) and then plot the curves α, β via the command parametric_plot. ⃝ 6.A.45. Cycloid. Consider the plane curve α given by the parametric equations x(t) = t − sin(t) , y(t) = 1 − cos(t) , t ∈ [0, 2π] . This is part of the so-called “cycloid”, which is the curve traced by a point on a circle as it rolls along a straight line CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS By substituting x = φ−1 (y), we obtain the originally desired primitive function. This is often written as follows: Integration by substitution If φ(x) is differentiable with a nowhere vanishing derivative, then F(y) = ∫ f(y) dy = ∫ f(φ(x))φ′ (x) dx = F(φ(x)). We talk about substituting the variable y by y = φ(x). On the level of differentials, the substitution can be easily understood in a way that (linearized) increments of the variable y and x are in mutual relation formally described by dy = φ′ (x) dx, which corresponds the relation between the integrated quan- tities f(y) dy = f(φ(x))φ′ (x) dx. As an illustration, we verify the last but one integral in the list in 6.2.3 using this method. To compute I = ∫ 1 √ 1 − x2 dx. Choose the substitution x = sin t, for all t ∈ (−π 2 , π 2 ). Then dx = cos t dt. So I = ∫ 1 √ 1 − sin2 t cos t dt = ∫ 1 √ cos2 t cos t dt = ∫ dt = t + C. By substitution t = arcsin x into the result, I = arcsin x+C. While substituting, the actual existence of the inverse function to y = φ(x) is required. To evaluate a definite Newton integral, it is needed to correctly recalculate the bounds of integration. Problems with the domains of the inverse functions can sometimes be avoided by dividing the integration into several intervals. We return to this point later. 6.2.6. Integration by reduction to recurences. Often the use of substitutions and integrating by parts leads to recurent relations, from which desired integrals can be evaluated. We illustrate by an example. Integrating by parts, to evaluate Im = ∫ cosm x dx = ∫ cosm−1 x cos x dx = cosm−1 x sin x − (m − 1) ∫ cosm−2 x(− sin x) sin x dx = cosm−1 x sin x + (m − 1) ∫ cosm−2 x sin2 x dx. Using the formula sin2 x = 1 − cos2 x, mIm = cosm−1 x sin x + (m − 1)Im−2. The initial values are I0 = x, I1 = sin x. 533 without slipping.2 Use calculus to show that the angle θ of the tangent lines of α at the points P = [x(2π 3 ), y(2π 3 )] ∈ α and Q = [x(4π 3 ), y(4π 3 )] ∈ α , respectively, is such that tan(θ) = − √ 3. Solution. Recall that sin(2π/3) = √ 3/2 = − sin(4π/3) and cos(2π/3) = −1/2 = cos(4π/3). Thus we compute P = [xP , yP ] = [ 2π 3 − √ 3 2 , 3 2 ] and Q = [xQ, yQ] = [ 4π 3 + √ 3 2 , 3 2 ] , respectively. The tangent line of α passing through P is the line ℓP (x) = yP + k2π 3 · (x − xP ) , k2π 3 = dy dx t= 2π 3 . Similarly, the tangent line of α passing through Q is the line ℓQ(x) = yQ + k4π 3 · (x − xQ) , k4π 3 = dy dx t= 4π 3 . We compute dy dx = dy/dt dx/dt = sin(t) 1 − cos(t) = 1 tan(t/2) = cot(t/2) , and thus it follows that k2π 3 = √ 3 3 and k4π 3 = − √ 3 3 . Hence the angle θ = ∠PRQ, where R is the intersection points of ℓP , ℓQ, is given by tan(θ) = k4π 3 − k2π 3 1 + k2π 3 · k4π 3 = − √ 3 3 − √ 3 3 1 − 3 9 = − √ 3 . An illustration is given here: This figure and all the computations presented above can be done easily in Sage, and we leave this part as an exercise, see 6.A.46. For more details on the cycloid, see for example 6.D.17 in Section D. □ 6.A.46. For the task in 6.A.45 llustrate the situation in Sage and confirm the related computations. ⃝ In the end of the next section we will return back to parametric curves and learn how to compute their lengths and the areas of the regions bounded or enclosed by of their graphs. Hence soon we will also learn how to deal with integrals involving parametric equations. 2When the circle has radius r the parametric equations of the cycloid are given by [r · x(t), r · y(t)], and in our case we fixed r = 1. Be aware that the cycloid is not an ellipse. CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Integrals in which the integrated function depends on expressions of the form (x2 + 1) can be reduced to these types of integrals using the substitution x = tg t. For example, to compute Jk = ∫ dx (x2 + 1)k the latter substitution yields (notice that dx = cos−2 t dt) Jk = ∫ dt cos2 t ( sin2 t cos2 t + 1 )k = ∫ cos2k−2 t dt. For k = 2, the result is J2 = 1 2 (cos t sin t + t) = 1 2 ( tg t 1 + tg2 t + t ) . After the reverse substitution t = arctg x J2 = 1 2 ( x 1 + x2 + arctg x ) + C. When evaluating definite integrals, we can compute the whole recurrence after evaluating with the given bounds. For example while integrating over the interval [0, 2π], the integrals have these values: I0 = ∫ 2π 0 dx = [x]2π 0 = 2π I1 = ∫ 2π 0 cos x dx = [sin x]2π 0 = 0 Im = ∫ 2π 0 cosm x dx = { 0 for odd m m−1 m Im−2 for even m. Thus for even m = 2n, the result is ∫ 2π 0 cos2n x dx = (2n − 1)(2n − 3) . . . 3 · 1 2n(2n − 2) . . . 2 2π. For odd m it is zero (as could be guessed from the graph of the function cos x). 6.2.7. Integration of rational functions. The next goal is the integration of the quotients of two polynomials f(x)/g(x). There are several simplifications to start with. If the degree of the polynomial f in the numerator is greater or equal to the degree of the polynomial g in the denominator, carry out the division with remainder (see the paragraph 5.1.2). This reduces the integration to a sum of two integrals. The division provides f = q · g + h, f g = q + h g . Thus, ∫ f(x)/g(x) dx = ∫ q dx+ ∫ h(x)/g(x) dx where the first integral is easy and the second one is again an expression of the type h(x)/g(x), but with degree of g(x) strictly larger than the degree of h(x) (such functions are called proper rational functions). 534 B. Integration In this section we will focus on tasks related to the notion of integration. As we pronounced before, the processes of differentiation and integration are inverse to each other, a result that we will confirm later in terms of the “fundamental theorem of calculus”. This theorem is in a sense the main link between the so called “differentiable calculus” and “integral calculus” and we will have the chances to analyze many examples a few below. Let us begin with the notion of indefinite integrals, also called antiderivatives, which are introduced in 6.2.1 (see also 6.2.2).3 We first present a few easy examples, based on basic rules from integration (see the “Integration Table” in 6.2.3). 6.B.1. Integration by heart. Using integration “by heart” (see 6.2.3) evaluate the indefinite integrals given below. Next confirm your answers via Sage. (1) ∫ e−x dx, with x ∈ R; (2) ∫ 1 √ 4 − x2 dx, with x ∈ (−2, 2); (3) ∫ 1 x2 + 3 dx, with x ∈ R; (4) ∫ 3x2 + 1 x3 + x + 2 dx, with x ̸= −1; (5) ∫ |x| dx, with x ∈ R. Solution. (1) The primitive of f(x) = e−x is the function F(x) = − e−x , since (− e−x )′ = e−x . Thus, by the definition of the indefinite integral given in 6.2.1, we have∫ f(x) dx = F(x) + C = − e−x +C, for some constant C. (2) Use the formula ∫ dx√ 1−x2 = arcsin(x) + C, for some constant C ∈ R. This gives ∫ 1 √ 4 − x2 dx = ∫ 1 2 √ 1 − (x 2 )2 dx = arcsin( x 2 ) + C . (3) Use the formula ∫ dx 1+x2 = arctan(x) + C, for some constant C ∈ R. We have ∫ 1 x2 + 3 dx = 1 3 ∫ 1 x2 3 + 1 dx = 1 √ 3 ∫ 1√ 3 1 + ( x√ 3 )2 dx = 1 √ 3 arctan( x √ 3 ) + C . (4) Use the formula ∫ f′ (x) f(x) dx = ln |f(x)| + C, with C ∈ R. Hence, we obtain ∫ 3x2 + 3 x3 + 3x + 2 dx = ln x3 + 3x + 2 + C . (5) The function f(x) = |x| is continuous on R, hence it has a primitive function on R. This has the form x2 2 +c1, for x ≥ 0, 3The notation for the indefinite integral was first introduced by Leibniz in 1675. CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Thus we can assume that the degree of g is strictly larger than the degree of f. We introduce the procedure to integrate proper rational functions by a simple example. Observe that we can integrate (a + x)−n , n > 1, and ∫ 1 a + x dx = ln |a + x| + C. Summing such simple fractions yields more complicated ones: −2 x + 1 + 6 x + 2 = 4x + 2 x2 + 3x + 2 which can be integrated directly: ∫ 4x + 2 x2 + 3x + 2 dx = −2 ln |x + 1| + 6 ln |x + 2| + C. This suggests looking for a procedure to express proper rational functions as a sum of simple ones. In the example, it is straightforward to compute the unknown coefficients A and B, once the roots of the denominator are known: 4x + 2 x2 + 3x + 2 = 4x + 2 (x + 1)(x + 2) = A x + 1 + B x + 2 . Multiply both sides by the polynomial x2 +3x+2 from the denominator and compare coefficients of the individual powers of x in the resulting polynomials: 4x+2 = A(x+2)+B(x+1) =⇒ 2A+B = 2, A+B = 4. This procedure is called decomposition into partial fractions. It is a purely algebraic procedure based on properties of polynomials. Without loss of generality, suppose that the denominator g(x) and the numerator f(x) do not share any real or complex roots and that g(x) has exactly n distinct real roots a1, . . . , an. Then the points a1, . . . an are all the discontinuities of the function f(x)/g(x). Split the expression f(x) g(x) according to the factors of the denominator. Thus, assume g(x) is the product g(x) = p(x)q(x) of two coprime polynomials. By the Bezout identity (see 12.3.8 on the page 1082), which is a corollary of the polynomial division with a remainder, there exist polynomials a(x) and b(x) of degrees strictly less than the degree of g such that a(x)p(x) + b(x)q(x) = 1. Multiplying this equality by the quotient f(x)/g(x), gives f(x) g(x) = a(x) q(x) + b(x) p(x) . Thus, we may restrict our attention to cases where the denominator g(x) cannot be decomposed further into two coprime poynomials. Suppose that the polynomial g(x) has only real roots. Then there is a unique decomposition into factors (x − ai)ni , where ni are the multiplicities of the roots ai, i = 1, . . . , k. By a sequential use of the latter procedure with coprime 535 and −x2 2 + c2, for x ≤ 0, for some constants c1, c2 ∈ R. It is easy to see that c1 = c2 = c ∈ R and hence ∫ |x| dx = 1 2 x |x| + c = { x2 2 + c, if x ≥ 0; −x2 2 + c, if x ≤ 0; A powerful capability of Sage, as most of the available computer algebra programs, is its ability to integrate symbolically. In Sage we will learn to integrate functions in many different ways and also numerically. For antiderivatives (indefinite integrals) one uses the integral function, via the syntax integral(f(x), x), which can be rewritten as f.integral(x). An alternative reads as f.integrate(x). Later we will see that the situation of definite integrals does not differ much. Recall that the indefinite integral (as an infinite set) can be represented by one specific function and its translations. Hence, one could expect that Sage will print out a function with a constant C, as in the relation F(x) = ∫ f(x) dx + C. However, Sage ignores C and hence you should always assume that this is implicitly part of the answer. Keeping in mind these details, we are now ready to solve our task. This can be done by the cell show(integral(e^(-x), x)) show(integral(1/(sqrt(4-x^2)), x)) show(integral(1/(x^2+3), x)) show(integral((3*x^2+1)/(x^3+x+2), x)) show(integral(abs(x))) Check yourself that this provides the desired answers. Often, computing integrals via Sage we may need to add restrictions, via the command assume. For instance, type m=var("m"); assume(m>1) show(integral(1/x^m, x)) Notice without the command assume(m > 1), Sage is not able to produce a result. We will meet further such examples in the sequel. □ 6.B.2. Let f be a continuous function on R with f(x) ̸= 0, for all x ∈ R. Suppose that the primitive function F of f satisfies 4F(x) = f(x), for all x ∈ R, and moreover that f(4) = 4. Find the type of f. ⃝ 6.B.3. (a) Write a routine in Sage which will print out a primitive of an integrable function f. (b) Use your routine to find the primitive of some of the functions presented in the “Integration Table” in 6.2.3. Solution. This is an easy task that one can implement as fol- lows: function("f")(x) def Primitive_function(f) : F(x)=integral(f, x).factor() show("A primative function of f(x)=", f, " is given by ", ":", F(x)) return CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS polynomials p(x) and q(x), we obtain a representation of f(x)/g(x) as a sum of fractions of the form f(x) g(x) = r1(x) (x − a1)n1 + · · · + rk(x) (x − ak)nk , where the degrees of the polynomials ri(x) are strictly smaller than the degrees of the denominators. Finally, each of the summands can be represented as a sum r(x) (x − a)n = A1 x − a + A2 (x − a)2 + · · · + An (x − a)n . Indeed, we multiply the equation by (x−a)n and start comparing the coefficients from the highest powers of the polynomial r(x) and compute sequentially A1, A2, . . . after expanding all the products. This can be done faster by suitable additions and subtractions, starting by the highest orders. For example, 5x − 16 (x − 2)2 = 5 x − 2 (x − 2)2 − 6 1 (x − 2)2 = 5 x − 2 + −6 (x − 2)2 . Finally, we have to handle the case, where there are not enough real roots. There always exists a factorization of g(x) into linear factors with complex roots (see the fundamental theorem of Algebra in 1.A.6 on page 6). The non-real roots always appear in conjugated pairs, since g(z) = g(¯z) for a polynomial with real coefficients. Repeating the above procedure for ratios of complex polynomials gives the same result, but with complex coefficients. If we insist in having real expressions only, we may collect the conjugate pairs together and get quadratic factors expressed as sums of squares (x − a)2 + b2 and their powers. The procedure works well and guarantees that it is possible to find summands in the form of Bx + C ((x − a)2 + b2)n . As in the real roots case, there is always a corresponding decomposition into partial fractions of the form A1x + B1 (x − a)2 + b2 + · · · + Anx + Bn ((x − a)2 + b2)n in the case of a power ((x − a)2 + b2 )n of such quadratic (irreducible) factor as well. The factorization of the polynomials and the further computations might be quite time consuming. The reader could prefer to experiment with computer algebra software instead. This works well in Maple by calling the procedure convert(h, parfrac, x) that decomposes the expression h rationally dependent on the variable x into partial fractions. The important point is that we can already integrate all of the above partial fractions. The last mentioned ones lead to integrals discussed in example 6.2.6. In summary, the rational functions f(x)/g(x) can be integrated easily, if the corresponding decomposition of the polynomial in the denominator g(x) is known. The reality is not that simple when computing (definite) Newton integrals. Although we find the primitive functions, the problematic points are the discontinuities of rational functions, in whose 536 (b) To test the routine it is easy: a,b=var("a, b") Primitive_function(a*x^3) Primitive_function(e^(a*x)) Primitive_function(a/x) Primitive_function(a*sin(x*b)) Primitive_function(a/(a^2+x^2)) Primitive_function(1/sqrt(a^2+x^2)) Primitive_function(x*abs(x)) Primitive_function((3*x^2)*sqrt(1+(1/x^2))) For instance, for the final case Sage’s output has the form: A primitive function of f(x) = 3x2 √ 1 + 1 x2 is given by x3 ( x2 +1 x2 )3 2 Observe that when the input f is an integrable function, but there is no function, built up of addition, subtraction, multiplication, division, roots, exponents, logarithms, trigonometric functions, and inverse trigonometric functions which will have, as its derivative the function f, then Sage will not provide an answer. Such are the functions e−x3 sin(x) , e−x3 sin(x2 ) , or the “Gaussian” (also called the “error function”) that we will meet later (this has a crucial role in statistics) However, one can still numerically compute such integrals, a situation that we will encounter later. □ Trigonometric identities are often usefull when integrating expression involving trigonometric functions. Test your skills on trigonometric identities by evaluating the following indefinite integral. 6.B.4. Compute the integral A = ∫ sin2 (x) cos2 (x) dx. ⃝ Next we present a simple application of antiderivatives, where a given “initial condition” allows us to compute explicitly the constant of integration C. In particular, this provides an example of solving a first order differential equation with an initial condition (also referred to as an “initial value problem”). Further such problems we will meet in the end of this section (see 6.B.64, 6.B.65). 6.B.5. An initial value problem. Determine a curve y = f(x) passing from the point P = [1, 3] with slope 2 5 x. Then use the command dsolve in Sage to confirm your answer. Solution. By assumption the slope of y = f(x) is 2 5 x, which means that f′ (x) = df dx = 2 5 x. By integration, we get f(x) = ∫ f′ (x) dx = ∫ 2 5 x dx = 2 5 ∫ x dx = x2 5 + C . To specify the constant C ∈ R use the equation f(1) = 3, i.e., 1 5 +C = 3. This gives C = 14/5 and hence y = f(x) = 1 5 (x2 + 14). CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS neighbourhood these functions are unbounded. We return to this problem later (see paragraph 6.2.14 below). 6.2.8. Riemann integral. We return to the idea of defining the integral as a tool for computing the area of the region bounded by the graph of a function and the x axis. This is our next goal. We prove that for all continuous functions on a closed bounded interval, this definition yields the same result as the Newton integral. Consider a real function f defined on the interval [a, b]. Choose a partition of this interval along with the choice of representatives ξi of the respective parts, i.e. a = x0 < x1 < · · · < xn = b and ξi ∈ [xi−1, xi], i = 1, . . . , n. The number δ = mini{xi − xi−1} is called the norm of the partition. Define the Riemann sum corresponding to the chosen partition along with the chosen representatives Ξ = (x0, . . . , xn; ξ1, . . . , ξn) as SΞ = n∑ i=1 f(ξi) · (xi − xi−1). Riemann integral4 Definition. The Riemann integral of the function f on the interval [a, b] exists, if for every sequence of partitions with representatives (Ξk)∞ k=0 with norms of the partitions δk approaching zero, the limit lim k→∞ SΞk = S exists and its value does not depend on the choice of the sequence of partitions and their representatives. Then we write S = ∫ b a f(x) dx. This definition does not look very practical, but nonetheless it allows us to formulate and prove several simple properties of the Riemann integral: 4Bernhard Riemann (1826-1866) was an extremely influential German mathematician with many contributions to infinitesimal analysis, differential geometry, and in particular complex analysis and analytic number theory. 537 To solve the system { df dx = 2 5 x, f(1) = 3} with Sage we will use the command desolve(F, y, [a, b]), which involves the differential equation that we want to solve (this is denoted by F and it will necessarily include the first derivative diff(y, x) of y = f(x)), the function y, which we should first introduce as a symbolic function in Sage, and the numbers a, b which are specified by the initial condition y(a) = b. For our case this technique takes the form y = function("y")(x) difeq = diff(y,x) - (2/5)*x == 0 h = desolve(difeq, y, [1, 3] ) show(h) Sage’s output has the desired form, i.e., 1 5 x2 + 14 5 . □ A very elementary method of integration is the so called integration by parts (see 6.2.4). This method is appropriate for computing integrals of the following form: ∫ P(x) abx dx , ∫ P(x) sin(bx) dx , ∫ P(x) cos(bx) dx, ∫ P(x) logn a x dx , ∫ xb logn a (kx) dx , ∫ P(x) arcsin(bx) dx , ∫ P(x) arccos(bx) dx , ∫ P(x) arctan(bx) dx , ∫ P(x) arccot(bx) dx , ∫ abx sin(cx) dx , and ∫ abx cos(cx) dx, where P is an arbitrary polynomial, a ∈ (0, 1) ∪ (1, +∞), b, c ∈ R\{0}, n ∈ N and k > 0. Let us now illustrate this method by a series of examples. 6.B.6. Integration by parts. Using integration by parts, evaluate the integral L = ∫ ( x2 + 1 ) e−x dx, with x ∈ R. Solution. Here we should apply integration by parts twice: L = ∫ ( x2 + 1 ) (− e−x )′ dx = − ( x2 + 1 ) e−x − ∫ ( x2 + 1 )′ (− e−x ) = − ( x2 + 1 ) e−x + 2ℓ , where ℓ = ∫ x e−x dx. Next we have ℓ = ∫ x(− e−x )′ = −x e−x − ∫ (x)′ (− e−x ) = −x e−x + ∫ e−x = −x e−x − e−x +C , for some constant C ∈ R. Thus, all together we obtain L = − e−x (x2 + 2x + 3) + C. □ 6.B.7. Using integration by parts, compute the indefinite integrals given below: (a) K = ∫ x cos(x) dx, with x ∈ R; CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Theorem. (1) Suppose f is a bounded real function defined on the interval [a, b], and c ∈ [a, b] is an inner point of this interval. Then the integral ∫ b a f(x) dx exists if and only if both of the integrals ∫ c a f(x) dx and ∫ b c f(x) dx exist. In that case ∫ b a f(x) dx = ∫ c a f(x) dx + ∫ b c f(x) dx. (2) Suppose f and g are two real functions defined on the interval [a, b], and that both of the integrals ∫ b a f(x) dx and ∫ b a g(x) dx exist. Then the integral of their sum also exists and ∫ b a (f(x) + g(x)) dx = ∫ b a f(x) dx + ∫ b a g(x) dx. (3) Suppose f is a real function defined on the interval [a, b], C ∈ R is a constant, and the integral ∫ b a f(x) dx exists. Then the integral ∫ b a Cf(x) dx also exists and ∫ b a Cf(x) dx = C ∫ b a f(x) dx. Proof. (1) First suppose that the integral over the whole interval exists. When computing it, we can limit ourselves to limits of the Riemann sums whose partitions have the point c among their partitioning points. Each such sum can be obtained as a sum of two partial Riemann sums. If these two partial sums would depend on the chosen partitions and representatives in the limit, then the total sums could not be independent on the choices in limit. (It suffices to keep the sequence of partitions of the subinterval the same, and change the other so that the limit would change). Conversely, if both Riemann integrals on both subintervals exists, they can be approximated with arbitrary precision by the Riemann sums, and moreover independently on their choice. If a partitioning point c is added to any sequence of Riemann sums over the whole interval [a, b], the value of the whole sum is changed. Also the values of the partial sums over the intervals belonging to [a, c] and [c, b] change at most by a multiple of the norm of the partition and possible differences of the bounded function f on all of [a, b]. This is a number arbitrarily close to zero for a decreasing norm of the partition. Necessarily the partial Riemann sums of the function over the two parts of the interval also converge to the limits, whose sum is the Riemann integral over [a, b]. (2) In every Riemann sum, the sum of the functions manifests as the sum of the values in the chosen representatives. Because multiplication of real numbers is distributive, each Riemann sum becomes the sum of the two Riemann sums with the same representatives for the two functions. The statement follows from the elementary properties of limits. (3) Each of the Riemann sums is multiplied by the constant C. So the claim follows from the elementary properties of limits. □ 538 (b) M = ∫ (2x − 1) ln(x) dx, with x > 0; (c) N = ∫ ex sin(βx) dx, with x, β ∈ R. ⃝ 6.B.8. Determine the integrals given below and next use Sage to confirm your answers: (a) ∫ x cos2(x) dx, with x ̸= π 2 + kπ, k ∈ Z; (b) ∫ x2 e−3x dx, with x ∈ R; (c) ∫ cos2 (x) dx, with x ∈ R. ⃝ The substitution method, introduced in 6.2.5, is another very important technique to compute integrals. The next series of exercises is based on this method. Notice later you will meet a more systematic illustration of the substitution method, along the integration of rational and irrational functions, but also in many trigonometric integrals. 6.B.9. Substitution method. Using a suitable substitution, determine the integral ∫ f(x) dx, where f(x) is given below. Moreover, use Sage to confirm your answer. (1) f(x) = √ 2x − 5, with x > 5 2 ; (2) f(x) = ( 7 + ln(x) )7 /x, with x > 0; (3) f(x) = cos(x)/ ( 1 + sin(x) )2 , with x ̸= (3+4k)π 2 , k ∈ Z; (4) f(x) = (1 − 2x)2025 , with x < 1/2; Solution. (1) Set t = 2x − 5 such that dt = 2 dx, i.e., dx = dt 2 . Thus ∫ √ 2x − 5 dx = 1 2 ∫ √ t dt = t 3 2 3 + C = 1 3 (2x − 5) 3 2 + C for some constant C. (2) Set t = 7 + ln(x) such that dt = dx x . Then we get ∫ ( 7 + ln(x) )7 x dx = ∫ t7 dt = t8 8 + C = ( 7 + ln(x) )8 8 + C . (3) Set t = 1 + sin(x) such that dt = cos(x) dx. Then ∫ cos(x) ( 1 + sin(x) )2 dx = ∫ dt t2 = − 1 t +C = − 1 ( 1 + sin(x) ) +C . (4) Set t = 1 − 2x such that dt = −2 dx, i.e., dx = −dt 2 . Then ∫ (1 − 2x)2025 dx = − 1 2 ∫ t2025 dt = − 1 2 · t2026 2026 + C = − (1 − 2x)2026 4052 + C , C ∈ R . A verification via Sage occurs with the same method presented in 6.B.1. Hence one can type show(integral(sqrt(2*x-5), x)) show(integral(((7+ln(x))^7)/x, x)) show(integral((cos(x)/(1+sin(x))^2), x)) CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 6.2.9. The fundamental theorem. The following result is crucial for understanding the relation between the integral and the derivative. The complete proof of this theorem is somewhat longer, so it is broken into several subsections. Fundamental theorem of integral calculus Theorem. For every continuous function f on a finite interval [a, b] there exists its Riemann integral ∫ b a f(x)dx. Moreover, the function F(x) given on the interval [a, b] by the Riemann integral F(x) = ∫ x a f(t)dt is a primitive function to f on this interval. 6.2.10. Upper and lower Riemann integral. In the first step for proving the existence of the integral, we use an alternative definition, in which the choice of representatives and the corresponding value f(ξi) is replaced by the suprema Mi of the values f(x) in the corresponding subintervals [xi−1, xi], or by the infima mi of the function f(x) in the same subintervals, respectively. We speak of upper and lower Riemann sums, respectively (in literature, this process is also called the Darboux integral). Because the function is continuous, it is bounded on a closed interval, hence all the above considered suprema and infima exist and are finite. Then the upper Riemann sum corresponding to the partition Ξ = (x0, . . . , xn) is given by the expression SΞ,sup = n∑ i=1 ( sup xi−1≤ξ≤xi f(ξ) ) (xi − xi−1) = n∑ i=1 Mi(xi − xi−1). The lower Riemann sum is SΞ,inf = n∑ i=1 ( inf xi−1≤ξ≤xi f(ξ) ) (xi − xi−1) = n∑ i=1 mi(xi − xi−1). For each partition Ξ = (x0, . . . , xn; ξ1, . . . , ξn) with representatives, there are the inequalities (1) SΞ,inf ≤ SΞ,ξ ≤ SΞ,sup Moreover, the infima and suprema can be approximated with arbitrary precision by the actual values of terms in the sequences. Thus, we might suspect that the Riemann integral exists if and only if for all sequences of partitions with norms approaching zero, the limits of both the upper and lower sums will exists and they will be equal. This is indeed true for all bounded functions: 539 and similarly for final case. □ 6.B.10. Using a suitable substitution determine ∫ ex3 +4 x2 , with x ∈ R. Next use Sage to confirm the answer. ⃝ 6.B.11. Compute ∫ cos5 (x) sin(x) dx, with x ∈ R. ⃝ 6.B.12. Compute ∫ sin4 (x) cos4(x) dx, with x ∈ ( −π 2 , π 2 ) . ⃝ 6.B.13. Use Sage to confirm the answers presented in 6.B.11 and 6.B.12, respectively. ⃝ 6.B.14. Evaluate ∫ cos5 (x) sin2 (x) dx. ⃝ Combining integration by parts with the substitution method is a very common situation. Hence we will meet often examples which rely on combining both methods, as the task presented below. 6.B.15. Evaluate the following integrals: (a) ∫ x3 e−x2 dx, with x ∈ R; (b) ∫ x arcsin(x2 ) dx, with x ∈ (−1, 1); (c) ∫ e √ x dx, with x > 0. ⃝ 6.B.16. Integration by reduction to recurrences. Let n be a non-negative integer. (a) Set In = ∫ xn ex dx, with I0 = ex . Show that In = xn ex −nIn−1 and next compute I1, I2, I3. (b) Set Jn = ∫ (ln(x))n dx. Show that Jn = x(ln(x))n − nJn−1 and next compute J1, J2, J3. Solution. (a) Integration by parts gives the result: In = ∫ xn (ex )′ dx = xn ex −n ∫ xn−1 ex dx = xn ex −nIn−1 . It is easy now to compute I1, I2, I3: I1 = x ex − ex = ex (x − 1) , I2 = x2 ex −2 ex (x − 1) = ex (x2 − 2x + 2) , I3 = x3 ex −3 ex (x2 − 2x + 2) = ex (x3 − 3x2 + 6x − 6) . You can also prove that In = ex pn(x) where the polynomial pn is of order n and defined recursively by pn(x) = xn − n pn−1(x) , p0(x) = 1 . (b) This can be solved similarly and left for practice. □ 6.B.17. Let In = ∫ sinn (x) dx with n ∈ N. Prove that In = − 1 n sinn−1 (x) cos(x) + n − 1 n In−2 . Then use this relation to compute I2 and I3. ⃝ CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Theorem. Let the function f be bounded on a closed interval [a, b]. Then Ssup = inf Ξ SΞ,sup , Sinf = sup Ξ SΞ,inf are the limits of all sequences of upper and lower sums with norm approaching zero, respectively. The Riemann integral of the function f exists if and only if Ssup = Sinf . Proof. First, notice that choosing two partitions Ξ1, Ξ2, there is the common refinement Ξ and we arrive at the inequalitites SΞ1,inf ≤ SΞ,inf ≤ SΞ,sup ≤ SΞ2,sup. Thus, Ssup is well defined since it is the infimum of a set of real values bounded from below by any of the SΞ,inf . Similarly for the value Sinf , which is bounded from above by any of SΞ,sup . Refine a partition Ξ1 to Ξ2 by adding new points. Then SΞ1,sup ≥ SΞ2,sup, SΞ1,inf ≤ SΞ2,inf . By the definition of the infimum, there are sequences of partitions Ξk for which Ssup is the limit of the sums SΞk,sup. Moreover, every two partitions have a common refinement. Thus it may be assumed that Ξk in the sequence is always a partition obtained by refining the previous one. Hence the sums SΞk,sup form a non-increasing sequence of real numbers converging to Ssup. A similar argument applies to Sinf . Hence the values Ssup = inf Ξ SΞ,sup , Sinf = sup Ξ SΞ,inf are good candidates for the limits of upper and lower sums. Next, consider a fixed partition Ξ with n inner partitioning points of the interval [a, b], and another partition Ξ1, whose norm is a small number δ. In the common refinement Ξ2, there will be only n intervals contributing to the sum SΞ2,sup by eventually smaller contribution than in the case of Ξ1. Now, f is a bounded function on[a, b] and thus each of these contributions will be bounded by a universal constant multiplied by the norm δ of the partition. Hence when choosing δ sufficiently small, the distance of SΞ1,sup from Ssup will not be larger than twice the distance of SΞ,sup from Ssup. Finally, return to the sequence of partitions Ξk as chosen above, and choose an ε > 0. Then there is some m ∈ N such that the distance of SΞk,sup from Ssup is less than ε for all k ≥ m. Hence for arbitrary partition Ξ with appropriately small norm δ > 0 the distance of SΞ,sup from Ssup does not exceed 2ε. In summary, for arbitrary 2ε > 0, there is δ > 0 such that for all partitions with norm at most δ the inequality |SΞ,sup − Ssup| < 2ε holds. This is exactly the statement that the number Ssup is the limit of all sequences of upper sums with norms of the partition approaching zero. The statement for lower sums is proved in exactly the same way. It remains to deal with the existence of the Riemann in- tegral ∫ b a f(x) dx. If Ssup = Sinf , then all Riemann sums of 540 An important situation in integration occurs along the integration of rational functions, see 6.2.7. The key point here is the decomposition of a such a function as a sum of simple rational functions. Integrating the partial fractions corresponding to real roots of a denominator of a rational function is very easy. For instance, ∫ A x − x0 dx y=x−x0 = dy=x A ln | y | + C = A ln | x − x0 | + C1 , ∫ A (x − x0) n dx y=x−x0 = dy=x ∫ A yn dy = Ay−n+1 −n + 1 + C2 = A (1 − n) (x − x0) n−1 + C2 , for some constants C1, C2, and for all A, x0 ∈ R, n ∈ N with n ≥ 2. Rational functions f(x)/g(x) can be integrated easily, assuming that we know the corresponding factorization of the polynomial in the denominator g(x) in terms of its roots and their multiplicities. Next we will cover all the possible cases and to begin with, here are a few easy tasks for you. 6.B.18. Evaluate the integral ∫ R(x) dx, for R(x) = 6 x−2 , with x ̸= 2 and R(x) = 6 (x+4)3 , with x ̸= −4. ⃝ 6.B.19. (a) Prove that ∫ dx (ax + b)n = 1 a(1 − n)(ax + b)n−1 + C . (b) Compute ∫ dx 9x2 + 6x + 1 by applying the formula in (a). ⃝ 6.B.20. Suppose that P(x) = ax2 + bx + c (a ̸= 0) is a parabola with a double root, i.e, ∆ = b2 − 4ac = 0. Prove that ∫ dx ax2 + bx + c dx = − 2 2ax + b + C . Use this formula to confirm your result in 6.B.19, (b). ⃝ 6.B.21. Evaluate the integral ∫ 1 5x2 + 5x + 2 dx. Next use Sage to confirm your result. Solution. The given parabola P(x) = x2 + 5x + 2 has negative discriminant ∆ = −15, and hence two complex conjugate roots, given by −1 2 ± i √ 15 10 . By completing the square we get P(x) = 5(x2 + x + 2 5 ) = 5 ( (x2 + x + 1 4 ) − 1 4 + 2 5 ) = 5 ( (x + 1 2 )2 + 3 20 ) = 5 ( (x + 1 2 )2 + (√ 3/20 )2 ) = 5 ( (x + 1 2 )2 + (√ 15/10 )2 ) . CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS sequences of the partitions have the same limit because of the inequalities (1). If the Riemann integral does not exist, then there exist two sequences of partitions Ξk and ¯Ξk and their representatives with different limits of Riemann sums. Suppose the first limit is larger then the other one. Then the upper Riemann sums can be selected for the first sequence and the lower Riemann sums for the second sequence. Their difference will then be at least as large. In particular, in view of the previous part of the proof, this implies Ssup > Sinf . □ 6.2.11. Uniform continuity. Until now, we have only used the continuity of the function f to know that all such functions are bounded on a closed finite interval. It remains to show that for continuous functions Ssup = Sinf . From the definition of continuity, for every point x ∈ [a, b] and every neighbourhood Oε(f(x)) there exists a neighbourhood Oδ(x) such that f(Oδ(x)) ⊂ Oε(f(x)). This statement can be rewritten in this way: for y, z ∈ Oδ(x), i.e. |y − z| < 2δ, it is true that f(y), f(z) ∈ Oε(f(x)), i.e. |f(y) − f(z)| < 2ε. A global variant of such a property is needed; it is called the uniform continuity of a function f: Uniform continuity Definition. Let f be a function on a closed finite interval [a, b]. f is uniformly continuous on [a, b], if for every ε > 0 there exists δ > 0 such that for all z, y ∈ [a, b] satisfying |y − z| < δ, the inequality |f(y) − f(z)| < ε holds. Theorem. Each continuous function on a finite closed interval [a, b] is uniformly continuous. Proof. Fixing some ε > 0, the definition of continuity of f provides for each x ∈ [a, b] the values δ(x), such that f(y) ∈ Oε(f(x)) for all y ∈ O2δ(x)(x). Since every finite closed interval is compact, it is covered by finitely many of such neighbourhoods Oδ(x)(x), determined by points 541 Thus ∫ 1 P(x) dx = 1 5 ∫ dx (x + 1 2 )2 + ( √ 15/10)2 . Next, set u = x + 1 2 with du = dx, and use the known formula (see 6.2.3) ∫ dx x2 + δ2 = 1 δ arctan (u δ ) + C . (♯) This gives 1 5 ∫ dx (x + 1 2 )2 + ( √ 15/10)2 = 1 5 ∫ du u2 + ( √ 15/10)2 = 1 5 · 1 √ 15/10 arctan ( u √ 15/10 ) + C = 2 √ 15 15 arctan ( ( √ 15/3)(2x + 1) ) + C , for some constant C. For a confirmation give the cell show(integral(1/(5 ∗ x2 + 5 ∗ x + 2), x)). □ Integrals of a rational function Q of the form Q(x) = (a1x + b1)/(a2x2 + b2x + c2) can be solved by forming on the numerator the derivative of the dominator. Then we decompose the fraction into two rational functions and hence our integral decomposes into two integrals, that are easier to be computed. Let us describe such an example. 6.B.22. Evaluate the integral A = ∫ 3x + 7 x2 − 4x + 15 dx, with x ∈ R. Solution. The derivative of g(x) = x2 − 4x + 15 is the line 2x − 4, with x ∈ R. Let f(x) = 3x + 7, with x ∈ R, as well. We see that f(x) = 3 2 (2x − 4) + 4 · 3 2 + 7 = 3 2 (2x − 4) + 13. Thus A = ∫ f(x) g(x) dx = 3 2 ∫ 2x − 4 g(x) dx + 13 ∫ dx g(x) = α + β . For α := 3 2 ∫ 2x − 4 g(x) dx, by setting u = g(x) = x2 − 4x + 14 we get du = (2x − 4) dx and thus α = 3 2 ∫ du u = 3 2 ln(x2 − 4x + 15) + C1 , C1 ∈ R . For the second integral β := 13 ∫ dx g(x) by completing the square we get g(x) = (x − 2)2 + 11. Thus an application of the relation (♯) appearing in the proof of 6.B.21, gives β = 13 ∫ dx (x − 2)2 + ( √ 11)2 = 13 √ 11 arctan ( x − 2 √ 11 ) +C2 for some constant C2. Hence all together we get that A = 3 2 ln(x2 − 4x + 15) + 13√ 11 arctan ( x−2√ 11 ) + C. □ 6.B.23. If Kn = Kn(x0, a) = ∫ dx ( (x − x0)2 + a2 )n dx, where a, x0 ∈ R are fixed, prove that Kn = 1 a2 ( 2n − 3 2n − 2 Kn−1 + x − x0 (2n − 2) ( (x − x0)2 + a2 )n−1 ) CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS x1, . . . , xk. Choose δ as the minimum of all the (finitely many) δ(xi). Choose any two points y, z ∈ [a, b] with |y − z| < δ, they both belong to one of O2δ(xi)(xi). Thus |f(y)−f(z)| ≤ |f(y) − f(xi)| + |f(xi) − f(z)| < 2ε and f has the desired property. □ 6.2.12. Finishing the proof of Theorem 6.2.9. Now we complete the proof of the existence of the Riemann integral. Choose ε and δ as in the definition of the uniform continuity of f. Consider any partition Ξ with n intervals and norm at most δ. Then, writing Ji = [xi−1, xi], n∑ i=1 sup ξ∈Ji f(ξ)(xi − xi−1) − n∑ i=1 inf ξ∈Ji f(ξ)(xi − xi−1) = n∑ i=1 ( sup ξ∈Ji f(ξ) − inf ξ∈Ji f(ξ) ) (xi − xi−1) ≤ ε · (b − a). For decreasing norm of the partition, the upper and lower sums are arbitrarily close to each other. In particular the upper Riemann integral and the lower Riemann integral coincide. To complete the proof of the fundamental theorem of integral calculus, it is still needed to verify the statement about the existence of a primitive function. For a continuous function f on interval [a, b] there exists the Riemann integral F(x) = ∫ x a f(t) dt for every x ∈ [a, b]. As in the statement about uniform continuity, there is δ > 0, dependent on a fixed small ε > 0, such that |f(x + ∆x) − f(x)| < ε for all 0 ≤ ∆x < δ on the interval [a, b]. The difference of the derivative of F(x) and the integrated function f(x) is expressed by the limit of the expressions α = 1 ∆x (∫ x+∆x a f(t) dt − ∫ x a f(t) dt ) − f(x) = 1 ∆x (∫ x+∆x x f(t) dt ) − f(x) for ∆x approaching zero. Now choose 0 < ∆x < δ and replace the integrated function by the constant value f(x). Then the values f(ξ) at any point ξ ∈ [x, x + ∆x] are distant from f(x) by at most ε. Hence the Riemann integral in question cannot be different from f(x)∆x by more then ε∆x. Thus, we arrived at the following estimate: |α| = 1 ∆x (∫ x+∆x x f(t) dt ) − f(x) < ε. But that means that at the point x, the one-sided right derivative of the function F(x) exists and equals f(x). The result for the left derivative is proved in the same way, just working with the interval [x − ∆x, x]. This finishes the proof of the theorem 6.2.9. 542 for all n ̸= 1, and that K1(x0, a) = 1 a arctan (x−x0 a ) + C. Solution. We will apply integration by parts. In terms of 6.2.4 we have F(x) = 1/ ( (x − x0) 2 + a2 )n and F′ (x) = −2n (x − x0) / ( (x − x0) 2 + a2 )n+1 . If you are not sure for this differentiation, use Sage and the cell var("a, n, x0"); F(x)=1/(((x-x0)^2+a^2)**n) show(diff(F(x), x).factor()) Sage prints out the following expression which agrees with ours: −2 ( a2 + x2 − 2 xx0 + x2 0 )−n−1 n(x − x0). Moreover, G′ (x) = 1 and we may fix G(x) = (x − x0). Thus, Kn(x0, a) = ∫ F(x)G′ (x) dx = x − x0 ( (x − x0) 2 + a2 )n + 2n ∫ ( (x − x0) 2 + a2 ( (x − x0) 2 + a2 )n+1 − a2 ( (x − x0) 2 + a2 )n+1 ) dx = x − x0 ( (x − x0) 2 + a2 )n + 2n ( Kn(x0, a) − a2 Kn+1(x0, a) ) , or equivalently, Kn+1 = 1 a2 ( 2n − 1 2n Kn + 1 2n x − x0 ( (x − x0) 2 + a2 )n ) . Replacing n by n − 1 in this formula, we get the result. On the other hand, the case n = 1 follows easily by the relation (♯), used in the proof of 6.B.21. □ 6.B.24. Using the result from 6.B.23, compute the integral given below and next verify your answer via Sage: I = ∫ 30x − 77 (x2 − 6x + 13) 2 dx , x ∈ R . Solution. This provides an example of partial fractions for multiple complex roots in the form of Ax+B[ (x−x0)2 +a2 ]n , with A, B, x0 ∈ R, a > 0, n ∈ N\{0, 1}, which again can be solved by forming on the numerator the derivative of the expression (x − x0) 2 + a2 appearing in the dominator, that is, A 2 · 2(x−x0)[ (x−x0)2 +a2 ]n + (B + Ax0) · 1[ (x−x0)2 +a2 ]n . Hence one needs to compute the two induced integrals, which can be done by the methods applied above. For our problem we have I = 15 ∫ 2x − 6 (x2 − 6x + 13) 2 dx + 13 ∫ dx (x2 − 6x + 13) 2 . To compute the first integral set u = x2 − 6x + 13 with du = (2x − 6) dx. Then 15 ∫ 2x − 6 (x2 − 6x + 13) 2 dx = 15 ∫ du u2 = − 15 u + C1 = − 15 x2 − 6x + 13 + C1 , for some constant C1. For the second integral by completing the square one gets x2 − 6x + 13 = (x − 3) 2 + 22 . Hence 13 ∫ 1 (x2 − 6x + 13) 2 dx = 13 ∫ dx ( (x − 3) 2 + 22 )2 . CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 6.2.13. Important remarks. (1) Theorems 6.2.9 and 6.2.8 claim that the Riemann integral is a linear map ∫ : C[a, b] → R from the vector space of continuous functions on the interval [a, b] to real numbers. Hence it is a linear form on the vector space C[a, b]. (2) We proved that every continuous function is a derivative of some function. Hence the concepts of the Newton and Riemann integrals coincide for continuous functions. Therefore the Riemann integral of continuous functions can be computed as the difference of values F(b)−F(a) of the primitive function F. (3) In the first step of the proof of the theorem 6.2.9 we proved the important statement that for bounded functions f on finite intervals [a, b] the limits of the upper and lower sums always exist. They are called respectively the upper Riemann integral and the lower Riemann integral and they are also denoted by ∫ b a f(x) dx and ∫ b a f(x) dx. In this way we can define the Riemann integral for continuous functions as in the above proof. (4) We derived the important property of continuous functions called the uniform continuity on finite closed intervals [a, b]. Clearly every uniformly continuous function is continuous as well, but the converse is not true on open intervals. As an example, consider the function f(x) = sin(1/x) on the interval (0, 1). (5) Consider a function f on an interval [a, b], which is only piece-wise continuous. This means that f is continuous in all points c ∈ [a, b] except for finitely many discontinuities ci, a < ci < b, in which it has finite one-sided limits. Because of the additivity of the integral with respect to the interval of integration, see 6.2.8(1), the last theorem implies that in this case the integral F(x) = ∫ x a f(t)dt exists for all x ∈ [a, b] and the derivative of the function F(x) exists at all points x where f is continuous. It can be verified that F(x) is continuous at the remaining points. So it is a continuous function on the whole interval [a, b]. When evaluating the integral by primitive functions, it is necessary to choose its individual parts so that they are connected continuously at the points ci. Then the entire integral can be again computed as a difference of the function F(x) at its boundary values. (6) Lagrange’s mean value theorem for differentiable functions has an analogue which is called the integral mean value theorem. Suppose f(x) is continuous on an interval [a, b] and its primitive function is F(x). The mean value theorem claims that there exists a point c, a < c < b such that ∫ b a f(x) dx = F(b) − F(a) = F′ (c)(b − a) = f(c)(b − a). 543 We can now apply 6.B.23 with x0 = 3, a = 2 and n = 2. In particular we have 13 · K2(3, 2) = 13 ∫ dx ( (x − 3) 2 + 22 )2 = 13 22 ( 1 2 K1(3, 2) + x − 3 2 ( (x − 3)2 + 22 ) ) = 13 4 ( 1 4 arctan ( x − 3 2 ) + C2 + x − 3 2 ( (x − 3)2 + 22 ) ) = 13 16 arctan ( x − 3 2 ) + 13(x − 3) 8 ( x2 − 6x + 13 ) + C2 , for some constant C2. In total we get I = 13 16 arctan ( x − 3 2 ) + 13x − 159 8 (x2 − 6x + 13) +C , C ∈ R . A verification via Sage occurs as usual, i.e., type show(integral((30 ∗ x − 77)/(x2 − 6 ∗ x + 13)2 , x)). □ 6.B.25. Consider the rational function R(x) = x2 − x − 1 x3 + 3x2 − 16x + 12 . (a) Find its domain and its discontinuities and sketch its graph. (b) Find the decomposition of R into partial fractions. (c) Evaluate the integral ∫ R(x) dx. Solution. (a) Let us express R as R(x) = f(x)/g(x). It is obvious that g(1) = 0 and based on the Horner’s scheme we get the factorization g(x) = (x−1)(x−2)(x+6). Using Sage one can type factor(x ∗ ∗3 + 3 ∗ x ∗ ∗2 − 16 ∗ x + 12). Hence, the domain of R is the set R\{−6, 1, 2}. Notice the dominator g has three distinct real roots and none of them is a root of the numerator f. Hence, all the discontinuities of R = f/g appear at the roots 1, 2 and −6 of g, as we can see also from the graph of R: (b) Let us assume that R(x) = x2 − x − 1 (x − 1)(x − 2)(x + 6) = A1 x − 1 + A2 x − 2 + A3 x + 6 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS This statement can be derived directly from the definition of the Riemann integral. It can be then used in the final step of the proof of the fundamental theorem of integral calculus. 6.2.14. Improper integrals. When discussing the integration of rational functions f, there is a need to consider definite integrals over intervals, where f(x) has improper (one-sided) limits. Here f is neither continuous nor bounded. Thus earlier definitions and results may not apply. We speak of “improper” integrals. A simple solution is to discuss the definite integral on a smaller sub-interval, and determine whether the limit value of such a definite integral exists when the boundary approaches the problematic point. If it does, the corresponding improper integral exists and equals this limit. We illustrate this procedure by an example: I = ∫ 2 0 dx (2 − x)1/4 . This is an improper integral, because the integrand f(x) = (2 − x)−1/4 = 1 4 √ (2−x) has its left-sided limit ∞ at the point b = 2. The integrand is continuous at all other points. Thus, for 0 < δ < 2, consider the integrals (substituting y = 2−x) Iδ = ∫ 2−δ 0 dx 4 √ 2 − x = ∫ 2 δ y−1/4 dy = [ 4 3 y3/4 ]2 δ = 4 3 [23/4 − δ3/4 ]. Notice that dy = −dx and x = 2 − δ corresponds to y = δ. When x = 0, then y = 2. The limit when δ → 0 from the right clearly exists, so the improper integral is evaluated. I = ∫ 2 0 dx 4 √ 2 − x = 4 3 23/4 . We proceed in the same way to integrate over an unbounded interval. In this case, we speak of improper Riemann integrals of the first kind. The integrals of unbounded functions on finite intervals are improper Riemann integrals of the second kind. More explicitly, we can define the integrals of both kinds as follows. For a ∈ R and f defined on [a, b) and bounded on each [a, c] ⊂ [a, b), I = ∫ b a f(x) dx = lim c→b ∫ c a f(x) dx, if the integrals and limit on the right hand side exist. Here b is either finite or b = ∞. Similarly we can have a finite fixed upper bound and −∞ ≤ a < c < b and infinite lower bound. If both a and b are infinite, we can evaluate the integral as a sum of two integrals with a chosen fixed bound in the middle as in ∫ ∞ −∞ f(x) dx = ∫ a −∞ f(x) dx + ∫ ∞ a f(x) dx. 544 for some reals A1, A2, A3 to be specified. This relation can be equivalently written as (by multiplying by the denominator) f(x) = A1(x−2)(x+6)+A2(x−1)(x+6)+A3(x−1)(x−2) . There are many approaches to compute A1, A2, A3. For instance, we may gather together like powers of the variable and equate their coefficients. This gives a set of equations to solve for A1, A2, A3. As an alternative we can plug x = 1 into this equation, which gives A1 = 1/7. Likewise, the substitution of x = 2 gives A2 = 1/8 while those of x = −6 yields A3 = 41/56. Thus, the decomposition of R into partial fractions is given by R(x) = 1 7(x − 1) + 1 8(x − 2) + 41 56(x + 6) . (∗) Sage provides a built-in method for computing the decomposition of a rational function into partial fractions, given by the command partial_fraction. For instance, for a Sage confirmation of (∗) one may use the cell f(x)=(x^2-x-1); g(x)=(x-1)*(x-2)*(x+6) R(x)=f(x)/g(x); show(R.partial_fraction()) (c) This is easy and relies on our previous conclusion (∗), i.e., ∫ R(x) dx = 1 7 ∫ 1 x − 1 dx + 1 8 ∫ 1 x − 2 dx + 41 56 ∫ 1 x + 6 dx = 1 7 ln |x − 1| + + 1 8 ln |x − 2| + 41 56 ln |x + 6| + C , for some constant C. Adding in the previous cell the command show(integral(R, x)), Sage confirms our answer. □ 6.B.26. Evaluate the integral ∫ Q(x) dx, where Q(x) = x (x − 1) 2 (x2 + 2x + 2) , x ̸= 1 . Solution. According to the discussion in 6.2.7, we may assume that Q(x) = A x − 1 + B (x − 1)2 + Cx + D x2 + 2x + 2 , for A, B, C, D ∈ R. This can be equivalently expressed as x = A (x − 1) ( x2 + 2x + 2 ) + B ( x2 + 2x + 2 ) + (Cx + D) (x − 1) 2 . By setting x = 1 we immediately get B = 1/5. By comparing the coefficients at the same powers of the polynomials we also get A = 1 25 , C = − 1 25 , and D = − 8 25 . Hence Q(x) = 1 25(x − 1) + 1 5(x − 1)2 − x + 8 25 (x2 + 2x + 2) . (∗) An easy way to verify this fractional decomposotion is by Sage, via the function partial_fraction as in 6.B.25, i.e., CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Its existence and its value do not depend on the choice of such bound, because by changing it, we only change both summands by the same finite value, but with opposite sign. At the same time a limit for which the upper and lower bound would approach ±∞ at the same speed can lead to different results! For example ∫ a −a x dx = [ 1 2 x2 ]a −a = 0, even though the values of the integrals ∫ b a x dx with a fixed and b → ∞ diverge to infinity. (This is the typical behavior for all odd functions.) Clearly, if f is continuous on [a, b), then the Newton integral F(x) = ∫ x a f(y) dy exists for all x ∈ [a, b) and the improper Riemann integral exists if and only if limx→b F(x) exists, and its value is this limit. Thus, for continous f(x) as in our simple example above, we compute the definite Newton integral and take its limit. This is what we did. The integrated functions may have more discontinuities with infinite one-sided limits and the interval of integration may be unbounded. Then the integration intervals must be split in such a way that the individual intervals of integration include only one of the above phenomena. Hence when evaluating the improper integral of a rational function, divide the given interval according to the discontinuities of the integrated function. Then compute all the improper integrals separately. 6.2.15. Mean value of a function. For a finite set of n numbers, the mean value, or arithmetic mean, is obtained by summing the numbers and dividing by n. For a Riemann integrable function f(x) on an interval (finite or infinite) [a, b], the mean value is defined by m(f) = 1 b − a ∫ b a f(x) dx. By definition, m(f) is the altitude of the rectangle over the interval [a, b], which has the same area as that of the region between the x axis and the graph of the f(x) (counted with signs according to being above or below the x-axis). Hence the integral mean value theorem is true in general: Proposition. If f(x) is a Riemann integrable function on an interval [a, b], then there exists a number m(f) satisfying ∫ b a f(x) dx = m(f)(b − a). 6.2.16. Integral criterion for series. Using the improper integral, we can also decide the question of convergence for a class of infinite series with summands expressed as values of a function in integers: 545 Q(x)=x/((x -1)^2*(x^2 + 2*x + 2)) show(R.partial_fraction()) Based now on (∗) we can write ∫ Q(x) dx = ∫ dx 25 (x − 1) + ∫ dx 5 (x − 1) 2 − ∫ x + 8 25 (x2 + 2x + 2) dx = 1 25 ln |x − 1| − 1 5 (x − 1) − 1 50 ln ( x2 + 2x + 2 ) − 7 25 arctan(x + 1) + C , for some constant C. Here, as in 6.B.22 we get ∫ x + 8 x2 + 2x + 2 dx = ∫ ( 1 2 (2x + 2) x2 + 2x + 2 + 7 x2 + 2x + 2 ) dx = 1 2 ∫ 2x + 2 x2 + 2x + 2 dx + 7 ∫ 1 (x + 1)2 + 1 dx = 1 2 ln ( x2 + 2x + 2 ) + 7 arctan(x + 1) + C . □ 6.B.27. Evaluate the integral A = ∫ Q(x) dx, where Q(x) = 2x4 + 2x2 − 5x + 1 x (x2 − x + 1) 2 , x ̸= 0 . Next confirm your result in Sage. ⃝ 6.B.28. Evaluate ∫ 1 e3x −2 e2x dx, with x ̸= ln(2). ⃝ Let us also integrate an improper rational function f(x)/g(x), which means that the degree of the polynomial f in the numerator is greater or equal to the degree of the polynomial g in the denominator. In such a case we should first carry out the division with remainder. 6.B.29. Evaluate ∫ x3 + 2x2 + x − 1 x2 − x + 1 dx, with x ∈ R. Can you confirm all the steps of your proof via Sage? Solution. The degree of the numerator f(x) =3 +2x2 +x−1 is bigger from the degree of the dominator g(x) = x2 −x+1, hence we proceed by division of polynomials: x + 3 x2 − x + 1 ) x3 + 2x2 + x − 1 − x3 + x2 − x 3x2 − 1 − 3x2 + 3x − 3 3x − 4 This means that f(x) = g(x)(x + 3) + 3x − 4 and implies the following decomposition f(x) g(x) = (x + 3) + 3x − 4 x2 − x + 1 . (♭) The division of polynomials can be successfully performed in Sage, and a way goes as follows improve explanation & maybe add in Ch.1 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Integral criterion Theorem. Let ∑∞ n=1 f(n) be a series such that the function f : R → R is positive and nonincreasing on the interval [1, ∞). Then this series converges if and only if the integral ∫ ∞ 1 f(x) dx. converges. Proof. If the integral is interpreted as the area of a region under the curve, the criterion is clear. Indeed, notice that the given series diverges or converges if and only if the same is true for the same series without the first summand. Moreover, by the monotonicity of f(x), there are the following es- timates ˜sk = k∑ n=2 f(n) ≤ ∫ k 1 f(x) dx ≤ ∞∑ n=1 f(n) = sk, because ˜sk is the lower sum of the Riemann integral while sk is the upper sum. Thus, the integral converges if and only the series does, as expected. □ 6.2.17. New acquisitions to the ZOO. As a matter of fact, primitive functions (Newton integrals) are very rarely described in terms of known elementary functions. Indeed, nearly all continuous functions lead to integrals which we cannot express in this way. Functions obtained by integration often appear in applications. Many of them have names and there are efficient methods how to approximate them numerically (we shall come to this point briefly in 6.2.23 below). Such functions either appear as the primitive functions or they are given by definite integrals dependent on further paremeter(s). We present just a few (important) examples now. In the methods of signal processing, a very important function is the so called sinc function: (1) sinc(x) = sin(x) x . We shall meet the sinc function in our discussion of Fourier transform in 7.2.6. Check yourself that it is a smooth function with limit values f(0) = 1, f′′ (0) = − 1 3 , . . . , f(2k) (0) = (−1)k 1 2k + 1 , whilst all f(2k+1) (0) vanish. This is easily seen from the Taylor expansion of the sine function (try to verify this by computing the derivates directly - a good excercise on multiple use of L’Hospital even for very small orderes). The even function sinc has its absolute maximum at the point x = 0 and many local maxima and minima, with inflexions between them. It oscillates with a fast decreasing amplitude as x approaches infinity. The x-axis is the asymptot for both infinities. 546 f(x)=x^3+2*x^2 + x - 1 ; g(x)=x^2-x+1 show(f.maxima_methods().divide(g)) Sage’s output has the form [x + 3, 3x − 4], where the first expression encodes the quotient and the second one the reminder. One may like to certify Sage’s answer by adding the syntax bool(g(x) ∗ (x + 3) + 3 ∗ x − 4 == f(x)), which returns True. Based now on (♭) one has ∫ f(x) g(x) dx = ∫ (x + 3) dx + ∫ 3x − 4 x2 − x + 1 = x2 2 + 3x + 3 2 ln ( x2 − x + 1 ) − 5 √ 3 arctan ( 2x − 1 √ 3 ) + C . Here, to compute the integral ℓ = ∫ 3x − 4 x2 − x + 1 dx you can apply the same method as 6.B.22 (since g(x) has only complex roots). In particular, ℓ = 3 2 ∫ 2x − 1 x2 − x + 1 dx − 5 2 ∫ dx ( x − 1 2 )2 + (√ 3 2 )2 = 3 2 ln ( x2 − x + 1 ) − 5 √ 3 arctan ( 2x − 1 √ 3 ) + c , for some constant c. This agrees with Sage’s output for the command integral((3 ∗ x − 4)/(x2 − x + 1), x). □ Many integrals may initially appear complicated. In such cases, the power of the substitution method becomes evident. For instance, the substitution method often allows us to transform irrational functions into rational ones, which are easier to integrate. The tasks below focus on integrating irrational functions and serve as preparation for the material covered in the final section of this chapter, where more space is dedicated to this topic. But first, let us describe a few useful hints. • Hint 1: This is about integrals of the form ∫ f ( x, p1 √ x, p2 √ x, . . . , pk √ x ) dx for certain numbers p1, p2, . . . , pk ∈ N and a rational function f. In this case we set n √ x = t, i.e., tn = x, where n is the least common multiple of p1, . . . , pk. This substitution reduces the integrated function (integrand) to a rational function, which we can always integrate. When instead of the expressions pj √ x we have pj √ ax + b for all j = 1, . . . , k, where a, b ∈ R, then set tn = ax + b, where n occurs in the same vein as above. • Hint 2: This is about integrals of the type ∫ f ( x, p1 √ ax + b cx + d , p2 √ ax + b cx + d , . . . , pk √ ax + b cx + d ) dx, where the values a, b, c, d ∈ R are such that ad − bc ̸= 0. In this case we set tn = ax+b cx+d , where again n is the least CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Gamma, Si, and more The sine integral function is defined by (2) Si(x) = ∫ x 0 sinc(t) dt. Other important functions are Fresnel’s sine and cosine in- tegrals FresnelS(x) = ∫ x 0 sin (1 2 πt2 ) dt(3) FresnelC(x) = ∫ x 0 cos (1 2 πt2 ) dt.(4) One of the most important mathematical functions ever is the Gamma function. It is defined for all positive real numbers z by (5) Γ(z) = ∫ ∞ 0 e−t tz−1 dt. For all integers n ≥ 0, n! = Γ(n + 1). The function Si(x) is shown in the left figure. Both Fresnel’s functions are shown on the right. The Gamma function is defined via an improper integral dependent on the parameter z. We shall provide some theory for such functions in the next part of this chapter. In particular, it can be proved that this function is analytic at all points 0 < z ∈ R. For small z ∈ N, we can evaluate: Γ(1) = ∫ ∞ 0 e−t t0 dt = [− e−t ]∞ 0 = 1 Γ(2) = ∫ ∞ 0 e−t t1 dt = [− e−t t]∞ 0 + ∫ ∞ 0 e−t dt = 1 Γ(3) = ∫ ∞ 0 e−t t2 dt = 0 + 2 ∫ ∞ 0 e−t t dt = 0 + 2 = 2. Integration by parts reveals immediately Γ(z + 1) = zΓ(z). Hence for all positive integers n this function yields the value of the factorial: Γ(n) = (n − 1)!. The following figure shows the behaviour of the functions f1(x) = ln(Γ(x + 1)), f2(x) = ln(Γ(x)) together with the function x ln x−x+1 (the dashed one, the dots are the actual values of ln(n!)). 547 common multiple of p1, . . . , pk. This converts the integrand to a rational function, as well. 6.B.30. Determine the integrals ∫ f(x) dx, where (a) f(x) = 1 √ x3 + 5 √ x7 , with x > 0; (b) f(x) = x + 1 3 √ 3x + 1 , with x ̸= −1 3 ; (c) f(x) = 1 x √ x + 1 x − 1 , with x ∈ R\[−1, 1]. Solution. (a) For all x > 0 we have f(x) = 1 √ x3 + 5 √ x7 = 1 √ x2 · x + 5 √ x5 · x2 = 1 x( √ x + 5 √ x2) . The least common multiply of 2, 5 is 10. Hence, according to the first hint above to compute ∫ f(x) dx we set 10 √ x = t, that is, t10 = x with 10t9 dt = dx. Then it is easy to see that √ x = t5 and 5 √ x2 = t4 , thus ∫ f(x) dx = ∫ dx x (√ x + 5 √ x2 ) = ∫ 10t9 t10 (t5 + t4) dt = 10 ∫ dt t6 + t5 . Now we see that the function Q(t) = 1 t6+t5 = 1 t5(1+t) admits the following decomposition into simple fractions: Q(t) = − 1 t + 1 + 1 t − 1 t2 + 1 t3 − 1 t4 + 1 t5 . You can verify this expression by Sage and the command var("t"); show((1/(t^6+t^5)).partial_fraction() Thus 10 ∫ Q(t) dt = 10 ∫ ( − 1 t + 1 + 1 t − 1 t2 + 1 t3 − 1 t4 + 1 t5 ) dt = 10 ( − ln(t + 1) + ln(t) + 1 t − 1 2t2 + 1 3t3 − 1 4t4 ) + C = ln x (1 + 10 √ x)10 + 10 10 √ x − 5 5 √ x + 10 3 10 √ x3 − 5 2 5 √ x2 + C for some constant C. (b) Set t = 3 √ 3x + 1, i.e., t3 = 3x + 1 and dx = t2 dt. Then ∫ x + 1 3 √ 3x + 1 dx = ∫ t3 −1 3 + 1 t t2 dt = ∫ t3 − 1 + 3 3 t dt = 1 3 ∫ ( t4 + 2t ) dt = 1 3 ( t5 5 + t2 ) + C = 1 15 (3 x + 1) 5 3 + 1 3 (3 x + 1) 2 3 + C . Remember always to use Sage to verify your computations. For instance, one may find difficult to simplify the result of the back substitution t = 3 √ 3x + 1 in the expression 1 3 ( t5 5 + t2 ) . There are many different ways to implement this, and here we present one based on the function subs. CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS The relatively good approximation should not be a surprise, because we may clearly write ln(n!) = ∑n k=1 ln k and the latter expression can be approximated by the integral∫ n 1 ln x = n ln n − n + 1. Hence we came close to a very important and famous formula approximating n! by nn e1−n : Stirling’s formula The asymptotic estimate for the factorial function is ln n! = n ln n − n + 1 + O(ln n) There is the famous Stirling’s approximation formula making this estimate much more precise (6) √ 2πnn+ 1 2 e−n ≤ n! ≤ e nn+ 1 2 e−n . The lower approximation is so good, that the function√ 2πxx+ 1 2 e−x would completely overlap with the black one on the picture above. We shall not provide the proof of the Stirling formula here. In order to understand the qualitative behavior of such functions (e.g. their differentiability etc.) we need to understand the limit processes much better. Before diving into this in the next section, we introduce several direct applications of the Riemann integral. 6.2.18. Riemann measurable sets. The definition of the Riemann integral is motivated by the concept of the area of rectangles in the plane with coordinates x and y. The definite integral ∫ b a is designed to correspond to the area of the region bounded by the x axis, the values of the function y = f(x) and boundary lines x = a, x = b. Moreover, the area of the region above the x axis is given with a positive sign, while values under the axis lead to a negative sign. From geometry, the length of an interval on the real line, and the area of a parallelogram determined by two vectors in the plane are basic concepts. This extends to the area of a parallelepiped in Euclidean vector space Rn . The areas/volumes of other subsets are yet to be defined. For some simple objects like triangles, polygons and polyhedrons, their area is given naturally by the generally expected properties of area 548 var("t"); a(t)=(1/3)*((t^5/5) + t^2) show(a.subs(t=(3*x+1)^(1/3))) As for the given integral, just type show(integral((x+1)/(3*x+1)^(1/3), x)) Observe that in our result there are still computations that can be done to arrive to a more solid expression, as 1 5 (3 x + 1) 2 3 (x + 2). The same can be done in Sage, by adding in the last two show commands the function factor, that is, show(a.subs(t = (3 ∗ x + 1)( 1/3)).factor()) and show(integral((x + 1)/(3 ∗ x + 1)( 1/3), x).factor()), respectively. Check yourself Sage’s outputs. (c) In this case set t = √ x+1 x−1 , thus t2 = x+1 x−1 and x = t2 +1 t2−1 . Moreover, it is easy to see that dx = −4t (t2−1)2 dt. This substitution will convert the initial integrand to a rational functional, namely ∫ 1 x √ x + 1 x − 1 dx = ∫ t2 − 1 t2 + 1 −4t2 (t2 − 1)2 dt = − ∫ 4t2 (t2 + 1)(t2 − 1) dt . If Q(t) = − 4t2 (t2 + 1)(t2 − 1) then we see that Q(t) = −2 2t2 (t2 + 1)(t2 − 1) = −2 ((t2 + 1) + (t2 − 1) (t2 + 1)(t2 − 1) ) = −2 ( 1 t2 + 1 + 1 t2 − 1 ) . However, 1 t2 − 1 = 1 2 ( 1 t − 1 − 1 t + 1 ) , and in total we get Q(t) = − 2 t2 + 1 − 1 t − 1 + 1 t + 1 . One can recover this expression in Sage by the cell var("t") (-(4*t^2)/((t^2-1)*(t^2+1))).partial_fraction() Thus now we can rewrite our integral as ∫ 1 x √ x + 1 x − 1 dx = ∫ ( 1 t + 1 − 1 t − 1 − 2 t2 + 1 ) dt = ln | t + 1 | − ln | t − 1 | − 2 arctan(t) + C = ln t + 1 t − 1 − 2 arctan(t) + C = ln √ x+1 x−1 + 1 √ x+1 x−1 − 1 − 2 arctan (√ x + 1 x − 1 ) + C . For a Sage confirmation type the usual show(integral((1/x)*sqrt((x+1)/(x-1)), x)) □ CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS (invariance with respect to Euclidean motions and additivity with respect to finite union of disjoint objects). We shall come to conceptual definitions in the next two chapters, mainly via generalizing the Riemann integral. Now we can start with several answers based on the univariate calculus. We start with the question how to measure the “volume” of one-dimensional subsets. We say that the subset A ⊂ R is (Riemann) measurable, if the function χ : R → R χA(x) = { 1 if x ∈ A 0 if x /∈ A. is Riemann integrable. That is, the (improper) integral m(A) = ∫ ∞ −∞ χA(x) dx exists (the finiteness of its value doesn’t matter). The function χA is called the characteristic function of the set A, the value m(A) is called the Riemann measure of the set A. Notice that for an interval A = [a, b] this yields the value ∫ ∞ ∞ χA(x) dx = ∫ b a dx = b − a, just as expected. The elementary properties of the Riemann integral imply that this definition of “size” has reasonable properties. The measure of a union of finitely many measurable pairwise disjoint subsets is the sum of their measures. In particular, every finite set A has zero measure. If instead we choose a countable union, this property is no longer true. For example, consider the set Q of all rational numbers as the union of one-element subsets. While every set containing only finitely many points has a zero measure by our definition, the characteristic function χQ is not Riemann integrable over any finite interval [a, b]. This is an essential flaw of the Riemann approach and we shall comment on how to improve it in the end of this chapter. Notice, the upper Riemann integral of the characteristic set χA corresponds to the infimum of the sums of lengths of finitely many intervals, by which we can cover the given set A. The lower integral is the supremum of the sums of lengths of finitely many disjoint intervals that can be embedded into the set A. We shall proceed in the same way in higher dimensions when defining the Jordan measure. For now, just remark that the area of a plane figure bounded by a graph of a function in the way described above is consistent with expectations and we are going to deduce some straightforward consequences for special higher dimensional concepts. 6.2.19. Length of a curve. The Riemann integral can be effectively used to compute the length of a curve in multidimensional Euclidean vector space Rn . For the sake of simplicity, we deal with a curve in R2 with coordinates x, y. Suppose a parametric description of a curve F : R → R2 , F(t) = [f(t), g(t)] 549 Recall from 6.2.2 that for an arbitrary function f that is continuous and bounded on a bounded interval (a, b), it holds the so called Newton-Leibniz formula, given by b∫ a f(x) dx = [F(x)] b a := lim x→b− F(x) − lim x→a+ F(x) . (⋆) Here, as usual F′ (x) = f(x) is the primitive function of f, with x ∈ (a, b). Under the given conditions, the function F always exists and so do both the proper limits appearing in (⋆). Therefore, to compute the definite integral one just needs to find the antiderivative and determine the respective one-sided limits. We also recall that the computation of definite integrals via Sage is simple, and relies on the command integral(f, x, a, b), or f.integral(x, a, b), see below for examples. 6.B.31. Compute π 3∫ π 6 tan2 (x) dx and π 4∫ 0 x cos2(x) dx. ⃝ 6.B.32. Compute the integrals given below by hand, and next present a confirmation by Sage. (a) ∫ 1 0 x √ 1 − x2 dx , (b) ∫ 2 1 1 √ x2 − 1 dx . ⃝ 6.B.33. Compute the integral ∫ π π/4 sin(3x) cos(x) dx in Sage. Next provide a formal computation. ⃝ 6.B.34. Prove the following inequalities: (a) √ 2 20 ≤ 1∫ 0 x9 √ 1 + x dx ≤ 1 10 ; (b) 1 < ∫ π/2 0 sin(x) x dx < π 2 . ⃝ 6.B.35. Let f : [−a, a] → R be a continuous function, a ∈ R. (a) If f is even show that ∫ a −a f(x) dx = 2 ∫ a 0 f(x) dx. (b) If f is odd show that ∫ a −a f(x) dx = 0. ⃝ 6.B.36. Let f : R → R be a continuous function which is periodic with period T > 0. (1) Show that the integral ∫ a+T a f(x) dx has the same value for every real number a, in particular, ∫ a+T a f(x) dx = ∫ T 0 f(x) dx . (2) More in general, show that ∫ a+nT a f(x) dx = n ∫ T 0 f(x) dx , CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS is given. Look at it as a trajectory of a movement. Assume that f(t) and g(t) have piece-wise continuous derivatives. By differentiating the map F(t) we obtain vectors corresponding to the speed of the movement along this trajectory. Hence the total length of the curve (i.e. distance traveled over time between the values t = a, t = b) is given by the integral over the interval [a, b], with the integrated function h(t) being the length of the vectors F′ (t). Therefore the length s is given by the formula s = ∫ b a h(t) dt = ∫ b a √ (f′(t))2 + (g′(t))2 dt. The result can be seen intuitively as a corollary of Pythagoras’ theorem: the linear increment ∆s of the length of the curve corresponding to the increment ∆t of variable t is given by the proportion in the orthogonal triangle and thus at the level of differentials ds = √ (g′(t))2 + (f′(t))2 dt. In the special case when the curve is the graph of a function y = f(x) between points a < b, we obtain s = ∫ b a √ 1 + (f′(x))2 dx and at the level of differentials, ds = √ 1 + (y′(x))2dx, just as expected. As an example, we calculate the circumference of the unit circle as twice the integral of the function y = √ 1 − x2 over [−1, 1]. We know that the result is 2π, because π is defined in this way. s = 2 ∫ 1 −1 √ 1 + (y′)2 dx = 2 ∫ 1 −1 √ 1 + x2 1 − x2 dx = 2 ∫ 1 −1 1 √ 1 − x2 dx = 2[arcsin x]1 −1 = 2π. If we instead use y = √ r2 − x2 = r √ 1 − (x/r)2 and bounds [−r, r] in the previous calculation, by substituting x = rt we obtain the circumference of the circle with radius r: s(r) = 2 ∫ r −r √ 1 + (x/r)2 1 − (x/r)2 dx = 2 ∫ 1 −1 r √ 1 − t2 dt = 2r[arcsin x]1 −1 = 2πr. The result is of course well known from elementary geometry. Nevertheless, by using integral calculus, we derive the important fact that the length of a circle is linearly dependent on its diameter 2r. The number π is exactly the ratio, appearing in this dependency. 550 for any a ∈ R and n ∈ Z. ⃝ 6.B.37. Use 6.B.36, to compute the following integrals: (a) ∫ π/6+2π π/6 | sin(x)| dx; (b) ∫ 4π 2π | sin(x)| dx; Solution. (a) The function f(x) = | sin(x)| is periodic with period T = π > 0. To confirm this claim in Sage use the bool command, as usual, i.e., f(x)=abs(sin(x)); bool(f(x+pi)==f(x)) The first integral corresponds to the grey area in the following figure: An application of the formula given in the second part of 6.B.36 gives ∫ π/6+2π π/6 | sin(x)| dx = ∫ 2π 0 | sin(x)| dx = 2 ∫ π 0 | sin(x)| dx = 2 ∫ π 0 sin(x) dx = 2 [ − cos(x)]π 0 = 2 · 2 = 4 . To confirm this computation in Sage add in the previous cell the syntax integral(f(x), x, pi/6, 2 ∗ pi + pi/6). (b) Since ∫ π 0 | sin(x)| dx = 2, the second integral is also 4, see the figure given for this case: Here is a formal computation: ∫ 4π 2π | sin(x)| dx = ∫ 4π 0 | sin(x)| dx − ∫ 2π 0 | sin(x)| dx = 4 ∫ π 0 | sin(x)| dx − 2 ∫ π 0 | sin(x)| dx = (4 − 2) ∫ π 0 | sin(x)| dx = 2 · 2 = 4 . Once more, to confirm the result via Sage just type integral(abs(sin(x), x, 2 ∗ pi, 4 ∗ pi). □ According to the fundamental theorem of integral calculus, whenever f : A ⊆ R → R is a continuous function and CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 6.2.20. Areas and volumes. The Riemann integral can be used to compute areas or volumes of shapes defined by a graph of a function. As an example, calculate the area of a circle with radius r. The quarter-circle bounded by the func- tion √ r2 − x2 for 0 ≤ x ≤ r determines one quarter of the area. Use the substitution x = r sin t, dx = r cos t dt (cf. the corollary for I2 in the paragraph 6.2.6) to obtain by symmetry a(r) = 4 ∫ r 0 √ r2 − x2 dx = 4r2 ∫ π/2 0 cos2 t dt = 4r2 ∫ π/2 0 sin2 t dt = 1 2 4r2 ∫ π/2 0 cos2 t + sin2 t dt = 2r2 ∫ π/2 0 dt = πr2 . It is worth noticing that this well known formula is derived from the principles of integral calculus. The area of a circle is not only proportional to the square of the radius, but this proportion is again given by the constant π. Notice the ratio of the area to the perimeter of a circle. πr2 2πr = 1 2 r. The square with the same area has the side of length √ πr and therefore its perimeter is 4 √ πr. Hence the perimeter of a square with the area of the unit circle is 4 √ π, compared to the perimeter 2π of the unit circle, which is about 0.8 less. It can be shown that in fact the circle is the shape with the smallest perimeter among all with the same area. We derive such results in comments about the calculus of variations in chapter 9. Another analogy of this approach is the computation of the volume or the surface area of a solid of revolution. Such a set in R3 is defined by plotting the graph of a function y = f(x) (for x in an interval [a, b]) in the plane xy and rotating this plane around the x axis. This is exactly what happens when producing pottery on a jigger – the hands shape the clay in the form of y = f(x). add appropriate picture here! When computing the area of the surface, an increment ∆x causes the area to increase by the multiple of the length ∆s of the curve given by the graph of the function y = f(x) and the size of the circle with radius f(x). Hence the surface area A(f) is computed by the formula A(f) = 2π ∫ b a f(x) ds = 2π ∫ b a f(x) √ 1 + (f′(x))2 dx, where the differential ds is given by the increment on the length of curve y = f(x), see above. If instead we determine the solid of revolution by its bound parametrized in the xy plain by a pair of functions [x(t), y(t)], then the corresponding differential of the length s has the form ds =√ (x′(t))2 + (y′(t))2dt. Thus we obtain A = 2π ∫ b a y(t) √ (y′(t))2 + (x′(t))2 dt. 551 a ∈ A, the function F(x) = ∫ x a f(t) dt , x ∈ A , satisfies F′ (x) = f(x) for any x ∈ A. Therefore F is a primitive of f on A. This function plays a key role in integral calculus and in many cases can be used to simplify our computations. The next series of tasks are based on this scheme and will help us to master the fundamental theorem of calculus. Further similar exercises are presented in Section D. 6.B.38. (a) Find the derivative of the function F(x) = x∫ 0 t5 ln (t + 1) dt, with x ∈ (−1, 1). (b) If F(x) = ∫ x 1 ( t2 sin(t) + 4 cos(4t) ) dt, with x > 0, show that limx→∞ g(x) = 1, where g(x) := F ′ (x) x2+2 . ⃝ 6.B.39. Local extremes. Find the local extremes of the func- tion F(x) = ∫ x 0 sin(t) t dt , with x > 0 . Solution. For all x > 0 we have F′ (x) = sin(x) x . Thus, the critical points of F, i.e., the solutions of the equation F′ (x) = 0 have the form x = k π for k ∈ Z+, and they all provide local extremes of F. In particular, the second derivative of F has the form F′′ (x) = ( sin(x) x )′ = x cos(x) − sin(x) x2 , x > 0 , and we see that F′′ (kπ) = kπ·cos(kπ) (kπ)2 = (−1)k kπ . Thus • F′′ (kπ) < 0 if k is odd =⇒ F attains a local maximum at (2m + 1)π, with m ∈ N. • F′′ (kπ) > 0 if k is even =⇒ F attains a local minimum at (2m)π, with m ∈ Z+. □ 6.B.40. Find the third-order Taylor expansion of the func- tion f(x) = ∫ x 0 1 1 + t2 dt , x ∈ R , around the point a = 0, both by hand and by Sage. Solution. Recall by 6.A.8 that we want to determine the poly- nomial T3 0 f(x) = f(0) + f′ (0)x + f′′ (0) 2 x2 + f(3) (0) 6 x3 . (♭) Obviously, f(0) = 0 and f′ (x) = 1 1 + x2 , ⇒ f′ (0) = 1 , f′′ (x) = ( 1 1 + x2 )′ = −2x (1 + x2)2 , ⇒ f′′ (0) = 0 , f(3) (x) = ( −2x (1 + x2)2 )′ = 2(3x2 − 1) (1 + x2)3 , ⇒ f(3) (0) = −2. CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS When computing the volume of the same solid, then the increase of ∆x causes the volume increase by a multiple of this increment and the area of the circle with radius f(x). Hence it is given by the formula V (f) = π ∫ b a (f(x))2 dx. As an example of using the formulas for surface and volume, we derive the well known formulas for the surface of the sphere and volume of the ball with diameter r. Ar = 2π ∫ r −r r √ 1 − (x/r)2 1 √ 1 − (x/r)2 dt = 2πr ∫ r −r dt = 4πr2 Vr = π ∫ r −r (r2 − x2 ) dx = π [ r2 x − 1 3 x3 ]r −r = 4 3 πr3 . Similarly to the circle, the ball is also the object which has the smallest surface among all with the given volume. That is the reason why small soap bubbles almost always assume this shape. 6.2.21. Differential equations. Theorem 6.2.9 can be understood in the following way. Given a continuous function f(x) on a bounded interval [a, b], the set of all functions y of one variable x satisfying the equality y′ = f(x) is given by the formula y(x) = ∫ x a f(t) dt + C with the constant C = y(a). This is the simplest instance of differential equations. More generally, ordinary differential equations of first order are given as y′ = f(x, y), where f(x, y) depends on two unknown variables x and y. A solution to this equation is a function y = y(x), such that the equality is true upon substitution. Similarly, dependence on higher derivatives of y may be included. We return to this concept in Chapter 8, see 8.3.2. For the present, we discuss one very special type of equations with separated variables y′ = f(x)g(y) and add a few observations concerning analytic solutions. Rewrite the equation in terms of the differentials, cf. 6.1.11, 1 g(y) dy = f(x) dx. Find the primitive functions on both sides to determine the unknown function y = y(x) implicitly. 552 Thus, according to (♭) we get T3 0 f(x) = x − x3 3 , x ∈ R. In Sage, as in 6.A.9 one can use the command taylor, though now the variable t should be included as a symbolic variable and assume that x > 0: var("x, t"); assume(x>0) f(x)=integral(1/(1+t^2), t, 0, x) T(x)=taylor(f(x), x, 0, 3); show(T(x)) Sage’s output has the form − 1 3 x3 + x. □ 6.B.41. Ler f : R → R be a differentiable function on R, having continuous second derivative everywhere on R and a local extremum at x0 = 1. If for some constants α, β ∈ R with α ̸= 1 we have the relation ∫ 1 0 ( x f′′ (x) + α f′ (x) ) dx = β (∗) and the graph Cf of f passes through the point P = [0, β] ∈ R2 , compute f(1) in terms of α and β. Solution. By assumption, f is differentiable everywhere on R and hence also at x0 = 1, and moreover attains a local extremum at x0, so we have f′ (x0) = f′ (1) = 0. We also have f(0) = β, since P belongs to the graph of f. Thus, if A is the left-hand-side of (∗), by applying integration by parts we get A = ∫ 1 0 x f′′ (x) dx + α ∫ 1 0 f′ (x) dx = [ xf′ (x) ]1 0 − ∫ 1 0 x′ f′ (x) + a [ f(x) ]1 0 = f′ (1) − [ f(x) ]1 0 + a [ f(x) ]1 0 = 0 − (f(1) − f(0)) + a(f(1) − f(0)) = (α − 1)(f(1) − f(0)) = (α − 1)(f(1) − β) . Thus we should have (α − 1)(f(1) − β) = β from where we get f(1) = β α−1 + β = α β α−1 . □ 6.B.42. Consider the function F(x) = x ∫ x 1 f(t) dt + (1 − x) ∫ x 0 f(t) dt with x ∈ [0, 1], where f : [0, 1] → R is continuous function on R with f(0)f(1) ̸= 0. Show that there exists ξ ∈ (0, 1) with f(ξ) = ∫ 1 0 f(t) dt . (∗) Solution. The function F is differentiable on [0, 1] and it easy to see that it satisfies F(0) = 0 = F(1). Therefore, by Rolle’s theorem there exists ξ ∈ (0, 1) with F′ (ξ) = 0. We will show that ξ satisfies (∗). In particular, a combination of the product rule with the fundamental theorem of calculus CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Indeed, if G(y) and F(x) are the primitive functions with G′ (y) = 1 g(y) and F′ (x) = f(x), and y(x) satisfies G(y(x)) = F(x), then differentiating both sides with respect to x yields 0 = G′ (y(x))y′ (x) − F′ (x) = y′ (x) g(y(x)) − f(x) as expected. Of course, it is necessary to be careful with the values y for which g(y) = 0, which need to be discussed separately. For example, the equation y′ = y leads to the implicit definition ln |y| = x + C, which for positive y provides y(x) = D ex with positive constant D, the constant solution y = 0 corresponds to D = 0. Negative values of y correspond to negative constants D in the same expression. If y(0) = 1, we recover the exponential y(x) = ex . 6.2.22. Analytic solutions. As we know, the power series are differentiated and integrated term by term, thus the solution y(x) to the equation y′ = f(x) with a known analytic function f(x) = ∑∞ n=0 anxn must be y(x) = ∞∑ n=0 1 n+1 anxn+1 + y(0), where y(0) is the free integration constant. The solution is defined on the covergence domain of the power series (which has to be same as that of f by the lim sup formula). Of course we might use series centered in other points x0 if prescribing the initial value y(x0). (We shall prove much later, that actually there is always the unique solution with the given initial prescribed value y(0) in Chapter 8.) The latter equation y′ = y had the analytic solution ex , too. Let us consider the general case of this type, i.e. equations of the form (1) y′ = f(y) with an analytic right-hand side f(y). Given the initial condition y(x0) = y0, straightforward differentiation with the help of the chain rule and the equation (1) shows y′ (x0) = f(y0) y′′ (x0) = f′ (y)y′ |x=x0 = f′ (y)f(y)|x=x0 = f′ (y0)f(y0) y′′′ (x0) = ( f′′ (y)y′ f(y) + f′ (y)f′ (y)y′ ) |x=x0 = f′′ (y0)(f(y0))2 + (f′ (y0))2 f(y0) ... Two crucial observations are due here. First, giving the initial condition y(x0) = y0, all derivatives y(k) (x0) are given at this point by the equation. Thus, if an analytic solution exists, we know it explicitly. So we have to focus on the convergence of the known formal expression of the series y(x) = ∑∞ n=0 1 n! y(n) (x0)(x − x0)n and we arrive at the theorem below. 553 (see 6.2.9), gives F′ (x) = ( x ∫ x 1 f(t) dt )′ + ( (1 − x) ∫ x 0 )′ = ∫ x 1 f(t) dt + x f(x) − ∫ x 0 f(t) dt + (1 − x)f(x) = − ( ∫ x 0 f(t) dt + ∫ 1 x f(t) dt ) + f(x) = − ∫ 1 0 f(t) dt + f(x) . This expression combined with the equation F′ (ξ) = 0 yields the result. □ 6.B.43. Let f : [0, 1] → R be a continuous function. Show that ∫ π 0 x f ( sin(x) ) dx = π 2 ∫ π 0 f ( sin(x) ) dx . Solution. Set A = ∫ π 0 x f ( sin(x) ) dx. The trick is the substitution u = π − x with du = −dx. For x = 0 we have u = π and for x = π we have u = 0. Recall also that sin(π − u) = sin(π) cos(u) − sin(u) cos(π) = 0 − (−1) sin(u) = sin(u) , for all u ∈ R. Thus A = − ∫ 0 π (π − u)f ( sin(π − u) ) du = ∫ π 0 (π − u)f ( sin(u) ) du = π ∫ π 0 f ( sin(u) ) du − ∫ π 0 u f ( sin(u) ) du , and the result follows easily. □ Along the proof of the fundamental theorem of integral calculus we used the notion of “uniformly continuous functions”, see 6.2.11. Uniform continuity is a stronger condition than continuity, since the real number δ in the relative definition depends only on ε. Thus, every uniformly continuous function defined on a subset A ⊆ R is also continuous, but the converse is not true (think for example of the hyperbola 1/x on (0, 1)). Let us present an alternative way to think about uniformly continuous functions and discuss some examples. 6.B.44. Show that a function f : A ⊆ R → R is uniformly continuous if and only if for any two sequences (xn), (yn) in A with xn − yn → 0, we also have f(xn) − f(yn) → 0, as n → ∞. Solution. Let us prove the one direction and leave the converse as an exercise. So, let f : A ⊆ R → R be a uniformly continuous function. Then, for every ε > 0 there exists δ > 0 such that for all x, y ∈ A satisfying |x−y| < δ the inequality |f(x) − f(y)| < ε holds. Let (xn), (yn) be sequences on A CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS In its proof, the second observation will be most helpful: the expressions for the derivatives y(n) are universal polynomials Pn y(n) (x) = Pn(f(x), f′ (x), . . . , f(n−1) (x)) in the derivatives of the function f, all with non-negative coefficients and independent of the particular equation.5 Cauchy-Kovalevskaya Theorem in dimension one Theorem. Assume f(y) is a real analytic function convergent on the interval (x0−a, x0+a) ⊂ R and consider the differential equation (1) with the condition y(x0) = y0. Then the formal power series y(x) = ∑∞ n=0 1 n! y(n) (x0)(x−x0)n converges on a neighborhood of x0 and provides the solution to (1) satisfying the initial condition. Proof. The second observation above suggests how to prove the convergence of the “candidate series” similarly as we proved the convergence of power series in general, i.e. by finding another converging series whose partial sums will bound our’s from above. This was the original Cauchy’s approach to this theorem and we talk about the method of majo- rants. Without loss of generality we shall fix x0 = 0 and y(0) = 0 (we may always use shifted quantities z = y − y0 and t = x−x0 to transform the general case). Assume we can find another analytic function g(x) = ∑∞ n=0 1 n! bnxn with all bn = g(n) ≥ 0, i.e. g has got all derivatives non-negative at the origin, such that g(n) (0) ≥ |f(n) (0)| for all n. Now, replace f in the equation (1) by g and write formal power series z(x) = ∑∞ n=0 z(n) n! xn for the potential solution of this equation as above. In particular, we deduce (recall the universal polynomials Pn have got non-negative coefficients) z(n) (0) = Pn ( g(z(0)), . . . , g(n−1) (z(0)) ) ≥ Pn ( |f(y(0))|, . . . , |f(n−1) (y(0))| ) ≥ |y(n) (0)| and, consequently, convergence of z(x) will imply absolute convergence of y(x), i.e. the claim of the Theorem. We try to find a majorant in the form of a geometric series. Let us pick r > 0, smaller than the radius of convergence of f. Then obviously, there is a constant C > 0 such that the derivatives an = f(n) (0) satisfy | 1 n! anrn | ≤ C for all n, i.e. |an| ≤ C n! rn (the series would certainly not converge otherwise). We may recognize the derivatives of a geometric series and write (2) g(z) = C ∞∑ n=0 zn rn = C r r − z with derivatives g(n) (0) = C n! rn . Finally, we have to prove that the solution of the equation z′ = g(z) is analytic. We can easily integrate this equation 5Although we shall not need the explicit formulae for these polynomials, they are well known under the name Faá di Bruno’s formula. In principle, they are direct generalization of the Leibniz rule to higher order derivatives. 554 with xn − yn → 0, and let ε > 0 be given. By the definition of a convergent sequence this means that there exists some natural n0 such that |xn − yn| < ε for all n ≥ n0. But then we also get |f(xn) − f(yn)| < ε for all n ≥ n0, which is equivalent to say that limn→∞ ( f(xn) − f(yn) ) = 0. □ 6.B.45. Let f : A → R be a continuous function defined on subset A of R. Provide examples verifying that: (a) If A is not closed, then f may not be uniformly continuous on A; (b) If A is not bounded, then f may not be uniformly continuous on A. ⃝ 6.B.46. Based on 6.B.44, show that the hyperbola f(x) = 1/x with x ∈ (0, 1) is not uniformly continuous. ⃝ 6.B.47. Show that the sine function sin(x) is uniformly continuous on R. Solution. First we will use the Lagrange’s mean value theorem (see 5.3.9) to show that | sin(x) − sin(y)| ≤ |x − y|, for any x, y ∈ R. If x = y we have nothing to prove, so assume that x ̸= y. Without loss of generality we may also assume that x < y. The sine function sin(x) satisfies the conditions of the Lagrange’s mean value theorem on the interval [x, y]. Thus, there exists some ξ ∈ (x, y) with sin(x) − sin(y) x − y = sin′ (ξ) = cos(ξ) . Since | cos(ξ)| ≤ 1 we get sin(x)−sin(y) x−y ≤ 1, which implies that | sin(x) − sin(y)| ≤ |x − y|, for any x, y ∈ R. Let us now prove that sin(x) is uniformly continuous on R. Given some ε > 0 take δ = ε. If x, y ∈ R satisfy |x−y| < δ, by the previous inequality we will have | sin(x) − sin(y)| ≤ |x − y| < δ = ε, which proves the claim. □ 6.B.48. Show that the cos function cos(x) is uniformly continuous on R. ⃝ 6.B.49. Let f : A → R be a continuous function defined on a subset A ⊆ R. (a) Show with an example that if (xn) is a Cauchy sequence on A, then the sequence (f(xn)) may fail to be Cauchy. (b) If f is in addition uniformly continuous on A, show that the sequence (f(xn)) is Cauchy for any Cauchy sequence (xn) on A. ⃝ 6.B.50. Which of the following functions is uniformly continuous on the given domain? f(x) = x2 on A = [0, 1], g(x) = tan(x) on B = [0, π/2), h(x) = x2 on R, and k(x) = x3 on R. ⃝ In many cases we should compute integrals over an infinite interval, or integrals of functions containing a discontinuity, see 6.2.14. In such cases one speaks for “improper integrals”, which are in general definite integrals that cover an unbounded area. For convenience we summarize the following rules: CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS with separated variables directly. Written with the help of differentials, (r − z) dz = Cr dx. Thus, the implicit equation reads 1 2 (r − z)2 = −Crx + D, where the constant D is determined by z(0) = 0. Consequently D = 1 2 r2 and a simple computation reveals the solution of the implicit equation z(x) = r ( 1 ± √ 1 − 2Cx r ) . The option with the minus sign satisfies our initial condition. This clearly is an analytic function, g provides the requested majorant, and the proof is finished. □ 6.2.23. Numerical approximation of integration. Just as in paragraph 6.1.12, we use the Taylor expansion to propose simple approximations of integration. We deal with an integral I = ∫ b a f(x) dx of an analytic function f(x) and a uniform partition of the interval [a, b] using points a = x0, x1, . . . , xn = b with distances xi − xi−1 = h > 0. Denote the points in the middle of the intervals in the partitions by xi+1/2 and the values of the function at the points of the partition by f(xi) = fi. Compute the contribution of one segment of the partition to the integral by the Taylor expansion (knowing that the power series might be integrated term by term). Integrate symmetrically around the middle values so that the derivatives of odd orders cancel each other out while integrating: ∫ h/2 −h/2 f(xi+1/2 + t)dt = ∫ h/2 −h/2 ( ∞∑ n=0 1 n! f(n) (xi+1/2)tn ) dt = ∞∑ k=0 (∫ h/2 −h/2 1 k! f(k) (xi+1/2)tk dt ) = ∞∑ k=0 h2k+1 22k(2k + 1)! f(2k) (xi+1/2). A simple numerical approximation of integration on one segment of the partition is the trapezoidal rule. This uses the area of a trapezoid given by the points [xi, 0], [xi, fi], [xi+1, 0], [xi+1, fi+1] for approximation. This area is Pi = 1 2 (fi + fi+1)h. In total, the integral I is approximated by Itrap = n−1∑ i=0 Pi = h 2 (f0 + 2f1 + · · · + 2fn−1 + fn). Compare Itrap to the exact value of I computed by contributions over individually segments of the partition. Express the values fi by the middle values fi+1/2 and the derivatives f (k) i+1/2 in the following way: fi+1/2±1/2 = fi+1/2 ± h 2 f′ i+1/2 + h2 2!22 f′′ i+1/2 ± h3 3!23 f (3) i+1/2 + . . . . 555 (1) ∫ +∞ a f(x) dx = lim c→+∞ ∫ c a f(x) dx, where f is assumed to be continuous on [a, +∞). (2) ∫ b −∞ f(x) dx = lim c→−∞ ∫ b c f(x) dx, where f is assumed to be continuous on (−∞, b]. (3) ∫ +∞ −∞ f(x) dx = ∫ 0 −∞ f(x) dx + ∫ +∞ 0 f(x) dx, where f is assumed to be continuous on R. (4) ∫ b a f(x) dx = lim t→b− ∫ t a f(x) dx, where f is assumed to be continuous on [a, b). (5) ∫ b a f(x) dx = lim t→a+ ∫ b t f(x) dx, where f is assumed to be continuous on (a, b]. (6) ∫ b a f(x) dx = ∫ c a f(x) dx + ∫ b c f(x) dx, where f is assumed to be continuous on [a, c) ∪ (c, b]. 6.B.51. Show that the integral L = ∫ 2 0 1 (x − 1)2 dx is divergent, i.e., L = +∞. Solution. Observe that x0 = 1 is an improper point of the function f(x) = 1/(x − 1)2 and lim x→1+ f(x) = lim x→1− f(x) = lim x→1 f(x) = +∞. Hence the line x = 1 is a vertical asymptote of f. However the integrand f is continuous at all other points, see also its graph below. Already from the graph we understand that the statement should be true. To prove it we will extend the method presented in 6.2.14 for computing the integral I appearing there, in particular this improper integral corresponds to case (6) of those listed above. Hence we have L = ∫ 2 0 f(x) dx = ∫ 1 0 f(x) dx + ∫ 2 1 f(x) dx = lim δ→1− ∫ δ 0 f(x) dx + lim ε→1+ ∫ 2 ε f(x) dx . CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Thus, the contribution Pi to the approximation is Pi = 1 2 (fi+fi+1)h = h ( fi+1/2+ h2 2!22 f′′ (i+1/2) ) +O(h5 ). Estimate the error ∆i = I − Itrap over one segment of the partition: ∆i = h ( fi+1/2 + h2 24 f′′ i+1/2 −fi+1/2 − h2 8 f′′ i+1/2 +O(h4 ) ) = − h3 12 f′′ i+1/2 + O(h5 ). The total error is thus estimated as |I − Itrap| = 1 12 nh3 |f′′ | + n O(h5 ) = 1 12 (b − a)h2 |f′′ | + O(h4 ) where |f′′ | represents an upper estimate for |f′′ (x)| of f over the integral of integration. If the linear approximation of the function over the individual segments does not suffice, we can try approximations by quadratic polynomials. To do so, three values are always needed, so work with segments of the partition in pairs. Suppose n = 2m and consider xi with odd indices. We choose fi+1 = f(xi + h) = fi + αih + βih2 fi−1 = f(xi − h) = fi − αih + βih2 which implies βi = 1 2h2 (fi+1 + fi−1 − 2fi). The approximation of the integral over two segments of the partition between xi−1 and xi+1 is now estimated by the expression (notice we integrate the quadratic polynomial with the requested values fi−1, fi, fi+1 in the points xi−1, xi, xi+1, respectively. It is not necessary to know the constant αi) Pi = ∫ h −h fi + αit + βit2 dt = 2hfi + 2 3 βih3 = 2hfi + 2h 6 (fi+1 + fi−1 − 2fi) = h 3 (fi+1 + fi−1 + 4fi). This procedure is called Simpson’s rule6 . The entire integral is now approximated by ISimp = 1 3 h ( f0 + 4 n−1∑ m=0 f2m+1 + 2 n−1∑ m=1 f2m + f2n ) . As with the trapezoidal rule above, the total error is estimated by |I − ISimp| = 1 180 (b − a)h4 |f(4) | + O(h5 ), where |f(4) | represents the upper bound for |f(4) (x)| over the interval of integration. 6This way of approximating the integral is attributed to the English mathematician and inventor Thomas Simpson (1710-1761). 556 Now, obviously F(x) = −1/(x − 1) satisfies F′ (x) = f(x) and hence is a primitive of f. Applying the fundamental theorem of integral calculus we obtain ∫ δ 0 f(x) dx = F(δ) − F(0) = − δ δ − 1 , ∫ 2 ε f(x) dx = F(2) − F(1 + ε) = − ε − 2 ε − 1 , and hence L = lim δ→1− δ 1 − δ + lim ε→1+ ε − 2 1 − ε = (+∞) + (+∞) = +∞ . These computations can be done quickly in Sage, e.g., by the block F(x)=-1/(x-1); var("delta, eps") show((F(delta)-F(0)).factor()) show(lim((F(delta)-F(0)), delta=1, dir="-")) show((F(2)-F(eps)).factor()) show(lim((F(2)-F(eps)), eps=1, dir="+")) Notice the command integral(1/(x − 1)2 , x, 0, 2) gives an error which confirms itself the divergence of L, check yourself the very end of Sage’s output for this command. □ 6.B.52. Show that +∞∫ 1 arctan(x) x √ x dx is a finite real number. ⃝ In 6.2.17 one meets several new acquisitions to the ZOO. One of them is the Gamma function, defined for all positive real numbers z by Γ(z) = ∫ ∞ 0 e−x xz−1 dx . The Gamma function is well defined and analytic for z > 0 (and more generally for complex numbers z with positive real part). Moreover, Γ converges for all z > 0 and the next task shows that the Gamma function can be seen as an extension of the factorial function to real positive numbers.4 6.B.53. An interpretation of the factorial. (a) Prove the recurrence formula Γ(n + 1) = nΓ(n), n ∈ Z+. (b) Using (a) show that Γ(n + 1) = n!, for all naturals n. (c) For some a > 0 prove that ∫ ∞ 0 e−a x xn−1 dx = 1 an Γ(n) . (d) Using the “Gaussian integral” ∫ ∞ 0 e−u2 du = √ π 2 , show that Γ(1/2) = √ π. 4The gamma function was introduced by the Leonhard Euler (1707-1783) at 1729, as a natural extension of the factorial operation n! (in the sense that n was replaced by a real or a complex number). CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 3. Sequences, series and limit processes While building a menagerie of functions, we encountered power series, which extended the collection of all polynomials in a natural way, see 5.4.10. We obtained the class of analytic functions in this way. We have also got the message that they are alwyas smooth and may be integrated and differentiated term by term on the entire domain of convergence. We gave the link to a straightforward technical proof of these statements in 9.4.2 on page 873. Now we shall develop simple tools and show similar results for much more general series of function. Moreover, functions often depend on further parameters which are dummy when differentiating or integrating, but we need to understand how the result behaves with respect to these parameters. For instance, what about the differentiability of the Gamma function introduced above? Or, when computing volume or area depending on a free parameter, how to minimize it? Finally, in the end of this chapter we briefly introduce some more advanced concepts of integration. 6.3.1. How well behaved is a sequence of functions? We return to the discussion of the limits of sequences of functions and the sum of series of functions in view of the methods of differential and integral calculus. Consider a convergent series of functions S(x) = ∞∑ n=1 fn(x) on an interval [a, b]. Natural questions include: • If all functions fn(x) are continuous at some point x0 ∈ [a, b], is the function S(x) also continuous at the point x0? • If all functions fn(x) are differentiable at some point a ∈ [a, b], is the function S(x) also differentiable there and does the equality S′ (x) = ∑∞ n=1 f′ n(x) hold? • If all functions fn(x) are Riemann integrable on an interval [a, b], is the function S(x) also integrable there and does the equality ∫ b a S(x)dx = ∑∞ n=1 ∫ b a fn(x)dx hold? Notice, it does not matter whether we discuss series or sequences, since the former ones are just limits of the sequences of the partial sums. First, we demonstrate by examples that the answers to all three questions above are “NO!”. Then we find additional conditions on the convergence of the series (or sequences) which guarantees the validity of all three statements. Later we shall mention alternative concepts of integration which are more satisfactory than the Riemann integral for an even wider classes of functions. 6.3.2. Examples of nasty sequences. (1) Consider the func- tions fn(x) = (sin x)n on the interval [0, π]. The values of these functions are nonnegative and smaller than one at all points 0 ≤ x ≤ π, except 557 Solution. (a) Integration by parts gives the result, i.e., Γ(n + 1) = ∫ ∞ 0 e−x xn dx = [ − xn e−x ]∞ 0 + n ∫ ∞ 0 e−x xn−1 dx = − lim x→∞ xn ex + nΓ(n) = 0 + nΓ(n) = nΓ(n) . Above we used the fact limx→∞ xn ex = 0. This limit has the intermediate form ∞ ∞ and a way to compute it occurs by applying the l’Hopital’s rule n times: imx→+∞ xn ex ∞ ∞ = lim x→+∞ n xn−1 ex ∞ ∞ = · · · ∞ ∞ = lim x→∞ n! ex = 0 . (b) Obviously, Γ(1) = ∫ ∞ 0 e−x dx = [ − e−x ]∞ 0 = 1 . Moreover, a direct computation shows that Γ(2) = ∫ ∞ 0 e−x x dx = 1 = Γ(1) . Thus we can write Γ(2) = 1 · Γ(1) = 1!, and similarly by integration by parts one can prove that Γ(3) = 2Γ(2) = 2!. Let us apply induction over n to prove the general formula. For the inductive step suppose Γ(n) = (n − 1)! for some n. Then, by (a) we see that Γ(n+1) = nΓ(n) = n(n−1)! = n!, for all naturals n. (c) For any positive integer we have Γ(n) =∫ ∞ 0 e−x xn−1 dx. For a > 0 set x = ay with dx = a dy. Then we see that Γ(n) = ∫ ∞ 0 e−ay (ay)n−1 a dy = an ∫ ∞ 0 e−ay yn−1 dy , i.e., ∫ ∞ 0 e−ay yn−1 dy = Γ(n) an . The result now follows. (d) This is based on the substitution x = u2 : Γ(1/2) = ∫ ∞ 0 e−x x− 1 2 dx = ∫ ∞ 0 e−x √ x dx x=u2 = dx=2u du 2 ∫ ∞ 0 e−u2 du = 2 √ π 2 = √ π . □ Recall that the mean value of a function over a given interval provides a single value that represents the average behavior of the function on that interval. This concept is fundamental in various areas of mathematics, since it allows the estimation of the overall effect of a function over an interval without being necessary to consider every individual point. This gives many applications in physics and engineering, where average values are often more useful than point values (such as in calculating average velocity, average temperature, or average concentration of a substance). CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS for x = π 2 , where the value is 1. Hence on the whole interval [0, π], these functions converge to the function f(x) = lim n→∞ fn(x) = { 0 for all x ̸= π 2 1 for x = π 2 . point by point. The limit of the sequence of functions fn is a discontinuous function, even though all functions fn(x) are continuous. The problematic point is an inner point of the interval. The same phenomenon occurs for a series of functions, because the sum is the limit of partial sums. Hence in the previous example, it suffices to express fn as the n-th partial sum. For example, f1(x) = sin x, f2(x) = (sin x)2 − sin x, etc. The figure plots the functions fm(x) for m = n3 , n = 1, . . . , 10. (2) We look at the second question, i.e. badly behaving derivatives. A natural idea based on the same principle as above is to construct a sequence of functions which has the same nonzero derivative at one point, but becomes smaller and smaller. So they converge pointwise to the identically zero function. The next figure plots the functions fn(x) = x(1 − x2 )n on the interval [−1, 1] for values n = m2 , m = 1, . . . , 10. It is immediate that limn→∞ fn(x) = 0 and that all functions fn(x) are smooth. Their derivative at x = 0 is f′ n(0) = ( (1 − x2 )n − 2nx2 (1 − x2 )n−1 ) |x=0 = 1 for all n. But the limit function for the sequence fn has a zero derivative at every point. (3) The counterexample to the third statement is in 6.2.18 already. The characteristic function χQ of rational numbers can be expressed as a sum of countably many functions, which are numbered exactly by rational numbers. They are zero everywhere except for the single point after which they 558 On the other hand, the integral mean value theorem provides a rigorous way to calculate the average value of a function over an interval. Next we will describe applications related to these theorems, see also see 6.2.15. 6.B.54. For the function f(θ) = cos(2θ) e1+sin(2θ) , determine its average value on [−π/4, π/4]. Solution. The average value of f on [−π/4, π/4] is given by m(f) = 1 π/4 − (−π/4) ∫ π/4 −π/4 f(θ) dθ = 2 π ∫ π/4 −π/4 f(θ) dθ . To compute this integral one may set u = 1 + sin(2θ), such that du = 2 cos(2θ)dθ. We also have u = 2 for θ = π/4, and u = 0 for θ = −π/4, thus we get m(f) = 2 π · 1 2 ∫ 2 0 eu du = e2 −1 π . Here is a confirmation in Sage: var("th"); f(th)=cos(2*th)*e^(1+sin(2*th)) show((2/pi)*integral(f(th), th,-pi/4,pi/4)) □ 6.B.55. Determine the average velocity m(v) of a solid in the time interval [1, 2], if its velocity is given by v(t) = t √ 1 + t2 , t ∈ [1, 2] . You can omit the units. Solution. To solve the problem, it suffices to realize that the sought avarage velocity is the mean value of function v on interval [1, 2]. Hence m(v) = 1 2 − 1 2∫ 1 t √ 1 + t2 dt = 5∫ 2 1 2 √ x dx = √ 5 − √ 2 with 1 + t2 = x, t dt = dx/2. □ Further exercises on improper integrals and other applications related to definite integrals are presented in Section D. Next we describe a few tasks about lengths of curves, areas of regions, and volumes. 6.B.56. Arc length. Determine the length of the parametric curve defined by x(t) = et −t, y(t) = 4 et/2 with 0 ≤ t ≤ 1. Then present the solution via Sage. Solution. According to 6.2.18, the length of a curve α(t) = [x(t), y(t)] on R2 defined on the interval [a, b] is given by the integral sα(t) ≡ s(α(t)) = ∫ b a √ (x′(t))2 + (y′(t))2 dt . CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS are named for, where the value is 1. Riemann integrals of all such functions are zero, but the sum is not a Riemann integrable function. This example illustrates the fundamental flaw of the Riemann integral, to which we return later. We present an example where the limit function f is integrable, all functions fn are continuous, but the value of the integral is not the limit of the integrals of fn. We modify the sequence of the functions x(1 − x2 )n used above. They integrate to ∫ 1 0 x(1 − x2 )n dx = 1 2(n+1) . Thus, we consider the functions fn(x) = 2(n + 1)x(1 − x2 )n . These functions with n = m2 , m = 1, . . . , 10 are on the next diagram. It is quite easy to verify that the values of these functions converge to zero for every x ∈ [0, 1] (for example the reader can check ln(fn(x)) → −∞). But for all n ∫ 1 0 fn(x) dx = 1 ̸= 0. 6.3.3. Uniform convergence. A reason of failure in all three previous examples is the fact that the speed of pointwise convergence of values fn(x) → f(x) varies dramatically point from point. Hence a natural idea is to confine the problem to cases where the convergence will have roughly the same speed over all the interval: Uniform convergence Definition. We say that the sequence of functions fn(x) converges uniformly on interval [a, b] to the limit f(x), if for every positive number ε, there exists a natural number N ∈ N such that for all n ≥ N and all x ∈ [a, b] the in- equality |fn(x) − f(x)| < ε holds. A series of functions converges uniformly on an interval, if the sequence of its partial sums converges uniformly. Albeit the choice of the number N depends on the chosen ε, it is independent on the point x ∈ [a, b]. This is the difference from pointwise convergence, where N depends on both ε and x. We visualise the definition graphically in this way: if we consider a zone created by a translation of the limit function f(x) to f(x) ± ε for arbitrarily small, but fixed positive 559 For the given curve we have x′ (t) = et −1 and y′ (t) = 2 et/2 , with (x′ (t))2 + (y′ (t))2 = (et +1)2 . Thus s = ∫ 1 0 √ (et +1)2 dt = ∫ 1 0 (et +1) dt = [ et ]1 0 + [ t ]1 0 = e −1 + 1 = e . In Sage you can confirm all these computations by the following cell: var(’t’); x(t)=e^t-t; y(t)=4*e^(t/2) X(t)=diff(x(t), t); show(X(t)) Y(t)=diff(y(t), t); show(Y(t)) l(t)=(X(t)*X(t)+Y(t)*Y(t)).factor(); show(l(t)) s=integral(sqrt(l(t)), t, 0, 1).simplify_full() show(s) Using the command parametric_plot you can plot the given curve, which can be done just by adding in the previous block the syntax p=parametric_plot((x(t), y(t)), (t, 0, 1)) p.show(aspect_ratio=1/4) □ It is often useful to jump via Sage from classical integration to numerical integration. This for example can be useful for cases where we need an approximation, or in cases where a “closed form” of the integral does not exist, as for example the case of Gaussian f(x) = e−x2 whose antiderivative is the so-called “error function” erf. This is an interesting situation which, till the end of this chapter, we will implement in many different ways. Notice in Sage an option to approximate a definite integral relies on the command n(integral(f, x, a, b)), or numerical_approx(integral(f, x, a, b)). For instance for the Gaussian f(x) = e−x2 try the cell f(x)=e^(-x^2) show(integral(f, x, 0, 1)) print(n(integral(f, x, 0, 1))) Sage’s output has the form 1 2 √ π erf (1) and 0.746824132812427, respectively. Built-in methods that Sage provides for numerical integration will be discussed later. We begin with the following task about the arc length of parametric curves. 6.B.57. Given a curve c : [a, b] → R2 in the parametric form c(t) = [x(t), y(t)], present a routine in Sage that will compute numerically the length of c on [a, b] and plot its graph. Then, test your routine for the following cases and next confirm routine’s result by a formal computation: (1) c(t) = [cos3 (t), sin3 (t)], t ∈ [0, π/2]; (2) c(t) = [t, ln ( cos(t) ) ], t ∈ [0, π/4]; CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS ε, all of the functions fn(x) will fall into this zone, except for finitely many of them. The first and the last of the nasty exam-provide a picture here? ples above do not have this property; In the second example, the sequence of derivatives f′ n lacked it. 6.3.4. The three claims in the following theorem say that all three generally false properties discussed in 6.3.1 are true features for uniform convergence (but beware the subtleties when differ- entiating). Consequences of uniform convergence Theorem. (1) Let fn(x) be a sequence of functions continuous on a closed interval [a, b] and converging uniformly to the function f(x) on this interval. Then f(x) is also continuous on the interval [a, b]. (2) Let fn(x) be a sequence of Riemann integrable functions on a finite closed interval [a, b] which converge uniformly to the function f(x) on this interval. Then f(x) is Riemann integrable, and ∫ b a f(x) dx = ∫ b a ( lim n→∞ fn(x) ) dx = lim n→∞ ∫ b a fn(x) dx. (3) Let fn(x) be a sequence of functions differentiable on a closed interval [a, b] and assume fn(x0) → f(x0) at some point x0 ∈ [a, b]. Moreover, assume all derivatives gn(x) = f′ n(x) are continuous and converge uniformly to the function g(x) on the same interval. Then the function f(x) = f(x0)+ ∫ x x0 g(t) dt is differentiable on the interval [a, b], the functions fn(x) converge to f(x) and f′ (x) = g(x). In other words, d dx f(x) = d dx ( lim n→∞ fn(x) ) = lim n→∞ ( d dx fn(x) ) Proof of the first claim. Fix an arbitrary fixed point x0 ∈ [a, b] and let ε > 0 be given. It is required to show that |f(x) − f(x0)| < ε for all x close enough to x0. From the definition of uniform convergence, |fn(x) − f(x)| < ε for all x ∈ [a, b] and all sufficiently large n. Choose some n with this property and consider δ > 0 such that |fn(x) − fn(x0)| < ε for all x in δ-neighbourhood of x0. That is possible because fn(x) are continuous for all n. Then |f(x) − f(x0)| ≤|f(x) − fn(x)| + |fn(x) − fn(x0)| + |fn(x0) − f(x0)| < 3ε for all x in the δ-neighbourhood of x0. This is the desired inequality with the bound 3ε. □ 560 (3) c(t) = [t sin(t), t cos(t)], t ∈ [0, 4π]; (4) c(t) = [sin2 (t), cos2 (t)], t ∈ [0, π 2 ]. ⃝ 6.B.58. Compute the length s of a part of the so-called “tractrix”, that is, the parametric curve α(t) = [f(t), g(t)] defined by f(t) = r cos t + r ln ( tan( t 2 ) ) , g(t) = r sin(t) , with t ∈ [π/2, a], where r > 0, and a ∈ (π/2, π). Next use Sage to plot α for the values r = 1, 2, . . . , 5 in one figure. ⃝ 6.B.59. Area of half ellipses. Consider the curve α(t) = [x(t), y(t)] = [4 cos(t), 3 sin(t)], with 0 ≤ t ≤ π. (a) Show that α(t) is the half of a horizontal ellipses centered at the origin, wit major axes of length 8 and minor axes of length 6. Use Sage to plot α and color the region bounded by the x-axis and the ellipsis. (b) Compute the area of this region. Solution. (a) Obviously, α(t) is the half of a horizontal ellipses centered at the origin of R2 , since it satisfies the equation x2 a2 + y2 b2 = 1 with a = 4 and b = 3. Thus the major axes has length 2a = 8 and the minor axes has length 2b = 6, respectively. To plot α we will use the command parametric_plot as above, that is, x(t)=4*cos(t); y(t)=3*sin(t) show(parametric_plot((x(t), y(t)), (t, 0, pi), fill=True, fillcolor="lightgrey", aspect_ratio=1, color="black", thickness=1.5)) which produces the following figure: (b) To find the area under the given parametric curve we need first to compute the integral ∫ π 0 y(t)x′ (t) dt = ∫ π 0 3 sin(t)(4 cos(t))′ dt = −12 ∫ π 0 sin2 (t) dt = − 12 2 ∫ π 0 ( 1 − cos(2t) ) dt = −6 [ t − sin(2t) 2 ]π 0 = −6π . Thus the required area equals to | ∫ π 0 y(t)x′ (t) dt| = 6π. In Sage to confirm this result it is sufficient to add in the previous block the cell F(t)=y(t)*diff(x(t), t) show(abs(integral(F(t), t, 0, pi))) □ CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Remark. In fact, the arguments in the proof show a more general claim. Indeed, if the functions fn(x) converge uniformly to f(x) on [a, b], and the individual functions fn(x) have the limits (or one-sided limits) limx→x0 fn(x) = an, then the limit limx→x0 f(x) exists if and only if the limit limn→∞ an = a exists. Then they are equal, that is, a = lim n→∞ ( lim x→x0 fn(x) ) = lim x→x0 ( lim n→∞ fn(x) ) . The reader should be able to modify the above proof for this situation. 6.3.5. Proof of the second claim. The proof of this part of the theorem is based upon a generalization of the properties of Cauchy sequences of numbers to uniform convergence of functions. In this way we can work with the existence of the limit of a sequence of integrals without needing to know the limit. Uniformly Cauchy sequences Definition. The sequence of functions fn(x) on interval [a, b] is uniformly Cauchy, if for every (small) positive number ε, there exists a (large) natural number N such that for all x ∈ [a, b] and all m, n ≥ N, |fn(x) − fm(x)| < ε. Every uniformly convergent sequence of functions on interval [a, b] is also uniformly Cauchy on the same interval. To see this, it suffices to notice the usual bound |fn(x) − fm(x)| ≤ |fn(x) − f(x)| + |f(x) − fm(x)| based on the triangle inequality. Before coming to the proof 6.3.4(2), we mention the fol- lowing: Proposition. Every uniformly Cauchy sequence of functions fn(x) on the interval [a, b] uniformly converges to some function f on this interval. Proof. Of course, the condition for a sequence of functions to be uniformly Cauchy implies that also for all x ∈ [a, b], the sequence of values fn(x) is a Cauchy sequence of real (or complex) numbers. Hence the sequence of functions fn(x) converges pointwise to some function f(x). Choose N large enough so that |fn(x) − fm(x)| < ε for some small positive ε chosen beforehand and all m, n ≥ N, x ∈ [a, b]. Now choose one such n and fix it, then |fn(x) − f(x)| = lim m→∞ |fn(x) − fm(x)| ≤ ε for all x ∈ [a, b]. Hence the sequence fn(x) converges to its limit uniformly. □ Proof of the second claim in 6.3.4. Recall that every uniformly convergent sequence of functions is also uniformly Cauchy and that the Riemann sums of all single terms fn(x) 561 6.B.60. Area. Calculate the area E between the x-axis and the graph of f(x) = x2 − 3x − 4 for 4 ≤ x ≤ 5. Next use Sage to illustrate the area in question. Solution. An easy computation shows that E = 17 6 , and this corresponds to grey region in the figure given here: To obtain the required illustration in Sage, we will use the polygon method, which is the core of of our program presented below. Details on this method are postponed to Section D, see (b) in 6.D.43. f(x)=x^2-3*x-4 p=plot(f(x), x, 3, 5, thickness=2, color="black") p+=line([(5, 0), (5, 6)], rgbcolor=(0.2,0.2,0.2), linestyle="--") p+=polygon([(4,0),(4,f(4))] + [(x, f(x)) for x in [4,4.1,..,5]]+[(5,0),(4,0)], rgbcolor=(0.8,0.8,0.8), aspect_ratio=’automatic’) p+=text("$f(x)=x^2-3x-4$", (4.5, 5), fontsize=14, color="black") show(p) show(integral(f(x), x, 4, 5)) □ 6.B.61. Compute the volume of a solid created by rotation of a bounded surface, whose boundary is the curve x4 −9x2 + y4 = 0, around the x-axis. You may omit the units. Solution. If [x, y] is a point on the x4 −9x2 +y4 = 0, clearly this curve also intersects points [−x, y], [x, −y], [−x, −y]. Thus is symmetric with respect to both axes x, y. For y = 0, we have x2 (x − 3) (x + 3) = 0, i.e. the x axis is intersected by the boundary curve at points [−3, 0], [0, 0], [3, 0]. In the first quadrant, it can then be expressed as a graph of the func- tion f(x) = 4 √ 9x2 − x4, x ∈ [0, 3] . The sought volume is thus a double (here we consider x > 0) of the integral 3∫ 0 πf2 (x) dx = π 3∫ 0 √ 9x2 − x4 dx . CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS of the sequence converge to ∫ b a fn(x) dx independently of the choice of the partition and the representatives. Hence, if |fn(x) − fm(x)| < ε for all x ∈ [a, b], then also ∫ b a fn(x) dx − ∫ b a fm(x) dx ≤ ε|b − a|. Therefore the sequence of numbers ∫ b a fn(x) dx is Cauchy, and hence convergent. The Riemann sums of the limit function f(x) can be made arbitrarily close to those of fn(x) for large n, by the same argument as above. So f(x) is integrable. Moreover, ∫ b a fn(x) dx − ∫ b a f(x) dx ≤ ε|b − a|, so the limit value is as expected. □ 6.3.6. Proof of the third claim. For the corresponding result about derivatives, extra care is needed regarding the assumptions: If the differentiable functions ˜fn(x) = fn(x) − fn(x0) are considered instead of fn(x), the derivatives do not change. Hence without loss of generality it can be assumed that all functions satisfy fn(x0) = 0. Then the first assumption of the theorem is satisfied automatically. For all x ∈ [a, b], we can write fn(x) = ∫ x x0 gn(t) dt. Because the functions gn converge uniformly to g on all of [a, b], the functions fn(x) converge to f(x) = ∫ x x0 g(t) dt. g is a uniform limit of continuous functions, thus g is again continuous. By 6.2.8, for the relations between the Riemann integral and the primitive function, the proof is finished. 6.3.7. Uniform convergence of series. For infinite series, we apply the previous three results to the sequences of partial sums. Thus the following corollary is an immediate consequence: 562 Using the substitution t = √ 9 − x2 (xdx = −tdt), we get 3∫ 0 √ 9x2 − x4 dx = 3∫ 0 x · √ 9 − x2 dx = − 0∫ 3 t2 dt = 9 . Thus the final answer is 18π. □ 6.B.62. Torricelli’s trumpet (1641). Let a part of a branch of the hyperbola xy = 1 for x ≥ a, where a > 0, rotate around the x axis. Show that the solid of revolution created in this manner has a finite volume V and simultaneously an infinite surface S. Solution. We know that V = π +∞∫ a ( 1 x )2 dx = π +∞∫ a 1 x2 dx = π ( lim x→+∞ − 1 x − ( − 1 a )) = π a and S = 2π +∞∫ a 1 x · √ 1 + ( − 1 x2 )2 dx = 2π +∞∫ a √ x4 + 1 x3 dx ≥ 2π +∞∫ a 1 x dx = 2π ( lim x→+∞ ln x − ln a ) = +∞ . The fact the the given solid (the so called Torricelli’s trumper) cannot be painted with a finite amount of colour, but can be filled with a finite amount of fluid, is called the “Torriccelli’s paradox”. Realize however that a real color painting has a nonzero width, which the computation does not take into account. For example, if we would paint it from the inside, a single drop of color would undoubtedly “block” the trumpet of infinite length. □ A differential equation is an equation involving derivatives. Many physical and engineering problems naturally lead to differential equations, making it essential to learn how to formulate and solve them. For instance, in Chapter 3 we encountered differential equations in the context of population growth. Differential equations fall into two main categories, “ordinary differential equations” (ODEs) and “partial differential equations” (PDEs). An ODE involves derivatives with respect to a single independent variable, and next we focus on ODEs of first order. Typically, such equations are written as y′ = F(x, y), where y = f(x), see 6.2.21. If F is in the form y′ = F(x, y) = f(x)g(y) for some functions f, g, the ODE is called separable. For g(y) ̸= 0 we can solve it by separating the variables: dy g(y) = f(x) dx ⇒ ∫ dy g(y) = ∫ f(x) dx. Let’s explore some simple problems, with a more detailed discussion on differential equations postponed to Chapter 8. CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Consequences for uniform convergence of series Theorem. Consider a sequence of functions fn(x) on interval [a, b]. (1) If all the functions fn(x) are continuous on [a, b] and the series S(x) = ∞∑ n=1 fn(x) converges uniformly to the function S(x), then S(x) is continuous on [a, b]. (2) If all the functions fn(x) are Riemann integrable on [a, b] and the series S(x) = ∞∑ n=1 fn(x) uniformly converges to S(x) on [a, b], then S(x) is integrable on [a, b] and ∫ b a ( ∞∑ n=1 fn(x) ) dx = ∞∑ n=1 ∫ b a fn(x) dx. (3) If all the functions fn(x) are continuously differentiable on the interval [a, b], if the series S(x) = ∑∞ n=1 fn(x) converges for some x0 ∈ [a, b], and if the series T(x) =∑∞ n=1 f′ n(x) converges uniformly on [a, b], then the series S(x) converges. S(x) is continuously differentiable on [a, b] and S′ (x) = T(x). That is: d dx ( ∞∑ n=1 fn(x) ) = ∞∑ n=1 d dx fn(x). 6.3.8. Test of uniform convergence. A simple way to test that a series of functions converges uniformly is to use a comparison with the absolute convergence of a suitable series of numbers. This is often called the Weierstrass test. Suppose a sequence of functions fn(x) is given on an interval I = [a, b] satifying |fn(x)| ≤ an ∈ R for suitable real constants an and for all x ∈ [a, b]. Let sk(x) = k∑ n=1 fn(x) for distinct indices k. For k > m, |sk(x) − sm(x)| = k∑ n=m+1 fn(x) ≤ k∑ n=m+1 |fn(x)| ≤ k∑ n=m+1 ak. 563 6.B.63. Differential equations. Compute the limit limx→+∞ f(x) if it is known that the function f(x) satisfies the differential equation f′ (x) + α · ex (f(x))2 = 0 , with initial condition f(0) = 1/α, where α > 0 is some constant. Next use Sage to solve the given differential equation for α = 3. Solution. Let us write y = f(x). Then, the given differential equation takes the form as y′ = −α · ex y2 , or equivalenty dy dx = −α · ex y2 ⇐⇒ − dy y2 = α · ex dx . Hence, by integrating we obtain − ∫ dy y2 = α ∫ ex dx , which is equivalent to − ∫ y−2 dy = α ex +C that is, y−1 = α ex +C. Thus y = f(x) = 1 α ex +C and to find C one relies on the initial condition f(0) = 1/α, which gives C = 0. Thus f(x) = 1 α ex and we see that limx→+∞ f(x) = 0. Recall by 6.B.5 that a method to solve (1st order) ODEs relies on the command desolve. Since a = 3 in this case one can type var("x"); y = function("y")(x) a=3; show(desolve(diff(y, x) + a*e^x*y^2, y)) and Sage’s output has the form 1 3 y (x) = C + ex , which is of course equivalent to our answer for a = 3. □ 6.B.64. Solve the ODE dy dx = esin(x) cos(x)y1/2 with initial condition y(0) = 16. ⃝ 6.B.65. Solve the task in 6.B.64 using Sage. Moreover, plot the solution together with the “slope field” of the corresponding differential equation. Solution. Recall by 6.B.5 that in order to solve an initial value problem in Sage we use the desolve command and include the option ics = [x0, y0] (or simply [x0, y0]), where y0 = y(x0) represents the initial condition. Thus, in our case one can type var("x") y = function("y")(x) h=desolve(diff(y, x) - e^(sin(x))*cos(x)*sqrt(y), y, ics=[0, 16]) show(h) which prints out the expression 2 √ y (x) = esin(x) +7, hence confirms our result in 6.B.64. Having a differential equation of the form y′ = F(x, y), plotting the slope field means that we plot a line at (x, y) with slope F(x, y). Such details will be analyzed more carefully in Chapter 8. However, we may already discuss the built-in method that Sage provides for this procedure. CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS If the series of the (nonnegative) constants ∑∞ n=1 an is convergent, then the sequence of its partial sums is a Cauchy sequence. But then the sequence of partial sums sn(x) is uniformly Cauchy. By 6.3.5 the following is verified: The Weierstrass test Theorem. Let fn(x) be a sequence of functions defined on interval [a, b] with |fn(x)| ≤ an ∈ R. If the series of numbers ∑∞ n=1 an is convergent, then the series S(x) = ∑∞ n=1 fn(x) converges uniformly. 6.3.9. Consequences for power series. The Weierstrass test allows us to derive the properties of power series in a very straightforward way. Conseder a series S(x) = ∞∑ n=0 an(x − x0)n centered at a point x0. We saw earlier in 5.4.9, that each power series converges on an entire interval (x0 − δ, x0 + δ). The radius of convergence δ ≥ 0 can be zero or ∞, see 5.4.13. Moreover, the series obtained by integrating or differentiating a power series term by term must have the same radius of convergence (by the very formula based on lim sup). In the proof of theorem 5.4.9, a comparison with a suitable geometric series is used to verify the convergence of power series. By the Weierstrass test, every power series S(x) converges uniformly on every compact (i.e. bounded closed) interval [a, b] contained in the interval (x0 −δ, x0 + δ). Thus the crucial result follows again: Differentiation and integration of power series Theorem. Every power series S(x) is continuous and is continuously differentiable at all points inside its interval of convergence. The function S(x) is Riemann integrable and can be differentiated or integrated done term by term. Abel’s theorem states that power series are continuous even at the boundary points of their domain when they converge there (including eventual infinite limits). We do not prove it here. The pleasant properties of power series also reveal limitations on the use in practical modelling. In particular, it is not possible to approximate piece-wise continuous or nondifferentiable functions very well by using power series. Of course, it should be possible to find better sets of functions fn(x) than just the values fn(x) = xn , up to constants. The best known examples are Fourier series and wavelets discussed in the next chapter. 564 The latter is based on the syntax plot_slope_field, but first we need to introduce the function F representing the differential equation. One can also plot the solution curves (“integral curves”), as slope fields, via the command streamline_plot(F(x, y), (x, a, b), (y, a, b)). For our task, we have dy dx = esin(x) cos(x)y1/2 , thus we need to consider the function F(x, y) = esin(x) cos(x)y1/2 , and the implementation takes the form var("x, y") c1=implicit_plot(2*sqrt(y)==e^(sin(x))+7, (x, 0, 25), (y, 0, 25)) F(x, y)=e^(sin(x))*cos(x)*sqrt(y) p1=plot_slope_field(F(x, y),(x, 0, 25),(y, 0, 25)) p2=streamline_plot(F(x, y),(x, 0, 25),(y, 0, 25)) show(p1+c1, figsize=6); show(p2, figsize=6) This produces the following figures Here, at the left hand side with blue appears the solution of our initial value problem, while the slope field has grey colour and appears in the background. At the right hand side we see the integral curves, as slope fields. □ Numerical integration is a fundamental technique in computational mathematics, used to approximate the area under a curve, especially when an exact analytical solution is challenging or impossible to obtain. It complements the numerical differentiation that we met earlier in 6.1.12 and is invaluable for analyzing functions where direct solutions are impractical. Below, we will explore examples using the trapezoid rule and Simpson’s rule, see 6.2.23. 6.B.66. Trapezoid rule. Compute the error of estimating the integral I = ∫ π 0 sin(x) dx by the trapezoid rule with n = 4 intervals. Demonstrate this by the following two ways: i) Compute the actual error given by the difference | I − Itrap|. ii) Apply the formula (b−a)h2 12 |f′′ |, where |f′′ | represents an upper estimate of |f′′ (x)| over the integral of integration [a, b], where f(x) = sin(x). Solution. The trapezoid rule with n = 4 intervals has the form Itrap = h 2 [ f(x0) + 2f(x1) + 2f(x2) + 2f(x3) + f(x4) ] , CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 6.3.10. Laurent series. We return to the smooth function f(x) = e−1/x2 from paragraph 6.1.9 in the context of Taylor series expansions. It is not analytic at the origin, because all its derivatives are zero there and the function is strictly positive at all other points. At all points x0 ̸= 0 this function is given by its convergent Taylor series with radius of convergence r = |x0|. At the origin the Taylor series converges only at the one point 0. Replace x with the expression −1/x2 into the power series for ex . The result is the series of functions S(x) = ∞∑ n=0 1 n! (−1)n x−2n = 0∑ n=−∞ (−1)|n| |n|! x2n . The series converges at all points x ̸= 0. It gives a good idea about the behaviour near the exceptional point x = 0. Thus we consider the following series similar to power series but more general: Laurent7 series A series of functions of the form S(x) = ∞∑ n=−∞ an(x − x0)n is called a Laurent series centered at x0. The series is convergent if both its parts with positive and negative exponents converge separately. The importance of Laurent series can be seen with rational functions. Consider such a function S(x) = f(x)/g(x) with coprime polynomials f and g and consider a root x0 of polynomial g(x). If the multiplicity of this root is s, then after multiplication we obtain the function ˜S(x) = S(x)(x−x0)s , which is analytic on some neighbourhood of x0. Therefore we can write S(x) = a−s (x−x0)r + · · · + a−1 x−x0 + a0 + a1(x − x0) + . . . = ∞∑ n=−s an(x − x0)n . Consider the two parts of the Laurent series separately: S(x) = S− +S+ = −1∑ n=−∞ an(x −x0)n + ∞∑ n=0 an(x −x0)n . For the series S+, Theorem 5.4.9 implies that its radius of convergence R is given by R−1 = lim supn→∞ n √ |an|. Apply the same idea to the series S− with 1/x substituted for x. It is then apparent that the series S−(x) converges for |x−x0| > r, where r = lim supn→∞ n √ |a−n|. 7Pierre Alphonse Laurent (1813-1854) was a French engineer and military officer. He submitted his generalization of the Taylor series into the Grand Prix competition of the French Académie des Sciences. For formal reasons it was not considered. It was published much later, after the author’s death. 565 where the step size h satisfies h = b−a n , see 6.2.23. For our case we have f(x) = sin(x), a = 0, b = π and h equals π/4. Thus, x0 = 0, x1 = π/4, x2 = π/2, x3 = 3π/4, and x4 = π, with f(x0) = 0 = f(x4), f(x1) = √ 2 2 = f(x3) and f(x2) = 1. This gives Itrap = π 8 [ 2(1 + √ 2)] ≈ 1.896, where we used Sage to compute this expression by the com- mand N((pi/8)*2*(1+sqrt(2))) An illustration of the trapezoid rule is given here Let us now derive the actuall error of the estimation as suggested in (i). You can easily compute I = 2, thus | I − Itrap| ≈ ∫ π 0 sin(x) dx − 1.896 ≈ 0.104 . This error should not differ dramatically from the result in (ii). Indeed, we see that f′′ (x) = − sin(x) and over the interval [0, π] the maximum value of | f′′ (x)| = |sin(x)| equals 1. This gives us (b − a)3 12n2 maxx∈[0,π] |f′′ (x)| n=4 = π3 192 ≈ 0.162 . Therefore, the actual error 0.104 is less than the theoretical upper bound 0.162, which is consistent with the idea that the theoretical error estimate often provides an upper bound on the actual error. We also conclude that the trapezoid rule underestimates the actual integral value. This occurs because f(x) = sin(x) is concave (down) on the integral of integration, i.e., f′′ (x) < 0 for all x ∈ [0, π]. This characteristic is typical of the trapezoid rule’s behaviour with concave functions, as illustrated in the figure above. □ 6.B.67. Create a routine in Sage to demonstrate the computation of the trapezoid rule with n intervals. Apply your program to estimate the integral L = ∫ 3π/2 π/2 cos(x) dx , with n = 4. Additionally, provide a formal proof, and calculate both the actual and theoretical errors of the estimation, as discussed in 6.B.66. ⃝ 6.B.68. Apply the Simpson’s rule to approximate the integral I = ∫ 1 0 1 1 + x dx with h = 1/4 and h = 1/2 respectively. CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Notice that the conclusions about convergence remain true even for complex values of x substituted into the expression. Laurent series can be considered as functions defined on a domain in the complex plain. We return to this in chapter 9. The following theorem is proved already. Convergence of the Laurent series on the annulus Theorem. The Laurent series S(x) centered at x0 converges for all x ∈ C satisfying r < |x − x0| < R and diverges for all x satisfying |x − x0| < r or |x − x0| > R, where r = lim sup n→∞ n √ |a−n|, R−1 = lim sup n→∞ n √ |an|. The Laurent series need not converge at any point, because possibly R < r. If we look for an example of the above case of rational functions expanded to Laurent series at some of the roots of the denominator, then clearly r = 0 and therefore, as expected, it converges in the punctured neighbourhood of this point x0. R is given by the distance to the closest root of the denominator. In the case of the first example, the function e−1/x2 , r = 0 and R = ∞. 6.3.11. Integrals dependent on parameters. When integrating a function f(x, y1, . . . , yn) of 1 real variable x depending on further real parameters y1, . . . , yn with respect to the single variable x, the result is a function F(y1, . . . , yn) depending on all the parameters. Such a function F often occurs in practice. For instance, we can look for the volume or area of a body which depends on parameters, and determine the minimal and maximal values (with additional constrains as well). Often it is desirable to interchange the operations of differentiation and integration. That this can be done is proved below. We begin with an examination of continuous dependency on the parameters. For sake of simplicity, we shall deal with functions f(x, y) depending on two variables, x ∈ [a, b], y ∈ [c, d]. We say f is continuous on I = [a, b] × [c, d] ⊂ R2 = C if for each z = (x, y) from the domain of f and ε > 0 there is some δ > such |f(w) − f(z)| < ε if w ∈ Oδ(z). (Notice the definition is the same as with the univariate functions, just we use the distance in the plane.) The function f(x, y) is called uniformly continuous if for each ε > 0, there is δ > 0 such that for any two points z, w in I ⊂ R2 = C, |z−w| < δ implies |f(z)−f(w)| < ε. Exactly the same argument as with univariate functions, based on the fact that every open cover of a compact set in the complex plane contains a finite subcover, cf. Theorem 5.2.8(5), provides the following lemma (cf. the proof of Theorem 6.2.11). Lemma. Each continuous function f(x, y) on I = [a, b] × [c, d] is uniformly continuous. Now we are ready for the following important claim: 566 Then, compare the accuracy of these results with those obtained using the trapezoidal rule. Which method provides a better approximation? ⃝ 6.B.69. Create a routine in Sage to demonstrate the computation of the Simpson’s rule with n intervals, where n is an even integer. Next use your program to verify the result presented in 6.B.68 for h = 1/4, and illustrate the Simpson’s approximation. ⃝ 6.B.70. Calculate the theoretical error (b−a)5 180 n4 f(4) of Simpson’s approximation with h = 1/4 for the integral I given in 6.B.68, where f(4) represents the upper bound of f(4) (x) with x ∈ [0, 1]. Then, compare this theoretical error with the actual error computed earlier. ⃝ C. Sequences, series and limit processes We are already familiar with sequences of real numbers, power series and the concept of convergence. Our aim now is to readdress the discussion on series of functions in light of the methods of differential and integral calculus. First we will consider sequences whose terms are functions rather than real or complex numbers. Sequences of functions naturally arise in real analysis and are crucial in approximation theory. We will illustrate the consequences of the uniform convergence of sequences and series of functions through numerous examples. Additionally, we will discuss tasks related to the differentiation and integration of power series, along with other applications. Let us begin, however, with numerical series. Thanks to the integral criterion of convergence (see 6.2.15), one can address the question of convergence for a broader class of series. The next few tasks will highlight this fact. 6.C.1. Apllications of the integral criterion of convergence. Decide whether the following sums converge or diverge and confirm your computations in Sage: T1 = ∞∑ n=1 1 n ln n , T2 = ∞∑ n=1 1 n2 . Solution. Observe that we cannot decide the convergence of none of these series by using the ratio or root test. This is because lim n→∞ | an+1 an | = lim n→∞ n √ an = 1 . However, using the integral criterion for convergence of series one obtains: ∫ ∞ 1 1 x ln(x) dx = ∫ ∞ 0 1 t dt = lim δ→∞ [ln(t)] δ 0 = ∞, hence the series S1 diverges. On the other hand, ∫ ∞ 1 1 x2 dx = lim δ→∞ [ − 1 x ]δ 1 = 1, hence the series S2 converges. □ CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Theorem. Assume f(x, y) is a function defined for all x lying in a bounded interval [a, b] and all y in a bounded interval [c, d], continuous on I = [a, b] × [c, d]. Consider the (Riemann) integral F(y) = ∫ b a f(x, y) dx. Then the function F(y) is continuous on [c, d]. Proof. Fix a point y ∈ [c, d], small ε > 0, and choose a neighbourhood W of y such that for all ¯y ∈ W ⊂ [c, d] and all x ∈ [a, b] (remember f is uniformly con- tinuous) |f(x, ¯y) − f(x, y)| < ε. The Riemann integral of continuous functions is evaluated by approximations of finite sums (equivalently: upper, lower, or Riemann sums with arbitrary representatives ξi, see paragraph 6.2.9). The goal is to establish that the Riemann sums for the integrals with parameters y and ¯y cannot differ much. In the following estimate for any partition with k intervals and representatives ξi, first use the standard properties of the absolute value and then exploit the choice of W: k−1∑ i=0 f(ξi, ¯y)(xi+1 − xi) − k−1∑ i=0 f(ξi, y)(xi+1 − xi) ≤ k−1∑ i=0 f(ξi, ¯y) − f(ξi, y) (xi+1 − xi) < ε(b − a). It follows that the limit values for any sequences of the partitions and representatives F(y) and F(¯y) cannot differ by more than ε(b−a) either, so the function F is continuous. □ 6.3.12. Integrating twice. The fact that the integral F(y) = ∫ b a f(x, y)dx of a continuous function f : [a, b] × [c, d] → R in the plane is again a continuous function F : [c, d] → R allows us to repeat the integration and write (1) I = ∫ d c ∫ b a f(x, y) dx dy = ∫ d c ( ∫ b a f(x, y) dx ) dy. The next theorem is the simplest version of the claim known as Fubini theorem. Fubini theorem Theorem. Consider a continuous function f : [a, b] × [c, d] → R in the plane R2 . The multiple integration (1) is well defined and does not depend on the order of integration, i.e., I = ∫ d c ( ∫ b a f(x, y) dx ) dy = ∫ b a ( ∫ d c f(x, y) dy ) dx. 567 6.C.2. Using the integral criterion, decide on the convergence of the series ∞∑ n=1 1 (n + 1) ln2 (n + 1) . Solution. The function f(x) = 1 (x+1) ln2(x+1) , x ∈ [1, +∞) is clearly positive and nonincreasing on its whole domain, thus the given series converges if and only if the integral ∫ +∞ 1 f(x) dx converges. By using the substitution y = ln (x + 1) (where dy = dx/(x + 1)), we can compute +∞∫ 1 1 (x + 1) ln2 (x + 1) dx = +∞∫ ln 2 1 y2 dy = 1 ln 2 . Hence the series converges. □ Next we will explore examples related to the concept of “uniform convergence”, which is a stronger notion than pointwise convergence. Recall that given a sequence of functions (fn) defined on an interval I = [a, b], we say that (fn) converges uniformly to a function f on [a, b], if for every ε > 0 there exists N ∈ N such that for all n ≥ N and x ∈ [a, b] we have |fn(x) − f(x)| < ε. In other words, we have f(x) − ε < fn(x) < f(x) + ε for all n ≥ N and x ∈ [a, b], thus uniform convergence means that all of the functions fn(x) are close to f(x) for all points x ∈ [a, b], except for finitely many of them, see the figure given below and see also 6.3.3. y x a b y = f(x) + ε y = f(x) − ε fn(x) f(x) Notice that we can equivalent interpret the condition of uniform convergence as (see 6.D.46) lim n→∞ supx∈[a,b]|fn(x) − f(x)| = 0 . It is also important to note that uniform convergence implies pointwise convergence (and the uniform limit function is the same as the pointwise limit function), but the converse is not true. Before discussing tasks on uniform convergence, we will first emphasize some natural tasks related to the pointwise CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Proof. We know f is uniformly continuous on the product of intervals [a, b]×[c, d] in the plane. Thus, for each ε > 0 there is δ > 0 such that |f(x1, y1) − f(x2, y2)| < ε whenever |x1 − x2| < δ and |y1 − y2| < δ. We know both Riemann integrals in (1) exist, thus we may fix a sequence Ξk of partitions of the interval [a, b] into k subinterval [xi−1, xi] of equal size 1/k and with representatives ξi,k, i = 1, . . . , k, and similarly for the interval [c, d] with the subintervals [yi−1, yi] and representatives ηj,k, j = 1, . . . , k. Then we may write I = ∫ d c lim k→∞ Sk(y)dy, Sk(y) = ( k∑ i=1 f(ξi,k, y)1 k (b − a) ) If 1 k < δ, then ∫ b a f(x, y) dx − k∑ i=1 f(ξi,k, y)1 k (b − a) ≤ k∑ i=1 ∫ xi xi−1 f(x, y) dx − f(ξi,k, y)1 k (b − a) ≤ k∑ i=1 ∫ xi xi−1 f(x, y) − f(ξi,k, y) dx ≤ ε(b − a). Thus, the convergence of lim k→∞ Sk(y) = F(y) = ∫ b a f(x, y) dx is uniform on [c, d]. In particular, we may swap the integral and the limit to obtain I = lim k→∞ ∫ d c ( k∑ i=1 f(ξi,k, y)1 k (b − a) ) dy = lim k→∞ k∑ i=1 ( ∫ d c f(ξi,k, y)dy )1 k (b − a) = lim k→∞ ( lim ℓ→∞ k∑ i=1 ℓ∑ j=1 f(ξi,k, ηj,ℓ) 1 k 1 ℓ (b − a)(d − c) ) . This double limit can be clearly rewritten as lim k→∞ ( k∑ i=1 ℓk∑ j=1 f(ξi,k, ηj,ℓk ) 1 k 1 ℓk (b − a)(d − c) ) for a suitable sequence of indices ℓk → ∞. Finally, take any sequence of indices kn → ∞ and ℓn → ∞, divide intervals (a, b) and (c, d) to kn and ℓn equal parts, and choose some representatives (xi,kn , yj,ℓn ) in all the corresponding small rectangles. Then the absolute value of the difference kn∑ i=1 ℓn∑ j=1 f(ξi,kn , ηj,ℓn ) − kn∑ i=1 ℓn∑ j=1 f(xi,kn , yj,ℓn ) 568 convergence of sequences of functions. See also the discussion in 6.3.1, where similar problems are addressed. 6.C.3. Provide an example for each of the following scenar- ios: (a) A sequence of continuous functions (fn) that converges pointwise to a function f which is discontinuous; (b) A sequence of differentiable functions (fn) that converges pointwise to a function f which is non-differentiable; (c) A sequences of integrable functions (fn) that converges pointwise to a function f which is non-integrable. ⃝ 6.C.4. Inspect which of the sequences (fn)n∈Z+ given below, converge uniformly on the given domain: (1) fn(x) = sin(nx)/n, with x ∈ R; (2) fn(x) = nx/(n + x), with x ∈ [0, 1]; (3) fn(x) = e x4 4n2 , with x ∈ R; (4) fn(x) = e x n , with x ∈ [0, 1]; (5) fn(x) = xn , with x ∈ [0, 1]. Solution. (1) In the first case it is easy to see that fn → 0 pointwise on R, that is, the limit function is the zero one, f(x) = 0. The sequence (fn) also converges uniformly on R. This is essentially derived from the inequality |fn(x)| = sin(nx) n = | sin(nx)| n ≤ 1 n . Thus, given some ε > 0,we will have |fn(x) − 0| < ε for all x ∈ R if n > 1 ε . Since 1 ε does not depend on x we conclude. (2) We may express fn as fn(x) = x 1 + x n , x ∈ [0, 1] , and it is obvious that (fn) is pointwise convergent on [0, 1] with limn→∞ fn(x) = f(x) = x, for all x ∈ [0, 1]. Next we see that |fn(x) − f(x)| = x2 n + x ≤ 1 n , x ∈ [0, 1] and using this we deduce that supx∈[0,1] |fn(x) − f(x)| ≤ 1 n . Thus limn→∞ supx∈[0,1] |fn(x) − f(x)| = 0 and (fn) is uniformly convergent in the interval [0, 1]. (3) The sequence (fn)n∈Z+ converges pointwise to the constant function f(x) = 1 on R, since lim n→∞ e x4 4n2 = e0 = 1, x ∈ R . However, we see that fn (√ 2n ) = e > 2, for all n ∈ Z+, which does not allow uniform convergence over R (be aware that in the definition of uniform convergence, it suffices to consider ε ∈ (0, 1)). (4) Obviously, limn→∞ e x n = 1, i.e., the pointwise limit function is the constant function f(x) = 1. Next we see that sup x∈[0,1] |fn(x) − f(x)| = sup x∈[0,1] ( e x n −1 ) = e 1 n −1 , CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS is at most ε(b−a)(c−d) whenever 1 kn < δ and 1 ℓn < δ. Thus, we see that the value of the iterated integral can be expressed as the limit of double sums, where the representatives of the individual rectangles are chosen arbitrarily.8 This expression does not depend on the order of our integration and thus the two posibilities for the iterated integrals must coincide. □ 6.3.13. Differentiation in the integrals. We are ready to discuss the differentiation of integrals with respect to parameters. The following result is extremely useful. For instance we shall use it in the next chapter when examining integral trans- forms. Differentiation with respect to parameters Theorem. Consider a continuous function f(x, y) defined for all x from a finite interval [a, b] and for all y in another finite interval [c, d], a point z ∈ [c, d], and the integral F(y) = ∫ b a f(x, y) dx. If there exists the continuous derivative d dy f on a neighbourhood of the point z, then d dy F(z) exists as well and d dy F(z) = ∫ b a d dy f(x, y)|y=z dx. Proof. By the assumed continuity of all functions and the already known continuous dependence of integrals on parameters, some knowledge about univariate antiderivatives can be used. The result is then a simple consequence of the Fubini theorem. Denote G(y) = ∫ b a d dy f(x, y) dx, F(y) = ∫ b a f(x, y) dx and compute, invoking Fubini theorem, the antiderivative H(y) = ∫ y y0 G(z) dz = ∫ y y0 (∫ b a d dz f(x, z) dx ) dz = ∫ b a (∫ y y0 d dz f(x, z) dz ) dx = ∫ b a (f(x, y) − f(x, y0)) dx = F(y) − F(y0). Finally, differentiating with respect to y yields G(y) = d dy H(y) = dF dy (y), as desired. □ 8ACtually, this is the way how we shall define the Riemann integral in more variables in Chapter 8. 569 and hence lim n→∞ sup x∈[0,1] |fn(x) − f(x)| = lim n→∞ ( e 1 n −1 ) = 1 − 1 = 0 . Thus (fn) is uniformly convergent over [0, 1]. (5) Let us first use Sage to sketch some members of (fn). We use the following cell: q=plot(x^5, x, 0, 1, color="steelblue") q+= text (r"$f_5(x)=x^5$ ",(0.84, 0.25), color="steelblue", fontsize ="14") #q+=plot(x^5, x, 0, pi, color="black") q+=plot(x^4, x, 0, 1, color="darkgreen") q+= text (r"$x^4$ ",(0.8, 0.47), color="darkgreen", fontsize ="14") q+=plot(x^3, x, 0, 1, color="magenta") q+= text (r"$x^3$ ",(0.8, 0.57), color="magenta", fontsize ="14") q+=plot(x^2, x, 0, 1, color="orange") q+= text (r"$x^2$ ",(0.8, 0.7), color="orange", fontsize ="14") q+=plot(x, x, 0, 1, color="darkred") q+= text (r"$f_1(x)=x$ ",(0.74, 0.85), color="darkred", fontsize ="14") q+=line([(1, 0), (1, 1^5)], linestyle="--") q.show(ymax=1) Executing this block we obtain the graph of the first 5 members of the sequence (fn), which we present here: An alternative but less informative way to plot some terms of (fn) goes as follows: var("n"); f(n, x)=x^n p=plot([f(n, x) for n in [1, 2..5]], (x, 0, 1)) show(p) As we have already seen in Chapter 5 (see 5.B.3), the limit function of (fn(x) = xn ) on [0, 1] is given by limn→∞(xn ) = f(x) = { 0 , if x ∈ [0, 1) , 1 , if x = 1 . Thus |fn(x) − f(x)| = { 0 , for x = 1 , |xn | , for x ∈ [0, 1) , CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 6.3.14. The Riemann–Stieltjes integral. To end this chapter, we mention briefly some other concepts of integration. Mostly we confine ourselves to remarks and comments. Readers interested in a thorough explanation can find another source. First, a modification of the Riemann integral, which is useful when discussing probability and statistics. In the discussion of integration, we summed infinitely many linearized (infinitely) small increments of the area given by a function f(x). We omitted the possibility that for different values of x we could take the increments with different weights. This can be arranged at the infinitesimal level by exchanging the differential dx for φ(x)dx for some suitable function φ, or we could even take some single points as adding the weight independentaly of the size of our intervals in the partitions. Imagine that at some point x0, the increment of the integrated quantity is given by αf(x0) independently of the size of the increment of x. For example, we may observe the probability that the amount of alcohol per mille in the blood of a driver at a test will be at most x. We might like to integrate over the possible values in the interval [0, x]. With quite a large probability the value is 0. Thus for any integral sum, the segment containing zero contributes by a constant nonzero contribution, independent of the norm of the partition. We cannot simulate such behaviour by multiplying the differential dx by some real function. Instead we generalize the Riemann integral in the following way: Riemann–Stieltjes integral Choose a real (usually) nondecreasing function g on a finite interval [a, b]. For every partition Ξ with representative ξi and points of the partition a = x0, x1, . . . , xn = b the Riemann–Stieltjes integral sum of function f(x) is SΞ = n∑ i=1 f(ξi) ( g(xi) − g(xi−1) ) . The Riemann–Stieltjes integral I = ∫ b a f(x)dg(x) exists and its value is I, if for every real ε > 0 there exists a norm of the partition δ > 0 such that for all partitions Ξ with norm smaller than δ, |SΞ − I| < ε. For example, choose g(x) on interval [0, 1] as a piecewise constant function with finitely many discontinuities c1, . . . , ck and “jumps” αi = lim x→ci+ g(x) − lim x→ci− g(x), 570 which implies that sup x∈[0,1] |fn(x) − f(x)| = sup{1, 0} = 1 . Therefore, limn→∞ supx∈[0,1] |fn(x) − f(x)| = 1 ̸= 0, and (xn )n∈Z+ is not uniformly convergent on [0, 1]. □ 6.C.5. Demonstrate that the sequence (fn)n∈Z+ of nonnegative functions defined by the given graph, pointwise converges to 0, yet it does not converge uniformly on [0, +∞). ⃝ 6.C.6. Show that the sequence (fn)n∈Z+ of functions defined by fn : R → R , fn(x) = √ x2 + 1 n2 , converges uniformly on R. ⃝ 6.C.7. (a) Provide an example of sequences (fn), (gn) that converge uniformly on a set I, but the product sequence (fn · gn) does not converge uniformly on I. (b) Suppose that the sequences (fn), (gn) converge uniformly to f, g, respectively, on a set I. Assume also that there exists some M > 0 such that |f(x)| < M and |g(x)| < M for all x ∈ I. Show that the sequence (fn · gn) uniformly converges to f · g on I. ⃝ 6.C.8. Inspect which of the sequences (fn)n∈Z+ given below, converge uniformly on the given domain. ⃝ Similarly with uniform convergence of sequences of functions, we can explore series of functions which converge uniformly on an interval. This is the case when the sequence of the partial sums converges uniformly. Uniform convergence of sereis has many applications. For instance, in Chapter 7 we will discuss the uniform convergence of trigonometric series, specifically focusing on Fourier series. 6.C.9. Decide whether the series ∞∑ n=1 √ x · n n4 + x2 converges uniformly on the interval (0, +∞). Solution. Using the denotation CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS then the Riemann–Stieltjes integral exists for every continuous f(x) and equals I = ∫ 1 0 f(x)dg(x) = k∑ i=1 αif(ci). By the same technique as used for the Riemann integral, we define upper and lower sums and upper and lower Riemann–Stieltjes integral. For bounded functions they always exist, and their values coincide if and only if the Riemann–Stieltjes integral in the above sense exists. We have already encountered problems with the Riemann integration of functions that are “too jumpy”. For a function g(x) on a finite interval [a, b] define its variation by varb a g = sup Ξ n∑ i=1 |g(xi) − g(xi−1)|, where the supremum is taken over all partitions Ξ of the interval [a, b]. If the supremum is infinite, we say that g(x) has unbounded variation on [a, b]. Otherwise we say that g is a function with bounded variation on [a, b]. A function is of bounded variation if and only if it can be written as the difference of two monotonic functions. We shall not provide a proof here. As in the discussion of the Riemann integral, we derive the following theorem. Properties of the Riemann–Stieltjes integral Theorem. Let f(x) and g(x) be real functions on a finite interval [a, b]. (1) Suppose g(x) is non-decreasing and continuously differentiable. Then the Riemann integral on the left hand side and the Riemann–Stieltjes integral on the right hand side either both exist or do not exist. In the former case, their values are equal ∫ b a f(x)g′ (x)dx = ∫ b a f(x)dg(x) (2) If f(x) is continuous and g(x) is a function with finite variation, then the integral ∫ b a f(x)dg(x) exists. We invite the reader to add the details of its proof. The main tools are the mean theorem, the uniform continuity of continuous functions on closed bounded intervals. The variation of g over the interval [a, b] plays the role of the length of the interval in the earlier proofs dealing with Riemann integra- tion. 6.3.15. Kurzweil-Henstock integral. The last topic in this chapter is a modification of the Riemann integral, which fixes the unfortunate behaviour at the third point in the paragraph 6.3.1. That is, the limits of the non-decreasing sequences of integrable functions are again integrable. Then we can interchange the order 571 fn(x) = √ x·n n4+x2 , x > 0, n ∈ N, we have f′ n(x) = n ( n4 −3x2 ) 2 √ x(n4+x2)2 , x > 0, n ∈ N. From now on, let n ∈ N be arbitrary. The inequalities f′ n(x) > 0 for x ∈ ( 0, n2 / √ 3 ) and f′ n(x) < 0 for x ∈ ( n2 / √ 3, +∞ ) imply that the maximum of function fn is attained exactly at the point x = n2 / √ 3. Since fn ( n2 √ 3 ) = 4√ 27 4n2 a ∞∑ n=1 4√ 27 4n2 = 4√ 27 4 ∞∑ n=1 1 n2 < +∞, according to the Weierstrass test, the series ∑∞ n=1 fn(x) converges uniformly on the interval (0, +∞). □ 6.C.10. For x ∈ [−1, 1], add ∞∑ n=1 (−1)n+1 n(n+1) xn+1 . Solution. First notice that by the symbol for an indefinite integral, we’ll denote one specific primitive function (while preserving the variable), which should be understood as a so called function of the upper limit, while the lower limit is zero. Using the theorem about integration of a power series for x ∈ (−1, 1), we’ll obtain ∑∞ n=1 (−1)n+1 n(n+1) xn+1 = ∑∞ n=1 ( (−1)n+1 n ∫ xn dx ) = ∫ ∞∑ n=1 ( (−1)n+1 n xn ) dx = ∫ ∑∞ n=1 ( (−1)n+1 ∫ xn−1 dx ) dx = ∫ ( ∫ ∑∞ n=1(−x)n−1 dx ) dx = ∫ ( ∫ 1 − x + x2 − x3 + · · · dx ) dx = ∫ ( ∫ 1 1+x dx ) dx = ∫ ln (1 + x) + C1 dx . Since ∫ ∞∑ n=1 ( (−1)n+1 n xn ) dx = ∫ ln (1 + x) + C1 dx, we know from the continuity of the given functions that ∞∑ n=1 (−1)n+1 n xn = ln (1 + x) + C1, x ∈ (−1, 1). The choice x = 0 then yields 0 = ln 1 + C1, i.e. C1 = 0. Next, ∫ ln (1 + x) dx = per partes = u = ln (1 + x) u′ = 1 1+x v′ = 1 v = x = x ln (1 + x) − ∫ x 1+x dx = x ln (1 + x) − ∫ 1 − 1 1+x dx = x ln (1 + x) − x + ln (1 + x) + C2 = (x + 1) ln (x + 1) − x + C2. Since the given series converges at the point x = 0 with a sum of 0, analogously as for C1 , 0 = 1 · ln 1 − 0 + C2 implies that C2 = 0. In total, we have for x ∈ (−1, 1): CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS of the limit process and integration in these cases, just as with uniform convergence. Notice what is the essence of the problem. Intuitively we assume that very small sets must have zero size. Thus the changes of values of the functions on such sets should not change the integral. Moreover, a countable union of such sets which are “negligible for the purpose of integration” should also have zero size. We would expect for example that the set of rational numbers inside a finite interval would have this property, hence its characteristic function should be integrable and the value of such an integral should be zero. We say that a set A ⊂ R has zero measure, if for every ε > 0 there is a covering of the set A by a countable system of open intervals Ji, i = 1, 2, . . . such that ∞∑ i=1 m(Ji) < ε. m(Ji) means the length of the interval Ji. In the sequel, the statement “function f has the given property on a set B almost everywhere” means that f has this property at all points except for a subset A ⊂ B of zero measure. For example, the characteristic function of rational numbers is zero almost everywhere. A piece-wise continuous function is continuous almost everywhere. Now we modify the definition of the Riemann integral so that restrictions on the Riemann sums are permitted, eliminating the effect of the values of the integrated function on sets of measure zero. This is achieved by a finer control of the size of the segments in the partition in the vicinity of problematic points. A positive real function δ on a finite interval [a, b] is called a gauge. A partition Ξ of interval [a, b] with representatives ξi is δ–gauged, if ξi − δ(ξi) < xi−1 ≤ ξi ≤ xi < ξi + δ(ξi) for all i. The norm δ of the partition used in the Riemann integration is a special case of constant gauges δ(x) = δ > 0. In order to restrict the Riemann sums to a gauged partition with representatives in the definition of the integral, it is necessary to know that for every gauge δ, a δ–gauged partition with representatives exists. Otherwise the condition in the definition could be satisfied in a vacuous way. This statement is called Cousin’s lemma. It is proved by exploiting the standard properties of suprema: For a given gauge δ on [a, b], denote by M the set of all points x ∈ [a, b] such that a δ–gauged partition with representatives can be found on [a, x]. M is nonempty and bounded, thus it has a supremum s. If s ̸= b, then there is a gauged partition with representatives at s, where s is in the interior of the last segment. This leads to a contradiction. Thus the supremum is b, but then the gauge δ(b) > 0 and thus b itself belongs to the set M. 572 ∞∑ n=1 (−1)n+1 n(n+1) xn+1 = (x + 1) ln (x + 1) − x. Moreover, according to Abel’s theorem (see 6.3.9), the sum of the given series equals the (potentially improper) limit of the function (x + 1) ln (x + 1) − x at points −1 and 1. In our case, both limits are proper (at point 1, the function is even continuous and the value of the limit at point 1 then equals the value of the function 2 ln 2 − 1.) For computing the value of the limit at point −1, we’ll use L’Hospital’s rule: lim x→−1+ (x + 1) ln (x + 1) − x = lim t→0+ t ln t + 1 = lim t→0+ ln t 1 t + 1 = lim t→0+ 1 t − 1 t2 + 1 = lim t→0+ −t + 1 = 1. Of course, the convergence of the series at points ±1 can be verified directly. It’s even possible to directly deduce that ∞∑ n=1 1 n(n+1) = 1 (by writing out 1 n(n+1) = 1 n − 1 n+1 . □ 6.C.11. Sum of a series. Using theorem 6.3.5 “about the interchange of a limit and an integral of a sequence of uniformly convergent functions”, we’ll now add the number series ∞∑ n=1 1 n2n . We’ll use the fact that ∞∫ 2 dx xn+1 = 1 n2n . Solution. On interval (2, ∞), the series of functions∑∞ n=1 1 xn+1 converges uniformly. That is implied for example by the Weierstrass test: each of the function 1 xn+1 is decreasing on interval (2, ∞), thus their values are at most 1 2n+1 ; the series ∑∞ n=1 1 2n+1 is convergent though (it’s a geometric series with quotient 1 2 ). Hence according to the Weierstrass test, the series of functions ∑∞ n=1 1 xn+1 converges uniformly. We can even write the resulting function explicitly. Its value at any x ∈ (2, ∞) is the value of the geometric series with quotient 1 x , so if we denote the limit by f(x), we have f(x) = ∞∑ n=1 1 xn+1 = 1 x2 1 1 − 1 x = 1 x(x − 1) . By using (6.3.7) (3), we get ∞∑ n=1 1 n2n = ∞∑ n=1 ∫ ∞ 2 dx xn+1 = ∫ ∞ 2 ( ∞∑ n=1 1 xn+1 ) dx = ∫ ∞ 2 1 x(x − 1) dx = lim δ→∞ ∫ δ 2 1 x − 1 − 1 x dx = lim δ→∞ [(ln(δ − 1) − ln(δ) − ln(1) + ln 2] = lim δ→∞ [ ln ( δ − 1 δ )] + ln(2) = ln ( lim δ→∞ δ − 1 δ ) + ln 2 = ln 2 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Now we can state the following generalization of the Riemann integral. Call it the K-integral9 . Kurzweil-Henstock integral Definition. A function f defined on a finite interval [a, b] has a Kurzweil-Henstock integral I = ∫ b a f(x) dx, if for every ε > 0, there exists a gauge δ such that for every δ–gauged partition with representatives Ξ, the inequality |SΞ −I| < ε is true for the corresponding Riemann sum SΞ. 6.3.16. Basic properties. When defining the K-integral, only the set of all partitions is bounded, for which the Riemann sums are taken into account. Hence if the function is Riemann integrable, then it is K-integrable, and the two integrals are equal. For the same reason, the argumentation in Theorem 6.2.8 about simple properties of the Riemann integral applies. This verifies that the K-integral behaves in the same way. In particular, a linear combination of integrable function cf(x) + dg(x) is again integrable and its integral is c ∫ b a f(x)dx + d ∫ b a g(x)dx etc. To prove this, it suffices only to think through some modifications when discussing the refined partitions, which moreover should be δ–gauged. The Kurzweil integral behaves as anticipated with respect to the sets of zero measure: Theorem. Consider a function f, defined on the interval [a, b]. If f is zero almost every, then the K-integral ∫ b a f(x)d(x) exists and is zero. Proof. The proof is an illustration of the idea that the influence of values on a “small” set can be removed by a suitable choice of gauge. Denote by M the corresponding set of zero measure, outside of which f(x) = 0 and write Mk ⊂ [a, b], k = 1, . . . , for the subset of the points for which k − 1 < |f(x)| ≤ k. Because all the sets Mk have zero measure, each of them can be covered by a countable system of pairwise disjoint open intervals Jk,i such that the sum of their lengths is arbitrarily small. Define the gauge δ(x) for x ∈ Jk,i so that the intervals (x−δ(x), x+δ(x)) are still contained in Jk,i. Outside of M, δ is defined arbitrarily. 9There are many equivalent definitions and thus also names for this K-integral. A complicated approach was coined by Arnaud Denjoy around 1912. Thus the space of real functions integrable on an interval [a, b] in this sense is often called Denjoy space. Other people involved were Nikolai Luzin and Oskar Perron. We can find the integral under their names. The simple and beautiful definition was introduced by Jaroslav Kurzweil, a Czech mathematician still living in 1957. Much of the theory was developed by Ralph Henstock (1923-2007), an English mathematician. 573 □ 6.C.12. Consider function f(x) = ∑∞ n=1 ne−nx . Determine ∫ ln 3 ln 2 f(x) dx. Solution. Similarly as in the previous case, the Weierstrass test for uniform convergence implies that the series of functions ∑∞ n=1 ne−nx converges uniformly on interval (ln 2, ln 3), since each of the functions ne−nx is lesser than n 2n on (ln 2, ln 3) and the series ∑∞ n=1 n 2n converges, which can be seen for example from the ratio test for convergence of series: lim n→∞ an+1 an = lim n→∞ (n + 1)2−(n+1 n2n = lim n→∞ 1 2 n + 1 n = 1 2 . In total, according to (6.3.7) (3), we have ∫ ln 3 ln 2 f(x) dx = ∫ ln 3 ln 2 ∞∑ n=1 ne−nx = ∞∑ n=1 ∫ ln 3 ln 2 ne−nx dx = ∞∑ n=1 [−e−nx ]ln 3 ln 2 = ∞∑ n=1 ( 1 2n − 1 3n ) = 1 − 1 2 = 1 2 . □ 6.C.13. Determine the following limit (give reasons for the procedure of computation): lim n→∞ ∫ ∞ 0 cos (x n ) ( 1 + x n )n dx. Solution. First we’ll determine lim n→∞ cos( x n )( 1+ x n )n . The sequence of these functions converges pointwise and we have lim n→∞ cos(x n ) ( 1 + x n )n = 1 lim n→∞ ( 1 + x n )n (??) = 1 ex It can be shown that the given sequence converges uniformly. Then according to (6.3.5) , lim n→∞ ∫ ∞ 0 cos (x n ) ( 1 + x n )n dx = ∫ ∞ 0 [ lim n→∞ cos (x n ) ( 1 + x n )n ] dx = ∫ ∞ 0 1 ex = 1 We leave the verification of uniform convergence to the reader (we only point out that the discussion is more complicated than in the previous cases). □ 6.C.14. Find the analytic function whose Taylor series is x − 1 3 x3 + 1 5 x5 − 1 7 x7 + · · · , for x ∈ [−1, 1]. ⃝ 6.C.15. From the knowledge of the sum of a geometric series, derive the Taylor series of function y = 1 5+2x CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS For any δ–gauged partition Ξ of the interval [a, b] the bound on the corresponding Riemann sum is given as n−1∑ j=0 f(ξn)(xi+1 − xi) = n−1∑ j=0 ξi∈M f(ξn)(xi+1 − xi) ≤ ∞∑ k=1 n−1∑ j=0 ξi∈Mk f(ξn) (xi+1 − xi) ≤ ∞∑ k=1 k ( n−1∑ j=0 ξi∈Mk m(Jk,j) ) . To guarantee that this bound is smaller than a given ε, it suffices to choose the covering by the intervals Jk,j so that ∞∑ j=1 m(Jk,j) =≤ ε k2k . Since ∑∞ k=1 2−k = 1, the result follows. □ Corollary. If the values of f(x) are changed on a set of zero measure, the K-integrability of f(x) is not changed, and neither is the value of its integral. 6.3.17. The fundamental theorems of Calculus. We conclude this chapter with a few remarks on the properties of integration procedures from the point of view of expectations and reality.10 In 6.2.9, we deal with the relation between the derivatives f(t) and the antiderivatives (integrals) F(t). Since f(t) is assumed continuous, two essential claims collapse into one, resulting in F(t) = ∫ t t0 f(x) dx up to the choice of the value of F(t0). In particular, ∫ t t0 F′ (dx) dx = F(t) − f(t0) for all choices of F. More generally, this can be split into two claims which hold for the K-integral under much milder conditions: 10A very good and elementary exposition of the K-integral can be found in the short paper Return to the Riemann Integral. The American Mathematical Monthly, Vol. 103, No. 8 (1996), 625-632. by Robert G. Bartle. 574 centered at the origin. Then determine its radius of convergence. ⃝ 6.C.16. Expand the function y = 1 3−2x , x ∈ ( −3 2 , 3 2 ) to a Taylor series centered at the origin. ⃝ 6.C.17. Expand the function cos2 (x) to a power series at the point π/4 and determine for which x ∈ R this series converges. ⃝ 6.C.18. Express the function y = ex defined on the whole real axis as an infinite polynomial with terms of the form an(x − 1)n and express the function y = 2x defined on R as an infinite polynomial with terms anxn . ⃝ 6.C.19. Find a function f such that for x ∈ R, the sequence of functions fn(x) = n2 x3 n2x2+1 , n ∈ N to it. Is this convergence uniform on R? ⃝ 6.C.20. Does the series ∞∑ n=1 n x n4+x2 , kde x ∈ R, converge uniformly on the whole real axis? ⃝ 6.C.21. By using differentiation, obtain the Taylor expansion of function y = cos x from the Taylor expansion of function y = sin x centered at the origin. ⃝ 6.C.22. Approximate (a) cosine of ten degress with a precision of at least 10−5 ; (b) the definite integral ∫ 1/2 0 dx x4+1 with a precision of at least 10−3 . ⃝ 6.C.23. Determine the power expansion centered at x0 = 0 of function f(x) = x∫ 0 et2 dt, x ∈ R. ⃝ 6.C.24. Find the analytic function whose Taylor series is x − 1 3 x3 + 1 5 x5 − 1 7 x7 + · · · , for x ∈ [−1, 1]. ⃝ 6.C.25. From the knowledge of the sum of a geometric series, derive the Taylor series of function y = 1 5+2x centered at the origin. Then determine its radius of convergence. ⃝ CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 1st and 2nd fundamental theorems of calculus Theorem. (1) Suppose the K-integral ∫ b a f(x) dx exists. Then F(t) = ∫ t a f(x dx) is continuous. The derivative F′ (t) exists and equals f(t) almost everywhere. (2) Suppose F(t) is a continuous function on [a, b] and suppose f(t) = F′ (t) exists for all but countably many exceptional points t in [a, b]. If f(t) is defined arbitrarily at those points, then F(t) = ∫ t a f(x) dx exists for all t ∈ [a, b] and equals to F(t) − F(a). We have no space here to go into proofs. Notice, the statements of the theorem are characteristic for the K-integrals, i.e. this is the only integration concept on R for which these two theorems hold true. 6.3.18. K-integrability and Lebesgue measure. We illustrate the claims in the latter theorem on the indicator function χQ of the rational numbers. Clearly its K-integral F(t) = ∫ t a χQ(x) dx exists (χQ is zero almost everywhere) and equals zero. Its derivative F′ (t) is identically zero, and equals χQ nearly everywhere. This is a good example of a bounded function which is not Riemann integrable, but is integrable in the more general sense. There are many more K-integrable functions than Riemann integrable functions. There is no difference between proper and improper integrals. More precisely, the K-integral ∫ b a f(x) dx exists if and only if the one-sided limit lim t→a− ∫ b t f(x) dx is well defined and their values coincide, and similarly for the upper limit b. This is due to the freedom in the choice of the gauges. There is only an indirect proof that there are bounded functions on a compact interval which are not K-integrable, based on some set-theoretic arguments, but there are no explicit constructions of such functions available. We say that a set of real numbers M is measurable if the K-integral of its indicator function χM exists. The assignment m : M → ∫ b a χM (x) dx for all sets M ⊂ [a, b] has the properties of a measure. The set of such measurable sets M is closed under finite intersections and countable unions. The measure m is additive with respect to unions of at most countable systems of pairwise disjoint sets. This measure coincides with the Lebesgue measure. This measure is used in another concept of integration, which is extremely useful in higher dimensional applications, the Lebesgue integral. We do not go into more details here. We remark that a real function f is Lebesgue integrable if and only if its absolute value is K-integrable. 575 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS A big advantage of the K-integral compared to other concepts is the possibility of integrating many functions which are not integrable in absolute value. Compare the concepts of convergence and absolute convergence of series. A typical example is the sinus integral over all reals. The K-integral of the sinc function ∫ ∞ 0 sin x x dx exists, while the absolute value g(x) = | sinc(x)| is not Lebesgue integrable. Such integrals are important in models for signal processing where it is necessary to aggregate potentially infinite many interferences canceling each other by different signs. 6.3.19. The convergence theorems. We have dealt with uniform convergence and Riemann integrability. With the Kintegral, there is a much nicer and stronger theorem available. A special case is the monotone convergence theorem for uniformly bounded functions f0(x) ≤ f1(x) ≤ . . . . Dominated convergence theorem Theorem. Suppose f0, f1, f2, . . . are K-integrable functions on an interval [a, b], converging pointwise to the limit function f. If there are two K-integrable functions g and h satisfying g(x) ≤ fn ≤ h(x), for all n ∈ N and x ∈ [a, b], then f is K-integrable too, and ∫ b a f(t) dt = lim n→∞ ∫ b a fn(t) dt. For monotone convergence, there is a stronger result saying that a sufficient and necessary condition for the Kintegrability of the pointwise limit is supn ∫ b a fn(x) dx < ∞. This theorem could not be applied in our third example in 6.3.2. There the functions fn have a "bump" which gets larger but narrower when close to the origin. The functions cannot be dominated by an integrable function. With the Riemann integral, a similar dominated convergence theorem can be proved, except that we have to guarantee the integrability of the pointwise limit f. 576 577 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS D. Additional exercises for the whole chapter Next we will describe some extra material on the notions that we have analyzed so far in Chapter 6. We begin with additional exercises on higher-order derivatives, Taylor polynomials, Taylor series and other tasks related to differentiation. Some of them are technical, and some other require a bit of our imagination. A) Material on derivatives of higher orders 6.D.1. Prove that nth derivative of the function f(x) = 1 x+1 with x ∈ A = R\{−1}, is given by f(n) (x) = (−1)n n! (x+1)n+1 for all x ∈ A. Solution. For any x in the domain of f we have f′ (x) = − 1 (x+1)2 . Thus for n = 1 the given relation holds. Assume that f(k) (x) = (−1)k k! (x+1)k+1 for some positive integer k. Then we see that f(k+1) (x) = ( (−1)k k! (x + 1)k+1 )′ = (−1)k k! ( 1 (x + 1)k+1 )′ = (−1)k k! ( −(k + 1)(x + 1)k (x + 1)′ (x + 1)2(k+1) ) = (−1)k+1 (k + 1)! (x + 1)k+2 , for all x ∈ A, where we used the identity k!(k + 1) = (k + 1)!. This shows that f(k+1) (x) = (−1)k+1 (k+1)! (x+1)k+2 and the claim follows by the principle of induction. □ 6.D.2. Compute the nth derivative of the function f(x) = 1 x(1−x) with x ∈ R\{0, 1}. Solution. For all x in the domain of f one can write f(x) = g(x) + h(x), where g(x) = 1 x and h(x) = 1 1−x , respectively. Thus it follows that f(n) (x) = g(n) (x) + h(n) (x) for all x, and it suffices to compute the nth derivatives of the functions g, h. For g we see that g′ (x) = −x−2 , g′′ (x) = 2 x−3 , g′′′ (x) = −6 x−4 = −1 · 2 · 3 x−4 , g(4) (x) = 24 x−5 = 1 · 2 · 3 · 4 x−5 . Hence clearly we have g(n) (x) = (−1)n n! x−(n+1) = (−1)n n! xn+1 , which easily follows by induction. Similarly, for h(x) we can prove that h(n) (x) = n! (1 − x)n+1 . Thus finally we deduce that f(n) (x) = n! ((−1)n xn+1 + 1 (1 − x)n+1 ) , x ∈ R\{0, 1} . □ 6.D.3. Consider the function f(x) = 1/x, with x ∈ R\{0}, and let n be a positive integer. Compute the limit lim x→+∞ f(n) (x) , for all x ̸= 0. Next confirm your answer via Sage by combining the commands limit and assume. Solution. First we need to compute the nth derivative. We see that f′ (x) = −x−2 , f′′ (x) = 2 x−3 , f′′′ (x) = −2·3 x−4 , f(4) (x) = 2·3·4 x−5 , . . . , f(n) (x) = (−1)n n! x−n−1 . Hence f(n) (x) = (−1)n n! x−n−1 and we leave as practice the proof by induction over n. To compute the given limit one may use the ratio test: lim n→+∞ f(n+1) (x) f(n)(x) = lim n→+∞ (−1)n+1 (n + 1)! x−n−2 (−1)nn! x−n−1 = lim n→+∞ (n + 1) |x| = +∞ . As for a Sage verification, it is necessary to use the command assume as follows var("n"); assume(x>1) limit(abs((-1)^n*(factorial(n))*(x**(-n-1))), n=oo) Otherwise, Sage prints out an error, suggesting the use of further restrictions. □ 6.D.4. Determine the Taylor expansion of third order around the point a = 0 of the following functions: g(x) = 1 cos(x) , h(x) = e− x2 2 , k(x) = sin (sin(x)) . Next verify your answers in Sage. ⃝ 578 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 6.D.5. For the function f(x) = esin(x) with x ∈ R compute the Taylor polynomials Tk a f(x), where k = 1, 2, 3, 4 and a = 0. Deduce that T3 0 f(x) = T2 0 f(x) for all x. Moreover, use Sage to plot the graphs of these Taylor extensions together with the graph of f. Solution. We have f(0) = e0 = 1, and compute f′ (0) = cos(x) esin(x) x=0 = 1 , f′′ (0) = [ esin(x) ( cos2 (x) − sin(x) )] x=0 = 1 , f′′′ (0) = [ esin(x) ( cos3 (x) − 3 cos(x) sin(x) − cos(x) )] x=0 = 1 − 1 = 0 , f(4) (0) = [ esin(x) ( cos4 (x) − 6 cos2 (x) sin(x) − 4 cos2 (x) + 3 sin2 (x) + sin(x) )] x=0 = 1 − 4 = −3 . Thus T1 0 f(x) = 1 + x , T2 0 f(x) = 1 + x + 1 2 x2 , T3 0 f(x) = T2 0 f(x) , T4 0 f(x) = 1 + x + 1 2 x2 − 1 8 x4 . In the figure below we present the graphs of these polynomials. To produce this figure in Sage (together with a confirmation of the expressions of Tk 0 f(x) for k = 1, . . . , 4) we applied the following block: f(x)=exp(sin(x)) p=plot(f(x), x, -2*pi, 2*pi, thickness=1.5, legend_label=r"$f$") T1(x)=taylor(f(x), x, 0, 1) p1=plot(T1(x), x, -2*pi, 2*pi, color="green", linestyle="--", legend_label=r"$T^1_{0}$") T2(x)=taylor(f(x), x, 0, 2) p2=plot(T2(x), x, -2*pi, 2*pi, color="red", linestyle="-.", legend_label=r"$T^2_{0}$") T3(x)=taylor(f(x), x, 0, 3) p3=plot(T2(x), x, -2*pi, 2*pi, color="purple", linestyle="-.", legend_label=r"$T^3_{0}$") T4(x)=taylor(f(x), x, 0, 4) p4=plot(T4(x), x, -pi, pi, color="black", linestyle=":", legend_label=r"$T^4_{0}$") show(p+p1+p2+p3+p4) □ 6.D.6. Consider the function f(x) = ln(1 + x2 )/(1 − cos(x)). Using the Taylor series centered at 0 of the enumerator and dominator of f, compute limx→0 f(x). Next present a confirmation of your result based on the l’Hopital’s rule. Solution. Recall that the Taylor series centered at 0 of ln(1 + x) and cos(x) are respectively given by ln(1 + x) = ∞∑ n=1 (−1)n−1 xn n , |x| < 1 , cos(x) = ∞∑ n=0 (−1)n x2n (2n)! x ∈ R . Thus, the Taylor series around 0 of ln(1 + x2 ) has the form ln(1 + x2 ) = ∞∑ n=1 (−1)n−1 n x2n = x2 − x4 2 + x6 3 − x8 4 + · · · , |x| < 1 , 579 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS while 1 − cos(x) = 1 − ( 1 − x2 2! + x4 4! − x6 6! + x8 8! − · · · ) = x2 2! − x4 4! + x6 6! − x8 8! + · · · , x ∈ R . We can now use these two formulas to compute the limit at hand: lim x→0 f(x) = lim x→0 ln(1 + x2 ) 1 − cos(x) = lim x→0 x2 − x4 2 + x6 3 − x8 4 + · · · x2 2! − x4 4! + x6 6! − x8 8! + · · · = lim x→0 x2 ( 1 − x2 2 + x4 3 − x6 4 + · · · ) x2 (1 2 − x2 4! + x4 6! − x6 8! + · · · ) = lim x→0 1 − x2 2 + x4 3 − x6 4 + · · · 1 2 − x2 4! + x4 6! − x6 8! + · · · = 1 1 2 = 2 . To confirm this result by an alternative method, apply the l’Hopital rule twice: lim x→0 ln(1 + x2 ) 1 − cos(x) 0 0 = lim x→0 ( ln(1 + x2 ) )′ ( 1 − cos(x) )′ = lim x→0 2x 1 + x2 sin(x) = lim x→0 2x (1 + x2) sin(x) 0 0 = lim x→0 (2x)′ ( (1 + x2) sin(x) )′ = lim x→0 2 2x sin(x) + (1 + x2) cos(x) = 2 0 + 1 = 2 . □ 6.D.7. Consider the function f(x) = ex − e−x −2x x2 − x ln(x + 1) . Using appropriately the theory of Taylor series to evaluate the limit limx→0 f(x). Next confirm your computation by Sage. Solution. Observe that limit that we need to compute has the indeterminate form 0/0. We have ex = 1 + x + x2 2! + x3 3! + · · · + xn n! + · · · , e−x = 1 − x + x2 2! − x3 3! + · · · + (−1)n xn n! + · · · Thus, ex − e−x = 2x + 2x3 3! + higher order terms. This implies that the numerator ex − e−x −2x is dominated by the term 2x3 3! , as x → 0. Similarly, x ln(x + 1) = x ( x − x2 2 + x3 3 − x4 4 + · · · ) = x2 − x3 2 + x4 3 − x5 4 + · · · . Thus the dominator x2 − x ln(1 + x) is dominated by the term x3 2 , as x → 0. Combining these two observations we get limx→0 f(x) = 2/3. In Sage to confirm the result give the cell f(x)=(e^x-e^(-x)-2*x)/(x^2-x*ln(1+x)); limit(f(x), x=0) □ 6.D.8. Consider the function f(x) = { 1/x2 , if x ̸= 0, 0 , if x = 0. Determine the intervals of monotonicity and find the extremes points of f, if any. Solution. The domain of f is the whole real line, but f is not continuous at x = 0, since lim x→0+ f(x) = lim x→0− f(x) = ∞ , and f(0) = 0 (try to plot the graph of f). Recall in Sage we can get the same, just by the cell show(lim(1/x^2, x=0, dir="left")); show(lim(1/x^2, x=0, dir="right")) Hence, f is neither differentiable at x = 0, i.e., f′ (0) does not exist. Now, for all x ̸= 0 we have f′ (x) = −2/x3 and hence f is strictly increasing for x < 0 and strictly decreasing for x > 0. However, at x = 0 the function f does not a maximum, since it is discontinuous at this point. □ 6.D.9. Find the asymptotes of the following functions: 580 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS (a) y = 2 arctan ( x x2 − 1 ) , x ∈ R\{±1} , (b) y = ln ( 3e2x + ex + 10 ex + 1 ) , x ∈ R . ⃝ 6.D.10. Let f : R → R be a differentiable function satisfying f(x) + ef(x) = x + 1, for all x ∈ R. (a) Show that f(0) = 0. (b) Does f have some critical points? Show that f is strictly increasing. (c) Show that f is positive for x > 0 and negative for x < 0. (d) Prove the inequality x 2 ≤ f(x) ≤ xf′ (x), for all x ∈ R. Solution. (a) The defining equation f(x) + ef(x) = x + 1 for x = 0 gives f(0) + ef(0) −1 = 0 . (∗) Suppose that f(0) = y ∈ R and consider the function g(y) = y + ey −1. Obviously, y = 0 is root of g, i.e., g(0) = 0. On the other hand, the first derivative of g is given by g′ (y) = 1 + ey and we have g′ (y) > 0 for all y ∈ R. Thus the function g is strictly increasing, which implies that y = 0 is its unique root. Thus f(0) = 0. Another solution: For y = f(0) and g(y) = y + ey −1 the relation (∗) is written as g(f(0)) = 0. However, g is injective since g is strictly increasing. Therefore, the relation g(f(0)) = 0 gives f(0) = 0. (b) By differentiating the relation f(x) + ef(x) = x + 1 with respect to x we get f′ (x)(1 + ef(x) ) = 1. Thus f′ (x) = 1 1 + ef(x) > 0 , x ∈ R . (∗∗) Hence f has no critical points, and in particular is strictly increasing. (c) To obtain this claim one can rely on the monotonicity of f and use (a). In particular, we have x > 0 ⇔ f(x) > f(0) = 0 , x < 0 ⇔ f(x) < f(0) = 0 . (d) Here we will use the second derivative of f, given by f′′ (x) = ( 1 1 + ef(x) )′ = − f′ (x) ef(x) (1 + ef(x) )2 = − ef(x) (1 + ef(x) )3 < 0 , for all ∈ R. Notice also by (∗∗) that f′ (0) = 1/2. Now, for x = 0 the given inequality holds as an equality. For x > 0 consider the set [0, x]. On this set f satisfies the conditions of the mean value theorem, hence there exist ξ ∈ (a, b) such that f′ (ξ) = f(x) − f(0) x − 0 = f(x) x , (♯) where we used (a) to replace f(0) by 0. On the other hand we saw that f′ is strictly decreasing (since f′′ (x) < 0 for all x). Thus, in combination with (♯) we obtain the following equivalences: 0 < ξ < x ⇔ f′ (x) > f′ (ξ) > f′ (0) ⇔ f′ (x) > f(x) x > 1 2 ⇔ xf′ (x) > f(x) > x 2 which proves the given formula for x > 0. For x < 0 one proceeds similarly by applying the mean value theorem on the set [x, 0]. □ 6.D.11. Study the local behaviour of the function f(x) = 3 √ | x |3 + 1, with x ∈ R. Solution. It is easy to deduce that f is continuous everywhere on R. Also, f(x) ≥ 1 and f(−x) = f(x) for all x ∈ R, i.e., the function f is positive and even. As for the limit behavior of f at ±∞, we compute (see also below the graph of f) lim x→±∞ 3 √ | x |3 + 1 = lim x→±∞ 3 √ | x |3 = lim x→±∞ | x | = +∞ . For the first derivative of f we compute f′ (x) =    x2 3 √ (x3+1)2 , for x > 0, 0 , for x = 0, − x2 3 √ (−x3+1)2 , for x < 0. Notice for computing f′ (0) we used the one-side limits lim x→0+ x2 3 √ (x3 + 1) 2 = 0 = lim x→0− − x2 3 √ (−x3 + 1) 2 . 581 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS We deduce that f′ (x) > 0 for all x > 0 and f′ (x) < 0 for all x < 0. This implies that f is strictly increasing on the interval (0, +∞), increasing on [0, +∞), strictly decreasing for all x ∈ (−∞, 0), and increasing on the interval (−∞, 0] (recall that f is an even function). It follows that has only one local minimum at point x0 = 0, which is a global minimum with value f(0) = 1. Next we will prove that f is convex for any x ∈ R. Indeed, the second derivative of f is given by f′′ (x) =    2x 3 √ (x3+1)5 , if x > 0; 0 , if x = 0; − 2x 3 √ (−x3+1)5 , if x < 0. Thus we deduce that f′′ (x) > 0 for all x ∈ (−∞, 0) ∪ (0, ∞), so f is strictly convex on this interval. In total, we obtained that f is convex on its whole domain (it doesn’t have some inflection points). Let us finally determine the asymptotes of Cf . Recall that a line y = ax + b is an inclined asymptote for x → ∞, if and only if both (proper) limits lim x→∞ f(x) x = a , lim x→∞ (f(x) − ax) = b , exist. Analogous statement holds for x → −∞. Hence the limits lim x→∞ f(x) x = lim x→∞ 3 √ x3 + 1 x = lim x→∞ 3 √ x3 x = 1 , lim x→∞ (f(x) − 1 · x) = lim x→∞ ( 3 √ x3 + 1 − x ) = lim x→∞   [ 3 √ x3 + 1 − x ] 3 √ (x3 + 1) 2 + x 3 √ x3 + 1 + x2 3 √ (x3 + 1) 2 + x 3 √ x3 + 1 + x2   = lim x→∞ x3 + 1 − x3 3 √ (x3 + 1) 2 + x 3 √ x3 + 1 + x2 = lim x→∞ 1 3x2 = 0 , imply that the line y = x is an asymptote at +∞. Since f is even, we immediately obtain the line y = −x as an asymptote at −∞. Let us finally present the graph of f together with the asymptotes y = ±x: □ 6.D.12. Using Sage as a tool, examine the local behaviour of the function f(x) = arctan ( x 2 − x ) . Solution. The given function is defined on A = R\{2}. Its graph does not have special symmetry. The only point of intersection of Cf with the axes is the origin [0, 0], while f is positive exactly on the open interval (0, 2). In Sage the roots of f occur by the command f(x)=arctan(x/(2-x)); show(solve(f(x)==0, x)) while one can verify that f is not even/odd by adding the cell show(bool(f(-x)==f(x))); show(bool(f(-x)==-f(x))) where for both cases Sage returns False. The function is everywhere continuous on A, and at x0 = 2 the graph of f has a jump of size π. This follows from the limits limx→2− f(x) = π 2 and limx→2+ f(x) = −π 2 , see also the graph Cf of f below: 582 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS This illustration occurs by adding in the initial block the cell p=plot(f(x), x, -10, 10, exclude=[2], color="black", thickness=2) p+=text(r"$f(x)=\arctan(\frac{x}{2-x})$", (5.1, 1.45), fontsize=14, color="black") p+=line([(2, -1.6), (2, 1.6)], rgbcolor=(0.2,0.2,0.2), linestyle="--") p.show(figsize=6) Moreover we see that limx→−∞ f(x) = −π 4 = limx→+∞ f(x), and thus the range of f is the set (−π/2, π/2)\{−π 4 }. Moreover, the only asymptote of Cf is the line y = −π/4 at ±∞. We also compute f′ (x) = 1 x2 − 2x + 2 , f′′ (x) = 2 (1 − x) (x2 − 2x + 2)2 , x ∈ R\{2} . The limits posed above together with the first two derivatives can be confirmed by adding in the initial block the cell show(lim(f(x), x=2, dir="+")); show(lim(f(x), x=2, dir="-")) show(lim(f(x), x=oo,)); show(lim(f(x), x=-oo)) show(diff(f, x).factor()); show(diff(f, x, 2).factor()) It is easy to see that f′ (x) > 0 for all x ∈ A and hence f is increasing at every point of its domain, see also the graph of f′ (x) above (which is coloured by green). This claim can be confirmed in Sage by adding the cell A=RealSet(x!=2); bool(diff(f, x)>0 for x in A) Notice the first command in this cell declares the domain A of f. Finally, the function f is convex on the interval (−∞, 1), and concave on the interval (1, 2), (2, +∞). In the figure at the right we included the graph of f′′ (x), which provides a graphical proof of these claims. The point x1 = 1 is the unique point of inflection with f(1) = π/4, which we can easily confirmed by adding the cell solve(diff(f, x, 2) == 0, x). □ 6.D.13. Study the local behaviour of the function f(x) = − x2 x + 1 , with x ∈ R\{−1} and use Sage to sketch the graph of f together with its asymptotes. ⃝ 6.D.14. Study the local behaviour of the function f(x) = x3 − 3x2 + 3x + 1 x − 1 and use Sage to confirm most of your computations. ⃝ 6.D.15. Study the local behaviour of the function f(x) = 3 √ x e−x . ⃝ Consider a set of n functions {f1(t), . . . , fn(t)} which are (n − 1) times differentiable. The Wronki matrix associated to this set is defined by W(f1, . . . , fn) =       f1(t) f2(t) . . . fn(t) f′ 1(t) f′ 2(t) . . . f′ n(t) f′′ 1 (t) f′′ 2 (t) . . . f′′ n (t) ... ... ... ... f (n−1) 1 (t) f (n−1) 2 (t) . . . f (n−1) n (t)       . On can prove that the set of functions {f1(t), . . . , fn(t)} is linearly independent if and only if det(W) ̸= 0. Let us describe an application of this result, which relates linear algebra with calculus. 583 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 6.D.16. Wronki matrix. Consider the following set B = {sin(t), cos(t), et }. (a) Show that B consists of linearly independent functions. (b) If V is the real vector space of functions spanned by B, let D : V → V be the linear operator defined by D(f) = f′ (t) = d f(t) d t . Compute the matrix of D with respect to the basis B. Moreover, find its kernel and its image. Solution. For the set B = {sin(t), cos(t), et } we compute det(W(sin, cos, exp)) = sin(t) cos(t) et cos(t) − sin(t) et − sin(t) − cos(t) et = −2et ̸= 0 , hence B consists of linearly independent vectors. We can easily verify the computation of the determinant in Sage, by typing t = var("t") W = matrix(SR, 3, 3, [[sin(t),cos(t), exp(t)],[cos(t), -sin(t), exp(t)], [-sin(t), -cos(t), exp(t)]]) W; w=W.det ( ); w.full_simplify ( ) which returns the desired answer −2 ∗ et . (b) For the matrix of the linear operator D with respect to the basis B we see that D(sin(t)) = 0 sin(t) + 1 cos(t) + 0et , D(cos(t)) = −1 sin(t) + 0 cos(t) + 0et , D(et ) = 0 sin(t) + 0 cos(t) + 1et . Then the coordinates of D(sin(t)), D(sin(t)) and D(et ) give the columns of the matrix of D, that is,   0 −1 0 1 0 0 0 0 1  . Let us now quickly compute the kernel of D via Sage: D = matrix(SR, 3, 3, [[0, -1, 0], [1, 0, 0], [0, 0, 1]]) D.right_kernel ( ) which prints out Vector space of degree 3 and dimension 0 over Symbolic Ring Basis matrix: [] This essentially means that Ker(D) = {0}. For the image use the command D.image (). It gives5 Vector space of degree 3 and dimension 3 over Symbolic Ring Basis matrix: [1 0 0] [0 1 0] [0 0 1] Thus the image of D is spanned by the standard basis of R3 . □ We proceed with material on parametrized curves and surfaces. 6.D.17. Find the points on the cycloid given in 6.A.45 where the tangent lines are vertical/parallel. ⃝ In differential calculus there is a plethora of important inequalities. Among them there is one with many applications, the so called Jensen inequality, This is about convex and concave functions, as we will see below. 6.D.18. Jensen inequality. For a strictly convex function f on interval I and for arbitrary points x1, . . . , xn ∈ I and real numbers c1, . . . , cn > 0 sucht that c1 + · · · + cn = 1, the inequality f ( n∑ i=1 ci xi ) ≤ n∑ i=1 ci f (xi) holds, with equality occuring if and only if x1 = · · · = xn. Notice the Jensen inequality can be also formulated in a more intuitive way: “the centroid of mass points placed upon a graph of a strictly convex function lies above this graph.” Solution. Could be proven easily by induction: for n = 2 it is just the definition of the convex function, for the induction step f (k+1∑ i=1 ci xi ) = f ( c1x1 + (1 − c1) k+1∑ i=2 ci 1 − c1 xi ) ≤ c1f(x1) + (1 − c1)f (k+1∑ i=2 ci 1 − c1 xi ) 5Recall that the basis computed by Sage is “row reduced”. 584 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS ≤ c1f(x1) + (1 − c1) (k+1∑ i=2 ci 1 − c1 f(xi) ) = k+1∑ i=1 cif(xi) . Notice above we used the inequality first for n = 2 and then for n = k. □ 6.D.19. Prove that among all (convex) n-gons inscribed into a circle, the regular n-gon has the largest area (for arbitrary n ≥ 3). Solution. Clearly it suffices to consider the n-gons inside of which lies the center of the circle. We’ll divide each such n-gon inscribed into a circle with radius r to n triangles with areas Si, i ∈ {1, . . . , n} according to the figure. With regard to the fact that sin φi 2 = xi r , cos φi 2 = hi r , i ∈ {1, . . . , n}, we have Si = xihi = r2 sin φi 2 cos φi 2 = 1 2 r2 sin φi, i ∈ {1, . . . , n}. This implies that the area of the hole n-gon is S = n∑ i=1 Si = 1 2 r2 n∑ i=1 sin φi. Thus we want to maximize the sum ∑n i=1 sin φi, while for values φi ∈ (0, π) we clearly have (1) φ1 + · · · + φn = n∑ i=1 φi = 2π. The function y = sin x is strictly concave on the interval (0, π), which means, that the function y = − sin x is strictly convex on this interval. Then according to Jensen’s inequality for ci = 1/n and xi = φi, we have − sin ( n∑ i=1 1 n φi ) ≤ − n∑ i=1 1 n sin φi, tj. sin ( n∑ i=1 1 n φi ) ≥ n∑ i=1 1 n sin φi. Moreover, we know the equality occurs exactly for φ1 = · · · = φn. If we express (using (1)) S = r2 n 2 n∑ i=1 1 n sin φi ≤ r2 n 2 sin ( n∑ i=1 1 n φi ) = r2 n 2 sin 2π n , we can see that S can attain at most the value on the right hand side. But that happens if and only if φ1 = · · · = φn (we chose xi = φi). Hence the regular n-gon is the one with the maximum area, because it satisfies φ1 = · · · = φn = 2π/n. □ 6.D.20. Isoperimitric quotient. For a closed curve in plane enclosing a planar region, we define its isoperimetric quotient as the number IQ := S π ( o 2π )2 = 4πS o2 , where S denotes the area of the region and o its perimeter (i.e. the length of the curve). Hence the isoperimetric quotient determines the ratio of the area of the region and the area of a circle with the same perimeter as the given region. The notation IQ is therefore not only an English abbreviation for the isoperimetric quotient, but can be also thought of as the “intelligence of the region”, with which it uses its perimeter for attaining as big area as possible. The isoperimetric theorem then states that for every closed curve, IQ ≤ 1, with equality occuring only for a circle, or (“the circle is the smartest”). Determine IQ for a regular polygon and a circle and find the sector of a circle, for which its boundary has the largest IQ Solution. First notice that the value of IQ doesn’t change with a change of scale on the axes (same on both). Because when the proportions of the region get a times bigger (for arbitrary a > 0), the perimeter also gets a times bigger and the area a2 times (it’s a square measure). Hence IQ doesn’t depend on the size of the region, but only on its shape. Thus we can consider a regular n-gon inscribed into a unit circle. According to the figure, h = cos φ = cos π n , x 2 = sin φ = sin π n , which yields on = n · x = 2n sin π n and Sn = n · 1 2 hx = n cos π n sin π n . Thus for a regular n-gon, we have IQ = 4πn cos π n sin π n 4n2 sin2 π n = π n cotg π n , which we can verify for example for a square (n = 4) with a side of length a, where IQ = 4πa2 (4a)2 = π 4 = π 4 cotg π 4 . 585 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Using the limit transition for n → ∞ and the limit lim x→0 sin x x = 1, we get the isoperimetric quotient for a circle: IQ = lim n→∞ π n cotg π n = lim n→∞ cos π n sin π n π n = cos 0 1 = 1. Of course, for a circle with radius r, we could have also directly computed IQ = 4πS o2 = 4π ( πr2 ) (2πr)2 = 1. For the boundary a of sector of a circle with radius r and central angle φ ∈ (0, 2π), we have IQ = 4πS o2 = 4π φr2 2 (2r+rφ)2 = 2πφ (2+φ)2 . Hence we’re looking for a maximum of the function f(φ) := 2πφ (2+φ)2 , φ ∈ (0, 2π). By computing f′ (φ) = 2π (2+φ)2 −2φ(2+φ) (2+φ)4 = 2π 2−φ (2+φ)3 , φ ∈ (0, 2π) we easily obtain that f′ (φ) > 0, φ ∈ (0, 2), f′ (φ) < 0, φ ∈ (2, 2π). Hence function f attains its maximal value for φ0 = 2 and for a central angle φ0 = 2 (radians), we get the largest IQ = 2πφ0 (2+φ0)2 = π 4 . For the sake of completeness, for a solid in three-dimensional space (more precisely, for the closed surface which is its boundary), we define IQ := V 4π 3 ( S 4π ) 3 2 , where V is the volume and S the surface of the solid. Thus we compare the volume of the solid with a given surface with the volume of the ball with the same space. □ 6.D.21. A string of length l is given. The task is to cut it into n parts so that it’s possible to create boundaries of geometric figures given in advance (for example a square, a triangle, a circle, a halfcircle) with the least sum of areas from the n smaller strings. Solution. To solve this problem, we’ll use the isoperimetric quotient of curves and Jensen’s inequality (stated in previous examples). For the geometric figures given in advance, denote the values of their isoperimetric quotients as 1 λi := 4πSi o2 i , i ∈ {1, . . . , n}, where Si is the area and oi the perimeter of the i-th figure. We’ll also use the denotation Λ := n∑ i=1 λi. Recall that the isoperimetric quotient is given only by the shape of the figure and doesn’t depend on its size. In particular, the value Λ is constant (it’s determined by the shapes of the given figures). Our task is to minimize the sum ∑n i=1 Si with ∑n i=1 oi = l. Because Si = o2 i 4πλi , i ∈ {1, . . . , n}, we need to minimize the expression S := 1 4π n∑ i=1 o2 i λi . Using Jensen’s inequality for the strictly convex function y = x2 (on the whole real axis), we obtain ( n∑ i=1 ci xi )2 ≤ n∑ i=1 ci x2 i for xi ∈ R and ci > 0 with the property c1 + · · · + cn = 1. Moreover we know that the equality occurs if and only if x1 = · · · = xn. By choosing ci = λi Λ , xi = oi λi , i ∈ {1, . . . , n}, we then get 586 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS ( n∑ i=1 λi Λ oi λi )2 ≤ n∑ i=1 λi Λ ( oi λi )2 . By several simplifications, we obtain the inequality 1 Λ2 ( n∑ i=1 oi )2 ≤ 1 Λ n∑ i=1 o2 i λi and then (notice that ∑n i=1 oi = l) l2 Λ ≤ n∑ i=1 o2 i λi , with equality again occuring for (1) x1 = · · · = xn, tj. o1 λ1 = · · · = on λn . This implies that S the smallest, if and only if (1) holds. This smallest value of S is l2 /(4πΛ). Now we only need to determine the lengths of the cut parts oi. If (1) holds, then clearly oi = kλi for all i ∈ {1, . . . , n} and certain constant k > 0. From n∑ i=1 oi = l and simultaneously n∑ i=1 oi = k n∑ i=1 λi = kΛ, we can immediately see that k = l/Λ, i.e. oi = λi Λ l, i ∈ {1, . . . , n}. Let’s take a look at a specific situation where we are to cut a string of length 1 m into two smaller ones and then create a square and a circle from them so that the sum of their areas is the smallest possible. For a square and a circle (in order), we have (see the example called Isoperimetric quotient) λ1 = 4 π , λ2 = 1, tj. Λ = λ1 + λ2 = 4+π π . Then the lengths of the respective parts are (in metres) o1 = 4 π 4+π π · 1 = 4 4+π . = 0, 56, o2 = 1 4+π π · 1 = π 4+π . = 0, 44. The area of a square with perimeter 0, 56 m (with a side of length a = 0, 14 m) is 0, 019 6 m2 and the area of a circle with perimeter 0, 44 m (and radius r . = 0, 07 m) is approximately 0, 015 4 m2 . We can verify that (in m2 l2 4πΛ = 1 4(4+π) . = 0, 035 = 0, 019 6 + 0, 015 4. □ A) Material on integration 6.D.22. Based on Sage, evaluate the integrals ∫ f(x) dx, where: (1) f(x) = x4 + ex +5 ln(x), with x > 0; (2) f(x) = √ x(1 + x3 ), with x > 0; (3) f(x) = x/ √ x + 1, with x > −1. ⃝ 6.D.23. Using any basic formula, and your pencil, determine a primitive function for the following functions: (a) f(x) = √ x √ x √ x, with x ∈ (0, +∞); (b) g(x) = (2x + 3x ) 2 , with x ∈ R; (c) h(x) = 1 √ 4 − 4x2 , with x ∈ (−1, 1); (d) k(x) = cos x 1 + sin x , with x ∈ ( −π 2 , 3π 2 ) . Then, confirm your computations via Sage. ⃝ 6.D.24. Find a primitive function for the function f(x) = ex + 3 √ 4 − x2 on the open interval (−2, 2). ⃝ 6.D.25. Evaluate the integral ∫ dx x ( ln(x) )2 + 2025x with an appropriate substitution. ⃝ 6.D.26. Evaluate the integral ∫ ex ex +2024 dx with an appropriate substitution. ⃝ 587 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 6.D.27. Apply integration by parts to compute the integral Λ = ∫ cos(3x + 2025)(x + 2026) dx, with x ∈ R. Solution. In terms of 6.2.4, it is convenient to set F(x) = (x + 2026), such that F′ (x) = 1. Then G′ (x) = cos(3x + 2025) and G must have the form G(x) = 1 3 sin(3x + 2025), such that G′ (x) = (3x)′ cos(3x+2025) 3 = cos(3x + 2025). Then we get Λ = ∫ (x + 2026) cos(3x + 2025) dx = ∫ (x + 2026) ( sin(3x + 2025) 3 )′ dx = = 1 3 (x + 2026) sin(3x + 2025) − 1 3 ∫ (x + 2026)′ sin(3x + 2025) dx = = 1 3 (x + 2026) sin(3x + 2025) − 1 3 ∫ sin(3x + 2025) dx = = 1 3 (x + 2026) sin(3x + 2025) + 1 3 · cos(3x + 2025) 3 + C , for some constant C. A direct verification in Sage occurs as usual: show(integral(cos(3*x+2025)*(x+2026), x).factor()) □ 6.D.28. Compute the integral Λ = ∫ sin(2024x) cos(x) dx, formally first and next in Sage. Solution. Recall the identities sin(α + β) = sin(α) cos(β) + sin(β) cos(α) , sin(α − β) = sin(α) cos(β) − sin(β) cos(α) . Adding these two relations we get 2 sin(α) cos(β) = sin(α + β) + sin(α − β), an identity that one can apply to compute the integral at hand. In particular, Λ = ∫ sin(2024x) cos(x) dx = 1 2 ∫ ( sin(2024x+x)+sin(2024x−x) ) dx = 1 2 ∫ sin(2025x) dx+ 1 2 ∫ sin(2023x) dx = = − 1 4050 cos(2025x) − 1 4046 cos(2023x) + C . In Sage to confirm this just type the command show(integrate(sin(2024 ∗ x) ∗ cos(x), x)). □ 6.D.29. Compute the integral I = ∫ √ 1 − x2 dx, with x ∈ (−1, 1), in two different ways. Solution. In terms of integration by parts (6.2.4), we have F(x) = √ 1 − x2, F′ (x) = −x √ 1 − x2 , G′ (x) = 1 and G(x) = x. Thus, up to a constant we have that I = ∫ √ 1 − x2 dx = x √ 1 − x2 + ∫ x2 √ 1 − x2 dx = x √ 1 − x2 − ∫ 1 − x2 − 1 √ 1 − x2 dx = = x √ 1 − x2 − ∫ √ 1 − x2 dx + ∫ 1 √ 1 − x2 dx = x √ 1 − x2 − ∫ √ 1 − x2 dx + arcsin(x) . This implies that 2 ∫ √ 1 − x2 dx = x √ 1 − x2 + arcsin(x) + C =⇒ I = ∫ √ 1 − x2 dx = 1 2 ( x √ 1 − x2 + arcsin(x) ) + C , for some constant C ∈ R. Let us now compute I by an appropriate substitution. Set x = sin(t) such that dx = cos(t) dt and x = arcsin(t). Notice that t ∈ (−π/2, π/2) for x ∈ (−1, 1), and among other things, one has 0 < cos(t) = |cos(t)| = √ cos2(t) = √ 1 − sin2 (t) . Therefore our integral I can written as I = ∫ √ 1 − x2 dx = ∫ √ 1 − sin2 (t) cos(t) dt = ∫ cos2 (t) dt = 1 2 ( t + sin(t) cos(t) ) + C , for some constant C, where for the final equality one is based on our previous computation from 6.B.8. Thus I = 1 2 ( sin(t) √ 1 − sin2 (t) + t ) + C = 1 2 ( x √ 1 − x2 + arcsin(x) ) + C , C ∈ R . □ 588 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 6.D.30. Compute the limit lim x→+∞ F′ (x) x4 + 1 , where F(x) = ∫ x 1 ( t sin(t) + 4 cos(4t) ) dt. Solution. Let us express F as F(x) = ∫ x 1 f(t) dt, with f(t) = t sin(t) + 4 cos(4t), for t ∈ R. Since f is continuous the function F is differentiable, with F′ (x) = x sin(x) + 4 cos(4x) for all x ∈ R. Now for all x > 0 the function g(x) = F′ (x) x4 + 1 = x sin(x) + 4 cos(4x) x4 + 1 satisfies |g(x)| = x sin(x) + 4 cos(4x) x4 + 1 ≤ |x sin(x)| + 4 |cos(4x)| x4 + 1 ≤ |x| + 4 x4 + 1 = x + 4 x4 + 1 . It follows that lim x→+∞ g(x) = 0 (since lim x→+∞ x+4 x4+1 = 0). □ 6.D.31. Suppose that the function f : R → R has continuous second derivative on R and attains a local extreme at x0 = 1. Moreover, assume that the graph Cf of f passes through the point P = [0, 3] ∈ R2 . If ∫ 1 0 ( x f′′ (x) + 4 f′ (x) ) dx = α for some real number α, find α such that f(1) = 1/3. Solution. By assumption at x0 = 1 the function f attains a local extreme, thus f′ (1) = 0. Moreover, we have f(0) = 3. Thus, ∫ 1 0 ( x f′′ (x) + 4 f′ (x) ) dx = ∫ 1 0 x f′′ (x) dx + 4 ∫ f′ (x) dx = [ x f′ (x) ]1 0 − ∫ 1 0 (x′ ) f′ (x) dx + 4 [ f(x) ]1 0 = = ( 1 · f′ (1) − 0 · f′ (0) ) + 3 [ f(x) ]1 0 = 0 + 3 ( f(1) − f(0) ) = 3 ( f(1) − 3) = 3f(1) − 9 . Thus 3f(1) − 9 = α, from where it follows that f(1) = (α + 9)/3, and since it should be f(1) = 1/3 we get α = −8. □ Suppose that for some rational function f we want to compute the integral ∫ f ( sin(x), cos(x) ) dx. Then, usually we apply the substitution method, and here are some hints: • If f ( sin(x), − cos(x) ) = −f ( sin(x), cos(x) ) , then apply the substitution t = sin(x); • If f ( − sin(x), cos(x) ) = −f ( sin(x), cos(x) ) , then set t = cos(x); • If f ( − sin(x), − cos(x) ) = f ( sin(x), cos(x) ) , then set t = tan(x); • If none of these equalities hold, then try to use the substitution t = tan (x/2). 6.D.32. Integrate the following functions, with x ∈ ( −π 2 , π 2 ) for all the cases: (a) f(x) = sin3 (x) 1 + 4 cos2(x) + 3 sin2 (x) ; (b) g(x) = 1 1 + sin2 (x) ; (c) h(x) = 1 2 − cos(x) . Solution. For f(x), in the denominator it appears the function β(x) = 1 + 4 cos2 (x) + 3 sin2 (x), which can be rewritten as β(x) = 4 + cos2 (x), and in the numerator only the sine function to an odd power. Thus, the substitution t = cos(x) with dt = − sin(x) dx, allows the replacement of all the sines and cosines. Indeed, ∫ sin3 (x) β(x) dx = ∫ sin(x) ( 1 − cos2 (x) ) 4 + cos2(x) dx = ∫ − ( 1 − t2 ) 4 + t2 dt = ∫ (1 − 5 4 + t2 ) dt = t − 5 2 arctan ( t 2 ) + C = cos(x) − 5 2 arctan ( cos(x) 2 ) + C . For g(x), because both the sine and cosine appear to an even power, we may use the substitution t = tan(x), and hence x = arctan(t). This leads to the relations sin2 (x) = tan2 (x) 1 + tan2 (x) = t2 1 + t2 , cos2 (x) = 1 1 + tan2 (x) = 1 1 + t2 . 589 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Combining them with the relation dx = 1 1+t2 dt, we get ∫ dx 1 + sin2 (x) = ∫ 1 1+t2 1 + t2 1+t2 dt = ∫ 1 1 + 2t2 dt = √ 2 2 arctan (√ 2t ) + C = √ 2 2 arctan (√ 2 tan(x) ) + C . Finally, for h(x) one can apply the universal substitution t = tan (x 2 ) , with sin(x) = 2t 1 + t2 , cos(x) = 1 − t2 1 + t2 , dx = 2 1 + t2 dt . In such terms we have ∫ dx 2 − cos x = ∫ 2 1+t2 2 − 1−t2 1+t2 dt = 2 ∫ dt 1 + 3t2 = 2 √ 3 3 arctan (√ 3t ) + C = 2 √ 3 3 arctan (√ 3 tan (x 2 )) + C . □ 6.D.33. Evaluate the integral I = ∫ 1 sin(x) dx with x ∈ ( 0, π 2 ) , in at least two alternative ways. Solution. We may rewrite the integrand as 1 sin(x) = sin(x) sin2 (x) = sin(x) 1 − cos2(x) . Thus, I = ∫ sin(x) 1 − cos2(x) dx and we may set t = cos(x) with dt = − sin(x) dx, which gives I = − ∫ dt 1 − t2 . However, it is easy to prove that − 1 1 − t2 = − 1 2(t + 1) + 1 2(t − 1) , hence we deduce that I = 1 2 ∫ ( 1 t − 1 − 1 t + 1 ) dt = 1 2 ln |t − 1| − 1 2 ln |t + 1| + C = 1 2 ln t − 1 t + 1 + C = 1 2 ln cos(x) − 1 cos(x) + 1 + C = = 1 2 ln tan2 (x 2 ) + C = 1 2 ln tan (x 2 ) 2 + C = ln tan (x 2 ) + C , C ∈ R . Here we applied the formulas tan (x 2 ) = √ 1 − cos(x) 1 + cos(x) and |x| 2 = x2 . Probably, the quickest way to compute the integral at hand is the substitution t = tan (x 2 ) (see also 6.D.32), with sin(x) = 2 tan (x 2 ) 1 + tan2 (x 2 ) = 2t 1 + t2 , dx = 2 1 + t2 dt . In such terms we immediately get I = ∫ 1 2t 1+t2 · 2 1 + t2 dt = ∫ 1 t dt = ln |t| + C = ln tan (x 2 ) + C, C ∈ R. Notice may exist many other alternatives based on different trigonometric identities. For instance, one may use the identity tan (x 2 ) + 1 tan (x 2 ) = tan (x 2 ) + cot (x 2 ) = 2 sin(x) , x ∈ ( − π 2 , π 2 ) , which you may verify either by your pencil, or by Sage and the command bool(2/sin(x)==(tan(x/2)+(1/tan(x/2))) for x in RealSet(-pi/2, pi/2)) □ Then next few tasks involve the integration of rational functions. 6.D.34. Consider the rational function Q(x) = 1 x3 − 1 , with x ̸= 1. (a) Present the decomposition of Q into partial fractions, and then verify your answer by Sage. (b) Evaluate the integral A = ∫ Q(x) dx. ⃝ 6.D.35. Determine the integral ∫ 3x + 5 x2 + 4x + 8 dx, where x ∈ R. ⃝ 6.D.36. Compute the indefinite integral of the function f(x) = 1 (x2 + x + 1)2 , with x ∈ R. ⃝ 6.D.37. Determine the indefinite integral ∫ dx x3 + 1 dx, where x ∈ R\{−1}. ⃝ 590 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Next, we present additional material related to the integration of irrational expressions. In particular, for a few, we will focus on integrals of the form ∫ f ( x, √ a x2 + b x + c ) dx , where f is a rational expression and we expect a ̸= 0, b2 − 4ac ̸= 0 for otherwise arbitrary numbers a, b, c ∈ R. In this case one can distinguish two cases, related with the existence/non-existence of real roots for the quadratic polynomial a x2 + b x + c. • If a > 0 and the polynomial a x2 + b x + c has real roots x1, x2, use the representation √ a x2 + b x + c = √ a √ (x − x1)2 x − x2 x − x1 = √ a | x − x1 | √ x − x2 x − x1 and set t2 = x−x2 x−x1 . • If a < 0 and the polynomial a x2 + b x + c has real roots x1 < x2, use the representation √ ax2 + bx + c = √ −a √ (x − x1)2 x2 − x x − x1 = √ −a (x − x1) √ x2 − x x − x1 and set t2 = x2−x x−x1 . • Finally, if the polynomial a x2 + b x + c does not have real roots (necessarily for a > 0), choose the substitution √ ax2 + bx + c = ± √ a · x ± t with any choice of the signs. Of course, here we choose the signs so that we result to as easy expression to integrate, as possible. As one can expect, for all these cases the corresponding substitutions lead to rational functions. 6.D.38. Determine the antiderivative of the following functions: (a) f(x) = 1 (x + 4) √ x2 + 3x − 4 , with x ∈ (−∞, −4) ∪ (1, +∞); (b) g(x) = 1 (x − 1) √ x2 + x + 1 , with x ̸= 1. Next confirm the computations using Sage. Solution. (a) According to the discussion above one can proceed as follows: ∫ dx (x + 4) √ x2 + 3x − 4 = ∫ dx (x + 4) √ (x − 1)(x + 4) = ∫ dx (x + 4) | x + 4 | √ x−1 x+4 = t2 = x−1 x+4 x = 5 1−t2 − 4 dx = 10t (1−t2)2 dt = ∫ 10t (1−t2)2 ( 5 1−t2 ) 5 1−t2 t dt = ∫ 2 5 1 − t2 1 − t2 dt = 2 5 sign ( 1 − t2 ) ∫ 1 dt = 2 5 sign ( 5 x + 4 ) t+C = 2 5 sign (x) √ x − 1 x + 4 +C . In Sage giving the cell f(x)=1/((x+4)*sqrt(x^2+3*x-4)); show(integral(f(x), x).factor()) we get the answer 2 √ x2 + 3 x − 4 5 (x + 4) (recall that x2 + 3x − 4 = (x + 4)(x − 1)). Hence Sage’s answer does not contain the sign of x, but obviously our solution is correct. (b) Here we see that ∫ dx (x − 1) √ x2 + x + 1 = √ x2 + x + 1 = x + t x2 + x + 1 = x2 + 2xt + t2 x = −t2 +2t−2 2t−1 + 1 dx = −2(t2 −t+1) (2t−1)2 dt = ∫ −2(t2 −t+1) (2t−1)2 −t2+2t−2 2t−1 t2−t+1 2t−1 dt = ∫ 2 t2 + 2t − 2 dt = ∫ (√ 3 3 1 t + 1 − √ 3 − √ 3 3 1 t + 1 + √ 3 ) dt = √ 3 3 ln t + 1 − √ 3 − √ 3 3 ln t + 1 + √ 3 + C = √ 3 3 ln t + 1 − √ 3 t + 1 + √ 3 + C = √ 3 3 ln √ x2 + x + 1 − x + 1 − √ 3 √ x2 + x + 1 − x + 1 + √ 3 + C . □ 591 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 6.D.39. Using a suitable substitution, compute ∫ dx x + √ x2 + x − 1 dx, x ∈ ( −∞, − √ 5 − 1 2 ) ∪ (√ 5 − 1 2 , +∞ ) . Solution. Even though the quadratic polynomial under the root has real roots x1, x2, we won’t solve this problem by substitution t2 = x−x2 x−x1 . We could proceed that way, but we will rather use a method we introduced for the complex roots case. This is because this method yields a very simple integral of a rational function, as can be seen from the calculation ∫ dx x + √ x2 + x − 1 = √ x2 + x − 1 = x + t x2 + x − 1 = x2 + 2xt + t2 x = t2 +1 1−2t dx = −2t2 +2t+2 (1−2t)2 dt = ∫ −2t2 + 2t + 2 (t + 2)(1 − 2t) dt = ∫ ( 1 − 2 t + 2 − 1 2 1 t − 1 2 ) dt = t − 2 ln | t + 2 | − 1 2 ln t − 1 2 + C = √ x2 + x − 1 − x − 2 ln (√ x2 + x − 1 − x + 2 ) − 1 2 ln √ x2 + x − 1 − x − 1 2 + C . for some constant C. Notice an undeniable advantage of the recommended substitutions is their universality though: by using them, one can compute all integrals of the respective type. □ 6.D.40. For x > 0 determine (a) ∫ (2 + 5x)3 4 √ x3 dx , (b) ∫ 3 √ 1 + 4 √ x √ x dx , (c) ∫ 1 4 √ 1 + x4 dx . Solution. All three given integrals are binomial, i.e. they can be written as ∫ xm (a + bxn )p dx , for some a, b ∈ R , and m, n, p ∈ Q . The binomial integrals are usually solved by applying the substitution method. If p ∈ Z (not necessarily p < 0), we choose the subtitution x = ts , where s is the common denominator of numbers m a n; if m+1 n ∈ Z and p /∈ Z, we choose a + bxn = ts , where s is the denominator of number p; and if m+1 n + p ∈ Z (p /∈ Z, m+1 n /∈ Z), we choose a + bxn = ts xn , where s is the denominator of p. In these three cases, a reduction to an integration of a rational function is guaranteed. Hence we can easily compute (a) ∫ (2+5x)3 4√ x3 dx = ∫ x− 3 4 (2 + 5x)3 dx = p ∈ Z x = t4 dx = 4t3 dt = 4 ∫ ( 2 + 5t4 )3 dt = 4 ∫ ( 8 + 60t4 + 150t8 + 125t12 ) dt = 4 ( 8t + 12t5 + 50 3 t9 + 125 13 t13 ) + C = 4 ( 8 4 √ x + 12 4 √ x5 + 50 3 4 √ x9 + 125 13 4 √ x13 ) + C . (b) ∫ 3 √ 1+ 4 √ x √ x dx = ∫ x− 1 2 ( 1 + x 1 4 )1 3 dx = p /∈ Z, m+1 n ∈ Z 1 + x 1 4 = t3 x = (t3 − 1)4 dx = 12t2 ( t3 − 1 )3 dt = 12 ∫ t3 ( t3 − 1 ) dt = 12 ∫ t6 − t3 dt = 12 ( t7 7 − t4 4 ) + C = 12 3 √ (1 + 4 √ x)4 ( 1+ 4 √ x 7 − 1 4 ) + C . (c) ∫ 1 4√ 1+x4 dx = ∫ ( 1 + x4 )− 1 4 dx = p /∈ Z, m+1 n /∈ Z, m+1 n + p ∈ Z 1 + x4 = t4 x4 x = ( t4 − 1 )− 1 4 dx = −t3 ( t4 − 1 )− 5 4 dt = − ∫ t2 t4−1 dt = − ∫ t2 (t−1)(t+1)(t2+1) dt = −1 4 ∫ ( 1 t−1 − 1 t+1 + 2 t2+1 ) dt = −1 4 (ln | t − 1 | − ln | t + 1 | + 2 arctg t) + C = −1 4 [ ln 4 √ 1 x4 +1−1 4 √ 1 x4 +1+1 + 2 arctg ( 4 √ 1 x4 + 1 ) ] + C . 592 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS □ Next, we focus on definite integrals and related applications. 6.D.41. Compute the definite integrals given below: A = ∫ π 2 π 4 sin t 1 − cos2 t dt , B = ∫ ln 2 0 dx e2x −3 ex , C = ∫ π 2 0 sin(x) sin(2x) dx . ⃝ 6.D.42. Use 6.B.36, to compute the integral ∫ 50 0 {x} dx, where { } is the fractional part function. Solution. (c) Recall by 5.E.57 that the fractional part function is defined by {x} = x − ⌊x⌋, and clearly this function is periodic with period T = 1. The integral at hand corresponds to the grey area in the following figure, which equals to 10 · 1 2 = 5 (obviously, this is the area of 10 half square boxes with sides of length 1). Indeed, ∫ 10 0 {x} dx = 10 ∫ 1 0 {x} dx = 10 ∫ 1 0 x dx = 10 [ x2 2 ]1 0 = 5 . For your convenience, here is the code used in Sage to produce the previous figure and confirm the result: fract(x)=x-floor(x) p=plot(fract, x, 0, 10, color="black", aspect_ratio=1, fill=True, fillcolor="grey", figsize=8); show(p) integral(fract, x, 0, 10) □ 6.D.43. For some parameter α ∈ R, consider the function f(x) = { ex cos(2x) , if x ∈ [−π/2, 0] , √ α + sin(4x) , if x ∈ (0, 3π/8] . (a) Determine α such that f is continuous for any x in its domain. Then use your answer and Sage to execute the following: • Indicate the whole domain of f but also the domains of the two components of f. • Sketch the graph of f via the piecewise method in Sage (see also 5.C.5). (b) Compute the area in between the graph of f and the x-axis, bounded from the vertical lines x = ±π 4 . (c) Use Sage in order to • Confirm your result in (b), in a “manual way”. • Compute the error that one obtains, if any, integrating the function f, as introduced via the piecewise method. Solution. At x0 = 0 we see that lim x→0− f(x) = lim x→0− ( ex cos(2x) ) = 1 = f(0). Hence we need lim x→0+ f(x) = lim x→0+ √ α + sin(4x) = 1 ⇐⇒ √ α + lim x→0+ sin(4x) = 1 , that is, √ α + 0 = 1 and hence α = 1. Thus f is continuous everywhere in its domain if and only if α = 1 (obviously, f is continuous on the intervals [−π/2, 0) and (0, 3π/8]). From now fix α = 1 and observe that f(−π/4) = 0 and f(3π/8) = 0. Sage has built-in commands to indicate the domain of a piecewise function and of its components, given by domain and domains, respectively. Hence we may type f1(x)=(e^x)*cos(2*x); f2(x)=sqrt(1+sin(4*x)) f=piecewise([[(-pi/2, 0), f1(x)], [(0, 6*pi/16), f2(x)]]) f.domain(); f.domains() Sage returns the obvious answers, i.e., (−1/2 ∗ pi, 0) ∪ (0, 3/8 ∗ pi), and ((−1/2 ∗ pi, 0), (0, 3/8 ∗ pi)), respectively. Let us now present the graph of f, where we have included the two vertical lines from the question in (b), and shaded the region whose area needs to be computed. 593 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS The code to construct this figure relies on many different steps, and some advanced options from 2D-graphics in Sage, as for example the command polygon to shade the region n question. In general, when one wants to integrate a function f from a to b and compute the area between f and the x-axis, with bounds the vertical lines passing from a, b, then can follow the summary given here: Step 1 introduce the function f; Step 2 declare the points a, b Step 3 sketch the graph of f, e.g., add the command q= plot(f, (x, a-0.5, b+0.5), thickness=2) Step 4 graph the two vertical lines, which can be done by adding the syntax q+= line([(a,0),(a,f(a))], color=’black’) q+= line([(b,0),(b,f(b))], color=’black’) Step 5 shade the area at hand, via the polygon method: q += polygon( [(a,0),(a,f(a))] +[(x, f(x)) for x in [a,a+0.005,..,b]] +[(b,0),(a,0)], rgbcolor=(0.2,0.4,0.7),aspect_ratio="automatic") Step 6 produce the figure, which can be done by adding show(q) (we may specify further options here, as ymin, ymax, ticks, figsize, and other, hence instead one may type q.show( ), and add inside the parentheses the appropriate options). For our example this method becomes a few more complicated, since we have a piecewise function, and hence we use first the piecewise method to introduce f. Here is the full implementation combining both methods. f1(x)=(e^x)*cos(2*x); f2(x)=sqrt(1+sin(4*x)) f=piecewise([[(-pi/2, 0), f1(x)], [(0, 6*pi/16), f2(x)]]) p=plot(f(x), x, -pi/2, 6*pi/16, color="black", thickness=1.2) p+=plot(f2(x), x, 0, 6*pi/16, color="darkgreen", thickness=1.2) p+= text ("$f(x)$ ",(pi/16, 1.47), color="black", fontsize ="12") a=-pi/4; b=pi/4 p+=line([(a, 0),(a, 1)], color="black") p+=line([(b, 0),(b, f(N(b)))], color="black") p+=polygon([(-pi/4,0),(-pi/4,f(N(-pi/4)))]+[(x, f(N(x))) for x in [-pi/4,-pi/4+0.005,..,pi/4]]+[(b,0),(a,0)], rgbcolor=(0.5,0.7,0.8),aspect_ratio="automatic") p.show(ticks=pi/4, tick_formatter=pi, ymin=-0.25, ymax=1.5) Notice this technique requires the values introduced in f to be in numerical expression, and this is the reason behind the appearance of f(N(b)), f(N(x), etc. Otherwise Sage returns errors. However, as we will see below this method produces aimprove ! small error in the computation of the given area; this is because the piecewise method uses approximations of a, b, which effect on the integration. (b) Let us denote by E the shaded area. By definition we have E = ∫ π 4 − π 4 |f|(x)| dx = ∫ π 4 − π 4 f(x) dx = ∫ 0 − π 4 ( ex cos(2x) ) dx + ∫ π 4 0 √ 1 + sin(4x) dx = E1 + E2 . 594 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS For the first integral, recall that ( sin(2x) 2 )′ = cos(2x) and ( −cos(2x) 2 )′ = sin(2x). Thus, by applying integration by parts twice we get E1 = ∫ 0 − π 4 ex ( sin(2x) 2 )′ dx = 1 2 [ ex sin(2x) ]0 − π 4 − 1 2 ∫ 0 − π 4 ex sin(2x) dx = = 1 2 ( e0 · sin(0) − e− π 4 · sin ( − 2π 4 )) + 1 4 ∫ 0 − π 4 ex ( cos(2x) )′ dx = = 1 2 e− π 4 · sin (π 2 ) + 1 4 [ ex cos(2x) ]0 − π 4 − 1 4 ∫ 0 − π 4 ex cos(2x) dx . Thus E1 = 1 2 e− π 4 + 1 4 ( e0 · cos(0) − e− π 4 · cos ( − 2π 4 )) − 1 4 E1 , or equivalently E1 = 2 e− π 4 +1 5 ≈ 0.382375. For E2, we use the identities sin(2y) = 2 sin(y) cos(y) and sin2 (y) + cos2 (y) = 1, for any y ∈ R, where we replace y by 2x. Thus we can write E2 = ∫ π 4 0 √ sin2 (2x) + cos2(2x) + 2 sin(2x) cos(2x) dx = ∫ π 4 0 √( sin(2x) + cos(2x) )2 dx = = ∫ π 4 0 ( sin(2x) + cos(2x) ) dx = [ − cos(2x) 2 + sin(2x) 2 ]π 4 0 = 1 2 + 1 2 = 1 . Thus all together E = E1 + E2 = 2 e− π 4 +1 5 + 1 = 2 e− π 4 +6 5 ≈ 1.382375. (c) A confirmation of the result above, occurs easily, without using the piecewise method to introduce f, and from the mathematical point of view it relies on the relation E = E1 + E2. Hence we can type f1(x)=e^(x)*cos(2*x); f2(x)=sqrt(1+sin(4*x)) show(integral(f1(x),x, -pi/4, 0)) show(integral(f2(x), x, 0, pi/4)) and this is what we mean “manually”. Execute this cell yourself to check Sage’s output. On the other hand, introducing the function f (in Sage) via the piecewise method, gives us the ability to directly integrate f, as follows: f1(x)=(e^x)*cos(2*x); f2(x)=sqrt(1+sin(4*x)) f=piecewise([[(-pi/2, 0), f1(x)], [(0, 6*pi/16), f2(x)]]) f.integral(x, N(-pi/4), N(pi/4)) However, since this still requires to introduce the certain values a = −π/4 and b = π/4 numerically, some error will appear. Indeed, executing this block, Sage prints out 1.16777341450385, which differs from the result in (b). The same time, we cannot replace in this certain example the last command by f.integral(x, −pi/4, pi/4), which is a pitfall of Sage. □ improve return 6.D.44. Compute the area S of a figure composed of two parts of plane bounded by lines x = 0, x = 1, x = 4, the x axis and the graph of a function y = 1 3 √ x−1 . Solution. First realize that 1 3 √ x−1 < 0, x ∈ [0, 1), 1 3 √ x−1 > 0, x ∈ (1, 4] and lim x→1− 1 3 √ x−1 = −∞, lim x→1+ 1 3 √ x−1 = +∞. The first part of the figure (below the x axis) is thus bounded by the curves y = 0, x = 0, x = 1, y = 1 3 √ x−1 with an area given by the improper integral S1 = − 1∫ 0 1 3 √ x−1 dx; while the second part (above the x axis), which is bounded by the curves y = 0, x = 1, x = 4, y = 1 3 √ x−1 , 595 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS has an area of S2 = 4∫ 1 1 3 √ x−1 dx. Since ∫ 1 3 √ x−1 dx = 3 2 3 √ (x − 1)2 + C, the sum S1 + S2 can be gotten as S = − lim x→1− ( 3 2 3 √ (x − 1)2 − 3 2 ) + lim x→1+ ( 3 2 3 √ 9 − 3 2 3 √ (x − 1)2 ) = 3 2 ( 1 + 3 √ 9 ) . We have shown among other things, that the given figure has a finite area, even though it’s unbounded (both from the top and the bottom). (If we approach x = 1 from the right, eventually from the left, its altitude grows beyond measure.) Recall here the indefinite expression of type 0 · ∞. Namely, the figure is bounded if we limit ourselves to x ∈ [0, 1 − δ] ∪ [1 + δ, 4] for an arbitrarily small δ > 0. □ 6.D.45. Determine the surface and volume of a circular paraboloid created by rotating a part of the parabola y = 2x2 for x ∈ [0, 1] around the y axis. Solution. The formulas stated in the texts are true for rotating the curves around the x axis! Hence it’s necessary either to integrate the given curve with respect to variable y, or to transform. This gives V = ∫ 2 0 x 2 dx = π and S = 2π ∫ 2 0 √ x 2 √ 1 + 1 8x dx = 2π ∫ 2 0 √ x 2 + 1 16 dx = π 17 √ 17 − 1 24 . □ A) Material on sequences, series and limit processes We begin with additional material related to uniform convergence of sequence of functions and of series. 6.D.46. Uniform convergence. Let (fn) be a sequence of functions defined on an interval I such that limn→∞ fn(x) = f(x) for all x ∈ I. Prove that (fn) uniformly converges to f on I if and only if limn→∞ an = 0, where an := supx∈I |fn(x)−f(x)|. Solution. Suppose that (fn) is uniformly convergent to f on I. Then, given ε > 0 there exists an integer N such that for any n ≥ N and x ∈ I we have |fn(x) − f(x)| < ε. Therefore we also have an := supx∈I |fn(x) − f(x)| ≤ ε for all n ≥ N and thus an → 0 as n → ∞. Conversely, assume that limn→∞ an = 0, where an is defined as above. Then, for any ε > 0 there exists an integer N such that an < ε for all n ≥ N, that is, sup x∈I |fn(x) − f(x)| < ε , for all n ≥ N . By the definition of supremum, |fn(x) − f(x)| ≤ sup x∈[a,b] |fn(x) − f(x)| < ε , for sufficiently large n and for all x ∈ I. Thus (fn) uniformly converges to f on I. □ 6.D.47. Natural logarithm. Expand the natural logarithm f(x) = ln(1 + x) into a power series around 0 and 1, and next determine all x ∈ R for which these series converge. Solution. First we will determine the expansion around point 0. To expand a function into a power series at a given point is the same as to determine its Taylor expansion at that point. We can easily see that f(n) (x) = [ln(x + 1)](n) = (−1)n+1 (n − 1)! (x + 1)n , so after computing the derivatives at zero, we have f(x) = ln(x + 1) = ln 1 + ∞∑ n=1 anxn , where the coefficients an are given by an = (−1)n+1 (n − 1)! n! = (−1)n+1 n . 596 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Therefore, one can write f(x) = ln(x + 1) = x − 1 2 x2 + 1 3 x3 − 1 4 x4 + · · · = ∞∑ n=1 (−1)n+1 n xn . For the radius of convergence, we can then use the limit of the quotient of the following coefficients of terms of the power series: r = 1 limn→∞ an+1 an = 1 limn→∞ 1 n+1 1 n = 1 . Hence the series converges for arbitrary x ∈ (−1, 1). For x = −1 we get the harmonic series (with a negative sign), while for x = 1 we get the alternating harmonic series, which converges by the Leibniz criterion. Thus the given series converges exactly for x ∈ (−1, 1]. Analogously, for the expansion at point 1, we get f(x) = ln(x + 1) = ln(2) + 1 2 (x − 1) − 1 8 (x − 1)2 + 1 3 · 23 (x − 1)3 − . . . = ln(2) + ∞∑ n=1 (−1)n+1 n · 2n (x − 1)n , and for the radius of convergence of this series we get r = 1 limn→∞ an+1 an = 1 limn→∞ 1 2n+1(n+1) 1 2n n = 1 . □ 6.D.48. Expand the function cos2 (x) into a power series at 0, and determine for which x ∈ R it converges. ⃝ 6.D.49. Expand the function sin2 (x) into a power series at 0 and determine for which x ∈ R it converges. ⃝ 6.D.50. Expand the function ln(x3 + 3x2 + 3x + 1) into a power series at 0 and determine for which x ∈ R it converges. ⃝ 6.D.51. Expand the function ln( √ x) into a power series at 1 and determine for which x ∈ R it converges. ⃝ Now is the time to demonstrate various ways to combine the theory of Taylor series with differentiation and integration. Further similar tasks are presented in Section D (see for example 6.D.58). To make the description more engaging, we will also utilize Sage. 6.D.52. On the interval of convergence (−1, 1), determine the sum of the series ∞∑ n=1 n (n + 1) xn . Next confirm your answer in Sage. 597 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Solution. We have ∞∑ n=1 n (n + 1) xn = ∞∑ n=1 n ( xn+1 )′ = ( ∞∑ n=1 n xn+1 )′ = ( ∞∑ n=1 n xn−1 x2 )′ = [ x2 ∞∑ n=1 (xn ) ′ ]′ = [ x2 ( ∞∑ n=1 xn )′ ]′ = [ x2 ( −1 + ∞∑ n=0 xn )′ ]′ = [ x2 ( −1 + 1 1 − x )′ ]′ = [ x2 · 1 (1 − x)2 ]′ = 2x (1 − x)3 , for all x ∈ (−1, 1). To confirm the result in Sage use the sum command, as follows: var("n") show(sum(n*(n+1)*x^n, n, 1, oo).factor()) □ 6.D.53. Use the Taylor series of f(x) = cos( √ x) centered at 0 to approximate the integral I = ∫ 1 0 cos( √ x) dx. Then use Sage to compute the real value of I, and compare the two results. ⃝ 6.D.54. Taylor series of arctan. Consider the function f(x) = 1/(1+x2 ), x ∈ (−1, 1). Use the Taylor series of f centered at 0 and the notion of definite integrals to compute the power series of arctan(x) around 0. Solution. Recall that for |x| < 1 we have 1 1−x = ∑∞ n=0 xn . Thus the replacement of x by −x2 gives (see also 5.E.142) 1 1 + x2 = ∞∑ n=0 (−1)n x2n , |x| < 1 . Hence, for all t ∈ (−1, 1) we have arctan′ (t) = 1 1 + t2 = ∞∑ n=0 ( −t2 )n = ∞∑ n=0 (−1)n t2n . Now, for all x ∈ (−1, 1) we have x∫ 0 arctan′ (t) dt = arctan(x) − arctan(0) = arctan(x). On the other side x∫ 0 ( ∞∑ n=0 (−1)n t2n ) dt = ∞∑ n=0  (−1)n x∫ 0 t2n dt   = ∞∑ n=0 (−1)n 2n + 1 x2n+1 , and thus arctan(x) = ∞∑ n=0 (−1) n 2n + 1 x2n+1 for all x ∈ (−1, 1). □ 6.D.55. Find the Maclaurin series of the function A(x) = ∫ x 0 t cos(t2 ) dt . Solution. The Maclaurin series of cos(t) is 1 − t2 2! + t4 4! − · · · = ∞∑ n=0 (−1)n (2n)! t2n . Using this we see that t cos(t2 ) = t ∞∑ n=0 (−1)n t4n (2n)! = ∞∑ n=0 (−1)n t4n+1 (2n)! . 598 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Therefore, A(x) = ∫ x 0 t cos(t2 ) dt = ∫ x 0 ∞∑ n=0 (−1)n (2n)! t4n+1 dt = ∞∑ n=0 [(−1)n (2n)! ∫ x 0 t4n+1 dt ] = ∞∑ n=0 (−1)n (4n + 2)(2n)! x4n+2 . Let us write down the first few terms of A, A(x) = 1 2 x2 − 1 12 x6 + 1 240 x10 − 1 10080 x14 + · · · . A confirmation of this expression via Sage is based on the method presented in 6.A.17. However, one should now combine this method with the command integral(g(t), t, 0, x), to introduce a function of the form ∫ x 0 g(t) dt, where g(t) = t cos(t2 ). Thus the implementation goes as follows: var("t"); assume(x>0) f(x)=integral(t*cos(t^2), t, 0, x) tf=taylor(f, x, 0, 6) T=tf.power_series(QQ); show(T) □ 6.D.56. For the convergent series ∞∑ n=0 (−1)n √ n + 100 estimate the error of the approximation of its sum by the partial sum s9999. ⃝ 6.D.57. Approximate the expression given below with an error lesser than 1/10: 2∫ 1 ( x − cos10 (x) 10 ) ln(x) dx . ⃝ 6.D.58. Find the Maclaurin series of the function F(x) = ∫ x 0 sin(t) t dt, with x ∈ R. Next check your answer via Sage, and plot the graph of the function F for certain x in the domain of F. Solution. Recall that sin(t) = t − t3 3! + t5 5! − · · · = ∞∑ n=0 (−1)n (2n + 1)! t2n+1 for all t ∈ R, hence sin(t) t = 1 − t2 3! + t4 5! − · · · = ∞∑ n=0 (−1)n (2n + 1)! t2n . By integrating this series we get the series expansion of F(x), i.e., F(x) = ∫ x 0 sin(t) t dt = ∫ x 0 [ ∞∑ n=0 (−1)n (2n + 1)! t2n ] dt = ∞∑ n=0 [ (−1)n (2n + 1)! ∫ x 0 t2n dt ] = ∞∑ n=0 [ (−1)n (2n + 1)! t2n+1 (2n + 1) ]x 0 = ∞∑ n=0 (−1)n (2n + 1)(2n + 1)! x2n+1 . Hence the series expansion of F is given by F(x) = x − x3 3 · 3! + x5 5 · 5! − · · · + (−1)n (2n + 1)(2n + 1)! x2n+1 + · · · We can easily check this result in Sage by the method described in 6.A.17 and 6.D.55, that is, 599 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS var("t"); assume(x>0); F(x)=integral(sin(t)/t, t, 0, x) tF=taylor(F, x, 0, 7); T=tF.power_series(QQ); show(T) This cell prints out the following expression: x − 1 18 x3 + 1 600 x5 − 1 35280 x7 + O(x8 ), hence verifies our answer. In order to plot the graph of F for certain x we can for example add in the above cell the command show(plot(F, x, −10, 10)). □ 6.D.59. Find the function f to which, for x ∈ R, the sequence of functions fn(x) = n2 x3 n2x2+1 , n ∈ N. converges. Is this convergence uniform on R? ⃝ 6.D.60. Does the series ∞∑ n=1 n x n4+x2 , kde x ∈ R, converge uniformly on the real line? ⃝ 6.D.61. Approximate (a) the cosine of ten degrees with accuracy of at least 10−5 ; (b) the definite integral ∫ 1/2 0 dx x4+1 with accuracy of at least 10−3 . ⃝ 6.D.62. Determine the power series centered at x0 = 0 of the function f(x) = x∫ 0 et2 dt, x ∈ R. ⃝ 6.D.63. Using the integral test, find the values a > 0 for which the series ∞∑ n=1 1 na converges. ⃝ 6.D.64. For which x ∈ R does the series ∞∑ n=1 ln(n!) nx converge? ⃝ 6.D.65. Determine whether the series ∞∑ n=1 (−1)n−1 tan 1 n √ n converges absolutely, converges conditionally, diverges to +∞, diverges to −∞, or none of the above. (such a series is sometimes said to be oscillating). ⃝ 6.D.66. Calculate the series ∞∑ n=1 1 n·3n with the help of an appropriate power series. ⃝ 600 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Solutions of the exercises 6.A.5. For all x ∈ (−1, 1) we have ln 1 + x 1 − x = ln (1 + x) − ln (1 − x). Therefore, it is useful to consider the auxiliary function φ(x) := ln (ax + 1) , x ∈ (−1, 1) , a = ±1 . For this function we easily compute φ′ (x) = a ax + 1 , φ′′ (x) = −a2 (ax + 1)2 , φ(3) (x) = 2a3 (ax + 1)3 , φ(4) (x) = −6a4 (ax + 1)4 , for all x ∈ (−1, 1). Motivated by these formulas, one may guess that φ(n) (x) = (−1)n−1 (n − 1)! an (ax + 1)n , x ∈ (−1, 1) , n ∈ N . (∗) Let us use the principle of mathematical induction to prove (∗). As we have seen, it holds for n = 1, 2, 3, 4 and we assume that (∗) holds for some other k ∈ N. Then, a direct computation shows that φ(k+1) (x) = ( φ(k) (x) )′ = ( (−1)k−1 (k − 1)! ak (ax + 1)k )′ = (−1)k−1 (k − 1)! ak (−k) a (ax + 1)k+1 = (−1)k k! ak+1 (ax + 1)k+1 , and hence the relation (∗) is true for all n ∈ N. Let us now return back to our initial task. By (∗) we may compute the nth derivative of ln(1 ± x), that is, ln(n) (1 + x) = (−1)n−1 (n − 1)! (x + 1)n , ln(n) (1 − x) = − (n − 1)! (−x + 1)n , x ∈ (−1, 1) . Therefore, we finally deduce that ( ln 1+x 1−x )(n) = (n − 1)! ( 1 (1−x)n − (−1)n (1+x)n ) for all x ∈ (−1, 1) and n ∈ N. 6.A.6. The answer is f(12) (x) = 212 e2x + cos(x). 6.A.7. The answer is f(26) (x) = − sin(x) + 226 e2x . 6.A.8. Recall that the third-order Taylor expansion of a function f around the point a = 0, is given by T3 0 f(x) = f(0) + f′ (0)x + f′′ (0) 2 x2 + f(3) (0) 6 x3 . For the sine function f(x) = sin(x) we have f′ (0) = cos(0) = 1, f(2) (0) = − sin(0) = 0, f(3) (0) = − cos(0) = −1, and f(0) = 0. Thus we get T3 0 sin(x) = x − 1 6 x3 = x − x3 3! . In a similar way we obtain T3 0 cos(x) = 1 − x2 2 , T3 0 ex = 1 + x + x2 2 + x3 6 , T3 0 ln(x + 1) = x + x2 2 − x3 3 . 6.A.10. The answer is T4 1 f(x) = 2 (x − 1) − (x − 1) 2 + 2 3 (x − 1) 3 − 1 2 (x − 1) 4 . To obtain this expression via Sage, we may type f(x)=ln(x^2); T(x)=taylor(f(x), x, 1, 4); show(T(x)) Adding in this cell the command p=plot(f(x), x, 0, 2, color="black", thickness=1.1) p+=plot(T(x), x, 0, 2, linestyle="--", thickness=1.1); p+=text(r"$f$", (1.95,1.8), fontsize=14, color="black") p+=text(r"$T$", (1.95,0.8), fontsize=14, color="blue") show(p) one can produce the required graphs which we present here (the graph of the Taylor polynomial is coloured blue). 601 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Observe that the approximation is accurate enough. In fact, try to verify yourself that as the degree of the Taylor polynomials increases, the approximations become more accurate. 6.A.13. Consider the exponential function f(x) = ex restricted on the closed interval [0, x]. The third-order degree Taylor polynomial of f(x) = ex , centered at a = 0 has the form T3 0 ex = 1 + x + x2 2! + x3 3! . Now, f is a smooth function and according to the theorem in 6.1.3 there exists some c ∈ (0, x) such that f(x) = ex = T3 0 ex +R(x) , R(x) = x3+1 (3 + 1)! f(3+1) (c) . We have f(4) (x) = ex and hence we seee that R(x) = x4 4! ec > 0. Thus f(x) − T3 0 f(x) ≥ 0 which proves our claim. 6.A.14. In this case we have k = 2 and hence we compute the error −x3 3(1+x)3 . 6.A.15. We compute T4 0 sin(x) = x − x3 6 . In Sage the task can be solved by the block f(x)=sin(x); F=plot(f(x), x, -pi, pi) T4(x)=taylor(f(x), x, 0, 4); show(T4(x)) T=plot(T4(x), x, -pi, pi, color="black"); show(F+T) This confirms the expression of T4 0 sin(x) and produces the graphs of f, T4 0 f, which we present in the figure below (in this figure, with blue is coloured the graph of f) Next we may approximate sin 1◦ by the expression sin 1◦ ≈ π 180 − π3 6·1803 . 6.A.16. The answer is 1 − π2 102·2 + π4 104·4! . 6.A.18. One can use the known formula 1 1+x = ∑∞ n=0 (−x) n = ∑∞ n=0(−1)n xn corresponding to geometric series. By differentiating it, we obtain 602 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS − 1 (1 + x)2 = ( ∞∑ n=0 (−1)n xn )′ = ∞∑ n=1 (−1)n n xn−1 . Thus we see that 1 (1 + x)2 = ∞∑ n=1 (−1)n+1 n xn−1 = 1 − 2x + 3x2 − 4x3 + 5x4 − 6x5 + 7x6 − 8x7 + · · · for all x ∈ (−1, 1). Finally, to confirm this expression by Sage we use the same method as in 6.A.17, i.e., f(x)=1/(1+x)^2; tf=taylor(f, x, 0, 10); T=tf.power_series(QQ); show(T) Sage’s answer has the form 1 − 2x + 3x2 − 4x3 + 5x4 − 6x5 + 7x6 − 8x7 + 9x8 − 10x9 + 11x10 + O(x11 ) . 6.A.19. The first limit equals to −1/6, while the second limit takes the value −1. 6.A.22. An example is given by f(x) = x4 . The point x0 = 0 is a local minimum of f but f′ (0) = 0 and f′′ (0) = 0. 6.A.26. For the solution one can adopt the method presented in 6.A.25, which means that we should use appropriately the commands find_local_maximum(f, a, b) and find_local_minimum(f, a, b). The implementation takes the form f(x)=(1/8)*x^8-(3/5)*x^5-3*x+9 fp=plot(f(x), x, -2, 2, color="black", legend_label=r"$f(x)$") show(fp) show(find_local_maximum(f, -2, 2)) show(find_local_minimum(f, -2, 2)) g(x)=x^2-(1/6)*x^3 gp=plot(g(x), x, -2, 2, color="black", legend_label=r"$g(x)$") show(gp) show(find_local_maximum(g, -2, 2)) show(find_local_minimum(g, -2, 2)) h(x)=x/sqrt(x^2+4) hp=plot(h(x), x, -2, 2, color="black", legend_label=r"$h(x)$") show(hp) show(find_local_maximum(h, -2, 2)) show(find_local_minimum(h, -2, 2)) k(x)=ln(x^3+8) kp=plot(k(x), x, -2, 2, color="black", legend_label=r"$k(x)$") show(kp) show(find_local_maximum(k, -2, 2)) show(find_local_minimum(k, -2, 2)) Sage’s answers look like as follows: For f : (66.19999293835187, −1.9999999605494494) , (3.1326992998650347, 1.525960530898268) . For g : (5.333333096630033, −1.9999999605494494) , ( 9.947977045494631 × 10−19 , 9.97395460544528e-10 ) . For h : (0.7071067742126095, 1.9999999605494494) , (−0.7071067742126095, −1.9999999605494494) . For k : (2.7725886926518686, 1.9999999605494494) , (−14.563311203591136, −1.9999999605494494) . Here, recall that Sage first presents the maximal/minimum value f(x0) at the extreme point x0, and next the point x0 itself. It also produces the plots of the functions at hand, which are listed below. 603 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 6.A.29. The function f(x) = xa is twice differentiable on (0, +∞) with f′ (x) = a xa−1 and f′′ (x) = a(a − 1) xa−2 . By assumption we have a > 0 and (a − 1) < 0. It is also easy to see that xa−2 > 0 for all x > 0. Thus f′′ (x) < 0 for all x > 0 which means that f is concave on (0, +∞). 6.A.30. Based on the cell f(x)=x/(x^2-1) a=lim(f(x), x=1, dir="right") b=lim(f(x), x=1, dir="left") c=lim(f(x), x=-1, dir="right") d=lim(f(x), x=-1, dir="left") show(a,’,’, b,’,’, c,’,’, d) we see that a := lim x→1+ f(x) = +∞ , b := lim x→1− f(x) = −∞ , c := lim x→−1+ f(x) = +∞ , d := lim x→−1− f(x) = −∞ . Thus, the lines x = ±1 are vertical asymptotes of f. It is also easy to see that the x-axis y = 0 is the unique horizontal asymptote of f (this is because limx→∞ f(x) = 0 = limx→−∞ f(x)). However, asymptotes with slope cannot exist, since f(x) = x/(x2 − 1) = g(x)/h(x), and the degree of g(x) is smaller than the degree of h(x). Recall that using Sage, to plot a rational function it is wise to restrict the y-values. On the other hand, Sage often sketches the vertical asymptotes via the option detect_poles, which we should appropriately include inside the command plot. In particular, the cell p=plot(x/(x^2-1), x, -5, 5, ymin=-10, ymax=10, detect_poles="show") produces the figure posed above. However, one has the chance to set detect_poles to ”True”, (or ”False”), to obtain only the graph of f. For an example with horizontal asymptotes see the 6.A.31. 6.A.31. (a) The given function is continuous and differentiable on R as a fraction of differentiable functions. We see that its first derivative is everywhere positive, f′ (x) = 2ex (ex + 1) 2 > 0, x ∈ R . This implies that f is strictly increasing for all x ∈ R. In case you want to verify this claim in Sage, use the cell f(x) = (exp(x) − 1)/(exp(x) + 1); d1(x) = diff(f, x).factor(); bool(d1(x) > 0) 604 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Now, since the domain of f is the whole real line, f cannot have vertical asymptotes. On the other hand, the lines y = ±1 are horizontal asymptotes of f since lim x→+∞ ex − 1 ex + 1 = lim x→+∞ ex ex = 1, lim x→−∞ ex − 1 ex + 1 = 0 − 1 0 + 1 = −1 . Combining these conclusions one deduces that the range is the open interval (−1, 1), see also the graph of f below. It seems that Sage has not an “built-in” method for horizontal asymptotes. But we can proceed “manually” and introduce theese lines via the command line. For instance, to sketch the graph of f, together with its asymptotes y = ±1, use the cell f(x)=(exp(x)-1)/(exp(x)+1) gf=plot(f, x, -5, 5, ymin=-1.1, ymax=1.1, color="black") a1 = line([(5,1) ,(-5,1)], linestyle="--", thickness =0.8) a2 = line([(5,-1) ,(-5,-1)], linestyle="--", thickness =0.8) (gf+a1+a2).show(ticks=1,tick_formatter =1) Executing this block, you will get the following figure In this figure we have restricted the y-values, by the options ymin = −1.1, and ymax = 1.1, though not necessary (more experienced programmers could jump this step). (b) We know that f is strictly increasing in R, hence it does not admit (local/global) extremes. The second derivative of f is given by f′′ (x) = 2(1−ex ) ex (ex +1)3 , for all x ∈ R. Thus, x = ln(1) = 0 is the unique solution of the equation f′′ (x) = 0 and we have • f′′ (x) > 0 for all x ∈ (−∞, 0) , • f′′ (x) < 0 for all x ∈ (0, +∞) . Therefore, f is strictly convex for all x ∈ (−∞, 0) and strictly concave for all x ∈ (0, +∞), and since f is continuous on 0 and its graph there changes from convex to concave, x = 0 is the unique inflection point of f (with value f(0) = 0). Some of the previous conclusions occur really easy in Sage, as well. For instance, by typing f(x)=(exp(x)-1)/(exp(x)+1) d2(x)=diff(f, x, 2); show(solve(d2(x)==0, x)) we confirm the inflection point of f. 6.A.34. By assumption, f is differentiable on R. Hence, if a ∈ R is a stationary point of f we have f′ (a) = 0. On the other hand, a differentiation of the defining equation gives 2f(x)f′ (x) − f(x) − xf′ (x) + 4x = 0 , x ∈ R . (∗) By replacing x = a in this equation, we find that f(a) = 4a. Combining this with the defining equation, the replacement x = a gives f2 (a) − af(a) + 2a2 − 7 = 0, that is, (4a)2 − 4a2 + 2a2 − 7 = 0, which is equivalent to 14a2 = 7. Thus a2 = 1 2 , i.e., a = ± √ 2 2 . Next, f is twice differentiable. Hence, if x = b ∈ R is an inflection point of f we should have f′′ (b) = 0. Now, taking the derivative of (∗), we see that 2 ( f′ (x) )2 + 2f(x)f′′ (x) − 2f′ (x) − xf′′ (x) + 4 = 0 , x ∈ R . 605 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS For x = b this relation reduces to ( f′ (b) )2 − f′ (b) + 2 = 0, which has negative discriminant, ∆ = −7 < 0. This implies that f does not admit inflection points. 6.A.35. We have f′ (x) = 2x ex2 , g′ (x) = 2x x2+1 , and k′ (x) = ex e2x +1 , for all x ∈ R. Thus, we get df(x) = (2x ex2 ) dx, dg(x) = 2x x2+1 dx, and dk(x) = ex e2x +1 dx, respectively. An implementation of these differentials in Sage, essentially relies on the command diff (or derivative). For instance, a cell based on this method and which includes dx as a symbolic variable has the form var("x, dx") f(x)=e**(x**2); f1(x)=diff(f, x) df(x)=f1*dx #declare the differential of f g(x)=ln(x**2+1); g1(x)=diff(g, x) dg(x)=g1*dx #declare the differential of g k(x)=arctan(e^x); k1(x)=diff(k, x) dk(x)=k1*dx #declare the differential of k show(df(x), ",", dg(x), ",", dk(x)) Chek yourself Sage’s output. 6.A.41. First we need to introduce f and use the command plot to sketch it. We also need to indicate the nodes x, x+h, f(x) and f(x + h), and draw the slopes corresponding to forward/backward differences, together with the tangent line. Hence, we may appropriately use the commands point and line (recall that these are build-in functions in Sage for the construction of points and lines, respectively). To keep our code simple, we agree to distinguish the slopes via colours. So, let us use green and red colours to illustrate the forward/backward difference slopes, and orange colour for the tangent line of f trough x. We can now encode the solution, by the following block: f(x)=exp(x^(2))-0.9 p=plot(f(x), x, -0.4, 0.5, ticks=[0.1,None]) p+=point((0.2, f(0.2)), size=30) p+=point((0.3, f(0.3)), size=30) p+=point((0.2, 0), size=30) p+=point((0.3, 0), size=30) p+=line([(0.2,f(0.2)),(0.3,f(0.3))],color="green") p+=line([(0.2,0), (0.2,f(0.2))], linestyle="--", rgbcolor=(0.2,0.2,0.2)) p+=line([(0.3,0), (0.3,f(0.3))], linestyle="--", rgbcolor=(0.2,0.2,0.2)) p+=point((0.1, 0), size=30) p+=point((0.1, f(0.1)), size=40) p+=line([(0.1,f(0.1)),(0.2,f(0.2))],color="red") p+=line([(0.1,0), (0.1,f(0.1))], linestyle="--",rgbcolor=(0.2,0.2,0.2)) f1=diff(f(x), x)(x=0.2); l1(x)=f(0.2)+f1*(x-0.2) p+=plot(l1(x), x, 0.01, 0.45,color="orange") p+=text(r"$x$", (0.22, 0.01),fontsize=12) p+=text(r"$x+h$", (0.335, 0.01),fontsize=12) p+=text(r"$x-h$", (0.14, 0.01),fontsize=12) p+=text(r"$f(x)$", (0.19, 0.16),fontsize=12) p+=text(r"$f(x+h)$", (0.24, 0.20),fontsize=12) p+=text(r"$f(x-h)$", (0.07, 0.128),fontsize=12) p+=text(r"$y=f(x)$", (0.35, 0.31),fontsize=12) show(p) This block has as output the following figure: 606 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 6.A.42. The central difference approximation of the value f′ (0.2) = 0.4163243 with step h = 0.01 is given by f(0.2 + 0.01) − f(0.2 − 0.01) 0.02 ≈ 0.4163671 . This proves the claim, comparing with the results in 6.A.40. 6.A.43. The axes of the given ellipse are the coordinate axes x and y, and its vertices are the points P = [ √ 2, 0], Q = [− √ 2, 0], R = [0, 1], S = [0, −1] , see also the given figure below. Let us first compute the curvature at R. We can consider the coordinate y as a function of x (determined uniquely in a neighbourhood of R ). Thus by differentiating the equation of the ellipse with respect to x we get 2x + 4yy′ = 0. Hence y′ = − x 2y . Differentiating this equation once more with respect to x we obtain the second derivative, y′′ = − 1 2 ( 1 y − xy′ y2 ) . At the point R, we see that y′ (R) = 0 and y′′ (R) = −1 2 . According to 6.1.13, the radius of the osculation circle will be given by the expression (1 + (y′ )2 ) 3 2 (y′′)2 evaluated at the point in question. This gives the values ±2 (the sign tells us the circle will be “below” the graph of the function). Now, the ideas in 6.1.13 and 6.1.16 imply that the centre will be in the direction opposite to the normal line of this curve, i.e., on the y-axis. Since the radius is 2, the centre will be at the point [0, 1 − 2] = [0, −1]. In total, the equation of the osculation circle of the given ellipse at R will be x2 + (y + 1)2 = 4. Analogously, we can determine the equation of the osculation circle at the point S, where we get the equation x2 + (y − 1)2 = 4. Finally, the curvatures of the ellipse at these points equals to 1 2 (the absolute value of the curvature of the graph of the function). For determining the osculation circle at the point [ √ 2, 0], we’ll consider the equation of the ellipse as a formula for the variable x depending on the variable y, i.e., x as a function of y. determined uniquely, so we cannot use the previous procedure - technically it would end up by diving by zero). Then, by differentiation we obtain: 2xx′ + 4y = 0, thus x′ = −2y x , and x′′ = −2(1 x − yx′ x2 ). Hence at point [ √ 2, 0], we have x′ = 0 and x′′ = − √ 2 and the radius of the circle of osculation is ρ = − 1√ 2 = √ 2 2 according to 6.1.13. The normal line is heading to −∞ along the x axis at point [ √ 2, 0], thus the center 607 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS of the osculation circle will be on the x axis on the other side at distance √ 2 2 , hence at the point [ √ 2 − √ 2 2 , 0] = [ √ 2 2 , 0]. In total, the equation of the circle of osculation at vertice [ √ 2, 0] will be (x − √ 2 2 )2 + y2 = 1 2 . The curvature at both of these vertices equals √ 2. Note that the vertices of an ellipse (more generally the vertices of a closed smooth curve in plane) can be defined as the points at which the function of curvature has an extreme. The ellipse having four vertices isn’t a coincidence. The so called “four vertices theorem” states that a closed curve of the class C3 has at least four vertices. A curve of the class C3 is locally given parametrically by points [f(t), g(t)] ∈ R2 , t ∈ (a, b) ⊂ R, where f and g are functions of the class C3 (R). Thus the curvature of the ellipse at its any point is between its curvatures at its vertices, i.e. between 1 2 and √ 2. 6.A.44. (a) Since the functions x = x(t) and y = y(t) are differentiable, by the chain rule one has y′ (t) = dy dt = dy dx dx dt =⇒ dy dx = dy/dt dx/dt , assuming that x′ (t) = 0 for all t ∈ I. (b) For α(t) we have dy dx = dy/dt dx/dt = 6t2 1 = 6t2 . Thus dy dx t=0 = 0 and dy dx t=2 = 24. Alternatively, we have x = 1 + t, thus t = x − 1 and hence y = 2(x − 1)3 . Thus y′ (x) = dy dx = 6(x − 1)2 = 6t2 . Next, for β(t) we have dy dx = dy/dt dx/dt = − (1 + t2 )2 2t(1 + t) . (♭) However in this case we don’t have x′ (t) ̸= 0 for all t ∈ I, in particular, the derivative dy dx is not defined at t = 0. However, at t = 2 one gets dy dx t=2 = −25/12. To confirm the situation in an alternative way, we need to eliminate t which can be done by the first parametric equation x = 1/(1 + t2 ). We get t = √ 1−x x (notice that x = 1 for t = 0 and x = 1/5 for t = 2). Thus, y(x) = ln ( 1 + √ 1−x x ) and now one can directly differentiate y with respect to x. Lets do this quickly in Sage: var("t") f(x)=ln(1+sqrt((1-x)/x)) show(diff(f, x)) show(diff(f, x)(x=1/5)) show(diff(f, x)(x=1)) show(diff(f(x), x)(x=1/(1+t^2)).full_simplify()) The very first show command returns the derivative y′ (x) (in terms of x), the second the value y′ (1/5) = −25/12 while the third one returns an error, since there y′ (x) is not defined. Finally, the last command confirms the expression given in (♭). (c) Let us now pose the code for solving this task: var("t") F1(t)=(diff(2*t^3, t)/diff(t^2+1, t)).factor(); show(F1) F2(t)=(diff(ln(1+t), t)/diff(1/(1+t^2), t)).factor(); show(F2) p=parametric_plot((t^2+1, 2*t^3), (t,0,2), legend_label=r"$\alpha(t)$", color="black", thickness=2) p+=parametric_plot((1/(1+t^2), ln(1+t)), (t,0,2),legend_label=r"$\beta(t)$",color="grey",thickness=2) p+=plot(F1(t), t, 0, 2, color="black", linestyle="--", thickness=1.5, legend_label=r"$\left(\frac{dy/dt}{dx/dt}\right)_{\alpha}$") p+=plot(F2(t), t, 0, 2, color="darkgrey", linestyle="--", thickness=1.5, legend_label=r"$\left(\frac{dy/dt}{dx/dt}\right)_{\beta}$") p.show(ymin=-4, ymax=4, xmax=4, aspect_ratio=1/2) This block confirms the given expressions of the derivatives dy dx and produces the graphs of the curves α, β, together with the graphs of the corresponding derivatives. 608 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 6.A.46. Let us present the block with some comments within: var("t, s, r") x(t)=t-sin(t); y(t)=1-cos(t) show(diff(y, t)/diff(x, t)) # declare the slope of the cycloid at general point y0=y(2*pi/3);show(y0) # declare the y-coordinate of the point P x0=x(2*pi/3);show(x0) # declare he x-coordinate of the point P k=diff(y, t)(t=2*pi/3)/(diff(x, t)(t=2*pi/3)); show(k) # declare the slope of the cycloid at 2pi/3 l(s)=y0+k*s-k*x0; show(l) #declare the tangent line passing through P y00=y(4*pi/3); show(y00) # declare the y-coordinate of the point Q x00=x(4*pi/3); show(x00) # declare the x-coordinate of the point Q kk=diff(y, t)(t=4*pi/3)/(diff(x, t)(t=4*pi/3)); show(kk) # declare the slope of the cycloid at 4pi/3 L(r)=y00+kk*r-kk*x00; show(L) #declare the tangent line passing through Q show(solve(l(r)-L(r)==0, r)) #declare the intersection point of the 2 tangent lines p=parametric_plot((x(t), y(t)), (t,0,2*pi), color="black", thickness=2) p+=plot(l(s), -0.5, 4, color="darkgrey") p+=plot(L(r),2, 6, color="grey") p+=point([x(2*pi/3), y(2*pi/3)], size=30) # the point P p+=point([x(4*pi/3), y(4*pi/3)], size=30) # the point Q p+=point([pi, l(pi)], size=30) # the intersection point R p+=text(r"$P$", (x0, y0+0.2),fontsize=14, rgbcolor=(0.1,0.2,0.5)) p+=text(r"$Q$", (x00, y00+0.2),fontsize=14, rgbcolor=(0.1,0.2,0.5)) p+=text(r"$R$", (pi, l(pi)+0.2),fontsize=14, rgbcolor=(0.1,0.2,0.5)) p.show(gridlines="true") 6.B.2. By definition, we have F′ (x) = f(x) for all x ∈ R. Combining this with the relation 4F(x) = f(x) we get F′ (x) = 4F(x), that is, F′ (x) F(x) = 4 (since F(x) = (1/4)f(x) ̸= 0 for all x ∈ R). An integration then gives ∫ F′ (x) F(x) dx = ∫ 4 dx ⇐⇒ ln | F(x) | = 4x + c ⇐⇒ | F(x) | = e4x+c , for some constant c. Thus F(x) = e4x+c or F(x) = − e4x+c . Because F(4) = f(4) 4 = 1 > 0 the second case is omitted. In particular, the relation F(4) = 1 gives e16+c = 1 = e0 , that is, 16 + c = 0 or equivalently c = −16. Thus F(x) = e4(x−4) and f(x) = F′ (x) = 4 e4(x−4) . 6.B.4. One can use the identities sin2 (x) = ( 1 − cos(2x) ) /2 and cos2 (x) = ( 1 + cos(2x) ) /2. In particular, we see that A = ∫ ( 1 − cos(2x) 2 ) ( 1 + cos(2x) 2 ) dx = ∫ 1 − cos2 (2x) 4 dx = 1 4 ∫ dx − 1 4 ∫ cos2 (2x) dx = x 4 − 1 4 ∫ ( (1 + cos(4x) 2 ) dx = x 4 − 1 8 ∫ dx − 1 8 ∫ cos(4x) dx = x 8 − sin(4x) 32 + C . 6.B.7. (a) We have (sin(x))′ = cos(x) and hence 609 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS K = ∫ x(sin(x))′ dx = x sin(x) − ∫ (x)′ sin(x) dx = x sin(x) + cos(x) + C , C ∈ R . (b) Recall that (x2 − x)′ = 2x − 1 and hence M = ∫ (x2 − x)′ ln(x) dx = (x2 − x) ln(x) − ∫ (x2 − x)(ln(x))′ dx = (x2 − x) ln(x) − ∫ (x − 1) dx = (x2 − x) ln(x) + x − x2 2 + C . (c) In this case we will use the relations (− 1 β cos(βx))′ = sin(βx) and ( 1 β sin(βx))′ = cos(βx), for all x ∈ R. Thus N = ∫ ex sin(βx) dx = ∫ ex (− 1 β cos(βx))′ dx = − 1 β ex cos(βx) + 1 β ν , (∗) where ν = ∫ ex cos(βx) dx. In the same vein we see that ν = ∫ ex ( 1 β sin(βx))′ dx = 1 β ex sin(βx) − 1 β N . (∗∗) Therefore, a combination of (∗) and (∗∗) gives N + 1 β2 N = − 1 β ex cos(βx) + 1 β2 ex sin(βx) + C , that is, N = β β2 + 1 ex ( 1 β sin(βx) − cos(βx) ) + C, C ∈ R. A word of caution: One should remember to identify multiples of C with C itself, as we did in cases (b) and (d), for example. Moreover, we could even write N(x) = β β2+1 ex ( 1 β sin(βx) − cos(βx) ) + C, x ∈ R, and similarly for the previous cases (we avoid this to save some space). 6.B.8. All the solutions can be obtained by applying the rule ∫ F(x)G′ (x) dx = F(x)G(x) − ∫ F′ (x)G(x) dx (integration by parts). Notice the first non-trivial part in this rule is the function G(x), which we should “guess” using its first derivative G′ (x). Next we should be able to compute the integral ∫ F′ (x)G(x) dx. (a) We have F(x) = x, F′ (x) = 1, G′ (x) = 1 cos2(x) and hence G(x) = tan(x), such that (tan(x))′ = 1 cos2(x) . Therefore, an application of the preceding rule gives ∫ x cos2(x) dx = ∫ x(tan(x))′ dx = x tan(x) − ∫ tan(x) dx = x tan(x) + ∫ − sin(x) cos(x) dx = x tan(x) + ∫ (cos(x))′ cos(x) dx = x tan(x) + ln |cos(x)| + C . Above, in the final step we applied the identity ∫ f′ (x) f(x) dx = ln | f(x) | + C, for some constant C (see 6.2.3). (b) In this case we have F(x) = x2 , F′ (x) = 2x, G′ (x) = e−3x and hence G(x) = −1 3 e−3x . Therefore, at a first step we obtain ∫ x2 e−3x dx = ∫ x2 ( − 1 3 e−3x )′ dx = − 1 3 x2 e−3x + 2 3 ∫ x e−3x dx . Next we see that∫ x e−3x = ∫ x ( − 1 3 e−3x )′ dx = − 1 3 x e−3x + 1 3 ∫ e−3x dx = − 1 3 x e−3x − 1 9 e−3x +C , for some constant C ∈ R, where in the final step we used the relation ∫ eax dx = eax a + C (see 6.2.3). A combination of these two relations gives ∫ x2 e−3x = − 1 3 x2 e−3x + 2 3 ( − 1 3 x e−3x − 1 9 e−3x ) + C = − 1 3 e−3x ( x2 + 2 3 x + 2 9 ) + C , for some constant C ∈ R. (c) In this case we have F(x) = cos(x), F′ (x) = − sin(x), G′ (x) = cos(x) and hence G(x) = sin(x). If we set for convenience M = ∫ cos2 (x) dx, we thus have M = ∫ cos2 (x) dx = ∫ cos(x) cos(x) dx = ∫ cos(x) ( sin(x) )′ dx = cos(x) sin(x) + ∫ sin2 (x) dx = cos(x) sin(x) + ∫ ( 1 − cos2 (x) ) dx = cos(x) sin(x) + ∫ 1 dx − ∫ cos2 (x) dx = cos(x) sin(x) + x − M + C , 610 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS that is, 2M = cos(x) sin(x) + x + C or equivalently, M = 1 2 ( cos(x) sin(x) + x ) + C, for some constant C ∈ R. We emphasise that usually suitable simplifications or substitutions, lead to the desired result faster than integration by parts. For instance, by using the identity cos2 (x) = 1 2 ( 1 + cos(2x) ) , x ∈ R we easily obtain ∫ cos2 (x) dx = ∫ 1 2 dx+ ∫ 1 2 cos(2x) dx = x 2 + sin(2x) 4 +C = x 2 + 2 sin(x) cos(x) 4 +C = 1 2 ( x+sin(x)·cos(x) ) +C , for some constant C ∈ R. 6.B.10. Set t = x3 + 4 such that dt = 3x2 dx. Thus dx = dt 3x2 and ∫ ex3 +4 x2 dx = 1 3 ∫ et dt = et 3 + C = ex3 +4 3 + C , for some constant C. In Sage one can type show(integral((e^(x^3+4))*x^2, x)) 6.B.11. The main idea of the so called first substitution method is writing the integral in the form of ∫ f ( φ(x) ) φ′ (x) dx , (⋆) for certain functions f and φ. Then the substitution y = φ(x) gives that dy = φ′ (x) dx, and the integral above reads as ∫ f(y) dy . Based on this idea let us set y = cos(x), such that dy = − sin(x) dx. Then we obtain ∫ cos5 (x) sin(x) dx = − ∫ cos5 (x) ( − sin(x) ) dx = − ∫ y5 dy = − y6 6 + C = − cos6 (x) 6 + C , for some arbitrary constant C ∈ R. 6.B.12. We will use the so called second substitution method, which means a reduction of ∫ f(y) dy to the form (⋆) presented in the previous task 6.B.11, for y = φ(x). In particular, in our example we want to determine the primitive function of function f(x) = tan4 (x). Thus it is sensible to consider the substitution u = tan(x), and hence x = arctan(u). This gives dx = du 1+u2 , and hence we get ∫ sin4 (x) cos4(x) dx = ∫ u4 1 + u2 du = ∫ u2 − 1 + 1 u2 + 1 du = u3 3 − u + arctan(u) + C = tan3 (x) 3 − tan(x) + x + C , for some constant C ∈ R. 6.B.13. Let us include the solutions in one cell: show(integral((cos(x))^5*sin(x), x)); show(integral((sin(x))^4/((cos(x))^4), x)) Check yourself that this verifies the given answers in 6.B.11 and 6.B.12, respectively. 6.B.14. This is based on appropriate substitution, in particular ∫ cos5 (x) sin2 (x) dx = ∫ ( cos2 (x) )2 sin2 (x) cos(x) dx = ∫ ( (1 − sin2 (x) )2 sin2 (x) cos(x) dx t=sin(x) = dt=cos(x) dt ∫ ( 1 − t2 )2 t2 dt = ∫ ( 1 − 2t2 + t4 ) t2 dt = ∫ ( t6 − 2t4 + t2 ) dt = t7 7 − 2 t5 5 + t3 3 + C = sin7 (x) 7 − 2 sin5 (x) 5 + sin3 (x) 3 + C , for some constant C. 6.B.15. (a) The substitution method leads to the integral ∫ x3 e−x2 dx = t = −x2 dt = −2x dx = 1 2 ∫ t et dt , which can be easily computed by integrating by parts, yielding 1 2 ∫ t et dt = F(t) = t F′ (t) = 1 G′ (t) = et G(t) = et = 1 2 t et − 1 2 ∫ et dt = 1 2 t et − 1 2 et +C = − 1 2 e−x2 ( x2 + 1 ) + C , C ∈ R . 611 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS (b) Similarly, we obtain ∫ x arcsin(x2 ) dx = t = x2 dt = 2x dx = 1 2 ∫ arcsin(t) dt = F(t) = arcsin(t) F′ (t) = 1√ 1−t2 G′ (t) = 1 G(t) = t = 1 2 t arcsin(t) − 1 2 ∫ t √ 1 − t2 dt = u = 1 − t2 du = −2t dt = 1 2 t arcsin(t) + 1 4 ∫ du √ u = 1 2 t arcsin(t) + 1 2 √ u + C = 1 2 t arcsin(t) + 1 2 √ 1 − t2 + C = 1 2 x2 arcsin(x2 ) + 1 2 √ 1 − x4 + C , C ∈ R . (c) In this case let us first use the substitution y = √ x to get rid of the root from the argument of the exponential function. This leads to the integral ∫ e √ x dx = y2 = x 2y dy = dx = 2 ∫ y ey dy . Based now on integration by parts, we obtain ∫ y ey dy = F(y) = y F′ (y) = 1 G′ (y) = ey G(y) = ey = y ey − ∫ ey dy = y ey − ey +C , C ∈ R . In total this gives ∫ e √ x dx = 2y ey −2 ey +C = 2 e √ x (√ x − 1 ) + C , C ∈ R . 6.B.17. Integration by parts gives ∫ sinn (x) dx = ∫ sinn−1 (x) sin(x) dx = ∫ sinn−1 (x)(− cos(x))′ dx = − sinn−1 (x) cos(x) + (n − 1) ∫ sinn−2 (x) cos2 (x) dx = − sinn−1 (x) cos(x) + (n − 1) ∫ sinn−2 (x) ( 1 − sin2 (x) ) dx = − sinn−1 (x) cos(x) + (n − 1) ∫ sinn−2 (x) dx − (n − 1) ∫ sinn (x) dx = − sinn−1 (x) cos(x) + (n − 1)In−2 − (n − 1)In . Thus for any positive integer n we get nIn = − sinn−1 (x) cos(x) + (n − 1)In−2 ⇐⇒ In = − 1 n sinn−1 (x) cos(x) + n − 1 n In−2 . Now, we have I0 = x and I1 = − cos(x). Thus based on the recurrence relation and using the identity sin(2x) = 2 sin(x) cos(x), we get I2 = ∫ sin2 (x) dx = − 1 2 sin(x) cos(x) + 1 2 I0 = − 1 2 sin(x) cos(x) + 1 2 x = − 1 4 sin(2x) + 1 2 x . Similarly, I3 = ∫ sin3 (x) dx = − 1 3 sin2 (x) cos(x) + 2 3 I1 = − 1 3 sin2 (x) cos(x) − 2 3 cos(x) = − 1 3 ( 1 − cos2 (x) ) cos(x) − 2 3 cos(x) = 1 3 cos3 (x) − cos(x) . 6.B.18. We can solve this by substitution, which we will encode in a more “compact” but less informative way, as follows: ∫ 6 x − 2 dx = y = x − 2 dy = dx = ∫ 6 y dy = 6 ln | y | + C1 = 6 ln | x − 2 | + C1 , C1 ∈ R . Similarly, we have ∫ 6 (x + 4)3 dx = y = x + 4 dy = dx = ∫ 6 y3 dy = 6 −2y2 + C2 = − 3 (x + 4)2 + C2 , C2 ∈ R . 6.B.19. Setting w = ax + b we get dw = a dx, that is, dx = dw a . Hence 612 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS ∫ dx (ax + b)n = ∫ dw awn = 1 a ∫ w−n dw = 1 a ( w−n+1 −n + 1 ) + C = − 1 a(n − 1)wn−1 + C , and the result follows. Next observe that 9x2 + 6x + 1 = (3x + 1)2 . Thus ∫ dx 9x2 + 6x + 1 = ∫ dx (3x + 1)2 , and by applying the relation that we proved in (a), it follows that ∫ dx (3x + 1)2 = − 1 3(3x + 1) + C, for some constant C. 6.B.20. Recall that P(x) = ax2 + bx + c = a ( x + b 2a )2 − ∆ 4a , and since ∆ = 0 we get P(x) = a ( x + b 2a )2 . Therefore,∫ dx P(x) = ∫ dx a ( x + b 2a )2 . Let us set u = x + b 2a , such that du = dx. Then we see that ∫ dx a ( x + b 2a )2 = 1 a ∫ du u2 = 1 a ∫ u−2 du = − 1 au + C = − 1 a ( x + b 2a ) + C = − 2 2ax + b + C , for some constant C. We leave the other confirmation to the reader. 6.B.27. By the discussion in 6.2.7 we can assume that Q(x) = 2x4 + 2x2 − 5x + 1 x (x2 − x + 1) 2 = A x + B1x + C1 x2 − x + 1 + B2x + C2 (x2 − x + 1)2 , for some A, B1, C1, B2, C2, to be specified. By equating coefficients and solving the corresponding system one gets A = 1, B1 = 1, C1 = 3, B2 = 1 and C2 = −6, and thus Q(x) = 1 x + x + 3 x2 − x + 1 + x − 6 (x2 − x + 1)2 . Recall that you can use Sage to get a quick confirmation of the obtained fractional decomposition of Q, just by typing Q(x)=(2*x^4+2*x^2-5*x+1)/(x*(x^2-x+1)^2); show(Q.partial_fraction()) The fractional decomposition of Q implies that the integral in question splits in three parts: ∫ Q(x) dx = ∫ 2x4 + 2x2 − 5x + 1 x (x2 − x + 1) 2 dx = ∫ 1 x dx + ∫ x + 3 x2 − x + 1 dx + ∫ x − 6 (x2 − x + 1)2 dx . (⋆) Let us focus on the last two integrals which are more difficult. To simplify our notation, we will compute them up to a constant (as Sage does). We begin with Λ := ∫ x + 3 x2 − x + 1 dx. Since x2 − x + 1 has complex roots one can apply a method similar to those presented in 6.B.22. Hence we get Λ = ∫ x + 3 x2 − x + 1 dx = 1 2 ∫ 2x − 1 x2 − x + 1 dx + 7 2 ∫ 1 x2 − x + 1 dx = Λ1 + Λ2 , and we need to evaluate the integrals Λ1 and Λ2. We have (up to a constant) Λ1 = 1 2 ∫ 2x − 1 x2 − x + 1 dx t=x2 −x+1 = dt=(2x−1) dx 1 2 ∫ dt t = 1 2 ln |t| = 1 2 ln(t) = 1 2 ln(x2 − x + 1) . (α) Next Λ2 = 7 2 ∫ dx x2 − x + 1 = 7 2 ∫ dx (x − 1 2 )2 + 3 4 = 7 2 ∫ dx (x − 1 2 )2 + ( √ 3 2 )2 u=x− 1 2 = du=dx 7 2 ∫ du u2 + ( √ 3 2 )2 = = 7 2 · 1 √ 3 2 arctan ( u √ 3 2 ) = 7 √ 3 arctan ( 2u √ 3 ) = 7 √ 3 3 arctan (√ 3(2x − 1) 3 ) . (β) A combination of (α) and (β) gives Λ = Λ1 + Λ2 = 1 2 ln(x2 − x + 1) + 7 √ 3 3 arctan (√ 3(2x − 1) 3 ) . (A) In the second integral M := ∫ x − 6 (x2 − x + 1)2 dx the dominator appears in the second power. Hence one can apply a similar technique to those presented in 6.B.24. We have M = ∫ x − 6 (x2 − x + 1)2 dx = 1 2 ∫ 2x − 1 (x2 − x + 1)2 dx − 11 2 ∫ dx (x2 − x + 1)2 = M1 + M2 . Then, up to a constant we get 613 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS M1 = 1 2 ∫ 2x − 1 (x2 − x + 1)2 dx t=x2 −x+1 = dt=(2x−1) dx 1 2 ∫ dt t2 = − 1 2t = − 1 2(x2 − x + 1) , (γ) and M2 = − 11 2 ∫ dx (x2 − x + 1)2 = − 11 2 ∫ dx ( (x − 1 2 )2 + ( √ 3 2 )2 )2 = − 11 2 · K2 ( 1 2 , √ 3 2 ) = = − 11 2 · 1 3 4 (1 2 K1 ( 1 2 , √ 3 2 ) + x − 1 2 2 ( (x − 1 2 )2 + 3 4 ) ) = − 11 2 · 4 3 (1 2 · 1 √ 3 2 arctan ( x − 1 2√ 3 2 ) + 2x−1 2 2(x2 − x + 1) ) = = − 22 3 (√ 3 3 arctan (√ 3(2x − 1) 3 ) + 2x − 1 4(x2 − x + 1) ) = − 22 √ 3 9 arctan (√ 3(2x − 1) 3 ) − 11(2x − 1) 6(x2 − x + 1) . (δ) Therefore by (γ) and (δ) we obtain M = M1 + M2 = − 1 2(x2 − x + 1) − 22 √ 3 9 arctan (√ 3(2x − 1) 3 ) − 11(2x − 1) 6(x2 − x + 1) . (B) Combining now (⋆) with (A) and (B) we finally deduce that ∫ Q(x) dx = ln |x| + 1 2 ln(x2 − x + 1) + 7 √ 3 3 arctan (√ 3(2x − 1) 3 ) − 1 2(x2 − x + 1) − 22 √ 3 9 arctan (√ 3(2x − 1) 3 ) − 11(2x − 1) 6(x2 − x + 1) + C , or equivalently ∫ Q(x) dx = ln x √ x2 − x + 1 − √ 3 9 arctan (√ 3(2x − 1) 3 ) − 11x − 4 3(x2 − x + 1) + C , for some constant C. For a confirmation (and much faster computation) via Sage, just type Q(x)=(2*x^4+2*x^2-5*x+1)/(x*(x^2-x+1)^2); show(Q.integrate(x)) Or you may like to verify the integrals Λ1, Λ2, M1, M2 presented above, which can be done in a similar way: show((1/2)*integral((2*x-1)/(x^2-x+1), x)); show((7/2)*integral(1/(x^2-x+1), x)) show((1/2)*integral((2*x-1)/(x^2-x+1)^2, x)); show((-11/2)*integral(1/(x^2-x+1)^2, x)) 6.B.28. An appropriate substitution is given by t = ex , hence x = ln(t) and dx = 1 t dt. This will allows us to convert the function Q(x) = 1 e2x −4 ex to a rational function, hence we can apply the theory described above. In particular, ∫ Q(x) dx = ∫ 1 e3x −2 e2x dx = ∫ dt (t3 − 2t2)t = ∫ dt t3(t − 2) . Now, suppose that 1 t3(t − 2) = A t + B t2 + C t3 + D t − 2 , for some A, B, C, D ∈ R . This gives the relation 1 = t2 (t − 2)A + t(t − 2)B + (t − 2)C + t3 D ⇐⇒ 1 = t3 (A + D) + t2 (−2A + B) + t(−2B + C) − 2C . Thus we get the system −2C = 1 , −2B + C = 0 , −2A + B = 0 , A + D = 0 , which has the tuple {A = −1/8, B = −1/4, C = −1/2, D = 1/8} as a unique solution. This means that 1 t3(t − 2) = − 1 8t − 1 4t2 − 1 2t3 + 1 8(t − 2) , (∗) which can be easily confirmed in Sage by the usual method, that is, var("t"); show((1/(t^4-2*t^3)).partial_fraction() Based on (∗) we may finally compute the integral at hand: ∫ dt t3(t − 2) = ∫ ( 1 8(t − 2) − 1 8t − 1 4t2 − 1 2t3 ) dt = 1 8 (ln |t − 2| − ln |t|) + 1 4t + 1 4t2 + C = 614 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS = 1 8 (ln |ex −2| − ln(ex )) + 1 4 ex ( 1 + 1 ex ) + C = ln |ex −2| 8 − x 8 + 1 4 e−2x (1 + ex ) + C , for some constant C. For a quick confirmation of this result try the command show(integral(1/(e**(3*x)-2*e**(2*x)), x)) 6.B.31. For all x ̸= π 2 + kπ, with k ∈ Z, we see that ∫ 1 cos2(x) dx = t=tan(x) = dt=dx/ cos2(x) ∫ dt = t + C = tan(x) + C , for some constant C. Therefore, with the help of the table given in 5.A.3 one computes π/3∫ π/6 tan2 (x) dx = π/3∫ π/6 sin2 (x) cos2(x) dx = π/3∫ π/6 1 − cos2 (x) cos2(x) dx = π/3∫ π/6 ( 1 cos2(x) − 1 ) dx = [ tan(x) − x ]π/3 π/6 = √ 3 − π 3 − ( 1 √ 3 − π 6 ) = 2 √ 3 3 − π 6 . An alternative for computing this integral occurs by the substitution u = tan(x), with du = dx cos2(x) , sin2 (x) = tan2 (x) 1 + tan2 (x) = u2 1 + u2 . This is also based on the integral ∫ 1 1 + u2 du = arctan(u) + C and left as an easy challenge. Notice, to confirm the result in Sage we may type show(integral(tan(x)*tan(x), x, pi/6, pi/3)) For the second integral recall by 6.B.8 that integration by parts gives ∫ x cos2(x) dx = x tan(x) + ln |cos(x)| + C , for all x ̸= π 2 + kπ, k ∈ Z, where C is some constant. Thus π/4∫ 0 x cos2(x) dx = [ x tan(x) + ln(cos(x)) ]π/4 0 = π 4 + ln √ 2 2 = π 4 − ln(2) 2 . Again we can confirm this result in Sage, as before, i.e., show(integral(x/(cos(x)*cos(x)), x, 0, pi/4)) 6.B.32. (a) Set y = 1 − x2 with dy = −2dx. For x = 0 we have y = 1 and for x = 1 we have y = 0. Thus 1∫ 0 x √ 1 − x2 dx = − 0∫ 1 y−1/2 2 dy = 1∫ 0 y−1/2 2 dy = [√ y ]1 0 = 1 . (b) Set t = x + √ x2 − 1, so that dt = √ x2−1+x√ x2−1 dx For x = 1 we get t = 1 and for x = 2 we get t = 2 + √ 3. Thus, 2∫ 1 dx √ x2 − 1 = 2+ √ 3∫ 1 1 t dt = [ ln(t) ]2+ √ 3 1 = ln ( 2 + √ 3 ) . 6.B.33. Using Sage we see that the integral at hand equals to −1/2. This can be done in the usual way, i.e., the cell integral(sin(3*x)*cos(x), x, pi/4,pi) One method to verify this result formally is based on the identity sin(α + β) + sin(α − β) = 2 sin(α) cos(β) which for our case applies as 2 sin(3x) cos(x) = sin(3x + x) + sin(3x − x). Thus we get ∫ π π/4 sin(3x) cos(x) dx = 1 2 ∫ π π/4 ( sin(3x + x) + sin(3x − x) ) dx = 1 2 ∫ π π/4 sin(4x) dx + 1 2 ∫ π π/4 sin(2x) dx = − 1 8 [ cos(4x) ]π π/4 − 1 4 [ cos(2x) ]π π/4 = − 1 8 (1 − (−1)) − 1 4 (1 − 0) = − 1 4 − 1 4 = − 1 2 . 615 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 6.B.34. (a) Because 0 ≤ x9 √ 2 ≤ x9 √ 1 + x ≤ x9 for all x ∈ [0, 1] the geometric meaning of the definite integral implies √ 2 20 = 1∫ 0 x9 √ 2 dx ≤ 1∫ 0 x9 √ 1 + x dx ≤ 1∫ 0 x9 dx = 1 10 . (b) Because 0 < x < π/2 we have 0 < 2 π < 1 x . Also we have 0 ≤ sin(x) ≤ 1 for all x ∈ (0, π/2). Thus ∫ π/2 0 2 π dx < ∫ π/2 0 sin(x) x < ∫ π/2 0 1 dx , or equivalently 2 π · π 2 < ∫ π/2 0 sin(x) x < π 2 . 6.B.35. Recall that any continuous function f : [−a, a] → R satisfies the relation ∫ a −a f(x) dx = ∫ 0 −a f(x) dx + ∫ a 0 f(x) dx . (∗) Set u = −x such that du = −dx. For x = −a we have u = a and for x = 0 we have u = 0. Assume now that f is even. Then f(u) = f(−x) = f(x) and thus ∫ 0 −a f(x) dx = ∫ 0 a f(u)(−du) = − ∫ 0 a f(u) du = ∫ a 0 f(u) du = ∫ a 0 f(x) dx and hence (∗) gives ∫ a −a f(x) dx = 2 ∫ a 0 f(x) dx. Similarly is treated the case where f is odd. 6.B.36. (1) Since f is periodic with period T, for any a ∈ R one can write ∫ a+T a f(x) dx = ∫ 0 a f(x) dx + ∫ T 0 f(x) dx + ∫ a+T T f(x − T) dx . (†) Set now u = x−T with du = dx. For x = T we have u = 0 and for x = a+T we have u = a. Thus, ∫ a+T T f(x−T) dx = ∫ a 0 f(u) du, and the result follows by (†). An alternative relies on the primitive function F of f, which allows us to prove that ∫ T 0 f(x) dx = ∫ a 0 f(x) dx + ∫ T a f(x) dx = ∫ a 0 f(x) dx + F(T) − F(a) = ∫ a 0 f(x) dx + ( F(a + T) − F(a) ) − ( F(a + T) − F(T) ) = ∫ a 0 f(x) dx + ∫ a+T a f(x) dx − ∫ a+T T f(x) dx . (‡) Then, as above with the substitution t = x − T one can show that ∫ a+T T f(x) dx = ∫ a 0 f(x) dx and the result follows by (‡). (2) Since f is periodic with period T, using induction over the naturals we can show that f(x) = f(x + nT) = f(x − nT), for any n ∈ N. This easily extends to n ∈ Z (try to prove this claim). Using this periodicity property, for n ≥ 0 we get ∫ a+nT a f(x) dx = ∫ 0 a f(x) dx + n−1∑ k=0 ∫ (k+1)T kT f(x) dx + ∫ a+nT nT f(x) dx = ∫ 0 a f(x) dx + n−1∑ k=0 ∫ (k+1)T kT f(x − kT) dx + ∫ a+nT nT f(x − nT) dx = ∫ 0 a f(x) dx + n−1∑ k=0 ∫ T 0 f(x) dx + ∫ T 0 f(x) dx = n−1∑ k=0 ∫ T 0 f(x) dx = n ∫ T 0 f(x) dx . (∗) 616 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Notice for the integral ∫ (k+1)T kT f(x − kT) dx we did the substitution t = x − kT with dt = dx, and t = 0 for x = kT, t = T for x = (k + 1)T. Similarly for ∫ a+nT nT f(x − nT) dx. Suppose now that n < 0. Then we have ∫ a+nT a f(x) dx = − ∫ a a+nT f(x) dx = − ∫ a+nT +(−nT ) a+nT f(x) dx (∗) = − ( − n ∫ T 0 f(x) dx ) = n ∫ T 0 f(x) dx . This proves our claim. Observe, that the identity in (1) for a = 0 reduces to ∫ nT 0 f(x) dx = n ∫ T 0 f(x) dx . This says that the area under Cf for n periods equals n times the area under Cf for one period. 6.B.38. (a) The given function F is clearly the antiderivative of the function f(x) := x5 ln (x + 1) on the interval (−1, 1). Thus, F′ (x) = −x5 ln (x + 1), see also the fundamental theorem of calculus in 6.2.9. (b) We have F′ (x) = x2 sin(x) + 4 cos(4x) and we see that |g(x)| = x2 sin(x) + 4 cos(4x) x2 + 2 ≤ |x2 sin(x)| + 4| cos(4x)| x2 + 2 ≤ |x2 | · 1 + 4 · 1 x2 + 2 = x2 + 4 x2 + 1 , for all x ∈ (0, +∞). Thus, obviously limx→∞ x2 +4 x2+1 = 1, which you can confirm in Sage by the cell var("x"); lim((x^2+4)/(x^2+1), x=oo) 6.B.45. For some real numbers a < b consider the set A = (a, b] and the function f : A → R, with f(x) = 1/(x − a). Though f is continuous on A, it is not uniformly continuous, since A is not closed. Indeed, let (xn) and (yn) be the sequences defined by xn = a + b − a n , yn = a + b − a n + 1 , n ∈ Z+ . Then it is easy to see that (xn − yn) → 0 but f(xn) − f(yn) does not convergent to 0. Hence, by 6.B.44 one deduces that f cannot be uniformly continuous on A. As an example where the domain is not bounded, consider the parabola h : A → R, h(x) = x2 defined on A = [0, +∞). The sequences (xn = n) and (yn = n − 1 n ) with n ∈ Z+, they both belong to A and satisfy xn − yn = 1 n → 0, as n → ∞. However, h(xn)−h(yn) = 2− 1 n2 , which obviously does not converge to 0. By 6.B.44 we deduce that h cannot be uniformly continuous on A 6.B.46. Consider the sequences (xn), (yn) with general terms xn = 1/(n + 1) and yn = 1/n, respectively, with n ≥ 2. They both belong to A = (0, 1) and satisfy xn − yn → 0. However, it is easy to see that f(xn) − f(yn) = 1 and hence limn→∞ ( f(xn) − f(yn) ) = 1 ̸= 0. The assertion now follows by the statement in 6.B.44. Provide an alternative proof based on the ε-δ-definition of uniform continuity, given in 6.2.11. 6.B.48. The proof is analogous to 6.B.47 and we leave it to the reader for practice. 6.B.49. (a) For instance, consider the function f(x) = sin ( 1 x ) with x ∈ A = (0, 1]. It is not hard to see that f is continuous on A, but not uniformly continuous. Now, for any n ∈ N we have n π + π 2 ≥ 1 and hence the sequence (xn) with general term xn = 1 n π + π/2 , n ∈ N , is a sequence in A. Since n π + π/2 → +∞ we have xn → 0, as n → +∞. Thus (xn) is convergent and thus a Cauchy sequence in A. However, the sequence (f(xn)) = ( sin ( 1 xn )) n∈N is not a Cauchy sequence. Why?. (b) Let (xn) be a Cauchy sequence with xn ∈ A for all n. By assumption f is uniformly continuous on A, hence for every ε > 0 there exists δ = δ(ε) > 0 such that |f(x) − f(y)| < ε, for all x, y ∈ A, provided that |x − y| < δ. However, (xn) is Cauchy and since δ > 0 we can find some natural N = N(δ) such that |xn − xm| < δ for all n, m ≥ N. But then we get |f(xn) − f(xm)| < ε, for all n, m ≥ N, and the result follows. 6.B.50. The function f(x) = x2 is continuous on R and hence also on A = [0, 1]. Since the set [0, 1] is a closed finite interval, the function f should be uniformly continuous, see the theorem in 6.2.11. 617 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Suppose that the function g(x) = tan(x) is uniformly continuous on B = [0, π/2). The domain of g is a finite subset of R and hence bounded. But then the function g should be also bounded.6 However, it is easy to see that tan(xn) → +∞ for xn ∈ (0, π/2) with xn → π/2, which implies that g(x) = tan(x) is not bounded (think also the graph of tan(x)). This gives a contradiction and hence g(x) = tan(x) is not uniformly continuous on B. Next, the function h(x) = x2 is not uniformly continuous on R. For instance, consider the sequences (xn = n) and (yn = n + 1 n ), with xn − yn → 0. Then we see that h(xn) − h(yn) = 2 + 1 n2 ≥ 2 and thus limn→+∞ ( h(xn) − h(yn) ) ̸= 0. Our assertion now follows by 6.B.44. Finally, the function k(x) = x3 is also not uniformly continuous on R. Here one may consider the sequences xn = n+ 1 n and yn = n. Then obviously xn − yn → 0 but we see that k(xn) − k(yn) ≥ 3n for all n, which implies that limn→+∞ ( k(xn) − k(yn) ) = +∞. 6.B.52. The improper integral represents the area of the region bounded by the graph of the positive function f(x) = arctan(x) x √ x = x− 2 3 arctan(x) , x ≥ 1 , and the x-axis. From the left, this region is bounded by the line x = 1, see also the figure here: Therefore, the integral is a positive real number, or equals +∞. First we see that +∞∫ 1 x− 3 2 dx = lim t→+∞ ∫ t 1 x− 3 2 dx = lim t→+∞ ( 2 − 2 √ t ) = 2 . Moreover, we know that π 4 ≤ arctan(x) ≤ π 2 , ∀ x ∈ [1, +∞) , and hence we get π 2 = π 4 +∞∫ 1 x− 3 2 dx ≤ +∞∫ 1 arctg x x √ x dx ≤ π 2 +∞∫ 1 x− 3 2 dx = π . Thus the integral at hand is a finite real number. In fact, with the aid of Sage one can compute this integral via the command integral(arctan(x)/(x ∗ sqrt(x)), x, 1, oo). 6.B.57. As usual you can use the def method and introduce a routine which you may call “length_of_curve”. Since we usually work with the variable t for a parametric curve c : [a, b] → R2 , t should be declared as a symbolic variable, together with the endpoints a, b, that is, the limits of integration. Thus, the input of our routine can be a tuple (x, y, a, b), corresponding to the curve c : [a, b] → R2 with c(t) = [x(t), y(t)] for all t ∈ [a, b]. Hence, though a, b are real number inputs, x, y : [a, b] → R are real-valued functions of t which it is necessary to introduce as symbolic functions. To control also the aspect ratio of the corresponding parametric plot (and take a better illustration), we will introduce one more symbolic variable, say rt, which for any curve we can specify so that the result will be the desired one. This means that finally the input of our routine will be the list (x, y, a, b, rt), and the implementation goes as follows: var("t, a, b, rt") function("x")(t) function("y")(t) def length_of_curve(x, y, a, b, rt) : X(t)=diff(x, t) 6Show that any uniformly-continuous function f : A → R mapping a bounded interval to R, is bounded. 618 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Y(t)=diff(y, t) l(t)=(X(t)*X(t)+Y(t)*Y(t)).factor() s(t)=numerical_approx(integral(sqrt(l), t, a, b), digits=7) show("The length of the curve c(t)=", (x, y), "on", [a, b], "" " equals to ", ":", s(t)) p=parametric_plot((x, y), (t, a, b)) p.show(figsize=6, aspect_ratio=rt) return To test the routine first we will confirm a known result based on the curve [et −t, 4 et/2 ] of the previous task (see 6.B.56), with t ∈ [0, 1]. To confirm this case, now one can simply give the cell length_of_curve(e^t-t, 4*e^(t/2), 0, 1, 1/2) where notice that we fixed rt = 1/2. Sage returns the following answer: The length of the curve c(t)= ( −t + et , 4 e ( 1 2 t )) on [0, 1] equals to:2.718282 For the given cases one can proceed similarly (we present them case by case): (1) For the curve c(t) = [cos3 (t), sin3 (t)] on [0, pi/2] type length_of_curve((cos(t))^3, (sin(t))^3, 0, pi/2, 1/2) Here the output is The length of the curve c(t)= ( cos (t) 3 , sin (t) 3 ) on [ 0, 1 2 π ] equals to:1.500000 Indeed, we see that |c′ (t)| = √ 9 cos4(t) sin2 (t) + 9 sin4 (t) cos2(t) = 3 cos(t) sin(t) and thus sc(t) = ∫ π/2 0 3 cos(t) sin(t) dt = 3 2 ∫ π/2 0 sin(2t) dt = 3 2 [ − cos(2t) 2 ]π/2 0 = 3 2 [ − cos(π) 2 + cos(0) 2 ] = 3 2 . As for the graph of c(t), Sage returns the figure given here: (2) For the curve c(t) = [t, ln(cos(t))] on [0, π/4] type length_of_curve(t, ln(cos(t)), 0, pi/4, 1) Sage;s output has the form The length of the curve c(t)= (t, log (cos (t))) on [ 0, 1 4 π ] equals to:0.8813736 and the corresponding plot has the form To confirm Sage’s result we compute 619 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS |c′ (t)| = √ 1 + ( − sin(t) cos(t) )2 = √ 1 + tan2 (t) = √ sec2(t) = sec(t) , where we used the identity tan2 (x) = sec2 (x) − 1. Thus sc(t) = ∫ π/4 0 |c′ (t)| dt = ∫ π/4 0 sec(t) dt . Notice now that ∫ sec(x) dx = ∫ sec(x) ( sec(x) + tan(x) ) sec(x) + tan(x) dx = ∫ sec2 (x) + sec(x) tan(x) sec(x) + tan(x) dx . Setting u = sec(x) + tan(x) we compute du = ( sec(x) + tan(x) )′ dx = ( ( 1 cos(x) )′ + tan′ (x) ) dx = ( sin(x) cos2(x) + (1 + tan2 (x)) ) dx = = ( 1 cos(x) · sin(x) cos(x) + sec2 (x) ) dx = ( sec(x) tan(x) + sec2 (x) ) dx , and hence ∫ sec(x) dx = ∫ 1 u du = ln |u| + C , C ∈ R . (⋆) Since we have u(0) = sec(0) + tan(0) = 1 and u(π/4) = sec(π/4) + tan(π/4) = √ 2 + 1, we finally deduce that sc(t) = ∫ π/4 0 sec(t) dt = ∫ √ 2+1 1 du u = [ ln |u| ]√ 2+1 1 = ln(1 + √ 2) − ln(1) = ln(1 + √ 2) ≈ 0.8813736 . (3) For the curve c(t) = [t sin(t), t cos(t)] on [0, 4π] type length_of_curve(t*sin(t), t*cos(t), 0, 4*pi, 1) which answers The length of the curve c(t)= (t sin (t) , t cos (t)) on [0, 4 π] equals to:80.81931 and prints out the following figure Let us present some hints helpful for verifying this result formally. First we compute |c′ (t)| = √ (sin(t) + t cos(t))2 + (cos(t) − t sin(t))2 = √ 1 + t2 and thus we need to compute sc(t) = ∫ 4π 0 √ 1 + t2 dt. For this integral, using the relation (⋆) from (2) we will first prove that ∫ √ 1 + x2 dx = 1 2 x √ 1 + x2 + 1 2 ln(x + √ 1 + x2) + C . (∗) Indeed, set x = tan(θ) with dx = ( 1 cos2(x) ) dθ = sec2 (x) dθ. Then ∫ √ 1 + x2 dx = ∫ √ 1 + tan2 (θ) sec2 (θ) dθ = ∫ √ sec2(θ) sec2 (θ) dθ = ∫ sec3 (θ) dθ . 620 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS We have (tan(θ))′ = sec2 (x) and (sec(θ))′ = sec(θ) tan(θ) and integration by parts gives ∫ sec3 (θ) dθ = ∫ sec2 (θ) sec(θ) dθ = ∫ (tan(θ))′ sec(θ) dθ = tan(θ) sec(θ) − ∫ tan(θ)(sec(θ))′ dθ = tan(θ) sec(θ) − ∫ tan2 (θ) sec(θ) dθ = tan(θ) sec(θ) − ∫ ( sec2 (θ) − 1 ) sec(θ) dθ = tan(θ) sec(θ) − ∫ sec3 (θ) dθ + ∫ sec(θ) dθ . Thus ∫ sec3 (θ) dθ = 1 2 ( tan(θ) sec(θ) + ∫ sec(θ) dθ ) and hence using (∗) we can write ∫ √ 1 + x2 dx = ∫ sec3 (θ) dθ = 1 2 tan(θ) sec(θ) + 1 2 ln | sec(θ) + tan(θ)| + C . The desired formula (⋆) appears now by replacing x = tan(θ) and sec(θ) = √ 1 + x2. The final computation of sc(t) is now based on (⋆) and left as an easy exercise. (4) For the curve c(t) = [sin2 (t), cos2 (t)] with t ∈ [0, π 2 ], similarly we the previous cases we can type length_of_curve((sin(t))^2, (cos(t))^2, 0, pi/2, 1) Sage returns7 The length of the curve c(t)= ( sin (t) 2 , cos (t) 2 ) on [ 0, 1 2 π ] equals to:1.414214 and the following illustration of the curve at hand: For a formal computation, we see that |c′ (t)| = √( 2 sin(t) cos(t) )2 + (−2 sin(t) cos(t))2 = √( sin(2t) )2 + ( − sin(2t) )2 = √ 2 sin2 (2t) . Thus sc(t) = ∫ π/2 0 |c′ (t)| dt = √ 2 ∫ π/2 0 sin(2t) dt = √ 2 [ − cos(2t) 2 ]π/2 0 = √ 2 . In fact, the given curve is a part of the line y = 1 − x (since sin2 (t) + cos2 (t) = 1), and in particular the segment with boundary points [0, 1] for t = 0, and [1, 0] for t = π 2 , see also the figure given above. Hence one can immediately write its length, sc(t) = √ 2 (by the Pythagorean theorem). 6.B.58. It is easy to see that f′ (t) = −r sin t + r 2 tan t 2 · cos2 t 2 = −r sin t + r sin t = r cos2 t sin t , g′ (t) = r cos t , for any t ∈ [π/2, a]. Thus, for the length sα(t) we get sα(t) = a∫ π/2 √ r2 cos4 t sin2 t + r2 cos2 t dt = a∫ π/2 √ r2 cos2 t sin2 t dt = −r a∫ π/2 cos t sin t dt = −r [ln (sin t)] a π/2 = −r ln (sin a) . To plot the tractrix for the values r = 1, 2, . . . , 5 one can use the block 7Be aware that in this case Sage does not type the curve c(t) properly. 621 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS var("r, t") f(t, r)=r*cos(t)+r*ln(tan(t/2)) g(t, r)=r*sin(t) a=parametric_plot((f(t, 1), g(t, 1)), (t, pi/2, pi/2+1), exclude=[pi], rgbcolor=(0.2,0.2,0.5), title="The tractrix for several values of r") a+=parametric_plot((f(t, 2), g(t, 2)), (t, pi/2, pi/2+1), exclude=[pi], rgbcolor=(0.2,0.5,0.2)) a+=parametric_plot((f(t, 3), g(t, 3)), (t, pi/2, pi/2+1), exclude=[pi], rgbcolor=(0.5,0.2,0.2)) a+=parametric_plot((f(t, 4), g(t, 4)), (t, pi/2, pi/2+1), exclude=[pi], rgbcolor=(0.4,0.4,0.2)) a+=parametric_plot((f(t, 5), g(t, 5)), (t, pi/2, pi/2+1), exclude=[pi], rgbcolor=(0.4,0.2,0.8)) a+=point([[0, r] for r in [1,2,..,5]], color="black") a.show(aspect_ratio=4, figsize=6) which produces the following figure 6.B.64. By separating the variables we can write y−1/2 dy = esin(x) cos(x) dx , and this implies that ∫ y−1/2 dy = ∫ esin(x) cos(x) dx ⇐⇒ 2y1/2 = ∫ esin(x) cos(x) dx + C To compute the integral at the right hand side set t = sin(x) with dt = cos(x) dx. Then we get 2y1/2 = ∫ et dt + C ⇐⇒ 2y1/2 = et +C = esin(x) +C . To compute C we use the initial condition. By assumption y(0) = 16, thus 2 · √ y(0) = 2 · √ 16 = 2 · 4 = 8 = esin(0) +C, from where we get C = 7. Thus y = y(x) = ( esin(x) +7 2 )2 . 6.B.67. Let us refer to our routine as trapezoid_rule. The input should be a function f, the reals numbers a, b representing the lower and upper limit of the integral and the number of steps n. Let us present the program and provide the necessary explanations below. def trapezoid_rule(f, a, b, n): step = (b - a) / n valsf = [f(a + i * step) for i in range(n + 1)] return (step / 2) * (valsf[0] + 2 * sum(valsf[1:n]) + valsf[-1]) In this block we use the list valsf, which is is generated by evaluating the function f at n + 1 points (the latter should be evenly spaced between a and b). Hence, for example, valsf[i] contains the function value at the ith point, i.e., valsf[i] = f(a + i × step). In the return, valsf[1 : n] selects the sublist containing all the intermediate function values, excluding the first and last points. This is because the trapezoid rule assigns a weight of 2 to the intermediate points between the start and end of the interval, while the first and last points are treated differently with a weight of 1. In our program, these are accessed separately as valsf[0] and valsf[−1], respectively. Let us now test our routine, by verifying, for instance, the result obtained in 6.B.66. It suffices to add the cell 622 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS n=4; a=0; b=pi; f(x)=sin(x) print(trapezoid_rule(f,a,b,n).n()) Sage’s result is 1.89611889793704. For the estimation of the integral L by the trapezoid rule, we will first present a formal computation. We have a = π/2, b = 3π/2, n = 4 and hence h = π/4. Moreover, x0 = a = π 2 , x1 = x0 + h = 3π 4 , x2 = x0 + 2h = π , x3 = x0 + 3h = 5π 4 , x4 = b = x0 + 4h = 3π 2 . Thus we obtain Ltrap = h 2 [ f(x0) + 2f(x1) + 2f(x2) + 2f(x3) + f(x4) ] = π 8 [ cos (π 2 ) + 2 cos ( 3π 4 ) + 2 cos (π) + 2 cos ( 5π 4 ) + cos ( 3π 2 ) ] = π 8 [ 0 − 2 √ 2 2 − 2 − 2 √ 2 2 + 0 ] = − π(1 + √ 2) 4 ≈ −1.896 . We can now confirm this result by applying the trapezoid_rule routine: n=4; a=pi/2; b=3*pi/2; f(x)=cos(x) print(trapezoid_rule(f,a,b,n).n()) Sage’s output is −1.89611889793704 To compute the actual error we first see that L = ∫ 3π/2 π/2 cos(x) dx = [sin(x)] 3π/2 π/2 = sin(3π/2) − sin(π/2) = −1 − 1 = −2. Thus, |L − Ltrap| = |−2 + 1.896| = |−0.104| = 0.104. For the theoretical error a simple computation gives that (b−a)h2 12 |f′′ | = π3 192 ≈ 0.161. It is easy to see that the cosine function is convex over the integral of integration and the trapezoid rule overestimates the integral (since −1.896 > −2). This overestimation is characteristic of the trapezoid rule when applied to convex functions. In fact, cos(x) is negative on [π/2, 3π/2], and hence the trapezoids do not exceed the area under the curve. This gives the desired overestimation, as illustrated in the figure below. 6.B.68. Simpson’s rule formula for an integral over [a, b] with n intervals, is given by (see 6.2.23) ISimp = h 3 [f(x0) + 4f(x1) + 2f(x2) + 4f(x3) + · · · + 2f(xn−2) + 4f(xn−1) + f(xn)] , where h = b−a n , xi = a + ih for i = 0, 1, . . . , n, and n is even! Hence, except of the starting and ending points x0, xn, for all odd indices i = 1, 3, 5 . . . the function values f(xi) are multiplied by 4, and for all even indices i = 2, 4, 6 . . . the function values f(xi) are multiplied by 2. For our case, for h = 1 4 we get the equation 1 4 = a−b n = 1−0 n , which gives n = 4. Thus we have n + 1 = 5 nodes: x0 = 0, x1 = 1/4, x2 = 1/2, x3 = 3/4 and x4 = 1, with values f(x0) = 1, f(x1) = 4/5, f(x2) = 2/3, f(x3) = 4/7 and f(x4) = 1/2, respectively. Thus ISimp = h 3 [f(x0) + 4f(x1) + 2f(x2) + 4f(x3) + f(x4)] = 1 12 ( 1 + 4 4 5 + 2 2 3 + 4 4 7 + 1 2 ) ≈ 0.69325 , 623 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS where we used Sage via the following basic command to do the final computation: print(N((1/12)*(1+16/5+4/3+16/7+1/2))) Let us now find the approximation by the trapezoid rule, where the corresponding nodes are the same. We have Itrap = h 2 [f(x0) + 2f(x1) + 2f(x2) + 2f(x3) + f(x4)] = 1 8 ( 1 + 2 4 5 + 2 2 3 + 2 4 7 + 1 2 ) ≈ 0.6970 . This result can be verified by using the program constructed in 6.B.67, for example. Hence we can type def trapezoid_rule(f, a, b, n): step = (b - a) / n valsf = [f(a + i * step) for i in range(n + 1)] return (step / 2) * (valsf[0] + 2 * sum(valsf[1:n]) + valsf[-1]) n=4; a=0; b=1; f(x)=1/(1+x) print(trapezoid_rule(f,a,b,n).n()) Running this block Sage prints out the number 0.697023809523809. The exact value of I is given by I = ∫ 1 0 1 1 + x dx = ln(1 + 1) − ln(1 + 0) = ln(2) ≈ 0.69314, and we may summarize the results about the errors in a table: Method Approximation Actual error Trapezoidal Rule 0.6970 |0.693147 -0.6970| ≈ 0.0039 Simpson’s Rule 0.693253 |0.693147-0.69325| ≈ 0.000106 Thus, the approximation given by Simpson’s rule is significantly closer to the exact value of I, compared to the trapezoidal rule (Simpson’s rule yields a smaller error). Similarly is treated the case with h = 1/2, which corresponds to n = 2 and n + 1 = 3 nodes: x0 = 0, x1 = 1/2, and x2 = 1. We leave this for practice. 6.B.69. Recall that Simpson’s rule requires n to be an even number, as the method involves pairing subintervals to fit a quadratic polynomial. To solve this task, we will use a method similar to the one presented in 6.B.67, and therefore, we will omit extensive explanations. Here is our routine: def simpson_rule(f, a, b, n): if n % 2 != 0: raise ValueError("n must be even for Simpson’s rule.") step = (b - a) / n valsf = [f(a + i * step) for i in range(n + 1)] return (step / 3) * (valsf[0] + 4 * sum(valsf[i] for i in range(1, n, 2)) + 2 * sum(valsf[i] for i in range(2, n-1, 2)) + valsf[-1]) Let us now use the function f(x) = 1/(1 + x) of the previous example, with h = 1/4 to check our program: n = 4; a = 0; b = 1; f(x) = 1 / (1 + x) print(simpson_rule(f, a, b, n).n()) This verifies our computation, as Sage’s output is the number 0.69325396825396. Note that if n is odd, Sage will display the the error: “n must be even for Simpson’s rule.” You can use this program to demonstrate additional examples of Simpson’s rule in your Sage editor. For h = 1/4 and the interval I given in 6.B.68, the slight overestimation of Simpson’s rule demonstrated above can be illustrated as follows: 624 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS To construct this figure we used the following block in Sage: f(x) = 1 / (1 + x) a = 0; b = 1; h = 1/4 n = int((b - a) / h) # Generate x values and corresponding y values x_values = [a + i*h for i in range(n + 1)] y_values = [f(x) for x in x_values] # Initialize the plot with the original function p = plot(f, a, b, color="blue", legend_label="f(x) = 1/(1+x)", thickness=1.5) # Create the piecewise quadratic approximation for Simpson’s rule for i in range(0, n, 2): x0, x1, x2 = x_values[i], x_values[i+1], x_values[i+2] y0, y1, y2 = y_values[i], y_values[i+1], y_values[i+2] # Shade the area under the curve for this segment using a polygon p += polygon([(x0, 0), (x0, y0), (x2, y2), (x2, 0)], color="darkgreen", alpha=0.3, fill=True) # Highlight the points used in Simpson’s rule p += points([(x, f(x)) for x in x_values], color="black", size=30, marker="o") # Show the plot with a larger range for visibility p.show(legend_loc="upper right") 6.B.70. It is easy to see that the fourth derivative of f(x) = 1 1+x is given by f(4) (x) = 24 (1+x)5 , for all x ∈ [0, 1]. Recall that we can verify this computation in Sage by the cell f(x)=1/(1+x); dif4=diff(f, x, 4); show(dif4) Since the fourth derivative f(4) (x) is decreasing for all x ∈ [0, 1], the maximum value of f(4) (x) occurs at the left endpoint of the interval, i.e., at x = 0, and equals 24. Here is a short program in Sage that can verify this claim (you may also plot the graph of f(4) (x)) var("x"); f_4 = 24 / (1 + x)^5 a = 0; b = 1 # Evaluate f^{(4)} at the endpoints f_4_a = f_4.subs(x=a) f_4_b = f_4.subs(x=b) # Find the maximum value of |f^{(4)}(x)| on the interval max_value = max(f_4_a, f_4_b) # Output the results f"f^{(4)}(0) = {f_4_a}, f^{(4)}(1) = {f_4_b}, Maximum |f^{(4)}(x)| on [0,1] = {max_value}" Thus we compute (b − a)5 180n4 maxx∈[0,1] f(4) (x) n=4 = 24 180 · 256 = 1 1920 ≈ 0.00052. The actual error 0.000106 computed above is indeed less than the theoretical error bound 0.00052. In this way one confirms that the theoretical error bound is valid, and provides an upper limit on the actual error. 625 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 6.C.3. (a) A typical example is the sequence (fn) with fn(x) = xn , x ∈ [0, 1]. In 5.B.3 we learned that for 0 ≤ x < 1 the sequence (fn(x)) = (xn ) converges to 0, while (fn(1)) converges to 1. Thus, the limit function of (fn) on [0, 1] is given by f(x) = { 0 , if x ∈ [0, 1) , 1 , if x = 1 . Although each member fn is continuous, it is obvious that f is not continuous at x0 = 1 (it has a jump discontinuity there). Another example is given by the sequence (fn)n∈Z+ defined by fn(x) =    0 , if x < 0 , nx , if 0 ≤ x ≤ 1 n , 1 , if x > 1 n . The graph of fn has the following form: This was made in Sage by the cell var("n"); n=2 def f(x): if -2 0 , 0 , if x ≤ 0 . (b) Consider for instance the sequence (fn(x) = e−nx2 ) with n ∈ N and x ∈ R. For x = 0 we obviously have fn(0) = 1 and limn→∞ fn(0) = 1, for all n. For x ̸= 0 we have limn→∞ fn(x) = 0, for all n. This is because of the inequality 1 et < 1 1+t , which holds for any t > 0.8 This gives that 0 < fn(x) = 1 enx2 < 1 1 + nx2 , and our claim follows. Thus, the sequence (fn) converges on R to the function f(x) = { 1 , if x = 0 , 0 , if x ̸= 0 . 8To see this, use the familiar inequality et > 1 + t, t > 0. 626 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS Now it is obvious that though all members of (fn) are differentiable functions over R, the limit function f(x) is not differentiable at x = 0. Another such example is given in 6.C.6. (c) Let us consider the sequence of functions (fn : [0, 1] → R) defined by fn(x) =    0 , for x = 0 , n , for 0 < x ≤ 1 n , 1/x , if 1 n < x ≤ 1 , It is not hard to prove that the pointwise limit function f : [0, 1] → R of (fn)n∈Z+ is given by f(x) = { 0 , if x = 0 , 1/x , if 0 < x ≤ 1 , To illustrate this result graphically, we just need to use Sage and plot some members of the sequence (fn). This can be accomplished in several different ways. For instance, you may use a bit of programming as demonstrated in the following block: var("n") n=5 def f(x): if x==0: return 0 elif 0 1 n implies 1 x < n). Next we need to understand the discontinuities of the functions fn, n ∈ Z+. Obviously, at x = 0 we have a (jump) discontinuity, limx→0+ fn(x) = n ̸= 0 = fn(0), for all n ∈ Z+. At all the other points the functions fn are continuous: e.g., at x = 1/n we get lim x→(1/n)− fn(x) = lim x→(1/n)+ n = n = fn(1/n) , lim x→(1/n)+ fn(x) = lim x→(1/n)+ (1/x) = 1 1/n = n . To summarize, each member fn is bounded and has a unique discontinuity at x = 0 (so a finite number of discontinuities, and thus of measure zero). Therefore, each fn is Riemann integrable. In fact, we can easily see that ∫ 1 0 fn(x) dx = ∫ 1/n 0 n dx + ∫ 1 1/n 1 x dx = 1 + ln(n) . In contrast, the limit function f is not bounded on [0, 1], and thus cannot be integrable on [0, 1]. 6.C.5. The sequence of functions (fn) defined by the given graph has the form fn(x) =    nx , for x ∈ [0, 1 n ) , 2 − nx , for x ∈ [1/n, 2/n] , 0 , otherwise. If x = 0 we have fn(x) = 0 for all n, and limn→∞ fn(0) = 0. Suppose now that x > 0 and in particular that x ∈ (0, 1 n ) for some n. Then, we may find a sufficiently large integer n0 such that x /∈ [0, 1 n0 ), which implies again that limn→∞ fn(x) = 0. 627 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS If x ∈ [ 1 n , 2 n ] then similarly there exists a sufficiently large integer n0 such that x /∈ [ 1 n0 , 2 n0 ], which again implies that limn→∞ fn(x) = 0. Hence the claim, i.e., limn→∞ fn(x) = f(x) = 0 for all x ≥ 0. For the uniform convergence, let x0 = 1 2n . Since fn(x0) = fn( 1 2n ) = n · 1 2n = 1 2 is non-zero (regardless how large n becomes), we also get |fn( 1 2n ) − f( 1 2n )| = |1 2 − 0| = 1 2 . Therefore, for any ε < 1 2 does not exist N such that |fn(x)| < ε for all n ≥ N. Consequently, (fn(x)) cannot converge uniformly to f(x) = 0. 6.C.6. Obviously, lim n→∞ fn(x) = lim n→∞ √ x2 + 1 n2 = √ x2 + lim n→∞ 1 n2 = √ x2 + 0 = √ x2 = |x| , that is, the limit function is given by f(x) = |x| with x ∈ R. Now, for given ε > 0, choose N large enough such that 1 N < ε. Then, for any n ≥ N and x ∈ R we have |fn(x) − f(x)| = √ x2 + 1 n2 − |x| = √ x2 + 1 n2 − |x| · √ x2 + 1 n2 + |x| √ x2 + 1 n2 + |x| = 1 n2 √ x2 + 1 n2 + |x| ≤ 1 n2 √ 1 n2 = 1 n < ε . This proves the claim. Observe that each member fn(x) is continuous and differentiable for all x ∈ R, but the limit function f(x) = |x|, though continuous, is not differentiable at x = 0. 6.C.7. (a) On the interval I = (0, 1) consider the sequences (fn), (gn), defined by fn(x) = 1 x , gn(x) = 1 n , n ∈ Z+ . Obviously, fn → f = 1 x and gn → 0, as n → ∞, and it is easy to see that we have uniform convergence on (0, 1). Consider the sequence (fn · gn)n∈Z+ with fn(x)gn(x) = 1 nx for all x ∈ (0, 1). Certainly, 1 nx → 0 as n → ∞, hence its pointwise limit function is the zero one. We will show however that the convergence is not uniform on (0, 1). For uniform convergence, given ε > 0 we need to find some positive integer N such that | 1 nx − 0| < ε for all n ≥ N and x ∈ (0, 1). This is also written as 1 nx < ε or equivalently, n > 1 ε x , which must hold for all x ∈ (0, 1). However, no matter how large n is chosen, there will always be some x ∈ (0, 1) close enough to 0 such that 1 nx ≥ ε (notice that as x → 0+ , the expression 1/x grows without bound, making 1 nx difficult to keep below any positive ε uniformly). In particular, we see that sup x∈(0,1) |fn(x)gn(x) − 0| = sup x∈(0,1) 1 nx = sup x∈(0,1) ( 1 nx ) = +∞ . 6.C.8. 6.D.4. We compute T3 0 (g(x)) = 1 + x2 2 , T3 0 (h(x)) = 1 − x2 2 and T3 0 (k(x)) = x − x3 3 , respectively. In Sage we can verify these results as usual, that is, by typing show(taylor(1/cos(x), x, 0, 3)) , show(taylor(exp(−x ∗ ∗2/2), x, 0, 3)) , show(taylor(sin(sin(x)), x, 0, 3)) , respectively. 6.D.9. (a) The x-axis y = 0 as x → ±∞. (b) The lines y = ln 10, and y = x + ln 3. 6.D.13. The function f(x) = −x2 /(x + 1) with x ∈ A = R\{−1} is not odd, neither even, nor periodic. Its range is the union (−∞, 0]∪[4, +∞). Moreover, the point x0 = −1 is an improper point, i.e., f is not defined there and so it has a single discontinuity with lim x→−1+ f(x) = −∞, lim x→−1− f(x) = +∞ . The function intersects the x axis only at the origin. It is positive for x < −1 and not positive for x > −1. It can be shown easily that lim x→−∞ f(x) = +∞, lim x→+∞ f(x) = −∞ , f′ (x) = − x2 + 2x (x + 1)2 , f′′ (x) = − 2 (x + 1)3 , x ∈ R\{−1} . Therefore, f is increasing on the intervals [−2, −1), (−1, 0] and decreasing on the intervals (−∞, −2], [0, +∞), see also a sketch of its graph below. The function f has two stationary points x1 = 0 and x2 = −2. We leave the characterization of these critical points to the reader. Moreover, f is convex on the interval (−∞, −1) and concave on the interval (−1, +∞). 628 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS However, it does not have a point of inflection. The line x = −1 is a vertical asymptote, the inclined asymptote at ±∞ is the line y = −x + 1. For example, f(−3) = 9/2, f′ (−3) = −3/4, f(1) = −1/2, f′ (1) = −3/4. To graph of f together with its asymptotes is given below: To obtain this illustration one may use the block f(x)=(-x^2)/(x+1) p=plot(f(x), x, -10, 10, exclude=[-1], ymin=-11, ymax=11, color="black", figsize=6) p+=text(r"$f(x)=\frac{-x^2}{x+1}$", (-4.1, 9), fontsize=14, color="black") p+=line([(-1, -12), (-1, 12)], rgbcolor=(0.2,0.2,0.5), linestyle="--") p+=line([(-10, 11), (10, -9)], rgbcolor=(0.2,0.5,0.2), linestyle="--"); show(p) Notice in the second line we have used the option exclude = [−1], in order to exclude the point where f is not defined. 6.D.14. The function is defined on A = R\{1} and is everywhere continuous on A. It is not odd, neither even nor periodic. The points of intersection of the graph of f with the axes are the points [ 1 − 3 √ 2, 0 ] and [0, −1]. At x0 = 1, the function has a discontinuity of the second kind and its range is R, which follows from the limits lim x→1− f(x) = −∞, lim x→1+ f(x) = +∞, lim x→±∞ f(x) = +∞. After the arrangement f(x) = (x − 1)2 + 2 x−1 , x ∈ R ∖ {1}, it is not difficult to compute f′ (x) = 2 (x−1)3 −1 (x−1)2 , x ∈ R ∖ {1}, f′′ (x) = 2 (x−1)3 +2 (x−1)3 , x ∈ R ∖ {1}. The only stationary point is x1 = 2. The function f is increasing on the interval [2, +∞), decreasing on the intervals (−∞, 1), (1, 2]. Hence at the point x1 it attains the local minimum y1 = 3. It is convex on the intervals ( −∞, 1 − 3 √ 2 ) , (1, +∞) and concave on the intervals ( 1 − 3 √ 2, 1 ) . The point x2 = 1 − 3 √ 2 is a point of inflection. The line x = 1 is a horizontal asymptote. The function does not have any inclined asymptotes. 6.D.15. The function is defined on R and is everywhere continuous. It is not odd, even nor periodic. It attains positive values on the positive half-axis, negative values on the negative half-axis. The point of intersection of the graph of f with the axes is only at the point [0, 0]. The derivative is: f′ (x) = e−x 3 3√ x2 − 3 √ x e−x , x ∈ R ∖ {0}, f′ (0) = +∞, f′′ (x) = 3 √ x e−x − 2e−x 3 3√ x2 − 2e−x 9 3√ x5 , x ∈ R ∖ {0}. The only zero point of the first derivative is the point x0 = 1/3. The function f is increasing on the interval (−∞, 1/3] and decreasing on the interval [1/3, +∞). Hence at the point x0, it has an absolute maximum y0 = 1/ 3 √ 3e. Since limx→−∞ f(x) = −∞, its range is (−∞, y0]. The points of inflection are x1 = 1− √ 3 3 , x2 = 0, x3 = 1+ √ 3 3 . It is convex on the intervals (x1, x2) and (x3, +∞), concave on the intervals (−∞, x1), (x2, x3). The only asymptote is the line y = 0 at +∞, i.e. limx→+∞ f(x) = 0. 6.D.17. Recall by 6.A.45 that the slope of the cycloid 629 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS x(t) = t − sin(t) , y(t) = 1 − cos(t) , t ∈ [0, 2π] is given by dy dx = dy/dt dx/dt = sin(t) 1 − cos(t) . We see that dy dx = 0 if and only dy dt = 0, that is, sin(t) = 0. This gives t = π where we haver a parallel tangent line. The vertical tangents appear at the cusps of the cycloid, that is, at the point t = 0 or t = 2π. These values of t solve the equation dx dt = 0, that is, cos(t) = 1. To calculate the slopes at these points we can use limits: k0 = lim t→0+ dy dx = lim t→0+ sin(t) 1 − cos(t) = +∞ , k2π = lim t→2π− dy dx = lim t→2π− sin(t) 1 − cos(t) = −∞ . Thus the tangents at t = 0, 2π are vertical. 6.D.22. Here one should simply give the following block: show(integral(x^4+e^x+5*ln(x), x)); show(integral(sqrt(x)*(1+x^3), x)); show(integral(x/sqrt(x+1), x)) Check yourself Sage’s output. 6.D.23. (a) F(x) = 8 15 x 8 √ x7; (b) G(x) = 4x ln 4 + 2 6x ln 6 + 9x ln 9 ; (c) H(x) = arcsin(x) 2 ; (d) K(x) = ln (1 + sin(x)). 6.D.24. A primitive function is given by F(x) = ex +3 arcsin (x 2 ) . 6.D.25. Set t = ln(x) with dt = 1 x dx. Then we see that ∫ dx x ( ln(x) )2 + 2025x = ∫ dx x (( ln(x) )2 + 2025 ) = ∫ dt t2 + 452 = 1 45 arctan ( t 45 ) + C = 1 45 arctan ( ln(x) 45 ) + C , for some constant C ∈ R. If you like to verify the result in Sage just type show(integral(1/(x*((ln(x))^2+2025)), x)) 6.D.26. Set t = ex such that dt = ex dx = t dx, that is, dx = 1 t dt. Then ∫ ex ex +2024 dx = ∫ ( t t + 2024 · 1 t ) dt = ∫ dt t + 2024 = ln |t + 2024| + C = ln(t + 2024) + C = ln(ex +2024) + C . 6.D.34. To be written. 6.D.35. The answer is 3 2 ln ( x2 + 4x + 8 ) − 1 2 arctg x + 2 2 + C, for some C ∈ R. 6.D.36. The answer is 4 3 √ 3 arctg 2x + 1 √ 3 + 2x + 1 3 (x2 + x + 1) + C, with C ∈ R. 6.D.37. The answer is 1 6 ln (x + 1)2 x2 − x + 1 + √ 3 3 arctg 2x − 1 √ 3 + C, with C ∈ R. 6.D.41. We compute A = 1 2 ln ( 2 + ln 2 2 − ln 2 ) , B = − 1 6 − 2 9 ln(2), and C = 2/3. 6.D.48. The answer is ∞∑ i=0 (−1)n 22n−1 (2n)! x2n , which converges for all real x. 6.D.49. The answer is ∞∑ n=1 (−1)n+1 22n−1 (2n)! x2n , which converges for all real x. 630 CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS 6.D.50. The answer is f(x) = ∞∑ n=1 3(−1)n+1 n xn , which converges for all x ∈ (−1, 1]. 6.D.51. One should first realize that we are expanding the function f(x) = 1 2 ln(x). Thus, we get f(x) = ∞∑ i=0 (−1)i+1 1 2i (x − 1)i , which converges on the interval (0, 2]. 6.D.53. Recall from Chapter 5 (see for example 5.D.21) that cos(x) = 1 − x2 2! + x4 4! − x6 6! + · · · , which makes sense for any x ∈ R. Thus, replacing x by √ x we get cos( √ x) = 1 − ( √ x)2 2! + ( √ x)4 4! − ( √ x)6 6! + · · · = 1 − x 2! + x2 4! − x3 6! + · · · . We can now approximate the integral at hand, as follows: I = ∫ 1 0 ( 1 − x 2! + x2 4! − x3 6! + · · · ) dx = [ x − x2 2 · 2! + x3 3 · 4! − x4 4 · 6! + · · · ]1 0 = 1 − 1 2 · 2! + 1 3 · 4! − 1 4 · 6! + · · · Hence we deduce that I = ∫ 1 0 cos( √ x) dx ≈ 1 − 1 2 · 2! + 1 3 · 4! − 1 4 · 6! = 1 − 1 4 + 1 72 − 1 2880 = 733/960 ≈ 0.7635416. In Sage the real value of I is obtained by the cell integral(cos(sqrt(x)), x, 0, 1) N(integral(cos(sqrt(x)), x, 0, 1)) Here the first commands gives the answer 2 cos(1) + 2 sin(1) − 2 ≈ and the second prints out its decimal expression, 0.763546581352073. Hence the estimation based on the Maclaurin series has very small error. 6.D.56. The error belongs to the interval (0, 1/200). 6.D.57. We find that 0 < ∫ 2 1 cos10 (x) 10 ln(x) dx < 1 10 , ∫ 2 1 x ln x dx = ln 4 − 3 4 . In this chapter, we mainly deal with applications of the tools of differential and integral calculus. We consider a variety of problems related to functions of one real variable. The tools and procedures are similar to the ones shown in Chapter 3, i.e. we consider linear combinations of selected generators and linear transformations. This chapter serves also as a useful consolidation of background material before considering functions of several variables, differential equations, and the calculus of variations. We begin by asking how to approximate a given function by linear combinations from a given set of generators. Approximation considerations lead to the general concept of distance. We illustrate the concepts on rudiments of the Fourier series. Our intuition from the Euclidean spaces of low dimensions is extended to infinite dimensional spaces, particularly the concept of orthogonal projections. The next part of this chapter focuses on integral operators. These are linear mappings on functions which are defined in terms of integrals. Especially, we pay attention to convolutions and Fourier analysis. Throughout all these considerations, we work with real or complex valued functions of one variable. Only then do we introduce the elements of the theory of metric spaces. This should enlighten the concepts of convergence and approximation on infinite dimensional spaces of functions. It will also cover our needs in analysis on Euclidean spaces Rn in the next chapter. 1. Fourier series 7.1.1. Spaces of functions. As usual, we begin by choosing appropriate sets of functions to use. We want enough functions so that our models can conveniently be applied in practice. At the same time, the functions must be sufficiently “smooth” so that we can integrate and differentiate as needed. All functions are defined on an interval I = [a, b] ⊂ R, where a < b. The interval may be bounded, (i.e., both a and b are finite), or unbounded (i.e., either a = −∞, or b = +∞, or both). CHAPTER 7 Continuous tools for modelling How do we manage non-linear objects? – mainly by linear tools again... A. Fourier series If we want to understand three-dimensional objects, we often use (one or more) two-dimensional plane projections of them. The orthogonal projections are special in providing the closest images of the points of the objects in the chosen plane. Similarly, we can understand complicated functions in terms of simpler ones. We consider their projections into the (real) vector space generated by those chosen functions. Recall from Chapter 2 that the orthogonal projections were easily computed in terms of inner products. Now we do the same for the infinite dimensional spaces of functions. CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Spaces of piecewise smooth functions We denote by S0 = S0 [a, b] the set of all piecewise continuous functions on I = [a, b] with real or complex values. Otherwise put, all functions f in S0 = S0 [a, b] have only finitely many points of discontinuity on bounded intervals. Moreover, f has finite one-sided left and right limits at every point in [a, b]. In particular, f is bounded on all bounded subintervals. For every natural number k ≥ 1, we consider the set of all piecewise continuous functions f such that all their derivatives up to order k (inclusive) lie in S0 . We denote this set by Sk [a, b], or briefly Sk . Note that the derivatives of functions in Sk need not exist at all points, but their onesided limits must exist. If the interval I is unbounded, we often consider only those functions with compact support. A function with compact support means that it is identically zero outside some bounded interval of the real line. For unbounded intervals, we denote by Sk c the subset of those functions in Sk which have compact support. Functions in S0 are always Riemann integrable on the bounded interval I = [a, b], with both ∫ b a |f(x)| dx < ∞, and ∫ b a |f(x)|2 dx < ∞. Both integrals are finite for unbounded intervals if the function f has compact support. 7.1.2. Distance between functions. The properties of limits and derivatives ensure that Sk and Sk c are vector spaces. In finite-dimensional spaces, the distance between vectors can be expressed by means of the differences of the coordinate components. In spaces of functions, we proceed analogously and utilize the absolute value of real or complex numbers and the Euclidean distance in the following way: The L1 distance of functions The L1–distance between functions f and g in S0 c is defined by ∥f − g∥1 = ∫ b a |f(x) − g(x)| dx. If g = 0, then the distance from f to the zero function, namely ∥f∥1, is called the L1-norm (i.e., length or size) of f. The L1–distance between functions f and g (when both are real valued) expresses the area enclosed by the graphs of these functions, regardless of which function takes greater values. We observe that ∥f − g∥1 ≥ 0. Since f and g are both piecewise continuous functions, ∥f − g∥1 = 0 only if f and g differ in their values at most at the points of discontinuity, and hence at only finitely many points on any bounded interval. Recall that we can change 632 In this case the inner product mimics the product of scalars and provides the necessary tool to calculate the corresponding projections. The simplest way to define an inner product on appropriate vector spaces of functions takes the form ⟨f, g⟩ = ∫ b a f(x)g(x) dx . We refer to this inner product as L2 and denote the corresponding norm by ∥ ∥2, see 7.1.1-7.1.3 for further details and its extension to complex functions. 7.A.1. Orthogonal systems of functions. Consider the subspace W = ⟨f1, f2⟩ of the space of real-valued functions defined on the interval [1, 2], generated by the functions f1(x) = 1/x and f2(x) = x2 , endowed with the L2 product. (a) Complete the function f1 to an orthogonal basis of W. (b) Determine the orthogonal projection of f(x) = x onto W and compute the distance of f from W. Solution. (a) The vector space W is generated by two linearly independent functions, thus its dimension is 2. All vectors belonging to W are of the form a · f1(x) + b · f2(x) for some a, b ∈ R. According to the Gram–Schmidt process, we seek for a vector of the form h = x2 + k · 1 x , k ∈ R, subject to the orthogonality condition 0 = ⟨ 1 x , x2 + k · 1 x ⟩ , which gives k = − ⟨ 1 x , x2 ⟩ ⟨ 1 x , 1 x ⟩ = − ∫ 2 1 1 x · x2 dx ∫ 2 1 1 x · 1 x dx = −3 . Hence the requested orthogonal basis consists of the functions 1 x and x2 − 3 x . (b) The projection of the function f(x) = x onto W has the form (see ....) px = ⟨x, 1 x ⟩ ⟨1 x , 1 x ⟩ · 1 x + ⟨x, x2 − 3 x ⟩ ⟨x2 − 3 x , x2 − 3 x ⟩ · ( x2 − 3 x ) = 2 x + 15 34 ( x2 − 3 x ) . Let us recall that the distance of a vector from the subspace is given by the norm of the difference between this vector and its projection. In this case we get ∥x − px∥2 = (∫ 2 1 ( x − 2 x − 15 34 ( x2 − 3 x ))2 dx )1 2 = 1 √ 408 ≈ 0.0495. □ 7.A.2. Let W be the the space generated by the functions 1 x , 1 x2 , 1 x3 , defined on the interval [1, 2]. Assume that W is endowed with the L2 product. Complete the function 1 x to an orthogonal basis of W. Next determine the projection of the functions f(x) = 1 x4 and g(x) = x onto W and find their distances from W. ⃝ CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING the value of any function at a finite number of points, without changing the value of the integral. If in particular, f and g are both continuous on [a, b], then ∥f − g∥1 = 0 implies f(x) = g(x) for all x ∈ [a, b]. Indeed, if f(x0) ̸= g(x0) at a point x0, a ≤ x0 ≤ b, and if f and g are both continuous at x0, then f and g also differ on some small neighbourhood of x0, and this neighbourhood, in turn, contributes a non-zero value into the integral, so that then ∥f − g∥1 > 0. If we have three functions f, g, and h, then, of course, ∫ b a |f(x) − g(x)| dx = ∫ b a |f(x) − h(x) + h(x) − g(x)| dx ≤ ∫ b a |f(x) − h(x)| dx + ∫ b a |h(x) − g(x)| dx, so the usual triangle inequality ∥f − g∥1 ≤ ∥f − h∥1 + ∥h − g∥1 holds. To derive this inequality, we used only the triangle inequality for the scalars; thus it is valid for functions f, g ∈ S0 c with complex values as well. ∥f −g∥1 is not the only way to measure distance between two functions f and g. For another way: The L2–distance The L2–distance between functions f and g in S0 c is defined by ∥f − g∥2 = (∫ b a |f(x) − g(x)|2 dx )1/2 . If g = 0, then ∥f∥2, the distance from f to the zero function, is called the L2 norm of f. Clearly ∥f∥2 ≥ 0. Moreover, ∥f∥2 = 0, implies that f(x) = 0 for all x except for a finite set of points in any bounded interval. As above for the L1 norm, ∥f − g∥2 = 0 only if f and g differ in their values at most at the points of discontinuity, and hence at only finitely many points on any bounded interval. In particular, if f and g are both continuous for all x, then ∥f − g∥2 = 0 implies f(x) = g(x) for all x. The square of ∥f∥2 for a function f is ∥f∥2 2 = ∫ b a |f(x)|2 dx and it is related to the well-defined symmetric bilinear mapping of real or complex functions to scalars ⟨f, g⟩ = ∫ b a f(x)g(x) dx since ⟨f, f⟩ = ∫ b a f(x)f(x) dx = ∫ b a |f(x)|2 dx = ∥f∥2 2 . We can use therefore all the properties of inner products in unitary spaces as described in Chapter 3. In particular, the 633 The next task involves a specific type of polynomials, known as “Chebyshev polynomials”. These are defined by Tn(x) = cos(n arccos(x)), or equivalently, given by Tn(cos(x)) = cos(n x) , n ∈ N . Chebyshev polynomials are described by a plethora of equivalent ways, and have a distinguished role in approximation theory. Moreover, due to their simple definition are easily adapted to symbolic computations, and here we will explain how one can manipulate them via Sage. 7.A.3. Chebyshev polynomials. Show that Tn(x) is a polynomial for all n ∈ N. Solution. By the definition given above it is direct that T0(x) = 1, T1(x) = x . Based now on the trigonometric identity cos(rz) cos(sz) = 1 2 ( cos((r − s)z)+ cos((r + s)z) ) , (⋆) we see that Tn+1(x) + Tn−1(x) = = cos ( (n + 1) arccos(x) ) + cos ( (n − 1) arccos(x) ) = 2 cos ( n arccos(x) ) cos ( arccos(x) ) = 2x Tn(x) , that is, Tn+1(x) = 2x Tn(x) − Tn−1(x) , for all positive integers n. This is the recurrent definition of Chebyshev polynomials, and the result follows. □ 7.A.4. (a) Verify that for each interval I = [a, b] ⊂ R and positive continuous function ω on I, the formula ⟨f, g⟩ω = ∫ b a f(x)g(x)ω(x) dx defines an inner product on the continuous functions on I. (b) Choosing I = (−1, 1) and ω(x) = ω0(x) = (1−x2 )−1/2 , deduce that the Chebyshev polynomials Tk(x), (k ∈ N), form an orthogonal system of polynomials with respect to ⟨ , ⟩ω0 . Solution. We compare the defining formula with the L2 inner product above: Consider the substitution x = φ(z), where φ is the inverse function to z = ∫ x a ω(t) dt. The inverse exists, since ω is positive and so z is a strictly increasing function of x. Thus, dz = ω(x)dx and ⟨f, g⟩ω = ∫ b a f(x)g(x)ω(x) dx = ∫ φ−1 (b) 0 f(φ(z))g(φ(z)) dz. In particular, the “coordinate change” x = φ(z) identifies the vector space of continuous functions on I with the space of continuous function on another interval equipped with the L2 inner product and so ⟨ , ⟩ω is an inner product. We leave as an easy exercise to check that ⟨ , ⟩ω satisfies the properties of an inner product. CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING inner product satisfies both linearity in the first argument and the Hermitian symmetry ⟨f, g⟩ = ⟨g, f⟩. It is a symmetric bilinear mapping in the real case. In the sequel, the L2 distance will be most important. Thus, to simplify notation, the norm symbol ∥ ∥ without subscript will mean ∥ ∥2 in the subsequent pargraphs. 7.1.3. Orthogonality. In Chapters 2 and 3, we dealt with finite-dimensional real or complex vector spaces. Most properties derived there concerned pairs or finite sets of vectors. Now, we can do just the same with functions. We restrict our definition of the inner product to any vector subspace generated by only finitely many functions f1, . . . , fk (over real or complex numbers, according to our need). We again obtain a well-defined inner product on this finite-dimensional vector subspace and so all our considerations from the finite dimensional linear algebra apply again. As an example, consider the monomial functions fi(x) = xi , i = 0, . . . , k. In S0 , these generate the (k + 1)– dimensional vector subspace Rk[x] of all polynomials of degree at most k. The inner product of two such polynomials is given by integration. Every polynomial of degree at most k is uniquely expressed as a linear combination of the generators f0, . . . , fk. Moreover, if we can arrange the choice of generators to satisfy (1) ⟨fi, fj⟩ = { 0 for i ̸= j, 1 for i = j, then the computations become much easier than they would otherwise. Recall the Gram–Schmidt orthogonalization procedure, see 2.3.21. This procedure transforms any system of linearly independent generators fi into new (again linearly independent) orthogonal generators gi of the same subspace, i.e. ⟨gi, gj⟩ = 0 for all i ̸= j. We can calculate them step by step. Put g1 = f1, and gℓ+1 = fℓ+1 + a1g1 + · · · + aℓgℓ, ai = − ⟨fℓ+1, gi⟩ ∥gi∥2 for ℓ ≥ 1. To illustrate, we apply this procedure to the three polynomials 1, x, x2 on the interval [−1, 1]. Put g1 = 1, and generate the sequence g1 = 1 g2 = x − 1 ∥g1∥2 (∫ 1 −1 x · 1 dx ) · g1 = x − 0 = x g3 = x2 − 1 ∥g1∥2 (∫ 1 −1 x2 · 1 dx ) · g1− 1 ∥g2∥2 (∫ 1 −1 x2 · x dx ) · g2 = x2 − 1 3 . 634 (b) In this special case, ω(x) = d d x (arccos(x)), and thus the above substitution yields ⟨Tr, Ts⟩ω = ∫ π 0 cos(rz) cos(sz) d z. We are dealing with improper Riemann integrals (integrating the unbounded function ω), but this does not cause any problem. In fact it is easy to evaluate the integral via the trigonometric formula (⋆) mentioned in 7.A.3. We see that it vanishes for all r ̸= s. □ Both Sage and Maple provide built-in functions for generating the Chebyshev polynomials. For instance, in Sage they can be accessed via the function chebyshev_T(n, x), and in our notation this correspond to Tn(x), see also below. 7.A.5. (a) Use Sage and its function chebyshev_T to derive the first five Chebyshev polynomials. Next, plot them. (b) Combine the recursive definition of Chebyshev polynomials with the def environment in Sage to present an alternative approach for integrating them into Sage. Solution. (a) To write down the first five Chebyshev polynomials one can type R = PolynomialRing(QQ, "x") show([chebyshev_T(n, x) for n in [0..4]]) which returns the list [ 1, x, 2 x2 − 1, 4 x3 − 3 x, 8 x4 − 8 x2 + 1 ] To plot them add the command plot([chebyshev_T(n, x) for n in [0..4]]) (b) For this task, let us agree to call T(n, x) the desired routine. We can introduce it as follows: def T(n, x): if n==0 : return 1 elif n==1 : return x else : return expand(2*x*T(n-1, x)-T(n-2, x)) To test the routine you may type T(3, x) to get 4x3 − 3x, T(4, x) to get 8x4 − 8x2 + 1, etc. Or you can directly compare the formula of T(n, x) with that of chebyshev_T(n, x), for some n, by typing bool(T(n, x)==chebyshev_T(n, x) for n in [0..10]) In this case Sage’s answer is True. □ 7.A.6. A good programmer is often interested in the efficiency of algorithms and programs. For example, you may like to check the efficiency of the routine in Sage presented in 7.A.5, which can be done via the command timeit. Hence, one may compare the execution time of the commands chebyshev_T(n, x) and T(n, x), for certain values of n. To test the case n = 3, for example, add the cell CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING The corresponding orthogonal basis of the space R2[x] of all polynomials of degree less than three on the interval [−1, 1] is 1, x, x2 − 1/3. Rescaling by appropriate numbers so that the basis elements all have length 1, yields the orthonormal basis h1 = √ 1 2 , h2 = √ 3 2 x, h3 = 1 2 √ 5 2 (3x2 − 1). For example, h1 = g1/∥g1∥ and ∥g1∥2 = ∫ 1 −1 12 dx = 2. We could easily continue this procedure in order to find orthonormal generators of Rk[x]. The resulting polynomials are called Legendre polynomials. Considering all Legendre polynomials hi, i = 0, . . . , we have an infinite orthonormal set of generators such that polynomials of all degrees are uniquely expressed as their finite linear combinations. 7.1.4. Orthogonal systems of functions. Generalizing the latter example, suppose we have three polynonials h1, h2, h3 forming an orthonormal set. For any polynomial h, we can put H = ⟨h, h1⟩h1 + ⟨h, h2⟩h2 + ⟨h, h3⟩h3. We claim that H is the (unique) polynomial which minimizes the L2–distance ∥h − H∥. See 3.4.3. The coefficients for the best approximation of a given function by a function from a selected subspace are obtained by the integration introduced in the definition of the inner product. This example of computing the best approximation of H by a linear combination of the given orthonormal generators suggests the following generalization: Orthogonal systems of functions Every (at most) countable system of linearly independent functions in S0 c [a, b] such that the inner product of each pair of distinct functions is zero is called an orthogonal system of functions. Moreover, if the norm ∥f∥2 = 1 for all f in an orthogonal system, we talk about an orthonormal system of functions. Notice, all continous functions have got compact supports on a finite interval [a, b]. Consider an orthogonal system of functions fn ∈ S0 [a, b] and suppose that for (real or complex) constants cn, the series F(x) = ∞∑ n=0 cnfn(x) converges uniformly on a finite interval [a, b]. Notice that the limit function F(x) does not need to belong to S0 [a, b], but this is not our concern now. 635 print(timeit(’chebyshev_T(3,x)’, number=20)) print(timeit(’T(3,x)’), number=20) Here the option number=20 restricts the number of loops. The output appears as follows: 20 loops, best of 3: 81.2 µs per loop 20 loops, best of 3: 18 µs per loop and what one looks for is the execution time, which here is measured in microseconds (µs).1 Note that each time the above two commands are executed, Sage returns slightly different execution times. 7.A.7. Show that the choice of the weight function ω(x) = e−x and the interval I = [0, ∞) in Problem 7.A.3, leads to an inner product for which the Laguerre polynomials Ln(x) = n∑ k=0 ( n k ) (−1)k k! xk form an orthonormal system. ⃝ 7.A.8. Check that the orthonormal systems obtained in the previous two examples coincide with the result of the corresponding Gram-Schmidt orthogonalisation procedure applied to the system 1, x, x2 , . . . , xn , . . . , using the inner products ⟨ , ⟩ω, possibly only up to signs. ⃝ 7.A.9. Let S0 ([−π, π]) be the (infinite-dimensional) vector space of F-valued piecewise continuous functions defined on [−π, π], where F ∈ {R, C}, and let us denote by ¯g the complex conjugate of a function g ∈ S0 ([−π, π]). Prove that for any f, g ∈ S0 ([−π, π]) the rule (f, g) := 1 π ∫ π −π f(x)g(x) dx defines an inner product. Next show that the system of func- tions {√ 2 2 , sin(x) , cos(x) , sin(2x) , cos(2x) , · · · : n ∈ Z+ } is ( , )-orthonormal. ⃝ The orthonormal system of functions discussed in 7.A.9 is an orthonormal version of the so-called “Fourier orthogonal system”, introduced in 7.1.6. In fact, a key point in 7.A.9 is the “periodicity” of the functions forming the orthonormal basis (each having period 2π). Periodic functions occur frequently in engineering problems, and usually have a complicated form. Thus, it is often desirable to have a presentation of such periodic functions in terms of the simpler functions of sine and cosine. Building an orthogonal system of the periodic functions sin(nx) and cos(nx) leads to the classical concept of the Fourier series 1For the command timeit Sage’s output can include also the units nanoseconds (ns) and millisecond (ms). Recall that nanoseconds provide the smallest unit of the three, followed by microseconds and then milliseconds. CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING By uniform convergence, the inner product ⟨F, fn⟩ can be expressed in terms of the particular summands (see the corollary 6.3.7), obtaining ⟨F, fn⟩ = ∞∑ m=0 cm ∫ b a fm(x)fn(x) dx = cn∥fn∥2 , since each term in the sum is 0 except when m = n. Exactly as in the example above, each finite sum ∑k n=0 cnfn(x) is the best approximation of the function F(x) among the linear combinations of the first k +1 functions fn in the orthogonal system. Actually, we can generalize the definition further to any vector space of functions with an inner product. See the exercise 7.A.4 for such an example. For the sake of simplicity we confine ourselves to the L2 distance, but the reader can check that the proofs work in general. We extend our results from finite-dimensional spaces to infinite dimensional ones. Instead of finite linear combinations of base vectors, we have infinite series of pairwise orthogonal functions. The following theorem gives us a transparent and very general answer to the question as to how well the partial sums of such a series can approximate a given function: 7.1.5. Theorem. Let fn, n = 1, 2, . . . , be an orthogonal sequence of (real or complex) functions in S0 [a, b] and let g ∈ S0 [a, b] be an arbitrary function. Put cn = ∥fn∥−2 ∫ b a g(x)fn(x) dx. Then (1) For any fixed n ∈ N, the expression which has the least L2–distance from g among all linear combinations of functions f1, . . . , fn is hn = n∑ i=1 cifi(x). (2) The series ∞∑ n=1 |cn|2 ∥fn∥2 always converges, and more- over ∞∑ n=1 |cn|2 ∥fn∥2 ≤ ∥g∥2 . (3) The equality ∞∑ n=1 |cn|2 ∥fn∥2 = ∥g∥2 holds if and only if lim n→∞ ∥g − hn∥ = 0. Before presenting the proof, we consider the meaning of the individual statements of this theorem. Since we are working with an arbitrarily chosen orthogonal system of functions, we cannot expect that all functions can be approximated by linear combinations of the functions fi. For instance, if we consider the case of Legendre orthogonal polynomials on the interval [−1, 1] and restrict ourselves to even degrees only, surely we can approximate only even 636 for a 2L-periodic function f that is integrable over an integral of length 2L, say [−L, L]. 2 These are series of the form F(x) = a0 2 + ∞∑ n=1 ( an cos( n π x L ) + bn sin( n π x L ) ) , where an and bn are the Fourier coefficients, given by an = 1 L ∫ L −L f(x) cos( n π x L ) dx , n = 0, 1, . . . , bn = 1 L ∫ L −L f(x) sin( n π x L ) dx , n = 1, 2, . . . , respectively. In many cases the period is 2π, i.e., L = π, but we are also interested in functions with arbitrary periods, which, in practice (applications), is the most common scenario. For instance, if f has period T then it may defined on any interval of the form [0, T) or [−T/2, T/2), i.e., of length T. Defining f on intervals of the form [−π, π) or [0, 2π) is useful however, since simplifies the calculations and aligns with the standard form of the Fourier series, see 7.1.6. For the purpose of Fourier series the function f may defined even on the open interval (−π, π), where the key requirement is that f must be periodic of period 2π, see for example the so called “Heaviside function” in 7.1.9 (also known as the “square wave function”). Below, our goal is to illustrate all of this with a series of examples, most of which are implemented using Sage as well. For convenience, next we will denote by Fk(x) (k > 0) the kth-partial sum of a given Fourier series F(x). 7.A.10. Consider the 2π-periodic extension of the function f(x) = x, with x ∈ (−π, π). Describe the corresponding Fourier series F(x) and next confirm your answer via Sage. Moreover, confirm that as the number of terms in the series increases, the approximation improves significantly, except in a small neighbourhood around the discontinuity point. (Hint: For example, use Sage to plot the partial sums F5(x), F10(x), F20(x) and F100(x)). Solution. We see that f(−x) = −x = −f(x) for all x ∈ (−π, π), which means that f is an odd function. Thus, the coefficients an vanish for all n = 0, 1, . . ., and the Fourier series of f is a sine series, that is, F(x) = ∑∞ n=1 bn sin(nx). On the other hand, the function g(x) = x sin(nx) is even, i.e., g(−x) = g(x) for all x ∈ (−π, π). Hence, according to the statement in 6.B.35 we have bn = 1 π ∫ π −π x sin(nx) dx = 2 π ∫ π 0 x sin(nx) dx . 2The Fourier series are named in honour of the French mathematician and physicist Jean B. J. Fourier, in recognition of his seminal 1807 work “On the Propagation of Heat in Solid Bodies”, which focused on the issue of heat conduction. Fourier introduced the concept of analyzing periodic functions using trigonometric series, a method that remains fundamental to mathematical physics and has numerous critical applications in engineering. CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING functions in a reasonable way. Nevertheless, the first statement of the theorem says that the best approximation possible (in the L2–distance), is by the partial sums as described. The second and third statements can be perceived as an analogy to the orthogonal projections onto subspaces in terms of Cartesian coordinates. Indeed, if for a given function g, the series F(x) = ∑∞ n=1 cnfn(x) converges pointwise, then the function F(x) is, in a certain sense, the orthogonal projection of g into the vector subspace of all such series. The second statement is called Bessel’s inequality and it is an analogy of the finite-dimensional proposition that the size of the orthogonal projection of a vector cannot be larger than the original vector. The equality from the third statement is called Parseval’s theorem and it says that if a given vector does not decrease in length by the orthogonal projection onto a given subspace, then it belongs to this subspace. On the other hand, the theorem does not claim that the partial sums of the considered series need to converge pointwise to some function. There is no analogy to this phenomenon in the finite-dimensional world. In general, the series F(x) need not be convergent for some points x, even under the assumption of the equality in (3). However, if the series ∑∞ n=1 |cn| converges to a finite value, and if all the functions fn are bounded uniformly on I, then, the series F(x) = ∑∞ n=1 cnfn(x) converges at every point x. Yet it need not converge to the function g everywhere. We return to this problem later. The proof of all of the three statements of the theorem is similar to the case of finite-dimensional Euclidean spaces. That is to be expected since the bounds for the distances of g from the partial sum f are constructed in the finitedimensional linear hull of the functions concerned: Proof of theorem 7.1.5. Choose any linear combination f = ∑k n=1 anfn and calculate its distance from g. We obtain ∥g − k∑ n=1 anfn∥2 = ∫ b a g(x) − k∑ n=1 anfn(x) 2 dx = ∫ b a |g(x)|2 dx − ∫ b a k∑ n=1 g(x)anfn(x) dx− − ∫ b a k∑ n=1 anfn(x)g(x) dx + ∫ b a k∑ n=1 anfn(x) 2 dx = ∥g∥2 − k∑ n=1 ancn∥fn∥2 − k∑ n=1 ancn∥fn∥2 + k∑ n=1 a2 n∥fn∥2 = ∥g∥2 + k∑ n=1 ∥fn∥2 ( (cn − an)(cn − an) − |cn|2 ) = ∥g∥2 + k∑ n=1 ∥fn∥2 ( |cn − an|2 − |cn|2 ) . Since we are free to choose an as we please, we minimize the last expression by choosing an = cn, for each n. This 637 Keeping in mind the relations cos(nπ) = (−1)n and sin(nπ) = 0, integration by parts gives bn = 2 π ∫ π 0 x sin(nx) dx = 2 π [ − x cos(nx) n + sin(nx) n2 ]π 0 = 2 π · −π n (−1)n = 2 n (−1)n+1 . Therefore, the Fourier series of f has the form F(x) = ∞∑ n=1 2(−1)n+1 n sin(nx) = 2 sin(x) − sin(2x) + 2 3 sin(3x) − 1 2 sin(4x) + 2 5 sin(5x) − · · · . In Sage it suffices to compute the Fourier coefficients an and bn, which can be easily done by the block f=lambda x:x; f=piecewise([[(-pi,pi),f]]) L=pi; n=var("n") an=(1/L)*integral(x*cos(n*pi*x/L),x,-pi,pi) show("a_n=", an) bn=(1/L)*integral(x*sin(n*pi*x/L) x,-pi,pi) show("b_n=", bn.expand()) Sage prints out the following expressions a_n=0 , b_n= − 2 cos (πn) n + 2 sin (πn) πn2 , which, with a bit of effort, confirm the presented values of an, bn, and consequently the given expression of F(x). In Sage the description of the kth-partial sum of the Fourier series corresponding to a periodic function f, relies on the syntax f.fourier_series_partial_sum(k, L), where L denotes the half-period of f. For instance, the 5th partial sum of the Fourier series at hand can be explicitly described by adding in the previous cell the following: FS = f.fourier_series_partial_sum(5,pi) show("partial FS = ",FS) To plot the suggested partial sums, together with f, you may add the block FS5 = f.fourier_series_partial_sum(5,pi) FS10 = f.fourier_series_partial_sum(10,pi) FS20 = f.fourier_series_partial_sum(20,pi) FS100 = f.fourier_series_partial_sum(100,pi) a = f.plot(x, -pi, pi) b5 = plot(FS5, x, -pi, pi, linestyle="--", tick_formatter=[2*pi, None]) b10 = plot(FS10, x, -pi, pi, linestyle="--", tick_formatter=[2*pi, None]) b20 = plot(FS20, x, -pi, pi, linestyle="--", tick_formatter=[2*pi, None]) b100 = plot(FS100, x, -pi, pi, linestyle="--", tick_formatter=[2*pi, None]) (a+b5).show(); (a+b10).show() (a+b20).show(); (a+b100).show() This procedure yields the figures presented below, illustrating the improvement in approximation as the number of terms is increased. CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING completes the proof of the first statement. With this choice of an, we obtain Bessel’s identity ∥g − k∑ n=1 cnfn∥2 = ∥g∥2 − k∑ n=1 |cn|2 ∥fn∥2 . Since the left-hand side is non-negative, it follows that: k∑ n=1 |cn|2 ∥fn∥2 ≤ ∥g∥2 . Let k −→ ∞. Since every non-decreasing sequence of real numbers which is bounded from above has a limit, it follows that ∞∑ n=1 c2 n∥fn∥2 ≤ ∥g∥2 , which is Bessel’s inequality. If equality occurs in Bessel’s inequality, then statement (3) follows straight from the definitions and the Bessel’s identity proved above. □ An orthogonal system of functions is called a complete orthogonal system on an interval I = [a, b] for some space of functions on I if and only if Parseval’s equality holds for every function g in this space. 7.1.6. Fourier series. The coefficients cn from the previous theorem are called the Fourier coefficients of a given function in the (abstract) Fourier series. The previous theorem indicates that we are able to work with countable orthogonal systems of functions fn in much the same way as with finite orthogonal bases of vector spaces. There are, however, essential differences: • It is not easy to decide what the set of convergent or uniformly convergent series F(x) = ∞∑ n=1 cnfn(x) looks like. • For a given integrable function, we can find only the “best approximation possible” by such a series F(x) in the sense of L2–distance. In the case when we have an orthonormal system of functions fn, the formulae mentioned in the theorem are simpler, but still there is no further improvement in the approxima- tions. The choice of an orthogonal system of functions for use in practice should address the purpose for which the approximations are needed. The name “Fourier series” itself refers to the following choice of a system of real-valued functions: Fourier’s orthogonal system The following functions form an orthogonal system 1, sin x, cos x, sin 2x, cos 2x, . . . , sin nx, cos nx, . . . 638 F5(x) F10(x) F20(x) F100(x) □ 7.A.11. Triangle wave. Find the Fourier series F(x) for the 2π-periodic extension of the function f(x) = | x |, with x ∈ (−π, π). Next present the 1st, 3rd and 6th partial sum of the Fourier series and use Sage to illustrate them for x ∈ (−4π, 4π). Solution. The given function corresponds to a sawtoothshaped oscillation. Its expression as a Fourier series is very important in practice. The function g is even on (−π, π), i.e., f(−x) = f(x) for all x ∈ (−π, π). Therefore, we have bn = 0 for all n ∈ Z+, and it suffices to determine an, for n ∈ N. Combining the definition of an with 6.B.35, we get a0 = 1 π π∫ −π g(x) dx = 2 π π∫ 0 x dx = 2 π [ x2 2 ]π 0 = π . To compute an for n ∈ Z+, one can apply integration by parts: an = 1 π π∫ −π f(x) cos(nx) dx = 2 π π∫ 0 x cos(nx) dx = 2 π [x n sin(nx) ]π 0 − 2 nπ π∫ 0 sin (nx) dx = 2 n2π [cos(nx)] π 0 = 2 n2π ((−1)n − 1) . So an = { − 4 n2π , for n odd; 0 , for n even , and the Fourier series in question has the form F(x) = π 2 + 2 π ∞∑ n=1 ( (−1)n − 1 n2 cos(nx) ) = π 2 − 4 π ∞∑ n=1 cos((2n − 1)x) (2n − 1)2 = π 2 − 4 π ( cos x + cos(3x) 32 + cos(5x) 52 + · · · ) . CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING An elementary exercise on integration by parts shows that this is an orthogonal system of functions on the interval [−π, π]. These functions are periodic with common period 2π (see the definition below). “Fourier analysis”, which builds upon this orthogonal system, allows us to work with all piecewise continuous periodic functions with extraordinary efficiency. Since many physical, chemical, and biological data are perceived, received, or measured, in fact, by frequencies of the signals (the measured quantities), it is really an essential mathematical tool. Biologists and engineers often use the word “signal” in the sense of “function”. Periodic functions A real or complex valued function f defined on R is called a periodic function with period T > 0 if f(x + T) = f(x) for every x ∈ R. It is evident that sums and products of periodic functions with the same period are again periodic functions with the same period. We note that the integral ∫ x0+T x0 f(x) dx of a periodic function f on an interval whose length equals the period T is independent of the choice of x0 ∈ R. To prove it, it is enough to suppose 0 ≤ x0 < T, using a translation by a suitable multiple of T. Then, ∫ x0+T x0 f(x) dx = ∫ T x0 f(x) dx + ∫ x0+T T f(x) dx = ∫ T x0 f(x) dx + ∫ x0 0 f(x) dx = ∫ T 0 f(x) dx. Next, remind the Fourier’s orthogonal system with the norms over intervals of length 2π equal to ∥ sin nx∥2 = ∥ cos nx∥2 = π (cf. 6.2.6). Using it to built series as in the theorem 7.1.5, we arrive at Fourier series The series of functions F(x) = a0 2 + ∞∑ n=1 ( an cos(nx) + bn sin(nx) ) with coefficients an = 1 π ∫ x0+2π x0 g(x) cos(nx) dx, bn = 1 π ∫ x0+2π x0 g(x) sin(nx) dx, is called the Fourier series of a function g on the interval [x0, x0 + 2π]. The coefficients an and bn are called Fourier coefficients of the function g. 639 Thus F1(x) = π 2 − 4 cos(x) π , F3(x) = F1(x) − 4 cos(3x) 9π , F6(x) = F3(x) − 4 cos(5x) 25π . Below we illustrate F1, F3 and F6 together in a single figure: To better observe the convergence of the Fourier series in a small neighbourhood around the origin, we may zoom in on the graph at that point: Notice the Fourier series described here could be derived by easier means, namely by integrating the Fourier series of the square wave function (Heaviside’s function), see 7.1.9. □ 7.A.12. Use Sage to provide a solution of the task in 7.A.11. Moreover, show that F5(x) = F6(x) for all x. ⃝ 7.A.13. Compute the Fourier series of the 2π-periodic extension of the sign function sgn : [−π, π] → R, where recall that sgn(x) =    −1 , for − π ≤ x < 0 , 0 , for x = 0 , 1 , for 0 < x ≤ π . ⃝ CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING If T is the time taken for one revolution of an object moving round the unit circle at constant speed, then that constant speed is ω = 2π/T. In practice, we often want to work with Fourier series with an arbitrary primary period T of the functions, not just 2π. Then we should employ the functions cos(ωnx), sin(ωnx), where ω = 2π T . By substitution t = ωx, we can verify the orthogonality of the new system of functions and recalculate the coefficients in the Fourier series F(x) of a function g on the interval [x0, x0 + T]: F(x) = a0 2 + ∞∑ n=1 ( an cos(nωx) + bn sin(nωx) ) , which have values an = 2 T ∫ x0+T x0 g(x) cos(nωx) dx, bn = 2 T ∫ x0+T x0 g(x) sin(nωx) dx. 7.1.7. The complex Fourier coefficients. Parametrize the unit circle in the form: eiωt = cos ωt + i sin ωt. For all integers m, n with m ̸= n, ∫ π −π eimx e−inx dx = ∫ π −π ei(m−n)x dx = 1 i(m−n) [ ei(m−n)x ]π −π = 0. Thus for m ̸= n, the integral ⟨eimx , einx ⟩ = 0. Fourier’s complex orthogonal system The following functions form a complex valued orthogonal system e−inωt , . . . , e−iωt , 1, eiωt , ei2ωt . . . , einωt , . . . Note that if m = n, then ∫ π −π eimx e−imx dx = ∫ π −π dx = 2π. The orthogonality of this system can be easily used to recover the orthogonality of the real Fourier’s system: Rewrite the above result as ∫ π −π (cos mx + i sin mx)(cos nx − i sin nx) dx = 0 By expanding and separating into real and imaginary parts we get both ∫ π −π (cos mx cos nx + sin mx sin nx) dx = 0 ∫ π −π (sin mx cos nx − cos mx sin nx) dx = 0 640 7.A.14. Use Sage to describe the Fourier series of the 2π-periodic extension of the function f(x) = { 0 , x ∈ [−π, 0) , sin(x) , x ∈ [0, π) . Additionally, plot the graphs of F3, F5, F8 and f in a single figure, and specify the 8th and 48th coefficients of the corresponding Fourier series. Solution. We have T = 2π and we are interested in the interval [−π, π). Let us begin by computing the Fourier coefficients of f, which can be done with the method applied above, involving the commands piecewise and integral: f1(x)=0; f2(x)=sin(x) f=piecewise([[(-pi, 0), f1], [(0, pi), f2]]) L=pi; n=var("n") an=(1/L)*integral(f1(x)*cos(n*pi*x/L),x,-pi, 0) +(1/L)*integral(f2(x)*cos(n*pi*x/L),x,0,pi) show("a_n=", an.expand()) bn=(1/L)*integral(f1(x)*sin(n*pi*x/L),x,-pi, 0) +(1/L)*integral(f2(x)*sin(n*pi*x/L),x,0,pi) show("b_n=", bn.expand()) Running this, we get the answers a_n= − cos (πn) π(n2 − 1) − 1 π(n2 − 1) , b_n= − sin (πn) π(n2 − 1) , which both make sense for all n ̸= 1. Thus, for all positive integers n ̸= 1 we have an = − 1 π(n2 − 1) ( (−1)n + 1) , bn = 0 and moreover a0 = 2 π . For n = 1 we have the integrals a1 = 1 2π π∫ 0 sin(2x) dx , b1 = 1 π π∫ −π f(x) sin(x) dx , and it is easy to see that a1 = 0 and b1 = 1/2. In this way we arrive at the Fourier series F(x) = 1 π + sin x 2 + 1 π ∞∑ n=2 ( (−1)n + 1 1 − n2 cos(nx) ) . Since (−1)n + 1 = 0 when n is odd, and (−1)n + 1 = 2 when n is even, put n = 2m to obtain the expression F(x) = 1 π + sin x 2 − 2 π ∞∑ m=1 cos(2mx) 4m2 − 1 . For the partial sums we see that F3(x) = 1 π + sin(x) 2 − 2 cos(2x) 3π , F5(x) = − 2 cos(4x) 15π + F3(x) , F8(x) = − 2 cos(8x) 63π − 2 cos(6x) 35π + F5(x) . In Sage to obtain these expressions we may continue typing in the preceding block the syntax given here: CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING By replacing n with −n, we have also ∫ π −π (cos mx cos nx − sin mx sin nx) dx = 0 ∫ π −π (sin mx cos nx + cos mx sin nx) dx = 0 and hence, with m ̸= n, ∫ π −π cos mx cos nx dx = 0 ∫ π −π sin mx sin nx dx = 0 ∫ π −π sin mx cos nx dx = 0 which proves again the orthogonality of the real valued Fourier system. Note the case m = n > 0, when we recovered again ∫ π −π cos2 nx dx = ∥ cos(nx)∥2 = π, ∫ π −π sin2 nx dx = ∥ sin(nx)∥2 = π. If n = 0, then ∥1∥2 = 2π. For a (real or complex) function f(t) with −T/2 ≤ t ≤ T/2, and all integers n, we can define, in this context, its complex Fourier coefficients by the complex numbers cn = 1 T ∫ T/2 −T/2 f(t) e−iωnt dt. The relation between the coefficients an and bn of the Fourier series (after recalculating the formulae for these coefficients for functions with a general period of length T) and these complex coefficients cn follow from the definitions. Clearly, c0 = a0/2, and for natural numbers n, we have cn = 1 2 (an − ibn), c−n = 1 2 (an + ibn). If the function f is real valued, cn and c−n are complex conjugates of each other. We note here that for real valued functions with period 2π, the Bessel inequality in this notation becomes 1 2 |a0|2 + ∞∑ n=1 (|an|2 + |bn|2 ) ≤ ∥f∥2 = 1 π ∫ π −π |f(t)|2 dt. We have expressed the Fourier series F(t) for a function f(t) in the form F(t) = ∞∑ n=−∞ cn eiωnt . In both cases of real or complex valued functions, the corresponding Fourier series can be written in this form. In general, the coefficients are complex. We return to this expression later, in particular when dealing with Fourier transforms. For fixed T, the expression ω = 2π/T describes how the frequency changes if n is increased by one. 641 FS3 = f.fourier_series_partial_sum(3,pi) FS5 = f.fourier_series_partial_sum(5,pi) FS8 = f.fourier_series_partial_sum(8,pi) show("3rd partial sum = ",FS3) show("5th partial sum = ",FS5) show("8th partial sum = ",FS8) a = f.plot(x, -pi, pi, thickness=2, color="black", legend_label="$f$") b = plot(FS3, x,-pi,pi,linestyle="--", color="blue",legend_label="$F_3$") c= plot(FS5, x,-pi,pi,linestyle="-.", color="green", legend_label="$F_5$") d= plot(FS8, x, -pi, pi, linestyle=":", color="red", legend_label="$F_8$") (a+b+c+d).show() This block also constructs the required graphs, which we present below. Finally, for calculating the Fourier coefficients Sage provides built-in functions. For instance, one can use the syntax f.fourier_series_cosine_coefficient(k), which corresponds to the kth coefficient of the cosine Fourier series of our periodic function f. Notice that if you are interested in “sine coefficients”, then the right function is f.fourier_series_sine_coefficient(k), and in this case one should ensure that uses the sine function correctly, see 7.D.8 for an example. Now you may check yourselves the desired coefficients, by simply adding the following cell: Fc8=f.fourier_series_cosine_coefficient(8) show("8th coefficient = ",Fc8) Fc48=f.fourier_series_cosine_coefficient(48) show("48th coefficient = ",Fc48) In 7.D.11 we will present an alternative method using the Sage functions introduced in this block to determine and illustrate certain partial sums of a Fourier series, without using the Sage function fourier_series_partial_sum. □ CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING We wish to show that the Fourier orthogonal system is complete on S0 [−π, π]. This needs a thorough technical preparation. So now, we only formulate the main result, and add some practical notes. The proof of the theorem will be discussed later, see 7.1.15. 7.1.8. Theorem. Consider a bounded interval [a, b] of length T = b − a. Let f be a real or complex valued function in S1 [a, b] (i.e., a piecewise continuous function with a piecewise continuous first derivative), extended periodically on R. Then: (1) The partial sums sN of its Fourier series converge pointwise to the function g(x) = 1 2 ( lim y→x+ f(y) + lim y→x− f(y) ) . (2) Moreover, if the periodic extension of f ∈ S1 [a, b] is a continuous function on R, then the pointwise convergence of its Fourier series is uniform. (3) The L2–distance ∥sN − f∥ converges to 0 for N → ∞. 7.1.9. Periodic extension of functions. The Fourier series converges, of course, outside the original interval [−T/2, T/2], since it is a periodic function on R. The Heaviside function g is defined by g(x) = { −1 if − π < x < 0, 1 if 0 < x < π (We do not need to define the values at zero and at the end points of the interval, because these do not effect the coefficients of the Fourier series.) The periodic extension of the Heaviside function onto all of R is usually called the square wave function. Since g is an odd function, the coefficients of the functions cos(nx) are all zero. The coefficients of sin(nx), are given by bn = 1 π ∫ π −π g(x) sin(nx) dx = 2 π ∫ π 0 sin(nx) dx = 2 nπ ( 1 − (−1)n ) . Thus the Fourier series of g is g(x) = 4 π ( sin(x) + 1 3 sin(3x) + 1 5 sin(5x) + . . . ) . The partial sums of its first five and fifty terms, respectively, are shown in the following diagrams. -1 0 2 x 0 -2 -0,5 0,5 -4 1 4 t=2. -1 0 2 x 4-2 -0,5 0 -4 1 0,5 t=24. 642 7.A.15. Fourier series over other intervals. Consider the function g : [−1, 1) → R, defined by g(x) = { 0 , for −1 ≤ x < 0 , x + 1 , for 0 ≤ x < 1 . (a) Determine a 2-periodic extension ˆg of g for x ∈ [−3, 3) and next sketch its graph via Sage. (b) Find the corresponding Fourier series on [−1, 1), and next confirm your result via Sage. (c) Use Sage to plot the partial sums F5(x), and F55(x) together with ˆg, for x ∈ (−3, 3). Solution. (a) By assumption the period is T = 2, and on the interval [−3, 3) a 2-periodic extension of g has the form ˆg(x) =    g(x + 2) , for −3 ≤ x < −1 , g(x) , for −1 ≤ x < 1 , g(x − 2) , for 1 ≤ x < 3 . =    0 , for −3 ≤ x < −2 , x + 3 , for −2 ≤ x < −1 , 0 , for −1 ≤ x < 0 , x + 1 , for 0 ≤ x < 1 , 0 , for 1 ≤ x < 2 , x − 1 , for 2 ≤ x < 3 . In Sage we can use the piecewise command to plot ˆg, so one may try the following block: g1(x)=0;g2(x)=x+1;g3(x)=x+3;g4(x)=x-1 g=piecewise([[(-3,-2),g1],[(-2,-1),g3], [(-1,0),g1],[(0,1),g2],[(1,2),g1],[(2, 3),g4]]) a = g.plot(x, -3, 3,thickness=3, color="steelblue", exclude=[-2, -1, 0, 1, 2], legend_label=r"$\hat{g}$") p0=point([-3, 0], size=30, color="black") p1=point([-2, 1], size=30, color="black") p2=point([-1, 0], size=30, color="black") p3=point([0, 1], size=30, color="black") p4=point([1, 0], size=30, color="black") p5=point([2, 1], size=30, color="black") show(a+p0+p1+p2+p3+p4+p5) Let us present the result: CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING If the interval [−T/2, T/2] is chosen for the prime period T of such a square wave function, the resulting Fourier series is g(x) = 4 π ( sin(ωx) + 1 3 sin(3ωx) + 1 5 sin(5ωx) + . . . ) . The number ω = 2π T is also called the phase frequency of the wave. As the number of terms of the series increases, the approximation gets much better except in a (still shrinking) neighbourhood of the discontinuity point. There, the maximum of the deviation remains roughly the same. This is a general property of Fourier series and it is called the Gibbs phenomenon. In accordance with 7.1.8(1), the Fourier series of the Heaviside function converges to the mean of the two onesided limits at the points of discontinuity. Of course, we cannot expect that the convergence of Fourier series for functions g with discontinuity points be uniform (then, the function g would have to be continuous itself, being a uniform limit of continuous functions). 7.1.10. Utilizing symmetry of functions. We consider the problem of finding the Fourier series of the function f(x) = x2 on the interval [0, 1]. If we just periodically extend this function from the given interval [0, 1], the resulting function would not be continuous, and so the convergence at integers would be as slow as in the case of a square wave function. However, we can easily extend the domain of f to the interval [−1, 1], so that f(x) = x2 is an even function for −1 ≤ x ≤ 1. If we then extend periodically, the result is continuous. The resulting Fourier series then converges uniformly, and since then f is even, only the coefficients an are non-zero. For n > 0, iterated application of integration by parts yields: an = 2 2 ∫ 1 −1 x2 cos (2πnx 2 ) dx = 2 ∫ 1 0 x2 cos(πnx) dx = 4 π2n2 (−1)n . The remaining coefficient is a0 = 2 2 ∫ 1 −1 x2 dx = 2 ∫ 1 0 x2 dx = 2 3 . The entire series giving the periodic extension of x2 from the interval [−1, 1] thus equals g(x) = 1 3 + 4 π2 ∞∑ n=1 (−1)n n2 cos(πnx). By the Weierstrass criterion, this series converges uniformly. Therefore, g(x) is continuous. By theorem 7.1.8, g(x) = f(x) = x2 on the interval [−1, 1]. Thus our series approximates the function x2 on the interval [0, 1] better (i.e. faster) than we could achieve with the periodic extension of the function from [0, 1] interval only. If we put x = 1 and rearrange, 643 Notice that we have included the points where the jump discontinuities occur within the interval [−3, 3), using the command point. (b) In this task one should apply the general formulae given in 7.1.6. We have ω = 2π/T = π, and an = 2 T x0+T∫ x0 g(x) cos(nωx) dx , ∀ n ∈ N , bn = 2 T x0+T∫ x0 g(x) sin(nωx) dx , ∀ n ∈ Z+ . Now it is easy to see that a0 = 1∫ −1 g(x) dx = 1∫ 0 (x+1) dx = 3 2 and moreover an = 1∫ −1 g(x) cos(nπx) dx = 1∫ 0 (x + 1) cos(nπx) dx = (−1)n − 1 n2π2 , bn = 1∫ −1 g(x) sin(nπx) dx = 1∫ 0 (x + 1) sin(nπx) dx = 1 − 2(−1)n nπ . In Sage to confirm these computations, it suffices to use only g (and not its extension), as follows: g1(x)=0;g2(x)=x+1 g=piecewise([[(-1, 0), g1], [(0, 1), g2]]) T=2; n=var("n"); Om=pi an=(2/T)*integral(g1(x)*cos(n*Om*x),x,-1,0)\ +(2/T)*integral(g2(x)*cos(n*Om*x), x,0,1) show("a_n=", an.expand()) bn=(2/T)*integral(g1(x)*sin(n*Om*x),x,-1,0)\ +(2/T)*integral(g2(x)*sin(n*Om*x), x,0,1) show("b_n=", bn.expand()); integral(x+1, x, 0, 1) In this way we obtain the following expression of the Fourier series in question: 3 4 + ∞∑ n=1 ( (−1)n −1 n2π2 cos(nπx) + 1−2(−1)n nπ sin(nπx) ) , and one can try to obtain refinements by considering odd/even cases. (c) This task can be solved by the following block: CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING we obtain the remarkable result π2 4 (g(1) − 1 3 ) = π2 6 = ∞∑ n=1 (−1)2n n2 = ∞∑ n=1 1 n2 . We proceed with a further illustration. Let us try to differentiate the Fourier series for x2 term by term and, of course we hope to get the Fourier series of x on the interval −1 < x < 1 x = 2 π ∞∑ n=1 (−1)n+1 n sin(πnx). If it was truly the series of x, it could not converge uniformly since the periodic extension of the function x from the interval [−1, 1] is not a continuous function. Thus our differentiation term by term is not guaranteed by the general theorem. However, a straightforward computation of the Fourier coefficients shows, that this is the Fourier series of x. The series for x2 converges uniformly, and so we can integrate it term by term obtaining 1 3 x3 = 2 3 x + 4 π3 ∞∑ n=1 (−1)n n3 sin(πnx). This is valid for −1 < x < 1. It is not valid for other values of x, since the series is periodic, but the other two terms are not. Of course, we may substitute the above Fourier series of the function x and thus obtain the Fourier series for x3 on the interval [−1, 1] this way. Notice, this will no more converge uniformly. In this context, we use the following terminology for the Fourier series of even or add functions: The sine and cosine Fourier series For a given (real or complex) valued function f on an interval [0, T) of length T > 0, the Fourier series of it’s even periodic extension (with period 2T) is called the cosine Fourier series of f, while the Fourier series of the the odd periodic extension of f is called the sine Fourier series of the function f. 7.1.11. General Fourier series and wavelets. In the case of a general orthogonal system of functions fn and the series generated from it, we often talk about the general Fourier series with respect to the orthogonal system of functions fn. Fourier series and further tools built upon them are used for processing various signals, images, and other data. In fact, these mathematical tools also underpin many fundamental models in science, including, for example, modeling of the function of the brain, as well as much of theoretical physics. The periodic nature of the sine and cosine functions used in classical Fourier series, and their simple scaling by increasing the frequency by unit steps, limit their usability. In many 644 g1(x)=0;g2(x)=x+1;g3(x)=x+3;g4(x)=x-1 g=piecewise([[(-3, -2), g1], [(-2, -1), g3], [(-1, 0), g1], [(0, 1), g2], [(1, 2), g1], [(2, 3), g4]]) partS5 = g.fourier_series_partial_sum(5,1) partS55 = g.fourier_series_partial_sum(55,1) a = g.plot(x,-3,3,thickness=3,color="steelblue", exclude=[-2, -1, 0, 1, 2], legend_label=r"$f$") s5=plot(partS5, x, -3, 3, linestyle="--", CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING applications, the nature of the data may suggest more convenient and possibly more efficient orthogonal systems of func- tions. Requirements for fast numerical processing usually include quick scalability and the possibility of easy translations by constant values. In other words, we want to be able to zoom quickly in and out with respect to the frequencies and, at the same time, to localize in time. Fast scalability can be achieved by having just one wavelet mother function1 ψ, if possible with compact support, from which we create countably many functions ψj,k, j, k ∈ Z, by integer translations and dyadic dilations: ψj,k(x) = 2j/2 ψ(2j x − k). It is wise to rescale and choose ψ with L2-norm equal to one (as a function on R). Then the coefficients 2j/2 ensure that the same is true for all ψj,k. Of course, the shape of the mother wavelet ψ should match and cover all typical local behaviour of the data to be processed. We say that ψ is an orthogonal mother wavelet if the resulting countable system of functions ψjk is orthogonal and, at the same time, it is “reasonably dense” in the space of functions with integrable squares. We come to this concept in more detail later on. The effectivity of the wavelet analysis is another issue which needs further ideas and concepts to be built in. We do not have space here to go into details, but the readers may find many excellent specialized books in this fascinating area of applied Mathematics. Here we consider one simple example. 7.1.12. The Haar wavelets. Perhaps the first question to start with is, how to effectively approximate any given function with piecewise constant ones. For various reasons, it is good if our mother wavelet ψ has zero mean, too. Thus we want to consider an analogue of the Heaviside function ψ(x) = { 1 x ∈ [0, 1/2) −1 x ∈ [1/2, 1). As a straightforward exercise we may check that, indeed, the resulting system of functions ψj,k is orthonormal. Another exercise shows, using finite linear combinations of these functions, that we may approximate any constant function with given precision over a bounded interval. In an exercise we shall see ?? that this already verifies the density properties required for the orthogonal mother wavelet functions. 1The roots of wavelets go back to various attempts, how to localize the basic signals in both time and frequency, with diverse motivations from engineering and other applications. The name wavelet seems to be related to the idea of having a wave similar signal which begins and ends with zero amplitude. Since late 1970’s, these attempts were related to many names (e.g. Morlet, Meyer) and the wavelet theory became the main tool in signal analysis. Of course, first examples of wavelets are much older, the Haar’s construction goes back to 1909. Actually, many of the wavelet types do not represent orthogonal systems of functions, they rather share the idea of a combination of the high pass filters and low pass filters. The reader is advised to consult extremely rich literature, if interested in more details 645 color="darkgreen", legend_label=r"$F_{5}$") s55=plot(partS55, x, -3, 3, linestyle="--", color="darkgreen", legend_label=r"$F_{55}$") show(a+s5);show(a+s55) Notice the resulting approximations are much slower comparing the situation in the previous examples. By the illustrations one can also demonstrate the “Gibbs phenomenon”. This is the overshooting in the jumps, which is proportional to the magnitudes of the jumps, see the figures here: □ It is often convenient (and usually easier) to express the Fourier series using the complex coefficients cn instead of the real coefficients an and bn. This is a straightforward consequence of the facts einωx = cos(nωx) + i sin(nωx) or, vice versa cos(nωx) = 1 2 (einωx + e−inωx ) , sin(nωx) = 1 2i (einωx − e−inωx ) . The resulting series for a real or complex valued function g on the interval [−π, π] is F(x) = ∑∞ n=−∞ cn einx with cn = 1 2π ∫ π −π e−inx g(x) dx . CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Now we consider the question of effective treatment. Notice that we can also use the characteristic function φ of the interval [0, 1) and write ψ(x) = φ(2x)−φ(2x−1) = 1 √ 2 √ 2φ1,0(x)− 1 √ 2 √ 2φ1,1(x). The function φ plays the role of the father wavelet function and it itself satisfies φ(x) = φ(2x)+φ(2x−1) = 1 √ 2 √ 2φ1,0(x)+ 1 √ 2 √ 2φ1,1(x). This can be interpreted as differencing and averaging the two consecutive values at the half scale. With these properties, there is no need for an explicit analytic form of ψ and φ, since we can find their values recurrently in all dyadic points x. Indeed, φ(2j−1 n) = φ(2j n) + φ(2j n − 1). The function φ has another useful feature. Namely we can obtain the unit constant function by adding all its integer trans- lations ∞∑ k=−∞ φ0,k(x) = 1 for all x ∈ R. Finally, nearly all the coefficients in the general Fourier series with the base ψj,k vanish for piecewise constant functions. On the contrary, the function φ "sees" the constants. In engineering terminology, this is an instance of the high pass filter and low pass filter. 7.1.13. Example. To illustrate the above considerations, we approximate the following function f(x) in R by the Haar wavelets, f(x) =    0.3(x + 3), −2 ≤ x ≤ −1 0.7(x + 1), 0 ≤ x ≤ 1 0 otherwise . The function in question is not periodic and could not be approximated well by classical Fourier series. The individual functions ψj,k from 7.1.11 have compact support, but in order to approximate constant or linear behaviour, we still need a large number of them. The following illustrations have been acquired in Maple working with indices |j| ≤ n and |k| ≤ n, the first one with n = 5, the second one with n = 10. The approximation on the sides of the interval is not as good as in the middle, because we do not include enough shifts, i.e. values of k. One of the motivations for the scaling 646 Let us discuss a few examples and for more details see 7.1.7 (see also 7.D.10). 7.A.16. Complex Fourier series. (a) Determine the complex version of the Fourier series F(x) of the function f(x) = x with x ∈ (−π, π). Next use the results from 7.A.10 to confirm that cn = 1 2 (an − ibn) , c−n = cn . (b) Verify that the complex form of the Fourier series matches the result given in 7.A.10. Solution. (a) By definition, F(x) = ∑∞ n=−∞ cn einx , hence one needs to specify the complex Fourier coefficients cn. We have c0 = 1 2π ∫ π −π f(x) dx = 1 2π ∫ π −π x dx = 0 , since f(x) = x is an odd function. For n ̸= 0, using integration by parts we get cn = 1 2π ∫ π −π f(x) e−inx dx = 1 2π ∫ π −π x e−inx dx = 1 2π [ − x e−inx in ]π −π − 1 2π ∫ π −π e−inx −in dx = − 1 2iπn [ x e−inx ]π −π + 1 2iπn ∫ π −π e−inx dx = − 1 2iπn ( π e−inπ +π einπ ) + 1 2iπn [ − e−inx in ]π −π = − 2π cos(nπ) 2πin − 1 2π(in)2 [ e−inx ]π −π = − (−1)n in + 1 2πn2 ( e−inπ − einπ ) = − (−1)n in + −2i sin(nπ) 2πn2 = − (−1)n in . Here we used the identities i2 = −1, e−inπ + einπ = 2 cos(nπ) = 2(−1)n and e−inπ − einπ = −2i sin(nπ) = 0. Now, according to 7.A.10 we have an = 0, bn = 2 n (−1)n+1 , and it should be cn = 1 2 (an − ibn) = 1 2 ( 0 − i 2 n (−1)n+1 ) = −i(−1)n+1 n = (−1)n+1 in , since −i = 1 i . This agrees with the result presented above and can be verified in Sage using the bool function. For example, you may type the cell var("n") bn=(2*(-1)^(n+1))/n; cn=(-(-1)^(n))/(i*n) bool(cn==-(i*bn)/2) which returns True. In a similar way, for c−n one gets c−n = (−1)n in and it easy to see that cn = cn, that is, c−n is the complex conjugate of cn, see also 7.1.7. CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING and shifting in the construction of wavelets is the hope for a small amount of non-zero coefficients. But this does not mean that most of the coefficients would be zero. In our case the percentage of non-zero coefficients for n = {1, 2, . . . , 10} is 55.6, 44., 38.8, 34.6, 32.2, 30.8, 29.8, 28.7, 28.0, 27.4. 7.1.14. Concluding remarks. A series of famous wavelets DN has been called after Ingrid Debauchies. They are constructed by very similar recurrent averadging and differencing relations based on certain natural requirements. Just as an indication, consider the slightly more general recurrent re- lations φ(x) = √ 2 N∑ k=0 hkφ(2x − k) ψ(x) = √ 2 N∑ k=0 gkφ(2x − k) with yet unknown constants gk and hk. If we want the mother wavelet ψ(x−k) to have zero coeffients in the resulting series for all polynomials up to the order N −1, then we must ensure that ∫ ∞ −∞ xk ψ(x) dx = 0 for all k = 0, 1, . . . , N −1. Similar conditions determine the Debauchies wavelets. The standards of JPEG2000 are based on similar wavelets and such techniques provide tools for professional compression of visual data in film industry, or the format DjVu for compressed publications. In the diagram below, there are the Daubechies D4 mother and father wavelets. In real applications, the orthogonality of the mother wavelet can be relaxed. As long as the functions ψk,l are linearly independent and generate the whole space of interest, we always get the dual basis with respect to the L2 inner product. 647 For the complex expression of F(x) we finally get F(x) = c0 + ∞∑ n=1 cn einx + ∞∑ n=1 c−n e−inx = 0 + ∞∑ n=1 −(−1)n in einx + ∞∑ n=1 (−1)n in e−inx = ∞∑ n=1 (−1)n in ( e−inx − einx ) . (♭) (b) This follows directly from (♭), combined with the identity −2i sin(nx) = e−inx − einx , i.e., F(x) = ∞∑ n=1 (−1)n in ( e−inx − einx ) = 1 i ∞∑ n=1 (−1)n n (−2i sin(nx)) = i i ∞∑ n=1 2(−1)n+1 n sin(nx) , which is the formula presented in 7.A.10. □ 7.A.17. Compute the complex Fourier series of the 2π-periodic extension of the exponential map f(x) = ex , with x ∈ (−π, π). Next use Sage to illustrate the 20th-partial sum of the Fourier series together with that of f. ⃝ When f is a periodic function with period 2L for some L > 0, then the complex Fourier series for x ∈ (x0, x0 +2L) has the form F(x) = ∑∞ n=−∞ cn e iπnx L , where cn = 1 2L ∫ x0+2L x0 f(x) e− iπnx L dx . The next task focuses on this case and is left as a challenge for you. 7.A.18. For the function f(x) = e−x with x ∈ [−1, 1] and f(x) = f(x + 2) outside the interval [−1, 1], compute the complex Fourier coefficients and derive the corresponding Fourier series. ⃝ Let f : [a, b] → R be a function for which f′ exists. We say that f is piecewise differentiable if both f and f′ are piecewise continuous on [a, b]. Suppose now that f is a 2L-periodic piecewise differentiable function defined on an interval of length L, for some L > 0, and let F be the Fourier series of f. Assuming that limn→∞ Fn(x) = F(x), the first part of the theorem in 7.1.8 (known as the “Dirichlet condition”, see 7.1.16) implies that at a point of continuity of f, the series F converges to the value of f at that point. Moreover, at a point of discontinuity the series F converges to the average of the left and right limits of f at that point. Thus, at a jump discontinuity x0 of f, the value of the Fourier series F does not depend on the value f(x0), but only on the left and right limits of f at x0. This means that the behaviour of the Fourier series of f at discontinuity points is not effected CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING 7.1.15. Proof of theorem 7.1.8 about Fourier series. We return to the detailed proof of the basic properties of the classical Fourier series. We shall need several concepts and technicalities related to abstract metric spaces. Thus, the reader could enjoy reading first the general context of metrics and convergence which we will introduce in the begining of the last part of this chapter. In this perspective, reading first a few paragraphs starting by 7.3.1, and returning later might be a good idea. On the other hand, we do not need much from the general theory of metric spaces and so our considerations of various concepts of convergence in the proof could also be of assistance for the abstract developments later. We do not worry here about necessary conditions for convergence, and many other formulations can be found in literature. On the other hand, the statement of theorem 7.1.8 is quite simple and deals with many useful situations, as we have seen already. Although we need only the L1 and L2 norms now, we should observe that for general 1 ≤ p < ∞, the formula ∥f∥p = (∫ b a |f|p )1/p defines also a norm. See the definition in 7.3.1 and the paragraph on the Lp-norms and Hölder inequality in 7.3.4 below. Moreover, there is the L∞ norm given by the suprema of values of f over the interval in question. For the sake of simplicity, we always work in the space S0 c or S1 c with respect to the corresponding norm (which always makes sense there). Hölder’s inequality (applied to the functions f and constant 1) yields the following bound on S0 [a, b]. Namely, for p > 1 and 1/p + 1/q = 1, ∫ b a |f(x)| dx ≤ |b − a|1/q (∫ b a |f(x)|p dx )1/p ≤ |b − a|1/q ∥f∥p. Replace f with fn − f. It is then clear from the above bound that Lp–convergence fn → f implies, for any p > 1, L1–convergence. (The terminology Lp–convergence is stronger than L1–convergence is sometimes used). With a modified bound, we can derive an even stronger proposition, namely that on a bounded interval, Lq–convergence is stronger than Lp–convergence whenever q > p; try this by yourselves. If uniform boundedness of the sequence of functions fn is given, then there is a constant C independent of n, so that ∥fn∥∞ ≤ C. Then we can assert that |fn(x)−f(x)| ≤ 2C, and it then follows that L1–convergence implies Lp–convergence, since 648 by the values of f there. On the other hand, recall that the continuity of f is a necessary and sufficient condition for the uniform convergence of F over the entire real line. Let us describe a few examples and for further material on the convergence of Fourier series see the final section of this chapter (Section D). 7.A.19. Describe the implications of Dirichlet’s condition for the Fourier series obtained in Problem 7.A.10. Solution. Consider the function f(x) = x, with x ∈ (−π, π), as in 7.A.10. The 2π-periodic extension of f has discontinuities at x = ±π, ±3π, ±5π, . . . and provides an example of a periodic piecewise differentiable function. At these discontinuities it is easy to see that the average of the left and right limits of f is zero. Therefore, by the result in 7.A.10 and Dirichlet’s condition one deduces that ∞∑ n=1 2(−1)n+1 n sin(nx) = { x , if − π < x < π , 0 , if x = ±π . □ 7.A.20. Convergence of Fourier series. Consider the square wave function (Heaviside’s function, see also 7.1.9), defined this time on [−1, 1], i.e., f(x) = { −1 , if − 1 ≤ x < 0 , 1 , if 0 < x ≤ 1 , with f(x + 2) = f(x) for all x outside the interval [−1, 1]. Determine the pointwise convergence of the corresponding Fourier series. Solution. To answer the task we do not need the explicit expression of the Fourier series F of f. In particular, f satisfies Dirichlet’s condition and hence F converges to f at all points where f is continuous. The discontinuities of f appear at x = 0, and x = ±1. In particular, for the endpoints we have f(−1) ̸= f(1) and the 2-periodic extension of f has discontinuities at ±1, ±3, . . .. At these points the Fourier series converges to the average of the left-hand and right-hand limits, which all equal to zero. For instance, we see that f(0+ ) = limx→0+ f(x) = 1 and f(0− ) = limx→0− f(x) = −1, hence 1 2 ( f(0− ) + f(0+ ) ) = 1 2 (−1 + 1) = 0 . The limits f(0− ) and f(0+ ) can be verified also in Sage, by utilizing the piecewise and limit commands. It is important however to specify the option algorithm = ”giac” within the limit command, as shown below: f = piecewise([[(-1, 0),-1], [(0,1),1]]) show(f(x).limit(x=0, dir="+", algorithm="giac")) show(f(x).limit(x=0, dir="-", algorithm="giac")) Thus F(x) converges to f(x) for all x, except of x = 0, ±1 where it converges to 0. □ CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING then ∫ b a |f(x) − fn(x)|p dx = = ∫ b a |f(x) − fn(x)|p−1 ||f(x) − fn(x)| dx ≤ (2C)p−1 ∫ b a |f(x) − fn(x)| dx which can be written ∥f − fn∥p ≤ (2C)1/q ∥f − fn∥ 1/p 1 . It follows that the Lp–norms on the space S0 [a, b] are equivalent with respect to the convergence of uniformly bounded sequences of functions. 7.1.16. Implications of the Dirichlet condition. The most difficult (and most interesting) problem is to prove the first statement of the theorem 7.1.8, which in the literature is often refered to as the Dirichlet condition, which seems to have been derived as early as in 1824. We begin by proving how this property of pointwise convergence implies the statements (2) and (3) of the theorem. Without loss of generality, we assume that we are working on the interval [−π, π], i. e. with period T = 2π. As the first step, we prepare simple bounds for the coefficients of the Fourier series. One bound is of course |an| ≤ 1 π ∫ π −π |f(x)| dx, and similarly for all the coefficients bn. This is because both cos(x) and sin(x) are bounded by 1 in absolute value. However, if f is a continuous function in S1 [−π, π], we can integrate by parts, thus obtaining (we write an(f) for the corresponding coefficient of the function f, and so on) bn(f) = 1 π ∫ π −π f(x) sin(nx) dx = −1 nπ [f(x) cos(nx)]π −π + 1 nπ ∫ π −π f′ (x) cos(nx) dx = 1 n an(f′ ). Notice f(−π) = f(π) by the continuity and thus the boundary term vanishes. Similarly we compute an(f) = −bn(f′ ). Iterating this procedure, we obtain a bound for functions f in Sk+1 [−π, π] with continuous derivatives up to order k inclusive |an(f)| ≤ 1 nk+1π ∫ π −π |f(k+1) (x)| dx, and similarly for bn(f). Thus we can see that the “smoother” a function is, the more rapidly the Fourier coefficients approach zero. For sufficiently smooth functions f, the nk –multiples of their Fourier coefficients an and bn are bounded by the L1-norm of their k–th derivative f(k) . 649 7.A.21. Using the Fourier series of the 2π-periodic extension of the function g(x) = | x |, with x ∈ [−π, π), sum the series ∞∑ n=1 1 (2n − 1)2 . Solution. To determine the value of this series, one can successfully apply several known Fourier series. The Fourier series of the function g coincides with the Fourier series presented in 7.A.11, i.e., π 2 − 4 π ∞∑ n=1 cos ( (2n − 1)x ) (2n − 1)2 . Since this function g is continuous on [−π, π) and | − π | = | π |, we have | x | = π 2 − 4 π ∞∑ n=1 cos ( (2n − 1)x ) (2n − 1)2 , x ∈ [−π, π] . Substituting x = 0, we get 0 = π 2 − 4 π ∞∑ n=1 1 (2n−1)2 . Therefore ∞∑ n=1 1 (2n − 1)2 = π2 8 . □ 7.A.22. Determine the convergence and uniform convergence of the Fourier series for the function g(x) = e−x , x ∈ [−1, 1). Solution. Again, it is unnecessary to calculate the corresponding Fourier series, since we wish only to check convergence. Let us consider the 2-periodic function s(x), defined as follows: s(x) = g(x) = e−x , x ∈ (−1, 1), s(1) = g(−1) + lim x→1− g(x) 2 = e + e−1 2 . According to 7.1.8, this function is the sum of the Fourier series in question. In other words, the Fourier series converges to s(x). Moreover, this convergence is uniform on every closed interval which contains none of the points 2k + 1, k ∈ Z. This follows from the continuity of the functions g and g′ on (−1, 1). On the other hand, the convergence cannot be uniform on any interval (c, d) such that [c, d] contains an odd integer. This is because the uniform limit of continuous functions is always a continuous function, and the periodic extension of s is not continuous at the odd integers. Thus, the series converges to the function g on (−1, 1), yet this convergence is uniform only on the subintervals [c, d] which satisfy the restriction −1 < c < d < 1. □ 7.A.23. Express the function g(x) = π2 −x2 on the interval [−π, π] in the form of a Fourier series. Using this expression, sum the two series CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Let f be a continuous function in the space S1 [a, b]. By the Dirichlet condition, its Fourier series converge pointwise to f. Then we can assert that |sN (x) − f(x)| = ∞∑ k=N+1 (ak cos(kx) + bk sin(kx)) ≤ ∞∑ k=N+1 (|ak| + |bk|). The right-hand side can further be estimated by the coefficients a′ n and b′ n of the derivative f′ . By applying in succession the inequality above, then Hölder’s inequality for infinite series (with p = q = 2), and then with the arithmeticgeometric inequality, we obtain |sN (x) − f(x)| ≤ ∞∑ k=N+1 1 k (|a′ k| + |b′ k|) ≤ ( ∞∑ k=N+1 1 k2 )1 2 ( ∞∑ k=N+1 (|a′ k|2 + 2|a′ k||b′ k| + |b′ k|2 ) )1 2 ≤ ( ∞∑ k=N+1 1 k2 )1 2 ( ∞∑ k=N+1 (2|a′ k|2 + 2|b′ k|2 ) )1 2 ≤ √ 2 (∫ ∞ N 1 x2 dx )1 2 1 √ π ∥f′ ∥2 = ( √ 2 √ π ∥f′ ∥2 ) · 1 √ N . Thus we have obtained not only a proof of the uniform convergence of our series to the anticipated value, but also a bound for the speed of the convergence: Uniform convergence under the Dirichlet condition If f is a continuous function in S1 [a, b], then sup x∈R |sN (x) − f(x)| ≤ ( √ 2 √ π ∥f′ ∥2 ) · 1 √ N . This proves the statement 7.1.8.(2), supposing the Dirichlet condition 7.1.8.(1) holds. 7.1.17. L2–convergence. In the next step of our proof, we derive L2–convergence of Fourier series. The proof utilizes the common technique of approximation objects which are not continuous by ones which are. We describe it without further details. Interested readers should be able to fill in the gaps by themselves without any difficulties. First, we formulate the statement we need. Lemma. The subset of continuous functions f in S0 [a, b] on a finite interval [a, b] is a dense subset in this space with respect to the L2–norm. 650 (1) ∞∑ n=1 (−1)n+1 n2 , ∞∑ n=1 1 n2 . Solution. We could take advantage of the function g being even, and calculate the non-zero coefficients an by integration by parts. However, in 7.1.10 the Fourier series for the function f(x) = x2 on the interval [−1, 1] is derived. This proves the identity f(x) = 1 3 + 4 π2 ∞∑ n=1 (−1)n cos(nπx) n2 , x ∈ (−1, 1), valid also for x = ±1. By adding π2 and rescaling, it follows that g(x) = π2 − ( 1 3 + 4 π2 ∞∑ n=1 (−1)n cos nπx π n2 ) π2 = 2 3 π2 + 4 ∞∑ n=1 (−1)n+1 cos(nx) n2 , x ∈ [−π, π]. Of course, one can also calculate the Fourier series of the function g directly. Substituting x = 0 and x = π then gives π2 = 2 3 π2 + 4 ∞∑ n=1 (−1)n+1 n2 , i.e. ∞∑ n=1 (−1)n+1 n2 = π2 12 , and 0 = 2 3 π2 + 4 ∞∑ n=1 (−1)n+1 (−1)n n2 , i.e. ∞∑ n=1 1 n2 = π2 6 . In other words, π2 = 12 ( 1 − 1 22 + 1 32 − 1 42 + · · · ) = 6 ( 1 + 1 22 + 1 32 + 1 42 + · · · ) . □ We have seen that the Fourier series of a 2L-periodic odd function involves only sine functions, whereas Fourier series of a 2L-periodic even function involves only cosine functions. However, Fourier series are employed in many areas and often we are interested in functions that are only defined on intervals of the form [0, L], for some L > 0. In particular, to get the cosine series expansion of f we may extend it to [−L, L] so that the extended function is an even 2L-periodic function. In this case, by the result in 6.B.35 we will have an = 2 L ∫ L 0 f(x) cos (nπx L ) dx , bn = 0 . This gives the cosine extension a0 2 + ∞∑ n=1 an cos (nπx L ) . Similarly, to get the sine series expansion of f, we may extend it to [−L, L] in such a way that the extended function is an odd 2L-periodic function. Then we will have an = 0 , bn = 2 L ∫ L 0 f(x) sin (nπx L ) dx , CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Sketch of the proof. Here "dense" means that for any g in S0 [a, b] and any ε > 0, there is some continuous f satisfying ∥f − g∥2 < ε. We deal with abstract topological concepts like this in the last part of this chapter. The idea of the proof can be seen easily via the example of approximation of Heaviside’s function h on the interval [−π, π]. We recall that h(x) = −1 for x < 0, and h(x) = 1 for x > 0. For every δ satisfying π > δ > 0, we define the function fδ as x/δ for |x| ≤ δ and fδ(x) = h(x) otherwise. All the functions fδ are continuous, in fact, piecewise linear. It can be calculated easily that ∥h−fδ∥2 → 0 so that h can be approximated in L2 norm by a sequence of continuous func- tions. All discontinuity points of a general function f can be catered for in exactly the same way. There are only finitely many of them, and so all of the considered functions are limit points of sequences of continuous functions. □ Now, our proof of the L2 convergence (under the assumption of the Dirichlet condition) is already simple because for the given function f, the distance between the partial sums of its Fourier series can be bounded by using a sequence of continuous functions fε in this way (all norms in this paragraph are the L2 norms): ∥f−sN (f)∥ ≤ ∥f−fε∥+∥fε−sN (fε)∥+∥sN (fε)−sN (f)∥ and the particular summands on the right-hand side can be controlled. The first one of them is at most ε, and according to the uniform convergence for continuous functions (as just proved above), the second summand can be bounded also by ε, if N is big enough. Notice that the third term has the value of the partial sum of the Fourier series for f−fε. Thus (cf. Theorem 7.1.5(1)), ∥f − fε − sN (f − fε)∥ ≤ ∥f − fε∥. Therefore by the triangle inequality, ∥sN (f−fε)∥ ≤ ∥sN (f−fε)−f+fε∥+∥f−fε∥ ≤ 2∥f−fε∥. Altogether, ∥f − sN (f)∥ ≤ 4ε. This verifies the L2 convergence of the functions sN (f) to f under the Dirichlet condition, which is we wanted to prove. 7.1.18. Dirichlet kernel. Finally, we arrive at the proof of the first statement of theorem 7.1.8. It follows from the definition of the Fourier series F(t) for a function f(t), using its expression with the complex exponential in 7.1.7, that the partial sums sN (t) can be written as sN (t) = 1 T N∑ k=−N ∫ T/2 −T/2 f(x) e−iωkx eiωkt dx, 651 which gives the sine extension ∞∑ n=1 bn sin (nπx L ) , see also at the end of the paragraph 7.1.10. Let us describe a few examples and for further material we refer to Section D. 7.A.24. Describe the cosine Fourier series for the sine function f(x) = sin(x), with x ∈ [0, π]. ⃝ 7.A.25. Describe the sine Fourier series for the function f(x) = cos(x), with x ∈ [0, π]. ⃝ We end this section with a tasks on the so called “Paresval’s” identity, and some applications of Fourier series. 7.A.26. Using Parseval’s identity for Fourier’s orthogonal system (part (3) of the theorem 7.1.5), verify that ∞∑ n=1 1 (2n−1)4 = π4 96 . Solution. It is imperative to choose an appropriate Fourier series. For instance, consider the Fourier series π 2 − 4 π ∞∑ n=1 cos((2n−1)x) (2n−1)2 , which we obtained for the function g(x) = | x |, x ∈ [−π, π) in 7.A.15(b). Parseval’s identity a2 0 2 + ∞∑ n=1 a2 n + ∞∑ n=1 b2 n = 2 T x0+T∫ x0 (g(x))2 d x says, substituting for the a’s and b’s from our particular series, that π2 2 + 16 π2 ∞∑ n=1 1 (2n−1)4 = 1 π π∫ −π | x |2 d x = 2 π π∫ 0 x2 d x = 2π2 3 , so, ∞∑ n=1 1 (2n − 1)4 = ( 2π2 3 − π2 2 ) π2 16 = π4 96 . □ There are other ways of obtaining this result, see for example (3) at the page 698. We recommend comparing the solutions of this exercise to the previous one. We started our discussion of Fourier series with the simplest of periodic functions f(t) = a sin(ωt + b) for certain constants a, ω > 0, b ∈ R. They appear as the general solution to the homogeneous linear differential equation (1) y′′ + ω2 y = 0 which arises in mechanics by Newton’s law of force for a moving particle. Recall the brief introduction to the simplest differential equations in 6.2.21 on page 552. Much more follows in Chapter 8. CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING where T is the period we are working with and ω = 2π/T. This expression can be rewritten as (1) sN (t) = ∫ T/2 −T/2 KN (t − x)f(x) dx, where the function KN (y) = 1 T N∑ k=−N eiωky is called the Dirichlet kernel. The sum is a (finite) geometric series with common ratio eiωy . By multiplying by eiωy and then subtracting, we obtain eiωy KN (y) = 1 T N∑ k=−N eiω(k+1)y (1 − eiωy )KN (y) = 1 T ( e−iNωy − ei(N+1)ωy ) . Provided ωy is not a multiple of 2π, we continue to obtain KN (y) = 1 T e−iNωy − ei(N+1)ωy 1 − eiωy = 1 T − e−i(N+1/2)ωy + ei(N+1/2)ωy eiωy/2 − e−iωy/2 = 1 T sin((N + 1/2)ωy) sin(ωy/2) in which the key step was to multiply both the numerator and the denominator by − e−iωy/2 . When y = 0, KN (0) = 1 T (2N + 1). The last expression shows that KN (y) is an even function. By l’Hospital’s rule, applied at y = 0, it is continuous everywhere. Since all the partial sums of the series for the constant function f(x) = 1 also equal 1, we obtain from the definition of the Dirichlet kernel, cf. (1), that ∫ T/2 −T/2 KN (x) dx = 1. In the case of periodic functions, the integrals over intervals whose length equals the period are independent of the choice of the end points. Hence, changing the coordinates, we can also use the expression sN (x) = ∫ T/2 −T/2 KN (y)f(x + y) dy for the partial sums. Finally, we are fully prepared. First, we consider the case when the function f is continuous (and piecewice differentiable) at the point x. We want to prove that in this case, the Fourier series F(x) for the function f converges to the value f(x) at the point x. We have sN (x) − f(x) = ∫ T/2 −T/2 (f(x + y) − f(x))KN (y) dy. 652 We mention that the function f has period T = 2π/ω. In mechanics, one often talks about frequency 1/T. The positive value a expresses the maximum displacement of the oscillating point from the equilibrium position and it is called the amplitude. The value b determines the position of the point at the initial time t = 0 and it is called the initial phase, while ω is the angular frequency of the oscillation. Similarly, the function z ≡ g(t) describes the dependence of voltage upon time t in an electrical circuit with inductance L and capacity C and which is the solution of the differential equation (2) z′′ + ω2 z = 0. The only difference between the equations (1) and (2) (besides the dissimilar physical interpretation) is the constant ω. In the equation (1), there is ω2 = k/m where k is the proportionality constant and m is the mass of the point, while in the equation (2), there is ω2 = (LC)−1 . We illustrate how Fourier series can be applied in the theory of differential equations. Consider only the nonhomogeneous (compare to (1)) differential equation (3) y′′ + a2 y = f(x) with y an unknown in variable x ∈ R, with a periodic, continuously differentiable function f : R → R on the right-hand side and a constant a > 0. Let T > 0 be the prime period of the function f and let its Fourier series on [−T/2, T/2] be known, i.e. we assume (4) f(x) = A0 2 + ∞∑ n=1 ( An cos 2πnx T + Bn sin 2πnx T ) . 7.A.27. Prove that if the equation (3) has a periodic solution on R, then the period of this solution is also a period of the function f. Further, prove that the equation (3) has a unique periodic solution with period T if and only if (1) a ̸= 2πn T for every n ∈ N. Solution. Let a function y = g(x), x ∈ R, be a solution of the equation (3) with f(x) ̸≡ 0 and with period p > 0. In order to substitute the function g into a second-order differential equation, its second derivative g′′ must exist. Since the functions g, g′ , g′′ , . . . share the same period, the function g′′ (x) + a2 g(x) = f(x) is also periodic with period p. In other words, the function f is periodic as a linear combination of functions with period p. Thus, we have proved the first statement claiming that p = lT for a certain l ∈ N. Suppose that the function y = g(x), x ∈ R, is a periodic solution of the equation (3) with period T and that it is expressed by a Fourier series as follows: (2) g(x) = a0 2 + ∞∑ n=1 ( an cos (ωnx) + bn sin (ωnx) ) , x ∈ R, CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING The integrand can be rewritten f(x + y) − f(x) sin(ωy/2) sin((N + 1/2)ωy) = = φx(y) ( cos(ωy/2) sin(Nωy)+sin(ωy/2) cos(Nωy) ) , where φx(y) = f(x + y) − f(x) sin(ωy/2) for y ̸= 0. Moreover, we can compute the right and left limits of φx at the point y = 0 using the L’Hospital’s rule. Indeed, writing f′ ±(x) for the right and left derivatives of f, we arrive at lim y→0+ φx(y) = lim y→0+ f(x + y) − f(x) y y sin(ωy/2) = 2 ω f′ +(x), and similiary for the left limit. Thus, φx(y) is piecewise continuous on the entire interval [−T/2, T/2]. Next, we rewrite the integral expression for sN −f in the form reminding coefficients of Fourier series: 2 T ∫ T/2 −T/2 (ψ1(y) sin(Nωy) + ψ2(y) cos(Nωy) dy with the piecewise continuous functions ψ1(y) = T 2 φx(y) cos(ωy/2), ψ2(y) = T 2 φx(y)sin(ωy/2). Indeed, we deal with the Fourier coefficients of the functions ψ1 and ψ2, but they have to converge to zero with N → ∞ by virtue of the general theorem on abstract Fourier series, see the Bessel inequality in 7.1.5(2). Hence lim N→∞ ∫ T/2 −T/2 ψ1(y) sin(Nωy) dy = 0, and lim N→∞ ∫ T/2 −T/2 ψ2(y) cos(Nωy) dy = 0. But this means lim N→∞ sN (x) − f(x) = 0, as desired. Now suppose the function f has a discontinuity at x. Without loss of generality, we may assume x = 0 (using constant shift of coordinates otherwise). Since the function belongs to S1 , it is already continuous and differentiable on a neighbourhood of the point x = 0 (outside the point itself). Split f into its even part f1 and its odd part f2. That is, write f(x) = f1(x) + f2(x), where f1(x) = 1 2 (f(x) + f(−x)) for x ̸= 0 f1(0) = 1 2 ( lim x→0+ f(x) + lim x→0− f(x)) f2(x) = 1 2 (f(x) − f(−x)). Then the even part f1(x) is continuous and piecewise differentiable at the point x = 0 because of the existence of the onesided limits, and so on a neighbourhood of the point x = 0. 653 where ω = 2π/T. If g satisfies the equation (3), it has a continuous second derivative on R. Therefore, for x ∈ R, g′ (x) = ∞∑ n=1 ( ωnbn cos (ωnx) − ωnan sin (ωnx) ) , (3) g′′ (x) = ∞∑ n=1 ( −ω2 n2 an cos (ωnx) − ω2 n2 bn sin (ωnx) ) , Substituting (4), (2) and (3) into (3) yields a2 1 2 a0 + ∞∑ n=1 ( (−ω2 n2 an + a2 an) cos (nωx) + ( −ω2 n2 bn + a2 bn ) sin (nωx) ) = A0 2 + ∞∑ n=1 ( An cos (nωx) + Bn sin (nωx) ) . It follows that (4) a2 a0 2 = A0 2 , that is a2 a0 = A0, and for n ∈ N, (5) ( −ω2 n2 + a2 ) an = An, ( −ω2 n2 + a2 ) bn = Bn. There is exactly one pair of sequences {an}n∈N∪{0}, {bn}n∈N satisfying these conditions if and only if −ω2 n2 + a2 = − (2πn T )2 + a2 ̸= 0 for every n ∈ N, i.e., if (1) holds. In this case, the only solution of (3) with period T is determined by the only solution (6) an = An −ω2n2 + a2 , bn = Bn −ω2n2 + a2 , n ∈ N of the system (5). We emphasize that we utilized the uniform convergence of the series in (3). □ 7.A.28. Using the solution of the previous problem, find all 2π-periodic solutions of the differential equation y′′ + 2y = ∞∑ n=1 sin(nx) n2 , x ∈ R. Solution. The equation is in the form of (3) for a = √ 2 and the continuously differentiable function f(x) = ∞∑ n=1 sin(nx) n2 , x ∈ R with prime period T = 2π. According to problem 7.A.27, the condition √ 2 /∈ N implies that there is exactly one 2π-periodic solution. If we look for it as the value of the se- ries a0 2 + ∞∑ n=1 ( an cos (nx) + bn sin (nx) ) , x ∈ R, we know also that (see (4) and (6)) a0 = an = 0, bn = 1 n2(2−n2) , n ∈ N. Thus, the given equation has the unique 2π-periodic solution y = ∞∑ n=1 sin(nx) n2(2−n2) , x ∈ R. □ CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Also f2(0) = 0, and the Fourier series for f2 contains only the terms with sin(nωx) and thus, vanishes at x = 0. Thus we can refer to the previous continuous case and obtain, for the Fourier series F(x) of the function f, the equa- tion F(0) = 1 2 ( lim x→0+ f(x) + lim x→0− f(x) ) + 0, which is what we wanted to prove. If we face more than one jump of f within the basic period of length T, we may express f as a sum of functions with exactly one discontinuity. For example, if points of discontinuity were x0 < x1, both in(−T/2, T/2), and writing f(y±) = limx→y± f(x) for the one-sided limits, we can consider f = g1 + g2 with g1(x) = { f(x) x ∈ (−T/2, x0) f(x) − f(x0+) + f(x0−) x ∈ (x0, T/2) g2(x) = { 0 x ∈ (−T/2, x0) f(x0+) − f(x0−) x ∈ (x0, T/2). Clearly g1 has got the only discontinuity in x1 while the piecewise constant g2 jumps at x0 only. By the already proved behavior, the Fourier series of g1 and g2 converges at x0 to f(x0−) and 1 2 (f(x0+) − f(x0−)), respectively. Their sum provides the desired result. Similarly for the other discontinuity point. This completes the proof. This also completes the proof of the statements (2) and (3) of theorem 7.1.8 where we required that the Dirichlet condition be true. 2. Integral operators and Fourier transform This section provides a few glimpses towards a fascinating and useful area of mathematics exploiting the integration process in many practical ways. Unlike most of this book, we do not provide any general theory and our constructions will be rather illustrated by examples than by rigorous theorems with full proofs. 7.2.1. Functionals. In the case of finite-dimensional vector spaces, we can regard the vectors as mappings from a finite set of fixed generators into the space of coordinates. The sums of vectors and the scalar multiples of vectors were then given by the corresponding operations with such functions. We also worked with the vector spaces of functions of a real variable in the same way when their values were scalars (or vectors as well). The simplest linear mapping α between vector spaces maps vectors to scalars. It is called a linear form, or a linear functional (in particular when dealing with vector spaces of functions). In finite dimension, it is defined as the sum of products of coordinates xi of vectors with fixed values αi = α(ei) at the generators ei, i.e. by one-row matrices: (x1, . . . , xn)T → (α1, . . . , αn) · (x1, . . . , xn)T . 654 B. Integral operators and Fourier Transform Next we will focus on problems related with the notion of “convolution”, which is a typical integral operation that applies on two functions f, g and produces a new one, denote by (f ⋆ g) and defined by (f ∗ g)(y) = ∫ ∞ −∞ f(x)g(y − x) dx . The convolution integral is a fundamental concept in various fields, e.g., in statistics, signal processing, computer vision, and other. For its basic features and other basic facts, see 7.2.2. 7.B.1. Find the convolutions f ∗g and f ∗h of the following functions and in each case check the "smoothing" of f. f(x) = sin(x) + 2 5 [sin(6x)]2 − 1 5 sin(60x) , g(x) = { 1 2ε if − ε < x < ε 0 otherwise , where ε > 0 , h(x) = 1 √ 2πσ e− (x−µ)2 2σ2 , where µ, σ ∈ R, σ > 0 . Solution. The function g is chosen to provide the mean of f over (small) intervals of length 2ε and it is normalized so that the integral of g over all of R is one. We should expect some smoothing of the oscillations of the function f(x). Drawing the resulting functions by Maple shows: labels in drawings proc ne, ale pripsat rukou, popis je ale v textu stejne ? The graph depicted in the upper left is the original f, while the graph in the upper right is of g with ε = 1/10. The two CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING More complicated mappings, with values lying in the same space, are given similarly by square matrices. We approach linear operations on spaces of functions in an analogous way. For simplicity, we work with the real vector space S of all piecewise continuous real-valued functions having compact support and defined on the whole R or on an interval I = [a, b]. Linear mappings S → R are called (real) linear functionals. Such functionals can be defined in many diverse ways. For example, by evaluating the function’s values (or its derivatives’) at some fixed points or in terms of integration. We can, for instance consider the functional L given by evaluating the function at a sole fixed point x0 ∈ I, i. e., L(f) = f(x0). Or, we can have the functional given by integration of the product with a fixed function g(x), i.e., L(f) = ∫ b a f(x)g(x) dx. The function g(x) in the previous example is a function which weighs the particular values representing the function f(x) in the definition of the Riemann integral and the functional L is a perfect analogue of the linear forms on finite dimensional vector spaces mentioned above. The simplest case of such a functional is, of course, the Riemann integral itself, i. e. the case of g(x) = 1 for all points x. A good example is given by g(x) = { 0 if |x| ≥ ε 1 2ε if |x| < ε. for any ε > 0. The integral of the function g over R equals one, and our linear functional can be perceived as a (uniform) averaging of the values of the function f over the ε– neighbourhood of the origin. Similarly, we can work with the function g(x) = { 0 if |x| ≥ ε e 1 x2−ε2 + 1 ε2 if |x| < ε which we used in the paragraph 6.1.9. This function is smooth on R with compact support on the interval (−ε, ε). Our functional has the meaning of a weighted combination of the values, but this time, the weights of the input values decrease rapidly as their distance from the origin increases. The integral of g over R is finite, but it is not equal to one. Dividing g by this integral would lead to a functional which would have the meaning of a non-uniform averaging of a given function f. Another example is the Gaussian function g(x) = 1 √ π e−x2 , which also has its integral over R equal to one (we verify this later). This time, all the input values x in the corresponding “average” have a non-zero weight, yet this weight becomes insignificantly small as the distance from the origin increases. 655 lower graphs show their convolution with (respectively) parameters for g selected as ε = 1/10 and ε = 13/50. It is straightforward to compute the convolution explicitly, too: f ∗ g(y) = 1 2ε ∫ y+ε y−ε sin(x) + 2 5 sin(6x)2 − 1 5 sin(60x) d x = 1 2ε [ − cos(x) − 1 30 sin(6x) cos(6x) + 1 5 x + 1 300 cos(60x) ]y+ε y−ε . Similarly, the other function h is a typical smoothing function which gives much more weight to the values of f near the point y, and much less weight to the values of f further from y. This is the famous Gaussian function. We meet it frequently. Using Maple again, we see that the integral of h over R is one (we prove this in 13.2.8). It is not easy to find the convolution analytically, but Maple can do it approximately. The resulting diagrams are as follows. As before, the graph depicted in the upper left is the original f, while the graph in the upper right is of h with µ = 0, and σ = 1/10. Below that, their convolutions are shown with parameters µ = 0, σ2 = 1/60 and (lower right) µ = 0, σ2 = 5/60. □ 7.B.2. Determine the convolution f1 ∗ f2 where f1(x) = 1 x for x ̸= 0 f2(x) = { x for x ∈ [−1, 1] 0 otherwise Solution. CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING 7.2.2. Function convolution. Integral functionals from the previous paragraph can easily be modified to obtain a “streamed averaging” of the values of a given function f near a given point y ∈ R: Ly(f) = ∫ ∞ −∞ f(x)g(y − x) dx Convolution of functions of a real variable The free parameter y in the definition of the functional Ly(f) can be perceived as a new independent variable, and our operation Ly actually maps functions to functions again, f → ˜f: ˜f(y) = Ly(f) = ∫ ∞ −∞ f(x)g(y − x) dx. This operation is called the convolution of functions f and g, denoted f ∗ g. The convolution is usually defined for real or complex valued functions on R with compact support, or those with decay fast enough to ensure the integrability. By the transformation t = z − x, we can easily calculate that (f ∗ g)(z) = ∫ ∞ −∞ f(x)g(z − x) dx = − ∫ −∞ ∞ f(z − t)g(t) dt = (g ∗ f)(z). Thus the convolution, considered as a binary operation ∗ : Sc × Sc → Sc of pairs of functions having compact support, is commutative. Similarly, convolutions can be considered with integration over a finite interval; we only have to guarantee that the functions participating in them are well-defined. In particular, this can be done for periodic functions, integrating over an interval whose length equals the period. Convolution is an extraordinarily useful tool for modeling the way in which we observe the data of an experiment or the influence of a medium through which information is transferred. For instance, an analog audio or video signal affected by noise. The input value f is the transferred information. The function g is chosen so that it would express the influence of the medium or the technical procedure used for the signal processing or the processing of any other data. 7.2.3. Gibbs phenomenon. Actually, we have already seen a very useful case of convolution. In paragraph 7.1.18, we interpreted the partial sum of the Fourier series for a function f as a convolution with the Dirichlet kernel KN (y) = ∑T/2 −T/2 eiωky . The figure shows this convolution kernel with N = 5 and N = 15. 656 (f1∗f2)(t) = ∫ ∞ −∞ f1(x)f2(t−x) dx = ∫ ∞ −∞ 1 x f2(t−x) dx. Since f2(x) = 0 outside the interval [−1, 1], necessarily −1 ≤ t − x ≤ 1 that is, t − 1 ≤ x ≤ t + 1. So (f1 ∗ f2)(t) = ∫ t+1 t−1 1 x (t − x) dx = t ∫ t+1 t−1 1 x dx − 2. This last integral is improper if t − 1 ≤ 0 ≤ t + 1. For t outside that interval, the integration gives (f1 ∗ f2)(t) = t ln t + 1 t − 1 − 2. If instead, −1 < t < 1, we can for small ε > 0, replace ∫ t+1 t−1 1 x dx with ∫ t+1 ε 1 x dx + ∫ −ε t−1 1 x dx which computes to ln |t + 1| − ln |ε| + ln | − ε| − ln |t − 1|. The terms in ε cancel, so when we take the limit ε −→ 0, we obtain the same answer for the integral as before. Thus (f1 ∗ f2)(t) = t ln t + 1 t − 1 − 2 for all values of t except for t = 1, or for t = −1. □ We calculate the convolution of two functions both of which have a bounded support. 7.B.3. Determine the convolution f1 ∗ f2 where f1(x) = { 1 − x2 for x ∈ [−1, 1], 0 otherwise, f2(x) = { x for x ∈ [0, 1], 0 otherwise. Solution. Since the integrand is zero when f1(x) = 0, f1 ∗ f2(t) = ∫ ∞ −∞ f1(x)f2(t − x) dx = ∫ 1 −1 (1 − x2 )f2(t − x) dx But the integrand is also zero when f2(t − x) = 0, so we need 0 ≤ t − x ≤ 1 ie. t − 1 < x < t for the integrand to be non zero. So for a non zero value of f1 ∗ f2(t), we integrate over the intersection of the intervals [t − 1, t] and CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Notice that instead of integrals over the entire real line, we employ the integration over the basic period T of the periodic functions in question. This interpretation allows us to explain the Gibbs phenomenon mentioned in paragraph 7.1.9. The point is that we know well the behaviour of the Dirichlet kernel near to the origin and thus, taken into account that the function f is bounded over the whole period and has all one-side limits of values and derivatives at each point of discontinuity, the effect of the convolution must be quite local. Consequently this leads to verification that the convolution with the Dirichlet kernel in the point x of jump of f behaves the same way as we computed explicitly for the Heaviside function at x = 0. There the overshooting by the Fourier sums can be computed explicitly and this explains the Gibs effect in general. We do not provide more details here. Readers may either work them out themselves (as a truly nontrivial exercise) or look them up in the literature. 7.2.4. Integral operators. In general, integral operators can depend on any number of values and derivatives of the function in its argument. For example, considering a function F depending on k + 2 free arguments, L(f)(y) = ∫ F(y, x, f(x), f′ (x), . . . , f(k) (x)) dx. Convolution is one of many examples of a special class of such operators on spaces of functions L(f)(y) = ∫ b a f(x)k(y, x) dx. The function k(y, x), dependent on two variables, k : R2 → R, is called the kernel of the integral operator L. The theory of integral operators is very useful and interesting. We focus only on an extraordinarily important special case, namely the Fourier transform F, which has deep connections with Fourier series. 7.2.5. Fourier transform. Recall that a function f(t), given by its converging Fourier series, equals f(t) = ∞∑ n=−∞ cn eiωnt , 657 [−1, 1]. Consequently, (f1 ∗ f2)(t) = 0, if t > 2 = ∫ 1 t−1 (1 − x2 )(t − x)dx = 4t/3 − t2 + t4 /12, 1 ≤ t ≤ 2 = ∫ t t−1 (1 − x2 )(t − x)dx = −t2 /2 + 1/4 + 2t/3, 0 ≤ t ≤ 1 = ∫ t −1 (1 − x2 )(t − x)dx = −t4 /12 + t2 /2 + 1/4 + 2t/3, − 1 ≤ t ≤ 0 = 0, if t < −1. □ 7.B.4. Determine the convolution f1 ∗ f2 of the functions f1 = { 1 − x for x ∈ [−2, 1], 0 otherwise, f2 = { 1 for x ∈ [0, 1], 0 otherwise. ⃝ The next topic is the Fourier transform, which is another example of an integral operator. This time the kernel e−iωt is complex (see 7.2.5 for the terminology). Thus the values on real functions are complex functions in general, see 7.2.5. This is a basic operation in mathematics, allowing the time and frequency analysis of signals and also the transitions between local and global behaviour. 7.B.5. Fix Ω > 0. Recall that sgn t = 1 if t > 0, and sgn t = −1 if t < 0, sgn 0 = 0. Find the Fourier transform F(f) and the inverse Fourier transform F−1 of the functions: (a) f(t) = sgn t if t ∈ (−Ω, Ω), and zero otherwise. (b) f(t) = 1 if t ∈ (−Ω, Ω) and zero otherwise. Solution. The case (a). The Fourier transform of the given function is F(f)(ω) = 1√ 2π ∞∫ −∞ f(t) e−iωt d t = 1√ 2π Ω∫ −Ω sgn t ( cos(ωt) − i sin(ωt) ) d t = 1√ 2π (Ω∫ 0 ( cos (ωt) − i sin (ωt) ) d t− 0∫ −Ω ( cos (ωt) − i sin (ωt) ) d t ) . Since cos and sin are respectively even and odd functions, CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING where the numbers cn are complex Fourier coefficients, and ωn = n2π/T with period T, see paragraph 7.1.7. After fixing T, the expression ∆ω = 2π/T describes the change of the frequency caused by n being increased by one. Thus it is just the discrete step by which we change the frequencies when calculating the coefficients of the Fourier series. The coefficient 1/T in the formula cn = 1 T ∫ T/2 −T/2 f(x) e−iωnx dx then equals ∆ω/2π, so the series for f(t) can be rewritten as f(t) = ∞∑ n=−∞ 1 2π ( ∆ω ∫ T/2 −T/2 f(x) e−iωnx dx ) eiωnt . Now imagine the values ωn for all n ∈ Z as the chosen representatives for small intervals [ωn, ωn+1] of length ∆ω. Then, our expression in the big inner parentheses in the previous formula for f(t), restricting the sum to n ∈ [−N, N], describes the summands of the Riemann sums for the integrals 1 2π ∫ N −N g(ω) eiωt dω, where g(ω) is a function which takes, at the points ωn, the values g(ωn) = ∫ T/2 −T/2 f(x) e−iωnx dx. Considering the limit for N → ∞ leads to the improper integral of g(w) eiωt . If the function f(x) e−iωnx is integrable over the entire R (this is always the case if f is piecewise continuous with a compact support) then letting T → ∞, the norm ∆ω of our subintervals in the Riemann sum decreases to zero. We obtain the integral g(ω) = ∫ ∞ −∞ f(x) e−iωx dx. The previous reasonings indicates that there should be a large set of functions f on R (e.g. all those with |f| integrable over R) for which we can define a pair of mutually inverse integral operators: 658 F(f)(ω) = 2√ 2π Ω∫ 0 −i sin (ωt) d t = 2i√ 2π [cos(ωt) ω ]Ω 0 = i √ 2 π cos(ωΩ)−1 ω . The inverse Fourier transforms is given by almost the same integral, with the kernel eiωx instead of e−iωx . The integration is in the frequency domain with variable ω. Thus, the only difference in the result is the sign: F−1 (f)(t) = 2√ 2π Ω∫ 0 i sin (ωt) d ω = −2i√ 2π [cos(ωt) t ]Ω 0 = i √ 2 π 1−cos(tΩ) t . Case (b) is computed similarly: F(f)(ω) = 1√ 2π Ω∫ −Ω ( cos(ωt) − i sin(ωt) ) d t = 1√ 2π Ω∫ −Ω cos (ωt) d t = 1√ 2π [sin(ωt) ω ]Ω −Ω = √ 2 π sin(ωΩ) ω . The latter expression is often expressed by means of the function sinc(t) = sin(t)/t F(f)(ω) = 2Ω √ 2π sinc(ωΩ). Here, the inverse Fourier transform has exactly the same result, because the sign change in the kernel does not affect the real part. Thus we only need to interchange the time and frequency variables: F−1 (f)(t) = 2Ω √ 2π sinc(tΩ). □ The results are: The first two diagrams below show the imaginary values of the Fourier image of the signum function from 7.B.5(a) with Ω = 20 and Ω = 50. The next two diagrams do the same for the characteristic function of the interval |x| < Ω CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Fourier transform For every real or complex valued function f on R for which the following integrals exist (e.g. all those piecewise continuous with compact support or fast decay), we define F(f)(ω) = 1 √ 2π ∫ ∞ −∞ f(t) e−iωt dt F−1 (f)(t) = 1 √ 2π ∫ ∞ −∞ f(ω) eiωt dω. The function ˜f(ω) = F(f)(ω) is called the Fourier transform of the function f. The previous ideas indicate that F−1 (F(f))(t) = f(t), whenever both the improper (Riemann) integrals exist. This says that the Fourier transform F has an inverse operation F−1 , which is called the inverse Fourier transform. Notice, the Fourier transfor and its inverse are integral operators with almost identical kernels k(ω, t) = e±iωt . The right choice of the function space (and integral type) ensuring the existence of Fourier transform and its inverse is a very subtle question. This is the main reason why we do not formulate theorems with formal proofs and just touch the problems based on examples here. We hope this will motivate the readers to find more detailed answers in specialized literature. 7.2.6. Simple properties. The Fourier transform changes the local and global behaviour of functions in an interesting way. We begin with a simple example in which there is a function f(t) which is transformed to the indicator function of the interval [−Ω, Ω], i. e., ˜f(ω) = 0 for |ω| > Ω, and ˜f(ω) = 1 for |ω| ≤ Ω. The inverse transform F−1 gives f(t) = 1 √ 2π ∫ Ω −Ω eiωt dω = 1 √ 2π [ 1 it eiωt ]Ω −Ω = 2 √ 2πt 1 2i (eiΩt − e−iΩt ) = 2Ω √ 2π sin(Ωt) Ωt . Thus, except for a multiplicative constant and the scaling of the input variable, it is the very important function f(x) = sinc(t) = sin t t . This function is not that easy to handle. Our construction of the Fourier transform suggests that F(f) should be the indicator function of the interval [−Ω, Ω]. We verify this by direct computation in ??. Now, just notice that the sinc function is not integrable in absolute value over R. It rather behaves as an alternating series of numbers. But the local behavior of f is easy to see. Calculation of the limit at zero, by l’Hospital’s rule or otherwise, gives f(0) = 2Ω(2π)−1/2 . The closest zero points are at 659 from 7.B.5(b). The longer the interval with the constant values is, the more the image is concentrated around the origin. We can always use directly the simpler version of the transform for the odd and even functions. If the argument f is odd, then only the sine part of the formula contributes and its Fourier transform is F(f)(ω) = −2i√ 2π ∞∫ 0 f(t) sin (ωt) d t. Similarly, for even functions f F(f)(ω) = 2√ 2π ∞∫ 0 f(t) cos (ωt) d t. In particular, the odd functions have pure imaginary images while the images of the even functions are real. More generally, every real function f decomposes in to its odd and even parts f = feven+fodd and the real and imaginary components of the Fourier image ˜f are the images of these two parts, re- spectively. 7.B.6. Discover how the Fourier transform and its inverse behave under the translation τa in the variable, τaf(x)= f(x + a), and the phase shift φa defined as φaf(x) = eiax f(x), always with a ∈ R. Solution. Evaluate the compositions F ◦τa and F ◦φa. This is easy: F ◦ τaf(ω) = ∫ ∞ −∞ f(t + a) e−iωt d t = ∫ ∞ −∞ f(x) e−iω(x−a) d x = eiaω Ff(ω). F ◦ φaf(ω) = ∫ ∞ −∞ f(t) eiat e−iωt d t = ∫ ∞ −∞ f(t) e−i(ω−a)t d t = Ff(ω − a). Thus is proved the formulae F ◦ τa = φa ◦ F, F ◦ φa = τ−a ◦ F. Similarly, F−1 ◦ τa = φ−a ◦ F−1 , F−1 ◦ φa = τa ◦ F−1 . □ The next problem displays the behaviour of the Fourier transform on the Gaussian function. This is a rare example, where the time and frequency forms are very similar. Again, we see the feature of exchanging the local and global properties in the time and frequency domains. 7.B.7. Compute the Fourier transform F(f) of the function f(t) = e−at2 , t ∈ R, where a > 0 is a fixed parameter. Solution. The task is to calculate CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING t = ±π/Ω and the function drops in value to zero quite rapidly away from the origin x = 0. This function is shown in the diagram by a wavy curve for Ω = 20. Simultaneously, the area where our function f(t) keeps waving more rapidly as Ω increases is also depicted by a curve. t 32 y 1 20 0 15 10 -1 5 0 -2 -5 -3 Omega=20.000 The Fourier transform of the indicator function of the interval [−Ω, Ω] is also the above function f, see the explicit computation in 7.B.5 (the even function sinc does not see the change of sign in the kernel of the transform). Clearly, this function f takes significant positive values near zero, and the value taken at zero is a fixed multiple of Ω. Therefore, as Ω increases, the function f concentrates more and more near the origin. Next, we derive the Fourier transform of the derivative f′ (t) for a function f. We continue to suppose that all the integration makes sense (e.g. f has compact support), so that both F(f′ ) and F(f) exist. By integration by parts, F(f′ )(ω) = 1 √ 2π ∫ ∞ −∞ f′ (t) e−iωt dt = 1 √ 2π [ e−iωt f(t) ]∞ −∞ + iω √ 2π ∫ ∞ −∞ f(t) e−iωt dt = iωF(f)(ω). Thus the Fourier transform converts the (limit) operation of differentiation to the (algebraic) operation of a multiplication by the variable. Of course, this procedure can be iterated, to obtain F(f′′ )(ω) = −ω2 F(f), . . . , F(f(n) ) = in ωn F(f). 7.2.7. The relation to convolutions. There is another extremely important property to consider, namely the relation between convolutions and Fourier transforms. We shall calculate the Fourier transform of the convolution h = f ∗ g, where we again assume that all the integrals exist (e.g. assuming the functions to be piecewice continuous with compact supports). Recall that we may change the order of integration, see 6.3.12. Then we change variable by the substitution t−x = u. 660 F(f)(ω) = 1√ 2π ∞∫ −∞ e−at2 e−iωt d t. A standard trick is to transform the problem into one of solving a (simple) differential equation. Hence: Differentiating (with respect to ω) and then integrating by parts gives ( F(f)(ω) )′ = 1 √ 2π ∫ ∞ −∞ −it e−at2 e−iωt d t = 1 √ 2π ( lim t→∞ i 2a e−at2 −iωt − lim t→−∞ i 2a e−at2 −iωt − ∫ ∞ −∞ i(−iω) 2a e−at2 e−iωt d t ) = 1 √ 2π ( i 2a lim t→∞ e−at2 − i 2a lim t→−∞ e−at2 − ∫ ∞ −∞ ω 2a e−at2 e−iωt d t ) = − ω 2a ( 1 √ 2π ∫ ∞ −∞ e−at2 e−iωt d t ) = − ω 2a F(f)(ω). Therefore y(ω) = F(f)(ω) satisfies the differential equation d y dω = − ω 2a y, i.e. 1 y d y = − ω 2a dω, unless y equals zero (y ≡ 0 is a solution of the equation). Integration yields ln | y | = −ω2 4a − C, i.e. y = Ke− ω2 4a , where C and K are constants. All solutions (including the zero solution) of the differential equation are given by the function y(ω) = K e− ω2 4a , K ∈ R. To find K, begin with the well known fact (proved in ??) ∞∫ −∞ e−x2 d x = √ π, to obtain ∞∫ −∞ e−at2 d t = 1√ a ∞∫ −∞ e−x2 d x = √ π√ a . Therefore, F(f)(0) = 1√ 2π √ π√ a = 1√ 2a and F(f)(0) = K e0 = K. So K = 1√ 2a and F(f)(ω) = 1√ 2a e− ω2 4a . □ 7.B.8. Determine the Fourier transform image of the Gaussian function f(t) = 1 σ √ 2π e− (t−µ)2 2σ2 . Solution. Use the result of the previous problem with a = 1 2σ2 and the composition with the variable shift τa from the last but one problem. It follows that F(f)(ω) = 1 √ 2π eiµω e−σ2 ω2 2 . □ CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING The result is F(h)(ω) = 1 √ 2π ∫ ∞ −∞ (∫ ∞ −∞ f(x)g(t − x) dx ) e−iωt dt = 1 √ 2π ∫ ∞ −∞ f(x) (∫ ∞ −∞ g(t − x) e−iωt dt ) dx = 1 √ 2π ∫ ∞ −∞ f(x) (∫ ∞ −∞ g(u) e−iω(u+x) du ) dx = 1 √ 2π (∫ ∞ −∞ f(x) e−iωx dx ) · (∫ ∞ −∞ g(u) e−iωu du ) = √ 2πF(f) · F(g) A similar calculation shows that the Fourier transform of a product is the convolution of the transforms, up to a multiplicative constant. In fact, F(f · g) = 1 √ 2π F(f) ∗ F(g). As we mentioned above, the convolution f ∗ g often models the process of the observation of some quantity f. Using the Fourier transform and its inverse, the original values of this quantity are easily recognised if the convolution kernel g is known. We just calculate F(f ∗ g) and divide it by the image F(g). This yields the Fourier transform of the original function f, which can be obtained explicitly using the inverse Fourier transform. This is sometimes called deconvolution. In real applications, the procedure often cannot be that straightforward since the Fourier image of the known convolution kernel might have zero values and therefore we hardly can divide by it as above. For example, take the convolution kernel sinc(t) whose image is an indicator function of some finite interval. So we need some more cunning techniques and there is a vast literature on them. 7.2.8. The L2–norm. Now we are in position to verify that the Fourier transform is actually an isometry with respect to the L2-norm. The Fourier transform exists for all functions in L1, i.e., with integrable absolute values. A simple observation reveals that F : L2 → L∞ satisfies ∥F(f)∥∞ ≤ ∥f∥1. Indeed, | e−iωx | = 1 and thus, |F(f)(ω) ≤ ∫ ∞ −∞ |f(x)| dx for all ω. Now, assume f, g are two functions in L2 and write ˆg for the function ˆg(t) = g(−t). Notice (f ∗ ˆg)(t) = ∫ ∞ −∞ f(x)ˆg(t − x) dx = ∫ ∞ −∞ f(x)g(x − t) dx. In particular, the scalar product is given by the formula ⟨f, g⟩ = (f ∗ ˆg)(0). 661 As mentioned, the most typical use of the Fourier transform is to analyse the frequencies in a signal. The next problem reveals the reason. For technical reasons we cut the signal by multiplication with the characteristic function hΩ of the interval (−Ω, Ω). 7.B.9. Find the Fourier transform of the functions f(t) = hΩ(t) cos(nt), g(t) = hΩ(t) sin(nt). Solution. By definition F(f)(ω) = 1 √ 2π Ω∫ −Ω cos(nt) e−iωt d t = 1 √ 2π Ω∫ −Ω 1 2 (eint + e−int ) e−iωt d t = 1 2 √ 2π [ 1 i(n − ω) ei(n−ω)t ]Ω −Ω + 1 2 √ 2π [ −1 i(n + ω) e−i(ω+n)t ]Ω −Ω = Ω √ 2π ( sinc((n − ω)Ω) + sinc((n + ω)Ω) ) . The same computation leads to the image of the sine signal, the only difference is one minus sign and an additional i in the formula: F(g)(ω) = i Ω √ 2π ( − sinc((n − ω)Ω) + sinc((n + ω)Ω) ) □ 7.B.10. Find the Fourier transform of the superposition of the cos(nt) signals over the interval (−Ω, Ω), f(t) = hΩ(t)(3 cos(5t) + cos(15t)). What happens, if Ω → ∞? Solution. The Fourier transform is linear over scalars, thus we simply add the corresponding images from the previous problem with n = 1 and n = 3, multiplied by the proper coefficients. The illustration of the image with Ω = 20 is Each of the peaks behaves like the Fourier image of the characteristic function hΩ, shifted to the frequencies. CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Further, the definition of the Fourier transform yields ⟨f, g⟩ = (F−1 F(f ∗ ˆg))(0) = 1 √ 2π ∫ ∞ −∞ F(f ∗ ˆg) eixω dω |x=0 = ∫ ∞ −∞ Ff(ω)Fˆg(ω) dω while F(ˆg)(ω) = 1 √ 2π ∫ ∞ −∞ g(−x) e−iωx dx = F(g)(ω). Consequently, ⟨f, g⟩ = ⟨F(f), F(g)⟩. Thus, we have verified that the Fourier transform preserves the scalar product and so it also preserves the L2–norm. This also explains our choice of the constants in the def- inition. 7.2.9. Dirac delta–function. We return to the first example of the inverse transform to the indicator function fΩ of the interval [−Ω, Ω]. Let Ω approach infinity and denote by √ 2πδ(t) the desired “limit function” for F−1 (fΩ)(t). The inverse image of a product with an arbitrary image F(g) can be expressed using convolution: F−1 (fΩ · F(g))(z) = 1 √ 2π ∫ ∞ −∞ g(t)F−1 (fΩ)(z − t) dt. As Ω increases to ∞, the left-hand expression should approach F−1 (F(g))(z) = g(z), while on the right-hand side, we get g(z) = ∫ ∞ −∞ g(t)δ(z − t) dt. The desired δ(t) thus looks as a “function” which takes zero everywhere except the single point t = 0 where it “has an infinite value”. Integrating the product of δ(t) with any integrable function g gives just the value of g at the point t = 0. Of course, this is strictly not a function at all. Nevertheless it is a useful concept. It is called the Dirac function δ and it can be described correctly as an example of what is known as a distribution. Since we do not have enough space and time, we do not pay further attention to distributions. We mention only that the Dirac δ can be imagined as a unit impulse at a single point. In fact, we saw similar concepts under the name “measure” when dealing with the Riemann-Stieltjes and HenstockKurzweil integrals, cf. 6.3.18, 6.3.14 and we shall come back to them in Chapter 10 in the context of probability. In this sense, the Dirac function is the (probability) measure concentrated in the origin and it can be realized by the RiemannStieltjes integral with the piecewise constant function g with the single unit jump in the origin. Its Fourier transform is the constant function F(δ)(ω) = 1√ 2π . On the other hand, many functions which are not strictly integrable on R are Fourier-transformed to expressions with 662 If Ω increases to infinity, the image ˜f has four peaks at the same positions corresponding to the frequencies ±5 and ±15. But they become narrower and sharper. In the limit, this is no longer a function since the width of the peaks becomes zero. This is usually written F(cos(nt))(ω) = √ π 2 (δ(n − ω)) + δ(n + ω) with the special case F(1)(Ω) = √ 2πδ(ω). See 7.2.9 for comments on the Dirac delta function. □ 7.B.11. Find the Fourier transform image of the convolution of the signals f(t) and g(t) from the Problem 7.B.1. Recall that f(t) = sin(t)+0.4 [sin(6t)]2 −0.2 sin(60t) and g is the characteristic function of the interval (−ε, ε). Assume that the signal is nonzero only in the interval (−Ω, Ω). Solution. Once we note that F(f ∗ g) = √ 2πF(f)F(g), we have all the ingredients ready. Indeed, in 7.B.5 and in the last two problems above, we already computed the Fourier image of g and of the sine and cosine functions on the interval (−Ω, Ω). Instead of writing the explicit formulae for the result, we display illustrations of the real components of F(f) and F(f ∗ g) in the first line, and similarly the imaginary components in the second line, all with Ω = 5, ε = 1/10. The reader should compare the diagrams of f and f ∗ g in 7.B.1 to see that the higher frequencies in f are effectively canceled by this convolution, as expected. □ CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING the Dirac δ. For instance, F(cos(nt))(ω) = √ π 2 ( δ(n − ω) + δ(n + ω) ) , which can be seen from the calculation of the Fourier transform of the function fΩ cos(nx) and then letting Ω approach ∞, see the solution to problem 7.B.10. We can obtain the Fourier transform of the sine function in a similar way. We can take advantage of the fact that the transform of the derivative of this function differ only by a multiple of the imaginary unit and the new variable. Alternatively, we can also use the fact that the sine function is obtained from the cosine function by the phase shift of π/2. These transforms are a basis for Fourier analysis of signals (see also problem 7.B.9): If a signal is a pure sinusoid of a given frequency, then this is recognized in the Fourier transform as two single-point impulses exactly at the positive and negative value of the frequency. If the signal is a linear combination of several such pure signals, then we obtain the same linear combination of single-point impulses. However, since we always process a signal in a finite time interval only, we get not single-point impulses, but rather a wavy curve similar to the function sinc with a strong maximum at the value of the corresponding frequency. The size of this maximum also yields information about the amplitude of the original signal. Another good way how to approximate the Dirac delta function is to exploit the Gaussian functions. As seen in the solution to problem 7.B.7, the Fourier image of the Gaussian function f(t) = 1 σ √ 2π e− t2 2σ2 is again Gaussian corresponding to the reciprocal values of σ. In the limit σ → 0, the image converges fast to the multiple of constant function, see the illustrations on the picture, with σ = 3 and σ = 1/10. Notice that the rather large σ in the first illustration corresponds to a wide Gaussian, while the image is the slim one. The second illustration provides the opposite case. The preimage is the narrow Gaussian and the image is already reasonably close to the constant function. The Gaussians are chosen with L1–norm equal to one, but the Fourier transform preserves the L2–norm of the functions. 7.2.10. Fourier sine and cosine transform. If we apply the Fourier transform to an odd function f(t), where f(−t) = 663 As discussed in 7.2.5, the Fourier transform has an inverse operation. This means that no information is lost when changing from the time behaviour of a signal to its frequency behaviour. This allows us to use Fourier transform for the solution of functional equations involving differentiation or integration. We stay with elementary observations only and return to differential equations in one and more variables in the following chapters. 7.B.12. By using the inverse Fourier transform solve the integral equation ∞∫ 0 f(t) sin (xt) d t = e−x , x > 0 for an unknown function f. Solution. Multiply both sides of the equation by √ 2/π, to obtain the sine Fourier transform on the left-hand side. Apply the inverse transform to both sides of the equation to get f(t) = 2 π ∞∫ 0 e−x sin (xt) dx, t > 0. Integrating by parts twice shows that ∫ e−x sin (xt) dx = e−x 1+t2 [ − sin (xt) − t cos (xt) ] + C. Hence ∞∫ 0 e−x sin(xt) d x = lim x→∞ ( e−x 1+t2 [ − sin (xt) − t cos (xt) ] ) − e0 (−t) 1+t2 = t 1+t2 . So f(t) = 2 π t 1 + t2 , t > 0. □ 7.B.13. Use the Fourier transform to find solutions of the non-homogeneous linear differential equation (1) y′ = a y + f where a ∈ R is a non-zero constant and f is a known function. Can all solutions be obtained in this way? Solution. The key observation for this problem is the relation between the Fourier transform and the derivative F(f′ )(ω) = iωF(f)(ω), see 7.2.6. Thus, if the Fourier transform is applied to the equation (1), we get the algebraic equation for ˜y = F(y) iω˜y = a˜y + ˜f. If it is assumed that F(f) = ˜f exists and there is a solution y with the Fourier image ˜y, then ˜y = 1 iω − a ˜f CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING −f(t), the contribution in the integration of the product of f(t) and the function cos(±ωt) cancels for positive and negative values of t. Thus if f is odd, then F(f)(ω) = −2i √ 2π ∫ ∞ 0 f(t) sin ωt dt. The resulting function is odd again, hence for the same reason, the inverse transform can be determined similarly: F−1 (f)(ω) = 2i √ 2π ∫ ∞ 0 f(t) sin ωt dt. Omitting the imaginary unit i (notice, this means we have to multiply the Fourier transform and its inverse by i and −i, respectively) inverse) gives mutually inverse transforms, which are called the Fourier sine transform for odd functions: ˜fs(ω) = √ 2 π ∫ ∞ 0 f(t) sin(ωt) dt, f(t) = √ 2 π ∫ ∞ 0 ˜fs(t) sin(ωt) dt. Similarly, we can define the Fourier cosine transform for even functions: ˜fc(ω) = √ 2 π ∫ ∞ 0 f(t) cos(ωt) dt, f(t) = √ 2 π ∫ ∞ 0 ˜fs(t) cos ωt dt. 7.2.11. Laplace transforms. The Fourier transform can be mainly applied to functions which are integrable in absolute value over R. The Laplace transform is similar to the Fourier transform and does apply to all functions whose growth is not too fast: L(f)(s) = ∫ ∞ 0 f(t) e−st dt. The integral operator L has a rapidly reducing kernel if s is a positive real number. Therefore, the Laplace transform is usually perceived as a mapping of suitable functions on the interval [0, ∞) to the function on the same or shorter interval. The image L(p) exists, for example, for every polynomial p(t) and all positive numbers s. Analogously to the Fourier transform, we obtain the formula for the Laplace transform of a differentiated function for s > 0 by using integration by parts: L(f′ (t))(s) = ∫ ∞ 0 f′ (t) e−st dt = [f(t) e−st ]∞ 0 + s ∫ ∞ 0 f(t) e−st dt = −f(0) + sL(f)(s). The properties of the Laplace transform and many other transforms used in technical practice can be found in specialized literature. We provide a few examples in the other column starting with 7.B.17. 664 and using the general relation F−1 (g · h) = √ 2πF−1 f ∗ F−1 g between the product and convolutions from 7.2.7 we arrive at the final formula y = 1 √ 2π F−1 ( 1 iω − a ) ∗ f. So it is necessary to compute the inverse Fourier transform of the simple rational function (iω − a)−1 . Guess the solution in two steps. Assume first a < 0 and evaluate ∫ 0 −∞ eat e−iωt d t = [ 1 a−iω e(a−iω)t ]0 −∞ = 1 a − iω . Similarly for a > 0 ∫ ∞ 0 eat e−iωt d t = [ 1 a−iω e(a−iω)t ]∞ 0 = 1 iω − a . This provides the two desired results. Indeed, if the equation (1) comes with a > 0, we rewrite our rational function as −(a − iω)−1 . Next, the function − √ 2π e−at for negative t provides the requested Fourier image. Immediately it is seen that the convolution y(t) = − ∫ 0 −∞ eax f(t − x) d x is a solution. (The multiples √ 2π in the expression with the convolution cancel.) Similarly, if a < 0 then y(t) = ∫ ∞ 0 eax f(t − x) d x is a solution. Not all solutions can be obtained in this way. For example, y′ = y leads to y(t) = C et with an arbitrary constant C, but this is not a function with a Fourier image. With f(t) = 0, our procedure produces the zero function, which is just one of the solutions. Similarly, if we deal with the equation y′ = y + t, then the particular solution suggested above is y(t) = − ∫ 0 −∞ ex (t − x) d x = −t − 1. □ 7.B.14. Check directly that the two functions y(t) found above are indeed solutions to the equation y′ = a y + f. ⃝ 7.B.15. As in the previous problem, solve the second order equation y′′ = a y + f. Solution. Use the fact that F(y′′ )(ω) = −ω2 F(y)(ω) and deduce the algebraic relation −ω2 ˜y = a˜y + ˜f, for the Fourier images ˜y and ˜f. Hence ˜y = −1 ω2 + a ˜f. CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING 7.2.12. Discrete transforms. Fourier analysis of signals mentioned in the previous paragraph are realized by special analog circuits in, for example, radio technology. Nowadays, we work only with discrete data when processing signals by computer circuits. Assume that there is a fixed (small) sampling interval τ given in a (discrete) time variable and that, for a large natural number N, the signal repeats with period Nτ, which is the maximal period which can be represented in our discrete model. We should not be surprised that our continuous models allow for a discrete analogy. Consider an N–dimensional vector, which can be imagined as the function r → f(r) ∈ C, for r = 0, 1, . . . , N − 1. Denote ∆ω = 2π N and ωk = k∆ω. The simplest discrete approximation of the Fourier transform integral suggests that ˜f(k) = 1 N N−1∑ r=0 f(r) e−i 2π N kr should be a promising transformation f → ˜f, whose inverse f → ˆf should not be far from ˆ˜f(k) = N−1∑ r=0 ˜f(r) ei 2π N kr . Actually, these are already the mutually inverse transforma- tions: Theorem. The transformation above satisfies ˆ˜f(k) = f(k) for all k = 0, 1, . . . , N − 1. Proof. Let T = N−1∑ r=0 eir 2π N k . Then ei 2π N k T = N∑ r=1 eir 2π N k and so by subtraction, (1 − ei 2π N k )T = 1 − ei2πk . The right hand side is 0 for all integers k. On the left side, the coefficient of T is not zero unless k is a multiple of N. Hence T = N−1∑ r=0 eir 2π N k = { N if k is a multiple of N 0 otherwise. With k and s both confined to the range {0, 1, 2 . . . N − 1}, k −s can only be a multiple of N when k = s. It follows that for such k and s N−1∑ r=0 eir 2π N (k−s) = δksN where the Kronecker delta δks = 0 for k ̸= s and δks = 1 if k = s. Finally, we compute: 665 In order to guess the correct preimage of the rational function in question, first assume a > 0 and compute ∫ ∞ −∞ e−a|x| e−iωx d x = [ 1 a−iω ea−iωx ]0 −∞ + [ −1 a+iω e−a−iωx ]∞ 0 = 1 a − iω + 1 a + iω = 2a a2 + ω2 Thus it is verified that F(e− √ a|x| ) = √ a 2π 1 a + ω2 . Immediately (the factors √ 2π cancel) y(t) = −1 √ a e− √ a|t| ∗f(t) = −1 √ a ∫ ∞ −∞ e− √ a|x| f(t − x) d x. The case a < 0 is a little more complicated. But we may ask Maple or look up in the literature that the function g(t) = sin(b|t|) has the Fourier image F(g)(ω) = 1 2π 2b b2 − ω2 . We are nearly finished. The required preimage is h(t) = √ 2π √ −4a sin( √ −a|t|) and the resulting convolution is y(t) = 1 √ −4a ∫ ∞ −∞ sin( √ −a|x|)f(t − x) d x. If we rewrite the equation as y′′ + by = f with b > 0, the result says y(t) = 1 2b ∫ ∞ −∞ sin(b|x|)f(t − x) d x. □ 7.B.16. Check directly that the two functions y(t) found above are indeed solutions to the equation y′′ = a y + f. ⃝ The Laplace transfer is another integral transform which interchanges differentiation and algebraic multiplication. As with the Fourier transform, this is based on the properties of the exponential function, but this time we take the real exponential, see 7.2.11 for the formula. One advantage is that every polynomial has its Laplace image. 7.B.17. Determine the Laplace transform L(f)(s) for each of the functions (a) f(t) = eat ; (b) f(t) = c1 ea1t + c2 ea2t ; (c) f(t) = cos (bt); (d) f(t) = sin (bt); (e) f(t) = cosh (bt); (f) f(t) = sinh (bt); (g) f(t) = tk , k ∈ N, CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING ˆ˜f(k) = N−1∑ r=0 ( 1 N N−1∑ s=0 f(s) e−i 2π N rs ) ei 2π N rk = 1 N N−1∑ s=0 f(s) (N−1∑ r=0 ei 2π N r(k−s) ) = 1 N N−1∑ s=0 f(s)δksN = f(k). □ The computations in the proof also verify that the Fourier image of a periodic complex valued function with a unique period among the chosen sampling periods is just its amplitude at this particular frequency. Thus, if the signal has been created as a superposition of periodic signals with the sampling frequencies only, we obtain the absolutely optimal result. However, if the transformed signal has a frequency not exactly available among the sampling frequencies, there are nonzero amplitudes at all the sampling frequencies in the Fourier image. This is called frequency leaking in the technical literature. There is a vast amount of literature devoted to fast implementation and exploitation of the discrete Fourier transform, as well as other similar discrete tools. This is an extremely active area of current research. 3. Metric spaces At the end of the chapter, we will focus on the concepts of distance and convergence in a more abstract way. This also provides the conceptual background for some of the already derived properties of Fourier series and Fourier transform. We need these concepts in miscellaneous contexts later. It is hoped that the subsequent pages are a useful (and hopefully manageable) trip into the world of mathematics for the competent or courageous! 7.3.1. Metrics and norms. When we discussed Fourier series, the distance between functions on a space of functions, was commonly referred to. Now we examine the concept of distance more thor- oughly. We are familiar with the distance of points x, y in the Euclidean space Rn given by the size of the vector x − y. A very different example of distance is the so called discrete metric definied on any set X as follows. Each element x ∈ X has got distance zero from itself and one from all other elements in X. Notice that the triangle inequality is strict here – making detours when walking from one element to another always increases the distance travelled. Both of these distances are generalized in the following concept: 666 where the constants b ∈ R and a, a1, a2, c1, c2 ∈ C are arbitrary. It is assumed that the positive number s ∈ R is greater than the real parts of the numbers a, a1, a2 ∈ C, and is greater than b in the problems (e) and (f). Solution. The case (a). It follows directly from the definition of the Laplace transform that L (f) (s) = ∞∫ 0 eat e−st d t = ∞∫ 0 e−(s−a)t d t = lim R→∞ ( e−(s−a)R −(s−a) ) − e0 −(s−a) = 1 s−a . The case (b). Using the result of the above case and the linearity of improper integrals, we obtain L (f) (s) = c1 ∞∫ 0 ea1t e−st d t + c2 ∞∫ 0 ea2t e−st d t = c1 s−a1 + c2 s−a2 . The case (c). Since cos (bt) = 1 2 ( eibt + e−ibt ) , the choice c1 = 1/2 = c2, a1 = ib, a2 = −ib in the previous case gives L (f) (s) = ∞∫ 0 (1 2 eibt + 1 2 e−ibt ) e−st d t = 1 2(s−ib) + 1 2(s+ib) = s s2+b2 . The cases (d), (e), (f). Analogously, the choices (d) c1 = −i/2, c2 = i/2, a1 = ib, a2 = −ib; (e) c1 = 1/2 = c2, a1 = b, a2 = −b; (f) c1 = 1/2, c2 = −1/2, a1 = b, a2 = −b lead to (d) L (f) (s) = b s2+b2 ; (e) L (f) (s) = s s2−b2 ; (f) L (f) (s) = b s2−b2 . Finally, the last one is obtained by a straightforward repetition of integration by parts: L(tk )(s) = ∫ ∞ 0 tk e−st d t = [ −tk 1 s e−st ]∞ 0 + k s ∫ ∞ 0 tk−1 e−st d t = · · · = k! sk ∫ ∞ 0 e−st d t = k! sk+1 . □ 7.B.18. Use the definition of the Gamma function Γ(t) in Chapter 6 in order to prove L(tα )(s) = Γ(α + 1) 1 sα+1 for general α > 0. Compare the result to the one of 7.B.17(g). ⃝ 7.B.19. For s > −1, calculate the Laplace transform L (g) (s) of the function g(t) = t e−t . Further, for s > 1, calculate the Laplace transform L (h) (s) of the function h(t) = t sinh t. CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Axioms of a metric and a norm A set X together with a mapping d : X × X → R such that for all x, y, z ∈ X, the following conditions are satisfied d(x, y) ≥ 0; and d(x, y) = 0 if and only if x = y,(1) d(x, y) = d(y, x),(2) d(x, z) ≤ d(x, y) + d(y, z),(3) is called a metric space. The mapping d is a metric on X. If X is a vector space over R and ∥ ∥ : X → R is a function satisfying ∥x∥ ≥ 0; and ∥x∥ = 0 if and only if x = 0,(4) ∥λx∥ = |λ| ∥x∥, for all scalars λ,(5) ∥x + y∥ ≤ ∥x∥ + ∥y∥,(6) then the function ∥ ∥ is called a norm on X, and the space X is then a normed vector space. Every norm defines a metric by setting d(x, y) = ∥x − y∥. The Euclidean distance in the vector spaces Rn satisfies the above three requirements and it is given by a norm. The L1 norm ∥ ∥1 and the L2 norm ∥ ∥2 on functions, satisfy the norm properties and thus define a metric on the relevant spaces of functions. Of course, not every metric can be defined by a norm in this way. The discrete metric mentioned above is an example. Metrics given by a norm have very specific properties since their behaviour on the whole space X can be derived from the properties in an arbitrarily small neighbourhood of the zero element x = 0 ∈ X. 7.3.2. Convergence. The concepts of (close) neighbourhoods of particular elements, convergence of sequences of elements and the corresponding “topological” concepts can be defined on abstract metric spaces in much the same way as in the case of the real and complex numbers and their sequences. See the beginning of the fifth chapter, 5.2.3–5.2.8. We can almost copy these paragraphs; although the proof of the theorem 5.2.8 is much harder. We begin with the concept of convergent sequences in a metric space X with metric d: Cauchy sequences Consider an arbitrary sequence of elements x0, x1, . . . in X. Suppose that for any fixed positive real number ε, d(xi, xj) < ε for all but finitely many pairs of terms xi, xj of the sequence. In other words, for any given ε > 0, there is an index N such that the above inequality holds for all i, j > N. Loosely put, the elements of the sequence are eventually arbitrarily close to each other. Such a sequence is called a Cauchy sequence. Just as in the case of the real or complex numbers, we would like every Cauchy sequence of terms xi ∈ X to converge to some x in the following sense: 667 ⃝ 7.B.20. The basic Laplace transforms are enumerated in the following table: y(t) L(y)(s) tk k! sk+1 eat 1 s−a t eat 1 (s−a)2 tn eat n! (s−a)n+1 sin ωt ω s2+ω2 cos ωt s s2+ω2 eat sin ωt ω (s−a)2+ω2 eat (cos ωt + a ω sin ωt) s (s−a)2+ω2 t sin ωt 2ωs (s2+ω2)2 sin ωt − ωt cos ωt 2ω3 (s2+ω2)2 Establish the 5th and 6th rows of the table above using Euler’s formula eiωt = cos ωt + i sin ωt. ⃝ As expected, using the features of the Laplace transform allows us to find explicit solutions to some differential equations. By 7.D.18, it is straightforward to incorporate the initial conditions into the solution. We present just two such examples in the problems at the conclusion of this Chapter, see 7.D.21. We return to this topic in Chapter 8. C. Metric spaces The concept of metric is an abstract version of what we understand as the distance in Euclidean geometry. It is always based on the triangle inequality. The axioms in Definition 7.3.1 follow the Euclidean experience, saying that our "distance" of two elements has to be strictly positive (except if the two elements coincide), should be symmetric in the arguments, and should satisfy the triangle inequality. Other concepts available in the literature are more abstract and might lead to more general objects (the most important ones being pseudometrics, ultrametrics, and semimetrics). 3 7.C.1. The discrete metric space X is defined as the set X with the function d : X × X → R d(x, y) = { 1 x ̸= y 0 x = y. Show that this is a metric space according to Definition 7.3.1. Show how to introduce a metric on Cartesian products of metric spaces, so that product of two discrete metric spaces is again discrete. 3The first axiomatic definition of a "traditional" metric was given by Maurice Fréchet in 1906. However, the name of the metric comes from Felix Hausdorff, who used this word in his work from 1914. CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Convergent sequences Let x0, x1, . . . be a sequence in a metric space X and let x be an element of X. We say that the sequence {xi} converges to the element x, if, for every positive real number ε, there is an integer N > 0, so that i > N implies d(xi, x) < ε. By the triangle inequality, it follows that for each pair of terms xi, xj from a convergent sequence with sufficiently large indices, d(xi, xj) ≤ d(xi, x) + d(x, xj) < 2ε. Therefore, every convergent sequence is a Cauchy sequence. Conversely however, not every Cauchy sequence is convergent. Metric spaces where every Cauchy sequence is convergent are called complete metric spaces. 7.3.3. Topology, convergence, and continuity. Just as in the case of the real numbers, we can formulate the convergence in terms of “open neighbourhoods”. Open and closed sets Definition The open ε–neighbourhood of an element x in a metric space X (or just ε–neighbourhood for short) is the set Oε(x) = {y ∈ X; d(x, y) < ε}. A subset U ⊂ X is open if and only if for all x ∈ U, U contains some ε–neighbourhood of x. We define a subset W ⊂ X to be closed if and only if its complement X \ W is an open set. Instead of an ε–neighbourhood, we also talk about (open) ε–ball centered at x. In the case of a normed space, we can consider ε–balls centered at zero: along with x, ε–balls determine an ε–neighbourhood. The limit points of a subset A ⊂ X are defined as those elements x ∈ X such that there is a sequence of points in A other than x converging to x. Lemma. A subset in a metric space is closed if and only if it contains all of its limit points Proof. Suppose A is closed and x is a limit point of A but not belonging to A. Then x ∈ X \ A which is open, so there is an ε–neighbourhood of x not intersecting with A. But in every ε–neighbourhood of x, there are infinitely many points of the set A, since x is a limit point. This is a contra- diction. Conversely, suppose A contains all of its limit points and suppose x ∈ X \A. If in every ε–neighbourhood of the point x, there is a point xε ∈ A, then the choices ε = 1/n provide a sequence of points xn ∈ A converging to x. But then, the point x would have to be a limit point, thus lying in A, which again leads to a contradiction. □ For every subset A in a metric space X, we define its interior as the set of those points x in A for which a neighbourhood of x also belongs to A. We define the closure ¯A of 668 Solution. All three axioms of a metric from 7.3.1 are obviously satisfied in our definition of the discrete metric space. Consider two metric spaces X and Y with metrics dX and dY . The first obvious idea seems to add the distances of the components, i.e. d ( (x1, y1), (x2, y2) ) = dX(x1, x2) + dY (y1, y2). Clearly this is a metric (verify in detail!), but if the metric spaces X and Y are discrete, then considering points u = (x1, y1) and w = (x2, y2) such that x1 ̸= x2, y1 ̸= y2 we arrive at d(u, w) = 2. Thus, this is not a discrete metric space. But there is another simple possibility of introducing a metric on X × Y using the maximum of the distances: d((x1, y1), (x2, y2)) = max{(dX(x1, x2), dY (y1, y2)}. We call this the product of the metric spaces X and Y . The triangle inequality as well as the other axioms are obvious (write down the explicit arguments!). Moreover, if both X and Y are discrete, then d is also a discrete metric. □ 7.C.2. Decide whether or not the following sets and mappings form a metric space: i) N, d(m, n) = gcd(m, n) ii) N, d(m, n) = max(m,n) gcd(m,n) − 1 iii) World population, d(P1, P2) = n, P1 = X0, X1,...,Xn+1 = P2 is the shortest seuquence of people, such that Xi knows Xi+1 for i = 0, . . . n. Solution. i) No. The "distance" d does not satisfy that d(m, m) = 0. ii) No. The first and second conditions in the definition 7.3.1 are fulfilled, but the triangle inequality (property (3)) is not. The distance of 8 and 9 is 8, the distance of 8 and 6 is 3 and the distance of 6 and 9 is 2, thus d(8, 9) > d(8, 6) + d(6, 9). iii) No. The "distance" is not symmetric. It would be a metric space, if the definition the word "knows" is changed to mean "know each other". □ 7.C.3. Consider the set of binary words of the length n. Define the distance between two words as the number of bits in which they differ. This is called the Hamming distance, see 12.5.2). Show that it defines a metric. Solution. The first two axioms of a metric are satisfied. For the third one let the words x and z differ in k bits. Let y be another word. Then consider just k bits, in which x and z differ. Clearly y differs in each of these bits exactly from one each of x and z. Thus considering only the parts of words xp, yp, zp in the k bits, we have d(xp, yp)+d(yp, zp) = d(xp, zp). In the other bits, the words x and z are the same, while x and y or y and z may differ. Thus d(x, y)+d(y, z) ≥ d(x, z) and the third axiom is satisfied also. □ CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING a set A as the union of the original set A with the set of all limit points of A. As easily as in the case of the real numbers, we can verify that the intersection of any system of closed sets as well as the union of any finite system of closed sets is also closed. On the other hand, any union of open sets is again an open set. A finite intersection of open sets is again an open set. Prove these propositions by yourselves in detail! We also advise the reader to verify that the interior of a set A equals the union of all open sets contained in A, (alternatively put, the interior of A is the largest open subset of A). The closure of A is the intersection of all closed sets which contain A, (alternatively put, the closure of A is the smallest closed superset of A). The closed and open sets are the essential concepts of the mathematical discipline called topology. Without pursuing these ideas further, we have just familiarised ourselves with the topology of the metric spaces. The concept of convergence can be reformulated now as follows. A sequence of elements xi, i = 0, 1, . . . , in a metric space X converges to x ∈ X if and only if for every open set U containing x, all but finitely many points of our sequence lie in U. Just as in the case of the real numbers, we can define continuous mappings between metric spaces: Continuous mappings Let W and Z be metric spaces. A mapping f : W → Z is continuous if and only if the inverse image f−1 (V ) of every open set V ⊂ Z is an open set in W. This is equivalent to the statement that f is continuous if and only if for every z = f(x) ∈ Z and positive real number ε, there is a positive real number δ such that for all elements y ∈ W with distance dW (x, y) < δ, dZ(z, f(y)) < ε. Again, as in the case of real-valued functions, a mapping f from one metric space to another is continuous if and only if it preserves the convergence of sequences (check this your- selves!). 7.3.4. Lp–norms. Now we have the general tools with which we can look at examples of metric spaces created by finite-dimensional vectors or functions at our disposal. We restrict ourselves to an extraordinarily useful class of norms. We begin with the real or complex finite-dimensional vector spaces Rn and Cn , and for a fixed real number p ≥ 1 and any vector z = (z1, . . . , zn), we define ∥z∥p = ( n∑ i=1 |zi|p )1/p . We prove that this indeed defines a norm. The first two properties from the definition are clear. It remains to prove the triangle inequality. For that purpose, we use Hölder’s inequality: 669 7.C.4. Consider any connected subset S ⊂ Rn (any two points in S can be connected with a path lying in S). Define the distance between two points as the length of the shortest path between the points. Is it a metric on S? Solution. It is a metric. All the axioms of the metric are trivially satisfied. But this metric has a special significance. The principle of "shortest way" is often met in reality. Recall for example Fermat’s principle (see 5.E.123) of the least time, where we measure the length of a path by the time it is traveled by light. Generally, shortest paths in a metric space are called geodesics. □ 7.C.5. Consider a space of integrable function on the interval [a, b]. Define the (L1) distance of the functions f, g as ∥f, g∥ = ∫ b a |f(x) − g(x)| dx . Why it is not a metric space? Solution. The first axiom of the metric space in 7.3.1 is not satisfied. Any function of zero measure has distance 0 from the null function. But if we consider an equivalence where two functions are equivalent, if they differ by a function of measure zero, then we get the space S0 (a, b). The given distance considered on the equivalence classes of this equivalence is the L1 metric. □ 7.C.6. Let r be a rational number and p a prime number. Then r can be uniquely written in the form r = pk u v , where u ∈ Z and v ∈ N are coprime and p does not divide both, the numerator u and the denominator v. Consider the map ∥.∥p : Q → R, ∥r∥ → p−k . Show that it is a norm on Q as a vector space over Q. It is called the p-adic norm ⃝ Solution. It is an exercise in elementary number theory. □ 7.C.7. Consider the power set (the set of all subsets) of a given finite set. Determine whether the functions d1 and d2, defined for all subsets X, Y by (a) d1(X, Y ) := | (X ∪ Y ) ∖ (X ∩ Y ) |, (b) d2(X, Y ) := |(X∪Y )∖(X∩Y )| |X∪Y | , for X ∪ Y ̸= ∅, d2(∅, ∅) := 0 are metrics. ( | X |, is meant the number of elements of a set X, thus the metric d1 measures the size of the symmetric difference of the sets, while d2 measures the relative symmetric difference.) Solution. We omit verifications of the first and second conditions from the definition of a metric in exercises on deciding whether a particular mapping is a metric. We analyze the triangle inequality only. The case (a). For any sets X, Y, Z, (1) (X∪Z)∖(X∩Z) ⊆ ( (X∪Y )∖(X∩Y ) ) ∪ ( (Y ∪Z)∖(Y ∩Z) ) To show this, suppose first that x is an element satisfying x ∈ X and x /∈ Z. Then either x ∈ Y in which case x ∈ (Y ∪ Z)∖(Y ∩Z), or x /∈ Y in which case x ∈ (X∪Y )∖(X∩Y ). CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Hölder inequality Lemma. For a fixed real number p > 1 and every pair of n–tuples of non-negative real numbers xi and yi, n∑ i=1 xiyi ≤ ( n∑ i=1 xp i )1/p · ( n∑ i=1 yq i )1/q , where 1/q = 1 − 1/p. Proof. Denote by X and Y the expressions in the product on the right-hand side of the inequality to be proved. If all of the numbers xi or all of the numbers yi are zero, then the statement is clearly true. Therefore, we can assume that X ̸= 0 and Y ̸= 0. We need to use the fact that the exponential function is a convex function. This can be stated: the graph of the exponential function lies below any of its chords, and indeed, the second derivative is again the expontial function, thus it is always positive. Hence, for any a and b, with p, q as above, e(1/p)a+(1/q)b ≤ (1/p) ea +(1/q) eb (in fact, this is the trivial case of the Jensen inequality, see 6.D.18). Now, for those k with xkyk ̸= 0 we define the numbers vk and wk so that xk = X evk/p , yk = Y ewk/q . Then evk/p+wk/q ≤ 1 p evk +1 q ewk . By substitution, it follows immediately that xkyk ≤ XY ( 1 p ( xk X )p + 1 q ( yk Y )q) . Summing over k = 1, . . . , n, gives (notice that adding the terms with xkyk = 0 does not spoil the inequality) 1 XY n∑ i=1 xiyi ≤ 1 pXp n∑ i=1 xp i + 1 qY q n∑ i=1 yq i = 1 pXp Xp + 1 qY q Y q = 1 p + 1 q = 1. Multiplying this inequality by XY finishes the proof. □ Now we can prove that ∥ ∥p is indeed a norm: Minkowski inequality For every p > 1 and all n–tuples of non-negative real numbers (x1, . . . , xn) and (y1, . . . , yn), ( n∑ i=1 (xi + yi)p )1/p ≤ ( n∑ i=1 xp i )1/p + ( n∑ i=1 yp i )1/p . 670 It follows that x belongs to the union, that is, the right side of 1. By symmetry, the same result holds if x is an element satisfying x /∈ X and x ∈ Z. Since then all possibilities when x belongs to the left side of 1 are accounted for, the inclusion 1 is established. But then, d1(X, Z) = | (X ∪ Z) ∖ (X ∩ Z) | ≤ ( (X ∪ Y ) ∖ (X ∩ Y ) ) ∪ ( (Y ∪ Z) ∖ (Y ∩ Z) ) ≤ | (X ∪ Y ) ∖ (X ∩ Y ) | + | (Y ∪ Z) ∖ (Y ∩ Z) | = d1(X, Y ) + d1(Y, Z). The case (b). Proceed similarly to the case of d1. Denote by X′ the complement of a set X. The equalities (X ∪ Y ) ∖ (X ∩ Y ) = (X ∩Y ′ ∩Z)∪(X ∩Y ′ ∩Z′ )∪(X′ ∩Y ∩Z)∪(X′ ∩Y ∩Z′ ), (Y ∪ Z) ∖ (Y ∩ Z) = (X ∩Y ∩Z′ )∪(X ∩Y ′ ∩Z)∪(X′ ∩Y ∩Z′ )∪(X′ ∩Y ′ ∩Z), ( (X ∪ Z) ∖ (X ∩ Z) ) ∪ ( Y ∖ (X ∪ Z) ) = (X ∩ Y ∩ Z′ ) ∪ (X ∩Y ′ ∩Z′ )∪(X′ ∩Y ∩Z)∪(X′ ∩Y ′ ∩Z)∪(X′ ∩Y ∩Z′ ), which, again, can be proved by listing several possibilities, imply a stronger form of (1), namely ( (X ∪ Z) ∖ (X ∩ Z) ) ∪ ( Y ∖ (X ∪ Z) ) ⊆( (X ∪ Y ) ∖ (X ∩ Y ) ) ∪ ( (Y ∪ Z) ∖ (Y ∩ Z) ) . Further, we invoke the inequality | (X∪Z)∖(X∩Z) | | X∪Z | ≤ ( (X∪Z)∖(X∩Z) ) ∪ ( Y ∖(X∪Z) ) X∪Z∪ ( Y ∖(X∪Z) ) , X ∪ Z ̸= ∅. This is based on calculations with non-negative numbers only since in general x z ≤ x + y z + y , y ≥ 0, z > 0, x ∈ [0, z]. Since X ∪ Z ∪ ( Y ∖ (X ∪ Z) ) = X ∪ Y ∪ Z, we obtain d2(X, Z) = | (X∪Z)∖(X∩Z) | | X∪Z | ≤ ( (X∪Z)∖(X∩Z) ) ∪ ( Y ∖(X∪Z) ) X∪Z∪ ( Y ∖(X∪Z) ) ≤ ( (X∪Y )∖(X∩Y ) ) ∪ ( (Y ∪Z)∖(Y ∩Z) ) | X∪Y ∪Z | ≤ | (X∪Y )∖(X∩Y ) | +| (Y ∪Z)∖(Y ∩Z) | | X∪Y ∪Z | ≤ | (X∪Y )∖(X∩Y ) | | X∪Y | + | (Y ∪Z)∖(Y ∩Z) | | Y ∪Z | = d2(X, Y ) + d2(Y, Z), if X ∪ Z ̸= ∅ and Y ̸= ∅. However, for X = Z = ∅ or Y = ∅, the triangle inequality clearly still holds. Therefore, both mappings are metrics. The metric d1 is quite elementary, but the other metric, the metric d2 has wider applications. In the literature, it is also known as Jaccard’s metric.4 □ 4It is named after the biologist Paul Jaccard, who described the measure of similarity in insects populations using the function 1 − d2 in 1908. CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING To verify this inequality, we can use the following obser- vation. n∑ i=1 (xi + yi)p = n∑ i=1 xi(xi + yi)p−1 + n∑ i=1 yi(xi + yi)p−1 and by Hölder’s inequality (recall p > 1), n∑ i=1 xi(xi + yi)p−1 ≤ ( n∑ i=1 xp i )1/p( n∑ i=1 (xi + yi)(p−1)q )1/q n∑ i=1 yi(xi + yi)p−1 ≤ ( n∑ i=1 yp i )1/p( n∑ i=1 (xi + yi)(p−1)q )1/q . Adding the last two inequalities, and taking into account that p + q = pq, and so (p − 1)q = pq − q = p, we arrive at ∑n i=1(xi + yi)p (∑n i=1(xi + yi)p )1/q ≤ ( n∑ i=1 xp i )1/p + ( n∑ i=1 yp i )1/p . that is ( n∑ i=1 (xi + yi)p )1−1/q ≤ ( n∑ i=1 xp i )1/p + ( n∑ i=1 yp i )1/p , or ( n∑ i=1 (xi + yi)p )1/p ≤ ( n∑ i=1 xp i )1/p + ( n∑ i=1 yp i )1/p , since 1−1/q = 1/p. This is the Minkowski inequality which we wanted to prove. Thus we have verified that on every finite-dimensional real or complex vector space, there is a class of norms ∥ ∥p for all p > 1. The case p = 1 was considered earlier. We can also consider p = ∞ by setting ∥z∥∞ = max{|zi|, i = 1, . . . , n}. This is obviously a norm. We notice that Hölder’s inequality can, in the context of these norms, be written for all x = (x1, . . . , xn), y = (y1, . . . , yn) as n∑ i=1 |xi| · |yi| ≤ ∥x∥p · ∥y∥q for all p ≥ 1 and q satisfying 1/p + 1/q = 1. For p = 1, we set q = ∞. 7.3.5. Lp–norms for sequences and functions. Now we can easily define the Lp-norms on suitable infinite-dimensional vector spaces as well. We begin with sequences. The vector space ℓp, p ≥ 1, is the set of all sequences of real or complex numbers x0, x1, . . . such that ∞∑ i=0 |xi|p < ∞. If x = (x1, x2 . . . ) ∈ ℓp, p ≥ 1, then the norm is given by ∥x∥p = ( ∞∑ i=0 |xi|p )1/p 671 7.C.8. Let d(x, y) := | x − y | 1 + | x − y | , x, y ∈ R. Prove that d is a metric on R. Solution. We prove the triangle inequality only (the rest is clear). Introduce an auxiliary function (1) f(t) := t 1 + t , t ≥ 0. Note that f(s) − f(r) = s 1+s − r 1+r = s−r (1+s)(1+r) > 0, whenever s > r ≥ 0. It follows that f is increasing, a fact which can also be verified by examining the first derivative. Therefore, d(x, z) = | x − z | 1 + | x − z | = | x − y + y − z | 1 + | x − y + y − z | ≤ | x − y | + | y − z | 1 + | x − y | + | y − z | = | x − y | 1 + | x − y | + | y − z | + | y − z | 1 + | x − y | + | y − z | ≤ | x − y | 1 + | x − y | + | y − z | 1 + | y − z | = d(x, y) + d(y, z), x, y, z ∈ R. □ The metrics in the next problems are defined by norms on vector spaces of functions. See the definitions and discussion in 7.3.1. 7.C.9. Determine the distance between the functions f(x) = x, g(x) = − x√ 1+x2 , x ∈ [1, 2] as elements of the normed vector space S0 [1, 2] of (piecewise) continuous functions on the interval [1, 2] with norm (a) ∥ f ∥1 = ∫ 2 1 | f(x) | d x; (b) ∥ f ∥∞ = max {| f(x) |; x ∈ [1, 2]}. Solution. The case (a). We need only compute the norm of the difference of the functions 2∫ 1 | f(x) − g(x) | d x = 2∫ 1 x + x√ 1+x2 d x = (x2 2 + √ 1 + x2 ]2 1 = 3 2 + √ 5 − √ 2. The case (b). It is necessary to compute max x∈[1,2] | f(x) − g(x) | = max x∈[1,2] ( x + x√ 1+x2 ) . Since ( x + x√ 1+x2 )′ = 1 + 1(√ 1+x2 )3 > 0, x ∈ [1, 2], it follows that f −g is increasing, and so attains its maximum at the right end point of the interval when x = 2. So max x∈[1,2] ( x + x√ 1+x2 ) = 2 + 2√ 1+22 = 2 + 2√ 5 . □ CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING That ∥x∥p is a norm follows immediately from the Minkowski inequality by letting n −→ ∞. The vector space ℓ∞, is the set of all bounded sequences of real or complex numbers x0, x1, . . . . If x = (x0, x1, . . . ) ∈ ℓ∞, then its norm is given by ∥x∥∞ = sup{|xi|, i = 0, 1, 2, 3 . . . } It is easily checked that this is indeed a norm. Eventually, we return to the space of functions S0 [a, b] on a finite interval [a, b] or S0 c [a, b] on an unbounded interval. We have already met the L1 norm ∥ ∥1. However, for every p > 1 and for all functions in such a space of functions, the Riemann integrals ∫ b a |f(x)|p dx surely exist, so we can define ∥f∥p = (∫ b a |f(x)|p dx )1/p . The Riemann integral was defined in terms of limits, using the Riemann sums which correspond to splittings Ξ with representatives ξi. In our case, those are the finite sums SΞ,ξ = n∑ i=1 |f(ξi)|p (xi − xi−1). Hölder’s inequality applied to the Riemann sums of a product |f(x)g(x)| of two functions f(x) and g(x) gives n∑ i=1 |f(ξi)||g(ξi)|(xi − xi−1) = = n∑ i=1 |f(ξi)|(xi − xi−1)1/p |g(ξi)|(xi − xi−1)1/q ≤ ( n∑ i=1 |f(ξi)|p (xi−xi−1) )1/p( n∑ i=1 |g(ξi)|q (xi−xi−1) )1/q . where on the right-hand side, there is the product of the Riemann sums for the integrals ∥f∥p and ∥g∥q. Moving to limits, we thus verify Hölder’s inequality for integrals: ∫ b a f(x)g(x) dx ≤ (∫ b a f(x)p dx )1/p(∫ b a g(x)q dx )1/q which is valid for all non-negative real-valued functions f and g in our space of piecewise continuous functions with compact support. In just the same way as in the previous paragraph, we can derive the integral form of the Minkowski inequality from Hölder’s inequality: ∥f + g∥p ≤ ∥f∥p + ∥g∥p. Thus ∥ ∥p is indeed a norm on the vector space of all continuous functions having a compact support for all p > 1 (we verified this for p = 1 long ago). 672 The L1 or L2 distances, discussed in the beginning of this chapter (cf. 7.1.2), reflect the basic intuition about the distance between graphs of the functions. However, in practice we need to understand more subtle concepts of distances. The most obvious way is to include the derivatives in a way similar to the values of the functions. 7.C.10. Consider the space S1 [a, b] of piecewise differentiable (real or complex) functions on the interval [a, b] and show that the formula ∥f∥ = (∫ b a |f(x)|2 + α2 |f′ (x)|2 d x )1/2 with any real α ≥ 0 is a norm on this vector space (up to the identification of functions differing only in the points of discontinuity). Compute the distance between functions f(x) = sin(x) + 0.1[sin(6x)]2 − 0.03 sin(60x) and g(x) = sin(x) on the interval [−π, π] in this norm and explain its dependence on α. Solution. The formula ⟨f, g⟩ → ∫ b a f(x)g(x) + α2 f′ (x)g′(x) d x defines a scalar product on S1 [a, b]. The mapping is linear in the first argument f, provides complex conjugate value if the arguments are exchanged and clearly is positive if f = g is non-zero on any interval (Ignore the values in the points of discontinuity, cf. the the discussion in 7.1.2). Thus the corresponding quadratic form defines a norm on the complex vector space S1 [a, b]. The distance in this norm is easily computed to obtain √ 0.02639 + 11.3097α2. Its dependence on α can be seen in the illustration — the values of the function f(x) are nearly equal to sin x, but the very wiggly difference is well apparent in the derivatives. If α = 0, the distance 0.162 is the usual L2 distance. If α = 1 the distance is 3.367.5 □ 5Here is an illustration of the very important concept of Sobolev spaces, where any number of derivatives can be involved. Moreover, we can use Lp, p ≥ 1 in the definition of the norm instead of p = 2. There is much literature on this subject. CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING We use the word “norm” for the entire space S0 [a, b] of piecewise continuous functions in this context; however, we should bear in mind that we have to identify those functions which differ only by their values at points of discontinuity. Among these norms, the case of p = 2 is special because of the existence of the inner product. In this case, we could have derived the triangle inequality much more easily using the Schwarz inequality. For the functions from S0 [a, b], we can define an analogy of the L∞–norm on n–dimensional vectors. Since our functions are piecewise continuous, they always have suprema of absolute values on a finite closed interval, so we can set ∥f∥∞ = sup{f(x), x ∈ [a, b]} for such a function f. If we considered both the one-sided limits (which always exist by our definition) and the value of the function itself to be the value f(x) at points of discontinuity, we can work with maxima instead of suprema. It is apparent again that it is a norm (except for the problems with values at discontinuity points). 7.3.6. Completion of metric spaces. Both the real numbers R and the complex numbers C are (with the metric given by the absolute value) a complete metric space. This is contained in the axiom of the existence of suprema. Recall that the real numbers were created as a “completion” of the space of rational numbers which is not complete. It is evident that the closure of the set Q ⊂ R is R. Dense and nowhere-dense subsets We say that a subset A ⊂ X in a metric space X is dense if and only if the closure of A is the whole space X. A set A is said to be nowhere dense in X if and only if the set X \ ¯A is dense. Evidently, A is dense in X if every open set in the whole space X has a non-empty intersection with A. In all cases of norms on functions from the previous paragraph, the metric spaces defined in this way are not complete since it can happen that the limit of a Cauchy sequence of functions from our vector space S0 [a, b] should be a function which does not belong to this space any more. Consider the interval [0, 1] as the domain of functions fn which take zero on [0, 1/n) and are equal to sin(1/x) on [1/n, 1]. They converge to the function sin(1/x) in all Lp norms, but this function does not lie in the space. Completion of a metric space Let X be a metric space with metric d which is not complete. A metric space ˜X with metric ˜d such that X ⊂ ˜X, d is the restriction of ˜d to the subset X, and the closure ¯X in ˜X is the whole space ˜X is called a completion of the metric space X. 673 Now we move to more theoretical considerations. Though these exercises may not look particularly practical, they should be of help in understanding the basic concepts of metric spaces, the convergence as well as their links to the topological concepts. 7.C.11. Show that the definition of a metric as a function d defined on X × X for a non-empty set X and satisfying d(x, y) = 0, if and only if x = y, x, y ∈ X,(1) d(x, z) ≤ d(y, x) + d(y, z), x, y, z ∈ X,(2) is equivalent to the definition given in the theoretical part, in paragraph 7.3.1. Solution. At first glance, it seems that this definition demands fewer requirements on the metric than the definition from the theoretical part. The two definitions are equivalent if and only if the two conditions of non-degeneracy and triangle inequality imply d(x, y) ≥ 0, x, y ∈ X,(3) d(x, y) = d(y, x), x, y ∈ X.(4) However, if we set x = z in (2), we get the non-negativity of the metric from (1). Similarly, the choice y = z in (2) together with (1) implies that d(x, y) ≤ d(y, x) for all points x, y ∈ X. Interchanging the variables x and y then gives d(y, x) ≤ d(x, y), i.e. (4). Thus, it is proved that the definitions are equivalent. □ 7.C.12. Describe all sequences in a discrete metric space X, which are convergent or Cauchy. Solution. Since the distance between two points x, y in X is either 1 or zero, the sequence x1, x2, . . . is Cauchy if and only if all xi are equal, except for a finite number of them. But then, the sequence is convergent. □ This problem shows a behaviour quite different from the convergence of sequences in the metric spaces X = R or X = C. But sequences of integers would behave in a very similar way. On the other hand, we deal mostly with metrics on spaces of functions, where intuition gained in the real line R may be useful. 7.C.13. Determine whether or not the sequence {xn}n∈N where x1 = 1, xn = 1 + 1 2 + · · · + 1 n , n ∈ N ∖ {1}, is a Cauchy sequence in R using the standard metric. Solution. Recall that (1) ∞∑ k=1 1 k = ∞, i.e. ∞∑ k=m 1 k = ∞, m ∈ N. Therefore, lim n→∞ | xn − xm | = ∞∑ k=m+1 1 k = ∞, m ∈ N. CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING The following theorem says that the completion of an arbitrary (incomplete) metric space X can be found in essentially the same way as the real numbers were created from the rationals. (Actually, the reader might read the detailed proof below first having this completion of rationals in mind, which verifies that this construction leads to R with the standard complete metric, thus satisfying the axioms of the reals.) 7.3.7. Theorem. Let X be a metric space with metric d which is not complete. Then there exists a completion ˜X of X. Proof. The idea of the construction is identical to the one used when building the real numbers. Two Cauchy sequences xi and yi of points belonging to X are considered equivalent if and only if d(xi, yi) converges to zero for i approaching infinity. This is a convergence of real numbers, thus the definition is correct. From the properties of convergence on the real numbers, it is clear that the relation defined above is an equivalence relation. The reader is advised to verify this in detail. For instance, the transitivity follows from the fact that the sum of two sequences converging to zero converges to zero as well. We define ˜X as the set of the classes of this equivalence of Cauchy sequences. The original points x ∈ X can be identified with the class of sequences equivalent to the constant sequence xi = x, i = 0, 1, . . . . It is now easy to define the metric ˜d. We put ˜d(˜x, ˜y) = lim i→∞ d(xi, yi) for sequences ˜x = {x0, x1, . . . } and ˜y = {y0, y1, . . . }. First, we have to verify that this limit exists at all and is finite. Notice the consequence of the triangle inequality for differences. Taking x, y, and z ∈ X, d(x, y) ≤ d(x, z) + d(y, z) and thus, d(x, y) − d(x, z) ≤ d(y, z). Swapping y and z, we obtain d(x, z) − d(x, y) ≤ d(y, z). Thus, (1) |d(x, y) − d(x, z)| ≤ d(y, z) (draw a picture!). Now, the fact that both the sequences ˜x and ˜y are Cauchy sequences implies that the considered sequence is also a Cauchy sequence of real numbers: |d(xi, yi) − d(xj, yj)| ≤ |d(xi, yi) − d(xi, yj)| + |d(xi, yj) − d(xj, yj)| ≤ d(yi, yj) + d(xi, xj). Thus, the limit of d(xi, yi) exists. If we select different representatives ˜x = {x′ 0, x′ 1, . . . } and ˜y = {y′ 0, y′ 1, . . . }, then by similar argument as above |d(x′ i, y′ i) − d(xi, yi)| ≤ |d(x′ i, y′ i) − d(x′ i, yi)| + |d(x′ i, yi) − d(xi, yi)| ≤ d(xi, x′ i) + d(yi, y′ i). Therefore, the definition is indeed independent of the choice of representatives. We verify that ˜d is a metric on ˜X. The first and second properties are clear, so it remains to prove the triangle inequality. For that purpose, choose three Cauchy representatives of 674 Hence the sequence {xn} is not a Cauchy sequence. Alternatively, {xn} is not a Cauchy sequence, since if it is, then it is convergent in the complete metric space R, which contradicts the divergence shown in (1). □ 7.C.14. Repeat the question from the previous problem with the metric d given by (cf. 7.C.8) d(x, y) := | x − y | 1 + | x − y | , x, y ∈ R. Solution. Instead of repeating the arguments, we point out the difference between the given metric from the standard one. The difference is expressed by the function f introduced in (1). This is a continuous function and, moreover a bijection between the sets [0, ∞) and [0, 1), having the property that f(0) = 0. Further, the property of a sequence being Cauchy or convergent in a metric space is defined by being Cauchy or convergent for the real numbers describing the distances between the elements in the sequence. But the continuous mappings preserve convergence or the property being Cauchy, and hence the solution for the new metric is the same as with the standard one. □ 7.C.15. Determine whether or not the metric space C[−1, 1] of continuous functions on the interval [−1, 1] with metric given by the norm (a) ∥ f ∥p = (∫ 1 −1 | f(x) |p d x )1/p for p ≥ 1; (b) ∥ f ∥∞ = max {| f(x) |; x ∈ [−1, 1]} is complete. Solution. The case (a). For every n ∈ N, define a function fn(x) = 0, x ∈ [−1, 0), fn(x) = 1, x ∈ ( 1 n , 1 ] , fn(x) = nx, x ∈ [ 0, 1 n ] . For every m ≥ n, m, n ∈ N, we compute the inequality ( 1∫ −1 | fm(x) − fn(x) |p d x )1/p < (1/n∫ 0 1 d x )1/p = ( 1 n )1/p . It follows that the sequence {fn}n∈N ⊂ C[−1, 1] is a Cauchy sequence of functions. Suppose the sequence {fn} has a ∥·∥p limit f in C[−1, 1]. We show that this limit cannot be continuous in x = 0. For every ε ∈ (0, 1), there exists an n(ε) ∈ N such that fn(x) = 0, x ∈ [−1, 0], fn(x) = 1, x ∈ [ε, 1] for all n ≥ n(ε). Imagine, f(y) ̸= 1 at some y ≥ ε. Then ∥f −fn∥ ≥ δ > 0 for all n ≥ n(ε) and some δ, since f is continuous. Thus f ̸= 1 on some bounded interval containing y. Therefore f must satisfy f(x) = 0, x ∈ [−1, 0], f(x) = 1, x ∈ [ε, 1] for an arbitrarily small ε > 0. Thus, necessarily, f(x) = 0, x ∈ [−1, 0], f(x) = 1, x ∈ (0, 1]. CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING the elements ˜x, ˜y, ˜z, and we obtain ˜d(˜x, ˜z) = lim i→∞ d(xi, zi) ≤ lim i→∞ d(xi, yi) + lim i→∞ d(yi, zi) = ˜d(˜x, ˜y) + ˜d(˜y, ˜z). The restriction of the metric ˜d just defined to the original space X is identical to the original metric because the original points are represented by constant sequences. It is required to prove that X is dense in ˜X. Let ˜x = {xi} be a fixed Cauchy sequence, and let ε > 0 be given. Since the sequence xi is a Cauchy sequence, all pairs of its terms xn, xm for sufficiently large indices m and n become closer to each other than ε. Then the choice y = xn for one of those indices necessarily implies that the elements y and xm are closer together than ε, and so, ˜d(˜y, ˜x) ≤ ε. Hence there is an element y of the original space such that the distance of the sequences of y’s from the chosen sequence xi does not exceed ε. This establishes the denseness of X. It remains to prove that the constructed metric space is complete. That is, that Cauchy sequences of points of the extended space ˜X with respect to the metric ˜d are necessarily convergent to a point in ˜X. This can be done by approximating the points of a Cauchy sequence ˜xk by points yk from the original space X so that the resulting sequence ˜y = {yi} would be the limit of the original sequence with respect to the metric ˜d. Since X is a dense subset in ˜X, we can choose, for every element ˜xk of our fixed sequence, an element zk ∈ X so that the constant sequence ˜zk would satisfy ˜d(˜xk, ˜zk) < 1/k. Now consider the sequence ˜z = {z0, z1, . . . }. The original sequence ˜x is Cauchy. So for a fixed real number ε > 0, there is an index n(ε) such that ˜d(˜xn, ˜xm) < ε/2 whenever both m and n are greater than n(ε). Without loss of generality, the index n(ε) is greater than or equal to 4/ε. Now, for m and n greater than n(ε), we get: d(zm, zn) = ˜d(˜zm, ˜zn) ≤ ˜d(˜zm, ˜xm) + ˜d(˜xm, ˜xn) + ˜d(˜xn, ˜zn) ≤ 1/m + ε/2 + 1/n ≤ 2 ε 4 + ε 2 = ε. Hence it is a Cauchy sequence zi of elements in X, and so ˜z ∈ ˜X. From the triangle inequality, ˜d(˜z, ˜xn) ≤ ˜d(˜z, ˜zn) + ˜d(˜zn, ˜xn). From the previous bounds, both terms on the right-hand side converge to zero. Hence the distances ˜d(˜xn, ˜z) approach zero, thereby finishing the proof. □ 7.3.8. Uniqueness. We consider now the uniqueness of the completion of metric spaces. 675 But this function is not continuous on [−1, 1], so it does not belong to the considered metric space. Therefore, the sequence {fn} does not have a limit in C[−1, 1], so this space is not complete. The case (b). Let an arbitrary Cauchy sequence {fn}n∈N ⊂ C[−1, 1] be given. The terms of this sequence are continuous functions fn on [−1, 1] having the property that for ε > 0 (or for every ε/2 if you want) there is an n(ε) ∈ N such that (1) max x∈[−1,1] | fm(x) − fn(x) | < ε 2 , m, n ≥ n(ε). In particular, for every x ∈ [−1, 1], we get a Cauchy sequence {fn(x)}n∈N ⊂ R of numbers. Since the metric space R with the usual metric is complete, every (for x ∈ [−1, 1]) sequence {fn(x)} is convergent. Set f(x) := lim n→∞ fn(x), x ∈ [−1, 1]. Letting m → ∞ in (1), we obtain max x∈[−1,1] | f(x) − fn(x) | ≤ ε 2 < ε, n ≥ n(ε). It follows that the sequence {fn}n∈N converges uniformly (that is, with respect to the given norm), to the function f on [−1, 1]. Since the uniform limit of continuous functions is continuous, so is f, so f ∈ C[−1, 1], see 6.3.4. Therefore, the metric space is complete. □ The same reasoning as above, and hence the same results, apply to the more general metric space C[a, b] of continuous functions on any closed bounded interval [a, b] or on the space Cc of continuous functions with compact support. 7.C.16. Prove that the metric space ℓ2 is complete. Solution. Recall that ℓ2 is the space of sequences of real numbers with the L2-norm, see 7.3.5. Consider an arbitrary Cauchy sequence {xn}n∈N in the space ℓ2. Every term of this sequence is again a sequence, i.e., xn = {xk n}k∈N, n ∈ N. Of course, the range of indices does not matter – there is no difference whether n, k ∈ N or n, k ∈ N ∪ {0}. Introduce auxiliary sequences yk for k ∈ N so that yk = {yn k }n∈N = { xk n } n∈N . If {xn} is a Cauchy sequence in ℓ2, then each of the sequences yk is a Cauchy sequence in R (the sequences yk are sequences of real numbers). It follows from the completeness of R (with respect to the usual metric) that all of the sequences yk are convergent. Denote their limits by zk, k ∈ N. It suffices to prove that z = {zk}k∈N ∈ ℓ2 and that the sequence {xn} converges for n → ∞ in ℓ2 just to the sequence z. The sequence {xn}n∈N ⊂ ℓ2 is a Cauchy sequence; therefore, for every ε > 0, there is an n(ε) ∈ N with the property that ∞∑ k=1 ( xk m − xk n )2 < ε, m, n ≥ n(ε), m, n ∈ N. In particular, l∑ k=1 ( xk m − xk n )2 < ε, m, n ≥ n(ε), m, n, l ∈ N, CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Isometries A mapping φ : X1 → X2 between metric spaces with metrics d1 and d2, respectively, is called an isometry if and only if all elements x,y ∈ X satisfy d2(φ(x), φ(y)) = d1(x, y). Of course, every isometry is a bijection onto its image (this follows from the property that the distance between distinct elements is non-zero) and the corresponding inverse mapping is an isometry as well. Now, consider two inclusions of a dense subset, ι1 : X → ˜X1 and ι2 : X → ˜X2, into two completions of the space X, and denote the corresponding metrics by d, d1, and d2, respectively. The mapping φ : ι1(X) ι−1 1 // X ι2 // ˜X2 is well-defined on the dense subset ι1(X) ⊂ ˜X1. Its image is the dense subset ι2(X) ⊂ ˜X2 and, moreover, this mapping is clearly an isometry. The dual mapping ι1 ◦ ι−1 2 works in the same way. Every isometric mapping maps, of course, Cauchy sequences to Cauchy sequences. At the same time, such Cauchy sequences converge to the same element in the completion if and only if this holds for their images under the isometry φ. Thus if such a mapping φ is defined on a dense subset X of a metric space ˜X1, then it has a unique extension to the whole ˜X1 with values lying in the closure of the image φ(X), i. e. ˜X2. By using the previous ideas, there is a unique extension of φ to the mapping ˜φ : ˜X1 → ˜X2 which is both a bijection and an isometry. Thus, the completions ˜X1 and ˜X2 are indeed identical in this sense. Thus it is proved: Theorem. Let X be a metric space with metric d which is not complete. Then the completion ˜X of X with metric ˜d is unique up to bijective isometries. In the following three paragraphs, we introduce three theorems about complete metric spaces. They are highly applicable in both mathematical analysis and verifying convergence of numerical methods. 7.3.9. Banach’s contraction principle. A mapping F : X → X on a metric space X with metric d is called a contraction mapping if and only if there is a real constant 0 ≤ C < 1 such that for all elements x, y in X, d(F(x), F(y)) ≤ C d(x, y). Theorem. If F is a contraction mapping on a complete metric space X, then it has a fixed point, i. e., there is a z ∈ X such that F(z) = z. Proof. The proof naturally follows the intuitive idea that iterative application of a contraction mapping starting from an initial value z0 ∈ X should “accumulate” to some point. 676 whence, letting m → ∞, l∑ k=1 ( zk − xk n )2 ≤ ε, n ≥ n(ε), n, l ∈ N, i.e. (this time l → ∞) (1) ∞∑ k=1 ( zk − xk n )2 ≤ ε, n ≥ n(ε), n ∈ N. Especially, ∞∑ k=1 ( zk − xk n )2 < ∞, n ≥ n(ε), n ∈ N and, at the same time, ∞∑ k=1 ( xk n )2 < ∞, n ∈ N, which follows straight from {xn}n∈N ⊂ ℓ2. Since (cf. the special case of Hölder’s inequality for p = 2 in 7.3.4) ∞∑ k=1 ( zkxk n ) ≤ √ ∞∑ k=1 z2 k · √ ∞∑ k=1 (xk n) 2 , n ∈ N and ∞∑ k=1 ( zk − xk n )2 = ∞∑ k=1 ( z2 k − 2zkxk n + ( xk n )2) , n ∈ N. Hence ∞∑ k=1 z2 k < ∞. It is proved that z ∈ ℓ2. The fact that {xn} converges for n → ∞ to z in ℓ2 follows from (1). □ The next problem addresses the question of the power of different metrics on the same space of functions in terms of convergence. We deal with the space Sc of piecewise continuous functions with compact support, equipped with the Lp metrics. We write briefly Lp for these metric spaces. In particular, we show that convergence in Lp for some positive p does not always imply convergence in Lq for another positive q ̸= p. 7.C.17. Let 0 < p < ∞. For each positive integer n, define the sequence of functions fn(x) = { n1/p −1/n ≤ x ≤ 1/n 0 otherwise. Decide for which q the sequnce fn converges in Lq. Solution. Let q be any positive real number. Then ∫ ∞ −∞ |fn(x)|q dx = ∫ 1/n −1/n nq/p dx = (2/n)nq/p = 2n(q/p)−1 . If 0 < q < p and if n → ∞, then ∫ ∞ −∞ |fn(x)|q dx −→ 0. So ∥ fn ∥q −→ 0, and the sequence converges to the zero function. Similarly if 0 < p < q and if n → ∞, then ∫ ∞ −∞ |fn(x)|q dx −→ ∞. CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING The metric space X, of course, needs to be complete; otherwise it could happen that the limit point does not exist in it. Choose an arbitrary z0 ∈ X and consider the sequence zi, i = 0, 1, . . . z1 = F(z0), z2 = F(z1), . . . , zi+1 = F(zi), . . . From the assumptions, we have d(zi+1, zi) = d(F(zi), F(zi−1)) ≤ C d(zi, zi−1) ≤ · · · ≤ Ci d(z1, z0). The triangle inequality then implies that for all natural numbers j, d(zi+j, zi) ≤ j∑ k=1 d(zi+k, zi+k−1) ≤ j∑ k=1 Ci+k−1 d(z1, z0) = Ci d(z1, z0) j∑ k=1 Ck−1 ≤ Ci d(z1, z0) ∞∑ k=1 Ck−1 = Ci 1 − C d(z1, z0). Now, since 0 ≤ C < 1, limn−→∞ Cn = 0, so for every positive (no matter how small) ε, the right-hand expression is surely less than ε for sufficiently large indices i, that is, d(zi, zi+j) ≤ Ci 1 − C d(z1, z0) ≤ ε. However, this ensures that the sequence zi is a Cauchy sequence. Since X is complete, the sequence has a limit z, and all that remains to be proved is F(z) = z. Every contraction mapping is continuous. Therefore, F(z) = F( lim n→∞ zn) = lim n→∞ F(zn) = z. This finishes the proof. □ The next two theorems extend the intutive understanding of “density” of closed intervals [a, b] ⊂ R, not allowing for any “holes” there. They are essential for the understanding of compactness of metric spaces. In fact, they are both special cases of more general theorems on topological spaces. 7.3.10. Cantor intersection theorem. For any set A in a metric space X with metric d, the real number diam A = sup x,y∈A d(x, y) is called the diameter of the set A. The set A is said to be bounded if and only if diam A < ∞. Theorem. If A1 ⊃ A2 ⊃ · · · ⊃ Ai ⊃ . . . is a nonincreasing sequence of non-empty closed subsets in a complete metric space X and if diam Ai → 0, then there is exactly one point x ∈ X belonging to the intersection of all the sets Ai.2 2Georg Cantor is considered as the founder of the set theory which he introduced and developed in the last quarter of 19th century. At this time, 677 So ∥ fn ∥q diverges, and in particular, fn cannot converge to any limit. Finally, for q = p we have ∫ |fn(x)|p dx = 2 for all positive integers n, and so as n −→ ∞ we get ∥ fn ∥p −→ 21/p . At the same time, for any g ∈ Sc, if g(x) ̸= 0 at some x ̸= 0 where g is continuous, its distance from fn cannot converge to zero. It follows that fn converges to 0 in Lq, 0 < q < p, but it does not converge in Lq with q ≥ p. □ The next problem deals with the extremely useful Banach fixed point theorem, showing the necessity of all the requirements in Theorem 7.3.9. 7.C.18. Show that the mapping f : ⟨0, ∞⟩ → ⟨0, ∞⟩ given by f(x) = x + e−x satisfies, for all x ̸= y, the condition |f(x) − f(y)| < |x − y|, but it does not have any fixed point, i.e. f(x) ̸= x for any x. (Thus the condition |f(x)−f(y)| ≤ C|x−y), with constant C < 1, in the Banach fixed point Theorem 7.3.9 is essential.) Solution. Clearly the function f is strictly increasing on the entire domain. Assume y < x. Then e−x < e−y and |f(x) − f(y)| = |x − y + e−x − e−y | < |x − y|. Finally, f(x) = x implies e−x = 0 which is impossible. □ Dealing with convergence of real numbers, we observe that the topological concepts of open neighbourhoods and the compactness are most useful. This is of course even more true for metric spaces, where we can work with open balls of radius r etc. The definitions remain essentially the same. 7.C.19. We have already seen the discrete metric space X ̸= ∅ with the metric d : X × X → R defined by the for- mula d(x, y) := 1, x ̸= y, d(x, y) := 0, x = y . (a) Decide whether (X, d) is complete. (b) Describe all open, closed, and bounded sets in (X, d). (c) Describe the interior, boundary, limit, and isolated points of an arbitrary set in (X, d). (d) Describe all compact sets in (X, d). Solution. The case (a) was essentially dealt with in 7.C.12. For an arbitrary sequence {xn}n∈N to be a Cauchy sequence, it is necessary in this space that there is an index n ∈ N such that xn = xn+m for all m ∈ N. Any sequence with this property then necessarily converges to the common value xn = xn+1 = · · · (we talk about almost stationary sequences). So the metric space (X, d) is complete. The case (b). The open 1-neighbourhood of any element contains this element only. Therefore, every singleton set is open. Since the union of any number of open sets is an open set, every set is open in (X, d). By complements, CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Proof. Select one point xi for each set Ai. Since diam Ai → 0, for every positive real number ε, we can find an index n(ε) such that for all Ai with indices i ≥ n(ε), their diameters are less than ε. For sufficiently large indices i, j, d(xi, xj) ≤ ε, and thus our sequence is a Cauchy sequence. Therefore, it has a limit point x ∈ X. x must be a limit point of all the sets Ai, thus it belongs to all of them (since they are all closed). So x belongs to their intersection. This proves the existence of x. Assume there are two points x and y, both belonging to the intersection of all the sets Ai. Then d(x, y) must be less than the diameter of the sets Ai. But diam Ai → 0, so d(x, y) = 0, hence x = y. This proves the uniqueness of x. □ 7.3.11. Theorem (Baire theorem). If X is a complete metric space, then the intersection of every countable system of open dense sets Ai is a dense set in the metric space X.3 Proof. Suppose X contains a system of dense open sets Ai, i = 1, 2 . . . . It is required to show that the set A = ∩∞ i=1Ai has a non-empty intersection with any open set U ⊂ X. Proceed inductively, invoking the previous theorem. Since A1 is dense, there is a z1 ∈ A1 ∩ U, but since the set A1 is open, the closure of an ε1-neighbourhood U1 (for sufficiently small ε1) of the point z1 is contained in A1 as well. Denote the closure of this ε1–ball U1 by B1. Further, suppose that the points zi and their open εi– neighbourhoods Ui are already chosen for i = 1, . . . , n, so that zk ∈ Bk = ¯Uk, zk ∈ ∩k i=1Bi ⊂ ∩k i=1Ai ∩ U. Since the set An+1 is open and dense in X, there is a point zn+1 ∈ An+1 ∩ Un; however, since An+1 ∩ Un is open, the point zn+1 belongs to it together with a sufficiently small εn+1-neighbourhood Un+1. Then, the closures surely satisfy Bn+1 = ¯Un+1 ⊂ ¯Un, and so the closed set Bn+1 is contained in An+1 ∩ ¯Un. Moreover, we can assume that all εn ≤ 1/n. If we proceed in this inductive way from the original point z1 and the set B1, we obtain a non-increasing sequence of non-empty closed sets Bn whose diameter approaches zero. Therefore, there is a point z common to all of these sets. That is, z ∈ ∩∞ i=1 ¯Ui = ∩∞ i=1Bi ⊂ ∩∞ i=1An ∩ U, which is the statement to be proved. □ the new abstract approach to fundamentals of Mathematics caused fierce objections. It also lead to the severe internal crises of Mathematics in the beginning of the 20th century. This part of the history of Mathematics is fascinating. 3This theorem is a part of considerations by René-Louis Baire in his 1899 doctoral thesis. More generally, a topological space satisfying the property as in the theorem is called a Baire space and the theorem simply says that every complete metric space is a Baire space. 678 this also means that every set is closed. The fact that the 2-neighbourhood of any element coincides with the whole space implies that every set is bounded in (X, d). The case (c). Once again, we use the fact that the open 1-neighbourhood of any element contains this element only. It follows that every point of any set is both its interior and its isolated point and that the sets have neither boundary nor limit points. The case (d). Every finite set in an arbitrary metric space is compact (it defines a compact metric space by restricting the domain of d). It follows from the classification of convergent sequences (see (a)), that no infinite sequence can be compact in (X, d). □ 7.C.20. In the metric space S[−1, 1] with metric given by the norm ∥ · ∥∞, consider the sets A = {f ∈ S[−1, 1]; f(0) ∈ (0, 2)}, B = {f ∈ S[−1, 1]; 1∫ −1 f(x) d x = 0}. Are these sets open? Are these sets closed? Solution. The interior of a set M is the set of all interior points of M and it is usually denoted by M0 . A set M is then open if and only if M = M0 . Similarly, we define the closure of a set M as the set of all points having zero distance from M; it is denoted by M. A set M is closed if and only if M = M. Since A0 = A, A = {f ∈ S[−1, 1]; f(0) ∈ [0, 2]}, B0 = ∅, B = B (especially, A contains functions f for which f(0) can attain values from the whole closed interval [0, 2]), the set A is open and not closed, and, the other way around, the set B is closed and not open. □ One of the most important concepts related to complete metric spaces is given by the principle of nested balls (cf. the theorem 7.3.10). Under some additional conditions, it says that a metric space (X, d) is complete if and only if every sequence {An}n∈N of nested (i.e., An+1 ⊆ An, n ∈ N) nonempty closed sets An has non-empty intersection. That is, (1) ∩ n∈N An ̸= ∅. 7.C.21. Verify that the additional condition in the theorem (1) lim n→∞ sup {d(x, y); x, y ∈ An} = 0. cannot be omitted. Solution. That the requirement (1) cannot be omitted is probably contrarily to many readers’ expectations. For a counterexample. consider the set X = N with metric d(m, n) = 1 + 1 m+n , m ̸= n, d(m, n) = 0, m = n. It is indeed a metric. The first and second properties are clearly satisfied. To prove the triangle inequality, it suffices to observe that d(m, n) ∈ (1, 4/3] if m ̸= n. Hence the only CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING 7.3.12. Bounded and compact sets. The following concepts facilitated our discussions when dealing with the real and complex numbers. They can be reformulated for general metric spaces with almost no change: An interior point of a subset A in a metric space is such an element of A which belongs to it together with some of its ε–neighbourhoods. A boundary point of a set A is an element x ∈ X such that each its neighbourhood has a non-empty intersection with both A and the complement X \ A. A boundary point may or may not belong to the set A itself. A limit point of a set A is an element x equal to the limit of a sequence xi ∈ A, such that xi ̸= x for all i. Clearly a limit point may or may not belong to the set A. An isolated point of a set A is an element a ∈ A such that one of its ε–neighbourhoods in X has the singleton intersection {a} with A. An open cover of a set A is a system of open sets Ui ⊂ X, i ∈ I, such that their union contains A. Compact sets A metric space X is called compact if every sequence of points xi ∈ X has a subsequence converging to some point x ∈ X. Any subset A ⊂ X in a metric space is called compact if it is compact as the metric space with the restricted metric. Clearly, the compact subsets in discrete metric spaces X are exactly the finite subsets of X. In the case of the real numbers R and complex numbers C, our definition reveals the compact subsets discussed there and we would also like to come to useful properties as we did for real and complex numbers in the paragraphs 5.2.7–5.2.8. Just as an appetizer, notice that the continuous functions between general metric spaces behave as the real functions: Theorem. Let f : X → Y be a continuous mapping between metric spaces. Then the images of compact sets are compact. Proof. Recall that any convergent sequence of points xi → x in X is mapped onto the convergent sequence f(xi) → f(x) in Y . Thus, the statement follows immediately from our definition of the compactness via convergent subsequences. □ In particular we obtain the most useful consequence on the minima and maxima of continuous real valued functions on compact subsets: Corollary. Let f : X → R be a real function defined on a compact metric space. Then there are the points x0 and y0 in X such that f(x0) = maxx∈X{f(x)}, f(y0) = minx∈X{f(x)}. Proof. The image f(X) must be a compact subset in R, thus it must achieve both maximum and minimum (which are 679 Cauchy sequences are those which are constant from some index on. These sequences are constant except for finitely many terms, sometimes called almost stationary sequences. Thus, every Cauchy sequence is convergent, so the metric space is complete. Define An := { m ∈ N; d(m, n) ≤ 1 + 1 2n } , n ∈ N. As the inequality in their definition is not strict, it is guaranteed that they are closed sets. Since An = {n, n + 1, . . . }, it follows that {An} are nested, but with empty intersection (contrary to (1)). If the requirement (1) is omitted, then the metric space is not complete, contradicting the data. Of course, in this case the condition (1)) is not met, as lim n→∞ sup {d(x, y); x, y ∈ An} = lim n→∞ ( 1 + 1 2n + 1 ) = 1 ̸= 0. □ 7.C.22. Determine whether the set (known as the Hilbert cube) A = { {xn}n∈N ∈ ℓ2; | xn | ≤ 1 n , n ∈ N } is compact in ℓ2. Then determine the compactness of the set B = { {xn}n∈N ∈ ℓ∞; | xn | < 1 n , n ∈ N } in the space ℓ∞. Solution. The space ℓ2 is complete (see 7.C.16). Every closed subset of a complete metric space defines a complete metric space. The set A is evidently closed in l2, so it suffices to show that it is totally bounded, and from the theorem 7.3.13(3) it is compact. To do that, construct an ε-net of A for any given ε > 0: Begin with the well-known series ∞∑ k=1 1 k2 = π2 6 (see (1)). For every ε > 0, there is an n(ε) ∈ N satisfying √ ∞∑ k=n(ε)+1 1 k2 < ε 2 . From each of the intervals [−1/n, 1/n] for n ∈ {1, . . . , n(ε)},choose finitely many points xn 1 , . . . , xn m(n) so that for any x ∈ [−1/n, 1/n] that min j∈{1,...,m(n)} x − xn j < ε√ 5n . Consider such sequences {yn}n∈N from l2 whose terms with indices n > n(ε) are zero, and at the same time, y1 ∈ { x1 1, . . . , x1 m(1) } , . . . , yn(ε) ∈ { x n(ε) 1 , . . . , x n(ε) m(n(ε)) } . There are only finitely many such sequences and they create the desired ε-net for A: let xn ∈ l2 is arbitrary. According to our choice of the sequences yn, there is yn such that CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING the supremum and the infimum of the bounded and closed image). □ The concept of boundedness is a little more complicated in the case of general metric spaces. For any point x and subset B ⊂ X in a metric space X with metric d, we define their distance4 dist(x, B) = inf y∈B {d(x, y)}. We say that a metric space X is totally bounded if for every positive real number ε, there is a finite set A such that dist(x, A) < ε for all points x ∈ X. We call such an A an ε-net of X. Note that a metric space is bounded if X has a finite diameter. We can immediately see that a totally bounded space is always bounded. Indeed, the diameter of a finite set is finite, and if A is the set corresponding to ε from the definition of total boundedness, then the distance d(x, y) of two points can always be bounded by the sum of dist(x, A), dist(y, A), and diam A, which is a finite number. In the case of a metric on a subset of a finite-dimensional Euclidean space, these concepts coincide since the boundedness of a set guarantees the boundedness of all coordinates in a fixed orthonormal basis, and this implies total boundedness. (Verify this in detail by yourselves!) The next theorem provides the promised very useful alternative characterisations of compactness: 7.3.13. Theorem. The following statements about a metric space X are equivalent: (1) X is compact; (2) every open covering of X, X = ∪i∈IUi, contains a finite covering X = ∪n k=1Ujk , where all jk ∈ I; (3) X is complete and totally bounded. Proof. We show consecutively the implications (1) =⇒ (3) =⇒ (2) =⇒ (1). Thus, start assuming X is compact. Then for each Cauchy sequence of points xi, there is a sub-sequence xin converging to a point x ∈ X. We just have to verify that the initial sequence also converges to the same limit, xi → x. This is easy and we leave it to the reader. So X is complete. Suppose X is not totally bounded. Then there is ε > 0 such that no finite ε-net exists in X. Then there is a sequence of points xi such that d(xi, xj) ≥ ε for all i ̸= j. (Verify this almost obvious claim – look at the definition of ε-nets!) Then this is a sequence of points, where no sub-sequence can be a Cauchy sequence, so X is not compact. This contradicts (1), so we conclude that X is totally bounded. 4Notice, that the distance between two subsets A, B ⊂ X should express how “different” they are. Thus we define the (Hausdorff) distance as follows dist(A, B) = max{sup{dist(x, B), x ∈ A}, sup{dist(y, A), y ∈ B}}. This difference is finite for bounded sets and it is easy to see that it vanishes if and only if the closures of A and B coincide. This means, the distances of the point x and one-point set {x} are very much different, dist(x, A) ̸= dist({x}, A), in general. 680 d(xn, yn) = ∞∑ k=1 (xk − yk)2 ≤ n(ε) ∑ k=1 (xk − yk)2 + ∞∑ k=n(ε)+1 x2 k ≤ √ ε2 5 + ε2 52 + · · · + ε2 5n(ε) + ε 2 < ε · √ 1 1 − 1 5 − 1 + ε 2 = ε. Since ε > 0 is arbitrary, the set A is totally bounded, which implies compactness. The closure of the set B is B = { {xn}n∈N ∈ l∞; | xn | ≤ 1 n , n ∈ N } . Hence B is not closed, and so it is not compact. The set B is compact. The proof of this fact is much simpler than for the set A, thus we leave it as an exercise for the reader. □ 7.C.23. Prove that on each metric space X, the given metric d is a continuous function X × X → R. ⃝ 7.C.24. Show that if F is a continuous mapping on a compact metric space X, then the inequality d(F(x), F(y)) < d(x, y), for all x ̸= y, implies the existence of a fixed point. Solution. The infimum α of the values of the continuous function d(x, F(x)) must be achieved in a point x0 ∈ X (see 7.3.12 for the concepts and main results and use the previous result 7.C.23). Since distances are non-negative, α ≥ 0. If α ̸= 0, then d(F(x0), F(F(x0))) < d(x0, F(x0)) = α which is a contradiction. □ CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING The next implication, namely (3) =⇒ (2) is more demanding. So assume X is complete and totally bounded, but X does not satisfy (2). Then there is an open covering Uα, α ∈ I, of X, which does not contain any finite covering. Choose a sequence of positive real numbers εk → 0 and consider the finite εk–nets from the definition of total boundedness. Further, for each k, consider the system Ak of closed balls with centres in the points of the εk-net and diameters 2εk. Clearly each such system Ak covers the entire space X. Altogether, there must be at least one closed ball C in the system A1 which is not covered by a finite number of the sets Uα. Call it C1 and notice that diam C1 = 2ε1. Next, consider the sets C1 ∩C, with balls C ∈ A2 which cover the entire set C1. Again, at least one of them cannot be covered by a finite number of Uα, we call it C2. This way, we inductively construct a sequence of sets Ck satisfying Ck+1 ⊂ Ck, diam Ck ≤ 2εk, εk → 0, and none of them can be covered by a finite number of the open sets Uα. Finally we choose one point xk ∈ Ck in each of these sets. By construction, this must be a Cauchy-sequence. Consequently, this sequence of points has a limit x since X is complete. Thus there is Uα0 containing x and containing also some δ-neighbourhood Bδ(x). But now, if diam Ck ≤ 2εk < δ, then Ck ⊂ Bδ(x) ⊂ Uα0 , which is a contradiction. The remaining step is to show the implication (2) =⇒ (1). Assume (2) and considering any sequence of points xi ∈ X, we set Cn = {xk; k ≥ n}. The intersection of these sets must be non-empty by the following general lemma: Lemma. Let X be a metric space such that property (2) in the Theorem holds. Consider a system of closed sets Dα, α ∈ I, such that each its finite subsystem Dα1 , . . . , Dαk has non-empty intersection. Then also ∩α∈IDα ̸= ∅. This simple lemma is proved by contradiction, again. If the latter intersection is empty, then X = X \ (∩α∈I Dα) = ∪α∈I(X \ Dα) = ∪α∈IVα, where Vα = X\Dα are open sets. Thus, there must be a finite number of them, {Vα1 , . . . , Vαn }, covering X too. Thus, we obtain X = ∪n i=1Vαi = ∪n i=1(X \ Dαi ) = X \ (∩n i=1Dαi ). This is a contradiction with our assumptions on Dα and the lemma is proved. Now, let x ∈ ∩∞ n=1Cn. By construction, there is a subsequence xnk in our sequence of points xn ∈ X, so that d(xnk , x) < 1/k. This is a converging subsequence, and so the proof is complete. □ As an immediate corollary of the latter theorem, each closed subset in a compact metric space is again compact. For subsets of a totally bounded set are totally bounded, and closed subsets of a complete metric space are also complete. 681 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Another consequence is an alternative proof that a subset K ⊂ Rn is compact, if and only if it is closed and bounded. Notice also that while the conditions (1) and (3) are given in terms of the metric, the equivalent condition (2) is purely topological. 7.3.14. Continuous functions. We revisit the questions related to continuity of mappings between metric spaces. If fact, many ideas understood for the functions of one real variable generalize natu- rally. In particular, as we already showed, every continuous function f : X → R on a compact set X is bounded and achieves its maximum and minimum. Here is another argument for this using the purely topological concept of compact- ness: Consider the open intervals Un = (n − 1, n + 1) ⊂ R, n ∈ Z covering R. Then their preimages f−1 (Ui) cover X, so that there is a finite number of them, covering X as well. Thus f is bounded and the supremum and infimum of its values exist. Consider sequences f(xn) and f(yn) converging to the supremum and infimum, respectively. Then there must be covergent subsequences of the points xn and yn in X and their limits x and y are in X too. But then f(x) and f(y) are the supremum and infimum of the values of f since f is continuous and thus respects convergence. We should also enjoy to see the differences between the “purely topological” concepts, as the continuity (possibly defined merely by means of open sets), and the next stronger concepts, which are “metric” properties. Uniformly continuous mappings A mapping f : X → Y between metric space is called uniformly continuous, if for each ε > 0 there is a δ > 0, such that dY (f(x), f(y)) < ε for all x, y ∈ X with dX(x, y) < δ. Notice the following simple lemma: Lemma. A mapping f : X → Y between metric spaces is uniformly continuous if and only if for each pair of sequences xk and yk in X, dX(xk, yk) → 0 implies dY (f(xk), f(yk)) → 0. Proof. If f is uniformly continuous, then the the claim on the sequences is obvious. Now, assume the property of sequences holds true, but f is not uniformly continuous. Then, employing the negation of the defining property, there is an ε > 0 such that for all δ = 1/n, there are xn, yn with dX(xn, yn) < 1/n and dY (f(xn), f(yn)) ≥ ε. But this violates the assumed condition. □ This observation leads to the following generalization of the behavior of real functions: Theorem. Each continuous mapping f : X → Y on a compact metric space X is uniformly continuous. 682 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Proof. Assume f is a continuous function. Consider any two sequences xk and yk with dX(xk, yk) → 0 and assume their images do not satisfy dY (xk, yk) → 0. Then, for some ε > 0, there must be a subsequence (xkn , ykn ) for which dY (xkn , ykn ) > ε. Since X is compact, there is a further subsequence of xnk converging to some point x ∈ X. Without loss of generality, we may assume that the original sequence (xk, yk) has got all these properties (we are aiming at a conflict with our assumption that f is not uniform continuous). Of course, dX(x, yk) ≤ dX(x, xk) + dX(xk, yk) → 0 and so limk→∞ yk = x, too. Next, notice that the metric dY : Y × Y → R is always a continuous function. Indeed, remind the definition of the product metric on X × X by the maximum of the distances of the components, see 7.C.1. With this definition, the continuity of d is obvious from the estimate |d(x, y) − d(x′ , y′ )| ≤ d(x, x′ ) + d(y, y′ ), cf. 7.3.8(1). Now, the continuity of f ensures lim k→∞ dY (f(xk), f(yk)) = dY (f(x), f(x)) = 0. But this violates our assumption on the sequences and the proof is complete. □ A very useful variation on the theme of continuity is the following definition. Lipschitz continuity A function f : X → Y between metric spaces is called Lipchitz continuous if there is a constant C > 0 such that for all points x, y ∈ X dY (f(x), f(y)) ≤ C dX(x, y)). Every Lipschitz continuous function is uniformly contin- uous. 7.3.15. Arzelà-Ascoli Theorem. To conclude, we consider some spaces of functions. These provide examples of how much they may differ from the usual Euclidean spaces. First, we introduce some terminology. Basically we want to deal with functions, which are all uniformly continuous in the very same way: Equicontinuous functions Consider a space M of mappings f : X → Y between metric spaces. We say that the functions in M are equicontinuous, if for each ε > 0 there is a δ > 0, such that dY (f(x), f(y)) < ε for all x, y ∈ X with dX(x, y) < δ, for all functions f ∈ M. Consider the metric space C(X) of all continuous (real or complex) functions on a compact metric space X, with its ∥ ∥∞ norm. This means that the distance between two functions f, g is the maximum of the distance between their values f(x) and g(x) for x ∈ X. 683 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING We say that a set M ⊂ C(X) of real functions is uniformly bounded, if there is a constant K ∈ R such that |f(x)| ≤ K for all functions f ∈ M and points x ∈ X. Of course, bounded sets M of functions in C(X) are always uniformly bounded, by the definition of the norm. Theorem. Consider a compact metrix space X. A set M ⊂ C(X) in the space of continuous functions with the supremum norm ∥ ∥∞ is compact if and only if it is bounded, closed, and equicontinuous.5 Proof. Suppose M is compact. Then M is totally bounded (and thus also uniformly bounded as noticed above). Since every compact subset is closed, it remains to verify the equicontinuity. Given ε > 0, consider the corresponding ε-net (f1, f2, . . . , fk) ⊂ M from the definition of the total boundedness of M. Recall that all the functions fi are uniformly continuous (as continuous functions on a compact set). Thus there is a δi for each fi, such that dX(x, y) < δi implies |fi(x) − fi(y)| < ε. Of course, we take δ to be the minimum of the finite many δi, i = 1, ..., k. Then the same equality holds for all fi in the ε-net. But now, considering an arbitrary function f ∈ M, there is a function fj in our ε-net with ∥f − fj∥ and so dX(x, y) < δ implies that |f(x) − f(y)| is at most |f(x) − fj(x)| + |fj(x) − fj(y)| + |fj(y) − f(y)| ≤ 3ε, and the equicontinuity has been proved. Conversely, suppose that M is a bounded, closed, and equicontiuous subset of C(X), with X as a compact metric space. First we show that M is complete. Obviously, a Cauchy seqence in the ∥ ∥∞ norm is also pointwise Caychy sequence of real or complex numbers. Thus the sequence of functions converges pointwise. Next, the pointwise limit must be bounded and the equicontinuity property implies that this limit will be continuous as well. Thus, we need to find a Cauchy (sub)sequence within any sequence of functions fn ∈ M. The compact space X itself is totally bounded and therefore it contains a countably dense set A ⊂ X (we may take the points in all 1/k-nets for k ∈ N). Write A = {a1, a2, . . . } as a sequence. Choose a subsequence of functions f1j, j = 1, 2, . . . within the functions fn, so that the sequence of values f1j(a1) converges. (This is possible, since the set M is bounded in the ∥ ∥∞ norm). Similarly, the subsequence f2j can be chosen from f1k, so that f2j(a2) converges. In general, the m-th subsequence is chosen from f(m−1)k and have the values fmj(am) converging (and by our construction, it converges in all ai, i < m too). 5A weaker version providing a sufficient condition was first published by Ascoli in 1883, the complete exposition was given by Arzelà in 1895. Again, there are much more general versions of this theorem in the realm of topological spaces. 684 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING As a result, we can choose the sequence of function gk = fkk for all positive integers k with the hope that this is a Cauchy sequence. This is where the equicontinuity helps. Start with any ε > 0 and find δε > 0, such that |f(x) − f(y)| < ε whenever the arguments x and y are closer than δε. Let Aε ⊂ A be subset forming a δε-net. This is a finite set and so there must be an n ∈ N such that for all i, j ≥ n and all a ∈ Aε, we know |gi(a) − gj(a)| < ε. But then, for every x ∈ X, there is some a ∈ Aε with dX(x, a) < δε and so |gi(x) − gj(x)| can be at most |gi(x) − gi(a)| + |gi(a) − gj(a)| + |gj(a) − gj(x)| ≤ 3ε. Thus, the sequence gk is a Cauchy sequence in C(X), and so M is compact. □ 685 686 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING D. Additional exercises to the whole chapter Let us begin this section with problems similar to those described in the introduction of Chapter 7. (cf. 7.A.1, 7.A.2). 7.D.1. Let S0 [0, 1] be the space of F-valued piecewise continuous functions defined on [0, 1], where F ∈ {R, C}, endowed with the L2-inner product ⟨f, g⟩ := ∫ 1 0 f(x)g(x) dx . With respect to the induced norm ∥ · ∥2 compute: (a) ∥f∥2, where f(x) = x + b, with b being some constant; (b) the distance ∥f − g∥2 between f(x) = cos(2πx) and g(x) = sin(2πx). Solution. (a) By definition, ∥f∥2 = √ ⟨f, f⟩ = √∫ 1 0 (x + b)2 dx, and we easily compute that ∫ 1 0 (x + b)2 dx = [ x3 3 ]1 0 + 2 [ bx2 2 ]1 0 + [ b2 x ]1 0 = 1 3 + b + b2 . Thus ∥f∥2 = √ 1 3 + b + b2. For (b) and the distance between the given f, g, we compute ⟨f − g, f − g⟩ = ∫ 1 0 ( cos(2πx) − sin(2πx) )2 dx = ∫ 1 0 ( 1 − 2 cos(2πx) sin(2πx) ) dx = ∫ 1 0 ( 1 − sin(4πx) ) dx = [ x − cos(4πx) 4π ]1 0 = 1 , where the third equality relies on the identity sin(2t) = 2 sin(t) cos(t). Thus ∥f − g∥2 = √ 1 = 1. □ 7.D.2. Compute the L2-norm of the function f(x) = eix , defined on the interval [−π, π]. Solution. We can express f as f(x) = eix = cos(x) + i sin(x), for all x ∈ [−π, π]. Then, with respect to L2-product we see that ⟨f, f⟩ = ∫ π −π ( cos(x) + i sin(x) )( cos(x) − i sin(x) ) dx = ∫ π −π ( cos2 (x) + sin2 (x)) dx = ∫ π −π dx = 2π . Thus ∥f∥2 = √ 2π. □ 7.D.3. Consider the vector space R3[x] of polynomials of one variable of degree at most 3 with real coefficients, endowed with the L2-inner product (i.e., the same inner product ⟨ , ⟩ as the one in Problem 7.D.1). Let W be the subspace of R3[x] having as basis the set B = {x, x2 }. Find an orthonormal basis of W with respect to ⟨ , ⟩. Solution. Set E1 = x and E2 = x2 . The Gram–Schmidt procedure states that an orthogonal basis of W is given by { w1 = E1 , w2 = E2 − ⟨E2, w1⟩ ∥w1∥2 2 w1 } . We compute ⟨E2, w1⟩ = ∫ 1 0 x3 dx = x4 4 1 0 = 1 4 , and ⟨w1, w1⟩ = ∫ 1 0 x2 dx = x3 3 1 0 = 1 3 . Thus w2 = E2 − 3 4 E1 = x2 − 3 4 x. As a verification, observe that ⟨w1, w2⟩ = ⟨x, x2 ⟩ − 3 4 ⟨x, x⟩ = 1 4 − 3 4 · 1 3 = 0 . Consequently, an orthonormal basis of W is given by the “vectors” { ˆw1 = w1 ∥w1∥2 , ˆw2 = w2 ∥w2∥2 }, where ∥w1∥2 = 1√ 3 and ∥w2∥2 = √ ∫ 1 0 ( x2 − 3 4 x )2 dx = 1 √ 80 , respectively. □ 687 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING 7.D.4. Consider the inner vector space (R3[x], ⟨ , ⟩) of the previous task 7.D.3, and let W be the subspace of R3[x] spanned by the basis {1, x2 }. Find a basis of the orthogonal complement W⊥ of W with respect to ⟨ , ⟩. Solution. By assumption W is spanned by the set {1, x2 }. Thus, a vector f(x) = a3x3 + a2x2 + a1x + a0 ∈ W⊥ in W⊥ must satisfy ⟨f(x), 1⟩ = 0 , and ⟨f(x), x2 ⟩ = 0 . This gives the following system of equations {a3 4 + a2 3 + a1 2 + a0 = 0 , a3 6 + a2 5 + a1 4 + a0 3 = 0 } . We can solve this in Sage via the cell var("a3", "a2", "a1", "a0") solve([(1/4)*a3+(1/3)*a2+(1/2)*a1+a0==0,(1/6)*a3+(1/5)*a2+(1/4)*a1+(1/3)*a0==0], a3, a2, a1, a0) Sage returns the solution [[a3 == 16*r1 + 3*r2, a2 == -15*r1 - 15/4*r2, a1 == r2, a0 == r1]] which means that a3 = 16a + 3b and a2 = −15(a + b 4 ) with a1 = b ∈ R and a0 = a ∈ R being the free variables. Thus f(x) = b(3x3 − 15 4 x2 + x) + a(16x3 − 15x2 + 1) , and this implies that W⊥ = span{f1(x), f2(x)}, where f1(x) = 3x3 − 15 4 x2 +x and f2(x) = 16x3 −15x2 +1, respectively. It remains to prove that f1(x), f2(x) are linearly independent which one can easily check (for instance one may compute the Wronskian as in the Problem 6.D.16). Thus {f1, f2} is a basis of W⊥ . □ 7.D.5. Given the vector subspace ⟨sin(x), x⟩ of the space of continuous real-valued functions defined on the interval [0, π], complete the function x to an orthogonal basis of the subspace and determine the orthogonal projection of the function 1 2 sin(x) onto it. ⃝ 7.D.6. Given the vector subspace ⟨cos(x), x⟩ of the space of continuous real-valued functions defined on the interval [0, π], complete the function cos(x) to an orthogonal basis of the subspace and determine the orthogonal projection of the function sin(x) onto it. ⃝ 7.D.7. Show that the system of functions 1, sin(x), cos(x), . . . , sin(nx), cos(nx), . . . is orthogonal with respect to the L2 inner product on the interval I = [−π, π]. ⃝ Next we will focus on tasks about Fourier series of periodic functions, and their convergence. 7.D.8. Let f : (−π, π) → R be the piecewise function defined by f(x) =    −1 , if −π < x < −π/3 , 0 , if −π/3 ≤ x ≤ π/3 , 1 , if π/3 < x < π . (1) Find the 2π-periodic extension of f and sketch its graph via Sage for x ∈ (−3π, 3π); (2) Prove that the Fourier series F of the 2π-periodic extension of f is a sine series. Next determine explicitly F; (3) Use Sage to verify your computations in (2); (4) Present the 5th partial sum of the approximation. Next confirm your result via Sage and plot F5, F10, F20 and F100 together with f, for x ∈ (−π, π). (5) Use Sage to specify the 5th, 10th, and 100th coefficient of the Fourier series F corresponding to f. Solution. (1) The 2π-periodic extension of f is defined by ˆf(x) = f(x) for all −π < x < π with ˆf(x) = ˆf(x + 2π) for all x ∈ R. In Sage to plot this periodic extension for x ∈ (−3π, 3π) we can use appropriately the piecewise function. In particular, the block f1(x)=-1; f2(x)=0; f3(x)=1 f=piecewise([[(-3*pi,-7*pi/3),f1],[(-7*pi/3,-5*pi/3),f2], [(-5*pi/3,-pi),f3],[(-pi,-pi/3),f1], [(-pi/3,pi/3),f2],[(pi/3,pi),f3], [(pi,5*pi/3),f1], [(5*pi/3,7*pi/3),f2], [(7*pi/3,3*pi),f3]]) a = f.plot(x, -3*pi, 3*pi, thickness=2, color="red", exclude=[-7*pi/3, -5*pi/3, -pi, -pi/3, pi/3, pi, 5*pi/3, 7*pi/3]) a.show(ticks=pi/3, tick_formatter=pi) produces the following figure: 688 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Here one may avoid to use tje exclude option (to exclude the jump discontinuities) and in this case Sage will produce the figure given in the r.h.s, where the vertical lines should not be considered as part of the required graph. (2) It is easy to see that f(x) (and hence ˆf) is an odd function, thus an = 0 for all n ∈ N. Consequently the Fourier series of f is a sine series. As for bn, first notice that the function g(x) := f(x) sin(nx) satisfies g(x) =    − sin(nx) , if −π < x < −π/3 , 0 , if −π/3 ≤ x ≤ π/3 , sin(nx) if π/3 < x < π , and g(−x) =    sin(nx) , if π/3 < x < π , 0 , if −π/3 ≤ x ≤ π/3 , − sin(nx) , if −π < x < −π/3 . Thus g(−x) = g(x) for all x ∈ (−π, π), that is, g is an even function. Therefore, bn = 1 π ∫ π −π f(x) sin(nx) dx = 2 π ∫ π 0 f(x) sin(nx) dx = 2 π [∫ π/3 0 f(x) sin(nx) dx + ∫ π π/3 f(x) sin(nx) dx ] = 2 π ∫ π π/3 1 · sin(nx) dx = 2 π [ − cos(nx) n ]π π/3 = 2 nπ ( − cos(nπ) + cos( nπ 3 ) ) = 2 nπ ( −(−1)n + cos( nπ 3 ) ) = 2 nπ ( (−1)n+1 + cos( nπ 3 ) ) . Consequently, the Fourier series of the function at hand has the form F(x) = 2 π ∞∑ n=1 (−1)n+1 + cos(nπ 3 ) n sin(nx) . (3) To confirm the previous expression in Sage it suffices to confirm the expressions of the Fourier coefficients, which can be done via the method presented in 7.A.10, that is, f1(x)=-1; f2(x)=0; f3(x)=1 g=piecewise([[(-pi, -pi/3), f1], [[-pi/3, pi/3], f2],[(pi/3, pi), f3]]); L=pi; n=var("n") an=(1/L)*integral(f1(x)*cos(n*pi*x/L),x,-pi,-pi/3)+(1/L)*integral(f2(x)*cos(n*pi*x/L),x,-pi/3,pi/3)\ +(1/L)*integral(f3(x)*cos(n*pi*x/L), x, pi/3, pi); show("a_n=", an.expand()) bn=(1/L)*integral(f1(x)*sin(n*pi*x/L),x,-pi,-pi/3)+(1/L)*integral(f2(x)*sin(n*pi*x/L),x,-pi/3,pi/3)\ +(1/L)*integral(f3(x)*sin(n*pi*x/L), x, pi/3, pi); show( "b_n=", bn.expand()) Executing this block, we obtained the desired result: a_n=0 , b_n= − 2 cos (πn) πn + 2 cos (1 3 πn ) πn . (4) An application of the given formula in (2) gives F5(x) = 3 sin(x) π − 3 sin(2x) 2π − 3 sin(4x) 4π + 3 sin(5x) 5π . To confirm this expression in Sage, simply add the following cell to the block above: pS5 = f.fourier_series_partial_sum(5,pi); show("5th partial sum = ",pS5) To generate the required graphs, include the following syntax in the initial block in (3): pS5 = f.fourier_series_partial_sum(5,pi); pS10 = f.fourier_series_partial_sum(10,pi) pS20 = f.fourier_series_partial_sum(20,pi); pS100 = f.fourier_series_partial_sum(100,pi) a = f.plot(x, -pi, pi, thickness=2, color="red", legend_label=r"$f$", exclude=[-pi/3, pi/3]) 689 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING s1 = plot(pS5, x, -pi, pi, linestyle="--", color="midnightblue", legend_label=r"$F_5$") s2 = plot(pS10, x, -pi, pi, linestyle="--", color="midnightblue", legend_label=r"$F_{10}$") s3 = plot(pS20, x, -pi, pi, linestyle="--", color="midnightblue", legend_label=r"$F_{15}$") s4 = plot(pS100, x, -pi, pi, linestyle="--",color="midnightblue", legend_label=r"$F_{100}$") (a+s1).show(); (a+s2).show(); (a+s3).show(); (a+s4).show() Running the code all together, we obtain the following figures: F5(x) F10(x) F20(x) F100(x) (5) By the result in (4) we already know the 5th coefficient which is given by 3 5π . For a Sage description of the 10th and 100th coefficient of F, one can proceed as described in 7.A.14. In fact, the Fourier series is a sine series, hence we need the function f.fourier_series_sine_coefficient(k), and the implementation of the method is as follows: f1(x)=-1; f2(x)=0; f3(x)=1 f=piecewise([[(-pi, -pi/3), f1], [[-pi/3, pi/3], f2],[(pi/3, pi), f3]]) Fc5=f.fourier_series_sine_coefficient(5) show("5th coefficient = ",Fc5) Fc10=f.fourier_series_sine_coefficient(10) show("10th coefficient = ",Fc10) Fc100=f.fourier_series_sine_coefficient(100) show("100th coefficient = ",Fc100) This confirms the value of the 5th coefficient and additionally provides the values of the other two coefficients in question, namely: 10th coefficient = − 3 10 π , 100th coefficient = − 3 100 π . Can you guess which coefficient is the 1000th? □ 7.D.9. (Reverse) sawtooth function. Consider the function defined by f(x) = { π − x 2 , if 0 < x ≤ 2π , f(x + 2π) , otherwise . (1) Describe the given function f for x ∈ (−2π, 4π], and next use Sage to plot its graph over this interval. (2) Find the Fourier series of f and next use Sage to confirm the result. 690 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING (3) Use Sage to plot the partial sums F5, F8, F15 and F50, in one single figure. (4) Examine the convergence of the Fourier series and show that π 4 = 1 − 1 3 + 1 5 − 1 7 + · · · . Solution. (1) The given function f provides the 2π-periodic extension of the function (π −x)/2, the latter defined on (0, 2π]. Hence, it should be f(x) = f(x + 2π) = π − x − 2π 2 = −π − x 2 , for all x ∈ (−2π, 0] and f(x) = f(x − 2π) = π − x + 2π 2 = 3π − x 2 , for all x ∈ (2π, 4π], that is, f(x) =    −π − x 2 , if −2π < x ≤ 0 , π − x 2 , if 0 < x ≤ 2π , 3π − x 2 , if 2π < x ≤ 4π . To sketch the graph of this piecewise function use the piecewise command, as before. For clarity, you may also mark the endpoints where f is defined (e.g., with black dots). This can be done using the command point, as follows: f1(x)= (pi-x)/2;f2(x)= (-pi-x)/2;f3(x)= (3*pi-x)/2 f = piecewise([[(-2*pi, 0),f2], [(0, 2*pi),f1], [(2*pi, 4*pi), f3]]) p=f.plot(x, -2*pi, 4*pi, exclude=[0, 2*pi], legend_label=r"$f$") p0=point([0, -pi/2], size=30, color="black") p1=point([2*pi, -pi/2], size=30, color="black") p2=point([4*pi, -pi/2], size=30, color="black") show(p+p0+p1+p2) Below one can see the produced plot: (2) By assumption the period is 2π, so the Fourier series has the form F(x) = a0 2 + ∞∑ n=1 ( an cos(nx) + bn sin(nx) ) , where a0 = 1 π ∫ 2π 0 f(x) dx , an = 1 π ∫ 2π 0 f(x) cos(nx) dx , bn = 1 π ∫ 2π 0 f(x) sin(nx) dx . First we will prove that an = 0 for all n ∈ N. It is very easy to see that a0 = 0. Next, based on integration by parts, we get an = 1 π ∫ 2π 0 f(x) cos(nx) dx = 1 2π ∫ 2π 0 (π − x) cos(nx) dx = 1 2 ∫ 2π 0 cos(nx) dx − 1 2π ∫ 2π 0 x cos(nx) dx = 1 2 [ sin(nx) n ]2π 0 − 1 2π ∫ 2π 0 x · ( sin(nx) n )′ dx = 1 2n ( sin(2nπ) − sin(0) ) − 1 2π ( [ x sin(nx) n ]2π 0 − 1 n ∫ 2π 0 sin(nx) dx ) = 0 − 1 2nπ ( 2π sin(2nπ) − 0 ) + 1 2nπ [ − cos(nx) n ]2π 0 = 0 − 0 − 1 2n2π ( cos(2nπ) − cos(0) ) = 0 , 691 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING for all positive integers n. In a similar way, for bn we compute bn = 1 π ∫ 2π 0 f(x) sin(nx) dx = 1 2π ∫ 2π 0 (π − x) sin(nx) dx = 1 2 ∫ 2π 0 sin(nx) dx − 1 2π ∫ 2π 0 x sin(nx) dx = − 1 2n [cos(nx)] 2π 0 + 1 2π ∫ 2π 0 x ( cos(nx) n )′ dx = − 1 2n ( cos(2nπ) − cos(0) ) + 1 2π ( [ x cos(nx) n ]2π 0 − 1 n ∫ 2π 0 cos(nx) dx ) = − 1 2n (1 − 1) + 1 2nπ ( 2π cos(2nπ) − 0) − 1 2nπ [ sin(nx) n ]2π 0 = 2π 2nπ − 1 2n2π ( sin(2nπ) − sin(0) ) = 1 n . Thus the Fourier series associated to f is given by F(x) = ∞∑ n=1 sin(nx) n . To confirm this expression by Sage one can compute the Fourier coefficients, as usual, i.e., var("n"); f(x)=(pi-x)/2; f = piecewise([[(0, 2*pi),f]]) an=(1/pi)*integral(((pi-x)/2)*cos(n*x), x, 0, 2*pi) bn=(1/pi)*integral(((pi-x)/2)*sin(n*x), x, 0, 2*pi) show("a_n=", an.expand()); show("b_n=", bn.expand()) (3) Recall that for the partial sums Sage provides the built-in function fourier_series_partial_sum. For our case the implementation can be done by continue typing in the previous block the following: partS5=f.fourier_series_partial_sum(5,pi); show("5th partial sum = ",partS5) partS8=f.fourier_series_partial_sum(8,pi); show("8th partial sum = ",partS8) partS15=f.fourier_series_partial_sum(15,pi); show("15th partial sum = ",partS15) partS50=f.fourier_series_partial_sum(50,pi); show("50th partial sum = ",partS50) Run the whole syntax yourselves to obtain the corresponding expressions of the partial sums F5, F8, F15 and F50. To plot them, execute the block given below (notice that this is independent of the previous cells). f1(x)= (pi-x)/2; f2(x)= (-pi-x)/2; f3(x)= (3*pi-x)/2 f = piecewise([[(-2*pi, 0),f2], [(0, 2*pi),f1], [(2*pi, 4*pi), f3]]) partS5=f.fourier_series_partial_sum(5,pi);partS8=f.fourier_series_partial_sum(8,pi) partS15=f.fourier_series_partial_sum(15,pi);partS50=f.fourier_series_partial_sum(50,pi) p=f.plot(x, -2*pi, 4*pi, exclude=[0, 2*pi], legend_label=r"$f$") s5=plot(partS5, x, -2*pi, 4*pi, linestyle="--", color="midnightblue", legend_label=r"$F_5$") s8=plot(partS8, x, -2*pi, 4*pi, linestyle="--", color="darkgreen", legend_label=r"$F_8$") s15=plot(partS15, x, -2*pi, 4*pi, linestyle="--", color="darkorange", legend_label=r"$F_{15}$") s50=plot(partS50, x, -2*pi, 4*pi, linestyle="--", color="crimson", legend_label=r"$F_{50}$") (p+s5+s8+s15+s50).show(figsize=8, ticks=pi/2, tick_formatter=pi) This generates the following figure: 692 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING We should mention that in this block f was introduced as in the first part (see (1)), with aim to display a larger portion of the graph of f. To verify the computations in part (2), however, it was adequate to use only the function (π − x)/2 on its initial domain. (4) By the figure above, it is clear that the Fourier series F(x) of f converges to f(x), at each point x where f is continuous, which means that for any x ∈ (0, 2π) we have π − x 2 = F(x) = ∞∑ n=1 sin(nx) n = sin(x) + sin(2x) 2 + sin(3x) 3 + sin(4x) 4 + sin(5x) 5 + · · · . This applies to x = π/2 and gives π 4 = sin( π 2 ) + sin(π) 2 + sin(3π/2) 3 + sin(2π) 4 + sin(5π/2) 5 + · · · = 1 + 0 − 1 3 + 0 + 1 5 − · · · . On the other hand, the points of discontinuity of f appear at x = 2kπ, for k ∈ Z. The average value of our function at these points equals to 0, and hence the Fourier expansion of f converges to 0 at any jump discontinuity of f. □ 7.D.10. Compute the complex version of the Fourier series of the 2π-periodic extension of the following function (often referred to as the “pulse wave”) g(x) = { 0 , if − π < x < 0 , 1 , if 0 < x < π . ⃝ 7.D.11. Consider the exponential mapping f(x) = ex with x ∈ [0, 1). Use Sage to implement the following: (1) Determine and illustrate the 1-periodic extension of f for x ∈ [−1, 2) (with at least two different Sage methods); (2) Compute the Fourier series of f on the interval [0, 1), and then illustrate its 5th, 15th and 25th partial sum (with at least two different Sage methods); (3) Determine the convergence of the Fourier series obtained above. Solution. (1) The 1-periodic extension of f on the interval [−1, 2) is given by f(x + 1) for −1 ≤ x < 0, by f(x) for 0 ≤ x < 1 and f(x − 1) for 1 ≤ x < 2. In Sage one method utilizes the piecewise function while another one relies on the def procedure to define piecewise functions. The implementation goes as follows: f=piecewise([[RealSet.closed_open(-1, 0), e^(x+1)], [(0, 1), e^x], [(1, 2), e^(x-1)]]) a=f.plot(x, -1, 2, thickness=1.5, color="grey", exclude=[0, 1]) p0=point([-1, 1], size=30, color="black") p1=point([0, 1], size=30, color="black") p2=point([1, 1], size=30, color="black") show(a+p0+p1+p2) var("t") def g(t): if 0 <= t<1: return e^t elif -1<=t < 0: return e^(t+1) elif 1<=t<2 : return e^(t-1) b= plot(g, (-1, 2), exclude=[0, 1], thickness=1.5, color="grey") p0=point([-1, 1], size=30, color="black") p1=point([0, 1], size=30, color="black") p2=point([1, 1], size=30, color="black") show(b+p0+p1+p2) They both produce the same figure, which is presented here: 693 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Notice that as in 7.D.9, both blocks include code for marking with black dots the endpoints where the periodic extension is defined. (2) As usual, it is sufficient to compute the Fourier coefficients. However, one should carefully apply the general formulas from 7.1.6 and here is a possible implementation: f=piecewise([[(0, 1), e^x]]); T=1; n=var("n"); Om=pi an=(2/T)*integral((e^x)*cos(2*n*Om*x), x, 0,1); show("a_n=", an.expand()) bn=(2/T)*integral((e^x)*sin(2*n*Om*x), x, 0,1); show( "b_n=", bn.expand()) Sage’s answer has the form a_n= 4 πne sin (2 πn) 4 π2n2 + 1 + 2 cos (2 πn) e 4 π2n2 + 1 − 2 4 π2n2 + 1 , b_n= − 4 πn cos (2 πn) e 4 π2n2 + 1 + 4 πn 4 π2n2 + 1 + 2 e sin (2 πn) 4 π2n2 + 1 . Recalling that sin(2nπ) = 0 and cos(2nπ) = 1 for all n, this means that an = 2(e − 1) 1 + 4n2π2 , (n ∈ N) , and bn = 4nπ(1 − e) 1 + 4n2π2 , (n ∈ Z+) , respectively. It follows that the Fourier series of the function at hand has the form F(x) = e −1 + 2 (e −1) ∞∑ n=1 cos (2nπx) 1 + 4n2π2 + 4π (1 − e) ∞∑ n=1 n sin (2nπx) 1 + 4n2π2 . (♭) Now, to plot the required partial sums, we can proceed as in the previous tasks (cf. 7.D.9), i.e., f=piecewise([[(0, 1), e^x]]) partS5 = f.fourier_series_partial_sum(5,1/2) partS15 = f.fourier_series_partial_sum(15,1/2) partS25 = f.fourier_series_partial_sum(25,1/2) a = f.plot(x, 0, 1, thickness=1.5, color="black") s5=plot(partS5, x, -1, 1, linestyle="--", color="grey") s15=plot(partS15, x, -1, 1, linestyle="--", color="grey") s25=plot(partS25, x, -1, 1, linestyle="--", color="grey") show(a+s5);show(a+s15);show(a+s25) Notice in the first line of this block we could replace the open set (0, 1) with the correct set that f is defined, i.e., the set [0, 1). This could be done by the command RealSet.closed_open(0, 1), but it is not so essential in the sense that it does not effect the Fourier series, as previously mentioned in ??. This block produces the following three figures: F5 F15 F25 694 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING An alternative method to derive the partial sums utilizes a combination of certain built-in functions in Sage, namely fourier_series_sine_coefficient and fourier_series_cosine_coefficient. Below we present the case for the 5th partial sum and similarly are treated the rest cases. f=piecewise([[(0, 1), e**x]]) a5=[f.fourier_series_cosine_coefficient(n, 1/2) for n in range(6)] b5=[f.fourier_series_sine_coefficient(n, 1/2) for n in range(6)] fs5=(a5[0]/2)+(sum([a5[k]*cos(2*k*pi*x) for k in range(1, 6)])) +(sum([b5[k]*sin(2*k*pi*x) for k in range(1, 6)])) print(bool(fs5.expand()==f.fourier_series_partial_sum(5,1/2).expand())) a=f.plot(x, 0, 1, thickness=1.5, color="black") b=fs5.plot(x, 0, 1, linestyle="--") (a+b).show(figsize=8) In this block we programmed Sage to compare the expression of the 5th partial sum, denoted by fs5 and defined manually, with the one that one derives via the usual method by the command fourier_series_partial_sum. This comparison is performed by the bool command, and Sage’s answer is True. One could also obtain the explicit form of the partial sums in question, which we skipped to save some space. As a side remark, we finally mention that for a formal verification of the Fourier coefficients an, bn one can use the following formulas ∫ ex cos (αx) dx = ex ( α sin (αx) + cos (αx) ) 1 + α2 + C , ∫ ex sin (βx) dx = ex ( sin (βx) − β cos (βx) ) 1 + β2 + C , with α, β ∈ R. Both of them can be derived by integrations by parts (see 6.B.7 for the second one). (3) The function f(x) = ex is continuous at any x ∈ (0, 1). It is also continuous at x = 0 since limx→0− ex = 1 = limx→0+ f(x). It also satisfies according to Dirichlet’s theorem (see ??) we should have ex = F(x) for all x ∈ [0, 1), where F is given in (♭). □ 7.D.12. (1) Derive a cosine Fourier series of f on the interval [0, 1] and illustrate some of its partial sums; (2) Derive a sine Fourier series of f on the interval (0, 1] and illustrate some of its partial sums; 695 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Note the differences between the three cases as shown in the diagrams. The approximation differs in relation to the continuity. The first diagram uses n = 5, the second uses n = 3, while the last diagram uses n = 10. We use the formulae (α, β ∈ R) ∫ ex cos (αx) d x = ex ( α sin (αx) + cos (αx) ) 1 + α2 + C,(1) ∫ ex sin (βx) d x = ex ( sin (βx) − β cos (βx) ) 1 + β2 + C,(2) both of which can be obtained by two integrations by parts. Actually, the second one was computed in detail in ??(d). We obtain (a) an = 2 1∫ 0 ex cos (2nπx) d x = 2 [ ex ( 2nπ sin(2nπx)+cos(2nπx) ) 1+4n2π2 ]1 0 = 2(e−1) 1+4n2π2 , n ∈ N ∪ {0}, bn = 2 1∫ 0 ex sin(2nπx) d x = 2 [ ex ( sin(2nπx)−2nπ cos(2nπx) ) 1+4n2π2 ]1 0 = 4nπ(1−e) 1+4n2π2 , n ∈ N; (b) an = 2 1∫ 0 ex cos (nπx) d x = 2 [ ex ( nπ sin(nπx)+cos(nπx) ) 1+n2π2 ]1 0 = 2 ( (−1)n e−1 ) 1+n2π2 , n ∈ N ∪ {0}; (c) bn = 2 1∫ 0 ex sin (nπx) d x = 2 [ ex ( sin(nπx)−nπ cos(nπx) ) 1+n2π2 ]1 0 = 2nπ ( 1+(−1)n+1 e ) 1+n2π2 , n ∈ N. Substitution then yields the corresponding Fourier series for g(x): (a) e − 1 + 2 (e − 1) ∞∑ n=1 cos(2nπx) 1+4n2π2 +4π (1 − e) ∞∑ n=1 n sin(2nπx) 1+4n2π2 ; 696 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING (b) e − 1 + 2 ∞∑ n=1 ((−1)n e−1) cos(nπx) 1+n2π2 ; (c) 2π ∞∑ n=1 n ( 1+(−1)n+1 e ) sin(nπx) 1+n2π2 . 7.D.13. Determine the cosine Fourier series for a periodic extension of the function g(x) = 1, x ∈ [0, 1), g(x) = 0, x ∈ [1, 4). Determine also the sine Fourier series for f(x) = x − 1, x ∈ (0, 2), f(x) = 3 − x, x ∈ [2, 4). Solution. We have already encountered the construction of a cosine Fourier series in ??(b) and also 7.A.15(b). It is the case of the Fourier series of an even function. Therefore, the first thing we to do is extend the definition of the function g to the interval (−4, 0) so that it is even. This means g(x) := 1 for x ∈ (−1, 0), g(x) := 0 for x ∈ (−4, −1]. Now we can consider its periodic extension onto the whole R with period x0 = −4, T = 8 and ω = π/4. Necessarily bn = 0 for all n ∈ N in a cosine Fourier series. We determine the Fourier coefficients an, n ∈ N a0 = 2 T x0+T∫ x0 g(x) d x = 2 8 1∫ −1 1 d x = 1 2 1∫ 0 1 d x = 1 2 , an = 2 T x0+T∫ x0 g(x) cos (nωx) d x = 1 2 1∫ 0 cos nπx 4 d x = 2 nπ sin nπ 4 . We use (1) a∫ −a f(x) d x = 2 a∫ 0 f(x) d x, which is valid for every even function f integrable on the interval [0, a]. The expression sin (nπ/4) is conveniently left as is, rather than the alternative of sorting out when it attains which of its five different values. Thus we write the cosine Fourier series in the form: 1 4 + ∞∑ n=1 ( 2 nπ sin nπ 4 cos nπx 4 ) . The sine Fourier transform of the function can be determined analogously from the odd extension of the given segment. Again, T = 8 and ω = π/4 for the function f. This time it is the coefficients an, n ∈ N ∪ {0}, which are zero. To find the remaining coefficients, use integration by parts and (1) (the product of two odd functions is an even function). For all n ∈ N bn = 2 T x0+T∫ x0 f(x) sin (nωx) d x = 1 2 ( 2∫ 0 (x − 1) sin nπx 4 d x − 4∫ 2 (x − 3) sin nπx 4 d x ) = [ (1 − x) 2 nπ cos nπx 4 ]2 0 + [ 8 n2π2 sin nπx 4 ]2 0 − [ (3 − x) 2 nπ cos nπx 4 ]4 2 − [ 8 n2π2 sin nπx 4 ]4 2 = 2 nπ ( (−1)n − 1 ) + 16 n2π2 sin nπ 2 . Immediately, bn = 0 when n is even. So the sine Fourier series can be written ∞∑ n=1 ( ( 2 nπ ( (−1)n − 1 ) + 16 n2π2 sin nπ 2 ) sin nπx 4 ) = ∞∑ n=1 (( −4 (2n−1)π + (−1)n−1 16 (2n−1)2π2 ) sin (2n−1)πx 4 ) . □ 7.D.14. Write the Fourier series of the π-periodic function which equals cosine on the interval (−π/2, π/2). Then write the cosine Fourier series of the 2π-periodic function y = | cos x |. Solution. We are looking for one Fourier series only, since the second part of the problem is just a reformulation of the first part. Therefore, we construct the Fourier series for the function g(x) = cos x, x ∈ [−π/2, π/2]. Since g is even, bn = 0, n ∈ N. We compute 697 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING an = 2 π π/2∫ −π/2 cos x cos(2nx) d x = 2 π π/2∫ −π/2 1 2 ( cos((2n + 1)x) + cos((2n − 1)x) ) d x = 1 π [ sin((2n+1)x) 2n+1 + sin((2n−1)x) 2n−1 ]π/2 −π/2 = 2 π ( (−1)n 2n+1 + (−1)n+1 2n−1 ) = 4 π (−1)n+1 4n2−1 for every n ∈ N. The calculation is also valid for n = 0, thus a0 = 4/π. The desired Fourier series is 2 π + 4 π ∞∑ n=1 ( (−1)n+1 4n2−1 cos (2nx) ) . □ 7.D.15. Sum the two series ∞∑ n=1 1 n4 , ∞∑ n=1 (−1)n+1 n4 . Solution. We hint at the procedure by which the series ∞∑ n=1 1 n2k , ∞∑ n=1 (−1)n+1 n2k for general k ∈ N can be calculated. Use the identities (1) x = π − 2 ∞∑ n=1 sin (nx) n , x ∈ (0, 2π), (2) x2 = 4π2 3 +4 ∞∑ n=1 cos (nx) n2 −4π ∞∑ n=1 sin (nx) n , x ∈ (0, 2π), which follow from the constructions of the Fourier series for the functions g(x) = x and g(x) = x2 , respectively, on the interval [0, 2π). By (1), ∞∑ n=1 sin(nx) n = π−x 2 , x ∈ (0, 2π). Substituting into (2) gives ∞∑ n=1 cos(nx) n2 = 3x2 −6πx+2π2 12 , x ∈ (0, 2π). Since the values of the series ∞∑ n=1 1 n2 = π2 6 , ∞∑ n=1 (−1)n+1 n2 = π2 12 have been already determined, substitution then proves the validity of this last equation at the marginal points x = 0, x = 2π. The left-hand series is evidently bounded from above by ∑∞ n=1 1 n2 , thus it converges absolutely and uniformly on [0, 2π]. Therefore, it can be integrated term by term: ∞∑ n=1 sin(nx) n3 = ∞∑ n=1 [ sin(ny) n3 ]x 0 = x∫ 0 ∞∑ n=1 cos(ny) n2 d y = x∫ 0 3y2 −6πy+2π2 12 d y = x3 −3πx2 +2π2 x 12 , x ∈ [0, 2π]. In fact, every Fourier series may be integrated term by term. Further integration gives ∞∑ n=1 1−cos(nx) n4 = ∞∑ n=1 [ −cos(ny) n4 ]x 0 = x∫ 0 ∞∑ n=1 sin(ny) n3 d y = x∫ 0 y3 −3πy2 +2π2 y 12 d y = x4 −4πx3 +4π2 x2 48 , x ∈ [0, 2π]. Substituting x = π leads to 698 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING ∞∑ n=1 1+(−1)n+1 n4 = ∞∑ n=1 1−cos(nπ) n4 = π4 48 . Since the numerator on the left-hand side is zero for even numbers n and is 2 for odd numbers n, the series can be written (3) ∞∑ n=1 2 (2n − 1)4 = π4 48 . From the expression ∞∑ n=1 1 n4 = ∞∑ n=1 1 (2n)4 + ∞∑ n=1 1 (2n−1)4 = 1 16 ∞∑ n=1 1 n4 + ∞∑ n=1 1 (2n−1)4 , it follows that ∞∑ n=1 1 n4 = 16 15 ∞∑ n=1 1 (2n−1)4 = 16 15 · 1 2 · π4 48 = π4 90 , thereby having summed up the first series. As for the second one, ∞∑ n=1 (−1)n+1 n4 = ∞∑ n=1 1 (2n−1)4 − ∞∑ n=1 1 (2n)4 = ∞∑ n=1 1 (2n−1)4 − 1 16 ∞∑ n=1 1 n4 = 1 2 · π4 48 − 1 16 · π4 90 = 7π4 720 . One can proceed similarly to sum the series ∞∑ n=1 1 n2k , ∞∑ n=1 (−1)n+1 n2k for other k ∈ N. It is natural to ask for the value of the series ∑∞ n=1 1 n3 . This problem has been tackled by mathematicians for centuries without success. The reader may justifiably be surprised by this since the procedure above is applicable to all the odd powers as well. For instance, one can start with the identity ∞∑ n=1 cos(nx) n = −ln ( 2 sin x 2 ) , x ∈ (0, 2π), which, by the way, can be proved by expanding the right-hand function into a Fourier series. If, similarly to the above, integrate the left-hand series term by term twice and substituted x → 0+ in the limit, we get the series ∑∞ n=1 1 n3 . Thus, it should suffice to integrate the right-hand function twice and calculate one limit. However, the integration of the right-hand side leads to a non-elementary integral. That is, the antiderivative cannot be expressed in terms of the elementary functions. 6 □ 7.D.16. Determine the function f whose Fourier transform is the function ˜f(ω) = 1√ 2π sin ω ω , ω ̸= 0. Solution. We might have noticed that the sinc function appeared as the image of the characteristic function hΩ of the interval (−Ω, Ω) in one of the previous problems: ˜hΩ(ω) = 2Ω √ 2π sinc(ωΩ). In this case Ω = 1 and the function f is the half of h1. The result can be computed directly. The inverse Fourier transform is f(t) = 1 2π ∞∫ −∞ sin ω ω eiωt dω = 1 2π ( 0∫ −∞ sin ω ω eiωt dω + ∞∫ 0 sin ω ω eiωt dω ) . Substitute −ω for ω in the first integral, to obtain 6The function ζ(p) = ∑∞ n=1 1 np is called the Riemann zeta function. EXPAND THE FOOTNOTE! 699 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING f(t) = 1 2π (∞∫ 0 sin ω ω e−iωt dω + ∞∫ 0 sin ω ω eiωt dω ) = 1 2π ∞∫ 0 sin ω ω 2(e−iωt + eiωt ) dω = 1 π ∞∫ 0 sin ω ω cos (ωt) dω. Continue via trigonometric identities to obtain f(t) = 1 2π (∞∫ 0 sin(ω(1+t)) ω dω + ∞∫ 0 sin(ω(1−t)) ω dω ) . The substitutions u = ω (1 + t), v = ω (1 − t) then give f(t) = 1 2π (∞∫ 0 sin u u d u − ∞∫ 0 sin v v d v ) = 0, t > 1; f(t) = 1 2π (∞∫ 0 sin u u d u + ∞∫ 0 sin v v d v ) = 1 π ∞∫ 0 sin u u d u, t ∈ (−1, 1); f(t) = 1 2π ( − ∞∫ 0 sin u u d u + ∞∫ 0 sin v v d v ) = 0, t < −1. Thus the function f is zero for | t | > 1 and constant (necessarily non-zero) for | t | < 1. (Throughout, we assume that the inverse Fourier transform exists). The constant is f(t) = 1/2 for |t| < 1, from the standard result ∞∫ 0 sin u u d u = π 2 . Alternatively, we can ’guess’ that the constant is one, i.e. g(t) = 1, | t | < 1; g(t) = 0, | t | > 1 and compute F(g)(ω) = 1√ 2π 1∫ −1 e−iωt d t = 2√ 2π 1∫ 0 cos (ωt) d t = 2√ 2π sin ω ω . So f(0) = g(0)/2 = 1/2, which also establishes ∞∫ 0 sin u u d u = π 2 . □ 7.D.17. Using the relation (1) L (f′ ) (s) = s L (f) (s) − lim t→0+ f(t), derive the Laplace transforms of both the functions y = cos t and y = sin t. Solution. Notice first that from (1), it follows that L (f′′ ) (s) = s L (f′ ) (s) − lim t→0+ f′ (t) = s ( sL (f) (s) − lim t→0+ f(t) ) − lim t→0+ f′ (t) = s2 L (f) (s) − s lim t→0+ f(t) − lim t→0+ f′ (t). Therefore, −L (sin t) (s) = L (− sin t) (s) = L ( (sin t) ′′) (s) = s2 L (sin t) (s) − s lim t→0+ sin t − lim t→0+ cos t = s2 L (sin t) (s) − 1, whence we get −L (sin t) (s) = s2 L (sin t) (s) − 1, i. e. L (sin t) (s) = 1 s2+1 . Now, invoking (1), we determine L (cos t) (s) = L ( (sin t) ′) (s) = s 1 s2+1 − lim t→0+ sin t = s s2+1 . □ 700 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING 7.D.18. Using the discussion from the previous problem, prove that for a continuous function y with enough sufficiently higher order derivatives L(y(n) )(s) = sn L(y)(s) − n∑ i=1 sn−i y(i−1) (0). Solution. Clearly L(y′ )(s) = sL(y)(s) − y(0) L(y′′ )(s) = s2 L(y) − sy(0) − y′ (0) and the claim is verified by induction. □ 7.D.19. Find the Laplace transform of Heaviside’s function H(t) and, for real a, the shifted Heaviside’s function Ha(t) = H(t − a): H(t) =    0 for t < 0, 1 2 for t = 0, 1 for t > 0. Solution. L(H(t))(s) = ∫ ∞ 0 H(t) e−st dt = ∫ ∞ 0 e−st dt = [ − e−st s ]∞ 0 = −1 s (0 − 1) = 1 s , L(Ha(t))(s) = L(H(t − a))(s) = ∫ ∞ 0 H(t − a) e−st dt = ∫ ∞ a e−st dt = ∫ ∞ 0 e−s(t+a) dt = e−as L(H(t))(s) = e−as s . □ 7.D.20. Show that for real a, (1) L(f(t) · Ha(t))(s) = e−as L(f(t + a))(s) Solution. L(f(t)Ha(t))(s) = ∫ ∞ 0 f(t)H(t − a) e−st dt = ∫ ∞ a f(t) e−st dt = ∫ ∞ 0 f(t + a) e−s(t+a) dt = e−as ∫ ∞ 0 f(t + a) e−st dt = e−as L(f(t + a))(s). □ 7.D.21. Find a function y(t) satisfying the differential equation y′′ (t) + 4y(t) = sin 2t and the initial conditions y(0) = 0 and y′ (0) = 0. Solution. From the example 7.D.18: s2 L(y)(s) + 4L(y)(s) = L(sin 2t)(s) Now, by 7.B.17(d) L(sin 2t)(s) = 2 s2 + 4 . It follows that L(y)(s) = 2 (s2 + 4)2 . 701 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING The inverse transform then givesMore details ? y(t) = 1 8 sin 2t − 1 4 t cos 2t. □ 7.D.22. Find a function y(t) satisfying the differential equation and the initial conditions: y′′ (t) + 4y(t) = f(t), y(0) = 0, y′ (0) = −1, where f(t) is the piecewise continuous function f(t) = { cos(2t) for 0 ≤ t < π, 0 for t ≥ π. Solution. This problem is a model for the undamped oscillations of a spring (excluding friction and other phenomena like non-linearities in the toughness of the spring and so on). It is initiated by an exterior force during the initial period only and then ceases. The function f(t) can be written as a linear combination of Heaviside’s function H(t) and its shift. That is, f(t) = cos(2t)(H(t) − Hπ(t)) up to the values t = 0, t = π, . . . Since L(y′′ )(s) = s2 L(y) − sy(0) − y′ (0) = s2 L(y) + 1, we get, making use of the previous example 7.D.20 , the right-hand sides to the calculation of the Laplace transformcheck the reference s2 L(y) + 1 + 4L(y) = L(cos(2t)(H(t) − Hπ(t))) = L(cos(2t) · H(t)) − L(cos(2t) · Hπ(t)) = L(cos(2t)) − e−πs L(cos(2(t + π)) = (1 − e−πs ) s s2 + 4 . Hence, L(y) = − 1 s2 + 4 + (1 − e−πs ) s (s2 + 4)2 . The inverse transform then yields the solution in the formMore details! y(t) = −1 2 sin(2t) + 1 4 t sin(2t) + L−1 ( e−πs s (s2 + 4)2 ) . According to (1), L−1 ( e−πs s (s2 + 4)2 ) = 1 4 L−1 (e−πs L(t sin(2t))) = (t − π) sin(2(t − π)) · Hπ(t). Since the Heaviside function Hπ(t) is zero for t < π and equals 1 for t > π, we get the solution in the form y(t) = { t−2 4 sin(2t) for 0 ≤ t < π (5t−2 4 − π ) sin(2t) for t ≥ π. □ 702 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Key to the exercises 7.A.2. Similarly to 7.A.1, one can use the Gram–Schmidt process with respect to the mentioned scalar product. This gives the functions f1(x) = 1 x , f2(x) = 1 x2 − 3 4x , f3(x) = 1 x3 − 3 2x2 + 13 24x . and it is easy to check that the set {f1, f2, f3} forms an orthonormal basis. We also see that: • The projection of the function 1 x4 has the form 15 32 f1 + 69 40 f2 + 9 4 f3, while the distance is √ 14 2240 = 0.00167. • The projection of the function x has the form 2f1 + 96(−3 4 + ln(2))f2 + 5760(−3 2 ln(2) + 25 24 )f3, while the distance is approximately 0.035. Let us now illustrate the functions x and 1/x4 and their approximations. In this figure we see that the function 1 x4 , which is the one whose shape is similar to that of one or more generators, is approximated much better by the projection. 7.A.7. We have just to check the orthogonality of the couples Lm(x), Ln(x) with respect to the inner product ⟨f, g⟩ω =∫ ∞ 0 f(t)g(t) e−t d t. This can be done by integration by parts. 7.A.8. The claim follows from the fact that the powers xk appear in the polynomials Tk or Lk the first time. Thus the linear hulls of the first k functions always coincide. 7.A.9. We mention that since f, g are both piecewise continuous functions defined on [−π, π], the product f¯g also belongs to S0 ([−π, π]). Thus, the function f¯g is Riemann integrable and (f, g) := ∫ π −π f(x)g(x)d x is well-defined. The rest details proving that ( , ) is an inner product on S0 ([−π, π]) rely on elementary properties of integrals. Next, we see that ( √ 2 2 , √ 2 2 ) = 1 π ∫ π −π 1 2 dx = 1 π [x 2 ]π −π = π π = 1 . Moreover, for any positive integer n we have ( √ 2 2 , sin(nx)) = √ 2 2π ∫ π −π sin(nx) dx = − √ 2 2π [ 1 n cos(nx) ]π −π = − √ 2 2nπ [cos(nπ) − cos(−nπ)] = 0 , ( √ 2 2 , cos(nx)) = √ 2 2π ∫ π −π cos(nx) dx = 0 , (sin(nx), sin(nx)) = 1 π ∫ π −π sin2 (nx) dx = 1 2π ∫ π −π ( 1 − cos(2nx) ) dx = 1 2π (π + π) − 1 2π [ 1 2n sin(2n) ]π −π = 1 , (cos(nx), cos(nx)) = 1 π ∫ π pi cos2 (nx) dx = 1 2π ∫ π −π ( 1 + cos(2nx) ) dx = 1 . Finally, using appropriate trigonometric identities it is not hard to prove that (cos(mx), cos(nx)) = (cos(mx), sin(nx)) = (sin(mx), sin(nx)) = 0 for all positive integers m, n with m ̸= n. Moreover, we have (cos(mx), sin(nx)) = 0 also for m = n, and the claim follows. 7.A.12. A confirmation of the expression presented in 7.A.11 goes as follows: f = piecewise([[(-pi,pi),abs(x)]]); L=pi; n=var("n") an=(1/L)*integral(abs(x)*cos(n*pi*x/L), x, -pi, pi); show("a_n=", an.expand()) bn=(1/L)*integral(abs(x)*sin(n*pi*x/L), x, -pi, pi); show( "b_n=", bn.expand()) 703 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING For the approximation with k = 5, and its graph, add in the previous block the cell Fs5 = f.fourier_series_partial_sum(5,pi); show("5th partial sum = ",Fs5) s5 = plot(Fs5, x, -pi-0.5, pi+0.5, linestyle="--", tick_formatter=[2*pi, None]) a=plot(f, x, -pi, pi, color="black", thickness=1.5) show(a+s5) Execute this code to obtain the explicit form of F5(x) and confirm graphically that it provides a pretty good approximation of the function at hand. The confirmation of the identity F5(x) = F6(x) can be easily done and we leave this for practice. 7.A.13. The sign function has discontinuities at x = 0 and x = ±π. Since sgn(−x) = − sgn(x) for all x we get an = 0 for all n ∈ N, and the Fourier series of sgn(x) is a sine series, F(x) = ∑∞ n=1 bn sin(nx). The function g(x) = sgn(x) sin(nx) is even, as a product of two odd functions, thus bn = 1 π ∫ π −π sgn(x) sin(nx) dx = 2 π ∫ π 0 sin(nx) dx = 2 π [ − cos(nx) n ]π 0 = 2 π [ − cos(nπ) n + 1 n ] = 2 ( 1 − (−1)n ) nπ . We can simplify this by observing that bn = { 0 , if n = 2m, where m ∈ Z+ , 4 nπ , if n = 2m − 1, where m ∈ Z+ . Therefore, the corresponding Fourier series has the form F(x) = 2 π ∞∑ n=1 1 − (−1)n n sin(nx) = 4 π ∞∑ m=1 1 (2m − 1) sin ( (2m − 1)x ) . 7.A.17. Since f is neither even, nor odd, the complex Fourier coefficients will have both real and imaginary part. In particular, using the properties of the exponential map, we compute cn = 1 2π ∫ π −π ex e−inx dx = 1 2π ∫ π −π e(1−in)x dx = 1 2π [ e(1−in)x (1 − in) ]π −π = 1 2π(1 − in) ( e(1−in)π − e−(1−in)π ) . Now it is easy to see that e(1−in)π = eπ cos(nx) and e−(1−in)π = e−π cos(nx). Thus, recalling the identity 2 sinh(x) = (ex − e−x ), we have arrived at the following expression: cn = cos(nπ) 2π(1 − in) ( eπ − e−π ) = sinh(π) π (1 + in) (1 + n2) cos(nx) = sinh(π) π (1 + in) (1 + n2) (−1)n . This also gives c0 = sinh(π) π = eπ − e−π 2π . As for c−n we see that c−n = sinh(π) π (1 − in) (1 + (−n)2) cos(−nx) = sinh(π) π (1 − in) (1 + n2) (−1)n = cn , as it is required. Recalling now that einx + e−inx = 2 cos(nx) and einx − e−inx = 2i sin(nx), we finally get F(x) = c0 + ∞∑ n=1 cn einx + ∞∑ n=1 c−n e−inx = sinh(π) π ( 1 + ∞∑ n=1 (1 + in) (1 + n2) (−1)n einx + ∞∑ n=1 (1 − in) (1 + n2) (−1)n e−inx ) = sinh(π) π ( 1 + ∞∑ n=1 (−1)n 1 + n2 [ (1 + in) einx +(1 − in) e−inx ]) = sinh(π) π ( 1 + ∞∑ n=1 (−1)n 1 + n2 [ einx + e−inx +in ( einx − e−inx )]) = sinh(π) π ( 1 + 2 ∞∑ n=1 (−1)n 1 + n2 [ cos(nx) + i2 n sin(nx) ]) = sinh(π) π ( 1 + 2 ∞∑ n=1 (−1)n 1 + n2 [ cos(nx) − n sin(nx) ]) . In order to plot f together with the 20th partial sum of the Fourier series, use the cell given here: f=piecewise([[(-pi,pi), e**x]]) a=plot(f, x, -pi, pi, color="steelblue", thickness=1.5) 704 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Fs20 = f.fourier_series_partial_sum(20,pi) s20 = plot(Fs20, x, -4*pi, 4*pi,color="grey", linestyle="--") (a+s20).show(ticks=pi, tick_formatter=pi, figsize=8) Running this cell we obtain the following figure: 7.A.18. In this case the period is given by T = 2L = 1, i.e., L = 1. Thus F(x) = ∞∑ n=−∞ cn e inπx L = ∞∑ n=−∞ cn einπx , cn = 1 2L ∫ L −L e−x e− inπx L dx = 1 2 ∫ 1 −1 e−x e−inπx dx . We compute cn = 1 2 ∫ 1 −1 e−(1+inπ)x dx = 1 2 [ e−(1+inπ)x −(1 + inπ) ]1 −1 = − 1 2(1 + inπ) ( e−(1+inπ) − e(1+inπ) ) = − 1 2(1 + inπ) ( e−1 e−inπ − e1 einπ ) = (−1)n (1 + inπ) e1 − e−1 2 = (−1)n (1 + inπ) sinh(1) . Here we used the identities einπ = e−inπ = cos(nπ) = (−1)n . Hence, the Fourier series of f has the form F(x) = ∞∑ n=−∞ [ (−1)n (1 + inπ) sinh(1) ] einπx = sinh(1) ∞∑ n=−∞ (−1)n (1 − inπ) 1 + n2p2 einπx . 7.A.24. We need the even extension of f on (−π, π), which reads by fev(x) = { f(x) , if 0 ≤ x < π , f(−x) , if − π < x < 0 = { sin(x) , if 0 ≤ x < π , − sin(x) , if − π < x < 0 . We may extend fev periodically all over R by setting fev(x + 2π) = fev(x) for all the other x. Notice the values of this 2π-periodic extension at ±π coincide with f(π) = sin(π) = 0 (and similarly for ±Nπ, N ∈ N). Let us use Sage to illustrate this extension on (−π, π): f1(x)=sin(x); f2(x)=-sin(x) f = piecewise([[(-pi, 0),f2], [RealSet.closed_open(0,pi),f1]]) p=f.plot(x, -pi, pi, exclude=[0], color="black") p.show(figsize=8, ticks=pi/2, tick_formatter=pi) Running this block we obtain the figure given here: 705 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING Now, for the cosine extension we have bn = 0 while the coefficients an are given by an = 2 π ∫ π 0 sin(x) cos(nx) dx. Using the identity 2 sin(θ) cos(φ) = sin(θ + φ) + sin(θ − φ) we obtain an = 1 π ∫ π 0 ( sin ( (1 + n)x ) + sin ( (1 − n)x )) dx = 1 π [ − cos ( (1 + n)x ) n + 1 − cos ( (1 − n)x ) 1 − n ]π 0 = 1 π ( 1 − cos ( (n + 1)π ) n + 1 + cos ( (1 − n)π ) − 1 n − 1 ) = 1 π ( 1 − (−1)n+1 n + 1 + (−1)n+1 − 1 n − 1 ) = 1 π ( 1 + (−1)n n + 1 − (−1)n + 1 n − 1 ) = (1 + (−1)n ) π ( 1 n + 1 − 1 n − 1 ) = − 2(1 + (−1)n ) π(n2 − 1) , which makes sense for all integers n > 1. To verify this expression in Sage one may add in the block presented above the following cell: L=pi; n=var("n"); an=(2/L)*integral(sin(x)*cos(n*pi*x/L), x, 0, pi); show( "a_n=", an.expand()) Moreover, it is easy to see that a1 = 2 π ∫ π 0 sin(x) cos(x) dx = 0 , a0 = 2 π ∫ π 0 sin(x) dx = 4 π . All together, we have proved that an =    π/4 , if n = 0 , 0 , if n ≥ 1, n = 2k + 1 odd , − 4 (n2−1)π , if n > 1, n = 2k even . Therefore for all x ∈ (−π, π) by Dirichlet’s condition we have that sin(x) = 2 π − 4 π ∞∑ k=1 cos(2kx) 4k2 − 1 = 2 π − 4 π (cos(2x) 3 + cos(4x) 15 + cos(6x) 35 + · · · ) . We may use Sage to illustrate the 3rd and 5th partial sums of the approximation, with brown and blue color respectively, together with the graph of fev for x ∈ (−π, π). It suffices to add to the initial block the following one: Fs3 = f.fourier_series_partial_sum(3,pi) Fs5 = f.fourier_series_partial_sum(5,pi) s3 = plot(Fs3, x, -pi, pi,color="brown", linestyle="--") s5 = plot(Fs5, x, -pi, pi, linestyle="--", color="steelblue") (p+s3+s5).show(ticks=pi/2, tick_formatter=pi, figsize=8) Let us pose the figure that one obtains after executing the whole syntax together: 706 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING 7.A.25. Here we need the odd extension of f, which is given by fodd(x) = { f(x) , if 0 < x < π , −f(−x) , if − π < x < 0 = { cos(x) , if 0 < x < π , − cos(x) , if − π < x < 0 . We may extend fodd over the whole real line by setting fodd(x + 2π) = fodd(x) for x outside (π, π). At x = ±π the values of this extension will coincide with the values f(±π) = ∓1. Let us illustrate the extension for x ∈ (−π, π) by Sage, via the same method as above: f1(x)=cos(x); f2(x)=-cos(x) f = piecewise([[(-pi, 0),f2], [(0, pi),f1]]) p=f.plot(x, -pi, pi, exclude=[0]) p.show(figsize=8, ticks=pi/2, tick_formatter=pi) Running this block we obtain the figure given here: Now, since we are interesting in the sine expansion, we have necessarily an = 0 for all n ∈ N. Moreover, since L = π we see that bn = 2 π ∫ π 0 cos(x) sin(nx) dx. To compute this integral use the identity 2 sin(nx) cos(x) = sin((1 + n)x) + sin((n − 1)x), which gives bn = 1 π ∫ π 0 (sin((1 + n)x) + sin((n − 1)x)) dx = − 1 π [ cos((n + 1)x) n + 1 + cos((n − 1)x) n − 1 ]π 0 = 2n ( (−1)n + 1 ) (n2 − 1)π . This holds for all n > 1 and can be verified in Sage by adding in the previous block the syntax given here: L=pi; n=var("n") bn=(2/L)*integral(cos(x)*sin(n*pi*x/L), x, 0, pi) show( "b_n=", bn.expand()) On the other hand, for n = 1 we get b1 = 2 π ∫ π 0 cos(x) sin(x) dx = 1 π ∫ π 0 sin(2x) dx = 0 , 707 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING thus we have proved that bn =    0 , if n = 1 , 0 , if n > 1, n = 2k + 1 odd , 4n (n2−1)π , if n > 1, n = 2k even . . Therefore, one deduces that cos(x) = ∞∑ k=1 8k (4k2 − 1)π sin(2kx) = 8 3π sin(2x) + 16 15π sin(4x) + 24 35π sin(6x) + · · · for all x ∈ (0, π). Below are presented the approximations obtained for n = 10 and n = 30, respectively, on the interval (−π, π). 7.B.4. f1 ∗ f2(t) =    t − t2 2 + 4 for t ∈ [−2, −1] 1 − t + 1 2 for t ∈ [−1, 1] t2 2 − 2t + 2 for t ∈ [1, 2] 0 otherwise. 7.B.14. It is a good exercise on derivation and integration by parts. We may differentiate with respect to t inside the integral and d d t f(t − x) can be interpreted as − d d x f(t − x). 7.B.16. Another good exercise on derivation and integration by parts. We may differentiate twice with respect to t inside the integral and f′′ (t − x) can be interpreted either as a derivative with respect to t or x. 7.B.18. The definition of Γ(t) reveals L(tα ) = ∫ ∞ 0 e−st tα d t = 1 sα+1 ∫ ∞ 0 e−x xα d x = Γ(α + 1) sα+1 . 7.B.19. Integrate by parts to obtain L (g) (s) = ∞∫ 0 t e−t e−st d t = ∞∫ 0 t e−(s+1)t d t = lim t→∞ ( t e−(s+1)t −(s+1) ) − 0 − ∞∫ 0 e−(s+1)t −(s+1) d t = − ( lim t→∞ e−(s+1)t (s+1)2 − e0 (s+1)2 ) = 1 (s+1)2 . Differentiating the Laplace transform of a general function −f (i. e., an improper integral) with respect to the parameter s gives (∞∫ 0 −f(t) e−st d t )′ = ∞∫ 0 −f(t) (e−st ) ′ d t = ∞∫ 0 t f (t) e−st d t. This means that the derivative of the Laplace transform L(−f)(s) is the Laplace transform of the function tf(t). The Laplace transform of the function y = sinh t has already been determined as the function y = 1 s2−1 . Therefore, L (h) (s) = ( − 1 s2−1 )′ = 2s (s2−1)2 . 708 CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING We could also have determined L (g) (s) this way. 7.B.20. L(cos ωt)(s) + iL(sin ωt)(s) = L(eiωt )(s) = ∫ ∞ 0 eiωt e−st dt = ∫ ∞ 0 e(iω−s)t dt = − 1 s − iω [ e(iω−s)t ]∞ 0 = − 1 s − iω ( lim t→∞ eiωt est − 1) = 1 s − iω = s + iω (s − iω)(s + iω) = s s2 + ω2 + i ω s2 + ω2 7.C.23. Recall the definition of the product metric of the Cartesian product X×X where the distance is given as the maximum of the distance of the components. The claim follows directly from the triangle inequality and the topological definition of the continuity. 7.D.5. The orthogonal basis has the form {x, − 3 π2 x + sin(x)}, while the projection does not change the function 1 2 sin(x) since it already lies in the space. 7.D.6. The orthogonal basis has the form {cos(x), 4 π cos(x) + x}. Moreover, the required projection is given by 3π/(π4 − 24)(4 cos(x) + πx). Notice that this is a very bad approximation. 7.D.7. We have already checked the orthogonality of the cosine terms in the solution to the example 7.A.4. The sine terms are obtained the same way since they are just shifts of cosines by π/2 in the argument. The mixed couples provide an odd function to be integrated on a symmetric interval around the origin and so the integral also vanishes. 7.D.10. For n ̸= 0 we have cn = 1 2π ∫ π 0 e−inx dx = 1 2π [ e−inx −in ]π 0 = 1 2π (1 − (−1)n ) in . Moreover, c0 = 1 2π ∫ π 0 dx = 1 2 . Thus F(x) = 1 2 + ∞∑ n=1 cn einx + −1∑ n=−∞ cn einx = 1 2 + ∞∑ n=1 1 2π (1 − (−1)n ) in einx + −1∑ n=−∞ 1 2π (1 − (−1)n ) in einx . Next, in the second of the sums, replace n with −n. This gives F(x) = 1 2 + 1 2π ∞∑ n=1 (1 − (−1)n ) in (einx − e−inx ) = 1 2 + 1 2π ∞∑ n=1 (1 − (−1)n ) in 2i sin(nx) = 1 2 + 1 π ∞∑ n=1 (1 − (−1)n ) n sin(nx) . It is easy to see that when n is even, the terms are all 0. For n odd, put n = 2m − 1 to obtain F(x) = 1 2 + 2 π ∞∑ m=1 ( 1 2m − 1 ) sin(2m − 1)x = 1 2 + 2 π ( sin(x) + sin(3x) 3 + sin(5x) 5 + . . . ) . Notice that this expression is similar to the problem described in 7.1.9. In particular, we could use the result from 7.1.9 to derive our claim after an easy transformation. 7.D.15. Determine the convolution of the functions f1 and f2, where f1 ∗ f2(t) =    (t+1)2 2 for t ∈ [−1, 0] 1−t2 2 for t ∈ [0, 1] 0 otherwise. At the beginning of our journey through the mathematical landscape, we saw that vectors can be manipulated nearly as easily as scalars. Now, we return to situations where the relations between objects are expressed with the help of more (yet still finitely many) parameters. This is really necessary when modeling processes or objects in practice, where functions R → R of one variable are seldom adequate. At least, functions dependent on finitely many parameters are necessary, and the dependence of the change of the results on the parameters is often more important than the result itself. There is little need for brand new ideas. Many problems we encounter can be reduced to ones we can solve already. We return to the discussion of situations when the values of functions are described in terms of instantaneous changes. That is, we consider ordinary differential equations. In the next chapter, we consider partial differential equations and provide a gentle introduction to variational problems. 1. Functions and mappings on Rn 8.1.1. The world of functions. In the sequel, the main objects are mappings between Euclidean spaces, f : Rn → Rm . We have seen many such examples already. The complex valued real functions correspond to n = 1, m = 2, while the power series converge inside of a circle in the complex plane, providing examples of f : R2 → R2 . We have also dealt with vector valued real functions, representing parametrized curves c : R → Rn (see e.g. the paragraphs on curvatures and Frenet frames in 6.1.16 on page 526). In linear algebra and geometry, we saw the linear and affine maps f : Rn → Rm defined with the help of matrices A ∈ Matm,n(R) and constant vectors y ∈ Rm : Rn ∋ x → y + A x ∈ Rm . In coordinates, the value is given by the expression∑ j aijxj + yi, where A = (aij) and y = (yi). Finally, the quadratic forms were mappings Rn → R given by symmetric matrices S = (sij) and the formula Rn ∋ x → xT S x ∈ R. In coordinates, the value is ∑ i,j sijxixj. CHAPTER 8 Calculus with more variables one variable is not enough? – never mind, just recall vectors! A. Multivariate functions We start this chapter with a couple of easy examples to "grasp" a little multivariate functions. 8.A.1. Solve the system of inequalities. Mark the resulting area in the plane. a) x2 + y2 ≤ 4 y ≥ 1 x b) y ≤ arctan x y ≤ 1 x2 c) x2 + (y − 1)2 ≥ 4 y + x2 − 2x ≥ 0 y ≥ 0 Solution. Whenever you have to solve an inequality of the form f(x, y) ≥ 0 (f : R2 → R is a function of two variables, but the same method is valid for inequalities with more variables), you just consider the border curve f(x, y) = 0. This curve divides the plane into some areas. Then all points in any of the areas either satisfy the inequality or the whole area does not satisfy it. If we have a system of inequalities, we CHAPTER 8. CALCULUS WITH MORE VARIABLES In general, all such mappings f : Rn → Rm are composed of m components of functions fi : Rn → R. So we start with this case. 8.1.2. Multivariate functions. We can stress the dependance on the variables x1, . . . , xn by writing the functions as f(x1, x2, . . . , xn) : Rn → R. The goal is to extend methods for monitoring the values of functions and their changes for this situation. We speak about functions of more variables or, more compactly, multivariate functions. We often work with the cases n = 2 or n = 3 so that the concepts being introduced are easier to understand. In these cases, letters like x, y, z are used instead of numbered variables. This means that a function f defined in the “plane” R2 is denoted R2 ∋ (x, y) → f(x, y) ∈ R, and, similarly, in the “space” R3 R3 ∋ (x, y, z) → f(x, y, z) ∈ R. Just as in the case of univariate functions, the domain A ⊂ Rn on which the function in question is defined needs to be considered. When examining a function given by a concrete formula, the first task is often to find the largest domain on which the formula makes sense. It is also useful to consider the graph of a multivariate function, i. e., the subset Gf ⊂ Rn × R = Rn+1 , defined by Gf = {(x1, . . . , xn, f(x1, . . . , xn)); (x1, . . . , xn) ∈ A}, where A is the domain of f. For instance, the graph of the function defined in the plane by the formula f(x, y) = x + y x2 + y2 is quite a smooth surface, see the illustration below. The maximal domain of this function consists of all the points of the plane except for the origin (0, 0). When defining the function, and especially when drawing its graph, fixed coordinates are used in the plane. Fixing the value of either of the coordinates, implies only one variable remains. Fixing the value of x, for example, gives the mapping R → R3 , y → (x, y, f(x, y)), i.e., a curve in the space R3 . Curves are vector functions of one variable, already worked with in chapter six (see 6.1.14). The images of the curves for some fixed values of the coordinates x and y are depicted by lines in the illustration. add the “coordinate lines” in the picture 710 solve each inequality separately and then intersect the result. In our cases we get □ 8.A.2. Determine the domain of the function R2 → R: a) xy y(x3 + x2 + x + 1) , b) ln(x2 − y2 ), c) ln(−x2 − y2 ), d) arcsin(2χQ(x)), where χQ denotes the indicator function of the rational numbers, e) f(x, y, z) = √ ln x · arcsin(y2z). Solution. a) The formula correctly expresses a value iff the denominator of the fraction is non-zero. Therefore, the formula defines a function on the set R2 \{([x, 0], [−1, y], x, y ∈ R}. b) The formula is correct iff the argument of the logarithm is positive, i. e., |x| > |y|. Therefore, the domain of this function is {(x, y) ∈ R, |x| > |y|}. You can see the graph of this function in the picture. c) This formula is again a composition of a logarithm and a polynomial of two variables. However, the polynomial −x2 − CHAPTER 8. CALCULUS WITH MORE VARIABLES 8.1.3. Euclidean spaces. In the case of functions of one variable, the entire differential and integral calculus is based on the concepts of convergence, open neighbourhoods, continuity, and so on. In the last part of chapter seven, these concepts were generalized for the metric spaces, rather than only for the Euclidean spaces Rn . Before proceeding it is appropriate to revise these ideas, and do further reading if necessary. We present a brief summary: The Euclidean space En is perceived as a set of points in Rn without any choice of coordinates, and its modelling vector space Rn is considered to be the vector space of all increments that can be added to the points of the space En (the modelling vector space). Moreover, the standard scalar product u · v = n∑ i=1 xiyi, is selected on Rn , where u = (x1, . . . , xn) and v = (y1, . . . , yn) are arbitrary vectors. This gives a metric on En, i.e. a function describing the distance ∥P −Q∥ between pairs of points P, Q by the formula ∥P − Q∥2 = ∥u∥2 = n∑ i=1 x2 i , where u is the vector which yields the point P when added to the point Q. In the plane E2, for instance, the distance between the points P1 = (x1, y1) and P2 = (x2, y2) is given by ∥P1 − P2∥2 = (x1 − x2)2 + (y1 − y2)2 . The metric defined in this manner satisfies the triangle inequality for every triple of points P, Q, R: ∥P − R∥ = ∥(P − Q) + (Q − R)∥ ≤ ∥P − Q∥ + ∥Q − R∥. 711 y2 takes on only non-positive real values, where the logarithm is undefined (as a function R → R). d) This formula correctly defines a value iff the argument of the arc sine lies in the interval [−1, 1], which is broken by exactly those pairs (= [x, y] ∈ R2 whose first component is rational. The formula thus defines a function on the set {[x, y], x ∈ R \ Q}. e) The argument of the square root must be non-negative, that is either the image of the logarithm is positive and the image of arcsine as well, or both images are negative. Thus we get that the domain is the set {[x, y, z] ∈ R3 ; (x ≥ 1 ∧ y ̸= 0 ∧ 0 ≤ z 1 y2 ) ∨(x ∈ (0, 1)∧y ̸= 0∧− 1 y2 ≤ z ≤ 0)∨(x > 0∧y = 0)}. □ In the following examples k([x, y]; r) means a circle with the center [x, y] and the radius r. 8.A.3. Determine the domain of the function f and mark the resulting area in the plane: i) f(x, y) = √ (x2 + y2 − 1)(4 − x2 − y2), ii) f(x, y) = √ 1 − x2 + √ 1 − y2, iii) f(x, y) = √ x2+y2−x 2x−x2−y2 , iv) f(x, y) = arcsin x y − 1 |y|−|x| , v) f(x, y) = √ 1 − x2 − 4y2, vi) f(x, y, z) = √ 1 − x2 a2 − y2 b2 − z2 c2 . Solution. a) It has to hold (x2 +y2 −1 ≥ 0, 4−x2 −y2 ≥ 0) or (x2 + y2 − 1 ≤ 0, 4 − x2 − y2 ≤ 0), that is (x2 + y2 ≥ 1, x2 + y2 ≤ 4) or (x2 + y2 ≤ 1, x2 + y2 ≥ 4), which is an annulus between the circles k([0, 0]; 1) and k([0, 0]; 2). b) It is a circle with the center [0, 0] and verticies [±1, ±1] c) The area between circles k([1 2 , 0]; 1 2 ) and k([1, 0]; 1), the smaller circle belongs to the area, the bigger one does not. d) The area between the lines y = x and y = −x (without these lines). e) The ellipse (together with the inner space) with the center [0, 0], with the major axis lying on the x-axis with the major radius a = 1, and the minor axis on the y-axis with the minor radius b = 1 2 . f) The ellipsoid (with the inner space) with the center [0, 0, 0] a semiaxes lying on the x, y, z axis respectively, with radii a ,b, and c. □ B. The topology of En In the previous chapter, we have defined general metric spaces and we have studies especially metric spaces consisting of the set of functions. As we have already seen in the previous chapter, many metrics can be defined on the space Rn (or on its subsets). For instance, considering a map of a CHAPTER 8. CALCULUS WITH MORE VARIABLES See 3.4.3(1) in geometry, or the axioms of metrics in 7.3.1, or the same inequality 5.2.2(2) for complex scalars. The concepts defined for real and complex scalars and discussed for metric spaces in detail can be carried over (extended) with no problem for the points Pi of any Euclidean space: Topology and metric in Euclidean spaces (1) a Cauchy sequence: a sequence of points Pi such that for every fixed ε > 0, ∥Pi − Pj∥ < ε holds for all indices but for finitely many exceptional points Pk; (2) a convergent sequence: a sequence of points Pi converges to a point P if nad only if for every fixed ε > 0, ∥Pi − P∥ < ε holds for all but finitely many indices i; the point P is then called the limit of the sequence Pi; (3) a limit point P of a set A ⊂ En: there exists a sequence of points in A converging to P and different from P; (4) a closed set: contains all of its limit points; (5) an open set: its complement is closed; (6) an open δ–neighbourhood of a point P: the set Oδ(P) = {Q ∈ En; ∥P − Q∥ < δ}, δ ∈ R, δ > 0; (7) a boundary point P of a set A: every δ–neighbourhood of P has non-empty intersection with both A and the complement En \ A; (8) an interior point P of a set A: there exists a δ– neighbourhood of P which lies inside A; (9) a bounded set: lies inside some δ–neighbourhood of one of its points (for a sufficiently large δ); (10) a compact set: both closed and bounded. (11) limit of a mapping: a ∈ Rm is the limit of function f : Rn → Rm in a limit point x0 of its domain A, if for each ε > 0, there is a δ–neighbourhood U of x0, such that ∥f(x) − a∥ < ε for all x ∈ U; this happens if and only if for each sequence xn ∈ A converging to x0, the values f(xn) converge to a. (12) continuity: mapping f : A ⊂ Rn → Rm is continuous in x0 ∈ A if the limit limx→x0 f(x) exists and equals to f(x0); the mapping f is continuous on A, if it is continuous in all points in A. Both the first and second items deal with norms of differences of points approaching zero. Since the square of the hodil by se obrazek na ilustraci pojmu, napr. obr. 1 norm is the sum of squares of the individual compoments, it is clear that this happens if and only if the individual components approach zero. In particular, the sequences of points Pi are Cauchy or convergent if and only if these properties are possessed by the real sequences obtained from the particular coordinates of the points Pi in every Cartesian coordinate system. Therefore, it also follows from Lemma 5.2.3 that every Cauchy sequence of points in En is convergent. Especially, En is a complete metric space. Similarly, the mappings from the item (11) are m-tuples of the compoment functions and the limits are given as the m-tuples of limits of these components. Recall some further results already discussed at the more general level of the metric spaces in chapter seven: 712 state as a subset of R2 , the distance of two points may be defined as the time necessary to get from one of the points to the other by public transport or on foot. In France, for example, the shortest paths between most pairs of points in this metric are far from line segments. In this chapter we will focus on the space En, that is Rn with the usual metric (distance) known to the mankind for a long time. The property, that the shortest path between any two points of this space is the line segment conecting them could be seen as the defining property (for example the above example does not satisfy it). Let us examine the space En in more detail. 8.B.1. Show that every non-empty proper subset of En has a boundary point (which need not lie in it). Solution. Let U ⊂ En be a non-empty subset with no boundary point. Consider a point X ∈ U, a point Y ∈ U′ := En \ U, and the line segment XY ⊂ En. Intuitively, going from X to Y along this segment, we must once get from U to U′ , and this can happen only at a boundary point (everyone who has ever been to a foreign country is surely well acquainted with this fact). Formally, let A be the point of XY for which |XA| = sup{|XZ|, XZ ∈ U} (apparently, there is exactly one such point on the segment XY ). This point is a boundary point of U: it follows from the definition of A that any line segment XB (with B ∈ XA) is contained in U; in particular, B ∈ U. However, if there were a neighborhood of A contained in U, then there would exist a part of the line segment XY longer than XA which would be contained in U, which contradicts the definition of the point A. Therefore, any neighborhood of the point A contains a point from U as well as a point from En \ U. □ 8.B.2. Prove that the only non-empty clopen (both closed and open) subset of En is En itself. Solution. It follows from the above exercise 8.B.1 that every non-empty proper subset U of En has a boundary point. If U is closed, then it is equal to its closure; therefore, it contains all of its boundary points. However, an open set (by definition) cannot contain a boundary point. □ 8.B.3. Show that the space En cannot be written as the union of (at least two) disjoint non-empty open sets. Solution. Suppose that En can be expressed thus, i. e., En = ∪i∈IUi, where I is an index set. Let us fix a set U from this union. Then, we can write En = U ∪U, where both U and U (being a union of open sets) are open. However, they are also complements of open sets; therefore, they are closed as well. This contradicts the result of the previous exercise 8.B.2. □ 8.B.4. Prove or disprove: a union of (even infinitely many) closed subsets of En is a closed subset of En . Solution. The proposition does not hold. As a counterexample, consider the union ∞∪ i=3 [ 1 i , 1 − 1 i ] CHAPTER 8. CALCULUS WITH MORE VARIABLES A mapping is continuous if and only if its preimages of open sets are open (check this carefully!). Further, each continuous function on a compact set A is uniformly continuous, bounded and attains its maximum and minimum, cf. the paragraph 7.3.14 on the page 682. The reader should make an appropriate effort to read the paragraphs 3.4.3, 5.2.5–5.2.8, 7.3.3–7.3.5, and 7.3.12 as well as try to think out/recall the definitions and connections of all these concepts. 8.1.4. Compact sets. Working with general open, closed, or compact sets could seem useless in the case of the real line E1 since intervals are almost always used. In the case of metric spaces in the last part of chapter seven, the ideas are complicated at first sight. However, the same approach is easy in the case of Euclidean spaces Rn . It is also very useful and important (and it is, of course, a special case of general metric spaces). Just as in the case of E1 or E2, we deal with the open covers of sets (i.e., systems of open sets containing the given sets), and Theorem 5.2.8 is again true (with mere reformula- tions): Theorem. Subsets A ⊂ En of Euclidean spaces satisfy: (1) A is open if and only if it is a union of a countable (or finite) system of δ–neighbourhoods, (2) every point a ∈ A is either interior or boundary, (3) every boundary point of A is either an isolated or a limit point of A, (4) A is compact if and only if every infinite sequence contained in it has a subsequence converging to a point in A, (5) A is compact if and only if each of its open covers contains a finite subcover. Proof. The proof from 5.2.8 can be reused without changes in the case of claims (1)–(3), yet now the concepts have to be perceived in a different way, and the “open intervals” are substituted with multidimensional δ–neighbourhoods of appropriate points. However, the proof of the fourth and fifth claims has to be adjusted properly. Actually, we proved the claims there for R and C, thus in dimensions one and two. Thus the reader may either extend the two-dimensional reasoning, or to rewrite the proof of the corresponding propositions for general metric spaces in 7.3.12, while noticing the parts which can be simplified for Euclidean spaces. □ 8.1.5. Curves in En. Almost all the discussion about limits, derivatives, and integrals of functions in chapters 5 and 6 concerned functions of a real variable and real or complex values since only the triangle inequality valid for the magnitudes of the real and complex numbers is used. This argument can be carried over to any function of a real variable with values in a Euclidean 713 of closed subsets of R, which is equal to the open interval (0, 1). □ 8.B.5. Prove or disprove: an intersection of (even infinitely many) open subsets of En is an open subset of En . Solution. The proposition does not hold in general. As a counterexample, consider the intersection ∞∩ i=2 ( 1 − 1 i , 1 + 1 i ) of open subsets of R, which is equal to the closed singleton {1}. □ 8.B.6. Consider the graph of a continuous function f : R2 → R as a subset of E3. Determine whether this subset is open, closed, and compact, respectively. Solution. The subset is not open since any neighborhood of a point [x0, y0, f(x0, y0)] contains a segment of the line x = x0, y = y0. However, there is a unique point of the graph of the function on this segment, and that is the point [x0, y0, f(x0, y0)]. The continuity of f implies that the subset is closed – we will show that every convergent sequence of points of the graph of f converges to a point which also lies in the graph: If such a sequence is convergent in E3, then it must converge in every component, so the sequence {[xn, yn]}∞ n=1 is convergent in R2 . Let us denote this limit by [a, b]. Then, it follows from the definition of continuity that its function values at the points [xn, yn] must converge to the value f(a, b). However, this means that the sequence {[xn, yn, f(xn, yn)]}∞ n=1 converges to the point [a, b, f(a, b)], which belongs to the graph of the function f. Therefore, the graph is a closed set. The subset is closed, yet it is not compact since it is not bounded (its orthogonal projection onto the coordinate plane xy is the whole R2 . (A subset of En is compact iff it is both closed and bounded.) □ And now let us study the limits of functions (a limit is defined thanks to the topology of En, see 8.1.3) C. Limits and continuity of multivariate functions If we approach limits of multivariate functins, there is one fact we have to deal with: Let us emphasize that there is no analogy of L’Hospital rule for multivariate functions. Counting limits 0 0 or ∞ ∞ , we have to be "clever". In one dimension we can approach a point either from right or left (and the limit in the point exists, if both one-sided limits exist and are equal to each other). In more dimensions we can approach a point from infinitely many directions, and the limit in the point exists, iff limits of the function narrowed to any path leading to the point exist and must be equal to each other. The easest way to obtain a limit is (as with the functions of one variable) to plug in the given point to the function prescription and if we get a meaningful expresion, we are done. CHAPTER 8. CALCULUS WITH MORE VARIABLES space Rn . Several tools for the work with curves are introduced in paragraphs 6.1.14–6.1.12. For every (parametrized) curve1 , that is, a mapping c = (c1(t), . . . , cn(t)) : R → Rn in an n–dimensional space, the concepts simply extend the ideas from the univariate functions with some extra thoughts: First note that both the limit and the derivative of curves make sense in an affine space even without selecting the coordinates (where the limit is again a point in the original space, while the derivative is a vector in the modeling vector space!). In the case of an integral, curves are considered in the vector space Rn . The reason for this can be seen even in the case of one dimension, where the origin is needed to be able to see the “area under the graph of a function”. It is apparent that limits, derivatives, and integrals have to be considered via the n individual coordinate components in Rn . In particular, their existence is determined in the same way: Basic concepts for curves (1) limit: lim t→t0 c(t) = ( lim t→t0 c1(t), . . . , lim t→t0 cn(t)) ∈ Rn (2) derivative: c′ (t0) = lim t→t0 1 |t − t0| · (c(t) − c(t0)) = (c′ 1(t0), . . . , c′ n(t0)) ∈ Rn (3) integral: ∫ b a c(t) dt = (∫ b a c1(t) dt, . . . , ∫ b a cn(t) dt ) ∈ Rn . We can directly formulate the analogy of the connection between the Riemann integral and the antiderivative for curves in Rn (see 6.2.9): Proposition. Let c be a curve in Rn , continuous on an interval [a, b]. Then its Riemann integral ∫ b a c(t)dt exists. Moreover, the curve C(t) = ∫ t a c(s)ds ∈ Rn is well-defined, differentiable, and C′ (t) = c(t) for all values t ∈ [a, b]. It is not simple to extend the mean value theorem and, in general, the Taylor’s expansion with remainder, see 5.3.9 and 6.1.3. They can be applied in a selected coordinate system to the particular coordinate functions of a differentiable function c(t) = (c1(t), . . . , cn(t)) on a finite interval [a, b]. In the case of the mean value theorem, for instance, there are numbers ti such that ci(b) − ci(a) = (b − a) · c′ i(ti), i = 1, . . . , n. 1In geometry, one often makes a distinction between a curve as a subset of En and its parametrization R → Rn. The word “curve”, means exclusively the parametrized curve here. 714 Otherwise we can get "undeterminate" expression. There are some tricks, we can use to count such a limit: (1) factorize the numerator or the denominator acording to some known formula and then reduce, (2) expand the numerator and the denominator with an appropriate term and then shorten, (3) bounded expression ∞ = 0, 0 · (bounded expression) = 0; (4) use an appropriate substitution to get a limit of a function of one variable (5) try polar coordinates x = r cos φ, y = r sin φ (it usually works with the expression x2 + y2 , we have x2 +y2 = r2 cos2 φ+r2 sin2 φ = r2 (cos2 φ+sin2 φ) = r2 , which is independent of φ); (6) try y = kx or y = kx2 or generally x = f(k) a y = g(k) (to prove the non-existence of the limit: if the limit after the substitution depends on k, the original limit does not exists) 8.C.1. lim(x,y)→(e2,1) ln x y ⃝ 8.C.2. lim(x,y)→(4,4) √ x− √ y x−y ⃝ 8.C.3. lim(x,y)→(1,∞) cos y x+y ⃝ 8.C.4. lim(x,y)→(0,2) exy −1 x ⃝ 8.C.5. lim(x,y)→(∞,∞) x2 +y2 x4+y4 ⃝ 8.C.6. lim(x,y)→(0,0) x2 +y2 x+y ⃝ 8.C.7. lim(x,y)→(0,0) x2 −y2 x2+y2 ⃝ 8.C.8. lim(x,y)→(∞,∞)( 2xy x2+y2 )x2 ⃝ 8.C.9. lim(x,y)→(1,1) x+y√ x2+y2 ⃝ 8.C.10. lim(x,y)→(0,0) x2 +y2 √ x2+y2+1−1 ⃝ 8.C.11. lim(x,y)→(0,0) xy2 cos 1 xy2 ⃝ 8.C.12. lim(x,y)→(0,0) sin xy x ⃝ 8.C.13. lim(x,y)→(0,0) x3 +y3 x2+y2 ⃝ 8.C.14. lim(x,y)→(∞,∞)(x2 + y2 )e−(x+y) ⃝ 8.C.15. lim(x,y)→(∞,1)(1 + 1 x ) x2 x+y ⃝ 8.C.16. lim(x,y)→(0,0) xy x2+y2 ⃝ 8.C.17. lim(x,y)→(0,0) 1−cos(x2 +y2 ) (x2+y2)xy ⃝ 8.C.18. Prove that lim(x,y)→(0,0) −y x2−y does not exists. ⃝ 8.C.19. Prove that lim (x,y)→(1,−2) 2x + xy − y − 2 x2 + y2 − 2x + 4y + 5 . does not exist. ⃝ CHAPTER 8. CALCULUS WITH MORE VARIABLES These numbers ti are distinct in general, so the difference vector of the boundary points c(b)−c(a) cannot be expressed as a multiple of the derivative of the curve at a single point, in general. For example, for a differentiable curve c(t) in the plane E2, c(t) = (x(t), y(t)) c(b) − c(a) = (x′ (ξ)(b − a), y′ (η)(b − a)) = (b − a) · (x′ (ξ), y′ (η)) for two (in general different) values ξ, η ∈ [a, b]. However, this reasoning is still sufficient for the following estimate: Lemma. If c is a curve in En with continuous derivative on a compact interval [a, b], then for all a ≤ s ≤ t ≤ b ∥c(t) − c(s)∥ ≤ √ n ( maxr∈[a,b] ∥c′ (r)∥ ) |t − s|. Proof. Direct application of the mean value theorem gives for appropriate points ri inside the interval [s, t] the fol- lowing: ∥c(t) − c(s)∥2 = n∑ i=1 (ci(t) − ci(s))2 ≤ n∑ i=1 ( c′ i(ri)(t − s) )2 ≤ (t − s)2 n∑ i=1 maxr∈[s,t] c′ i(r)2 ≤ n(maxr∈[s,t], i=1,...,n |c′ i(r)|)2 (t − s)2 ≤ n maxr∈[s,t] ∥c′ (r)∥2 (t − s)2 . □ Another important concept is the tangent vector to a curve c : R → En at a point c(t0) ∈ En. It is defined as the vector in the modelling vector space Rn given by the derivative c′ (t0) ∈ Rn . Consider c to be the path of an object moving in the space in time. Then the tangent vector at a point t0 can be perceived physically as the instantaneous velocity at this point.maybe also a pict. The straight line T given parametrically as T : c(t0) + (t − t0) · c′ (t0) is called the tangent line to the curve c at the point t0. Unlike the tangent vector, the (unparametrized) tangent line T is independent of the parametrization of the curve c. The chain rule ensures that changing the parametrization leads to the same tangent vector, up to multiple. 8.1.6. Partial derivatives. If we look at the multivariate function f(x1, . . . , xn) : Rn → R as at the function of one real variable xi while the other variables are assumed constant, we can consider the derivative of this function. This is called the partial derivative of the function f with respect to xi, and it is denoted as ∂f ∂xi , i = 1, . . . , n, or (without referring to the particular function) as the operator ∂ ∂xi on the functions. More generally, for every function f : Rn → R and an arbitrary curve c : R → Rn , their composition (f ◦ c)(t) : R → R can be considered. This composite function F = f ◦c expresses the behaviour of the function f along the curve 715 Solution. The lines through [1, −2]) have the equation y = kx − k − 2. As we aproach [1, −2] along one of these lines, we get the limit k 1+k2 , which is different for differnt k, thus the limit does not exists. □ Let us recall, that a funtion is continuous in points, where the limit exists and is equal to to function value. 8.C.20. Find the discontinuity points of f(x, y) = 2x−5y x2+y2−1 . ⃝ 8.C.21. Find the discontinuity points of f(x, y) = sin(x2 y+xy2 ) cos(x−y) . ⃝ 8.C.22. Find the discontinuity points of f(x, y) = { x3 +y3 x2+y2 pro [x, y] ̸= [0, 0], 0 pro [x, y] = [0, 0]. ⃝ D. Tangent lines, tangent planes, graphs of multivariate functions 8.D.1. A car is moving at velocity given by the vector (0, 1, 1). At the initial time t = 0, it is situated at the point [1, 0, 0]. The acceleration of the car in time t is given by the vector (− cos t, − sin t, 0). Describe the dependency of the position of the car upon the time t. Solution. As we have already discussed in paragraph 8.1.5, we got acquainted with the means of solving this type of problem as early as in chapter 6. Notice that the “integral curve” C(t) from the theorem of paragraph 8.1.5 starts at the point (0, 0, 0) (in other words, C(0) = (0, 0, 0)). In the affine space Rn , we can move it so that it starts at an arbitrary point, and this does not change its derivative (this is performed by adding a constant to every component in the parametric equation of the curve). Therefore, up to the movement, this integral curve is determined uniquely (nothing else than constants can be added to the components without changing the derivative). When we integrate the curve of acceleration, we get the curve of velocity (− sin t, cos t − 1, 0). Considering the initial velocity as well, we obtain the velocity curve of the car: (− sin t, cos t, 1) (we shifted the curve of the vector (0, 1, 1), i. e., so that now the velocity curve at time t = 0 agrees with the given initial velocity). Further integration leads to the curve (cos t−1, sin t, t). Shifting this of the vector (1, 0, 0) then fits with the initial position of the car. Therefore, the car moves along the curve [cos t, sin t, t] (this curve is called a helix). □ 8.D.2. Determine both the parametric and implicit equations of the tangent line to the curve c : R → R3 , c(t) = (c1(t), c2(t), c3(t)) = (t, t2 , t3 ) at the point which corresponds to the parameter’s value t = 1. Solution. The value t = 1 corresponds to the point c(1) = [1, 1, 1]. The derivatives of the particular components are c′ 1(t) = 1, c′ 2(t) = 2t, c3(t) = 3t2 . The values of CHAPTER 8. CALCULUS WITH MORE VARIABLES c. The simplest case is using parametrized straight lines c and choosing the lines ci(t) = (x1, . . . , xi + t, . . . , xn), the derivative of f◦ci yields just the partial derivatives ∂f ∂xi . More generally, derivatives can be defined in any direction: directional and partial derivatives Definition. f : Rn → R has derivative in the direction of a vector v ∈ Rn at a point x ∈ En if and only if the derivative dvf(x) of the composite mapping t → f(x + tv) exists at the point t = 0, i.e. dvf(x) = lim t→0 1 t (f(x + tv) − f(x)). The partial derivatives are the values ∂f ∂xi = dei f where ei are the elements of the standard basis of Rn . In other words, the directional derivative expresses the infinitesimal increment of the function f in the direction v. For functions in the plane, ∂ ∂x f(x, y) = lim t→0 1 t (f(x + t, y) − f(x, y)) ∂ ∂y f(x, y) = lim t→0 1 t (f(x, y + t) − f(x, y)). Especially, the partial differentiation with respect to a given variable is just the casual one-variable differentiation while considering the other variables to be constants. 8.1.7. The differential of a function. Partial or directional derivatives are not always good enough to obtain a fair approximation of the behaviour of a function by linear expressions. There are three concerns for a function f : Rn → R there. First, the directional derivatives at a point x ∈ Rn may not exist in all directions, although the partial derivatives are well defined. Second, the dependence of the directional derivatives dvf(x) on the direction v need not be linear. Third, even if dvf(x) is a linear mapping in the argument v, the function still may be not ’well behaved’ around the point x. As an example, consider the functions in the plane with coordinates (x, y) given by the formulae g(x, y) = { 1 if xy = 0 0 otherwise h(x, y) =    x if y = 0 y if x = 0 0 otherwise k(x, y) = { x if y = x2 ̸= 0 0 otherwise. Both partial derivatives of g at (0, 0) exist and no other directional derivatives do, and g is even not continous at the origin. The functions h and k are continuous at (0, 0) and h has all its directional derivatives at the origin equal zero, 716 the derivatives at the point t = 1 are 1, 2, 3. Therefore, the parametric equations of the tangent line are: x = c′ 1(1)s + c1(1) = t + 1, y = c′ 2(1)s + c2(1) = 2t + 1, z = c′ 3(1)s + c3(1) = 3t + 1. In order to get the implicit equations (which are not given canonically), we eliminate the parameter t, thereby obtaining: 2x − y = 1, 3x − z = 2. □ 8.D.3. Determine the tangent line p to the curve c(t) = (ln t, arctan t, esin(πt) ) at the point t0 = 1. ⃝ 8.D.4. Find a point on the curve c(t) = (t2 − 1, −2t2 + 5t, t − 5) such that the tangent line passing through it is paralell to the plane ϱ: 3x + y − z + 7 = 0. Solution. The direction c′ (t0) of the curve c(t) in t0 has to be perpendicular to the normal of ϱ, that is the scalar multiple of these two vectors is 0. The tangent vector in the point c(t) is (2t, −4t + 5, 1), the normal vector of the plane ρ is (3, 1, −1) (just read off the coefficients by x, y and z in the equation of ρ. That is 3 · 2t + 1 · (−4t + 5) + 1 · 1 = 0, which gives [3, −18, −7]. □ 8.D.5. Find the parametric equation of the tangent line of the curve given as the intersection of surfaces x2 +y2 +z2 = 4 and x2 + y2 − 2x = 0 in the point [1, 1, √ 2]. Solution. p = {[1 − √ 2s, 1, √ 2 + s]; s ∈ R}. □ 8.D.6. The set of differentiable functions. We can notice that multivariate polynomials are differentiable on the whole of their domain. Similarly, the composition of a differentiable univariate function and a differentiable multivariate function leads to a differentiable multivariate function. For instance, the function sin(x+y) is differentiable on the whole R2 ; ln(x + y) is a differentiable function on the set of points with x > y (an open half-plane, i. e., without the boundary line). The proofs of these propositions are left as an exercise on limit compositions. Remark. Notation of partial derivatives. The partial derivative of a function f : Rn → R in variables x1, . . . , xn with respect to the variable x1 will be denoted by both ∂f ∂x1 and the shorter expression fx1 . In the exercise part of the book, we will rather keep to the latter notation. On the other hand, the notation ∂f ∂x1 better catches the fact that this is a derivative of f in the direction of the vector field ∂ ∂x1 (you will learn what a vector field is in paragraph 9.1.1). 8.D.7. Determine the domain of the function f : R2 → R, f(x, y) = x2√ y. Calculate the partial derivatives where they are defined on this do- main. CHAPTER 8. CALCULUS WITH MORE VARIABLES except for the partial derivatives, which are equal to 1. In particular, dvh(0, 0) is not a linear mapping in the argument v. More generally, consider a function f which, along the lines (r cos θ, r sin θ) with a fixed angle θ, takes the values α(θ)r, where α(θ) is a periodic odd function of the angle θ, with period 2π. All of its directional derivatives dvf at (0, 0) exist, yet these are not linear expressions depending on the directions v for general functions α(θ). The graph of f can be visualized as a “deformed cone” and we can hardly hope for a good linear approximation at its vertex. Finally, k has all directional derivatives zero, i.e. dvh(0) = 0 for all directions v, which is a linear dependence on v ∈ R2 . But still, the zero mapping is a very bad approximation of k along the parabolla y = x2 . Check all these claims in detail yourselves! Therefore, we imitate the case of univariate functions as thoroughly as possible, and avoid such a pathological behaviour of functions directly by defining and using the concept of differential: Differential of a function Definition. A function f : Rn → R has got the differential at a point x if and only if all of the following three conditions hold: (1) the directional derivatives dvf(x) at the point x exist for all vectors v ∈ Rn , (2) dvf(x) is linearly dependent on the argument v, (3) limv→0 1 ∥v∥ ( f(x + v) − f(x) − dvf(x) ) = 0. The linear expression dvf (in a vector variable v) is then called the differential df of the function f evaluated at the increase v. In words, it is required that the behaviour of the function f at the point x is well approximated by linear functions of increments of the variable quantities. It follows directly from the definition of directional derivatives that the differential can be defined solely by the property (3). If there is a linear form df(x) such that the increments v at the point x satisfy the property (3) with dvf(x) = df(x)(v), then df(x)(v) is apparently just the directional derivative of the function f at the point x, so the properties (1) and (2) are automatically satisfied. Notice, that in dimension one, the only linear functions are multiplications by constant numbers, and if such a number satisfying (3) exists, we call it the derivative f′ (x) in the point x. Then the first two properties automatically hold true and thus we did not have to distinguish the derivative and the differential there. 8.1.8. Examine what can be said about the differential of a function f(x, y) in the plane, supposing both partial derivatives ∂f ∂x , ∂f ∂y exist and are continuous in a neighbourhood of a point (x0, y0). To this purpose, consider any smooth curve t → (x(t), y(t)) with x0 = x(0), y0 = y(0). 717 Solution. The domain of the function in question in R2 is the half-plane {(x, y), y ≥ 0}. In order to determine the partial derivative with respect to a given variable, we consider the other variables to be constants in the formula that defines the function. Then, we simply differentiate the expression as a univariate function. We thus get: fx = 2xy a fy = 1 2 x2 √ y . The partial derivatives exist at all points of the domain except for the boundary line y = 0. □ 8.D.8. Determine the derivative of the function f : R3 → R, f(x, y, z) = x2 yz at the point [1, −1, 2] in the direction v = (3, 2, −1). Solution. The directional derivative can be calculated in two ways. The first one is to derive it directly from the definition (see paragraph 8.1.6). The second one is to use the differential of the function; see 8.1.7 and theorem 8.1.8. Since the given function is a polynomial, it is differentiable on the whole R3 . Let us follow the definition: fv(x, y, z) = lim t→0 1 t [f(x + 3t, y + 2t, z − t) − f(x, y, z)] = lim t→0 1 t [(x + 3t)2 (y + 2t)(z − t) − x2 yz] = lim t→0 1 t [t(6xyz + 2x2 z − x2 y) + t2 (. . . )] = 6xyz + 2x2 z − x2 y. We have thus derived the derivative in the direction of the vector (3, 2, −1) as a function of three real variables which determine the point at which we are interested in the value of the derivative. Evaluating this for the desired point thus leads to fv(1, −1, 2) = −7. In order to compute the directional derivative from the differential of the function, we first have to determine the partial derivatives of the function: fx = 2xyz, fy = x2 z, fz = x2 y. It follows from the note beyond theorem 8.1.8 that we can express fv(1, −1, 2) = 3fx(1, −1, 2) + 2fy(1, −1, 2)+ + (−1)fz(1, −1, 2) = = 3 · (−4) + 2 · 2 + (−1) · (−1) = −7. □ 8.D.9. Determine the derivative of the function f : R3 → R, f(x, y, z) = cos(x2 y) z at the point [0, 0, 2] in the direction of the vector (1, 2, 3). Solution. The domain of this function is R3 except for the plane z = 0. The following calculations will be considered only on this domain. The function in question is differentiable at the point [0, 0, 2] (this follows from the note 8.D.6). We can determine the value of the examined directional derivative by 8.1.7, using partial derivatives. First, we determine the partial derivatives of the given function (as we have already mentioned in exercise 8.D.7, in CHAPTER 8. CALCULUS WITH MORE VARIABLES The idea is to use the mean value theorem for univariate functions for differences of function values, where only one of the variables changes: f(x, y)−f(x0, y) = ∂f ∂x (x1, y)(x− x0) for suitable x1 between x0 and x. Apply this in both summands of the following expression separately, to obtain 1 t ( f(x(t), y(t)) − f(x0, y0) ) = 1 t ( f(x(t), y(t))−f(x0, y(t)) ) + 1 t ( f(x0, y(t))−f(x0, y0) ) = 1 t (x(t)−x0) ∂f ∂x (x(ξ), y(t))+1 t (y(t)−y0) ∂f ∂y (x0, y(η)) for suitable numbers ξ and η between 0 and t. Indeed, by exploiting that the curve (x(t), y(t)) is differentiable, there must be such values ξ and η. Especially, for every sequence of numbers tn converging to zero, the corresponding sequences of numbers ξn and ηn also converge to zero (by the squeeze theorem for three limits) and they all satisfy the above equality. If t converges to 0, the continuity of the partial derivatives, together with the test for convergence of functions using subsequences of the input values (cf. 5.2.15), as well as the properties of the limits of sums and products of functions (cf. Theorem 5.2.13) imply d dt f(x(t), y(t))|t=0 = x′ (0) ∂f ∂x (x0, y0) + y′ (0) ∂f ∂y (x0, y0), which is a pleasant extension of the theorem on differentiation of composite functions of one variable for the case f ◦ c. Of course, with the special choice of parametrized straight lines with direction vector v = (ξ, η), (x(t), y(t)) = (x0 + tξ, y0 + tη), the calculation leads to the derivative in the direction v = (ξ, η) and the equality dvf(x0, y0) = ∂f ∂x (x0, y0)ξ + ∂f ∂y (x0, y0)η. This formula can be expressed in a neat way to describe coordinate expressions of linear functions on vector spaces: df = ∂f ∂x dx + ∂f ∂y dy, where dx stands for the differential of the function (x, y) → x, i.e. dx(v) = ξ, and similarly for dy. In other words, the directional derivative dvf is a linear function Rn → R on the increments, with coordinates given by the partial derivatives. Now we could similarly prove that the assumption of continuous partial derivatives at a given point guarantees the approximation property of the differential as well. In particular, note that the computation for f ◦c above excluded phenomena like the function k(x, y) above (there dvk(0, 0) = 0, but the derivative along the curve (t, t2 ) was one). We shall better do this for the general multivariate functions straightaway. 718 order to determine the partial derivative with respect to x, we differentiate it as a univariate function (in x) and use the chain rule; similarly for other partial derivatives): fx = − 2xy sin(x2 y) z , fy = − x2 sin(x2 y) z , fz = − cos(x2 y) z2 . Evaluating the expression gives fx(0, 0, 2) + 2 · fy(0, 0, 2) + 3 · fz(0, 0, 2) = 1 · 0 + 2 · 0 + 3 · ( −1 4 ) = −3 4 . □ 8.D.10. Having a function f : Rn → R with differential df(x) and a point x ∈ Rn , determine a unit direction v ∈ Rn in which the directional derivative dv(x) is maximal. Solution. According to the note beyond theorem 8.1.5, we are maximizing the function fv(x) = v1fx1 (x)+v2fx2 (x)+ · · · + vnfxn (x) in dependence on the variables v1, . . . , vn which are bound by the condition v2 1 + · · · + v2 n = 1. We have already solved this type of problem in chapter 3, when we talked about linear optimization (viz 3.A.5). The value fv(x) can be interpreted as the scalar product of the vectors (fx1 , . . . , fxn ) and (v1, . . . , vn). And this product is maximal if the vectors are of the same direction. The vector v can thus be obtained by normalizing the vector (fx1 , . . . , fxn ). In general, we say the the function grows maximally in the direction (fx1 , . . . , fxn ). Then, this vector is called the gradient of the function f. In paragraph 8.1.26, we will recall this idea and go into further details. □ Counnting the differential of a function is technicly very easy, just plug into the definition. 8.D.11. Find the differential of a function f in a point P: i) f(x, y) = arctan x+y 1−xy , P = [ √ 3, 1] ii) f(x, y) = arcsin x√ x2+y2 , P = [1, √ 3]. iii) f(x, y) = xy + x y , P = [1, 1]. Solution. i) df( √ 3, 1) = 1 4 dx + 1 2 dy, ii) df(1, √ 3) = √ 3 4 dx − 1 4 dy, iii) df(1, 1) = 2dx. □ Let us realize, that differential of a function is a linear map: 8.D.12. Count the differential of the function f(x, y, z) = 2x sin y arctan z in the point [−4, π 2 , 0] evaluated on dx = 0.05, dy = 0.06, and dz = 0.08. Solution. df(−4, π 2 , 0) = 0dx + 0dy + 1 16 dz = 0.005. □ The differential thus can be used to approximate the values of a function. CHAPTER 8. CALCULUS WITH MORE VARIABLES 8.1.9. The following theorem provides a crucial and very useful observation. Continuity of partial derivatives Theorem. Let f : En → R be a function of n variables with continuous partial derivatives in a neighbourhood of the point x ∈ En. Then its differential df at the point x exists and its coordinate expression is given by the formula df = ∂f ∂x1 dx1 + ∂f ∂x2 dx2 + · · · + ∂f ∂xn dxn. Proof. This theorem can be derived analogously to the procedure described above, for the case n = 2. Care is needed in details to finish the reasoning about the approximation property. As above, consider a curve c(t) = (c1(t), . . . , cn(t)), c(0) = (0, ..., 0) and a point x ∈ Rn , and express the difference f(x + c(t)) − f(x) for the composite function f(c(t)) as follows: f(x1 + c1(t), . . . , xn + cn(t)) − f(x1, x2 + c2(t), . . . ) + f(x1, x2 + c2(t), . . . )) − f(x1, x2, . . . , xn + cn(t)) ... + f(x1, x2, . . . , xn + cn(t)) − f(x1, x2, . . . , xn). Now, apply the mean value theorem to all of the n summands, obtaining (similarly to the case of two variables) (c1(t) − c1(0)) ∂f ∂x1 (x1 + c1(θ1), . . . , xn + cn(t)) + (c2(t) − c2(0)) ∂f ∂x2 (x1, x2 + c2(θ2), . . . , xn + cn(t)) ... + (cn(t) − cn(0)) ∂f ∂xn (x1, x2, . . . , xn + cn(θn)), for appropriate values θi, 0 ≤ θi ≤ t. This is a finite sum, so the same reasoning as in the case of two variables verifies that d dt f(x + c(t))|t=0 = c′ 1(0) ∂f ∂x1 (x) + · · · + c′ n(0) ∂f ∂xn (x). The special choice of the curves c(t) = x+tv for a directional vector v verifies the statement about existence and linearity of the directional derivatives at x. Finally, apply the mean value theorem in the same way to the difference f(x + v) − f(x) = dvf(x + θv) = v1 ∂f ∂x1 (x + θv) + · · · + vn ∂f ∂xn (x + θv) 719 8.D.13. Approximate √ 2.982 + 4.052 with the use of differential (and not with a calculator). Solution. We use the differential of the function f(x, y) = √ x2 + y2 in [3, 4]. Then f′ x = x√ x2+y2 , f′ y = y√ x2+y2 , thus √ 2.982 + 4.052 . = f(32 , 42 ) + df(2.98 − 3, 4.05 − 4) = √ 32 + 42 + 3 √ 32 + 42 (−0.02) + 4 √ 32 + 42 (0.05) = 5 − 0, 06 5 + 0, 2 5 = 5, 028. □ 8.D.14. With the help of a differential calculate i) arctan 1.02 0.95 , ii) ln(0, 972 + 0, 052 ), iii) arcsin 0,48 1,05 iv) 1, 042,02 . ⃝ 8.D.15. What is approximately the change (in cm3 ) in the volume of the cone with a base radius r = 10 cm and height h = 10 cm, if we increase the radius by 5 mm and we decrease the height by 5 mm? Solution. The volume is (as a fuction of the radius r and a height h) V (r, h) = 1 3 πr2 h. The change is approximately given by the diffefential of V in [10, 10] evaluated on dr = 10.5 − 10 = 0.5 and dh = 9.5 − 10 = −0.5. We get 50 3 πcm3 . □ 8.D.16. Find the tangent plane to the graph of a function f : R2 → R in a point P = [x0, y0, f(x0, y0): i) f(x, y) = √ 1 − x2 − y2, P = [x0, y0, z0] = [ 1√ 3 , 1√ 3 , ?]. ii) f(x, y) = ex2 +y2 , P = [x0, y0, z0] = [0, 0, ?], iii) f(x, y) = x2 + xy + 2y2 , P = [x0, y0, z0] = [1, 1, ?], iv) f(x, y) = arctan y x , P = [x0, y0, z0] = [1, −1, ?]. Solution. i) f(x0, y0) = √ 1 − 1 3 − 1 3 = 1√ 3 , thus z0 = 1√ 3 . Further f′ x = − x√ 1−x2−y2 , f′ y = − y√ 1−x2−y2 , thus f′ x(x0, y0) = −1/ √ 3 1/ √ 3 = −1, f′ y(x0, y0) = −1/ √ 3 1/ √ 3 = −1. The equation of a tangent plane in [ 1√ 3 , 1√ 3 , 1√ 3 ] is z = 1 √ 3 − (x − 1 √ 3 ) − (y − 1 √ 3 ), or x + y + z = √ 3, ii) z0 = 1, z = 1, iii) z0 = 4, 3x + 5y − z = 4, iv) z0 = −π 4 , x + y − 2z = π 2 . □ CHAPTER 8. CALCULUS WITH MORE VARIABLES with an appropriate θ, 0 ≤ θ ≤ 1, where the latter equality holds according to the formula for directional derivatives derived above, for sufficiently small v’s. Since all the partial derivatives are continuous at the point x, for an arbitrarily small ε > 0, there is a δ–neighbourhood U of the origin in Rn such that for w ∈ U, all partial derivatives ∂f ∂xi (x + w) differ from ∂f ∂xi (x) by less than ε. Hence the estimate 1 ∥w∥ f(x + w) − f(x) − dwf(x) ≤ ≤ 1 ∥w∥ f(x + w) − f(x) − dwf(x + θw) + 1 ∥w∥ dwf(x + θw) − dwf(x) = 1 ∥w∥ w1 ( ∂f ∂x1 (x + θw) − ∂f ∂x1 (x) ) + . . . + wn ( ∂f ∂xn (x + θw) − ∂f ∂xn (x) ) ≤ n ∥w∥ ∥w∥ε, where θ is the parameter for which the expression on the second line vanishes. Thus, the approximation property of the differential is satisfied as well. □ The approximation property of the differential can be written as f(x + v) = f(x) + df(x)(v) + α(v), where the function α(v) satisfies limv→0 α(v) ∥v∥ = 0, i.e. α(v) = o(∥v∥) in the asymptotic terminology introduced in 6.1.12 on the page 520. 8.1.10. A plane tangent to the graph of a function. The linear approximation of the function behaviour by its differential can be expressed in terms of its graph, similarly to the case of univariate functions. We work with hyperplanes instead of tangent lines. In the case of a function on E2 and a fixed point (x0, y0) ∈ E2, consider the plane in E3 given by the equation on the three coordinates (x, y, z): z = f(x0, y0) + df(x0, y0)(x − x0, y − y0) = f(x0, y0)+ ∂f ∂x (x0, y0)(x−x0)+ ∂f ∂y (x0, y0)(y−y0). It is already seen that the increase of the function values of a differentiable function f : En → R at points x + tv and x is always expressed in terms of the directional derivative dvf at a suitable point between them. Therefore, this is the only plane containing the point (x0, y0) with the property that all derivatives, and so the tangent lines of all curves c(t) = (x(t), y(t), f(x(t), y(t))) lie in this plane, too. It is called the tangent plane to the graph of the function f. 720 8.D.17. Find all points on the conic k : x2 + 3y2 − 2x + 6y − 8 = 0 such that the normal of the conic is parallel to the y axe. For each point write the equation of the tangent in the point. Solution. The normal to k in a point is parallel to the y axe iff the tangent line in the point is parallel to the x axe. The normal to k in [x0, y0] ∈ k is parallel to y axis iff one of the tangents to k in [x0, y0] is parallel to x axis, and this happens iff y′ (x0) = 0, where y is a function given implicitly by k in a neighborhood of [x0, y0]. Derivation of the eqution of k gives 2x + 6yy′ − 2 + 6y′ = 0, that is y′ = 1−x 3(1+y) . Thus y(x0)′ = 0, iff x0 = 1. Substituting to the equation of k we get 1 + 3y2 0 − 2 + 6y0 − 8 = 0, thus y0 = 1 or y0 = −3. The saught points are [1, 1], resp. [1, −3], the equations of tangents in the points are y = 1, resp. y = −3. □ 8.D.18. On the conic given by the equation 3x2 +6y2 −3x+ 3y − 2 = 0 find all points where the normal to the conic is parallel with the line y = x. For each point give the equation of the tangent in the point. ⃝ 8.D.19. On the conic given by the equation x2 + xy + 2y2 − x+3y −54 = 0 find all points where the normal to the conic is parallel to the x axis. For each point give the equation of the tangent in the point. ⃝ 8.D.20. On the graph of the function u(x, y, z) = x √ y2 + z2 find all points where the tangent plane is parallel to the plane x + y − z − u = 0. ⃝ 8.D.21. Find the points on to the ellipsoid x2 +2y2 +z2 = 1, where the tangent planes are parallel to x − y + 2z = 0. Solution. The equation of the tangent plane is determined by the partial derivatives of z = z(x, y) given implicitly by the equation x2 + 2y2 + z2 = 1 of the ellipsoid. The normal vector in [x0, y0, z0] is (z′ x(x0, y0), z′ y(x0, y0), −1). This vector has to be parallel to the normal (1, −1, 2) of the plane, thus (−2z′ x(x0, y0), −2z′ y(x0, y0), 2) = (1, −1, 2). which yields 2x0 = z0, 4y0 = −z0 and after substituting to the ellipsoid’s equation we get the sought points: [ 2√ 22 , − 1√ 22 , 4√ 22 ] and [− 2√ 22 , + 1√ 22 , − 4√ 22 ]. Another solution. It is useful to realize, that the normal vector in [x0, y0, z0] of the surface given implicitly by F(x, y, z) = 0 is the vector (F′ x(x0, y0, z0), F′ y(x0, y0, z0), F′ z(x0, y0, z0)). □ 8.D.22. Determine whether the tangent plane to the graph of the function f : R × R+ → R, f(x, y) = x · ln(y) at the point [1, 1 e ] goes through the point [1, 2, 3] ∈ R3 . Solution. First of all, we calculate the partial derivatives: fx(x, y) = ln(y), fy(x, y) = x y ; their values at the point [1, 1 e ] are −1, e; further f(1, 1 e ) = −1. Therefore, the equation of the tangent plane is z = f ( 1, 1 e ) + fx ( 1, 1 e ) (x − 1) + fy ( 1, 1 e ) ( y − 1 e ) = −1 − x + ey. CHAPTER 8. CALCULUS WITH MORE VARIABLES Two tangent planes to the graph of the function f(x, y) = sin(x) cos(y) are shown in the illustration. The diagonal line is the image of the curve c(t) = (t, t, f(t, t)). 0 1 2 3 x 0 -2 4 1 2 -1 5 3 4 0 y 5 6 6 1 2 0 1 2 3 x 0 -2 4 1 2 -1 5 3 4 0 y 5 6 6 1 2 For the case of functions of n variables, the tangent plane is defined as an analogy to the tangent plane to a surface in the three-dimensional space. Instead of being overwhelmed by many indices, it is useful to recall affine geometry, where hyperplanes can be used, see paragraph 4.1.3. Tangent (hyper)planes Definition. A tangent hyperplane to the graph of a function f : Rn → R at a point x ∈ Rn is the hyperplane containing the point (x, f(x)) with the modelling vector space which is the graph of the linear mapping df(x) : Rn → R, i.e. the differential at the point x ∈ En. The definition takes advantage of the fact that the directional derivative dvf is given by the increment in the tangent (hyper)plane corresponding to the increment v. Many analogies with the univariate functions follow from the latter fact. In particular, a differentiable function f on En has zero differential at a point x ∈ En if and only if its composition with any curve going through this point has a stationary point there, i.e., is neither increasing, nor decreasing in the linear approximation. In other words, the tangent plane at such a point is parallel to the hyperplane of the variables (i.e., its modelling space is En ⊂ En+1, having added the last coordinate set to zero). Of course, this does not mean that f should have a local extremum at such a point. Just as in the case of univariate functions, this depends on the values of higher derivatives. But it is a necessary condition to the existence of extrema. 8.1.11. Derivatives of higher orders. The operation of differentiation can be iterated similarly to the case of univariate functions. This time, choose new directions for each iteration. Fix an increment v ∈ Rn . The enumeration of the differentials at this argument defines a (differential) operation on differentiable functions f : En → R f → dvf = df(v), and the result is again a function df(v) : En → R. If this function is differentiable as well, repeat this procedure with 721 The given point does not satisfy this equation, so it does not lie in the tangent plane. □ 8.D.23. Determine the parametric equation of the tangent line to the intersection of the graphs of the functions f : R2 → R, f(x, y) = x2 + xy − 6; g : R × R+ → R, g(x, y) = x · ln(y) at the point [2, 1]. Solution. The tangent line to the intersection is the intersection of the tangent planes at the given point. The plane that is tangent to the graph of f and goes through the point [2, 1] is z = f (2, 1) + fx(2, 1)(x − x0) + fy(2, 1)(y − y0) = 5x + 2y − 12. The tangent plane to the graph of g is then z = f(2, 1) + gx(x, y)(2, 1)(x − x0) + g(x, y)y(2, 1)(y − y0) = 2y − 2. The intersection line of these two planes is given parametrically as [2, t, 2t − 2], t ∈ R. Another solution. The normal to the surface given by the equation f(x, y, z) = 0 at the point b = [2, 1, 0] is (fx(b), fy(b), fz(b)) = (5, 2, −1); the normal to the surface given by g(x, y, z) = 0 at the same point is (0, 2, −1). The tangent line is perpendicular to both normals; we can thus obtain a vector it is parallel to as the vector product of the normals, which is (0, 5, 10). Since the tangent line goes through the point [2, 1, 0], its parametric equation is [2, 1 + t, 2t], t ∈ R. □ 8.D.24. V. ypočtěte všechny parciální derivace prvního a druhého řádu funkce f(x, y, z) = x y z . Solution. f′ x = y z x y z −1 , f′ y = x y z ln x· 1 z , f′ z = x y z ln x· −y z2 , f′′ xx = y z (y z − 1)x y z −2 , f′′ yy = x y z ln2 x · 1 z2 , f′′ zz = x y z ln2 x · y2 z4 + x y z ln x · 2y z3 , f′′ xy = 1 z x y z −1 + y z x y z −1 ln x · 1 z , f′′ xz = −y z2 x y z −1 + y z x y z −1 ln x · −y z2 , f′′ yz = x y z ln2 x · −y z3 + x y z ln x · −1 z2 . □ 8.D.25. Find all first and second order partial derivatives of z = f(x, y) in [1, √ 2, 2] defined in a neighborhood of the point by x2 + y2 + z2 − xz − √ 2yz = 1. ⃝ 8.D.26. Find all first and second order partial derivatives of z = f(x, y) in [−2, 0, 1] defined in a neighborhood of the point by 2x2 + 2y2 + z2 + 8xz − z + 8 = 0. ⃝ 8.D.27. Determine all second partial derivatives of the function f given by f(x, y, z) = √ xy ln z. Solution. First, we determine the domain of the given function: the argument of the square root must be non-negative, and the argument of the natural logarithm must be positive. Therefore, Df = {(x, y, z) ∈ R3 , (z ≥ 1&(xy > 0))∨(0 < z < 1)&(xy < 0)}. CHAPTER 8. CALCULUS WITH MORE VARIABLES another increment, and so on. In particular, work with iterations of partial derivatives. For second-order partial derivatives, write ( ∂ ∂xj ◦ ∂ ∂xi ) f = ∂2 ∂xi∂xj f = ∂2 f ∂xi∂xj . In the case of the repeated choice i = j, write ( ∂ ∂xi ◦ ∂ ∂xi ) f = ∂2 ∂x2 i f = ∂2 f ∂x2 i . Proceed in the same way with further iterations and talk about partial derivatives of order k ∂k f ∂xi1 . . . ∂xik . More generally, one can iterate (assuming the function is sufficiently differentiable) any directional derivatives; for instance, dv ◦ dwf for two fixed increments v, w ∈ Rn . k–times differentiable functions Definition. A function f : En → R is k–times (continuously) differentiable at a point x if and only if all its partial derivatives up to order k (inclusive) exist in a neighbourhood of the point x and are continuous at this point. f is k–differentiable if it is k–times (continuously) differentiable at all points of its domain. From now on, the work is with continuously differentiable functions unless explicitly stated otherwise. To show the basic features of the higher derivatives in the simplest form, work in the plane E2. In the plane as well as in the space, iterated derivatives are often denoted by mere indices referring to the variable names, for example: fx = ∂f ∂x , fxx = ∂2 f ∂x2 , fxy = ∂2 f ∂x∂y , fyx = ∂2 f ∂y∂x . Supposing the second-order partial derivatives are continuous at the point x (and exist in its neghborhood), we show that the partial derivatives commute. That is, the order in which differentiation is carried out does not matter. By our assumption, the limits fxy(x, y) = lim t→0 1 t ( fx(x, y + t) − fx(x, y) ) = lim t→0 1 t ( lim s→0 1 s ( f(x + s, y + t) − f(x, y + t) − (f(x + s, y) − f(x, y)) )) exist. However, since the limits can be expressed by any choice of values tn → 0 and sn → 0 and the limits of the corresponding sequences, the second derivative can be expressed 722 Now, we calculate the first partial derivatives with respect to each of the three variables: fx = y ln(z) 2 √ xy ln(z) , fy = x ln(z) 2 √ xy ln(z) , fz = xy 2z √ xy ln(z) . Each of these three partial derivatives is again a function of three variables, so we can consider (first) partial derivatives of these functions. Those are the second partial derivatives of the function f. We will write the variable with respect to which we differentiate as a subscript of the function f. fxx = − y2 ln2 z 4(xy ln z) 3 2 , fxy = − xy ln2 z 4(xy ln z) 3 2 + ln z 2 √ xy ln z , fxz = − xy2 ln z 4z(xy ln z) 3 2 + y 2z √ xy ln z , fyy = − x2 ln2 z 4(xy ln z) 3 2 , fyz = − x2 y ln z 4z(xy ln z) 3 2 + x 2z √ xy ln z , fzz = − x2 y2 4z2(xy ln z) 3 2 − xy 2z2 √ xy ln z . By the theorem about interchangeability of partial derivatives (see 8.1.12), we know that fxy = fyx, fxz = fzx, fyz = fzy. Therefore, it suffices to compute the mixed partial derivatives (the word “mixed” means that we differentiate with respect to more than one variable) just for one order of differentiation. □ E. Taylor polynomials 8.E.1. Write the second-order Taylor expansion of the function f : R2 → R, f(x, y) = ln(x2 + y2 + 1) at the point [1, 1]. Solution. First, we compute the first partial derivatives: fx = 2x x2 + y2 + 1 , fy = 2y x2 + y2 + 1 , then the Hessian: Hf(x, y) = ( 2y2 −2x2 +2 (x2+y2+1)2 − 4xy (x2+y2+1)2 − 4xy (x2+y2+1)2 2x2 −2y2 +2 (x2+y2+1)2 ) . The value of the Hessian at the point [1, 1] is ( 2 9 −4 9 −4 9 2 9 ) . CHAPTER 8. CALCULUS WITH MORE VARIABLES as fxy(x, y) = lim t→0 1 t2 ( ( f(x + t, y + t) − f(x, y + t) ) − ( f(x + t, y) − f(x, y) ) ) , if the limit on the right hand side exists (notice we cannot take this as granted without further arguments). Consider the expression from which the last limit is taken as the function φ(x, y, t), and try to express it in terms of partial derivatives. For a temporarily fixed t, denote g(x, y) = f(x + t, y) − f(x, y). Notice that the partial derivative gy exists on a neighborhood of (x, y) and it is continuous in (x, y). Thus, for small t, we may apply the mean value theorem viewing g as function of y. Therefore, the expression in the last large parentheses equals to g(x, y + t) − g(x, y) = t · gy(x, y + t0) for a suitable t0 which lies between 0 and t (the value of t0 depends on t). Next, gy(x, y) = fy(x + t, y) − fy(x, y), so we may rewrite φ as φ(x, y, t) = 1 t gy(x, y + t0) = 1 t ( fy(x + t, y + t0) − fy(x, y + t0) ) . Another application of the mean value theorem yields φ(x, y, t) = fyx(x + t1, y + t0) for a suitable t1 between 0 and t. Since it is assumed that the second order partial derivatives of f are continous at the point (x, y), the requested limit limt→0 φ(x, y, t) must exist and we have arrived at the desired equality fxy = fyx, too. 8.1.12. Schwarz’s Theorem. The same procedure for functions of n variables proves the following fundamental result:2 Commutativity of partial derivatives Theorem. Let f : En → R be a k–times differentiable function with continuous partial derivatives up to order k (inclusive) at the point point x ∈ Rn . Then all partial derivatives of the function f at the point x up to order k (inclusive) are independent of the order of differentiation. Proof. The proof for the second order is illustrated above in the special case when n = 2. In fact, it yields the general case as well. Indeed, notice that for every fixed choice of a pair of coordinates xi and xj, the discussion of their interchanging takes 2This is a great example of a result which was used widely for many decades before a complete proof was published in 1873 by Karl Hermann Amandus Schwarz (1843-1921). The result is also known as Clairaut’s theorem – following the earlier version for functions requesting continuous partial derivatives on a neighborhood of x. 723 Altogether, we get that the second-order Taylor expansion at the point [1, 1] is T2(x, y) =f(1, 1) + fx(1, 1)(x − 1) + fy(1, 1)(y − 1) + 1 2 (x − 1, y − 1)Hf(1, 1) ( x − 1 y − 1 ) = ln(3) + 2 3 (x − 1) + 2 3 (y − 1) + 1 9 (x − 1)2 − 4 9 (x − 1)(y − 1) + 1 9 (y − 1)2 = 1 9 (x2 + y2 + 8x + 8y − 4xy − 14) + ln(3). □ Remark. In particular, we can see that the second-order Taylor expansion of an arbitrary differentiable function at a given point is a second-order polynomial. 8.E.2. Determine the second-order Taylor polynomial of the function f : R2 → R2 , f(x, y) = xy cos y at the point [π, π]. Decide whether the tangent plane to the graph of this function at the point [π, π, f(π, π)] goes through the point [0, π, 0]. Solution. As in the above exercises, we find out that T(x, y) = 1 2 π2 y2 − xy − π3 y + 1 2 π4 . The tangent plane to the graph of the given function at the point [π, π] is given by the first-order Taylor polynomial at the point [π, π]; its general equation is thus z = −πy − πx + π2 , and this equation is satisfied by the given point [0, π, 0]. □ 8.E.3. Determine the third-order Taylor polynomial of the function f : R3 → R, f(x, y, z) = x3 y + xz2 + xy + 1 at the point [0, 0, 0]. ⃝ 8.E.4. Determine the second-order Taylor polynomial of the function f : R2 → R, f(x, y) = x2 sin y + y2 cos x at the point [0, 0]. Decide whether the tangent plane to the graph of this function at the point [0, 0, 0] goes through the point [π, π, π]. ⃝ 8.E.5. Determine the second-order Taylor polynomial of the function ln(x2 y) at the point [1, 1]. ⃝ 8.E.6. Determine the second-order Taylor polynomial of the function f : R2 → R, f(x, y) = tan(xy + y) at the point [0, 0]. ⃝ F. Extrema of multivariate functions 8.F.1. Determine the stationary points of the function f : R2 → R, f(x, y) = x2 y + y2 x − xy and decide which of these points are local extrema and of which type. Solution. The first derivatives are fx = 2xy + y2 − y, fy = x2 + 2xy − x. If we set both partial derivatives equal CHAPTER 8. CALCULUS WITH MORE VARIABLES place in a two-dimensional affine subspace, (all the other variables are considered to be constant and do not affect in the discussion). So neighbouring partial derivatives may be interchanged. This solves the problem in order two. In the case of higher-order derivatives, the proof can be completed by induction on the order. Every order of the indices i1, . . . , ik can be obtained from a fixed one by several interchanges of adjacent pairs of indices. □ 8.1.13. Hessian. The differential was introduced as the linear form df(x) which approximates the function f at a point x in the best possible way. Similarly, a quadratic approximation of a function f : En → R is possible. Hessian Definition. If f : Rn → R is a twice differentiable function, the symmetric matrix of functions Hf(x) = ( ∂2 f ∂xi∂xj (x) ) =     ∂2 f ∂x1∂x1 (x) . . . ∂2 f ∂x1∂xn (x) ... ... ... ∂2 f ∂xn∂x1 (x) . . . ∂2 f ∂xn∂xn (x)     is called the Hessian of the function f. It is already seen from the previous reasonings that the vanishing of the differential at a point (x, y) ∈ E2 guarantees stationary behaviour along all curves going through this point. The Hessian Hf(x, y) = ( fxx(x, y) fxy(x, y) fyx(x, y) fyy(x, y) ) plays the role of the second derivative. For every parametrized straight line c(t) = (x(t), y(t)) = (x0 + ξt, y0 + ηt), the derivative of the univariate function α(t) = f(x(t), y(t)) can be computed by means of the formula d dt f(t) = fx(x(t), y(t))x′ (t) + fy(x(t), y(t))y′ (t) (derived in 8.1.8) and so the function β (t) = f(x0, y0) + t ∂f ∂x (x0, y0)ξ + t ∂f ∂y (x0, y0)η + t2 2 ( fxx(x0, y0)ξ2 + 2fxy(x0, y0)ξη + fyy(x0, y0)η2 ) shares the same derivatives up to the second order (inclusive) at the point t = 0 (calculate this on your own!). The function β can be written in terms of vectors as β(t) = f(x0, y0) + df(x0, y0)(tv) + 1 2 Hf(x0, y0)(tv, tv), where v = (ξ, η) is the increment given by the derivative of the curve c(t), and the Hessian is used as a symmetric 2–form. This is an expression which looks like Taylor’s theorem for univariate functions, namely the quadratic approximation 724 to zero simultaneously, the system has the following solution: {x = y = 0}, {x = 0, y = 1}, {x = 1, y = 0}, {x = 1/3, y = 1/3}, which are four stationary points of the given function. The Hessian of the function f is( 2y 2x + 2y − 1 2x + 2y − 1 2x ) . Its values at the stationary points are, respectively, ( 0 −1 −1 0 ) , ( 1 1 1 0 ) , ( 0 1 1 1 ) , (2 3 1 3 1 3 2 3 ) . Therefore, the first three Hessians are indefinite, and the last one is positive definite. The point [1/3, 1/3] is thus a local minimum. □ 8.F.2. Determine the point in the plane x+y+3z = 5 lying in R3 which is closest to the origin of the coordinate system. First, do this by applying the methods of linear algebra; then, using the methods of differential calculus. Solution. It is the intersection point of the perpendicular going through the point [0, 0, 0] to the plane. The normal to the plane is (t, t, 3t), t ∈ R. Substituting into the equation of the plane, we get the intersection point [5/11, 5/11, 15/11]. Alternatively, we can minimize the distance (or its square) of the plane’s points from the origin, i. e., the function (5 − y − 3z)2 + y2 + z2 . Setting the partial derivatives equal to zero, we get the system 3y + 10z − 15 = 0 2y + 3z − 5 = 0, whose solution is as above. Since we know that the minimum exists and is the only stationary point, we need not calculate the Hessian any more. □ 8.F.3. Determine the local extrema of the function f(x, y) = x2 + arctan2 x + y3 + y , x, y ∈ R. Solution. The function f can be written as the sum f1 + f2, where f1(x) = x2 + arctan2 x, f2(y) = y3 + y , x, y ∈ R. If the function f has a local extremum at a point, then it does so with respect to an arbitrary subset of its domain. In other words, if the function has, for instance, a maximum at a point [a, b] and we set y = b, then the univariate function f(x, b) of x must have a maximum at the point x = a. Let us thus fix an arbitrary y ∈ R. For this fixed value of y, we get a univariate function, which is a shift of the function f1. This means that its maxima and minima are at the same points. However, it is easy to find the extrema of the function f1. We can just realize that this function is even (it is the sum of two even functions, and the function y = arctan2 x is the product of two odd functions) and increasing for x ≥ 0 (the composition as well as the sum of increasing functions is again an increasing function). Therefore, it has a unique extremum, and that is a minimum at the point x = 0. Similarly, for any fixed value of x, f is a shift of the function f2, and f2 has a minimum at the point y = 0, which is its only extremum. We CHAPTER 8. CALCULUS WITH MORE VARIABLES of a function by Taylor’s polynomial of degree two. The following illustration shows both the tangent plane and this quadratic approximation for two distinct points and the function f(x, y) = sin(x) cos(y). popis obrazku 6 5 4 x 3 2 1 0 0 1 2 3 4 y 5 6-2 -1 0 1 2 6 5 4 x 3 2 1 0 0 1 2 3 4 y 5 6-2 -1 0 1 2 8.1.14. Taylor’s expansion. The multidimensional version of Taylor’s theorem is an example of a mathematical statement where the most difficult part is finding the right formulation. The proof is then quite simple. The discussion on the Hessians continues. Write Dk f for the k–th order approximations of the function f : En → Rn . It is always a k–linear expressions in the increments. The differential D1 f = df (the first order) and the Hessian D2 f = Hf (the second order) are already discussed. For functions f : En → R, points x = (x1, . . . , xn) ∈ En, and increments v = (ξ1, . . . , ξn), set Dk f(x)(v) = ∑ 1≤i1,...,ik≤n ∂k f ∂xi1· · ·∂xik (x1, . . . , xn)ξi1· · · ξik . An illustrative example (making use of the symmetry of the partial derivatives) is, for E2, the third-order expression D3 f(x, y)(ξ, η) = ∂3 f ∂x3 ξ3 + 3 ∂3 f ∂x2∂y ξ2 η + 3 ∂3 f ∂x∂y2 ξη2 + ∂3 f ∂y3 η3 , and, in general, Dk f(x, y)(ξ, η) = k∑ ℓ=0 ( k ℓ ) ∂k f ∂xk−ℓ∂yℓ ξk−ℓ ηℓ . Taylor’s expansion with remainder Theorem. Let f : En → R be a k–times differentiable function in a neighbourhood Oδ(x) of a point x ∈ En. For every increment v ∈ Rn of size ∥v∥ < δ, there exists a number θ, 0 ≤ θ ≤ 1, such that f(x + v) = f(x) + D1 f(x)(v) + 1 2! D2 f(x)(v)+ · · · + 1 (k − 1)! Dk−1 f(x)(v) + 1 k! Dk f(x + θ · v)(v). 725 have thus proved that f can have a local extremum only at the origin. Since f(0, 0) = 0, f(x, y) > 0, [x, y] ∈ R2 ∖ {[0, 0]}, the function f has a strict local (even global) minimum at the point [0, 0]. □ 8.F.4. Examine the local extrema of the function f(x, y) = ( x + y2 ) e x 2 , x, y ∈ R. Solution. This function has partial derivatives of all orders on the whole of its domain. Therefore, local extrema can occur only at stationary points, where both the partial derivatives fx, fy are zero. Then, it can be determined whether the local extremum occurs by computing the second derivatives. We can easily determine that fx(x, y) = e x 2 + 1 2 ( x + y2 ) e x 2 , fy(x, y) = 2y e x 2 , x, y ∈ R. A stationary point [x, y] must satisfy fy(x, y) = 0, i. e. y = 0, and, further, fx(x, y) = fx(x, 0) = e x 2 ( 1 + 1 2 x ) = 0, i. e. x = −2. We can see that there is a unique stationary point, namely [−2, 0]. Now, we calculate the Hessian Hf at this point. If this matrix (the corresponding quadratic form) is positive definite, the extremum is a strict local minimum. If it is negative definite, the extremum is a strict local maximum. Finally, if the matrix is indefinite, there will be no extremum at the point. We have fxx(x, y) = 1 2 e x 2 ( 2 + 1 2 ( x + y2 )) , fyy(x, y) = 2 e x 2 , fxy(x, y) = fyx(x, y) = y e x 2 , x, y ∈ R. Therefore, Hf (−2, 0) = ( fxx (−2, 0) fxy (−2, 0) fyx (−2, 0) fyy (−2, 0) ) = ( 1/2e 0 0 2/e ) . We should recall that the eigenvalues of a diagonal matrix are exactly the values on the diagonal. Further, positive definiteness means that all the eigenvalues are positive. Hence it follows that there is a strict local minimum at the point [−2, 0]. □ 8.F.5. Find the local extrema of the function f(x, y, z) = x3 + y2 + z2 2 − 3xz − 2y + 2z, x, y, z ∈ R. Solution. The function f is a polynomial; therefore, it has partial derivatives of all orders. It thus suffices to look for its stationary points (the extrema cannot be elsewhere). In order to find them, we differentiate f with respect to each of the three variables x, y, z and set the derivatives equal to zero. We thus obtain 3x2 − 3z = 0, i. e., z = x2 , CHAPTER 8. CALCULUS WITH MORE VARIABLES Proof. Given an increment v ∈ Rn , consider the parametrized straight line c(t) = x + tv in En, and examine the function φ : R → R defined by the composition φ(t) = f ◦ c(t). Taylor’s theorem for univariate functions claims that (see Theorem 6.1.3) φ(t) = φ(0) + φ′ (0)t + . . . + 1 (k − 1)! φ(k−1) (0)tk−1 + 1 k! φ(k) (θ)tk . It remains to verify that computing the derivatives φ(ℓ) yields the desired relation. This can be done quite easily by induction on the order k. For k = 1, Taylor’s theorem coincides with the corollary of the mean value theorem applied to the directional derivative, which is already used several times. When deriving it, the formula d dt φ(t) = ∂f ∂x1 (x(t)) · x′ 1(t) + · · · + ∂f ∂xn (x(t)) · x′ n(t) is used. It holds for every continuously differentiable curve and function f, cf. 8.1.8 and 8.1.9. This means that φ′ (t) = D1 f(c(t))(c′ (t)) = D1 f(c(t))(v) for all t in a neighbourhood of zero. Proceed similarly for functions Dℓ f. Write c′ (t) instead of the increment v, and recall that further differentiation of c(t) leads identically to zero everywhere, i.e. c′′ (t) = 0 for all t (since it is a parametrized straight line). Suppose φ(ℓ) (t) = Dℓ f(x(t))(v) = ∑ i1,...,iℓ ( ∂ℓ f ∂xi1 . . . ∂xiℓ (x1(t), . . . , xn(t))x′ i1 (t) · · · x′ iℓ (t) ) and calculate φ(ℓ+1) (t). By the above formula for first-order differentiation in a given direction and the rule for the derivative of a product (see Theorem 5.3.4), the differentiation of the composite function gives φ(ℓ+1) (t) = d dt Dℓ f(c(t))(c′ (t)) = d dt ∑ i1,...,iℓ ( ∂ℓ f ∂xi1 . . . ∂xiℓ (x1(t), . . . , xn(t)) · x′ i1 (t) · · · x′ iℓ (t) ) = ∑ i1,...,iℓ ( n∑ j=1 ∂ℓ+1 f ∂xi1 . . . ∂xiℓ ∂xj (x1(t), . . . , xn(t)) · x′ j(t) · x′ i1 (t) · · · x′ iℓ (t) ) + 0, which is the required formula for order ℓ+1. Taylor’s theorem now follows from substituting into the equality for φ at the beginning of this proof and the enumeration at the right values of t. □ 726 2y − 2 = 0, i. e., y = 1, and (utilizing the first equation) z − 3x + 2 = 0, i. e., x ∈ {1, 2}. Therefore, there are two stationary points, namely [1, 1, 1] and [2, 1, 4]. Now, we compute all second-order partial deriva- tives: fxx = 6x, fxy = fyx = 0, fxz = fzx = −3, fyy = 2, fyz = fzy = 0, fzz = 1. Having this, we are able to evaluate the Hessian at the stationary points: Hf (1, 1, 1) =   6 0 −3 0 2 0 −3 0 1   , Hf (2, 1, 4) =   12 0 −3 0 2 0 −3 0 1   . Now, we need to know whether these matrices are positive definite, negative definite, or indefinite in order to determine whether and which extrema occur at the corresponding points. Clearly, the former matrix (for the point [1, 1, 1]) has eigenvalue λ = 2. Since its determinant equals −6 and it is a symmetric matrix (all eigenvalues are real), the matrix must have a negative eigenvalue as well (because the determinant is the product of the eigenvalues). Therefore, the matrix Hf (1, 1, 1) is indefinite, and there is no extremum at the point [1, 1, 1]. We will use the so-called Sylvester’s criterion for the latter matrix Hf (2, 1, 4). According to this criterion, a realvalued symmetric matrix A =       a11 a12 a13 · · · a1n a12 a22 a23 · · · a2n a13 a23 a33 · · · a3n ... ... ... ... ... a1n a2n a3n · · · ann       is positive definite if and only if all of its leading principal minors A, i. e. the determinants d1 = a11 , d2 = a11 a12 a12 a22 , d3 = a11 a12 a13 a12 a22 a23 a13 a23 a33 , . . . dn = | A |, are positive. Further, it is negative definite iff d1 < 0, d2 > 0, d3 < 0, . . . , (−1)n dn > 0. The inequalities 12 = 12 > 0, 12 0 0 2 = 24 > 0, 12 0 −3 0 2 0 −3 0 1 = 6 > 0, imply that the matrix Hf (2, 1, 4) is positive definite – there is a strict local minimum at the point [2, 1, 4]. □ 8.F.6. Find the local extrema of the function z = ( x2 − 1 ) ( 1 − x4 − y2 ) , x, y ∈ R. CHAPTER 8. CALCULUS WITH MORE VARIABLES 8.1.15. Formula with multi-indices. To simplify the notation, let us introduce the multi-index notation for the polynomials with more variables. Multi-indices A multi-index α of length n is an n-tuple of non-negative integers (α1, . . . , αn). The integer |α| = α1 + · · · + αn is called the size of the multi-index α. Monomials are written shortly as xα instead of xα1 1 xα2 2 . . . xαn n . Real polynomials in n variables can be symbolically expressed in a similar way as univariate polynomials: f = ∑ |α|≤k aαxα , g = ∑ |β|≤ℓ bβxβ ∈ R[x1, . . . , xn]. f is said to have total degree k if at least one coefficient with multi-indices α of size k is non-zero, while all the coefficients with multi-indices of larger sizes vanish. Nice formulae express addition and multiplication of multivariate polynomials of degrees k and ℓ respectively: f + g = ∑ |α|≤max(k,ℓ) (aα + bα)xα , fg = k+ℓ∑ |γ|=0 ( ∑ α+β=γ (aαbβ)xγ ) , where the multi-indices are added componentwise, and the formally non-existing coefficients are assumed to be zero. Moreover we write shortly ∂αf = ∂|α| f ∂x1 α1 . . . ∂xn αn and α! = α1! · · · αn!. In particular, ∂αf = f if |α| = 0. Taylor’s polynomials via multi-indices Taylor’s expansion up to order r of a function f : Rn → R, for an increment v ∈ Rn is the polynomial (1) f(x + v) = f(x) + ∑ 1≤|α|≤k 1 α! ∂αf(x) vα , quite as the formula in dimension one. 8.1.16. Real analytic functions. If the multidimensional power series F(v) = ∑ |α|≥0 1 α! ∂αf(x) vα converges at some neighborhood of v = 0, we call the function f (real) analytic on a neighborhood of x. For instance, this happens if all the partial derivatives are uniformly bounded, i.e., ∂αf(x) < C for all α. Indeed, we may estimate 1 k! ∑ i1,...,ik ∂k f(x) ∂xi1 . . . ∂xik vi1 . . . vik ≤ 1 k! nk C∥v∥k and thus the Taylor’s series converges absolutely by the Weierstrass criterion for all v. Think about the details. Actually, our argument shows that the Taylor’s series converges (on a small 727 Solution. Once again, we calculate the partial derivatives zx, zy and set them equal to zero. This leads to the equations −6x5 + 4x3 + 2x − 2xy2 = 0, ( x2 − 1 ) (−2y) = 0, whose solutions [x, y] = [0, 0], [x, y] = [1, 0], [x, y] = [−1, 0]. (In order to find these solutions, it suffices to find the real roots 1, −1 of the polynomial −6x4 + 4x2 + 2 using the substitution u = x2 . Now, we compute the second-order partial derivatives zxx = −30x4 + 12x2 + 2 − 2y2 , zxy = zyx = −4xy, zyy = −2 ( x2 − 1 ) and evaluate the Hessian at the stationary points: Hz (0, 0) = ( 2 0 0 2 ) , Hz (1, 0) = Hz (−1, 0) = ( −16 0 0 0 ) . We can see that the first matrix is positive definite, so the function has a strict local minimum at the origin. However, the second and third matrices are negative semidefinite. Therefore, the knowledge of second partial derivatives in insufficient for deciding whether there is an extremum at the points [1, 0] and [−1, 0]. On the other hand, we can examine the function values near these points. We have z (1, 0) = z (−1, 0) = 0, z (x, 0) < 0 for x ∈ (−1, 1). Further, consider y dependent on x ∈ (−1, 1) by the formula y = √ 2 (1 − x4), so that y → 0 for x → ±1. For this choice, we get z ( x, √ 2 (1 − x4) ) = ( x2 − 1 ) ( x4 − 1 ) > 0, x ∈ (−1, 1). We have thus shown that in arbitrarily small neighborhoods of the points [1, 0] and [−1, 0], the function z takes on both higher and lower values than the function value at the corresponding point. Therefore, these are not extrema. □ 8.F.7. Decide whether the polynomial p(x, y) = x6 + y8 + y4 x4 − x6 y5 has a local extremum at the stationary point [0, 0]. Solution. We can easily verify that the partial derivatives px and py are indeed zero at the origin. However, each of the partial derivatives pxx, pxy, pyy is also equal to zero at the point [0, 0]. The Hessian Hp (0, 0) is thus both positive and negative semidefinite at the same time. However, a simple idea can lead us to the result: We can notice that p(0, 0) = 0 and p(x, y) = x6 ( 1 − y5 ) + y8 + y4 x4 > 0 for [x, y] ∈ R × (−1, 1) ∖ {[0, 0]}. Therefore, the given polynomial has a local minimum at the origin. □ 8.F.8. Determine local extrema of the function f : R3 → R, f(x, y, z) = x2 y + y2 z + x − z on R3 . ⃝ 8.F.9. CHAPTER 8. CALCULUS WITH MORE VARIABLES neighborhood of x) if the partial derivatives grow with the order k slower than k!. 8.1.17. Local extrema. We examine the local maxima and minima of functions on En using the differential and the Hessian. Just as in the case of univariate functions, an interior point x0 ∈ En of the domain of a function f is said to be a (local) maximum or minimum if and only if there is a neighbourhood U of x0 such that for all points x ∈ U, the function value satisfies f(x) ≤ f(x0) or f(x) ≥ f(x0), respectively. If strict inequalities hold for all x ̸= x0, there is a strict extremum. To simplify, suppose that f has continuous both firstorder and second-order partial derivatives on its domain. A necessary condition for the existence of an extremum at a point x0 is that the differential be zero at this point, i.e., df(x0) = 0. If df(x0) ̸= 0, then there is a direction v in which dvf(x0) ̸= 0. However, then the function value is increasing at one side of the point x0 along the line x0 +tv and it is decreasing on the other side, see 5.3.2. An interior point x ∈ En of the domain of a function f at which the differential df(x) is zero is called a stationary point of the function f. To illustrate the concept on a simple function in E2, consider f(x, y) = sin(x) cos(y). The shape of this function resembles the well-known egg plates, so it is evident that there are many extrema, and also many stationary points which are not extrema (“saddles” are visible in the picture). 00 -1 22 -0,5 44 0 66 0,5 8 8 1 Calculate the first derivatives, and then the necessary secondorder ones: fx(x, y) = cos(x) cos(y), fy(x, y) = − sin(x) sin(y), and both derivatives are zero for two sets of points (1) cos(x) = 0, sin(y) = 0, that is (x, y) = (2k+1 2 π, ℓπ), for any k, ℓ ∈ Z (2) cos(y) = 0, sin(x) = 0, that is (x, y) = (kπ, 2ℓ+1 2 π), for any k, ℓ ∈ Z. The second partial derivatives are Hf(x, y) = ( fxx fxy fxy fyy ) (x, y) = ( − sin(x) cos(y) − cos(x) sin(y) − cos(x) sin(y) − sin(x) cos(y) ) . So the following Hessians are obtained in two sets of stationary points: 728 Determine the local extrema of the function f : R3 → R, f(x, y, z) = x2 y − y2 z + 4x + z on R3 . ⃝ 8.F.10. Determine the local extrema of the function f : R3 → R, f(x, y, z) = xz2 + y2 z − x + y on R3 . ⃝ 8.F.11. Determine the local extrema of the function f : R3 → R, f(x, y, z) = y2 z − xz2 + x + 4y on R3 . ⃝ 8.F.12. Determine the local extrema of the function f : R2 → R, f(x, y) = x2 y + x2 + 2y2 + y on R2 ⃝ 8.F.13. Determine the local extrema of the function f : R2 → R, f(x, y) = x2 y + 2y2 + 2y on R2 . ⃝ 8.F.14. Determine the local extrema of the function f : R2 → R, f(x, y) = x2 + xy + 2y2 + y on R2 . ⃝ 8.F.15. Determine the local extrema of the function f : R2 → R, f(x, y) = x2 + xy − 2y2 + y on R2 . ⃝ G. Implicitly given functions and mappings 8.G.1. Let F : R2 → R be a function, F(x, y) = xy sin (π 2 xy2 ) . Show that the equality F(x, y) = 1 implicitly defines a function f : U → R on a neighborhood U of the point [1, 1] so that F(x, f(x)) = 1 for x ∈ U. Determine f′ (1). Solution. The function is differentiable on the whole R2 , so it is such on any neighborhood of the point [1, 1]. Let us evaluate Fy at [1, 1]: Fy(x, y) = x sin (π 2 xy2 ) + πx2 y2 cos (π 2 xy2 ) , so Fy(1, 1) = 1 ̸= 0. Therefore, it follows from theorem 8.1.25 that the equation F(x, y) = 1 implicitly determines on a neighborhood of the point (1, 1) a function f : U → R defined on a neighborhood of the point (number) 1. Moreover, we have Fx(x, y) = y sin (π 2 xy2 ) + π 2 xy3 cos (π 2 xy2 ) , so the derivative of the function f at the point 1 satisfies f′ (1) = − Fx(1, 1) Fy(1, 1) = − 1 1 = −1. □ Remark. Notice that although we are unable to explicitly define the function f from the equation F(x, f(x)) = 1, we are able to determine its derivative at the point 1. 8.G.2. Considering the function F : R2 → R, F(x, y) = ex sin(y) + y − π/2 − 1, show that the equation F(x, y) = 0 implicitly defines the variable y to be a function of x, y = f(x), on a neighborhood of the point [0, π/2]. Compute f′ (0). Solution. The function is differentiable in a neighborhood of the point [0, π/2]; moreover, Fy = ex cos y+1, F(0, π/2) = 1 ̸= 0, so the equation indeed defines a function f : U → R on a neighborhood of the point [0, π/2]. Further, we have Fx = ex sin y, Fx(0, π/2) = 1, and its derivative at the point CHAPTER 8. CALCULUS WITH MORE VARIABLES (1) Hf(kπ + π 2 , ℓπ) = ± ( 1 0 0 1 ) , where the minus sign occurs when k and ℓ have the same parity (remainder on division by two), and the sign + occurs in the other case; (2) Hf(kπ, ℓπ + π 2 ) = ± ( 0 1 1 0 ) , where again the minus sign occurs when k and ℓ have the same parity, and the sign + occurs in the other case; From the proposition of Taylor’s theorem for order k = 2, there is, in a neighbourhood of one of the stationary points (x0, y0), f(x, y) = f(x0, y0)+ + 1 2 Hf(x0 + θ(x − x0), y0 + θ(y − y0))(ξ, η). Here, Hf is considered to be a quadratic form evaluated at the increment (x − x0, y − y0) = (ξ, η). In the case (1), Hf(x0, y0)(ξ, η) = ±(ξ2 + η2 ), while in the case (2), Hf(x0, y0)(ξ, η) = ±2ξη. While in the first case, the quadratic form is either always positive or always negative on all nonzero arguments, in the second case, there are always arguments with positive values and other arguments with negative values. Since the Hessian of the function is continuous (i.e. all the partial derivatives up to order two are continous), the Hessians in the nearby points are small perturbations of those in (x0, y0) and so these properties of the quadratic form Hf(x, y) remain true on some neighbourhood of (x0, y0). This is obvious in cases (1) and (2), since a small perturbation of the matrices does clearly not change the latter properties of the quadratic forms in question. A general formal proof is presented below. The local maximum occurs if and only if the point (x0, y0) belongs to the case (1) with k and ℓ of the same parity. On the other hand, if the parities are different, then the point from the case (1) happens to be a point of a local minimum. On the other hand, in the case (2) the entire function f behaves similarly to the Hessian and so the “saddle” points are not extrema. 8.1.18. The decision rules. In order to formulate the general statement about the Hessian and the local extrema at stationary points, it is necessary to remember the discussion about quadratic forms from the paragraphs 4.3.2–4.3.3 in the chapter on affine geometry. There are introduced the following types of quadratic forms h : En → R: • positive definite if and only if h(u) > 0 for all u ̸= 0 • positive semidefinite if and only if h(u) ≥ 0 for all u ∈ V • negative definite if and only if h(u) < 0 for all u ̸= 0 • negative semidefinite if and only if h(u) ≤ 0 for all u ∈ V • indefinite if and only if h(u) > 0 and f(v) < 0 for appropriate u, v ∈ V . 729 0 satisfies: f′ (0) = − Fx(0, π/2) Fy(0, π/2) = − 1 1 = −1. □ 8.G.3. Let F(x, y, z) = sin(xy) + sin(yz) + sin(xz). Show that the equation F(x, y, z) = 0 implicitly defines a function z(x, y) : R2 → R on a neighborhood of the point [π, 1, 0] ∈ R3 so that F(x, y, z(x, y)) = 0. Determine zx(π, 1) and zy(π, 1). Solution. We will calculate Fz = y cos(yz) + x cos(xz), Fz(π, 1, 0) = π + 1 ̸= 0, and the function z(x, y) is defined by the equation F(x, y, z(x, y)) = 0 on a neighborhood of the point [π, 1, 0]. In order to find the values of the wanted partial derivatives, we first need to calculate the values of the remaining partial derivatives of the function F at the point [π, 1, 0]. Fx(x, y, z) = y cos(xy) + z cos(xz) Fx(π, 1, 0) = −1, Fy(x, y, z) = x cos(xy) + z cos(yz) Fy(π, 1, 0) = −π, odkud zx(π, 1) = − Fx(π, 1, 0) Fz(π, 1, 0) = 1 π + 1 , zy(π, 1) = − Fy(π, 1, 0) Fz(π, 1, 0) = π π + 1 . □ 8.G.4. Having the mapping F : R3 → R2 , F(x, y, z) = (f(x, y, z), g(x, y, z)) = (ex sin y , xyz), show that the equation F(x, c1(x), c2(x)) = (0, 0) defines a curve c : R → R2 on a neighborhood of the point [1, π, 1]. Determine the tangent vector to this curve at the point 1. Solution. We will calculate the square matrix of the partial derivatives of the mapping F with respect to y and z: H(x, y, z) = ( fy fz gy gz ) = ( x cos y ex sin y 0 xz xy ) . Hence, H(1, π, 1) = ( −1 0 1 π ) and det H(1, π, 1) = −π ̸= 0. Now, it follows from the implicit mapping theorem (see 8.1.25) that the equation F(x, c1(x), c2(x)) = (0, 0) on a neighborhood of the point [1, π, 1] determines a curve (c1(x), c2(x)) defined on a neighborhood of the point [1, π]. In order to find its tangent vector at this point, we need to determine the )column) vector (fx, gx) at this point: ( fx gx ) = ( sin y ex sin y yz ) , ( fx(1, π, 1) gx(1, π, 1) ) = ( 0 π ) . The wanted tangent vector is thus ( (c1)x(1) (c2)x(1)) ) = ( fy(1, π, 1) fz(1, π, 1) gy(1, π, 1) gz(1, π, 1) )−1 ( fx(1, π, 1 gx(1, π, 1) ) = = ( −1 0 1 π )−1 ( 0 π ) = ( −1 0 1 π 1 π ) = ( 0 1 ) . CHAPTER 8. CALCULUS WITH MORE VARIABLES There are methods to allow determining whether or not a given form has any of these properties. The Taylor expansion with remainder immediately yields the following rules: Local extrema Theorem. Let f : En → R be a twice continuously differentiable function and x ∈ En be a stationary point of the function f. Then (1) f has a strict local minimum at x if Hf(x) is positive definite, (2) f has a strict local minimum at x if Hf(x) is negative definite, (3) f does not have an extremum at x if Hf(x) is indefinite. Proof. The Taylor second-order expansion with remainder applied to our function f(x1, . . . , xn), an arbitrary point x = (x1, . . . , xn), and any increment v = (v1, . . . , vn), such that all points x+θv, θ ∈ [0, 1], lie in the domain of the function f, says that f(x + v) = f(x) + df(x)(v) + 1 2 Hf(x + θ · v)(v) for an appropriate real number θ, 0 ≤ θ ≤ 1. Since it is supposed that the differential is zero, we obtain f(x + v) = f(x) + 1 2 Hf(x + θ · v)(v). By assumption, the quadratic form Hf(x) is continuously dependent on the point x, and the definiteness or indefiniteness of quadratic forms can be determined by the sign of the major subdeterminants of the matrix Hf, see Sylvester’s criterion in paragraph 4.3.3. However, the determinant itself is a polynomial expression in the coefficients of the matrix, hence a continuous function. Therefore, the non-vanishing and signs of the examined determinants are the same in a sufficiently small neighbourhood of the point x as at the point x itself. In particular, for a positive definite Hf(x), it is guaranteed that at a stationary point x, f(x + v) > f(x) for sufficiently small v. So this is a sharp minimum of the function f at the point x. The case of negative definiteness is analogous. If Hf(x) is indefinite, then there are directions v, w in which f(x + v) > f(x) and f(x + w) < f(x), so there is no extremum at the stationary point in question. □ The theorem yields no result if the Hessian of the function is degenerate, yet not indefinite at the point in question. The reason is the same as in the case of univariate functions. In these cases, there are directions in which both the first and second derivatives vanish, so at this level of approximation, it cannot be determined whether the function behaves like t3 or ±t4 until higher-order derivatives in the necessary directions are calculated. At the same time, even at those points where the differential is non-zero, the definiteness of the Hessian Hf(x) has 730 □ H. Constrained optimization We will begin with a somewhat atypical optimization problem. 8.H.1. A betting office accepts bets on the outcome of a tennis match. Let the odds laid against player A winning be a : 1 (i. e., if a bettor bets x dollars on the event that player A wins and this really happens, then the bettor wins ax dollars) and, similarly, let the odds laid against player B winning be b : 1 (fees are neglected). What is the necessary and sufficient condition for (positive real) numbers a and b so that a bettor cannot guarantee any profit regardless the actual outcome of the match? (For instance, if the odds were laid 1.5 : 1 against the win of A and 5 : 1 against the win of B, then the bettor could bet 3 dollars on B winning and 7 dollars on A winning and profit from this bet in either case). Solution. Let the bettor have P dollars. The bet amount can be divided to kP and (1 − k)P dollars, where k ∈ (0, 1). The profit is then akP dollars (if player A wins) or b(1−k)P dollars (if B does). The bettor is always guaranteed to win the lesser of these two amounts; the total profit (or loss) is obtained by subtracting the bet P, then. Since each of a, b, P is a positive real number, the function akP is increasing, and the function b(1 − k)P is decreasing with respect to k. For k = 0, b(1−k)P is greater; for k = 1, akP is. The minimum of the two numbers akP and b(1 − k)P is thus maximal for a k ∈ (0, 1), namely for the value k0 which satisfies ak0P = b(1 − k0)P, whence k0 = b a+b . Therefore, the betting office must choose a, b so that ak0P = b(1 − k0)P < P, which is equivalent to ak0 < 1, i. e., ab < a + b. □ We managed to solve this constrained optimization problem even without using the differential calculus. However, we will not be able to do so in the following problems. 8.H.2. Find the extremal values of the function h(x, y, z) = x3 + y3 + z3 on the unit sphere S in R3 given by the equation F(x, y, z) = x2 + y2 + z2 − 1 as well as on the circle which is the intersection of this sphere with the plane G(x, y, z) = x + y + z. Solution. First, we will look for stationary points of the function h on the sphere S. Computing the corresponding gradients (for instance, grad h(x, y, z) = (3x2 , 3y2 , 3z2 )) , we get the system 0 = 3x2 − 2λx, 0 = 3y2 − 2λy, 0 = 3z2 − 2λz, 0 = x2 + y2 + z2 − 1 CHAPTER 8. CALCULUS WITH MORE VARIABLES similar consequences as the non-vanishing of the second derivative of a univariate function. For a function f : Rn → R, the expression z(x + v) = f(x) + df(x)(v) defines the tangent hyperplane to the graph of the function f in the space Rn+1 . Taylor’s theorem of order two with remainder, as used in the proof above, provides the expression f(x + v) = z(x + v) + 1 2 Hf(x + θv)(v). If the Hessian is positive definite, all the values of the function f lie above the values of the tangent hyperplane for arguments in a sufficiently small neighbourhood of the point x, i.e. the whole graph is above the tangent hyperplane in a sufficiently small neighbourhood. In the case of negative definiteness, it is the other way round. Finally, when the Hessian is indefinite, the graph of the function has values on both sides of the hyperplane. This happens, in general, along objects of lower dimensions in the tangent hyperplane, so there is no straightforward generalization of inflection points. 8.1.19. The differential of mappings. The concepts of derivative and differential can be easily extended to mappings F : En → Em. Having selected the Cartesian coordinate system on both sides, this mapping is an ordinary m–tuple F(x1, . . . , xn) = (f1(x1, . . . , xn), . . . , fm(x1, . . . , xn)) of functions fi : En → R. F is defined to be a differentiable or k–times differentiable mapping if and only if the corresponding property is shared by all the functions f1, . . . , fm. The differentials dfi(x) of the particular functions fi give a linear approximation of the increments of their values for the mapping F. Therefore, we can expect that they also give a coordinate expression of the linear mapping D1 F(x) : Rn → Rm between the modelling spaces which linearly approximates the increments of the mapping F. Differential and Jacobi matrix Consider a differentiable mapping F : Rn → Rm with components (f1(x1, . . . , xn), . . . , fm(x1, . . . , xn)) and x in its domain. The matrix D1 F(x) =     df1(x) df2(x) ... dfm(x)     =       ∂f1 ∂x1 ∂f1 ∂x2 . . . ∂f1 ∂xn ∂f2 ∂x1 ∂f2 ∂x2 . . . ∂f2 ∂xn ... ... ... ... ∂fm ∂x1 ∂fm ∂x2 . . . ∂fm ∂xn       (x) is called the Jacobi matrix of the mapping F at the point x. The linear mapping D1 F(x) defined on the increments v = (v1, . . . , vn) by the Jacobi matrix is called the differential of the mapping F at a point x in the domain if and only if lim v→0 1 ∥v∥ ( F(x + v) − F(x) − D1 F(x)(v) ) = 0. Recall that the definition of Euclidean distance guarantees that the limits of values in En exist if and only if the 731 consisting of four equations in four variables. Before trying to solve this system, we can estimate how many local constrianed extrema we should anticipate the function to have. Surely, h(P) is in absolute value equal to at most 1, and this happens at all intersection points of the coordinate axes with S. Therefore, we are likely to get 6 local extrema. Further, inside every eighth of the sphere given by the coordinate planes, there may or may not be another extremum. The particular quadrants can be easily parametrized, and the function h (considered a function of two parameters) can be analyzed by standard means (or we can have it drawn in Maple, for example). Actually, solving the system (no matter whether algebraically or in Maple again) leads to a great deal of stationary points. Besides the six points we have already talked about (two of the coordinates equal to zero and the other to ±1) and which have λ = ±3 2 , there are also the points P± = ± (√ 3 3 , √ 3 3 , √ 3 3 ) , for example, where a local extremum indeed occurs. If we restrict our interest to the points of the circle K, we must give another function G another free parameter η representing the gradient coefficient. This leads to the bigger system 0 = 3x2 − 2λx − η, 0 = 3y2 − 2λy − η, 0 = 3z2 − 2λz − η, 0 = x2 + y2 + z2 − 1, 0 = x + y + z. However, since a circle is also a compact set, h must have both a global minimum and maximum on it. Further analysis is left to the reader. □ 8.H.3. Determine whether the function f : R3 → R, f(x, y, z) = x2 y has any extrema on the surface 2x2 + 2y2 + z2 = 1. If so, find these extrema and determine their types. Solution. Since we are interested in extrema of a continuous function on a compact set (ellipsoid) – it is both closed and bounded in R3 – the given function must have both a minimum and maximum on it. Moreover, since the constraint is given by a continuously differentiable function and the examined function is differentiable, the extrema must occur at stationary points of the function in question on the given set. We can build the following system for the stationary points: 2xy = 4kx, x2 = 4ky, 0 = 2kz. This system is satisfied by the points [± 1√ 3 , 1√ 6 , 0] and [± 1√ 3 , − 1√ 6 , 0]. The function takes on only two values at these four stationary points. Ir follows from the above that the first and second stationary points are maxima of CHAPTER 8. CALCULUS WITH MORE VARIABLES limits of the particular coordinate components do. Direct application of Theorem 8.1.6 about the existence of the differential for functions of n variables to the particular coordinate functions of the mapping F thus leads to the following generalization (prove this in detail by yourselves!): Existence of the differential Corollary. Let F : En → Em be a mapping such that all of its coordinate functions have continuous partial derivatives in a neighbourhood of a point x ∈ En. Then the differential D1 F(x) exists, and it is given by the Jacobi matrix D1 F(x). 8.1.20. Lipschitz continuity. Continuous differentiability of mappings allows good control on their variability in the following sense. Assume the estimates of the difference F(y) − F(x) for all x and y from a convex compact subset K in the domain of F are of interest. Applying the Taylor’s theorem with remainder in order one on each of the components of F = (f1, . . . , fn) separately gives the estimate (write v = y − x) ∥F(y) − F(x)∥2 = m∑ i=1 |fi(y) − fi(x)|2 = m∑ i=1 |D1 fi(x + θiv)(v)|2 = m∑ i=1 n∑ j=1 ∂fi ∂xj (x + θiv)vj 2 ≤ ( max z∈K,i,j ∂fi ∂xj (z) 2 ) nm∥v∥2 = C2 ∥v∥2 for an appropriate constant C ≥ 0. The fact that continuous functions are bounded over each compact set is used. This is the property of Lipschitz continuity of F on the compact set K: ∥F(y) − F(x)∥ ≤ C∥y − x∥, for all x, y ∈ K which was considered in 7.3.14 in the end of chapter 7. Proposition. Each continuously differentiable mapping F : Rn → Rm is Lipschitz continuous over convex compact sets. 8.1.21. Differential of composite mappings. The following theorem formulates a very useful generalization of the chain rule for univariate functions. Except for the concept of the differential itself, which is mildly complicated, it is actually the same as the one already seen in the case of one variable. The Jacobi matrix for univariate functions is a single number, namely the derivative of the function at a given point, so the multiplication of Jacobi matrices is simply the multiplication of the derivatives of the outer and inner components of the function. There is, of course, another special case: the formula derived and used several times for the derivative of a composition of multivariate functions with curves. There, the differential is the linear form expressed via the partial derivatives of the outer components, evaluated on the vector of the 732 the function on the given ellipsoid, while the other two are minima. □ Remark. Note that we have used the variable k instead of λ from the theorem 8.1.29. 8.H.4. Decide whether the function f : R3 → R, f(x, y, z) = z − xy2 has any minima and maxima on the sphere x2 + y2 + z2 = 1. If so, determine them. Solution. We are looking for solutions of the system kx = −y2 , ky = −2xy, kz = 1. The second equation implies that either y = 0 or x = −k 2 . The first possibility leads to the points [0, 0, 1], [0, 0, −1]. The second one cannot be satisfied. Note that because of the third equation k ̸= 0 and substituting into the equation of the sphere, we get the equation k2 4 + k2 2 + 1 k2 = 1, which has no solution in real numbers (it is a quadratic equation in k2 with the negative discriminant). The function has a maximum and minimum, respectively, at the two computed points on the given sphere. □ 8.H.5. Determine whether the function f : R3 → R, f(x, y, z) = xyz, has any extrema on the ellipsoid given by the equation g(x, y, z) = kx2 + ly2 + z2 = 1, k, l ∈ R+ . If so, calculate them. Solution. First, we build the equations which must be satisfied by the stationary points of the given function on the ellipsoid: ∂g ∂x = λ ∂f ∂x : yz = 2λkx, ∂g ∂y = λ ∂f ∂y : xz = 2λly, ∂g ∂z = λ ∂f ∂z : xy = 2λz. We can easily see that the equation can only be satisfied by a triple of non-zero numbers. Dividing pairs of equations and substituting into the ellipse’s equation, we get eight solutions, namely the stationary points x = ± 1√ 3k , y = ± 1√ 3l , z = ± 1√ 3 . However, the function f takes on only two distinct values at these eight points. Since it is continuous and the given ellipsoid is compact, f must have both a maximum and minimum on it. Moreover, since both f and g are continuously differentiable, these extrema must occur at stationary points. Therefore, it must be that four of the computed stationary points are local maxima of the function (of value 1 3 √ 3kl ) and the other four are minima (of value − 1 3 √ 3kl ). □ CHAPTER 8. CALCULUS WITH MORE VARIABLES derivative of the inner component, again given by the product of the one line (the form) and one column (the vector). The chain rule Theorem. Let F : En → Em and G : Em → Er be two differentiable mappings, where the domain of G contains the whole image of F. Then, the composite mapping G ◦ F is also differentiable, and its differential at any point x in the domain of F is given by the composition of differentials D1 (G ◦ F)(x) = D1 G(F(x)) ◦ D1 F(x). The Jacobi matrix on the left hand side is the product of the corresponding Jacobi matrices on the right hand side. Proof. In paragraph 8.1.6 and in the proof of Taylor’s theorem, it was derived how the differentiation of mappings composed of functions and curves behaves. This proved the theorem in the special case of n = r = 1. The general case can be proved analogously, one just has to work with more vectors. Fix an arbitrary increment v and calculate the directional derivative for the composition G ◦ F at a point x ∈ En. This means to determine the differentials for the particular coordinate functions of the mapping G composed with F. To simplify, write g ◦ F for any one of them. dv(g ◦ F)(x) = lim t→0 1 t ( g(F(x + tv)) − g(F(x)) ) . The expression in parentheses can, from the definition of the differential of g, be expressed as g(F(x + tv)) − g(F(x) = dg(F(x))(F(x + tv) − F(x)) + α(F(x + tv) − F(x)), where α is a function defined on a neighbourhood of the point F(x) which is continuous and limv→0 1 ∥v∥ α(v) = 0. Substitution into the equality for the directional derivative yields dv(g ◦ F)(x) = lim t→0 1 t ( dg(F(x))(F(x + tv) − F(x)) + α ( F(x + tv) − F(x) ) ) = dg(F(x)) ( lim t→0 1 t ( F(x + tv) − F(x) )) + lim t→0 1 t ( α ( F(x + tv) − F(x) ) ) = dg(F(x)) ◦ D1 F(x)(v) + 0. The fact that linear mappings between finite-dimensional spaces are always continuous was used. In the last step the Lipschitz continuity of F, i.e. ∥F(x+tv)−F(x)∥ ≤ C∥v∥|t| was exploited, and the properties of the function α. So the theorem for the particular functions g1, . . . , gr of the mapping G is proved. The theorem in general now follows from the definition of matrix multiplication and its links to linear mappings. Think about all details! □ 733 8.H.6. Determine the global extrema of the function f(x, y) = x2 − 2y2 + 4xy − 6x − 1 on the set of points [x, y] that satisfy the inequalities (1) x ≥ 0, y ≥ 0, y ≤ −x + 3. Solution. We are given a polynomial with continuous partial derivatives on a compact (i. e. closed and bounded) set. Such a function necessarily has both a minimum and a maximum on this set, and this can happen only at stationary points or on the boundary. Therefore, it suffices to find stationary points inside the set and the ones on a finite number of open (or singleton) parts of the boundary, then evaluate f at these points and choose the least and the greatest values. Notice that the set of points determined by the inequalities (1) is clearly a triangle with vertices at [0, 0], [3, 0], [0, 3]. Let us determine the stationary points inside this triangle as the solution of the equations fx = 0, fy = 0. Since fx(x, y) = 2x + 4y − 6, fy(x, y) = 4x − 4y, these equations are satisfied only by the point [1, 1]. The boundary suggests itself to be expressed as the union of three line segments given by the choice of pairs of vertices. First, we consider x = 0, y ∈ [0, 3], when f(x, y) = −2y2 − 1. However, we know the graph of this (univariate) function on the interval [0, 3] It is thus not difficult to find the points at which global extrema occur. They are the marginal points [0, 0], [0, 3]. Similarly, we can consider y = 0, x ∈ [0, 3], also obtaining the marginal points [0, 0], [3, 0]. Finally, we get to the line segment y = −x+3, x ∈ [0, 3]. Making some rearrangements, we get f(x, y) = f(x, −x + 3) = −5x2 + 18x − 19, x ∈ [0, 3]. We thus need to find the stationary points of the polynomial p(x) = −5x2 + 18x − 19 from the interval [0, 3]. The equation p′ (x) = 0, i. e., −10x + 18 = 0, is satisfied by x = 9/5. This means that in the last case, we obtained one more point (besides the marginal points), namely [9/5, 6/5], where a global extremum may occur. Altogether, we have these points as “suspects”: [1, 1], [0, 0], [0, 3], [3, 0], [9 5 , 6 5 ] with function values −4, −1, −19, −10, −14 5 , respectively. We can see that the function f takes on the greatest value −1 at the point [0, 0] and the least value −19 at the point [0, 3]. □ 8.H.7. Determine whether the function f : R3 → R, f(x, y, z) = y2 z has any extrema on the line segment given by the equations 2x + y + z = 1, x − y + 2z = 0 and the constraint x ∈ [−1, 2]. If so, find these extrema and determine their types. Justify all of your decisions. Solution. We are looking for the extrema of a continuous function on a compact set. Therefore, the function must have both a minimum and a maximum on this set, and this will happen either at the marginal points of the segment or at those CHAPTER 8. CALCULUS WITH MORE VARIABLES 8.1.22. Transformation of coordinates. A mapping F : En → En which has an inverse mapping G : En → En defined on the entire image of F is called a transformation. Such a mapping can be perceived as a change of coordinates. It is usually required that both F and G be (continuously) differentiable mappings. Just as in the case of vector spaces, the choice of “point of view”, i.e. the choice of coordinates, can simplify or deteriorate comprehension of the examined object. The change of coordinates is now being discussed in a much more general form than in the case of affine mappings in the fourth chapter. Sometimes, the term “curvilinear coordinates” is used in this general sense. An illustrative example is the change of the most usual coordinates in the plane to polar coordinates. That is, the position of a point P is given by its distance r = √ x2 + y2 from the origin and the angle φ = arctan(y/x) between the ray from the origin to it and the x-axis (if x ̸= 0). Notice, this is just the transformation between the algebraic and geometric form of a complex num- ber. The illustration shows the the “line” r = φ drawn in the Cartesian coordinates. The change from the polar coordinates to the standard ones is Ppolar = (r, φ) → (r cos φ, r sin φ) = PCartesian It is apparent that it is necessary to limit the polar coordinates to an appropriate subset of points (r, φ) in the plane so that the inverse mapping would exist. The Cartesian image of lines in polar coordinates with constant coordinates r or φ is also shown in the illustration above. Let us discuss an example how to deal with the concept of transformation and the theorem about differentiation of composite mappings. The inverse to the above is the transformation F : R2 → R2 (for instance, on the domain of all points in the first quadrant except for the points having x = 0): r = √ x2 + y2, φ = arctan y x . Consider now the function gt : E2 → R, with free parameter t ∈ R, g(r, φ, t) = sin(r − t) 734 where the gradient of the examined function is a linear combination of the gradients of the functions that give the constraints. First, let us look for the points which satisfy the gradient condition: 0 = 2k + l, 2yz = k − l, y2 = k + 2l, 2x + y + z = 1, x − y + 2z = 0. The solution of the system is [x, y, z] = [2 3 , 0, −1 3 ] and [x, y, z] = [4 9 , 2 9 , −1 9 ] (of course, the variables k and l can also be computed, but we are not interested in them). The marginal points of the given line segment are [−1, 5 3 , 4 3 ] and [2, −4 3 , −5 3 ]. Considering these four points, the function takes on the greatest value at the first marginal point (f(x, y, z) = 100 27 ), which is its maximum on the given segment, and it takes the least value at the second marginal point (f(x, y, z) = −80 27 ), which is thus its minimum there. □ 8.H.8. Find the maximal and minimal values of the polyno- mial p(x, y) = 4x3 − 3x − 4y3 + 9y on the set M = { [x, y] ∈ R2 ; x2 + y2 ≤ 1 } . Solution. This is again the case of a polynomial on a compact set; therefore, we can restrict our attention to stationary points inside or on the boundary of M and the “marginal” points on the boundary of M. However, the only solutions of the equations px(x, y) = 12x2 − 3 = 0, py(x, y) = −12y2 + 9 = 0 are the points [ 1 2 , √ 3 2 ] , [ 1 2 , − √ 3 2 ] , [ −1 2 , √ 3 2 ] , [ −1 2 , − √ 3 2 ] , which are all on the boundary of M. This means that p has no extremum inside M. Now, it suffices to find the maximum and minimum of p on the unit circle k : x2 + y2 = 1. The circle k can be expressed parametrically as x = cos t, y = sin t, t ∈ [−π, π]. Thus, instead of looking for the extrema of p on M, we are now seeking the extrema of the function f(t) := p(cos t, sin t) = 4 cos3 t − 3 cos t − 4 sin3 t + 9 sin t on the interval [−π, π]. For t ∈ [−π, π], we have f′ (t) = −12 cos2 t sin t + 3 sin t − 12 sin2 t cos t + 9 cos t, In order to determine the stationary points, we must express the function f′ in a form from which we will be able to calculate the intersection of its graph with the x-axis. To this purpose, we will use the identity 1 cos2 t = 1 + tg2 t, CHAPTER 8. CALCULUS WITH MORE VARIABLES in polar coordinates. Such a function can approximate the waves on a water surface after a point impulse in the origin at the time t, see the illustration (there, t = −π/2). While it was easy to define the function in polar coordinates, it would have been much harder to guess with Cartesian coordinates. Compute the derivative of this function in Cartesian coordinates. Using the theorem, ∂g ∂x (x, y, t) = ∂g ∂r (r, φ) ∂r ∂x (x, y) + ∂g ∂φ (r, φ) ∂φ ∂x (x, y) = cos( √ x2 + y2 − t) x √ x2 + y2 + 0 and, similarly, ∂g ∂y (x, y, t) = ∂g ∂r (r, φ) ∂r ∂y (x, y) + ∂g ∂φ (r, φ) ∂φ ∂y (x, y) = cos( √ x2 + y2 − t) y √ x2 + y2 . 8.1.23. The inverse mapping theorem. If the first derivative of a differentiable univariate function is non-zero, its sign determines whether the function is increasing or decreasing. Then, the function has this property in a neighbourhood of the point in question, and so an inverse function exists in the selected neighbourhood. The derivative of the inverse function f−1 is then the reciprocal value of the derivative of the function f (i.e. the inverse with respect to multiplication of real numbers). For higher dimensions there is the analogous re- sult: The inverse mapping theorem Theorem. Let F : En → En be a differentiable mapping on a neighbourhood of a point x0 ∈ En, and let the Jacobi matrix D1 F(x0) be invertible. Then in some neighbourhood of y0 = F(x0), the inverse mapping F−1 exists, it is differentiable, and its differential at the point F(x0) is the inverse mapping to the differential D1 F(x0). Hence, D1 (F−1 )(F(x0)) is given by the inverse matrix to the Jacobi matrix of the mapping F at the point x0. Interpreting this situation for a mapping E1 → E1 and linear mappings R → R as their differentials, the nonvanishing is a necessary and sufficient condition for the differential 735 which is valid provided both sides are well-defined. We get f′ (t) = cos3 t [ − 12 tg t + 3 ( tg t + tg3 t ) − 12 tg2 t + 9 ( 1 + tg2 t ) ] for t ∈ [−π, π] with cos t ̸= 0. However, this condition does not exclude any stationary points since sin t ̸= 0 if cos t = 0. Therefore, the stationary points of f are those points t ∈ [−π, π] for which −4 tg t + tg t + tg3 t − 4 tg2 t + 3 + 3 tg2 t = 0. The substitution s = tg t leads to s3 − s2 − 3s + 3 = 0, i. e. (s − 1) ( s − √ 3 ) ( s + √ 3 ) = 0. Then, the values s = 1, s = √ 3, s = − √ 3 respectively correspond to t ∈ {−3 4 π, 1 4 π}, t ∈ {−2 3 π, 1 3 π}, t ∈ {−1 3 π, 2 3 π}. Now, we evaluate the function f at each of these points as well as at the marginal points t = −π, t = π. Sorting them, we get f ( −1 3 π ) = −1 − 3 √ 3 < f ( −3 4 π ) = −3 √ 2 < f ( −2 3 π ) = 1 − 3 √ 3 < −1, f (−π) = f (π) = −1 < 0, f (2 3 π ) = 1 + 3 √ 3 > f (1 4 π ) = 3 √ 2 > f (1 3 π ) = −1 + 3 √ 3 > 0. Therefore, the global minimum of the function f is at the point t = −π/3 , while the global maximum is at t = 2π/3. Now, let us get back to the original function p. Since we know the values cos ( −1 3 π ) = 1 2 , sin ( −1 3 π ) = − √ 3 2 , cos (2 3 π ) = −1 2 , sin (2 3 π ) = √ 3 2 , we can deduce that the polynomial p takes on the minimal value −1−3 √ 3 (the same as f, of course) at the point [1/2, − √ 3/2] and the maximal value 1 + 3 √ 3 at [−1/2, √ 3/2]. □ 8.H.9. At which points does the function f(x, y) = x2 − 4x + y2 take on global extrema on the set M : | x | + | y | ≤ 1? Solution. Expressing f in the form f(x, y) = (x − 2)2 − 4 + y2 , we can see that the global maximum and minimum occur at the same points as for the function g(x, y) := √ (x − 2)2 + y2, [x, y] ∈ M, since neither shifting the function nor applying the increasing function v = √ u for u ≥ 0 changes the points of extrema (of course, they can change their values). However, we know that the function g gives the distance of a point [x, y] from the point [2, 0]. Since the set M is clearly a square with vertices [1, 0], [0, 1], [−1, 0], [0, −1], the point of M that is closest to [2, 0] is the vertex [1, 0], while the most distant one is [−1, 0]. Altogether, we have obtained that the minimal value of f occurs at the point [1, 0] and the maximal one at [−1, 0]. □ CHAPTER 8. CALCULUS WITH MORE VARIABLES to be invertible as a linear mapping. In general finite dimensions, the non-generacy of the differential is the adequate con- cept. Proof. First, verify that the theorem makes sense and is as expected. If it is supposed that the inverse mapping exists and is differentiable at F(x0), then differentiating the composite mapping F−1 ◦ F enforces the formula idRn = D1 (F−1 ◦ F)(x0) = D1 (F−1 ) ◦ D1 F(x0), which verifies the formula at the conclusion of the theorem. Therefore, it is known at the beginning which differential for F−1 to find. Next, suppose that the inverse mapping F−1 exists in a neighbourhood of the point F(x0) and that it is continuous. Since F is differentiable in a neighbourhood of x0, it follows that (1) F(x) − F(x0) − D1 F(x0)(x − x0) = α(x − x0) with function α : Rn → 0 satisfying limv→0 1 ∥v∥ α(v) = 0. To verify the approximation properties of the linear mapping (D1 F(x0))−1 , it suffices to calculate the following limit for y = F(x) approaching y0 = F(x0): lim y→y0 1 ∥y − y0∥ ( F−1 (y)−F−1 (y0)−(D1 F(x0))−1 (y−y0) ) . Substituting (1) for y − y0 into the latter equality yields lim y→y0 1 ∥y − y0∥ ( x − x0 − (D1 F(x0))−1 (D1 F(x0)(x − x0) + α(x − x0)) ) = lim y→y0 −1 ∥y − y0∥ ( D1 F(x0))−1 (α(x − x0) ) = (D1 F(x0))−1 lim y→y0 (−1) ∥y − y0∥ (α(x − x0)), where the last equality follows from the fact that linear mappings between finite-dimensional spaces are always continuous. Hence performing this linear mapping commutes with the limit process. The proof is almost finished. The limit at the end of the expression is, using the properties of α, zero if the values ∥F(x)−F(x0)∥ are greater than C∥x−x0∥ for some constant C > 0. This can be translated in terms of the inverse as C∥F−1 (y) − F−1 (y0)∥ ≤ ∥y − y0∥, i.e. ∥F−1 (y) − F−1 (y0)∥ ≤ D∥y − y0∥ for the constant D = C−1 > 0. This is Lipschitz continuity, which is a stronger property than F−1 being continuous. So, now it remains “merely” to prove the existence of a Lipschitzcontinuous inverse mapping to the mapping F. 736 8.H.10. Compute the local extrema of the function y = f(x) given implicitly by the equation 3x2 + 2xy + x = y2 + 3y + 5 4 , [x, y] ∈ R2 ∖ {[ x, x − 3 2 ] ; x ∈ R } . Solution. In accordance with the theoretical part (see 8.1.25), let us denote F(x, y) = 3x2 + 2xy + x − y2 − 3y − 5 4 , [x, y] ∈ R2 ∖ {[ x, x − 3 2 ] ; x ∈ R } and calculate the derivative y′ = f′ (x) = −Fx(x,y) Fy(x,y) = −6x+2y+1 2x−2y−3 . We can see the this derivative is continuous on the whole set in question. In particular, the function f is defined implicitly on this set (the denominator is non-zero). A local extremum may occur only for those x, y which satisfy y′ = 0, i. e., 6x + 2y + 1 = 0. Substituting y = −3x−1/2 into the equation F(x, y) = 0, we obtain −12x2 + 6x = 0, which leads to [x, y] = [ 0, −1 2 ] , [x, y] = [1 2 , −2 ] . We can also easily compute that y′′ = (y′ ) ′ = − ( 6+2y′ ) (2x−2y−3)−(6x+2y+1) ( 2−2y′ ) (2x−2y−3)2 . Substituting x = 0, y = −1/2, y′ = 0 and x = 1/2, y = −2, y′ = 0, we obtain y′′ = −6(−2)−0 4 > 0 for [x, y] = [ 0, −1 2 ] and y′′ = −6(+2)−0 4 < 0 for [x, y] = [1 2 , −2 ] . We have thus proved that the implicitly given function has a strict local minimum at the point x = 0 and a strict local maximum at x = 1/2. □ 8.H.11. Find the local extrema of the function z = f(x, y) given on the maximum possible set by the equation (1) x2 + y2 + z2 − xz − yz + 2x + 2y + 2z − 2 = 0. Solution. Differentiating (1) with respect to x and y gives 2x + 2zzx − z − xzx − yzx + 2 + 2zx = 0, 2y + 2zzy − xzy − z − yzy + 2 + 2zy = 0. Hence we get that (2) zx = fx(x, y) = z − 2x − 2 2z − x − y + 2 , zy = fy(x, y) = z − 2y − 2 2z − x − y + 2 . We can notice that the partial derivatives are continuous at all points where the function f is defined. This implies that the local extrema can occur only at stationary points. These points satisfy zx = 0, i. e. z − 2x − 2 = 0, zy = 0, i. e. z − 2y − 2 = 0. We have thus two equations, which allow us to express the dependency of x and y on z. Substituting into (1), we obtain the points [x, y, z] = [ −3 + √ 6, −3 + √ 6, −4 + 2 √ 6 ] , CHAPTER 8. CALCULUS WITH MORE VARIABLES To simplify, reduce the general case slightly. Especially, without loss of generality, apply shifts of the coordinates by constant vectors. In particular, it can be assumed that x0 = 0 ∈ Rn , y0 = F(x0) = 0 ∈ Rn . So assume this property of the mapping F. Further, composing the mapping F with any linear mapping G yields a differentiable mapping again, and it is known how the differential changes. The choice G(y) = (D1 F(0))−1 (y) gives D1 (G ◦ F)(0) = idRn and thus we may assume that D1 F(0) = idRn . With these assumptions, consider the mapping K(x) = F(x) − x. This mapping is also differentiable, and its differential at 0 is zero. It is already known that each continuously differentiable mapping is Lipschitz continuous over every δ–neighbourhood Uδ of the origin (in the its domain), ∥K(x) − K(y)∥ ≤ C∥x − y∥, where C is bounded by the maximum of all absolute values of the partial derivatives in the Jacobi matrix of the mapping K in the neighbourhood Uδ, cf. 8.1.20. Since the differential of the mapping K at the point x0 = 0 is zero, one can, by selecting a sufficiently small neighbourhood U of the origin, achieve the bound ∥K(x) − K(y)∥ ≤ 1 2 ∥x − y∥. It follows by the triangle inequality that ∥x − y∥ = ∥(F(x) − K(x)) − (F(y) − K(y))∥ ≤ ∥F(x) − F(y)∥ + ∥K(x) − K(y)∥ ≤ ∥F(x) − F(y)∥ + 1 2 ∥x − y∥ and hence 1 2 ∥x − y∥ ≤ ∥F(x) − F(y)∥. With this estimate, if x ̸= y are both in the neighbourhood U = Uδ, then also F(x) ̸= F(y). Therefore, the mapping is bijective onto its image V = F(U). Write F−1 for its inverse defined on V . For this mapping, the latter estimate says ∥F−1 (x) − F−1 (y)∥ ≤ 2∥x − y∥, so this mapping is not only continuous (as we assumed in our first step in the proof), but also Lipschitz-continuous, as requested in the end of the previous part of the proof. It could seem that the proof is complete, but this is not so. To finish, it is necessary to show that the mapping F restricted to a sufficiently small neighbourhood Uδ is not only bijective onto its image, but also that it maps open neighbourhoods of zero onto open neighbourhoods of zero.3 3In the literature, there are examples of mappings which continuously and bijectively map a line segment onto a square. So this is not an obvious requirement. 737 [x, y, z] = [ −3 − √ 6, −3 − √ 6, −4 − 2 √ 6 ] . Now, we need the second derivatives in order to decide whether the local extrema really occur at the corresponding points. Differentiating zx in (2), we obtain zxx = fxx(x, y) = (zx−2)(2z−x−y+2)−(z−2x−2)(2zx−1) (2z−x−y+2)2 , with respect to x, and zxy = fxy(x, y) = zy(2z−x−y+2)−(z−2x−2)(2zy−1) (2z−x−y+2)2 , with respect to y. We need not calculate zyy since the variables x and y are interchangeabel in (1) (if we swap x and y, the equation is left unchanged). Moreover, the x- and y-coordinates of the considered points are the same; hence zxx = zyy. Now, we evaluate that at the stationary points: fxx ( −3 + √ 6, −3 + √ 6 ) = fyy ( −3 + √ 6, −3 + √ 6 ) = − 1√ 6 , fxy ( −3 + √ 6, −3 + √ 6 ) = fyx ( −3 + √ 6, −3 + √ 6 ) = 0, fxx ( −3 − √ 6, −3 − √ 6 ) = fyy ( −3 − √ 6, −3 − √ 6 ) = 1√ 6 , fxy ( −3 − √ 6, −3 − √ 6 ) = fyx ( −3 − √ 6, −3 − √ 6 ) = 0. As for the Hessian, we have Hf ( −3 + √ 6, −3 + √ 6 ) = ( − 1√ 6 0 0 − 1√ 6 ) , Hf ( −3 − √ 6, −3 − √ 6 ) = ( 1√ 6 0 0 1√ 6 ) . Apparently, the first Hessian is negative definite, while the second one is positive definite. This means that there is a strict local maximum of the function f at the point[ −3 + √ 6, −3 + √ 6 ] , and there is a strict local minimum at the point [ −3 − √ 6, −3 − √ 6 ] . □ 8.H.12. Determine the strict local extrema of the function f(x, y) = 1 x + 1 y , x ̸= 0, y ̸= 0 on the set of points that satisfy the equation 1 x2 + 1 y2 = 4. Solution. Since both the function f and the function given implicitly by the equation 1 x2 + 1 y2 − 4 = 0 have continuous partial derivatives of all orders on the set R2 ∖ {[0, 0]}, we should look for stationary points, i. e., for the solution of the equations Lx = 0, Ly = 0 for L(x, y, λ) = 1 x + 1 y − λ ( 1 x2 + 1 y2 − 4 ) , x ̸= 0, y ̸= 0. We thus get the equations − 1 x2 + 2λ x3 = 0, − 1 y2 + 2λ y3 = 0, which lead to x = 2λ, y = 2λ. Considering the set of points in question, the constraint x = y gives the stationary points (1) [√ 2 2 , √ 2 2 ] , [ − √ 2 2 , − √ 2 2 ] . Now, let us examine the second differential of the function L. We can easily compute that CHAPTER 8. CALCULUS WITH MORE VARIABLES Decrease the latter neighbourhood U = Uδ so that the above estimates are true for the boundary of U as well and at the same time the Jacobi matrix of the mapping is invertible on all of U. This can be done since the determinant is a continuous mapping. Let B denote the boundary of the set U, that is, the corresponding sphere. Since B is compact and F is continuous, the function ρ(x) = ∥F(x)∥ achieves both the maximum and the minimum on B. Denote a = 1 2 minx∈B ρ(x) and consider any y ∈ Oa(0) fixed. Of course, a > 0 because x = 0 is the only point with F(x) = 0 within Uδ. It is necessary to show that there is at least one x ∈ U such that y = F(x), which completes the proof of the inverse mapping theorem. For this purpose, consider the function (y is a fixed point) h(x) = ∥F(x) − y∥2 . Again, the image h(U) ∪ h(B) must have a minimum. This minimum cannot occur for x ∈ B. Notice that F(0) = 0, hence h(0) = ∥y∥ < a. At the same time, the distance of y from F(x) for x ∈ B is at least a for all y ∈ Oa(0) (since a is selected to be half the minimum of the magnitude of F(x) on the boundary). Therefore, the minimum occurs inside U, and it is a stationary point z of the function h. Fixing such z, means that for all j = 1, . . . , n, ∂h ∂xj (z) = n∑ i=1 2 ( fi(z) − yi ) ∂fi ∂xj (z) = 0. This is a system of linear equations with variables ξi = fi(z) − yi and coefficients given by twice the Jacobi matrix D1 F(z). In particular, for z ∈ U, such a system has a unique solution, and this is zero since the Jacobi matrix is invertible. In this way the desired point x = z ∈ U is found, satisfying, for all i = 1, . . . , n, the equality fi(z) = yi, i.e., F(z) = y. □ 8.1.24. The implicit functions. The next goal is to employ the inverse mapping theorem for clarifying the properties of implicitly defined functions. To start, consider a differentiable function F(x, y) defined in the plane E2, and look for those points (x, y) where F(x, y) = 0. An example of this can be the usual (implicit) definition of straight lines and circles: F(x, y) = ax + by + c = 0, a, b, c ∈ R F(x, y) = (x − s)2 + (y − t)2 − r2 = 0, r > 0. While in the first case, the relation between the quantities x and y can be expressed as the function (for b ̸= 0) y = f(x) = − a b x − c b for all x; in the other case, for any point (x0, y0) satisfying the equation of the circle and such that y0 ̸= t (these are the 738 Lxx = 2 x3 − 6λ x4 , Lxy = 0, Lyy = 2 y3 − 6λ y4 , x ̸= 0, y ̸= 0, whence it follows that d2 L(x, y) = ( 2 x3 − 6λ x4 ) dx2 + ( 2 y3 − 6λ y4 ) dy2 . Differentiating the constraint 1 x2 + 1 y2 = 4, we get − 2 x3 dx − 2 y3 dy = 0, i. e. dy2 = y6 x6 dx2 . Therefore, d2 L(x, y) = [ 2 x3 − 6λ x4 + ( 2 y3 − 6λ y4 ) y6 x6 ] dx2 . In fact, we are considering a one-dimensional quadratic form whose positive (negative) definiteness at a stationary point means that there is a minimum (maximum) at that point. Realizing that the stationary points had x = 2λ, y = 2λ, mere substitution yields d2 L (√ 2 2 , √ 2 2 ) = −4 √ 2 dx2 , d2 L ( − √ 2 2 , − √ 2 2 ) = 4 √ 2 dx2 , which means that there is a strict local maximum of the function f at the point [√ 2/2, √ 2/2 ] , while at the point [ − √ 2/2, − √ 2/2 ] , there is a strict local minimum. The corresponding values are: (2) f (√ 2 2 , √ 2 2 ) = 2 √ 2, f ( − √ 2 2 , − √ 2 2 ) = −2 √ 2. Now, we will demonstrate a quicker way how to obtain the result. We know (or we can easily calculate) the second partial derivatives of the function L, i. e., the Hessian with respect to the variables x and y: HL (x, y) = ( 2 x3 − 6λ x4 0 0 2 y3 − 6λ y4 ) . The evaluation HL (√ 2 2 , √ 2 2 ) = ( −2 √ 2 0 0 −2 √ 2 ) , HL ( − √ 2 2 , − √ 2 2 ) = ( 2 √ 2 0 0 2 √ 2 ) then tells us that the quadratic form is negative definite for the former stationary point (there is a strict local maximum) and positive definite for the latter one (there is a strict local minimum). We should be aware of a potential trap in this “quicker” method in the case we obtain an indefinite form (matrix). Then, we cannot conclude that there is not an extremum at that point since as we have not included the constraint (which we did when computing d2 L), we are considering a more general situation. The graph of the function f on the given set is a curve which can be defined as a univariate function. This must correspond to a one-dimensional quadratic form. □ CHAPTER 8. CALCULUS WITH MORE VARIABLES marginal points of the circle in the direction of the coordinate x), There is a neighbourhood of the point x0 in which either y = f(x) = t + √ (x − s)2 − r, or y = f(x) = t − √ (x − s)2 − r, according to whether (x0, y0) belongs to the upper or lower semicircle. If a diagram of the situation is drawn, the reason is clear: describing both the semicircles simultaneously by a single function y = f(x) is not possible. The boundary points of the interval [s−r, s+r] are even more interesting. They also satisfy the equation of the circle with y = t, yet Fy(s±r, t) = 0, which describes the position of the tangent line to the circle at these points, parallel to the y-axis. There are no neighbourhoods of these points in which the circle could be described as a function y = f(x). Moreover, the derivatives of the function y = f(x) = t+ √ (x − s)2 − r2 can be easily expressed in terms of partial derivatives of the function F: f′ (x) = 1 2 2(x − s) √ r2 − (x − s)2 = − x − s y − t = − Fx Fy . If the roles of the variables x and y are interchanged and a relation x = f(y) such that F(f(y), y) = 0 is sought, then neighbourhoods of the points (s ± r, t) are obtained with no problem. Notice that the partial derivative Fx is non-zero at these points. So it is observed (though for only two examples): for a function F(x, y) and a point (a, b) ∈ E2 such that F(a, b) = 0, there is the unique function y = f(x) satisfying F(x, f(x)) = 0 on some neighbourhood of x if Fy(a, b) ̸= 0. In this case, f′ (a) = −Fx(a, b)/Fy(a, b) can even be computed. We prove that in fact, this proposition is always true. The last statement about derivatives can be remembered (and is quite comprehensible if things are properly understood) from the expression for the differential of the (constant) function g(x) = F(x, y(x)) and the differential dy = f′ (x)dx 0 = dg = Fxdx + Fydy = (Fx + Fyf′ (x))dx. One can work analogously with the implicit expressions F(x, y, z) = 0, to look for a function g(x, y) such that F(x, y, g(x, y)) = 0. As an example, consider the function f(x, y) = x2 + y2 , whose graph is the rotational paraboloid centered at the point (0, 0). This can be defined implicitly by the equation 0 = F(x, y, z) = z − x2 − y2 . Before formulating the result for the general situation, notice which dimensions could/should appear in the problem. If it is desired to find, for this function F, a curve c(x) = (c1(x), c2(x)) in the plane such that F(x, c(x)) = F(x, c1(x), c2(x)) = 0, 739 8.H.13. Find the global extrema of the function f(x, y) = 1 x + 1 y , x ̸= 0, y ̸= 0 on the set of points that satisfy the equation 1 x2 + 1 y2 = 4. Solution. This exercise is to illustrate that looking for global extrema may be much easier than for local ones (cf. the above exercise) even in the case when the function values are considered on an unbounded set. First, we would determine the stationary points (1) and the values (2) the same way as above. Let us emphasize that we are looking for the function’s extrema on a set that is not compact, so we will not do with evaluating the function at the stationary points. The reason is that the function f may not have an extremum on the considered set – its range might be an open interval. However, we will show that this is not the case here. Let us thus consider | x | ≥ 10. We can realize that the equation 1 x2 + 1 y2 = 4 can be satisfied only by those values y for which | y | ≥ 1/2. We have thus obtained the bounds −2 √ 2 < − 1 10 − 2 ≤ f(x, y) ≤ 1 10 + 2 < 2 √ 2, if | x | ≥ 10. At the same time, we have (interchanging x and y leads to the same task) −2 √ 2 < − 1 10 − 2 ≤ f(x, y) ≤ 1 10 + 2 < 2 √ 2, if | y | ≥ 10. Hence we can see that the function f must have global extrema on the considered set, and this must happen inside the square ABCD with vertices A = [−10, −10], B = [10, −10], C = [10, 10], D = [−10, 10]. The intersection of the “hundred times reduced” square with vertices at ˜A = [−1/10, −1/10], ˜B = [1/10, −1/10], ˜C = [1/10, 1/10], ˜D = [−1/10, 1/10] and the given set is clearly the empty set. Therefore, the global extrema are at points inside the compact set bounded by these two squares. Since f is continuously differentiable on this set, the global extrema can occur only at stationary points. We thus must have fmax = f (√ 2 2 , √ 2 2 ) = 2 √ 2, fmin = f ( − √ 2 2 , − √ 2 2 ) = −2 √ 2. □ 8.H.14. Determine the maximal and minimal values of the function f(x, y, z) = xyz on the set M given by the condi- tions x2 + y2 + z2 = 1, x + y + z = 0. Solution. It is not hard to realize that M is a circle. However, for our problem, it is sufficient to know that M is compact, i. e. bounded (the first condition of the equation of the unit sphere) and closed (the set of solutions of the given equations is closed since if the equations are satisfied by all terms of a converging sequence, then it is satisfied by its limit as well). The function f as well as the constraint functions F(x, y, z) = x2 + y2 + z2 − 1, G(x, y, z) = x + y + z CHAPTER 8. CALCULUS WITH MORE VARIABLES then this can be done (even for all initial conditions x = a), yet the result is not unique for a given initial condition. It suffices to consider an arbitrary curve on the rotational paraboloid whose projection onto the first coordinate has a non-zero derivative. Then consider x to be the parameter of the curve, and c(x) to be its projection onto the plane yz. Therefore, it is expected that one function of m + 1 variables defines implicitly a hypersurface in Rm+1 which is to be expressed (at least locally) as the graph of a function of m variables. It can be anticipated that n functions of m + n variables define an intersection of n hypersurfaces in Rm+n , which is expected as an “m–dimensional” object. 8.1.25. The general theorem. Consider a differentiable mapping F = (f1, . . . , fn) : Rm+n → Rn . The Jacobi matrix of this mapping has n rows and m + n columns. Write it symbolically as D1 F =     ∂f1 ∂x1 . . . ∂f1 ∂xm ... ... ... ∂fn ∂x1 . . . ∂fn ∂xm ∂f1 ∂xm+1 . . . ∂f1 ∂xm+n ... ... ... ∂fn ∂xm+1 . . . ∂fn ∂xm+n     = (D1 xF, D1 yF), where (x1, . . . , xm+n) ∈ Rm+n is written as (x, y) ∈ Rm × Rn , D1 xF is a matrix of n rows and the first m columns in the Jacobi matrix, while D1 yF is a square matrix of order n, with the remaining columns. The multidimensional analogy to the previous reasoning with the non-zero partial derivative with respect to y is the condition that the matrix D1 yF is invertible. The implicit mapping theorem Theorem. Let F : Rm+n → Rn be a differentiable mapping in an open neighbourhood of a point (a, b) ∈ Rm × Rn = Rm+n at which F(a, b) = 0, and det D1 yF ̸= 0. Then there exists a neighbourhood U of the point a ∈ Rm and a unique differentiable mapping G : Rm → Rn defined on U, with G(a) = b and such that F(x, G(x)) = 0 for all x ∈ U. Moreover, the Jacobi matrix D1 G of the mapping G is, in the neighbourhood of the point a, given by the product of matrices D1 G(x) = −(D1 yF)−1 (x, G(x)) · D1 xF(x, G(x)). Proof. For the sake of comprehensibility, first show the proof for the simplest case of the equation F(x, y) = 0 with a function F of two variables. At first sight, it might look complicated, but this situation can be discussed in a way which can be extended for the general dimensions as in the theorem, almost without changes. 740 have continuous partial derivatives of all orders (since they are polynomials). The Jacobi constraint matrix is ( Fx(x, y, z) Fy(x, y, z) Fz(x, y, z) Gx(x, y, z) Gy(x, y, z) Gz(x, y, z) ) = ( 2x 2y 2z 1 1 1 ) . Its rank is reduced (less than 2) if and only if the vector (2x, 2y, 2z) is a multiple of the vector (1, 1, 1), which gives x = y = z, and thus x = y = z = 0 (by the second constraint). However, the set M does contain the origin. Therefore, we may look for stationary points using the method of Lagrange multipliers. For L(x, y, z, λ1, λ2) = xyz − λ1 ( x2 + y2 + z2 − 1 ) − λ2 (x + y + z) , the equations Lx = 0, Ly = 0, Lz = 0 give yz − 2λ1x − λ2 = 0, xz − 2λ1y − λ2 = 0, xy − 2λ1z − λ2 = 0, respectively. Subtracting the first equation from the second one and from the third one leads to xz − yz − 2λ1y + 2λ1x = 0, xy − yz − 2λ1z + 2λ1x = 0, i. e., (x − y) (z + 2λ1) = 0, (x − z) (y + 2λ1) = 0. The last equations are satisfied in these four cases: x = y, x = z; x = y, y = −2λ1; z = −2λ1, x = z; z = −2λ1, y = −2λ1, thus (including the constraint G = 0) x = y = z = 0; x = y = −2λ1, z = 4λ1; x = z = −2λ1, y = 4λ1; x = 4λ1, y = z = −2λ1. Except for the first case (which clearly cannot happen), including the constraint F = 0 yields (4λ1)2 + (−2λ1)2 + (−2λ1)2 = 1, i. e. λ1 = ± 1 2 √ 6 . Altogether, we get the points [ − 1√ 6 , − 1√ 6 , 2√ 6 ] , [ − 1√ 6 , 2√ 6 , − 1√ 6 ] , [ 2√ 6 , − 1√ 6 , − 1√ 6 ] , [ 1√ 6 , 1√ 6 , − 2√ 6 ] , [ 1√ 6 , − 2√ 6 , 1√ 6 ] , [ − 2√ 6 , 1√ 6 , 1√ 6 ] . We will not verify that these really are stationary points. The only important thing is that all stationary points are among these six. We are looking for the global maximum and minimum of the continuous function f on the compact set M. However, the global extrema (we know they exist) can occur only at points of local extrema with respect to M. And the local extrema can occur only at the aforementioned points. Therefore, it suffices to evaluate the function f at these points. Thus CHAPTER 8. CALCULUS WITH MORE VARIABLES Extend the function F to ˜F : R2 → R2 , (x, y) → (x, F(x, y)). The Jacobi matrix of the mapping ˜F is D1 ˜F(x, y) = ( 1 0 Fx(x, y) Fy(x, y) ) . It follows from the assumption Fy(a, b) ̸= 0, that the same also holds in a neighbourhood of the point (a, b), so the function ˜F is invertible in this neighbourhood, by the inverse mapping theorem. Therefore, there is a uniquely defined differentiable inverse mapping ˜F−1 in a neighbourhood of the point (a, 0). Denote by π : R2 → R the projection onto the second coordinate, and consider the function f(x) = π ◦ ˜F−1 (x, 0). This function is well-defined and differentiable. It must be verified that the expression F(x, f(x)) = F ( x, π( ˜F−1 (x, 0)) ) is zero in a neighbourhood of the point x = a. It follows directly from the definition of ˜F(x, y) = (x, F(x, y)) that its inverse is of the form ˜F−1 (x, y) = (x, π ˜F−1 (x, y)). Therefore, the previous calculation can be resumed: F(x, f(x)) = π ( ˜F ( x, π( ˜F−1 (x, 0)) )) = π ( ˜F( ˜F−1 (x, 0)) ) = π(x, 0) = 0. This proves the first part of the theorem, and it remains to compute the derivative of the function f(x). This derivative can, once again, be obtained by invoking the inverse mapping theorem, using the matrix (D1 ˜F)−1 . The following equality is easily verified by multiplying the matrices. It can also be computed directly using the explicit formula for the inverse matrix in terms of the determinant and the algebraically adjoint matrix, see paragraph 2.2.11 ( 1 0 Fx(x, y) Fy(x, y) )−1 = (Fy(x, y))−1 ( Fy(x, y) 0 −Fx(x, y) 1 ) . By the definition f(x) = π ˜F−1 (x, 0), and thus the first entry of the second row of this matrix is the derivative f′ (x) with y = f(x), i.e. the Jacobi matrix D1 f. In this simple case, it is exactly the desired scalar −Fx(x, f(x))/Fy(x, f(x)). The general proof is exactly the same, there is no need to change any of the formulae. We obtain the invertible mapping ˜F : Rm+n → Rm+n and define G(x) = π ˜F−1 (x, 0), where π : Rm+n → Rn , π(x, y) = y. The same check as above reveals that F(x, G(x)) = 0) as requested. Only in the last computation of the derivative of the function do the corresponding parts of the Jacobi matrix D1 xF and D1 yF appear, instead of the particular partial derivatives. For the calculation of the Jacobi matrix of the mapping G, use the computation of the inverse matrix. This time the algebraic procedure from paragraph 2.2.11 is not very advantageous. It is better to be guided by the case in dimension 741 we find out that the wanted maximum is f ( − 1 √ 6 , − 1 √ 6 , 2 √ 6 ) = f ( − 1 √ 6 , 2 √ 6 , − 1 √ 6 ) = = f ( 2 √ 6 , − 1 √ 6 , − 1 √ 6 ) = 1 3 √ 6 , while the minimum is f ( 1 √ 6 , 1 √ 6 , − 2 √ 6 ) = f ( 1 √ 6 , − 2 √ 6 , 1 √ 6 ) = = f ( − 2 √ 6 , 1 √ 6 , 1 √ 6 ) = − 1 3 √ 6 . □ 8.H.15. Find the extrema of the function f : R3 → R, f(x, y, z) = x2 + y2 + z2 , on the plane x + y − z = 1 and determine their types. Solution. We can easily build the equations that describe the linear dependency between the normal to the constraint surface and the examined function: x = k, y = k z = −k, k ∈ R. The only solution is the point [1 3 , 1 3 , −1 3 ]. Further, we can notice that the function is increasing in the direction of (1, −1, 0), and this direction lies in the constraint plane. Therefore, the examined function has a minimum at this point. Another solution. We will reduce this problem to finding the extrema of a two-variable function on R2 . Since the constraint is linear, we can express z = x + y − 1. Substituting this into the given function then yields a real-valued function of two variables: f(x, y) = x2 + y2 + (x + y − 1)2 = 2x2 + 2xy + y2 − 2x − 2y + 1. Setting both partial derivatives equal to zero, we get the linear equation 4x + 2y − 2 = 0, 4y + 2x − 2 = 0, whose only solution is the point [1 3 , 1 3 ]. Since it is a quadratic function with positive coefficients at the unknowns, it is unbounded on R2 . Therefore, there is a (global) minimum at the obtained point. Then, we can get the corresponding point [1 3 , 1 3 , −1 3 ] in the constraint plane from the linear dependency of z. □ 8.H.16. Find the extrema of the function x + y : R3 → R on the circle given by the equations x + y + z = 1 and x2 + y2 + z2 = 4. Solution. The “suspects” are those points which satisfy (1, 1, 0) = k · (1, 1, 1) + l · (x, y, z), k, l ∈ R. Clearly, x = y(= 1/l). Substituting this into the equation of the circle then leads to the two solutions [ 1 3 ± √ 22 6 , 1 3 ± √ 22 6 , 1 3 ∓ √ 22 3 ] . Since every circle is compact, it suffices to examine the function values at these two points. We find out that there is a CHAPTER 8. CALCULUS WITH MORE VARIABLES m + n = 2 and to divide the matrix (D1 ˜F−1 ) = ( idRm 0 D1 xF(x, y) D1 yF(x, y) )−1 = ( A B C D ) into blocks of m and n rows and columns (for instance A is of type m×m, while C is of type n×m). Now, the matrices A, B, C, D can be determined from the defining equality for the inverse: ( idRm 0 D1 xF(x, y) D1 yF(x, y) ) · ( A B C D ) = ( idRm 0 0 idRn ) . Apparently, it follows that A = idRm , B = 0, D = (D1 yF)−1 , and finally, D1 xF + D1 yF · C = 0. The latter equality implies already the desired relation D1 G = C = −(D1 yF)−1 · D1 xF. This concludes the proof of the theorem. □ 8.1.26. The gradient of a function. As seen in the previous paragraph, if F is a continuously differentiable function of n variables, the definition F(x1, . . . , xn) = b with a fixed value b ∈ R defines the subset Mb ⊂ Rn which mostly has the properties of an (n−1)–dimensional hypersurface. To be more precise, if the vector of the partial derivatives D1 F = ( ∂F ∂x1 , . . . , ∂F ∂xn ) is non-zero, the set M can be described locally as the graph of a continuously differentiable function of n − 1 variables. In this connection, there are level sets Mb. The vector D1 F ∈ Rn is called the gradient of the function F. In technical and physical literature, it is also often denoted as grad F or ∇F. Since Mb is given by a constant value of the function F, the derivatives of the curves lying in M have the property that the differential dF always evaluates to zero along them. For every such curve, F(c(t)) = b, hence d dt F(c(t)) = dF(c′ (t)) = 0. On the other hand, we can consider a general vector v = (v1, . . . , vn) ∈ Rn and the magnitude of the corresponding directional derivative |dvF| = ∂F ∂x1 v1 + · · · + ∂F ∂xn vn = cos φ∥D1 F∥∥v∥, where φ is the angle between the directions of the vector v and the gradient D1 F, see the discussion about angles of vectors and straight lines in the fourth chapter (cf. definition 4.1.15). Thus, the following result is observed: Zde by se hodil obrazek, napr. obr. 2 742 maximum of the considered function on the given circle at the former point and a minimum at the latter one. □ 8.H.17. Find the extrema of the function f : R3 → R, f(x, y, z) = x2 + y2 + z2 , on the plane 2x + y − z = 1 and determine their types. ⃝ 8.H.18. Find the maximum of the function f : R2 → R, f(x, y) = xy on the circle with radius 1 which is centered at the point [x0, y0] = [0, 1]. ⃝ 8.H.19. Find the minimum of the function f : R2 → R, f = xy on the circle with radius 1 which is centered at the point [x0, y0] = [2, 0]. ⃝ 8.H.20. Find the minimum of the function f : R2 → R, f = xy on the circle with radius 1 which is centered at the point [x0, y0] = [2, 0]. ⃝ 8.H.21. Find the minimum of the function f : R2 → R, f = xy on the ellipse x2 + 3y2 = 1. ⃝ 8.H.22. Find the minimum of the function f : R2 → R, f = x2 y on the circle with radius 1 which is centered at the point [x0, y0] = [0, 0]. ⃝ 8.H.23. Find the maximum of the function f : R2 → R, f(x, y) = x3 y on the circle x2 + y2 = 1. ⃝ 8.H.24. Find the maximum of the function f : R2 → R, f(x, y) = xy on th ellipse 2x2 + 3y2 = 1. ⃝ 8.H.25. Find the maximum of the function f : R2 → R, f(x, y) = xy on the ellipse x2 + 2y2 = 1. ⃝ I. Volumes, areas, centroids of solids 8.I.1. Find the volume of the solid which lies in the halfplane z ≥ 0, the cylinder x2 + y2 ≤ 1, and the half-plane a) z ≤ x, b) x + y + z ≤ 0. Solution. a) The volume can be calculated with ease using cylindric coordinates. There, the cylinder is determined by the inequality r ≤ 1; the half-plane z ≤ x by z ≤ r cos φ, then. Altogether, we get CHAPTER 8. CALCULUS WITH MORE VARIABLES The maximal growth of a function Proposition. The gradient D1 F = ( ∂F ∂x1 , . . . , ∂F ∂xn ) provides the directions of maximal growth of the function F of n variables. Moreover, the vanishing directional derivatives are exactly those in directions perpendicular to the gradient. Therefore, it is clear that the tangent plane to a non-empty level set Mb in a neighbourhood of its point with non-zero gradient D1 F is determined by the orthogonal complement to the gradient, and the gradient itself is the normal vector of the hypersurface Mb. For instance, considering a sphere in R3 with radius r > 0, centered at (a, b, c), i.e. given implicitly by the equation F(x, y, z) = (x − a)2 + (y − b)2 + (z − c)2 = r2 , The normal vectors at a point P = (x0, y0, z0) are obtained as non-zero multiples of the gradient, i.e. multiples of D1 F = ( 2(x0 − a), 2(y0 − b), 2(z0 − c) ) , and the tangent vectors are exactly the vectors perpendicular to the gradient. Therefore, the tangent plane to a sphere at the point P can always be described implicitly in terms of the gradient by the equation 0 = (x0 − a)(x − x0) + (y0 − b)(y − y0) + (z0 − c)(z − z0). This is a special case of the following general formula: Tangent hyperplanes to level sets Theorem. For a function F(x1, . . . , xn) of n variables and a point P = (p1, . . . , pn) in a level set Mb of the function F such that the gradient D1 F is non-vanishing at P, the implicit equation for the tangent hypersurface to Mb is 0 = ∂F ∂x1 (P)(x1 − p1) + · · · + ∂F ∂xn (P)(xn − pn). Proof. The statement is clear from the previous discussions. The tangent hyperplane must be (n − 1)–dimensional, so its direction space is given as the kernel of the linear form given by the gradient (zero values of the corresponding linear mapping Rn → R given by multiplying the column of coordinates by the row vector grad F). Clearly, the selected point P satisfies the equation. □ 8.1.27. Illumination of 3D objects. Consider the illumination of a three-dimensional object where the direction v of the light falling onto the two-dimensional surface M of this object is known. Assume M is given implicitly by an equation F(x, y, z) = 0. The light intensity at a point P ∈ M is defined as I cos φ, where φ is the angle between the normal line to M and the vector which is opposite to the flow of the light. As seen, the normal line is determined by the gradient of the function F. 743 V = ∫ 1 0 ∫ π 2 − π 2 ∫ r cos φ 0 r dz dφ dr = 2 3 . b) We will reduce this problem to one that is completely analogous to the above part by rotating the solid around the z-axis by the angle π/4 (be it in the positive or the negative direction). Applying the rotation matrix  √ 2/2 − √ 2/2 0√ 2/2 √ 2/2 0 0 0 1  , the original inequality x+y+z ≤ 0 is transformed to √ 2x′ +z′ ≤ 0 in the new coordinates. Now, it is easy to express the integral that corresponds to the volume of the examined solid: V = ∫ 1 0 ∫ 3π 2 π 2 ∫ 0 − √ 2r cos φ r dz dφ dr = 2 √ 2 3 . We need not have computed the result as we did; instead, we could notice that the solid from part (a) differs only by homothety with coefficient √ 2 in the direction of the y-axis. See also note 8.I.11. □ 8.I.2. Find the volume of the solid in R3 which is given by x2 + y2 + z2 ≤ 1, 3x2 + 3y2 ≥ z2 , x ≥ 0. Solution. First, we should realize what the examined solid looks like. It is a part of a ball which lies outside a given cone (see the picture). The best way to determine the volume is probably to subtract half the volume of the sector given by the cone from half the ball’s volume (note that the volume of the solid does not change if we replace the condition x ≥ 0 with z ≥ 0 – the sector is cut either “horizontally” or “vertically”, but always to halves). We will calculate in spherical coordinates. x = r cos(φ) sin(ψ), y = r sin(φ) sin(ψ), z = r cos(ψ), φ ∈ [0, 2π), ψ ∈ [0, π), r ∈ (0, ∞). CHAPTER 8. CALCULUS WITH MORE VARIABLES The sign of the expression then says which side of the surface is illuminated. For example, consider an illumination with constant intensity I0 in the direction of the vector v = (1, 1, −1) (i.e. “downward askew”), and let the ball given by the equation F(x, y, z) = x2 + y2 + z2 − 1 ≤ 0 be the object of interest. Then, for a point P = (x, y, z) ∈ M on the surface, the intensity I(P) = grad F · v ∥ grad F∥∥v∥ I0 = −2x − 2y + 2z 2 √ 3 I0 is obtained. Notice that, as anticipated, the point which is illuminated with the (full) intensity I0 is the point P = 1√ 3 (−1, −1, 1) on the surface of the ball, while the antipodal point is fully illuminated with the minus sign (i.e. on the inside of the sphere). Zde by se mozna hodilo i neco vic, aspon obrazek, neco jako obr. 3 8.1.28. Tangent and normal spaces. Ideas about tangent and normal lines can be extended to general dimensions. With a mapping F : Rm+n → Rn , and coordinate functions fi, one can also consider the n equations for n + m variables fi(x1, . . . , xm+n) = bi, i = 1, . . . , n, expressing the equality F(x) = b for a vector b ∈ Rn . Assuming that the conditions of the implicit function theorem hold, the set of all solutions (x1, . . . , xm+n) ∈ Rm+n is (at least locally) the graph of a mapping G : Rm → Rn . Technically, it is necessary to have some submatrix in D1 F of the maximal possible rank n. For a fixed choice b = (b1, . . . , bn), the set of all solutions is, of course, the intersection of all hypersurfaces M(bi, fi) corresponding to the particular functions fi. The same must hold for tangent directions, while normal directions are generated by the individual gradients. Therefore, if D1 F is the Jacobi matrix of a mapping which implicitly defines a set M and P = (p1, . . . , pm+n) ∈ M is a point such that M is the graph of a mapping in the neighbourhood of the point P, D1 F =     ∂f1 ∂x1 . . . ∂f1 ∂xm+n ... ... ... ∂fn ∂x1 . . . ∂fn ∂xm+n     , then the affine subspace in Rm+n which contains exactly all tangent lines going through the point P is given implicitly by the following equations: 0 = ∂f1 ∂x1 (P)(x1 − p1) + · · · + ∂f1 ∂xn (P)(xm+n − pm+n) ... 0 = ∂fn ∂x1 (P)(x1 − p1) + · · · + ∂fn ∂xn (P)(xm+n − pm+n). This subspace is called the tangent space to the (implicitly given) m–dimensional surface M at the point P. The normal space at the point P is the affine subspace generated by the point P and the gradients of all the functions 744 The Jacobian of this transformation R3 → R3 is r2 sin(ψ). First of all, let us determine the volume of the ball. As for the integration bounds, it is convenient to express the conditions that bind the solid in the coordinates we will work in. In the spherical coordinates, the ball is given by the inequality x2 + y2 + z2 = r2 ≤ 1. First, let us find the integration bounds for the variable φ. If we denote by πφ the projection onto the φ-coordinate in the spherical coordinates (πφ(φ, θ, r) = φ), then the image of the projection πφ of the solid in question gives the integration bounds for the variable φ. We know that πφ(ball) = [0, 2π) (the equation r2 ≤ 1 does not contain the variable φ, so there are no constraints on it, and it takes on all possible values; this can also easily be imagined in space). Having the bounds of one of the variables determined, we can proceed with the bounds of other variables. In general, those may depend on the variables whose bounds have already been determined (although this is not the case here). Thus, we choose arbitrarily a φ0 ∈ [0, 2π), and for this φ0 (fixed from now on), we find the intersection of the solid (ball) and the surface φ = φ0 and its projection πψ on the variable ψ. Similarly like for φ, the variable ψ is not bounded (either by the inequality r2 ≤ 1 or the equality φ = φ0), so it can take on all possible values, ψ ∈ [0, π). Finally, let us fix a φ = φ0 and a ψ = ψ0. Now, we are looking for the projection πr(U) of the object (line segment) U given by the constraints r2 ≤ 1, φ = φ0, ψ = ψ0 on the variable r. The only constraint for r is the condition r2 ≤ 1, so r ∈ (0, 1]. Note that the integration bounds of the variables are independent of each other, so we can perform the integration in any order. Thus, we have Vkoule = ∫ 1 0 ∫ 2π 0 ∫ π 0 r2 sin(ψ) dψ dφ dr = 4 3 π. Now, let us compute the volume of the spherical sector given by x2 + y2 + z2 ≤ 1 and 3x2 + 3y2 ≥ z2 . Again, we express the conditions in the spherical coordinates: r2 ≤ 1, 3 sin2 (ψ) ≥ cos2 (ψ), i. e., tg(ψ) ≥ 1√ 3 . Just like in the case of the ball, we can see that the variables occur independently in the inequalities, so the integration bounds of the variables will be independent of each other as well. The condition r2 ≤ 1 implies r ∈ (0, 1]; from tg(ψ) ≥ 1√ 3 , we have ψ ∈ [0, π 6 ]. The variable φ is not restricted by any condition, so φ ∈ [0, 2π]. Vsector = ∫ 2π 0 ∫ 1 0 ∫ π 6 0 r2 sin ψ dψ dr dφ = 2 − √ 3 3 π, altogether, V = Vball − Vsector = 2 3 π − 2 − √ 3 3 π = π √ 3 . CHAPTER 8. CALCULUS WITH MORE VARIABLES f1, . . . , fn at the point P, i.e. the rows of the Jacobi matrix D1 F. As an illustrative simple example, calculate the tangent and normal spaces to a conic section in R3 . Consider the equation of a cone with vertex at the origin, 0 = f(x, y, z) = z − √ x2 + y2, and a plane, given by 0 = g(x, y, z) = z − 2x + y + 1. The point P = (1, 0, 1) belongs to both the cone and the plane, so the intersection M of these surfaces is a curve (draw a diagram!). Its tangent line at the point P is given by the following equations: 0 = − 1 2 √ x2 + y2 2x x=1,y=0 · (x − 1) − 1 2 √ x2 + y2 2y x=1,y=0 · y + 1 · (z − 1) = −x + z 0 = −2(x − 1) + y + (z − 1) = −2x + y + z + 1, while the plane perpendicular to the curve, containing the point P, is given parametrically by the expression (1, 0, 1) + τ(−1, 0, 1) + σ(−2, 1, 1) with real parameters τ and σ. 8.1.29. Constrained extrema. Now we come with the first really serious application of the differential calculus of more variables. The typical task in optimization is to find the extrema of values depending on several (yet finitely many) parameters, under some further constraints on the parameters of the model. The problem often has m + n parameters constrained by n conditions. In the language of differential calculus, it is desired to find the extrema of a differentiable function h on the set M of points given implicitly by a vector equation F(x1, . . . , xm+n) = 0. Of course, we might first locally parameterize the solution space of the latter equation by m free parameters, express the function h in terms of these parameters and look for the local extrema by inspecting the critical points. However, we have already prepared more efficient procedures for this effort. For every curve c(t) ⊂ M going through P = c(0), it must be ensured that h(c(t)) is an extremum of this univariate function. Therefore, the derivative must satisfy d dt h(c(t))|t=0 = dc′(0)h(P) = dh(P)(c′ (0)) = 0. This means that the differential of the function h at the point P is zero along all tangent increments to M at P. This property is equivalent to stating that the gradient of h lies in the normal subspace (more precisely, in the modelling vector space of the normal subspace). Such points P ∈ M are called stationary points of the function h with respect to the constraints given by F. 745 We could also have computed the volume directly: V = ∫ π 0 ∫ 1 0 ∫ 5π 6 π 6 r2 sin ψ dψ dr dφ = π √ 3 . In cylindric coordinates x = r cos(φ), y = r sin(φ), z = z with Jacobian r of this transformation, the calculation of the volume as the difference of the two solids considered above looks as follows: V = 2 3 π − ∫ 2π 0 ∫ 1 2 0 ∫ 1 0 r dz dr dφ = π √ 3 . Note that we cannot compute the volume of the solid directly in the cylindric coordinates. Thus, we must split it into two solids defined by the conditions r ≤ 1 2 and r ≥ 1 2 , respec- tively. V = V1 + V2 = ∫ 2π 0 ∫ 1 2 0 ∫ √ 3r 0 r dz dr dφ + ∫ 2π 0 ∫ 1 1 2 ∫ √ 1−r2 0 r dz dr dφ = π √ 3 . □ Another alternative is to compute it as the volume of a solid of revolution, again splitting the solid into the two parts as in the previous case (the part “under the cone” and the part “under the sphere”. However, these solids cannot be obtained by rotating around one of the axes. The volume of the former part can be calculated as the difference between the volumes of the cylinder x2 + y2 ≤ 1 4 , 0 ≤ z ≤ √ 3 2 and the cone’s part 3x2 + 3y2 ≤ z2 , 0 ≤ z ≤ √ 3 2 . The volume of the latter one is then the difference between the volumes of the solid that is created by rotating the part of the arc y = √ (1 − x2), 1 2 ≤ x ≤ 1 around the z-axis and the cylinder x2 + y2 ≤ 1 4 , 0 ≤ z ≤ √ 3 2 . V = V1 + V2 = ( π √ 3 8 − π √ 3 24 ) + ( π ∫ √ 3 2 0 (1 − r2 ) dr − π √ 3 8 ) = π √ 3 4 + π 4 √ 3 = π √ 3 . 8.I.3. Calculate the volume of the spherical segment of the ball x2 + y2 + z2 = 2 cut by the plane z = 1. CHAPTER 8. CALCULUS WITH MORE VARIABLES As seen in the previous paragraph, the normal space to the set M is generated by the rows of the Jacobi matrix of the mapping F, so the stationary points are described equivalently by the following proposition: Lagrange multipliers Theorem. Let F = (f1, . . . , fn) : Rm+n → Rn be a differentiable function in a neighbourhood of a point P, F(P) = 0. Further, let M be given implicitly by an equation F(x, y) = 0, and let the rank of the matrix D1 F at the point P be n. Then P is a stationary point of a continuously differentiable function h : Rm+n → R with respect to the constraints F, if and only if there exist real parameters λ1, . . . , λn such that grad h = λ1 grad f1 + · · · + λn grad fn. The procedure suggested by the theorem is called the method of Lagrange multipliers. It is of algorithmic character. Consider the numbers of unknowns and equations: the gradients are vectors of m + n coordinates, so the statement of the theorem yields m + n equations. The variables are, on one side, the coordinates x1, . . . , xm+n of the stationary points P with respect to the constraints, and, on the other hand, the n parameters λi in the linear combination. It remains to say that the point P belongs to the implicitly given set M, which represents n further equations. Altogether, there are 2n + m equations for 2n + m variables, so it can be expected that the solution is given by a discrete set of points P (i.e., each one of them is an isolated point). Very often, the system of equations is a seemingly simple system of algebraic equations, but in fact only rarely can it be solved explicitly. We return to special algebraic methods for systems of polynomial equations in chapter 12. There are also various numerical approaches to such systems. Theoretical details are not discussed here, but there are several solved examples in the other column, including also the illustration of how to use the second derivatives to decide about the local extrema under the constraints. 8.1.30. Arithmetic mean versus geometric mean. As an example of practical application of the Lagrange multipliers, we prove the inequality 1 n (x1 + · · · + xn) ≥ n √ x1 · · · xn for any n positive real numbers x1, . . . , xn. Equality occurs if and only if all the xi’s are equal. Consider the sum x1 + · · · + xn = c as the constraint for a (non-specified) non-negative constant c. We look for the maxima and minima of the function f(x1, . . . , xn) = n √ x1 · · · xn with respect to the constraint and the assumption x1 > 0,..., xn > 0. 746 Solution. We iwll compute the integral in spherical coordinates. The segment can be perceived as a spherical sector without the cone (with vertex at the point [0, 0, 0] and the circular base z = 1, x2 + y2 = 1). In these coordinates, the sector is the product of the intervals [0, √ 2] × [0, 2π) × [0, π/4]. We thus integrate in the given bounds, in any order: ∫ 2π 0 ∫ √ 2 0 ∫ π 4 0 r2 sin(θ) dθ dr dφ = 4 3 ( √ 2 − 1)π. In the end, we must subtract the volume of the cone. That is equal to 1 3 πR2 H (where R is the radius of the cone’s base and H is its height; both are equal to 1 in our case), so the total volume is Vsector − Vcone = 4 3 ( √ 2 − 1) − 1 3 π = 1 3 π(4 √ 2 − 5). The volume of a general spherical segment with height h in a ball with radius R could be computed similarly: V = Vsector − Vcone = ∫ 2π 0 ∫ arccos ( R−h R ) 0 ∫ R 0 r2 sin(θ) dr dθ dφ − 1 3 π(2Rh − h2 )(R − h) = 1 3 πh2 (3R − h). □ 8.I.4. Find the volume of the part of the cylinder x2 + z2 = 16 which lies inside the cylinder x2 + y2 = 16. CHAPTER 8. CALCULUS WITH MORE VARIABLES The normal vector to the part of the hyperplane M defined by the constraints is (1, . . . , 1). Therefore, the function f can have an extremum only at those points where its gradient is a multiple of this normal vector. Hence there is the following system of equations for the desired points (the components ∂f ∂xi of the gradient appear on the left-hand side, λ-multiple of the normal vector is on the right): 1 n 1 xi n √ x1 · · · xn = λ, for i = 1, . . . , n and λ ∈ R. These equations imply x1 = · · · = xn in the set M. If the variables xi are allowed to be zero as well, then the set M would be compact, so the function f would have to have both a maximum and a minimum there. However, f is minimal if and only if at least one of the values xi is zero; so the function necessarily has a strict maximum at the point with xi = c n , i = 1, . . . , n, and then λ = 1 n . By substituting, the geometric mean equals the arithmetic mean for these extreme values, but it is strictly smaller at all other points with the given sum c of coordinates, which proves the inequality. 2. Integration for the second time We return to the process of integration, discussed in the second and third parts of chapter six. We saw that the integration with respect to the diverse coordinates can be iterated. Now we extend the concept of the Riemann integration and Jordan measure to general Euclidean spaces and, again, we shall see that the approaches coincide for many reasonable functions. 8.2.1. Integrals dependent on parameters. Recall that integrating a function f(x, y1, . . . , yn) of n + 1 variables with respect to the single variable x, the result is a function F(y1, . . . , yn) of the remaining variables. Essentially, we proved the following theorem already in 6.3.11 and 6.3.13. This is an extremely useful technical tool, as we saw when handling the Fourier transforms and convolutions in the last chapter. Previous results about extrema of multivariate functions also have a direct application for minimization of areas or volumes of objects defined in terms of functions dependent on parameters, etc. 747 Solution. We will compute the integral in Cartesian coordinates. Since the solid is symmetric, it suffices to integrate over the first octant (interchanging x and −x does not change the equation of the solid; the same holds for y and for z). The part of the solid that lies in the first octant is given by the space under the graph of the function z(x, y) = √ 16 − x2 and over the quarter-disc x2 + y2 ≤ 16, x ≥ 0, y ≥ 0. Therefore, the volume of the whole solid is equal to V = 8 ∫ 4 0 ∫ √ 16−x2 0 4 √ 16 − x2 dy dx = 128. □ Remark. Note that the projection of the considered solid onto both the plane y = 0 and the plane z = 0 is a circle with radius 4, yet the solid is not a ball. CHAPTER 8. CALCULUS WITH MORE VARIABLES Continuity and differentiation Theorem. Consider a continuous function f(x, y1, . . . , yn) defined for all x from a finite interval [a, b] and for all (y1, . . . , yn) lying in some neighbourhood U of a point c = (c1, . . . , cn) ∈ Rn , and its integral F(y1, . . . , yn) = ∫ b a f(x, y1, . . . , yn) dx. Then function F(y1, . . . , yn) is continuous on U. Moreover, if there exists the continuous partial derivative ∂f ∂yj on a neighbourhood of the point c, then ∂F ∂yj (c) exists as well and ∂F ∂yj (c) = ∫ b a ∂f ∂yj (x, c1, . . . , cn) dx. Proof. In Chapter 6, we dealt with two variables x, y only, but replacing the absolute value |y| with the norm ∥y∥ of the vector of parameters does not change the argumentation at all. Again, the main point is that the continuous real functions on compact sets are uniformly continuous. Since partial derivative concerns only one of the variables, the rest of the theorem was proved in 6.3.13, too. □ 8.2.2. Integration of multivariate functions. In the case of univariate functions, integration is motivated by the idea of the area under the graph of a given function of one variable. Consider now the volume of the part of the three-dimensional space which lies under the graph of a function z = f(x, y) of two variables, and the multidimensional analogues in general. In chapter six, small intervals [xi, xi+1] were chosen of length ∆xi which divided the whole interval [a, b]. Then, their representatives ξi were selected, and the corresponding part of the area was approximated by the area of the rectangle with height given by the value f(ξi) at the representative, i.e. the expression f(ξi)∆xi. In the case of functions of two variables, work with divisions in both variables and the values representing the height of the graph above the particular little rectangles in the plane. The first thing to deal with is to determine the integration domain, that is, the region the function f is to be integrated over. As an example, consider the function z = f(x, y) =√ 1 − x2 − y2, whose graph is, inside the unit disc, half of the unit sphere. Integrating this function over the unit disc yields the volume of the unit semi-ball. The simplest approach is to consider only those integration domains M which are given by products of intervals, i.e. given by ranges x ∈ [a, b] and y ∈ [c, d]. In this context, it is called a multidimensional interval. If M is a different bounded set in R2 , work with a sufficiently large area [a, b] × [c, d], rather than with the set itself, and adjust the function so that f(x, y) = 0 for all points lying outside M. Considering the above case of the unit ball, integrate over the 748 8.I.5. Find the volume of the part of the cylinder x2 +y2 = 4 bounded by the planes z = 0 and z = x + y + 2. Solution. We will work in cylindric coordinates given by the equations x = r cos(φ), y = r sin(φ), z = z. The Jacobian of this transformation is J = r. The solid can be divided into two parts: above and below the plane z = 0, whose volumes will be denoted by V1 and V2, respectively. Further, we can notice that one part of the solid with volume V1 is a pyramid with vertices [0, 0, 0], [0, 0, 2], [−2, 0, 0], [0, −2, 0]. Thus, we will further split this solid (above z = 0) into two parts, whose volumes we will calculate separately. V1 − Vpyramid = ∫ π −π/2 (∫ 2 0 [r sin φ + r cos φ + 2]r dr ) dφ, = 6π + 16 3 , Vpyramid = 4 3 Further, V1 − V2 = ∫ π −π ∫ 2 0 r2 (sin(φ) + cos(φ)) + 2r dr dφ = 8π, so V1 + V2 = 4π + 40 3 . □ Remark. During the calculation, we made use of the fact that integrating a function of two variables over an area in R2 yields the difference of the volume of the solid in R3 determined by the graph of the integrated function and lying above z = 0 and the one lying below z = 0. 8.I.6. Find the volume of the solid in R3 which is given by the intersection of the sphere x2 +y2 +z2 = 4 and the cylinder x2 + y2 = 1. CHAPTER 8. CALCULUS WITH MORE VARIABLES set M = [−1, 1] × [−1, 1] the function f(x, y) = {√ 1 − x2 − y2 for x2 + y2 ≤ 1 0 otherwise. The definition of the Riemann integral then faithfully follows the procedure from paragraph 6.2.8. This can be done for an arbitrary finite number of variables. Given an n–dimensional interval I and partitions into ki subintervals in each variable xi, select the partition of I into k1 · · · kn small n–dimensional intervals, and write ∆xi1...in for their volumes. The maximum of the lengths of the sides of the multidimensional intervals in such a partition is called its norm. Riemann integral Definition. The Riemann integral of a real-valued function f defined on a multidimensional interval I = [a1, b1] × [a2, b2]×. . .×[an, bn] exists if for every choice of a sequence of divisions Ξ (dividing the multidimensional interval in all variables simultaneously), and the representatives ξi1...in of the little mutlidimensional intervals in the partitions, with the norm of the partitions converging to zero, the integral sums SΞ,ξ = ∑ i1...in f(ξi1...in )∆xi1...in always converge to the value S = ∫ I f(x1, . . . , xn) dx1 . . . dxn, independent of the selected sequence of divisions and repre- sentatives. The function f is then said to be Riemann-integrable over I. As a relatively simple exercise, prove in detail that every Riemann-integrable function over an interval I must be bounded there. The reason is the same as in the case of univariate functions: the control of the norms of the divisions used in the definition is somewhat rough. The situation gets worse when integrating in this way over unbounded intervals, see more remarks in 8.2.6 below. Therefore, consider integration of functions over Rn mainly for functions whose support is compact, that is, functions which vanish outside a bounded interval I. A bounded set M ⊂ Rn is said to be Riemann- measurable4 if and only if its indicator function, defined by χM (x1, . . . , xn) = { 1 for (x1, . . . , xn) ∈ S 0 for all other points in Rn , is Riemann-integrable over Rn . 4 Better to say “measurable via Riemann integration”, the measure itself is commonly called the Peano–Jordan measure in the literature. 749 Solution. Thanks to symmetry, it suffices to compute the volume of the part that lies in the first octant. We will integrate in cylindric coordinates given by the equations x = r cos(φ), y = r sin(φ), z = z with Jacobian J = r, and it is the space between the plane z = 0 and the graph of the function z = √ 4 − x2 − y2 = √ 4 − r2. Therefore, we can directly write is as the double integral V = 8 ∫ π/2 0 ∫ 1 0 r √ 4 − r2 dr dφ = 2 3 (8 − 3 √ 3)π. □ 8.I.7. Find the volume of the solid in R3 which is given by the intersection of the sphere x2 + y2 + z2 = 2 and the paraboloid z = x2 + y2 . Solution. Once again, we will work in cylindric coordinates: V = ∫ 2π 0 ∫ 1 0 ∫ √ 2−r2 r2 r dz dr dφ = 4 √ 2π 3 − 7π 6 . CHAPTER 8. CALCULUS WITH MORE VARIABLES For any Riemann-measurable set M and a function f defined at all points of M, consider the function ˜f = χM ·f as a function defined on the whole Rn . This function ˜f apparently has a compact support. The Riemann integral of the function f over the set M is defined by ∫ M f dx1 . . . dxn = ∫ Rn ˜f dx1 . . . dxn, supposing the integral on the right-hand side exists. 8.2.3. Properties of Riemann integral. This definition of the integral does not provide reasonable instructions for computing the values of Riemann integrals. However, it does lead to the following basic properties of the Riemann integral (cf. Theorem 6.2.8): Theorem. The set of Riemann-integrable real-valued functions over a Riemann measurable domain M ⊂ Rn is a vector space over the real scalars, and the Riemann integral is a linear form there. If the integration domain M is given as a disjoint union of finitely many Riemann-measurable domains Mi, then f is integrable on M if and only if it is integrable on all Mi, and the integral over a function f over M is given by the sum of the integrals over the individual subdomains Mi. Proof. All the properties follows directly from the definition of the Riemann integral and the properties of convergent sequences of real numbers, just as in the case of univariate functions. Think out the details by yourselves. □ For practical use, rewrite the theorem into the usual equalities: Finite additivity and linearity Any linear combination of Riemann-integrable functions fi : I → R, i = 1, . . . , k (over scalars in R) is again a Riemann-integrable function, and its integral can be computed as follows: ∫ I ( a1f1(x1, . . . , xn)+. . . + akfk(x1, . . . , xn) ) dx1. . .dxn = a1 ∫ I f1(x1, . . . , xn) dx1 . . . dxn+ · · · + ak ∫ I fk(x1, . . . , xn) dx1 . . . dxn. Let M1 and M2 be disjoint Riemann-measurable sets, consider a function f : M1 ∪ M2 → R. Then f is Riemannintegrable over both sets Mi if and only if it is integrable over its union, and ∫ M1∪M2 f(x1, . . . , xn) dx1 . . . dxn = ∫ M1 f(x1, . . . , xn) dx1 . . . dxn+ ∫ M2 f(x1, . . . , xn) dx1 . . . dxn. 750 □ 8.I.8. Find the volume of the solid in R3 which is bounded by the elliptic cylinder 4x2 + y2 = 1 and the planes z = 2y and z = 0, lying above the plane z = 0. Solution. Thanks to symmetry, it is advantageous to work in the coordinates x = 1 2 r cos(φ), y = r cos(φ), z = z with Jacobian J = 1 2 r. The equation of the elliptic cylinder in these coordinates is r2 = 1. Thus, the wanted volume is V = ∫ π 0 ∫ 1 0 r sin(φ) 1 2 r dr dφ = ∫ π 0 ∫ 1 0 r2 sin(φ) dr dφ = ∫ π 0 1 3 sin(φ) dφ = 2 3 . □ 8.I.9. Find the volume of the solid in R3 which is bounded by the paraboloid 2x2 + y2 = z and the plane z = 2. Solution. Similarly to the above problem, we choose “special” coordinates which respect the symmetry of the solid: x = 1√ 2 r cos(φ), y = r sin(φ), z = z with Jacobian CHAPTER 8. CALCULUS WITH MORE VARIABLES 8.2.4. Multiple integrals. Riemann-measurable domains especially involve cases when the boundary of the integration domain M can be expressed step by step via continuous dependencies between the coordinates in the following way. The first coordinate x runs within an interval [a, b]. The interval range of the next coordinate can be defined by two functions, i.e. y ∈ [φ(x), ψ(x)], then the range of the next coordinate is expressed as z ∈ [η(x, y), ζ(x, y)], and so on for all of the other coordinates. For example, this is easy in the case of a ball from the introductory example: for x ∈ [−1, 1], define the range for y as y ∈ [− √ 1 − x2, √ 1 − x2]. The volume of the ball can then be computed by integration of the mentioned function f, or integrate the indicator function of the ball, i.e. the function which takes one on the subset M ⊂ R3 which is further defined by z ∈ [− √ 1 − x2 − y2, √ 1 − x2 − y2]. The following fundamental theorem transforms the computation of a Riemann integral to a sequence of computations of univariate integrals (while the other variables are considered to be parameters, which can appear in the integration bounds as well). Notice, we could have defined the multiple integral directly via the one-dimensional integration, but we would face the trouble of ensuring the indepdence of the result on our way of describing M. The theorem reveals that the two approaches coincide and there are no unclear points left. Multiple integrals Theorem. Let M ⊂ Rn be a bounded set, expressed with the help of continuous functions ψi, ηi M = {(x1, . . . , xn); x1 ∈ [a, b], x2 ∈ [ψ2(x1), η2(x1)], . . . , xn ∈ [ψn(x1, . . . , xn−1), ηn(x1, . . . , xn−1)]}, and let f be a function which is continuous on M. Then the Riemann integral of the function f over the set M exists and is given by the formula ∫ M f(x1, x2, . . . , xn) dx1 . . . dxn = ∫ b a (∫ η2(x1) ψ2(x1) . . . (∫ ηn(x1,...,xn−1) ψn(x1,...,xn−1) f(x1, x2, . . . , xn) dxn ) . . . dx2 ) dx1 where the individual integrals are the one-variable Riemann integrals. Proof. Consider first the proof for the case of two variables. It can then be seen that there is no need of further ideas in the general case. Consider an interval I = [a, b] × [c, d] containing the set M = {(x, y); x ∈ [a, b], y ∈ [ψ(x), η(y)]} and divisions Ξ of the interval I with representatives ξij. 751 J = 1√ 2 r. The equation of the paraboloid in these coordinates is z = r2 , so the volume of the solid is equal to V = 4 ∫ π/2 0 ∫ √ 2 0 ∫ 2 r2 1 √ 2 r dz dr dφ = 2 √ 2 ∫ π/2 0 ∫ √ 2 0 2r − r3 dr dφ = 2 √ 2 ∫ π/2 0 dφ = √ 2π. □ 8.I.10. Calculate the volume of the ellipsoid x2 + 2y2 + 3z2 = 1. Solution. We will consider the coordinates x = r cos(φ) sin(θ), y = 1√ 2 r sin(φ) sin(θ), z = 1√ 3 r cos(θ). The corresponding Jacobian is 1√ 6 r2 sin(θ), so the volume is V = ∫ 2π 0 ∫ π 0 ∫ 1 0 1 √ 6 r2 sin(θ) dr dθ dφ = 4 3 √ 6 π. □ 8.I.11. Remark. Note that if the transformation the coordinates is linear (and affine), then the space is deformed “uniformly”. This means that the volume of an arbitrary solid is changed proportionally to the change of the volume of an infinitesimal volume element, which is the Jacobian. Therefore, if we consider the volume of the ball with a given radius r to be known, (in this case, r = 1), we can infer directly that the volume of the ellipsoid is V = 1√ 6 · 4 3 π = 4 3 √ 6 π. 8.I.12. Find the volume of the solid which is bounded by the paraboloid 2x2 + 5y2 = z and the plane z = 1. Solution. We choose the coordinates x = 1√ 2 r cos(φ), y = 1√ 5 r sin(φ), z = z. The determinant of the Jacobian is r√ 10 , so the volume is V = ∫ 2π 0 ∫ 1 0 ∫ 1 r2 r √ 10 dz dr dφ = π 2 √ 10 . □ CHAPTER 8. CALCULUS WITH MORE VARIABLES The corresponding integral sum is SΞ,ξ = ∑ i,j f(ξij)∆xij = ∑ i (∑ j f(ξij))∆yj ) ∆xi, where ∆xij is written for the product of the sizes ∆xi and ∆xj of the intervals which correspond to the choice of the representative ξij. Assume first that we work only with choices of representatives ξij which all share the same first coordinate xi. If the partition of the interval [a, b] is fixed, and only the partition of [c, d] is refined, the values of the inner sum of the expression approaches the value of the integral Si = ∫ η(xi) φ(xi) f(xi, y) dy, which exists since the function f(xi, y) is continuous. In this way, a function is obtained which is continuous in the free parameter xi, see 8.2.1. Therefore, further refinement of the partition of the interval [a, b] leads, in the limit, to the desired formula ∑ i Si∆xi → S = ∫ b a (∫ η(x) ψ(x) f(x, y) dy ) dx. It remains to deal with the case of general choices of representatives of general divisions Ξ. Since M is clearly compact (bounded and closed by definition), and f is a continuous function on M, it is uniformly continuous there. Therefore, if a small real number ε > 0 is selected beforehand, there is always a bound δ > 0 for the norm of the partitions, so that the values of the function f for the general choices ξij differ by no more than ε from the choices used above. Thus, the limit processes results in the same value for general Riemann sums SΞ,ξ as seen above. Now, the general case can be proved easily by induction. In the case of n = 1, the result is trivial. The presented reasoning can easily be transformed for a general induction step, writing (x2, . . . , xn) instead of y, having x1 instead of x, and perceiving the particular little cubes of the divisions as (n − 1)-dimensional cubes Cartesian-multiplied by the last interval. In the last-but-one step of the proof, the induction hypothesis is used, rather than the simple one-dimensional integration. The final argument about uniform continuity remains the same. It is advised to write this proof in detail as an exercise. □ 8.2.5. Fubini theorem. The latter theorem has a particularly simple shape in the case of a multidimensional interval M. Then all the functions in bounds for integration are just the constant bounds from the definition of M. But this means that the integration process can be carried out coordinate by coordinate in 752 8.I.13. Find the volume of the solid which lies in the first octant and is bounded by the surfaces y2 + z2 = 9 and y2 = 3x. Solution. In cylindric coordinates, V = ∫ π/2 0 ∫ 3 0 ∫ r2 3 cos2 (φ) 0 r dx dr dφ = 27 16 π. □ 8.I.14. Find the volume of the solid in R3 which is bounded by the cone part 2x2 +y2 = (z−2)2 , z ≥ 2 and the paraboloid 2x2 + y2 = 8 − z. Solution. First of all, we find the intersection of the given surfaces: (z − 2)2 = −z + 8, z ≥ 2; therefore, z = 4, and the equation of the intersection is 2x2 +y = 4. The substitution x = 1√ 2 r cos(φ), y = r sin(φ), z = z transforms the given surfaces to the form r2 = (z−2)2 , z ≥ 2, and r2 = 8 − z, i. e., z = r + 2 for the former surface and z = 8 − r2 for the latter. Altogether, the projection of the given solid onto the coordinate φ is equal to the interval [0, 2π]. Having fixed a φ0 ∈ [0, 2π], the projection of the intersection of the solid and the plane φ = φ0 onto the coordinate r equals (independently of φ0) the interval [0, 2]. Having fixed both r0 and φ0, the projection of the intersection of the solid and the line r = r0, φ = φ0, onto the coordinate z is equal to the interval [r0 + 2, 8 − r2 0]. The Jacobian of the considered transformation is J = 1√ 2 r, so we can write V = ∫ 2π 0 ∫ 2 0 ∫ 8−r2 r+2 r √ 2 dz dr dφ = 16 √ 2 3 π. □ CHAPTER 8. CALCULUS WITH MORE VARIABLES any order. We have exploited this behavior already in Chapter 6, cf. 6.3.12. In this way is proved the important corollary:5 Fubini theorem Theorem. Every continuous function f(x1, . . . , xn) on a multidimensional interval M = [a1, b1] × [a2, b2] × . . . × [an, bn] is Riemann integrable on M, and its integral ∫ M f(x1, . . . , xn) dx1 . . . dxn = ∫ b1 a1 ∫ b2 a2 . . . ∫ bn an f(x1, . . . , xn) dx1 . . . dxn is independent of the order in which the multiple integration is performed. The possibility of changing the order of integration in multiple integrals is extremely useful. We have already taken advantage of this result, namely when studying the relation of Fourier transforms and convolutions, see paragraph 7.1.9. 8.2.6. Unbounded regions and functions. There is no simple concept of an improper integral for unbounded multivariate functions. The following example of multiple integration of an unbounded function is illustrative in this direction: ∫ 1 0 (∫ 1 0 x − y (x + y)3 dy ) dx = 1 2 ∫ 1 0 (∫ 1 0 x − y (x + y)3 dx ) dy = − 1 2 . The reason can be understood by looking at the properties of non-absolutely converging series. There, rearranging the summands can lead to an arbitrary result. The situation is better if the Riemann integral of a bounded non-negative function f(x) ≥ 0 with non-compact support over the whole Rn is calculated. Of course some extra information is needed on the decay of the function f for large arguments. For example, if f is Riemann integrable over each n–dimensional interval I and there is a universal bound ∫ I |f(x)| dx ≤ C with a constant C independent of the choice of the n– dimensional interval I ⊂ Rn , then we may define ∫ Rn f(x) dx = lim r→∞ ∫ Ir f(x) dx, where Ir = {(x1, . . . , xn); |xj| < r, j = 1, . . . , n}. The resulting limit, if it exists, is bounded by the same constant C. 5Guido Fubini (1907-1943) was an important Italian mathematician active also in applied areas of mathematics. Simple derivation of Fubini theorem builds upon the simple properties of Riemann integration and the continuity of the integrated function. Fubini, in fact, proved this result in a much more general context of integration, while the theorem just introduced was used by mathematicians like Cauchy at least a century before Fubini. 753 8.I.15. Find the volume of the solid which lies inside the cylinder y2 +z2 = 4 and the half-space x ≥ 0 and is bounded by the surface y2 + z2 + 2x = 16. Solution. In cylindric coordinates, V = ∫ 2π 0 ∫ 2 0 ∫ 8− r2 2 0 r dx dr dφ = 28π. □ 8.I.16. The centroid of a solid. The coordinates (xt, yt, zt) of the centroid of a (homogeneous) solid T with volume V in R3 are given by the following integrals: xt = ∫∫∫ T x dx dy dz, yt = ∫∫∫ T y dx dy dz, zt = ∫∫∫ T z dx dy dz. The centroid of a figure in R2 or other dimensions can be computed analogously. 8.I.17. Find the centroid of the part of the ellipse 3x2 + 2y2 = 1 which lies in the first quadrant of the plane R2 . Solution. First, let us calculate the volume of the given ellipse. The transformation x = 1√ 3 x′ , y = 1√ 2 y′ with Jacobian 1√ 6 leads to S = ∫ 1√ 3 0 ∫ √ 1−3x2 2 0 dy dx = 1 √ 6 ∫ 1 0 ∫ √ 1−x2 0 dy′ dx′ = π 4 √ 6 . The other integrals we need can be computed directly in Cartesian coordinates x and y: Tx = ∫ 1√ 3 0 ∫ √ 1−3x2 2 0 x dy dx = ∫ 1√ 3 0 x √ 1 − 3x2 2 dx = = 1 2 ∫ 1 3 0 √ 1 − 3t 2 dt = √ 2 18 , Ty = ∫ 1√ 3 0 ∫ √ 1−3x2 2 0 y dy dx = 1 2 ∫ 1√ 3 0 1 − 3x2 2 dx = = 1 4 ∫ 1√ 3 0 (1 − 3x2 ) dx = √ 3 18 . Therefore, the coordinates of the centroid are [4 √ 3 9π , 2 √ 2 π ]. □ 8.I.18. Find the volume and the centroid of a homogeneous cone of height h and circular base with radius r. Solution. Positioning the cone so that the vertex is at the origin and points downwards, we have in cylindric coordinates that V = 4 ∫ π/2 0 ∫ r 0 ∫ h h r ρ ρ dz dρ dφ = 1 3 πhr2 . CHAPTER 8. CALCULUS WITH MORE VARIABLES In this case, the Fubini theorem is true in the form ∫ Rn f(x) dx = ∫ ∞ −∞ . . . (∫ ∞ −∞ f(x) dx1 ) . . . dxn. 8.2.7. Further remarks on integration. The Riemann integral of multivariate functions behaves even worse than in the case of functions of one variable in the sixth chapter. Therefore, more sophisticated approaches to integrations have been developed. They are mainly based on the concept of the measure of a set. We consider this problem briefly now. As we shall see in 8.2.10, the Riemann integration of the indicator functions χM of sets M ⊂ Rn leads to a finitely additive measure. In probability theory in chapter 10, even elementary problems require a concept of measure which is additive over countable systems of disjoint sets. Having such a measure, measurable functions f can be defined by the condition that their preimages of bounded intervals, f−1 ([a, b]]), are measurable sets and the integral is built by approximation via such “horizontal strips”, see the illustration. This is the starting point of Lebesgue integration.. add a regular diagram on Lebesgue integration We omit further details here, but note the Riesz representation theorem6 saying that for each linear functional I (i.e. a linear mapping valued in R) on continuous functions with compact support on a metric space X, there is the unique measure (with certain regularity properties) such that the integral associated to this measure extends I. In the case of the Riemann integral I on functions on Rn this provides the Lebesgue measure and the Lebesgue integral. Another point of view is the completion procedure for metric spaces. Consider the vector space X = Sc(Rn ) of all continuous functions with compact support. It can be equipped with the Lp norms, similar to the univariate case from the seventh chapter, i.e. ∥f∥p = (∫ Rn |f(x1, . . . , xn)|p dx1 . . . dxn )1/p for any 1 ≤ p < ∞. Since the Riemann integral is defined again in terms of partitions and the representative values, the properties of the norm can be verified in the same way as for univariate functions, using Hölder’s and Minkowski’s in- equalities. There are the metrics ∥ ∥p on X. The general theory provides its completion ˜X, unique up to isometry, and it can be shown that it is again a space of functions. The Lebesgue integral mentioned above defines exactly these norms. Hence the spaces of functions with Lebesgue integrable powers |f|p are obtained. 6Frigyes Riesz (1880-1956) was a famous Hungarian mathematician active in particular in functional analysis. He introduced this theorem in the special case of X being an interval in Rn in 1909 754 Apparently, the centroid lies on the z-axis. For the z-coordinate, we get z = 1 V ∫ cone zdV = 1 V ∫ π/2 0 ∫ r 0 ∫ h h r ρ zρ dz dρ dφ = 3 4 h. Thus, the centroid lies 1 4 h over the center of the cone’s base. □ 8.I.19. Find the centroid of the solid which is bounded by the paraboloid 2x2 + 2y2 = z, the cylinder (x+1)2 +y2 = 0, and the plane z = 0. Solution. First, we will compute the volume of the given solid. Again, we use the cylindric coordinates (x = r · cos φ, y = r · sin φ, z = z), where the equation of the paraboloid is z = 2r2 and the equation of the cylinder reads r = −2 cos(φ). Moreover, taking into account the fact that the plane x = 0 is tangent to the given cylinder, we can easily determine the bounds of the integral that corresponds to the volume of the examined solid: V = ∫ 3π 2 π 2 ∫ −2 cos φ 0 ∫ 2r2 0 r dz dr dφ = ∫ 3π 2 π 2 ∫ −2 cos φ 0 2r3 dr dφ = ∫ 3π 2 π 2 8 cos4 πφ = 3π, where the last integral can be computed using the method of recurrence from 6.2.6. Now, let us find the centroid. Since the solid is symmetric with respect to the plane y = 0, the y-coordinate of the centroid must be zero. Then, the remaining coordinates xT and zT of the centroid can be computed by the following in- tegrals: xT = 1 V ∫ ∫ ∫ B x dx dy dz = 1 V ∫ 2r2 0 ∫ 3π 2 π 2 ∫ −2 cos φ 0 r2 cos φ dz dr dφ = 1 V ∫ 3π 2 π 2 ∫ −2 cos φ 0 2r4 cos φ dr dph = 1 V ∫ 3π 2 π 2 − 64 5 cos6 φ dφ = − 4 3 , where the last integral was computed by 6.2.6 again. Analogously for the z-coordinate of the centroid: zT = 1 V ∫ 2r2 0 ∫ 3π 2 π 2 ∫ −2 cos φ 0 zr cos φ dz dr dφ = 20 9 . The coordinates of the centroid are thus [−4 3 , 0, 20 9 ]. □ CHAPTER 8. CALCULUS WITH MORE VARIABLES 8.2.8. Change of coordinates. When calculating integrals of univariate functions, the “subtitution method” is used as one of the powerful tools, cf. 6.2.5. The method works similarly in the case of functions of more variables, when understanding its geometric meaning. Recall and reinterpret the univariate case. There, the integrated expression f(x) dx infinitesimally describes the two-dimensional area of the rectangle whose sides are the (linearized) increment ∆x of the variable x, i.e. the onedimensional rectangle, and the value f(x). If the variable x is transformed by the relation x = u(t), then the linearized increment can be expressed with the help of the differential as dx = du dt dt, and so the corresponding contribution for the integral is given by f(u(t)) du dt dt. Here one either supposes that the sign of the derivative u′ (t) is positive, or one interchanges the bounds of the integral, so that the sign does not effect the result. Intuitively, the procedure for n variables should be similar. It is only necessary to recall the formula for (change of) the volume of parallelepipeds. The Riemann integrals are approximated by Riemann sums, which are based on the n–dimensional volume (area) of small multidimensional intervals ∆xi1...in in the variables, multiplied by the values of the function at the representative points ξi1...in . If the coordinates are transformed by means of a mapping x = G(y), not only the function values f(G(˜ξi1...in )) are obtained at the representative points ˜ξi1...un = G−1 (ξi1...in ) in the new coordinate expression, but also the change of the volume of the corresponding small multidimensional intervals needs care. Once again, this is the case of a linear approximation of a change, which is well known — the best linear approximation of G(y) is its derivative D1 G(y), which is given by the Jacobi matrix of G, see 8.1.19. The change of the volume is then given (in absolute value) by the determinant of this matrix (see a discussion of this topic in chapter 4 devoted to analytic geometry and linear algebra, especially 4.1.19). Summarizing, the formulation of the next theorem should not be surprising and its proof consists in formalization of the latter ideas. However, this needs some effort and so the proof is split into several steps. 755 8.I.20. Find the centroid of the homogeneous solid in R3 which lies between the planes z = 0 and z = 2, bounded by the cones x2 + y2 = z2 and x2 + y2 = 2z2 . Solution. The problem can be solved in the same way as the previous ones. It would be advantageous to work in cylindric coordinates. However, we can notice that the solid in question is an “annular cone”: it is formed by cutting out a cone K1 with base radius 4 of a cone K2 with base radius 8, of common height 2. The centroid of the examined solid can be determined by the “rule of lever”: the centroid of a system of two solids is the weighted arithmetic mean of of the particular solids’ centroids, weighed by the masses of the solids. We found out in exercise 8.I.18 that the centroid of a homogeneous cone is situated at quarter its height. Therefore, the centroids of both cones lie at the same point, and this points thus must be the centroid of the examined solid as well. Hence, the coordinates of the wanted centroid are [0, 0, 3 2 ]. □ 8.I.21. Find the volume of the solid in R3 which is bounded by the cone part x2 + y2 = (z − 2)2 and the paraboloid x2 + y2 = 4 − z. Solution. We build the corresponding integral in cylindric coordinates, which evaluates as follows: V = ∫ 2π 0 ∫ 1 0 ∫ 4−r2 r+2 r dz dr dφ = 5 6 π. □ 8.I.22. Find the volume of the solid in R3 which lies under the cone x2 + y2 = (z − 2)2 , z ≤ 2 and over the paraboloid x2 + y2 = z. Solution. V = ∫ 2π 0 ∫ 1 0 ∫ 2−r r2 r dz dr dφ = 5 6 π. CHAPTER 8. CALCULUS WITH MORE VARIABLES Transformation of coordinates Theorem. Let G(t) : Rn → Rn be a continuously differentiable and invertible mapping, and write t = (t1, . . . , tn), x = (x1, . . . , xn) = G(t1, . . . , tn). Further let M = G(N) be a Riemann-measurable sets, and f : M → R a continuous function. Then, N is also Riemann-measurable and ∫ M f(x) dx1. . .dxn = ∫ N f(G(t)) det(D1 G(t)) dt1. . .dtn. 8.2.9. The invariance of the integral. The first thing to be verified is the coincidence of two definitions of volume of parallelepipeds (taken for granted in the above intuitive explanation of the latter theorem). Volumes and similar concepts were dealt with in chapter 4 and a crucial property was the invariance of the concepts with respect to the choice of Euclidean frames of Rn , cf. 4.1.19 on page 334, which followed directly from the expression of the volumes in terms of determinants. It is needed to show that the same result holds in terms of the Riemann integration as defined above. It turns out that it is easier to deal with invariance with respect to general invertible linear mappings Ψ : Rn → Rn . Proposition. Let Ψ : Rn → Rn be an invertible linear mapping and I ⊂ Rn a multidimensional interval. Consider a function f, such that f ◦Ψ is integrable on I. Then M = Ψ(I) is Riemann-measurable, f is Riemann integrable on M and ∫ M f(x1, . . . , xn) dx1 . . . dxn = = | det Ψ| ∫ I (f ◦ Ψ)(y1, . . . , yn) dy1 . . . dyn. Proof. Each linear mapping is a composition of the elementary transformations of three types (see the discussion in chapter 2, in particular paragraphs 2.1.7 and 2.1.9). The first one is a multiplication of one of the coordinates with a constant: Ψ(y1, . . . , yn) = (y1, . . . , αyi, . . . , yn). In this case | det Ψ| = |α|. The second one consists of an exchange of two coordinates, i.e. for given 1 ≤ i < j ≤ n, Ψ(y1, . . . , yn) = (y1, . . . , yj, . . . , yi, . . . , yn). The determinant of Ψ is −1 in this case. The third type of transformations is of the form Ψ(y1, . . . , yn) = (y1, . . . , yi + yj, . . . , yj, . . . , yn), with determinant one. Without loss of generality, i = 1 can be chosen in the first case and i = 1, j = 2, in the second case. Since the determinant of the composition of the mappings (i.e. the determinant of the product of the matrices) is the product of the individual determinants, it is enough to prove the proposition for all three special types of Ψ. Express the right hand integrals for these three types of Ψ by means of the multiple integrals and Fubini theorem. Write 756 Note that the considered solid is symmetric with the solid from the previous exercise 8.I.21 (the center of the symmetry is the point [0, 0, 2]). Therefore, it must have the same volume. □ 8.I.23. Find the centroid of the surface bounded by the parabola y = 4 − x2 and the line y = 0. ⃝ 8.I.24. Find the centroid of the circular sector corresponding to the angle of 60◦ that was cut out of a disc with radius 1. ⃝ 8.I.25. Find the centroid of the semidisc x2 + y2 = 1, y ≥ 0. ⃝ 8.I.26. Find the centroid of the circular sector corresponding to the angle of 120◦ that was cut out of a disc with radius 1. ⃝ 8.I.27. Find the volume of the solid in R3 which is given by the inequalities z ≥ 0, z − x ≤ 0, and (x − 1)2 + y2 ≤ 1. ⃝ 8.I.28. Find the volume of the solid in R3 which is given by the inequalities z ≥ 0, z − y ≤ 0. ⃝ 8.I.29. Find the volume of the solid bounded by the surface 3x2 + 2y2 + 3z2 + 2xy − 2yz − 4xz = 1. ⃝ 8.I.30. Find the volume of the part of R3 lying inside the ellipsoid 2x2 + y2 + z2 = 6 and in the half-space x ≥ 1. ⃝ 8.I.31. The area of the graph of a real-valued function f(x, y) in variables x and y. The area of the graph of a function of two variables over an area S in the plane xy is given by the integral P = ∫ S √ 1 + f2 x + f2 y dx dy. Considering the cone x2 + y2 = z2 . find the area of the part of its lateral surface which lies above the plane z = 0 and inside the cylinder x2 + y2 = y. Solution. The wanted area can be calculated as the area of the graph of the function z = √ x2 + y2 over the disc K: x2 − (y − 1 2 )2 . We can easily see that fx = x x2 + y2 , fy = y x2 + y2 , so the area is expressed by the integral ∫∫ K √ 1 + f2 x + f2 y dx dy = ∫∫ K √ 2 dx dy = = √ 2 ∫ π 0 ∫ sin π 0 r dr dφ = √ 2 2 ∫ π 0 sin2 φ = √ 2π 4 . CHAPTER 8. CALCULUS WITH MORE VARIABLES I = [a1, b1] × . . . × [an, bn] and x = Ψ(y) for the transformation. In the first case (notice we can deal with the first variable and α > 0 without loss of generality), | det Ψ| ∫ I f(αy1, y2, . . . , yn) dy1 . . . dyn = = α ∫ bn an . . . (∫ b1 a1 f(αy1, y2, . . . , yn) dy1 ) . . . dyn = α α−1 ∫ bn an . . . (∫ αb1 αa1 f(x1, x2, . . . , xn) dx1 ) . . . dxn = ∫ Ψ(I) f(x1, x2, . . . , xn) dx1 . . . dxn. The second case is even easier, since the order of integration does not matter due to the Fubini theorem. The third case is similar to the first one: | det Ψ| ∫ I f(y1 + y2, y2, . . . , yn) dy1 . . . dyn = = ∫ bn an . . . (∫ b1 a1 f(y1 + y2, y2, . . . , yn) dy1 ) . . . dyn = ∫ bn an . . . (∫ b1+x2 a1+x2 f(x1, x2, . . . , xn) dx1 ) . . . dxn = ∫ Ψ(I) f(x1, x2, . . . , xn) dx1 . . . dxn. The reader should check the details that the last multiple integral describes the image Ψ(I). □ As a direct corollary of the proposition, the Riemann integral is invariant with respect to the Euclidean affine mappings. That is, the integral cannot depend on the choice of the orthogonal frame in the Euclidean Rn . 8.2.10. Riemann-measurable sets. It is necessary to understand how to recognize Riemann-measurable domains M. When defining the Riemann integral, a strict analogy of the lower and upper Riemann integrals for univariate functions can be considered. This means taking infima or suprema of the integrated function over the corresponding multidimensional intervals instead of the function values at the representatives in the Riemann sums. For bounded functions, there are well-defined values of the upper and lower integrals found in this way. If this is done for the indicator function χM of a fixed set M, the inner and outer Riemann measure of the set M is obtained. Evidently, the inner measure is the supremum of the areas given by the the (finite) sums of the volumes of all multi-dimensional intervals from the partitions which are inside M, and on the other hand, the outer measure is the infimum of the (finite) sums of the volumes of intervals covering M. It follows directly from the definition that a set M is Riemann-measurable if and only if its inner and outer measures are equal. The sets whose outer measure is zero are, of course, Riemann-measurable. They are called measure zero sets or null sets. The finite additivity of the Riemann integral makes 757 □ 8.I.32. Find the area of the parabola z = x2 + y2 over the disc x2 + y2 ≤ 4. ⃝ 8.I.33. Find the area of the part of the plane x + 2y + z = 10 that lies over the figure given by (x−1)2 +y2 ≤ 1 and y ≥ x. ⃝ In the following exercise, we will also apply our knowledge of the theory of Fourier transforms from the previous chapter. 8.I.34. Fourier transform and diffraction. Light intensity is a physical quantity which expresses the transmission of energy by waves. The intensity of a general light wave is defined as the time-averaged magnitude of the Poynting vector, which is the vector product of mutually orthogonal vectors of electric and magnetic fields. A monochromatic plane wave spreading in the direction of the y-axis satisfies I = cε0 1 τ ∫ τ 0 E2 y dt, where c is the speed of light and ε0 is the vacuum permittivity. The monochromatic wave is described by the harmonic function Ey = ψ(x, t) = A cos(ωt − kx). The number A is the maximal amplitude of the wave, ω is the angular frequency, and for any fixed t, the so-called wave length λ is the prime period. The number k then represents the speed k = 2π λ at which the wave propagates. We have I = cε0 1 τ ∫ τ 0 E2 y dt = cε0 1 τ ∫ τ 0 A2 cos2 (ωt − k x) dt = cε0A2 1 τ ∫ τ 0 1 + cos(2(ωt − k x)) 2 dt = 1 2 cε0A2 1 τ [ t + sin(2(ωt − k x)) 2ω ]τ 0 = 1 2 cε0A2 1 τ ( τ + sin(2(ωτ − k x)) − sin(2(−k x)) 2ω ) = 1 2 cε0A2 ( 1 + sin(2(ωτ − k x)) − sin(2(−k x)) 2ωτ ) . = 1 2 cε0A2 The second term in the parentheses can be neglected since it is always less than 2 2ωτ = T 2πτ < 10−6 for real detectors of light, so it is much inferior to 1. The light intensity is directly proportional to the squared amplitude. A diffraction is such a deviation from straight-line propagation of light which cannot be explained as the result of a refraction or reflection (or the change of the ray’s direction in a medium with continuously varying refractive index). The diffraction can be observed when a lightbeam propagates through a bounded space. The diffraction phenomena are strongest and easiest to see if the light goes through openings or obstacles whose size is roughly the wavelength of the light. In the case of the Fraunhofer diffraction, with which we will deal in the following example, a monochromatic plane wave goes through a very thin rectangular opening and projects on CHAPTER 8. CALCULUS WITH MORE VARIABLES the measure finitely additive. Hence, a disjoint union of finitely many measurable sets is again a measurable set, and its measure is given by the sum of the measures of the individual sets in the union. Consider the measurability of any given set M ⊂ I ⊂ Rn inside a sufficiently large multidimensional interval I. Consider the boundary ∂M, i.e. the set of all boundary points of M. For any partition Ξ of I from the definition of the Riemann integral of χM , each of the intervals Ii1...in with non-trivial intersection with ∂M contributes to the upper integral but might not contribute to the lower integral. On the contrary, for every point in the interior Mo ⊂ M its interval Ii1...in contributes to both the same way as soon as the norm of the partition is small enough. This observation leads to the first part of the following claim: Proposition. A bounded set M ⊂ Rn is Riemannmeasurable if and only if its boundary is of Riemann-measure zero. If M is a Riemann-measurable set and G : M ⊂ Rn → Rn is a continuously differentiable and invertible mapping, then G(M) is again Riemann-measurable. Proof. The first claim is already verified. Since both G and G−1 are continuous, G maps internal points of M to internal points of G(M). To finish the proof, it must be verified that G maps the boundary ∂M, which is a set of measure zero, again to a set of measure zero. Since every Riemann integrable set M is bounded, its closure ¯M must be compact. It follows that G and all partial derivatives of its components are uniformly continuous on ¯M, and in particular on the boundary ∂M. Next, consider a partition Ξ of an interval I containing ∂M and a fixed tiny interval J in the partition including a point t ∈ ∂M. Write R = G(t) + D1 G(t)(J − t), which is to be understood as follows: J is first shifted to the origin by translation, then the derivative of G is applied obtaining a parallelepiped. This is shifted back to be around G(t). By the uniform continuity of G and D1 G, for each ε > 0 there is a bound δ for the norm of a partition for which G(J) ⊂ G(t) + (1 + ε)D1 G(t)(J − t) can be guaranteed. The entire image of J lies inside a slightly enlarged linear image of J by the derivative. Now, the outer measure α of the image G(J) satisfies: α ≤ (1 + ε)n voln R = (1 + ε)n | det G(t)| voln J. If µ is the upper Riemann sum for the measure of ∂M corresponding to the chosen partition, the outer measure of G(∂M) must be bounded by (1+ε)n maxt∈∂M | det G(t)|µ. Finally, we exploit the assumption that ∂M has got measure zero and thus, for the same ε we may further decrease the bound on the norm of the partition so that µ ≤ ε, too. But then the outer measure is bounded by a constant multiple of (1 + ε)n ε, with the universal constant maxt∈∂M | det G(t)|. So the outer measure is zero, as required. □ 758 a distant surface. For instance, we can highlight a spot on the wall with a laser pointer. The image we get is the Fourier transform of the function describing the permeability of the shade - opening. Let us choose the plane of the diffraction shade as the coordinate plane z = 0. Let a plane wave A exp(ikz) (independent of the point (x, y) of landing on the shade) hit this plane perpendicularly. Let s(x, y) denote the function of the permeability of the shade, then the resulting waves falling onto the projection surface at a point (ξ, η) can be described as the integral sum of the waves (Huygens-Fresnel principle) which have gone through the shade and propagate through the medium from all points (x, y, 0) (as a spherical wave) into the point (ξ, η, z): ψ(ξ, η) = A ∫∫ R2 s(x, y)e−ik(ξx+ηy) dx dy ψ(ξ, η) = A ∫ p/2 −p/2 ∫ q/2 −q/2 e−ik(ξx+ηy) dy dx ψ(ξ, η) = A ∫ p/2 −p/2 e−ikξx dx ∫ q/2 −q/2 e−ikηy dy = A [ e−ikξx −ikξ ]p/2 −p/2 [ e−ikηy −ikη ]q/2 −q/2 = A 2 sin(k ξp/2) kξ 2 sin(k ηq/2) kη = Apq sin(k ξp/2) kξp/2 sin(k ηq/2) kηq/2 The graph of the function f(x) = sin x x looks as follows: The graph of the function ψ(ξ, η) = sin ξ ξ sin η η then does: CHAPTER 8. CALCULUS WITH MORE VARIABLES A slightly extended argumentation as in the proof above leads to understanding that the Riemann integrable functions are exactly those bounded functions with compact support whose set of discontinuity points has (Riemann) measure zero. 8.2.11. Proof of Theorem 8.2.8. A continuous function f and a differentiable change of coordinates is under consideration. So the inverse G−1 is continuously differentiable, and the image G−1 (M) = N is Riemann-measurable. Hence the integrals on both sides of the equality exist and it remains to prove that their values are equal. Denote a composite continuous function by g(t1, . . . , tn) = f(G(t1, . . . , tn)), and choose a sufficiently large n-dimensional interval I containing N and its partition Ξ. The entire proof is nothing more than a more exact writing of the discussion presented before the formulation of the theorem. Repeat the estimates on the volumes of images from the previous paragraph on Riemann measurability. It is already known that the images G(Ii1...in ) of the intervals from the partition are again Riemann-measurable sets. For each small part Ii1...in of the partition Ξ, the integral of f over Ji1...in = G(Ii1...in ) certainly exists, too. Further, if the center ti1...in of the interval Ii1...in is fixed, then the linear image of this interval Ri1...in = G(ti1...in ) + D1 G(ti1...in )(Ii1...in − ti1...in ), is obtained. This is an n-dimensional parallelepiped (note that the interval is shifted to the origin, then mapped by the linear mapping given by the Jacobi matrix, and the result is then added to the image of the center). If the partition is very fine, this parallelepiped differs only a little from the image Ji1,...in . By the uniform continuity of the mapping G, there is, for an arbitrarily small ε > 0, a norm of the partition such that for all finer partitions G(ti1...in ) + (1 + ε)D1 G(t1, . . . tn)(Ii1...in ) ⊃ Ji1...ik . However, then the n-dimensional volumes also satisfy voln(Ji1...in ) ≤ (1 + ε)n voln(Ri1...in ) = (1 + ε)n det G(ti1...ik ) voln(Ii1...in ). Now, it is possible to estimate the entire integral: ∫ M f(x1, . . . , xn) dx1 . . . dxn = = ∑ i1...in ∫ Ji1...in f(x1, . . . , xn) dx1 . . . dxn ≤ ∑ i1...in ( sup t∈Ii1...in g(t)) voln(Ji1...in ) ≤ (1 + ε)n ∑ i1...in ( sup t∈Ii1...in g(t)) det G(ti1...ik ) voln(Ii1...in ). If ε approaches zero, then the norms of the partitions approach zero too, the left-hand value of the integral remains the same, while on the right-hand side, the Riemann integral 759 And the diffraction we are describing: Since limx→0 sin x x = 1, the intensity at the middle of the image is directly proportional to I0 = A2 p2 q2 . The Fourier transform can be easily scrutinized if we aim a laser pointer through a subtle opening between the thumb and the index finger; it will be the image of the function of its permeability. The image of the last picture can be seen if we create a good rectangular opening by, for instance, gluing together some stickers with sharp edges. J. First-order differential equations 8.J.1. Find all solutions of the differential equation y′ = √ 1−y2 cos2 x ( 1 + cos2 x ) . Solution. We are given an ordinary first-order differential equation in the form y′ = f(x, y), which is called an explicit form of the equation. Moreover, we can write it as y′ = f1(x)·f2(y) for continuous univariate functions f1 and f2 (on certain open intervals), i. e., it is a differential equation with separated variables. First, we replace y′ with dy/dx and rewrite the differential equation in the form 1√ 1−y2 dy = 1+cos2 x cos2 x dx. Since ∫ 1+cos2 x cos2 x dx = ∫ 1 cos2 x + 1 dx, we can integrate using the basic formulae, thereby obtaining (1) arcsin y = tg x + x + C, C ∈ R. However, we must keep in mind that the division by the ex- pression √ 1 − y2 is valid only if it is non-zero, i. e., only for y ̸= ±1. Substituting the constant functions y ≡ 1, y ≡ −1 into the given differential equation, we can immediately see that they satisfy it. We have thus obtained two more solutions, which are called singular. We do not have to pay attention to the case cos x = 0 since this only loses points of the domains (but not any solutions). Now, we will comment on several parts of the computation. The expression y′ = dy/dx allows us to make many symbolic manipulations. For instance, we have CHAPTER 8. CALCULUS WITH MORE VARIABLES of g(t)| det G(t)| is obtained. Instead of the desired equality, the inequality: ∫ M f(x) dx1. . .dxn ≤ ∫ N f(G(t)) det(D1 G(t)) dt1. . .dtn is obtained. The same reasoning can be repeated after interchanging G and G−1 , the integration domains M and N, and the functions f and g. The reverse inequality is immediately obtained: ∫ N g(t) det(D1 G(t)) dt1 . . . dtn ≤ ∫ M f(x) det(D1 G(G−1 (x))) det(D1 G−1 (x)) dx1 . . . dxn = ∫ M f(x) dx1 . . . dxn. The proof is complete. 8.2.12. An example in two dimensions. The coordinate transformations are quite transparent for the integral of a continuous function f(x, y) of two variables. Consider the differentiable transformation G(s, t) = (x(s, t), y(s, t)). Denoting g(s, t) = f(x(s, t), y(s, t)), ∫ G(N) f(x, y) dxdy = ∫ N g(s, t) ∂x ∂s ∂y ∂t − ∂x ∂t ∂y ∂s dsdt is obtained. As a truly simple example, calculate the integral of the indicator function of a disc with radius R (i.e. its area) and the integral of the function f(t, θ) = cos(t) defined in polar coordinates inside a circle with radius 1 2 π (i.e. the volume hidden under such a “cap placed above the origin”, see the illustration). -1,5 -1 -0,5 0 0,2 0,4 0 0,6 0,8 y 1 -1,50,5 -1 -0,51 0 0,5 11,5 x1,5 First, determine the Jacobi matrix of the transformation x = r cos θ, y = r sin θ D1 G = ( cos θ −r sin θ sin θ r cos θ ) . Hence, the determinant of this matrix is equal to det D1 G(r, θ) = r(sin2 θ + cos2 θ) = r. 760 dz dy · dy dx = dz dx , 1 dy dx = dx dy . The validity of these two formulae is actually guaranteed by the chain rule theorem and the theorem for differentiating an inverse function, respectively. It was just the facility of the manipulations that inspired G. W. Leibniz to introduce this notation, which has been in use up to now. Further, we should realize why we have not written the general solution (1) in the suggesting form (2) y = sin (tg x + x + C) , C ∈ R. As we will not mention the domains of differential equations (i. e., for which values of x the expressions are well-defined), we will not change them by “redundant” simplifications, either. It is apparent that the function y from (2) is defined for all x ∈ (0, π) ∖ {π/2}. However, for the values of x which are close to π/2 (having fixed C), there is no y satisfying (1). In general, the solutions of differential equations are curves which may not be expressible as graphs of elementary functions (on the whole intervals where we consider them). Therefore, we will not even try to do that. □ 8.J.2. Find the general solution of the equation y′ = (2 − y) tg x. Solution. Again, we are given a differential equation with separated variables. We have dy dx = (2 − y) tg x, − dy y − 2 = sin x cos x dx, − ln | y − 2 | = − ln | cos x | − ln | C |, C ̸= 0. Here, the shift obtained from the integration has been expressed by ln | C |, which is very advantageous (bearing in mind what we want to do next) especially in those cases when we obtain a logarithm on both sides of the equation. Further, we have ln | y − 2 | = ln | C cos x |, C ̸= 0, | y − 2 | = | C cos x |, C ̸= 0, y − 2 = C cos x, C ̸= 0, where we should write ±C (after removing the absolute value). However, since we consider all non-zero values of C, it makes no difference whether we write +C or −C. We should pay attention to the fact that we have made a division by the expression y − 2. Therefore, we must examine the case y ≡ 2 separately. The derivative of a constant function is zero, so we have found another solution, y ≡ 2. However, this solution is not singular since it is contained in the general solution as the case C = 0. Thus, the correct result is y = 2 + C cos x, C ∈ R. □ CHAPTER 8. CALCULUS WITH MORE VARIABLES Therefore, the calculation can be done directly for the disc S which is the image of the rectangle (r, θ) ∈ [0, R]×[0, 2π] = T. In this way the area of the disc is obtained: ∫ S dxdy = ∫ 2π 0 ∫ R 0 r drdθ = ∫ R 0 2πr dr = πR2 . The integration of the function f is very similar, using multiple integration and integration by parts: ∫ S f(x, y) dxdy = ∫ 2π 0 ∫ π/2 0 r cos r drdθ = π2 − 2π. In many real life applications, a much more general approach to integration is needed which allows for the dealing with objects over curves, surfaces, and their higher dimensional analogues. For many simple cases, such tools can be built now with the help of parametrization of such k-dimensional surfaces and employ the letter theorem to show the independence of the result on such a parametrization. These topics are postponed to the beginning of the next chapter where a more general and geometric approach is discussed, see e.g. 9.1.9 and 9.1.12. 3. Differential equations In this section, we return to (vector) functions of one variable, defined and examined in terms of their instantaneous changes. 8.3.1. Linear and non-linear difference models. The concept of derivative was introduced in order to work with instantaneous changes of the examined quantities. In the introductory chapter, difference equations based on similar concepts in relation to sequences of scalars were discussed. As a motivating introduction to equations containing derivatives of unknown functions, recall first the difference equations. The simplest difference equations are formulated as yn+1 = F(yn, n), with a function F of two variables. For example, the model describing interests of deposits or loans (this included the Malthusian model of populations) was considered. The increment was proportional to the value, yn+1 = a yn, see 1.2.2. Growths by 5% is represented by a = 1.05. Considering continuous modeling, the same request leads to an equation connecting the derivative y′ (t) of a function with its value (1) y′ (t) = r y(t) with the proportionality constant r. Here, the instantaneous growth by 5% corresponds to r = 0.05. It is easy to guess the solution of the latter equation, i.e. a function y(t) which satisfies the equality identically, y(t) = C ert with an arbitrary constant C. This constant can be determined uniquely by choosing the initial value y0 = y(t0) at some point t0. If a part of the increment in a model should be given as a constant independent of the value y or t (like bank charges 761 8.J.3. Find the solution of the differential equation (1 + ex ) yy′ = ex which satisfies the initial condition y(0) = 1. Solution. If the functions f : (a, b) → R and g : (c, d) → R are continuous and g(y) ̸= 0, y ∈ (c, d), then the initial problem y′ = f(x)g(y), y(x0) = y0 has a unique solution for any x0 ∈ (a, b), y0 ∈ (c, d). This solution is determined implicitly as y(x)∫ y0 dt g(t) = x∫ x0 f(t) dt. In practical problems, we first find all solutions of the equation and then select the one which satisfies the initial condi- tion. Let us compute: (1 + ex ) y dy/dx = ex , y dy = ex 1 + ex dx, y2 2 = ln (1 + ex ) + ln | C |, C ̸= 0, y2 2 = ln (C [1 + ex ]) , C > 0. The substitution y = 1, x = 0 then gives 1 2 = ln (C · 2) , i. e. C = √ e 2 . We have thus found the solution y2 2 = ln (√ e 2 [1 + ex ] ) , i. e., y = √ 2 ln (√ e 2 [1 + ex] ) on a neighborhood of the point [0, 1] where y > 0. □ 8.J.4. Find the solution of the differential equation y′ = y2 +1 x+1 which satisfies y(0) = 1. Solution. Similarly to the previous example, we get dy y2 + 1 = dx x + 1 , arctan y = ln | x + 1 | + C, C ∈ R. The initial condition (i. e., the substitution x = 0 and y = 1) gives arctan 1 = ln | 1 | + C, i. e., C = π 4 . Therefore, the solution of the given initial problem is the func- tion y(x) = tg ( ln | x + 1 | + π 4 ) on a neighborhood of the point [0, 1]. □ CHAPTER 8. CALCULUS WITH MORE VARIABLES or the natural decrease of stock population as a result of sending some part of it to slaughterhouses), an equation can be used with a constant s on the right-hand side. (2) y′ (t) = r · y(t) + s. The solution of this equation is the function y(t) = C ert − s r . It is a straightforward matter to produce this solution when it is realized that the set of all solutions of the equation (1) is a one-dimensional vector space, while the solutions of the equation (2) are obtained by adding any one of its solutions to the solutions of the previous equation. The constant solution y(t) = k for k = −s r is easily found. Similarly, in paragraph 1.4.1, the logistic model of population growth was created. Based on the assumption that the ratio of the change of the population size p(n + 1) − p(n) and its size p(n) is affine with respect to the population size itself. The model behaves similar as the Malthusian one for small values of the population size and to cease growing when reaching a limit value K. Now, the same relation for the continuous model can be formulated for a population p(t) dependent on time t by the equality (3) p′ (t) = p(t) ( − r K p(t) + r ) . At the value p(t) = K for a (large) constant K, the instantaneous increment of the function p is zero, while for p(t) > 0 near zero, the ratio of the rate of increment of the population and its size is close to r, which is the (small) number expressing the rate of increment of the population in good conditions (e.g. 0.05 would again mean immediate growth by 5%). It is not easy to solve such an equation without knowing any theory (although this type of equations will be dealt with in a moment). However, as an exercise on differentiation, it is easily verified that the following function is a solution for every constant C: p(t) = K 1 + CK e−rt . For the continuous and discrete versions of the logistic models, the values K = 100, r = 0.05, and C = 1 in the left hand illustration are chosen. The same 1.4.1 result occurs in the right hand illustration (i.e. with a = 1.05 and p1 = 1, as expected). The choice C = 1 yields p(0) = K/(1 + K) which is very close to 1 if K is large enough. 762 8.J.5. Solve (1) y′ = x + y + 1 2x + 2y − 1 . Solution. Let a function f : (a, b) × (c, d) → R have continuous second-order partial derivatives and f(x, y) ̸= 0, x ∈ (a, b), y ∈ (c, d). Then, the differential equation y′ = f(x, y) can be transformed to an equation with separated variables if and only if f(x, y) f′ y(x, y) f′ x(x, y) f′′ xy(x, y) = 0, x ∈ (a, b), y ∈ (c, d). With a bit of effort, it can be shown that a differential equation of the form y′ = f(ax + by + c) can be transformed to an equation with separated variables, and this can be done by the substitution z = ax + by + c. Let us emphasize that the variable z replaces y. We thus set z = x + y, which gives z′ = 1 + y′ . Substitution into (1) yields z′ − 1 = z + 1 2z − 1 , dz dx = z + 1 2z − 1 + 1, dz dx = 3z 2z − 1 , ( 2 3 − 1 3z ) dz = 1 dx, 2 3 z − 1 3 ln | z | = x + C, C ∈ R, or 2 3 z − 1 3 ln | Cz | = x, C ̸= 0. Now, we must get back to the original variable y in one of these forms. The general solution can be written as 2 3 x + 2 3 y − 1 3 ln | x + y | = x + C, C ∈ R, i. e., x − 2y + ln | x + y | = C, C ∈ R. At the same time, we have the singular solution y = −x, which follows from the constraint z ̸= 0 of the operations we have made (we have divided by the value 3z). □ 8.J.6. Solve the differential equation xy′ + y ln x = y ln y. Solution. Using the substitution u = y/x, every homogeneous differential equation y′ = f (y/x) can be transformed to an equation (with separated variables) u′ = 1 x (f(u) − u) , i. e. u′ x + u = f(u). The name of this differential equation is comes from the following definition. A function f of two variables is called homogeneous of degree k iff f(tx, ty) = tk f(x, y). Then, a differential equation of the form P(x, y) dx + Q(x, y) dy = 0 CHAPTER 8. CALCULUS WITH MORE VARIABLES In particular, both versions of this logistic model yield quite similar results. For example, the left hand illustration also contains the dashed line of the graph of the solution of the equation (1) with the same constant r and initial condition (i.e. the Mathusian model of growth). 8.3.2. First-order differential equations. By an (ordinary) first-order differential equation, is usually meant the relation between the derivative y′ (t) of a function with respect to the variable t, its value y(t), and the variable itself, which can be written in terms of some real-valued function F : R3 → R as the equality F(y′ (t), y(t), t) = 0. This equation resembles the implicitly defined functions y(t); however, this time there is a dependency on the derivative of the function y(t). We also often suppress the dependence of y = y(t) on the other variable t and write F(y′ , y, t) = 0 instead. If the implicit equation is solved at least explicitly with regard to the derivative, i.e. y′ = f(t, y) for some function f : R2 → R, it is clear graphically what this equation defines. For every value (t, y) in the plane, the arrow corresponding to the vector (1, f(t, y)), can be considered. That is the velocity with which the point of the graph of the solution moves through the plane, depending on the free parameter t. For instance, the equation (3) in the previous subsection determines the following: (illustrating the solution for the initial condition as above). Such illustrations should invoke the idea that differential equations define a “flow” in the plane, and each choice of the initial value (t0, y(t0)) should correspond to a unique flowline expressing the movement of the initial point in the time t. It can be anticipated intuitively that for reasonably behaved functions f(t, y) in the equations y′ = f(t, y), there is a unique solution for all initial conditions. 763 is a homogeneous differential equation iff the functions P and Q are homogeneous of the same degree k. For instance, we can discover that the given equation x dy + (y ln x − y ln y) dx = 0 is homogeneous. Of course, it is not difficult to write it explicitly in the form y′ = y x ln y x . The substitution u = y/x then leads to u′ x + u = u ln u, du dx x = u (ln u − 1) , du u (ln u − 1) = dx x , where u (ln u − 1) ̸= 0. Using another substitution, namely t = ln u − 1, we can integrate ∫ du u (ln u − 1) = ∫ dx x , ∫ dt t = ∫ dx x , ln | t | = ln | x | + ln | C |, C ̸= 0, ln | ln u − 1 | = ln | Cx |, C ̸= 0, ln u − 1 = Cx, C ̸= 0, ln y x = Cx + 1, C ̸= 0, y = xeCx+1 , C ̸= 0. The excluded cases u = 0 and ln u = 1 do not lead to two more solutions sinceu = 0 implies y = 0, which cannot be put into the original equation. On the other hand, ln u = 1 gives y/x = e, and the function y = ex is clearly a solution. Therefore, the general solution is y = xeCx+1 , C ∈ R. □ 8.J.7. Compute y′ = −4x+3y+1 3x+2y+1 . Solution. In general, we are able to solve every equation of the form (1) y′ = f ( ax + by + c Ax + By + C ) . If the system of linear equations (2) ax + by + c = 0, Ax + By + C = 0 has a unique solution x0, y0, then the substitution u = x−x0, v = y − y0 transforms the equation (1) to a homogeneous equation dv du = f ( au+bv Au+Bv ) . CHAPTER 8. CALCULUS WITH MORE VARIABLES 8.3.3. Integration of differential equations. Before examining the conditions for existence and uniqueness of the solutions, we present a truly elementary method for finding the solutions. The idea, mentioned briefly already in 6.2.21 on page 552, is to transform the problem to ordinary integration, which usually leads to an implicit description of the solution. Equations with separated variables Consider a differential equation in the form (1) y′ = f(t) · g(y) for two continuous functions of a real variable, f and g. The solution of this equation can be obtained by integration, finding the antiderivatives G(y) = ∫ dy g(y) , F(t) = ∫ f(t)dt. This procedure reliably finds solutions y(t) which satisfy g(y(t)) ̸= 0, given implicitly by the formula (2) F(t) + C = G(y) with an arbitrary constant C. Differentiating the latter equation (2) using the chain rule for the composite function G(y(t)) leads to 1 g(y) y′ (t) = f(t), as required. As an example, find the solution of the equation y′ = t y. Direct calculation gives ln |y(t)| = 1 2 t2 + C with arbitrary constant C. Hence it looks (at least for positive values of y) as y(t) = e 1 2 t2 +C = D e 1 2 t2 , where D is an arbitrary positive constant. It is helpful to examine the resulting formula and signs thoroughly. The constant solution y(t) = 0 also satisfies the equation. For negative values of y, the same solution can be used with negative constants D. In fact, the constant D can be arbitrary, and a solution is found satisfying any initial value. The illustration shows two solutions which demonstrate the instability of the equation with regard to the initial values: For every t0, if we change a small y0 from a negative value to a positive one, then the behaviour of the resulting solution changes dramatically. Notice the constant solution y(t) = 0, which satisfies the initial condition y(t0) = 0. 764 If the system (2) has no solution or has infinitely many solutions, the substitution z = ax + by transforms the equation (1) to an equation with separated variables (often, the original equation is already such). In this problem, the corresponding system of equations 4x + 3y + 1 = 0, 3x + 2y + 1 = 0 has a unique solution x0 = −1, y0 = 1. The substitution u = x+1, v = y−1 then leads to the homogeneous equation dv du = −4u+3v 3u+2v , which can be solved by further substitution z = v/u. We thus obtain z′ u + z = − 4 + 3z 3 + 2z , dz du u = − 2z2 + 6z + 4 3 + 2z , 2z + 3 2z2 + 6z + 4 dz = − du u provided z2 + 3z + 2 ̸= 0. Integrating, we get 1 2 ln z2 + 3z + 2 = − ln | u | + ln | C |, C ̸= 0, 1 2 ln ( z2 + 3z + 2 ) u2 = ln | C |, C ̸= 0, ln ( z2 + 3z + 2 ) u2 = ln C2 , C ̸= 0, ( z2 + 3z + 2 ) u2 = ±C2 , C ̸= 0. We thus have ( z2 + 3z + 2 ) u2 = D, D ̸= 0 and returning to the original variables, ( v2 u2 + 3 v u + 2 ) u2 = D, D ̸= 0, v2 + 3vu + 2u2 = D, D ̸= 0, (y − 1)2 + 3(y − 1)(x + 1) + 2(x + 1)2 = D, D ̸= 0. Making simple rearrangements, the general solution can be expressed as (x + y) (2x + y + 1) = D, D ̸= 0. Now, let us return to the condition z2 + 3z + 2 ̸= 0. It follows from z2 + 3z + 2 = 0 that z = −1 or z = −2, i. e., v = −u or v = −2u. For v = −u, we have x = u − 1 and y = v + 1 = −u + 1, which means that y = −x. Similarly, for v = −2u, we have y = −2u + 1, hence y = −2x − 1. However, both functions y = −x, y = −2x − 1 satisfy the original differential equations and are included in the general solution for the choice D = 0. Therefore, every solution is known from the implicit form (x + y) (2x + y + 1) = D, D ∈ R. □ CHAPTER 8. CALCULUS WITH MORE VARIABLES Using separation of variables, the non-linear equation is easily solved from the previous paragraph which describes the logistic population model. Try this as an exercise. 8.3.4. First order linear equations. In the first chapter, we paid much attention to linear difference equations. Their general solution was determined in paragraph 1.2.2 on page 11. Although it is clear beforehand that it is a one-dimensional affine space of sequences, it is a hardly transparent sum, because all the changing coefficients need to be taken into account. Consequently this can be used as a source of inspiration for the following construction of the solution of a general firstorder linear equation (1) y′ = a(t)y + b(t) with continuous coefficients a(t) and b(t). First, find the solution of the homogeneous equation y′ = a(t)y. This can be computed easily by separation of variables, obtaining the solution with y(s) = y0, y(t) = y0F(t, t0), F(t, s) = e ∫ t s a(x) dx . In the case of difference equations, the solution of the general non-homogeneous equation was “guessed”. Then it was proved by induction that it was correct. It is even simpler now, as it suffices to differentiate the correct solution to verify the statement, once we are told what the right result is: The solution of first-order linear equations The solution of the equation (1) with initial values y(t0) = y0 is (locally in a neighbourhood of t0) given by the formula y(t) = y0F(t, t0) + ∫ t t0 F(t, s)b(s) ds, where F(t, s) = e ∫ t s a(x) dx . Verify the correctness of the solution by yourselves (pay proper attention to the differentiation of the integral where t is both in the upper bound and a free parameter in the integrand, cf. 6.3.13). In fact, there is the general method called variation of constants which directly yields this solution, see e.g. the problem 8.J.9. It consists in taking the solution for the homogenous equation in the form y(t) = c F(t, t0) and consider instead an ansatz for a solution to the nonhomogeneous equation in the form y(t) = c(t)F(t, t0) with an unknown function c(t). Differentiating yields the equation c′ = e − ∫ t t0 a(x)dx b(t) and integrating this leads to c(t) = ∫ t t0 e ∫ t0 s a(x)dx b(s)ds, i.e. y(t) = c(t) e ∫ t t0 a(x)dx as in the above formula. Check the details! Notice also the similarity to the solution for the equations with constant coefficients explicitly computed in the form of convolution in 7.B.13 on the page 663, which could serve as inspiration, too. As an example, the equation y′ = 1 − x y, 765 8.J.8. Find the general solution of the differential equation ( x2 + y2 ) dx − 2xy dy = 0. Solution. For y ̸= 0, simple rearrangements lead to y′ = x2 +y2 2xy = 1+ ( y x )2 2 y x . Using the substitution u = y/x, we get to the equation u′ x + u = 1+u2 2u . For u ̸= ±1 and D = −1/C, we have du dx x = 1 + u2 − 2u2 2u , 2u 1 − u2 du = dx x , − ln 1 − u2 = ln | x | + ln | C |, C ̸= 0, ln 1 | 1 − u2 | = ln | Cx |, C ̸= 0, 1 1 − u2 = Cx, C ̸= 0, 1 = Cx ( 1 − y2 x2 ) , C ̸= 0, − D x = 1 − y2 x2 , D ̸= 0, −Dx = x2 − y2 , D ̸= 0. The condition u = ±1 corresponds to y = ±x. While y ≡ 0 is not a solution, both the functions y = x and y = −x are solutions and can be obtained by the choice D = 0. The general solution is thus y2 = x2 + Dx, D ∈ R. □ 8.J.9. Solve y′ = x − 2y x2−1 . Solution. The given equation is of the form y′ = a(x)y + b(x), i. e., a non-homogeneous linear differential equation (the function b is not identically equal to zero). The general solution of such an equation can be obtained using the method of integration factor (the non-homogeneous equation is multiplied by the expression e− ∫ a(x) dx ) or the method of variable separation (the integration constant that arises in the solution of the corresponding homogeneous equations is considered to be a function in the variable x). We will illustrate both of these methods on this problem. As for the former method, we multiply the original equation by the expression e ∫ 2 x2−1 dx = e ln x−1 x+1 = x−1 x+1 , where the corresponding integral is understood to stand for any antiderivative and where any non-zero multiple of the obtained function can be considered (that is why we could remove the absolute value). Thus, consider the equation y′ x−1 x+1 + 2y (x+1)2 = x(x−1) x+1 . The core of the method of integration factor is that fact that the expression on the left-hand side is the derivative of y x−1 x+1 . Integrating this leads to CHAPTER 8. CALCULUS WITH MORE VARIABLES can be solved directly (although the so-called error function appears when integrating the particular solution and this cannot be expressed via elementary functions), this time encountering stable behaviour, visible in the following illustration. 8.3.5. Transformation of coordinates. The illustrations suggest that differential equations can be perceived as geometric objects (the “directional field of the arrows”), so the solution can be found by conveniently chosen coordinates. We return to this point of view later. Here are three simple examples of typical tricks as seen from the explicit form of the equations in coordinates. We begin with homogeneous equations of the form y′ = f (y t ) . Considering the transformation z = z(t) = y t and assuming that t ̸= 0, then by the chain rule, z′ = 1 t2 ( t y′ − y ) = 1 t (f(z) − z), which is an equation with separated variables. Other examples are the Bernoulli differential equations, which are of the form y′ = f(t)y + g(t)yn , where n ̸= 0, 1. The choice of the transformation z = y1−n leads to the equation z′ = (1 − n)y−n (f(t)y + g(t)yn ) = (1 − n)f(t)z + (1 − n)g(t), which is a linear equation, easily integrated. We conclude with the extraordinarily important Riccati equation. It is a form of the Bernoulli equation with n = 2, extended by an absolute term y′ = f(t)y + g(t)y2 + h(t). This equation can also be transformed to a linear equation provided that a particular solution x = x(t) can be guessed. Then, use the transformation z = 1 y − x . Verify by yourselves that this transformation leads to the equa- tion z′ = − ( f(t) + 2x(t)g(t) ) z − g(t). 766 y x−1 x+1 = ∫ x(x−1) x+1 dx = x2 2 − 2x + 2 ln | x + 1 | + C, C ∈ R. Therefore, the solutions are the functions y = x+1 x−1 ( x2 2 − 2x + 2 ln | x + 1 | + C ) , C ∈ R. As for the latter method, we first solve the corresponding homogeneous equation y′ = − 2y x2−1 , which is an equation with separated variables. We have dy dx = − 2y x2 − 1 , dy y = − 2 x2 − 1 dx, ln | y | = − ln | x − 1 | + ln | x + 1 | + ln | C |, C ̸= 0, ln | y | = ln C x + 1 x − 1 , C ̸= 0, y = C x + 1 x − 1 , C ̸= 0, where we had to exclude the case y = 0. However, the function y ≡ 0 is always a solution of a homogeneous linear differential equation, and it can be included in the general solution. Therefore, the general solution of the corresponding homogeneous equation is y = C(x+1) x−1 , C ∈ R. Now, we will consider the constant C to be a function C(x). Differentiating leads to y′ = C′ (x) (x+1)(x−1)+C(x) (x−1)−C(x) (x+1) (x−1)2 . Substituting this into the original equation, we get C′ (x) (x+1)(x−1)+C(x) (x−1)−C(x) (x+1) (x−1)2 = x − 2 C(x) (x+1) (x−1)(x2−1) . It follows that C′ (x) = x(x−1) x+1 , i. e., C(x) = ∫ x(x − 1) x + 1 dx, C(x) = x2 2 − 2x + 2 ln | x + 1 | + C, C ∈ R. Now, it suffices to substitute: y = C(x) x+1 x−1 = x+1 x−1 ( x2 2 − 2x + 2 ln | x + 1 | + C ) , C ∈ R. We can see that the result we have obtained here is of the same form as in the former case. This should not be surprising as the differences between the two methods are insignificant and the computed integrals are the same. Finally, we can notice that the solution y of an equation y′ = a(x)y can be found in the same way for any continuous function a. We thus always have y = Ce ∫ a(x) dx , C ∈ R. CHAPTER 8. CALCULUS WITH MORE VARIABLES As seen in the case of integration of functions (the simplest type of equations with separated variables), the equations usually do not have a solution expressible explicitly in terms of elementary functions. As with standard engineering tables of values of special functions, books listing the solutions of basic equations are compiled as well.7 Today, the wisdom concealed in them is essentially transferred to software systems like Maple or Mathematica. Here, any task about ordinary differential equations can be assigned, with results obtained in surprisingly many cases. Yet, explicit solutions are not possible for most prob- lems. 8.3.6. Existence and uniqueness. The way out of this is numerical methods, which try only to approximate the solutions. However, to be able to use them, good theoretical starting points are still needed regarding existence, uniqueness, and stability of the solutions. We begin with the Picard–Lindelöf theorem: Existence and uniqueness of the solutions of ODEs Theorem. Consider a function f(t, y) : R2 → R with continuous partial derivatives on an open set U. Then for every point (t0, y0) ∈ U ⊂ R2 , there exists the maximal interval I = (t0 − a, t0 + b), with positive a, b ∈ R, and the unique function y(t) : I → R which is the solution of the equation y′ = f(t, y) on the interval I. Proof. If a differentiable function y(t) is a solution of an equation satisfying the initial condition y(t0) = y0, then it also satisfies the equation y(t) = y0 + ∫ t t0 y′ (s) ds = y0 + ∫ t t0 f(s, y(s)) ds, where the Riemann integrals exist due to the continuity of f and hence also y′ . However, the right-hand side of this expression is the integral operator L(y)(t) = y0 + ∫ t t0 f(s, y(s)) ds acting on functions y. Solving first-order differential equations, is equivalent to finding fixed points for this operator L, that is, to find a function y = y(t) satisfying L(y) = y. On the other hand, if a Riemann-integrable function y(t) is a fixed point of the operator L, then it immediately follows from the fundamental theorem of calculus that y(t) satisfies the given differential equation, including the initial con- ditions. 7For example, the famous book Differentialgleichungen reeller Funktionen, Akademische Verlagsgesellschaft, Leipzig 1930, by E. Kamke, a German mathematician, contains many hundreds of solved equations. They appeared in many editions in the last century. 767 Similarly, the solution of an equation y′ = a(x)y +b(x) with an initial condition y(x0) = y0 can be determined explicitly as (provided the coefficients, i. e. the functions a and b, are continuous) y = e ∫ x x0 a(t) dt ( y0 + ∫ x x0 b(t) e − ∫ t x0 a(s) ds dt ) . Let us remark that the linear equation has no singular solution, and the general solution contains a C ∈ R. □ 8.J.10. Solve the linear equation (y′ + 2xy) ex2 = cos x. Solution. If we used the method of integration factor, we would only rewrite the equation trivially since it is already of the desired form – the expression on the left-hand side is the derivative of y ex2 . Thus, we can immediately calculate ( y ex2 )′ = cos x, y ex2 = ∫ cos x dx, y ex2 = sin x + C, C ∈ R, y = e−x2 (sin x + C) , C ∈ R. □ 8.J.11. Find all non-zero solutions of the Bernoulli equa- tion y′ − y x = 3xy2 . Solution. The Bernoulli equation y′ = a(x)y + b(x)yr , r ̸= 0, r ̸= 1, r ∈ R can be solved by first dividing by the term yr and then using the substitution u = y1−r , which leads to the linear differential equation u′ = (1 − r) [a(x)u + b(x)] . In this very problem, the substitution u = y1−2 = 1/y gives u′ + u x = −3x. Similarly to the previous exercise, we have u = e− ln | x | [∫ −3x eln | x | dx ] , where ln | x | was obtained as an (arbitrary) antiderivative to 1/x. Furhter, u = eln 1 | x | [∫ −3x eln | x | dx ] , u = 1 | x | [∫ −3x | x | dx ] . The absolute value can be replaced with a sign that can be canceled, i. e., it suffices to consider u = 1 x [∫ −3x2 dx ] = 1 x [ −x3 + C ] , C ∈ R. Returning to the original variable, we get y = 1 u = x C−x3 , C ∈ R. The excluded case y ≡ 0 is a singular solution (which, of course, is true for every Bernoulli equation with r positive). □ CHAPTER 8. CALCULUS WITH MORE VARIABLES It is easy to estimate how much the values L(y) and L(z) differ for various functions y(t) and z(t). Since both partial derivatives of f are continuous, f is itself locally Lipschitz. This means that restricting the values (t, y) to a neighbourhood U of the point (t0, y0) with compact closure, there is the estimate |f(t, y) − f(t, z)| ≤ C|y − z|, with some constant C depending only on U. This immediately leads to the following bound (for the sake of simplicity, t ≥ t0, but the final conclusion works for t < t0 the same way) | ( L(y) − L(z) ) (t)| = ∫ t t0 f(s, y(s)) − f(s, z(s)) ds ≤ ∫ t t0 |f(s, y(s)) − f(s, z(s))| ds ≤ C ∫ t t0 |y(s) − z(s)| ds ≤ C ( max t0≤s≤t |y(s) − z(s)| ) |t − t0| = D( max t0≤s≤t |y(s) − z(s)| ) , where the constant D comes from substituting the maximum of |t − t0| on U. If the operator L is viewed as an operator on a metric space of continuous functions on a compact interval with the max norm, this yields ∥L(y) − L(z)∥ ≤ D ∥y − z∥. Some further restrictions on the choice of U and the considered functions y and z are required, in order to make the constant D smaller than one. Then the Banach fixed point theorem, based on the notion of a contraction, can be applied. See 7.3.9 on the page 676. At the same time, the operator must leave the chosen space of functions y invariant, i.e. the images L(y) are also there. To begin, choose ε > 0 and δ > 0, both small enough so that [t0 − δ, t0 + δ] × [y0 − ε, y0 + ε] = V ⊂ U, and consider only those functions y(t) which satisfy for J = [t0 −δ, t0 +δ] the estimate maxt∈J |y(t)−y0| < ε. The uniform continuity of f(t, y) on V ensures that fixing ε and further shrinking δ, implies max t∈J |L(y)(t) − y0| < ε. Finally, the above estimate for ∥L(y) − L(z)∥ shows that if δ is decreased sufficiently further, then the latter constant D becomes smaller than one, as required for a contraction. At the same time, L maps the above space of functions into itself. However, for the assumptions of the Banach contraction theorem, which guarantees the uniquely determined fixed point, completeness of the space X of functions on which the operator L works is needed. Since the mapping f(t, y) is continuous, there follows a uniform bound for all of the functions y(t) considered above 768 8.J.12. Interchanging the variables, solve the equation y dx − ( x + y2 sin y ) dy = 0. Solution. When the variable x occurs only in the first power in the differential equation and y occurs in the arguments of elementary functions, we can apply the so-called method of variable interchange, when we look for the solution as for a function x of the independent variable y. First, we write the equation explicitly: y′ = y x+y2 sin y . This equation is not of any of the previous types, so we rewrite it as follows: dy dx = y x + y2 sin y , dx dy = ( y x + y2 sin y )−1 = x y + y sin y, x′ = 1 y x + y sin y. We have thus obtained a linear differential equation. Now, we can easily compute its general solution x = −y cos y + Cy, C ∈ R. □ Further problems concerning first-order differential equations can be found on page 795. K. Practical problems leading to differential equations 8.K.1. A water purification plant with volume 2000 m3 was contaminated with lead which is spread in the water with density 10 g/m3 . Water is flowing in and out of the basin at 2 m3 /s. In what time does the amount of lead in the basin decrease below 10 µg/m3 (which is the hygienic norm for the amount of lead in drinkable water by a regulation of the European Community) provided the water keeps being mixed uniformly? Solution. Let us denote the water’s volume in the basin by V (m3 ), the speed of the water’s flow by v (m3 /s). In an infinitesimal (infinitely small) time unit dt, m V · v dt grams of lead runs out of the basin, so we can construct the differential equation dm = − m V · v dt for the change of the lead’s mass in the basin. Separating the variables, we get the equation dm m = − v V dt. Integration both sides of the equation and getting rid of the logarithms, we get the solution in the form m(t) = m0e− v V t , where m0 is the lead’s mass at time t = 0. Substituting the concrete values, we find out that t . = 6 h 35 min. □ CHAPTER 8. CALCULUS WITH MORE VARIABLES and the values t > s in their domain: |L(y)(t) − L(y)(s)| ≤ ∫ t s |f(s, y(s)| ds ≤ A |t − s| with a universal constant A > 0. Besides the conditions mentioned above, there is a restriction to the subset of all equicontinuous functions in the sense of the Definition 7.3.15. According to the Arzelà-Ascoli Theorem proved in the same paragraph at the page 683, this set of continuous functions is already compact, hence it is a complete set of continuous functions on the interval. Therefore, there exists a unique fixed point y(t) of this contraction L by the Theorem 7.3.9. This is the solution of the equation. It remains to show the existence of a maximal interval I = (t0 − a, t0 + b). Suppose that a solution y(t) is found on an interval (t0, t1), and, at the same time, the one-sided limit y1 = limt→t1− y(t) exists and is finite. It follows from the already proven result that there exists a solution with this initial condition (t1, y1), in some neighbourhood of the point t1. Clearly, it must coincide with the discussed solution y(t) on the left-hand side of t1. Therefore, the solution y(t) can be extended on the right-hand side of t1. There are only two possibilities when the extension of the solution behind t1 does not exist: either there is no finite left limit y(t) at t1, or the limit y1 exists, yet the point (t1, y1) is on the boundary of the domain of the function f. In both cases, the maximal extension of the solution to the right of t0 is found. The argumentation for the maximal solution left of t0 is analogous. □ 8.3.7. Iterative approximations of solutions. The proof of the previous theorem can be reformulated as an iterative procedure which provides approximate solutions using step-by-step integration. Moreover, an explicit estimate for the constant C from the proof yields bounds for the errors. Think this out as an exercise (see the proof of Banach fixed-point theorem in paragraph 7.3.9). It can then be shown easily and directly that it is a uniformly convergent sequence of continuous functions, so the limit is again a continuous function (without invoking the complicated theorems from the seventh chapter). 769 8.K.2. The speed of transmission of a message in a population consisting of P people is directly proportional to the number of people who have not heard the message yet. Determine the function f which describes the dependency of the number of people who have heard the message on time. Is it appropriate to use this model of message transmission for small or large values of P? Solution. We construct a differential equation for f. The speed of the transmission df dt = f′ (t) should be directly proportional to the number of people who have not heard of it, i. e. the value P − f(t). Altogether, df dt = k(P − f(t)). Separating the variables and introducing a constant K (the number of people who know the message at time t = 0 must be P − K), we get the solution f(t) = P − Ke−kt , where k is a positive real constant. Apparently, this model makes sense for large values of P only. □ 8.K.3. The speed at which an epidemic spreads in a given closed population consisting of P people is directly proportional to the product of the number of people who have been infected and the number of people who have not. Determine the function f(t) describing the number of infected people in time. Solution. Just like in the previous problem, we construct a differential equation: df dt = k · f(t) (P − f(t)) . Again, separating the variables and introducing suitable constants K and L, we obtain f(t) = K 1 + Le−Kkt . □ 8.K.4. The speed at which a given isotope of a given chemical element decays is directly proportional to the amount of the given isotope. The half-life of the isotope of plutonium 239 94 Pu is 24,100 years. In what time does a hundredth of a nuclear bomb whose active component is the mentioned isotope disappear? Solution. Denoting the amount of plutonium by m, we can build a differential equation for the rate of the decay: dm dt = k · m, where k is an unknown constant. The solution is thus the function m(t) = m0e−kt . Substituting into the equation for half-life(e−kt = 1 2 ), we get the constant k . = 2.88 · 105 . The wanted time is then approximately 349 years. □ CHAPTER 8. CALCULUS WITH MORE VARIABLES Picard’s approximations Theorem. The unique solution of the equation y′ = f(t, y) whose right-hand side f has continuous partial derivatives can be expressed, on a sufficiently small interval, as the limit of step-by-step iterations beginning with the constant function (Picard’s approximation): y0(t) = y0, yn+1(t) = L(yn), n = 1, . . . . It is a uniformly converging sequence of differentiable functions with differentiable limit y(t). Only the Lipschitz condition is needed for the function f, so the latter two theorems are true with this weaker assumption as well. It is seen in the next paragraph that continuity of the function f guarantees the existence of the solution. Yet it is insufficient for the uniqueness. 8.3.8. Ambiguity of solutions. We begin with a simple example. Consider the equation y′ = √ |y|. Separating the variables, the solution is y(t) = 1 4 (t + C)2 , for positive values y, with an arbitrary constant C and t + C > 0. For the initial values (t0, y0) with y0 ̸= 0, this is an assignment matching the previous theorem, so there is locally exactly one solution. The solution must apparently remain non-decreasing, hence for negative values y0, the solution is the same, only with the opposite sign and t + C < 0. However, for the initial condition (t0, y0) = (t0, 0), there is not only the already discussed solution continuing to the left of t0 and to the right, but also the identically zero solution y(t) = 0. Therefore, these two branches can be glued arbitrarily (see the diagram, where the thick solution can be continued along the t axis and branch along the parabola at any value t.) Nevertheless, the existence of a solution is guaranteed by the following theorem, known as Peano existence theorem: vylepsit obrazek, aby ukazoval vic reseni ... 770 8.K.5. The acceleration of an object falling in a constant gravitational field with a certain resistance of the environment is given by the formula dv dt = g − kv, where k is a constant which expresses the resistance of the environment. An object was dropped in a gravitational field with g = 10 ms−2 at the initial speed of 5 ms−1 , the resistance constant is k = 0.5 s−1 . What will the speed of the object be in three seconds? Solution. v = g k − (g k − v0 ) e−kt , v(3) = 20 − 15e− 3 2 ms−1 after substitution. □ 8.K.6. The rate of increase of a population of a certain type of bug is indirectly proportional to its size. At time t = 0, the population had 100 bugs. In a month, the population doubled. What will the size of the population be in two months? Solution. Let us consider a continuous approximation of the number of bugs, and let their amount be denoted by P. Then, we can build the following equation: dP dt = k P , P = √ Kt + c. Substituting the given values, we get P(2) =√ 7 · 100, which is an estimate of the actual number of bugs. □ 8.K.7. Find the equation of the curve with the following properties: It lies in the first quadrant, goes through the point [1, 3/4], and its tangent at any point marks on the positive half-axis y a segment whose length is the same as the distance of that point from the origin. ⃝ 8.K.8. Consider a chemical compound C isolated in a container. C is unstable, with half-time of a molecule equal to q time units. If there were M moles of the compound C in the container at the beginning (i. e., at time t = 0), how many moles of it will be there at time t ≥ 0? ⃝ 8.K.9. A 100-gram body lengthens a spring of 5 cm if hung on it. Express the dependency of its position on time t provided the speed of the body is 10 cm/s when going through the equilibrium point. ⃝ Further practical problems that lead to differential equations can be found on page 795. L. Higher-order differential equations 8.L.1. Underdamped oscillation. Now, we will describe a simple model for the movement of a solid object attached to a point with a strong spring. If y(t) is the deviation of our object from the point y0 = y(0) = 0, then we can assume that the acceleration y′′ (t) in time t is proportional to the magnitude of the deviation, yet with the other sign. The proportionality constant k is called the spring constant. Considering the CHAPTER 8. CALCULUS WITH MORE VARIABLES Theorem. Consider a function f(t, y) : R2 → R which is continuous on an open set U. Then for every point (t0, y0) ∈ U ⊃ R2 , there exists a solution of the equation y′ = f(t, y) locally in some neighbourhood of t0. Proof. The proof is presented only roughly, with the details left to the reader. We construct a solution to the right of the initial point t0. For this purpose, select a small step h > 0 and label the points tk = t0 + kh, k = 1, 2, . . . . The value of the derivative f(t0, y0) of the corresponding curve of the solution (t, y(t)) is defined at the initial point (t0, y0), so a parametrized line with the same derivative can be substituted: y(0)(t) = y0 + f(t0, y0)(t − t0). Label y1 = y(0)(t1). Construct inductively the functions and points y(k)(t) = yk + f(xk, yk)(t − tk), yk+1 = y(k)(tk+1). Now, define ˜yh(t) by gluing the particular linear parts, i.e., ad picture!!! ˜yh(t) = y(k)(t) if t ∈ [kh, (k + 1)h]. This is a continuous function, called the Euler’s approximation of the solution. It “only” remains to prove that the limit of the functions ˜yh for h approaching zero exists and is a solution. For this, one must observe (as done already in the proof of the theorem on uniqueness and existence of the solution) that f(t, y) is uniformly continuous on a sufficiently small neighbourhood U where the solution is sought. For any selected ε > 0, a sufficiently small δ such that |f(t, y) − f(s, z)| < ε, exists whenever ∥(t − s, y − z)∥ < δ. Especially, all functions ˜yh are in the set of uniformly continuous and equicontinous functions on a sufficiently small interval. By the Arzelà-Ascoli theorem (see paragraph 7.3.15 on page 683), the constructed continuous functions ˜yh are all in a compact set of functions. So there exists a sequence of values hn → 0 such that the corresponding sequence of functions ˜yhn converges uniformly to a continuous function y(n). Write ˆyn(t) = ˜yhn (t), i.e. ˆyn → y uniformly. For each of the continuous functions ˜yh, there are only finitely many points in the interval [t0, t] where it is not differentiable, so ˆyn(t) = y0 + ∫ t t0 ˆy′ n(s) ds. On the other hand, the derivatives on the particular intervals are constant, so (here, k is the largest such that t0 + khn ≤ 771 case k = 1, we get the so-called oscillation equation y′′ (t) = −y(t). This equation corresponds to the system of equations x′ (t) = −y(t), y′ (t) = x(t) from 1. The solution of this system is given by x(t) = R cos(t − τ), y(t) = R sin(t − τ) with an arbitrary non-negative constant R, which determines the maximum amplitude, and a constant τ, which determines the initial phase. Therefore, in order to determine a unique solution, we need to know not only the initial position y0, but also the speed of the motion at that moment. These two pieces of information uniquely determine both the amplitude and the initial phase. Moreover, let us imagine that as a result of the properties of the spring material, there is another force which is directly proportional to the instantaneous speed of our object, with the other sign than the amplitude again. This is expressed by one more term with the first derivative, so our equation is now y′′ (t) = −y(t) − αy′ (t), where α is a constant which expresses the magnitude of the damping. In the following picture, there are the so-called phase diagrams for solutions with two distinct initial conditions, namely with zero damping on the left, and for the value of the coefficient α = 0.3 on the right. 0 5 -3-3 10-2 -2 t-1 -1 0 15 0 1x(t) y(t) 1 2 203 2 3 Tlumeneoscilace 0 5 -3-3 10-2 -2 t-1 -1 0 15 0 1x(t) y(t) 1 2 203 2 3 Tlumeneoscilace The oscillations are expressed by the y-axis values; the x-axis values describe the speed of the motion. 8.L.2. Undamped oscillation. Find the function y(t) which satisfies the following differential equation and initial condi- tions: y′′ (t) + 4y(t) = f(t), y(0) = 0, y′ (0) = −1, where the function f(t) is piecewise continuous: f(t) = { cos(2t) for 0 ≤ t < π, 0 for t ≥ π. Solution. This problem is a model of undamped oscillation of a spring (omitting friction, non-linearities in the toughness of the spring, and other factors) which is initiated by an outer force. CHAPTER 8. CALCULUS WITH MORE VARIABLES t, while yj and tj are the points from the definition of the function ˜yhn ) ˆyn(t) = y0 + k−1∑ j=0 ∫ tj+1 tj f(tj, yj)ds + ∫ t tk f(tk, yk) ds. Instead, the equation ˆyn(t) = y0 + ∫ t t0 f(s, ˆyn(s)) ds is wanted, but the difference between this integral and the last two terms in the previous expression is bounded by the possible variation of the function values f(t, ˆy) and the lengths of the intervals. By the universal bound for f(t, y) above, the last integral can be used instead of the actual values in the limit process limn→∞ yn(t), thereby obtaining y(t) = lim n→∞ ( y0 + ∫ t t0 f(s, ˆyn(s)) ds ) = y0 + ∫ t t0 ( lim n→∞ f(s, ˆyn(s)) ) ds = y0 + ∫ t t0 f(s, y(s)) ds, where the uniform convergence ˆyn(t) → y(t) is employed. This proves the theorem. □ 8.3.9. Coupled first-order equations. The problem of finding the solution of the equation y′ = f(x, y) can also be viewed as looking for a (parametrized) curve (x(t), y(t)) in the plane where the parametrization of the variable x(t) = t is fixed beforehand. If this point of view is accepted, then this fixed choice for the variable x can be forgotten, and the work can be carried out with an arbitrary (finite) number of variables. In the plane, for instance, such a system can be written in the form x′ = f(t, x, y), y′ = g(t, x, y) with two functions f, g : R3 → R. A simple example in the plane might be the system of equations x′ = −y, y′ = x. It is easily guessed (or at least verified) that there is a solution of this system, x(t) = R cos t, y(t) = R sin t, with an arbitrary non-negative constant R, and the curves of the solution are exactly the parametrized circles with radius R. In the general case, the vector notation of the system can be used in the form x′ = f(t, x) for a vector function x : R → Rn and a mapping f : Rn+1 → Rn . The validity of the theorem on uniqueness and existence of the solution to such systems can be extended: 772 The function f(t) can be written as a linear combination of Heaviside’s function u(t) and its shift, i. e., f(t) = cos(2t)(u(t) − uπ(t)) Since L(y′′ )(s) = s2 L(y) − sy(0) − y′ (0) = s2 L(y) + 1, we get, applying the results of the above exercises 7 and 8 to the Laplace transform of the right-hand side s2 L(y) + 1 + 4L(y) = L(cos(2t)(u(t) − uπ(t))) = L(cos(2t) · u(t)) − L(cos(2t) · uπ(t)) = L(cos(2t)) − e−πs L(cos(2(t + π)) = (1 − e−πs ) s s2 + 4 . Hence, L(y) = − 1 s2 + 4 + (1 − e−πs ) s (s2 + 4)2 . Performing the inverse transform, we obtain the solution in the form y(t) = −1 2 sin(2t) + 1 4 t sin(2t) + L−1 ( e−πs s (s2 + 4)2 ) . However, by formula (1), we have L−1 ( e−πs s (s2 + 4)2 ) = 1 4 L−1 (e−πs L(t sin(2t))) = (t − π) sin(2(t − π)) · Hπ(t). Since Heaviside’s function is zero for t < π and equal to 1 for t > π, we get the solution in the form y(t) = { −1 2 sin(2t) + 1 4 t sin(2t) for 0 ≤ t < π π−2 4 sin(2t) for t ≥ π □ 8.L.3. Find the general solution of the equation y′′′ − 5y′′ − 8y′ + 48y = 0. Solution. This is a third-order linear differential equation with constant coefficients since it is of the form y(n) + a1y(n−1) + a2y(n−2) + · · · + an−1y′ + any = f(x) for certain constants a1, . . . , an ∈ R. Moreover, we have f(x) ≡ 0, i. e., the equation is homogeneous. First of all, we will find the roots of the so-called characteristic polynomial λn + a1λn−1 + a2λn−2 + · · · + an−1λ + an. Each real root λ with multiplicity k corresponds to the k so- lutions eλx , x eλx , . . . , xk−1 eλx and every pair of complex roots λ = α ± iβ with multiplicity k corresponds to the k pairs of solutions eαx cos (βx) , x eαx cos (βx) , . . . , xk−1 eαx cos (βx) , eαx sin (βx) , x eαx sin (βx) , . . . , xk−1 eαx sin (βx) . CHAPTER 8. CALCULUS WITH MORE VARIABLES Existence and uniqueness for systems of ODEs Theorem. Consider functions fi(t, x1, . . . , xn) : Rn+1 → R, i = 1, . . . , n, with continuous partial derivatives. Then, for every point (t0, c) ∈ Rn+1 , c = (c1, . . . , cn), there exists a maximal interval (t0−a, t0+b), with positive numbers a, b ∈ R, and a unique (vector) function x(t) : R → Rn which is the solution of the system of equations x′ 1 = f1(t, x1, . . . , xn) ... x′ n = fn(t, x1, . . . , xn) with the initial condition x(t0) = c, i.e. x1(t0) = c1, . . . , xn(t0) = cn. Proof. The proof is almost identical to the one of the existence and uniqueness of the solution for a single equation with a single unknown function as shown in Theorem 8.3.6. The unknown function x(t) = (x1(t), . . . , xn(t)) is a curve in Rn satisfying the given equation, so its components xi(t) are again expressed in terms of integrals xi(t) = xi(t0) + ∫ t t0 x′ i(s) ds = ci + ∫ t t0 fi(s, x(s)) ds. We work with the integral operator y → L(y), this time mapping curves in Rn to curves in Rn . It is desired to find its fixed point. The proof proceeds in much the same way as in the case 8.3.6. It is only necessary to observe that the size of the vector ∥f(t, z1, . . . , zn) − f(t, y1, . . . , yn)∥ is bounded from above by the sum ∥f(t, z1, . . . , zn) − f(t, y1, z2 . . . , zn)∥ + . . . + ∥f(t, y1, . . . , yn−1, zn) − f(t, y1, . . . , yn)∥. It is recommended to go through the proof of Theorem 8.3.6 from this point of view and to think out the details. □ 8.3.10. Example. When dealing with models in practice, it is of interest to consider the qualitative behaviour of the solution in dependence on the initial conditions and free parameters of the system We consider a simple example of a system of first-order equations from this point of view. The standard population model “predator – prey”, was introduced in the 1920s by Lotka and Volterra. Let x(t) denote the evolution of the number of individuals in the prey population and y(t) for the predators. Assume that the increment of the prey would correspond to the Malthusian model (i.e. exponential growth with coefficient α) if they were not hunted. On the other hand, assume that the predator would only naturally die out if there were no prey (i.e. exponential decrease with coefficient γ). Further, consider an 773 Then, the general solution corresponds to all linear combinations of the above solutions. Therefore, let us consider the polynomial λ3 − 5λ2 − 8λ + 48 with roots λ1 = λ2 = 4, λ3 = −3. Since we know the roots, we can deduce the general solution as well: y = C1e4x + C2x e4x + C3e−3x , C1, C2, C3 ∈ R. □ 8.L.4. Compute y′′′ + y′′ + 9y′ + 9y = ex + 10 cos (3x) . Solution. First, we will solve the corresponding homogeneous equation. The characteristic polynomial is equal to λ3 + λ2 + 9λ + 9, with roots λ1 = −1, λ2 = 3i, λ3 = −3i. The general solution of the corresponding homogeneous equation is thus y = C1e−x +C2 cos (3x)+C3 sin (3x) , C1, C2, C3 ∈ R. The solution of the non-homogeneous equation is of the form y = C1e−x + C2 cos (3x) + C3 sin (3x) + yp, C1, C2, C3 ∈ R for a particular solution yp of the non-homogeneous equation. The right-hand side of the given equation is of a special form. In general, if the non-homogeneous part is given by a function Pn(x) eαx , where Pn is a polynomial of degree n, then there is a particular solution of the form yp = xk Rn(x) eαx , where k is the multiplicity of α as a root of the characteristic polynomial and Rn is a polynomial of degree at most n. More generally, if the non-homogeneous part is of the form eαx [Pm(x) cos (βx) + Sn(x) sin (βx)] , where Pm is a polynomial of degree m and Sn is a polynomial of degree n, there exists a particular solution of the form yp = xk eαx [Rl(x) cos (βx) + Tl(x) sin (βx)] , where k is the multiplicity of α + iβ as a root of the characteristic polynomial and Rl, Tl are polynomials of degree at most l = max {m, n}. In our problem, the non-homogeneous part is a sum of two functions in the special form (see above). Therefore, we will look for (two) corresponding particular solutions using the method of undetermined coefficients, and then we will add up these solutions. This will give us a particular solution of the original equation (as well as the general solution, then). Let us begin with the function y = ex , which has particular solution yp1 (x) = Aex for some A ∈ R. Since yp1 (x) = y′ p1 (x) = y′′ p1 (x) = y′′′ p1 (x) = Aex , substitution into the original equation, whose right-hand side contains only the function y = ex , leads to 20Aex = ex , i. e. A = 1 20 . For the right-hand side with the function y = 10 cos (3x), we are looking for a particular solution in the form CHAPTER 8. CALCULUS WITH MORE VARIABLES interaction of the predator and the prey which is expected to be proportional to the number of both with a certain coefficient β, which is in the case of the predator, supplemented by a multiplicative coefficient expressing the hunting efficiency. Lotka-Volterra model This is a system of two equations, x models the pray, y the predator, with positive constants α, β, γ, δ x′ = αx − βyx y′ = −γy + δβxy. The diagram illustrates one of typical behaviours of such dynamical systems – the existence of closed orbits on which the system moves in time. These are the thick black ovals, while the “comets” indicate the field at the individual points (i.e. their expected movement). The left illustration corresponds to α = 1, β = 1, γ = 0.3, δ = 0.3 and the initial condition (x0, y0) = (1, 0.5) at t0 = 0 for the solution, while the other illustration comes with α = 1, β = 2, γ = 2, δ = 1 and (x0, y0) = (1, 1.5) In both cases, the system is quite stable in the vicinity of the initial condition, and it would be very stable for (x0, y0) = (1, 1) or (1, 0.5), respectively. But their developement differs in speed — the depicted solution cycles close in the times about t = 12 in the first case and t = 5 in the other one. It is interesting that the same model captures quite well the development of the unemployment rate in population, considering the employees to be the predators, while the employers play the role of the prey. Much information about this and other models can be found in the literature. Next, we are approaching several qualitative results (i.e., featuring properties of the solutions, without knowing them explicitely). They are all easily understandable as statements, but the complexity of their proofs perhaps would be too demanding, at least in the first reading. Thus, the readers are advised to focus on the Theorems (do not ignore them, they are all absolutely essential!) and their rough explanations, but skip the proofs. The exposition will come back to easily proved topics in 8.3.16 on page 785. 774 yp2 (x) = x [B cos (3x) + C sin (3x)] . Recall that the number λ = 3i was obtained as a root of the characteristic polynomial. We can easily compute the deriva- tives y′ p2 (x) = [B cos (3x) + C sin (3x)] +x [−3B sin (3x) + 3C cos (3x)] , y′′ p2 (x) = 2 [−3B sin (3x) + 3C cos (3x)] +x [−9B cos (3x) − 9C sin (3x)] , y′′′ p2 (x) = 3 [−9B cos (3x) − 9C sin (3x)] +x [27B sin (3x) − 27C cos (3x)] . Substituting them into the equation, whose right-hand side contains the function y = 10 cos (3x), we get −18B cos (3x) − 18C sin (3x) − 6B sin (3x) + 6C cos (3x) = 10 cos (3x) . Confronting the coefficients leads to the system of linear equa- tions −18B + 6C = 10, −18C − 6B = 0 with the only solution B = −1/2 and C = 1/6, i. e., yp2 (x) = x [ −1 2 cos (3x) + 1 6 sin (3x) ] . Altogether, the general solution is y = C1e−x + C2 cos (3x) + C3 sin (3x) + 1 20 ex − 1 2 x cos (3x) + 1 6 x sin (3x) , C1, C2, C3 ∈ R. □ 8.L.5. Determine the general solution of the equation y′′ + 3y′ + 2y = e−2x . Solution. The given equation is a second-order (the highest derivative of the wanted function is of order two) linear (all derivatives are in the first power) differential equation with constant coefficients. First, we solve the homogenized equa- tion y′′ + 3y′ + 2y = 0. Its characteristic polynomial is x2 + 3x + 2 = (x + 1)(x + 2), with roots x1 = −1 and x2 = −2. Hence, the general solution of the homogenized equation is c1e−x + c2e−2x , where c1, c2 are arbitrary real constants. Now, using the method of undetermined coefficients, we will find a particular solution of the original nonhomogeneous equation. According to the form of the non-homogeneity and since −2 is a root of the characteristic polynomial of the given equation, we are looking for the solution in the form y0 = axe−2x for a ∈ R. Substituting into the original equation, we obtain a[−4e−2x +4xe−2x +3(e−2x −2xe−2x )+2xe−2x ] = e−2x , CHAPTER 8. CALCULUS WITH MORE VARIABLES 8.3.11. Stability of systems of equations. In order to illustrate the stability questions, we discuss just one basic theorem only. We are interested in the continuity with respect the to L∞ norm on the space of functions (i.e. the supremum norm, see 7.3.5). According to the theorem below, the assumption that the partial derivatives of the functions defining the system are continuous (in fact, it suffices to have them Lipschitz), guarantees the continuity of the solutions in dependence on the initial conditions as well as the defining equations themselves. Note however, that as the distance of t from the initial value t0 grows, then the error estimates grow exponentially! Therefore, this result is of a strictly local character. It is not in contradiction with the example of the unstably behaving equation y′ = ty illustrated in paragraph 8.3.3.8 Consider two systems of equations written in the vector form (1) x′ = f(t, x), y′ = g(t, y) and assume that the mappings f, g : U ⊂ Rn+1 → Rn have continuous partial derivatives on an open set U with compact closure. Such functions must be uniformly continuous and uniformly Lipschitz on U, so there are the finite values C = sup x̸=y; (t,x), (t,y)∈U ∥f(t, x) − f(t, y)∥ ∥x − y∥ B = sup (t,x)∈U ∥f(t, x) − g(t, x)∥ With this notation, the fundamental theorem can be formu- lated: Theorem. Let x(t) and y(t) be two fixed solutions x′ = f(t, x(t)), y′ = g(t, y(t)) of the systems (1) considered above, given by initial conditions x(t0) = x0 and y(t0) = y0. Then, ∥x(t) − y(t)∥ ≤ ∥x0 − y0∥ eC|t−t0| +B C ( eC|t−t0| −1 ) . Proof. Without loss of generality, t0 = 0. From the expression of the solutions x(t) and y(t) as fixed points of the corresponding integral operators follows the estimate ∥x(t)−y(t)∥ ≤ ∥x0 −y0∥+ ∫ t 0 ∥f(s, x(s))−g(s, y(s))∥ ds. The integrand can be further estimated as follows: ∥f(s, x(s)) − g(s, y(s))∥ ≤ ≤ ∥f(s, x(s)) − f(s, y(s))∥ + ∥f(s, y(s)) − g(s, y(s))∥ ≤ C ∥x(s) − y(s)∥ + B If F(t) = ∥x(t) − y(t)∥, α = ∥x0 − y0∥, then F(t) ≤ α + ∫ t 0 (C F(s) + B) ds. 8Much more information can be found for example in Gerald Teschl’s book Ordinary Differential Equations and Dynamical Systems, Graduate Studies in Mathematics, Volume 140, Amer. Math. Soc., Providence, 2012. 775 hence a = −1. We have thus found the function −xe−2x as a particular solution of the given equation. Hence, the general solution is the function space c1e−x + c2e−2x − xe−2x , c1, c2 ∈ R. □ 8.L.6. Determine the general solution of the equation y′′ + y′ = 1. Solution. The characteristic polynomial of the given equation is x2 + x, with roots 0 and −1. Therefore, the general solution of the homogenized equation is c1 + c2e−x , where c1, c2 ∈ R. We are looking for a particular solution in the form ax, a ∈ R (since zero is a root of the characteristic polynomial). Substituting into the original equation, we get a = 1. The general solution of the given non-homogeneous equation is c1 + c2e−x + x, c1, c2 ∈ R. □ 8.L.7. Determine the general solution of the equation y′′ + 5y′ + 6y = e−2x . Solution. The characteristic polynomial of the equation is x2 + 5x + 6 = (x + 2)(x + 3), its roots are −2 and −3. The general solution of the homogenized equation is thus c1e−2x + c2e−3x , c1, c2 ∈ R. We are looking for a particular solution in the form axe−2x , (−2 is a root of the characteristic polynomial), a ∈ R, using the method of undetermined coefficients. Substitution into the original equation yields a = 1. Hence, the general solution of the given equation is c1e−2x + c2e−3x + xe−2x . □ 8.L.8. Determine the general solution of the equation y′′ − y′ = 5. Solution. The characteristic polynomial of the equation is x2 − x, with roots 1, 0. Therefore, the general solution of the homogenized equation is c1 + c2ex , where c1, c2 ∈ R. We are looking for a particular solution in the form ax, a ∈ R, using the method of undetermined coefficients. The result is a = −5, and the general solution is of the form c1 + c2ex − 5x. □ 8.L.9. Solve the equation y′′ − 2y′ + y = ex x2+1 . Solution. We will solve this non-homogeneous equation using the method of variation of constants. We will thus obtain the solution in the form y = C1(x) y1(x) + C2(x) y2(x) + · · · + Cn(x) yn(x), CHAPTER 8. CALCULUS WITH MORE VARIABLES Such an estimate bound can be exploited further, by the following general result, known as Gronwall’s inequality. Note the similarity with the general solution of linear equa- tions. Lemma. Assume a real-valued function F(t) satisfies for all t in the interval [0, tmax] F(t) ≤ α(t) + ∫ t 0 β(s)F(s) ds for some real-valued functions α(t), β(t), with β(t) ≥ 0. Then F(t) ≤ α(t) + ∫ t 0 α(s)β(s) e ∫ t s β(r) dr ds for all t ∈ [0, tmax]. Moreover, if additionally α(t) is nondecreasing, then F(t) ≤ α(t) e ∫ t 0 β(s) ds . Proof of the lemma. Write G(t) = e− ∫ t 0 β(s) ds . By the first assumption of the theorem, d dt ( G(t) ∫ t 0 β(s)F(s) ds ) = = β(t)G(t) ( F(t) − ∫ t 0 β(s)F(s) ds ) ≤ α(t)β(t)G(t). Integrating with respect to t and dividing by the non-zero function G(t) gives ∫ t 0 β(s)F(s) ds ≤ ∫ t 0 α(s)β(s) G(s) G(t) ds, which, having added α(t) to both sides of the inequality, gives the first proposition of the lemma. Assuming that α(t) is non-decreasing, there follows: F(t) ≤ α(t) ( 1 + ∫ t 0 β(s) e ∫ t s β(r) dr ds ) . The integrand is a derivative: −β(s) e ∫ t s β(r) dr = d ds ( e ∫ t s β(r) dr ) , so F(t) ≤ α(t) ( 1 − ∫ t 0 d ds e ∫ t s β(r) dr ds ) = α(t) ( 1 + e ∫ t s β(r) dr −1 ) , and the second proposition of the lemma is also proved. □ Now, the proof of the theorem about continuous dependency on the parameters is easily finished. The bound F(t) ≤ α + ∫ t 0 (C F(s) + B) ds is already obtained, and using a slightly modified function ˜F(t) = F(t) + B C , this yields ˜F(t) ≤ B C + α + ∫ t 0 C ˜F (s) ds. 776 where y1, . . . , yn give the general solution of the corresponding homogeneous equation and the functions C1(x), . . . , Cn(x) can be obtained from the system C′ 1(x) y1(x) + · · · + C′ n(x) yn(x) = 0, C′ 1(x) y′ 1(x) + · · · + C′ n(x) y′ n(x) = 0, ... C′ 1(x) y (n−2) 1 (x) + · · · + C′ n(x) y(n−2) n (x) = 0, C′ 1(x) y (n−1) 1 (x) + · · · + C′ n(x) y(n−1) n (x) = f(x). The roots of the characteristic polynomial λ2 − 2λ + 1 are λ1 = λ2 = 1. Therefore, we are looking for the solution in the form C1(x) ex + C2(x) x ex , considering the system C′ 1(x) ex + C′ 2(x) x ex = 0, C′ 1(x) ex + C′ 2(x) [ex + x ex ] = ex x2 + 1 . We can compute the unknowns C′ 1(x) and C′ 2(x) using Cramer’s rule. It follows from ex x ex ex ex + x ex = e2x , 0 x ex ex x2+1 ex + x ex = −x e2x x2 + 1 , ex 0 ex ex x2+1 = e2x x2 + 1 that C1(x) = − ∫ x x2 + 1 dx = − 1 2 ln ( x2 + 1 ) + C1, C1 ∈ R, C2(x) = ∫ dx x2 + 1 = arctan x + C2, C2 ∈ R. Hence, the general solution is y = C1ex + C2x ex − 1 2 ex ln ( x2 + 1 ) + x ex arctan x, C1, C2 ∈ R. □ 8.L.10. Find the only function y which satisfies the linear differential equation y(3) − 3y′ − 2y = 2ex , with initial conditions y(0) = 0, y′ (0) = 0, y′′ (0) = 0. Solution. The characteristic polynomial is x3 − 3x − 2, with roots 2 and −1 (double). We are looking for a particular solution in the form aex , a ∈ R, easily finding out that it is the function −1 2 ex . The general solution of the given equation is thus c1e2x + c2e−x + c3xe−x − 1 2 ex . CHAPTER 8. CALCULUS WITH MORE VARIABLES This is the assumption of Gronwall’s inequality with even constant parameters, so by the second claim of the lemma, F(t) + B C ≤ (α + B C ) e ∫ t 0 C ds , which is the statement F(t) ≤ α eCt +B C (eCt −1) as desired. □ The continuous dependency on both the initial conditions and the potential further parameters in which the function f would be Lipschitz-continuous follows immediately from the statement of the theorem. The extremely simple equations in one variable x′ = ax, where a are small constants, with their exponential solution x(t) = eat show that better general results cannot be ex- pected. 8.3.12. Differentiable dependance. In practical problems, the differentiability of the obtained solutions is often of interest, especially with regard to the initial conditions or other parameters of the system. In the general vector notation of the system of ordinary equations y′ = f(t, y), it can always be supposed that the vector function does not depend explicitly on t. If it does, then another variable y0 can be added to the other variables y1, . . . , yn. Then there is the same system of equations for the curve ˜y′ (t) = (y0(t), y1(t), . . . , yn(t)) as y′ 0 = 1 y′ 1 = f1(y0, y1, . . . , yn) ... y′ n = fn(y0, y1, . . . , yn) with the initial conditions y0(t0) = t0, y1(t0) = x1, . . . , yn(t0) = xn. Such systems, which do not explicitly depend on time, are called autonomous systems of ordinary differential equations. Without loss of generality, we deal with autonomous systems in finite dimension n, dependent on parameters λ and with initial conditions (1) y′ = f(y, λ), y(t0) = x. Without loss of generality, consider the initial value t0 = 0, and write the solution with y(0) = x in the form y(t, x, λ) to emphasize the dependency on the parameters. For fixed values of the initial conditions (and the potential parameters λ), the solution is always once more differentiable than the function f. This can be derived inductively by applying the chain rule. If f is continuously differentiable and y(t) is a solution, then (use the matrix notation where 777 Substituting into the original conditions, we get the only satisfactory function, 2 9 e2x + 5 18 e−x + 1 3 xe−x − 1 2 ex . □ Further problems concerning higher-order differential equations can be found on page 798 M. Applications of the Laplace transform Differential equations with constant coefficients can also be solved using the Laplace transform. 8.M.1. Let L(y)(s) denote the Laplace transform of a function y(t). Integrating by parts, prove that Solution. (1) L(y′ )(s) = sL(y)(s) − y(0) L(y′′ )(s) = s2 L(y) − sy(0) − y′ (0) and, by induction: L(y(n) )(s) = sn L(y)(s) −∑n i=1 sn−i y(i−1) (0) . □ 8.M.2. Find the function y(t) which satisfies the differential equation y′′ (t) + 4y(t) = sin 2t as well as the initial conditions y(0) = 0, y′ (0) = 0. Solution. It follows from the above exercise 7.D.18 that s2 L(y)(s) + 4L(y)(s) = L(sin 2t)(s). We also have L(sin 2t)(s) = 2 s2 + 4 , i. e., L(y)(s) = 2 (s2 + 4)2 . The inverse transform leads to y(t) = 1 8 sin 2t − 1 4 t cos 2t . □ 8.M.3. Find the function y(t) which satisfies the differential equation y′′ (t) + 6y′ (t) + 9y(t) = 50 sin t and the initial conditions y(0) = 1, y′ (0) = 4. Solution. The Laplace transform yields s2 L(y)(s)−s−4+6(sL(y)(s)−1)+9L(y)(s) = 50L(sin t)(s), i. e., (s2 + 6s + 9)L(y)(s) = 50 s2 + 1 + s + 10, L(y)(s) = 50 (s2 + 1)(s + 3)2 + s + 10 (s + 3)2 . Decomposing the first term to partial fractions, we obtain 50 (s2 + 1)(s + 3)2 = As + B s2 + 1 + C s + 3 + D (s + 3)2 , CHAPTER 8. CALCULUS WITH MORE VARIABLES the Jacobi matrix D1 f(y) of the mapping f : Rn → Rn is multiplied with the column vector y′ ) y′′ = D1 f(y) · y′ = D1 f(y) · f(y) exists and is continuous. With all the derivatives up to order two continuous, there is an expression for the third derivative: y(3) = D2 f(y) ( f(y), f(y) ) + ( D1 f(y))2 · f(y). Here, the chain rule is used again, starting with the differential of the bilinear mapping of matrix multiplication and viewing the second derivative as a bilinear object evaluated on y′ in both arguments. Think out the argumentation for this and higher orders in detail. Assume for a while that there is a solution y(t, x) of the system (1) which is continuously differentiable in the parameters x ∈ Rn , i.e. the initial condition as well, and forget about the further parameters λ for now. Write Φ(t, x) = D1 x(y(t, x)), for the Jacobi matrix of all partial derivatives with respect to the coordinates xi, which depends on the time t as well as the initial condition x. Its derivative Φ′ (t, x) with respect to t can be computed using the symmetry of partial derivatives and the chain rule: Φ′ (t, x) = d dt ( D1 xy(t, x) ) = D1 x ( y′ (t, x) ) = D1 f(y(t, x)) · D1 xy(t, x) = D1 f(y(t, x)) · Φ(t, x). So the derivatives with respect to the initial conditions along the solution y(t, x) of the system (1) are given as the solutions of a system of n2 first-order equations with initial condition (2) Φ′ (t, x) = F(t, x) · Φ(t, x), Φ(0, x) = E, where F(t, x) = D1 f(y(t, x)), and the initial condition comes out from the identity y(0, x) = x. The unique existence of the solution of this (matrix) system and its continuous dependence on the parameters have already been proved. The following theorem says that for systems (1) with continuously differentiable right-hand sides f, the derivatives with respect to the initial condition can be obtained in this way. Differentiability of the solutions Theorem. Consider an open subset U ⊂ Rn+k and a mapping f : U → Rn with continuous first derivatives. Then, a system of differential equations dependent on a parameter λ ∈ Rk with initial condition at a point x ∈ U y′ = f(y, λ), y(0) = x has a unique solution y(t, x, λ), which is a mapping with continuous first derivatives with respect to each variable. 778 so 50 = (As + B)(s + 3)2 + C(s2 + 1)(s + 3) + D(s2 + 1). Substituting s = −3, we get 50 = 10D hence D = 5 and confronting the coefficients at s3 , we have 0 = A + C, hence A = −C. Confronting the coefficients at s, we obtain 0 = 9A + 6B + C = 8A + 6B, hence B = 4 3 C. Finally, confronting the absolute term, we infer 50 = 9B + 3C + D = 12C + 3C + 5 hence C = 3, B = 4, A = −3. Since s + 10 (s + 3)2 = s + 3 + 7 (s + 3)2 = 1 s + 3 + 7 (s + 3)2 , we have L(y)(s) = −3s + 4 s2 + 1 + 3 s + 3 + 5 (s + 3)2 + 1 s + 3 + 7 (s + 3)2 = −3s s2 + 1 + 4 s2 + 1 + 4 s + 3 + 12 (s + 3)2 . Now, the inverse Laplace transform yields the solution in the form y(t) = −3 cos t + 4 sin t + 4e−3t + 12te−3t . □ 8.M.4. Find the function y(t) which satisfies the differential equation y′′ (t) = cos (πt) − y(t), t ∈ (0, +∞) and the initial conditions y(0) = c1, y′ (0) = c2. Solution. First, we should emphasize that it follows from the theory of ordinary differential equations that this equation has a unique solution. Further, we should recall that L (f′′ ) (s) = s2 L (f) (s) − s lim t→0+ f(t) − lim t→0+ f′ (t) and L (cos (bt)) (s) = s s2+b2 , b ∈ R. Applying the Laplace transform to the given differential equation then gives s2 L (y) (s) − sc1 − c2 = s s2+π2 − L (y) (s), i. e., (1) L (y) (s) = s (s2 + 1) (s2 + π2) + c1s s2 + 1 + c2 s2 + 1 . Therefore, it suffices to find a function y which satisfies (1). Performing partial fraction decomposition, we obtain s (s2+1)(s2+π2) = 1 π2−1 ( s s2+1 − s s2+π2 ) . The above expression of L (cos (bt)) (s) and the already proved formula L (sin t) (s) = 1 s2+1 CHAPTER 8. CALCULUS WITH MORE VARIABLES Proof. Consider a general system dependent on parameters, but viewed as an ordinary autonomous system with no parameters. More explicitely, consider the parameters to be additional space variables and add (vector) conditions λ′ (t) = 0 and λ(0) = λ. Therefore, the theorem is proved for autonomous systems with no further parameters. There is dependency on the initial condi- tions. Just as in the proof of the fundamental existence theorem 8.3.6, build on the expression of the solutions as fixed points of the integral operators and prove that the expected derivative, as discussed above, enjoys the properties of the differential. Fix a point x0 as the initial condition, together with a small neighbourhood x0 ∈ V , which if necessary can be further decreased during the following estimates, so that ∥f(y) − f(z)∥ ≤ C ∥y − z∥ on this neighbourhood by the Lipschitz property. It is already deduced that if the derivative Φ(t, x) = D1 xy(t, x) of the solution y(t, x) exists, then it must be uniquely given by the equation (2) wit the proper initial conditions. Therefore, define Φ(t, x) by this equation and examine the expression G(t, h) = ∥y(t, x0 + h) − y(t, x0) − Φ(t, x0)(h)∥ with small increments h ∈ Rn . In order to prove that the continuous derivative exists, it is necessary to show that lim h→0 1 ∥h∥ G(t, h) = 0. Several estimates are needed for this purpose. First, from the latter theorem 8.3.11 about continuous dependence on initial conditions, the estimate ∥y(t, x0 + h) − y(t, x0)∥ ≤ ∥h∥ eC|t| follows immediately. In the next step, use Taylor’s expansion of f with remainder: f(y) − f(z) = D1 f(z) · (y − z) + R(y, z), where R(y, z) satisfies R(y, z) ∥y − z∥ → 0 for ∥y − z∥ → 0. This implies the crucial estimate. In the first equality substitute in the expression of solutions in terms of fixed points of the integral operators. Next, exploit the definition of the mapping Φ(t, x0) in terms of its derivative (write F(t, x) = D1 f(y(t, x)) again and notice that its initial condition Φ(0, x)(h) = h implies the vanishing of the h 779 then yield the wanted solution y(t) = 1 π2−1 (cos t − cos (πt)) + c1 cos t + c2 sin t . □ 8.M.5. Solve the system of differential equations x′′ (t) + x′ (t) = y(t) − y′′ (t) + et , x′ (t) + 2x(t) = −y(t) + y′ (t) + e−t with the initial conditions x(0) = 0, y(0) = 0, x′ (0) = 1, y′ (0) = 0. Solution. Again, we apply the Laplace transform. This, using L (e±t ) (s) = 1 s∓1 , transforms the first equation to s2 L (x) (s) − s lim t→0+ x(t) − lim t→0+ x′ (t) + sL (x) (s) − lim t→0+ x(t) = = L (y) (s)− ( s2 L (y) (s) − s lim t→0+ y(t) − lim t→0+ y′ (t) ) + 1 s−1 and the second one to sL (x) (s) − lim t→0+ x(t) + 2L (x) (s) = = −L (y) (s) + sL (y) (s) − lim t→0+ y(t) + 1 s+1 . Evaluating the limits (according to the initial conditions), we obtain the linear equations s2 L (x) (s)−1+sL (x) (s) = L (y) (s)−s2 L (y) (s)+ 1 s−1 and sL (x) (s) + 2L (x) (s) = −L (y) (s) + sL (y) (s) + 1 s+1 with the only solution L (x) (s) = 2s−1 2(s−1)(s+1)2 , L (y) (s) = 3s 2(s2−1)2 . Once again, we perform partial fraction decomposition, get- ting L (x) (s) = 1 8 1 s−1 + 3 4 1 (s+1)2 − 1 8 1 s+1 = 3 4 1 (s+1)2 + 1 4 1 s2−1 . Since we have already computed that L (t e−t ) (s) = 1 (s+1)2 , L (sinh t) (s) = 1 s2−1 , L (t sinh t) (s) = 2s (s2−1)2 , we get x(t) = 3 4 t e−t + 1 4 sinh t, y(t) = 3 4 t sinh t. We definitely advise the reader to verify that these functions of x and y are indeed the wanted solution. The reason is that the Laplace transforms of the functions y = et , y = sinh t and y = t sinh t were obtained only for s > 1). □ 8.M.6. Find the solution of the following system of differential equations: x′ (t) = −2x(t) + 3y(t) + 3t2 , y′ (t) = −4x(t) + 5y(t) + et , x(0) = 1, y(0) = −1 Solution. L(x′ )(s) = L(−2x + 3y + 3t2 )(s), L(y′ )(s) = L(−4x + 5y + et )(s). CHAPTER 8. CALCULUS WITH MORE VARIABLES summand). G(t, h) = x0 + h + ∫ t 0 f(y(s, x0 + h))ds − x0 − ∫ t 0 f(y(s, x0))ds − Φ(t, x0)(h) = h + ∫ t 0 ( f(y(s, x0 + h)) − f(y(s, x0)) − F(s, x0)Φ(s, x0)(h) ) ds − h ≤ ∫ t 0 f(y(s, x0 + h)) − f(y(s, x0)) − F(s, x0)Φ(s, x0)(h)∥ ds ≤ ∫ t 0 ∥F(s, x0)∥ ∥y(s, x0 +h) − y(s, x0) − Φ(s, x0)(h)∥ ds + ∫ t 0 ∥R(y(s, x0 + h), y(s, x0))∥ ds, where the norm on the matrices is taken as the maximum of the absolute values of their entries. Since F(t, x) is continuous, there is a uniform bound of its norm in the neighbourhood V given by ∥F(t, x0)∥ ≤ B, for all |t| < T with a sufficiently small T to ensure the solutions remain in the neighbourhood V . At the same time, for any fixed constant ε > 0, there is a bound ∥h∥ < δ for which the remainder R satisfies ∥R(y(t, x0 + h), y(t, x0))∥ ≤ ε∥y(t, x0 + h) − y(t, x0)∥ ≤ ∥h∥ε eCT . Therefore, the estimate on G(t, h) can be improved as fol- lows: G(t, h) ≤ B ∫ t 0 G(s, h) ds + ε∥h∥ eCT . Gronwall’s lemma now gives G(t, h) ≤ ε∥h∥ e(C+B)T . This implies that limh→0 1 ∥h∥ G(t, h) = 0 as requested. □ In the same way, it can be proved that continuous differentiability of the right-hand side up to order k (inclusive) guarantees the same order of differentiability of solutions in all input parameters. 8.3.13. The analytic case. Let us pay additional attention to the case when the right hand side f of the system of equations (1) y′ = f(y), y(t0) = y0 is analytic in all arguments (i.e., a convergent multidimensional power series f(y) = ∑∞ |α|=0 1 α! ∂f|α| ∂yα yα , see 8.1.15). Exactly as in the previous discussion, we may hide the time variable t as well as further parameters in the variables. 780 The left-hand sides can be written using (1), while the righthand sides can be rewritten thanks to linearity of the L operator. Since L(3t2 )(s) = 6 s3 and L(et )(s) = 1 s−1 , we get the system of linear equations sL(x)(s) − 1 = −2L(x)(s) + 3L(y)(s) + 6 s3 , sL(y)(s) + 1 = −4L(x)(s) + 5L(y)(s) + 1 s−1 . In matrices, this is A(s)ˆx(s) = b(s), where A(s) = ( s + 2 −3 4 s − 5 ) , ˆx(s) = ( L(x)(s) L(y)(s) ) and b(s) = ( 1 + 6 s3 −1 + 1 s−1 ) . Cramer’s rule says that L(x)(s) = |A1| |A| , L(y)(s) = |A2| |A| , where |A| = s + 2 −3 4 s − 5 = s2 − 3s + 2, |A1| = 1 + 6 s3 −3 −1 + 1 s−1 s − 5 = (s − 5)(1 + 6 s3 ) + 3(−1 + 1 s−1 ) |A2| = s + 2 1 + 6 s3 4 −1 + 1 s−1 = (s + 2)(−1 + 1 s−1 ) − 4 − 24 s3 . Hence, L(x)(s) = 1 (s − 1)(s − 2) ( (s − 5)(s3 + 6) s3 − 3 s − 2 s − 1 ) , L(y)(s) = 1 (s − 1)(s − 2) ( (s + 2)(2 − s) s − 1 − 4s3 + 24 s3 ) . Decomposing to partial fractions, the Laplace images of the solutions can be expressed as follows: L(x)(s) = − 39 2s2 − 3 (s−1)2 + 28 s−1 − 21 4(s−2) − 15 s3 − 87 4s , L(x)(s) = −18 s2 − 3 (s−1)2 + 27 s−1 − 7 s−2 − 12 s3 − 21 s . Now, the inverse transform yields the solution of this Cauchy problem: x(t) = −39 2 t − 3tet + 28et − 21 4 e2t − 15 2 t2 − 87 4 , y(t) = −18t − 3tet +27et − 7e2t − 6t2 − 21 . □ N. Numerical solution of differential equations Now, we present two simple exercises on applying the Euler method for solving differential equations. 8.N.1. Use the Euler method to solve the equation y′ = −y2 with the initial condition y(1) = 1. Determine the approximate solution on the interval [1, 3]. Try to estimate for which value h of the step is the error less than one tenth. Solution. The Euler method for the considered equation is given by yk+1 = yk − h · y2 k for x0 = 1, y0 = 1, xk = x0 + k · h, yk = y(xk). We begin the procedure with step value h = 1 and halve it in each iteration. The estimate for the “sufficiency” of h will be made somewhat imprecisely by comparing two adjacent CHAPTER 8. CALCULUS WITH MORE VARIABLES The famous theorem below says that the solution of the most general system with analytic right-hand side is analytic in all the parameters as well (including the initial conditions). ODE version of Cauchy-Kovalevskaya Theorem Theorem. Assume f(y) is a real analytic vector valued function on a domain in Rn and consider the differential equation (1). Then the unique solution of this initial problem is real analytic, including the dependancy on the initial condition. Proof. The idea of the proof is identical as in the simple one-dimensional case in 6.2.22. As we saw in the beginning of the previous paragraph, there are universal (multidimensional) polynomial expressions for all derivatives of the vector function y(t) in terms of the partial derivatives of the vector function f. If we expand them in terms of the individual partial derivatives of the mapping f all of their coefficients are obviously non-negative. Let us write again y(k) (0) = Pk(f(y(0)), . . . , ∂βf(y(0)), . . . ) for these multivariate vector valued polynomials (the multiindices β in the arguments are all of size up to k − 1). Without loss of generality we may consider the initial condition t0 = 0, y(0) = 0. Indeed, constant shifts of the variables (say z = y − y0, x = t − t0) transform the general case to this one. Once we know that the components of the solution are power series, the transformed quantities will be analytic too, including the dependancy on the values of the incital conditions. In order to prove that the solution to the problem y′ = f(y), y(0) = 0 is analytic on a neighborhood of the origin, we shall again look for a majorant g for the vector equation y′ = f(y), i.e. we want an analytic function on a neighborhood of the origin 0 ∈ Rn with ∂αg(0) ≥ |∂αf(0)|, for all multi-indices α. Then, by the universal computations of all the coefficients of the power series y(t) = ∑∞ k=0 1 k! y(k) (0)tk solving potentially our problem, and similarly for z′ = g(z), the convergence of the series for z implies the same for y: z(k) (0) = Pk ( g(0), . . . , ∂βg(0), . . . ) ≥ Pk ( |f(0)|, . . . , |∂βf(0)| ) ≥ |y(k) (0)|. As usual, knowing already how to find a majorant in a simpler case, we try to apply a straightforward modification. By the analycity of f, for r > 0 small enough there is a constant C such that | 1 α! ∂αfi(0)r|α| | ≤ C, for all i = 1, . . . , n and mutli-indices α. This means |∂αfi(0)| ≤ C α! r|α| . In the 1-dimensional case, we considered the multiple of a geometric series g(z) = C r r−z with the right derivatives g(n) = C n! rn . Now the most similar mapping is g(z1, . . . , zn) = (g1(z1, . . . , zn), . . . , gn(z1, . . . , zn)) with 781 approximate values of the function y at common points, terminating the procedure if the maximum of the absolute difference of these values is not greater than the tolerated error (0.1). The results h0 = 1 y(0) = (1 0 0) h1 = 0.5 y(1) = (1 0.5 0.375 0.3047 0.2583) Maximal difference: 0.375. h2 = 0.25 y(2) = (1.0000 0.7500 0.6094 0.5165 0.4498 0.3992 0.3594 0.3271 0.3004) Maximal difference: 0.1094. h3 = 0.125 y(3) = (1.0000 0.8750 0.7793 0.7034 0.6415 0.5901 0.5466 0.5092 0.4768 0.4484 0.4233 0.4009 0.3808 0.3627 0.3462 0.3312 0.3175) Maximal difference: 0.0322. Using suitable software, the following graphical representation of the results can be obtained, where the dashed curve corresponds to the exact solution, which is the function y = 1/x. 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 □ 8.N.2. Using the Euler method, solve the equation y′ = −2y with the initial condition y(0) = 1 and step value h = 1. Explain the phenomenon which occurs here and suggest another procedure. Solution. In this case, the Euler method is given by yk+1 = yk − h · 2yk = −yk. For the initial condition y0 = 1, we get the alternating values ±1 as the result. This is a typical manifestation of the instability of this method for large step values h. If the step cannot be reduced for some reasons (for instance, when processing digital data, the step value is fixed), better results can be achieved by the so-called implicit Euler method. For a general equation y′ = f(x, y), that is given by the formula yk+1 = yk + h · f(xk+1, yk+1). CHAPTER 8. CALCULUS WITH MORE VARIABLES all the components gi equal to h : Rn → R, h(z1, . . . , zn) = C r r − z1 − · · · − zn . Then the values of all the partial derivatives with |α| = k at z = 0 are ∂αh(0) = Crk!(r − z1 − · · · − zn)−k−1 |z=0 = C k! rk , exactly as suitable. (Check the latter simple computation yourself!) So it remains to prove that the majorant system z′ = g(z) has got the converging power series solution z. Obviously, by the symmetry of g (all compoments equal to the same h and h is symmetric in the variables zi), also the solution z with z(0) = 0 must have all the components equal (the system does not see any permutation of the variables zi at all). Let us write zi(t) = u(t) for the common solution components. With this ansatz, u′ (t) = h(u(t), u(t), . . . , u(t)) = C r r − nu(t) . This is nearly exactly the same equation as the one in 6.2.22 and we can easily see its solution with u(0) = 0: u = r n ( 1 − √ 1 − 2nCt r ) . Clearly, this is an analytic solution and the proof is finished. □ 8.3.14. Vector fields and their flows. Before going to higher-order equations, pause to consider systems of first-order equations from the geometrical point of view. When drawing illustrations of solutions earlier, we already viewed the right hand side of an autonomous system as a “field of vectors” f(x) ∈ Rn . This shows how fast and in which direction the solution should move in time. This can be formalized. A tangent vector with a footpoint x ∈ Rn is a couple (x, v) ∈ Rn × Rn . The set of all vectors with footpoints in an open set U ⊂ Rn is called the tangent bundle TU, with the footpoint projection p : (x, v) → x. A vector field X defined on an open set U ⊂ Rn is a mapping X : U → TU which is a section of the projection p, i.e., p ◦ X = idU . The derivative in the direction of the vector field X is defined for all differentiable functions g on U by X(g) : U → R, X(g)(x) = dX(x) g = dg(x)(X(x)). So the vector field X is a first order linear differential operator mapping functions into functions. Apply pointwise the properties of directional derivative to obtain the derivative rule (also called the Leibniz rule) for products of functions: (1) X(gh) = hX(g) + gX(h). In fixed coordinates, X(x) = (X1(x), . . . Xn(x)) and X(g)(x) = X1(x) ∂g ∂x1 (x) + · · · + Xn(x) ∂g ∂xn (x). 782 In general, we thus have to solve a non-linear equation in each step. However, in our problem, we get yk+1 = yk − 2h · yk+1, so we have yk+1 = 1 3 yk for h = 1. Again, the obtained results can be represented graphically, including the exact solution of the equation. 0 1 2 3 4 5 6 7 8 9 10 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 □ CHAPTER 8. CALCULUS WITH MORE VARIABLES Clearly, there are the special vector fields with coordinate functions equal to zero except for one function Xi which is identically one. Such a field then corresponds to the partial derivatives with respect to the variable xi. This is also matched by the common notation ∂ ∂xi for such vector fields and in general, X(x) = X1(x) ∂ ∂x1 + · · · + Xn(x) ∂ ∂xn . Remark. Actually, each derivative on functions, i.e., a linear operator D satisfying (1), is given by a unique vector field. This may be seen as follows. First D(1) = D(1 · 1) = 2D(1) and, thus, D(c) = 0 for constant functions. Next, each function f(x) can be written on a neighborhood of a point q ∈ Rn as f(x) = f(q) + ∫ 1 0 d dt f(q + t(x − q)) dt = f(q) + ∑n i=1 ∫ 1 0 ∂f ∂xi (q + t(x − q)) dt(xi − qi) = f(q) + ∑n i=1 αi(x)(xi − qi). Thus, D(f) = 0 + ∑n i=1 D(αi)(qi − qi)+ ∑n i=1 αi(q)D(xi) = ∑n i=1 ∂f ∂xi (q)D(xi). Defining the components Xi = D(xi) of the vector field X, we have obtained D as the derivative in the direction of X. We shall write X(U) for the set of all smooth vector fields on U, i.e. those with all components Xi smooth. Vector fields ∂ ∂xi can be perceived as generators of X(U), admitting smooth functions as the coefficients in linear combinations. We return to the problem of finding the solution of a system of equations. Rephrase it equivalently as finding a curve which satisfies x′ (t) = X(x(t)) for each value x(t) in the domain of the vector field X. In words: the tangent vector of the curve is given, at each of its points, by the vector field X. Such a curve is called an integral curve of the vector field X, and the mapping FlX t : Rn → Rn , defined at a point x0 as the value of the integral curve x(t), satisfying x(0) = x0 is called the flow of the vector field X. The theorem about existence and uniqueness of the solution of the systems of equations says (cf. 8.3.6) that for every continuously differentiable vector field X, its flow exists at every point x0 of the domain for sufficiently small values of t. The uniqueness guarantees that FlX t+s(x) = FlX t ◦ FlX s (x), whenever both sides exist. In particular, the mappings FlX s and FlX t always commute. Moreover, the mapping FlX t (x) with a fixed parameter t is differentiable at all points x where it is defined, cf. 8.3.12. If a vector field X is defined on all of Rn , and if its support is compact (i.e., X(x) = 0 off a compact set K ⊂ Rn ), then its flow clearly exists at all points and for all values of t. Vector fields with flows existing for all t ∈ R are called complete. The flow of a complete vector field consists of (mutually commuting) diffeomorphisms FlX t : Rn → Rn with inverse diffeomorphisms FlX −t. 783 CHAPTER 8. CALCULUS WITH MORE VARIABLES A simple example of a complete vector field is the field X(x) = ∂ ∂x1 . Its flow is given by FlX t (x1, . . . , xn) = (x1 + t, x2, . . . , xn). On the other hand, the vector field X(t) = t2 d dt on the onedimensional space R is not complete as the solutions x(t) of the corresponding equation dx = x2 dt are of the form x(t) = 1 C−t , except for the initial condition x(0) = 0, so they “run away” towards infinite values in a finite time. The points x0 in the domain of a vector field X : U ⊂ Rn → Rn where X(x0) = 0 are called singular points of the vector field X. Clearly FlX t (x0) = x0 for all t at all singular points. 8.3.15. Local qualitative description. The description of vector fields as assigning the tangent vector in the modelling space to each point of the Euclidean space is independent of the coordinates. It follows that the flows exhibit a geometric concept which must be coordinate-free. It is necessary to know what happens to the fields and their flows, when coordinates are transformed. Suppose y = F(x) is such a transformation with F : Rn → Rn (or on some smaller domain there). Then the solutions x(t) to a system x′ = X(x) satisfy x′ (t) = X(x(t)), and in the transformed coordinates this reads y′ (t) = ( F(x(t)) )′ (t) = D1 F(x(t)) · x′ (t) = D1 F(x(t)) · X(x(t)). This means that the “transformed field” Y in the new coordinates is Y (F(x)) = D1 F(x) · X(x). At the same time, the flows of these vector fields are related as follows: FlY t ◦F(x) = F ◦ FlX t (x). By fixing x = x0 and writing x(t) = FlX t (x0), the curve F(x(t)) is the unique solution for the system of equations y′ = Y (y) with initial condition y0 = F(x0), which equals the right-hand side. The following theorem offers a geometric local qualitative description of all solutions of systems of first order ordinary differential equations in a neighbourhood of each point x which is not singular. The flowbox theorem Theorem. If X is a differentiable vector field defined on a neighbourhood of a point x0 ∈ Rn and X(x0) ̸= 0, then there exists a transformation of coordinates F such that in the new coordinates y = F(x), the vector field X is given as the field ∂ ∂y1 . Proof. Construct a diffeomorphism F with the required properties, step by step. Geometrically, the essence of the proof can be summarized as follows: first select a hypersurface which goes through the point x0 and is complementary to the directions X(x) near to x0. Then fix the coordinates 784 CHAPTER 8. CALCULUS WITH MORE VARIABLES on it, and finally, extend them to some neighbourhood of the point x0 using the flow of the field X. Without loss of generality, move the point x0 to the origin by a translation. Then by a suitable linear transformation on Rn , set X(0) = ∂ ∂x1 (0). Let us write ξ for the vector field in question, independent of any coordinates. In the standard coordinates on Rn , write the flow Flξ t of the field X going through the point (x1, . . . , xn) at time t = 0 as xi(t) = φi(t, x1, . . . , xn), i = 1, . . . , n. Next, define the new coordinates y = (y1, . . . , yn) by y = Flξ y1 (0, y2, . . . , yn), which corresponds to the inverse transformation F−1 to the diffeomorphism F = (f1, . . . , fn) with the components fi(x1, . . . , xn) = φi(x1, 0, x2, . . . , xn). This follows exactly the strategy using the hypersurface x1 = 0. Since ξ(0, . . . , 0) = ∂ ∂x1 ∂F ∂x1 (0, . . . , 0) = d dt|0 (φ1(t, 0, . . . , 0), . . . φn(t, 0, . . . , 0)) = (1, 0, . . . , 0), while the flow Flξ 0 at the time t = 0 yields φi(0, 0, x2, . . . , xn) = (0, x2, . . . , xn), and in particular ∂F ∂xi (0, . . . , 0) = (0, . . . , 1, . . . , 0), i = 2, . . . , n. Therefore, the Jacobi matrix of the mapping F at the origin is the identity matrix E, so it is a transformation of coordinates on some neighbourhood (see the inverse mapping theorem in paragraph 8.1.23). Directly from the definition of the mapping F−1 we can compute F−1 (Flξ t (y)) = F−1 (Flξ t+y1 (0, y2, . . . , yn)) = F−1 ◦ F(t + y1, . . . , yn) = (t + y1, y2, . . . , y2), and this is the desired coordinate description of the flow in the new coordinates. □ 8.3.16. Higher-order equations. An ordinary differential equation of order k (solved with respect to the highest derivative) is an equation (1) y(k) = f(t, y, y′ , . . . , y(k−1) ), where f is a known function of k + 1 variables, t is the independent variable, and y(t) is an unknown function of one variable. This type of equation is always equivalent to a system of k first-order equations. Introduce new unknown functions in a variable t as fol- lows: y0(t) = y(t), y1(t) = y′ 0(t), . . . , yk−1(t) = y′ k−2(t). 785 CHAPTER 8. CALCULUS WITH MORE VARIABLES Now, the function y(t) is a solution of the original equation (1) if and only if it is the first component of the solution of the system of equations y′ 0 = y1 y′ 1 = y2 ... y′ k−2 = yk−1 y′ k−1 = f(t, y0, y1, . . . , yk−1). Hence the following direct corollary of the theorems 8.3.9–8.3.12: Solutions of higher-order ODEs Theorem. Consider a function f(t, y0, . . . , yk−1) : U ⊂ Rk+1 → R with continuous partial derivatives on an open set U. Then for every point (t0, z0, . . . , zk−1) ∈ U, there exists a maximal interval Imax = (x0−a, x0+b), with positive numbers a, b ∈ R, and a unique function y(t) : Imax → R which is a solution of the k-th order equation y(k) = f(t, y, y′ , . . . , y(k−1) ) with the initial condition y(t0) = z0, y′ (t0) = z1, . . . , y(k−1) (t0) = zk−1. This solution depends differentiably on the initial conditions and on potential further parameters differentiably entering the function f. Moreover, the solution is analytic if the latter dependence is analytic. In particular, the theorem shows that in order to determine unambiguously the solution of an ordinary k-th order differential equation, the values of the solution and its first k − 1 derivatives must be determined at one point With a system of ℓ equations of order k, the same procedure transforms this system to a system of kℓ first-order equations. Therefore, an analogous statement about existence, uniqueness, continuity, and differentiability is also true. If the right-hand side f of the equation is differentiable up to order r or analytic, including the parameters, than the same property is enjoyed by the solutions as well. 8.3.17. Linear differential equations. The operation of differentiation can be viewed as a linear mapping from (sufficiently) smooth functions to functions. Multiplying the derivatives ( d dt )j of the particular orders j by fixed functions aj(t) and adding these expressions, gives the linear differential operators y(t) → D(y)(t): D(y)(t) = ak(t)y(k) (t) + · · · + a1(t)y′ (t) + a0(t)y(t). To solve the corresponding homogeneous linear differential equation of order k then means finding a function y satisfying D(y) = 0. 786 CHAPTER 8. CALCULUS WITH MORE VARIABLES The sum of two solutions is again a solution, since for any functions y1 and y2, D(y1 + y2)(t) = D(y1)(t) + D(y2)(t). A constant multiple of a solution is again a solution. So the set of all solutions of a k-th order linear differential equation is a vector space. Apply the previous theorem about existence and uniqueness, to obtain the following: The space of solutions of linear equations Theorem. The set of all solutions of a homogeneous linear differential equation of order k with continuously differentiable coefficients is a vector space of dimension k. Therefore, the solutions can be described as linear combinations of any set of k linearly independent solutions. Such solutions are determined uniquely by linearly independent initial conditions on the value of the function y(t) and its first k − 1 derivatives at a fixed point t0. Proof. Choose k linearly independent initial conditions at a fixed point. For each of them, there is a unique solution. A linear combination of these initial condition then leads to the same linear combination of the corresponding solutions. All of the possible initial conditions are exhausted, so the entire space of solutions of the equation is obtained in this way. □ The same arguments as with the first order linear differential equations in the paragraph 8.3.4 reveal that all solutions of the non-homogeneous k-th order equation D(y) = b(t) with a fixed continuous function b(t) are the sums of one fixed solution y(t) of this problem and all solutions ˜y of the corresponding homogeneous equation. Thus the entire space of solutions is an affine k-dimensional space of functions. The method of variation of constants exploited in 8.3.4 is one of the possible approaches to guess one non-homogeneous solution if we know the complete solution to the homogeneous problem. We shall illustrate the latter results on the most simple case: 8.3.18. Linear equations with constant coefficients. The previous discussion recalls the situation with homogeneous linear difference equations dealt with in paragraph 3.2.1 of the third chapter. The analogy goes further when all of the coefficients aj of the differential operator D are constant. Such first-order equations (1) have solutions as an exponential with an appropriate constant at the argument. Just as in the case of difference equations, it suggests trying whether such a form of the solution y(t) = eλt with an unknown parameter λ can satisfy an equation of order k. Substitution yields D(eλt ) = ( akλk + ak−1λk−1 + · · · + a1λ + a0 ) eλt . The parameter λ leads to a solution of a linear differential equation with constant coefficients if and only if λ is a root of the characteristic polynomial akλk + · · · + a1λ + a0. 787 CHAPTER 8. CALCULUS WITH MORE VARIABLES If the root λ = a + i b ∈ C is not real, then we may consider the complex valued solution eλt . Since the conjugate root is also involved, we may consider the linear combinations of the solutions eλt and e ¯λt , providing the real solutions eat sin(bt) and eat cos(bt). If the characteristic polynomial has k distinct roots, then we have the basis of the whole vector space of solutions. Otherwise, if λ is a multiple root, then direct calculation, making use of the fact that λ is then a root of the derivative of the characteristic polynomial as well, yields that the function y(t) = t eλt is also a solution. Similarly, for higher multiplicities ℓ, There are ℓ distinct solutions eλt , t eλt , . . . , tℓ−1 eλt . In the case of a general linear differential equation, a nonzero value of the differential operator D is wanted. Again, as for systems of linear equations or linear difference equations, the general solution of this type of (non-homogeneous) equa- tions D(y) = b(t), for a fixed function b(t), is the sum of an arbitrary solution of this equation and the set of all solutions of the corresponding homogeneous equation D(y)(t) = 0. The entire space of solutions is a finite-dimensional affine space, hidden in the huge space of functions. The methods for finding a particular solution are introduced in concrete examples in the other column. In principle, they are based on looking for the solution in a similar form as the right-hand side is, or the method of variation of the constants. 8.3.19. Matrix systems with constant coefficients. Before leaving the area of differential equations, consider a very special case of first-order systems, whose right-hand side is given by multiplication of a matrix A ∈ Matn(R) of constant coefficients and an n2 -dimensional unknown matrix function Y (t): (1) Y ′ (t) = A · Y (t). Clearly this is a strict analogy to the iterative models in chapter 3 and we also met such a system of n2 equations 8.3.12(2) when preparing the proof of Theorem 8.3.12. Combine knowledge from linear algebra and univariate function analysis to guess the solution. Define the exponential of a matrix by the formula B(t) = etA = ∞∑ k=0 tk k! Ak . The right-hand expression can be formally viewed as a matrix whose entries bij are infinite series created from the mentioned products. If all entries of A are estimated by the maximum of their absolute values ∥A∥ = C, then the absolute value of the k-th summand in bij(t) is at most |t|k k! nk C2k . Hence, every series bij(t) is necessarily absolutely and uniformly convergent, and it is bound above by the value e|t|nC2 . Differentiate the terms of the series one by one, to get a uniformly convergent series with limit A etA . Therefore, by the 788 CHAPTER 8. CALCULUS WITH MORE VARIABLES general properties of uniformly convergent series, the deriva- tive d dt ( etA ) = A etA also equals this expression. The general solution of the system (1) is obtained in the form Y (t) = etA ·Y0, where Y0 ∈ Matn(R) is the arbitrary initial condition Y (0) = Y0. The exponential etA is a well defined invertible matrix for all t. So we have a vector space of the proper dimension, and hence all solutions to the system (1). Notice that in order to get a solution, it is necessary to multiply by Y0 from the right. It is remarkable that dealing with a vector equation with a constant matrix A ∈ Matn(R), (2) y′ = A · y, for an unknown function y : R → Rn , then the columns of the matrix exponential etA provide n linearly independent solutions. The general solution is then given by linear combinations of them. The general solutions of the system (2) may be understood much better by invoking some linear algebra – the Jordan canonical form of linear mappings, see e.g. 3.4.10. In terms of vector fields X, the system has the linear expression X(y) = Φ(y) where Φ is the linear mapping with the matrix A in coordinates. Clearly linear transformations of the system lead to another vector field with such linear description, since the differential of a linear mapping is the mapping itself. Any linear transformation of coordinates with the (constant) matrix T transforms the system into ˜y′ = (Ty)′ = (TAT−1 ) · (Ty) = ˜A · ˜y. In particular, a suitable change of coordinates T provides the matrix ˜A in the Jordan canonical form expressing Φ as a sum of two commuting linear mappings Φ = Φd + Φn with Φd diagonalisable and Φn nilpotent. Moreover, the decomposition of the nilpotent part into the sum of cyclic nilpotent mappings provides the Jordan blocks Jλ =   λ 1 . . . 0 0 λ . . . 0 . . . . . . . . . 1 0 0 . . . λ   = (Jλ)d + (Jλ)n =   λ 0 . . . 0 0 λ . . . 0 . . . . . . . . . 0 0 0 . . . λ   +   0 1 . . . 0 0 0 . . . 0 . . . . . . . . . 1 0 0 . . . 0  . Splitting the system (2) into block-wise diagonal form splits also the space of the solutions generated by the exponential etA into the corresponding blocks (all the powers Ak enjoy the same block structure). So we can work with the matrix A already in the form on one such block Jλ = (Jλ)d + (Jλ)n. But for any two commuting matrices C and D, the exponentials etC and etD commute. So the exponential etD of the 789 CHAPTER 8. CALCULUS WITH MORE VARIABLES nilpotent D = (Jλ)n can be computed as the finite sum E + t   0 1 . . . 0 0 0 . . . 0 . . . . . . . .. 1 0 0 . . . 0   + · · · + tk−1 (k−1)!   0 0 . . . 1 0 0 . . . 0 . . . . . . . .. 0 0 0 . . . 0   . where k is the size of the block and E is the identity matrix. The solution of the corresponding matrix system is Y (t) = eλtE · eDt = eλt eDt =     eλt t eλt . . . tk−1 (k−1)! eλt 0 eλt . . . tk−2 (k−2)! eλt . . . . . . . . . t eλt 0 0 . . . eλt     Finally, k independent solutions can be written down by inspecting the individual columns in Y (t). Notice that the canonical basis (e1, . . . , ek) provides just the chain of vectors with D(ek) = ek−1, k = 2, . . . , k, while D(e1) = 0. Now the k independent solutions are: y1(t) = eλt e1 y2(t) = eλt ( te1 + e2 ) ... yk(t) = eλt ( tk−1 (k−1)! e1 + tk−2 (k−2)! e2 + · · · + tek−1 + ek ) . The result can be easily transferred back to the original coordinates in which the system (2) was given. Finding the decomposition of the space into Jordan blocks and finding the chains of basis vectors vi realizing the cyclic nilpotent components, we arrive at the independent solutions by replacing the ei by vi. The findings are summarised in the following theorem, one of many attributed to Euler. Theorem. All solutions of the system (2) are linear combinations of those in the form combining exponential and polynomial expressions y(t) = eλt k∑ j=0 pjtj where k is the order of nilpotency of the Jordan block corresponding to the eigenvalue λ of the matrix A, pj are suitable constant vectors. In particular, if the nilpotent part of A is trivial, then k = 0. This important result allows many generalizations. For example, the Floquet-Lyapunov theory generalizes this behaviour of solutions to systems with periodically time-dependent matrices A(t). 8.3.20. Return to singular points. Finally, recall the firstorder matrix system in paragraph 8.3.12 when the derivative of the solutions of vector equations with respect to the initial conditions was discussed. Consider a differentiable vector field X(x) defined on a neighbourhood of its singular point x0 ∈ Rn , i.e. X(x0) = 0. Then, the point x0 is a fixed point of its flow FlX t (x). 790 CHAPTER 8. CALCULUS WITH MORE VARIABLES The differential Φ(t) = Dx FlX t (x0) satisfies the matrix system with initial condition (see (2) on page 778) Φ′ (t) = D1 X(x0) · Φ(t), Φ(0) = E. The important point is that the differential D1 X is evaluated along the constant flow line x0, since this is a singular point of the system. The solution is known explicitly, and this describes the evolution of the differential Φ(t) of the vector field’s flow at the singular point x0, Φ(t) = etA , A = D1 X(x0). This is a useful step in analysing the qualitative behaviour in a neighbourhood of the stationary point x0. Consider the Lotka-Volterra system from this point of view. Use the coordinates (x, y) and parameters α, β, γ, δ exactly as in 8.3.10. In particular all these quantities are assumed to be positive. The vector field in question is X(x, y) = (x(α − βy), y(−γ + βδx)). So there is a single singular point (except (0, 0)) (x0, y0) = ( γ δβ , α β ) and the differential of X at this point is D1 X(x0, y0) = ( α − βy0 −βx0 δβy0 δβx0 − γ ) = ( 0 −γ δ αδ 0 ) The characteristic polynomial of the latter matrix is λ2 + γα and so there are two complex conjugated roots λ = ±i √ αγ. As is known from linear algebra, such a matrix describes a rotation in suitable coordinates. Compute the real and imaginary components of the eigenvectors corresponding to λ (as developed in linear algebra), to obtain the matrix solution in the form Y (t) = ( cos √ αγ t − sin √ αγ t δ √ α γ sin √ αγ t δ √ α γ cos √ αγ t ) . The columns are the two independent solutions y(t) (they differ just by the phase of the linearly distorted rotation). This might be useful information for further analysis of the model around its singular point. For example, the parameter β does not appear explicitly here, while δ influences the distortion of the flow lines from being circles. Compare this with the illustrations on page 774. For a first approximation, guess that the sizes of the populations of both the prey and the predator oscillate regularly around the values of the singular point if the initial conditions are near this point. 8.3.21. A note about Markov chains. In the third chapter, we dealt with iterative processes, where the stochastic matrices and Markov processes determined by them played an interesting role. Recall that a matrix A is stochastic if the sum of each of its columns is one. In other words, (1 . . . 1) · A = (1 . . . 1). 791 CHAPTER 8. CALCULUS WITH MORE VARIABLES Take the exponential etA to obtain (1 . . . 1) · etA = ∞∑ k=0 tk k! (1 . . . 1) · Ak = et (1 . . . 1). Therefore, for every t, the invertible matrix B(t) = e−t etA is stochastic, if A is stochastic. Add stochastic initial conditions B0, to get the flow B(t) = e−t etA ·B0, which is a continuous version of the Markov process (infinitesimally) generated by the stochastic matrix A. Differentiating with respect to t, yields d dt B(t) = − e−t etA ·B0 +e−t A etA ·B0 = (−E +A)B(t), so the matrix B(t) is the solution of the matrix system of equations with constant coefficients Y ′ (t) = (A − E) · Y (t) with the stochastic matrix A. This can be explained intuitively. If the matrix A is stochastic, then the instantaneous increment of the stochastic vector y(t) in the vector system with the matrix A, y′ (t) = A · y(t), is again a stochastic vector. However, it is desired that the Markov process keeps the vector y(t) stochastic for all t. Hence, the sum of increments of the particular components of the vector y(t) must be zero, which is guaranteed by subtracting the identity matrix. As seen above, the columns of the matrix solution Y ′ (t) create a basis of all solutions y′ (t) of the vector system. Much information can be obtained about the solutions by using some linear algebra. For example, suppose that the matrix A is primitive, that is, suppose one of its powers has only positive entries, see 3.3.3 on page 214. Then its powers converge to a matrix A∞, all of whose columns are eigenvectors corresponding to the eigenvalue 1. Next, estimate the difference between the solution Y ′ (t) for large t and the constant matrix A∞. There are two consequences from the latter convergence. First, there exists a universal constant bound for all powers ∥Ak − A∞∥ ≤ C. Second, for every small positive ε, there is an N ∈ N such that for all k ≥ N, ∥Ak − A∞∥ ≤ ε. Hence, e−t ∞∑ k=0 tk k! Ak − e−t ∞∑ k=0 tk k! A∞ ≤ e−t ∑ k 0) by the value ε(C + 1)∥A∞∥. Summarizing, a very interesting statement is proved, which resembles the discrete version of Markov processes: Continuous processes with a stochastic matrix Theorem. Every primitive stochastic matrix A determines a vector system of equations y′ = (A − E) · y with the following properties: • the basis of the vector space of all solutions is given by the columns of the stochastic matrix Y (t) = e−t etA , • if the initial condition y0 = y(t0) is a stochastic vector, then the solution y(t) is also a stochastic vector for all values of t, • every stochastic solution converges for t → ∞ to the stochastic eigenvector y∞ of the matrix A corresponding to the eigenvalue 1. 8.3.22. Remarks on numerical methods. Except for the exceptionally simple equations, for example, linear equations with constant coefficients, analytically solvable equations are seldom encountered in practice. Therefore, some techniques are required to approximate the solutions of the equations. Approximations have already been considered in many other situations. (Recall the interpolation polynomials and splines, exploitation of Taylor polynomials in methods for numerical differentiation and integration, Fourier series etc.). With a little courage, consider difference and differential equations to be mutual approximations. In one direction, replace differences with differentials (for example, in economical or population models). For other situations the differences may imitate well continuous changes in models. Use the terminology for asymptotic estimates, as introduced in 6.1.12. In particular, an expression G(h) is asymptotically equal to F(h) for h approaching zero or infinity, and write G(h) = O(F(h)), if the finite limit of G(h)/F(h) ex- ists. A good example is the approximation of a multivariate function f(x) by its Taylor polynomial of order k at a point x0. Taylor’s theorem says that the error of this approximation is O(∥h∥k+1 ), where h is the increment of the argument h = x − x0. In the case of ordinary differential equations, the simplest scheme is approximation using Euler polygons. Present this method for a single ordinary equation with two quantities: one independent and one dependent. It works analogously for systems of equations where scalar quantities and their derivatives in time t are replaced with vectors dependent on time and their derivatives. This procedure was used before in the proof of the Peano’s existence theorem, see 8.3.8. 793 CHAPTER 8. CALCULUS WITH MORE VARIABLES Consider an equation y′ = f(t, y) with continuous right-hand f. Denote the discrete increment of time by h, i.e. set tn = t0 + nh. It is desired to approximate y(t). It follows from Taylor’s theorem (with remainder of order two) and the equation that y(tn+1) = y(tn) + y′ (tn)h + O(h2 ) = y(tn) + f(tn, y(tn))h + O(h2 ). Define recurrently the values yj by the first order formula yj+1 = yj + f(tj, yj)h. This leads to the local approximation error O(h2 ), occuring in one step of the recurrence. If n such steps are needed with increment h from t0 to t = tn, the error could be up to nO(h2 ) = 1 h (t−t0)O(h2 ) = O(h). More care is needed, since the function f(t, y) is evaluated in the approximate points (ti, yi) and the already approximate previous values yj. In order to keep control, f(t, y) must be Lipschitz in y. Assuming inductively that the estimate is true for all i < j, |f(tj, y(tj)) − f(tj, yj)| ≤ C|y(tj) − yj| ≤ C|t − t0|O(h) where C is the Lipschitz constant, assuming that the error does not exceed O(h) with globally valid constant for yj. Inductively, the expected bound O(h) for the global error estimate is obtained. Think about the details. The Euler procedure is the simplest method within the class of the Runge-Kutta methods. Dealing with higher order equations, we may either view them as vector valued first order systems (as in the theoretical column) and then even Euler method provides results for the initial condition on the necessary number of derivatives in one point. But in practical problems, it is often needed to find solutions passing through more then one prescribed point. For example, with second order equations, prescribe two values y(t1) and y(t2) of the solution. This would need completely different methods. 794 795 CHAPTER 8. CALCULUS WITH MORE VARIABLES O. Additional exercises to the whole chapter 8.O.1. A basin with volume 300 hl contains 100 hl of water in which 50 kg of salt is dissolved. Water with 2 kg of salt per 1 hl starts flowing into the basin at 6 hl/min. The mixture, being kept homogeneous by permanent stirring, leaves the basin at 4 hl/min. Express the amount of salt (in kg) in the basin after t minutes have expired as a function of the variable t ∈ [0, 100]. ⃝ 8.O.2. During a controlled experiment, a small smelting furnace is slowly cooling down while the outer temperature keeps at 300 K. The experiment began at noon. At 1 pm, the temperature in the furnace was estimated at 1300 K. At 3 pm, it was only 550 K. Supposing the measurements were accurate, compute what the temperature in the furnace was at 2 pm. ⃝ 8.O.3. The half-life of the radioactive sulfur isotope 35 S is 87.5 days. After what period are there only 900 grams left of the original amount of 1 kilogram of this isotope? (You may express the result in terms of the natural logarithm.) ⃝ 8.O.4. The half-time of a radioactive element A is 5 years; for an element B, it is 1 year. If we have 5 kg of element B and 1 kg of element A, after what period will we have the same amount of both? (You may express the result in terms of the natural logarithm.) ⃝ 8.O.5. The half-time of a radioactive element A is 8 years; for an element B, it is 2 years. If we have 3 kg of element B and 1 kg of element A, after what period will we have the same amount of both? (You may express the result in terms of the natural logarithm.) ⃝ 8.O.6. The half-life of the radioactive cobalt isotope 60 Co is 5.27 years. Having 4 kg of this isotope, after what period does 1 kg of it decay? (You may express the result in terms of the natural logarithm.) ⃝ 8.O.7. Solve the following differential equation for the function y = y(x): y′ = 1 + y2 1 + x2 . ⃝ 8.O.8. Determine all solutions of the following equation with separated variables: y − y2 + xy′ = 0. ⃝ 8.O.9. Solve the equation 1 + dy dx = ey . ⃝ 8.O.10. Solve the equation 2y = x3 y′ . ⃝ 8.O.11. Determine all solutions of the equation √ 4 − y2 dx + y dy = 0. ⃝ 8.O.12. Solve y′ tan x = y2 + 1 − 2y. ⃝ 8.O.13. Determine the general solution of the differential equation x2 +1 x = y 1−y2 y′ . ⃝ 8.O.14. Find the general solution of the differential equation (x + 1) dy + xy dx = 0. 796 CHAPTER 8. CALCULUS WITH MORE VARIABLES ⃝ 8.O.15. Find the solution of the differential equation sin y cos x dy = cos y sin x dx which satisfies 4y(0) = π. ⃝ 8.O.16. Solve the initial problem ( x2 + 1 ) ( y2 − 1 ) + xyy′ = 0, y(1) = √ 2. ⃝ 8.O.17. Determine the particular solution of the equation y′ sin x = y ln y which goes through the point [π/2, e]. ⃝ 8.O.18. Find all solutions of the differential equation 2 (1 + ex ) yy′ = ex , which satisfy y(0) = 0. ⃝ 8.O.19. Solve the homogeneous equation (xy′ − y) cos y x = x. ⃝ 8.O.20. Determine the general solution of the homogeneous differential equation y3 = x3 y′ . ⃝ 8.O.21. Find all solutions of the equation xy′ = √ x2 − y2 + y. ⃝ 8.O.22. Determine the general solution if we are given xy′ = y cos ( ln y x ) . ⃝ 8.O.23. Solve the equation (x + y) dx − (x − y) dy = 0 as homogeneous. ⃝ 8.O.24. Calculate y′ = (x + y)2 . ⃝ 8.O.25. Find the general solution for y′ = x−y+3 x+y−5 . ⃝ 8.O.26. Calculate y′ = x−y+1 x−y . ⃝ 8.O.27. Determine all solutions of the differential equation y′ = 5y−5x−1 2y−2x−1 . ⃝ 8.O.28. Find the general solution of the equation y′ = x−y−1 x+y+3 . 797 CHAPTER 8. CALCULUS WITH MORE VARIABLES ⃝ 8.O.29. Determine the general solution for the equation y′ = 2x−y−5 x−3y−5 . ⃝ 8.O.30. Express the solutions of the equation y′ = x+2y−7 x−3 as explicitly given functions. ⃝ 8.O.31. Using the method of constant variation, calculate y′ + 2y = x. ⃝ 8.O.32. Determine the general solution of the equation y′ = 6x + 2y + 3. ⃝ 8.O.33. Solve the linear equation y′ = 4xy + (2x + 1)e2x2 . ⃝ 8.O.34. Solve the equation y′ x + y = x ln x. ⃝ 8.O.35. Calculate the linear differential equation y′ x = y + x2 ln x. ⃝ 8.O.36. Find all solutions of the equation y′ cos x = (y + 2 cos x) sin x. ⃝ 8.O.37. Find the solution of the equation y′ = 6x − 2y which satisfies the initial condition y(0) = 0. ⃝ 8.O.38. Solve the initial problem y′ + y sin x = sin x, y (π 2 ) = 2. ⃝ 8.O.39. Find the solution of the equation y′ = 4y + cos x which goes through the point [0, 1]. ⃝ 8.O.40. Solve the following equation for any a, b ∈ R: xy′ + y = ex , y (a) = b. ⃝ 8.O.41. Determine the general solution of the equation 3x2 y′ + xy = 1 y2 . ⃝ 8.O.42. Solve the Bernoulli equation y′ = xy − y3 e−x2 . ⃝ 8.O.43. Calculate the Bernoulli equation y′ − y x = y2 sin x. 798 CHAPTER 8. CALCULUS WITH MORE VARIABLES ⃝ 8.O.44. Find all solutions of the equation y′ = 4y x + x √ y. ⃝ 8.O.45. Solve the equation xy′ + 2y + x5 y3 ex = 0. ⃝ 8.O.46. Find the general solution of the following equation provided a, b > 0: y dy = ( a y2 x2 + b 1 x2 ) dx. ⃝ 8.O.47. Interchanging the variables, solve 2y + ( y2 − 6x ) y′ = 0. ⃝ 8.O.48. Solve the equation y′ = y 2y ln y+y−x . ⃝ 8.O.49. Calculate the general solution of the following equation: x dx = ( x2 y − y3 ) dy. ⃝ 8.O.50. Interchanging the variables, calculate (x + y) dy = y dx + y ln y dy. ⃝ 8.O.51. Solve y′ (e−y − x) = 1. ⃝ 8.O.52. Calculate y′ = 1 2x−y2 . ⃝ 8.O.53. Solve the equation 2y dx + x dy = 2y3 dy. ⃝ 8.O.54. Calculate y′′ + 3y′ + 2y = (20x + 29) e3x . ⃝ 8.O.55. Find any solution of the non-homogeneous linear equation 799 CHAPTER 8. CALCULUS WITH MORE VARIABLES y′′ + y′ + 5 2 y = 25 cos (2x) . ⃝ 8.O.56. Determine the solution of the equation y′′ + 2y′ + 2y = 3e−x cos x. ⃝ 8.O.57. Find the solution of the equation y′′ = 2y′ + y + 1, which satisfies y(0) = 0 and y′ (0) = 1. ⃝ 8.O.58. Find the solution of the equation y′′ = 4y − 3y′ + 1 which satisfies y(0) = 0 and y′ (0) = 2. ⃝ 8.O.59. Determine the general solution of the linear equation y′′ − 2y′ + 5y = 5e2x sin x. ⃝ 8.O.60. Taking advantage of the special form of the right-hand side, find all solutions of the equation y′′ + y′ = x2 − x + 6e2x . ⃝ 8.O.61. Solve y(4) − 2y′′ + y = 8 (ex + e−x ) + 4 (sin x + cos x) . ⃝ 8.O.62. Using the method of constant variation, calculate y′′ − 2y′ + y = ex x . ⃝ 8.O.63. Solve y′′ + 4y′ + 4y = e−2x ln x. ⃝ 8.O.64. Using the method of constant variation, find the general solution of the equation y′′ + 4y = 1 sin(2x) . ⃝ 8.O.65. Solve the equation y′′ + y = tan2 x. ⃝ 8.O.66. Find the solution of the differential equation y(3) = −2y′′ − 2y′ − y + sin(x) which satisfies y(0) = −1 2 , y′ (0) = √ 3 2 , and y′′ (0) = −1 − √ 3 2 . ⃝ 8.O.67. Calculate the equation y′′′ − 2y′′ − y′ + 2y = 0. ⃝ 8.O.68. Find the general solution of the equation y(4) + 2y′′ + y = 0. 800 CHAPTER 8. CALCULUS WITH MORE VARIABLES ⃝ 8.O.69. Solve y(6) + 2y(5) + 4y(4) + 4y′′′ + 5y′′ + 2y′ + 2y = 0. ⃝ 8.O.70. Find the general solution of the linear equation y(5) − 3y(4) + 2y′′′ = 8x − 12. ⃝ 801 CHAPTER 8. CALCULUS WITH MORE VARIABLES Key to the exercises This chapter presents several glimpses towards more serious applications of the differential and integral calculus. We cannot be ambitious in covering the displayed topics extensively. Thus, after reasonably detailed introduction to more geometric approach to the differential and integral calculus, we present rather quick surveys and comments on partial differential equations, variational calculus, and complex analysis. We hope the readers will get excited at least by some of them and look for further resources themselves. many pictures missing!!! 1. Exterior differential calculus and integration We have already seen how to optimize functions on subsets in Rn , but how to integrate quantities over such domains? For example, if we have 2-dimensional membrane in R3 and we know the infinitesimal flow of some liquid through it, how to compute how much went through within a time interval? In order to understand such questions properly, we formalize the concept of the level sets Mb from the paragraph 8.1.24 devoted to the implicit functions and provide a geometric explanation of the integration process. Then we quite easily arrive at several powerful tools, including the Stokes theorem, which is a higher dimensional extension to the fundamental theorem of (univariate) calculus, and the Frobenius theorem which generalizes the integration of prescribed line elements into a solution of an ODE to higher dimensions. 9.1.1. Vector fields and differential forms. Let us come back to the concept of tangent vectors and vector fields, cf. 8.3.14 where we introduced the tangent space TU = U × Rn as the set of all possible tangent vectors at the points of an open subset U ⊂ Rn . There is the projection p : TU → U assigning the foot points to the tangent vectors, we write TxU for the vector space of all vectors X with p(X) = x at a point x ∈ U, and we use the notation X(U) for the set of all smooth vector fields on the open subset U. The linear combinations of the special vector fields ∂ ∂xi admitting smooth functions as the coefficients generate the entire X(U). Thus we write general vector fields as X(x) = X1(x) ∂ ∂x1 + · · · + Xn(x) ∂ ∂xn . CHAPTER 9 Continuous models – further selected topics The World might be discrete in reality. But the continuous models are useful anyhow ... A. Exeterior differential calculus B. Applications of Stoke’s theorem 9.B.1. Compute ∫ c (x − y)dx + x dy, where c is the positively oriented curve represented by the perimeter of the square ABCD with vertices A = [2, 2]; B = [−2, 2]; C = [−2, −2]; D = [2, −2]. Solution. Using Green’s theorem (see ??), we reduce the given curve integral to an area (multiple) integral. The integral is of the form ∫ c f(x, y) dx+g(x, y) dy, where f(x, y) = x − y and g(x, y) = x. The needed partial derivatives of the functions f(x, y) and g(x, y) are thus fy(x, y) = −1 and gx(x, y) = 1. All of the functions f(x, y), g(x, y), fy(x, y), and gx(x, y) are continuous on R2 , so we can use Green’s theorem: ∫ c (x − y)dx + xdy = ∫∫ D (1 + 1)dxdy = 2 ∫∫ D dxdy CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS As we know, every differentiable mapping F : U → V between two open sets U ⊂ Rn , V ⊂ Rm defines the mapping F∗ : TU → TV by applying the differential D1 F to the individual tangent vectors. Thus if y = F(x) = (f1(x), . . . , fm(x)) then F∗ : TxU → TF (x)V , F∗ ( n∑ i=1 Xi(x) ∂ ∂xi ) (y) = m∑ j=1 ( n∑ i=1 ∂fj(x) ∂xi Xi(x) ) ∂ ∂yj (y). When we studied the vector spaces in chapter two, we came accros the useful concept of linear forms. They were defined in paragraph 2.3.17 on page 122. This idea extends naturally now. A scalar valued linear mapping defined on the tangent space TxU is such a linear form at the foot point x. The vector space of all such forms T∗ x U = (TxU)∗ is thus naturally isomorphic to Rn∗ and the collection T∗ U of these spaces comes equipped by the projection to the foot points, let us denote it again by p. Having a mapping η : U ⊂ Rn → T∗ U with values η(x) ∈ T∗ x U on an open subset U, i.e., p ◦ η = idU , we talk about a differential form η on U, or a linear form. Every differentiable function f on an open subset U ⊂ Rn defines the differential form df on U (cf. 8.1.7). We use the notation Ω1 (U) for the set of all smooth linear differential forms on the open set U. In the chosen coordinates (x1, . . . , xn) we can use the differentials of the particular coordinate functions to express every linear form η as η(x) = η1(x)dx1 + · · · + ηn(x)dxn, where ηi(x) are uniquely determined functions. Such a form η evaluates on a vector field X(x) = X1(x) ∂ ∂x1 + · · · + Xn(x) ∂ ∂xn as η(X)(x) = η(x)(X(x)) = η1(x)X1(x)+· · ·+ηn(x)Xn(x). If the form η is the differential of a function f, we get just back the expression X(f)(x) = df(X(x)) = ∂f ∂x1 X1(x) + · · · + ∂f ∂xn Xn(x) for the derivative of f in the direction of the vector field X. 9.1.2. Exterior differential forms. As we discussed already in chapters 1 and 4, the volume of k-dimensional parallelepipeds S, as a quantity depending of the k vectors spanning S, is an antisymmetric k–linear form on the vectors, see 2.3.23 on page 127. Remember also the computation of the volume of parallelepipeds in terms of determinants in 4.1.19 on page 334. Thus, if we want to talk about the (linearized) volume on k-dimensional objects, we need a concept which will be linear in k distinct tangent vector arguments and will assign a scalar quantity to them. Moreover, we will require that interchanging any pair of arguments swaps the sign, in accordance with the orientations. 803 = 2 2∫ −2 2∫ −2 dxdy =2 [ x ]2 −2 · [ y ]2 −2 = 32. □ 9.B.2. Compute ∫ c x4 dx + xy dy, where c is the positively oriented curve going through the vertices A = [0, 0]; B = [1, 0]; C = [0, 1]. Solution. The curve c is the boundary of the triangle ABC. The integrated functions are continuously differentiable on the whole R2 , so we can use Green’s theorem: ∫ c x4 dx + xydy = ∫∫ D y dx dy = 1∫ 0 −x+1∫ 0 y dx dy = 1∫ 0 [ y2 2 ]−x+1 0 dx = 1∫ 0 [ x2 − 2x + 1 2 ] dx = 1 2 [ x3 3 − 2x2 2 + x ]1 0 = 1 6 . □ 9.B.3. Calculate ∫ c (xy + x + y) dx + (xy + x − y) dy, where c is the circle with radius 1 centered at the origin. Solution. Again, the prerequisites of Green’s theorem are satisfied, so we can use Green’s theorem, which now gives ∫ c (xy + x + y) dx + (xy + x − y) dy = ∫∫ D y + 1 − x − 1 dx dy = 1∫ 0 2π∫ 0 r2 (sin φ − cos φ) dr dφ = 1∫ 0 r2 dr 2π∫ 0 sin φ − cos φdφ = 1 3 [ − cos φ − sin φ ]2π 0 = 0. □ 9.B.4. Compute ∫ c (2e2x sin y − 3y3 )dx + (e2x cos y + 4 3 x3 )dy, where c is the positively oriented ellipse 4x2 + 9y2 = 36. Solution. : We will use Green’s theorem, choosing the linear deformation of polar coordinates x = 3r cos φ, φ ∈ [0, 2π], CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Exterior differential forms Definition. The vector space of all k–linear antisymmetric forms on a tangent space TxU, U ⊂ Rn , will be denoted by Λk (TxU)∗ . We talk about exterior k–forms at the point x ∈ U. The assignment of a k–form η(x) ∈ Λk T∗ x U to every point x ∈ U from an open subset in Rn defines an exterior differential k–form on U. The set of smooth exterior k–forms on U is denoted Ωk (U). Next, let us consider a smooth mapping G : V → U between two open sets V ⊂ Rm and U ⊂ Rn , an exterior k-form η(G(x)) ∈ Λk (T∗ G(x)U), and choose arbitrarily k vectors X1(x), . . . , Xk(x) in the tangent space TxV . Just like in the case of linear forms, we can evaluate the form η at the images of the vectors Xi using the mapping y = G(x) = (g1(x), . . . , gn(x)). This operation is called the pullback of the form η by G. G∗ ( η(G(x)) ) (X1(x), . . . , Xk(x)) = η(G(x)) ( G∗(X1(x)), . . . , G∗(X1(x)) ) , which is an exterior form in Λk (T∗ x V ). In the case of linear forms, this is the dual mapping to the differential D1 G. We can compute directly from the definition that, for instance, G∗ (dyi) ( ∂ ∂xk ) = dyi ( G∗ ( ∂ ∂xk )) = ∂gi ∂xk , and so (1) G∗ (dyi) = ∂gi ∂x1 dx1 + · · · + ∂gi ∂xm dxm, which extends to the linear combinations of all dyi over fuc- tions. Another immediate consequence of the definition is the formula for pullbacks of arbitrary k–forms by composing two diffeomorphisms: (2) (G ◦ F)∗ α = F∗ ( G∗ α ) . Indeed, as a mapping on k-tuples of vectors, (G ◦ F)∗ α = α ◦ ((D1 G ◦ D1 F) × . . . × (D1 G ◦ D1 F)) = G∗ (α) ◦ (D1 F × . . . × D1 F) = F∗ ◦ G∗ α as expected. 9.1.3. Wedge product of exterior forms. Given a k–form α ∈ Λk Rn∗ and an ℓ–form β ∈ Λk Rn∗ , we can create a (k + ℓ)–form α ∧ β by all possible permutations σ of the arguments. We just have to alternate the arguments in all possible orders and take the right sign each time: (α ∧ β)(X1, . . . , Xk+ℓ) = 1 k!ℓ! ∑ σ∈Σk+ℓ sgn(σ)α(Xσ(1),. . ., Xσ(k))β(Xσ(k+1),. . ., Xσ(k+ℓ)). 804 y = 2r sin φ r ∈ [0, 1], leading to (the Jacobian of the transformation is 6r): ∫ c (2e2x sin y − 3y3 )dx + (e2x cos y + 4 3 x3 ) dy = = ∫∫ D 2e2x cos y + 4x2 − (2e2x cos y − 9y2 ) dx dy = = 1∫ 0 2π∫ 0 6r [ 4(3r cos φ)2 + 9(2r sin φ)2 ] = = 216 1∫ 0 r3 dr 2π∫ 0 dφ = 216 · [r4 4 ]1 0 · 2π = 108π. □ 9.B.5. Compute ∫ c (ex lny − y2 x)dx + ( ex y − 1 2 x2 y ) dy, where c is the positively oriented circle (x−2)2 +(y−2)2 = 1. Solution. ∫ c (ex lny − y2 x)dx + ( ex y − 1 2 x2 y)dy = = ∫∫ D ex y − xy − ex y + 2xy dx dy = = 1∫ 0 2π∫ 0 r(r cos φ + 2) · (r sin φ + 2) dr dφ = = 1∫ 0 2π∫ 0 r3 sin φ cos φ + 2r2 (sin φ + cos φ) + 4r dr dφ = = 1 4 2π∫ 0 sin φ cos φ dφ + 2 3 2π∫ 0 sin φ + cos φ dφ + 4π = = 1 3 [sin2 φ 2 ]2π 0 + [ − cos φ + sin φ ]2π 0 + 4π = 4π. □ 9.B.6. Calculate the integral ∫ c (ex sin y − xy2 )dx + ( ex cos y − 1 2 x2 y ) dy, where c is the positively oriented circle x2 +y2 +4x+4y+7 = 0. ⃝ 9.B.7. Compute ∫ c (3y − esin x ) dx + (7x + √ y4 + 1) dy, CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS It is clear from the definition that α ∧ β is indeed a (k + ℓ)– form. In the simplest case of 1–forms, the definition says that (α ∧ β)(X, Y ) = α(X)β(Y ) − α(Y )β(X). In the case of a 1–form α and a k–form β, we get (α ∧ β)(X0, X1, . . . , Xk) = k∑ j=0 (−1)j α(Xj)β(X0, . . . , ˆXj, . . . , Xk), where the hat indicates omission of the corresponding argument. The wedge product of finitely many forms is defined analogously (either directly by a similar formula, or we can notice that the above wedge product of forms is an associative operation – think this out by yourselves!). Next, remind the generators ∂ ∂xi of all vector fields in X(Rn ), as well as the generators dxi of all linear exterior forms in Ω1 (Rn ). Their wedge products εi1...ik = dxi1 ∧ · · · ∧ dxik with k-tuples of indices i1 < i2 < · · · < ik generate the whole space Ωk (Rn ) by linear combinations with functions standing for coefficients. Indeed, interchanging a pair of adjacent forms in the product merely changes the sign, so the whole expression is identically zero if an index appears twice. Therefore, every k–form α is given uniquely by functions αi1...ik (x) in the expression α(x) = ∑ i1<··· dim U. Thus, Ωk (U) contains only the trivial zero form in this case. Another straightforward consequence of the definition is that the pullback of the wedge product by a smooth mapping G : V → U satisfies G∗ (α ∧ β) = G∗ α ∧ G∗ β. We should also notice that 0–forms Ω0 (Rn ) are just smooth functions on Rn . The wedge product of a 0–form f and a k–form α is just the multiple of the form α by the function f. Similarly, the top degree forms in Ωn (U) are all generated by the single generator ε12...n, since there is just one possibility of n different choices among n coordinates, up to the ordering. This means that actually the n-forms ω are identified with functions via the formula ω(x) = f(x)dx1 ∧ · · · ∧ dxn. At the same time, while the pullback on the functions f ∈ Ω0 (U) by a transformation F : Rn → Rn , y = F(x), is trivial, i.e. F∗ f(x) = f(y) = f ◦ F(x), a straightforward computation reveals (1) F∗ ω(x) = det(D1 F)(x)f(F(x))dx1 ∧ · · · ∧ dxn for all ω = fdy1 ∧ · · · ∧ dyn. 805 where c is the positively oriented circle x2 + y2 = 9. ⃝ 9.B.8. Compute the integral ∫ c ( 1 x + 2xy − y3 3 ) dx + ( 1 y + x2 + x3 3 ) dy, where c is the positively oriented boundary of the set D = {(x, y) ∈ R2 : 4 ≤ x2 + y2 ≤ 9, x√ 3 ≤ y ≤ √ 3x}. ⃝ 9.B.9. Remark. An important corollary of Green’s theorem is the formula for computing the area D that is bounded by a curve c. m(D) = 1 2 ∫ c −y dx + x dy. 9.B.10. Compute the area given by the ellipse x2 a2 + y2 b2 = 1. Solution. Using the formula 9.B.9 and the transformation x = a cos t, y = b sin t, we get for t ∈ [0, 2π] that m(D) = 1 2 ∫ c −y dx + x dy = = 1 2 2π∫ 0 a cos t · b cos tdt − 1 2 2π∫ 0 b sin t · (−a sin t)dt = = 1 2 ab 2π∫ 0 cos2 tdt + 1 2 ab 2π∫ 0 sin2 tdt = = 1 2 ab 2π∫ 0 cos2 t + sin2 tdt = 1 2 ab2π = πab, which is indeed the well-known formula for the area of an ellipse with semi-axes a and b. □ 9.B.11. Find the area bounded by the cycloid which is given parametrically as ψ(t) = [a(t−sin t); a(1−cos t)], for a ≥ 0, t ∈ (0, 2π), and the x-axis. Solution. Let the curves that bound the area be denoted by c1 and c2. As for the area, we get m(D) = 1 2 ∫ c1 −y dx + x dy + 1 2 ∫ c2 −y dx + x dy. Now, we will compute the mentioned integrals step by step. The parametric equation of the curve c1 (a segment of the x-axis) is (t; 0); t ∈ [0; 2aπ], so we obtain for the first integral that 1 2 ∫ C1 −y dx + x dy = 1 2 ∫ 2aπ 0 0 · 1 dt + ∫ 2aπ 0 t · 0 dt = 0. The parametric equation of the curve c2 is ψ(t) ∈ (a(t− sin t), a(1 − cos t)); t ∈ [2π; 0]. The formula for the area expects a positively oriented curve, CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS 9.1.4. Integration of exterior forms on Rn . Once we fix coordinates (x1, . . . , xn) on Rn (e.g. the standard ones), there is the bijection between functions f and top degree forms ω(x) = f(x)dx1 ∧ · · · ∧ dxn. This can be interpreted as defining the scale with which the standard volume in Rn is to be taken pointwise due to the function f. Notice, that changing the coordinates via a tranformation F will rescale this understanding of the forms exactly as in the formula for coordinate substitution in the Riemann integral. We should view this observation as a new interpretation of the integrands in our earlier procedure of integration of functions f on Riemann measurable open subsets U ⊂ Rn , independent of any coordinate choice. Let us check this interpretation in more detail. First, we define the n–form ωRn , giving the standard n–dimensional volume of parallelograms, i.e. in the standard coordinates we obtain ωRn = dx1 ∧ · · · ∧ dxn. If we want to integrate a function f(x) “in the new way”, we consider the form ω = fωRn instead, i.e. ω = f(x)dx1 ∧ · · · ∧ dxn. We define the integral of the form ω as ∫ U ω = ∫ U f(x)dx1 ∧ · · · ∧ dxn = ∫ U f(x) dx1 · · · dxn, where the Riemann integral of a function is considered on the right-hand side. Let us point out, that the n-form ω on the left-hand side is well defined, independently of any choice of coordinates. If we want to express the form ω in different coordinates using a diffeomorphism G : V → U, G = (g1, . . . , gn), it means we will evaluate ω at a point G(y) = x at the values of the vectors G∗(X1), . . . , G∗(Xn). However, this means we will integrate the form G∗ ω in coordinates (y1, . . . , yn), and we already saw in the previous paragraph, cf. 9.1.3(1) that (G∗ ω)(y) = f(G(y)) det(D1 G(u))dy1 ∧ · · · ∧ dyn. Substituting into our interpretation of the integral, we get ∫ V G∗ (fωRn ) = ∫ G−1(U) f(G(y)) det(D1 G(u))dy1 · · · dyn, which is, by the theorem 8.2.8 on the coordinate substitution in the integral, the same value as ∫ U fωRn if the determinant of the Jacobian matrix is positive, and the same value up to the sign if it is negative. Our new interpretation thus provides the geometrical meaning for the integral of an n–form on Rn , supposing the corresponding Riemann integral exists in some (and hence any) coordinates. This integration takes into account the orientation of the area we are integrating over. We shall come back to this point in a moment. 9.1.5. Integrals along curves. Our next goal is to integrate objects over domains which are similar to curves or surfaces in R3 . Let us first shape our mind on the simplest case of the lowest dimension, i.e. the curves in Rn . 806 which means for the considered parametric equation that we are moving against the parametrization direction, i. e. from the upper bound to the lower one. We thus get for the area of the cycloid that 1 2 ∫ c2 −y dx + x dy = 1 2 0∫ 2π a(t − sin t) · a(sin t) dt− − 1 2 0∫ 2π a(1 − cos t) · a(1 − cos t) dt = = 1 2 a2 2π∫ 0 t sin t − sin2 t − 1 + 2 cos t − cos2 t dt = = 1 2 a2 0∫ 2π t sin t + 2 cos t − 2 dt = = 1 2 a2 [−t cos t − sin t + 2 cos t − 2]0 2π = 3πa2 . □ 9.B.12. Compute I = ∫∫ S x3 dy dz + y3 dx dz + z3 dx dy,where S is given by the sphere x2 + y2 + z2 = 1. Solution. It is advantageous to work in spherical coordinates x = ρ sin φ cos ψ ρ = [0, 1], y = ρ sin φ sin ψ φ = [0, π], z = ρ cos φ ψ = [0, 2π]. The Jacobian of this transformation is −ρ2 sin φ. The given integral is then equal to I = ∫∫ S x3 dy dz + y3 dx dz + z3 dx dy = = ∫∫∫ V 3x2 + 3y2 + 3z2 dx dy dz = = 3 1∫ 0 2π∫ 0 π∫ 0 ρ2 sin φ ( ρ2 sin2 φ cos2 ψ + ρ2 sin2 φ sin2 ψ+ +ρ2 cos2 φ ) dρ dφ dψ = = 3 1∫ 0 2π∫ 0 π∫ 0 ρ4 sin φ ( sin2 φ ( cos2 ψ + sin2 ψ ) + + cos2 φ ) dρ dφ dψ = = 3 1∫ 0 2π∫ 0 π∫ 0 ρ4 sin φ dρ dφ dψ = 3 · [ ρ5 5 ]1 0 [ψ]2π 0 [cos φ]π 0 = = 3 · 1 5 · 2π · [−1 − 1] = − 12 5 π. □ CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Recall the calculation of the length of a curve in Rn by univariate integrals, which was discussed in paragraph 6.1.10 on page 518. The curve was parametrized as a mapping c(t) : R → Rn , and the size of the tangent vector ∥c′ (t)∥ was expressed in the Euclidean vector space. This procedure was given by the universal relation for an arbitrary tangent vector, i.e., we actually found the function ρ : Rn → R which gave the true size when evaluated at c′ (t). This mapping satisfied ρ(a v) = |a|ρ(v) since we ignored the orientation of the curve given by our parametrization. If we wanted a signed length, respecting the orientation, then our mapping ρ would be linear on every one-dimensional subspace L ⊂ Rn . Of course we could have multiplied the Euclidean size by a positive function and integrate this quantity. In view of our geometric approach to integration, we should rather integrate linear forms along curves, while the size of vectors is given by a quadratic form, rather than a linear one. However, in dimension one, we take the square root of the values of the (positive definite) quadratic form, in order to get a linear form (up to sign) which is just the size of the vectors. Let us proceed in a much similar way dealing with linear differential forms η on Rn . The simplest ones are the differentials df of functions f on Rn . In order to motivate our development, let us consider the following task. Imagine, we are cycling along a path c(t) in R2 , the function f is the altitude of the terrain. If we want to compute the total gain of altitude along the path c(t), we should “integrate” the immediate infinitesimal gains, which should be the derivatives of f in the directions of the tangent vectors to the path, i.e. df(c′ (t)). Thus, let us consider a differentiable curve c(t) in Rn , t ∈ [a, b], write M for the image c([a, b]), and assume that a differentiable function f is defined on a neighborhood of M. The differential of this function gives for every tangent vector the increment of the function in the given direction. It is expressed by the differential of the composite mapping f ◦c d(f ◦ c)(t) = ∂f ∂x1 (c(t))c′ 1(t) + · · · + ∂f ∂xn (c(t))c′ n(t). We can thus try to define the value of the integral in the following way ∫ M df = ∫ b a ( ∂f ∂x1 (c(t))c′ 1(t) + · · · + ∂f ∂xn (c(t))c′ n(t) ) dt, and we immediately verify that the change of the parametrization of the curve has no effect upon the value. Indeed, writing c(t) = c(ψ(s)), a = ψ(˜a), b = ψ(˜b), our procedure yields ∫ ˜b ˜a ( ∂f ∂x1 (c(ψ(s)))c′ 1(ψ(s)) + . . . + ∂f ∂xn (c(ψ(s)))c′ n(ψ(s)) ) dψ ds ds, and the theorem about coordinate transformations for univariate integrals gives just the same value if we have dψ ds > 0, i.e., 807 9.B.13. The vector form of the Gauss–Ostrogradsky theorem. The divergence of a vector field F(x, y, z) = f(x, y, z) ∂ ∂x + g(x, y, z) ∂ ∂y + h(x, y, z) ∂ ∂z is defined as div X := fx + gy + hz. Then, the Gauss–Ostrogradsky theorem can be formulated as follows: ∫∫∫ V div ⃗F(x, y, z) dx dy dz = ∫∫ S ⃗F(x, y, z)·⃗n(x, y, z) dS, where ⃗n(x, y, z) is the outer unit normal to the surface S at the point [x, y, z] ∈ S (S is the boundary of the normal domain V ). 9.B.14. Find the flow of the vector field given by the function F = (xy2 , yz, x2 z) over the cylinder x2 + y2 = 4, z = 1, z = 3. Solution. First of all, we compute the divergence of the vector field: div F = ∇·F = ( ∂(xy2 ) ∂x + ∂(yz) ∂y + ∂(x2 z) ∂z ) = y2 +z +x2 . Therefore, the flow T of the vector filed is equal to ∫∫∫ V y2 + z + x2 dx dy dz = = 2∫ 0 2π∫ 0 3∫ 1 ρ · (ρ2 sin2 φ + z + ρ2 cos2 φ) dρ dφ dz = = 2∫ 0 2π∫ 0 3∫ 1 ρ · (ρ2 (sin2 φ + cos2 φ) + z ) dρ dφ dz = = 2∫ 0 2π∫ 0 3∫ 1 ρ · (ρ2 (sin2 φ + cos2 φ) + z ) dρ dφ dz = = 2∫ 0 2π∫ 0 3∫ 1 ρ3 + ρz dρ dφ dz = = 2π 2∫ 0 3∫ 1 ρ3 + ρz dρ dz = 2π 3∫ 1 [ ρ4 4 + ρ2 2 z]2 0 dz = = 2π 3∫ 1 4 + 2z dz = 2π[4z + z2 ]3 1 = 2π[12 + 9 − 4 − 1] = 32π. □ 9.B.15. Find the flow of the vector field given by the function F = (y, x, z2 ), over the sphere x2 + y2 + z2 = 4. Solution. The divergence of the given vector field is: divF = ∇ · F = ( ∂y ∂x + ∂x ∂y + ∂z2 ∂z ) = 2z. CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS if we keep the orientation of the curve, and the same value up to sign if the derivative of the transformation is negative. If we extend the same definition to an arbitrary linear form η = η1dx1 + . . . ηndxn we arrive at the same formulae with ηi replacing the derivatives ∂f ∂x1 , ∫ M η = ∫ b a ( η1(c(t))c′ 1(t) + · · · + ηn(c(t))c′ n(t) ) dt, again independent of the parametrization of the curve c as above. In the above example with n = 2, f was the altitude of the terrain, and the integral of df along the path modelled the total gain of elevation. Thus, we should expect that the total gain along the path should depend on the values c(a) and c(b) only, while different curves with the same boundary points would produce different integrals of η for a general 1–form η. This will indeed be the special claim of the Stokes theorem below. Before we treat the higher dimensional analogs, we shall look at more abstract approach to suitable subsets in Rn and the role of coordinates on them. 9.1.6. Manifolds. The straightforward generalizations of parameterized curves c(t) : R → Rn are the differentiable mappings φ : V ⊂ Rk → Rn , k ≤ n, with injective differential dφ(u) at every point of its open domain V . Such mappings are called im- mersions. With the curves, we did not care about their selfintersections etc. Now, for technical reasons, we shall be more demanding. Manifolds in Rn A subset M ⊂ Rn is called a manifold of dimension k if every point x ∈ M has a neighborhood U ⊂ Rn which is the image of a diffeomorphism ˜φ : V × ˜V → U, V ⊂ Rk , ˜V ⊂ Rn−k , such that • the restriction φ = ˜φ|V : V → M is an immersion, • ˜φ−1 (M) = V × {0} ⊂ Rn . The manifolds M are carrying the topology inherited from Rn . ./img/0214_eng.png 808 Thus, the wanted flow equals ∫∫∫ V 2z dx dy dz = 2∫ 0 π∫ 0 2π∫ 0 ρ2 sin φ · 2ρ cos φ dρ dφ dψ = = 2 2∫ 0 ρ3 dρ 2π∫ 0 dψ π∫ 0 sin φ cos φ dφ = = 2[ ρ4 4 ]2 0 · [ψ]2π 0 · [ sin2 φ 2 ]π 0 = = 2 · 16 4 · 2π · 0 = 0. □ C. Equation of heat conduction 9.C.1. Find the solution to the so-called equation of heat conduction (equation of diffusion) ut(x, t) = a2 uxx(x, t), x ∈ R, t > 0 satisfying the initial condition lim t→0+ u (x, t) = f(x). Notes: The symbol ut = ∂u ∂t stands for the partial derivative of the the u with respect to t (i. e., differentiating with respect to t and considering x to be constant), and similarly, uxx = ∂2 u ∂x2 denotes the second partial derivative with respect to x (i. e., twice differentiating with respect to x while considering t to be constant). The physical interpretation of this problem is as follows: We are trying to determine the temperature u(x, t) in an thermally isolated and homogeneous bar of infinite length (the range of the variable x) if the initial temperature of the bar is given as the function f. The section of the bar is constant and the heat can spread in it by conduction only. The coefficient a2 then equals the quotient α cϱ , where α is the coefficient of thermal conductivity, c is the specific heat and ϱ is the density. In particular, we assume that a2 > 0. Solution. We apply the Fourier transform to the equation, with respect to variable x. We have F (ut) (ω, t) = 1√ 2π ∞∫ −∞ ut(x, t) e−iωx dx = = ( 1√ 2π ∞∫ −∞ u (x, t) e−iωx dx )′ , where differentiated with respect to t, i. e., F (ut) (ω, t) = (F (u) (ω, t)) ′ = (F (u))t (ω, t). At the same time, we know that F ( a2 uxx ) (ω, t) = a2 F (uxx) (ω, t) = −a2 ω2 F (u) (ω, t). Denoting y(ω, t) = F (u) (ω, t), we get to the equation yt = −a2 ω2 y. We already solved a similar differential equation when we were calculating Fourier transforms, so it is now easy for us to determine all of its solutions y(ω, t) = K (ω) e−a2 ω2 t , K (ω) ∈ R. CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS This definition is illustrated by the picture above. Manifolds can be typically (at least locally) given implicitly as the level sets of differentiable mappings, see paragraph 8.1.24 and the discussion in 8.1.26. The mapping φ from the definition is called the local parametrization or local map of the manifold M. The manifolds are a straightforward generalization of curves and surfaces in the plane R2 or the space R3 . We have excluded curves and surfaces which are self-intersecting and even those which are self-approaching. For instance, we can surely imagine a curve representing the figure 8 parametrized with a mapping φ with everywhereinjective differential. However, we will be unable to satisfy the second property from the manifold definition in a neighborhood of the point where the two branches of the curve meet. Tangent and cotangent bundles of manifolds The tangent bundle TM of the manifold M is the collection of vector subspaces TxM ⊂ TxRn which contain all vectors tangent to the curves in M. There is the footpoint projection p : TM → M. Similarly, the cotangent bundle T∗ M of the manifold M is the collection of the dual spaces (TxM)∗ , together with the footpoint projection. ./img/0214_eng.png Clearly, every parametrization φ defines a diffeomor- phism φ∗ : TV → T(φ(V )) ⊂ TM, φ∗(c′ (t)) = d dt φ(c(t)). Due to the chain rule, this definition does not depend of the choice of the representing curve c(t). We shall also write Tφ for the mapping φ∗. In particular, the local maps φ (extended to ˜φ, as in the above definition) induce the local maps φ∗ : TU = U ×Rk → TM ⊂ Rn ×Rn of the tangent bundle. Thus, the tangent bundle TM is again a manifold, which locally looks as U × Rk over sufficiently small open subsets U ⊂ M. But we shall see that TM might be quite different from M × Rk globally. Dealing with the cotangent bundle, we can use the dual mappings (Tφ−1 )∗ on the individual fibers T∗ x M to obtain local parametrizations. 809 It remains to determine K(ω). The transformation of the initial condition gives F (f) (ω) = lim t→0+ F (u) (ω, t) = lim t→0+ y(ω, t) = K (ω) e0 = K (ω), hence y(ω, t) = F (f) (ω) e−a2 ω2 t , K (ω) ∈ R. Now, using the inverse Fourier transform, we can return to the original differential equation with solution u (x, t) = 1√ 2π ∞∫ −∞ y(ω, t) eiωx dω = = 1√ 2π ∞∫ −∞ F (f) (ω) e−a2 ω2 t eiωx dω = = 1√ 2π ∞∫ −∞ ( 1√ 2π ∞∫ −∞ f(s) e−iωs ds ) e−a2 ω2 t eiωx dω = = 1√ 2π ∞∫ −∞ f(s) ( 1√ 2π ∞∫ −∞ e−a2 ω2 t e−iω(s −x) dω ) ds. Computing the Fourier transform F(f) of the function f(t) = e−at2 for a > 0, we have obtained (while relabeling the variables) 1√ 2π ∞∫ −∞ e−cp2 e−irp dp = 1√ 2c e− r2 4c , c > 0. According to this formula (consider c = a2 t > 0, p = ω, r = s − x), we have 1√ 2π ∞∫ −∞ e−a2 ω2 t e−iω(s −x) dω = 1√ 2a2t e− (s −x)2 4a2t , Therefore, u (x, t) = 1 2a √ πt ∞∫ −∞ f(s) e− (x−s)2 4a2t ds. □ D. Partial Differential Equations 9.D.1. Find general solution and solution for given boundary conditions of homogenous linear equation xux + yuy = 0, (i) u(cos σ, sin σ) = 1, (ii) u(σ, 1) = 1 + σ2 , (iii) u(σ, 1 − σ) = 3σ. (iv) u(cos σ, sin σ) = σ, (v) u(σ, σ) = 1, (vi) u(σ, σ) = σ. Solution. Firstly let’s look to our problem geometrically. The vector ∇u = (ux, uy) is the gradient of unknown function u = u(x, y) (its direction is direction of the bigest growing of function u). Equation a(x, y)ux + b(x, y)uy = 0, tell us, that scalar product of given vector field ⃗v = (a(x, y), b(x, y)) and ∇u is zero, e.g. ⃗v is orthogonal to ∇u at each point, so ⃗v has to be tangent to the contour lines (lines of constant value of u), we will call them CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Notice that two differentiable immersions φ and ψ parametrizing the same open subset U ⊂ M provide the composition ψ−1 ◦ φ. We view this as a coordinate change for U and we have just seen that coordinate changes on M induce coordinate changes on TM. Further, if M and N are two manifolds and F : M → N a mapping, we say that F is differentiable (up to order r or smooth or analytic), if the compositions ψ−1 ◦ F ◦ φ with two local parametrizations φ of M and ψ of N (of the same order of differentiability as we want to check) is differentiable (up to order r or smooth or analytic). Again, the chain rule property of differentiation shows that this definition does not depend on the particular choice of the parametrizations. Each differentiable mapping F : M → N defines the tangent mapping TF : TM → TN between the tangent spaces, which clearly is differentiable of order one less than the assumed differentiability of F. Vector fields and differential forms on manifolds Smooth vector fields X on a manifold M are smooth sections X : M → TM of the footpoint projection p : TM → M, i.e., p ◦ X = idM . Smooth k–forms η on a manifold M are sections M → Λk (TM)∗ such that the pullback of this form by any parametrization V → M yields a smooth exterior k–form on V . We write X(M) for the space of smooth vector fields on M, while Ωk (M) stays for the space of all smooth exterior k–forms on M. Notice that all our coordinate formulae for the vector fields, forms, pullbacks etc. on Rm hold true in the more abstract setting of manifolds and their local parametrizations.1 9.1.7. Integration of exterior forms on manifolds. Now, we are almost ready for the definition of the integral of k–forms on k–dimensional manifolds. For the sake of simplicity, we will examine smooth forms ω with compact support only. First, let us assume that we are given a k–dimensional manifold M ⊂ Rn and one of its local parametrizations φ : V ⊂ Rk → U ⊂ M ⊂ Rn . We consider the standard orientation on Rk given by the standard basis (cf. 4.1.19 for the definition of the orientation of a vector space). The choice of the parametrization φ also fixes the orientation of the manifold U ⊂ M. This orientation will be the same for those choices of local parametrizations, which differ by diffeomorphisms with positive determinants of their Jacobi matrices. The orientation will be the other one in the case of negative determinants. The manifold M is called orientable if there 1Actually, instead of dealing with manifolds as subsets of Rn, we might use the same concept of local parametrizations of a space M with differentiable transition functions ψ−1 ◦φ. We just need to know what are the “open subsets” in M, thus we could start at the level of topological spaces. On the other hand, there is the general result (the so called Whitney embedding theorem) that each such abstract n-dimensional manifold can be realized as embedded in the R2n, so we essentialy do not loose any generality here. 810 characteristics. Now it’s easy to find general solution — it could be any function, which is constant on characteristics (integral curves of vector field ⃗v). Let’s find them for our case by solving system ODE’s: ˙x(t) = dx dt (t) = a(x, y) = x, ˙y(t) = dx dt (t) = b(x, y) = y, or by solving equation y′ (x) = dy dx = ˙y ˙x = b a . Solution yields to the curves x = C1et , y = C2et , or y x = C and characteristics are lines threw the origin (picture P.1). General solution can be written as u(x, y) = Φ (y x ) , or u(x, y) = Ψ ( x y ) . Look at the test: ux = Φ′ ( − y x2 ) , uy = Φ′ ( 1 x ) =⇒ xux + yuy = 0. y x x x y y (i), (iv) (ii) (iii) On the picture we see characteristics together with given boundary lines. We need to choose solutions satisfying our boundary condition. The questions are: How many solutions we can find? Does the solution exists for any boundary con- dition? (i) Parametrizet boundary curve (circle) intersect any characteristics infinitely many times. But the value u(cos σ, sin σ) = 1 is constant. Solution has to be constant on characteristics, therefore u(x, y) = 1 is unique solution of the equation and boundary condition (i). (ii) Parametrized boundary curve (line) x = φ(σ) = σ, y = ψ(σ) = 1 intersects almost all characteristic (except line y = 0) just once. The value u = f(σ) = 1 + σ2 in point of intersection should be the same on the whole characteristics: Without lose of generality, take t = 0 and from x = C1e0 = φ(σ) = σ, y = C2e0 = ψ(σ) = 1 we get C1 = σ, C2 = 1. Then x = σet , y = et =⇒ σ = x y and we have solution u(x, y) = f(σ(x, y)) = 1 + [σ(x, y)]2 = 1 + x2 y2 . This solution exists for y ̸= 0. On the connected component of its domain, including boundary curve (Ω = CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS is a covering of the entire set M by local parametrizations φ such that their orientations coincide. Therefore, we apparently have exactly two orientations on every connected orientable manifold. Fixing either of them, we thereby restrict the set of parametrizations to those compatible with this orientation. From now on, we will always proceed in this fashion, and we will talk about oriented manifolds only. Next, let us fix a form ω with compact support inside the image of one parametrization U ⊂ M of an oriented manifold M. The pullback form φ∗ (ω) is a smooth k–form on V ⊂ Rk with compact support. The integral of the form ω on M is defined in terms of the chosen parametrization which is compatible with the orientation as follows: ∫ M ω = ∫ Rk φ∗ (ω). If we choose a different compatible parametrization ˜φ = φ◦ψ where ψ is a diffeomorphism ψ : W → V ⊂ Rk , we can easily compute the result, following the same definition. Let us denote φ∗ (ω)(u) = f(u)du1 ∧ · · · ∧ duk. Invoking the relation 9.1.2(2) for the pullback of a form by a composite mapping, we get ∫ M ω = ∫ Rk ˜φ∗ (ω) = ∫ Rk ψ∗ (φ∗ ω) = ∫ Rk ψ∗ ( f du1 ∧ · · · ∧ duk ) = ∫ Rk f(ψ(v)) det(D1 ψ)(v)dv1 · · · dvk. This is again the same value as ∫ Rk φ∗ ω. This proves the correctness of our definition of the inte- gral ∫ M ω provided the integrated k–form has compact support lying in the image of a single parametrization. However, typical manifolds M are given by implicit equations. For example, x2 +y2 +z2 = 1 defines the surface of the unit ball, i.e., the sphere S2 ⊂ R3 . If we want to integrate an exterior 2–form on S2 , we will have to use several parametrizations. Fortunately, our definition of the integral is additive with respect to disjoint unions of integration domains. Therefore, if we can write M = U1 ∪ U2 ∪ · · · ∪ Um ∪ B, where Ui are pairwise disjoint images of parametrizations φi, and B is a set whose inverse image in any parametrization is a Riemann measurable set with measure zero, we can compute ∫ M ω = ∫ U1 ω + · · · + ∫ Um ω, and we can easily verify that this value is independent of the choice of the sets Ui and the parametrizations (in particular, we need not be worried by the set B since the result of any integration on it is zero). For example, we can imagine splitting a sphere to the upper and lower hemispheres, leaving the equator B uncovered. 811 {(x, y) ∈ R2 | y > 0}) this is unique continuous solution. There is no "natural" wa, how to make its prolongation for y < 0. (iii) Parametrized boundary curve (line) x = φ(σ) = σ, y = ψ(σ) = 1 − σ intersects almost all characteristic (except line y = −x) just once. Again, the value u = f(σ) = 3σ in point of intersection should be the same on the whole characteristics: Take t = 0 and from x = C1e0 = φ(σ) = σ, y = C2e0 = ψ(σ) = 1 we get C1 = σ, C2 = 1 − σ. Then x = σet , y = (1 − σ)et =⇒ σ 1 − σ = x y , =⇒ σ = x x + y . and we have the unique solution u(x, y) = f(σ(x, y)) = 3σ(x, y) = 3x x + y . This solution exists for y ̸= −x. Let’s make a test: ux = 3y (x + y)2 , uy = −3x (x + y)2 =⇒ =⇒ xux + yuy = 0, u(σ, 1 − σ) = 3σ σ + 1 − σ = 3σ. (iv) Boundary curve intersect any characteristics infinitely many times, but the value at two point on the same characteristic is different. Globally this problem has no solution. We can solve it for example for σ ∈ (−π 2 , π 2 ): t = 0, x = C1e0 = cos σ, y = C2e0 = sin σ, thus C1 = cos σ, C2 = sin σ and x = cos σet , y = sin σet =⇒ σ = arctg y x . We get the solution for x > 0: u(x, y) = σ(x, y) = arctg y x . Make a test. (v) The boundary curve y = x is one of the characteristics. It has no sense. Condition u(σ, σ) = 1 gives infinitely many solutions. Any function u = Φ (y x ) or Ψ ( x y ) giving value 1 for y = x is solution of this problem (for expample u = y x , u = y2 x2 , u = 1, ...). (vi) The boundary curve y = x is again one of the characteristics. It has no sense. There is no solution, because condition u(σ, σ) = σ doesn’t give the same value on characteristic y = x. Finally it seems (for our equation): If the boundary curve intersects each characteristics at exactly one point, there exists unique continuous solution (on some open set including boundary curve) satysfying boundary condition. Generally is situation more complicated. Requirement, that boundary curve has to intersect each characteristics at exactly one point is not necessary, nor sufficient). □ 9.D.2. Find general solution of yux + xuy = 0 and solutions for given boundary condition CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS When calculating in practice, we usually divide the entire manifold into several disjoint open areas with compact closures, and we integrate on each of them separately. However, this procedure still does not help if we stick with the strict assumption that the entire support of integrated form has to be inside of one parametrization. Thus, we will develop a global definition of the integral, which is more advantageous from the technical/theoretical point of view (although it usually does not help in computations directly). 9.1.8. Partition of unity. Consider a manifold M ⊂ Rn and one of its covers by open images Ui of parametrizations φi. We can surely find a countable cover of each manifold M (it suffices to realize that we can do with parametrizations which map the origin to points with rational coordinates in Rn ). Furthermore, we shall assume that any point in x ∈ M belongs to only finitely many sets Ui. Such a cover is called a locally finite cover by parametrizations φi.2 Now, recall the smooth variants of indicator functions from paragraph 6.1.10. For every pair of positive numbers ε < r, we constructed a function fε,r(t) of one real variable t such that fε,r(t) = 1 for |t| < r − ε, while fε,r(t) = 0 for |t| > r + ε, and 0 ≤ fε,r(t) ≤ 1 everywhere. At the same time, we had f(t) ̸= 0 if and only if |t| < r + ε. Next, if we define χr,ε,x0 (x) = fε,r(|x − x0|), then we get a smooth function which takes the value 1 inside the ball Br−ε(x0), with support exactly Br+ε(x0), and with values between 0 and 1 everywhere. mozna obr. char.fce Lemma (Whitney’s theorem). Every closed set K ⊂ Rn is the set of all zero points of some smooth non-negative func- tion. Proof. The idea of the proof is quite simple. If K = Rn , the zero function fulfills the conditions, so we can further assume that K ̸= Rn . The open set U = Rn \ K can be expressed as the union of (at most) countably many open balls Bri (xi), and for each of them, we choose a smooth non-negative function fi on Rn whose support is just Bri (xi), see the function χr,ε,x0 above. Now, we add up all these functions into an infinite series f(x) = ∞∑ k=1 akfk(x), where the positive coefficients ak are selected so small that this series would converge to a smooth function f(x). To this purpose, it suffices to choose ak so that all partial derivatives of all functions akfk(x) up to order k (inclusive) would be bounded from above by 2−k . Then, the series∑ k akfk is bounded from above by the series ∑ k 2−k , hence by Weierstrass criterion, it converges uniformly on the entire 2This property is called paracompactness and, actually, each metric space is paracompact. Thus in particular all our manifolds enjoy this property too. But we do not want to go into details of the proof. 812 (i) u(x, 0) = |x|, (ii) u(0, y) = y2 , (iii) u(σ, σ) = 2σ2 . ⃝ 9.D.3. Find general solution of yux − xuy = 0 and solutions for given boundary condition (i) u(σ, σ) = 2σ2 , (ii) u(0, y) = 1 1+y2 , (iii) u(x, 1) = x4 . ⃝ 9.D.4. Find solution of sin x sin yux+cos x cos yxuy = 0, u = cos 2y for x+y = π 2 . ⃝ 9.D.5. Look on general quasilinear first order equation: a(x, y, u)ux + b(x, y, u)uy = f(x, y, u), its special case is equation linear in highest order terms (in some texts, including our, is called semilinear): a(x, y)ux + b(x, y)uy = f(x, y, u), and again its special case is linear equation (generally nonho- mogenous): a(x, y)ux + b(x, y)uy = f(x, y). In the folowing sections we show, how to solve each of this type of equation. Find a solution of quasilinear equation for given boundary condition: ux − uuy = −u, u(0, y) = 2y. Solution. Let’s solve characteristic system: ˙x = a = 1, ˙y = b = −u, ˙u = f = −u. Characteristics are given parametrically: x(t) = t + C1, y(t) = C3e−t + C2, u(t) = C3e−t . If we want general solution of equation, it is given implicitly by Φ(uex , u − y) = 0, where u − y = −C2, uex = C3eC1 is some implicit description of characteristics. From boundary condition for t = 0 we get x(0) = C1 = 0, y(0) = C3 + C2 = σ, u(0) = C3 = 2σ, so C2 = −σ and x(t, σ) = t, y(t, σ) = σ(2e−t − 1) =⇒ σ = y 2e−x − 1 , e−t = e−x CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Rn . Moreover, we get the same for all series of partial derivatives, since we can always write them as r−1∑ k=0 ak ∂r fk ∂xi1 · · · ∂xir + ∞∑ k=r ak ∂r fk ∂xi1 · · · ∂xir , where the first part is a smooth function as it is a finite sum of smooth functions, and the second part can again be bounded from above by an absolutely converging series of numbers, so this expression will converge uniformly to ∂r f ∂xi1 ···∂xir . It is apparent from the definition that the function f(x) satisfies the conditions of the lemma. □ Partition of unity on a manifold Theorem. Consider a manifold M ⊂ Rn equipped with a locally finite cover by open images Ui of parametrizations φi. Then, there exists a system of smooth non-negative functions fi on the sets M such that for every point x ∈ M, we have ∑ i fi(x) = 1, and fi(x) ̸= 0 if and only if x ∈ Ui. The system of functions fi from the theorem is called the partition of unity subordinated to the locally finite cover of the manifold by the open sets Ui. Proof. First, we extend the sets Ui to open sets ˜Ui using the extended parametrizations ˜φ, from the definition of manifold and its local parametrizations. We can surely do this in such a way that the sets ˜Ui keep being a locally finite cover of an open neighborhood ˜U = ∪i ˜Ui ⊂ Rn of the manifold M. For every open set ˜Ui, we can choose a non-negative function gi(x) on the whole Rn so that gi(x) ̸= 0 exactly for x ∈ ˜Ui. This can be done by Whitney’s theorem proved in the above Lemma. Now, the function g(x) = ∑ i gi(x) is well-defined for all x ∈ Rn and smooth, thanks to the cover being locally finite (for every fixed point x, it is a finite sum of non-zero functions on some of its neighborhoods). The function g(x) is positive for all x ∈ M. Thus, instead of functions gi(x) restricted to M, we may rather consider the functions fi(x) = gi(x)/g(x), which already have both of the required properties of the theorem. □ 9.1.9. Integration of k–forms on manifolds. Now, we are ready for the definition of the integral of k– forms on k–dimensional manifolds. Let us consider an oriented manifold M ⊂ Rn and a form ω ∈ Ωk (M) with compact support. Let us choose a locally finite cover of the manifold M by parametrizations φi : Vi → Ui such that the closures of all images φi(Vi) are compact and, eventually, choose a partition of unity fi subordinated to this cover. The integral is defined by the formula ∫ M ω = ∫ M ∑ i fiω = ∑ i ∫ Ui fiω, 813 and u(t, σ) = 2σe−t =⇒ u(x, y) = 2y 2 − ex . This is unique continuous solution on connected component of its domain including boundary curve (for x < ln 2). Make a test. □ 9.D.6. Find a solution of equation for boundary conditions (i)—(iii): yux + xuy = 2u (i) u(x, 0) = 1, (ii) u(x, 0) = x2 , (iii) u(0, y) = √ 1 + y2. Solution. This is semilinear equation, we will solve it like quasilinear (in the next section we show another method working for semilinear equation): ˙x = y, ˙y = x, ˙u = 2u. Characteristics are given parametrically x(t) = C1et + C2e−t , y(t) = C1et − C2e−t , u(t) = C3e2t . (i) Without lose of generality put t = 0 for the point of intersection of characteristic curve and boundary curve x = φ(σ) = σ, y = ψ(σ) = 0, u = f(σ) = 1 (boundary curve intersects characteristics for x > y), x(0) = C1 + C2 = σ, y(0) = C1 − C2 = 0. Than C1 = C2 = σ 2 and from x = σ 2 ( et + e−t ) = σ cosh t, y = σ 2 ( et − e−t ) = σ sinh t we get x2 − y2 = σ2 , x + y = σet , σ = √ x2 − y2, et = x + y √ x2 − y2 . u(0) = C3 = f(σ) = 1, u(x, y) = u(σ(x, y), t(x, y)) = e2t = [ x + y √ x2 − y2 ]2 Finaly u(x, y) = x + y x − y , for x > y. Make a test. (ii) Boundary curve is the same, u(0) = C3 = f(σ) = σ2 , solution is u(x, y) = u(σ(x, y), t(x, y)) = σ2 e2t = (x+y)2 , for x > y. (iii) Again put t = 0 for the point of intersection of characteristic curve and boundary curve x = φ(σ) = 0, y = ψ(σ) = σ, u = f(σ) = √ 1 + σ2 (boundary curve intersects characteristics for y > x), x(0) = C1 + C2 = 0, y(0) = C1 − C2 = σ. Than C1 = −C2 = σ 2 and from x = σ 2 ( et − e−t ) = σ sinh t, y = σ 2 ( et + e−t ) = σ cosh t CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS where the right-hand integrals have already been defined since each of the forms fiω has support inside the image under the parametrization φi (and they equal to ∫ M fiω for the same reason). Actually, we can assume that our sum is finite, since it suffices to consider integral over the image of parametrizations covering the compact support of ω. Hence, it is a well-defined number, yet it remains to verify that the resulting value is independent of all our choices. To this purpose, let us choose another system of parametrizations ψ : ˜Vj → ˜Uj, again with compatible orientations, providing a locally finite cover of M. Let gi be the corresponding partition of unity. Then the sets Wij = Ui ∩ ˜Uj form again a locally finite covering and the set of functions figj provide the partition of unity subordinated to this covering. We arrive at the following equalities: ∑ i ∫ M fiω = ∑ i ∫ M fi (∑ j gj ) ω = ∑ i,j ∫ M figjω ∑ j ∫ M gjω = ∑ j ∫ M gj (∑ i fi ) ω = ∑ i,j ∫ M figjω, where the potentially infinite sums inside of the integrals are all locally finite, while the sums outside of the integral can be viewed as finite due to the compactness of the support of ω. Thus, we have checked that the choices of the partition of unity and the parametrizations do not influence the value of the integral. 9.1.10. Exterior differential. As we have seen, the differential of a function can be interpreted as a mapping d : Ω0 (Rn ) → Ω1 (Rn ). By means of parametrizations, this definition extends (in a coordinate free way) to functions f on manifolds M, where the differential df is a linear form on M. The following theorem extends this differential to arbitrary exterior forms on manifolds M ⊂ Rn . Exterior differential Theorem. For all m–dimensional manifolds M ⊂ Rn and k = 0, . . . , m, there is the unique mapping d : Ωk (M) → Ωk+1 M, such that (1) d is linear with respect to multiplication by real num- bers; (2) for k = 0, this is the differential of functions; (3) if α ∈ Ωk (M), β arbitrary, then d(α ∧ β) = (dα) ∧ β + (−1)k α ∧ (dβ); (4) d(df) = 0 for every function f on M. The mapping d is called the exterior differential. The equality d ◦ d = 0 is valid for all degrees k. 814 and we get y2 − x2 = σ2 , x + y = σet , σ = √ y2 − x2, et = x + y √ y2 − x2 . u(0) = C3 = f(σ) = √ 1 + σ2, u(x, y) = u(σ(x, y), t(x, y)) = √ 1 + σ2e2t Finaly u(x, y) = √ 1 + y2 − x2 · x + y y − x , for y > x. Make a test. How to write general solution of equation? Take an implicit description of characteristics, from x = C1et + C2et , y = C1et − C2et , u = C3e2t we get (x + y)2 u = C2 1 C3 = Φ, y2 − x2 = 2C1C2 = Ψ. General solution of equation is any function constant on char- acteristics: u : α(Φ, Ψ) = 0, or u = (x+y)2 K(Ψ) = (x+y)2 K(y2 −x2 ). Make a test. In the next example we show another way, how to solve semilinear eqution. □ 9.D.7. Solve equation (S) and equation (L) (S) x2 ux + xyuy = u2 , (L) x2 ux + xyuy = x2 both generally and for boundary conditions (i) — (iv) (i) u(1, y) = y, (ii) u(cos σ, sin σ) = 1, σ ∈ ( −π 2 , π 2 ) , (iii) u(x, 1 − x) = x, (iv) u(x, 1 − x) = x2 . Solution. Let’s start again with characteristic equation y′ = dy dx = xy x2 =⇒ y x = C. Take a new coordinates ξ(x, y) = y x , η(x, y) = y and compute ux = uξξx + uηηx = − y x2 uξ, uy = uξξy + uηηy = −1 x uξ + uη. The equation (S) will be now uη = u2 η2 · ξ and solution ∫ ∂u u2 = ∫ ξ ∂η η2 =⇒ − 1 u = − ξ η + K(ξ), We can write general solution u(ξ, η) = η ξ + ηD(ξ) , u(x, y) = x 1 + xD (y x ). Make a test. Now we can use general solution to find function D, depending on given boundary condition: (i) u(1, σ) = 1 1+D ( y x ) = σ ⇒ D (σ) = 1 σ − 1. From y x = σ 1 we get solution u(x, y) = x 1+x ( x y −1 ) = xy y+x2−xy . CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Proof. Each k–form can be written locally in the form α = ∑ i1<··· 0. (iii) Analogically u(x, y) = x, (iv) Analogically u(x, y) = x2 x+y2+xy . For linear equation (L) is situation much easier. We can see, that u = x is one of the solutions of both, (L) and (S). For (L) we can use principle of superposition and write general solution like a sum of general solution of homogenous case and any particular solution (make a test): u(x, y) = x + D (y x ) . For boundary conditions we get (i) u(x, y) = x + y x − 1, (ii) u(x, y) = x + 1 − x√ x2+y2 , x > 0, (iii) u(x, y) = x, (iv) u(σ, 1 − σ) = σ + D(σ) = σ2 ⇒ D(σ) = σ2 − σ, 1−σ σ = y x ⇒ σ = x x+y , D = σ2 − σ = − xy (x+y)2 and solution is u(x, y) = x − xy (x + y)2 , x ̸= −y. Make a test for all solutions and use this method also for solving previous problem. □ 9.D.8. Find general solution of yux + uy = −u. ⃝ 9.D.9. Find general solution of ux + uy = −u(x − y) and solutions for given boundary condition (i) u(0, y) = y, (ii) u(σ, −σ) = 1 σ , (iii) u(x, 0) = x2 . ⃝ 9.D.10. Solve ux − y2 uy = u2 , u(x, 1) = 1 2x . ⃝ 9.D.11. Solve xux + yuy = 2(x2 + y2 ), u(1, y) = 2y2 + 1. ⃝ 9.D.12. Solve 2xux − yuy = x2 + y2 , u(2, y) = 1 − y2 . CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS The second step in the proof is the verification, that the coordinate formula (5) correctly defines a differential operator on general manifolds M. In order to achieve this, it is sufficient to show that the coordinate expression of the exterior derivative commutes with the pullbacks of forms. Indeed, we may then define the differential operator around any point in any coordinates and the results will coincide.3 Thus, consider a change of coordinates G : U → V , x = G(y) = (g1(y), . . . , gm(y)), and compute G∗ (dα) of an exterior form α = a dxi1 ∧ · · · ∧ dxik (which gives the result for sums of such expressions too). This is straightforward: G∗ (dα) = ∑ i G∗ ( ∂a ∂xi )G∗ (dxi) ∧ G∗ (dxi1 ) ∧ · · · = ∑ i ( ∂a ∂xi ◦ G) ( ∂gi ∂y1 dy1 +· · · ) ∧ (∂gi1 ∂y1 dy1 +· · · ) ∧ · · · . Now, notice d(G∗ (dxj)) = G∗ (d(dxj)) = 0 and thus d ( G∗ (α) ) = d ( (a ◦ G)G∗ (dxi1 ) ∧ . . . ) = d(a ◦ G) ∧ G∗ (dxi1 ) ∧ · · · ∧ G∗ (dxik ) = ∑ i ( ( ∂a ∂xi ◦ G) ( ∂gi ∂y1 dy1 + . . . )) ∧ G∗ dxi1 ∧ . . . , clearly the same expressions. □ 9.1.11. Manifolds with boundary. In practical problems, we often work with manifolds M like an open ball in the three-dimensional space. At the same time, we are interested in the boundaries of these manifolds ∂M, which is a sphere in the case of a ball. The simplest case is the one of connected curves. It is either a closed curve (like a circle in the plane), then its boundary is empty, or the boundary is formed by two points. These points will be considered including the orientation inherited from the curve, i.e. the initial point will be taken with the minus sign, and the terminal point with the plus sign. The curve integral is the easiest one, and we can notice that integrating the differential df of a function along the curve M defined as the image of a parametrization c : [a, b] → M, then we get directly from the definition that ∫ M df = ∫ [a,b] c∗ (df) = ∫ b a d dt (f ◦ c)(t) dt = f(c(b)) − f(c(a)). Therefore, the result is not only independent of the selected parametrization, but also of the actual curve. Only the initial and terminal points matter. Splitting the curve into several 3Such operators are intrinsically defined on all manifolds. Actually, for all k > 0, the only operation d : Ωk → Ωk+1 commuting with pullbacks and with values depending only on the behavior of the argument α on any small neighborhood of x (locality of the operator), is the exterior derivative. Thus even the linearity, as well as the dependence on the first derivatives are direct consequences of naturality. See the book Natural operations in differential geometry, Springer, 1993, by I. Kolar, P.W. Michor and J. Slovak for full proof of this astonishing claim. 816 ⃝ 9.D.13. Solve x2 ux + yuy = 2u, u(1, y) = y3 . ⃝ 9.D.14. S. olve equation for given boundary condition u = uxuy, u(x, 0) = x2 . Solution. We show, how to use method of characteristics for general first order equation F(p, q, u, x, y) = 0, where p = ux, q = uy. Our equation we can write in the form u−p·q = 0. We need to solve system of five ODE’s: ˙x = Fp = −q, ˙y = Fq = −p, ˙u = pFp + qFq = −2pq, ˙p = −Fx − pFu = −p, ˙q = −Fy − qFu = −q. Solution of this characteristic system is x = C1e−t + C2, y = C3e−t + C4, u = C1C2e−2t + C5, p = C3e−t , q = C1e−t . Without lose of generality take t = 0 for the point of intersection of characteristic lines with boundary line x = φ(σ) = σ, y = ψ(σ) = 0, u = f(σ) = σ2 and find Ci = Ci(σ) from x(0) = C1 + C2 = σ, y(0) = C3 + C4 = 0, u(0) = C1C2 + C5 = σ2 , p(0) = C3 = p0 = 2σ, q(0) = C1 = q0 = σ 2 . The value of p0 and q0 we get from system of two algebraic equation: F(p0, q0, f(σ), φ(σ), ψ(σ)) = 0 =⇒ σ2 = p0 · q0, u(φ(σ), ψ(σ)) = f(σ) =⇒ d dσ u(φ(σ), ψ(σ)) = d dσ f(σ), uxφ′ (σ) + uyψ′ (σ) = f′ (σ) =⇒ p0 · 1 + q0 · 0 = 2σ. Finally we get C1 = σ 2 , C2 = σ 2 , C3 = 2σ, C4 = −2σ, C5 = 0. From x(σ, t) = σ 2 ( e−t + 1 ) , y(σ, t) = 2σ ( e−t − 1 ) we can write σ(x, y) = x − y 4 , e−t (x, y) = 4x + y 4x − y , CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS consecutive disjoint intervals, the integral splits into the sum of differences of the values at the splitting points. This sum will be telescoping (i.e., the middle terms cancel out), resulting in the same value again. Notice, we have already proved the behavior expected in 9.1.5 when dealing with the elevation gain by a cyclist. We shall discuss this phenomenon in general dimensions now. To be able to do this, we need to formalize the concept of the boundary of a manifold and its orientation. The simplest case is the closed half-space ¯M = (−∞, 0] × Rn−1 . Its boundary is ∂M = {(x1, x2, . . . , xn) ∈ Rn ; x1 = 0}. The orientation on this boundary inherited from the standard orientation is the one determined by the form dx2 ∧· · ·∧dxn. Oriented boundary of a manifold Let us consider a closed subset ¯M ⊂ Rn such that its interior M ⊂ ¯M is an oriented m–dimensional manifold covered by compatible parametrizations φi. Further, let us assume that for every boundary point x ∈ ∂M = ¯M \ M, there is a neighborhood in ¯M with parametrization φ : V ⊂ (−∞, 0] × Rm−1 → M such that the points x ∈ ∂M ∩ φ(V ) from just the image of the boundary of the half-space (−∞, 0] × Rm−1 . The subset ¯M ⊂ Rm covered by the above parametrizations with compatible orientations is called an oriented manifold with boundary. The restrictions of the parametrizations including boundary points to the boundary ∂M defines the structure of an (m − 1)–dimensional oriented manifold on ∂M. Think of a closed unit balls B(x, r) ⊂ Rn as such manifolds. Their interiors are an n–dimensional manifolds, just pridat obrazek k prikladu open subsets in Rn , but their boundaries Sn−1 are the spheres with the inherited structure of (n−1)–dimensional manifolds. The inherited orientations are well understood via the outward normals to the spheres. Another example is a plane disc sitting as a 2–dimensional manifold in R3 with its 1– dimensional boundary being a circle. Here the chosen position of the normal to the plane defines the orientation of the circle, one or the other way. In practice, we often deal with slightly more general manifolds where we allow for corners in the boundary of all smaller dimensions. A good example is the cube in R3 having the sides as 2-dimensional parts of the boundary and also the edges between them as 1-dimensional parts and the vortices as 0-dimensional parts of the boundary. Yet another class of examples is formed by all simplexes and their curved embeddings in Rn . Since those lower dimensional parts of the boundary will have Riemann measure zero, we can neglect them when integrating over ∂M. Thus we shall not go into details of this technical extension of our definitions. 817 and u(σ(x, y), t(x, y)) = σ2 e−2t = [( x − y 4 ) 4x + y 4x − y ]2 , u(x, y) = 1 16 (4x + y) 2 . Make a test. □ 9.D.15. Solve u2 x + u2 y = 1, u(cos σ, sin σ) = 1. ⃝ 9.D.16. Solve u2 x + u2 y + 2u = 0, u(cos σ, sin σ) = − 1 2 . ⃝ 9.D.17. Solve √ u2 x + u2 y − u = 0, u(cos(σ), sin(σ)) = 1. ⃝ 9.D.18. Solve u2 x + u2 y − u = 0, u(cos(σ), sin(σ)) = 1. ⃝ 9.D.19. Solve √ u2 x + u2 y + u = 0, u(cos(σ), sin(σ)) = −1. ⃝ 9.D.20. Solve uxuy + u = 0, u(x, 0) = x. ⃝ 9.D.21. Solve xu2 x + yu2 y = u, u(σ, σ) = 2σ. ⃝ 9.D.22. Solve xu2 x + yu2 y = u, u(1, y) = 1. ⃝ 9.D.23. Solve u2 x − u2 y = 4u, u(cos σ, sin σ) = cos 2σ. ⃝ 9.D.24. Solve u2 x + yuy = u, u(1, y) = y. ⃝ CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS 9.1.12. Stokes’ theorem. Now, we get to a very important and useful result. We shall formulate the main theorem about the multidimensional analogy of curve integrals for smooth forms and smooth manifolds. A brief analysis of the proof shows that actually, we need once continuously differentiable exterior forms as integrands on twice continuously differentiable parametrizations of the manifold. In practice, the boundary of the region is often similar as in the case of the unit cube in R3 , i.e., we have discontinuities of the derivatives on a Riemann measurable set with measure zero in the boundary. In such a case, we divide the integration to smooth parts and add the results up. We can notice that although new pieces of boundaries appear, they are adjacent and have opposite orientations in the adjacent regions, so their contribution is canceled out (just like in the above case of boundary points of a piecewise differentiable curve). Stokes’ theorem Theorem. Consider a smooth exterior (k −1)–form ω with compact support on an oriented manifold ¯M with boundary ∂M with the inherited orientation. Then we have ∫ M dω = ∫ ∂M ω. Proof. Using an appropriate locally finite cover of the manifold ¯M and a partition of unity subordinated to it, we can express the integrals on both sides as the sum (even a finite sum, since the support of the considered form ω is compact) of integrals of forms ω supported in individual parametrizations. Thus we can restrict ourselves to just two cases ¯M = Rk or the half-space ¯M = (−∞, 0] × Rk−1 . In both cases, ω will surely be the sum of forms ωj ωj = aj(x)dx1 ∧ · · · ∧ ˆdxj ∧ · · · ∧ dxk, where the hat indicates the omission of the corresponding linear form, and aj(x) is a smooth function with compact support. Their exterior differentials are dωj = (−1)j−1 ∂aj ∂xj dx1 ∧ · · · ∧ dxk. Again, we can verify the claim of the theorem for such forms ωj separately. Let us compute the integrals ∫ M dωj using the Fubini’s theorem. This is most simple if ¯M = Rn , ∫ Rn dωj = (−1)j−1 ∫ Rk−1 ( ∫ ∞ −∞ ∂aj ∂xj dxj ) dx1 · · · ˆdxj · · · dxk = (−1)j−1 ∫ Rk−1 [ aj ]∞ −∞ dx1 · · · ˆdxj · · · dxk = 0. Notice, we are allowed to use the Fubini’s theorem for the entire Rn since the support of the integrated function is in fact compact and thus we can replace the integration domain by a large multidimensional interval I. At the same time, the forms ωj are all zero outside of such a large interval I and thus 818 9.D.25. Find all solutions (if exists) of the system ux = f(x, y, u) = y(u − xy + 1), uy = g(x, y, u) = x(u − xy + 1). Solution. Necessary (and sufficient) condition for existence of solution is uxy = uyx: uxy = fy + fug = u − 2xy + 1 + xyu − x2 y2 + xy, uyx = gx + guf = u − 2xy + 1 + xyu − x2 y2 + xy. Let’s start with first equation ∂u ∂x = yu − xy2 + 1, which is linear and we can find solution in the form u(x, y) = K(x, y)exy so ux = Kxexy + yKexy = yKexy − xy2 + y and Kx = (y − xy2 )e−xy , by per partes we have K(x, y) = xye−xy + D(y). Substituing this to the second equation u(x, y) = xy + D(y)exy =⇒ uy = x + D′ (y)exy + xD(y)exy = g(x, y, u) we get D′ (y) = 0, D is constant and all solutions are given by u(x, y) = xy + Dexy . □ 9.D.26. Find solution (if exists) of system ux = 2xy2 u, uy = 2x2 yu. ⃝ 9.D.27. Find solution (if exists) of system xux = u − y, yuy = u − x. ⃝ 9.D.28. For second order semilinear equation we will use notation A(x, y)uxx+2B(x, y)uxy+C(x, y)uyy = F(ux, uy, u, x, y). Show, that characteristics polynomial of the matrix ( A B B C ) has (E) two nonzero real roots (not necessarily different) of the same sign (λ1 · λ2 > 0) if and only if B2 − AC < 0, equation is eliptic, (H) two nonzero real roots of different sign (λ1 · λ2 < 0) if and only if B2 − AC > 0, equation is hyperbolic, (P) two real roots from which (at least) one is zero (λ1 ·λ2 = 0) if and only if B2 − AC = 0, equation is parabolic. CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS the integrals ∫ ∂M ωj all vanish and the claim of the Stokes’ theorem is verified in this case. Actually, we may also say that ∂M = ∅ and thus the integral is zero. Next, let us assume ¯M is the half-space (−∞, 0]×Rk−1 . If j > 1, the form ωj evaluates identically to zero on the boundary ∂M, since x1 is constant there and thus dx1 is identically zero on all tangent directions to ∂M. Integration over the interor M yields zero, using the same approach as above: ∫ M dωj = (−1)j−1 ∫ 0 −∞ ∫ Rk−2 ( ∫ ∞ −∞ ∂aj ∂xj dxj ) dx1· · · ˆdxj· · ·dxk = (−1)j−1 ∫ 0 −∞ ∫ Rk−1 [ aj ]∞ −∞ dx1 · · · ˆdxj · · · dxk = 0 since the function aj has compact support. So the theorem is also true in this case. However, if j = 1, then we obtain ∫ M dω1 = ∫ Rk−1 (∫ 0 −∞ ∂a1 ∂x1 dx1 ) dx2 · · · · · · dxk = ∫ Rk−1 a1(0, x2, . . . , xk)dx2 · · · dxk = ∫ ∂M ω1. This finishes the proof of Stokes’ theorem. □ 9.1.13. Green’s theorem. We have proved an extraordinarily strong result which covers several standard integral relations from the classical vector analysis. For instance, we can notice that by Stokes theorem, the integration of exterior differential dω of any k–form over a compact manifold without boundary is always zero (for example, integral of any 2–form dω over the sphere S2 ⊂ R3 vanishes). Let us look step by step at the cases of Stokes’ theorem with k dimensional boundaries ∂M in Rn in low dimensions. Green’s theorem In the case n = 2, k = 1, we are examining a domain M in the plane, bounded by a closed curve C = ∂M. Differential 1–forms are ω(x, y) = f(x, y) dx + g(x, y) dy, with the differential dω = ( −∂f ∂y + ∂g ∂x ) dx ∧ dy. Therefore, Stokes’ theorem yields the formula ∫ C f(x, y)dx + g(x, y)dy = ∫ M ( − ∂f ∂y + ∂g ∂x ) dx ∧ dy which is one of the standard forms of the Green’s theorem. Using the standard scalar product on R2 , we can identify the vector field X with a linear form ωX such that ωX(Y ) = ⟨Y, X⟩. In the standard coordinates (x, y), this just means that the field X = f(x, y) ∂ ∂x + g(x, y) ∂ ∂y corresponds to the form ω = f(x, y) dx + g(x, y) dy given above. The integral of ωX over a curve C has the physical interpretation of the work done by movement along this curve in the force field X. Green’s theorem then says, besides others, that if ωX = dF for some function F, then the work done along a closed curve is always zero. Such fields are called potential fields 819 Solution. EASY ale FUJ □ 9.D.29. Show, that solution of (H) hyperbolic equation in canonical form uxy = 0 is any function u(x, y) = F(x) + G(y). (E) eliptic equation in canonical form ∆v = vxx + vyy = 0 is any function v(x, y) = F(x + iy) + G(x − iy). (P) parabolic equation in canonical form wxx = kwy, is any function w(x, y) = ek(Cx+C2 y) . Solution. ux = F′ , uxy = 0. vx = F′ + G′ , vxx = F′′ + G′′ , vy = iF′ − iG′ , vyy = −F′′ − G′′ =⇒ uxx + uyy = 0, wx = kCek(Cx+C2 y) , wxx = k2 C2 ek(Cx+C2 y) , wy = kC2 ek(Cx+C2 y) =⇒ wxx = kwy. □ 9.D.30. Show, that real solution of Laplace equation in polar coordinates x = r cos φ, y = r sin φ (compute): urr + 1 r2 uφφ + 1 r uφ = 0, are harmonic function given by αn(r, φ) = rn cos nφ, βn(r, φ) = rn sin nφ, n ∈ N0. Find their expression αn(x, y), βn(x, y) for n = 0, 1, 2, 3. ⃝ 9.D.31. Find general solution of second order equation x2 uxx − 2xyuxy + y2 uyy − x2 ux + (x + 2)yuy = 0. Solution. We have A = x2 , B = −xy, C = y2 , B2 − AC = 0 and equation is parabolic. Let’s solve characteriscic equation dy dx = y′ = B ± √ B2 − AC A = −xy x2 = − y x . We have solution in implicit form xy = C = ξ (ξ = ξ(x, y) will be new coordinate function). For second new coordinate function η = η(x, y) we can take any independent (ξxηy ̸= ξyηx) function, for example η = x. Inverse transformation is x = η, y = ξ η . We get ux = uξξx + uηηx = ξ η uξ + uη, uy = uξξy + uηηy = ηuξ, CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS and the function F is the potential of the field X. In other words, the work done when moving in potential fields does not depend on the path, it depends only on the initial and terminal points. With Green’s theorem, we have verified once again that integrating the differential of a function along a curve depends solely on the initial and terminal points of the curve. 9.1.14. The divergence theorem. The next case deals with integrating over some open subset in R3 and it has got a lot of incarnations in practical use. We shall mention a few. Gauss–Ostrogradsky’s theorem In the case n = 3, k = 2 we are examining a region M ⊂ R3 , bounded by a surface S. All 2–forms are of the form ω = f(x, y, z) dy∧dz+g(x, y, z) dz∧dx+h(x, y, z) dx∧ dy, and we get dω = (∂f ∂x + ∂g ∂y + ∂h ∂z ) dx ∧ dy ∧ dz. The Stokes’ theorem says that ∫ S f dy ∧ dz + g dz ∧ dx + h dx ∧ dy = ∫ M ( ∂f ∂x + ∂g ∂y + ∂h ∂z ) dx ∧ dy ∧ dz. This is the statement of the Gauss–Ostrogradsky theorem. This theorem has a very illustrative physical interpretation, too. Every vector field X = f(x, y, z) ∂ ∂x + g(x, y, z) ∂ ∂y + h(x, y, z) ∂ ∂z can be plugged into the first argument of the standard volume form ωR3 = dx ∧ dy ∧ dz on R3 . Clearly, the result is a 2–form ωX (x, y, z) = f(x, y, z)dy ∧ dz + g(x, y, z)dz ∧ dx + h(x, y, z)dx ∧ dy. The latter 2–form infinitesimally describes the volume of the parallelepiped given by the flux caused by the field X through a linearized piece of surface. If we consider the vector field to be the velocity of the flow of the particular points of the space, this infinitesimally describes the volume transported pointwise by the flow through the given surface S. Thus the left hand side is the total change of volume inside of S, caused by the flow of X. The integrand of the right-hand side of the integral, is related to the so-called divergence of the vector field, which is the expression defined as d(ωX ) = (div X) dx ∧ dy ∧ dz. The Gauss–Ostrogradsky theorem says ∫ S iX ωR3 = ∫ M div X ωR3 , i.e. the volume of total flow through a surface is given as the integral of the divergence of the vector field over the interior. In particuclar, if div X vanished identically, then the total volume flow through the boundary surface of the region is zero as well. 820 uxx = ξ2 η2 uξξ + 2 ξ η uξη + uηη, uxy = ξuξξ + ηuξη + uξ, uyy = η2 uξξ. Finally the equation has the form η2 (uηη − uη) = 0 =⇒ uηη = uη. Integrating twice we have u(ξ, η) = G(ξ)eη + F(ξ), u(x, y) = G(xy)ex + F(xy), make a test. Try to solve equation for different choice of the function η. □ 9.D.32. Find general solution of second order equation yuyy − xuxy + uy = 0. Solution. We have A = 0, B = − x 2 , C = y, B2 − AC = x2 4 > 0 for x ̸= 0 and this equation is hyperbolic. The characteristic equation (we can’t divide by A = 0!) is now dx dy = x′ = B ± √ B2 − AC C = 1 y ( − x 2 ± x 2 ) . We have two solutions in implicit form (new coordinate functions ξ = ξ(x, y), η = η(x, y)) x = C1 = ξ, xy = C2 = η =⇒ x = ξ, y = η ξ . Finally ux = uξ + η ξ uη, uy = ξuη, uxy = ξuξη + ηuηη + uη, uyy = ξ2 uηη and in new coordinates ξ, η has the equation canonical form uξη = 0. General solution is u(ξ, η) = F(ξ) + G(η), u(x, y) = F(x) + G(xy). □ 9.D.33. Find general solution of second order equation uxx − 2 sin xuxy + (2 − cos2 x)uyy − cos xuy = 0. Solution. We have A = 1, B = − sin x, C = 2 − cos2 x, B2 − AC < 0 and equation is eliptic. Let’s solve characteriscic equation dy dx = y′ = B ± √ B2 − AC A = − sin x ± i. We have solution in implicit form y−cos x−ix = C1 = Φ(x, y), y−cos x+ix = C2 = Ψ(x, y) and new (real) coordinate function are ξ = 1 2 (Φ + Ψ) = y − cos x, η = 1 2i (Ψ − Φ) = x. CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Such fields, with div X = 0, are called divergence free or solenoidal vector fields. They correspond to dynamics without changes of volumes (e.g. modelling dynamics of incompressible liquids). In order to reformulate the theorem completely in terms of functions, let us observe that the inherited volume form ωS on S is defined by the property ν∗ ∧ ωS = ωR3 at all points of S, where ν∗ is dual form to the oriented (outward) unit normal to S. All forms of degree 2 are multiples of ωS by functions. In particular, iX(ν∗ ∧ ωS) = ν · X ωS, i.e. we have to integrate the scalar product of the vector field X with the unit normal vector with respect to the standard volume on S. Thus, we have proved the folowing result formulated in the classical vector analysis style. Actually, a simple check reveals that the above arguments work for all open submanifolds M ⊂ Rn with boundary hypersurface S and vector fields X. The reader should easily verify this in detail. Divergence theorem Theorem. Let X be a vector field on a n-dimensional manifold M ⊂ Rn with hypersurface boundary S. Then ∫ M div X dx1 . . . dxn = ∫ S X · ν dS, where ν is the oriented (outward) unit normal to S and dS stays for the volume inherited from Rn on S. Notice the 2-dimensional case coincides with the Green’s theorem above. 9.1.15. The original Stokes theorem. If ω is any linear form, then the integral of dω over a surface depends on the boundary curve only. This is the most classical Stokes’ theo- rem: The classical Stokes’ theorem In the case n = 3, k = 1 we deal with a surface M in R3 bounded by a curve C. The general linear forms are ω = f dx + g dy + h dz, with the integral ∫ C f dx + g dy + h dz = ∫ M dω, where dω = (∂h ∂y − ∂g ∂z ) dy ∧ dz + (∂f ∂z − ∂h ∂x ) dz ∧ dx + (∂g ∂x − ∂f ∂y ) dx ∧ dy. Again, we use the standard scalar product to identify the vector field X = f ∂ ∂x + g ∂ ∂y + h ∂ ∂z with the form ωX = f dx + g dx + h dz. Finally, reverting the above relation between the vector fields and two forms on R3 , the 2– form dωX can be identified with the vector field rot X, dωX = ωR3 (rot X, , ). 821 We have ξx = sin x, ξy = 1, ξxx = cos x, ξxy = 0, ξyy = 0, ηx = 1, ηy = ηxx = ηxy = ηyy = 0, uy = uξ, uxy = uξξ sin η + uξη, uyy = uξξ, uxx = uξξ sin2 η + 2uξη sin η + uηη + uξ cos η. Finally the equation became canonical ∆u = uξξ + uηη = 0. General solution is u(ξ, η) = C(ξ + iη) + D(ξ − iη) make a test, u(x, y) = C(y − cos x + ix) + D(y − cos x − ix). Real solution is given by harmonic functions in variables ξ, η. □ 9.D.34. Solve x2 uxx − 2xuxy + uyy + xux = 0. ⃝ 9.D.35. Solve xuxx − yuxy + ux = 0 ⃝ 9.D.36. Solve uxx − uxy + 2uyy = 0. ⃝ 9.D.37. Find canonical form of y2 uxx + x2 uyy = 0. ⃝ 9.D.38. Show, that solution of nonhomogenous wave equation with initial conditions (infinite string): utt(t, x) = a2 uxx(t, x) + f(t, x), t ∈ (0, ∞), x ∈ R, u(0, x) = φ(x), ut(0, x) = ψ(x), x ∈ R, is given by d’Alembert’s formula: u(t, x) = φ(x − at) + φ(x + at) 2 + 1 2a x+at∫ x−at ψ(ξ)dξ+ + 1 2a t∫ 0    x+a(t−σ)∫ x−a(t−σ) f(σ, ξ)dξ    dσ. ⃝ CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS This field is called the rotation or curl of the vector field X. The Stokes’ theorem now reads: ∫ C ωX = ∫ M rot X. Consequently, the fields X with the property ωX = dF for some function F (the fields of gradients of functions), have got the property rot X = 0. They are called conservative (or potential) vector fields. 9.1.16. Brouwer’s theorem. Among many useful fixedpoint theorems, the Brouwer theorem is particularly nice.4 We present a special case here. In particular, our formulation clearly must hold true for any homeomorphic image of the domain K. In fact, the convexity ensures that there are no “holes” inside of K. For example rotating an annulus r ≤ ∥x∥ ≤ R in the plane clearly does not have any fixed point. Brouwer’s fixed-point theorem Theorem. Let K ⊂ Rn be a compact convex submanifold of dimension n with boundary ∂K. Then every continuous mapping f : K → K has got at least one fixed point x, i.e. f(x) = x. Proof. We shall restrict ourselves to the case of smooth manifold K and smooth mapping f. In fact, if a continuous f had no fixed point, then it is possible to approximate it by a smooth ˜f without any fixed point (e.g., by taking a convolution ˜f = f ∗ φ with a suitable smooth kernel φ enjoying a very small support). Assume f : K → K is such a smooth mapping with f(x) ̸= x for all x ∈ K. For each y ∈ K, consider the ray L from f(y) through y. This ray will leave K in the unique point F(y) ∈ F(y) and F : K → ∂K is smooth. Notice we use the convexity assumption here (it is evident if K is a ball B of diameter r and the general case can be trasformed to the ball case by “smoothly expanding” K from an inner point to a big enough ball B). In particular, the construction implies F|∂K = id∂K. Thus, by our assumptions, F : K → ∂K is a smooth retraction of K to its boundary. Now, we may consider a smooth exterior form ω ∈ Ωn−1 (K) providing the standard (inherited) volume on ∂K, and we employ the general Stokes’ theorem: 0 < ∫ ∂K ω = ∫ ∂K F∗ ω = ∫ K d(F∗ ω) = ∫ K F∗ (dω) = ∫ K F∗ (0) = 0. This is a contradiction and the theorem is proved. □ 4Its 2-dimensional version is attributed to Luitzen Egbertus Jan Brouwer (1881-1966), a Dutch mathematician and philosopher, who should had noticed that when stirring a cup of coffee with sugar, there always stays at least one point fixed. 822 9.D.39. Using d’Alembert’s formula solve utt(t, x) = uxx(t, x) + cos x, t ∈ (0, ∞), x ∈ R, u(0, x) = x2 , ut(0, x) = arctgx, x ∈ R. Solution. u1(t, x) = φ(x − at) + φ(x + at) 2 = = (x − t)2 + (x + t)2 2 = x2 + t2 , u2(t, x) = 1 2a x+at∫ x−at ψ(ξ)dξ = 1 2 x+t∫ x−t arctgξdξ = [ ξarctgξ − 1 2 ln(1 + ξ2 ) ]x+t x−t = 1 2 [(x + t)arctg(x + t)− −(x − t)arctg(x − t) + ln √ 1 + (x − t)2 1 + (x + t)2 ] , u3(t, x) = 1 2a t∫ 0    x+a(t−σ)∫ x−a(t−σ) f(σ, ξ)dξ    dσ = = 1 2 t∫ 0   x+t−σ∫ x−t+σ cos ξdξ   dσ = 1 2 t∫ 0 [ sin ξ ]x+t−σ x−t+σ dσ = 1 2 t∫ 0 [ sin(x + t − σ) − sin(x − t + σ) ] dσ = = 1 2 [ cos(x + t − σ) ]t 0 + 1 2 [ cos(x − t + σ) ]t 0 = = 1 2 [ − cos(x + t) − cos(x − t) ] + cos x. u(t, x) = u1(t, x) + u2(t, x) + u3(t, x). □ 9.D.40. Solve (x ∈ R, t ∈ (0, ∞)) a) utt − uxx = 0, u(0, x) = sin x, ut(0, x) = x cos x b) utt − uxx = 0, u(0, x) = 2x, ut(0, x) = ln(1 + x2 ) c) utt − uxx = sin x, u(0, x) = x, ut(0, x) = 1 x ⃝ 9.D.41. Solve utt(t, x) = uxx(t, x) + 2 sin x, t ∈ (0, ∞), x ∈ R, u(0, x) = x2 + 2x, ut(0, x) = x cos x, x ∈ R. ⃝ 9.D.42. Solve utt(t, x) = uxx(t, x) + 2 cos x, t ∈ (0, ∞), x ∈ R, u(0, x) = 2x2 − x, ut(0, x) = x sin x, x ∈ R. CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS 9.1.17. Another kind of integration. As we have seen, solutions to ODEs are flows of vector fields. As a modification, we can prescribe one-dimensional linear subspaces Lx ⊂ TxM at each point of a manifold M and look for unparameterized curves P tangent to them at all points. This is a coordinatefree version of the ODE theory. Indeed, locally we may always choose a vector field X generating the spaces Lx and in each coordinate patch, the flow of X will provide the paramaterized one-dimensional submanifolds P ⊂ M tangent to Lx at all points. A change of coordinates or X will change the parameterizations, but not the curves P. If we want to describe an n-dimensional submanifold N ⊂ M, 1 < n < M, in a similar way, we define the n-dimensional subspaces Dx ⊂ TxM for all x ∈ M and seek for a submanifold N with TyN = Dy at all y ∈ N. Integrability of distributions The union D ⊂ TM of individual linear subspaces Dx ⊂ TxM, x ∈ M, is called a distribution D on M. We say that the distribution is n-dimensional and smooth if each fixed point x allows for a neighborhood U and n linearly independent smooth vector fields X1, . . . , Xn generating Dy at all y ∈ U. The distribution is called integrable if for each point x ∈ M, there is a submanifold N ⊂ M such that x ∈ N and TyN = Dy for all y ∈ N. Our goal is to give neccesary and sufficient conditions for smooth distributions to be integrable. Clearly, the case of n = 1 is trivial, since we already know that the conditions are empty – each such distribution is integrable. The core idea is to use the so called flow box theorem for vector fields proved in 8.3.15 and to exploit the individual flows of the chosen generators X1, . . . , Xn in order to “draw” new coordinates, in which the integral submanifold would appear as given by xn+1 = 0, . . . , xm = 0. The problem we face is that the flows do not commute in general and thus our idea will not work. 9.1.18. Lie bracket of vector fields. Fortunately, the commutativity of the flows is captured by a simple differential operation. Consider two vector fields on Rm , X = X1(x) ∂ ∂x1 + · · · + Xm(x) ∂ ∂xm , Y = Y1(x) ∂ ∂x1 + · · · + Ym(x) ∂ ∂xm . The commutator of the derivatives of functions in the directions of these vector fields is Y (Xf) − X(Y f) = ∑ i,j Yi ∂ ∂xi ( Xj ∂f ∂xj ) − Xi ∂ ∂xi ( Yj ∂f ∂xj ) = ∑ i,j ( Yj ∂Xi ∂xj − Xj ∂Yi ∂xj ) ∂f ∂xi , thanks to the commutativity of the second derivatives of f. Thus, the commutator of the two vector fields behaves as the 823 ⃝ 9.D.43. utt(t, x) = 4uxx(t, x) + cos2 x, t ∈ (0, ∞), x ∈ R, u(0, x) = x3 , ut(0, x) = 1 √ 1 + x2 , x ∈ R. ⃝ 9.D.44. utt = 9uxx − 4 cos x, t ∈ (0, ∞), x ∈ R, u(0, x) = x2 , ut(0, x) = xex , x ∈ R. ⃝ 9.D.45. Solve parabolic equation ut = kuxx, t ∈ (0, ∞), x ∈ (0, l) with Dirichlet boundary conditions u(t, 0) = u(t, l) = 0, t ∈ (0, ∞) and initial condition u(0, x) = φ(x) = x(l − x), x ∈ (0, l). Solution. Suppose that solution is in the form u(t, x) = X(x) · T(t) and compute ut = XT′ , uxx = X′′ T, dividing both sides of equation by XT we get T′ kT = X′′ X = −λ, where λ is constant number, because left side depend only on variable t and right side only on variable x. Solving both ordinnary equations T′ + λkT = 0, X′′ + λX = 0 we get T(t) = Ce−λkt , X(x) = C1eµx + C2e−µx , where µ = √ −λ. Using boundary conditions X(0) = C1 + C2 = 0 =⇒ C2 = −C1 and X(l) = C1 ( eµl − e−µl ) = 0 =⇒ e2µl = 1 = e2πni . Numbers µn = inπ l and λn = n2 π2 l2 , where n = 0, 1, 2, . . . We have Tn = Cne− n2π2 l2 kt , Xn = Dn ( ei nπ l x − e−i nπ l x ) = 2iDn sin (nπ l x ) . Solution of equation and boundary condition is series u(t, x) = ∞∑ n=0 TnXn = ∞∑ n=1 Kn sin (nπ l x ) e− n2π2 l2 kt . CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS vector field [X, Y ], [X, Y ] = m∑ i,j=1 ( Yj ∂Xi ∂xj − Xj ∂Yi ∂xj ) ∂ ∂xi . This vector field is called the Lie bracket5 of its arguments. It is easy to see that [ , ] is a bilinear antisymmetric operation (over the real scalars) on the differentiable vector fields, and expanding the commutators explicitly we arrive at the so called Jacobi identity [X, [Y, Z]] = [[X, Y ], Z] + [Y, [X, Z]] valid for all triples of vector fields, and the Leibnitz derivative property [X, fY ] = (Xf)Y + f[X, Y ]. Remark. In fact, it is quite straightforward to see that the vector fields X and the diffeomorphisms FlX t in their flows are linked in a very similar manner to the square matrices A and their exponential images etA . The Lie bracket encodes the composision of the diffeomorphisms like the commutators of matrices encode the matrix multiplication. Thus, it is not surprising that the flows of two vector fields are commuting if and only if their Lie bracket vanishes. We shall not go into the technical proof here since we shall not need the result explicitely below. 9.1.19. Back to distributions. We say that D ⊂ TM is an involutive distribution if for all vector fields X, Y valued in D, their Lie bracket [X, Y ] has got values in D, too. Frobenius’ theorem Theorem. Let D ⊂ TM be a smooth n-dimensional distribution in an m-dimensional manifold M. Then D is integrable if and only if it is involutive. Proof. Remind integrability means the local existence of the integral submanifolds through each point in M. One of the implications of the proof is nearly trivial. If D is integrable, then through each x ∈ M there is the integral submanifold N. Consider the embedding i : N → M and any vector fields ˜X, ˜Y on M valued in D. Since Dy = TyN for all y ∈ N, ˜X and ˜Y are tangent to i(N) ⊂ M. We claim that the restriction of the Lie bracket [ ˜X, ˜Y ] to i(N) is the image i∗([X, Y ]), where the vector fields X, Y are viewed as the given fields on N, i.e., i∗X(x) = ˜X(i(x)), i∗Y (x) = ˜Y (i(x)). Thus, the bracket has to be in the image again. The latter claim is a consequence of a more general statement: 5Marius Sophus Lie (1842–1899)) was an excellent Norwegian mathematician, the father of the Lie theory. Originally invented to deal with systems of partial differential equation via continuous groups of their symmetries, the theory of Lie groups and Lie algbebras is nowadays in the core of a vast part of Mathematics. It is a pity we do not have time and space to devote more attention to this marvelous mathematical story in this textbook. 824 Coefficients Kn we get from initial condition u(0, x) = ∞∑ n=1 Kn sin (nπ l x ) = φ(x) = x(l − x). So Kn are Fourier coefficient of odd extension of function φ(x) to the interval [−l, l]: Kn = 2 l l∫ 0 x(l − x) sin (nπ l x ) dx. After integration per partes we get Kn = 4l2 n3π3 [(−1)n+1 +1] and final solution u(t,x)= ∞∑ n=0 8l2 (2n + 1)3π3 sin ( (2n + 1)π l x ) e− (2n+1)2π2 l2 kt . □ 9.D.46. Solve eliptic equation on square ∆u = 0, x ∈ (0, 1), y ∈ (0, 1) with Dirichlet boundary conditions for x u(0, y) = u(1, y) = 0, y ∈ (0, 1) and boundary condition for y u(x, 0) = φ(x) = sin(πx) cos(πx), x ∈ (0, 1), u(x, 1) = ψ(x) = − sin(πx), x ∈ (0, 1). Solution. Suppose again that solution is in the form u(x, y) = X(x) · Y (y) =⇒ ∆u = X′′ Y + XY ′′ = 0. Dividing equation by XY we get X′′ X = − Y ′′ Y = −λ. Solution (using Dirichlet boundary conditions for x) is Xn = Cn sin(nπx), Yn = Knenπy + Hne−nπy , n ∈ N and u(x, y)= ∞∑ n=1 XnYn = ∞∑ n=1 [ knenπy + hne−nπy ] sin(nπx). Using boundary conditions for y we get u(x, 0) = ∞∑ n=1 (kn + hn) sin(nπx) = φ(x), u(x, 1) = ∞∑ n=1 (knenπ + hne−nπ ) sin(nπx) = ψ(x). Numbers bn = kn + hn are Fourier coefficients of odd extension of function φ(x) on [−1, 1], b2 = 1 2 and bn = 0 for n ̸= 2. Numbers βn = knenπ + hne−nπ are Fourier coefficients of odd extension of function ψ(x) on [−1, 1], β1 = −1 and βn = 0 for n ̸= 1. We need to solve algebraic equations k2 + h2 = 1 2 , k2e2π + h2e−2π = 0, k1 + h1 = 0, k1eπ + h1e−π = −1. CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Claim. If φ : N → M is a smooth map and two couples of vector fields X, Y and ˜X, ˜Y satisfy Tφ ◦ X = ˜X ◦ φ, Tφ ◦ Y = ˜Y ◦ φ, then their Lie brackets satisfy the same relation: Tφ ◦ [X, Y ] = [ ˜X, ˜Y ] ◦ φ. Indeed, consider a smooth function f on M and compute, using X(f ◦φ)(x) = (TφX)f = ( ˜X◦φ)(x)f = ˜X(φ(x))f, and similarly for Y : [X, Y ](f ◦ φ)(x) = XY (f ◦ φ)(x) − Y X(f ◦ φ)(x) = X(( ˜Y f) ◦ φ)(x) − Y (( ˜Xf) ◦ φ)(x) = ˜X( ˜Y f)(φ(x)) − ˜Y ( ˜Xf)(φ(x)) = ([ ˜X, ˜Y ]f) ◦ φ(x). Now we employ the latter claim for the inclusion i in the role of φ and obviously every integrable distribution must be in- volutive. As we already revealed, each one-dimensional distribution is involutive and locally integrable. The main idea of the proof is to start with any set of (locally) generating vector fields for D, to use some nice coordinates with respect to the first vector field and to employ the induction on the dimension to the rest of them. Assume the theorem is true for dimensions less than n and consider an involutive smooth distribution D of dimension n, generated by fields X1 . . . , Xn. Actually, we shall prove a much stronger version of the theorem. We claim that if D is involutive, then there are coordinates (x1, . . . , xm) around each point x ∈ M, such that the equations xn+1 = an+1, . . . , xm = am with small constants ai are defining all the individual integral submanifolds of D through points close to x. This is indeed true in dimension n = 1. By the flowbox theorem 8.3.15, for each point x ∈ M there are coordinate functions y1, . . . , ym on a neighborhood U of x, for which X1 = ∂ ∂y1 . Let us consider the submanifold Q ⊂ M defined by y2 = 0, . . . , ym = 0 and the “projections” Yj of the other fields to make then tangent to Q. This requires that Yj leave constant the coordinate y1, i.e. we set Yj = Xj − Xj(y1)X1, j = 2, . . . , m. Indeed, this definition ensures Yj(y1) = 0 and thus the fields are tangent to Q as required. We leave Y1 = X1, and clearly Y1, . . . , Yn generate the same involutive distribution D. Thus [Yi, Yj] = ∑ i,j cijkYk for some set of functions cijk. Moreover, we may view Q as one leaf of all the subsets defined by y2 = b2, . . . , ym = bm with small constants bi and there is the projection p : U → Q forgetting the first coordinate. On the submanifold Q, there is the (n − 1)-dimensional involutive distribution ˜D generated by the fields ˜Yi = Yi|Q, i = 2, . . . , n (notice we again use the argument from the 825 Finally k1 = 1 e−π − eπ , h1 = 1 eπ − e−π , k2 = 1 2 − 2e4π , h2 = 1 2 − 2e−4π , other coefficiens are all zero. Solution is u(x, y) = sin(πx) [ 1 e−π − eπ eπy + 1 eπ − e−π e−πy ] + + sin(2πx) [ 1 2 − 2e4π e2πy + 1 2 − 2e−4π e−2πy ] . Make a test. □ 9.D.47. Solve hyperbolic equation utt = a2 uxx, t ∈ (0, ∞), x ∈ (0, π) with Neumann boundary conditions ux(t, 0) = ux(t, π) = 0, t ∈ (0, ∞) and initial conditions u(0, x) = φ(x) = 0, ut(0, x) = 5 cos(3x), x ∈ (0, π). Solution. Again, looking for solution in the form u(t, x) = X(x) · T(t) we get XT′′ = a2 X′′ T =⇒ T′′ a2T = X′′ X = −λ. So X(x) = C1eµx + C2e−µx , where µ = √ −λ and using Neumann boundary condition we have X′ (0) = µ(C1 − C2) = 0 =⇒ C1 = C2, X′ (π) = µC1 ( eµπ − e−µπ ) = 0, e2µπ = 1 = ei2nπ =⇒ µn = in, λn = −n2 , n = 0, 1, 2, . . . We have solutions Xn(x) = C1 ( einx + e−inx ) = 2C1 cos(nx), Tn(t) = an cos(nat) + bn sin(nat), u(t, x) = ∞∑ n=0 XnTn = = ∞∑ n=0 [An cos(nat) + Bn sin(nat)] cos(nx). Now we have to apply initial conditions u(0, x) = ∞∑ n=0 An cos(nx) = φ(x), ut(0, x) = ∞∑ n=0 naBn cos(nx) = ψ(x) and numbers An are Fourier coefficiens of even extension of function φ(x) to [−π, π]: A0 = 1 π π∫ 0 φ(x)dx, An = 2 π π∫ 0 φ(x) cos(nx)dx, CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS beginning of the proof about the brackets of restricted fields). Now, our assumption says we find suitable coordinates (q2, . . . , qm) on Q around the point x ∈ Q, so that for all small constants bn+1, . . . , bm, the integral submanifolds of ˜D are defined by qn+1 = bn+1, . . . , qm = bm. Finally, we need to adjust the original coordinate functions yi all over the neighborhood U of x. The obvious idea is to use the flow of X1 = Y1 to extend the latter coordinates on Q. Thus we define the coordinate functions in all y ∈ U using the projection p, x1(y) = y1(y), x2(y) = q2(p(y)), . . . , xm = qm(p(y)). The hope is that all submanifolds N given by equations xn+1 = bn+1, . . . , xm = bm (for small bj) will be tangent to all fields Y1, . . . , Yn. Technically, this means Yi(xj) = 0 for all i = 1, . . . , n, j = n + 1, . . . , m. By our definition, this is obvious for the restriction to Q, and obviously Y1(xj) = 0 in all other points, too. Let us look closely on what is happening with one of our functions Yi(xj) along the flows of the field X1. We easily compute with the help of the definition of the Lie bracket ∂ ∂x1 (Yi(xj)) = Y1(Yi(xj)) = Yi(Y1(xj)) + [Y1, Yi](xj) = Yi(Y1(xj)) + c1i1Y1(xj) + m∑ k=2 c1ikYk(xj) = m∑ k=2 c1ikYk(xj). This is a system of linear ODEs for the unknown functions Yi(xj) in one variable x1 along the flow lines of Y1. The initial condition at the point in Q is zero and thus this constant zero value has to propagate along the flow lines, as requested. The induction step is complete. □ 9.1.20. Formulation via exterior forms. As we know from linear algebra, a vector subspace of codimension k is defined by k independent linear forms. Thus, every smooth n-dimensional distribution D ⊂ TM on a manifold M can be (at least) locally defined by m − n linear forms ωj on M. A direct computation in coordinates reveals that the differential of linear from ω evaluates on two vector fields as follows (1) dω(X, Y ) = X(ω(Y )) − Y (ω(X)) − ω([X, Y ]). Indeed, if X = ∑ i Xi ∂ ∂xi , Y = ∑ i Yi ∂ ∂xi , ω = ∑ i ωidxi, then X(ω(Y ))−Y (ω(X)) = ∑ i,j (Xi ∂ ∂xi (ωjYj)−Yi ∂ ∂xi (ωjXj)) = ∑ i,j ( Xi ∂ωj ∂xi Yj − Yi ∂ωj ∂xi Xj + ωj ( Xi ∂Yj ∂xi − Yi ∂Xj ∂xi )) = dω(X, Y ) + ω([X, Y ]). Thus, the involutivity of a distribution defined by linear forms ωn+1, . . . , ωm should be closely linked to properties of the 826 numbers naBn are Fourier coefficiens of even extension of function ψ(x) to [−π, π]: B0 = 1 naπ π∫ 0 ψ(x)dx, Bn = 2 naπ π∫ 0 ψ(x) cos(nx)dx. Because φ(x) = 0, all An are zero, ψ(x) = 5 cos(3x), so B3 = 5 3a and all other Bn are again zero. Finally solution of our problem is u(t, x) = 5 3a sin(3at) cos(3x). Make a test. □ 9.D.48. Solve ∆u = uxx + uyy = 0, (x, y) ∈ (0, π) × (0, π), u(x, 0) = u(x, π) = 0, x ∈ (0, π) u(π, y) = 0, u(0, y) = sin y, y ∈ (0, π). ⃝ 9.D.49. Solve uxx = utt, x ∈ (0, l), t ∈ (0, ∞), ux(t, 0) = ux(t, l) = 0, t ∈ (0, ∞), u(0, x) = − cos πx l , x ∈ (0, l), ut(0, x) = cos2 πx l − sin2 πx l , x ∈ (0, l). ⃝ 9.D.50. Solve uxx = ut, x ∈ (0, π), t ∈ (0, ∞), u(t, 0) = u(t, π) = 0, t ∈ (0, ∞), u(0, x) = 2 cos(3x) sin x, x ∈ (0, π). ⃝ 9.D.51. Solve uxx = utt, x ∈ (0, π), t ∈ (0, ∞), u(t, 0) = u(t, π) = 0, t ∈ (0, ∞), u(0, x) = −6 sin(2x) cos(2x), x ∈ (0, π), ut(0, x) = − sin x + 4 sin x cos x, x ∈ (0, π). ⃝ 9.D.52. Solve ut = uxx, x ∈ (0, a), t ∈ (0, ∞), u(t, 0) = ux(t, a) = 0, t ∈ (0, ∞), u(0, x) = x(2a − x), x ∈ (0, a). ⃝ 9.D.53. utt = uxx, x ∈ (0, A), t ∈ (0, ∞), u(t, 0) = u(t, A) = 0, t ∈ (0, ∞), u(0, x) = x(A − x), ut(0, x) = 0, x ∈ (0, A). CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS differentials on the common kernel. Indeed, there is the following version of the latter theorem: Frobenius’ theorem Theorem. The distribution D defined on an m-dimensional manifold M by (m − n) independent smooth linear forms ωn+1, . . . , ωm is integrable if and only if there are linear forms αij such that dωk = ∑ ℓ αkℓ ∧ ωℓ. Proof. Let us write ω = (ωn+1, . . . , ωm) for the Rm−n -valued form. The distribution is D = ker ω. Now, the formula (1) (applied to all components of ω) implies that involutivity of D is equivalent to dω| ker ω = 0. If the assumption of the theorem on the forms holds true, dω clearly vanishes on the kernel of ω and therefore D is involutive, and one of the implications of the theorem is proved. Next, assume D is integrable. By the stronger claim proved in the latter Frobenius theorem, for each point x ∈ M, there are coordinates (x1, . . . , xm) such that D is the common kernel of all dxn+1, . . . , dxm. In particular, our forms ωj are linear combinations (over functions) of the latter (m − n) differentials. Moreover, there must be smooth invertible matrices of functions A = (akℓ) such that dxk = ∑ ℓ akℓωℓ, k, ℓ = n + 1, . . . , m. Finally, dωk includes only terms with dxi ∧ dxj with j > n and all dxj can be expresed via our forms ωℓ from the previous equation. Thus the differentials have got the requested forms. □ 2. Remarks on Partial Differential Equations The aim of our excursion into the landscape of differential equations is modest. We do not have space in this rather elementary guide to come close enough to this subtle, beautiful, and extremely useful part of mathematics dealing with differential equations. Still we mention a few issues. First, the simplest method reducing the problem to already mastered ordinary differential equations is explained, based on the so called characteristics. Then we show more simple methods how to get some families of solutions. Next, we present a more complicated theoretical approach dealing with formal solvability of even higher order systems of differential equations and its convergence – the famous Cauchy-Kovalevskaya theorem. This is the only instance of general existence and uniqueness theorem for differential equations involving partial derivatives. Unortunately, it does not cover many of interesting problems of practical importance. Finally, we display a few classical methods to solve boundary problems involving some of the most common equations of second order. 827 ⃝ 9.D.54. ∆u = uxx + uyy = 0, x ∈ (0, a), y ∈ (0, a), u(0, y) = u(a, y), y ∈ (0, a), u(x, a) = 0, u(x, 0) = x(a − x), x ∈ (0, a). ⃝ 9.D.55. ut = 4uxx, x ∈ (0, 1), t ∈ (0, ∞), ux(t, 0) = u(t, 1) = 0, t ∈ (0, ∞), u(0, x) = x − 1, x ∈ (0, 1). ⃝ 9.D.56. Find bounded solution of Laplace equation and boundary condition on the circle (Dirichlet internal boundary problem): ∆u(x, y) = 0, x2 + y2 < A2 , u(A cos φ, A sin φ) = α(φ) = sin4 φ. Solution. After transformation to the polar coordinates x = r cos φ, y = r sin φ, Laplace equation for function u = u(r, φ) become urr + 1 r ur + 1 r2 uφφ = 0. We are looking for solution in the form u(r, φ) = R(r)·Φ(φ), where Φ(φ + 2π) = Φ(φ) and u(A, φ) = α(φ) = sin4 φ, u(0, φ) = C is constant. Let’s compute R′′ Φ + 1 r R′ Φ + 1 r2 RΦ′′ = 0, multiplying by r2 and dividing by RΦ we get r2 R′ R + r R′ R = − Φ′ Φ = −λ. Solving equation Φ′′ − λΦ = 0 =⇒ Φ = B1eµφ + B2e−µφ , where µ = √ λ, using condition Φ(φ) = Φ(φ + 2π), Φ(φ) have to be periodical with period 2π, so µ = in and λn = −n2 , n ∈ N0. Finally for function Φ(φ) we have Φn = an cos(nφ) + bn sin(nφ). Let’s solve equation for R = R(r), which is Euler r2 R′′ n + rR′ n − n2 Rn = 0. Substituing s = ln t, r = et , R′ n = 1 r ˙Rn, R′′ n = (R′ n)′ = − 1 r2 ˙Rn + 1 r2 ¨Rn, we get ¨Rn − n2 Rn = 0, Rn(t) = { C0 + D0t, n = 0, Cnent + Dne−nt , n ∈ N, CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS 9.2.1. Initial observations. In practical problems, we often meet equations relating unknown functions of more variables and their derivatives. We already handled the very special case where the relations concerned functions x(t) of just one variable t. More explicitely, we dealt with vector equations x(k) = F(t, x, ˙x, ¨x, ... x, . . . , x(k−1) ), F : Rnk+1 → Rn , where the dots over x ∈ Rn meant the (iterated) derivatives of x(t) = (x1(t), . . . , xn(t)), up to the order k. The goal was to find a (vector) curve x(t) in Rn which makes this equation valid. Two more comments are due: 1) we can omit the explicit appearance of t on the cost of adding one more variable and equation ˙x0 = 1; and 2) giving new names to the iterated derivates xj = x(j) and adding equations ˙xj = xj+1, j = 1, . . . , k − 1, we reduce always the problem to a first order system of equations (on a much bigger space). Thus, we should like to work similarly with the equations F(x, y, ux, uxx, uxy, uyy, . . . ) = 0, where u is an unknown function (possibly vector valued) of two variables x and y (or even more variables) and, as usual, the indices denote the partial derivatives. Even if we expect the implicit equation to be solved in some sence with respect to some of the highest partial derivatives, we cannot hope for a general existence and uniqueness result similar to the ODE case. Let us start with a most simple example illustrating the general problem related to the choice of the initial conditions. 9.2.2. The simplest linear case. Consider one real function u = u(x, y), subject to the linear homogeneous equation (1) a(x, y)ux + b(x, y)uy = 0 where a and b are known functions of two variables defined for x, y in a domain Ω ⊂ R2 . We consider the equation in the tubular domain Ω × R ⊂ R3 . Usualy, Ω is an open set together with a nice boundary, a curve ∂Ω in our case. An obvious simple idea suggests to write Ω as a union of non-intersecting curves and look for u constant along those curves. Moreover, if those curves were transversal to the boundary ∂Ω, then initial conditions along the boundary should extend inside of Ω. Thus, consider such a potentially existing curve c(t) = (x(t), y(t)) and write 0 = d dt u(c(t)) = ux(c(t)) ˙x(t) + uy(c(t)) ˙y(t). This yields the conditions for the requested curves: (2) ˙x = a(x, y), ˙y = b(x, y). Since u is considered constant along the curve, we obtain a unique possibility for the function u along the curves for all initial conditions x(0), y(0), and u(x(0), y(0)), if the coefficients a and b are at least Lipschitz in x and y. The latter curves are called the characteristics of the first order partial differential equation (1) and they are solutions 828 Rn(r) = { C0 + D0 ln r, n = 0, Cnrn + Dnr−n , n ∈ N. Solution has to be bounded for r → 0 (inside circle), so D0 = Dn = 0 and finally u(r, φ) = Rn(r) · Φn(φ) = = a0C0 + ∞∑ n=1 Cnrn [an cos(nφ) + bn sin(nφ)] . Coefficients we compute from boundary contidion U(A, φ)=K0 + ∞∑ n=1 An [Kn cos(nφ) + Hn sin(nφ)]= = α(φ) = sin4 φ= ( 1 − cos(2φ) 2 )2 = = 1 − 2 cos(2φ) + cos2 (2φ) 4 = = 1 4 − cos(2φ) 2 + 1 4 ( 1 + cos(4φ) 2 ) = = 3 8 − cos(2φ) 2 + 1 8 cos(4φ), K0 = 3/8, K2 = −1 2A2 , K4 = 1 8A4 and all other Kn, Hn are zero. u(r, φ) = 3 8 − r2 2A2 cos(2φ) + r4 8A4 cos(4φ), u(x, y) = 3 8 − x2 − y2 2A2 + x4 + y4 − 6x2 y2 8A4 , where we use cos(2φ) = cos2 φ − sin2 φ = x2 r2 − y2 r2 , cos(4φ) = cos2 (2φ) − sin2 (2φ). □ 9.D.57. Find bounded solution of Laplace equation and boundary condition on the circle (Dirichlet external boundary problem): ∆u(x, y) = 0, x2 + y2 > A2 , u(A cos φ, A sin φ) = α(φ) = cos2 φ. Solution. From previous exercise we get Φn = an cos(nφ) + bn sin(nφ), n ∈ N0, Rn(r) = { D0 + C0 ln r, n = 0, Cnrn + Dnr−n , n ∈ N. Functions Rn(r) have to be bounded for r → ∞ (outside of the circle), so C0 = Cn = 0 and we have u(r, φ) = Rn(r) · Φn(φ) = = a0D0 + ∞∑ n=1 Dnr−n [an cos(nφ) + bn sin(nφ)] . Coefficients we compute from boundary contidion U(A, φ)=K0 + ∞∑ n=1 A−n [Kn cos(nφ) + Hn sin(nφ)]= = α(φ) = cos2 φ = 1 − cos(2φ) 2 . CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS of its characteristic equations (2). If the coeficients are differentiable in all variables, then also the solution u will be differentiable for differentiable choices of initial conditions on a curve transversal to the characteristics and we might have solved the problem (1) locally. Still it might fail. Let us look at the homogeneous linear problem (3) yux − xuy = 0, u(x, 0) = x. We saw already the solutions to the characteristic equations ˙x = y, ˙y = −x and the characteristics are circles with centers in the origin, x(t) = R sin t, y(t) = R cos t. If we choose any even differentiable function ψ(x) = u(x, 0) for the initial conditions at points (x, 0), we are lucky to see that the solution will work. But for odd functions, e.g. our choice ψ(x) = x, there will be no solution of our problem in any neighbourhood of the origin. Clearly, this failure is linked to the fact that the origin is a singular point for characteristic equations. 9.2.3. The quasi-linear case. The situation seems to get more tricky once we add a nontrivial right-hand value f(x, y, u) to the equation (1), i.e. we try to solve the problem (allowing a and b to depend on u) (1) a(x, y, u)ux + b(x, y, u)uy = f(x, y, u). But in fact, the very same idea leads to characteristic equations on R3 , writing z = u(x, y) for the unknown function along the characteristics. Geometrically, we seek for a vector field tangent to all graphs of solutions in the tubular domain Ω × R. Remind z = u(x, y), restricted to a curve in the graph, implies ˙z = ux ˙x + uy ˙y, and thus we may set ˙z = f(x, y, z), ˙x = a(x, y, z), ˙y = b(x, y, z) in order to get such a characteristic vector field. Characteristic equations and integrals The characteristic equations of the equation (1) are (2) ˙x = a(x, y, z), ˙y = b(x, y, z), ˙z = f(x, y, z). This autonomous system of three equations is uniquely solvable for each initial condition if a, b, and f are Lipschitz. A function ψ on Ω × R which is constant on each flow line of the characteristic vector field, i.e., ψ(x(t), y(t), z(t)) = const for all solutions of (2), is called an integral of the equation (1). If ψz ̸= 0, then the implicit function theorem guarantees the unique existence of the function z = u(x, y) satisfying the chosen initial conditions. Check yourself that the latter functions u are solutions to our problem. This approach covers the homogeneous case as well, we just consider the autonomous characteristic equations with ˙z = 0 added. Let us come back to our simple equation 9.2.2(3) and choose f(x, y, u) = y for the right-hand side. The characteristic equations yield x = R sin t, y = R cos t as before, while ˙z = y = R cos t and hence z = R sin t + z(0). Thus, we 829 We get K0 = 1 2 , K2 = A2 2 and all other Kn, Hn are zero. u(r, φ) = 1 2 + 1A2 2r2 cos(2φ), u(x, y) = 1 2 + x2 − y2 2(x2 + y2)2 A2 . □ 9.D.58. Solve internal Dirichlet problem on the circle: ∆u(x, y) = 0, x2 + y2 < A2 , u(A cos φ, A sin φ) = α(φ) = sin φ cos φ. ⃝ 9.D.59. Solve external Dirichlet problem on the circle: ∆u(x, y) = 0, x2 + y2 > A2 , u(A cos φ, A sin φ) = α(φ) = 2 sin2 φ + 3 cos2 φ. ⃝ 9.D.60. Solve internal Dirichlet problem on the circle: ∆u(x, y) = 0, x2 + y2 < 1, u(cos φ, sin φ) = α(φ) = sin φ + cos2 φ. ⃝ 9.D.61. Solve parabolic equation ut = κuxx, x ∈ R, t ∈ (0, ∞), with initial condition u(0, x) = φ(x) = { 1 for x ∈ (−1, 1), 0 otherwise. Suppose u(t, x) → 0 and ux(t, x) → 0 for x → ±∞. Solution. Let’s compute Fourier image ˜uxx(ξ) = ∞∫ −∞ uxxe−iξx dx = a′ = uxx a = ux b = e−iξx b′ = −iξe−iξx = = [ uxe−iξx ]∞ −∞ − (−iξ) ∞∫ −∞ ux(x)e−iξx dx = = a′ = ux a = u b = e−iξx b′ = −iξe−iξx = [ uxe−iξx ]∞ −∞ + +iξ [ ue−iξx ]∞ −∞ + (−iξ)2 ∞∫ −∞ ue−iξx dx = 0 + 0 − ξ2 ˜u, because u → 0 and ux → 0 for x → ±∞. After Fourier transformation the equation became ˜ut(t, ξ) = −κξ2 ˜u(t, ξ) and together with initial condition ˜u(0, ξ) = ˜φ(ξ), where ˜φ(ξ) is Fourier image of φ(x), we have Fourier image of so- lution ˜u(t, ξ) = ˜φ(ξ)e−κξ2 t , CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS may choose ψ(x, y, z) = z − x as an integral of the equation, and the solution u(x, y) = x + C with any constant C. Notice, there will be plenty of solutions here since we may add any solution of the homogenous problem, i.e. all functions of the form (3) u(x, y) = h(x2 + y2 ) with any differentiable function h. Thus, the general solution u(x, y) = x + h(x2 + y2 ) depends on one function of one variable (the above constant C is a special case of h). We may also conclude that for “reasonable” curves ∂Ω ⊂ R2 (those transversal to the circles centred at the origin and not containing the origin) and “reasonable” initial value u|∂Ω (we have to watch the multiple intersection of the circles with ∂Ω) there will be (at least locally) a unique solution extending the intital values to an open neighborhood of ∂Ω. Of course, we may similarly use characteristics and integrals for any finite number of variables x = (x1, . . . , xn) and equations of the form a1(x, u) ∂u ∂x1 + · · · + an(x, u) ∂u ∂xn = f(x, u) with the unknown function u = u(x1, . . . , xn). As we shall see later, typically we obtain generic solutions dependent on one function of n − 1 variables, similarly to the above exam- ple. 9.2.4. Systems of equations. Let us look what happens if we add more equations. There are two quite different ways how to couple the equations. We may seek for an unknown vector valued functions u = (u1, . . . , um) : Rn → Rm , subject to m equations (1) Ai(x, u) · ∇ui = fi(x, u), i = 1, . . . m, where the left hand side means the scalar product of a vector valued function Ai : Rm+n → Rn and the gradient vector of the function ui. Such systems behave similarly as the scalar ones and we shall come back to them later. The other option leads to the so called overdetermined systems of equations. Actually we shall not pay more attention to this case in the sequel and so the reader might jump to 9.2.6 if getting lost. Consider a (scalar) function u on a domain in Ω ⊂ Rn and its gradient vector ∇u. For each matrix A = (aij) with m rows and n columns, with differentiable functions aij(x, u) on Ω × R, and the right hand value function F(x, u) : Ω × R → Rm , we can consider the system of equations (2) A(x, u) · ∇u = F(x, u). Of course, in both cases, we have got m individual equations of the type from the previous paragraph and we could apply the same idea of characteristic vector fields for all of them. The problem consists in coupling of the equations and obtaining possibly inconsistent neccesary conditions from the individual characteristic fields. 830 (solve this separable equation with initial condition thoroughly). If Fourier image of solution is product of two functions, the original solution has to be convolution of inverse Fourier images of this functions: u(t, x) = ∞∫ −∞ φ(y) 1 2 √ πκt e− (x−y)2 4κt dy. We have used the fact, that inverse Fourier image of Gaussian function e−κξ2 t is again Gaussian function 1 2 √ πκt e− x2 4κt and formula for convolution f ∗ g(x) = ∞∫ ∞ f(y)g(x − y)dy. We will express u(t, x) using the error function erf(x) = 2 π x∫ 0 e−a2 da. Take φ(x) from assignement and compute u(t, x) = 1∫ −1 1 2 √ πκt e− (x−y)2 4κt dy = = a = x−y 2 √ κt da = −1 2 √ κt dy = x−1 2 √ κt∫ x+1 2 √ κt −1 √ π e−a2 da = = 1 √ π x+1 2 √ κt∫ 0 e−a2 da − 1 √ π x−1 2 √ κt∫ 0 e−a2 da, u(t, x) = 1 2 [ erf ( x + 1 2 √ κt ) − erf ( x − 1 2 √ κt )] . □ 9.D.62. Solve parabolic equation ut = uxx, x ∈ R, t ∈ (0, ∞), with initial condition u(0, x) = φ(x) = { 2 for x ∈ (0, 1), 0 otherwise. Suppose u(t, x) → 0 and ux(t, x) → 0 for x → ±∞ and express solution using error function. ⃝ E. Variational Problems F. Complex analytic functions 9.F.1. Check that the mapping z → zn , n ∈ N, defined on the entire C has got the complex derivative z → nzn−1 . Solution. We can proceed two ways. Either check the definition of the complex derivative directly, cf. ??, or use the detour via the explicit expression of the mapping f(x+i y) = (x + i y)n in two real coordinates x, y and see that its derivative is complex. CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Let us look at the overdetermined case now. We can get most close to the situation with the ordinary differential equations if A is invertible and we move it to the right hand side, arriving at the system of equations (3) ∇u = A−1 (x, u) · F(x, u) = G(x, u). The simplest non-trivial case consists of two equations in two variables: ux = f(x, y, u), uy = g(x, y, u). Geometrically, we describe the graph of the solution as a surface in R3 by prescribing its tangent plane through each point. An obvious condition for the existence of such u is obtained by differentiating the equations and employing the symmetry of the higher order partial derivatives, i.e. the condition uxy = uyx. Indeed, uxy = fy + fug = gx + guf = uyx, where we substituted the original equations after applying the chain rule. We shall see in a moment that this condition is also sufficient for the existence of the solutions. Moreover, if the solutions exist, then they are determined by their values in one point, similarly to the ordinary differential equations. 9.2.5. Frobenius’ theorem again. Similarly, we can deal with the gradient ∇u of an m-dimensional vector valued function u. For example, if m = 2 and n = 2 we are describing the tangent planes to the two-dimensional graph of the solution u in R4 . In general we face mn equations (1) ∂up ∂xi = Fpi(x, u), i = 1, . . . , n, p = 1, . . . , m. The necessary conditions imposed by the symmetry of higher order derivatives then read (2) ∂2 up ∂xi∂xj = ∂Fpi ∂xj + ∑ q ∂Fpi ∂uq Fqj = ∂Fpj ∂xi + ∑ q ∂Fpj ∂uq Fqi for all i, j and p. Let us reconsider our problem from the geometric point of view now. We are seeking for the graph of the mapping u : Rn → Rm . The equations (1) describe the n-dimensional distribution D on Rm+n and the graphs of possible solutions u = (u1, . . . , um) are just the integral manifolds of D. The distribution D is clearly defined by the m linear forms ωp = dup − ∑ i Fpidxi, p = 1, . . . , m, while the vector fields generating the common kernel of all ωp can be chosen as Xi = ∂ ∂xi + ∑ p Fpi ∂ ∂up . 831 The first very simple possibility repeats the computation with polynomials in one real variable and is shown in 9.4.1. □ CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Now we compute differentials dωp and evaluate them on the fields Xi −dωp = ∑ i,j ∂Fpi ∂xj dxj ∧ dxi + ∑ i,q ∂Fpi ∂uq duq ∧ dxi = ∑ i,j (∂Fpi ∂xj + ∑ q ∂Fpi ∂uq Fqj ) dxj ∧ dxi −dωp(Xj, Xi) = (∂Fpi ∂xj + ∑ q ∂Fpi ∂uq Fqj ) − (∂Fpj ∂xi + ∑ q ∂Fpj ∂uq Fqi ) . Thus, vanishing of the differentials on the common kernel is equivalent to the neccesary conditions deduced above, and the Frobenius theorem says that the latter conditions are sufficient, too. We have proved the following: Theorem. The system of equations (1) admits solutions if and only if the conditions (2) are satisfied. Then the solutions are determined uniquely locally around x ∈ Ω by the initial conditions u(x) ∈ Rm . Remark. The Frobenius’ theory deals with the so called overdetermined systems of PDEs, i.e. we have got too many equations and this causes obstructions towards their integrability. Although the case in the last paragraph sounds very special, the actual use of the theory consists in considering differential consequences of a given system until we reach a point, where the special theorem applies and gives not only further obstractions but also the sufficient conditions. 9.2.6. General solutions to PDE’s. In a moment, we shall deal with diverse boundary conditions for the solutions of PDEs. In most cases we shall be happy to have good families of simple "guessed" solutions which are not subject to any further conditions. We talk about general solutions in this context. Unlike the situation with ODEs, we should not hope to get a universal expression for all possible solutions this way (although we can come close to that in some cases, cf. 9.2.3(3)). Instead, we often try to find the right superpositions (i.e. linear combinations) or integrals built from suitable general so- lutions. Let us look at the simplest linear second order equations in two variables, homogeneous with constant coefficients: (1) Auxx + 2Buxy + Cuyy + Dux + Euy + Fu = 0 where A, B, C, D, E, F are real constants and at least one of A, B, C is non-zero. Similarly to the method of characteristics, we try to reduce the problem to ODEs. Let us again assume solution in the form u = f(p), where f is an unknown function of p and p(x, y) should be nice enough to get close to solutions. The necessary derivatives are ux = f′ px, uy = f′ py, uxx = f′′ pxpx + f′ pxx, uxy = f′′ pxpy + f′ pxy, uyy = 832 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS f′′ pypy + f′ pyy. Thus (1) becomes too complicated in general, but restricting to affine p(x, y) = αx+βy with constants α, β, we arrive at (2) (Aα2 + 2Bαβ + Cβ2 )f′′ + (Dα + Eβ)f′ + Ff = 0. This is a nice ODE as soon as we fix the values of α and β. Let us look at several simple cases of special importance. Assume D = E = F = 0, A ̸= 0. Then, after dividing by α2 , we solve the equation (A + 2B β α + C β2 α2 )f′′ = 0 and the right choice of the ratio λ = β/α ̸= 0 kills the entire coefficient at f′′ . Thus, (2) will hold true for any (twice differentiable) function f and we arrive at the general solution u(x, y) = f(p(x, y)), with p(x, y) = x + λy. Of course, the behavior will very much depend on the number of real roots of the quadratic equation A + 2Bλ + Cλ2 = 0. The wave equation. Put A = 1, C = − 1 c2 , B = 0, thus our equation is uxx = 1 c2 uyy, the wave equation in dimension 1. Then the equation 1 − 1 c2 λ2 = 0 has got two real roots λ = ±c, and we obtain p = x ± cy leading to the general solution u(x, y) = f(x − cy) + g(x + cy) with two arbitrary twice differentiable functions of one variable f and g. In Physics, the equation models one-dimensional wave development in the space parametrized by x while y stays for the time. Notice c corresponds to the speed of the wave u(x, 0) = f(x) + g(x) initiated in the time y = 0, and while the f part moves forwards, the other part moves backwards. Indeed, imagine u(x, y) = f(x − cy) describes the displacement of a string at point x in time y. This remains constant along the lines x−cy = constant. Thus, a stationary observer sees the initial displacement u(x, 0) moving along x-axis with the speed c. In particular, we see that the initial condition along a line in the plane is not enough to determine the solution, unless we request the solution will move only in one of the possible directions (i.e. we posit either f or g to be zero). The Laplace equation. Now we consider A = C = 1, B = 0, i.e. the equation uxx + uyy = 0. This is the Laplace equation in two dimensions and its solutions are called harmonic functions. Proceeding as before, we obtain two imaginary solutions to the equation λ2 + 1 = 0 and our method produces p = x ± i y, a complex valued function instead of the expected real one. This looks ridiculous, but we could consider f to be a mapping f : C → C viewed as a mapping on the complex plane. Remind that some of such mappgins have got differentials D1 f(p) which actually are multiplications by complex numbers at each point, cf. ??. This is in particular true for any polynomial or converging power series. We may request that this property holds true for all iterated derivatives of this kind. In general, we call such functions on C holomorhic and we discuss them in the last part of this chapter. The reader is 833 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS advised to come back to this exposition on general solutions to Laplace equation after reading through the begining of the part on complex analytic functions below, starting in 9.4.1. Now, assuming f is holomorphic, we can repeat the above computation and arrive again at (λ2 + 1)f′′ (p) = 0 independently of the choice of f (here f′ (p) means the complex number given by the differential D1 f, f′′ (p) is the iteration of this kind of derivative). Moreover, the derivatives of vector valued functions are computed for the components separately and thus both the real and the imaginary part of the general solution f(x + i y) + g(x − i y) will be real general solutions. For example, consider f(p) = p2 leading to u(x, y) = (x + iy)2 = (x2 − y2 ) + i 2xy and simple check shows that both terms satisfy the equation separately. Notice the two solutions x2 − y2 and xy povide the bases of the 2-dimensional vector space of harmonic homogeneous polynomials of degree two. The diffusion equation. Next assume A = κ, B = C = D = F = 0, and add the first order term with E = −1. This provides the equation uy = κuxx, the diffusion equation in dimension one. Applying the same method again, we arrive at the ODE κα2 f′′ − βf′ = 0 which is easy to solve. We know the solutions are found in the form f(p) = eνp with ν satisfying the condition κα2 ν2 − βν = 0. The zero solution is not interesting, thus we are left with the general solution to our problem by substituting p(x, y) = αx + βy and ν = β κα2 : u(x, y) = f(p) = e 1 κ ( β α x+ β2 α2 y) . Again, a simple check reveals that this is a solution. But it is not very "general" – it depends just on two scalars α and β. We have to find much better ways how to find solutions of such equations. 9.2.7. Nonhomogeneous equations. As always with linear equations, the space of solutions to the homogeneous linear equations is a real vector space (or complex, if we deal with complex valued solutions). Let us write the equation as Lu = 0, where L is the differenatial operator on the left hand side. For instance, L = A ∂2 ∂x2 + B ∂2 ∂x∂y + C ∂2 ∂y2 + D ∂ ∂x + E ∂ ∂y + F in the case of the linear equation 9.2.6(1). The solutions of the corresponding non-homogeneous equation Lu = f with a given function f on the right hand side form an affine space. Indeed, if Lu1 = f, Lu2 = f, Lu3 = 0, then clearly L(u1 −u2) = 0 while L(u1 +u3) = f. Thus, if we succeed to find a single solution to Lu = f, then 834 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS we can add any general solution to the homogeneous equation to obtain a general solution. Let us illustrate our observation on some of our basic examples. The non-homogenous wave equation uxx −uyy = x + y has got the general solution u(x, y) = 1 6 (x3 − y3 ) + f(x − y) + g(x + y) depending on two twice differentiable functions. The non-homogeneous Laplace equation is called the Poisson equation. A general complex valued solution of the Poisson equation uxx + uyy = x + y is u(x, y) = 1 6 (x3 + y3 ) + f(x − i y) + g(x + i y) depending on two holomorphic functions f and g. 9.2.8. Separation of variables. As we have experienced, a straightforward attempt to get solutions is to expect them in a particular simple form. The method of separation of variables is based on the assumption that the solution will appear as a product of single variable functions in all variables in question. Let us apply this method on our three special examples. Diffusion equation. We expect to find a general solution of κuxx = ut in the form u(x, t) = X(x)T(t). Thus the equation says κX′′ (x)T(t) = T′ (t)X(x). Assume further u ̸= 0 and divide this equation by u = XT: X′′ (x) X(x) = T′ (t) κT(t) . Now the crucial observation comes. Notice the terms on the left and right are function of different variables and thus the equation may be satisfied only if both the sides become constant. We shall have to distinguish the signs of this separation constant, so let us write it as −α2 (choosing the negative option). Thus we have to solve two independent ODEs X′′ + α2 X = 0, T′ + α2 κT = 0. The general solutions are X(x) = A cos αx + B sin αx T(t) = C e−α2 κt with free real constants A, B, C. When combining these solutions in the product, we may absorb the constant C into the other ones and thus we arrive at the general solution u(x, t) = (A cos αx + B sin αx) e−α2 κt . This solution depends on three real constants. If we choose a positive separation constant instead, i.e. λ2 , there will be a sign change in our equations and the resulting general solution is u(x, t) = (A cosh αx + B sinh αx) eα2 κt . If the separation constant vanishes, then we obtain just u(x, t) = A + Bx, independent of t. 835 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS The Laplace equation. Assume u(x, y) = X(x)Y (y) satisfies the equation uxx +uyy = 0 and proceed exactly as above. Thus, X′′ Y + Y ′′ X = 0 and dividing by XY and choosing the separation constant α2 , we arrive at X′′ = α2 X, Y ′′ = −α2 Y. The general solution depends on four real constants A, B, C, D u(x, y) = (A cosh αx + B sinh αx)(C cos αy + D sin αy). If the separation constant is negative, i.e. −α2 , the roles of x and y swap. The wave equation. Let us look how the method works if there are more variables there. Consider a solution u(x, y, z, t) = X(x)Y (y)Z(z)T(t) of the 3D wave equation 1 c2 utt = uxx + uyy + uzz. Playing the same game again, we arrive at the equation 1 c2 T′′ XY Z = X′′ Y ZT + Y ′′ XZT + Z′′ XY T. Dividing by u ̸= 0, 1 c2 T′′ T = X′′ X + Y ′′ Y + Z′′ Z and since all the individual terms depend on different single variables, they have to be constant. Again, we shall have to keep attention to the signs of the separation constants. For instance, let us choose all constants negative and look at the individual four ODEs 1 c2 T′′ T = −α2 , X′′ X = −β2 , Y ′′ Y = −γ2 , Z′′ Z = −δ2 with the constants satisfying −α2 = −β2 − γ2 − δ2 . The general solution is u(x, y, z, t) = X(x)Y (y)Z(z)T(t) with linear combinations T(t) = A cos cαt + B sin cαt X(x) = C cos βx + D sin βx Y (y) = E cos γy + F sin γy Z(z) = G cos δz + H sin δz with eight real constants A through H. If we choose any of the separation constants positive, the corresponding component in the product would display hyperbolic sine and cosine instead. Of course, the relation between the constants sees the signs as well. We can also work with complex valued solutions and choose the exponentials as our building blocks (i.e. X(x) = e±iβx or X(x) = e±βx , etc). For instance, take one of the solutions with all the separation constants negative u(x, y, z, t) = eiβx eiγy eiδz e−icαt = ei(βx+γy+δz−cαt) . Similarly to the 1D situation, we can again see a "plane wave" propagating along the direction (β, γ, δ) with angular frequency cα. 836 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS 9.2.9. Boundary conditions. We continue with our examples of second order equations and discuss the three most common boundary conditions for them. Let us consider a domain Ω ⊂ Rn , bounded or unbounded, and a differential operator L defined on (real or complex valued) functions on Ω. We write ∂Ω for the boundary of Ω and assume this is a smooth manifold. Locally, such a submanifold in Rn is given by one implicit function ρ : Rn → R and the unit normal vector ν(x), x ∈ ∂Ω, to the hypersurface ∂Ω is given by the normalized gradient ν(x) = ∇ρ(x) ∥∇ρ(x)∥ . We say that a function u is differentiable on Ω, if it is differentiable on its interior and the directional derivatives D1 νu(x) exist in all points of the boundary. Typically we write ∂ ∂ν for the derivative in the normal direction. For simplicity, let us restrict ourselves to L of the form L = A(x, y) ∂2 ∂x2 + 2B(x, y) ∂ ∂x∂y + C(x, y) ∂2 ∂y2 and look at the equation Lu = F(x, y, u, ∂u ∂x , ∂u ∂y ). Cauchy boundary problem At each point of the boundary x ∈ ∂Ω we prescribe both the value φ(x) = u(x) and the derivative ψ(x) = ∂u ∂ν (x) in the normal unit direction. The Cauchy problem is to solve the equation Lu = F on Ω, subject to u = φ and ∂u ∂ν = ψ on ∂Ω. We shall see that the Cauchy problems very often lead locally to unique solutions, subject to certain geometric conditions on the boundary ∂Ω. At the same time, it is often not the convenient setup for practical problems. We shall illustrate this phenomenon on the 2D Laplace equation in the next but one paragraph. An even simpler possibility is to request only the condition on the values of u on the boundary ∂Ω. Another possibility, often needed in direct applications, is to prescribe the derivatives only. We shall see, that this is reasonable for the Laplace and Poisson equations. Dirichlet and Neumann boundary problems At each point of the boundary x ∈ ∂Ω we prescribe the value φ(x) = u(x) or the derivative ψ(x) = ∂u ∂ν (x) in the normal unit direction. The Dirichlet problem is to solve the equation Lu = F on Ω, subject to the condition u = φ on ∂Ω. The Neumann problem is to solve the equation Lu = F on Ω, subject to the condition ∂u ∂ν = ψ on ∂Ω. 9.2.10. Uniqueness for Poisson equations. Because the proof of the next theorem works in all dimensions n ≥ 2, we 837 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS shall formulate it for the general Poisson equation (1) ∆u = ( ∂2 ∂x1 2 + · · · + ∂2 ∂xn 2 ) u = F(x1, . . . , xn). Theorem. Assume u is a twice differentiable solution of the Poisson equation (1) on a domain Ω ⊂ Rn . If u satisfies the Dirichlet condition u = φ on ∂Ω, then u is the only solution of the Dirichlet problem. If u satisfies the Neumann condition ∂u ∂ν = ψ on ∂Ω, then u is the unique solution of the Neumann problem, up to an additive constant. The proof of this theorem relies on a straightforward consequence of the divergence theorem. Remind 9.1.14, saying that for each vector field X on a domain Ω ⊂ Rn with hypersurface boundary ∂Ω (2) ∫ M div X dx1 . . . dxn = ∫ ∂Ω X · ν d∂Ω, where ν is the oriented (outward) unit normal to ∂Ω and d∂Ω stays for the volume inherited from Rn on ∂Ω. 1st and 2nd Green’s identity Lemma. Let M ⊂ Rn be a n-dimensional manifold with boundary hypersurface S, and consider two differentiable functions φ and ψ. Then (3) ∫ M (φ∆ψ + ∇φ · ∇ψ) dx1 . . . dxn = ∫ S φ∇ψ · ν dS. This version of the divergence theorem is called the 1st Green’s identity. Next, let us consider one more differentiable function µ and X = φµ∇ψ − ψµ∇φ. The the divergence theorem yields the so called 2nd Green’s identity (4) ∫ M φ(∇ · (µ∇))ψ − ψ(∇ · (µ∇))φ dx1 . . . dxn = ∫ S µ(φ∇ψ − ψ∇φ) · ν dS, where ∇ · (µ∇) means the formal scalar product of the two vector valued differential operators. Proof of the Green’s identities. The first claim follows by applying (2) to X = φ∇ψ, where φ and ψ are differentiable functions and ∇ψ is the gradient of ψ. Indeed, iXωRn = φ(∇ψ · ν)dS div X = φ∆ψ + ∇φ · ∇ψ, where the dot in the second term denotes the scalar product of the two gradients. Let us also notice that the scalar product ∇ψ·ν is just the derivative of ψ in the direction of the oriented unit normal ν. The second identity is computed the same way and the two terms with the scalar products of two gradients cancel each other. The reader should check the details. □ 838 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Remark. A special case of the 2nd Green’s identity is worth mentioning. Namely, if µ = 1 and both ψ and φ vanish on the boundary ∂Ω, we obtain ∫ Ω φ∆ψ − ψ∆φ dx1 . . . dxn = 0. This means that the Laplace operator is self-adjoint with respect to the L2 scalar product on such functions. Proof of the uniqueness. Assume u1 and u2 are solutions of the Poisson equation on Ω, thus u = u1 − u2 is a solution of the homogeneous Laplace equation, ∆u = ∆u1 − ∆u2 = F − F = 0. At same time, either u = u1 − u2 = 0 on ∂Ω or ∂u ∂ν = 0 on ∂Ω. Now we exploit the first Green’s identity (3) with φ = ψ = u, ∫ Ω (u∆u + ∇u · ∇u) dx1 . . . dxn = ∫ ∂Ω u ∂u ∂n dS. In both problems, Dirichlet or Neumann, the right hand side vanishes. The first term in the left hand integrand vanishes, too. We conclude ∫ Ω ∥∇u∥2 dx1 . . . dxn = 0, but this is possible only if ∇u = 0 since the integrand is continuous. Thus, u = u1 − u2 is constant. But if we solve a Dirichlet problem, then u1 and u2 coinside on the boundary and thus they are equal. □ 9.2.11. Well posed problems. Consider the Cauchy boundary problem for uxx +uyy = 0, ∂Ω given by y = 0 and φ(x) = u(x, 0) = Aα sin αx ψ(x) = uy(x, 0) = Bα sin αx with the scalar coefficients Aα and Bα depending on the chosen frequency α. Simple inspection reveals, that we can find such a solution within the result from the separation method: u(x, y) = (Aα cosh αy + 1 α Bα sinh αy) sin αx. Now, choose Bα = 0 and Aα = 1 α , i.e. u(x, y) = 1 α cosh αy sin αx. Obviously, when moving α towards infinity, the Cauchy boundary conditions can become arbitrarily small and still small change of Bα causes arbitrarily big increase of the values of u in any close vicinity of the line y = 0. Imagine, the equation describes some physical process and the boundary conditions reflect some measurements, including some periodic small errors. The results will be horribly instable with respect to these errors in the derivatives. We should admit that the problem is in some sense ill-posed, even locally. This motivates the following definition. 839 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Well-posed and ill-posed boundary problems The problem Lu = F on the domain Ω with boundary conditions on ∂Ω is called well-posed if all three conditions hold true: (1) The boundary problem has got a solution u (a classical solution means u is twice continuously differentiable); (2) the solution u is unique; (3) the solution is stable with respect to initial data, i.e. “small” change of the boundary conditions results in a “small” change of the solution. The problem is called ill-posed, if any of the above conditions fails. Usualy, the stability in the third condition means that the solution is continuously dependent on the boundary conditions in a suitable topology on the chosen space of functions. Also the uniqueness required in the second condition has to be taken reasonably. For instance, only uniquenes up to some additive constant makes sense for the Neumann prob- lems. 9.2.12. Quasilinear equations. Now we exploit our experience and focus on the (local) Cauchy type problems for equations of arbitrary order. Similarly to the ODEs, we shall deal with problems, where the highest order derivatives are prescribed (more or less) explicitly and the initial conditions are given on a hypersurface up to the order k − 1. Some notation will be useful. We shall use the multiindices to express multivariate plynomials and derivatives, cf. 8.1.15. Further we shall write ∇k u = {∂αu; |α| = k} for the vector of all derivatives of order k. In particular, ∇u means again the gradient vector of u. Quasi-linear PDEs For unknown scalar function u on a domain Ω ⊂ Rn we prescribe its derivatives (1) ∑ |α|=k aα(x, u, . . . , ∇k−1 u)∂αu = b(x, u, . . . , ∇k−1 u), where b and aα are functions on the tubular domain Ω × RN , accomodating all the derivatives, with at least one of aα non-zero. We call such equations the (scalar) quasilinear partial differential equations (PDE) of order r. We call (1) semilinear if all aα do not depend on u and its derivatives (thus all the non-linearity hides in b). The principal symbol of a semi-linear PDE of order k is the symmetric k-linear form P on Ω, P(x) : (Rn )k → R, P(x, ξ, . . . , ξ) = ∑ |α|=k aα(x)ξα . For instance, the Poisson equation ∆u = f(x, y, u, ∇u) on R2 is a semi-linear equation and its principal symbol is the positive definite quadratic form P(ζ, η) = ζ2 + η2 , independent of (x, y). 840 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS The diffusion equation ∂u ∂t = ∆u on R3 has got the symbol P(τ, ζ, η) = ζ2 + η2 , i.e. a positive semi-definite quadratic form, while the wave equation □u = ∂2 u ∂t2 − ∂2 u ∂x2 = 0 has got the indefinite symbol P(τ, ζ) = τ2 − ζ2 on R2 . We shall focus on the scalar equations and reduce the problem to a special situation which allows a further reduction to a system of first order equations (quite similarly to the ODE theory). Thus we extend the previous definition to systems of equations. Notice, these are systems of the first kind mentioned in 9.2.4. Systems of quasi-linear PDEs A system of quasi-linear PDEs determines a vector valued function u : Ω ⊂ Rn → Rm , subject to the vector equation (2) A(x, u, . . . , ∇k−1 u) · ∇k u = b(x, u, . . . , ∇k−1 u). Here A is a matrix of type m × M with functions ai,α : Ω × RN as entries, M = (n+k−1 k ) is the number of k-combinations with repetition from n objects, ∇k u is the vector of vectors of all the kth-order derivatives of the components of u, b : Ω × RN → Rm , and · means the scalar products of the individual rows in A with the vectors ∇k ui of the individual components of u, matching the individual components in b. 9.2.13. Cauchy data. Next, we have to clarify the boundary condition data. Let us consider a domain U ⊂ Rn and a smooth hypersurface Γ ⊂ U, e.g. Γ given by an implicit equation f(x1, . . . , xn) = 0 locally. Consider the unit normal vector ν(x) at each point x ∈ Γ (i.e. ν = 1 ∥∇f∥ ∇f if given implicitly). We would like to find minimal data along Γ determining a solution of 9.2.12(1), at least locally around a given point. To make things easy, let us first assume that Γ is prescribed by xn = 0. Then ν(x) = (0, . . . , 0, 1) at all x ∈ Γ and knowing the restriction of u to Γ, we also now all derivatives ∂α with α = (α1, . . . , αn−1, 0), 0 ≤ |α|. Thus, we have to choose reasonably differentiable functions cj on Γ, j = 0, . . . , k − 1, and posit for all j ∂αu(x) = cj(x), α = (0, . . . , 0, j), x ∈ Γ. All the other derivatives ∂αu on Γ, 0 ≤ |α| < ∞ with αn < k are computed inductively by the symmetry of partial derivatives. Moreover, if a(0,...,0,k) ̸= 0, we can establish the remaining k-th order derivative by means of the equation 9.2.12(1) and hope to be able to continue inductively. Indeed, writing a = a(0,...,0,k)(x, u, . . . , ∇k−1 (u)) ̸= 0 (and similarly leaving out the arguments of the other functions aα), the equation 9.2.12(1) can be rewritten as (1) ∂k ∂xn k u = 1 a ( − ∑ |α|=k,αn̸=k aα∂αu + b(x, u, . . . , ∇k−1 u) ) . Now, on Γ we can use the already known derivatives to compute directly all the ∂αu with αn < k +1. But differentiating 841 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS the latter equation by ∂ ∂xn we obtain the missing derivative of order k + 1 from the known quantities on the right-hand side. By induction, we obtain all the derivatives, as requested. In the general situation we can iterate the derivative D1 ν(x)u of u in the direction of the unit normal vector ν to the hypersurface Γ: Cauchy data for scalar PDE The (smooth or analytic) Cauchy data for the kth order quasilinear PDE 9.2.12(1) consist of a hypersurface Γ ⊂ U and k (smooth or analytic) functions cj, 0 ≤ j ≤ k − 1, prescribing the derivatives in the normal directions to Γ (2) (D1 ν(x))j u(x) = cj(x), x ∈ Γ. A normal direction ν(x), x ∈ Γ, is called characteristic for the given Cauchy data, if (3) ∑ |α|=k aα(x, u, . . . , ∇k−1 u)ν(x)α = 0. The Cauchy data are called non-characteristic if there are no characteristic normals to Γ. Notice the situation simplifies for the semi-linear equations. Then the characteristic directions do not depend on the chosen functions cj from the Caychy data and they are directly related to the properties of the principal symbol of the equation. In the case of the hyperplane Γ = {xn = 0} treated above, the Cauchy data are non-characteristic if and only if a(0,...,0,k) ̸= 0. For instance, semi-linear equations of first order always admit characteristic directions since their principal symbols are linear forms and so they must have non-trivial kernels (hyperplanes of characteristic directions). In the three second order examples of the Laplace equation, diffusion equation, and wave equation very different phenomena occur. Since the symbol of the Laplace equation is a positive definite quadratic form, characteristic directions can never appear, independently of our choice of Γ. On the contrary, there are always non-trivial characteristic directions in the other two cases. Characteristic cones of semi-linear PDEs The characteristic directions of a semi-linear PDE on a domain Ω ⊂ Rn generate the characteristic cone C(x) ⊂ TxΩ in the tangent bundle, C(x) = {ξ ∈ TxΩ; P(x)(ξ, . . . , ξ) = 0}. The Cauchy data on a hypersurface Γ are non-characteristic if and only if (TΓ)⊥ ∩C = {0}, i.e. the orthogonal complements to the tangent spaces to Γ with respect to the standard scalar product on Rn never meet the characteristic cone. Notice, cones for linear forms are hyperplanes in the tangent space, quadratic cones appear with second order, etc. The tangent vectors to characteristics of the first order quasilinear equations (as introduced in 9.2.2) are orthogonal to the characteristic normals. We have learned that the first order equations propagate the solutions along the characteristic 842 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS lines and so we are not free to prescribe the Cauchy data for the solution in such a case. 9.2.14. Cauchy-Kovalevskaya Theorem. As seen so many times already, the analytic mappings are very rigid and most questions related to them boil down to some estimates and smart combinatorial ideas. It is time to remind what happens for analytic equations and Cauchy data in the very special case of the ODEs. For a single scalar autonomous ODE of first order, the Cauchy data consist of a single point “hypersurface” Γ = {x} in Ω ⊂ R and the value u(x). In particular, the Cauchy data are always non-characteristic in dimension one. Already in 6.2.22 we gave a complete proof that the induced derivatives of u provide a converging power series and thus the only solution, on certain neighborhood of x. In 8.3.13 we extended the same proof to autonomous systems of ODEs, which verified the same phenomenon for general systems of ODEs of any order k. Here the Cauchy data again consist of the only point in Γ and all derivatives of u of orders less than k (and again, they are always non-characteristic). In subsequent paragraphs we shall comment on how to extend the ODE proof to the following very famous theorem. In particular, the statement says that we have to expect general solutions to kth order scalar equations in n variables to depend on k independent functions of n − 1 variables. This is in accordance with our experience from simple examples. Cauchy-Kovalevskaya theorem Theorem. The analytic Cauchy problem consisting of quasilinear equation 9.2.12(1) with analytic coefficients and right hand side, and analytic non-characteristic Cauchy data 9.2.13(2) has got a unique analytic solution on a neighborhood of each point in Γ. Notice that we have computed explicitly the formal power series for the solution (by an inductive procedure) for the special case when Γ is defined by xn = 0. In this case, the theorem claims that this formal series always converges with non-trivial radius of convergence. The full proof is very technical and we do not have space to bother the readers with all details. In the next paragraphs, we shall provide indications toward the steps in the proof. If the track (or interest) will be lost, the reader should rather jump to 9.2.18. 9.2.15. Flattening the Cauchy data. The first step in the proof is to transform the non-characteristic data to the “flat” hypersurface Γ discussed in the beginning of 9.2.13. Remind that for such Γ the non-characteristic condition in 9.2.13(3) reads a(0,...,0,k) ̸= 0. Let us start with the general equation and its analytic Cauchy data on an analytic Γ ⊂ Rn (we omit the arguments 843 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS of all the functions and ℓ = 0, . . . , k − 1) (1) ∑ |α|=k aα∂αu = b, ∂ℓ u ∂nℓ (x) = cℓ(x), x ∈ Γ. We shall work locally around some unspecified fixed point in Γ. Since Γ is an analytic hypersurface in Rn , there are new local coordinates y = Ψ(x), such that Γ = {x; Ψn(x) = 0}. Moreover, Ψ can be chosen again analytic. Thus, the unit normal vector ν to Γ equals to ∇Ψn, up to a multiple µ−1 at each point of Γ. Let Φ = Ψ−1 , i.e. x = Φ(y), and v(y) = u(Φ(y)). Then Φ is analytic and the equation transforms to another equation in the coordinates y, (2) ∑ |α|=k ˜aα∂αv = ˜b with analytic coefficients (they can be all expressed by means of the chain rule and the mutually inverse transformations Φ, Ψ). By the very definition, ∂Φ ∂yn is a vector (the last column in the matrix D1 Φ) perpendicular to Γ and thus it must be µν (remind the product of the Jacobi matrices D1 (Ψ)D1 (Φ) is the identity matrix and the rows ∇Ψj, j = 1, . . . , n − 1 generate the tangent space TΓ). Claim 1. The transformed Cauchy data for the equation (2) are analytic. The hypersurface ˜Γ given by yn = 0 as well as the coefficients of the equation are analytic. Compute ∂ℓ v ∂yn ℓ on ˜Γ. v = co ◦ Φ ∂v ∂yn = ∇u · ∂Φ ∂yn = µ∇u · ν = µ ∂u ∂ν = µc1 ∂2 v ∂yn 2 = ∂µ ∂yn ∂u ∂ν + µ2 ∂2 u ∂ν2 = ∂µ ∂yn c1 + µ2 c2 ∂3 v ∂yn 3 = ∂2 µ ∂yn 2 ∂u ∂ν + 3µ ∂µ ∂yn ∂2 u ∂ν2 + µ3 ∂3 u ∂ν3 = ∂2 µ ∂yn 2 c1 + 3µ ∂µ ∂yn c2 + µ3 c3. Inductively, we see that the transformed functions ˜cj are obtained in an analytic way from the functions ci, i = 0, . . . , j. Claim 2. The Cauchy data for the equation (1) are noncharacteristic if and only if the transformed Cauchy data for (2) are non-characteristic, i.e. ˜a(0,...,0,k) ̸= 0. Compute using the chain rule (remind ∇Ψn is a vector, the gradient of the last coordinate function of Ψ, and it is equal to ν, up to the non-zero multiple µ−1 ) ∂αu = ∂k v ∂yn k (∇Ψn)α + lot of terms of lower order in ∂ ∂yn = µ−k ∂k v ∂yn k να + · · · . 844 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Substitute into (1) ∑ |α|=k aα∂αu = µ−k ∑ |α| aανα ∂k v ∂yn k + · · · . We have computed ˜a(0,...,0,k) = µ−k ∑ |α|=k aανα which is non-zero if and only if the orginial Cauchy data were non-characteristic. Since we already verified that all the partial derivatives of v along ˜Γ can be computed for the non-characteristic Cauchy data with the flat hypersurface ˜Γ, we have actually proved the following claim. Proposition. The Cauchy data for (1) allow to compute all partial derivatives of its solution u along the hypersurface Γ if and only if the data are non-characteristic. 9.2.16. Reduction to a first-order system. Without loss of generality, we may consider only the Cauchy data of the form discussed in 9.2.13, i.e. the quasi-linear equation on a domain in Rn is (1) ∂k ∂xn k u = ∑ |α|=k,αn̸=k aα∂αu + b and Γ is given by xn = 0 with prescribed normal derivatives c0, . . . , ck−1. Moreover, we can subtract suitable fixed functions from u in order to transform the equation into a new one of the same shape and with all the Cauchy data cj vanishing on Γ. Indeed, start with v = u − c0. This transforms the equation and Cauchy data so that the new ˜c0 = 0. If we killed the functions ˜c0, . . . , ˜cℓ−1, then we may subtract g(x1, . . . , xn) = 1 ℓ! xℓ n˜cℓ(x1, . . . , xn−1) which kills the next one. The final reduction step is to introduce new functions for all components in the vector (u, ∇u, . . . , ∇k−1 u). Write v1, . . . , vN for all these functions, and add one more function v0(x) = xn. Then we can rewrite our equation (1) as a system of quasi-linear equations of first order on the vector function v = (v0, . . . , vN ). (2) ∂ ∂xn vs = ∑ 0≤r≤N n−1∑ i=1 asri ∂ ∂xi vr + bs, s = 0, . . . , N, where all the coefficients asri, b, are functions in x1, . . . , xn−1, v0, . . . , vN and the boundary condition on Γ = {xn = 0} is v|Γ = 0. Notice two important facts, the coefficients do not depend on xn and all derivatives on the left hand side are ∂ ∂xn . This is a technicality which makes the problem similar to the autonomous systems of ODEs. The principle is obvious from a simple example. Consider 2nd order equation with coefficients axx, axt, b utt = axtuxt + axxuxx + b 845 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS and view t, u, ut, ux as unknown function. Then our equation rewrites as the system of four equations ∂ ∂t t = 1, ∂ ∂t u = ut, ∂ ∂t ux = ∂ ∂x ut, ∂ ∂t ut = axt ∂ ∂x ut + axx ∂ ∂x ux + b with boundary condition t = u = ut = ux = 0 on the line t = 0. 9.2.17. The majorant. The rest of the proof is very technical, but it is based on the straightforward idea of a majorant for the Cauchy problem 9.2.16(2). Remind we used this method in 8.3.13 when proving the ODE version of the Cauchy-Kovalevskaya theorem. We shall not go into much detail and only indicate how to generalize the method from 8.3.13. The first step is easy – we have seen already that all derivatives of the solution vector v at a fixed point x in Γ are computed by chain rule from the Cauchy data. This proves the uniqueness of the analytic solution, if such a solution exists. Again, it turns out the derivatives are given via universal polynomials in the derivatives of the coefficients asri, bs of the system 9.2.16(2), and the polynomials have got non-negative real coefficients. The reader may easily fill the details as an exercise. Now, a very similar majorant of the coeficients as in the ODE case can be chosen. First, the analycity of the coefficients ensures the existence of some suitably small r > 0 and (perhaps big) constant C such that 1 α! |∂αasri|r|α| ≤ C, 1 α! |∂αbs|r|α| ≤ C, and thus |∂αasri| ≤ C|α|!r−|α| , |∂αbs| ≤ C|α|!r−|α| , for all coefficients and multiindices α. In particular, all the coefficients can be majorized by the function h(x1, . . . , xn−1, v0, . . . , vN ) = Cr r − ∑n−1 j=1 xj − ∑N s=0 vs . Now, the majorizing system for the vector (V0, . . . , VN ) is ∂Vs ∂xn = ∑ 0≤r≤N n−1∑ i=1 h ∂ ∂xi Vr + h, s = 0, . . . , N. Since the coefficients are completely symmetric in the variables, let us expect the solution in the form V0 = · · · = VN = W(x1 + · · · + xn−1, xn), i.e. W is a real function of two variables, say W(t, y). Substituting into the system we arrive at the linear first order PDE ∂W ∂y = Cr r − t − NW(t, y) ( N(n − 1) ∂W ∂t + 1 ) , with boundary condition W(t, 0) = 0. The reader may find the solution (e.g. by the method of characteristics) W(t, y) = 1 Nn ( r − t − √ (r − t)2 − 2nNCry ) , a real analytic function on a neighborhood of the origin. 846 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS This concludes the proof.6 9.2.18. Back to second order equations. Let us continue our brief excursion to the PDE world by more detailed comments on the second order semilinear equations. We shall deal with scalar PDEs on a domain Ω ⊂ Rn with one of the usual boundary conditions. Thus consider a general linear operator (1) L = ∑ 1≤i,j≤n aij ∂2 ∂xi∂xj + ∑ 1≤i≤n ai ∂ ∂xi + a written in coordinates (x1, . . . , xn). If we consider any other coordinate system y = Φ(x), then the chain rule determines how to transform the equation Lu = f in coordinates x into ˜L˜v = ˜f, where u(x) = v(Φ(x)). Write ∇ = ( ∂ ∂x1 , . . . , ∂ ∂xn ) for the gradient operator, dot for the standard scalar product of vectors, D1 Φ for the Jacobi matrix of Φ in the coordinates y and x, ∂Φ ∂xi for the ith column of D1 Φ. ∂ ∂xi = ∇ · ∂Φ ∂xi = ∑ k ∂Φk ∂xi ∂ ∂yk ∂ ∂xi∂xℓ = ∑ k,ℓ ∂Φk ∂xi ∂Φℓ ∂xj ∂2 ∂yk∂yℓ + ∑ k Φk ∂xi∂xj ∂ ∂yk . Thus, the operator (1) transforms into ˜L = ∑ 1≤i,j,k,ℓ≤n aij ∂Φk ∂xi ∂Φℓ ∂xj ∂2 ∂yk∂yℓ + ∑ 1≤i,k≤n ai ∂Φk ∂xi ∂ ∂yk + a. In particular, the principal symbol transforms pointwise as a quadratic form under the linearized transformation D1 Φ. As we know from the linear algebra, the global behavior of real quadratic forms is classified by their singature, i.e. the number of positive and negative entries in the diagonalized matrix, cf. 4.3.2. This is transfered to the following Classification of 2nd order quasi-linear PDEs Consider a second order quasi-linear operator (1) with the principal symbol Q. The equations Lu = f and the operator L are called • elliptic if Q is either positive or negative definite • hyperbolic if Q has got the signature (n − 1, 1) (or equvalently (1, n − 1)) • parabolic if Q is positive semidefinite with rank n − 1 and the equation can be rewritten as ∂ ∂t u = ˜L(u), where the principal symbol of ˜L depends on the remaining variables only. Notice that we actually did not include all possibilities into the above list. We omited the ultra-hyperbolic case, where the rank of Q is maximal, with the remaining possibilities of signatures. Further, the parabolic equations could appear with the minus sign at ∂ ∂t (the so called “backwards 6The reader may find detailed exposition in many basic books, for instance see the Chapter 1 of the book "Introduction to partial differential equations" by Gerald B. Folland, Princeton, 1995. 847 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS parabolic equations”), and the rank of Q could also drop by more than one. Most of these cases cannot be seen in low dimensions. If the coefficients aij, ai and a are constant, then the principal symbol Q is a constant quadratic form, too, and we may choose just linear transformations T = D1 Φ instead of general Φ. This will always allow us to transform the quadratic form to its canonical form over the entire domain Ω. Many particular examples are discussed explicitly in the other column, see ??, in particular in low dimensions. Let us also stress that while in dimension n = 2 we may even locally integrate the necessary linearized transformations into a genuin mapping Φ and get the canonical forms of the equations even in more general context, see ??, this is mostly not possible in higher dimensions. Before coming to a few most important examples, let us look at the characteristic directions in the individual cases. If L is elliptic, then of course there cannot be any characteristic direction. So the (local) Cauchy problem prescribing the analytic value and first normal derivative along any analytic hypersurface Γ will have a locally converning analytic solution around any fixed point. Unfortunately, we have already encountered that this is not a well posed problem even for the two dimensional Laplace equation, cf. 9.2.11. On the contrary, there will be a (n−1)-dimensional cone of characteristic direction at each point of a hyperbolic equation, while the parabolic equations will come equipped with a line of characteristic directions. 9.2.19. The wave equation. The wave operator in dimension n is (1) L = ∂2 ∂t2 − c2 ∆, where ∆ is the Laplace operator ∆ = ∂2 ∂x1 2 + · · · + ∂2 ∂xn 2 , c2 > 0 a real constant. The operator L lives on domains in Rn+1 . Let us first return to the 2D wave equation utt = c2 uxx. We know the general solution u(x, t) = f(x − ct) + g(y + ct), cf. 9.2.6, which is the superposition of the forward and backward waves u1(x, 0) = f(x) and u2(x, 0) = g(x). This perfectly matches our general expectation from the Cauchy-Kovalevskaya theorem that general solutions to quasilinear second order equations in two variables should depend on two real single variable functions. Moreover, the characteristic directions are x ± ct = 0 and thus the line Γ = {t = 0} is non-characteristic. Setting the Cauchy boundary data u(x, 0) = φ(x), ∂ ∂t u(x, 0) = ψ(x) and substituting the general solution, we arrive at φ(x) = f(x) + g(x), ψ(x) = −cf′ (x) + cg′ (x), 848 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS where dash stays for the derivative of the single variable function, as usual. Thus, for any s0 and s 1 c ∫ s s0 ψ(x) dx = −f(s) + g(s) + f(s0) − g(s0). Add and substract the equations 2g(s) = φ(s) + 1 c ∫ s s0 ψ(x) dx − f(s0) + g(s0) 2f(s) = φ(s) − 1 c ∫ s s0 ψ(x) dx + f(s0) − g(s0). Finally, substitute s = x − ct into f(s), and s = x + ct into g(s) in order to get the value of u(x, t) (notice the integrals add nicely, while the constants depending on the choice of s0 cancel). (2) u(x, y) = 1 2 ( φ(x−ct)+φ(x+ct) ) + 1 2c ∫ x+ct x−ct ψ(y) dy. This solution is often called the D’Alembert’s solution. The formula also reveals the continuous dependence of the solution on the boundary conditions. We may conclude that the Cauchy problem seems to be the right boundary value problem for the wave equation (although we have fully proved that it is well-posed only in the dimension two and the analytic category). In higher dimensions, the situation is much more complicated. One of useful options is to employ the method of separation of variables, but splitting only the time and space variables. Consider the n-dimensional wave equation and expect the solution in the form u(x, t) = F(x)T(t) (now x ∈ Rn , t ∈ R). Plugging this into (1) and playing the separation method game, we arrive at two equations ∆F + αF = 0, T′′ + α 1 c2 T = 0, where α is the separation constant (usually we consider either α2 or −α2 to fix the types of the equations). The first equation is called the Helmholtz equation and we shall come back to it below. 9.2.20. The diffusion equation. In general dimension n the diffusion operator is (1) L = ∂ ∂t − κ∆, κ > 0, the diffusion equation is considered on domains in Rn × R. Again, let us have a look at the simplest 1D diffusion equation ut = κuxx. It describes the diffusion process in a one-dimensional object with diffusivity κ (assumed to be constant here) in time. First of all, let us notice that the usual boundary value presription of the state at time t = 0 is not matching the assumption of the Cauchy-Kovalevskaya theroem. Indeed, taking Γ = {t = 0}, the normal direction vector ∂ ∂t is characteristic. 849 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS The intuition related to the expectation on diffusion problems suggests that Dirichlet boundary data should suffice (we just need the inicial state and the diffusion then does the rest), or we can combine them with some Neumann data (if we supply heat at some parts of the boundary). Moreover, the process should not be reversible in time, so we should not expect that the solution would extend accross the line t = 0. Let us look at a classical example considered already by Kovalevskaya. Posit u(0, x) = g(x) = 1 1 + x2 on a neighborhood of the origin (perfect analytic boundary data and equation), and expect u is a solution of ut = uxx in the form u(t, x) = ∑ k,ℓ≥0 ck,ℓ tk k! xℓ ℓ! . The equation obviously implies the relations ck+1,ℓ = ck,ℓ+2 for all k, ℓ. Further, the power series of (1 + x2 )−1 =∑ ℓ(−1)ℓ x2ℓ is obtained from the geometric power series with argument −x, and with x2 substituted for x in the end. Thus, for all ℓ, c0,2ℓ+1 = 0, c0,2ℓ = (−1)ℓ (2n)!. By the recurrence, ck,2ℓ = (−1)k+ℓ (2(k + ℓ))!. This is a too quick growth for a converging power series. For example, looking at the terms 1 k!k! ck,k = (4k)! k!(2k)! , they grow as fast towards the infinity as the expression e−k kk 82k , by the Stirling formula for the factorial (cf. 6.2.17). We have learned that there cannot be any analytic solution to our Dirichlet boundary problem at all. This example also shows the relevance of all assuptions in the CauchyKovalevskaya theorem. 9.2.21. Diffusion via Fourier transform. Fortunately, another straightforward method helps us to solve the simplest diffusion equation with Dirichlet data. Let us assume u(x, t) is a solution of ut = κuxx, u(x, 0) = φ(x). Remind, the Fourier transform (with respect to x) transfers the differentiation ∂ ∂x to algebraic multiplication by ix, while the other variable t remains as parameter. Thus, the Fourier image ˜u(ξ, t) must obey ˜ut = −κξ2 ˜u, ˜u(ξ, 0) = ˜φ. This is a quite simple ODE problem with the general solution ˜u(ξ, t) = C(ξ) e−κtξ2 , while the initial condition implies that the integration constant is just C(ξ) = ˜φ. Now remember the relation between the Fourier transform and convolution, 7.2.7 at page 660. The image of the convolution is the product of the images, up to the factor √ 2π. Thus we shall immediately write down the solution u with 850 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS t > 0, once we find the inverse Fourier image of the Gaussian f(ξ) = e−κtξ2 . But Fourier images F(f) of Gaussians f are again Gaussians, up to constant, see ??, F(e−ax2 )(ξ) = 1√ 2a e− ξ2 4a , with any real constant a > 0. Thus, we can write for t > 0 ˜u(ξ, t) = F(φ)F ( 1√ 2κt e− x2 4κt ) . Finally, we obtain u as the convolution of the initial condition with the so called heat kernel function u(x, t) = 1 2 √ πκt ∫ ∞ −∞ e− (x−y)2 4κt φ(y) dy. Obviously, the solution depends continuously on the boundary condition. We may imagine it models the dynamics of temperature in an infinite homogeneous bar, with some initial distribution of temperature and no losses or gains of energy in time. Let us also observe the behavior of the solution for t close to zero. As mentioned in 7.2.9, the Gaussians with variance converging to zero are a good approximation for the so called Dirac delta functions, and indeed the limit of the convolution for t → 0+ is exactly the function φ, as expected. We shall come back to such convolution based principles a few pages later, after investigating simpler methods. 9.2.22. Superposition of the solutions. A general idea to solve boundary value problems is to take a good supply of general solutions and try to take linear combination of even infinite many of them. This means we consider the solution in a form of a series. The type of the series is governed by the available solutions. Let us illustrate the method on the diffusion equation discussed above. Imagine we want to model the temperature of a a homogeneous bar of length d. Initially, at time t = 0, the temperature at all points x is zero. At one of its ends we keep the temperature zero, while the other end will be heated with some constant intensity. Set the bar as the interval x ∈ [0, d] ⊂ R, and the domain Ω = [0, d] × [0, ∞). Our boundary problem is (1) ut = κuxx, u(x, 0) = 0, u(0, t) = 0, ∂ ∂x u(d, t) = ρ, where ρ is a constant representing the effect of the heating. The idea is to exploit the general solutions u(x, t) = (A cos αx + B sin αx) eα2 κt from 9.2.6 with free parameters α, A, and B. We want to consider a superposition of such solutions with properly chosen parameters and get the solution to our boundary problem in the form combining Fourier series terms wit the exponentials. This approach is often called the Fourier method. The condition u(0, t) = 0 suggests to restrict ourselves to A = 0. Then, ux(x, t) = Bα cos(αx) e−α2 κt . It seems to be difficult now to guess how to combine such solutions, to get something constant in time, as the Neumann part boundary condition requests. But we can help with a small trick. 851 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS There are some further obvious solutions to the equation – those with u depending on the space coordinate only. We may consider v(x, t) = ρx and seek further our solution in the form u(x, t)+v(x). Then u must again be a solution of the same diffusion equation (1), but the boundary conditions change to u(x, 0) = −ρx, u(0, t) = 0, ∂ ∂x u(d, t) = 0. Now, we want ux(x, t) = Bα cos(αx) e−α2 κt = 0, i.e. we should restrict to frequencies α = 1 2d nπ, with odd non-negative integers n. This has settled the second of the boundary condition. The remaining one is u(x, 0) = −ρx which sets the condition on the coefficients B in the superpo- sition ∑ k≥0 B2k+1 sin ( (2k + 1)πx 2d ) = −ρx on the interval x ∈ [0, d]. This is a simple task of finding the Fourier series of the function x, which we handled in 7.1.10. Combining all this, we get the requested solution u(x, t) to our problem: ρx − 8ρd 1 π2 ∑ k≥0 (−1)k (2k+1)2 sin ((2k+1)πx 2d ) e−κ (2k+1)2π2t 4d2 . Even though our supply of general solutions was not big, superposing countably many of them helped us to solve our problem. Notice the behavior at the heated end. If t → ∞, then the all exponential terms in the sum vanish faster than the very first one, the sine terms are bounded, and thus the entire component with the sum vanishes quite fast. Thus, for big t, the heated end will increase its temperature nearly linearly with the speed ρ. 9.2.23. Separation in transformed coordinates. As we have seen several times, it is very useful to view a given equation rather as an inpendent object expressed in some particular coordinates. The practical problems mostly include some symmetries and then we should like to find some suitable coordinates in order to see the equation in some simple form. As an example, let us look at the Laplace operator ∆ in the polar coordinates in the plane, and cylindrical or spherical coordinates in the space. Writing as usual x = r cos φ, y = r sin φ for the polar transformation, the Laplace operator gets the neat form (1) ∆ = ∂2 ∂r2 + 1 r2 ∂2 ∂φ2 + 1 r ∂ ∂r = 1 r ∂ ∂r ( r ∂ ∂r ) + 1 r2 ∂2 ∂φ2 . The reader should perform the tedious but straightforward computation. Similarly, ∆ = 1 r ∂ ∂r ( r ∂ ∂r ) + 1 r2 ∂2 ∂φ2 + ∂2 ∂z2 ,(2) ∆ = 1 r2 ∂ ∂r ( r2 ∂ ∂r ) + 1 r2 sin2ψ ∂2 ∂φ2 + 1 r2 sin ψ ∂ ∂ψ ( sin ψ ∂ ∂ψ ) (3) in the cylindrical and spherical coordinates, respectively. 852 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Let us illustrate the use on the following problem. Imagine a twisted circular drum, whose rim suffers a small vertical displacement. We should model the stabilized position of the drumskin. Intuitively, we should describe the drumskin position by the 2D wave equation, but since we are interested in the state with ∂ ∂t vanishing, we actually take u as the vertical displacement in the interior of the unit circle, Ω = {x2 + yy ≤ 1} ⊂ R2 and request ∆u = 0, subjet to the Dirichlet boundary problem prescribing the vertical displacement u(x, y) = f(x, y) of the rim. Obviously, we want to consider the problem in the polar coordinates, where the boundary condition gets the neat form u(1, φ) = g(φ). Say g(φ) = ε sin φ + ε2 sin 5φ with some small constant ε ≥ 0. We shall apply the separation of variables method to these data. Expecting the solution in the form u(r, φ) = R(r)Φ(φ), the equation implies (after dividing by RΦ) R′′ R + 1 r R′ R + 1 r2 Φ′′ Φ = 0. Thus, multiplying by r2 and considering the separation constant α2 , we arrive at two ODEs Φ′′ + α2 Φ = 0, r2 R′′ + rR′ − α2 R = 0. Fortunately, they are both easy to solve. From the first equation (with α > 0) Φ(φ) = A cos αφ + B sin αφ, while the other equation transform by S(t) = R(exp t) into an equation with constant coefficients and its solution yields R(r) = Crα + Dr−α with α ̸= 0. If α = 0, then the solution for R is R(r) = C ln r + D and the previous equation implies Φ(φ) = Aφ + B. In fact, we insist the solution u = RΦ be a single valued function in the plane and so we can allow only integer values of α, including α = 0 when Φ becomes a constant (again, any non-zero multiple would lead to multi-valued solutions u). Thus the general solution of the Laplace equation coming from the separation of variables method and superposition is (4) u(r, φ) = C0 ln r + D0+ ∞∑ n=1 (An cos nφ + Bn sin nφ)(Cnrn + Dnr−n ). In our problem, we clearly insist in having u finite at the origin and thus all Dn and the C0 have to vanish. Now we can employ the boundary condition D0 + ∞∑ n=1 cn(An cos nφ + Bn sin nφ) = ε sin φ + ε sin 5φ. This is a very simple case of Fourier series and we see immediately that all the coefficients have to vanish except of the B1 and B5 and the requested solution is u(r, φ) = εr sin φ + εr5 sin 5φ. 853 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS The higher the frequency of the twist of the rim, the slower the distorsion develops in the center of the drumskin. The method works for every boundary condition u(1, φ) = g(φ), if we are able to find its Fourier series. 9.2.24. The Helmholtz equation. In 9.2.19, we looked for solutions of the nD wave equation utt−c2 ∆u = 0 in the form u(x, t) = F(x)T(t), where x ∈ Rn , t ∈ R. This (partial) separation of variables with negative separation constant −α2 leads to the Helmholtz equation ∆F + αF = 0, together with the easily solved T′′ + α2 T = 0. Let us treat the 2D case in the polar coordinates, again using separation of variables. Thus, the Helmholtz equation gets the form 1 r ∂ ∂r ( r ∂ ∂r ) + 1 r2 ∂2 ∂φ2 + α2 F = 0 and we seek for F(r, φ) = R(r)Φ(φ). Writing β2 for the separation constant now, we arrive at the two equations (the second one is multiplied by r2 , for convenience) Φ′′ + β2 Φ = 0, r2 R′′ + rR′ + (α2 r2 − β2 )R = 0. The angular component equation has got the obvious solutions A cos βφ + B sin βφ, and again we have to restrict β to integers in order to get single-valued solutions. With β = m, the radial equation is the well known Bessel’s ODE of order m (notice our equation gets the form we had in ?? once we substitute z = αr), with the general solution R(r) = CJm(αr) + DYm(αr), where Jm and Ym are the special Bessel functions of the first and second kinds. We have obtained a general solution which is very useful in practical problems, cf. ??. 9.2.25. Non-homogeneous equations. Finally, we add a few comments on the non-homogeneous linear PDEs. Although we provide arguments for the claims, we shall not go into technical details of proofs because of the lack of space. Still, we hope this limited insight will motivate the reader to seek for further sources to learn more. As always, facing a problem Lu = f, we have to find a single particular solution to this problem, and we may then add all solutions to the homogeneous problem Lu = 0. Thus, if we have to match say Dirichlet conditions u = g on the boundary ∂Ω of a domain Ω, and we know some solution w, i.e. Lw = f (not taking care of the boundary conditions), than we should find a solution v to the homogenenous Dirichlet problem with the boundary condition g − w|∂Ω. Clearly the sum u = v + w will solve our problem. In principle, we may always consider superpositions of known solutions as in the Fourier method above. We shall indicate a more conceptual and general approach now briefly. 854 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Let us come back to the 1D diffusion equation and our solution of a homogeneous problem by means of the Fourier transform in 9.2.21. The solution of ut = κuxx with u(x, 0) = φ is a convolution of the boundary values u(x, 0) with the heat kernel (1) Q(x, t) = 1 √ 4πκt e− x2 4κt . Now, the crucial observation is that u(x, t) = Q(x, t) is a solution to L(u) = ut − κuxx = 0 for all x and t > 0, while on neigborhood of the origin it behaves as the Dirac delta function in the variable x. (The first part is a matter of direct computation, the second one was revealed in 9.2.21 already.) The latter observation suggests, how to find the particular solutions to a non-homogeneous problem. Consider the integral of the convolution (2) u(x, t) = ∫ t 0 (∫ ∞ −∞ Q(x − y, t − s)f(y, s) dy ) ds. The derivative ut will have two terms. In the first one we differentiate with respect to the upper limit of the outer integral, while the other one is the derivative inside the integrals. The derivatives with respect to x are evaluated inside the integrals. Thus, in the evaluation of L = ∂ ∂t − κ ∂2 ∂x2 the terms inside of the integral cancel each other (remember Q is a solution for all x, and t > 0) and only the first term of ut survives. It seems obvious that this term is the evaluation of the integrand with s = t. Although, these values are not properly defined, we may verify this claim in terms of taking limit (t− s) → 0+. This leads to lim s→t− ∫ ∞ −∞ Q(x − y, t − s)f(y, s) dy = f(x, s). Thus, (2) is a particular solution and clearly u(x, 0) = 0. The solution of the general Dirichlet problem L(u) = f, u(x, 0) = φ on Ω = R × [0, ∞) is (3) u(x, t) = ∫ ∞ −∞ Q(x − y, t)φ(y) dy + ∫ t 0 (∫ ∞ −∞ Q(x − y, t − s)f(y, s) dy ) ds. Let us summarize the achievements and try to get generalization to general dimensions. First, we can generalize the heat kernel function Q writing its nD variant depending on the distance r from the origin only. Consider the formula with x ∈ Rn as the product of the 1D heat kernels for each of the variables in x. (4) Q(x, t) = 1 √ (4πκt)n e− ∥x∥2 4κt . Then taking the n-dimensional (iterated) convolution of Q with the boundary condition φ on the hyperplane t = 0 provides the solution candidate (5) u(x, t) = ∫ Rn Q(x − y, t)φ(y) dy1 . . . dyn. 855 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Indeed, a straightforward (but tedious) computation reveals that Q is a solution to L(u) = 0 in all points (x, t) with t > 0, and Q behaves again as the Dirac delta at the origin. In particular (5) is a solution to the Dirichlet problem L(u) = 0, u(x, 0) = φ and we can allso obtain the non-homogeneous solutions similarly to the 1D case. 9.2.26. The Green’s functions. The solutions to the (nonhomogeneous) diffusion equation constructed in the last paragraph are built on a very simple idea – we find a solution G to our equation which is defined everywehere expcept in the origin and blows up in the origin at the speed making it into a Dirac delta function at the origin. A convolution with such kernel G is then a good candidate for solutions for . Let us try to mimic this approach for the Laplace and Poisson equations now. Actually, we shall modify the strategy by requesting 856 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS about 1 page to be finished – the spherical symmetric solution to Laplace => Green’s function => solution to Poisson similarly to the diffusion. 857 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS 3. Remarks on Variational Calculus Many practical problems look for minima or maxima of real functions J : S → R defined on some spaces of functions. In particular, many laws of nature can be expressed as certain “minimum principle” concerning some space of map- pings. The basic idea is exactly the same as in the elementary differential calculus: we aim at finding the best linear approximations of J at fixed arguments u ∈ S, we recognize the critical points (those with vanishing linearization), and then we perhaps look at the quadratic approximations at the critical points. However, all these steps are far more intricate, need a lot of care, and may provide nasty surprises. 9.3.1. Simple examples first. If we know the sizes of tangent vectors to curves, we may ask what is the shortest distance between two points. In the plane R2 , this means we have got a quadratic form g(x) = (gij(x)), 1 ≤ i, j ≤ 2, at each x ∈ R2 and we want to integrate (the dots mean derivatives in time t, u(t) = (u1(t), u2(t)) are differentiable paths) (1) J (u) = ∫ t2 t1 √ g(u(t))( ˙u(t)) dt to get the distance between the two given points u(t1) = (u1(t1), u2(t1)) = A and u(t2) = (u1(t2), u2(t2)) = B. If the size of the vectors is just the Euclidean one, and we consider curves u(t) = (t, v(t)), i.e., graphs of functions of one variable, the length (1) becomes the well known formula (2) J (u) = ∫ t2 t1 √ 1 + ˙v(t)2 dt. Quite certainly we all believe that the mimimum for fixed boundary values v(t1) and v(t2) must be a straight line. But so far, we have not formulated the problem itself. What is the space of curves we deal with? If we allowed non-continuous ones, then shorter paths are available! So we should aim at proving that the lines are the minimal curves among the continuous ones. Do we need them to be differentiable? In some sense we do, since the derivative appears in our formula for J , but we need to have the integrand defined only almost everywehere. For example, this will be true for all Lipschitz curves. In general, g(u)( ˙u) = g11(u) ˙u2 1 + 2g12(u) ˙u1 ˙u2 + g22(u) ˙u2 2. Such lengths of vectors are automatically inherited from the ambient Euclidean R3 on every hypersurface in the space. Thus, finding the minimum of J means finding the shortest track in a real terrain (with hills and valleys). If we choose a positive function α on R2 and consider g(x) = α(x)2 idR2 , i.e., the Euclidean size of vectors scaled by α(x) > 0 at each point x ∈ R2 , we obtain (3) J (u) = ∫ t2 t1 α(t, v(t)) √ 1 + ˙v(t)2 dt. We can imagine the speed 1/α of a moving particle (or light) in the plane depends of the values of α (the smaller is α, the 858 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS bigger is the speed) and our problem will be to find the shortest path in terms of the time necessary to pass from A to B. As a warm up, consider α = 1 in the entire plane, except the vertical strip V = {(t, y); t ∈ [a, a + b]} where α = N and take A = (0, 0), B = (a+b, c), a, b, c > 0. We can imagine V is a lake, you have to get from A to B by running and swimming, and you are swimming N times slower than running. If we believe that the straight lines are the minimizers for constant α, then it is clear that we have to find the optimal point P = (a, p) on the bank of the lake where we start swimming. The total time T(p) will then be (s is our actual speed when running straight) |AP| s + |PB| s/N = 1 s (√ p2 + a2 + N √ (c − p)2 + b2 ) and we want to find the minimum of T(p). The critical point is given by p √ p2 + a2 = N c − p √ (c − p)2 + b2 =⇒ sin φ = N sin ψ, where φ is the angle betwen our running track and the normal to the boundary of V, while ψ is the angle between our swimming track and the normal to the boundary (draw a picture yourself!). Thus we have recovered the famous Snell law of light diffraction saying that the proportion of the sine values of the angles is equal to the proportion of the speeds. (Of course, to finish the solution of the problem, the reader should find the solution p of the quartic equation and check that it is a minimum.) 9.3.2. Variational problems. We shall restrict our attention to the following class of problems. General first order variational problems Consider an open Riemann measurable set Ω ⊂ Rn , the space C1 (Ω) of all differentiable mappings u : Ω → Rm , a C2 function F = F(x, y, p) : Rn × Rm × Rnm and set the functional (1) J (u) = ∫ Ω F(x, u(x), D1 u(x))ωRn , i.e., J (u) is computed as the ordinary integral of a Riemann integrable function f(x) = F(x, u(x), D1 u(x)) where D1 u is the Jacobi matrix (the differential) of u. The function F is called the Lagrangian of the variational problem and our task is to find the minimum of J and the corresponding minimizer u with prescribed boundary values u on the boundary ∂Ω (and perhaps some further conditions restricting u). Mostly we shall restrict ourselves to the case n = m = 1, like in the previous paragraph, where u is a real differentiable function defined on an interval (t1, t2) and the function F = F(t, y, p) : R3 → R, (2) J (u) = ∫ Ω F(t, u(t), ˙u(t)) dt. 859 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS We saw F = √ 1 + p2, F = α(t) √ 1 + p2 in the previous paragraph. If we take F = y √ 1 + p2, the functional J computes the area of the rotational surface given by the graph of the function u (up to a constant multiple). In all cases we may set the boundary values u(t1) and u(t2). Actually, our differentiability assumptions are too strict as we saw already in our last example above, where F was differentiable except of the boundary of the lake V . We can easily extend our space of functions to piecewise differentiable u and request F(t, u(t), ˙u(t)) to be piecewise differentiable for all such u’s (as always, picewise differentiable means the one-side derivatives exist at all points). A maybe shocking example is the following functional: (3) J (u) = ∫ 1 0 ( ˙u(t)2 − 1)2 dt on piece-wise differentiable functions on [0, 1] (i.e. F is the neat polynomial (p2 − 1)2 ). Clearly, J (u) ≥ 0 for all u and if we set u(0) = u(1) = 0, then any zig-zag piecewise linear function u with derivatives ±1 satisfying the boundary conditions achieves the zero minimum. At the same time, there is no minimum among the differentiable functions u (find a quick proof of that!), but we can approximate any of the zigzag minima by smooth ones at any precision. 9.3.3. More examples. Let us develop a general method how to find the analogy to the critical points form the elementary calculus here. We shall find the necessary steps dealing with a specific set of problems in this paragraph. Let us work with the Lagrangian generalizing the previous examples: (1) F(t, y, p) = yr √ 1 + p2 r > 0, and write Ft, Fy, Fp, etc., for the corresponding partial derivatives. Consider the variational problem on an interval I = (t1, t2) with fixed boundary conditions u(t1) and u(t2) and assume u ∈ C2 (I), u(t) > 0. Let us consider any differentiable v on I with v(t1) = v(t2) = 0 (or even better v with compact support inside of I). Then u + δv fulfills the boundary conditions for all small real δ’s and consider J (u + δv) = ∫ t2 t1 F(t, u(t) + δv(t), ˙u(t) + δ ˙v(t)). Of course, the necessary condition for u being a critical point must be d dδ |0 J (u + δv) = 0, i.e., (remind the derivative with respect to a parameter can be swapped with the integration) (2) 0 = ∫ t2 t1 Fy(t, u(t), ˙u(t))v(t)+Fp(t, u(t), ˙u(t))˙v(t) dt. Integrating the second term in (2) per partes immediately yields (remember v(t1) = v(t2) = 0) 0 = ∫ t2 t1 ( Fy(t, u(t), ˙u(t))v(t) − d dt Fp(t, u(t), ˙u(t)) ) v(t) dt. 860 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS This condition will be certainly satisfied if the so called Euler equation holds true for u (we prove this is a necessary condition in lemma 9.3.6) (3) d dt Fp(t, u(t), ˙u(t)) = Fy(t, u(t), ˙u(t)). An equivalent form of this equation for ˙u(t) ̸= 0 is (we omit the arguments t of u and ˙u) (4) Ft(t, u, ˙u) = d dt ( F(t, u, ˙u) − ˙uFp(t, u, ˙u) ) . In our case of F(t, y, p) = yr (1 + p2 )1/2 , Ft vanishes identically, Fp = yr p(1 + p2 )−1/2 and thus, if we further assume r ̸= 0, u > 0, the term in the bracket has to be a positive constant Cr : Cr = ur (1+ ˙u2 )1/2 − ˙uur ˙u(1+ ˙u2 )−1/2 = ur (1+ ˙u2 )−1/2 . We have arrived at the differential equation (5) u = C(1 + ˙u2 )1/2r which we are going to solve. Consider the transformation ˙u = tan τ, i.e., u = C(1 + (tan τ)2 )1/2r = C(cos τ)−1/r , and so du = C r (cos τ)−1/r tan τdτ. Consequently, dt = 1 ˙u du = C r (cos τ)−1/r dτ and by integration we arrive at the very useful parametrization of the solutions by the parameter τ (which is actually the slope of the tangent to the solution graph): (6) t = t0 + C r ∫ τ 0 (cos s−1/r )ds u = C(cos τ)−1/r . Now, we can summarize the result for several interesting values of r. First, if r = 0 (which we excluded on the way), then the Euler equation (3) reads ¨u(1 + ˙u2 )−3/2 = 0, which implies ¨u = 0 and thus the potential minimizers should be straight lines as expected. (Notice that we have not proved yet that the Euler equation is indeed a necessary condition, we shall come to that in the next paragraphs.) For general r ̸= 0, the Euler equation (3) tells (a straightforward computation!) ¨u = r 1 + ˙u2 u and thus the sign of the second derivative coincides with the sign of r. In particular, the potential minimizers are always concave functions (if r < 0) or convex (if r > 0). If r = −1, the parametrization (6) leads to (an easy inte- gration!) (7) t = t0 − C sin τ, u = C cos τ, thus for τ ∈ [−π/2, π/2] our solutions are half-circles with radius C in the upper halfplane, centred at (t0, 0). For r = −1/2, the solution is (8) t = t0 − C 2 (2τ + sin 2τ), C 2 (1 + cos 2τ) 861 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS which is a parametric description of a fixed point on a circle with diameter C rolling along the t axis, the so called cycloid curve. Now, τ ∈ [−π/2, π/2] provides t running from t0 + 1 2 Cπ to t0 − 1 2 Cπ, while u is zero in the point t0 ± 1 2 Cπ and reaches the highest point at t = t0. (Draw pictures!) Next, look at r = 1/2. Another quick integration reveals t = t0 + 2C tan τ = t0 + 2C ˙u, and we can compute ˙u and substitute into (5) to obtain u = C + 1 4C (t − t0)2 . Thus the potential minimizers are parabolas with the axis of symmetry t = t0. If we fix A = (0, 1) and a t0, there are two relevant choices C = 1 2 (1± √ 1 − t2 0) whenever |t0| < 1 (and no options for |t0| > 1). The two parabolas will have two points of intersection, A and another point B. Clearly only one of them should be the actual minimizer. Moreover, the reader could try to prove that the parabola u = 1 4 t2 touches all of them and has them all on the left (this is the so called envelope of all the family of parabolas). Thus, there will be no potential minimizer joining the point A = (1, 0) to an arbitrary point on the right of the parabola u = 1 4 t2 . The last case we come to is r = 1, i.e., the case of the area of the surface of the rotational body drawn by the graph. Here we better use another parametrization of the slope of the tangent, we set ˙u = sinh τ. A very similar computation as above then immediately leads to t = t0 + C r ∫ τ 0 cosh s ds and we arrive at the result7 (9) u(t) = C cosh t−t0 C . 9.3.4. Critical points of functionals. Now we shall develope a bit of theory verifying that the steps done in the previous examples realy provided necessary conditions for solutions of the variational problems. In order to underline the essential features, we shall first introduce the basic tools in the realm of general normed vector spaces, see 7.3.1. The spaces of piecewise differentiable functions on an interval with the Lp norms can serve as typical examples. We shall deal with mappings F : S → R called (real) functionals. The first differential Let S be a vector space equipped with a norm ∥ ∥. A continuous linear mapping L : S → R is called a continuous linear functional. A functional F : S → R is said to have the differential DuF at a point u ∈ S if there is a continuous linear functional L such that (1) lim v→0 F(u + v) − F(u) − L(v) ∥v∥ = 0. 7Some more details on the set of examples of this paragraph can be found in the article "Elementary Introduction to the Calculus of Variations" by Magnus R. Hestenes, Mathematics Magazine, Vol. 23, No. 5 (May - Jun., 1950), pp. 249-267. 862 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS In the very special case of the Euclidean S = Rn , we have recovered the standard definition of the differential, cf. 8.1.7 (just notice that all linear functionals are continuous on a finite dimensional vector space). Again, the differential is computed via the directional derivatives.8 Indeed, if (1) holds true, than for each fixed v ∈ S (2) δF(u)(v) = lim t→0 F(u + tv) − F(u) t = d dt|0 F(u+tv) exists and L(v) = δF(u)(v). We call δF(u) the variation of the functional F at u. A point u ∈ S is called a critical point if δF(u) = 0. We say that F has got a local minimum at u if there is an open neighborhood U of u such that F(w) ≥ F(u) for all w ∈ U. Similarly, we define local maxima and talk about local extrema. If u is an extreme of F, then in particular t = 0 must be an extreme of the function F(u + tv) of one real variable t, where v is arbitrary. Thus the extremes have to be at critical points, if the variations exist. Next, let us assume the variations exist at all points in a neighborhood of a critical point u ∈ S. Then, again exactly as in the elementary calculus, considering two increments v, w ∈ S we consider the limit (3) δ2 F(u)(v, w) = lim t→0 δF(u + tv)(w) − δF(u)(w) t . If the limits exist for all u, v, then clearly δ2 F(u) is a bilinear mapping. Then, δ2 F(u)(w, w) is a quadratic form which we can consider as a second order approximation of F at u. We call it the second variation of F. Moreover, again as in the elementary calculus, δ2 F(u)(w, w) = d2 dt2 |0 F(u+tw), if the second variation exists. We may summarize: Theorem. Let F : S → R be a functional with a local extreme in u ∈ S. If the variation δF(u) exists, then it has to vanish. If the second variation δ2 F(u) exists (thus in particular, δF exists on a neighborhood of u), then δ2 F(u)(w, w) ≥ 0 for a minimum, while δ2 F(u)(w, w) ≤ 0 for a maximum. Proof. Assume F has got a local minimum at u. We have already seen, f(t) = F(u + tv) has to achieve a local minimum for each v at t = 0. Thus f′ (0) = 0 if f(t) is differentiable, and so δF(u) vanishes. Now assume δ2 F(u)(w, w) = f′′ (0) = τ < 0 for some w. Then the mean value theorem implies f(t) − f(0) = f′ (c)t = 1 c (f′ (c) − f′ (0))ct for some t ≥ c > 0. Thus, for t small enough f(t)−f(0) < 0 which contradicts f(0) being a local minimum. The claim for maximum follows analogously (or we may apply the already proved result to the functional −F). □ 8In functional analysis, this directional derivative is usually called the Gâteaux differential, while the continuous functional L satisfying (1) is usually called the Fréchet differential, going back to two of the founders of functional analysis from the beginning of the 20th century. 863 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Corollary. On top of all assumptions of the above theorem suppose F(v + tw) is 2 times differentiable at t = 0 and δ2 F(v)(w, w) ≥ 0 for all v in a neighborhood of the critical point u and w ∈ S. Then F has got a minimum at u. Proof. As before we consider f(t) = F(u + tw), w = z − u. Thus, for some 0 < c ≤ 1 F(z) − F(u) = f(1) − f(0) = f′ (0) + 1 2 f′′ (c) = 1 2 δ2 F(u + cw)(w, w) ≥ 0. □ Remark. Actually, the condition from the collolary is far too strong in infinite dimensional spaces. It is possible to replace it by the condition δ2 F continuous at u and δ2 F(u)(w, w) ≥ C∥w∥ for some real constant C > 0 just in the critical point u. In the finite dimensional case, this is equivalent to the requirement δ2 F continuous and positive definite. 9.3.5. Back to variational problems. As we already noticed, the answer to a variational problem minimizing a functional (we omit the arguments t of the unknown function u) (1) J (u) = ∫ t2 t1 F(t, u, ˙u) dt depends very much on the boundary conditions and the space of functions we deal with. If we posit u(t1) = A, u(t2) = B with arbitrary A, B ∈ R we may deal with spaces of differentiable or piecewise differentiable functions satisfying these boundary conditions. But these subspaces will not be vector spaces any more. Thus, strictly speaking, we cannot apply the concepts from the previous paragraph here. However, we may fix any differentiable function v on [t1, t2] satisfying v(t1) = A, v(t2) = B, e.g. v(t) = A + (B − A) t−t1 t2−t1 , and replace the functional J by ˜J (u) = J (u + v) = ∫ t2 t1 F(t, u + v, ˙u + ˙v) dt. Now, the intitial problem transforms to one with boundary conditions u(t1) = u(t2) = 0 and computing the variations d dδ ˜J (u + δw) = d dδ J (u + v + δw) does not change, i.e. we have to request w(t1) = w(t2) = 0 and we differentiate in a vector space. Essentially, we just exploit the natural affine structures on the subspaces of functions defined by the general boundary conditions and thus the derivatives have to live in their modeling vector subspaces. 864 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS The first and second variations Corollary. Let F(t, y, p) be a twice differentiable Lagrangian and consider the variational problem of finding a minimum of the functional (1) on the space of differentiable functions u ∈ C1 [t1, t2] with boundary conditions u(t1) = A, u(t2) = B. Then the first and second variations exist and can be computed for all v ∈ S = {v ∈ C1 [t1, t2]; v(t1) = v(t2) = 0}, as follows: δJ (u)(v) = ∫ t2 t1 ( Fy(t, u, ˙u)v + Fp(t, u, ˙u)˙v ) dt(2) δ2 J (u)(v, v) = ∫ t2 t1 ( Fyy(t, u, ˙u)v2 + 2Fyp(t, u, ˙u)v ˙v + Fpp(t, u, ˙u)˙v2 ) dt. (3) If u is a local minimum of the variational problem, then δJ (u)(v) = 0 for all v ∈ S, while δ2 J (u)(v, v) ≥ 0 for all v in a neighborhood of the origin in S. Proof. Thanks to our strong assumptions on the differentiability of F, u, and v, we may differentiate the real function f(t) = J (u+tv) at t = 0 swapping the integral and the derivative. This immediately provides both formulae. The remaining two claims are straightforward consequences of the theorem and corollary in the previous paragraph 9.3.4. □ 9.3.6. Euler-Lagrange equations. We are following the path which we already tried when discussing our first bunch of examples in 9.3.3. Our next step was to guess the consequences of vanishing of the first variation in terms of a differential equation. Now we complete the arguments. We start with a simple result called the fundamental lemma of the calculus of variation. Lemma. Assume u is a continuous function on the interval [t1, t2] and for all compactly supported smooth φ ∈ C∞ c [t1, t2], ∫ t2 t1 u(t)φ(t) dt = 0. Then u vanishes identically on [t1, t2]. Proof. Assume there is a c ∈ (t1, t2) such that u(c) > 0. Due to the continuity, u(t) > u(c)/2 > 0 on a neighborhood (c − s, c + s) ⊂ (t1, t2), s > 0. Next, remind the smooth variants of indicator functions constructed in 6.1.10. For every pair of positive numbers 0 < ε < r, we constructed a function φε,r(t) of one real variable t such that φε,r(t) = 1 for |t| < r − ε, while φε,r(t) = 0 for |t| > r + ε, and 0 ≤ φε,r(t) ≤ 1 everywhere. Thus, choosing such a function for φ, with r + ε < s and the origin shifted to t = c, we certainly obtain ∫ t2 t1 u(t)φ(t) dt > 1 2 u(c)2(r − ε) > 0. 865 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS If we find some negative value u(c), the same argumentation finishes the proof (notice there is no need to consider the boundary points due to the continuity of u). □ Euler-Lagrange equations Theorem. Consider a twice differentiable Lagrangian F(t, y, p) on [t1, t2] × R2 and a differentiable critical point u of the functional J (u) = ∫ t2 t1 F(t, u, ˙u) dt with fixed boundary values u(t1), u(t2). Then u is a solution of the differential equation (1) Fy(t, u, ˙u) − d dt Fp(t, u, ˙u) = 0. Notice that the derivative in the second term of the EulerLagrange equation means the so called total derivative, i.e. we should differentiate the composed mapping via the chain rule. This can be a problem, if we do not assume u to be twice differentiable. Proof. We already know that vanishing of the first variation δJ (u) is a necessary condition for u being a critical point. Thus we can start with the equality (2) in the previous paragraph 9.3.5 and compute by integrating per partes: (2) 0 = ∫ t2 t1 ( Fy(t, u, ˙u)v + Fp(t, u, ˙u)˙v ) dt = ∫ t2 t1 ( Fy(t, u, ˙u) − d dt Fp(t, u, ˙u) ) v dt + Fp(t, u, ˙u)v|t=t2 − Fp(t, u, ˙u)v|t=t1 . Finally we exploit the above fundamental lemma of the calculus of variation with arbitrary smooth test functions v with compact supports inside (t1, t2). Thus the last term with boundary values vanishes and by the lemma, the EulerLagrange equation has to hold true for u. □ 9.3.7. Remarks. We made our life comfortable by taking very strong differentiablity assumptions in the theorem above. In the last century, there was a lot effort to get much more general results with weaker assumptions. This is really important in practice where we need to deal with piece-wise differentiable extremals. On the other hand, we need even twice differentiable critical points in order to write down the EulerLagrange equation explicitely. Another difficult point is to recognize which of the critical points are the minima or maxima of the functional. We saw that the second variation is a very specific quadratic functional, see 9.3.5(3), and there is a rich theory dealing with its properties. We do not have time to into details here, but we mention just one simple necessary condition for the extreme (with a bit tricky proof), to get feeling about the topic). Lemma (Lagrange necessary condition). Consider a twice differentiable Lagrangian F(t, y, p) on [t1, t2] × R2 and a differentiable critical point u of the functional J (u) = ∫ t2 t1 F(t, u, ˙u) dt with fixed boundary values u(t1), u(t2). If 866 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS u is a local minimum of J , then Fpp(t, u(t), ˙u(t)) ≥ 0 on the entire interval [t1, t2]. Proof. Assume there is t0 ∈ (t1, t2) such that Fpp(t0, u(t0), ˙u(t0)) = −µ < 0. Similarly as in the proof of of the fundamental lemma in the previous paragraph, we choose s > 0 so that Fpp(t, u, ˙u) < −1 2 µ on (t0 − s, t0 + s), and a smooth analog of an indicator function φ = φε,r, centered at t0 and satisfying r + ε < s. Then the second variation evaluated on v(t) = αφ(t0 + t α ) with some α > 0 can be estimated as follows (we use a constant C > |Fyy(t, u, ˙u)|, C > |Fyp(t, u ˙u)| on the entire interval, the fact that ˙v = ˙φ and the derivative | ˙φε,r| integrates over its support to two, and the substitution τ = t0 + t α ) δ2 J (u)(φ, φ) = ∫ t2 t1 ( Fyy(t, u, ˙u)v2 + 2Fyp(t, u, ˙u)v ˙v + Fpp(t, u, ˙u)˙v2 ) dt ≤ ∫ t0+αs t0−αs (Cα2 + 2Cα|˙v|) dt − 1 2 µ ∫ t0+αs t0−αs ˙v2 dt = 2Csα3 + 2Cα2 ∫ t0+s t0−s | ˙φ| dτ − 1 2 µα ∫ t0+s t0−s ˙φ2 dτ = 2Csα3 + 4Csα2 − 1 2 µ ( ∫ t0+s t0−s ˙φ2 dτ)α. The integral on the right-hand side is strictly positive and, thus, the entire expression is negative if α is small enough. This is a contradiction and the proof is complete. □ 9.3.8. Special cases. Very often the Lagrangians do not depend on all variables and then the variations and the Euler-Lagrange equations get special forms. The following summary is a straigtforward consequence of the general equation 9.3.6(1), whose equivalent form we saw already in 9.3.3(4) 867 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Special forms of Lagrangians Case 1. If the Lagrangian is F(t, y), i.e., does not depend on the derivatives, then the Euler-Lagrange equation says (1) Fy(t, u) = 0 which is an implicit equation for u(t). Moreover, the second variation is δ2 J (u)(v, v) = ∫ t2 t1 Fyy(t, u)v2 dt. Case 2. If the Langrangian is F(t, p), then the EulerLagrange equation is (2) d dt Fp(t, ˙u) = 0 and its solutions are given by the first order differential equation Fp(t, ˙u) = C with a constant parameter C. Moreover, the second variation is δ2 J (u)(v, v) = ∫ t2 t1 Fpp(t, ˙u)v ˙v2 dt. Case 3. If the Langrangian is F(u, ˙u), then there is a consequence of the Euler-Lagrange equation (for ˙u ̸= 0) (3) d dt ( F(u, ˙u) − ˙uFp(u, ˙u) ) = 0 which again reduces the equation to the first order including a free constant parameter. 9.3.9. Remarks on higher dimensional problems. 868 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS 9.3.10. Problems with free boundary conditions. 869 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS 9.3.11. Constrained and isoperimetric problems. 870 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS placeholder 871 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS placeholder 872 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS 4. Complex Analytic Functions In the rest of the chapter, we shall look at the (single complex variable) functions defined on the complex plane C = R2 . On many occasions we saw how helpful it was to extend objects from the real line into the complex plane. We provide a few glimpses into the rich classical theory, and we hope the readers will enjoy the use of it in the practical column. 9.4.1. Complex derivative. An open and connected subset Ω ⊂ C is called a region, or a domain. A mapping f : Ω → C is called a complex function of a single complex variable. Working with complex numbers, we may repeat the definition of derivative: Complex derivative We say that a complex function f : Ω → C has the complex derivative f′ (a) at a point a ∈ Ω, if the complex limit f′ (a) = lim z→a f(z) − f(a) z − a ∈ C exists. We say that f is differentiable in the complex sense, or holomorphic, on Ω, if its complex derivative f′ (z) exists at each z ∈ Ω. Clearly, this definition restricts itself to the definition of the derivative of functions of one real variable along R ⊂ C, when restricting the defining limit to real z and a. We shall see that the existence of a complex derivative is much more restrictive than in the real case. The simplest example of a differentiable complex function is z → zn , n ∈ N. Indeed, exactly as with the real polynomials, we compute (z +h)n −zn = h ( nzn−1 + 1 2 n(n−1)zn−2 h+· · ·+hn−1 ) and thus for all z ∈ C we obtain the limit (zn )′ = lim h→0 (z + h)n − zn h = nzn−1 . By the very definition, the mapping f → f′ is linear over the complex scalars and thus all polynomials f(z) are differentiable this way: f(z) = n∑ k=0 akzk → f′ (z) = n−1∑ k=0 (k + 1)ak+1zk . 9.4.2. Analytic functions. A complex function f : Ω → C is called analytic in the region Ω if for each a ∈ Ω, there is an open disc D = {|z −a| < r} ⊂ Ω on which f is represented by a convergent power series (1) f(z) = ∞∑ n=0 cn(z − a)n . The restrictions of such power series to real arguments z are the analytic functions dealt with already in Chapter 5 and the direct proof of the following theorem (not relying an any real multivariate calculus) was promised already in 5.4.10. 873 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS The theorem reveals that the analytic functions behave as polynomials. The basic principle in its proof is to try to bound a series which we know very little about, by a geometric series, which we fully understand. This is a frequent technique in analysis. An important property of every power series is that it is (complex) differentiable within its radius of convergence.9 Convergence and derivative of power series Theorem. Consider an analytic function (1). There exists R, which is either a non-negative real number or infinity, such that • f(z) converges absolutely if |z − a| < R; • f(z) diverges if |z − a| > R; • 1 R = lim supn→∞ |cn| 1 n . Further, if the radius of convergence is R > 0, then f is differentiable (in complex sense) for |z − a| < R, and its derivative equals the power series f′ (z) = ∞∑ n=1 ncn(z − a)n−1 , obtained by differentiating f(z) term by term. Moreover, the power series representing f′ (z) has the same radius of convergence as f(z). The open disc |z − a| < R is called the disc of convergence of f(z) = ∑∞ n=0 cn(z − a)n . Proof. Let L = 1 R . Suppose |z−a| < R. We show that∑∞ n=0 cn(z − a)n converges. Notice, we mean that L = 0 if R = ∞, and so the statement is trivially true in this case. For our fixed z, there is ε > 0 such that (L + ε)|z − a| < 1. Since L = lim supn→∞ |cn| 1 n , this means that |cn| 1 n < L+ε, i.e. |cn| < (L+ε)n for sufficiently large n. Therefore, |cn||z − a|n ≤ (L + ε)n |z − a|n and ∑∞ n=0 |cn||z − a|n is majorized by the convergent geometric series ∑∞ n=0 ρn with ρ = (L + ε)|z − a|. Therefore f(z) = ∑∞ n=0 cn(z − a)n converges. Now suppose |z − a| > R. By the definition of lim sup, for any ε > 0, there exists infinitely many cn satisfying |cn| > (L − ε)n . Choose ε > 0 small enough such that (L−ε)|z −a| > 1. Then |cn||z − a|n > (L − ε)n |z − a|n for infinitely many n, and because (L − ε)|z − a| > 1, this implies that cn(z − a)n does not converge to 0 as n → ∞. Therefore ∑∞ n=0 cn(z − a)n diverges. Next, we move to the derivative. First, notice that f(z) =∑∞ n=0 cn(z−a)n and g(z) = ∑∞ n=1 ncn(z−a)n−1 have the same radius of convergence because lim n 1 n = 1. 9Actually the opposite implication is true as well: A holomorhpic function on a domain Ω is analytic on Ω. We shall not provide a full proof of this result, but we come close to it below. The reader may find the full argument at nearly all basic textbooks on complex analysis. 874 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Fix z0 in the disc of convergence, so that |z0 − a| < r < R for some value of r. Let SN (z), EN (z) be defined by SN (z) = N∑ n=0 cn(z − a)n , EN (z) = ∞∑ n=N+1 cn(z − a)n . We may think of SN , which consists of the lower order terms of the power series, as the main term, and EN as the error term. Notice that g(z) = ∑∞ n=1 ncn(z − a)n−1 is the termby-term derivative of f(z). We prove that f′ (z0) = g(z0), which means lim h→0 f(z0 + h) − f(z0) h − g(z0) = 0. Thus, given any ε > 0, we must show that there exists a δ > 0 such that if 0 < |h| < δ, then the expression above has absolute value less than ε. To do so, we break the expression into three parts and estimate each of those separately. More precisely, since f(z) = SN (z) + EN (z), and we know the derivative S′ N of the polynomial SN , we write f(z0 + h) − f(z0) h − g(z0) = ( SN (z0 + h) − SN (z0) h − S′ N (z0) ) + ( S′ N (z0) − g(z0) ) + EN (z0 + h) − EN (z0) h . We analyze the individual terms. The first term contains the main term and its derivative, which exists because SN is a polynomial. Thus, this term approaches 0 as h → 0. In other words, given ε 3 > 0, we can find δ > 0 such that 0 < |h| < δ implies SN (z0 + h) − SN (z0) h − S′ N (z0) < ε 3 . The second term is S′ N (z0) − g(z0). Since S′ N (z0) → g(z0) as N → ∞ (because we know that g(z) is a power series which converges absolutely for z in it’s convergence disc centered at a, and S′ N (z) is the N-th partial sum of this power series), this means that for ε 3 > 0, we can find some N1 such that if N > N1, then |S′ N (z0) − g(z0)| < ε 3 . The third term is the most tricky one to estimate effectively. We can write EN (z0+h)−EN (z0) = ∞∑ n=N+1 (cn(z0+h−a)n −cn(z0−a)n ) Expanding (z0 +h−a)n −(z0 −a)n = h ( (z0 +h−a)n−1 + (z0 + h − a)n−2 (z0 − a) + · · · + (z0 − a)n−1 ) , we obtain EN (z0 + h) − EN (z0) h = ∞∑ n=N+1 ( cn((z0 + h − a)n−1 + (z0 + h − a)n−2 (z0 − a) + · · · + (z0 − a)n−1 ) . Observe, that for h sufficiently small, |z0 +h−a| < r as well as |z0 − a| < r. Therefore, if we replace all terms by their 875 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS absolute values and apply the triangle inequality, we obtain: EN (z0 + h) − EN (z0) h ≤ ∞∑ n=N+1 |cn|nrn−1 . The series on the right converges, and furthermore, its value approaches 0 as N → ∞. Indeed, ∑∞ n=N+1 |cn|nrn−1 is just the tail of the series g(r) with absolute values on all of its individual terms, and we know that g(z) converges absolutely for |z − a| < R. So the series in question does converge, and since it is the end of a convergent series, its terms must approach 0 as N → ∞. Therefore, given ε 3 > 0, we can find N2 such that for all sufficiently small h and N > N2, EN (z0 + h) − EN (z0) h < ε 3 . Select now N > max{N1, N2}. Then an application of the triangle inequality yields: SN (z0 + h) − SN (z0) h − S′ N (z0) + S′ N (z0) − g(z0) + EN (z0 + h) − EN (z0) h ≤ ε □ 9.4.3. Corollaries. We can apply the above theorem any number of times to obtain the following consequences. In particular, notice the straightforward existence of the antiderivative which we shall link with integrals in the next subsections. Corollaries on the derivatives of power series Corollary. Consider any power series f(z) = ∞∑ n=0 cn(z − a)n with convergence radius R > 0 and write D for its convergence disk. (1) f(z) is infinitely (complex) differentiable in D and each of its k-th derivatives f(k) (z) can be obtained by differentiating term-by-term k times. The resulting power series has radius of convergence again equal to R. (2) There exists the (complex) antiderivative F(z) = ∞∑ n=0 1 n + 1 cn(z − a)n+1 , such that F′ (z) = f(z) in the disc of convergence D, which is the same for both series. (3) The coefficients ck of the power series f(z) are ck = f(k) (a) k! . Proof. All the claims are more or less obvious. Differentiating consecutively we see that f is infinitely differentiable at all z ∈ D, as claimed in (1). Furthermore, we see that f(k) (z) = ∑∞ n=k n(n−1) . . . (n−k+1)cn(z−a)n−k , which 876 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS in particular yields (2) with k = 1. Finally, substituting z = a gives f(k) (a) = k!ck, since all terms containing (z − a)n−k where n > k vanish at z = a. Therefore ck = f(k) (a) k! . □ 9.4.4. Links to the real calculus. Each complex valued function f on Ω can be viewed as a mapping f : Ω ⊂ R2 → R2 . If this mapping is differentiable in the real sense (i.e. all partial derivatives are continuous) we may write f(z) = f(a) + D1 f(a)(z − a) + (z − a)α(z) for a ∈ Ω and z in a small neighborhood of a in Ω, with D1 f being the Jacobi matrix of first partial derivatives in the two real coordinates and limz→a α(z) = 0. Thus it is legitimate to question whether the real linear approximation D1 f(a) is complex linear. Obviously this happens if and only if the complex derivative f′ (z) exists. If f(z) = u(x + i y) + i v(x + i y) is the coordinate expression of a complex differentiable function f viewed as a differentiable mapping R2 → R2 , z = x + i y, then clearly ∂f ∂x (z) = f′ (z) · 1, ∂f ∂y (z) = f′ (z) · i. Thus, ∂u ∂y +i∂v ∂y = i (∂u ∂x +i∂v ∂x ) and we have arrived at the sufficient and necessary conditions for D1 f(z) being complex linear (1) ux = vy, uy = −vx. Yet another argument goes as follows: the rank two matrix describing multiplication by a complex number a + i b has a in the diagonal entries. −b and b are the other two entries. In particular, this implies uxx + uyy = 0. The same Laplace equation holds true for the other component function v of any holomorphic function f = u + i v. At the level of differentials, it is useful to consider two (complex valued) linear forms dz = dx + i dy, dz = dx − i dy, together with the dual basis at the complexified tangent space ∂ ∂z = 1 2 ( ∂ ∂x − i ∂ ∂y ) , ∂ ∂z = 1 2 ( ∂ ∂x + i ∂ ∂y ) . A straightforward check reveals that a differentiable function f : Ω ⊂ C = R2 → C is complex differentiable if and only if ∂ ∂z f = 0. 9.4.5. Integrals along paths. Another important link concerns integration along paths. A continuous path γ in the complex plane is a continuous mapping γ : J ⊂ R → C defined on a bounded closed interval J = [a, b]. A path is called a simple closed path, or a Jordan curve in the complex plane10 , if it does not intersect itself and γ(a) = γ(b). 10We shall be interested in piecewise differentiable Jordan curves only, and it is quite easy to see that these curves always divide the complex planes into exactly two connected components (it is obvious for piecewise linear and the rest comes via approximation). For general Jordan curves, this is a 877 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS The composition of a path γ with any continuous mapping f defined on the target of γ is again continous, and thus the (complex valued) Riemann integral ∫ b a f ◦ γ(t) dt exists, but it is dependent on the parametrization of the path. An easy way out is to restrict ourselves to differentiable paths with the derivative ˙γ(t) ̸= 0 for all t, and define the integral Iγ along a path γ as the Riemann integral Iγ = ∫ b a f(γ(t))˙γ(t) dt. This coincides perfectly with the Riemann integrals of real functions of one variable restricted to a reparametrization γ of an interval in R ⊂ C. Writing f = u + i v and γ(t) = x(t) + i y(t), f(γ)˙γ = (u + i v)( ˙x + i ˙y) = (u ˙x − v ˙y) + i(v ˙x + u ˙y). Now we may check that the complex value Iγ is independent of the choice of parametrization directly by the substitution formula for real integrals. We should also notice that actually the (complex valued) linear form f(z) dz on R2 = C equals (u + i v)(dx + i dy) = (u dx − v dy) + i(v dx + u dy). Thus, Iγ equals the integral of the linear form f(z) dz over the (unparametrized) submanifold γ ⊂ C in the sense introduced in the first part of this chapter: Iγ = ∫ b a (u ˙x − v ˙y) dt + i ∫ b a (v ˙x + u ˙y) dt = ∫ γ f(z) dz. In fact, any choice of parametrization (˙γ ̸= 0 on J) determines the orientation of γ. Thus the integral is independent of the parametrization, up to sign. If γ is a composition γ2 ◦ γ1 of two paths (we simply concatenate the curves γ1 : [a, b] → C and γ2 : [b, c] → C, γ2(b) = γ1(b)), then clearly ∫ γ f(z) dz = ∫ γ1 f(z) dz + ∫ γ2 f(z) dz. In particular if γ2 = γ−1 1 , i.e. the same curve with opposite parametrizations, then ∫ γ f(z) dz = 0. Clearly, our definition of integration extends to piecewise differentiable paths. By uniform continuity over compact domains, the value Iγ depends continuously on the choice of the path γ in the C0 metric on the functions on the interval J. Thus we may approximate any integral Iγ by integrating the same function over a piecewise linear path ˜γ. 9.4.6. Antiderivatives. If F(z) is an antiderivative of f(z), then clearly d dt F(γ(t)) = F′ (γ(t)) · ˙γ(t) = f(γ(t)) · ˙γ(t) and therefore we have verified the straightforward generalization of the Newton integral formula in one real variable cal- culus: difficult topological result attributed to the French mathematician Camille Jordan (1838-1922). This is the same Jordan related to the Jordan canonical form of matrices discussed in Chapter 4. 878 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Newton integral formula For each piecewise differentiable path γ : [a, b] → C and antiderivative F(z) of the function f(z) defined on a neighborhood of γ, (1) Iγ = ∫ γ f(z) dz = F(γ(b)) − F(γ(a)). In particular, the value of the integral depends only on the values of F in the endpoints of γ and not on the path itself. Corollary. Antiderivatives to complex functions on connected domains are uniquely defined, up to complex constants. Proof. Assume F′ (z) = G′ (z), i.e. (F − G)′ (z) = 0. Then for each path γ, γ(0) = z, γ(1) = w, (F − G)(w) − (F − G)(z) = ∫ γ 0 dz = 0, and thus F − G is a constant function. □ As an example, consider the paths γr : [0, 2π] → C, γr(t) = r eit , i.e. the positively oriented boundary of the ball B(0, r) centered at the origin, with radius r > 0. It is easy to compute the integral of f(z) = zn along these paths, for all n ∈ Z. (2) ∫ γr zn dz = ∫ 2π 0 rn enit ir eit dt = { rn+1 n+1 [e(n+1)it ]2π 0 = 0 n ̸= −1 ir0 ∫ 2π 0 e0 dt = 2πi n = −1. In particular, we see that the integral of any polynomial along any circle vanishes (cf. more details in ??). 9.4.7. Cauchy integral theorem. The formula 9.4.6(1) can be applied to closed paths and we arrive immediately at the important Cauchy integral theorem on the convergence discs of analytic functions. This result is actually available for all holomorphic functions on much more general domains. Recall that a domain Ω is called simply connected if every simple closed continuous path in Ω can be shrunk continuously into a point without leaving Ω. Cauchy integral theorem Theorem. Let f : Ω → C be analytic on a simply connected domain Ω and γ be a closed piecewise differentiable path in Ω. Then ∫ γ f(z) dz = 0. Sketchy proof. The analytic function f has an antiderivative F on each of its convergence discs. Assume first that the entire domain Ω is contained in one such disc. Then we break the closed path γ : [0, 1] → Ω into the intervals 879 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS 0 < t1 < t2 < . . . < tm = 1 on which γ is differentiable, and ∫ γ f(z) dz = m−1∑ j=0 ( F(γ(tj+1)) − F(γ(tj)) ) = F(γ(1)) − F(γ(0)) = 0 since γ(0) = γ(1). In particular, if T is a triangle lying entirely in a convergence disc of the analytic function f(z) inside of Ω, then∫ ∂T f(z) dz = 0, where ∂T is the oriented boundary of T. Next, without loss of generality, the path γ can be viewed as a polygon, since any piecewise differentiable path γ(t) can be uniformly approximated by piecewise linear functions γn(t), that form closed polygons. The integrals of f over γn approximate the integral under question. Thus, if we show∫ γn f(z) dz = 0, this would imply that ∫ γ f(z) dz = 0, too. It seems to be clear, that the interior of any closed polygon γn can be triangulated into closed triangles Tj so that all Tj lie in Ω with their interiors. Actually, here we need the assumption that Ω is simply connected if we want to fill in all the details. The integral along the path γn is then equal to the sum of the integrals over all the individual triangles (notice we integrate twice in the opposite directions over each edge which does not belong to γn). Finally, possibly refining the polygon γn and the triangulation, we may assume that the size of each triangle Tj is so small, that Tj lies entirely in some convergence disc of f(z). Therefore, ∫ γn f(z) dz = ∑ j ∫ ∂Tj f(z) dz = 0, and hence ∫ γ f(z) dz = 0 as requested. □ 9.4.8. Cauchy integral theorem again. We were quite sloppy about the topological issues in the above sketch of the proof. Actually, there is a more general theorem deducing the conclusion of the Cauchy integral theorem under the assumption that the function f is complex differentiable (holomorphic). We shall prove this theorem under the additional assumption that f is (continuously) differentiable as a mapping of two real variables. Both conditions are obviously satisfied for analytic functions. We remark that the general claim of the theorem is proved by a procedure similar to the above argumentation, dealing first with the claim for triangles etc. The reader may find the full proof in any basic textbook on complex analysis. Theorem (Cauchy integral theorem). If f(z) is holomorhic in a simply connected domain Ω ⊂ C, then for every piecewise differentiable simple closed path γ ⊂ Ω ∫ γ f(z) dz = 0. 880 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Proof of a special case. Without loss of generality, assume that γ is a piecewise differentiable path bounding some simple connected region G. Write as usual f(z) = u(x, y) + i v(x, y) and so ∫ ∂G f(z) dz = ∫ ∂G u dx − v dy + i ∫ ∂G v dx + u dy. Now, assuming f is differentiable as a functions of two real variables, the Green’s version 9.1.13 of the general Stokes’ theorem 9.1.12 implies ∫ ∂G f(z) dz = ∫ G ( − ∂v ∂x − ∂u ∂y + i ( ∂u ∂x − ∂v ∂y )) dx dy = 2i ∫ G ∂f ∂z dx dy = 0, since ∂f ∂z = 0 is equivalent to being holomorphic. □ The Cauchy integral theorem has an immediate consequence, ensuring the existence of antiderivatives: 9.4.9. Theorem. For every analytic function f(z) in a simply connected region Ω, its antiderivative F(z) exists in that region. Proof. If Ω is the convergence disc of a power series expression for f, then the claim is obvious, cf. 9.4.3. In general, fix a point z0 ∈ Ω, and consider an arbitrary ζ ∈ Ω and any path γ ⊂ Ω with the initial point z0 and end point ζ. Define F(ζ) = ∫ γ f(z) dz. Choose any other ˜γ with the same beginning and end, and prolong the path γ by ˜γ−1 . This provides a closed path µ = ˜γ−1 ◦ γ, and therefore, by the Cauchy integral theorem,∫ µ f(z) dz = 0. Thus F(ζ) is well defined, independent of the choice of γ. Next, consider a small h such that the entire oriented segment ν joining ζ and ζ + h is in Ω. Then F(ζ+h)−F(ζ) = ∫ ν◦γ f(z) dz− ∫ γ f(z) dz = ∫ 1 0 f(ζ+ht)h dt. In particular lim h→0 F(ζ + h) − F(ζ) h = lim h→0 ∫ 1 0 f(ζ + ht) dt = f(ζ) and thus, F(ζ) is the requested antiderivative. □ Clearly, the antiderivative of an analytic function on a simply connected domain is again analytic. This follows immediately from corollary 9.4.6 and the local formula for antiderivative, F(z) = ∑∞ n=0 1 n+1 cn(z − a)n+1 of the analytic functions f(z) = ∑∞ n=0 cn(z −a)n on the individual convergence discs. It is related to the much more general concept of analytic extension which we shall discuss now. 881 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS 9.4.10. Uniqueness. At first glance, the power series representing locally an analytic function in a domain Ω should glue together. We shall deal first with the uniqueness issues. Uniqueness theorem Lemma. Consider an analytic function f(z) in Ω and a sequence of its zeroes an ∈ Ω which has a limit point a ∈ Ω. Then f(z) = 0 everywhere in Ω. Proof. We start with a simple observation on the nonvanishing of analytic functions: Claim. Let f(z) ̸= 0 be analytic in Ω, with f(a) = 0 for some a ∈ Ω. Then, there exists ε > 0 such that f(z) ̸= 0 for 0 < |z − a| < ε. Indeed, in some neighbourhood of a, the analytic function f(z) is represented by a power series f(z) = ∑∞ n=0 cn(z − a)n . Since f(a) = 0, we have c0 = 0. Let ck be the first non-zero coefficient in the series. Then f(z) = (z − a)k g(z), where g(z) = ∑∞ n=k cn(z − a)n−k and g(a) ̸= 0. Therefore, by continuity of g(z), there exists a disc centered at a of the radius ε > 0 where g(z) does not have zeroes and, consequently, f(z) ̸= 0 for 0 < |z − a| < ε. The lemma is now a simple corollary of the above claim. Under the assumptions, f(z) must vanish identically on a nontrivial disc centered at a. Assume f(w) ̸= 0, w ∈ Ω, and choose a path γ with γ(0) = a, γ(1) = w. Define t0 to be the infimum of the nonempty set {t ∈ [0, 1]; f(γ(t)) ̸= 0}. Then t0 > 0 and f(γ(t)) is identically zero for t ∈ [0, t0). Thus the above claim applies for a = γ(t0), which leads to a contradiction with f(t0) ̸= 0 and we are done. □ As a corollary, we see that any function f(z), analytic in two concentric discs is represented in those discs by the same power series f(z) = ∑∞ n=0 cn(z −a)n , where a is the centre of those discs, and ck are Taylor coefficients ck = 1 k! f(k) (a), k = 0, 1, . . .. 9.4.11. Analytic extension. The basic idea for gluing nonzero powerseries together is very simple. Consider a power series f(z) = ∑∞ n=0 cn(z − a)n , converging in D = {|z − a| < r} and some point b ∈ D. If |z − b| + |b − a| < r, then∑∞ n=0 |cn|(|z − b| + |b − a|)n converges and, thus, on the smaller disc Db = {|z − b| < s = r − |b − a|} we may rewrite the power series f(z) by expanding (z − a)n = (z − b + b − a)n : ∞∑ n=0 cn(z − a)n = ∞∑ n=0 n∑ k=0 ( n k ) cn(b − a)n−k (z − b)k = ∞∑ k=0 ∞∑ n=k ( n k ) cn(b − a)n−k (z − b)k , where all the series converge absolutely, and so the order of summation is irrelevant. 882 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Thus, writing dk = ∑∞ n=k (n k ) cn(b − a)n−k , the new power series f(z) = ∑∞ k=0 dk(z − b)k converges at least on the disc Db and we shall call it the re-expansion of∑∞ n=0 cn(z − a)n at the centre b. The re-expansion is guaranteed to converge for |z − b| < r − |b − a|. However, the radius of convergence of∑∞ n=0 dn(z − b)n can be larger. The concept of analytic extension is based on this. Analytic elements and extensions An analytic element Φ with centre at a ∈ C is a pair Φ = (D, f), where D is some disc D = {|z − a| ≤ R} and f is a convergent power series in D. The element Φ is called canonical, if R is the radius of convergence of f at a. Elements Φ1 = (D1, f1) and Φ2 = (D2, f2) are called immediate extensions of each other if D1 ∩ D2 ̸= ∅ and f1(z) = f2(z) on D1 ∩ D2. An element Ψ is called an analytic extension of Φ along the chain Φ0 = Φ, Φ1, . . . , Φn = Ψ if Φj+1 is an immediate analytic extension of Φj, j = 0, 1, . . . , n − 1. A parametric family Φt = (Dt, ft) of canonical elements is called an analytic extension of Φ0 along a path γ : [0, 1] → C, if (i) for all t ∈ [0, 1] the radii of convergence of the discs Dt are Rt > 0 and the centres of Φt are at = γ(t); and (ii) for all τ ∈ [0, 1] there exist open intervals Uτ ⊂ [0, 1] containing τ such that for all t ∈ Uτ , γ(t) ∈ Dτ and Φt is an immediate extension of Φτ . For example, the pair (1) log z = ( |z − 1| < 1, ∞∑ n=0 (−1)n n (z − 1)n ) is the canonical element restricted to the standard real logarithm function, centered at 1. Obviously, if Ψ = (D2, f2) is an immediate extension of Φ = (D1, f1) and the centre a2 of Ψ lies in D1 then the series f2 is just a re-expansion of f1 around a2. This follows from the uniqueness theorem and the possibility of re-expansion. Lemma. Assume that the element Φ1 = (D1, f1) is an immediate extension of Φ0 = (D0, f0), and that Φ2 = (D2, f2) is an immediate extension of Φ1. Then, Φ2 is also an immediate extension of Φ0 if D0 ∩ D1 ∩ D2 ̸= ∅. If Φt and Ψt are two analytic extensions of canonical element Φ0 = Ψ0 along the same path γ : [0, 1] → C, then Ψt = Φt, for all t ∈ [0, 1]. Proof. Clearly, f0 = f1 = f2 on the non-empty open subset D0 ∩D1 ∩D2 of D0 ∩D2. By the uniqueness theorem f2 = f0 everywhere in D0 ∩D2, which proves the first claim. Next, consider E = {t ∈ [0, 1] : Ψt = Φt}. Since 0 ∈ E, the set E is not empty. It is open, because UΦ τ ∩ UΨ τ ⊂ E for all τ ∈ [0, 1]. Further, if t0 ∈ [0, 1] is a limit point of E then the elements Φt0 and Ψt0 are immediate extensions of Φt1 = Ψt1 for all t1 ∈ UΦ t0 ∩ UΨ t0 , t1 < t0. Since Φt0 and Ψt0 have the common centre γ(t0), they must be equal. So E is closed and thus, E = [0, 1], i.e. Φt = Ψt for all t. □ 883 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS 9.4.12. Technical observations. In a sequence of simple observations, we show that analytic extension along a path can be always be obtained by analytic extensions along some chain of elements and vice versa. First, consider the family Φt of elements that extend Φ0 along some γ(t) : [0, 1] → C and write R(t) for the corresponding radii of convergence. Proposition. If R(τ) = ∞ for some τ ∈ [0, 1], then it is infinite for all t. If finite, then R(t) is continuous on [0, 1]. Proof. If R(τ) = ∞, then each element Φt is a reexpansion of Φτ , and therefore R(t) = ∞ for all t. On the other hand, if R(τ) < ∞ for some τ, then for all t ∈ Uτ , the circles {|z − γ(t)| = R(t)} and {|z − γ(τ)| = R(τ)} intersect at a pair of points because neither of these circles lies inside the other one. Let w be one such point of intersection. Then for the triangle with vertices w, γ(τ), γ(t) we arrive at |R(t) − R(τ)| < |γ(t) − γ(τ)|. Since γ is uniformly continuous on [0, 1], R(t) is continuous on Uτ , and thus also on [0, 1]. □ Lemma. Consider an analytic extension {Φt, t ∈ [0, 1]} of Φ0 along a path γ. There exist a finite number of intermediate points 0 = t0 < t1 < . . . < tn = 1 such that Φ1 is obtained by analytic extension along the chain of elements Φ0 = Φt0 , Φt1 , . . . , Φtn = Φ1. Conversely, if a canonical element Ψ extends Φ along the chain Φ = Φ0, Φ1, . . . , Φn = Ψ and γ : [0, 1] → C is the piece-wise linear path through the centres of Φ0, Φ1, . . . , Φn, then there is a family {Ψt} of canonical elements extending Ψ0 = Φ0 along γ, such that Ψ1 = Φn = Ψ. Proof. Since R(t) > 0 for all t and is continuous, due to the compactness of [0, 1] it is separated from zero and, therefore R(t) > c for some constant c > 0. Uniform continuity of γ(t) implies that ∃δ > 0 such that |t2 − t1| < δ yields |γ(t2) − γ(t1)| < c. Since the intervals Jt = Ut ∩ (t − δ 2 , t + δ 2 ) cover [0, 1] one can choose a finite subcover Jt1 , . . . , Jtn−1 for some t1 < . . . < tn−1, to which we append t0 = 0 and tn = 1 if these terminal points are missing from the sequence. Then, |γ(tj+1) − γ(tj)| < c and the centres of Φtj+1 and Φtj lie in the disc of convergence of each other, respectively. So, these elements are re-expansions, and hence, immediate extensions of each other. As there are a finite number of straight segments in γ(t), it is sufficient to consider the case n = 1. The rest follows by induction. Thus γ(t) will be a segment connecting γ(0) and γ(1), that lies in D0 ∪ D1 with D0 ∩ D1 ̸= ∅. Define Φt = (Dt, ft) as re-expansion of either Φ0 or Φ1, depending on where γ(t) appears, in D0 or D1. If γ(t) ∈ D0 ∩ D1 then re-expansion of both Φ0 or Φ1 at the centre γ(t) determine same canonical element as noticed above. The interval Uτ can be chosen so that γ(t), t ∈ Uτ entirely lies in either D0 or D1. Then Φt for t ∈ Uτ will be an immediate extension of Φτ . □ 884 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS 9.4.13. Monodromy theorem. In general, it is very difficult to say whether there is an analytic extension of a given analytic element along a given path. But we may quite easily find out conditions under which existing extensions along paths γ0 and γ1 with a common beginning and end must coincide. Homotopic paths We say that two paths γ0(t) and γ1(t) with common terminal points γ0(0) = γ1(0) = a and γ0(1) = γ1(b) = b are homotopic if there exists a continuous function γ : [0, 1] × [0, 1] → C such that γ(0, t) = γ0(t), γ(1, t) = γ1(t), γ(s, 0) = a, γ(s, 1) = b for all s ∈ [0, 1]. We say that the paths γs(t) = γ(s, t) provide a homotopic deformation of γ0 to γ1. The following theorem says that homotopic paths lead to the same analytic extensions. Monodromy theorem Theorem. Suppose that a canonical element Φ0 = (D0, f0) centred at a can be analytically extended along every path γs in a homotopic deformation. Then the extensions along each of those paths terminate with the same canonical element Φ1. Proof. Write Φst, s ∈ [0, 1], t ∈ [0, 1] the canonical elements of the extension of Φ00 = Φ0 along γs(t). Let Rs(t) be the radius of convergence of Φst. Since Rs(t) > 0 for all (s, t) ∈ [0, 1] × [0, 1], there exists ρ > 0 such that Rs(t) > ρ for all s and t. Notice that γ(s, t) is uniformly continuous and thus, fixing s0 ∈ [0, 1] we may choose an interval Vs0 around s0 on the s-axis such that max t∈[0,1] |γ(s, t) − γ(s0, t)| < ρ 4 . Then, for all s ∈ Vs0 , the result of analytic extension remains the same as every element Φst centred at γ(s, t) is a reexpansion of Φs0t because the centre γ(s, t) lies in the disc of convergence of Φs0t. Thus, extensions along γs(t) and γs0 (t) produce the same terminal element Φs0t. Consider the set E = {s ∈ [0, 1] : Φs1 = Φ01}. Clearly, E is not empty as it contains s = 0. By the previous argument, E is open. We shall see that E is also closed. Let s0 be a limit point of E. Consider the interval Vs0 as constructed above. Then there is s′ ∈ Vs0 ∩E, and extensions along γs′ (t) and γs0 (t) coincide, and so s0 ∈ E. Hence E = [0, 1], which proves the theorem. □ Corollary (Monodromy theorem). Consider a simply connected region Ω ⊂ C and a canonical element Φ = (D, f) centred at a ∈ Ω. Suppose that Φ extends along any path γ ⊂ Ω through a. Then for any b ∈ U, the extension of Φ along any path terminating at b is independent on the path 885 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS and as a result, produces the same analytic element for any such path. Thus, an analytic extension of Φ to every point in Ω generates an analytic function that is represented as a convergent power series in any disc inscribed in Ω. 9.4.14. Remarks. We look at some simple examples where the analytic extension is crucial. Consider the function f(z) = √ z. Clearly we may choose the two different options f(1) = ±1 and each of the choices will lead to a canonical element. The analytic extensions of them are called the branches of a multivalued complex function f. Notice that √ z is not analytic at the origin since the derivatives blow up to infinity there. Intuitively, it seems that two closed paths in C \ {0} are homotopic if and only if they run around the origin the same number of times (the winding number). We may imagine what happens to the values if we move z along a circle z = r eiθ . The two initial options lead to f1(z) = √ r eiθ/2 f2(z) = √ r ei(π+θ/2) . Once we run θ from 0 to 2π, the value of the branches swap. See ?? for more observations on root functions. more in the other column Another very important example is f(z) = z−1 . Since f(z) integrates to 2πi over each circle centered in origin, there cannot exist an antiderivative to f along any of such circles. But locally, the antiderivative is the logarithmic function log z. The canonical element 9.4.11(1) extends to one of infinite branches and running along a circle must change its value by the constant 2πi. We return to general analytic functions. As promised a few pages back, the monodromy theorem implies that for any analytic function f(z) in a simply connected region Ω, there exists an analytic function representing the antiderivative. Moreover, for each analytic element of f, this is just the antiderivative of the power series representing the function on the disc. Now we may also deduce the Cauchy integral theorem for simple connected regions Ω in another way: The integral along the closed path ∂Ω is given by the difference of the values of the antiderivative at the terminal points. Since the boundary ∂Ω is homotopic to a point, the integral van- ishes. 9.4.15. Cauchy theorem the third time. The Cauchy integral theorem holds also for analytic functions on domains Ω which are not simply connected. A bounded open domain Ω ⊂ C is said to have a regular boundary ∂Ω if the set of boundary points ∂Ω consists of finitely many piece-wise smooth and mutually disjoint Jordan curves γ0, γ1, . . . , γn. Notice, the Jordan curves in the boundary divide the complex plane into n+2 connected components. Just one of them is unbounded, one of them coincides with Ω and all the others are bounded “holes” inside of Ω. We write γ0 for the oriented exterior boundary, i.e. the boundary of the connected unbounded component of C \ Ω oriented counter-clockwise, 886 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS while γ1, . . . , γn form the oriented interior boundary, i.e. the curves are oriented clockwise. Cauchy integral theorem Theorem. Let Ω ⊂ C be a bounded region with regular boundary and f(z) analytic on the closure Ω (i.e. analytic on some domain containing Ω). Then ∫ ∂Ω f(z) dz = 0. Proof. With our choice of orientation of the boundary, we must prove ∫ ∂Ω f(z) dz = ∫ γ0 f(z) dz + n∑ j=1 ∫ γj f(z) dz = 0. We proceed by induction on n. If n = 0, then clearly Ω is simply connected and the theorem is proved already, see Theorem 9.4.8. If n = 1, then there is exactly one interior part of the boundary γ1 and we may choose any two smooth paths µ1 and µ2 joining the left most points and the right most points in γ0 and γ1, respectively. This way, we split Ω into two simply connected regions Ω+ (say the upper one) and Ω− (the lower one), with boundaries ∂Ω+ = γ+ 0 ◦µ1 ◦γ+ 1 ◦µ−1 2 and ∂Ω− = γ− 0 ◦ µ2 ◦ γ− 1 ◦ µ−1 1 . At the same time, ∫ ∂Ω f(z) dz = ∫ ∂Ω+ f(z) dz + ∫ ∂Ω− f(z) dz since the integration over µi and µ−1 i , i = 1, 2, cancel each other on the right hand side. Moreover, the boundaries ∂Ω± are again piecewise differentiable Jordan curves and therefore both integrals on the right hand side vanish. The general induction step is completely analogous. If n > 1, we find one of the interior boundaries γi closest to γ0. Choose two cuts µ1, µ2 so that one of the two newly created components of Ω is simply connected. Then the other component has one less interior boundary and thus the theorem follows by induction. □ it needs a diagram with the cuts, see the illustration on the diagram 9.4.16. Cauchy integral formula. Consider an open ball without its center, Ω = B(z0, r) \ {z0}, an analytic function f(z) in Ω, and positively oriented Jordan curves γ ⊂ Ω including z0 in its interior. Due to the Cauchy integral theorem, the integral∫ γ f(z) dz does depend on the choice of such γ. Indeed, the region enclosed between the two choices γ1, γ2 is bounded by them, but with opposite orientations. Thus the vanishing of the integral over the boundary means that actually the integrals over γ1 and γ2 are equal. Next, recall that the integral of z−1 over any circle centered at the origin is 2πi, see 9.4.6(2). These observations suggest the following essential formula (we may expect that f(ζ) will behave similarly as the constant f(z) for γ very small): 887 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Cauchy integral formula Theorem. Let f(z) be analytic function in the closure of the region Ω ⊂ C with regular boundary. Then for all z in Ω, f(z) = 1 2πi ∫ ∂Ω f(ζ) ζ − z dζ. Proof. Fix z ∈ Ω and consider an open disc Dρ = {ζ ∈ C; |ζ − z| < ρ} lying inside of Ω. The function g(ζ) = f(ζ) ζ−z is analytic in the closure of Ω \ Dρ. Adopting the counterclockwise orientation of the boundary of Dρ, the Cauchy integral theorem implies ∫ ∂Ω f(ζ) ζ − z dζ = ∫ ∂Dρ f(ζ) ζ − z dζ. We aim at showing that ∫ ∂Dρ f(ζ) ζ−z dζ = 2πif(z). We know 2πif(z) = ∫ ∂Dρ f(z) ζ−z dζ. Thus, we consider ∫ ∂Dρ f(ζ ζ − z dζ − 2πif(z) = ∫ ∂Dρ f(ζ) − f(z) ζ − z dζ and estimate ∫ ∂Dρ f(ζ) − f(z) ζ − z dζ ≤ max |ζ−z|=ρ 2πρ|f(ζ) − f(z)| ρ = 2π max |ζ−z|=ρ |f(ζ) − f(z)|. Clearly, the right hand side approaches zero as ρ → 0. Hence, ∫ ∂Dρ f(ζ) − f(z) ζ − z dζ = 0 and the formula in the theorem has been verified. □ Notice that if we consider z ∈ C\Ω in the above theorem, then the function f(ζ) ζ−z is analytic in Ω and thus the integral vanishes by the Cauchy integral theorem. 9.4.17. Corollaries. Taking consecutive derivatives with respect to z in the above formula, we obtain expressions for all derivatives of f(z): Cauchy integral formula for derivatives Corollary. Let f(z) be an analytic function in the closure of the region Ω ⊂ C with regular boundary. Then for all z in Ω, f(n) (z) = n! 2πi ∫ ∂Ω f(ζ) (ζ − z)n+1 dζ. Proof. Indeed, z is an independent argument in the smooth integrand, thus we may differentiate the integral, which yields the formula. □ 888 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Apply the Cauchy integral formula to the disc Dr = {|z − a| < r} to obtain Mean value theorem Theorem. For f(z) analytic in Dr the value at the centre of the disc can be evaluated as f(a) = 1 2π ∫ 2π 0 f(a + r eiθ )dθ. Proof. By the Cauchy integral formula f(a) = 1 2πi ∫ |ζ−a|=r f(ζ ζ − a dζ. Substitute ζ = a + r eiθ and dζ = ir eiθ dθ to obtain f(a) = 1 2π ∫ 2π 0 f(a + r eiθ )dθ. □ 9.4.18. Laurent series. Already in 6.3.10, we noticed that quotients f(z) g(z) of two polynomials enjoy a quite nice expansion similar to power series. We called series of the form ∑∞ n=−∞ cn(z − a)n a Laurent series. Now we do the same with complex arguments and coefficients. The part of the Laurent series with positive powers,∑∞ n=0 cn(z−a)n , is called the regular part, while the remaining part ∑−∞ n=−1 cn(z −a)n , consisting of negative powers of (z −a), is called its principal part. A Laurent series is called convergent if both the regular and principal parts converge. Laurent series Theorem. Every function f(z) analytic in the annulus A = {r < |z − a| > R}, with 0 ≤ r < R ≤ ∞ admits a representation by a Laurent series (1) f(z) = ∞∑ n=−∞ cn(z − a)n , where the coefficients cn can be calculated by (2) cn = ∫ |ζ−a|=ρ f(ζ) (ζ − a)n+1 dζ, n ∈ Z, r < ρ < R. The coefficients cn, called the Laurent coefficients of f(z) in A, are uniquely determined. In particular they do not depend on ρ. Proof. Choose z ∈ A and r′ , R′ such that r < r′ < R′ < R. The function f(z) is now analytic in the closure of the annulus A′ = {r′ < |z − a| < R′ }. Therefore, by the 889 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Cauchy integral formula, adopting the anticlockwise orientation, f(z) is a difference of two integrals, f(z) = J1 − J2, f(z) = 1 2πi ∫ |ζ−a|=R′ f(ζ) ζ − z dζ − 1 2πi ∫ |ζ−a|=r′ f(ζ ζ − z dζ. If |ζ−a| = R′ , then |z−a| < |ζ−a| and we may expand f(ζ) ζ − z = f(ζ) (ζ − a) 1 1 − z−a ζ−a = ∞∑ n=0 f(ζ) (ζ − a)n+1 (z − a)n . Next, we can estimate f(ζ) (ζ − a)n+1 (z − a)n < max|ζ−a|=R′ |f(ζ)| R′ ( |z − a| R′ )n and thus the series is uniformly convergent and admits term by term integration. Therefore J1 = 1 2πi ∞∑ n=0 cn(z − a)n , cn = ∫ |ζ−a|=ρ f(ζ) (ζ − a)n+1 dζ. If |ζ − a| = r′ , then |ζ − a| < |z − a| and similarly to the above, the expansion f(ζ) ζ − z = f(ζ) (z − a) 1 ζ−a z−a − 1 = − ∞∑ n=0 f(ζ) (z − a)n+1 (ζ − a)n leads to the equality (via term by term integration) −J2 = 1 2πi −∞∑ n=−1 cn(z − a)n , cn = ∫ |ζ−a|=r′ f(ζ) (ζ − a)n+1 dζ. Thus, we have obtained the Laurent series representation f(z) = ∞∑ n=−∞ cn(z − a)n as requested. On the other hand, if we are given a Laurent series (1), then fixing arbitrary n ∈ Z and multiplying this formula by (z − a)−(n+1) we can integrate over |z − a| = ρ in order to obtain ∫ |z−a|=ρ f(z) (z − a)n+1 dz = 2πi cn. The circle {|z − a| = ρ} with r < ρ < R was chosen arbitrarily, and in particular we see that the integrals ∫ |z−a|=ρ f(z) (z−a)n+1 dz cannot depend on ρ. □ 9.4.19. Remarks on convergence. Given a Laurent series, its regular part ∑∞ n=0 cn(z − a)n represents a power series that converges absolutely and uniformly on compact sets in its disc of convergence {|z − a| < R} with 1 R = lim supn→∞ |cn| 1 n , see the Cauchy-Hadamard formula in Theorem 9.4.2. The principal part ∑−∞ n=−1 cn(z − a)n becomes a power series ∑∞ n=1 c−nwn after a coordinate change w = 1 z−a and this series converges for |w| < 1 r with r = lim supn→∞ |c−n| 1 n . Thus we have verified: 890 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Proposition. For any set of coefficients {cn, n ∈ Z} set 1 R = lim sup n→∞ |cn| 1 n , r = lim sup n→∞ |c−n| 1 n . Then the Laurent series f(z) = ∞∑ n=−∞ cn(z − a)n converges absolutely and uniformly on any compact set in the annulus A = {r < |z − a| < R}. It is analytic in A and if |z − a| < r, then the principal part ∑−∞ n=−1 cn(z − a)n diverges, while the regular part ∑∞ n=0 cn(z − a)n diverges for |z − a| > R. 9.4.20. Link to Fourier series. There is a very interesting link between Laurent and Fourier series. If f is analytic in A = {1−ρ < |z| < 1+ρ} for some ρ > 0 then its n-th Laurent series coefficient cn = 1 2πi ∫ |z|=ρ f(z) zn+1 dz = 1 2π ∫ 2π 0 f(eit ) e−int dt. Therefore, cn represents the n-th Fourier coefficient of φ(t) = f(eit ) on t ∈ [0, 2π], and the Fourier series of f(eit ) converges uniformly to f(eit ) on t ∈ [0, 2π]. 9.4.21. Liouville theorem. The formula 9.4.18(2) for Laurent series coefficients yields the following Cauchy inequali- ties: (1) |cn| = 1 2πi ∫ |z−a|=ρ f(z) (z − a)n+1 dz ≤ max |z−a|=ρ |f(z)| ρn for all n ∈ Z. As a straightforward consequence we obtain the following Liouville theorem Theorem. If f(z) is analytic in C and bounded, i.e. |f(z)| ≤ M for some constant M and all z ∈ C, then f(z) is constant. Proof. The Cauchy inequalities applied to f(z) = ∑∞ n=0 cn(z − a)n yield that |cn| ≤ M Rn , n > 0 for any R > 0. Thus, cn = 0 for any n ≥ 1 and consequently f(z) = c0. □ 9.4.22. Isolated singularities. We look at typical examples of of analytic functions around "suspicious" points. Consider the fraction f(z) = sin z z . The origin is the zero point of both sin z and z, and since they behave very similarly for small z, we see limz→0 f(z) = 1. 891 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS On the other hand f(z) = 1 z grows towards infinity, limz→0 1 z = ∞, in the sense of the extended complex plane C = C ∪ ∞ (also called the Riemann sphere). We can imagine C as the sphere with the stereographic projection onto the plane C, see the picture. Then clearly limz→a f(z) = ∞ picture of the Riemann sphere C missing if and only if limz→0 |f(z)| = ∞ in the sense of standard analysis in real variables. It might easily happen that the limit does not exist at all, see the theorem below. For example, take f(z) = e 1 z around the point z = 0. It is given by the infinite principle part of a Laurent series, f(z) = ∑−∞ n=−1 1 n! zn . In general we talk about isolated singular points: Isolated singularities If f(z) is analytic in a punctured neighbourhood V = {0 < |z − a| < ρ} then a is called an isolated singular point for f(z). We say that the singular point is • removable, if there is a finite limit lim z→a f(z) = b ∈ C; • a pole, if lim z→a f(z) = ∞; • an essential singularity, if lim z→a f(z) does not exist in C. A function f(z) with only isolated singularities in a domain Ω ⊂ C and without any essential singularities is called a meromorphic function in Ω. The function f(z) = tan 1 z provides an example of a non-isolated singularity at a = 0, as 0 is the limit of poles (π 2 + nπ)−1 , n ∈ Z, of f(z). On the other hand, all rational functions f(z)/g(z) are meromorphic in C. The following theorem classifies isolated singularities and poles in terms of Laurent series. Theorem. The following properties are equivalent: • the point z = a is a removable singularity for f(z); • |f(z)| is bounded in some punctured neighbourhood V = {0 < |z − a| < ρ}; • the Laurent series of f(z) in V = {0 < |z − a| < ρ} is the Taylor series f(z) = ∑∞ n=0 cn(z − a)n , i.e. the principal part vanishes; • f(a) can be defined so that f(z) becomes analytic in {|z − a| < ρ}. Further, the point z = a is a pole for f(z) if and only if the principal part of the Laurent series of f(z) in {0 < |z − a| < ρ} contains only finitely many terms, i.e. f(z) =∑∞ n=−N cn(z−a)n , n ∈ N, for some integer N (the smallest N with this property is called the order or the pole at z = a). Finally, the Laurent series f(z) = ∑∞ n=−∞ cn(z − a)n in the punctured neighbourhood of a contains infinitely many terms with non-zero coefficients cn, n < 0, if and only if z = a is an essential singularity for f(z). Proof. If |f(z)| ≤ M for 0 < |z − a| < ρ, then by the Cauchy inequalities 9.4.21(1), |c−n| ≤ Mεn , n > 0, for all 0 < ε < ρ. Therefore, all coefficients with negative indices 892 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS vanish, and f(z) = ∞∑ n=0 cn(z − a)n . Define f(a) = c0 to obtain a power series that converges in the entire disc {|z − a| < ρ}. This implies the equivalence of the four conditions in the first part of the theorem. By the definition of a pole, f(z) ̸= 0 in some punctured disc D = {0 < |z − a| < ρ′ < ρ} around a as limz→a f(z) = ∞. Therefore g(z) = 1 f(z) is also analytic in D and limz→a g(z) = 0. Hence, g(z) is analytic in D assuming g(a) = 0 and, therefore, g(z) = (z − a)N h(z) for some integer n and analytic function h(z) with h(a) ̸= 0. Thus, 1 h(z) is also analytic on a neighborhood of a and, therefore, f(z) = 1 (z − a)N ∞∑ n=0 cn(z − a)n . Conversely, if f(z) = 1 (z−a)N h(z), where h(a) ̸= 0, then limz→a f(z) = ∞ and a is a pole for f. Finally, an isolated singularity of f(z) is neither removable nor a pole if and only if the principle part of its Laurent series is infinite and this observation finishes the proof. □ 9.4.23. Some consequences. There are several straightforward corollaries of our classification of isolated singularities. In particular, if limz→a f(z) does not exist, then f(z) has really chaotic behaviour: Theorem. If a ∈ C is an essential singularity of f(z), then for any w ∈ C there is a sequence zn → a such that limn→∞ f(zn) = w. Proof. Let w = ∞. Since the singularity z = a is not removable, f(z) cannot be bounded in any punctured neighbourhood of a. So there exists a sequence zn → a such that limn→∞ f(zn) = ∞. For w ∈ C, if in any punctured neighbourhood of a there is a point z such that f(z) = w, then by making a sequence with those points we obtain a sequence znsuch that f(zn) = w, as required. If there is a punctured neighbourhood of a where f(z) ̸= w, then g(z) = 1 f(z)−w also has an isolated singularity at z = a, which cannot be a pole or a removable one, as otherwise f(z) = w + 1 g(z) would have a limit as z → a. Therefore z = a is an essential singularity for g(z) and, thus, there is a sequence zn → a such that g(zn) = ∞, which implies that limn→∞ f(zn) = w. □ We say that ∞ ∈ C is an isolated singularity of f(z) if f(z) is analytic in {|z − a| > R} for some R > 0. These are straightforward consequences of the Liouville theorem if ∞ is the only singularity of f(z): Corollary. If f(z) is analytic in C and z = ∞ is a removable singularity for f(z) then f(z) is a constant. If f(z) is analytic in C and z = ∞ is a pole then f(z) is a polynomial f(z) = ∑n j=0 cj(z − a)j . 893 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Proof. The first claim is a simple reformulation of the Liouville theorem, cf. 9.4.21. To deal with the other claim, consider g(w) = f( 1 w ). Then w = 0 is a pole for g(w). Let P(w) = ∑N j=1 cjw−j be the principal part of Laurent series for g(w). Thus, h(w) = g(w) − P(w) is analytic in C with a removable singularity at w = 0. Moreover, limw→∞ h(w) = limw→∞ g(w) = f(0). Thus, |h(w)| is bounded, and by the Liouville theorem h(w) = const = f(0) = c0. Hence f(z) = g(z−1 ) = ∑N j=0 cjzj , which is a polynomial in z. □ 9.4.24. Residues. Next we return to the Cauchy integral theorem, with our knowledge of isolated singular- ities. A residue of an analytic function f(z) at its isolated singular point a ∈ C is defined as resa f = 1 2πi ∫ |z|=r f(z) dz, where 0 < r < ρ. Obviously, the definition does not depend on the choice of r. Residue Theorem Theorem. If f(z) is represented by the Laurent series∑∞ n=−∞ cn(z − a)n , then resa f = c−1. Further, consider a domain D ⊂ C and a function f(z) analytic in D \ {a1, . . . , an}, where aj ∈ D, j = 1, . . . , n. Then ∫ ∂D f(z) dz = 2πi n∑ j=1 resaj f. Proof. Integrating the Laurent series ∑∞ n=−∞ cn(z − a)n and using the fact that ∫ |z−a|=ρ (z − a)n dz = 0 unless n = −1, while ∫ |z−a|=ρ (z − a)−1 dz = 2πi, we obtain that resa f = c−1. Next, choose such ρ > 0 that open discs Dj = {|z − aj| < ρ}, j = 1, . . . , n, have pairwise empty intersections and their closures Dj belong to D. Then the Cauchy Integral theorem 9.4.15, applied to Dρ = D \ ∪n j=1 Dj yields 0 = ∫ ∂Dρ f(z) dz = ∫ ∂D f(z) dz − n∑ j=1 ∫ ∂Dj f(z) dz = ∫ ∂D f(z) dz − n∑ j=1 resaj f. □ 9.4.25. Residues at infinity. Recall that when integrating along the circle |z −a| = R, we always assume counterclockwise orientation on the circle. Thus we use the minus sign in the definition: If f(z) is analytic in the closure of an exterior of a disc {|z| ≥ R}, then res∞ f = − 1 2πi ∫ |z|=R f(z) dz. 894 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS In terms of the Laurent series f(z) = ∑∞ n=−∞ cn(z − a)n , valid in {|z| ≥ R}, we have res∞ f = −c−1. Note that if f(z) is analytic in C \ {a1, . . . , an}, then res∞ f + n∑ j=1 resaj f = 0. Indeed, by taking a disc {|z| < R} of sufficiently large radius that does not contain singularities on its boundary, we conclude that 1 2πi ∫ |z|=R f(z) dz = n∑ j=1 resaj f. 9.4.26. Example of applications. Residues of analytic functions are used for the evaluation of improper integrals in real analysis. The following lemma turns out to be very useful for such purposes. We shall write M(R) for the maximum of |f(z|) over the upper half of the circle with radius R, i.e. M(R) = max|z|=R,Im z≥0 |f(z)|. Jordan’s lemma Lemma. Consider the function f(z) continuous on {Im z ≥ 0, |z| = R}. Then, for each positive real parameter t, ∫ |z|=R,Im z≥0 f(z) eitz dz ≤ π t M(R). Consequently, if f(z) is continuous on {Im z ≥ 0, |z| ≥ R0} and limR→∞ M(R) = 0, then lim R→∞ ∫ |z|=R,Im z≥0 f(z) eitz dz = 0. Proof. We estimate the integral from the lemma: ∫ π 0 f(R eiθ ) e−tR sin θ+itR cos θ iR eiθ dθ ≤ R M(R) ∫ π 0 e−tR sin θ dθ. To evaluate the latter integral, we observe that sin θ ≥ 2 π θ for 0 ≤ θ ≤ π 2 . Thus, using t > 0 and the substitution τ = 2Rtθ π , we arrive at R M(R) ∫ π 0 e−tR sin θ dθ = 2R M(R) ∫ π 2 0 e−tR sin θ dθ ≤ 2R M(R) ∫ π 2 0 e −2tRθ π dθ = π t M(R) ∫ Rt 0 e−τ dτ = π t M(R)(1 − e−Rt ) ≤ π t M(R). The consequence for R → ∞ is obvious. □ Typically, the Jordan lemma is used to compute improper integrals of real analytic (complex valued) functions g(x) = 895 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS f(x) eitx along the entire real line (or rather the real or imaginary parts of such integrals). If the corresponding complex analytic function f(z) has only a finite number of poles ak in the upper half plane and limR→∞ M(R) = 0, then we may compute the real integral ∫ ∞ −∞ g(x) dx = lim R→∞ ∫ γR g(z) dz = 2πi ∑ k resak g(z), where γR is the path composed byf the interval [−R, R] and the half upper circle of radius R. See the diagram and the examples in the other column. picture for the half circle area 9.4.27. Concluding remarks. Of course we have not touched on many important issues in this short introduction. These include the conformal properties of all analytic functions, i.e. they preserve all angles of curves, the richness of analytic functions which allow the mapping of any simply connected region Ω bijectively to the unit open disc (i.e. both the map and its inverse are analytic, the Riemann mapping theorem). The proper setup for analytic extensions are the Riemann surfaces with their fascinating topological properties. Also, we only commented on the possibility of proving the Cauchy integral theorem for triangles just assuming the existence of complex derivative. The analyticity of all holomorhic functions then follows from the Cauchy integral formula. Moreover, we have not mentioned the functions of several complex variables at all! We hope that all of these interesting issues will challenge the readers to go for further more detailed study in the relevant literature. 896 897 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS G. Additional exercises to the whole chapter 9.G.1. Solution. □ 9.G.2. Solution. □ 9.G.3. Solution. □ 9.G.4. Solution. □ 9.G.5. Solution. □ 898 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS Solutions of the exercises 9.B.6. The answer is 4π. 9.B.7. The answer is 36π. 9.B.8. The answer is 65π 24 . 9.D.2. The general solution has the fom u = Φ(x2 − y2 ). Moreover, u(i) = √ x2 − y2, x > y, u(ii) = y2 − x2 , and the condition (iii) makes no sense. 9.D.3. The general solution has the fom u = Φ(x2 + y2 ). Moreover, we see that u(i) = x2 + y2 , u(ii) = 1 1+x2+y2 , u(iii) = (x2 + y2 − 1)2 (unique for x2 + y2 > 1). 9.D.4. The answer has the form u(x, y) = 2 cos(y) sin(x) − 1. 9.D.8. The general solution has the form u = K(y2 − 2x) · ey . 9.D.9. The general solution has the form u(x, y) = K(y − x) · ey . Moreover, u(i) = (y − x) · ex(y−x) , u(ii) = 2 x−y · e y2−x2 2 , u(iii) = (y − x)2 · ey(y−x) . 9.D.10. The solution is given by u = y yC ( 1 y −x ) −1 , up = y 2xy+3y−3 . 9.D.11. The solution has the form u = x2 + y2 + C (y x ) , up = x2 + y2 + y2 x2 . 9.D.12. The solution is given by u = 1 4 x2 − 1 2 y2 + C(y √ x), up = 1 4 (x2 − 2y2 − xy2 ). 9.D.13. The solution is given by u = y2 C ( ye 1 x ) , up = y3 e 1 x −1 . 9.D.15. The solution is given by u(x, y) = √ x2 + y2, u(x, y) = 2 − √ x2 + y2. 9.D.16. The solution is given by u(x, y) = x2 +y2 2 , u(x, y) = −1 2 ( 1 − √ x2 + y2 )2 . 9.D.17. The solution is the function u(x, y) = e √ x2+y2−1 . 9.D.18. The solution is the function u(x, y) = 1 4 ( x2 + y2 + 2 √ x2 + y2 + 1 ) . 9.D.19. The solution is the function u(x, y) = −e1− √ x2+y2 . 9.D.20. The answer is given by u(x, y) = x(1 − y). 9.D.21. The answer is the function u(x, y) = x + y. 9.D.22. The answer is given as follows: u(x, y) = (2 − √ x)2 , u = x. 9.D.23. The answer is g u(x, y) = x2 − y2 . 9.D.24. The answer is u(x, y) = y. 9.D.26. u(x, y) = Dex2 y2 . 9.D.27. u(x, y) = x + y + Dxy. 9.D.30. We get α0 = 1, β0 = 0, α1 = x, β1 = y, α2 = x2 − y2 , β2 = 2xy, α3 = x3 − 3xy2 , β3 = 3x2 y − y3 . 9.D.34. Parabolic equation, ξ = y + ln x, η = y,, uηη = 0, solution u = yC(y + ln x) + D(y + ln x). 9.D.35. Hyperbolic equation, ξ = xy, η = y, uξη = 0, u = F(y) + G(xy). 9.D.36. Eliptic equation, ξ = x + 2y, η = √ 7x, ∆u = 0, u(x, y) = C(x + 2y + i √ 7x) + D(x + 2y − i √ 7x). 9.D.37. Eliptic equation, ξ = y2 , η = x2 , ∆u = − 1 2 ( uη η + uξ ξ ) . 9.D.38. Substitute formula into equation and initial condition. 899 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS 9.D.40. ua = 1 2 [sin(x − t) + sin(x + t) + (x + t) sin(x + t) + cos(x + t) − (x − t) sin(x − t) − cos(x − t)] , ub = 2x + 1 2 [ (x + t) ln(1 + (x + t)2 ) − 2(x + t) + 2 arctan(x + t) − (x − t) ln(1 + (x − t)2 ) + 2(x − t) − 2 arctan(x − t) ] , uc = x + ln √ x + t x − t + sin x − sin x cos t. 9.D.41. u(t, x) = x2 +2x+t2 +2 sin x−2 sin x cos t+ 1 2 [(x + t) sin(x + t) − (x − t) sin(x − t) + cos(x + t) − cos(x − t)] . 9.D.42. u(t, x) = 2x2 −x+2t2 +2 cos x−2 cos x cos t+ 1 2 [−(x + t) cos(x + t) + (x − t) cos(x − t) + sin(x + t) − sin(x − t)] . 9.D.43. u(t, x) = x3 +12xt2 + 1 4 ln x + 2t + √ 1 + (x + 2t)2 x − 2t + √ 1 + (x − 2t)2 + 1 4 t2 + 1 8 cos 2x− 1 8 cos 2x cos 4t. 9.D.44. u(t, x) = x2 +9t2 + 1 6 [ (x + 3t)ex+3t − (x − 3t)ex−3t + ex−3t − ex+3t ] + 4 9 [cos x cos 3t − cos x] . 9.D.48. u(x, y) = sin y · ( 1 1 − e2π ex + 1 1 − e−2π e−x ) . 9.D.49. u(t, x) = − cos πx l cos πt l + l 2π cos 2πx l sin 2πt l . 9.D.50. u(t, x) = −e−4t sin(2x) + e−16t sin(4x) . 9.D.51. u(t, x) = −3 sin(4x) cos(4t) − sin x sin t + sin(2x) sin(2t) 9.D.52. u(t, x) = ∞∑ n=1 cne − [ (2n−1) 2a π ]2 t sin ( 2n − 1 2a πx ) , cn = 2 a a∫ 0 x(2a−x) sin ( 2n − 1 2a πx ) dx = 32a2 π3(2n − 1)3 9.D.53. u(t, x) = 8A2 π3 ∞∑ n=1,3,5,... 1 n3 sin nπx A cos nπt A . 9.D.54. u(x, y)= ∞∑ n=1 sin nπx a ( ane nπ a y + bne− nπ a y ) , an = 1 1 − e2nπ · 2 a a∫ 0 x(a−x) sin nπx a dx, bn = 1 1 − e−2nπ · 2 a a∫ 0 x(a−x) sin nπx a dx u(x, y) = ∞∑ n=0 sin (2n + 1)πx a ( 8a2 (2n + 1)3π3 ( 1 − e2(2n+1)π )e (2n+1)π a y + 8a2 (2n + 1)3π3 ( 1 − e−2(2n+1)π )e− (2n+1)π a y ) . 900 CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS 9.D.55. u(t, x) = − ∞∑ n=0 8 (2n + 1)2π2 cos ( 2n + 1 2 πx ) ·e−(2n+1)2 π2 t . 9.D.58. u(x, y) = xy A2 . 9.D.59. u(x, y) = 5 2 + A2 2 · x2 − y2 (x2 + y2)2 . 9.D.60. u(x, y) = 1 2 + x2 − y2 2 + y. 9.D.62. u(t, x) = erf ( x 2 √ t ) − erf ( x − 1 2 √ t ) . Roughly speaking, statistics is any processing of numerical or other type of data about a population of objects and their presentation. In this context, we talk about descriptive statistics. Its objective is thus to process and comprehensibly represent data about objects of a given “population” — for instance, the annual income of all citizens obtained from the complete data of revenue authorities, or the quality of hotel accommodation in some region. In order to achieve this, we focus on simple numerical characterization and visualization of the data. in general, many pictures missing!! Mathematical statistics uses mathematical methods to derive conclusions valid for the whole (potentially infinite) population of objects, based on a “small” sample. For instance, we might want to find out how much a certain disease is spread in the population by collecting data about a few randomly chosen people, but we interpret the results with regard to the entire population. In other words, mathematical statistics makes conclusions about a large population of objects based on the study of a small (usually randomly selected) sample collection. It also estimates the reliability of the resulting conclusions. Mathematical statistics is based on the tools of probability theory, which is very useful (and amazing) in itself. Therefore, probability theory is discussed first. This chapter provides an elementary introduction to the methods of probability theory, which should be sufficient for correct comprehension of ordinary statistical information all around us. However, for a serious understanding of a mathematical statistician’s work, one must look for other resources. 1. Descriptive statistics Descriptive statistics alone is not a mathematical discipline although it uses many manipulations with numbers and sometimes even very sophisticated methods. However, it is a good opportunity for illustrating the mathematical approach to building generally useful tools. At the same time, it should serve as a motivation for studying probability theory because of later applications in statistics. CHAPTER 10 Statistics and probability methods Is statistics a part of mathematics? – whenever it is so, we need much of mathematics there...! A. Dots, lines, rectangles The obtained data from reality can be displayed in many ways. Let us illustrate some of them. 10.A.1. Presenting the collected data. 20 mathematicians were asked about the number of members of their household. The following table displays the frequency of each number of members. Number of members 1 2 3 4 5 6 Number of households 5 5 1 6 2 1 Create the frequency distribution table. Find the mean, median and mode of the number of members. Build a column diagram of the data. Solution. Let us begin with the frequency distribution table. There, we write not only the frequencies, but also the cumulative frequencies and relative frequencies (i. e., the probability that there is a given number of members in a randomly picked household). Let us denote the number of members by xi, the corresponding frequency by ni, the relative frequency by pi (= ni/ ∑6 j=1 nj = ni/20), the cumulative frequency by Ni (= ∑i j=1 xj), and the relative cumulative frequency by Fi CHAPTER 10. STATISTICS AND PROBABILITY THEORY In our brief introduction, we first introduce the concepts allowing to measure the positions of data values and the variability of the data values (means, percentiles etc.). We touch the problem how to visualize or otherwise present the data sets (diagrams). Then we deal with the potential relations between more data sets (covariance and principal components) and, finally, we deal with data without numerical values relying just on their frequencies of appearance (entropy). 10.1.1. Probability, or statistics? It is not by accident that we return to a part of the motivating hints from the first chapter, as soon as we have managed to gather enough mathematical tools both discrete and continuous. Nowadays, many communications are of a statistical nature, be it in media, politics, or science. Nevertheless, in order to properly understand the meaning of such a communication and using particular statistical methods and concepts, one must have a broad knowledge of miscellaneous parts of mathematics. In this subsection, we move away from the mathematical theory; and think about the following steps and our objectives. As an example of a population of objects, consider the students of a given basic course. Then, the examined numerical data can be: • the “mean number of points” obtained during the course in the previous semester and the “variance” of these val- ues, • the “mean marks” for the examination of this and other courses and the “correlation” (i.e. mutual dependence) of these results, • the “correlation” of data about the past results of given students, • the “correlation” of the number of failed exams of a given student and the number of hours spent in a temporary job, • ... With regard to the first item, the arithmetic mean itself does not carry enough information about the quality of the lecture or of the lecturer, nor about the results of particular students. Maybe the value which is “in the middle” of the population, or the number of points achieved by the student who was just better than half of the students is of more concern. Similarly, the first quarter, the last quarter, the first tenth, etc. maybe of interest. Such data are called statistics of the population. Such statistics are interesting for the students in question as well, and it is quite easy to define, compute, and communicate them. From general experience or as a theoretical result outside mathematics, a reasonable assessment should be “normally” distributed. This is a concept of probability theory, and it requires quite advanced mathematics to be properly defined. Comparing the collected data about even a small random population of students to theoretical results can serve in two ways: We can estimate the parameters of the distribution as well as draw a conclusion whether the assessment is reasonable. 902 (= Ni/20 = ∑i j=1 pj): xi ni pi Ni Fi 1 5 1/2 5 1/4 2 5 1/4 10 1/2 3 1 1/20 11 11/20 4 6 3/10 17 17/20 5 2 1/10 19 19/20 6 1 1/20 20 1 Now, we can easily construct the wanted (column) graphs of (relative, cumulative) frequencies: The mean number of members of a household is: ¯x = 5 · 1 + 5 · 2 + 1 · 3 + 6 · 4 + 2 · 5 + 1 · 6 20 = 2.9. The median is the arithmetic mean of the tenth and eleventh values (having been sorted), which are respectively 2 and 3, i. e., ˜x = 2.5. The mode is the most frequent value, i. e., ˆx = 4. The collected data can also be presented using a box plot: The upper and lower sides of the “box” correspond respectively to the first (lower) and the third (upper) quartile, so its height is equal to the interquartile range. The thick horizontal line is drawn at the median level; the lower and upper horizontal lines correspond respectively to the minimum and maximum elements of the data set, or to the value that is 1.5 times the interquartile range less than the lower side of the box (and greater than the upper side, respectively). The data outside this range would be shown as circles. We can also build the histogram of the data: CHAPTER 10. STATISTICS AND PROBABILITY THEORY At the same time, the numerical values of statistics for a given population can yield qualitative description of the likelihood of our conclusions. We can compute statistics which reflect the variability of the examined values, rather than where these values are positioned within a given population. For instance, if the assessment does not show enough variability, it may be concluded that it is badly designed, because the students’ skills are of course different. The same applies if the collected data seem completely random. In the above paragraph, it is assumed that the examined data is reliable. This is not always the case in practice. On the contrary, the data is often perturbed with errors due to construction of the experiment and the data collection itself. In many cases, not much is known about the type of the data distribution. Then, methods of non-parametric statistics are often used (to be mentioned at the end of this chapter). Very interesting conclusions can be found if we compare the statistics for different quantities and then derive information about their relations. For example, if there is no evident relation between the history of previous studies and the results in a given course, then it may be that the course is managed wrongly. These ideas can be summarized as follows: • In descriptive statistics, there are tools which allow the understanding of the structure and nature of even a huge collection of data; • in mathematics, one works with an abstract mathematical description of probability, which can be used for analysis of given data. Especially, this is when there is a theoretical model to which the data should correspond; • conclusions of statistical investigation of samples of particular data sets can be given by mathematical statistics; • mathematical statistics can also estimate how adequate such a description is for a given data set. 10.1.2. Terminology. Statisticians have introduced a great many concepts which need mastering. The fundamental concept is that of a statistical population, which is an exactly defined set of basic statistical units. These can be given by enumeration or by some rules, in case of a larger population. On every statistical unit, statistical data is measured, with the “measurement” perceived very broadly. For instance, the population can consist of all students of a given university. Then, each of the students is a statistical unit and much data can be gathered about these units – the numerical values obtainable from the information system, what is their favorite colour, what they had for dinner before their last test, etc. The basic object for examining particular pieces of data is a data set. It usually consists of ordered values. The ordering can be either natural (when the data values are real numbers, for example) or we can define it (for instance, when we observe colours, we can express them in the RGB format 903 Note that the frequencies of one- and two-member households were merged into a single rectangle. This is used in order to make the data “easier to read” – there exist (various and ambiguous) rules for the merging. We simply mention this fact without presenting an exact procedure (it is just as anyone likes). □ 10.A.2. Given a data set x = (x1, x2, . . . , xn), find the mean and variance of the centered values xi − ¯x and the standardized values xi−¯x sx . Solution. The mean of the centered values can be found directly using the definition of arithmetic mean: 1 n n∑ i=1 (xi − ¯x) = 1 n n∑ i=1 xi − ¯x n n∑ i=1 1 = ¯x − ¯x = 0. The variance of the centered values is clearly the same as for the original ones (sx). For the standardized values, the mean is equal to zero again, and the variance is 1 n n∑ i=1 ( xi−¯x sx )2 = 1 s2 x · 1 n n∑ i=1 (xi − ¯x)2 = 1. □ 10.A.3. Prove that the variance satisfies s2 x = 1 n ∑n i=1 x2 i − ¯x2 . Solution. Using the definitions of variance and arithmetic mean, we get: s2 x = 1 n n∑ i=1 ( x2 i − 2xi ¯x + ¯x2 ) = 1 n n∑ i=1 x2 i − 2¯x n n∑ i=1 xi + ¯x2 = = 1 n n∑ i=1 x2 i − ¯x2 . □ CHAPTER 10. STATISTICS AND PROBABILITY THEORY and order them with respect to this sign). We can also work with unordered values. Since statistical description aims at telling comprehensible information about the entire population, we should be able to compare and take ratios of the data values. Therefore, we need to have a measurement scale at our disposal. In most cases, the data values are expressed as numbers. However, the meaning of the data can be quantified variously, and thus we distinguish between the following types of data measurement scales. Types of data measurement scales The data values are called: • nominal if there is no relation between particular values; they are just qualitative names, i.e. possible values (for instance, political parties or lecturers at a university when surveying how popular they are); • ordinal the same as above, but with an ordering (for example, number of stars for hotels in guidebooks); • interval if the values are numbers which serve for comparisons but do not correspond to any absolute value (for example, when expressing temperature in Celsius or Fahrenheit degrees, the position of zero is only con- ventional); • ratio if the scale and the position of zero are fixed (most physical and economical quantities). With nominal types, we can interpret only equalities x1 = x2; with ordinal types, we can also interpret inequalities x1 < x2 (or x1 > x2); with interval types, we can also interpret differences x1 − x2. Finally, with rational types, we have also ratios x1/x2 available. 10.1.3. Data sorting. In this subsection, we work with a data set x1, x2, . . . , xn, which can be ordered (thus, their type is not nominal) and which have been obtained through measurement on n statistical units. These values are sorted in a sorted data set (1) x(1), x(2), . . . , x(n). The integer n is called the size of the data set. When working with large data sets where only a few values occur, the simplest way to represent the data set is to enumerate the values’ frequencies. For instance, when surveying the political party preference or when presenting the quality of a hotel, write only the number of occurrences of each value. If there are many possible values (or there can even be continuously distributed real values), divide them into a suitable number of intervals and then observe the frequencies in the given intervals. The intervals are also called classes and the frequencies are called class frequencies. We also use cumulative frequencies and cumulative class frequencies which correspond to the sum of frequencies of values not exceeding a given one. 904 10.A.4. The following values have been collected: 10; 7; 7; 8; 8; 9; 10; 9; 4; 9; 10; 9; 11; 9; 7; 8; 3; 9; 8; 7. Find the arithmetic mean, median, quartiles, variance, and the corresponding box diagram. Solution. Denoting the individual values by ai and their frequencies by ni, we can arrange the given data set into the following table. ai 3 4 7 8 9 10 11 ni 1 1 4 4 6 3 1 From the definition of arithmetic mean, we have ¯x = 3 + 4 + 4 · 7 + 4 · 8 + 6 · 9 + 3 · 10 + 11 1 + 1 + 4 + 4 + 6 + 3 + 1 = 162 20 = 8.1. Since the tenth least collected value is x(10) = 8 and the eleventh one is x(11) = 9, the median is equal to ˜x = 8+9 2 = 8.5. The first quartile is x0.25 = x(5)+x(6) 2 = 7, and the third quartile is x0.75 = x(15)+x(16) 2 = 9. From the definition of variance, we get s2 x: 5.12 + 4.12 + 4 · 1.12 + 4 · 0.12 + 6 · 0.92 + 3 · 1.92 + 2.92 1 + 1 + 4 + 4 + 6 + 3 + 1 = 3.59. The histogram and box diagram are shown in the following pictures. where we have used "statistics" method to make the histogram "nice" and "clear". You can find a lot of these conventions in the books on statistics, but if you do not know them, you are lost. This is the default setting of the R program. For example if you replace just value 3 by 2 you get quite different looking histogram: CHAPTER 10. STATISTICS AND PROBABILITY THEORY Most often, the mean ai of a given class is considered to be its representative, and the value aini (where ni is the frequency of the class) is the total contribution of the class. Relative frequencies ai/n, and relative cumulative frequencies, can also be considered. A graph which has the intervals of particular classes on one axis and rectangles above them with height corresponding to the frequency is called a histogram. Cumulative frequency is represented similarly. The following diagram shows histograms of data sets of size n = 500 which were randomly generated with various standard distributions (called normal, χ2 , respectively). 10.1.4. Measures of the position of statistical values. If the magnitude of values around which the collected data values gather are to be expressed, then the concepts of the definition below can be used. There, we work with ratios or interval types of scales. Consider an (unsorted) data set (x1, . . . , xn) of the values for all examined statistical units and let n1, . . . , nm be the class frequencies of m distinct values a1, . . . , am that occur in this set. Means Definition. The arithmetic mean (often only mean) is given as ¯x = 1 n n∑ i=1 xi = 1 n m∑ j=1 njaj. The geometric mean is given as ¯xG = n √ x1x2 · · · xn and makes sense for positive values xi only. The harmonic mean is given as ¯xH = ( 1 n n∑ i=1 1 xi )−1 and is also useded for positive values xi only. The arithmetic mean is the only one of the three above which is invariant with respect to affine transformations. For all scalars a, b, (a + b · x) = 1 n n∑ i=1 (a + bxi) = a + b n∑ i=1 xi = a + b · ¯x. 905 □ 10.A.5. 425 carps were fished, and each one was taken weighed. Then, mass intervals were set, resulting in the following frequency distribution table: Weight (kg) 0–1 1–2 2–3 3–4 4–5 5–6 6–7 Class midpoint 0.5 1.5 2.5 3.5 4.5 5.5 6.5 Frequency 75 90 97 63 48 42 10 Draw a histogram, find the arithmetic, geometric, and harmonic means of the carps’ weights. Furthermore, find the median, quartiles, mode, variance, standard deviation, coefficient of variation, and draw a box plot. Solution. The histogram looks as follows: From the definitions of the corresponding concepts in subsection 10.1.4, we can directly compute that the arithmetic mean is ¯x = 2.7 kg, the geometric mean is ¯xG = 2.1 kg, and the harmonic mean is ¯xH = 1.5 kg. By the definitions of subsection 10.1.5, the median is equal to ˜x = x0.5 = 2.5 CHAPTER 10. STATISTICS AND PROBABILITY THEORY Therefore, the arithmetic mean is especially suitable for interval types. The logarithm of the geometric mean is the arithmetic mean of the logarithms of the values. It is especially suitable for those quantities which cumulate multiplicatively, e. g. interests. If the interest rate for each time period is xi%, then the final result is the same as if the interest rate had constant value of ¯xG %. See 10.A.9 for an example where the harmonic mean is appropriate. In subsection 8.1.30 (page 746), we use the methods invented there to prove that the geometric mean never exceeds the arithmetic mean. The harmonic mean never exceeds the geometric mean, and so ¯xH ≤ ¯xG ≤ ¯x. 10.1.5. Median, quartile, decile, percentile, ... Another way of expressing the position or distribution of the values is to find, for a number α between zero and one, such a value xα that 100α% of values from the set are at most xα and the remaining ones are greater than xα. If such a value is not unique, one can choose the mean of the two nearest possibil- ities. The number xα is called the α–quantile. Thus, if the result of a contestant puts him into x1.00, it does not mean that he is better than anyone else yet. However, there is surely no one better than him. The most common values of xα are the following: • The median (also sample median) is defined by ˜x = x0.50 = { x((n+1)/2) for odd n 1 2 ( x(n/2) + x(n/2+1)) for even n , where x(k) corresponds to the value in the sorted data set 10.1.3(1). • The first and third quartile are Q1 = x0.25 and Q3 = x0.75, respectively. • The p-th quantile (also sample quantile or percentile) xp, where 0 < p < 1 (usually rounded to two decimal places). One can also meet the mode, which is the value ˆx that is most frequent in the data set x. The arithmetic mean, median (with ratio types), and mode (with ordinal or nominal types) correspond to the “anticipated” values. Note that all α–quantiles with interval scales are invariant with respect to affine transformations of the values (check this yourselves!). 10.1.6. Measures of the variability. Surely any measure of the variability of a data set x ∈ Rn should be invariant with respect to constant translations. In the Euclidean space Rn , both the standard distance and the sample mean have this property. Therefore, choose the following: 906 kg, the lower quartile to x0.25 = 1.5 kg, the upper quartile to x0.75 = 3.5 kg, and the mode is ˆx = 2.5 kg. From the definitions of subsection 10.1.6, we compute the variance of the weights, which is s2 x = 2.7 kg2 , whence it follows that the standard deviation is sx = 1.7 kg, and the coefficient of variation is Vx = 0, 6. □ 10.A.6. Prove that the entropy is maximal if the nominal values are distributed uniformly, i. e., the frequency of each class is ni = 1. Solution. By the definition of entropy (see 10.1.11), we are looking for the maximum of the function HX = − ∑n i=1 pi ln pi with respect to unknown relative frequencies pi = ni n , which satisfy ∑n i=1 pi = 1. Therefore, this is a typical example of finding constrained extrema, which can be solved using Lagrange multipliers. The corresponding Lagrange function is L(p1, . . . , pn, λ) = − n∑ i=1 pi ln pi + λ ( n∑ i=1 pi − 1 ) . The partial derivatives are ∂L ∂pi = − ln pi − 1 + λ, hence its stationary points is determined by the equations pi = eλ−1 for all i = 1, . . . , n. Moreover, we know that the sum of the relative frequencies pi is equal to one. This means that neλ−1 = 1, whence we get λ = 1 − ln n. Substitution then yields pi = 1 n . □ 10.A.7. The following graphs depict the frequencies of particular amounts of points obtained by students of the MB104 lecture at the Faculty of Informatics of Masaryk University in 2012. The axes of the cumulative graph are “swapped”, as opposed to the previous example. The frequencies of particular amounts of points are enumerated in the following table: CHAPTER 10. STATISTICS AND PROBABILITY THEORY Variance and standard deviation Definition. The variance of a data set x is defined by s2 x = 1 n n∑ i=1 (xi − ¯x)2 . The standard deviation sx is defined to be the square root of the variance. As requested, the variability of statistical values is independent of constant translation of all values. Indeed, the unsorted data set y = (x1 + c, x2 + c, . . . , xn + c) has the same variance sy = sx. Sometimes, the sample variance is used, where there is (n − 1) in the denominator instead of n. The reason will be clear later, cf. 10.3.2. In case of class frequencies nj of values aj for m classes, this expression leads to the value s2 x = 1 n m∑ j=1 nj(aj − ¯x)2 of the variance. In practice, it is recommended to use the Sheppard’s correction, which decreases s2 x by h2 /12, where h is the width of the intervals that define the classes. Further, one can encounter the data-set range R = x(n) − x(1) and the interquartile range Q = Q3 − Q1. The mean deviation, which is defined as the mean distance of the values from the median: Dx = 1 n n∑ i=1 |xi − ˜x|. The following theorem clarifies why these measures of variability are chosen: Theorem. The function S(t) = (1/n) ∑n i=1(xi −t)2 has the minimum value at t = ¯x, i.e., at the sample mean. The function D(t) = (1/n) ∑n i=1 |xi − t| has the minimum value at t = ˜x, i.e., the median. Proof. The minimum of the quadratic polynomial f(t) = ∑n i=1(xi − t)2 is at the only root of its derivative: f′ (t) = −2 n∑ i=1 (xi − t). Since the sum of the distances of all values from the sample mean is zero t = ¯x is the requested root and the first proposition is proved. As for the second proposition, return to the definition of the median. For this purpose, rearrange the sum so that the first and the last summand is added, then the second and the last-but-one summand, etc. In the first case, this leads to the 907 # of points # of students 20.5 1 20 1 19 2 18.5 1 18 2 17.5 3 17 2 16.5 4 16 3 15.5 5 15 7 14.5 6 14 14 13.5 21 13 21 12.5 19 12 17 11.5 18 11 31 10.5 22 10 53 # of points # of students 9.5 9 9 9 8.5 13 8 8 7.5 13 7 4 6.5 7 6 4 5.5 8 5 7 4.5 9 4 5 3.5 7 3 8 2.5 8 2 14 1.5 8 1 2 0.5 6 0 9 The corresponding histogram looks as follows: The histogram was obtained from the Information System of Masaryk University. We can see that the data are shown in a somewhat unusual way: individual amounts of points correspond to “double rectangles”. It is a matter of taste how to represent the data (it is possible to merge some values, thereby decreasing the number of rectangles, or to use thinner rectangles). CHAPTER 10. STATISTICS AND PROBABILITY THEORY expression |x(1) − t| + |x(n) − t|, and this is equal to the distance x(n) − x(1) provided t lies inside the range, and it is even greater otherwise. Similarly, the other pair in the sum gives x(n−1) − x(2) if x(2) ≤ t ≤ x(n−1), and it is greater otherwise. Therefore, the minimality assumption leads to t = ˜x. □ In practice, it is required to compare the variability of data sets of different statistical populations. For this purpose, it is convenient to relativize the scale, and so use the coefficient of variation of a data set x: Vx = √ s2 x |¯x| . This relative measure of variability can be perceived in percentage of the deviation with respect to the sample mean ¯x. 10.1.7. Skewness of a data set. If the values of a data set are distributed symmetrically around the mean value, then ¯x = ˜x However, there are distributions where ¯x > ˜x. This is common, for instance, with the distribution of salaries in a population where the mean is driven up by a few very large incomes, while much of the population is below the av- erage. A useful characteristic concerning this is the Pearson coefficient, given by β = 3 ¯x − ˜x sx . It estimates the relative measure (the absolute value of β) and the direction of the skewness (the sign). In particular, note that the standard deviation is always positive, so it is already the sign of ¯x − ˜x which shows the direction of the skewness. Quantile coefficients of skewness More detailed information can be obtained from the quantile coefficients of skewness βp = x1−p + xp − 2˜x x1−p − xp , for each 0 < p < 1/2. Their meaning is clear when the numerator is expressed as (x1−p − ˜x) − (˜x − xp). In particular, the quartile coefficient of skewness is obtained when selecting p = 0.25. 10.1.8. Diagrams. People’s eyes are well suited for perceiving information with a complicated structure. That is why there exist many standardized tools for displaying statistical data or their correlations. One of them is the box diagram. 908 We can notice that the mode of the values is 10, which, accidentally, was also the number of points necessary to pass the course. The mean of the obtained points is 9.48. 10.A.8. Here, we present column diagrams of the amounts of points of MB101 students in autumn 2010 (the very first semester of their studies). The first one corresponds to all students of the course; the second one does to those who (3 years later) successfully finished their studies and got the bachelor’s degree. Again, the results can be depicted in an alternative way: CHAPTER 10. STATISTICS AND PROBABILITY THEORY Box diagram The diagram illustrates a histogram and a box diagram of the same data set (normal distribution with mean equal to 10 and variance equal to 3, n = 500). The middle line is the median; the edges of the box are the quartiles; the “paws” show 1.5 of the interquartile range, but not more than the edges of the sample range. Potential outliers are indicated too. Common displaying tools allow us to view potential dependencies of two data sets. For instance, in the left-hand diagram below, the coordinates are chosen as the values of two independent normal distributions with mean equal to 10 and variance equal to 3. In the right-hand illustration, the first coordinate is from the same data set, and the second coordinate is given by the formula y = 3x + 4. It is also perturbed with a small error. 10.1.9. Covariance matrix. Actually, the depencies between several data sets associated to the same statistical units are at the core of our interest in many real world problems. When definining the variance in 10.1.6 above, we employed the euclidean distance, i.e. we evaluated the scalar product of the values of the square of distances from the mean with itself. Thus, having two vectors of data sets, we may define 909 And these are the graphs of amounts of points obtained by those students who continued their studies: We can see that in the former case, the mode is equal to 0, while in the latter case, it is 10 again. The frequency distribution is close to the one of the MB104 course, which is recommended for the fourth semester. CHAPTER 10. STATISTICS AND PROBABILITY THEORY Covariance and covariance matrix Consider two data sets x = (x1, . . . , xn), y = (y1, . . . , yn), and their means ¯x, ¯y. We define their covariance by the formula cov(x, y) = 1 n n∑ i=1 (xi − ¯x)(yi − ¯y). If there are k sample sets x(1) = (x (1) 1 , . . . , x (1) n ), ..., x(k) = (x (k) 1 , . . . , x (k) n ), then their covariance matrix is the symmetric matrix C = (cij) with cij = cov(x(i) , x(j) ). Again, the sample covariance and sample covariance matrix are defined by the same formulae with n replaced by (n − 1). Clearly the covariance matrix has got the variances of the individual data sets on its diagonal. In order to imagine what the covarinace should say, consider the two possible behaviours of two data sets: (a) they will deviate from their means in a very similar way (comparing individually xi and yi), (b) they behave very independently. In first case, we should assume that the signs of the deviations will mostly coincide and thus the sum in the definition will lead to a quite big positive number. In the other case the signs should be rather independent and thus the positive and negative contributions should effectively cancel each other in the covariance sum. Thus we expect the data sets expressing independent features to be close to zero while the covariance of dependent sets should be far from zero. The sign of the covariance shows the character of the dependence. For example, the two sets of data depicted in the left hand diagram above had covariance bout -0.11, while the covariance of the data from the right hand picture was about 25.9. Similarly to the variance, we are often interested in normalized values. The correlation coefficient takes the covariance and divides it by the standard deviation of each of the data sets. In our two latter cases, the correlation coffeicients are about -0.01 and 0.99. As expected, they very clearly indicate which of the data are correlated. 10.1.10. Principal components analysis. If we deal with statistics involving many parameters and we need to decide quickly about their similarity (correlation) with some given patterns, we might use a simple idea from linear algebra. Assume we have got k data sets x(i) . Since their covariance matrix C is symmetric, there is an orthonormal basis e in Rk such that in this basis the corresponding quadratic form given by C will enjoy a diagonal matrix. The relevant basis e consists of the real eigenvectors ei ∈ Rk for the eigenvalues λi. The bigger is the absolute value |λi|, the bigger is the variation of the orthogonal projection ˆx of all the k data sets into this one-dimensional subspace spaned by ei. Thus we may restrict ourselves to just this one data set ˆx and consider the statistics concerning this one set as representing the multi-parametric data sets x(i) . Similarly we may 910 10.A.9. A car was traveling from Brno to Prague at 160 km/h, and then back from Prague to Brno at 120 km/h. What was its average speed? Solution. This is an example where one might think of using the arithmetic mean, which is incorrect. The arithmetic mean would be the correct result if the car spent the same period of time going at each speed. However, in this case, it traveled the same distance, not time, at each speed. Denoting by d the distance of Brno and Prague and by vp the average speed, we obtain d 160 + d 120 = 2d vp , whence vp = 2 1 160 + 1 120 . = 137.14. Therefore, the average speed is the harmonic mean (see 10.1.3) of the two speeds. □ B. Visualization of multidimensional data The above examples were devoted to displaying one numerical characteristic measured for more objects (number of points obtained by individual students, for example). Graphical visualization of data helps us understand them better. However, how to depict the data if we measure p different characteristics, p ≥ 3, of n objects. Such measurements cannot be displayed using graphs we have met. 10.B.1. One of the possible methods is the so-called principal component analysis. In this method, we use eigenvectors and eigenvalues (see 2.4.2) of the sample covariance matrix (see 10.2.35). We will use the following notation: • random vectors of the measurement xi = (xi1, xi2, . . . , xip)T , i = 1, . . . , n, • the mean of the j-th component mj = 1 n ∑n i=1 xij, j = 1, . . . , p, • the sample variance of the j-th component sj = 1 n−1 ∑n i=1(xij − mj)2 , j = 1, . . . , p, • the vector of means m = (m1, . . . , mp), • the sample covariance matrix 1 n−1 ∑n i=1(xi − m)(xi − m)T (note that each summand is a p-by-p matrix). The covariance matrix is symmetric, hence all its eigenvalues are real and its eigenvectors are pairwise orthogonal. Moreover, considering the eigenvectors of unit length, we can see that the eigenvalue corresponding to an eigenvector of the covariance matrix yields the variance of (the size of) the projection of the data onto this direction (the projection takes place in the p-dimensional space). The goal of this method is to find the direction (in the p-dimensional space of the measured characteristics) for which the variance of the projections is as great as possible. Thus, this direction corresponds to the eigenvector of the covariance matrix whose eigenvalue is the greatest one. The linear combination given CHAPTER 10. STATISTICS AND PROBABILITY THEORY also use several biggest eigen-values instead of one and reduce the dimension of our parameter space in this way. Finally, considering the unit length eigenvector (α1, . . . , αk) corresponding to the chosen eigenvalue λ, then the values αj provide the right coefficients in the orthogonal projection (x(1) , . . . , x(k) ) → ˆx = α1x(1) + · · · + αkx(k) . See the exercise 10.B.2 for an illustration, together with another description how to proceed with the data in 10.B.1. The latter approach is called the principal component analysis. 10.1.11. Entropy. We also need to describe the variability of data sets even with nominal types, for instance in statistical physics or information theory. The only thing at disposal is the class frequencies, so the principle of classical probability can be used (see the fourth part of chapter one). There, the relative frequency of the i-th class, pi = ni n , is understood to be the probability that a random object belongs to this class. The variance of ratio-type values with class frequencies nj was given by the formula (see 10.1.6) s2 x = m∑ j=1 nj n (aj − ¯x)2 = m∑ j=1 pj(aj − ¯x)2 , where pj denotes the (classical) probability that the value is in the j-th class. Therefore, it is a weighted mean of the adjusted values where the weight of the term (aj − ¯x)2 is pj. The variability of nominal values are expressed similarly (denote it by HX). Even though there are no numerical values aj for the indices j, we can be interested in functions F that depend on the relative frequencies pj. For a data set x we can define HX = n∑ i=1 piF(pi), where F is an unknown function with some reasonable prop- erties. If the data set has only one value, i.e. pk = 1 for some k and otherwise pj = 0, then we agree that the variability is zero, and so F(1) = 0. Moreover, HX is required to have the following property: If a data set Z consists of pairs of values from data sets X and Y (for example, one can observe eye colour and hair colour of people – statistical units), it is reasonable that the variability of Z be the sum of the variabilities, that is, HZ = HX + HY . The relative class frequencies pi for the values of the data set X and qj for those of Y are known. The relative class frequencies for Z are then rij = nimj nm = piqj, so we demand the equality (the ranges of the sums are clear from the context) ∑ i,j piqjF(piqj) = ∑ i piF(pi) + ∑ j qjF(qj). 911 by the components of this vector is called the first principal component. The size of the projection onto this direction estimates the data quite well (the principal component can be viewed as a characteristic which substitutes for the p characteristics, i. e., it is a random vector with n components). If we subtract this projection from the data and consider the direction of the greatest variance again, we get the second principal component. Repeating this procedure further, we obtain the other principal components. The directions of the principal components correspond to the eigenvectors of the covariance matrix in decreasing order with respect to the size of the corresponding eigenvalues. 10.B.2. Find the first principal component of the following simple data and the vector which substitutes them: Five people were taken their height, little finger length, and index finger length. The measured data are shown in the following table (in centimeters). Solution. Martin Michael Matthew John Peggy index f. 9 11 8 8 8 little f. 7.5 8 6.3 6 6.5 height 186 187 173 174 167 The vectors of the collected data are: x1 = (9; 7.5; 186), x2 = (11; 8; 187), x3 = (8; 6; 173), x4 = (8; 6; 174), x5 = (8; 6.5, 167). The covariance matrices of these vectors are:   0.04 0.14 1.72 0.14 0.49 6.02 1.72 6.02 73.96   ,   4.840 2.64 21.12 2.64 1.44 11.52 21.12 11.52 92.16   ,   0.641 0.640 3.521 0.640 0.640 3.52 3.521 3.52 19.36   ,   0.641 0.640 2.721 0.640 0.640 2.72 2.721 2.72 11.56   ,   0.641 0.240 8.321 0.240 0.09 3.12 8.32 3.12 108.16   . The sample covariance matrix is then a quarter of their sum, i. e., S =   1.70 1.075 9.35 1.075 0.825 6.725 9.35 6.725 76.30   The eigenvalues of S are approximately 2.7, 312.2, and 0.38. The unit eigenvector corresponding to the greatest one is approximately (0.122; 0.09; 0.989). Thus, the first principal component is (185 5; 186.8; 172.4; 173.4; 166.5), which is not far from the people’s heights. □ 10.B.3. The students of a class had the following marks in various subjects: CHAPTER 10. STATISTICS AND PROBABILITY THEORY Since pi and qj are relative frequencies, they sum to 1. So the right-hand side of the equality can be written as (∑ j qj )( ∑ i piF(pi) ) + (∑ i pi )( ∑ j qjF(qj) ) , leading to ∑ i,j piqjF(piqj) = ∑ i,j piqj ( F(pi) + F(qj) ) . This is satisfied by any constant multiple of a logarithm of any fixed base a > 1. It can be shown that no other continuous solution F exists. Since pi ≤ 1, ln pi ≤ 0. The variability must be nonnegative, so F is chosen to be a logarithmic function multiplied by −1. Such a choice also satisfies F(1) = 0, as de- sired. Entropy The measure of variability of nominal values is expressed in terms of entropy. It is given by HX = − k∑ i=1 ni n ln (ni n ) , where k is the number of sample classes. Sometimes (especially in information theory), the binary logarithm is used instead of the natural logarithm. One often works with the quantity eHX = ∏ i p−pi i , (or with another logarithm base). In this form, for a data set X with k equal class frequencies, compute eHX = ( (1 k )− 1 k )k = k, which is independent of the sample size. The next illustration shows 2-based entropy y for the number of occurrences of letters a, b in 10-letter words consisting of these characters, and x is the number of occurrences of b. Note that the maximum entropy 1 occurs for the same number of a’s and b’s, and indeed 21 = 2 as computed above. The following illustration displays the entropy of 11 randomly chosen strings of length 10 made of 8 characters. The values are all much less than the theoretical maximal value of 3. This reflects the fact that the number of occurences of the individual 8 characters cannot be equal (or it could happen with a very small probability if the length of the string was 8 912 Student id Maths Physics History English PE 1 1 1 2 2 1 2 1 3 1 1 1 3 2 1 1 1 1 4 2 2 2 2 1 5 1 1 3 2 1 6 2 1 2 1 2 7 3 3 2 2 1 8 3 2 1 1 1 9 4 3 2 3 1 10 2 3 1 2 1 Find the first principal component of the following simple data and the vector which substitutes them. Solution. The vectors of observation are x1 = (1, 1, 2, 2, 1), . . . , x10 = (2, 3, 1, 2, 1). The corresponding covariance matrices are:       1.21 1.10 −0.330 −0.330 0.110 1.10 1. −0.300 −0.300 0.100 −0.330 −0.300 0.0900 0.0900 −0.0300 −0.330 −0.300 0.0900 0.0900 −0.0300 0.110 0.100 −0.0300 −0.0300 0.0100       , . . . ,       0.0100 −0.100 0.0701 −0.0300 0.0100 −0.100 1. −0.700 0.300 −0.100 0.0701 −0.700 0.490 −0.210 0.0701 −0.0300 0.300 −0.210 0.0900 −0.0300 0.0100 −0.100 0.0701 −0.0300 0.0100       The sample covariance matrix is       0.99 0.44 −0.078 0.26 −0.01 0.44 0.89 −0.22 0.22 −0.11 −0.078 −0.22 0.45 0.23 0.03 0.26 0.22 0.23 0.45 −0.078 −0.01 −0.11 0.033 −0.0778 0.100       . Its dominant eigenvalue is about 13.68, and the corresponding unit eigenvector is approximately (0.70; 0.65; −0.13; 0.28; −0.07). Therefore, the principal component is (1.58; 2.73; 2.13; 2.93; 1.45; 1.93; 4.28; 3.48; 5.26; 3.71) □ Another possible method of visualization of multidimensional data is the so-called cluster analysis, but we will not go into further details here. C. Classical and conditional probability In the first chapter, we met the so-called classical probability, see 1.4.1. Just to recall it, let us try to solve the following (a bit more complicated) problem: 10.C.1. Aleš wants to buy a new bike, which costs 5100 crowns. He has 2500 crowns left from organizing a camp. Aleš is no dope: he took 50 more crowns from his pocket money and went to the casino to play the roulette. Aleš always bets on red. This means that the probability of winning is 18/37 and the amount he wins is equal to the amount he has CHAPTER 10. STATISTICS AND PROBABILITY THEORY or 16). If the same is done with, say, strings of length 10000, we would get very close to 3 (typically the difference would be in the order of 10−3 , if the random string generator was good enough). 2. Probability Before further reading, the reader is advised to go through the fourth part of chapter one (the subsection beginning on page 18). Back then, we worked mainly with classical finite probability. We defined the basics of a formalism which we extend now. The main extension is that the sample space Ω can be infinite, even uncountable. Recall that when we talked about geometric probability at the end of the fourth part of chapter one, the sample space for description of an event was a part of the Euclidean space, and events were suitable subsets of it. All of those sets were uncountable. Begin with a simple (infinite, yet still discrete) example, to which we return from time to time throughout this section. 10.2.1. Why infinite sets of events? Imagine an experiment where a coin is repeatedly tossed until it comes up heads. There are many questions to be asked about this experiment: What is the probability of tossing the coin at least 3 times? (or exactly 35 times, or at most 10 times, etc.) The outcomes of this experiment can be considered in the form ωk ∈ N≥1 ∪ {∞}, which could be read as “the coin comes up heads for the first time in the k-th toss”. Note that k = ∞ is inserted, since the possibility that the coin always comes up tails must be allowed, too. This problem is solved if the classical probability 1/2 of the coin coming up heads in one toss is used (and the same for tails). In the abstract model, the total number of tosses by any natural number N cannot be bounded. On the other hand, the probability that the coin always comes up tails in the first (k −1) tosses out of the total number of n ≥ k tosses is given by the fraction 2n−k 2n = 2−k , where in the numerator, there is the number of favorable possibilities out of n independent tosses (i.e. the number of possibilities how to distribute two values into the n−k remaining positions), while in the denominator, there is the number of 913 bet. His betting strategy is as follows: The first time, he bets 10 crowns. Each time he has lost, he bets twice the previous bet (if he does not have enough money to make this bet, he leaves the casino, deeply depressed). Each time he has won, he bets 10 crowns again. What is the probability that, using this strategy, he wins the desired 2550 more crowns? (As soon as this happens, he immediately runs to buy the bike.) Solution. First of all, we calculate how many times Aleš can lose in a row. If he bets 10 crowns the first time, then in order to bet n times, he needs 10+20+· · ·+10·2n−1 = 10· (n−1∑ i=0 2i ) = 10· ( 2n − 1 2 − 1 ) = 10·(2n −1). As we can see, the number 2550 is of the form 10(2n − 1), for n = 8. This means that Aleš can bet eight times in a row regardless of the odds. He can never bet nine times in a row, because for that he would have to have 10(29 − 1) = 5110 crowns, which he will never reach (he stops betting as soon as he has 5100 crowns). Therefore, Aleš loses the whole game if and only if he loses eight consecutive bets. The probability of losing one bet is 19/37; hence, the probability of losing eight consecutive (independent) bets is (19/37)8 . Thus, the probability that he wins 10 crowns (using his strategy) is 1 − (19/37)8 . In order to win 2550 crowns, he must win 255 times, and the probability of this is ( 1 − ( 19 37 )8 )255 . = 0.29. Therefore, the probability of winning using his strategy is much lower than if he bet everything on red straightaway. □ 10.C.2. You could try to solve a slight modification of the above problem: Joe stops playing only if he loses all his money; if he still has some money, but not enough to bet twice the previous bet, he bets 10 dollars again. We also met the conditional probability in the first chapter, see 1.4.8. 10.C.3. Let A, B be two events such that B is a disjoint union of events B1, B2, . . . , Bn. Using the definition of conditional probability (see 10.2.6), prove that P(A|B) = n∑ i=1 P(A|Bi)P(Bi|B)(1) CHAPTER 10. STATISTICS AND PROBABILITY THEORY all possible outcomes. As expected, this probability is independent of the chosen n, and there is the ∑∞ k=1 2−k = 1. Therefore, the probability of tossing only tails is zero. Thus we can define probability on the sample space Ω with sample points (outcomes) ωk, whose probability is 2−k . This leads to a probability according to the definitions below. We return to this example throughout this section. 10.2.2. σ-fields. Work with a fixed non-empty set Ω, which contains the possible outcomes of the experiment and which is called the sample space. The possible outcomes ω ∈ Ω are also called sample points. In probability models, not all subsets of outcomes need be admitted. In particular, the singletons {ω} need not be considered. Those subsets whose probability we want to measure are required to satisfy the axioms of the so called σ-algebras. The axioms listed below are chosen from a larger collection of natural requirements in a minimal form. The first one is based on the assumption that the universal event should be a measurable set. The second one is forced by the assumption that events can be negated. The third one reflects the necessity to examine the event of the occurrence of at least one event from a countably infinite collection. (For instance, in the example from the previous subsection, the coin is tossed only finitely many times, but there is no upper bound on the number of tosses.). σ–algebras of subsets A collection A of subsets of the sample space is called a σ-algebra or σ-field and its elements are called events or measurable sets if and only if • Ω ∈ A, i.e., the sample space is an event; • if A, B ∈ A, then A \ B ∈ A, i.e., the set difference of two events is also an event; • if Ai ∈ A, i ∈ I, is a countable collection of events, then their union is also an event, i.e., ∪i∈IAi ∈ A. As usual, the basic axioms imply simple corollaries which describe further (intuitively required) properties in the form of mathematical theorems. The reader should check carefully that both following properties hold. • The complement Ac = Ω \ A of an event A is again an event. • The intersection of two events is again an event since for any two subsets A, B ⊂ Ω, A \ (Ω \ B) = A ∩ B. Actually, for any countable system of events Ai, i ∈ I, the event Ω \ ∪i∈U Ac i = ∩i∈IAi is also in the σ-algebra A. Altogether, a σ-algebra is a collection of subsets of the sample space which is closed with respect to set differences, countable unions, and countable intersections. 914 Solution. First, note that the events A∩B1, A∩B2,...,A∩Bn are also disjoint. Therefore, we can write P(A|B1 ∪ · · · ∪ Bn) = P (A ∩ (B1 ∪ · · · ∪ Bn)) P(B1 ∪ · · · ∪ Bn) = = P ((A ∩ B1) ∪ (A ∩ B2) ∪ · · · ∪ (A ∩ Bn)) P(B) = = ∑n i=1 P(A ∩ Bi) P(Bi) · P(Bi) P(B) = = n∑ i=1 P(A|Bi)P(Bi|B). □ 10.C.4. We have four bags with balls: In the first bag, there are four white balls. In the second bag, there are three white balls and one black ball. In the third bag, there are two white and two black balls. Finally, in the fourth bag, there are four black balls. We randomly pick a bag and take two balls out of it (without putting the first one back). Find the probability that a) the balls are of different colors; b) the second ball is white provided the first ball was white. Solution. Since there is the same number of balls in each of the bags, any ball has the same probability of being taken (similarly for any pair of balls lying in the same bag). Therefore, we can solve this problem using classical probability a) Altogether, there are 24 pairs of balls that can be taken. Out of them, 7 consist of balls of different colors. Therefore, the wanted probability is 7/24. b) Let A denote the event that the first ball is white and B denote the event that the second ball is white. Then, P(B∩A) is the probability that both balls are white, and this is equal to 10/24 = 5/12 since there are 10 such pairs. Again, we can used classical probability to calculate P(A): there are 16 balls in total, and 9 of them are white. Altogether, we have P(B|A) = P(B ∩ A) P(A) = 5 12 9 16 = 20 27 . Another solution. The event A can be viewed as the union of three mutually exclusive events A1, A2, A3 that we took a white ball from the first, second, and third bag, respectively. Since there is the same number of balls in each of the bags, the probability of taking any (white) ball is also the same (independent of which ball it is), so we get P(A) = 9 16 and P(A1|A) = 4 16 9 16 = 4 9 , P(A2|A) = 3 9 = 1 3 , P(A3|A) = 2 9 . Applying (5), we obtain P(B|A) = P(B|A1)P(A1|A) + P(B|A2)P(A2|A) + P(B|A3)P(A3|A) = = P(B|A1) · P(A1) P(A) + P(B|A2) · P(A2) P(A) + P(B|A3) · P(A3) P(P(A) = = 1 · 4 9 + 2 3 · 3 9 + 1 3 2 9 = 20 27 . □ CHAPTER 10. STATISTICS AND PROBABILITY THEORY 10.2.3. Probability space. Now introduce probability in the mathematical model, recalling the concepts used already in the first chapter. Elementary concepts Use the following terminology in connection with events: • the entire sample space Ω is called the universal event; the empty set ∅ ∈ A is called the null event; • the singletons ω ∈ Ω are called elementary events (note that {ω} may not even be an event in A); • the intersection of events ∩i∈IAi corresponds to the simultaneous occurrence of all the events Ai, i ∈ I; • the union of events ∪i∈IAi corresponds to the occurrence of at least one of the events Ai, i ∈ I; • if A∩B = ∅, then A, B ∈ A are called exclusive events or disjoint events, • if A ⊂ B, then the event A implies the event B; • if A ∈ A, then the event B = Ω \ A is called the complementary event to A and denoted B = Ac . We have seen an example of probability defined on an infinite sample space in 10.2.1 above. In general, probability is comprehended as follows: Probability Definition. A probability space is the σ-algebra A of subsets of the sample space Ω on which there is a scalar function P : A → R with the following properties: • P is non-negative, i.e., P(A) ≥ 0 for all events A; • P is countably additive, i.e., P(∪i∈IAi) = ∑ i∈I P(Ai), for every countable collection of mutually exclusive events; • the probability of the universal event is 1. The function P is called the probability function on (Ω, A). Immediately from the definition, the complementary event satisfies P(Ac ) = 1 − P(A). In chapter one, theorems on addition of probabilities were derived. Although dealing with finite sample spaces, the arguments remain the same now. In particular, the inclusion and exclusion principle says for any finite collection of k events Ai that P(∪k i=1Ai) = k∑ i=1 P(Ai) − k−1∑ i=1 k∑ j=i+1 P(Ai ∩ Aj) + k−2∑ i=1 k−1∑ j=i+1 k∑ ℓ=j+1 P(Ai ∩ Aj ∩ Aℓ) − · · · + + (−1)k−1 P(A1 ∩ A2 ∩ · · · ∩ Ak). 915 10.C.5. We have four bags with balls: In the first bag, there are four white balls. In the second bag, there are three white balls and one black ball. In the third bag, there are two white and two black balls. Finally, in the fourth bag, there are one white and three black balls. We randomly pick a bag and take a ball out of it, finding out that it is black. Then we throw away this bag, pick another one and take a ball out of it. What is the probability that it is white? Solution. Similarly as in the above exercise, let A denote the event that the very first ball is black. This event can be viewed as the union of mutually exclusive events Ai, i = 2, 3, 4, where Ai is the event of picking the i-th bag and taking a black ball from there. Again, the probability of picking any (black) ball is the same. Hence, P(A2|A) = 1 6 , P(A3|A) = 2 6 = 1 3 , and P(A4|A) = 3 6 = 1 2 . Let B denote the event that the second ball is white. If the thrown bag is the second one, then there are a total of 7 white balls remaining, so the probability of taking one of them is P(B|A2) = 7 12 (we can use classical probability again because each of the bags contains the same number of balls, so any ball has the same probability of being taken). Similarly, P(B|A3) = 8 12 and P(B|A4) = 9 12 . Applying (5), we get that the wanted probability is P(B|A) = P(B|A2)P(A2|A) + P(B|A3)P(A3|A) + P(B|A4)P(A4|A) = = 7 12 · 1 6 + 8 12 · 1 3 + 9 12 · 1 2 = 25 36 . □ 10.C.6. We have four bags with balls: In the first bag, there are a white ball and a black ball. In the second bag, there are three white balls and one black ball. In the third bag, there are one white and two black balls. Finally, in the fourth bag, there are one white and three black balls. We randomly pick a bag and take a ball out of it, finding out that it is white. Then we throw away this bag, pick another one and take a ball out of it. What is the probability that it is white? Solution. Similarly as in the above exercise, we view the event A of the first ball being white as the union of four mutually exclusive events A1, A2, A3, and A4 that we take a white ball from the first, second, third, and fourth bag, respectively. The probability of taking a white ball out of the first bag is P(A1) = 1 4 · 1 2 (the probability of A1 is the product of the probability that we pick the first bag and the probability that we take a white ball from there); similarly, P(A2) = 1 4 · 3 4 , P(A3) = 1 4 · 1 3 , P(A4) = 1 4 · 1 4 . P(A) = P(A1) + P(A2) + P(A3) + P(A4) = 11 24 . Note that the probability P(A) cannot be calculated classically, i. e., by simply dividing the number of white balls by the total number of the balls, because, for instance, the probability of taking a white ball from the first bag is twice greater than from the fourth bag. As for the conditional probabilities, we have P(A1|A) = P(A1)/P(A) = 3 11 , P(A2|A) = 9 22 , P(A3|A) = 2 11 , P(A4|A) = 3 22 . Now, let B denote the event that we take another white ball after we have thrown away the first bag. We want to apply (5) again. It remains to compute P(B|Ai), i = 1, . . . , 4. The probability P(B|A1) can be CHAPTER 10. STATISTICS AND PROBABILITY THEORY The reader should look back at 1.4.5 and think about the de- tails. 10.2.4. Independent events. The definition of stochastically independent events also remains unchanged. It reflects the intuition that the probability of the simultaneous occurrence of independent events is equal to the product of the particular probabilities. Stochastic independence Events A, B are said to be stochastically independent if and only if P(A ∩ B) = P(A)P(B). Of course, the universal event and the null event are stochastically independent of any event. Recall that replacing an event Ai with the complementary event Ac i in a collection of stochastically independent events A1, A2, ..., again results in a collection of stochastically independent events, and (see 1.4.7, page 23) P(A1 ∪ · · · ∪ Ak) = 1 − P(Ac 1 ∩ · · · ∩ Ac k) = = 1 − (1 − P(A1)) . . . (1 − P(Ak)). Classical finite probability remains the fundamental example of probability, used as the inspiration during creation of the mathematical model. Recall that in this case, Ω is a finite set, the σ-algebra A is the collection of all subsets of Ω, and the classical probability is the probability space (Ω, A, P) with probability function P : A → R, P(A) = |A| |Ω| . This corresponds precisely to the intuition about the relative frequency pA of an event A when drawing a random element from the sample set Ω. This definition of probability guarantees reasonable behaviour of monotone sequences of events: 10.2.5. Theorem. Consider a probability space (Ω, A, P) and a non-decreasing sequence of events A1 ⊂ A2 ⊂ . . . . Then, P ( ∞∪ i=1 Ai ) = lim i→∞ P(Ai). Similarly, if A1 ⊃ A2 ⊃ A3 ⊃ . . . , then P ( ∞∩ i=1 Ai ) = lim i→∞ P(Ai). Proof. The considered union A = ∪∞ i=1Ai can be rewritten in terms of mutually exclusive events ˜Ai = Ai \ Ai−1, defined for all i = 2, 3, . . . . Set ˜A1 = A1. Then, P(A) = P ( ∞∪ i=1 ˜Ai ) = ∞∑ i=1 P( ˜Ai) = lim k→∞ k∑ i=1 P( ˜Ai). 916 computed as the sum of the probabilities of the mutually exclusive events B2, B3, B4 (given A1) that the second white ball comes from the second, third, fourth bag, respectively. Altogether, we have P(B|A1) = P(B2|A1)+P(B3|A1)+P(B4|A1) = 1 3 3 4 + 1 3 1 3 + 1 3 1 4 = 4 9 Similarly, P(B|A2) = 1 3 1 2 + 1 3 1 3 + 1 3 1 4 = 13 36 , P(B|A3) = 1 3 1 2 + 1 3 3 4 + 1 3 1 4 = 1 2 , P(B|A4) = 1 3 1 2 + 1 3 3 4 + 1 3 1 3 = 19 36 . Altogether, we get P(B|A) = P(B|A1)P(A1|A) + P(B|A2)P(A2|A) + P(B|A3)P(A3|A) + P(B|A4)P(A4|A) = = 4 9 3 11 + 13 36 9 22 + 1 2 2 11 + 19 36 3 22 = 19 44 □ 10.C.7. Two shooters shoot at a target, each makes two shots. Their respective accuracies are 80 % and 60 %. We have found two hits in the target. What is the probability that they belong to the first shooter? Solution. The probability of hitting the target is 4/5 for the first shooter, and 3/5 for the second one. Consider the events: A . . . there are two hits in the target, both of the first shooter, B . . . there are two hits in the target. Our task is to find P(B|A). We can divide the event B into six disjoint events according to which shot(s) of each shooter was/were successful. We enumerate the events in a table and, for each of them, we compute its probability. This is easy as each of the events is the intersection of four independent events (results of the four shots). A hits is denoted by 1, a miss by 0. Shooter 1 Shooter 2 probability B1 0 1 0 1 1 5 · 4 5 · 2 5 · 3 5 B2 0 1 1 0 24 252 B3 1 0 1 0 24 252 B4 1 0 0 1 24 252 B5 1 1 0 0 64 252 B6 0 0 1 1 9 252 Adding up the probabilities of these disjoint events, we get: P(B) = 6∑ i=1 P(Bi) = 169/625. CHAPTER 10. STATISTICS AND PROBABILITY THEORY For the finite sums, k∑ i=1 P( ˜Ai) = P(A1) + k∑ i=2 ( P(Ai) − P(Ai−1) ) = P(An) by the assumptions Ai−1 ⊂ Ai. This proves the first part of the theorem. In the second part, consider the complements Bi = Ac i instead of the events Ai. They satisfy the assumptions of the first part of this theorem. Then, the complement of the considered intersection is B = Ac = ( ∞∩ i=1 Ai )c = ∞∪ i=1 Bi. The desired statement follows from the fact that P(A) = 1 − P(B) = 1 − lim i→∞ P(Bi) = lim i→∞ ( 1 − P(Bi) ) which completes the proof. □ 10.2.6. Conditional probability. Consider the following problem: On average, 40% of students succeed in course X and 80% of students succeed in course Y . If a random student is enrolled in both these courses saying that he has passed one of them (but we overhear which one), what is the probability that he has meant course X? As mentioned in subsection 1.4.8 (page 24), such problems can be formalized in the way described below. (We shall come back to the solution of the latter problem in 10.3.12.) Conditional probability Definition. Let H be an event with non-zero probability in the σ-algebra A of a probability space (Ω, A, P). The conditional probability P(A|H) of an event A ∈ A with respect to the hypothesis H is defined as P(A|H) = P(A ∩ H) P(H) . The definition corresponds to the intuition from the classical probability that the probability of events A and H occurring simultaneously, provided the event H has occurred, is P(A ∩ H)/P(H). Directly from the definition, the hypothesis H and the event A are independent if and only if P(A) = P(A|H). At first sight, it may seem that introducing conditional probability does not add anything new. Actually, it is a very important type of approach which is needed in statistics as well. The hypothesis can be the a prior probability (i.e. the prior belief assumed beforehand), and the resulting probability is said to be posterior (i.e., it is considered to be a consequence of the assumption). This is the core of the Bayesian approach to statistics as is seen later. The definition also implies the following result. 917 Now, we can compute the conditional probability, using the formula of subsection 10.2.6: P(A|B) = P(A ∩ B) P(B) = P(B5) P(B) = 64 625 169 625 = 64 164 · 0.38. □ 10.C.8. We toss a coin. If it comes up heads, we put a white ball into an (initially empty) bag; otherwise we put a black ball there. This is repeated n times. Then, we take a ball randomly from the bag (without replacement). Suppose it is white. What is the probability that another ball we take randomly from the bag is black? Solution. We will solve the problem for a general (possibly biased) coin. In particular, we assume that the individual tosses are independent and that there exists a fixed probability of the coin coming up heads, which we denote p. The event “a ball in the bag is white” corresponds to the event “the coin came up heads in the corresponding toss”. Since the first ball was white, we deduce that p > 0. We can also see that the probability space “taking a random ball from the bag” is isomorphic to the probability space “tossing a coin”. Since we assume that the individual tosses are independent, we also get the independence of the colors of the selected balls. This leads to the conclusion that the probability in question is 1−p. Is this reasoning correct? Do we not expect the probability of taking a black ball to be greater than 1 − p? See, there were approximately np white and n(1 − p) black balls in the bag, so if we had removed one white ball, the probability of selecting a black one should increase, shouldn’t it? Before reading further, try to figure out which (if any) of these two presented reasonings is correct, and whether the probability is also dependent on n (the number of balls in the bag before any were removed). Now, we select a more sophisticated approach to the problem. Let Bi denote the event “there were i white balls in the bag” (before any were removed), i ∈ {0, 1, 2, . . . , n}. Further, let A denote the event “the first ball is white” and C denote the event “the second ball is black”. Actually, the event Bi says that the coin came up heads i times out of n; hence, its probability is P(Bi) = ( n i ) pi (1 − p)n−i . The conditional probability of taking a white ball provided there are exactly i white balls in the bag is equal to P(A|Bi) = i n . We are interested in the probability of C, knowing that A has occurred, i. e., we want to know P(C|A). Since the events Bi are pairwise disjoint, this is also true for the events C ∩Bi. Since C can be decomposed as the disjoint union ∪n i=0(C ∩ CHAPTER 10. STATISTICS AND PROBABILITY THEORY Lemma. Let an event B be the union of mutually exclusive events B1, B2,...,Bn. Then, P(A|B) = n∑ i=1 P(A|Bi)P(Bi|B)(1) Proof. The events A ∩ B1, A ∩ B2, ..., A ∩ Bn are also mutually exclusive. Therefore, P(A|B1 ∪ · · · ∪ Bn) = P (A ∩ (B1 ∪ · · · ∪ Bn)) P(B1 ∪ · · · ∪ Bn) = = P ((A ∩ B1) ∪ (A ∩ B2) ∪ · · · ∪ (A ∩ Bn)) P(B) = = n∑ i=1 P(A ∩ Bi) P(Bi) P(Bi) P(B) = = n∑ i=1 P(A|Bi)P(Bi|B). □ Consider the special case B = Ω. Then, the events Bi can be considered the “possible states of the universe”, P(A|Bi) expresses the probability of A provided the universe is in its i–th state, and P(Bi|Ω) = P(Bi) is the probability of the universe being in its i–th state. By the above lemma, P(A) = P(A|Ω) = n∑ i=1 P(A|Bi)P(Bi). This formula is called the law of total probability. 10.2.7. Bayes’ theorem. Simple rearrangement of the conthese two subsections are now moved to Chapter one, modify/extend here ... ditional probability formula leads to P(A ∩ B) = P(B ∩ A) = P(A)P(B|A) = P(B)P(A|B). There are two important corollaries: Bayes’ rules Theorem. The probabilities of events A and B satisfy P(A|B) = P(A)P(B|A) P(B) (1) P(A|B) = P(A)P(B|A) P(A)P(B|A) + P(Ac)P(B|Ac) .(2) The first proposition is called the inverse probability formula. The second proposition is called the first Bayes’ for- mula. Proof. The first statement is a mere rearrangement of the formula above the theorem. To obtain the second statement, note that P(B) = P(B ∩ A) + P(B ∩ Ac ). Applying the law of total probability, P(B) = P(A)P(B|A) + P(Ac )P(B|Ac ) can be substituted into the inverse probability formula, thereby obtaining the second statement of the theorem. □ 918 Bi), we can write P(C|A) = P ( n∪ i=0 (C ∩ Bi)|A ) = n∑ i=0 P ( (C ∩ Bi) ∩ A ) P(A) = = 1 P(A) n∑ i=0 P ( C ∩ (A ∩ Bi) ) = = 1 P(A) n∑ i=0 P(A ∩ Bi)P(C|A ∩ Bi) = = 1 P(A) n∑ i=0 P(Bi)P(A|Bi)P(C|A ∩ Bi). We use the law of total probability and substitute for P(A), which leads to (1) P(C|A) = n∑ i=0 P(Bi)P(A|Bi)P(C|A ∩ Bi) P(A) = = n∑ i=0 P(Bi)P(A|Bi)P(C|A ∩ Bi) n∑ i=0 P(Bi)P(A|Bi) . This formula is sometimes called the second Bayes’ formula; it holds in general provided the space Ω is a disjoint union of the events Bi. Since we tossed the coin at least once, we have n ≥ 1. Now, we can calculate: n∑ i=0 P(Bi)P(A|Bi) = n∑ i=0 ( n i ) pi (1 − p)n−i · i n = = n∑ i=1 (n − 1)! (i − 1)!(n − i)! pi (1 − p)n−i = = n−1∑ i=0 (n − 1)! i!(n − i − 1)! pi+1 (1 − p)n−i−1 = = p n−1∑ i=0 ( n − 1 i ) pi (1 − p)n−1−i = = p ( p + (1 − p) )n−1 = p, n∑ i=0 P(Bi)P(A|Bi)P(C|A ∩ Bi) = = n∑ i=0 ( n i ) pi (1 − p)n−i · i n · n − i n − 1 = = n−1∑ i=1 (n − 2)! (i − 1)!(n − i − 1)! pi (1 − p)n−i = = n−2∑ i=0 (n − 2)! i!(n − 2 − i)! pi+1 (1 − p)n−i−1 = =p(1 − p) n−2∑ i=0 ( n − 2 i ) pi (1 − p)n−2−i = CHAPTER 10. STATISTICS AND PROBABILITY THEORY Bayes’ rule is sometimes formulated in a somewhat more general form, proved similarly as in (??): Let the sample space Ω be the union of mutually exclusive events A1,...An. Then, for any i ∈ {1, . . . , n}, (3) P(Ai|B) = P(B|Ai)P(Ai) ∑n k=1 P(B|Ak)P(Ak) . 10.2.8. Example and remarks. Now, the introductory question from 10.2.6 can be dealt with easily. Consider the event A which corresponds to “the student having passed an exam” and the event B which corresponds to “the exam in question concerning course X”. Assume that the probabilities of the exam concerning either course are the same, i.e., P(B) = P(Bc ) = 0.5. While the wanted probability P(B|A) is unclear, the probability P(A|B) = 0.4 is given, as well as P(A|Bc ) = 0.8. This is a typical application of Bayes’ formula ??(??). There is no need to calculate P(A) at all: P(B|A) = P(B)P(A|B) P(B)P(A|B) + P(Bc )P(A|Bc ) = = 0.5 · 0.4 0.5 · 0.4 + 0.5 · 0.8 = 1 3 . In order to better understand the role of the prior probability hypothesis, here is another example. Consider a university using entrance exams with the following reliability: 99% of intelligent people pass them, while concerning non-intelligent people, only 0.5% are able to pass. It is desired to find the probability that a random student (accepted applicant) of the university is intelligent. Thus, let A be the event “a random person is intelligent” and B be the event “the person passed the exams successfully”. Using Bayes’ formula, the probability that A occurs provided B has occurred can be computed. It is only necessary to supply the general probability p = P(A) that a random applicant is intelligent. P(A|B) = p · 0.99 p · 0.99 + (1 − p) · 0.005 . The following table presents the result for various values of p. The first column corresponds to the case that every other applicant is intelligent, etc. p 0.5 0.1 0.05 0.01 0.001 0.0001 P(A|B) 0.99 0.96 0.91 0.67 0.17 0.02 Therefore, if every other applicant is intelligent, then 99% of the students are intelligent. If only 1% of the population meets an expectation of “intelligence” and the applicants form a good random sample, then only about two thirds of the students are intelligent, etc. Consider similar tests for the occurrence of a disease, say HIV. There may be a test with the same reliability as the one above and use it to test all students that are present at the university. In this case, assume that the parameter p is close to the one for the entire population (say 1 out of 10000 people is infected, on average), which corresponds to the last column 919 = { p(1 − p), n > 1 0, n = 1. Substituting this into the second Bayes’ formula, we obtain the wanted probability P(C|A) = { 0, n = 1, 1 − p, n > 1. Thus, the simple reasoning about the probability spaces being isomorphic led to the correct result. The second reasoning was wrong because it omitted the fact that since the first ball was white, the expected number of white balls in the bag (before removing the first one) was greater than np. The calculation highlights the singular case n = 1. □ 10.C.9. Once upon a time, there was a quiz where the first prize was Ferrari 599 GTB Fiorano. The contestant who won the final round was taken into a room where there were three identical doors. Behind two of them, there were goats, while the third one contained the car. In order to win the car, the contestant had to guess the correct door. First, the contestant pointed at one of the three doors. Then, an assistant opened one of the other two doors behind which there was a goat. Now, the contestant is given the option to change his guess. Should he do so? Solution. Of course, we assume that the contestant wants to win the car. First of all, try to examine your intuition for random events. For example, you can reason as follows: “One of the two remaining doors contains the car, each with the same probability. Therefore, it does not matter which door we choose.” Or: “The probability of choosing the correct door at the beginning is 1 3 . The shown goat changes nothing, so the probability that the guess is wrong is 2 3 . Therefore, we should change the door, thereby winning by 2 3 .” Apparently, it is wise to change the door only if the probability of the car being behind that door is greater than behind the initially chosen one. We consider the following events: H stands for “the initial guess is correct”, A stands for “we have changed the door”, and C for “we have won”. We are thus interested in the probabilities P(C|A) and P(C|Ac ). First, we choose one of three doors, and the Ferrari is behind one of them, so P(H) = 1 3 , P(Hc ) = 1 − 1 3 = 2 3 . We assume that the event of changing the door is independent of the original guess, hence P(A|H) = P(A|Hc ) = P(A), P(Ac |H) = P(Ac |Hc ) = P(Ac ). If the original guess is correct and it is changed, then we surely lose; while if it is originally wrong and then it is changed, then we surely win. Therefore, we have P(C|A ∩ H) = 0 = P(C|Ac ∩ Hc ), P(C|Ac ∩ H) = 1 = P(C|A ∩ Hc ). CHAPTER 10. STATISTICS AND PROBABILITY THEORY of the table above. Clearly the result of the test is catastrophically unreliable. Only about 2% of the students who are tested positive are really infected! Note that the problem with both tests is the same one. It is clear that real entrance exams require good selectivity and reliability. So the university marketing must ensure that the actual applicants do not provide a good random sample of population. Perhaps the university should try to discourage “non-intelligent” people from applying and thus secure a sufficiently low number of such applicants. With diseases, even the very rare occurrence of healthy people tested positively can be devastating. If the test is improved so that it is 100% reliable for positive people, it would have almost no impact on the resulting probabilities in the table. Thus, if a person is tested positive when diagnosing a rare disease, it is necessary to make further tests. Then, the result P(A|B) of the first test plays the role of the prior probability P(A) during the second test, etc. This approach allows one to “cumulate the experience”. 10.2.9. Borel sets. In practice, the probability of events which are expressed by questioning whether some numerical quantity falls into a given interval is of interested. We illustrate this on the example dealing with the results of students in a given course, measured for instance by the number of points in a written exam (cf. 10.1.1). On one hand, there is only a finite number of students, and there are only a finite number of possible results (say, the numbers of points in the written exam can be the integers 0 through 20). On the other hand, imagining the results of the students as an analogy to independent rolls of a regular die is inappropriate. Even if a regular 21-hedron would exist (it cannot, see chapter 13); that would be somewhat weird. Thus it is better to focus on the assessing function X : Ω → R in the sample space Ω of all students and model the probability that its value falls into a fixed interval when a random student is picked. For instance, if the table transferring points into marks A through F is fixed, the probability that the student obtained an A or a B can be modeled. In the case of a reasonable course, we should expect that the most probable results are somewhere in the middle of the “interval of success”, while the ideal result of the full number of points is not very probable. Similarly, if many values of X lie in the interval of failure, this may be at most universities perceived as a significant failure of the lecturer. This is a typical example of the random variables or random vectors, as defined below (it depends whether the result of just one or several students is chosen randomly). One way to proceed is to model the behaviour of X as probability defined for all intervals. This requires the following σ-algebra:1 1In this connection, we also talk about the σ-algebra of Borelmeasurable sets on Rk, and then the following definition says that random variables are Borel-measurable functions. 920 It follows from the second Bayes’ formula (1) that P(C|A) = = P(H)P(A|H)P(C|A ∩ H) + P(Hc )P(A|Hc )P(C|A ∩ Hc ) P(A) = =P(Hc ) = 2 3 and, analogously, P(C|Ac ) = = P(H)P(Ac |H)P(C|Ac ∩ H) + P(Hc )P(Ac |Hc )P(C|Ac ∩ Hc ) P(Ac) = =P(H) = 1 3 . We have thus obtained P(C|A) > P(C|Ac ), which means that it is wise to change the door. Note that the solution is based upon the assumption that the assistant deliberately opens a door behind which there is a goat. If the contestant believes it was an accident or if instead, say, he happens to see (or hear) a goat behind one of the two not chosen doors, then the first reasoning is correct and the probability remains to be 1 2 . □ 10.C.10. We have two bags. The first one contains two white and two black balls, while the second one contains one white and two black balls. We randomly select one of the bags and take two balls out of it (without replacement). What is the probability that the second ball is black provided the first one is white? ⃝ D. What is probability? First of all, recall the geometric probability, which was introduced in ??. 10.D.1. Buffon’s needle. A plane is covered with parallel lines, creating bands of width l. Then, a needle of length l is thrown onto the plane. What is the probability that the needle crosses one of the lines? Solution. The position of the needle is given by two independent parameters: the distance d of the needle’s center from the closest line (d ∈ [0, l/2]) and the angle α (α ∈ [0, π/2]) between the lines and the needle’s direction. The needle crosses one of the lines if and only if l/2 sin α > d. The space of all events (α, d) is a rectangle π/2 × l/2. The favorable events (α, d) (i. e. those for which l/2 sin α > d) correspond to those points in the rectangle which lie under the curve l/2 sin α (α being the variable of the x-axis). By 6.2.20, the area of the figure is ∫ π 2 0 l 2 sin α dα = l 2 . Thus, the wanted probability is (see ??) l 2 π 2 · l 2 = 2 π . □ CHAPTER 10. STATISTICS AND PROBABILITY THEORY Borel sets The Borel sets in R are all those subsets that can be obtained from intervals using complements, countable unions, and countable intersections. More generally, on the sample space Ω = Rk , one considers the smallest σ-algebra B which contains all k– dimensional intervals. The sets in B are called the Borel sets on Rk . 10.2.10. Random variables. The probabilities of the individual intervals in the Borel algebra are usually given as follows. Consider a numerical quantity X on any sample space, that is, a function X : Ω → R. Since it is desired to work with the probability of X taking on values from any fixed interval, the probability space and the properties of the function X have to allow this. Notice that working with finite probability spaces where all subsets are events, every function X : Ω → R is a random variable in the following sense. Random variables and vectors Definition. A random variable X on a probability space (Ω, A, P) is a function X : Ω → R such that the inverse image X−1 (B) lies in A for every Borel set B ∈ B on R. The real-valued function PX(B) = P(X−1 (B)) defined on all intervals B ⊂ R is called the (probability) distribution of a random variable X. A random vector X = (X1, . . . , Xk) on (Ω, A, P) is a k-tuple of random variables Xi : Ω → R defined on the same probability space (Ω, A, P). If intervals I1, . . . , Ik in R are chosen, then the probability of simultaneous occurrence of all of the k events Xi ∈ Ii must exist. Thus, as in the scalar case, there is a real-valued function defined on the k-dimensional intervals B = I1 × · · · × Ik, PX(B) = P(X−1 (B)) (and thus also for all Borel sets B ⊂ Rk ). It is called the probability distribution of the random vector X. 10.2.11. Distribution function. The distribution of random variables is usually given by a rule which shows how the probability grows as the interval B is extended. In particular, consider the intervals I with endpoints a, b, −∞ ≤ a ≤ b ≤ ∞. Denote P(a < X < b) the probability of X lying in I = (a, b), or P(X < b) if a = −∞; and analogously for other types of intervals. In the special case of a singleton, write P(X = a). In the case of a random vector X = (X1, . . . , Xk), write P(a1 < X1 < b1, . . . , ak < Xk < bk) for the probability of simultaneous occurrence of the events where the values of Xi fall into the corresponding intervals (which may also be closed, unbounded, etc.). 921 The following (known) problem, which also deals with geometric probability, illustrates that we must be cautious about what is assumed to be “clear”. 10.D.2. Bertrand’s paradox. What is the probability that a random chord of a given circle is longer than the side of an equilateral triangle inscribed into the circle? Solution. We will show three ways how to find “this” proba- bility. 1) Every chord is determined by its center. Thus, a random choice of the chord is given by a random choice of the center. The chord is greater than the side of the inscribed equilateral triangle if and only if its center lies inside the concentric circle with half radius. The center is chosen “randomly” from the whole inside of the circle. Therefore, the probability that it will lie in the inner disc is given by the ratio of the areas of these discs, which is 1 4 . 2) Unlike above, we claim that the wanted probability does not change if the direction of the chord is fixed. Then, the centers of such chords lie on a fixed diameter of the circle. The favorable centers are those which lie inside the inner circle (see 1)), i. e., inside a fixed diameter of the inner circle. The ratio of the diameters is 1 : 2, hence the wanted probability is 1 2 . 3) Now, we observe that a chord is determined by its endpoints (which must lie on the circle). Let us fix one of the endpoints (call it A)–thanks to the apparent symmetry, this should not affect the resulting probability. Then, the chord satisfies the given condition if and only if the other endpoint lies on the shorter arc BC, where ABC is the inscribed equilateral triangle. However, the length of this arc is one third of the length of the entire circle, which means that the wanted probability is equal to 1 3 . How is it possible that we came to three different probabilities? It is caused by a hidden ambiguity in the statement of the problem. It is necessary to specify what exactly it means to choose a chord “randomly”. Each of the three results is correct provided the chord is chosen in the corresponding way. However, these ways are not equivalent; this is apparent not only from the different results, but also from the distribution of the chords’ centers. In the first case, they are distributed uniformly throughout the inside of the circle. In the second and third cases, the centers are concentrated more towards the center of the circle. □ 10.D.3. Two envelopes. There are two envelopes, each contains a certain amount of money. We know that the amount in one of them is twice as great as in the other one. We can choose either of the envelopes (and take its contents). As soon as we choose one, we are allowed to change our mind and take the other envelope instead. Is it advantageous to do so? Solution. At the first sight, it must not matter which envelope we choose. The probability of choosing the one which contains more is 1/2, so it is no good to change our choice. CHAPTER 10. STATISTICS AND PROBABILITY THEORY Distribution function Definition. The distribution function or cumulative distribution function of a random variable X is the function FX : R → [0, 1] defined for all x ∈ R by FX(x) = P(X < x). The distribution function of a random vector (X1, . . . , Xk) is the function FX : Rk → R defined for all vectors x = (x1, . . . , xk) ∈ Rk by FX(x) = P(X1 < x1, . . . , Xk < xk). If it is clear from the context which distribution function is discussed, omit the random variable name and write simply F(x). The following theorem guarantees that, for every random variable, the probability that the value of X falls into any (fixed) interval (and thus into any Borel set B) can be calculated purely from the knowledge of its distribution function.2 10.2.12. Theorem. For every random variable X, its distribution function F : R → [0, 1] has the following properties: (1) F is a non-decreasing function; (2) F has both side-limits at every point x ∈ R, yet these limits may differ; (3) F is left-continuous; (4) at the infinite points, the limits of F are lim x→∞ F(x) = 1, lim x→−∞ F(x) = 0; (5) the probability of X taking on the value x is given by P(X = x) = lim y→x+ F(y) − F(x). (6) The distribution function of a random variable always has only countably many points of discontinuity. Proof. The proof consists of quite simple and straightforward calculations. In particular, note that the events a ≤ X < b and X < a are exclusive, so P(a ≤ X < b) = P(X < b) − P(X < a) = F(b) − F(a). Hence the first property follows immediately from the definition of probability. The next two statements follow from the probability of monotone sequences of events, discussed in 10.2.5. Fix a nonincreasing sequence of numbers rn > 0 which converges to 0, and consider the events An given by X < x − rn. The union of these events is exactly the event A given by X < x. Of course, the event A does not depend on the choice of the sequence rn. By the first proposition of 10.2.5, P(A) = lim n→∞ P(An). 2In literature, the definition with the non-strict inequality F(x) = P(X ≤ x) is often met. In this case, the probability P(X = x) is also included in FX (x). Then, the distribution function has similar properties as those in 10.2.12, only it is right-continuous instead of left-continuous, etc. 922 However, consider the following reasoning: the envelope we have chosen contains a. Therefore, the other one contains a/2 or 2a, each with probability 1/2. This means that if we change the envelope, then we get a/2 with probability 1/2 and 2a with probability 1/2, i. e., the expected outcome is 1 2 a 2 + 1 2 2a = 5 4 a. Therefore, it is wise to change the envelope. What is wrong with this reasoning? There are several issues. Mainly, it is not generally true that if there is amount a in one of the envelopes, then the second one contains a/2 with probability 1/2. This depends on the initial distribution of the amounts that have been put into the envelopes, which is not precisely stated in the problem. However, the paradox is rooted not only in the concealed a priori distribution. There are (even discrete) distributions for which the choice of changing the envelope always produces greater expected outcome than that of not changing it. Nevertheless, any distribution with this property must have infinite expected value (if the expectation is finite, then there is always a value which, when seen in the envelope, it is more advantageous to keep), so it is dubious to say that it is better to get “greater” infinity on average. □ E. Random variables, density, distribution function 10.E.1. Consider rolling a die. The set of sample points is Ω = {ω1, . . . , ω6}, where ωi means that we have rolled number i. Further, consider the σ-field A = {∅, {ω1, ω2}, {ω3, ω4, ω5, ω6}, Ω}. Find whether the mapping X : Ω → R defined by i) X(ωi) = i for each i ∈ {1, 2, 3, 4, 5, 6}, ii) X(ω1) = X(ω2) = −2, X(ω3) = X(ω4) = X(ω5) = = X(ω6) = 3 is a random variable with respect to A. Solution. First of all, we should make sure that the set A really satisfies all axioms of 10.2.2, i. e., that it is a well-defined σ-field. Then, by definition 10.2.10, a random variable is any function X : Ω → R such that the preimage of every Borelmeasurable set B ⊂ R lies in A. As for the first case, consider the interval [2, 3]. Since X−1 ([2, 3]) = {ω2, ω3} ̸∈ A, we can see that the function X is not a random variable. In the second case, we can easily see that X is a random variable: Consider any interval in R. Then, exactly one of the four following occurs: 1) If the interval contains neither −2 nor 3, then the preimage in X is the empty set. 2) If it contains −2 but not 3, then the preimage is {ω1, ω2}. 3) On the other hand, if it contains 3 but not −2, then the preimage is {ω3, ω4, ω5, ω6}. 4) Finally, if it contains both these numbers, then the preimage is the whole sample space Ω. In each case, the preimage lies in the σ-field A. □ CHAPTER 10. STATISTICS AND PROBABILITY THEORY The distribution function is non-decreasing, and thus the leftsided limit equals the supremum. Thus, the left-sided limit FX at x exists and equals P(A). This proves one half of proposition (2) as well as all of proposition (3). Similarly, the above sequence rn can be used to define the events An by Xn < x + rn. This time, it is a non-increasing sequence A1 ⊃ A2 ⊃ . . . , and its intersection is the event X ≤ x. By the second property of 10.2.5, P(A) = lim n→∞ P(An) = P(X ≤ x), which verifies that the right-sided limit of F at x exists. At the same time, property (5) is proved. The limit values of property (4) can be derived similarly by applying theorem 10.2.5, as shown for the one-sided limits above. In the first case, use the events An given by X < rn, for an arbitrary increasing sequence rn → ∞. Their union is the universal event Ω. In the second case, use the events An given by X < rn, for any decreasing sequence rn → −∞, and their intersection is the null event. It remains to prove the last statement. As already shown, the discontinuity points of the distribution function are exactly those values x which the random variable has with nonzero probability, i.e., P(X = x) ̸= 0. Now, let Mn denote the set of points x for which P(X = x) > 1 n . Clearly, the set M of all discontinuity points equals the union of the sets Mn: M = ∪∞ n=2. Since the sum of probabilities of mutually exclusive events cannot exceed 1, Mn can contain no more than n − 1 elements. Thus, M is a countable union of finite sets, thus it is countable. □ 10.2.13. Probability measure. The probability that a random variable has a value lying in an arbitrarily chosen interval can be computed purely from the knowledge of its distribution function. The distribution function FX thus defines the entire probability distribution of the random variable X. How a particular random variable X is defined can be ignored. X can be viewed directly as a probability definition on the σ-algebra of all the Borel sets in R. In this sense, every function F : R → R satisfying the first four properties of the latter theorem is a distribution function of a unique random variable. Check the properties of the probability function defined on all intervals this way! The probability obtained in this way is also called a probability measure on R. Similarly one deals with probability measures on the algebra of Borel sets in Rk in terms of the distribution functions of the random vectors. In this sense, a random variable or random vector can be considered without any explicit link to a probability space (Ω, A, P). 10.2.14. Discrete random variables. Random variables behave substantially differently according to whether the non-zero probability is “concentrated in isolated points” or it is “continuously distributed” along (a part of) the real axis. 923 10.E.2. Consider a σ-field (Ω, A), where Ω = {ω1, ω2, ω3, ω4, ω5} and A = {∅, {ω1, ω2}, {ω3}, {ω4, ω5}, {ω1, ω2, ω3}, {ω1, ω2, ω4, ω5}, {ω3, ω4, ω5}, Ω}. Find a mapping X : Ω → R, as general as possible, which is a random variable with respect to A. Solution. Since the events ω1, ω2 do not occur individually in A, the random variable X must map them to the same number, i. e. X(ω1) = X(ω2) = a for an a ∈ R. For the same reason, we must have X(ω4) = X(ω5) = b for a b ∈ R. If an interval contains both a and b, then its preimage is {ω1, ω2, ω4, ω5} ∈ A, which is okay. Clearly, the event ω3 may be mapped to an arbitrary c ∈ R. Then, we can easily verify that the X-preimage of every interval is contained in A, i. e., X is a random variable with respect to A. □ 10.E.3. Consider a random variable X which takes on value i with probability P(X = i) = 1 6 , for each i = 1, . . . , 6. Find the distribution function FX(x) and draw its graph. Solution. By definition 10.2.11, the distribution function is FX(x) = P(X < x). This means that FX(x) = 0 for x < 1, FX(x) = ⌊x⌋ 6 for 1 ≤ x < 6 (where ⌊x⌋ stands for the floor of x), and FX(x) = 1 for x ≥ 6. The graph looks as follows: □ 10.E.4. An archer keeps shooting at a target until he hits. He has 4 arrows at his disposal. In each attempt, the probability that he hits the target is 0.6. Let X be the random variable which gives the number of unused arrows. Find the probability mass function and the distribution function of X and draw their graphs. Solution. Clearly, the probability of k consecutive misses followed by a hit is equal to 0.4k · 0.6. Therefore, fX(x) = P(X = x) = 0.43−x · 0.6 for x ∈ {1, 2, 3}. If the archer misses three times, then there will be no arrow left at the CHAPTER 10. STATISTICS AND PROBABILITY THEORY Discrete random variables If a random variable X assumes only finitely many values x1, x2, . . . , xn ∈ R or countably infinitely many values x1, x2, . . . , it is called a discrete random variable. One can define its probability mass function f(x) by f(x) = { P(X = xi) x = xi 0 otherwise. Since the probability is countably additive and the singleton events X = xi are mutually exclusive, the sum of all values f(xi) is given by either a finite sum or an absolutely convergent series ∑ i f(xi) = 1. The probability distribution of a random variable X sat- isfies P(X ∈ B) = ∑ xi∈B f(xi). In particular, the distribution function is of the form FX(t) = ∑ xi 3. The graphs of the probability mass function and the distribution function are as follows: □ 10.E.5. The distribution function of a random variable X is FX(x) =    0 for x ≤ 3 1 3 x − 1 for 3 < x ≤ 6 1 for 6 < x. i) Justify that FX is indeed a distribution function. CHAPTER 10. STATISTICS AND PROBABILITY THEORY Note that the distribution function F(x) of a continuous random variable X is always differentiable. Its derivative is the density function of X, i.e., F′ (x) = f(x). 10.2.16. The general case. Of course, there are also random variables with mixed behaviour, where some part of the probability is distributed continuously, while there are values that are taken on with non-zero probability. This means, the probability measure of some singletons x ∈ R is non-zero and still X is not a discrete random variable. For instance, consider a chaotic lecturer who remains to stand at his laptop with probability p throughout the entire lecture, but once he decides to move, he happens to be at any position in front of the lecture room with equal probability. Then, the random variable which corresponds to his position (assume that the desk with the laptop is at position 0 and the lecture room is bounded by the values ±1) has the following distribution function: F(t) =    0 if t ≤ −1 1−p 2 (t + 1) if t ∈ (−1, 0) p + 1−p 2 (t + 1) if t ∈ [0, 1) 1 if t ≥ 1. The distribution function of all such variables can be expressed directly using the Riemann-Stieltjes integral F(t) = ∫ t −∞ f(x)d(g(x)), developed in subsection 6.3.14 (page 570). In the example above, choose f(x) = 1 and g(x) =    −1 for x ≤ −1 1−p 2 x for −1 < x < 0 1−p 2 x + p for 0 ≤ x < 1 1+p 2 for x ≥ 1. This corresponds again to the idea, that the distribution function is equivalent to a probability measure. Thus the measure of any interval is given by integrating its indicator function with respect to this measure. This is what the RiemannStieltjes integral achieves. The Riemann integral corresponds to the choice g(x) = x. One could add only the jump p at x = 0 (i.e. g(x) = x for x < 0, while g(x) = x + p otherwise) and leave the constant density 1−p 2 to f(x), which would be nonzero only on [−1, 1]. This corresponds to splitting the probability measure into its discrete part (hidden in g) and continuous part (expressed by the probability density). Notice that any distribution function can have only countably many points of discontinuity. 10.2.17. Basic discrete distributions. The requirements on the properties of probability distributions of random variables are based on the modeled situations. Here is a list of the simplest discrete dis- tributions. 925 ii) Find the density of the random variable X. iii) Compute P(2 < X < 4). Solution. a) Clearly, FX is continuous and non-decreasing. Moreover, we have lim x→−∞ F(x) = 0 and lim x→∞ F(x) = 1, as needed. b) By 10.2.14, the density of a continuous random variable is the derivative of its distribution function. We can see that on the interval (3, 6), the density is equal to f(x) = 1 3 , while on the intervals (−∞, 3) and (6, ∞), it is equal to zero. Therefore, the variable X has uniform distribution, see 10.2.20. c) We have from the definition of the distribution function that P(2 < X < 4) = FX(4) − FX(2) = 4 3 − 1 = 1 3 . □ 10.E.6. Consider a random variable X and a function f : R → R given by f(x) = a 1+x2 for x ∈ R, where a is a parameter. Suppose that f is the density of X. Find i) the value of a, ii) the distribution function of X, iii) P(−1 < X < 1). Solution. a) If the function f is to be a probability density, then its integral over R must be equal to one. This yields the condition 1 = ∫ ∞ −∞ a 1 + x2 dx = a[arctg x]∞ −∞ = aπ. Hence a = 1 π . b) By 10.2.14, the distribution function is given by the following integral: FX(x) = ∫ x −∞ f(t)dt = 1 π ∫ x −∞ dt 1 + t2 = 1 π arctg x + 1 2 . c) By b) and the definition of the distribution function, we have P(−1 < X < 1) = FX(1)−FX(−1) = 1 π · π 4 − 1 π · ( − π 4 ) = 1 2 . □ 10.E.7. The joint probability mass function of a discrete random vector is given by the following table: X Y 2 5 6 1 1 5 1 10 1 20 2 1 10 1 20 0 3 3 10 1 20 3 20 Find i) the marginal distribution and probability mass functions; ii) the joint distribution function and draw it in a suitable way; iii) P(Y > 3X). Solution. a) By 10.2.22, the marginal distribution of the random variable X is obtained by summing up the joint probability mass function over all possible values of Y in each row. Similarly, the marginal distribution of Y is obtained by CHAPTER 10. STATISTICS AND PROBABILITY THEORY Degenerate distribution The distribution which corresponds to a constant random variable X = µ is called the degenerate distribution Dg(µ). Its distribution function FX and probability mass function fX are given by FX(t) = { 0 t ≤ µ 1 t > µ fX(t) = { 1 t = µ 0 otherwise. Here follows a description of an experiment with two possible outcomes called success and failure. If the probability of success is p, then the probability of failure must be 1 − p. It is convenient to take the values 0 and 1 for the two possible results. Bernoulli distribution The distribution of a random variable X which is 0 (failure) with probability q = 1 − p and 1 (success) with probability p is called the Bernoulli distribution A(p). Its distribution function FX and probability mass function fX are given by FX(t) =    0 t ≤ 0 q 0 < t ≤ 1 1 t > 1 fX(t) =    p t = 1 q t = 0 0 t /∈ {0, 1}. Further, consider a random variable X which corresponds to n independent experiments described by the Bernoulli distribution, where X measures the number of successes. Clearly the probability mass function is non-zero exactly at the integers t = 0, . . . , n, which correspond to the total number of successes in the experiments (the order does not matter). The probability that t successes are encountered in t chosen experiments out of n is pt (1 − p)n−t . It is necessary to sum all the (n t ) possibilities. This leads the the binomial distribution of X: Binomial distribution The binomial distribution Bi(n, p) has probability mass function fX(t) = {(n t ) pt (1 − p)n−t t ∈ {0, 1, . . . , n} 0 otherwise. The illustration shows the probability mass functions for Bi(50, 0.2), and Bi(50, 0.9). The distribution of the probability corresponds to the intuition that most outcomes occur near the value np: 926 summing up the entries in each column. Thus, we get the following: X 1 2 3 fX 7 20 3 20 1 2 and Y 2 5 6 fY 3 5 1 5 1 5 b) The joint distribution function is at point (a, b) equal to the sum of all values of the joint probability mass function f(X,Y ) such that X ≤ a and Y ≤ b. This corresponds to values of the subtable whose lower-right corner is (a, b). Precisely, the joint distribution function F(X,Y ) looks as follows is missing .... and on intervals (−∞, 1) × R and R × (−∞, 2), F(X,Y ) is clearly zero. c) Apparently, P(Y > 3X) = P(X = 1, Y = 5) + P(X = 1, Y = 6) = = 1 10 + 1 20 = 3 20 □ 10.E.8. Find the probability P(2X > Y ) provided the density of the random vector (X, Y ) is given by: f(X,Y )(x, y) = { 1 6 (4x − y) for 1 ≤ x ≤ 2, 2 ≤ y ≤ 4, 0 otherwise. Solution. By definition, we have P(2X > Y ) = ∫ ∞ −∞ ∫ 2x −∞ f(X,Y )(x, y)dydx = = ∫ 2 1 ∫ 2x 2 1 6 (4x − y)dydx = = ∫ 2 1 [ 2 3 xy − 1 12 y2 ]2x 2 dx = = ∫ 4 2 ( x2 − 4 3 x + 1 3 ) dx = = [ 1 3 x3 − 2 3 x2 + 1 3 x ]2 1 = 2 3 . □ 10.E.9. Find the marginal distribution function and the joint and marginal density of the random vector (X, Y ) provided F(X,Y )(x, y) =    0 for x < 0, y < 0 1 4 x2 y2 for 0 ≤ x ≤ 1, 0 ≤ y ≤ 2 1 for x > 1, y > 2 Solution. The density of the random vector (X, Y ) is obtained by differentiation with respect to x a y. Thus, for 0 ≤ x ≤ 1, 0 ≤ y ≤ 2, we have f(X,Y )(x, y) = xy, and elsewhere the density is zero. The marginal density of the random variable X is then fX(x) = ∫ ∞ −∞ f(X,Y )(x, y)dy = ∫ 2 0 xydy = [ 1 2 xy2 ]2 0 = 2x. Similarly, for Y , we get fY (y) = 1 2 y. The marginal distribution functions are FX(x) = ∫ x −∞ fX(t)dt = ∫ x 0 2tdt = x2 CHAPTER 10. STATISTICS AND PROBABILITY THEORY Next, consider distributions similar to the Bernoulli process referred to in 10.2.1. Consider independent experiments with the Bernoulli distribution A(p), as in the case of the binomial distribution, and fix a positive integer r. Repeat the experiment until r successes occur. The random variable X is defined as the number of failures before the r-th success. In the case of r = 1, it is exactly the example from 10.2.1. The event X = k occurs if and only if there are exactly r−1 successes in the first k+r−1 experiments and the (k +r)-th experiment also ends with a success. Thus the following probability mass function is arrived at: Geometric distribution The random variable X which corresponds to the number of failures before reaching the r-th success has probability distribution P(X = k) = ( k + r − 1 r − 1 ) pr (1 − p)k , k = 0, 1, 2, . . . This is called the negative binomial distribution. In the case of r = 1, it is the geometric distribution. Often the same definition is used with the successes and failures interchanged. This results in the same formula for the probability mass function with p and 1 − p interchanged. The geometric distribution appears in physics in connection with Einstein–Bose statistics. 10.2.18. Poisson distribution. In practice, the binomial distribution often leads to further model problems. Consider the situation that r (mutually indistinguishable) objects are to be divided into n (distinguishable) boxes, and each object is equally probable (i.e., has probability 1/n) to fall into any of the boxes. The random variable which describes the number X of objects in one fixed box can be described as follows: The admissible values are X = k, where k = 0, . . . , r, and the individual probabilities are P(X = k) = ( r k ) ( 1 n )k ( 1 − 1 n )r−k = ( r k ) (n − 1)r−k nr . Thus, the distribution of X is of the type Bi(r, 1/n). Such a variable can be encountered, for example, when describing a physical system with a huge number of gas molecules. The boxes represent small volumes of the space. 927 and FY (y) = ∫ y −∞ fY (t)dt = ∫ y 0 1 2 tdt = 1 4 y2 . □ 10.E.10. In a bag, there are 14 balls–4 red, 5 white, and 5 blue ones. We randomly take 6 balls out of the bag (without replacement). Find the distribution of the random vector (X, Y ) where X stands for the number of red balls taken and Y for the number of white balls. In addition, find the marginal distributions of X and Y . Then, compute P(X ≤ 3), P(1 ≤ Y ≤ 4). Solution. The value of the probability mass function at point (x, y) is defined as the probability P(X = x, Y = y), i. e. the probability of taking x red balls and y white balls. The number of ways how to take x red balls is (4 x ) ; for y white balls, it is (5 y ) ; and the remaining 6−x−y blue balls can be selected in ( 5 6−x−y ) ways. Altogether, there are (4 x )(5 y )( 5 6−x−y ) possibilities. The values of this expression for all x, y are in the following table. x\ y 0 1 2 3 4 5 ∑ X 0 0 5 50 100 50 5 210 1 4 100 400 400 100 4 1008 2 30 300 600 300 30 0 1260 3 40 200 200 40 0 0 480 4 10 25 10 0 0 0 45∑ Y 84 630 1260 840 180 9 3003 The values in the last column and row are the sums over all values of y and x, respectively. Then, the values of the probability mass function are obtained after dividing by the number of all possibilities how to take the 6 balls, i. e. (14 6 ) = 3003. The marginal distributions of X and Y correspond to the last column and row, respectively. The probability P(X ≤ 3) can be calculated easily from the marginal distribution of X: P(X ≤ 3) = FX(3) = 1 3003 (210+1008+1260+480) = 0.985. Similarly, for the probability P(1 ≤ Y ≤ 4), we have P(1 ≤ Y ≤ 4) = FY (4) − FY (1) = = 1 3003 (630 + 1260 + 840 + 180) = 0.969. □ 10.E.11. The density of a random vector (X, Y, Z) is f(x, y, z) = { c(x + y + z) for 0 ≤ x ≤ 1, 0 ≤ y ≤ 1, 0 ≤ z ≤ 1 0 otherwise. Find the value of the parameter c as well as the distribution function of the vector, and compute P(0 ≤ X ≤ 1 2 , 0 ≤ Y ≤ 1 2 , 0 ≤ Z ≤ 1 2 ). Solution. The integral of the density over the entire space must be equal to one. This gives us 1 = ∫ 1 0 ∫ 1 0 ∫ 1 0 c(x + y + z)dzdydx = c ∫ 1 0 ∫ 1 0 (x + y + 1 2 )dydx = = c ∫ 1 0 (x + 1)dx = 3 2 c. CHAPTER 10. STATISTICS AND PROBABILITY THEORY Observe the distribution of the molecules. Then, the behaviour of Xn as the number n of boxes as well as the number rn of objects increases so that their ratio rn/n = λ remains constant is of interest. In other words, every box is to contain (approximately) the same number λ of elements, on average. We are interested in the asymptotic behaviour of the variables Xn as n → ∞. Letting limn→∞ rn/n = λ, the standard procedure (with details to be added – take it as a challenge to recall the methods from the analysis of univariate functions!) leads to: lim n→∞ P(Xn = k) = lim n→∞ ( rn k ) (n − 1)rn−k nrn = lim n→∞ rn(rn − 1) . . . (rn − k + 1) (n − 1)k 1 k! ( 1 − 1 n )rn = λk k! lim n→∞ ( 1 + −rn n rn )rn = λk k! lim m→∞ ( 1 + −λ m )m = λk k! e−λ , since the functions (1+x/n)n converge uniformly to the function ex on every bounded interval in R. Poisson distribution The Poisson distribution Po(λ) describes the random variables with probability mass function fX(k) = { λk k! e−λ k ∈ N 0 otherwise. Of course, ∞∑ k=0 fX(k) = ∑ k λk k! e−λ = e−λ ∑ k λk k! = e−λ+λ = 1. As seen above, this discrete distribution Po(λ) with an arbitrary λ > 0 (distributed into infinitely many points) is a good approximation of the binomial distributions Bi(n, λ/n), for large values of n. 10.2.19. Two examples. Besides the physical model mentioned above, such a behaviour can be encountered when observing occurrences of events in a space with constant expected density in a unit volume. Observing bacteria under a microscope, when the bacteria are expected to occur in any part of the image with the same probability, provides an example. If the “mean density of occurrence” in a unit area is λ and the whole region is divided into n identical parts, then the occurrence of k events in a fixed part is modeled by a random variable X with the Poisson distribution. When diagnosing in practice, such an observation allows us to compute the total number of bacteria with a relatively good accuracy from the actual numbers in only several randomly chosen samples. 928 Hence, c = 2 3 . By definition, the distribution function is equal to FX(x, y, z) = 2 3 ∫ x 0 ∫ y 0 ∫ z 0 (r + s + t)dtdsdr = = 2 3 ∫ x 0 ∫ y 0 (rz + sz + 1 2 z2 )dsdr = 2 3 ∫ x 0 (rzy + 1 2 y2 z + 1 2 z2 y)dr = = 2 3 (1 2 x2 zy + 1 2 y2 zx + 1 2 z2 yx) = 1 3 (x2 zy + y2 zx + z2 yx), so the wanted probability is P(0 ≤ X ≤ 1 2 , 0 ≤ Y ≤ 1 2 , 0 ≤ Z ≤ 1 2 ) = F(1 2 , 1 2 , 1 2 ) = 1 16 . □ 10.E.12. Find the value of the parameter a so that the func- tion f(x) =    0 for x ≤ 1 a ln(x) for 1 < x < 2 0 for 2 ≤ x would be the probability density of a random variable. Solution. We know that the condition for the function to be a density is ∫ ∞ −∞ f(x) = 1 Thus, we have to calculate ∫ ln(x) dx: ∫ ln(x) dx = x ln(x)− ∫ 1 dx = x ln(x)−x = x(ln(x)−1). Altogether, ∫ ∞ −∞ f(x) = ∫ 2 1 a ln(x) = a[x(ln(x)−1)]2 1 = a(2 ln(2)−1), so a = 1 2 ln(2)−1 . □ 10.E.13. A child has become lost in a forest whose shape is that of a regular hexagon. Suppose that the probability that the child happens to be in a given part of the forest is directly proportional to the size of that part, but independent of its position in the forest. • What is the probability distribution of the distance of the child from a given side (extended to a straight line) of the forest? • What is the probability distribution of the distance of the child from the closest side of the forest? Solution. • Let a be the length of the sides of the hexagon (forest). Then, the probability distribution satisfies f(x) =    0 for x ≤ 0 4 9a2 x + 2 3 √ 3a for 0 < x ≤ 1 2 √ 3a − 4 9a2 x + 2√ 3a for 1 2 √ 3a ≤ x ≤ √ 3a 0 for x > √ 3a , as for the first question. • First, let us compute the distribution function F of the wanted random variable X that corresponds to the distance of the child from the closest side. The distance can CHAPTER 10. STATISTICS AND PROBABILITY THEORY The second example is more challenging. We describe events which occur randomly at time t ≥ 0. Here, the probability of an occurrence in the following small time period of length h does not depend on what had happened before and equals the same value hλ for a fixed λ > 0. At the same time, the probability that the event occurs more than once in a given time period is small. Let Xt denote the random variable which corresponds to the number of occurrences of the examined event in the interval [0, t). The requirements are expressed infinitesimally. We want: • the probability of exactly one event in each time period of length h equals hλ + α(h), where the function α(h) satisfies limh→0+ α(h) h = 0; • the probability β(h) of more than one event occurring in a time period of length h to satisfy limh→0+ β(h) h = 0; • the events Xt = j and Xt+h −Xt = k to be independent for all j, k ∈ N and t, h > 0. Use the notation pk(t) = P(Xt = k), k ∈ N, and set the initial conditions p0(0) = 1 and pk(0) = 0 for k > 0. Compute directly p0(t + h) = p0(t)P(Xt+h − Xt = 0) = = p0(t)(1 − hλ − α(h) − β(h)) and similarly, pk(t + h) = P(Xt = k, Xt+h − Xt = 0) + P(Xt = k − 1, Xt+h − Xt = 1) + P(Xt ≤ k − 2, Xt+h = k) = pk(t)P(Xt+h − Xt = 0) + pk−1P(Xt+h − Xt = 1) + k−2∑ i=0 P(Xt = i, Xt+h − Xt = k − i) = pk(t)(1 − hλ − α(h) − β(h)) + pk−1(t)(hλ + α(h)) + k−2∑ i=0 pi(t)P(Xt+h − Xt = k − i). Hence (similar to in 6.1.12, page 520, the symbol o(h) is written for expressions which, when divided by h, approach zero as h → 0+) p0(t + h) − p0(t) h = −λp0(t) + 1 h o(h) pk(t + h) − pk(t) h = −λpk(t) + λpk−1(t) + 1 h o(h). Letting h → 0+, an (infinite!) system of ordinary differential equations is obtained: p′ 0(t) = −λp0(t), p0(0) = 1 p′ k(t) = −λpk(t) + λpk−1(t), pk(0) = 0 for all t > 0 and k ∈ N, with an initial condition. The first equation has a unique solution p0(t) = e−λt , 929 be anywhere in the interval I = ⟨0, √ 3 2 a⟩. Then, for y ∈ I, we have F(y) = P[X < y] = √ 3 4 a2 − ( √ 3 2 a−y)2 3 4 a2 √ 3 4 a2 √ 3 4 a2 = 1− 4( √ 3 2 a − y)2 3a2 Altogether, F(y) =    0 for y ≤ 0 1 − 4( √ 3 2 a−y)2 3a2 for y ∈ ⟨0, √ 3 2 a⟩ 1 for y ≥ √ 3 2 a Thus, the density, being the derivative of the distribution function, satisfies: f(x) =    0 for x ≤ 0 8( √ 3 2 a−y) 3a2 for y ∈ ⟨0, √ 3 2 a⟩ 0 for y ≥ √ 3 2 a □ 10.E.14. Let a random variable X have uniform distribution on an interval ⟨0, r⟩. Find the distribution function and probability density of the volume of the ball whose radius is equal to X. Solution. First, we find the distribution function F (for 0 < d < 4 3 πr3 ) F(d) = P [ 4 3 πX3 ≤ d ] = P [ X ≤ 3 √ 3d 4π ] = 3 √ 3d 4π r . Altogether, F(x) =    0 for x ≤ 0 3 √ 3 4πr3 x 1 3 for 0 < x < 4 3 πr3 1 for x ≥ 4 3 πr3 Differentiating this, we obtain the density: f(x) =    0 for x ≤ 0 3 √ 1 36πr3 x− 2 3 for 0 < x < 4 3 πr3 0 for x ≥ 4 3 πr3 □ 10.E.15. Find the value(s) of the parameter a ∈ R so that the function f(x) =    0 for x ≤ 0 ax2 for 0 < x < 3 0 for x ≥ 3 defines the probability density of a random variable X. Then, find its distribution function, probability density, and the expected value of the volume of the cube whose edge-length has probability density determined by f. Solution. Simply, a = 1 9 . Thus, the distribution function of the random variable X is FX(t) = 1 27 t3 for t ∈ (0, 3), zero for smaller values of t, and one for greater. Let Z = X3 denote the random variable corresponding to the volume of the considered cube. It lies in the interval (0, 27). Thus, for CHAPTER 10. STATISTICS AND PROBABILITY THEORY which can be immediately substituted into the second equation. This leads to p1(t) = λt e−λt . A trivial induction argument shows that the system has a unique solution pk(t) = (λt)k k! e−λt , t > 0, k ∈ N. It is thus verified that for every process which satisfies the three properties above, the random variable Xt which corresponds to the number of occurrences in the time period [0, t) has distribution Po(λt). In practice, these processes are connected with the failure rate of machines. 10.2.20. Continuous distributions. The simplest example of a continuous distribution is the uniform distribution of the probability throughout a fixed interval. This is also a good illustration of the fact that a simply formulated requirement does not leave many free choices in the definition. Now, the probability of X taking on a value inside an interval which is included in the sample interval (a, b) ⊂ R is required to be dependent only on the length of the interval, but not on its actual position. This means that the density fX of the random variable X should be constant and the value of this constant is given by the requirement P(a ≤ X < b) = 1. Uniform distribution For any real numbers a, b, −∞ < a < b < ∞, define the density and distribution function as follows: fX(t) =    0 t ≤ a 1 b−a t ∈ (a, b) 0 t ≥ b, FX(t) =    0 t ≤ a t−a b−a t ∈ (a, b) 1 t ≥ b. Here, the random variable X has uniform distribution. The next distribution is similar to the discrete Poisson distribution. Suppose the occurrence of a random event is observed such that its occurrences in non-overlapping intervals are independent. Thus, if p(t) is the probability of the event not occurring during an interval of length t, then of necessity p(t + s) = p(t)p(s) for all t, s > 0. Moreover, assume that p is differentiable and p(0) = 1. Then, ln p(t + s) = ln p(t) + ln p(s). Letting s → 0+ (and applying l’Hospital’s rule), ( ln(p) )′ (t) = lim s→0+ ln p(t + s) − ln p(t) s = lim s→0+ (ln p(s))′ 1 = p′ (0) p(0) = p′ (0). Thus, p′ (0) = −λ ∈ R ( Note: λ > 0, and p′ (0) cannot be positive as p(0) = 1). Then, p(t) satisfies ln p(t) = −λt + C. The initial condition leads to the only solution p(t) = e−λt . 930 t ∈ (0, 27) and the distribution function FZ of the random variable Z, we can write FZ(t) = P[Z < t] = P[X3 < t] = P[X < 3 √ t] = FX( 3 √ t) = 1 27 t. Then, the density is fZ(t) = 1 27 on the interval (0, 27) and zero elsewhere. Since this is the uniform distribution on the given interval, the expected value is equal to 13.5. □ 10.E.16. Find the value(s) of the parameter a ∈ R so that the function f(x) =    0 for x ≤ 0 ax for 0 < x < 3 0 for x ≥ 3 defines the probability density of a random variable X. Then, find its distribution function, probability density, and the expected value of the area of the square whose side-length has probability density determined by f. Solution. We proceed similarly as in the previous example. Again, we can easily find that a = 2 9 . Thus, the distribution function of the random variable X is FX(t) = 1 9 t2 for t ∈ (0, 3), zero for smaller values of t, and one for greater. Let Z = X3 denote the random variable corresponding to the area of the considered square. It lies in the interval (0, 9). Thus, for t ∈ (0, 9) and the distribution function FZ of the random variable Z, we can write FZ(t) = P[Z < t] = P[X2 < t] = P[X < √ t] = FX( √ t) = 1 9 t. Then, the density is fZ(t) = 1 9 on the interval (0, 9) and zero elsewhere. Since this is the uniform distribution on the given interval, the expected value is equal to 4.5. □ 10.E.17. Find the value(s) of the parameter a ∈ R so that the function f(x) =    0 for x ≤ 0 ax2 for 0 < x < 2 0 for x ≥ 2 defines the probability density of a random variable X. Then, find its distribution function, probability density, and the expected value of the volume of the cube whose edge-length has probability density determined by f. ⃝ 10.E.18. We randomly cut a line segment of length l into two pieces. Find the distribution function and the density of the area of the rectangle whose side-lengths are equal to the obtained pieces. Solution. Let us compute the distribution function: Let X denote the random variable with uniform distribution on the interval ⟨0, l⟩, which corresponds to the length of one of the pieces (then, the length of the other piece is l − X). The area S = x(l−x) of the rectangle, for x ∈ ⟨0, l⟩, can lie anywhere in the interval ⟨0, l2 /4⟩. Setting d ∈ ⟨0, l2 /4⟩, we can write F(d) = P[S ≤ d] = P[X(l − X) ≤ d] Thus, we are looking for those values of x for which x(l − x) ≤ d, which is a quadratic inequality. The roots of the corresponding quadratic equation are l− √ l2−4d 2 and l+ √ l2−4d 2 . CHAPTER 10. STATISTICS AND PROBABILITY THEORY Now, consider the random variable X which corresponds to a (random) moment when the event occurs for the first time. Apparently, the distribution function of X is given by FX(t) = 1 − p(t) = { 1 − e−λt t > 0 0 t ≤ 0. This function has the desired properties: It has values between zero and one, it is increasing and it has the required behaviour at ±∞. The density of this random variable can be obtained by differentiation of the distribution function. Exponential distribution The distribution corresponding to the continuous random variable X with density fX(t) = { λ e−λt t > 0 0 t ≤ 0. is called the exponential distribution ex(λ). The exponential distribution belongs to the more general family of important distributions with the densities of the form cxa−1 e−bx for x > 0, with given constants a > 0, b > 0, while the constant c is to be computed. The following expression is required to equal one: ∫ ∞ 0 cxa−1 e−bx dx = ∫ ∞ 0 c ( t b )a−1 e−t 1 b dt = c ba Γ(a). Γ is the famous transcendental function providing the analytic extension of the factorial function, discussed in 6.2.17 on the page 546. Gamma distribution The distribution whose density is zero for x ≤ 0, while for x > 0. It is given by f(x) = ba Γ(a) xa−1 e−bx , called the gamma distribution Γ(a, b) with parameters a > 0, b > 0. Thus, the exponential distribution is the special case of this one for the value a = 1. 10.2.21. Normal distribution. Recall the binomial distribution. If the success rate p is left constant, but the number n of experiments is increased, the probability mass function keeps its shape (although the scale changes). As n increases, the values of the probability mass function merges into a curve that should correspond to the density of a continuous distribution which is a good approximation for Bi(n, p) for large values of n. Recall the smooth function y = e−x2 /2 , mentioned in subsection 6.1.9 (page 516) as an appropriate tool for the construction of functions which are smooth but not analytic. The 931 The inequality is satisfied by exactly those values of x which lie outside this interval. Therefore, P[X(l − X) ≤ d] = P[X ∈ ⟨0, l⟩ \ ( l − √ l2 − 4d 2 , l + √ l2 − 4d 2 )] = l − √ l2 − 4d l = 1 − √ l2 − 4d l Altogether, F(x) =    0 for x ≤ 0 1 − √ l2−4x l for 0 ≤ x ≤ l2 4 1 for x > l2 4 The density is obtained by differentiation: f(x) =    0 for x ≤ 0 2 l √ l2−4x for 0 ≤ x ≤ l2 4 0 for x > l2 4 □ 10.E.19. Nezávislé náhodné veličiny X a Y mají následující hustoty pravděpodobnosti: fX(t) =    0 for t ≤ 0, 1 for 0 < t < 1, 0 for 1 ≤ t, fY (t) =    0 for t ≤ 0, 2t for 0 < t < 1, 0 for 1 ≤ t. Určete distribuční funkci náhodné veličiny udávající obsah obdélníka o stranách X a Y . Solution. FY (t) =    0 for t ≤ 0 2t − t2 for 0 < t < 1 1 for 1 ≤ t □ 10.E.20. Let X, Y be independent random variables, where X has uniform distribution on the interval (0, 2) and Y is given by its density function: f(x) =    0 for x ≤ 0 2x for 0 < x < 1 0 for x ≥ 1. Find the probability that Y is less than X2 . Solution. Since X and Y are independent random variables, the joint density f(X,Y ) : R2 → R2 of the variable (X, Y ) is given by the densities fX and fY of the individual random variables. Thus, we have f(X,Y )(u, v) = = { fX(u) · fY (v) = 1 2 · 2v = v for (u, v) ∈ (0, 2) × (0, 1), 0 otherwise. CHAPTER 10. STATISTICS AND PROBABILITY THEORY illustration compares this curve (in the right hand part) to the values of Bi(40, 0.5). This suggests looking for a convenient continuous distribution whose density would be given by a suitably adjusted variation of this function. The function e−x2 /2 is everywhere positive, so it suffices to compute ∫ ∞ −∞ e−x2 /2 dx. If this results in a finite value, just multiply the function by its reciprocal value. Unfortunately, this integral cannot be computed in terms of elementary functions. Luckily, multidimensional integration and Fubini’s theorem can be used. Transform to polar coordinates, to obtain( ∫ ∞ −∞ e−x2 /2 dx )( ∫ ∞ −∞ e−y2 /2 dy ) = ∫ R2 e−(x2 +y2 )/2 dxdy = ∫ ∞ 0 ∫ 2π 0 e−(r2 )/2 rdrdθ = 2π (cf. the notes at the end of subsection 8.2.5, verify that the integrated function satisfies the conditions given there, and compute that thoroughly!). Hence the integral results in √ 2π, so the function f(x) = 1√ 2π e−x2 /2 is a well-defined density of a random variable. Normal distribution The distribution of the random variable Z with density φ(z) = 1 √ 2π e−z2 /2 is called the (standard) normal distribution N(0, 1). The corresponding distribution function Φ(z) = 1 √ 2π ∫ z −∞ e−x2 /2 dx cannot be expressed in terms of elementary functions. It is called the Gaussian function and the graph of φ(x) is often called the Gaussian curve. So far, the correct density which approximates the binomial distribution is not found. The diagram that compares the probability function of the binomial distribution to the Gaussian curve shows that the position of the maximum must be moved as well as an application of shrinkage or stretch to the curve horizontally. The first goal is easily reached by constant 932 Then, the wanted probability P is the integral of the density f(X,Y ) over the part O of the plane where Y < X2 : P = ∫∫ O f(X,Y ) dx dy = 1 − ∫∫ R2\O f(X,Y ) dx dy = = 1 − ∫ 1 0 ∫ 1 x2 y dy dx = 3 5 . □ 10.E.21. Let X, Y be independent random variables, where X has density function f1(x) =    0 for x ≤ 0 2x for 0 < x < 1 0 for x ≥ 1, and Y has density function f2(x) =    0 for x ≤ 0 x 2 for 0 < x < 2 0 for x ≥ 2. Find the probability that Y is greater than X2 . ⃝ 10.E.22. Let X, Y be independent random variables, where X has density function f(x) =    0 for x ≤ 0 2x 9 for 0 < x < 3 0 for x ≥ 1, and Y has density function f(x) =    0 for x ≤ 0 x 2 for 0 < x < 2 0 for x ≥ 2. Find the probability that Y is greater than X3 . ⃝ F. Expected value, correlation Compute the expected value and variance of the binomial distribution. Solution. The direct calculation from the definitions is a nice exercise on combinatorics. We prove this statement using the properties of the expected value and variance. Using the definition of the binomial distribution (see 10.2.17), we can view the random variable X ∼ Bi(n, p) as the sum X = ∑n k=1 Yk, where Y1, . . . , Yn ∼ A(p) are independent random variables saying whether the k–th experiment was successful. Clearly, the Bernoulli distribution has expected value E Yi = p, hence by theorem 10.2.29, we have E X = ∑n k=1 E Yk = np. Similarly, we compute E(Y 2 k ) = 12 · p + 02 · (1 − p) = p, so var Yk = E(Y 2 k ) − (E Yk)2 = p − p2 . By theorem 10.2.33, we have var X = ∑n k=1 var Yk = np(1 − p). □ CHAPTER 10. STATISTICS AND PROBABILITY THEORY shift µ of the variable z, while scaling the difference x − µ by coefficient σ > 0 does the rest. Thus, there are two real parameters µ and σ > 0 and the density function is of the form: gµ,σ(x) = e−(x−µ)2 /(2σ2 ) . Simple variable substitution leads to ∫ ∞ −∞ e−(x−µ)2 /(2σ2 ) dx = √ 2πσ. Thus there is an entire two-parametric class of densities φµ,σ = 1 σ √ 2π e− (x−µ)2 2σ2 of random variables. The corresponding distributions are denoted by N(µ, σ2 ). We return to the asymptotic closeness of the normal and binomial distributions for n → ∞ after creating suitable tools. The following illustration reveals, how well this works. The discrete values correspond to Bi(40, 0.5), while the curve depicts the density of N(20, 10). 10.2.22. Distributions of random vectors. As for the scalar random variables, one defines the distribution functions and the density or the probability mass function for continuous and discrete random vectors. There are joint probability mass functions and densities. For two discrete random variables, i.e. a discrete vector (X, Y ) of random variables, define their (joint) probability mass function f(x, y) = { P(X = xi ∧ Y = yj) x = xi, y = yj 0 otherwise. A random vector (X, Y ) is called continuous, if its distribution function is defined as for continuous random variables. This means, for all a, b ∈ R, F(a, b) = P(X < a, Y < b) = = ∫ b −∞ ∫ a −∞ f(x, y)dxdy, and the function f(x, y) is called the (joint) density of the random vector (X, Y ). 933 10.F.1. An archer shoots five arrows at a target. Each time, the probability he hits is 0.6, and the individual results are independent. Let X be the random variable which corresponds to the number of hits. Determine its distribution and find its expected value and variance. Solution. Clearly, the shots are independent experiments with the Bernoulli distribution A(3 5 ). Thus, by the definition of the binomial distribution, we have X ∼ Bi(5, 3 5 ). By F, the expected value and variance of Bi(n, p) are equal to np and np(1−p), respectively, which gives E X = 3 and var X = 6 5 for our case. □ 10.F.2. Consider the discrete random variable X which takes on the values k = 0, 1, 2, 3, . . . , each with probability P(X = k) = p(1 − p)k (geometric distribution). Find E X (the expected number of failures before the first success) and var X. Solution. Using the definition of the expected value and the formula for summing the derivative of a geometric series, we calculate E X = ∞∑ k=0 kp(1 − p)k = p(1 − p) ∞∑ k=0 k(1 − p)k−1 = = p(1 − p) 1 p2 = 1 − p p . Similarly, using the formula for summing the second derivative of a geometric series, we compute E(X2 ) = ∞∑ k=0 k2 p(1 − p)k = (1 − p)(2 − p) p2 , hence the variance is var X = E(X2 ) − (E X)2 = 1−p p2 . □ 10.F.3. A random variable X is defined by its density fX(x) = 3 x4 for x ∈ (1, ∞) and fX(x) = 0 elsewhere. Find its distribution function, expected value, and variance. Solution. By the definition of the distribution function, we have, for x ∈ (1, ∞), FX(x) = ∫ x 1 3 t4 dt = [ − 1 t3 ]x 1 = 1 − 1 x3 . The expected value of X is equal to E X = ∫ ∞ 1 3 x3 dx = [ − 3 2x2 ]∞ 1 = 3 2 and the expected value of X2 is E(X2 ) = ∫ ∞ 1 3 x2 dx = [ − 3 x ]∞ 1 = 3. Therefore, var X = 3 − (3 2 )2 = 3 4 . □ CHAPTER 10. STATISTICS AND PROBABILITY THEORY For a general continuous random vector X = (X1, . . . , Xn), define F(a1, . . . , an) = P(X1 < a1, . . . , Xn < an) = = ∫ an −∞ · · · ∫ a1 −∞ f(x1, . . . , xn) dx1 · · · dxn, and similarly for discrete random vectors with more compo- nents. A random vector (X, Y ) with both X and Y continuous is not always a continuous vector in the above sense. For example, taking a continuous variable X, the random vector (X, 2X) is neither continuous nor discrete, since the entire probability mass is concentrated along the line y = 2x in the plane, but not in individual points. The marginal distribution for one of the variables can be obtained by summation or integration over the others. For instance, in the case of a discrete random vector (X, Y ), the events (X = xi, Y = yj) for all possible values xi and yj with non-zero probabilities for X and Y , respectively, form an exhaustive collection of events for the vector (X, Y ). Thus P(X = xi) = ∞∑ j=1 P(X = xi, Y = yj), which relates the marginal probability distribution of the random variable X to the joint probability distribution of the random vector (X, Y ). In the case of continuous random vectors, proceed similarly using integrals instead of sums. 10.2.23. Stochastic independence. It is known from subsection 10.2.3 what (in)dependence means for events. Random variables X1, · · · , Xn are (stochastically) independent if and only if for any ai ∈ R, the events X1 < a1, . . . , Xn < an are independent. In view of the definition of the distribution function F of the random vector (X1, . . . , Xn), this is equivalent to F(x1, . . . , xn) = FX1 (x1) · · · FXn (xn), where FXi are the distribution functions of the individual components. It follows that the events corresponding to Xk ∈ Ik for arbitrarily chosen intervals Ik is also independent. The probability of X1 ∈ [a, b) and simultaneously Xi ∈ (−∞, ci) for the other components is F(b, c2, . . . , cn) − F(a, c2, . . . , cn) = (FX1 (b) − FX1 (a))FX2 (c2) . . . FXn (cn), and so on. The densities and probability mass functions behave well too: Proposition. For any random vector (X1, . . . , Xn), the following two conditions are equivalent: • The random variables X1 . . . , Xn are stochastically in- dependent. • The joint distribution function F of the of random vector (X1, . . . , Xn) is the product of the marginal distribution functions FXi of the individual components. 934 10.F.4. A random variable X is defined by its density fX(x) = cos x for x ∈ ⟨0, π 2 ⟩ and fX(x) = 0 elsewhere. Find its expected value, variance, and median. Solution. Using the definition and integration by parts, we get E X = ∫ π 2 0 x cos xdx = [x sin x + cos x] π 2 0 = π 2 − 1. Using double integration by parts, we obtain E(X2 ) = ∫ π 2 0 x2 cos xdx = = [ x2 sin x + 2x cos x − 2 sin x ]π 2 0 = (π 2 )2 − 2, so the variance is equal to var X = (π 2 )2 −2−(π 2 −1)2 = π− 3. By definition, the distribution function is equal to FX(x) =∫ x 0 cos tdt = sin x, and the median is F−1 (0.5) = π 6 . □ 10.F.5. A random variable X is defined by its density fX(x) = λe−λx for x ≥ 0, and fX(x) = 0 elsewhere (the so-called exponential distribution; λ > 0 is a fixed parameter). Find its expected value, variance, mode (the real number where the density reaches its maximum), and median. Solution. Using the definition and integration by parts, we get E X = ∫ ∞ 0 xλe−λx dx = [ −xe−λx − 1 λ e−λx ]∞ 0 = 1 λ , E(X2 ) = ∫ ∞ 0 x2 λe−λx dx = = [ −x2 e−λx − 2x 1 λ e−λx − 2 λ2 e−λx ]∞ 0 = 2 λ2 , hence var X = E(X2 ) − (E X)2 = 1 λ2 . Since F′ X(x) = −λ2 e−λx < < 0, the density keeps decreasing. Therefore, its maximum is at zero. By definition, we have F(x) = ∫ x 0 λe−λt dt = 1 − e−λx , so the median is equal to F−1 (0.5) = − 1 λ ln(1 2 ) = ln 2 λ . □ 10.F.6. The joint probability mass function of a discrete random vector (X1, X2) is defined by π(0, −1) = c, π(1, 0) = π(1, 1) = π(2, 1) = 2c, π(2, 0) = 3c and zero elsewhere. Find the parameter c and compute the covariance cov(X1, X2). Solution. If π is to be a probability mass function, then the sum of its values over the entire domain must be equal to 1, i. e., ∑ i,j π(i, j) = c + 3.2c + 3c = 10c = 1, so c = 1 10 . The probability mass function π1 of X1 is given by the sum of the joint function over all possible values of X2, i. e., π1(i) = ∑ j π(i, j). Thus, we have π1(0) = c, π1(1) = 4c, π1(2) = 5c and zero elsewhere. Similarly, for CHAPTER 10. STATISTICS AND PROBABILITY THEORY Moreover, if all Xi are discrete random variables, then they are independent if and only if the joint probability mass function f of the random vector (X1, . . . , Xn) is the product of the marginal probability mass functions fXi of the individiual components. Similarly, if all Xi are continuous random variables, then they are independent if and only if the joint density function f of the random vector (X1, . . . , Xn) exists and it is the product of the marginal density functions fXi of the individiual components. In particular, any random vector with independent continuous components is again a continuous random vector. Proof. Many of the claims are already verified. The only nontrivial implication left is the one assuming the product formula for the joint distribution function and deriving the claim on the probability function or the density. The argument for n = 2 is shown below, the general case is analogous. Consider first two discrete independent random variables X, Y . Then fX,Y (xi, yj) = P(X = xi, Y = yi) = P(X = xi)P(Y = yj) = fX(xi)fY (yj). The joint distribution function is FX,Y (x, y) = ∑ xi 0, iii) Y = ln X, x > 0, iv) Y = 1 X , x > 0. Solution. We can simply apply the formula for the density of a transformed random variable, which yields a) fY (y) = f(ln y)1 y , b) fY (y) = 2f(y2 )y, c) fY (y) = f(ey )ey , d) fY (y) = f(1/y) 1 y2 . □ CHAPTER 10. STATISTICS AND PROBABILITY THEORY Z ∼ N(0, 1) in 10.2.21. This is verified easily. FY (y) = P(Y < y) = P(µ + σZ < y) = Φ ( 1 σ (y − µ) ) = 1√ 2π ∫ y−µ σ −∞ e−z2 /2 dz = ∫ y −∞ 1 √ 2πσ e− (x−µ)2 2σ2 dx, where the substitution x = µ + σz is used in the last step. This is exactly what is wanted. More generally, the above formula (1) has the straightforward analog for the density of Y = ψ(X) for a continuous X in the case when ψ has got non-zero derivative (thus ψ is invertible). (2) fY (y) = ∫ y −∞ |ψ′ (ψ−1 (y)|−1 fX(ψ−1 y) dy. Check the formula yourself! (Start with the case when the derivative ψ′ is always positive.) It is more complicated with more general sums of independent random variables. Consider two such continuous random variables X and Y with densities fX and fY , respectively. The distribution function of the random variable V = X + Y is computed directly (exploit the independence of X and Y write the joint density of (X, Y ) as product): FV (u) = ∫ x+y 0, it holds that P(X ≥ a) ≤ E X a . 10.H.1. Consider a non-negative random variable X with expected value µ. With no further information about X, bound P(X > 3µ). Then, compute P(X > 3µ) if you know that X ∼ Ex( 1 µ ). Solution. If the non-negative random variable X does not take zero with probability 1, then its expected value µ is positive. Therefore, the wanted probability can be bounded using Markov’s inequality as P(X ≥ 3µ) ≤ µ 3µ = 1 3 . If we know that X ∼ Ex( 1 µ ), then P(X > 3µ) = 1 − P(X ≤ 3µ) = 1 − F(3µ), where F is the distribution function of the exponential distribution. By definition, this is F(x) = ∫ x 0 1 µ e− t µ dt = [ −e− t µ ]x 0 = 1 − e− x µ . Hence, P(X > 3µ) = 1 e3 . □ 10.H.2. At a particular place, the average speed of wind is 20 kilometers per hour. • Regardless of the distribution of the speed as a random variable, bound the probability that in a given observation, the speed does not exceed 60 km/h. • Find the interval in which the speed lies with probability at least 0.9 if you know that the standard deviation is σ = 1 km/h. Solution. Let X denote the random variable that corresponds to the speed. In the first case, we can only use Markov’s inequality, leading to P(X ≤ 60) = 1 − P(X ≥ 60) ≥ 1 − 20 60 = 2 3 . In the second case, we know the variance (or standard deviation) of the speed, so we can use Chebyshev’s inequality (see 10.2.32): 0.9 ≤ P(|X − 20| < x) = 1 − P(|X − 20| ≥ x) ≤ 1 − 1 x2 . Hence, x ≥ √ 10 ≈ 3.2. Thus, the wanted interval is (16.8 km/h, 23.2 km/h). □ CHAPTER 10. STATISTICS AND PROBABILITY THEORY is the random variable corresponding to the won amount, it seems that the correct answer is “anything below the expected value E X”. As derived in 10.2.1, P(T = k) = 2−k , provided that the coin is fair. Sum up all the probabilities multiplied by 2k , to obtain ∑∞ 1 1 = ∞. Therefore, the expected value does not exist. So it seems that it is advantageous for the gambler to play even if the initial amount is very high... Simulate the game for a while, to obtain that the amount won is somewhere around 24 . The reason is that no one is able to play infinitely long, hence the extremely high amounts are not feasible enough to be won, so such amounts cannot be taken seriously. In decision theory, these cases (when the expected value does not directly correspond to the evaluated utility) are called St. Petersburg paradox, and much literature has been devoted to this topic.3 10.2.29. Properties of the expected value. In the case of simple distributions, compute the expected value directly from the definition. For instance, for the Bernoulli distribution A(p), it is immediate that E X = (1 − p) · 0 + p · 1 = p. Similarly, compute the expected value np of the binomial distribution Bi(n, p). This requires more thought. The result is a direct corollary of the following general theorem since Bi(n, p) is the sum of n random variables with the Bernoulli distributions A(p). For any random variables X, Y , real constants a, b, consider the expected values of the functions of random variables X + Y and a + bX, provided the expected values E X and E Y exist. It follows directly from the definition that the constant random variable a has expected value a. Further, E(bX) = b E X, since the constant b can be factored out from the sums or in- tegrals. More generally, the expected value of the product of independent random variables X and Y can be computed as follows. Suppose the components of the vector (X, Y ) are discrete and independent, with probability mass functions fX(xi), fY (yj). Then, E(XY ) = ∑ i ∑ j xiyjfX(xi)fY (yj) = (∑ i xifX(xi) )(∑ j yjfY (yj) ) = E X E Y. Similarly, verify the equality E(XY ) = E X E Y for independent continuous random variables. 3Going back to Bernoulli, 1738, the real value is given by the utility, rather than the price. 940 10.H.3. Each yogurt of an undisclosed company contains a photo of one of 26 ice-hockey world champions. Suppose the players are distributed uniformly at random. How many yogurts must Vera buy if she wants the probability of getting at least 5 photos of Jaromír Jágr to be at least 0.95? Solution. Let X denote the random variable that corresponds to the number of obtained photos of Jaromír Jágr (parametrized by the number n of yogurts bought). Clearly, X ∼ Bi(n, 1 26 ). We are looking for the value of n for which P(X ≥ 5) = 0.95, i. e., FX(4) = P(X ≤ 4) = 0.05. In order to find it, we use the de Moivre-Laplace theorem and approximate the binomial distribution with the normal distribution (we assume that n is large, so the approximation error will be small). By F, the expected value of X is E X = n 26 , and its variance is var X = 25n 262 . Denoting the corresponding standardized variable by Z, we can reformulate the condition as 0.05 = P(X ≤ 4) = P ( Z ≤ 4 − n 26 5 √ n 26 ) = FZ ( 104 − n 5 √ n ) , where by the approximation assumption, FZ ≈ Φ is the distribution function of the normal distribution N(0, 1). Since we must have n > 104, using Φ(−x) = 1 − Φ(x), the above equation gives n − 104 = Φ−1 (0.95) · 5 √ n. Using a table of the normal distribution or appropriate software, we can learn that z(0.95) = 1.65. Solving this quadratic equation, we get n . = 228.8. Thus, Vera must buy at least 229 yogurts. □ 10.H.4. We roll a die 1200 times. Find the probability that the number of 6s lies between 150 and 250 (inclusive) using Chebyshev’s inequality, and then using Moivre-Laplace the- orem. Solution. Let X denote the random variable which corresponds to the number of 6s. Clearly, X ∼ Bi(1200, 1 6 ). By F, we have E X = 1200 · 1 6 = 200 and var X = 200(1 − 1 6 ) = 500 3 . The condition on the number of 6s says that 150 ≤ X ≤ 250, which can be written as |X−200| ≤ 50. Using Chebyshev’s inequality 10.2.32, we get P(|X−200| ≤ 50) = 1−P(|X−200| ≥ 51) ≥ 1− 500 3 · 512 ≈ 0.94. (2) The exact value of the wanted probability is given by the expression P(150 ≤ X ≤ 250) = FX(250) − FX(150), where FX is the distribution function of the binomial distribution. By definition, P(150 ≤ X ≤ 250) = 250∑ k=150 ( 1200 x ) ( 1 6 )x ( 5 6 )1200−x . This expression is hard to evaluate without a computer, so we use Moivre-Laplace theorem. Replacing X with the standardized random variable Z = √ 3(X − 200) 10 √ 5 , CHAPTER 10. STATISTICS AND PROBABILITY THEORY Now compute E(X + Y ) for arbitrary random variables. For discrete distributions of X and Y , E(X + Y ) = ∑ i ∑ j (xi + yj)P(X = xi, Y = yj) = ∑ i ( xi ∑ j P(X = xi, Y = yj) ) + ∑ j ( yj ∑ i P(X = xi, Y = yj) ) = ∑ i xiP(X = xi) + ∑ j yjP(Y = yj), where absolute convergence of the first double sum follows from the triangle inequality and the absolute convergence of the sums that stand for the expected values of the particular random variables. Absolute convergence is used in order to interchange the sums. Dealing with continuous variables X and Y , whose expected values exist, proceed analogously. E(X + Y ) = ∫ ∞ −∞ ∫ ∞ −∞ (x + y)fX,Y (x, y) dxdy = ∫ ∞ −∞ x ∫ ∞ −∞ fX,Y (x, y) dydx + ∫ ∞ −∞ y ∫ ∞ −∞ fX,Y (x, y) dydx = ∫ ∞ −∞ xfX(x) dx + ∫ ∞ −∞ yfY (y) dy = E X + E Y, where absolute convergence of integrals of the expected values E X and E Y is used to interchange the integrals by Fubini’s theorem. Altogether, the expected formula: E(X + Y ) = E X + E Y is obtained, whenever the expected values E X and E Y exist. Straightforward application of this result leads to the fol- lowing: Affine nature of expected values For any constants a, b1, . . . , bk and random variables X1, . . . , Xk, E(a + b1X1 + · · · + bkXk) = a + b1 E X1 + · · · + bk E Xk. The following theorem extends this behaviour with respect to affine transformations of random vectors, and shows that the expected value is invariant with respect to affine transformations, as is the arithmetic mean: Theorem. Let X = (X1, . . . , Xn) be a random vector with expected value E X, a ∈ Rm , B ∈ Matmn(R) a martix. Then, E(a + B · X) = a + B · E X. Proof. There is almost nothing remaining to be proved. Since the expected value of a vector is defined as the vector of the expected values of the components, it suffices to restrict attention to a single item in E(a + B · X). Thus, it can be assumed that a is a scalar and B is a matrix with a single row. 941 then, by 10.2.40, we have Z ∼ N(0, 1), i. e., FZ ≈ Φ. Thus, P(150 ≤ X ≤ 250) = P( √ 3(250−200) 10 √ 5 ≤ Z ≤ √ 3(150−200) 10 √ 5 ) ≈ ≈ Φ( √ 15) − Φ(− √ 15) = 2Φ( √ 15) − 1. We learn that Φ( √ 15) ≈ 0.99994, so the wanted probability is approximately 99.988 %. □ 10.H.5. At the Faculty of Informatics, 10 % of students have prumer less than 1.2 (let us call them successful). How many students must we meet if the probability that there are 8–12 % successful ones among them is to be at least 0.95? Solve this problem using Chebyshev’s inequality, and then using Moivre-Laplace theorem. Solution. Let X denote the random variable that corresponds to the number of successful students, parametrized by the number n of students we meet. Since a randomly met student has probability 10 % of being successful, when meeting n students, we have X ∼ Bi(n, 1 10 ). By F, we have E X = 0.1n and var X = 0.09n. By Chebyshev’s inequality 10.2.32, the wanted probability satisfies P(|X − 0.1n| ≤ 0.02n) = 1 − P(|X − 0.1n| ≥ 0.02n) ≥ ≥ 1 − 0.1 · 0.9n (0.02n)2 = 1 − 225 n . The inequality 1 − 225 n ≥ 0.95 and hence P(|X − 0.1n| ≤ 0.02n) ≥ 0.95 holds for n ≥ 4500. The exact value of the probability is given in terms of the distribution function FX of the binomial distribution: P(0.08n ≤ X ≤ 0.12n) = FX(0.12n) − FX(0.08n). Using the de Moivre-Laplace theorem (see 10.2.40), we can approximate the standardized random variable Z = 10X−n 3 √ n with the standard normal distribution, FZ ≈ Φ, so 0.95 = P(0.08n ≤ X ≤ 0.12n) = P(− √ n 15 ≤ Z ≤ √ n 15 ) ≈ ≈ Φ( √ n 15 ) − Φ(− √ n 15 ) = = 2Φ( √ n 15 ) − 1. Hence √ n = 15z(0.975) and we learn n ≈ 864.4. Thus, we can see that it is sufficient to meet 865 students. □ 10.H.6. The probability that a planted tree will grow is 0.8. What is the probability that out of 500 planted trees, at least 380 trees will grow? Solution. The random variable X that corresponds to the number of trees that will grow has binomial distribution X ∼ Bi(500, 4 5 ). By F, we have E X = 400 and var X = 80. The standardized random variable is Z = X−400√ 80 . By the de CHAPTER 10. STATISTICS AND PROBABILITY THEORY Then, the expected value of a finite sum of random variables is obtained, and by the above results, that exists and is given as the sum of the expected values of the individual items. This is exactly what is wanted to be proved. □ 10.2.30. Quantiles and critical values. Introduce numerical characteristics that are analogous to those from descriptive statistics. There, the next useful characteristics are the quantiles, cf. 10.1.5. Consider a random variable X whose distribution function FX is strictly monotone. This is satisfied by any random variable whose density is nowhere equal to zero, which is the case for the normal distribution, for example. In this case, define the quantile function F−1 X simply as the inverse function (FX)−1 : (0, 1) → R. This means that the value y = F−1 (α) is such that P(X < y) = α. This corresponds precisely to the quantiles from descriptive statistics using relative frequencies for the probabilities. Quantile function For any random variable X with distribution function FX(x), define its quantile function F−1 (α) = inf{x ∈ R; F(x) ≥ α}, α ∈ (0, 1). Clearly, this is a generalization of the previous definition in the case the distribution function is strictly monotone. As seen in descriptive statistics, the most used quantiles are for α = 0.5 (the median), α = 0.25 (the first quartile), α = 0.75 (the third quartile). Similarly for deciles and percentiles when α is equal to (integer) multiples of tenths and hundredths, respectively. It follows directly from the definition that the quantile function for a given random variable X allows the determination of intervals into which the values of X fall with a chosen probability. For instance, the value Φ−1 (0.975), approximately 1.96, corresponds to percentile 97.5 for the normal distribution N(0, 1). This says that with the probability of 2.5 %, the value of such a random variable Z ∼ N(0, 1) is at least 1.96. Since the density of the variable Z is symmetric with respect to the origin, this observation can be interpreted as that there is only a 5% probability that the value of |Z| is greater 1.96. There are similar intervals and values when discussing the reliability of estimates of characteristics of random vari- ables. Critical values For a random variable X and a real number 0 < α < 1, define its critical value x(α) at level α as P(X ≥ x(α)) = α. This means that x(α) = F−1 X (1 − α) where F−1 X is the quantile function of the random variable X. 942 Moivre-Laplace theorem, we have FZ ≈ Φ, so P(X ≥ 380) = P(Z ≥ 380 − 400 √ 80 ) ≈ 1 − Φ(− √ 20 2 ) = = Φ( √ 20 2 ) ≈ 0.987. □ 10.H.7. Using the distribution function of the standard normal distribution, find the probability that the absolute difference between the heads and the tails in 1600 tosses of a coin is at least 82. Solution. Let X denote the random variable that corresponds to the number of times the coin came up heads. Then X has binomial distribution Bi(1600, 1/2) (with expected value 800 and standard deviation 20), so for a large value of n = 1600, by the de Moivre-Laplace theorem, the distribution function of the variable X−800 20 can be approximated with the distribution function Φ of the standard normal distribution. Thus, the wanted probability is P = 1 − P[759 ≤ X ≤ 841] = 1 − P [ −2.05 ≤ X − 800 20 ≤ 2.05 ] . = 2Φ(−2, 05) . = 0.0404. □ 10.H.8. Using the distribution function of the standard normal distribution, find the probability that the absolute difference between the heads and the tails in 3600 tosses of a coin is at most 66. Solution. Let X denote the random variable that corresponds to the number of times the coin came up heads. Then X has binomial distribution Bi(3600, 1/2) (with expected value 1800 and standard deviation 30), so for a large value of n = 3600, the distribution function of the variable X−800 20 can be approximated, by the de Moivre-Laplace theorem, with the distribution function Φ of the standard normal distribution. Thus, the wanted probability is P[1767 ≤ X ≤ 1833] = P [ −1, 1 ≤ X − 1800 30 ≤ 1, 1 ] . = . = Φ(1, 1) − Φ(−1, 1) . = 0, 7498. □ 10.H.9. The probability that a seed will grow is 0.9. How many seeds must we plant if we require that with probability at least 0.995, the relative number of grown items differs from 0.9 by at most 0.034. Solution. The random variable X that corresponds to the number of grown seeds, out of n planted ones, has binomial distribution X ∼ Bi(n, 9 10 ). By F, we have E X = 0.9n and var X = 0.09n, so the standardized variable is Z = X−0.9n√ 0.09n . CHAPTER 10. STATISTICS AND PROBABILITY THEORY 10.2.31. Variance and standard deviation. The simple numerical characteristics concerning the variability of sample values in descriptive statistics were the variance and the standard deviation. Define them similarly for random variables. Variance of a random variable Given a random variable X with finite expected value, its variance is defined as var X = E ( (X − E X)2 ) , provided the right-hand expected value exists. Otherwise, the variance of X does not exist. The square root √ var X of the variance is called the standard deviation of the random variable X. Using the properties of the expected value, a simpler formula can be derived for the variance of a random variable X whose expected value exists: var X = E(X − E X)2 = E(X2 − 2X(E X) + (E X)2 ) = E X2 − 2(E X)2 + (E X)2 = E X2 − (E X)2 . Consider how affine transformations change the variance of a random variable. Given real numbers a, b and a random variable X with expected value and variance, consider the random variable Y = a + bX. Compute var Y = E ( (a + bX) − E(a + bX) )2 = E ( b(X − E X) )2 = b2 var X. Thus are derived the following useful formulae: Properties of variance var X = E(X2 ) − (EX)2 (1) var(a + bX) = b2 var X(2) √ var(a + bX) = b √ var X(3) Given a random variable X with expected value and nonzero variance, define its standardization as the random vari- able Z = X − E X √ var X . Thus, the standardized variable is the affine transformation of the original variable whose expected value equals zero and variance equals one. 10.2.32. Chebyshev’s inequality. A good illustration of the usefulness of variance is the Chebyshev’s inequality. This connects the variance directly to the probability that the random variable assumes values that are distant from its expected value. 943 The condition in question can be written as P(|X − 0.9n| ≤ 0.034n) = P ( |Z| ≤ 0.034n √ 0.09n ) = = P ( |Z| ≤ 0.34 3 √ n ) ≥ 0.995. By the de Moivre-Laplace theorem, for large n, the distribution function can be approximated by the distribution function Φ of the normal distribution. Thus, P ( |Z| ≤ 0.34 3 √ n ) ≈ Φ ( 0.34 3 √ n ) − Φ ( − 0.34 3 √ n ) = = 2Φ ( 0.34 3 √ n ) − 1. Altogether, we get the condition 2Φ ( 0.34 3 √ n ) − 1 ≥ 0.995. Odtud vypočítáme n ≥ ( 3z(0.9975) 0,34 )2 ≈ 615. □ 10.H.10. The service life (in hours) of a certain kind of gadget has exponential distribution with parameter λ = 1 10 . Using the central limit theorem, bound the probability that the total service life of 100 such gadgets lies between 900 and 1050 hours. Solution. In exercise 10.F.5, we computed that the expected value and variance of a random variable Xi with exponential distribution are equal to E Xi = 1 λ and var Xi = 1 λ2 , respectively. Thus, the expected service life of each gadget is E Xi = µ = 10 hours, with variance var Xi = σ2 = 100 hours2 . By the central limit theorem, the distribution of the transformed random variable 1√ n ∑n i=1 ( Xi−µ σ ) = 1 100 ∑100 i=1 Xi − 10 approaches the standard normal distribution as n tends to infinity. Thus, the wanted probability for the service life of 100 gadgets P(900 ≤ ∑ Xi ≤ 1050) = P ( −1 ≤ 1 100 100∑ i=1 Xi − 10 ≤ 0, 5 ) can be approximated with the distribution function of the normal distribution: P(900 ≤ ∑ Xi ≤ 1050) ≈ Φ(0.5) − Φ(−1) ≈ 0.533. □ 10.H.11. We keep putting items into a chest. The expected mass of an item is 3 kg and the standard deviation is 0.8 kg. What is the maximum number of items that we can put into the chest so that with probability at least 99%, the total mass does not exceed one ton? Solution. Let Xi denote the random variable that corresponds to the mass of the i-th item. Then, we have µ = E Xi = 3 and σ = √ var Xi = 0.8 (in kilograms), and we want to have P( n∑ i=1 Xi ≤ 1000) = 0.99. CHAPTER 10. STATISTICS AND PROBABILITY THEORY Chebyshev’s inequality Theorem. Consider a random variable X with finite variance, and fix an arbitrary ε > 0. Then, P(|X − E X| ≥ ε) ≤ var X ε2 . Proof. Suppose X is continuous. Set µ = E X and compute, using the definition: var X = ∫ ∞ −∞ (x − µ)2 f(x) dx = ∫ |x−µ|≥ε (x − µ)2 f(x) dx + ∫ |x−µ|<ε (x − µ)2 f(x) dx ≥ ∫ |x−µ|≥ε ε2 f(x) dx = ε2 P(|X − µ| ≥ ε). □ The analogous proof for discrete random variables is left as an exercise for the reader. Realizing that the variance is the square of the standard deviation σ, the choice ε = kσ yields the probability P(|X − E X| ≥ kσ) ≤ 1 k2 . The Chebyshev’s inequality helps understanding asymptotic descriptions of limit processes. For instance, consider the sequence of random variables X1, X2, . . . with probability distributions Xn ∼ Bi(n, p), with a fixed value of p, 0 < p < 1. Intuitively, it is expected that the relative frequency of success should approach the probability p as n increases, i.e., that the values of the random variables Yn = 1 n Xn should approach p. Clearly, E Yn = np n = p, var Yn = np(1 − p) n2 = p(1 − p) n . Direct application of Chebyshev’s inequality yields, for any fixed ε > 0, that P ( |Yn − p| ≥ ε ) ≤ p(1 − p) nε2 . Hence it is clear that, for any fixed ε > 0, lim n→∞ P ( Xn n − p ≥ ε ) = 0. This result is known as Bernoulli’s theorem (one of many). This type of limit behaviour is called convergence in probability. Thus it is proved (as a corollary of Chebyshev’s inequality) that the random variables Yn converge in probability to the constant random variable p. 944 By the central limit theorem 10.2.40, the distribution of the random variable Sn = 1 √ n n∑ i=1 ( Xi − 3 0.8 ) = 1 0.8 √ n n∑ i=1 Xi − 3 √ n 0.8 can be approximated by the standard normal distribution. Thus, we get P( n∑ i=1 Xi ≤ 1000) = P(Sn ≤ 1000 0.8 √ n − 3 √ n 0.8 ) ≈ Φ( 1000 0.8 √ n − 3 √ n 0.8 ). We learn that z(0.99) ≈ 2.326, so the wanted n satisfies the quadratic equation 1000 0.8 √ n − 3 √ n 0.8 = 2.326, whence we get n ≈ 322. □ I. Testing samples from the normal distribution In subsection 10.3.4, we introduced the so-called twosided interval estimate of an unknown parameter µ of the normal distribution N(µ, σ2 ). In some cases, we may be interested only in an upper or lower estimate, i.e. a statistic U or L for which P(µ < U) or P(L < µ), respectively. Then, we talk about a one-sided confidence interval (−∞, U) or (L, ∞). The formula for these intervals can be derived similarly as for the two-sided interval. Now, we have for the random variable Z = √ n ¯X−µ σ ∼ N(0, 1) that 1 − α = Φ(z(1 − α)) = P(Z < z(1 − α)). Hence it immediately follows that 1 − α = P( ¯X − σ √ n z(1 − α) < µ), so L = ¯X − σ√ n z(1 − α). Similarly, we find U = ¯X + σ√ n z(1 − α), and for a distribution with unknown variance, µ ≥ ¯X − S√ n tn−1(1 − α) and µ ≤ ¯X + S√ n tn−1(1 − α). If we want to estimate the variance σ2 of a random distribution, then we use theorem 10.3.3, similarly as when we derived it for the expected value. This time, we use the second part of the theorem, by which the random variable n−1 σ2 S2 has distribution χ2 . Then, we can immediately see that 1 − α = P ( χ2 n−1(α/2) ≤ n − 1 σ2 S2 ≤ χ2 n−1(1 − α/2) ) . Thus, the two-sided 100(1 − α)% confidence interval for the variance is ( (n − 1)S2 χ2 n−1(1 − α/2) , (n − 1)S2 χ2 n−1(α/2) ) and similarly for the one-sided upper and lower estimates, we get σ2 ≤ (n − 1)S2 χ2 n−1(α) , resp. (n − 1)S2 χ2 n−1(1 − α) ≤ σ2 . CHAPTER 10. STATISTICS AND PROBABILITY THEORY 10.2.33. Covariance. We return to random vectors. In the case of the expected value, the situation is very simple — just take the vector of expected values. When characterizing the variability, the dependencies between the individual components are also of much interest. We follow the idea from 10.1.9 again. Covariance Given random variables X, Y whose variances exist, Define their covariance as cov(X, Y ) = E ((X − E X)(Y − E Y )) The basic properties of the concept can be derived very easily: Theorem. For any random variables X, Y , Z whose variances exist and real numbers a, b, c, d, cov(X, Y ) = cov(Y, X)(1) cov(X, Y ) = E(XY ) − (E X)(E Y )(2) cov(X + Y, Z) = cov(X, Z) + cov(Y, Z)(3) cov(a + bX, c + dY ) = bd cov(X, Y )(4) var(X + Y ) = var X + var Y + 2 cov(X, Y ).(5) Moreover, if X and Y are independent, then cov(X, Y ) = 0, and consequently (6) var(X + Y ) = var X + var Y. Proof. Directly from the definition, the covariance is symmetric in the arguments. The second proposition follows immediately from the properties of the expected value: cov(X, Y ) = E(X − E X)(Y − E Y ) = E(XY ) − (E Y )X − (E X)Y + E X E Y = E(XY ) − E X E Y. The next proposition also follows easily if the definition is expanded and the fact that the expected value of the sum of random variables equals the sum of their expected values is used. The next proposition can be computed directly: cov(a + bX, c + dY ) = = E ( (a + bX − E(a + bX))(c + dY − E(c + dY )) ) = E ( (bX − b E(X))(dY − d E(Y )) ) = E ( bd(X − E(X))(Y − E(Y )) ) = bdE ( (X − E X)(Y − E Y ) ) = bd cov(X, Y ). 945 10.I.1. We roll a die 600 times, obtaining only 45 sixes. Is it possible to say that the die is ideal at level α = 0.01? Solution. For an ideal die, the probability of rolling a six is always p = 1 6 . The number of sixes in 600 rolls is given by a random variable X with binomial distribution X ∼ Bi(600, 1 6 ). By 10.2.40, this distribution can be approximated by the distribution N(100, 250 3 ). The measured value X = 45 can be considered a random sample consisting of one item. Assuming that the variance is known and applying 10.3.4, we get that the 99% (two-sided) confidence interval for the expected value µ equals (45 − √ 250 3 z(0.995), 45 + √ 250 3 z(0.995)). We learn that the quantile is approximately z(0.995) ≈ 2.58, which gives the interval (21, 69). However, for an ideal die, we clearly have µ = 100, so our die is not ideal at level α = 0.01. □ 10.I.2. Suppose the height of 10-years-old boys has normal distribution N(µ, σ2 ) with unknown expected value µ and variance σ2 = 39.112. Taking the height of 15 boys, we get the sample mean ¯X = 139.13. Find i) the 99% two-sided confidence interval for the parameter µ, ii) the lower estimate for µ at significance level 95 %. Solution. a) By 10.3.4, the 100(1 − α)% two-sided confidence interval for the unknown expected value µ of the normal distribution is (1) µ ∈ ( ¯X − σ √ n z(1 − α/2), ¯X + σ √ n z(1 − α/2) ) , where ¯X is the sample mean of n items, σ2 is the known variance, and z(1 − α/2) is the corresponding quantile. Substituting the given values n = 15, σ ≈ 6.254 and the learned z(0.995) ≈ 2.576, we get σ√ n z(α/2) ≈ 4.16, i. e., µ ∈ (134.97, 143.29). b) The lower estimate L for the parameter µ at significance level 95 % is given by the expression L = ¯X − σ√ n z(0.95). We learn that z(0.95) ≈ 1.645, and direct substitution leads to µ ∈ (136.474, ∞). □ 10.I.3. A customer tests the quality of bought products by examining 21 randomly chosen ones. He will accept the delivery if the sample standard deviation does not exceed 0.2 mm. We know that the pursued property of the products has normal distribution of the form N(10 mm; 0.0734 mm2 ). Using statistical tables, find the probability that the delivery will be accepted. How does the answer change if the customer, in order to save expenses, tests only 4 products? Solution. The problem asks for the probability P(S ≤ 0.2). By theorem 10.3.3, when sampling n products, the random variable n−1 σ2 S2 has distribution χ2 n−1. In our case, n = 21 and σ2 = 0.0734, so P(S ≤ 0.2) = P ( 20 0.0734 S2 ≤ 20 0.0734 0.22 ) = χ2 20 ( 20 · 0.22 0.0734 ) CHAPTER 10. STATISTICS AND PROBABILITY THEORY The other propositions about the variance are quite simple corollaries: var(X + Y ) = E ( (X + Y ) − E(X + Y ) )2 = E ( (X − E X) + (Y − E Y ) )2 = E(X − E X)2 + 2 E(X − E X)(Y − E Y ) + E(Y − E Y )2 = var X + 2 cov(X, Y ) + var Y. Furthermore, if X and Y are independent, then E(XY ) = E X E Y , and hence that their covariance is zero. □ Directly from the definition, var(X) = cov(X, X). The latter theorem claims that covariance is a symmetric bilinear form on the real vector space of random variables whose variance exists. The variance is the corresponding quadratic form. The covariance can be computed from the variance of the particular random variables and of their sum, as seen in linear algebra, see the property (5). Notice that the random variable, equal to the sum of n independent and identically distributed random variables Yi behaves, very much differently than the multiple nY . In fact, var(Y1 + · · · + Yn) = n var Y, var(nY ) = n2 var Y. 10.2.34. Correlation of random variables. To a certain extent, covariance corresponds to dependency between the random variables. Its relative version is called the correlation of random variables and, similarly as for the standard deviation, the following concept is defined: Correlation coefficient The correlation coefficient of random variables X and Y whose variances are finite and non-zero is defined as ρX,Y = cov(X, Y ) √ var X √ var Y . As seen from theorem 10.2.33, the correlation coefficient of random variables equals the covariance of the standardized variables 1√ varX (X − E X) and 1√ varY (Y − E Y ). The following equalities hold (here, a, b, c, d are real constants, bd ̸= 0, and X, Y are random variables with finite non-zero variances) ρa+bX,c+dY = sgn(bd)ρX,Y ρX,X = 1. Moreover, if X and Y are independent, then ρX,Y = 0. Note that if the variance of a random variable X is zero, then it assumes the value E X with probability 1. If the value of X falls into an interval I not containing E X with probability p ̸= 0, then the expression var X = E(X − E X)2 is positive. Stochastically, random variables with zero variance behave as constants. 946 The expression in the argument of the distribution function is approximately 10.9, and we can learn from the table of the χ2 distribution that χ2 20(10.9) ≈ 0.05. Thus, the probability that delivery will be accepted is only 5 %. We could have expected the probability to be low: indeed, E S2 = = σ2 = 0.0734 > 0.22 . If the customer tests only 4 products, then the probability of acceptance is given by the expression χ2 3 ( 3·0.22 0.0734 ) ≈ χ2 3(1.63). The value of the distribution function of χ2 in this argument cannot be found in most tables. Therefore, we estimate it using linear interpolation. For instance, if the nearest known points are χ2 3(0.58) = 0.1 and χ2 3(6.25) = 0.9, then χ2 3(1.63) ≈ (1.63 − 0.58) 0.9 − 0.1 6.25 − 0.58 + 0.1 ≈ 0.24. Although this results is only an estimate, we can be sure that the probability of acceptance is much greater than when testing 21 products. □ 10.I.4. From a population with distribution N(µ, σ2 ), where σ2 = 0.06, we have sampled the values 1.3; 1.8; 1.4; 1.2; 0.9; 1.5; 1.7. Find the two-sided 95% confidence interval for the unknown expected value. Solution. We have a random sample of size n = 7 from the normal distribution with known variance σ2 = 0.06. The sample mean is ¯X = 1 7 (1.3 + 1.8 + 1.4 + 1.2 + 0.9 + 1.5 + 1.7) = 1.4 and we can learn for the given confidence level α = 0.05 that z(1 − α/2) = z(0.975) ≈ 1.96. Substituting into (1), we immediately obtain the wanted interval (1.22, 1.58). □ 10.I.5. Let X1, . . . , Xn be a random sample from the distribution N(µ, 0.04). Find the least number of measurements that are necessary so that the length of the 95% confidence interval for µ would not exceed 0.16. Solution. Since we have a normal distribution with known variance, we know from (1) that the length of the (1 − α)% confidence interval is 2σ√ n z(1 − α/2). Substituting the given values, we get that the number n of measurements satisfies the inequality 2 · 0.2 √ n z(0.975) ≤ 0.16. Since z(0.975) ≈ 1.96, we obtain n ≥ 24.01. Thus, at least 25 measurements are necessary. □ 10.I.6. Consider a random variable X with distribution N(µ, σ2 ), where µ, σ2 are unknown. The following table shows the frequencies of individual values of this random variable: Xi 8 11 12 14 15 16 17 18 20 21 ni 1 2 3 4 7 5 4 3 2 1 Calculate the sample mean, sample variance, sample standard deviation, and find the 99% confidence interval for the expected value µ. CHAPTER 10. STATISTICS AND PROBABILITY THEORY If the covariance is a positive-definite symmetric bilinear form, then it would follow from the Cauchy-Schwarz inequality (see 3.4.3) that (1) |ρX,Y | ≤ 1 The following theorem claims more. It shows that the full correlation or anti-correlation, i.e. ρX,Y = ±1 of random variables X a Y says that they are bound by an affine relation Y = kX + c, where the sign of k corresponds to the sign in ρX,Y = ±1. On the other hand, a zero correlation coefficient says that the (potential) dependency between the variables is very far from any affine relation of the mentioned type. (Note, however, this does not mean that the variables must be independent). For instance, consider random variables Z ∼ N(0, 1) and Z2 . Then cov(Z, Z2 ) = E Z3 = 0 since the density of Z is an even function. Thus the expected value of an odd power of Z is zero, if it exists. Theorem. If the correlation coefficient is defined, then |ρX,Y | ≤ 1. Equality holds if and only if there are constants k, c such that P(Y = kX + c) = 1. Proof. A stochastic affine relation between Y and X with nonzero coefficient at Y is sought. This is equivalent to Y + sX ∼ D(c) for some fixed value of the parameter s and constant c. In such a case the variance vanishes. Thus one considers the following non-negative quadratic expression: 0 ≤ var ( Y − E Y √ varY + t X − E X √ varX ) = 1 + 2tρX,Y + t2 . The right-hand quadratic expression does not have two distinct real roots; hence its discriminant cannot be positive. So 4(ρX,Y )2 − 4 ≤ 0. Hence the desired inequality is obtained, and also the discriminant vanishes if ρX,Y = ±1. For the only (double) root t0, the corresponding random variable has zero variance; thus it asumes a fixed value with probability 1. This yields the affine relation as expected. □ 10.2.35. Covariance matrix. The variability of a random vector must be considered. This suggests considering the covariances of all pairs of components. The following definition and theorem show that this leads to an analogy of the variance for vectors, including the behaviour of the variance under affine transformations of the random variables. 947 Solution. The sample mean is given by the expression ¯X =∑ niXi/ ∑ ni. Substituting the given values, we get ¯X = 490/32 ≈ 15.3. By definition, the sample variance is S =∑ ni(Xi − ¯X)2 /( ∑ ni − 1). Substituting the given values, we get S2 = 1943/256 ≈ 7.6, so the sample standard deviation is S ≈ 2.8. The formula for the two-sided (1−α)% confidence interval for the expected value µ, when the variance is unknown, was derived at the end of subsection 10.3.4: µ ∈ ( ¯X − S √ n tn−1(1 − α/2), ¯X + S √ n tn−1(1 − α/2) ) . Substitution yields ¯X = 15.3, n = 32, S ≈ 2.8, α = 0.01, and we learn t31(0.995) ≈ 2.75. Thus, the 99% confidence interval is µ ∈ (14.0, 16.7). □ 10.I.7. Using the following table of the distribution function of the normal distribution, find the probability that the absolute difference between the heads and the tails in 3600 tosses of a coin is greater than 90. Standard Normal Distribution Table 0 z z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09 0.0 .0000 .0040 .0080 .0120 .0160 .0199 .0239 .0279 .0319 .0359 0.1 .0398 .0438 .0478 .0517 .0557 .0596 .0636 .0675 .0714 .0753 0.2 .0793 .0832 .0871 .0910 .0948 .0987 .1026 .1064 .1103 .1141 0.3 .1179 .1217 .1255 .1293 .1331 .1368 .1406 .1443 .1480 .1517 0.4 .1554 .1591 .1628 .1664 .1700 .1736 .1772 .1808 .1844 .1879 0.5 .1915 .1950 .1985 .2019 .2054 .2088 .2123 .2157 .2190 .2224 0.6 .2257 .2291 .2324 .2357 .2389 .2422 .2454 .2486 .2517 .2549 0.7 .2580 .2611 .2642 .2673 .2704 .2734 .2764 .2794 .2823 .2852 0.8 .2881 .2910 .2939 .2967 .2995 .3023 .3051 .3078 .3106 .3133 0.9 .3159 .3186 .3212 .3238 .3264 .3289 .3315 .3340 .3365 .3389 1.0 .3413 .3438 .3461 .3485 .3508 .3531 .3554 .3577 .3599 .3621 1.1 .3643 .3665 .3686 .3708 .3729 .3749 .3770 .3790 .3810 .3830 1.2 .3849 .3869 .3888 .3907 .3925 .3944 .3962 .3980 .3997 .4015 1.3 .4032 .4049 .4066 .4082 .4099 .4115 .4131 .4147 .4162 .4177 1.4 .4192 .4207 .4222 .4236 .4251 .4265 .4279 .4292 .4306 .4319 1.5 .4332 .4345 .4357 .4370 .4382 .4394 .4406 .4418 .4429 .4441 1.6 .4452 .4463 .4474 .4484 .4495 .4505 .4515 .4525 .4535 .4545 1.7 .4554 .4564 .4573 .4582 .4591 .4599 .4608 .4616 .4625 .4633 1.8 .4641 .4649 .4656 .4664 .4671 .4678 .4686 .4693 .4699 .4706 1.9 .4713 .4719 .4726 .4732 .4738 .4744 .4750 .4756 .4761 .4767 2.0 .4772 .4778 .4783 .4788 .4793 .4798 .4803 .4808 .4812 .4817 2.1 .4821 .4826 .4830 .4834 .4838 .4842 .4846 .4850 .4854 .4857 2.2 .4861 .4864 .4868 .4871 .4875 .4878 .4881 .4884 .4887 .4890 2.3 .4893 .4896 .4898 .4901 .4904 .4906 .4909 .4911 .4913 .4916 2.4 .4918 .4920 .4922 .4925 .4927 .4929 .4931 .4932 .4934 .4936 2.5 .4938 .4940 .4941 .4943 .4945 .4946 .4948 .4949 .4951 .4952 2.6 .4953 .4955 .4956 .4957 .4959 .4960 .4961 .4962 .4963 .4964 2.7 .4965 .4966 .4967 .4968 .4969 .4970 .4971 .4972 .4973 .4974 2.8 .4974 .4975 .4976 .4977 .4977 .4978 .4979 .4979 .4980 .4981 2.9 .4981 .4982 .4982 .4983 .4984 .4984 .4985 .4985 .4986 .4986 3.0 .4987 .4987 .4987 .4988 .4988 .4989 .4989 .4989 .4990 .4990 3.1 .4990 .4991 .4991 .4991 .4992 .4992 .4992 .4992 .4993 .4993 3.2 .4993 .4993 .4994 .4994 .4994 .4994 .4994 .4995 .4995 .4995 3.3 .4995 .4995 .4995 .4996 .4996 .4996 .4996 .4996 .4996 .4997 3.4 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4998 3.5 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998 Gilles Cazelais. Typeset with LATEX on April 20, 2006. Solution. Let X denote the random variable that corresponds to the number of heads. Then, X has binomial distribution Bi(3600, 1/2) (with expected value 1800 and standard deviation 30), so by the de Moivre-Laplace theorem, for large values of n, the distribution function of the variable X−1800 30 can be approximated by the distribution function Φ of the normal CHAPTER 10. STATISTICS AND PROBABILITY THEORY Covariance matrix Consider a random vector X = (X1, . . . , Xn)T all of whose components have finite variances. The covariance matrix of the random vector X is defined in terms of the expected value as (notice the vector X is viewed as a column of random variables now) var X = E(X − E X)(X − E X)T . Using the definition of the expected value of a vector and expanding the matrix multiplication, it is immediate that the covariance matrix var X is the symmetric matrix      var X1 cov(X1, X2) · · · cov(X1, Xn) cov(X2, X1) var X2 · · · cov(X2, Xn) ... ... ... ... cov(Xn, X1) cov(Xn, X2) · · · var Xn      . Theorem. Consider a random vector X = (X1, . . . , Xn)T all of whose components have finite variances. Further, consider the transformed random vector Y = BX + c, where B is an m-by-n matrix of real constants and c ∈ Rm is a vector of constants. Then, var(Y ) = var(BX + c) = B(var X)BT . Proof. The claim follows from direct computation, using the properties of the expected value: var(Y ) = E ( (BX + c) − E(BX + c) )( (BX + c) − E(BX + c) )T = E(B(X − E X))(B(X − E X))T = B E(X − E X)(X − E X)T BT = B(var X)BT . □ The constant part of the transformation has no impact, while with respect to the linear part of the transformation, the covariance matrix behaves as the matrix of a quadratic form. 10.2.36. Moments and moment function. The expected value and variance reflect the square of the deviation of values of a random variable from the average. In descriptive statistics, one also examines the skewness of the data, and it is natural to examine the variability of random variables in terms of higher powers of the given random variable X. The characteristic E(Xk ) is called the k-th moment; the characteristic µk = E ( (X − EX)k ) is called the k-th central moment of a random variable X. What also comes in handy is the k-th absolute moment, given by E |X|k . From the definition it follows that for a continuous random variable X, E Xk = ∫ ∞ −∞ xk fX(x) dx. 948 distribution. Thus, the wanted probability is P = 1 − P[1755 ≤ X ≤ 1845] = = 1 − P [ −1.5 ≤ X − 1800 30 ≤ 1.5 ] = 2Φ(−1.5) . = 0.1336, where the last value was learned from the table. □ 10.I.8. The probability that a newborn baby is a boy is 0.515. Find the probability that there are at least the same number of girls as boys among ten thousand babies. Solution. P[X < 5000] = P[ X − 5150 √ 5150 · 0.485 ∼N(0,1) < −150 √ 5150 · 0.485 −3,001... ] . = . = 0.00135 □ 10.I.9. Using the distribution function of the standard normal distribution, find the probability that we get at least 3100 sixes out of 18000 rolls of a six-sided die. Solution. We proceed similarly as in the exercises above. X has binomial distribution Bi(18000, 1/6). We find the expected value ((1/6)(18000) = 3000) as well as the standard deviation √ ((1/6)(1 − 1/6)18000) = 50. Therefore, the variable X−3000 50 can be approximated with the distribution function Φ of the standard normal distribution: P[X ≥ 3100] = P [ X − 3000 50 ≥ 3100 − 3000 50 ] = = P [ X − 3000 50 ≥ 2 ] . = 1 − Φ(2) . = 0.0228. □ 10.I.10. A public opinion agency organizes a survey of preferences of five political parties. How many randomly selected respondents must answer so that the probability that for each party, the survey result differs from the actual preference by no more than 2% is at least 0.95? Solution. Let pi, i = 1 . . . 5 be the actual relative frequency of voters of the i-th political party in the population, and let Xi denote the number of voters of this party among n randomly chosen people. Note that given any five intervals, the events corresponding to Xi/n falling into the corresponding interval may be dependent. If we choose n so that for each i, Xi/n falls into the given interval with probability at least 1 − ((1 − 0.95)/5) = 0.99, then the desired condition is sure to hold even in spite of the dependencies. Thus, let us look for n such that P[|X n − p| < 0.02] ≥ 0.99. First of all, we CHAPTER 10. STATISTICS AND PROBABILITY THEORY Similarly, for a discrete random variable X whose probability is concentrated into points xi, E Xk = ∑ i xi k fX(xi). The next theorem shows that all the moments completely describe the distribution of the random variable, as a rule. For the sake of computations, it is advantageous to work with a power series in which the moments appear in the coefficients. Since the coefficients of the Taylor series of a function at a given point can be obtained using differentiation, it is easy to guess the right choice of such a function: Moment generating function Given a random variable X, consider the function MX(t) : R → R defined by MX(t) = E etX = {∑ i etxi fX(xi) if X is discrete ∫ ∞ −∞ etx fX(x) dx if X is continuous. If this expected value exists, the moment generating function of the random variable X can be discussed. It is clear that this function MX(t) is always analytic in the case of discrete random variables with finitely many values xi. Theorem. Let X be a random variable such that its analytic moment generating function on an interval (−a, a) exists. Then, MX(t) is given on this interval by the absolutely convergent series MX(t) = ∞∑ k=0 tk k! E Xk . If two random variables X and Y share their moment generating functions over a nontrivial interval (−a, a), then their distribution functions coincide. Proof. The verification of the first statement is a simple exercise on the techniques of differential and integral calculus. In the case of discrete variables, there are either finite sums or absolutely and uniformly converging series. In the case of continuous variables, there are absolutely converging integrals. Thus, the limit process and the differentiation can be interchanged. Since d dt etx = x etx , it is immediate that dk dtk MX(t) = E Xk , as expected. The second claim is obvious for two discrete variables X and Y with only a finite number of values x1, . . . , xk for which either fX(xi) ̸= 0 or fY (xi) ̸= 0. Indeed, the functions etxi are linearly independent functions and thus their coefficients in the common moment function M(t) = etx1 f(xi) + · · · + etxk f(xk) must be the shared probability function values for both random variables X and Y . 949 rearrange the expression: P [ X n − p < 0.02 ] = P [ −0.02 < X n − p < 0.02 ] = = P [−0.02 · n < X − pn < 0.02 · n] = = P [ −0.02 · n √ np(1 − p) < X − pn √ np(1 − p) < 0.02 · n √ np(1 − p) ] = = Φ ( 0.02 · n √ np(1 − p) ) − Φ ( − 0.02 · n √ np(1 − p) ) = = 2Φ ( 0.02 · n √ np(1 − p) ) − 1, where Φ is the distribution function of the normal distribution. Thus, let us solve the inequality 2Φ ( 0.02 · n √ np(1 − p) ) − 1 ≥ 0.99 Φ ( 0.02 · n √ np(1 − p) ) ≥ 0.995 Since the distribution function is increasing, the last condition is equivalent to 0.02 · n √ np(1 − p) ≥ Φ−1 (0.995) 0.02 · n √ np(1 − p) ≥ 2.576 √ n ≥ 50 · 2.576 · √ p(1 − p) ≤ 1 2 =⇒ =⇒ n ≥ (25 · 2.276)2 · 4147 Here, we used the fact that the maximum of the function p(1 − p) is 1 4 , and it is reached at p = 1 2 . We can see that if e. g. p . = 0.1, then √ p(1 − p) = 0.3 and the value of the least n is lower. This accords with our expectations: for less popular parties, it suffices to have fewer respondents (if the agency estimates the gain of such party to be around 2 % without asking anybody, then the wanted precision is almost guaranteed). □ 10.I.11. Two-choice test. Consider random vectors Y1 and Y2 all of whose components are pairwise independent random variables with normal distribution, and suppose that the components of vector Yi have expected value µi, and the variance σ is the same for all the components of both vectors. Use the general linear model to test the hypothesis whether µ1 = µ2. Solution. We will proceed quite similarly as in subsection 10.3.12 of the theoretical part. This time, we can write both CHAPTER 10. STATISTICS AND PROBABILITY THEORY In the case of continuous variables X and Y sharing their generating function M(t), the argument is more involved and an indication only is provided. Notice that M(t) is analytic and thus it is defined for all complex numbers t, |t| < a. In particular, M(it) = ∫ ∞ −∞ eitx f(x)dx, which is the inverse Fourier transform of f(x), up to the constant multiple √ 2π, see 7.2.5 (on page 657). If this works for all t, then clearly f is obtained by the Fourier transform of ( √ 2π)−1 M(it) and thus must be the same for both X and Y . Further details, in particular covering general random variables, would need much more input from measure theory and Fourier analysis, and thus it is not provided here. □ It can be also shown that the assumptions of the theorem are true whenever both MX(−a) < ∞ and MX(a) < ∞. 10.2.37. Properties of moment function. By the properties of the exponential functions, it is easy to compute the behaviour of the moment function under affine transformations and sums of independent random variables. Proposition. Let a, b ∈ R and X, Y be independent random variables with moment generating functions MX(t) and MY (t), respectively. Then, the moment generating functions of the random variables V = a + bX and W = X + Y are Ma+bX(t) = eat MX(bt) MX+Y (t) = MX(t)MY (t) Proof. The first formula can be computed directly from the definition: MV (t) = E e(a+bX)t = E eat e(bt)X = eat MX(bt). As for the second formula, recall that etX and etY are independent variables. Use the fact that the expected value of the product of independent random variables equals the product of the expected values. MW (t) = E et(X+Y ) = E etX etY = E etX E etY = MX(t)MY (t). □ 10.2.38. Normal and binomial distributions. As an illustrating example, compute the moment function of two random variables X ∼ N(µ, σ) and X ∼ Bi(n, p). Moment generating function for N(µ, σ) Proposition. If X ∼ N(µ, σ), then MX(t) = eµt e σ2t2 2 . In particular, it is an analytic function on all of R. 950 vectors Yi into one column, and we consider the model           Y11 ... Y1n1 Y21 ... Y2n2           =           1 0 ... ... 1 0 1 1 ... ... 1 1           ( β1 β2 ) + σZ. We will work with arithmetic means of the individual vectors ¯Y1 and ¯Y2. Direct application of the general formula from theory gives the estimate b in the form ( b1 b2 ) = ( n1 + n2 n2 n2 n2 )−1 ( n1 ¯Y1 + n2 ¯Y2 n2 ¯Y2 ) = = 1 n1n2 ( n2 −n2 −n2 n1 + n2 ) ( n1 ¯Y1 + n2 ¯Y2 n2 ¯Y2 ) = ( ¯Y1 ¯Y2 − ¯Y1 ) and for the matrix C = (XT X)−1 , where X is the 2-column matrix with zeros and ones from our model, we have C = ( 1 n1 − 1 n1 − 1 n1 1 n1 + 1 n2 ) . Thus, we test the hypothesis µ1 = µ2, which means that we test whether β2 = 0. For this, it is suitable to use the statistic T = ¯Y2 − ¯Y1 S ( n1n2 n1 + n2 )1 2 , where the standard deviation S is substituted as S2 = 1 n1 + n2 − 2 ( n1∑ i=1 (Y1i − ¯Y1)2 + n2∑ i=1 (Y2i − ¯Y2)2 ) . The distribution of this statistic is tn1+n2−2, so the null hypothesis µ1 = µ2 is rejected at level α if we have |T| ≥ tn1+n2−2(α). □ 10.I.12. In JZD1 Tempo, the milk yield of their cows was measured during five days, the results being 15, 14, 13, 16 a 17 hectoliters. In JZD Boj, which had the same number of cows, they performed the same measurement during seven days, the results being 12, 16, 13, 15, 13, 11, 18 hectoliters. a) Find the 95% confidence interval for the milk yield of JZD Boj’s cows, and the 95% confidence interval for the milk yield of JZD Tempo’s cows. b) On the 5% level, test the hypothesis that both farms have cows of the same quality. Suppose that the milk yield of the cows in each day is given by the normal distribution. Solve these problems assuming that there are no data from previous measurements, and then assuming that the previous measurements showed that the standard deviation was σ = 2 hl. 1JZD — jednotné zemědělské družstvo — an agricultural cooperative farm, created by forced collectivization in 1950s in Czechoslovakia. CHAPTER 10. STATISTICS AND PROBABILITY THEORY Proof. Suppose Z ∼ N(0, 1)). Then MZ(t) = ∫ ∞ −∞ etx 1 √ 2π e− x2 2 dx = ∫ ∞ −∞ 1 √ 2π e− 1 2 (x2 −2tx+t2 −t2 ) dx = e t2 2 ∫ ∞ −∞ 1 √ 2π e− (x−t)2 2 dx = e t2 2 , where use is made of the fact that in the last-but-one expression, for every fixed t, the density of a continuous random variable is integrated; hence this integral equals one. Substitute the formula for the moment generating function Mµ+σZ, to obtain for X ∼ N(µ, σ) that MX(t) = eµt e σ2t2 2 , again a function analytic over entire R. □ In particular, the moments of Z of all orders exist. Substitute 1 2 t2 into the power series for the exponential function, and calculate them all: MZ(t) = ∞∑ k=0 1 k! ( t2 2 )k = ∞∑ k=0 1 k!2k t2k = = 1 + 0 t + 1 2 t2 + 0 t3 + 3 4! t4 + . . . In particular, the expected value of Z is E Z = 0, and its variance is var Z = E Z2 − (E Z)2 = 1. Further, all moments of odd orders vanish, E Z4 = 3, etc. Hence the sum of independent normal distributions X ∼ N(µ, σ) and Y ∼ N(µ′ , σ′ ) has again the normal distribution X + Y ∼ N(µ + µ′ , σ + σ′ ). Similarly, considering the discrete random variable X ∼ Bi(n, p), MX(t) = E etX = n∑ k=0 (p et )k ( n k ) (1 − p)n−k = (p et +(1 − p))n = (p(et −1) + 1)n = 1 + npt + 1 2 ( n(n − 1)p2 + np ) t2 + . . . is computed. Of course, the same can be computed even easier using the proposition 10.2.37 since X is the sum of n independent variables Y ∼ A(p) with the Bernoulli distribution. Therefore, E etX = (E etY )n = (p et +(1 − p))n . Hence all the moments of the variable Y equal p. Therefore, E Y = p, while var Y = p(1−p). From the moment function MX(t), E X = np and var X = E X2 − (E X)2 = np(1 − p). 951 Solution. First of all, let us compute the results for the known variance. In order to find the confidence interval, we use the statistic U = X − µ σ/ √ n , which has standardized normal distribution (see 10.2.21). Then, the confidence interval is (see 10.3.4) ( ¯X − σ √ n z(α/2), ¯X + σ √ n z(α/2) ) , where α = 0, 05. Now, it suffices to substitute the specific values. For JZD Tempo, we thus get the sample mean X1 = 15 + 14 + 13 + 16 + 17 5 = 15, and using appropriate software, we can learn that z(0.025) = 1.96, which gives the interval ( 15 − 2 √ 5 1.96, 15 + 2 √ 5 1.96 ) . = (13.25; 16.75). For JZD Boj, we get X2 = 12 + 16 + 13 + 15 + 13 + 11 + 18 7 = 14, so the 95% confidence interval for the milk yield of their cows is (12.52; 15.48). If the variance of the measurements is not known, we use the so-called sample variance for the estimate. In order to find the confidence interval, we use the statistic T = X − µ S √ n , which has Student’s distribution with n − 1 degrees of freedom (see also 10.3.4). Then, we can analogously obtain the 95% confidence interval ( ¯X − S √ n tn−1(α/2), ¯X + S √ n tn−1(α/2) ) . For the values of JZD Tempo, we get the sample variance S2 1 = 02 + (−1)2 + (−2)2 + 12 + 22 ) 4 = 2.5, i. e., S . = 1.58. Further, we have t4(0, 025) . = 2, 78, so the 95% confidence interval for JZD Tempo is (13.03; 16.97). For JZD Boj, we get the sample variance S2 2 = 6, so the wanted confidence interval is (11.73; 16.27). b) If we compare the expected values of milk yield in both farms, then this is a comparison of the expected values of two independent choices from the normal distribution. In the case of unknown variances, we further assume that the variance is the same for both farms. CHAPTER 10. STATISTICS AND PROBABILITY THEORY 10.2.39. Skewness and kurtosis. Since the third central moment is given in terms of third powers of deviations from the expected value, it expresses to a certain extent the symmetry of the random variable distributed around the expected value. In descriptive statistics, we describe this by the coefficient of skewness. For random variables, we use similarly the charac- teristic γ1 = E(X − E X)3 ( √ var X)3 , which is called the coefficient of skewness of a random variable X. Another commonly used characteristic is the kurtosis of a random variable X, defined as γ2 = E(X − E X)4 (var X)2 − 3. The standard normal distribution has third central moment equal to zero and the fourth one equal to 3. Thus, the kurtosis is standardized so that its value for the standard normal distribution is zero. For a general distribution, the kurtosis provides comparison to the normal distribution. In practice, there are other standardizations of skewness coefficients and kurtosis. 10.2.40. Law of large numbers. Now, we can consider the key tools which connect probability and statistics. We start with the generalization of Bernoulli’s theorem about the binomial distribution, discussed at the end of subsection 10.2.32. The random variables 1 n Xn, where Xn ∼ Bi(n, p), can be viewed as the arithmetic means of n independent variables with distribution A(p), and Bernoulli’s theorem then says that these means converge to p with probability 1. Such a proposition holds in general. Independence of the variables is not needed, only the fact that cov(Xi, Xj) = 0 guarantees that the variances sum up. The law of large numbers Proposition. Consider a sequence of pairwise uncorrelated random variables X1, X2, . . . which have the same finite expected value E Xi = µ. Moreover, assume the variances are bounded, so that var Xi ≤ C, for a fixed constant C. Then for any ε > 0, lim n→∞ P ( 1 n n∑ i=1 Xi − µ < ε ) = 1. Proof. By the use Chebyshev’s inequality just as at the end of subsection 10.2.32, P ( 1 n n∑ i=1 Xi − µ ≥ ε ) ≤ var ( 1 n ∑n i=1 Xi − µ ) ε2 = 1 n2 ∑n i=1 var Xi ε2 ≤ C nε2 . 952 Thus, let us examine the hypothesis assuming the known variances σ2 1 = σ2 2 = 4. We use the statistic U = (X1 − X2) − (µ1 − µ2 √ σ2 1 n1 + σ2 2 n2 = = X1 − X2 √ σ2 1 n1 + σ2 2 n2 ∼ N(0, 1), where µ1 and µ2 are the unknown expected values of milk yield in the examined farms, and n1, n2 are the numbers of measurements. This statistic has, as indicated, the standardized normal distribution. We reject the hypothesis at the 5% level if and only if the absolute value of the statistic U is greater than z0.025, i. e., if and only if 0 does not lie in the 95% confidence interval for the difference of the expected values of milk yield in both farms. For the specific values, we get U = 15 − 14 √ 4 5 + 4 7 . = 0.854. Thus, we have |U| < z(0.025) = 1.96, so the hypothesis that the expected values of milk yield are the same in both farms is not rejected at the 5% level. The reached p-value of the test (see 10.3.9) is 39.4%, so we did not get much closer to rejecting the hypothesis (the probability that the value of the examined statistic is less than 0.854 provided the null hypothesis holds is 60.6%. If we do not know the variances of the measurements but we know that they must be equal in both farms, we use the statistic K = (X1 − X2) − (µ1 − µ2) S∗ √ 1 n1 + 1 n2 = = X1 − X2 S∗ √ 1 n1 + 1 n2 ∼ tn1+n2−2, where S∗ = (n1 − 1)S2 1 + (n2 − 1)S2 2 n1 + n2 − 2 . For the specific values, we get K . = 0.796, |K| < t10(0, 025) = 2.2281, so again, the null hypothesis is not rejected. The reached p-value of the test is 44.6%, which is even greater than in the above test. □ 10.I.13. Analyzing the variance of a simple sort. For k ≥ 2 independent samples Yi of size ni from normal distributions with equal variance, use a linear model to test the hypothesis that all the expected values of individual samples are equal. Solution. The technique is quite similar to that of the above exercise. The hypothesis to be tested is equivalent to stating that a submodel in which all the components of the random vector Y created by joining the given k vectors Yi have the same expected value holds. CHAPTER 10. STATISTICS AND PROBABILITY THEORY Thus, the probability P is bounded from below by P ( 1 n n∑ i=1 Xi − µ < ε ) ≥ 1 − C nε2 , which proves the proposition. □ Thus, existence and uniform boundedness of variances suffices for the means of pairwise uncorrelated variables Xi with zero expected value to converge (in probability) to zero. 10.2.41. Central limit theorem. The next goal is more ambitious. In addition to the law of large numbers, the stochastic properties of the fluctuation of the means ¯Xn = 1 n ∑n i=1 Xi around the expected value µ need to be understood. We focus first on the simplest case of sequences of independent and identically distributed random variables Xi. Then formulate a more general version of the theorem and provide only comments on the proofs. Move to a sequence of normalized random variables Xi. Assume E Xi = 0 and var Xi = 1. Assume further that the moment generating function MX(t) exists and is shared by all the variables Xi. The arithmetic means 1 n ∑n i=1 Xi are, of course, random variables with zero expected value, yet their variances are n n2 = 1 n . Thus, it is reasonable to renormalize them to Sn = 1 √ n n∑ i=1 Xi, which are again standardized random variables. Their moment generating functions are (see proposition 10.2.37) MSn (t) = E e t√ n ∑ i Xi = ( MX( t √ n ) )n . Since it is assumed that the variables Xi are standardized, MX( t √ n ) = 1 + 0 t √ n + 1 t2 2n + o (t2 n ) , where again o(G(n)) is written for expressions which, when divided by G(n), approach zero as n → ∞, see subsection 6.1.12. Thus, in the limit, lim n→∞ MSn (t) = lim n→∞ ( 1 + t2 2n + o ( 1 n ) )n = e t2 2 . This is just the moment generating function of the normal distribution Z ∼ N(0, 1), see the end of subsection 10.2.35. Thus, the standardized variables Sn asymptotically have the standard normal distribution. We have thus proved a special version of the following fundamental theorem. Although the calculation is merely a manipulation of moment generating functions, many special cases were proved in different ways, providing explicit estimates for the speed of convergence, which of course is useful information in practice. Notice that the following theorem does not require the probability distributions of the variables Xi to coincide! 953 Thus, the used model is of the form                Y11 ... Y1n1 Y21 ... Yk1 ... Yknk                =               1 0 · · · 0 ... ... ... 1 0 · · · 0 0 1 · · · 0 ... ... ... 0 0 · · · 1 ... ... ... 0 0 · · · 1                   µ1 µ2 ... µk     + σZ. We can easily compute estimates for the expected values µi using arithmetic means: ¯Yi = 1 ni ni∑ j=1 Yij. Hence we get the estimate ˆYij = ¯Yi, so the residual sum of squares is of the form RSS = k∑ i=1 ni∑ j=1 (Yij − ¯Yi)2 . The estimate of the common expected value in the considered submodel is ¯Y = 1 n k∑ i=1 ni∑ j=1 Yij = 1 n k∑ i=1 ni ¯Yi, where n = n1 + · · · + nk, and the residual sum of squares in this submodel is RSS0 = k∑ i=1 ni∑ j=1 (Yij − ¯Y )2 . In the original model, there are k independent parameters µi, while in the submodel, there is a single parameter µ, so the tested statistic is of the form F = (n − k) (k − 1) (RSS0 − RSS) RSS . □ J. Linear regression We already met the linear regression in chapter three, subsection ??. Now, we will try to apply the same principle to problems which are often studied by statisticians. One standard application of the linear regression is “laying a line” through given data. Thus, we have a sequence of measurements for which we record the values of two variables between which we anticipate linear dependency. A classical example is the dependency of a son’s height on his father’s height. CHAPTER 10. STATISTICS AND PROBABILITY THEORY Central limit theorem Theorem. Consider a sequence of independent random variables Xi which have the same expected value E Xi = µ, variance var Xi = σ2 > 0 and uniformly bounded third absolute moment E |Xi|3 < C. Then, the distribution of the random variable Sn = 1 √ n n∑ i=1 ( Xi − µ σ ) satisfies lim n→∞ P(Sn < x) = Φ(x), where Φ is the distribution function of the standard normal distribution. Note that the central limit theorem gives a result on asymptotic behaviour which says that the distribution functions of certain variables approach the standard normal distribution. Such behaviour is called convergence in distribution. This type of convergence is weaker than convergence in probabil- ity. The assumption that all Xi are independent and identically distributed was not fully exploited in the argumentation above. Only the knowledge of E Xi = 0 and var Xi = 1 was used. The assumption of the uniformly bounded third absolute moments of Xi can be used to prove the existence of the moment generating functions. The estimate E |Xi|3 can then be used to complete the above proof exactly as above. There are many more general results. We mention at least the Lyapunov’s central limit theorem formulated as follows: Consider a sequence of random variables Xi with finite expected values µi and variances σi 2 . Write sn = n∑ i=1 σi 2 and assume for some δ > 0 lim n→∞ 1 s2+δ n n∑ i=1 E |Xi − µi|2+δ = 0. Then Xi−µi sn converges in distribution to Z ∼ N(0, 1). The previous version of the central limit theorem is derived by choosing δ = 1. Then sn = σ √ n and the condition of the Lyapunov’s theorem reads 0 ≤ lim n→∞ n−3/2 σ−3 n∑ i=1 E |Xi|3 ≤ Cσ−3 lim n→∞ n−3/2+1 = 0. 10.2.42. De Moivre-Laplace theorem. Historically, the first formulated special case of the central limit theorem was that of variables Yn with binomial distribution Bi(n, p). They can be viewed as the sum of n independent variables Xi with Bernoulli distribution A(p), 0 < p < 1. These variables have moment generating functions and E |Xi|3 = p < 1. 954 10.J.1. Find the linear regression model for the dependence of Y on X, based on the following lists of measured data: X = [1, 4, 5, 7, 10], Y = [3, 7, 8, 12, 18]. Solution. In order to find the parameters of the regression line, use the formulas derived in 10.3.12. Using the method of least squares, we try to minimize the distance of the vector b1X + b0 from the vector Y with respect to the parameters b1 and b0. This distance, as we know from chapter two, is minimal for the orthogonal projection of the vector Y onto the vector subspace generated by the vectors (1, . . . , 1) and (x1, . . . , xn). For parameters b0, b1 of the regression line Y = b1X + b0, we obtain b1 = ∑n i=1(xi − ¯x)(Yi − ¯Y ) ∑n i=1(xi − ¯x)2 = = (1 − 5.4)(3 − 9.6) + · · · + (10 − 5.4)(18 − 9.6) ((1 − 5.4)2 + (4 − 5.4)2 + (5 − 5.4)2 + (7 − 5.4)2 + (10 − 5.4)2 = =1.677. Now, we can easily calculate the coefficient b0: b0 = ¯Y − b1 ¯x = 0.5442. Therefore, the wanted linear dependency is Y = 1.677 · X + 0.5442. Note that in this model, the roles of the variables X and Y are totally equal. Using the same method, we could have obtained the dependency of X on Y : X = 0.5867 · Y − 0.2322. □ Remark. Think out why the linear regression model of the dependency of X on Y cannot be obtained by merely expressing X from the linear regression model of the dependency of Y on X. Remark. In many real situations, the dependency of the variables is clearly given, if one of the variables is time, for ex- ample. 10.J.2. An orbital station has measured, at the same instant of five consecutive days, the following velocities of an unknown cosmic object (in km/s): 10, 11.4, 13.1, 15.8, and 18.7. Estimate the object’s velocity on the tenth day. Solution. Here, it is good to notice that the velocity does not change linearly with time (the acceleration is increasing). Thus, we can hypothesize that the object is being attracted to another one with the gravitational force. Then, its velocity would be a quadratic function of time. So let us use the method of least squares to lay a quadratic function (as precise as possible) through the measured data. The procedure is the same as if we made the linear regression of the vector v = (v1, v2, . . . , vn) dependent on x = (x1, . . . , xn) and x2 = (x2 1, . . . , x2 n). This method is called quadratic regression. Thus, we are looking for a vector of parameters b = (b0, b1, b2) so that the variable b2x2 + b1x + b0 would CHAPTER 10. STATISTICS AND PROBABILITY THEORY Thus, the central limit theorem says in this case that the random variables Sn = 1 √ n n∑ i=1 ( Xi − p √ p(1 − p) ) = X − np √ np(1 − p) behave asymptotically as the standard normal distribution. This can be formulated: the random variable X ∼ Bi(n, p) behaves as the random variable with normal distribution N(np, np(1 − p)) as n increases. This behaviour is demonstrated exactly in the illustration at the end of 10.2.21. In practice, approximation of the binomial distribution by the normal distribution is usually considered appropriate if np(1 − p) > 9. We illustrate the result with a concrete example. Suppose it is desired to know what percentage of students like a given course, with an error of at most 5 %. The number of people who like the course among n randomly chosen people should behave as the random variable X ∼ Bi(n, p). Further, suppose the result is desired to be correct with confidence (i.e., probability again) to at least 90%. Thus, P ( 1 n X − p < 0.05 ) ≃ 0.9 is desired by choosing a high enough number n of students to ask. Approximate 0.9 ≃ P ( 1 n X − p < 0.05 ) = P ( − 0.05n √ np(1 − p) < X − np √ np(1 − p) < 0.05n √ np(1 − p) ) ≃ Φ ( 0.05n √ np(1 − p) ) − Φ ( − 0.05n √ np(1 − p) ) = 2Φ ( 0.05n √ np(1 − p) ) − 1, where the symmetry of the density function of the normal distribution is exploited. Thus, Φ ( 0.05n √ np(1 − p) ) ≃ 1 2 (1 + 0.9) = 0.95 is wanted. This leads to the choice (recall the definition of critical values z(α) for a variable Z with standard normal distribution in subsection 10.2.30) Φ ( 0.05n √ np(1 − p) ) ≃ z(0.05) = 1.64485. Since p(1−p) is at most 1 4 , the necessary number of students can be bounded by n > 270, independently of p. 955 estimate y. Let us build the matrix X of the values of independent variables: X =    1 x1 x2 1 ... ... ... 1 xn x2 n    =       1 1 1 1 2 4 1 3 9 1 4 16 1 4 25       , and the vector of parameters b = (b0, b1, b2) can be computed by (1): b = (XT X)−1 XT v . = (9.26; 0.47; 0.29). Then, the wanted quadratic estimate is v = 0.29x2 + 0.47x + 9.26, so the estimated velocity on the tenth day is approximately 42.96 km/s. In the model of classic linear regression, we would get v = 2.18x + 7.26, which yields 29.06 km/s for the tenth day. The difference between these estimates is quite large. This illustrates that analysis of the situation is a very important part of statistics. □ K. Bayesian data analysis 10.K.1. Consider the Bernoulli process defined by a random variable X ∼ Bi(n, θ) with binomial distribution, and assume that the parameter θ is a random variable with uniform distribution on the interval (0, 1). We define the success chance in our process as the variable γ = θ 1−θ . What is the density of this variable γ? Solution. Intuitively, we can feel that the distribution is not uniform. Denoting the wanted probability density by f(s), we can use the relation between θ and γ to compute θ = γ 1+γ . In addition, we can immediately see that the probability density of γ is non-zero only for positive values of the variable. Now, we can formulate the statement as the requirement (1) Θ = P(θ < Θ) = P ( γ < Γ 1 + Γ ) = ∫ Γ 0 f(s)ds, where Γ = Θ 1−Θ . However, the right-hand upper bound contains the changing limit γ, so we get the defining formula for f(s) f(s) = ( s s + 1 )′ = 1 (s + 1)2 . Indeed, the wanted density gives much higher probability to low values of the chance than to high ones. □ We could see in subsection 10.3.7 that when taking the Bayesian approach with binomial model of probability distribution of a random variable X ∼ Bi(n, θ), then we are interested in its probability mass function fX(k) = (n k ) θk (1 − θ)n−k . This function can be viewed as the conditional probability P(θ|X = k) for the uniform a priori probability distribution of the variable θ on the interval (0, 1). Thus, it is just CHAPTER 10. STATISTICS AND PROBABILITY THEORY 10.2.43. Important distributions. In the sequel, we return to statistics. It should be of no surprise that we work with the characteristics of random vectors similar to the sample mean and variance, as well as relative quotients of such characteristics, etc. We consider several such cases. Consider a random variable Z ∼ N(0, 1), and compute the density fY (x) of the random variable Y = Z2 . Clearly, fY (x) = 0 for x ≤ 0, while for positive x, FY (x) = P(Y < x) = P(− √ x < Z < √ x) = ∫ √ x − √ x 1 √ 2π e−z2 /2 dz = ∫ x 0 1 √ 2π t−1/2 e−t/2 dt. Differentiation leads to fY (x) = d dx FY (x) = 1 √ 2π x−1/2 e−x/2 . This distribution is called χ2 with one degree of freedom, written Y ∼ χ2 . We work with sums of such independent variables. All fall into a general class of distributions whose densities are of the form fX(x) = cxa−1 e−bx for x > 0, while fX(x) = 0 for non-positive x, i.e., the distribution χ2 corresponds to the choice a = b = 1/2. This case is already thoroughly discussed as an example in subsection 10.2.20. Hence such a function is the density for the constant c = ba Γ (a) . Thus, it is the distribution Γ(a, b) with density, for positive x, fX(x) = ba Γ(a) xa−1 e−bx . In general, the k-th moment of such variable X is easily computed: E Xk = ∫ ∞ 0 xk ba Γ(a) xa−1 e−bx dx = Γ(a + k) Γ(a)bk ∫ ∞ 0 ba+k Γ(a + k) xa−1+k e−bx dx = Γ(a + k) Γ(a)bk , since the integral of the density of Γ(a + k, b) in the last expression must be equal to one In particular, E X = Γ (a+1) bΓ (a) = a b , while var X = Γ(a + 2) b2Γ(a) − a2 b2 = (a + 1)a − a2 b2 = a b2 . Similarly, the moment generating function can be computed for all values t, −b < t < b MX(t) = ∫ ∞ 0 etx ba Γ(a) xa−1 e−bx dx = = ba (b − t)a ∫ ∞ 0 xk (b − t)a Γ(a) xa−1 e−(b−t)x dx = = ba (b − t)a . 956 the a posteriori probability distribution of θ corresponding to the result X = k of the experiment. The following exercise concerns the general class of these probability distributions. 10.K.2. Find the basic charakteristics of the so-called betadistribution fi(a, b) with probability density of the form fY = { C ya−1 (1 − y)b−1 y ∈ (0, 1) 0 otherwise. Solution. The constant C must be chosen as the multiplicative inverse of the integral ∫ 1 0 ya−1 (1 − y)b−1 dy, which is a function B(a, b), known as beta-function in mathematical analysis and other sciences (e. g. physics). The function gamma, which generalizes the discrete values of factorial, emerges in the following calculation: Γ(x)Γ(y) = ∫ ∞ 0 e−t tx−1 dt · ∫ ∞ 0 e−s sy−1 ds = = ∫ ∞ 0 ∫ ∞ 0 e−t−s tx−1 sy−1 dt ds = (substitution t = rq, s = r(1 − q)) = ∫ ∞ r=0 ∫ 1 q=0 e−r (rq)x−1 ( r(1 − q) )y−1 r dq dr = = ∫ ∞ r=0 e−r rx+y−1 dr · ∫ 1 t=0 qx−1 (1 − q)y−1 dq = = Γ(x + y)B(x, y). Thus, we get the general formula B(a, b) = Γ(a + b) Γ(a)Γ(b) and it follows from properties of the gamma-function for positive integers a, b that B(n − k + 1, k + 1) = k!(n − k)! (n + 1)! = 1 n + 1 ( n k )−1 . We can directly compute that the expected value of the variable X ∼ fi(a, b) with beta-distribution is (applying Γ(z + 1) = zΓ(z)) E X = B(a + 1, b) B(a, b) = a a + b . If a = b, then the expected value and median are 1 2 . We can also directly calculate the variance var X = E(X − E X)2 = ab (a + b)2(a + b + 1) . Thus, for a = b, we get var X = 1 8a+4 , which shows that the variance decreases as a = b increases. For a = b = 1, we get the ordinary uniform distribution on the interval (0, 1). □ CHAPTER 10. STATISTICS AND PROBABILITY THEORY Thus, for the sum of independent variables Y = X1 + · · · + Xn with distributions Xi ∼ Γ(ai, b), the moment generating function (for values |t| < b) is obtained MY (t) = ( b b − t )a1+···+an , that is, Y ∼ Γ(a1 + · · · + an, b). It is essential that all of the gamma distributions share the same value of b. As an immediate corollary, the density of the variable Y = Z2 1 + · · · + Z2 n is obtained, where Zi ∼ N(0, 1). As just shown, this is the gamma distribution Y ∼ Γ(n/2, 1/2); hence its density is fY (x) = 1 2n/2Γ(n/2) xn/2−1 e−x/2 . This special case of a gamma distribution is called χ2 with n degrees of freedom. Usually, it is denoted by Y ∼ χ2 n. 10.2.44. The F-distribution. In statistics, it is often wanted to compare two sample variances, so we need to consider variables which are given as a quotient U = X/k Y/m , where X ∼ χ2 k and Y ∼ χ2 m. Suppose fX(x) and fY (x) are the densities of independent random variables X and Y . Suppose fY is non-zero only for positive values of x. Compute the distribution function of the random variable U = cX/Y , where c > 0 is an arbitrary constant. By Fubini’s theorem, the order of integration can be interchanged with respect to the individual variables. FU (u) = P(X < (u/c)Y ) = ∫ ∞ 0 ∫ uy/c −∞ fX(x)fY (y) dx dy = ∫ ∞ 0 (∫ u −∞ y c fX(ty/c)fY (y) dt ) dy = ∫ u −∞ ( 1 c ∫ ∞ 0 yfX(ty/c)fY (y) dy ) dt. This expression for FU (u) shows that the density fU of the random variable U equals fU (u) = 1 c ∫ ∞ 0 yfX(uy/c)fY (y) dy. Substitute the densities of the corresponding special gamma distributions for X ∼ χ2 k and Y ∼ χ2 m. Set c = m/k. The random variable U = X/k Y/m has density fU (u) equal to (k/m)k/2 2(k+m)/2Γ(k/2)Γ(m/2) ∫ ∞ 0 y(k+m)/2−1 e−y(1+ku/m)/2 dy. The integrand in the latter integral is, up to the right constant multiple, the density of the distribution of a random variable Y ∼ Γ((k + m)/2, (1 + ku/m)/2). Hence the multiple can be rescaled (notice u is constant there) in order to get 957 10.K.3. In the situation as in the problem above the previous one, assume that the success chance θ in the Bernoulli process is a random variable with probability distribution fi(a, b). What is the probability distribution of the variable γ = θ 1−θ ? In what is it special when a = b = p? Solution. We have already discusses the special case with uniform distribution fi(1, 1). Thus, we can continue with the equality ∥1∥, where we used the form of this distribution. Now the left-hand side contains, instead of Θ, the expression 1 B(a, b) ∫ Θ 0 ta−1 (1 − t)b−1 dt. When differentiating, we must use the rule for differentiation of integral with variable upper bound. Thus, we get for the wanted density that B(a, b)f(s) = ( s s + 1 )a−1 ( 1 − s s + 1 )b−1 1 (s + 1)2 = = ( sa−1 s + 1 )a+b . The picture shows the densities for a = b = p = = 2, 5, 15. This enforces the intuition that the same and not too small values of a = b = p correspond to the most probable value θ = 1 2 , so the density of the chance is greatest around one. The higher p, the lower the variance of this variable. □ 10.K.4. Show that the Bernoulli experiment, described by a random variable X ∼ Bi(n, θ), and the a priori probability of a random variable θ with beta-distribution, the a posteriori probability also has beta-distribution with suitable parameters which depend on the experiment results. What is the a posteriori expected value of θ (i. e., the Bayesian point estimate of this random variable)? Solution. As justified in subsection 10.3.7 of the theoretic part, the a posteriori probability density is, up to an appropriate constant, given as the product of the a priori probability density g(θ) = 1 B(a, b) θa−1 (1 − θ)b−1 and the probability of the examined variable X provided the value of θ occurred. Thus, assuming k successes in the CHAPTER 10. STATISTICS AND PROBABILITY THEORY the integral to evaluate to one. The density fU (u) is then expressed as Γ((k + m)/2) Γ(k/2)Γ(m/2) ( k m )k/2 uk/2−1 ( 1 + k m u )−(k+m)/2 . This distribution is called the Fisher-Snedecor distribution with k and m degrees of freedom, or F-distribution in short. 10.2.45. The t-distribution. One encounters another useful distribution when examining the quotient of variables Z ∼ N(0, 1) and √ X/n. Here X ∼ χ2 n. (We are interested in the quotient of Z and the standard deviation of some sample). Compute first the distribution function of Y = √ X (note that X, and hence Y as well, take only positive values with non-zero probability) FY (y) = P( √ X < y) = P(X < y2 ) = ∫ y2 0 1 2n/2Γ(n/2) xn/2−1 e−x/2 dx = ∫ y 0 1 2n/2−1Γ(n/2) tn−1 e−t2 /2 dt. Hence the density of the random variable Y is fY (y) = 1 2n/2−1Γ(n/2) yn−1 e−y2 /2 . The same method can be used as in the previous subsection with the random variable U = cZ/Y , setting c = √ n, Y = √ X. This leads to the random variable T = Z √ X/n . Similar computation as the one above yields that the density fT satisfies fT (t) = Γ((n + 1)/2) Γ(n/2) √ nπ ( 1 + t2 n )−(n+1)/2 . This is called the Student’s t-distribution with n degrees of freedom. 10.2.46. Multidimensional normal distribution. Consider a random vector Z = (Z1, . . . , Zn) with independent components Zi ∼ N(0, 1). Then its covariance matrix is equal to the unit matrix, i.e., var Z = In. Random vectors are often encountered which are an affine transformation U = a + BZ of such a vector Z, where a is an arbitrary constant vector in Rm and B is an m-by-n constant matrix. As derived in theorems 10.2.29 and 10.2.35, these random vectors have expected value E U = a and covariance matrix var U = V = BBT (since the covariance matrix of Z is the identity matrix). Therefore, this covariance matrix is always positive-semidefinite. The random vector U is said to have multivariate normal distribution Nm(a, V ). 958 Bernoulli experiment, we get the a posteriori density (the sign used instead of equality denotes “proportional”) g(θ|X = k) ∝ P(X = k|θ)g(θ) ∝ ∝ θk (1 − θ)n−k θa−1 (1 − θ)b−1 = = θa+k−1 (1 − θ)b+n−k−1 . Thus, we have indeed obtained the density (up to a constant, which we need not evaluate) of the a posteriori distribution for θ with distribution B(a + k, b + n − k). Its a posteriori expected value is ˆθ = a + k a + b + n . For n and k approaching infinity so that k/n → p, our a posteriori estimate also satisfies ˆθ → p. Thus, we can see that for large values of n and k, the observed fraction of successful experiments outweighs the a priori assumption. On the other hand, for small values, the a priori assumption is very important. □ 10.K.5. We have data about accident rates for N = 20 drivers in the last n = 10 years (the k-th item corresponds to the number of years when the k-th driver had an accident): 0, 0, 2, 0, 0, 2, 2, 0, 6, 4, 3, 1, 1, 1, 0, 0, 5, 1, 1, 0. We assume that the probabilities pj, j = 1, . . . , N, that the j-th driver has an accident in a given year are constants. For each driver, estimate the probability that s/he has an accident in the following year (in order to determine the individual insurance fee, for instance). 2 Solution. We introduce random variables Xij with value 0 if the i-th driver has no accident in the j-th year, and 1 otherwise. The individual years are considered to be independent. Thus, we can assume that the random variables Sj = ∑n i=1 Xji that correspond to the number of accidents in all the n = 10 years have distribution Bi(n, pj). Of course, we could estimate the probabilities for all drivers altogether, i. e., using the arithmetic mean ˆp = 1 N n∑ j=1 Sj 1 n = 1 20 29 10 = 0.145. However, consider the homogeneity of the distribution of the variables Xj, they can hardly be accounted equal, so such estimate would be misleading. On the other hand, the opposite extreme, i. e., a totally independent and individual estimate ˆpj = 1 n Sj is also inappropriate, since we surely do not want to set zero insurance fee until the first accident happens. The realistic method is to use the same assumption for the a priori distribution of the probabilities pj of accident 2This problem is taken from the contribution M. Friesl, Bayesovské odhady v některých modelech, published in: Analýza dat 2004/II (K. Kupka, ed.), Trilobyte Statistical Software, Pardubice, 2005, pp. 21-33. CHAPTER 10. STATISTICS AND PROBABILITY THEORY For any multivariate normal distribution Nm(a, V ), consider again the affine transformation W = c + DU with a vector of constants c ∈ Rk and an arbitrary k-by-m constant matrix D. Direct calculation leads to W = c + D(a + BZ) = (c + Da) + (DB)Z, which is a random vector W ∼ Nk(c + Da, DBBT DT ). Thus, the covariance matrix of the multivariate normal distribution behaves as a quadratic form with respect to affine transformations. This straightforward idea shows that any linear combination of components of a random vector with the multivariate normal distribution is a random variable with the normal distribution. Similarly, any vector obtained by choosing only some of the components of the vector U is again a random vector with the multivariate normal distribution. Note that when the random vector Z ∼ Nn(0, In) is transformed with an orthogonal matrix QT , then the joint distribution function of the random vector U = QT Z can be computed directly. If the transformation in coordinates as t = QT z is written, then its inverse is z = Qt, and the Jacobian of this transformation is equal to one. Hence (note that∑ i z2 i = ∑ i t2 i . As in chapter 3, write z < u if all components satisfy zi < ui) FU (u) = P(Ui < ui, i = 1, . . . , n) = = ∫ · · · ∫ QT z 0. The experiment of tossing the coin n times allows the adjustment of the distribution within the preferred class. Thus, we build on some assumptions about the distribution and adjust the prior distribution in view of the experiment. This approach is called Bayesian statistics. The first approach is based on the purely mathematical abstraction that probabilities are given by the frequencies of event occurrences in data samples which are so large that they can be approximated with infinite models. The central limit theorem can be used to estimate their confidence. From the statistical point of view, the probability is an idealization of the relative frequencies of the cases when an examined result 960 L. Processing of multidimensional data Sometimes, we need to process multidimensional data: for each of n objects, we determine p characteristics. For instance, we can examine marks of several students in various subjects. 10.L.1. In his attempts, J.G.Mendel examined 10 plants of pea, and each was examined for the number of yellow and green seeds. The results of the experiment are summarized in the following table: plant number 1 2 3 4 5 6 7 8 9 10 yellow seeds 25 32 14 70 24 20 32 44 50 44 green seeds 11 7 5 27 13 6 13 9 14 18 total seeds 36 39 19 97 37 26 45 53 64 62 It follows from the genetic models that the probability of occurrence of the yellow seed should be 0.75 (and 0.25 for green seed). At the asymptotic significance level 0.05, test the hypothesis that the results of Mendel’s experiments are in accordance with the model. Solution. We test the hypothesis with the Pearson’s chisquared test. We use the statistic K = r∑ j=1 (nj − npj) npj , where r is the number of sorting intervals (measurements; we have r = 10), nj is the actually measured frequency in the chosen sorting interval (we will count the number of yellow seeds), pj is the expected frequency (by the assumed distribution), in our case, pj = 0.75, j = 1, . . . , 10. If the results of the experiment were really distributed as assumed in our model, we would have K ≈ χ2 (r − 1 − p), where p is the number of estimated parameters in the assumed probability distribution. In our case, it is especially simple, since our model does not have any unknown parameters, so we have p = 0 (the parameters may occur if, e. g., we assume that the probability distribution in our experiment is normal but with unknown variance and expected value; then we would have p = 2). Thus, K ≈ χ2 (9). The statistic is recommended to be used if the expected frequency of the characteristic in each of the sorting intervals is at least 5. Let us write the data into a table: j nj pj npj (nj −npj )2 npj 1 25 0.75 27 0.148148 2 32 0.75 29, 25 0.258547 ... ... ... ... ... 10 44 0.75 46.5 0.134409 The value of the statistic K for the given data is K = 0.148148 + 0.258547 + · · · + 0.134409 = 1.797495. CHAPTER 10. STATISTICS AND PROBABILITY THEORY occurs in many repeated experiments. This seeming advantage/rigor can become a disadvantage as soon as we are interested in the confidence of the data themselves and the suitability of the chosen experiment. The same problem occurs if we want to use frequentist statistics to estimate the probability of one or more outcomes of an experiment that is executed only once. On the other hand, Bayesian statistics is an example of applying mathematics to “common sense” when we want to adjust our belief in light of new information. It is interesting that, from the historical point of view, the first approach was the Bayesian one (for instance, Laplace and more as early as in the 18th century), which succumbed to frequentist statistics in the 20th century. In recent decades, Bayesian statistics has been returning, together with further new approaches. 10.3.2. Random sample of a population. Describe the first approach of the above subsection. Thus, assume that there is a (huge) basic statistical set of N units, which is called the population, and each of the units has a numerical characteristic, i.e., there is a set of values (x1, . . . , xN ). From this set there is only a sample with values (X1, . . . , Xn). In order to avoid the discussion of the actual size of the basic statistical set of N units, assume that the items of the sample are selected one by one and every item is always put back into the population. In addition, assume that every item has the same probability 1/N of being chosen. This is a random sample. The way of realizing the random sample can be viewed as working with a vector (X1, . . . , Xn) of independent, identically distributed random variables. In particular, they have the same distribution function FX(x) and moments E Xi = µ, var Xi = σ2 . The next step must be a derivation of the characteristics of the sample mean ¯X and the sample variance S2 = 1 n − 1 n∑ i=1 (Xi − ¯X)2 . The following theorem explains why the coefficient 1 n−1 is selected instead of 1 n , which is the case with s2 in subsection 10.1.6. Theorem. The sample mean ¯X computed from a random sample of size n whose distribution has finite expected value µ and finite variance σ2 satisfies E ¯X = µ, var ¯X = 1 n σ2 . The sample variance S2 satisfies E S2 = σ2 . Proof. As derived in subsection 10.2.29, E ¯X = 1 n E n∑ i=1 Xi = 1 n nµ = µ. 961 This value is less than χ2 0.95(9) = 16.9, so we do not reject the null hypothesis at level 0.05 (i. e., we do not refute the known genetic model). □ CHAPTER 10. STATISTICS AND PROBABILITY THEORY Since the variables Xi are independent, additivity of variance can be used (derived in subsection 10.2.33). The variance behaves as a quadratic form with respect to multiplication by a scalar. Hence var ¯X = 1 n2 var n∑ i=1 Xi = 1 n2 nσ2 = 1 n σ2 . The formula n∑ i=1 (Xi − µ)2 = n∑ i=1 (Xi − ¯X)2 + n( ¯X − µ)2 can be verified simply by expanding the multiplications. Thus: E s2 = 1 n E n∑ i=1 (Xi − µ)2 − n n ( ¯X − µ)2 = = 1 n n∑ i=1 var Xi − var ¯X = = ( 1 − 1 n )σ2 . That is why the variance s2 is multiplied by the coefficient n n−1 , which leads just to the sample variance S2 and its expected value σ. Of course, this multiplication makes sense only if n ̸= 1. □ 10.3.3. Random sample of the normal distribution. In practice, it is necessary to know not only the numerical characteristics of the sample mean and the variance, but also their total probability distributions. Of course, it can be derived only if the particular probability distribution of Xi is known. As a useful illustration, calculate the result for a random sample of the normal distri- bution. It is already verified, as an example on properties of moment generating functions in 10.2.37, that the sum of random variables with the normal distribution results again in the normal distribution. Hence the sample mean must also have the normal distribution, and since both its expected value and variance are known, ¯X ∼ N(µ, 1 n σ2 ). The probability distribution of the sample variance is more complicated. Here, apply the ideas about multivariate normal distributions from subsection 10.2.37. Consider a vector Z of standardized normal variables Zi = Xi − µ σ . The same property holds for the vector U = QT Z with any orthogonal matrix Q. In addition, ∑ i U2 i = ∑ i X2 i . Choose the matrix Q so that the first component U1 equals the sample mean ¯Z, up to a multiple. This means that the first column of Q is chosen as ( √ n)−1 (1, . . . , 1). Then U2 1 = n ¯Z2 , so we 962 CHAPTER 10. STATISTICS AND PROBABILITY THEORY can compute: n∑ i=1 U2 i = n∑ i=1 Z2 i = n∑ i=1 (Zi − ¯Z)2 + n ¯Z2 n∑ i=2 U2 i = n∑ i=1 (Zi − ¯Z)2 = 1 σ2 n∑ i=1 (Xi − ¯X)2 . Therefore, a multiple of the sample variance n−1 σ2 S2 is the sum of n−1 squares of standardized normal variables, so the following theorem is proved: Theorem. Let (X1, . . . , Xn) be a random sample from the N(µ, σ2 ) distribution. Then, ¯X and S2 are independent variables, and ¯X ∼ N(µ, 1 n σ2 ), n − 1 σ2 S2 ∼ χ2 n−1. Hence, it immediately follows that the standardized sample mean T = √ n ¯X − µ S has Student’s t-distribution with n − 1 degrees of freedom. 10.3.4. Point and interval estimates. Now, we have everything needed to estimate the parameter values in the context of frequentist statistics. Here is a simple example. Suppose there are 500 students enrolled in a course, each of which has a certain degree of satisfaction with the course, expressed as an integer in the range 1 through 10. It may be assumed that the satisfactions Xi of the students are approximated by a random variable with distribution N(µ, σ2 ). Further, suppose a detailed earlier survey showed that µ = 6, σ = 2. In the current semester, 15 students are asked about their opinion about the course, as rumor has it that the evaluation of the new lecturer might be quite different. The results show that 2 students vote 3, 3 vote 4, 3 vote 5, 5 vote 6, and 2 vote 7. Altogether, the sample mean is ¯X = 5.133 and the sample variance is S2 = 1.695. By assumptions, ¯X ∼ N(µ, σ2 /n), so Z = √ n ¯X−µ σ ∼ N(0, 1). In order to express the confidence of the estimate, compute the interval which contains the estimated parameter with an a priori fixed probability 100(1−α)%. We talk about a confidence level α, 0 < α < 1. Consider µ to be the unknown parameter, while the variance can be assumed (be it correct or not) to remain unchanged. It follows that 1 − α = P(|Z| < z(α/2)) = P ( √ n ¯X − µ σ < z(α/2) ) = P ( ¯X − σ √ n z(α/2) < µ < ¯X + σ √ n z(α/2) ) , where z(α/2) means the critical value, cf. 10.2.30. Thus, an interval is found whose endpoints are random variables and which contains the estimated parameter µ with an a priori fixed probability. The middle point of this interval is called the point estimate for parameter µ; the whole interval is called the interval estimate. We can also say that at the confidence 963 CHAPTER 10. STATISTICS AND PROBABILITY THEORY level α, the estimated parameter µ is or is not different from another value µ0. Suppose for instance, the data and levels are α = 0.05 and α = 0.1. Respectively we obtain the intervals µ ∈ (4.121, 6.145), µ ∈ (4.284, 5.983). Considering the confidence level of 5%, we cannot affirm that the opinion of students are worse compared to the previous year because the mentioned interval also contains the value µ0 = 6. We can conclude this if we take the confidence level of 10% since the value µ0 = 6 no more lies in the corresponding interval. On the other hand, if it is assumed that the other (worse) lecturer causes the variance of the answers to change as well (for instance, the students might agree more on the bad assessment), we proceed differently. Instead of the standardized variable Z, deal in a similar way with the variable T = √ n ¯X − µ S . As seen, this random variable has probability distribution T ∼ tn−1, where n = 15 in this case. This leads to the interval estimate ¯X − S √ n tn−1(α/2) < µ < ¯X + S √ n tn−1(α/2). Substitute the data at levels α = 0.05 and α = 0.03 respectively, to obtain µ ∈ ( 4.412, 5.854 ) , µ ∈ ( 4.321, 5.945 ) . Therefore, at the confidence level of 3%, the opinion seems to have become worse. This corresponds to our intuition that the sample deviation S = 1.302, which is significantly smaller than σ = 2 from the previous case, should be essential for our thinking. 10.3.5. Likelihood of estimates. From the mathematical point of view, interval and point estimates are simple and easy to understand. It is much worse with their practical interpretation because it is problematic to verify all assumptions about randomness of the sample. With more complicated cases, we consider problems with the “likelihood” of our estimates. As mathematicians, we can avoid the practical problem by defining the missing concept. In general, one works with a random sample of size n. Implicitly it is assumed that there are independent random variables Xi with the same probability distribution which depends on an unknown parameter θ (a vector in general). We are trying to find a sample statistic T, i.e., a function of the random variables X1, X2, . . . which, in a mathematical sense, estimates the actual value of the parameter θ. T is said to be an unbiased estimator of θ if and only if E T = θ. The expected value E(T − θ) is called the bias of estimator T. The asymptotic behaviour of the estimator, that is, what it does as n goes to infinity is often of interest. T = T(n) is 964 CHAPTER 10. STATISTICS AND PROBABILITY THEORY said to be a consistent estimator of the parameter θ if and only if T(n) converges in probability to θ, i.e., for every ε > 0, lim n→∞ P(|T(n) − θ| < ε) = 1. The Chebyshev’s inequality immediately yields P(|T(n) − E Tn| < ε) ≥ 1 − var T(n) ε2 . Assuming limn→∞ E T(n) = θ, then, for sufficiently large values of n, P(|T(n)−θ| < 2ε) ≥ P(|T(n)−E T(n)| < ε) ≥ 1− var T(n) ε2 . A useful proposition is thus proved : Theorem. Assume that limn→∞ E T(n) = θ and limn→∞ var T(n) = 0. Then, T(n) is a consistent estimator of θ. As a simple example, we can illustrate this theorem on variance: ˆσ2 = 1 n n∑ i=1 (Xi − ¯X)2 = n − 1 n S2 . Since it is known from subsection 10.3.2 that S2 is an unbiased estimator, it follows that ˆσ2 is not. However, limn→∞ ˆσ2 = σ2 , and it can be calculated that lim n→∞ var ˆσ2 = lim n→∞ var S2 = lim n→∞ 2σ n − 1 = 0. Therefore, the statistic s2 is a consistent estimator of the vari- ance. It is apparent that there may be more unbiased estimators for a given parameter. For instance, it is already shown that the arithmetic mean ¯X is an unbiased estimator of the expected value θ of random variables Xi. The value X1 is, of course, an unbiased estimator of θ as well. We wish to find the best estimator T in the class of considered statistics, which are unbiased or consistent. Consider as best the one whose variance is as small as possible. Recall that the variance of a vector statistic T is given by the corresponding covariance matrix, which is, in case of independent components, a diagonal matrix with the individual variances of the components on the diagonal. We have already defined inequalities between positive-definite matrices. 10.3.6. Maximum likelihood. Assume that the density function of the components of the sample is given by a function f(x, θ) which depends on an unknown parameter θ (a vector in general). By the assumed independence, the joint density of the vector (X1, . . . , Xn) is equal to the product: f(x1, . . . , xn, θ) = f(x1, θ) · · · f(xn, θ), which is called the likelihood function. We are interested in the value ˆθ which maximizes the likelihood function on the set of all admissible values of the parameter. In the discrete case, this means choosing the parameter for which the obtained sample has the greatest prob- ability. 965 CHAPTER 10. STATISTICS AND PROBABILITY THEORY Usually, it is more efficient to work with the loglikelihood function ℓ(x1, . . . , xn, θ) = ln f(x1, . . . , xn, θ) = n∑ i=1 ln f(xi, θ). Since the ln function is strictly increasing, maximization of the log-likelihood function is equivalent to maximization of the original likelihood function. If, for some input, it happens that f(x1, . . . , xn, θ) = 0, set ℓ(x1, . . . , xn, θ) = −∞. In the case of discrete random variables, use the same definition with probability mass function instead of the density, i.e., ℓ(x1, . . . , xn, θ) = n∑ i=1 ln(P(Xi = xi|θ)). We can illustrate the principle on a random sample from the normal distribution N(µ, σ2 ) with size n. The unknown parameters are µ or σ, or both. The considered density is f(x, µ, σ) = 1 √ 2πσ2 e− (x−µ)2 2σ2 . Take logarithms of both sides, to obtain ℓ(x, µ, σ) = −n 1 2 ln(2πσ2 ) − 1 2σ2 n∑ i=1 (xi − µ)2 . The maximum can be found using differentiation (note that σ2 is treated as a symbol for a variable): ∂ℓ ∂µ = − 1 2σ2 n∑ i=1 (−2)(xi − µ) = 1 σ2 ( −nµ + n∑ i=1 xi ) ∂ℓ ∂σ2 = − n 2σ2 + 1 2(σ2)2 n∑ i=1 (xi − µ)2 = = 1 2(σ2)2 ( −nσ2 + n∑ i=1 (xi − µ)2 ) . Thus, the only critical points are given by ˆµ = ¯X and ˆσ2 = s2 . Substitute these values into the matrix of second derivatives, to obtain the Hessian of ℓ: ( − n ˆσ2 0 0 − n 2(ˆσ2)2 ) . Finally, this is the required maximum, and since there is only one critical point, it must be the global maximum (think about the details of this argument!). Thus it is verified that the expected value and the variance are the most likely estimates for µ and σ, as already used. 10.3.7. Bayesian estimates. We return to the example from subsection 10.3.4, now from the point of view of Bayesian statistics. This totally reverses the approach: the collected data X1, . . . , X15 (i.e., the points which express how much each student is satisfied, using the scale 1 through 10) are treated as constants. On the other hand, the estimated parameter µ (the expected value of the points of satisfaction), is viewed as the random variable 966 CHAPTER 10. STATISTICS AND PROBABILITY THEORY whose distribution we wish to estimate. Let us look at the general principle first (and come back to this example soon). For this purpose, let us interpret Bayes’ formula for conditional probability on the level of probability mass functions or probability densities, in the following way: If a vector (X, Θ) has joint density f(x, θ), then the conditional probability of a component Θ, given by X = x, is defined as the density g(θ|x) = f(x|θ)g(θ) f(x) , where f(x) and g(θ) are the marginal probability densities. (In the above example, x is 15-dimensional vector coming from the multidimensional normal distribution while θ = µ is scalar.) Thus, given the a priori probability density g(θ) of the estimated parameter θ and the probability density f(x|θ) (in the above example, θ = µ is the expected value parameter of the distribution), the formula to compute the a posteriori probability density g(θ|x) can be used, based on the collected data. Indeed, we do not need to know f(x) for the following reason: we have to view f(x) as a constant independent of θ and thus the proper density is obtained from f(x|θ)g(θ) by multiplying with a uniquely given constent in the end. Thus, during the computation, it is sufficient to be precise “up to a constant multiple”. For this purpose, use the notation Q ∝ R, meaning that there is a constant C such that the expressions Q and R satisfy Q = CR. We shall illustrate this procedure on a more explicite example. In order to be as close as possible to the ideas from subsection 10.3.4, work with normal distributions N(µ, σ2 ). Suppose that the satisfaction of individual students in particular lectures is a random variable X ∼ N(θ, σ2 ), while the parameter θ reached by the particular lecturers is a random variable θ ∼ N(a, b). Compute, (up to a constant multiple, ignoring all multiplicative components which do not include any θ), g(θ|x) ∝ f(x|θ)g(θ) ∝ exp ( − (x − θ)2 2σ2 − (θ − a)2 2b2 ) ∝ exp ( − 1 2 ( θ2 ( 1 σ2 + 1 b2 ) − 2θ ( x σ2 + a b2 ))) ∝ exp ( − 1 2 ( θ − b2 x + σ2 a σ2b2 σ2 b2 b2 + σ2 )2( b2 σ2 b2 + σ2 )−1) . This proves already that the distribution for θ is θ ∼ N ( b2 b2 + σ2 x + σ2 b2 + σ2 a, b2 σ2 b2 + σ2 ) . This result can be interpreted so that if the parameters a, b, σ are known from long-run evaluation of surveys and the opinion of another student is learned, then the a priori opinion about the parameters for an individual lecture can be adjusted. In the resulting estimate, the expected value is given by the weighted average of the found value x and the a priori assumed expected value a, in dependence on the standard deviations σ and b. 967 CHAPTER 10. STATISTICS AND PROBABILITY THEORY 10.3.8. Interpretation in Bayesian statistics. We follow the ideas from the previous subsection, compared to the frequentist interpretation from 10.3.4. It may seem odd that a single query can influence an opinion so much. For σ → 0, the relevance of a single opinion is still increasing, and this corresponds to a 100% relevance of x in the case σ = 0. This is in accordance with the interpretation that Bayesian statistics is the probability extension of the standard discrete mathematical logic. If the variance σ is close to zero, then it is almost certain that the opinion of any student precisely describes the opinion of the whole population. In subsection 10.3.4, we worked with the sample mean ¯X of the collected data. This can be used in the previous calculation, since the mean also has a normal distribution, too. The expected value is the same, and the only difference is that σ2 /n is substituted instead of σ2 . To facilitate the notation, define the constant cn = nb2 nb2 + σ2 . The a posteriori estimate for θ based on the found sample mean ¯X has the distribution with parameters θ ∼ N(cn ¯X + (1 − cn)a, cnσ2 /n). As could be expected, for increasing n, the expected value of the distribution for θ approaches the sample mean, and its variance approaches zero. In other words, the higher the value of n, the closer is the point estimate from the frequentist point of view. A contribution of the Bayesian approach is that if the estimated distribution is used, questions of the kind: “What is the probability that the new lecturer is worse than the old one?” can be answered. Use the same data as in 10.3.4 and supplement the necessary a priori data. Assume that the lecturers are assessed quite well (otherwise, they would probably not be teaching at the university at all). For concreteness, select the a priori distribution with parameters a = 7.5, b = 2.5, and the standard deviation with σ = 2. Continue with n = 15 and the sample mean of 5.133. Substitute this data, to get the a posteriori estimate for the distribution θ ∼ N(5.230, 0.256). We are interested in P(θ < 6). This is computed by evaluating the distribution function of the corresponding normal distribution for the input value 6 (Excel is capable of this, too). The answer is approximately 93.6 %. This is similar to the material in subsection 10.3.4, where the known variance is assumed constant. Note the influence of the a priori assumption about the distribution of the parameter θ for all lecturers. To a certain extent, this reflects a faith that the lecturers are rather good. If a statistician has a reason for assuming that the actual expected value a for a specific lecturer is shifted, say a = 6 as in the survey about the previous lecturer, (this can be caused, 968 CHAPTER 10. STATISTICS AND PROBABILITY THEORY for example, by the fact that the lecture is hard and unpopular), then the probability of his actual parameter being less than 6 would be approximately 95.0 %. (If the expected value is considered to be significantly worse only when below 5.5, then the value would be only approximately 75 %). When substituting a = 5, the value is already 96.8 %. The variance b2 is also important. For instance, the a priori estimate a = 6, b = 3.5 leads to probability 95.2 %. In the above discussion, another very important point is touched on – sensitivity analysis. It would seem desirable that a small change of the a priori assumption has only a small effect on the a posteriori result. It appears that this is so in this example; however, we omit further discussion here. The same model with exponential distributions is used in practice when judging the relevance of the output of an IQ test of an individual person. It can also be used for another similar exam where it is expected that the normal distribution approximates well the probability distribution of the results. In both cases, there is an a priori assumption to which group he/she should belong. Other good examples (with different distributions) are practical problems from insurance industry, where it is purposeful to estimate the parameters so that both the effects of the experiment upon an individual item and the altogether expectations over the population are included. 10.3.9. Notes on hypothesis testing. We return to deciding whether a given event does or does not occur in the context of frequentist statistics. We build on the approach from interval estimates, as presented above. Thus, consider a random vector X = (X1, . . . , Xn) (the result of a random sample), whose joint distribution function is FX(x). A hypothesis is an arbitrary statement about the distribution which is determined by this distribution function. Usually, one formulates two hypothesis, denoted H0 and HA. The former is traditionally called null hypothesis, and the latter is called alternative hypothesis. The result of the test is then a decision based on a concrete realization of the random vector X (a test) whether the hypothesis H0 is to be rejected or not in favor of the hypothesis HA. During this process, two types of errors may occur. Type I error occurs when H0 is rejected even though it is true. Type II error occurs when H0 is not rejected although it is false. The decision procedure of a frequentist statistician consists of selecting the critical region W, i.e., the set of test results when the hypothesis is rejected. The size of the critical region is chosen so that a true hypothesis is rejected with probability not greater than α. This means that a fixed bound for the probability of the type I error is required: the significance level α. The most common choices are α = 0.05 or α = 0.01. It is also useful in practice to determine the least possible significance level p for which the hypothesis is rejected; the p–value of the test. It remains to find a reasonable procedure for choosing the critical range. This should hopefully be done so that the 969 CHAPTER 10. STATISTICS AND PROBABILITY THEORY type II error occurs as rarely as possible. Usually, it is convenient to consider the likelihood function f(x, θ), defined for a random vector X in subsection 10.3.6. For the sake of simplicity, assume there is a one-dimensional parameter θ, and formulate the null hypothesis as X being given by the function f(x, θ0), while the alternative hypothesis is given by the distribution f(x, θ1) for fixed distinct values θ0 and θ1. Ideas about rejecting or accepting the hypotheses suggest that when substituting the values of a specific test into the likelihood function, the hypothesis can be accepted if f(x, θ0) is much greater than f(x, θ1). This suggests considering, for each constant c > 0, the critical range Wc = {x; f(x, θ1) ≥ cf(x, θ0)}. Having chosen the significance level, choose c so that ∫ Wc f(x, θ0) = α. This guarantees that for the test result x ∈ Wc, when H0 is valid, the type I error occurs with at most the prescribed probability. This can also be guaranteed by other critical ranges W which also satisfy ∫ W f(x, θ0) = α. On the other hand, type II errors are also of interest. That is, it is desired to maximize the probability of HA over the critical range. Thus, consider the difference D = ∫ Wc f(x, θ1) − ∫ W f(x, θ1) for arbitrary W as above. The regions over which integration is carried out, can be divided into the common part W ∩ Wc and the remaining set differences. The contributions of the common part are subtracted, and there remains D = ∫ Wc\W f(x, θ1) − ∫ W \Wc f(x, θ1). Using the definition of the critical range Wc, (again, put back the same integrals over the common part) D ≥ c ∫ Wc\W f(x, θ0) − c ∫ W \Wc f(x, θ0) = = c ∫ Wc f(x, θ0) − c ∫ W f(x, θ0) = cα − cα = 0. Thus is proved an important statement, the Neyman– Pearson lemma: Proposition. Under the above assumptions, Wc is the optimal critical range which minimizes occurrence of the type II error at a given significance level. 970 CHAPTER 10. STATISTICS AND PROBABILITY THEORY 10.3.10. Example. The interval estimate, as illustrated on an example in subsection 10.3.4, is a special case of hypothesis testing, when H0 had the form “the expected value of the satisfaction with the course remained µ0”, while HA says that it is equal to a different value µ1. The general procedure mentioned above leads in this case to the critical range given by |Z| = ¯X − µ0 σ √ n ≥ z(α/2). Note that in the definition of the critical range, the actual value µ1 is not essential. In the context of classical probability, the decision at a given level α whether or not there is a change to the expected value µ is thus formalized. To test only whether the satisfaction is decreased, assume beforehand that µ1 < µ0. We analyze this case thoroughly: The critical range from the Neyman–Pearson lemma is determined by the inequality f(x, µ1, σ2 ) f(x, µ0, σ2) = e− 1 2σ2 ∑n i=1 ( (xi−µ1)2 −(xi−µ0)2 ) ≥ c. Take logarithms and rearrange to obtain 2¯x(µ1 − µ0) − (µ2 1 − µ2 0) ≥ 2σ2 n ln c. Since µ1 < µ0, it follows that ¯x ≤ µ1 + µ0 2 + σ2 n(µ1 − µ0) ln c = y. For a given level α, the constant c, and thereby the decisive parameter y, are determined so that, under the assumption that H0 is true, α = P( ¯X ≤ y) = P ( ¯X − µ0 σ √ n ≤ y − µ0 σ √ n ) . By assuming that H0 is true, Z = ¯X − µ0 σ √ n ∼ N(0, 1), so the requirement means choosing Z ≤ −z(α), which determines uniquely the optimal Wc. Note that this critical region is independent of the chosen value µ1, and the actual value for y did not have to be expressed at all. It was only essential to assume that µ1 < µ0. In the illustrative example from subsection 10.3.4, H0 : µ = 6, and the alternative hypothesis is HA : µ < 6. The variance is σ2 = 4. The test with n = 15 yielded ¯x = 5.133. Substitute this, to get the value z = 5.133−6 4 √ 15 = −1.678, while −z(0.05) = −1.645. Therefore, if we are testing whether the new teacher is even worth than the previous one, we reject the hypothesis at the level of 5 %, deducing that the students’ opinions are really worse. If, for the critical range, the union of the critical ranges for the cases µ1 < µ0 and µ1 > µ0 are chosen, the same results as for the interval estimate are obtained, as mentioned above. 971 CHAPTER 10. STATISTICS AND PROBABILITY THEORY We remark that in the Bayesian approach, it is also possible to accept or reject hypotheses in a direct connection to the a posteriori probability of events, as was, to certain extent, indicated in subsection 10.3.8 where our specific example is interpreted. 10.3.11. Linear models. As is usual in the analysis of mathematical problems, either we deal with linear dependencies and objects, or we discuss their linearizations. In statistics, many methods belong to the linear models, too. We consider a quite general scheme of this type. Consider a random vector Y = (Y1, . . . , Yn)T and sup- pose Y = X · β + σZ, where X = (xij) is a constant real-valued matrix with n rows and k < n columns, whose rank is k, β is an unknown constant vector of k parameters of the model, Z is a random vector whose n components have distribution N(0, 1), and σ > 0 is an unknown positive parameter of the model. This is a linear model with full rank. In practice, the variables xij are often known. The problem is to estimate or predict the value of Y . For instance, xij can express the grade in maths of the i-th student in the j-th semester (j = 1, 2, 3), and we want to know how this student will fare in the fourth semester. For this purpose, the vector β needs to be known. This can be estimated using complete observations, that is, from the knowledge of Y (from the results of past years, for example). In order to estimate the vector β, the least squares method can often be used. This means looking for the estimate b ∈ Rk for which the vector ˆY = Xb minimizes the squared length of the vector Y − Xβ. This is a simple problem from linear algebra, looking for the orthogonal projection of the vector Y onto the subspace span X ⊂ Rn generated by the columns of the matrix X. This is minimizing the function ∥Y − Xβ∥2 = n∑ i=1 ( Yi − k∑ j=1 xijβj )2 . Choose an arbitrary orthonormal basis of the vector subspace span X and write it into columns of the matrix P. For any choice of basis, the orthogonal projection is realized as multiplication by the matrix PPT . In the subspace span X, the mapping given by this matrix is the identity. That is, ˆY = PPT Y = PPT (Xβ + σZ) = Xβ + σPPT Z. The matrix PPT is positive-semidefinite. Extend the basis consisting of the columns of P to an orthonormal basis of the whole space Rn . In other words, create a matrix Q = (P R) by writing the newly added basis vectors into the matrix R with n − k columns and n rows. Denote by V = PT Z and U = RT Z the random vectors with k and n − k components, respectively. They are orthogonal, and their sum in Rn is the vector (V T UT )T = QT Z. 972 CHAPTER 10. STATISTICS AND PROBABILITY THEORY Clearly (see subsection 10.2.46), both vectors V and U have multivariate normal distribution with zero expected value and identity covariance matrix. The random vector Y is decomposed to the sum of a constant vector Xβ and two orthogonal projections Y = Xβ + σPV + σRU, and the desired orthogonal projection is the sum of the first and second summands. In subsection 10.2.46, the distribution of such random vectors is also derived. The size of ∥Y − ˆY ∥2 is called the residual sum of squares, sometimes denoted by RSS. Also, the residual variance is defined as S2 = ∥Y − Xb∥2 n − k . Recall that ˆY = Xb and that XT X is invertible as the full rank of X is assumed. Thus b = (XT X)−1 XT ˆY can be computed. At the same time, XT (Y − ˆY ) = σXT (RU) = 0, since the columns of X and R are mutually orthogonal. Therefore, (1) b = (XT X)−1 XT Y. The chosen matrix P can be used with advantage. Since its columns generate the same subspace as the columns of X, there is a square matrix T such that X = PT (its columns are the coefficients of linear combinations which are expressed by the columns of X in the basis of P). Substitute (using the fact that PT P is the identity matrix and T is invertible): b = (TT PT PT)−1 TT PT Y = = T−1 (TT )−1 TT PT (PTβ + σZ) = = β + σT−1 V. Thus is proved the main properties of the linear model: Theorem. Consider a linear model Y = Xβ + σZ. (1) For the estimate of ˆY , ˆY = Xβ + σPV, ˆY ∼ N(Xβ, σ2 PPT ). (2) The residual sum of squares and the normed square of the residue size have distributions: Y − ˆY ∼ N(0, σ2 RRT ), ∥Y − ˆY ∥2 /σ2 ∼ χ2 n−k. (3) The random variable b = β + σT−1 V has distribution b ∼ N(β, σ2 (XT X)−1 ). (4) The residual variance satisfies (n − k)S2 /σ2 ∼ χ2 n−k. (5) The expected value of the residual variance is E S2 = σ2 . (6) The variables b and S2 are independent. Proof. Both the shape and distribution of ˆY are determined. It is clear that Y − ˆY = σRU, which verifies the second proposition. Further, ∥Y − ˆY ∥2 /σ2 = ∥RU∥2 = ∥U∥2 , 973 CHAPTER 10. STATISTICS AND PROBABILITY THEORY where the last equality follows from the fact that in the construction, U is the vector of coordinates of the projection Z onto the complement of span X, and RU is this projection. The size of a vector is exactly the sum of squares of its coordinates in any orthonormal basis. Therefore, the random vector ∥Y − ˆY ∥2 /σ2 is the sum of (n−k) squares of random variables with distribution N(0, 1), so it is the distribution χ2 n−k, which proves the rest of (2). The next proposition follows directly from the definitions and calculations. it suffices to estimate the covariance matrix for b. From the general properties, it should be the matrix T−1 (TT )−1 . This is the same as (XT X)−1 = ((PT)T (PT))−1 . The proposition (4) is a reformulation of the information in (2). The next proposition follows from the fact that the expected value of the χ2 distribution equals the number of degrees of freedom. Finally, independence of the variables b and S is a consequence of the fact that the former variable is a function of the vector V , while the latter one is a function of the vector U. These vectors are independent since they are two complementary parts from an orthogonal transformation of the vector Z. □ In practice, the hypothesis whether fewer parameters are sufficient to estimate the expected value is sometimes tested. A random vector Y is said to satisfy a submodel if and only if both Y = Xβ + σZ and Y = X0 β0 + σZ, where X0 has only q < k columns. It is assumed that the columns of X0 generate a subspace in span X, i.e., all are linear combinations of the columns of X. Repeat the above construction, choosing the matrix P so that the first q vectors of P generates span X0 . The matrix P is then of the form (P0 P1 ), and the vector V decomposes similarly: V = ( V 0 V 1 ) = ( (P0 )T Z (P1 )T Z ) . This yields a finer decomposition of the vectors and their sizes and the corresponding residues: ˆY 0 = P0 (P0 )T Y = X0 β0 + σP0 V 0 Y − ˆY 0 = σP1 V 1 + σRU ∥Y − ˆY 0 ∥2 = σ2 ∥V 1 ∥2 + σ2 ∥U∥2 (RSS0 − RSS)/σ2 = ∥V 1 ∥2 . Therefore, the normed difference of the residues has distribution χ2 k−q. It follows immediately that the statistic F given as the relative difference of the residues has FisherSnedecor distribution: F = (RSS0 − RSS)/(k − q) RSS /(n − k) ∼ Fk−q,n−k . 974 CHAPTER 10. STATISTICS AND PROBABILITY THEORY In practice, the parameter σ is seldom known, and so the estimate S2 is used. Instead of the individual components bj ∼ N(βj, σ2 cjj) of the random vector b, where cjj are the diagonal entries of the matrix C = (XT X)−1 , work with the statistics Tj = bj − βj S √ cjj ∼ tn−k . Of course, these variables need not be independent. If the full rank of the matrix X is not assumed, a pseudoinverse matrix can be used instead of C = (XT X)−1 . 10.3.12. Examples of tests. As an illustration, we mention some examples of application of linear models in the simplest types of tests. The most trivial case is when there is only one sample. Here the test is whether or not the only parameter β equals a given value β0. For this case, choose the matrix X as a single column consisting of ones. Then, the expression Y = Xβ + σZ indicates that the individual components in Y are independent variables Yi ∼ N(β, σ2 ). It is a random sample of size n from the normal distribution. In general, the estimate b = (XT X)−1 XT Y = 1 n n∑ i=1 Yi = ¯Y S2 = 1 n − 1 ∥Y − X ¯Y ∥2 = 1 n − 1 n∑ i=1 (Yi − ¯Y )2 , which respectively is exactly the sample mean and variance used before. In this context, the statistic T = ¯Y − β0 S √ n may also be of interest. Testing the hypothesis β = β0 is called the one-sample t-test. The hypothesis is rejected at level α if |T| ≥ tn−1(α). There is another simple application of the general model, which is called the paired t-test. It is appropriate for cases when pairs of random vectors W1 = (Wi1) and W2 = (Wi2) are tested. The differences Yi = Wi1 − Wi2 of their components have distribution N(β, σ2 ). In addition, the variables Yi need to be independent (which does not mean that the individual pairs Wi1 and Wi2 have to be independent!). In the context of our illustrative example from 10.3.4, we can imagine the assessment of two lecturers by the same student. Test the hypothesis that for every i, E Wi1 = E Wi2. Thus, use the statistic T = ¯W1 − ¯W2 S √ n. Finally, we consider an example with more parameters. It is a classical case of the regression line. 975 CHAPTER 10. STATISTICS AND PROBABILITY THEORY Assume that the variables Yi, i = 1, . . . , n have distribution N(β0 +β1xi, σ2 ), where xi are given constant. Examine the best approximation Yi = b0 + b1xi, and the matrix X of the corresponding linear model is XT = ( 1 1 . . . 1 x1 x2 . . . xn ) . Substitute into general formulae, and compute the estimate ( b0 b1 ) = ( n n¯x n¯x ∑n i=1 x2 i )−1( n ¯Y∑n i=1 xiYi ) = = ( n∑ i=1 (xi − ¯x)2 )−1( 1 n ∑n i=1 x2 i −¯x −¯x 1 )( n ¯Y∑n i=1 xiYi ) . It follows that b1 = ∑n i=1(xi − ¯x)(Yi − ¯Y ) ∑n i=1(xi − ¯x)2 . Finally, compute b0 = ¯Y − b1 ¯x. From the calculations, var b1 = σ2 / n∑ i=1 (xi − ¯x)2 . In order to test the hypothesis whether the expected value of a variable Y does not depend on x, that is, whether or not H0 is of the form β1 = 0, use the statistic T = b1 S ( n∑ i=1 (xi − ¯x)2 )1/2 ∼ tn−2 . The statistical analysis of multiple regression is similar. There are several sets of values xij to evaluate the statistical relevance of the approximation Yi = b0 + b1x1i + · · · + bkxki. The individual statistics Tj allow for a t-test of dependence of the regression on the individual parameters. Software packages often provide a parameter which expresses how well the values Yi are approximated. It is called the coefficient of de- termination: R2 = 1 − RSS ∑n i=1(Yi − ¯Y )2 . 10.3.13. In practice, problems are often met where the distributions of the statistical data sets are either completely unknown or errors are assumed in the model, together with deviations with nonzero expected value and a non-normal distribution. In these cases, application of classical frequentist statistics is very hard or even totally impossible. There are approaches which work directly with the sample set. Then derive statistics of point or interval estimates or probability calculations about the above, including the evaluation of standard errors. One of the pioneering articles of this topic is the brief work of Bradley Efron of Stanford University, published in 976 CHAPTER 10. STATISTICS AND PROBABILITY THEORY 1981: Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods4 . The keywords of this article are: Balanced repeated replications; Bootstrap; Delta method; Half-sampling; Jackknife; Infinitesimal jackknife; Influence function. The procedure used in the bootstrap method uses software resources, created from a given data sample, and new data samples of the same size (with replacement). The desired statistics (sample mean, variance, etc.) is then examined for each of them. After a great number of executions of this procedure, a data sample is obtained which is considered a relevant approximation of the probability distribution of the examined statistic. The characteristics of this data set is considered a good approximation of the characteristics of the examined statistics for point or interval estimates, analysis of variance, etc. There is not enough space for a more detailed analysis of these techniques, which is the foundations of non-parametric methods in contemporary software statistical tools. 4Biometrika (1981), 68, 3, pp. 589-99 977 In this chapter, we will deal with problems concerning integers, which include mainly divisibility and solving equations whose domain will be the set of integers (or natural numbers). Notice that in this chapter, unlike in the other parts of this book, we will not consider zero to be a natural number, as is usual in this field of mathematics. Although the natural numbers and the integers are, from a certain point of view, the simplest mathematical structure, examination of their properties yielded a good deal of tough problems for generations of mathematicians. These are often problems which can be formulated quite easily, yet many of them remain unsolved so far. Let us mention some of the most popular of them: • twin primes – the problem is to decide whether there are infinitely many primes p such that p + 2 is also a prime,1 • Sophie Germain primes – the problem is to decide whether there are infinitely many primes p such that 2p + 1 is also a prime, • existence of an odd perfect integer – i.e., the sum of whose divisors equals twice the integer, e.g., 6 or 28, • Goldbach’s conjecture – the problem is to decide whether every even integer greater than 2 can be expressed as the sum of two primes. A jewel among the problems of number theory is the Fermat’s Last Theorem – the problem is to decide whether there are natural numbers n, x, y, z such that n > 2 and xn + yn = zn ; Pierre de Fermat formulated this problem as early as in 1637; much effort of many generations was put to this question, and it was solved (using results of various fields of mathematics) by Andrew Wiles in 1995. 1. Fundamental concepts 11.1.1. Divisibility. Recall that we say that an integer a divides an integer b (or that b is divisible by a, also that b is a multiple of a) if and only if there exists an integer c satisfying a · c = b. We write this as a | b. The concept of divisibility can be considered much more generally, as we shall see in 12.3.5. 1In 2013, Yitang Zhang published a proof of a promising proposition: for some n < 7 · 107, there are infinitely many pairs of primes which differ by n. See Y. Zhang, Bounded gaps between primes, Annals of Mathematics, 2013. Although the bound was improved to N = 246 already one year later by James Maynard and Terence Tao, the problem is still open. CHAPTER 11 Number theory God created the integers, all else is the work of man. Leopold Kronecker A. Basic properties of divisibility Let us recall the basic properties of divisibility, whose proof follows directly from the definition: the integer 0 is divisible by every integer; the only integer that is divisible by 0 is 0; every integer a satisfies a | a; every triple of integers a, b, c satisfies the following four implications: a | b ∧ b | c =⇒ a | c, a | b ∧ a | c =⇒ (a | b + c) ∧ (a | b − c), c ̸= 0 =⇒ (a | b ⇐⇒ ac | bc), a | b ∧ (b > 0) =⇒ a ≤ b. The mere knowledge of these basic rules allows us to solve many problems. 11.A.1. Determine the natural numbers n for which the integer n3 + 1 is divisible by the integer n − 1. Solution. We have n3 − 1 = (n − 1)(n2 + n + 1), so the integer n3 − 1 is divisible by the integer n − 1 for any n. If n − 1 is to divide n3 + 1 as well, it must also divide the difference (n3 + 1) − (n3 − 1) = 2 (see the second property of divisibility). Since n ∈ N, we have n − 1 ≥ 0. Now, n − 1 | 2 implies that n − 1 = 1 or n − 1 = 2, whence n = 2 CHAPTER 11. ELEMENTARY NUMBER THEORY We will often take advantage of one of the most important properties of the integers, the unique Euclidean division (i.e., division with remainder). Unique division with remainder Theorem. For any integers a ∈ Z, m ∈ N, there exists a unique pair of integers q ∈ Z, r ∈ {0, 1, . . . , m − 1} satisfying a = qm + r. Proof. First, we will prove the existence of the integers q, r. Let us fix a natural number m and prove the statement for any a ∈ Z. Assume first that a is non-negative and prove the existence of the integers q, r by induction on a: If 0 ≤ a < m, we can choose q = 0, r = a, and the equality a = qm + r holds trivially. Next, suppose that a ≥ m and the existence of the integers q, r has been proved for all a′ ∈ {0, 1, 2, . . . , a − 1}. In particular, for a′ = a − m ≥ 0, there are q′ , r′ such that a′ = q′ m + r′ and r′ ∈ {0, 1, . . . , m − 1}. Therefore, if we select q = q′ + 1, r = r′ , we obtain a = a′ + m = (q′ + 1)m + r′ = qm + r, which is what we wanted to prove. Now, if a is negative, then we have proved that for the positive integer −a, there are q′ ∈ Z, r′ ∈ {0, 1, . . . , m − 1} such that −a = q′ m + r′ . If r′ = 0, we set r = 0, q = −q′ ; otherwise (i.e., r′ > 0), we put r = m − r′ , q = −q′ − 1. In either case, we get a = q · m + r. Therefore, the integers q, r with the wanted properties exist for every a ∈ Z, m ∈ N. Finally, we will prove the uniqueness. Suppose that there are integers q1, q2 ∈ Z and r1, r2 ∈ {0, 1, . . . , m − 1} which satisfy a = q1m + r1 = q2m + r2. Simple rearrangement yields r1 − r2 = (q2 − q1)m, so m | r1 − r2. However, we have 0 ≤ r1 < m and 0 ≤ r2 < m, whence it follows that −m < r1 − r2 < m. Therefore, r1 − r2 = 0, and (q2 − q1)m = 0, hence q1 = q2, r1 = r2. □ The integers q and r from the theorem are respectively called the quotient and remainder of the division of a by m with remainder. The choice of this terminology seems more intuitive if we rearrange the equality a = mq + r into the form a m = q + r m , where 0 ≤ r m < 1. 11.1.2. Greatest common divisor. One of the most needed tools of computational number theory is the algorithm for computing the greatest common divisor. Since it is a relatively fast procedure, as we are going to show, it is used very often in modern algorithms as well. 979 or n = 3. The wanted property is thus possessed only by the natural numbers 2 and 3. □ 11.A.2. Prove that for any a ∈ Z, the following holds: i) a2 leaves remainder 0 or 1 when divided by 4; ii) a2 leaves remainder 0, 1, or 4 when divided by 8; iii) a4 leaves remainder 0 or 1 when divided by 16. Solution. • It follows from the Euclidean division theorem that every integer a can be written uniquely in either the form a = 2k or a = 2k + 1. Squaring this leads to a2 = 4k2 or a2 = 4(k2 + k) + 1, which is what we wanted to prove. • Making use of the above result, we immediately obtain the statement for the (even) integers of the form a = 2k. Back then, we arrived at a2 = 4k(k + 1) + 1 for odd integers a; we get the proposition easily if we realize that k(k + 1) is surely even. • Again, we utilize the result of the previous parts, i.e. a2 = 4ℓ or a2 = 8ℓ + 1. Squaring these equalities once again, we get a4 = (a2 )2 = 16ℓ2 for a even, and a4 = (a2 )2 = (8ℓ + 1)2 = 64ℓ2 + 16ℓ + 1 = 16(4ℓ2 + ℓ) + 1 for a odd. □ 11.A.3. Prove that if integers a, b ∈ Z leave remainder 1 when divided by an m ∈ N, then so does their product ab. Solution. By the Euclidean division theorem, there are s, t ∈ Z such that a = sm + 1, b = tm + 1. Multiplying these equalities leads to the expression ab = (sm + 1)(tm + 1) = (stm + s + t)m + 1, where stm+s+t is the quotient, so the remainder of ab when divided by m is equal to 1. □ It follows from the Euclidean division theorem that the greatest common divisor (a, b) of any pair of integers a, b exists, is unique, and can be computed efficiently by the Euclidean algorithm. At the same time, the coefficients in Bézout’s identity, i.e., integers k, l such that ka + lb = (a, b), can be determined this way. It can also be easily proved straight from the properties of divisibility that integer linear combinations of integers a, b are exactly the multiples of their greatest common divisor. 11.A.4. Find the greatest common divisor of the integers a = 10175, b = 2277 and determine the corresponding coefficients in Bézout’s identity. Solution. We will invoke the Euclidean algorithm: 10175 = 4 · 2277 + 1067, 2277 = 2 · 1067 + 143, 1067 = 7 · 143 + 66, 143 = 2 · 66 + 11, 66 = 6 · 11 + 0. CHAPTER 11. ELEMENTARY NUMBER THEORY Greatest common divisor Consider integers a, b. An integer m satisfying both m | a and m | b is called a common divisor of a and b. A common divisor m ≥ 0 of a and b which is divisible by every common divisor of the integers a, b is called the greatest common divisor of a and b and it is denoted by (a, b) (or gcd(a, b) for the sake of clarity). The concept of the least common multiple is defined dually and denoted by [a, b] (or lcm(a, b)). It follows straight from the definition that for any a, b ∈ Z, we have (a, b) = (b, a), [a, b] = [b, a], (a, 1) = 1, [a, 1] = |a|, (a, 0) = |a|, [a, 0] = 0. So far, we have not shown that for every pair of integers a, b, their greatest common divisor and least common multiple exist. However, if we assume they exist, then they are unique because every pair of non-negative integers k, l satisfy (directly from the definition) that k | l and l | k imply k = l. However, in the general case of divisibility in integral domains, the situation is more complicated – see 12.3.8. Even in the case of the so-called Euclidean domains,2 which guarantee the existence of greatest common divisors, the result is determined uniquely up to the multiplication by a unit (an element having multiplicative inverse) – in the case of the integers, the result would be determined uniquely up to sign; the uniqueness was thus guaranteed by the condition that the greatest common divisor be non-negative. Euclidean algorithm Theorem. Let a1, a2 be positive integers. For every n ≥ 3 such that an−1 ̸= 0, let an denote the remainder of the division of an−2 by an−1. Then, after a finite number of steps, we arrive at ak = 0, and it holds that ak−1 = (a1, a2). Proof. By the Euclidean division, we have a2 > a3 > a4 > . . . . Since these are non-negative integers, this decreasing sequence cannot be infinite, so we get ak = 0 after a finite number of steps, where ak−1 ̸= 0. From the definition of the integers an, it follows that there are integers q1, q2, . . . , qk−2 such that a1 = q1 · a2 + a3, a2 = q2 · a3 + a4, ... ak−3 = qk−3 · ak−2 + ak−1, ak−2 = qk−2 · ak−1. It follows from the last equality that ak−1 | ak−2. Further, ak−1 | ak−3, ..., ak−1 | a2, ak−1 | a1. Therefore, ak−1 is a common divisor of the integers a1, a2. 2Wikipedia, Euclidean domain, http://en.wikipedia.org/ wiki/Euclidean_domain (as of July 29, 2017). 980 Therefore, 11 is the greatest common divisor. We will express this integer from the particular equalities, resulting in a linear combination of the integers a, b: 11 = 143 − 2 · 66 = 143 − 2 · (1067 − 7 · 143) = −2 · 1067 + 15 · 143 = −2 · 1067 + 15 · (2277 − 2 · 1067) = 15 · 2277 − 32 · 1067 = 15 · 2277 − 32 · (10175 − 4 · 2277) = −32 · 10175 + 143 · 2277. The wanted expression in the form of Bézout’s identity is thus 11 = (−32) · 10175 + 143 · 2277. □ 11.A.5. The computation of the greatest common divisor using the Euclidean algorithm is quite fast even for relatively large integers. In our example, we will try this out with integers A, B, each of which will be the product of two 101-digit primes. Let us notice that the computation of the greatest common divisor of even such huge integers took an immeasurably small amount of time. A noticeable amount of time is taken by the computation of the greatest common divisor in the second example, where the input consists of two integers having more than a million digits. An example in the system SAGE : add the solution of the previous task. sage : p= next_prime (5*10^100) sage : q= next_prime (3*10^100) sage : r= next_prime (10^100) sage : A=p*q;B=q*r; sage : time G=gcd (A,B); print G Time : CPU 0.00 s, Wall : 0.00 s 300000000000000000000000000000000000\ 000000000000000000000000000000000000\ 00000000000000000000000000223 sage : time G=gcd (A ^10000+1 , B ^10000+1) ; Time : CPU 2.47 s, Wall : 2.48 s 11.A.6. Find the greatest common divisor of the integers 249 − 1 and 235 − 1, and determine the corresponding coefficients in Bézout’s identity. Solution. Again, we use the Euclidean algorithm. We get: 249 − 1 = 214 (235 − 1) + 214 − 1, 235 − 1 = (221 + 27 )(214 − 1) + 27 − 1, 214 − 1 = (27 + 1)(27 − 1). The wanted greatest common divisor is thus 27 − 1 = 127. Let us notice that 7 = (49, 35) – see also the following exercise 11.A.7. Reversing this procedure, we find the coefficients k, ℓ into Bézout’s identity CHAPTER 11. ELEMENTARY NUMBER THEORY On the other hand, any common divisor of the given integers a1, a2 divides the integer a3 = a1 − q1a2 as well, hence it also divides a4 = a2 − q2a3, a5, . . . , and especially ak−1 = ak−3 − qk−3ak−2. We have thus proved that ak−1 is the greatest common divisor of the integers a1, a2. □ It follows from the previous statement and the fact that (a, b) = (a, −b) = (−a, b) = (−a, −b) holds for any a, b ∈ Z, that every pair of integers has a greatest common divisor. 11.1.3. Bézout’s theorem. The Euclidean algorithm provides another interesting statement, which is often used. Theorem (Bézout). For every pair of integers a, b, there exist integers k, l such that (a, b) = ka + lb. Proof. Surely it suffices to prove the theorem for a, b ∈ N. Let us notice that if it is possible to express integers r, s in the form r = r1a+r2b, s = s1a+s2b, where r1, r2, s1, s2 ∈ Z, then we can also express r + s = (r1 + s1)a + (r2 + s2)b in this way as well as c · r = (c · r1)a + (c · r2)b for any c ∈ Z, and thus also any integer linear combination of the numbers r and s arising in the process of Euclidean algorithm. It follows from the Euclidean algorithm (for a1 = a, a2 = b) that we can also thus express a3 = a1 − q1a2, a4 = a2 − q2a3, . . . , hence the integer ak−1 = ak−3 − qk−3ak−2, too, which is (a1, a2). Let us emphasize that the wanted numbers k, l are not determined uniquely. □ The Euclidean algorithm and Bézout’s identity are the fundamental results of elementary number theory and form one of the pillars of the algorithms used in algebra and number theory. 11.1.4. Least common multiple. We have ignored the properties of the least common multiple so far. However, we will see that thanks to the following proposition, they can be derived from the properties of the greatest common divisor. Lemma. For every pair of integers a, b, their least common multiple [a, b] exists and it holds that (a, b) · [a, b] = |a · b|. Proof. The statement is trivially true if either of the integers a, b is zero. Furthermore, we can assume that both these (non-zero from now on) integers are positive since their signs do not take any effect in the formula in question. We are going to show that q = a·b/(a, b) is the least common multiple of the integers a, b, which will finish the proof. Since (a, b) is a common divisor of a, b, both a/(a, b) and b/(a, b) are integers, hence q = ab (a, b) = a (a, b) · b = b (a, b) · a is a common multiple of a, b. By Bézout’s identity, there are integers k, l such that (a, b) = ka + lb. Let us suppose 981 27 − 1 = k(249 − 1) + ℓ(235 − 1): 27 − 1 = (235 − 1) − (221 + 27 )(214 − 1) = (235 − 1) − (221 + 27 )((249 − 1) − 214 (235 − 1)) = (235 + 221 + 1)(235 − 1) − (221 + 27 )(249 − 1). Therefore, k = −(221 + 27 ), ℓ = 235 + 221 + 1. Let us bear in mind that these coefficients are never determined uniquely. □ 11.A.7. Now, let us try to generalize the result of the previous exercise, i.e., prove that it holds for any a, m, n ∈ N, a ̸= 1, that (am − 1, an − 1) = a(m,n) − 1. Solution. This statement follows easily from the fact that any pair of natural numbers k, ℓ satisfies ak − 1 | aℓ − 1 if and only if k | ℓ. This can be proved by dividing the integer ℓ by the integer k with remainder, i.e., we set ℓ = kq + r, where q, r ∈ N0, r < k, and consider that akq+r −1 = (ak −1)(ak(q−1)+r +ak(q−2)+r +· · ·+ar )+ar −1 is the division of the integer akq+r − 1 by the integer ak − 1 with remainder (apparently, we have ar −1 < ak −1). Hence we can easily see that the remainder r is zero if and only if the remainder ar − 1 is zero, which is what we wanted to prove. □ 11.A.8. Prove that for all n ∈ N, 25 | 42n+1 − 10n − 4. Solution. This statement can be proved in several ways (the easiest one is in terms of congruences, which will be introduced a bit later). Here, we will prove it by induction and then as a consequence of the binomial theorem. The proposition is clearly true for n = 0 (even though the problem does not ask about the situation for n = 0, we can surely prove the desired property for the set n ∈ N0, thereby simplifying the first step of the induction). As for the second step: assuming 25 | 42n−1 − 10(n − 1) − 4, then we also have 25 | 16(42n−1 − 10(n − 1) − 4) = 42n+1 − 10n − 4 − 150n + 100, whence it easily follows that the desired proposition 25 | 42n+1 − 10n − 4 holds. The second proof uses the binomial theorem. By that, (5 − 1)2n+1 = 2n+1∑ k=0 ( 2n + 1 k ) 52n+1−k (−1)k , where all of the terms of the sum, except the last two, are apparently multiples of 25, i.e., the only part of the sum which is not divisible by 25 is (2n+1 2n ) 51 (−1)2n + 50 (−1)2n+1 . In other words, 42n+1 leaves the same remainder when divided by 25 as the integer 10n + 4, which is equivalent to what we were to prove. □ B. Prime numbers Euclid’s theorem 11.2.1 is a fundamental property of primes as it can be seen in the following example. CHAPTER 11. ELEMENTARY NUMBER THEORY that n ∈ Z is an arbitrary common multiple of the integers a, b. We want to show that it is divisible by q. We thus have n/a, n/b ∈ Z, hence the number n b · k + n a · l = n(ka + lb) ab = n(a, b) ab = n q is an integer as well. However, this means that q | n, which is what we wanted to prove. □ 11.1.5. Coprime integers. Analogously to the case of two integers, we can also define the greatest common divisor and least common multiple of more than two integers, and it can be easily proved that (a1, . . . , an) = ((a1, . . . , an−1), an), [a1, . . . , an] = [[a1, . . . , an−1], an]. Integers a1, a2, . . . , an ∈ Z are said to be coprime (also relatively prime) if and only if (a1, a2, . . . , an) = 1. They are said to be pairwise coprime (pairwise relatively prime) if and only if we have (ai, aj) = 1 for every pair of indices i, j satisfying 1 ≤ i < j ≤ n. Remark. Let us realize that the concepts coprime and pairwise coprime are different. For example, we have (6, 10, 15) = 1; however, any two of the three integers 6, 10, 15 are not coprime. Lemma. For any natural numbers a, b, c we have (1) (ac, bc) = (a, b) · c; (2) if a | bc and (a, b) = 1, then a | c; (3) d = (a, b) if and only if there are k, l ∈ N such that a = dk, b = dl, and (k, l) = 1. Proof. (1) Since (a, b) is a common divisor of the integers a, b, (a, b) · c is a common divisor of the integers ac, bc, hence (a, b) · c | (ac, bc). From Bézout’s identity, we obtain k, l ∈ Z such that (a, b) = ka + lb. Since (ac, bc) is a common divisor of the integers ac, bc, it divides the integer kac + lbc = (a, b) · c as well. We have thus proved that (a, b) · c and (ac, bc) are natural numbers which divide each other, hence they are equal. (2) Let us suppose that (a, b) = 1 and a | bc. From Bézout’s identity again, we get k, l ∈ Z such that ka+lb = 1, whence it follows that c = c(ka + lb) = kca + lbc. Since a | bc, it follows that c as well is a multiple of a. (3) Let d = (a, b), then there are q1, q2 ∈ N such that a = dq1, b = dq2. Then, by part (1), we have d = (a, b) = (dq1, dq2) = d · (q1, q2), so (q1, q2) = 1. On the other hand, if a = dq1, b = dq2, and (q1, q2) = 1, then (a, b) = (dq1, dq2) = d(q1, q2) = d · 1 = d (again invoking part (1) of this lemma). □ 2. Primes The concept of a prime is one of the most important in elementary number theory. Its importance is given mainly by the unique factorization theorem, which is a strong as well as efficient tool for solving miscellaneous problems from number theory. 982 11.B.1. i) Prove that if natural numbers m, n are coprime, then so are m2 + mn + n2 and m2 − mn + n2 . ii) Prove that if odd natural numbers m, n are coprime, then so are m + 2n and m2 + 4n2 . Solution. i) To reach a contradiction, suppose that there is a prime p which divides both of the integers m2 + mn + n2 and m2 − mn + n2 . Then, it divides their difference 2mn as well, whence p = 2 or p divides one of the integers m, n. If p = 2, then m2 + mn + n2 is even, so the integers m and n must be even as well, which contradicts that they are coprime. If p divides m as well as m2 +mn+n2 , then it also divides n2 , whence, by Euclid’s theorem (11.2.1), it divides n as well. However, this also contradicts that m, n are coprime. The case of p | n is analogous. ii) Just like in the above exercise, let us suppose that there is a prime p which divides m + 2n as well as m2 + 4n2 . Then, it must also divide (m2 + 4n2 ) − (m + 2n)(m − 2n) = 8n2 , and since p ̸= 2 (if m + 2n were even, then so would m be), we necessarily have p | n. However, since p divides m + 2n as well, we get p | m, which is a contradiction. □ 11.B.2. Prove that any n ∈ N, n > 1, is prime iff n is not divisible by any prime p ≤ √ n. Solution. For n composite we have n = ab with appropriate a, b > 1. If we admitted a, b > √ n then we would have n = ab > √ n · √ n = n. Therefore n has a divisor (and thus also a prime divisor) not greater than √ n. □ The theoretical part contains Euclid’s proof of the infinitude of primes and deals in detail with the distribution of primes in the set of natural numbers (in some cases, however, we were forced to leave the mentioned theorems unproved). Now, we will give several exercises on this topic. 11.B.3. For any natural number n ≥ 3, there is at least one prime between the integers n and n!. Solution. Let p denote an arbitrary prime dividing the integer n! − 1 (by the Fundamental theorem of arithmetic (11.2.2), there is such a prime since n! − 1 > 1). If we had p ≤ n, then p would have to divide n! as well, so it could not divide n! − 1. Therefore, n < p. Since p | (n! − 1), we have p ≤ n!−1, hence p < n!. Our prime p thus satisfies the conditions of the problem. □ The result of this exercise also implies the infinitude of primes (it suffices to consider the sequence a0 = 3, an+1 = an! for n ∈ N). However, this statement is very weak (compared to reality) since the constructed sequence contains only a “tiny” subset of the primes. On the other hand, we are able to construct an arbitrarily long sequence of consecutive composite numbers, as shown by the following exercise. CHAPTER 11. ELEMENTARY NUMBER THEORY Prime numbers Every natural number n ≥ 2 has at least two positive divisors: 1 and itself. If there are no other divisors, it is called a prime (number). Otherwise (i.e., if there exist other divisors), we talk about a composite (number). In the subsequent paragraphs, we will usually denote primes by the letter p. The first few primes are 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, ...(in particular, the number 1 is considered to be neither prime nor composite as it is a unit in the ring of the integers). As we will prove shortly, there are infinitely many primes. However, we have rather limited computational resources when it comes to determining whether a given number is prime: The number 282 589 933 − 1, which is the greatest known prime as of 2018, has only 24 862 048 digits, so its decimal representation would fit into many a prehistoric data storage device. Printing it as a book would, however, (assuming 60 rows on a page and 80 digits in a row) take 5 180 pages. Now, let us introduce a theorem which gives a necessary and sufficient condition for being prime and is thus a fundamental ingredient for the proof of the unique factorization the- orem. 11.2.1. Theorem (Euclid’s on primes). An integer p ≥ 2 is a prime if and only if the following holds: for every pair of integers a, b, p | ab implies that either p | a or p | b (or both). Proof. “ =⇒ ” Suppose that p is a prime and p | ab, where a, b ∈ Z. Since (p, a) is a positive divisor of p, we have either (p, a) = p or (p, a) = 1. In the former case, we get p | a; in the latter, p | b by part (2) of the previous lemma. “ ⇐= ” If p is not a prime, it has a positive divisor distinct from both 1 and p. Let us denote it by a. However, then we have b = p a ∈ N and p = ab, hence 1 < a < p, 1 < b < p. We have thus found integers a, b such that p | ab, while p divides neither a nor b. □ 11.2.2. Fundamental theorem of arithmetic. In arithmetic operations, we much rely on expressing the numbers (or more complicated objects) as products of simpler ones. The prime property distinugishes the simplest ones and we call them irreducible. We shall see in 12.3.5 how to define divisibility in an arbitrary integral domain (i.e., rings without divisors of zero). In some integral domains (e.g. Q), there are no elements with the prime property. Other integral domains have such elements, yet they do not satisfy the unique factorization theorem. It is quite similar with the generalization of aforementioned Euclid’s theorem on primes – the elements which satisfy p | ab =⇒ (p | a or p | b) are always irreducible, but the contrary is not generally true. Let us mention at least one example of an ambiguous 983 11.B.4. Prove that for any natural number n, there exist n consecutive natural numbers none of which is prime. Solution. Let us examine the integers (n + 1)! + 2, (n + 1)! + 3, · · · , (n + 1)! + (n + 1). For any k ∈ {2, 3, . . . , n+1}, we have k | (n+1)!, so k | (n+1)!+k as well, thus (n+1)!+k cannot be a prime. Therefore, there are no primes among these n natural numbers. □ Practical notes. As we will show, it is very complicated to decide for certain whether a given large integer is a prime (on the other hand, for most composite numbers, it is really easy to prove that they are indeed composite – see part 11.6.4). Nevertheless, Indian mathematicians1 managed to prove in 2002 that there is an algorithm running in polynomial time with respect to the input (i.e., the number of digits of the integer in question) which decides whether the integer is a prime. We are unable to produce such an algorithm for prime factorization (and it is widely believed that it is impossible although no one has been able to prove this so far). The fastest generally applicable factorization algorithm, the socalled general number field sieve, runs in sub-exponential time O ( e1.9(log N)1/3 (log log N)2/3 ) . In 1994, Peter Shor invented an algorithm which factors an integer N in cubic time (i.e., runs it O ( log3 N ) ) on a quantum computer. However, this algorithm requires computers with sufficient number of quantum bits. We can see how difficult this is from the fact that in 2001, IBM managed to use a quantum computer to factor the integer 15, and in 2012, another record was achieved by factoring the integer 143 (in fact using other approach, so-called adiabatic quantum computa- tion). We can find more evidence about the difficulty of this problem in the call made in 1991 by RSA Security.2 If anyone manages to factor the integers labeled by the number of digits they have as RSA-100, ..., RSA-704, RSA-768, ..., RSA-2048, they could receive respectively 1,000, ..., 30,000, 50,000, ..., 200,000 dollars (the integer RSA-100 was factored by Arjen Lenstra that very year; the integer RSA-704 was factored in 2012; many others have not been factored yet). Thanks to the unique factorization theorem, we are able to (provided we know this factorization) easily answer the following questions concerning the number or sum of the divisors of a given integer. Just that easily, we can get the (intuitively well-known) procedure of computation of the greatest common divisor of two integers from their prime factoriza- tions. 1M. Agrawal, N. Kayal, N. Saxena. PRIMES is in P. Annals of Mathematics 160 (2): 781–793. 2004. 2See http://www.rsasecurity.com/rsalabs/node. asp?id=2093. CHAPTER 11. ELEMENTARY NUMBER THEORY factorization – in Z[ √ −5], we have:3 6 = 2 · 3 = (1 + √ −5) · (1 − √ −5). However, it needs a longer discussion to verify that all of the mentioned factors are really irreducible in Z[ √ −5]. Fundamental theorem of arithmetic Theorem. Every natural number n can be expressed as the product of primes, and this expression is unique up to the order of the factors. (If n is a prime, then it is the “product” of a single prime n; if n = 1, it is the empty product (the product of the empty set of primes)4 ) Proof. First, we prove by complete induction on n that every natural number n can be expressed as a product of primes. We have already discussed the validity of this statement for n = 1. Now, let us suppose that n ≥ 2 and we have already proved that all natural numbers less than n can be factored to primes. If n is a prime, the statement is clearly true. If n is not a prime, then it has a divisor d, 1 < d < n. Denoting e = n/d, we also have 1 < e < n. From the induction hypothesis, we get that both d and e can be expressed as products of primes, so their product d · e = n can also be expressed in this way. To prove uniqueness, let us have an equality of products n = p1·p2 · · · ps = q1·q2 · · · qt, where pi, qj are primes for all i ∈ {1, . . . , s}, j ∈ {1, . . . , t}. We will prove by induction on s that s = t and p1 = q1, . . . , ps = qs. If s = 1, then p1 = q1 · · · qt is a prime. If we had t > 1, the integer p1 would have a divisor q1 such that 1 < q1 < p1 (since q2q3 · · · qt > 1), which is impossible. Therefore, we must have t = 1 and p1 = q1. Now, let us suppose that s ≥ 2 and the proposition holds for s − 1. It follows from the equality p1 · p2 · · · ps = q1 · q2 · · · qt that ps divides the product q1 · · · qt, which is, by Euclid’s theorem, possible only if ps divides some qj(and after relabeling we may assume j = t). Since qj is a prime, it follows that ps = qt. Dividing both sides of the original equality by this integer, we obtain p1 · p2 · · · ps−1 = q1 · q2 · · · qt−1, and from the induction hypothesis, we get s − 1 = t − 1, p1 = q1, . . . , ps−1 = qs−1. Altogether, we have s = t and p1 = q1, . . . , ps−1 = qs−1, ps = qs. This proves the uniqueness, and thus the entire theorem as well. □ 11.2.3. Prime distribution. Don Zagier wrote: There are two facts about the distribution of prime numbers. The first is that, [they are] the most arbitrary and ornery objects studied by mathematicians: they grow like weeds among the natural numbers, seeming to obey no other law than that of chance, and nobody can predict where the next one will sprout. The second fact 3The symbol Z[ √ −5] denotes the integers extended by a root of the equation x2 = −5, which is defined similarly as we obtained the complex numbers by adjoining the number √ −1 to the reals. 984 11.B.5. Number of divisors and sum of divisors. Prove the following formulae: Every integer a = pα1 1 · · · · · pαk k has got exactly τ(a) = (α1 + 1)(α2 + 1) · · · · · (αk + 1) positive divisors, which sum up to σ(a) = pα1+1 1 − 1 p1 − 1 . . . pαk+1 k − 1 pk − 1 . Moreover, let p1, . . . , pk be pairwise distinct primes and α1, ..., αk, β1, ..., βk be non-negative integers. Denoting γi = min{αi, βi}, δi = max{αi, βi} for every i = 1, 2, . . . , k, then we have (pα1 1 · · · · · pαk k , pβ1 1 · · · · · pβk k ) = pγ1 1 · · · · · pγk k , [pα1 1 · · · · · pαk k , pβ1 1 · · · · · pβk k ] = pδ1 1 · · · · · pδk k . Solution. Every positive divisor of the integer a = pα1 1 · · · · · pαk k is of the form pβ1 1 · · · · · pβk k , where β1, . . . , βk ∈ N0 and β1 ≤ α1, β2 ≤ α2, ..., βk ≤ αk. The justification of all of the claims is now a simple consequence of this explicit factorization description of the the divisors. To find the number of positive divisors, we employ elementary combinatorics (the product rule) to get τ(a) = (α1 + 1)(α2 + 1) · · · · · (αk + 1). Next, we can see that the formula for the sum of the divisors holds if we rewrite it in the form (1 + p1 + · · · + pα1 1 ) · · · (1 + pk + · · · + pαk k ), realizing that every pair of parentheses contains the sum of a finite geometric series. The other statements follow directly from the definition. □ The sum of all positive divisors of an integer is connected to the so-called perfect numbers. We say that an integer a is perfect if and only if σ(a) = 2a, i.e., if and only if the sum of all positive divisors of a, excluding a itself, equals a. Perfect numbers include, for instance, 6 = 1 + 2 + 3, 28 = 1 + 2 + 4 + 7 + 14, 496, and 8128 (this exhausts all perfect numbers less than 10 000). It can be shown that even perfect numbers are in a tight relation with the so-called Mersenne primes since the following holds: 11.B.6. Show that a natural number a is an even perfect number if and only if it is of the form a = 2q−1 (2q − 1), where 2q − 1 is a prime. Solution. If a = 2q−1 (2q − 1), where p = 2q − 1 is a prime, then the previous statement yields σ(a) = 2q − 1 2 − 1 · (p + 1) = (2q − 1) · 2q = 2a. Such an integer a is thus a perfect number. For the opposite direction, consider any even perfect number a, and let us write a = 2k · m, where m, k ∈ N and 2 ∤ m. CHAPTER 11. ELEMENTARY NUMBER THEORY is even more astonishing, for it states just the opposite: that the prime numbers exhibit stunning regularity, that there are laws governing their behavior, and that they obey these laws with almost military precision. In the next paragraphs, we will discuss the following questions: Are there infinitely many primes? Are there infinitely many primes in every (or at least one) arithmetic sequence? How are the primes distributed among the natural num- bers? The first question has got an easy answer and the fundamental theorem was known to Euclid already around 300 BC: Theorem (Euclid). There are infinitely many primes among the natural numbers. Proof. Suppose that there are only finitely many, and let them be denoted by p1, p2, . . . , pn. Set N = p1 ·p2 . . . pn +1. This integer, being greater than 1, is either a prime or it is divisible by a prime different from p1, . . . , pn (since the primes p1, . . . , pn divide the integer N − 1), which is a contradiction. □ 11.2.4. Next, we will mention a rather strong statement, whose proof is very laborious (that is why we do not present it), yet it can be done by elementary means5 . Theorem (Chebyshev’s, Bertrand’s postulate). For every integer n > 1, there is at least one prime p satisfying n < p < 2n. 11.2.5. Primes are distributed quite uniformly in the sense that in any “reasonable” arithmetic sequence (i.e. such that its terms are coprime), there are infinitely many of them. For instance, considering the remainders upon division by 4, there are infinitely many primes with remainder 1 as well as infinitely many primes with remainder 3 (of course, there is no prime with remainder 0 and only one prime with remainder 2). The situation is analogous for remainders upon division by other integers, as explained by the following theorem, whose proof is very difficult. Theorem (Dirichlet’s on primes). If a, m are coprime natural numbers, there are infinitely many primes k such that mk + a is a prime. In other words, there are infinitely many primes among the integers 1 · m + a, 2 · m + a, 3 · m + a, . . .. We can at least mention a proof of a special case of this theorem, which is a modification of Euler’s proof of the infinitude of primes. Proposition. There are infinitely many primes of the form 4k + 3, where k ∈ N0. 5See Wikipedia, Proof of Bertrand’s postulate, http://en. wikipedia.org/wiki/Proof_of_Bertrand’s_postulate (as of July 29, 2017) or see M. Aigner, G. Ziegler, Proofs from THE BOOK, Springer, 2009. 985 Since the function σ is multiplicative (see 11.3.2), we have σ(a) = σ(2k ) · σ(m) = (2k+1 − 1) · σ(m). However, it follows from a being perfect that σ(a) = 2a = 2k+1 · m, whence 2k+1 · m = (2k+1 − 1) · σ(m). Since 2k+1 − 1 is odd, we must have 2k+1 − 1 | m, so we can lay m = (2k+1 − 1) · n for an appropriate n ∈ N. Rearranging leads to 2k+1 · n = σ(m). Both m and n divide m (and since m n = 2k+1 −1 > 1, these integers are different), hence 2k+1 · n = σ(m) ≥ m + n = 2k+1 · n, and so σ(m) = m+n. However, this means that m is a prime with the sole divisors m and n = 1, whence a = 2k ·(2k+1 − 1), where 2k+1 − 1 = m is a prime. □ Remark. On the other hand, people have been unsuccessful in describing odd perfect numbers; we even do not know whether there exists an odd perfect number. Mersenne primes are those of the form 2k − 1. It should not go unnoticed that Mersenne primes are easily recognizable among all primes – for Mersenne numbers (excluding the primality requirement), there is a simple and fast procedure how to verify that they are primes. It is thus not by chance that the largest known primes are usually of the form 2k − 1. Later, we will show how to efficiently verify whether a given Mersenne number is prime (see the Lucas-Lehmer test in part 11.6.9). Although it may seem a weird and practically useless business to look for primes as great as possible, it pushes the borders of our cognition of mathematics forward and refines the used methods (as well as hardware). Moreover, the discoverers often benefit from this (Electronic Frontier Foundation issued EFF Cooperative Computing Awards for finding a prime having at least 106 , 107 , 108 , and 109 digits – rewards of 50 and 100 dollars, respectively, for the first and second categories were paid in 2000 and 2009, respectively, to the GIMPS project in both cases. Apparently, it will take a while before the other prizes are awarded.) C. Congruences In this paragraph, we will see in practice how wielding basic operations with congruences can improve the expressing of our reasonings about various problems: We would be able to solve them without congruences, using only the basic properties of divisibility. However, with the help of congruences, our considerations will often be much shorter and clearer. 11.C.1. Show that, For any a, b ∈ Z, m ∈ N, the following conditions are equivalent: i) a ≡ b (mod m), ii) a = b + mt for an appropriate t ∈ Z, iii) m | a − b. CHAPTER 11. ELEMENTARY NUMBER THEORY Proof. Suppose the contrary, i.e., there are only finitely many primes of this form, and let them be denoted by p1 = 3, p2 = 7, p3 = 11, ..., pn. Further, let us set N = 4p2 · p3 · · · ··pn +3. Factoring N, the product must contain (according to the result of exercise 11.A.3) at least one prime p of the form 4k + 3. If not, N would be a product of only primes of the form 4k + 1, so N as well would give remainder 1 upon division by 4, which is not true. However, none of the primes p1, p2, . . . , pn can play the role of the mentioned p since if we had pi | N for some i ∈ {2, . . . , n}, then we would get pi | 3. Similarly, 3 ∤ N, and we thus get a contradiction with the assumption of finitely many primes of the given form. □ An analogous elementary proof can be used for primes of the form 3k + 2 or 6k + 5; however, it will not work for primes of the form 3k +1 or 4k +1 (think this out well!). We will be able to remedy this in the latter case in part 11.4.11 about quadratic congruences). 11.2.6. From the propositions mentioned in this chapter, one can roughly imagine how “dense” the primes appear among the natural numbers. It is more accurately described (although “only” asymptotically) by the following, very important theorem, which was proved independently by J. Hadamard and Ch. J. de la Vallée-Poussin in 1896 (but we shall not go into the proof here). Theorem (Prime Number Theorem). Let π(x) denote the number of primes less than or equal to a number x ∈ R. Then π(x) ∼ x ln x , i.e., the quotient of the functions π(x) and x/ ln x approaches 1 for x → ∞. The following table illustrates how good the asymptotic estimate π(x) ∼ x/ ln(x) is in several concrete instances in reality: x π(x) x/ ln(x) relative error 100 25 21.71 0.13 1000 168 144.76 0.13 10000 1229 1085.73 0.11 100000 9592 8685.88 0.09 500000 41538 38102.89 0.08 The density of primes among the natural numbers is also partially described by the following result by Euler. Let us remind the formula ∑ n∈N 1 n2 = π2 6 (see 7.1.10) which indicates that in view of the next proposition the primes are distributed more “densely” in N than squares. Proposition. Let P denote the set of all primes, then ∑ p∈P 1 p = ∞. 986 Solution. (1) =⇒ (3) If a = q1m + r, b = q2m + r, then a − b = (q1 − q2)m. (3) =⇒ (2) If m | a − b, then there is a t ∈ Z such that m · t = a − b, i.e., a = b + mt. (2) =⇒ (1) If a = b + mt, then, expressing b = mq + r, it follows that a = m(q +t)+r. Therefore, a and b share the same remainder r upon division by m, i.e., a ≡ b (mod m). □ 11.C.2. Prove fundamental properties of congruences stated in 11.3.1. Solution. i) If a ≡ b (mod m) and c ≡ d (mod m), by the previous lemma, there are integers s, t such that a = b + ms, c = d+mt. However, then we have a+c = b+d+m(s+t), and, by the lemma again, a + c ≡ b + d (mod m). Adding a congruence a ≡ b (mod m) to mk ≡ 0 (mod m), which is clearly valid, leads to a + mk ≡ b (mod m). ii) If a ≡ b (mod m) and c ≡ d (mod m), there are integers s, t such that a = b + ms, c = d + mt. Then, ac = (b + ms)(d + mt) = bd + m(bt + ds + mst), whence we get ac ≡ bd (mod m). iii) Let a ≡ b (mod m) and n be a natural number. Since an − bn = (a − b)(an−1 + an−2 b + · · · + bn−1 ), it follows that an ≡ bn (mod m) as well. iv) Suppose that a ≡ b (mod m), a = a1 · d, b = b1 · d, and (m, d) = 1. By the lemma, the difference a − b = (a1 − b1) · d is divisible by the integer m, and since (m, d) = 1, the integer a1 − b1 is also divisible by m (by lemma 11.1.5). Hence it follows that a1 ≡ b1 (mod m). Further, if ad ≡ bd (mod md), i.e., md | ad−bd, we get directly from the definition of divisibility that m | a − b. v) If a ≡ b (mod m), then a − b is a multiple of m, and hence a multiple of any divisor d of m, so a ≡ b (mod d). vi) Suppose that a ≡ b (mod m), b = b1d, m = m1d. Then there is an integer t such that a = b + mt = b1d + m1dt = (b1 + m1t)d, hence d | a. vii) If a ≡ b (mod m1), a ≡ b (mod m2), . . . , a ≡ b (mod mk), then the difference a − b is a common multiple of the integers m1, m2, . . . , mk, and so it is divisible by their least common multiple [m1, m2, . . . , mk], whence it follows that a ≡ b (mod [m1, . . . , mk]). □ Remark. We have already used some properties of congruences without explicitly mentioning it – now, the result of the exercise 11.A.3 can be reformulated as “if a ≡ 1 (mod m), b ≡ 1 (mod m), then also ab ≡ 1 (mod m)”, which is a special case of item (2) of the previous theorem. It is not by chance because any statement which uses congruences can be reformulated in terms of divisibility. The usefulness of congruences thus lies not in the strength to solve more problems than without them, but rather in being a very CHAPTER 11. ELEMENTARY NUMBER THEORY Proof. Let n be an arbitrary natural number and p1, . . . , pπ(n) all primes less than or equal to n. Let us set λ(n) = π(n) ∏ i=1 ( 1 − 1 pi )−1 . The particular factors can be perceived as sums of geometric series, hence λ(n) = π(n) ∏ i=1 ( ∞∑ αi=0 1 pαi i ) = ∑ 1 pα1 1 · · · p απ(n) π(n) , where we sum over all π(n)-tuples of non-negative integers (α1, . . . , απ(n)). Since every integer not exceeding n factors to only primes in the set {p1, . . . , pπ(n)}, all of them are included in this sum. Therefore, λ(n) > 1 + 1 2 + · · · + 1 n , and since the harmonic series is divergent (see 5.D.1), we also have limn→∞ λ(n) = ∞. Taking into account the expansion of the function ln(1 + x) to a power series (see 6.D.47), we further get ln λ(n) = − π(n) ∑ i=1 ln ( 1 − 1 pi ) = π(n) ∑ i=1 ∞∑ m=1 (mpm i ) −1 = = p−1 1 + · · · + p−1 π(n) + π(n) ∑ i=1 ∞∑ m=2 (mpm i ) −1 . Since the inner sum can be bound from above by ∞∑ m=2 (mpm i ) −1 < ∞∑ m=2 p−m i = = p−2 i ( 1 − p−1 i )−1 ≤ 2p−2 i , the divergent sequence ln λ(n) < ∑π(n) i=1 p−1 i + 2 ∑π(n) i=1 p−2 i can also be bounded from above. The second sum apparently converges (since the series ∑∞ n=1 n−2 does), so the first sum ∑π(n) i=1 p−1 i must diverge, which is what we wanted to prove. □ 3. Congruences and basic theorems This concept was introduced by C. F. Gauss in 1801 in his book Disquisitiones Arithmeticae. Although being a simple one, its contribution to number theory is mainly in making some reasonings (even quite complicated ones) be written much more compactly and transparently. Congruence If integers a, b give the same remainder r (where 0 ≤ r < m) when divided by a natural number m, we say that they are congruent modulo m and write it as a ≡ b (mod m). In the other case, we say that the integers a, b are not congruent modulo m, writing a ̸≡ b (mod m). 987 convenient way of writing which simplifies both expressions and reasonings. 11.C.3. i) Find the remainder of the integer 730 when divided by 50. ii) Find the last two digits of the decimal representation of the integer 730 . Solution. i) Since 72 = 49 ≡ −1 (mod 50), using the properties of congruences, which are mentioned above, 730 ≡ (−1)15 = −1 (mod 50), so the remainder of 730 upon division by 50 is 49. ii) Our task is actually to determine the remainder of 730 upon division by 100. We know from the above that the integer 730 leaves remainder 49 when divided by 50, so the last two digits are either 49 or 99. In particular, we already know that 730 ≡ −1 (mod 25), and we can easily calculate that 730 ≡ (−1)30 = 1 (mod 4). Since (4, 25) = 1, the wanted pair of digits is 49 (it leaves the desired remainder upon division by both 25 and 4). □ We now show how helpful the notion of congruence can be in solving problems similar to one already solved in 11.A.8 (where the induction and binomial theorem were used). 11.C.4. Prove that for any n ∈ N, the integer 37n+2 + 16n+1 + 23n is divisible by 7. Solution. We have 37 ≡ 16 ≡ 23 ≡ 2 (mod 7), so by the basic properties of congruences, 37n+2 + 16n+1 + 23n ≡ 2n+2 + 2n+1 + 2n = 2n (4+2+1) ≡ 0 (mod 7). □ 11.C.5. Prove that the integer n = (8355 + 6)18 − 1 is divisible by 112. Solution. We factor 112 = 7 · 16. Since (7, 16) = 1, it suffices to show that 7 | n and 16 | n. We have 835 ≡ 2 (mod 7), so n ≡ (25 + 6)18 − 1 = 3818 − 1 ≡ 318 − 1 = 276 − 1 ≡ (−1)6 − 1 = 0 (mod 7), hence 7 | n. Similarly, 835 ≡ 3 (mod 16), so n ≡ (35 + 6)18 − 1 = (3 · 81 + 6)18 − 1 ≡ (3 · 1 + 6)18 − 1 = 918 − 1 = 819 − 1 ≡ 19 − 1 = 0 (mod 16), hence 16 | n. Altogether, 112 | n, which was to be proved. □ 11.C.6. Prove that the following relations hold for prime p: i) If k ∈ {1, . . . , p − 1}, then p | (p k ) . ii) If a, b ∈ Z, then ap + bp ≡ (a + b)p (mod p). Solution. CHAPTER 11. ELEMENTARY NUMBER THEORY Whenever it is apparent that we are working with congruence relations, we usually omit the symbol mod , or write just a ≡ b (m). 11.3.1. Fundamental properties. It follows directly from the definition that the congruence modulo m is an equivalence relation. Now, we will state further properties of congruences which are proven in detail in the practical column, see 11.C.2. Properties of congruences (1) Congruences with respect to the same modulus can be added. An arbitrary multiple of the modulus can be added to either side. (2) Congruences with respect to the same modulus can be multiplied. (3) Both sides of a congruence can be raised to the power of the same natural number. (4) Both sides of a congruence can be divided by their common divisor provided it is coprime to the modulus. Both sides of a congruence together with the modulus can be divided by a positive divisor common to all three of them. (5) If a congruence is valid with respect to a modulus m, it is also valid with respect to any modulus d which divides m. (6) If either side of a congruence and the modulus are divisible by an integer, then this integer must divide the other side of the congruence as well. (7) If a congruence is valid with respect to moduli m1, . . . , mk, it is also valid with respect to their least common multiple [m1, . . . , mk]. Remark. We have already used some properties of congruences without explicitly mentioning it – now, the result of the exercise 11.A.3 can be reformulated as “if a ≡ 1 (mod m), b ≡ 1 (mod m), then also ab ≡ 1 (mod m)”, which is a special case of the property (2) above. It is not by chance because any statement which uses congruences can be reformulated in terms of divisibility. The usefulness of congruences thus lies not in the strength to solve more problems than without them, but rather in being a very convenient way of writing, which simplifies both expressions and reasonings. 11.3.2. Arithmetic functions. By arithmetic function we mean any function whose domain is the set of natural numbers. Good examples are the number of divisors τ(a) or their sum σ(a), computed in 11.B.5. A prominent example is the following function counting the coprime numbers less than given a: 988 i) Since the binomial coefficient satisfies ( p k ) = p(p − 1) · · · (p − k + 1) k! , which is an integer, we hence know that k! divides the product p(p − 1) · · · (p − k + 1). However, since the integer k! is coprime to the prime p, we thus get that k! | (p − 1) · · · (p − k + 1), whence it follows that p | (p k ) . ii) The binomial theorem implies that (a + b)p = ap + (p 1 ) ap−1 b + · · · + ( p p−1 ) abp−1 + bp . Thanks to the previous item, we have (p k ) ≡ 0 (mod p) for any k ∈ {1, . . . , p−1}, whence the statement follows easily. □ 11.C.7. Prove that for any natural numbers m, n and any integers a, b such that a ≡ b (mod mn ), it is true that am ≡ bm (mod mn+1 ). Solution. Since clearly m | mn , we get that the congruence a ≡ b (mod m) holds, invoking property (5) of 11.3.1. Therefore, considering the algebraic identity am −bm = (a−b)(am−1 +am−2 b+· · ·+abm−2 +bm−1 ), all the summands in the second pair of parentheses are congruent to am−1 modulo m, so am−1 + am−2 b + · · · + bm−1 ≡ m · am−1 ≡ 0 (mod m). Since mn divides a − b, and the sum am−1 + am−2 + · · · + bm−1 is divisible by m, we get that mn+1 must divide their product, which means that am ≡ bm (mod mn+1 ). □ 11.C.8. Using the result of the previous exercise (see also 11.A.2), prove that: i) integers a which are not divisible by 3 satisfy a3 ≡ ±1 (mod 9), ii) odd integers a satisfy a4 ≡ 1 (mod 16). Solution. i) Cubing the congruence a ≡ ±1 (mod 3) (and, again, raising the exponent of the modulus), we get a3 ≡ ±1 (mod 32 ). ii) This statement was proved already in the third part of exercise 11.A.2. Now, we will present another proof. Thanks to part (ii) of the mentioned exercise, we know that every odd integer a satisfies a2 ≡ 1 (mod 23 ). Squaring this (and recalling the above exercise) leads to a4 ≡ 12 (mod 24 ). □ 11.C.9. Divisibility rules. We can surely recall the basic rules of divisibility (at least by the numbers 2, 3, 4, 5, 6, 9 a 10) in terms of the decimal representation of a given integer. However, how can these rules be proved and can they be extended to other divisors as well? We already know that we can restrict ourselves to divisibility by powers of primes (for instance, divisibility by 6 can be tested using divisibility by 2 and 3). CHAPTER 11. ELEMENTARY NUMBER THEORY Euler’s totient function φ For a natural number, we define the value of Euler’s totient function φ as φ(n) = |{a ∈ N | 0 < a ≤ n, (a, n) = 1}|. For example, φ(1) = 1, φ(5) = 4, φ(6) = 2. If p is a prime, then clearly φ(p) = p − 1 (all natural numbers less than p are coprime to it). We are going to prove several important properties of the function φ. Let us start with a quite simple observation: Lemma. Let n ∈ N. Then, ∑ d|n φ(d) = n. Proof. Let us consider the n fractions 1 n , 2 n , 3 n , . . . , n − 1 n , n n . Reducing them to lowest terms and grouping them together by the denominators, we get just the statement in question (see the picture illustrating the case n = 12). □ 11.3.3. Soon we shall recognize, that knowing the values of the Euler’s totient function may be extremely useful. The following explicit formula is crucial. We might prove it by the inclusion-exclusion principle, determining the number of integers not coprime to n in a given interval. Here we shall employ a more conceptual approach wich is of great interest by itself, and thus, the proof is postponed to 11.3.8 below. Theorem. Let n ∈ N factor to primes as n = pα1 1 · · · pαk k . Then, φ(n) = n · ( 1 − 1 p1 ) · · · ( 1 − 1 pk ) . 11.3.4. Multiplicative functions. Dealing with arithmetic functions, we introduce the concept of "multiplicativity" which concerns only the coprime arguments: Definition. A multiplicative function on the natural numbers is such an arithmetic function which, for all pairs of coprime natural numbers a, b, satisfies f(a · b) = f(a) · f(b). 989 The rule for divisibility by 9 says that a given integer is divisible by 9 if and only if its digit sum is. We will prove this as a consequence of a much stronger statement: It holds that every integer is congruent to its digit sum modulo 9 (in particular, it is congruent to zero if and only if its digit sum is). And this is trivial to prove: The digit sum of an integer n = ak10k + ak−110k−1 + · · · + a110 + a0 is equal to S(n) = ak + ak−1 + · · · + a0, and since 10ℓ ≡ 1ℓ = 1 (mod 9) for any ℓ ∈ N0, we get n = ak10k + · · · + a0 ≡ ak + · · · + a0 = S(n) (mod 9). This derivation is valid also if we replace 9 with 3. The rule for divisibility by 11, which we have not mentioned yet, works similarly. Here, we have 10ℓ ≡ (−1)ℓ (mod 11), so we get n = ak10k + · · · + a0 ≡ ak(−1)k + · · · + a1(−1) + a0 ≡ (a0 + a2 + · · · ) − (a1 + a3 + · · · ) (mod 11). Therefore, an integer is divisible by 11 if and only if the difference of the sum of the digits at even places and the sum of the digits at odd places is. There is a nice trick for the divisors 7 and 13: We have 1001 = 7·11·13; an integer n = 1000a+b thus satisfies n ≡ −a + b (mod m), where m is any of the numbers 7, 11, 13. Therefore, 2015 is divisible by 13 since 015 − 2 = 13. Similarly, 2016 is divisible by 7 as 016 − 2 = 14 is a multiple of 7. We could justify that 2013 is a multiple of 11 in the same way, but the aforementioned criterion 11 | (3 + 0) − (1 + 2) is maybe more smart. Using divisibility for error detection. Let us note that divisibility by eleven is often used for making decimal codes which allow us to detect a single-digit error. If we make such a mistake when copying an integer which is divisible by eleven, then the resulting integer is surely not (see the aforementioned criterion of divisibility by eleven). More details can be found in chapter 12.5.1 about coding. For instance, the national identification numbers in the Czech Republic and Slovakia contain a check digit which completes the code into an integer divisible by eleven. Similarly, the numbers of bank accounts managed by Czech banks must comply with a similar (only a bit more complicated) procedure. Both the transformed 6-digit prefix a5a4a3a2a1a0 and the 10-digit account number b9b8b7b6b5b4b3b2b1b0 must satisfy the following condition on divisibility by eleven (here, we mention only the one for the number without the prefix): 0 ≡ b929 + b828 + b727 + · · · + b323 + b222 + b121 + b020 ≡ −5b9 + 3b8 − 4b7 − 2b6 − b5 + 5b4 − 3b3 + 4b2 + 2b1 + b0 (mod 11). This condition can be shortly described so that the account number, perceived as being in binary (though with usage of decimal digits) is to be divisible by eleven. CHAPTER 11. ELEMENTARY NUMBER THEORY Clearly, the multiplicative functions include, for instance, the number of divisors τ(n), or their sum σ(n), cf. 11.B.5. It follows directly from Theorem 11.3.3 that the Euler’s totient function is a multiplicative arithmetic function, too: Corollary. Let a, b ∈ N, (a, b) = 1. Then φ(a · b) = φ(a) · φ(b). Remark. The multiplicativity of φ can also be derived directly from the knowledge that (n, ab) = 1 ⇐⇒ (n, a) = 1 ∧ (n, b) = 1. Then the easy fact about the totient function on powers of primes φ(pα ) = pα − pα−1 = (p − 1) · pα−1 , leads to the formula for the computation of φ in yet another alternative way. 11.3.5. Möbious function. Let us start our detour leading to the proof of Theorem 11.3.3. At the same time, the introduced convolution and inversion formulae should provide a glance into a very important conceptual direction. Definition. Let a natural number n be factored to distinct primes: n = pα1 1 · · · pαk k . The value of the Möbius function µ(n) is defined to be 0 if αi > 1 for some i (i.e., if n is divisible by a square), and (−1)k otherwise. Further, we define µ(1) = 1 (in accordance with the convention that 1 factors to the product of zero primes). For instance, µ(4) = µ(22 ) = 0, µ(6) = µ(2 · 3) = (−1)2 = 1, µ(2) = µ(3) = −1. By the very definition, µ is a multiplicative function, too. In the next paragraphs, we prove several important properties of the Möbius function, especially the so-called Möbius inversion formula. Let us start with a simple observation: Lemma. For all n ∈ N \ {1}, it holds that ∑ d|n µ(d) = 0. Proof. Writing n as n = pα1 1 · · · pαk k , then all divisors d of n are of the form d = pβ1 1 · · · pβk k , where 0 ≤ βi ≤ αi for all i ∈ {1, . . . , k}. Therefore, ∑ d|n µ(d) = ∑ (β1,...,βk)∈(N∪{0})k 0≤βi≤αi µ ( pβ1 1 · · · pβk k ) = ∑ (β1,...,βk)∈{0,1}k µ ( pβ1 1 · · · pβk k ) = (k 0 ) + (k 1 ) · (−1) + (k 2 ) · (−1)2 + · · · + (k k ) · (−1)k = (1 + (−1))k = 0. In the third equality, we used a combinatorial reasoning – the summand (k ℓ ) (−1)ℓ gives the contribution of the divisors d = pβ1 1 · · · pβk k with the property that exactly ℓ of the exponents β1, . . . , βk are equal to one; there are (k ℓ ) of them, and each satisfies that µ(pβ1 1 · · · pβk k ) = (−1)ℓ . □ 990 11.C.10. Verify that the account number of the Masaryk university, 85636621, is built correctly. Solution. We will test the condition of divisibility by eleven: −5b9 +3b8 −4b7 −2b6 −b5 +5b4 −3b3 +4b2 +2b1 +b0 ≡ −4 · 8 − 2 · 5 − 1 · 6 + 5 · 3 − 3 · 6 + 4 · 6 + 2 · 2 + 1 · 1 ≡ 0 (mod 11) . □ Euler’s totient function. The totient function φ assigns to a natural number m the number of natural numbers which are less than or equal to m and coprime to m, which can be written as φ(m) = |{a ∈ N | 0 < a ≤ m, (a, m) = 1}|. However, to be able to evaluate it efficiently, one needs to know the factorization of the input integer m to primes. In such a case, for m = pα1 1 · · · pαk k , we have φ(m) = (p1 − 1)pα1−1 1 · · · (pk − 1)pαk−1 k . In particular, we know that φ(pα ) = (p − 1) · pα−1 and that φ(m · n) = φ(m) · φ(n) holds whenever m, n are coprime. 11.C.11. Calculate φ(72). Solution. 72 = 23 ·32 =⇒ φ(72) = 72·(1− 1 2 )·(1− 1 3 ) = 24, alternatively φ(72) = φ(8) · φ(9) = 4 · 6 = 24. □ 11.C.12. i) Determine all natural numbers n for which φ(n) is odd. ii) Prove that ∀n ∈ N : φ(4n + 2) = φ(2n + 1). Solution. i) We clearly have φ(1) = φ(2) = 1. Every integer n ≥ 3 is either divisible by an odd prime p (then, φ(n) is divisible p − 1, which is an even integer) or n is a (higherthan-first) power of two (and then, φ(2α ) = 2α−1 is even as well). Altogether, we have found out that φ(n) is odd only for n = 1, 2. ii) The integer 2n + 1 is odd, so (2, 2n + 1) = 1, and hence φ(4n + 2) = φ(2 · (2n + 1)) = φ(2) · φ(2n + 1) = φ(2n + 1) . □ 11.C.13. Find all natural numbers m for which: i) φ(m) = 30, ii) φ(m) = 34, iii) φ(m) = 20, iv) φ(m) = m 3 . Solution. In all of the above cases, we are looking for the fibers of a given integer a in the form m = pα1 1 · · · pαk k , and we proceed as follows: • Since φ(m) = (p1 − 1)pα1−1 1 · · · (pk − 1)pαk−1 k = a, every prime p which divides m must satisfy p − 1 | a. • Similarly, every prime p whose higher power divides m must divide a. More exactly, we must even have pα−1 | a. CHAPTER 11. ELEMENTARY NUMBER THEORY 11.3.6. Dirichlet convolution. There is another concept which is tightly connected to the Möbius function, the so-called Dirichlet product (also Dirichlet convolution). Definition. Let f, g be arithmetic functions. Its Dirichlet product is defined as follows: (f ◦ g)(n) = ∑ d|n f(d) · g (n d ) = ∑ d1d2=n f(d1) · g(d2). The next observation is nearly obvious: Lemma. The Dirichlet product is associative. Proof. ((f ◦g)◦h)(n) = ∑ d1d2d3=n f(d1)g(d2)h(d3) = (f ◦(g◦h))(n) □ 11.3.7. Möbious inversion formula. Let us define two useful functions i and I by i(1) = 1, i(n) = 0 for all n > 1 and I(n) = 1 for all n ∈ N. Then, every arithmetic function f satisfies: f ◦ i = i ◦ f = f (I ◦ f)(n) = (f ◦ I)(n) = ∑ d|n f(d). Further, notice the Möbious function commutes with I and the result for all n > 1 is i (the statement is clear for n = 1 and we use the above Lemma 11.3.5): (I ◦ µ)(n) = ∑ d|n I(d)µ (n d ) = ∑ d|n I (n d ) µ(d) = = ∑ d|n µ(d) = 0 Theorem (Möbius inversion formula). Let an arithmetic function F be defined in terms of an arithmetic function f by F(n) = ∑ d|n f(d). Then, f can be expressed as f(n) = ∑ d|n µ (n d ) · F(d). Proof. The relation F(n) = ∑ d|n f(d) can be rewritten as F = f ◦I. Therefore, F ◦µ = (f ◦I)◦µ = f ◦(I◦µ) = f ◦ i = f, which is the statement of the theorem. □ 11.3.8. Proof of the Theorem 11.3.3. Invoking the Lemma 11.3.2 and the Möbius inversion formula, we get φ(n) = ∑ d|n µ(d) n d = n − n p1 − · · · − n pk + · · · + (−1)k n p1 · · · pk = n · ( 1 − 1 p1 ) · · · ( 1 − 1 pk ) and the explicit formula for φ(n) has been proved. 991 • This procedure results in a finite set of candidates for m, which can be eliminated in a convenient way, sometimes also using the fact that any prime dividing a must occur in the factorization of φ(m) as a divisor of some p − 1 or in a prime power pα−1 . Now, let us solve problems i)-iii): i) Every prime p from the factorization of m must satisfy p − 1 | 30, so p − 1 ∈ {1, 2, 3, 5, 6, 10, 15, 30}, which is satisfied by primes p ∈ {2, 3, 7, 11, 31}, and only 2 and 3 of them can divide m in higher power than the first. Therefore, m = 2α 3β 7γ 11δ 31ε , where α, β ∈ {0, 1, 2}, γ, δ, ε ∈ {0, 1}. The analysis of the possibilities can be further simplified if we realize that φ(3) = 2, φ(32 ) = φ(7) = 6, φ(11) = 10 are all integers which divide 30 into an odd integer greater than 1. Therefore, if we had, for instance, m = 7 · m1, where 7 ∤ m1, then we would also have φ(m1) = 5, which is impossible, as we know from the previous exer- cise. We thus get β = γ = δ = 0 and m = 2α · 31ε , whence we can easily obtain the solution m ∈ {31, 62}. ii) Similarly to above, only primes p ∈ {2, 3} can divide m, and the prime 3 can divide m only in the first power. However, since 34 φ(3) = 17, the prime 3 cannot divide m at all. The remaining possibility, m = 2α , leads to 34 = 2α−1 , which is also impossible. Therefore, there is no such number m. iii) Now, every prime p dividing m must satisfy p−1 | 20, so p − 1 ∈ {1, 2, 4, 5, 10, 20}, which is satisfied by primes p ∈ {2, 3, 5, 11}, and only 2 and 5 of those can divide m in higher power. We thus have m = 2α 3β 5γ 11δ , where α ∈ {0, 1, 2, 3}, γ ∈ {0, 1, 2}, β, δ ∈ {0, 1}. First, consider δ = 1. Then, φ(2α 3β 5γ ) = 2, whence we easily get that γ = 0 and (α, β) ∈ {(2, 0), (1, 1), (0, 1)}, which gives three solutions: m ∈ {44, 66, 33}. Further, let us have δ = 0. If γ = 2, then φ(2α 3β ) = 1, whence (α, β) ∈ {(1, 0), (0, 0)}. We thus obtain two more solutions: m ∈ {50, 25}. If γ = 1, then we get 20 φ(5) = 5, similarly to the above item. This is an odd integer, so we get no solutions in this case. This is also the case for γ = 0 since the equation φ(2α ) = 20 has no solution, either. Altogether, there are five satisfactory values m ∈ {25, 33, 44, 50, 66}. iv) This problem is of a different kind than the previous ones, so we must approach otherwise. The relation φ(m) = m 3 implies that m must be a multiple of three (since the lefthand side of the equation is an integer). We will thus be looking for the solution in the form m = 3α · n, where 3 ∤ n, α ≥ 1. Then, φ(m) = 2 · 3α−1 · φ(n) = m 3 = CHAPTER 11. ELEMENTARY NUMBER THEORY 11.3.9. Remark. Tady by mela byt poznamka (by T. Perutka) o Okruhu aritmetickych funkci. Zaroven trochu vyplni prostor, ktery je potreba kvuli velkemu mnozstvi prikladu na Eulerovu funkci v praktickem sloupci. Tady by mela byt poznamka (by T. Perutka) o Okruhu aritmetickych funkci. Zaroven trochu vyplni prostor, ktery je potreba kvuli velkemu mnozstvi prikladu na Eulerovu funkci v praktickem sloupci. Tady by mela byt poznamka (by T. Perutka) o Okruhu aritmetickych funkci. Zaroven trochu vyplni prostor, ktery je potreba kvuli velkemu mnozstvi prikladu na Eulerovu funkci v praktickem sloupci. Tady by mela byt poznamka (by T. Perutka) o Okruhu aritmetickych funkci. Zaroven trochu vyplni prostor, ktery je potreba kvuli velkemu mnozstvi prikladu na Eulerovu funkci v praktickem sloupci. Tady by mela byt poznamka (by T. Perutka) o Okruhu aritmetickych funkci. Zaroven trochu vyplni prostor, ktery je potreba kvuli velkemu mnozstvi prikladu na Eulerovu funkci v praktickem sloupci. Tady by mela byt poznamka (by T. Perutka) o Okruhu aritmetickych funkci. Zaroven trochu vyplni prostor, ktery je potreba kvuli velkemu mnozstvi prikladu na Eulerovu funkci v praktickem sloupci. Tady by mela byt poznamka (by T. Perutka) o Okruhu aritmetickych funkci. Zaroven trochu vyplni prostor, ktery je potreba kvuli velkemu mnozstvi prikladu na Eulerovu funkci v praktickem sloupci. Tady by mela byt poznamka (by T. Perutka) o Okruhu aritmetickych funkci. Zaroven trochu vyplni prostor, ktery je potreba kvuli velkemu mnozstvi prikladu na Eulerovu funkci v praktickem sloupci. Tady by mela byt poznamka (by T. Perutka) o Okruhu aritmetickych funkci. Zaroven trochu vyplni prostor, ktery je potreba kvuli velkemu mnozstvi prikladu na Eulerovu funkci v praktickem sloupci. Tady by mela byt poznamka (by T. Perutka) o Okruhu aritmetickych funkci. Zaroven trochu vyplni prostor, ktery je potreba kvuli velkemu mnozstvi prikladu na Eulerovu funkci v praktickem sloupci. Möbius inversion formula and irreducible polynomials. Above, we prove the properties of Euler’s totient function using Möbius inversion formula. This is a quite strange subsection - we do not cover the finite fields, nor the polynomials they are treated much later ... The standard form of this formula connects the expression of an arithmetic function F of natural numbers in terms of a function f in the form F(n) = ∑ d|n f(d) to the inverse expression of the function f in terms of the function F in the form f(n) = ∑ d|n µ(n d ) · F(d). The function value µ(n) depends on the prime factorization of the input value n as follows: • if a square of a prime divides n, then µ(n) = 0; • otherwise, we set µ(n) = (−1)k , where k is the number of primes which divide n. 992 3α−1 · n. Reducing this leads to 2φ(n) = n or, equivalently, φ(n) = n 2 . Here, we must have 2 | n, and writing n = 2β · k, where (k, 6) = 1, β ≤ 1, we get φ(k) = k, which is apparently satisfied only by k = 1. To summarize it, the problem is satisfied by those natural numbers which are of the form 2α 3β , where α, β ≥ 1 . □ 11.C.14. Find all two-digit numbers n for which 9|φ(n). ⃝ 11.C.15. Fermat’s (little) theorem. Now, we will prove Fermat’s little theorem 11.3.11 in two more ways: by mathematical induction, and then by a combinatorial means). The theorem states that for any integer a and any prime p which does not divide a, it holds that ap−1 ≡ 1 (mod p). Solution. First, we prove (by induction on a) that an apparently equivalent statement, ap ≡ a (mod p), holds for any a ∈ Z and prime p. For a = 1, there is nothing to prove. Further, let us assume that the proposition holds for a and prove its validity for a+1. It follows from the induction hypothesis and the exercise 11.C.6 that (a + 1)p ≡ ap + 1p ≡ a + 1 (mod p), which is what we were to prove. The statement holds trivially for a = 0 as well as in the case a < 0, p = 2. The validity for a < 0 and p odd can be obtained easily from the above: since −a is a positive integer, we get −ap = (−a)p ≡ −a (mod p), whence ap ≡ a (mod p). The combinatorial proof is a somewhat “cunning” one: Similarly to problems using Burnside’s lemma (see exercise 12.G.1), we are to determine how many necklaces can be created by wiring a given number of beads, of which there is a given number of types. Having a types of beads, there are clearly ap necklaces of length p, a of which consist of a single bead type. From now on, we will be interested only in the other ones, of which there are thus ap − a. Apparently, each necklace is transformed into itself by rotating by p beads. In general, a necklace can be transformed into itself by rotating by another number of beads, but this number can never be coprime to p (for instance, considering p = 8 and the necklace ABABABAB, rotations by 2, 4, or 6 beads leave it unchanged). However, if p is a prime, it follows that all rotations lead to different necklaces. Therefore, if we do not distinguish necklaces which differ in rotation only (i.e., in the position of the “knot”), there are exactly ap − a p of them, which especially means that p | ap − a. As an example, let us consider the case a = 2, p = 5, i.e., necklaces of length 5, consisting of 2 bead types (A, B). There are 25 = 32 necklaces in total, 2 of which consist of a single bead type (AAAAA, BBBBB). Leaving them and ignoring the position of the knot, there remain 25 −2 5 = 6 necklaces which differ not merely in rotation, namely CHAPTER 11. ELEMENTARY NUMBER THEORY This formula can be generalized in many ways – especially in cases where the functions F and f map natural numbers into an abelian group (G, ·). In this case, the formula (considering the operation in G to be multiplicative) gets the form f(n) = ∑ d|n F(d) µ ( n d ) . Now, we will demonstrate the use of the Möbius inversion formula on a more complex example from the theory of finite fields. Let us consider a p-element field Fp (i.e., the ring of residue classes modulo a prime p) and examine the number Nd of monic irreducible polynomials of a given degree d over this field. Let Sd(x) denote the product of all such polynomials. Now, we borrow a (not very hard) theorem from the theory of finite fields which states that for all n ∈ N, we have xpn − x = ∏ d|n Sd(x). Confronting the degrees of the polynomial on both sides yields pn = ∑ d|n dNd, whence we get, by applying the standard Möbius inversion formula, that Nn = 1 n ∑ d|n µ (n d ) pd . In particular, we can see that for any n ∈ N, it holds that Nn = 1 n (pn − · · · + µ(n)p) ̸= 0 since the expression in the parentheses is a sum of distinct powers of p multiplied by coefficients ±1, so it cannot be equal to 0. Therefore, there exist irreducible polynomials over Fp of an arbitrary degree n, so there are finite fields Fpn (having pn elements) for any prime p and natural number n (in the theory of field extensions, such a field is constructed as the quotient ring Fp[x]/(f) of the ring of polynomials over Fp modulo the ideal generated by an irreducible polynomial f ∈ Fp[x] of degree n, whose existence has just been proved). 11.3.10. Example. By the formula we have proved, the number of (monic) irreducible polynomials over F2 of degree 5 is equal to N5 = 1 5 ∑ d|5 µ (5 d ) 2d = 1 5 ( µ(1) · 25 + µ(5) · 2 ) = 6. The number of monic irreducible polynomials over F3 of degree four is then N4 = 1 4 ∑ d|4 µ (4 d ) 3d = 1 4 ( µ(1) · 34 + µ(2) · 32 + µ(4)31 ) = 1 4 (81 − 9) = 18 . 11.3.11. The next two theorems belong to the most important results of elementary number theory, and they will often be applied in further theoretical as well as practical problems. 993 ABBBB, AABBB, AAABB, AAAAB, ABABB, AABAB. □ 11.C.16. i) Determine the remainder of the integer 250 + 350 + 450 when divided by 17. ii) Determine the remainder of the integer 2181 +3181 +5181 when divided by 37. Solution. i) By Fermat’s theorem, we have 216 ≡ 316 ≡ 416 ≡ 1 (mod 17). Since 50 ≡ 2 (mod 16), we get 250 + 350 + 450 ≡ 22 + 32 + 42 ≡ 12 (mod 17) . ii) Similarly 236 ≡ 336 ≡ 536 ≡ 1 (mod 37), and hence 2181 + 3181 + 5181 ≡ 2 + 3 + 5 ≡ 10 (mod 37) . □ Euler’s theorem and orders of integers modulo m. Thanks to Euler’s theorem, it is guaranteed that every integer a which is coprime to m has an order, i.e. the least natural number n such that an ≡ 1 (mod m). The most interesting ones are those integers a whose order equals φ(m); they are called the primitive roots modulo m. 11.C.17. Determine the order of 2 modulo 7. Solution. The order of 2 modulo 7 is equal to 3 as 21 = 2 ̸≡ 1 (mod 7), 22 = 4 ̸≡ 1 (mod 7), 23 = 8 ≡ 1 (mod 7). □ 11.C.18. Determine the last two digits of the number 72019 . Solution. We can easily see that the order of 7 modulo 100 is equal to 4 – by simple calculations, we have 72 = 49 and 492 = (50 − 1)2 = 502 − 2 · 50 + 1 ≡ 1 (mod 100). Therefore, it suffices to determine the remainder r of the integer 2019 when divided by 4, since 72019 ≡ 7r (mod 100). Apparently, we have r = 3, so the wanted last two digits are the same as those of 7 · 49, i.e. 43. □ Now, we mention several statements about the properties of the order of an integer modulo m. 11.C.19. Let m ∈ N, a, b ∈ Z, (a, m) = (b, m) = 1. Prove that if a ≡ b (mod m), then the integers a, b share the same order modulo m. Solution. Raising the congruence a ≡ b (mod m) to the n-th power leads to an ≡ bn (mod m), so an ≡ 1 (mod m) ⇐⇒ bn ≡ 1 (mod m). □ 11.C.20. Let m ∈ N, a ∈ Z, (a, m) = 1. If the order of a modulo m is r · s, (where r, s ∈ N), prove that the order of the integer ar modulo m is s. Solution. Since none of the integers a, a2 , a3 , . . . , ars−1 is congruent to 1 modulo m, nor is any of the integers ar , a2r , a3r , . . . , a(s−1)r . On the other hand, we have (ar )s ≡ 1 (mod m), so the order of ar modulo m equals s. □ CHAPTER 11. ELEMENTARY NUMBER THEORY Fermat’s little theorem Theorem. Let a be an integer and p a prime, p ∤ a. Then, ap−1 ≡ 1 (mod p). Proof. The statement will follow as a simple consequence of Euler’s theorem (and together with this one, it is a consequence of more general Lagrange’s theorem 12.4.10). However, it can be proved directly (by mathematical induction or a combinatorial means, as mentioned in exercise 11.C.15). □ Sometimes, Fermat’s little theorem is presented in the following form, which is apparently equivalent to the original statement. Corollary. Let a be an integer and p a prime. Then, ap ≡ a (mod p). 11.3.12. Euler’s theorem. We are coming to a theorem which has got immediate striking applications in cryptography (we shall come to this point in cf. 11.6.18). Before formulating and proving Euler’s theorem, we introduce a few useful concepts. Residue systems A complete residue system modulo m is an arbitrary m-tuple of integers which are pairwise incongruent modulo m (the most commonly used m-tuple is 0, 1, . . . , m − 1 or, for odd m, its “symmetric” variation −m−1 2 , . . . , −1, 0, 1, . . . , m−1 2 ). A reduced residue system modulo m is an arbitrary φ(m)-tuple of integers which are pairwise incongruent modulo m and coprime to m. Lemma. Let x1, x2, . . . , xφ(m) form a reduced residue system modulo m. If a ∈ Z, (a, m) = 1, then the integers a·x1, . . . , a·xφ(m) also form a reduced residue system modulo m. Proof. Since (a, m) = 1 and (xi, m) = 1, we have (a · xi, m) = 1. Further, if we had a · xi ≡ a · xj (mod m) for some distinct indices i, j, dividing both sides of the congruence by the integer a (which is coprime to m) would lead to xi ≡ xj (mod m), meaning that the original φ(m)-tuple was not a reduced residue system, either. □ Euler’s theorem Theorem. Let a ∈ Z, m ∈ N, (a, m) = 1. Then, aφ(m) ≡ 1 (mod m). Proof. Let x1, x2, . . . , xφ(m) be an arbitrary reduced residue system modulo m. By the previous lemma, a · x1, . . . , a · xφ(m) is also a reduced residue system modulo m. 994 11.C.21. Show that the contrary of the previous statement need not be true in general. Solution. Indeed, even if the order of an integer ar modulo mis s, the order of a modulo m may not be r ·s. For instance, for m = 13 and the integers a = 3, b = −4, we have a2 = 9, a3 = 27 ≡ 1 (mod 13), so the order of a modulo 13 is 3. Similarly, b2 = 16 ̸≡ 1 (mod 13), b3 = −64 ≡ 1 (mod 13), so the order of b modulo 13 is 3, too. On the other hand, b2 = (−4)2 = 16 ≡ 3 = a (mod 13) has the same order (3) as a, yet the integer b does not have order 2 · 3. □ 11.C.22. Determine the last digit of the numbers i) 3579 , ii) 373737 , iii) 121314 . ⃝ 11.C.23. It holds for all odd n ∈ N that n | 2n! − 1. Prove this! ⃝ 11.C.24. i) Determine the last digit of the number 79573 . ii) Determine the remainder of the number 151413 when divided by 11. Solution. i) The order of 7 modulo 100 is equal to 4 by exercise 11.C.18, so it suffices to find the remainder of the (extremely large) exponent upon division by 4. Since 9 ≡ 1 (mod 4), the entire exponent leaves remainder 1 as well. Therefore, the wanted digit is 71 = 7. ii) The order of 15 ≡ 4 (mod 11) is 5 (which can be found by direct computation or from the fact that 2 is a primitive root modulo 11 (see also 11.C.28); then, theorem 11.3.14 yields that the order of 4 = 22 is 10 (10,2) = 5). It is thus sufficient to determine the remainder of the exponent modulo 5. We have 1413 ≡ (−1)13 = −1 ≡ 4 (mod 5), so the wanted remainder is 44 = 28 = 256 ≡ 6 − 5 + 2 = 3 (mod 11). Alternatively, we could have finish the calculation as follows: 44 ≡ 4−1 ≡ 3 (mod 11). □ 11.C.25. Determine the last two digits of the decimal expansion of the number 141414 . Solution. We are interested in the remainder of the number a = 141414 upon division by 100. However, since (14, 100) > 1, we cannot consider the order of 14 modulo 100. Instead, we can factor the modulus to coprime integers: 100 = 4·25. Apparently, 4 | a, so it remains to find the remainder of a modulo 25. By Euler’s theorem, we have 14φ(25) = 1420 ≡ 1 (mod 25), so we are interested in the remainder of 1414 upon division by 20 = 4 · 5. Again, we clearly have 4 | 1414 , and further 1414 ≡ (−1)14 = 1 (mod 5), so 1414 ≡ 16 (mod 20). Altogether, 141414 ≡ 1416 = 216 · 716 (mod 25). CHAPTER 11. ELEMENTARY NUMBER THEORY Therefore, for every i ∈ {1, 2, . . . , φ(m)}, there is a unique j ∈ {1, 2, . . . , φ(m)} such that a · xi ≡ xj (mod m). Multiplying these congruences leads to (a · x1) · (a · x2) · · · (a · xφ(m)) ≡ x1 · x2 · · · xφ(m) (mod m), which can be rearranged to aφ(m) · x1 · x2 · · · xφ(m) ≡ x1 · x2 · · · xφ(m) (mod m). Dividing this by the integer x1 · x2 · · · xφ(m) already gives the wanted statement. □ Remark. As we have already mentioned, Euler’s theorem is a consequence of Lagrange’s theorem (see 12.4.10) applied to the group (Z× m, ·). This proof of Euler’s theorem utilized the fact that multiplying by an integer a which is coprime to m is, in algebraic words, an automorphism of the group (Z× m, ·). 11.3.13. There is an important concept which is tightly connected to Euler’s totient function and Euler’s theorem: the so-called order of an integer modulo m – once again, it is nothing else than the order of the corresponding element in the group of invertible residue classes modulo m: Order of an integer Let a ∈ Z, m ∈ N, where (a, m) = 1. The order of a modulo m is the least natural number n satisfying an ≡ 1 (mod m). It follows from Euler’s theorem that the order of an integer is well-defined – the order of any integer coprime to the modulus is surely not greater than φ(m). As we will see later, the integers whose order is exactly φ(m) are of great interest – they are called primitive roots modulo m and they play an important role in solving binomial congruences, among others. This concept is just another name for a generator of the group (Z× m, ·). Some of the very basic results of the order are demonstrated in 11.C.19, and complete description of the dependency of the order upon the exponent is given by the subsequent two theorems. Theorem. Let m ∈ N, a ∈ Z, (a, m) = 1. Let r denote the order of a modulo m. Then, for any t, s ∈ N0, we have at ≡ as (mod m) ⇐⇒ t ≡ s (mod r). Proof. Without loss of generality, we can assume that t ≥ s. Dividing the integer t − s by r with remainder, we get t − s = q · r + z, where q, z ∈ N0, 0 ≤ z < r. “ ⇐= ” Since t ≡ s (mod r), we have z = 0, hence at−s = aqr = (ar )q ≡ 1q (mod m). Multiplying both sides of the congruence by the integer as leads to the wanted statement. “ =⇒ ” It follows from at ≡ as (mod m) that as · aqr+z ≡ as (mod m). Since ar ≡ 1 (mod m), we also have aqr+z ≡ az (mod m). Altogether, after dividing both sides of the first congruence by the integer as (which is coprime to the modulus), we get az ≡ 1 (mod m). Since 995 We can simplify the computation to come a lot if we realize that 72 ≡ −1 (mod 25), and 25 ≡ 7 (mod 25). Then, 141414 ≡ 216 · 716 ≡ (25 )3 · 2 · 716 ≡ 73 · 2 · 716 ≡ 2 · 719 ≡ 2 · (−1)9 · 7 = 11 (mod 25). We are thus looking for a non-negative integer which is less than 100, is a multiple of 4, and leaves remainder 11 when divided 25 – the only such number is clearly 36. □ 11.C.26. Determine the last three digits of the number 121011 . ⃝ 11.C.27. Find all natural numbers n for which the integer 5n − 4n − 3n is divisible by eleven. Solution. The orders of all of the numbers 3, 4, and 5 are equal to five, so it suffices to examine n ∈ {0, 1, 2, 3, 4}. It can be seen from the following table n 0 1 2 3 4 5n mod 11 1 5 3 4 9 4n mod 11 1 4 5 9 3 3n mod 11 1 3 9 5 4 that only the case n ≡ 2 (mod 5) yields 3 − 5 − 9 ≡ 0 (mod 11). The problem is thus satisfied by exactly those natural numbers n which satisfy n ≡ 2 (mod 5). □ 11.C.28. Primitive roots. Show that there are no primitive roots modulo 8, and find any primitive root modulo 11. Solution. Apparently, even integers cannot be primitive roots modulo 8, so it remains to examine odd ones. We can easily calculate that 32 ≡ 52 ≡ 72 ≡ 1 (mod 8), but φ(8) = 4 > 2. Now, we will verify that 2 is a primitive root modulo 11. The order of 2 divides φ(11) = 10, so it suffices to show that 22 ̸≡ 1 (mod 11) and 25 = 32 ≡ −1 ̸≡ 1 (mod 11). Therefore, the order of 2 modulo 11 is indeed 10. □ 11.C.29. We will now determine (with the help of propositions proving Theorem 11.3.16) primitive roots modulo 41, 412 , and 2 · 412 . Solution. Since φ(41) = 40 = 23 · 5, it holds that an integer g coprime to 41 is a primitive root modulo 41 if and only if g20 ̸≡ 1 (mod 41) ∧ g8 ̸≡ 1 (mod 41). CHAPTER 11. ELEMENTARY NUMBER THEORY z < r, it follows from the definition of the order that z = 0, hence r | t − s. □ The above theorem and Euler’s theorem apparently lead to the following corollary (whose second part is only a reformulation of Lagrange’s theorem 12.4.10 for our situation): Corollary. Let m ∈ N, a ∈ Z, (a, m) = 1, and r be the order of a modulo m. (1) For any n ∈ N ∪ {0}, it holds that an ≡ 1 (mod m) ⇐⇒ r | n. (2) r | φ(m) 11.3.14. The following theorem is a generalization of the result in 11.C.20. Theorem. Let m, n ∈ N, a ∈ Z, (a, m) = 1. If the order of a modulo m is r, then the order of an modulo m is r (n,r) . Proof. Since r·n (r,n) = [r, n], which is clearly a multiple of r, we have (an ) r (r,n) = a[r,n] ≡ 1 (mod m) (the last statement follows from the above corollary, because r | [r, n]). On the other hand, if k ∈ N is such that (an )k = an·k ≡ 1 (mod m), we get r | n · k (since r is the order of a). Further, we know that r (n,r) | n (n,r) · k, whence (thanks to r (n,r) and n (n,r) being coprime) r (n,r) | k. Therefore, r (n,r) is the order of the integer an modulo m. □ The last statement of this series connects the orders of two integers to the order of their product: Lemma. Let m ∈ N, a, b ∈ Z, (a, m) = (b, m) = 1. If a has order r and b has order s modulo m, where (r, s) = 1, then the integer a · b has order r · s modulo m. Proof. Let δ denote the order of a · b. Then, (ab)δ ≡ 1 (mod m). Raising both sides of this congruence to the r-th power leads to arδ brδ ≡ 1 (mod m). Since r is the order of a, we have ar ≡ 1 (mod m), i.e., brδ ≡ 1 (mod m), and so s | rδ. From r being coprime to s, we get s | δ. Analogously, we can get r | δ, so (again utilizing that r, s are coprime) r · s | δ. On the other hand, we clearly have (ab)rs ≡ 1 (mod m), hence δ | rs. Altogether, δ = rs. □ 11.3.15. Primitive roots. Among the integers coprime to a modulus m (i.e., the elements of a reduced residue system modulo m), the most important ones are those whose order is equal to φ(m). Step-by-step exponentiation of such a number yields all possible elements of a reduced residue system (or integers congruent to them). Therefore, in various problems, we can work with powers of a given integer instead of considering random elements of a reduced residue system modulo m, and this is often much simpler (see, for instance, the proof of the theorem 11.4.10 about binomial coefficients). 996 Now, we will go through the potential primitive roots in ascending order: g = 2 : 28 = 25 · 23 ≡ −9 · 8 ≡ 10 (mod 41), 220 = (25 )4 ≡ (−9)4 = 812 ≡ (−1)2 = 1 (mod 41), g = 3 : 38 = (34 )2 ≡ (−1)2 = 1 (mod 41), g = 4 : the order of 4 = 22 always divides the order 2, g = 5 : 58 = (52 )4 ≡ (−24 )4 = 216 = (28 )2 ≡ 102 ≡ 18 (mod 41), 520 = (52 )10 ≡ (−24 )10 = 240 = (220 )2 ≡ 1 (mod 41), g = 6 : 68 = 28 · 38 ≡ 10 · 1 = 10 (mod 41), 620 = 220 · 320 ≡ 220 · (38 )2 · 34 ≡ 1 · 1 · (−1) = −1 (mod 41). We have thus proved that 6 is the least positive primitive root modulo 41 (if we were interested in other primitive roots modulo 41 as well, we would get them as the powers of 6 with exponent taking on values from the range 1 to 40 which are coprime to 40. There are exactly φ(40) = φ(23 · 5) = 16 of them, and the resulting remainders modulo 41 are ±6, ±7, ±11, ±12, ±13, ±15, ±17, ±19). Now, if we prove that 640 ̸≡ 1 (mod 412 ), we will know that 6 is a primitive root modulo any power of 41 (if we had “bad luck” and found out that 640 ≡ 1 (mod 412 ), then a primitive root modulo 412 would be 47 = 6 + 41). To avoid manipulating huge numbers when verifying the condition, we will use several tricks (the so-called residue number system). First of all, we calculate the remainder of 68 upon division by 412 ; this problem can be further reduced to computing the remainders of the integers 28 and 38 : 28 = 256 = 6 · 41 + 10 (mod 412 ), 38 = (34 )2 = (2 · 41 − 1)2 ≡ −4 · 41 + 1 (mod 412 ). Then, 68 = 28 · 38 ≡ (6 · 41 + 10)(−4 · 41 + 1) ≡ −34 · 41 + 10 ≡ 7 · 41 + 10 (mod 412 ) and 640 = (68 )5 ≡ (7 · 41 + 10)5 ≡ (105 + 5 · 7 · 41 · 104 ) = 104 (10 + 35 · 41) ≡ (−2 · 41 − 4)(−6 · 41 + 10) ≡ (4 · 41 − 40) = 124 ̸≡ 1 (mod 412 ). In the calculation, we made use of the fact that 104 = 6 · 412 − 86, i.e., 104 ≡ −2 · 41 − 4 (mod 412 ). Therefore, 6 is a primitive root modulo 412 , and since it is an even integer, we can see that 1687 = 6+412 is a primitive root modulo 2 · 412 (while the least positive primitive root modulo 2 · 412 is the integer 7). □ CHAPTER 11. ELEMENTARY NUMBER THEORY Primitive root Let m ∈ N. An integer g, (g, m) = 1, is said to be a primitive root modulo m if and only if its order modulo m equals φ(m). Lemma. If g is a primitive root modulo m, then for every integer a such that (a, m) = 1, there is a unique xa ∈ Z, 0 ≤ xa < φ(m) with the property that gxa ≡ a (mod m). The mapping a → xa is called the discrete logarithm or index of the integer a (with respect to a given modulus m and a fixed primitive root g), and it is a bijection between the sets {a ∈ Z; (a, m) = 1, 0 < a < m} and {x ∈ Z; 0 ≤ x < φ(m)}. Proof. Suppose that it holds for x, y ∈ Z, 0 ≤ x, y < φ(m) that gx ≡ gy (mod m). From the properties of the order, we get x ≡ y (mod φ(m)), i.e., x = y, so the mapping is injective. Since it is a mapping between two finite sets which have the same number of elements, it must be surjective as well. □ If there are primitive roots at all for a natural number m, then there are exactly φ(φ(m)) of them among the integers 1, 2, . . . , m: If g is a primitive root and a ∈ {1, 2, . . . , φ(m)} arbitrary, then the order of ga is φ(m) (a,φ(m)) (by theorem 11.3.14), which is equal to φ(m) if and only if (a, φ(m)) = 1, and there are exactly φ(φ(m)) such integers in the set {1, 2, . . . , φ(m)}. Now, we are about to show that primitive roots exist for a sufficient amount of moduli m. 11.3.16. Theorem (Existence of primitive roots). Let m ∈ N, m > 1. The modulus m has primitive roots if and only if at least one of the following conditions holds: • m = 2 or m = 4, • m is a power of an odd prime, • m is twice a power of an odd prime. The proof of this theorem will be done in several steps. We can easily see that 1 is a primitive root modulo 2 and 3 is a primitive root modulo 4. Further, we will show that primitive roots exist modulo any odd prime (in algebraic words, this is another proof of the fact that the group (Z× m, ·) of invertible residue classes modulo a prime m is cyclic; see also 12.4.8). Proposition. Let p be an odd prime. Then there are primitive roots modulo p. Proof. Let r1, r2, . . . , rp−1 be the orders of the integers 1, 2, . . . , p − 1 modulo p. Let δ = [r1, r2, . . . , rp−1] be the least common multiple of these orders. We will show that there is an integer of order δ among 1, 2, . . . , p − 1 and that δ = p − 1. Let δ = qα1 1 · · · qαk k be the factorization of δ to primes. For every s ∈ {1, . . . , k}, there is a c ∈ {1, . . . , p − 1} such that qαs s | rc (otherwise, there would be a common multiple 997 D. Solving congruences Linear congruences. The following exercise illustrates that the procedure we mentioned in the proof of theorem 11.4.3 about solvability of linear congruences (which invokes Euler’s theorem) is usually not the most efficient one – we can utilize both Bézout’s theorem and equivalent modifications of the given congruence. 11.D.1. Solve the congruence 39x ≡ 41 (mod 47). Solution. i) First, we use Euler’s theorem. Since (39, 47) = 1, we have 39φ(47) = 3946 ≡ 1 (mod 47), i.e., 3945 · 39 3946≡1 x ≡ 3945 · 41 (mod 47), whence it already follows that x ≡ 3945 · 41 (mod 47). To complete the solution, it remains to calculate the remainder of 3945 · 41 when divided by 47, which is left as an exercise to the kind reader, leading to the result x ≡ 36 (mod 47). ii) Another option is to make use of Bézout’s theorem. The Euclidean algorithm applied to the pair (39, 47) yields 47 = 1 · 39 + 8, 39 = 4 · 8 + 7, 8 = 1 · 7 + 1. In the other direction, this leads to 1 = 8 − 7 = 8 − (39 − 4 · 8) = 5 · 8 − 39 = 5 · (47 − 39) − 39 = 5 · 47 − 6 · 39. Considering this equality modulo 47 and remembering that we are solving the equation 41 ≡ x · 39, we obtain 1 ≡ −6 · 39 (mod 47), / · 41 41 ≡ 41 · (−6) · 39 (mod 47), x ≡ 41 · (−6) (mod 47), x ≡ −246 (mod 47), x ≡ 36 (mod 47). Let us notice that this procedure is usually used in the corresponding software tools – it is efficient and can be easily made into an algorithm. It was also important that 41 (the number we multiplied the congruence with) and the modulus 47 are coprime. CHAPTER 11. ELEMENTARY NUMBER THEORY of the integers r1, r2, . . . , rp−1 less than δ). Therefore, there exists an integer b such that rc = b · qαs s . Since c has order rc, the order of the integer gs := cb is equal to qαs s (by the theorem 11.3.14 on orders of powers). Reasoning analogously for any s ∈ {1, . . . , k}, we get integers g1, . . . , gk, and we can set g := g1 · · · gk. From the properties of the order of a product, we get that the order of g is equal to the product of the orders of the integers g1, . . . , gk, i.e. to qα1 1 · · · qαk k = δ. Now, we prove that δ = p − 1. Since the orders of the integers 1, 2, . . . , p − 1 divide δ, we get the congruence xδ ≡ 1 (mod p) for any x ∈ {1, 2, . . . , p − 1}. By theorem 11.4.8, there are at most δ solutions to a congruence of degree δ modulo a prime p (in algebraic words, we are actually looking for roots of a polynomial over a field, and there cannot be more of them than the degree of the polynomial, as we will see in part 12.3.4). On the other hand, we have already shown that this congruence has p−1 solutions, so necessarily δ ≥ p−1. Still, δ is (being the order of g) a divisor of p − 1, whence we finally get the wanted equality δ = p − 1. □ 11.3.17. Next, we show that there are primitive roots modulo powers of odd primes. First, we prove two helpful lemmas and then comes the proposition. Lemma. Let p be an odd prime, ℓ ≥ 2 arbitrary. Then, it holds for any a ∈ Z that (1 + ap)pℓ−2 ≡ 1 + apℓ−1 (mod pℓ ). Proof. This will follow easily from the binomial theorem using mathematical induction on ℓ. I. The statement is clearly true for ℓ = 2. II. Let the statement be true for ℓ, and let us prove it for ℓ + 1. Invoking exercise 11.C.7 and raising the statement for ℓ to the p-th power, we obtain (1 + ap)pℓ−1 ≡ (1 + apℓ−1 )p (mod pℓ+1 ). It follows from the binomial theorem that (1 + apℓ−1 )p = 1 + p · a · pℓ−1 + p∑ k=2 ( p k ) ak p(ℓ−1)k and since we have p | (p k ) for 1 < k < p (by exercise 11.C.6), it suffices to show that pℓ+1 | p1+(ℓ−1)k , which is equivalent to 1 ≤ (k − 1)(ℓ − 1). Thanks to the assumption ℓ ≥ 2, we get that pℓ+1 | p(ℓ−1)p for k = p as well. □ Lemma. Let p be an odd prime, ℓ ≥ 2 arbitrary. Then, it holds for any integer a satisfying p ∤ a that the order of 1+ap modulo pℓ equals pℓ−1 . Proof. By the previous lemma, we have (1 + ap)pℓ−1 ≡ 1 + apℓ (mod pℓ+1 ), and considering this congruence modulo pℓ , we get (1 + ap)pℓ−1 ≡ 1 (mod pℓ ). At the same time, it follows directly from the previous lemma and p not being a divisor of a that 998 iii) Concerning paper-and-pencil calculations, the most efficient procedure (yet one not easily generalizable into an algorithm) is to gradually modify the congruence so that the set of solutions remains unchanged: 39x ≡ 41 (mod 47), −8x ≡ −6 (mod 47), / : −2 4x ≡ 3 (mod 47), 4x ≡ −44 (mod 47), / : 4 x ≡ −11 (mod 47), x ≡ 36 (mod 47). □ Systems of congruences. In order to solve system of (not only linear) congruences, we will often utilize the Chinese remainder theorem, which guarantees uniqueness of the solution provided the moduli of the particular congruences are pairwise coprime. 11.D.2. Solve the system x ≡ 7 (mod 27), x ≡ −3 (mod 11). Solution. As (27, 11) = 1, we are guaranteed by the Chinese remainder theorem that the solution is unique modulo 27 · 11 = 297. There are two major possible approaches to finding the solution. (a) Using the Euclidean algorithm, we can find the coefficients in Bézout’s identity: 1 = 5 · 11 − 2 · 27. Hence, [11]−1 27 = [5]27 and [27]−1 11 = [−2]11. Therefore, the solution is x ≡ 7 · 11 · 5 − 3 · 27 · (−2) = 547 ≡ 250 (mod 297). (b) Using step-by-step substitution, we get x = 11t − 3 from the second congruence. Substituting this into the first one leads to 11t ≡ 10 (mod 27). Multiplying this by 5 yields 55t ≡ 50, i.e., t ≡ −4 (mod 27). Altogether, x = 11 · 27 · s − 4 · 11 − 3 = 297s − 47 for s ∈ Z, i.e., x ≡ −47 (mod 297). □ 11.D.3. Solve the following system of congruences: x ≡ 1 (mod 10), x ≡ 5 (mod 18), x ≡ −4 (mod 25). Solution. The integers x which satisfy the first congruence are those of the form x = 1 + 10t, where t ∈ Z may be arbitrary. We will substitute this expression into the second congruence and then solve it (as a congruence in variable t): 1 + 10t ≡ 5 (mod 18), 10t ≡ 4 (mod 18), 5t ≡ 2 (mod 9), 5t ≡ 20 (mod 9), t ≡ 4 (mod 9), CHAPTER 11. ELEMENTARY NUMBER THEORY (1 + ap)pℓ−2 ̸≡ 1 (mod pℓ ), which gives the wanted proposition. □ Proposition. Let p be an odd prime. Then, for every ℓ ∈ N, there is a primitive root modulo pℓ . Proof. Let g be a primitive root modulo p. We will show that if gp−1 ̸≡ 1 (mod p2 ), then g is a primitive root even modulo pℓ for any ℓ ∈ N. (If we had gp−1 ≡ 1 (mod p2 ), then (g + p)p−1 ≡ 1 + (p − 1)gp−2 p ̸≡ 1 (mod p2 ), so we could choose g + p for the original primitive root instead of the congruent integer g.) Let g satisfy gp−1 ̸≡ 1 (mod p2 ). Then, there is an a ∈ Z, p ∤ a such that gp−1 = 1 + p · a. We will show that the order of g modulo pℓ is φ(pℓ ) = (p−1)pℓ−1 . Let n be the least natural number which satisfies gn ≡ 1 (mod pℓ ). By the previous lemma, the order of gp−1 = 1+p·a modulo pℓ is pℓ−1 . However, then it follows from the corollary of 11.3.13 that (gp−1 )n = (gn )p−1 ≡ 1 (mod pℓ ) =⇒ pℓ−1 | n. At the same time, the congruence gn ≡ 1 (mod p) implies that p − 1 | n. From p − 1 and pℓ−1 being coprime, we get that (p − 1)pℓ−1 | n. Therefore, n = φ(pℓ ), and g is thus a primitive root modulo pℓ . □ 11.3.18. Our next task is to deal with the existing primitive roots in form of multiples of two. Proposition. Let p be an odd prime and g a primitive root modulo pℓ for ℓ ∈ N. Then the odd one of the integers g, g+pℓ is a primitive root modulo 2pℓ . Proof. Let c be an odd natural number. Then, for every n ∈ N, we have cn ≡ 1 (mod pℓ ) if and only if cn ≡ 1 (mod 2pℓ ). Since φ(2pℓ ) = φ(pℓ ), every odd primitive root modulo pℓ is also a primitive root modulo 2pℓ . □ The subsequent proposition describes the case of powers of two. We will use similar helping lemmas as in the case of odd primes. Lemma. Let ℓ ∈ N, ℓ ≥ 3. Then 52ℓ−3 ≡ 1 + 2ℓ−1 (mod 2ℓ ). Proof. Similarly as above for 2 ∤ p. □ Lemma. Let ℓ ∈ N, ℓ ≥ 3. Then the order of the integer 5 modulo 2ℓ is 2ℓ−2 . Proof. Easily from the above lemma. □ Proposition. Let ℓ ∈ N. There are primitive roots modulo 2ℓ if and only if ℓ ≤ 2. Proof. Let ℓ ≥ 3. Then the set S = {(−1)a · 5b ; a ∈ {0, 1}, 0 ≤ b < 2ℓ−2 ; b ∈ Z} forms a reduce residue system modulo 2ℓ : it has φ ( 2ℓ ) elements, and it can be easily verified that they are pairwise incongruent modulo 2ℓ . 999 or t = 4 + 9s, where s ∈ Z is arbitrary. The first two congruences are thus satisfied by exactly those integers x which are of the form x = 1 + 10t = 1 + 10(4 + 9s) = 41 + 90s. Once again, this can be substituted into the third congruence and then solved: 41 + 90s ≡ −4 (mod 25), 90s ≡ 5 (mod 25), 18s ≡ 1 (mod 5), 3s ≡ 6 (mod 5), s ≡ 2 (mod 5), or s = 2 + 5r, where r ∈ Z. Altogether, x = 41 + 90s = 41 + 90(2 + 5r) = 221 + 450r. Therefore, the system is satisfied by those integers x with x ≡ 221 (mod 450). □ 11.D.4. A group of thirteen pirates managed to steal a chest full of gold coins (there were around two thousand of them). The pirates tried to divide them evenly among themselves, but ten coins were left over. They started to fight for the remaining coins, and one of the pirates was deadly stabbed during the combat. So, they tried to divide the coins evenly once again, and now three coins were left. Another pirate died in a subsequent battle for the three coins. The remaining pirates tried to divide the coins evenly for the third time, now successfully. How many coins were there in the chest? Solution. The problem leads to the following system of con- gruences: x ≡ 10 (mod 13), x ≡ 3 (mod 12), x ≡ 0 (mod 11). Its solution is x ≡ 231 (mod 11·12·13). Since the number x of coins is to be around 2000 and x ≡ 231 (mod 1716), we can easily settle that there were exactly 231 + 1716 = 1947 coins. □ 11.D.5. When gymnasts made groups of eight people, three were left over. When they formed circles, each consisting of seventeen people, seven remained; and when they grouped into pyramids (each of them contains 21 = 42 + 22 + 1 gymnasts), two of them were incomplete (missing a person “on the top”). How many gymnasts were there, provided there were at least 2000 and at most 4000? Solution. We solve the following system of linear congruences in the standard way: c ≡ 3 (mod 8), c ≡ 7 (mod 17), c ≡ −2 (mod 21), CHAPTER 11. ELEMENTARY NUMBER THEORY At the same time (utilizing the previous lemma), the order of every element of S apparently divides 2ℓ−2 . Therefore, this reduced system cannot (and nor can any other) contain an element of order φ(2ℓ ) = 2ℓ−1 . □ 11.3.19. The last piece to the jigsaw puzzle of propositions which collectively prove theorem 11.3.16 is the statement about non-existence of primitive roots for composite numbers which are neither a power of prime nor twice such. Proposition. Let m ∈ N be divisible by at least two primes, and let it not be twice a power of an odd prime. Then, there are no primitive roots modulo m. Proof. Let m factor to primes as 2α pα1 1 · · · pαk k , where α ∈ N0, αi ∈ N, 2 ∤ pi, and k ≥ 2 or both k ≥ 1 and α ≥ 2. Denoting δ = [φ(2α ), φ(pα1 1 ), . . . , φ(pα1 1 )], we can easily see that δ < φ(2α ) · φ(pα1 1 ) · · · φ(pα1 1 ) = φ(m) and that for any a ∈ Z, (a, m) = 1, we have aδ ≡ 1 (mod m). Therefore, there are no primitive roots modulo m. □ 11.3.20. In general, it is computationally very hard to find a primitive root for a given modulus. The following theorem describes a necessary and sufficient condition for the examined integer to be a primitive root. Theorem. Let m be such an integer that there are primitive roots modulo m. Let us write φ(m) = qα1 1 · · · qαk k , where q1, . . . , qk are primes and α1, . . . , αk ∈ N. Then, for every g ∈ Z, (g, m) = 1, it holds that g is a primitive root modulo m, if and only if neither of the following congruences holds: g φ(m) q1 ≡ 1 (mod m), . . . , g φ(m) qk ≡ 1 (mod m). Proof. If either of the congruences were true, it would mean that the order of g is less than φ(m). On the other hand, if g fails to be a primitive root, then there is a d ∈ N, d | φ(m), where d < φ(m) and gd ≡ 1 (mod m). If u = φ(m) d > 1, then there must be an i ∈ {1, . . . , k} such that qi | u. However, then we get g φ(m) qi = g d· u qi ≡ 1 (mod m). □ 4. Solving congruences and systems of them This part will be devoted to the analog to solving equations in a numerical domain. We will actually be solving equations (and systems of equations) in the ring of residue classes (Zm, +, ·); we will, however, talk about solving congruences modulo m and write it in the more transparent way as usual. 1000 leading to the solution c ≡ 1027 (mod 2856), which, together with the additional information, implies that there were exactly 3883 gymnasts. □ 11.D.6. Find which of the following (systems of) linear congruences has a solution. i) x ≡ 1 (mod 3), x ≡ −1 (mod 9); ii) 8x ≡ 1 (mod 12345678910111213); iii) x ≡ 3 (mod 29), x ≡ 5 (mod 47). ⃝ The Chinese remainder theorem can also be used “in the opposite direction”, i.e. to simplify a linear congruence provided we are able to express the modulus as a product of pairwise coprime factors. 11.D.7. Solve the congruence 23 941x ≡ 915 (mod 3564). Solution. Let us factor 3564 = 22 ·34 ·11. Since none of the integers 2, 3, 11 divides 23 941, we have (23 941, 3564) = 1, so the congruence has a solution. Since φ(3564) = 2·(33 ·2)· 10 = 1080, the solution is of the form x ≡ 915 · 23 9411079 (mod 3564). However, it would take much effort to simplify the right-hand integer to a more explicit form. Therefore, we will try to solve the congruence in a different way – we will build an equivalent system of congruences which are easier to solve than the original one. We know that an integer x is a solution of the given congruence if and only if it is a solution of the system 23941x ≡ 915 (mod 22 ), 23941x ≡ 915 (mod 34 ), 23941x ≡ 915 (mod 11). Solving these congruences separately, we get the following, equivalent system: x ≡ 3 (mod 4), x ≡ −3 (mod 81), x ≡ −4 (mod 11). Now, the procedure for finding a solution of a system of congruences yields x ≡ −1137 (mod 3564), which is the solution of the original congruence as well. □ 11.D.8. Solve the congruence 3446x ≡ 8642 (mod 208). ⃝ 11.D.9. Prove that the sequence (2n − 3)∞ n=1 contains infinitely many multiples of 5 as well as infinitely many multiples of 13, yet there is no multiple of 65 in it. ⃝ Residue number system. When calculating with large integers, it is often more advantageous to work not with their decimal or binary expansions, but rather with their representation in a so-called residue number system, which allows for easy parallelization of computations with large integers. Such a system is given by a k-tuple of (usually pairwise coprime) moduli, and each integer which CHAPTER 11. ELEMENTARY NUMBER THEORY Congruence in one variable Let m ∈ N, f(x), g(x) ∈ Z[x]. The notation f(x) ≡ g(x) (mod m) is called a congruence in variable x, and it is understood to be the problem of finding the set of solutions, i.e., the set of all such integers c for which f(c) ≡ g(c) (mod m). Two congruences (in one variable) are called equivalent if and only if they have the same set of solutions. The mentioned congruence is equivalent to the congru- ence f(x) − g(x) ≡ 0 (mod m). The only method which always leads to a solution is trying out all possible values (however, this would, of course, often take too much time). This procedure is formalized by the following proposition. 11.4.1. Proposition. Let m ∈ N, f(x) ∈ Z[x]. Then, it holds for every a, b ∈ Z that a ≡ b (mod m) =⇒ f(a) ≡ f(b) (mod m). Proof. Let f(x) = cnxn +cn−1xn−1 +· · ·+c1x + + c0, where c0, c1, . . . , cn ∈ Z. Since a ≡ b (mod m), ciai ≡ cibi (mod m) holds for every i = 0, 1, . . . , n. Adding up these congruences for i = 0, 1, 2, . . . , n leads to cnan + · · · + c1a + c0 ≡ cnbn + · · · + c1b + c0 (mod m), i.e., f(a) ≡ f(b) (mod m). □ Corollary. The set of solutions of an arbitrary congruence modulo m is a union of residue classes modulo m. Definition. The number of solutions of a congruence in one variable modulo m is the number of residue classes modulo m containing the solutions of the congruence. Example. The concept number of solutions of a congruence, which we have just defined, is a bit counterintuitive in that it depends on the modulus of the congruence. Therefore, equivalent congruences (sharing the same integers as solutions) can have different numbers of solutions. (1) The congruence 2x ≡ 3 (mod 3) has exactly one solution (modulo 3). (2) The congruence 10x ≡ 15 (mod 15) has five solutions (modulo 15). (3) The congruences from (1) and (2) are equivalent. 11.4.2. Linear congruence in one variable. Just like in the case of ordinary equations, the easiest congruences are the linear ones, for which we are able not only to decide whether they have a solution, but to efficiently find it (provided they have some). The procedure is described by the following theorem and its proof. 11.4.3. Theorem. Let m ∈ N, a, b ∈ Z, and d = (a, m). Then the congruence (in variable x) ax ≡ b (mod m) 1001 is less than their product is then uniquely representable as a k-tuple of remainders (whose values do not exceed the mod- uli). 11.D.10. The quintuple of moduli 3, 5, 7, 11, 13 can serve to uniquely represent integers which are less than their product (i.e. less than 15015) and to perform standard arithmetic operations efficiently (and in a distributed manner if desired). Now, we will determine the representation of the integers 1234 and 5678 in this residue number system and we will determine their sum and product. Solution. Calculating the remainders of the given integers upon division by the particular moduli, we get their RNS representations, which can be written as the tuples (1, 4, 2, 2, 12) and (2, 3, 1, 2, 10). The sum is computed componentwise (reducing the results modulo the appropriate number), leading to the tuple (0, 2, 3, 4, 9). Using the Chinese remainder theorem, this tuple can then be transformed back to the integer 6912. The product is computed analogously, yielding the corresponding tuple (2, 2, 2, 4, 3), which can be transformed back to 9662 (by the Chinese remainder theorem again). This is indeed congruent to 1234 · 5678 modulo 15015. □ 11.D.11. In practice, the residue number system is often a triple 2n − 1, 2n , 2n + 1 (why are these integers always coprime?), which can uniquely cover integers of 3n bits at the utmost. Consider the case n = 3 and determine the representation of the integer 118 in this residue number system. Solution. We can directly calculate that 118 ≡ 6 (mod 7), 118 ≡ 6 (mod 8), and 118 ≡ 1 (mod 9). The wanted representation is thus given by the triple (6, 6, 1). In practice, however, it is very important that the RNS representation can be efficiently transformed to binary and vice versa. In our concrete case, the remainder of 118 = (1110110)2 when divided by 23 can be found easily – it is the last three bits (110)2 = 6. Computing the remainder upon division by 23 + 1 = 9 or 23 − 1 = 7 is not any more complicated. We can see (splitting the examined integer into three groups of n bits each) that (1110110)2 ≡ (001)2 + (110)2 + (110)2 ≡ 6 (mod 23 − 1), (1110110)2 ≡ (001)2 − (110)2 + (110)2 ≡ 1 (mod 23 + 1). A thoughtful reader has surely noticed the similarity with the criteria for divisibility by 9 and 11, which were discussed in paragraph 11.C.9. □ 11.D.12. Higher-order congruences. Using the procedure of theorem 11.4.6, solve the congruence x4 + 7x + 4 ≡ 0 (mod 27). Solution. First, we will solve this congruence modulo 3 (by substitution, for instance) – we can easily find that the solution CHAPTER 11. ELEMENTARY NUMBER THEORY has a solution if and only if d | b. If d | b, then this congruence has exactly d solutions (modulo m). Proof. First, we prove that the mentioned condition is necessary. If an integer c is a solution of this congruence, then we must have m | a · c − b. Since d = (a, m), we get d | m and d | a · c − b, so d | a · c − (a · c − b) = b. Now, we will prove that if d | b, then the given congruence has exactly d solutions modulo m. Let a1, b1 ∈ Z and m1 ∈ N so that a = d · a1, b = d · b1, and m = d · m1. The congruence we are trying to solve is thus equivalent to the congruence a1 · x ≡ b1 (mod m1), where (a1, m1) = 1. This congruence can be multiplied by the integer a φ(m1)−1 1 , which, by Euler’s theorem, leads to x ≡ b1 · a φ(m1)−1 1 (mod m1). This congruence has a unique solution modulo m1, thus it has d = m/m1 solutions modulo m. □ Using the theorem about solutions of linear congruences, we can, among others, prove Wilson’s theorem – an important theorem which gives a necessary (and sufficient) condition for an integer to be a prime. Such conditions are extremely useful in computational number theory, where one needs to efficiently determine whether a given large integer is a prime. Unfortunately, it is not known now how fast modular factorial of a large integer can be computed. That is why Wilson’s theorem is not used for this purpose in practice. Theorem (Wilson). A natural number n > 1 is a prime if and only if (n − 1)! ≡ −1 (mod n). Proof. First, we prove that every composite number n > 4 satisfies n | (n − 1)!, i.e., (n − 1)! ≡ 0 (mod n). Let 1 < d < n be a non-trivial divisor of n. If d ̸= n/d, then the inequality 1 < d, n/d ≤ n − 1 implies what we need: n = d · n/d | (n − 1)!. If d = n/d, i.e., n = d2 , then we have d > 2 (since n > 4) and n | (d · 2d) | (n − 1)!. For n = 4, we easily get (4 − 1)! ≡ 2 ̸≡ −1 (mod 4). Now, let p be a prime. The integers in the set {2, 3, . . . , p − 2} can be grouped by pairs of those mutually inverse modulo p, i.e., pairs of integers whose product is congruent to 1. By the previous theorem, for every integer a of this set, there is a unique solution of the congruence a · x ≡ 1 (mod p). Since a ̸= 0, 1, p − 1, it is apparent that the solution c of the congruence also satisfies c ̸≡ 0, 1, −1 (mod p). The integer a cannot be paired with itself, either: If so, i.e., a · a ≡ 1 (mod p), we would (thanks to p | a2 − 1 = (a + 1)(a − 1)) get the congruence a ≡ ±1 (mod p). The product of the integers of the mentioned set thus consists of products of (p − 3)/2 pairs (whose product is always congruent to 1 modulo p). Therefore, we have (p − 1)! ≡ 1(p−3)/2 · (p − 1) ≡ −1 (mod p). □ 1002 is x ≡ 1 (mod 3). Now, writing the solution in the form x = 1 + 3t, where t ∈ Z, we will solve the congruence modulo 9: x4 + 7x + 4 ≡ 0 (mod 9), (1 + 3t)4 + 7(1 + 3t) + 4 ≡ 0 (mod 9), 1 + 4 · 3t + 7 + 7 · 3t + 4 ≡ 0 (mod 9), 33t ≡ −12 (mod 9), 11t ≡ − 4 (mod 3), t ≡ 1 (mod 3). Writing t = 1 + 3s, where s ∈ Z, we get x = 4 + 9s, and substituting this leads to (4 + 9s)4 + 7(4 + 9s) + 4 ≡ 0 (mod 27), 44 + 4 · 43 · 9s + 28 + 63s + 4 ≡ 0 (mod 27), 256 · 9s + 63s ≡ −288 (mod 27), 256s + 7s ≡ − 32 (mod 3), 2s ≡ 1 (mod 3), s ≡ 2 (mod 3). Altogether, we get the solution in the form x = 4 + 9s = 4 + 9(2 + 3r) = 22 + 27r, where r ∈ Z, i.e., x ≡ 22 (mod 27). □ 11.D.13. Knowing a primitive root modulo 41 from exercise 11.C.29 , solve the congruence 7x17 ≡ 11 (mod 41). Solution. Multiplying the congruence by 6, we get an equivalent congruence 42x17 ≡ 66, i.e., x17 ≡ 25 (mod 41). Since 6 is a primitive root modulo 41, the substitution x = 6t leads to the congruence 617t ≡ 25 ≡ 64 (mod 41), which is equivalent to 17t ≡ 4 (mod 40), and this holds if and only if t ≡ 12 (mod 40). Therefore, the congruence is satisfied by exactly those integers x with x ≡ 612 ≡ 4 (mod 41). □ 11.D.14. Solve the congruence x5 + 1 ≡ 0 (mod 11). Solution. Since (5, φ(11)) = 5 and (−1) φ(11) 5 ≡ 1 (mod 11), the congruence x5 ≡ −1 (mod 11) has five solutions. There are several possibilities how to find them. We can either try all (ten) candidates or transform the problem to a linear congruence using the primitive-root trick. Since 210/2 ≡ −1 ̸≡ 1 (mod 11) and 210/5 ≡ 4 ̸≡ 1 (mod 11), 2 is a primitive root modulo 11 (see also exercise 11.C.28), and the substitution x ≡ 2y then transforms the congruence to 25y ≡ 25 (mod 11), CHAPTER 11. ELEMENTARY NUMBER THEORY 11.4.4. Systems of linear congruences. Having a system of linear congruences in the same variable, we can decide whether each of them is solvable by the previous theorem. If at least one of the congruences does not have a solution, nor does the whole system. On the other hand, if each of the congruences is solvable, we can rearrange it into the form x ≡ ci (mod mi). We thus get a system of congruences x ≡ c1 (mod m1), ... x ≡ ck (mod mk). Apparently, it suffices to solve the case k = 2 since the solutions of a system of more congruences can be obtained by repeatedly applying the procedure for a system of two con- gruences. Proposition. Let c1, c2 be integers and m1, m2 be natural numbers. Let us denote d = (m1, m2). The system of two congruences x ≡ c1 (mod m1), x ≡ c2 (mod m2) has no solution if c1 ̸≡ c2 (mod d). On the other hand, if c1 ≡ c2 (mod d), then there is an integer c such that x ∈ Z satisfies the system if and only if it satisfies the congruence x ≡ c (mod [m1, m2]). Proof. If the given system is to have a solution x ∈ Z, we must have x ≡ c1 (mod d), x ≡ c2 (mod d), and thus c1 ≡ c2 (mod d) as well. Hence it follows that the system cannot have a solution when c1 ̸≡ c2 (mod d). From now on, suppose that c1 ≡ c2 (mod d). The first congruence of the system is satisfied by those integers x which are of the form x = c1 + tm1, where t ∈ Z is arbitrary. Such an integer x satisfies the second congruence of the system if and only if c1 + tm1 ≡ c2 (mod m2), i.e., tm1 ≡ c2 − c1 (mod m2). By the theorem about solutions of linear congruences, this congruence (in variable t) is solvable since d = (m1, m2) divides c2 − c1, and t satisfies this congruence if and only if t ≡ c2 − c1 d · (m1 d )φ ( m2 d ) −1 ( mod m2 d ) , i.e., if and only if x = c1 + tm1 = c1 + (c2 − c1) · (m1 d )φ ( m2 d ) + r m1m2 d = c + r · [m1, m2], where r ∈ Z is arbitrary and c = c1 + (c2 − c1) · (m1/d)φ(m2/d) , as m1m2 equals d · [m1, m2]. We have thus found such an integer c that every x ∈ Z satisfies the system if and only if x ≡ c (mod [m1, m2]), as wanted. □ 1003 which is equivalent to the linear congruence 5y ≡ 5 (mod 10), y ≡ 1 (mod 2). This congruence is satisfied by y ∈ {−3, −1, 1, 3, 5}; the original congruence is thus (substituting x ≡ 2y (mod 11)) satisfied by x ∈ {−1, 2, −3, −4, −5}. □ 11.D.15. Solve the congruence x3 − 3x + 5 ≡ 0 (mod 105). ⃝ 11.D.16. Determine the number of solutions of the congru- ence x5 ≡ 534 (mod 232 ). Solution. The given congruence is equivalent to x5 ≡ 5 (mod 232 ), and since we have (5, φ(23)) = 1, it follows from the theorem on solvability of binomial congruences that the congruence has a unique solution if considered modulo 23. Furthermore, this solution is surely not a multiple of 23. Therefore, considering the polynomial whose roots we are looking for, its derivative (x5 − 5)′ = 5x4 does not evaluate to a multiple of 23 at the wanted solution, either. Invoking Hensel’s lemma, we can summarize that the original congruence has a unique solution (without having to describe it explicitly). □ 11.D.17. Give an example of a polynomial congruence whose degree is less than the number of its solutions. Solution. Taking into account theorem 11.4.8, we must use either a modulus which is composite or a polynomial all of whose coefficients will be multiples of the modulus. As an example of a congruence of the first kind, we can put x2 ≡ 1 (mod 8), which is a quadratic congruence with four solutions 1, 3, 5, 7. The case if a prime modulus can be exemplified by the quadratic congruence 10x2 − 15 ≡ 0 (mod 5), which has five solutions. □ 11.D.18. Other types of congruences. Prove that for any natural number n, the integer 111 + 2222n−1 is divisible by 127. Solution. We are to prove that the congruence 2222n−1 ≡ −111 (mod 127) is satisfied for every n ∈ N. This congruence is equivalent to 2222n−1 ≡ 222 (mod 127). Since 27 = 128 ≡ 1 (mod 127), the order of 2 modulo 127 equals 7, so the congruence to be proved is (by 11.3.13) equivalent to 222n−1 ≡ 22 (mod 7). CHAPTER 11. ELEMENTARY NUMBER THEORY We can notice that the proof of this theorem is constructive, i.e., it yields a formula for finding the integer c. This theorem thus gives us a procedure how to catch the condition that an integer x satisfies a given system by a single congruence. This new congruence is then of the same form as the original one. Therefore, we can apply this procedure to a system of more congruences – first, we create a single congruence from the first and second congruences of the system (satisfied by exactly those integers x which satisfy the original two); then, we create another congruence from the new one and the third one of the original system, and so on. Each step reduces the number of congruences by one; after a finite number of steps, we thus arrive at a single congruence which describes all solutions of the given system. It follows from the procedure we have just mentioned (supposing the condition from below holds) that a system of congruences always has a solution, and this is unique. Theorem (Chinese remainder theorem). Let m1, , . . . , mk ∈ N be pairwise coprime, a1, . . . , ak ∈ Z. Then, the system x ≡ a1 (mod m1), ... x ≡ ak (mod mk) has a unique solution modulo m1 · m2 · · · mk. Remark. The unusual name of this theorem comes from Chinese mathematician Sun Tzu of the 4th century. In his text, he asked for an integer which leaves remainder 2 when divided by 3, leaves remainder 3 when divided by 5, and again remainder 2 when divided by 7. The answer is rumored to be hidden in the following song: Proof. It is a simple consequence of the previous proposition about the form of the solution of a system of two congruences. However, as we show here, this result can also be proved directly. Let us denote M := m1m2 · · · mr and ni = M/mi for every i, 1 ≤ i ≤ r. Then, for any i, mi is coprime to ni, so there is an integer bi ∈ {1, . . . , mi − 1} such that bini ≡ 1 (mod mi). Note that bini is divisible by all the numbers mj, 1 ≤ j ≤ r, i ̸= j. Therefore, the wanted solution of the system is the integer x = a1b1n1 + a2b2n2 + · · · + arbrnr. □ 1004 Similarly, the order of 2 modulo 7 is 3, which leads to the (again equivalent) congruence 22n−1 ≡ 2 (mod 3), (−1)2n−1 ≡ −1 (mod 3), and this is apparently true (we could also have proceed likewise – the order of 2 modulo 3 is 2, and so on). This proves the statement. □ 11.D.19. Determine which natural numbers n satisfy that the integer n · 2n + 1 is divisible by seven. Solution. We are looking for the solution of the con- gruence n · 2n ≡ −1 (mod 7). We should be aware of the fact that we cannot use the theorem 11.4.1 since n · 2n is not a polynomial in variable n, so it is not guaranteed (and it is even not true) that the expression will yield the same remainder modulo 7 when evaluated at integers which are congruent modulo 7. On the other hand, we can notice that the order of 2 modulo 7 is equal to 3, so we can split the problem into three cases according to the remainder of n when divided by 3. For n ≡ 0 (mod 3), we have 2n ≡ 1 (mod 7), so the congruence in question is equivalent to n ≡ −1 (mod 7). Combining the conditions n ≡ 0 (mod 3) and n ≡ −1 (mod 7) in the Chinese remainder theorem leads to the solution n ≡ 6 (mod 21). Now, for n ≡ 1 (mod 3), we have 2n ≡ 2 (mod 7), so the examined congruence is of the form 2n ≡ −1 (mod 7), which is equivalent to n ≡ 3 (mod 7). The conditions n ≡ 1 (mod 3) and n ≡ 3 (mod 7) are satisfied iff n ≡ 10 (mod 21). Finally, for n ≡ 2 (mod 3), we have 2n ≡ 4 (mod 7), and the solution of the congruence 4n ≡ −1 (mod 7) is n ≡ 5 (mod 7). Altogether, n ≡ 5 (mod 21). The problem is satisfied by exactly those natural numbers n with n ≡ 5, 6, 10 (mod 21). □ 11.D.20. Prove that for any natural number n, the integer 2n4 + n3 + 50 is divisible by 6 if and only if the integer 2 · 4n + 3n + 50 is divisible by 13. Solution. The expression f(n) = 2n4 +n3 +50 is a polynomial in variable n, so in this case, we can make use of theorem 11.4.1, i.e., it suffices to go through all possible remainders modulo 6. Since the order of 4 modulo 13 is equal to 6 and the order of 3 modulo 13 equals 3, it is enough (by 11.3.13) to examine the remainder of n upon division by 6 in the latter case as well. In the former case, we calculate n 0 1 2 3 4 5 f(n) mod 6 2 5 0 5 2 3 CHAPTER 11. ELEMENTARY NUMBER THEORY Let us emphasize that this is quite a strong theorem (which is actually valid in much more general algebraic structures), which allows us to guarantee that for any remainders with respect to given (pairwise coprime) moduli, there exists an integer with the given remainders. 11.4.5. Higher-order congruences. Now, let us get back to the more general case of congruences. f(x) ≡ 0 (mod m), where f(x) is a polynomial with integer coefficients and m ∈ N. So far, we have only one method at our disposal, which is tedious, yet universal – to try all possible remainders modulo m. When solving such a congruence, it is sufficient to find out for which integers a, 0 ≤ a < m, it holds that f(a) ≡ 0 (mod m). The disadvantage of this method is its complexity, which increases as m does. If m is composite, i.e., m = pα1 1 . . . pαk k , where p1, . . . , pk are distinct primes, and k > 1, we can replace the original congruence by the system of congruences f(x) ≡ 0 (mod pα1 1 ), ... f(x) ≡ 0 (mod pαk k ), which has the same set of solutions. However, we can solve the congruences separately. The advantage of this method is in that the moduli of the congruences of the system are less than the modulus of the original congruence. Example. Consider the congruence x3 − 2x + 11 ≡ 0 (mod 105). If we were to try out all possibilities, we would have to compute the value of f(x) = x3 −2x+11 for the 105 values f(0), f(1), . . . , f(104). Therefore, we better factor 105 = 3 · 5 · 7 and solve the congruences f(x) ≡ 0 for moduli 3, 5, and 7. We evaluate the polynomial f(x) in convenient integers: x −3 −2 −1 0 1 2 3 f(x) −10 7 12 11 10 15 32 . The congruence f(x) ≡ 0 (mod 3) thus has solution x ≡ −1 (mod 3) (only the first one of the integers 12, 11, 10 is a multiple of 3); the congruence f(x) ≡ 0 (mod 5) has solutions x ≡ 1 and x ≡ 2 (mod 5); finally, the solution of the congruence f(x) ≡ 0 (mod 7) is x ≡ −2 (mod 7). It remains to solve two systems of congruences: x ≡ −1 (mod 3), x ≡ 1 (mod 5), x ≡ −2 (mod 7) and x ≡ −1 (mod 3), x ≡ 2 (mod 5), x ≡ −2 (mod 7). Solving these systems, we can find out that the solutions of the given congruence f(x) ≡ 0 (mod 105) are exactly those integers x which satisfy x ≡ 26 (mod 105) or x ≡ 47 (mod 105). 1005 Therefore, the congruence f(n) ≡ 0 (mod 6) is satisfied by exactly those natural numbers n which satisfy n ≡ 2 (mod 6). In the latter case, we gradually compute that n 0 1 2 3 4 5 4n mod 13 1 4 3 −1 −4 −3 3n mod 13 1 3 9 1 3 9 2 · 4n + 3n − 2 mod 13 1 9 0 −3 −7 1 Just like in the former case, the congruence 2·4n +3n +50 ≡ 0 (mod 13) is satisfied if and only if n ≡ 2 (mod 6). □ 11.D.21. Solve the congruence x2 ≡ 18 (mod 63). Solution. Since (18, 63) = 9, it must be that 9 | x2 , i.e., 3 | x. Setting x = 3x1, x1 ∈ Z, we get an equivalent congruence x2 1 ≡ 2 (mod 7), which already satisfies that the modulus is coprime to the integer on the right-hand side. It follows from theorem 11.4.8 that this congruence has at most 2 solutions, and those are clearly x1 ≡ ±3 (mod 7), i.e., x1 ≡ ±3, ±10, ±17, ±24, ±31, ±38, ±45, ±52, ±59 (mod 63). The solution of the original congruence is thus x ≡ 3x1 (mod 63), i.e., x ≡ ±9, ±12, ±30 (mod 63). □ 11.D.22. Solve the congruence x3 ≡ 3 (mod 18). Solution. Since (3, 18) = 3, we must have 3 | x. Making the substitution x = 3·x1, similarly to the above exercise, we get the congruence 27x3 1 ≡ 3 (mod 18), which has no solution since (27, 18) ∤ 3. □ Quadratic congruences. In the theoretical column we state that any quadratic congruence can be transformed to the (possibly system of congruences of) binomial form x2 ≡ a (mod p), and then we can decide about the solvability using the Legendre symbol. Let us illustrate it on several examples. 11.D.23. Determine the number of solutions of the congruence 13x2 + 7x + 1 ≡ 0 (mod 37). Solution. First, we need to normalize the polynomial on the left-hand side, i.e. we have to find the inverse of 13 modulo 37. Using the Eucliean algorithm we find that the inverse is 20, and after multiplication of both sides of the congruence by it and reducing modulo 37 we obtain the congruence x2 +29x+ 20 ≡ 0 (mod 37). Now we complete the square (an odd coefficient 29 does not cause any trouble as it can be replaced by −8) and we obtain (x − 4)2 + 4 ≡ 0 (mod 37). After substitution of y for x − 4 we finally obtain the congruence in a binomial form y2 ≡ −4 (mod 37). CHAPTER 11. ELEMENTARY NUMBER THEORY It is not always possible to replace the congruence with a system of congruences modulo primes, as in the above example: if the original modulus is a multiple of a higher power of a prime, then we cannot “get rid” of this power. However, even such a congruence modulo a power of prime need not be solved by examining all possibilities. There is a more efficient tool, which is described by the following theorem. 11.4.6. Theorem (Hensel’s lemma). Let p be a prime, f(x) ∈ Z[x], a ∈ Z such that p | f(a), p ∤ f′ (a). Then, for every n ∈ N, the system x ≡ a (mod p), f(x) ≡ 0 (mod pn ) has a unique solution modulo pn . Proof. We will proceed by induction on n. In the case of n = 1, the congruence f(x) ≡ 0 (mod p1 ) is only another formulation of the assumption that the integer a satisfies p | f(a). Further, let n > 1 and suppose the proposition is true for n − 1. If x satisfies the system for n, then it does so for n−1 as well. Denoting one of the solutions of the system for n − 1 as cn−1, we can look for the solution of the system for n in the form x = cn−1 + k · pn−1 , where k ∈ Z. We need to find out for which k we have f ( cn−1 + k · pn−1 ) ≡ 0 (mod pn ). We know that pn−1 | f ( cn−1 + k · pn−1 ) . Now, we use the binomial theorem for f(x) = amxm + · · · + a1x + a0, where a0, . . . , am ∈ Z. We have ( cn−1 + k · pn−1 )i ≡ ci n−1 + i · ci−1 n−1 · kpn−1 (mod pn ), hence f ( cn−1 + k · pn−1 ) ≡ f(cn−1) + k · pn−1 f′ (cn−1). Therefore, f ( cn−1 + k · pn−1 ) ≡ 0 (mod pn ) ⇐⇒ ⇐⇒ 0 ≡ f(cn−1) pn−1 + k · f′ (cn−1) (mod p). Since cn−1 ≡ a (mod p), we get f′ (cn−1) ≡ f′ (a) ̸≡ 0 (mod p), so (f′ (cn−1), p) = 1. By the theorem about the solutions of linear congruences, we can hence see that there is (modulo p) a unique solution k of this congruence, and since cn−1 was, by the induction hypothesis, the only solution modulo pn−1 , the integer cn−1 +k ·pn−1 is the only solution of the given system modulo pn . □ Example. Consider the congruence 3x2 + 4 ≡ 0 (mod 49). The congruence can be equivalently transformed (by solving the linear congruence 3y ≡ 1 (mod 49) and multiplying both sides of the congruence by the integer y ≡ 33) to the form x2 ≡ 15 (mod 72 ). Then, we proceed as in the constructive proof of Hensel’s lemma. 1006 The fact that this congruence is solvable can be established either using theorem 11.4.10, or with use of the Legendre symbol. The former approach leads to the calculation d = (2, φ(37)) = 2, and (−4) 36 2 ≡ 1 (mod 37), while the latter one gives ( −4 37 ) = ( −1 37 ) · ( 2 37 )2 = 1 by the corollary after theorem 11.4.13 (as 37 ≡ 1 (mod 4)). In any way we have obtained that the given congruence has d = 2 solutions. □ 11.D.24. Solve the congruence 6x2 +x−1 ≡ 0 (mod 29). Solution. Although we have not presented any special method for finding solutions of quadratic congruence yet (apart from the general method for binomial congruences or going through the complete residue system) we will see that in some case the set of solutions can be easily established. Let us first proceed in the usual way: multiplying the congruence by 5 (it is the inverse of 6 modulo 29) we obtain x2 + 5x − 5 ≡ 0 (mod 29), and after completing the square we have (x − 12)2 ≡ 4 (mod 29). We immediately see that this congruence is solvable with the pair of solutions x−12 ≡ ±2 (mod 29), and thus x ≡ 10, 14 (mod 29). We could have also seen almost immediately that the given polynomial can be factored as 6x2 + x − 1 = (3x − 1)(2x+1), and thus the prime modulus 29 has to divide either 3x − 1 or 2x + 1. The obtained linear congruences 3x ≡ 1 (mod 29) and 2x ≡ −1 (mod 29) easily yield the same solutions x ≡ 10 (mod 29) and x ≡ 14 (mod 29) as above. □ 11.D.25. Find all integers which satisfy the congruence x2 ≡ 7 (mod 43). Solution. The Legendre symbol evaluates to ( 7 43 ) = − ( 43 7 ) = − ( 1 7 ) = −1. Hence it follows that 7 is a quadratic nonresidue modulo 43, so there is no solution of the given congruence. □ 11.D.26. Find all integers a for which the congruence x2 ≡ a (mod 43) is solvable. Solution. This exercise is a follow-up to the above one, from which we can see that the integer 7 does not meet the requirement. We can test all the remainders modulo 43 in the same way, but there is a simpler method. The congruence is surely solvable if a is a multiple of 43 (then, it has a unique solution); CHAPTER 11. ELEMENTARY NUMBER THEORY First, we solve the congruence x2 ≡ 15 ≡ 1 (mod 7), which has at most 2 solutions, and those are x ≡ ±1 (mod 7). These solutions can be expressed in the form x = ±1 + 7t, where t ∈ Z, and substituted into the congruence modulo 49, whence we get the solution x ≡ ±8 (mod 49) (if we were interested solely in the number of solutions, we would not even have to finish the calculation as it follows straight from Hensel’s lemma that every solution modulo 7 gives a unique solution modulo 49 because for f(x) = x2 − 15, we have 7 ∤ f′ (±1)). 11.4.7. Congruences modulo a prime. The solution of general higher-order congruences has thus been reduced to the solution of congruences modulo a prime. As we will see, this is where the stumbling block is since no (much) more efficient universal procedure than trying out all possibilities is known. We can at least mention several statements describing the solvability and number of solutions of such congruences. We will then prove some detailed results for some special cases in further paragraphs. Theorem. Let p be a prime, f(x) ∈ Z[x]. Every congruence f(x) ≡ 0 (mod p) is equivalent to a congruence of degree at most p − 1. Proof. Since it holds for any a ∈ Z that p | ap − a (simple consequence of Fermat’s little theorem), the congruence xp −x ≡ 0 (mod p) is satisfied by all integers. Dividing the polynomial f(x) by xp − x with remainder, we get f(x) = q(x) · (xp − x) + r(x) for suitable f(x), r(x) ∈ Z, where the degree of r(x) is less than that of the divisor, i.e. p. We thus get that the congruence r(x) ≡ 0 (mod p) is equivalent to the congruence f(x) ≡ 0 (mod p), yet it is of degree at most p − 1. □ 11.4.8. Theorem. Let p be a prime, f(x) ∈ Z[x]. If the congruence f(x) ≡ 0 (mod p) has more than deg(f) solutions, then each of the coefficients of the polynomial f is a multiple of p. Proof. In algebraic words, we are actually interested in the number of roots of a non-zero polynomial over a finite field Zp, and by 12.3.4, there are at most deg(f) of them. □ Corollary (Another proof of Wilson’s theorem). If p is a prime, then (p − 1)! ≡ −1 (mod p). Proof. The statement is clearly true for p = 2, so we can consider only odd primes p from now on. By Fermat’s little theorem, the congruence (x−1)(x−2) · · · (x−(p−1))−(xp−1 −1) ≡ 0 (mod p) is satisfied by any integer a which is not divisible by p; i.e., there are p−1 solutions. However, its degree is equal to p−2 (which is less than the number of solutions). It follows from 11.4.7 that all of the coefficients of the left-hand polynomial 1007 and if not, it must be a quadratic residue modulo 43. The quadratic residues can be most simply enumerated by calculating the squares of all elements of a reduced residue system modulo 43. The quadratic residues are thus the integers congruent to (±1)2 , (±2)2 , (±3)2 , . . . , (±21)2 modulo 43, so the problem is satisfied by exactly those integers a which are congruent to any one of 1, 4, 6, 9, 10, 11, 13, 14, 15, 16, 17, 21, 23, 24, 25, 31, 35, 36, 38, 40, 41. □ Using law of quadratic reciprocity 11.4.13 we can calculate the value (a/p) for any integer a and an odd prime p. Moreover, evaluation of the Legendre symbol is fast enough even for high arguments, therefore using it is favourable to verifying criteria of the theorem 11.4.10. 11.D.27. Here, we recall the statement of the Law in the slightly modified way which is more suitable for direct calcu- lations. i) −1 is a quadratic residue for primes p which satisfy p ≡ 1 (mod 4) and it is a quadratic nonresidue for primes p satisfying p ≡ 3 (mod 4). ii) 2 is a quadratic residue for primes p which satisfy p ≡ ±1 (mod 8) and it is a quadratic nonresidue for primes p satisfying p ≡ ±3 (mod 8). iii) If p ≡ 1 (mod 4) or q ≡ 1 (mod 4), then (p/q) = (q/p); for other odd p, q, we have (p/q) = −(q/p). Solution. We simply apply law of quadratic reciprocity in the appropriate cases. i) The integer p−1 2 is even iff 4 | p − 1. ii) We need to know for which odd primes p the exponent is p2 −1 8 is even. Odd primes are congruent to ±1 or ±3 modulo 8, so we have (by 11.C.7) that either p2 ≡ 1 (mod 16) or p2 ≡ 9 (mod 16). iii) This is clear from the law of quadratic reciprocity. □ 11.D.28. Derive by straight calculation from Gauss’s lemma 11.4.14 once again the so-called supplementary laws of quadratic reciprocity: ( −1 p ) = (−1) p−1 2 and ( 2 p ) = (−1) p2 −1 8 . Solution. To evaluate (−1/p) in the former case, we should realize that µ tells the number of least (in absolute value) negative remainders of integers in the set { −1, −2, . . . , −p−1 2 } . However, those are exactly the desired remainders and they are all negative; hence we have µ = p−1 2 and (−1/p) = (−1) p−1 2 . In the latter case, we need to express the number of least (in absolute value) negative remainders of integers in the set { 1 · 2, 2 · 2, 3 · 2 . . . , p−1 2 · 2 } . CHAPTER 11. ELEMENTARY NUMBER THEORY are multiples of p. In particular, this applies to the absolute term, which equals (p−1)!+1. This proves Wilson’s theorem. □ 11.4.9. Binomial congruences. This part will be devoted to solving special types of higher-order polynomial congruences, the so-called binomial congruences. It is an analog to the binomial equations, where the polynomial f(x) is xn − a. It can easily be shown that we can restrict ourselves to the condition that a be coprime with the modulus of the congruence – otherwise, we can always equivalently transform the congruence into this form or decide that it has no solution. Quadratic and power residues Let m ∈ N, a ∈ Z, (a, m) = 1. The integer a is said to be a n-th power residue modulo m, or residue of degree n modulo m if and only if the congruence xn ≡ a (mod m) is solvable. Otherwise, we call a a n-th power nonresidue modulo m, or nonresidue of degree n modulo m. For n = 2, 3, 4, we use the adjectives quadratic, cubic, and quartic residue (or nonresidue) modulo m. Now, we will show how to solve binomial congruences modulo m, if there are primitive roots modulo m (in particular, when the modulus is an odd prime or its power). 11.4.10. Theorem. Let m ∈ N be such that there are primitive roots modulo m. Further, let a ∈ Z, (a, m) = 1. Then, the congruence xn ≡ a (mod m) is solvable (i.e., a is an n-th power residue modulo m) if and only if aφ(m)/d ≡ 1 (mod m), where d = (n, φ(m)). And if so, it has exactly d solutions. Proof. Let g be a primitive root modulo m. Then, for any x coprime to m, there is a unique integer y (its discrete logarithm) with the property 0 ≤ y < φ(m) such that x ≡ gy (mod m). Similarly, for a given a, there is a unique b ∈ Z; 0 ≤ b < φ(m) such that a ≡ gb (mod m). After this substitution, the binomial congruence in question is thus equivalent to the congruence (gy )n ≡ gb (mod m) and, invoking theorem 11.3.13, to the linear congruence n · y ≡ b (mod φ(m)) as well. However, this congruence n · y ≡ b (mod φ(m)) is solvable if and only if d = (n, φ(m)) | b (and if so, it has d solutions). It remains to prove that d | b if and only if aφ(m)/d ≡ 1 (mod m). However, the congruence 1 ≡ aφ(m)/d ≡ gbφ(m)/d (mod m) is true if and only if φ(m) | bφ(m) d , which happens if and only if d | b. □ 1008 For any k ∈ { 1, 2, . . . , p−1 2 } , the integer 2k leaves a negative remainder if and only if 2k > p−1 2 , i.e., iff k > p−1 4 . Now, it remains to determine the number of such integers k. If p ≡ 1 (mod 4), then this number is equal to p−1 2 − p−1 4 = p−1 4 , so ( −1 p ) = (−1)µ = (−1) p−1 4 = (−1) p−1 4 · p+1 2 = (−1) p2 −1 8 , since p+1 2 is odd in this case. Similarly, for p ≡ 3 (mod 4), the number of such integers k equals p−1 2 − p−3 4 = p+1 4 , so ( −1 p ) = (−1) p+1 4 = (−1) p+1 4 · p−1 2 = (−1) p2 −1 8 , since p−1 2 is odd in this case as well. □ 11.D.29. Solve the congruence x2 − 23 ≡ 0 (mod 77). Solution. Factoring the modulus, we get the system x2 − 1 ≡ 0 (mod 11), x2 − 2 ≡ 0 (mod 7). Clearly, 1 is a quadratic residue modulo 11, so the first congruence of the system has (exactly) two solutions: x ≡ ±1 (mod 11). Further, (2/7) = (9/7) = 1, and it should not take much effort to notice the solution: x ≡ ±3 (mod 7). We have thus obtained four simple systems of two linear congruences each. Solving them, we will get that the original congruence has the following four solutions: x ≡ 10, 32, 45 or 67 (mod 77). □ 11.D.30. Solve the congruence 7x2 + 112x + 42 ≡ 0 (mod 473). ⃝ Jacobi symbol. Jacobi symbol (a/b) is a generalization of the Legendre symbol to the case where the “lower” argument b need not be a prime, but any odd positive integer. It is defined as the product of the Legendre symbols corresponding to the prime factors of b: if b = pα1 1 · · · pαk k , then ( a b ) = ( a p1 )α1 · · · ( a pk )αk . The primary motivation for introducing the Jacobi symbol is the necessity to evaluate the Legendre symbol (and thus to decide the solvability of quadratic congruences) without having to factor integers to primes. We will illustrate such calculation on an example having in mind that Jacobi symbol shares with the Legendre one not only the notation but also almost all of the (computational) properties. 11.D.31. Decide whether the congruence x2 ≡ 219 (mod 383) is solvable. CHAPTER 11. ELEMENTARY NUMBER THEORY Corollary. If the assumptions of the above theorem hold and, moreover, (n, φ(m)) = 1, the congruence xn ≡ a (mod m) always has a unique solution. In other words, exponentiation to the n-th power (where n is coprime to φ(m)) is a bijection on the set Z× m of invertible residue classes modulo m (it is even an automorphism of the group (Z× m, ·)). 11.4.11. Quadratic congruences and the Legendre symbol. Now, our task is to find an efficient condition determining whether a quadratic congru- ence ax2 + bx + c ≡ 0 (mod m) is solvable (and if so, how many solutions it has). It can easily be seen from the presented theory that if we want to decide whether this congruence is solvable, it suffices to decide this for the (binomial) congruence x2 ≡ a (mod p), where p is an odd prime and a is an integer coprime to it. A congruence modulo a composite m can be decomposed to an equivalent system of congruences modulo the particular factors of the integer m, which are powers of primes. Such congruences can be transformed to quadratic congruences with prime modulus using the procedure described in Hensel’s lemma 11.4.6. Norming this congruence and completing the square then results in the aforementioned form. To decide the solvability of a congruence, we can, of course, use the theorem 11.4.10 about the solvability of binomial congruences. Its application is, however, often limited by time resources; we will thus try to find a criterion which will be computationally easier in (not only) the quadratic case. Example. Let us determine the number of solutions of the congruence x2 ≡ 219 (mod 383). Since 383 is a prime and (2, φ(383)) = 2, it follows from theorem 11.4.10 that the given congruence is solvable (and it has 2 solutions) if and only if 219 φ(383) 2 = 219191 ≡ 1 (mod 383). It is not easy to verify this proposition without some computational power (though, this can still be calculated on a “piece of paper”). However, we will show that this condition can be verified much more easily using the properties of the so-called Legendre symbol. Legendre symbol Let p be an odd prime and a an integer. The Legendre symbol is defined by ( a p ) =    1 for p ∤ a, a is a quadratic residue modulo p, 0 for p | a, −1 if a is a quadratic nonresidue modulo p. The Legendre symbol is also often written as (a/p) and usually read as “a on p”. Example. Since the congruence x2 ≡ 1 (mod p) is solvable for an arbitrary odd prime p, we have (1/p) = 1. Further, 1009 Solution. Since 383 is a prime, the congruence will be solvable if the Legendre symbol will satisfy (219/383) = 1. ( 219 383 ) = − ( 383 219 ) (Jacobi) as 383 ≡ 219 ≡ 3 (mod 4) = − ( 164 219 ) = − ( 41 219 ) 164 = 22 · 41 = − ( 219 41 ) (Jacobi) as 41 ≡ 1 (mod 4) = − ( 14 41 ) = − ( 2 41 )( 7 41 ) = − ( 7 41 ) as 41 ≡ 1 (mod 8) = − ( 41 7 ) as 41 ≡ 1 (mod 4) = − ( −1 7 ) = 1 as 7 ≡ 3 (mod 4). □ Now, we introduce several exercises proving that the Jacobi symbol has properties similar to the Legendre one, which relieves us of the necessity to factor the integers that appear when working purely with the Legendre symbol. 11.D.32. Prove that all odd positive numbers b, b′ and all integers a, a1, a2 satisfy (the symbols used here are always the Jacobi ones): i) if a1 ≡ a2 (mod b), then (a1 b ) = (a2 b ) , ii) (a1a2 b ) = (a1 b )(a2 b ) , iii) ( a bb′ ) = (a b )(a b′ ) . ⃝ 11.D.33. Prove that if a, b are odd natural numbers, then i) ab−1 2 ≡ a−1 2 + b−1 2 (mod 2), ii) a2 b2 −1 8 ≡ a2 −1 8 + b2 −1 8 (mod 2). Solution. i) Since the integer (a − 1)(b − 1) = (ab − 1) − (a − 1) − (b−1) is a multiple of 4, we get (ab−1) ≡ (a−1)+(b− 1) (mod 4), which gives what we want when divided by two. ii) Similarly to above, (a2 − 1)(b2 − 1) = (a2 b2 − 1) − − (a2 − 1) − (b2 − 1) is a multiple of 16. Therefore, (a2 b2 − 1) ≡ (a2 − 1) + (b2 − 1) (mod 16), which gives the wanted statement when divided by eight (see also exercise 11.A.2). □ 11.D.34. Prove that if a1, . . . , ak are odd natural numbers, then i) ∏k ℓ=1 aℓ−1 2 ≡ ∑k ℓ=1 aℓ−1 2 (mod 2), ii) ∏k ℓ=1 a2 ℓ −1 8 ≡ ∑k ℓ=1 a2 ℓ −1 8 (mod 2). ⃝ CHAPTER 11. ELEMENTARY NUMBER THEORY (−1/5) = (4/5) = 1, because the congruence x2 ≡ −1 (mod 5) is equivalent to the congruence x2 ≡ 4 (mod 5), whose solutions are x ≡ ±2 (mod 5). The statement of the following lemma will be very often used when evaluating the Legendre symbol in practice. 11.4.12. Lemma. Let p be an odd prime, a, b ∈ Z arbitrary. Then: (1) (a p ) ≡ a p−1 2 (mod p). (2) (ab p ) = (a p )(b p ) . (3) If a ≡ b (mod p), then (a p ) = (b p ) . Proof. (1) The statement is clear for p | a; if a is a quadratic residue modulo p, then the statement follows from the theorem about the solvability of quadratic congruences, which claims (in this case, we have (φ(p), 2) = 2) that the necessary and sufficient condition for the congruence x2 ≡ a (mod p) to be solvable that a p−1 2 ≡ 1 (mod p). The same theorem implies for the case of a quadratic nonresidue as well that we have a p−1 2 ̸≡ 1 (mod p). However, then (since we have p | ap−1 − 1 = (a p−1 2 − 1)(a p−1 2 + 1) by Fermat’s theorem), necessarily p | a p−1 2 + 1, i.e., a p−1 2 ≡ −1 (mod p). (2) From (1), we have ( ab p ) ≡ (ab) p−1 2 = a p−1 2 b p−1 2 ≡ ( a p )( b p ) (mod p). However, since the values of the Legendre symbol belong to the set {−1, 0, 1}, this congruence immediately implies that the left and right sides are equal. (3) Apparent from the definition. □ Corollary. (1) Any reduced residue system modulo p contains the same number of quadratic residues as non- residues. (2) The product of two quadratic residues as well as the product of two quadratic nonresidues is a residue; the product of a residue and a nonresidue is a nonresidue. (3) (−1/p) = (−1) p−1 2 , i.e., the congruence x2 ≡ −1 (mod p) is solvable if and only if p ≡ 1 (mod 4). Proof. (1) Considering the elements of a reduced residue system modulo p (we can take, for instance, the set {−p−1 2 , . . . , −1, 1, . . . , p−1 2 }), the quadratic residues are exactly those integers which are congruent to one of (±1)2 , . . . , (±p−1 2 )2 . Thus there are exactly p−1 2 of quadratic residues, so there are p − 1 − p−1 2 = p−1 2 of the other ones (the quadratic nonresidues). (2) This follows immediately from part (2) and the previous lemma. 1010 11.D.35. Prove the law of quadratic reciprocity for the Jacobi symbol, i.e., prove that if a, b are odd natural numbers, then i) (−1 a ) = (−1) a−1 2 , ii) (2 a ) = (−1) a2−1 8 , iii) (a b ) = (b a ) · (−1) a−1 2 b−1 2 . Solution. Let (just like in the definition of the Jacobi symbol) a factor to (odd) primes as p1p2 · · · pk. i) The properties of the Legendre symbol and the aforementioned statement imply that ( −1 a ) = ( −1 p1 ) · ( −1 p2 ) · · · ( −1 pk ) = = (−1) p1−1 2 · · · (−1) pk−1 2 = = (−1) ∑k i=1 pi−1 2 = = (−1) ∏k i=1 pi−1 2 = (−1) a−1 2 . ii) Analogously to above. iii) Further, let b factor to (odd) primes as q1q2 · · · qℓ. If we have pi = qj for some i and j, then the symbols on both sides of the equality are equal to zero. Otherwise, the law of quadratic reciprocity for the Legendre symbol implies that for all pairs (pi, qj), we have ( pi qj ) = ( qi pj ) · (−1) pi−1 2 qj −1 2 . Therefore, ( a b ) = k∏ i=1 ℓ∏ j=1 ( pi qj ) = = k∏ i=1 ℓ∏ j=1 ( qj pi ) · (−1) pi−1 2 qj −1 2 = = k∏ i=1 (−1) pi−1 2 ∑ℓ j=1 qj −1 2 ℓ∏ j=1 ( qj pi ) = = k∏ i=1 (−1) pi−1 2 ∏ℓ j=1 qj −1 2 ℓ∏ j=1 ( qj pi ) = = k∏ i=1 (−1) pi−1 2 b−1 2 ℓ∏ j=1 ( qj pi ) = = (−1) b−1 2 ∑k i=1 pi−1 2 k∏ i=1 ℓ∏ j=1 ( qj pi ) = = (−1) a−1 2 b−1 2 ( b a ) . We utilized the result of part (i) of the previous exercise in the calculations. □ 11.D.36. Determine whether the congruence x2 ≡ 38 (mod 165) is solvable. CHAPTER 11. ELEMENTARY NUMBER THEORY (3) It follows from part (1) of the lemma that (−1/p) ≡ (−1) p−1 2 (mod p); both sides, however, take on the values ±1, so they must be equal. □ These basic statements about the values of the Legendre symbol are already sufficient for proving the theorem on the infinitude of primes of the form 4k + 1 (see paragraph 11.2.3). Proposition. There are infinitely many primes of the form 4k + 1. Proof. We will proceed by contradiction. Suppose that p1, p2, . . . , pℓ is the enumeration of all primes of the form 4k + 1, and consider the integer N = (2p1 · · · pℓ)2 + 1. This integer is of the form 4k+1 as well. The assumption that N is a prime would lead to an immediate contradiction, since N is surely greater than any of the integers p1, p2, . . . , pℓ. Therefore, from now on, let us suppose that it is thus composite. Then, there must exist a prime p which divides N. Apparently, none of the primes 2, p1, p2, . . . , pℓ divides N, so we will be finished if we prove that p is also of the form 4k + 1. It follows from the congruence (2p1 · · · pℓ)2 ≡ −1 (mod p), that (−1/p) = 1, and this is true (by the previous corollary) if and only if p ≡ 1 (mod 4). Altogether, we have reached a contradiction (a prime p not belonging to the original list of all primes of the form 4k + 1) in the case of composite N as well, which proves that there are infinitely many such primes. □ The most important theorem which allows us to efficiently compute the value of the Legendre symbol (and thus determine the solvability of a quadratic congruence), is the so-called law of quadratic reciprocity. Law of quadratic reciprocity 11.4.13. Theorem. Let p, q be odd primes. Then, (1) (−1 p ) = (−1) p−1 2 , (2) (2 p ) = (−1) p2−1 8 , (3) (q p ) = (p q ) · (−1) p−1 2 q−1 2 . The theorem is put this way mainly because we can calculate the value (a/p) for any integer a using these three formulae and the basic rules for the Legendre symbol. Example. Let us calculate the value (79/101) using the properties of the Legendre symbol. 1011 Solution. The Jacobi symbol is equal to ( 38 165 ) = ( 2 165 ) · ( 19 165 ) = (2 3 ) · (2 5 ) · ( 2 11 ) · (19 3 ) · (19 5 ) · (19 11 ) = (−1)3 (1 3 ) · (−1 5 ) · ( 2 11 )3 = 1. This result does not answer the question of the existence of a solution. However, if we split the congruence to a system of congruences according to the factors of the modulus, we obtain x2 ≡ −1 (mod 3), x2 ≡ 3 (mod 5), x2 ≡ 5 (mod 11), whence we can easily see that the first and second congruences have no solution. In particular,(−1 3 ) = −1 and (3 5 ) = (5 3 ) = (2 3 ) = −1 . Therefore, neither the original congruence has a solution. □ 11.D.37. Find all primes p such that the integer below is a quadratic residue modulo p: i) 3, ii) − 3, iii) 6. Solution. i) We are looking for primes p ̸= 3 such that x2 ≡ 3 (mod p) is solvable. Since p = 2 satisfies the above, we will consider only odd primes p ̸= 3 from now on. For p ≡ 1 (mod 4), it follows from the law of quadratic reciprocity that 1 = (3/p) = (p/3), which occurs if and only if p ≡ 1 (mod 3). On the other hand, if p ≡ −1 (mod 4), then 1 = (3/p) = −(p/3), which holds for p ≡ −1 (mod 3). Putting the conditions of both cases together, we arrive at p ≡ ±1 (mod 12), which, together with p = 2, completes the set of all primes satisfying the given condition. ii) The condition 1 = (−3/p) = (−1/p)(3/p) is satisfied if either (−1/p) = (3/p) = 1 or (−1/p) = (3/p) = −1. In the former case (using the result of the previous item), this means that p ≡ 1 (mod 4) and p ≡ ±1 (mod 12). In the latter case, we must have p ≡ −1 (mod 4) and p ≡ ±5 (mod 12), at the same time – we can take, for instance, the set {−5, −1, 1, 5} for a reduced residue system modulo 12, and since (3/p) = 1 for p ≡ ±1 (mod 12), we surely have (3/p) = −1 whenever p ≡ ±5 (mod 12). We have thus obtained four systems of two congruences each. Two of them have no solution, and the remaining two are satisfied by p ≡ 1 (mod 12) and p ≡ −5 (mod 12), respectively. iii) In this case, (6/p) = (2/p)(3/p) and once again, there are two possibilities: either (2/p) = (3/p) = 1 or (2/p) = (3/p) = −1. The former case occurs if p satisfies p ≡ ±1 (mod 8) as well as p ≡ ±1 (mod 12). Solving the corresponding systems of linear congruences leads to the condition p ≡ ±1 (mod 24). In the latter case, we get p ≡ ±3 (mod 8) as well as p ≡ ±5 (mod 12), which together gives p ≡ ±5 (mod 24). Let us remark that thanks to Dirichlet’s theorem 11.2.5, the number of primes we were interested in is infinite in each of the three problems. □ CHAPTER 11. ELEMENTARY NUMBER THEORY ( 79 101 ) = ( 101 79 ) since 101 is congruent to 1 modulo 4 = ( 22 79 ) = ( 2 79 ) · ( 11 79 ) = ( 11 79 ) since 79 is congruent to −1 modulo 8 = (−1) ( 79 11 ) since 11 ≡ 79 ≡ 3 (mod 4) = (−1) ( 2 11 ) = 1 since 11 ≡ 3 (mod 8). Many proofs of the the quadratic reciprocity law can be found in literature6 . However, many of them (especially the shorter ones) usually make use of deeper knowledge from algebraic number theory. We will present an elementary proof of this theorem here. Let S denote the reduced residue system of the least residues (in absolute value) modulo p, i.e., S = { −p−1 2 , −p−3 2 , . . . , −1, 1, . . . , p−3 2 , p−1 2 } . Further, for a ∈ Z, p ∤ a, let µp(a) denote the number of negative least residues (in absolute value) of the integers 1 · a, 2 · a, . . . , p − 1 2 · a, i.e., we decide for each of these integers to which integer from the set S it is congruent and count the number of the negative ones. If it is clear from context which values a, p we mean, we will usually omit the parameters and write only µ instead of µp(a). Example. We determine µp(a) for the prime p = 11 and the integer a = 3. Now, the reduced residue system we are interested in is S = {−5, . . . , −1, 1, , . . . , 5}, and for a = 3, we calculate 1 · 3 ≡ 3 (mod 11) 2 · 3 ≡ −5 (mod 11) 3 · 3 ≡ −2 (mod 11) 4 · 3 ≡ 1 (mod 11) 5 · 3 ≡ 4 (mod 11), whence µ11(3) = 2. We will show in the following statement that this integer is tightly connected to the Legendre symbol – the value of the symbol (3/11) can be determined in terms of the µ function as (−1)µ11(3) = (−1)2 = 1. 6In 2000, F. Lemmermeyer stated 233 proofs – see F. Lemmermeyer, Reciprocity laws. From Euler to Eisenstein, Springer. 2000 1012 11.D.38. The following exercise illustrates that if the modulus of a quadratic congruence is a prime p satisfying p ≡ 3 (mod 4), then we are able not only to decide the solvability of the congruence, but also to describe all of its solutions in a simple way. Consider a prime p ≡ 3 (mod 4) and an integer a such that (a/p) = 1. Prove that the solution of the congruence x2 ≡ a (mod p) is x ≡ ±a p+1 4 (mod p). Solution. It can be easily verified (using lemma 11.4.12) that ( a p+1 4 )2 ≡ a p+1 2 ≡ a · (a p ) ≡ a (mod p) . □ 11.D.39. Determine whether the congruence x2 ≡ 3 (mod 59) is solvable. If so, find all of its solutions. Solution. Calculating the Legendre symbol ( 3 59 ) = − ( 59 3 ) = − ( 2 3 ) = −(−1) = 1, we find out that the congruence has two solutions. Thanks to the statement above, we can immediately see (59 ≡ 3 (mod 4)) that the congruence is satisfied by x ≡ ±3 59+1 4 = ±315 ≡ (35 )3 ≡ ≡ ±73 = ±343 ≡ ∓11 (mod 59), since 35 = 243 ≡ 7 (mod 59). □ E. Diophantine equations Here, we limit ourselves only to the small class of equations which can be solved using divisibility or can be reduced to solving congruences. 11.E.1. Linear Diophantine equations. Decide whether it is possible to use a balance scale to weigh 50 grams of given goods provided we have only (an arbitrary number of) three kinds of masses; their weights are 770, 630, and 330 grams, respectively. If so, how to do that? Solution. Our task is to solve the equation 770x + 630y + 330z = 50, where x, y, z ∈ Z (a negative value in the solution would mean that we put the corresponding masses on the other scale). Dividing both sides of the equation by (770, 630, 330) = 10, we get an equivalent equation 77x + 63y + 33z = 5. Considering this equation modulo (77, 63) = 7, we get the following linear congruence: 33z ≡ 5 (mod 7), 5z ≡ 5 (mod 7), z ≡ 1 (mod 7). CHAPTER 11. ELEMENTARY NUMBER THEORY 11.4.14. Lemma (Gauss). If p is an odd prime, a ∈ Z, p ∤ a, then the value of the Legendre symbol satisfies ( a p ) = (−1)µp(a) . Proof. For each integer i ∈ { 1, 2, . . . , p−1 2 } , we set a value mi ∈ { 1, 2, . . . , p−1 2 } so that i · a ≡ ±mi (mod p). We can easily see that if k, l ∈ { 1, 2, . . . , p−1 2 } are different, then the values mk, ml are also different (the equality mk = ml would imply that k · a ≡ ±l · a (mod p), and hence k ≡ ±l (mod p), which cannot be satisfied unless k = l). Therefore, the sets {1, 2, . . . , p−1 2 } and {m1, m2, . . . , mp−1 2 } coincide, which is also illustrated by the above example. Multiplying the congruences 1 · a ≡ ±m1 (mod p), 2 · a ≡ ±m2 (mod p), ... p − 1 2 · a ≡ ±mp−1 2 (mod p) leads to p−1 2 ! · a p−1 2 ≡ (−1)µ · p−1 2 ! (mod p), since there are exactly µ negative values on the right-hand sides of the congruences. Dividing both sides by the integer p−1 2 !, we get the wanted statement, making use of lemma 11.4.12, whence (a/p) ≡ a p−1 2 (mod p). □ Now, with the help of Gauss’s lemma, we will prove the law of quadratic reciprocity. Proof of the law of quadratic reciprocity. The first part has been already proven; for the rest, we first derive a lemma which will be utilized in the proof of both of the remaining parts. Let a ∈ Z, p ∤ a, k ∈ N and let [x] and ⟨x⟩ denote the integer part (i.e. floor) and the fractional part, respectively, of a real number x. Then, [ 2ak p ] = [ 2 [ ak p ] + 2 ⟨ ak p ⟩] = 2 [ ak p ] + [ 2 ⟨ ak p ⟩] . This expression is odd if and only if ⟨ak p ⟩ > 1 2 , which is if and only if the least residue (in absolute value) of the integer ak modulo p is negative (a watchful reader should notice the return from the calculations of (ostensibly) irrelevant expressions back to the conditions close to the Legendre symbol). The integer µp(a) thus has the same parity (is congruent to, modulo 2) as ∑p−1 2 k=1 [2ak p ] , whence (thanks to Gauss’s lemma) we get that ( a p ) = (−1)µp(a) = (−1) ∑ p−1 2 k=1 [ 2ak p ] . 1013 This congruence is thus satisfied by those integers z of the form z = 1 + 7t, where t is an integer parameter. Substituting the form of z into the original equation, we get 77x + 63y = 5 − 33(1 + 7t), 11x + 9y = −4 − 33t. Now, we consider this (parametrized) equation modulo 11: 9y ≡ −4 − 33t (mod 11), −2y ≡ −4 (mod 11), y ≡ 2 (mod 11). Therefore, this congruence is satisfied by integers y = 2+11s for any s ∈ Z. Now, it only remains to calculate x: 11x = −4 − 33t − 9(2 + 11s), 11x = −22 − 33t − 9 · 11s, x = −2 − 3t − 9s. We have found out that the equation is satisfied if and only if (x, y, z) is in the set {(−2 − 3t − 9s, 2 + 11s, 1 + 7t); s, t ∈ Z}. Particular solutions can be obtained by evaluating the triple at concrete values of t, s. For instance, setting t = s = 0 gives the triple (−2, 2, 1); putting t = −4, s = 1 leads to (1, 13, −27). Of course, the unknowns can be eliminated in any order – the result may seem “syntactically” different, but it must still describe the same set of solutions (that is given by a particular coset of an appropriate subgroup (in our case, it is the subgroup (2, 2, 1)+(3, 0, 7)Z+(−9, 11, 0)Z) in the commutative group Z3 , which is an apparent analog to the fact that the solution of such an equation over a field forms an affine subspace of the corresponding vector space). □ Other types of Diophantine equations reducible to congruences. Some Diophantine equations are such that one of the unknowns can be expressed explicitly as a function of the other ones. In this case, it makes sense to examine for which integer arguments it holds that the value of the function is also an integer. For instance, having an equation of the form mxn = f(x1, . . . , xn−1), where m is a natural number and f(x1, . . . , xn−1) ∈ Z[x1, . . . , xn−1] is a polynomial with integer coefficients, an n-tuple of integers x1, . . . , xn is a solution of it if and only if f(x1, . . . , xn−1) ≡ 0 (mod m). 11.E.2. Solve the Diophantine equation x(x + 3) = 4y − 1. Solution. The equation can be rewritten as 4y = x2 +3x+1. Now, we will solve the congruence x2 + 3x + 1 ≡ 0 (mod 4). This congruence has no solution since for any integer x, the polynomial x2 + 3x + 1 evaluates to an odd integer (the CHAPTER 11. ELEMENTARY NUMBER THEORY Furthermore, if a is odd, then a + p is even and we get ( 2a p ) = ( 2a + 2p p ) = ( 4a+p 2 p ) = ( 2 p )2 · (a+p 2 p ) = (−1) ∑ p−1 2 k=1 [(a+p)k p ] = (−1) ∑ p−1 2 k=1 [ ak p ] · (−1) ∑ p−1 2 k=1 k . Since the sum of the arithmetic series ∑p−1 2 k=1 k is 1 2 p−1 2 p+1 2 = p2 −1 8 , we get (for a odd) the relation ( 2 p ) · ( a p ) = (−1) ∑ p−1 2 k=1 [ ak p ] · (−1) p2−1 8 , which, for a = 1, gives the wanted statement of item 2. By part (2), which we have already proved, and the previous equality, we now get for odd integers a that (1) ( a p ) = (−1) ∑ p−1 2 k=1 [ ak p ] . Now, let us consider, for given primes p ̸= q, the set T = {q · x; x ∈ Z, 1 ≤ x ≤ (p − 1)/2}× × {p · y; y ∈ Z, 1 ≤ y ≤ (q − 1)/2}. We apparently have |T| = p−1 2 · q−1 2 . We will show that we also have (−1)|T | = (−1) ∑ p−1 2 k=1 [pk q ] · (−1) ∑ p−1 2 k=1 [qk p ] , which will be sufficient thanks to the above. Since the equality qx = py can happen for no pair of x, y from the permissible domain, the set T can be partitioned to (disjoint) subsets T1 and T2 so that T1 = T ∩ {(u, v); u, v ∈ Z, u < v}, T2 = T \ T1. Clearly, |T1| is the number of pairs (qx, py) for which x < p q y. Since p q y ≤ p q · q−1 2 < p 2 , we have [ p q y ] ≤ p−1 2 . For a fixed y, in T1, there are thus exactly those pairs (qx, py) for which 1 ≤ x ≤ [ p q y ] ; hence |T1| = ∑(q−1)/2 y=1 [ p q y ] . Analogously, |T2| = ∑(p−1)/2 x=1 [ q p x ] . By (1), we thus have (p q ) = (−1)|T1| and (q p ) = (−1)|T2| , which finishes the proof of the law of quadratic reciprocity. □ 1014 fact that the congruence is not solvable can also be established by trying out all four possible remainders modulo 4 into it). □ 11.E.3. Solve the following equation in integers: 379x + 314y + 183y2 = 210 Solution. The equation is linear in x, so the other unknown, y, must satisfy the congruence 183y2 + 314y − 210 ≡ 0 (mod 379). Now, we can complete the left-hand polynomial to square in order to get rid of the linear term. First of all, we must find a t ∈ Z such that 183 · t ≡ 1 (mod 379). (In other words, we need to determine the inverse of the integer 183 modulo 379). For this purpose, we will use the Euclidean algorithm: 379 = 2 · 183 + 13, 183 = 14 · 13 + 1, whence 1 = 183 − 14 · 13 = 183 − 14 · (379 − 2 · 183) = = 29 · 183 − 14 · 379. Therefore, we can take, for instance, the integer 29 to be our t. Now, multiplying bith sides of the congruence by t = 29 and rearranging it, we get an equivalent congruence: y2 + 10y − 26 ≡ 0 (mod 379) Now, we can complete the left-hand polynomial to square, which leads to (substituting z = y + 5) (y + 5)2 − 52 − 26 ≡ 0 (mod 379), z2 ≡ 51 (mod 379). Invoking the law of quadratic reciprocity, we calculate the Legendre symbol (51/379): ( 51 379 ) = ( 3 379 ) · ( 17 379 ) = ( 379 3 ) · (−1) · ( 379 17 ) · (+1) = = ( 1 3 ) · (−1) · ( 5 17 ) = (1) · (−1) · ( 17 5 ) · (+1) = = (−1) · ( 2 5 ) = (−1) · (−1) = 1, whence it follows that the congruence is solvable, and, in particular, it has two solutions modulo 379. The proposition of exercise 11.D.38 implies that the solutions are of the form z ≡ ±51 380 4 , where 513 ≡ 1 (mod 379), whence 5195 = (513 )31 · 512 ≡ −52 (mod 379). The solution is thus z ≡ ±52 (mod 379), which gives for the original unknown that y ≡ 47 (mod 379), y ≡ −57 (mod 379). Therefore, the given Diophantine equation is satisfied by those pairs (x, y) with y ∈ {47 + 379 · k; k ∈ Z} ∪ {−57 + CHAPTER 11. ELEMENTARY NUMBER THEORY The evaluation of the Legendre symbol (as we saw in the example above) allows us to only use the law of quadratic reciprocity for primes, so it forces us to factor integers to primes, which is a very hard operation from the computational point of view. This can be mended by extending the definition of the Legendre symbol to the so-called Jacobi symbol with similar properties. Definition. Let a ∈ Z, b ∈ N, 2 ∤ b. Let b factor as b = p1p2 · · · pk to (odd) primes (here, we exceptionally do not group the same primes to a power of the prime, rather we write each one explicitly, e.g. 135 = 3 · 3 · 3 · 5). The symbol ( a b ) = ( a p1 ) · ( a p2 ) · · · ( a pk ) is called the Jacobi symbol. We show in the practical column that the Jacobi symbol has similar properties as the Legendre one. However, there is a substantial aberration – it is not generally true that (a/b) = 1 implies that the congruence x2 ≡ a (mod b) is solvable. Example. ( 2 15 ) = ( 2 3 ) · ( 2 5 ) = (−1) · (−1) = 1, but the congruence x2 ≡ 2 (mod 15) has no solution (the congruence x2 ≡ 2 is solvable neither modulo 3 nor modulo 5). Theorem (Law of quadratic reciprocity for the Jacobi symbol). Let a, b ∈ N be odd integers. Then, (1) (−1 a ) = (−1) a−1 2 , (2) (2 a ) = (−1) a2−1 8 , (3) (a b ) = (b a ) · (−1) a−1 2 b−1 2 . Proof. The proof is simple, utilizing the law of quadratic reciprocity for the Legendre symbol. See exercise 11.D.35. □ There is another application of the law of quadratic reciprocity in a certain sense – we can consider the question: For which primes is a given integer a a quadratic residue? We are already able to answer this question for a = 2, for example. The first step in answering this question is to do so for primes since the answer for composite values of a depends on the factorization of the integer a. Theorem. Let q be an odd prime. • If q ≡ 1 (mod 4), then q is a quadratic residue modulo those primes p which satisfy p ≡ r (mod q), where r is a quadratic residue modulo q. • If q ≡ 3 (mod 4), then q is a quadratic residue modulo those primes p which satisfy p ≡ ±b2 (mod 4q), where b is odd and coprime to q. 1015 379 · k; k ∈ Z} and x = 1 379 · (210 − 314y − 183y2 ); e. g. (−1105, 47) or (−1521, −57) (which are the only solutions with |x| < 105 ). □ 11.E.4. Solve the equation 2x = 1 + 3y in integers. Solution. If y < 0, then 1 < 1+3y < 2, whence 0 < x < 1, so x could not be an integer. Therefore, y ≥ 0, hence 2x = 1 + 3y ≥ 2 and x ≥ 1. We will show that we also must have x ≤ 2. If not (i.e., if x ≥ 3), then we would have 1 + 3y = 2x ≡ 0 (mod 8), whence it follows that 3y ≡ −1 (mod 8). However, this impossible since the order of 3 modulo 8 equals 2, so the powers of three are congruent to 3 and 1 only. Now, it remains to examine the possibilities x = 1 and x = 2. For x = 1, we get 3y = 21 − 1 = 1, hence y = 0. If x = 2, we have 3y = 22 − 1 = 3, whence y = 1. Thus, the equations has two solutions: x = 1, y = 0; and x = 2, y = 1. □ F. Primality tests 11.F.1. Mersenne primes. The following problems are in deep connection with testing Mersenne numbers for primality. For any q ∈ N, consider the integer Mq = 2q − 1 and prove: i) If q is composite, then so is Mq. ii) If q is a prime, q ≡ 3 (mod 4), then 2q + 1 divides Mq if and only if 2q + 1 is a prime (hence it follows that if q ≡ 3 (mod 4) is a Sophie Germain prime3 , then Mq is not a prime). iii) If a prime p divides Mq, then p ≡ ±1 (mod 8) and p ≡ 1 (mod q). Solution. 3See Wikipedia, Sophie Germain prime, http://en.wikipedia. org/wiki/Sophie_Germain_prime (as of July 28, 2013, 14:43 GMT). CHAPTER 11. ELEMENTARY NUMBER THEORY Proof. The first theorem follows trivially from the law of quadratic reciprocity. Let us consider q ≡ 3 (mod 4), i.e., (q/p) = (−1) p−1 2 (p/q). First of all, let p ≡ +b2 (mod 4q), where b is odd, and hence b2 ≡ 1 (mod 4). Then, p ≡ b2 ≡ 1 (mod 4) and p ≡ b2 (mod q). Therefore, (−1) p−1 2 = 1 and (p/q) = 1, whence (q/p) = 1. Now, if p ≡ −b2 (mod 4q), then we similarly get that p ≡ −b2 ≡ 3 (mod 4) and p ≡ −b2 (mod q). Therefore, (−1) p−1 2 = −1 and (p/q) = −1, whence we get again that (q/p) = 1. For the opposite way, suppose that (q/p) = 1. There are two possibilities – either (−1) p−1 2 = 1 and (p/q) = 1, or (−1) p−1 2 = −1 and (p/q) = −1. In the former case, we have p ≡ 1 (mod 4) and there is a b such that p ≡ b2 (mod q). Further, we can assume without loss of generality that b is odd (if not, we could have taken b + q instead). However, then we get b2 ≡ 1 ≡ p (mod 4), and altogether p ≡ b2 (mod 4q). In the latter case, we have p ≡ 3 (mod 4) and (−p/q) = (−1/q)(p/q) = (−1)(−1) = 1. Therefore, there is a b (which can also be chosen so that it is odd) such that −p ≡ b2 (mod q). We thus get −b2 ≡ 3 ≡ p (mod 4), and altogether p ≡ −b2 (mod 4q). □ 5. Diophantine equations It is as early as in the third century AD when Diophantus of Alexandria dealt with miscellaneous equations while admitting only integers as solutions. And there is no wonder – in many practical problems that lead to equations, non-integer solutions may fail to have a meaningful interpretation. As an example, we can consider the problem of how to pay an exact amount of money with coins of given values. In honor of Diophantus, equations for which we are interested in integer solutions only are called Diophantine equa- tions. Another nice example of a Diophantine equation is Euler’s relation v − e + f = 2 from graph theory, connecting the number of vertices, edges, and faces of a planar graph. Furthermore, if we restrict ourselves to regular graphs only, we get to the problem about existence of the so-called Platonic solids, which can be smartly described just as a solution of this Diophantine equation – for more information, see 13.1.22. Unfortunately, there is no universal method for solving this kind of equations. There is even no method (algorithm) to decide whether a given polynomial Diophantine equation has a solution. This question is well-known as Hilbert’s tenth problem, and the proof of algorithmic unsolvability of this problem was given by rii Mati seviq (Yuri Matiyasevich) in 1970.7 However, there are cases in which we are able to find the solution of a Diophantine equation, or – at least – to reduce 7See the elementary text M. Davis, Hilbert’s Tenth Problem is Unsolvable, The American Mathematical Monthly 80(3): 233–269. 1973. 1016 i) If n | q, then it follows from exercise 11.A.7 that 2n −1 | 2q − 1, so Mn | Mq. Therefore, Mq is not a prime for n > 1. ii) Let n = 2q+1 be a divisor of Mq. We will show that n is a prime invoking Lucas’ theorem 11.6.10, Since n−1 = 2q has only two prime divisors, it suffices to find compositeness witnesses for the integers 2 and q. We have 2 n−1 q = 22 ̸≡ 1 (mod n), (−2) n−1 2 = −2q ≡ − 1 ̸≡ 1 (mod n), thanks to the assumption n | Mq = 2q − 1. Further, since (−2)n−1 = 2n−1 = 22q − 1 = (2q + 1)Mq ≡ 0 (mod n), it follows from Lucas’ theorem that n is a prime. Now, let p = 2q + 1 ≡ −1 (mod 8) be a prime. Since (2/p) = 1, there exists an m such that 2 ≡ m2 (mod p). Hence, 2q ≡ 2 p−1 2 ≡ mp−1 ≡ 1 (mod p), so p | 2q − 1 = = Mq. iii) If p | Mq = 2q − 1, then the order of 2 modulo p must divide the prime q, hence it equals q. Therefore, q | p−1, and there exists a k ∈ Z such that 2qk = p − 1. Altogether, we get (2/p) ≡ 2 p−1 2 ≡ 2qk ≡ 1 (mod p), i.e., p ≡ ±1 (mod 8). □ 11.F.2. For each of the following Mersenne numbers, determine whether it is prime or composite: 211 − 1, 215 − 1, 223 − 1, 229 − 1, and 283 − 1. Solution. In the case of the integer 215 − 1, the exponent is composite; therefore, the whole integer is composite as well (we even know that it is divisible by 23 −1 and 25 −1). In the other cases, the exponent is always a prime. We can notice that these primes, namely q = 11, 23, 29, and 83, are even Sophie Germain primes (i.e., 2q + 1 is also a prime). It thus follows from part (ii) of the previous exercise that 23 | 211 −1, 47 | 223 − 1, and 167 | 283 − 1. We cannot use this proposition for the last case since 29 ̸≡ 3 (mod 4) and, indeed, 59 ∤ 229 − 1. Now, however, it follows from part (iii) of the above exercise that if there is a prime p which divides 289 − 1, then it must satisfy p ≡ ±1 (mod 8) p ≡ 1 (mod 29), i.e., p ≡ 1 (mod 232) or p ≡ 175 (mod 232). If we are looking for a prime divisor of the integer n = 229 − 1 = 536 870 911, then it suffices to check the primes (of the above form) up to √ n ≈ 23 170. There are 50 of them, so we are able to decide whether n is a prime quite easily (even with paper and pencil). In this case, fortunately, n is divisible already by the least prime, 233. □ 11.F.3. Show that the integer 341 is a Fermat pseudoprime to base 2, yet it is not a Euler-Jacobi pseudoprime to base 2. Further, prove that the integer 561 is a Euler-Jacobi pseudoprime to base 2, but not to base 3. Prove that, on the other hand, the integer CHAPTER 11. ELEMENTARY NUMBER THEORY the problem to solving congruences, which is besides the already mentioned applications another motivation for studying them. Now, we will describe several such types of Diophantine equations. Linear Diophantine equation. A linear Diophantine equation is an equation of the form a1x1 + a2x2 + · · · + anxn = b, where x1, . . . , xn are unknowns and a1, . . . , an, b are given non-zero integers. We can see that the ability to solve Diophantine equations is sometimes important in “practical” life as well, as is proved by Bruce Willis and Samuel Jackson in Die Hard with a Vengeance, where they have to do away with a bomb using 4 gallons of water, having only 3- and 5-gallon containers at their disposal. A mathematician would say that the gentlemen were to find a solution of the Diophantine equation 3x+5y = 4. One can use congruences in order to solve these equations. Apparently, it is necessary for the equation to be solvable that the integer d = (a1, . . . , an) divides b. Provided that, dividing both sides of the equation by the number d leads to an equivalent equation a′ 1x1 + a′ 2x2 + · · · + a′ nxn = b′ , where a′ i = ai/d for i = 1, . . . , n and b′ = b/d. Here, we have d · (a′ 1, . . . , a′ n) = (da′ 1, . . . , da′ n) = (a1, . . . , an) = d, so (a′ 1, . . . , a′ n) = 1. Further, we will show that the equation a1x1 + a2x2 + · · · + anxn = b, where a1, a2, . . . , an, b are integers such that (a1, . . . , an) = 1, always have a solution in integers and all such solutions can be described in terms of n − 1 integer parameters. We will prove this proposition by mathematic induction on n, the number of unknowns. The situation is trivial for n = 1 – there is a unique solution (which does not depend 1017 121 is a Euler-Jacobi pseudoprime to base 3, but not to base 2. Solution. The integer 341 is a Fermat pseudoprime to base 2 since 210 ≡ 1 ⇒ 2340 ≡ 1 (mod 341). It is not a Euler-Jacobi pseudoprime since 2170 ≡ 1 (mod 341), but( 2 341 ) = −1, which follows from the fact that 341 ≡ −3 (mod 8). For the integer 561, we have 2280 ≡ 1 (mod 561) and ( 2 561 ) = 1, since 561 ≡ 1 (mod 8). Therefore, it is a Euler-Jacobi pseudoprime to base 2. But not to base 3, since 3 | 561. On the other hand, the integer 121 satisfies 35 ≡ 1 (mod 121) ⇒ 360 ≡ 1 (mod 121) and ( 3 121 ) = 1, but 260 ≡ 89 ̸≡ 1 (mod 121). □ 11.F.4. Prove that the integers 2465, 2821, and 6601 are Carmichael numbers, i.e., denoting any of them as n, then every integer a coprime to n satisfies an−1 ≡ 1 (mod n). Solution. We have 2465 = 5 · 17 · 29, 2821 = 7 · 13 · 31, 6601 = 7 · 23 · 41, and the proposition follows from Korselt’s criterion 11.6.6 since all of the integers 4, 16, 28 divide 2464 = 25 · 7 · 11, all of the integers 6, 12, 30 divide 2820 = 22 ·3·5·47, and 6, 22, 40 divide 6600 = 23 ·3·52 ·11. □ 11.F.5. Prove that the integer 2047 is a strong pseudoprime to base 2, but not to base 3. Further, prove that the integer 1905 is a Euler-Jacobi pseudoprime to base 2 but not a strong pseudoprime to this base. Solution. In order to verify whether 2047 is a strong pseudoprime to base 2, we factor (22046 − 1) = (21023 − 1)(21023 + 1). Since 21023 ≡ 1 (mod 2047), the statement is true. However, it is not a strong pseudoprime to base 3 as 31023 ≡ 1565 ̸≡ ±1 (mod 2047). Notice that for the integer 2047, the strong pseudoprimality test is identical to the Euler one (this is because the integer 2046 is not divisible by four). The integer 1905 is a Euler-Jacobi pseudoprime to base 2 since 21904/2 ≡ 1 (mod 1905) and the Jacobi symbol (2/1905) is equal to 1. Since 1904 = 24 · 7 · 17, 1905 will be a strong pseudoprime to base 2 only if at least one of the following congruences holds: 2952 ≡ −1 (mod 1905), 2476 ≡ −1 (mod 1905), 2238 ≡ −1 (mod 1905), 2119 ≡ ±1 (mod 1905). However, 2952 ≡ 2476 ≡ 1 (mod 1905), 2238 ≡ 1144 (mod 1905), and 2119 ≡ 128 (mod 1905). Therefore, 1905 is not a strong pseudoprime to base 2. □ CHAPTER 11. ELEMENTARY NUMBER THEORY on any parameters). Further, let n ≥ 2 and suppose that the statement holds for equations having n−1 unknowns. Denoting d = (a1, . . . , an−1), any n-tuple x1, . . . , xn that satisfies the equation must also satisfy the congruence a1x1 + a2x2 + · · · + anxn ≡ b (mod d). Since d is the greatest common divisor of the integers a1, . . . , an−1, this congruence is of the form anxn ≡ b (mod d), which (since (d, an) = (a1, . . . , an) = 1) has a unique solu- tion xn ≡ c (mod d), where c is a suitable integer, i.e., xn = c+d·t, where t ∈ Z is arbitrary. Substituting into the original equation and refining it leads to the equation a1x1 + · · · + an−1xn−1 = b − anc − andt with n−1 unknowns and one parameter, t. However, the number (b − anc)/d is an integer, so we can divide the equation by d. This leads to a′ 1x1 + · · · + a′ n−1xn−1 = b′ , where a′ i = ai/d for i = 1, . . . , n − 1 and b′ = ((b − anc)/d) − ant, satisfying (a′ 1, . . . , a′ n−1) = (da′ 1, . . . , da′ n−1)·1 d = (a1, . . . , an−1)·1 d = 1. By the induction hypothesis, this equation has, for any t ∈ Z, a solution which can be described in terms of n−2 integer parameters (different from t), which together with the condition xn = c + dt gives what we wanted. 11.5.1. Pythagorean equation. In this section, we will deal with enumeration of all right triangles with integer side lengths. This is a Diophantine equation where we will only seldom use the methods described above; nevertheless, we will look at it in detail. The task is to solve the equation x2 + y2 = z2 in integers. Solution. Clearly, we can assume that (x, y, z) = 1 (otherwise, we simply divide both sides of the equation by the integer d = (x, y, z)). Further, we can show that the integers x, y, z are pairwise coprime: if there were a prime p dividing two of them, then we can easily see that it would have to divide the third one as well, which it may not according to our assumption. Therefore, at most one of the integers x, y is even. If neither of them were, we would get z2 ≡ x2 + y2 ≡ 1 + 1 (mod 8), which is impossible (see exercise 11.A.2). Altogether, we get that exactly one of the integers x, y is even. However, since the roles of these integers in the equation are symmetric, we 1018 11.F.6. Applying Pocklington-Lehmer test 11.6.11, show that 1321 is a prime. Solution. Let us set N = 1321, then N − 1 = 1320 = 23 · 3 · 5 · 11. For the sake of simplicity, we will assume that the trial division is executed only for primes below 10, then F = 23 · 3 · 5 = 120, U = 11, where (F, U) = (120, 11) = 1. In order to prove the primality of 1321 by the Pocklington-Lehmer test, we need to find a primality witness ap for each p ∈ {2, 3, 5}. Since ( 2 1320 3 − 1, 1321 ) = 1 and ( 2 1320 5 − 1, 1321 ) = 1, we can lay a3 = a5 = 2. However, for p = 2, we have ( 2 1320 2 − 1, 1321 ) = 1321, so we have to look for another primality witness. We can take a2 = 7 since( 7 1320 2 − 1, 1321 ) = 1. In both cases, we have 21320 ≡ 71320 ≡ 1 (mod 1321). The primality witnesses of the integer 1321 are thus a2 = 7, a3 = a5 = 2. Instead, we could also have chosen for all primes p the same number (e. g. 13), which is a primitive root modulo 1321. □ 11.F.7. Factor the integer 221 to primes by Pollard’s ρ-method. Use the function f(x) = x2 + 1 with initial value x0 = 2. Solution. Let us set x = y = 2. The procedure from 11.6.14 gives: x := f(x) y := f(f(y)) (|x − y|, 221) mod 221 5 26 1 26 197 1 14 104 1 197 145 13 We have thus found a non-trivial divisor, so now it is easy to calculate 221 = 13 · 17. □ 11.F.8. Find a non-trivial divisor of the integer 455459. Solution. Consider the function f(x) = x2 + 1 (we silently assume that this function behaves randomly modulo an unknown prime divisor p of the integer n and has the required properties). In the particular iterations, we compute a ← f(a) (mod n), b ← f(f(b)) (mod n) while evaluating d = (a − b, n). a b d 5 26 1 26 2871 1 677 179685 1 2871 155260 1 44380 416250 1 179685 43670 1 121634 164403 1 155260 247944 1 44567 68343 743 We have found a divisor 743, and now we can easily compute that 455459 = 613 · 743. □ CHAPTER 11. ELEMENTARY NUMBER THEORY can, without loss of generality, select x to be even and set x = 2r, r ∈ N. Hence, we have 4r2 = z2 − y2 , so r2 = z + y 2 · z − y 2 . Now, let us denote u = 1 2 (z + y), v = 1 2 (z − y) (then, the inverse substitution is z = u + v, y = u − v). Since y is coprime to z, so is u to v (if there were a prime p dividing both u and v, then it would divide their sum as well as their difference, i.e., the integers y and z). It follows from r2 = u · v that there are coprime positive integers a, b such that u = a2 , v = b2 . Moreover, since u > v, we must have a > b. Altogether, we get x = 2r = 2ab, y = u − v = (a2 − b2 ), z = u + v = (a2 + b2 ), which indeed satisfies the given equation for any coprime a, b ∈ N with a > b. Further solutions can be obtained by interchanging x and y. Finally, relinquishing the condition (x, y, z) = 1, each solution will yield infinitely many more if we multiply each of its component by a fixed positive integer d. □ 11.5.2. Fermat’s Last Theorem for n = 4. Thanks to the parametrization of Pythagorean triples, we will be able to prove that the famous Fermat’s Last Theorem xn + yn = zn has no solution for n = 4 in integers. For this task it is sufficient to prove that the equation x4 + y4 = z2 has no solution in N. Solution. We will use the so-called method of infinite descent, which was introduced by Pierre de Fermat. This method utilizes the fact that every non-empty set of natural numbers has a least element (in other words, N is a wellordered set). Therefore, suppose that the set of solutions of the equation x4 + y4 = z2 is non-empty and let (x, y, z) denote (any) solution with z as small as possible. The integers x, y, z are thus pairwise distinct. Since the equation can be written in the form (x2 )2 + (y2 )2 = z2 , it follows from the previous exercise that there exist r, s ∈ N such that x2 = 2rs, y2 = r2 − s2 , z = r2 + s2 . Hence, y2 + s2 = r2 , where (y, s) = 1 (if there were a prime p dividing both y and s, then it would divide x as well as z, which contradicts that they are coprime). Making the 1019 G. Encryption 11.G.1. RSA. We have overheard that the integers 29, 7, 21 were sent by means of RSA with public key (7, 33). Try to break the cipher and find the messages (integers) that were originally sent. Solution. In order to find the private key d, we need to solve the congruence 7d ≡ 1 (mod φ(33)). However, since the integer 33 is quite small, we can factor it and easily compute that φ(33) = (3 − 1)(11 − 1) = 20. We are thus looking for a d such that 7d ≡ 1 (mod 20), which is satisfied by d ≡ 3 (mod 20). Since 293 ≡ (−4)3 ≡ 2, 73 ≡ 13, and 213 ≡ 21 (mod 33), the messages that were encrypted are 2, 13, and 21. □ Attacks against RSA. Using so-called Fermat’s factorization method, we can try to factor n = p · q if we think that the difference between p and q is small. Then, n = ( p + q 2 )2 − ( p − q 2 )2 , where s = (p − q)/2 is small and t = (p + q)/2 is only a bit greater than √ n. Therefore, it suffices to check whether4 t = ⌈ √ n⌉ , t = ⌈ √ n⌉ + 1, t = ⌈ √ n⌉ + 2, . . . , until t2 − n is a square (this condition can, of course, be checked efficiently). 11.G.2. Now, we will try to factor the integer n = 23104222007 this way. (We anticipate that it is a product of two close primes.) Solution. We compute √ n ≈ 152000,731 and check the candidates for t: For t = 152001, we have √ t2 − n ≈ 286,345. For t = 152002, it is √ t2 − n ≈ 621,287. For t = 152003, √ t2 − n ≈ 830,664. Finally, for t = 152004, we get √ t2 − n = 997 ∈ Z. Therefore, s = 997 and we can easily calculate the prime divisors of n: p = t + s = 153001, q = t − s = 151007. □ 11.G.3. The RSA modulus n = p · q can also be easily factored if the integer φ(n) is known (compromised). Then, φ(n) = (p−1)(q−1) = pq−(p+q)+1, odkudp+q = n+1−φ(n). We are thus to find two integers whose sum and product are known, which can be done by Viète’s formulas relating the roots and the coefficients of a polynomial, whence it follows that p and q are the roots of the polynomial x2 − (n + 1 − φ(n))x + n. 4The symbol ⌈x⌉ denotes the ceiling if a real number x, i.e., the it is the integer which satisfies ⌈x⌉ − 1 < x ≤ ⌈x⌉. CHAPTER 11. ELEMENTARY NUMBER THEORY Pythagorean substitution once again, we get natural numbers a, b with (y is odd) y = a2 − b2 , s = 2ab, r = a2 + b2 . The inverse substitution leads to x2 = 2rs = 2 · 2ab(a2 + b2 ), and since x is even, we get (x 2 )2 = ab(a2 + b2 ). The integers a, b, a2 + b2 are pairwise coprime (which can be derived easily from the fact that y is coprime to s). Therefore, each of them is a square of a natural number: a = c2 , b = d2 , a2 + b2 = e2 , whence c4 + d4 = e2 , and since e ≤ a2 + b2 = r < z, we get a contradiction with the minimality of z. □ 6. Applications – calculation with large integers, cryptography 11.6.1. Computational aspects of number theory. In many practical problems which utilize the results of number theory, it is necessary to execute one or more of the following computations fast: • common arithmetic operations (sum, product, modulo) on integers; • to determine the remainder of a (natural) n-th power of an integer a when divided by a given m; • to determine the multiplicative inverse of an integer a modulo m ∈ N; • to determine the greatest common divisor of two integers (and the coefficients of corresponding Bézout’s identity); • to decide whether a given integer is a prime or composite number. • to factor a given integer to primes. Basic arithmetic operations are usually executed on large integers in the same way as we were taught at primary school, i.e., we add in linear time and multiply and divide with remainder in quadratic time. The multiplication, which is a base for many other operations, can be performed asymptotically more efficiently (there exist algorithms of the type divide and conquer) – for instance, the Karatsuba algorithm (1960), running in time Θ ( nlog2 3 ) or the Schönhage-Strassen algorithm (1971), which runs in Θ(n log n log log n) and uses Fast Fourier Transforms – see also 7.2.5. Although it is asymptotically much better, in practice, it becomes advantageous for integers of at least ten thousand digits (it is thus used, for example, when looking for large primes in the GIMPS project). 1020 11.G.4. Consider (as above) the integer n = 23104222007 and factor it with the additional knowledge that φ(n) = 23103918000. Solution. Following the procedure described above, we get the quadratic equation x2 − 304008x + 23104222007 = 0, whose solutions are p = 1 2 (304008 + √ 3040082 − 4 · 23104222007 = 153001, q = 1 2 (304008 − √ 3040082 − 4 · 23104222007 = 151007. □ 11.G.5. ElGamal. Martin and John want to communicate using the ElGamal encryption (designed by Egyptian mathematician Taher ElGamal, who was inspired by the Diffie-Hellman key exchange protocol). Martin chose the prime 41 and its primitive root g = 11 as well as the integer 10. Then, he published the triple (41, 11, A), where A ≡ 1110 (mod 41); he kept the integer 10 to himself – it is his private key. John used a public channel to send the pair (22, 6) to him. What is the original message John sent? Solution. For completeness, we will first compute the whole public key A = 9 (however, this integer was only needed by John when he encrypted the message for Martin; it is no longer necessary for decryption). The message M can be obtained as M ≡ (6/2210 ) (mod 41). First, we compute 2210 ≡ 222 · ( 222 )2 · (( 222 )2 ) ≡ ≡ (−8) · (−8)2 · (−8)2 ≡ ≡ (−8) · 23 · 23 ≡ −9 (mod 41) and (−9)−1 ≡ 9 (mod 41). Therefore, the decrypted message is the integer M = 9 · 6 ≡ 13 (mod 41) . □ 11.G.6. Rabin cryptosystem. Alice has chosen p = 23, q = 31 as her private key in Rabin cryptosystem. The public key is n = pq = 713, then. Encrypt the message M = 327 for Alice and show how Alice will decrypt it. Solution. We compute C = (327)2 ≡ 692 (mod 713) and send this cipher to Alice. According to the decryption procedure, we determine r ≡ C(p+1)/4 ≡ 692 23+1 4 ≡ 18 (mod 23), s ≡ C(q+1)/4 ≡ 692 31+1 4 ≡ 14 (mod 31), and further the coefficients a, b into Bézout’s identity 23 · a + 31 · b = 1 (using the Euclidean algorithm). We get a = −4, b = 3; the candidates for the original message are thus CHAPTER 11. ELEMENTARY NUMBER THEORY 11.6.2. Greatest common divisor and modular inverses. As we have already shown, the computation of the solution of the congruence a · x ≡ 1 (mod m) in variable x can be easily reduced (thanks to Bézout’s identity) to the computation of the greatest common divisor of the integers a and m and looking for the coefficients k, l in Bézout’s identity k·a+l·m = 1 (the integer k is then the wanted inverse of a modulo m). f u n c t i o n extended_gcd ( a ,m) i f m == 0: r e t u r n ( 1 , 0 ) e l s e ( q , r ) := d i v i d e ( a ,m) ( k , l ) := extended_gcd (m, r ) r e t u r n ( l , k − q∗ l ) A thorough analysis8 shows that the problem of computing the greatest common divisor has quadratic time complex- ity. 11.6.3. Modular exponentiation. The algorithm for modular exponentiation is based on the idea that when computing, for instance 264 mod 1000, one need not calculate 264 and then divide it with remainder by 1000, but that it is better to multiply the 2’s gradually and reduce the temporary result modulo 1000 whenever it exceeds this value. More importantly, there is no need to perform such a huge number of multiplications: in this case, 63 naive multiplications can be replaced with six squarings, as 264 = (((((22 )2 )2 )2 )2 )2 . f u n c t i o n modular_pow ( base , exp , mod) r e s u l t := 1 while exp > 0 i f ( exp % 2 == 1 ) : r e s u l t := ( r e s u l t ∗ base ) % mod exp := exp >> 1 base := ( base ∗ base ) % mod r e t u r n r e s u l t The algorithm squares the base modulo n for every binary digit of the exponent (which can be done in quadratic time in the worst case) and it performs a multiplication for every one in the binary representation of the exponent. Altogether, we are able to do modular exponentiation in cubic time in the worst case. We can also notice that the complexity is a good deal dependent on the binary appearance of the exponent. Example. Let us compute 2560 (mod 561). Since 560 = (1000110000)2, the mentioned algorithm gives 8See, for example, D. Knuth, Art of Computer Programming, Volume 2: Seminumerical Algorithms, Addison-Wesley 1997 or Wikipedia, Euclidean algorithm, http://en.wikipedia.org/wiki/Euclidean_ algorithm (as of July 29, 2017). 1021 the integers ∓4 · 23 · 14 ± 3 · 31 · 18 (mod 713). We thus know that one of the integers 386, 603, 110, 327 is the message that was sent. □ 11.G.7. Show how to encrypt and decrypt the message M = 321 in Rabin cryptosystem with n = 437. Solution. The encrypted text can be obtained as the square modulo n: C = 3212 ≡ (−116)2 = 13456 ≡ 346 (mod 437). On the other hand, when decrypting, we will use the factorization (its knowledge is the private key of the message receiver) n = 437 = 19 · 23, and we compute r = 346 19+1 4 = 3465 ≡ 17 ≡ −2 (mod 19) and s = 246 23+1 4 = 3466 ≡ 1 (mod 23). Applying Euclidean algorithm to the pair (19, 23) = 1, we determine the coefficients into Bézout’s identity 19 · (−6) + 23 · 5 = 1. Then, the message is one of the integers ±6·19·1±5·23·(−2) (mod 437), i.e., M = ±116 or M = ±344. Indeed, M = −116 ≡ 321 (mod 437). □ CHAPTER 11. ELEMENTARY NUMBER THEORY exp base result last digit exp 560 2 1 0 280 4 1 0 140 16 1 0 70 256 1 0 35 460 1 1 17 103 460 1 8 511 256 0 4 256 256 0 2 460 256 0 1 103 256 1 0 511 1 0 Therefore, 2560 ≡ 1 (mod 561). 11.6.4. Primality testing. Although we have the Fundamental theorem of arithmetic, which guarantees that every natural number can be uniquely factored to a product of primes, this operation is very hard from the computational point of view. In practice, it is usually done in the following steps: (1) finding all divisors below a given threshold (by trying all primes up to the threshold, which is usually somewhere around 106 ); (2) testing the remaining factor for compositeness (deciding whether some necessary condition for primality holds); (a) if the compositeness test did not find the integer to be composite, i.e., it is likely to be a prime, then we test it for primality to verify that it is indeed a prime; (b) if the compositeness test proved that the integer was composite, then we try to find a non-trivial divisor. The mentioned steps are executed in this order because the corresponding algorithms are gradually (and strongly) increasing in time complexity. In 2002, Agrawal, Kayal, and Saxena published an algorithm for primality testing in polynomial time, but it is still more efficient to use the above procedure in practice. 11.6.5. Compositeness tests – how to recognize composite numbers with certainty? The so-called compositeness tests check for some necessary condition for primality. The easiest of such conditions is Fermat’s little theorem. Proposition (Fermat’s test). Let N be a natural number. If there is an a ̸≡ 0 (mod N) such that aN−1 ̸≡ 1 (mod N), then N is not a prime. Unfortunately, having a composite N, it still may not be easy to find such an integer a which reveals the compositeness of N. There are even such exceptional integers N for which the only integers a with the mentioned property are those which are not coprime to N. To find them is thus equivalent to finding a divisor, and thus to factoring N to primes. There are indeed such ugly (or extremely nice?) composite numbers N for which every integer a which is coprime to N satisfies aN−1 ≡ 1 (mod N). These are called 1022 CHAPTER 11. ELEMENTARY NUMBER THEORY Carmichael numbers, the least of which9 is 561 = 3 · 11 · 17, and it was no sooner than in 1992 that it was proved10 that there are even infinitely many of them. Example. We will prove that 561 is a Carmichael number, i.e., that it holds for every a ∈ N which is coprime to 3·11·17 that a560 ≡ 1 (mod 561). Thanks to the properties of congruences, we know that it suffices to prove this congruence modulo 3, 11, and 17. However, this can be obtained straight from Fermat’s little theorem since such an integer a satisfies a2 ≡ 1 (mod 3), a10 ≡ 1 (mod 11), a16 ≡ 1 (mod 17), where all of 2, 10, and 16 divide 560, hence a560 ≡ 1 modulo 3, 11 as well as 17 for all integers a coprime to 561 (see also Korselt’s criterion mentioned below). 11.6.6. Proposition (Korselt’s criterion). A composite number n is a Carmichael number if and only if both of the following conditions hold • n is square-free (divisible by the square of no prime), • p − 1 | n − 1 holds for all primes p which divide n. Proof. “ ⇐= ” We will show that if n satisfies the above two conditions and it is composite, then every a ∈ Z which is coprime to n satisfies an−1 ≡ 1 (mod n). Let us thus factor n to the product of distinct odd primes: n = p1 · · · pk, where pi − 1 | n − 1 for all i ∈ {1, , . . . , k}. Since (a, pi) = 1, we get from Fermat’s little theorem that api−1 ≡ 1 (mod pi), whence (thanks to the condition pi −1 | n−1) it also follows that an−1 ≡ 1 (mod pi). This is true for all indices i, hence an−1 ≡ 1 (mod n), so n is indeed a Carmichael number. “ =⇒ ” A Carmichael number n cannot be even since then we would get for a = −1 that an−1 ≡ −1 (mod n), which would (since an−1 ≡ 1 (mod n)) mean that n is equal to 2 (and thus is not composite). Therefore, let n factor as n = pα1 1 · · · pαk k , where pi are distinct odd primes and αi ∈ N. Thanks to theorem 11.3.16, we can choose for every i a primitive root gi modulo pαi i , and the Chinese remainder theorem then yields an integer a which satisfies a ≡ gi (mod pαi i ) for all i and which is apparently coprime to n. Further, we know from the assumption that an−1 ≡ 1 (mod n), so this holds modulo pαi i , and thus gn−1 i ≡ 1 (mod pαi i ) as well. Since gi is a primitive root modulo pαi i , the integer n−1 must be a multiple of its order, i.e. a multiple of φ(pαi i ) = pαi−1 i (pi − 1). At the same time, we have (pi, n − 1) = 1 (since pi|n), so necessarily αi = 1 and pi − 1 | n − 1. □ Fermat’s primality test can be slightly improved to Euler’s test or even more with the help of the Jacobi symbol, yet this still does not mend the presented problem completely. 9The first discoverer of the first seven Carmichael numbers is apparently Czech priest and mathematician Václav Šimerka (1819–1887), who occupied himself with them much earlier than the American mathematician R. D. Carmichael (1879–1967), whose name they bear. 10W. R. Alford, A. Granville and C. Pomerance, There are Infinitely Many Carmichael Numbers, Annals of Mathematics, Vol. 139, No. 3 (1994), pp. 703-722. 1023 CHAPTER 11. ELEMENTARY NUMBER THEORY Proposition (Euler’s test). Let N be an odd natural number. If there is an integer a ̸≡ 0 (mod N) such that a N−1 2 ̸≡ ±1 (mod N), then N is not a prime. Proof. This follows directly from Fermat’s theorem and the fact the for N odd, we have aN−1 = (a N−1 2 −1)(a N+1 2 − 1). □ Proposition (Euler-Jacobi test). Let N be an odd natural number. If there is an integer a ̸≡ 0 (mod N) such that a N−1 2 ̸≡ (a N ) (mod N), then N is not a prime. Proof. This follows immediately from lemma 11.4.12. □ Example. Let us consider N = 561 = 3 · 11 · 17 as before and let a = 5. Then, we have 5280 ≡ 1 (mod 3) and 5280 ≡ 1 (mod 10), but 5280 ≡ −1 (mod 17), so surely 5280 ̸≡ ±1 (mod 561). Here, it did not hold that a(N−1)/2 ≡ ±1 (mod N), so we even did not need to check the value of the Jacobi symbol (5/561). However, the Euler-Jacobi test can often reveal a composite number even in the case when this power is equal to ±1. Example. Euler’s test cannot detect the compositeness of the integer N = 1729 = 7 · 13 · 19 since the integer N−1 2 = 864 = 25 · 33 is divisible by 6, 12, and 18, and so it follows from Fermat’s theorem that a(N−1)/2 ≡ 1 (mod N) holds for all integers a coprime to N. On the other hand, we get already for a = 11 that (11/1729) = −1, so the Euler-Jacobi is able to recognize the integer 1729 as composite. Let us notice that the value of the Legendre or Jacobi symbol (a/n) can be computed very efficiently thanks to the law of quadratic reciprocity11 , namely in time O((log a)(log n)). Pseudoprimes A composite number n is called a pseudoprime if it passes the corresponding test of compositeness without being revealed. We thus have (1) Fermat pseudoprimes to base a, (2) Euler (or Euler-Jacobi) pseudoprimes to base a, (3) strong pseudoprimes to base a, which are composite numbers which pass the following compositeness test: The subsequent test is simple, yet (as shown in theorem 11.6.8) very efficient. It is a further specification of Fermat’s test, which we have introduced at the beginning. 11.6.7. Theorem. Let p be an odd prime. Let us write p−1 = 2t · q, where t is a natural number and q is odd. Then, every integer a which is not a multiple of p satisfies aq ≡ 1 (mod p) or there exists an e ∈ {0, 1, . . . , t − 1} such that a2e q ≡ −1 (mod p). 11See H. Cohen, A Course in Computational Algebraic Number Theory, Springer, 1993. 1024 CHAPTER 11. ELEMENTARY NUMBER THEORY Proof. It follows from Fermat’s little theorem that p | ap−1 − 1 = (a p−1 2 − 1)(a p−1 2 + 1) = = (a p−1 4 − 1)(a p−1 4 + 1)(a p−1 2 + 1) = ... = (aq − 1)(aq + 1)(a2q + 1) · · · (a2t−1 q + 1), whence the statement follows easily since p is a prime. □ Proposition (Miller-Rabin compositeness test). Let N, t, q be natural numbers such that N is odd and N −1 = 2t ·q, 2 ∤ q. If there is an integer a ̸≡ 0 (mod N) such that aq ̸≡ 1 (mod N) a2e q ̸≡ −1 (mod N) for e ∈ {0, 1, . . . , t − 1}, then N is not a prime. Proof. The correctness of the test follows directly from the previous theorem. □ Miscellaneous types of pseudoprimes In practice, this easy test rapidly increases the ability to recognize composite numbers. The least strong pseudoprime to base 2 is 2047 (while the least Fermat pseudoprime to base 2 was already 341), and considering the bases 2, 3, and 5, the least strong pseudoprime is 25326001. In other words, if we are to test integers below 2·107 , then it is sufficient to execute this compositeness test already for the bases 2, 3, and 5. If the tested integer is not revealed to be composite, then it is surely a prime. On the other hand, it has been proved that no finite basis is sufficient for testing all natural numbers. The Miller-Rabin test is a practical application of the previous statement, and we are even able to bound the probability of failure thanks to the following theorem, which we present without a proof 12 . 11.6.8. Theorem. Let N > 10 be an odd composite number. Let us write N − 1 = 2t · q, where t is a natural number and q is odd. Then, at most a quarter of the integers from the set {a ∈ Z; 1 ≤ a < N, (a, N) = 1} satisfies the following condition: aq ≡ 1 (mod N) 12Schoof, René (2004), "Four primality testing algorithms" (PDF), Algorithmic Number Theory: Lattices, Number Fields, Curves and Cryptography, Cambridge University Press, ISBN 978-0-521-80854-5 1025 CHAPTER 11. ELEMENTARY NUMBER THEORY or there is an e ∈ {0, 1, . . . , t − 1} satisfying a2e q ≡ −1 (mod N). In practical implementations, one usually tests about 20 random bases (or the least prime bases). In this case, the above theorem states that the probability of failing to reveal a composite number is less than 2−40 . The time complexity of the algorithm is same as in the case of modular exponentiation, i.e. cubic in the worst case. However, we should realize that the test is non-deterministic and the reliability of its deterministic version depends on the so-called generalized Riemann hypothesis (GRH13 ). 11.6.9. Primality tests. Primality tests are usually applied when the used compositeness test claims that the examined integer is likely to be a prime, or they are executed straightaway for special types of integers. Let us first give a list of the most known tests, which includes historical tests as well as very modern ones. (1) AKS – a general polynomial primality test discovered by Indian mathematicians Agrawal, Kayal, and Saxena in 2002. (2) Pocklington-Lehmer test – primality test of subexponential complexity. (3) Lucas-Lehmer test – primality test for Mersenne num- bers. (4) Pépin’s test – primality test for Fermat numbers from 1877. (5) ECPP - primality test based on the so-called elliptic curves. Now, we will introduce a standard primality test for Mersenne numbers. Proposition (Lucas-Lehmer test). Let q ̸= 2 be a prime, and let a sequence (sn)∞ n=0 be defined recursively by s0 = 4, sn+1 = s2 n − 2. Then, the integer Mq = 2q − 1 is a prime if and only if Mq divides sq−2. Proof. We will be working in the ring R = Z[ √ 3] = = {a + b √ 3; a, b ∈ Z}, where the division with remainder behaves similarly as in the integers (see also 12.3.5). Let us set α = 2+ √ 3, β = 2− √ 3 and note that α+β = 4, α·β = 1. First, we prove by induction that it holds for all n ∈ N0 that (1) sn = α2n + β2n = β2n ( 1 + α2n+1 ) . The statement is true for n = 0 since s0 = 4 = α + β. Now, let us suppose that it is true for n−1, then sn = s2 n−1−2 is, by the induction hypothesis, equal to ( α2n−1 + β2n−1 )2 − 2 = α2n + β2n . 13Wikipedia, Riemann hypothesis, http://en.wikipedia. org/wiki/Riemann_hypothesis (as of July 29, 2017). 1026 CHAPTER 11. ELEMENTARY NUMBER THEORY Further, since Mq ≡ −1 (mod 8), we have (2/Mq) = 1, and it follows from the law of quadratic reciprocity that ( 3 Mq ) = − ( Mq 3 ) = − ( 2q − 1 3 ) = − ( 1 3 ) = −1, since we have 2q − 1 ≡ 1 (mod 3) for q odd. Both of these expressions are valid even if Mq is not a prime (in this case, it is the Jacobi symbol). Let us note that in the last part of the proof, we will use the extension of the congruence relation to the elements of the domain Z[ √ 3] = {a+b √ 3; a, b ∈ Z}; just like in the case of the integers, we write for α, β ∈ Z[ √ 3] that α ≡ β (mod p) if p | α−β. Further, an analog of proposition (ii) from 11.C.6 holds as well – if p is a prime, then (α + β)p ≡ αp + βp (mod p) (the proof is identical to the one for the integers). “ =⇒ ” Suppose that Mq is a prime. We will prove that α2q−1 ≡ −1 (mod Mq), which will imply (thanks to 1) that Mq | sq−2. Since 2(Mq−1)/2 ≡ (2/Mq) = 1 (mod Mq), there is a y ∈ Z such that 2y2 ≡ 1 (mod Mq). We have (y(1 + √ 3))2 = y2 (4 + 2 √ 3) ≡ α (mod Mq), whence, invoking Fermat’s theorem and the relation 2q−1 = Mq+1 2 , we get α2q−1 ≡ ( y ( 1 + √ 3 ))Mq+1 ≡ y2 · yMq−1 ( 1 + √ 3 ) · ( 1 + √ 3 )Mq ≡ y2 ( 1 + √ 3 ) · ( 1 − √ 3 ) = −2y2 ≡ −1 (mod Mq). When deriving this, we made use of the fact that 3 is a quadratic nonresidue modulo Mq, so ( 1 + √ 3 )Mq ≡ 1 + (√ 3 )Mq = 1 + 3(Mq−1)/2 · √ 3 ≡ 1 − √ 3 (mod Mq). “ ⇐= ” For the other direction, let Mq | sq−2. However, then Mq | sq−2 · α2q−2 = 1 + α2q−1 . If p ̸= 2, 3 is a prime divisor of Mq, then α2q−1 ≡ −1 (mod p) as well and α2q ≡ 1 (mod p). Hence it follows that 2q is the order of α in the multiplicative group Tp = {a + b √ 3; 0 ≤ a, b < p} \ {0}. If we had (3/p) = 1, then we would get αp−1 = β · αp ≡ β · ( 2p + (√ 3 )p) ≡ β · ( 2 + √ 3 · 3(p−1)/2 ) ≡ β · ( 2 + √ 3 ) = 1, whence p − 1 would be a multiple of the order of α, i.e. 2q . However, this would mean that p > p − 1 ≥ 2q > 2q − 1 = Mq, which contradicts the fact that p is a divisor of Mq. 1027 CHAPTER 11. ELEMENTARY NUMBER THEORY Therefore, we have (3/p) = −1 and αp+1 ≡ ( 2 + √ 3 ) ( 2 + √ 3 )p ≡ ( 2 + √ 3 ) ( 2 − √ 3 ) ≡ 1 (mod p). The order of α modulo p is 2q , hence 2q | p+1 and especially p ≥ 2q − 1 = = Mq. At the same time, p is a prime divisor of Mq, therefore, Mq = p is a prime. □ Unlike the proof, implementation of this algorithm is very easy. Algorithm (Lucas-Lehmer primality test): f u n c t i o n LL_is_prime ( q ) s := 4; M := 2q − 1 r e p e a t q − 2 times s := s2 − 2 (mod M) i f s = 0 r e t u r n PRIME . e l s e r e t u r n COMPOSITE. The time complexity of this test is asymptotically the same as in the case of the Miller-Rabin test. It is, however, more efficient in concrete instances. Fermat numbers are integers of the form Fn = 22n + 1. Pierre de Fermat conjectured in the 17th century that all of the integers of this form are primes (apparently driven by the effort to generalize the observation for F0 = 3, F1 = 5, F2 = 17, F3 = 257, and F4 = 65537). However, in the 18th century, Leonhard Euler found out that F5 = 641 × 6700417, and we have not been able to discover any other Fermat primes so far. Since their size increases rapidly, it takes much time resources to compute with them (so the following test is not much used). Nowadays, the least Fermat number which has not been tested is F33, which has 2 585 827 973 digits, and it is thus much greater than the largest discovered prime. Proposition (Pépin’s test). A necessary and sufficient condition for the n-th Fermat number Fn to be a prime is 3 Fn−1 2 ≡ −1 (mod Fn). We can see that this is a very simple test, which is actually a mere part of Euler’s compositeness test. Proof of correctness of Pépin’s test. First, suppose that 3(Fn−1)/2 ≡ −1 (mod Fn). Then, 3Fn−1 ≡ 1 (mod Fn). Since Fn − 1 is a power of two, Fn − 1 is necessarily the order of 3 modulo Fn. However, the order of every integer modulo Fn is at most φ(Fn) ≤ Fn − 1, hence in this case, we have φ(Fn) = Fn − 1, which means that Fn is a prime. 1028 CHAPTER 11. ELEMENTARY NUMBER THEORY For the other direction, let Fn be a prime. From part (i) of lemma 11.4.12, we get that 3(Fn−1)/2 ≡ (3/Fn) (mod Fn), so it suffices to determine the value (3/Fn). However, this is easy, because Fn ≡ 2 (mod 3), and thus (Fn/3) = −1. Further, we have Fn ≡ 1 (mod 4), and the law of quadratic reciprocity thus yields (3/Fn) = −1, which is what we wanted to prove. □ Now, we will introduce a primality test which is a bit old, yet it is still widely used in modern computation systems – the so-called Pocklington-Lehmer test. However, first of all, we will describe a simpler primality test for illustration, the so-called Lucas’s test: 11.6.10. Theorem (Lucas). If for all prime divisors q of N − 1, there is an a such that aN−1 ≡ 1 (mod N), a N−1 q ̸≡ 1 (mod N), then N is a prime. Proof. It suffices to prove that N − 1 divides φ(N) (which is a condition apparently unsatisfied by composite numbers). If not, then there is a prime q and r ∈ N such that qr divides N − 1, but it does not divide φ(N). The order e of the integer a divides N − 1 (the first condition) and does not divide (N − 1)/q (the second condition), hence qr divides e. Furthermore, e divides φ(N), and so qr does, a contradiction. □ The integer a from the previous theorem is called a primality witness for the integer N (in this as well as in the other primality tests). Another general primality test is based on the above one. It is good if we want to make the high probability of the answer of the Miller-Rabin compositeness test into certainty. 11.6.11. Theorem (Pocklington–Lehmer). Let N be a natural number, N > 1. Let p be a prime which divides N − 1. Further, let us suppose that there is an integer ap such that aN−1 p ≡ 1 (mod N) and ( a N−1 p p − 1, N ) = 1. Let pαp be the highest power of p which divides N − 1. Then every positive divisor d of the integer N satisfies d ≡ 1 (mod pαp ). Proof of the Pocklington–Lehmer theorem. Every positive divisor d of the integer N is a product of prime divisors of N, so it suffices to prove the theorem for prime values of d. The condition aN−1 p ≡ 1 (mod N) implies that the integers ap, N are coprime (any divisor they have in common must divide the right-hand side of the congruence as well). Then, (ap, d) = 1 as well, and we have ad−1 p ≡ 1 (mod d) by Fermat’s theorem. Since ( a (N−1)/p p −1, N ) = 1, we get a (N−1)/p p ̸≡ 1 (mod d). Let e denote the order of ap modulo d. Then, e | d − 1, e | N − 1, and e ∤ (N − 1)/p. If pαp ∤ e, then e | N − 1 would imply that e | N−1 p , which is a contradiction. Therefore, pαp | e, and so pαp | d − 1. □ 1029 CHAPTER 11. ELEMENTARY NUMBER THEORY 11.6.12. Theorem. Let N ∈ N, N > 1. Suppose that we can write N − 1 = F · U, where (F, U) = 1 and F > √ N, and that we are familiar with the prime factorization of F. Then: • if we can find for every prime p | F an integer ap ∈ Z from the above theorem, then N is a prime; • if N is a prime then for every prime p | N − 1, there is an integer ap ∈ Z with the desired properties. Proof. By theorem 11.6.11, the potential divisor d > 1 of the integer N satisfies d ≡ 1 (mod pαp ) for all prime factors of F, hence d ≡ 1 (mod F), and so d > √ N. If N has no non-trivial divisor less than or equal to √ N, then it is necessarily a prime. On the other hand, it suffices to choose for ap a primitive root modulo the prime N (independently of p). Then, it follows from Fermat’s theorem that aN−1 p ≡ 1 (mod N), and since ap is a primitive root, we get a (N−1)/p p ̸≡ 1 (mod N) for any p | N − 1. The integers ap are again called primality witnesses for the integer N. □ Remark. The previous test also contains Pépin’s test in itself (here, for N = Fn, we have p = 2, which is satisfied by the primality witness ap = 3). 11.6.13. The polynomial test. Viz Radan (ATC2014) podrobne, a velmi strucne McAndrew (Crypto in sage) Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni 1030 CHAPTER 11. ELEMENTARY NUMBER THEORY o Rabin-Millerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o RabinMillerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o RabinMillerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o RabinMillerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho vyse. 1031 CHAPTER 11. ELEMENTARY NUMBER THEORY 11.6.14. Looking for divisors. If one of the compositeness tests verifies that a given integer is indeed composite, we usually want to find one of its non-trivial divisors. However, this task is much more difficult than mere revealing that it is composite – let us recall that the compositeness tests can guarantee the compositeness, yet they provide us with no divisors (which is, on the other hand, advantageous for RSA and similar cryptographic protocols). Therefore, we will present here only a short summary of methods used in practice and one sample for inspira- tion. (1) Trial division (2) Pollard’s ρ-algorithm (3) Pollard’s p − 1 algorithm (4) Elliptic curve method (ECM) (5) Quadratic sieve (QS) (6) Number field sieve (NFS) For illustration, we demonstrate in the exercises (11.F.8) one of these algorithms – Pollard’s ρ-method – on a concrete instance. This algorithm is especially suitable for finding relatively small divisors (since its expected complexity depends on the size of these divisors), and it is based on the idea that having a random function f : S → S, where S is a finite set having n-elements, the sequence (xn)∞ n=0, where xn+1 = f(xn), must loop. The expected length of the tail as well as the period is then √ π · n/8. The algorithm described below is again a straightforward implementation of the mentioned reasonings. Algorithm (Pollard’s ρ-method): Input : n −− the i n t e g e r to be f a c t o r e d , and an a p p r o p r i a t e f u n c t i o n f(x) a := 2; b := 2; d := 1 While d = 1 do a := f(a) b := f(f(b)) d := gcd(a − b, n) I f d = n , r e t u r n FAILURE . Else r e t u r n d . 1032 CHAPTER 11. ELEMENTARY NUMBER THEORY 11.6.15. Public-key cryptography. In present-day practice, the most important application of number theory is the so-called public-key cryptography. We shall only briefly touch this fascinating area of Mathematics. The main objectives are to provide • encryption; the message encrypted with the public key of the receiver can be decrypted by no one else (to be precise, by no one who does not know his private key); • signature; the integrity of the message signed with the private key of the sender can be verified by anyone with access to his public key. The most basic and most often used protocols in publickey cryptography are: • RSA (encryption) and the derived system for signing mes- sages, • Digital Signature Algorithm – DSA and its variant based on elliptic curves (ECDSA), • Rabin cryptosystem (and signature scheme), • ElGamal cryptosystem (and signature scheme), • elliptic curve cryptography (ECC), • Diffie-Hellman key exchange protocol (DH). 11.6.16. Encryption – RSA. First, we describe the most known public-key cipher – RSA. The simple principle of the protocol RSA14 is based on the Euler’s theorem 11.3.12: • The user A needs a pair of keys – a public one (VA) and a private one (SA). • Key generating: the user selects two large primes p, q, and calculates n = pq, φ(n) = (p − 1)(q − 1). The integer n is part of the public key; the main idea is that it is too hard to compute φ(n) without knowing the de- composition. • Then, the user chooses the second half of the public key, e, and verifies that (e, φ(n)) = 1. • Using the Euclidean algorithm, the private key d is computed so that e · d ≡ 1 (mod φ(n)). The principle of RSA The secret communication runs in the following steps (for the sake of simplicity, we will further identify the encryption procedure with the public key VA and the decryption procedure with the private key SA): • Encrypting a numerical code of a message M addressed to the user A (by any other participant who has got access to the public key VA): C = VA(M) ≡ Me (mod n). • Decrypting the cipher C by the user A: OT = SA(C) ≡ Cd (mod n). 14Ron Rivest, Adi Shamir, Leonard Adleman (1977); C. Cocks, the secret service GCHQ (not publicly) as early as 1973. 1033 CHAPTER 11. ELEMENTARY NUMBER THEORY The proof of correctness of this protocol (i.e., that A indeed receives what was meant) is a straightforward application of Euler’s theorem: Thanks to 11.3.13, it holds for any message M which is coprime to n that (Me )d ≡ M1 = M (mod n). In the (extremely unlikely) case that the message M would not be coprime to n, the statement holds as well, although the proof needs to be modified with the help of the Chinese remainder theorem (however, we should realize that if the message M with property 0 < M < n is not coprime to n, then it means that (M, n) is a non-trivial divisor of n, so the key of the receiver is actually discredited). The security of RSA has been tested since it was invented in 1977, and no meaningful weakness (except for side channels or some singular keys) has been discovered yet (provided a sufficiently large key is used; nowadays it is recommended to use at least 2048 bits). Nevertheless, it has not been proved that the RSA problem really relies on the hardness of integer factorization. The requirements on a secure choice of the key for practical reasons are: • d is large enough (defense against the so-called Wiener’s attack), • p and q are not too close to each other (see the exercise 11.G.1), • the public key is selected to be at least e = 65537 (although no direct attack against a small public key e has been noticed). 11.6.17. Rabin cryptosystem. Further, we mention a simplified variant of the protocol named Rabin cryp- tosystem15 , which has been the first public cryptosystem where one demonstrably needs to factorize the modulus to break it (unlike RSA, for which this has not been proved): • Every participant A needs a pair of keys – a public one (VA) and a private one (SA). • Key generating: A chooses two large primes of roughly the same size – p, q ≡ 3 (mod 4), and computes n = pq. • The public key is VA = n, the private key is the pair SA = (p, q). The secret communication then runs as follows: • Encryption of the numerical code of the message M: C = VA(M) ≡ M2 (mod n). • Decryption of the cipher C: the (four) roots of C modulo n are computed and it is easily found out which one of them is the original message (for instance, the other three make no sense or do not contain the agreed identi- fication). As we can see from the description of the protocol, the process of decryption requires the computation of the square 15Rabin, Michael. Digitalized Signatures and Public-Key Functions as Intractable as Factorization (in PDF). MIT Laboratory for Computer Science, January 1979. 1034 CHAPTER 11. ELEMENTARY NUMBER THEORY root of C modulo n = pq, where p ≡ q ≡ 3 (mod 4). This can be done as follows: • The values r ≡ C(p+1)/4 (mod p) and s ≡ C(q+1)/4 (mod q) are computed. • Further, we need to determine the coefficients a, b in Bézout’s identity, i.e., integers for which ap + bq = 1. • We set x ≡ (aps + bqr) (mod n), y ≡ (aps − bqr) (mod n). • The square roots of C modulo n are then ±x, ±y. Let us mention that this is in fact an application of the Chinese remainder theorem and the fact that we are able to easily find the solutions of the quadratic congruence x2 ≡ a (mod p) provided p ≡ 3 (mod 4) (see exercise 11.D.38). Indeed, it holds that (±x)2 = (aps + bqr)2 ≡ (bqr)2 ≡ r2 ≡ C(p+1)/2 ≡ C (mod p), where we made use of the fact that bq ≡ 1 (mod p) and that C ≡ M2 (mod p) is a quadratic residue modulo p, hence C(p−1)/2 ≡ (C/p) = 1 (mod p). Similarly, we have (±x)2 ≡ C (mod q) as well, thus ±x is a square root of C modulo n. The derivation of the same result for y is nearly identical. 11.6.18. Digital signature. Now, let use briefly describe the principle of digital signature. Principle of the digital signature Creating the signature: (1) A digest (hash) HM of the message is generated, the length of the hash is fixed (160 or 256 bits, for instance) – we should realize that such a mapping is surely not injective (there will be many messages sharing the same hash). (2) The signature of the message SA(HM ) is created from this hash using the knowledge of the private key of the signer (similarly to decryption of a message’s text). (3) The message M is sent (optionally encrypted with the public key of the receiver) together with the created sig- nature. The signature verification then runs as follows: (1) A digest H′ M if generated for the received message M (after decryption, if it has been encrypted) (2) Using the public key of the (declared) sender of the message, the original digest of the message is reconstructed: VA(SA(HM )) = HM . (3) The digests are then compared; i.e., it is found out whether HM = H′ M . The (cryptographic) hash function mentioned above must have the following properties: • It is easy to find the hash of any message. 1035 CHAPTER 11. ELEMENTARY NUMBER THEORY • It is impossible (in real time) to find (any) message with the desired hash. • It is impossible (in real time) to find two messages with the same hash (the function must be collision-resistant). • Every change of the message changes the hash as well. The most known examples of such functions are: • MD5 (128 bit, Rivest 1992) – not collision-resistant • SHA-1 (160 bit, NSA 1995) – from 2005 considered insufficiently collision-resistant • RIPEMD-320 • SHA-3 11.6.19. Diffie-Hellman key exchange system. Another important type of protocol, which is very often used in practice, is a protocol for key exchange in symmetric cryptography – Diffie-Hellman key exchange16 , whose discovery was a breakthrough in this discipline, making it possible to replace onetime keys, messengers with cases (and similar) with mathematical means, in particular without the necessity of prior communication of both sides. The protocol for the agreement of two sides (Alice, Bob) on a common key (integer) is as follows: Principle of DH key exchange protocol • Both sides agree on a prime p and a primitive root g modulo p (this need not be done secretly). • Alice chooses a random integer a and sends ga (mod p). • Bob chooses a random integer b and sends gb (mod p). • The common key for the communication is then gab (mod p). The security of this protocol relies on the fact that it is hard to compute the discrete logarithm (the so-called discrete logarithm problem) – see also part 11.3.15. There is another encryption algorithm which is based on the Diffie-Hellman key exchange protocol – algorithm ElGamal, which we will also describe in short: • Every participant chooses a prime p with a primitive root g. • Further, they choose a private key x, compute h = gx (mod p), and publish the public key (p, g, h). The secret communication then runs as follows: • Encryption of the numerical code of the message M: choosing a random y and computing C1 = gy (mod p) and C2 = M · hy (mod p), then sending the pair (C1, C2) to participant A. • The participant A then decrypts the message by computing C2/Cx 1 . 16 Whitfield Diffie, Martin Hellman (1976); M. Williamson (secret service GCHQ) as early as 1974 (not published). 1036 CHAPTER 11. ELEMENTARY NUMBER THEORY Remark. The mechanism of digital signature can be derived from the ElGamal algorithm just like in the case of RSA. 1037 1038 CHAPTER 11. NUMBER THEORY H. Additional exercises to the whole chapter 11.H.1. Find all three-digit n ∈ N, for which 11 | n a n 11 is the sum of squares of the digits of n. ⃝ 11.H.2. Someone falsely memorized Fermat’s little theorem, and claims that for p ∤ a we have ap ≡ 1 (mod p). Find all pairs (a, p) for which this is true. ⃝ 11.H.3. Prove that last four digits of 5n , (n = 4, 5, 6, . . .) form a periodical sequence. Find this period. ⃝ 11.H.4. Prove that there are infinitely many odd natural numbers k such that the integer 22n +k is composite for every n ∈ N. ⃝ 11.H.5. Prove that for every integer k ̸= 1, there are infinitely many natural numbers n such that the integer 22n + k is composite. ⃝ 11.H.6. Consider the sequence (an)∞ n=1 defined by an = 2n + 3n + 6n − 1. Prove that for every prime p, this sequence contains a multiple of p. ⃝ 11.H.7. Prove that no natural number n greater than 1 satisfies n | 2n − 1. ⃝ 11.H.8. Prove that for every odd prime p, there are infinitely many natural numbers n such that p | n · 2n + 1. ⃝ 11.H.9. Let a function f : N → N satisfy (f(a), f(b)) = (f(a), f(|a − b|)) for all a, b ∈ N. Prove that (f(a), f(b)) = f((a, b)). Show that this implies the result of exercise 11.A.7 as well as the fact that (Fa, Fb) = F(a,b), where Fa denotes the a-th term of the Fibonacci sequence. ⃝ 11.H.10. Let the RSA parameters be n = 143 = 11 · 13, e = 7, d = 103. Sign the message m = 8, and verify this signature. Decide, whether s = 42 is the signature of the message m = 26. ⃝ 1039 CHAPTER 11. NUMBER THEORY Key to the exercises 10.C.10. The answer is 3 5 · 2 3 + 2 5 · 1 = 4 5 . 10.E.17. Simply, a = 3 8 . Thus, the distribution function of the random variable X is FX(t) = 1 8 t3 for t ∈ (0, 2), zero for smaller values of t, and one for greater. Let Z = X3 denote the random variable corresponding to the volume of the considered cube. It lies in the interval (0, 8). Thus, for t ∈ (0, 8) and the distribution function FZ of the random variable Z, we can write FZ(t) = P[Z < t] = P[X3 < t] = P[X < 3 √ t] = FX( 3 √ t) = 1 8 t. Then, the density is fZ(t) = 1 8 on the interval (0, 8) and zero elsewhere. Since this is the uniform distribution on the given interval, the expected value is equal to 4. 10.E.21. f(X,Y )(u, v) = uv, for (u, v) ∈ (0, 1) × (0, 2), f(X,Y )(u, v) = 0 otherwise. Then, the wanted probability is P = ∫ 1 0 ∫ 2 x2 xy dy dx = 11 12 . 10.E.22. P = ∫ 3√ 2 0 ∫ 2 x3 xy dy dx = 3√ 4 12 . 10.F.9. EU = 1 · 0.6 + 2 · 0.4 = 1.4, EU2 = 0.4 + 4 · 0.6 = 2.8 EV = 0.4 + 0.6 + 1.2 = 2.1, EV 2 = 0.3 + + 1.2 + 3.6 = 5.1, E(UV ) = 2.8, var(U) = 2.8 − 1.42 = 2.8 − 1.96 = 0.84, var(V ) = 5.1 − 4.41 = 0.69, cov(UV ) = 2.8 − 1.4 · 2.1 = −0.14, ρU,V = −0.14√ 0.84·0.69 . 10.F.10. EX = 1/3, var2 X = 4/45. 10.F.11. ρX,Y = −1. 10.F.12. ϱU,V . = −0, 421. In this chapter, we begin a seemingly very formal study. But the concepts reflect many properties of things and phenomena surrounding us. This is one of the parts of the book which is not in the prerequisites of any other chapter. Large parts serve as a quick illustration of interesting uses of mathematical tools and models. The simplest properties of real objects are used for encoding in terms of algebraic operations. Thus, “algebra” considers algorithmic manipulations with letters which usually correspond to computations or descriptions of processes. Strictly speaking, this chapter builds only on the first and sixth parts of chapter one, where abstract views on numbers and relations between objects are introduced. But it is a focal point for abstract versions of many concepts already met. The first two sections aim at direct generalizations of the familiar algebraic structure of numbers. This leads to a discussion of rings of polynomials. Only then we provide an introduction to group theory, for which there is only a single operation. The last two sections provide some glimpses of direct applications. The construction of (self-correcting) codes often used in data transfer is considered. The last section explains the elementary foundations of computer algebra. This includes solving polynomial equations and algorithmic methods for manipulation and calculations with formal expres- sions. 1. Posets and Boolean algebras Familiarity with the properties of addition and multiplication of scalars and matrices is assumed. Likewise, the binary operations of set intersection and union in elementary set theory, as indicated in the end of the first chapter. We proceed to work with symbols which stand for miscellaneous objects resulting in the universal applicability of the results. This allows the relating of the basic set operations, to propositional logic which formalizes methods for expressing propositions and evaluating truth values. 12.1.1. Algebraic operations. For any set M, there is a set K = 2M consisting of all subsets of M, together with the operations of union ∨ : K × K → K and intersection ∧ : CHAPTER 12 Algebraic structures The more abstraction, the more chaos? – no, it is often the other way round... A. Boolean algebras and lattices 12.A.1. Find the (complete) disjunctive normal form of the proposition (B′ ⇒ C) ∧ [(A ∨ C) ∧ B] ′ . Solution. If the propositional formula contains only a few variables (in our case, it is three), the most advantageous procedure is to build the truth table of the formula and build the disjunctive normal form from that. The table will consist of 23 = 8 rows. The examined formula is denoted φ. CHAPTER 12. ALGEBRAIC STRUCTURES K × K → K. This is an instance of an algebraic structure on the set K with two binary operations. In general, write (K, ∨, ∧). In the special case of sets, these binary operations are denoted rather by ∪ and ∩, respectively. To every set A ∈ K, its complement A′ = K \ A can be assigned. This is another operation ′ : K → K with only one argument. Such operations are called unary operations. In general, there are algebraic structures with k operations µ1, . . . , µk, each of them µj : K ij -times × · · · × K → K with ij arguments, and write (K, µ1, . . . , µk) for such a structure. The number ij of arguments is called the parity of the operation (“unary”, “binary”, etc.). If ij = 0, then the operation has no arguments which means it is a distinguished element in K. With subsets in K = 2M , there is the unique “greatest object”, i.e. the entire set M, which is neutral for the ∧ operation. Similarly, the empty set ∅ ∈ K is the only neutral element for ∨. Notice that if M is empty, then K contains the only element ∅. 12.1.2. Set algebra. View the algebraic structure on the set K = 2M from the previous paragraph as (K, ∨, ∧, ′ , 1, 0), with two binary operations, one unary operation (the complement), and two special elements 1 = M, 0 = ∅. It is easily verified that all elements A, B, C ∈ K satisfy the following properties: Axioms of Boolean algebras A ∧ (B ∧ C) = (A ∧ B) ∧ C,(1) A ∨ (B ∨ C) = (A ∨ B) ∨ C,(2) A ∧ B = B ∧ A, A ∨ B = B ∨ A,(3) A ∧ (B ∨ C) = (A ∧ B) ∨ (A ∧ C),(4) A ∨ (B ∧ C) = (A ∨ B) ∧ (A ∨ C),(5) there is a 0 ∈ K such that A ∨ 0 = A,(6) there is a 1 ∈ K such that A ∧ 1 = A,(7) A ∧ A′ = 0, A ∨ A′ = 1.(8) Compare these properties with those of the scalars (K, +, ·, 0, 1): Properties (1) and (2) say that both the operations ∧ and ∨ are associative. Property (3) says that both operations are also commutative. So far, this is the same as for the addition and multiplication of scalars. Also there are neutral elements for both operations there. However, the properties (4) and (5) are stronger now: they require the distributivity of ∧ over ∨ as well as ∨ over ∧. Of course, this cannot be the case for addition and multiplication of numbers. In the case of numbers, multiplication distributes over addition but not vice versa. 1041 A B C B′ ⇒ C [(A ∨ C) ∧ B] ′ φ 0 0 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 0 0 0 1 0 1 0 1 1 1 1 1 1 0 1 0 0 1 1 1 1 0 0 The resulting complete disjunctive normal form is the disjunction of the formula that correspond to the rows with one in the last column (the formula is true for the given valuation of the atomic propositions). The row corresponds to conjunction of the variables (if the corresponding value is 1) or their negations (if it is 0). In our case, it is the disjunction of conjunctions corresponding to the second, third, and sixth rows, i. e., the result is (A′ ∧ B′ ∧ C) ∨ (A′ ∧ B ∧ C′ ) ∨ (A ∧ B′ ∧ C). We can also rewrite the formula by expanding the connective ⇒ with ∧ and ∨, using the De Morgan laws and dis- tributivity: ( B′ ⇒ C ) ∧ [(A ∨ C) ∧ B]′ ⇐⇒ ⇐⇒(B ∨ C) ∧ [ (A ∨ C)′ ∨ B′] ⇐⇒ ⇐⇒(B ∨ C) ∧ [( A′ ∧ C′ ) ∨ B′] ⇐⇒ ⇐⇒ [ (B ∨ C) ∧ ( A′ ∧ C′ )] ∨ [ (B ∨ C) ∧ B′] ⇐⇒ ⇐⇒ [( B ∧ A′ ∧ C′ ) ∨ ( C ∧ A′ ∧ C′ )] ∨ [( B ∧ B′) ∨ ( C ∧ B′)] ⇐⇒ ⇐⇒ ( B ∧ A′ ∧ C′ ) ∨ ( C ∧ B′) , which is an (incomplete) disjunctive normal form of the given formula. Clearly, it is equivalent to our result above (the word “complete” means that each disjunct (called clause in this context) contains each of the three variables or their negations (these are called literals). □ 12.A.2. Find a disjunctive normal form of the formula ((A ∧ B) ∨ C) ′ ∧ (A′ ∨ (B ∧ C′ ∧ D)) ⃝ We know several logical connectives: ∧, ∨, =⇒, ≡ and the unary ′ . Any propositional formula with these connectives can be equivalently written using only some of them, for instance ∨ and ′ . There are also connectives which alone suffice to express any propositional formula. From binary connectives, these are NAND and NOR (A NAND B = (A ∧ B)′ , A NOR B = (A∨B)′ ). Try to express each of the known connectives using only NAND, and then only NOR. These connectives are implemented in electric circuits as the so-called “gates”. 12.A.3. Express the propositional formula (A ⇒ B) using only the NAND gates. ⃝ CHAPTER 12. ALGEBRAIC STRUCTURES Properties (6)–(8) require the existence of neutral elements for both operations as well as the existence of an analogy to the “inverse” of each element. (Note however, that the intersection with the complement results in the neutral element for union and vice versa. This is the other way round for numbers.) There are only a few structures that possess such restrictive properties. Boolean algebras Definition. A set K together with two binary operations ∧, ∨, a unary operation ′ , and two special elements 1, 0, which satisfy the properties (1)–(8) is called a Boolean algebra. The ∧ operation is called infimum or meet, and the ∨ operation is called supremum or join. The element A′ is called the complement of A. Note that the axioms of Boolean algebras are symmetric with respect to interchanging the operations ∧ and ∨ together with the interchange of 0 and 1. This means that any proposition that can be derived from the axioms has a valid dual proposition, created by interchanging ∧ with ∨ and 0 with 1. This is the principle of duality. 12.1.3. Properties of Boolean algebras. As usual, we derive several elementary corollaries of the axioms. In particular, note that in the special case of the Boolean algebra of all subsets of a given set, this proves all the elementary properties known from set algebra. The special elements 1 and 0 are unique as the neutral elements for ∧ and ∨. If there is ˜0 with the same properties then ˜0 = ˜0 ∨ 0 = 0. Similarly 1 is also unique. Next, if B, C ∈ K satisfy the properties of A′ (axiom (8) in the above definition), then B = B ∨ 0 = B ∨ (A ∧ C) = = (B ∨ A) ∧ (B ∨ C) = 1 ∧ (B ∨ C) = B ∨ C and similarly C = C ∨ B = B ∨ C. Therefore, B = C and so the complement to any A ∈ K is determined uniquely by its properties. Generally the above observation means that given (K, ∧, ∨), there exists at most one operation ′ which completes it to a Boolean algebra (K, ∧, ∨, ′ , 1, 0), together with unique elements 1 and 0. Generally (K, ∧, ∨) is written, omitting the other three symbols for operations. The properties of the following proposition have their own names in set algebra: (2) is called absorption laws, (3) is the idempotency of the operations ∧ and ∨, and the equalities of (4) are the De Morgan laws. 1042 12.A.4. Write down a logic table for the Boolean proposi- tion ((A ∧ B) ∨ C)′ . Solution. Using de Morgan’s and distributive laws we ex- press ((A ∧ B) ∨ C)′ = (A ∧ B)′ ∧ C′ = A′ ∧ C′ ∨ B′ ∧ C′ . Setting 1 for the value True and 0 for False for A, B, C we obtain the table for φ(A, B, C) = A′ ∧ C′ ∨ B′ ∧ C′ : A B C φ 0 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 1 1 0 0 1 1 1 0 □ 12.A.5. For A = 0, B = 0, C = 1 find the value of the Boolean proposition A ∧ B ∨ C Solution. A ∧ B ∨ C = 0 ∧ 0 ∨ 1 = 0 ∨ 1 = 1, which has value True. □ 12.A.6. A switching circuit is an electrical network that consists of switches being open or closed as on fig. .. If switch a is open we set a = 0, if swich a is closed we set a = 1 and so on. Switches labeled with the same letter must are either all open or all closed simultaneously. Switch a′ is open if and only if switch a is closed. If current can flow between the extreme left and right ends of the circuit, the output of the circuit is 1 and 0 otherwise. The switching circuit representing a∧b is called series circuit, while the circuit representing a ∨ b is called parallel circuit. Represent the switching circuit at the figure?? as a Bollean proposition. Solution. We can write this circuit as proposition φ = a ∧ X ∧ b, where X = b ∨ d′ ∨ c ∧ (a ∨ d ∨ c′ ). Thus, φ = a ∧ (b ∨ d′ ∨ c ∧ (a ∨ d ∨ c′ )) ∧ b. □ 12.A.7. Draw a switching circuit representing Boolean proposition (a ∧ ((b ∧ c′ ) ∨ (b′ ∧ c))) ∨ (a ∧ b ∧ c) Solution. see fig??? □ CHAPTER 12. ALGEBRAIC STRUCTURES Further properties of Boolean algebras Proposition. In every Boolean algebra (K, ∧, ∨); (1) A ∧ 0 = 0, A ∨ 1 = 1, (2) A ∧ (A ∨ B) = A, A ∨ (A ∧ B) = A, (3) A ∧ A = A, A ∨ A = A, (4) (A ∧ B)′ = A′ ∨ B′ , (A ∨ B)′ = A′ ∧ B′ , (5) (A′ )′ = A. for all A, B ∈ K. Proof. By the principle of duality, it suffices to prove only one of the claims in each item. Begin with (3), of course using just the axioms of the Boolean algebras A = A ∧ 1 = A ∧ (A ∨ A′ ) = (A ∧ A) ∨ 0 = A ∧ A. Now, (1) is proved easily: A ∧ 0 = A ∧ (A ∧ A′ ) = (A ∧ A) ∧ A′ = A ∧ A′ = 0. (2) is also easy (read the second equality from right to left): A ∧ (A ∨ B) = (A ∨ 0) ∧ (A ∨ B) = = A ∨ (0 ∧ B) = A ∨ 0 = A. In order to prove the De Morgan laws, it suffices to verify that A′ ∨ B′ has the properties of the complement of A ∧ B. By the above, it must be the complement. Using (1), compute (A ∧ B) ∧ (A′ ∨ B′ ) = ((A ∧ B) ∧ A′ ) ∨ ((A ∧ B) ∧ B′ ) = (0 ∧ B) ∨ (A ∧ 0) = 0. Similarly, (A ∧ B) ∨ (A′ ∨ B′ ) = (A ∨ (A′ ∨ B′ )) ∧ (B ∨ (A′ ∨ B′ )) = (1 ∨ B′ ) ∧ (1 ∨ A′ ) = 1. Finally, from the definition, A′ ∧ A = 0 and A′ ∨ A = 1. Hence, A has the required properties of the complement of A′ , which means that A = (A′ )′ . □ 12.1.4. Examples of Boolean algebras. The intersection and union on all subsets in a given set M always define a Boolean algebra. The smallest is the set of all subsets of a singleton M. It contains two elements, namely 0 = ∅ and 1 = M with the obvious equalities 0 ∧ 1 = 0, 0 ∨ 1 = 1, etc. The operations ∧ and ∨ are the same as multiplication and addition in the remainder class ring Z2 of even and odd numbers. This is called the Boolean algebra Z2. This is the only case when a Boolean algebra is a field of scalars at the same time! As in the case of rings of scalars or vector spaces, the algebraic structure of a Boolean algebra can be extended to all spaces of functions whose codomain is a Boolean algebra. For the set S = {f : M → K} of all functions from a set M to a Boolean algebra (K, ∧, ∨), the necessary operations and the distinguished elements 0 and 1 on S can be defined 1043 12.A.8. Verify equivalence of the following two Boolean propositions (a ∨ b) ∧ (c ∨ d) and (a ∧ c) ∨ (a ∧ d) ∨ (b ∧ c) ∨ (b ∧ d) Solution. Distributive laws yield (a∨b)∧(c∨d) = ((a∨b)∧c)∨((a∨b)∧d) = a∧c∨b∧c∨a∧d∨b∧d. □ 12.A.9. Determine whether the following two Boolean propositions ((a′ ∧ b) ∨ (a ∧ c′ ))′ and (a ∨ b′ ) ∧ (a ∨ c′ ) are equivalent? Solution. setting a = 1, b = 1, c = 0 implies ((a′ ∧ b) ∨ (a ∧ c′ ))′ = 0 and (a ∨ b′ ) ∧ (a ∨ c′ ) = 1. Hence, these propositions are not equivalent. □ 12.A.10. Determine whether the following two Boolean propositions (a ∧ b ∧ c) ∨ (a ∨ c)′ and (a ∧ c) ∨ (a′ ∧ c′ ) are equivalent? Solution. setting a = 1, b = 0, c = 1 implies (a ∧ b ∧ c) ∨ (a ∨ c)′ = 0 and (a ∨ b′ ) ∧ (a ∧ c) ∨ (a′ ∧ c′ ) = 1. Hence, these propositions are not equivalent. □ 12.A.11. Show that the negation a′ can not be obtained by a combinatorial circuit that consists of ∧ and ∨ operations only. Solution. We prove the claim by the induction on the number of operations n in the circuit. For n = 1 it is clear that neither proposition b ∧ c or b ∨ c is equivalent to a′ . Suppose a′ can not be achieved by a combinatorial circuit, containing ∨ and ∧. Consider a circuit with n + 1 operations and the first occurrence of the input of a. It can be either ∧ or ∨. If it is ∧ with the second argument either a or 1, then the outcome of it is the same as input a and this operation ∧ can be deleted from the circuit with input a going to the next available operation. If the second argument is 0, then a∧0 = 0 and operation can again be deleted with 0 going as input to the next available operation. If a enters the circuit at the ∨, and the second argument of this ∨ is either a or 0, then the ∨ can be deleted with a entering the next operation. If the second argument equals 1, then the result of ∨ is 1 , and it can be entered to the next available operation. □ 12.A.12. Simplify the formula ((A ∧ B) ∨ (A ⇒ B)) ∧ ((B′ ⇒ C) ∨ (B ∧ C′ )). Solution. In Boolean algebra, we obtain (a · b + a′ + b) · (b + c + b · c′ ) = · · · = a′ · c + b. This means that the given formula is equivalent to (A′ ∧C)∨ B. □ CHAPTER 12. ALGEBRAIC STRUCTURES as functions of an argument x ∈ M as follows: (f1 ∧ f2)(x) = (f1(x)) ∧ (f2(x)) ∈ K, (f1 ∨ f2)(x) = (f1(x)) ∨ (f2(x)) ∈ K, (1)(x) = 1 ∈ K, (0)(x) = 0 ∈ K, (f)′ (x) = (f(x))′ ∈ K. It is easy and straightforward to verify that these new operations define a Boolean algebra. Recall that the subsets of a given set M can be viewed as mappings M → Z2 (the elements of the subset in question are mapped to 1 while all others go to 0). Then, the union and intersection can be defined in the above manner — for instance, evaluating the expression (A ∨ B)(x) for a point x ∈ M, This determines whether it lies in A or whether it lies in B, and whether the join of the results is in Z2. The result is 1 if and only if x lies in the union. 12.1.5. Propositional logic. The latter simple observation brings us close to the calculus of elementary logic. View the notations for operations in a Boolean algebra as creating “words” from the elements A, B, . . . ∈ K, the operations ∨, ∧, ′ and parentheses, which clarify the desired precedence of the op- erations. The axioms of the Boolean algebras and their corollaries say how different words may produce the same result in K. This is clear in the case of K = 2M , the set of all subsets of a given set; it is just equality of subsets. Now, another interpretation is mentioned in terms of operations in formal logic. Work with words as above but view them as propositions composed from elementary (atomic) propositions A, B, . . . and the logical operations AND (the binary operation ∧), OR (the binary operation ∨), and the negation NOT (the unary operation ′ ). These words are called propositions. They are assigned a truth value depending on the truth values of the individual atomic propositions. The truth value is an element of the trivial Boolean algebra Z2, i.e. either 0 or 1. The truth value of a proposition is completely determined by assigning the truth values for the simplest propositions A∧ B, A ∨ B and A′ . A ∧ B is defined to be true if and only if both A and B are true. A ∨ B is false if and only if both A and B are false. The value of A′ is complementary to A. A proposition with n elementary propositions defines a function (Z2)n → Z2. Two propositions are called logically equivalent if and only if they define the same function. In the previous paragraph, it is already verified that the set of all classes of logically equivalent propositions has the structure of a Boolean algebra. Propositional logic satisfies everything proved for general Boolean algebras. Next, we consider how other usual simple propositions of propositional logic are represented as elements of the Boolean algebra. Expressions always correspond to a class of logically equivalent propositions: 1044 12.A.13. Design a logic proposition in three boolean that takes value True if and only if the majority of variables also have True values. Solution. For this purpose, a dijunsctive normal form is quite suitable. Consider the proposition: a ∧ b ∧ c′ ∨ a ∧ b′ ∧ c ∨ a′ ∧ b ∧ c ∨ a ∧ b ∧ c. □ 12.A.14. Let A be a free boolean algebra, generated by countably many generators a1, . . . , an, . . . with operations ∧, ∨ and ′ and elements being finitely many of these generators, linked by the operations. Prove that these algebra is atomless. Solution. Suppose f is an atom in A. Without loss of generality f can be expressed in terms of the first n generators a1, . . . , an. Then f ∧an+1 is neither f nor 0, so f can not be an atom in A. □ 12.A.15. Prove monotonicity laws for Boolean algebras: if A ≤ ˜A and B ≤ ˜B then i) A ∨ B ≤ ˜A ∨ ˜B; ii) A ∧ B ≤ ˜A ∧ ˜B; iii) ( ˜A)′ ≤ A′ Solution. Since (A ∨ B) ∨ ( ˜A ∨ ˜B) = (A ∨ ˜A) ∨ (B ˜B) = ˜A ∨ ˜B, it implies the first assertion. The second follows from the duality the fact that A ≤ ˜A if and only if A ∧ ˜A = A. Application of de Morgan’s law ( ˜A)′ ∨ A′ = (A ∧ ˜A′ ) = A′ verifies the third. □ 12.A.16. For Boolean algebras prove i) A ≤ B if and only if B′ ≤ A′ ; ii) A ≤ B if and only if A ∧ B′ = 0; iii) C ∧ A ≤ B if and only if A ≤ C′ ∨ B. Solution. By previous exercise, A ≤ B yields B′ ≤ A′ , and vice versa A = (A′ )′ ≤ (B′ )′ = B. If A ≤ B, then A = A ∧ B and A ∧ B′ = A ∧ B ∧ B′ = 0., Conversely, from A ∧ B′ = 0 follows that A = A ∧ (B ∨ B′ ) = A ∧ B ∨ A ∧ B′ = A ∧ B, so A ≤ B. For the third assertion, since C ∧ A ≤ B and C′ ∧ A ≤ C′ , by monotonicity of the ∨, A = C ∧ A ∨ C′ ∧ A ≤ B ∨ C′ . On the other hand, if A ≤ C′ ∨ B, by monotonicity of the ∧, C ∧ A ≤ C ∧ (C′ + B) ≤ C ∧ B ≤ B. □ CHAPTER 12. ALGEBRAIC STRUCTURES The standard logical operators (1) A AND B corresponds to A ∧ B, (2) A OR B corresponds to A ∨ B, (3) the implication A ⇒ B can be obtained as A′ ∨ B, (4) the equivalence A ⇔ B corresponds to (A ∧ B) ∨ (A′ ∧ B′ ), (5) the exclusive OR, known as A XOR B, is given as (A∧ B′ ) ∨ (A′ ∧ B), (6) the negation of OR, A NOR B, is expressed as A′ ∧B′ , (7) the negation of AND, A NAND B, is given as A′ ∨ B′ , (8) tautology (proposition always true) is given in terms of an arbitrary atomic proposition as A ∨ A′ , its negation (always false) is A ∧ A′ . Note that in set algebra, XOR corresponds to the symmetric difference. 12.1.6. Switch boards as Boolean algebras. A switch is a black box with only two states – it is either on (and the signal goes through) or off (and the signal does not go through). B A B A One or more switches may be interconnected in a series circuit or a parallel circuit. The series circuit corresponds to the binary operation ∧, while the parallel circuit corresponds to ∨. The unary operation A′ defines a switch whose state is always the opposite than that of A. Every finite word created from the switches A, B, . . . and the operations ∧, ∨, and ′ can be transformed to a diagram that represents a system of switches, connected by wires, similarly as in the above subsection, where each choice of states of the individual switches gives the value “on/off” for the entire system. The discussion is about switchboards and their logical evaluation function. Again, it is easy to verify all the axioms of Boolean algebras for the system. The following diagram illustrates one of the distributivity axioms. =A A A B C B C The circuit without a switch corresponds to 1. When the endpoints are not connected, this corresponds to 0 (consider a series circuit of A and A′ ). Draw diagrams for all axioms of Boolean algebras and verify them! We return to this example shortly, showing that each expression in propositional logic can be modeled by a switch board. 12.1.7. Algebra of divisors. There are other natural examples of Boolean algebras. Choose a positive integer p ∈ N. The underlying set Dp is the set of all divisors q of p. For two such divisors q, r, define q ∧ r to be the greatest common divisor of q and r, and q ∨r is defined to be their least common multiple (cf. the 1045 12.A.17. For any two elements A, B of a Boolean algebra, the symmetric difference of A and B is defined as A B = A ∧ B′ ∨ B ∧ A′ . This is analogous of taking symmetric difference A B = (A \ B) ∪ (B \ A). Show that i) A = B if and only if A B = 0. ii) A (B C) = (A B) C iii) A ∧ (B C) = (A ∧ B) (A ∧ C). Solution. i) If A = B then A ≤ B ans B ≤ A, and therefore, by previous exercise, A ∧ B′ = 0 and B ∧ A′ = 0, thus A B = 0. On the other hand, A B = 0 implies that both A ∧ B′ = 0 and B ∧ A′ = 0, which yield A ≤ B ans B ≤ A, thus A = B. ii) Expanding (B C)′ using distributivity and de Morgan’s laws, we obtain (B C)′ = (B′ ∨ C) ∧ (C′ ∨ B) = B ∧ C ∨ B′ ∧ C′ . Further expanding A (B C) we get A (B C) = A∧(B∧C∨B′ ∧C′ )∨A′ ∧(B∧C′ ∨B′ ∧C) = A∧B ∧C ∨A∧B′ ∧C′ ∨A′ ∧B ∧C′ ∨A′ ∧B′ ∧C, which equals (A B) C by obvious symmetry. iii) Follows from A∧(B C) = A∧(B∧C′ ∨C∧B′ ) = A∧B∧C′ ∨A∧C∧B′ and (A∧B) (A∧C) = A∧B∧(A′ ∨C′ )∨A∧C∧(A′ ∨B′ ) = A∧B∧C′ ∨A∧C∧B′ □ 12.A.18. The Sheffer stroke is defined as: A|B = A′ ∧ B′ . The Pierce arrow can be defined as: A ↑ B = A′ ∨ B′ . Prove that | and ↑ define A ∨ B, A ∧ B and A′ . Solution. It is clear from de Morgan’s laws, in particular, that A′ = A ↑ A, A ∨ B = (A ↑ A) ↑ (B ↑ B). Since A ∧ B = (A′ ∨ B′ )′ , we obtain that ↑ generates all Boolean operations. Analogously, A′ = A|A and A ∧ B = A′ |B′ = (A|A)|(B|B). De Morgan’s law A ∨ B = (A′ ∧ B′ )′ proves that | also generates all three Boolean operations. □ 12.A.19. The exclusive or binary operation in Boolean algebra is defined as: A ⊕ B = (A ∨ B) ∧ (A′ ∨ B′ ). Prove that ⊕ and ∧ do not generate all Boolean operations. Solution. If we only apply ⊕ and ∧ operations, from the inputs with all values False we’ll be getting outputs with the same False value. Thus, A′ can not be obtained using only ⊕ and ∧ out of a False proposition A. □ CHAPTER 12. ALGEBRAIC STRUCTURES previous chapter for the definitions and context). The distinguished element 1 ∈ Dp is defined to be p itself. The neutral element 0 for join on Dp is the integer 1 ∈ N. The unary operation ′ is defined using division as q′ = p/q. Proposition. The set Dp together with the above operations ∧, ∨, and ′ is a Boolean algebra if and only if the factorization of p contains no squares (i.e., in the unique factorization p = q1 . . . qn, the prime numbers qi are pairwise distinct). Proof. It is easy to verify the axioms of Boolean algebras under the assumptions of the proposition. It might be interesting to see where the assumption squarefree is needed. The greatest common divisor of a finite number of integers is independent of their order. This also holds for the least common multiple. This corresponds to the axioms (1) and (2) in 12.1.2. The commutativity (3) is clear. For any three elements a, b, c, write their factorizations without loss of generality as a = qn1 1 . . . qns s , b = qm1 1 . . . qms s , c = qk1 1 . . . qks s . Zero powers are allowed and all qj are pairwise coprime. Thus, a∧b ∈ Dp corresponds to the element in which each qi occurs with the power that is the minimum of the powers in a and b. This holds analogously for a∨b and maximum. The distributivity laws (4) and (5) of 12.1.2 now follow easily. There is no problem with the existence of the distinguished elements 0 a 1. These are already defined directly and clearly satisfy the axioms (6) and (7). However, if there are squares in the factorizations, then this prevents the existence of complements. For instance, in D12 = {1, 2, 3, 4, 6, 12}, 6 ∧ 6′ = 1 cannot be reached since 6 has a non-trivial divisor which is common to all other elements of D12 except for 1, but 6 ∨ 1 = 6 ̸= 12. (The number 1 is the potential smallest element in D12; it plays the role of 0 from the axioms.) Nevertheless, if there are no squares in the factorization of the integer p, then the complement can be defined as q′ = p/q, and it can be verified easily that this definition satisfies the axiom 12.1.2(8). □ If there are no squares in the decomposition of p, then the number of all divisors is a power of 2. This suggests that these Boolean algebras are very similar to the set algebras we started with. We return to the classification of all finite Boolean algebras. Before that, we consider structures like the divisors above for general p. 12.1.8. Partial order. There is a much more fundamental concept, the partial order. See the end of chapter 1. Recall that the definition of a partial order is a reflexive, antisymmetric, and transitive relation ≤ on a set K. A set with partial order (K, ≤) is called a partially ordered set, or poset for short. The adjective “partial” means that in general, this relation does not say whether a ≤ b or b ≤ a for every two different elements a, b ∈ K If it does for each pair, it is called a linear order or a total order. 1046 12.A.20. Translate the following sentence as a logical proposition considering ambiguity of English language: Either school X improves its performance and continues to perform well or school Y wins academic competition. Solution. Modern English is sometimes vague about the meaning of or. Therefore, the sentence can be translated using ∨ or ⊕ in place of the or. Let A, B, C represent propositions School X improves its performance, School X continues to perform well, and School Y wins academic competition. Then the proposition of the sentence can be interpreted as i) (A ∧ B) ∨ C; ii) (A ∧ B) ⊕ C; This two propositions differ when all three propositions A, B, C have True value. □ 12.A.21. Let logical implication → mean A → B = A′ ∨B. Prove the equivalence of the following logical propositions i) (A′ → (B → C)) = (B → (A ∨ C)); ii) (A → B′ )′ ∧ (A ∨ B)′ = 0, that is, it is always False. iii) (A ∨ B) ∧ (B′ ∨ C) → (B ∨ C) = 1 that is, it is always True 12.A.22. Show that the following dvisibilty relations are partial orders on X. Is any one of them a linear order? i) X = N. Relation | is defined as: m|n if m divides n. ii) Let X be a set of all integer divisors of 36. Solution. i) As n|n for each n, the relation is reflexive. IF m|n and n|m, then m = n. Since r|m and m|n implies r|n, it is also transitive. The order is not linear, since, for example, 4 and 5 are not divisors of each other. ii) X = {1, 2, 3, 4, 6, 9, 12, 18, 36}. By previous item, | is reflexive, antisymmetric and transityive. However, divisors 2 and 3 are not related. (DRAW HASSE DIA- GRAM!) □ 12.A.23. Given a partial order ≤ on a set A, one can also define a corresponding strict order <, that is also useful in various situations. A relation R on A is called irreflexive if (x, x) ̸∈ R for all x ∈ A. R is called asymmetric if with every (x, y) ∈ R, (y, x) ̸∈ R. Relation R on A is a strict order if R is a partial order and, in addition, it is irreflexive and asymmetric. Let A be a set i) If ≤ is a partial order order on A, then the relation <, defined as a < b if a ≤ b and a ̸= b is a strict order on A. ii) Given strict order < on A then the partial order ≤ can be defined as a ≤ b if a < b or a = b. Solution. Suppose, ≤ is a partial order with < defined by a < b for a ≤ b and a ̸= b. Then < is irreflexive: if a < b then a ̸= b. Now, suppose that a < b and b < c , then a ≤ c by transitivity of ≤. If a = c, then by asymmetry of ≤, we would have a = b. Hence, a < c. □ CHAPTER 12. ALGEBRAIC STRUCTURES There is always a partial order on the set K = 2M of all subsets of a given set M – the inclusion. In terms of intersections or joins, the inclusion can be defined as A ⊆ B if and only if A ∧ B = A, or equivalently, A ⊆ B if and only if A ∨ B = B. In general, each Boolean algebra is a very special poset: Lemma. Let (K, ∧, ∨) be a Boolean algebra. Then the relation ≤ defined by A ≤ B if and only if A ∧ B = A is a partial order. Moreover, for all A, B, C ∈ K: (1) A ∧ B ≤ A, (2) A ≤ A ∨ B, (3) if A ≤ C and B ≤ C, then A ∨ B ≤ C, (4) A ≤ B if and only if A ∧ B′ = 0, (5) 0 ≤ A and A ≤ 1. Proof. All the properties to be proved are results of simple calculations in the Boolean algebra K. Begin with the properties of a partial order for ≤. Reflexivity is a direct corollary of idempotency: A ∧ A = A, i.e. A ≤ A. Similarly, the commutativity of ∧ guarantees the antisymmetry of ≤, since if both A ∧ B = A and B ∧ A = B, then A = A ∧ B = B ∧ A = B. Finally, if A ∧ B = A and B ∧ C = B, then A ∧ C = (A ∧ B) ∧ C = A ∧ (B ∧ C) = A ∧ B = A, which verifies the transitivity of ≤. Similarly, (A ∧ B) ∧ A = (A ∧ A) ∧ B = A ∧ B, that is, A ∧ B ≤ A. It follows from A ∧ (A ∨ B) = A, see 12.1.3(2), that A ≤ A ∨ B. This proves the claim (2). Distributivity together with assumption (3) provides (A ∨ B) ∧ C = (A ∧ C) ∨ (B ∧ C) = A ∨ B, so that (3) holds. The proposition (5) follows directly from the axioms for the distinguished elements 1 and 0. It remains to prove (4). If A ≤ B, then A∧B′ = A∧B∧ B′ = 0. On the other hand, if A∧B′ = 0, then A = A∧1 = A∧(B ∨B′ ) = (A∧B)∨(A∧B′ ) = (A∧B)∨0 = A∧B. Hence A ≤ B, and the proof is finished. □ Note that as for the algebra of subsets, in all Boolean algebras A∧B = A if and only if A∨B = B. If A∧B = A, then the absorption laws imply that A∨B = (A∧B)∨B = B, and vice versa. Therefore, the operation ∨ can also be used in the definition of a partial order. Every poset (K, ≤) corresponds to a (oriented) graph (cf. the beginning of chapter 13 for definitions if necessary): the vertex set is K, and there is an edge leading from a to b if and only if a ≤ b. This is a convenient way how to represent finite posets. A Hasse diagram of a poset is a drawing of this graph in the plane so that greater elements are drawn above lower ones. Since the edge orientation is implicitly given by this, it need not be drawn explicitly. Furthermore, loops and edges 1047 12.A.24. Prove that a finite poset A can be embedded in a totally ordered set, whose order extends the original order on A. Solution. Let (A, ≤) be a finite poset with a minimal element a1. Set A\{a1} remains a poset with some minimal element a2. Choose a3, . . . , an similarly until A is exhausted. Define ai ≤ aj for i ≤ j, which defines a total order on A. □ 12.A.25. Convert the poset of divisors of 36 into a linearly ordered set in different ways Solution. One order is the natural order in N: 1, 2, 3, 4, 6, 9, 12, 18, 36. Another order can be chosen as 1, 2, 4, 3, 6, 12, 9, 18, 36. □ 12.A.26. Prove that lattices of the diamond and pentagon types are not distributive. Solution. i) For the diamond lattice a∧(b∨c) = a∧1 = a, while (a∧b)∨(a∧c) = 0∨0 = 0. ii) For the pentagon lattice a∧(b∨c) = a∧1 = a, and (a∧b)∨(a∧c) = 0∨0 = 0. also hold. □ 12.A.27. Prove uniqueness of completeness in distributive lattices, i.e. if (A, ≤) is a bounded distributive lattice with minimum 0 and maximum 1 then complements are unique. Solution. Let x′ and z denote complements of x ∈ A. Then (1) x′ = x′ ∧ 1 = x′ ∧ (x ∨ z) = (x′ ∧ x) ∨ (x′ ∧ z) = 0 ∨ (x′ ∧ z) = (x′ ∧ z) ∨ (x′ ∧ z) = (x ∨ x′ ) ∧ z = 1 ∧ z = z □ 12.A.28. Anne, Brenda, Kate, and Dana want to set out on a trip. Find out which of the girls will go if the following must hold: At least one of Brenda and Dana will go; at most one of Anne and Kate will go; at least one of Anne and Dana will go; at most one of Brenda and Kate will go; Brenda will not go unless Anne goes; and Kate will go if Dana goes. Solution. Transforming the problem to Boolean algebra, simplifying it, and transforming it back, we find out that either Anne and Brenda will go or Kate and Dana will go. □ CHAPTER 12. ALGEBRAIC STRUCTURES which are implied by transitivity and reflexivity are omitted in the diagram. Especially when K has only a few elements, this is a very transparent way of discussing several cases; see the examples in the exercise column. 12.1.9. Lattices. Not every poset is created the latter way from a Boolean algebra. For instance, the trivial partial order is defined on any set as A ≤ A for each A, but all pairs of different elements are incomparable. Such a poset cannot rise from a Boolean algebra if K contains more than one element (as seen, the least and greatest elements of a Boolean algebra are comparable to every element). Think to what extent the operations ∧ and ∨ can be built from a partial order. They are the suprema and infima in the following definition: lower and upper bounds, suprema, infima Consider a fixed poset (K, ≤). An element C ∈ K is said to be a lower bound for a subset L ⊆ K if and only if C ≤ A for all A ∈ L. An element C ∈ K is said to be the greatest lower bound (or infimum) of a subset L ⊆ K if and only if it is a lower bound and for every lower bound D of L, D ≤ C. By replacing ≤ with ≥ in the above, the definitions of an upper bound and of the least upper bound (or supremum) of a subset L are obtained. If the suprema and infima exist for all couples A, B, they define the binary operations ∨ and ∧, respectively. Lattices Definition. A lattice is a poset (K, ≤) where every twoelement set {A, B} has a supremum A ∨ B and an infimum A ∧ B. The poset (K, ≤) is said to be a complete lattice if and only if every subset of K has a supremum and an infimum. The binary operations ∧ and ∨ on a lattice (K, ≤) are clearly commutative and associative (prove this in detail!). The latter properties (of associativity and commutativity) ensure that all finite non-empty subsets in K possess infima and suprema. Note that any element of a lattice K is an upper bound for the empty set. Thus in a complete lattice, the supremum of the empty set is the least element 0 of K. Similarly, the infimum of the empty set is the greatest element 1 of K. Of course, a finite lattice (K, ≤) is always complete (with 1 being the supremum of all elements in K and 0 the infimum of all elements in K). Remark. The poset of real numbers, completed with the greatest and least elements ±∞, is a complete lattice (with the standard ordering). We may view it as a completion of the poset of rational numbers. The classical Dedekind cut construction of this completion of rational numbers extends 1048 12.A.29. Solve the following problem by transforming it to Boolean algebra: Tom, Paul, Sam, Ralph, and Lucas are suspected of having committed a murder. It is certain that at the crime scene, there were: at least one of Tom and Ralph, at most one of Lucas and Paul, and at least one of Lucas and Tom. Sam could be there only if so was Ralph. However, if Sam was there, then so was Tom. Paul could never cooperate with Ralph, but Paul and Tom are an inseparable pair. Who committed the murder? Solution. Transforming into Boolean algebra, using the first letter of each name, we get (t + r)(l′ + p′ )(l + t)(r + s′ )(s′ + t)(p′ + r′ )(pt + p′ t′ ) and thanks to x2 = x, xx′ = 0, x + x′ = 1, we can rearrange the above to s′ r′ ptl′ + s′ rp′ t′ l. Thus, the murder was committed either by Tom and Paul or by Ralph and Lucas. □ 12.A.30. A vote box for three voters is a box which processes three votes and outputs “yes” if and only if majority of the voters is for. Design this box using from switch circuits. Solution. □ 12.A.31. Find a finite subset of the set of positive integers which is not a lattice with respect to divisibility. ⃝ 12.A.32. Find the number of partial orders on a given 4-element set. Draw the Hasse diagram of each isomorphism class and determine whether it is a lattice. Is one of them a Boolean algebra? Solution. We go through all Hasse diagrams of the partial orders on a 4-element set M and for each diagram, we count the number of partial orders (i. e. subsets of M × M) that correspond to it; see the picture: Therefore, there are 219 partial orders on a given 4-element set. Note that the condition of existence of suprema and infima of any pair of elements in a lattice implies (by induction) the existence of them for any finite non-empty subset. In particular, this means that every non-empty finite lattice has a greatest element as well as a least element. CHAPTER 12. ALGEBRAIC STRUCTURES to all posets. Indeed, the so called Dedekind-MacNeille comadd a simple example to the other column, e.g. complete the graf K3,3 as Hasse diagram of a poset.pletion provides the unique smallest complete lattice containing a given poset. We shall not go into any details here. A lattice is said to be distributive if and only the operations ∧ and ∨ satisfy the distributivity axioms (4) and (5) of subsection 12.1.2 on page 1041. There are lattices which are not distributive; see the Hasse diagrams of two such simple lattices below (and check that in both cases x ∧ (y ∨ z) ̸= (x ∧ y) ∨ (x ∧ z)). Now, Boolean algebras can be defined in terms of lattices: a Boolean algebra is a complete distributive lattice such that each element has its complement (i.e. the axiom 12.1.2(8) is satisfied). It is already verified that the latter requirement implies that complements are unique (see the ideas at the beginning of subsection 12.1.3), which means that the alternative definition of Boolean algebras is correct. During the discussion of divisors of any given integer p, distributive lattices Dp are encountered. These distributive lattices are Boolean algebras if and only if p is squarefree, see 12.1.7. 12.1.10. Homomorphisms. Dealing with mathematical structures, most information about objects can be obtained/understood from the homomorphisms. These are mappings which preserve the corresponding operations. The linear mappings between vector spaces, or continuous mappings on Rn or any metric spaces with the given topology of open neighbourhoods represent very good examples. This concept is particularly simple for posets: Poset homomorphisms Let (K, ≤K) and (L, ≤L) be posets. A mapping f : K → L is called a poset homomorphism (also order-preserving mapping, monotone mapping or isotone mapping) if for all A ≤K B the same relation f(A) ≤L f(B) is true. Although the structure of any Boolean algebra is completely determined by its subordinated poset structure, an isotone mapping does not necessarily respect the suprema and infima. Non-comparable elements A, B can be mapped to the same image f(A) = f(B), while their suprema could map to f(A ∨ B) strictly larger. 1049 Using this criterion, we can see that only the last two Hasse diagrams may be lattices. Indeed, they are lattices; the first one is even a Boolean algebra. □ 12.A.33. Find the number of partial orders on the set {1, 2, 3, 4, 5} such that there are exactly two pairs of incomparable elements. ⃝ 12.A.34. Draw the Hasse diagram of the lattice of all (positive) divisors 36. Is this lattice distributive? Is it a Boolean algebra? Solution. The lattice distributive (it does not contain a sublattice isomorphic to diamond or pentagon). □ 12.A.35. Draw the Hasse diagram of the lattice of all (positive) divisors 30. Is this lattice distributive? Is it a Boolean algebra? Solution. This lattice is a Boolean algebra, and it has 8 elements. All finite Boolean algebras are of size 2n for an appropriate n, and they are all isomorphic for a fixed n (see 12.1.16). This Boolean algebra is a “cube”: its graph can be drawn as projection of a cube onto the plane. CHAPTER 12. ALGEBRAIC STRUCTURES In the case of Boolean algebras, homomorphisms are defined as follows: Lattice and Boolean-algebra homomorphisms A mapping f : (K, ∧, ∨) → (L, ∧, ∨) is a homomorphism of Boolean algebras if and only if for all A, B ∈ K (1) f(A ∧ B) = f(A) ∧ f(B), (2) f(A ∨ B) = f(A) ∨ f(B), (3) f(A′ ) = f(A)′ . Moreover, if f is bijective, it is an isomorphism of Boolean algebras. Similarly, lattice homomorphisms are defined as mappings which satisfy the properties (1) and (2). It is easily verified that if a homomorphism f is bijective, then f−1 is also a homomorphism. It is clear from the definition of the partial order on Boolean algebras or lattices that every homomorphism f : K → L also satisfies f(A) ≤ f(B) for all A, B ∈ K such that A ≤ B, i.e. it is in particular a poset homomorphism. In particular, f(0) = 0 and f(1) = 1. The converse of the above is generally not true, that is, it may happen that a poset homomorphism is not a lattice ho- momorphism. 12.1.11. Fixed-point theorems. Many practical problems lead to discussion on the existence and properties of fixed points of a mapping f : K → K on a set K, i.e. of elements x ∈ K such that f(x) = x. The concepts of infima and suprema allows the derivation of very strong propositions of this type surprisingly easily. There follows here a classical theorem proved by Knaster and Tarski1 : Tarski’s fixed-point theorem Theorem. Let (K, ∧, ∨) be a complete lattice and f : K → K a poset homomorphism. Then, f has a fixed point, and the set of all fixed points of f, together with the restricted ordering from K, is again a complete lattice. Proof. Denote M = {x ∈ K; x ≤ f(x)}. Since K has a least element, M is non-empty. Since f is order-preserving, f(M) ⊆ M. Moreover, denote z1 = sup M. Then, for x ∈ M, x ≤ z1, which means that f(x) ≤ f(z1). At the same time, x ≤ f(x), hence f(z1) is an upper bound for M. Then z1 ≤ f(z1), and also z1 ∈ M, and hence f(z1) ≤ z1. It follows that f(z1) = z1, so a fixed point is found. 1Knaster and Tarski proved this in the special case of the Boolean algebra of all subsets in a given set already in 1928, cf. Ann. Soc. Polon. Math. 6: 133–134. Much later in 1955, Tarski published the general result, cf. Pacific Journal of Mathematics. 5:2: 285–309. Alfred Tarski (1901-1983) was a renowned and influential Polish logician, mathematician and philosopher, who worked most of his active career in Berkeley, California. His elder colleague Bronisław Knaster (1893-1980) was also a Polish mathematician. 1050 □ 12.A.36. Decide whether every lattice on a 3-element set is a chain, i. e., whether each pair of elements are necessary comparable. Solution. As we have noticed in exercise 12.A.32, every finite non-empty lattice must contain a greatest element and a least element. Each of these is thus comparable to any other, which means that the remaining one is comparable with these two; and there are no other elements. □ 12.A.37. Find an example of two lattices and a poset homomorphism between them which is not a lattice homomor- phism. Solution. Again, we return to exercise 12.A.32 and consider the following mapping: □ 12.A.38. Decide whether every lattice homomorphism between finite non-empty lattices K, L maps the least element of K to the least element of L. Solution. No, any constant mapping between two lattices is a lattice homomorphism. Thus, if we sent everything to an element different from the least one, we get the wanted counterexample homomorphism. □ 12.A.39. Decide whether every chain which has a greatest element and a least element is a complete lattice. Solution. No. Consider the set of non-zero integers and order it as follows: any positive integer is greater than any negative integer, but the ordering among the positive integers will be reversed, as well as among the negative integers. Then, 1 will be the greatest element of the resulting chain, and −1 will be the least element. However, the subset of all positive integers does not have an infimum in this poset. Formally we define the linear order ≺ on Z \ {0} by: a ≺ b ⇐⇒ [(sgn(a)·sgn(b) = 1∧b > a)∨(sgn(a) > sgn(b))]. CHAPTER 12. ALGEBRAIC STRUCTURES It is more difficult to verify the last statement of the theorem, namely that the set Z ⊆ K of all fixed points of f is a complete lattice. The greatest element z1 = max Z is found already. Analogously, using infimum and the property f(x) ≤ x in the definition of M, we find the least element z0 = min Z. Consider any non-empty set Q ⊆ Z and denote y = sup Q. This supremum need not lie in Z. However, as seen shortly, the set has a supremum in Z with respect to the partial order in K, restricted to Z. For that purpose, denote R = {x ∈ K; y ≤ x}. It is clear from the definitions that this set together with the partial order in K, restricted to R, is again a complete lattice, and that the restriction of f to R is again a poset homomorphism f|R : R → R. By the above, f|R has a least fixed point ¯y. Of course, ¯y ∈ Z, and ¯y is the supremum of the fixed set Q with respect to the inherited order on Z. Note that it is possible that ¯y > y. Analogously, the infimum of any non-empty subset of Z can be found. Since the least and greatest elements are already found, the proof is finished. □ Remark. In the literature, one may find many variants of the fixed-point theorems, in various contexts. One of very useful variants is Kleene’s recursion theorem, which can be determined from the theorem just proved and formulated as follows: Consider a poset homomorphism f and a countable subset of K (using the notation of Tarski’s fixed-point theorem), formed by the Kleene chain 0 ≤ f(0) ≤ f(f(0)) ≤ . . . . Then, the supremum z of this subset cannot be greater than any fixed point of f. If y is a fixed point of f, it follows from 0 ≤ y that f(0) ≤ f(y) = y, etc. Moreover, if f is assumed continuous in a certain sense of reasonably preserving suprema, then it can be shown that f(z) is also the supremum of this chain and hence is a fixed point. Therefore, it is the smallest fixed point. This theorem is called Kleene’s fixedpoint theorem. It has many applications in recursion theory, when discussing termination of algorithms, etc. We omit details about the necessary “continuity” of mappings between posets and further generalizations.2 We point out the added value to the general formulation of Tarski’s theorem — the Kleene’s theorem provides an iterative computational process approaching the fixed point with the given “seed”, the minimal point. 2Stephen Cole Kleene (1909-1994) was a famous American mathematician working with Church, Turing, Post and others. The interested reader may consult full exposition of the above mentioned theorem in chapter 1 of the book: Mathematical Theory of Domains, Cambridge Tracts in Theoretical Computer Science, Cambridge University Press, 1994, by V. StoltenbergHansen, I. Lindström, E. R. Griffor. 1051 □ 12.A.40. Give an example of an infinite chain which is a complete lattice. Solution. We can take the set of real numbers together with −∞, ∞, where −∞ is the least element (and thus the supremum of the empty set) and ∞ is the greatest element (and thus the infimum of the empty set). The lattice suprema and infima are thus defined in accordance with these concepts in the real numbers. Moreover, −∞ is the infimum of subsets which are not bounded from below, and similarly ∞ is the supremum of subsets which are not bounded from above. □ 12.A.41. Decide whether the set of all convex subsets of R3 is a lattice (with respect to suitably defined operations of suprema and infima). If so, is this lattice complete, distribu- tive? Solution. It is a lattice. The infimum is simply the intersection, since the intersection of convex subsets is again convex. The supremum is the convex hull of the union. It is clear that the lattice axioms are indeed satisfied for these operations (think this out!). The lattice is complete, since the above operations work for infinite subsets as well, and clearly, the lattice has both a least element (the empty set) and a greatest element (the entire space). However, the lattice is not distributive. For example, consider three unit balls B1, B2, B3 centered at [3, 0, 0], [−3, 0, 0], [0, 0, 0], respectively. Then, K1 ∨ (K2 ∧ K3) = K1 ̸= (K1 ∨ K3) ∧ (K1 ∨ K2). □ 12.A.42. Decide whether the set of all vector subspaces of R3 is a lattice (with respect to suitably defined operations of suprema and infima). If so, is this lattice complete, distribu- tive? Solution. This is a lattice, infima correspond to intersections and suprema to sums of vector spaces (it is easy to verify that these operations satisfy the lattice axioms). This lattice is complete (the operations work for infinite subsets as well, the least element is the zero-dimensional subspace, and the greatest element is the entire space). However, it is not distributive (consider three lines in a plane). □ CHAPTER 12. ALGEBRAIC STRUCTURES 12.1.12. Back to Boolean algebras. When discussing propositional logic, there is the problem of what exactly are the elements of the corresponding Boolean algebra. Formally, they are defined as the classes of equivalent propositions. In other words, we work with truth-value functions for propositions with a given number of arguments. There is the problem of recognizing propositions which are equivalent in this sense. There is the question of whether every function (Z2)n → Z2 can be defined in terms of the basic logical operations. Clearly all such functions form a Boolean algebra, since their values are in the Boolean algebra Z2. Similarly, there is the problem of deciding whether or not two systems of switches can have the same function. Just as for propositions, a system consisting of n switches corresponds to a function (Z2)n → Z2. There are 22n such functions. A Boolean algebra can be naturally defined on these functions (again using the fact that the function values are in the Boolean algebra Z2). We summarize a few such questions: some basic questions Question 1: Are all finite Boolean algebras (K, ∧, ∨) defined on sets K with 2n elements? Question 2: Can each function (Z2)n → Z2 be the truth function of some logical expression built of n elementary propositions and the logical operators? Question 3: How to recognize whether two such expressions represent the same function? Question 4: Can each function (Z2)n → Z2 be realized by some switch board with n switches? Question 5: How to recognize whether two switchboards represent the same function? All these questions are answered by finding the normal form of every element of a general Boolean algebra. This is achieved by writing it as the join of certain particularly simple elements. By comparing the normal forms of any pair of elements, it is easily determined whether or not they are the same. This helps to classify all finite Boolean algebras, giving the affirmative answer to question 1. 12.1.13. Atoms and normal forms. First, define the “simplest” elements of a Boolean algebra: Atoms in a Boolean algebra Let K be a Boolean algebra. An element A ∈ K, A ̸= 0, is called an atom if and only if for all B ∈ K, A ∧ B = A or A ∧ B = 0. In other words, A ̸= 0 is an atom if and only if there are only two elements B such that B ≤ A, namely B = 0 and B = A. Note that 0 is not considered an atom, just as the integer 1 is not considered a prime. Let us remark that infinite Boolean algebras may contain no atoms at all. 1052 B. Rings 12.B.1. Decide whether the set R with the operations ⊕, ⊙ form a ring, a commutative ring, an integral domain or a field: i) R = Z, a ⊕ b = a + b + 3, a ⊙ b = −3, ii) R = Z, a ⊕ b = a + b − 3, a ⊙ b = a · b − 1, iii) R = Z, a ⊕ b = a + b − 1, a ⊙ b = a + b − a · b, iv) R = Q, a ⊕ b = a + b, a ⊙ b = b, v) R = Q, a ⊕ b = a + b + 1, a ⊙ b = a + b + a · b, vi) R = Q, a ⊕ b = a + b − 1, a ⊙ b = a + b + a · b. ⃝ Solution. i) not a ring (but is a commutative rng), ii) not a ring, iii) an integral domain, iv) not a ring, v) a field, vi) not a ring. □ 12.B.2. Prove that the subset Z[i] = {a + bi | a, b ∈ Z} of the complex numbers is an integral domain. Is it a field? Solution. Any subring of an integral domain must be an integral domain again. In this case, we are talking about a subset of the field C (thus also an integral domain). Since the subset is closed with respect to all the operations (sum, additive inverse, multiplication) and contains both 0 and 1, it is indeed a subring. However, multiplicative inverses exist only for the numbers 1, i, −1, −i (these form the so-called subgroup of units – invertible elements), so it is not a field. □ 12.B.3. In the ring of 2-by-2 matrices over the real numbers, consider the subring of matrices of the form ( a −b b a ) . Prove that this subring is isomorphic to C. Solution. We will show that the isomorphism is given by the mapping φ : ( a −b b a ) → a + ib. The multiplication in the subring works as follows: ( a −b b a ) · ( c −d d c ) = ( ac − bd −bc − ad bc + ad ac − bd ) , and in, C, we have (a + ib)(c + id) = ac − bd + i(bc + ad). Hence we can see that φ is a homomorphism with respect to multiplication. Since addition is defined componentwise, φ is a homomorphism to it as well. Moreover, this mapping is clearly both injective and surjective, thus it is an isomorphism. □ 12.B.4. Prove that the identity is the only automorphism of the field of real numbers. Solution. Consider an automorphism φ : R → R. Clearly, it must satisfy φ(0) = 0 and φ(1) = 1. Since φ respects addition, we must have for all positive integers n that φ(n) = φ(1 + 1 + · · · + 1) = nφ(1) = n and φ(−n) = −n. Since it respects multiplication, we must have for any integers CHAPTER 12. ALGEBRAIC STRUCTURES The situation is very simple in the Boolean algebra of all subsets of a given finite set M. Clearly, the atoms are precisely the singletons A = {x}. For every subset B, either A ∧ B = A (if x ∈ B) or A ∧ B = 0 (if x /∈ B). The requirements fail whenever there is more than one element in A. Next, consider which elements are atoms in the Boolean algebra of functions of the switch boards with n switches A1, . . . , An. It can be easily verified that there are 2n atoms, which are of the form Aσ1 1 ∧· · · ∧Aσn n , where either Aσi i = Ai or Aσi i = A′ i. The infimum φ ∧ ψ of functions φ and ψ is the function whose values are given by the products of the corresponding values in Z2. Therefore, φ ≤ ψ if φ takes the value 1 ∈ Z2 only on arguments where ψ also has value 1. Hence in the Boolean algebra of truth-value functions, a function φ is an atom if and only if φ returns 1 ∈ Z2 for exactly one of the 2n possible choices of arguments. All these functions can be created in the above mentioned manner. Now, the promised theorem can be formulated. While this one is called the disjunctive normal form, there is also the opposite version with the suprema and infima interchanged (the conjunctive normal form). Disjunctive normal form Theorem. Each element B of a finite Boolean algebra (K, ∧, ∨) can be written as a supremum of atoms B = A1 ∨ · · · ∨ Ak. This expression is unique up to the order of the atoms. The proof takes several paragraphs, but the basic idea is quite simple: Consider all atoms A1, A2, . . . , Ak in K which are less or equal to B. From the properties of the order on K, (see 12.1.8(3)) it follows that Y = A1 ∨ · · · ∨ Ak ≤ B. The main step of the proof is to verify that B∧Y ′ = 0, which by 12.1.8(4) guarantees that B ≤ Y . That proves the equality B = Y . 12.1.14. Three useful claims. We derive several technical properties of atoms, in order to complete the proof of the theorem on disjunctive normal form. We retain the notation of the previous subsection. Proposition. (1) If Y, X1, . . . , Xℓ are atoms in K, then Y ≤ X1 ∨ . . . ∨ Xℓ if and only if Y = Xi for some i = 1, . . . , ℓ. (2) For each Y ∈ K, Y ̸= 0, there is an atom X ∈ K such that X ≤ Y . (3) If X1, . . . , Xr are precisely all the atoms of K, then Y = 0 if and only if Y ∧ Xi = 0 for all i = 1, . . . , r. Proof. (1) If the inequality of the proposition holds, then Y ∧ (X1 ∨ · · · ∨ Xℓ) = Y. 1053 p, q (q ̸= 0) that φ(p) = φ(q · p q ) = φ(p)cdotφ(p q ). Hence, φ(p q ) = p q , i. e., φ(r) = r for all rational numbers r. Consider a positive number x ∈ R. Then, φ(x) = φ (√ x 2 ) = φ ( √ x) 2 ≥ ≥ 0. Thus, for any x, y ∈ R such that x < y, we must have φ(x) < φ(y). Now, assume that φ is not the identity, i. e., there exists a z ∈ R such that φ(z) ̸= z. We can assume without loss of generality that φ(z) < z. Since Q is dense in R, there exists an r ∈ Q for which φ(z) < r < z. However, we know that φ(r) = r, which means that r < z implies φ(r) < φ(z). Altogether, we get the wanted contradiction φ(z) < φ(r) < φ(z). □ 12.B.5. Let p be a prime and R a ring which contains p2 elements. Prove that R is commutative. Solution. Since (R, +) is a finite commutative group with p2 elements, it is by 12.4.8 isomorphic to either Zp2 or Zp × Zp. In the first case, (R, +) is cyclic, so there exists an element x ∈ R such that each element of R is of the form nx for some 1 ≤ n ≤ p2 . Since all these elements commute, we get that the entire R is commutative. In the second case, each element (except 0) must have order p with respect to addition. Let x ∈ R be any element that is not in the additive subgroup generated by 1. Then, each element of R is of the form m + nx, where 1 ≤ m, n ≤ p. Again, all these elements commute, so R is commutative. □ 12.B.6. Find the inverses of 17, 18, and 19 in (Z∗ 131, ·) (the group of all invertible elements in Z131 with multiplication). Solution. Applying the Euclidean algorithm, we get 131 = 7 · 17 + 12, 17 = 1 · 12 + 5, 12 = 2 · 5 + 2, 5 = 2 · 2 + 1. Therefore, 1 = 5−2·2 = 5−2(12−2·5) = 5·5−2·12 = 5·(17−12)−2·12 = 5·17−7·12 = 5·17−7·(131−7·17) = 54·17−7·131. The inverse of 17 is 54. Similarly, [18]−1 = 51 and [19]−1 = 69. □ 12.B.7. Find the inverse of [49]Z253 in Z253 ⃝ 12.B.8. Find the inverse of [37]Z208 in Z208. ⃝ 12.B.9. Find the inverse of [57]Z359 in Z359. ⃝ 12.B.10. Find the inverse of [17]Z40 in Z40. ⃝ C. Polynomial rings 12.C.1. Eisenstein’s irreducibility criterion This criterion provides a sufficient condition for a polynomial over Z to be irreducible over Q (which is the same as to be irreducible over Z): Let f(x) = anxn + an−1xn−1 + · · · + a1x + a0 be a polynomial over Z and p be a prime such that • p divides aj, j = 0, . . . , n − 1, CHAPTER 12. ALGEBRAIC STRUCTURES By distributivity, the equality can be rewritten as (Y ∧ X1) ∨ · · · ∨ (Y ∧ Xℓ) = Y. However, for all i either Y ∧ Xi = 0 or Y ∧ Xi = Xi. If all these intersections are 0, then Y = 0. Thus, there is an i for which Y ∧ Xi = Xi. Since Y is also an atom, the desired equality Y = Xi is proved. The other implication is trivial. (2) If Y is an atom itself, choose X = Y . If Y is not an atom, then it follows from the definition that there must exist a non-zero element Z1 ̸= Y for which Z1 ≤ Y . If Z1 is not an atom either, then similarly find a Z2 ≤ Z1, etc., leading to a sequence of pairwise distinct elements · · · ≤ Zk ≤ Zk−1 ≤ · · · ≤ Z1 ≤ Y, which cannot be infinite since the entire Boolean algebra K is finite. Therefore, it must end with an atom Zk. (3) Assume that Y ∧ Xi = 0 for all indices i. If Y ̸= 0, then due to the above claim, there must exist an atom Xj for which Xj ∧ Y = Xj, which is a contradiction. The other implication is trivial. □ 12.1.15. Completion of the proof of theorem 12.1.13. Write Y = A1 ∨ · · · ∨ Ak ≤ B, where Ai are all the atoms in K which are less then or equal to B. Compute B ∧ Y ′ = B ∧ (A1 ∨ · · · ∨ Ak)′ = B ∧ A′ 1 ∧ · · · ∧ A′ k. If an atom A = Ai is contained in the join Y , then B ∧ Y ′ ∧ A = 0. However, if A is an atom which does not occur in Y , then also B ∧ Y ′ ∧ A = 0, since Y contains exactly those atoms which are ≤ B. Hence B ∧ Y ′ ∧ A = 0 for all atoms A in K. Thus it is proved that the intersection of B ∧ Y ′ and any atom is zero, which means that it must be zero itself, by the third claim in the latter proposition. Therefore, B ≤ Y (cf. 12.1.8(4)). The definition of Y implies Y ≤ B, so the antisymmetry of the order implies that B = Y . It remains to prove the uniqueness of the expression, up to order. Thus, suppose B can be written in two ways as B = A1 ∨ · · · ∨ Ak = ˜A1 ∨ · · · ∨ ˜Aℓ. Since each Ai satisfies Ai ≤ B, the first claim in the proposition above ensures it must equal one of the ˜Aj. Repeating this argument gives the desired uniqueness and finishes the proof. 12.1.16. Classification. To end the discussion of Boolean algebras, we prove that all the examples of finite Boolean algebras (of given size) are isomorphic. In particular, each of the 22n truth-value functions for n atomic propositions can be expressed as an appropriate proposition, just like each of the 22n switch board functions can be defined in terms of n suitably arranged switches. In both cases, the algebra in question behaves the 1054 • p does not divide an, • p2 does not divide a0. Then, f(x) is irreducible over Z (Q). Prove this criterion. ⃝ 12.C.2. Factorize over C and R the polynomial x4 + 2x3 + 3x2 + 2x + 1. Solution. This polynomial can be factorized either by looking for multiple roots or as a reciprocal equation: • Let us compute the greatest common divisor of the polynomial and its derivative 4x3 + 6x2 + 6x + 2, using the Euclidean algorithm. The greatest common divisor is given in any ring up to a multiple by a unit, and during the Euclidean algorithm, we may multiply the partial results by units of the ring. In the case of a polynomial ring over a field of scalars, the units are exactly the nonzero scalars. We perform the multiplication in the way to avoid calculations with fractions as much as possible. 2x4 + 4x3 + 6x2 + 4x + 2 : 2x3 + 3x2 + 3x + 1 = x + 1 2 2x4 + 3x3 + 3x2 + x x3 + 3x2 + 3x + 2 x3 + 3 2 x2 + 3 2 x + 1 2 3 2 x2 + 3 2 x + 3 2 Further, we divide the polynomial 2x3 + 3x2 + 3x + 1 by the remainder 3 2 x2 + 3 2 x + 3 2 (multiplied by the unit 2 3 ) 2x3 + 3x2 + 3x + 1 : x2 + x + 1 = 2x + 1 2x3 + 2x2 + 2x x2 + x + 1 The roots of the greatest common divisor of the original polynomial and its derivative are exactly the multiple roots of the original polynomial. In this case, the roots of x2 + x + 1 are −1 2 ± i √ 3/2, which are thus double roots of the original polynomial. The factorization over C is thus to root factors (this is always the case over C, as stated by the fundamental theorem of algebra): x4 + 2x3 + 3x2 + 2x + 1 = = ( x + 1 2 − i √ 3 2 )2 · ( x + 1 2 + i √ 3 2 )2 . The factorization over R can be obtained by multiplying the factors corresponding to pairs of complex-conjugated roots of the polynomial (verify that such a product must always result in a polynomial with real coefficients!): x4 + 2x3 + 3x2 + 2x + 1 = ( x2 + x + 1 )2 . • Let us solve the equation x4 + 2x3 + 3x2 + 2x + 1 = 0. CHAPTER 12. ALGEBRAIC STRUCTURES same way as the Boolean algebra of all subsets of a given 2n -element set. Moreover, each of these expressions can be written in a unique normal form, so it can be decided algorithmically whether two switch boards have the same behaviour without comparing their values for all 2n possible inputs (which on the other hand might still be faster, in particular the resulting normal formula tends to be exponentially large). Theorem. Every finite Boolean algebra is isomorphic to the Boolean algebra K = 2M where M is the set of atoms in K. Proof. The idea of the proof is quite straightforward. Every isomorphism of a Boolean algebra (K, ∧, ∨) must map atoms to atoms. Let M be the set of all atoms in K and consider the Boolean algebra (2M , ∩, ∪). This defines a natural correspondence between the atoms of K and the atoms of 2M . Next, use the disjunctive normal form to extend the mapping to all of K. Each element X ∈ K can be written uniquely (up to order) as a join of atoms: X = A1 ∨ · · · ∨ Ak Define the function f : K → 2M by f(X) = f(A1) ∪ · · · ∪ f(Ak) = {A1, . . . , Ak}, as the union of the singletons Ai ⊆ M that occur in the ex- pression. The uniqueness of the normal form implies that f is a bijection. It remains to show that it is a homomorphism of the Boolean algebras. Let X, Y ∈ K. The normal form of their supremum contains exactly the atoms which occur in at least one of X, Y ; while the infimum involves just those atoms which occur in both. This verifies that f preserves the operations ∧ and ∨. As for the complements, note that an atom A occurs in the normal form of X′ if and only if X ∧ A = 0. Hence f preserves complements, which finishes the proof. □ The classification of infinite Boolean algebras is far more complicated. It is not the case that each would be isomorphic to the Boolean algebra of all subsets of an appropriate set M. However, every Boolean algebra is isomorphic to a Boolean subalgebra of a Boolean algebra 2M for an appropriate set M. This result is known as Stone’s representation theorem3 . 2. Elements of Logic This section provides a quick introduction to logic. Mastering the calculus of Boolean algebras will become useful, but we have to start from scratch. 3The American mathematician Marshall Harvey Stone (1903 – 1989) proved this theorem in 1936 when dealing with the spectral theory of operators on Hilbert spaces, required for analysis and topology. Nowadays, it belongs to standard material in advanced textbooks. 1055 Dividing by x2 and substituting t = x + 1 x , we get the equation t2 + 2t + 1 = 0 with double root −1. Now, substituting this into the definition of t, we get the known equation x2 + x + 1 = 0, which was solved above. □ Remark. Let us remark that the only irreducible polynomials over R are linear polynomials and quadratic polynomials with negative discriminant. This also follows from the reasonings in the above exercise. 12.C.3. Factorize the polynomial x5 +3x3 +3 to irreducible factors over i) Q, ii) Z7. Solution. i) By Eisenstein’s criterion, the given polynomial is irreducible over Z and Q (we use the prime 3). ii) (x − 1)2 (x3 + 2x2 − x + 3). Using Horner’s scheme, for instance, we find the double root 1. When divided by the polynomial (x − 1)2 , we get (x3 + 2x2 − x + 3), which has no roots over Z7. Since it is only of degree 3, this means that it must be irreducible (if it were reducible, one of the factors would have to be linear, which means that the cubic polynomial (x3 +2x2 −x+3) would have a root). □ 12.C.4. Factorize the polynomial x4 + 1 over • Z3, • C, • R. Solution. • (x2 + x + 2)(x2 + 2x + 2) • The roots are the fourth roots of −1, which lie in the complex plane on the unit circle, and their arguments are π/4, π/4 + π/2, π/4 + π, and π/4 + 3π/2 i. e., they are the numbers ± √ 2/2 ± i √ 2/2. Thus, the factorization is ( x − √ 2 2 − i √ 2 2 ) ( x − √ 2 2 + i √ 2 2 ) ( x + √ 2 2 − i √ 2 2 ) ( x − √ 2 2 + i √ 2 2 ) . • Multiplying the root factors of complex-conjugated roots in the factorization over C, we get the factorization over R: ( x2 − √ 2x + 1 ) ( x2 + √ 2x + 1 ) .. □ 12.C.5. Find a polynomial with rational coefficients of the lowest degree possible which has 2007 √ 2 as a root. Solution. P(x) = x2007 −2. Let us show that there is no polynomial of lower degree with root 2007 √ 2: Let Q(x) be a nonzero polynomial of the lowest degree with root 2007 √ 2. Then, deg Q(x) ≤ 2007. Let us divide P(x) by Q(x) with remainder: P(x) = Q(x)·D(x)+R(x), where D(x) is the quotient and R(x) is the remainder, and either deg R(x) < st Q(x) or R(x) = 0. Substituting the number 2007 √ 2 into the last equation, we can see that 2007 √ 2 is also a root of R(x). By the definition of Q(x), this means that R(x) must be the zero polynomial, which means that Q(x) divides P(x). However, P(x) is irreducible (by Eisenstein’s criterion for 2), so its only CHAPTER 12. ALGEBRAIC STRUCTURES 12.2.1. From set algebra to logical propositions. Let us recall that in 12.1.2, we abstracted Boolean algebras from the properties of set operations; in particular, the ∨ operation corresponded to the union of sets. We understand sets as collections of elements with some property. So, if an element x is in the union of sets C and B, then it has ”property C” or ”property B.” Furthermore, if we know that element x does not have ”property C,” then we know for sure that it has ”property B.” In other words, from the fact that element x has ”property C” or ”property B,” but not ”property C,” we can conclude that element x definitely has ”property B.” Symbolically, x ∈ C ∪ B x ̸∈ C x ∈ B , or more succinctly using Boolean operations, C ∨ B C′ B This simple observation can be reformulated as follows: ”From C ∨ B and C′ , it logically follows that B.” For the sake of simplicity, let’s denote A = C′ , which means C = A′ , and rewrite the previous observation as: (1) A′ ∨ B A B and read it as: ”From A′ ∨ B and A, it necessarily implies B.” Next, let’s perform three simple Boolean calculations: A′ ∨ (B′ ∨ A) = (A′ ∨ A) ∨ B = 1 ∨ B = 1, (A′ ∨ B)′ ∨ ( (A′ ∨ B′ )′ ∨ A ) = (A′ ∨ B)′ ∨ ( (A ∧ B) ∨ A′ ) = (A′ ∨ B)′ ∨ (B ∨ A′ ) = 1, (A′ ∨ B)′ ∨ (A′ ∨ C) = (A ∧ B′ ) ∨ A′ ∨ C =(A ∨ A′ ∨ C) ∧ (B′ ∨ A′ ∨ C) = 1 ∧ (A′ ∨ B′ ∨ C) = ( A′ ∨ (B′ ∨ C) ) . Now, we will use the axioms (8) of Boolean algebras, and we obtain the following formulas: (2) A′ ∨ (B′ ∨ A) = 1, A′ ∨ A = 1 (A′ ∨ B)′ ∨ ( (A′ ∨ B′ )′ ∨ A ) = 1, ( A′ ∨ (B′ ∨ C) )′ ∨ ( (A′ ∨ B)′ ∨ (A′ ∨ C) ) = 1 . We already noticed the connections between Boolean algebras and propositional logic in paragraph 12.1.5. In this sense, we can see that the left sides of the previous equations express true propositions. Now, let’s transition more thoroughly from the ”Boolean world” to the ”world of logic.” We will do this by replacing uppercase letters with lowercase letters, replacing the symbol ′ ∨ with the symbol →, and changing the ”postfix” symbol ′ to the ”prefix” symbol ¬. So, instead of A′ ∨B, we will write a → b, and instead of A′ , we will write ¬a.” 1056 non-trivial divisor is itself (up to multiplication by a unit of the polynomial ring over Q, i. e. a non-zero rational constant). Thus, we have Q(x) = P(x) up to multiplication by a unit. For instance, the polynomial 1 3 x2007 − 2 3 also satisfies the stated conditions. However, if we require the polynomial to be monic (i. e., with leading coefficient 1), then the only solution is the mentioned polynomial P(x). □ 12.C.6. Find all irreducible polynomials of degree at most 2 over Z3. Solution. By definition, all linear polynomials are irreducible. As for quadratic irreducible polynomials, the easiest way is to simply enumerate them all and leave out the reducible ones, i. e. those which are a product of two linear polynomials. The reducible polynomials are (x + 1)2 = x2 + 2x + 1, (x + 2)2 = x2 + x + 1, (x + 1)(x + 2) = x2 + 2, x2 , x(x + 1) = x2 + x, x(x + 2) = x2 + 2x. (It suffices to consider monic polynomials, since the non-monic can be obtained by multiplication by 2.) The remaining quadratic polynomials over Z3 are irreducible; these are x2 + 2x + 2, x2 + x + 2, x2 + 1. □ 12.C.7. Decide whether the following polynomial is irreducible over Z3; if not, factorize it: x4 + x3 + x + 2. Solution. Evaluating the polynomial at 0, 1, 2, we find that it has no root in Z3. This means that it is either irreducible or a product of two quadratic polynomials. Assume it is reducible. Then, we may assume without loss of generality that it is a product of two monic polynomials (the only other option is that it is a product of two polynomials with leading coefficients equal to 2 – then both can be multiplied by 2 in order to become monic). Thus, let us look for constants a, b, c, d ∈ Z3 so that x4 + x3 + x + 2 = (x2 + ax + b)(x2 + cx + d) = = x4 + (a + c)x3 + (ac + b + d)x2 + + (ad + bc)x + bd. Comparing the coefficients of individual power of x, we get the following system of four equations in four variables: 1 = a + c, 0 = ac + b + d, 1 = ad + bc, 2 = bd. From the last equation, we get that one of the numbers b, d is equal to 1 and the other one to 2. Thanks to symmetry of the system in the pairs (a, b) and (c, d), we can choose b = 1, d = 2. From the second equation, we get ac = 0, i. e., at least one of the numbers a, c is 0. From the first equation, we get that the other one is 1. From the third equation, we get 2a + c = 1, i. e., a = 0, c = 1. Altogether, x4 + x3 + x + 2 = (x2 + 1)(x2 + x + 2). □ CHAPTER 12. ALGEBRAIC STRUCTURES 12.2.2. Classical Propositional Logic. The alphabet of classical propositional logic consists of lowercase Latin letters a, b, c, . . . possibly with subscripts a1, a2, a3, . . ., special symbols ¬, →, and parentheses (, ). These letters are called ”atomic propositions” or ”propositional variables.” We can think of them as declarative statements that can be either true or false. (Truth in this context is understood in the usual vague sense; the concept of truth has not been precisely defined yet.) The symbols ¬, → are called ”operators” or ”connectives,” sometimes also ”propositional functors.” What are the propositions? Definition. ”Propositions” are defined by the following rules: (i) Every atomic proposition is a proposition. (ii) If A is a proposition, then ¬A is also a proposition. (iii) If A and B are propositions, then (A → B) is also a proposition. In this definition, uppercase letters are actually parameters that represent any proposition. Example. a, b, ¬a, (a → b) are propositions; A, B, C are parameters that represent arbitrary propositions. A statement created according to rule (ii) will be called negation of statement A. We can read it as ‘not A,’ ‘non-A,’ ‘A is not true,’ and similar expressions. A statement created according to rule (iii) is called an implication, where statement A is called the antecedent of the implication, and statement B is called the consequent of the implication. We can read it as ‘A logically implies B,’ ‘if A, then B,’ ‘A is a sufficient condition for B,’ ‘B is a necessary condition for A,’ ‘A implies B,’ and similar expressions. If there is no risk of misunderstanding, we will omit the outer parentheses in statements created according to rule (iii). The last statement from the previous example will then have the form ¬ ( (b → a) → ¬a ) → (b → a). We consider the alphabet of propositional calculus as primitive concepts. From them, we construct, delineate, and define further concepts using rules. In Kantian terminology, atomic statements are synthetic, while other statements are analytical—they can be broken down into atomic statements. Statements are understood as declarative sentences that are clear enough to decide whether they are true or not. Since ‘truth’ is still a very vague concept, we will use the word ‘validity.’ However, we do not yet know any valid statements. We will now declare as valid statements those that are prescribed by the Boolean expressions on the left sides of equations (2). 1057 12.C.8. For any odd prime p, find all roots of the polynomial P(x) = xp−2 + xp−3 + · · · + x + 2 over the field Zp. Solution. Considering the equality xp−1 − 1 = (x − 1)(P(x) − 1), we can see that all numbers of Zp, except 0 and 1, are roots of P(x) − 1, so they cannot be roots of P(x) + 1. Clearly, 0 is never a root of P(x), and 1 is always a root, which means that it is the only root.. □ 12.C.9. Factorize the polynomial p(x) = x2 + x + 1 in Z5[x] and Z7[x]. Solution. Irreducible in Z5[x]; p(x) = (x − 2)(x − 4) in Z7[x]. □ 12.C.10. Factorize the polynomial p(x) = x6 −x4 −5x2 −3 in C[x], R[x], Q[x], Z[x], Z5[x], Z7[x], knowing that it has a multiple root. Solution. Applying the Euclidean algorithm, we find out that the greatest common divisor of p and its derivative p′ is x2 +1. Dividing the polynomial p(x) twice by this factor, we get p(x) = (x2 + 1)2 (x2 − 3). Clearly, these factors are irreducible in the rings Q[x] and Z[x]. In C[x], we can always factorize a polynomial to linear factors. In this case, it suffices to factorize x2 + 1, which is easy: x2 + 1 = (x + i)(x − 1). The factor x2 − 3 is equal to (x − √ 3)(x + √ 3) even in R[x]. Thus, in C[x], we have p(x) = (x + i)2 (x − i)2 ( x − √ 3 ) ( x + √ 3 ) , while in R[x], we have p(x) = (x2 + 1)2 ( x − √ 3 ) ( x + √ 3 ) . In Z5[x], the polynomial x2 + 1 has roots ±2, and the polynomial x2 − 3 has no roots, which means that p(x) = (x − 2)2 (x + 2)2 (x2 − 3). In Z7[x], neither polynomial has a root, so the factorization to irreducible factors is identical to that in Q[x] and Z[x]. p(x) = (x2 + 1)2 (x2 − 3). □ 12.C.11. Knowing that the polynomial p = x6 +x5 +4x4 + 2x3 +5x2 +x+2 has multiple root x = i, factorize it to irreducible polynomials over C[x], R[x], Z2[x], Z5[x], and Z7[x]. Divide the polynomial q = x2 y2 + y2 + xy + x2 y + 2y + 1 by the irreducible factors of p in R[x], and use the result to solve the system of polynomial equations p = q = 0 over C. Solution. p = (x2 +1)2 (x2 +x+2), in Z2: p = x(x+1)5 , in Z5: p = (x−2)2 (x+2)2 (x2 +x+2), in Z7: p = (x2 +1)2 (x+ 4)2 . For the second polynomial, we get q = (y2 +y)(x2 +x+ 2) − y2 (x + 1) + 1 and q = (y2 + y)(x2 + 1) + y(x + 1) + 1. Thus, if x = α is a root of x2 + x + 2, i. e., α = −1 2 ± 1 2 i √ 7, CHAPTER 12. ALGEBRAIC STRUCTURES Axioms of classical propositional calculus Definition. Axioms of classical propositional calculus are: (kvp1) A → (B → A), (kvp2) (A → B) → ( (A → ¬B) → ¬A ) , (kvp3) ( A → (B → C) ) → ( (A → B) → (A → C) ) , (kvp4) ¬¬A → A, where A, B, C denote statements. Axiom (kvp1) can be interpreted as follows: if statement A is valid, then we don’t need to know where it follows from; a valid statement follows from anything. Axiom (kvp2) can be read as follows: if statement B follows from statement A, then if the negation of statement B also follows from statement A, statement A must necessarily not be valid. In other words, an implication is valid if its antecedent is not valid. In other words, a sufficient (but not necessary!) condition for the validity of an implication is the invalidity of the antecedent. Axiom (kvp3) is called the axiom of deduction, and the reason for this name will become clear later. Axiom (kvp3), together with axiom (kvp1), expresses the transitivity of implication. However, this might not be immediately evident (if it is, I apologize to the reader for underestimating them). Therefore, we will also address the transitivity of implication later. Axiom (kvp4) simply states that when it’s not true that statement A is not valid, then statement A is valid. The selection of axioms for classical propositional calculus was arbitrary, and nothing forced us to choose them. However, as we will see later, this selection is very useful. Although it’s not the only possible one. We can now construct statements and know four valid statements. What remains is to clarify how to derive new valid statements from valid ones. For this purpose, we will use the rule (1). Definition. The modus ponens inference rule is of the form: A → B A B. This rule states that if an implication is valid and its antecedent is valid, then its consequent is also valid. In other words, from the implication, we deduce the valid antecedent, and the valid consequent remains. Hence the name of this rule, the method of detachment. We can summarize the first part of our considerations: The propositional logic system consists of three components. Countably infinite alphabets together with finitely many syntactic rules tell us how to construct words and sentences, i.e., statements, from individual letters of the alphabet. Further, a finite number of axioms, i.e., statements that we consider valid. The third component consists of a finite number of inference rules. 1058 then y = 1√ 1+α . If x = β is a root of x2 + 1, i. e., β = ±i, then y = − 1 1+β □ 12.C.12. Factorize the following polynomial to irreducible polynomials in R[x] and in C[x]: 4x5 − 8x4 + 9x3 − 7x2 + 3x − 1. ⃝ 12.C.13. Factorize the following polynomial to irreducible polynomials in R[x] and in C[x]: x5 + 3x4 + 7x3 + 9x2 + 8x + 4. ⃝ 12.C.14. Factorize x4 −4x3 +10x2 −12x+9 to irreducible polynomials in R[x] and in C[x]: ⃝ 12.C.15. Decide whether the following polynomial over Z3 is irreducible; if not, factorize it to irreducible polynomials: x5 + x2 + 2x + 1. ⃝ 12.C.16. Decide whether the following polynomial over Z3 is irreducible; if not, factorize it to irreducible polynomials: x4 + 2x3 + 2. ⃝ 12.C.17. Find all monic quadratic irreducible polynomials over Z5. Solution. We write out all monic quadratic polynomials over Z5 and exclude those which are not irreducible, i. e., have a root: x2 ±2, x2 ±x+2, x2 ±2x−2, x2 −x±1, x2 ±2x−1. □ D. Rings of multivariate polynomials 12.D.1. Find the remainder of the polynomial x3 y + x + yz + yz4 with respect to the basis (x2 y + z, y + z) and the orderings 0, ∅ ε < 0. We will write zj = xj + iyj = xj + √ −1yj. Therefore, Xε is given as a subset in R4 by a system of two real equations: Re(z2 1 + z2 2 − ε) = x2 1 + x2 2 − y2 1 − y2 2 − ε = 0, Im(z2 1 + z2 2 − ε) = 2(x1y1 + x2y2) = 0. Thus, we can assume that Xε will be a “two-dimensional surface” in R4 . We will try to imagine it as a surface in R3 in a suitable projection R4 → R3 . For this purpose, we choose the mapping φ+ : (x1, x2, y1, y2) → ( x1, x2, x1y2 − x2y1 √ x2 1 + x2 2 ) Denote by V the subset of R4 which is given by our second equation, i. e., V = {(x1, x2, y1, y2); x1y1 + x2y2 = 0, (x1, x2) ̸= (0, 0)}. The restriction of φ+ to V is invertible, and its inverse ψ+ is given by ψ+ : (u, v, w) → ( u, v, − vw √ u2 + v2 , uw √ u2 + v2 ) . Now, note that ( x1y2 − x2y1 √ x2 1 + x2 2 )2 = y2 1 + y2 2, and hence it follows that φ+(V ∩ Xε ) = Hε = {(u, v, w); u2 + v2 − w2 − |ε| = 0}. Now, we can compose the constructed mappings φε : Xε → V @ > φ+ >> R3 \ {(0, 0, 0)} ⊇ Hε , and for every ε > 0, we get a bijection φε : Xε → Hε . The real part of this variety is the “thinnest circle” on the one-part rotational hyperboloid Hε ; see the picture. CHAPTER 12. ALGEBRAIC STRUCTURES 1. A → B premise 2. (A → B) → ( (B → C) → (A → B) ) (kvp1) 3. (B → C) → (A → B) 1., 2. 4. (B → C) → ( A → (B → C) ) (kvp1) 5. ( A → (B → C) ) → ( (A → B) → (A → C) ) (kvp3) 6. (B → C) → ( (A → B) → (A → C) ) 4., 5., TI. 7. ( (B → C) → ( (A → B) → (A → C) )) → (( (B → C) → (A → B) ) → ( (B → C) → (A → C) )) (kvp3) 8. ( (B → C) → (A → B) ) → ( (B → C) → (A → C) 6., 7. 9. (B → C) → (A → C) 3., 8. 2) Theorem [3]: ⊢ ( A → (B → C) ) → ( B → (A → C) ) . We will show {A → (B → C), B, A} ⊢ C: 1. A → (B → C) premise 2. A premise 3. B → C 1., 2. 4. B premise 5. C 3., 4. Now we use the Deduction Theorem three times to get {A → (B → C), B} ⊢ A → C, {A → (B → C)} ⊢ B → (A → C), ⊢ ( A → (B → C) ) → ( B → (A → C) ) . Semantics of Classical Propositional Logic Classical propositional logic was introduced purely formally and axiomatically. An alphabet and syntax (rules for constructing ”letters,” ”words,” and ”sentences”) were introduced, some statements (axioms) were declared valid, and a rule was specified for deriving other valid statements from valid ones. The idea that statements are or represent ”declarative sentences that can be designated as true or false” and that we can somehow ”read” propositional connectives is entirely unnecessary when creating a propositional logic system. Now we will ”fill the form with content,” show ”what it is about.” More precisely, we will give meanings to statements. Let’s start somewhat broadly: We somehow understand linguistic expressions; they express some ”meaning” for us. This meaning helps us determine or identify a fact or reality in the real (or fictional, or virtual) world that the respective linguistic expression denotes. This fact or reality is the ”meaning” of the linguistic expression. Schematically: expression meaning denotes senseexpresses identifies A classic example, mentioned by Gottlob Frege, is the linguistic expression ”Morning Star.” Its sense is the ”brightest star in the morning sky.” Its meaning in the real world is the planet Venus. The sense of another expression, ”Evening Star,” is the ”brightest star in the evening sky.” But its meaning is the same—Venus. 1062 For ε < 0, we can repeat the above reasoning, merely interchanging x and y and the signs in the definition of φ+: φ− : (x1, x2, y1, y2) → ( −y1, −y2, −y1x2 + y2x1 √ y2 1 + y2 2 ) , which changes the inversion ψ− ψ+ : (u, v, w) → ( − vw √ u2 + v2 , uw √ u2 + v2 , −u, −v ) . Now, Hε is again a one-part rotational hyperboloid, but its real part is Xε R = ∅. In the complex case, we can observe that when continuously changing the coefficients, the resulting variety changes only a bit, except for certain “catastrophic” points, where a qualitative leap may occur. This is called the principle of permanence. In the real case, this principle does not hold at all. 12.D.8. The projective extension of the line and the plane. The real projective space P1(R) is defined as the set of all directions in R2 , i. e., its points are one-dimensional subspaces of R2 . The complex projective space P1(C) is defined as the set of all directions in C2 , i. e., its points are one-dimensional subspaces of C2 . Similarly, the points of the real and complex twodimensional projective spaces are defined as directions in R3 and C3 , respectively. CHAPTER 12. ALGEBRAIC STRUCTURES The example shows that different expressions can have the same sense; conversely, we can also encounter expressions that have no sense. Statements in classical propositional logic are represented by various expressions. However, the meaning of a statement will be exactly one of the possibilities ”true” or ”false.” These meanings are symbolically denoted as T and F, sometimes as P and N in Czech, alternatively as 1 and 0. We will use this last notation. Sometimes it will be appropriate to interpret the symbols ”0” and ”1” directly as numbers, between which an order and arithmetic operations are naturally defined. It is reasonable to consider the negation of a statement to be true precisely when the original statement is false, and vice versa. From the modus ponens rule, it follows that if an implication is true, and its antecedent is also true, then the consequent must also be true. Axiom (kvp2) implies that if the antecedent of an implication is false, then the entire implication is true. These considerations allow us to formulate a definition. Definition. An interpretation of the classical propositional logic system is a mapping I from the set of statements to the set {0, 1} such that (i1) I(¬A) = 1 if and only if I(A) = 0, (i2) I(A → B) = 0 if and only if I(A) = 1 and simultaneously I(B) = 0. The set {0, 1} is called the set of truth values. The interpretations of primitive statements ¬A and A → B can be written in tables: A ¬A 0 1 1 0 , A B A → B 0 0 1 0 1 1 1 0 0 1 1 1 . We will use the notation |A| = I(A), which is often found in logic textbooks. Then we can express the interpretation of primitive statements as follows: |¬A| = 1−|¬A|, |A → B| = max{1−|A|, |B|} = max{|¬A|, |B|}. From the definition of interpretation, it is clear that the truth value of any statement can be computed for any interpretation of atomic statements by a finite number of subtractions and searching for the greater of the two values. Example. Let |a| = 1, |b| = 0. Then ¬ ( (b → a) → ¬a ) → (b → a) = max { 1 − ¬ ( (b → a) → ¬a ) , |b → a max {|(b → a) → ¬a| , max {1 − |b|, |a|}} = max {max(1 − |b → a| , |¬a|), 1} = 1. This ordered computation is, of course, not very efficient. A much shorter evaluation would be achieved using a truth ta- ble. Definition. 1063 E. Algebraic structures First of all, we practice general properties of operations and we find out what structures the known sets and operations actually are. 12.E.1. Decide about the following sets and operations what algebraic structures they are (groupoid, semigroup (with potential one-sided neutral elements), monoid, group): i) the set of all subsets of the integers with union, ii) the set of positive integers with the greatest common divisor as the binary operation, iii) the set of positive integers with the least common multiple as the binary operation, iv) the set of all 2-by-2 invertible matrices over R with addi- tion, v) the set of all 2-by-2 matrices over R with multiplication, vi) the set of all 2-by-2 matrices over R with subtraction, vii) the set of all 2-by-2 invertible matrices over Z2 with mul- tiplication, viii) the set Z6 with multiplication (modulo 6), ix) the set Z7 with multiplication (modulo 7). Construct the table of the operation for the last-but-two struc- ture. Solution. i) a monoid (the empty set being neutral), ii) a semigroup (with no neutral elements), iii) a monoid (1 being neutral), iv) not even a groupoid (consider A+(−A) for an invertible matrix A), v) a monoid, vi) a groupoid (not associative), vii) a group, viii) a monoid (the class [1] being neutral), ix) a monoid (the class [1] being neutral). The group in vii) consists of the following elements: A = ( 1 0 0 1 ) , B = ( 0 1 1 0 ) , C = ( 1 1 0 1 ) , D = ( 1 1 1 0 ) , E = ( 0 1 1 1 ) , F = ( 1 0 1 1 ) . The table of the matrix multiplication looks as follows: A B C D E F A A B C D E F B B A E F C D C C D A B F E D D C F E A B E E F B A D C F F E D C B A Note that each row and column (disregarding the heading ones) contains each element exactly once (why is it so?). Thus, we do not have to calculate each product and instead we can play “sudoku” as soon as we have filled enough entries of the table. □ CHAPTER 12. ALGEBRAIC STRUCTURES • We say that a statement A is a tautology if, for every interpretation of the statements from which A is composed, |A| = 1. • We say that a statement A is a contradiction if, for every interpretation of the statements from which A is composed, |A| = 0. • We say that a statement A follows from a set of statements {A1, A2, . . . } (finite or countable) if, for an interpretation of the statements A1, A2, . . . such that |A1| = |A2| = · · · = 1, it holds that |A| = 1. • We say that a set of statements {A1, A2, . . . } is semantically consistent if there exists an interpretation such that |A1| = |A2| = · · · = 1. A tautology follows from an empty set of statements. Examples. (1) The statement A → ¬¬A is a tautology: If |A| = 1, then |A → ¬¬A| = max {1 − |A|, |¬¬A|}= max {0, 1 − (1 − 1)} = 1, If |A| = 0, then |A → ¬¬A| = max {1 − |A|, |¬¬A|}= max {1, 1 − (1 − 0)} = 1. The given statement is also a theorem in classical vector calculus. (2) The statements A → B and B → C are semantically consistent. We construct a table A B C A → B B → C 0 0 0 1 1 0 0 1 1 1 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 1 1 1 1 and from it, we can see that both statements have an interpretation of 1 in four cases, specifically in rows 1, 2, 4, and 8. (3) The statement A → C follows from the set of statements {A → B, B → C} . From the previous example, we already know that the given set of statements is semantically consistent. In the interpretation table, we only need to focus on the rows where both statements have a truth value of 1. A B C A → B B → C A → C 0 0 0 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 All truth values in the last column of the table are equal to 1. Recall that the statement A → C is also derivable from the set of statements {A → B, B → C} ; this is the transitivity of implication. 1064 12.E.2. Let X be a set and P(X) denote the set of all subsets of X. Decide whether the set P(X) together with each of the following operations forms a groupoid, semigroup, monoid, group and whether the operation is commutative: i) set intersection, ii) set union, iii) set symmetric difference (xor). Solution. If the set X is empty, then P(X) together with any of the mentioned operations is the trivial (1-element) group. Otherwise: i) with the set intersection, the resulting structure is a commutative monoid, ii) with the set union, the resulting structure is a commutative monoid, iii) with the set xor, the resulting structure is a commutative group where the empty set is neutral and each element is self-inverse. A−1 = A. □ 12.E.3. Decide about the following sets and operations what algebraic structures they are (groupoid, semigroup, group), whether they have one-sided or two-sided neutral elements, and whether the operation is commutative: i) the set of all 3-by-3 invertible matrices over R with addi- tion, ii) the set of all 3-by-3 matrices over R with multiplication, iii) the set of all 3-by-3 matrices over R with addition, iv) the set of all 3-by-3 invertible matrices over Z2 with mul- tiplication, v) (Z9, +), vi) (Z9, ·). ⃝ 12.E.4. Decide about the following subsets G of the complex numbers what algebraic structures they form together with multiplication (groupoid, semigroup, group), and whether the operation is commutative: i) G = {a + bi | a, b ∈ Z}, ii) G = {a + bi | a, b ∈ R, a2 + b2 = 1}, iii) G = {a + b · √ 5 | a, b ∈ Q, a2 + b2 ̸= 0}. ⃝ 12.E.5. Decide whether Z together with the operation ♡ forms a groupoid, semigroup, monoid, group, and whether ♡ is commutative, provided it is defined by: i) a♡b = (a, b), ii) a♡b = a|b| , iii) a♡b = 2a + b, iv) a♡b = |a|, v) a♡b = a + b + a · b, vi) a♡b = a + b − a · b, vii) a♡b = a + (−1)a b. ⃝ CHAPTER 12. ALGEBRAIC STRUCTURES There are four possible truth tables for unary logical connectors. One of them is the table for the negation interpretation. We will show that the remaining three possible tables also express interpretations of statements formed using the unary connector ¬ and the binary connector →. A ♠A A → A 0 1 1 1 1 1 , A ♣A ¬(A → A) 0 1 0 1 1 0 , A ♡A A 0 1 0 1 1 1 . The first connector is called unary verum, the second unary falsum, and the third assertion; the symbols used are not stan- dard. Similarly, any truth table for binary logical connectors is satisfied by a statement created using the connectors ¬ and →. We will demonstrate this with at least one example. A B A ↔ B ¬ ( (A → B) → ¬(B → A) ) 0 0 1 1 0 1 0 0 1 0 0 0 1 1 1 1 . You probably know that the connector ↔ B is called equiva- lence. Basic Properties of Classical Propositional Logic First, let’s emphasize that it’s essential to distinguish between concepts that are axiomatic, i.e., introduced using axioms and rules of inference, and those that are semantic, i.e., introduced using interpretation. (And let’s also note that the concepts referred to as axiomatic are commonly called syntactic, but that’s misleading because it leads to mixing axioms with syntax in the narrower sense—rules for forming state- ments.) However, we’ve already seen that both the axiomatics and semantics include concepts that are related to each other—a ”theorem” corresponds to a ”tautology,” ”derivation” corresponds to ”entailment.” But this is the case only in classical propositional logic; in other propositional logic systems, it may not hold. We’ve already formulated the basic properties of classical propositional logic, or at least hinted at them. Now let’s summarize them without proofs: Deduction.: The concept of a ”theorem” is defined using the concept of ”derivation” and is thus reducible to this concept. Conversely, the deduction theorem implies that the concept of ”derivation from a finite set of statements” is reducible to the concept of a ”theorem.” The set of theorems coincides with the set of derivable statements. The semantic counterpart of the deduction theorem also holds: Statement A follows from a set of statements {A1, . . . , An−1, An} if and only if statement An → A follows from the set of statements {A1, . . . , An−1}. 1065 12.E.6. In how many ways can we fill the following table so that ({a, b, c}, ⋇) would be a i) groupoid ii) commutative groupoid iii) a groupoid with a neutral element iv) a monoid v) a group ⋇ a b c a c b a b b c Solution. i) 35 ii) 9 iii) 9 iv) 1 v) 0 □ 12.E.7. Find the number of groupoids on a given threeelement set. Solution. Since the set is given, it remains to define the binary operation. In a groupoid, there is no restriction except that the result of the operation must be an element of the underlying set. Thus, for any pair of elements, there are three possibilities for the result. By the product rule, this gives 33·3 = 19683 groupoids. □ 12.E.8. Decide whether the set G = (R ∖ {0} × R) together with the operation △ defined by (x, y)△(u, v) = (xu, xv+y) for all (x, y), (u, v) ∈ G is a groupoid, semigroup, monoid, group, and whether △ is commutative. ⃝ 12.E.9. Let Ω be a set with the multiplication operation defined by ab = a for all a, b in Ω Prove that such multiplication is associative. ⃝ Solution. For a, b, c ∈ Ω we have: a(bc) = ab = a = ac = (ab)c □ 12.E.10. Suppose that abc = e for some a, b, c in a group G. Show that bca = e as well. ⃝ Solution. Set x = bc. Then ax = e i.e. x−1 = a since x is invertible and so xa = (bc)a = e. □ F. Groups We begin with recalling permutations and their properties. We have already met permutations in chapter two, see ??, where we used them to define the determinant of a matrix. 12.F.1. For each of the following conditions, find all permutations π ∈ S7 which satisfy it: i) π4 = (1, 2, 3, 4, 5, 6, 7) CHAPTER 12. ALGEBRAIC STRUCTURES Compactness.: A trivial consequence of the definition of derivation is that a statement is derivable from any set of statements if and only if it is derivable from some of its finite subsets. (This follows from the fact that derivation is a finite sequence.) This also holds for entailment: A statement follows from an infinite set of statements if and only if it follows from some of its finite subsets. It should be noted that proving this statement is not trivial. Correctness.: Classical propositional logic is correct, meaning that every theorem is a tautology. ”What can be proven is true.” Completeness.: Classical propositional logic is complete, meaning that every tautology is a theorem. ”What is true can be proven.” Strong Correctness and Completeness.: Classical propositional logic is strongly correct, meaning that whenever a statement is derivable from a set of statements, it follows from it. Classical propositional logic is also strongly complete, meaning that whenever a statement follows from a set of statements, it is derivable from it. Therefore, a statement in classical propositional logic follows from a set of statements if and only if it is derivable from it. Denotational Saturation.: An operator with any truth table can be expressed using the operators ¬ and →; that is, for any n-ary operator defined by a truth table, there exists a statement constructed solely using ¬ and → that has the same truth table. Decidability.: For any (correctly formulated) statement, it can be decided whether it is a tautology or not. For any statement, it can be decided whether it is a theorem or not. This statement follows directly from the definition of interpretation. Every statement is composed of only finitely many parts, that is, only finitely many atomic statements. And then, it can be decided by finite computation or a finite truth table whether the value is 1 for all evaluations of atomic parts. Alternative Formulations of Propositional Logic The primitive operations of propositional logic are, by definition, negation ¬ and implication →. The first is unary, the second is binary. Using them, we introduce several other binary operations on the set of statements, i.e., propositional functors: 1066 ii) π2 = (1, 2, 3) ◦ (4, 5, 6) iii) π2 = (1, 2, 3, 4) ⃝ 12.F.2. Find the signature (parity) of each of the following permutations: i) ( 1 2 3 4 5 6 . . . 3n − 2 3n − 1 3n 2 3 1 5 6 4 . . . 3n − 1 3n 3n − 2 ) , ii) ( 1 2 3 . . . n n + 1 n + 2 . . . 2n 2 4 6 . . . 2n 1 3 . . . 2n − 1 ) . Solution. The parity of a permutation corresponds to the number of transpositions from which it is built or, equivalently, to the number of its inversions, see 2.2.2. The number of inverses can be read easily from the two-row representation of the permutation. For each number of the second row, we count the number of numbers that are less and lie more to the right than the current number. Thus, the first permutation is even (the signature is 1), and in the second case, the signature depends on n and is equal to (−1) n·(n+1) 2 . □ 12.F.3. Find all permutations ρ ∈ S9 such that [ρ ◦ (1, 2, 3)] 2 ◦ [ρ ◦ (2, 3, 4)] 2 = (1, 2, 3, 4). Solution. No such permutation exists, since the left-hand side is always an even permutation, while the right-hand side is an odd one. □ 12.F.4. Find all permutations ρ ∈ S9 such that ρ2 ◦ (1, 2) ◦ ρ2 = (1, 2) ◦ ρ2 ◦ (1, 2). ⃝ 12.F.5. Consider the permutation σ = ( 1 2 3 4 5 6 7 3 6 5 7 1 2 4 ) . Find the order of σ in the group (S7, ◦), the inverse of σ and compute σ2013 . Show that σ does not commute with the transposition τ = (2, 3). Solution. σ = (1, 3, 5) ◦ (2, 6) ◦ (4, 7). Therefore, the order of σ is the least common multiple of the cycle lengths 3, 2, 2, which is 6. Furthermore, σ−1 = (1, 5, 3) ◦ (2, 6) ◦ (4, 7) and σ2013 = (σ3 35)6 ◦ σ3 = σ3 = (2, 6) ◦ (4, 7). Finally, we have σ ◦ τ = (1, 3, 6, 2, 5) ◦ (4, 7), but τ ◦ σ = (1, 2, 6, 3, 5) ◦ (4, 7). □ 12.F.6. Find σ−1 and σ2013 , where • (a) σ = ( 1 2 3 4 5 6 7 4 5 7 6 1 2 3 ) in the symmetric group (S7, ◦). • (b) σ = [4]11 in the group (Z× 11, ·). Solution. (a) σ = (1, 4, 6, 2, 5)◦(3, 7), σ−1 = (1, 5, 2, 6, 4)◦ (3, 7), since the order of (1, 4, 6, 2, 5) is 5 and the order of the transposition (3, 7) is 2, we get that the order of σ is the least CHAPTER 12. ALGEBRAIC STRUCTURES statement definition name A ∧ B ¬(A → ¬B) conjunction A ∨ B ¬A → B [inclusive] disjunction A ⊻ B (A ∧ ¬B) ∨ (¬A ∧ B) exclusive disjunction A | B ¬(A ∧ B) Sheffer’s function, incompatibility A ↓ B ¬(A ∨ B) Peirce’s function A ↔ B (A → B) ∧ (B → A) equivalence All the connectors introduced in this way are not strictly part of propositional logic; they are meta-symbols. However, this strict distinction is not significant for our purposes, and we can freely add the symbols ∧, ∨, ⊻, | , ↔, and ↓ to the propositional alphabet. The conjunction A ∧ B can be read as ”A and B” or ”A simultaneously with B” and so on. The disjunction A ∨ B can be read as ”A or B.” The grammatical connector ”or” here has a similar meaning to the Latin ”vel,” indicating possible but not always necessary alternatives, so at least one of the statements A, B is true. Similarly, in the Czech sentence: ”Tonight I will go to the pub or study mathematics.” I can study mathematics with my classmates in the pub. In contrast, the exclusive disjunction A⊻B, which can be read as ”either A or B,” expresses ”or” in an exclusionary sense, meaning that exactly one of the statements A, B is true, as in the Czech sentence: ”Tonight I will go to the pub or somewhere else.” The Czech language has another meaning for the word ”or,” which is that at most one of the options will occur, for example, in the sentence: ”Tonight I will sleep either in Brno or in New York.” This possibility is formalized by Sheffer’s connector, and the statement A | B is suitable to read as ”A is incompatible with B.” Statement A ↓ B can be read as ”neither A nor B,” and the equivalence A ↔ B can be read as ”A if and only if B” or ”A is a necessary and sufficient condition for B.” Furthermore, it should be noted that the exclusive disjunction A ⊻ B is sometimes denoted as A ↔| B, and the negation of implication ¬(A → B) is sometimes called ”inhibition” and denoted as A →| B. Alternative names for the Peirce function are ¯∨, .|., or NOR. The semantics of these propositional connectors can be easily derived from their definitions. Let’s summarize them in a truth table: A B A ∧ B A ∨ B A ⊻ B A | B A ↓ B A ↔ B 0 0 0 0 0 1 1 1 0 1 0 1 1 1 0 0 1 0 0 1 1 1 0 0 1 1 1 1 0 0 0 1 As an exercise, you can verify some ”standard” tautolo- gies: 1067 common multiple of 2 and 5, which is 10, i. e., σ10 = 1. Then, σ2013 = (σ1 0)201 ◦ σ3 = σ3 = (1, 2, 4, 5, 6) ◦ (3, 7) . (b) For the sake of simplicity, we will write only k for the residue class [k]11, k ∈ Z. Then, 45 ≡ 1 (mod 1)1 ⇒ σ−1 = 44 ≡ 3 (mod 1)1 σ2013 = 42013 ≡ 43 ≡ 9 (mod 1)1. □ 12.F.7. Prove that every group whose number of elements is even contains a non-trivial (i. e., different from the identity) element which is self-inverse. Solution. Since each element which is not self-inverse can be paired with its inverse, we see that there are an even number of elements which are not-self inverse. Thus, there remain an even number of elements which are self-inverse, and one of them is the identity, so there must be at least one more such element. □ 12.F.8. Prove that there exists no non-commutative group of order 4. Solution. By Lagrange’s theorem (see 12.4.10), the nontrivial elements of a 4-element group are of order 2 or 4. If there is an element of order 4, then the group is cyclic, and thus commutative. So the only remaining case is that there are (besides the identity e) three elements of order 2, call them a, b, c. We are going to show that we must have ab = c: It cannot be that ab = e, since the inverse of a is a itself, and not b. It cannot be that ab = a, since this would mean that b = e, and similarly, it cannot be that ab = b, since this would mean that a = e. Therefore, the only remaining possibility is that indeed ab = c, and it can be shown analogously that the product of any two non-trivial elements, regardless of the order, must be equal to the third one, so this group is commutative, too. Altogether, we have shown that there are exactly two groups of order 4, up to isomorphism. The latter is called the Klein group, and one instance of it is the group Z2 × Z2. □ 12.F.9. Show that there exists no non-commutative group of order 5. Solution. By Lagrange’s theorem (see 12.4.10), the nontrivial elements of a 5-element group are of order 5, so the group must be cyclic, and thus commutative. □ Remark. The same argumentation show that each group of prime order must be cyclic, and thus commutative. In particular, there are neither 2-element nor 3-element noncommutative groups. As we have shown (see 12.F.8), there is even no 4-element non-commutative group. Therefore, the smallest non-commutative group may be of order 6. As we have seen (see 12.E.1(vii)), this is indeed the case. CHAPTER 12. ALGEBRAIC STRUCTURES A ∨ ¬A, ¬(A ∧ ¬A) Law of excluded middle (A ∧ ¬A) → B, ¬A → (A → B) Law of Duns Scotus( (A → B) ∧ A ) → B Modus ponens (¬B → ¬A) → (A → B) Reductio ad absurdum, indirec( (A ∧ ¬B) → (C ∧ ¬C) ) → (A → B) Proof by contradiction( (A → B) ∧ (B → C) ) → (A → C) Transitivity of implication( (A ∧ B) → C ) → ( A → (B → C) ) Deduction We can declare other connectors, such as ¬, ∧, and ∨, as primitive operations alongside →. In this sense, Propositional Calculus can be treated as discussed in Section 11.48. For constructing propositional calculus, one of either ∧ or ∨ alongside ¬ is sufficient, as implication → can be expressed using them: A → B = ¬A ∨ B, or A → B = ¬(A ∧ ¬B). Axioms can then be rewritten using the respective connector. The inference rules will take the following form: A ∨ B ¬A B, A ∨ B ¬B A, or ¬(A ∧ B) A ¬B, ¬(A ∧ B) B ¬A. Each of the Sheffer and Peirce functions can serve as a single propositional connector. Specifically: ¬A = A | A, A → B = A | (B | B), and ¬A = A ↓ A, A → B = ( (A ↓ A) ↓ B ) ↓ ( (A ↓ A) ↓ B ) . The inference rules can then be written as: A | B A B | B, A | B B A | A and A ↓ B A ↓ A B ↓ B. As previously mentioned, Propositional Calculus can be built using other connectors than ¬ and →. However, any combination of these connectors will suffice, as shown in Section 11.48. The construction of Propositional Calculus will only require one of ∧ or ∨ in addition to negation, as implication can be expressed using them. Therefore, the axioms for classical propositional calculus can be either: (kvp1) A → (B → A) (kvp2′ ) (¬A → ¬B) → ( (¬A → B) → A ) (kvp3) ( A → (B → C) ) → ( (A → B) → (A → C) ) Or, a single axiom known as the ”Meredith’s Axiom”: ((( (A → B) → (¬C → ¬D) ) → C ) → E ) → ( (E → A) → (D → A) ) . On the other hand, if we consider ¬, →, ∧, and ∨ as primitive connectors (i.e., not defining conjunction and disjunction in terms of implication and negation), we would need 1068 12.F.10. Prove that any group G where each element is selfinverse must be commutative. Solution. Let a, b ∈ G. Since each of ba, b, a is assumed to be self-inverse, we get ab = ab((ba)(ba)) = a(bb)aba = (aa)ba = ba. □ 12.F.11. Prove that every group G of order 6 is isomorphic to Z6 or S3. Solution. By Lagrange’s theorem (see 12.4.10), the non-trivial elements of a 6-element group are of order 2, 3, or 6. If there is an element of order 6, then G is cyclic, and thus isomorphic to Z6. Therefore, assume from now on that the order of each non-trivial element is 2 or 3. Since an element a of order 3 is not self-inverse (we have a−1 = a2 since a · a2 = a3 = 1), we get from exercise 12.F.7 that there must be at least one element of order 2. As we are going to show, there must also be an element of order 3. For the sake of contradiction, assume that each element of G is self-inverse, and let a ̸= b be any two elements different from the identity e. The same argumentation as in 12.F.8 shows that the product ab cannot be any of e, a, b. Thus, H = {e, a, b, ab} is a 4-element subset of G. Thanks to the self-inverseness, we can see that H is closed under the operation, with the possible exception of b · a, b · ab, and ab · a. However, we get from the above exercise that G is commutative, so that these 3 products also lie in H, and it follows that H is actually a subgroup of G. However, this contradicts theorem 12.4.10, by which a 6-element group cannot have a 4-element subgroup. The only remaining case is that there is an element of order 2 (call it a) as well as an element of order 3 (call it b). Then, b2 is also of order 3 (and different from b), so G contains the 4 elements e, a, b, b2 . Furthermore, G must also contain ab, ba, ab2 , b2 a, and by uniqueness of inverses, none of these is equal to e. Moreover, none of these may be equal to any of a, b, b2 (e. g. if we had a = ab, then multiplication by a−1 from the left yields e = b; the other equalities can be refuted similarly). Since G contains only 6 elements, the set {ab, ba, ab2 , b2 a} has at most two. Again, we can have neither ab = ab2 nor ba = b2 a. If ab = ba, then (ab)2 = a2 b2 = b2 ̸= e and (ab)3 = a3 b3 = a ̸= e, so the order of ab is greater than 3, which contradicts our assumption. Therefore, it must be that ab = b2 a and ba = a2 b, so that G is indeed isomorphic to S3 (a corresponds to a transposition and b does to a cycle of length 3). This group can also be viewed as the group of symmetries of an equilateral triangle (a corresponds to a reflection and b does to a rotation by 120◦ ), see also 12.4.3. We have discussed all possibilities, so the proof is finished. □ CHAPTER 12. ALGEBRAIC STRUCTURES a more extensive system of axioms. Kleene introduced such axioms in 1967: (Kkvp1) A → (B → A) (Kkvp2) ( (A → (B → C) ) → ( (A → B) → (A → C) ) (Kkvp3) (A ∧ B) → A (Kkvp4) (A ∧ B) → B (Kkvp5) A → ( B → (A ∧ B) ) (Kkvp6) A → (A ∨ B) (Kkvp7) B → (A ∨ B) (Kkvp8) (A → C) → ( (B → C) → ( (A ∨ B) → C) ) (Kkvp9) (A → B) → ( (A → ¬B) → ¬A ) (Kkvp10) ¬¬A → A Notably, axioms (Kkvp1), (Kkvp2), (Kkvp9), and (Kkvp10) are the same as axioms (kvp1), (kvp3), (kvp2), and (kvp4). Alternatives to Classical Propositional Calculus Classical propositional calculus possesses beautiful properties, particularly decidability and strong completeness and correctness. This means that in classical propositional calculus, the axiom system essentially aligns with semantics. However, this can also pose a danger - one might blur the line between semantics and axiomatics and perceive all of mathematics as precise formal calculations. One of the problems with classical propositional calculus is the primitive concept of →. It is meant to correspond to the intuitive notion of implication. However, consider the Czech sentences: ”If it is dry in the Sahara, then beer is brewed in the Czech Republic” (which is true due to the truth of the consequent) and ”If the Illuminati rule the world, they caused a tsunami in Sri Lanka” (which is true due to the falsity of the antecedent). These are certainly not exemplary true statements. Sometimes, constructions like these, which are consequences of axiom (kvp1) and theorem ¬A → (A → B), are referred to as ”paradoxes of implication.” Another problem is posed by axiom (kvp4), which is semantically interpreted as implying that if a statement is not false, then it is true. This appears to be self-evident, but it conceals an unspoken assumption that there are no truth values other than ”true” and ”false.” A statement could have another truth value, such as ”possible,” or it could have no truth value, expressing uncertainty or ”I don’t know.” Accepting this possibility means relinquishing the ”view from God’s eye.” Alternative approaches to propositional calculus may involve ”improving implication” or introducing multiple truth values. We will explore some significant cases. (1) Intuitionistic Propositional Calculus. This arises through a slight modification of classical propositional calculus. Axiom (kvp4) is replaced by a weaker axiom: (ivp4) A → (¬A → B). This axiom is weaker in the sense that it is a theorem of classical propositional calculus (which can be easily verified by showing that (ivp4) is a tautology of classical propositional 1069 12.F.12. Find all commutative groups of order 8 (up to isomorphism). Then, for each of the following groups, decide to which of the found ones it is isomorphic (the operation is always multiplication): • Z× 15, • Z× 16, • Z× 17/{[1], [−1] = [16], ·}, • the complex roots of the polynomial z8 − 1. Solution. By theorem 12.4.8, every commutative group is a product of cyclic groups. By 12.4.10, their orders divide 8. This means that there are only 3 possibilities: Z8, Z2 × Z4, and Z2 × Z2 × Z2. • The group Z× 15 contains the residue classes which are coprime to 15. There are φ(15) = (5 − 1)(3 − 1) = 8 of them, so indeed |Z× 15| = 8. In particular, these are 1, 2, 4, 7, 8, 11, 13, 14. Their orders are either 2 (for 4, 11, 14) or 4 (for 2, 7, 8, 13), which means that Z× 15 is isomorphic to Z2 × Z4. • Z× 16 = {1, 3, 5, 7, 9, 11, 13, 15}. Again, this group contains 8 elements, and their orders are either 2 (for 7, 9, 15) or 4 (for 3, 5, 11, 13), which means that Z× 16 is also isomorphic to Z2 × Z4. • Z× 17 = {±1, ±2, . . . , ±8}. Thus, the quotient Z× 17/(±1) = {1, 2, . . . , 8} has 8 elements. We can easily calculate that the order of 3 is 8. Therefore, 3 generates the entire group, which means that Z× 17/(±1) ∼= Z8. • The complex roots of the polynomial z8 − 1 are e nπ 4 i , where n = 1, 2, . . . , 8. Clearly, these form a cyclic group of order 8, isomorphic to Z8. □ 12.F.13. Let G be a commutative group and denote H = {g ∈ G | g2 = = e}, where e is the identity of G. Prove that H is a subgroup of G. Solution. Clearly, e ∈ H. If a ∈ H, then we also have a−1 ∈ H, because a = a−1 (since a2 = e). Moreover, if b ∈ H, then (ab)2 = a2 b2 = e (this is where we use the commutativity of G), which means that ab ∈ H. Thus, H is closed under the operation, and it is indeed a subgroup. □ 12.F.14. Let GLn(R) denote the set of all n-by-n regular matrices with real coefficients. Prove that G = GL2(R) with multiplication is a group and decide for each of the following subsets H of G whether it is a subgroup of G: i) H = GL2(Q), ii) H = GL2(Z), iii) H = {A ∈ GL2(Z) | |A| = 1}, iv) H = {( 0 a a b ) ∈ G | a, b ∈ Q } , v) H = {( 1 0 a 1 ) ∈ G | a ∈ Z } , vi) H = {( 1 a a 1 ) ∈ G | a ∈ Q } , vii) H = {( 0 a b c ) ∈ G | a, b, c ∈ R } , CHAPTER 12. ALGEBRAIC STRUCTURES calculus), but (kvp4) is not a theorem of intuitionistic propositional calculus. This means that anything provable or derivable in intuitionistic propositional calculus can also be proven in classical propositional calculus. All the theorems of classical propositional calculus proven without using axiom (kvp4) are also theorems of intuitionistic propositional calculus. In intuitionistic propositional calculus, the Deduction Theorem also holds. In its proof, we did not use axiom (kvp4). The connection between classical and intuitionistic propositional calculus is further demonstrated by the following Reflection Theorem. If statement A is derivable from statements A1, . . . , An in classical propositional calculus, then it is also derivable from statements ¬¬A1, . . . , ¬¬An in intuitionistic propositional calculus. Classical reasoning is thus mirrored in intuitionistic calculus through double negation. In this sense, intuitionistic propositional calculus is stronger than classical. These observations lead to the conclusion that the → operator has a different meaning in classical and intuitionistic logic, even though it is denoted by the same symbol. (The meaning of an expression can depend on the context in which it is used.) Therefore, the semantics of intuitionistic logic is different from classical logic. It cannot be defined using truth tables; it is introduced using the concept of possible worlds. However, intuitionistic propositional calculus is just as ”beautiful” as classical, being strongly correct and complete, compact, and decidable. (2) Lukasiewicz’s Three-Valued Propositional Calculus. The set of truth values includes not only 0 and 1 but also the value x, which can be interpreted as ”I don’t know.” Their semantics are defined by truth tables as follows: The axioms of this propositional calculus are: (l3vp1) A → (B → A), (l3vp2) (A → B) → ( (B → C) → (A → C) ) , (l3vp3) (¬A → ¬B) → (B → A), (l3vp4) ( (A → ¬A) → A ) → A. The modus ponens rule is used for inference. Lukasiewicz’s three-valued propositional calculus is decidable, compact, strongly correct, and complete. However, the Deduction Theorem does not hold in it. (3) Lukasiewicz’s fuzzy propositional calculus. The set of truth values is a closed interval [0, 1]. The values 0 and 1070 viii) H = {( 1 a b c ) ∈ G | a, b, c ∈ R } . ⃝ 12.F.15. i) Decide whether the set H = {a ∈ R∗ | a2 ∈ Q} is a subgroup of the group (R∗ , ·) ii) Decide whether the set H = {a ∈ R | a2 ∈ Q} is a subgroup of the group (R, +) ⃝ 12.F.16. Find all positive integers m ̸= 5 such that the group Z× m is isomorphic to Z× 5 . ⃝ 12.F.17. How many cycles of length p (1 < p ≤ n) are there in Sn? Solution. The elements of the cycle (i. e., the non-fixed points of the permutation) can be selected in (n p ) ways. Now, without loss of generality, we can proclaim one of the p elements to be the first element in the cycle representation (for instance the least one, if we are working with numbers). This element can be mapped to any of the p − 1 remaining elements, that one can be mapped to any of the p − 2 remaining elements, etc. Altogether, we get by the product rule that there are ( n p·(p−1)! ) cycles of length p. □ 12.F.18. Let G be the set of real-valued matrices with zeros above the diagonal and ones on it. Prove that G with matrix multiplication forms a group, i. e., a subgroup in GL(3, R), and find the center of G (i. e., the subgroup defined by Z(G) = {z ∈ G | ∀g ∈ G : zg = gz}). Solution. We can either verify all the group axioms or make use of the known fact that GL(3, R) is a group, and we verify only that G is closed with respect to multiplication and inverses. Clearly, the neutral element (the identity matrix) lies in G.   1 0 0 a 1 0 b c 1     1 0 0 a1 1 0 b1 c1 1   =   1 0 0 a + a1 1 0 b + ca1 + b1 c + c1 1   ∈ G,   1 0 0 a 1 0 b c 1   −1 =   1 0 0 −a 1 0 −b + ac −c 1   ∈ G. It follows from the form of the products in G that the center contains precisely the matrices of the form  1 0 0 0 1 0 b 0 1  . □ 12.F.19. For any subset X ⊆ G, we define its centralizer as CG(X) = {y ∈ G | xy = yx, for all x ∈ X}. Prove that if X ⊆ Y , then CG(Y ) ⊆ CG(X). Further, prove that X ⊆ CG(CG(X)) and CG(X) = CG(CG(CG(X))). Solution. The first proposition is clear: The elements of G which commute with everything from Y also commute with everything from X. We have from the definition that CHAPTER 12. ALGEBRAIC STRUCTURES 1 represent truth and falsehood, just like in classical propositional calculus. The values in between can be seen as the degree of certainty about the statement. The interpretation is governed by the rules: |¬A| = 1−|A|, |A → B| = max{1−|A|, |B|} = max{|¬A|, |B|}, |A ∧ B| = min {|A|, |B|} , |A ∨ B| = max {|A|, |B|} . The first two rules are the same as in classical propositional calculus. The degree of certainty of a statement cannot be interpreted as its probability, even when the statements are about random events. In such cases, the conjunction ∧ would express the simultaneous occurrence of events. The degree of certainty is calculated directly from the degrees of certainty of individual events, not their probabilities, as it also depends on the stochastic dependence or independence of the relevant events. The axioms of this propositional calculus are: (lF vp1) A → (B → A), (lF vp2) (A → B) → ( (B → C) → (A → C) ) , (lF vp3) (¬A → ¬B) → (B → A), (lF vp4) ( (A → B) → B ) → ( (B → A) → A ) . The modus ponens rule is used for inference. The first three axioms of fuzzy propositional calculus are the same as the axioms of the three-valued propositional calculus. Fuzzy propositional calculus is decidable and compact, but it is not strongly correct or complete. The Deduction Theorem does not hold in it. (4) Paraconsistent four-valued propositional calculus. We previously interpreted the truth value x in Lukasiewicz’s three-valued propositional calculus as ”I don’t know.” However, we can also interpret it as meaning that a given statement has no truth value, i.e., ”neither true nor false.” Now, to the three truth values 0, 1, and x, we add a fourth value, denoted as g, and interpret it as ”simultaneously true and false.” This may sound nonsensical at first, but these introduced truth values can have meaning when ”seeking the truth” in the real world. For example, imagine that we have witness statements about a situation that can be expressed by a statement (declarative sentence). If the witnesses agree that the situation occurred, we assign a truth value of 1 to the statement. If they agree that the situation did not occur, we assign a value of 0. However, it’s possible that witness statements may differ. In such a case, we assign the truth value g to the statement. And if there is no testimony at all, the statement will have the value x. The value x is also referred to as a ”truth-value gap,” while the value g is called a ”truth-value glut.” We can define the interpretation as an ordered pair (I1, I2) of interpretations of atomic statements in Lukasiewicz’s three-valued logic, with two witnesses interpreting the statement. Each of them can respond with ”yes,” ”no,” or ”I don’t know.” To these pairs of interpretations, we assign truth values in the interpretation I according to the table: 1071 CG(CG(X)) = {y ∈ G | xy = yx, ∀x ∈ CG(X)}, and this is in particular satisfied by the elements y ∈ X. The last statement follows simply using the two above. Substituting X := CG(X) into the second one, we get CG(X) ⊆ CG(CG(CG(X))), and applying the first one to the second one, we obtain CG(X) ⊇ CG(CG(CG(X))). □ 12.F.20. Suppose that a group G has a non-trivial subgroup H which is contained in every non-trivial subgroup of G. Prove that H is contained in the center of G. Solution. For each g ∈ G, the centralizer CG(g) = {x ∈ G | xg = gx} is a non-trivial subgroup, since g ∈ CG(g) and CG(e) = G. Thus, the group H is contained in every CG(g). Therefore, it is contained in their intersection (over all g ∈ G), which is exactly the center of G. □ 12.F.21. Let G be a finite group. The conjugation class for a ∈ G is the set Cl(a) = {xax−1 | x ∈ G}. Prove that: i) the set of conjugation classes of all elements of G is a partition of G, ii) the size of each conjugation class divides the order of G, iii) if G has only two conjugation classes, then its order is 2. Solution. (i) It suffices to show that we have for any a, b ∈ G that either Cl(a) = Cl(b) or Cl(a) ∩ Cl(b) = ∅. Thus, assume that the intersection of Cl(a) and Cl(b) is non-empty. Then, by definition, there are x, y ∈ G such that xax−1 = yby−1 . Multiplying this equality by y−1 from the left and by y from the right leads to y−1 xax−1 y = b. However, (y−1 x)−1 = x−1 y, which means that b is of the form zaz−1 for z = y−1 x and thus lies in Cl(a). Analogously, we get a ∈ Cl(b), so that both conjugation classes coincide. (ii) Note that the elements of Cl(a) are in one-to-one correspondence with the cosets corresponding to the centralizer CG(a) = {x ∈ G | xax−1 = a}. Indeed, if elements b and c lie in the same coset (i. e., they satisfy b = cz for some z ∈ CG(a)), then bab−1 = cza(cz)−1 = czaz−1 c−1 = czz−1 ac−1 = cac−1 . By 10.2.1, we have |G| = |CG(a)|·|G/CG(a)|, which means that |Cl(a)| = |G/CG(a)| divides |G|. (iii) The neutral element always forms its own conjugation class Cl(e) = {e}. Therefore, if there are only two conjugation classes, then all the other elements a ̸= e must lie in one class. Thus, its size is |G| − 1, and by (ii), this integer must divide |G|, which means that |G| = 2. □ 12.F.22. Let G be a commutative group. Suppose that the order r of an element a ∈ G and the order s of an element b ∈ G are coprime. Prove that the order of ab is rs. Solution. We have (ab)rs = ars brs = (ar )s (bs )r = es er = e, so the order is at most rs. For the sake of contradiction, assume that (ab)q = e for some q < rs. Since q is less than the least common multiple of r and s (recall that r, s are CHAPTER 12. ALGEBRAIC STRUCTURES (I1, I2) (0, 0) (0, x) (x, 0) (x, x) (0, 1) (1, 0) (x, 1) (1, x) (1, 1) I 0 0 0 x g g 1 1 1 We create the interpretation of negation by taking the negation interpretations in I1 and I2 (recall that if |A| = x, then |¬A| = x in Lukasiewicz’s three-valued logic). This results in the table: A ¬A 0 1 x x g g 1 0 Furthermore, on the set of truth values, we introduce a (partial) ordering: g x 0 1 and we define the interpretation of propositional connectives using the rules: |A → B| = sup{|¬A|, |B|}, |A∨B| = sup {|A|, |B|} , |A∧B| = inf {|A|, |B|} The semantics is thus a straightforward generalization of the semantics of Lukasiewicz’s fuzzy propositional calculus. Within this semantics, we define a tautology as a statement that, for any interpretation of the statements it is composed of, has an interpretation of 1 or g. Furthermore, we say that statement A follows from the set of statements {A1, . . . , An} if, for any interpretation of atomic statements such that statements A1, . . . , An have an interpretation of 1 or g, statement A has an interpretation of 1 or g. If statement A has an interpretation of x, then statements A ∨ ¬A and ¬(A ∧ ¬A) also have an interpretation of x and are not tautologies. In the considered paraconsistent logic, the Law of Excluded Middle does not hold. If statement A has an interpretation of g, and statement B has an interpretation of 1, then statement A → B has an interpretation of 1, and statement A → ¬B has an interpretation of g. This means that in paraconsistent logic, ”local inconsistency” can occur (a case where a statement and its negation both follow from another statement). If statement A has an interpretation of g, then statement A ∧ ¬A also has an interpretation of g, and some statement B with an interpretation of 0 or x does not follow from it. This means that the Law of Non-Contradiction does not hold in paraconsistent logic, and ”global inconsistency” (a case where every statement follows from some set of statements) does not occur. A propositional logic with ”local inconsistencies” but without ”global inconsistencies” is interesting because people sometimes hold beliefs that are factually contradictory and yet manage to work with them relatively successfully. 1072 coprime), at least one of them does not divide q. Assume that it is r (the other case can be refuted analogously). Taking the s-th power of the equality (ab)q = e, we get e = ((ab)q )s = (ab)qs = aqs bqs = aqs (bs )q = aqs eq = aqs . Since r does not divide q and is coprime to s, we get that r (the order of a) does not divide qs, but aqs = e, which is a contradiction. □ 12.F.23. Prove that every finite group G whose order is greater than 2 has a non-trivial automorphism. Solution. If G is not commutative and a is an element that does not lie in the center, then the conjugation x → axa−1 defines a non-trivial automorphism. For a cyclic group of order m, we have, for any n coprime to m, the automorphism x → xn . If G is commutative, then it is a product of cyclic groups (see 10.1.8). If the order of at least one of the factors is greater than 2, then we can use the above automorphism for cyclic groups. If the order of each factor is 2, then permuting any pair of factors is a non-trivial automorphism. □ 12.F.24. Consider the group (Q, +) of the rational numbers with addition and the group (Q+ , ·) of the positive rational numbers with multiplication. Find all homomorphisms (Q, +) → (Q+ , ·). Solution. There is only one homomorphism, the trivial one. For the sake of contradiction, assume that there exists a non-trivial homomorphism φ, i. e., φ(a) = b ̸= 1 for some a, b ∈ Q. Then, for all n ∈ N, we have b = φ(a) = φ(na n ) = φ( a n )n . This is a contradiction, since only some n-th roots of b are rational (cf. ??). □ 12.F.25. Let G be the group of matrices of the form( a 0 b a−1 ) , where a, b ∈ R and a > 0, and let N be the set of matrices of the form ( 1 0 b 1 ) , where b ∈ R. Show that N is a normal subgroup of G and prove that G/N is isomorphic to R. Solution. The key to the proof is the formula for multiplication in G: ( a 0 b a−1 ) ( a1 0 b1 a−1 1 ) = ( aa1 0 ba1 + a−1 b1 a−1 a−1 1 ) . Hence we can see that the mapping ( a 0 b a−1 ) → a is a homomorphism with kernel N. Thus, N is a normal subgroup of G. Moreover, G/N is isomorphic to the multiplicative group R+ , which is isomorphic to the additive group R. □ 12.F.26. Let G be the group with operation of matrix multiplication of 2 × 2 matrices ( a b c d ) where ad − bc ̸= 0 and a, b, c, d are integers from Z3. Show that i) |G| = 48. ii) if ad − bc = 1 then |G| = 24. Solution. CHAPTER 12. ALGEBRAIC STRUCTURES 3. Polynomial rings The operations of addition and multiplication are fundamental in the case of scalars as well as vectors. There are other similar structures. Besides the integers Z, rational numbers Q and complex numbers C, there are polynomials over similar scalars K to be considered. Among others, the abstract algebraic theory can be in many aspects viewed as a straightforward generalization of divisibility properties of integers. 12.3.1. Rings and fields. Recall that the integers and all other scalars K have the following properties: Commutative rings and integral domains Definition. Let (M, +, ·) be an algebraic structure with two binary operations + and ·. It is a commutative ring, if it satisfies • (a + b) + c = a + (b + c) for all a, b, c ∈ M; • a + b = b + a for all a, b ∈ M; • there is an element 0 such that for all a ∈ M, 0+a = a; • for each a ∈ M, there is the unique element −a ∈ M such that a + (−a) = 0. • (a · b) · c = a · (b · c) for all a, b, c ∈ M; • a · b = b · a for all a, b ∈ M; • there is an element 1 such that for all a ∈ M, 1 · a = a; • a · (b + c) = a · b + a · c for all a, b, c ∈ M. If the ring is such that c · d = 0 implies either c or d is zero, then it is called an integral domain. The first four properties define the algebraic structure of a commutative group (M, +). Groups are considered in more detail in the next part of this chapter. The last property in the list of ring axioms is called distributivity of multiplication over addition. There are similar axioms for Boolean algebras where each of the operations is distributive over the other. If the operation "·" is commutative for all elements, then the ring is called a commutative ring. Otherwise, the ring is called a non-commutative ring. In the sequel, rings are commutative unless otherwise stated. Traditionally, the operation "+" is called addition, and the operation "·" multiplication, even if they are not the standard operations on one of the known rings of numbers. In the literature, there are structures without the assumption of having the identity for multiplication. These are not discussed here, so it is always assumed that a ring has a multiplicative identity denoted by 1. The identity for addition is denoted by 0. Fields A non-trivial ring where all non-zero elements are invertible with respect to multiplication is called a division ring. If the multiplication is commutative, it is called a field. 1073 i) For the first row (a, b) of a M ∈ G values a, b could be arbitrary except for a = 0, b = 0 in Z3. Hence, there are 3 × 3 − 1 = 8 possibilities to fill the first row. The second row should be not a multiple of the first row. The multiplication factor could be 0, 1 or 2. Hence. for the second row there are 3 × 3 − 3 = 6 possibilities to fill the second row once the first row is filled. Thus, |G| = 8 × 6 = 48. ii) The set of matrices ( a b c d ) , such that ad−bc = 1 forms a normal subgroup in G. Note that for ∀g ∈ G values of det g ∈ Z3 can be either 1 or 2. E.g. det g = 2 for g = ( 2 2 1 2 ) and det h = 1 for h = ( 2 1 1 1 ) . Hence, det g′ = 2 ∀g′ ∈ Hg and, conversely, if det g′ = 2 then g′ ∈ Hg. Multiplication by g bijectively maps H on Hg. Nence, index H ∈ G equals 2 and, therefore, |H| = |G| 2 = 24. □ 12.F.27. If N is a normal subgroup of G and H is any subgroup in G, then NH is also a subgroup in G. Solution. Let n1h1 and n2h2 are two elements in NH Then i) n1h1n2h2 = n1h1n2(h1)−1 h1h2 = n1n′ 2h1h2 = n′ h′ ∈ NH. ii) (nh)−1 = h−1 n−1 = h−1 n−1 h−1 h = n′ h. □ 12.F.28. Suppose that N and M are two normal subgroups of G and N cupM = {e}. Show that nm = mn for any n ∈ N and m ∈ M. Solution. On the one hand, mnm−1 n−1 = m(nm−1 n−1 ) = mm′ ∈ M. On the other hand, mnm−1 n−1 = (mnm−1 )n−1 = n′ n−1 ∈ N. Hence mnm−1 n−1 in M ∪ N. So, mnm−1 n−1 = e. Thus, mn = nm. □ 12.F.29. Show that the intersection of two normal subgroups of G is also a normal subgroup of G. Solution. It is clear that the intersection of two subgroups of G is also a subgroup of G. Let N and M be normal subgroups of G. Then for any g ∈ G and any n ∈ N and any m ∈ M we have gng−1 ∈ N and gmg−1 1 ∈ M. Let w ∈ N ∪ M. Since w ∈ N we have gwg−1 ∈ N and since w ∈ M we have gwg−1 ∈ M. Hence gwg−1 ∈ N ∪ M for any g ∈ G i.e. N ∪ M is a normal subgroup of G. □ 12.F.30. Prove that if G is some group and H ⊂ G its subgroup of index 2 then H is a normal subgroup of G. Solution. Let H and aH be the left cosets of H in G and H and Hb be right cosets of H in G. Since there are only two cosets aH = G \ H and Hb = G \ H, then aH = Hb. In order to show that H is normal in G, we need xH = Hx for any x ∈ G. Let x ∈ G. If x ∈ H then, obviously, CHAPTER 12. ALGEBRAIC STRUCTURES Typical examples of fields are the rational numbers Q, the real numbers R, and the complex numbers C. Furthermore, every remainder class set Zp is a commutative ring, while only Zp for prime p are also fields. We shall come back to fields in the next part of this chapter, too. Recall the useful example of a non-commutative ring, the set Matk(K) of all k-by-k matrices over a ring K, k ≥ 2. As can be checked for K = Z2 and k = 2, these rings are never an integral domain (see 2.1.5 on page 87 for the full argument). As an example of a division ring which is not a field, consider the ring of quaternions H. This is constructed as an extension of the complex numbers, by adding another imaginary unit j, i.e. H = C ⊕ jC ≃ R4 , just as the complex numbers are obtained from the reals. Another “new” element i·j is usually denoted k. It follows from the construction that ij = −ji. This structure is a division ring. Think out the details as a not completely easy exercise! 12.3.2. Elementary properties of rings. The following lemma collects properties which all seem to be obvious for rings of scalars. But the properties need proof to build an abstract theory: Lemma. In every commutative ring K, the following holds (1) 0 · c = c · 0 = 0 for all c ∈ K, (2) −c = (−1) · c = c · (−1) for all c ∈ K, (3) −(c · d) = (−c) · d = c · (−d) for all c, d ∈ K, (4) a · (b − c) = a · b − a · c, (5) the entire ring K collapses to the trivial set {0} = {1} if and only if 0 = 1. Proof. All of the propositions are direct consequences of the definition axioms. In the first case, for any c, a c · a = c · (a + 0) = c · a + c · 0, and since 0 is the only element that is neutral with respect to addition, c · 0 = 0. In the second case, it suffices to compute 0 = c · 0 = c · (1 + (−1)) = c + c · (−1). This means that c · (−1) is the inverse of c, as desired. The following two propositions are direct consequences of the second proposition and the axioms. If the ring contains only one element, then 0 = 1. On the other hand, if 1 = 0, then for any c ∈ K, necessarily c = 1 · c = 0 · c = 0. □ 12.3.3. Polynomials over rings. The definition of a commutative ring uses precisely the properties that are expected for multiplication and addition. The concept of polynomials can now be extended. A polynomial is any expression that can be built from (known) constant elements of K and an (unknown) variable using finite number of additions and multiplications. Formally, polynomials are defined as follows:4 4It is not by accident that the symbol K is used for the ring – you can imagine e.g. any of the number rings behind that. 1074 xH = H = Hx i.e. xH = Hx. If x ∈ G \ H = aH = Hb then there exist h1, h2 ∈ H such that x = ah1 = h2b. Then xH = (ah1)H = aH = Hb = H(h2b) = Hx, i.e. xH = Hx for any x ∈ G, so G−1 HG = H. □ 12.F.31. Show that the intersection of two normal subgroups in G is a normal subgroup of G. Solution. It is clear that the intersection of two subgroups of G is a subgroup of G: Let N and M be normal subgroups of G: Then for any g ∈ G and any n ∈ N and any m ∈ M we have gng−1 ∈ N and gmg−1 1 ∈ M. Let w ∈ N ∪ M. Since w ∈ N we have gwg−1 ∈ N and since w ∈ M we have gwg−1 ∈ M. Hence gwg−1 ∈ N ∪ M for any g ∈ G i.e. N ∪ M is a normal subgroup in G. □ 12.F.32. If N and M are normal subgroups of G show that NM is also a normal subgroup of G. Solution. Let nm ∈ NM, then gnmg−1 = gng−1 gmg−1 ∈ NM, since gmg−1 ∈ M and gng−1 ∈ N. □ 12.F.33. If N is a normal subgroup in the finite group such that gcd(|G : N|, |N|) = 1. Show that N must contain any element x ∈ G satisfying x|N| = e. Solution. Let x be an element in G such that x|N| = e. Since gcd(|G : N|, |N|) = 1 there exist m, n ∈ Z such that m|G : N| + n|N| = 1. Then x = xm|G:N|+n|N = xm|G:N| xn|N| . Since x|N| = e we have x = xm|G:N| . Consider element xN ∈ G/N. Since xNk = xk N for any k ∈ Z, then N = (xN)|G/N| = x|G/N| N , which is the identity element in GN. This means x|G/N| ∈ N and so x = (x|G:N| )m ∈ N. □ 12.F.34. If H is a subgroup of G such that the product of any two right cosets of H in G is again a right coset of H in G show that H is normal in G. Solution. Let Ha and Hb be two right cosets of H ∈ G. By assumption HaHb is a right coset of H in G: The set HaHb contains the element eaeb = ab. Since there is only one right coset of H in G containing ab; which is Hab; we have HaHb = Hab for all a, b ∈ G. Hence HaH = Ha for all a ∈ G. i.e. h1ah2 is equal to h3a for some h3 ∈ H. But h1ah2 = h3a implies ah2a−1 = h−1 1 h3 = h′ ∈ H for any h2 ∈ H and a ∈ G. So, H is normal in G. □ 12.F.35. Let Z(G) be centraliser of group G and xy = z ∈ Z(G) for some x, y ∈ G. Show that x and y commute. Solution. As zx = xz, we have xyx = zx = xz = xxy. Multiplying by x−1 on the left, we obtain yx = xy. □ CHAPTER 12. ALGEBRAIC STRUCTURES Polynomials Definition. Let K be a commutative ring. A polynomial over K is a finite expression f(x) = k∑ i=0 aixi , where ai ∈ K, i = 0, 1, . . . , k are the coefficients of the polynomial. If ak ̸= 0, then by definition, f(x) has degree k, written deg f = k. The zero polynomial is not assigned a degree. Polynomials of degree zero (called constant polynomials) are exactly the non-zero elements of K. Polynomials f(x) and g(x) are equal if they have the same coefficients. The set of all polynomials over a ring K is denoted K[x]. Every polynomial defines a mapping f : K → K by substituting the argument c for the variable x and evaluating the resulting expression, i.e. f(c) = a0 + a1c + · · · + akck . Note that constant polynomials define constant mappings in this manner. A root of a polynomial f(x) is such an element c ∈ K for which f(c) = 0 ∈ K. It may happen that different polynomials define the same mapping. For instance, the polynomial x2 + x ∈ Z2[x] defines the mapping which is constantly equal to zero. More generally, for every finite ring K = {a0, a1, . . . , ak}, the polynomial f(x) = (x − a0)(x − a1) . . . (x − ak) defines the constant-zero mapping. Polynomials f(x) = ∑ i aixi and g(x) = ∑ i bixi can be added and multiplied in a natural way (just think to introduce again the structure of a ring and invoke the expected distributivity of multiplication over addition): (f + g)(x) = (a0 + b0) + (a1 + b1)x + · · · + (ak + bk)xk , (f · g)(x) = (a0b0) + (a0b1 + a1b0)x + . . . + (a0br + a1br−1 + arb0)xr + · · · + akbℓxk+ℓ , where k ≥ ℓ are the degrees of f and g, respectively. Zero coefficients are assumed everywhere where there is no coefficient in the original expression.5 This definition corresponds to the addition and multiplication of the function values of f, g : K → K, by the properties of “coefficients” in the original ring K. It follows directly from the definition that the set of polynomials K[x] over a commutative ring K is again a commutative ring, where the multiplicative identity is the element 1 ∈ K, perceived as a polynomial of degree zero. The additive identity is the zero polynomial. You should check all the axioms carefully! 5To avoid this formal hassle, a polynomial can be defined as an infinite expression (like a formal power series over the ring in question) with the condition that only finitely many coefficients are non-zero. Concerning the order, zero polynomial is also atributed to have order −∞. 1075 12.F.36. Let G be a group of 2 × 2 matrices M = ( a b c d ) such that ad − bc ̸= 0, a, b, c, d ∈ R. Let subgroup H ={( 1 0 x 1 ) |x ∈ Z } ∈ G and g = ( 1 0 0 2 ) Show that H ̸= gHg−1 ⊂ H. Solution. Obviously, matrix ( 1 0 1 1 ) ∈ H. However, ( 1 0 0 2 ) ( 1 0 x 1 ) ( 1 0 0 1 2 ) = ( 1 0 2x 2 ) ( 1 0 0 1 2 ) = ( 1 0 2x 1 ) Thus, g−1 Hg = {( 1 0 2x 1 ) |x ∈ Z } does not contain ( 1 0 1 1 ) ∈ H. □ 12.F.37. Let G be a group of an even order. Prove that it contains a subgroup of order 2. Solution. Consider the set of pairs Ω = {g, g−1 }, where g ̸= g−1 . This engages an even number of elements of G. Those elements g ∈ G that are not in Ω are the ones satisfying g = g−1 i.e., g2 = e. Therefore if we count Gmod2 , we can ignore the pairs {g, g−1 } ∈ Ω and obtain |G| ≡ {g ∈ G|g2 = e} mod 2. One solution to g2 = e is e itself. If it were the only solution, then |G| ≡ 1 mod 2, which is false. Therefore some g ̸= e satisfies g2 = e, which provides an element and also a subgroup in G of order 2. □ 12.F.38. Let G be finite abelian group, which order n is divisible by prime factor p. Show that G contains an element of order p and, hence a cyclic subgroup of that order. This claim is known as abelian case of Cauchy theorem. Solution. We use induction on n. Case n = p is trivial. Let n > p and p|n and the proposition is true for all abelian groups of orders divisible by p and less than n. Assume no element of G has order p. Then no element has order divisible by p, because if g ∈ G has order r and p|r then g r p would have order p. Let G = {g1, g2, . . . , gn} and let gi have order mi so mi is not divisible by p. Set m to be the least common multiple of all mi, i = 1, 2, . . . , n, so m is not divisible by p and gmi = e for all i. Because G is abelian, the function f : (Z/m)n → G given by f(a1, ..., an) = ga1 1 . . . gan n . is a homomorphism f(a1, ..., an)f(b1, ..., bn) = f(a1 + b1, . . . , an + bn). as, by commutativity of G, ga1 . . . gan n gb1 1 . . . gbn n = ga1+b1 1 . . . gan+bn n ’s. This homomorphism is surjective (each element of gi ∈ G is F(a1, . . . , an) for ai = 1 and other aj = 0, j ̸= i. gi , and if ai = 1 and other aj ’s are 0 then f(a1, . . . , an) = gi) and the elements where f takes particular value gi is a coset of ker f, so |G| = number of cosets of ker f which is a factor of |Z/m|n , which is a factor of mn . Since p||G| and mn is not divisible by p, this is a contradiction. □ CHAPTER 12. ALGEBRAIC STRUCTURES Lemma. A polynomial ring over an integral domain is again an integral domain. Proof. The task is to show that K[x] can contain nontrivial divisors of zero only if they are in K. However, this is clear from the expression for polynomial multiplication. If f(x) and g(x) are polynomials of degree k and ℓ as above, then the coefficient at xk+ℓ in the product f(x) · g(x) is the product ak · bℓ, which is non-zero unless there are zero divisors in K. □ 12.3.4. Multivariate polynomials. Some objects can be described using polynomials with more variables. For instance, consider a circle in the plane R2 whose center is at S = (x0, y0) and whose radius is R. This circle can be defined by the equation (x − x0)2 + (y − y0)2 − R2 = 0. Rings of polynomials in variables x1, . . . , xr can be defined similarly as in the case of K[x]. Instead of the powers xk of a single variable x, consider the monomials xk1 1 · · · xkr r , and their formal linear combinations with coefficients ak1···kr ∈ K. However, it is simpler, both formally and technically, to define them inductively by K[x1, . . . , xr] := ( K[x1, . . . , xr−1] ) [xr]. For instance, K[x, y] = K[x][y]. One can consider polynomials in the variable y over the ring K[x]. It can be shown (check this in detail!) that polynomials in variables x1, . . . , xr can be viewed, even with this definition, as expressions created from the variables x1, . . . , xn and the elements of the ring K with a finite number of (formal) addition and multiplication in a commutative ring. For example, the elements of K[x, y] are of the form f = an(x)yn + an−1(x)yn−1 + · · · + a0(x) = = (amnxm + · · · + a0n)yn + · · · + (bp0xp + · · · + b00) = = c00 + c10x + c01y + c20x2 + c11xy + c02y2 + . . . . To simplify the notation, we use the multi-index notation (as we did with real polynomials and partial derivates in infinitesimal analysis). 1076 12.F.39. Let G be finite -abelian group, which order n is divisible by prime factor p. Show that G contains an element of order p and, hence a cyclic subgroup of that order. This is non-abelian case of Cauchy theorem. Solution. If a proper subgroup H of G has order divisible by p, then by induction there is an element of order p in H, which gives us an element of order p in G. Thus we may assume no proper subgroup of G has order divisible by p. For any proper subgroup H ⊂ G we have |G| = |H|[G : H] and |H| is not divisible by p, so p|[G : H] for every proper subgroup H. Denote conjugacy class of an element a ∈ G by C(a) = {b ∈ G|b = g−1 agfor someg ∈ G}. Let the conjugacy classes in G with size greater than 1 be represented by g1, g2, . . . , gk. The conjugacy classes of size 1 are the elements of the center Z(G). Since the conjugacy classes are a partition of G, counting |G| by counting conjugacy classes implies |G| = |Z(G)| + k∑ i=1 |C(gi)| = |Z(G)| + k∑ i=1 [G : Z(gi)], where Z(gi) is the centraliser of gi. Every [G : Z(gi)] > 1 since each C(gi) is greater than 1. Since ∀j, |G| = |Z(gi)|[G : Z(gi)], it yields p|[G : Z(gi)]. Thus, |Z(G)| is divisible by p and so is non-trivial. Since proper subgroups of G do not have order divisible by p, center Z(G) has to be all of G. That means G is abelian, which is a contradiction. □ 12.F.40. Prove that any group of order 15 is cyclic. Solution. Let G be a group of order 15. By Cauchy’s theorem there exists a subgroup M of order 5 and a subgroup N of order 3. [G : M] = |G| |M| = 3. Both M and N are cyclic subgroups. Let x be a generator of M and y - the generator of N. Let’s prove that xy generates entire G. (NEED TO COME UP WITH AN ELEGANT PROOF..) (xg)5 = □ 12.F.41. Let G be a group of order 14 which has a normal subgroup N of order 2. Prove that G is commutative. Solution. Clearly, the order of the group G/N is |G/N| = |G| |N| = 7. By Lagrange’s theorem 12.4.10, the orders of its elements are 1 or 7. Since only the identity has order 1, this means that there is an element of order 7, so that the group G/N is cyclic. Let N = {e, n}, where e is the identity of G and let [a] be the generator of G/N. Since N is normal, we have ana−1 ∈ N, but ana−1 = e implies n = e, which means that we must have ana−1 = n, i. e., na = an. Since [a] generates G/N, we get that each element of G/N is of the form [a]k , k = 0, . . . , 6, i. e., [ak ]. Then, each element of G CHAPTER 12. ALGEBRAIC STRUCTURES Multi-indices A multi-index α of length r is an r-tuple of non-negative integers (α1, . . . , αr). The integer |α| = α1 + · · · + αr is called the size of the multi-index α. Monomials are written shortly as xα instead of xα1 1 xα2 2 . . . xαr r . Polynomials in r variables can be symbolically expressed in a similar way as univariate polynomials: f = ∑ |α|≤n aαxα , g = ∑ |β|≤m bβxβ ∈ K[x1, . . . , xr]. f is said to have total degree n if at least one coefficient with multi-indices α of size n is non-zero, while all the coefficients with multi-indices of larger sizes vanish. Analogous formulae can be defined for addition and multiplication of multivariate polynomials of degrees m and n respectively: f + g = ∑ |α|≤max(m,n) (aα + bα)xα , fg = m+n∑ |γ|=0 ( ∑ α+β=γ (aαbβ)xγ ) , where the multi-indices are added componentwise, and the formally non-existing coefficients are assumed to be zero. Clearly in the case of univariate polynomials we recover the formula fg(x) = k+ℓ∑ i=0 ( ∑ p+q=i apbqxi ) . Lemma. These formulae describe addition and multiplication in the inductively defined ring of polynomials in r variables. In particular, in a polynomial ring over integral domain, the total degree of a sum or difference of two polynomials (if defined) is at most the maximum of their total degrees, while the total degree of a product of two polynomials is the sum of their total degrees. Proof. The proposition is easily proved by induction on the number of variables. Suppose that the formulae are valid in K[x1, . . . , xr−1], and calculate the sum f = ak(x1, . . . , xr−1)xk r + · · · + a0(x1, . . . , xr−1) = ( ∑ α ak,αxα ) xk r + . . . , g = bl(x1, . . . , xr−1)xl r + · · · + b0(x1, . . . , xr−1) =   ∑ β bl,βxβ   xk r + . . . , f + g = ( a0(x1, . . . , xr−1) + b0(x1, . . . , xr−1) ) + ( a1(x1, . . . , xr−1) + b1(x1, . . . , xr−1) ) xr + . . . 1077 is of the form ak or ak n, and since a and n commute, we get that actually all elements of G commute. □ 12.F.42. Decide whether the following holds: If the quotient G/N of a group G by a normal subgroup N is commutative, then G itself is commutative. ⃝ 12.F.43. Prove that any subgroup H of the symmetric group Sn contains either only even permutations or the same number of even and odd permutations. Solution. Consider the homomorphism p : H → Z2 which maps each permutation to its parity (0 for even and 1 for odd). Then, p−1 (0) = Ker(p) is a normal subgroup of H: let h ∈ Ker(p), then p(ghg−1 ) = p(g)p(h)p(g−1 ) = p(g)p(g−1 ) = p(gg−1 ) = = p(e) = 0, which means that ghg−1 ∈ Ker(p), i. e., Ker(p) is normal. Since Z2 has only two elements, it follows that H/ Ker(p) has either only one coset (i. e., all permutations are even) or two cosets, which must be of equal size (i. e., there are the same number of even and odd permutations). □ 12.F.44. Describe the group of symmetries of a regular tetrahedron and find all of its subgroups. Solution. Let us denote the vertices of the tetrahedron as a, b, c, d. Each symmetry can be described as a permutation of the vertices (to which vertex each one goes). Thus, the group of symmetries of the tetrahedron is isomorphic to a certain subgroup of the symmetric group S4. Given any pair of vertices, there exists a symmetry which swaps this pair and keeps the other two vertices fixed (this is reflection with respect to the plane that is perpendicular to the line segment of the pair and goes through its center). Thus, the wanted subgroup is generated by all transpositions in S4. However, this is the group S4 itself. Thus, let us describe all subgroups of the group S4. This group has 24 elements, which means that the order of any subgroup must be one of 1, 2, 3, 4, 6, 8, 12, 24 (see 12.4.10). Clearly, the only group of order 1 is the trivial subgroup {id}. Similarly, the only group of order 24 is the entire group S4. Now, let us look at the remaining orders of a potential subgroup H ⊆ S4. (i) |H| = 2. H must consist of the identity and another self-inverse element (x2 = id). These are transpositions and double transpositions (compositions of two disjoint transpositions). Geometrically, double transpositions correspond to rotation by 180◦ around the axis that goes through the centers of opposite edges). Thus, we get nine subgroups: {id, (a, b)}, {id, (a, c)}, {id, (a, d)}, {id, (b, c)}, {id, (b, d)}, {id, (c, d)}, {id, (a, b) ◦ (c, d)}, {id, (a, c) ◦ (b, d)}, {id, (a, d) ◦ (b, c)}. (ii) |H| = 3. By Lagrange’s theorem, such a subgroup must be cyclic, i. e., it must be of the form {id, p, p2 }, p3 = id. Thus, the factorization of p to independent cycles must contain a cycle of length 3, which means that p cannot contain anything else. By 12.F.17, there are 4 · 2 cycles of length 3, which give rise to the following four CHAPTER 12. ALGEBRAIC STRUCTURES = ( ∑ γ (ak,γ + bk,γ)(x1, . . . , xr−1)γ ) xk r + . . . + ( ∑ γ (a0,γ + b0,γ)(x1, . . . , xr−1)γ ) = ∑ (γ,j) (aj,γ + bj,γ)(x1, . . . , xr−1)γ xj r. The proof for multiplication is similar (do it yourselves!). □ The definition or the above formulae for polynomials over general commutative rings yield the following corollary: Corollary. If a ring K is an integral domain, then the ring K[x1, . . . , xr] is also an integral domain. Proof. Proceed by induction on the number r of vari- ables.6 We know the result for the univariate polynomials already. In particular, the product of non-zero polynomials is again non-zero. If the proposition holds for r−1 variables, then the result follows from the inductive definition of the polynomials. □ Let us notice that each polynomial f ∈ K[x1, . . . , xn] also represents a mapping K × . . . × K → K multilinear and symmetric in its n arguments. 12.3.5. Divisibility and irreducibility. The next goal is to understand how polynomials over a general integral domain can be expressed as products of simpler polynomials. In the case of univariate polynomials, this means finding the roots of a polynomial. Since multivariate polynomials can be defined inductively, it suffices to consider univariate polynomials over a general integral domain. This leads to a generalization of the concept of divisibility which forms the basis of elementary number theory in chapter eleven. Consider an integral domain K (for instance, the integers Z or the ring Zp for prime p). Divisibility in rings For a, c ∈ K we say that a divides c in K if and only if there is some b ∈ K such that a · b = c. This is written a|c. The divisors of 1 (the multiplicative identity), i.e. the invertible elements of K, are called units. The units of a commutative ring always form a commutative group in the sense used for the properties of addition in the definition of rings. This is called the group of units in K. The group of units in Z is {−1, 1}, while all nonzero elements in a field are units there. In an integral domain, the divisors are determined uniquely. If b = a · c and b ̸= 0, then c is determined by the choice of a and b, since if b = ac = ac′ , then 0 = a · (c − c′ ) 6Alternatively, proceed directly using multi-index formulae for the product, provided an appropriate ordering on monomials is defined, see the last part of this chapter. 1078 subgroups: {id, (a, b, c), (a, c, b)}, {id, (a, c, d), (a, d, c)}, {id, (a, b, d), (a, d, b)}, {id, (b, c, d), (b, d, c)}. The cycles of length 3 correspond to rotation by 120◦ around the axis that goes through a vertex and the center of the opposite side. (iii) |H| = 4. Such a subgroup must be isomorphic to Z4 or Z2 × Z2. Considering the factorization to independent cycles, we find out that the only permutation of order 4 is a cycle of length 4. Thus, cyclic subgroups must contain a cycle of length 4, namely exactly two of them, since if p has order 4, then p−1 = p3 is also of order 4 i. e., a cycle of length 4. Then, the permutation p2 has order 2, so it must be a double transposition (it is not a single transposition, since p2 clearly does not have a fixed point). There are six cycles of length 4 (see 12.F.17), and they pair up to the following three subgroups of this type: {id, (a, b, c, d), (a, c) ◦ (b, d), (a, d, c, b)}, {id, (a, c, b, d), (a, b) ◦ (b, d), (a, d, b, c)}, {id, (a, b, d, c), (a, d) ◦ (bc), (a, c, d, b)}. As for subgroups isomorphic to Z2 × Z2, they must contain (besides the identity) only elements of order 2, which are transpositions and double transpositions. By 12.F.43, the subgroup must contain either no or exactly two transpositions. Moreover, it cannot cannot contain two dependent transpositions, since their composition is a cycle of length 3. Thus, the subgroup contains (besides the identity) either two independent transpositions and the double transposition which is their composition (this gives rise to three subgroups), or the three double transpositions. Altogether, we have found: {id, (a, b), (a, b) ◦ (c, d), (c, d)}, {id, (a, c), (a, c) ◦ (b, d), (b, d)}, {id, (a, d), (a, d) ◦ (b, c), (b, c)} and {id, (a, b) ◦ (c, d), (a, c) ◦ (b, d), (a, d) ◦ (b, c)}. (iv) |H| = 6. By 12.F.11, this subgroup is isomorphic to S3 (it cannot be isomorphic to Z6 since there is no element of order 6 in S4), so it contains (besides the identity) two elements x, x−1 of order 3 and three elements of order 2. Thus, x and x−1 are cycles of length 3 which fix the same vertex (say a). What are the other three elements? There cannot be a double transposition, since its composition with x yields another cycle of length 3. There cannot be a transposition which does not fix a since its composition with x yields a cycle of length 4. Thus, the only possibility is that there are the three transpositions which also fix a. Since there are four possibilities which vertex is the fixed one, we obtain four subgroups of order 6. (v) |H| = 8. The group cannot be a subgroup of the group A4 of even permutations (since there are 12 of them, and 8 does not divide 12). Thus, by 12.F.43, H must contain four even and four odd permutations. The even permutations must form a subgroup of A4, and we could see in (iii) that the only such 4-element subgroup is {id, (a, b) ◦ (c, d), (a, c) ◦ (b, d), (a, d) ◦ (b, c)}, which is normal. Considering any odd permutation and the coset (with respect to the above normal subgroup) which contains it, we can see that the coset together with the above 4 elements form a subgroup of S4. We thus get three subgroups of S4. It is not hard to realize that CHAPTER 12. ALGEBRAIC STRUCTURES and being in an integral domain, a ̸= 0 implies c = c′ . The reader should check that the claim is wrong e.g. in Z4. Just as for integers, the following propositions are direct corollaries of the definitions (check the details yourself!): Lemma. Let a, b, c ∈ K. Then, (1) if a|b and b|c, then a|c; (2) if a|b and a|c, then a|(αb + βc) for all α, β ∈ K; (3) a|0 (since a · 0 = 0, in particular 0|0); (4) a ∈ K is divisible by each unit e ∈ K and its a-multiple a · e (this follows from the existence of e−1 ). Unique factorization domain An element a ∈ K is said to be irreducible if and only if it is divisible only by units e ∈ K and their a-multiples. A ring K is called a unique factorization domain if and only if the following conditions hold: • For every non-zero element a ∈ K, there are irreducible elements a1, . . . , ar ∈ K such that a = a1 · a2 . . . ar. This is called factorization of a. • If there are two factorizations of a into irreducible nonunit elements a = a1a2 . . . ar = b1b2 . . . bs, then r = s and, up to a permutation of the order of the factors bj, aj = ejbj for suitable units ej, j = 1, . . . , r. Z is a unique factorization domain. So is every field, since every non-zero element is a unit. There are examples of integral domains without the unique factorization property. The construction is similar to polynomials. Instead of powers, consider conveniently combined roots of powers of the unknown variable x. An integral domain K consists of finite expressions of the form a0 + k∑ i=1 ai (xmi ) 1 2ni , where a0, . . . , ak ∈ Z, mi, n ∈ Z>0. The multiplication and addition is defined as with polynomials assuming the standard behaviour of rational powers of x. Then, the only units in K are ±1, and all elements with a0 = 0 are reducible, but the expression x, for example, cannot be expressed as a product of irreducible elements. There are simply very few irreducible elements in K. 12.3.6. Euclidean division and roots of polynomials. The fundamental tool for the discussion of divisibility, common divisors, etc. in the ring of integers Z is the procedure of division with remainder, and the Euclidean algorithm for the greatest common divisor. These procedures can be generalized. Consider univariate polynomials akxk + · · · + a0 over a general integral domain K. akxk is called the leading monomial, while ak is the leading coefficient. Lemma (An algorithm for division with remainder). Let K be an integral domain and f, g ∈ K[x] polynomials, g ̸= 0. 1079 each of them is isomorphic to the group of symmetries of a square (the so-called dihedral group D4). From the geometrical point of view, we can describe it as follows: Consider the orthogonal projection of the tetrahedron onto the plane that is perpendicular to the line that goes through the centers of opposite edges. The boundary of this projection is a square. Out of all the symmetries of the tetrahedron, we take only those which induce a symmetry of this square (for instance, it will not be a symmetry which only swaps adjacent vertices of the resulting square). Since there are three pairs of opposite edges in the tetrahedron, we get three 8-element subgroups, isomorphic to the dihedral group D4. (vi) |H| = 12. By 12.F.43, such a subgroup contains either only even permutations, or six even and six odd permutations, and the six even permutations must form a subgroup of S4. However, we could see in (iv) that there is no 6-element subgroup of S4 consisting only of even permutations. Thus, the only possibility is the alternating group A4 of all even permutations in S4. From the geometric point of view, these are the so-called direct symmetries, which are realized by rotations (not reflections), and thus can be performed in the space. □ Remark. In general, the group of symmetries of a solid with n vertices is a subgroup of the symmetric group Sn. 12.F.45. Which subgroups of the group S4 are normal? Solution. By definition, a subgroup H ⊆ S4 is normal iff it is closed under conjugation, i. e., ghg−1 ∈ H for any g ∈ S4, h ∈ H. Since conjugation in symmetric groups only renames the permuted items but preserves the permutation structure (i. e. the cycle lengths in the factorization to independent cycles), we can see that H is normal if and only if it contains either no or all permutations of each type. Examining all the subgroups, which we found in the previous exercise, we find that the normal ones are the trivial group {id}, the so-called Klein group (consisting of the identity and the three double transpositions, which we already met in 12.F.8), the alternating group A4 of all even permutations, and the entire group S4. □ 12.F.46. Find the group of symmetries of a cube (describe all symmetries). Is this group commutative? Solution. The group has 48 elements; 24 of them are generated by rotations (these are the so-called direct symmetries), the other 24 are the composition of a direct symmetry and a reflection. The group is not commutative (consider the composition of a reflection with respect to the plane containing the centers of four parallel edges and a rotation by 90◦ around the axis that lies in the plane and goes through the centers of two opposite sides. The group is isomorphic to S4 × Z2. □ 12.F.47. In the group of symmetries of a cube, find the subgroup generated by a reflection with respect to the plane containing the centers of four parallel edges and the rotation by 180◦ around the axis that lies in the plane and goes through the centers of two opposite sides. Is this subgroup normal? ⃝ CHAPTER 12. ALGEBRAIC STRUCTURES Then, there exists an a ∈ K, a ̸= 0, and polynomials q and r such that af = qg + r, where either r = 0 or deg r < deg g. Moreover, if K is a field or if the leading coefficient of g is one, and if the choice a = 1 is made, then the polynomials q and r exist and they are unique. Proof. If deg f < deg g or f = 0, then the choice a = 1, q = 0, r = f, satisfies all the conditions. If g is constant, set a = g, q = f, r = 0. Continue by induction on the degree of f. Suppose deg f ≥ deg g > 0, and write f = a0 + · · · + anxn , g = b0 + · · · + bmxm . Either bmf − anxn−m g = 0 or deg(bmf − anxn−m g) < deg f. In the former case, the proof is finished. In the latter case, it follows by the induction hypothesis that there exist a′ , q′ , r′ satisfying a′ ( bmf − anxn−m g ) = q′ g + r′ and either r′ = 0 or deg r′ < deg g. This means that a′ bmf = ( q′ + a′ anxn−m ) g + r′ . If bm = 1 or K is a field, then the induction hypothesis can be used to choose a′ = 1 and then q′ , r′ are unique. In this case, bmf = ( q′ + anxn−m ) g + r′ . If K is a field, then this equation can be multiplied by b−1 m . Assume that there is another solution f = q1g+r1. Then, 0 = f − f = (q − q1)g + (r − r1) and either r = r1, or deg(r − r1) < deg g. In the former case, it follows that q = q1 as well, since there are no zero divisors in K[x]. Let axs be the term of the highest degree in q − q1 ̸= 0 (it must exist). Then, its product with the term of the highest degree in g must be zero (since the term of the highest degree is just the product of the terms of the highest degrees and there are no other terms of this degree there). However, this means that a = 0. Since axs is the non-zero term with the highest degree, q − q1 contains no non-zero monomials, so it equals zero. But then r = r1. □ The procedure for the Euclidean division can be used to discuss the roots of a polynomial. Consider a polynomial f ∈ K[x], deg f > 0, and divide it by the polynomial x − b, b ∈ K. Since the leading coefficient is one, the algorithm produces a unique result. It follows that there are unique polynomials q and r which satisfy f = q(x − b) + r, where r = 0 or deg r = 0, i.e. r ∈ K. This means that the value of the polynomial f at b ∈ K equals f(b) = r. It follows that the element b ∈ K is a root of the polynomial f if and only if (x − b)|f. Since division by a polynomial of degree one decreases the degree of the original polynomial by at least one, the following proposition is proved: Corollary. Every polynomial f ∈ K[x] has at most deg f roots, including multiplicities. In particular, polynomials over an infinite integral domain define the same mapping K → K if and only if they are the same polynomial. 1080 12.F.48. For each of the following permutations, decide whether the subgroup it generates is normal in the corresponding group: • (1, 2, 3) in S3, • (1, 2, 3, 4) in S4, • (1, 2, 3) in A4 For the last case, find the right cosets of A4 by the considered subgroup. Find all n ≥ 3 for which the subset of all cycles of length n together with the identity is a subgroup of Sn. Show that if this is so, then it is even a normal subgroup. Solution. • It is a normal subgroup of A3. • It is not a normal subgroup ( (1, 2) ◦ (1, 3) ◦ (2, 4) ◦ (1, 2) = = (4, 1) ◦ (2, 3) ). • It is not a normal subgroup. The right cosets are {(1, 2, 4), (2, 4, 3), (1, 3) ◦ (2, 4)}, {(1, 4, 2), (1, 4, 3), (1, 4) ◦ (2, 3)}, {(2, 3, 4), (1, 2) ◦ (3, 4), (1, 3, 4)}, {id, (1, 2, 3), (1, 3, 2)}. The mentioned subset is a subgroup only for n = 3. In this case, it is the alternating group A3 of all even permutations in S3, which is a normal subgroup. (For greater value of n, we can find two cycles of length n whose composition is neither a cycle of length n nor the identity.) □ 12.F.49. Find the subgroup of S6 that is generated by the permutations (1, 2) ◦ (3, 4) ◦ (5, 6), (1, 2, 3, 4), and (5, 6). Is this subgroup normal? If so, describe the set of (two-sided) cosets S6/H. Solution. First of all, note that all of the generating permutations lie in the subgroup S4 × S2 ⊆ S6 (considering the natural inclusion of S4 ×S2, i. e., for s ∈ S4 ×S2, the restriction of s to {1, 2, 3, 4} is a permutation on this set and so is the restriction of s to {5, 6}). This means that the group they generate is also a subgroup of S4 × S2. Moreover, since there is (5, 6) among the generators, we can see that the subgroup is of the form H ×S2, where H ⊆ S4. Thus, it suffices to describe H. This group is generated by the elements (1, 2)◦(3, 4) and (1, 2, 3, 4) (projection of the generators on S4). We have (1, 2, 3, 4)2 = (1, 3) ◦ (2, 4), (1, 2, 3, 4)3 = (4, 3, 2, 1), (1, 2, 3, 4)4 = id, [(1, 2) ◦ (3, 4)] 2 = id, [(1, 2) ◦ (3, 4)] ◦ (1, 2, 3, 4) = (2, 4), (1, 2, 3, 4) ◦ [(1, 2) ◦ (3, 4)] = (1, 3), [(1, 2) ◦ (3, 4)] ◦ (4, 3, 2, 1) = (1, 3), (4, 3, 2, 1) ◦ [(1, 2) ◦ (3, 4)] = (2, 4), [(1, 2) ◦ (3, 4)] ◦ [(1, 3) ◦ (2, 4)] = (1, 4) ◦ (2, 3), [(1, 3) ◦ (2, 4)] ◦ [(1, 2) ◦ (3, 4)] = (1, 4) ◦ (2, 3), CHAPTER 12. ALGEBRAIC STRUCTURES Indeed, if two polynomials over an integral domain define the same mapping K → K, then their difference has any element of K as a root. This means that if their difference is not the zero polynomial, then K has at most as many elements as the maximum of the degrees of the polynomials in question. 12.3.7. Multiple roots and derivatives. In the continuous modelling, we mostly work over infinite integral domains K and so we identify the algebraic expressions for the polynomials with the mappings. However the properties of differentiation on polynomials are of purely algebraic character, in general: The differentiation of polynomials over real or complex numbers is an algebraic operation which make sense for all commutative rings K and it still satisfies the Leibniz rule: Derivative of polynomials Consider polynomials f(x) = a0 + a1x + · · · + anxn , g(x) = b0 +b1x · · ·+bmxm be polynomials of orders n and m over an commutative ring K. The derivative ′ : f(x) → f′ (x) = a1 +a2x+· · ·+nanxn−1 respects the addition of polynomials and their multiplication by the elements in K. Moreover it satisfies the Leibniz rule (1) (f(x)g(x))′ = f′ (x)g(x) + f(x)g′ (x). While claim on the additive structure is obvious, let us check the Leibniz rule: f(x) · g(x) = m+n∑ k=0 ckxk , ck = ∑ i+j=k aibj and thus, expanding f′ (x) · g(x) + f(x) · g′ (x) yields exactly the expression for the derivative of the product ∑m+n k=1 kckxk−1 . In particular, the derivative is not a homomorphism K[x] → K[x] of the ring of polynomials, in view of (1). In much more general context, the homomorphisms of the additive structure of a ring satisfying the Leibniz rule (1) are called derivations. For polynomial rings, we see inductively that the only derivation there is our operation ′ . Differentiation can be exploited easily for discussing multiple roots of polynomials. Consider a polynomial f(x) ∈ K[x] over an integral domain K, with root c ∈ K of multiplicity k. Thus, in view of the division of polynomials discussed in the previous paragraph 12.3.6, f(x) = (x − c)k g(x), with the unique polynomial g, g(c) ̸= 0. Differentiating f(x) and applying the Leibniz rule we obtain f′ (x) = k(x − c)k−1 g(x) + (x − c)k g′ (x) = (x − c)k−1 ( kg(x) + (x − c)g′ (x) ) . Clearly the polynomial h(x) = kg(x) + (x − c)g′ (x) does not admit c as root, i.e. h(c) = kg(c) ̸= 0. 1081 [(1, 2) ◦ (3, 4)] ◦ (4, 2) = (1, 2, 3, 4), (1, 3) ◦ (4, 2) = (1, 3) ◦ (2, 4). Now, we can note that the generating permutations (1, 2, 3, 4) and (1, 2) ◦ (3, 4) are symmetries of a square on vertices 1, 2, 3, 4. Therefore, they cannot generate more than 8-element D4, which has already happened. This means that no more permutations can be obtained by further compositions. Thus, the subgroup H ⊆ S4 has 8 elements (which is possible by Lagrange’s theorem, since 8 divides 24). H = { id, (1, 2, 3, 4), (1, 3) ◦ (2, 4), (4, 3, 2, 1), (1, 2) ◦ (3, 4), (1, 3), (2, 4), (1, 4) ◦ (2, 3)}. Altogether, the examined subgroup in S6 has 16 elements: for each h ∈ H, it contains (h, id) and (h, (56)). □ 12.F.50. Find the subgroup in S4 that is generated by the permutations (1, 2) ◦ (3, 4), (1, 2, 3). Solution. Since both the generating permutations are even, they can generate only even permutations. Thus, the examined group is a subgroup of the alternating group A4 of all even permutations. We have [(1, 2) ◦ (3, 4)] 2 = id, (1, 2, 3)2 = (3, 2, 1), [(1, 2) ◦ (3, 4)] ◦ (1, 2, 3) = (2, 4, 3), (1, 2, 3) ◦ [(1, 2) ◦ (3, 4)] = (1, 3, 4), [(1, 2) ◦ (3, 4)] ◦ (3, 2, 1) = (3, 1, 4), (3, 2, 1) ◦ [(1, 2) ◦ (3, 4)] = (2, 3, 4), and now, we already have seven elements of the examined subgroup of A4, and since A4 has 12 elements and the order of a subgroup must divide that of the group, it is clear that the subgroup is the whole A4. □ 12.F.51. Find all subgroups of the group of invertible 2-by-2 matrices over Z2 (with matrix multiplication). Is any of them normal? Solution. In exercise 12.E.1, we built the table of the operation in this group. By Lagrange’s theorem (12.4.10), the order of any subgroup must divide the order of the group, which is six. Thus, besides the trivial subgroup {A} and the entire group, each subgroup must have two or three elements. In a 2-element subgroup, the non-trivial element must be selfinverse, which is also sufficient for the subset to be a subgroup. We thus get the subgroups {A, B}, {A, C}, {A, F}, which are not normal, as can be easily verified. The identity is A. Since B, C, F have order 2, they cannot be in a 3-element subgroup. Thus, the only remaining possibility is P = {A, D, E}, which is indeed a subgroup. Moreover, checking the conjugations BDB = E, CDC = E, FDF = E (whence it follows that BEB = D, CEC = D, FEF = D), we find out that this subgroup is normal. □ CHAPTER 12. ALGEBRAIC STRUCTURES On the contrary, if c is a joint root of f and its derivative, then, reading the previous computation backwards (with k = 1), we see the c is also a root of g(x), and thus a root with multiplicity at least two of f. Inductively, we arrive at the following very useful claim: Proposition. A polynomial f(x) over an integral domain K admits the root c ∈ K of multiplicity k, if and inly if c is the root of f′ (x) of multiplicity k − 1. 12.3.8. The greatest common divisor. Consider a polynomial ring K[x] over an integral domain K. A polynom h ∈ K[x] is called the greatest common divisor of polynomials f and g ∈ K[x] if and only if the following hold: • h|f and h|g, • for any k, if both k|f and k|g, then k|h. As a direct corollary of the existence of an algorithm for unique division with remainder, there is the very important Bezout’s identity (it is proved using the Euclidean division similarly as in the case of the integers in Chapter 11). Theorem. Let K be a field and f, g ∈ K[x]. Then, there exists a greatest common divisor h of the polynomials f and g. The polynomial h is unique up to a multiple by a non-zero scalar. In addition, there exist polynomials A, B ∈ K[x] such that h = Af + Bg. Proof. The polynomials h, A, B can be constructed directly using the Euclidean algorithm. Continue dividing with remainder (since K is a field, there is always a unique way to do this; see the above lemma 12.3.6): f = q1g + r1, g = q2r1 + r2, r1 = q3r2 + r3, ... rp−1 = qp+1rp + 0. In this procedure, the degrees of the polynomials ri are strictly decreasing; hence the equality from the last line must occur (for some p), and this says that rp|rp−1. It follows from the line above that rp|rp−2 etc.. Continue sequentially up to the first and second lines, to obtain rp|g and rp|f. If h|f and h|g, then the same equalities imply that h divides all the ri. In particular, it divides rp. In this way, the greatest common divisor h = rp of the polynomials f and g is obtained. Substitute upwards, starting with the last equation: h = rp = rp−2 − qprp−1 = = rp−2 − qp(rp−3 − qp−1rp−2) = = −qprp−3 + (1 + qp−1qp)rp−2 = = −qprp−3 + (1 + qpqp−1)(rp−4 − qp−2rp−3) = ... 1082 12.F.52. Find all subgroups of the group (Z10, +). Solution. The subgroups are isomorphic to (Zd, +), Where d|10, i. e., {0} ∼= Z1, {0, 5} ∼= Z2, {0, 2, 4, 6, 8} ∼= Z5, and Z10. □ 12.F.53. Find the orders of the elements 2, 4, 5 in (Z× 35, ·) and in (Z35, +). Solution. By definition, the order of x in the group (Z× 35, ·) is the least positive integer k such that xk ≡ 1 mod 35. By Euler’s theorem, the order of x = 2 and x = 4 is k ≤ φ(35) = 24. Computing the corresponding modular powers, we find out that the order of x = 2 is 12. Hence it immediately follows that the order of x = 4 is 6. The number x = 5 does not lie in the group (Z× 35, ·). Specifically, we have (modulo 35): x x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 2 4 8 16 32 29 23 11 22 9 18 1 4 16 29 11 9 1 In the group (Z35, +), the order of x is the least positive integer k such that k·x ≡ 0 mod 35. This can be calculated simply as k = 35 (35,x) . Therefore, the order of 2 and 4 is 35, while the order of 5 is 7. □ 12.F.54. Find all finite subgroups of the group (R∗ , ·)1 Solution. If a given subgroup of the group (R∗ , ·) contains an element a, |a| ̸= 1, then the elements a, a2 , a3 , . . . form an infinite geometric progression of pairwise distinct elements all of which must lie in the considered subgroup, so it is infinite. Thus, a finite subgroup may contain only the numbers 1 and −1, which means that there are two finite subgroups: {1}, {−1, 1}. □ 12.F.55. For each of the following formulas, decide whether it correctly defines a mapping φ. If so, decide whether it is a homomorphism, and if so, find its kernel. Moreover, decide whether it is surjective and injective: i) φ : Z4 × Z3 → Z12, φ(([a]4, [b]3)) = [a − b]12, ii) φ : Z4 × Z3 → Z12, φ(([a]4, [b]3)) = [6a + 4b]12, iii) φ : Z4 × Z3 → Z12, φ(([a]4, [b]3)) = [0]12. Solution. i) Not a mapping. For instance, if we take two representatives ([6]4, [1]3) = ([2]2, [1]3) of the same element in Z4 × Z3, then we get φ([6]4, [1]3) = [5]12 and φ([2]4, [1]3) = [1]12), so this is not a correct definition of a mapping. ii) A homomorphism, neither injective, nor surjective. Its kernel Ker(φ) is the set {([2]4, [0]3), ([0]4, [0]3)}. iii) A homomorphism, neither injective, nor surjective. Its kernel is the entire group Z4 × Z3. □ 1The group of all invertible elements for R and C is denoted by R∗ and C∗, respectively, and by Z× n for Zn. CHAPTER 12. ALGEBRAIC STRUCTURES = Af + Bg. □ 12.3.9. Unique factorization. Now follows a very useful (and elegant) statement, the proof of which is straightforward, yet it requires many technical details (and it also concerns the so called field of rational functions). It is recommended to read the following paragraph carefully. Then maybe, at the first reading, skip the technical lemmas of the proof. Theorem. Let K be a unique factorization domain. Then, K[x] is also a unique factorization domain. Proof. The idea of the proof is very simple. Consider a polynomial f ∈ K[x]. If f is reducible, then f = f1 · f2, where neither of the polynomials f1, f2 ∈ K[x] is a unit. Moreover, assume for a while that if f is divisible by an irreducible polynomial h, then so is f1 or f2. If this is always the case, this procedure can be applied step by step to reach a unique factorization. If f1 is further reducible, then f1 = g1 · g2, where g1, g2 are not units, and either both the polynomials g1 and g2 have degree less than that of f, or the number of irreducible factors in the leading terms of g1 and g2 decreases (for instance, over the integers Z, 2x2 + 2x + 2 = 2(x2 + x + 1)). After a finite number of steps, a factorization f = f1 . . . fr is obtained, where the polynomials f1, . . . , fr are irreducible. It follows from the additional assumption that every irreducible polynomial h which divides f also divides one of f1, . . . , fr. Therefore, for every other factorization f = f′ 1f′ 2 . . . f′ s, each factor fi divides one of f′ j, and in this case, f′ j = efi for an appropriate unit e. Cancel such pairs step by step, to conclude that r = s and that the individual factors differ only by a unit multiple. □ To conclude, we have to deal with our assumption in the first paragraph of our above proof. This will require some technical preparation and we shall come back to it soon. The direct consequence of the latter theorem for the multivariate polynomials can be formulated (due to their inductive definition): Corollary. Let K be a unique factorization domain. Then, K[x1, . . . , xr] is also a unique factorization domain. Every polynomial over a unique factorization domain can be factored in a similar way to the case of polynomials with real or complex coefficients. In particular, this holds for polynomials over every field of scalars. 12.3.10. Fields of fractions. When dealing with integer calculations, it is often more advantageous to work with rational numbers and verify only at the end of the procedure that the result is an integer. This method is useful in the case of polynomials, too. 1083 12.F.56. For each of the following formulas, decide whether it correctly defines a mapping φ. If so, decide whether it is a homomorphism, and if so, find its kernel. Moreover, decide whether it is surjective and injective: i) φ : Z4 → C∗ , φ([a]4) = ia , ii) φ : Z5 → C∗ , φ([a]5) = ia , iii) φ : Z4 → C∗ , φ([a]4) = (−1)a , iv) φ : Z → C∗ , φ(a) = ia . Solution. i) We have φ([a]4 +[b]4) = ia+b = ia ·ib = φ([a])·φ([b]), and φ([4]) = i4 = 1, which means that if [c]4 = [d]4, i. e., c = = d + 4k, k ∈ Z, then φ([c]4) = ic = id+4k = id = φ([d]4), so the mapping is a well-defined homomorphism. It is injective (which is equivalent to saying that Ker(φ) = {[0]4]}), but it is clearly not surjective. ii) Not a mapping since we have [0]5 = [5]5 and φ([0]5) = i0 = 1 but φ([5]5) = i5 = i. iii) Is a homomorphism, neither injective (we have −1 = φ(1) = φ(3) = (−1)3 = −1), nor surjective. The kernel is Ker(φ) = {[0]4, [2]4]}. iv) Is a homomorphism, neither injective nor surjective. The kernel is Ker(φ) = 4Z = {4k | k ∈ Z}. □ 12.F.57. For each of the following formulas, decide whether it correctly defines a mapping φ. If so, decide whether it is a homomorphism. Moreover, decide whether it is surjective and injective: i) φ : Q∗ → Q∗ , φ ( p q ) = q p ii) φ : Q∗ → Q∗ , φ ( p q ) = p2 q2 iii) φ : Q∗ → Q∗ , φ ( p q ) = p2 +q2 pq ⃝ 12.F.58. For each of the following formulas, decide whether it correctly defines a mapping φ. If so, decide whether it is a homomorphism. Moreover, decide whether it is surjective and injective: i) φ : C → R, φ(a + bi) = a + b, ii) φ : C → R, φ(a + bi) = a, iii) φ : C∗ → R∗ , φ(a + bi) = a2 + b2 , iv) φ : C∗ → R∗ , φ(c) = 2|c|, v) φ : C∗ → R∗ , φ(c) = |c|3 , vi) φ : C∗ → R∗ , φ(c) = 1/|c|. ⃝ 12.F.59. For each of the following formulas, decide whether it correctly defines a mapping φ. If so, decide whether it is a homomorphism. Moreover, decide whether it is surjective and injective: i) φ : GL2(R) → R∗ , φ(A) = |A| ii) φ : GL2(R) → R∗ , φ (( a b c d )) = a2 + b2 . iii) φ : GL2(R) → R∗ , φ (( a b c d )) = ac + bd. CHAPTER 12. ALGEBRAIC STRUCTURES Let K be an integral domain. Its field of fractions is defined as the set of equivalence classes of the pairs (a, b) ∈ K × K, b ̸= 0. These classes are written a b , and the equivalence is defined by a b = a′ b′ ⇔ ab′ = a′ b. Addition and multiplication is defined in terms of representa- tives: a b + c d = ad + bc bd , a b c d = ac bd . It is easily verified that this definition is correct and that the resulting structure satisfies all field axioms. In particular, 0 1 is the additive identity, and 1 1 is the multiplicative identity. If a ̸= 0, b ̸= 0, then a b b a = 1 1 . All the details of the arguments are in fact identical with the discussion of rational numbers in 1.6.6. The field of fractions of a ring K[x1, . . . , xr] is called the field of rational functions (of r variables) and denoted K(x1, . . . , xr). In software systems like Sage, Maple or Mathematica, all algebraic operations with polynomials are performed in the corresponding field of fractions, i.e. in the field of rational functions, usually using K = Q. 12.3.11. Completion of the proof. It remains to prove that if a polynomial f = f1f2 is divisible by an irreducible polynomial h, then h divides either f1 or f2 or both. This statement is proved in the following three lemmas. Lemma. Let K be a unique factorization domain. Then: (1) If a, b, c ∈ K, a is irreducible and a|bc, then either a|b or a|c. (2) If a constant polynomial a ∈ K[x] divides f ∈ K[x], then a divides all coefficients of f. (3) If a is an irreducible constant polynomial in K[x] and a|fg, f, g ∈ K[x], then a|f or a|g. Proof. (1) By the assumption, bc = ad for a suitable d ∈ K. Let d = d1 . . . dr, b = b1 . . . bs, c = c1 . . . cq be the factorizations to irreducible factors. This means that ad1 . . . dr = b1 . . . bsc1 . . . cq. Since ad factors in a unique way, it follows that a = ebj or a = eci for a suitable unit e. (2) Let f = b0 + b1x + · · · + bnxn . Since a|f, there must exist a polynomial g = c0 + c1x + . . . ckxk such that f = ag. Hence it immediately follows that k = n, ac0 = b0, . . . , acn = bn. (3) Consider f, g ∈ K[x] as above and suppose that a divides neither f nor g. By the previous claim, there exists an i such that a does not divide bi, and there exists a j such that a does not divide cj. Choose the least such i and j. The coefficient at xi+j of the polynomial fg is b0ci+j + b1ci+j−1 + · · · + bi+jc0. By choice, a divides all of b0ci+j, . . . , bi−1cj+1, bi+1cj−1, . . . , bi+jc0. At the same 1084 ⃝ 12.F.60. For each of the following formulas, decide whether it correctly defines a mapping φ. If so, decide whether it is a homomorphism. Moreover, decide whether it is surjective and injective: i) φ : Z3 → A4, φ([a]3) = (1, 2, 4) ◦ (1, 3, 2)a ◦ (1, 4, 2) ii) φ : Z3 → A4, φ([a]3) = (1, 2) ◦ (1, 3, 2)a ⃝ 12.F.61. For each of the following formulas, decide whether it correctly defines a mapping φ. If so, decide whether it is a homomorphism. Moreover, decide whether it is surjective and injective: i) φ : Z → Z, φ(a) = 2a ii) φ : Z → Z, φ(a) = a + 1 iii) φ : Z → Z, φ(a) = 3|a| iv) φ : Z → Z, φ(a) = 1 ⃝ 12.F.62. For each of the following formulas, decide whether it correctly defines a mapping φ. If so, decide whether it is a homomorphism. Moreover, decide whether it is surjective and injective: i) φ : Z × Z × Z → Q∗ , φ((a, b, c)) = 2a 3b 12c ii) φ : Z∗ 3 × Z5 → Z5, φ((a, b)) = ba iii) φ : Z2 × Z → Z, φ(([a]2, b)) = b ⃝ 12.F.63. Prove that there exists no isomorphism of the multiplicative group of non-zero complex numbers onto the multiplicative group of non-zero real numbers. Solution. Every homomorphism must map the identity of the domain to the identity of the codomain (see 12.4.5). Thus, 1 must be mapped to itself. And what about −1? We know that f(−1)2 = f((−1)2 ) = f(1) = 1. Therefore, the image of −1 is a square root of 1. Since we are interested in bijective homomorphisms only, we must have f(−1) = −1. However, then f(i)2 = f(i2 ) = f(−1) = −1, so that f(i) is a square root of −1 in R; however, no such real number exists. Therefore, no bijective homomorphism may exist. □ Remark. The mapping which assigns the absolute value to each non-zero complex number is a surjective homomorphism of C∗ onto R+ . G. Burnside’s lemma 12.G.1. How many necklaces can be created from 3 black and 6 white beads? Beads of one color are indistinguishable, and two necklaces are considered the same if they can be transformed to each other by rotation and/or reflection. CHAPTER 12. ALGEBRAIC STRUCTURES time, it does not divide bicj. Therefore, it cannot divide the coefficient. □ 12.3.12. Lemma. Consider the field of fractions L of a unique factorization domain K. If a polynomial f is irreducible in K[x], then it is irreducible in L[x], too. Proof. Each coefficient a ∈ K can be considered as an element a 1 ∈ L. Therefore, every non-zero polynomial f ∈ K[x] can be considered a polynomial in L[x]. Suppose that f = g′ h′ for some g′ , h′ ∈ L[x], where the polynomials g′ , h′ are not units in L[x] (i.e. they are not constant polynomials, since L is a field). Let a be a common multiple of the denominators of the coefficients in g′ and b be a common multiple of the denominators of coefficients in h′ . Then, bh′ , ag′ ∈ K[x], and so abf = (bh′ )(ag′ ). Let c be an irreducible factor in the factorization of ab. Then, c divides (bh′ )(ag′ ), and hence c divides bh′ or ag′ (by the previous lemma). This means that c can be canceled out. After a finite number of such cancellations, the conclusion is that f = gh for polynomials g, h ∈ K[x]. Since the degrees of the polynomials are not changed, neither g nor h is constant. Thus if f is reducible in L[x], then it is also reducible in K[x], contradicting the implication to be proved. □ 12.3.13. Lemma. Let K be a unique factorization domain and f, g, h ∈ K[x]. Suppose that f is irreducible and f|gh. Then, either f|g or f|h. Proof. This statement is already proved in one of the previous lemmas for the case that f is a constant polynomial (i.e. an element of K). Suppose that deg f > 0. Then f is irreducible in L[x] as well, where L is the field of fractions of the ring K. Suppose that K itself is a field (and as such equals its field of fractions). Moreover, suppose that f|gh and f does not divide g. The greatest common divisor of the polynomials g and f must be a constant polynomial in K. Therefore, there are A, B ∈ K[x] such that 1 = Af + Bg. Hence, h = Afh + Bgh. Since f|gh, it follows that f|h as well. Return to the general case. It follows from the assumptions that f|g or f|h in the polynomial ring L[x] over the field of fractions L of the ring K. For instance, let h = kf in L[x], and choose an a ∈ K so that ak ∈ K[x]. Then, ah = akf and it must hold for every irreducible factor c of a that c|ak, because f is irreducible and not constant. It follows that c can be canceled. After a finite number of such cancellations, a becomes a unit, i.e. h = k′ f for an appropriate k′ ∈ K[x]. □ The proof of this lemma completes the proof of theorem 12.3.9. 1085 Solution. Let us assume the necklace as coloring of the vertices of a regular 9-gon. Let S denote the set of all such colorings. Since each coloring is determined by the positions of the 3 black beads, we get that S has (9 3 ) = 84 elements. We know that the group of symmetries is D9, which contains 9 rotations (including the identity) and 9 reflections. Two colorings are the same if they lie in the same orbit of the action of D9 on the set S. Thus, we are interested in the number of orbits (let us denote it N). In order to find N, it suffices to compute the sizes of Sg for all elements g of D9: The identity is the only element of order 1, we have|Sid| = 84, so the contribution to the sum is 84. There are 9 reflections g, each of order 2. Clearly, we have |Sg| = 4, so the total contribution is 4 · 9 = 36. There are 2 rotations g by 2π/3 or 4π/3, both of order 3, and |Sg| = 3. Their contribution is 6. Finally, there are 6 rotations of order 9, and no coloring is kept unchanged by them, so they do not contribute to the sum. Altogether, we get by the formula of Burnside’s lemma that N = 1 |D9| ∑ g∈D9 |Sg| = 126 18 = 7. Draw the seven necklaces! □ 12.G.2. Find the number of colorings of a 3-by-3 grid with three colors. Two colorings are considered the same they can be transformed to each other by rotation and/or reflection. Solution. The group of symmetries of the table is the same as for a square, i. e., it is the dihedral group D4. Without any identification, there are clearly 39 colorings of the table. Now, the group G = D4 acts on these colorings. For each symmetry g ∈ G, we find the number of colorings that g keeps unchanged: • g = Id: |Sg| = 39 . • g is a rotation by 90◦ or 270◦ (= −90◦ ): In this rotation, every corner tile is sent to an adjacent corner tile. This means that if the coloring is to be unchanged, all the corner tiles must be of the same color. Similarly, all the edge tiles must be of the same color. Then, the center tile may be of any color. Altogether, we get that there are 33 which are not changed by the considered rotations. • g is a rotation by 180◦ : There are four pairs of tiles that are sent to each other by this symmetry, which means that the two tiles of each pair must be of the same color. Then, the center tile may be of any color. Altogether, we have |Sg| = 35 . • g is one of the four reflections: There are three pairs of tiles that are sent to each other by the reflection, so again the tiles within each pair must be of one color. The three tiles that are fixed by the reflection may be each of an arbitrary color. Altogether, we get |Sg| = 36 . CHAPTER 12. ALGEBRAIC STRUCTURES 4. Groups, rings, and fields As an illustration of the most abstract approach to an algebraic theory, concepts enjoying just one operation only are considered. The focus is on objects and situations where equations of the form a · x = b always have a unique solution (as usual with linear equations, the objects a and b are given, while x is what is sought for). This is group theory. Note that nothing is known about the “nature” of the objects, or even what the dot stands for. The only assumption is that two objects a and x are assigned an object a · x. In a previous part of this chapter, such operations are known as addition or multiplication in rings. The concepts and vocabulary concerning such operations are now extended. Among them, numbers and transformations of the plane and space, where such “group” objects are met. Then follows the foundations of a general theory. 12.4.1. Examples and concepts. Let A be a set. A binary operation on A is defined to be any mapping A × A → A. The result of such an operation is often denoted (a, b) → a · b and called the product of a and b. A set together with a binary operation is called a groupoid or a magma. Further assumed properties of the operations are needed in order to be able to say something interesting, Binary operations and semigroups A binary operation is said to be associative, if and only if a · (b · c) = (a · b) · c for all a, b, c ∈ A. A groupoid where the operation is associative is called a semigroup. A binary operation is said to be commutative if and only if a · b = b · a for all a, b ∈ A. The natural numbers N = {0, 1, 2, . . . } together with either addition or multiplication from a groupoid. These operations are both commutative and associative. The integers Z = {. . . , −2, −1, 0, 1, 2, . . . } form a groupoid with any of addition, subtraction, and multiplication. Subtraction is neither associative, for example (5 − 3) − 2 = 0 ̸= 5 − (3 − 2) = 4, nor commutative, since a−b = −(b−a), which is in general different from b − a. 1086 By Burnside’s lemma, the wanted number of colorings is equal to 1 8 ( 39 + 2 · 33 + 35 + 4 · 36 ) = 2862. □ 12.G.3. a) Find all rotational symmetries of a regular octahedron. b) Find the number of colorings of its sides. Two colorings are considered the same they can be transformed to each other by rotation. Solution. a) Placing the octahedron into the Cartesian coordinate system so that pairs of adjacent vertices are on the axes and the center of the octahedron lies in the origin, then every rotational symmetry is given by which of the six vertices is on the positive x-semiaxis and which of the four adjacent vertices is on the positive y-semiaxis. Thus, the group has 24 elements. These are (besides the identity) rotations by ±90◦ and 180◦ around axes going through opposite vertices, rotations by 180◦ around axes going through the centers of opposite edges, and finally rotations around ±120◦ around axes going through the centers of opposite sides. b) Without any identifications, there are 38 colorings. For each rotational symmetry g, we compute the number of colorings that are kept unchanged by it: – g is a rotation by ±90◦ around an axis going through opposite vertices. Then, g fixes 32 colorings, and there are 6 such rotations. – g is a rotation by 180◦ around an axis going through opposite vertices or the centers of opposite edges. Then, g fixes 34 colorings. There are 3 + 6 = 9 of these. – g is a rotation by ±120◦ . Then, g also fixes 34 colorings, and there are 8 such rotations. Together with 38 for the identity, we get that the number of colorings is 1 24 ( 38 + 6 · 32 + 17 · 34 ) = 333. □ 12.G.4. How many necklaces can be created from 9 white, 6 red, and 3 black beads? Beads of one color are indistinguishable, and two necklaces are considered the same if they can be transformed to each other by rotation and/or reflection. CHAPTER 12. ALGEBRAIC STRUCTURES Neutral elements, inverses, and groups7 A left identity (or left neutral element) in a groupoid (A, ·) is an element e ∈ A such that e · a = a for all a ∈ A. Similarly, e ∈ A is a right identity (right neutral element) iff for all a ∈ A, a·e = a. If e satisfies both these properties, it is called an identity (or a neutral element). In a grupoid (A, ·) with identity e, an element b is a left inverse of an element a if and only if b · a = e; it is a right inverse of a if and only if a · b = e. If b satisfies both these properties, it is called an inverse of a. A monoid (M, ·) is a semigroup which has a neutral element. A group (G, ·) is a monoid where each element has an inverse. A commutative semigroup is a semigroup where the operation is commutative, similarly for a commutative monoid or a commutative group. A commutative group is also often called an Abelian group. Consider direct consequences of the definitions. A groupoid cannot have both a left identity and a different right identity (if it had, what would be their product equal to?). Thus, if a groupoid has a (two-sided) identity, then it is the only identity element, called the identity. Similarly, in a monoid, an element x cannot have both a left inverse a and a different right inverse b, since if a · x = x · b = e, then also a = a · (x · b) = (a · x) · b = b. Note that associativity of the operation is needed here. It follows that if x has an inverse, then it is unique. It is usually denoted by x−1 . As an example, consider again the subtraction on integers. This operation is not associative. There is a right identity (zero), i.e. a − 0 = a for any integer a, but it is not a left identity. There is no left identity for subtraction. The integers are a semigroup with respect to either addition or multiplication. They form a group only with addition, since with respect to multiplication, only the integers ±1 have an inverse. If (A, ·) is a group, then any subset B ⊆ A which is closed with respect to the restriction of · (i.e. a · b ∈ B for any a, b ∈ B) and forms a group with this operation is called a subgroup. Both conditions are essential. For instance, consider the integers as a subset of the rational numbers and the multiplication there. Let G be a group and M ⊂ G. The subgroup generated by M is the smallest (with respect to set inclusion) subgroup of G which contains all the elements of M. Clearly, this is the intersection of all subgroups containing M. 7The name "Abelian" is in honour of a young mathematician Niels Henrik Abel. The adjective is so widely used that it is common to write it with a lower-case ‘a’, abelian, although it is derived from a surname. 1087 Solution. The group of symmetries of the necklace is the dihedral group D18, which has 36 elements. It acts on the set of necklaces, where we can number each place (1 through 18), resulting in 18!/(9!6!3!) = 4084080 necklaces (without any identification). The only symmetries that fix a non-zero number of necklaces are rotations by 120◦ and 240◦ , reflections, and of course the identity. By Burnside’s lemma, the wanted number of necklaces is equal to 1 36 ( 4084080 + 2 · ( 6 3 )( 3 3 ) + 9 · ( 8 4 )( 4 3 )) = 113590. □ 12.G.5. How many necklaces can be created from 6 white, 6 red, and 6 black beads? Beads of one color are indistinguishable, and two necklaces are considered the same if they can be transformed to each other by rotation and/or reflection. ⃝ 12.G.6. How many necklaces can be created from 8 white, 8 red, and 8 black beads? Beads of one color are indistinguishable, and two necklaces are considered the same if they can be transformed to each other by rotation and/or reflection. ⃝ 12.G.7. How many necklaces can be created from 3 white and 6 black beads? Beads of one color are indistinguishable, and two necklaces are considered the same if they can be transformed to each other by rotation and/or reflection. ⃝ H. Codes 12.H.1. Consider the (5, 3)-code over Z2 generated by the polynomial x2 + x + 1. Find all codewords as well as the generating matrix and the check matrix. Solution. p(x) = x2 + x + 1. The code words are precisely the multiples of the generating polynomial: 0·p, 1·p, x·p, (x+1)·p, x2 ·p, (x2 +1)·p, (x2 +x)·p, (x2 +x+1)·p or 0, x2 + x + 1, x3 + x2 + x, x3 + 1, x4 + x3 + x2 , x4 + x3 + x + 1, x4 + x, x4 + x2 + 1 CHAPTER 12. ALGEBRAIC STRUCTURES Here are a few very well known examples of groups. The rational numbers Q are a commutative group with respect to addition. The integers are one of their subgroups. The non-zero rational numbers are a commutative group. For every positive integer k, the set of all k-th roots of unity, i.e. the set {z ∈ C; zk = 1} is a finite commutative group with respect to multiplication of complex numbers. For k = 2, this is the two-element group {−1, 1}, both of whose elements are self-inverse. For k = 4, this is the group G = {1, i, −1, −i}. The set Matn, n > 1, of all square matrices of order n is a (non-commutative) monoid with respect to multiplication and a commutative group with respect to addition (see subsections 2.1.2–2.1.5). The set of all linear mappings Hom(V, V ) on a vector space is a monoid with respect to mapping composition and a commutative group with respect to addition (see subsection 2.3.12). In every monoid, the subset of all invertible elements forms a group. In the former of the above examples, it was the group of invertible matrices. In the latter case, it was the group of linear isomorphisms of the corresponding vector space. In previous chapters, there are several (semi)group structures, sometimes met quite unexpectedly. For example, recall various subgroups of the group of matrices or the group structure on elliptic curves. 12.4.2. Permutation groups. Groups and semigroups often arise as sets of mappings on a fixed set M, which are closed with respect to mapping com- position. This is easily seen on finite non-empty sets M, where every subset of invertible mappings generates a group with respect to composition. Such a set M consisting of m = |M| ∈ N elements allows for mm possible mappings (each of m elements can be sent to arbitrary element of M), and all of these mappings can be composed. Since mapping composition is associative, it is a semigroup. If a mapping α : M → M is required to have an inverse α−1 , then α must be a bijection. Composition of two bijections yields again a bijection; hence the set Σm of all bijections on an m-element set M is a group. This is called the symmetric group (on m elements). It is an example of a finite group.8 The name of the group Σm brings another connection: Instead of bijections on a finite set, permutations can be viewed as the rearranging of distinguished objects. Permutations are encountered in this sense when studying determinants, for example, see subsection 2.2.1 on page 96 for a few 8It can be proved that every finite group is a subgroup of an appropriate finite symmetric group. This can be interpreted so that the groups Σm are as non-commutative and complex as possible. 1088 or 00000, 11100, 01110, 10010, 00111, 11011, 01001, 10101. The basis vectors multiplied by x5−3 = x2 yield mod (p): x2 ≡ x + 1, x3 = x.x2 ≡ x(x + 1) = x2 + x ≡ 1, x4 ≡ x. This means that the basis vectors are encoded as follows: 1 → x2 + x + 1, 100 → 11100, x → x3 + 1, i. e. 010 → 10010, x2 → x4 + x, 001 → 01001. Thus, the generating matrix is G =       1 1 0 1 0 1 1 0 0 0 1 0 0 0 1       and the check matrix is H = ( 1 0 1 1 0 0 1 1 0 1 ) . □ 12.H.2. Consider the (5, 3)-code over Z2 generated by the polynomial x2 + x + 1. Find the generating matrix and the check matrix of the (7, 4)-code over Z2 generated by the polynomial x3 + x + 1. ⃝ 12.H.3. A 7-bit message a0a1 . . . a6, considered as the polynomial a0+a1x+· · ·+a6x6 , is encoded using the polynomial code generated by x4 + x + 1. i) Encode the message 1100011. ii) You have received the word 10111010001. What was the message if you assume that at most one bit was flipped? iii) What was the message in ii) if you assume that exactly two bits were flipped? Solution. i) x4 ≡ x + 1, x5 ≡ x2 + x, x9 ≡ x3 + x, x10 ≡ x2 + x + 1, whence 1 + x + x5 + x6 → x4 + x5 + x9 + x10 + x + 1 + x2 + x+ x3 + x + x2 + x + 1 = = x3 + x4 + x5 + x9 + x10 . Thus, the code is 00011100011. CHAPTER 12. ALGEBRAIC STRUCTURES elementary results. Let us briefly recollect them now in view of the general concepts of groups and their homomorphisms. What the operation in this group looks like needs more thought. In the case of a (small) finite group, build a complete table of the operation results for all pairs of operands. Considering the group Σ3 on numbers {1, 2, 3} and denoting the particular permutations by the ordering of the numbers (not to be confused with the notation for cycles!) a = (1, 2, 3), b = (2, 3, 1), c = (3, 1, 2), d = (1, 3, 2), e = (3, 2, 1), f = (2, 1, 3), then the composition is given by the following table: · a b c d e f a a b c d e f b b c a f d e c c a b e f d d d e f a b c e e f d c a b f f d e b c a Note that there is a fundamental difference between the permutations a, b, c and the other three. The former three form a cycle, generated by either b or c: b2 = c, b3 = a, c2 = b, c3 = a. It follows that these three permutations form a commutative subgroup. Here (as well as in the whole group), a is the neutral element and b and c are inverses of each other. Therefore, this subgroup is the same as the group Z3 of residue classes modulo 3, or as the group of third roots of unity. The other three permutations are self-inverse, which means that any one of them together with the identity a create a subgroup, the same one as Z2. Further, b and c are elements of order 3, i.e. the third power is the first one equal to the identity a, while d, e, and f are of order 2. Since the table is not symmetric with respect to the main diagonal, the composition · is not commutative. Other permutation groups Σm of finite m-element sets behave similarly. Each permutation σ partitions the set M into a disjoint union of maximal invariant subsets, which are obtained by taking unprocessed elements x ∈ M step by step and putting all iteration results σk (x), k = 1, 2, . . . , into the class Mx until σk (x) = x. Each permutation is obtained as a composition of the cycles, which behave as the identity outside Mx and as σ on Mx. If the elements of Mx are numbered as (1, 2, . . . , |Mx|) so that i corresponds to σi (x), then the permutation is simply a one-place shift in the cycle (i.e. the last element is mapped back to the first one). Hence the name cycle. These cycles commute, so it does not matter in which order the permutation σ is composed from them. (Of course, if we pick arbitrary two cycles on M, they do not have to commute.) The simplest cycles are one-element fixed points of σ and two-element subsets (x, σ(x)), where σ(σ(x)) = x. The latter are called transpositions. Since every cycle can be composed from transpositions of adjacent elements (just let the 1089 ii) 1 + x2 + x3 + x4 + x6 + x10 divide by x4 + x + 1 gives remainder x2 + 1 ≡ x8 . Thus, the ninth bit was flipped and the original message was 1010101. iii) Either the first and third bits were flipped (x2 +1), or the fifth and sixth were (x4 + x5 ≡ x2 + 1). In the first case, the message was 1010001, while in the second case, it was 0110001. □ 12.H.4. A 7-bit message a0a1 . . . a6, considered as the polynomial a0+a1x+· · ·+a6x6 , is encoded using the polynomial code generated by x4 + x3 + 1. i) Encode the message 1101011. ii) You have received the word 01001011101. What was the message if you assume that at most one bit was flipped? iii) What was the message in ii) if you assume that exactly two bits were flipped? Solution. i) x4 ≡ x3 + 1, x5 ≡ x3 + x + 1, x7 ≡ x2 + x + 1, x9 ≡ x2 + 1, x10 ≡ x3 + x, thus we get 1 + x + x3 + x5 + x6 → x4 + x5 + x7 + x9 + x10 + x3 + + 1 + x3 + x + 1 + x2 + x + 1 + x2 + 1 + x3 + x = = x3 + x4 + x5 + x7 + x9 + x10 + x3 + x. Therefore, the code is 0101 redundancy 1101011 message . ii) x + x4 + x6 + x7 + x8 + x10 divided by x4 + x3 + 1 gives remainder x2 + x + 1 ≡ x7 . Thus, the eighth bit was flipped, and the original message was 1010101. iii) Either the second and tenth bit were flipped (x+x9 ≡ x2 +x+1), or the fourth and seventh (x3 +x6 ≡ x2 +x+1), or the fifth and ninth (x4 +x8 ≡ x2 +x+1). The respective messages are 00001011111, 01011010101, and 01000011001. □ 12.H.5. Consider the (15, 11)-code generated by the polynomial 1 + x3 + x4 . We have received the word 011101110111001. Find the original 11-bit message, provided exactly one bit was flipped. Solution. The word is a codeword if and only if it is divisible by the generating polynomial 1+x3 +x4 . The received word corresponds to the polynomial x + x2 + x3 + x5 + x6 + x7 + x9 + x10 + x11 + x14 . When divided by 1 + x3 + x4 , it leaves remainder x + 1. This means that an error has occurred. If we assume that only one bit was flipped, then there must be a power of x which is equal to this remainder modulo 1 + x3 + x4 . Thus, we compute x4 ≡ x3 + 1, x5 ≡ x3 + x + 1, . . . , x12 ≡ x + 1 and find out that the thirteenth it was flipped, and the original message was 01110111101. CHAPTER 12. ALGEBRAIC STRUCTURES last element “bubble” back to the beginning), every permutation can be written as a composition of transpositions of adjacent elements. Return to the case of Σ3. Two elements, b, c, represent cycles which include all the three elements; each of them generates {a, b, c} = Z3. Besides those, d, e, f are composed of cycles of length 2 and 1; finally a is composed of three cycles of length one. There are no more possibilities. However, it is clear from the procedure that for more elements, there are very many possibilities. In general, there are many ways of expressing a permutation as a composition of transpositions. However, for a given permutation, the parity of the number of transpositions is fixed and independent of the choice of particular transpositions. This can be seen from the number of inverses of a permutation, since each transposition changes the number of inverses by an odd number (see the discussion in subsection 2.2.2 on page 97). It follows that there is a well-defined mapping sgn : Σm → Z2 = {±1}, the permutation parity. This recovers the proposition crucial for building the determinants (see 2.2.1 and on): Theorem. Every permutation of a finite set can be written as a composition of cycles. A cycle of length ℓ can be expressed as a composition of ℓ − 1 transpositions. The parity of this cycle is (−1)ℓ−1 . The parity of the composition σ◦τ is equal to the product of the parities of the composed permutations σ and τ. The last proposition suggests that the mapping sgn transforms permutation composition σ ◦ τ to the product sgn σ · sgn τ in the commutative group Z2. (Semi)group homomorphisms In general, a mapping f : G1 → G2 is a (semi)group homomorphism if and only if it respects the operation, i.e. f(a · b) = f(a) · f(b). In particular, the permutation parity is a homomorphism sgn : Σm → Z2. In a moment, we shall see that the group inversions and units are also preserved by homorphisms. Before disussing the theory, let us look at more examples of groups. 12.4.3. Symmetries of plane figures. In the fifth part of chapter one, the connections between invertible 2-by-2 matrices and linear transformations in the plane are thoroughly considered. A matrix in Mat2(R) defines a linear mapping R2 → R2 that preserves standard distances if and only if its columns form an orthonormal basis of R2 (which is a simple condition for the matrix entries, see subsection 1.5.7 on page 33). Combining the orthogonal linear mappings with translations, we arrive at the group of all Euclidean transformations of the plane. 1090 Let us look at the exercise more thoroughly. Computing all powers of x, we obtain x4 ≡ x3 + 1, x5 ≡ x3 + x + 1, x6 ≡ x3 + x2 + x + 1, x7 ≡ x2 + x + 1, x8 ≡ x3 + x2 + x, x9 ≡ x2 + 1, x10 ≡ x3 + x, x11 ≡ x3 + x2 + 1, x12 ≡ x + 1, x13 ≡ x2 + x, x14 ≡ x3 + x2 , so the generating matrix is G =                           1 1 1 1 0 1 0 1 1 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 1 1 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1                           . We can verify that multiplication by 01110111101 yields the codeword 011101110111101, which differs from the received word 011101110111001 exactly in the thirteenth bit □ Now, we begin to efficiently use the check matrix. 12.H.6. Find the generating matrix and the check matrix of the (7, 2)-code (i. e., there are 2 bits of the message and 5 redundant bits) generated by the polynomial x5 + x4 + x2 + 1. Decode the received word 0010111 (i. e., find the message that was sent) assuming that the least number of errors occurred. Solution. The generating matrix of the code is G =           1 1 0 1 1 1 0 1 1 1 1 0 0 1           . CHAPTER 12. ALGEBRAIC STRUCTURES In fact, it is possible to prove that every mapping of the plane into itself which preserves distances is affine such a Euclidean transformation.9 As observed, the linear part of this mapping is orthogonal. Thus, all these mappings form a group of all orthogonal transformations (also called Euclidean transformations) in the plane. Moreover, it is shown that besides translations Ta by a vector a, these are only rotations Rφ around the origin by any angle φ, and reflection Fℓ with respect to any line that goes through the origin (also note that the central inversion is the same as the rotation by π). Now, general group concepts are illustrated on the problem of symmetries of plane figures. For example, consider tiles. First, consider them individually, in the form of a bounded diagram in the plane. Then consider with the condition of tiling in a band, and then in the entire plane. As an example, consider a line segment and an equilateral triangle. It is of interest in how much these objects are symmetric; that is, with respect to which transformations (that preserve distances) they are invariant. In other words, we want the image of the figure to be identical to the original one (unless some significant points are labeled , for example the vertices of the triangle A, B, C or the endpoints of the line segment). It is clear that all symmetries of a fixed object form a group (usually with only one element, the identity). 3A B p Zp 1 2 In the case of the line segment, the situation is very simple – it is clear that the only non-trivial symmetries are rotation by π around the center of the segment, reflection FH with respect to the axis of the segment, and reflection FV with respect to the line itself. All these symmetries are self-inverse. Hence the group of symmetries has four elements. Its table looks as follows: 9If a mapping F : R2 → R2 preserves distances, then this must also hold for the mapped vectors of velocity, i.e. the Jacobi matrix DF(x, y) must be orthogonal at every point. Expanding this condition for the given mapping F = (f(x, y), g(x, y)) : R2 → R2 leads to a system of differential equations which has only affine solutions, since all second derivatives of F must be zero (and then, the proposition is an immediate consequence of Taylor’s remainder theorem). Try to think out the details! The same procedure leads to the result for Euclidean spaces of arbitrary dimension. Note that the condition to be proved is independent of the choice of affine coordinates. Composing F with a linear mapping does not change the result. Hence, for a fixed point (x, y), compose (DF)−1 ◦F and assume, without loss of generality, that DF(x, y) is the identity matrix. Differentiation of the equations then yields the desired proposition. 1091 The generating matrix is of the form G = ( P Ik ) , where P =       1 1 0 1 1 1 0 1 1 1       . The check matrix is of the form ( In−k P ) , i. e., in our case H =       1 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 1 1 1       . Multiplying the received word by the check matrix, we get the syndrome (error) of the word: H =       1 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 1 1 1                 0 0 1 0 1 1 1           = = ( 0 1 1 1 1 ) . The syndrome corresponding to the received word is 01111. Now, we find all words corresponding to this syndrome. This can be done by adding all codewords to the received word. There are four codewords, corresponding to the four possible messages. They are obtained by multiplying the messages (00, 01, 10, 11) by the generating matrix. Thus, we get the codewords 0000000, 1111101, 1010110, 0101011. The space of words corresponding to a given syndrome is an affine space whose direction is the vector space of all codewords (see 12.5.8). Thus, we get the words 0010111, 1101010, 1000001, 0111100. The least number of errors is equal to the least number of ones in the obtained words. In our case, this is the word 1000001, which contains only two ones and thus is the socalled leading representative of the class of words with syndrome 01111. The original message can be obtained by subtracting (or adding – this is equivalent in Z2) the received word and the leading representative of the class with the given syndrome. In our case, we get 0010111 − 1000001 = 1010110. Therefore, assuming the least number of errors, the sent word was 1010110, where the last two bits are the original message, i. e., 10. □ 12.H.7. Consider the (7, 3)-code generated by the polynomial x4 + x3 + x + 1. Find its generating matrix and check matrix. Using the method of leading representatives, decode the received word 1110010. ⃝ CHAPTER 12. ALGEBRAIC STRUCTURES · R0 Rπ FH FV R0 R0 Rπ FH FV Rπ Rπ R0 FV FH FH FH FV R0 Rπ FV FV FH Rπ R0 . This group is commutative. For the equilateral triangle, there are more symmetries: one can rotate by 2π/3 or one can mirror with respect to axes of the sides. In order to obtain the entire group, all compositions of these transformations must be added in. In 1.5.9 it is shown that the composition of two reflections is always a rotation. At the same time, it is clear that changing the order of composition of fixed two reflections leads to a rotation by the same angle but the other orientation. It follows that the reflections around two axes generate all the symmetries, of which there are six altogether. Placing the triangle as is shown in the diagram, the six transformations are given by the following matrices: a =   1 0 0 1  , b =    −1 2 √ 3 2 − √ 3 2 −1 2   , c =    −1 2 − √ 3 2√ 3 2 −1 2   , d =   −1 0 0 1  , e =    1 2 − √ 3 2 − √ 3 2 −1 2   , f =    1 2 √ 3 2√ 3 2 −1 2   . A comparison of the table of the operation, with that of the permutation group Σ3, shows that it is the same. For the sake of clarity, the vertices are labeled with numbers, so the corresponding permutations can be easily understood. Similarly, there are groups of symmetries with k rotations and k reflections. It suffices to consider a regular k-gon. These groups are usually denoted Dk and are called the dihedral groups of order k. They are not commutative for k ≥ 3 (D2 is commutative). The name comes from the fact that D2 is the group of symmetries of the hydrogen molecule H2, which contains two hydrogen atoms and can be imagined as a line segment. Similarly, there are figures whose only symmetries are rotations, and hence the corresponding groups are commutative. They are denoted Ck and called cyclic groups of order k. For that, it suffices to consider a regular polygon whose sides are changed non-symmetrically, but in the same manner (see the extension of the triangle in the diagram). Note that the group C2 can be realized in two ways: either using the rotation by π or a single reflection. As the first illustration of the power of abstraction, we prove the following theorem. A figure is said to have a discrete group of symmetries if and only if the set of images of an arbitrary point over all the symmetries is a discrete subset of the plane (i.e. each of its points has a neighbourhood where there is no other point of the set). Note that every discrete group of symmetries of a bounded figure is necessarily finite. Theorem. Let M be a bounded set in the plane R2 with discrete group of symmetries G. Then, G is either trivial or one of the groups Ck, Dk for k > 1. 1092 12.H.8. Consider the linear (7, 4)-code (i. e., the message has length 4) over Z2 defined by the matrix           0 1 1 0 1 1 0 1 1 0 1 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1           Decode the received word 1010001 (i. e., find the sent message) assuming that the least number of errors occurred. Solution. There are 24 = 16 possible messages. All codewords can be obtained by multiplying the possible messages (0000, 0001, . . . , 1111) by the generating matrix of the code. Thus, we get: 0110001, 1010010, 1100100, 0111000 1100011, 1010101, 0001001, 1011100 1101010, 0110110, 0001110, 1101101 1011011, 0000111, 0111111, 0000000. Now, we construct the check matrix of the given code: H =   1 0 0 0 1 1 0 0 1 0 1 1 0 1 0 0 1 1 0 1 1   . (we remove the block of the generating matrix that consists of the identity matrix and to the left of the remaining block, we write the identity matrix of fitting size). Now, multiplying the vector of the received word zT = (1010001) by H, we get the syndrome s = Hz = (110)T . One word with this syndrome is 1100000 (we fill the syndrome with zeros to the appropriate length). All words with syndrome 110 are obtained by adding this word to all codewords. Thus, we get the words 1000001, 0110010, 0000100, 1011000, 0000011, 0110101, 1101001, 0111100, 0001010, 1010110, 1101110, 0001101, 0111011, 1100111, 1011111, 1100000 Out of these words with syndrome 110, only the word 0000100 contains a single one, so this is the leading representative of the class of words with syndrome 110. Subtracting the leading representative from the received word, we get the word that was sent, assuming the least number of bit flips (1 in this case), i. e., the word (101)0101, where the last four bits are the message, i. e. 0101. □ 12.H.9. Consider the linear (7, 4)-code (i. e., the message has length 4) over Z2 defined by the matrix           1 1 0 1 0 0 1 1 1 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1           CHAPTER 12. ALGEBRAIC STRUCTURES Proof. If there were a set M with translation as one of its symmetries, then it could not be bounded. If M had, as one of its symmetries, a rotation by angle which is an irrational multiple of 2π, then iterating this rotation would lead to a dense subset of images on the corresponding circle.10 It follows that the group is not discrete. If M had non-trivial rotations with different centers as symmetries, then again it could not be bounded. To see this, write the corresponding rotations in the complex plane as R : z → (z − a)ζ + a, Q : z → zη for complex units ζ = e2πi/k , η = e2πi/ℓ and an arbitrary a ̸= 0 ∈ C. Then, it is immediate (a straightforward computation with complex numbers) that Q ◦ R ◦ Q−1 ◦ R−1 : z → z + a(−1 + ζ + η − ζη), which is a translation by a non-trivial vector unless the angle of one of the rotations is zero. It follows that M is not bounded. The same holds for the case of a rotation and a reflection with respect to a line which does not go through the center of the rotation. Check this case yourself! The only symmetries available are rotations with a common center and reflections with respect to lines which pass through this center. It remains to prove that the entire group is composed either only from rotations, or from the same number of rotations and reflections. Recall that the composition of two different reflections yields a rotation whose angle doubles the angle enclosed by the corresponding axes (see 1.5.9). Therefore, composing a reflection with respect to a line p with a rotation by angle φ is again a reflection with respect to the line which is at angle φ/2 from p (draw a diagram!). The proof is almost complete. Observe that the subgroup of all rotations in the group of symmetries contains a rotation by the smallest positive angle φ0 (there are only finite many of them there). But then it is impossible to allow a rotation Rφ where φ is not a multiple of φ0 (for then φ ∈ (kφ0, (k + 1)φ0) for some k and the composition R−kφ0 ◦ Rφ would have an smaller angle than φ0). This subgroup coincides with one of Cℓ. Next, adding one reflection produces exactly one different reflection for each nontrivial element in Cℓ, as seen above. □ 12.4.4. Symmetries of plane tilings. There is more complicated behaviour in the case of plane figures in bands or in the entire plane (for example, symmetries of various tilings). 10The argument is subtle but straightforward: if there was an interval of length ε on the circle not hit by an orbit of a point under the rotation, all points in the orbit would have to be at distance at least ε. Thus there can be only finitely many of them and this contradicts the irrationality of the angle. 1093 Decode the received word 1101001 (i. e., find the sent message) assuming that the least number of errors occurred. Solution. Syndrome 101, leading representative 0001000, sent message (110)0001 □ 12.H.10. Consider the linear (7, 4)-code (i. e., the message has length 4) over Z2 defined by the matrix           1 0 1 1 0 1 0 1 0 1 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1           Decode the received word 0000011 (i. e., find the sent message) assuming that the least number of errors occurred. Solution. Syndrome 011, leading representative 0000100, sent message (000)0111. □ 12.H.11. Consider the linear (7, 4)-code (i. e., the message has length 4) over Z2 defined by the matrix           0 1 1 1 1 0 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1           Decode the received word 0001100 (i. e., find the sent message) assuming that the least number of errors occurred. Solution. Syndrome 110, leading representative 0000010, sent message (000)1110. □ 12.H.12. We want to transfer one of four possible messages with a binary code which should be able to correct all single errors. What is the minimum possible length of the codewords (all codewords have to be of the same length)? Why? Solution. Let us denote the desired length as n. The minimum Hamming distance of any two codewords must be at least three. This means that if we take two different codewords and flip one bit in each, the resulting words must also be different (and also different from each codeword). There are n + 1 words that can be obtained from a given one by flipping at most one bit (this includes the original word itself as well). Thus, we must have at least 4(n + 1) possible words. On the other hand, there are 2n words of length n, which means that 4(n + 1) ≤ 2n . This inequality is satisfied only for n ≥ 5. Thus, the codewords must be at least 5 bits long. And indeed, there are four codewords of length 5 with minimum Hamming distance 3, for instance 00111, 01001, 10100, 11010. □ CHAPTER 12. ALGEBRAIC STRUCTURES Consider first the set containing the points that lie between two fixed parallel lines. Suppose that this band is covered with disjoint images of a bounded subset M by some translation. Of course, this translation is a symmetry of the chosen tiling of the band. So the group of symmetries is necessarily infinite. Such a set allows for no other rotation symmetries than Rπ, and the only possible reflections are either horizontal with respect to the axis of the band, or vertical with respect to any line which is perpendicular to the boundary lines. In addition, there are translations given by a vector which is parallel to the axis of the band. A not-too-complicated discussion leads to description of all discrete groups of symmetries for these bands. Such a group is generated by some of the following symmetries: translation T, shifted reflection G (i.e. composition of horizontal reflection and translation), vertical reflection V , horizontal reflection H and rotation R by π. Theorem. Every discrete group of symmetries of a band in the plane is isomorphic to one of the groups generated by the following symmetries: (1) a single translation T, (2) a single shifted reflection G, (3) a single translation T and a vertical reflection V , (4) a single translation T and the rotation R, (5) a single shifted reflection G and the rotation R, (6) a single translation T and the horizontal reflection H, (7) a single translation T, the horizontal reflection H and a vertical reflection V . The proof is not presented here. The following diagram shows examples of schematic patterns with corresponding symmetries: It is even more complicated with symmetries of tilings which cover the entire plane. There is insufficient space here to consider further details. It can be shown that there are 17 such groups of symmetries, known as the two-dimensional crystallographic groups. A similar complete discussion is known even for threedimensional discrete groups of symmetries. The rich theory was created namely in the 19th century in connection with studying symmetries of crystals and molecules of chemical elements. 1094 12.H.13. We want to transfer 4-bit messages with a binary code which should be able to correct all single and double errors. What is the minimum possible length of the codewords (all codewords have to be of the same length)? Why? Solution. We proceed similarly as in the above exercise. If the code is to correct double errors as well, then the minimum Hamming distance of any two codewords must be at least three. This means that if we take two different codewords and flip up to two bits in each, the resulting words must also be different. Denoting by n the length of the words, we get the inequality 24 ( 1 + n + ( n 2 )) ≤ 2n . The least value of n for which it is satisfied is n = 12, so the codewords must be at least 12 bits long. □ I. Extension of the stereographic projection Let us try to extend the definition of the stereographic projection so that a circle would be parametrized by points of P1(R). Let us look at the corresponding mapping P1(R) → P2(R2 ). The points in projective extensions will be defined in the so-called homogeneous coordinates, which are given up to a multiple. For instance, the points in P2(R) will be (x : y : z). A circle in the plane z = 1 is given as the intersection of the cone of directions defined by x2 + y2 − z2 = 0 with this plane. The inversion of the stereographic projection (i. e., our parametrization of the circle) can be described as: (t : 1) → ( 2t 1 + t2 : t2 − 1 t2 + 1 : 1 ) = ( 2t : t2 − 1 : t2 + 1 ) . For t ̸= 0, we have (t : 1) = (2t2 : 2t), and the original stereographic projection (i. e., the inverse of the above mapping) can be written linearly as (x : y : z) → (y + z : x), which extends our parametrization to the improper point (0 : 1) → (0 : 1 : 1). Then, the mapping of P1(R) onto the circle has the “linear” form P1(R) ∋ (x : y) → (2x : x − y : x + y) ∈ P2(R). Now, let us look at how simple it is to calculate the formula for the stereographic projection in the projective extensions directly (see 4.4.1): We include P1(R) as points with homogeneous coordinates (t : 0 : 1), and among the linear combinations of point (0 : 1 : −1) (i. e., the pole from which we project) and (x : y : z) (a general point of the circle), we must find the one whose coordinates are (u : 0 : v). The only possibility is the point (x : 0 : z + y), which is our previous formula. CHAPTER 12. ALGEBRAIC STRUCTURES 12.4.5. Group homomorphisms. Recall that a mapping f : G → H from a group G to a group H is called a group homomorphism if and only if it respects the operation, i.e. for all a, b ∈ G that f(a · b) = f(a) · f(b). Note that the operation on the left-hand side is the operation in G, before f is applied, while the operation on the right-hand side is the operation in H, after f is applied. The following properties of homomorphisms follow easily from the definition: Proposition. Every group homomorphism f : G → H satis- fies: (1) the identity of G is mapped to the identity of H, (2) the inverse of an element of G is mapped to the inverse of its image, i.e. f(a−1 ) = f(a)−1 , (3) the image of a subgroup K ⊆ G is a subgroup f(K) ⊆ H, (4) the preimage f−1 (K) ⊆ G of a subgroup K ⊆ H is again a subgroup, (5) if f is also a bijection, then the inverse mapping f−1 is also a homomorphism, (6) f is injective if and only if f−1 (eH) = {eG}. Proof. (3), (2), and (1). If K ⊆ G is a subgroup, then for each y = f(a), z = f(b) in H, a, b ∈ K, y · z = f(a · b) also lies in the image. In particular, f(b) = f(e · b) = f(e) · f(b) and similarly f(b) = f(b) ·f(e). Thus f(e) is the unique unit in the image f(K). Finally, f(e) = f(a·a−1 ) = f(a) · f(a−1 ), so that f(a−1 ) is the right inverse of f(a). Similarly, it is a left inverse, too, and we have proved the first three claims. Proceed similarly in the case of preimages: if a, b ∈ G satisfy f(a), f(b) ∈ K ⊆ H, then also f(a · b) ∈ K. Suppose there exists an inverse mapping g = f−1 . Fix arbitrarily y = f(a), z = f(b) ∈ H. Then, f(a · b) = y · z = f(a) · f(b), which is equivalent to the expression g(y) · g(z) = a · b = g(y · z). Thus the inverse mapping is also a homomorphism. If f(a) = f(b), then f(a · b−1 ) = eH. Therefore, if the only element that is mapped to eH is eG, then a · b−1 = eG, i.e. a = b. The other implication is trivial. □ The subgroup f−1 (eH) (the preimage of the identity in H) is called the kernel of the homomorphism f and is denoted ker f. A bijective group homomorphism is called a group isomorphism. It follows directly from the above ideas that a homomorphism f : G → H with a trivial kernel is an isomorphism onto the image f(G). 1095 J. Elliptic curves A singular point of a hypersurface in Pn , defined by a homogeneous polynomial F(x0, x1, . . . , xn) = 0, is such a point which satisfies ∂F ∂xi = 0 for i = 1, . . . , n. From the geometric point of view, “something weird” happens at the point. In the case of a curve in the projective space P2 (R), the condition that the partial derivatives must be zero means that there is no tangent line to the curve at the given point. This means that the curve has the so-called cusp there or it intersects itself. A “nice” singularity can be seen in the “quatrefoil”, i. e., the variety given by the zero points of the polynomial (x2 + y2 )3 − 4x2 y2 v R2 : The cusp can be found on the curve in R2 given by x3 − y2 = 0. CHAPTER 12. ALGEBRAIC STRUCTURES 12.4.6. Examples. The additive group Zk of residue classes modulo k is isomorphic to the group of k-th roots of unity, and also to the group of rotations by integer multiples of 2π/k. Draw a diagram, calculation with the complex units e2πi/k is very efficient. The mapping exp : R → R+ is an isomorphism of the additive group of the real numbers onto the multiplicative group of the positive real numbers. This isomorphism extends naturally to a homomorphism exp : C → C\{0} of the additive group of the complex numbers onto the multiplicative group of the non-zero complex numbers. However, this homomorphism has a non-trivial kernel. The restriction of exp to purely imaginary numbers (which is a subgroup isomorphic to R) is a homomorphism it → eit = cos t + i sin t. This means that the numbers 2kπi, k ∈ Z, lie in the kernel. It can be shown that nothing else is in the kernel. If es+it = es · eit = 1 for real numbers s and t, then es = 1, i.e. s = 0, and then t = 2kπ for an integer k. The determinant of a matrix is a mapping which assigns, to each square matrix of scalars in K, a scalar in K (the cases K = Z, Q, R, C have already been worked with). The Cauchy theorem about the determinant of the product of square ma- trices det(A · B) = (det A) · (det B) can be also seen as the fact that for the group G = GL(n, K) of invertible matrices, the mapping det : G → K \ {0} is a group homomorphism. 12.4.7. Group product. Given any two groups, a more complicated group can be constructed using the following con- struction: Group product For any groups G, H the group product G×H is defined as follows: The underlying set is the Cartesian product G × H and the operation is defined componentwise. That is, (a, x) · (b, y) = (a · b, x · y), where the left-hand operation is the one to be defined, while the right-hand operations are respectively those in G and H. The projections onto the components G and H in the product: pG : G × H ∋ (a, b) → a ∈ G, pH : G × H ∋ (a, b) → b are surjective homomorphisms, whose kernels are ker pG = {(eG, b); b ∈ H} ≃ H, ker pH = {(a, eH); a ∈ G} ≃ G. The group Z6 is isomorphic to the product Z2 × Z3. This can be seen easily in the multiplicative realization of the groups Zk as the complex k-th roots of unity. Z6 consists of the points of the unit circle that form the vertices of a regular hexagon. Then, Z2 corresponds to ±1, while Z3 corresponds to the equilateral triangle, one of whose vertices is 1096 An elliptic curve C is the set of points in K2 , where K is a given field, which satisfy an equation of the form y2 = x3 + ax + b, where a, b ∈ K. In addition, we require that there are no singularities, which means, over the field of real numbers, that ∆ = −16(4a3 + 27b2 ) ̸= 0. This expression ∆ is called the discriminant of the equation. Note that the right-hand side contains a cubic polynomial without the quadratic term. This form of the equation is called the Weierstrass equation of an elliptic curve. 12.J.1. Prove that the curve y2 = x3 + ax + b in R2 has a singularity if and only if 4a3 − 27b2 = 0. Solution. The equation of the curve in homogeneous coordinates (see 4.4.1) is F(x, y, z) = 0, where (1) F(x, y, z) = y2 z − x3 − axz2 − bz3 . We have ∂F ∂x = −3x2 − az2 , ∂F ∂y = 2yz, ∂F ∂z = y2 − 2axz − 3bz2 . Let [x, y, z] be a singular point of the given curve. If z = 0, then since the partial derivatives of F with respect to x and z must be zero, we get x = 0 and y = 0, respectively. However, this is “out”, because the point [0, 0, 0] does not lie in the considered projective space P2 (R). Thus, the singular points has z ̸= 0, so that ∂F ∂y = 0 implies y = 0. Denoting γ = x z , then −3x2 − az2 = 0 implies 3γ2 = −a, and y2 − 2axz − 3bz2 = 0 implies 2aγ = −3b. We can see that the equality a = 0 implies that b = 0, i. e., the equality 4a3 = 27b2 is satisfied trivially. If a ̸= 0, then we can express γ from the two obtained equations. From the first one, we have γ = − 3b 2a , and from the second one, γ2 = −a 3 . Altogether, γ2 = − a 3 = 9b2 4a2 =⇒ 4a3 + 27b2 = 0. Thus, we have proved one of the implications. On the other hand, if 4a3 +27b2 = 0, then defining γ = − 3b 2a , we can see that the point [γ, 0, 1] satisfies the equation of the elliptic curve: γ2 + aγ + b = ( − 3b 2a ) ( − a 3 ) + a ( − 3b 2a ) + b = = b 2 − 3b 2 = 0. Thanks to the choice of γ, all the three partial derivatives of F at the point [γ, 0, 1] are zero. □ In order to define a group operation on the points of an elliptic curve, it is useful to consider the curve in the projective extension of the plane (see 4.4.1), and we define a point CHAPTER 12. ALGEBRAIC STRUCTURES the number 1. If each point is identified with the rotation that maps 1 to that point, then the composition of such rotations is always commutative. Composing a rotation from Z2 with a rotation from Z3 yields exactly all rotations from Z6. Draw a diagram! This leads to the following isomorphism (using additive notation, as is common for the residue classes): [0]6 → ([0]2, [0]3), [1]6 → ([1]2, [2]3), [2]6 → ([0]2, [1]3), [3]6 → ([1]2, [0]3), [4]6 → ([0]2, [2]3), [5]6 → ([1]2, [1]3). Similar constructions are available for finite commutative groups in complete generality. 12.4.8. Commutative groups. Any element a of a group G is contained in the minimal subgroup {. . . , a−2 , a−1 , e, a, a2 , a3 , . . . }, which contains it. It is clear that this subgroup is commutative. If G is finite, then it must once happen that ak = e. The least positive integer k with this property is called the order of the element a in G. A cyclic group G is one which is generated by one of its elements a in the above manner. If the order k of the generator is finite, then it results in one of the groups Ck, known from the discussion of symmetries of plane figures. It follows directly from the definition that every cyclic group is isomorphic either to the group of integers Z (if it is infinite) or to one of the groups of residue classes Zk (if it is finite). These simple building stones are sufficient to create all finite commutative groups. Theorem. Every finite commutative group G is isomorphic to a product of some cyclic groups Ck. The orders of the components Ck are always powers of the prime divisors of the number of elements n = |G|. This product decomposition is unique, up to order. If n = pk1 1 · · · pkr r is the prime factorization of n, then the group Cn is isomorphic to the product Cn = Cp k1 1 × · · · × Cpkr r . Incomplete proof. We are going to prove only the second claim now, and we return to the first claim later, see 12.4.12, 12.4.13. For a simpler case, suppose n = pq with p coprime to q. Fix a generator a of the group Cn, a generator b of Cp, and a generator c of Cq. Define the mapping f : Cn → Cp × Cq by f(ak ) = (bk , ck ). Since ak · aℓ = ak+ℓ and similarly for b and c, it follows that f(ak · aℓ ) = (bk+ℓ , ck+ℓ ) = (bk , ck ) · (bℓ , cℓ ), so the mapping f is a homomorphism. If the image is the identity, then k must be a multiple of p as well as q. Since 1097 O ∈ C as the direction (0, 1) (which is the point [0, 1, 0] in the homogeneous coordinates). Then, the addition of two points A, B ∈ C is geometrically defined as the point −C, where C is the third intersection point of the line AB with the elliptic curve. If A = B, then the result is given by the other intersection point with the tangent line of the elliptic curve that goes through A. 12.J.2. Prove that the above definition correctly defines an operation on the points of an elliptic curve. Solution. The intersections of the line with the elliptic curve are obtained as the roots of a cubic equation. If it has two real roots, corresponding to the points A and B, then it must have a third real root as well, i. e., the line AB must have another intersection points with the curve. In the case of a tangent line, the point A corresponds to a double root, so there also exists another intersection point. As for improper points (the last homogeneous coordinate is zero; they correspond to directions in the plane), the only improper point that belongs to the curve given by the equation (1) is the point O = [0, 1, 0]. Addition with the point O means looking for a second intersection of the elliptic curve (besides the point A itself) and the line which goes through A and is parallel to the y-axis. The improper line z = 0 has triple intersection point O with the curve, i. e., O + O = O. □ Remark. Thus, the operation is well-defined. Moreover, it follows directly from the definition that it is commutative. It even follows from the above that O is a neutral element of the operation. However, the proof of associativity is far from trivial. 12.J.3. Define the above operation algebraically. Solution. For any point A ∈ C, we define A+O = O+A = A. For a point A ∈ C, A = (α, β, 1), we clearly have B ∈ C, and we define A + B = 0, i. e., A = −B. For a point A ̸= −B, A = [α, β, 1] and a point B ∈ C, B = [γ, δ, 1], we set k = { β−δ α−γ for A ̸= B, [5pt]3α2 +a 2β for A = B, σ = k2 − α − γ, τ = −β + k(α − σ). CHAPTER 12. ALGEBRAIC STRUCTURES p and q are coprime, k is a multiple of n, so f is injective. Moreover, the group Cn has the same number of elements as Cp × Cq, so f is an isomorphism. Finally, the proposition about the decomposition of cyclic groups of order k into smaller cyclic groups follows by induction on the number of different primes pi in the factorization of n. □ Notice, Cp2 is never isomorphic to the product Cp × Cp. While Cp2 is generated by an element of order p2 , the highest order of an element in Cp × Cp is only p. Since every finite commutative group is isomorphic to a product of cyclic groups, it is possible, for a given number of elements, to enumerate all commutative groups of that order up to isomorphism. For instance, there are only two groups of order 12: C12 = C4 × C3, C2 × C2 × C3 = C2 × C6. Notice similarly that if all elements (except the identity) of a finite commutative group G have order 2, then G has the form (C2)n for an integer n. In particular, such a group G has 2n elements. If the decomposition of G into cyclic groups contains a group Cp, p > 2, then the group contains elements of higher order. 12.4.9. Subgroups and cosets. Selecting any subgroup H of a group G, gives further information about the structure of the whole group. A binary relation ∼H on G can be defined as follows: a ∼H b if and only if b−1 · a ∈ H. This relation expresses when two elements of G are “the same” up to multiplication by an element of H from the right. It is easily verified that this relation is an equivalence: Clearly, a−1 · a = e ∈ H, so it is reflexive. If b−1 · a = h ∈ H, then a−1 · b = (b−1 · a)−1 = h−1 ∈ H, so it is symmetric as well. Finally, if c−1 · b ∈ H and b−1 · a ∈ H, then c−1 · a = c−1 · b · b−1 · a ∈ H, so it is transitive, too. It follows that G partitions into the left cosets of mutually equivalent elements, with respect to the subgroup H. The coset corresponding to an element a is denoted a · H, and a · H = {a · h; h ∈ H}, since an element b is equivalent to a if and only if it can be expressed this way. The corresponding partition of G (i.e. the set of all left cosets) with respect to H is denoted G/H. Similarly, right cosets H · a can be defined. The corresponding equivalence relation is given by a ∼ b if and only if a · b−1 ∈ H. Hence, H \ G = {H · a; a ∈ G}. Proposition. Let G be a group and H a subgroup of G. Then: (1) The left cosets with respect to H coincide with the right cosets with respect to H if and only if for each a ∈ G, 1098 Then, we define A + B = [γ, τ, 1]. We leave it for the reader to verify that this is indeed the operation that we have defined geometrically. □ K. Gröbner bases 12.K.1. Is the basis g1 = x2 , g2 = xy + y2 a Gröbner basis for the lexicographic ordering x > y? If not, find one. Solution. Clearly, the leading monomials are LT(g1) = x2 , LT(g2) = xy, so the S-polynomial is equal to S(g1, g2) = yg1 − xg2 = −xy2 . By theorem 12.6.12, g1, g2 is a Gröbner basis if and only if the remainder in the multivariate division of this S-polynomial by the basis polynomials is zero. Performing this division (see 12.6.6), we obtain S(g1, g2) = yg1 − xg2 + yg2 − y3 . The remainder y3 shows that g1, g2 do not form a Gröbner basis. By 12.6.13, in order to get one, we must add the remainder polynomial g3 = y3 to g1, g2. Now, we calculate that S(g1, g3) = y3 g1 − x2 g3 = 0 and S(g2, g3) = y2 g2 − xg3 = y4 = yg3. Hence it follows by theorem 12.6.12 that g1, g2, g3 is already a Gröbner basis. □ 12.K.2. Is the basis g1 = xy − 2y, g2 = y2 − x2 a Gröbner basis for the lexicographic ordering y > x? If not, find one. Solution. Since LT(g1) = xy and LT(g2) = y2 , the corresponding S-polynomials is S(g1, g2) = yg1 − xg2 = x3 − 2y2 = −2g2 + x3 − 2x2 . The leading term x3 is a multiple of neither xy nor y2 , which means that g1, g2 do not from a Gröbner basis. We can obtain one by adding the polynomial g3 = x3 − 2x2 . Then, we have S(g1, g3) = x2 g1 − yg3 = 0 and S(g2, g3) = x3 g2 − y2 g3 = 2y2 x2 − x5 = = (4y + 2xy)g1 − (x2 + 2x + 4)g3 + 8g2. □ 12.K.3. Eliminate variables in the ideal I = ⟨x2 + y2 + z2 − 1, x2 + y2 + z2 − 2x, 2x − y − z⟩. Solution. The variable elimination is obtained by finding a Gröbner basis with respect to the lexicographic monomial ordering. Let us denote the generating polynomials of I as g1, g2, g3, respectively. The reduction g2 = g1 + 1 − 2x yields the reduced polynomial f1 = 2x−1. Now, we use this polynomial to reduce g3 = f1 + 1 − y − z to f2 = y + z − 1. Now, we reduce g1, dividing it by f1 and f2, which leads to g1 = ( 1 2 x + 1 4 )f1 + y2 + z2 − 1 and y2 + z2 − 1 = (y − z + 1)f2 + 2z2 − 2z + 1 4 . CHAPTER 12. ALGEBRAIC STRUCTURES h ∈ H a · h · a−1 ∈ H. (2) Each coset (left or right) has the same cardinality as the subgroup H. Proof. Both properties are direct consequences of the definition. In the first case, for any a ∈ G, h ∈ H, an element h′ ∈ H is required so that h · a = a · h′ . This occurs if and only if a−1 · h · a = h′ ∈ H. In the second case, if a · h = a · h′ , then multiplication by a−1 from the left yields h = h′ . □ As an immediate corollary of the above statement, there are the following extremely useful results: 12.4.10. Theorem. Let G be a finite group with n elements and H a subgroup of G. Then: (1) the cardinality n = |G| is the product of the cardinality of H and the cardinality of G/H, i.e. |G| = |G/H| · |H|, (2) the integer |H| divides n, (3) if a ∈ G is of order k, then k divides n, (4) for each a ∈ G, an = e, (5) if n is prime, then G is isomorphic to the cyclic group Zn. The second proposition is called Lagrange’s theorem, The fourth proposition is called Fermat’s little theorem. Special cases are discussed in the previous chapter on number theory. Proof. Each left coset has exactly |H| elements. However, different cosets are disjoint. Hence the first proposition follows. The second proposition is a direct corollary of the first one. Each element a ∈ G generates the cyclic subgroup {a, a2 , . . . , ak = e}, and the order of this subgroup is exactly the order of a. Therefore, the order of a must divide the number of elements of G. Since the order k of any element a divides n and ak = e, then also an = (ak )s = e for any integer s. If n > 1, then there exists an element a ∈ G that is different from e. Its order k is an integer greater than one and it divides n. Therefore, k must be equal to n. This means that all the elements of G are of the form aℓ for ℓ = 1, . . . , n. □ 12.4.11. Normal subgroups and quotient groups. A subgroup H which satisfies a · h · a−1 ∈ H for all a ∈ G, h ∈ H, is called a normal subgroup. For normal subgroups, the operation on G/H can be defined by (a · H) · (b · H) = (a · b) · H. Choosing other representatives a · h, b · h′ leads to the same result: (a · h · b · h′ ) · H = ((a · b) · (b−1 · h · b) · h′ ) · H = (a · b) · H. 1099 Hence, f3 = 8z2 − 8z + 1. We can see that we could do with polynomial reduction and we did not have to add any other polynomials. The basis of I with eliminated variables is I = ⟨2x − 1, y + z − 1, 8z2 − 8z + 1⟩. □ 12.K.4. Solve the following system of polynomial equa- tions: x2 y − z3 = 0, 2xy − 4z = 1, z − y2 = 0, x3 − 4yz = 0. Solution. Using appropriate software, we can find out that the corresponding ideal ⟨x2 y − z3 , 2xy − 4z − 1, z − y2 , x3 − 4yz⟩ has Gröbner basis ⟨1⟩ with respect to the lexicographic monomial ordering, which means that the system has no solution. □ 12.K.5. Find the Gröbner basis of the variety in R3 defined parametrically as x = 3u + 3uv2 − u3 , y = 3v + 3u2 v − v3 , z = 3u2 − 3v2 . This is the so-called Enneper surface, and it is depicted in the picture on page 1119. Solution. Applying the elimination procedure (e. g. int the MAPLE system, using gbasis with plex ordering), we obtain the corresponding implicit representation, i. e., an equation with a single polynomial of degree nine: − 59049z − 104976z2 − 6561y2 − 72900z3 − 18954y2 z− − 23328z4 + 32805z2 x2 + 14580z3 x2 + 3645z4 x2 − 1296y4 z− − 16767y2 z2 − 6156y2 z3 − 783y2 z4 + 39366zx2 + 19683x2 − − 1296y4 − 2430z5 + 432z6 + 108z7 + 486z5 x2 − 432y4 z2 + + 54y2 z5 + 27z6 x2 − 48y4 z3 + 15y2 z6 − 64y6 − z9 . □ As we illustrate on the following simple exercise, the Gröbner basis can be used for solving integer optimization problems. 12.K.6. What is the minimum number of banknotes that are necessary to pay 77700 CZK? Solve this problem for three scenarios: First, assume that the banknotes at disposal are of values 100 CZK, 200 CZK, 500 CZK, 1000 CZK. Then, assume that there are also banknotes of value 2000 CZK. Finally, assume that there are no banknotes of value 2000 CZK, but there are banknotes of value 5000 CZK. Solution. Let us denote the respective banknotes by variables s, d, p, t, D, P. The banknotes to be used will be represented as a polynomial in these variables so that the exponent of each variable determines the number of the corresponding banknotes. For instance, if we decide to use only the 100 CZK banknotes, then the polynomial will be q = s777 . If we pay with ten 1000 CZK banknotes, ten 500 CZK banknotes, CHAPTER 12. ALGEBRAIC STRUCTURES Moreover, cosets can be written as H ·a·H, and the equation (H · a) · (b · H) = H · (a · b) · H is straighforward. On the other hand, the definition of the product on the cosets fails if H is not normal. Clearly, this new operation on G/H satisfies all group axioms: the identity is the group H itself (formally it is the coset e·H that corresponds to the identity e of G), the inverse of a · H is a−1 · H, and the associativity is clear from the definition. This is called the quotient group G/H of G by the normal subgroup H. Of course, in commutative groups, every subgroup is normal. The subset nZ = {na; a ∈ Z} ⊆ Z is a subgroup of the integers, and the corresponding quotient group is the (additive) group Zn of residue classes. It is clear from the definition that the kernel of every homomorphism is a normal subgroup. On the other hand, if a subgroup H ⊆ G is normal, then the mapping p : G → G/H, a → a · H is a surjective homomorphism, whose kernel is H. It can be seen directly from the definition of the operation on G/H that p is a homomorphism, and it is clearly surjective. It follows that normal subgroups are precisely the kernels of homomor- phisms. Moreover, for any group homomorphism f : G → K with kernel H = ker f, there is a well-defined homomor- phism ˜f : G/H → K, ˜f(a · H) = f(a), which is injective. If H is any normal subgroup in G contained in ker f, the latter homomorphism is still well defined, but not necessarily injective. There is a seemingly paradoxical example of a group homomorphism C∗ → C∗ , defined on the non-zero complex numbers by z → zk , where k is a fixed positive integer. Clearly, this is a surjective homomorphism, and its kernel is the set of k-th roots of unity, i.e. the cyclic subgroup Zk. Reasoning as above, there is an isomorphism ˜f : C∗ /Zk → C∗ for any positive integer k. This example illustrates that in the case of infinite groups, the calculations with cardinalities are not so intuitive as in the case of finite groups and theorem 12.4.10. 12.4.12. Exact sequences. A normal subgroup H of a group G yields the short exact sequence of groups e → H → G → G/H → e, where the arrows respectively correspond to the only homomorphism of the trivial group {e} into the group H, the inclusion ι of the subgroup H ⊆ G, the projection ν onto the quotient group G/H, and the only homomorphism of the group G/H onto the trivial group {e}. In each case, the image of one arrow is precisely the kernel of the following one. This is the definition of exactness of a sequence of homomorphisms. 1100 and the remaining amount with 100 CZK banknotes, then the polynomial will be q = t10 p10 s627 . In the former case, the number of banknotes will be 777. In the latter case, it will be 10 + 10 + 627 = 647. If we have only the banknotes s, d, p, t, then the ideal that describes the relation of the individual banknotes is I1 = ⟨s2 − d, s5 − p, s10 − t⟩. In order to minimize the number of used banknotes, we compute the Gröbner basis with respect to the graded reverse lexicographic ordering (we want to eliminate the small ban- knotes): G1 = (p2 − t, s2 − d, d3 − sp, sd2 − p). Now, we take any polynomial that represents a given choice of banknotes. Reducing this polynomial with respect to the basis G1, we get a polynomial whose degree is minimal for our monomial ordering, at it is easy to show that it is the polynomial corresponding to the optimal choice. For instance, take q = s777 . Reduction with respect to G1 yields t77 pd. This means that the optimal choice is seventy-seven 1000 CZK banknotes, one 500 CZK banknote, and one 200 CZK banknote. Altogether, it is 79 banknotes. In the second scenario, when we also have the banknote D, the ideal is I2 = ⟨s2 − d, s5 − p, s10 − t, s20 − D⟩ and its Gröbner basis is G2 = (t2 − D, p2 − t, s2 − d, d3 − sp, sd2 − p). Reduction of q = s777 with respect to G2 gives D38 tpd, so this time we pay with 41 banknotes. In the third scenario, we have I3 = ⟨s2 − d, s5 − p, s10 − t, s50 − P⟩ and G3 = (t5 − P, p2 − t, s2 − d, d3 − sp, sd2 − p), and the reduction is equal to P15 t2 pd. In this case, we need only 19 banknotes. Of course, this simple problem can be solved quickly with common sense. However, the presented method of Gröbner bases gives a universal algorithm which can be automatically used for higher amounts and other, more complicated cases. □ Gröbner bases have applications in robotics as well. In particular, it is in inversion kinematics, where one must find how to set individual joints of a robot so that it could reach a given position. This problem often leads to a system of nonlinear equations which can be solved by finding a Gröbner basis, as in the following problem. 12.K.7. Consider a simple robot, as shown in the picture, which consists of three straight parts of length 1 which are connected with independent joints that enable arbitrary angles α, β, γ. We want this robot to grasp, from above, an object which lies on the ground in distance x. What values should the angles α, β, γ be set to? Draw the configuration of the robot for x = 1, 1,5a √ 3. Solution. Consider a natural coordinate system, where the initial end of the robotic hand lies in the origin, and the CHAPTER 12. ALGEBRAIC STRUCTURES If there exists a homomorphism σ : G/H → G such that ν ◦ σ = idG/H, it is said that the exact sequence splits. Lemma. Every split short exact sequence of commutative groups defines an isomorphism G → H × G/H. Proof. Define a mapping f : H × G/H → G by f(a, b) = a · σ(b). Since the groups are commutative, f is a homomorphism: f (aa′ , bb′ ) = aa′ σ(b)σ(b′ ) = (aσ(b)) (a′ σ (b′ )) . If f(a, b) = e, then σ(b) = a−1 ∈ H, i.e. b = ν(σ(b)) is the identity in G/H. However, its image is then σ(b) = e, so a = e. Since the left and right cosets of commutative groups coincide, the mapping f is surjective. Hence f is an isomorphism. □ Now, the main idea of the proof of theorem 12.4.8 can be indicated. If it is known that every short exact sequence created by choosing cyclic subgroups H of a finite commutative group G splits, then it is easy to proceed with the proof by induction. If G is a group of order n which is not cyclic, then an element a of order p, p < n, can selected. The cyclic subgroup H generated by a can be found as well as the splitting of the corresponding short exact sequence. This expresses the group G as the product of the selected cyclic subgroup H and the group G/H of order n/p. The main technical point of the proof is the verification that in each finite commutative group, there are elements of order pr with appropriate powers of the primes p and that the short exact sequences for these groups really split. 12.4.13. Return to finite Abelian groups. Below is a brief exposition of the complete proof of the classification theorem, broken into several steps. The following lemma suggests that cyclic subgroups with prime orders are required. Lemma (Claim 1). Let G be a finite Abelian group with n elements. If p is a prime which divides n, then there is an element g ∈ G of order p.11 Proof. The claim is obvious if n is prime, i.e. G = Zp (as proved above). If n is not prime, proceed by induction on n. Clearly G must have a proper subgroup H if n is not prime, |H| = m < n. Either p|m or p|(n/m). In the former case, the claim follows from the induction hypothesis directly. Otherwise assume p|(n/m) (remind n/m is the order of G/H). Then there is an element g ∈ G such that the order of g · H in the quotient group G/H is p. Thus gp ∈ H, and therefore the order of g in G divides p|H|. Since p is a prime, the order of g is ℓp for some integer ℓ. Hence the element gℓ has the required order p. □ 11This is a special version of the more general result valid for all finite groups, called the Cauchy theorem. The formulation remains the same, with the word Abelian omitted. 1101 ground corresponds to the x-axis. It follows from elementary trigonometry that the total x-range of the robot at angels α, β, γ is equal to x = sin α + sin(α + β) + sin(α + β + γ). Similarly, the range of the robot in the vertical direction is y = cos α + cos(α + β) + cos(α + β + γ). The condition of grasping the object from above is clearly equivalent to the condition α + β + γ = π, so the statement of the problem leads to the system sin α + sin(α + β) = x, cos α + cos(α + β) − 1 = 0. In order to transform this system to a system of polynomial equations, we introduce new variables s1 = sin α, c1 = cos α, s2 = sin β, c2 = cos β, which of course satisfy s2 1 + c2 1 = 1 and s2 2 + c2 2 = 1. Using basic trigonometric equalities for sums in arguments, we obtain the following, equivalent system of polynomial equations: s1 + s1c2 + c1s2 − x = 0, c1 + c1c2 − s1s2 − 1 = 0, s2 1 + c2 1 − 1 = 0, s2 2 + c2 2 − 1 = 0. The Gröbner basis of the corresponding ideal can be found in a while using appropriate software. For the graded reverse lexicographic ordering s1 > c1 > s2 > c2, we get the basis (2c2 + 1 − x2 , 2c1(1 + x2 ) − 2s2x − 1 − x2 , 2s1(1 + x2 )+ +2s2 − x − x3 , 4s2 2 − 3 − 2x2 + x4 ), and hence it is easy to calculate the values of the variables in dependence on x. For example, we can immediately see that c2 = x2 −1 2 , i. e., β = arccos(x2 −1 2 ). In particular, it is clear that the problem has no solution for |x| > √ 3. Specifically, for |x| < √ 3, there are 2 solutions, and for |x| = √ 3, there is one solution (α = π 3 , β = 0, γ = 2π 3 for positive x; α = −π 3 , β = 0, γ = 4π 3 for negative x). For x = 1, we get the solution α = 0, β = π 2 , γ = π 2 and the degenerated solution α = π 2 , β = −π 2 , γ = π. The case x = −1 is similar. It is good to realize that for |x| < 1, one of the solutions will always correspond to a configuration of the robot where some parts will intersect. For these values of x, there will be only one realizable configuration. CHAPTER 12. ALGEBRAIC STRUCTURES For any prime number p, G is called a p-group if each of its elements has order pk for some power k. Claim 1 has an obvious corollary: Lemma (Claim 2). A finite Abelian group G is a p-group if and only if its number of elements n is a power of p. Proof. One implication follows straight from the Lagrange’s theorem since all proper divisors of a power of a prime p are just smaller powers of p. On the other hand, if n is not a power of prime, it has another prime divisor q and so there is an element of order q by Claim 1. □ Now it can be shown that a given finite Abelian group G can always be decomposed into a product of p-groups. Lemma (Claim 3). If G is a finite Abelian group than it is isomorphic to a product of p-groups. This decomposition is unique up to order. Proof. Consider a prime p dividing n = |G|. Define Gp to be the subgroup of all elements whose orders are powers of p, while G′ p is the subgroup of all elements whose orders are not divisible by p (check yourself that subgroups are obtained in this way). By the above Claim 1, the subgroup Gp is not trivial. Next, consider an element g of order qpℓ with q not divisible by p. Then gpℓ has order q, so this element belongs to G′ p, while gq ∈ Gp. The Bezout equality guarantees there are integers r and s such that rpℓ + sq = 1. Hence g = grpℓ · gsq is a decomposition of g into product of elements in Gp and G′ p. This verifies G ≃ Gp × G′ p and Gp is a p-group. This process can be repeated for the subgroup G′ p and the remaining primes in the decomposition in order to complete the proof. The uniqueness claim is obvious. □ It remains to consider the p-groups only. The next claim shows that the p-groups which are not cyclic must have more than one subgroup of order p. Lemma (Claim 4). If a finite Abelian p-group G has just one subgroup H with |H| = p, then it is cyclic. Proof. The case p = n = |G| is obvious. Proceed by induction on n. Assume H is the only subgroup of order p and consider σ : G → G, σ(g) = gp and write K = ker(σ). Then H ⊂ K and since p is prime, all elements in K have order p. For any g ∈ K, the cyclic group generated by g has order p and so coincides with H and consequently H = K. If G ̸= K, then σ(G) is a non-trivial subgroup in G which must be isomorphic to G/K. By Claims 1 and 2, there is a subgroup in σ(G) of order p. This yields a subgroup in G and by assumption it is again H. Finally, apply the induction hypothesis on the group σ(G) ≃ G/H, which has to be cyclic. Choosing a generator g · H of the latter group, even in G the 1102 □ Gröbner bases can also be used in software engineering when looking for loop invariants, which are needed for verification of algorithms, as in the following problem. 12.K.8. Verify the correctness of the algorithm for the product of two integers a, b. (x, y, z) := (a, b, 0); while not (y = 0) do if y mod 2 = 0 then (x, y, z) := (2*x, y/2, z) else (x, y, z) := (2*x, (y-1)/2, x+z) end if end while return z Solution. Let X, Y, Z denote the initial values of the variables x, y, z, respectively. Then, by definition, a polynomial p is an invariant of the loop if and only if we have p(x, y, z, X, Y, Z) = 0 after each iteration. Such a polynomial can be found using Gröbner basis as follows: Let f1, f2 denote the assignments of the then- and elsebranches, respectively, i. e., f1(x, y, z) = (2x, 1 2 y, z) and f2(x, y, z) = (2x, y − 1 2 , x+z). For n iterations of the first one, we immediately obtain the explicit formula fn 1 (x, y, z) = (2n x, 1 2n y, z). In order to transform this iterated function to a polynomial mapping, we introduce new variables u := 2n , v := 1 2n . Then, fn 1 is given by the polynomial function F1 : x → ux y → vy z → z, CHAPTER 12. ALGEBRAIC STRUCTURES cyclic subgroup generated by g must have a subgroup of order p (again Claim 1). The uniqueness assumption ensures that this is again the subgroup H. Clearly, |G/H| = |G|/p is the smallest exponent with g|G|/p ∈ H. At the same time the latter power is not equal to the unit since H ⊂ ⟨g⟩. Consequently, the order of g in G is bigger than |G|/p and so the group G is cyclic. □ Finally, a splitting condition for the p-groups is proved, which provides the property discussed in the end of the previous paragraph on the exact sequences. This completes the entire proof of the classification theorem. Lemma (Claim 5). Let G be a finite Abelian p-group and let C be a cyclic subgroup of maximal order in G. Then G = C × L for some subgroup L. Proof. If G is cyclic, set G = C and of course G = C × L with L = {e}. Proceed by induction on n = |G|. Assume G is not cyclic. Then it contains more than one cyclic subgroup of order p. Of course the subgroup C is one such subgroup. Choose H to be another subgroup of order p which is not a subgroup in C. Since p is prime, the intersection of H and C is trivial. Consequently the quotient group (C × H)/H ⊂ G/H is isomorphic to C. Now consider the induction step. The order of the cyclic subgroup (C × H)/H in G/H must be maximal, since the orders of the elements g · H in the quotient group are divisors of the orders of the generators g in the group G. By the induction hypothesis, G/H = (C × H)/H × K for some subgroup K ⊂ G/H. Clearly the preimage of K under the quotient projection is a group L satisfying H ⊂ L ⊂ G. Now, the latter identification of G/H with the product implies G = (C · H) · L = C · (H · L) = C · L. At the same time, L ∩ (C · H) = H and so L ∩ C = {e}. So G = C × L. □ The proof is complete up to the uniqueness claim. It is known already that the decomposition into p-groups is unique. Assume that a p-group G decomposes into two products of cyclic groups H1 × . . . × Hk and H′ 1 × . . . × H′ ℓ with nonincreasing orders of Hi or H′ j. Then the orders of H1 and H′ 1 coincide, since these are the maximal orders in G. By induction all the orders coincide and the work is complete. The classification theorem is a special case of a more general result on finitely generated Abelian groups. In additive notation, if g1, . . . , gt are generators of the entire G, then all elements of G are of the form a1g1 + · · · + atgt with integer coefficients ai. The general theorem provides a severe restriction for possible relations between such combinations. In fact it says that all finitely generated Abelian groups are products of cyclic groups, hence G = Zℓ ⊕Zp1 ×. . .×Zpk . This means there is always a finite number of completely independent generators of G and each of them generates a cyclic subgroup in 1103 where the new variables satisfy uv = 1. Clearly, the invariant polynomial must lie in the ideal I1 = ⟨ux − X, vy − Y, z − Z, uv − 1⟩. In order to find such polynomial, it suffices to eliminate the variables u and v, which can be done just with the Gröbner basis with respect to the graded reverse lexicographic ordering with u > v > x > y > z. This basis is equal to (xy − XY, z − Z, x − vX, y − uY ). Hence F1(xy − XY ) = xy − XY ans F1(z − Z) = z − Z, and all other polynomials are invariant with respect to any number n of applications of f1 and are given by a polynomial in (polynomials) xy − XY and z − Z. Now, we proceed similarly for f2. For n iterations, we derive the formula fn 2 (x, y, z) = (2n x, 1 2n (y + 1) − 1, (2n − 1)x + z), and introducing the variables u and v, we get an equivalent polynomial function F2 : x → ux y → v(y + 1) − 1 z → (u − 1)x + z. The invariant polynomial for F2 can be obtained similarly as above, thanks to the Gröbner basis of the corresponding ideal. However, we are interested in those polynomials which are invariant for both F1 and F2. Clearly, these must lie in the ideal I2 = ⟨F2(xy − XY ), F2(z − Z), uv − 1⟩. Substituting for F2, we obtain I2 = ⟨uxv(y + 1) − ux − XY, (u − 1)x + z − Z, uv − 1⟩ and with the Gröbner basis of this ideal, we eliminate the variables u and v and thus find the polynomial xy−XY +z −Z, which is invariant for both F1 and F2, so it is an invariant of the given cycle. Since at the beginning we have X = a, Y = b, Z = 0, we can see that it holds in every step of the algorithm that xy − ab + z = 0. Since the loop terminates only if y = 0, we get that indeed z = ab. □ Now, we present several exercises where we use Gröbner bases to solve various polynomial systems. The primary goal will not be to find the Gröbner basis, but rather to solve the given system. 12.K.9. Using Gröbner basis, solve the polynomial system x3 − 2xy = 0, x2 y + x − 2y2 = 0. Solution. Let us denote f1 := x3 − 2xy, f2 := x2 y + x − 2y2 . The basis (f1, f2) is not a Gröbner basis since, e. g., LM(yf1 − xf2) = x2 /∈ (x3 , x2 y) = (LM(f1), LM(f2)). Thus, we have to add the polynomial yf1 − xf2 = −x2 . CHAPTER 12. ALGEBRAIC STRUCTURES G. (Compare this to the description of finite dimensional vector spaces via their basis, as discussed in chapter 2.) 12.4.14. Group actions. Groups can be considered as sets of transformations of a fixed set. All the transformations are invertible, and the set of transformations must be closed under composition. The idea is to work with a fixed group whose elements are represented as mappings on a fixed set, but the mappings corresponding to different elements of the group need not be different. For instance, the rotations around the origin over all possible angles correspond to the group of real numbers. On the other hand, the rotation by 2π is the identity as a mapping. Formally, this situation can be described as follows: Group actions A left action of a group G on a set S is a homomorphism of the group G to the subgroup of invertible mappings in the monoid SS of all mappings S → S. Such a homomorphism can be viewed as a mapping φ : G × S → S which satisfies φ(a · b, x) = φ(a, φ(b, x)), hence the name “left action”. Often, the notation a·x is used to refer to the result of an a ∈ G applied to a point x ∈ S (although this is a different dot than the operation inside the group). Then, the definition property can be expressed as (a · b) · x = a · (b · x). The image of a point x ∈ S in the action of the entire group G is called the orbit Sx of x, i.e. Sx = {y = φ(a, x); a ∈ G}. For each point x ∈ S, define the isotropy subgroup Gx ⊆ G of the action φ (also called the stabilizer subgroup): Gx = {a ∈ G; φ(a, x) = x}. If for every two points x, y ∈ S, there is an a ∈ G such that φ(a, x) = y, then the action φ is said to be transitive. Choosing any two points x, y ∈ S and a g ∈ G which maps x to y = g · x, then the set {ghg−1 ; h ∈ Gx} is clearly the isotropy subgroup Gy. In addition, the mapping h → ghg−1 is a group homomorphism Gx → Gy. In the case of transitive actions, the entire space forms a single orbit and all isotropy subgroups have the same cardi- nality. As an example of a transitive action of a finite group, consider the apparent action of the symmetric group of a fixed set X on X itself. The natural action of all invertible linear transformations on the non-zero elements of a vector space V is also transitive. However, if the entire space V is selected, then the zero vector forms its own orbit. The mentioned example of the action of the additive group of real numbers that acts as rotations around a fixed center O in the plane is not transitive. The orbit of each point M is the circle which is centered at O and goes through M. 1104 The resulting basis can be reduced using division of the polynomials f1, f2 by x2 . This leads to the basis ( xy, x − 2y2 , x2 ) . Now, we can divide the first polynomial by the second one, with remainder 2y3 , and the third one by the second one, with remainder 4y4 . Thus, we get the basis ( x − 2y2 , y3 ) , and this is a Gröbner basis: by the naive algorithm (see 12.6.13), it suffices to verify that the polynomial S ( x − 2y2 , y3 ) = y3 ( x − 2y2 ) − xy3 = −2y5 gives zero remainder with respect to the basis ( x − 2y2 , y3 ) , even for any polynomial ordering. Clearly, the solution of the system is the point (0, 0). □ 12.K.10. Consider the following system of polynomial equations: x2 yz2 + x2 y2 + yz − xyz2 − z2 = 0, x2 y + z = 0, xyz + z + 1 = 0. Sort the monomials of the polynomials with respect to the lexicographic ordering with x > y > z, then divide the first polynomial by the second one and the third one, and use the result in order to solve the system in the real numbers. Solution. x2 y2 + x2 yz2 − xyz2 + yz − z2 = ( y + z2 ) ( x2 y + z ) − − y(xyz + z + 1) − z3 + z. Hence z = 0, ±1. Then, e. g., 0 = z ( x2 y + z ) − x(xyz + z + 1) = = z2 − zx − x. Hence, x = z2 z+1 , and we get from the third equation that y = −(1+z)2 z3 . This is satisfied by the sole point (1 2 , −4, 1 ) . □ 12.K.11. Using Gröbner basis, solve the polynomial system x2 + y + z = 1, x + y2 + z = 1, x + y + z2 = 1. Solution. Let us denote f1 := x + y + z2 − 1. The division of x + y2 + z − 1 by f1 gives f2 = y2 − y − z2 + z. The division of x2 + y + z − 1 by f1 yields y2 + 2yz2 − y + z4 − 2z2 + z, and further division by f2 produces the remainder. f3 = 2yz2 + z4 − z2 . CHAPTER 12. ALGEBRAIC STRUCTURES A typical example of a transitive action of a group G is the natural action on the set G/H of left cosets for any subgroup H. It is defined by g · (aH) = (ga)H. This is the form of every transitive group action. For any transitive action G × S → S and a fixed point x ∈ G, S can be identified with the set G/Gx of left cosets by gGx → g ·x. Clearly, this mapping is surjective, and the images g·x = h·x coincide if and only if h−1 g ∈ Gx, which is equivalent to gGx = hGx. Finally, note that this identification transforms the original action of G on S just to the mentioned action of G on G/Gx. 12.4.15. Theorem. Let an action of a finite group G on a finite set S be given. Then: (1) for each point x ∈ S, |G| = |Gx| · |Sx|, (2) (Burnside’s lemma) if N denotes the number of orbits, then |G| = 1 N ∑ g∈G |Sg |, where Sg = {x ∈ S; g · x = x} denotes the set of fixed points of the action corresponding to an element g. Proof. Consider any point x ∈ S and its isotropy subgroup Gx ⊆ G. The same argument as the one at the end of the previous paragraph for transitive group actions can be applied to each action of the group G. This gives the mapping G/Gx → Sx, g · Gx → g · x. If g · x = h · x, then clearly g−1 h ∈ Gx, so this mapping is injective. Clearly, it is also surjective, which means that the cardinalities of the finite sets must satisfy |G/Gx| = |Sx|. The first proposition follows, because |G| = |G/Gx| · |Gx|. The second proposition is proved by counting the cardinality of the set of fixed points of individual group elements in two different ways: F = {(x, g) ∈ S × G; g · x = x} ⊆ S × G. Since these are finite sets, the elements of the Cartesian product S × G can be considered as entries of a matrix (columns are indexed by elements of S, rows are indexed by elements of G). Summing this matrix up, either by rows or by columns, yields |F| = ∑ g∈G |Sg | = ∑ x∈S |Gx|. For the sake of clarity, choose one representative for each orbit (denote them x1, . . . , xN ), and recall that the cardinalities of the isotropy groups for points that lie in the same orbit are the same. Using the proved statement (1), it is obtained easily that |F| = ∑ g∈G |Sg | = N∑ i=1 ∑ x∈Sxi |Gx| = N∑ i=1 |Sxi ||Gxi | = N·|G|, which completes the proof. □ 1105 However, (f1, f2, f3) is not a Gröbner basis yet. That will be constructed by the choice g1 := f1, g2 := f2 and replacing f3 with the S-polynomial 2z2 f2 − yf3 = −yz4 − yz2 − 2z4 + 2z3 . Then, the division by the polynomial f3 leads to g4 = z6 − 4z4 + 4z3 − z2 = = z2 (z − 1)2 ( z2 + 2z − 1 ) . Now, (g1, g2, g3) is a Gröbner basis, so we can solve the system by elimination. We get from g4 = 0 that z = 0, 1, −1 ±√ 2. Substituting this to g2 = 0 and g1 = 0 gives the solutions (1, 0, 0),(0, 1, 0),(0, 0, 1), ( −1 + √ 2, −1 + √ 2, −1 + √ 2 ) , ( −1 − √ 2, −1 − √ 2, −1 − √ 2 ) . □ 12.K.12. Solve the following system of polynomial equations in R: x2 − 2xz − 4, x2 y2 z + yz3 , 2xy2 − 3z3 . Solution. The basis suitable for variable elimination is the Gröbner basis for the lexicographic monomial ordering with x > y > z. Using Maple, we can find the basis 144z5 + 35z7 + 12z9 , 23z6 + 12z8 + 44yz4 , yz3 + 3z5 + 4zy2 , 9z4 + 4y3 , −8y2 − 6z4 + 3xz3 , 2xy2 − 3z3 , x2 − 2xz − 4. Since the discriminant of the first polynomial of the basis (divided by z5 ) satisfies 352 − 4 · 144 · 7 < 0, we must have z = 0. Substituting this into the other polynomials, we immediately obtain y = 0, x = ±2. □ 12.K.13. Solve the following system of polynomial equations in R: xy + yz − 1, yz + zw − 1, zw + wx − 1, wx + xy − 1. Solution. In this case, it is a good idea to take the graded lexicographic ordering with w > x > y > z. Using the algorithm 12.6.13 or appropriate software, we find the corresponding Gröbner basis (x − z, w − y, 2yz − 1). Thus, the system is satisfied by exactly the points ( 1 2t , t, 1 2t , t) for an arbitrary t ∈ R except zero. □ CHAPTER 12. ALGEBRAIC STRUCTURES It is recommended to think out thoroughly how useful these propositions are for solving combinatorial problems, c.f. 12.G.1 through 12.G.7. Next, we shall come back to fields and after knowing the complete classification of finite Boolean algebras and finite abelian groups, we focus on finite fields. As first step, we have to draw some elementary results on rings and univariate polynomial rings. 12.4.16. Rings, ideals, quotients. We are meeting rings on our journey permanently since the very beginning. Now we shall add some structural understanding of objects and homomorphisms in this categry. Similarly to groups, the inclusion of a subring K ⊂ L allows us to construct the residue class space L/K = {a + K; a ∈ L}, i.e., we consider the (commutative) aditive group structure only. If we want to equip L/K by a ring structure applying the operations to representatives, we do not face problems with the addition (since all subgroups are normal in the abelian case), but we need the properties a · K ⊂ K and K · a ⊂ K for all a ∈ L to get the multiplication right, too. Indeed, (a+K)·(b+K) = a·b+a·K+K·b+K = a·b+K under this assumption. Such a subring K ⊂ L is called an ideal in L, and we call L/K the residue class ring, or quotient ring. The verification of all ring properties for the addition and multiplication defined by means of the representatives is straightforward. For all homomorphisms of rings, f : K → L, the kernel ker f = {a ∈ K; f(a) = 0} is clearly an ideal, since all multiples of zero are zero. Similarly to groups, the ideals K ⊂ L come with the short exact sequences of rings (1) 0 → K → L → L/K → 0. By the very definition, an intersection of ideals is an ideal again. Thus we may talk about ideals generated by subsets in L. In commutative rings L, we distinguish the ideals generated by a single element and call them the principal ideals, ⟨a⟩ = {b · a; b ∈ L}. If there are no proper ideals (i.e. different from the trivial ones) except of the principal ones, we call L a principal ideal domain. The integer ring Z is a principal ideal domain. Indeed, if K ⊂ Z and n is the smalles positive integer in K, then for any m ∈ K, the division yields m = qn + r with the remainder again in K. Thus r = 0 (it cannot be smaller then n) and we have proved K = ⟨n⟩. If n = 1, then K = Z. The quotient ring Z/⟨n⟩ is the ring of residue classes Zn. Notice, there are no proper ideals in any field K. 12.4.17. Characteristic of rings and fields. Consider a ring K. Its additive structure provides the cyclic subgroups generated by single elements a ∈ K, Ga = {a, a + a, 3a, . . . }. If there is the minimal positive integer n, such that na = 0, then we say that the order of a is n. If there is such a (minimal) n valid for all a ∈ K, we say that the characteristic of the ring K is n. If there are elements of infinite order, or their 1106 12.K.14. Solve the following system of polynomial equations in R: x2 + yz + x, z2 + xy + z, y2 + xz + y. Solution. Using the algorithm 12.6.13 or appropriate software, we find the corresponding Gröbner basis for the lexicographic monomial ordering with x > y > z, consisting of six polynomials: z2 + 3z3 + 2z4 , z2 + z3 + 2yz2 + 2yz3 , y − yz − z − z2 − 2yz2 + y2 , yz + z + z2 + 2yz2 + xz, z2 + xy + z, x2 + yz + x. The roots of the first polynomial are z = 0, −1, −1 2 . Discussing the individual cases, we find out that the system is satisfied exactly by the points (0, 0, 0), (−1, 0, 0), (0, −1, 0), (0, 0, −1) and (−1 2 , −1 2 , −1 2 ). □ CHAPTER 12. ALGEBRAIC STRUCTURES orders are not bounded, we say that the characteristic of K is infinity. Clearly, the characteristic of Zn is n, while the scalars Z, Q, R, C all have characteristic ∞. Lemma. If K is a finite ring with |K| = m elements, then the characteristic of K is a divisor of m. Proof. The orders of the cyclic soubgroups Ga must be finite (if ka = ℓa, then (k − ℓ)a = 0, and there are only finitely many elements in K) and they must be divisors of m (as known for all finite abelian groups). □ Another nice observation says, that finite integral domains are actually fields: Theorem. If K is a finite commutative ring without divisors of zero, then K is a field. Moreover the number of elements |K| = m is a power of a prime number, m = pn , where p is the characteristic of K. Proof. Let p be the characteristic of K. Suppose, p = kℓ is not a prime (clearly, 1 < k ≤ ℓ < p). Then looking at 1 ∈ K, 0 = p · 1 = (kℓ) · 1 = (k · 1) · (ℓ · 1) and this is impossible without divisors of zero. Thus the characteristic of K is a prime p. Next, for any 0 ̸= a ∈ K, all the elements a · b, b ∈ K must be different, since a · b1 = a · b2 would imply a · (b1 − b2) = 0, i.e., b1 = b2. In particular, the unit 1 is expressed as 1 = a · b for some b and so b is the inverse to a. Thus K is a finite field. Finally, the aditive subgroup G1 = {1, 2 · 1, . . . , p · 1} is closed under the multiplication, too (and is isomorphic to the field Zp). Now, K can be viewed as a vector space over G1 and it must be generated by finite number of elements a1, . . . , an (with big freedom in the choices, but always the same number, as known from the elementary linear algebra). But then it is clear that |K| = pn , as requested. □ The theorem suggests that all finite fields are built from the simplest bricks — the so called Galois fields GF(p) isomorphic to Zp with prime p. They are often denoted Fp in literature. 12.4.18. Univariate polynomials over fields. The ring of univariate polynomials K[x] over any field K is a principal ideal domain. Indeed, the same argument as for integers applies. Given a non-zero ideal J, there must be a polynomial f of lowest degree in J. If the degree is zero, J = K[x]. Otherwise, each g ∈ J is expressed as g = qf + r applying the division with reminder (cf. 12.3.6). Then r ∈ J and it must be of lower degree than f (which is impossible) or zero. Thus, J = ⟨f⟩. Recall that each ideal J = ⟨f⟩ ⊂ K[x], deg f > 0, provides the quotient ring L = K[x]/J. Lemma. The ring L = K[x]/⟨f⟩, d = deg f > 0 contains K as subfield, and L is a vector space over K of dimension at most d. 1107 CHAPTER 12. ALGEBRAIC STRUCTURES Proof. Since d = deg f > 0, no constant polynomial can appear in ⟨f⟩. Thus a → a + J embeds K into L as a subring. Finally, the elements 1 + J,x + J,...,xd−1 + J must generate L (remind we deal with univariate polynomials). □ Proposition. The ring L = K[x]/⟨f⟩, deg f > 0 is a field if and only if the polynomial f does not have any roots in K. If K is finite, then |L| = pnm , where pn = |K| and m is the dimension of the vector space L over K. Proof. □ In the situation of the proposition, we call L the algebraic extension of the field K with respect to the plynomial f. Clearly, the polynomial f becomes completely reducible, f = a(x − α1) . . . (x − αd), over L. Thus the construction adds all the missing roots αi to K and extends the field this way. Considering a polynomial 12.4.19. Theorem (Classification of finite fields). For each prime number p and power n ∈ N, there is a field with pn elements. Such field F is obtained as extension All fields with the same number of elements are isomorphic. Proof. □ 12.4.20. The fundamental theorem of algebra. While it may happen that a polynomial over the real numbers has no roots, every polynomial over the complex numbers has a root. This is the statement of the so called fundamental theorem of algebra, which is presented here with an (essentially) complete proof. By this result, every polynomial in C[x] has as many roots (including multiplicity) as its degree deg f = k. Hence it always admits a factorization of the form f(x) = b(x − a1) · (x − a2) . . . (x − ak) for the complex roots ai and the appropriate leading coefficient b. Theorem. The field C is algebraically closed, i.e. every polynomial of degree at least one has a root. Proof. We are going to provide an elementary proof based on simple real analysis, in particular the concept of continuity (the readers should be familiar with the techniques developed in Chapters 5 and 6). Suppose that f ∈ C[z] is a non-zero polynomial with no root, i.e. f(z) ̸= 0 for all z ∈ C. Consider the mapping defined by φ : C → C, z → f(z) |f(z)| . Then φ maps the entire C into the unit circle K1 = {eit , t ∈ R} ⊆ C. By the assumption that f(z) is never zero, this mapping is well-defined. Next, we shall consider the restrictions of φ to the individual circles Kr ⊆ C with center at zero and radius r ≥ 0 1108 CHAPTER 12. ALGEBRAIC STRUCTURES We can parameterize these circles by the mappings ψr : R → Kr, t → ψr(t) = r eit . For all r, the composition κ : (0, ∞) × R → K1, κ(r, t) = φ ◦ ψr(t), is continuous in both t and r. Thus, for each r, there exists a mapping αr : R → R which is uniquely given by the conditions 0 ≤ αr(0) < 2π and κ(r, t) = eiαr(t) . Again, the obtained mapping αr continuously depends on r. Altogether, there is a continuous map- ping α : R × (0, ∞) → R, (t, r) → αr(t). It follows from its construction that for every r, 1 2π (αr(2π) − αr(0)) = nr ∈ Z. Since α is continuous in r, it means that nr is an integer constant which is independent of r. In order to complete the proof, it suffices to note that if f = a0 + · · · + adzd and ad ̸= 0, then for small values of r, αr behaves nearly as a constant mapping, while for large values of r, it behaves almost as if f = zd . First, calculate nr for f = zd , then make this statement more precise. This completes the proof. The complex functions z → zd , z → zd |zd| can be expressed easily using the trigonometric form of the complex numbers z = r(cos θ + i sin θ): zd = rd (cos dθ + i sin dθ) = rd eidθ , zd |zd| = 1(cos dθ + i sin dθ) = eidθ . In this case, the mapping φ is simply a rotation of the complex plane, followed by the central projection onto the unit circle. Then, κr(t) = eidt , and so αr(t) = dt, regardless of r. It follows that nr = d for the choice f = zd . If f = azd is chosen, a ̸= 0, then there is no impact on the above result (verify this yourselves!). Consider a general polynomial f = a0 +· · ·+adzd with no root. Then a0 ̸= 0 (a0 = 0, implies that 0 is a root). For z ̸= 0, f(z) adzd = 1 + 1 ad (a0z−d + · · · + ad−1z−1 ). 1109 CHAPTER 12. ALGEBRAIC STRUCTURES Hence, lim|z|→∞ f(z) adzd = 1. Knowing this, calculate lim |z|→∞ f(z) |f(z)| − adzd |adzd| = = lim |z|→∞ f(z) adzd adzd |adzd| |adzd | |f(z)| − adzd |adzd| = 0. Hence, nr = d for large values of r. A similar computation can be done for small values of r. Recall that a0 ̸= 0: f(z) a0 = 1 + 1 a0 (a1z + · · · + adzd ). Thus, lim|z|→0 f(z) a0 = 1. In addition, f(z) |f(z)| = f(z) a0 a0 |a0| |a0| |f(z)| . Hence, lim|z|→0 f(z) |f(z)| = lim|z|→0 a0 |a0| , i.e. nr = 0 for small values of r. Altogether, the degree of the polynomial is d = 0. □ 5. Coding theory It is often needed to transfer data while guaranteeing that they are transferred correctly. In some cases, it suffices to recognize whether the data is unchanged. In some cases, it can be transferred again. In other cases, this might not be reasonable, so that the data is needed to be recovered after a small number of errors in the transfer. This is the goal of coding theory. Some of the algorithms are explored now. Notice that coding is quite different from encrypting. If no one but the addressee is meant to be able to read the message, then it should be encrypted. This topic is discussed briefly at the end of the previous chapter. 12.5.1. Codes. Data transfer is usually prone to errors. Since any information may be encoded as a sequence of bits (zeros or ones), the work is with Z2 although the theory may be developed even for a general finite field. Furthermore, the length of the message to be transferred is assumed to be known in advance. Thus, one transfers k-bit words, where k ∈ N is fixed. It is desired to detect potential errors, and if possible, recover the original data. For this reason, further (n−k) bits are added to the k-bit word, where n is also fixed (and of course n > k). These are called (n, k)-codes. There are 2k binary words of length k and each should be mapped to one of the 2n possible codewords. For an (n, k)-code, there remain 2n − 2k = 2k (2n−k − 1) words which are not codewords (if such a word is received, then an error has occurred). Thus, even for a large value of k, only a few bits added provide much redundant information. The simplest example is the parity check code. Having a message of length k, the codeword is created by adding a bit whose value is determined so that the total number ones would be even. This is an example of a (k + 1, k)-code. If there occur an odd number of errors during the transfer, then it can be detected with this simple code. Every two 1110 CHAPTER 12. ALGEBRAIC STRUCTURES codewords differ in at least two bits, but an error word differs from at least two codewords in only one bit. Therefore, this code is unable to recover the original message, even with the assumption that only one bit was changed. The following diagram illustrates all 2-bit words with the parity bit added. The codewords are marked with a bold dot. 100 110 101 111 001 010 011 000 Moreover, the parity check code is unable to detect the error of interchanging a pair of adjacent bits, which often hap- pens. 12.5.2. Word distance. In the diagram of the parity check (3, 2)-code, each error word is at the “same” distance from three codewords – those which differ in exactly bits. The other words are farther. Formally, this observation can be described by the following definition of distance: Word distance The Hamming distance of a pair of words (of equal length) is the number of bits in which they differ. Consider words x, y, z such that x and y differ in r bits, and y and z differ in s bits. Then, x and z differ in at most r + s, which verifies the triangle inequality for distances. If the code is to detect errors in r bits, then the minimum distance between each pair of codewords must be at least r+1. If the code is to recover errors in r bits, then there exists only one codeword whose distance from the received word is at most r. Thus, the following propositions are verified: Theorem. (1) A code reliably detects at most r errors if and only if the minimum Hamming distance of the codewords is r + 1. (2) A code reliably detects and recovers at most r errors if and only if the minimum Hamming distance of the codewords is 2r + 1. 12.5.3. Construction of polynomial codes. For practical applications, the codewords should be constructed efficiently so that they can be easily recognized among all the words. The parity check code is one example; another trivial possibility is to simply repeat the bits. For instance, the (3, 1)-code can be considered which triplicates each bit. A systematic way for code construction is to use division of polynomials. A message b0b1 . . . bk−1 is understood as the polynomial m(x) = b0 + b1x + · · · + bk−1xk−1 1111 CHAPTER 12. ALGEBRAIC STRUCTURES over the field Z2. The encoded message should be another polynomial v(x) of degree at most n − 1. Polynomial codes Let p(x) = a0 + · · · + an−kxn−k ∈ Z2[x] be a polynomial with coefficients a0 = 1, an−k = 1. The polynomial code generated by a polynomial p(x) is the (n, k)-code whose codewords are polynomial of degree less than n divisible by p(x). A message m(x) is encoded as v(x) = r(x) + xn−k m(x), where r(x) is the remainder in the division of the polynomial xn−k m(x) by p(x). Check the claimed properties first. By the very definition of the codeword v(x) of a message m(x): v(x) = r(x) + xn−k m(x) = = r(x) + q(x)p(x) + r(x) = q(x)p(x), since the sum of two identical polynomials is always zero over Z2. Therefore, all codewords are divisible by p(x). On the other hand, if v(x) is divisible by p(x), the above calculation can be read from right to left (setting r(x) = xn−k m(x) − q(x)p(x)), and so it is a codeword created by the above procedure. From the definition, the codeword is created by adding n − k bits given by r(x) at the beginning of the word (and simply shifting the message to the right by that). It follows that the original message is contained in the polynomial v(x) and the decoding is easy. Consider the two simple examples, already mentioned. First, note that p(x) = 1 + x divides v(x) if and only if v(1) = 0. This occurs if and only if v(x) has an even number of non-zero coefficients. So the polynomial p(x) = 1 + x generates the parity check (n, n − 1)-code for any n ≥ 2. Similarly, it is easily verified that the polynomial p(x) = 1 + x + · · · + xn−1 generates the (n, 1)-code of n-fold bit repetition. Dividing the polynomial b0xn−1 by p(x), gives the remainder b0(1 + · · · + xn−2 ), so the corresponding codeword is b0p(x). 12.5.4. Error detection. Let e(x) denote the error vector. This is, the difference between the original message v ∈ (Z2)n and the received data u: u(x) = v(x) + e(x). The error is detected if and only if the generator of the code (i.e. the polynomial p(x)) does not divide e(x). Therefore, polynomials over Z2[x] which do not happen to be divisors too often are of interest. Definition. An irreducible polynomial p(x) ∈ Z2[x] of degree m is said to be primitive if and only if p(x) divides (1 + xk ) for k = 2m − 1, but not for any smaller value of k. 1112 CHAPTER 12. ALGEBRAIC STRUCTURES Theorem. Let p(x) be a primitive polynomial of degree m and n ≤ 2m − 1. Then the polynomial (n, n − m)-code generated by p(x) detects all simple and double errors. Proof. If exactly one error occurs, then e(x) = xi for some i, 0 ≤ i < n. Since p(x) is irreducible, it cannot have a root in Z2. In particular, it cannot divide xi since the factorization of xi is unique. It follows that every simple error is detected. If exactly two errors occur, then e(x) = xi + xj = xi ( 1 + xj−i ) for some 0 ≤ i < j < n. p(x) does not divide any xi , and since it is primitive, it does not divide 1 + xj−i either, provided j−i < 2m −1. At the same time, p(x) is irreducible, which means that it does not divide the product e(x) = xi (1+ xj−i ), which completes the proof. □ 12.5.5. Corollary. Let q(x) be a primitive polynomial of degree m and n ≤ 2m − 1. Then the polynomial (n, n − m − 1)-code generated by the polynomial p(x) = q(x)(1 + x) detects all double errors as well as all errors with an odd number of bit flips. Proof. The codewords generated by the chosen polynomial p(x) are divisible by both x + 1 and the primitive polynomial q(x). As verified, the factor x + 1 is responsible for parity checking, i.e. all codewords have an even number of non-zero bits. This detects any odd number of errors. By the above theorem, the second factor is able to detect all double errors. □ The following table illustrates the power of the above theorems for several primitive polynomials of low degrees. For instance, the last row says that by adding only 11 redundant bits to a message of length 1012, and employing the polynomial (x+1)p(x), all single, double, triple, and odd-numbered errors in the transfer can be detected. These are already quite large numbers, with over 300 decimal digits. primitive polynomial p(x) redundant bits codeword length 1 + x 1 1 1 + x + x2 2 3 1 + x + x3 3 7 1 + x + x4 4 15 1 + x2 + x5 5 31 1 + x + x6 6 63 1 + x3 + x7 7 127 1 + x2 + x3 + x4 + x8 8 255 1 + x4 + x9 9 511 1 + x3 + x10 10 1023 Note that quite strong results on the divisibility are used for the decomposition of polynomials derived in the second part of this chapter. But tools which would assist in constructing primitive polynomials are not mentioned. 1113 CHAPTER 12. ALGEBRAIC STRUCTURES Such tools come from the theory of finite fields. The name "primitive" reflects the connection to the primitive elements in the Galois fields G(2m ). This theory also provides a convenient way of applying the Euclidean division, that is, of verifying whether or not the received word is a codeword, using the delayed registers. This is a simple circuit with as many elements as is the degree of the polynomial.12 12.5.6. Linear codes. Polynomial codes can also be described using elementary matrix calculus. Recall that when working over the field Z2, caution is required when applying the results of elementary linear algebra, since then the property that v = −v implies v = 0 is often used. This is not the case now. However, the basic definition of vector spaces, existence of bases and descriptions of linear mappings by matrices are still valid. It is useful to recall the general theory and its ap- plicability. Start with a more general definition of codes, which only requires linear dependency of the codeword on the original message: Linear codes Any injective linear mapping g : (Z2)k → (Z2)n is a linear code. The k-by-n matrix G that corresponds to this mapping (in the canonical bases) is called the generating matrix of the code. For each message v, the corresponding codeword is given by u = G · v. Theorem. Every polynomial (n, k)-code is a linear code. Proof. Use elementary properties of Euclidean division. Apply the assignment of the polynomial v(x) = r(x) + xn−k m(x) determined by the original message m(x) to the sum of two messages m(x) = m1(x) + m2(x). The remainder in the division xn−k (m1(x) + m2(x)) is, by uniqueness, given as the sum r1(x) + r2(x) of the remainders for the individual messages. It follows that v(x) = r1(x) + r2(x) + xn−k (m1(x) + m2(x)), which is the desired additivity. Since the only non-zero scalar in Z2 is 1, the linearity of the mapping of the message m(x) to the longer codeword v(x) is proved. Moreover, this mapping is clearly injective, since the original message m(x) is simply copied beyond the redundant bits. □ For instance, consider the (6, 3)-code generated by the polynomial p(x) = 1 + x + x3 , for encoding 3-bit words. 12More about the beautiful theory and its connection with codes can be found in the book Gilbert, W., Nicholson, K., Modern Algebra and its applications, John Wiley & Sons, 2nd edition, 2003, 330+xvii pp., ISBN 0-471-41451-4. 1114 CHAPTER 12. ALGEBRAIC STRUCTURES Evaluate it on the individual basis vectors mi(x) = xi−1 , i = 1, 2, 3, to get v0 = (1 + x) + x3 , v1 = (x + x2 ) + x4 , v2 = (1 + x + x2 ) + x5 . It follows that the generating matrix of this (6, 3)-code is G =         1 0 1 1 1 1 0 1 1 1 0 0 0 1 0 0 0 1         . Polynomial codes always copy the original message beyond the redundant bits. So the generating matrix can be split into two blocks P and Q consisting respectively of n − k and k rows. Then Q equals the identity matrix Ik. 12.5.7. Theorem. Let g : (Z2)k → (Z2)n be a linear code with generating matrix G, written in blocks as G = ( P Ik ) . Then, the mapping h : (Z2)n → (Z2)n−k with the matrix H = ( In−k P ) has the following properties: (1) Ker h = Im g, (2) a received word u is a codeword if and only if H · u = 0. Proof. The composition h ◦ g : (Z2)k → (Z2)n−k is given by the product of matrices (computing over Z2): H · G = ( In−k P ) · ( P Ik ) = P + P = 0. Hence it is proved that Im g ⊆ Ker h. Since the first n − k columns of H are basis vectors of (Z2)n−k , the image Im h has the maximum dimension n − k, which means that this image contains 2n−k vectors. Vector spaces over Z2 are finite commutative groups, so the formula relating the orders of subgroups and quotient groups from subsection 12.4.10 can be used, thus obtaining | Ker h| · | Im h| = |(Z2)n | = 2n . Therefore, the number of vectors in Ker h is equal to 2n · 2k−n = 2k . In order to complete the proof of the first proposition, it suffices to note that the image Im g also has 2k ele- ments. The second proposition is a trivial corollary of the first one. □ The matrix H from the theorem is called the parity check matrix of the corresponding linear (n, k)-code. 1115 CHAPTER 12. ALGEBRAIC STRUCTURES For instance, the matrix H = (1 1 1) is the parity check matrix for the parity check (3, 2)-code, encoding 2-bit words. It is easily obtained from the matrix G =   1 1 1 0 0 1   that generates this code. For the (6, 3)-code mentioned above, the parity check matrix is H =   1 0 0 1 0 1 0 1 0 1 1 1 0 0 1 0 1 1   . 12.5.8. Error-correcting codes. As seen, transferring a message u gives the result v = u + e. Over Z2, this is equivalent to e = u + v. It follows that if the error e to be detected fixed, all the received words determined by the correct codewords u fill one of the cosets in the quotient space (Z2)n /V , where V is the vector subspace V ⊆ (Z2)n of the codewords. The mapping h : (Z2)n → (Z2)n−k corresponding to the parity check matrix has V as its kernel. Therefore, it induces the injective linear mapping ˜h : (Z2)n /V → (Z2)n−k . Clearly, the value ˜h(v + V ) on the coset generated by v is determined uniquely by the value H · v. Syndromes The expression H ·v, where H is the parity check matrix for the considered linear code, is called the syndrome of v. The following claim is a direct corollary of the construction and the above observations: Theorem. Two words are in the same class u+V if and only if they have the same syndromes. It follows that self-correcting codes can be constructed by choosing, for every syndrome, the element of the corresponding coset which is most likely to be the sent codeword. Naturally, when choosing the code, it is desirable to maximize the probability that it can correct single errors (and possibly even more errors). Try it on the example of the (6, 3)-code for which the matrices G and H are already computed. Build the table of all syndromes and the corresponding words. The syndrome 000 is possessed exactly by the codewords. All words with a given different syndrome are obtained by choosing one of them and adding all the proper codewords. The following two tables display the syndromes in the first rows; the second rows then display the vector which has the least number of ones among the vectors of the corresponding coset. In almost all cases, this is just one value one there; only in the last column, there are two ones, and the element 1116 CHAPTER 12. ALGEBRAIC STRUCTURES is chosen where the ones are adjacent (because, for instance, multiple errors are more likely to be adjacent). 000 100 010 001 000000 100000 010000 001000 110100 010100 100100 111100 011010 111010 001010 010010 111001 011001 101001 110001 101110 001110 111110 100110 001101 101101 011101 000101 100011 000011 110011 101011 010111 110111 000111 011111 110 011 111 101 000100 000010 000001 000110 110000 110110 110101 110010 011110 011000 011011 011100 111101 111011 111000 111111 101010 101100 101111 101000 001001 001111 001100 001011 100111 100001 100010 100101 010011 010101 010110 010001 All the columns in the tables are affine subspaces whose modelling vector spaces are always the first column of the first table. This is because the code is linear, so that the set of all codewords forms a vector space, and the individual cosets of the quotient space are consequently affine subspaces. In particular, the difference of each pair of words in the same column is a codeword. The words in bold are the leading representatives of the coset (affine space) that correspond to the given syndromes. These are the words with the least number of ones in their column. They correspond to the least number of bit flips which must be made to any word in the given column in order to get a codeword. For instance, if a word 111101 is received, compute that its syndrome is 110. The leading representative of the coset of this syndrome is 000100. Subtract it from the received word, to obtain the codeword 111001. This is the codeword with the least Hamming distance from the received word. So the original message is most likely to be 001. 6. Systems of polynomial equations In practical problems, objects or actions are often encountered which are described in terms of polynomials or systems of polynomial equations. For instance, the set of points in R3 defined by two equations x2 + y2 − 1 = 0 and z = 0 is the circle which is centered at (0, 0, 0), has radius 1 and lies in the plane xy. Similarly, equations xz = 0 and yz = 0 considered in R3 define the union of the line x = 0, y = 0 and the plane z = 0. Notice we have to specify the space carefully, since x2 +y2 = 0 defines a circle in R2 , but it is a cylinder if viewed in R3 . 1117 CHAPTER 12. ALGEBRAIC STRUCTURES Deciding whether or not a given point lies within a given body, finding extrema of algebraically defined subsets of multidimensional spaces, analyzing movements of parts of some machine, etc. are examples of such problems. 12.6.1. Affine varieties. For the sake of simplicity (existence of roots of polynomials), the work is mainly with the field of complex numbers. Some ideas are extended to the case of a general field K. An affine n-dimensional space over a field K is understood to be Kn = K × · · · × K n with the standard affine structure (see the beginning of Chapter 4). As seen, a polynomial f = ∑ α aαxα ∈ K[x1, . . . , xn] can be viewed naturally as a mapping f : Kn → K, defined by f(u1, . . . , un) := ∑ α aαuα , where uα = uα1 1 · · · uαn n . In dimension n = 1, the equality f(x) = 0 describes only finitely many points of K. In higher dimension, the similar equality f(x1, . . . , xn) = 0 describes subsets similar to curves in the plane or surfaces in the space. However, they may be of quite complicated and self-intersecting shapes. For instance, the set given by the equation (x2 + y2 )3 − 4x2 y2 = 0 look as a quatrefoil (see the illustration in the beginning of the part J in the other column). Another illustration of a two-dimensional surface is given by Whitney’s umbrella x2 − y2 z = 0, which, besides the part shown in the diagram, also includes the line {x = 0, y = 0}. The diagram was drawn using the parametric description x = uv, y = v, z = u2 , whence the implicit description x2 − y2 z = 0 is easily guessed. 1118 CHAPTER 12. ALGEBRAIC STRUCTURES In the following illustration, there is the Enneper surface with parametrization x = 3u+3uv2 −u3 , y = 3v+3u2 v−v3 , z = 3u2 − 3v2 . It is hard to imagine how to obtain the implicit description from this parametrization in hand. Nevertheless, there is an algorithm to eliminate the variables u and v from these three equations. Quite a complex theory is needed to be developed for that. As usual, begin by formalization of the objects of interest. Affine varieties Let f1, . . . , fs ∈ K[x1, . . . , xn]. The affine variety in Kn corresponding to the set of polynomials f1, . . . , fs is the set V(f1, . . . , fs) = { (a1, . . . , an) ∈ Kn ; fi(a1, . . . , an) = 0, i = 1, . . . , s } 5 Affine varieties include conic sections, quadrics, and hyperquadrics, both singular and regular. Many curves and surfaces can be easily described as affine varieties. The variety corresponding to a set of polynomials is the intersection of the varieties corresponding to the individual polynomials. For instance, V(x2 + y2 − 1, z) ⊂ R3 is the circle which is centered at (0, 0, 0), has radius 1 and lies in the plane xy. Similarly, V(xz, yz) ⊂ R3 is the union of the line x = 0, y = 0 and the plane z = 0, since it is exactly the points of these two objects where both the polynomials xz, yz are zero. These examples illustrate that it is not easy to deal with the concept of dimension. Is the mentioned line, added to the plane, enough for the variety to be considered three-dimensional, or should one keep considering it two-dimensional with a certain anomaly? Verify the following straightforward proposition: 1119 CHAPTER 12. ALGEBRAIC STRUCTURES Theorem. Let V = V(f1, . . . , fs) and W = V(g1, . . . , gt) ⊆ Kn be affine varieties. Then, V ∪ W and V ∩ W are also affine varieties, where V ∩ W = V(f1, . . . , fs, g1, . . . , gt), V ∪ W = V(figj) for 1 ≤ i ≤ s, 1 ≤ j ≤ t. In the following subsections, some questions which arise in the context of varieties are answered: Q1. Is the set V(f1, . . . , fs) empty? Q2. Is the set V(f1, . . . , fs) finite? Q3. How to understand the concept of dimension for vari- eties? All of these problems can be “reasonably” solved for varieties over the complex numbers (as well as over any algebraically closed field); it is more difificult for the real numbers and nearly impossible for general fields. For instance, over the rational numbers, the question whether V(xn +yn −zn ) = ∅ leads to the well-known Fermat’s last theorem, many times mentioned in Chapter 11. 12.6.2. Parametrization. For some purely practical operations with varieties, it is convenient to use the implicit representation (the one used so far). For instance, deciding whether a given point lies in a given variety, or inside the space enclosed by it, is quite easy using the implicit description. On the other hand, the parametric description may also come in handy in many situations (for example, it was used to draw the diagram). The variety V(x + y + z − 1, x + 2y − z − 3) is a line (the intersection of two planes). If the system x + y + z − 1 = 0, x + 2y − z − 3 = 0, is solved, the parametric description of this line is immediate: x = −1 − 3t, y = 2 − 2t, z = t as known from the affine geometry. One needs to be more careful in general: Rational parametrization Definition. A rational parametric representation of a variety V(f1, . . . , fr) ⊆ Kn is a set of rational functions r1, . . . , rn ∈ K(t1, . . . , ts) such that: • the entire image of the mapping r = (r1, . . . , rn) is contained in V(f1, . . . , fr); • V(f1, . . . , fr) is the minimal affine variety which contains these points (x1, . . . , xn). Note that the parametrization is not required to describe all the points of the variety. This is important, as seen by a simple example of parametrization of a circle in the plane: x = 2t 1 + t2 , y = −1 + t2 1 + t2 , 1120 CHAPTER 12. ALGEBRAIC STRUCTURES which can be obtained using the stereographic projection. (Verify this in detail!) Note that this parametrization describes all points except for the point (0, 1), from which we project, since this point is not reachable for any value of the parameter t. This is nobody’s fault; it follows from the different topological properties of the circle and the line that there exists no global bijective rational parametrization. In this connection, two more questions arise: Q4. Does there exist a parametrization of a given variety and how to find it? Q5. Is there an implicit description of a parametrically defined variety? The general answer to question 4 is negative. In fact, most affine varieties cannot be parametrized; or at least there is no algorithm for parametrization of the implicit descrip- tion. It is clear at first sight that a given variety may admit many implicit and parametric descriptions. In the case of implicit descriptions, it is given by representation in terms of several “generating” polynomials, and there is clearly much freedom in their choice. Once a parametrization is found, it can be composed with any rational bijection on the parameters in order to obtain another one. 12.6.3. Ideals. In order to avoid the dependence on the chosen equations that define a variety, consider all consequences of the given equations. This leads to the following algebraic concept of subsets in rings (which is similar to normal subgroups): Ideals Definition. Let K be a commutative ring. A subset I ⊆ K is called an ideal if and only if 0 ∈ I and f, g ∈ I =⇒ f + g ∈ I, f ∈ I, h ∈ K =⇒ f · h ∈ I. Since the definition contains only universal quantifiers (the properties are required for all elements in K or I), the intersection of two ideals is also an ideal. Consequently, ideals can be considered generated by subsets. Use the notation I = ⟨a1, . . . , an⟩. It is easy to prove that such an ideal is I = {∑ i aibi, bi ∈ K } , where only finite sums are considered. (Check that this is the intersection of all ideals containing the set of the generators!) The set of generators may be infinite, too. If there are only finitely many generators, the ideal is said to be finitely gener- ated. It is easy to verify that each variety defines an ideal in the ring of polynomials in the following way: 1121 CHAPTER 12. ALGEBRAIC STRUCTURES The ideal of a variety For a variety V = V(f1, . . . , fs), set I(V ) := { f ∈ K[x1, . . . , xn]; f(a1, . . . , an) = 0, ∀ (a1, . . . , an) ∈ V } . Lemma. Let f1, . . . , fs, g1, . . . , gt ∈ K[x1, . . . , xn] be polynomials. Then: (1) if ⟨f1, . . . , fs⟩ = ⟨g1, . . . , gt⟩, then V = V(f1, . . . , fs) = V(g1, . . . , gt); (2) I(V ) is an ideal, and ⟨f1, . . . , fs⟩ ⊆ I(V ). Proof. If a point a = (a1, . . . , an) lies in a variety V(f1, . . . , fs), then any polynomial of the form f = h1f1 + · · · + hsfs (i.e. any member of the ideal I = ⟨f1, . . . , fs⟩) takes zero at a. In particular, this means that all the polynomials gi take zero at a. Hence V(f1, . . . , fs) ⊆ V(g1, . . . , gt). The other inclusion can be proved similarly. In order to verify the second proposition, choose g, g′ ∈ I(V ), h ∈ K[x1, . . . , xn]. Then, for any point a ∈ V that (gh)(a) = 0 ⇒ gh ∈ I(V ), (g + g′ )(a) = 0 ⇒ g + g′ ∈ I(V ). Hence I(V ) is an ideal in K[x1, . . . , xn]. For any polynomial f = h1f1 + · · · + hsfs ∈ ⟨f1, . . . , fs⟩ and a point a ∈ V , f(a) = 0, which proves the desired inclusion. □ The simplest examples are trivial varieties – a single point and the entire affine space: I ( {(0, 0, . . . , 0)} ) = ⟨x1, . . . , xn⟩, I(Kn ) = {0} for any infinite field K. The other inclusion of the second part of the lemma does not hold in general. For instance, the variety V(x2 , y2 ) contains only the single point – (0, 0). This means that I(V ) = ⟨x, y⟩ ⊃ ⟨x2 , y2 ⟩. If V, W ⊆ Kn are varieties, then V ⊆ W =⇒ I(V ) ⊇ I(W). In other words, a polynomial which takes zero at each point of a given variety clearly takes zero at each point of any of the variety’s subsets. Now, further natural problems can be formulated: Q6. Is every ideal I ∈ K[x1, . . . , xn] finitely generated? Q7. Is there an algorithm which decides whether f ∈ ⟨f1, . . . , fs⟩? Q8. What is the precise relation between ⟨f1, . . . , fs⟩ and I ( V(f1, . . . , fs) ) ? 1122 CHAPTER 12. ALGEBRAIC STRUCTURES 12.6.4. Dimension 1. Consider univariate polynomials first f = a0xn + a1xn−1 + · · · + an, where a0 ̸= 0. The leading term of a polynomial is defined to be LT(f) := a0xn . Clearly, deg f ≤ deg g ⇐⇒ LT(f)|LT(g). Let K be a field and g a non-zero polynomial. Every polynomial f ∈ K[x] can be written in a unique way as f = q · g + r, where r = 0 or deg r < deg g. In fact, the quotient q and the remainder r can be computed by the following algorithm: (1) q := 0, r := f (2) while r ̸= 0 ∧ LT(g)|LT(r) (a) q := q + LT(r)/LT(g) (b) r := r − LT(r)/LT(g) · g When checking the loop condition, the invariant f = q ·g +r holds, so the algorithm answers correctly as soon as the loop condition becomes "false". Since the degree of r is decreasing, the algorithm eventually terminates. Corollary. Let K be a field. Then, every ideal in the polynomial ring K[x] is of the form ⟨f⟩. Proof. Consider an ideal I ⊆ K[x]. If I = {0}, then the ideal is generated by the zero polynomial. If I contains a non-zero polynomial f, then choose any with the lowest degree. Clearly ⟨f⟩ ⊆ I. For any polynomial g ∈ I, consider the Euclidean division of g by f, i.e. g = qf + r. Clearly, qf ∈ I, which means that r ∈ I as well. However, the degree of f is as small as possible, so r = 0. Therefore, g is a multiple of f, and I = ⟨f⟩. □ An ideal which is generated by a single element is called a principal ideal. A ring which has the property of the last lemma is called a principal ideal domain. Recall the Euclidean algorithm for the greatest common divisor h = GCD(f, g) of polynomials f and g (the variable h contains the desired greatest common divisor when the algorithm terminates): (1) h := f, s := g (2) while s ̸= 0 (a) r := h mod s (b) h := s (c) s := r Let f = q · g + r and h = GCD(f, g). Then, h|r, g and ∀p ∈ K[x]: p|r, g, so p|f and p|h. Hence, h is GCD(r, g). Trivially, GCD(h, 0) = h, so the algorithm computes GCD(f, g) correctly. Since the degree of r is strictly decreasing, the algorithm eventually terminates. It follows that each pair of polynomials has a greatest common divisor. It is unique up to multiplication by a scalar. Indeed, if two polynomials are greatest common divisors of a given pair of polynomials, then they must divide each other, 1123 CHAPTER 12. ALGEBRAIC STRUCTURES which is, for polynomials, possible only in the case men- tioned. The greatest common divisor of more polynomials is defined as follows: If s > 2, then GCD(f1, . . . , fs) := GCD ( f1, GCD(f2, . . . , fs) ) . Lemma. Let f1, . . . , fs be polynomials. Then, ⟨GCD(f1, . . . , fs)⟩ = ⟨f1, . . . , fs⟩. Proof. GCD(f1, . . . , fs) divides all the polynomials fi. Hence the principal ideal ⟨GCD(f1, . . . , fs)⟩ contains the ideal ⟨f1, . . . , fs⟩. The other inclusion follows immediately from Bezout’s identity. □ Earlier, eight questions were formulated. Here are some answers for dimension 1: • Since V(f1, . . . , fs) = V(GCD(f1, . . . , fs)), the problem of emptiness of a given variety reduces to the problem of existence of a root of a single polynomial. • For the same reason, each variety is a finite set of isolated points – the roots of the polynomial GCD(f1, . . . , fs), except for the case when GCD(f1, . . . , fs) = 0. This can happen only if f1 = f2 = · · · = fs = 0, and then the variety is the entire K. • The concept of dimension is not of much interest in this case; each variety has dimension zero, being a discrete set of points. • Each ideal can be generated by a single polynomial. • f ∈ ⟨f1, . . . , fs⟩ ⇐⇒ GCD(f1, . . . , fs)|f. • Denoting ⟨f⟩ := I(V(f1, . . . , fs)), then f and GCD(f1, . . . , fs) may differ only in root multiplicities. 12.6.5. Monomial ordering. In order to generalize the Euclidean division of polynomials for more variables, one must first find an appropriate analogy for the degree of a polynomial and its leading term. The Euclidean division of a polynomial f ∈ K[x1, . . . , xn] by polynomials g1, . . . , gs is to be an expression of the form f = a1g1 + · · · + asgs + r where no term of the remainder r is divisible by the leading term of any of the polynomials gi. Try this with f = x2 y + xy2 + y2 , g1 = xy − 1, and g2 = y2 − 1. The first division yields f = (x + y) · g1 + (x + y2 + y). LT(y2 −1) does not divide x (the leading term of the remainder), so, theoretically, continuation is not possible. However, x can be moved into the remainder, thus obtaining the result f = (x + y) · g1 + g2 + (x + y + 1). No term of the remainder is divisible by either LT(g1) or LT(g2). How are the leading terms determined? 1124 CHAPTER 12. ALGEBRAIC STRUCTURES Monomial ordering A monomial ordering on K[x1, . . . , xn] is a well-ordering (every non-empty subset has a least element) < on Nn which satisfies ∀α, β, γ ∈ Zn : α < β =⇒ α + γ < β + γ. An ordering on Nn induces an ordering on monomials as soon as the order of the variables x1 < x2 < . . . xn is fixed. Each polynomial can be rearranged as a decreasing sequence of monomials (ignoring the coefficients for now). The following three definitions introduce the most common monomial orderings. Each ordering assumes that the order of the variables is fixed, usually x1 > x2 > · · · > xn. Definition. Let α, β ∈ Nn . The lexicographic ordering lex β ⇐⇒ the left-most non-zero term of α − β is positive. The graded lexicographic ordering grlex β ⇐⇒ |α| > |β| or |α| = |β| and α >lex β. The graded reverse lexicographic ordering grevlex β ⇐⇒ |α| > |β| or |α| = |β| and the right-most nonzero term of α − β is negative. If x > y > z, then x >grevlex y >grevlex z, but x2 yz2 >grlex xy3 z, yet x2 yz2 lex, >grlex, >grevlex are monomial or- derings! 12.6.6. Multivariate division with remainder. Consider a non-zero polynomial f = ∑ α∈Nn aαxα in K[x1, . . . , xn] and < a monomial ordering. Then define the degree, leading coefficient, leading monomial, and leading term of f as follows: • multideg f := max{α ∈ Nn , aα ̸= 0}, • LC f := amultideg f , • LM f := xmultideg f , • LT f := LC f · LM f. Of course, these concepts depend on the underlying monomial ordering. Lemma. Let f, g ∈ K[x1, . . . , xn] and < be a monomial ordering. Then, (1) multideg(f · g) = multideg f + multideg g, (2) f + g ̸= 0 =⇒ multideg(f + g) ≤ max{multideg f, multideg g}. Proof. Both claims are straightforward corollaries of the definitions. □ 1125 CHAPTER 12. ALGEBRAIC STRUCTURES Theorem. Let < be a monomial ordering and F = (f1, . . . , fs) be an s-tuple of polynomials in K[x1, . . . , xn]. Then, every polynomial f ∈ K[x1, . . . , xn] can be expressed as f = a1f1 + · · · + asfs + r, where ai, r ∈ K[x1, . . . , xn] for all i = 1, 2, . . . , s. Moreover, either r = 0 or r is a linear combination of monomials none of which is divisible by any of LT f1, . . . , LT fs, and if aifi ̸= 0, then multideg f ≥ multideg aifi for each i. The polynomial r is called the remainder of the multivariate division f/F. Proof. The theorem says nothing about uniqueness of the result. The following algorithm produces a possible solution and thus proves the theorem. In the sequel, consider the output of this algorithm to be the result of the division. (1) a1 := 0, . . . , as := 0, r := 0, p := f (2) while p ̸= 0 (a) i := 1 (b) d := false (c) while i ≤ s ∧ not d (i) if LT fi|LT p ai := ai + LT p/LT fi p := p − (LT p/LT fi) · fi d := true (ii) else i := i + 1 (d) if not d (i) r := r + LT p (ii) p := p − LT p In every iteration of the outer loop, exactly one of the commands 2(c)i, 2(d)ii is executed, so the degree of p decreases. Therefore, the algorithm eventually terminates. When checking the loop condition, the invariant f = a1f1 + · · · + p + r holds, and each term of each ai is the quotient LT p/LT fi from some moment. The degrees of these terms are less than the current degree of p, which is at most the degree of f. Altogether, the degree of each aifi is at most the degree of f. □ In the ring K[x1, . . . , xn], the following implication clearly holds: f = a1f1 + · · · + asfs + 0 =⇒ f ∈ ⟨f1, . . . , fs⟩. However, the converse is generally not true for multivariate division: Consider f = xy2 − x, f1 = xy + 1, f2 = y2 − 1. The algorithm outputs f = y(xy + 1) + 0(y2 − 1) + (−x − y), but f = x(y2 − 1), so that f ∈ ⟨f1, f2⟩. The next goal is to find some distinguished generators of the ideals I = ⟨f1, . . . , fs⟩ which would behave better. In a certain sense, this is a similar procedure to the Gaussian elimination of variables for systems of linear equations. Begin with some special assumptions about the ideals. 1126 CHAPTER 12. ALGEBRAIC STRUCTURES 12.6.7. Monomial ideals. An ideal I ⊆ K[x1, . . . , xn] is called monomial if and only if there is a set of multi-indices A ⊆ Nn such that I is generated by the monomials xα with α ∈ A. This means that all polynomials in I are of the form ∑ α∈A hαxα , where hα ∈ K[x1, . . . , xn]. Clearly, for a monomial ideal I xβ ∈ I if and only if there exists an α ∈ A such that xα divides xβ . Lemma. Let I ⊆ K[x1, . . . , xn] be a monomial ideal and f ∈ K[x1, . . . , xn] a polynomial. Then, the following propositions are equivalent: (1) f ∈ I, (2) each term of f lies in I; (3) the polynomial f is a linear combination of monomials from I with coefficients from K. Proof. The implications (3) =⇒ (2) =⇒ (1) are obvious. It remains to prove (1) =⇒ (3). Write the polynomial f as f = ∑ α aαxα , where aα ∈ K. It follows from the assumption f ∈ I that f =∑ β∈A hβxβ , where xβ ∈ I and hβ ∈ K[x1, . . . , xn]. Each term aαxα must equal some term of the other equality. Hence each term aαxα of the polynomial f can be expressed as the sum of expressions d xβ+δ , where d ∈ K, xβ ∈ I. However, this means that xα ∈ I, so that (3) holds. □ Corollary. Two monomial ideals coincide if and only if they contain the same monomials. The following theorem goes much further. It says that every monomial ideal is finitely generated and, moreover, the finite set of generators may be chosen from any given set of generators. 12.6.8. Theorem (Dickson’s lemma). Every monomial ideal I = ⟨xα , α ∈ A⟩ ⊆ K[x1, . . . , xn] can be written in the form I = ⟨xα1 , . . . , xαs ⟩, where α1, . . . , αs ∈ A. Proof. Proceed by induction on the number of variables. If n = 1, then I ⊆ K[x], I = ⟨xα , α ∈ A ⊆ N⟩. The set of exponents in A has a minimum, so denote it by β := min A. Then, xβ divides each monomial xα with α ∈ A, so I = ⟨xβ ⟩. Now suppose n > 1 and assume that the proposition is true for fewer variables. Denote the variables x1, . . . , xn−1, y, and write monomials in the form xα ym where α ∈ Nn−1 , m ∈ N. Suppose that I ⊆ K[x1, . . . , xn−1, y] is monomial, and define J ⊆ K[x1, . . . , xn−1] by J := ⟨xα , ∃m ∈ N, xα ym ∈ I⟩. Clearly, J is a monomial ideal in n − 1 variables, so by the induction hypothesis, J = ⟨xα1 , . . . , xαs ⟩. It follows from the definition of J that there are minimal integers mi ∈ N such that xαi ymi ∈ IA. Denote m := max{mi} and define an analogous system of ideals Jk ⊆ K[x1, . . . , xn−1] for 0 ≤ k ≤ m − 1 Jk := ⟨xβ ; xβ yk ∈ IA⟩. 1127 CHAPTER 12. ALGEBRAIC STRUCTURES Again, all the ideals Jk satisfy the induction hypothesis, so they can be expressed as Jk = ⟨xαk,1 , . . . , xαk,sk ⟩. It remains to show that I is generated by the following finite set of monomials: xα1 ym , . . . , xαs ym , xα0,1 y0 , · · · , xα0,s0 y0 , ... xαm−1,1 ym−1 , . . . , xαm−1,sm−1 ym−1 . Consider a monomial xα yp ∈ I. Either of the following cases occurs: • p ≥ m. Then, xα ∈ J, k = p, so one of xα1 ym , . . . , xαs ym divides xα yp . • p < m. Then, analogously, xα ∈ Jk, and one of xαk,1 yk , . . . , xαk,sk yk divides xα yp . By the previous lemma, each polynomial f ∈ I can be expressed as a linear combination of monomials from I. Each of these is divisible by one of the generators; hence f lies in the ideal generated by them. Therefore, I is a subset of that. The other inclusion is trivial, which completes the proof of Dickson’s lemma. □ 12.6.9. Hilbert’s theorem. Everything is now at hand for the discussion of ideal bases in polynomial rings. The main idea is the maximal utilization of the information about the leading terms among the generating polynomials and in the ideal. For a nonzero ideal I ⊆ K[x1, . . . , xn], denote LT I := {axα ; ∃f ∈ I : LT f = axα }. Clearly, ⟨LT I⟩ is a monomial ideal, so by Dickson’s lemma, ⟨LT I⟩ = ⟨LT g1, . . . , LT gs⟩ for appropriate g1, . . . , gs ∈ I. Theorem. Every ideal I ∈ K[x1, . . . , xn] is finitely gener- ated. Proof. The statement is trivial for I = {0}. So suppose I ̸= {0}. By Dickson’s lemma and the above note, there are g1, . . . , gs ∈ I such that ⟨LT I⟩ = ⟨LT g1, . . . , LT gs⟩. Clearly, ⟨g1, . . . , gs⟩ ⊆ I. Choose any polynomial f ∈ I and divide it by the s-tuple g1, . . . , gs. f = a1g1 + · · · + asgs + r, is obtained where no term of r is divisible by any of LT g1, . . . , LT gs. Since r = f −a1g1 −· · ·−asgs, r ∈ I, and also LT r ∈ LT I. This means that LT r ∈ ⟨LT I⟩. Admit that r ̸= 0. Since ⟨LT I⟩ is monomial, LT r must be divisible by one of its generators, i.e. LT g1, . . . , LT gs. This contradicts the result of the multivariate division algorithm. Therefore, r = 0 and I is generated by g1, . . . , gs. □ 1128 CHAPTER 12. ALGEBRAIC STRUCTURES 12.6.10. Gröbner bases. The basis used in the proof of Hilbert’s theorem has the properties stated in the following definition: Gröbner bases of ideals Definition. A finite set of generators g1, . . . , gs of an ideal I ⊆ K[x1, . . . , xn] is called Gröbner basis if and only if ⟨LT I⟩ = ⟨LT g1, . . . , LT gs⟩. Corollary. Every ideal I ⊆ K[x1, . . . , xn] has a Gröbner basis. Every set of polynomials g1, . . . , gs ∈ I such that ⟨LT I⟩ = ⟨LT g1, . . . , LT gs⟩ is a Gröbner basis of I. Example. Return to the remark on similarity with the Gaussian variable elimination for systems of linear equations. That is, illustrate the general results above on the simplest case of polynomials of degree one with the lexicographic ordering. Denote the generators fi = ∑ j aijxj + ai0. Consider the matrix A = (aij), where i = 1, . . . , s and j = 0, . . . , n, and apply the Gaussian elimination to it. This gives a matrix B = (bij) in echelon form. Zero rows can be omitted from it. Hence there is a new basis g1, . . . , gt, where t ≤ s. Due to the performed steps, each fi can be expressed as a linear combination g1, . . . , gt, which means that ⟨f1, . . . , fs⟩ = ⟨g1, . . . , gt⟩. Now, verify that these polynomials g1, . . . , gt form a Gröbner basis. Without loss of generality, assume that the variables are labeled so that LM gi = xi for i = 1, . . . , t. Any polynomial f ∈ I can be written as f = h1f1 + · · · + hsfs = h′ 1g1 + · · · + h′ tgt. It is required that LT f ∈ ⟨LT g1, . . . , LT gt⟩. That is, LT f is divisible by one of x1, . . . , xt. Suppose that f contains only the variables xt+1, . . . , xn. However, then h′ 1 = 0, since x1 is only in g1 by the echelon form of B. Analogously, h′ 2 = · · · = h′ t = 0, and so f = 0. The existence of the very special bases is now proved. However, they cannot yet be constructed algorithmically. This is the goal of the following subsections. 12.6.11. Theorem. Let G = {g1, . . . , gt} be a Gröbner basis of an ideal I ⊆ K[x1, . . . , xn] and f a polynomial in K[x1, . . . , xn]. Then, there is a unique r = ∑ α aαxα ∈ K[x1, . . . , xn] such that: (1) no term of r is divisible by any of LT g1, . . . , LT gt, i.e. ∀α∀i: LT gi̸ | aαxα ; (2) ∃g ∈ I : f = g + r. Proof. The algorithm for multivariate division produces f = a1g1 + · · · + atgt + r, where r satisfies the condition (1). Select g as a1g1 + · · · + atgt, which of course lies in I. 1129 CHAPTER 12. ALGEBRAIC STRUCTURES It remains to prove uniqueness. Suppose that f = g + r = g′ + r′ , where r ̸= r′ . Clearly, r−r′ = g′ −g ∈ I. Since G is a Gröbner basis, LT(r − r′ ) is divisible by one of LT g1, . . . , LT gt. There are two possibilities: • LM r ̸= LM r′ . The one with the higher degree must be divisible by one of LT g1, . . . , LT gt, which contradicts condition (1). • LM r = LM r′ and LC r ̸= LC r′ . Then both the monomials LM r and LM r′ must be divisible by one of LT g1, . . . , LT gt, which is again a contradiction. It follows that LT r = LT r′ and the inductive argument shows that r = r′ . □ The previous theorem generalizes the Euclidean division, with an ideal instead of a divisor. In the univariate case, this is no generalization, since every ideal is generated by a single polynomial. If it is only the remainder which is of interest, the order of polynomials in the Gröbner basis does not matter. Hence it makes sense to define the notation f G for the remainder in the division f/G, provided G = (g1, . . . , gs) is a Gröbner basis. Corollary. Let G = {g1, . . . , gt} be a Gröbner basis of an ideal I ⊆ K[x1, . . . , xn] and f a polynomial in K[x1, . . . , xn]. Then, f ∈ I if and only if the remainder f G is zero. 12.6.12. Syzygies. The next step is to find a sufficient “testing set” of polynomials of a given ideal which allows us to verify whether the considered system is a Gröbner basis. Again, we wish to test this by means of multivariate division only. For α = multideg f and β = multideg g, consider γ := (γ1, . . . , γn), where γi = max{αi, βi}. The monomial xγ is called the least common multiple of the monomials LM f and LM g and is denoted LCM(LM f, LM g) := xγ . The expression S(f, g) := xγ LT f · f − xγ LT g · g is called the S-polynomial (also syzygy, or pair) of the polynomials f, g. This is a tool for the elimination of leading terms. The Gaussian elimination is a special case of this procedure for polynomials of degree one. However, during the general procedure, it may happen that the degrees of the resulting polynomials are higher even though the original leading terms are removed. For instance, consider the polynomials f = x3 y2 − x2 y3 + x, g = 3x4 y + y2 1130 CHAPTER 12. ALGEBRAIC STRUCTURES of degree 5 in R[x, y] a with the lex x2 >lex · · · . Then, for each p = 0, . . . , n, Gp := G ∩ K[xp+1, . . . , xn] is a Gröbner basis for the ideal Ip. If G is minimal or reduced, then Gp is again minimal or reduced, respectively. 1136 CHAPTER 12. ALGEBRAIC STRUCTURES Proof. Without loss of generality, assume that Gp = {g1, . . . , gr}. Since G ⊆ I, it follows that Gp ⊆ Ip. The inclusion ⟨Gp⟩ ⊆ Ip is trivial. It needs to be verified for each polynomial f ∈ Ip that f = h1g1 + · · · + hrgr. To do this, perform multivariate division by the original Gröbner basis G. Since f ∈ I, it follows that f G = 0, i.e. f = h1g1 + · · · + hrgr + hr+1gr+1 · · · + hmgm. Each of the polynomials gr+1, . . . , gm must contain at least one of the variables x1, . . . , xp, otherwise it would lie in Gp. By the properties of the lexicographic ordering, this variable must also be contained in LT gr+1, . . . , LT gm. Recall the individual steps of the algorithm for multivariate division and the fact that f contains no monomial with any of x1, . . . , xp. Then hr+1 = · · · = hm = 0. thus verifing that f ∈ ⟨Gp⟩. Not only the desired inclusion but also the fact that the division f/G on Ip gives the same result as f/Gp is proved. For 1 ≤ i < j ≤ r, consider the S-polynomials S(gi, gj). S(gi, gj) Gp = S(gi, gj) G = 0, so Gp is a Gröbner basis of Ip. It is clear that the property, for the basis, of being either minimal or reduced, is preserved. □ The only property of the lexicographic ordering used in the proof is that if a variable occurs in the polynomial f, then it occurs in its leading term as well. However, this condition is much weaker than that of the lexicographic ordering. Therefore, in actual implementations, one may use any ordering with the mentioned property. This usually leads to more efficient computations, since the pure lexicographic ordering usually leads to an unpleasant increase of the polynomials’ degrees. 12.6.18. Back to parametrized varieties. The above theorem suggests an algorithm for finding an implicit representation of a variety defined in terms of polynomial parametrization. Tools necessary for work with the smallest varieties that contain the points defined by parametrization, are not here available, so a detailed discussion is omitted. When the parametrization of a variety is given by polynomial relations x1 = f1(u1, . . . , uk), . . . , xn = fn(u1, . . . , uk), the reduced Gröbner basis of the ideal ⟨x1 − f1, . . . , xn − fn⟩ can be computed in the lexicographic ordering where ui < xj for all i,j. From this basis, the reduced Gröbner basis of the elimination ideal Ik is obtained. This is precisely the required ideal and its implicit representation. It suffices to use an ordering which guarantees that each ui is before each xj, so that the computation of the Gröbner 1137 CHAPTER 12. ALGEBRAIC STRUCTURES basis would eliminate ui; otherwise the ordering may be arbitrary. There is a chance that there is a more efficient computation than with the pure lexicographic ordering. When the parametrization is rational, i.e. xi = fi(t1, . . . , tm) gi(t1, . . . , tm) , it is perhaps natural to think of substituting the ideal ⟨x1g1 − f1, . . . , xngn − fn⟩ into the above theorem. However, the result of this is usually not good. For instance, consider x = u2 v , y = v2 u , z = u. Here, I = ⟨vx − u2 , uy − v2 , z − u⟩, and the elimination yields I2 = ⟨z(x2 y − z3 )⟩. However, the correct result is V(x2 y − z3 ). The computation has added an entire plane. The problem is that the entire variety of zero points of the denominators in the parametrizations of individual variables is included in W = V(g1, . . . , gn). Instead, perceive the parametrization F as a mapping F : (Kk − W) → Kn . For the implicit situation, use the ideal I = ⟨g1x1 − fj, . . . , gnxn − fn, 1 − g1 · · · gny⟩ ⊆ ⊆ K[y, t1, . . . , tm, x1, . . . , xn], where the additional variable y enables avoidance of the zero spaces of the denominators. It can be shown that V (Ik+1) is the minimal affine variety which contains F(Km − W). 1138 1139 CHAPTER 12. ALGEBRAIC STRUCTURES Key to the exercises In this chapter, we return to problems concerning properties or mutual relations of (mainly) finite sets of objects. Combinatorial problems are already introduced in the second and third parts of chapter one. Like number theory, combinatorics is a field of mathematics where the problems can often be formulated very easily. On the other hand, solutions can be much more difficult to find. We begin with graph theory, and display a collection of useful algorithms based on this theory. At the end of the chapter methods of combinatorial computations are considered. 1. Elements of Graph theory 13.1.1. Two examples. Several people come to a party; some pairs of people know each other, while other people know nobody. (Acquaintance is assumed to be symmetric). How many people must be there in order to guarantee that there are either three people who all know each other, or there are three people with no mutual acquaintance? Such situations can be aptly illustrated by a diagram. The points (or vertices) stand for the particular people of the party, the full lines represent pairs who know one another, while the dashed lines stand for pairs who do not know one another. Note that every pair of vertices is connected by either a full or a dashed line. The question is now reformulated as: how many vertices must be there in order that either there is a triangle whose sides are all full or a triangle whose sides are all dashed? 00 0000 11 1111 0000 00 1111 11 0000 00 1111 11000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 000000000000000 111111111111111 111111111111111111111111111111111111111111111 111111111111111111111111111111111111111111111111111111111111 111111111111111111111111111111111111111111111 111111111111111111111111111111111111111111111111111111111111 111111111111111111111111111111111111111111111 111111111111111111111111111111111111111111111111111111111111 111111111111111111111111111111111111111111111111111111111111 000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 000000000000000 111111111111111 111111111111111111111111111111111111111111111 111111111111111111111111111111111111111111111111111111111111 111111111111111111111111111111111111111111111 111111111111111111111111111111111111111111111111111111111111 111111111111111111111111111111111111111111111 111111111111111111111111111111111111111111111111111111111111 111111111111111111111111111111111111111111111111111111111111 00 0000 11 1111 0000000000 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000 11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 0011 000 111 0011 0011 00 0 11 100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 00000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000000000 111111111111111111111111111111 11111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111 000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111 0000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000000000 111111111111111 11111111111111111111111111111111111111111111111111111111111111111111111111111111 0000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000000000 111111111111111 11111111111111111111111111111111111111111111111111111111111111111111111111111111 0000000000000000000011111111111111111111 00000000001111111111 0000000000 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000 11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 0000000000000000000000000000000000000000 1111111111111111111111111111111111111111 0000000000 00000000000000000000000000000000000 1111111111 11111111111111111111111111111111111 0011 000 000000 000 111 111111 111 00 0000 11 1111 00 00 00 11 11 11 00 00 00 11 11 11 There is no such triangle in the left-hand diagram with four vertices. The example of a regular pentagon, in which all its outside edges are full, while all its diagonals are dashed (draw a picture!), shows that at least six vertices are required. Such a triangle always exists if the number of vertices is at least six. To show this, consider a set of six vertices, each pair of which is joined by either a dashed line or a full line. CHAPTER 13 Combinatorial methods, graphs, and algorithms Do we often prefer thinking in pictures? – yes, but we can compute discrete things only... A. Fundamental concepts One of the motives for creating graph theory was visualization of certain problems concerning relations. A human brain like thinking about entities it can imagine. Therefore, we like representing a binary relation with a graph whose vertices correspond to the elements and edges (lines between the elements) correspond to the fact that the given pair is related. Optionally, we can encode a relation in a more complicated way–use Hasse diagram (see 12.1.8), for instance. Partially ordered sets are almost always depicted this way. The relation of friendship or acquaintance between people can also be translated to graphs. This gives rise to a good deal of “relaxing” problems. 13.A.1. Prove that the number of odd-degree vertices in an undirected graph is always even. Solution. Let G = (V, E) be an arbitrary undirected graph with degree d(v) at every v ∈ V . Adding the numbers of all degrees at the vertices we count all edges twice, and, there- fore, ∑ v∈V d(v) = 2|E|. Let V1 ⊂ V be the set of vertices CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS If v is one of the vertices, then it is joined by five outgoing lines. At least three of these lines are of one type, without loss of generality, full, joining to vertices vA, vB, vC. Then either the triangle formed by the vertices vA, vB, vC, contains only dashed lines, which is then the desired triangle, or one of its edges is full in which case there is a full triangle. As another example, consider a black box which consumes one bit after another and shines in blue or in red according to whether the last bit is zero or one. Imagine this could be a light over the toilet door recognizing whether the last person came out (0) or in (1). Again, this scheme can be illustrated by a diagram: 0 S BLUE RED 0 1 1 1 0 The third vertex which has only two outgoing arrows represents the beginning of the system (before the first bit is sent). Both situations share the same scheme: there is a finite set of objects represented by vertices. There is a set of their properties represented by connecting lines between particular vertices. The scheme can be modified by distinguishing the directions of the connecting lines by arrows. Such a situation can be described in terms of relations; see the text from subsection 1.6.1 on in the sixth part of chapter one. But this is a complicated terminology for describing simple situations: In the first case, there is one set of people with two complementary symmetric and non-reflexive relations. In the second case, there are two antisymmetric relations on three elements. 13.1.2. Fundamental concepts of graphs. We use the terminology which corresponds to the latter dia- grams. Graphs and directed graphs Definition. A graph (also an undirected graph) is a pair G = (V, E), where V is the set of its vertices and E is a subset of the set (V 2 ) of all 2-element subsets of V . The elements of E are called edges of the graph. The vertices of an edge e = {v, w}, v ̸= w, are called the endpoints of e. An endpoint of an edge is said to be incident to that edge. Two edges which share a vertex are called adjacent. Any two vertices which are the endpoints of an edge are called adjacent. 1141 with all degrees and V2 ⊂ V - set of vertices with even degrees. Then, 2|E| = ∑ v∈V d(v) = ∑ v∈V1 d(v) + ∑ v∈V2 d(v), and ∑ v∈V1 d(v) = 2|E| − ∑ v∈V2 d(v), which is an even number. So, |V1| is even. □ 13.A.2. Does there exist a graph with degree sequence (3, 3, 2, 2, 2, 1)? Solution. In this sequence the number of odd-degree vertices equals odd number 3, which is impossible. □ 13.A.3. Let G be a graph with minimum degree d > 1. Prove that G contains a cycle of length at least d + 1. Solution. Let v1 . . . vk be a maximal path in G, i.e., a path that cannot be extended. Then any neighbour of v1 must be on the path, since otherwise we could extend it. Since v1 has at least d(G) neighbors, the set {v2, . . . , vk} must contain at least d(G) elements. Hence k ≥ d(G) + 1, so the path has length at least d(G). Now, neighbour of v1 that is furthest from v1 along the path must be vi with i ≥ d(G) + 1. Then v1 . . . viv1 form a cycle of length at least d(G) + 1. □ 13.A.4. Show that any graph with |V | ≥ 2 contains at least 2 vertices with equal degrees Solution. Prove, using pigeon hole principle. Let |V | = n. Degree d(v) for any vertex v ∈ V can take values from 0 to n − 1. So, there are n possible distinct values for d(v). However, if one of the degrees d(v) = n − 1, which means that vertex v is connected to any other vertex, no other degree can be 0. Thus, d(v) can take no more than n − 1 distinct values, and, by pigeon hole principle, among n values of d(v) at least two coincide. □ 13.A.5. Show that if n people attend a party and some shake hands with others, then at the end, there are at least two people who have shaken hands with the same number of people. Solution. Let g be a graph with the set of vertices V being the set of people. If two people shake hands, that would represent an edge e between those two vertices. by previous problem, there will be at least two vertices with equal degrees. □ 13.A.6. Show that if every connected component of a graph is bipartite, then the graph is bipartite. Solution. Graph is bipartite, if we can divide its vertices into two subsets A and B such that every edge in the graph connects a vertex in set A to a vertex in set B. If the vertex sets of components are divided into sets Ai and Bi, we set A = ∪iAi and B = ∪iBi. Subsets A and B impose the graph with bipartite structure. □ CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS A directed graph is a pair G = (V, E), where V is as above, but now, E ⊆ V × V . The first of the vertices that define an edge e = (v, w) is called the tail of the edge and the other vertex is called its head. From the vertices’ point of view, e is an outgoing edge of v and an ingoing edge of w. The directed edges are also called arcs or arrows. The head and the tail of a directed edge may be the same vertex; such an edge is called a loop. Two directed edges are called consecutive if the tail of one of them is the head of the other one. Similarly, two vertices which are the head and the tail of an edge are called consecutive. To every directed graph G = (V, E), its symmetrization can be assigned. This is an undirected graph with the same set of vertices as G. It contains an edge e = {v, w} if and only if at least one of the edges e′ = (v, w) and e′′ = (w, v) belongs to E. Graph theory provides an extraordinarily good language for thinking about procedures and deriving properties that concern finite sets of objects. They are a good example of a compromise between the tendency to “think in diagrams” and precise mathematical formulations. The language of graph theory allows the adding of information about the vertices or edges in particular problems. For instance, the vertices of a graph can be “coloured” according to membership of the corresponding objects to several (pairwise disjoint) classes. Or the edges with several values can be labeled, and so on. The existence of an edge between differently coloured vertices can indicate a “conflict”. For example, if the vertices are coloured red and blue according to membership to two groups of people with different interests and the edges represent adjacency at a dining table, then an edge connecting two differently coloured vertices can mean a potential conflict. Our first example from the previous subsection can thus be perceived as a graph with coloured edges. The statement we have checked there reads thus in the language of graph theory: A graph Kn = ( V, (V 2 )) with n ≥ 6 vertices and all possible edges which are labeled with two colours always contains a triangle whose sides are of the same colour. The directed graph in the second example above, whose edges are labeled with zero or one, represents a simple finite automaton. This name reflects the idea that the graph describes a process which is, at any moment, in a state represented by the corresponding vertex. It changes to another state, in a step represented by one of the outgoing edges of that vertex. The theory of finite automata is not considered here. 13.1.3. Examples of useful graphs. The simplest case of graphs are those which contain no edges. There is no special notation for them. At the other extreme is a graph which contains all possible edges. This is called a complete graph, denoted by Kn, where n is the number of vertices of the graph. The graphs K4 and 1142 13.A.7. Show that a graph is bipartite if and only if it contains no cycles of odd length. Solution. Suppose there are no cycles of odd length in the graph G. Choose any vertex from the graph and put it in set A. Follow every edge from that vertex and put all vertices at the other end in set B. Erase all the vertices that have been already used. Now for every vertex in B, follow all edges from each vertex in B and put the vertices on the other end in A, erasing all used vertices. Alternate back and forth in this manner until we cannot proceed. This happens when either we exhaust all the vertices or we encounter a vertex that is already in one set and needs to be moved to the other. If the latter occurs, that would represent an odd number of steps from it to itself, so there would be a cycle of odd length. If the graph is not connected, there may still be vertices that have not been assigned. Repeat same process until all vertices are assigned either to set A or to set B. Thus, graph G is bipartite. Suppose now that graph G is bipartite. Let c be a cycle vi1 . . . vik of length k. Vertices along this cycle alternate between subsets A and B of V . Also, the first vi1 and the last vik vertices on c should lie in the same set. Therefore, there is the same number of A-vertices and B-vertices on c, and, so, k is even. □ 13.A.8. Show that any tree with at least two vertices is bi- partite. Solution. Since a tree does not contain any cycles, it also does not contain odd length cycles, and therefore, it is bipartite. □ 13.A.9. Prove that if u is a vertex of odd degree in a graph, then there exists a path from u to some other vertex v of odd degree. Solution. We build a path that does not reuse any edges. As we build the path, we erase edges we used in order not to use them again. Begin at vertex u and select an arbitrary path emanating from it. If, at any point, the path reaches a vertex of odd degree, we are be done, but each time we arrive at a vertex t of even degree, we are guaranteed that there is another neighbouring vertex where we can move further. By passing through t we erase two edges from vertex t. Since the vertex originally was of even degree, coming in and going out reduces its degree by two, so it remains even. In this way, there is always a way to continue when we arrive at a vertex of even degree. Since there are only a finite number of edges, the tour must end eventually, and the only way it can end is if we arrive at a vertex of odd degree. □ 13.A.10. If the distance d(u, v) between two vertices u and v that can be connected by a path in a graph is defined to be the length of the shortest path connecting them, then show that the distance function satisfies the triangle inequality: d(u, v) + d(v, w) ≥ d(u, w). Solution. If one simply connects the paths from u to v to the path connecting v to w one obtains a path from u to w of length d(u, v) + d(v, w). Since we are looking for the path CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS K6 are presented in the introductory subsection. The graph K3 is called a triangle. An important type of graph, is a path. This is a graph whose vertices are ordered as (v0, . . . , vn) so that E = {e1, . . . , en}, where ei = {vi−1, vi} for all i = 1, . . . , n. A path graph of length n is denoted by Pn. If the first and last vertices coincide for the path graph ( n ≥ 3), it is called a cycle graph of length n, denoted by Cn. The graphs K3 = C3, C5, and P5 are shown in the following diagram. 00 0 0 0 00 0 0 0 00 1 11 1 1 1 1 11 1 1 1 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 001100000000 0000 0000 0000 00000000 0000 0000 0000 0000 1111 11111111 1111 1111 1111 1111 11111111 1111 1111 00000000 0000 0000 0000 00000000 0000 0000 0000 0000 1111 11111111 1111 1111 1111 1111 11111111 1111 1111 0000000011111111 0000 00000000 0000 0000 0000 0000 1111 11111111 1111 1111 1111 1111 0000 00000000 0000 0000 0000 0000 1111 11111111 1111 1111 1111 1111 00 0000 00 00 00 0000 11 1111 11 11 11 1111 000001111100 0000 00 00 00 0000 11 1111 11 11 11 1111 00 0000 00 00 00 0000 11 1111 11 11 11 1111 0000011111 0 00 0 0 0 00 1 11 1 1 1 11 00000000001111111111 0011 Another type of graph is the complete bipartite graph. Its vertices can be coloured with two (distinct) colours. All possible edges between vertices of different colours are present, but no other edges. Such a graph is denoted by Km,n, where m and n are the numbers of vertices of particular colours. The diagram below illustrates the graphs K1,3, K2,3, and K3,3. 00110011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011000000000000 000000 000000000 000000 111111111 111111111 111111111 111111 000000000000 000000 000000000 000000 111111111 111111111 111111111 111111 00000000 0000 000000 0000 111111 111111 111111 1111 00000000 0000 000000 0000 111111 111111 111111 1111 0000000000000000 00000000 000000000000 00000000 111111111111 111111111111 111111111111 11111111 0000 00 000 00 111 111 111 11 00000000 0000 000000 0000 111111 111111 111111 1111 00000000000000000000 0000000000 000000000000000 0000000000 111111111111111 111111111111111 111111111111111 1111111111 000000000000 000000 000000000 000000 111111111 111111111 111111111 111111 000000000000 000000 000000000 000000 111111111 111111111 111111111 111111 000000000000 000000 000000000 000000 111111111 111111111 111111111 111111 000000000000 000000 000000000 000000 111111111 111111111 111111111 111111 00000000000000000000 0000000000 000000000000000 0000000000 111111111111111 111111111111111 111111111111111 1111111111 00000000000000000000 0000000000 000000000000000 0000000000 111111111111111 111111111111111 111111111111111 1111111111 0011 Another interesting example of a graph is the hypercube Hn in dimension n, whose vertices are the integers 0, . . . , 2n − 1 and whose edges join those pairs of vertices whose binary expansions differ by exactly one bit. The following diagram depicts the hypercube H4, with labels of the vertices indicated. From the definition it follows that a hypercube of given dimension can always be composed from two hypercubes of dimension one lower, connecting them with edges in an appropriate way. These new edges between the two disjoint copies of H3 are the dashed ones in the diagram. Obviously, the hypercube H4 can be similarly decomposed in several ways (just looking at one fixed bit position, as is done with the very first position in the diagram). 1001 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 000000001111111100000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 0000000011111111 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 0000000011111111 0000000011111111 0011 0011 0011 0 0 1 1 0 0 1 1 0011 0 0 1 1 0 0 1 1 00000000111111110000000000 00000 00000 00000 00000 00000 00000 00000 00000 00000 1111111111 11111 11111 11111 11111 11111 11111 11111 11111 11111 0000000000 00000 00000 00000 00000 00000 00000 00000 00000 00000 1111111111 11111 11111 11111 11111 11111 11111 11111 11111 11111 00000 00000 00000 00000 0000000000 0000000000 0000000000 00000 11111 11111 11111 11111 1111111111 1111111111 1111111111 11111 0000000011111111 00000 00000 00000 00000 0000000000 0000000000 0000000000 00000 11111 11111 11111 11111 1111111111 1111111111 1111111111 11111 0000000011111111 0000000011111111 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 0000000000000000 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 1111111111111111 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 0000000000000000 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 1111111111111111 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 0000 0001 0010 0100 0110 1110 1111 1011 0 0 1 1 Here are two more examples. The first is the cycle ladder graph CLn with 2n vertices. This consists of two cycle graphs Cn whose vertices are connected by edges according to their order in the cycles. The second is the Petersen graph. 1143 of minimal length, if there is a shorter path it will be shorter than this one, so the triangle inequality will be satisfied. □ 13.A.11. Show that in a directed graph where every vertex has the same number of incoming as outgoing edges there exists an Eulerian path for the graph. Solution. Suppose a path starts at some vertex v. First edge on the path can be any edge emanating from v, that we delete from the set of possible further edges on a path. As following the path we arrive at any other vertex u we choose an edge that is still in the set of possible outgoing from u edges. We stop as we can no longer choose such an edge. This can happen only if we arrive again at v and all outgoing edges from v have been previously engaged in the path. Indeed, if this is any other vertex w then every arrival at w engages one incoming edge that can be followed by an outgoing edge as their number is the same. Hence, our path starts and ends at v and contains all edges of the graph. □ 13.A.12. An n-cube Qn is a cube in n dimensions that consists of vertices with coordinates 0 and 1 and edges, connecting neighbouring vertices, that may differ from each other in just one coordinate. Show that Qn possesses a Hamiltonian circuit. Solution. We prove by induction. The k + 1-dimensional cobe Qk+1 can be considered as two copies of Qk with every vertex of the first Qk having first coordinate 0 and 1 for the second copy of Qk. Consider some Hamiltonian cycle vi1 . . . vik vi1 in Qk. In the first copy of Qk that we take a cycle 0vi1 . . . 0vik 0vi1 , and in the second copy 1vi1 . . . 1vik , 1vi1 . Consider path 0vi1 . . . 0vik , 1vik 1vik−1 1vi1 0vi1 which forms a Hamiltonian cycle on Qk+1. □ 13.A.13. Show that a tree with n vertices has exactly n1 edges. Solution. We proceed by induction. Suppose that every tree with k vertices has precisely k − 1 edges. If the tree T contains k + 1 vertices, we will show that it contains a vertex with a single edge connected to it. If not,start at any vertex, and start following edges marking each vertex as we pass it. If we ever come to a marked vertex, there is a loop in the edges which is impossible. But since each vertex is assumed to have more than one vertex coming out,there is never a reason that we have to stop at one,so we eventually encounter a marked vertex, which is a contradiction. Take the vertex with a single edge connecting to it, and delete it and its edge from the tree T. The new graph T0 will have k vertices. It must be connected, since the only thing we removed was a vertex that was not connected to anything else, and all other vertices must be connected. If there were no loops before, removing an edge certainly cannot produce a loop, so T0 is a tree. By the induction hypothesis, T0 has k1 edges. But to convert T0 to T we need to add one edge and one vertex, so T also satises the formula for the number of edges. □ CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS This is somewhat similar to CL5, yet it is actually the simplest counterexample for many propositions about graphs. 0011 0 0 1 1 0011 0011 0011 0011 0011 0 0 1 1 0 0 1 1 0011 0 0 1 1 0011 0011 0011 0 0 1 1 0 0 1 1 0 0 1 1 0011 0011 0011 13.1.4. Morphisms of graphs. Mappings between the sets of vertices or edges which respect the considered structure are of great importance in graph theory. It is enough to consider mappings between the vertices only. Morphisms of graphs Definition. Let G = (V, E) and G′ = (V ′ , E′ ) be two given graphs. A morphism (or homomorphism) f : G → G′ is a mapping fV : V → V ′ between the sets of vertices such that if e = {v, w} is an edge in E, then e′ = {fV (v), fV (w)} is an edge in E′ . In practice, there is no need to distinguish between the morphism f and the mapping fV . The definition is the same for directed graphs, using ordered pairs e = (v, w) as edges. In the case of undirected graphs, the definition implies that if f(v) = f(w) for distinct v, w ∈ V , then they are not connected by an edge. On the other hand, such an edge is admissible for directed graphs provided the common image of the vertices has a loop. An important special case of a morphism of a graph G is one whose codomain is Km. Such a morphism is equivalent to a labeling of the vertices of G with m colours (or any other names) of the vertices of Km so that vertices of one colour are not adjacent. In this case, it is a (vertex) colouring of the graph G with m colours. If a morphism f : G → G′ is a bijection between the sets of vertices such that the inverse mapping f−1 is also a morphism, then f is called an isomorphism of graphs. Two graphs are isomorphic if they differ only in the labeling of the vertices. Every morphism of directed graphs is also a morphism of their symmetrizations. The converse is not true in general. There are simple and extraordinarily useful examples of graph morphisms: namely a path, a walk, and a cycle in a graph: 1144 13.A.14. If u and v are two vertices of a tree T, show that there is a unique path connecting them. Solution. Since T is a tree, it is connected, and therefore there has to be at least one path connecting u and v. Suppose there are two different paths P and Q connecting u to v. Reverse Q to make a path Q′ leading from v to u, and the path made by concatenating P and Q′ leads from u back to itself. Now this path PQ′ may not necessarily be a loop, since it may use some of the same edges in both directions, but we assumed that there are some differences between P and Q. We can, from PQ′ , generate a simple loop. Begin at u and continue back one node at a time until the paths P and Q′ differ. At this point, the paths split. Continue along both paths of the beyond the bifurcation point until they join again for the rst time, and this must occur eventually, since we know they are joined at the end. The two fragments of P and Q′ form a smaller circuit in the tree, which is impossible, since T is a tree. □ 13.A.15. If G is a connected graph and k ≥ 2 is the maximum path length, then any two paths in G with length k share at least one common vertex. Solution. Suppose, not. Let P = vi1 . . . vi and Q = vj1 . . . vjk be two paths of maximal length kthat do not share any vertex. Vertices vi1 and vj1 are connected by a path R = ui1 . . . uis , where ui1 = vi1 and uis = vjk . Path R coincides with P on some vertices, then it deviates from P. May be, returns to P again, then deviates from P for the last time and connects to Q. Denote this fragment of R by vir = uim uim+1 . . . uim+t = vjℓ , where t ≥ 1. Let P′ be path vi1 . . . vir , T′ = uim . . . uim+t and Q′ be the longest part of Q going forward or backwards from vjℓ , which length is obviously, at least, k + 1. □ 13.A.16. Solution. □ 13.A.17. In a dormitory, there is a party held every night. Every time, the organizer of the party invites all of his/her acquaintances so that at the end of the party, all of the guests know each other. Suppose that each member of the dormitory has organized a party at least once, yet there are still two students who do not know each other. Show that they will not meet at the next party. Solution. Consider the acquaintance graph of the students at the beginning (the vertices correspond to the students, and the edges to the acquaintances). We are going to show that if two students lie in the same connected component of this graph (i. e., there exists a chain of acquaintances beginning with one of the considered students and ending with the other one), see 13.1.10, then they will know each other as soon as each member of the dormitory has held a party. Indeed, consider the shortest path (acquaintance chain) between two students that lie in the same connected component. Every time someone from this path organizes a party, this path is made one shorter CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Walks, Paths, Trails, and Cycles A walk of length n in a graph G is a morphism s : Pn → G. Both vertices and edges may repeat in the image. A trail is a walk, where vertices are allowed to repeat, but edges are not allowed to repeat. A path of length n in a graph G is any morphism p : Pn → G such that p is an injective mapping. The images of the vertices v0, . . . , vn from Pn are pairwise distinct. A cycle of length n in a graph G is any morphism c : Cn → G such that c is an injective mapping of the vertices. For simplicity, the morphism is often identified with its image. Walks are often written explicitly in the form (v0, e1, v1, . . . , en, vn), where ei = {vi−1, vi} for i = 1, . . . , n. A walk can be thought of as the trajectory of a “pilgrim” moving from the vertex f(v0) to the vertex f(vn), not stopping at any vertex of an (undirected) graph. Pn always contains an edge connecting the adjacent vertices vi−1 and vi, while loops are not admitted in undirected graphs. The pilgrim can enter a vertex more than once or even go along an edge already visited. The pilgrim making a “trail” is a little wiser – he does not go along an edge already visited for the second time on his walk from the initial vertex f(v0) to the terminal vertex f(vn). 13.1.5. Subgraphs. The images of paths, walks, and cycles are examples of subgraphs, but not in the same way. Subgraphs Definition. A graph G′ = (V ′ , E′ ) is a subgraph of a graph G = (V, E) if and only if and only if V ′ ⊆ V , and E′ ⊆ E. Consider a graph G = (V, E). Choose a subset V ′ ⊆ V . The largest subgraph (with respect to the number of edges) with V ′ as its set of vertices is called an induced subgraph. It is the graph G′ = (V ′ , E′ ), where an edge e ∈ E belongs to E′ if and only if both of its endpoints lie in V ′ . Therefore, the set E′ of G′ ’s edges is given as the intersection E ∩ (V ′ 2 ) . A spanning subgraph (also factor) of a graph G = (V, E) is any graph G′ = (V, E′ ) with V = V ′ . Hence G′ has the same vertex set as G, but the set of edges E′ ⊂ E may be arbitrary. A clique is a subgraph of the graph G which is isomorphic to a complete graph. Every subgraph can be constructed by a step-by-step application of these two cases – first, select V ′ ⊆ V , then choose the target edge set E′ in the subgraph induced on V ′ . Every image of a homomorphism (vertices as well as edges) forms a subgraph. 1145 (the organizer falls out). Since we assume that each of the students on the path has organized a party, the marginal students must know each other as well. Therefore, if there are two students who do not know each other even after everyone has held a party, then they lie in different connected components of the graph, so they will never meet at a party (in particular, not at the upcoming one). □ Now, we are going to practice the fundamental concepts of graph theory on simple combinatorial problems. 13.A.18. Determine the number of edges of each of the graphs K6, K5,6, C8. Solution. The complete graph K6 on 6 vertices has (6 2 ) = 15 edges. The complete bipartite graph K5,6 (see 13.1.3) has 5 · 6 = 30 edges. Finally, the cycle graph C8 has 8 edges. □ 13.A.19. Degree sequence. Verify whether each of the following sequences is the degree sequence (see 13.1.7) of some graph. If so, draw one of the corresponding graphs. i) (1, 2, 3, 4, 5, 6, 7, 8, 9), ii) (1, 1, 1, 2, 2, 3, 4, 5, 5). Solution. First of all, we should check the necessary condition from (1). In the former case, we have 1 + · · · + 9 = 1 2 · 9 · 10 = 45, so the condition is not satisfied. Therefore, the first sequence does not correspond to any graph. As for the latter sequence, the sum of the wanted degrees equals 24, so the necessary condition is satisfied. Now, we proceed by the Havel–Hakimi theorem from subsection 13.1.7. (1, 1, 1, 2, 2, 3, 4, 5, 5) ←→ (1, 1, 1, 1, 1, 2, 3, 4) ←→ ←→ (1, 1, 1, 0, 0, 1, 2) ←→ (0, 0, 1, 1, 1, 1, 2) ←→ ←→ (0, 0, 1, 1, 0, 0) ←→ (0, 0, 0, 0, 1, 1) ←→ ←→ (0, 0, 0, 0, 0). Of course, it was not necessary to execute the procedure to the very end. We could have finished as soon as we saw that the obtained sequence indeed is the degree sequence of some graph. Now, we construct the corresponding graph “backwards” (however, we must take care to always add edges to vertices of appropriate degrees– it is this place where we have the option and can obtain non-isomorphic graphs with the same degree sequence). One of the possible outcomes is the following graph (the order in which each vertex was selected is written inside it): CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS 13.1.6. How many non-isomorphic graphs are there? It is easy to draw all graphs (up to isomorphism) with a predetermined small number of vertices (three or four for instance). Generally this is a complicated combinatorial problem. It is often difficult to decide whether or not two given graphs are isomorphic. Remark. This problem, known as the Graph isomorphism problem, is a somewhat peculiar member of the class NP1 – it is known neither whether it is NP-complete nor whether it can be solved in polynomial time. This is a special case of the problem of deciding whether or not a given graph is isomorphic to a subgraph of another graph. This Subgraph isomorphism problem is known to be NP-complete. Just to get some feeling, let us find a rough lower estimate of the total number of non-isomorphic graphs. There are the same number of graphs on a given set of n vertices as the number of subsets of the edge set. A k-element set has 2k subsets. There are at most n! graphs isomorphic to a given one, since this is the number of bijections between n-element sets. It follows that there are at least k(n) = 2 (n 2 ) n! pairwise non-isomorphic graphs. From this, log2 k(n) = ( n 2 ) − log2 n! ≥ n2 2 ( 1 − 1 n − 2 log2 n n ) , since n! ≤ nn . The asymptotic estimate for large n: log2 k(n) ≥ 1 2 n2 − O(n log2 n), follows. (See the notation for asymptotic bounds from subsection 6.1.12 on page 520). Thus the logarithm of the number of non-isomorphic graphs grows at least as fast as n2 . 13.1.7. Vertex degree and degree sequence. It is relatively easy to verify that two given graphs are not isomorphic. Since isomorphic graphs differ in the relabeling of the vertices only, they share all numerical and other characteristics which are not changed by the relabeling. Simple data of this type includes, for instance, the number of edges incident to particular ver- tices. 1Wikipedia, NP (complexity), http://en.wikipedia.org/ wiki/NP_(complexity) (as of Aug. 7, 2013, 13:44 GMT). 1146 1 0 0 0 0 0 3 2 4 □ 13.A.20. Find the number of pairwise non-isomorphic complete bipartite graphs with 1001 edges. Solution. A complete bipartite graph Km,n has m · n edges. Therefore, the problem can be stated as follows: In how many ways can we write the integer 1001 as the product of two integers? Since 1001 = 7 · 11 · 13, we get 1001 = 1 · 1001 = 7 · (11 · 13) = 11 · (7 · 13) = 13 · (7 · 11). Thus, there are four non-isomorphic complete bipartite graphs having 1001 edges: K1,1001, K7,143, K11,91 and K13,77. □ 13.A.21. Find the number of graph homomorphisms (see 13.1.4) a) from P2 to K5, b) from K3 to K5. Solution. We can see from the definition of the graph homomorphism that the only condition which must be satisfied is that adjacent vertices must not be mapped to the same vertex. a) 5 · 4 · 4 = 80. b) 5 · 4 · 3 = 60. □ 13.A.22. Number of walks. Using the adjacency matrix (see 13.1.8), find the number of trails of length 4 from vertex 1 to vertex 2 in the following graph: 1 2 3 4 5 Solution. The adjacency matrix of the given graph is AG =       0 1 1 0 0 1 0 1 1 1 1 1 0 1 1 0 1 1 0 1 0 1 1 1 0       . CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Vertex degree and degree sequence The degree of a vertex v ∈ V in a graph G = (V, E) is the number of edges from E incident to v. It is denoted by deg v. The degree sequence of a graph G with vertices V = (v1, . . . , vn) is the sequence (deg v1, deg v2, . . . , deg vn). Sometimes, it is required that the sequence be sorted in ascending or descending order rather than correspond to the selected order of vertices. In the case of directed graphs, distinguish between the indegree deg+ v of a vertex v and its outdegree deg− v. A directed graph is said to be balanced if and only if for all vertices v deg− v = deg+ v. The degree sequence of a graph (and its isomorphic copies) is unique up to permutation. Therefore, if the degree sequences of two graphs differ not merely by permutation, then the graphs are not isomorphic. The converse statement is not true in general. Two non-isomorphic graphs with the same degree sequence are the graph G = C3 ∪ C3 which has degree sequence (2, 2, 2, 2, 2, 2), and C6. However, C6 contains a path of length 5, but C3 ∪C3 does not contain a path of length 5. Therefore, these two graphs cannot be isomorphic. Since every edge has two endpoints, it is counted twice in the sum of the degree sequence (this condition is sometimes known as handshaking lemma). It follows that (1) ∑ v∈V deg v = 2|E|. In particular, the sum of the degree sequence must be even. The following theorem2 illustrates how useful invariants like the degree sequences can be. The proof is constructive. It describes an algorithm for constructing a graph with a given degree sequence if there is one, or shows that there is no such graph. Deciding about a given degree sequence Theorem. For any natural numbers 0 ≤ d1 ≤ · · · ≤ dn, there exists a graph G on n vertices with the above values as its degree sequence if and only if there exists a graph on n − 1 vertices with degree sequence (d1, d2, . . . , dn−dn − 1, dn−dn+1 − 1, . . . , dn−1 − 1). Proof. If there exists a graph G′ on n − 1 vertices with degree sequence as stated in the theorem, then a new vertex vn can be added to G′ . Connect vn with edges to the last dn vertices of G′ , thereby obtaining a graph G with the desired degree sequence. 2Proved independently by Václav J. Havel in 1955 in the Časopis pro pěstování matematiky (in Czech) and S. L. Hakimi in 1962 in the Journal of the Society for Industrial and Applied Mathematics. 1147 The number of walks of length 4 from vertex 1 to vertex 2 is the element at [1, 2] in the matrix A4 G. Since A2 G =       2 1 1 2 2 1 4 3 2 2 1 3 4 2 2 2 2 2 3 2 2 2 2 2 3       , we have (A4 G)1,2 = (2, 1, 1, 2, 2)·(1, 4, 3, 2, 2)T = 17. Therefore, there are 17 walks of length 4 between the vertices 1 and 2. □ 13.A.23. A cut edge (also called bridge) in a graph is such an edge that its removal increases the number of connected components of the graph. Similarly, a cut vertex (also called articulation point) is a vertex with this property, i. e., when removed (with the edges incident to it, of course), the graph splits up into more connected components. Find all cut edges and vertices of the following graph: 0 1 2 3 4 17 5 6 7 8 9 10 11 12 13 14 15 16 ⃝ 13.A.24. Prove that a Hamiltonian graph (see 13.1.13) must be 2-vertex-connected. Give an example of a graph which is 2-vertex-connected yet does not contain a Hamiltonian cycle. Solution. Considering any pair of vertices in a Hamiltonian graph, there are two disjoint (except for the two vertices) paths between them (the “arcs” of the Hamiltonian cycle). Therefore, if we remove one of the vertices, the graph clearly remains connected (the vertex to be removed lies on one of the two paths only). As for the example of a non-Hamiltonian graph which is 2-vertex-connected, we can recall the Petersen graph (see the picture at the beginning of this chapter). □ 13.A.25. Determine the number of cycles (see 13.1.3) in the graph K5. Solution. We sort the cycles by their lengths, i. e., we count separately the numbers of cycles upon three, four, and five vertices. A cycle of length three is determined uniquely by its three vertices, and there are (5 3 ) ways how to choose them. A cycle of length four is determined by its vertices (which can be chosen in (5 4 ) ways) and the pair of neighbors of a fixed vertex (the pair can be selected from the remaining three vertices in (3 2 ) ways). Finally, a cycle of length five is given by the pair of neighbors of a fixed vertex as well as the other neighbor (from the two remaining) of a fixed vertex of this CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS The reverse implication is more difficult. The following needs to be proved. Suppose a fixed degree sequence (d1, . . . , dn) with 0 ≤ d1 ≤ · · · ≤ dn is given. Then there exists a graph whose vertex vn is adjacent to exactly the last dn vertices vn−dn , . . . , vn−1. The idea is simple – if any of the last dn vertices vk is not adjacent to vn, then vn must be adjacent to one of the prior vertices. We want to interchange the endpoints of two edges so that the vertices vn and vk become adjacent and the degree sequence remains unchanged. Technically, this can be done as follows: Consider all graphs G with a given degree sequence and let, for each G, ν(G) denote the greatest index of a vertex which is not adjacent to the vertex vn. Fix G to be such that ν(G) is as small as possible. Then, either ν(G) = n − dn − 1 (and the graph is obtained) or ν(G) ≥ n − dn. If the latter is true, then vn is adjacent to one of the vertices vi, i < ν(G). Since, deg vν(G) ≥ deg vi, there exists a vertex vℓ which is adjacent to vν(G), but not to vi. Replace the edge {vℓ, vν(G)} for {vℓ, vi} as well as {vi, vn} for {vν(G), vn}, to get a graph G′ with the same degree sequence, but with ν(G′ ) < ν(G), which contradicts the choice of G. (Draw a diagram!) Therefore, the former possibility is true. So the graph is created by adding the last vertex and connecting it to the last dn vertices with edges. □ The theorem describes an exact procedure for constructing a graph with a given degree sequence. If there is no such graph, the algorithm so indicates during the computation. Begin with the degree sequence in (say) ascending order. Then delete the largest value d and subtract one from d remaining values on the very right. Then sort the obtained degree sequence and continue with the above step until either there is an example of a graph with the current degree sequence or the degree sequence does not correspond to any graph. If, eventually, a graph is constructed after a number of steps, then one can reverse the procedure, adding one vertex in each step, connected to those vertices where ones were subtracted during the procedure. (Try examples by yourself!) The algorithm constructs only one of the many graphs which share the same degree sequence. 13.1.8. Matrix representation. The efficiency of graph representations is of importance for running algorithms. One of them is useful in theoretical con- siderations: 1148 pair. Altogether, there are ( 5 3 ) + ( 5 4 ) · ( 3 2 ) + ( 5 5 ) · ( 4 2 ) · ( 2 1 ) = 37 cycles. □ 13.A.26. Determine the number of subgraphs (see 13.1.4) of the graph K5. Solution. Again, we count the number of subgraphs separately by the number v of their vertices: • v = 0. There is a unique graph on 0 vertices, the empty graph. • v = 1. There are 5 ways of selecting 1 vertex, resulting in 5 subgraphs. • v = 2. Two vertices can be chosen in (5 2 ) ways. Further, there may or may not be an edge between them. Altogether, we get (5 2 ) · 2 such subgraphs. • v = 3. Three vertices can be selected in (5 3 ) ways. For each pair of them, there may or may not be an edge. This results in (5 3 ) · 2 (3 2 ) subgraphs. • v = 4. Here, we calculate (5 4 ) · 2 (4 2 ) subgraphs. • v = 5. Finally, in this case, there are (5 5 ) ·2 (5 2 ) subgraphs. Altogether, we have found 1550 subgraphs of the graph K5. □ 13.A.27. Determine the number of paths between a fixed pair of different vertices in the graph K7. Solution. We sort the paths by their lengths. There is a unique path of length one (it consists of the edge that connects the selected vertices). There are five paths of length two (it may lead through any of the remaining vertices). There are 5 · 4 paths of length three (we select the two vertices through which it leads, and their order matters). Similarly, there are 5·4·3 paths of length four, 5·4·3·2 paths of length five, and also 5! paths of length six. Clearly, there are no longer paths in K7. Altogether, we have 1+5+5·4+5·4·3+5!+5! = 326 paths. □ At the end of this subsection, we present one more amusing problem. 13.A.28. The towns of a certain country are connected with roads. Each town is directly connected to exactly three other towns. Prove that there exists a town from which we can make a sightseeing tour such that the number of roads we use is not divisible by three. Solution. First of all, we reformulate this problem in the language of graph theory. Our task is to prove that every 3-regular graph (i. e., such that the degree of every vertex equals three) contains a cycle whose length is not divisible by three. We will proceed by induction, and actually, we will prove a stronger proposition: every graph each of whose vertices has degree at least three contains a cycle whose length is not divisible by three. In fact, the original proposition could not be proved by induction since the induction hypothesis CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Adjacency matrix The adjacency matrix of the (undirected) graph G = (V, E) is defined as follows. Fix a (total) ordering of its vertices V = (v1, . . . , vn). Define the matrix AG = (aij) over Z2 (entries are zero and ones) aij = { 1 if the edge eij = {vi, vj} ∈ E, 0 if the edge eij = {vi, vj} ̸∈ E. It is recommended to write explicitly the adjacency matrices of the graphs mentioned at the beginning of this chapter! By definition, adjacency matrices are symmetric. There are straightforward generalizations of this concept for more general graphs. For oriented edges their directions may be indicated by the sign, multiple edges might be encoded by appropriate integers, etc. If the matrix is stored in a two-dimensional array, then this method of graph representation is very inefficient. It consumes O(n2 ) memory. However, if the graph is rather sparse, i.e. there are only a few edges, then almost all of the entries of the matrix are zeros. There are many methods of storing such matrices more efficiently. The matrix representation of graphs is suggestive of linear algebra considerations. For example, there is the following beautiful theorem: Theorem. Let G = (V, E) be a graph with vertices ordered as V = (v1, . . . , vn), and let AG be its adjacency matrix. Further, let Ak G = (a (k) ij ) denote the entries of the k-th power of the matrix AG = (aij). Then, a (k) ij is the number of walks of length k between the vertices vi and vj. Proof. The proof is by induction on the length of the walks. For k = 1, the statement is simply a reformulation of the definition of the adjacency matrix. Suppose the proposition holds for a fixed positive integer k. Examine the number of walks of length k + 1 between the vertices vi and vj for some fixed indices i and j. Each walk can be obtained by attaching an edge from vi to a vertex vℓ to a walk of length k between vℓ and vj. Further, each walk of length k + 1 can be obtained uniquely in this way. Therefore, if a (k) ℓj denotes the number of walks of length k from vℓ to vj, then the number of walks of length k + 1 is a (k+1) ij = n∑ ℓ=1 aiℓ · a (k) ℓj . This is exactly the formula for the product of the matrix AG and the power Ak G. It follows that the entries of the matrix Ak+1 G are the integers a (k+1) ij . □ Corollary. If G = (V, E) and AG are as above, then each pair of vertices in G is connected by a path if and only if the matrix (A + En)n−1 has only positive entries (En is the n-by-n identity matrix). 1149 would be too weak. The induction will be carried on the number k of vertices of the graph. Clearly, the statement holds for k = 4. Now, consider a graph where the degree of each vertex is at least three and suppose that the statement is true for any such graph on fewer vertices. The reader should be able to prove that there exists a cycle in the graph. If its length is not divisible by three, we are done. Thus, suppose from now on that C = v1v2 . . . v3n. Each vertex of this cycle is connected to at least one more (different from the neighbors on the cycle) in the graph. If there is a vertex vi on the cycle that is connected to a vertex vj on the cycle (j > i + 1), then the lengths of the cycles v1v2 . . . vivjvj+1 . . . v3n and vivi+1 . . . vj total up to 3n + 2, so the length of at least one of them is not divisible by three, as wanted. The situation is similar if there are two vertices vi and vj, 1 ≤ i < j ≤ 3n, which are connected to the same vertex outside the cycle. Therefore, suppose that each vertex of the cycle is connected to some vertices outside C and no two vertices of the cycle are adjacent to the same vertex outside. Then, we can consider the graph which is obtained from the original one by replacing the vertices v1, v2, . . . , v3n with a single vertex V . In this new graph, there are also at least three edges leading from each vertex (including V ), so we can apply the induction hypothesis to it. Therefore, there is a cycle w1w2 . . . wk where 3 ∤ k. If it does not contain the vertex V , then it is a cycle in the original graph as well. If it does, then we proceed analogously as above: we consider two cycles whose lengths sum up to 3n+2k, so the length of at least one of them is not divisible by three. We have found the wanted cycle in every case, which finishes the proof. □ B. Fundamental algorithms Let us begin with breadth-first search and depth-first search, which serve as a basis for more sophisticated algorithms. Their actual implementations may differ; therefore, the answers to the following problems may be ambiguous. 13.B.1. Consider a graph on six vertices which are labeled 1, 2,..., 6. A pair of vertices is connected with an edge if CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Proof. (A + En)n−1 = An−1 + (n−1 1 ) An−2 + · · · + (n−1 n−2 ) A + En. The entries of the resulting matrix are (using the notation as above) a (n−1) ij + · · · + (n−1 ℓ ) a (n−1−ℓ) ij + · · · + (n − 1)aij + δij, where δii = 1 for all i, and δij = 0 for i ̸= j. This gives the sum of numbers of walks of length 0, . . . , n − 1 between the vertices vi and vj, multiplied by positive constants. Therefore, it is non-zero if and only if there is a path between these vertices. □ 13.1.9. Remark. Observe how permuting the vertices of V affects the adjacency matrix of the corresponding graph. It is not hard to see that each such permutation permutes both the rows and columns of the matrix AG in the same way. Such a permutation can be given uniquely by the permutation matrix, each of whose rows and columns contain zeros only except for one entry, which is 1. If P is a permutation matrix, then the new adjacency matrix of the isomorphic graph G′ is AG′ = P · AG · PT , where (the dots stand for matrix multiplication). The transposed matrix PT is also the inverse matrix to P, since permutation matrices are orthogonal. Every permutation can be written as a composition of transpositions; hence every permutation matrix can be obtained as the product of the matrices corresponding to the transpositions. Of course, this is exactly how the matrices of linear mappings change under the change of basis. Understanding the adjacency matrix as a linear mapping is often useful. For example, the adjacency matrix may be thought of as hitting vectors of zeros and ones (imagine the ones indicating active vertices of interest) and yielding vectors of integers (showing how many times the given vertices are arrived at from all active vectors along the edges in one step). This observation also shows that the question whether two adjacency matrices describe isomorphic graphs is equivalent to asking for the equivalence of the matrices via a permutation matrix P. 13.1.10. Connected components of a graph. Every graph G = (V, E) naturally partitions into disjoint subgraphs Gi such that two vertices v ∈ Gi and w ∈ Gj are connected by a path if and only if i = j. This procedure can be formalized as follows: Let G = (V, E) be an undirected graph. Define a relation ∼ on the set V . Set v ∼ w for vertices v, w ∈ V if and only if there exists a path from v to w in G. This relation is clearly a welldefined equivalence relation. Every class [v] of this equivalence determines the induced subgraph G[v] ⊆ G, and the (disjoint) union of these subgraphs actually gives the original graph G. According to the definition of an equivalence 1150 and only if the sum of their labels is odd. Describe the run of the breadth-first search algorithm on this graph. Which of the edges is visited at the end provided the search is initiated at vertex 5 and the neighbors of a given vertex are visited in ascending order? Solution. The algorithm starts at vertex 5 and goes along the edges (5, 2), (5, 4), (5, 6), thereby visiting the vertices 2, 4, 6 (the queue of vertices to be processed is 2, 4, 6). The first vertex to have been visited is 2, so the algorithm continues the search from there, i. e., vertex 5 is processed and vertex 2 becomes active. The algorithm goes along the edges (2, 1), (2, 3), (2, 5) (the last one has already been used), thereby visiting the vertices 1 and 3 (the queue of vertices to be processed is 4, 6, 1, 3). Now, vertex 2 becomes processed and the first unprocessed vertex to have been visited becomes active. That is vertex 4. The algorithm discovers the edges (4, 1) and (4, 3), yet no new vertices. Vertex 4 becomes processed and vertex 6 becomes active. This leads to discovery of the edges (6, 1) and (6, 3). If the algorithm know the number of edges in the graph, it terminates at this moment. Otherwise, it goes through the vertices 1 and 3, finding out that there are no new edges or vertices, and then it terminates. In either case, the last edge to have been discovered is (3, 6). □ 13.B.2. Consider a graph on six vertices which are labeled 1, 2,..., 6. A pair of vertices is connected with an edge if and only if the sum of their labels is odd. Describe the run of the depth-first search algorithm on this graph. Which of the edges is visited at the end provided the search is initiated at vertex 5 and the neighbors of a given vertex are visited in ascending order? Solution. The algorithm starts at vertex 5 and goes along the edges (5, 2), (5, 4), (5, 6), thereby visiting the vertices 2, 4, 6 in this order (the stack of vertices to be processed is 6, 4, 2). Vertex 5 becomes processed and the last vertex to have been visited (i. e., vertex 6) becomes active. The algorithm goes along the edges (6, 1) and (6, 3) (the edge (6, 5) has already been used), thereby visiting the vertices 1 and 3 (the stack of vertices to be processed is 3, 1, 4, 2). Now, vertex 2 becomes processed and the last unprocessed vertex to have been visited becomes active. This is vertex 3. The algorithm discovers the edges (3, 2) and (3, 4), so the stack becomes 4, 2, 1, 4, 2. Vertex 3 becomes processed and vertex 4 becomes active. This leads to discovery of the edge (4, 1), leaving the stack at 1, 2, 1, 2. The algorithm continues the search from vertex 1, visiting the last edge (1, 2). (Note: only unprocessed vertices are pushed into the stack.) □ Remark. If we had chosen the opposite edge priority, the edges would have been visited in the following order: (5, 2), (2, 1), (1, 4), (4, 3), (3, 2), (3, 6), (6, 1), (6, 5), (4, 5). Intuitively, the depth-first search can be perceived so that the algorithm examines the first undiscovered edge in each step. 13.B.3. Let the vertices of the graph K6 be labeled 1, 2,..., 6. Write the order of edges of K6 in which they are visited by the depth-first search algorithm, supposing the search is initiated CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS relation, no edge of the original graph can connect vertices that belong to different components. The subgraphs G[v] are called connected components of the graph G. A graph G = (V, E) is said to be connected if and only if it has exactly one connected component. If the graph G is directed, then the definition is analogous to the case of undirected graphs – it is only required that there exist both paths from v to w and from w to v in order for the pair (v, w) to be related. Using this definition, strongly connected components can be discussed. On the other hand, it may only be required that the symmetrization of the graph be connected (in the undirected sense); then weak connectedness can be discussed. 13.1.11. Multiply connected graphs. It is useful to consider the concept of connectedness in a much stronger sense, i.e. to enforce a certain redundancy in the number of paths between vertices. Definition. A graph G = (V, E) is said to be • k-vertex-connected if and only if it has at least k + 1 vertices, and remains connected whenever any k−1 vertices are removed; • k-edge-connected if and only if it has at least k edges, and remains connected whenever any k − 1 edges are removed. In the case k = 1, the definition simply says that the graph is connected (in both cases) since the condition is vacuously true. Stronger graph connectedness is desirable with any networks supporting some surfaces (roads, pipelines, internet connection, etc.) where the clients prefer considerable redundancy of the provided service for the case if several connections in the network (i.e. edges in a graph) or nodes in the network (vertices in a graph) break down. In general, Menger’s theorem3 holds. It says that for every pair of vertices v and w, the number of pairwise edgedisjoint paths from v to w equals the minimum number of edges that must be removed so as to leave v and w in different components of the new graph. Similarly, the number of pairwise vertex-disjoint paths from v to w equals the number of vertices that must be removed in order to disconnect v from w. We return to this topic in subsection 13.2.13. Right now, we consider the simplest interesting case in detail. These are graphs (on at least three vertices) such that deleting any one vertex does not destroy the connectedness. Theorem. If G = (V, E) has at least three vertices, then the following conditions are equivalent: • G is 2-vertex-connected; • every pair of vertices v and w in G lie on a common cycle; • the graph G can be constructed from the triangle K3 by repeatedly adding and splitting edges. 3Karl Menger proved this as early as in 1927; that is, before graph theory came into being. 1151 from vertex 3 and the neighbors of a given vertex are visited in ascending order. ⃝ 13.B.4. Let the vertices of the graph K6 be labeled 1, 2,..., 6. Write the order of edges of K6 in which they are visited by the breadth-first search algorithm, supposing the search is initiated from vertex 3 and the neighbors of a given vertex are visited in ascending order. ⃝ 13.B.5. Apply Dijkstra’ algorithm to find the shortest path from vertex number 9 to each of the other vertices. 0 1 2 3 4 17 5 6 7 8 9 10 11 12 13 14 15 16 2 7 3 4 5 1 1 8 15 2 7 5 2 6 1 18 3 1 2 7 3 111 ⃝ 13.B.6. Give an example of i) a graph on at least 4 vertices which does not contain a negative cycle, yet Dijkstra’s algorithm fails on it; ii) a graph on at least 4 vertices which contains a negative edge, yet Dijkstra’s algorithm succeeds on it. Solution. In both cases, we must be well aware how Dijkstra’s algorithm works. Then, it is easy to find the wanted examples (apparently, there are many more possibilities). As for the first problem, we can consider the following graph (where S is the initial vertex): S A B C 1 2 −3 1 If Dijkstra’s algorithm is run from S, then it visits the vertex A and fixes its distance from S to 1. However, there is a shorter path, namely the path (S, B, C, A) of length 0. As for the second problem, consider the following: S A B C −1 2 1 1 □ Bellman-Ford algorithm. This algorithm is based on the same principle as Dijkstra’s. However, instead of going through particular vertices, it processes them “simultaneously”–the relaxation loop (i. e., finding out whether the temporary distances of the vertices can be improved using a given edge) is iterated (|V |−1)-times over all edges. The advantage is that this approach works even with negative edges, and it CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Proof. If the second proposition is true, there are at least two different paths between any two vertices. So deleting a vertex cannot destroy the connectedness and the first proposition follows. Conversely, suppose the first proposition is true. Proceed by induction on the minimal length of a path between v and w. Suppose first that the vertices are the endpoints of an edge e, and that the shortest path is of length 1. If removing the edge e splits the graph into two components, then this would also occur if the vertex v is removed or if the vertex w is removed. Therefore, the graph is connected even without the edge e, so there is a path between v and w. This path, together with the edge e, forms a cycle. For the induction hypothesis, assume that such a shared cycle is constructed for all pairs of vertices connected by a path whose length does not exceed k. Consider vertices v, w and one of the shortest paths between them: (v = v0, e1, v1, e2, . . . , vk+1 = w) of length k+1. Then, v1 and w can be connected by a path of length at most k, hence they lie on a common cycle. Denote by P1 and P2 the corresponding two disjoint paths between v1 and w. Now, the graph G \ {v1} is also connected, so there exists a path P from v to w which does not go through the vertex v1, and this path must once meet either of the paths P1, P2. Without loss of generality, suppose that this occurs on the path P1, at vertex z. Now, the cycle can be built: it consists of the part of P from v to z, the part of P1 from z to w, and P2 (directed the other way) from w to v (draw a diagram!). It follows that the second proposition is a consequence of the first proposition, and hence first condition is equivalent to the second one. Suppose the third proposition is true. Neither splitting an edge nor adding a new one in a 2-vertex-connected graph destroys the 2-connectedness. So the first proposition follows from the third proposition. It remains to prove that third proposition follows from the first proposition. From the first proposition, G is 2-connected, so there exists a cycle, which can be obtained from K3 by splitting edges. Consider the subgraph G′ = (V ′ , E′ ) determined by this cycle, and consider an edge e = {v, w} /∈ E′ such that one of its endpoints lies in V ′ . If both of its endpoints lie there, a new edge can simply be added to the graph G′ , which leads to the subgraph (V ′ , E′ ∪ {e}) in G, which contains more vertices and edges than the graph G′ . Consider the remaining possibility, i.e. v ∈ V ′ while w /∈ V ′ . Since G is 2-connected, it remains connected even if the vertex v is removed, and it contains a shortest path P between the vertex w and some vertex (denote it as v′ ) in G′ (apart from the removed vertex v) and containing no other vertex from V ′ . Adding this path to the graph G′ , together with the edge e (which can be done by adding the edge {v, v′ } splitting it to the desired number of “new” vertices and edges), A new subgraph is obtained which satisfies the requirements 1152 is able to detect negative cycles (if another iteration of the relaxation loop leads to a change, then there must be a negative cycle in the graph). However, we pay for that with increased time complexity. 13.B.7. Use the Bellman-Ford algorithm to find the shortest paths from the vertex S to all other vertices. Assume that the edges are ordered by the number of the tail (or head) and the initial vertex is the least one. Then, change the value of the edge (8, 6) from 18 to −18, execute the algorithm on this new graph, and show the detection of negative cycles. picture skipped Solution. According to the conditions, the edges are visited in the following order: (S,4), (S,7), (1,2), (1,5), (2,1), (2,3), (2,6), (3,7), (4,7), (4,8), (5,1), (5,6), (6,2), (6,5), (7,8), (8,6). The vertex distances (potential higher values computed earlier during the same iteration are written in parentheses): S 1 2 3 4 5 6 7 8 1 0 ∞ ∞ ∞ 1 ∞ 22 3(6) 4 2 0 ∞ 23 ∞ 1 24 22 3 4 3 0 25(30) 23 26 1 24 22 3 3 4 0 25 23 26 1 24 22 3 3 Since the fourth iteration does not lead to any change, we can terminate the algorithm at this moment. In the changed graph, the execution is as follows (for the sake of clarity, we do not write the values of vertices that are untouched by the change): S 1 2 3 4 5 6 7 8 1 0 ∞ ∞ ∞ 1 ∞ −14 3(6) 4 2 −13 −12 3 −11(−6) −10 −19 −2 −1 4 −18 −17 5 −16 −15 −24 −7 −6 6 −23 −22 7 −21 −20 −29 −12 −11 8 −28 −27 9 −26 −25 −34 −17 −16 The graph has 9 vertices, and since the ninth iteration changed the distance of one of the vertices, there is a negative cycle. Of course, we could have terminated the algorithm much earlier if we had noticed exactly what changes took place between the particular steps. Clearly, the values of the vertices 1, 2, 3, 5, 6, 7, 8 keep decreasing below all bounds. The algorithm can also be implemented so that it produces the tree of shortest paths and also finds the vertices lying on a negative cycle if there is one. □ Paths between all pairs of vertices. We often need to know the shortest paths between all pairs of vertices. Of course, we could apply the above algorithms to all initial vertices. However, there is a more effective method to do this. One of the possibilities is to use the similarity with matrix multiplication, which is the basis of the Floyd-Warshall algorithm (the bestknown among algorithms of the all pairs shortest paths type), which: • computes the distances between all pairs of vertices in time O(n3 ); CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS and contains more vertices than the considered graph G′ . After a finite number of these steps, the entire graph G is built from the triangle K3, as desired. The proof is complete. □ 13.1.12. Eulerian graphs. There are problems of the type “draw a graph without removing the pencil from the paper”. In the language of graph theory, this can be stated as follows: Eulerian trails Definition. A trail which visits every edge exactly once and whose initial and terminal vertices are the same is called a Eulerian trail. Connected graphs that admit such a trail are called Eulerian graphs. Of course, an Eulerian trail goes through every vertex at least once, but it can visit a vertex more than once. To draw a graph without removing the pencil from the paper while ending at the same point where one started means to find an Eulerian trail. The terminology refers to the classical story about the seven bridges in Königsberg. There, the task was to go for a walk and visit each of the bridges exactly once. The first proof that this is impossible is by Leonhard Euler, in 1736. The situation is depicted in the diagram. On the left, there is a sketch of the river with the islands and bridges. The corresponding multigraph is caught in the right-hand diagram. The vertices of this graph correspond to the “connected land”, while the edges correspond to the bridges. If it is desired to do without the multiple edges (which have not been admitted so far), it would suffice to place another vertex inside each bridge (i.e. to split the edges with new vertices). Surprisingly, the general solution of this problem is quite simple, as shown by the following theorem. Of course, this also shows that Euler could not design the desired walk. Eulerian graphs Theorem. A graph G is Eulerian if and only if it is connected and all vertices of G have even degree. Proof. If a graph is Eulerian, for every vertex entered there is an exit. Therefore, the degree of every vertex is even. More formally: consider a trail that begins and ends at a vertex v0 and passes through all edges. Every vertex occurs once or more on this trail and its degree equals twice the number of its occurrences. Now suppose that all vertices of a graph G have even degree. Consider the longest possible trail (v0, e1, . . . , vk) in G 1153 • starts with the matrix U0 = A = (aij) of edge lengths (setting uii = 0 for each vertex i) and then iteratively computes the matrices U0, U1, . . . , U|V |, where uk(i, j) is the length of the shortest path from i to j such that all of its inner vertices are among {1, 2, . . . , k}; • the matrices are computed using the formula uk(i, j) = min{uk−1(i, j), uk−1(i, k) + uk−1(k, j)}. In other words, considering the shortest path from i to j which can go only through the vertices 1, . . . , k, we can ask whether it uses the vertex k. If so, then this path consists of the shortest path from i to k and the shortest path from k to j (and these two paths use only the vertices 1, . . . , k − 1). Otherwise, the wanted path is also the shortest path from i to j which can go only through the vertices 1, . . . , k − 1. Clearly, for k = |V |, we get the shortest paths between all pairs of vertices without any restrictions. Moreover, we can maintain the so-called predecessor matrix (i. e., the predecessor of each vertex on the shortest path from each vertex and update it as follows: • Initialization: (P0)ij = i for i ̸= j and aij < ∞, • In the k-th step, we update (Pk)ij = { (Pk−1)kj, if the path through k is better, (Pk−1)ij, otherwise. As soon as the algorithm terminates, we can easily construct the shortest path between any pair of vertices u, v: we derive it from the matrix P = Pn = (pij) (in the reverse order) as v, w = puv, puw, . . .. 13.B.8. Apply the Floyd–Warshall algorithm to the graph in the picture. Write the intermediate results into matrices. Show the detection of negative cycles. Maintain all information necessary for the construction of the shortest paths. picture skipped Solution. We proceed according to the algorithm, obtaining the following shortest-length matrices and predecessor matri- ces: U0 =     0 4 −3 ∞ −3 0 −7 ∞ ∞ 10 0 3 5 6 6 0     , P0 =     − 1 1 − 2 − 2 − − 3 − 3 4 4 4 −     ; U1 =     0 4 −3 ∞ −3 0 −7 ∞ ∞ 10 0 3 5 6 2 0     , P1 =     − 1 1 − 2 − 2 − − 3 − 3 4 4 1 −     ; U2 =     0 4 −3 ∞ −3 0 −7 ∞ 7 10 0 3 3 6 −1 0     , P2 =     − 1 1 − 2 − 2 − 2 3 − 3 2 4 2 −     ; CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS where no edge occurs twice or more. First, suppose for a moment that vk ̸= v0. This would mean that the number of edges of the trail that enter or leave the vertex v0 is odd, so there must be an edge which is incident to v0 and not contained in the trail. However, then the trail can be prolonged while still using every edge of the graph at most once, which is a contradiction. Therefore, v0 = vk. Define a subgraph G′ = (V ′ , E′ ) of G as follows: It contains the vertices and edges of our fixed trail and nothing else. If V ′ ̸= V , then (since the graph G is connected) there exists an edge e = {v, w} such that v ∈ V ′ and w /∈ V ′ . However, then the trail can be “rotated” so that it begins and ends at the vertex v. It can be prolonged with the edge e, which contradicts the assumption of the greatest possible length. Therefore, V ′ = V . It remains to show that E′ = E. So suppose there is an edge e = {v, w} /∈ E′ . As above, the trail can be rotated so that it begins and ends at the vertex v and then it can continue the edge e – a contradiction. □ Corollary. A connected graph can be drawn without removing the pencil from the paper if and only if there are either no vertices of odd degree or exactly two of them. Proof. Let G be a connected graph with exactly two odddegree vertices. Construct a new graph G′ by attaching a new vertex w to the original graph G and connecting it to both the odd-degree vertices. This graph is Eulerian, and the Eulerian trail in it leads to the desired result. On the other hand, if a graph G can be drawn in the desired way, then the graph G′ is necessarily Eulerian, so the degrees of the vertices in G are as stated. □ The situation for directed graphs is similar. A directed graph is called balanced if and only if the outcoming and incoming degrees coincide, i.e. deg+(v) = deg−(v), for all vertices v. Proposition. A directed graph G is Eulerian if and only if it is balanced and its symmetrization is connected (i.e. the graph G is weakly connected). Proof. The proof is analogous to the undirected case. (Work out the details yourself!) □ 13.1.13. Hamiltonian cycles. Find a walk or cycle that visits every vertex of a graph G exactly once. Of necessity, such a walk can visit every edge at most once. Such a cycle is called a Hamiltonian cycle in the graph G. A graph is called Hamiltonian if and only if it contains a Hamiltonian cycle. This problem seems to be very similar to the above one of visiting every edge exactly once. But while the problem of deciding about the existence of a Eulerian trail is trivial, the problem of deciding whether a graph is Hamiltonian is NP-complete. Of course, this problem can be solved by “brute force”. Given a graph on n vertices, generate all n! possible orders of the n vertices, and for each of them, verify whether it is a cycle in G. 1154 U3 =     0 4 −3 0 −3 0 −7 −4 7 10 0 3 3 6 −1 0     , P3 =     − 1 1 3 2 − 2 3 2 3 − 3 2 4 2 −     ; U4 =     0 4 −3 0 −3 0 −7 −4 6 9 0 3 3 6 −1 0     , P4 =     − 1 1 3 2 − 2 3 2 4 − 3 2 4 2 −     . Since there is no negative number on the diagonal of U4, there is no negative cycle in the graph. Suppose we would like to find the shortest path from vertex 3 to vertex 1, for instance: The predecessor of 1 is P4[3, 1] = 2 and the predecessor of 2 is P4[3, 2] = 4. Therefore, the wanted path is 3, 4, 2, 1 and its length is U4[3, 1] = 6. □ Hamiltonian graphs. To decide whether a given graph is Hamiltonian is an NP-complete problem. Therefore, it might be useful to have some simpler necessary or sufficient conditions for this property at our disposal. We mention three sufficient conditions: Dirac’s, Ore’s, and the Bondy–ChvĂĄtal theorem. Dirac: Let a graph G with n ≥ 3 vertices be given. If each vertex of G has degree at least n/2, then G is Hamilton- ian. Ore: Let a graph G with n ≥ 3 vertices be given. If the sum of the degrees of each pair of non-adjacent vertices is at least n, then G is Hamiltonian. The closure of a graph G is the graph cl(G) obtained from G by repeatedly adding an edge u, v such that u, v have not been adjacent and deg(u) + deg(v) ≥ n until no such pair of vertices u, v exists. Bondy, ChvĂĄtal: A graph G is Hamiltonian if and only if cl(G) is. 13.B.9. Prove the Bondy–ChvĂĄtal theorem. Solution. Clearly, it suffices to prove that if G is Hamiltonian after addition of an edge {u, v} such that u, v have not been adjacent and deg(u) + deg(v) ≥ n, then it is already Hamiltonian without this edge. Suppose that G + {u, v} is Hamiltonian, but G is not. Then, there exists a Hamiltonian path from u to v in G. It must hold for each vertex adjacent to u that its predecessor on this path is not adjacent to v (otherwise, there would be a Hamiltonian cycle in G). Therefore, deg(u) + deg(v) ≤ n − 1. □ 13.B.10. i) Prove that the Bondy–ChvĂĄtal theorem implies Ore’s and Ore’s implies Dirac’s. ii) Give an example of a Hamiltonian graph which satisfies Ore’s condition but not Dirac’s. iii) Give an example of a Hamiltonian graph whose closure is not a complete graph. Solution. CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS This problem forms a vital field of research. For instance, in 2010, A. Björklund published a randomized algorithm based on the Monte Carlo method, which counts the number of Hamiltonian cycles in a graph on n vertices in time O(1,567n ).4 Finding Hamiltonian cycles is desired in many problems related to logistics. For example, finding optimal paths for goods delivery. 13.1.14. Trees. Sometimes it is desired to minimize the number of edges in the graph while keeping it connected. Of course, this is possible if and only if there is at least one cycle in the graph. The graphs without cycles are extremely important as we shall see below. Forests, Trees, Leaves A connected graph which does not contain a cycle is called a tree. A graph which does not contain a cycle is called a forest (a forest is not required to be connected). Every vertex of degree one in any graph is called a leaf . The definition suggests an easily memorable “theorem”: A tree is a connected forest. Lemma. Every tree with at least two vertices contains at least two leaves. For any graph G with a leaf v, the following propositions are equivalent: • G is a tree; • G \ v is a tree. Proof. Let P = (v0, . . . , vk) be (any) longest possible path in a tree G. If the vertex v0 is not a leaf, then there is an edge e incident to it whose other endpoint v is not in P since this would form a cycle in the tree. Then the path P with this edge could be prolonged, which contradicts "‘longest"’. So the vertex v0 is a leaf. The proof for the vertex vk is similar. Thus, if the longest path is not trivial, then it must contain two leaves v0 and vk. Next, consider any leaf v of a tree G. Consider any two other vertices w, z in G. There exists a path between them, and no vertex on this path has degree one. Therefore, this path remains the same in G \ v. Hence the graph remains connected even after removing the vertex v. There is no cycle, since it is constructed by removing a vertex from a tree. Conversely, if G\v is a tree, then adding a vertex with degree 1 cannot create a cycle. The resulting graph is evidently connected. □ Trees can be characterized by many equivalent and useful properties. Some of them appear in the following theorem which is more difficult to formulate than to prove. 4Björklund, Andreas (2010), "Determinant sums for undirected Hamiltonicity", Proc. 51st Impartial Symposium on Foundations of Computer Science (FOCS ’10), pp. 173-182, arXiv:1008.0541, doi:10.1109/FOCS.2010.24. 1155 i) If a graph G satisfies Ore’s condition, then its closure is a complete graph, which is Hamiltonian, of course. By the Bondy–ChvĂĄtal theorem, the original graph is Hamiltonian as well. Further if G satisfies Dirac’s condition, then it clearly satisfies Ore’s as well and thus is Hamil- tonian. ii) Consider the following example: 1 2 3 4 5 The degree of vertex 5 is 2, which is less than 5 2 . The sum of the degrees of any pair of (not only non-adjacent) vertices is at least 5. iii) The wanted conditions are satisfied by the cycle graphs Cn, n > 4, for which cl(Cn) = Cn. □ Planar graphs. 13.B.11. Decide whether the graph in the picture is planar. Solution. By the Kuratowski theorem (see page 1162), this graph is not planar since one of its subgraphs is a subdivision of K3,3. □ 13.B.12. Decide whether there is a graph with degree se- quence (6, 6, 6, 7, 7, 7, 7, 8, 8, 8). If so, is there a planar graph with this degree sequence? ⃝ 13.B.13. What is the minimum number of edges of a hexa- hedron? Solution. In any polyhedron, every face is bounded by at least three edges. At the same time, every edge lies belongs to two faces. If f is the number faces and e the number of edges of the polyhedron, then we have 3f ≤ 2e (see also 13.1.20). For a hexahedron, this bound yields 18 ≤ 2e, i. e., e ≥ 9. Indeed, there exists a hexahedron with nine edges. It can be obtained by “gluing” two identical regular tetrahedra together along one face. Therefore, the minimum number of edges of a hexahedron is nine. □ CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS 13.1.15. Theorem. Let G = (V, E) be a graph. The following conditions are equivalent: (1) G is a tree. (2) For every pair of vertices v, w, there is exactly one path from v to w. (3) G is connected but ceases to be such if any edge is re- moved. (4) G does not contain a cycle, but the addition of any edge creates one. (5) G is a connected graph, and the number of its vertices and edges satisfies |V | = |E| + 1. Proof. The properties 2–5 are satisfied in every tree. Indeed, by the previous lemma, every tree which has at least two vertices has a leaf v. It continues to be a tree when this leaf v is removed. Therefore, it suffices to show that if any of the statements 2–5 is true for a given tree, then it holds when a leaf is added to the tree as well. This is clear. In the case of properties 2 and 3, the graph is connected, and their formulation directly excludes the existence of cycles. As for the fourth property, it suffices to verify that G is connected. However, any two of vertices v, w in G are either connected with an edge, or adding this edge to the graph creates a cycle. So there exists a path between them even without this edge. The last implication can be proved by induction on the number of vertices. Suppose that all connected graphs on n vertices and n−1 edges are trees. The sum of vertex degrees of any graph on n+1 vertices and n edges is 2n, so the graph must contain a leaf. It follows from the induction hypothesis that this graph can be constructed by attaching a leaf to a tree; hence it is also a tree. □ 13.1.16. Rooted trees, binary trees, and heaps. Trees are often suitable structures for data storage. They permit basic database operations (e.g. finding a particular piece of information) efficiently. Since there is no cycle in a tree, fixing one vertex vr defines the orientation of all edges. For every vertex v, there is exactly one path from vr to v, so the orientation can be defined accordingly. Since there are no cycles, it is impossible for two such paths to force both orientations of a particular edge. If one of the vertices of a tree is fixed, the situation is similar to a real tree in nature – there is a distinctive vertex which “grows from the ground”. Trees with a fixed distinguished vertex vr are called rooted trees, and vr is said to be the root of the tree. In a rooted tree, the terms successor and predecessor of a vertex are defined as follows: a vertex w is a successor of v (or v is a predecessor of w) if and only if the path from the root of the tree to the vertex w goes through v and v ̸= w. If the vertices are directly connected with an edge, we can talk about a direct successor and a direct predecessor. More 1156 13.B.14. Decide whether the given planar graph is maximal. Add as many edges as possible while keeping the graph pla- nar. Solution. The graph has 14 vertices and 20 edges, hence 3|V | − 6 − |E| = 16. Therefore, it is not maximal, and 16 edges can be added so that it is still planar. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Ten (dashed) have been added. For the sake of clarity, the other 6 edges that connect the vertices of the “outer” 9-gon are not drawn. □ 13.B.15. Prove or disprove each of the following proposi- tions. i) Every graph with fewer than 9 edges is planar. ii) Every graph which is not planar is not Hamiltonian. iii) Every graph which is not planar is Hamiltonian. iv) Every graph which is not planar is not Eulerian (see 13.1.12). v) Every graph which is not planar is Eulerian. vi) Every Hamiltonian graph is planar. vii) No Hamiltonian graph is planar. viii) Every Eulerian graph is planar. ix) No Eulerian graph is planar. ⃝ Trees. 13.B.16. Determine the code of the following graph as a i) plane tree, ii) tree. CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS often, they are called a child and a parent (motivated by the genealogical trees). The most common data structures are the binary trees, which are special cases of a rooted trees: there, every vertex has at most two children (sometimes, the term binary tree implies that every vertex is either a leaf, or has exactly two children; to avoid ambiguity, such trees are often called full binary trees). Such trees are very useful in search procedures. If the vertices are associated to keys from a totally ordered set (eg. the integers), the search for the vertex with a given key is performed by searching the path from the root of the tree to that vertex. At every vertex, compare its key to the desired one. This decides whether one continues to the left or to the right, or stop the search if it is found. If this algorithm is to be correct, one of the children with all its successors must have lower keys than the keys of the other child and all its successors. In order for the search to be efficient, some effort must be made to keep the binary trees balanced, with the lengths of the paths from the root to the leaves differing by at most one. The most unfortunate example of a binary tree on n vertices is the path graph Pn (which may be formally considered a binary tree), while the most desired case is the perfect complete binary tree, where every vertex that is not a leaf has exactly two children, and all leaves are at the same level. Such a tree can be constructed only when the number of vertices is of the form n = 2k − 1, k = 1, 2, . . . . Therefore, in a balanced tree, finding the vertex with a given key value can be done in O(log2 n) steps. Such trees are often called binary search trees. Think out as an exercise how to efficiently perform basic graph operations over binary search trees (additions and removals of the vertex with a given key as well as how to keep the tree balanced). An extraordinarily useful example of binary trees is the structure of a heap. It is a full balanced binary tree, where the keys are either strictly decreasing along each path from the root (the so called max heap), or they are inreasing (the min heap). Because of this ordering along the paths in a max heap, the maximum key value of the heap can be found in constant time and removed in logarithmic time (similarly with minimum in the min heap). The desired maximum is just at the root and after removing it we need to balance the shape of the heap. Prove this is possible in logarithmic time yourself! 4 6 11 9 8 7 3 1 5 100 19 36 17 3 25 1 2 7 The left-hand diagram shows a binary search tree. In the right-hand diagram, there is a max heap. Much literature is devoted to trees, their applications and miscellaneous variations. 1157 Solution. i) Using the procedure from 13.1.18, we get the following code of the plane tree: 0 0 0001100100101111 1 0 0101000010101111 1 1 The highlighted vertex in the graph is indeed the appropriate candidate to be the root since it is the only element of the center of the tree. ii) As for the unique construction of a plane tree, we sort the descendants lexicographically in ascending order. Thus, the wanted code is 0000001010111101011 0000010110110011111. □ 13.B.17. For each of the following codes, decide whether there exists a tree with this code. If so, draw the corresponding tree. • 00011001111001, • 00000110010010111110010100001010111111. ⃝ Huffman coding. We are working with plane binary trees where every edge is colored with a symbol of an output alphabet A (we often have A = {0, 1}). The codewords C are those words over the alphabet A to which we translate the symbols of the input alphabet. Our task is to represent a given text using suitable codewords over the output alphabet. We can easily see that it makes sense to require that the set of codewords be prefix, (i. e., no codeword can be a prefix of another one); otherwise, we could get into trouble when decoding. We will use binary trees for the construction of binary prefix codes (i. e. over the alphabet A = {0, 1}). We label the edges going from each vertex by 0 and 1. Further, we label the leaves of our tree with symbols of the input alphabet. This results in a prefix code over A for these symbols by concatenating the edge labels along the path from the root to the corresponding leaf. Clearly, this code is prefix. Moreover, if we take into account the frequencies of particular symbols of the input alphabet in the input text, we obtain a lossless data compression. CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS 13.1.17. Remarks on sorting. Suppose it is required to distinguish all different sortings of n elements, thus distinguishing among n! different objects. If there is no information other than comparing the order of two single elements, then the tree of all possible decision paths can be written down. The sorting provides a path through this binary tree. As seen, any binary tree of depth h (i.e. h − 1 is the length of the longest path from the root to a leaf) has at most 2h−1 leaves. It follows that a tree of depth h satisfying 2h−1 ≥ n! is needed. Consequently the depth h satisfies h log 2 > log n!. log n! = log 1 + log 2 + · · · + log n > ∫ n 1 log x dx = n log n − (n − 1) h > n log n − n log 2 > n log n − n It is proved that the depth of the necessary binary tree is bounded from below by an expression of size n log n. Hence no algorithm based only on the comparison of two elements of the ordered set can have a better worst case run than O(n log n). The latter claim is not true if there is further relevant information. For example, if it is known that only a finite number of k values may appear among our n elements, then one may simply run through the list counting how many occurrencies of the individual values are there, and hence write the right ordered list from scratch. This all happens in linear time! 13.1.18. Tree isomorphisms. Simple features of trees are exploited in order to illustrate the (generally difficult) problem of graph isomorphisms on this special class of graphs. First, strengthen the structure to be preserved by the isomorphisms. Then show that the obtained procedure is also applicable to the most general trees. In order to keep more information about the structure of rooted trees, remember the relations parent–child. Also have the children of every node sorted in a specific order (for instance, from left to right if drawn on a diagram). Such trees are called ordered trees or plane trees. They are formally defined as a tuple T = (V, E, vr, ν), where ν is a partial order on the edges such that a pair of edges is comparable if and only if they have the same tail (i.e. they all go from one parent vertex to all its children). A homomorphism of rooted trees T = (V, E, vr) and T′ = (V ′ , E′ , v′ r) is a graph morphism φ : T → T′ such that vr is mapped to v′ r; similarly for isomorphisms. For plane trees, it is further required that the morphism preserves the partial orders ν and ν′ . 1158 Let M be the list of frequencies of the symbols of the input alphabet in the input text. The algorithm constructs the optimal binary tree (the so-called minimum-weight binary tree) and the assignment of the symbols to the leaves. • Select the two least frequencies w1, w2 from M. Create a tree with two leaves labeled by the corresponding symbols and root labeled by w1 +w2, then replace the values w1, w2 with the new value w1 + w2 in M. • Repeat the above step; if the selected value from M is a sum, then simply “connect” the existing subtree. • The code of each symbol is determined by the path from the root to the corresponding leaf (left edge = “0”, right edge = “1”, for instance). 13.B.18. Find the Huffman code for the input alphabet with the frequencies [’A’:16,’B’:13,’C’:9,’D’:12,’E’:45,’F’:5]. Solution. If we naively assign a 3-bit code to each letter of the alphabet, then this message of length 100 consumes 300 bits. We show that Huffman code is more succinct. We build the tree according to the algorithm. E:45 ABCDF:55 BD:25 B:13 D:12 AFC:30 FC:14 F:5 C:9 A:16 0 1 0 1 0 1 0 1 0 1 We have thus obtained the codes A : 111,B : 100, C : 1101, D : 101, E : 0, F : 1100. Multiplying the code lengths by the frequencies, we can see that a 100-letter message with the given distribution of letters is encoded into only 3 · 16 + 3 · 13 + 4 · 9 + 3 · 12 + 1 · 45 + 4 · 5 = 224 bits. □ C. Minimum spanning tree 13.C.1. How many spanning trees (see 13.2.6) of the graph K5 are there? And how many are there if we do not distinguish isomorphic ones? Solution. There are three pairwise non-isomorphic spanning trees (with degree sequences (1, 2, 2, 2, 1), (1, 2, 3, 1, 1), (4, 1, 1, 1, 1)). The corresponding classes of isomorphic spanning trees have 5· (4 2 ) ·2, 5·4·3, and 5 elements, respectively. Altogether, there are 125 = 53 spanning trees, which is in accordance with Cayley’s formula for the number of spanning trees of a complete graph (see 13.4.11). □ CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Coding the plane trees Given the plane tree T = (V, E, vr, ν). It has a code W by strings of ones and zeros, defined recursively as follows: Start with the word 01 for the root v0 and write W = 0W1 . . . Wℓ1, where Wi are the ℓ still unknown words for the subtrees rooted by the children of v0. In particular the code of the tree with just one vertex is W = 01. Applying the same procedure recursively over the children and concatenating the results defines the code. The tree in the left-hand diagram above (in the end of 13.1.16) is encoded as follows (the children of a vertex are ordered from left to right, Wr is for the code of the child with key r): 0W3W81 → 00W1W510W911 → 00010W4W61100W11111 → 000100101110001111. Imagine drawing the entire plane tree with one move of the pencil, starting with an arrow ending in the root and going downwards with arrows towards the leaves and then upwards to the predecessors, reaching consecutively all the leafs from the left to the right and writing 0 when going down and 1 when going up. The very last arrow is then leaving the root upwards. Theorem. Two plane trees are isomorphic if and only if their codes are the same. Proof. By construction, two isomorphic trees are assigned the same code. It remains to show that different codes lead to non-isomorphic plane trees. This is proved by induction on the length of the code (i.e. the number of zeros and ones). This length is 2(|E| + 1), (twice the number of vertices; therefore, the proof can be viewed as an induction on the number of vertices of the tree T. The shortest code corresponds to the smallest tree on one vertex. Assume that the proposition holds for all trees up to n vertices, i.e. for codes of length up to k = 2n, and consider a code of the form 0W1, where W is a code of length 2n. Find the shortest prefix of W1 which contains the same number of zeros and ones (when drawing a diagram of the tree, this is the first moment when we return to the root of the tree that corresponds to the code 0W1). Similarly, find the next part of the code W that contains the same number of zeros and ones, etc. Hence the code W can be written as W = W1W2 . . . Wℓ. By the induction hypothesis, the codes Wi correspond uniquely (up to isomorphism) to plane trees, and the order of their roots, being the children of our tree T, is given uniquely by the order in the code. Therefore, the tree T is determined uniquely by the code 0W1 up to isomorphism. □ 1159 13.C.2. Let the vertices of K6 be labeled 1, 2, . . . , 6 and let every edge {i, j} be assigned the integer [(i + j) mod 3] + 1. How many minimum spanning trees are there in this graph? Solution. There are five edges whose value is 1: four of them lie on the cycle 12451 and the remaining one is the edge 36. Therefore, they form a disconnected subgraph of the complete graph, so the spanning tree must contain at least one edge of value 2. Thus, the total weight of a minimum spanning tree is at least 4 · 1 + 2 = 6. And indeed, there exist spanning trees with this weight. We select all the edges of value 1 except for one that lies on the mentioned cycle and connect the resulting components 1245 and 36 with any edge of value 2. There are four such edges. Altogether, there are 4 · 4 = 16 minimum spanning trees. □ 13.C.3. Find a minimum spanning tree of the following graph using i) Kruskal’s algorithm, ii) Jarník’s (Prim’s) algorithm. Explain why we cannot apply Borůvka’s algorithm directly. 2 7 3 4 5 1 1 5 3 2 2 5 2 6 1 4 3 1 2 7 3 111 Solution. The spanning tree is 2 3 5 1 1 3 2 2 2 1 4 1 2 3 111 Borůvka’s algorithm cannot be applied directly since the mapping of the weights to the edges is not injective. However, this can be fixed easily by slight modifications of the weights. □ 13.C.4. Consider the following procedure for finding the shortest path between a given pair of vertices in an undirected weighted graph: First, we find a minimum spanning tree. Then, we proclaim that the only path between the pair of vertices in the obtained spanning tree is the shortest one. Prove the correctness of this method, or disprove it by providing a counterexample. ⃝ 13.C.5. We are given the following table of distances of world metropolises: London, Mexico City, New York, Paris, CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Use the encoding of plane trees to encode any tree. Deal first with the case of rooted trees. Determine the order of the children of every vertex uniquely up to isomorphism. The order is unimportant if and only if the subgraphs determined by the respective children are isomorphic. The same construction can be used as for the plane trees, ordering the vertices lexicographically with respect to their codes (see 12.6.5 for the related concepts). This means that codes W1, W2 satisfy W1 > W2 if and only if W1 contains a one at an earlier position than W2 or W2 is a prefix of W1. The rooted tree as a whole is described by the same recursive procedure: if the children of a vertex v are coded by W1, . . . , Wℓ, then the code of the vertex v is 0W1 . . . Wℓ1, where the order is selected so that W1 ≤ W2 ≤ · · · ≤ Wℓ. If no vertex is designated to be the root of a tree, the root can be designed so that it would be almost “in the middle” of the tree. This can be realized by assigning an integer to every vertex of the tree which describes its eccentricity. That eccentricity exT(v) of a vertex v in a graph T is defined to be the greatest possible distance between v and some vertex w in T. This concept is meaningful for all graphs; however, by the absence of cycles in trees, it is guaranteed that there are at most two vertices with the minimal eccentricity. Lemma. Let C(T) be the set of those vertices of a tree T whose eccentricity is minimal. Then, C(T) contains either a single vertex, or exactly two vertices, which are connected with an edge in T. Proof. The claim is proved by induction, using the trivial fact that the most distant vertex from any vertex v must be a leaf. Therefore, the center of T coincides with the center of the tree T′ which is created from the tree T by removing all its leaves and the corresponding edges. After a finite number of such steps, there remains either just one vertex, or a subtree with two vertices. □ C(T) determined by the latter lemma is called the center of the graph, and the minimal eccentricity is called the radius of the graph. A unique (up to isomorphism) code can now be assigned to every tree. If the center of T contains only one vertex, use it as the root. Otherwise, create the codes for the two rooted subtrees of the original tree without the edge that connects the vertices of the center, and the code of T is the code of the rooted tree (T, x), where x is the vertex of the center whose subtree has lexicographically smaller code. Corollary. Trees T and T′ are isomorphic if and only if they are assigned the same code. The above ideas imply that the algorithm for verifying planar tree isomorphism can be implemented in linear time with respect to the number of vertices of the trees. 1160 Peking, and Tokyo:         L MC NY P Pe T L 5558 3469 214 5074 5959 MC 2090 5725 7753 7035 NY 3636 6844 6757 P 5120 6053 Pe 1307         What is the least total length of wire used for interconnecting these cities (assuming the length necessary to connect a given pair of cities is equal to the distance in the table). ⃝ 13.C.6. Using the matrix variation of the JarnĂŋk-Prim algorithm, find a minimum spanning tree of the graph given by the following matrix:             − 12 − 16 − − − 13 12 − 16 − − − 14 − − 16 − 12 − 14 − − 16 − 12 − 13 − − − − − − 13 − 14 − 15 − − 14 − 14 − 15 − − 14 − − − 15 − 14 13 − − − 15 − 14 −             ⃝ D. Flow networks 13.D.1. An example of bad behavior of the FordFulkerson algorithm. The worst-case time complexity of the Ford-Fulkerson algorithm is O(E · |f|), where |f| is the size of a maximum flow. Consider the following network: 0/100 0/100 0/100 0/100 0/1 The bad behavior of the algorithm is due to the fact that it uses depth-first search to find unsaturated paths. Solution. We proceed strictly by depth-first search (examining the vertices from left to right and then top down): /100 0/100 0/100 0/100 0/1 1/100 1/1 1/100 1/100 0/100 0/100 1/100 1/1 1/100 0/1 1/100 1/100 1/100 1/100 1/100 0/1 2/100 1/1 2/100 2/100 1/100 1/100 2/100 1/1 2/100 0/1 2/100 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS The trees form a special class of graphs. They are often used in miscellaneous variations and with additional requirements. We return to them later, in connection with practical applications. Now follows another extraordinarily important class of graphs. 13.1.19. Planar graphs. Some graphs are drawn in the plane in such a way that their edges do not “cross” one another. This means that every vertex of the graph is identified with a point of the plane, and an edge between vertices v, w corresponds to a continuous curve c : [0, 1] → R2 that connects the vertices c(0) = v and c(1) = w. Furthermore, suppose that edges may intersect only at their endpoints. This describes a planar graph G. The question whether a given graph admits a realization as a planar graph often emerges in practical problems. Here is an example: Providers of water, electricity, and gas have their connection spots near three houses (each provider has one spot). Each house wants to be connected to each resource so that the connections would not cross (they might want not to dig too deep, for instance). Is it possible to do this? The answer is: “no”. In this particular case, it is clear from the diagram. There is a complete bipartite graph K3,3, where three of the vertices correspond to the connection spots, while the other three represent the houses. The edges are the connections between the spots and the houses. All edges can be placed except the last one – see the diagram, where the dashed edge cannot be drawn without crossing any more: 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1000000 000000000 000000000000 000 111 111111111111 111111111 111111 000000 000000000 000000000000 000 111111 111111111 111111111111 111 00 000 0000 0 1 1111 111 11 00 000 0000 0 1 1111 111 11 00 000 0000 0 1 1111 111 110 0 1 1 For a complete proof, more mathematical tools are needed. A complete explanation is not provided here, but an indication of the reasoning follows. One of the basic results from the topology of the plane (the Jordan curve theorem) states that every closed continuous curve c in the plane that is not self-intersecting (i.e. it is a “crooked circle”) divides the plane into two distinct parts. In other words, every other continuous curve which connects a point inside a curve c and a point outside c must intersect c. If the edges are realized as piecewise linear curves (every edge composed of finitely many adjacent line segments), then it is quite easy to prove the Jordan curve theorem (you might do it yourself!). The general theorem can be proved by approximating the continuous curves by piecewise linear ones (quite difficult to do, but it is much easier if the curve is assumed to be piecewise differentiable). 1161 .../100 .../100 .../100 .../100 .../1 100/100 99/100 99/100 100/100 1/1 100/100 0/1 100/100 We can see that 200 iterations were needed in order to find the maximum flow. □ 13.D.2. Find the size of a maximum flow in the network given by the following matrix A, where vertex 1 is the source and vertex 8 is the sink. Further, find the corresponding minimum cut. A =             − 16 24 12 − − − − − − − − 30 − − − − − − − 9 6 12 − − − − − − − 21 − − − − − − 9 − 15 − − − − − − − 9 − − − − − − − 18 − − − − − − − −             Solution. The following augmenting semipaths are found: 1–2–5–8 with residual capacity 15. 1–2–5–6–8 with residual capacity 1. 1–3–5–6–8 with residual capacity 8. 1–4–7–8 with residual capacity 12. 1–3–7–8 with residual capacity 6. The total size of the found flow is 42. We can see that it is indeed of maximum size from the fact the cut consisting of edges (5, 8), (6, 8), and (7, 8) has also size 42 (and it is thus a minimum cut). □ 13.D.3. The following picture depicts a flow network (the numbers f/c define the actual flow and the capacity of a given edge, respectively). Decide whether the given flow is maximum. If so, justify your answer. If not, find a maximum flow and describe the used procedure in detail. Find a minimum cut in the network. picture skipped Solution. In the given network, There exists an augmenting (semi)path 1–2–3–4–8 with residual capacity 4. Its saturation results in a flow of size 32. Since the cut (3, 8), (5, 8), (2, 4), (6, 4) is of the same size, we have found a maximum flow. □ 13.D.4. Find a maximum flow and a minimum cut in the following flow network (source=1,sink=14). CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Consider the graph K3,3. The triples of vertices that are not connected with edges are indistinguishable up to order. Therefore the thick cycle can be considered the general case of a cycle with four points in the graph. The position of the remaining two vertices can then be discussed. In order for the graph to be planar, either both of the vertices must lie inside the cycle, or both outside. Again, these possibilities are equivalent, so it can be assumed without loss of generality that they are in opposite sides, as are the black vertices on the diagram. Now, their position with respect to a suitable cycle with two thick edges and two thin black edges can be discussed (i.e. through three gray vertices and one black one). Then, we can discuss the position of the remaining black vertex with respect to this cycle. This leads to the impossibility of drawing the last (dashed) edge without crossing the thick cycle. It can be shown similarly that the complete graph K5 is not planar either. We provide a pure combinatorial argument why K5 and K3,3 cannot be planar graphs below, see the Corollary in the end of the next subsection. Notice that if a graph G is“expanded” by dividing some of its edges (i.e. adding new vertices in the edges), then the new graph is planar if only if G is planar. The new graph is a subdivision of G. Planar graphs must not contain any subdivision of K3,3 or K5. The reverse implication is also true: Kuratowski theorem Theorem. A graph G is planar if and only if none of its subgraphs is isomorphic to a subdivision of K3,3 or K5. The proof is complicated, so it is not discussed here. Much attention is devoted to planar graphs both in research and practical applications. There are algorithms which are capable of deciding whether or not a given graph is planar in linear time. Direct application of the Kuratowski theorem would lead to a worse time complexity. 13.1.20. Faces of planar graphs. Consider a planar graph G embedded in the plane R2 . Let S be the set of those points x ∈ R2 which do not belong to any edge of the graph (nor are vertices). In this way, the set R2 \ G is partitioned into connected subsets Si, called the faces of the planar graph G. Since the graphs are finite, there is exactly one unbounded face S0. The set of all faces are denoted by S = {S0, S1, . . . , Sk}, and the planar graph by G = (V, E, S). The simplest case of a planar graph is a tree. Every tree is a planar graph since it can be constructed by step-by-step addition of leaves, staring with single vertex. Of course, the Kuratowski theorem can also be applied– when there is no cycle in a graph G, then there cannot be a subdivision of K3,3 or K5, either. Since a tree G cannot contain a cycle, there is 1162 51 10 2 7 3 8 4 6 14 12 13 11 9 12/30 4/4 4/6 6/12 12/18 22/32 4/6 4/4 6/8 18/18 0/2 2/2 6/24 4/4 2/2 0/2 0/14 2/2 10/10 12/18 6/6 8/8 8/20 0/8 4/4 4/4 14/28 2/2 0/4 Solution. The paths are saturated in the following order: 1 18 −−→ 2 18 −−→ 7 14 −−→ 10 12 ←−− 5 12 −−→ 8 4 −→ 11 2 −→ 13 10 −−→ 14 r.2 1 16 −−→ 2 16 −−→ 7 12 −−→ 10 10 ←−− 5 10 −−→ 8 14 −−→ 13 8 −→ 14 r.8 We have found a flow of size 50. And indeed, it is a maximum flow since there is no further unsaturated path. If we look for the reachable vertices, we can also find a cut with capacity 50, consisting of edges [2, 4] : 4, [7, 9] : 2, [7, 12] : 4, [10, 12] : 2, [10, 14] : 6, [13, 14] : 32. □ 13.D.5. Find a maximum flow in the following network on the vertex set {1, 2, . . . , 9} with source 1 and sink 9 using the Ford-Fulkerson algorithm (during the depth-first search, choose the vertices in ascending order). Find a minimum cut in this network. Describe the steps of the procedure in detail. The edges e ∈ E as well as the lower and upper bounds on the flow (l(e) and u(e)) and the current flow f(e) are given in the table: CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS only one face S0 there (the unbounded face). Since the number of edges of a tree is related to the number of its vertices, cf. the formula 13.1.15(5), it follows that |V | − |E| + |S| = 2 for all trees. Surprisingly, the latter formula linking the number of edges, faces, and vertices can be derived for all planar graphs. The formula is named after Leonhard Euler. Especially, the number of faces is independent of the particular embedding of the graph in the plane: Euler’s formula Theorem. Let G = (V, E, S) be a connected planar graph. Then, |V | − |E| + |S| = 2. Proof. The proof is by induction on the number of edges. The graph with zero or one edge satisfies the formula. Consider a graph G with |E| > 1. If G does not contain a cycle, then it is a tree, and the formula is already proved for this case. Suppose that there is an edge e of G that is contained in a cycle. Then, the graph G′ = G \ e is connected, and it follows from the induction hypothesis that G′ satisfies Euler’s formula: |V | − (|E| − 1) + (|S| − 1) = 2, since removing an edge necessarily leads to merging two faces of G into one face in G′ . Hence Euler’s formula is valid for the graph G. □ Corollary. Let G = (V, E, S), be a planar graph with |V | = n ≥ 3, and |E| = e. Then • There is the inequality e ≤ 3n−6 which becomes equality if and only if G is a maximal planar graph (adding any edge to G, would violate planarity). • If G does not contain a triangle (i.e. the graph K3 is not a subgraph), then e ≤ 2n − 4. Proof. Continue adding edges to a given graph until it is maximal. If the obtained maximal graph G satisfies the inequality with equality, then the inequality holds for the original graph as well. Similarly, if the graph G is not connected, two of its components can be connected with a new edge, so such a graph cannot be maximal. Even if it were connected but not 2-connected, there would exist a vertex v ∈ V such that when it is removed, the graph G collapses into several components G1, . . . , Gk, k ≥ 2. However, then an edge can be added between these components without destroying the planarity of the original graph G (draw a diagram!). Therefore, it can be assumed from the beginning that the original graph G is a maximal planar 2-connected graph. As shown in theorem 13.1.11, every 2-connected graph can be constructed from the triangle K3 by splitting edges and attaching new ones. It is easily proved by induction that every 1163 e l(e) u(e) f(e) (1,2) 0 6 0 (1,3) 0 6 0 (1,6) 0 4 0 (2,3) 0 2 0 (2,4) 0 3 0 (3,4) 0 4 0 (3,5) 0 4 0 (4,5) 3 5 4 (4,8) 0 3 0 e l(e) u(e) f(e) (5,1) 0 3 0 (5,6) 0 6 0 (5,7) 0 5 4 (5,8) 0 5 0 (6,9) 0 5 0 (7,4) 1 6 4 (7,9) 0 3 0 (8,9) 0 9 0 ⃝ 13.D.6. A cut in a network (V, E, s, t, w) can also be viewed as a set C ⊂ E of edges such that in the network (V, E \ C, s, t, w), there is no path from the source s to the sink t, but if any edge e is removed from C, then the resulting set does not satisfy this property, i. e., there is a path from s to t in (V, E \ C ∪ e, s, t, w). Find all cuts (and their sizes) in the following network: 4 5 6 2 1 2 10 Z S 4 2 2 Solution. Let us fix the following edge labeling: Z S a b c d e f gh i j Then, there are cuts: {f,i},{f,h,j,a},{f,j,c,a,d,e},{f,j,c,a,d,g}, {b,j,c},{b,j,h},{b,i}. Their capacities are 12, 9, 20, 18, 15, 10, and 15, respectively. □ 13.D.7. Find a maximum flow in the network given in the above exercise. ⃝ CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS face of a planar graph must be bounded by a cycle (which seems intuitively apparent). However, if there is a face of our maximal planar graph G that is not bounded by a triangle, then this face can be split with another edge (a “diagonal” in geometrical terminology), so G would not be maximal. It follows that all faces of G are bounded by triangles K3. Hence 3|S| = 2|E|. It suffices to substitute |S| = 2 3 |E| for the number of faces in Euler’s formula. The second proposition is analogous; now, the faces of the maximal planar graph without triangles are bounded by either four or five edges, whence it follows that 4|S| ≤ 2|E| with the equality if and only if there are just quadrangles there. □ The corollary implies (even without the Kuratowski theorem) that neither K5 nor K3,3 is planar: in the former case, |V | = 5 and |E| = 10 > 3|V | − 6, while in the latter, |V | = 6, |E| = 9 > 2|V | − 4, which is again a contradiction since K3,3 does not contain a triangle. 13.1.21. Convex polyhedra in the space. Planar graphs can be imagined as drawn on the sphere instead in the plane. The sphere can be constructed from the plane by attaching one point “at infinity”. Again, faces of such graphs can be discussed, and the faces are now equivalent to one another (even the face S0 is bounded). On the contrary, every convex polyhedron P ⊆ R3 can be imagined as a graph drawn on the sphere (project the vertices and edges of the polyhedron onto a sufficiently large sphere from any point inside P). Dropping a point inside one of the faces (that face becomes the unbounded face S0) then leads to the planar graph as above – the sphere with the hole is spread in the plane. The planar graphs that are formed of convex polyhedra are clearly 2-connected since every pair of vertices of a convex polyhedron lies on a common cycle. Moreover, every face is interior to its boundary cycle and the graphs of convex polyhedra are always 3-connected. In fact, they are just such graphs as the following Steinitz’s theorem says (we omit the proof): Steinitz’s polyhedra theorem Theorem. A graph G is the graph of a convex polyhedron if and only if it is planar and 3-vertex-connected. 13.1.22. Platonic solids. As an illustration of the combinatorial approach to polyhedral graphs, we classify all the regular polyhedra. These are those built up from one type of regular polygons so that the same number of them touch at every vertex. It was known as early as in the epoch of the ancient philosopher Plato that there are only five of them: 1164 Further exercises on maximum flows and minimum cuts can be found on page 1220. E. Classical probability and combinatorics In this section, we recall the methods we learned as early as in the first chapter. 13.E.1. We throw n dice. What is the probability that none of the values 1, 3, 6 is cast? Solution. We can also see the problem as throwing one dice n times. The probability that none of the values 1, 3, 6 is cast in the first throw is 1/2. The probability that they are cast neither in the first throw nor in the second one is clearly 1/4 (the result of the first throw has no impact on the result of the second one). Since this holds generally, i. e., the results of different throws are (stochastically) independent, the wanted probability is 1/2n . □ 13.E.2. We have a pack of ten playing cards, exactly one of which is an Ace. Each time, we randomly draw one of the ten cards and then put it back. How many times do we have to repeat this experiment if we require the probability of getting the Ace at least once to be at least 0.9? Solution. Let Ai be the event “the Ace is picked in the i-th draw”. The events Ai are (stochastically) independent. Hence, we know that P ( n∪ i=1 Ai ) = 1−(1−P(A1))·(1−P(A2)) · · · (1−P(An)) for every n ∈ N. We are looking for an n ∈ N such that P ( n∪ i=1 Ai ) = 1 − (1 − P(A1)) · (1 − P(A2)) · · · (1 − P(An)) > 0.9. Apparently, we have P(Ai) = 1/10 for any i ∈ N. Therefore, it suffices to solve the inequality 1 − ( 9 10 )n > 0.9, whence n > loga 0.1 loga 0.9 , where a > 1. Evaluating this, we find out that we have to repeat the experiment at least 22 times. □ 13.E.3. We randomly draw six cards from a pack of 32 cards (containing four Kings). Calculate the probability that the sixth card is a King and, at the same time, it is the only King drawn. Solution. By the theorem on product of probabilities, the result is 28 32 · 27 31 · 26 30 · 25 29 · 24 28 · 4 27 . = 0.0723. □ 13.E.4. We randomly draw two cards from a pack of 32 cards (containing four Aces). Calculate the probability that the second card drawn is an Ace if: a) the first card is put back; b) the first card is not put back. Solution. If the first card is put back in the pack, then we clearly repeat an experiment with 32 possible outcomes (with the same probability), 4 of which are favorable. Therefore, CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Translate the condition of regularity to the properties of the corresponding graphs: Every vertex needs the same degree d ≥ 3, and the boundary of every face must contain the same number k ≥ 3 of vertices. Let n, e, s denote the total number of vertices, edges, and faces, respectively. Firstly, the relation of the vertex degrees and the number of edges requires dn = 2e. Secondly, every edge lies in the boundary of exactly two faces, so 2e = ks. Thirdly, Euler’s formula states that 2 = n − e + s = 2e d − e + 2e k . Put this together. The constants d and k must satisfy 1 d − 1 2 + 1 k = 1 e . Since d, k, e, n are positive integers (in particular, 1 e > 0), this equality restricts the possibilities. Especially, the lefthand side is maximal for d = 3. Substitute this value, to obtain the inequality − 1 6 + 1 k = 1 e > 0. It follows that k ∈ {3, 4, 5} for a general d. The roles of k and d are symmetric in the original equality, so also d ∈ {3, 4, 5}. Checking each of the remaining possibilities, yields all the solutions: d k n e s 3 3 4 6 4 3 4 8 12 6 4 3 6 12 8 3 5 20 30 12 5 3 12 30 20 It remains to show that the corresponding regular polyhedra exist. This is already seen in the above diagrams, but that is not a mathematical proof. The existence of the first three is apparent. Concentrate on the geometrical construction of the regular dodecahedron (draw a diagram!). Begin with a cube, building “A-tents” on all its sides simultaneously. The upper horizontal poles are set on the level of the cube’s sides so that those of adjacent sides are perpendicular to each other. Its length is chosen so that the trapezoids of the lateral sides would have three sides of the same length. Now, simultaneously 1165 the wanted probability is 1/8. However, even if we do not put the first card back, the probability is the same. Clearly, the probability of a given card being drawn the first time is the same as for the second time. Of course, we can also apply the conditional probability. This leads to 4 32 · 3 31 + 28 32 · 4 31 = 1 8 . □ 13.E.5. Combinatorial identities. Use combinatorial means to derive the following important identities (in particular, do not use induction): Arithmetic series ∑n k=0 k = n(n+1) 2 = (n+1 2 ) Geometric series ∑n k=0 xk = xn+1 −1 x−1 Binomial theorem (x + y)n = ∑n k=0 (n k ) xk yn−k Upper binomial theorem ∑n k=0 (k m ) = (n+1 m+1 ) Vandermonde’s convolution1 (m+n r ) = ∑r k=0 (m k )( n r−k ) . ⃝ 13.E.6. An urn consists of 30 red balls and 70 green balls. What is the probability of getting exactly k red balls in a sample of size 20, ) ≤ k ≤ 20 if the sampling is done with replacement (repetition allowed)? Solution. Any time we take a sample from the urn we put it back before the next sample (sampling with replacement). Thus in this experiment each time we sample, the probability of choosing a red ball is 30 100 and we repeat this in 20 independent trials. Thus, using the binomial formula, we obtain P(kredballs) = ( 20 k ) (0.3)k (0.7)20−k □ 13.E.7. An urn consists of 30 red balls and 70 green balls. What is the probability of getting exactly k red balls in a sample of size 20, ) ≤ k ≤ 20 if the sampling is done without replacement (repetition not allowed)? Solution. Let A be the event of getting exactly k red balls. To find P(A) we need to find |A| and the total number of possibilities |S|. Here |S| (100 20 ) . Next, to find |A|, we need to find out in how many ways one can choose k red balls and 20 − k green balls. Thus, |A| = ( 30 k )( 70 20 − k ) . Thus, P(A) = |A| |S| = (30 k )( 70 20−k ) (100 20 ) . □ 13.E.8. Assume that there are k people in a room and we know that • k = 5 with probability 1 4 ; • k = 10 with probability 1 4 ; • k = 15 with probability 1 2 . 1Also called hockey identity. CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS raise all tents while keeping the ratio of the three sides of the trapezoids. There is a position at which the adjacent trapezoid and triangle sides are coplanar. At that position, the regular dodecahedron is created. Now, the regular icosahedron can be constructed via the so called dual graph construction. The dual graph G′ to a planar graph G = (V, E, S) has vertices defined as the faces in S and there is an edge between faces S1 and S2 if and only if they share an edge (i.e. were neighbours) in G. Clearly the dual graph to the dodecahedron is the isosahedron. Exactly as the cube and the octohedron are dual, while tetrahedron is dual to itself. 2. A few graph algorithms In this part, we consider several applications of graph concepts and the algorithms built upon them. 13.2.1. Graph representations. As already indicated, algorithms are often formulated with the help of the language of graphs. The concept of an algorithm can be formalized as a procedure dealing with a (directed) graph whose vertices and/or edges are equipped with further information. The procedure consists in walking through the graph along its edges, while processing the information associated to the visited vertices and edges. Of course, processing the information includes also the decision which outgoing edges must be investigated in a further walk, and in which order. In the case of an undirected graphs, each (undirected) edge can be replaced with a pair of directed edges. The graph may also be changed during the run of the algorithm, i.e. vertices and/or edges may be added or removed. In order to execute such algorithms efficiently (usually on a computer), it is necessary to represent the graph in a suitable way. The adjacency matrix representation is one possibility, cf. 13.1.8. There are many other options based on various lists with suitable pointers. The edge list (also the adjacency list) of the graph G = (V, E) consists of two lists V and E that are interconnected by pointers so that every vertex points to the edges it is incident to. and every edge points to its endpoints. The necessary memory to represent the graph as an edge list is O(|V | + |E|) since every edge is pointed at twice and every vertex is pointed at d times, where d is its degree, and the sum of the degrees of all vertices equals twice the number of edges. Therefore, up to a constant multiple, this is an optimal way of graph representation in memory. It is of interest in how the basic graph operations are processed in both representations. By the basic operations, is meant: • removal of an edge, • addition of an edge, • removal of a vertex, • addition of a vertex, • splitting an edge with a new vertex. 1166 i) What is the probability that at least two of them have been born in the same month? Assume that all months are equally likely. ii) Given that we already know there are at least two people that celebrate their birthday in the same month, what is the probability that k = 10? Solution. Let Ak be the event that at least 2 people out of k people have birthdays on the same month. We have The complimentary event Bk is that none out of k people have birthday on the same month, i.e. there are exactly k distinct months when these people were born. Thus P(Ak) = 1 − P1 k 2 / 12k , for2 ≤ k ≤ 12. For k > 12, obviously, P(Ak) = 1. Therefore the required P = 1 4 P(A5) + 1 4 P(A10) + 1 2 P(A15) = 1 4 (1 − P1 5 2 / 125 ) + 1 4 (1 − P10 5 / 1210 ) + 1 2 . The second part of the question asks for conditional probability P(k = 10|A). According to Bayes’ rule: P(k = 10|A) = P(A|k = 10)P(k = 10) P(A) = P(A10 4P(A)) = 1 − P101 2 / 121 0 (1 − P 1 5 2 / 125) + (1 − P 10 5 / 1210) + 2 □ 13.E.9. Hat matching problemN guests arrive at a party. Each person is wearing a hat. All hats are collected and then randomly redistributed at the departure. What is the probability that at least one person receives his/her own hat? Solution. Let Ai be the the event that person i receives own hat. Then the task is to find P(E), where E = ∪N i=1Ai. To find P(E) we use inclusion-exclusion principle: P(E) = P(∪N i=1Ai) = N∑ i=1 P(Ai) − ∑ i 1, then there exists a vertex z distinct from v and w such that dG(v, w) = dG(v, z) + dG(z, w). The following is true: Every function dG on V × V (for a finite set V ), satisfying the five properties listed above, allows to define the edges E so that G = (V, E) is a graph with metric dG. Prove this yourself as an exercise! (It is quite clear how to construct the corresponding graph. It remains “merely” to show that the given function dG could be achieved as the metric on the constructed graph.) 13.2.4. Dijkstra’s shortest-path algorithm. One may suspect that the shortest path between a given vertex v and another given vertex w can be found by breadth-first searching the graph. With this approach, discuss first the vertices which are reachable with one edge from the initial vertex v, then those which are two edges distant, and so on. This is the fundamental idea of one of the most often used graph algorithms – the Dijkstra’s algo- rithm5 . This algorithm is able to find the shortest paths even in problems from practice, where each edge e is assigned a weight w(e), which is a positive real number. When looking for shortest paths, the weights are to represent lengths of the 5Edsger Wybe Dijkstra (1930 - 2002) was a famous Dutch computer scientist, being one of the fathers of this discipline. Among others, he is credited as one of founders of concurrent computing. He published the above algorithm in 1959 1169 □ 13.E.11. Four players are given two cards each from a pack consisting of four Aces and four Kings. What is the probability that at least one of the players is given a pair of Aces? Express the result as a ratio of two-digit integers. ⃝ 13.E.12. Alex owns two special dice: one of them has only 6’s on its sides. The other one has two 4’s, two 5’s, and two 6’s. Martin has two ordinary dice. Each of the players throws his dice and the one whose sum is higher wins. What is the probability that Alex wins? Express the result as a ratio of two-digit integers. ⃝ 13.E.13. In how many ways can we place n rooks on an n× n chessboard so that every unoccupied square is guarded by at least one of the rooks? Solution. Clearly, the condition is satisfied if and only if at least one of the following holds: There is at least one rook in each rank (which implies that there must be exactly one rook in each rank–there are nn such placements since the particular squares can be selected independently for each rank); or there is at least one rook in each file (again resulting in nn placements). However, there are n! placements which satisfy both (we have n squares where to put a rook in the first rank, n − 1 squares for the second rank since one of the files is already occupied, etc.). By the inclusion-exclusion principle, the wanted number of placements is: 2nn − n!. □ 13.E.14. We flip a coin five times. Every time it comes up heads, we put a white ball into a hat. Every time it comes up tails, we put a black ball into the hat. Express the probability that there are more black balls than white ones provided there is at least one black ball. Solution. Let us define the events A – there are more black balls than white ones, H – there is at least one black ball. We want to compute P(A|H). Clearly, the probability P ( HC ) of the complementary event to H is 2−5 . Further, the probability of A is the same as the probability P ( AC ) . Therefore, P(H) = 1 − 2−5 and P(A) = 1/2. Further, P(A ∩ H) = P(A) since the event H contains the event A (the event A implies the event H). Altogether, we have obtained P(A|H) = P(A ∩ H) P(H) = 1 2 1 − (1 2 )5 = 16 31 . □ F. More advanced problems from combinatorics In the first chapter, we met the fundamental methods used in combinatorics. Even using merely these ideas, we are able to solve relatively complicated problems. CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS edges. However, in general, the weights may have other meanings: they may stand for profits or costs, network flows, and so on. The input of the algorithm consists of an edge-weighted graph G = (V, E) and an initial vertex v0. The output consists of the numbers dw(v), which give the least possible sum of the weights of edges along a path from the vertex v0 to the vertex v. This procedure works in undirected graphs as well as in directed ones. In order to ensure the termination and the correctness of the algorithm, it is important that all of the weights are positive – see example 13.B.6. Dijkstra’s algorithm needs only a little modification of the general breadth-first search: • For every vertex v, keep the information d(v), which is an upper bound for the actual distance of v from the initial vertex v0. • At every stage, the already processed vertices are those for which the shortest path is already known. Then, d(v) = dw(v). • When some sleeping vertices are to be made active, choose exactly those vertices y from the set Z of sleeping vertices for which d(y) = min{d(z); z ∈ Z}. Suppose that the graph G has at least two vertices. More formally, Dijkstra’s algorithm can be described as follows: Dijkstra’s Algorithm Input: vertex v0 in the graph G = (V, E) with weights on all edges. Output: the distances from v0 within G associated to all ver- tices. (1) Initialization step: Set the values for all v ∈ V : d(v) = { 0 for v = v0, ∞ for v ̸= v0. Set Z = V , W = ∅. (2) Cycle condition: If every vertex y ∈ Z is assigned ∞, the algorithm terminates; otherwise the algorithm continues with another iteration. (In particular, the algorithm terminates if Z = ∅.) (3) Update of the vertex statuses: • Find the set N of those vertices v ∈ Z for which d(v) = δ is as small as possible: δ = min{d(y); y ∈ Z}. • All vertices which have been in W are removed and marked as processed; the new set of active vertices is W = N, while all these vertices are removed from Z, i.e. they are no more sleeping. (4) Cycle body: For each edge e ∈ EW Z (i.e. whose tail is an active vertex v and head is a sleeping vertex y: • if d(v) + w(e) < d(y), then update d(y) to d(v) + w(e). Move back to check the cycle condition (step 2). 1170 13.F.1. There are n (n ≥ 3) fortresses positioned on a circle, numbered 1 through n. At a given moment, every fortress shoots at one of its neighbors (i. e., fortress 1 shoots at n or 2, fortress 2 shoots at 1 or 3, etc.). We will refer to the set of hit fortresses as a result (i. e., we are only interested in whether each fortress was hit or not; it does not matter whether it was hit once or twice). Let P(n) denote the number of possible results. Prove that the integers P(n) and P(n + 1) are coprime. Solution. First of all, note that a set of hit fortresses is a possible result if and only if no pair of adjacent-but-one fortresses (i. e., whose numbers differ by 2 modulo n) is unhit. Therefore, if n is odd, then P(n) is equal to the number K(n) of results where no pair of adjacent fortresses is unhit (consider the order 1, 3, 5, . . . , n, 2, 4, . . . , n − 1). If n is even, then P(n) equals K(n/2)2 since fortresses at even positions and those at odd positions can be considered independently. We can easily derive the following recurrent formula for K(n): K(n) = K(n − 1) + K(n − 2). (Well, on the other hand, it is not so trivial...It is left as an exercise for the reader.) Further, we can easily calculate that K(2) = 3, K(3) = 4, K(4) = 7, so K(2) = F(4) − F(0), K(3) = F(5) − F(1), K(4) = F(6) − F(2), and simple induction argument shows that K(n) = F(n+2) − F(n− 2), where F(n) denotes the n-th term of the Fibonacci sequence (F(0) = 0, F(1) = F(2) = 1). Moreover, since (K(2), K(3)) = 1, we have for n ≥ 3 that (similarly as with the Fibonacci sequence) (K(n), K(n − 1)) = (K(n) − K(n − 1), K(n − 1)) = = (K(n − 2), K(n − 1)) = · · · = 1. Now, we are going to show that, for every n = 2a, P(n) = K(a)2 is coprime to both P(n + 1) = K(2a + 1) and P(n − 1) = K(2a − 1). It suffices to realize that for a ≥ 2, we have (K(a), K(2a + 1)) = (K(a), F(2)K(2a) + F(1)K(2a − 1)) = (K(a), F(3)K(2a − 1) + F(2)K(2a − 2) = . . . = (K(a), F(a + 1)K(a + 1) + F(a)K(a)) = (K(a), F(a + 1)) = (F(a + 2) − F(a − 2), F(a + 1)) = (F(a + 2) − F(a + 1) − F(a − 2), F(a + 1)) = (F(a) − F(a − 2), F(a + 1)) CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS 13.2.5. Theorem. For a given vertex v0, the Dijkstra’s algorithm finds the distance dw(v) of each vertex v in G that lies in the connected component of the vertex v0. For the vertices v of other connected components, d(v) = ∞ remains. The algorithm can be implemented in such a way that it terminates in time O(n log n + m), where n is the number of vertices and m is the number of edges in G. Proof. The algorithm is correct, since • it terminates after a finite number of steps; • when it does, its output has the desired properties. The cycle condition guarantees that in each iteration, the number of sleeping vertices decreases by one at least since N is always non-empty. Therefore, the algorithm necessarily terminates after a finite number of steps. After going through the initialization cycle, (1) dw(v) ≤ d(v) for all vertices v of the graph. Now assume that this property holds when the algorithm enters the main cycle and show that it holds when it leaves the cycle as well. Indeed, if d(y) is changed during step 4, then it is caused by finding a vertex v such that dw(y) ≤ dw(v) + w({v, y}) ≤ d(v) + w({v, y}) = d(y), where the new value is on the right-hand side. The inequality (1) is satisfied when the algorithm terminates. It remains to verify that the other inequality holds as well. For this purpose, consider what is actually done in steps 3 and 4 of the algorithm. Let 0 = d0 < · · · < dk denote all (distinct) finite distances dw(v) of the vertices in G from the initial vertex v0. At the same time, this partitions the vertex set of the graph G into clusters Vi of vertices whose distance from v0 is exactly di. During the first iteration of the main cycle, N = V0 = {v0}, the number δ is just d1, and the set of sleeping vertices is changed to V \ V0. Suppose this holds up to j-th iteration (inclusive), i.e. the algorithm enters the cycle with N = Vj, δ = dj, and ∪j i=0 Vi = V \ N. Consider a vertex y ∈ Vj+1, i.e. dw(y) = dj+1 < ∞, and there exists a path (v0, e1, v1, . . . , vℓ, eℓ+1, y) with total length dj+1. However, then (2) dw(vℓ) ≤ dj+1 − w({vℓ, y}) < dj+1. It follows from the assumption that the vertex vℓ was active during an earlier iteration of the main cycle, and dw(vℓ) = d(vℓ) = di for some i ≤ j then. Therefore, after the current iteration of the main cycle has been finished, d(y) = dw(vℓ) + w({vℓ, y}) = dj+1 and this does not change any more. It follows that the inequality (1) holds with equality when the algorithm terminates. 1171 = (F(a − 1), F(a + 1)) = (F(a − 1), F(a)) = 1. (K(a), K(2a − 1)) = (K(a), F(2)K(2a − 2) + F(1)K(2a − 3)) = (K(a), F(3)K(2a − 3) + F(2)K(2a − 4)) = · · · = (K(a), F(a)K(a) + F(a − 1)K(a − 1)) = (K(a), F(a − 1)) = (F(a + 2) − F(a − 2), F(a − 1)) = (F(a + 2) − F(a), F(a − 1)) = (F(a + 2) − F(a + 1), F(a − 1)) = (F(a), F(a − 1)) = 1. This proves the proposition. □ G. Probability in combinatorics Classical probability is tightly connected to combinatorics, as we have already seen in the first chapter. Now, we present another example, which is a bit more complicated. Combinatorics is hidden even in the following “probabilistic” problem. 13.G.1. There are 100 prisoners in a prison, numbered 1 through 100. The chief guard has placed 100 chests (also numbered 1 through 100) into a closed room and randomly put balls with numbers 1 through 100 on them into the chests so that each chest contains exactly one ball. He has decided to play the following game with the prisoners: He calls them one by one into the room and the invited prisoner is allowed to gradually open 50 chests. Then, he leaves without any possibility to talk to the other prisoners, the guard closes all the chests, and another prisoner is let in. The guard has promised to free all the prisoners provided each of them finds the ball with his number in one of the 50 opened chests. However, if any of the prisoners fails to find his ball, all will be executed. Before the game begins, the prisoners are allowed to agree on a strategy. Does there exist a strategy that gives the prisoners a “reasonable” chance of winning? Solution. Clearly, if the prisoners choose to open the chests randomly (where the choices of the particular prisoners are independent), the chance for one prisoner to find his ball is 1/2, so the total probability of success is merely 1/2100 . Therefore, it is necessary to look for a strategy where the successes of the prisoners are as dependent as possible. First of all, we should realize that the invited prisoner has no information from other prisoners and does not know the positions of particular balls in the chests. However, once he opens a chest, he knows the ball number it contains and may choose the next chest accordingly. This suggests the following simple strategy: Every prisoner starts with the chest that bears his number. If it contains the corresponding ball, the prisoner succeeds and can open the remaining chests at random. If not, he opens the chest with the number of the found ball. He continues this way until he eventually finds his ball or opens the fiftieth chest. Since every chest “points” at another chest according to the described procedure, let us call this strategy the pointer strategy. CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS The analysis of the main cycle just made also determines a bound for the running time of the algorithm (i.e. the number of elementary operations with the graph and other corresponding objects). The main cycle is iterated as many times as there are (distinct) distances di in the graph. Every vertex, when processed during step 3, is considered exactly once. The vertices that are still sleeping must be sorted. This gives the bound O(n log n) for this part of the algorithm provided the graph is stored as a list of vertices and weighted edges, such that the sleeping vertices are kept in a suitable data structure that allows the finding of the set of N active vertices in time O(log n + |N|). This can be achieved if a heap is used. Every edge is processed exactly once in step 4 since the vertices are active only during one iteration of the cycle. □ Note that the inequality (2), essential for the analysis of the algorithm, need not hold if the weights of the edges are allowed to be negative. In practice, many heuristic improvements of the algorithm are applied. For instance, it is not necessary to compute the distance between all vertices if only the distance between a given pair of vertices is of interest. When the vertex is excluded from the active ones, its distance is final. Further, it is not necessary to initialize the distances with the value of infinity. Of course, this is technically impossible, and a sufficiently large constant would be needed in the im- plementation. 13.2.6. Spanning trees. In practical applications, graphs often encode all possibilities of connections between particular objects, as in road or electrical networks. If it is only required that each pair of vertices is connected by a path, using as few edges as possible, then what is needed is a subgraph T which is a tree. This corresponds to the problem of finding a type of minimal network. Spanning tree of a graph Definition. Any tree T = (V, E′ ) in a graph G = (V, E), E′ ⊆ E is called a spanning tree of the graph G. A graph can have a spanning tree if and only if it is con- nected. A spanning tree is connected since all trees are. Conversely, the following algorithm finds a spanning tree for any given connected graph. 1172 Probability of success. The guard’s possible placements of the balls bijectively correspond to permutations of the numbers 1 through 100. In order to find the probability of success, we must realize for which permutations the pointer strategy works. Recall that every permutation can be expressed as the composition of pairwise disjoint cycles. If each prisoner were allowed to open an arbitrary number of chests, he would find his ball as the last one of the corresponding cycle since he begins with the chest with his number, which is pointed at just by the chest with his ball. It follows that the strategy fails if and only if there is a cycle of length greater than 50 because then no prisoner of this cycle finds his ball in time. Thus, we must count the number of such permutations. In general, the probability that a random permutation of length n contains a cycle of length r > n/2 (there could be more occurrences of shorter cycles; however, there can be at most one cycle of length greater than n/2, which simplifies the calculation) is as follows: We must choose the r elements of the cycle, order them, and then choose an arbitrary permutation of the remaining n − r numbers. This leads to ( n r ) (r − 1)! (n − r)! = n! r Therefore, the probability that such permutation is selected (among all the n! permutations) is 1/r. Thus, the probability that our 100 prisoners succeed is 1 − 100∑ k=51 1 k ≈ 0.311828 As we can see, this is much higher than the original 1/2100 . Now, let us look at the behavior of this probability for a general number n of the prisoners (then, each prisoner is allowed to open at most n/2 chests). In general, the probability that a random permutation of length n contains a cycle of length greater than n/2 is equal to p = n∑ k=1+ n 2 1 k Recall that ∑n k=1 1 k → ln (n) + γ for n → ∞, where γ is Euler’s constant. Thus, we have: CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Spanning forest algorithm Input: Graph G = (V, E) Output: A forest T = (V, E′ ) consisting of spanning trees of the components of G. (1) Sort all edges e0, . . . , em ∈ E in any order. (2) Start with E0 = {e0} and gradually build the sets of edges Ei so that in the i-th step add the edge ei to Ei−1 unless this creates a cycle in the graph Gi = (V, Ei−1 ∪ {ei}). If this edge creates a cycle, leave Ei = Ei−1 unchanged. (3) The algorithm terminates if the graph Gi = (V, Ei) has exactly n − 1 edges at some step i or if i = m, and produces the graph T = (V, Ei). If the algorithm terminates for the latter reason, then the graph is not connected and no spanning tree exists (but there are still the spanning trees of all individual components). Proof. It follows from the rules of the algorithm that the resulting subgraph T of G never contains a cycle. Therefore, it is a forest. If the resulting number of edges is n − 1, then it must be a tree; see theorem 13.1.15. It remains to show that the connected components of the graph T have the same sets of vertices as the connected components of the original graph G: Every path in T is also a path in G; therefore, all vertices that lie in one tree of T must lie in the same component of G. If there is a path in G from v to w such that its endpoints lie in different trees of T, then one of its vertices vi is the last one that is in the component determined by v (in particular, vi+1 does not lie in this component). The corresponding edge {vi, vi+1} creates a cycle when examined by the algorithm since otherwise, it would be in T. Since the edges are never removed from T, there is a path between vi and vi+1 in T, which contradicts the assumptions. Therefore, v and w cannot lie in different trees of T. The number of components of T is given by the fact that the number of vertices and edges differs by one in every tree. The difference increases by one with every component so if there are n vertices and k edges in the forest, then there are n − k components. □ Remark. As always, the time complexity of the algorithm is of interest. The addition of an edge creates a cycle if and only if its endpoints lie in the same connected component of the forest T under construction. Knowledge of the connected components of the current forest T is helpful. To implement the algorithm, it is needed to unite two equivalence classes of a given equivalence relation on a given set (the vertex set) and to find out whether two vertices are in the same class or not. The union requires time O(k), where k is the number of elements to be united. k can be bounded from above by n, the total number of vertices. However, for each equivalence class it can be noted how many vertices it contains. If, for each vertex, the information to which class it belongs is kept, then the union operation 1173 p = n∑ k=1 1 k − n 2∑ k=1 1 k → ln (n)+γ−ln (n 2 ) −γ = ln 2, pro n → ∞. Hence it follows that, for large values of n, the probability of success approaches 1 − p ≃ 1 − ln 2 = 0.30685 . . . . Now, we are going to show that the pointer strategy is optimal. Optimality of the pointer strategy. In order to prove the optimality of the pointer strategy, we merely modify the rules of the game and further define another game. Consider the following rules: Every prisoner keeps opening the chests until he finds his ball. The prisoners win iff each opens at most 50 chests. Clearly, this modification does not change the probability of success, but it will help us prove the optimality. We will refer to this game as game A. Now, consider another game (game B) with the following rules: First, prisoner number 1 enters the chest room and keeps opening the chests until he finds his ball. Then, the guard leaves the opened chests as they are and immediately invites the prisoner with the least undiscovered number. The game proceeds this way until all chests are opened. The prisoners win iff none of them opened more than 50 (n/2 in the general case) chests. Suppose that the guard notes the ball numbers in the order they were discovered by the prisoners. This results in a permutation of the numbers 1 through 100, from which he can see whether the prisoners won or not. The probability of discovering a particular ball is at every moment independent of the selected strategy. There are 100! permutations which correspond to some strategies (no matter whether they are random or sophisticated) since they are merely the orders in which the ball numbers are discovered. In order to compute the probability of success in game B, we should note that any order can be written as the composition of cycles where each cycle contains the ball numbers discovered by a given prisoner. For the sake of clarity, consider a game with 8 prisoners. If the guard has noted the permutation (2, 5, 7, 1, 6, 8, 3, 4), then we can see that the prisoners win since prisoner 1 discovered numbers (2, 5, 7, 1), then prisoner 3 discovered (6, 8, 3), and finally prisoner 4 discovered only his number (4). In this case, we can write: (2, 5, 7, 1, 6, 8, 3, 4) → (2, 5, 7, 1)(6, 8, 3)(4). Further, any such permutation corresponds to a unique order of the numbers 1 through 8. Having any permutation in the cyclic notation, we first rearrange each cycle so that its least number is the last one and then sort the cycles by their last numbers in ascending order. For instance, we have: (7, 5, 8)(2, 4)(1, 6, 3) → (6, 3, 1)(4, 2)(8, 7, 5) → → (6, 3, 1, 4, 2, 8, 7, 5). We have thus constructed a bijection between the winning orders of discovered numbers and the permutations of the numbers 1 through 8 that do not contain a cycle of length greater than 4. It follows that the probability of success in game B is the same as the probability that a random permutation does CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS means to relabel the vertices of one of the united classes. If the smaller class is always selected to be relabeled, then the total number of operations of the algorithm is O(n log n+m). (As an exercise, complete the details of these considerations yourself!) The above reasoning shows that slightly better in time might be achieved, if only the spanning tree of the connected component of a given starting vertex is of interest: Another spanning tree algorithm Input: G = (V, E) with n vertices and m edges, vertex v ∈ V . Output: The tree T spanning the connected component of v. (1) Initialize T0 = ({v}, ∅). (2) In the i-th step, we build the tree Ti as follows. Look for edges e which are not in Ti−1, but their tail vertex ve is. Take one of them and add it to Ti−1, i.e. add the head vertex to Vi−1 and e to Ei−1. (3) The algorithm terminates as soon as no such edge exists. Apparently, the resulting graph T is connected. The count of its vertices and edges shows that it is a tree. Proof. The vertices of T coincide with the vertices of the connected component of the graph G containing the starting vertex v. Suppose there is a path from v to a given vertex w. If w does not lie in T, then label by vi the last of its vertices that lie in T (just like in the proof of the previous lemma). However, the subsequent edge of the path would have to be added to T by the algorithm when it terminated, which is a contradiction. Consequently, this algorithm finds a spanning tree of the connected component that contains a given initial vertex v in time O(n + m). □ 13.2.7. Minimum spanning tree. All spanning trees of a given graph G have the same number of edges since this is a general property of trees. Just as the shortest path in graphs with weighted edges was found, now spanning trees with the minimum sum of their edges’ weights is desired. Definition. Let G = (V, E, w) be a connected graph whose edges e are labeled by non-negative weights w(e). A minimum spanning tree of G is such that its total weight does not exceed that of any other spanning tree. This problem has many applications in practice. For instance, networks of electricity, gas, water, etc. Surprisingly, it is quite simple to find a minimum spanning tree (supposing all edge weights w(e) of G are nonnegative) by the following procedure6 : 6Joseph Bernard Kruskal (1928 - 2010) was a famous American mathematician, statistician, computer scientist, and psychometrician. There are other famous mathematicians of the same surname - his two brothers and one nephew. Martin David co-invented solitons and surreal numbers, William 1174 not contain a cycle of length greater than 4 (n/2 in the general case). This corresponds to the probability of success in the original game using the pointer strategy. Now, this implies an important conclusion for game A. Indeed, the prisoners may apply any strategy from game A to game B as follows: each prisoner behaves like in game A, but he considers open chests to be closed, i. e., if he wants to open a chest which has already been opened, he just “p”asses this move and further behaves as if he had just discovered the ball number in the considered chest. Therefore, any strategy that succeeds for a given placement of the balls in game A must succeed for the same placement in game B as well. Therefore, if there existed a better strategy for game A, we could apply it to game B and obtain a higher chance of winning there. However, this is impossible since all strategies in game B lead to the same probability of success. Therefore, the pointer strategy is better than or equally good as any other strategy. □ 13.G.2. In a competition, there are m contestants and n officials, where n is an odd integer greater than two. Each official judges each contestant as either good or bad. Suppose that any pair of officials agree on at most k contestants. Prove that k m ≥ n − 1 2n . Let us look at two possible approaches to this problem. Solution. Let us count the number N of pairs ({official, official}, contestant) where the officials are distinct and agree on the contestant. Altogether, there are (n 2 ) pairs of officials, and each pair can agree on at most k contestants. Therefore, N ≤ k (n 2 ) . Now, let us fix a contestant X and count the number of pairs of officials who agree on X. Say that x officials said X was good. Then, there are (x 2 ) pairs who agree that X is good and (n−x 2 ) pairs who agree that X is bad. Altogether, there are ( x 2 ) + ( n − x 2 ) = x(x − 1) 2 + (n − x)(n − x − 1) 2 pairs that agree on X. We have: x(x − 1) 2 + (n − x)(n − x − 1) 2 = 2x2 − 2nx + n2 − n 2 = = ( x − n 2 )2 + n2 4 − n 2 ≥ n2 4 − n 2 = (n − 1)2 4 − 1 4 . Since n is odd, the expression (n − 1)2 /4 is an integer. Thus, the number of pairs that agree on X is at least (n − 1)2 /4. Hence N ≥ m(n − 1)2 /4. Combining these two inequalities together, we get k m ≥ n − 1 2n . An alternative solution - using probabilities. Let choose a pair of officials at random. Let X be the random variable which tells the number of cases when this pair agrees. We are going to prove the contrapositive implication, i. e., if k m < n−1 2n , then X is greater than k with probability greater than zero, which will be denoted P(X > k) > 0. CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Kruskal’s algorithm Input: A graph G = (V, E, w) with non-negative weights over edges. Output: The minimal spanning trees for all components of G. (1) Sort the m edges in E so that w(e1) ≤ w(e2) ≤ · · · ≤ w(em). (2) For this order of the edges, call the “Spanning forest algorithm” from the previous subsection. This is a typical example of the “greedy approach”, when maximizing profits (or minimizing expenses) is achieved by choosing always the option which is the most advantageous at each stage. In many problems, this approach fails since low expenses at the beginning may be the cause of much higher ones at the end. Therefore, greedy algorithms are often a base for very useful heuristic algorithms but seldom yield optimal solutions. However, in the case of minimum spanning tree, this approach works: Theorem. Kruskal’s algorithm finds a minimum spanning tree for every connected graph G with non-negative edge weights. The algorithm runs in O(m log m) time where m is the number of edges of G. Proof. Let T = (V, E(T)) denote the spanning tree generated by Kruskal’s algorithm and, further, let ˜T = (V, E( ˜T)) be an arbitrary minimum spanning tree. The minimality implies that∑ e∈E( ˜T ) w(e) ≤ ∑ e∈E(T ) w(e), so the goal is to show that also ∑ e∈E(T ) w(e) ≤ ∑ e∈E( ˜T ) w(e). If E(T) = E( ˜T), then nothing further is needed. So assume there exists an edge e ∈ E(T) such that e /∈ E( ˜T). From all such edges, choose one, call it e with weight w(e) as small as possible. The addition of e into ˜T creates a cycle ee1e2 · · · ek in ˜T, and at least one of its edges ei is not in E(T). The choice of the edge e implies that if w(ei) < w(e), then the edge ei would be among the candidate edges in Kruskal’s algorithm after a certain subtree T′ ⊆ T ∩ ˜T had been created, so its addition to the gradually constructed tree T would not create a cycle. Therefore, if w(ei) < w(e), the edge ei would be chosen in the algorithm. It follows that w(ei) ≥ w(e). However, now the edge ei can be replaced with e in ˜T (by the choice of ei, this results in a spanning tree again) without increasing the total weight. So the resulting ˜T is a minimum spanning tree. It differs from T in fewer edges than before. Therefore, in a finite number of steps, ˜T is changed to T without increasing the total weight. □ 13.2.8. Two more algorithms. The second algorithm for finding a spanning tree, presented in 13.2.6 also leads to a minimum spanning tree: was active in statistics, Clyde was a computer scientist too. The above algorithm dates from 1956. 1175 Consider the random variables Xi for i = 1, 2, . . . , m with codomain 0, 1, denoting whether the pair agrees on the i-th contestant. Let Xi = 1 when they agree, and let Xi = 0 otherwise. Hence we have: X = X1 + X2 + · · · + Xm Using the linearity of expectation, we obtain: E[X] = E[X1] + E[X2] + · · · + E[Xm]. Now, let us calculate E[Xi] = ∑ xi∈{0,1} xi · P(Xi = xi). Since Xi can be only 0 or 1, we have directly E[Xi] = P(Xi = 1). Let us examine the probability P(Xi = 1), i. e., that the officials agree on the i-th contestant. There are(n 2 ) pairs of officials. Let ti denote the number of officials who say the i-the contestant is good and n−ti be the number of those who do not. Then, there are (ti 2 ) pairs who agree that the i-the contestant is good and (n−ti 2 ) pairs who agree on the contrary. Altogether, there are (ti 2 ) + (n−ti 2 ) pairs that agree on the i-th contestant. Therefore, E[Xi] = P(Xi = 1) = (ti 2 ) + (n−ti 2 ) (n 2 ) . Hence, E[X] = m∑ i=1 (ti 2 ) + (n−ti 2 ) (n 2 ) . We are going to show that for odd values of n, we have (ti 2 ) + (n−ti 2 ) ≥ (n−1)2 4 . Rearranging this leads to (n − 2ti)2 ≥ 1 ⇔ ti ≤ n − 1 2 or ti ≥ n + 1 2 , which is clearly true since n−1 2 and n+1 2 are adjacent integers. Using the inequality (ti 2 ) + (n−ti 2 ) ≥ (n−1)2 4 , we obtain: E[X] ≥ m (n−1 2 )2 n(n−1) 2 = m(n − 1) 2n . Thanks to the assumption m(n−1) 2n > k, we have E[X] > k, and thus P(X > k) > 0, which finishes the proof. □ Further, we demonstrate application of probabilities to an interesting problem. 13.G.3. Let S be a finite set of points in the plane which are in general position (i. e., no three of them lie on a straight line). For any convex polygon P all of whose vertices lie in S, let a(P) denote the number of its vertices and b(P) the number of points from S which are outside P. Prove that for any real number x, we have ∑ P xa(P ) (1 − x)b(P ) = 1 where the sum runs over all convex polygons P with vertices in S. (A line segment, a singleton, and the empty set are considered to be a convex 2-gon, 1-gon, and 0-gon, respectively.) CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Jarník-Prim’s algorithm7 Input: A connected graph G = (V, E, w) with n vertices and m edges, with non-negative weights over the edges. Output: The minimum spanning tree T of G. (1) Initialize T0 = ({v}, ∅) with some vertex v ∈ V . (2) In the i-th step, look for all edges e which are not in Ti−1, but their tail vertex ve is. Take the one of them with minimal weight and add it to Ti−1, i.e. add the head vertex to Vi−1 and e to Ei−1. (3) The algorithm terminates when the number of added edges totals at n − 1. The Borůvka’s algorithm is similar. It constructs as many as possible connected components simultaneously: It begins with the singleton components in the graph T0 = (V, ∅). In each step, it connects every component to another component with the shortest edge possible. It is easy to prove that (provided the edge weights are pairwise distinct) this results in a minimum spanning tree. Borůvka’s algorithm Input: A connected graph G = (V, E, w) with non-negative weights on the edges. Output: The minimum spanning tree for G. (1) Initialization. Create the graph S with the same vertex set as G and no edges; (2) The main loop. While S contains more than one component, do: • for every tree T in S, find the shortest edge that connects T to G \ T, and add this edge into E, • add all edges of E into the graph S and clear E. Note that Borůvka’s algorithm can be executed using parallel computation, which is why it is used in many practical modifications. The proofs that both of these algorithms are correct, are similar to that of Kruskal’s. The details are omitted. 13.2.9. Traveling salesman problem. So far, our short excursion through graph based algorithms could give the feeling that simple and straightforward algorithms for the considered problems can always be found. So far however, only the easy problems have been considered. In all but very few cases, the contrary is true — mostly there are no algorithms running in polynomial time, so one needs to use algorithms which do not always find the optimal solution but give 7Robert Clay Prim (born 1921) is an American mathematician and computer scientist. While he published his work already in the realm of computer science and hence the most common name of the algorithm is “Prim’s”, earlier works by Otakar Borůvka (1899 - 1995) and Vojtěch Jarník (1897–1970) appeared before those by Prim. The Borůvka’s algorithm was designed when consulting the construction of new electricity network in Moravia, a region in central Europe, in 1926, and Jarník published the algorithm (recovered much later by Prim) in 1930, motivated by Brůvka. 1176 Solution. First of all, we prove the wanted equality for x ∈ [0, 1]. Let us color a point from S so that it is white with probability x and black with probability 1−x (in other words, we consider a random choice of the size |S| with the binomial probability distribution Bi(n, x) and let us say that success corresponds to white and failure corresponds to black). We can note that for any such coloring, there must exist a polygon such that all of its vertices are white and all points outside are black (this polygon is the convex hull of the white points). The above suggests that the probability that the random choice realizes a polygon with all vertices white and all exterior points black is equal to one. However, we can compute this probability in a different way. The event of a polygon having this property is the union of k disjoint events, where k is the number of convex polygons, namely that a given polygon has the desired property (note that the property cannot be shared by different convex polygons). For every given convex polygon P, the probability that its vertices are white and the points outside it are black is equal to xa(P ) (1−x)b(P ) , where a(P) is the number of vertices of P and b(P) is the number of points from S outside P. Since the probability of a union of disjoint events is equal to the sum of the particular events’ probabilities, we get ∑ P xa(P ) (1 − x)b(P ) = 1. This proves the equality for all numbers in the interval [0, 1]. However, we can also perceive this fact as follows: any real number from the interval [0, 1] is a root of the poly- nomial ∑ P xa(P ) (1 − x)b(P ) − 1. As we know, a nonzero polynomial over the (infinite) field of real numbers can have only finitely many roots (see 12.3.6). Therefore,∑ P xa(P ) (1−x)b(P ) −1 is the zero polynomial and the equal- ity ∑ P xa(P ) (1 − x)b(P ) = 1 thus holds for all real numbers x. □ Remark. This equality holds even if we define the numbers a(P) and b(P) in another way: The definition of a(P) is the same, but now let b(P) denote the number of points from S which are not the vertices of P. (Thus, we always have a(P)+ b(P) = |S|). Then, the given equality is a corollary of the binomial theorem for (x + (1 − x))|S| . 13.G.4. A competition with n players is called an (n, k)-tournament iff it has k rounds and satisfies the following: i) every player competes in every round and any pair of players competes at most once, ii) if A plays with B in the i-th round, C plays with D in the i-th round, and A plays with C in the j-th round, then B plays with D in the j-the round Find all pairs (n, k) for which there exists an (n, k)-tournament. Solution. There exists an (n, k)-tournament if and only if 2⌈log2(k+1)⌉ divides the integer n. First of all, we are going to show the if-part. We construct a (2t , k)-tournament CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS one which is as good as possible. This is called a heuristic approach. One of the most important combinatorial problems of this class is the problem of finding a minimum Hamiltonian cycle. This is a Hamiltonian cycle with the minimum sum of the weights of its edges among all Hamiltonian cycles. This problem arises in many practical applications. For instance: • goods or post delivery (via a given network) • network maintenance (electricity, water pipelines, IT, etc.) • request processing (parallel requests for reading from a hard disk, for instance), • measuring several parts of a system (for example, when studying the structure of a protein crystal using X-rays, the main expenses are due to the movements and focusing for particular measurements), • material division (for instance, when covering a wall with wallpaper, one tries to keep the pattern continuous while minimizing the amount of unused material)). The greedy approach can be applied in case of looking for a minimum Hamiltonian cycle as well. The algorithm begins in an arbitrary vertex v1, which is set active, and the other vertices are labeled as sleeping. For each step, it examines the sleeping vertices adjacent to the active one and selects the one which is connected by the shortest edge. The active vertex is labeled as processed, and the selected vertex becomes active. The algorithm terminates either with a failure, when there is no edge going from the active vertex to a sleeping one, or it successfully finds a Hamiltonian path. In the latter case, if there exists an edge from the last vertex vn to v1, a Hamiltonian cycle is obtained. This algorithm seldom produces a minimal Hamiltonian cycle. At least, it always finds some (and relatively small) Hamiltonian cycle in a complete graph. 13.2.10. Flow networks. Another group of applications of the language of graph theory concerns moving some amount of a measurable material in a fixed network. The vertices of a directed graph represent places between which one transports material up to predetermined limits which are given as assessments of the edges (called capacities). There are two important types of vertices: the sources and sinks of the network. A network is a directed graph with valued edges, where some of the vertices are labeled as sources or sinks. Without loss of generality, assume that the graph is directed and has only one source and one sink: In the general case, an artificial source and a sink can always be added, connected with directed edges to the original sources and sinks. Then the capacities of the added edges would cover all maximum capacities of the particular sources and sinks. The situation is depicted in the diagram. There, the black vertices on the left correspond to the given sources, while the black vertices on the right stand for the given sinks. On the left, 1177 where k ≤ 2t − 1 (then, the general case 2t | n can be easily derived from that). There are thus 2t players in the tournament, so we assign to each player a (unique) number from the set {0, 1, . . . , 2t − 1}. In the i-th round, player α competes with player α ⊕ i (where ⊕ is the binary XOR operation, i. e., the j-th bit of a ⊕ b is one if and only if the j-th bit of a is different from the j-th bit of b). This schedule is correct since every player is engaged in every round and different players have different opponents (for α ̸= β, we have α ⊕ i ̸= β ⊕ i). Further, the opponent of the opponent of α is indeed α (since (α ⊕ i) ⊕ i = α). Moreover, the second tournament rule is also satisfied: if α plays with β and γ plays with δ in the i-th round (i. e., β = α⊕i and δ = γ ⊕i) and if j is such that α plays with γ in the j-th round, then we have β⊕j = (α⊕i)⊕j = (α⊕j)⊕i = γ⊕i = δ, so β indeed plays with δ in the j-th round. Any (2t · s, k)-tournament where s is odd can be obtained as s parallelized (2t , k)-tournaments. Now, we are going to show that the condition 2⌈log2(k+1)⌉ | n is necessary as well. Consider the graph Gi whose vertices correspond to the players and edges are between the pairs who have played in or before the i-th round. Consider players A and B who play together in round i + 1. We want to show that we must have |Γ| = |∆| where Γ is the component of A and ∆ is the component of B. Actually, we show that any player of Γ competes with a player of ∆ in round i + 1. Thus, let C ∈ Γ, i. e., in Gi, there exists a path A = X1, X2, . . . , Xm = C such that Xj has played with Xj+1, j = 1, . . . , m − 1, in or before the i-th round. Consider the sequence Y1, Y2, . . . Ym, where Yk is the opponent of Xk in round i + 1, k = 1, . . . , m (thus Y1 = B). Then, for any 1 ≤ j ≤ m − 1, we have that Xj competes with Yj and Xj+1 competes with Yj+1 in round i + 1 (by the definition of the sequence Y1, . . . , Yi) and in a certain r-the round (1 ≤ r ≤ i), Xj played with Xj+1 (by the definition of the sequence X1, . . . , Xi). However, by the second tournament rule, this means that Yj also played with Yj+1 in the r-the round, so the edge YjYj+1 is contained in Gi for any 1 ≤ j ≤ m − 1, Thus, Y1, Y2, . . . Ym is a path in Gi, so B = Y1 and Ym lie in the same component (∆). It can be deduced analogously that any player from ∆ competes with a player of Γ in round i + 1, and since every player plays exactly once in a given round, we must have |Γ| = |∆|. By the definition of a component, the component of A in Gi+1 is equal to Γ ∪ ∆. Then, we have either Γ = ∆ (then, the component of A in Gi+1 is Γ), or Γ ∩ ∆ = ∅ (in this case, the component of A in Gi+1 is the disjoint union Γ ∪ ∆). Altogether, the component of A in Gi+1 is either the same or twice as great as in Gi. Now, consider the components Γ1, Γ2, . . . , Γk of A in the respective graphs G1, G2, . . . Gk. We have |Γ1| = 2 (since A had exactly one opponent in the first round) and for 1 ≤ i ≤ k − 1, we have either |Γi| = |Γi+1|, or 2|Γi| = |Γi+1|. Therefore, the number of vertices (players) of every component is a power of two, i. e., |Γk| = 2l for some l, and Γk ≥ k + 1 (A had different opponents in the k rounds). Hence, 2l ≥ k + 1, i. CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS there is an artificial source (a white vertex), and there is an artificial sink on the right. The edge values are not shown in the diagram. Flow networks A network is a directed graph G = (V, E) with a distinctive vertex z, called the source, and another distinctive vertex s, called the sink, together with a non-negative assessment of the edges w : E → R, which represents their capacities. A flow in a network S = (V, E, z, s, w) is an assessment of the edges f : E → R such that, for each vertex v except for the source and the sink, the total input is equal to the total output, i.e. ∑ e∈IN(v) f(e) = ∑ e∈OUT (v) f(e). This rule is often called the Kirchhoff’s law (referring to the terminology used in physics). The size of a flow f is given by the total balance of the source values |f| = ∑ e∈OUT (z) f(e) − ∑ e∈IN(z) f(e). It follows directly from the definition that the size of a flow f can also be computed as |f| = ∑ e∈IN(s) f(e) − ∑ e∈OUT (s) f(e). The left hand part of the following diagram shows a simple network with the source in the white circled vertex and the sink in the black bold vertex. The labels over the edges determine the maximal capacities. Looking at the sum of the capacities that enter the sink, the maximum flow in this network is 5 (the sum of the capacities leaving the source is larger). 4 0011 0011 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 2 3 2 5 3 3 2 2 1 3 2 1 2 0011 2/3 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0011 0011 0 0 1 1 0 0 1 1 0011 2/2 1/3 1/2 0/2 1/2 1/1 0/2 1/3 2/2 2/4 1/3 2/5 1/1 0 0 1 1 13.2.11. Maximum flow problem. The next task is to find the maximum possible flow for a given network on a graph G. The right hand side of the above diagram shows a flow of size five, and the size of any flow cannot exceed this. The fundamental principle is that the capacities of a set of edges 1178 e., 2l is at least 2⌈log2(k+1)⌉ , so the number of players in each component is divisible by 2⌈log2(k+1)⌉ . Thus, so must be the total number n. □ H. Combinatorial games 13.H.1. Consider the following game for two players: On the table, there are four piles of 9, 10, 11, and 14 tokens, respectively. Players alternate moves where the move consists of selecting one of the piles and removing an arbitrary (positive) number of tokens from that pile. The player who takes the last token wins. Is there a winning strategy for one of the players? Solution. Note that this game is the sum of four games which correspond to one-pile games where an arbitrary (positive) number of tokens can be removed (the sum of combinatorial games is both commutative and associative, so we can talk just about the sum of those games without having to specify the order). A simple induction argument shows that the value of the Sprague-Grundy function (the SG-value) of such onepile game is equal to the number of tokens: Suppose that a natural number n is such that for all k < n, the SG-value of the game with k tokens is k. According to the rules of the game, we can remove an arbitrary (positive) number of tokens, i. e., we can leave there an arbitrary number from 0 to n−1. By the induction hypothesis, this means that for any number k < n, we can reach a position whose SG-value is k, and we cannot reach a position whose SG-value would be n. By the definition of the SG-function, the value of the game with n tokens is n. It follows from the theorem of subsection 13.2.16 that the SG-value of the initial position of our game is equal to the xor of the initial positions of the particular games, namely 9 ⊕ 10 ⊕ 11 ⊕ 14 = 6. Since this value is non-zero, there exists a winning strategy for the first player: he always moves to a position whose SG-value is zero–such a position must exist by the definition of the SG-function. For instance, the first move would be to remove 6 tokens from the pile containing 14. (We look at the highest one in the binary expansion of the SG-value and find a pile where the corresponding bit is also one. Then, we set this bit to zero–thereby surely decreasing the number of tokens– and adjust the lower bits so that there would be an even number of ones in each position, resulting in zero SG-value.) □ 13.H.2. Consider the following game for two players: On the table, there is a pile of tokens. Players alternate moves where the move consists of either splitting one pile into two (non-empty) piles or removing an arbitrary (positive) number of tokens from a pile. The player who takes the last token wins. Find the SG-value of the initial position of this game if the pile contains n tokens. CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS through which each path from z to s must go have to be added up. In the diagram, there are three such choices providing the limits 12, 8, 5 (from left to right). At the same time, in such a simple case the flow that realizes the maximal possible value is easily found. This idea can be formalized as follows: Cut in a network A cut in a network S = (V, E, z, s, w) is a minimal set of edges C ⊆ E such that when these edges are removed, there remains no path from the source z to the sink s in the graph G = (V, E \ C). The number |C| = ∑ e∈C w(e) is called the capacity of the cut C. Clearly, there is no flow whose size is greater than the capacity of a cut. We present the Ford-Fulkerson algorithm8 , which finds a cut with the minimum possible capacity as well as a flow which realizes this value. This proves the following theorem: Theorem. In any network S = (V, E, z, s, w), the maximum size of a flow equals the minimum capacity of a cut in S. The idea of the algorithm is quite simple. It looks for paths between the vertices of the graph, trying to “saturate” them with the flow. For this purpose, define the following terminology: An undirected path from the vertex v to the vertex v′ in a network S = (V, E, z, s, w) is called unsaturated if and only if all edges e directed along the path from v to v′ satisfy f(e) < w(e) and the edges e in the other direction satisfy f(e) > 0 (sometimes, one tries to saturate the flow in the other direction; yielding a semipath, or the augmenting semipath). The residual capacity of an edge e is the number w(e) − f(e) if the edge is directed from v to w, and it is the number f(e) otherwise. The residual capacity of a path is defined to be the minimum residual capacity of its edges. For the sake of simplicity, assume that all the edge capacities are rational numbers. 8Ford, L. R.; Fulkerson, D. R. (1956). "Maximal flow through a network". Canadian Journal of Mathematics 8: 399–404. 1179 Solution. We are going to prove by induction that any positive integer k satisfies: g(4k + 1) = 4k + 1 g(4k + 2) = 4k + 2 g(4k + 3) = 4k + 4 g(4k + 4) = 4k + 3 Clearly, we have g(0) = 0. The following picture shows how we can deduce the value of the SG-function for one-, two-, and three-token piles. However, it is apparent that this would be much harder for a general number of tokens. Now, assume that the above is satisfied for all positive integers below 4k + 1 and let us prove that we indeed have g(4k + 1) = 4k + 1. By the definition, the SG-value is the least non-negative integer l such that there is no move to a position with SG-value l. Moreover, this property (including that the terminal positions have zero value) determine the Sprague-Grundy function uniquely. Therefore, it suffices to prove that, for each l < 4k + 1, we can move to a position with SG-value l, and that we cannot get into a position with SG-value 4k + 1. The former is clear since by the induction hypothesis, the SG-values of one-pile games of 0, 1, . . . , 4k tokens take all the integers 0, 1, . . . , 4k (although not in this order), so we can just remove the corresponding number of tokens from the pile. Now, we are going to show that we cannot reach a position with SG-value 4k + 1: We already know that the only moves that could possibly lead to this SG-value are to split the pile into two. If we examine the resulting amounts modulo 4, there are two possibilities: either the number of tokens in one of the resulting piles is divisible by 4 (4a) and the other one leaves remainder 1 (4b+1), or the numbers leave remainders 2 and 3, respectively. As for the former case, the SG-values of the resulting piles are, by the induction hypothesis, 4a − 1 and 4b + 1 (the numbers of tokens in the particular piles are non-zero and less than 4k+1, so we may use the induction hypothesis. In the latter case, i. CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Ford-Fulkerson algorithm Input: A network S = (V, E, z, s, w). Output: A maximal possible flow f : E → R and a minimal cut C, which is given by those edges which lead from the set U ⊆ V of those vertices to which there exists an unsaturated path, to V \ U. (1) Initialization: Set f(e) = 0 for each edge e ∈ E, and using depth-first search from z, find the set U ⊆ V of those vertices to which there exists an unsaturated path. (2) The main loop: While s ∈ U, do • select an unsaturated path P from the source z to the sink s; then increase the flow f along all edges of the path P by the value of the residual capacity of P. • update U. Proof. As seen, the size of any flow cannot exceed the capacity of any cut. Therefore, it suffices to show that when the algorithm terminates, the capacity of the generated cut equals the size of the constructed flow. The algorithm terminates in the first moment when there is no unsaturated path from the source z to the sink s. This means that U does not contain s and for all edges e from U and ending outside of U, f(e) = w(e) (otherwise, the other endpoint of e would be added to U when updating U). For the same reason, all edges e leading from V \ U to U must have f(e) = 0. Clearly, at each beginning of the main loop the total size of the flow satisfies |f| = ∑ edges from U to V \ U f(e) − ∑ edges from V \ U to U f(e) . However, when the algorithm terminates, this expression equals to the capacity of the final cut C which is the desired result. It remains to show that the algorithm always terminates. Since the edges are assumed assessed with rational numbers, it can be assumed by rescaling that the capacities are integers. Then every flow constructed during the run of the algorithm has integer size. In addition, every iteration of the main loop increases the size of the flow. However, since any cut bounds the maximum size of any flow from above, the algorithm must terminate after a finite number of steps. □ 0/3 0011 0 0 1 1 0011 0011 0011 0 0 1 1 00110011 0 0 1 1 0/2 0/2 0/2 1/1 0/2 0/3 0/2 2/4 0/3 1/5 0/1 2/3 2/2 0 0 1 1 0/3 0011 0 0 1 1 0011 0011 0011 0 0 1 1 0011 0 0 1 1 0011 0/2 0/2 2/2 1/1 0/2 2/3 2/2 2/4 0/3 3/5 0/1 2/3 2/2 0 0 1 1 1180 e., if we split the pile into 4a+2 and 4b+3 tokens, we get that their SG-values are 4a + 2 and 4b + 4. Furthermore, a twopile game is the sum of the two corresponding one pile-games, so the SG-value of the two-pile game is the xor (nim-sum) of the amounts. In both cases, the SG-value leaves remainder 2 upon division by 4 (consider the last two bits). In particular, it is surely not equal to 4k +1. This proves the induction step for positive integers of the form 4k + 1. The proof for integers of the form 4k + 2 is analogous. The situation is more amazing in the 4k + 3 case: Similarly as above, it follows from the induction hypothesis that the SG-values of the one-pile positions we can move to exhaust all the non-negative integers up to 4k +2. However, note that if we split the pile into two containing 1 and 4k + 2 tokens, respectively, then their SG-values are also 1 and 4k+2 by the induction hypothesis, and the xor of these integers is 4k + 3. It remains to prove that there is no move into a position with SG-value 4k + 4: Again, the only remaining possibility is to split the existing pile. Then, the resulting remainders modulo 4 are either 0 and 3, or 1 and 2. By the induction hypothesis the remainders of the corresponding SG-values are respectively 3 and 0 in the former case, and 1 and 2 in the latter. In either case, the xor of these integers (and thus the SG-value of the resulting position) leaves remainder 3, so it is not equal to 4k + 4. This proves the induction step for positive integers of the form 4k + 3. The proof for integers of the form 4k + 4 is analogous. □ I. Generating functions 13.I.1. In how many ways can we buy 12 packs of coffee if we can choose from 5 kinds? Further, solve this problem with the following modifica- tions: i) we want to buy at least 2 packs of each kind; ii) we want to buy an even number of packs of each kind; iii) there are only 3 packs of one of the kinds. Solution. The basic problem is a classical example of a combinatorial problem on the number of 5-combinations with repetition–the answer is (12+5−1 5−1 ) = (16 4 ) . The modifications can also be solved by combinatorial reasonings with a bit of invention. However, we want to demonstrate how these problems can be solved (almost without no need to think) using generating functions. The wanted number corresponds to the coefficient at x12 in the expansion of the function (1 + x + x2 + . . . )5 = = (1 + x + . . . )(1 + x + . . . ) · · · (1 + x + . . . ) into a power series. The number of packs of the first kind determines which term is selected from the first parenthesis, and similarly for the other kinds. (Note that we need not pay special attention to that fact that there cannot be more than 12 packs of a given kind – it turns out that infinite series are usually simpler to work with than finite polynomials.) CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS The run of the algorithm is illustrated in two diagrams. On the left, there are two shortest unsaturated paths from the source to the sink in gray (the upper one has two edges, while the lower one has three). On the right, another path is saturated (taking the first turn in the upper path), also drawn in gray. Now, it is apparent that there can be no other unsaturated path from the source to the sink. Therefore, the algorithm terminates at this moment. 13.2.12. Further remarks. The algorithm allows for further conditions incorporated in the problem. For instance, capacity limits can be set for the vertices of the network as well. There are not only upper limits for the flows along particular edges or through vertices, but also lower ones. It is easy to add vertex capacities – just double every vertex (one for the incoming edges, the other for the outgoing edges), connecting each pair with an edge of the corresponding capacity. The lower limits for the flow can be included in the initialization part of our algorithm. However, one needs to check whether such a flow exists at all. Many other variations can be found in literature. On the other hand, the algorithm does not necessarily terminate if the edge capacities are irrational. Moreover, the flows that are constructed during the run may not even converge to the optimal solution in such a case. However, it still Put some nice example in the other column, e.g. the https://www.cs.prince- ton.edu/cours- es/archive/spring13/cos423/le tures/07DemoFordFulk- ersonPathological.pdf one holds that if the algorithm terminates, then a maximum flow is found. If the capacities are integers (equivalently rational numbers), the running time of the algorithm can be bounded by O(f|E|), where f is the size of a maximum flow in the network and |E| is the number of edges. The worst case occurs if every iteration increases the size of the flow by one. In the proof of correctness, no explicit way of searching the graph when looking for an unsaturated path is used. Another variation of the Ford–Fulkerson algorithm is to use breadth-first search. The resulting algorithm is called Edmonds–Karp, and its running time is O(|V ||E|2 ).9 We mention Dinic’s algorithm, which simplifies the search for an unsaturated path by constructing the level graph, where augmenting edges are considered only if they lead between vertices whose distances from the source differ. The time complexity of this algorithm is O(|V |2 |E|), which is much better for dense graphs than the Edmonds-Karp algorithm. 13.2.13. Problems related to flow networks. A good application of flow networks is the problem of bipartite matching. The task is to find a maximum matching in a bipartite graph, i.e. a set of as many edges as possible so that each vertex of the graph is the endpoint of at most one of the selected edges. 9Edmonds, Jack; Karp, Richard M. (1972). "Theoretical improvements in algorithmic efficiency for network flow problems". Journal of the ACM (Association for Computing Machinery) 19 (2): 248–264. doi:10.1145/321694.321699. 1181 Since 1 1 − x = 1 + x + x2 + . . . (see 13.4.3), the function we are considering is (1 − x)−5 . Our task is thus to expand (1 − x)−5 into a power series. By the generalized binomial theorem, from 13.4.3, the coefficient at xk is the number (k+5−1 5−1 ) , which is (16 4 ) in our case. Note that using generating functions, we have answered the question not only for 12, but rather for an arbitrary number of packs of coffee. The modifications can be solved analogously: i) The generating function is (x2 + x3 + . . . )5 = ( x2 1 − x )5 = x10 (1 − x)5 , hence the coefficient at x12 is equal to (2+5−1 5−1 ) . ii) An even number of each kind corresponds to the generating function (1 + x2 + x4 + . . . )5 = 1 (1 − x2)5 . The coefficient at x12 can be found by many means; the easiest one seems to be the substitution y = x2 and looking for the coefficient at y6 (which can be perceived as joining the packs into pairs in the shop). This leads to the answer (6+5−1 5−1 ) . iii) In this case, the generating function equals (1 + x + x2 + x3 )(1 + x + x2 + . . . )4 , and the wanted result is thus(12+4−1 4−1 ) + (11+4−1 4−1 ) + (10+4−1 4−1 ) + (9+4−1 4−1 ) . □ 13.I.2. In how many ways can we use the coins of values 1, 2, 5, 10, 20, and 50 crowns to pay exactly 100 crowns? Solution. We are looking for non-negative integers a1, a2, a5, a10, a20, and a50 such that ai is a multiple of i for all i ∈ {1, 2, 5, 10, 20, 50} and, at the same time, a1 + a2 + a5 + a10 + a20 + a50 = 100. We can see that the wanted number of ways can be obtained as the coefficient at x100 in the product (1 + x + x2 + . . . )(1 + x2 + x4 + . . . )(1 + x5 + x10 + . . . )· · (1 + x10 + x20 + . . . )(1 + x20 + x40 + . . . )· · (1 + x50 + x100 + . . . ) = = 1 1 − x · 1 1 − x2 · 1 1 − x5 · 1 1 − x10 · 1 1 − x20 · 1 1 − x50 . The result can be obtained using the software of SAGE, for instance (the names of the used commands are pretty selfdescriptive, aren’t they?): sage : f=1/(1 - x)*1/(1 - x^2) *1/(1 - x^5) \ *1/(1 - x ^10) *1/(1 - x ^20) *1/(1 - x ^50) sage : r= taylor (f,x ,0 ,100) sage : r. coeff (x ,100) 4562 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS This is an abstract variation of a quite common problem. For instance, it may be needed to match boys and girls in dancing lessons, provided information about which pairs would be willing to dance together is given. This problem is easily reduced to the problem of maximum flow. Add an artificial source to the graph and connect it with edges to all vertices of one part of the bipartite graph, while the vertices of the other part are connected to an artificial sink. The capacity of each edge is set to one, and the resulting graph is searched for the maximum flow. Then, the edges that are used in the flow correspond to the selected pairs. Of course, information on which pairs to put together by leaving some of them out, may be included. Another important application of flow networks is the proof of Menger’s theorem (mentioned as a theorem in 13.1.10). It can be understood as follows: Given a directed graph, set the capacity of each edge as well as edge vertex to one. Further, select an arbitrary pair of vertices v and w, which are considered to be the source and the sink, respectively. Then, the size of a maximum flow in this graph equals the maximum number of disjoint paths from v to w (the paths may share only the source and the sink). Every cut divides v and w into different connected components of the remaining graph (since they are chosen to be the source and sink). The desired statements then follow from the fact that the size of a maximum flow equals the capacity of a minimum cut. 13.2.14. Game trees. We turn our attention to a very broadly used application of tree structures when analyzing possible strategies or procedures. They can be encountered in the theory of artificial intelligence as well as in the game theory. They play an important role in economics and many other social fields. This is about games. In the mathematical sense, game theory examines models in which one or more players take turns in playing moves according to predetermined and generally known rules. Usually, the moves are assessed with profits or losses for the given player. Then, the task is to find a strategy for each player, i.e. an algorithmic procedure which maximizes the profits or minimizes the losses. We use an extensive description of the games. This means that a complete and finite analysis of all possible states of the game is given, and the resulting analysis gives an exact account about the profits and losses. This is supposing that the other players also play the best moves for them. A game tree is a rooted tree whose vertices are the possible states of the game and they are labeled according to whose turn it is. The outgoing edges of a vertex correspond to the possible moves of the player from that state. This complete description of a game using the game tree may be used for common games like chess, naughts and crosses (known also as tic-tac-toe), etc. 1182 □ 13.I.3. Expand the following functions into power series: i) x x+2 , ii) x2 +x+1 2x3+3x2+1 . Solution. i) x x + 2 = x 2 − (−x) = x/2 1 − (−x/2) = = x 2 − x2 4 + x3 8 − · · · + = ∞∑ n=1 (−1)n−1 xn 2n . ii) We perform partial fraction decomposition: x2 + x + 1 2x3 + 3x2 + 1 = x2 + x + 1 (x − 1)2(2x + 1) = = A 2x + 1 + B x − 1 + C (x − 1)2 , finding out that A = B = 1 3 and C = 1; hence x2 + x + 1 2x3 + 3x2 + 1 = 1/3 1 + 2x − 1/3 1 − x + 1 (1 − x)2 = = ∑∞ n=0 [ 1 3 ((−2)n − 1)) + (n + 1) ] xn . □ 13.I.4. Find the generating functions of the following se- quences: i) (1, 2, 3, 4, 5, . . . ), ii) (1, 4, 9, 16, . . . ), iii) (1, 1, 2, 2, 4, 4, 8, 8, . . . ), iv) (9, 0, 0, 2 · 16, 0, 0, 4 · 25, 0, 0, 8 · 36, . . .), v) (9, 1, −9, 32, 1, −32, 100, 1, −100, . . . , ). ⃝ 13.I.5. In how many ways can we buy n pieces of the following five kinds of fruit if we do not distinguish between particular pieces of a given kind, we need not buy all kinds, and: • there is no restriction on the number of apples we buy, • we want to buy an even number of bananas, • the number of pears we buy must be a multiple of 4, • we can buy at most 3 oranges, and • we can buy at most 1 pomelo. CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS As a simple example, consider a simple variation of the game known as Nim.10 There are k tokens on the table (the tokens may be sticks or matches), where k > 1 is an integer, and players take turns at removing one or two tokens. The player who manages to take the last token(s) wins. There is a variation of the game in which the player who is forced to take the last token loses. The tree of this game, including all necessary information about the game, can be constructed as follows: • The state with ℓ tokens on the table and the first player to move corresponds to the subtree rooted at Fℓ. The state with the same number of tokens but the second player to move is represented by the subtree rooted at Sℓ. • The vertex Fℓ has Sℓ−1 as its left-hand son and Sℓ−2 as its right-hand son. Similarly, the sons of the vertex Sℓ are Fℓ−1 and Fℓ−2. • The leaves are always F0 or S0. (In the variation when the player to take the last token loses, these would be the states F1 and S1.) Every run of the game starting at root Fk corresponds to exactly one leaf of the resulting tree. Therefore, the total number p(k) of possible runs for Fk is equal to p(k) = p(k − 1) + p(k − 2) for k ≥ 3, and clearly p(1) = 1 and p(2) = 2. This difference equation is already considered. It is satisfied by the Fibonacci numbers, which can be computed by an explicit formula (see the subsection on generating functions in the end of this chapter, or the corresponding part about difference equations in chapter three, cf. 3.B.1). A formula is known for the number of possible runs of the game. The number of possible states equals the number of all vertices of the tree. The game always ends in a win of one of the players. We can also consider games where a tie is possible. 13.2.15. Game analysis. The tree structure allows an analysis of the game so that an algorithmic strategy for each player can be built. This is done with a simple recursive procedure for assessing the root of a subtree. Each vertex is given a label: W for vertices where the first player can force a win, L for those where the first player loses if the other one plays optimally, and, optionally, T for vertices where optimal play of both players results in a tie. The procedure is as follows: (1) The leaves are labeled directly according to the rules of the game (in the case of our Nim, the leaves S0 are labeled by W, and the leaves F0 by L). (2) Considering the vertex Fℓ. Label it W if it has a son who is labeled by W. If there is no such son, but there is a son labeled by T, then Fℓ is given the label T. Otherwise, i.e. if all sons are labeled by L, then Fℓ also gets L. 10The game was given this name by Charles Bouton in his analysis of this type of games from 1901. It refers to the German word “Nimm!”, meaning “Take!”. 1183 Solution. The generating function for the sequence (an) where an is the wanted number of ways to buy n pieces of fruit is (1 + x + x2 + · · · )(1 + x2 + x4 · · · )(1 + x4 + x8 + · · · )· · (1 + x + x2 + x3 )(1 + x) = = 1 1 − x · 1 1 − x2 · 1 1 − x4 · 1 − x4 1 − x · (1 + x) = = 1 (1 − x)3 . By the generalized binomial theorem, we have (1 − x)−3 =∑∞ n=0 (n+2 2 ) xn . Therefore, the wanted number of ways satisfies an = (n+2 2 ) . □ 13.I.6. Using the generalized binomial theorem, prove again the following combinatorial identities: • ∑n k=0 (n k ) = 2n , • ∑n k=0(−1)k (n k ) = 0, • ∑n k=0 k (n k ) = n2n−1 . Solution. Substituting into the binomial theorem (1 + x)n = ( n 0 ) + ( n 1 ) x + ( n 2 ) x2 + · · · + ( n n ) xn the numbers x = 1 and x = −1, we obtain the first and second identities, respectively. Then, the third one can be obtained by viewing both sides of the binomial theorem “continuously” and using the properties of derivatives. □ 13.I.7. In a box, there are 30 red, 40 blue, and 50 white balls. Balls of one color are indistinguishable. In how many ways can we select 70 balls? Solution. The wanted number is equal to the coefficient at x70 in the product (1 + x + · · · + x30 )(1 + x + · · · + x40 )(1 + x + · · · + x50 ). This product can be rearranged to (1 − x)−3 (1 − x31 )(1 − x41 )(1 − x51 ), whence, using the generalized binomial theorem, we obtain (( 2 2 ) + ( 3 2 ) x+ ( 4 2 ) x2 +· · · ) (1−x31 −x41 −x51 +x72 +. . .). CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS (3) Similarly, a vertex Sℓ is labeled L if it has a son labeled by L. Otherwise if it has a son labeled by T, it receives T. Otherwise (i.e. if it has only W-sons), it is labeled by W. Calling this procedure on the root of the tree gives the labeling of each vertex as well as an optimal strategy for each player: • The first player tries to move to a vertex labeled by W; if this cannot be achieved, he moves to a T-vertex at least. • Similarly, the second player tries to move to a vertex labeled by L; if this cannot be achieved, he moves to a T-vertex at least. The depth of the recursion is given by the depth of the tree. For instance, having a Nim game with k tokens, the depth is k. This analysis is not very useful yet. In order to use it in the mentioned form, the entire game tree is needed for disposal. This can be a great amount of data (for instance, in the case of naughts and crosses on 3 × 3 playground, the corresponding tree has tens of thousands of vertices). Usually, the analysis with game tree is used when only a small part of the whole tree is examined, applying appropriate heuristic methods, and the corresponding part is being created dynamically during the game. This is a fascinating field of the modern theory of artificial intelligence. The details are omitted. There is a more compact representation of the tree structure for our purposes of complete formal analysis. If the game tree for Nim is drawn, then one state of the game is represented by many vertices which correspond to different histories of the game. However, the strategies depend only on the actual state (i.e. the number of tokens and the player to move) rather than on the history of the game. Therefore, the same game can be described by a graph where for each number of tokens, there is only one vertex, and the whole strategy is determined by identifying who is winning (whether this is the player on move or the other one) Directed edges are used for the description of possible moves and there is then always an acyclic graph. 5N F0 L L F3 L S2 S1 F1 F0 W L L S0 W 0P 1N 3P 7N 6P 4N 2N The example of the game Nim is displayed on the diagram. On the left, there is a complete game tree corresponding to three tokens. The directed graph on the right represents the game with seven tokens. A complete tree for this game would already have 21 leaves, and the number of leaves grows exponentially with the number of tokens. 1184 Hence, the coefficient at x70 is clearly (70+2 2 ) − (70+2−31 2 ) − (70+2−41 2 ) − − (70+2−51 2 ) = 1061. □ 13.I.8. Prove that n∑ k=1 Hk = (n + 1)(Hn+1 − 1). Solution. The necessary convolution can be obtained as the product of the series 1 1−x and 1 1−x ln 1 1−x . Hence: [xn ] 1 (1−x)2 ln 1 1−x = n∑ k=1 1 k (n + 1 − k), whence the wanted identity follows easily. □ 13.I.9. Solve the recurrence a0 = a1 = 1, an = an−1 + 2an−2 + (−1)n . Solution. As always, it may be a good idea to write out a few terms of the sequence (however, this will not help us much in this case; still, it can serve as verification of the result).2 Step 1: an = an−1 + 2an−2 + (−1)n [n ≥ 0] + [n = 1]. Step 2: A(x) = xA(x) + 2x2 A(x) + 1 1+x + x. Step 3: A(x) = 1 + x + x2 (1 − 2x)(1 + x)2 . Step 4: an = 7 9 2n + (1 3 n + 2 9 ) (−1)n . □ 13.I.10. Quicksort – analysis of the average case. Our task is to determine the expected number of comparisons made by Quicksort, a well-known algorithm for sorting a (finite) sequence of elements. An example of a simple divide-and-conquer implemen- tation: if L == []: return [] return qsort ([ x for x in L [1:] if x= L [0]]) It is not too difficult to construct a formula for the number of comparisons (we assume that the particular orders of the sequence to be sorted are distributed uniformly). The following parameters are important for the analysis of the algorithm: i) The number of comparisons in the divide phase: n − 1. ii) The uniformness assumption: the probability that L[0] is the k-th greatest element of the sequence is 1 n . iii) The sizes of the sorted subsequences in the conquer phase: k − 1 and n − k. 2Despite the statement in Concrete mathematics, this sequence can already be found in The On-Line Encyclopedia of Integer Sequences. CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS The individual vertices in the directed acyclic graph on the right-hand side of the diagram indicate the number of tokens left and the information whether the game at that state is won by the player who is to move (letter N as “next”) or the other one (letter P as “previous”). Altogether, considering a game with k tokens, this graph always has only k+1 vertices. At the same time, there is the complete strategy encoded in the graph: The players always try to move from the current state into a vertex labeled by P if such one exists. In fact, every directed acyclic graph can be seen as a description of a game. The initial situations are represented by those vertices which have no incoming edges (there can be one or more of them), and the game ends in leaves, i.e. vertices with no outgoing edges (again, there can be one or more of them). The strategy for each player can be obtained by a simple recursive procedure as above (for the sake of simplicity, it is assumed that there is no tie): • The leaves are labeled by P ( the player who is to move from a leaf loses). • A non-leaf vertex of the graph is labeled by N if there is an edge leading to a P-vertex. Otherwise, it is labeled P. In the case of our variation of Nim, the situation is very simple. It follows from the strategy described that the player who is to move loses if and only if the number of tokens is divisible by three. The games that can be represented by a directed acyclic graph are called impartial. These are exactly those games which satisfy: • in every state, both players choose from the same set of moves; • the number of possible states is finite; • the game has “zero sum”, i.e. the better the outcome for one of the players, the worse for the other one. An example of an impartial game is tic-tac-toe. Although the players use different symbols in this game, they can place them in any of the unoccupied squares. On the other hand, chess is not an impartial game in this sense, since the set of possible moves in every situation depends on the number of pieces the players have at their disposal. 13.2.16. Sum of combinatorial games. The rules of the real classical game Nim are somewhat more complicated: There are three piles of tokens. The move consists of selecting one of the piles and removing an arbitrary (positive) number of tokens from that pile. The player who manages to take the last token wins. There is a variation of the game in which the player who is forced to take the last token loses. If this game is considered with one pile, the situation is easy: The first player takes all the tokens and wins immediately. However, it is not that easy with three piles. Whether the analysis of the one-pile game is of any use for this more complicated game is a good question. 1185 We thus get the following recurrent formula for the expected number of comparisons: Cn = n − 1 + n∑ k=1 1 n (Ck−1 + Cn−k) . It is possible to solve this recurrence (using certain tricks which can be learned to some extent) even without using generating functions. Cn = n − 1 + 2 n n∑ k=1 Ck−1 symmetry of both sums nCn = n(n − 1) + 2 n∑ k=1 Ck−1 multiply by n (n − 1)Cn−1 = (n − 1)(n − 2) + 2 n−1∑ k=1 Ck−1 the same expression for Cn−1 nCn = (n + 1)Cn−1 + 2(n − 1) subtracted and rearranged We have thus obtained a much simpler recurrence: nCn = (n + 1)Cn−1 + 2(n − 1). On the other hand, this equation contains non-constant coefficients as well. We can also note that the recurrence has been simplified to the extent that the values Cn can be computed iteratively. Nevertheless, it is advantageous to express these values explicitly as a function of n (or at least to approximate them). First, we use a slight trick: dividing both sides by n(n + 1) : Cn n + 1 = Cn−1 n + 2(n − 1) n(n + 1) Now, we “expand” this expression (telescope, we can also use the substitution Bn = Cn/n + 1): Cn n + 1 = 2(n − 1) n(n + 1) + 2(n − 2) (n − 1)n + · · · + 2 · 1 2 · 3 + C1 2 . Hence Cn n + 1 = 2 n−1∑ k=1 k (k + 1)(k + 2) . This can be summed up using partial fraction decomposition, for instance: k (k+1)(k+2) = 2 k+2 − 1 k+1 . This leads to Cn n + 1 = 2 ( Hn+1 − 2 + 1 n + 1 ) , whence Cn = 2(n + 1)Hn+1 − 4(n + 1) + 2 (Hn = ∑n k=1 1 k is the sum of the first n terms of the harmonic progression). At the same time, we can give the bound Hn ∼∫ n 1 dx x + γ, whence Cn ∼ 2(n + 1)(ln(n + 1) + γ − 2) + 2. 13.I.11. Using the generating function F(x) = x/(1−x−x2 ) for the Fibonacci sequence, find the generating function for the “semi-Fibonacci” sequence (F0, F2, F4, . . .). ⃝ CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS For this purpose, introduce a new concept, the sum of impartial games: A situation in the game composed of two simpler games is a pair of possible situations in the particular games. Then, a move consists of selecting one of the two games and performing a move in that game (the other game is left unchanged). Therefore, the sum of impartial games is an operation which assigns to a pair of directed acyclic graphs a new one. Considering graphs G1 = (V1, E1) and G2 = (V2, E2), its sum G1+G2 is the graph G = (V, E), where V = V1×V2 and E = {(v1v2, w1v2); (v1, w1) ∈ E1} ∪ {(v1v2, v1w2); (v2, w2) ∈ E2}. In the case of one game, the vertices can be labeled stepby-step by the letters N and P in an upwards manner, according to whether one can get to a P-vertex along some of the edges. However, in the sum of games, movement along the edges is needed in a much more complicated way. Therefore, finer tools are needed for expressing the reachability of vertices labeled by P from other vertices. This needs some preparation which might seem like a strange magic (but the proof of the theorem below shows that all this is quite natural). Define the Sprague–Grundy function recursively. g : V → N on a directed acyclic graph G = (V, E) as follows:11 (1) for a leaf v, set g(v) = 0; (2) for a non-leaf vertex v ∈ V , define g(v) = mex{g(w); (v, w) is an edge}, where the minimum excluded value function mex is defined on subsets S of the natural numbers N = {0, 1, . . . } by mex S = min N \ S. The function g(v) is just the mex S operation for the set S of the values g(w) over those vertices w where one can get along an edge from v. Note that this definition is correct since, clearly, the formula uniquely defines a function that assigns a natural number to any vertex in the acyclic graph in question. Yet another operation on the natural numbers is needed. It is the binary XOR operation (a, b) → a ⊕ b, performing the exclusive-or operation bit-wise on the binary expansions of a and b. This operation can be considered from the following point of view: Consider the binary expansions of a and b to be vectors in the vector space (Z2)k over Z2 (for a sufficiently large k), and add them there. The resulting vector is the binary expansion of a ⊕ b. Now the main result can be formulated: 11We are presenting the theory which was developed in combinatorial game theory independently by R. P. Sprague in 1935 and P. M. Grundy in 1939. 1186 13.I.12. The fan of order n is a graph on n + 1 vertices, which are labeled 0, 1, . . . , n, with the following edges: vertex 0 is connected to all other vertices, and for each k satisfying 1 ≤ k < n, vertex k is connected to vertex k + 1. How many spanning trees does this graph have? Solution. Denoting by vn the wanted number of spanning trees, we clearly have v1 = 1, and since the fan of order 2 is the triangle graph K3, we have v2 = 3. Further, we are going to show that for n > 1, the following recurrence3 holds: vn = vn−1 + n−1∑ k=0 vk + 1, v0 = 0. For a fixed spanning tree of the fan of order n, let k be the greatest integer of the set {1, . . . , n − 1} such that the spanning tree contains all edges of the path (0, 1, 2, 3, . . . , k). This spanning tree cannot contain the edges {0, 2}, . . . , {0, k}, {k, k + 1}; therefore, there are the same number of spanning trees for a fixed k as in the fan of order n − k with vertices 0, k + 1, k + 2, . . . , n, i. e. vn−k. Further, we must count one spanning tree for k = n and those spanning trees which do not contain the edge {0, 1} (thus they must contain the edge {1, 2}) – they are obtained from fans of order n − 1 on vertices 0, 2, . . . , n. We have thus obtained the wanted recurrence vn = vn−1 + vn−1 + vn−2 + · · · + v0 + 1. Now, we have the general formula vn = vn−1 + n−1∑ k=0 vk + 1 − [n = 0], whence the usual procedure for finding the generating function V (x) of this sequence yields V (x) = x · V (x) + ∞∑ n=0 ∑ kk vkxn + x 1 − x = = ( ∞∑ k=0 vkxk ) · ∑ n>k xn−k + x 1 − x = = ( ∞∑ k=0 vkxk ) · x 1 − x + x 1 − x = V (x) · x 1 − x + x 1 − x . The solution of the equation V (x) = xV (x) + x 1−x V (x) + x 1−x is V (x) = x 1 − 3x + x2 , whence using the standard method (partial fraction decomposition) or the previous problem leads to the result vn = F2n. □ 3 Using this recurrent formula to calculate more values vn, we find out that v3 = 8, v4 = 21, which suggests a hypothesis about connection with the Fibonacci sequence in the form vn = F2n. This can be proved easily by induction. CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Sprague–Grundy theorem 13.2.17. Theorem. Consider a directed acyclic graph G = (V, E). Its vertices v are labeled by P if and only if g(v) = 0, where g is the Sprague–Grundy function. For any two directed acyclic graphs G1 = (V1, E1), G2 = (V2, E2) and their Sprague–Grundy functions g1, g2, the Sprague–Grundy function g of their sum is given by g(v1v2) = g1(v1) ⊕ g2(v2). Proof. The first proposition of the theorem follows directly by induction from the definition of the Sprague–Grundy function g. The proof of the other part is more complicated. Let (v1v2) be a position of the game G1 + G2 = (V, E), and consider any a ∈ N0 such that a < g1(v1) ⊕ g2(v2). There exists a state x1x2 of the game G1 +G2 such that g(x1)⊕g(x2) = a and (v1v2, x1x2) ∈ E, and, at the same time, there is no edge (v1v2, y1y2) ∈ E such that g1(y1) ⊕ g2(y2) = g1(v1) ⊕ g2(v2). This justifies the recursive definition of the Sprague–Grundy function and proves the rest of the theorem. To show why, find a vertex x1x2 with a given value a < g1(v1) ⊕ g2(v2) of the Sprague–Grundy function. Consider the integer b := a ⊕ g1(v1) ⊕ g2(v2). Refer to the bit of value 2i as the i-th bit of an integer. Clearly, b ̸= 0. Let k be the position of the highest one in the binary expansion of b, i.e. 2k ≤ b < 2k+1 . This means that the k-th bit of exactly one of the integers a, g1(v1) ⊕ g2(v2) is one and that these integers do not differ in higher bits. It follows from the assumption a < g1(v1)⊕g2(v2) that it is the integer g1(v1) ⊕ g2(v2) whose k-th bit is one. Therefore, the k-th bit of exactly one of the integers g1(v1), g2(v2) is one. Assume without loss of generality that it is the integer g1(v1). Further, consider the integer c := g1(v1) ⊕ b. Recall that the highest one of b is at position k, so the integers c, g1(v1) do not differ in higher bits and the k-th bit of c is zero. Therefore, c < g1(v1). Then, by the definition of the function value g1(v1), there must exist a state w1 of the game G1 such that (v1, w1) ∈ E1 and g1(w1) = c. Now, (v1v2, w1v2) ∈ E and g1(w1) ⊕ g2(v2) = c ⊕ g2(v2) = g1(v1) ⊕ b ⊕ g2(v2) = g1(v1) ⊕ a ⊕ g1(v1) ⊕ g2(v2) ⊕ g2(v2) = a. This fulfills the first part of our plan. Further, consider any edge (v1v2, y1y2) ∈ E in G, where (v1, y1) ∈ E1, and hence v2 = y2. Suppose that g1(y1) ⊕ g2(y2) = g1(v1)⊕g2(v2). Then, g1(y1)⊕g2(v2) = g1(v1)⊕ g2(v2). Clearly, the terms g2(v2) can be canceled (it is an operation in a vector space), leading to g1(y1) = g1(v1). This contradicts the properties of the Sprague–Grundy function g1 of the game G1. This proves the second part of the theorem. □ 1187 Recursively connected sequences. Sometimes, we are able to express the wanted number of ways or events only in terms of more mutually connected sequences. 13.I.13. In how many ways can we cover a 3 × n rectangle with 1 × 2 domino pieces? Evaluate this value for n = 20. Solution. We can easily find out that c1 = 0, c2 = 3, c3 = 0, and it is reasonable to set c0 = 1 (this is nor merely convention; there is indeed a unique empty covering). We are looking for a recursive formula–discussing the behavior “on the edge”, we find out that cn = 2rn−1 + cn−2, rn = cn−1 + rn−2, r0 = 0, r1 = 1, where rn is the number of coverings of the rectangle 3 × n without one of the corner tiles. The values of cn and rn for the first few non-negative integers n are: n 0 1 2 3 4 5 6 7 cn 1 0 3 0 11 0 41 0 rn 0 1 0 4 0 15 0 56 • Step 1: cn = 2rn−1 + cn−2 + [n = 0], rn = cn−1 + rn−2. • Step 2: C(x) = 2xR(x)+x2 C(x)+1, R(x) = xC(x)+x2 R(x). • Step 3: C(x) = 1 − x2 1 − 4x2 + x4 , R(x) = x 1 − 4x2 + x4 . • Step 4: We can see that both are functions of x2 . We can thus save much work if we consider the function D(z) = 1/(1−4z +z2 ). Then, we have C(x) = (1−x2 )D(x2 ), i. e., [x2n ]C(x) = [x2n ](1 − − x2 )D(x2 ) = [xn ](1 − x)D(x), so c2n = dn − dn−1. The roots of 1−4x+x2 are 2+ √ 3 and 2− √ 3, whence the standard procedure yields c2n = (2 + √ 3)n 3 − √ 3 + (2 − √ 3)n 3 + √ 3 . Just like with the Fibonacci sequence, the second term is negligible for large values of n and is always between 0 and 1. Therefore, c2n = ⌈ (2 + √ 3)n 3 − √ 3 ⌉ . For instance, c20 = 413403. □ 13.I.14. Using generating functions, find the number of ones in a random bit string. Solution. Let B the set of bit strings, and for b ∈ B, let |b| denote the length of b and j(b) the number of ones in it. The generating function is of the form B(x) = ∑ b∈B x|b| = ∑ n≥0 2n xn = 1 1 − 2x . CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS The following useful result is a direct corollary of this theorem: Corollary. A vertex v1v2 in the sum of games is labeled by P if and only if g1(v1) = g2(v2). For example, if three piles of tokens are combined in the simplified Nim game (it is always allowed to take only one or two tokens), the first player always wins, if all three piles have the same number of tokens, not divisible by three. The individual functions gi(k) for the individual piles equal the remainder after dividing k by 3. It follows that, when summing the first two Nim pile games, the value g(v) = 0 is obtained for the initial position. Summing again with another pile game gives g(v) ̸= 0. In the original game, the individual piles are described by g(k) = k (any number of tokens can be chosen, hence the function g grows in this way). The losing positions are those, where the binary sum of the numbers of tokens is zero. For example, if the two of the initial piles are of equal size, then a simple winning strategy is to remove the third one completely and always make the remaining two equal after the opponent’s move. Remark. Further details are omitted in this text. It can be proved that every finite directed acyclic graph is isomorphic to a finite sum of suitably generalized games of Nim. In particular, the analysis of the simple game and construction of the function g basically (at least implicitly) gives a complete analysis of all impartial games. 3. Remarks on Computational Geometry A large amount of practical problems consist in constructing or analyzing some finite geometrical objects in Euclidean spaces, mainly in 2D or 3D. This is a very busy area of both applications and research. At the same time, most of the algorithms and their complexity analysis are based on graph theoretical and further combinatorial concepts. We provide several glimpses into this beautiful topic.12 We discuss convex hulls, triangulations, and Voronoi diagrams and focus on a few basic approaches only. 13.3.1. Convex hulls. We start with a simple and practical problem. In the plane R2 , suppose n points X = {v1, . . . , vn} are given and the task is to find their convex hull CH(X). As learned in the Chapter 4, CH(X) is given by a convex polygon and it is desired to find it effectively. First we have to decide how CH(X) should be encoded as a data structure. Choose the connected list of edges. There is the cyclic list of the vertices vi in the polygon, sorted in the counter clock-wise order, together with pointers towards the oriented segments between the consecutive vertices (the 12The beautiful book Computational Geometry, Algorithms and Applications by de Berg, M., Cheong, O., van Kreveld, M., Overmars, M., published by Springer (1997) can be warmly recommended, http://www.springer.com/us/book/9783540779735 1188 The generating function for the number of ones is C(x) = ∑ b∈B j(b)x|b| . A string b can be obtained from the one bit shorter string b′ by adding either a zero or a one, i. e., j(b) is the sum of j(b′ ) ones and j(b′ ) + 1 ones. Therefore, C(x) = ∑ b′∈B (1 + 2j(b′ ))x|b′ |+1 = ∑ b′∈B x|b′ |+1 + 2 ∑ b′∈B j(b′ )x|b′ |+1 = = xB(x) + 2xC(x). Hence C(x) = x (1 − 2x)2 = x(1 − 2x)−2 and the n-th coefficient is cn = 2n−1 ( −2 n−1 ) = n2n−1 . This number gives the number of ones in strings of length n, and there are bn = 2n such strings. Therefore, the expected number of ones in such string is cn bn = n 2 , which is, of course, what we have anticipated. □ 13.I.15. Find the generating function and an explicit formula for the n-th term of the sequence {an} defined by the recurrent formula a0 = 1, a1 = 2 an = 4an−1 − 3an−2 + 1 for n ≥ 2. Solution. The universal formula which holds for all n ∈ Z is an = 4an−1 − 3an−2 + 1 − 3[n = 1]. Multiplying by xn and summing over all n, we get the following equation for the generating function: A(x) = ∞∑ n=0 anxn . Hence, we can express A(x) = 3x2 − 3x + 1 (1 − x)2(1 − 3x) = 3 4 · 1 1 − x − 1 2 · 1 (1 − x)2 + 3 4 · 1 1 − 3x . Therefore, the coefficient at xn is an = 3 4 (−1)k ( −1 n ) − 1 2 (−1)n ( −2 n ) + 3 4 (−3)n ( −1 n ) = = 3 4 − 1 2 (n + 1) + 3 4 3n = d1−2n+3n+1 4 . □ 13.I.16. Solve the following recurrence using generating functions: a0 = 1, a1 = 2 an = 5an−1 − 4an−2 n ≥ 2 Solution. The universal formula is of the form an = 5an−1 − 4an−2 − 3[n = 1] + [n = 0] Multiplying both sides by xn and summing over all n, we obtain A(x) = 5xA(x) − 4x2 A(x) − 3x + 1. Hence A(x) = 1 − 3x (1 − 4x)(1 − x) = 2 3 · 1 1 − x + 1 3 · 1 1 − 4x CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS edges). Moreover, there is the list of edges pointing to their tail and head vertices. There is a simple way, to get CH(X). Namely, create the oriented edges eij for all pairs (vi, vj) of the points in X, and decide whether eij belongs to CH(X) by testing whether all the other points of X are on the left of eij (in the obvious sense). It is known already from chapter one, that this is tested in constant time by means of the determinant. Clearly eij belongs to CH(X) if and only if all the latter tests are positive. In the end, the order in which to sort the edges and vertices in the output is found. This does not look like a good algorithm, since O(n2 ) edges need to be tested against O(n) points. Hence, cubic time complexity is expected. But there is a strong simple improvement available. Consider the lexicographic order on the points vi with respect to their coordinates. Then build the convex hull consecutively and run through the tests only for the edges having the last added vertex as their tail. Gift Wrapping Convex Hull Algorithm Input: A set of points X = {v1, . . . , vn} in the plane. Output: The requested edge list for CH(X). (1) Initialization. Take the smallest vertex v0 in the lexicographic order with respect to the coordinates, and set vactive = v0. (2) Main cycle. • Test edges with tail vactive, until e belonging to CH(X) is found. • add e to CH(X) and set its head to be the vactive • if vactive ̸= v0, then repeat the cycle. Obviously, the most left and lowest vertex v0 in X is in CH(X). Since CH(X) is a cycle (as a directed graph), the algorithm works correctly. It is necessary to be careful about possible collinear edges in CH(X) and the lack of robustness of the test for those nearly collinear. This simple improvement reduces the worst running time of the algorithm to O(n2 ). The worst case can be obtained if all the points vi appear on one circle, and unluckily the right next point is always found as the very last one in partial tests. But the actual running time is much better, at most O(ns), where s is the size of the CH(X). For example, in situations where the distribution of the points in the plane is random with normal distribution (see 1189 and an = 2 3 (−1 n ) + 2 3 (−1 n ) (−4)n = 4n + 2 3 . □ 13.I.17. A cash dispenser can provide us with banknotes of values 200, 500, and 1,000 crowns. In how many ways can we pick 7,000 crowns? Use generating functions to find the solution. Solution. The problem can be reformulated as looking for the number of integer solutions of the equation 2a + 5b + 10c = 70; a, b, c ≥ 0. This number is equal to the coefficient at x70 in the function G(x) = (1+x2 +x4 +· · · )(1+x5 +x10 +· · · )(1+x10 +x20 +· · · ). This function is equal to G(x) = 1 1 − x2 1 1 − x5 1 1 − x10 and since 1 − x10 1 − x5 = 1 + x5 and 1 − x10 1 − x2 = 1 + x2 + x4 + x6 + x8 , we can transform it into the form G(x) = (1 + x2 + x4 + x6 + x8 )(1 + x5 ) (1 − x10)3 . By the binomial theorem, we have (1 − x10 )3 = ∞∑ k=0 (−1)k ( −3 k ) x10k . Therefore, G(x) equals (1+x2 +x4 +x5 +x6 +x7 +x8 +x9 +x11 +x13 ) ∞∑ k=0 (−1)k ( −3 k ) x10k The term x70 can be obtained only as 7 · 10 + 0, i. e., the coefficient at x70 is equal to [x70 ]G(x) = − (−3 7 ) = (3+7−1 7 ) = 9 · 8 2 = 36. □ 13.I.18. Find the probability of getting exactly k heads after n times tossing the coin. Solution. Represent H outcome as x and T as y. We see that all possible outcomes of n tosse sare represented by expansion of f(x) = (x + y)n . The coefficient at xk yn−k is the number of outcomes with exactly k heads. The required probability is, therefore, (( n ) ,k) 2n . □ 13.I.19. Find the probability of getting exactly k heads after n times tossing the coin if the probabilities of a head and a tail equal p and q, respectively, P + q = 1. Solution. Represent H outcome as px and T as q. We see that all possible outcomes of n tosses are represented by expansion of f(x) = (px + q)n . The coefficient at xk is the number of outcomes with exactly k heads. The required probability is, therefore, pk qn−k (( n ) ,k) 2n . □ CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS the chapter 10 for what this means), then it is known that the expected size would be logarithmic. At the same time, finding CH(X) for X distributed on a circle is equivalent to sorting the points along the circle. So the worst time run cannot be better than O(n log n) for all algorithms (cf. 13.1.17). 13.3.2. The sweep line paradigm. We illustrate several main approaches to computational geometry algorithms for the same convex hull problem. The latter algorithm is close to the idea of having a special object L running through all the objects in the input X consecutively, taking care of constructing the relevant partial output of the algorithm on the run. This is the event structure describing all the events needing consideration, and the sweep line structure carrying all the information to deal with the individual events. The procedure is similar to the search over a graph discussed earlier. For a shortcut, this may reduce the dimension of the problem (e.g. from 2D to 1D) on the cost of implementing dynamical structures. To start, initialize the queue of the events and begin dealing with them. At each step, there are the still sleeping events (those further in the queue not yet treated), the current active events (those under consideration) and the already processed ones. With the CH(X), this idea can be implemented as follows. Initialize the lexicographically ordered points vi in X. Notice that the first and last ones necessarily appear in the CH(X). This way, CH(X) splits into the two disjoint chains of edges between them. We call them the upper and the lower convex hulls. Hence the entire construction can be split into the upper convex hull and lower convex hull. As in the diagram, moving through the events one by one, it is only needed to check, whether the edge joining the last but one active vertex with the current one makes a left or right turn (as usual, right means the clockwise orientation). If right, then add the edge to list, if left, omit the recent edges one by one, until the right turn is obtained. 1190 13.I.20. There are coins C1, . . . , Cn For each k coin Ck is biased so that when tossed the probability that it falls heads is 1 2k+1 . If n coins are tossed, what is the probability of an odd number of heads? Solution. Let pk = 1 2k+1 and qk = 1 − pk. The generating function can be written as f(x) = n∏ i=1 (qi + pix) = n∑ i=1 amxm , with am the probability of getting exactly m heads. Observe that f(1) = ∑ am = ∑ a2k + ∑ a2k+1. and f(−1) = ∑ am(−1)m = ∑ a2k − ∑ a2k+1 . Hence, ∑ a2k+1 = 1 2 (f(1) − f(−1)) = 1 2 (1 − 1 2n + 1 ) = n n + 1 , as f(1) = 1 and f(−1) = 1 3 3 5 5 7 . . . 2n−1 2n+1 = 1 2n+1 . □ 13.I.21. Let n be a positive integer. Find the number of polynomials P(x) with coefficients from the set {0, 1, 2, 3} such that P(2) = n. Solution. Let P(x) = a0 + a1x + a2x2 + . . . akxk + . . . is such polynomial. Then its coefficients should satisfy the equation a0 + 2a1 + 4a2 + . . . 2k ak + . . . = n with integer coefficients 0 ≤ ak ≤ 3. The number of such solutions is given by the coefficient at xn in f(x) = (1+x+x2 +x3 )(1+x2 +x4 +x6 )(1+x4 +x8 +x12 ) . . . = ∞∏ k=0 (1 + x2k + x2(2k ) + x3(2k ) ) = ∞∏ k=0 1 − x2k+2 1 − x2k = 1 (1 − x)(1 − x2) = 1 4(1 − x) + 1 2(1 − x)2 1 4(1 + x) = ∑ n = 0∞ ( 1 4 + (−1)n 1 4 + 1 2 (n + 1) ) xn . Hence, the required coefficient is ⌊n 2 ⌋ + 1. □ 13.I.22. Find the number of subsets of {1, . . . , 2017}, such that the sum of its elements is divisible by 5. Solution. For subset S = {s1, s2, . . . , sk} define σ(S) = xs1 xs2 . . . xsk . Then ∑ S⊂{1,...,2017} σ(S) = (1 + x)(1 + x2 ) . . . (1 + x2017 ). The coefficient at xn in f(x) is the number of subsets in {1, . . . , 2017} with the sum of its elements being exactly n. Let f(x) = ∑ anxn . We are looking for the sum A5 = a0 + a5 + a10 + . . .. Let ωj = exp(2jπ 5 ), j = 0, . . . , 4 be the 5-th roots of unity with ω0 = 1. Then A5 = 1 5 (f(ω0) + f(ω1) + f(ω2) + f(ω3) + f(ω4)). CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Sweep Line Upper Convex Hull Input: A set of points X = {v1, . . . , vn} in the plane. Output: The directed path UCH(X). (1) Initialization. • Set the event structure to be the lexicographically ordered list of points vfirst, . . . , vlast. There is no special sweep line structure but the indicator distinguishing the stage of the event. • Set the active event to be vactive = vfirst and initiate the UCH(X) as the trivial path with one vertex vfirst (this is the current last vertex of the path in construction). (2) Main cycle. • Set the active event to the next point v in the queue, and consider the potential edge e having vactive as the tail and the last vertex in UCH(X) as head. • Check whether the UCH(X) is to the left of e (check it only against the last edge in the current UCH(X)). If so, add e and the vactive to the UCH(X). If not, remove edges in UCH(X) one by one, until the test turns positive. • Repeat the cycle until the next event is the vlast. It is easy to check that the algorithm is correct. Exactly n events need to be considered, and at each of it up to O(n) vertices can be removed in the current UCH(X). This occurs in O(n2 ) time, but in fact none of the vertices is added again to the UCH(X) after removal. It follows that the asymptotic estimate for the main cycle run is O(n) in total and it is the ordering in the initialization dominating with its O(n log n) time. Clearly the linear O(n) memory is sufficient and so the optimal solution is achieved again for the convex hull prob- lem. 13.3.3. The divide and conquer paradigm. Another very standard idea is to divide the entire problem into pieces, apply recursively the same procedure on them, and merge the partial results together. These are the two phases of the divide and conquer approach. This paradigm is common in many areas, cf. 13.I.10. With convex hulls, adopt the gift wrapping approach for the conquer phase. The idea is to split recursively the task producing disjoint “left CH(X)” and “right CH(X)” and to 1191 Obviuosly f(ω0) = f(1) = 22017 . For 1 ≤ j ≤ 4 f(ωj) = ((1+ω0 j )(1+ωj)(1+ω2 j )(1+ω3 j )(1+ω4 j ))403 (1+ωj)(1+ω2 j ) = 2403 (1+ωj+ω2 j +ω3 j ). Therefore, 4∑ j=1 F(ωj) = 2403 (4 − 1 − 1 − 1) = 2403 . Therefore, the answer is 1 5 (22017 + 2403 ), which is an integer as 2 is the last digit in 22017 and 8 is the last digit in 2403 . □ 13.I.23. Using generating functions, solve recursive equa- tion ak+1 = 2ak + 4k , a1 = 3. Solution. From a1 = 2a0 + 40 follows a0 = 1. Multiplyuing both sides of the recursion by xk by summation on k we obtain ∞∑ k=0 ak+1xk = 2 ∞∑ k=0 akxk + ∞∑ k=0 4k xk . Therefore, for generating function f(x) = ∞∑ k=0 akxk we have 1 x ∞∑ k=0 ak+1xk+1 = 2f(x) + ∞∑ k=0 (4x)k , that is 1 x (f(x) − 1) = 2f(x) + 1 1 − 4x . Therefore, f(x) = 1 1 − 2x + x (1 − 2x)(1 − 4x) = 1 2 1 − 2x + 1 2 1 − 4x . Expanding f(x), we obtain: f(x) = 1 2 ∞∑ k=0 (2x)k + 1 2 ∞∑ k=0 (4x)k , and so, ak = 1 2 2k + 1 2 4k . □ 13.I.24. In how many ways can n balls be distributed in 4 boxes if the first box has at least two balls. Solution. The generating function for the first box is b1(x) = x2 + x3 + x4 + . . . = x2 1−x . The generating function for each other box is b(x) = 1 + x + x2 + x3 + . . . = 1 1−x . For all 4 boxes it has to be f(x) = b1(x)(b(x))3 = x2 (1−x)4 . The coefficient at xn in f(x) that indicates the number of options to distribute n balls, equals ( ( n ) − 2 + 4 − 1, 4 − 1) = ( ( n ) + 1, 3). □ CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS merge them by finding the upper and lower “tangent edge” of those two parts. Divide and Conquer Convex Hull Input: Set of points X = {v1, . . . , vn} in the plane, ordered lexicographically. Output: The directed path CH(X). (1) Divide. If n ≤ 3, return the CH(X). Otherwise, split X = X1 ∪X2 into two subsets of (roughly) same sizes, respecting the order (i.e. all vertices in X1 smaller then those in X2). (2) Merge. • Start with the edge joining the largest point in CH(X1) and the smallest in CH(X2), and iteratively balance it to the lower tangent segment el to CH(X1) and CH(X2). • Proceed similarly to get the upper tangent eu. • Merge the relevant parts of the CH(X1) and CH(X2) with the help of el and eu. Perhaps the merge step requires some more explanation. The situation is illustrated in the diagram. For the upper tangent, first fix the right-hand vertex of the initial edge joining the two convex polygons. Then find the tangent to the left polygon from this vertex. Then fix the head of the moving edge and find the right hand touch point of the potential tangent. After a finite number of exchanges like this, the edge stabilizes. This is the upper tangent edge eu. Observe that during the balancing, we move only clockwise on the righthand polygon and counter-clockwise on the other one. Notice also that it is the smart divide strategy which prevents any of the points of the input X1 to appear inside of the CH(X2) and vice versa. Again the analysis of the algorithm is easy. The typical merge time is asymptotically linear and then the recursive call yields O(log n) runs of the procedures. The total estimated time is again O(n log n). Notice, there is no initialization in the procedure itself. Just assume that the points in X are already ordered. Hence another O(n log n) time must be added to prepare for the very first call. The memory necessary for running the algorithm is estimated by O(n) if the recursions are implemented properly. 13.3.4. The incremental paradigm. This approach consists in taking the input objects one by one and consecutively build the required resulting structure. This is particularly useful if the application does not allow to have all the points available in the beginning. Imagine incrementally building the convex hull of shots into the target, as they occur. Another good use is in the randomized algorithms, where all the data is known in the beginning, but treated in a random order. Typically the expected time for running is then very good, while there might be much less effective, but improbable, worst time runs. 1192 13.I.25. How many sequences of {0, 1, 2, 3} of length n have at least 2 zeroes? Solution. Exponential generating function for 0- entry is b0(x) = x2 2! + x3 3! + . . . = ex −1 − x . Exponential generating function for other entries is ex as there are no restrictions. Hence, the exponential generating function in question is f(x) = e4x − e3x −x e3x . The answer to the question is n! times the coefficient at xn in f(x), which is n!( 4n n! − 3n n! − n 3n−1 n! = 4n − 3n − n3n−1 . □ CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS The former case is easy to illustrate on the convex hull problem. In each step employ the merge step of the very degenerate version of the divide and conquer algorithm, merging CH(X1) of the previously known points X1, while X2 = CH(X2) = {vk} is just the new point. But an extra step is needed to check whether vk is inside of CH(X1) or not. If it is, then skip the new point and wait for the next one. If not, then merge. The worst time of this algorithm is O(n2 ), but as with the gift wrapping method, it depends on the actual size of the output as well as on the quality of the algorithm checking whether is vk inside of the CH(X1). We illustrate the second case with a more elaborate convex hull algorithm. The main idea is to keep track of the position of all points in X with respect to the convex hull CH(Xk) of the first k of them (in the fixed order chosen at the begin- ning). With this goal in mind, keep the dynamical structure of a bipartite graph G whose first group of vertices consists of those points which have not been processed yet, while the other group contains all the faces of the current convex polygon S = CH(Xk) (call them faces, not to be confused with the edges in the graph in G). Remember the faces in S are oriented. Such a face e is in conflict with the point v if the face is “visible” from v, i.e. v is in the right-hand halfplane determined by e. Keep all points joined to each of their faces in conflict in the bipartite graph. Call G the graph of conflicts. The algorith can now be formulated: Randomized Incremental Convex Hull Algorithm Input: A set X = {v1, . . . , vn}, n > 3, of at least four points in the plane. Output: The edge list R of the convex hull CH(X). (1) Initialization. Fix a random order on X. Choose the first three points as X0, create the list of conflicts for the edge list R = CH(X0) (i.e. state which of the three faces are seen from which points) and remove the three points from X. (2) Main cycle. Repeat until the list X is empty: • choose the first point v ∈ X; • if there are some conflicts of v in G, then – remove all the faces in conflict with v from both R and G, – find the two new faces (the uper and lower tangents from the new point v to the existing CH(X) – they are easily found by taking care of the “vertices without missing edges”), – add the two new faces to both R and G and find all their conflicts; • remove the point v from the list X, and from the graph G. The complete analysis of this algorithm is omitted. Notice that finding the newly emerging conflicts is easy since it is only necessary to check the potential conflicts of points 1193 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS which are in conflict with the faces incident to those two vertices in G before the update, to which the two new faces are attached. It can be proved that the expected time for this algorithm is O(n log n), while the worst time is O(n2 ). The complete framework for the analysis of randomized algorithms is nicely explained in the book mentioned in the very beginning of this part, see page 1188. 13.3.5. Convex hull in 3D. Many applications need convex hulls of finite sets of points in higher dimensions, in particular in 3D. There are several ways of adjusting 2D algorithms to their 3D ver- sions. First, it needs to be stated what the right structure is for the CH(X). As seen in 13.1.21, the convex polyhedra in R3 can be perfectly described by planar graphs. In order to modify the algorithms into 3D, some good encoding for them is needed. We want to find all vertices with edges or faces which are incident or neighbouring in time proportional to the output. This is nicely achieved by the double connected edge lists. Double connected edge list – DCEL Let G = (V, E, S) be a planar graph. The double connected edge list is the list E such that each edge is represented by two oriented twin-edges e, ˜e, equipped with the pointers • Vt, Vh to tail and head of e • F to the incident face (the left one with respect to the directed edge e) and the pointer to the twin ˜e • P to the edge following along the face F (in the counter clockwise direction). At the same time, keep the list of vertices and the list of faces, always with just one pointer towards some of the incident edges. Clearly, we can remove or add faces, edges, vertices, or find all kind of neighbors in time proportional to the output (notice the use of the twins and think about the details!). Next, look at the 2D incremental algorithm above and try to imagine what needs changing there to get it to work in 3D. First, we have to deal with the 2D faces S of the DCEL of the convex hull and instead of their boundary vertices, deal with boundary edges. Again, all the faces with conflicts with the just processed point have to be removed (see the picture). This leads to a directed cycle of edges the pointers F missing (call them “unsaturated edges”). Finally, instead of adding 1194 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS two new faces in 2D, add the tangent cone of faces joining the point v with the unsaturated edges. Of course the graph of conflicts must also be updated. Randomized Incremental 3D Convex Hull Algorithm Input: A set X = {v1, . . . , vn}, n ≥ 4, of at least five points in the space R3 . Output: The DCEL R for the convex hull CH(X). (1) Initialization. Fix a random order on X. Choose the first four points as X0, create the list of conflicts G and the DCEL R for CH(X0) (i.e. tell which of the four faces are seen from which points) and remove the four points from X. (2) Main cycle. Until the list X is empty, repeat: • take the first point v ∈ X; • if there are some conflicts of v in G, then – remove all the faces of R in conflict with v (from both R and G), and take care of the (oriented) edges e in R left without their incident faces, – build the “tangent cone” from the new point v to the current R by connecting v to the latter “unsaturated” edges, – add the new faces to both R and G and find all their conflicts (again, note that the check for new conflicts can be restricted to the points which were in conflict with the faces incident to those edges where the cone has been attached to the previous R). • remove the point v from the list X and from the graph G. A detailed analysis is omitted. As with the 2D case, the expected running time for this algorithm is O(n log n). By the very well adapted DCEL data structure for the convex hull, it is a very good algorithm. The divide and conquer algorithm from the 2D case can be easily adapted, too. Skipping details, the initial ordering of the input points lexicographically allows to recursively call the same procedure producing two DCELs of disjunct convex polytopes. This allows us to apply a more sophisticated “gift wrapping” approach when merging the results. A sort of “tubular collar” wrapping of the two polytopes to create their 1195 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS convex hull is desired. Imagine rotating the hyperplanes similarly as with the lines in 2D in order to get the first edge in the tubular set of faces to be added. Then the first plane containing one of the missing new faces is obtained. Continue breaking the plane along the new edges, until both directed cycles along of which the collar is attached to the previous two polytopes are closed. All that is done by bending the planes by the smallest possible angle in each step and checking to arrive at the right position. Of course, the DCEL structure is essential to update all the data properly in time proportional to the size of the changes. With reasonably smart implementation, this algorithm achieves the optimal O(n log n) running time. Both of the latter algorithms can be generalized to all higher dimensions, too. 13.3.6. Voronoi diagrams. The next stop is at one of the most popular and useful planar divisions (and searching in them). For a finite set of points X = {v1, . . . , vn} in the plane R2 , write V R(vi) for the set of pointin R2 sharing vi ∈ X as their uniquely given closest point in X. Define: Voronoi diagram For a given set of points X = {v1, . . . , vn} (not all colinear), the Voronoi regions are V R(vi) = {x ∈ R2 ; ∥x − vi∥ < ∥x − vk∥, for all vk ∈ X, k ̸= i}. This is an intersection of n−1 open half-planes bounded by lines, so it is an open convex set. Its boundary is a convex polygon. The Voronoi diagram V D(X) is the planar graph whose faces are the open regions V R(vi), while the boundaries of V R(vi) yield the edges and vertices. 4 Care is needed about collinearity since if all the points vi are on same line in R2 , then their Voronoi regions are strips in the plane bounded by parallel lines. Under all other circumstances, the planar graph V D(X) from the latter definition is well defined and connected. By definition, the vertices p of V D(X) are the points in the plane, such that at least three points v, w, u ∈ X are the same distance from p and no more points of X are inside the circle through v, w, u. If there are no more points of X on the latter circle, then the degree of this vertex p is 3. The most degenerate situation occurs if all the points of X are on one circle. Then, obviously, the Voronoi regions are delimited by two half lines, all emanating from the center of the circle and cutting the angles properly. The construction of the V D(X) is then equivalent to ordering of the points by angles. At least O(n log n) time is needed for the worst case estimate in any algorithm building the Voronoi diagrams. Some of the Voronoi regions are unbounded, others bounded. If just two points v and w are considered, then the axis of the segment vw is the boundary for the regions 1196 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS V R(v) and V R(w). In particular, the region V R(v) must be bounded for each v in the interior of the convex hull of X. On the contrary, consider an edge in the CH(X) with incident vertices v and w, and the “outer” half-axis of the segment vw. If one considers any interior point u in the CH(X), then it is in the other halfplane with respect to the segment vw. Sooner or later the points in the latter half-axis are closer to v and w than to u. It follows that both V R(v) and V R(w) are unbounded. Summarizing: Lemma. Each Voronoi region V R(v) of the Voronoi diagram V D(X) is an open convex polygon region. It is unbounded if and only if v belongs to the convex hull of X. 13.3.7. An incremental algorithm. Each Voronoi diagram represents a planar division and adding a new point p as a vertex, it is quite obvious how to update the diagram. First assume, we know which V R(v) the point p hits. Then choose the center v, split the region V R(v) by the relevant part of the axis of the segment pv. Add this new edge e into the updated V D(X), simultaneously creating the two new faces and removing the V R(v) one. The new edge e hits the boundary of the current V R(v) in either two points or one point (if the new edge is unbounded). These hits show what is the next region of the updated diagram to be split. “Walk” further with the new hit at the boundary revealing the next center of region playing the role of v above. Ultimately this walk consecutively splits the visited old regions and creates the new directed cycle of edges bounding the new region, or it has an unbounded path of boundary edges, if the new region is unbounded. See the diagram for an illustration. If the new point p is on the boundary, i.e. hitting one of the edges or vertices in V D(X), then the same algorithm works. Just start with one of the incident regions. So far this looks easy, but how does one find the relevant region hit by the new point? An efficient structure to search for it on the run of the algorithm is desired. Build an acyclic directed graph G for that purpose. The vertices in G are all the temporary Voronoi regions as they were created in the individual incremental steps. Whenever a region is split by the above procedure, new leafs in G are created. Draw edges towards these leafs from all the earlier regions which have some nontrivial overlap. Of course, care must be taken how 1197 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS the old regions are overlap with the new ones, but this is not difficult. We illustrate the procedure on the diagram, updating from one point to four points. Incremental Voronoi Diagram Input: The set of points X = {v1, . . . , vn} in the plane, not all collinear. Output: The DCEL of V D(X) and the search graph G. (1) Initialization. Consider the first two points X0 = {v1, v2} and create the DCEL for V D(X0) with two regions. Create the acyclic directed graph G (just root and two leaves). (2) Main cycle. Repeat until there are no new points z ∈ X: • localize the V R(v) hit by z (by the search in G) • perform the path walk finding the boundary of the new region V R(z) in V D(X) • update the DCEL for V D(X) and the acyclic directed search graph G. This algorithm is easy to implement. It produces directly a search structure for finding the proper Voronoi regions of given points. Unfortunately, it is very far from optimal in both aspects - the worst running time is O(n2 ), and the worst depth of the search acyclic graph is O(n). If this is treated as a randomized incremental algorithm, the expected values is better, but not optimal either. Below is a useful modification via triangulations. 13.3.8. Delaunay triangulation. One remarkable feature of the Voronoi diagrams should not remain unnoticed. Right after the definition of the Voronoi diagram, an important fact was mentioned. The vertices of the planar graph V D(X) are centers of circles containing at least three points 1198 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS of X, and no other points of X are inside of the circle. The dual graph to V D(X) (see 13.1.22 for the definition) can be realized by taking the vertices V = X and the edges indicate the neighboring faces. Then this is again a tessellation of the plane. Actually, the convex hull of X is tesselated into convex regions, and one unbouded region remains. It is called the Delaunay tessellation DT(X). In the generic case, the degrees of all vertices in V D(X) are 3 (i.e. no four points of X lie on one circle). This is the Delaunay triangulation of the convex hull of X.13 Notice that it easy to turn any Delaunay tesselation into a triangulation by adding the necessary edges to triangulate the convex regions with more edges. Any of these refined tesselations is called the Dalaunay triangulation associated to the V D(X). In general, a planar graph T is called a triangulation of its vertices X ⊂ R2 , |X| = n, if all its bounded faces have just 3 vertices. It is easy to see that each triangulation T has τ = 2n − 2 − k triangles and ν = 3n − 3 − k edges, where k is the number of vertices in the boundary of the unbounded face. Indeed, by the Euler formula (13.1.20) n−ν+τ +1 = 2 (there is an unbounded face on top of all the triangles). Now, every triangle has 3 edges, while there are k edges around the unbounded face. It follows that 3τ + k = 2ν. It remains to solve the two linear equations for τ and ν. The triangulations are extremely useful in numerical mathematics and in computer graphics as the typical background mesh for processing of approximate values of functions. Of course, there are many triangulations on a given set and one of the qualitative requests is to aim at triangles as close to the equilateral triangles as possible. This could be phrased as the goal to maximize the minimal angles inside the triangles. A practical way to do this is to write the angle vector of the triangulation A(T) = (γ1, γ2, . . . , γ3τ ), where γj are the angles of all the triangles in T sorted by their value, γj ≤ γk for all j < k. A triangulation T on X is said to be angle optimal, if A(T) ≥ A(T′ ) for all triangulations T′ on the same set of vertices X, in the lexicographic ordering. In particular, an angle optimal triangulation achieves the maximum over the minimal angles of the triangles. Surprisingly, there is a very simple (though not very effective) procedure to produce (one of) the angle optimal triangulations. Consider any two adjacent triangles and check the six angle sequences of their interior angles. If the current position of the diagonal edge provides the worse sequence, flip it. See the diagram. 13Although the name sounds French, Boris Nikolaevich Delone (1890 - 1990) was a Russian mathematician using the French transcription of his name in his publications. His name is associated with the triangulation because of his important work on this from 1934. 1199 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS The flip is necessary if and only if one of the vertices outside the diagonal is inside the circle drawn through the remaining three vertices. Since each such flipping of an edge inside of a tringulation T definitively increases the angle vector, the following algorithm must stop and achieve an angle optimal triangula- tion: Edge Flipping Delaunay Input: Any triangulation ˜T of points X in the plane. Output: An angle minimal triangulation T of the same set X. (1) Main cycle. Repeat until there are no edges to flip: • Find an edge which should be flipped and flip it. Theorem. A triangulation T on a set of points X in the plane R2 is angle optimal if and only if it is a Delaunay triangulation associated to the Voronoi diagram V D(X). Proof. Consider any Delaunay triangulation T associated to V D(X) and one of the vertices p of V D(X). If the four vertices of two neighboring triangles are not all on the same circle, then by the very definition of V D(X), the two circles in question do not contain the remaining points and thus there is no need for any flips. Further, let v1, . . . , vk be all the points of X lying on the circle determining p. Fix an edge with two neighbouring endpoints on a circle. All triangles with the third vertex on the circle above the edge share the same angle. A simple check now verifies that different ways of triangulating the same region of V D(X) with more than 3 boundary edges lead always to the same angle vector. In particular, there are no flips at all necessary in the above algorithm if one starts with the Delaunay trinagulation T. Hence the angle optimal triangulation is arrived at. In order to prove the other implication, recall the comments on the diagram above. All triangles in the angle optimal triangulations T have the following two properties: (1) the circle drawn through their three vertices do not include any other point in its interior: (2) the circle having any of their edges as diameter does not have any other point in its interior. Consider the dual graph G of T and consider its realization in the plane by drawing the vertices as the centers of the circles drawn through the vertices of individual triangles, 1200 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS while the edges are the segments joining them. If there are not more than 3 points on any of those circles, then G = V D(X) is obtained. In the degenerate situations, all the triangles sharing the circle produce the same vertex in the plane and some of the relevant edges degenerate completely. Identify those collapsing elements in G, to get the right V D(X). □ 13.3.9. Incremental Delaunay. We return to the general idea for the Voronoi diagram, namely to design an algorithm which constructs both V D(X) and DT(X) and which behaves very well in its randomized implementation. The idea is straightforward. Use the incremental approach as with the Voronoi diagrams for refining the consecutive Delaunay triangulations, employing the flipping of edges method. By looking at the diagram, the Voronoi algorithm is easily modified. Care must be taken of three different cases for the new points – hitting the unbounded face, hitting one of the internal triangles, hitting one of the edges. 1201 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Incremental Delaunay Triangulation Input: The set of points X = {v1, . . . , vn} in the plane, not all collinear. Output: The DCEL of both DT(X) and the search graph G for this triangulation. (1) Initialization. Consider the first three points X0 = {v1, v2, v3}. Create the DCEL for DT(X0) with two regions, and create the CH(X0) (the connected edge list). Create the acyclic directed graph G (just root and two leaves). (2) Main cycle. Repeat until there are no new points z ∈ X: • Localize the face ∆ inDT(Xk) hit by z (by the search in G) • if z is in the unbounded face, then – add the new triangles ∆1, . . . , ∆ℓ to DT(Xk) by joining z to visible edges in CH(Xk) – update the CH(X). • if z hits a (bounded) triangle ∆, then split it into the three new triangles ∆1, ∆2, ∆3. • if z hits an edge e, then split the adjacent bounded triangles into ∆1, . . . , ∆4 (only two, if an edge in CH(Xk)) is hit). • Create a queue Q of not yet checked modified triangles and repeat as long Q is not empty: – take the first ∆ from the queue Q, look for its neighbour not in Q and not yet checked, and flip the edge if necessary; – if an edge is flipped, put the newly modified triangles into Q Detailed analysis of the algorithm is omitted. It is almost obvious that the algorithm is correct. It is only necessary to prove that the proposed version of the flip edge algorithm update ensures that after each addition of the new point z in kth step, the correct Delaunay triangulation of Xk arises. Once an edge is flipped, then it is not necessary to consider it any time later. Finally, if the Voronoi diagram is needed instead, it can be obtained from the DT(X) in linear time. Obviously the search structures can be used directly. Surprisingly enough, it turns out that the expected number of total flips necessary over all the run is of size O(n log n). Hence the algorithm achieves the perfect expected time O(n log n). Detailed analysis of this beautiful example of results in computational geometry can be found in the section 9.4. of the book by Berg at al., cf. the link on page 1188. 13.3.10. The beach line Voronoi. The Voronoi algorithm provides a perfect example for the sweep line paradigm, where the extra structure to keep track of the events has to be quite smart.14 14This is mostly called the Fortune Voronoi algorithm. Not because this is such a lucky construction, but rather because the algorithm was published by Steven Fortune of the Bell Laboratories in 1986 1202 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Imagine the horizontal line L (parallel to the x axis) flows in the top-down direction and meets the points in X = {v1, . . . , vn}. Of course V D(X) including all points above the current position of L cannot be drawn, since it depends also on the points below the line. It is better to look at the part RL in the plane, RL = {p ∈ R2 ; dist(p, L) ≥ dist(p, vi), vi ∈ X above L}. This is exactly the part of R2 which can be tesselated into the Voronoi diagram with the information collected at the current position of L. Obviously, RL is bounded by a continuous curve consisting of parts of parabolas, since for one point vi this is the case, and the intersection of the conditions is rele- vant. Call the boundary of RL the beach line BL. The vertices on BL draw the V D(X) when L is moving. Since the Voronoi diagram consists of lines, we do not even compute the parabolas and take care of the arrangements of the still active parts of parabolas in the beachline, as determined by the individual points. New parts of the beachline arise when the line L meets one of the points. Add all the points to an ordered list in the obvious lexicographic order. Call them the point events. The active arc in the beachline disappears if the line L meets the bottom of the circle drawn through three points determining a vertex in the Voronoi diagram. Such an event is called a circle event. Both types of the events are illustrated in the diagram above. There is a striking difference between them. The point events always initiate a new arc and start “drawing” two edges of the Voronoi diagram. They initiate previously unknown circle events. The circle events might disappear without creating a genuine vertex in V D(X). Look at the diagram at the s point event. The new s, r, q circle event is encountered there. But this would not create a vertex in the diagram if there was the next point u somewhere close enough to the indicated vertex. One could find it out as soon as such a point event u is met. Such “ineffectively disappearing” circle events are called false alarms. On the contrary the p, q, r circle event shown in the diagram gives rise to the indicated vertex. Summarizing, the emerged circle events must be inserted properly into the ordered queue of events and handled properly at each of the point events. 1203 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Further details are not considered here. When implemented properly, this algorithm runs in the optimal O(n log n) time and O(n) storage. See again the above mentioned book by Berg et al. (section 7) for details. 13.3.11. Geometric transformations. Various geometric transformations of the basic ambient space can often help to transform one problem into another one. This is illustrated in a beautiful construction relating the convex hulls and the Voronoi diagrams. Of course transformations which behave well on lines and planes and preserve incidences should be well thought of. The affine and projective transformations behave well in this respect as in the fourth chapter. Introduce a more interesting one – the spherical inversion. In the plane R2 , consider the unit circle x2 + y2 = 1. For arbitrary v = (x, y) ̸= (0, 0) define φ(v) = 1 ∥v∥2 v = 1 x2 + y2 (x, y) Clearly φ is a bijection on R2 \ {(0, 0)}. The geometric meaning of such a transform is clear from the formula, see the diagram. "The general" point v is sent to a point on the same line through the origin, but with reciprocal size. The unit sphere is the set of all fixed points. The same principle works for all dimensions, so we may equally well (and more interestingly) consider v ∈ R3 in the sequel. Next follows the crucial property of φ. Lemma. The mapping φ maps the spheres and planes in R3 onto spheres and planes. The image of a sphere is a plane if and only if the sphere contains the origin. Proof. Consider a sphere C with the center c and radius r. The equation for its general points p reads ∥p − c∥2 = r2 . By drawing a few images as in the diagram above, it is easily guessed that the images will be a circle with the center s = 1 ∥c∥2−r2 c (i.e. again on the same line through the origin). Now consider q = φ(p) and compute (using just 1204 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS 2p · c = ∥p∥2 + ∥c∥2 − r2 from the latter equation) ∥q − s∥2 = ∥ p ∥p∥2 − c ∥c∥2 − r2 ∥2 = 1 ∥p∥2 + 1 (∥c∥2 − r2)2 ∥c∥2 − 2 p · c (∥c∥2 − r2)∥p∥2 = 1 (∥c∥2 − r2)2 ∥c∥2 − 1 ∥c∥2 − r2 = ( r ∥c∥2 − r2 )2 . The latter computation assumes ∥c∥ ̸= r. Fix the center c ̸= 0 and consider diameters r approaching ∥c∥ from below or above. Then the images are circles with the centers s approaching one or the other infinite points of the line span{c} and fast growing radii. In the limit position, the plane is obtained as requested. (Check this asymptotic computation directly yourself, if any doubts.) □ The continuity of φ has got important consequences. Consider a general plane µ (not containing the origin). The inversion φ maps one of the half-spaces determined by µ to the interior of the image sphere. The other half-space maps to the unbounded complement of the sphere. The latter is of course the half-space containing the origin. The efficient link between the Voronoi diagrams and convex hulls can now be explained. Assume a set of points X = {v1, . . . , vn} in the plane is given. View them as the points in the plane z = 1 in R3 , i.e. add the same coordinates z = 1 to all the points (x, y) in X. For simplicity, assume that no three of them are collinear and no four them lie on the same circle. The spherical inversion φ maps the entire plane z = 1 to the sphere S with center c = (0, 0, 1/2) and radius 1/2. Write w1, . . . , wn for the images wi = φ(vi). Now, consider CH(Y ) for the set of the images Y = {w1, . . . , wn}. This is a convex polytope with all vertices on the sphere S. All its faces represent planes not containing the origin (this is due to the assumption that no three points of X are collinear). Split the faces of CH(Y ) into those “visible” from the origin and the “invisible” ones. In the latter case, all points of Y are on the same side of plane µ generated by the face as the origin. This implies that all the other points are outside of the image sphere Sµ = φ(µ). In particular, there are no points of X inside of the intersection of Sµ with the plane z = 1. This is the defining condition for obtaining one of the vertices of the Voronoi diagram. Since the map φ preserves incidencies, the entire DCEL for V D(X) is easily reconstructed from the DCEL of CH(Y ) and vice versa. This resembles the construction of the dual graph, i.e. the Delaunay triangulation DT(X) from the Voronoi diagram, with further geometric transformation in the back. Last, but not least, the faces of CH(Y ) visible from the origin are worth mentioning too. For the same reason as above, all the points of Y appear on the other side from 1205 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS the origin and so, all the points in X are inside the image sphere. This means the diagram of furthest points instead of the Voronoi diagram of the closest ones is obtained. This is a very useful tool in several areas of mathematics, see some of the exercises (??) for further illustration. 4. Remarks on more advanced combinatorial calculations 13.4.1. Generating functions. The worlds of discrete and continuous mathematics meet all the time. There are already many instances of useful interactions. With some slight exaggeration, we can claim that all results in analysis were achieved by an appropriate reduction of the continuous tasks to some combinatorial problem (for instance, integration of rational functions is reduced to partial fraction decomposition, solution of analytic differential equations possibly boils down to some recurrences, etc.). In the opposite direction, we demonstrate how handy continuous methods can be in purely combinatorial problems. We begin with a simple combinatorial question: There are four 1-crown coins, five 2-crown coins, and three 5-crown coins at our disposal. Suppose we want to buy a bottle of coke which costs 22 crowns. In how many ways can we pay the exact amount of money with the given coins? We are looking for integers i, j, k such that i+j+k = 22 and i ∈ {0, 1, 2, 3, 4}, j ∈ {0, 2, 4, 6, 8, 10}, k ∈ {0, 5, 10, 15}. Consider the product of polynomials (over the real numbers, for instance) (x0 + x1 + x2 + x3 + x4 )(x0 + x2 + x4 + x6 + + x8 + x10 )(x0 + x5 + x10 + x15 ). It should be clear that the number of solutions equals the coefficient at x22 in the resulting polynomial. This corresponds to the four possibilities of choosing the values i, j, k: 3 · 5 + 3 · 2 + 1 · 1, 3 · 5 + 2 · 2 + 3 · 1, 2 · 5 + 5 · 2 + 2 · 1, and 2 · 5 + 4 · 2 + 4 · 1. This simple example deserves more attention. The coefficients of the particular polynomials represent sequences of numbers, referring to how many times we can achieve the given value with one type of coins only. Work with an infinite sequence to avoid a prior bound on how many available values there can be. Encode the possibilities in infinite sequences (1, 1, 1, 1, 1, 0, 0, . . . ) 1-crowns (1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, . . . ) 2-crowns (1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, . . . ) 5-crowns. Each such sequence with only finitely many non-zero terms can be assigned a polynomial. The solution of the problem is given by the product of these polynomials, as noted before. This is an instance of a general procedure for handling sequences effectively. 1206 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Generating function of a sequence Definition. An (ordinary) generating function for an infinite sequence a = (a0, a1, a2, . . . ) is a (formal) power series a(x) = a0 + a1x + a2x2 + · · · = ∞∑ i=0 aixi . The values ai are considered in some fixed field K, normally the rational numbers, real numbers, or complex numbers. In practice, there are several standard ways for defining and using generating functions: - to find an explicit formula for the n-th term of a sequence; - to derive new recurrent relations between values (although generating functions are often based on recurrent formulae themselves); - for calculation of means or other statistical dependencies (for instance, the average time complexity of an al- gorithm); - to prove miscellaneous combinatorial identities; - to find an approximate formula or the asymptotic behaviour when the exact formula is too hard to get. We shall see examples of some of these. 13.4.2. Operations with generating functions. Several basic operations with sequences correspond to simple operations over power series (which can be easily proved by performing the relevant operation with the power series): • Component wise, the sum (ai +bi) of the sequences corresponds to the sum a(x) + b(x) of the generating func- tions. • Multiplication (α · ai) of all terms by a given scalar α corresponds to the same multiplication α · a(x) of the generating function. • Multiplication of the generating function a(x) by a monomial xk corresponds to shifting the sequence k places to the right and filling the first k places with zeros. • In order to shift the sequence k places to the left (i.e. omit the first k terms), subtract the polynomial bk(x) corresponding to the sequence (a0, . . . , ak−1, 0, . . . ) from a(x), and then divide the generating function by the expression xk . • Substitution of a polynomial f(x) for x leads to a specific combination of the terms of the original sequence. They can be expressed easily for f(x) = αx, which corresponds to multiplication of the k–th term of the sequence by the scalar value αk . The substitution f(x) = xn inserts n − 1 zeros between each pair of adjacent terms. The first and second rules express the fact that the assignment of the generating function to a sequence is an isomorphism of the two vector spaces (over the field in question). There are other important operations which often appear when working with generating functions: 1207 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS • Differentiation with respect to x: the function a′ (x) generates the sequence (a1, 2a2, 3a3, . . . ); the term at index k is (k + 1)ak+1 (i.e. the power series is differentiated term by term). • Integration: the function ∫ x 0 a(t) dt generates the sequence (0, a0, 1 2 a1, 1 3 a2, 1 4 a3, . . . ); for k ≥ 1, the term at index k is equal to 1 k ak−1 (clearly, differentiation of the corresponding power series term by term leads to the original function a(x)). • Product of power series: the product a(x)b(x) is the generating function for the sequence (c0, c1, c2, . . . ), where ck = ∑ i+j=k aibj, i.e. the terms of the product are up to ck the same as in the product (a0 + a1x + a2x2 + · · · + akxk )(b0 + b1x + b2x2 + · · · bkxk ). The sequence (cn) is also called the convolution of the sequences (an), (bn). 13.4.3. More links to continuous analysis. There are useful examples of generating functions. Most of them are seen when working with power series in chapters five and six. Perhaps the reader recognizes the generating function given by the geometric series: a(x) = 1 1 − x = 1 + x + x2 + . . . , which corresponds to the constant sequence (1, 1, 1, . . . ). As we know, this power series converges for x ∈ (−1, 1) and equals 1/(1 − x). It works the other way round as well: Expand this function into its Taylor series at the point 0. The original series is obtained. This “encoding” of a sequence into a function and then decoding it back is the key idea in both the theory and practice of generating function. Generally, consider any sequence ai with n √ an bounded. Then there is a neighbourhood on which its generating function converges (see 5.4.10 on page 428). For example, an easy check shows that this happens whenever |an| = O(nk ) with a constant exponent k ≥ 0. On this neighbourhood, the generating functions can be worked with as with ordinary functions. In particular, one can add, multiply, compose, differentiate, and integrate them. All the equalities obtained carry over to the relevant sequences. Recall several very useful basic power series and their sums: 1 1 − x = ∑ n≥0 xn , ln(1 + x) = ∑ n≥1 (−1)n+1 xn n , ln 1 1 − x = ∑ n≥1 xn n , ex = ∑ n≥0 xn n! , 1208 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS sin x = ∑ n≥0 (−1)n x2n+1 (2n + 1)! , cos x = ∑ n≥0 (−1)n x2n (2n)! . 13.4.4. Binomial theorem. Recall the standard finite binomial formula (a + b)r = ar (1 + c)r = ar ∑r n=0 (r n ) cn , where r ∈ N, 0 ̸= a, b ∈ C, c = b/a. Even if the power r is not a natural number, the Taylor series of (1 + x)r can still be computed. This yields the following generalization: Generalized binomial theorem Theorem. For any r ∈ R, k ∈ N, write ( r k ) = r(r − 1)(r − 2) · · · (r − k + 1) k! (in particular (r 0 ) = 1, having empty product divided by 1 in the latter formula). The power series expansion (1 + x)r = ∑ k≥0 ( r k ) xk converges on a neighbourhood of zero, for each r ∈ R. The latter formula is called the generalized binomial theorem. In particular, the function 1 (1−x)n , n ∈ N can be expanded into the series 1 + ( 1 + n − 1 n − 1 ) x + · · · + ( k + n − 1 n − 1 ) xk + · · · . Proof. The theorem is obvious if r ∈ N since it is then the finite binomial formula. So assume r is not a natural number and thus zero is never obtained when evaluating (r k ) . First, differentiate the function a(x) = (1+x)r and evaluate all the derivatives in x = 0. Obviously a(k) (0) = r(r − 1) · · · (r − k + 1)(1 + x)r−k |x=0 = r(r − 1) · · · (r − k + 1) which provides the coefficients ak = (r k ) of the series. In 5.4.5, there are several simple tests to decide about convergence of a number series. The ratio test helps here: ak+1 xk+1 ak xk = r(r−1)...(r−k) (k+1)! r(r−1)...(r−k+1) k! x = r − k k + 1 x. By the ratio test, the radius of convergence is 1 for all r /∈ N. The generalized binomial formula for negative integers is a straightforward consequence. If r = −n, we may write(−n k ) = (−1)k 1 k! (n + k − 1)(n + k − 2) · · · n = (n+k−1 n−1 ) and substituting −x for the argument just kills the signs as requested. □ 1209 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS 13.4.5. Examples. The formulae with r as a negative integer are very useful in practice. The simplest one is the geometric series with r = −1. Write down two more of them. 1 (1 − x)2 = ∑ n≥0 (n + 1)xn 1 (1 − x)3 = ∑ n≥0 ( n + 2 2 ) xn . The same results can be obtained by consecutive convolutions. Indeed, for the generating function a(x) of a sequence (a0, a1, a2, . . . ), 1 1−x a(x) is the generating function for the sequence of all the partial sums (a0, a0+a1, a0+a1+a2, . . .). For instance, 1 1 − x ln 1 1 − x is the generating function of the harmonic numbers Hn = 1 + 1 2 + · · · + 1 n . 13.4.6. Difference equations. Typically, the generating functions can be very useful, if the sequences are defined by relations between their terms. An instructive example of such an application is the complete discussion of the solutions of linear difference equations with constant coefficients. This is examined in the second part of chapter one, see 1.2.4. Back there, a formula is derived for first-order equations, the uniqueness and existence of the solution is justified, after only “guessing” the solution. Now, it can be truly derived in another straight forward way, working also in more complex and non-linear problems. First, sort out the well-known example of the Fibonacci sequence, given by the recurrence Fn+2 = Fn + Fn+1, F0 = 0, F1 = 1, and write F(x) for the (yet unknown) generating function of this sequence. We want to compute F(x) and so obtain an explicit expression for the nth Fibonacci number. The defining equality can be expressed in terms of F(x) if we use our operations for shifting the terms of the sequence. Indeed, xF(x) corresponds to the sequence (0, F0, F1, F2, . . . ), and x2 F(x) does to (0, 0, F0, F1, . . . ). Therefore, the generating function G(x) = F(x) − x F(x) − x2 F(x) represents the sequence (F0, F1 − F0, 0, 0, . . . , 0, . . . ). Substitute in the values F0 = 0, F1 = 1 (the initial condition). Obviously G(x) = x and hence (1 − x − x2 )F(x) = x. F(x) is a rational function. It can be rewritten as a linear combination of simple rational functions. This is helpful, since a 1210 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS linear combination of generating functions corresponds to the same combination of the sequences. Rational functions can be decomposed into partial fractions, see 6.2.7. Using this procedure, we find a generating function for 1/(1 − x − x2 ). Namely, write r ∈ N F(x) = 1 1 − x − x2 = A x − x1 + B x − x2 = a 1 − λ1x + b 1 − λ2x , where A, B are suitable (generally) complex constants, and x1, x2 are the roots of the polynomial in the denominator. The ultimate constants a, b, λ1, and λ2 can be obtained by a simple rearrangement of the particular fractions. This leads to the general solution for the generating function F(x) = ∞∑ n=0 (aλn 1 + bλn 2 )xn , and so the general solution of the recurrence is known as well. In the present case, the roots of the quadratic polynomial are 1± √ 5 2 . Hence the reciprocal values of the roots are λ1,2 = 2 1± √ 5 . The partial fraction decomposition equality gives x = a ( 1 − 2 1− √ 5 x ) + b ( 1 − 2 1+ √ 5 x ) and so a = −b = 1√ 5 . Finally the requested solution Fn = 1 √ 5 (( 1 + √ 5 2 )n − ( 1 − √ 5 2 )n) is obtained. Compare this procedure to the approach in 3.2.2 and 3.B.1. This expression, full of irrational numbers, is an integer. The second summand is approximately (1− √ 5)/2 ≃ −0.618. Its value is negligible for large n. Hence Fn can be computed by evaluating just the first summand and approximating to the nearest integer. Of course, the same procedure can be applied for general k-th order homogeneous linear difference equations. Consider the recurrence Fn+k = α0Fn + · · · + αk−1Fn+k−1. The generating function for the resulting sequence is F(x) = g(x) 1 − α0xk − · · · − αk−1x , where the polynomial g(x) of order at most k − 1 is determined by the chosen initial conditions. Using partial fraction decomposition, the general result follows as in subsection 3.2.4. 13.4.7. The general method. Power series are a much stronger tool for solving recurrences. The point is that one is not restricted to linearity and homogeneity. Using the following general approach, recurrences that seem intractable at first sight can quite often be managed. The first steps are just algorithmic, while the final solution of the equation on the generating function may need very diverse approaches. 1211 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS In order to be able to write down the necessary equations efficiently, adopt the convention of the logical predicate [δ(n)] which is attached before the expression it should govern. Simply multiply by the coefficient 1 if δ(n) is true, and by zero otherwise. For instance, the equation Fn = Fn−2 + Fn−1 + [n = 1]1 + [n = 0]1 defines the above Fibonacci recurrence with initial conditions F0 = 1 and F1 = 2. Method to resolve recurrences Recurrent definitions of sequences (a0, a1, . . . ) may be solved in the following 4 steps: (1) Write the complete dependence between the terms in the sequence as a single equation expressing an in terms of terms with smaller indices. This universal formula must hold for all n ∈ N (supposing a−1 = a−2 = · · · = 0). (2) Both sides of the equation are multiplied by xn . Then sum the resulting expressions over all n ∈ N. One of the summands is ∑ n≥0 anxn , which is the generating function A(x) for the sequence. Rearrange other summands so that they contain only the terms A(x) and some other polynomial expressions. (3) Solve the resulting equation with respect to A(x) explic- itly. (4) The function A(x) is expanded into the power series. Its coefficients at xn are the requested values of an. As an example, consider a second order linear difference equation with constant coefficients, but a non-linear right hand side. The recurrence is an = 5an−1 − 6an−2 − n with initial conditions a0 = 0, a1 = 1. The individual steps in the latter procedure are as follows: Step 1. The universal equation is clear, up to the initial conditions. First check n = 0, which yields no extra term, but then n = 1 enforces the extra value 2 to be added. Hence, an = 5an−1 − 6an−2 − n + [n = 1]2. Step 2. ∑ n≥0 anxn = 5x ∑ n≥0 an−1xn−1 − 6x2 ∑ n≥0 an−2xn−2 − ∑ n≥0 nxn + 2x Next, one of the terms is nearly the power series for (1−x)−2 . Thus remove one x there in order to get the equality on A(x) in the form as required (ignore the negative values of indices since all a−1, a−2. . . . vanish by assumption). A(x) = 5xA(x) − 6x2 A(x) − x 1 (1−x)2 + 2x. Step 3. Find the reciprocal values to the roots, 2 and 3, of the polynomial 1−5x+6x2 = (1−2x)(1−3x). An elementary 1212 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS calculation yields A(x) = 2x3 − 4x2 + x (1 − 2x)(1 − 3x)(1 − x)2 . Step 4. Partial fraction decomposition directly leads to the result A(x) = − 1 4 1 1 − 3x + 2 1 1 − 2x − 1 2 1 (1 − x)2 − 5 4 1 1 − x . This corresponds to the solution (notice the last but one term yields ∑ n(n+1)xn and thus the constant term in our formula is the sum −1/2 − 5/4) an = − 1 4 3n + 2n+1 − 1 2 n − 7 4 . The first eight terms in the sequence are 0, 1, 3, 6, 8, −1, −59, −296, all integers of course. 13.4.8. Plane binary trees and Catalan numbers. The next application of the generating functions answers the question about the number bn of nonisomorphic plane binary trees on n vertices (cf. 13.1.18 for plane trees). Treat these trees in the form of the root (of a subtree) with a pair [the left binary subtree, the right binary subtree]. Examine the initial values of n, namely b0 = 1, b1 = 1, b2 = 2, b3 = 5. It is more or less obvious that for n ≥ 1, the sequence bn satisfies the recurrent formula bn = b0bn−1 + b1bn−2 + · · · + bn−1b0, and this is actually close to a convolution of two equal sequences. Rearrange the expression so that it holds for all n ∈ N0: bn = ∑ 0≤k x, while putting the other elements into the list L2. (3) Conquer phase. Combine the lists L = Qsort(L1) + (L[0]) + Qsort(L2) and return the list L. We analyze how many comparisons are needed. Assume that all possible orderings of the list L to be sorted are distributed uniformly. The following parameters are crucial: • The number of comparisons in the divide phase is n − 1. • The assumption of uniformity ensures that the probability of L[0] being the k-th greatest element of the sequence is 1 n . • The sizes of the sublists to be sorted in the conquer phase are k − 1 and n − k. There is the following recurrent formula for the expected number of comparisons Cn: (1) Cn = n − 1 + n∑ k=1 1 n (Ck−1 + Cn−k) . One could work the steps of the general method directly, but the symmetry of the two summands allows a rewrite (1), multiplying by n at the same time (2) nCn = n(n − 1) + 2 n∑ k=1 Ck−1. In the first step, care is needed concerning about n = 0. In the defining recurrence (1), n = 0 is not treated at all (since the equation does not make sense). So the convention must be extended to include C0 = 0 in the computation. Then the equation (2) defines the C1 = 0 properly. It is not necessary to add any terms in view of the initial conditions. Next, multiply both sides by xn and add ∑ n≥0 nCnxn = ∑ n≥0 n(n − 1)xn + 2 ∑ n≥0 n∑ k=1 Ck−1xn . All the terms look familiar. The left hand side shows the derivative of the generating function C(x) = ∑ n≥0 Cnxn if one x is removed. The first term on the right is the series for (1 − x)3 , up to a constant and shift of powers by 2. Finally the last term is the convolution with (1 − x)−1 , up to one x 1215 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS and the coefficient 2. Hence the equation xC′ (x) = 2x2 (1 − x)3 + 2 xC(x) 1 − x . The third step is straightforward see 8.3.3. Divide by x to obtain C′ (x) = 2 1 − x C(x) + 2x (1 − x)3 . The corresponding integrating factor is e − ∫ 2 1−x dx = (1 − x)2 . Hence ( (1 − x)2 C(x) )′ = 2x 1 − x , and finally C(x) = 2 ( 1 (1 − x)2 ln 1 1 − x − x (1 − x)2 ) . The first terms in the bracket corresponds to the convolution of two known sequences, so it contributes to Cn by n∑ k=1 1 k (n − k + 1) = (n + 1) n∑ k=1 1 k − n = (n + 1)Hn − n = (n + 1)(Hn+1 − 1), where Hn are the harmonic numbers. The result is Cn = 2(n + 1)(Hn+1 − 1) − 2n. Notice in 13.I.10, the very same recurrence is solved by different (more direct and simpler) tricks, without any differential equations involved. Since the harmonic numbers Hn are easily approximated by ln n = ∫ n 1 1 x dx, the analysis shows that the estimated time cost of quicksort is O(n log n). But it is easy to see that the worst time case is O(n2 ) (in this version it happens if the list was already ordered properly – then L1 is always empty and the depth of the recursion is linear). 13.4.10. Exponential generating functions. Another approach to generating functions is to take the exponential ex = ∑ n≥0 1 n! xn as the power series corresponding to the constant sequence (1, 1, . . . ). In general, this is called the exponential generating functions A(x) = ∑ n≥0 an xn n! . Here are a few elementary examples: ex e.g.f. ←→ (1, 1, 1, . . .), 1 1 − x e.g.f. ←→ (1, 1, 2, 6, 24, . . . , n!, . . .) ln 1 1 − x e.g.f. ←→ (0, 1, 1, 2, 6, 24, . . .) The slight modification of the definition (just reconsidering the 1 n! coefficient) is responsible for a very different behaviour, compared to the ordinary generating functions. Indeed, now the elementary operations are: 1216 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS • Multiplication of A(x) by x yields the sequence with terms ˜an = nan−1. • Differentiation of A(x) shifts the sequence to the left. • Integration of A(x) shifts the sequence to the right. • The product of functions A(x) and B(x) corresponds to the sequence with terms hn = ∑ k (n k ) akbn−k, the binomial convolution of an and bn. As before, the exponential generating functions might become useful when resolving recurrences. Here is a simple example. Define the sequence by the initial conditions g0 = 0, g1 = 1 and the formula gn = −2ngn−1 + ∑ k≥0 ( n k ) gkgn−k. At the first glance, seeing the binomial convolution suggests trying the exponential version. Write G(x) for the corresponding power series and proceed in the usual four steps again. Step 1. Complete the formula to accommodate the initial con- ditions: gn = −2ngn−1 + n∑ k=0 ( n k ) gkgn−k + [n = 1]. There seems to be a subtle point about g0 here, because the equation gives g0 = g0 2 , with two solutions 0 and 1. The proper choice of g0 now yields the correct value for g1, but the right solution G is chosen later. Step 2. Multiply by xn n! and add over all n, to obtain G(x) = −2xG(x) + G(x)2 + x. Step 3. Now, solve the easy quadratic equation, arriving at G(x) = 1/2(1 + 2x ± √ 1 + 4x2). The evaluation at zero provides g0. Hence the right choice for g0 = 0 is the minus sign. Hence, G(x) = 1 + 2x − √ 1 + 4x2 2 . Step 4. Apply the generalized binomial theorem, to expand G(x) into a power series. See13.4.8. √ 1 + 4x2 = 1 + ∑ k≥1 1 k · (−1)k−1 · 2 · ( 2k − 2 k − 1 ) · x2k . Further, since G(x) = ∑ n≥0 gn xn n! = 1 + 2x − √ 1 + 4x2 2 , g2k+1 = 0 and g2k = (−1)k · 1 k ( 2k − 2 k − 1 ) · (2k)! = (−1)k · (2k)! · Ck−1, where Cn is the n-th Catalan number. 1217 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS 13.4.11. Cayley’s formula. We conclude this chapter by a more complicated example. Cayley’s formula computes the number of trees (i.e. graphs with unique paths between all pairs of vertices) on n given vertices, κ(Kn) = nn−2 . The notation refers to the equivalent formulation to find all spanning trees in the complete graph Kn. Equivalently, in how many ways can a tree be realized on n vertices with the vertices labeled. For example, already the path Pn can be realized in n! ways, so there must be very many of them. This result is proved with the help of the exponential generating functions. Write Tn = κ(Kn) for the unknown values. It is easily shown that T1 = T2 = 1, T3 = 3, T4 = 16. For instance, consider trees on 4 vertices. Out of the (6 3 ) = 20 potential graphs with exactly three edges, those where the edges form a triangle must not be counted. There are (4 3 ) = 4 of them. In the diagram, there are four different possibilities, and each of them can be rotated into another three, hence the solution is 16. The recurrent formula can be obtained by fixing one of the vertices and add together the possibilities for all available degrees of this vertex. This suggests looking rather at the number Rn of the rooted trees. It is clear that Rn = nTn because there are n possibilities to place the root at each of the trees. Also, one can work with one fixed ordering of the vertices in Kn and multiply the result by n! in the end. In this way, go through the possible degrees m of the first vertex and for each m to find the different possibilities for the sizes k1 . . . , km of the corresponding subtrees. Obviously k1 + · · · + km = n − 1, all ki > 0, and since the labeling of all vertices is fixed, all the orders of the subtrees must be considered as equivalent. Multiply the contribution by 1 m! and similarly for each of the possibilities of the subtrees. The recurrent formula is Rn = n! ∑ m>0 1 m! ∑ k1+···+km=n−1 1 k1! . . . km! Rk1 . . . Rkm . Of course, R0 = 0, R1 = 1 and, already using the formula, R2 = 2u1 = 2. Next, R3 = 3u2 + 3u1 2 = 9, R4 = 4R3 + 24R1R2 + 4R1 3 = 64, all as expected. The first step of the standard procedure is accomplished. Next, write R(x) = ∑ n≥0 Rn 1 n! xn . The inner sum is the coefficient at xn−1 in the m-th power of the series R(x). Therefore, Rn 1 n! = [xn−1 ] ∑ m≥0 1 m! R(x)m , 1218 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS and hence have the required equation on R: R(x) = x eR(x) . There are several ways, of solving such functional equations. Here is one such tool without proof. maybe, get it as application of residue theorem in chapter 9 Theorem (Lagrange inverse formula). Consider an analytic function f, f(0) = 0 and f′ (0) ̸= 0. Then there (locally) is the analytic inverse of f, i.e w = g(z) = ∑ n≥1 gn zn n! and z = f(g(z)). Moreover, for all n > 0, gn = lim w→0 ( dn−1 dwn−1 ( w f(w) )n) . In this case, solve the equation x = R(x) eR(x) , so that we may apply the latter theorem with g = R and f(w) = w ew . It follows that [xn ]R(x) = 1 n [wn−1 ] ( w w/ew )n = 1 n [wn−1 ]ewn = 1 n nn−1 (n − 1)! = nn−1 n! In particular, Rn = nn−1 and so, Tn = Rn n = nn−2 . 1219 1220 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS J. Additional exercises to the whole chapter 13.J.1. Determine the number of edges that must be added into i) the cycle graph Cn on n vertices, ii) the complete bipartite graph Km,n in order to obtain a complete graph. ⃝ 13.J.2. Let the vertices of K6 be labeled 1, 2, . . . , 6 and let every edge {i, j} be assigned the integer [(i+j) mod 3]+1. How many maximum spanning trees are there in this graph? ⃝ 13.J.3. Let the vertices of K7 be labeled 1, 2, . . . , 7 and let every edge {i, j} be assigned the integer [(i+j) mod 3]+1. How many maximum spanning trees are there in this graph? ⃝ 13.J.4. Let the vertices of K5 be labeled 1, 2, . . . , 5 and let every edge {i, j} be assigned the integer: 1 if i + j is odd; 2 if i + j is even. How many maximum spanning trees are there in this graph? ⃝ 13.J.5. Let the vertices of K5 be labeled 1, 2, . . . , 5 and let every edge {i, j} be assigned the integer: 1 if i + j is odd; 2 if i + j is even. How many minimum spanning trees are there in this graph? ⃝ 13.J.6. Let the vertices of K6 be labeled 1, 2, . . . , 6 and let every edge {i, j} be assigned the integer: 1 if i + j leaves remainder 1 upon division by 3; 2 if i + j leaves remainder 2 upon division by 3; 3 if i + j is divisible by 3; How many minimum spanning trees are there in this graph? ⃝ 13.J.7. Let the vertices of K6 be labeled 1, 2, . . . , 6 and let every edge {i, j} be assigned the integer: 1 if i + j leaves remainder 1 upon division by 3; 2 if i + j leaves remainder 2 upon division by 3; 3 if i + j is divisible by 3; How many maximum spanning trees are there in this graph? ⃝ 13.J.8. Icosian Game – find a Hamiltonian cycle in the graph consisting of the vertices and edges of the regular dodecahe- dron. Solution. See Wikipedia4 . □ 13.J.9. Does there exist a Hamiltonian cycle in the Petersen graph? Solution. No (however, when any one of the vertices is removed, the resulting graph is already Hamiltonian). This can be shown by enumerating all 3-regular Hamiltonian graphs on 10 vertices and finding a cycle of length less than 5 in each of them. □ 13.J.10. If G = (V, E) is Hamiltonian and ∅ ̸= W ⊊ V , then G \ W has at most |W| connected components. Give an example of a graph where the converse does not hold. Solution. □ 13.J.11. Find a maximum flow and the corresponding minimum cut in the following weighted directed graph: 01 01 01 0101 01 01 20 18 3 7 5 10 7 11 9 12 8 20 17 10 11 9 11 Z S A B C D E F 7 2 2 ⃝ 13.J.12. Find a maximum flow and the corresponding minimum cut in the following weighted directed graph: 4Wikipedia, Icosian game, http://en.wikipedia.org/wiki/ Icosian_game (as of Aug. 8, 2013, 13:24 GMT). 1221 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS 01 01 01 0101 01 01 10 20 30 9 9 19 20 20 5 10 8 9 7 4 8 11 5 9 12 14 Z S A B C D E F ⃝ 13.J.13. Find a maximum flow and the corresponding minimum cut in the following weighted directed graph: 01 01 01 0101 01 01 13 19 23 8 7 11 9 7 10 15 14 7 17 23 11 9 20 15 10 2 Z S A B C D E F ⃝ 13.J.14. Find a maximum flow and the corresponding minimum cut in the following weighted directed graph: 01 01 01 0101 01 01 8 7 10 7 12 13 8 18 28 16 5 9 18 17 7 6 20 17 8 Z S A B C D E F 13 ⃝ 13.J.15. Find the generating functions of the following sequences: i) (1; 2; 1; 4; 1; 8; 1; 16; . . . ) ii) (1; 1; 0; 1; 1; 0; 1; 1; . . . ) iii) (1; −1; 2; −2; 3; −3; 4; −4; . . . ) Solution. i) (1; 2; 1; 4; 1; 8; 1; 16; . . . ) = (1; 0; 1; 0; . . . ) + (0; 2; 0; 4; 0; 16; . . . ). Thus, we find the generating functions for each sequence separately. As for the first one, consider the sequence (1, 1, 1, 1, 1, . . . ). It is generated by the function 1 1−x . The zeros can be inserted by substituting x2 for x. As for the second sequence, we proceed similarly, starting with (1; 2; 4; 8; 16; . . . ), then multiplying by two, inserting zeros, and finally shifting to the right by multiplying by x. ii) (1; 1; 0; 1; 1; 0; 1; 1; . . . ) = (1; 0; 0; 1; 0; 0; 1 . . . ) + (0; 1; 0; 0; 1; 0; 0; 1 . . . ). i) 1 1−x2 + 2x 1−2x2 ii) 1+x 1−x3 iii) −1 (1−x2)2 + x (1−x2)2 □ 1222 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS 13.J.16. Find the coefficient at x17 in (x3 + x4 + x5 + . . . )3 . Solution. (x3 + x4 + x5 + ...)3 = x9 (1−x)3 = x9 · 1 (1−x)3 . We are thus looking for the coefficient at x8 in 1 (1−x)3 . This is equal to (10 2 ) , i. e. 45. □ 13.J.17. There are 30 red, 40 blue, and 50 white balls in a box (balls of the same color are indistinguishable). In how many ways can we pick up 70 balls from the box? ⃝ Solution. Clearly, the number of possibilities is equal to the coefficient at x70 in the expression (1 + x + · · · + x30 )(1 + x + · · · + x40 )(1 + x + · · · + x50 ). Mere rearrangements lead to (1+x+· · ·+x30 )(1+x+· · ·+x40 )(1+x+· · ·+x50 ) = 1 (1 − x)3 . . . (1−x31 )(1−x41 )(1−x51 ). Applying the generalized binomial theorem, we obtain the solution (72 2 ) − (41 2 ) − (31 2 ) − (21 2 ) . □ 13.J.18. What is the probability that a roll of 12 dice results in the sum of 30? Hint: Express the number of possibilities when the sum is 30. Consider (x + x2 + x3 + x4 + x5 + x6 )12 . ⃝ 13.J.19. A fruit grower wants to plant 25 new trees, having four species at his disposal. However, his wife insists that there be at most 1 walnut, at most 10 apples, at least 6 cherries, and at least 8 plums. In how many ways can he fulfill his beloved’s wishes? Hint: We are interested in the coefficient at x25 in the expression (1 + x)(1 + x + · · · + x10 )(x6 + x7 + . . . )(x8 + x9 + . . . ). Solution. (1+x)(1+x+· · ·+x10 )(x6 +x7 +. . . )(x8 +x9 +. . . ) = x14 (1 − x2 )(1 − x11 ) (1 − x)4 . Therefore, we are looking for the coefficient at x11 in (1 − x2 − x11 . . . ) · 1 (1−x)4 , which is equal to (14 3 ) − (12 3 ) − (3 3 ) . □ 13.J.20. Express the general term of the sequences defined by the following recurrences: i) a1 = 3, a2 = 5, an+2 = 4an+1 − 3an for n = 1, 2, 3 . . . . ii) a0 = 0,a1 = 1, an+2 = 2an+1 − 4an for n = 0, 1, 2, 3 . . . . Solution. i) an = 2 + 3n−1 . ii) an = 1 2 √ −3 · ((1 + √ −3)n − (1 − √ −3)n ). □ 13.J.21. Solve the recurrence where each term of the sequence (a0, a1, a2, . . . ) is equal to the arithmetic mean of the preceding two terms. ⃝ 13.J.22. Solve the recurrence an+2 = √ an+1an with the initial conditions a0 = 2, a1 = 8. Hint: Create a new sequence bn = log2 an. ⃝ 13.J.23. Solve the recurrence given by an = ∑ k≥0 ( n k ) ak 2k , a0 = 1. Hint: Multiply both sides by xn n! and sum it up. Note that A(X) is the exponential generating function for the sequence (an). ⃝ 1223 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS 13.J.24. Find the number of triangulations of a convex n-gon. Hint: Select any diagonal that goes through a fixed vertex, this splits the polygon in two. Solution. tn = Cn−2, where Cn denotes the n-th Catalan number. □ 13.J.25. Find the number of walks in a square grid of size n × n from the lower left-hand corner A to the upper right-hand corner B which go only upwards or rightwards and intersect the diagonal AB at exactly one point (besides A and B). Hint: Catalan numbers. ⃝ 13.J.26. Prove that the Fibonacci number satisfy: i) F2 + F4 + · · · + F2n = F2n+1 − 1 ii) F1 + F3 + · · · + F2n−1 = F2n ⃝ 13.J.27. Recall the well-known puzzle Tower of Hanoi and let Hn denote the minimum number of steps necessary to move a tower consisting of n disks from one rod to another one. Find a recurrent formula for Hn as well as its general solution. Solution. Hn+1 = 2Hn + 1, Hn = 2n − 1. □ 1224 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS At the very end of the book, we present one problem from practice. 13.J.28. A volleyball team (with a libero, i. e. 7 people) are sitting in a pub, drinking their favorite and well-deserved beer. However, there are only 7 beer mugs available. What is the probability that in the next round, i) exactly one volleyball player is not given the same mug as the last time, ii) no one is given the same mug as the last time, iii) exactly three players are given the same mug. Solution. i) If six of the seven people are given the same mug, then so must the last one. Therefore, the probability is zero. ii) Let M be the set of all orders of the 7 players and let Ai be the event of orders where the i-th player is given his mug. We want to calculate |M − ∪iAi|. We get 7! ∑7 k=0 (−1)k k! = 1854, so the probability is 1854 5040 = 103 280 . = 0,37. iii) There are (7 3 ) = 35 ways to select the three people who are to get the same mug. The remaining four people must be given different mugs. Again, we can apply the formula from above, i. e., there are 4! ∑4 k=0 (−1)k k! = 9 possibilities. Altogether, there are 9 · 35 = 315 favorable cases, so the probability is 315 5040 = 1 16 . □ 1225 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS Key to the exercises 13.A.23. The cut vertices are 0, 1, 9, 10; the cut edges are (0, 1), (0, 12), (9, 10). 13.B.3. (3, 1), (3, 2), (3, 4), (3, 5), (3, 6), (6, 1), (6, 2), (6, 4), (6, 5), (5, 1), (5, 2), (5, 4), (4, 1), (4, 2), (2, 1). 13.B.4. (3, 1), (3, 2), (3, 4), (3, 5), (3, 6), (1, 2), (1, 4), (1, 5), (1, 6), (2, 4), (2, 5), ...(5, 6). 13.B.5. Solution is missing. 13.B.12. It can be shown using the Havel–Hakimi procedure that such graph indeed exists. However, it cannot be planar: |V | = 10, |E| = 35, but if it were planar, we would have 3|V | − 6 ≥ |E|, i. e., 24 ≥ 35. 13.B.15. i) Yes. This follows immediately from the Kuratowski theorem (K5 has 10 edges and K3,3 has 9). ii) No. Consider K5 or K3,3. iii) No. There are many counterexamples, for instance K3,3 with another vertex and an edge leading to it. iv) No. Consider K5. v) No. Consider K3,3. vi) The same as (ii). vii) No. Consider Cn. viii) No. Consider K5. ix) No. Consider Cn. 13.B.17. The first code does not represent a tree (it has a proper prefix with the same number of zeros and ones). There is a tree corresponding to the second code. 13.C.4. The procedure is incorrect. As a counterexample, consider a cycle graph with one edge of length two and all other edges of length one. 13.C.5. Applying any algorithm for finding a minimum spanning tree, we find out that the wanted length is 12154 (the spanning tree consists of edges LPe, LP, LNY, PeT, MCNY). 13.C.6. Solution is missing. 13.D.5. We find a maximum flow of size 15 and the cut [1, 6], [1, 3], [2, 4], [2, 3] of the same capacity. 13.D.7. We know from the theory and the result of the above exercise that the minimum capacity of a cut is 9. There are more maximum flows in the network. For instance, we can set f(a) = 2, f(b) = 4, f(c) = 1, f(h) = 1, f(j) = 4, f(f) = 2, f(i) = 7, and f(v) = 0 for all other edges v of the graph. 13.E.5. Solution is missing. 13.E.11. 1 − 4! · 4! 8! 24 = 27 35 . 13.E.12. 49 54 . 13.I.4. i) We know from the exercise of subsection 13.4.3 that the generating function of the sequence (1, 2, 3, 4, . . .) is 1 (1−x)2 . ii) Since we have (by the previous exercise as well) x (1 − x)2 o.g.f. ←→ (0, 1, 2, 3, . . . ), mĂĄme pro derivaci tĂŠto funkce ( x (1 − x)2 )′ = 1 + x (1 − x)3 o.g.f. ←→ (1 · 1, 2 · 2, 3 · 3, . . . ). Let us emphasize that this problem could also be solved using the fact that 1 (1−x)3 o.g.f. ←→ (n+2 n ) . 1226 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS iii) We have 1 1 − x o.g.f. ←→ (1, 1, 1, 1 . . . ), 1 1 − 2x o.g.f. ←→ (1, 2, 4, 8, . . . ), 1 1 − 2x2 o.g.f. ←→ (1, 0, 2, 0, 4, 0, . . . ), x 1 − 2x2 o.g.f. ←→ (0, 1, 0, 2, 0, 4, . . . ), whence we get the result 1 + x 1 − 2x2 o.g.f. ←→ (1, 1, 2, 2, 4, 4, 8, 8, . . . ). iv) We know from the above that f(x) = 1+x (1−x)3 o.g.f. ←→ (12 , 22 , 32 , . . . ), hence f(x) − (1 + 4x) x2 o.g.f. ←→ (32 , 42 , 52 , . . . ). Substituting 2x3 for x, we obtain f(2x3 ) − (1 + 8x3 ) 46 o.g.f. ←→ (9, 0, 0, 2 · 16, 0, 0, 4 · 25, . . . ). v) If we denote the result of the previous problem as F(x), then the result of this one is F(x) − x2 F(x) + x 1−x3 . 13.I.11. x/(1 − 3x + x2 ). 13.J.1. i) The complete graph on n vertices has n(n−1) 2 edges, the cycle graph on n vertices has n edges. Therefore, n(n−1) 2 − n edges must be added to the cycle graph. ii) Similarly as above, we get the result (m+n)(m+n−1) 2 − m · n. 13.J.2. There are five edges whose value is 3: four of them lie on the cycle 23562 and the remaining one is the edge 14. Therefore, they form a disconnected subgraph of the complete graph, so the spanning tree must contain at least one edge of value 2. Thus, the total weight of a maximum spanning tree is at most 4 · 3 + 2 = 14. And indeed, there exist spanning trees with this weight. We select all the edges of value 3 except for one that lies on the mentioned cycle and connect the resulting components 2356 and 14 with any edge of value 2. There are four such edges. Altogether, there are 4 · 4 = 16 maximum spanning trees. 13.J.3. The edges of value 1 form a subgraph with two connected components, namely {1, 2, 4, 5, 7} a {3, 6}. Further, there are six edges of value 2 that lead between these two components. Therefore, the total weight of a minimum spanning tree is 6 · 1 + 2 = 8. Moreover, there are exactly three cycles in the former component, each of length 4, and each of the 6 edges of this component belongs to exactly two of the three cycles. In order to obtain a tree from this component, we must omit two edges, which can be done in 6 · 4/2 ways. Altogether, we get 12 · 6 = 72 minimum spanning trees. 13.J.4. 18. 13.J.5. 12. 13.J.6. 16. 13.J.7. 16. 13.J.11. The minimum cut is given by the set {Z, A, E}. Its value is 32. 13.J.12. The minimum cut is given by the set {B, D, S}. Its value is 40. 13.J.13. The minimum cut is given by the set {F, S, D}. Its value is 29. 13.J.14. The minimum cut is given by the set {F, S}. Its value is 39. 1227 CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS 13.J.18. The resulting probability is the ratio of the number of favorable cases to the number of all cases. Clearly, the latter is 612 . Now, let us compute the number of favorable cases. Consider the expression (x + x2 + x3 + x4 + x5 + x6 )12 . Then, the number of favorable cases is the coefficient at x30 . We have: (x+x2 +x3 +x4 +x5 +x6 )12 = ( x(1 − x6 ) 1 − x )12 = x12 · ( 1 − x6 1 − x )12 Therefore, we are interested in the coefficient at x18 in ( 1 − x6 1 − x )12 = (1 − 12x6 + 66x12 − 220x18 ) · 1 1 − x 12 . It follows from the generalized binomial theorem that the number of favorable cases is ( 29 11 ) − 12 · ( 23 11 ) + 66 · ( 17 11 ) − 220 · ( 11 11 ) . 13.J.21. an = k ( −1 2 )n + l. 13.J.22. Solution is missing. 13.J.23. Solution is missing. 13.J.25. Solution is missing. 13.J.26. Solution is missing. 1228 Index LU-decomposition, 172 δ-neighborhood, 264 ε-net, 457 k–times (continuously) differentiable, 481 k-combination, 13 k-edge-connected, 634 k-linear, 107 k-permutation, 13 k-permutations, 15 k-vertex-connected, 633 (Riemann) measurable, 373 Taylor expansion with a remainder, 345 characteristic polynomial of the mapping, 111 commutative group, 5 commutative rings, 5 degree of nilpotency, 163 factor vector space, 166 generated, 204 isomorphism, 99 linear combination, 93 linearly dependent, 93 linearly independent, 93 matrix for changing the basis, 102 matrix of the mapping f, 101 normal, 161 normal matrices, 162 parallelepiped, 209 root subspace, 165 scalars, 6 Sylvester criterion, 228 angle between vector subspaces, 215 A binary relation, 37 a critical point of order k, 349 a Laurent series centered at x0, 383 a representative, 40 Abel’s theorem, 383 absolute value, 260 absolutely convergent, 292 absorbing, 152 addition inverse, 74 adjacency list, 629 adjacency matrix, 629 adjacent, 622 adjoint mapping, 159 adjoint matrix, 159 admissible vector x, 136 affine combinations of points, 207 affine coordinate system, 204 affine coordinates, 26 affine frame, 204 affine hull, 205 affine map, 211 affine mappings of the plane, 29 affine subspace, 204 algebra, 249 algebraic complement, 89 Algebraic multiplicity, 115, 163 algebraically adjoint matrix, 91 algorithm ElGamal, 614 alternating series, 294 an asymptote with a slope, 351 an inflective point, 350 analytic formula, 223 analytic function, 347 angle φ(u, v) , 215 angle between affine subspaces, 216 angles, 208 antisymmetric, 39, 86 antisymmetric bilinear form, 108 area, 35 arithmetic mean, 288 arithmetic representatives, 230 associativity, 5 asymptotes, 351 asymptotes without a slope, 351 asymptotic estimation, 358 autonomous systems of ordinary differential equations, 538 axioms, 5 axis of collineation f, 234 Baire space, 455 balanced, 627, 642 basis, 24 basis of vector space, 96 Bernoulli differential equations, 529 Bernoulli’s inequality, 289 Bessel inequality, 156 Bessel’s identity, 421 Bessel’s inequality, 420 Bezout’s identity, 568 bilinear forms, 107 binary search trees, 642 binary trees, 642 binomial coefficient, 13 binomial congruences, 594 binomial expansion, 13 binormal of the curve, 356 bipartite matching, 657 BorĹŻvka’s algorithm, 652 boundary, 208 boundary point, 264, 456 bounded, 136, 264, 455, 457 1229 INDEX breadth-first search, 632 bunch of hyperplanes passing through point A ∈ P(V ), 234 calibre, 388 canonical analytic formulas, 224 capacities, 654 capacity of the cut, 655 Cartesian coordinate system, 213 Cartesian product, 37 Casortian, 139 Cauchy inequality, 155 Cauchy sequence, 261, 446 Cauchy theorem, 88 Cauchy’s mean value theorem, 285 center, 225 center of collineation f, 234 characteristic function of the set, 373 characteristic polynomial, 140 characteristic polynomial of the matrix, 110 child, 642 classical probability, 17 classical Stokes’ theorem, 524 clique, 626 closed, 446 closed interval, 263 closed subset, 263 closure, 263 code of a plane tree, 643 codomain, 7 codomain of the relation, 37 coefficients of the polynomial, 250 collineation, 231 coloring of the graph, 625 columns of the matrix, 74 combinations, 13 common divisor, 567 commutative field, 5 commutativity, 5 compact, 264, 456 comparable, 39 complementary event, 18 complementary minor, 89 complementary submatrix, 89 complete, 39, 262, 541 complete bipartite graph, 623 complete graph, 623 complete metric spaces, 446 complete orthogonal system, 421 complete residue system, 581 complete spline, 257 completion of the metric space X, 451 complex Fourier coefficients, 424 complex-valued functions of a real variable, 249 composite, 570 composition, 38 composition of a relation, 38 concave, 350 conditional probability, 21 congruence in variable, 588 congruent modulo m, 577 conjugation, 260 connected, 633 connected components, 633 consecutive, 622 continuous at a point, 272 continuous mappings, 447 contraction, 531 contraction mapping, 454 convergence, 261 converges, 261, 446 converges uniformly, 379 convex, 208, 350 convex hull, 208 convex polyhedrons, 209 convolution, 664 convolution of functions, 436 coordinates of the vector, 99 coprime, 570 cosine Fourier series, 427 critical points, 344 cross product, 221 cross-ratio of four points (A, B, C, D), 233 crossbar, 210 cubic interpolation spline, 257 curvature, 353 cut in a network, 655 cycle, 86, 625 cycle graph of length n, 623 cycle ladder graph, 624 cycle of length n, 625 cyclic, 163 Cyclometric functions, 299 d, 627 Darboux integral, 366 decomposition into partial fractions, 363 deconvolution, 441 decreasing at a point, 279 decreasing on an interval, 279 definite integral, 359 degree n, 250 degree sequence, 627 deleted neighborhood, 267 dense, 451 dependent variable, 7 depth-first search, 632 derivative, 254, 277 derivative in the direction of a vector, 476 determinant, 29 determinant expansion, 88 Determinant of the matrix A, 84 diameter of the set, 455 difference equation of the first order, 9 difference space, 203 differentiable, 277 differentiable (k + 1) times, 342 differentiable at a point, 477 differentiable mapping, 488 differential equations, 370 differential of the function f, 477 differential of the mapping, 489 Dijkstra’s algorithm, 636 dimension, 72, 203 dimension of V , 96 dimension of the matrix, 75 Dirac function δ, 441 direct predecessor, 642 direct successor, 642 direct sum, 95 directed graph, 622 directional derivative, 476 Dirichlet condition, 431 1230 INDEX Dirichlet kernel, 433 Dirichlet product, 579 discontinuities, 369 discrete logarithm, 584 discrete logarithm problem, 614 discrete logistic model, 12 distance, 457 distance of the point x from the set A, 263 distance of vertices, 635 distributivity of addition with respect to multiplication, 5 diverges, 291 divisible, 566 domain, 7 double differentiable, 342 dual basis, 103 dual mapping, 158 dual projective space, 233 dual space, 103 eccentricity, 644 echelon form, 78 edge list, 629 edges, 622 eigenvalues of mapping, 110 eigenvalues of the matrix, 110 eigenvectors of mapping, 110 elementary column transformations, 78 elementary row transformations, 78 endpoints, 622 equicontinuous, 460 equivalence, 577 equivalence classes, 40 equivalence relation, 39 Euclid’s algorithm, 568 Euclidean distance, 213 Euclidean division, 566 euclidean point space, 213 Euclidean subspaces, 213 euclidean transformation, 225 Euler’s approximation, 533 Euler’s number e, 290 Eulerian, 638 Eulerian trail, 638 even, 85 event, 17 events, 17 exponential function, 275 exponential generating functions, 669 exponential of a matrix, 545 exterior differential, 521 exterior differential k–form, 515 faces, 208 faces of the planar graph, 646 factor, 626 father wavelet function, 428 Fermat numbers, 609 field, 5 finite automaton, 623 finitely dimensional, 96 first order recurrence, 9 fixed hyperplane, 234 fixed point, 234 flow, 654 flow of the vector field, 541 Ford-Fulkerson algorithm, 655 forest, 640 Fourier coefficients, 421 Fourier coefficients of the function, 422 Fourier cosine transform, 443 Fourier series, 421, 422 Fourier sine transform, 443 Fourier transform, 438 Fourier transform F, 439 free vectors, 25 Frenet–Serret formulas, 357 Frobenius theorem, 133 function, 7 fundamental system of solutions, 133, 138 Fundamental theorem of arithmetic, 571 game tree, 658 games, 658 Gauss–Ostrogradsky theorem, 524 Gaussian elimination, 78 general Fourier series, 427 general position, 207, 232 generalized binomial theorem, 665 generalized exponential power series, 670 generates, 94 generating function, 664 generators, 94 geometric basis, 232 geometric mean, 288 geometric multiplicity, 115, 163 geometric points, 230 geometric series, 296 Gibbs phenomenon, 425 gradient of the function F, 498 Gramm determinant, 219 Gramm-Schmidt orthogonalisation process, 106 graph, 622 greatest common divisor, 567 Green’s theorem, 523 Gronwall’s inequality, 536 Hölder’s inequality, 448 Hölder’s inequality for integrals, 450 Hamiltonian cycle, 640 harmonic mean, 288 head, 622 Heaviside’s function, 258 Hermite’s interpolation polynomial, 256 Hermitian matrices, 160 high pass filter, 428 Homogeneous coordinates, 229 homogeneous coordinates, 230 homogeneous equations, 529 Homogeneous linear difference equation of order k, 138 homogeneous linear recurrence, 138 homogeneous system, 132 homothety, 109 hyperbolic functions, 300 hypercube, 624 hyperplane in An, 207 hypothesis, 21 identity relation, 38 Image, 99 immersions, 514 impartial, 660 implicit description, 205 improper, 267, 277 improper integrals of the first kind, 371 1231 INDEX improper integrals of the second kind, 371 improper limit points, 267 incident, 622 inclusion-exclusion principle, 19 increasing at the point, 279 increasing on the interval, 279 indefinite, 228, 487 indegree deg+ v, 627 independent variable, 7 index, 584 induced subgraph, 626 infimum, 259 infinite points, 230 infinite series of numbers, 291 infinitely dimensional, 96 ingoing, 622 initial values, 525 injective, 38 integral curve, 541 integral domain, 6 integral mean value theorem, 374 integral of the form ω on M, 518 integral operators, 438 interior, 447 interior point, 264 interior point of a subset, 456 interpolation polynomial, 251 interval [x0, x1], 257 invariant subspace, 112 inverse, 77 inverse Fourier transform, 439 inverse function, 282 inverse matrix to the rotation matrix, 32 inverse relation, 39 inversion in permutationσ, 85 invertible matrix, 77 isolated point, 265, 456 isometry, 453 isomorphism, 625 Jacobi symbol, 600 Jacobi theorem, 228 Jacobian matrix of the mapping, 488 JarnĂŋk’s algorithm, 652 Jordan blocks, 164 Jordan curve theorem, 645 Jordan decomposition, 164 Jordan measure, 374 k- combinations, 15 k-combinations with repetitions, 15 kernel of linear mapping, 99 kernel of the integral operator L, 438 Kronecker delta, 75 Kruskal’s algorithm, 651 Kuratowski theorem, 646 l’Hospital’s rule, 285 Lagrange algorithm, 226 Lagrange interpolation polynomial, 253 Lagrange’s mean value theorem, 284 Laplace expansion, 90 Laplace transform, 443 law of cosines, 215 law of inertia, 227 law of quadratic reciprocity, 597, 598 leading principal minors, 89 leading principal submatrices, 89 leaf, 640 least common multiple, 567 left-sided limit, 268 Legendre polynomials, 418 Legendre symbol, 595 Leibniz criterion, 294 Leibniz rule, 280 length of a curve, 374 Leslie model for population growth, 146 level sets, 498 limes superior, 294 limit, 261, 267 limit point, 456 limit point of the set A, 263 limit points of a subset A ⊂ X, 447 line segment, 208 Linear algebra, 24 linear approximation, 255 linear combinations, 82 linear combinations of vectors, 24 linear difference equation of first order, 9 linear form η on U, 514 linear forms, 103 linear functionals, 435 linear mapping, 28 linear mapping (homomorphism), 99 linear programming problem, 134 linear restrictions, 134 linearly dependent, 82 linearly independent, 82 Lipchitz continuous, 459 Lipschitz continuity, 493 local parametrization of the manifold, 515 locally finite cover by parametrizations, 518 logarithmic function with base a, 276 logarithmic order of magnitude, 547 loop, 622 low pass filter, 428 lower bound, 259 lower Riemann sum, 366 Lucas’s test, 609 Möbius function, 579 Möbius inversion formula, 579 Malthusian population growth, 9 mapping, 7 mapping from a set A to the set B, 37 Markov chain xn, 151 Markov process, 151 mathematical analysis, 249 mathematical induction, 10, 14 matrices, 28 maximum, 485 mean value, 374 member of the determinant 84 Menger’s theorem, 634 Mersenne primes, 573 method of Lagrange multipliers, 501 metric, 445 metric on the graph, 635 metric space, 445 minimum, 485 minimum excluded value, 661 minimum spanning tree, 651 Minkowski inequality, 449 1232 INDEX minor, 89 minor complement, 89 modules over rings, 73 Monte Carlo methods, 24 morphism, 625 multidimensional interval, 503 multiple, 566 multiplicative function, 580 mutually perpendicular, 161 natural logarithm, 276 natural spline, 257 negative definite, 228 negative semidefinite, 228 negatively definite, 487 negatively semidefinite, 487 neighborhood of a point, 264 Newton integral, 359 nilpotent, 163 Nim, 658 non-homogeneous linear difference equations, 143 Norm, 148 norm, 155, 445 norm of the partition, 365 normal space, 500 normal vector, 498 normalised, 104, 153 normalized vectors, 31 normed vector space, 445 nowhere dense, 451 number π, 297 number of solutions of a congruence, 588 objective function, 134 odd, 85 One-sided derivatives, 277 one-to-one, 38 onto, 37 open, 446 open ε–neighbourhood, 446 open cover, 265, 456 open intervals, 264 open set, 264 order of a modulo m, 582 order of an integer modulo m, 582 order of magnitude, 547 ordered field, 259 ordered trees, 643 ordering, 39 orientation, 36, 218 orientation of the manifold, 517 oriented euclidean (point) space, 218 oriented manifold with boundary, 522 oriented manifolds, 518 oriented vector space, 218 origin of the affine coordinate system, 204 orthogonal, 104, 153 orthogonal basis, 104 orthogonal complement, 105, 154 orthogonal group, 157 orthogonal mapping, 112 orthogonal matrices, 113, 157 orthogonal mother wavelet, 427 orthogonal system of functions, 419 orthogonally diagonalisable, 161 orthonormal basis, 153 orthonormal system of functions, 419 orthonormalised basis, 104 oscillates, 291 osculating circle, 353 outdegree deg− v, 627 outer product, 220 outgoing, 622 pairwise coprime, 570 parametric description, 26, 205 parametrized by the length, 356 parent, 642 Parity of permutation, 85 Parseval equality, 156 Parseval’s theorem, 420 partial derivatives of order k, 481 partial derivatives of the function f, 476 partial sums, 291 particular solution, 133 partition, 40 Pascal triangle, 15 path, 623, 625 path graph of length n, 623 path of length n, 625 perfect numbers, 573 periodic, 298 periodic function, 422 permutation, 12 permutation of the set X, 84 permutation with repetitions, 15 perpendicular, 31, 104, 161 perpendicular projection, 105 Perron-Frobenius theory, 147 Petersen graph, 624 phase frequency, 425 Picard’s approximation, 532 planar graph, 645 plane trees, 643 Pocklington-Lehmer, 609 points, 203 polar basis, 226 polynomial order of magnitude, 547 polynomials, 8 positive definite, 227 positive direction, 32 positive matrix, 147 positive semidefinite, 228 positively definite, 163, 487 positively semidefinite, 163, 487 power function xa, 275 power mean with exponent r, 288 power residue, 594 power series, 295 power series centered at x0, 300 predecessor, 642 preimage, 38 primality witness, 609 prime, 570 Primitive matrix, 147 primitive root, 584 principal matrices, 89 principal minors, 89 principle of inclusion-exclusion, 20 private key, 612, 614 probability function, 17 projection, 105 projective maps, 231 1233 INDEX Projective plane P2, 229 projective quadric, 235 projective transformations, 231 projectivization of a vector space, 230 proper, 267, 277 proper rational functions, 363 pseudoinverse matrix, 175 pseudoprime, 605 public key, 612, 614 pullback of the form η by φ, 515 QR decomposition, 175 quadratic forms, 223 quadrics, 223 Rabin cryptosystem, 613 radius of convergence, 295 range, 7 rank of the matrix, 82 rank of the quadratic form, 224 ratio of points, 212 rational functions, 274 rays, 208 real-valued functions of a real variable, 249 recurrence relation, 9 reduced residue system, 581 reflection through a line, 33 reflexive, 39 regular collineations, 231 regular square matrix, 77 residual capacity, 655 Riccati equation, 529 Riemann integral, 365 Riemann measurable, 504 Riemann measure of the set, 373 Riemann sum, 365 Riemann–Stieltjes integral sum, 386 right-continuous or left-continuous, 272 right-sided limit, 268 Rolle’s theorem, 284 root of the polynomial, 251 root of the tree, 641 root vector, 165 rooted trees, 641 rotation or curl of the vector field, 524 rows of the matrix, 74 RSA, 612 Saarus rule, 84 sample space, 17 sampling interval τ, 443 scalar functions, 7 Scalar product, 104 scalar product, 31, 74, 153 scale, 36 second-order partial derivatives, 481 self-adjoint, 160 self-adjoint matrices, 160 semiaxes, 225 semipath, 655 separated variables, 370 sequence an converges to a, 268 sequentially continuous, 369 series of functions, 295 set of solutions, 588 shift of the plane, 25 signature of a quadratic form, 227 simplex, 208, 209 Simpson’s rule, 386 sine Fourier series, 427 singular values of the matrix, 173 sink, 654 size of a flow, 654 size of the vector, 153 smooth, 342 solution, 525 solvable, 136 source, 654 spanning subgraph, 626 spanning tree, 649 spectral radius of matrix A, 148 Spectrum of linear mapping, 115 spectrum of linear mapping, 163 Sprague–Grundy function, 661 Sprague–Grundy theorem, 662 square matrix, 75 square wave function, 425 Standard affine space An, 202 standard basis of Kn, 98 standard maximalisation problem, 134 standard minimisation problem, 134 standard unitary space, 153 stationary point of the function, 485 stationary points, 344, 501 Steinitz exchange theorem, 97 Steinitz’s theorem, 648 stochastic matrices, 152 stochastically independent, 20, 21 strategy, 658 strict extremum, 485 subdeterminant, 89 subgraphs, 626 submatrix of the matrix A, 89 subspaces, 208 successor, 642 sum of impartial games, 661 sum of subspaces, 95 supremum, 259 surface and volume of a solid of revolution, 376 surjective, 37 symmetric, 39, 86 symmetric bilinear form, 108 symmetric mappings, 160 symmetric matrices, 160 symmetrization, 622 tail of the edge, 622 tangent hyperplane, 480 tangent line, 255 tangent line to the curve c, 476 tangent plane, 480 tangent space, 500 tangent space TU, 514 tangent vector, 476, 513 Taylor expansion with a remainder, 345 Taylor polynomial of k–th degree, 345 the backward difference, 358 the central difference, 358 the class of funcitons Ck (A), 342 the curvature of the curve, 356 the differential of function f, 352 the differention of the second order, 358 The domain of the relation, 37 1234 INDEX The Euclidean plane, 30 the existence of a neutral element, 5 the existence of a unit element, 5 the existence of an inverse element, 5 the forward difference, 358 the Frenet frame, 356 The fundamental theorem of algebra, 343 the graph of a function, 38 the indefinite integral, 358 the integral mean value theorem, 369 the lower Riemann integral, 369 the main normal, 356 the primitive function, 358 the second derivative, 342 the sources and sinks of the network, 653 the uniform continuity, 369 the upper Riemann integral, 369 the Weierstrass test, 382 topology, 264, 447 topology of the complex plane, 264 topology of the metric spaces, 447 topology of the real line, 264 torsion of the curve, 356 totally bounded, 457 Trace of mapping, 111 trace of matrix, 111 trail, 625 transformation, 489 transient, 152 transitive, 39 translation, 25, 203 transpose, 86 transposition, 84 trapezoidal rule, 385 tree, 640 triangle, 208, 623 triangle inequality, 155 trigonometric functions, 297 unbounded, 264 undirected graph, 622 uniform continuity, 368 uniformly bounded, 460 uniformly Cauchy, 380 unit decomposition subordinate to a locally finite cover, 519 unit matrix, 32, 75 unitary group, 157 Unitary isomorphism, 154 unitary mapping, 154 unitary matrices, 157 Unitary space, 153 universal formula, 667 unsaturated, 655 upper bound, 259 upper Riemann sum, 366 Vandermonde determinant, 252 variation, 387 vector, 72 vector field, 541 vector field X, 513 vector field X along the curve M, 513 vector functions, 354 vector functions of one real variable, 354 vector of restrictions, 134 Vector space, 92 vector subspace, 94 vectors, 24 vertices, 622 walk, 625 walk of length n, 625 wavelet mother function, 427 weak connectedness, 633 weakly connected, 639 weight, 636 zero curvature, 352 zero matrix, 74 zero measure, 388 zero vector, 24 Based on the earlier textbook: Matematika drsně a svižně Jan Slovák, Martin Panák, Michal Bulant a kolektiv published by Masarykova univerzita in 2013 1. edition, 2013 500 copies Typography, LATEX and more, Tomáš Janoušek Print: Tiskárna Knopp, Černčice 24, 549 01 Nové Město nad Metují