Brisk guide to Mathematics
Jan Slovák
and
Michal Bulant, Ioannis Chrysikos, Martin Panák
with help of
Ray Booth, Vladimir Ejov, Radek Suchánek, Vojtěch Žádník, ...
Brno, 2024
Works on this text were ﬁnanically supported by the project NPO MUNI 3.2.1 DataAnalytics NPO-MUNI-MSMT-16606/2022
Authors:
Michal Bulant
Ioannis Chrysikos
Martin Panák
Jan Slovák
With further help of:
Ray Booth
Vladimir Ejov
Radek Suchánek
Vojtěch Žádník
Graphics and illustrations:
Petra Rychlá
2023 Masaryk University
Contents – theory
Chapter 1. Initial warmup 4
1. Numbers and functions 4
2. Diﬀerence equations 10
3. Combinatorics 14
4. Probability 18
5. Plane geometry 26
6. Relations and mappings 41
Chapter 2. Elementary linear algebra 83
1. Vectors and matrices 83
2. Determinants 96
3. Vector spaces and linear mappings 107
4. Properties of linear mappings 128
Chapter 3. Linear models and matrix calculus 193
1. Linear optimization 193
2. Diﬀerence equations 202
3. Iterated linear processes 211
4. More matrix calculus 221
5. Decompositions of the matrices and
pseudoinversions 244
Chapter 4. Analytic geometry 319
1. Aﬃne and Euclidean geometry 319
2. Transformations 338
3. Geometry of quadratic forms and quadrics 342
4. Projective geometry 349
Chapter 5. Establishing the ZOO 368
1. Polynomial interpolation 368
2. Real numbers and limit processes 379
3. Derivatives 403
4. Inﬁnite sums and power series 417
Chapter 6. Diﬀerential and integral calculus 508
1. Diﬀerentiation 508
2. Integration 529
3. Sequences, series and limit processes 557
Chapter 7. Continuous tools for modelling 631
1. Fourier series 631
2. Integral operators and Fourier transform 654
3. Metric spaces 666
Chapter 8. Calculus with more variables 709
1. Functions and mappings on Rn
709
2. Integration for the second time 747
3. Diﬀerential equations 761
Contents – practice
Chapter 1. Initial warmup 4
A. Numbers and functions 4
B. Diﬀerence equations 12
C. Combinatorics 16
D. Probability 20
E. Plane geometry 28
F. Relations and mappings 43
G. Additional exercises for the whole chapter 47
Chapter 2. Elementary linear algebra 83
A. Vectors and matrices 83
B. Determinants 100
C. Vector spaces and linear mappings 107
D. Properties of linear mappings 133
E. Additional exercises for the whole chapter 142
Chapter 3. Linear models and matrix calculus 193
A. Linear optimization 193
B. Recurrence relations 204
C. Models of growth and iterated processes 210
D. More matrix calculus 224
E. Matrix decompositions 248
F. Additional exercises for the whole chapter 262
Chapter 4. Analytic geometry 319
A. Aﬃne geometry 319
B. Euclidean geometry 327
C. Geometry of quadratic forms 341
D. Further exercise on this chapter 361
Chapter 5. Establishing the ZOO 368
A. Polynomial interpolation 368
B. Real numbers and limit processes 380
C. Derivatives 407
D. Inﬁnite sums and power series 428
E. Additional exercises for the whole chapter 444
Chapter 6. Diﬀerential and integral calculus 508
A. Derivatives of higher orders 508
B. Integration 534
C. Sequences, series and limit processes 566
D. Additional exercises for the whole chapter 577
Chapter 7. Continuous tools for modelling 631
A. Fourier series 631
B. Integral operators and Fourier Transform 654
C. Metric spaces 667
D. Additional exercises to the whole chapter 686
Chapter 9. Continuous models – further selected topics802
1. Exterior diﬀerential calculus and integration 802
2. Remarks on Partial Diﬀerential Equations 827
3. Remarks on Variational Calculus 858
4. Complex Analytic Functions 873
Chapter 10. Statistics and probability theory 901
1. Descriptive statistics 901
2. Probability 913
3. Mathematical statistics 959
Chapter 11. Elementary number theory 978
1. Fundamental concepts 978
2. Primes 982
3. Congruences and basic theorems 987
4. Solving congruences and systems of them 1000
5. Diophantine equations 1016
6. Applications – calculation with large integers,
cryptography 1020
Chapter 12. Algebraic structures 1040
1. Posets and Boolean algebras 1040
2. Elements of Logic 1055
3. Polynomial rings 1073
4. Groups, rings, and ﬁelds 1086
5. Coding theory 1110
6. Systems of polynomial equations 1117
Chapter 13. Combinatorial methods, graphs, and
algorithms 1140
1. Elements of Graph theory 1140
2. A few graph algorithms 1166
3. Remarks on Computational Geometry 1188
4. Remarks on more advanced combinatorial
calculations 1206
Chapter 8. Calculus with more variables 709
A. Multivariate functions 709
B. The topology of En 711
C. Limits and continuity of multivariate functions 713
D. Tangent lines, tangent planes, graphs of
multivariate functions 715
E. Taylor polynomials 722
F. Extrema of multivariate functions 723
G. Implicitly given functions and mappings 728
H. Constrained optimization 730
I. Volumes, areas, centroids of solids 742
J. First-order diﬀerential equations 759
K. Practical problems leading to diﬀerential
equations 768
L. Higher-order diﬀerential equations 770
M. Applications of the Laplace transform 777
N. Numerical solution of diﬀerential equations 780
O. Additional exercises to the whole chapter 795
Chapter 9. Continuous models – further selected topics802
A. Exeterior diﬀerential calculus 802
B. Applications of Stoke’s theorem 802
C. Equation of heat conduction 808
D. Partial Diﬀerential Equations 809
E. Variational Problems 830
F. Complex analytic functions 830
G. Additional exercises to the whole chapter 897
Chapter 10. Statistics and probability methods 901
A. Dots, lines, rectangles 901
B. Visualization of multidimensional data 910
C. Classical and conditional probability 912
D. What is probability? 920
E. Random variables, density, distribution function 922
F. Expected value, correlation 932
G. Transformations of random variables 937
H. Inequalities and limit theorems 939
I. Testing samples from the normal distribution 944
J. Linear regression 953
K. Bayesian data analysis 955
L. Processing of multidimensional data 960
Chapter 11. Number theory 978
A. Basic properties of divisibility 978
B. Prime numbers 981
C. Congruences 985
D. Solving congruences 997
E. Diophantine equations 1012
F. Primality tests 1015
G. Encryption 1019
H. Additional exercises to the whole chapter 1038
Chapter 12. Algebraic structures 1040
A. Boolean algebras and lattices 1040
B. Rings 1052
C. Polynomial rings 1053
D. Rings of multivariate polynomials 1058
E. Algebraic structures 1063
F. Groups 1065
G. Burnside’s lemma 1084
H. Codes 1087
I. Extension of the stereographic projection 1094
J. Elliptic curves 1095
K. Gröbner bases 1098
Chapter 13. Combinatorial methods, graphs, and
algorithms 1140
A. Fundamental concepts 1140
B. Fundamental algorithms 1149
C. Minimum spanning tree 1158
D. Flow networks 1160
E. Classical probability and combinatorics 1164
F. More advanced problems from combinatorics 1169
G. Probability in combinatorics 1171
H. Combinatorial games 1178
I. Generating functions 1180
J. Additional exercises to the whole chapter 1220
Index 1228
Preface
This textbook is a followup of the Czech course material “Matematika drsně a svižně”, reﬂecting many years of lecturing
Mathematics at the Faculty of Informatics at the Masaryk University in Brno. Their programme required introduction to
genuine mathematical thinking and precision, but within a quite limited time-frame for lectures. This endeavor was undertaken
by Jan Slovák and Martin Panák since 2004, with further collaborators joining later. Our goal was to cover seriously, but
quickly, about as much of mathematical methods as usually seen in bigger courses in the classical Science and Technology
programmes. At the same time, we did not want to give up the completeness and correctness of the mathematical exposition.
We wanted to introduce and explain more demanding parts of Mathematics together with elementary explicit examples how
to use the concepts and results in practice. But we did not want to decide how much of theory or practice the reader should
enjoy and in which order.
Now, we have tried to accommodate all the above features in one book providing a more or less complete account on
basics of Mathematics, as taught in typical Mathematics Bachelor programmes, but in a way relevant to the coming era of
computational powers and artiﬁcial intelligence resources, including the GPT and other chatbots available widely. Thus, we
chose the two column format of the textbook, where the theoretical explanation on one side and the practical procedures and
exercises on the other side are split. Moreover, we suppose the practice learning will be heavily supported by Computer Aided
Mathematic tool, we chose mainly Sage in our illustrations. This way, we want to encourage and help the readers to ﬁnd their
own way. Either to go through the examples and algorithms ﬁrst (perhaps with the help of Sage or GPT-like chatbots), and
then to come to more serious thinking on why the things work, or the other way round.
We also hope to overcome the usual stress of the readers horriﬁed by the amount of the stuﬀ. With our text, they are not
supposed to read through the book in a linear order. On the opposite, the readers should enjoy browsing through the text and
ﬁnding their own thrilling paths through the new mathematical landscapes.
In both columns, we intend to present rather standard exposition of basic Mathematics, but focusing on the essence of
the concepts and their relations. The exercises are addressing simple mathematical problems but we also try to show the
exploitation of mathematical models in practice as much as possible.
We are aware that the text is written in a very compact and non-homogeneous way. A lot of details are left to readers,
in particular in the more diﬃcult paragraphs, while we try to provide a lot of simple intuitive explanation when introducing
new concepts or formulating important theorems. Similarly, the examples display the variety from very simple ones to those
requesting independent thinking.
We would very much like to help the reader:
• to formulate precise deﬁnitions of basic concepts and to prove simple mathematical results;
• to percieve the meaning of roughly formulated properties, relations and outlooks for exploring mathematical tools;
• to understand the instructions and algorithms underlying mathematical models and to appreciate their usage.
These goals are ambitious and there are no simple paths reaching them without failures on the way. This is one of the
reasons why we come back to basic ideas and concepts several times with growing complexity and width of the discussions.
Of course, this might also look chaotic but we very much hope that this approach gives a better chance to those who will
persist in their eﬀorts. We also hope, this textbook should be a perfect beginning and help for everybody who is ready to think
and who is ready to return back to earlier parts again and again.
To make the task simpler and more enjoyable, we have added what we call "emotive icons". We hope they will spirit the
dry mathematical text and indicate which parts should be read more carefully, or better left out in the ﬁrst round. They should
work as a sort of switches, so when loosing the ground, the reader is adviced to ﬁnd the next appealing icon and go on.
The usage of the icons follows the feelings of the authors and we tried to use them in a systematic way. We hope the
readers will assign the meaning to icons individually. Roughly speaking, there are icons indicating complexity, diﬃculty etc.:
Further icons indicate unpleasant technicality and need of patiance, or possible entertainment and pleasure:
Some icons are related to feelings when solving problems and appear mainly in the practical column:
The practical column with the solved problems and exercises should be readable nearly independently of the theory.
Without the ambition to know the deeper reasons why the algorithms work, it should be possible to read mainly just this
column. In order to help such readers, some deﬁnitions and descriptions in the theoretical text are marked in order to catch
the eyes easily when reading the exercises. The exercises and theory are partly coordinated to allow jumping there and back,
but the links are not tight. The numbering in the two columns is distinguished by using the diﬀerent numberings of sections,
i.e., those like 1.2.1 belong to the theoretical column, while 1.A.14 points to the practical column. The equations are numbered
within subsections and their quotes include the subsection numbers if necessary.
In general, our approach stresses the fact that the methods of the so called discrete Mathematics seem to be more important
for mathematical models nowadays. They seem also simpler to get percieved, supported by the computational tools, and
grasped.
However, the continuous methods are strictly necessary too. First of all, the classical continuous mathematical analysis
is essential for understanding of convergence and robustness of computations. It is hard to imagine how to deal with error
estimates and computational complexity of numerical processes without it. Moreover, the continuous models are often the
eﬃcient and eﬀectively computable approximations to discrete problems coming from practice.
As usual with textbooks, there are numerous ﬁgures completing the exposition. We very much advise the readers to draw
their own pictures whenever necessary, in particular in the later chapters, where we provide only a few.
The rough structure of the book and the dependencies between its chapters are depicted in the diagram below. The darker
the color is, the more demanding is the particular chapter (or at least its essential parts). In particular, the chapters 7 and 9
include a lot of material which would perhaps not be covered in the regular course activities or required at exams in great
detail. The solid arrows mean strong dependencies, while the dashed links indicate only partial dependencies. In particular,
the textbook could support courses starting with any of the white boxes, i.e. aiming at standard linear algebra and geometry
(chapters 2 through 4), discrete chapters of mathematics (11 through 13), and the rudiments of Calculus (5, 6, 8).
All topics covered in the book are now included (with more or less details) in teaching within our Mathematics Minor
programme, complemented by numerical seminars. In this block of four courses, the ﬁrst semester covers chapters 1 and 2
and selected topics from chapters 3 and 4. The second semester essentially includes chapters 5, 6, and 7. The third semester
is now split into two parts. The ﬁrst one is covered by chapter 8 (with only a few glimpses towards the more advanced topics
from chapter 9), while the rest of the semester is devoted to the rudiments of the graph theory in chapter 13. The last semester
provids large parts of the chapters 11 through 13.
This textbook project got a new impuls when developing the main background material for a new professional programme
Data Analytics, a cross-disciplinary endeavor aiming at formal education of top-tier young programmers in the era of GPT
and further AI resources. There, the two elementary maths courses in the ﬁrst semester are covered by Chapters 1-7, while
in the second semester, another two courses are based on Chapters 8-9, and 13. Finally one course in the third semester is
covered by Chapters 11-12. Two more courses are devoted to Probability and Statistics and they already go much beyond the
Chapter 10 here.
The goal of this ﬁrst chapter is to introduce the reader to
the fascinating world of mathematical thinking.
The name of this chapter can be also understood as an encouragement
for patience. Even the simplest tasks and ideas
are easy only for those who have already seen similar ones.
We start with the simplest thing: numbers.
They will also serve as the ﬁrst example of how mathematical
objects and theories are built. The entire ﬁrst chapter
will become a quick tour through various mathematical landscapes
(including germs of analysis, combinatorics, probability,
geometry, and the language of Mathematics itself).
Perhaps sometimes our deﬁnitions and ideas will look
too complicated and not practical enough. The simpler the
objects and tasks are, the more diﬃcult the mastering of depth
and all nuances of the relevant tools and procedures might be.
We shall come back to all of the notions again and again in
the further chapters and hopefully this will be the crucial step
in the ultimate understanding.
Thus the advice: do not worry if you ﬁnd some particular
part of the exposition too formal or otherwise diﬃcult – come
back later for another look.
1. Numbers and functions
Since the dawn of time, people want to know “how
much” about something they have, or
“how much” is something worth, “how
long” will a particular task take, etc. The
answer for such ideas is usually some
kind of “number”. We consider something to be a number,
if it behaves according to the usual rules – either according
to all the rules we accept, or maybe only to some of them.
For instance, the result of multiplication does not depend on
the order of multiplicands. We have the number zero whose
addition to another number does not change the result. We
have the number one whose product with another number
does not change the result. And so on.
The simplest example of numbers are the positive integers
which we denote Z+
= {1, 2, 3, . . . }. The natural numbers
consist of either just the positive integers, or the positive
integers together with the number zero. The number zero is
either considered to be a natural number, as is usual in computer
science, or not a natural number as is usual in some
other contexts.
CHAPTER 1
Initial warmup
“value, diﬀerence, position”
– what it is and how to comprehend it?
A. Numbers and functions
People mostly think that “Mathematics” equals “counting”.
Indeed, we start the introduction with number
systems including integers, rationals, reals,
and complex numbers, denoted by Z, Q, R and
C, respectively. Counting is based on functions,
and numbers are their arguments and values. We approach
these elementary and well known concepts now in order to
invoke the more realistic view at Mathematics as a way of
thinking. The reader is always advised to keep an eye at the
right-hand column if the concepts and methods get unclear. In
the next pages, we also use basic properties of numbers which
we carefully discuss only in Chapter 12 (e.g., the unique decomposition
of integers into products of primes, etc.).
As soon as mathematical tasks approach counting, it can
be easily done in computer aided mathematics software in
a few lines of code. Therefore, our exposition will often include
Sage cells, but we avoid presenting preliminaries on
programming with Sage. Our experience indicates that the
Sage (Python based) interface is user-friendly and even unfamiliar
readers should feel comfortable with basic coding in
Sage really fast.
1.A.1. Integers are discrete, in the sense that for any m ∈
Z, there is no integer between m and m + 1. Show with an
example that this fails for each couple of rational numbers.
CHAPTER 1. INITIAL WARMUP
Thus the set of natural numbers is either Z+
, or the set
N = {0, 1, 2, 3, . . . }. To count “one, two, three, ...” is
learned already by children in their pre-school age. Later on,
we meet all the integers Z = {. . . , −2, −1, 0, 1, 2, . . . } and
ﬁnally we get used to ﬂoating-point numbers. We know what
a 1.19-multiple of the price means if we have a 19% tax.
1.1.1. Properties of numbers. In order to be able to work
properly with numbers, we need to be careful
with their deﬁnition and properties. In
mathematics, the basic statements about
properties of objects, whose validity is assumed without the
need to prove them, are called axioms.
We list the basic properties of the operations of addition
and multiplication for our calculations with numbers, which
we denote by letters a, b, c, . . . . Both operations work by taking
two numbers a, b. By applying addition or multiplication
we obtain the resulting values a + b and a · b.
Properties of numbers
Properties of addition:
(a + b) + c = a + (b + c), for all a, b, c(CG1)
a + b = b + a, for all a, b(CG2)
there exists 0 such that for all a, a + 0 = a(CG3)
for all a there exists b such that a + b = 0.(CG4)
The properties (CG1)-(CG4) are called the properties of a
commutative group. They are called respectively associativity,
commutativity, the existence of a neutral element
(when speaking of addition we usually say zero element),
and the existence of an inverse element (when speaking of
addition we also say the negative of a and denote it by −a).
Properties of multiplication:
(a · b) · c = a · (b · c), for all a, b, c(R1)
a · b = b · a, for all a, b(R2)
there exists 1 such that for all a, 1 · a = a(R3)
a · (b + c) = a · b + a · c, for all a, b, c.(R4)
The properties (R1)-(R4) are called respectively associativity,
commutativity, the existence of a unit element and
distributivity of addition with respect to multiplication.
The sets with operation +, · that satisfy the properties
(CG1)-(CG2) and (R1)-(R4) are called commutative rings.
Two further properties of multiplication are:
for every a ̸= 0 there exists b such that a · b = 1.(F)
if a · b = 0, then either a = 0 or b = 0 or both.(ID)
The property (F) is called the existence of an inverse element
with respect to multiplication (this element is then denoted
by a−1
). For normal arithmetic, this is called the reciprocal
of a, the same as 1/a or 1
a .
The property (ID) then says that there exists no “divisors
of zero”. A divisor of zero is a number a, a ̸= 0, such
that there is a number b, b ̸= 0, with a · b = 0.
5
Solution. Let p = a/b and q = c/d be two rationals, where
a, b, c, d ∈ Z, and suppose for instance that p < q. Consider
the average α of p, q, i.e., α = 1
2 (p + q). Then α = ad+cb
2bd ,
and this is a rational number as the quotient of the integers
ad+cb and 2bd. Now, adding p to both sides of the inequality
p < q we get 2p < p + q, whereas adding q to both sides
gives p + q < 2q. Hence we have 2p < p + q < 2q and so
p < α < q. □
1.A.2.
1.A.3. Show that the integer 2 does not have a rational
square root (hint: think about the number of copies of 2 appearing
in the decompositions of integers). ⃝
1.A.4. An irrational number is a real number which cannot
be written as the ratio of two integers. Using 1.A.3, prove
that between any two diﬀerent rational numbers there is an
irrational number. ⃝
Perhaps, people reading this book are familiar with naturals,
integers, rationals, and reals. However, there
are larger sets of numbers as the complex numbers,
or the quaternions, the latter introduced by
William H. Hamilton already in 1843. Next we provide
examples on complex numbers, see their introduction in
1.1.3. The algebra of quaternions is a topic requiring some
knowledge from linear algebra. Hence it will be discussed
later in paragraph 2.E.66, in terms of the so-called Pauli ma-
trices.1
1.A.5. Consider the function f(x) = x2
− 4x + 8. Use its
graph to ensure yourself that we need to extend the real numbers
to solve each quadratic equation with real coeﬃcients.
Recall the formula for such a universal solution.
Solution. Given a function f with domain X and codomain
Y , its graph is the set Gf = {(x, f(x)) : x ∈ X}. Thus Gf
is a subset of the Cartesian product X × Y of X, Y , i.e.,
Gf ⊆ X × Y = {(x, y) : x ∈ X, y ∈ Y } ,
where here (x, y) are ordered pairs, see also 1.6.1.
To illustrate the graph of the given f, we ﬁrst note
that Gf ⊆ R × R. This means that Gf is a subset of
the real plane R2
, also called Euclidean plane (see the
discussion in 1.5.1 and 1.5.2 for more details on R2
).
To visualize Gf or parts of it, there are many alternatives.
Next we will demonstrate how to use SageMath, which
1Complex numbers appear already in Ars Magna by Gerolamo Cardano
(1501-1576, Italian), around 1545 in relation to solving cubic equations
(see 1.A.21 below). Another known number system consists of the octonions,
also referred to as “Cayley numbers”. Roughly speaking, octonions
provide a generalization of quaternions. In a similar fashion, quaternions can
be though of as a generalization of complex numbers, and complex numbers
of reals.
William Rowan Hamilton (1805-1865) was an Irish mathematician, astronomer,
and physicist. Arthur Cayley (1821-1895) was a British mathematician
working mainly in algebra. Actually, the octonions were introduced
slightly earlier independently by John T. Graves in 1843, the same year as his
friend discovered the quaternions.
CHAPTER 1. INITIAL WARMUP
1.1.2. Remarks. The integers Z are a good example of a
commutative group. The natural numbers are
not such an example since they do not satisfy
(CG4) (and possibly do not even contain the neutral
element if one does not consider zero to be a
natural number). If a commutative ring also satisﬁes the property
(F), we speak of a ﬁeld (often also about a commutative
ﬁeld).
The last stated property (ID) is automatically satisﬁed if
(F) holds. However, the converse statement is false. Thus we
say that the property (ID) is weaker than (F). For example, the
ring of integers Z does not satisfy (F) but does satisfy (ID). In
such a case we use the term integral domain.
Notice that the set of all non-zero elements in the ﬁeld
along with the operation of multiplication satisﬁes (R1), (R2),
(R3), (F) and thus is also a commutative group. However in
this case, instead of addition we speak of multiplication. As
an example, the set of all non-zero real numbers forms a commutative
group under multiplication.
The elements of some set with operations + and · satisfying
(not necessarily all) stated properties (for example, a
commutative ﬁeld, an integral domain) may be called scalars.
To denote them we usually use lowercase Latin letters, either
from the beginning or from the end of the alphabet.
We will use only these properties of scalars and thus our
results will hold for any objects with such properties. This is
the true power of mathematical theories – they do not hold
just for a speciﬁc solved example. Quite the opposite, when
we build ideas in a rational way they are always universal. We
will try to emphasise this aspect, although our ambitions are
modest due to the limited size of this book.
Before coming to any use of scalars, we should make a
short formal detour and pay attention to their existence. We
shall come back to this in the very end of this chapter, when
we shall deal with the formal language of Mathematics in general,
cf. the constructions starting in 1.6.5. There we indicate
how to get natural numbers N, integers Z, and rational numbers
Q, while the real numbers R will be treated much later
in chapter 5.
At this point, let us just remark that it is not enough to
pose the axioms of objects. We have to be sure that the given
conditions are not in conﬂict and such objects might exist.
We suppose the readers are sure about the existence of
the domains N, Z, Q and can handle them easily. The real
numbers are usually understood as a dense and better version
of Q, but what about the domain of complex numbers?
As is usual in mathematics, we will use variables (letters
of alphabet or other symbols) to denote numbers, and it does
not matter whether we know their value beforehand or not.
1.1.3. Complex numbers. We are forced to extend the domain
of real numbers as soon as we want to see solutions of
equations like x2
= b for all real numbers b.
We know that this equation always has a solution x in
the domain of real numbers, whenever b is non-negative. If
6
is a free mathematical software system that can be accessed at
https://sagecell.sagemath.org .
The graph of the
given function f is displayed
in the image on
the right. The Sage cell
that generates this graph
uses the plot command,
which requires three parameters:
“what to plot”,
“what the variable is”
and "the range of plotting".
Thus, in general,
the syntax we use is as follows:
plot(f(x), (x, x_min, x_max))
In our case this takes the very simple form
f(x)= x**2 -4*x + 8; plot(f(x), x, -10, 10)
In this block observe that we used “;” to type diﬀerent commands
in the same line, while the values −10, 10 specify
where x is evaluated. In the graph we see that f does not intersect
the x-axis. Thus the equation f(x) = x2
−4x+8 = 0
cannot admit a solution over the reals.
In general, this reﬂects the negative value of the discriminant
∆ = b2
− 4ac of an equation ax2
+ bx + c = 0. Indeed
∆ = −16 in our case. Recall the well known formula for the
solutions x1,2 = −b±
√
∆
2a , giving us the (possibly) complex
solutions x1,2 once we introduce
√
−1 = i. Indeed, in our
case
√
∆ =
√
−1
√
−∆ =
√
−1
√
16 = i
√
16 = 4i and the
solutions are x1,2 = 2 ± i2. □
1.A.6. A comment on power functions. Suppose that n ∈
N is a positive natural number.2
The nth root n
√
x = x
1
n
of some positive real x > 0 is the inverse of the power
function f(x) = xn
, so ( n
√
x)n
= n
√
xn = x. A polynomial
function is the sum of a ﬁnite number of (constant)
multiples of power functions with natural exponents, as for
example the function f(x) = 2(x − 4)3
− 16. Note that
it holds x0
= 1 for each real or complex number x. Consider
the equation f(x) = 0, i.e., 2(x − 4)3
= 16. We have
(x − 4) =
3
√
23 = 2, which gives x = 6 as a quick solution.
This is visible also in the graph of f, given below (this was obtained
in Sage via the command plot(f, −2, 10), as before).
However, since our
equation is of degree 3,
over C one expects three
solutions (based on the
fundamental theorem of
algebra, see in Chapter
12, and also in the end
of this section). In order
to determine all the
2In this book we mainly adopt the notation N = {0, 1, 2, 3, . . .} for the
(ordered) set of naturals, so we view 0 as a natural number (except Chapter
11 on number theory, where 0 is not considered as a natural number).
CHAPTER 1. INITIAL WARMUP
b < 0, then such a real x cannot exist. Thus we need to ﬁnd
a larger domain, where this equation has a solution.
The crucial idea is to add the new number i to the real
numbers, the imaginary unit, for which we require i2
= −1.
Next we try to extend the deﬁnitions of addition and multiplication
in order to preserve the usual behaviour of numbers (as
summarised in 1.1.1).
Clearly we need to be able to multiply the new number i
by real numbers and sum it with real numbers. Therefore we
need to work in our newly deﬁned domain of complex numbers
C with formal expressions of the form z = a+i b, being
called algebraic form of z. The real number a is called the
real part of the complex number z, the real number b is called
the imaginary part of the complex number z, and we write
Re(z) = a, Im(z) = b. It should be noted that if z = a + i b
and w = c + i d then z = w implies both a = c and b = d.
In other words, we can equate both real and imaginary parts.
For positive x we then get (i · x)2
= −1 · x2
and thus we can
solve the equations as requested.
In order to satisfy all the properties of associativity and
distributivity, we deﬁne the addition so that we add independently
the real parts and the imaginary parts. Similarly, we
want the multiplication to behave as if we multiply the pairs
of real numbers, with the additional rule that i2
= −1, thus
(a + i b) + (c + i d) = (a + c) + i (b + d),
(a + i b) · (c + i d) = (ac − bd) + i (bc + ad).
Next, we have to verify all the properties (CG1-4), (R1-4)
and (F) of scalars from 1.1.1. But this is an easy exercise:
zero is the number 0 + i 0, one is the number 1 + i 0, both
these numbers are for simplicity denoted as before, that is, 0
and 1. For non-zero z = a + i b we easily check that z−1
=
(a2
+ b2
)−1
(a − i b). All other properties are obtained by
direct calculations.
1.1.4. The complex plane and polar form. A complex number
is given by a pair of real numbers, therefore
it corresponds to a point in the real plane
R2
. Our algebraic form of the complex numbers
z = x + i y corresponds in this picture to understanding
the x-coordinate axis as the real part while the
y-coordinate axis is the imaginary part of the number. The
absolute value of the complex number z is deﬁned as its distance
from the origin, thus |z| =
√
x2 + y2.
The reﬂection with respect to the real axis then corresponds
to changing the sign of the imaginary part. We call
this operation z → ¯z = x − i y the complex conjugation.
Let us now consider complex numbers of the form z =
cos φ + i sin φ, where φ is a real parameter giving the angle
between the real axis and the line from the origin to z (measured
in the positive, i.e. counter-clockwise sense). These
numbers describe all points on the unit circle in the complex
plane. Every non-zero complex number z can be then written
as
z = |z|(cos φ + i sin φ).
7
solutions, we may use
the command solve in Sage. This is an appropriate tool for
solving equations and system of equations.
In the sequel we will encounter the solve command in
various contexts and analyze its possible implementations in
more detail. For the function f of our example, we can proceed
as follows:
x=var("x"); f = 2*(x-4)**3-16
sol = solve([f==0], x); sol
and as an output we obtain a list of the three solutions, with
two of them being complex conjugate, i.e.,
[x == -I*sqrt(3) + 3,
x == I*sqrt(3) + 3, x == 6]
A traditional way to ﬁnd the roots of polynomials relies
on the so called Horner’s scheme. This provides a quick way
to evaluate a given polynomial f(x), and occurs by rewriting
the polynomial as follows:
f(x) = a0 + a1x + a2x2
+ a3x3
+ · · · + anxn
=
a0 + x
(
a1 + x
(
a2 + x(a3 + · · · + x(an−1 + xan) · · · )
))
.
Observe that such a procedure requires n multiplications and
n additions (and this turns out to be optimal). In fact we can
treat the division of polynomials by divisors of the form x−ρ
(ρ ∈ R), using the Horner’s table:
ρ an an−1 · · · a0
- ρan · · ·
an ρan + an−1 · · · f(ρ)
In this table, each entry in the second row is the product of
ρ with the bottom-row entry immediately to the left. The bottom
line is the sum of the previous two lines. For example, if
f(x) = x3
+ x2
− x + 2 and ρ = −2, we obtain
ρ = −2 1 1 −1 2
−2 2 −2
1 −1 1 0
The zero in the last entry (the remainder), veriﬁes that ρ = −2
is a root of our polynomial, i.e., f(ρ) = 0.3
1.A.7. Use the Horner’s scheme to ﬁnd the division of p(x) =
5x4
+ x2
− 2x + 2 by x − 4. Then verify your answer by the
classical algorithm of (long) division. ⃝
Polynomials can be deﬁned also over the complex numbers,
and the Horner’s scheme is applicable in
this case as well. Recall that to each complex
number z = x + iy ∈ C we can associate the
ordered pair (x, y) in the Euclidean plane. Thus,
complex numbers can be viewed as ordered pairs of real numbers
(x, y) ∈ R2
, and such pairs correspond to vectors on R2
with initial point the origin of R2
. As usual we will use the
x-axis for the reals Re z, and the y-axis to denote the purely
3The motivation back of the notation ρ in the Horner’s scheme, is due
to the greek word “ριζα”, which means “root”.
CHAPTER 1. INITIAL WARMUP
For given z ̸= 0, φ is unique if 0 ≤ φ < 2π. The number φ
is called the argument of the complex number z and this form
of z is called the polar form of the complex number. This
way of writing the complex numbers is very convenient for
understanding the multiplication.
Consider the numbers z = |z| (cos φ + i sin φ) and w =
|w| (cos ψ + i sin ψ) and calculate their product
z · w = |z|(cos φ + i sin φ)|w|(cos ψ + i sin ψ)
= |z||w|
(
cos φ cos ψ − sin φ sin ψ
+ i (cos φ sin ψ + sin φ cos ψ)
)
= |z||w|(cos(φ + ψ) + i sin(φ + ψ)).
The last equality is a result of the addition formulas for
trigonometric functions (we shall deal with them in more detail
later in our discussion of rotations in the plane, see the
page 37).
Division is equally easy. If z = |z|(cos φ+i sin φ) ̸= 0,
then w = |z|−1
(cos φ − i sin φ) satisﬁes zw = wz = 1,
hence we can write w = z−1
= 1/z.
We can summarize (and iterate the application of the previous
formula on the product of the number z with itself):
Polar form and de Moivre Theorem
Consider two complex numbers z = |z|(cos φ + i sin φ)
and w = |w|(cos ψ + i sin ψ) in polar forms. Then if n is
an integer, positive or negative,
z w = |z| |w|(cos(φ + ψ) + i sin(φ + ψ))
zn
= |z|n
(cos(nφ) + i sin(nφ)).
1.1.5. Functions. In most tasks we do not deal just with numbers,
i.e. with individual values of scalars. More
often the values are associated to each of the elements
in a set of objects.
Formally we talk about a mapping f : A →
B assigning to each element x in the domain set A the value
f(x) in the codomain set B. The set of all images f(x) ∈ B
is called the range of f.
The set A or B can be a set of numbers, but there is nothing
to stop them being sets of other objects. The mapping
f, however it is described, must unambiguously determine a
unique member of B for each member of A.
In another terminology, the member x ∈ A, is often
called the independent variable. Then y = f(x) ∈ B, is
called the dependent variable. We also say that the value
y = f(x) is a function of the independent variable x in the
domain of f.
For now, we shall restrict ourselves to the case where the
codomain B is a subset of scalars and we shall talk about
scalar functions.
The simplest way to deﬁne a function appears if A is a
ﬁnite set. Then we can describe the function f by a table or
a listing showing the image of each member of A. We have
certainly seen many examples of such functions:
8
imaginary numbers i Im z. We proceed with basic exercises
on the arithmetic of complex numbers. Recall that both the
addition and the multiplication of complex numbers attain
a geometric interpretation: Addition corresponds to vector
sum, while multiplication by i =
√
−1, for example, is equivalent
to a counterclockwise rotation by the right angle.
1.A.8. Arithmetic in C. Compute Re(z1), Im(z1), z1, |z2|,
z1 + z2, and z1z2, for the cases:
(a) z1 = 1 − 2i, z2 = 4i − 3;
(b) z1 = 3 + 4i, z2 = 3 − 4i.
Solution. (a) We have Re(z1) = 1, Im(z1) = −2, z1 =
1 + 2i, |z2| =
√
42 + (−3)2 = 5, z1 + z2 = −2 + 2i,
z1z2 = 5 + 10i.
(b) Here we use Sage, where the ﬁeld of complex numbers is
denoted by CC, while as we saw above the square root of −1
is denoted by I (or i). To introduce a complex number there
are many alternatives, e.g., we simply type z1 = 3 + 4 ∗ I,
where ∗ is the general multiplication operator in Sage4
, or
type z1 = CC(3, 4). As for the additional code that gives the
computations, type and run successively the following
z1 = CC(3, 4); z2 = CC(3, -4)
z1.real ( ); z1.imag ( )
z1.conjugate ( ); abs(z2); z1+z2; z1*z2
Note that in case (b) we have s z2 = z1. In Sage a veriﬁcation
of this takes the form z1 == conjugate(z2), after introducing
z1, z2 as above. As it is usual also in Python (the mother
of Sage), this means that one uses “=” to determine a quantity,
instead of “==” which is used to determine an equation
between two quantities (in the latter case Sage returns True
or False). It follows that |z1| = |z2| =
√
25 = 5 (see also
1.A.10 below), and the multiplication z1z2 occurs also by the
usual formula z¯z = |z|2
, with z ∈ C (verify this). □
1.A.9. Show that the Euclidean distance between two arbitrary
complex numbers z = x + iy and w = u + iv is given
by the formula d = |z − w| =
√
(x − u)2 + (y − v)2, with
x, y, u, v ∈ R. ⃝
1.A.10. Given the complex number 1 + i
√
3, make use
only of the ﬁgure given below to calculate its distance from
its complex conjugate. Next, consider the complex number
z = 2+2i
√
3+3i−3
√
3
1−i
√
3
. Relate a veriﬁcation of the equality
|z| =
√
13 with the given ﬁgure and ﬁnd another z′
∈ C with
the same property, i.e., |z′
| =
√
13.
Solution. Complex conjugates are symmetric with respect to
the x-axis. Moreover, the distance of a complex number from
the x-axis equals its imaginary part. Thus, as we deduce also
from the given ﬁgure, the distance between 1 + i
√
3 and its
conjugate 1−i
√
3 equals to 2
√
3. Now, from the Pythagorean
4As we will see later, in Sage the multiplication of matrices and vectors
is also denoted by ∗.
CHAPTER 1. INITIAL WARMUP
Let f denote the pay of a worker in some company in
certain year. The values of independent variable, that is, the
domain of the function, are individual workers x from the set
of all considered workers. The value f(x) is their pay for the
given year. Similarly we can talk about the age of students or
their teachers in years, the litres of beer and wine consumed
by individuals from a given group, etc.
Another example is a food dispensing machine. The domain
of a function f would be the button pushed together
with the money inserted to determine the selection of the food
item.
Let A = {1, 2, 3} = B. The set of equalities f(1) =
1, f(2) = 3, f(3) = 3, deﬁnes a function f : A → B.
Generally, as there are 3 possible values for f(1), and the
same for f(2), and f(3), there are 27 possible functions from
A into B in total.
But there are other ways to deﬁne a function than as a
table. For example, the function f can denote the area of a
planar region. Here, the domain consists of subsets of the
plane (e.g. all triangles, circles or other planar regions with a
deﬁned area). The range of f consists of the respective areas
of the regions. Rather than providing a list of areas for a ﬁnite
number regions, we hope for a formula allowing us to compute
the functional value f(P) for any given planar region P
from a suitable class.
Of course, there are many simple functions given by formulae,
e.g. f(x) = 3x + 7 with A = B = R or A = B = N.
Not all functions can be given by a formula or list. For
example, let f(t) denote the speed of the car at time t. For
any given car and time t, we know there will be the functional
values f(t) denoting its speed. Which can of course be measured
approximately, but usually not by a formula.
Another example: Let f(n) be the nth
digit in the decimal
expansion of π
.
= 3.1415 . . . . So for example f(4) = 5.
The value of f(n) is deﬁned but unknown if n is large enough.
The mathematical approach in modelling real problems
often starts from the indication of certain dependencies between
some quantities and aims at explicit formulas for functions
which describe them. Often a full formula is not available
but we may obtain the values f(x) at least for some instances
of the independent variable x, or we may be able to
ﬁnd a suitable approximation.
We shall see all of the following types of expressions of
the requested function f in this book:
• exact ﬁnite expression (like the function f(x) = 3x + 7
above);
• inﬁnite expression (we shall come to that only much later
in chapter 5 when introducing the limit processes);
• description of how the function’s values change under a
given change of the independent variable (this behaviour
will be displayed under the name diﬀerence equation in
a moment and under diﬀerent circumstances later on);
• approximation of a not computable function with a
known one (usually including some error estimates –
this could be the case with the car above, say we know
9
Theorem it follows that a complex number x+iy and its complex
conjugate x − iy have the same length (absolute value),
given by r :=
√
x2 + y2.
For our task this applies to 1 + i
√
3. On the other hand, recall
that the absolute value of the product of any two complex
numbers is the product of their absolute values. Thus, if z1, z2
are two complex numbers, provided that z2 ̸= 0, we see that
z1z2
¯z2
=
|z1z2|
|¯z2|
=
|z1| |z2|
|¯z2|
= |z1| .
Observe now the enumerator of the given z is expressed as
2 + 2i
√
3 + 3i − 3
√
3 = 2(1 + i
√
3) + 3i(1 + i
√
3)
= (2 + 3i)(1 + i
√
3) .
This means that our z has the form z1z2
¯z2
with z1 = 2+i3 and
z2 = 1+i
√
3, respectively, and hence the result: |z| = |z1| =
|2 + i3| =
√
22 + 32 =
√
13. Because z2 does not play any
essential role in this computation, one can ﬁnd inﬁnitely many
complex numbers with the same property as the given z. □
1.A.11. Determine the distance d of the numbers z, ¯z in the
complex plane for z =
√√√
13 − i(
√
13/2). ⃝
1.A.12. Use the Horner’s scheme to solve the equation
z3
+ (1 + i)z2
+ az + 2 = 0 , z ∈ C ,
when it is already known that z0 = i is a root of the given
equation. ⃝
1.A.13. Solve the equation x2
+ 2(1 + i)x = 14i (observe
that this is a quadratic equation with complex coeﬃcients).
Solution. Add to both sides of the equation the term (1+i)2
.
This gives x2
+ 2(1 + i)x + (1 + i)2
= 14i + (1 + i)2
, or
equivalently,
(x + (1 + i))2
= 16i .
Therefore, x + (1 + i) = ±4
√
i, that is, x = −(1 + i) ± 4
√
i
and it remains to compute
√
i. Suppose that a + ib =
√
i for
some reals a, b ∈ R. Then (a + ib)2
= i or in other words
a2
+ 2iab − b2
= i = 0 + 1i. By comparing the real and
imaginary parts, we obtain the equations a2
− b2
= 0 and
2ab = 1. The ﬁrst relation gives a = ±b. If a = b then the
CHAPTER 1. INITIAL WARMUP
it goes with some known speed at the time t = 0, we
break as much as possible on a known surface and we
compute the decrease of speed with the help of some
mathematical model);
• ﬁnding only the probability of possible values of the function.
For example the function giving the length of life
of a given group of still living people, in dependence of
some health related parameters.
1.1.6. Functions deﬁned explicitly. Let us start with the
most desirable case, when the function values
are deﬁned by a computable ﬁnite formula. Of
course, we shall be interested also in the eﬃciency
of the formulas, i.e. how fast the evaluations would
be. In principle, real computations can involve only a ﬁnite
number of summations and multiplications of numbers. This
is how we deﬁne the polynomials, i.e. functions of the form
f(x) = an·xn
+· · · +a1·x+a0, where a0, . . . , an are known
scalars, x is the unknown variable whose value we can insert.
xn
= 1·x · · · x means the n-times repeated multiplication of
the unit by x (in particular, x0
= 1), and f(x) is the value of
the indicated sum of products. This is fairly well computable
formula for each n ∈ N. The choice n = 0 provides the
constants (i.e. constant functions) a0.
The next example is more complicated.
Factorial function
Let A = Z+
be the set of positive integers. For each n ∈
Z+
, deﬁne the factorial function by
n! = n(n − 1)(n − 2) . . . 3 · 2 · 1.
For convenience we also deﬁne 0! = 1. (We will see why this
is sensible later on). It is easy to see that n! = n · (n − 1)!
for all n ≥ 1.
So 1! = 1, 2! = 2 · 1 = 2, 3! = 3 · 2 · 1 = 6, 6! = 720 etc.
The latter example deserves more attention. Notice that
we could have deﬁned the factorial by setting A = B = N and
giving the equation f(n) = n · f(n − 1) for all n ≥ 1. This
does not yet deﬁne f, but for each n, it does determine what
f(n) is in terms of its predecessor f(n − 1). This is sometimes
called a recurrence relation. After choosing f(0) = 1,
the recurrence now determines f(1) and hence successively
f(2) etc., and so a function is deﬁned. It is the factorial function
as described above.
2. Diﬀerence equations
The factorial function is one example of a function
which can be deﬁned on the natural numbers
by means of a recurrence relation. Such a situation
can often be seen when formulating mathematical
models that describe real systems in practice.
We will observe here only a few simple examples and
return to this topic in chapter 3.
10
second relation reduces to a2
= 1/2, that is, a = ±1/
√
2 =
±
√
2/2 = b. We also see that the case a = −b is impossible;
It gives 2a2
= −1 which contradicts to the fact a ∈ R. Thus
√
i = ±
√
2
2 (1 + i) and
x = −(1 + i) ± 2
√
2(1 + i) = (−1 ± 2
√
2)(1 + i) .
For convenience, let us verify the answer via Sage. The
next commands successively written, yield the computations
of the real/imaginary parts of the two complex
solutions:
eq = x**2+2*(1+I)*x-14*I==0
sols = solve([eq], x); sols
sols[0].rhs().real_part()
sols[0].rhs().imag_part()
sols[1].rhs().imag_part()
sols[1].rhs().real_part()
For example, the last command returns 2 ∗ sqrt(2) − 1, as it
should be, and similarly for the previous commands. As a useful
remark, keep in mind that in Sage the command sols[0]
returns the ﬁrst solution, and sols[1] the second one. This
means that Sage assigns automatically labels to these solutions,
as the ﬁrst two integers “0”, and “1”, respectively. □
1.A.14. Polar form and algebraic form. Express z1 = 2 +
3i in polar form. Next, express z2 = 3(cos(π/3)+i sin(π/3))
in algebraic form.
Solution. As we learned from the ﬁgure in 1.A.10, the absolute
value |z1| of the complex number z1 equals to r =
√
13.
Next, we need the angle
(argument) φ of z1, indicated
in the ﬁgure. Applying
the well-known
rules for the trigonometric
functions cos and sin
yields sin(φ) = 3/
√
13
and cos(φ) = 2/
√
13,
respectively. Therefore,
with the aid of a calculator
we get
φ = arcsin(3/
√
13) = arccos(2/
√
13)
.
= 56.3◦
.
This computation is also easy in Sage: Running the cell
z=2+3*I; arg(z)
we obtain arctan(3/2). Then, to translate the angle in radians
or degrees we just type
phi = arctan(3/2)
print(N(phi,digits = 5),"radians")
print(N(phi*180/pi,digits=4),"degrees")
The given answer is 0.98279 radians and 56.31 degrees,
respectively. Now we are able to present the polar form of
z1 = x + iy: We have x =
√
13 cos(φ) and y =
√
13 sin(φ)
CHAPTER 1. INITIAL WARMUP
1.2.1. Linear diﬀerence equations of ﬁrst order. A general
diﬀerence equation of the ﬁrst order (or ﬁrst order recurrence)
is an expression of the form
f(n + 1) = F(n, f(n)),
where F is a known function with two arguments (independent
variables). If we know the “initial” value f(0), we can
compute f(1) = F(0, f(0)), then f(2) = F(1, f(1)) and so
on. Using this, we can compute the value f(n) for arbitrary
n ∈ N.
An example of such an equation is provided by the factorial
function f(n) = n! where:
(n + 1)! = (n + 1) · n!, f(0) = 1.
In this way, the value of f(n + 1) depends on both n and the
value of f(n), and formally we would express this recurrence
in the form F(x, y) = (x + 1)y.
A very simple example is f(n) = C for some ﬁxed scalar
C and all n. Another example is the linear diﬀerence equation
of ﬁrst order
(1) f(n + 1) = a · f(n) + b,
where a ̸= 0 and b are ﬁxed scalars.
Such a diﬀerence equation is easy to solve if b = 0. Then
it is the well-known recurrent deﬁnition of the geometric progression.
We have
f(1) = af(0), f(2) = af(1) = a2
f(0),
and so on. Hence for all n we have
f(n) = an
f(0).
This is also the relation for the Malthusian population growth
model. This is based on the assumption that population size
grows with a constant rate when measured at a sequence of
ﬁxed time intervals.
It is time to prove our ﬁrst mathematical theorem. We
deduce a general result for ﬁrst order equations with variable
coeﬃcients, namely:
(2) f(n + 1) = an · f(n) + bn.
We use the usual notation for sum
∑
, and the similar
notation for the product
∏
. We use also the convention that
when the index set is empty, then the sum is zero and the
product is one.
1.2.2. Proposition. The general solution of the ﬁrst order difference
equation (2) from the previous paragraph with the initial
condition f(0) = y0 is for n ∈ N given by the formula
(1) f(n) =
(n−1∏
i=0
ai
)
y0 +
n−2∑
j=0


n−1∏
i=j+1
ai

 bj + bn−1.
Proof. We use mathematical induction. The result
clearly holds for n = 1 since f(1) = a0y0 + b0.
Assuming that the statement holds for some
ﬁxed n, we compute (do not be upset with the
many brackets and remember the conventions
about the empty sums and products):
11
which gives
z1 =
√
13
(
2
√
13
+ i ·
3
√
13
)
=
√
13
(
cos
(
arccos
(
2√
13
))
+ i sin
(
arcsin
(
3√
13
)))
.
Transition from polar form to algebraic form is even simpler:
z2 = 3
(
cos
(π
3
)
+ i sin
(π
3
))
= 3
(
1
2
+ i ·
√
3
2
)
.
□
1.A.15. Remark. In principle, the argument arg(z) = φ of
a complex number z is only deﬁned up to “modulo 2π”. That
is, for any integer k ∈ Z, the angle φ + 2kπ would serve
as well (since full circle rotations do not change the point).
Therefore, in the previous exercise we have essentially deﬁned
the so-called principal (value of the) argument of z.
1.A.16. (a) Is it true that the (principal) argument of a positive
real number is zero and of a negative one is π?
(b) Is it true that the (principal) argument of the imaginary
number i is π/2, while of −2i is −π? ⃝
1.A.17. Express z = 1 + cos π
3 + i sin π
3 in polar form. ⃝
1.A.18. De Moivre’s theorem applied. Simplify the expres-
sion
(
5
√
3 + 5i
)n
for n = 2 and n = 12.
Solution. Using the binomial theorem for n = 2 we see that
(
5
√
3 + 5i
)2
= 75 + 10
√
3 · 5i − 25 = 50 + 50
√
3i .
For n = 12 it is shorter (and wiser) to express ﬁrst the complex
number in polar form:
5
√
3 + 5i = 10
(√
3
2
+
i
2
)
= 10
(
cos
π
6
+ i sin
π
6
)
.
Then, an application of the de Moivre’s formula presented in
1.1.4, gives the following:
(
5
√
3 + 5i
)12
= 1012
(
cos
12π
6
+ i sin
12π
6
)
= 1012
.
□
1.A.19. Compute the expression
(
cos π
6 + i sin π
6
)31
by applying
the de Moivre’s theorem. ⃝
1.A.20. Use de Moivre’s theorem to prove the identities
cos(3φ) = 4 cos3
(φ) − 3 cos(φ) ,
sin(3φ) = 3 sin(φ) − 4 sin3
(φ) . ⃝
The beauty of complex numbers is also linked to the
“fundamental theorem of algebra”, see 12.4.20.
In fact, complex numbers arose from the need to
solve cubic equations (without meaning that the
solutions must be complex). Next we describe
a procedure, which was developed during the 16th century
by S. del Ferro (1465-1526), G. Cardano (1501-1576),
N. F. Tartaglia (1500-1557), and possibly others. Unlike the
CHAPTER 1. INITIAL WARMUP
f(n + 1) = an


(n−1∏
i=0
ai
)
y0 +
n−2∑
j=0


n−1∏
i=j+1
ai

bj + bn−1


+ bn
=
( n∏
i=0
ai
)
y0 +
n−1∑
j=0


n∏
i=j+1
ai

 bj + bn,
as can be seen directly by multiplying out. □
Note that for the proof, we did not use anything about the
scalars except for the properties of commutative ring.
There is another approach to the proof which explains the
name "linear". First observe that with all bn = 0, the formula
from the proposition is obvious. Next observe, that dealing
with two sequences yn, y′
n solving 1.2.2(2), their diﬀerence
zn = yn − y′
n must be a solution to the “homogenized” equation
f(n + 1) = an · f(n). On the contrary, knowing such
sequences zn and yn, then their sum y′
n = zn + yn must be
a solution of 1.2.2(2). Thus, we could just guess and check
that the second and third terms of (1) solve our equation. In
particular, the following corollary can be proved very easily
this way. Try to complete this line of arguments in detail!
1.2.3. Corollary. The general solution of the linear diﬀerence
equation (1) from 1.2.1 with a ̸= 1 and initial condition
f(0) = y0 is
(1) f(n) = an
y0 + (1 + · · · + an−1
)b = an
y0 +
1 − an
1 − a
b.
Proof. If we set ai and bi to be constants and use the
general formula 1.2.2(1), we obtain
f(n) = an
y0 + b
(
1 +
n−2∑
j=0
an−j−1
)
.
We observe that the expression in the bracket is (1+a+· · ·+
an−1
). The sum of this geometric progression follows from
1 − an
= (1 − a)(1 + a + · · · + an−1
). □
The proof of the former proposition is a good example of
a mathematical result, where the veriﬁcation is quite easy, as
soon as someone tells us the theorem. Mathematical induction
is a natural method of proof.
Note that for calculating the sum of a geometric progression
we silently assumed the existence of the inverse element
for non-zero scalars. But actually, we derived the formula in
a way manifesting that the division is always possible if our
scalars do not allow for divisors of zero.
In particular, notice that the formula (1) is valid with integer
coeﬃcients a, b and integer initial conditions. Here, we
know in advance that each f(n) is an integer. Thus our formula
necessarily gives correct integer solutions, although living
in the extension of Z to the rational numbers Q.
12
quadratic equations, here we need the complex numbers even
if all three solutions are real.
1.A.21. Express a solution of the cubic equation x3
+ax2
+
bx + c = 0 in terms of the real coeﬃcients a, b and c.
Solution. The ﬁrst step reduces the quadratic dependence by
setting x := t − a/3. The original equation then becomes
t3
+pt+q = 0, with p = b−a2
/3 and q = c+(2a3
−9ab)/27,
respectively. The next step requires us to be more creative.
We introduce new variables, u and v, satisfying the conditions
{
u + v = t , 3uv + p = 0
}
.
Using these terms, we can substitute the ﬁrst condition into
the previous equation to obtain
u3
+ v3
+ (3uv + p)(u + v) + q = 0 .
Next, use the second equation to eliminate v. This yields
u6
+ qu3
−
p3
27
= 0 ,
which is a quadratic equation in the unknown s = u3
. Hence
u =
3
√
−
q
2
±
√
q2
4
+
p3
27
.
Finally, by back substitution we obtain the desired answer:
x = −p/3u + u − a/3 .
Observe that in order to obtain all three solutions, one has to
work with complex roots. This is because the equation x3
=
a, with a ̸= 0, has exactly three solutions over C (according
to the fundamental theorem of algebra). If one needs all the
three solutions, the Sage cell given below provides an answer
(though the expressions are complicated enough and hence
not presented here).
x, a, b, c=var("x, a, b, c")
assume(a, b, c, "real")
solve(x**3+a*x**2+b*x+c==0, x) □
1.A.22. Solve the equation x3
+ x2
− 2x − 1 = 0. ⃝
A series of additional tasks related to complex numbers
is presented in the end of this chapter. In fact, once we provide
a systematic treatment of basic linear algebra in Chapter 2,
we will be able to demonstrate how to treat complex numbers
via matrices. This takes place in paragraph ??.
B. Diﬀerence equations
Diﬀerence equations (also called recurrence relations)
are rules which determine the values of elements
of a sequence in terms of previous elements.
Solving a diﬀerence equation means
ﬁnding an explicit formula for an arbitrary element of the sequence.
If each element of the sequence is determined only
by the previous element, we talk about ﬁrst order diﬀerence
equations. As we will see below, and also in Chapter 3, this
CHAPTER 1. INITIAL WARMUP
The linear diﬀerence equation 1.2.1(1) can be neatly interpreted
as a mathematical model for ﬁnance,
e.g. savings or loan payoﬀ with a ﬁxed interest
rate a and ﬁxed repayment b. (The cases of savings
and loans diﬀer only in the sign of b).
With varying parameters a and b we obtain a similar
model with varying interest rate and repayment. We can imagine
for instance that n is the number of months, an is the interest
rate in the nth month, bn the repayment in the nth month.
1.2.4. A nonlinear example. When discussing linear difference
equations, we mentioned a very primitive
population growth model which depends directly
on the momentary population size p. At
ﬁrst sight, it is clear that such a model with a > 1
leads to a very rapid and unbounded growth.
A more realistic model has such a population change
∆p(n) = p(n + 1) − p(n) only for small values of p, that
is ∆p/p ∼ r > 0. Thus if we want to let the population grow
by 5% for a time interval only for small p, then we choose
r to be 0.05. For some limiting value p = K > 0 the population
may not grow. For even greater values it may even
decrease, for instance if the resources for the feeding of the
population are limited, or if individuals in a large population
are obstacles to each other etc.
Assume that the values yn = ∆p(n)/p(n) change linearly
in p(n). Graphically we can imagine this dependence
as a line in the plane of the variables p and y. This line passes
through the point [0, r], so that y = r when p = 0. This line
also passes through [K, 0] since this gives the second condition,
namely that when p = K the population does not change.
Thus we set
y = −
r
K
p + r.
By setting y = yn = ∆p(n)/p(n) and p = p(n) we obtain
p(n + 1) − p(n)
p(n)
= −
r
K
p(n) + r.
By multiplying, we obtain a diﬀerence equation of ﬁrst
order with p(n) present as both a ﬁrst and a second power.
(1) p(n + 1) = p(n)
(
1 −
r
K
p(n) + r
)
.
13
type of recurrence relations are often induced by problems
appearing in our everyday life.
1.B.1. Michael wants to buy a great new car, which costs
C 30 000. He wants to take out a loan and pay it oﬀ under a
ﬁxed month repayment agreement. The car company oﬀers
him a loan with yearly interest of 6%, with ﬁrst repayment
in the end of the ﬁrst month of the loan. Specify the exact
amount of the monthly payment, under the assumption that
Michael aims to redeem the loan in three years.
Solution. Let us denote by P the sum that Michael has to
pay per month, and by dk the amount of the remaining loan
after k months. Set also C = 30000 and u = 0.06
12 (the latter
represents the monthly interest rate). We havev d0 = C =
30000, and after the ﬁrst month we see that d1 = C−P +u·C.
After the kth month we compute
dk = dk−1 − P + u dk−1 = (1 + u)dk−1 − P . (∗)
In order to solve (∗) we may apply the statement of Corollary
1.2.3. In terms of this result we have a = 1+u, b = −P, and
thus the solution takes the form
dk = d0ak
− P
(
ak
− 1
a − 1
)
.
Paying oﬀ the loan in three years means d36 = 0. This gives
P = 30000 ·
(
(1 + u)36
u
(1 + u)36 − 1
)
= 912, 7 .
Thus, approximately Michael should pay the amount of C913
per month. □
1.B.2. Consider the task in 1.B.1. For how long will Michael
pay, if the monthly dose of the loan is C500? ⃝
1.B.3. Determine the sequence {yn}∞
n=1, which satisﬁes the
following recurrence relation
yn+1 =
3yn
2
+ 1, n ≥ 1, y1 = 1.
Next verify your answer via Sage.
Solution. We can apply again the Corollary 1.2.3. We have
a = 3/2, b = 1 and y0 = 0 (such that y1 = 1). Hence,
according to the given formula we deduce that
yn = (3/2)n
· 0 +
1 − (3/2)n
1 − 3
2
· 1 = −2 + 2(3/2)n
.
Let us explain the procedure via Sage, which has a very
friendly package to treat recurrence relations. An appropriate
function that one can load is the function rsolve (from
the pure Python package sympy which is for symbolic computations).
This can deal with linear recurrence relations and
initially we type the following code:
from sympy import Function, rsolve
from sympy.abc import n
y = Function("y")
The next step is to type the relation that we want to solve and
the initial conditions:
CHAPTER 1. INITIAL WARMUP
Try to think through the behaviour of this model for various
values of r and K. In the diagram we
can see the results for parameters r = 0.05
(that is, ﬁve percent growth in the ideal state),
K = 100 (resources limit the population to the size 100),
and as p(0) = 2 we have initially two individuals.
Note that the original almost exponential growth slows
down later. The population size approaches the desired limit
of 100 individuals. For p close to one and K much greater
than r, the right side of the equation (1) is approximately
p(n)(1 + r). That is, the behaviour is similar to that of the
Malthusian model. On the other hand, if p is almost equal to
K, the right hand side of the equation is approximately p(n).
For an initial value of p greater than K the population size
will decrease. For an initial value of p less than K the population
size will increase.1
3. Combinatorics
A typical “combinatorial” problem is to count in how
many ways something can happen. For instance,
in how many ways can we choose two diﬀerent
sandwiches from the daily oﬀering in a grocery
shop?
In this situation we need ﬁrst to decide what we mean
by diﬀerent. Do we then allow the choice of two "identical"’
sandwiches? Many such questions occur in the context of
card games and other games.
The solution of particular problems, usually involves either
some multiplication of particular results (if the individual
possibilities are independent) or some addition (if their appearance
is disjoint). This is demonstrated in many examples
in the problem column (cf. several problems starting with
1.D.2).
1.3.1. Permutations. Suppose we have a set of n (distinguishable)
objects, and we wish to arrange them in some order.
We can choose a ﬁrst object in n ways, then a second
in n − 1 ways, a third in n − 2 ways, and so on, until we
choose the last object for which there is only one choice. The
total number of possible arrangements is the product of these,
hence there are exactly n! = n(n − 1)(n − 2) . . . 3 · 2 · 1
distinct orders of the objects. Each ordering of the elements
of a set S is called a permutation of the elements of S. The
number of permutations on a set with n elements is n!.
We can identify the elements in S by numbering them
(using the digits from one to n), that is, we identify S with the
set S = {1, . . . , n} of n natural numbers. Then the permutations
correspond to the possible orderings of the numbers
from one to n. Thus we have an example of a simple mathematical
theorem and this discussion can be considered to be
its proof.
1This model is called the discrete logistic model. Its continuous version
was introduced already in 1845 by Pierre François Verhulst. Depending
on the proportions of the parameters r, K and p(0), the behaviour can be
very diverse, including chaotical dynamics. There is much literature on this
model.
14
f = y(n+1) - 3*y(n)/2 -1
initial = {y(1):1, y(2):5/2}
To get the solution we ﬁnally type rsolve(f, y(n), initial).
This gives the answer 2 ∗ (3/2) ∗ ∗n − 2.5
□
In Chapter 3 we will study recurrences in a more systematic
way. Now we continue our observations with
a very characteristic example, the Fibonacci numbers,
which we will meet several times, see e.g., 3.B.1.
These numbers are deﬁned by the recurrence relation using
two preceding values: Fn = Fn−1 + Fn−2 (n ≥ 3) with
the initial conditions F1 = 1 and F2 = 1 (or the same with
F0 = 0). Hence the very ﬁrst numbers in the Fibonacci sequence
(Fn)n∈N are 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, etc.6
We shall call such equations (second order) linear difference
equation with constant coeﬃcients in Chapter 3, and
we will learn how to solve them.
1.B.4. Fibonacci numbers. A 2-composition of a positive
integer n is a representation of n as an ordered sum of numbers
1 and 2. Notice that the order indeed matters; For instance
3 = 1 + 1 + 1, 3 = 1 + 2 and 3 = 2 + 1, hence
the integer 3 has three distinct 2-compositions. Let Cn be the
number of 2-compositions of n ∈ N. Show that Cn = Fn+1
for n ∈ N. Having Sage as a tool, test (or prove) the claim,
too.
Solution. Notice, C(0) = 1 = F(1) due to our conventions,
since there is only the empty sum of interest providing the result.
Thus, we may omit n = 0 from our considerations. Let
us denote by Cn(1) and Cn(2) the number of 2-compositions
of n, ending with 1 and 2, respectively. For n = 1 we have
C1(1) = 1 and C1(2) = 0, so C1 = C1(1) + C1(2) =
1 + 0 = 1 = F2. For n = 2 we have C2(1) = 1 and
C2(2) = 1, so C2 = C2(1) + C2(2) = 1 + 1 = 2 = F3.
Hence the claim holds for n = 1, 2. Suppose now that the integer
n satisﬁes n ≥ 3. Assume ﬁrst that the 2-composition
of n ends to 1. In this case, we can determine a 2-composition
of the integer n − 1 by deleting this 1. However, by adding
a 1 in a 2-composition of n − 1, we will get a 2-composition
of n that ends by 1. Thus we conclude that Cn(1) = Cn−1.
Similarly, if the 2-composition of n ends by 2, then we get a
2-composition of the integer n − 2 by deleting this 2. However,
by adding a 2, or two ones, in the end of a 2-composition
of n − 2 we will result to a 2-composition of n ending either
with 2 or with 1. Since the latter case has been counted already
above, we comclude Cn(2) = Cn−2. Altogether we
5Recall that a ∗ ∗n in Sage means an.
6The Fibonacci numbers have a long and rich history, and they appear
in many applications. These numbers were ﬁrst discussed by Acharya Pingala,
an Indian poet and mathematician (around 200 BC) when counting the
possible patterns for poetry forms based on syllables of two lengths. The
Italian mathematician Leonardo di Pisa (also called Bonacci, Pisano, or Fibonacci)
explained them in his famous book Liber Abaci to the western community
in 1202. Here these numbers were used to explain the rabbit reproduction
question. They also arise in combinatorics and graph theory, and we
shall meet them several times, e.g. in Chapters 3 and 13.
CHAPTER 1. INITIAL WARMUP
Number of permutations
Proposition. The number p(n) of distinct orderings of a ﬁnite
set with n elements, is given by the factorial function:
(1) p(n) = n!
Suppose S is a set with n elements. Suppose we wish to
choose and arrange in order just k of the members of S, where
1 ≤ k ≤ n. This is called a k-permutation without repetition
of the n elements. The same reasoning as above shows that
this can be done in
v(n, k) = n(n − 1)(n − 2) · · · (n − k + 1) =
n!
(n − k)!
ways. The right side of this result also makes sense for k = 0,
(there is just one way of choosing nothing), and for k = n,
since 0! = 1.
Now we modify the problem, this time where the order
of selection is immaterial.
1.3.2. Combinations. Consider a set S with n elements. A
k-combination of the elements of S is a selection
of k elements of S, 0 ≤ k ≤ n, when order
does not matter.
For k ≥ 1, the number of possible results
of a subsequential choosing of our k elements, is n(n −
1)(n − 2) · · · (n − k + 1) (a k-permutation). We obtain
the same k-tuple in k! distinct orders. Hence the number of
k-combinations is
n(n − 1)(n − 2) · · · (n − k + 1)
k!
=
n!
(n − k)!k!
If k = 0, the same formula is still true, since 0! = 1, and
there is just one way to select all n elements.
Combinations
Proposition. The number c(n, k) =
(n
k
)
of combinations of
k-th degree among n elements, where 0 ≤ k ≤ n, is
(1) c(n, k) =
n(n − 1) . . . (n − k + 1)
k(k − 1) . . . 1
=
n!
(n − k)!k!
.
We pronounce the binomial coeﬃcient
(n
k
)
as “n over k”
or “n choose k”. The name stems from the binomial expansion,
which is the expansion of (a+b)n
. If we expand (a+b)n
,
the coeﬃcient of ak
bn−k
is the number of ways to choose a
k-tuple from n parentheses in the product (from these parentheses,
we take a, from the others, we take b). Therefore we
have
(2) (a + b)n
=
n∑
k=0
(
n
k
)
ak
bn−k
.
Note that only distributivity, commutativity and associativity
of multiplication and summation was necessary. The formula
(2) therefore holds in every commutative ring.
15
get Cn = Cn(1) + Cn(2) = Cn−1 + Cn−2, and this coincides
with the Fibonacci recurrence. Thus, the diﬀerence may
be only in the initial conditions, but we have veriﬁed already
C1 = F2 and C2 = F3, and this concludes the proof.
To test Cn = Fn+1 via Sage we will use the observed
recurrence relation together with the initial conditions, which
we present via Sage. Notice, how Sage imports tools from
Python packages. Here Sage learns how to solve recurrences.
Of course, c(n) would be listed very fast, after being resolved
in the 6th line:
from sympy import Function, rsolve
from sympy.abc import n
a = Function("a")
f = a(n)-a(n-1)-a(n-2)
initial = {a(1):1, a(2):2}
c(n)=rsolve(f, a(n), initial)
k=var("k")
bool(c(k)==fibonacci(k+1)
for k in (0 .. 120))
This returns True and so veriﬁes our claim. Inside the cell
we used the function fibonacci(n), which returns the nth
Fibonacci number, and the class bool, which is for comparing
logical propositions. Later in 1.G.20 the reader can ﬁnd
a visual presentation of Cn. □
Recurrence relations can be much more complicated
than those of ﬁrst or second order. To highlight
the case, we present an example of order 3
and another one including more equations with
more variables. In the latter case, although we are not able to
evaluate the arbitrary term in the presented sequence P(k,l)
explicitly, we succeed in answering the question using the relevant
recurrence relations. This is an example of the so called
partial diﬀerence equations, since the terms of the sequence
are indexed by two independent variables k, l.
Further applications of recurrence relations are in the
ﬁnal section of this chapter.
1.B.5. How many words of length 12 can be constructed by
using only the letters A and B, under the assumption that they
do not contain a sentence of the form BBB?
Solution. Let an denote the number of words of length n
consisting of the letters A and B, but without BBB as a sentence
in between. The words of length n > 3 that satisfy the
given condition, either end with one letter A, or with two letters
AB, or with three letters ABB (notice that the AA two
letters end has been covered, the ABB end is the only option
for the BB end, while the remaining BAB three last letters
are covered already). We enumerate an−1 words ending with
an A (observe that preceding the last A, it can be an arbitrary
word of length n − 1 satisfying our condition). Analogously
for the two remaining groups. Thus an (n > 3) satisﬁes the
recurrence relation
an = an−1 + an−2 + an−3
CHAPTER 1. INITIAL WARMUP
We present a few simple propositions about binomial coeﬃcients
– another simple example of a mathematical proof.
If needed, we deﬁne
(n
k
)
= 0 whenever k < 0 or k > n.
1.3.3. Proposition. For all non negative integers n, we have
(1)
(n
k
)
=
( n
n−k
)
0 ≤ k ≤ n.
(2)
(n+1
k+1
)
=
(n
k
)
+
( n
k+1
)
0 ≤ k ≤ n − 1.
(3)
∑n
k=0
(n
k
)
= 2n
(4)
∑n
k=0 k
(n
k
)
= n2n−1
.
Proof. The ﬁrst formula in the proposition is immediate
directly from the formula 1.3.2(1). If we expand the righthand
side of (2), we obtain
(
n
k
)
+
(
n
k + 1
)
=
n!
k!(n − k)!
+
n!
(k + 1)!(n − k − 1)!
=
(k + 1)n! + (n − k)n!
(k + 1)!(n − k)!
=
(n + 1)!
(k + 1)!(n − k)!
which is the left-hand side of (2).
In order to prove (3), we use mathematical induction
again. Mathematical induction consists of two
steps. In the initial step, we establish the claim
for n = 0 (in general, for the smallest n the
claim should hold for). In the inductive step we assume that
the claim holds for some n (and all smaller numbers). We use
this to prove that this implies the claim for n + 1. The principle
of mathematical induction then asserts that the claim
holds for every n.
The claim (3) clearly holds for n = 0, since
(0
0
)
= 1 =
20
. It holds also for n = 1. Now assume that the claim holds
for some n ≥ 1. We must prove the corresponding claim for
n + 1 using the claims (2) and (3). We calculate
n+1∑
k=0
(
n + 1
k
)
=
n+1∑
k=0
((
n
k − 1
)
+
(
n
k
))
=
n∑
k=−1
(
n
k
)
+
n+1∑
k=0
(
n
k
)
= 2n
+ 2n
= 2n+1
.
Note that the formula (3) gives the number of all subsets
of an n-element set, since
(n
k
)
is the number of all subsets of
size k. Note also that (3) follows from 1.3.2(2) by choosing
a = b = 1.
To prove (4) we again employ induction, as we did in (3).
For n = 0 the claim clearly holds. The inductive assumption
says that (4) holds for some n. We calculate the corresponding
sum for n+1 using (2) and the inductive assumption. We
obtain
16
with a1 = 2, a2 = 4, and a3 = 7. Using this relation we can
now compute a12. The answer is 1705. Of course, we can get
this by simple modiﬁcation of the previous Sage cell, give it
a try! □
1.B.6. After the ﬁrst quarter,
the score of a basketball
match between the national
teams of Russia and
the Czech Republic is 12 : 9
for the Czech team. In how
many ways could the score
have developed?
Solution. We can divide all
possible evolutions of the
quarter with the ﬁnal score
k : l into six mutually exclusive
possibilities, according
to which team scored, and
how much was it worth (1,
2 or 3 points). If we denote by P(k,l) the number of ways in
which the score could have developed for a quarter that ended
with k : l, then for k, l ≥ 3 the following recurrence relation
holds:
P(k,l) = P(k−3,l) + P(k−2,l) + P(k−1,l) + P(k,l−1)
+P(k,l−2) + P(k,l−3) .
Using the symmetry of the problem, P(k,l) = P(l,k). In addition,
for k ≥ 3 we see that
P(k,2) = P(k−3,2) + P(k−2,2) + P(k−1,2) + P(k,1) + P(k,0) ,
P(k,1) = P(k−3,1) + P(k−2,1) + P(k−1,1) + P(k,0) ,
P(k,0) = P(k−3,0) + P(k−2,0) + P(k−1,0) .
These relations, along with the initial condition, give P(0,0) =
1, P(1,0) = 1, P(2,0) = 2, P(3,0) = 4, P(1,1) = 2, and
P(2,1) = P(1,1) + P(0,1) + P(2,0) = 5 ,
P(2,2) = P(0,2) + P(1,2) + P(2,1) + P(2,0) = 14 .
Therefore, by repeatedly using the above equations, we arrive
at nearly 500 million options, since P(12,9) = 497178513. □
C. Combinatorics
In this section we use natural numbers to label items
that may or may not be distinguishable and address
questions about the number of ways certain
events can occur. Let us begin with a straightforward
problem involving permutations, which will
demonstrate the usefulness of mathematical software packages,
such as as Matlab, Maple, Sage, etc.
We expect the reader to be familiar with the basic concepts
as permutations, combinations, etc., as wells as adding
or multiplying the numbers of independent options. If not,
have a look into the other column in 1.3.1.
CHAPTER 1. INITIAL WARMUP
n+1∑
k=0
k
(
n + 1
k
)
=
n+1∑
k=0
k
((
n
k − 1
)
+
(
n
k
))
=
n∑
k=−1
(k + 1)
(
n
k
)
+
n+1∑
k=0
k
(
n
k
)
=
n∑
k=0
(
n
k
)
+
n∑
k=0
k
(
n
k
)
+
n∑
k=0
k
(
n
k
)
= 2n
+ n2n−1
+ n2n−1
= (n + 1)2n
.
This completes the inductive step and the claim is proven for
all natural n. □
The second property from above allows us to write down
all the binomial coeﬃcients into the Pascal triangle.2
Here,
every coeﬃcient is obtained as a sum of the two coeﬃcients
situated right “above” it:
n = 0 : 1
n = 1 : 1 1
n = 2 : 1 2 1
n = 3 : 1 3 3 1
n = 4 : 1 4 6 4 1
n = 5 : 1 5 10 10 5 1
Note that in individual rows we have the coeﬃcients of individual
powers in the expression (2). For instance the last
given row says
(a + b)5
= a5
+ 5a4
b + 10a3
b2
+ 10a2
b3
+ 5ab4
+ b5
.
1.3.4. Choice with repetitions. The ordering of n elements,
where some of them are indistinguishable, is
called a permutation with repetitions.
Among n given elements, suppose there are
p1 elements of the ﬁrst kind, p2 elements of the
second kind, ..., pk of the k-th kind, where p1 + p2 + · · · +
pk = n. Then the number of permutations with repetitions
of these elements is denoted as P(p1, . . . , pk).
We consider the orderings which diﬀer only in the order
of indistinguishable elements to be identical. Elements of the
ith kind can be ordered in pi! ways, thus we have
Permutations with repetitions
The number of permutations with repetitions is
P(p1, . . . , pk) =
n!
p1! · · · pk!
.
Let S be a set with n distinct elements. We wish to select
k elements, 0 ≤ k ≤ n from S with repetition permitted.
This is called a k-permutation with repetition. Since the
ﬁrst selection can be done in n ways, and similarly the second
can also be done in n ways etc. The total number V (n, k) of
k-permutations with repetitions is nk
. Hence
2Although the name goes back to Blaise Pascal’s treatise from 1653,
such a neat triangle conﬁguration of the numbers c(n, k) were known for
centuries earlier in China, India, Greece, etc.
17
1.C.1. Presentations of permutations. List all permutations
of 4 digits using any computer algebra system of your
choice. Think of them as of mappings.
Solution. There exist 4! = 4 · 3 · 2 = 24 such permutations,
e.g., [1, 2, 3, 4], [1, 2, 4, 3], [1, 3, 2, 4], etc., and carefully one
can list all possible cases. We appreciate Sage, which can do
this for us much faster, or at least verify our computation, via
the cell
per4 = Permutations (4); per4.list ()
The answer is
[[1, 2, 3, 4], [1, 2, 4, 3],
[1, 3, 2, 4], [1, 3, 4, 2],
[1, 4, 2, 3], [1, 4, 3, 2],
[2, 1, 3, 4], [2, 1, 4, 3],
[2, 3, 1, 4], [2, 3, 4, 1],
[2, 4, 1, 3], [2, 4, 3, 1],
[3, 1, 2, 4], [3, 1, 4, 2],
[3, 2, 1, 4], [3, 2, 4, 1],
[3, 4, 1, 2], [3, 4, 2, 1],
[4, 1, 2, 3], [4, 1, 3, 2],
[4, 2, 1, 3], [4, 2, 3, 1],
[4, 3, 1, 2], [4, 3, 2, 1]]
Note that the command per4[12] returns the 13th permutation
(Sage counts from zero), which is [3, 1, 2, 4], and similarly
one can specify any of the 24 permutations directly, without
being necessary to list them (hence in this case we do not
need the command per4.list( )). Recall also that permutations
of the elements in a ﬁnite set S are essentially bijections
f : S → S. To realize this perspective via Sage, and interpret
permutations as functions, let us ﬁx the 13th permutation as
above. Then, the cell
per4=Permutations(4)
Pm = Permutation(per4[12])
for i in Pm: print(Pm.index(i)+1,"->",i)
returns the desired result, i.e.,
1 -> 3; 2 -> 1; 3 -> 2; 4 -> 4
Think of the syntax of these commands in Sage (e.g., the
command per4[12] provides simply a list, while the second
line keeps Pm to be the permutation object in Sage; the index
method returns the index of the ﬁrst appearance of the
value - which is OK with permutations, but the method enumerate
applied to lists works in general; the +1 is necessary
since the indices run from 0). Experiment with additional examples
yourself! For instance, composition of permutations
is encoded by the standard ∗ operator – what would be the
Pm ∗ Pm permutation? Hint: follow the above arrows to compose
the mappings, and check in Sage. □
1.C.2. During a conference, 8 speakers are scheduled. Determine
the number of all possible orderings in which two
given speakers do not speak one right after the other.
Solution. Denote the two given speakers by A and B, respectively.
If B follows directly after the speaker A, we can consider
it as a speech by a single speaker AB. The number of
all orderings where B speaks directly after A is therefore 7!,
CHAPTER 1. INITIAL WARMUP
k-permutations with repetitions
V (n, k) = nk
.
If we are interested in a choice of k elements without taking
care of order, we speak of k-combinations with repetitions.
At ﬁrst sight, it does not seem to be easy to determine the
number. We reduce the problem to another problem we have
already solved, namely combinations without repetitions:
Combinations with repetitions
Theorem. The number of k-combinations with repetitions
from n elements equals for every k ≥ 0 and n ≥ 1
(
n + k − 1
k
)
.
Proof. Label the n elements as a1, a2, · · · , an. Suppose
each element labeled ai is selected ki times,
0 ≤ ki ≤ k, so that k1+k2+· · ·+kn = k. Each
such selection can be paired with the sequence
of symbols ∗ and | where each ∗ represents one
selection of an element and individual boxes are separated by
| (therefore there are n − 1 of them).
The number of ∗ in the ith box is equal to ki, so we obtain
the sequence
∗ · · · ∗
k1
| ∗ · · · ∗
k2
| · · · | ∗ · · · ∗
kn
.
The other way around, from any such sequence we can
determine the number of selections of any element (e.g. the
number of ∗ before ﬁrst | determines k1).
Having altogether k symbols ∗ and n − 1 separators | we
see that there are
(
n + k − 1
n − 1
)
=
(
n + k − 1
k
)
possible sequences and therefore also the same number of the
required selections. □
4. Probability
Now we are going to discuss the last type of function description,
as listed in the very end of the
subsection 1.1.5. Thus, instead of assigning
explicit values of a function, we shall
try to describe the probabilities of the individual
options.
1.4.1. What is probability? As a simple example we can
use common six-sided dice throwing, with sides labelled as
1, 2, 3, 4, 5, 6.
If we describe the mathematical model of such throwing with
a “fair” dice, we expect by symmetry that every side occurs
with the same frequency. We say that “every side occurs with
the probability 1/6”.
18
the number of permutations of seven elements. By symmetry,
the number of all orderings where A speaks directly after B
is also 7!. Since the number of all possible orderings of eight
speakers is 8!, the solution is 8! − 2 · 7!. □
1.C.3. How many rearrangements of the letters of the word
“problem” are there, such that:
(a) The letters b, r are next to each other;
(b) The letters b, r are not next to each other. ⃝
1.C.4. k-permutations and k-combinations. Determine
the number of diﬀerent permutations that admits a code with
5 digits in a speciﬁc order, under the assumption that every
digit can be used only once. Deduce that 5-digit codes of
such type are much more secure than the corresponding
4-digit codes. Finally, what is the situation if the order of
digits in the code does not matter?
Solution. We are interested in codes with 5 digits, where
the order of the digits is crucial and every digit appears only
once. In other words, we have the set of digits, say S =
{0, 1, . . . , 9}, and we want to choose and arrange in order
5 elements of S. In terms of the paragraph 1.3.4, this is a
5-permutation of 10 elements without repetition, which can
be done in
u(10, 5) =
10!
(10 − 5)!
= 10 · 9 · 8 · 7 · 6 = 30240
ways. In Sage we can compute u(10, 5) by typing
factorial(10)/factorial(5)
For a 4-digit code with the same characteristics, the corresponding
number of permutations is given u(10, 6) = 5040,
hence the claim. (Of course, we see from the formula that the
letter number must be 6 times smaller than the ﬁrst one!)
Assume now that the order does not play some role.
Then, codes of the form 12345 and 54123, for instance, represent
the same code, hence ﬁnally one will get much less
possible permutations (codes of this type are very unsafe). In
particular, this case yields the deﬁnition of a 5-combination
of 10 elements, whose number according to 1.3.2 is given by
c(10, 5) =
(
10
5
)
=
10!
5!(10 − 5)!
=
30240
120
= 252 .
Sage provides the command binomial(n, k) for the
k-combination of n elements which in our case applies
as
binomial (10, 5)
In such terms, the very ﬁrst result about u(10, 5) appears also
by the command binomial(10, 5) ∗ factorial(5). □
1.C.5. Remark. Notice that Sage does not have a direct command
for k-permutations. Let us construct one, by introducing
a function called uperm(n, k):7
7Note that usually the cells that we provide in Sage can be copy-pasted
without extra editing being necessary in the editor of Sage. However, when
we introduce functions as below, the reader should be careful and type the
command in the editor exactly as it appears here, since the spaces play some
role.
CHAPTER 1. INITIAL WARMUP
But throwing some less symmetric version of a dice with
six faces, the actual probabilities of the individual results
might be quite diﬀerent. Let us build a simple mathematical
model for this. We shall work with the parameters pi for
the probabilities of individual sides with two requirements.
These probabilities have to be non-negative real numbers and
their sum is one, i.e.
p1 + p2 + p3 + p4 + p5 + p6 = 1.
At this time, we are not concerned about the particular choice
of the speciﬁc values pi, they are given to us. Later on, in
chapter 10, we shall link probability with mathematical statistics
and then we shall introduce methods how to discuss
reliability of such a model for a speciﬁc real dice.
1.4.2. Classical probability. Let us come back to the mathematical
model for the fair dice. We consider the sample space
Ω = {1, 2, 3, 4, 5, 6} of all possible elementary events (each
of them corresponding to one possible result of the experiment
of throwing the dice). Then we can consider any event
as a given subset A of Ω. For example A = {1, 3, 5} describes
the result of getting odd number on the resulting side
(we count the labels on the sides of the dice). Similarly, the
set B = Ac
= {2, 4, 6} = Ω \A is the complementary event
of getting even numbered points. The probability of both A
and B will be 1/2. Indeed, |A|/|Ω| = 1/2, where |A| means
the number of elements of a set A.
This leads to the following obvious generalization:
Classical probability
Let Ω be a ﬁnite set with n = |Ω| elements. The classical
probability of the event corresponding to any subset A ⊂ Ω
is deﬁned as
P(A) =
|A|
|Ω|
.
Such a deﬁnition immediately allows us to solve problems
related to throwing several fair dice simultaneously. Indeed,
we may treat this as throwing independently one dice
many times and thus multiplying the probabilities. For example,
the event of getting an odd sum of points on two dice is
given by adding the probabilities of having an even number
on the ﬁrst one and odd number on the second one and vice
versa. Thus the probability will be twice 1/2 · 1/2, which is
1/2 as expected.
1.4.3. Probability space. Next, we formulate a more general
concept of probability covering also the unfair
dice example above.
We shall need a ﬁnite set Ω of all possible
states of a system (e.g. results of an experiment),
which we call the sample space.
Further, the space of all possible events is given as the set
A of all subsets in Ω. Finally, we need the function describing
the probabilities of occurrence of individual events:
19
def uperm (n,k):
a = factorial (k) * binomial (n, k)
return(a)
In such terms we can directly compute u(n, k) for any n, k,
with k = 0, 1, 2, . . ., by typing in the editor of Sage the command
uperm(n, k). For instance, the command uperm(10, 5)
returns 30240, uperm(10, 1) returns 10, uperm(10, 2) returns
90, uperm(10, 11) returns 0 (as it should be since k > n
in this case), etc.
1.C.6. Determine the number of 4-digit codes, composed of
the digits 1, 3, 5, 6, 7 and 9, under the condition that no digit
occurs more than once.
Solution. We have 6 distinct letters at our disposal, and we
ask: How many distinct (ordered) 4-tuples can be chosen
from them? Obviously the result is
u(6, 4) =
6!
(6 − 4)!
= 6 · 5 · 4 · 3 = 360 .
The reader may verify the computation in Sage with the function
uperm(n, k) constructed above, that is, via the command
uperm(6, 4). □
1.C.7. Six men in a meeting shake mutually their hands.
How many handshakes will happen in total?
Solution. We understand that each couple of men shake their
hands. Thus, the number of handshakes equals the number of
ways of choosing an unordered pair among 6 elements, this
is the combinations c(6, 2). Let us present the answer via
Sage:
c62=binomial(6, 2); c62
which gives 15. □
To summarize, k-permutations u(n, k) take care of the
order while combinations c(n, k) do not. Moreover,
u(n, k) = c(n, k)k!. We have noticed that permutations
and combinations are most useful when
counting possibilities. Let us ﬁnalize this paragraph
by treating examples where repetitions occur. Notice
how the principle of “inclusion and exclusion” may be useful.
1.C.8. k-permutations with repetition. The Greek alphabet
consists of 24 letters. How many words of exactly ﬁve letters
can be composed in it? (Disregarding whether the words
have some actual meaning or not).
Solution. For each of the ﬁve positions in the word we have
24 possibilities, since the letters can repeat. According to the
discussion in 1.3.4, this is nothing but a 5-permutation with
repetitions. Hence the total number of words that can be composed
is given by V (24, 5) = 245
= 7962624. In Sage we
compute this by typing 24 ∧ 5 or 24 ∗ ∗5. □
CHAPTER 1. INITIAL WARMUP
Probability function
Let us consider a non-empty ﬁxed sample space Ω. The
probability function P : A → R satisﬁes
P(Ω) = 1(1)
0 ≤ P(A) for all events A(2)
P(A ∪ B) = P(A) + P(B) whenever A ∩ B = ∅.(3)
Notice that the intersection A ∩ B describes the simultaneous
appearance of both events, while the union A ∪ B
means that at least one of events A and B appear. The event
Ac
= Ω \ A is called the complementary event.
There are some further straightforward consequences of
the deﬁnition for all events A, B:
P(A) = 1 − P(Ac
)(4)
P(∅) = 0(5)
P(A) ≤ 1 for all events A(6)
P(A) ≤ P(B) whenever A ⊂ B(7)
P(A ∪ B) = P(A) + P(B) − P(A ∩ B)(8)
The proofs are all elementary. For example, A ∪ (Ac
) = Ω
and thus (3) implies (4).
Similarly, we can write A = (A \ B) ∪ (A ∩ B) and
A∪B = (A\B)∪(B\A)∪(A∩B) with disjoint unions of sets
on the right hand sides. Thus, P(A) = P(A\B)+P(A∩B)
and P(A ∪ B) = P(A \ B) + P(B \ A) + P(A ∩ B) by (3),
which implies the last equality. The remaining three claims
are even simpler.
All these properties correspond exactly to our intuition
how probability should behave. Probability should be always
a real number between zero and one. The event Ω includes all
possible results of the experiment, so it must have probability
one. No result appears with probability zero, the probabilities
of disjoint events should add, etc.
Of course, the classical probability on the sample space
Ω is an example of a probability function. In our more general
model, the set of all subsets A is closed upon union, intersection
and taking the complement, and this has been essential
20
1.C.9. In how many ways can we insert ﬁve golf balls into
ﬁve holes, with one ball into every hole, assuming that we
have four identical white balls, four identical blue balls and
three identical red balls?
Solution. First it is convenient to solve the problem for the
case where we have ﬁve balls of every colour. This amounts
to a free choice of ﬁve elements from three possibilities (there
is a choice out of three colours for every hole), that is, permutations
with repetitions. So the total ways are given by
V (3, 5) = 35
= 243. Next, to answer our task it is suﬃcient
to subtract the conﬁgurations which are excluded:
• all chosen balls are of one colour; there are 3 such cases,
• exactly four red balls were chosen; there are 2 · 5 = 10.
Indeed, in the latter event, one ﬁrst ﬁxes the colour of the
non-red ball (two options), and then the hole where this ball
is in (ﬁve independent options). Hence, in total there are
35
− 3 − 10 = 230 possible ways. □
1.C.10. Determine the number of ways of placing the white
tower and black tower on the chessboards (of size 8 × 8), that
are neither in the same column nor in the same row. ⃝
1.C.11. k-combinations with repetition. In how many
ways can we equip three distinct envelopes with ﬁve identical
stamps and ﬁve identical letters. (Yes, we may glue many
or no stamps onto the same envelop, etc.) Next provide
an answer for the same task, under the condition that no
envelope stays empty.
Solution. First compute the number of choices ignoring the
non-emptiness condition. It is an example of 5-combinations
with repetition from 3 elements, and since we glue the stamps
and insert the letters independently, we have got C(3, 5) ·
C(3, 5) = c(7, 2)2
= 441 diﬀerent ways.
Next we have to subtract the choices with exactly one
envelope empty, and ﬁnally with two empty ones (we still may
distribute the stamps freely). We thus obtain (in the second
term we choose the empty envelop and distribute the letters,
and we have to correct the multiple appearances) C(3, 5) −
3(C(2, 5) + 3) − 3C(3, 5) =
(7
2
)
− 3
(6
1
)
+ 3 = 6. Thus the
resulting number is 6C(3, 5) = 126.
If our interpretation was that the stamps are glued only
on the non-empty envelops, the result would be the square of
the latter number of choices for envelops, i.e., 36. □
D. Probability
We proceed with simple exercises related to the concept
of classical probability. In many cases we
are dealing with experiments having only a ﬁnite
number of outcomes and we are interested in
whether or not the outcome belongs to a subset
of favourable outcomes. Then, the probability we are trying
to determine equals the number of favourable outcomes divided
by the total number of all possible outcomes. Classical
probability can be used when we assume, or know, that each
CHAPTER 1. INITIAL WARMUP
in our exposition above. This will continue in all our discussion
on probability in the sequel, where we shall have to allow
for more general spaces of events A in the sets of all subsets
in the sample space. We will return to this and more serious
generalizations in chapter 10.
1.4.4. Summing probabilities. By using mathematical induction,
the additivity of probability is easily extended to
any (ﬁnite) number of mutually exclusive events Ai ⊂ Ω,
i = 1, . . . , n. That is,
P(∪i∈IAi) =
n∑
i=1
P(Ai),
whenever Ai ∩ Aj = ∅, for all i ̸= j, i, j = 1, . . . , n. Indeed,
1.4.3(3) is the result for n = 2. If we assume the validity
of the formula for some ﬁxed n, then the union of any n + 1
events A0, A1, . . . , An can be split into the union of A0 and
A1 ∪. . . An. Then by the induction assumption, together with
1.4.3(3) again, the result follows.
In general, the summing of probabilities of event occurrences
is much more diﬃcult. The problem is that whenever
the events are mutually compatible, the possible results in
their intersection are counted multiple times.
We have seen the simplest case of two mutually compatible
events A and B in 1.4.3(8). For classical probability, it
reduces just to counting elements in subsets. Indeed, those
elements that belong to both the sets A and B count in the
formula P(A ∪ B) = P(A) + P(B) − P(A ∩ B) twice and
thus we have to subtract them once.
Now, we look at the general case. The approach of
interactive inclusion and exclusion (potentially
too many) elements in some count is a standard
method in combinatorics known as the
inclusion-exclusion principle. We shall exploit
this method in our general ﬁnite probability spaces.
As we shall see, this is an example of a mathematical
theorem, where the hard part is to understand (and ﬁnd) the
formulation of the result. The proof is then relatively simple.
The diagram explains the situation for three sets A, B, C
for classical probability:
P(A∪B ∪ C) = P(A) + P(B) + P(C) − P(A ∩ B)
− P(A ∩ C) − P(B ∩ C) + P(A ∩ B ∩ C).
Clearly, the probabilities are given by ﬁrst counting the
elements in each set and adding. Then we subtract the sum
21
outcome has the same probability of happening (for instance,
fair dice throwing, etc).
1.D.1. Prove that the following two events can happen with
the same probability (independently of each other):
(a) The roll of a dice results in a number greater than 4;
(b) Throwing two dice, and assuming that their total sum
equals to 7, at least one resulted in four.
Solution. For the ﬁrst case, there are six possible outcomes
(the set {1, 2, 3, 4, 5, 6}). Two are favourable ({5, 6}). Thus
the probability is 2/6 = 1/3. For the second event, one can
apply classical probability (where the condition is interpreted
as restriction of the probability space). The space has 6 elements,
and exactly 2 of those are favourable to the given event.
Thus, the answer is again 2/6 = 1/3.
Notice, that if we treat rolling of two dices as two independent
rolling of one, and view the results independent of
the order (i.e., just the total), then the probability of the sums
is not equal. For instance 12 and 2 have got probability 1/36
while 10 appears with probability 3/36, etc. Thus, throwing
two dices is not a classical probability case in general, but
we were right with treating the resulting sum 7 as classical
probability anyhow. Why? □
1.D.2. Helen wants to give to John and Mary ﬁve pears and
six apples. If we consider the pears to be indistinguishable
and the same for the apples, what is the probability that either
John or Mary get nothing?
Solution. Let us assume that the pears and apples are distributed
independently and each option has the same probability.
Of course, one of the two mentioned persons could
then get nothing. We have to count in how many ways the
distributions happen.
The ﬁve pears can be divided in six ways (it is determined
by the number of pears given to John, the rest goes
to Mary.) Similarly, the six apples can be divided in seven
ways. These divisions are independent, hence we can apply
the product rule, which states that the frequency of two independent
events occurring together can be calculated by multiplying
the individual values, i.e., we have got 6 · 7 = 42
options in total. Two of them result in John or Mary getting
nothing. Thus the probability is 1/21 (relatively high).
Notice, how much diﬀerent this is from dealing with 11
distinguishable objects, where 11 independent tosses of the
same coin with always the same result would result in probability
2 · 2−11
(extremely low). □
1.D.3. We choose randomly a group of ﬁve people from a
group of eight men and four women. What is the probability
that there are at least three women in the chosen group? ⃝
1.D.4. Remark. The classical probability concept can be often
applied, but we have to be careful. Imagine we should
compute the probability that the reader of this remark will
win at least 25 million euro in EuroLotto during the next week.
First, notice the formulation of the problem is incomplete. For
CHAPTER 1. INITIAL WARMUP
of those in intersections of pairs of sets, since those elements
are counted twice. But we must then add in the number of
elements in the intersection of all three.
We shall now follow the same idea in order to write down
the formula in the following theorem. It seems plausible that
such a formula should work with proper coeﬃcients of the
sums of probabilities of intersections of more and more events
among A1, . . . , Ak, at least in the case of classical probability.
The reader will perhaps appreciate that a quite straightforward
mathematical induction will verify the theorem in full
generality.
1.4.5. Theorem. Let A1, . . . , Ak ∈ A be arbitrary events
over the sample space Ω with a set of events A. Then
P(∪k
i=1Ai) =
k∑
i=1
P(Ai) −
k−1∑
i=1
k∑
j=i+1
P(Ai ∩ Aj)
+
k−2∑
i=1
k−1∑
j=i+1
k∑
ℓ=j+1
P(Ai ∩ Aj ∩ Aℓ)
− · · ·
+ (−1)k−1
P(A1 ∩ A2 ∩ · · · ∩ Ak).
Proof. For k = 1 the claim is obvious. The case k = 2
is the same as the equality 1.4.3(8), which we have
already proved.
Assume that the theorem holds for any number
of events up to k, where k ≥ 1. Now we can work in
the induction step with the formula for k + 1 events, where
the union of the ﬁrst k of them are considered to be the A in
the equation 1.4.3(8) and the remaining event is considered
to be the B:
P(∪k+1
i=1 Ai) = P(
(
∪k
i=1Ai
)
∪ Ak+1)
=
k∑
j=1
(
(−1)j+1
∑
1≤i1<···<ij ≤k
P(Ai1 ∩ · · · ∩ Aij )
)
+ P(Ak+1) − P((A1 ∪ · · · ∪ Ak) ∩ Ak+1).
This already resembles the formula for k + 1 summed events.
But in the ﬁrst term, expressions containing Ak+1 are missing.
Also absent is a term allowing for the probability that all the
events happen. On the other hand, the last expression should
not be there. We can replace it by the expression
−P
(
(A1 ∩ Ak+1) ∪ · · · ∪ (Ak ∩ Ak+1)
)
and for this we can again use the induction, that is, the formula
in the statement of the theorem. With a little patience (and a
piece of paper long enough to write down all the expressions)
we can check that this adds all the missing pieces. □
1.4.6. Inclusion-exclusion principle. As we have mentioned
already, a special case of the previous
theorem is the one of classical probability.
There the probability of an event A is strictly
proportional to the number of elements in A
(which is just divided by the total size n of the sample space).
22
example, we could impose that the sample space is just “win”
and “not win”, with the same probability 1/2. In reality, the
basic condition for the classical probability is not valid (every
elementary event must have the same probability), but even
the elementary event has not been deﬁned suﬃciently yet.
Instead, suppose that EuroLotto has a daily draw with a
jackpot of C25 000 000 for choosing 5 correct numbers in
1, . . . , 50. Let us simplify our problem and assume we want
to win the C25 000 000 jackpot at least once. Thus we want
to see the probability that the elementary event, a single lotto
card with 5 numbers winning a jackpot happens. We assume
that all quintuples are drawn with the same probability.
Assuming for instance that the reader submits k lotto
cards every day of the week, the probability of winning at
least one jackpot during the week is 7k(50
5
) = 7k
2 118 760 . For
instance, for k = 7 via Sage and the command
N(49/2118760)
we get the very small number 0.0000231267345050879.
1.D.5. There are 2n seats in a row in a cinema. We randomly
seat n men and n women in the row. What is the probability
that no two persons of the same sex sit next to each other?
Solution. There are (2n)! possible seatings. We assume,
each of them comes with the same probability. The number of
seatings satisfying the given condition is 2(n!)2
. For we have
two ways for choosing the positions for men and thus also
for women: either all men sit on odd-numbered places, and
thus the women sit on even-numbered places, or vice versa.
Among these places, both men and women are seated arbitrarily.
Therefore, the resulting probability is
P(n) =
2(n!)2
(2n)!
.
In particular, P(2) = 1/3, P(5) = 1/126 and P(8) =
1/6435. To verify this computation by Sage we use the
cell:
n=var("n")
P(n) = 2*factorial(n)**2/factorial(2*n)
Typing P(2), P(5), etc, we obtain the given values. □
1.D.6. Five persons enter an elevator in a building with eight
ﬂoors. Each of them leaves the elevator at any ﬂoor with the
same probability. What is then the probability, that
• all of them leave at sixth ﬂoor;
• all of them leave at the same ﬂoor;
• each of them leaves at a diﬀerent ﬂoor.
Solution. The sample space of possible events (see 1.4.3),
is the space of all possible ways of leaving the elevator by 5
people. There are 85
of them. Hence, in the ﬁrst case there
is only one favourable outcome, thus the probability is 1
85 .
In the second case there are eight favourable outcomes, thus
the probability is 8 1
85 = 1
84 . In the third case, the number
of favourable outcomes is given by a ﬁve-element variation
of eight elements (we choose ﬁve ﬂoors among eight where
CHAPTER 1. INITIAL WARMUP
Thus, in the formula from the previous theorem, all the
probabilities give the sizes of the subsets involved, up to a
common factor 1
n .
In this way we can extract from the theorem 1.4.5 the
following claim for the size of a general ﬁnite set M and its
subsets A1, . . . , Ak. As usual we let |M| denote the number
of elements of the set M.
Of course for every ﬁnite set M and its subsets,
|M \ (∪k
i=1Ai)| = |M| − | ∪k
i=1 Ai|.
Now we use the previous theorem, and express the size of
the union on the right side, and we obtain the theorem that is
usually called
Principle of inclusion-exclusion
|M \ (∪k
i=1Ai)| = |M| +
k∑
j=1
(−1)j
∑
1≤i1<···<ij ≤k
|Ai1 ∩ · · · ∩ Aij |
The meaning of this result for the special case n = 3 can
be visualized easily, see the diagram in 1.4.5.
1.4.7. Independent events. Next, we wish to express possible
dependencies among events in a given sample space Ω
with the probability function P. We say that the events A and
B are stochastically independent if
P(A ∩ B) = P(A) · P(B).
This deﬁnition may remind us of our experiences in
combinatorics when counting possibilities for independent
choices. For example, dealing with a fair dice, we can deﬁne
events A “odd number occurs”, B “the result is at least
3” and C “the result is at most 3”. The probabilities are
P(A) = 1
2 , P(B) = 2
3 , P(C) = 1
2 , P(A ∩ B ∩ C) =
1
6 = 1
2 · 2
3 · 1
2 = P(A) · P(B) · P(C), and taking pairs we
have P(A ∩ C) = 1
3 ̸= 1
2 · 1
2 , P(A ∩ B) = 1
3 = 1
2 · 2
3 ,
P(B ∩ C) = 1
6 ̸= 2
3 · 1
2 . Notice, that the stochastical dependance
of the pairs A, C and B, C corresponds well to our
intuition (e.g. there are more odd numbers between the values
1, 2, 3 than between the numbers 4, 5, 6).
This example also shows that we have to be careful with
more events. In general, mutually independent sets are deﬁned
in this way:
Deﬁnition. Consider an arbitrary probability function P on
sample space Ω and k events A1, . . . , Ak in that
space. We say that these events are stochastically
independent (with respect to the probability
function P), if for any chosen events Ai1 , . . . , Aiℓ
,
1 ≤ ℓ ≤ k we have
P(Ai1 ∩ · · · ∩ Aiℓ
) = P(Ai1 ) · . . . · P(Aiℓ
).
Every subset of a set of stochastically independent events
is also stochastically independent. Further, for any two
23
some person leaves the elevator and then we choose the order
in which they leave the chosen ﬂoors). The probability is then
8 · 7 · 6 · 5 · 4
85
.
= 0.20508 .
In Sage we can obtain this result by using the function
uperm introduced above. We present the corresponding
cell:
def uperm(n, k):
a = factorial(n)/factorial(n-k)
return a
p=uperm(8,5)/8**5
print("Probability of each of them leaving
at a different floor:", round(p,5))
It returns (the round function rounds a number to a given precision
in decimal digit):
Probability of each of them leaving
at a different floor: 0.20508
□
1.D.7. In a particular state there is a parliament with 200
members. Two major political parties in this state ﬂip a coin
during an “election” for every seat in the parliament. Each of
the parties has associated one side of the coin. What is the
probability that each of the parties gains 100 seats, assuming
that the coin is “fair”?
Solution. There are 2200
of possible results of the elections
(considered to be a sequence of 200 results of ﬂips). If each
party is to obtain 100 seats, then there are exactly 100 tails and
100 heads in the sequence. There are
(200
100
)
such sequences
(since the sequence is uniquely determined by choosing 100
members of 200 possible, which will result in, say, tails). The
resulting probability is
(200
100
)
2200 =
200!
100!·100!
2200
.
= 0.0563 .
In Sage we can do this computation by using the binomial
and round functions.
ans = binomial(200, 100) / (2**200)
print(round(ans, 4))
□
1.D.8. In a box there are 9 red and 7 white balls. Sequentially
we draw three balls (without returning). Determine the
probability that the ﬁrst two are red and the third is white. ⃝
In many problems we use the “rule of complement”,
computing the size of the complements of the desired
subsets. This principle generalizes to the notion of
“inclusion-exclusion principle”. Iterating, we add
too much, push out unwanted counts, etc., see 1.4.6.
1.D.9. Inclusion-exclusion principle. A secretary has to
send six letters to six diﬀerent people. She puts the letters
in the envelopes randomly. What is the probability that at
least one person receives the correct intended letter?
CHAPTER 1. INITIAL WARMUP
stochastically independent events we compute
P(A ∩ Bc
) = P(A \ B) = P(A) − P(A ∩ B) =
= P(A)(1 − P(B)) = P(A)P(Bc
).
From there we can show that by exchanging one or more
events in a set of stochastically independent events by their
complements, we again obtain a set of stochastically independent
sets.
Sometimes we need to compute the probability that at
least one of the stochastically independent set of events occurs.
That is, we want to compute P(A1 ∪ · · · ∪ Ak). In such
a situation we can use the De Morgan laws for sets (check
them yourselves!)
(∪i∈IAi)c
= ∩i∈IAc
i , (∩i∈IAi)c
= ∪i∈IAc
i .
We obtain
P(A1 ∪ · · · ∪ Ak) = 1 − P(Ac
1 ∩ · · · ∩ Ac
k)
= 1 − (1 − P(A1)) · . . . · (1 − P(Ak))
= 1 −
k∏
j=1
(1 − P(Aj).
1.4.8. Conditional probability. Often we want to restrict
our attention only to events, which lie in a subspace
H ⊂ Ω. This means that the events in
question will be the intersections A ∩ H of the
original events A with the subset H. Thus our
new probabilities should be proportional to P(A ∩ H). We
would like to have H in the role of the new sample space.
As an example, we might look again at the model of a
fair dice and ask the question “what is the probability that by
throwing two dice the result is twice 5, if we know that the
sum of the results is 10?”. Of course, we are now having only
the possibilities 4 + 6 (two times) and 5 + 5 (once). So the
probability should be 1
3 , much greater then the probability 1
36
of the same event without any further condition.
Similar situations are reﬂected in the following deﬁni-
tion:
Conditional probability
Deﬁnition. Let H be an event with non-zero probability in
the sample space Ω with the probability function P. The
conditional probability P(A|H) of the event A given H is
deﬁned by the formula
P(A|H) =
P(A ∩ H)
P(H)
.
The event H is sometimes called the hypothesis.
As it is obvious from the deﬁnition, the hypothesis H
with non-zero probability and the event A are (stochastically)
independent if and only if P(A) = P(A|H). The deﬁnition
also directly implies the “theorem for product of probabilities”
24
Solution. We compute the probability of the complementary
event, which is “no person receives the correct letter”. The
sample space corresponds to all possible orderings of six envelopes.
Let us denote both the letters and the envelopes by
numbers from one to six. Then, all the favourable events (no
letter is assigned to the corresponding envelope) correspond
to such orderings of six elements, where the ith element is not
at the ith place, for i = 1, . . . , 6. These are the permutations
without a ﬁxed point.
One can compute
the number of such orderings
using the so-called
inclusion-exclusion principle,
see Theorem 1.4.5 and
its corollary 1.4.6. Here we
will use the same notation.
Let us denote by Mi the
set of permutations, such
that i is a ﬁxed point (note
that permutations in Mi can
also have other ﬁxed points). In such terms, the resulting
number d of permutations without a ﬁxed point is
d = 6! − |M1 ∪ · · · ∪ M6| .
On the other hand, the number of elements in the intersection
Mi1 ∩· · · ∩Mik
, k = 1, . . . , 6, equals to (6−k)! (note that the
order of the elements i1, . . . , ik is ﬁxed, while the remaining
6 − k can be ordered arbitrarily). The inclusion-exclusion
principle states that
|M1 ∪ · · · ∪ M6| =
6∑
k=1
(−1)k+1
(
6
k
)
(6 − k)! .
Hence, we can compute the number d:
d = 6! −
6∑
k=1
(−1)k+1
(
6
k
)
(6 − k)!
=
6∑
k=0
(−1)k
(
6
k
)
(6 − k)! = 6!
6∑
k=0
(−1)k
k!
.
We deduce that the probability that no person receives “his”
letter is
∑6
k=0
(−1)k
k! = 53
144 , and thus the probability we were
asked for equals to 1 − 53
144 = 91
144 . More than one half! □
1.D.10. Approximately 1200 persons die per year at the
roads of the Czech Republic. Determine the probability that
some person of a chosen group of 500 people will die in the
following ten years in a traﬃc accident (assume that the total
population is 10 million and every person has the same
probability of dying in a traﬃc accident during one year).
Solution. Let us ﬁrst count the probability that one randomly
chosen person will not die in ten years in a traﬃc accident.
The probability that he/she will not die in one year is (1− 12
105 ).
Thus, the probability that he/she will not die in ten years is
(1 − 12
105 )10
. We deduce that the probability that in ten years
CHAPTER 1. INITIAL WARMUP
– if we have two events A1, A2 satisfying P(A1 ∩ A2) > 0,
then
P(A1 ∩ A2) = P(A2)P(A1|A2) = P(A1)P(A2|A1).
All these numbers express (in a diﬀerent manner) the probability
that both events A1 and A2 occur. For instance, in
the last case we ﬁrst look whether the ﬁrst event occurred.
Then, assuming that the ﬁrst has occurred, we look whether
the second also occurs. Similarly, for three events A1, A2, A3
satisfying P(A1 ∩ A2 ∩ A3) > 0 we obtain
P(A1 ∩ A2 ∩ A3) = P(A1)P(A2|A1)P(A3|A1 ∩ A2).
The probability that three events occur simultaneously can be
computed as follows. Compute the probability that the ﬁrst
occurs, then compute the probability that the second occurs
under the assumption that the ﬁrst has occurred. Then compute
the probability that the third occurs under the assumption
that both the ﬁrst and the second have occurred. Finally,
multiply the results together.
In general, if we have k events A1, . . . , Ak satisfying
P(A1 ∩ · · · ∩ Ak) > 0, then the theorem says
P(A1 ∩ · · · ∩ Ak) =
= P(A1)P(A2|A1)· · ·P(Ak|A1 ∩ · · · ∩ Ak−1).
Notice that our condition that P(A1 ∩ · · · ∩ Ak) > 0 implies
that all the hypotheses in the latter formula have got non-zero
probabilities and thus all the conditional probabilities make
sense. Indeed, each Ai is at least as big as the intersection
and thus its probability is at least as big, see 1.4.3(7).
1.4.9. Geometric probability. In practical problems, the
sample space may not be a ﬁnite set. The set
A of all events may not be the entire set of all
subsets in Ω. To generalise probability to such
situations is beyond our scope now, but we can
at least give a simple illustration.
Consider the plane R2
of pairs of real numbers and a subset
Ω with known area Ω. Events are represented by subsets
A ⊂ Ω (again with known areas). For the event set A we
consider some suitable system of subsets for which we can
determine the area. An event A then occurs if a randomly
chosen point from Ω belongs to the subregion determined by
A, otherwise the event does not occur.
Consider the problem of randomly choosing two numbers
a < b in the interval [0, 1] ⊂ R. All values a and b
are chosen with equal probability. The question is “what is
the probability that the interval (a, b) has length at least one
half?” The choice of points (a, b) is actually the choice of a
point [a, b] inside of the triangle Ω with vertex points [0, 0],
[0, 1], [1, 1] (see the diagram below).
We can imagine this as a description of a problem where
a very tired guest at a party tries to divide a sausage with two
cuts into three pieces for himself and his two friends. What
is the probability that the middle part will be at least half of
the sausage?
25
none of the given 500 people will not die, occurs by the product
rule. This is because the events are independent, hence
we obtain (1 − 12
105 )5000
. It follows that the probability of the
complementary event, that is, some of the chosen people will
die, equals to 1 −
(
1 − 12
105
)5000 .
= 0.4512. □
Remark. The model we have used in the previous exercise
to describe the given situation is only approximate. The complication
arises from the condition that every person in the
sample has the same probability of dying, and the probabilities
are derived from the total number of deaths per year.
However, the number of deaths changes yearly and also the
population changes. We may handle one of the possible inaccuracies
as follows. Suppose that 1200 persons per year
die, so in ten years 12000 persons die. The probability that
a certain person dies in ten years is estimated by 12000/107
.
Thus, the probability that a speciﬁc person will not die in ten
years is (1 − 12
104 ) (ﬁrst two members of binomial expansion
of (1 − 12
105 )10
). In total, we obtain the following estimate
of the probability 1 −
(
1 − 12
104
)500 .
= 0.4514. Observe that
both estimates are very close one to each other.
We already met probabilities based on assumptions that
other events have happened, we say “under certain
hypothesis”. We talks about conditional probability,
see 1.4.8. As we will see below, such estimates can
be successfully derived also via Sage (which is also
related to “Monte Carlo” computational method, cf. 1.4.10).
1.D.11. Sage simulation. What is the probability that when
rolling two dice the sum is 7, if we know that neither of the
rolls resulted in a 2? Estimate the solution via Sage.
Solution. Let B be the event that neither of the rolls results
into 2, and let A be the event “sum is 7”. The set of all possible
outcomes is again denoted by Ω. Then
P(A|B) =
P(A ∩ B)
P(B)
=
|A∩B|
|Ω|
|B|
|Ω|
=
|A ∩ B|
|B|
.
Now, the number 7 can appear as a sum in four ways if there
is no 2, that is, |A ∩ B| = 4, |B| = 5 · 5 = 25. Thus
P(A|B) =
4
25
= 0.16 .
Observe that P(A) = 1
6 , that is A and B are not stochastically
independent.
Let us now run the simulation to estimate the resulting probability
via Sage. Initially we use the parameter num_rolls and
the method random_element (be careful that after copyingpasting
the cells below, one needs to re-edit the functions to
keep the spaces that appearing below, etc).
num_rolls = 100_000
die1=[IntegerRing().random_element(x=1,
y=7) for _ in range(num_rolls)]
die2=[IntegerRing().random_element(x=1,
y=7) for _ in range(num_rolls)]
CHAPTER 1. INITIAL WARMUP
Thus we need to determine the area of the subset which
corresponds to points with b ≥ a + 1
2 , that is, the interior of
the triangle A bounded by the points [0, 1
2 ], [0, 1], [1
2 , 1]. We
ﬁnd P(A) = (1/8)/(1/2) = 1
4 .
Similarly, if we ask for the probability that some of the
three guests will get at least half of the sausage, then we have
to add the probabilities of two other events: B saying a ≥
1/2 and C given as b ≤ 1/2. Clearly they correspond to the
lowest and the most right top triangles and thus they have got
probabilities 1/4, too. Thus the requested probability is 3/4.
Equivalently we could have asked for the complementary
event “all of them get less than a half” which clearly corresponds
to the middle triangle and thus has probability 1/4.
Try to answer on your own the question “what is the minimal
prescribed length ℓ such that the probability of choosing
an interval (a, b) of length at least ℓ is one half?”
1.4.10. Monte Carlo methods. One eﬃcient method for
computing approximate values is simulation by the
relative occurrence of a chosen event.
We present an example. Let Ω to be the unit
square with vertices at [0, 0], [1, 0], [0, 1], and [1, 1].
Let A be the intersection of Ω with the unit disk centred at the
origin. Then area A = 1
4 π. Suppose we have a reliable generator
of random numbers a and b between zero and one. We
then compute relative frequencies of how often a2
+ b2
< 1.
That is, that [a, b] ∈ A. Then the result (after a large number
of attempts) should approximate the area of a quarter unit
circle, that is π/4 quite well. (Draw a picture yourselves!)
Of course, the well-known formula for the area of a circle
with radius r is πr2
, where π = 3.14159 . . . . It is an
interesting question – why should the area of a circle be a
constant multiple of the square of its radius? We will be able
to prove this later. But experimentally, we can hint at this by
the approach as above using squares of diﬀerent sizes.
Numerical approaches based on such probabilistic principle
are called Monte Carlo methods.
5. Plane geometry
So far we have been using elementary notions from the
geometry of the real plane in an intuitive way.
Now we will investigate in more detail how to
deal with the need to describe “position in the
plane” and to ﬁnd some relation between positions of distinct
points in the plane.
26
We select only rolls where neither dice resulted in a
2.
good_rolls = []
for x, y in zip(die1, die2):
if x != 2 and y != 2:
good_rolls.append((x, y))
Next we calculate the number of rolls with sum 7:
count_sum_7 = 0
for x, y in good_rolls:
if x + y == 7:
count_sum_7 += 1
The ﬁnal probability is given by the ratio
result=count_sum_7/len(good_rolls)
print(n(result))
Sage’s output is 0.162572964225857 (this will slightly diﬀer
each time you run the code!), hence our estimate given by
Sage has been close to the correct answer. □
1.D.12. Michael has two mailboxes, one at gmail.com and
the other one at hotmail.com. His username is the
same at both servers, but the passwords are diﬀerent.
He does not remember which password corresponds
to which server. When typing in the password
for accessing his mailbox, he makes a typo with probability
5% (i.e., if he tries to type in a speciﬁc password, he
types what he intended with probability 95%). At the server
hotmail.com, Michael typed in the username and a password,
but the server told him that something is wrong. What is
the probability that he chose the correct password but just
mistyped? (we assume that the username is always typed correctly
and that making a typo cannot turn a wrong password
into a right one.)
Solution. Let A be the event that Michael typed in a wrong
password at hotmail.com. This event is the union of two disjoint
events:
• A1: he wanted to type in the correct password and
mistyped;
• A2: he wanted to type in the wrong password (the one from
gmail.com) and either mistyped it or not.
We are looking for a conditional probability P(A1|A) which
according to the paragraph 1.4.8 is given by
P(A1 ∩ A)
P(A)
=
P(A1)
P(A1 ∪ A2)
=
P(A1)
P(A1) + P(A2)
.
Here, we have used the fact that P(A1 ∪ A2) = P(A1) +
P(A2), since A1 and A2 are disjoint. So, we only need
to compute the probabilities P(A1) and P(A2). The event
A1 is the intersection of two independent events: Michael
wanted to type in a correct password and Michael mistyped.
According to the problem statement, the probability of the
ﬁrst event is 1/2 and the probability of the second event is
1/20. In total, and since the events are independent, we get
P(A1) = 1
2 · 1
20 = 1
40 . In addition, P(A2) = 1
2 , and thus
CHAPTER 1. INITIAL WARMUP
Our tools will be mappings. We will mainly consider
mappings F which, to (ordered) pairs of values (x, y), assign
pairs (w, z) = F(x, y). Such a mapping will consist of two
functions w(x, y) and z(x, y), each depending on two arguments
x, y. This will also serve as a gentle introduction to
the part of mathematics called Linear algebra, with which
we will deal in the subsequent three chapters.
1.5.1. Vector space R2
. We view the “plane” as a set of pairs
of real numbers (x, y) ∈ R2
. We will call these pairs vectors
in R2
. For such vectors we can deﬁne addition “coordinatewise”,
that is, for vectors u = (x, y) and v = (x′
, y′
) we
set
u + v = (x + x′
, y + y′
).
Since all the properties of commutative groups hold for individual
coordinates, these hold for our new vector addition too.
In particular there exists a zero vector 0 = (0, 0), such that
v + 0 = v. We use the same symbol 0 for the vector and for
the number zero on purpose. The context will always make it
clear which “zero” it is.
Next we deﬁne scalar multiplication of vectors. For a ∈
R and u = (x, y) ∈ R2
, we set
a · u = (ax, ay).
Usually we will omit the symbol · and use the juxtaposition
of the symbols a v to denote the scalar multiple of a vector.
We can directly check other properties for scalar multiplication
by a or b and addition of vectors u and v. For instance
a (u+v) = a u+a v, (a+b)u = a u+b u, a(b u) = (ab)u.
We use the same symbol + for both vector addition and scalar
addition.
Now we take a very important step. Deﬁne vectors e1 =
(1, 0) and e2 = (0, 1). Every vector can then be
written uniquely as
u = (x, y) = x e1 + y e2.
The expression on the right is called a linear combination of
vectors e1 and e2 (with coeﬃcients x and y). The pair of
vectors e = (e1, e2) is called the standard basis of the vector
space R2
.
As shown on the picture, these operations are easy to
imagine if we consider the vectors v to be arrows starting at
the origin 0 = (0, 0) and ending at the position (x, y) in the
plane.
27
P(A) = P(A1) + P(A2) = 1
40 + 1
2 = 21
40 . This gives
P(A1|A) =
P(A1)
P(A)
=
1
40
21
40
=
1
21
.
With the Sage cell
PA1=1/2*1/20; PA2=1/2; PA=PA1+PA2
PA1condA=PA1/PA
print("P(A1|A)=", PA1condA)
we obtain the veriﬁcation P(A1|A) = 1/21. □
1.D.13. Consider a deck of 32 cards. If we draw twice one
card, what is the probability that the second drawn card is an
ace, if we return the ﬁrst card; and when we don’t return the
ﬁrst card (then there are 31 cards in the deck).
Solution. If we return the card in the deck, we are just repeating
the experiment, which has 32 possible results (having the
same probability), and exactly four of them are favourable.
Thus we see that the probability is p = 1/8. In the second
case when we do not return the card, the probability is the
same. Indeed, notice that when drawing all the cards one by
one, the probability of an ace as the ﬁrst card is identical to
the probability getting an ace in the second draw.
We can also apply the conditional probability, splitting
the event into two disjoint options of drawing or not drawing
an ace as the ﬁrst card: p = 4
32 · 3
31 + 28
32 · 4
31 = 1
8 . □
The concept of geometric probability is useful if we can
identify individual events with positions on a line,
in a plane, in a space, etc., while the sample space
Ω is a region with known length, area, volume, respectively.
The favorable positions are then represented
by measurable objects, too. In analogy to the classical
probability, we deﬁne the probability as the ratios of length,
areas, volumes, etc., to those of Ω, see ??. This reﬂects the
idea that all subregions of the same measure are hit with the
same probability. We present one example (and we include
more of them in the ﬁnal section of this chapter, starting at
page 61).
1.D.14. In a certain country, a bus departs from town A to
town B once a day at a random time between eight a.m. and
eight p.m. Once a day in the same time interval another bus
departs in the opposite direction. The trip in either direction
takes ﬁve hours. Compute the probability that the buses meet,
assuming they use the same route.
Solution. The sample space is a square 12×12. If we denote
the time of the departure of the buses as x and y respectively,
then they meet on the trail if and only if |x − y| ≤ 5. This inequality
determines the region in the square of “favourable
events”. This is a complement to the union of two rightangled
isosceles triangles with legs of length 7, see also the
ﬁgure below.
CHAPTER 1. INITIAL WARMUP
The addition of two such arrows is then given by the parallelogram
law: Given two arrows starting at the origin, their
sum is the arrow given by the diagonal arrow (also starting
at the origin) of the parallelogram with the two given arrows
as adjacent sides. Multiplication by a scalar a corresponds to
stretching the arrow to its a-multiple. This includes negative
scalars, where the direction of the vector is reversed.
If we choose any two non-zero vectors u, v such that neither
of them is a multiple of the other, then they form a basis
of R2
, too, i.e. each other vector can be (uniquely) written as
a linear combination of these. This seems to be clear from
the picture, and we may easily write down the two equations
for the coeﬃcients and to solve them (as we shall deduce in a
while).
1.5.2. Points in the plane. In geometry, we should distinguish
between the points in the plane (as for instance
the chosen origin O above), and the vectors
as the arrows describing the diﬀerence between
two such points. We will work in ﬁxed
standard coordinates, that is, with pairs of real numbers, but
for better understanding we will now distinguish vectors written
in parentheses and denoted for a moment by bold face
letters like u, v, instead of brackets (which we use for coordinates
of points in the plane). Points themeselves are denoted
by capital latin letters.
Even if we view the entire plane as pairs of real numbers
in R2
, we may understand adding two such couples as follows.
The ﬁrst couple of coordinates describes a point P = [x, y],
while the other one denotes a vector u = (u1, u2). Their sum
P +u corresponds to adding the (arrow) vector u to the point
P. If we ﬁx the vector u, we call the resulting mapping
P = [x, y] → P + u = [x + u1, y + u2]
the shift of the plane (or translation) by the vector u.
Thus, the vectors in R2
can be understood in more abstract
way as the shifts in the plane (sometimes called the free
vectors in elementary geometry texts).
The standard coordinates on R2
, understood as pairs of
real numbers are not the only ones. We can put a coordinate
system on the plane with our choosing.
28
Its area in total is 49, so the area of the “favourable part” is
144 − 49 = 95. The probability is p = 95
144
.
= 0.66. □
E. Plane geometry
In this section we handle elementary geometric objects in
the real plane R2
, including points, vectors, lines,
circles, and polygons. Our explanation will be supported
by Sage. We begin with a series of standard
tasks related to the fundamental notion of lines. Recall
that a line in the plane is given “implicitly” by the equation
a x+b y = c (the line is the set of all points [x, y] given by
the solutions with ﬁxed parameters a, b, c). We shall mostly
write ℓ, p, q for lines. Lines can be also “parametrized” using
equations of the form x = x0 + tv1 and y = y0 + tv2, where
P = [x0, y0] is a point on the line, (v1, v2) is the direction
vector, and t ∈ R is a parameter, see 1.5.3 for more details.
1.E.1. Points versus vectors. Points in the plane will be
denoted with capital letters, as P = [x, y], where
x, y ∈ R are the coordinates, while the vectors shifting
P to Q = [x + u1, y + u2] will be written as
u = (u1, u2). This follows the needs to distinguish
the “position in the plane” P and the “change of the position”
u. Notice, the same notation is used for closed and open intervals
on a real line, cf. Chapter 5, but we do not see much space
for confusion here. The reader is advised to look carefully at
the explanation starting in 1.5.2, to fully appreciate working
with both concepts as couples of reals, as soon as we ﬁx our
coordinates. This approach will be crucial when dealing with
aﬃne and Euclidean geometries in Chapter 4.
We shall also adopt the matrix calculus which brings the
notation for both points and vectors in the form of columns,
i.e., instead of u = (x, y) or P = [x, y], we often write
(
x
y
)
.
1.E.2. Determine the line ℓ given by {x = 2−t, y = 1+3t :
t ∈ R} implicitly. Next present an illustration of ℓ using Sage
for those x with −5 ≤ x ≤ 5.
Solution. We need to ﬁnd the implicit representation of ℓ,
and this can be done by eliminating the parameter t. Solving
the ﬁrst equation with respect to t gives t = 2−x. Replacing
this in the second equation gives y = 1 + 3(2 − x), that is
y + 3x − 7 = 0, which is the solution. To present a line via
Sage, there are many alternatives. Here we use the implicit
form y = −3x + 7, and then the code can be
CHAPTER 1. INITIAL WARMUP
Coordinates in the plane R2
Choose any point in the plane, and call it the origin O. All
other points P in the plane can be identiﬁed with the vectors
(arrows)
−−→
OP with their tails at the origin.
Choose any point other than O and call it E1. This deﬁnes
the vector e1 =
−−→
OE1 = (1, 0). Choose any other point
E2 so that O, E1, E2 are distinct and not collinear. This deﬁnes
the vector e2 =
−−→
OE2 = (0, 1).
Then every point P = (a, b) in the plane can be described
uniquely as P = O + a e1 + b e2 for real a, b, or in
vector notation,
−−→
OP = a e1 + b e2.
Translation, by adding a ﬁxed vector, can be used either
to shift the coordinate system (including the origin), or to
shift sets of points in the plane. Notice that the vector corresponding
to the shift of the point P into the point Q is given
as the diﬀerence Q − P (in any coordinates). Thus we shall
also use this notation for the vector
−−→
PQ = Q − P.
For each choice of coordinates, we have two distinct lines
for the two axes. The origin is the point of intersection. Other
way round, each choice of two non-parallel lines, together
with the scales on each of them deﬁnes coordinates in the
plane. They are called aﬃne coordinates.
Clearly each nontrivial triangle in the plane with vertices
O, E1, E2 deﬁnes coordinates where this triangle is deﬁned
by points [0, 0], [1, 0], [0, 1]. Thus we may say that in the
geometry of plane, “all nontrivial triangles are the same, up
to a choice of coordinates”.
1.5.3. Lines in the plane. Every line is parallel to a (unique)
line through the origin. To deﬁne a line, we
therefore need two ingredients. One is a nonzero
vector which describes the direction of the
line. Call it v = (v1, v2). The other is a point
P0 = [x0, y0] on the line. Every point on the line is then of
the form
P(t) = P0 + t v, t ∈ R.
Parametric description of a line
We may understand the line p as the set of all multiples of
the vector v, shifted by the vector (x0, y0). This is called the
parametric description of the line:
p = {P ∈ R2
; P = P0 + t v, t ∈ R}.
The vector v is called direction vector of the line p.
29
ell=-3*x+7
plot(ell,-5,5,color="black",thickness=2)
Or simply type ell = −3 ∗ x + 7; plot(ell, −5, 5), or
plot(−3 ∗ x + 7, −5, 5), without the options for the colour
and the thickness of the line. The ﬁgure that Sage returns via
the very ﬁrst choice is given here:
□
1.E.3. Consider the lines p, q given in the form [2, 0] +
t(3, 2), and [−1, 2] + s(1, 3), with t, s ∈ R, respectively.
Present them in implicit form, and next examine if they intersect.
In positive case, determine explicitly the (unique) intersection
point. Verify your answer via Sage.
Solution. The coordinates of the points on the ﬁrst line are
given by the parametric equations {x = 2 + 3t, y = 0 + 2t}.
Eliminating t from the equations we obtain the implicit form
of p: 2x − 3y − 4 = 0. Similarly, the points on the line q are
given by {x = −1 + s, y = 2 + 3s}.
By eliminating s, we get the implicit form of q, namely 3x −
y + 5 = 0. The unique solution (x∗
, y∗
) of the system of the
equations
{
2x − 3y − 4 = 0 , 3x − y + 5 = 0
}
, if any,
determines the coordinates of the unique intersection point
P = (x∗
, y∗
) of p, q, if any. We compute P =
[
−19
7 , −22
7
]
.
As we have seen, Sage oﬀers the command solve. Here its
application comes with the cell
x, y=var("x, y")
eq1=2*x-3*y-4; eq2=3*x-y+5
solve([eq1==0, eq2==0], x, y)
which returns [[x == (−19/7), y == (−22/7)]]. As for the
given plot that joins both lines, we have applied the Sage cell
CHAPTER 1. INITIAL WARMUP
In the chosen coordinates, the point P(t) = [x(t), y(t)]
is given as
x = x(t) = x0 + t v1, y = y(t) = y0 + t v2
We can eliminate t from these two equations to obtain
−v2x + v1y = −v2x0 + v1y0.
Since the vector v = (v1, v2) is non-zero, at least one of
the numbers v1, v2 is non-zero. If one of the coordinates v1
or v2 is zero, then the line is parallel to one of the coordinate
axis.
Implicit description of a line
The general equation of the line in the plane is
(1) ax + by = c,
with a and b not both zero. The relation between the pair
of numbers (a, b) and the direction vector of the line v =
(v1, v2) is
(2) av1 + bv2 = 0.
We can view the left hand side of the equation (1) as a
function z = f(x, y) mapping each point [x, y] of the plane
to a scalar and the line corresponds to the prescribed constant
value of this function. We shall see soon that (2) says the
vector (a, b) is perpendicular to the direction of the line.
Suppose we have two lines p and q. We ask about their
intersection p ∩ q. That is a point [x, y] which satisﬁes the
equations of both lines simultaneously. We write them as
(3)
ax + by = r
cx + dy = s.
Again, we can view the left side as a mapping F, which to
every pair of coordinates [x, y] of the point P in the plane
assigns a vector of values of two scalar functions f1 and f2
given by the left sides of the particular equations (3). Hence
we can write our two scalar equations as one vector equation
F(v) = w, where v = (x, y) and w = (r, s).
Notice that the two lines are not parallel if and only if
they have a unique point in their intersection.
30
z=var("z")
y1=(2/3)*z-(4/3); y2=3*z+5
P = plot([],figsize=(4,4))
P += plot(y1,z,(-4,4), color="black")
P += plot(y2,z,(-4,4), color="black")
P += point((-19/7, -22/7), size=50,
color="black"); show(P)
□
1.E.4. Remark. In 1.E.3 you can also solve the problem by
substituting the parametric equations of the line q into the
implicit equation of p, which gives 2(−1+s)−3(2+3s)−4 =
0. This equation has the unique solution s = −12/7, hence
returning back to the parametric equation of q we obtain the
coordinates of P as previously mentioned.
In Sage you can directly plot a line given in parametric
form using the parametric_plot command. For example,
to plot the lines from 1.E.3 you can use the follo wing
Sage cell (check the resulting ﬁgure in your editor). Sage
can also ﬁnd the intersection point by solving the systems
of the two equations for the parameters s, t - try it out your-
selves!
t = var("t")
P1= parametric_plot((2+3*t,2*t),(t,-3,3),
color="blue",legend_label="line p")
P2= parametric_plot((-1+t,2+3*t),(t,-3,
3),color="black",legend_label="line q")
Q= point((-19/7,-22/7),size=70)
PL=P1+P2+Q; show(PL)
1.E.5. Determine the intersection of the lines x + y − 4 = 0
and {x = −1 + 2t, y = 2 + t : t ∈ R}. ⃝
1.E.6. Find the line p
which passes through the
point P = [2, 3] ∈ R2
and
is parallel to line q given
by x − 3y + 2 = 0. Provide
a ﬁgure in Sage illustrating
the situation.
Solution. On the real
plane for two parallel
lines there are two
chances: either they
overlap, or they have no
intersection. Since the line p should pass from the point [2, 3]
and this point does not belong to q (since it does not satisfy
its equation), we deduce that p, q have no intersection.
This means that p is of the form x−3y+c = 0, for some
c ∈ R, with c ̸= 2. To ﬁnd c, we substitute the coordinates
of the point P into this equation. It reveals c = 7, which
means that p is implicitly given by y = 1
3 x + 7
3 . For the
ﬁgure presented above, we have used the cell
x=var("x")
y1=(1/3)*x+(2/3)
y2=(1/3)*x+(7/3)
a = plot([],figsize=(4,4))
CHAPTER 1. INITIAL WARMUP
1.5.4. Linear mappings and matrices. Mappings F with
which we have worked with when describing
intersection of lines have one very important
property in common: they preserve the operations
of addition and multiplication with vectors and scalars,
that is they preserve linear combinations:
F(a · v + b · w) = a · F(v) + b · F(w)
for all a, b ∈ R, v, w ∈ R2
. We say that F is a linear mapping
from R2
to R2
, and write F : R2
→ R2
. This can be also
described with words — linear combination of vectors maps
to the same linear combination of their images, that is linear
mappings are those mappings which preserve linear combina-
tions.
We have already encountered the same behaviour in the
equation 1.5.3(1) for the line, where the linear mapping in
question was f : R2
→ R and its prescribed value c. That
is also the reason why the values of the mapping z = f(x, y)
are on the image depicted as a plane in R3
.
We can write such a mapping using matrices. By a matrix
we mean a rectangular array of numbers, for instance
A =
(
a b
c d
)
or v =
(
x
y
)
.
We speak of a (square 2 × 2 ) matrix A and (column) vector
v. Multiplication of matrices, row by column, is deﬁned as
follows:
A · v =
(
a b
c d
)
·
(
x
y
)
=
(
ax + by
cx + dy
)
.
We introduce some more tools for vectors and matrices.
Our goal is to compute with matrices in a similar way as we
do it with scalars.
We deﬁne the product C = A · B of two square matrices
A and B applying the above formulas to individual
columns of the matrix B and writing the resulting column
vectors again as columns in the matrix C.
In order to multiply two vectors v and w in a similar way,
we can write the vector w as a row of numbers
(the transposed vector) wT
. Then the product of
wT
and v is
wT
· v =
(
r s
)
·
(
x
y
)
= rx + sy.
This is the scalar product of vectors v and w introduced a
while ago.
We can easily check the associativity of multiplication
(do it for general matrices A, B and a vector v in detail):
(A · B) · v = A · (B · v).
Instead of a vector v we can write any (2 × 2) matrix C. In a
similar way, distributivity also holds:
A · (B + C) = A · B + A · C
(A + B) · C = A · C + B · C.
31
a += plot(y1,x,(-4,4), thickness=2)
a += plot(y2,x,(-4,4), thickness=2)
a += point((2, 3), size=70);show(a)
Observe that in this case we asked a diﬀerent size for the thickness
of lines, in comparison with the previous ﬁgures. □
1.E.7. Consider the following lines in R2
:
p1 : 2x + 3y − 4 = 0, p2 : x − y + 3 = 0,
p3 : −2x + 2y = −6, p4 : −x − 3
2 y + 2 = 0,
p5 : x = 2 + t, y = −2 − t, t ∈ R
Determine which lines are parallel to each other. ⃝
1.E.8. Find a parametric equation of the line ℓ which goes
through the points P1 = [1, 3] and P2 = [−2, 1] on R2
. What
is the corresponding implicit equation of ℓ?
Solution. The parametric equation of a line ℓ on the Euclidean
plane passing through two points P1 = [u1, v1] and
P2 = [u2, v2] on R2
is given by
P1 + t
−−−→
P1P2 = P1 + t(P2 − P1) ,
where
−−−→
P1P2 := P2 −P1 = (u2 −u1, v2 −v1) is the displacement
vector from P1 to P2. Applying this rule for the given
points, we get P2 − P1 = (−3, −2) and so the parametric
equation of ℓ is given by [1, 3]+ t(−3, −2) = [1, 3] − t(3, 2),
with t ∈ R. Hence we have x = 1 − 3t and y = 3 − 2t. By
eliminating t we get y = 2
3 x + 7
3 x. As a veriﬁcation one can
check that both the points P1, P2 satisfy the equation of ℓ. □
1.E.9. Planar soccer player shoots a ball from the point F =
[1, 0] in the direction of the vector u = (3, 4) hoping to hit
the goal which is a line segment from the point A = [23, 36]
to B = [26, 30]. Does the ball ﬂy towards the goal?
Solution. The ball travels along the line [1, 0] + t(3, 4).
The line segment joining A and B, has the parametrization
[23, 36] + s(3, −6). The intersection of these lines is given
by equations 1 + 3t = 23 + 3s and 4t = 36 − 6s, with the
solution t = 8, s = 2/3. As 0 < 2/3 < 1 the intersection is
in the line segment between A and B, and so the ball hits the
goal.
Another approach (closed to the function of our vision)
is based on the slopes of the vectors
−→
FA, u = (3, 4), and
−−→
FB
(i.e. the ratio of the coordinates, we shall come back to that
later). Since 36
22 > 4
3 > 30
25 , the player scores. □
Next, we shall touch the basic computational tools in the
entire Mathematics, the matrix calculus. For now, we shall
restrict our attention to tasks in the plain only, and the reader
can ﬁnd introductory concepts from the so called “linear algebra”
in 1.5.4, also restricted to the real plane R2
. Below
we will ﬁrst experience the operations of addition and multiplication
of matrices, and exercise how to compute with them
more or less exactly as with scalars. We shall return to the
geometry of the plane with the new tools at hand later.
CHAPTER 1. INITIAL WARMUP
But the commutativity does not hold. For example,
(
0 1
0 0
)
·
(
0 0
0 1
)
=
(
0 1
0 0
)
,
(
0 0
0 1
)
·
(
0 1
0 0
)
=
(
0 0
0 0
)
.
This last product also shows the existence of divisors of zero.
Notice that the mapping deﬁned by multiplication of vectors
with a ﬁxed matrix is a linear mapping, i.e. it respects linear
combinations. On the other hand, every linear mapping
F has to be completely given by its values on the two vectors
e1 = (1, 0) and e2 = (0, 1) in the standard basis and
these values appear in the columns of the matrix A, which
expresses F as matrix multiplication (check this observation
carefully!).
With matrices and vectors we can write the equations for
lines and points respectively as
uT
· v =
(
a b
)
·
(
x
y
)
= c
A · v =
(
a b
c d
)
·
(
x
y
)
=
(
r
s
)
= w.
1.5.5. Determinant of matrix. The procedure of ﬁnding the
intersection of lines described in 1.5.3 fails in some special
cases. For instance the intersection of two parallel lines is
either empty (when the lines are parallel but distinct) or the
line itself (when the lines are identical). This condition occurs
when the ratios a/c and b/d are the same, that is
(1) ad − bc = 0.
Note that this expression already takes care of the cases,
where either c or d is zero.
The expression on the left in (1) is called the determinant
of the matrix A. We write it as
det A =
a b
c d
= ad − bc.
Our discussion can be now expressed as follows:
Proposition. The determinant is a real valued function det A
deﬁned for all square 2×2 matrices A. The (vector) equation
A·v = u has a unique solution for v if and only if det A ̸= 0.
So far, we have worked with pairs of real numbers in
the plane. Equally well we might pose exactly
the same questions for points with integer coordinates
and lines with equations with integer
coeﬃcients. Notice that the latter requirement
is equivalent to considering rational coeﬃcients in the equations.
We have to be careful which properties of the scalars
we exploit.
In fact, we needed all the properties of the ﬁeld of scalars
when discussing the solvability of the system of two equations
— try to think it through. At least, we can be sure that the intersection
of two non-parallel lines with rational coeﬃcients
is a point with rational coeﬃcients again. The case of integer
coeﬃcients and coordinates is more diﬃcult. We shall come
32
1.E.10. Quick comments on notation. In order to simplify
our notation, for the multiplication of two matrices
A, B (of appropriate size) we just write
AB, instead of A·B (which is the complete notation
we use in the other column, but only in
this chapter). Later on we shall use the operator “·” for the
scalar product u · v between two vectors u, v (which we do in
the other column as well).
In the sequel, we shall write Matm,n(R) for the set of all
m×n real matrices (so in this section both m and n will be either
1 or 2). For m = n, i.e., square matrices, we will use the
notation Matn(R). If m = 1, n = 2, the object is a couple of
scalars written in a row, if n = 1 and m = 2, we deal with a
vector written as a column (while Mat1(R) coincides with R).
Moreover, for A =
( α β
γ δ
)
∈ Mat2(R) we denote by det(A)
and tr(A) the determinant and trace of A, which are the real
numbers deﬁned by det(A) = αδ − βγ and tr(A) = α + δ,
respectively. On all matrices, there is the transposition operation,
A → AT
swapping rows and columns, and the product
by scalars cA, for c ∈ R. In Chapter 2, we will extend the matrix
calculus to general positive integers m and n, and general
scalars.
1.E.11. Let A, B be two real square matrices. How many
arithmetic operations do we need to compute the product
AB?
Solution. Given A, B ∈ Mat2(R), each entry of the product
AB requires 2 multiplications and 2 − 1 = 1 addition. Since
there are 22
= 4 entries in AB, we need 4(2 + 1) = 12
operations in total.
We may notice, that the same operation of “multiplying”
rows and columns for general A, B ∈ Matn(R) requires n
multiplications and n−1 additions for each term in the result.
Thus, we shall see in the next chapter that in total we need
n2
(n+n−1) = n2
(2n−1) arithmetic operations to compute
AB for two real square matrices A, B.8
□
1.E.12. Compute v = 2(A − B)T
Cu, for A =
( 0 5
−2 2
)
, B =( 2 0
−1 1
)
, C =
( 2 −2
4 5
)
, and u = (3, 2)T
, respectively. ⃝
1.E.13. Matrices in Sage. Our aim is to compute with matrices
and vectors just like with scalars. Sage can
do this. Let us look at simple computations like
in 1.E.12. We have to tell Sage to use the objects
of the right class. The commands matrix and vector
do that for us. Let us type
A = matrix([[0,5],[-2,2]])
B = matrix([[2,0],[-1,1]])
C = matrix([[2,-2],[4,5]])
u = vector([3,2])
v =2*transpose(A-B)*C*u; print(v)
8This means, that the complexity of matrix multiplication is “polynomial”,
essentially of the size of n3. With big matrices, this is an essential issue
in theoretical computer science to make the power lower. Volker Strassen
introduced his famous algorithm reaching the complexity of the size n2.807
in 1969, and as of July 2023, the fastest known algorithm performs with the
complexity size of n2.371866.
CHAPTER 1. INITIAL WARMUP
back to this in the next chapter. In particular we shall see that
the equation 1.5.3(3) with ﬁxed integer coeﬃcients a, b, c, d
has a unique integer solution for all integer values (r, s) if and
only if the determinant is ±1.
1.5.6. Aﬃne mappings. We now investigate how the matrix
notation allows us to work with simple mappings
in the aﬃne plane. We have seen that
matrix multiplication deﬁnes a linear mapping.
Shifting R2
by a ﬁxed vector w = (r, s) ∈ R2
in the aﬃne
plane can be also easily written in matrix notation:
P =
(
x
y
)
→ P + w =
(
x + r
y + s
)
.
If we add a ﬁxed vector to the result of a linear mapping then
we have the expression
v =
(
x
y
)
→ A · v + w =
(
ax + by + r
cx + dy + s
)
.
In this way we have described all aﬃne mappings of the plane
to itself.
Such mappings allow us to recompute coordinates which
arise by diﬀerent choices of origins or bases. We shall come
back to this in detail later.
1.5.7. The distance and angle. Now we consider distance.
We deﬁne the length of the vector v = (x, y) to be
∥v∥ =
√
x2 + y2.
Immediately we can deﬁne notions of distance, angle and rotation
in the plane.
Distance in the plane
The distance between the points P, Q in the plane is given
as the length of the vector
−−→
PQ, i.e. ∥Q−P∥. Obviously, the
distance does not depend on the ordering of P and Q and it
is invariant under shifts of the plane by any ﬁxed vector w.
The Euclidean plane is an aﬃne plane with distance deﬁned
as given above.
33
Notice, Sage does all the multiplications properly (including
multiplying correctly all the components by 2), and returns
the components (−52, 64) of the vector v. The methods
of addition and multiplication are appropriately used
for the objects, with the same symbols as with scalars. Of
course, we may introduce a matrix with other scalar type
of entries. For example, the real ﬂoat type entries are invoked
by A = matrix(RR, [[0, 5], [−2, 2]]) (recall that RR indicates
the ﬁeld of ﬂoat type real numbers in Sage), and
similarly with CC for complex ﬂoat type. Sage also offers
the methods transpose(A) and determinant(A) (or
write A.transpose( ) and A.determinant( ), respectively),
which return the transpose and determinant of a matrix A.
Further methods and comments will appear when using Sage
below.
1.E.14. Prove the statements listed below, where A, B, C are
the matrices from 1.E.12:
(a) (A − B)2
̸= A2
− 2AB + B2
;
(b) D := 2ABC − BCA − CAB ̸= 0;9
(c)
c(AB) − (c(AB))T
√
2
=
11
√
2c
2
(
0 1
−1 0
)
, with c ̸= 0;
(d) det(AB) − det(A) det(B) = 0;
(e) The determinant of the matrix
F := det(A)B − det(B)A − tr(C)C
is an integer divisible by the numbers 1, 2, 19 and 38.
⃝
1.E.15. Give an example of matrices A and B for which
(a) (A + B)(A − B) ̸= A2
− B2
.
(b) (A + B)(A + B) ̸= A2
+ 2AB + B2
. ⃝
The so called “aﬃne geometry” of the real plane is characterized
by the concept of “going straight with
constant velocity” from a given point, i.e., along
a line. The mappings F : R2
→ R2
of the plane
which preserve this feature are called “aﬃne”.
Those ﬁxing the origin O = [0, 0] are just the “linear ones”,
F(a u + b v) = a F(u) + b F(v) ,
for all a, b ∈ R and u, v ∈ R2
. Remarkably, all aﬃne maps
are obtained by additionally allowing the “shifts” P → P +
u, P ∈ R2
, for some ﬁxed vectors u.
In coordinates, the linear mappings are given by matrix
multiplication, while the aﬃne mapping include also addition
of constant vectors. See the detailed explanation in the other
column, e.g., 1.5.6. Due to the associativity of matrix multiplication,
composition of mappings corresponds to multiplication
of matrices, too. Moreover, invertible aﬃne mappings
are also understood as changes of coordinates, see 1.5.6, thus
we call them aﬃne transformations.
9Here we denote by 0 the zero 2 × 2 matrix.
CHAPTER 1. INITIAL WARMUP
Angles are a matter of vectors rather than points in Euclidean
geometry. Let u be a vector of length
1, at angle φ measured counter-clockwise from
the vector (1, 0). In coordinates, u is at the
unit circle and has ﬁrst and second coordinates
cos φ, sin φ respectively (this is one of the elementary deﬁnitions
of the sine and cosine functions). That is,
u = (cos φ, sin φ).
This is compatible with −1 ≤ sin φ ≤ 1 satisfying
(cos φ)2
+ (sin φ)2
= 1.
Angle between vectors
The angle φ between two vectors v and v′
can be in general
described using their coordinates v = (x, y), v′
= (x′
, y′
)
like this:
(1) cos φ =
xx′
+ yy′
∥v∥ ∥v′∥
with 0 ≤ φ ≤ π.
In our special case v = (1, 0), the more general equation
gives
cos φ =
x′
∥v′∥
,
which is just the deﬁnition of the function cos φ. The general
case can be always reduced to this special one. First we notice
that the angle φ between two vectors u, v is always the same
as the angle between the normalized vectors 1
∥u∥ u and 1
∥v∥ v.
Thus we can restrict ourselves to two vectors on the unit circle.
Then we can rotate our coordinates in such a way that the ﬁrst
of the vectors will become (1, 0). This means, it is enough to
show that the scalar product xx′
+yy′
, as well as the length of
vectors are invariant with respect to rotations. We shall come
back to this in a moment.
In the special case, when the scalar product is zero, we
say that the vectors are perpendicular and this corresponds
to φ = π/2 as expected. Of course the best example of perpendicular
vectors of length 1 are the standard basis vectors
(1, 0) and (0, 1).
Notice that our formula for the angle between the vectors
is symmetric in the two vector arguments, thus we have to take
the smaller of the possible angles between two vectors and φ
is always between 0 and π.
We can easily imagine that not all aﬃne coordinates are
adequate for expressing the distance and thus for use in the Euclidean
plane. Indeed, although we may choose any point O
as the origin again, we want also the basis vectors e1 =
−−→
OE1
and e2 =
−−→
OE2 to be perpendicular and of length one. Such
34
1.E.16. Decide whether the mappings F, G : R2
→ R2
given
by
F : (x, y) → (7x − 3y, −2x + 5y) ,
G : (x, y) → (2x + 2y − 4, 4x − 9y + 3) ,
are linear. Write them via matrix multiplication, analyze the
meaning of the individual columns, and check that their composition
is expressed via matrix multiplication, too. ⃝
Aﬃne transformations play a crucial role in computer
graphics and computer vision. Computer
graphics involve creating and manipulating
images, displayed or animated on a computer
screen. Modern computer-aided design (CAD) programs
and tools like LibreCAD have a wide range of applications,
from designing satellites to creating computer games and
educational software. Thus these tools are particularly
valuable for engineers and designers working in various
industries.
Below we will explore how to manipulate aﬃne transformations
on the plane, which will enhance our understanding
of their mathematical signiﬁcance in practical applications.
Using Sage to illustrate these concepts, we will focus
on two-dimensional graphics (2D graphics) through the problems
analyzed below. For additional learning and Sage examples,
refer to the tasks outlined in 1.G.52, 1.G.53, and ??.
1.E.17. Aﬃne transformations. Find the matrix representation
for simple aﬃne transformations on the plane, as: shifts,
homotheties, stretching in only one direction, shears, rotations,
and reﬂections.
Solution. In order to talk about matrix representation, we
need to ﬁx the coordinates, see 1.5.2. We shall work with
the standard ones, that is, with the vectors e1 = (1, 0),
e2 = (0, 1), identifying the plane with R2
explicitly.
• “Idenity transformation.” The simplest aﬃne transformation
is the identity, the relevant matrix is the identity matrix
E = ( 1 0
0 1 ). Clearly E u = u, for all u ∈ R2
.
• “Shifts.” The shift (or translation) τw by a vector w =
r e1 + s e2 = (r, s)T
∈ R2
is given by applying the matrix E
and adding the vector w, i.e., (x, y)T
→ (x, y)T
+ (r, s)T
=
(x+r, y+s)T
, or in other words τw(u) = u+w. For instance,
if w = (3, 0)T
, then applying τw to an image will shift it to
the right by 3. Notice that when we want to shift a point P to
the origin, we will also write τ−P for the relevant translation.
With the aim to illustrate a translation (and also the rest
transformations), we will use a triangle ABC, with vertices
the points A = [1, 0], B = [4, −2], and C = [2, −2]. In the
ﬁgures we have coloured these vertices by blue, green and
red, respectively. The new coordinates are also painted by
the corresponding colour, with the hope to get a better visualization
of how the given transformation acts. For instance,
let us present the illustration of the translation τ(3,0) applied
to ABC:
CHAPTER 1. INITIAL WARMUP
basis will be called orthonormal. We shall see that the angles
and distances computed in such coordinates will always
be the same no matter which coordinates are used.
1.5.8. Rotation around a point in the plane. As we saw,
the matrix of any given linear mapping F : R2
→ R2
is
easy to guess. If the mapping works applying the matrix with
columns (a, c) and (b, d), then the ﬁrst column (a, c) is obtained
by multiplying this matrix with the basis vector (1, 0)
and the second is the evaluation at the second basis vector
(0, 1).
We can see from the picture that the columns of the matrix
corresponding to rotating counter-clockwise through the
angle ψ are computed as follows:
(
a b
c d
) (
1
0
)
=
(
cos ψ
sin ψ
)
,
(
a b
c d
) (
0
1
)
=
(
− sin ψ
cos ψ
)
The counter-clockwise direction is called the positive direction,
the other direction is the negative direction.
Rotation matrix
Rotation through a given angle ψ in the positive direction
about the origin is given by the matrix Rψ:
v =
(
x
y
)
→ Rψ · v =
(
cos ψ − sin ψ
sin ψ cos ψ
)
·
(
x
y
)
.
Now, since we now know how the matrix of the rotation
in the plane looks like, we can check
that rotation preserves distances and angles
(deﬁned by the equation (1) in 1.5.7).
Denote the image of a vector v as
v′
=
(
x′
y′
)
= Rψ · v =
(
x cos ψ − y sin ψ
x sin ψ + y cos ψ
)
,
and similarly w′
= Rψ · w for w = (r, s)T
, and w′
=
(r′
, s′
)T
.
A straightforward expansion (exploiting (cos ψ)2
+
(sin ψ)2
= 1) checks the expected equalities ∥v′
∥2
= ∥v∥2
,
and that x′
r′
+ y′
s′
= xr + ys.
The previous expression can be written using vectors and
matrices as follows:
(Rψ · w)T
(Rψ · v) = wT
v.
35
• “Homotheties.” A homothety hc is a constant rescaling
of all vectors by a factor c ∈ R. Thus the relevant matrix is
a multiple of E and hence diagonal, A = c E =
( c 0
0 c
)
. As
a result, a homothety with c > 1 will stretch a ﬁgure. Let us
illustrate this, by using the triangle ABC and the homothety
hc((x, y)T
) = (3x
2 , 3y
2 )T
, i.e., c = 3/2:
In the opposite, a homothety hc with 0 < c < 1 will shrink
the ﬁgure, try to present examples yourselves.
In case we want to apply a homothety with another ﬁxed
point P = [a, b] in R2
(thus rescaling the diﬀerences Q − P
for all points Q), then we can ﬁrst translate P to the origin,
then to employ the usual homothety and then to shift the plane
back:
(
x
y
)
→
(
x − a
y − b
)
→
(
c(x − a)
c(y − b)
)
→
(
c(x − a) + a
c(y − b) + b
)
.
Imagine what is happening with the plane under the three
transformations on the way.
• “Rotations.” If we rotate by angle θ counterclockwise,
we get e1 → (cos θ, sin θ) (this is the elementary deﬁnition of
the goniometric functions), while e2 → (− sin θ, cos θ) (look
at the picture in 1.5.8). Thus, the rotation matrix has the form
Rθ =
(
cos θ − sin θ
sin θ cos θ
)
.
For instance, for θ = π/2, we get Rπ
2
=
( 0 −1
1 0
)
, and hence
in this case the rotation takes (x, y)T
to (−y, x)T
:
(
x
y
)
→ Rπ
2
(
x
y
)
=
(
0 −1
1 0
) (
x
y
)
=
(
−y
x
)
.
The ﬁgure below illustrates the result of Rπ/2 on the triangle
ABC used above.10
10Notice a rotation preserves both the lengths of vectors and the angles.
CHAPTER 1. INITIAL WARMUP
The transposed vector (Rψ · w)T
equals wT
· RT
ψ , where RT
ψ
is the so-called transpose of the matrix Rψ. That is a matrix,
whose rows consist of the columns of the original matrix and
similarly the columns consist of the rows of the original matrix.
Therefore we see that the rotation matrices satisfy the
relation RT
ψ · Rψ = I, where the matrix I (sometimes we
denote this matrix just as 1 and mean by this the unit in the
ring of matrices) is the unit matrix
I =
(
1 0
0 1
)
.
This leads us to a remarkable observation — the matrix X
with the property that X · Rψ = I (we will call such a matrix
the inverse matrix to the rotation matrix Rψ) is the transpose
of the original matrix. This makes sense, since the inverse
mapping to the rotation through the angle ψ is again a rotation,
but through the angle −ψ. That is, the inverse matrix of RT
ψ
equals the matrix
R−ψ =
(
cos(−ψ) − sin(−ψ)
sin(−ψ) cos(−ψ)
)
=
(
cos ψ sin ψ
− sin ψ cos ψ
)
.
It is easy to write the rotation around a point P = O+w,
P = [r, s] again using matrices. One just has to note that
instead of rotating around the given point P, we can ﬁrst shift
P into the origin, then do the rotation and then do the inverse
shift.
We calculate:
v =
(
x
y
)
→ v − w → Rψ · (v − w)
→ Rψ · (v − w) + w =
(
cos ψ (x − r) − sin ψ (y − s) + r
sin ψ (x − r) + cos ψ (y − s)) + s
)
.
1.5.9. Reﬂections. Another well-known example of a length
preserving mapping is reﬂection through a line.
It is enough to understand reﬂection through a
line that goes through the origin O. All other
reﬂections can be derived using shifts. Thus, we
look ﬁrst for a matrix Zψ of reﬂection with respect to the line
through the origin and through the point (cos ψ, sin ψ).
Notice that
Z0 =
(
1 0
0 −1
)
.
36
• “Stretching by a parameter c”. Let us stretch the plane
only in the direction of the ﬁrst basis vector e1. Then, the
relevant matrix Ac is diagonal, obtained by multiplying the
ﬁrst row in the identity matrix by the stretching parameter c,
i.e., Ac =
( c 0
0 1
)
. The ﬁgure below shows the stretching of
our triangle ABC in the direction of e1 with c = 3/2, hence
the particular linear transformation has the form
(
x
y
)
→
(
3/2 0
0 1
) (
x
y
)
=
(
3x/2
y
)
.
Applying the same stretching but in direction of e2, will force
only the y-coordinates to be transformed, as we see here:
If we want to employ the same operation with a general ﬁxed
point P belonging in a line ℓ, and along this line, we may
again compose three transformations: ﬁrst shift by τ−P , then
ﬁnd the angle θ of ℓ with the direction of e1 and use the R−θ
matrix, then apply the above stretching and next move the
plane back, i.e., apply Rθ and τP .
• “Shears.” The horizontal shear is a linear transformation
keeping the y-coordinate ﬁxed, and the entire
“x-coordinate axes” (this is the line y = 0), is ﬁxed, too. In
other words a horizontal shear takes the point (x, y) to the
point (x + ay, y) for some a. Similarly, vertical shears are
the transformations with the properties of the components
swapped. Thus, the relevant matrices have the following
CHAPTER 1. INITIAL WARMUP
Any line going through the origin can be rotated so that it has
the direction (1, 0) and thus we can write general reﬂection
matrix as
Zψ = Rψ · Z0 · R−ψ,
where we ﬁrst rotate via the matrix R−ψ so that the line is
in “zero” position, reﬂect with the matrix Z0 and return back
with the rotation Rψ.
Therefore we can calculate (by associativity of matrix
multiplication):
Zψ =
(
cos ψ − sin ψ
sin ψ cos ψ
)
·
(
1 0
0 −1
)
·
(
cos ψ sin ψ
− sin ψ cos ψ
)
=
(
cos ψ sin ψ
sin ψ − cos ψ
)
·
(
cos ψ sin ψ
− sin ψ cos ψ
)
=
(
cos2
ψ − sin2
ψ 2 sin ψ cos ψ
2 sin ψ cos ψ −(cos2
ψ − sin2
ψ)
)
=
(
cos 2ψ sin 2ψ
sin 2ψ − cos 2ψ
)
.
The last equality follows from the usual formulas for trigonometric
functions:
(1)
sin 2ψ = 2 sin ψ cos ψ
cos 2ψ = cos2
ψ − sin2
ψ.
Notice that the product Zψ · Z0 gives:
(
cos 2ψ sin 2ψ
sin 2ψ − cos 2ψ
)
·
(
1 0
0 −1
)
=
(
cos 2ψ − sin 2ψ
sin 2ψ cos 2ψ
)
.
This observation can be formulated as follows:
Proposition. A rotation through the angle ψ can be obtained
by two subsequent reﬂections through the lines that have the
angle 1
2 ψ between them.
37
form:
Ha =
(
1 a
0 1
)
, Va =
(
1 0
a 1
)
.
Let us illustrate a shear in the x-direction for the values a =
±3. We get the transformations
(
x
y
)
→
(
1 ±3
0 1
) (
x
y
)
=
(
x ± 3y,
y
)
,
so the triangle ABC is transformed as follows:
Case a = 3:
Case a = −3:
• “Reﬂections.” Finally, let us deal with reﬂections.
They actually appeared above as stretching along one direction,
with the parameter c = −1. Thus, stretching along the
direction of e1 with c = −1 gives a reﬂection with respect to
the y-axis, where the matrix representation is
( −1 0
0 1
)
. Repeating,
using instead of e1 the vector e2, we get a reﬂection with
respect to the x-axis, where the matrix presentation is
( 1 0
0 −1
)
.
Below we illustrate the reﬂection of our triangle ABC with
respect to the y-axis:
For an illustration of a reﬂection in the plane with respect to
x-axis, see 1.G.51.
On the other hand, the reﬂection in the plane through the
line y = x, sends the x-axis into the y-axis, and vice versa.
Hence it is also referred to as the “axial symmetry”. Obviously,
its matrix presentation is given by
( 0 1
1 0
)
, hence for example
it transforms the given triangle as follows (see 1.G.52
for another example).
□
CHAPTER 1. INITIAL WARMUP
In fact we can prove the previous proposition purely by
geometrical argumentation, as shown in the above picture
(try to be a “synthetic geometer” when reﬂecting
A to A′
and then A′
to A′′
). If we believe in this proof
“by picture”, then the above computational derivation
of the proposition provides the proof of the standard double
angle formulas (1).
1.5.10. The following is a recapitulation of previous ideas.
Mappings that preserve length
Theorem. A linear mapping of the Euclidean plane is composed
of one or more reﬂections if and only if it is given by
a matrix R which satisﬁes
R =
(
a b
c d
)
, ab + cd = 0, a2
+ c2
= b2
+ d2
= 1.
This happens if and only if the mapping preserves length. Rotation
is such a mapping if and only if the determinant of the
matrix R equals one, which corresponds to an even number
of reﬂections. When there is an odd number of reﬂections,
the determinant equals −1.
Proof. We calculate how a general matrix A might
look, when the corresponding mapping
preserves length. That is, we have a map-
ping
(
x
y
)
→
(
a b
c d
)
·
(
x
y
)
=
(
ax + by
cx + dy
)
.
Preserving length thus means that for every x and y, we have
x2
+ y2
= (ax + by)2
+ (cx + dy)2
= (a2
+ c2
)x2
+ (b2
+ d2
)y2
+ 2(ab + cd)xy.
Since this equation is to hold for every x and y, the coeﬃcients
of the individual powers x2
, y2
and xy on the left and
right side of the equation must be equal. Thus we have calculated
that the conditions put on the matrix R in the ﬁrst part
38
1.E.18. Show that every rotation can be obtained as a composition
of three shears.11
Solution. This is a little more demanding exercise.
But we may experiment with the abilities of Sage and
this makes our life really easy. We would add the
lines in the following session following partial results.
Composing horizontal shears gives again a horizontal
shear (and similarly for the vertical ones):
(
1 a
0 1
) (
1 b
0 1
)
=
(
1 a + b
0 1
)
.
Thus we try to combine horizontal and vertical ones to get a
rotation by an angle θ.
# introduce three shear matrices with
# variable parameters a,b,c
a,b,c=var("a,b,c")
A = matrix([[1,a],[0,1]]); B = matrix(
[[1,0],[b,1]]);C=matrix([[1,c],[0,1]])
# take the general tranformation X
# obtained by composition
X = C*B*A; print(X)
# notice that one of the conditions on X
# to be a rotation indicates a=c
X = X.substitute(c=a)
eq1 = X[0,0]^2+X[1,0]^2-1;
eq2 = X[1,0]+X[0,1];
# eq3 = X[0,0]-X[1,1] == 0 is already
# fulfilled; solve the system
sol = solve([eq1,eq2],[a,b]); print(sol)
The ﬁnal line returns the two solutions:
a = (
√
( − r2
1 + 1) − 1)/r1 , b = r1 ,
or
a = −(
√
( − r2
2 + 1) + 1)/r2 , b = r2 .
This means that b is a free parameter and from the printed
form of X, we learn that b plays the role of sin θ. The graphical
illustration of this compositions is given in ??. □
1.E.19. Change of coordinates. Look at picture in 1.5.6 and
assume that, in the standard coordinates, O′
= [1, 5], P =
O′
+ e′
1 = [2, 3], Q = O′
+ e′
2 = [2, 7] and these deﬁne the
“girl’s” coordinate system. Consider the two options that the
meeting point M = [2, 1] is given (a) by the boy, or (b) by the
girl, in their own coordinate system. What does it mean for
the other one? Find the answer ﬁrst for the situation where at
least the origins of the coordinate systems coincide!
Solution. If the boy and girl have got at least the same origin
of coordinate systems, then the transformation of coordinates
is simple. The girl’s basis is already expressed in the standard
coordinates (the boy’s ones). Thus the matrix T with colums
11This is actually of high interest for pixel based computer graphics: A
horizontal or vertical shear is a very quick operation consisting of constant
shifting of individual rows or columns of pixels, while implementing a rotation
directly is much more time consuming! It seems that the ﬁrst algorithm
based on this was introduced by Alan Paeth in 1986.
CHAPTER 1. INITIAL WARMUP
of the theorem we are proving are equivalent to the property
that the given mapping preserves length.
Because a2
+ c2
= 1, we can assume that a = cos φ
and c = sin φ for a suitable angle φ. As soon as we choose
the ﬁrst column of the matrix R, the relation ab + cd = 0
determines the second column up to a multiple. But we also
know that the length of the vector in the second column is
one, and thus we have only two possibilities for the matrix R,
namely:
(
cos φ − sin φ
sin φ cos φ
)
,
(
cos φ sin φ
sin φ − cos φ
)
.
In the ﬁrst case, we have a rotation through the angle φ, in
the second case we have a rotation composed with the reﬂection
through the ﬁrst coordinate axis. As we have seen in the
previous proposition 1.5.8, every rotation corresponds to two
reﬂections. The determinant of the matrix R is in these two
cases either one or minus one and distinguishes between these
two cases by the parity of the number of reﬂections. □
Notice, we have now proved our earlier claim on the invariance
of formulae for distance and angle in any orthonormal
coordinates. Moreover, we have seen that all euclidean
aﬃne mappings are generated by translations, and reﬂections.
1.5.11. Area of a triangle. At the end of our little trip to
geometry we will focus on the area of planar
objects. For us, triangles will be suﬃcient. Every
triangle is determined by a pair of vectors
v and w, which, if translated so that they start
from one vertex P of the triangle, determine the remaining
two vertices. We would like to ﬁnd a formula (scalar function
area), which assigns the number area ∆(v, w) equal to the
area of the triangle ∆(v, w) deﬁned in the aforementioned
way. By translating, we can place P at the origin since translation
should not change the area.
We can see from the picture that the desired value is half
of the area of the parallelogram spanned by the vectors v and
w. It is easy to calculate (using the well-known formula: base
times corresponding height), or simply observe from the picture
that the following holds
area ∆(v + v′
, w) = area ∆(v, w) + area ∆(v′
, w)
area ∆(av, w) = a area ∆(v, w).
39
e′
1 and e′
2 yields the linear mapping providing the boy’s coordinate
expression of the girl’s vectors. In order to go in the
opposite direction, we just take the inverse matrix T−1
. Thus,
T =
(
1 1
2 −2
)
, and T−1
=
(
1/2 −1/4
1/2 1/4
)
,
and the boy knows that the girl means him to be at [3, −2],
while the girl knows that boy’s [2, 1] would be [3/4, 5/4].
In the general case, we must consider the shifts of the
origin, evaluated in the right coordinates. We shall again use
Sage to compute:
# go to the new affine coordinates
O=vector([0,0]);e1=vector([1,0]);
e2=vector([0,1]);
OO=vector([1,5]);ee1=vector([2,3])-OO;
ee2 = vector([2,7])-OO;
# define the relevant transformations
T=matrix([ee1,ee2]); T=T.transpose();
print(T); print(T.inverse())
def change_girl_to_boy(coordinates):
position = vector(coordinates)
return T*position + OO
def change_boy_to_girl(coordinates):
# [0,0] = a[1,-2]+b[1,2]+[1,5] =>
# a-b=5/2, a+b=-1 => a=3/4, b=-7/4
position = vector(coordinates)
return T.inverse()*position
+ vector([3/4,-7/4])
print(change_girl_to_boy(vector([2,1])))
print(change_boy_to_girl(vector([2,1])))
# tests
OO+2*ee1+1*ee2 == O+4*e1+3*e2 and
O+2*e1+1*e2 == OO+3/2*ee1-1/2*ee2
The linear parts of the transformations are still T and
T−1
, the right shifts are added. The last two prints give the
answer: the girl’s meeting point [2, 1] means the boy should
go to position [4, 3], while if it is the boy who tells his meeting
point, then the girl should look for him at [3/2, −1/2] in her
coordinates. The ﬁnal line conﬁrms this by returning “True”.
□
Let us now move to Euclidean tasks. In addition, there is
the concept of the norm (size) of a vector, allowing
us to measure distances between points. We
have already seen this in case of complex numbers,
viewing C as the real plane with standard cordinates.
As we will see below, we are then also able to measure angles
between vectors and areas of polygonal objects.
1.E.20. (a) Find the norms, the scalar product and the angle
of the vectors u = (−3, −2) and v = (−2, 3).
(b) Prove that the triangle with vertices P = [2, 2], Q = [3, 0],
R = [4, 3] is isosceles.
Solution. (a) We shall work with the standard Euclidean
norm on R2
(which we may identify with C), as we did
in 1.A.8. The norm of a vector w = (x, y) is given by
CHAPTER 1. INITIAL WARMUP
Finally we add to the formulation of our problem a condition
area ∆(v, w) = − area ∆(w, v),
which corresponds to the idea that we give a sign to the area,
according to the order in which we are taking the vectors.
If we write the vectors v and w into the columns of a
matrix A, then the mapping
A = (v, w) → det A
satisﬁes all the three conditions we wanted. How many such
mappings could there possibly be? Every vector can be expressed
using two basis vectors e1 = (1, 0) and e2 = (0, 1).
By linearity, area ∆ is uniquely determined by these vectors.
We want
area ∆(e1, e2) =
1
2
.
In other words, we have chosen the orientation and the scale
through the choice of basis vectors, and we choose the unit
square to have area equal to one.
Thus we see that the determinant gives the area of a parallelogram
determined by the columns of the matrix A. The
area of the triangle is thus one half of the parallelogram.
1.5.12. Visibility in the plane. The previous description of
the value for oriented area gives us an elegant tool
for determining the position of a point relative to oriented
line segments. By an oriented line segment we
mean two points in the plane R2
with a selected order.
We can imagine it as an arrow from one point to the
other. Such an oriented line segment divides the plane into
two half-planes. Let us call them “left” and “right”. We want
to be able to determine whether a given point is in the left or
right half-plane.
Such tasks are often met in computer graphics when dealing
with visibility of objects. We can imagine that an oriented
line segment can be “seen” from the points to the right of it
and cannot be seen from the points to left of it.
40
∥w∥ =
√
x2 + y2. Thus ∥u∥ =
√
9 + 4 =
√
13 = ∥v∥.
Their scalar product (dot product) is given by u·v ≡ ⟨u, v⟩ =
(−3)(−2) + (−2)3 = 0, and hence u, v are orthogonal each
other, i.e., their angle is φ = π/2. In Sage, we have got the
method which is applied to the ﬁrst vector and gets the second
one as argument: u.dot_product(v), see the cell
u = vector([-3, 2]); v = vector([-2, 3])
u.dot_product(v)
which returns 0. As for the norms of the vectors u, v,
type
u.norm (); v.norm ()
and for both cases this returns
√
13.
(b) In this case we compute
|
−−→
PQ| = ∥P − Q∥ =
√
(2 − 3)2 + (2 − 0)2 =
√
5,
|
−−→
QR| = ∥Q − R∥ =
√
(3 − 4)2 + (0 − 3)2 =
√
10,
|
−→
PR| = ∥P − R∥ =
√
(2 − 4)2 + (2 − 3)2 =
√
5.
Thus |
−−→
PQ| = |
−→
PR| and the triangle is isosceles. □
1.E.21. Given the line ℓ : 7y = 6x + 13, determine the line
ℓ′
which is perpendicular to ℓ and passes through the point
P = [−6, 7] ∈ R2
. Find at least two methods.
Solution. (1) Given a line ℓ in parametric form P0 + tu =
[x0, y0]+t(x1, y1) (t ∈ R), it is easy to determine a line ℓ′
perpendicular
to ℓ, which goes through another point Q = [a, b],
as follows. The direction vector of ℓ′
must be orthogonal with
the direction vector of ℓ. So, if we assume that Q + tv is the
parametric form of ℓ′
, then we should have u · v = 0, where ·
is the usual dot product on R2
. Taking v = (−y1, x1) we see
that (x1, y1) · (−y1, x1) = 0, that is, u ⊥ v. Thus the parametric
equation of ℓ′
is simply Q + tv = [a, b] + t(−y1, x1),
with t ∈ R. To apply this in our case we need to express
the initial line ℓ in parametric form. The points [−1, 1] and
[0, 13/7] belong to the line ℓ, so the parametric form of ℓ is
(cf. 1.E.8)
[−1, 1] + t
(
0 − (−1), 13
7 − 1
)
= [−1, 1] + t(1, 6
7 ) .
As a result, the direction vector of ℓ′
is (−6
7 , 1), such that
(1, 6
7 ) · (−6
7 , 1) = 0, and ℓ′
reads by [−6, 7] + t(−6
7 , 1), or in
other words x = −6 − 6
7 t and y = 7 + t, with t ∈ R. For
a veriﬁcation we can transfer the given parametric form of
ℓ′
to implicit form. Eliminating t from these equations gives
x = −6 − 6
7 (y − 7), which is equivalent to y = −7
6 x.
(2) Actually, writing the equation for ℓ as 6x − 7y − 13 = 0,
it is obvious that the coeﬃcients (6, −7) provide just a vector
perpendicular to ℓ. Thus we may write the implicit equation
for ℓ′
immediately.
(3) We may also work with the slopes. Writing ℓ as y = 6
7 x+
13
7 , we get its slope m = 6/7. The line ℓ′
, perpendicular to the
line ℓ must have the slope m′
= −7
6 , such that mm′
= −1.
CHAPTER 1. INITIAL WARMUP
We have the oriented line segment AB and are given
some point C. We calculate the oriented area of the corresponding
triangle determined by the vectors A−C and B−C.
If the point C is to the left of the line segment, then with the
usual positive orientation (counter-clockwise) we obtain the
positive sign of the oriented area, while the negative sign corresponds
to the points to the right.
This approach is often used for testing relative positions
in 2D graphics.
6. Relations and mappings
In the ﬁnal part of this introductory chapter, we return
to the formal description of mathematical structures.
We will try to illustrate them on examples
we already know. We can consider this part to be
an exercise in a formal approach to the objects
and concepts of mathematics, i.e. the “language of mathe-
matics”.
1.6.1. Relations between sets. First we deﬁne the Cartesian
product A×B of two sets A and B. It is the set of all ordered
pairs (a, b) such that a ∈ A and b ∈ B. A binary relation between
the two sets A and B is then a subset R of the Cartesian
product A × B.
We write a ≃R b to mean (a, b) ∈ R, and say that a is
related to b. The domain of the relation is the subset
D = {a ∈ A : ∃b ∈ B, (a, b) ∈ R}.
Here the symbol ∃b means that there is at least one such
b satisfying the rest of the claim.
Similarly, the codomain of the relation is the subset
I = {b ∈ B : ∃a ∈ A, (a, b) ∈ R}.
A special case of a relation between sets is a mapping
from the set A to the set B. This is the case
when every element of the domain of the relation
is related to exactly one element of the
codomain. Examples of mappings known to us are all functions,
where the codomain of the mapping is a set of numbers,
for instance the set of integers or the set of real numbers, or
the linear mappings in the plane given by matrices. We write
f : D ⊆ A → I ⊆ B,
f(a) = b
to express the fact that (a, b) belongs to a relation, and we say
that b is the value of f at a. Furthermore we say that
• mapping f of the set A to the set B is surjective (or onto),
if D = A and I = B, clarify ?
• mapping f of the set A to the set B is injective (or oneto-one),
if D = A and for every b ∈ I there exist exactly
one preimage a ∈ A, f(a) = b.
Expressing a mapping f : A → B as a relation
f ⊆ A × B, f = {(a, f(a)); a ∈ A}
is also known as the graph of a mapping f.
41
Thus, ℓ′
is of the form y = −7
6 x + b, and substituting the
coordinates of P implies b = 0, i.e., y = −7
6 x. □
1.E.22. Verify that each aﬃne mapping always maps parallel
lines to parallel lines, but the angles and distances are changed
in general. ⃝
1.E.23. Remark on plots. Sage, as well as other software,
provides extensive 2D plotting functionality. However, even
for drawing simple lines in the plane, we should notice several
details.
First, let us look at the so called “aspect ratio” of the ﬁgures,
which is governed by the option aspect_ratio in Sage.
The aspect ratio reﬂects the apparent height/width ratio of a
unit square. In Sage one can explicitly ask for an “automatic
setup” by typing aspect_ratio = ”automatic”. However,
this does not produce always what we normally expect. Instead,
the option aspect_ratio = 1 works much better applying
the same scale on both coordinates. This seems to be
relevant if illustrating Euclidean properties, where we want
to see the real angles and lengths, while in the aﬃne setup we
may more appreciate to choose the scales independently.
We may also adjust the size of a ﬁgure via the option
figsize. The default ﬁgsize is 4, while setting figsize = 8
makes a ﬁgure roughly twice as big. We can also request
separate horizontal and vertical sizes, by typing for example
figsize = [4, 8] (both will be measured in inches). Usually,
small ﬁgsize values work well.
1.E.24. To illustrate the above situation, provide the plots of
the lines appearing in Problem 1.E.21. ⃝
In the next series of exercises we shall focus on applications
of the linear and aﬃne mappings. In
applications, they appear either directly, or as
“linear approximations”.12
We already saw the
simple building blocks of most of them in 1.E.17, see also the
other column starting in 1.5.8.
We shall also include the concept of area and its link to
the properties of the determinant function on matrices. We
refer to the paragraphs 1.5.5 and 1.5.11 for details.
1.E.25. Compute the area of the triangle PQR, if P =
[−8, 1], Q = [−2, 0] and R = [5, 9]. What would we have to
do, if the points were given in a coordinate system diﬀerent
to the standard Euclidean one? ⃝
1.E.26. Consider the quadrilateral S given by its vertices A =
[1, 1], B = [6, 1], C = [11, 4], and D = [2, 4]. Illustrate S
via Sage and deduce that S is a trapezoid whose bases are of
lengths 5 and 9, respectively, and whose height equals to 3.
Next compute its area in at least two ways. ⃝
12While the linear approximations are the core of building models in
Science and Engineering (we shall start developing the relevant methods in
Chapter 5), in arts, the linearity concept is essential in ﬁnding the right proportions,
helping to create structure, balance, and rhythm in compositions,
guiding the viewer’s eye and eﬀecting emotions.
CHAPTER 1. INITIAL WARMUP
1.6.2. Composition of relations and mappings. For mappings,
the concept of composition is clear. Suppose we have
two mappings f : A → B and g : B → C. Then their
composition g ◦ f : A → C is deﬁned as
(g ◦ f)(a) = g(f(a)).
Composition can also be expressed with the notation used for
a relation as
f ⊆ A × B, f = {(a, f(a)); a ∈ A}
g ⊆ B × C, g = {(b, g(b)); b ∈ B}
g ◦ f ⊆ A × C, g ◦ f = {(a, g(f(a))); a ∈ A}.
The composition of a relation is deﬁned in a very similar
way. We just add existential quantiﬁers to the
statements, since we have to consider all possible
“preimages” and all possible “images”. Let
R ⊆ A × B, S ⊆ B × C be relations. Then
S ◦ R ⊆ A × C,
S ◦ R = {(a, c); ∃b ∈ B, (a, b) ∈ R, (b, c) ∈ S}.
A special case of a relation is the identity relation
idA = {(a, a) ∈ A × A; a ∈ A}
on the set A. It is a neutral element with respect to composition
with any relation that has A as its codomain or domain.
For every relation R ⊆ A × B, we deﬁne the inverse
relation
R−1
= {(b, a); (a, b) ∈ R} ⊂ B × A.
Beware, the same term is used with mappings in a more speciﬁc
situation. Of course, for every mapping there is its inverse
relation, but this relation is in general not a mapping.
Therefore we speak about the existence of an inverse mapping
if every element b ∈ B is an image of exactly one element in
A. In such a case the inverse mapping is exactly the inverse
relation.
Note that the composition of a mapping and its inverse
mapping (if it exists) is the identity mapping. In general, this
is not so for relations.
42
1.E.27. (a) An equilateral triangle with vertices [1, 0] and
[0, 1] lies entirely in the ﬁrst quadrant. Find the coordinates
of its third vertex with the help of Sage.
(b) An equilateral triangle has vertices at P = [1, 1] and
Q = [2, 3]. Its other vertex lies in the same half-plane as the
point S = [0, 0]. The triangle is rotated by 60◦
in the positive
direction around the point S, to produce a new triangle.
Determine the coordinates of the vertices of the new triangle.
Solution. (a) Let us denote by A = [1, 0] and B = [0, 1]
the given points. To ﬁnd the coordinates [x, y] of the third
vertex we are rotating the point A around the point B by an
angle of 60◦
= π/3 in the positive direction (counter clockwise
direction). Thus, in Sage we can ﬁnd the third vertex by
typing
a=pi/3
M=matrix([[cos(a),-sin(a)],[sin(a),cos(a)]])
A=vector([1, 0]); B=vector([0, 1])
rot=M*(A-B)+B;
print(rot)
Hence we have
[x, y] =
(
cos(π/3) − sin(π/3)
sin(π/3) cos(π/3)
)
(A − B) + B ,
which gives [x, y] = [1
2 +
√
3
2 , 1
2 +
√
3
2 ].
(b) Using the general rotation transformation as found in Sage
above, we obtain easily the wanted points (check yourselves):
[−
3
2
√
3,
√
3−
1
2
], [
1
2
−
1
2
√
3,
1
2
√
3+
1
2
], [1−
3
2
√
3,
√
3+
3
2
] .
□
1.E.28. Consider a regular hexagon with vertices labeled in
the positive direction, with ﬁrst vertex A = [0, 2] and centre
at the point S = [1, 0]. Determine the coordinates of the third
vertex C. ⃝
1.E.29. Which sides of the quadrangle given by the vertices
[−2, −2], [1, 4], [3, 3] and [2, 1] are “visible” from the position
of the point X = [3, π − 2]?
Solution. In the ﬁrst step we order the vertices such that
their order corresponds the counter-clockwise direction. We
choose vertex A = [−2, −2], the order of the remaining vertices
is then B = [2, 1], C = [3, 3], D = [1, 4] (think, how to
order the points without a picture; you can actually use similar
procedure to what follows). First consider the side AB. It
along with the point X = [3, π − 2] determines the matrix
(
−2 − 3 2 − 3
−2 − (π − 2) 1 − (π − 2)
)
such that its ﬁrst column is the diﬀerence A − X and the
second column is B − X. Whether it can be “seen” from
the point [3, π − 2] (i.e., is left or right of the vector
−−→
AB, see
1.5.12), is then determined by the sign of the determinant
−2 − 3 2 − 3
−2 − (π − 2) 1 − (π − 2)
=
−5 −1
−π 3 − π
< 0 .
CHAPTER 1. INITIAL WARMUP
1.6.3. Relation on a set. In the case when A = B we speak
about a relation on the set A. We say that the relation R is:
• reﬂexive, if idA ⊆ R, that is (a, a) ∈ R for every a ∈ A,
• symmetric, if R−1
= R, that is if (a, b) ∈ R, then also
(b, a) ∈ R,
• antisymmetric, if R−1
∩ R ⊆ idA, that is if (a, b) ∈ R
and if also (b, a) ∈ R, then a = b,
• transitive, if R◦R ⊆ R, that is if (a, b) ∈ R and (b, c) ∈
R implies (a, c) ∈ R.
A relation is called an equivalence relation if it is reﬂexive,
symmetric and transitive.
A relation is called an ordering if it is reﬂexive,
transitive and antisymmetric. Orderings are
usually denoted by the symbol ≤, that is the fact
that element a is in relation with element b is written as a ≤ b.
Notice that the relation <, that is “to be strictly smaller
than”, is not an ordering on the set of real numbers, since it is
not reﬂexive. The common practice (also used in this book)
is to use the sign < or ⊂ for orderings, too.
A good example of an ordering is set inclusion. Consider
the set 2A
of all subsets of a ﬁnite set A. We have a relation
⊆ on the set 2A
given by the property “being a subset”. Thus
X ⊆ Z if X is a subset of Z. Clearly all three conditions from
the deﬁnition of ordering are satisﬁed: if X ⊆ Y and Y ⊆ X
then necessarily X and Y must be identical. If X ⊆ Y ⊆ Z
then also X ⊆ Z, and reﬂexivity is clear from the deﬁnition.
(As mentioned above, we shall mostly use the symbol ⊂ without
excluding equality.)
We say that an ordering ≤ on a set A is complete, if every
two elements a, b ∈ A are comparable, that is, either a ≤ b
or b ≤ a.
If A contains more than one element, there exist subsets
X and Y where neither X ⊆ Y nor Y ⊆ X, so the ordering
⊆ is not complete on the set of all subsets of A.
The set of real numbers with the usual ≤ is complete.
Thus the subdomains N, Z, Q come equipped with a complete
ordering, too. On the other hand, there is no such natural
complete ordering on C. The absolute value provides a
partial ordering there (comparing the radii of the circles in
the complex plane).
1.6.4. Partitions of an equivalence. Every equivalence relation
R on a set A deﬁnes also a partition of the set A, consisting
of subsets of mutually equivalent elements, namely
equivalence classes. For any a ∈ A we consider the set of
elements, which are equivalent with a, that is
[a] = Ra = {b ∈ A; (a, b) ∈ R}.
Clearly a ∈ Ra by reﬂexivity. If (a, b) ∈ R, then Ra =
Rb by symmetry and transitivity. Furthermore, if Ra ∩ Rb ̸=
∅ then there is an element c in both Ra and Rb so that Ra =
Rc = Rb. It follows that for every pair a, b, either Ra = Rb,
or Ra and Rb are disjoint. That is, the equivalence classes
are pairwise disjoint. Finally, A = ∪a∈ARa. That is, the set
A is partitioned into equivalence classes. We write [a] = Ra,
43
For the sides BC, CD and DA we analogically obtain
2 − 3 3 − 3
1 − (π − 2) 3 − (π − 2)
< 0 ,
3 − 3 1 − 3
3 − (π − 2) 4 − (π − 2)
> 0 ,
1 − 3 −2 − 3
4 − (π − 2) −2 − (π − 2)
> 0 .
The determinants diﬀer in signs, thus the point X is outside
the given quadrangle and a side is visible from X, if X is left
of the side. According to our convention of putting vectors
−−→
XA,
−−→
XB,
−−→
XC ,
−−→
XD into the determinants, the side is visible,
if the corresponding determinant is negative (i.e., X is right of
the oriented side). Thus, from the point X are visible exactly
the sides determined by the pairs of vertices A = [−2, −2],
B = [2, 1] and B = [2, 1], C = [3, 3]. □
1.E.30. Given the sides of the pentagon with vertices at points
[−2, −2], [−2, 2], [1, 4], [3, 1] and [2, −11/6], which are visible
from the point [300, 1]. ⃝
1.E.31. Let the triangle with the vertices P = [5, 6], Q =
[7, 8], and R = [5, 8]. Determine, which of its sides are visible
from the point X = [0, 1]. ⃝
F. Relations and mappings
We close this chapter by considering brieﬂy some aspects
of the language of mathematics. Thus, instead of the
usual intuitive introduction we request the reader to
have a quick look at the formal approach to the relevant
concepts, beginning in 1.6.1. Note that digesting
the abstract concepts of equivalence and ordering (see
1.6.3) is one of the crucial steps towards mathematical think-
ing.
1.F.1. Determine whether the following relations on the
given set M are equivalence relations:
i) M = {f : R → R}, where f ∼ g if f(0) = g(0).
ii) M = {f : R → R}, where f ∼ g if f(0) = g(1).
iii) M is the set of lines in the plane, where two lines are
related if they do not intersect.
iv) M is the set of lines in the plane, where two lines are
related if they are parallel.
v) M = N, where m ∼ n if S(m) + S(n) = 20, while
S(n) stands for the sum of the digits of the integer n.
vi) M = N, where m ∼ n if C(m) = C(n), where
C(n) = S(n) if the sum of the digits S(n) is less than 10,
otherwise we deﬁne C(n) = C(S(n)) (thus we always
have C(n) < 10).
Solution. For each case we must check the three deﬁning
properties of equivalence relations.
i) a) Reﬂexivity: for any real function f, f(0) = f(0).
b) Symmetry: if f(0) = g(0), then also g(0) = f(0).
CHAPTER 1. INITIAL WARMUP
and by the above, we can represent an equivalence class by
any one of its elements.
1.6.5. Existence of scalars. As before, we assume to know
what sets are, and indicate the construction of the natural
numbers.
We denote the empty set by ∅ (notice the diﬀerence
between the symbol 0 for the zero and the empty
set ∅) and deﬁne
(1) 0 := ∅, n + 1 := n ∪ {n} ,
in other words
0 := ∅, 1 := {0}, 2 := {0, 1}, . . . , n + 1 := {0, 1, . . . , n}.
This notation says that if we have already deﬁned the numbers
0, 1, 2, . . . n, then the number n+1 is deﬁned as the set of all
previous numbers.
We have deﬁned the set of natural numbers N.3
Next,
we should construct the operations + and · and deduct their
required properties. In order to do that in detail, we would
have to pay more attention to basic understanding of sets. For
example, once we know what a disjoint union of sets is, we
may deﬁne the natural number c = a+b as the unique natural
number c having the same number of elements as the disjoint
union of a and b.
Of course, formally speaking, we need to explain what
does it mean for two sets to have the same number of elements.
Let us notice that in general, having the two sets A and B of
the same “size” should mean that there exists a bijection A →
B. This is completely in accordance with our intuition for
ﬁnite sets. However, it is much less intuitive with inﬁnite sets.
For example there is the same amount of all natural numbers
and those with natural square roots (the bijection a → a2
),
although the picture in the solution to 1.A.3 could be read as
“many of natural numbers do not have a rational square root”.
We say, that each set which is bijective to natural numbers
N is countable. Sets bijective to some natural number n (as
deﬁned above) are called ﬁnite (with number of elements n),
while the sets which are neither ﬁnite nor countable are called
uncountable.
So adding a + b is understood as increasing a iteratively
by one exactly as many time as is the number of elements in b.
Similarly, we multiply a·b by adding a to zero as many times
as is the number of elements in b. Now we should check the
axioms for scalar operations for our operations + and ·. The
(boring but tedious) check is left to the reader.
We can also deﬁne a relation ≤ on N as follows: m ≤ n,
if either m ∈ n or m = n. Clearly this is a complete ordering.
For instance 2 ≤ 4, since
2 = {∅, {∅}} ∈ {∅, {∅}, {∅, {∅}}, {∅, {∅}, {∅, {∅}}}} = 4.
3The concept of natural numbers based on the principle of "increasing
by one" was known to all ancient civilisations, however they always had the
smallest natural number one. The set theoretical approach was developped
in 19th century and there zero got a logical smallest natural number as the
counterpart of the empty set.
44
c) Transitivity: if f(0) = g(0) and g(0) = h(0), then
also f(0) = h(0). We conclude that the relation is
an equivalence relation.
ii) The relation is not reﬂexive, since for instance for the
function f(x) = sin(x) we have sin(0) ̸= sin(1). Moreover,
it is not transitive.
iii) The relation is neither reﬂexive (every line intersects itself),
nor transitive.
iv) This is an equivalence relation whose equivalence
classes are well represented by the un-oriented lines
through the origin.
v) The relation clearly is not reﬂexive. Neither it is transi-
tive.
vi) In this case the answer is positive and we leave the veriﬁcation
of the details to the reader. □
1.F.2. Consider the set A = {1, 2, 3, 4} and let R be the
relation over A deﬁned by
R = {(0, 0), (0, 1), (0, 3), (1, 0), (1, 1), (2, 3), (3, 3)}
Show that R is not an equivalence relation by explaining
which of the deﬁning properties of an equivalence relation
are not satisﬁed.
Solution. It is easy to see that R is not reﬂexive, since 2 ∈ A
but (2, 2) /∈ R. It is not symmetric since for example (3, 2) /∈
R. Finally it is not transitive since (1, 0) and (0, 3) they both
belong to R but (1, 3) /∈ R. □
1.F.3. Over the set A = {a, b, c, d} consider the relation
R1 = {(a, a), (b, b), (c, c), (d, d), (b, a), (b, c), (b, d)}. Is R1
an equivalence relation?
Solution. Since (k, k) ∈ R1 for any k ∈ A, the relation is
reﬂexive. It is also transitive. However, it is not symmetric
since for example (b, a) ∈ R1 but (a, b) /∈ R1. Hence R1 is
not an equivalence relation. □
1.F.4. Verify the result of the previous task in Sage. ⃝
1.F.5. Let the relation R be deﬁned over R2
such that
((a, b), (c, d)) ∈ R for arbitrary a, b, c, d ∈ R if and only
if b = d. Determine whether or not this is an equivalence
relation. In the positive case, describe geometrically the
partitioning determined by R. ⃝
1.F.6. Present the domain D and the codomain I of the
relation R = {(a, v), (b, x), (c, x), (c, u), (d, v), (f, y)} deﬁned
between the sets A = {a, b, c, d, e, f} and B =
{x, y, u, v, w}. Is the relation R a mapping? Next attempt
to solve the problem using Sage. ⃝
1.F.7. Determine for each of the following relations over the
set {a, b, c, d} whether it is an ordering and whether it is com-
plete:
R1 = {(a, a), (b, b), (c, c), (d, d), (b, a), (b, c), (b, d)} ,
R2 = {(a, a), (b, b), (c, c), (d, d), (d, a), (a, d)} ,
R3 = {(a, a), (b, b), (c, c), (d, d), (a, b), (b, c), (b, d)} .
CHAPTER 1. INITIAL WARMUP
In other words, the recurrent deﬁnition itself gives the relation
n ≤ n+1. and transitivity then gives n ≤ k for all k obtained
in this manner later.
This ordering of the positive integers or natural numbers
(the number a is strictly smaller than b if a ∈ b) has obviously
got the following striking property: every nonempty subset in
N or Z+
has a smallest element.
1.6.6. Integers and rational numbers. With the set N of
positive integers together with zero, we can always add two
numbers together. Also, adding zero to a number does not
change it. We can also deﬁne subtraction, but the result does
not always belong to N.
The basic idea of construction of the integers from the
natural numbers or positive integers is to add to N these missing
results.
This can be done as follows: instead of subtraction, we
will work with ordered pairs of numbers (a, b) representing
the desired result a − b. It just remains to deﬁne which such
pairs are equivalent (with respect to the result of subtraction).
The necessary relation is then:
(a, b) ∼ (a′
, b′
) ⇐⇒ a−b = a′
−b′
⇐⇒ a+b′
= a′
+b.
Note that the expression in the middle equation may not belong
to N, but the expression on the right always does. It is
easy to check that it really is an equivalence, and we denote
its classes as the integers Z.
We deﬁne addition and subtraction on Z using representatives.
For instance
[(a, b)] + [(c, d)] = [(a + c, b + d)],
which is clearly independent of the choice of representatives.
It is always possible to choose a representative (a, 0) for
natural numbers a, and a representative (0, a) for negative
numbers −a. This is probably the simplest and easiest choice.
Next, we deﬁne multiplication of integers similarly to the
addition (i.e., using representatives we aim at the representative
of (a − b) · (c − d))
(a, b) · (c, d) = (ac + bd, bc + ad).
This is clearly commutative and obviously does not depend
on the choice of representives.
Moreover, choosing positive or negative multiplicands
with their special representatives as above, we get just the standard
multiplication in N taking the right care for the signs.
Now it is easy to see, that we have all the properties
(CG1)–(CG4) and (R1)–(R4), see the paragraph 1.1.1. For
multiplication, the neutral element is one, but for all numbers
a other than zero and ±1 there does not exist an integer a−1
with the property a · a−1
= 1. Thus, for multiplication, we
are missing the inverse elements. However, the property of
the integral domain (ID) holds. This means that if the product
of two integers equals zero, then at least one of them has
to be zero.
We can construct the rational numbers Q by adding all
the missing multiplicative inverses by a method analogous to
the construction of Z from N.
45
Solution. The relation R1 is an ordering, which is not complete
(for instance neither (a, c) /∈ R1 nor (c, a) /∈ R1). The
relation R2 is not anti-symmetric as it is both (a, d) ∈ R2 and
(d, a) ∈ R2, therefore it is not an ordering. The relations R3
does not deﬁne an ordering as well, since it is not transitive
(for instance (a, b), (b, c) ∈ R3, but (a, c) /∈ R3). □
1.F.8. In the following three ﬁgures, icons are connected
with lines such that people in diﬀerent parts of the world could
have assigned them. Determine whether the connection is a
mapping, and whether it is injective, surjective or bijective.
Solution. In the ﬁrst ﬁgure the connection is a mapping
which is surjective but not injective, because both the snake
and the spider are labeled as poisonous. The second ﬁgure
is not a mapping but only a relation, since the dog is labeled
both as a pet and as a meal. The third connection is again a
mapping. This time it is neither injective nor surjective. □
1.F.9. Determine whether or not the mapping f is injective
(one-to-one) or surjective (onto), when f is given by:
(a) f : Z × Z → Z, f((x, y)) = x + y − 10x2
;
(b) f : N → N × N, f(x) =
(
2x, x2
+ 10
)
. ⃝
1.F.10. Determine the number of equivalence relations over
the set A = {1, 2, 3, 4}.
Solution. We divide the sought equivalences according to the
types of corresponding partitions (given by number and cardinality
of equivalence classes), and we count the number of
partitions of a given type:
The type of partition number of partitions of this type
1,1,1,1 1
2,1,1
(4
2
)
2,2 1
2
(4
2
)
3,1
(4
1
)
4 1
.
Thus in total we have 15 diﬀerent equivalences. □
1.F.11. Determine the number of orderings of a fourelement
set.
Solution. We will consider all possible Hasse diagrams of
orderings over a four-element set M. We count how many
diﬀerent orderings (recall that an ordering is a subset of a set
M × M) the given Hasse diagram has. See the diagram:
CHAPTER 1. INITIAL WARMUP
On the set of all ordered pairs (p, q), q ̸= 0, of integers,
we deﬁne a relation ∼ so that it models our expectation of the
fractions p/q:
(p, q) ∼ (p′
, q′
) ⇐⇒ p/q = p′
/q′
⇐⇒ p · q′
= p′
· q.
Again, we are not able to formulate the expected behaviour
in the middle equation when we work in Z, but for the equation
on the right this is indeed possible. This relation is a
well-deﬁned equivalence (think it through!).
If we formally write p/q instead of pairs (p, q), we can
deﬁne the operations of multiplication and addition by the
well-known formulas
p/q · r/s = pr/qs
p/q + r/s = ps/qs + qr/qs = (ps + qr)/qs.
1.6.7. Remainder classes. Another example of equivalence
classes is the remainder classes of integers. For
a ﬁxed natural number k we deﬁne an equivalence
∼k so that two numbers a, b ∈ Z are equivalent
if they have the same remainder when divided
by k. The resulting set of equivalence classes is denoted
as Zk. This procedure is simplest for k = 2. This
yields Z2 = {[0], [1]}, where zero stands for even numbers
and one for odd numbers. It is easy to see that the addition
and multiplication can be deﬁned using representatives.
Remainder classes rings and fields
Theorem. The remainder class Zk is always a commutative
ring of scalars. It is a commutative ﬁeld of scalars (that is,
the property (F) from the paragraph 1.1.1 is also satisﬁed)
if and only if k is a prime.
If k is not prime, then Zk contains a divisor of zero, thus
it is not an integral domain.
Proof. The second part is easy to see — if x · y = k
for natural numbers x, y, then the result of multiplying the
corresponding classes [x] · [y] is zero.
On the other hand, if x and k are relatively prime, then
according to the Bezout equality, (which we derive later, see
11.1.2), there are natural numbers a and b satisfying
ax + bk = 1,
which for corresponding equivalence classes gives
[a] · [x] + [0] = [a] · [x] = [1]
and thus [a] is the inverse element to [x]. □
46
In total, there are 219 orderings over a four-element set. □
Many combinatorics problems involve relations, and examples
can be found in the ﬁnal section with additional material
(see Section G).
47
CHAPTER 1. INITIAL WARMUP
G. Additional exercises for the whole chapter
We will now present supplementary material related to the concepts discussed so far. We begin by reviewing some key
equalities and inequalities that appear throughout the book, due to their versatility in various applications.
A) Material on numbers and functions
1.G.1. (Basic arithmetic and the triangular numbers) Recall from high-school that the principle of mathematical induction
can be used to prove the following in just a few steps:
∆n := 1 + 2 + · · · + n =
n(n + 1)
2
, n ∈ Z+ .
The numbers ∆n are called triangle numbers. Provide a shorter proof of this summation without using induction, and illustrate
your method with a ﬁgure. Next, show that the number of distinct handshakes or wine glass clinks among n people equals
the triangular number ∆n−1. Finally prove that ∆n − ∆n−1 = n, and ∆n + ∆n−1 = n2
for any n. ⃝
1.G.2. (Average speed) A bicyclist ascends a hill at 25km/h, and descends the same hill at 75km/h. Find the bicyclist’s
average speed for the entire trip, assuming the length of the hill is not needed for the calculation. ⃝
1.G.3. The arithmetic–geometric mean inequality (AM-GM inequality). (a) For any two non-negative real numbers a, b
prove the inequality
√
ab ≤
a + b
2
, or equivalently (ab)
1
2 ≤
a + b
2
, (†)
with equality if and only if a = b (recall that
√
ab is called the geometric mean of a, b, while a+b
2 is the arithmetic mean of
a, b). Next use (†) to prove that any four non-negative real numbers a, b, c, d, the following inequality holds:
(abcd)
1
4 ≤
a + b + c + d
4
, (‡)
with equality if and only if a = b = c = d.
(b) Present a proof using mathematical induction for the general arithmetic–geometric mean inequality
n
√
a1 · a2 · . . . · an ≤
a1 + a2 + · · · + an
n
,
with equality if and only if a1 = a2 = . . . = an, where ai are non-negative real numbers for all i = 1, . . . , n.
Solution. (a) There are numerous alternative proofs for the AM-GM inequality. For instance, take the square in both sides of
the relation given in (†). We obtain ab ≤ (a + b)2
/4, which simpliﬁes to 4ab ≤ a2
+ 2ab + b2
. This is equivalent to stating
that (a − b)2
≥ 0. This last inequality is obviously true, as the square of a real number is non-negative. Moreover, equality
holds if and only if a = b. Since all steps are reversible, the statement is proven.
A more geometric proof can be descrined using the ﬁgure below. This illustrates four rectangles, each of area ab, placed
inside a square of area (a + b)2
. We observe that 4ab does not exceed (a + b)2
, with the “wasted” area being the shaded
region in the ﬁgure. By demonstrating that the shaded area is (a − b)2
we can conclude that equality holds if and only a = b.
The second inequality follows directly from (†), when combined with the known properties of powers of real numbers:
(abcd)
1
4 =
(
(abcd)
1
2
)1
2
=
(√
ab
√
cd
)1
2
(†)
≤
√
ab +
√
cd
2
(†)
≤
a+b
2 + c+d
2
2
=
a + b + c + d
4
.
(b) To simplify the notation, suppose that Υ(n) stands for the hypothesis that the given inequality is valid. Above we proved
the result for n = 2, 4, that is, we have conﬁrmed Υ(2) and Υ(4), and in a similar way one can show that the result is valid
for n = 8, that is,
(a1 · a2 · . . . · a8)
1
8 =
(
(a1 · . . . · a4)
1
4 (a5 · . . . · a8)
1
4
)1
2
(†)
≤
(a1 · . . . · a4)
1
4 + (a5 · . . . · a8)
1
4
2
(‡)
≤
a1 + · · · + a8
8
.
48
CHAPTER 1. INITIAL WARMUP
Hence we have conﬁrmed Υ(2), Υ(4) and Υ(8), and one can obtain the statement for any 2k
with k ≥ 1, that is, verify Υ(2k
)
for all integers k ≥ 1 (we leave this veriﬁcation to the reader). Let us now explain how Υ(4) can be used to derive Υ(3). Set
µ :=
a1 + a2 + a3
3
for the arithmetic mean of some real positive numbers a1, a2, a3. A crucial but simple observation is that µ is also the
arithmetic mean of the numbers a1, a2, a3, µ, that is,
a1 + a2 + a3
3
=
a1 + a2 + a2 + µ
4
,
or in other words µ = a1+a2+a2+µ
4 . A proof of this is easy: By deﬁnition we have 3µ = a1 + a2 + a3, or equivalently
4µ = a1 + a2 + a3 + µ. Then a division by 4 gives the claim. Now we can apply Υ(4) on the numbers a1, a2, a3, µ:
4
√(
a1a2a3µ
)
=
(
a1a2a3µ
)1
4
≤
a1 + a2 + a3 + µ
4
= µ .
Considering the fourth power of both sides in this inequality, we obtain abcµ ≤ µ4
which can be rewritten as a1a2a3 ≤ µ3
.
Thus, after taking cube roots in both sides we arrive to the result, i.e., (a1a2a3)
1
3 ≤ µ. Before describe the general step, keep
in mind that a1a2a3µ can be rewritten as a1a2a3µ22
−3
and a1 + a2 + a3 + µ = a1 + a2 + a3 + (22
− 3)µ, since 22
− 3 = 1.
Now one can use Υ(8), to obtain also Υ(5), Υ(6), and Υ(7), and so forth. However, let us explicitly present the general
argument. Observe that in order to compute Υ(3) from Υ(4), we extended the triple (a1, a2, a3) into the quadruple
(a1, a2, a3, µ) = (ˆa1, ˆa2, ˆa3, ˆa4), so that we could apply Υ(4). We will use the analogous technique to prove Υ(n) for all
n < 2k
. Hence, the goal is to extend (a1, a2, . . . , an) to (ˆa1, . . . , ˆa2k ), so that we can apply Υ(2k
) on the latter set. A
question that may arise is how to extend this result, and the case n = 3 provides the necessary motivation. Indeed, consider
the following:
ˆaj :=
{
aj , j = 1, 2, . . . , n ,
µ :=
a1 + a2 + . . . + an
n
, j = n + 1, . . . , 2k
(for n = 3 this formula yields the expressions used above, and in particular we have k = 2 for this case). Apply now Υ(2k
)
to (ˆa1, . . . , ˆa2k ). This gives
(
a1 · · · anµ2k
−n
) 1
2k
≤
a1 + · · · + an + (2k
− n)µ
2k
= µ .
Obviously, (µ2k
−n
)
1
2k
= µµ− n
2k
, therefore the previous formula is also written as
(
a1 · · · anµ2k
−n
) 1
2k
= (a1 · · · an)
1
2k
µµ− n
2k
≤ µ .
Thus we can eliminate µ from both sides, i.e., we have proved till now that (a1 · · · an)
1
2k
µ− n
2k
≤ 1. Raising both sides of
this inequality to the power of 2k
n , yields the result:
(a1 · · · an)
1
n
µ−1
≤ 1 , or equivalent (a1 · · · an)
1
n
≤ µ .
This completes the main proof, and the equality case can be easily derived. □
The proof given above goes back almost two centuries, and is due to the French mathematician A.-L. Cauchy (1789–1856).
Note that this is a special kind of mathematical induction, often referred to as the “forward-backward ” induction.
This is because we ﬁrst derived the result for all the powers of 2, and next proceeded with another induction to
establish it for all positive integers. As we pronounced, the AM-GM inequality admits many diﬀerent proofs, most
of them based on induction. There are also methods arising in term of calculus, which allow the presentation of
shorter proofs. For example, much later in Chapter 8, we will provide a proof of the AM-GM inequality in terms of the
so-called Lagrange multipliers.
1.G.4. Bernoulli inequality. For any x ∈ R with x > −1 based on mathematical induction one can easily prove that
(1+x)n
≥ 1+nx for any n ∈ N, see also 5.4.1. This is the so called Bernoulli inequality. Prove that the Bernoulli inequality
yields the AM-GM inequality.
Solution. For n = 1 we have nothing to prove, so we may assume that n ≥ 2. Setting y = x + 1 with x > −1, so that y > 0,
the Bernoulli inequality takes the form
yn
≥ 1 + n(y − 1) . (∗)
49
CHAPTER 1. INITIAL WARMUP
Now, to simplify the description let us set A(n) =
a1 + · · · + an
n
and G(n) = (a1 · . . . · an)
1
n , so that the AM-GM inequality
simply reads by G(n) ≤ A(n). Obviously A(n)
A(n−1) > 0 and as y we may consider the ratio A(n)
A(n−1) with n ≥ 2. Then, using
the relation (∗) we deduce that
(
A(n)
A(n − 1)
)n
≥ 1 + n
(
A(n)
A(n − 1)
− 1
)
=
A(n − 1) + nA(n) − nA(n − 1)
A(n − 1)
=
an
A(n − 1)
,
and this can be equivalently rephrased as An
(n) ≥ anAn−1
(n − 1). Using this inequality successively we obtain
An
(n) ≥ anAn−1
(n − 1) ≥ an · an−1An−2
(n − 2) ≥ . . . ≥ an · an−1 · . . . · a2A1
(1)
= an · an−1 · . . . · a2 · a1 = Gn
(n) ,
that is, An
(n) ≥ Gn
(n), for any n ≥ 2. Raising both sides to the power of 1
n we obtain the desired A(n) ≥ G(n). □
1.G.5. Symbolic computations with Sage and polynomials. In SageMath, for a series of applications as doing symbolic
computations, solving equations, plotting functions, etc, are available the so called symbolic variables. Symbolic
variables are often used to represent "indeterminates," which are variables that do not have a ﬁxed value.
This concept is frequently employed in mathematical problem-solving to allow for general solutions and manipulations,
such as simplifying expressions, solving equations, and performing symbolic diﬀerentiation and integration (see
Chapter 5 and Chapter 6, respectively). By using indeterminates, we can explore mathematical relationships and derive
formulas that hold true for any value of the variable, providing a powerful tool for both theoretical and applied mathematics.
On the other hand, polynomials form a fundamental component of algebra and calculus, and symbolic computation
provides a powerful toolset for examining their properties and behavior. Let us explain how using symbolic variables, we can
manipulate polynomials easily. First, to establish a symbolic variable we type t = var(”t”), or simply var(”t”), or
t=SR.var("x"); t
which returns a symbolic variable, named t by Sage. We stress that Sage does this automatically for the variable x, but not
for other variables. Thus, given the cell
y=var("y"); x-y
Sage will understand x − y as an expression of two variables x, y. In the opposite, typing directly x − y an error will be
printed out, declaring that the name ”y” is not defined. It is also possible to create more than one symbolic variable, in a
row. This for example allows to use expressions depending either on many variables, or both on variables and parameters, as
in the examples below:
x, y, z=var("x, y, z"); x+y+z
t, a, b, c=var("t, a, b, c"); a*t**2+b*t+c
The cell
var("a, b, c"); f=a*x**2+b*x+c; print([f.coefficient(x**2), f.coefficient(x)])
returns the coeﬃcients of the terms x2
and x, respectively, as [a, b], which can be useful in complicated expressions. Of
course, if you want to solve the equation ax2
+ bx + c = 0 with a > 0, then type
x, a, b, c=var("x, a, b, c"); assume(a>0); show(solve(a*x**2+b*x+c==0, x))
In this case Sage returns the known solutions in terms of a, b, c, check the output yourselves. An alternative to introduce
symbolic variables is the command x = SR.var(”x”, n). This creates the symbolic variables x0, . . . , xn−1 for some positive
integer n, and each of them can be accessed by typing x[i], for some i ∈ {0, . . . , n − 1}. As an example, type
x=SR.var("x"); a=SR.var("a", 4); a[0]+a[1]*x+a[2]*x**2+a[3]*x**3
Executing this cell we obtain a3 ∗ x3
+ a2 ∗ x2
+ a1 ∗ x + a0, hence ﬁxing the parameters a3, a2, a1, a0 we can deﬁne a real
polynomial of degree three. For such substitutions one may apply the following code:
x=SR.var("x"); a=SR.var("a", 4)
a[0]+a[1]*x+a[2]*x**2+a[3]*x**3
f=a[0]+a[1]*x+a[2]*x**2+a[3]*x**3
f.subs(a[0] == 6,a[1] == 4, a[2] == 2, a[3] == 1/8)
This prints out the polynomial 1/8 ∗ x3
+ 2 ∗ x2
+ 4 ∗ x + 6, which we may plot by typing (produce the ﬁgure in your edi-
tor)
plot(1/8*x^3 + 2*x^2 + 4*x + 6, (x, -50, 50))
Notice to factor a polynomial we may use the command factor:
50
CHAPTER 1. INITIAL WARMUP
p(x)=x^5 + x^4 - 8*x^3 + 11*x^2 - 15*x + 2; show(factor(p(x)))
In this case Sage prints out the expression
(
x4
+ 3 x3
− 2 x2
+ 7 x − 1
)
(x − 2). We will meet further option manupulating
polynomials in Sage in the sequel. We will encounter more options for manipulating polynomials in SageMath later in this
text.
1.G.6. Suppose that P(x) is a polynomial (with constant coeﬃcients) satisfying the following: divided by x + 2 leaves
remainder −1, divided by x + 3 leaves remainder 1, while divided by x − 3 leaves remainder 4. Find the remainder υ(x) of
the division of P(x) by the polynomial (x + 2)(x + 3)(x − 3).
Solution. Let us write P(x) = q(x) · π(x) + υ(x), where q(x) = (x + 2)(x + 3)(x − 3) is the divisor, π(x) is the
quotient and υ(x) the remainder. Since the divisor is of degree 3, the remainder υ(x) is a polynomial at most of degree 2, say
υ(x) = ax2
+ bx + c. By assumption we have P(−2) = −1, P(−3) = 1 and P(3) = 4. Since P(x) = (x2)(x + 3)(x −
3) · π(x) + ax2
+ bx + c, we get the following system of equations:
{
4a − 2b + c = −1 , 9a − 3b + c = 1 , 9a + 3b + c = 4
}
.
Solving this system we obtain a = 1/2 = b and c = −2. For verifying this, use for example Sage via the block
var("a, b, c"); f1=4*a-2*b+c+1; f2=9*a-3*b+c-1; f3=9*a+3*b+c-4
solve((f1==0, f2==0, f3==0), a,b,c)
□
1.G.7. Find a polynomial P(x) of fourth degree satisfying P(1) = 1, such that P(x+1) is divided by (x−1)2
, and P(x−1)
is divided by (x + 1)2
.
Solution. By assumption, we have
P(x + 1) = (x − 1)2
· π1(x) , and P(x − 1) = (x + 1)2
· π2(x) ,
where π1(x), π2(x) are the corresponding quotients. Using these equations, one can obtain the relations
P(x) = (x − 2)2
· π1(x − 1) , and P(x) = (x + 2)2
· π2(x + 1) ,
respectively. This means that P(x) is divided by (x − 2)2
and also by (x + 2)2
. But then P(x) will be divided also
by the product (x − 2)2
(x + 2)2
= (x2
− 4)2
.13
Now, P(x) is of degree 4, hence the previous conclusion implies that
P(x) = c · (x2
− 4)2
, for some constant c ∈ R\{0}. Based on the condition P(1) = 1 we compute c = 1/9, and hence
P(x) = 1
9 (x2
− 4)2
. □
1.G.8. Given the polynomial P(x) = x2
+ x + 2, show that the polynomial Q(x) := P(P(x)) − P(x) − 3 is divided by
P(x) − 1. Which is the quotient? ⃝
1.G.9. Prove that the polynomial P(x) of degree 4 satisfying P(0) = 0 and P(x + 1) − P(x) = 4x3
, admits x0 = 1 as a
double root. ⃝
1.G.10. (a) Factor the polynomial p(x) = 4x4
− 8x3
− 3x2
+ 7x − 2 using Horner’s scheme and indicate the multiplicities
of its roots. Verify your answer via Sage and present the graph of p(x) for x ∈ [−2, 2].
(b) Consider the polynomials p1(x) = x4
+ 2x3
+ x2
+ x + 12, p2(x) = 2x5
+ x4
− 53x3
− 7x + 3, q1(x) = x + 2 and
q2(x) = x2
− 5x + 1. Show that the Euclidean divisions of pi(x) by qi(x) for i = 1, 2, satisfy υ2(x) = −62x + υ1(x) and
moreover π2(x) = 2π1(x) + 11x2
− 2x − 9, where we write pi(x) = qi(x)πi(x) + υi(x) with i = 1, 2.
Solution. (a) Dividing the factors of 2 by all factors of 4 ones gets a list of all possible rational roots. In particular, we see
that ρ1 = −1 is a root, p(ρ1) = 4 + 8 − 3 − 7 − 2 = 0. Thus, using the Horner’s table we get
ρ1 = −1 4 −8 −3 7 −2
−4 12 −9 2
4 −12 9 −2 0
Hence p(x) = (x − ρ1)(4x3
− 12x2
+ 9x − 2) = (x + 1)(4x3
− 12x2
+ 9x − 2). Next we see that ρ2 = 2 is a root of
4x3
− 12x2
+ 9x − 2, thus using the Horner’s method for this polynomial we get
ρ2 = 2 4 −12 9 −2
8 −8 2
4 −4 1 0
13Here we are based on the general fact that if a polynomial P(x) (with
constant coeﬃcients) is divided by the monomials (x − ρ1), (x − ρ2), . . .,
(x − ρk), where ρi ̸= ρj for 1 ≤ i ̸= j ≤ k, then P(x) is also divided by
the product (x − ρ1) · (x − ρ2) · . . . · (x − ρk). The converse is also true.
51
CHAPTER 1. INITIAL WARMUP
Thus p(x) = (x−ρ1)(x−ρ2)(4x2
−4x+1) = (x+1)(x−2)(2x−1)2
, or equivalently p(x) = 4(x−ρ1)(x−ρ2)(x−ρ3)2
where ρ1 = −1, ρ2 = 2 and ρ3 = 1
2 . Note that ρ3 is a double root. In Sage we simply type
p(x)=4*x^4-8*x^3-3*x^2+7*x-2; factor(p(x))
which returns (2 ∗ x − 1)2
∗ (x + 1) ∗ (x − 2). In this block one can specify the ring over which the factorization is implemented
by typing x = QQ[”x”] at the beginning of the code. This deﬁnes x as a polynomial over the rational numbers.
However, in our case, this approach will yield the same factorization result as the default setting. Additionally, by typing
p.roots() SageMath will return the roots of the polynomial p, along with their multiplicities:
[(2, 1), (-1, 1), (1/2, 2)] # in the pair (a, b), a denotes a root, b denotes its multiplicity
As for the graph of p(x) for x ∈ [−2, 2], one may add the syntax
plot(4*x^4-8*x^3-3*x^2+7*x-2, x, -2, 2
(b) The ﬁrst division gives
x3
+ x − 1
x + 2
)
x4
+ 2x3
+ x2
+ x + 12
− x4
− 2x3
x2
+ x
− x2
− 2x
− x + 12
x + 2
14
Thus p1(x) = q1(x)π1(x) + υ1(x), where π1(x) = x3
+ x + 1 and υ1(x) = 14 (constant). For the second division one gets
2x3
+ 11x2
− 11
x2
− 5x + 1
)
2x5
+ x4
− 53x3
− 7x + 3
− 2x5
+ 10x4
− 2x3
11x4
− 55x3
− 11x4
+ 55x3
− 11x2
− 11x2
− 7x + 3
11x2
− 55x + 11
− 62x + 14
hence q2(x) = q2(x)π2(x) + υ2(x) with π2(x) = 2x3
+ 11x2
− 11 and υ2(x) = −62x + 14, respectively. Thus υ2(x) =
−62x + υ1(x) and the second relation follows since
2
x3
+ x − 1
)
2x3
+ 11x2
− 11
− 2x3
− 2x + 2
11x2
− 2x − 9
□
1.G.11. (Polynomial division via SageMath) Given a polynomial p(x) = anxn
+ an−1xn−1
+ · · · + a1x + a0 and a
number x0, use Sage to implement the division of p(x) by (x − x0), resulting in a quotient polynomial q(x) and a remainder
r. Next use your program to conﬁrm that dividing p(x) = 4 − 3x + 5x2
− 2x3
+ x4
by (x − 3) results in the quotient
q(x) = 21 + 8x + x2
+ x3
and the remainder r = 67. ⃝
The omission of basic features of integers, such as their divisibility and the theory of prime numbers, up to this point, has
been intentional. It is more practical to explore these topics in detail in Chapter 11, which is entirely dedicated to number
theory. For this reason, we will delay discussing additional tasks related to the beautiful theory of integers and prime numbers,
and instead, focus on problems involving complex numbers and their geometry for now.
1.G.12. (True or False?) Answer the following True or False questions, providing a proof to support your statements.
(1) “The expression of the complex number z =
1 + 2i
4 + 5i
in the form z = x + iy is given by − 6
41 + i13
41 ”.
52
CHAPTER 1. INITIAL WARMUP
(2) “The ratio
(1 + i)2023
(1 − i)2020
equals 2i − 2”.
(3) “The product zw of two elements z, w ∈ S1
on the unit circle S1
= {z ∈ C : |z| = 1} belongs also to S1
”.
(4) “Any element z ∈ S1
has an inverse which also belongs to S1
.”
(5) “The cubic equation z3
+ i = 0 has three solutions over C, given by z1 = i, z2 =
√
3
2 − i1
2 , z3 = −
√
3
2 − i1
2 .”
⃝
1.G.13. Consider the complex numbers z1 = 2 + 6i and z2 = −4 − 2i. Suppose that there is another complex number
w = x + iy which is at a distance of 2
√
2 from z2. Determine the positions of w on the plane R2
, such that the triangle
formed by the the points P1 = [2, 6], P2 = [−4, −2] and P3 = [x, y] (representing z1, z2 and w, respectively) is a right
triangle. In particular, show that there are two possible solutions for w, and illustrate both of them using SageMath. ⃝
1.G.14. Let a, b, c, d ∈ R with ac > 0 and ad = bc. Show that the polynomial P(x) = ax3
+ bx2
+ cx + d has a unique
real solution and two purely imaginary solutions.
Solution. The condition ac > 0 implies that a ̸= 0, and so d = bc
a . Therefore we see that
P(x) = ax3
+ bx2
+ cx +
bc
a
=
1
a
(
a2
x3
+ abx2
+ acx + bc
)
=
1
a
(
ax2
(ax + b) + c(ax + b)
)
=
1
a
(ax + b)(ax2
+ c) .
As a consequence, the equation P(x) = 0 is equivalent to (ax + b)(ax2
+ c) = 0, that is, x = − b
a or x2
= − c
a < 0. So we
have a real solution x0 = − b
a and two purely imaginary given by x1,2 = ±i
√ c
a . □
Euler’s identity for complex numbers states that any complex number z = r(cos(φ) + i sin(φ)), where r = |z| and φ is
the argument of z, can be expressed as z = r eiφ
. Euler’s identity thus provides an alternative to the polar form
of z. For example, given z ∈ S1
we get z = cos(φ) + i sin(φ) = eiφ
, such that |z| = 1. It is worth noting that
the exponential eiy
of an imaginary number iy is given by the power series
∑∞
n=0
(iy)n
n! . We will present a proof
of this claim later, in Section 5.4.12, which discusses the relationship between exponentials and trigonometric
functions. However, it is useful to get an initial understanding of this concept at this point.
1.G.15. Based on the Euler’s identity, provide an answer to the following tasks:
(a) Express eiπ/2
, eiπ
, and ei2π
in algebraic form.
(b) Show that cos(φ) = eiφ
+e−iφ
2 and sin(φ) = eiφ
−e−iφ
2 , respectively. Deduce that tan(φ) = −ieiφ
−e−iφ
eiφ+e−iφ .
(c) Show that eiφ
= 1.
(d) Using the formula ei(α+β)
= eiα
eiβ
, prove the following classical trigonometric identities:
cos(α + β) = cos(α) cos(β) − sin(α) sin(β) , sin(α + β) = sin(α) cos(β) − cos(α) sin(β) .
Solution. (a) Using Euler’s identity we obtain
eiπ/2
= cos(π/2) + i sin(π/2) = i =
√
−1 , eiπ
= cos(π) + i sin(π) = −1 , ei2π
= cos(2π) + i sin(2π) = 1 .
(b) We have eiφ
= cos(φ) + i sin(φ) and cos(φ) − i sin(φ) = e−iφ
= eiφ (see also 5.4.12). By adding these two equations
and solving for cos(φ), we derive the formula for cos(φ). In a similar way we obtain the formula for sin(φ), and the ﬁnal
claim follows in combination with the usual deﬁnition of the trigonometric function tan, that is, tan(φ) := sin(φ)
cos(φ) .
(c) Indeed, eiφ
= |cos(φ) + i sin(φ)| =
√
cos2(φ) + sin2
(φ) =
√
1 = 1.
(d) We see that
ei(α+β)
= eiα
eiβ
=
(
cos(α) + i sin(α)
)
·
(
cos(β) + i sin(β)
)
=(
cos(α) cos(β) − sin(α) sin(β)
)
+ i
(
sin(α) cos(β) + cos(α) sin(β)
)
.
On the other hand ei(α+β)
= cos(α + β) + i sin(α + β). By comparing the real/imaginary parts of these two equations we
obtain the given formulas. □
1.G.16. (a) Show that the following equation does not admit a real solution:
(1 + iz)2025
=
2025 + i
1 + i2025
, z ∈ C . (∗)
(b) Consider the complex number w =
1
2
(
z +
1
z
)
with z ∈ C such that arg(z) = φ ̸= kπ for all k ∈ Z and Im(w) = 0.
Show that the |z| = 1, i.e., z lies on the unit circle.
53
CHAPTER 1. INITIAL WARMUP
Solution. (a) Suppose in the opposite that the equation (∗) has a real solution z = x ∈ R, that is, (1 + ix)2025
=
2025 + i
1 + 2025i
.
Taking absolute values in both sides we have the following equivalent relations:
(1 + ix)2025
=
2025 + i
1 + i2025
⇐⇒ |(1 + ix)|
2025
=
|2025 + i|
|1 + i2025|
⇐⇒ (
√
1 + x2)2025
=
√
20252 + 1
√
1 + 20252
⇐⇒
(
√
1 + x2)2025
= 1 ⇐⇒
√
1 + x2 = 1 ⇐⇒ 1 + x2
= 1 ⇐⇒ x = 0 .
However, x = 0 cannot be a solution of the equation (∗) as it follows from the computation below:
12025
=
2025 + i
1 + i2025
⇐⇒ 1 + i2025 = 2025 + i ,
a contradiction. Hence the given equation does not admit a real solution.
(b) By assumption, φ is the (principle) argument of z and we may use the polar form z = |z| (cos φ + i sin φ). Thus we can
write
w =
1
2
(
|z| (cos φ + i sin φ) +
1
|z| (cos φ + i sin φ)
)
=
1
2
(
|z| (cos φ + i sin φ) +
1
|z|
(
cos(−φ) + i sin(−φ)
))
=
1
2
(
|z| (cos φ + i sin φ) +
1
|z|
(
cos(φ) − i sin(φ)
))
=
1
2
(
|z| +
1
|z|
)
cos φ + i
1
2
(
|z| −
1
|z|
)
sin φ .
Here we have used that the complex number eiφ
= cos φ + i sin φ satisﬁes (see also the previous problem)
1
eiφ
= e−iφ
= cos(−φ) + i sin(−φ) = cos(φ) − i sin(φ) .
Thus now the requirement Im(w) = 0 (in combination with the condition φ ̸= kπ) gives
1
2
(
|z| −
1
|z|
)
sin φ = 0 ⇐⇒ |z| −
1
|z|
= 0 ⇐⇒ |z|
2
= 1 ⇐⇒ |z| = 1 .
□
1.G.17. Roots of unity and regular n-polygons. (a) Prove that the cube roots of 1 are the vertices of an equilateral triangle
inscribed in the unit circle S1
(on the plane), with side length given by
√
3. Verify your computation via Sage.
(b) As a generalization, deduce that for any positive integer n the equation zn
= 1, where z ∈ C, has n-roots z1, z2, . . . , zn
called the nth roots of unity. Prove that these solutions form the vertices of the regular n-polygon in the plane.
Solution. (a) We are interested in solutions of the equation z3
= 1, where z ∈ C. By the fundamental theorem of algebra
we expect three solutions over C. To ﬁnd them explicitly, we can use de Moivre’s formula which provides an easy way for
solving equations of the form zn
= z0. Indeed, in polar form the equation z3
= 1 is expressed as r3
(
cos(3φ)+i sin(3φ)
)
=
1(cos 0 + i sin 0). Verify by simple computations that this equation is satisﬁed if and only if r = 1 and 3φ = 0(modulo 2π).
This means that the three solutions are of the form z0 = cos 0 + i sin 0, z1 = cos 2π
3 + i2π
3 , and z2 = cos 4π
3 + i sin 4π
3 , or
in algebraic form
z0 = 1 , z1 = −
1
2
+ i
√
3
2
, z2 = −
1
2
− i
√
3
2
.
An appropriate Sage cell that gives directly all three solutions in algebraic form reads as
z=var ("z"); solve (z**3 -1==0 , z)
It is easy to verify that |zi| = 1 for any i = 0, 1, 2, so each of them lies on the unit circle S1
, see the ﬁgure below. In addition,
we see that z2
1 = z1 = z2. It follows that the coordinate points of z0, z1, z2 = z1 form an equilateral triangle with sides of
length
√
3 as we can see in the ﬁgure (for instance the points z1, z2 are conjugate each other, so their distance is 2
√
3/2 =
√
3).
54
CHAPTER 1. INITIAL WARMUP
(b) The ﬁrst claim follows by the fundamental theorem. Now, as for the case n = 3, de Moivre’s formula provides the
argument of the roots. In particular, we see that the argument multiplied by n has to be a multiple of 2π. Moreover, the
absolute value has to be one, so the roots are of the form
zk = cos(
2kπ
n
) + i sin(
2kπ
n
) , k = 0, . . . , n − 1 .
Recall now that polygons are geometric objects in the plane composed of points and line segments connected together to close
and form a single shape. A regular polygon is a polygon having equal angles and sides of equal length. For the roots derived
above we see that the argument of any two consecutive of them, diﬀers by 2π/n. Moreover, the absolute value of any root
equals to 1. Together these claims prove that the points under question are vertexes of a regular n-polygon. □
1.G.18. Polygons via Sage. In Sage on can apply the following cell to produce the second ﬁgure given above (this already
includes the code producing the ﬁrst simpler ﬁgure).
n=3;
A=[exp(2*pi*I*k/n) for k in range(n)]
a = plot([],xmin=-1, xmax=1, ymin=-1, ymax=1, figsize=(4,4), rgbcolor=(1/4,1/8,3/4))
a+=list_plot(A, color="black", size=70, xmin=-1,xmax=1)
a+= circle ((0.0 ,0.0) ,1, color="black")
a+=line([(-1/2,(sqrt(3)/2)), (-1/2,-(sqrt(3)/2))], color="black", linestyle="--")
a+=line([(-1/2,(sqrt(3)/2)), (1, 0)], color="black", linestyle="--")
a+=line([(1, 0), (-1/2,-(sqrt(3)/2))], color="black", linestyle="--")
a+= text("$z_2$", (-0.65, (sqrt(3)/2)+0.05), color="black", fontsize="16")
a+= text("$z_3$", (-0.65, -(sqrt(3)/2) -0.05),color="black", fontsize="16")
a+= text("$z_1$", (1.10, 0.07), color="black", fontsize="16")
show(a)
In fact, the commands of the ﬁrst ﬁve lines in the previous code, with only diﬀerence the initial n, can produce similar plots
for our favourite n (though for large n Sage needs time to answer). Below we present the ﬁgures for n = 5, 15, 25, 50.
Finally it is also possible to produce directly the regular polygons, using the command polygon2d. Let us present a Sage cell
adapted to the case n = 6.
n=6; A=[exp(2*pi*I*k/n) for k in range(n)]
a = plot([],xmin=-1, xmax=1, ymin=-1, ymax=1, figsize=(4,4))
a+=list_plot(A, color="black", size=10, xmin=-1,xmax=1)
a+=circle((0.0,0.0),1,color="black")
L = [[cos(pi*i/3),sin(pi*i/3)] for i in range(6)]
b=polygon2d(L, color="lightgrey")
show(a+b)
This produces the hexagon:
55
CHAPTER 1. INITIAL WARMUP
1.G.19. Enumerate all ordered pairs (x, y) ∈ R × R satisfying the relation (x + iy)2024
= x − iy.
Solution. Setting z = x + iy the given relation becomes z2024
= z. Recall that z2
= |z|
2
and by mathematical induction
(or by de Moivre’s formula) one can prove that |zn
| = |z|
n
for any n ∈ Z. Thus, having in mind the relation z2024
= z we
compute
|z|
2024
= z2024
= |z| = |z| ⇐⇒ |z| (|z|
2023
− 1) = 0 .
From this equation we see that either |z| = 0 or |z| = 1. The ﬁrst condition implies that (x, y) = (0, 0). If |z| = 1, then
using the relation z2024
= z we see that z2025
= zz2024
= zz = |z|
2
= 1. However the equation z2025
= 1 has 2025 distinct
roots. Hence, all together we can enumerate 2025 + 1 ordered pairs (x, y) ∈ R × R satisfying the given condition. □
B) Material on difference equations
1.G.20. Tillings. For the Problem 1.B.4 on 2-compositions of natural numbers and for the given solution provide a visual
interpretation.
Solution. To provide a visual presentation of Cn (and so also of the Fibonacci numbers) one may think the 1s as squares,
and the 2s as dominoes. Then Cn counts the distinct ways to tile a board of length n with squares and dominoes, and the
result proved in 1.B.4 means that we have Fn+1 ways to do so. Below we illustrate the idea for n = 4 and n = 5. For n = 4
one has the 2-compositions 1 + 1 + 1 + 1, 1 + 1 + 2, 1 + 2 + 1, 2 + 1 + 1, 2 + 2, that is, C4 = 5 = F5. Notice that
C4 = C3 + C2 = 3 + 2.
For n = 5 we can deﬁne 8 such decompositions, i.e., C5 = 8 = F6, as we can see also in the ﬁgure below (notice that
C5 = C4 + C3 = 5 + 3).
□
1.G.21. Basic properties of Fibonacci numbers. Show that the Fibonacci numbers Fn satisfying the following:
(a) The sum of the ﬁrst n Fibonacci numbers equals to Fn+2 − 1;
(b) F2
1 + F2
2 + · · · + F2
n = FnFn+1;
(c) Fm+n = Fn−1Fm + FnFm+1;
(d) Fn−1Fn+1 − F2
n− = (−1)n
.
Solution. (a) We have F1 = F3 − F2, F2 = F4 − F3, . . ., Fn−1 = Fn+1 − Fn and Fn = Fn+2 − Fn+1. Adding these
relations (side by side), we get F1 + F2 + · · · + Fn = Fn+2 − F2 and the claim follows.
(b) First notice that using the recurrence relation of Fibonacci numbers one may extend these numbers to zero and negative
indices, and here we need to know that F0 = F2 − F1 = 0. Now, since we have
FkFk+1 − Fk−1Fk = Fk(Fk+1 − Fk−1) = F2
k ,
we obtain the relations
F2
1 = F1F2 , F2
2 = F2F3 − F1F2 , · · · , F2
n = FnFn+1 − Fn−1Fn .
Adding these relations (side by side) we arrive to the desired formula.
(c)+(d) The last two cases can be proved by induction over n and we shall prove (d) only, which is known as the Cassini
identity. For n = 1 it becomes F0F2 − F2
1 = 0 · 1 − 12
= −1 = (−1)1
, which is true. The induction hypothesis says that
the result is true for some arbitrary positive integer k, which mean that
56
CHAPTER 1. INITIAL WARMUP
Fk−1Fk+1 − F2
k = (−1)k
. (∗)
Then, for n = k + 1 we compute
FkFk+2 − F2
k+1 = (Fk+1 − Fk−1)(Fk + Fk+1) − F2
k+1 = Fk+1Fk + F2
k+1 − Fk−1Fk − Fk−1Fk+1 − F2
k+1
(∗)
= Fk+1Fk − Fk−1Fk −
(
F2
k + (−1)k
)
= Fk+1Fk − Fk
(
Fk−1 + Fk
)
+ (−1)k+1
= Fk+1Fk − FkFk+1 + (−1)k+1
= (−1)k+1
,
where we have applied repeatedly the deﬁnition of Fibonacci numbers. This means that the claim is true for n = k + 1, and
hence is valid for all n ≥ 1. □
Recurrence relations have a wide range of applications across various ﬁelds, often in surprising ways. For example,
linear recurrences naturally arise in simple geometric problems, requiring creative problem-solving skills. For
your convenience, we have included some examples of these problems below. In Chapter 3 we will cover diﬀerence
equations again, this time with more advanced applications to expand upon the introduction given in this
chapter.
1.G.22. Consider n lines dividing a plane into regions. Find the maximum number of regions that are formed.
Solution. Let us denote the number of regions by pn. If there is no line in the plane, then the whole plane is one region
and p0 = 1. If there are n lines, then adding an additional line will increase the number of regions, exactly by the number
of regions that the new line intersects. If no lines are parallel and no three lines intersect at the same point, then the
number of regions intersected by the (n + 1)th line is one more than the number of intersections it has with the existing
n lines. Each intersection divides an existing region into two, resulting in an increase of one region for each intersection.
The new line can have at most n intersections with the existing n lines.
Each segment of the line between two intersections crosses exactly one
region, so the new line can cross at most n + 1 regions. Thus we obtain
the recurrence relation pn+1 = pn + (n + 1), with p0 = 1. We can
derive an explicit formula for pn via Corollary 1.2.3, which we leave to
the reader. A more straightforward approach relies on the observation
that pn = pn−1 + n. This gives
pn = pn−1 + n = pn−2 + (n − 1) + n
= pn−3 + (n − 2) + (n − 1) + n = . . .
= p0 +
n∑
i=1
i = 1 +
n(n + 1)
2
=
n2
+ n + 2
2
.
Or we can use Sage as we did in 1.B.3:
from sympy import Function, rsolve
from sympy.abc import n
p = Function("p")
eq = p(n)-p(n+1)+n+1
initial = {p(0):1, p(1):2}
rsolve(eq, p(n), initial)
□
1.G.23. What is the maximum number of areas that a 3-dimensional space can be divided into by n planes?
Solution. Let rn be the desired number. Obviously, r0 = 1. As in 1.G.22, let us consider n planes in the space. Now, add
an additional plane and determine the maximum number of new regions it can create. The maximum number of new regions
is precisely the number of regions that the new plane intersects. What is the maximum number of new regions that can be
created?
The number of areas intersected by the (n + 1)th plane equals the number of regions created by the intersections of this
new plane with the n existing planes. According to the exercise in plane, there can be at most 1/2 · (n2
+ n + 2) such regions.
Therefore, we obtain the recurrent formula
rn+1 = rn +
n2
+ n + 2
2
.
57
CHAPTER 1. INITIAL WARMUP
This equation can be solved directly as follows:
rn = rn−1 +
(n − 1)2
+ (n − 1) + 2
2
= rn−1 +
n2
− n + 2
2
= rn−2 +
(n − 1)2
− (n − 1) + 2
2
+
n2
− n + 2
2
= rn−2 +
n2
2
+
(n − 1)2
2
−
n
2
−
(n − 1)
2
+ 1 + 1 = rn−3 +
n2
2
+
(n − 1)2
2
+
(n − 3)2
2
−
n
2
−
(n − 1)
2
−
(n − 2)
2
+1 + 1 + 1
= · · · = r0 +
1
2
n∑
i=1
i2
−
1
2
n∑
i=1
i +
n∑
i=1
1 = 1 +
n(n + 1)(2n + 1)
12
−
n(n + 1)
4
+ n =
n3
+ 6n + 5
6
.
Above we used the formula
n∑
i=1
i2
=
n(n + 1)(2n + 1)
6
,
which you may want to prove using mathematical induction. □
1.G.24. Find the maximum number of areas that the plane can be divided into by n circles.
Solution. A solution to this geometric problem again relies on a recurrence which can be described as follows:
First observe that the (n+1)th circle intersects the n existing circles in at
most 2n points. For example, see the ﬁgure presented at the right-hand
side for the case of 2 + 1 circles; the third circle intersects the previous
two into 4 points (hence 4 = 2n for n = 2). Thus, for the maximum
number pn of areas one can derive the recurrent formula
pn+1 = pn + 2n .
Clearly p1 = 2. Thus for pn we obtain
pn = pn−1 + 2(n − 1) =
pn−2 + 2(n − 2) + 2(n − 1) = · · · = p1 +
n−1∑
i=1
2i = n2
− n + 2 .
Or we may directly solve the recurrence in Sage, as before:
from sympy import Function, rsolve
from sympy.abc import n
p = Function("p")
eq = p(n)-p(n+1)+2*n
initial = {p(1):2, p(2):4}
rsolve(eq, p(n), initial)
□
1.G.25. Determine the maximum number of regions that a 3-dimensional space can be divided into by n balls. ⃝
1.G.26. Find the number of regions formed when n distinct planes intersect at a single point in a 3-dimensional space. ⃝
C) Material on combinatorics
1.G.27. Pascal’s triangle via Sage. Recall that Pascal’s triangle is formed by the binomial coeﬃcients, as discussed in the
section at the end of the paragraph 1.3.2. In particular, the entry in the nth row and kth column of Pascal’s triangle is given
by the binomial coeﬃcient
(n
k
)
. For example for n = 6 we get the following illustration:
Note that in the second ﬁgure, the colored "diagonals" represent the natural numbers and triangular numbers, respectively,
the latter as illustrated in the ﬁrst exercise of this section. Present the Pascal’s triangle via Sage for n = 9.
Solution. It is suﬃcient to use the command binomial(n, i) and type the cell:
58
CHAPTER 1. INITIAL WARMUP
[[binomial(n, i) for i in range(n+1)] for n in range(9)]
Alternatively, we can give the cell
n=8
[[binomial(i, j) for j in range(i + 1)] for i in range(n + 1)]
Sage prints out the following:
[[1],
[1, 1],
[1, 2, 1],
[1, 3, 3, 1],
[1, 4, 6, 4, 1],
[1, 5, 10, 10, 5, 1],
[1, 6, 15, 20, 15, 6, 1],
[1, 7, 21, 35, 35, 21, 7, 1],
[1, 8, 28, 56, 70, 56, 28, 8, 1]]
□
1.G.28. For any ﬁxed n ∈ N, determine the number of all solutions of the equation x1 +x2 +· · ·+xk = n for the following
two cases:
(a) over the set Z≥ := {a ∈ Z : a ≥ 0} of non-negative integers;
(b) over the set Z+ := {a ∈ Z : a > 0} of strictly positive integers.
Solution. (a) Every solution (r1, . . . , rk) to the equation
∑k
i=1 xi = n can be uniquely encoded as a sequence of 1s and
separators. In particular, the sequence is constructed by writing r1 1s, followed by a separator, then r2 1s, followed by another
separator, and so on. Consequently, this sequence contains exactly n 1s, and k − 1 separators. Each such sequence uniquely
corresponds to a solution of the given equation. Therefore, the number of solutions is equal to the number of such sequences,
and the latter is given by
(n+k−1
n
)
.
(b) We now look for a solution in the domain of positive integers. We see that the natural numbers x1, . . . xk provide a solution
of the given equation if and only if the non-negative integers yi = xi − 1, i = 1, . . . , k form a solution of the equation
y1 + y2 + · · · + yk = n − k .
Using the result obtained in (a), we deduce that there are
(n−1
k−1
)
of them. □
1.G.29. (Trolleybus) (a) In how many ways can ﬁve people seat in a car for ﬁve people, under the assumption that only two
of them have a driving licence?
(b) In how many ways can 20 passengers and two drivers be seated in a trolleybus for 25 people? ⃝
1.G.30. (Flipping coins – I) We ﬂip a coin six times.
(a) How many distinct sequences of heads and tails are there?
(b) How many sequences with exactly four heads are there?
(c) How many sequences with at least two heads are there? ⃝
1.G.31. (Playing with divisions) Determine the number of all possible ways that the following can happen:
(a) divide 40 identical balls among 4 boys?
(b) divide among three people A, B and C, 33 distinct coins such that A and B together have twice as many coins as C.
(c) divide 9 girls and 6 boys into two group such that each group contains at least two boys. ⃝
1.G.32. According to quality, we divide food products into groups I, II, III, IV . Determine the number of all possible
divisions of 9 food products into these groups, such that the numbers of products in groups are all distinct.
Solution. If we directly write the considered groups from the elements of I, II, III, IV , we create combinations of repetitions
of the ninth-order from four elements. The number of such combinations is
(12
9
)
= 220. □
1.G.33. (Handshakes) New players meet in a volleyball team (6 people). How many handshakes are there when everybody
shakes once with everybody else? How many handshakes are there if everybody shakes hands once with each opponent after
playing a match? ⃝
59
CHAPTER 1. INITIAL WARMUP
1.G.34. We need to accommodate 9 people in one four-bed room, one three-bed room and one two-bed room. In how many
ways can this be done?
Solution. We may assign to the people in the four-bed room the number 1, in the three-bed room number 2 and in the two-bed
room number 3. In this way t we create permutations with repetitions from the elements 1, 2, 3, where 1 occurs four times, 2
three times and 3 two times. The number of such permutations is 9!
4!·3!·2! = 1 260. □
1.G.35. In a long-distance race, where the racers start one after another in given time intervals, there were k racers, among
them 3 friends. Determine the number of starting schedules in which no two of the 3 friends start next to each other. For
simplicity assume k ≥ 5.
Solution. Remaining k − 3 racers can be ordered in (k − 3)! ways. For the three friends there are then k − 2 places (the
start, the end and the k − 4 spaces) where we can put them in v (k − 2, 3) ways. Using the rule of (combinatorial) product,
we obtain
(k − 3)! · (k − 2) · (k − 3) · (k − 4) = (k − 2)! · (k − 3) · (k − 4) .
□
1.G.36. There are 32 participants of a tournament. The organisers have stated that the participants must divide arbitrarily
into four groups, such that the ﬁrst one has size 10, the second and the third 8, and the fourth 6. In how many ways can this
be done?
Solution. We can imagine that from 32 participants we create a row, where ﬁrst 10 are the ﬁrst group, next 8 are the second
group and so on. On the other hand, there are 32! orderings of all participants. Note that the division into groups is not
inﬂuenced if we change the order of the people in the same group. Therefore the number of distinct divisions equals to
32!
10!·8!·8!·6! . □
D) Material on probabilities
Next we will provide supplementary material on basic and conditional probability. Probability theory will be explored
in depth in Chapter 10, where we will build on this foundational material by introducing additional key results,
such as Bayes’ theorem, and discussing its signiﬁcance in statistics. Additionally, the rich interplay between
combinatorics and probability will be examined in Chapter 13, which concludes the book.
1.G.37. For the ﬁnal exams in the 6th class of all the high-schools in a European country, it is known that:
• 25% of the students fails in the exams of mathematics;
• 18% of the students fails in the exams of physics;
• 10% of the students fails both in the exams of mathematics and physics.
Choose randomly a student belonging to this class.
(a) Find the probability the student has failed at least in one of these two subjects;
(b) Find the probability the student has failed in the exams of mathematics, but not in the exams of physics.
(c) If the student has failed in the exams of physics, ﬁnd the probability he also failed in the exams of mathematics.
(d) If the student has failed in the exams of mathematics, ﬁnd the probability he also failed in the exams of physics.
Solution. We deﬁne two events: the event M, described by “The candidate has failed in the exams of mathematics”, and the
even P, described by “The candidate has failed in the exams of physics”.
(a) We should compute P(M ∪ P) = P(M) + P(P) − P(M ∩ P) which gives 0.25 + 0.18 − 0.1 = 0.33.
(b) In this case we should compute P(M ∩ Pc
), which equals to P(M) − P(M ∩ P) = 0, 25 − 0.1 = 0.15.
(c) We are looking for the probability P(M|P). By deﬁnition, we have
P(M|P) =
P(M ∩ P)
P(P)
=
0.1
0.18
≈ 0.55 .
(d) In this ﬁnal case we are looking for the probability P(P|M), which is given by
P(P|M) =
P(P ∩ M)
P(M)
=
0.1
0.25
= 0.4.
□
60
CHAPTER 1. INITIAL WARMUP
1.G.38. From ten cards, where exactly one is an ace, we randomly draw a card and put it back. How many times must we
do this, so that the probability that the ace is drawn at least once, is greater than 0.9?
Solution. Let Ai be the following event: “at ith drawing the ace was drawn”. The individual events Ai are independent, so
we know that
P
( n∪
i=1
Ai
)
= 1 − (1 − P(A1)) · (1 − P(A2)) · · · (1 − P(An)) , n ∈ N .
We are looking for some n ∈ N satisfying
P
( n∪
i=1
Ai
)
= 1 − (1 − P(A1)) · (1 − P(A2)) · · · (1 − P(An)) > 0.9 .
Clearly, P(Ai) = 1/10 for all i ∈ N. Therefore, it is suﬃcient to solve the inequality
1 −
(
9
10
)n
> 0.9 ,
from which one gets
n >
loga 0.1
loga 0.9
, where a > 1 .
Evaluating, we deduce that one must do the drawing at least twenty two times. In particular, a Sage cell verifying this has the
following form:
x = var("x")
ineq = 1 - 0.9**x > 0.9
sol = solve(ineq , x, algorithm="sympy")
print(sol)
Sage prints out x > 21.8543453267828. □
1.G.39. (Playing with a dice)
(a) Throwing a dice eleven times in a row the result was 4. Determine the probability that the twelfth roll results in 4.
(b) Throwing n dice, compute the probability that among the numbers that appeared the values 1, 3 and 6 are not present.
(c) Throwing a dice determine the conditional probability, that the ﬁrst die resulted in ﬁve under the condition that the sum is
9. Based on this result, decide whether the events “ﬁrst dice results in ﬁve” and “the sum is 9”, are independent. ⃝
1.G.40. We ﬂip a coin ﬁve times. For every head, we put a white ball in a hat, for every tail we put in the same hat a black
ball. Compute the probability that in the hat there are more black balls than white balls, if there is at least one black ball in
the hat.
Solution. We can introduce the following two events:
A − “There are more black than white balls in the hat”;
H − “There is at least one black ball in the hat”.
Notice the event A has H as a consequence. Denote by Ac
, Hc
the complementary events of A and H, respectively. The goal
is to compute the conditional probability P(A|H). The probability P(Hc
) is 2−5
, and the probability of H is the same as
the probability P(Ac
) of the complementary event of A. Thus P(H) = 1 − 2−5
and moreover P(A) = 1/2. Furthermore
P(A ∩ H) = P(A), since the event H contains the event A, as it was mentioned above. This ﬁnally gives
P(A|H) =
P(A ∩ H)
P(H)
=
1
2
1 −
(1
2
)5 =
16
31
.
□
1.G.41. In a painting exhibition there are 15 paintings and between them 12 are authentic and 3 are copies. A visitor chooses
randomly a painting and asks the opinion of an art expert about the authentication of the artwork. The expert can give a correct
opinion both about an original painting as much as for a copy, on average 9 out of 10 times. If the expert decides that the
painting is authentic, ﬁnd the probability the expert is right.
Solution. (a) Let us consider the events:
A − “The painting is authentic”;
B − “The expert considers that the painting is authentic”.
We are looking for the conditional probability P(A|B). By the statement we have P(A) = 12
15 = 4
5 and P(Ac
) = 3
15 = 1
5 =
1 − P(A). We also see that the probability of the event B occurring, given that A is true, equals to 9/10, i.e., P(B|A) = 9
10 .
61
CHAPTER 1. INITIAL WARMUP
Similarly P(Bc
|Ac
) = 9
10 and moreover P(Bc
|A) = 1
10 = P(B|Ac
). Thus, by the deﬁnition of the conditional probability
and the product rule we get
P(A|B) =
P(A ∩ B)
P(B)
=
P(A) · P(B|A)
P(B)
. (∗)
However, by Bayes’ formula, see 10.2.7 we have
P(B) = P(A) · P(B|A) + P(Ac
) · P(B|Ac
) =
4
5
·
9
10
+
1
5
·
1
10
=
37
50
.
Thus, in combination with (∗) we arrive to P(A|B) = 36
37 . □
1.G.42. A rod of length two meters is randomly divided into three parts. Determine the probability that at least one part is
at most 20 cm long.
Solution. Random division of a rod into three parts is given by two points of the cut, x and y (we ﬁrst cut the rod in the
distance x from the origin, we do not move it and again cut it in the distance y from the origin). The sample space is a square
C with side 2 meters, see also the ﬁgure below.
If we place the square C so that its two sides lie on axes in the plane, then the condition that at least one part is at most 20 cm
determines in the square a subregion O, deﬁned by
O =
{
(x, y) ∈ C : x ≤ 20 ∨ x ≥ 180 ∨ y ≤ 20 ∨ y ≥ 180 ∨ |x − y| ≤ 20
}
.
A straightforward computation shows that this subregion has area 51
100 times the area of the square. Let us now explain how
one can treat this task via Sage. We ﬁrst use the a cell to generate some random points.
num_points = 100_000
#Generate num_points points
x=[RealField().random_element(min=0, max=2) for _ in range(num_points)]
y=[RealField().random_element(min=0, max=2) for _ in range(num_points)]
Next we divide the points into two groups corresponding to the case where one piece of rod is less than 20 cm.
favourable_events = []
other_events = []
for i, j in zip(x, y):
if abs(i-j) <= 5:
favourable_events.append((i, j))
else:
other_events.append((i, j))
Finally we plot these two sets using red and blue colour (as we see in the ﬁgure posed above) via the cell
scatter1 = scatter_plot(short_rod_cases, facecolor="blue",
marker="o", markersize=10, edgecolor="blue")
scatter2 = scatter_plot(other_cases, facecolor="red",
marker="o", markersize=10, edgecolor="red")
show(scatter1+scatter2, axes=true, aspect_ratio=1, ticks=[0.2,0.2])
This returns the ﬁgure posed above. Recall that we are interested in the ratio of the blue area to the total area of the square.
To this end, we can compute its approximate value by typing
len(short_rod_cases) / (len(short_rod_cases) + len(other_cases))
62
CHAPTER 1. INITIAL WARMUP
Sage prints out the estimation 0.51018. □
1.G.43. Michael and Alex have a lunch at the school canteen, which operates from 11 to 14. Each of them takes a lunch for
30 minutes, and the arrival time is random. What is the probability that they meet at a given day, if they always sit at the same
table? ⃝
E) Material on plane geometry
1.G.44. Compute the area S of a quadrilateral with vertices A = [0, −2], B = [1, −1], C = [1, 5], and D = [−1, 1]. ⃝
1.G.45. Determine the relative position of the lines p, q in the plane for p : 2x − y − 5 = 0, q : x + 2y − 5 = 0. If they are
not parallel, determine the intersection point. ⃝
1.G.46. Determine the sum of the three angles, which are between the vectors (1, 1), (2, 1) and (3, 1) respectively and the
x-axis in the plane R2
. ⃝
1.G.47. Compute the area of parallelogram with vertices at [5, 5], [6, 8] at [6, 9].
Solution. Although such parallelogram is not uniquely determined (the fourth vertex is not given), the triangle with vertices
at [5, 5], [6, 8] and [6, 9] must be necessarily a half of every parallelogram with these three vertices (one of the sides of the
triangle becomes the diagonal of the parallelogram). Therefore the area equals the determinant
6 − 5 6 − 5
8 − 5 9 − 5
=
1 1
3 4
= 1 · 4 − 1 · 3 = 1. □
1.G.48. Determine the angle φ between the two diagonals A3A7 and A5A10 of a regular dodecagon A0A1A2 . . . A11.
Solution. The angle does not depend neither on the size, nor on the position of the given dodecagon. Consider a dodecagon
inscribed in a circle with a diameter of 1, and let A0 be the point [1, 0]. The vertices of the dodecagon can be then represented
by the twelfth roots of unity in the complex plane. Thus we can write Ak = cos(2kπ/12) + i sin(2kπ/12) and hence
A3 = cos(π/2) + i sin(π/2) = i ∼ [0, 1] , A5 = cos(5π/6) + i sin(5π/6) = −
√
3
2
+
1
2
i ∼ [−
√
3
2
,
1
2
] ,
A7 = cos(7π/6) + i sin(7π/6) = −
√
3
2
−
1
2
i ∼ [−
√
3
2
, −
1
2
] , A10 = cos(5π/3) = i sin(5π/3) = 1/2 − i
√
3
2
∼
[1/2, −
√
3
2
] .
Combining now the description given in 1.5.7, we deduce that cos φ = 1
2
√
2+
√
3
. This gives φ = 75◦
. □
1.G.49. Consider the matrices
A =
(
1 0
0 0
)
, B =
(
1 2
−2 1
)
, X =
(
a b
c d
)
, D =
(
5 −b
4 −4d
)
, E =
(
1 0
0 1
)
with a, b, c, d ∈ R. Solve the matrix equation X(E − A2
) + BBT
= D. ⃝
1.G.50. Illustrate the expression of a rotation as a composition of shears on an interactive plot, using Sage (use the method
from 1.E.18).
Solution. A solution goes as follows:
# import the the right objects and define the shear functions
from sage.plot.point import point
from sage.plot.line import line
def shear_transformation(x, y, shear_factor):
new_x = x + shear_factor * y
new_y = y
return new_x, new_y
def shear_vert_transformation(x, y, shear_factor):
new_x = x
new_y = y + shear_factor * x
return new_x, new_y
# create an illustrative interactive plot
@interact
def plot_shear(b=(0.0, 1.0, 0.02)):
# choose points and lines to show
points1 = [(0, 0), (1, 1), (1, 2), (2, 3)]
points2 = [(1, 1), (2, 2), (2, 3), (3, 4)]
63
CHAPTER 1. INITIAL WARMUP
# get the shear parameters a and apply shears
a = (sqrt(-b^2 + 1) - 1)/b
sheared_points11 = [shear_transformation(x, y, a) for x, y in points1]
sheared_points12 = [shear_transformation(x, y, a) for x, y in points2]
sheared_points21 = [shear_vert_transformation(x, y, b) for x, y in sheared_points11]
sheared_points22 = [shear_vert_transformation(x, y, b) for x, y in sheared_points12]
sheared_points31 = [shear_transformation(x, y, a) for x, y in sheared_points21]
sheared_points32 = [shear_transformation(x, y, a) for x, y in sheared_points22]
# prepare the objects for the final plot
p1 = point(points1, color=’blue’, size=50)
p2 = point(points2, color=’blue’, size=50)
p_sheared11 = point(sheared_points11, color=’red’, size=25)
p_sheared12 = point(sheared_points12, color=’red’, size=25)
p_sheared21 = point(sheared_points21, color=’green’, size=25)
p_sheared22 = point(sheared_points22, color=’red’, size=25)
p_sheared31 = point(sheared_points31, color=’red’, size=50)
p_sheared32 = point(sheared_points32, color=’red’, size=50)
l1 = line(points1, thickness=2)
l2 = line(points2, thickness=2)
l1_sheared1 = line(sheared_points11, thickness=1, color=’black’)
l2_sheared1 = line(sheared_points12, thickness=1, color=’black’)
l1_sheared2 = line(sheared_points21, thickness=1, color=’green’)
l2_sheared2 = line(sheared_points22, thickness=1, color=’green’)
l1_sheared3 = line(sheared_points31, thickness=2, color=’red’)
l2_sheared3 = line(sheared_points32, thickness=2, color=’red’)
# Combine the plots and show the result (choose options)
combined_plot = p1 + p_sheared11 + p_sheared21 +p_sheared31
combined_plot += p2 + p_sheared12 + p_sheared22 +p_sheared32
combined_plot += l1 + l1_sheared1 + l1_sheared2 + l1_sheared3
combined_plot += l2 + l2_sheared1 + l2_sheared2 + l2_sheared3
combined_plot.show(axes_labels=[’x’, ’y’],gridlines=True,aspect_ratio=1, figsize=8)
Executing this code (for one chosen value of b = sin θ, e.g., b = 3) we obtain the following picture:
□
1.G.51. (a) Let ABC be the triangle on the plane with vertices the points A = [1, 0], B = [4, −2] and C = [2, −3],
respectively. Reﬂect ABC with respect to the x-axis and illustrate the initial triangle and the result of the reﬂection via Sage.
64
CHAPTER 1. INITIAL WARMUP
Solution. On R2
the reﬂection with respect to the x-axis is the linear mapping f : R2
→ R2
with u = (x, y)T
→ f(u) :=
(x, −y)T
. Obviously, the x-axis is the ﬁxed points set of f, which means that f preserves A = [1, 0]. However, it transforms
B and C to B′
= [4, 2] and C′
= [2, 3], respectively, see the ﬁgure given below. Alternatively, recall that matrix of f with
respect to the standard basis on R2
is given
( 1 0
0 −1
)
, such that
f(u) =
(
1 0
0 −1
)
u =
(
1 0
0 −1
) (
x
y
)
=
(
x
−y
)
.
Viewing A, B, C as vectors we obtain
f(A) =
(
1 0
0 −1
) (
1
0
)
=
(
1
0
)
, f(B) =
(
1 0
0 −1
) (
4
−2
)
=
(
4
2
)
, f(C) =
(
1 0
0 −1
) (
2
−3
)
=
(
2
3
)
.
Thus indeed, reﬂecting the triangle ABC via f, we arrive to the triangle AB′
C′
, where B′
= [4, 2] and C′
= [2, 3],
respectively. An illustration in Sage can been seen below. Can you reproduce it in your editor? For help, see the cell
given in 1.E.26.
□
1.G.52. Let ABCD be the trapezoid on the plane with vertices the points A = [0, 0], B = [4, 0], C = [3, 3/2] and
D = [3/2, 3/2]. Let {e1 = (1, 0), e2 = (0, 1)} be the standard basis, identifying the plane with R2
.
(a) Transform ABCD via the homothety h1/2;
(b) Stretch ABCD in the direction of e1 with stretching parameter c = 3/4.
(c) Stretch ABCD in the direction of e2 with stretching parameter c = 4/3.
(d) Apply to ABCD a horizontal shear with parameter a = 2
(e) Apply to ABCD a vertical shear with parameter a = 1/2
(f) Transform ABCD via the axial symmetry.
Illustrate all the cases.
Solution. (a) The matrix of h1/2 is
( 1
2 0
0 1
2
)
, and we have h1/2((x, y)T
) = (x/2, y/2)T
. Viewing A, B, C, D as vectors on
the plane, one computes
h1/2(A) = [0, 0] = A , h1/2(B) = [2, 0] , h1/2(C) = [3/2, 3/4] , h1/2(D) = [3/4, 3/4] .
Thus, the new coordinates of the points B, C, D are B′
= −[2, 0], C′
= [3/2, 3/4] and D′
= [3/4, 3/4], respectively, and so
h1/2 has shrunk the trapezium (because 0 < c = 1/2 < 1), as you can see here:
65
CHAPTER 1. INITIAL WARMUP
(b) The matrix of stretching in the direction of e1 only, with stretching parameter c = 3/4 is again diagonal, given by
( 3
4 0
0 1
)
.
This means that the corresponding transformation is given by (x, y)T
→ (3x
4 , y)T
, so it preserves the y-coordinates. Therefore
A = [0, 0] → A , B = [4, 0] → [3, 0] , C = [3, 3/2] → [9/4, 3/2] , D = [3/2, 3/2] → [9/8, 3/2] ,
and the initial trapezium ABCD has been transformed to AB′
C′
D′
, with B′
= [3, 0], C′
= [9/4, 3/2], and D′
= [9/8, 3/2],
respectively. Hence in this case an illustration looks like as follows:
(c) The stretching here preserves x and transforms y by 4
3 , i.e., it has the form (x, y)T
→ (x, 4y
3 )T
. We compute
A = [0, 0] → A , B = [4, 0] → B , C = [3, 3/2] → [3, 2] , D = [3/2, 3/2] → [3/2, 2] ,
thus the initial trapezium ABCD has been transformed to ABC′
D′
, with C′
= [3, 2], and D′
= [3/2, 2], respectively. As
for the illustration we get
(d) As a transformation, the horizontal shear on the plane with parameter a = 2 has matrix
( 1 2
0 1
)
, and thus we get the
assignment (x, y)T
→ (x + 2y, y)T
. We compute
A = [0, 0] → A , B = [4, 0] → B , C = [3, 3/2] → [6, 3/2] , D = [3/2, 3/2] → [9/2, 3/2] .
This means that the new coordinates of initial trapezium are now ABC′
D′
with C′
= [6, 3/2] and D′
= [9/2, 3/2], respectively.
Let us for a few postpone its illustration, and present it below together with the illustration of the vertical shear.
(e) The vertical shear on the plane with parameter a = 1/2 has matrix
( 1 0
1
2 1
)
, so it corresponds to the mapping (x, y)T
→
(x, x
2 + y)T
. We compute
A = [0, 0] → A , B = [4, 0] → [4, 2] , C = [3, 3/2] → [3, 3] , D = [3/2, 3/2] → [3/2, 9/4] ,
hence in this case the new coordinates of ABCD are AB′
C′
D′
with B′
= [4, 2], C′
= [3, 3] and D′
= [3/2, 9/4], respectively.
Below, at the l.h.s is presented an illustration of the horizontal shear, while the vertical shear corresponds to the ﬁgure
at the r.h.s.
(f) Recall that the axial symmetry is the linear transformation (x, y)T
→ (y, x)T
. Let us present only the illustration of its
action on ABCD, and leave to the reader the mathematical explanation of the ﬁgure. Can you predict the vertices which
remain ﬁxed (without an illustration or computation).
66
CHAPTER 1. INITIAL WARMUP
□
1.G.53. Let ABCD be the rhombus on the plane with vertices the points A = [0, 0], B = [5, 0], C = [8.4], and D = [3, 4],
respectively.
(a) Rotate ABCD counter-clockwise around the origin for θ1 = π/3.
(b) Then rotate the rhombus obtained in (a), counter-clockwise around the origin for θ2 = π/3.
(c) Finally rotate the rhombus obtained in (b), counter-clockwise around the origin for another θ3 = π/3.
(d) Determine directly the ﬁnal position of ABCD, after imposing the rotations described in (a), (b) and (c).
Solution. First observe that the given polygon ABCD is indeed a rhombus. This is because the vectors u1 =
−−→
AB and
u2 =
−−→
AD are of the same length (with respect to the dot product on R2
), ∥u1∥ = ∥u2∥ = 5, and moreover the opposite sides
are parallel each other. The angle between u1, u2 equals to arccos(3/5) ≈ 53.13◦
.
For a moment think why after applying a rotation on ABCD, the new polygon will be again a rhombus. This is because
rotations preserves lengths and angles (we will essentially prove this in Chapter 2, studying orthogonal transformations).
Thus, rotating ABCD around the origin, for any possible angle, we will obtain a rhombus with the same characteristics
(same lengths and angles as those described for ABCD).
The rotation matrix for θ = π/3 is given by
( 1/2 −
√
3/2
√
3/2 1/2
)
, and hence as a linear transformation on R2
, the rotation Rπ
3
has the form
(x, y)T
→ (
x −
√
3y
2
,
√
3x + y
2
)T
.
Based on direct computations, one can now verify the following:
(a) Under the action of Rπ/3 the points A, B, C, D (seen as vectors on R2
) are mapped to:
A → A , B → [5/2, 5
√
32] , C → [4 − 2
√
3, 4
√
3 + 2] , D → [
3
2
− 2
√
3,
3
√
3
2
+ 2] .
Hence, a counter-clockwise rotation around the origin for θ = π/3 transforms the rhombus ABCD to AB1C1D1, where
B1 = [5/2, 5
√
32], C1 = [4 − 2
√
3, 4
√
3 + 2] and D1 = [3
2 − 2
√
3, 3
√
3
2 + 2], respectively (see the ﬁgure below, where we
have included all the rotations together).
(b) Next we need to apply Rπ/3 to AB1C1D1. We compute
A → A , B1 → [−5/2, 5
√
32] , C1 → [−4 − 2
√
3, 4
√
3 − 2] , D1 → [−
3
2
− 2
√
3,
3
√
3
2
− 2] .
This means that AB1C1D1 has been transformed to the rhombus AB2C2D2, where B2 = [−5/2, 5
√
32], C2 = [−4 −
2
√
3, 4
√
3 − 2] and D2 = [−3
2 − 2
√
3, 3
√
3
2 − 2], respectively.
(c) Let us ﬁnally apply the third rotation, using this time the rhombus AB2C2D2 obtained in (b). We compute
A → A , B2 → [−5, 0] , C2 → [−8, −4] , D2 → [−3, −4] ,
Thus a counter-clockwise rotation around the origin by π/3 transforsms the rhombus AB2C2D2 to AB3C3D3, where B3 =
[−5, 0] = −B, C3 = [−8, −4] = −C and D3 = [−3, −4] = −D, respectively. Let us illustrate all these rotations together
in one ﬁgure (via Sage), as follows:
67
CHAPTER 1. INITIAL WARMUP
The code behind this ﬁgure is based on a bit more advanced options from 2D-graphics in Sage, comparing with what we have
seen so far (as assigning automatically names to given points in a list). For the interested reader we pose it here:
pts = [(0, 0), (5, 0), (8, 4), (3, 4), (5/2, 5*sqrt(3)/2),
(4-2*sqrt(3), 2+4*sqrt(3)), (3/2-2*sqrt(3), (3/2)*sqrt(3)+2),
(-5/2, 5*sqrt(3)/2), (-4-2*sqrt(3), 4*sqrt(3)-2), (-3/2-2*sqrt(3), (3/2)*sqrt(3)-2),
(-5, 0), (-8, -4), (-3, -4)]
pt_names = [’$A$’, ’$B$’, ’$C$’, ’$D$’, ’$B_1$’, ’$C_1$’, ’$D_1$’, ’$B_2$’, ’$C_2$’,
’$D_2$’, ’$B_3$’, ’$C_3$’, ’$D_3$’]
pt_opt = {’color’: ’black’, ’horizontal_alignment’: ’left’, ’vertical_alignment’: ’top’}
plt=point2d(pts, color=’blue’, size=20)
plt += sum(text(name, vector(pt), **pt_opt) for name, pt in zip(pt_names, pts))
plt +=point([0, 0], size=40, color="blue", aspect_ratio=1)
plt +=point([5, 0], size=40, color="green")
plt +=point([8, 4], size=40, color="red")
plt +=point([3, 4], size=40, color="purple")
plt +=polygon([(0,0), (5,0), (8,4), (3, 4)], fill=False,
rgbcolor= (0, 1/2, 1))#the rhombus ABCD
plt +=point([5/2, 5*sqrt(3)/2], size=40, color="green")
plt +=point([4-2*sqrt(3), 2+4*sqrt(3)], size=40, color="red")
plt +=point([3/2-2*sqrt(3), (3/2)*sqrt(3)+2], size=40, color="purple")
plt +=polygon([(0,0),(5/2, 5*sqrt(3)/2),(4-2*sqrt(3),2+4*sqrt(3)),(3/2-2*sqrt(3),
(3/2)*sqrt(3)+2)],fill=False,rgbcolor=(1/8,1/4,1/2),linestyle="--")#the rhombus AB_1C_1D_1
plt +=point([0, 0], size=40, color="blue", aspect_ratio=1)
plt +=point([-5/2, 5*sqrt(3)/2], size=40, color="green")
plt +=point([-4-2*sqrt(3), 4*sqrt(3)-2], size=40, color="red")
plt +=point([-3/2-2*sqrt(3), (3/2)*sqrt(3)-2], size=40, color="purple")
plt +=polygon([(0,0),(-5/2,5*sqrt(3)/2),(-4-2*sqrt(3),4*sqrt(3)-2),(-3/2-2*sqrt(3),
(3/2)*sqrt(3)-2)],fill=False,rgbcolor=(1/8,1/4,1/2),linestyle="--")#the rhombus AB_2C_2D_2
plt +=point([-5, 0], size=40, color="green")
plt +=point([-8, -4], size=40, color="red")
plt +=point([-3, -4], size=40, color="purple")
plt += polygon([(0,0), (-5, 0), (-8, -4), (-3, -4)],fill=False,
rgbcolor=(1/8,1/4,1/2), linestyle="--")#the rhombus AB_3C_3D_3
plt.show(aspect_ratio=1, figsize=8)
(d) This task relies on the principle that composing two rotations results in another rotation, which can be easily determined.
In other words, applying a rotation by an angle θ1 followed by a rotation by an angle θ2 is equivalent to a single rotation by
the angle θ1 + θ2. In terms of matrices, this implies that
(
cos θ1 − sin θ1
sin θ1 cos θ1
) (
cos θ2 − sin θ2
sin θ2 cos θ2
)
=
(
cos(θ1 + θ2) − sin(θ1 + θ2)
sin(θ1 + θ2) cos(θ1 + θ2)
)
,
68
CHAPTER 1. INITIAL WARMUP
that is, Rθ1 Rθ2 = Rθ1+θ2 . To prove this identity, you can use the trigonometric identities provided in part (d) of Problem
1.G.15. For our case we compute
Rθ1 Rθ2 = R2π/3 , and (Rθ1 Rθ2 )Rθ3 = R2π/3Rθ3 = R2π/3Rπ/3 = Rπ .
To determine the ﬁnal position of the initial rhombus ABCD it is now suﬃcient to apply a rotation by π. This results in the
linear transformation (
x
y
)
→
(
cos π − sin π
sin π cos π
) (
x
y
)
=
(
−1 0
0 −1
) (
x
y
)
=
(
−x
−y
)
= −
(
x
y
)
.
Hence ABCD will be ﬁnally transformed to AB′
C′
D′
, where B′
= −B, C′
= −C and D′
= −D, respectively. As a
veriﬁcation, we see that these are the points obtained in step (c), that is B′
= B3, C′
= C3 and D′
= D3, respectively. □
F) Material on relations and mappings
1.G.54. Consider a relation R on the set {1, 2, 3, 4} where (a, b) ∈ R if and only if a − b is divisible by 2. Show that R is
an equivalence relation.
Solution. To verify reﬂexivity we need to show that (a, a) ∈ R for all a ∈ {1, 2, 3, 4}. This is obvious since 0 is divisible by
2. Therefore, R is reﬂexive. To verify symmetry we need to show that if (a, b) ∈ R then (b, a) ∈ R. If (a, b) ∈ R the a − b is
divisible by 2, and hence also b − a = −(a − b) is divisible by 2. Thus (b, a) ∈ R and R is symmetric. To verify transitivity
we need to show that if (a, b) ∈ R and (b, c) ∈ R then (a, c) ∈ R. If (a, b) ∈ R and (b, c) ∈ R then both a − b and b − c
are divisible by 2 and their sum (a − b) + (b − c) = a − c is also divisible by 2. Thus (a, c) ∈ R and R is transitive. For
example (1, 3) ∈ R since 1 − 3 = −2 is divisible by 2, and since R is symmetric we have also (3, 1) ∈ R. Then we see that
(1, 1) ∈ R, as well. Since R is reﬂexive, symmetric, and transitive, it is an equivalence relation. □
1.G.55. Determine the number of mappings from the set {1, 2} to the set {a, b, c}. How many of them are surjective and
how many are injective?
Solution. The element 1 can be mapped to any of a, b, c and the same for 2. Thus there are exactly 32
mappings of the set
{1, 2} to the set {a, b, c}. None of them can be surjective, because the set {a, b, c} has more elements than the set {1, 2}.
Injectivity requires that the images of 1 and 2 are diﬀerent. There are three possibilities for the image of 1, and once the
image of 1 is given, there remain two possibilities for the image of 2. Thus, the number of injective mappings of the set {1, 2}
to the set {a, b, c} is 6. □
1.G.56. Let {a, b, c, d} be a set with a relation {(a, a), (b, b), (a, b), (b, c), (c, b)}. What is the minimal number of elements
we have to add to the relation in order to make it an equivalence?
Solution. Let us successively ensure the three properties that deﬁne an equivalence. First it is the reﬂexivity. We must add the
tuples {(c, c), (d, d)}. Second is the symmetry – we must add (b, a) and for the third step we must do the so-called transitive
closure. Since a is in relation with b and b is in relation with c, we must add (a, c) and (c, a). □
1.G.57. Determine how many distinct binary relations can be deﬁned between a set X and the set of all subsets of X, if the
set X has exactly 3 elements.
Solution. First, notice that the set of all subsets of X has exactly 23
= 8 elements, and thus the Cartesian product with X has
8 · 3 = 24 elements. Possible binary relations correspond to subsets of this Cartesian product, and hence the total numbers
of such relations is 224
= 16777216. □
1.G.58. Find the number of surjective mappings f from the set {1, 2, 3, 4, 5} to the set {1, 2, 3} such that f(1) = f(2).
Solution. Every such mappings is uniquely given by the images of the elements {1, 3, 4, 5}, there are exactly that many
mappings as there are surjective mappings of the set {1, 3, 4, 5} to the set {1, 2, 3}, that is, 36. □
1.G.59. Determine the number of surjective mappings of the set {1, 2, 3, 4} to the set {1, 2, 3}. ⃝
1.G.60. List all the relations on the two-element set {1, 2} that are symmetric, but neither reﬂexive nor transitive.
Solution. The reﬂexive relations are exactly those which contain both pairs (1, 1), (2, 2). Hence this excludes the following
relations
{(1, 1), (2, 2)} , {(1, 1), (2, 2), (1, 2)} , {(1, 1), (2, 2), (2, 1)} , {(1, 1), (2, 2), (1, 2), (2, 1)} .
We claim that the remaining relations, which are symmetric but not transitive, must contain (1, 2), (2, 1). If such a relation
contains one of these two (ordered) pairs, by symmetry it must contain also the other. If it contains neither of these pairs, then
it is clearly transitive. Thus, from the total number of 16 relations over a two-element set we select
{(1, 2), (2, 1)} , {(1, 2), (2, 1), (1, 1)} , {(1, 2), (2, 1), (2, 2)} .
It is now clear that each of these 3 relations is symmetric, but neither reﬂexive nor transitive. □
69
CHAPTER 1. INITIAL WARMUP
1.G.61. Determine the number of relations over the set {1, 2, 3, 4}, which are both symmetric and transitive.
Solution. Relations of the given properties is an equivalence over some subset of the set {1, 2, 3, 4}. In total,
1 + 4 · 1 +
(4
2
)
· 2 +
(4
3
)
· 5 + 15 = 52 ,
Thus, there are 52 relations on the set {1, 2, 3, 4} that satisfy the required properties. □
1.G.62. Determine all the elements in S ◦ R, if
R = {(2, 4), (4, 4), (4, 5)} ⊂ N × N , and S = {(3, 1), (3, 2), (3, 5), (4, 1), (4, 4)} ⊂ N × N .
Solution. Consider all choices of two ordered tuples
(2, 4), (4, 1), (2, 4), (4, 4), (4, 4), (4, 1), (4, 4), (4, 4)
such that the second element of the ﬁrst ordered tuple (which is a member of R) is equal to the ﬁrst element of the second
ordered tuple (which is a member of S). Then we obtain S ◦ R = {(2, 1), (2, 4), (4, 1), (4, 4)}. This task can also be solved
using Sage. First, deﬁne a function to compose two given relations as follows
def RelCompose(Rel1, Rel2):
RS = set()
for (a,b) in Rel1:
for (c,d) in Rel2:
if b == c:
RS.add( (a,d))
return RS
Now we should introduce R, S, as follows:
R = {(2,4), (4,4), (4,5)}; S = {(3,1), (3,2), (3,5), (4,1), (4,4)}
print(RelCompose(R,S))
Executing the whole block Sage returns {(4, 4), (2, 4), (4, 1), (2, 1)}. □
1.G.63. Let R be the binary relation between the sets A = Z and B = R, deﬁned by R = {(0, 4), (−3, 0), (5, π), (5, 2), (0, 2)}.
Express explicitly R−1
and R ◦ R−1
. ⃝
1.G.64. Is there an equivalence relation on the set of all lines in the plane that also serves as an ordering?
Solution. An equivalence relation (or ordering relation) must be reﬂexive, therefore every line must be in relation with itself.
Furthermore we require that the relation is both symmetric (equivalence) and antisymmetric (ordering). This implies that
each line can only be related to itself. If we deﬁne the relation such that two lines are in relation if and only if they are
identical, we obtain a very natural relation, which is both an equivalence relation and an ordering. We just need to check that
it is transitive, which is easy. Therefore, the only relation that meets the criteria is the identity relation over the set of all lines
in the plane. □
1.G.65. We have a set {3, 4, 5, 6, 7}. Write explicitly the following relations and explore their properties:
i) a divides b,
ii) Either a divides b or b divides a,
iii) a and b have a common divisor greater than one,
⃝
70
CHAPTER 1. INITIAL WARMUP
Solutions to the problems
1.A.3. Already the ancient Greeks knew that if we prescribe the area of square as a2
= 2, then we cannot ﬁnd a rational a to
satisfy it. Why? Suppose that we know that (p/q)2
= 2 for some natural numbers p and q, which have no common divisor
greater than 1 (otherwise we can further reduce the fraction p/q). Then p2
= 2q2
is an even number and on the l.h.s p2
must
be even. Therefore so is p. Now p is even, thus p2
must be divisible by 4. But then q2
is even and so q must be even too. This
certiﬁes that p and q both have 2 as a common factor, a contradiction.
1.A.4. Let p, q be two arbitrary rational numbers with p ̸= q. We may assume that p < q, and similarly is treated the other
case. Set
β := p −
p − q
√
2
.
We will show that β is an irrational, that is, β ∈ R\Q, and in addition that p < β < q. The ﬁrst claim follows because 1/
√
2
is an irrational, in combination with the following two facts:
• The product of a rational with an irrational is an irrational number.
• Adding a rational and an irrational number we obtain an irrational number.
Let us prove the ﬁrst statement and in a very similar way occurs also the second one. Let p =
a
b
∈ Q be a rational number,
and let x ∈ R\Q be an irrational. Then xp = px =
xa
b
. Assume that xp is a rational, that is,
xp =
xa
b
=
r
s
,
for some integers r, s. Multiplying both sides with b
a we obtain x = rb
sa . Being the product of the rationals r
s and b
a , the
number x should be a rational, a contradiction. Now it remains to prove that p < β < q. We begin with 0 < 1√
2
< 1
and multiply both sides by q − p > 0. This gives 0 < q−p√
2
< q − p and by adding to both sides the number p we get
p < p − p−q√
2
< q. This proves the assertion.
1.A.7. By assumption, p(x) = 5x4
+ x2
− 2x + 2 so the Horner’s table has the form
ρ = 4 5 0 1 −2 2
20 80 324 1288
5 20 81 322 1290
This means that p(x) = (x − 4)(5x3
+ 20x2
+ 81x + 322) + 1290. In terms of the long division method we have
5x3
+ 20x2
+ 81x + 322
x − 4
)
5x4
+ x2
− 2x + 2
− 5x4
+ 20x3
20x3
+ x2
− 20x3
+ 80x2
81x2
− 2x
− 81x2
+ 324x
322x + 2
− 322x + 1288
1290
71
CHAPTER 1. INITIAL WARMUP
1.A.9. The horizontal distance between z, w is the absolute value of the diﬀerence of their real parts, that is, hor = |x − u|.
For an illustration, see the ﬁgure below. Note that here one uses the absolute value of x − u, since depending whether z lies
to the left or to the right of w, the horizontal distance is given by ±(x − u).
Similarly, the vertical distance between z, w is the absolute value of the diﬀerence of their imaginary parts, that is, ver =
|y − v|. According to the Pythagorean theorem the distance d between z, w satisﬁes
d2
= hor2
+ ver2
= |x − u|
2
+ |y − v|
2
and the assertion follows.
1.A.11. The distance under question equals to
√
13.
1.A.12. By assumption, i is a solution of the given equation. Hence we have that i3
+ (1 + i)i2
+ ai + 2 = 0. From this
equation we compute a = 2 + i. Thus, we need to solve the equation
f(z) = 0, where f(z) = z3
+ (1 + i)z2
+ (2 + i)z + 2 .
Since f(z) is of degree three, over C we can specify three roots. In particular, we know that i is a solution of f(z) = 0, and
we can use this to obtain the corresponding Horner table. Set ρ = i. We have
ρ = i 1 1 + i 2 + i 2
i i − 2 −2
1 1 + 2i 2i 0
The last entry of the bottom row is 0 since i is a root of f(z) = 0. Hence we can write
f(z) = z3
+ (1 + i)z2
+ (2 + i)z + 2 = (z − i)(z2
+ (1 + 2i)z + 2i) = (z − ρ)π(z)
with π(z) := z2
+ (1 + 2i)z + 2i. Now, the discriminant of π(z) is given by
∆ = (1 + 2i)2
− 4 · 1 · 2i = 1 + 4i − 4 − 8i = −3 − 4i = 1 − 4 − 4i = (1 − 2i)2
.
Thus, the rest two roots are given by
z2,3 =
−(1 + 2i) ±
√
(1 − 2i)2
2
,
that is, z2 = −2i and z3 = −1.
1.A.16. (a) This is true, as we can see in ﬁgure below. In particular, the argument of a real number is kπ, with k ∈ Z.
(b) The principal argument of i equals to φ = π/2 and is given correctly. However, −2i has the same principal argument as
−i, i.e., ϑ = −π/2, see also the ﬁgure. Hence the last statement in (b) is false. One can show that a complex number will be
imaginary, if and only if its argument is given by ±π/2 + nπ, with n ∈ Z.
1.A.17. One can apply the method presented in Problem 1.A.14. This means that for the given z ∈ C we need to compute its
magnitude r = |z| and also its argument. Moreover, a direct calculation shows that
|z| =
√(
1 + cos π
3
)2
+ sin2 π
3 =
√
3 , cos(φ) =
Re(z)
|z|
=
1 + 1
2
√
3
=
√
3
2
, sin(φ) =
Im(z)
|z|
=
1
2
.
72
CHAPTER 1. INITIAL WARMUP
Therefore we see that φ = π/6. All together, we obtain z =
√
3
(
cos
π
6
+ i sin
π
6
)
.
1.A.19. Recall that the de Moivre’s formula states that zn
= |z|
n (
cos(nφ) + i sin(nφ)
)
for any n ∈ Z and z ∈ C, see 1.1.4.
We are interested in z31
, where z = cos π
6 + i sin π
6 (note that z ∈ S1
= {z ∈ C : |z| = 1}). In such tasks it is often useful
to draw the diagram corresponding to z, which for the particular case takes the form
Therefore, we obtain
(
cos
π
6
+ i sin
π
6
)31
= cos
31π
6
+ i sin
31π
6
= cos
7π
6
+ i sin
7π
6
= −
√
3
2
− i
1
2
.
1.A.20. This task highlights the diﬀerent uses of de Moivre’s theorem. Indeed, having in mind the result in 1.1.4, and the
identities (a + b)3
= a3
+ 3a2
b + 3ab2
+ b3
, i2
= −1, i3
= −i, we compute
cos(3φ) + i sin(3φ) =
(
cos(φ) + i sin(φ)
)3
= cos3
(φ) + 3i cos2
(φ) sin(φ) + 3i2
cos(φ) sin2
(φ) + i3
sin3
(φ)
=
(
cos3
(φ) − 3 cos(φ) sin2
(φ)
)
+ i
(
3 cos2
(φ) sin(φ) − sin3
(φ)
)
.
A comparison now of the real and the imaginary parts yields the result:
cos(3φ) = cos3
(φ) − 3 cos(φ) sin2
(φ) = cos3
(φ) − 3 cos(φ)
(
1 − cos2
(φ)
)
= 4 cos3
(φ) − 3 cos(φ) ,
and
sin(3φ) = 3 cos2
(φ) sin(φ) − sin3
(φ) = 3
(
1 − sin2
(φ)
)
sin(φ) − sin3
(φ) = 3 sin(φ) − 4 sin3
(φ) .
1.A.22. Note that this equation has no rational roots. Substitution into formulas obtained in 1.A.21 yields p = b − a2
/3 =
−7/3, q = −7/27. It follows that
u =
3
√
28 ± 12
√
−147
6
.
We can theoretically choose up to six possibilities for u (two for the choice of the sign and three independent choices of the
cubic root). But we obtain only three distinct values for x. By substitution into the formulas, one of the roots is of the form
14
3
√
3(28 − 84i
√
3)
+
3
√
28 − 84i
√
3
6
−
1
3
.
= 1.247 ,
and similarly for the other two (approximately −0.445 and −1.802). Finally observe that even if we have used complex
numbers during the computation, all the solutions are real.
1.B.2. Set as before a =
(
1 + 0.06
12
)
= 1.005 and C = 30 000. The condition dk = 0 gives the equation
ak
=
P
a−1
P
a−1 − C
=
200P
200P − C
.
By taking logarithms of both sides, we see arrive to
k =
ln(200P) − ln(200P − C)
ln a
This, for P = 500 gives approximately k = 71.5. Therefore, Michael will pay for 72 months, with the last repayment being
less that C500.
73
CHAPTER 1. INITIAL WARMUP
1.C.3. (a) The pair of letters b and r can be assumed to be a single indivisible “double-letter”. In total we have six distinct
letters and there are 6! words of six indivisible letters. We have to multiply this by two, since the double-letter can be either
br, or rb. So the solution is 2 · 6!.
(b) The events in this case form the complement to the part (i), in the set of all rearrangements of the seven-letters. The
solution is therefore 7! − 2 · 6!.
1.C.10. First, we can place the white tower in any of 82
positions. Then we have to our disposal 72
positions (the remaining
7 rows and columns) in which we can place the black tower. Therefore, the total number of ways equals to 82
· 72
= 3136.
The “inlusion-exclusion” would of course, provide the same result: 82
(82
− 1) − 82
· 7 · 2 = 3136 (the ﬁrst terms are all
possibilities, the second one the forbidden ones).
1.D.3. First divide the favourable cases according to the number of men in the chosen group: 2 or 1. Now, there are eight
groups with ﬁve people of which one is a man (all women have to be present in such groups, thus it depends only on which
man is chosen). Also, there are c(8, 2) · c(4, 3) =
(8
2
)
·
(4
3
)
of groups with two men. This is because we choose two men
from eight and then independently three women from four. These two choices can be independently combined and thus using
the rule of product we obtain the number of such groups. The total number of groups with ﬁve people is c(12, 5) =
(12
5
)
.
Therefore, the desired probability, being the quotient of the number of favourable outcomes to the total number of outcomes,
equals to
8 +
(4
3
)(8
2
)
(12
5
) =
5
33
.
1.D.8. We solve this exercise using the theorem about multiplication of probabilities. This is explained in 1.4.8, based on the
conditional probability concept. Here it seems to be obvious:
First we require a red ball, that happens with the probability 9/16. If a red ball was drawn, then in the second round we
draw a red ball with the probability 8/15 (there are 15 balls in the box, 8 of them are red). Finally, if two red balls were drawn,
the probability that a white ball is drawn is 7/14 (there are 7 white balls and 7 red balls in the box). Thus we obtain
9
16
·
8
15
·
7
14
= 0.15 .
1.E.5. Eliminate t to obtain q : x − 2y = −5. Then solve for x and y. The intersection has coordinates x = 1, y = 3.
1.E.7. It is clear that
−2 ·
(
−x − 3
2 y + 2
)
= 2x + 3y − 4 .
Thus p1 and p4 describe the same line. Moreover, note that p2 can be rewritten as −2x + 2y − 6 = 0, thus the lines p2 and p3
are parallel and distinct. Also, by eliminating t, the line p5 has an equation x + y = 0, which is not parallel to any other line.
1.E.12. We compute
A − B =
(
−2 5
−1 1
)
, (A − B)T
=
(
−2 −1
5 1
)
and by matrix multiplication we obtain
v = 2
(
−2 −1
5 1
) (
2 −2
4 5
) (
3
2
)
=
(
−52
64
)
.
1.E.14. Of course, all these results can be easily computed via Sage and the methods mentioned in the previous subsection.
But it is all so easy with 2 by 2 matrices that we could count on ﬁngers, too:
(a) We have A − B =
(
−2 5
−1 1
)
, so we easily compute:
(A − B)2
=
(
−2 5
−1 1
) (
−2 5
−1 1
)
=
(
−1 −5
1 −4
)
, A2
=
(
−10 10
−4 −6
)
, B2
=
(
4 0
−3 1
)
and
AB =
(
−5 5
−6 2
)
̸= BA =
(
0 10
−2 −3
)
.
Thus, computing (A−B)2
we may use the formula A2
−AB −BA+B2
(but we see that it is false to claim that (A−B)2
=
A2
− 2AB + B2
). Hence in matrix calculus the multiplication is not in general commutative.
(b) We compute ABC, BCA, and CAB, which yields:
D = 2
(
10 35
−4 22
)
−
(
8 12
−14 24
)
−
(
2 6
−50 30
)
=
(
10 52
56 −10
)
̸=
(
0 0
0 0
)
.
74
CHAPTER 1. INITIAL WARMUP
Therefore, in general given three arbitrary matrices A, B, C, we have ABC ̸= BCA ̸= CAB.
(c) We compute (in the result, we write E =
( 0 1
−1 0
)
):
c(AB) − (c(AB))T
√
2
=
c
√
2
(AB − (AB)T
) =
√
2c
2
{(
−5 5
−6 2
)
−
(
−5 −6
5 2
)}
=
11
√
2
2
cE .
(d) This is a direct computational consequence of the formula for determinant. In Chapter 2, we will see that this holds true
for square matrices of all sizes.
(f) Let us better handle this one by Sage, where it is convenient to use the cell presented in 1.E.12 and add the code
D=det(A)*B-det(B)*A-C.trace()*C; print(det(D)); print(divisors(det(D)))
This gives −38, which is the determinant of D, and [1, 2, 19, 38], which are its divisors.
1.E.15. (a) For any two square matrices A and B we have
(A + B)(A − B) = A2
− A · B + B · A − B2
.
Therefore, the identity (A + B) · (A − B) = A2
− B2
is valid if and only if AB = BA. Thus any pair of matrices which do
not commute does the work. A choice is given for instance by
A =
(
1 2
3 4
)
, B =
(
4 3
2 1
)
.
Indeed, AB =
(
8 5
20 13
)
̸= BA =
(
13 20
5 8
)
.
(b) Similarly, any two square matrices A, B satisfy
(A + B)(A + B) = A2
+ AB + BA + B2
It follows that (A + B)(A + B) = A2
+ 2AB + B2
if and only if AB = BA, as in the ﬁrst case. Hence the pair of matrices
presented above provides an example also for this case.
1.E.16. The deﬁnition of the linear mappings shows, that knowing the values on u = (1, 0) and v = (0, 1) determines the
values for all the other vectors: (x, y) → F(x u+y v) = x F(u)+y F(v). This corresponds to placing the vectors F(u) and
F(v) into the columns of the appropriate matrix A, and then express the value as
( x
y
)
→ A
( x
y
)
. This answers the question
about the role of the columns in the matrix A, once we have it. Moreover, the aﬃne mappings diﬀer from the linear ones
only by having a constant vector added. Thus, an aﬃne mapping is linear if and only if it keeps the origin ﬁxed. In our case,
clearly
F
((
x
y
))
=
(
7 −3
−2 5
) (
x
y
)
= A
(
x
y
)
,
G
((
x
y
))
=
(
2 2
4 −9
) (
x
y
)
+
(
−4
3
)
= B
(
x
y
)
+ w.
F
((
0
0
))
=
(
0
0
)
, G
((
0
0
))
=
(
−4
3
)
,
and we deduce that the mapping F is linear, but G is not.
The ﬁnal claim is obvious from the associativity of matrix multiplication, i.e.,
B
(
A
(
x
y
)
)
+ w = (BA)
(
x
y
)
+ w , A
(
B
(
x
y
)
+ w
)
= (AB)
(
x
y
)
+ Aw .
Check these computations yourselves and see also the right column in 1.5.4.
1.E.22. This is a typical task to be discussed without any explicit coordinates in mind. Let us start with the linear mappings.
In the plane, lines are parallel if and only if their directional vectors u in the parametric representations P(t) = P + tu are
parallel. The images of the lines can be parametrized by F(P(t)) = F(P) + tF(u). Now recall that on the plane, two
non-zero vectors are parallel if and only if one is a scalar multiple of the other. Thus, the images of parallel lines are clearly
parallel as well, except the case, where the image F(u) = 0. So, strictly speaking, the claim we should prove is false. But we
have proved that invertible linear mappings transform parallel lines to parallel lines. Adding translations in order to deal with
general aﬃne mappings, does not spoil this property. On the other hand, for example, a shear transformation clearly changes
both the angles and the distances, cf. 1.E.17.
1.E.24. An “admissible picture” that eﬀectively illustrates the result of the task in 1.E.21, should resemble the ﬁgure shown
below:
75
CHAPTER 1. INITIAL WARMUP
This ﬁgure allows you to visually verify the two requirements of the task: ﬁrst, that the two lines are perpendicular, and
second, that the line in question passes through the point P = [−6, 7].
To create this plot, you can use a Sage cell with speciﬁc options to ensure the correct aspect ratio and ﬁgure size. For
example, you can type:
x=var("x")
y2=-7*x/6; y1=(1/7)*(6*x+13)
fig = plot([], figsize=[4,4])
fig += plot(y1,x,(-7,7), aspect_ratio=1)
fig += plot(y2,x,(-7,7), aspect_ratio=1)
fig += point((-6, 7), size=70)
show(fig)
To verify the eﬀect of the diﬀerent options on the plot, try the following: Modify the Sage cell to use
aspect_ratio = ”automatic”. You will observe that the two lines may appear non-perpendicular, which can be
misleading. Additionally, changing the ﬁgsize to [4, 8] can produce a distorted or "ugly" ﬁgure, further complicating the
visual interpretation. By experimenting with these settings in your editor, you can see ﬁrsthand how altering the aspect ratio
and ﬁgure size impacts the clarity and accuracy of your plot.
1.E.25. We know that the area equals to the absolute value of the half of the determinant of the matrix, whose ﬁrst column is
given by the vector Q − P and the second column by the vector R − P, that is the determinant of the matrix
A =
(
−2 − (−8) 5 − (−8)
0 − 1 9 − 1
)
=
(
6 13
−1 8
)
.
A simple calculation yields
1
2
|det(A)| =
1
2
|(6 · 8 − (13 · (−1))| =
61
2
. Alternatively, in Sage we can type the cell
A=matrix([[6, 13], [-1, 8]]); A.det ( )
which returns the determinant of A. As we will prove in Chapter 2, changing the order of the vectors leads to the change of
the sign of the determinant. But the absolute value remains unchanged. Similarly, the transposed matrix (writing the vectors
rather in the rows) yields the same determinant. Finally, if the vertices P, Q, R are ordered in the anti-clockwise direction,
the determinant formed by the vectors Q − P and R − P is always positive.
If the given coordinates were not in the standard Euclidean coordinate system, we would have to look at their deﬁning
frame e1, e2. If their norms are one and they are perpendicular (an orthonormal frame, see the end of 1.5.7), then we do not
need to do anything. Otherwise, we should transform the coordinates ﬁrst into the standard coordinate system, see the methods
discussed in 1.E.22 (although area preserving transformation matrices are those with determinant one, e.g., all shears, so this
might not be necessary at all – think about it!).
1.E.26. Let us denote the quadrilateral S by ABCD. Being a polygon, to sketch it in Sage we can apply the command
polygon or polygon2d. Here we use the ﬁrst command in the following cell:
P = polygon([(1,1),(6,1),(11,4),(2,4)], fill=False, color="black")
A =point([1, 1], size=40, color="black")
B =point([6, 1], size=40, color="black")
C =point([11, 4], size=40, color="black")
D =point([2, 4], size=40, color="black")
76
CHAPTER 1. INITIAL WARMUP
Al= text("$A$",(0.8, 0.8),color="black",fontsize="12")
Bl= text("$B$",(6.2, 0.8),color="black",fontsize="12")
Cl= text("$C$",(11.2, 4.2),color="black",fontsize="12")
Dl= text("$D$",(1.8, 4.2),color="black",fontsize="12")
l=line([(1, 1), (11, 4)], color="black",linestyle="--")
show(P+A+B+C+D+Al+Bl+Cl+Dl+l)
Executing this block we obtain the following illustration:
From this ﬁgure we observe that S is a trapezoid, with bases of length 5 and 9 and a height of 3. Thus, the area of S is
calculated as
Area(S) =
5 + 9
2
· 3 = 21 .
This formula can be derived by noting that shearing transformations do not alter the area of a shape. By selecting an appropriate
trapezoid that simpliﬁes the calculations, you can verify this result. Can you identify a suitable trapezoid to use for this
purpose?
Alternatively, we can divide S into two triangles, △ABC and △ACD, as shown in the ﬁgure. We can then ﬁnd the
area of S by summing the areas of these two triangles, which can be calculated using the determinants of the corresponding
matrices:
d1 =
6 − 1 11 − 1
1 − 1 4 − 1
=
5 10
0 3
, d2 =
11 − 1 2 − 1
4 − 1 4 − 1
=
10 1
3 3
,
where in the columns are these vectors B − A, C − A (for d1) and C − A, D − A (for d2). In such terms we see that
Area(S) = 1
2 (|d1| + |d2|), and this gives 21 as well. Note that the second approach works for all polygonal objects in the
plane.
1.E.28. The coordinates of the vertex C can be obtained by rotating the point A around the centre S through the angle 2π/3
in the positive direction. This gives C = [3
2 −
√
3, −1 −
√
3
2 ]. As a veriﬁcation in Sage you may type
a=2*pi/3;x1=0;y1=2;x2=1;y2=0}}
M=matrix([[cos(a), -sin(a)], [sin(a), cos(a)]])
A=vector([x1, y1]); S=vector([x2, y2])
rot=(M*(A-S)+S); show(rot)
1.E.30. For simplicity, set A = [−2, −2], B = [2, −11/6], C = [3, 1], D = [1, 4], and E = [−2, 2]. The sides BC and CD
are clearly visible from the position of the point [300, 1]. On the other hand, DE and EA cannot be seen. For the side AB,
we compute
−2 − 300 2 − 300
−2 − 1 −11
6 − 1
= −302 ·
(
−17
6
)
− (−298) · (−3) < 0 .
This implies that the side can be seen from the point [300, 1].
1.E.31. Order the vertices in the positive direction, that is counter-clockwise: [5, 6], [7, 8], [5, 8]. Using the corresponding
determinants we can determine whether the point [0, 1] lies to the “left” or to the “right” of the sides of the triangle when we
view them as oriented line segments. We have
Q − X
R − X
=
7 7
5 7
> 0 ,
R − X
P − X
=
5 7
5 5
< 0 ,
P − X
Q − X
=
5 5
7 7
= 0 .
We see that the determinants are not all positive, so X is outside the triangle. In particular, if it is left of some oriented
segment (a side of the triangle), the segment is not visible from P. Because the last determinant is zero, the points [0, 1],
[5, 6] and [7, 8] lie on a line, and thus the side joining P and Q is not visible. The side joining Q and R is also not visible,
unlike the side joining P and R, for which the determinant is negative.
77
CHAPTER 1. INITIAL WARMUP
1.F.4. we should determine ﬁrst the three deﬁning properties of an equivalence relation, as functions, and next examine if R
satisﬁes them. We can do this as follows:
def is_reflexive(Rel, A=None):
if not A:
A = set(x[0] for x in Rel)
# in this way we define the domain
return all({(a,a) in Rel for a in A})
def is_symmetric(Rel):
return all({(b,a) in Rel for (a,b) in Rel})
def is_transitive(Rel):
return all({(a,d) in Rel for (a,b) in Rel for (c,d) in Rel if b == c})
def relation_summary(Rel):
A = set(x[0] for x in Rel)
reflexive = is_reflexive(Rel, A)
symmetric = is_symmetric(Rel)
transitive = is_transitive(Rel)
print(f’Reflexive: {reflexive}’)
print(f’Symmetric: {symmetric}’)
print(f’Transitive:{transitive}’)
Rel1 = {(’a’,’a’),(’b’,’b’),(’c’,’c’),(’d’,’d’), (’b’,’a’),(’b’,’c’), (’b’,’d’)}
relation_summary(Rel1)
Sage returns the desired result:
Reflexive: True
Symmetric: False
Transitive: True
Notice the use of the so called “f-strings” in the print – we are allowed to use expressions whose values should be printed
inside the string.
1.F.5. From the relationship ((a, b), (a, b)) ∈ R for all a, b ∈ R it follows that the relation is reﬂexive. It is also easy to see
that the relation is symmetric, since in the equality of the second coordinates we can interchange the left and right side. If
((a, b), (c, d)) ∈ R and ((c, d), (e, f)) ∈ R, that is, b = d and d = f, then we get that ((a, b), (e, f)) ∈ R, that is, b = f.
Hence R is also transitive, and so an equivalence relation. Notice the points in the plane are related if and only if they have
the same second coordinate (the line they determine is perpendicular to the y axis). Thus the corresponding partition divides
the plane into the lines parallel with the x axis.
1.F.6. Directly from the deﬁnition of the domain and the codomain of a relation we obtain
D = {a, b, c, d, f} ⊂ A, I = {x, y, u, v} ⊂ B .
To check if a relation is a mapping, we need to ensure that each element in the domain is assigned to exactly one element in
the codomain. In our case we deduce that R is not a mapping since (c, x), (c, u) ∈ R, that is, c ∈ D has two images.
In Sage, the domain and codomain can be found like this:
A = set(["a", "b", "c", "d", "e","f"])
B = set(["x","y", "u", "v", "w"])
Rel = set( [("a","v"), ("b","x"), ("c","x"),("c","u"), ("d","v"), ("f","y")])
# to compute the domain we iterate over R taking the first element of each pair
domain = set(x[0] for x in Rel)
print(f"domain = {domain}")
# to compute the codomain, we take the second element
codomain = set(x[1] for x in Rel)
print(f"codomain = {codomain}")
1.F.9. In the ﬁrst case, the map f is surjective (it is enough to set x = 0) but not injective (it is enough to set (x, y) = (0, −9)
and (x, y) = (1, 0)). In the second case, f is an injective mapping (both its coordinates, that is functions y = 2x and
y = x2
+ 10 are clearly increasing over N). The mapping is not surjective (for instance the pair (1, 1) has no preimage).
1.G.1. An easy way is given by Sage, using the command sum. Typing in your editor the code
# Define symbolic variables
n, k = var("n k")
78
CHAPTER 1. INITIAL WARMUP
# Define the function to sum
f = k
# Compute the symbolic sum
sum1 = sum(f, k, 1, n)
# Show the symbolic sum
show(sum1)
Sage’s output is 1
2 n2
+ 1
2 n, which represents the evaluated form of the sum
∑n
k=1 k.
However, since the task requires an illustration, we need to explore a diﬀerent approach. The idea that we will present
below likely dates back to the Pythagoreans (6th century BC), but is also closely associated with the German mathematician
F. Gauss (1777-1855), one of the most inﬂuential scientists of his time. Let ∆n = 1 + 2 + · · · + (n − 1) + n be the sum that
we want to compute, and rewrite this with the inverse ordering of the summands, that is, ∆n = n + (n − 1) + · · · + 2 + 1.
Adding these two relations, we get
2∆n = (n + 1) + (n − 1 + 2) + (n − 2 + 3) + · · · + (2 + (n − 1)) + (1 + n) = (n + 1) + (n + 1) + · · · + (n + 1) ,
where the sum in the r.h.s has n-factors. Thus 2∆n = n(n + 1), which completes the proof.
Graphically, the triangular number ∆n = 1
2 n(n+1) (n = 1, 2, . . .), can be represented by the triangular grid of points (black
dots) ﬁgured below, where the ﬁrst row contains a single dot (∆1 = 1), and each subsequent row contains one more dot than
the previous one (we always move the ﬁrst dot upwards).
In the ﬁgure are presented the ﬁrst ﬁve triangular numbers.14
To illustrate the simpliﬁed proof given above, one converts the
above triangle arrangements to half-square arrangements, as in the ﬁgure below.
Let us illustrate the case of ∆7. The shaded triangle is the triangle arrangement corresponding to ∆7, while the white triangle
occurs after rotating the shaded diagram to obtain the rectangular of size 7(7 + 1). Hence 2∆7 = 56 = 7(7 + 1), that is
∆7 = 28. Finally, as in 1.C.7 we see that the number of distinct handshakes or wine glass clinks that can be made among a
group of n people is given by
(
n
2
)
, and this can be rephrased as ∆n−1.15
This last claim is based on the relation
∆n =
(
n + 1
2
)
.
As for the two given recurrence relations, they occur very easily and it worths to mention that if we add them, then we return
back to the initial deﬁnition of ∆n.
1.G.2. Suppose that the length of the hill is ℓ km. Then, the bicyclist needs ℓ/25 hours to goes up the hill, and ℓ/75 hours to
goes down the same hill. For instance, if ℓ = 25 km, then the cyclist needs one hour to goes uphill and a bit more than half
hour for downhill. Since the total distance is 2ℓ km, we deduce that the average speed is
2ℓ
ℓ
25 + ℓ
75
=
75 · 2ℓ
4ℓ
=
75
2
= 37.5 km .
Often, the average speed is called the harmonic mean of the two given rates.
1.G.8. By assumption, P(x) = x2
+ x + 2, hence we have
14Many mathematicians include 0 as the very ﬁrst element in the sequence
of triangle numbers, but we do not adopt this here.
15Each person shakes hands with n − 1 others, but each handshake
involves two people, so we count each handshake twice.
79
CHAPTER 1. INITIAL WARMUP
Q(x) = P(P(x)) − P(x) − 3 =
(
P(x)
)2
+ P(x) + 2 − P(x) − 3 =
(
P(x)
)2
− 1 =
(
P(x) − 1
)
·
(
P(x) + 1
)
.
This show that Q(x) is divided by P(x) − 1, with quotient π(x) = P(x) + 1 = x2
+ x + 3.
1.G.9. Suppose that P(x) = ax4
+ bx3
+ cx2
+ dx + e. Since P(0) = 0 we get e = 0 and so P(x) = ax4
+ bx3
+ cx2
+ dx.
Now, we see that
P(x + 1) − P(x) = a(x + 1)4
+ b(x + 1)3
+ c(x + 1)2
+ d(x + 1) − (ax4
+ bx3
+ cx2
+ dx)
= a(x4
+ 4x3
+ 6x2
+ 4x + 1) + b(x3
+ 3x2
+ 3x + 1) + c(x2
+ 2x + 1) + d(x + 1)
−ax4
− bx3
− cx2
− dx
= 4ax3
+ 3(2a + b)x2
+ (4a + 3b + 2c)x + (a + b + c + d) .
Hence, by the relation P(x + 1) − P(x) = 4x3
we get the following system of equations:
{
4a = 4 , 2a + b = 0 , 4a + 3b + 2c = 0 , a + b + c + d
}
.
This has unique solution given by a = 1, b = −2, c = 1 and d = 0, thus P(x) = x4
− 2x3
+ x2
= x2
(x2
− 2x + 1) =
x2
(x − 1)2
, and the claim follows.
1.G.11. This task can be challenging, especially at this point. However, we will make every eﬀort to explain the code thoroughly,
including our comments within the block, to ensure a clear understanding of each step. We will use the def keyword
to deﬁne a function (or routine), which we name horner_division. This function has two inputs: A list of coeﬃcients starting
from degree 0, which we may think as [a0, . . . , an]. The number n represents the number of coeﬃcients in the polynomial
p. Sage encodes this number by the command len. The second entry is the real number x0. To perform the division we will
combine the for loop with the range() function. Such combinations are essential in SageMath for handling repetitive tasks,
as demonstrated in our example. Recall that the range() function generates a sequence of numbers, commonly used in for
loops. We can customize such sequences using diﬀerent arguments:
• One argument: range(stop) generates numbers from 0 to stop − 1.
• Two arguments: range(start, stop) generates numbers from start to stop.
• Three arguments: range(start, stop, step) generates numbers from start to stop with increments of step.
With all the necessary tools in place, we can now construct our routine. The code is as follows:
def horner_division(p, x0):
n = len(p) # The number of coefficients in the polynomial
q = [0] * (n - 1) # Initialize a list to hold the coefficients of the quotient
r = p[n - 1] # Start with the leading coefficient (highest degree term)
for i in range(n - 2, -1, -1): # Loop from the second last coefficient to the first
q[i] = r # Assign the current remainder as a coefficient in the quotient
r = p[i] + x0 * r # Update the remainder using the current coefficient and x0
return q, r # Return the list of quotient coefficients and the remainder
Our routine is now ready to be tested. Let us use it to prove the second claim.
# Example polynomial
# Define the polynomial p(x) = 4 -3x + 5x^2 -2x^3 + x^4
p = [4, -3, 5, -2, 1]; x0 = 3
q, r = horner_division(p, x0)
print("Quotient coefficients:", q)
print("Remainder:", r)
Sage’s output has the form
Quotient coefficients: [21, 8, 1, 1]
Remainder: 67
Thus dividing p(x) = 4 − 3x + 5x2
− 2x3
+ x4
by (x − 3) we have quotient q(x) = 21 + 8x + x2
+ x3
and remainder
r = 67.
1.G.12. (1) Fractions of the form z :=
z1
z2
, where z1, z2 are complex numbers with z2 ̸= 0, can be written in the form
z = x + iy with x, y ∈ R, after multiplying z with
¯z2
¯z2
. This is because z2 ¯z2 = |z2|
2
∈ R. For the given z we compute
1 + 2i
4 − 5i
=
1 + 2i
4 − 5i
·
4 + 5i
4 + 5i
=
4 + 5i + 8i + 10i2
|4 + 5i|
2 =
−6 + 13i
42 + 52
= −
6
41
+ i
13
41
.
This can be quickly veriﬁed in Sage by executing the following cell:
80
CHAPTER 1. INITIAL WARMUP
real((1+2*I)/(4-5*I)) == -6/41; imag((1+2*I)/(4-5*I)) == 13/41
For both commands Sage prints out True. You can also obtain the result directly by typing the following command:
z=(1+2*I)/(4-5*I); simplify(z)
(2) Denote by α the quantity that we want to compute. We have
α :=
(1 + i)2023
(1 − i)2020
=
(
1 + i
1 − i
)2020
(1 + i)3
and moreover
1 + i
1 − i
=
1 + i
1 − i
·
1 + i
1 + i
=
(1 + i)2
2
= i .
Hence α = i2020
(1 + i)3
= i2020
(2i − 2), and it suﬃces to compute i2020
. A direct computation shows that i4
= 1 and
moreover i8
= 1. By induction it follows that i4n
= 1 for all n ∈ N. Thus i2020
= i4·505
= 1 and so α = 2i − 2. A
veriﬁcation by Sage occurs by the cell
a=((1+I)**(2023))/((1-I)**(2020)); simplify(a)
(3-4) Both claims are true and indicate that S1
has the structure of a commutative group under multiplication, as discussed in
1.1.1. Indeed, the ﬁrst assertion is true because the absolute value function is multiplicative, which implies that S1
is closed
under multiplication. Speciﬁcally, for any z ∈ S1
we have |z| = 1, and hence z¯z = |z|2
= 1. Moreover, since 1
z = 1
|z| = 1,
we conﬁrm that 1
z ∈ S1
for any z ∈ S1
, ensuring that S1
is closed under taking reciprocals. Below, we will explore a slightly
diﬀerent but equivalent interpretation of the unit circle, which provides an alternative way to verify these properties.
(5) The statement is true. We will verify this using Sage, and encourage the reader to explore and present alternative meth-
ods.
z=var("z"); solve(z**3+I==0, z)
1.G.13. The complex numbers that satisfy the conditions of the task lie on a circle in R2
, with centre at P2 = [−4, −2] and
radius ρ, where ρ is the distance between z2 and w, i.e., ρ = 2
√
2 = |z2 − w|. Let us illustrate this circle along with the
complex numbers z1, z2 in a ﬁgure (see the ﬁgure in the left-hand side).
Now, |z2 − w| = |w − z2| = |(x + 4) + i(y + 2)|, and we obtain ρ2
= (2
√
2)2
= 8 = (x + 4)2
+ (y + 2)2
= (x2
+ y2
) +
4(2x + y) + 20, i.e.,
(x2
+ y2
) + 4(2x + y) + 12 = 0 . (†)
Moreove, if b = |z1 − w| = |(2 − x) + i(6 − y)| is the distance between z1, w, then we compute
b2
= (2 − x)2
+ (6 − y)2
= (x2
+ y2
) − 4(x + 3y) + 40 . (‡)
Finally, the distance between z1, z2 is given by d = |z1 − z2| = |6 + 8i| =
√
36 + 100 = 10. Recall that in Sage this can
be veriﬁed by the command abs(6 + 8i). We can now proceed with the Pythagorean theorem, which yields the condition
d2
= ρ2
+ b2
. Due to (‡) and the above observations, this translates as
100 = 8 + (x2
+ y2
) − 4(x + 3y) + 40 ⇐⇒ 52 = (x2
+ y2
) − 4(x + 3y) .
Hence together with (†) we result to the following system of equations
{
(x2
+ y2
) + 4(2x + y) = −12 , (x2
+ y2
) − 4(x + 3y) = 52
}
.
81
CHAPTER 1. INITIAL WARMUP
We may multiple the ﬁrst equation by a minus and then add it to the second one. This gives the equation y = −
(3
4
x + 4
)
.
Replacing now this expression of y into any of the two equations determining the system (or even in the equation (‡)), one
arrives to the following quadratic relation that determines x, namely
25
16
x2
+ 11x + 12 = 0 ,
with discriminant ∆ = 46 > 0. Thus, there are two solutions that result in valid triangles, which are illustrated in the ﬁgure
on the right above. Let us present the explicit form of the solutions:
x1,2 =
8(−11 ±
√
46)
25
, y1,2 = −
(3
4
(−88 ± 8
√
46
25
)
+ 4
)
.
Using Sage one may verify this by typing
x, y=var("x, y")
solve([x**2+y**2+4*(2*x+y)+12==0, x**2+y**2-4*(x+3*y)-52==0], x, y)
1.G.25. Consider ﬁrst the division of a plane by n circles. As we know by 1.G.24, the maximum number yn of areas a
plane can be divided into by n circles is yn = yn−1 + 2(n − 1), y1 = 2, that is, yn = n2
− n + 2. For the maximum
number pn of areas a space can be divided into by n balls, we obtain the recurrent formula pn+1 = pn + yn, p1 = 2, that is,
pn = n
3 (n2
− 3n + 8).
1.G.26. Let xn. denote the number of regions in question. When n = 1, a single plane divides the space into two regions.
This means that x1 = 2. Now, each new additional plane will intersect all the existing n planes along lines that divide the
new plane into regions. In more details, a new plane will intersect the existing n planes along n lines, each of which divides
the new plane into additional regions. Thus each new plane adds 2(n − 1) new regions. Therefore we derive the following
recurrence relation: xn = xn−1 +2(n−1), with the initial condition x1 = 2. It is now straightforward to see that the solution
has the form xn = n(n − 1) + 2.
1.G.29. (a) For the driver’s place we have two choices and the other places are then arbitrary, that is, for the second seat we
have four choices, for the third three choices, then two and then 1. That makes 2.4! = 48 ways.
(b) Similarly in the bus we have two choices for the driver, and then the other driver plus the passengers can be seated among
the 24 seats arbitrarily. First choose the seats to be occupied, that is,
(24
21
)
. Among these seats the people can be seated in 21!
ways. The solution is 2 ·
(24
21
)
21! = 24!
3 ways.
1.G.30. (a) 26
= 64.
(b)
(6
4
)
= 15.
(c) No head is one possibility
(6
0
)
= 1, one head is
(6
1
)
= 6. Thus there are 7 sequences with at most one head and the result
is 64 − 7 = 57.
1.G.31. (i) Let us add three matches to the 40 balls. If we order the balls and matches in a row, the matches divide the balls
in 4 sections. We order the boys at random, give the ﬁrst boy all the balls from the ﬁrst section, give the second boy all the
balls from the second section and so on. It is now evident that the result is
(43
3
)
= 12 341.
(ii)From the problem statement it is clear that C must receive 11 coins. That can be done in
(33
11
)
ways. Each of the remaining
22 coins can be given either to A or to B, which gives 222
ways. Using the rule of product we obtain the result
(33
11
)
· 222
. (iii)
We divide the boys and the girls independently. Thus the answer is 29
(25
− 7) = 12800.
1.G.33. Each pair of players shakes hands at the introduction. As we know the number of handshakes is the combination
c(6, 2) =
(6
2
)
= 15. After a match each of the six players shakes hands six times (with each of six opponents). Hence the
required number is 62
= 36.
1.G.39. (a) The answer is 1/6.
(b) One can reformulate this task, as we are throwing the dice n times. The probability that the ﬁrst roll does not result into
1, 3 or 6 is 1/2. The probability that neither the ﬁrst nor the second roll is clearly 1/4. This is because the result of the ﬁrst
roll does not inﬂuence the result of the second roll. Because the event determined by the result of a given roll and the event
determined by the result of another roll are always (stochastically) independent, we deduce that the probability is 1/2n
.
(c) If we denote the event “ﬁrst dice resulted in ﬁve” by A and the event “the sum is 9” by H, then it holds
P(A|H) = P (A∩H)
P (H) =
1
36
4
36
= 1
4 .
82
CHAPTER 1. INITIAL WARMUP
Note that the sum 9 occurs when the ﬁrst die is 3 and the second 6, the ﬁrst is 4 and the second 6, the ﬁrst is 5 and the second
is 4, or the ﬁrst is 6 and the second is 3. Of those four results (that have the same probability) only one is favourable to the
event A. Since the probability of A is clearly 1/6 ̸= 1/4, the events are not mutually independent.
1.G.43. The space of all possible events can be viewed as a square 3×3. Denote by x the arrival time of Michael and by y the
arrival time of Alex. In such terms we can claim that these two persons will meet if and only if |x − y| ≤ 1
2 . This inequality
determines in the square of all possible events the area whose volume is 11/36 of the volume of the whole square. Thus, this
is also the probability of the event.
1.G.44. Dividing the quadrilateral into two triangles ABC and ACD with areas S1 and S2 we obtain
S = S1 + S2 =
1
2
1 − 0 1 − 0
−1 + 2 5 + 2
+
1
2
1 − 0 −1 − 0
5 + 2 1 + 2
=
1
2
(7 − 1) +
1
2
(3 + 7) = 8 .
1.G.45. Eliminating y yields (4x − 2y − 10) + (x + 2y − 5) = 0, from which x = 3, and hence y = 1. Hence the lines
intersect at the point P = [3, 1].
1.G.46. The given vectors correspond to complex numbers 1 + i, 2 + i and 3 + i. We are to ﬁnd the sum of their arguments.
According to de Moivre’s formula this equals the argument of their product. Their product is (1 + i)(2 + i)(3 + i) =
(1 + 3i)(3 + i) = 10i, which is a purely imaginary number with argument π/2. So the sum we are looking for is π/2.
1.G.49. Clearly, A2
= A and hence E − A2
= E − A =
(
0 0
0 1
)
. Also, B is symmetric, i.e., B = BT
, and hence
BBT
= B2
=
(
5 4
4 5
)
. Thus, the equation X(E − A2
) + BBT
= D is equivalent to
(
5 b + 4
4 d + 5
)
=
(
5 −b
4 −4d
)
⇐⇒
(
0 2b + 4
0 5d + 5
)
=
(
0 0
0 0
)
. Its solution is given by b = −2, d = −1, and a, c ∈ R free. In Sage we can directly compute the
expression X(E − A2
) + BBT
− D as follows:
var("a, b, c, d")
X=matrix([[a, b], [c, d]]); D=matrix([[5, -b], [4, -4*d]])
A=matrix([[1, 0], [0, 0]]); B=matrix([[1, 2], [2, 1]])
E = identity_matrix(2)
show(X*E-X*A^2+B*(B.transpose())-D)
1.G.59. The number we need to determine is obtained by subtracting the number of non-surjective mappings from the number
of all mappings. The number of all mappings is V (3, 4) = 34
. Non-surjective mappings have either a one element, or a two
element codomain. There are just three mappings with a one element codomain. The number of mappings with a two-element
codomain is
(3
2
)
(24
− 2) (there are
(3
2
)
ways to choose the codomain and for a ﬁxed two-element codomain there are 24
− 2
ways how to map four elements onto them). Therefore, the number of surjective mappings is 34
−
(3
2
)
(24
− 2) − 3 = 36.
1.G.63. We see that R−1
= {(4, 0), (0, −3), (π, 5), (2, 5), (2, 0)}. Moreover,
R ◦ R−1
= {(4, 4), (0, 0), (π, π), (2, 2), (4, 2), (π, 2), (2, π), (2, 4)} .
1.G.65. Here are the solutions:
i) (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (3, 6), check that it is an ordering relation.
ii) again (i, i) for i = 1, . . . , 7 and additionally (3, 6), (6, 3), check that it is an equivalence relation.
iii) (i, i) for i = 1, . . . , 7 and also (3, 6), (6, 3), (4, 6), (6, 4). Check that it is not an equivalence, since transitivity does not
hold.
In the previous chapter we warmed up by considering
relatively simple problems which did not require any sophisticated
tools. It was enough to use addition and multiplication
of scalars. In this and subsequent chapters we shall add more
sophisticated thoughts and concepts.
First we restrict ourselves to concepts and operations consisting
of a ﬁnite number of multiplications and additions to a
ﬁnite number of scalars. This will take us three chapters and
only then will we move on to inﬁnitesimal concepts and tools.
Typically we deal with ﬁnite collections of scalars of a given
size. We speak about “linear objects” and “linear algebra”.
Although it might seem to be a very special tool, we shall see
later that even more complicated objects are studied mostly
using their “linear approximations”.
In this chapter we will work with ﬁnite sequences of
scalars. Such sequences arise in real-world
problems whenever we deal with objects described
by several parameters, which we shall
call coordinates. Do not try much to imagine
the space with more than three coordinates. You have to live
with the fact that we are able to depict only one, two or three
dimensions. However, we will deal with an arbitrary number
of dimensions. For example, observing any parameter in a
group of 500 students (for instance, their study results), our
data will have 500 elements and we would like to work with
them. Our goal is to develop tools which will work well even
if the number of elements is large.
Do not be afraid of terms like ﬁeld or ring of scalars K.
Simply, imagine any speciﬁc domain of numbers. Rings of
scalars are for instance integers Z and all residue classes Zk.
Among ﬁelds we have seen only R, Q, C and residue classes
Zk for k prime. Z2 is very speciﬁc among them, because the
equation x = −x does not imply x = 0 here, whereas in
nearly every other ﬁeld it does.
1. Vectors and matrices
In the ﬁrst two parts of this chapter, we will work with
vectors and matrices in the simple context of ﬁnite sequences
of scalars. We can imagine working with integers or residue
classes as well as real or complex numbers. We hope to illustrate
how easily a concise and formal reasoning can lead
to strong results valid in a much broader context than just for
real numbers.
CHAPTER 2
Elementary linear algebra
Can’t you count with scalars yet?
– no worry, let us go straight to matrices...
A. Vectors and matrices
The vectors (n-tuples of scalars) and matrices
(2-dimensional arrays of scalars) are the
backbone of most of the computational power
in practical tasks in all disciplines relying on
any kind of data analytics or numerics (e.g.,
physics, biology, chemistry, engineering, or economics). In
this chapter we shall focus on matrix calculus and illustrate
some of its simple applications.
To start with, we display a few elementary tasks on matrices
and vectors, extending the discussion in the ﬁrst chapter
(see, e.g., 1.E.12 and 1.E.14). Recall that the matrix multiplication
works only if the matrix on the left has got the same
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Later, we follow the general terminology where the notion
of vectors is related to ﬁelds of scalars only.
2.1.1. Vectors over scalars. For now, a vector is for us an
ordered n-tuple of scalars from K, where the ﬁxed n ∈ N is
called dimension.
We can add and multiply scalars. We will be able to add
vectors, but multiplying a vector will be possible only by a
scalar. This corresponds to the idea we have already seen in
the plane R2
. There, addition is realized as vector composition
(as composition of arrows having their direction and size
and compared when emanating from the origin). Multiplication
by scalar is realized as stretching the vectors.
A vector u = (a1, . . . , an) is multiplied by a scalar c by
multiplying every element of the n-tuple u by c. Addition is
deﬁned coordinate-wise.
Basic vector operations
For all vectors u, v and scalars c we deﬁne
u + v = (a1, . . . , an) + (b1, . . . , bn)
= (a1 + b1, . . . , an + bn)
c · u = c · (a1, . . . , an) = (c · a1, . . . , c · an).
For vector addition and multiplication by scalars we shall
use the same symbols as for scalars, that is, respectively, plus
and either dot or juxtaposition, i.e., cu = c(a1, . . . , an) =
(ca1, . . . , can).
The vector notation convention. We shall not, unlike many
other textbooks, use any special notations for vectors
and leave it to the reader to pay attention to
the context. For scalars, we shall mostly use letters
from the beginning of the alphabet, for the
vectors from the end of the alphabet. The middle part of the
alphabet can be used for indices of variables or components
and also for summation indices.
In the general theory in the end of this chapter and later,
we will work exclusively with ﬁelds of scalars when talking
about vectors. Now we will work with the more relaxed properties
of scalars as listed in 1.1.1.
For vector addition in Kn
, the properties (CG1)–(CG4)
(see 1.1.1) clearly hold with the zero element being (notice we
deﬁne the addition coordinate-wise) 0 = (0, . . . , 0) ∈ Kn
.
We are purposely using the same symbol for both the zero
vector element and the zero scalar element. Next, let us notice
the following basic properties of vectors:
Vector properties
For all vectors v, w ∈ Kn
and scalars a, b ∈ K we have
a · (v + w) = a · v + a · w(V1)
(a + b) · v = a · v + b · v(V2)
a · (b · v) = (a · b) · v(V3)
1 · v = v(V4)
84
number of columns as the number of rows of the matrix on
the right. The number of rows in the result coincides with
that of the left matrix, while its number of columns equals the
number of columns of the right-hand matrix.
2.A.1. Matrix multiplication. Check the computations:
(a)
(
1 2
−1 3
) (
1 −1
2 1
)
=
(
5 1
5 4
)
,
(b)
(
1 −1
2 1
) (
1 2
−1 3
)
=
(
2 −1
1 7
)
,
(c)
(
1 2 3
1 −1 1
)


1 −1 2 1
1 1 −2 −3
3 2 1 0


=
(
12 7 1 −5
3 0 5 4
)
,
(d)
(
1 3 −3
)


1 −2 3
3 2 1
1 −1 −4

 =
(
7 7 18
)
,
(e)
(
1 2 −2
)


2
1
3

 =
(
−2
)
= −2 .
Remark. As we already mentioned in Chapter 1, the multiplication
of matrices is not commutative in general. Parts (a)
and (b) in 2.A.1 illustrate this fact for 2 × 2 matrices. On
the other hand, part (e) illustrates that multiplying rows and
columns produces the so called scalar products of vectors
(also called dot products), and our approach to distances and
angles will rely on them. For two vectors like u = (2, 1, 3)T
and v = (1, 2, −2)T
, we write ⟨u, v⟩ or u · v for vT
u = −2.
As usual, we use the superscript T to indicate the transposition
of the matrices (i.e., we write the rows of A as columns
in AT
), and E for the identity matrix and its dimension will
be clear from the context.
2.A.2. For the following matrices and vectors
A =
(
1 1
0
√
2
)
, B =
(
1 4
4 0
)
, u =
(
0√
2
)
, v =
(
1
4
)
compute the given expressions:
1) (A − E)(A + E) , 6) 4A3
− B2
,
2) A(Bu) , 7)
√
2u − 1
2 v ,
3) (A − B)(A + B)u , 8)
√
2
2 u − ABu + 2v ,
4) aA2
+ aB2
(a ∈ R) , 9) A2
u − B2
v ,
5) u · v − 4
√
2 , 10) (B2
u) · (4v) .
⃝
2.A.3. Compute all tasks in 2.A.2 in Sage.
Solution. A solution goes as follows:
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
The properties (V1)–(V4) of our vectors are easily
checked for any speciﬁc ring of scalars K, since we need
just the corresponding properties of scalars as listed in 1.1.1,
applied to individual components of the vectors. In this way
we shall work with, for instance, Rn
, Qn
, Cn
, but also with
Zn
, (Zk)n
, n = 1, 2, 3, . . ..
2.1.2. Matrices over scalars. Matrices are slightly more
complicated objects, useful when working with vectors.
Matrices of type m/n
A matrix of the type m/n over scalars K is a rectangular
schema A with m rows and n columns
A =




a11 a12 . . . a1n
a21 a22 . . . a2n
...
...
am1 am2 . . . amn




where aij ∈ K for all 1 ≤ i ≤ m, 1 ≤ j ≤ n. For a matrix
A with elements aij we also use the notation A = (aij).
The vector (ai1, ai2, . . . , ain) ∈ Kn
is called the
(i-th) row of the matrix A, i = 1, . . . , m. The vector
(a1j, a2j, . . . , amj) ∈ Km
is called the (j-th) column of the
matrix A, j = 1, . . . , n.
Matrices of the type 1/n or n/1 are actually just vectors
in Kn
.
All general matrices can be understood as vectors in
Kmn
, we just consider all the columns.
In particular, matrix addition and matrix
multiplication by scalars is deﬁned:
A + B = (aij + bij), a · A = (a · aij)
where A = (aij), B = (bij), a ∈ K.
The matrix −A = (−aij) is called the additive inverse
to the matrix A and the matrix
0 =



0 . . . 0
...
...
0 . . . 0



is called the zero matrix. By considering matrices as
mn-dimensional vectors, we obtain the following:
Proposition. The formulas for A+B, a·A, −A, 0 deﬁne the
operations of addition and multiplication by scalars for the
set of all matrices of the type m/n, which satisfy properties
(V1)–(V4).
2.1.3. Matrices and equations. Many mathematical models
are based on systems of linear equations. Matrices are
useful for the description of such systems. In order to see
this, let us introduce the notion of scalar product of two vectors,
assigning to the vectors (a1, . . . , an) and (x1, . . . , xn)
their product
(a1, . . . , an) · (x1, . . . , xn) = a1x1 + · · · + anxn.
This means, we multiply the corresponding coordinates of the
vectors and sum the results.
85
E=matrix.identity(2)
A=matrix([[1, 1], [0, sqrt(2)]])
B=matrix([[1, 4], [4, 0]])
u=vector([0, sqrt(2)])
v=vector([1, 4])
show((A-E)*(A+E))
show(A*B*u)
show((A-B)*(A+B)*u)
show(A^2+B^2)
u.dot_product(v)
show(4*A^3-B^2)
show(sqrt(2)*u-(1/2)*v)
show((sqrt(2)/2)*u-A*B*u+2*v)
show(A^2*u-B^2*v)
(B^2*u).dot_product(4*v)
□
Doing computations with matrices, we always recommend
to use Sage to verify your formal computations, especially
for large size matrices.
2.A.4. Determine explicitly each of the following vectors (by
hand), and next verify your answer in Sage:
1) u1 := (A − E)2
(4B)u, 3) u3 := A(Bu),
2) u2 := (A − E)T
(B2
u), 4) u4 := B(Au),
where E is the identity 3 × 3 matrix and A, B, u are respectively
given by
A =


1
√
2 π
0 1 1
−π 0 2

 , B =


0 1 1
−1 0 π
−1 0 1

 , u =


π
0√
2

 .
⃝
While explicit computations in high dimensions (i.e.,
dealing with vectors with large number of components) can
hardly be imagined without the use of computer aided mathematics
software, like Sage, the understanding of structure and
properties is often based on visual perception. Of course, we
are then limited to dimensions two or three, but even then we
can gain essential insight how the things may work in general
(and this fact might be viewed as one of the strategies
to build a mathematical mindset). Thus, we should enjoy the
Sage tools designed to plot vectors and similar objects.1
2.A.5. Use Sage for plotting: (a) The vectors u = (1, 2, 3)T
,
v = (2, −3, 4)T
, u + v, u − v, which are all vectors of the
3-dimensional Euclidean space R3
.
(b) Repeat for the vectors x = (4, 0, 2)T
, y = (−3, 0, 1)T
,
x + y, and x − y.
Solution. (a) The simplest method is to introduce the vectors,
and then use the function plot, as follows:
u = vector([1, 2, 3])
1Notice that so far Sage allows us to introduce vectors and make operations
between them, without a reference to the choice of scalars. Later
in 2.C.12 we will see how to treat vectors in Sage, when they are viewed as
elements of some speciﬁc vector space.
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Every system of m linear equations in n variables
a11x1 + a12x2 + · · · + a1nxn = b1
a21x1 + a22x2 + · · · + a2nxn = b2
...
am1x1 + am2x2 + · · · + amnxn = bm
can be seen as a constraint on values of m scalar products
with one unknown vector (x1, . . . , xn) (called the vector of
variables, or vector variable) and the known vectors of coordinates
(ai1, . . . , ain).
The vector of variables can be also seen as a column in a
matrix of the type n/1, and similarly the values
b1, . . . , bm can be seen as a vector u, and that is
again a single column of the matrix of the type
m/1. Our system of equations can then be formally written
as A · x = u as follows:



a11 . . . a1n
...
...
am1 . . . amn


 .



x1
...
xn


 =



b1
...
bm



where the left-hand side is interpreted as m scalar products of
the individual rows of the matrix (giving rise to a column vector)
with the vector variable x, whose values are prescribed
by the equations. That means that the identity of the i-th coordinates
corresponds to the original i-th equation
ai1x1 + · · · + ainxn = bi
and the notation A · x = u gives the original system of equa-
tions.
2.1.4. Matrix product. In the plane, that is, for vectors of
dimension two, we developed a matrix calculus.
We noticed that it is eﬀective to work with (see
1.5.4). Now we generalize such a calculus and
we develop all the tools we know already from
the plane case to deal with higher dimensions n.
It is possible to deﬁne matrix multiplication only when
the dimensions of the rows and columns allow it, that is, when
the scalar product is deﬁned for them as before:
Matrix product
For any matrix A = (aij) of the type m/n and any matrix
B = (bjk) of the type n/q over the ring of scalars K we
deﬁne their product C = A · B = (cik) as a matrix of the
type m/q with the elements
cik =
n∑
j=1
aijbjk, for arbitrary 1 ≤ i ≤ m, 1 ≤ k ≤ q.
That is, the element cik of the product is exactly the scalar
product of the i-th row of the matrix on the left and of the
k-th column of the matrix on the right. For instance we have
(
2 1
1 −1
)
·
(
2 1 1
−1 0 1
)
=
(
3 2 3
3 1 0
)
.
86
v = vector([2, -3, 4])
u_plus_v = u + v
u_minus_v = u - v
uu=plot(u, color="blue")
vv=plot(v, color="green")
uplusv=plot(u_plus_v, color="green")
uminusv=plot(u_minus_v, color="purple")
show(uu+vv+uplusv+uminusv)
However, Sage oﬀers an alternative method for plotting vectors
in three dimensions, using the arrow3d function. For
instance, to visualize the vectors in (a) you can use the following
block:
u=vector([1, 2, 3])
v=vector([2, -3, 4])
u_plus_v=u + v
u_minus_v=u - v
u_ar=arrow3d([0, 0, 0], u, color="red")
v_ar=arrow3d([0, 0, 0], v, color="blue")
u_plus_v_ar=arrow3d([0, 0, 0],
u_plus_v, color="green")
u_minus_v_ar=arrow3d([0, 0, 0],
u_minus_v, color="purple")
p=u_ar+v_ar+u_plus_v_ar+u_minus_v_ar
p.show()
In this block we used red, blue, green and purple to represent
the vectors u, v, u+v and u−v, respectively. Run the code in
your Sage environment to generate and view the vector plot.
Next apply the same method to analyze the second set of vectors.
□
As our ﬁrst real task essentially involving matrices, we
are going to develop a systematic approach
for solving systems of linear equations.
Our ﬁrst aim is to introduce the
beautiful Gauss elimination method (see
2.1.7 for details), treating a few examples. You might also
look into 2.1.3. Later, in Section C, we shall relate the description
given here with the important notions of “aﬃne
spaces” and “vector spaces”.
2.A.6. A colourful example. A company of painters orders
810 litres of paint, to contain 270 litres each of red, green and
blue coloured paint. The provider can satisfy this order by
mixing the colours he usually sells, namely:
• reddish colour – it contains 50% of red, 25% of green
and 25% of blue colour;
• greenish colour – it contains 12.5% of red, 75% of green
and 12.5% of blue colour;
• bluish colour – it contains 20% of red, 20% of green and
60% of blue colour.
How many litres of each of the colours at the warehouse have
to be mixed in order to satisfy the order?
Solution. Let us denote by
• x – the number of litres of reddish colour to be used;
• y – the number of litres of bluish colour to be used;
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.1.5. Square matrices. If there is the same number of rows
and columns in the matrix, we speak of a square matrix. The
number of rows or columns is then called the dimension of
the matrix. The matrix
E = (δij) =



1 . . . 0
...
...
...
0 . . . 1



is called the unit matrix, or alternatively, the identity matrix.
The numbers δij deﬁned in such a way are also called the Kronecker
delta. When we restrict ourselves to square matrices
over K of ﬁxed dimension n, the matrix product is deﬁned
for any two matrices. That is, there is the well deﬁned multiplication
operation there. Its properties are similar to that of
scalars:
Proposition. On the set of all square matrices of dimension
n over an arbitrary ring of scalars K, the multiplication
operation is deﬁned with the following
properties of rings (see 1.1.1):
(R1) Multiplication is associative.
(R3) The unit matrix E = (δij) is the unit element for multi-
plication.
(R4) Multiplication and addition is distributive.
In general, neither the property (R2) nor (ID) are true. Therefore,
the square matrices for n > 1 do not form an integral
domain, and consequently they cannot be a (commutative or
non-commutative) ﬁeld.
Proof. Associativity of multiplication – (R1): Since scalars
are associative, distributive and commutative, we can compute
for any three matrices A = (aij) of type m/n, B = (bjk)
of type n/p and C = (ckl) of type p/q:
A · B =
(∑
j
aij · bjk
)
, B · C =
(∑
k
bjk · ckl
)
,
(A · B) · C =
(∑
k
(∑
j
aijbjk
)
ckl
)
=
(∑
j,k
aijbjkckl
)
,
A · (B · C) =
(∑
j
aij
(∑
k
bjkckl
))
=
(∑
j,k
aijbjkckl
)
.
Note that while computing, we relied on the fact that it does
not matter in which order are we performing the sums and
products, that is, we were relying on the properties of scalars.
We can easily see that multiplication by a unit matrix has
the property of a unit element:
A · E =



a11 · · · a1m
...
am1 · · · amm


 ·




1 0 · · · 0
0 1 · · · 0
...
...
0 0 · · · 1




= A
and similarly from the left,
E · A = A.
It remains to prove the distributivity of multiplication and
addition. Again using the distributivity of scalars we can
87
• z – the number of litres of greenish colour to be used;
By mixing the colours, we want the ﬁnal colour to contain
270 litres of red. By assumption, the reddish contains 50%
of red, the greenish contains 12.5% of red and the bluish 20%
of red. Hence, we get the equation
0.5x + 0.125y + 0.2z = 270 .
Similarly, for blue and green colour, respectively, we get
0.25x + 0.75y + 0.2z = 270 ,
0.25x + 0.125y + 0.6z = 270 .
From the ﬁrst equation we have x = 540−0.25y −0.4z, and
then the second and third equations give 2, 75y +0.4z = 540
and 0.25y+2z = 540, respectively. Hence, z = 270−0.125y
and after substituting into the ﬁrst equation, we obtain 2.7y =
432, that is, y = 160. Therefore z = 270−0.125·160 = 250
and x = 540 − 0.25 · 160 + 0.4 · 250 = 400. This means
that it is necessary to mix 400 litres of reddish, 160 litres of
bluish and 250 litres of greenish colour.
Once you have set up the system that describes the problem,
you can quickly ﬁnd the solution using Sage. For example,
enter the following block
var("x, y, z")
eq1=0.5*x+0.125*y+0.2*z-270
eq2=0.25*x+0.75*y+0.2*z-270
eq3=0.25*x+0.125*y+0.6*z-270
solve([eq1==0,eq2==0, eq3==0], x, y, z)
Sage’s output is the list
[[x == 400, y == 160, z == 250]] □
2.A.7. Matrix notation. Approach the previous problem
using the matrix notation and solving the matrix
equation in Sage.
Solution. A practical approach to solve linear systems
of equations relies on matrix notation. The
ﬁrst step is to construct the corresponding coeﬃcient matrix,
where each row represents the coeﬃcients of the variables in
a speciﬁc equation. In particular, the ﬁrst row contains the
coeﬃcients from the ﬁrst equation, the second row from the
second equation, and so forth.
In our case, the system in 2.A.6 leads to the 3 × 3 matrix
A =


0.5 0.125 0.2
0.25 0.75 0.2
0.25 0.125 0.6

 .
Then we write the system of equations as Ax = b, where
x = (x1, x2, x3)T
and b = (270, 270, 270)T
, respectively.
We call x the vector of unknowns, and b the vector of constants
in the right hand side of the system.
Merging A and the vector b, we arrive at the so called
extended matrix2
, which in our example has got the form
(
A b
)
=


0.5 0.125 0.2 270
0.25 0.75 0.2 270
0.25 0.125 0.6 270

 . (♯)
2The extended matrix is also referred to as the augmented matrix.
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
easily calculate for matrices A = (aij) of the type m/n,
B = (bjk) of the type n/p, C = (cjk) of the type n/p,
D = (dkl) of the type p/q
A · (B + C) =
(∑
j
aij(bjk + cjk)
)
=
(
( ∑
j
aijbjk
)
+
( ∑
j
aijcjk
)
)
= A · B + A · C
(B + C) · D =
(∑
k
(bjk + cjk)dkl
)
=
(
( ∑
k
bjkdkl
)
+
( ∑
k
cjkdkl
)
)
= B · D + C · D.
As we have seen in 1.5.4, two matrices of dimension two
do not necessarily commute: for example
(
1 0
0 0
)
.
(
0 1
0 0
)
=
(
0 1
0 0
)
(
0 1
0 0
)
.
(
1 0
0 0
)
=
(
0 0
0 0
)
.
This gives us immediately a counterexample to the validity of
(R2) and (ID). For matrices of type 1/1 both axioms clearly
hold, because the scalars itself have them. For matrices of
greater dimension the counterexamples can be obtained similarly.
Simply place the counterexamples for dimension 2 in
their left upper corner, and select the rest to be zero. (Verify
this on your own!) □
In the proof we have actually worked with matrices of
more general types, thus we have proved the properties in
greater generality:
Associativity and distributivity
Matrix multiplication is associative and distributive, that is,
A · (B · C) = (A · B) · C
A · (B + C) = A · B + A · C,
whenever are all the given operations deﬁned. The unit matrix
is a unit element for multiplication (both from the right
and from the left).
2.1.6. Inverse matrices. With scalars we can do the following:
from the equation a · x = b with a ﬁxed
invertible a we can express x = a−1
· b for any
b. We would like to be able to do this for matrices
too. So we need to solve the problem – how
to tell that such a matrix exists, and if so, how to compute it?
We say that B is the inverse of A if
A · B = B · A = E.
Then we write B = A−1
. From the deﬁnition it is clear that
both matrices must be square and of the same dimension n.
A matrix which has an inverse is called an invertible matrix
or a regular square matrix.
88
Observe that the extended matrix completely encodes the
original linear system of equations (just leaving out the variable
names). We shall come back to the extended matrix in
the next task.
In order to ﬁnd a solution of Ax = b in Sage, we can use
the command A.solve_right(b). For our case:
A=matrix(QQ, [[0.5, 0.125, 0.2], [0.25,
0.75, 0.2], [0.25, 0.125, 0.6]])
b=vector([270, 270, 270])
x = A.solve_right(b); print(x)
Or, instead of A.solve_right(b), we could type A\b,
i.e.,
A=matrix(QQ, [[0.5, 0.125, 0.2],
[0.25, 0.75, 0.2], [0.25, 0.125, 0.6]])
b=vector([270, 270, 270])
A \ b #this is a Matlab like command
□
In the sequel we will refer to A\b as the “backslash”
operator in Sage. If we are interested in a system of the
form xA = b, in order to solve for x we use the command
A.solve_left(b). Let us notice that if the matrix equation
does not have a solution, Sage returns an error (see also
2.A.20 for a similar case), while if it does have solutions, Sage
returns just one. We shall come back to this phenomenon
later.
2.A.8. Using the echelon form. Solve the previous problem
again, this time using the transformation to the echelon form
of the extended matrix.
Solution. The mathematical approach suggests to look for
“equivalent” systems of equations with the same solution sets,
perhaps much nicer than the original one. The crucial observation
is that elementary row transformations allow us to
transform the extended matrix into a row echelon form, see
2.1.7. Recall, the elementary row operations correspond to
interchanging of any two rows, multiplying each element of
a row by a non-zero number, or adding (a multiple of) a row
to another one.
Dealing with an extended matrix B =
(
A b
)
, in Sage
the elementary row operations are implemented via the following
functions:3
B.swap.rows() interchange two rows,
B.rescale.row() scale a row by a factor,
B.add_multiple_of_row() add a multiple of one row
to another row and replace.
Recall, a matrix is in row echelon form if it meets the following
requirements:
• The ﬁrst non-zero number from the left in a non-zero row,
known as the leading coeﬃcient or pivot, is always to the
right of the leading coeﬃcient of the row above it.
• Rows containing only zeros are located at the bottom of
the matrix.
3In Sage there are similar functions encoding the column operations.
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
In the subsequent paragraphs we derive (among other
things) that B is actually the inverse of A whenever just one
of the above required equations holds. The other is then a
consequence.
We easily check that if A−1
and B−1
exist, then there
also is the inverse of the product A · B
(1) (A · B)−1
= B−1
· A−1
.
Indeed, because of the associativity of matrix multiplication
proved a while ago, we have
(B−1
· A−1
) · (A · B) = B−1
· (A−1
· A) · B = E
(A · B) · (B−1
· A−1
) = A · (B · B−1
) · A−1
= E.
Because we can calculate with matrices similarly as with
scalars (they are just a little more complicated),
the existence of an inverse matrix can really help
us with the solution of systems of linear equations:
if we express a system of n equations for
n unknowns as a matrix product
A · x =



a11 · · · a1m
...
am1 · · · amm


 ·



x1
...
xm


 =



b1
...
bm


 = u
and when the inverse of the matrix A exists, then we can multiply
from the left by A−1
to obtain
A−1
· u = A−1
· A · x = E · x = x,
that is, A−1
· u is the desired solution.
On the other hand, expanding the condition A·A−1
= E
for unknown scalars in the matrix A−1
gives us n systems of
linear equations for the same matrix on the left and diﬀerent
vectors on the right. Thus we should think about methods for
solutions of the systems of linear equations.
2.1.7. Equivalent operations with matrices. Let us gain
some practical insight into the relation between systems of
equations and their matrices. Clearly, searching for the inverse
can be more complicated than ﬁnding the direct solution
to the system of equations. But note that whenever we
have to solve more systems of equations with the same matrix
A but with diﬀerent right sides u, then yielding A−1
can
be really beneﬁcial for us.
From the point of view of solving systems of equations
A · x = u, it is natural to consider the matrices
A and vectors u equivalent whenever they
give a system of equations with the same solution
set. Let us think about possible operations
which would simplify the matrix A such that obtaining the
solution is easier.
We begin with simple manipulations of rows of equations
which do not inﬂuence the solution, and similar modiﬁcations
of the right-hand side vector. If we are able to change a square
matrix into the unit matrix, then the right-hand side vector
is a solution of the original system. If some of the rows of
the system vanish during the course of manipulations (that is,
89
These two conditions ensure that all entries in a column below
a leading coeﬃcient are zeros.
Paragraph 2.1.8 explains how to interpret the elementary
row transformations as matrix multiplication. This presentation
is often very convenient, and we shall come back to this
later in Section E, see for example the task in 2.E.3.
We will now apply this method to solve the system posed
in 2.A.6. Let R1, R2, R3 denote the rows of the matrix obtained
at each step of the procedure. We begin by considering
the extended matrix B =
(
A b
)
, as outlined in relation
(♯) of 2.A.7, and apply the row operations R1 → 2R1,
R2 → 4R2 and R3 → 4R3. This yields
B =
(
A b
)
∼


1 0.25 0.4 540
1 3 0.8 1080
1 0.5 2.4 1080

 .
In Sage these operations are implemented by the block
B=matrix(QQ,[[0.5,0.125,0.2,270],[0.25,
0.75,0.2,270],[0.25,0.125,0.6,270]])
B.rescale_row(0,2) #multiply the 1st row
# by 2 (Sage labels rows as 0,1,...)
B.rescale_row(1,4) #the 2nd row by 4
B.rescale_row(2,4) #the 3rd row by 4
show(B)
where the last command is used to verify the matrix obtained
along this ﬁrst step. Notice, the ﬁrst argument in these commands
is the index of the row to be replaced, the second argument
is the row to form a multiple of, and the ﬁnal argument
is the scale factor. Next, B.add_multiple_of_row(k, l, c)
will replace then kth row Rk by Rk + cRl, where Rl denotes
the lth row and c is some (non-zero) constant.
Using the resulting matrix, we proceed with the row operations
R2 → R2 − R1 and R3 → R3 − R1, while keeping
R1 untouched:
(
A b
)
∼


1 0.25 0.4 540
0 2.75 0.4 540
0 0.25 2 540

 .
These row operations also admit an easy description in Sage,
simply by adding in the previous cell the code
B.add_multiple_of_row(1,0,-1)
#replace the 2nd row R2 by R2-R1
B.add_multiple_of_row(2,0,-1)
#replace the 3rd row R3 by R3-R1
show(B)
Notice, in Sage we do not need to introduce new matrices
when doing row operations successively. Next we multiply
both R2 and R3 by 4 and then interchange them, i.e.,
R1 → R1, R2 → 4R2, R3 → 4R3 and next R2 ↔ R3. This
provides
B ∼


1 0.25 0.4 540
0 11 1.6 2160
0 1 8 2160

∼


1 0.25 0.4 540
0 1 8 2160
0 11 1.6 2160

 .
In order to describe this step in Sage, we should continue typing
in the previous cell the code given below:
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
they become zero), then we get some direct information about
the solution. Our simple operations are:
Elementary row transformations
• interchanging two rows,
• multiplication of any given row by an non-zero scalar,
• adding another row to any given row.
These operations are called elementary row transformations.
It is clear that the corresponding operations at the level of the
equations in the system do not change the set of the solutions
whenever we deal with a ﬁeld of scalars, they still look reasonable
if our ring of coordinates is an integral domain (but
multiplying by a divisor of zero might be a problem).
Analogically, elementary column transformations of matrices
are
• interchanging two columns
• multiplication of any given column by a non-zero scalar,
• adding another column to any given column.
These do not preserve the solution set, since they change the
variables themselves.
Systematically we can use elementary row transformations
for subsequent elimination of variables.
This gives an algorithm which is usually called
the Gaussian elimination method. Henceforth,
we shall assume that our scalars come from an integral domain
(e.g. integers are allowed, but not say Z4).
Gaussian elimination of variables
Proposition. Any non-zero matrix over an arbitrary integral
domain of scalars K can be transformed, using ﬁnitely
many elementary row transformations, into row echelon
form:
• For each j, if aik = 0 for all columns k = 1, . . . , j,
then akj = 0 for all k ≥ i,
• if a(i−1)j is the ﬁrst non-zero element at the (i − 1)-st
row, then aij = 0.
Proof. The matrix in row echelon form looks like








0 . . . 0 a1j . . . . . . . . . a1m
0 . . . 0 0 . . . a2k . . . a2m
...
0 . . . . . . . . . . . . 0 alp . . .
...








.
The matrix can (but does not have to) end with some zero
rows. In order to transform an arbitrary matrix, we can use
a simple algorithm, which will bring us, row by row, to the
resulting echelon form:
90
B.rescale_row(1,4)
#multiply the 2nd row R2 by 4
B.rescale_row(2,4)
#multiply the 3rd row R3 by 4
B.swap_rows(1,2)
#interchange the 2nd row by the 3rd row
show(B)
Observe now that in the matrix obtained in this step, the coeﬃcient
of y in the third row is not yet zero. Therefore, it remains
to apply one more row operation, namely R3 → R3 − 11R2:
B ∼


1 0.25 0.4 540
0 1 8 2160
0 0 −86.4 −21600

 .
This matrix is in row echelon form, and in Sage for this last
row operation we successively proceed with the code
B.add_multiple_of_row(2,1,-11)
#replace the 3rd row R3 by R3-11R2
show(B)
Finally, by the so called backward substitution, we can describe
the solution of the initial system. In particular, from the
last row we get z =
−21600
−86.4
= 250. Using this, from the second
row we obtain y = 2160−8·250 = 160 and ﬁnally from
the ﬁrst row we have x = 540 − 0.4 · 250 − 0.25 · 160 = 400.
□
2.A.9. Based on elementary row transformation solve the
linear system posed below. Then use Sage to ﬁnd an echelon
form of the corresponding extended matrix, and verify your
solution:



x1 + 2x2 + 3x3 = 2 ,
2x1 − 3x2 − x3 = −3 ,
−3x1 + x2 + 2x3 = −3 .
Solution. Let us express the system in the matrix form Ax =
b: 

1 2 3
2 −3 −1
−3 1 2




x1
x2
x3

 =


2
−3
−3

 ,
. The extended matrix is given by
B :=
(
A b
)
=


1 2 3 2
2 −3 −1 −3
−3 1 2 −3

 .
Let us now perform elementary row operations to transform
the extended matrix into row echelon form. We have:
(
A b
)
∼


1 2 3 2
0 −7 −7 −7
0 7 11 3


∼


1 2 3 2
0 −7 −7 −7
0 0 4 −4

 ∼


1 2 3 2
0 1 1 1
0 0 1 −1

 .
Above, we ﬁrst applied the operations R2 → R2 − 2R1,
R3 → R3 + 3R1 on B to obtain the ﬁrst mentioned matrix.
Using this, we proceeded with R3 → R3 + R2, to get the
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Gaussian elimination algorithm
(1) By a possible interchange of rows we can obtain a matrix
where the ﬁrst row has a non-zero element in the
ﬁrst non-zero column. Let that column be column j. In
other words, a1j ̸= 0, but aiq = 0 for all i, and all q,
1 ≤ q < j.
(2) For each i = 2, . . ., multiply the ﬁrst row by the element
aij, multiply i-th row by the element a1j and subtract,
to obtain aij = 0 on the i-th row.
(3) By repeated application of the steps (1) and (2), always
for the not-yet-echelon part of rows and columns in the
matrix we reach, after a ﬁnite number of steps, the ﬁnal
form of the matrix.
This algorithm clearly stops after a ﬁnite number of steps
and provides the proof of the proposition. □
The given algorithm is really the usual elimination of
variables used in the systems of linear equations.
In a completely analogous manner we deﬁne the column
echelon form of matrices and considering column elementary
transformations instead the row ones, we obtain an algorithm
for transforming matrices into the column echelon form.
Remark. Although we could formulate the Gaussian elimination
for general scalars from any ring, this does
not make much sense in view of solving equations.
Clearly having divisors of zero among the scalars, we
might get zeros during the procedure and lose information
this way. Think carefully about the diﬀerences between
the choices K = Z, K = R and possibly Z2 or Z4.
On the other hand, if we are dealing with ﬁelds of scalars,
we can always arrive at a row echelon form where the nonzero
entries on the “diagonal” are ones. This is done by applying
the appropriate scalar multiplication to each individual
row. However, this is not possible in general – think for
instance of the integers Z.
2.1.8. Matrices of elementary row transformations. Let
us now restrict ourselves to ﬁelds of scalars K,
that is, every non-zero scalar has an inverse.
Note that elementary row or column transformations
correspond respectively to multiplication
from the left or right by the following matrices (only
the diﬀerences from the unit matrix are indicated):
(1) Interchanging the i-th and j-th row (column)










...
0 . . . 1
...
...
...
1 . . . 0
...










← i-th row
← j-th row
91
second matrix, and the ﬁnal step consists of the operations
R2 → −1
7 R2 and R3 → 1
4 R3, which give the third matrix.
This one is in row echelon form and implies directly x3 = −1,
and moreover
{
x1 + 2x2 + 3x3 = 2 ,
x2 + x3 = 1 ,
from where we specify x2 = 2 and x1 = 1.
Sage provides an easy way to compute a row echelon
form corresponding to a matrix B, via the command
B.echelon_form( ). For our example we can type
A=matrix([[1,2,3],[2,-3,-1],[-3,1,2]])
b=vector([2, -3, -3])
B=A.augment(b) #the augmented matrix
show(B);
show(B.echelon_form())
This code returns the extended matrix B and a row echelon
form associated to B:4


1 2 3 2
0 7 3 11
0 0 4 −4

 .
Note that Sage’s solution diﬀers from ours. However, there is
no need to worry, as there are various methods for performing
row reduction, and thus the row echelon form of a matrix is
not unique! From the last row of this matrix we obtain 4x3 =
−4, that is, x3 = −1, and so the second row which reads
as 7x2 + 3x3 = 11, becomes 7x2 = 14, so x2 = 2. From
the ﬁrst row we have x1 + 2x2 + 3x3 = 2, and by replacing
the values found for x2, x3 we get x1 = 1, and so the same
solution as above. □
In summary, the most eﬃcient method for solving large
systems of linear equations involves applying elementary
row transformations to achieve row echelon
form. This technique, known as Gaussian
elimination, can be optimized through various
sophisticated schemes for selecting the best pivots in numerical
contexts. Subsequent examples will illustrate that systems
of linear equations may have inﬁnitely many solutions,
exactly one solution, or none at all. For a theoretical exploration
of the classiﬁcation of solutions using matrix invariants
and other methods, see 2.1.13 and 2.3.5.
2.A.10. Using the Gauss elimination method solve the following
linear system:



2x1 − x2 + 3x3 = 0 ,
3x1 + 16x2 + 7x3 = 0 ,
3x1 − 5x2 + 4x3 = 0 ,
−7x1 + 7x2 + −10x3 = 0 .
4If we want the vertical line in the presentation of the extended matrix,
we can type B = A.augment(b, subdivide = ”True”).
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
(2) Multiplication of the i-th row (column) by the scalar a:








...
1
a
1
...








← i-th row
(3) To row i, add row j (columns):
i-th row and j-th column →










...
1
...
1 1
...










This trivial observation is actually very important, since
the product of invertible matrices is invertible (recall 2.1.6(1))
and all elementary transformations over a ﬁeld of scalars are
invertible (the deﬁnition of the elementary transformation itself
ensures that inverse transformations are of the same type
and it is easy to determine the corresponding matrix).
Thus, the Gaussian elimination algorithm tells us, that
for an arbitrary matrix A, we can obtain its equivalent row
echelon form A′
= P · A by multiplying with a suitable invertible
matrix P = Pk · · · P1 from the left (that is, sequential
multiplication with k matrices of the elementary row transfor-
mations).
If we apply the same elimination procedure for the
columns, we can transform any matrix B into its column
echelon form B′
by multiplying it from the right by a suitable
invertible matrix Q = Q1 · · · Qℓ. If we start with the matrix
B = A′
in row echelon form, this procedure eliminates
only the still non-zero elements out of the “diagonal” of
the matrix and in the end we can transform the remaining
elements to be units. Thus we have veriﬁed a very important
result which we will use many times in the future:
2.1.9. Theorem. For every matrix A of the type m/n over a
ﬁeld of scalars K, there exist square invertible matrices P and
Q of dimensions m and n, respectively, such that the matrix
P · A is in row echelon form and
P · A · Q =








1 . . . 0 0 . . . 0
...
...
0 . . . 1 0 . . . 0
0 . . . 0 0 . . . 0
...








.
The number of the ones in the diagonal is independent of the
particular choice of P and Q.
Proof. We already have proved everything but the last
sentence. We shall see this last claim below in 2.1.11. □
92
Solution. This is a so called homogeneous linear system,
since the vector b is the zero vector. In matrix notation it
reads as




2 −1 3
3 16 7
3 −5 4
−7 7 −10






x1
x2
x3

 =




0
0
0
0



 ,
that is, Ax = 0, where 0 denotes the zero vector. A homogeneous
system of linear equations has either exactly one solution,
the zero one, or inﬁnitely many solutions, see 2.3.5 for
details. By transforming the matrix into a row echelon form
via elementary row operations, we will prove that the given
system has inﬁnitely many solutions. Initially we obtain
A∼




2 −1 3
3 16 7
3 −5 4
−7 7 −10



 ∼




2 −1 3
0 35/2 5/2
0 −7/2 −1/2
0 7/2 1/2



 .
From there, we see that the second, third and fourth equations
are multiples of the equation 7x2 + x3 = 0. We continue as
follows:
(2 −1 3
0 35/2 5/2
0 −7/2 −1/2
0 7/2 1/2
)
∼
(2 −1 3
0 35/2 5/2
0 0 0
0 0 0
)
∼
(2 −1 3
0 7 1
0 0 0
0 0 0
)
.
We can translate this literally as
{
2x1 − x2 + 3x3 = 0 , 7x2 + x3 = 0
}
,
where observe that the equations induced by the last two rows
are redundant. In particular, x3 is a free variable to which we
assign the real parameter t ∈ R. This gives inﬁnite many
solutions of the form
x2 = −
1
7
x3 = −
1
7
t , x1 =
1
2
(x2 − 3x3) = −
11
7
t .
We may verify this result in Sage, as before:
var("x1, x2, x3"); eq1=2*x1-x2+3*x3
eq2=3*x1+16*x2+7*x3; eq3=3*x1-5*x2+4*x3
eq4=-7*x1+7*x2-10*x3
solve([eq1==0,eq2==0,eq3==0,eq4==0],
x1, x2, x3)
which answers
x1 == -11/7*r1, x2 == -1/7*r1, x3 == r1
Notice the substitution t = −7s brings our solution to the
form (11s, s, −7s), with s ∈ R, that is,
(x1, x2, x3)
T
= s · (11, 1, −7)
T
, s ∈ R ,
where “·” is simply the scalar multiplication. This expression
can be obtained very quickly in Sage, by the command
A.right_kernel(), where A is the matrix of the linear homogeneous
system, as discussed in the next task. □
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.1.10. Algorithm for computing inverse matrices. In the
previous paragraphs we almost obtained the complete
algorithm for computing the inverse matrix.
Using the simple modiﬁcation below, we ﬁnd either
that the inverse does not exist, or we compute
the inverse. Keep in mind that we are still working over a ﬁeld
of scalars.
Equivalent row transformations of a square matrix A of
dimension n leads to an invertible matrix P′
such that P′
· A
is in row echelon form. If A has an inverse, then there exists
also the inverse of P′
· A. But if the last row of P′
· A is zero,
then the last row of P′
· A · B is also zero for any matrix B of
dimension n. Thus, the existence of a zero row in the result
of (row) Gaussian elimination excludes the existence of A−1
.
Assume now that A−1
exists. As we have just seen, the
row echelon form of A will have exclusively non-zero rows
only, In particular, all diagonal elements of P′
·A are non-zero.
But now, we can employ row elimination by the elementary
row transformation from the bottom-right corner backwards
and also transform the diagonal elements to be units. In this
way, we obtain the unit matrix E. Summarizing, we ﬁnd another
invertible matrix P′′
such that for P = P′′
·P′
we have
P · A = E.
Now observe that we could clearly work with columns
instead of row transformation and thus, under the assumption
of the existence of A−1
, we would ﬁnd a matrix Q such that
A · Q = E. From this we see immediately that
P = P · E = P · (A · Q) = (P · A) · Q = E · Q = Q.
That is, we have found the inverse matrix
A−1
= P = Q
for the matrix A. Notice that at the point of ﬁnding the matrix
P with the property P · A = E, we do not have to do any further
computation, since we have already obtained the inverse
matrix.
In practice, we can work as follows:
Computing the inverse matrix
Write the unit matrix E to the right of the matrix A, producing
an augmented matrix (A, E). Transform the augmented
matrix using the elementary row transformations to row echelon
form. This produces an augmented matrix (PA, PE),
where P is invertible, and PA is in row echelon form. By
the above, either we may achieve PA = E, in which case
A is invertible and P = PE = A−1
, or PA has a row of
zeros, in which case we conclude that the inverse matrix for
A does not exist.
2.1.11. Linear dependence and rank. In the previous
practical algorithms dealing with matrices
we worked all the time with row and column
additions and scalar multiplications, seeing
them as vectors.
Such operations are called linear combinations. We shall
return to such operations in an abstract sense later on in 2.3.1.
93
2.A.11. Analyze the space of solutions in the previous task
in view of the vector addition and scalar multiplication.
Solution. Of course, knowing that Ax = 0, Ay = 0, the
sum must be a solution again: A(x + y) = Ax + Ay = 0.
Similarly A(cx) = c(Ax) for all scalars c. Viewing matrices
as linear mappings x → Ax, the kernel of a (real) m × n
matrix A, also known as the null space of A, is the set of
vectors x ∈ Rn
satisfying Ax = 0, where 0 denotes the zero
vector of degree m.
We saw in the previous task that the solution space, i.e.,
the kernel of A, was spanned by one vector (11, 1, −7)T
.
To compute the null space of A in Sage we can just type
A.right_kernel( ). However, be aware that Sage also provides
the command A.kernel( ), which refers to the left kernel
of A and it is an alias of the command A.left_kernel().
In our text we will use only the left/right kernel commands to
avoid confusions.
Now, to obtain a veriﬁcation of the previous expression
you can type
A=matrix([[2, -1, 3], [3, 16, 7],
[3, -5, 4], [-7, 7, -10]])
sol=A.right_kernel()
show(sol)
The output has the form RowSpan(11 1 − 7), which essentially
means that the corresponding solution is spanned by
the vector (11, 1, −7)T
, as we described above. Alternatively,
you can simply type A.right_kernel(), and then Sage returns
a more complex information:
Free module of degree 3 and rank 1
over Integer Ring
Echelon basis matrix: [11 1 -7]
In the ﬁrst line of this output, the “degree 3” part refers to
the “dimension” of the vectors in the kernel, that is 3 × 1
vectors. The “rank 1” part indicates the “dimension” of the
right kernel of A, i.e., of the null space of A.
Notice, Sage automatically considered A over the ring
Z, and therefore it talks about free modules instead of vector
spaces. The answer would be diﬀerent if we enforced the ﬁeld
Q (by adding QQ in the deﬁnition of A).5
We will learn about vector spaces, subspaces and the notion
of dimension of a vector space in Section C, where we
will meet further examples on null spaces. □
2.A.12. Based on Gauss elimination method, solve the linear
system below. Next present the solution in Sage.



3x1 + 3x3 − 5x4 = −8 ,
x1 − x2 + x3 − x4 = −2 ,
−2x1 − x2 + 4x3 − 2x4 = 0 ,
2x1 + x2 − x3 − x4 = −3 .
5Beware that declaring A to be over the ﬁeld R, we obtain a completely
diﬀerent result since the numerical errors in the 53 bit arithmetic prevent us
to have the echelon form with two zero rows. Thus Sage would answer that
there is only the zero solution there!
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
But it will be useful to understand their core meaning right
now. A linear combination of rows of a matrix A = (aij) of
type m/n is understood as an expression of the form
c1ui1 + · · · + ckuik
,
where ci are scalars, uj = (aj1, . . . , ajn) are rows of the
matrix A. Similarly, we can consider linear combinations of
columns by replacing the above rows uj by the columns uj =
(a1j, . . . , amj).
If the zero row can be written as a linear combination of
some given rows with at least one non-zero scalar coeﬃcient,
we say that these rows are linearly dependent. In the alternative
case, that is, when the only possibility of obtaining the
zero row is to select all the scalars cj equal to zero, the rows
are called linearly independent.
Analogously, we deﬁne linearly dependent and linearly
independent columns.
Lemma. The number of non-zero “steps” in the row echelon
form always equals to the number of linearly independent
rows of the matrix. In particular, the elementary
row transformations do not change the maximal
number of linearly independent rows. Similarly for
columns and the column echelon form, and the maximal number
of linearly independant rows and columns coincides (and
equals to the number of ones in Theorem 2.1.9)
Proof. Indeed, the claims all follow once we prove that
the number h of ones in the matrix Eh from the theorem 2.1.9
does not depend on our choice of the row or column transformations
in our procedure. Thus, assume that by two diﬀerent
row transformation procedures into the echelon form we obtain
two diﬀerent h′
< h. But then according to our algorithm
there are invertible matrices P′
, P′′
, Q′
, and Q′′
such that
Eh = P′
· A · Q′
, Eh′ = P′′
· A · Q′′
.
In particular, Eh = P′
· P′′−1
· Eh′ · Q′′−1
· Q′
and so there
are invertible matrices P and Q such that
Eh′ · Q = P · Eh.
Now, on the left hand side, there are the ﬁrst h′
rows of Q,
complemented by zero rows. At the same time, on the right
hand side, there are h ﬁrst columns of P, complemented by
zero columns. Thus, the equality implies that taking the ﬁrst h
columns in P, only the ﬁrst h′
of the rows there are non-zero.
But this shows, that our Gaussian elimination algorithm applied
to P would show P is not invertible. Therefore h′
= h,
and we have proved that the number of ones in the matrix
P · A · Q in theorem 2.1.9 is independent of the choice of our
elimination procedure and it is always equal to the number of
linearly independent rows in A, which must be the same as
the number of linearly independent columns in A. □
Deﬁnition. The maximal number h(A) of linearly indepent
rows (or columns) in A is called the rank of the
matrix.
We have the following theorem:
94
Solution. The corresponding extended matrix has the form
B :=
(
A b
)
=




3 0 3 −5 −8
1 −1 1 −1 −2
−2 −1 4 −2 0
2 1 −1 −1 −3



 .
By changing the order of rows (equations) we obtain
B ∼




1 −1 1 −1 −2
2 1 −1 −1 −3
−2 −1 4 −2 0
3 0 3 −5 −8




which we transform into a row echelon form:
( 1 −1 1 −1 −2
2 1 −1 −1 −3
−2 −1 4 −2 0
3 0 3 −5 −8
)
∼
(1 −1 1 −1 −2
0 3 −3 1 1
0 −3 6 −4 −4
0 3 0 −2 −2
)
∼
(1 −1 1 −1 −2
0 3 −3 1 1
0 0 3 −3 −3
0 0 3 −3 −3
)
∼
(1 −1 1 −1 −2
0 3 −3 1 1
0 0 3 −3 −3
0 0 0 0 0
)
.
Notice that via the command B.echelon_form( )), Sage
computes a diﬀerent echelon form. In any case, we obtain
three equations in four variables, and thus the initial linear
system has inﬁnitely many solutions (as in the previous example,
although this is not a homogeneous system). These
three equations have exactly one solution for any choice for
the free variable x4 ∈ R. So we may set x4 = t ∈ R, and
translate the ﬁnal matrix of the Gauss elimination as
x1 − x2 + x3 − t = −2 ,
3x2 − 3x3 + t = 1 ,
3x3 − 3t = −3 .
The solution now can be easily described by back substitu-
tion:
x3 = t − 1 , x2 =
1
3
(2t − 2) , x1 =
1
3
(2t − 5) .
Of course, the same solution occurs if we use the echelon
form that Sage provides. Now, to have a Sage veriﬁcation of
the given solution, it suﬃcient to use the cell
Eq1=x1-x2+x3-x4+2
Eq2=3*x2-3*x3+x4-1
Eq3=3*x3-3*x4+3
show(solve([Eq1==0, Eq2==0, Eq3==0],
x1, x2, x3))
Notice for t = 3s our solution (x1, x2, x3, x4) takes the
form
(
2s −
5
3
, 2s −
2
3
, 3s − 1, 3s
)
, s ∈ R .
□
2.A.13. Sage and general systems. Analyze the solutions
of the system of linear equations from the previous task in
view of the vector operations and Sage tools.
Solution. We could approach the linear system of 2.A.12 by
using the backslash operator in Sage:
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Theorem. Let A be a matrix of type m/n over a ﬁeld of
scalars K. The matrix A has the same maximal number h(A)
of linearly independent rows as linearly independent columns.
In particular, the rank is always at most the minimum of the
dimensions of the matrix A.
The algorithm for computing the inverse matrix also says
that a square matrix A of dimension m has an inverse if and
only if its rank equals m.
2.1.12. Matrices as mappings. Similarly to the way we
worked with matrices in the geometry of the plane (see 1.5.7),
we can interpret every matrix A of the type m/n as a mapping
A : Kn
→ Km
, x → A · x.
By the distributivity of matrix multiplication, it is clear how
the linear combinations of vectors are mapped using such
mappings:
A · (a x + b y) = a (A · x) + b (A · y).
Straight from the deﬁnition we see, by the associativity of
multiplication, that composition of mappings corresponds to
matrix multiplication in given order. Thus invertible matrices
of dimension n correspond to bijective mappings
A : Kn
→ Kn
.
Remark. From this point of view, the theorem 2.1.9 is very
interesting. We can see it as follows: the rank
of the matrix determines how large is the image
of the whole Kn
under this mapping. In fact, if
A = P ·Ek ·Q where the matrix Ek has k ones
as in 2.1.9, then the invertible Q ﬁrst bijectively “shuﬄes” the
n-dimensional vectors in Kn
, the matrix Ek then “copies” the
ﬁrst k coordinates and completes them with the remaining
m − k zeros.
This “k-dimensional” image then cannot be enlarged by
multiplying with P. Multiplying by P can only bijectively
reshuﬄe the coordinates in the image.
2.1.13. Back to linear equations. We shall return to the notions
of dimension, linear independence and so
on in the third part of this chapter. But we should
notice now what our results say about the solutions
of the systems of linear equations.
If we consider the matrix of the system of equations and
add to it the column of the required results, we speak about the
extended matrix of the system. The above Gaussian elimination
approach corresponds to the sequential variable elimination
in the equations and the deletion of the linearly dependent
equations (these are simply consequences of other equations).
Thus we have derived complete information about the
size of the set of solutions of the system of linear equations,
based on the rank of the matrix of the system. If we are left
with more non-zero rows in the row echelon form of the extended
matrix than in the original matrix of the system, then
there cannot be a solution (simply, we cannot obtain the given
vector value with the corresponding linear mapping). If the
95
A=matrix([[3, 0, 3, -5], [1, -1, 1, -1],
[-2, -1, 4, -2], [2, 1, -1, -1]])
b=vector([-8, -2, 0, -3])
x=A.solve_right(b); x
The output is the solution given above with t = 0
only:
(-5/3, -2/3, -1, 0)
As we have seen before, this provides just one solution of
the system. But the matrix form Ax = b of the equation
reveals that having two such solutions x and y, ensures that
A(x − y) = Ax − Ay = b − b = 0 (notice we can compute
with vectors and matrices just like if they were scalars). On
the contrary, if y was a solution of the homogeneous system,
Ay = 0, then A(x + y) = b + 0 = b.
Hence we should be grateful to Sage, picking one of
the solutions to the original system and combining the backslash
operator with the right_kernel( ) method, as fol-
lows:
A=matrix([[3,0,3,-5],[1,-1,1,-1],
[-2,-1,4,-2], [2,1,-1,-1]])
b=vector([-8, -2, 0, -3])
x=A.solve_right(b)
k=A.right_kernel().matrix()
# basis of solutions in rows
nrows = k.nrows()
# counts the kernel dimension
if nrows > 0: # non unique solution
t=vector( var("t", n=nrows))
# row of parameters t0,...
show(x + t*k)
else: # unique solution
show(x)
Executing this block, Sage’s output is as in the end of the
previous task. Actually, we add the vector subspace of the
solutions to the homogeneous system, to one of the general
solutions. □
If we work over a ﬁeld of scalars, e.g., Q or R, the extended
matrices B =
(
A b
)
of the linear systems
can be transformed further, aiming at having
all the pivots equal to one, and all the other
entries in the columns of the pivots equal to zero.
This matrix is in a special row echelon form, known as the
reduced row echelon form (often referred to as RREF).
It is not diﬃcult to prove that this reduced row echelon
form of any matrix is unique, once the pivot columns are ﬁxed.
2.A.14. Come back to the task in 2.A.12 and solve it using
the RREF method. Then check with Sage.
Solution. A few elementary row transformations of the augmented
matrix lead to
(1 −1 1 −1 −2
0 3 −3 1 1
0 0 3 −3 −3
0 0 0 0 0
)
∼
(1 −1 1 −1 −2
0 1 −1 1/3 1/3
0 0 1 −1 −1
0 0 0 0 0
)
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
rank of both matrices is the same, then the backwards elimination
provides exactly as many free parameters as the diﬀerence
between the number of variables n and the rank h(A).
In particular, there will be exactly one solution if and only if
the matrix is invertible.
All this will be stated explicitely in terms of abstract vector
spaces in the important Kronecker-Capelli theorem, see
2.3.5.
2. Determinants
In the ﬁfth part of the ﬁrst chapter, we introduced the
scalar function det on square matrices of dimension
2 over the real numbers, called determinant,
see 1.5.5. We saw that the determinant
assigned a non-zero number to a matrix if and only if the
matrix was invertible. We did not say it in exactly this way,
but you can check for yourself in previous paragraphs starting
with 1.5.4 and formula 1.5.5(1).
We saw also that determinants were useful in another
way, see the paragraphs 1.5.10 and 1.5.11. There we showed
that the area of the parallelepiped should be linearly dependent
on every two of the vectors deﬁning it. It was useful
to require the change of the sign when changing the order of
these vectors. Because determinants (and only determinants)
have these properties, up to a constant scalar multiple, we concluded
that it was determining the area. Now we will see that
we can proceed similarly for every ﬁnite dimension.
We work again with arbitrary scalars K and matrices over
these scalars. Our results about determinants will thus hold
for all commutative rings, notably also for integer matrices or
matrices over any residue classes.
2.2.1. Deﬁnition of the determinant. Each bijective mapping
from a set X to itself is called a permutation of the set
X, cf. 1.3.1. If X = {1, 2, . . . , n}, the permutation can be
written by putting the resulting ordering into a table:
(
1 2 . . . n
σ(1) σ(2) . . . σ(n)
)
.
In this way, we shall view a permutation as a bijection or an
ordering.
The element x ∈ X is called a ﬁxed point of the permutation
σ if σ(x) = x. If there exist exactly two distinct elements
x, y ∈ X such that σ(x) = y while all other elements z ∈ X
are ﬁxed points, then the permutation σ is called a transposition,
and we denote it by (x, y). Of course, then σ(y) = x
holds for such a transformation.
For dimension 2, the formula for a determinant was simple
– take all possible products of two elements,
one from every column and every row of the matrix,
give them a sign such that interchanging
two columns leads to the change of the sign of
the whole result, and sum all of them (that is, both):
A =
(
a b
c d
)
, det A = ad − bc.
96
∼
(1 −1 0 0 −1
0 1 0 −2/3 −2/3
0 0 1 −1 −1
0 0 0 0 0
)
∼
(1 0 0 −2/3 −5/3
0 1 0 −2/3 −2/3
0 0 1 −1 −1
0 0 0 0 0
)
.
The ﬁnal matrix is in reduced row echelon form, which evidently
directly provides the solution




x1
x2
x3
x4



 =




−5/3
−2/3
−1
0



 + t ·




2/3
2/3
1
1



 , t ∈ R ,
as above. Notice again, that this solution is the sum of a particular
solution of our initial system and the general solution
of the corresponding homogeneous system Ax = 0, cf. the
Kronecker-Capelli Theorem in 2.3.5. We also mention that
in the Gauss elimination method, free variables are the variables
whose column in the reduced row echelon form does
not contain any pivot (e.g., x4 is the free variable of this speciﬁc
example, and indeed, the fourth column in REEF is the
unique column which does not contain any pivot).
The reduced echelon form of a given matrix B can be
quickly described in Sage, by the command B.rref( ). Thus,
the following cell veriﬁes the (necessarily unique) expression
presented above.
A=matrix([[3, 0, 3, -5], [1, -1, 1, -1],
[-2, -1, 4, -2], [2, 1, -1, -1]])
b=vector([-8, -2, 0, -3])
B=A.augment(b)
show(B.rref())
□
2.A.15. Remarks on Sage methods. Although we noticed
the importance of B.echelon_form( ) in 2.A.10 and 2.A.12
(mimicking our handling of matrices by hand), the Sage
method B.rref( ) seems to be much more eﬃcient. We
should also mention that Sage provides methods for directly
identifying the pivots of the augmented matrix. The relevant
command is B.pivots( ) (or we can use the command
B.pivot_rows( )).
For example, adding in the previous cell B.pivots( ),
Sage answers (0, 1, 2), which means that the pivots are in the
ﬁrst, second and third columns.
Recall now that a linear system of equations is inconsistent
(does not have a solution), if there is a pivot in the last column
of an echelon form of the augmented matrix. Therefore,
we can combine this with the previous commands in Sage to
derive quickly if a linear system of equations is consistent or
not.
2.A.16. Use Sage to ﬁnd the reduced echelon form and the
pivot entries of the following matrices:
B± =


1 1 −2
3 2 0
2 ±1 2

 , C =




2 1 7 0
−1 4 10 5
3 2 12 10
0 0 0 10



 .
Next, deduce about the consistency of
(a) the linear systems having as extended matrix one of B±;
(b) the homogeneous linear system Cx = 0. ⃝
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Consider now square matrices A = (aij) of dimension
n over K. The formula for the determinant of the matrix A is
also composed of all possible products from elements from
individual rows and columns, with properly chosen signs.
In dimension 3 we can guess the correct signs easily. The
product of the elements on the diagonal should be with positive
sign and we want anti-symmetry when interchanging two
columns or rows. This gives the so called Sarrus rule:
Sarrus rule
a11 a12 a13
a21 a22 a23
a31 a32 a33
=
a11a22a33 + a13a21a32 + a12a23a31
−a13a22a31 − a11a23a32 − a12a21a33
The general deﬁnition can be formulated via a sum over
all permutations:
Definition of determinant
The determinant of the matrix A is a scalar det A = |A|
deﬁned by the relation
|A| =
∑
σ∈Σn
sgn(σ)a1σ(1) · a2σ(2) · · · anσ(n)
where Σn is the set of all possible permutations over
{1, . . . , n} and the symbol sgn for a permutation σ, called
the parity of σ, will be described below. Each of the
expressions
sgn(σ)a1σ(1) · a2σ(2) · · · anσ(n)
is called a term in the determinant |A|.
2.2.2. Parity of permutation. How should we deﬁne the
sign of a permutation? We say that a pair of
elements a, b ∈ X = {1, . . . , n} forms an
inversion in the permutation σ, if a < b and
σ(a) > σ(b). A permutation σ is called even
or odd, if it contains an even or odd number of inversions,
respectively.
Thus, the parity sgn σ of the permutation σ is
(−1)number of inversions
and we denote it by sgn(σ). This
amounts to our deﬁnition of sign for computing determinant.
But we should like to know how to calculate the parity. The
following theorem reveals that the Sarrus rule really deﬁnes
the determinant in dimension 3.
Theorem. Over the set X = {1, 2, . . . , n} there are exactly
n! distinct permutations. These can be ordered in a sequence
such that every two consecutive permutations diﬀer in exactly
one transposition. Every transposition changes parity.
For any chosen permutation σ there is such a sequence
starting with σ.
Proof. For n = 1 or n = 2, the claim is trivial. We
prove the theorem by induction on the size n of the set X.
Assume that the claim holds for all sets with n − 1 elements
and consider a permutation σ(1) = a1, . . . , σ(n) =
97
Our next focus is on computing inverses A−1
of matrices
A ∈ Matn(K) for ﬁelds K. Viewing X = A−1
as an unknown matrix, we aim at AX = E (the
identity matrix). This provides n systems of linear
equations with the same coeﬃcient matrix A (we
view the matrix equation column-wise).
We already touched this in the case n = 2. We realize
that the general case is equivalent to ﬁnding RREF for the
augmented matrix (A | E).
2.A.17. Solve the following matrix equations:
(
1 3
3 8
)
X1 =
(
1 2
3 4
)
, X2
(
1 3
3 8
)
=
(
1 2
3 4
)
.
⃝
2.A.18. Compute the inverse of the matrices
A =


4 3 2
5 6 3
3 5 2

 , B =


1 0 1
3 3 4
2 2 3

 .
Then determine the matrix
(
AT
· B
)−1
.
Solution. As just explained, an n×n matrix A is invertible if
and only if A is row-equivalent to the n×n identity matrix E.
Hence, to ﬁnd the inverse we ﬁrst form the augmented matrix(
A E
)
, which in our case reads as
(
A E
)
=


4 3 2 1 0 0
5 6 3 0 1 0
3 5 2 0 0 1

 .
The goal now is to transform the submatrix A of the augmented
matrix into the identity matrix using elementary row
operations on
(
A E
)
. This process illustrates how Gaussian
elimination is applied to compute the inverse of a matrix
Note that this transformation may not always be possible,
depending on the rank of A. (Later we shall see that
having the maximal rank is equivalent to having the determinant
non-zero). However, whenever we can achieve the unit
matrix in the place of A, the original unit submatrix gets transformed
into the desired A−1
. Notice, this is exactly solving all
the given n systems simultaneously, using the RREF method
with the shared coeﬃcient matrix A. In this case the ﬁnal
(augmented) matrix will be
(
E A−1
)
.
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
an. According to the induction assumption, all the permutations
that end with an can be obtained in a sequence, where
every two consecutive permutations diﬀer in one transposition.
There are (n − 1)! such permutations. In order to proceed
further, we select the last of them, and use the transposition
of σ(n) = an with some element ai which has not been
at the last position yet. Once again, we form a sequence of
all permutations that end with ai. After doing this procedure
n-times, we obtain n(n−1)! = n! distinct permutations – that
is, all permutations on n elements. The resulting sequence
satisﬁes the condition.
Note that the last sentence of the theorem does not seem
to be useful in practice. But it is a very important part for
proving the theorem by induction over the size of X.
It remains to prove the part of the theorem about parities.
Consider the ordering
(a1, . . . , ai, ai+1, . . . , an),
containing r inversions. Then in the ordering
(a1, . . . , ai+1, ai, . . . , an)
there are either r − 1 or r + 1 inversions. Every transposition
(ai, aj) is obtainable by doing (j−i)+(j−i−1) = 2(j−i)−1
transpositions of neighbouring elements.
Therefore any transposition changes the parity. Also, we
already know that all permutations can be obtained by applying
transpositions. □
We found that applying a transposition changes the parity
of a permutation and any ordering of numbers {1, 2, . . . , n}
can be obtained through transposing of neighbouring elements.
Therefore we have proven
Corollary. On every ﬁnite set X = {1, . . . , n} with n elements,
n > 1, there are exactly 1
2 n! even permutations, and
1
2 n! odd permutations.
If we compose two permutations, it means ﬁrst doing all
transpositions forming the ﬁrst permutation and then all the
transpositions forming the second one. Therefore for any two
permutations σ, η : X → X we have
sgn(σ ◦ η) = sgn(σ) · sgn(η)
and also
sgn(σ−1
) = sgn(σ).
2.2.3. Decomposing permutations into cycles. A good tool
for practical work with permutations is the cycle decomposition,
which is also a good exercise on the concept of equiva-
lence.
98
We compute
(
A E
) R1→R1−R3
∼


1 −2 0 1 0 −1
5 6 3 0 1 0
3 5 2 0 0 1


R2→R2−5R1
∼
R3→R3−3R1


1 −2 0 1 0 −1
0 16 3 −5 1 5
0 11 2 −3 0 4


R2→R2−R3
∼


1 −2 0 1 0 −1
0 5 1 −2 1 1
0 11 2 −3 0 4


R3→R3−2R2
∼


1 −2 0 1 0 −1
0 5 1 −2 1 1
0 1 0 1 −2 2


R1→R1+2R3
∼
R2→R2−5R3


1 0 0 3 −4 3
0 0 1 −7 11 −9
0 1 0 1 −2 2


R2↔R3
∼


1 0 0 3 −4 3
0 1 0 1 −2 2
0 0 1 −7 11 −9

 ,
where the row operations performed at each step are indicated
for clarity. This means that
A−1
=


3 −4 3
1 −2 2
−7 11 −9

 .
As a veriﬁcation it is easy to see that AA−1
= A−1
A =
E, or we can directly use the command A.inverse( ) in
Sage:
A=matrix([[4, 3, 2],[5, 6, 3],[3, 5, 2]])
show(A); show(A.inverse())
Observe now that when calculating A−1
, we did not have
to cope with fractions thanks to the suitably chosen row transformations
(and the fact, that A−1
is an integer matrix, too!).
We leave the similar steps leading to B−1
to the reader (again,
the result happens to be an integer matrix):
B−1
=


1 2 −3
−1 1 −1
0 −2 3

 .
Based now on the identity
(
AT
· B
)−1
= B−1
·
(
AT
)−1
= B−1
·
(
A−1
)T
we ﬁnally obtain
(
AT
· B
)−1
=


−14 −9 42
−10 −5 27
17 10 −49

 .
To describe this result in Sage one can continue typing
in the previous cell, by adding the code
B=matrix([[1,0,1],[3,3,4],[2,2,3]])
show(B.inverse())
show((transpose(A)*B).inverse())
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Cycles
A permutation σ over the set X = {1, . . . , n} is called a
cycle of length k, if we can ﬁnd elements a1, . . . , ak ∈ X,
2 ≤ k ≤ n such that σ(ai) = ai+1, i = 1, . . . , k − 1, while
σ(ak) = a1, and other elements in X are ﬁxed-points of σ.
Cycles of length two are transpositions.
Proposition. Every permutation is a composition of cycles.
Cycles of even length have parity −1, cycles of odd length
have parity 1.
Proof. Fix a permutation σ and deﬁne a relation R such
that two elements x, y ∈ X are R-related if
and only if σℓ
(x) = y for some iteration ℓ ∈ Z
of the permutation σ (notice σ−1
means the inverse
bijection to σ). Clearly, it is an equivalence
relation (check it carefully!). Because X is a ﬁnite set,
for some ℓ it must be that σℓ
(x) = x. If we pick one equivalence
class {x, σ(x), . . . , σℓ−1
(x)} ⊂ X and deﬁne other
elements to be ﬁxed-points, we obtain a cycle. Evidently, the
original permutation X is then the composition of all these
cycles for individual equivalence classes and it does not matter
in which order we compose the cycles.
For determining the parity we just have to note that cycles
of even length can be written as a composition of an odd
number of transposition, therefore their parity is −1. Analogously,
cycle of odd length can be obtained using an even
number of transpositions and therefore it has parity 1. □
2.2.4. Expansion of determinant. Our understanding of
the permutations allows to ﬁnd the expansion
method of computing the determinants. The
simple idea is to collect the terms containing an
element in a ﬁxed row in the determinant sum
and to add these contributions along the row.
Consider a matrix A = (aij) and let us look at all terms
in |A| containing the element a11. By the very deﬁnition,
these terms correspond to all permutations σ with σ(1) = 1.
Thus, the contribution of all these terms to |A| is a11A11,
where A11 is the determinant of the matrix obtained from A
by omitting the ﬁrst row and the ﬁrst column.
Similarly, we can take any other ﬁxed element aij in A
and look for the contribution of all terms containing it. Again,
we could write Aij for the determinant of the matrix obtained
from A by omitting the i-th row and the j-th column, and the
latter contribution must have terms like in aijAij, but we have
to be very careful about the signs. While the actual terms
of |A| would be sgn σaija1σ(1) . . .∧
. . . anσ(n) where the hat
denotes the omition of the i-th entry and σ(i) = j, the signatures
of the permutations in Aij, with the i and j omitted
might be diﬀerent.
In order to compare it to the previous case i = 1, j = 1,
we can change the initial ordering of the elements in the domain
and target of the permutations σ. Clearly, i − 1 changes
on the domain and j − 1 changes on the target do the job
99
which also veriﬁes the given expression for B−1
. □
2.A.19. Compute the inverse of the following matrices and
next provide a veriﬁcation by Sage:
A =


1 0 −2
2 −2 1
5 −5 2

 , B =






8 3 0 0 0
5 2 0 0 0
0 0 −1 0 0
0 0 0 1 2
0 0 0 3 5






,
C =




1 1 1 1
1 1 −1 1
1 −1 1 −1
1 −1 −1 1



 , D =
(
1 i
−i 3
)
,
where as usual i =
√
−1. ⃝
We have seen that a system of linear equations Ax = b
may have: (1) no solution; (2) a unique solution;
(3) inﬁnitely many solutions, depending
on one or more free parameters.
In the homogeneous case, where b = 0, there is always
(at least) one solution. The collections of all solutions forms
a vector s[pace (computed as the kernel of the mapping x →
Ax, we shall come back to these concepts in more abstract
way later).
If the right-hand side vector b has at least one non-zero
entry, we say that the system is non-homogeneous. Then, the
space of all solutions is closely related to the ranks of the
matrices A and (A | b).
The rank of a matrix is deﬁned as the maximal number of
its linearly independent rows (which is the same as the maximal
number of linearly independent columns in A, cf. 2.1.11).
This corresponds to the fact that solutions x of the system
Ax = b are the coeﬃcients in the expression of b as a linear
combination of the columns of A. We have seen that the
case (1) happens if the rank of the augmented matrix is bigger
than that of A (which of course cannot happen if b = 0). The
number of free parameters for the cases (2) and (3) is the difference
between the size of b and the rank of A. Later we will
see that the rank is closely related to the nullity of the matrix,
see for example the description in ??.
2.A.20. Determine the rank of the matrix
A =




1 −3 0 1
1 −2 2 −4
1 −1 0 1
−2 −1 1 −2



 .
Then, determine the number of solutions of the system



x1 + x2 + x3 − 2x4 = 4 ,
−3x1 − 2x2 − x3 − x4 = 5 ,
+ 2x2 + x4 = 1 ,
x1 − 4x2 + x3 − 2x4 = 3 .
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
(by “bubbling” the index in question to the ﬁrst position by
consecutive swaps of neighboring positions).
Thus, the sign correction is (−1)i+j−2
and we have to
adjust the value of Aij as in the following algorithm, which
is the simplest version of the more general Laplace expansion
formula, see 2.2.9 below. The readers not sure about the details
of our argumentation here may wait for the detailed proof
in the more general situation.
Expansion of determinant
The algebraic complement Aij of the element aij in a matrix
A is the (−1)i+j
-multiple of the determinant of the matrix
obtained from A by omiting the i-th row and the j-th col-
umn.
Fixing the i-th row or j-th column,
|A| =
n∑
j=1
aijAij, |A| =
n∑
i=1
aijAij.
The latter formulae correspond to splitting the determinant
sum to parts containg terms with the individual elements
in the row or column.
For example, an easy application derives the Sarrus rule
from the formula in dimension 2 now.
2.2.5. Simple properties. Knowing the properties of permutations
and their parities from previous paragraphs
allows us to derive quickly basic properties
of determinants.
For every matrix A = (aij) of the type
m/n over scalars from K we deﬁne the transpose of A as
the matrix AT
= (a′
ij) with elements a′
ij = aji. The matrix
AT
is of the type n/m.
A square matrix A with the property A = AT
is called
symmetric. If A = −AT
, then A is called antisymmetric.
Simple properties of determinants
Theorem. Every square matrix A = (aij) satisﬁes the following
conditions:
(1) |AT
| = |A|.
(2) If one of the rows contains only zero elements from K,
then |A| = 0.
(3) If a matrix B was obtained from A by transposing two
rows, then |A| = −|B|.
(4) If a matrix B was obtained from A by multiplying one
row by a scalar a ∈ K, then |B| = a |A|.
(5) If all elements of the k-th row in A are of the form akj =
ckj +bkj and all remaining rows in the matrices A, B =
(bij), C = (cij) are identical, then |A| = |B| + |C|.
(6) A determinant |A| does not change if we add to any row
of A a linear combination of other rows.
100
Moreover, ﬁnd all solutions of the corresponding homogeneous
system, and in addition of the system



x1 − 3x2 = 1 ,
x1 − 2x2 + 2x3 = −4 ,
x1 − x2 = 1 ,
−2x1 − x2 + x3 = −2 . ⃝
B. Determinants
A key object in matrix calculus is the scalar function
det, the so called determinant, which we
already came across when discussing the
dimension two calculus (see for example
1.E.25, 1.E.29). Now we shall handle the determinant
concept for general square matrices of size n.
In dimension two, the formula was det A = det(aij) =
a11a22 − a12a21, i.e., each term picked up just one entry
from each row and column and there are just two options
there. Again, we want to see a function linear in all individual
columns and changing the sign, whenever two columns
are exchanged. Thus, we want to include all terms with any
permutation of columns, to be chosen for the individual rows.
This leads us back to the permutations, i.e., bijections
σ : {1, . . . , n} → {1, . . . , n}, see the 3rd section of Chapter
1, and 2.2.1. In particular, by 2.2.1 we know that a convenient
way to denote permutations is the so called two-row notation
(a table form for the function σ). This notation is adopted in
this column, as well.
2.B.1. Decompose the permutation
σ =
(
1 2 3 4 5 6 7 8 9
3 1 6 7 8 9 5 4 2
)
into a product of cycles and product of transpositions.
Solution. The term “cycle” is closely related to invariant subsets
in {1, . . . , n} under iterated actions of σ. We may easily
see them from the deﬁning table: start with the ﬁrst element
and look on the second row, iterating σ. We see that 1 is
mapped to 3, while the second iteration sends 3 to 6, and so
on. We continue until we again reach the starting element
(which must happen since the set is ﬁnite). We obtain the following
sequence of elements, which map to each other under
the given permutation:
1 → 3 → 6 → 9 → 2 → 1.
Such a sequence is called a cycle, or better we extend it to a bijection
by letting all the other elements as ﬁxpoints (see 2.2.3),
and we may denote it by (1, 3, 6, 9, 2). Next, choose any element
which belongs to the set {1, 2, . . . , 9} but does not belong
to our cycle (1, 3, 6, 9, 2), e.g., the integer 4. Applying
the same procedure as before, we obtain the cycle (4, 7, 5, 8).
Observe that each element from the set {1, 2, . . . , 9} appears
now in one of the previous two cycles. Thus, we can express
σ as the composition of these two cycles, i.e.,
σ = (1, 3, 6, 9, 2) ◦ (4, 7, 5, 8) = (4, 7, 5, 8) ◦ (1, 3, 6, 9, 2),
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Proof. (1) The terms of determinants |A| and |AT
|
are in bijective correspondence, where the
term sgn(σ)a1σ(1) · a2σ(2) · · · anσ(n) corresponds
the following AT
term (notice,
multiplication does not depend on the order of scalars)
sgn(σ)aσ(1)1 · aσ(2)2 · · · aσ(n)n =
= sgn(σ)a1σ−1(1) · a2σ−1(2) · · · anσ−1(n),
and we have to ensure that this member has the correct sign.
But the parities of σ and σ−1
are the same, and so this is really
a term in the determinant |AT
| and the ﬁrst claim is proved.
(2) This comes straight from the deﬁnition of determinant,
because all its terms contain exactly one member from
every row. Thus, if one of the rows is zero, all terms of the
determinant are also zero.
(3) The only change in the terms of |B| compared to |A|
is the addition of one transposition in all permutations, therefore
all the signs will be reversed.
(4) This follows straight from the deﬁnition, because
terms of |B| are just terms of |A| multiplied by the scalar a.
(5) In every term of |A|, there is exactly one element from
the k-th row of the matrix A. By the distributive law for multiplication
and addition in K, the claim follows directly from
the deﬁnition of determinant.
(6) If there are two identical rows in A, then there are always
two identical terms among all terms in the determinant,
up to the sign. Therefore in this case |A| = 0. Thus, by (5),
we can add any other row to the given row, without changing
the value of the determinant. In view of the claims (4) and
(5), we can in fact add a scalar multiple of any other row. □
2.2.6. Computational corollaries. Let us note a nice corollary
of the ﬁrst claim of the previous theorem
about the equality of the determinants
of the matrix and its transpose. It ensures
that whenever we prove some claim about
determinants formulated in terms of rows of the corresponding
matrix, we immediately obtain an analogous claim in
terms of the columns.
For instance, we can immediately formulate all the
claims (2)–(6) for linear combinations of columns.
For a while, let us assume, we work with a ﬁeld of scalars
K. Then, by the previous theorem, we can use elementary
row transformations to bring any square matrix A into row
echelon form, without changing the value of its determinant.
We just have to be careful and add only linear combinations
of other rows to a given one.
Thus let us look at the distribution of the elements in the
individual terms of a determinant |A| with dimension of A
equal to n > 1. There is just one term with all of its elements
on the diagonal. In all other terms, there must be elements
both above and below the diagonal (if we place one element
outside of the diagonal, we block two diagonal entries and
we leave only n − 2 diagonal positions for the other n − 1
elements).
101
where the second equality occurs since independent cycles
commute.
Next, we notice how to decompose general cycles into the
simplest ones - the transpositions (i, j) (i.e., i → j → i and
all other remain ﬁxed). A simple check reveals (do it carefully
yourselves!) that any cycle (i1, i2, . . . , ik) is a composition of
transpositions of the neighbouring numbers, starting from the
back. Thus, in our case
(1, 3, 6, 9, 2) = (1, 3) ◦ (3, 6) ◦ (6, 9) ◦ (9, 2)
≡ (1, 3)(3, 6)(6, 9)(9, 2) .
For example, evaluating this resulting bijection on 2:
((1, 3)(3, 6)(6, 9)(9, 2))(2) = ((1, 3)(3, 6)(6, 9))(9) =
= ((1, 3)(3, 6))(6) = (1, 3)(3) = 1,
as expected.
This means σ = (1, 3)(3, 6)(6, 9)(9, 2)(4, 7)(7, 5)(5, 8)
(notice that this is also the decomposition into the minimal
number of transpositions).
As we saw in Chapter 1, Sage oﬀers a rich portfolio
of methods available for solving combinatorial tasks. To
treat the given task in Sage, we need to deﬁne a permutation
p of a ﬁnite set, which can be done via he command
Permutation(p). Also, we can determine the cycles of p by
the command Permuation(p).to_cycles( ). Let us illustrate
the situation for p = σ, where σ is as above:
p = Permutation([3, 1, 6, 7, 8, 9, 5, 4, 2])
c = (p.to_cycles())
ts = [] # collecting transpositions
for x in c:
for i in range(len(x) - 1):
ts.append( (x[i], x[i + 1]) )
print(ts)
This block returns the following list verifying our an-
swer:
[(1, 3), (3, 6), (6, 9), (9, 2), (4, 7),
(7, 5), (5, 8)] □
We are ready to summarize: The right formula for the
determinant of a matrix A = (aij) of size n is
det(A) = |A| =
∑
σ∈Σn
sgn(σ) a1σ(1) . . . anσ(n) ,
where the parity sgn(σ) is ±1, according to the even or odd
number of permutations constituting σ. This is well deﬁned.
Clearly this deﬁnition fulﬁlls our wish to have an expression
linear in each of the columns, while changing its sign with
each transposition of columns.
Note that the minimal number of transpositions in the decomposition
of a given permutation can be obtained as follows:
We ﬁrst decompose the permutation into a product of
independent cycles, and then the cycles canonically into the
transpositions (actually the length of the cycles tells their parity
and they are multiplied).
You may ﬁnd more details in 2.2.2, where the parity is
deﬁned in the terms of the so called inversions. In the two-row
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Therefore, if the matrix A is in a row echelon form, then
every term of |A| is zero, except the term with exclusively
diagonal entries. This proves the following algorithm:
Computing determinants using elimination
Lemma. If A is in the row echelon form then
|A| = a11a22 · · · ann.
This observation gives an eﬀective method for computing
determinants using the Gauss elimination method for
matrices over a ﬁed of scalars K, see the paragraph 2.1.7.
Notice that the very same argumentation allows us to stop
the elimination having the ﬁrst k columns in the requested
form and ﬁnding the determinant of the matrix B of dimension
n−k in the right bottom corner of A in another way. The
result will then be |A| = a11a22 · · · akk|B|.
As a useful (theoretical) illustration of this principle, we
shall derive the following formula for direct calculation
of solutions of systems of linear equations.
For sake of simplicity, we still work with
ﬁeld of scalars now. (But we shall see later, that
Cramer rule actully works for all scalars.)
Cramer rule
Proposition. Consider the system of n linear equations for
n variables with matrix of the system A = (aij) and the
column of values b = (b1, . . . , bn). In matrix notation this
means we are solving the equation A · x = b.
If there exists the inverse |A|−1
, then the individual components
of the unique solution x = (x1, . . . , xn) are given
as
xi = |Ai||A|−1
,
where the matrices Ai arise from the matrix A of the system
by replacing the i-th column by the column b of values.
Proof. Even for general scalars, if A−1
exists, then there
must be a unique solution. As we have already seen, working
over ﬁeld of scalars the inverse of the matrix of the system
exists if and only if the system has a unique solution, and this
in turn happens if and only if |A| ̸= 0.
If we have such a solution x, we can express the column
b in the matrix Ai by the corresponding linear combination of
the columns of the matrix A, that is the values bk = ak1x1 +
· · · + aknxn. Then, by subtracting the xℓ-multiples of all the
other ℓ-th columns from this i-th column in Ai, we arrive at
just the xi-multiple of the original column of A. The number
xi can thus be brought in front of the determinant to obtain the
equation |Ai| = xi |A|, and thus |Ai||A|−1
= xi|A||A|−1
=
xi, which is our claim. □
Notice also that the properties (3)–(5) from the previous
theorem say that the determinant, (considered as a mapping
which assigns a scalar to n vectors of dimension n), is an
102
notation for permutations, the number of inversions is seen
when you go through the columns and count the number of
those where the second row is smaller than the ﬁrst one.
2.B.2. Determine the parity of the permutation σ in 2.B.1,
and of the permutation τ =
(
1 2 3 4 5 6
2 4 6 1 5 3
)
. ⃝
2.B.3. Check the previous result in Sage.
Solution. It is very easy to obtain the parity of a permutation
in Sage. This is done via the command p.sign(), where p is
the given permutation. For instance, for the given σ and τ in
2.B.2 we can type:
s=Permutation([3,1,6,7,8,9,5,4,2])
t=Permutation([2,4,6,1,5,3])
print(s.sign()); print(t.sign())
Sage says σ, τ are odd by returning −1 for both cases. □
2.B.4. Compute the determinant of the matrices
A =
(
1 2
2 1
)
, B =


1 2 3
1 −1 2
3 2 2

 , C =


1 1 1
1 0 0
−2 0 1

 .
Solution. For A we compute det(A) = 1·1−2·2 = −3. As
for the 3 × 3 matrices, the deﬁtion yields the so called Sarrus
rule, see 2.2.1. In particular, for B we get
det(B) = 1 · (−1) · 2 + 2 · 2 · 3 + 3 · 1 · 2
−3 · (−1) · 3 − 1 · 2 · 2 − 1 · 2 · 2 = 17.
Another way to remember this rule:
det
(
a b c
d e f
ℓ m n
)
= a · α − b · β − c · γ,
where α = det
(
e f
m n
)
, β = det
(
d f
ℓ n
)
, γ = det
(
d e
ℓ m
)
.
Similarly, for the matrix C we obtain
det(C) = 1 · 0 · 1 + 1 · 0 · 1 + 1 · 0 · (−2)
−1 · 0 · (−2) − 1 · 1 · 1 − 1 · 0 · 0 = −1 .
Of course, all cases can be done in Sage, as in Chapter 1. Let
us recall the relevant code:
B=matrix([[1,2,3],[1,-1,2],[3,2,2]])
det(B) □
There are several methods for computing determinants of
big matrices. One of them is to use the elementary
row (or column) transformations. Remember
though that we are allowed to add linear
combinations of other rows to the transformed
one, since multiplying a row clearly results in the same multiple
of the resulting determinant. If A is in a row-echelon
(column-echelon) form, then det(A) is just the multiple of all
diagonal elements (straight from deﬁnition!). See a more detailed
review of elementary properties in 2.2.5.
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
antisymmetric mapping linear in every argument, exactly as
we required in analogy to the 2-dimensional case.
2.2.7. Further properties of the determinant. Later we
will see that, exactly as in the dimension 2, the
determinant of the matrix equals to the (oriented)
volume of the parallelepiped determined by the
columns of the matrix. We shall also see that
considering the mapping x → A · x given by the square
matrix A on Rn
we can understand the determinant of this
matrix as expressing the ratio between the volume of the parallelepipeds
given by the vectors x1, . . . , xn and their images
A · x1, . . . , A · xn.
Because the composition x → A·x → B·(A·x) of mappings
corresponds to the matrix multiplication, the Cauchy
theorem below is easy to understand:
Cauchy theorem
Theorem. Let A = (aij), B = (bij) be square matrices of
dimension n over the ring of scalars K. Then
|A · B| = |A| · |B|.
In the next paragraphs, we derive this theorem in a purely
algebraic way, in particular because the previous argumentation
based on geometrical intuition could hardly work for arbitrary
scalars. The basic tool is the determinant expansion
using one or more of the rows or columns which we have seen
in simplest case of single rows or columns in 2.2.4.
We will also need a little technical preparation. The
reader who is not fond of too much abstraction can skip these
paragraphs and note only the statement of the Laplace theorem
and its corollaries.
Notice also, the claims (2), (3) and (6) from the theorem
2.2.5 are easily deduced from the Cauchy theorem and the
representation of the elementary row transformations as multiplication
by suitable matrices (cf. 2.1.8).
2.2.8. Minors of the matrix. When investigating matrices
and their properties we often work only
with parts of the matrices. Therefore we
need some new concepts.
submatrices and minors
Let A = (aij) be a matrix of the type m/n and let 1 ≤ i1 <
. . . < ik ≤ m, 1 ≤ j1 < . . . < jl ≤ n be ﬁxed natural
numbers. Then the matrix
M =



ai1j1 ai1j2 . . . ai1jℓ
...
...
aikj1
aikj2
. . . aikjℓ



of the type k/ℓ is called a submatrix of the matrix A determined
by the rows i1, . . . , ik and columns j1, . . . , jℓ. The
remaining (m − k) rows and (n − ℓ) columns determine a
matrix M∗
of the type (m − k)/(n − ℓ), which is called
complementary submatrix to M in A.
103
The other option is to use the Laplace expansion with
respect to one or more rows or columns, see the description
in 2.2.9.
2.B.5. Compute det(B) for the matrix B from the previous
task by the Gauss elimination method.
Solution. We aim at transforming the matrix into a row echelon
form, and then multiplying the numbers on the diagonal.
However, we must remember that a multiplication of a row
with a scalar changes the determinant of the matrix by the
same multiple. Moreover, interchanging two rows changes
the sign of the determinant of the matrix.
1 2 3
1 −1 2
3 2 2
=
1 2 3
0 −3 −1
0 −4 −7
=
1
−4
·
1
3
·
1 2 3
0 12 4
0 −12 −21
= −
1
12
·
1 2 3
0 12 4
0 0 −17
.
The remaining step is the computation of the determinant of
an upper triangular matrix. The latter equals to the product of
the diagonal entries, thus, det(B) = − 1
12 (1·12·(−17)) = 17.
Of course, direct application of the Sarrus rule was faster. □
2.B.6. Compute the determinant of the matrix
A =




1 3 5 6
1 2 2 2
1 1 1 2
0 1 2 1



 .
Solution. A standard way to compute det(A) is the reduction
of the size of the matrices by the expansion along the ﬁrst
column (or line, see also 2.3.10). This gives
det(A) = 1 ·
2 2 2
1 1 2
1 2 1
− 1 ·
3 5 6
1 1 2
1 2 1
+ 1 ·
3 5 6
2 2 2
1 2 1
= −2 − 2 + 6 = 2 ,
where we have exploited that the last entry of the ﬁrst column
is zero.
An alternative way is to convert the matrix to row echelon
form, which we can do as follows:
det(A) = −
1 1 1 2
1 2 2 2
1 3 5 6
0 1 2 1
= −
1 1 1 2
0 1 1 0
0 2 4 4
0 1 2 1
=
= −
1 1 1 2
0 1 1 0
0 0 2 4
0 0 1 1
=
1 1 1 2
0 1 1 0
0 0 1 1
0 0 2 4
=
1 1 1 2
0 1 1 0
0 0 1 1
0 0 0 2
= 2 .
Observe that here we have interchanged the rows twice. Specify
the rest of the applied rules, according to 2.2.5! □
2.B.7. Use the Laplace theorem (see 2.2.9) to compute the
determinant of the matrix D given below. Next verify your
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
When k = ℓ we call the determinant |M| the subdeterminant
or minor of the order k of the matrix A. If m = n and
k = ℓ, then M∗
is also a square matrix and |M∗
| is called
the minor complement to |M|, or complementary minor of
the submatrix M in the matrix A. The scalar
(−1)i1+···+ik+j1+···+jl
· |M∗
|
is then called the algebraic complement of the minor |M|.
The submatrices formed by the ﬁrst k rows and columns
are called leading principal submatrices, and their determinants
are called leading principal minors of the matrix A. If
we choose k sequential rows and columns starting with the
i-th row, we speak of principal submatrices and principal mi-
nors.
Specially, when k = ℓ = 1, m = n we call the corresponding
algebraic complementary minor the algebraic complement
Aij of the element aij of the matrix A, which we met
already in 2.2.4.
2.2.9. Laplace determinant expansion. If the leading principal
minor |M| of the matrix A is of the order
k, then, directly from the deﬁnition of the determinant,
each of the individual k!(n − k)! terms
in the product of |M| with its algebraic complement
is a term of |A|.
In general, consider a square submatrix M, that is, a
square matrix given by the rows i1 < i2 < · · · < ik and
columns j1 < · · · < jk. Then using (i1 − 1) + · · · + (ik − k)
exchanges of neighbouring rows and (j1 −1)+· · ·+(jk −k)
exchanges of neighbouring columns in A we can transform
this submatrix M into a leading principal submatrix and the
complementary matrix gets transformed into its complementary
matrix.
The whole matrix A gets transformed into a matrix B
satisfying (cf. 2.2.5 and the deﬁnition of the determinant)
|B| = (−1)α
|A|, where α =
∑k
h=1(ih +jh)−2(1+· · · +k).
But (−1)α
= (−1)β
with β =
∑k
h=1(ih + jh). Therefore
we have checked:
Proposition. If A is a square matrix of dimension n and |M|
is its minor of the order k < n, then the product of any term
of |M| with any term of its algebraic complement is a term in
the determinant |A|.
This claim suggests that we could perhaps express the determinant
of the matrix by using some products of smaller determinants.
We see that |A| contains exactly n! distinct terms,
exactly one for each permutation. These terms are mutually
distinct as polynomials in the components of a general matrix
A. If we can show that there are exactly that many mutually
distinct expressions from the previous claim, we obtain the
determinant |A| as their sum.
It remains to show that the terms of the product |M|·|M∗
|
contain exactly n! distinct members from |A|.
From the chosen k rows we can choose
(n
k
)
minors
M and using the previous lemma each of the k!(n − k)!
104
answer in Sage.
D =






1 0 1 0 1
0 2 0 2 0
0 0 3 0 3
4 0 0 4 4
0 0 0 0 5






.
⃝
2.B.8. This is a theoretical task. Let us ﬁx any matrices
A, B ∈ Matn(K), over any ring K.
(a) If det(A) = 0, show that the product AB also has determinant
zero.
(b) Pove the identity det(c A) = cn
det(A), for each scalar
c ∈ K.
(c) Show by a counterexample that in general det(A + B) ̸=
det(A) + det(B). ⃝
The upcoming series of exercises will explore various
methods for computing the inverse of an invertible
matrix, including Cramer’s rule, and other
topics. As we will demonstrate, Sage can efﬁciently
handle these computations, leveraging its extensive
matrix calculus functionality.
Before diving into the problems, let us review some key
points about matrix manipulation in Sage. There are several
convenient shortcuts for matrix creation. As we pronounced,
to specify the ring of the matrix entries use QQ for rationals,
RR for reals, CC for complex numbers, and ZZ for integers.
For exact computations with rational entries the results are
precise. For numerical calculations with real or complex matrices,
we recommend to use of RDF and CDF working with
ﬂoating-point representation, instead of RR and CC.
After creating a matrix A in your favourite Sage editor,
you can access its entries in a straightforward manner using
the square bracket notation A[i, j]. Here, i and j, represent
the row and column indices, respectively. Note that both
Python and Sage use zero-based indexing.
For example, to access the entry a3,3 of a 4 × 4 matrix
A, we type A[2, 2]. Notice, Sage also allows indexing from
the end of the matrix, using negative indices. For example,
A[−1, :], refers to the last row of A, and A[−2, :] refers to the
second to last row of A. Additionally, indices can be speciﬁed
as intervals using the format a : b. For instance, A[2 : 3] is
valid and will be useful when discussing submatrices later.
Another useful tip is the notation p : q : r, which generates
all indices from p to q − 1, in steps of r.
2.B.9. Given the matrix A =
( 1 1 3 0 8
2 2 −4 −8 4
−1 −5 2 4 6
7 −1 10 0 9
)
use Sage
to access the entries a33, a12, a11, a45 and a44.
Solution. We can use the cell
A=matrix(QQ,[[1,1,3,0,8],[2,3,-4,-8,4],
[-1, -5, 2, 4, 6], [7, -1, 10, 0, 9]])
print(A[2, 2]); print(A[0, 1])
print(A[0, 0]);print(A[3,4])
print(A[3,3])
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
terms in the products of |M| with their algebraic complements
is a term in |A|. But for distinct choices of M we
can never obtain the same terms and the individual terms
in (−1)i1+···+ik+j1+···+jl
· |M| · |M∗
| are also mutually
distinct. Therefore we have exactly the required number
k!(n − k)!
(n
k
)
= n! of terms, and we have proved:
Laplace theorem
Theorem. Let A = (aij) be a square matrix of dimension n
over arbitrary ring of scalars with k rows ﬁxed. Then |A| is
a sum of all
(n
k
)
products (−1)i1+···+ik+j1+···+jl
·|M|·|M∗
|
of minors of the order k chosen among the ﬁxed rows with
their algebraic complements.
The Laplace theorem transforms the computation of |A|
into the computation of determinants of lower dimension.
This method of computation is called the Laplace expansion
along the chosen rows (or columns). For instance, the expansion
along the i-th row or the j-th column is:
|A| =
n∑
j=1
aijAij =
n∑
i=1
aijAij
where Aij are the algebraic complements of the elements aij
(that is, minors of order one), as deduced in 2.2.4 already.
In practical computations, it is often eﬃcient to combine
the Laplace expansion with a direct method of Gaussian elim-
ination.
2.2.10. Proof of the Cauchy theorem. The theorem is
based on a clever but elementary application of
the Laplace theorem. We just use the Laplace
expansion twice on a particular arrangement of
a well chosen matrix.
Consider ﬁrst the following matrix H of dimension 2n
(we are using the so-called block symbolics, that is, we write
the matrix as if composed of the (sub)matrices A, B, and so
on).
H =
(
A 0
−E B
)
=










a11 . . . a1n
...
...
an1 . . . ann
0 . . . 0
...
...
0 . . . 0
−1 0
...
0 −1
b11 . . . b1n
...
...
bn1 . . . bnn










The Laplace expansion along the ﬁrst n rows gives
|H| = |A| · |B|.
Now in sequence, we add linear combinations of the ﬁrst
n columns to the last n columns in order to obtain a matrix
105
Sage returns 2, 1, 1, 9 and 0, respectively. Notice entering
the command A([4, 3]), Sage will produce an error message:
matrix index out of range, which is expected. □
2.B.10. For the matrices A, B given below, compute:
(a) the corresponding null spaces;
(b) the submatrix of A obtained by canceling the ﬁrst row, and
the ﬁrst and third columns; Is this diagonal?
(c) the submatrix of B obtained by canceling the ﬁrst and last
row, and the last column.
(d) the submatrix of B obtained by canceling the ﬁrst and
third row, and the second and fourth column. Is this symmet-
ric?
A =


1 −1 π 0
0 π −1 0
1 0 −1 π

 , B =




1 2 3 4
2 1 4 3
3 4 1 2
4 3 2 1



 .
⃝
2.B.11. Using the Cramer’s rule solve the system posed in
2.A.9. Next demonstrate the situation in Sage.
Solution. Consider a linear system Ax = b of n equations
and n unknowns, with det(A) ̸= 0. Recall by 2.2.6 that the
Cramer’s rule describes the unique solution of such a linear
system as the unique vector x = (x1, . . . , xn)T
of Rn
with
entries xi = det(Ai)
det(A) , for all i = 1, . . . , n, where Ai are the
matrixes obtained from the coeﬃcient matrix A by replacing
the ith column by the vector b, for i = 1, . . . , n. The matrix
A of the system in 2.A.9 is 3 × 3 and the Cramer’s rule gives
x1 =
2 2 3
−3 −3 −1
−3 1 2
1 2 3
2 −3 1
−3 1 2
= 1 , x2 =
1 2 3
2 −3 −1
−3 −3 2
1 2 3
2 −3 1
−3 1 2
= 2 ,
x3 =
1 2 2
2 −3 −3
−3 1 −3
1 2 3
2 −3 1
−3 1 2
= −1 .
It is easy to demonstrate Cramer’s rule in Sage. For a 3 ×
3 system this can done really quickly, and an illustration is
here:
A=matrix([[1,2,3],[2,-3,-1],[-3,1,2]])
b=vector([2, -3, -3])
A1=copy(A);A2=copy(A);A3=copy(A)
A1[:,0] = b #construct the matrix A_1
A2[:,1] = b #construct the matrix A_2
A3[:,2] = b #construct the matrix A_3
show(A1);show(A2);show(A3)
x1=A1.det()/A.det(); x2=A2.det()/A.det()
x3=A3.det()/A.det(); print([x1, x2, x3])
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
with zeros in the bottom right corner. We obtain
K =










a11 . . . a1n
...
...
an1 . . . ann
c11 . . . c1n
...
...
cn1 . . . cnn
−1 0
...
0 −1
0 . . . 0
...
...
0 . . . 0










.
The elements of the submatrix on the top right part must sat-
isfy
cij = ai1b1j + ai2b2j + · · · + ainbnj,
that is, they are exactly the components of the product A · B
and |K| = |H|. The expansion of the last n columns gives us
|K| = (−1)n
(−1)1+···+2n
|A·B| = (−1)2n·(n+1)
·|A·B| =
|A · B|. This proves the Cauchy theorem.
2.2.11. Determinant and the inverse matrix. Assume ﬁrst
that there is an inverse matrix of the matrix A,
that is, A · A−1
= E. Since the unit matrix
always satisﬁes |E| = 1, it follows that for every
invertible matrix its determinant is an invertible
scalar and by the Cauchy theorem we have |A−1
| = |A|−1
.
But we can say more, combining the Laplace and Cauchy
theorems.
Inverse matrix determinant formula
For any square matrix A = (aij) of dimension n we deﬁne
the matrix A∗
= (a∗
ij), where a∗
ij = Aji are algebraic complements
of the elements aji in A. The matrix A∗
is called
the algebraically adjoint matrix of the matrix A (or the adjugate
matrix A∗
).
Theorem. For every square matrix A over a ring of scalars
K we have that
(1) AA∗
= A∗
A = |A| · E.
In particular,
(i) A−1
exists as a matrix over the ring of scalars K if and
only if |A|−1
exists in K.
(ii) If A−1
exists, then A−1
= |A|−1
· A∗
.
Proof. As already mentioned, the Cauchy theorem
shows that the existence of A−1
implies the
invertibility of |A| ∈ K.
For an arbitrary square matrix A we can directly
compute A · A∗
= (cij), where
cij =
n∑
k=1
aika∗
kj =
n∑
k=1
aikAjk.
If i = j, it is exactly the Laplace expansion of |A| along
the i-th row.
If i ̸= j, then we may imagine we expand the determinant
along the j-th row, but plug in the values of the i-th row
instead of the ajk’s. This is the expansion of the determinant
106
Notice, the copy( ) function provides a way to make a copy
of a matrix before we make changes to it. This block returns
the matrices A1, A2, A3 and the solution in the form
[1, 2, −1]. In Section E we will meet another implementation
of Cramer’s rule via Sage, see 2.E.23. □
2.B.12. Is the matrix A given below invertible? In the positive
case ﬁnd its algebraically adjoint matrix and its inverse:
A =




1 0 2 0
0 3 0 4
5 0 6 0
0 7 0 8



 .
Solution. Recall that a square matrix A is invertible if and
only if det(A) ̸= 0. Expanding the ﬁrst row of the given A,
we see that
det(A) =
1 0 2 0
0 3 0 4
5 0 6 0
0 7 0 8
=
3 0 4
0 6 0
7 0 8
+ 2
0 3 4
5 0 0
0 7 8
= 16 .
Hence there exists A−1
such that AA−1
= E = A−1
A,
where E is the 4 × 4 identity matrix. To compute A−1
we
will use the algebraic adjoint matrix of A, the latter given by
adj(A) =




A11 A12 A13 A14
A21 A22 A23 A24
A31 A32 A33 A34
A41 A42 A43 A44




T
.
Here, Aij is the algebraic complement of the element aij of
the matrix A, that is, the product of the number (−1)i+j
and
the determinant of the matrix given by A without the ith row
and jth column. We compute:
A11 =
3 0 4
0 6 0
7 0 8
= −24, A12 = −
0 0 4
5 6 0
0 0 8
= 0,
A13 =
0 3 4
5 0 0
0 7 8
= 20, A14 = −
0 3 0
5 0 6
0 7 0
= 0,
A21 = −
0 2 0
0 6 0
7 0 8
= 0, A22 =
1 2 0
5 6 0
0 0 8
= −32,
A23 = −
1 0 0
5 0 0
0 7 8
= 0, A24 =
1 0 2
5 0 6
0 7 0
= −28,
A31 =
0 2 0
3 0 4
7 0 8
= 8, A32 = −
1 2 0
0 0 4
0 0 8
= −0,
A33 =
1 0 0
0 3 4
0 7 8
= −4, A34 = −
1 0 2
0 3 0
0 7 0
= −0,
A41 = −
0 2 0
3 0 4
0 6 0
= 0, A42 =
1 2 0
0 0 4
5 6 0
= −16,
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
of a matrix where the i-th and j-th row is the same, therefore
cij = 0.
This implies that A · A∗
= |A| · E, and we have proven
one of the equalities (1). In particular, if |A|−1
exists, then
A · (|A|−1
A∗
) = E.
If |A| is an invertible scalar, we may repeat the previous
computation for A∗
· A, and we obtain (|A|−1
A∗
) · A = E.
Therefore our computation really gives the inverse matrix of
A, as claimed in the theorem. □
Notice that for ﬁelds of scalars we have already proved
that the right inverse of a matrix is automatically the left inverse
and thus the inverse, too. Here we have obtained the
same result for all rings of scalars, together with a strong and
eﬀective existence condition. On the other hand the exact formula
for the inverse has become rather theoretical with little
practical value.
As a direct corollary of this theorem we can once again
prove the Cramer rule for solving the systems of linear equations,
see 2.2.6. Really, for the solution of the system A·x = b
we just need to read in the equation
x = A−1
· b = |A|−1
A∗
· b
the individual components of the expression A∗
· b as the
Laplace expansions of the determinant of the matrix Ai which
arose through the exchange of the i-th column of A for the column
b.
3. Vector spaces and linear mappings
2.3.1. Abstract vector spaces. Let us go back for a while to
the systems of m linear equations of n variables
from 2.1.3 and further, let us assume that the system
is homogeneous, A · x = 0, i.e.,



a11 . . . a1n
...
...
am1 . . . amn


 .



x1
...
xn


 =



0
...
0


 .
By the distributivity of the matrix multiplication it is clear
that the sum of two solutions x = (x1, . . . , xn) and y =
(y1, . . . , yn) satisﬁes
A · (x + y) = A · x + A · y = 0
and thus is also a solution. Similarly, a scalar multiple a · x
is also a solution. The set of all solutions of a ﬁxed system of
equations is therefore closed under vector addition and scalar
multiplication. These are the basic properties of vectors of dimension
n in Kn
, see 2.1.1. Now we have the vectors in the
solution space with n coordinates. The “dimension” of this
space is given by the diﬀerence of the number of variables and
the rank of the matrix A. Thus we can easily deal with the solution
of a system of 1000 equations in 1000 variables and
need only one or two free parameters. Thus the whole solution
space will behave as a plane or a line, as we have already
seen in 1.5.3 at the page 29, although the vectors themselves
are given by so many components.
107
A43 = −
1 0 0
0 3 4
5 0 0
= 0, A44 =
1 0 2
0 3 0
5 0 6
= −12.
Thus, by substitution we obtain that
adj(A) =




−24 0 20 0
0 −32 0 28
8 0 −4 0
0 16 0 −12




T
=




−24 0 8 0
0 −32 0 16
20 0 −4 0
0 28 0 −12



 .
As a veriﬁcation of the given expression in Sage, you
can use either the command A.adjugate, or its alias
A.adjoint_classical()), as follows:
A=matrix([[1, 0, 2, 0],[0, 3, 0, 4],
[5, 0, 6, 0], [0, 7, 0, 8]])
show(A.adjugate())
The inverse matrix A−1
is now obtained by the rule A−1
=
1
det(A) · adj(A), that is,
A−1
=




−3/2 0 1/2 0
0 −2 0 1
5/4 0 −1/4 0
0 7/4 0 −3/4



 .
Recall that we can directly compute either the determinant
of A, or the inverse of A in Sage by adding the commands
det(A) and A.inverse( ), respectively. □
C. Vector spaces and linear mappings
Intuitively, a vector space is a landscape (of any dimension
n, ﬁnite or inﬁnite) where each position corresponds
to a unique set of scalar coordinates (again
n) and we add and multiply them exactly as we already
saw in chapter 1, working with R2
and C.
Vector spaces can become quite abstract and unusual (as
already the scalars can be) and they are fundamental building
blocks in mathematics, with an amazing wide range of applications
in many ﬁelds, including chemistry, physics, image
processing, computer science, economics, etc. For an informative
introduction to the concepts, see 2.3.1. Next we will
primarily focus on ﬁnite-dimensional vector spaces.
2.C.1. Vector spaces. Decide whether the following sets
form a vector space over the given ﬁeld:
(a) The set of real solutions of the linear system:



x1 + x2 + · · · + x8 + x9 + x10 = 10x1 ,
x1 + x2 + · · · + x8 + x9 = 9x1 ,
x1 + x2 + · · · + x8 = 8x1 ,
...
x1 + x2 = 2x1 .
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
We go further. Already in paragraph 1.2.1 we have encountered
an interesting example of a space of all solutions
of a homogeneous linear diﬀerence equation of ﬁrst order. All
solutions have been obtained from a single one by scalar multiplication
and are also closed under addition and scalar multiples.
These “vectors” of solutions are inﬁnite sequences of
numbers, although we intuitively expect that the “dimension”
of the whole space of solutions should be one. We shall understand
such phenomena with the help of a more general definition
of vector space and its dimension.
Vector space definition
A vector space V over a ﬁeld of scalars K is a set where we
deﬁne the operations
• addition, which satisﬁes the axioms (CG1)–(CG4) from
the paragraph 1.1.1 on the page 5,
• scalar multiplication, for which the axioms (V1)–(V4)
from the paragraph 2.1.1 on the page 84 hold.
Recall our simple notational convention: scalars are usually
denoted by letters from the beginning of the alphabet, that
is, a, b, c, . . . , while for vectors we shall use letters from the
end, that is, u, v, w, x, y, z. Usually, x, y, z will denote
n-tuples of scalars. For completeness, the letters from the
centre of the alphabet, for instance i, j, k, ℓ, will mostly denote
indices.
In order to gain some practice in the formal approach, we
check some simple properties of vectors.
These are trivial for n-tuples for scalars,
but not so evident for general vectors in
our new abstract sense.
2.3.2. Proposition. Let V be a vector space over a ﬁeld of
scalars K. Suppose a, b, ai ∈ K, and u, v, uj ∈ V . Then
(1) a · u = 0 if and only if a = 0 or u = 0,
(2) (−1) · u = −u,
(3) a · (u − v) = a · u − a · v,
(4) (a − b) · u = a · u − b · u,
(5)
(∑n
i=1 ai
)
·
(∑m
j=1 uj
)
=
∑n
i=1
∑m
j=1 ai · uj.
Proof. We can expand
a · u = (a + 0) · u
(V 2)
= a · u + 0 · u
which, according to the axiom (CG4), implies 0·u = 0. Now
u + (−1) · u
(V 2)
= (1 + (−1)) · u = 0 · u = 0
and thus −u = (−1) · u. Further,
a · (u + (−1) · v)
(V 2,V 3)
= a · u + (−a) · v = a · u − a · v,
which proves (3). It follows that
(a − b) · u
(V 2,V 3)
= a · u + (−b) · u = a · u − b · u
which proves (4). Property (5) follows using induction with
(V2) and (V1).
It remains to prove (1): a · 0 = a · (u − u) = a · u −
a · u = 0, which along with the ﬁrst derived proposition in
108
(b) The set of real solutions of the equation
x1 + x2 + · · · + x10 = 0 .
(c) The set of real solutions of the equation
x1 + 2x2 + 3x3 + · · · + 10x10 = 1 .
(d) The set of all real or complex sequences (recall that a
real or a complex sequence can be viewed as a mapping
f : N → R or f : N → C, respectively).
(e) The set F = {f f : X → R} of real-valued functions
with domain a non-empty set X.
(f) The set of real solutions of a non-homogeneous diﬀerence
equation.
(g) The set Matm,n(K) of m × n matrices with entries over
K, where K ∈ {R, C}.
(h) The set Q(
√
2) = {a +
√
2b : a, b ∈ Q} over Q.
Solution. (a) This set is a vector space, since the solutions
are real multiples of the vector u := (1, . . . , 1)T
(10-factors).
Moreover, a sum of two multiples of the same vector is again
a multiple of the vector, while the reverse vector is again a
multiple of the vector. All other axioms are trivially satisﬁed.
(b) This provides an example of a vector space of dimension 9
(its dimension is determined by the number of free parameters
of the solution). More general, recall that the set of solutions
of any homogeneous linear system is a vector space, see 2.3.5.
(c) Taking twice the solution x1 = 1, xi = 0, for i = 2, . . . 10,
we do not obtain a solution. Hence, the given set cannot be
a vector space. However, the set of solutions forms an aﬃne
space, see 4.1.1.
(d) The set of all real (respectively, complex) sequences
clearly forms a real (respectively, complex) vector space. Notice
that the addition and scalar multiplication is deﬁned termwise.
Moreover, the zero vector is represented by the zero
sequence (0, 0, 0, · · · ).
(e) This provides an example of an inﬁnite-dimensional vector
space, where the addition and multiplication are deﬁned by
the usual way, i.e., (f + g)(x) = f(x) + g(x) and (c ˙f)(x) =
c · f(x), respectively, for all x ∈ X. Notice the zero vector is
simply the constant function with zero value everywhere on
X, i.e., the zero function deﬁned on X.
(f) Consider two solutions of a non-homogeneous equation
anxn+k + an−1xn+k−1 + · · · + a0xk = c ,
anyn+k + an−1yn+k−1 + · · · + a0yk = c ,
where c is some non-zero scalar. Their sum satisﬁes
an(xn+k + yn+k) + an−1(xn+k−1 + yn+k−1)
+ · · · + a0(xk + yk) = 2c ,
and obviously, this cannot provide a solution of the given
non-homogeneous equation. Hence we do not obtain a vector
space. However, the set of solutions in this case forms an
aﬃne space (see 4.1.1).
(g) This is one of the most fundamental examples of ﬁnitedimensional
vector spaces. For simplicity, let us ﬁx K = R
and similarly is treated the complex case, see also the task
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
this proof proves one implication. For the other implication,
we use an axiom for the ﬁeld of scalars, and axiom (V4) for
vector spaces: if p · u = 0 and p ̸= 0, then u = 1 · u =
(p−1
· p) · u = p−1
· 0 = 0. □
2.3.3. Linear (in)dependence. In paragraph 2.1.11 we
worked with linear combinations of rows of a matrix. With
vectors we work analogously:
Linear combination and independence
An expression of the form a1 v1 + · · · + ak vk is called a
linear combination of vectors v1, . . . , vk ∈ V .
A ﬁnite sequence of vectors v1, . . . , vk is called linearly
independent, if the only zero linear combination is
the one with all coeﬃcients zero. That is, for any scalars
a1, . . . , ak ∈ K, a1 v1 +· · ·+ak vk = 0 implies a1 = a2 =
· · · = ak = 0. It is clear that for an independent sequence
of vectors, all vectors are mutually distinct and nonzero.
The set of vectors M ⊂ V in a vector space V over K is
called linearly independent, if every ﬁnite k-tuple of vectors
v1, . . . , vk ∈ M is linearly independent.
The set of vectors M is linearly dependent, if it is not
linearly independent.
A nonempty subset M of vectors in a vector space over a
ﬁeld of scalars K is dependent if and only if one
of its vectors can be expressed as a ﬁnite linear
combination using other vectors in M. This follows
directly from the deﬁnition:
Indeed, at least one of the coeﬃcients in the corresponding
linear combination must be nonzero, and since we are over
a ﬁeld of scalars, we can multiply whole combination by the
inverse of this nonzero coeﬃcient and thus express its corresponding
vector as a linear combination of the others.
Every subset of a linearly independent set M is clearly
also linearly independent (we require the same conditions on
a smaller set of vectors). Similarly, we can see that M ⊂ V
is linearly independent if and only if every ﬁnite subset of M
is linearly independent.
2.3.4. Generators and subspaces. A subset M ⊂ V is
called a vector subspace if it forms, together with the restricted
operations of addition and scalar multiplication, a vector
space. That is, we require
∀a, b ∈ K, ∀v, w ∈ M, a · v + b · w ∈ M.
We investigate a couple of cases: The space of m-tuples
of scalars Rm
with coordinate-wise addition and multiplication
is a vector space over R, but also a vector space over Q.
For instance for m = 2, the vectors (1, 0), (0, 1) ∈ R2
are
linearly independent, because from
a · (1, 0) + b · (0, 1) = (0, 0)
follows a = b = 0. Further, the vectors (1, 0), (
√
2, 0) ∈ R2
are linearly dependent over R, because
√
2 · (1, 0) = (
√
2, 0),
but over Q they are linearly independent! Over R these two
109
in 2.E.56 for more details on complex matrices. The vector
addition on Matm,n(R) is given by matrix addition, i.e.,
A + B = (aij + bij) ∈ Matm,n(R) for all A = (aij),
B = (bij) ∈ Matm,n(R), and scalar multiplication is deﬁned
by multiplying each entry of the given matrix by the
scalar, i.e., cA = (c · aij) ∈ Matm,n(R) for all c ∈ R and
A = (aij) ∈ Matm,n(R), see also 2.1.2. Let us check the
axioms (CG1)-(CG4) presented in 1.1.1:
• (CG1) – associativity: Indeed, (A+B)+C = A+(B+C)
for all A, B, C ∈ Matm,n(R). This property follows by the
associativity of the addition of real numbers: (aij + bij) +
cij = aij + (bij + cij), where we assume that A = (aij),
B = (bij) and C = (cij), respectively.
• (CG2) – commutativity: In an analogous way we can prove
that A + B = B + A, for all A, B ∈ Matm,n(R).
• (CG3) – existence of neutral element: This is the zero matrix,
which we just denote by 0 ∈ Matm,n(R). Obviously, it
satisﬁes A + 0 = A = 0 + A, for all A ∈ Matm,n(R).
• (CG4) – existence of additive inverse: For A ∈ Matm,n(R),
B = −A is the unique matrix satisfyingA + B = 0.
These axioms ensure that the pair (Matm,n(R), +) has the
structure of a “commutative group”.
Finally, it is very easy to check the axioms (V 1) − (V 4)
posed in the paragraph 2.1.1, and which must satisfy the
scalar multiplication. For completeness, we list them:
• (V1) – α(A + B) = αA + αB for all α ∈ R, and
A, B ∈ Matm,n(R).
• (V2) – (α + β)A = αA + βA for all α, β ∈ R and
A ∈ Matm,n(R).
• (V3) – α(βA) = (αβ)A, for all α, β ∈ R and A ∈
Matm,n(R).
• (V4) – 1A = A for all A ∈ Matm,n(R).
(h) The ﬁnal set Q(
√
2) = {a +
√
2b : a, b ∈ Q} provides
an example of a vector space over Q. Indeed, let
x = a +
√
2b, y = c +
√
2d two elements in Q(
√
2). Then
x+y = (a+c)+
√
2(b+d) ∈ Q(
√
2). Moreover, for q ∈ Q
we have qx ∈ Q(
√
2), since qa, qb ∈ Q, for all a, b ∈ Q. □
2.C.2. By the discussion in 2.3.4 we know that the space
Rm[x] of real polynomials of degree at most m is a vector
space over R. Can we claim the same for the set of real polynomials
of degree exactly m? ⃝
2.C.3. Vector spaces in Sage. (a) Use Sage to create the vector
spaces R4
and Q4
. Next check if the vector u = 2v + 3w
is an element of Q4
, where v = (1, −2, −3, 0)T
and w =
(1, 0, 3, −2)T
, respectively.
(b) Use Sage to create the matrix spaces Mat4(R) of 4 × 4
real matrices, and Mat4,1(C) of 4 × 1 complex matrices.
Solution. (a) An easy way to introduce a vector space in Sage,
is to start by specifying the ﬁeld over which the vectors are
deﬁned, and then use an exponent to indicate the dimension
of the vector space. This process is compact and mimics the
notation we use when working on paper.
For instance, to introduce R4
, type
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
vectors “generate” a one-dimensional subspace, while over Q
the subspace is “larger”.
Polynomials with real coeﬃcients and of degree at most
m form a vector space Rm[x]. We can consider the polynomials
as mappings f : R → R and deﬁne the addition and
scalar multiplication like this: (f + g)(x) = f(x) + g(x),
(a · f)(x) = a · f(x).
Polynomials of all degrees also form a vector space R[x]
(or R∞[x]) and Rm[x] ⊂ Rn[x] is a vector subspace for any
m ≤ n ≤ ∞. Further examples of subspaces is given by all
even polynomials or all odd polynomials, that is, polynomials
satisfying f(−x) = ±f(x).
In complete analogy with polynomials, we can deﬁne a
vector space structure on a set of all mappings R → R or of
all mappings M → V of an arbitrary ﬁxed set M into the
vector space V .
Because the condition in the deﬁnition of subspace consists
only of universal quantiﬁers, the intersection of
subspaces is still a subspace. We can see this also
directly: Let Wi, i ∈ I, be vector subspaces in V ,
a, b ∈ K, u, v ∈ ∩i∈IWi. Then a · u + b · v ∈ Wi for
all i ∈ I. Hence a · u + b · v ∈ ∩i∈IWi.
It can be noted that the intersection of all subspaces
W ⊂ V that contain some given set of vectors M ⊂ V is
a subspace. It is called the linear span or linear hull of M
and we write span M.
We say that a set M generates the subspace span M,
or that the elements of M are generators of the subspace
span M.
We formulate a few simple claims about subspace gener-
ation:
Proposition. For every nonempty set M ⊂ V, we have
(1) span M = {a1 ·u1 +· · ·+ak ·uk; k ∈ N, ai ∈ K, uj ∈
M, j = 1, . . . , k};
(2) M = span M if and only if M is a vector subspace;
(3) if N ⊂ M then span N ⊂ span M is a vector subspace;
the subspace span ∅ generated by the empty subset is the
trivial subspace {0} ⊂ V .
Proof. (1) The set of all linear combinations
a1u1 + · · · + akuk
on the right-hand side of (1) is clearly a vector subspace and
of course it contains M. On the other hand, each of the linear
combinations must be in span M and thus the ﬁrst claim is
proved.
Claim (2) follows immediately from claim (1) and from
the deﬁnition of vector space. Analogously, (1) implies most
of the third claim.
Finally, the smallest possible vector subspace is {0}. Notice
that the empty set is contained in every subspace and each
of them contains the vector 0. This proves the last claim. □
110
V=RR^4; V # we can also type V=RR**4
Sage returns the following:
Vector space of dimension 4 over
Real Field with 53 bits of precision
Similarly, to introduce the vector space Q4
type QQˆ4, etc.
We can also use the VectorSpace() function, which requires
the name of the number system for the entries and the
number of entries in each vector. For instance the cell
V = VectorSpace(QQ, 4); print(V)
prints out
Vector space of dimension 4
over Rational Field
Vectors can be constructed in the usual way, that is,
V=QQ^4;
v=vector(QQ, [1, -2, -3, 0])
w=vector(QQ, [1, 0, 3, -2])
It is also easy to perform computations with vectors (vector
addition and scalar multiplication):6
V=QQ^4; v=vector(QQ, [1, -2, -3, 0])
w=vector(QQ, [1, 0, 3, -2])
u=2*v+3*w
print(u) # we could type show(u)
u in V
Here Sage returns the vector u, i.e., (5, −4, 3, −6) and the
word True, the latter verifying that u ∈ V .
(b) Sage provides built-in functions for working with matrix
spaces. This cell creates the space Mat4(R):
M = MatrixSpace(RR,4); M
Similarly, to introduce the space Mat4,1(C) use the cell
M = MatrixSpace(CC,4, 1); M
Notice, when the number of columns is omitted, it defaults to
the number of rows. □
2.C.4. Describe the linear combination 3v1 +2v2 +5v3 +v4
of the vectors v1, . . . , v4 given below in Sage, using at least
two diﬀerent methods:
v1 = (1
2 , 2
3 , 1, 0)T
, v3 = (0, 1
5 , 2
5 , 1)T
,
v2 = (2, 0, 1/7, 0)T
, v4 = (−1, 2, 0, 3)T
. ⃝
2.C.5. Consider the system Ax = b given below with a solution
given by (−2/15, 1/5, 2/15, −11/15)T
,




3 −9 9 0
9 −3 6 0
9 −6 0 −6
−9 12 6 6








x1
x2
x3
x4



 =




−1
−1
2
0



 .
6Be aware that Sage displays vectors using parentheses (to distinguish
them from lists), and presents them horizontally. In particular, in Sage there
is no inherent distinction between a “row vector” and a “column vector”.
However, when dealing with matrices, it is crucial to make this distinction.
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Basis and dimension
A subset M ⊂ V is called a basis of the vector space V if
span M = V and M is linearly independent.
A vector space with a ﬁnite basis is called ﬁnitely dimensional.
The number of elements of the basis is called
the dimension of V .
If V does not have a ﬁnite basis, we say that V is inﬁnitely
dimensional. We write dim V = k, k ∈ N or k = ∞.
In order to be satisﬁed with such a deﬁnition of dimension,
we must know that diﬀerent bases of the same space will
always have the same number of elements. We shall show this
below, cf. 2.3.9. But we note immediately, that the trivial
subspace is generated by the empty set, which is an “empty”
basis. Thus it has dimension zero.
The linearly independent vectors
ei = (0, . . . , 1, . . . , 0) ∈ Kn
, i = 1, . . . , n
(all zeros, but one value 1 at the i-th position) are the most
useful example of a basis in the vector space Kn
. We call it
the standard basis of Kn
.
2.3.5. Linear equations again. It is a good time now to recall
the properties of systems of linear equation
in terms of abstract vector spaces and their bases.
As we have already noted in the introduction to
this section (cf. 2.3.1), the set of all solutions of
the homogeneous system
A · x = 0
is a vector space. If A is a matrix with m rows and n columns,
and the rank of the matrix is k, then using the row echelon
transformation (see 2.1.7) to solve the system, we ﬁnd that
the dimension of the space of all solutions is exactly n − k.
Indeed, the left hand side of the equation can be understood
as the linear combination of the columns of A with coeﬃcients
given by x and the rank k of the matrix provides the
number of linearly independent columns in A, thus the dimension
of the subspace of all possible linear combinations of the
given form. Therefore, after transforming the system into row
echelon form, exactly m − k zero rows remain. In the next
step, we are left with exactly n − k free parameters. By setting
one of them to have value one, while all others are zero,
we obtain exactly n − k linearly independent solutions. Then
all solutions are given by all the linear combinations of these
n − k solutions. Every (n − k)-tuple of linearly independent
solutions is called a fundamental system of solutions of the
given homogeneous system of equations. We have proved:
Proposition. The set of all solutions of the homogeneous system
of equations A · x = 0, for n variables with the matrix
A of rank k, is a vector subspace in Kn
of dimension n − k.
Every basis of this space forms a fundamental system of solutions
of the given homogeneous system.
Next, consider the general system of equations A · x = b.
Notice that the columns of the matrix A are actually images
of the vectors of the standard basis in Kn
under the mapping
111
Use Sage to illustrate that the solution gives us scalars that
yield the vector of constants b as a linear combination of the
columns of the coeﬃcient matrix A. ⃝
2.C.6. Describe the vector space structure of Z3
5. What is
the total number of vectors in Z3
5? Use the cardinality
command in Sage to support your answer. ⃝
Next we will explore problems concerning the concept
of linear subspaces. A vector subspace of a vector
space V is a non-empty subset W of V which is
closed under addition and scalar multiplication. It
can be either given as a span of some vectors (i.e.,
all of their linear combinations), or by some (linear) condi-
tions.7
Therefore, any subspace of V is itself a vector space
and shares the same fundamental properties and signiﬁcance,
see also the discussion in 2.3.4.
In Sage we have numerous options to explore the properties
and characteristics of subspaces. Once we have introduced
a vector space V , a subspace W ⊂ V is deﬁned by
specifying a set of generators that span the subspace. This
can be achieved by the subspace() function or with the span
command.
2.C.7. Linear subspaces. Decide whether the following
statements are true or false:
(a) The subset U1 = {(x, y, z) ∈ R3
: 2x + 7y + z = 1} is
a linear subspace of R3
.
(b) The subset U2 = {(x, y, z) ∈ R3
: 2x2
+ 7y + z = 0}
is a linear subspace of R3
.
(c) The subset U3 = {(x, y, z) ∈ R3
: 2x + 7y + z = 0} is
a linear subspace of R3
.
(d) The subset U4 = {ax2
+ c : a, c ∈ R} is a linear subspace
of the vector space R2[x] of polynomials with real
coeﬃcients and degree at most 2.
(e) The subset U5 = {(x1, x2, x3) ∈ R3
: |x1| = |x2| =
|x3|} is a linear subspace of R3
.
(f) The subset U6 = {p ∈ R3[x] : p(1) = 0} is a linear
subspace of the vector space R3[x] of polynomials with
real coeﬃcients and degree at most 3.
(g) The set of real solutions of a homogeneous diﬀerence
equation is a linear subspace of the vector space of all
real sequences.
(h) The set {
(
a b
0 c
)
: a, b, c ∈ R} of 2 × 2 upper triangular
matrices is a subspace of Mat2(R).
(i) The subset Vλ = {x ∈ Rn
: Ax = λx}, where A ∈
Matn(R) and λ ∈ R , is a linear subspace of Rn
.
(j) The subset {A ∈ Matn(R) : A = −AT
} of skewsymmetric
n × n matrices is a subspace of Matn(R).
Solution. (a) The subset U1 cannot be a subspace of R3
, since
the zero vector does not belong to U1.
(b) This is also false, and U2 cannot be subspace of R3
.
For example, the vector u = (1, 0, −2) ∈ U2, but −u =
7This implies that W contains the zero vector. Indeed, any w ∈ W
will have an additive inverse −w ∈ W and hence 0 = w + (−w) ∈ W.
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
assigning the vector A · x to each vector x. If there should
be a solution, b must be in the image under this mapping and
thus it must be a linear combination of the columns in A.
If we extend the matrix A by the column b, the number
of linearly independent columns and thus also rows might increase
(but does not have to). If this number increases, then b
is not in the image and the system of equations does not have
a solution. If on the other hand the number of linearly independent
rows does not change after adding the column b to
the matrix A, it means that b must be a linear combination of
the columns of A. Coeﬃcients of such combinations are then
exactly the solutions of our system.
Consider now two ﬁxed solutions x and y of our system
and some solution z of the homogeneous system with
the same matrix. Then clearly
A · (x − y) = b − b = 0
A · (x + z) = b + 0 = b.
Thus we can summarise in the form of the so called
Kronecker-Capelli theorem1
:
Kronecker-Capelli Theorem
Theorem. The solution of a non-homogeneous system of linear
equations A · x = b exists if and only if adding the
column b to the matrix A does not increase the number of
linearly independent rows. In such a case the space of all solutions
is given by all sums of one ﬁxed particular solution
of the system and all solutions of the homogeneous system
that has the same matrix.
2.3.6. Sums of subspaces. Since we now have some intuition
about generators and the subspaces generated
by them, we should understand the possibilities
of how some subspaces can generate the
whole space V .
Sum of subspaces
Let Vi, i ∈ I, be subspaces of V . Then the subspace generated
by their union, that is, span ∪i∈IVi, is called the sum of
subspaces Vi. We denote it as W =
∑
i∈I Vi. Notably, for
a ﬁnite number of subspaces V1, . . . , Vk ⊂ V we write
W = V1 + · · · + Vk = span(V1 ∪ V2 ∪ · · · ∪ Vk).
We see that every element in the considered sum W can
be expressed as a linear combination of vectors from the subspaces
Vi. Because vector addition is commutative, we can
1A common formulation of this fact is “system has a solution if and only
if the rank of its matrix equals the rank of its extended matrix”. Leopold Kronecker
was a very inﬂuential German Mathematician, who dealt with algebraic
equations in general and in particular pushed forward Number Theory
in the middle of 19th century. Alfredo Capelli, an Italian, worked on algebraic
identities. This theorem is equally often called by diﬀerent names, e.g.
Rouché-Frobenius theorem or Rouché-Capelli theorem etc. This is a very
common feature in Mathematics.
112
(−1, 0, 2) /∈ U2, since 2 · (−1)2
+ 7 · 0 + 2 = 4 ̸= 0.
(c) This is true and we leave the veriﬁcation for practice.
(d) The set U4 is a linear subspace of R2[x], since
(
a1x2
+ c1
)
+
(
a2x2
+ c2
)
= (a1 + a2) x2
+ (c1 + c2) ,
k
(
ax2
+ c
)
= (ka) x2
+ kc ,
for all real numbers a1, c1, a2, c2, a, c, k ∈ R. Notice also
that for a = c = 0 we obtain the zero polynomial.
(e) The subset U5 is not a linear subspace of R3
, since for
example (1, 1, 1) + (−1, 1, 1) = (0, 2, 2) /∈ U5.
(f) The subset U6 is clearly a subspace of R3[x]. Indeed, if
p, q ∈ U6, and c ∈ R, then we see that
(p + q)(1) = p(1) + q(1) = 0 , ⇒ (p + q) ∈ U6 ,
(cp)(1) = cp(1) = c · 0 = 0 , ⇒ (cp) ∈ U6 .
(g) Consider two sequences (xj)∞
j=0 and (yj)∞
j=0 satisfying
the same homogeneous diﬀerence equation, that is,
anxn+k + an−1xn+k−1 + · · · + a0xk = 0
anyn+k + an−1yn+k−1 + · · · + a0yk = 0.
By adding these equations, we obtain
an(xn+k + yn+k) + an−1(xn+k−1 + yn+k−1)
+ · · · + a0(xk + yk) = 0 .
This means that also the sequence (xj + yj)∞
j=0 satisﬁes the
given equation. Analogously, if the sequence (xj)∞
j=0 satisﬁes
the given equation, then the same applies for (uxj)∞
j=0,
for some u ∈ R. Thus the assertion is true.
(h) This claim is again true, as you can easily verify yourself.
(i) If x, y ∈ Vλ, then we see that
A(x + y) = Ax + Ay = λx + λy = λ(x + y) ,
A(cx) = c(Ax) = c(λx) = λ(cx) ,
and hence x + y ∈ Vλ and cx ∈ Vλ, for all x, y ∈ Vλ and
c ∈ R. Hence the claim is true. Non-zero vectors lying on
Vλ are called eigenvectors of A corresponding to the eigenvalue
λ. Thus we have just proved that the eigenspace Vλ
corresponding to the eigenvalue λ is a subspace of Rn
. We
will learn more about these notions later (see Section 2.D.1).
Notice that when λ = 0, then V0 = Ker(A) is the null space
(kernel) of the squared matrix A.
(j) Taking A, B ∈ Matn(R) with AT
= −A and BT
= −B,
we see that (A+B)T
= −(A+B) and (cA)T
= −(cA), for
all c ∈ R. Hence the claim is true. Show that the subset of
n × n symmetric matrices is also a subspace of Matn(R). □
2.C.8. Construct in Sage the 2-dimensional subspaces of Q3
,
generated over Q by two vectors of the standard basis e =
{e1 = (1, 0, 0)T
, e2 = (0, 1, 0)T
, e3 = (0, 0, 1)T
}. ⃝
Dealing with vector spaces, it is crucial to understand
the linear dependence or independence of vectors. A
basis is an independent set of generators. In ﬁnitedimensional
cases, this is the same as a maximal set
of linearly independent vectors. Every ﬁnite-dimensional
K-vector space V admits a basis; however, bases are not at
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
aggregate summands that belong to the same subspace and
for a ﬁnite sum of k subspaces we obtain
V1 +V2 +· · ·+Vk = {v1 +· · ·+vk; vi ∈ Vi, i = 1, . . . , k}.
The sum W = V1 + · · · + Vk ⊂ V is called the direct sum
of subspaces if the intersection of each Vi with the sum of the
other spaces is trivial, that is, Vi ∩
∑
j̸=i Vj = {0}. We show
that in such a case, every vector w ∈ W can be written in a
unique way as the sum
w = v1 + · · · + vk,
where vi ∈ Vi. Indeed, if we could simultaneously write w
as w = v′
1 + · · · + v′
k, then
0 = w − w = (v1 − v′
1) + · · · + (vk − v′
k).
If vi − v′
i is the ﬁrst nonzero term of the right-hand side, then
this vector from Vi can be expressed using vectors from the
other subspaces. This is a contradiction to the assumption that
Vi has zero intersection with all the other subspaces. The only
possibility is then that all the vectors on the right-hand side
are zero and thus the expression of w is unique.
For direct sums of subspaces we write
W = V1 ⊕ · · · ⊕ Vk = ⊕k
i=1Vi.
2.3.7. Basis. Now we have everything prepared for understanding
minimal sets of generators as we understood
them in the plane R2
and to prove the
promised indepence of the number of basis elements
on any choices.
A basis of a k-dimensional space will usually be denoted
as a k-tuple v = (v1 . . . , vk) of basis vectors. This is just a
matter of convention: with ﬁnitely dimensional vector spaces
we shall always consider the bases along with a given order of
the elements, even if we have not deﬁned it that way (strictly
speaking).
Clearly, if (v1, . . . , vn) is a basis of V , then the whole
space V is the direct sum of the one-dimensional subspaces
V = span{v1} ⊕ · · · ⊕ span{vn}.
An immediate corollary of the derived uniqueness of decomposition
of any vector w in V into the components in the
direct sum gives a unique decomposition
w = x1v1 + · · · + xnvn.
This allows us, after choosing a basis, to see the abstract vectors
again as n-tuples of scalars. We shall return to this idea
in paragraph 2.3.11, when we ﬁnish the discussion of the existence
of bases and sums of subspaces in the general case.
2.3.8. Theorem. From any ﬁnite set of generators of a vector
space V we can choose a basis. Every basis of a ﬁnitely
dimensional space V has the same number of elements.
Proof. The ﬁrst claim is easily proved using induction
on the number of generators k.
113
all unique, cf. 2.3.7, 2.3.10. At the same time, the number of
vectors in any basis is unique, and we call it the dimension
dimK(V ) of V . We write just dim(V ) if the ﬁeld K is clear
from the context. Notice, that vector spaces (over ﬁxed ﬁeld
K) of the same ﬁnite dimension are isomorphic, cf. 2.3.13.
2.C.9. Consider any matrix A of size m × n over a ﬁeld
K. Besides the kernel Ker(A) = {x ∈ Kn
: Ax = 0} of
A, there are several other fundamental subspaces associated
with A:
• The column space C(A) = {Ax : x ∈ Kn
}, which is the
span of the columns of A.
• The row space C(AT
) = {AT
y : y ∈ Km
}, which is the
span of the rows of A.
• The left null space Ker(AT
) = {y ∈ Km
: AT
y = 0},
also known as the cokernel of A.
Prove that C(AT
) is a subspace of Kn
, while Ker(AT
)
and C(A) are subspaces of Km
. Moreover, show that
dimK C(A) = rank(A) = dimK C(AT
). ⃝
Although vector spaces over inﬁnite ﬁelds are always inﬁnite
sets, Sage can eﬀectively handle them as
mathematical objects. The main tool to this is
proper work with the concept of basis. See our
discussion in 2.C.14 for the ﬁnite case and see also the discussion
in 2.3.10.
2.C.10. Determine whether or not the vectors
u1 = (1, 2, 3, 1)T
, u2 = (1, 0, −1, 1)T
, u3 = (2, 1, −1, 3)T
and u4 = (0, 0, 3, 2)T
are linearly independent. Do they
provide a basis of R4
? Describe an answer using Sage, as
well.
Solution. Given n vectors on Rn
, say
x1 = (x11, . . . , x1n)T
, . . . , xn = (xn1, . . . , xnn)T
,
it is easy to check that the condition det(A) ̸= 0 is equivalent
to their linear independence, where A is the coeﬃcient
matrix,
A =




x11 x12 . . . x1n
x21 x22 . . . x2n
...
... . . .
...
xn1 xn2 . . . xnn




.
For the given vectors on R4
, we compute
det(A) =
1 2 3 1
1 0 −1 1
2 1 −1 3
0 0 3 2
= 10 ̸= 0 .
Thus the set u = (u1, u2, u3, u4) consists of linearly independent
vectors. Since dim R4
= 4 and this set is a basis
for its own linear span, u is indeed a basis. To check the linear
independence/dependence in Sage, we can rely on built-in
functions, as follows:
u1=vector(RR, [1, 2, 3, 1])
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Only the zero subspace does not need a generator and
thus we are able to choose an empty basis. On the
other hand, we are not able to choose the zero vector
(the generators would then be linearly dependent) and
there is nothing else in the subspace.
In order to have our inductive step more natural, we deal
with the case k = 1 ﬁrst. We have V = span{v} and v ̸= 0,
because {v} is a linearly independent set of vectors. Then
{v} is also a basis of the vector space V and any other vector
is a multiple of v, so all bases of V must contain exactly one
vector, which can be chosen from any set of generators.
Assume that the claim holds for k = n and consider
V = span{v1, . . . , vn+1}. If v1, . . . , vn+1 are linearly independent,
then they form a basis. If they are linearly dependent,
there exists i such that
vi = a1v1 + · · · + ai−1vi−1 + ai+1vi+1 + · · · + an+1vn+1.
Then V = span{v1, . . . , vi−1, vi+1, . . . , vn+1} and we can
choose a basis, using the inductive assumption.
It remains to show that bases always have the same number
of elements. Consider a basis v = (v1, . . . , vn) of the
space V and for an arbitrary nonzero vector u, consider
u = a1v1 + · · · + anvn ∈ V
with ai ̸= 0 for some i. Then
vi =
1
ai
(
u−(a1v1+· · ·+ai−1vi−1+ai+1vi+1+· · ·+anvn)
)
and therefore also span{u, v1, . . . , vi−1, vi+1, . . . , vn} = V .
We show that this is again a basis. For if adding u to the
linearly independent vectors v1, . . . , vi−1, vi+1, . . . , vn leads
to a set of linearly dependent vectors, then
V = span{v1, . . . , vi−1, vi+1, . . . , vn},
which implies a basis of n − 1 vectors chosen from v, which
is not possible.
Thus we have proved that for any nonzero vector
u ∈ V there exists i, 1 ≤ i ≤ n, such that
(u, v1, . . . , vi−1, vi+1, . . . , vn) is again a basis of V .
Similarly, instead of one vector u, we can consider a linearly
independent set u1, . . . , uk. We will sequentially add
u1, u2, . . . , always exchanging for some vi using our previous
approach. We have to ensure that there always is such vi to
be replaced (that is, that the vectors ui will not consequently
replace each other).
Assume thus that we have already placed u1, . . . , uℓ instead
of some vj’s. Then the vector uℓ+1 can be expressed as
a linear combination of the latter vectors ui and the remaining
vj’s. As we have seen, uℓ+1 may replace any vector with
non-zero coeﬃcient in this linear combination. If only the coeﬃcients
at u1, . . . , uℓ were nonzero, then it would mean that
the vectors u1, . . . , uℓ+1 were linearly dependent, which is a
contradiction.
Summarizing, for every k ≤ n we can arrive after k steps
at a basis in which k vectors from the original basis were exchanged
for the new ui’s. If k > n, then in the n-th step we
114
u2=vector(RR, [1, 0, -1, 1])
u3=vector(RR, [2, 1, -1, 3])
u4=vector(RR, [0, 0, 3, 2])
V=RR^4; V.linear_dependence([u1, u2, u3, u4]) ==[]
Sage’s output is True. In case you want to compute only the
determinant of A, then replace the last line by the syntax given
here:
A=column_matrix([u1, u2, u3, u4])
show(det(A))
□
2.C.11. Consider the vector space V = Q4
and the vectors
v1 = (1, 0, 1, 2)T
and v2 = (0, 1, 0, 0)T
, both lying on V .
What is the dimension of the vector subspace of V = Q4
,
spanned by v1, v2? ⃝
2.C.12. (a) Consider Q3
over the rationals and let W be the
subspace of Q3
generated by the vectors v1 = (2, −1, 3)T
and v2 = (4, 0, 1)T
. Determine the expression of a general
vector of W and next use Sage to determine the dimension of
W. What happens if we replace v2 by a scalar multiple of v1?
(b) Consider Q4
over the rationals and let W be the subspace
spanned by the vectors u1 = (1, 1, 0, 0)T
and u2 =
(0, 1, 1, 0)T
. Describe the quotient space V/W (i.e., “what
is left” in V viewing vectors “up to W”, cf. 3.4.13) and next
verify your answer via Sage. ⃝
2.C.13. For the second task in 2.C.12, ﬁnd a basis of the quotient
space V/W. Conﬁrm your answer by Sage by using the
method vector in subspace. ⃝
2.C.14. (a) Show that the following set of vectors is a subspace
of Z3
5,
U =





a
2b
3a + 4b

 : a, b ∈ Z5



.
(b) Find the reduced row echelon form of the matrix A having
as columns the generators of U. Next deduce that dimZ5 U =
2 by ﬁnding a basis of U. ⃝
We can often rely on Sage to perform mathematical “experiments”.
In linear algebra for example, this can be done
using matrices or vectors with random coeﬃcients. Notice
that such experiments can be useful to test theoretical statements,
or illustrate them in a random way. Let us describe
such an example.
2.C.15. Provide in Sage a short code that generates a random
(ordered) tuple S of vectors within V = Q4
, and checks
their linear independence. Moreover, create a random 4 × 4
matrix A whose columns are the vectors in S, and use the
matrix’s rank to establish an alternative criterium for the linear
independence of S (recall the following basic criterium
for the invertibility of a square matrix A ∈ Matn(K): A is
invertible if and only if rank(A) = n).
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
would obtain a basis consisting only of new vectors ui, which
means that the original set could not be linearly independent.
In particular, it is not possible for two bases to have a
diﬀerent number of elements. □
In fact, we have proved a much stronger claim, the
Steinitz exchange lemma:
Steinitz Exchange Lemma
For every ﬁnite basis v of a vector space V and every set of
linearly independent vectors ui, i = 1, . . . , k in V we can
ﬁnd a subset of the basis vectors vj which will complete the
set of ui’s into a new basis.
2.3.9. Corollaries of the Steinitz lemma. Because of the
possibility of freely choosing and replacing basis
vectors we can immediately derive nice (and
intuitively expectable) properties of bases of
vector spaces:
Proposition. (1) Every two bases of a ﬁnite dimensional
vector space have the same number of elements, that is,
our deﬁnition of dimension is basis-independent.
(2) If V has a ﬁnite basis, then every linearly independent
set can be extended to a basis.
(3) A basis of a ﬁnite dimensional vector space is a maximal
linearly independent set of vectors.
(4) The bases of a vector space are the minimal sets of gen-
erators.
A little more complicated, but now easy to deal with, is
the situation of dimensions of subspaces and their sums:
Corollary. Let W, W1, W2 ⊂ V be subspaces of a space V
of ﬁnite dimension. Then
(1) dim W ≤ dim V ,
(2) V = W if and only if dim V = dim W,
(3) dim W1 +dim W2 = dim(W1 +W2)+dim(W1 ∩W2).
Proof. It remains to prove only the last claim. This is
evident if the dimension of one of the spaces is
zero. Assume dim W1 = r ≥ 1, dim W2 = s ≥
1 and let (w1 . . . , wt) be a basis of W1 ∩ W2 (or
empty set, if the intersection is trivial).
According to the Steinitz exchange lemma this
basis of the intersection can be extended to a basis
(w1, . . . , wt, ut+1 . . . , ur) for W1 and to a basis
(w1 . . . , wt, vt+1, . . . , vs) for W2. Vectors
w1, . . . , wt, ut+1, . . . , ur, vt+1 . . . , vs
clearly generate W1 + W2. We show that they are linearly
independent. Let
a1w1 + · · · + atwt + bt+1ut+1 + . . .
· · · + brur + ct+1vt+1 + · · · + csvs = 0.
115
Solution. To introduce a random vector in Sage we can use
the command V.random_element(), where V is the underlying
vector space. For instance, the cell
V=QQ^4
u=V.random_element(); show(u)
prints a random vector of Q4
. Any time that the code will be
implemented, a new vector will appear in our screen. To construct
simultaneously four random vectors lying on Q4
and
check their linear independence, we can instead type
V=QQ^4
S=[V.random_element() for i in range(4)]
show(S); V.linear_dependence(S)==[]
which will return an ordered random tuple S of vectors in Q4
and moreover the message “True”, or “False”, depending
on the linear independence/dependence of S. Or type
V=QQ^4
S=[V.random_element() for i in range(4)]
W=V.span(S); print(W)
In this case, if Sage conﬁrms that W is a 4-dimensional subspace,
then the set S of four random vectors should be linearly
independent. Remarkably, most of the cases Sage returns a
linearly independent set, and hence this method can be used
when you need to locate many diﬀerent (ordered) bases of a
given vector space. We recommend practicing with further
examples in your editor.
Finally, to introduce the required matrix A and specify a
rank-based criterium for linear independence or dependence,
you can add the following code in either of the previous two
cells:
A=column_matrix(S)
if rank(A) < 4 :
# then linear dependent
print(False)
else :
# then linear independent
print(True)
Thus, when A is of full rank, i.e., rank(A) = 4, S will consist
of linearly independent vectors, and Sage will type True.
When rank(A) < 4, S will consist of linearly dependent vectors,
and its message will be False. □
2.C.16. Given arbitrary linearly independent vectors u, v, w,
z in a vector space V , decide whether or not the vectors
u − 2v, 3u + w − z, u − 4v + w + 2z, 4v + 8w + 4z
are also linearly independent. ⃝
2.C.17. Determine the coordinates of the vector
w = (1, 1, 1)T
with respect to the following ordered
basis of R3
:
u =
{
u1 = (1, 2, 1)T
, u2 = (−1, 1, 0)T
, u3 = (0, 1, 1)T
}
.
Next, verify your answer in Sage, based for example on the
command coordinates, or any other method.
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Then necessarily
− (ct+1 · vt+1 + · · · + cs · vs) =
= a1 · w1 + · · · + at · wt + bt+1 · ut+1 + · · · + br · ur
must belong to W2 ∩ W1. This implies that
bt+1 = · · · = br = 0,
since this is the way we have deﬁned our bases. Then also
a1 · w1 + · · · + at · wt + ct+1 · vt+1 + · · · + cs · vs = 0
and because the corresponding vectors form a basis W2, all
the coeﬃcients are zero.
The claim (3) now follows by directly counting the generators.
□
2.3.10. Examples. (1) Kn
has (as a vector space over K) dimension
n. The n-tuple of vectors
((1, 0, . . . , 0), (0, 1, . . . , 0) . . . , (0, . . . , 0, 1))
is clearly a basis, called the standard basis of Kn
.
Note that in the case of a ﬁnite ﬁeld of scalars, say Zk
with k prime, the whole space Kn
has only a ﬁnite number
kn
of elements.
(2) C as a vector space over R has dimension 2. A basis is for
instance the pair of numbers 1 and i, or any other two complex
numbers which are not a real multiple of each other, eg. 1+i
and 1 − i.
(3) Km[x], that is, the space of all polynomials with coeﬃcients
in K of degree at most m, has dimension m + 1. A
basis is for instance the sequence 1, x, x2
, . . . , xm
.
The vector space of all polynomials K[x] has dimension
∞, but we can still ﬁnd a basis (although inﬁnite in size):
1, x, x2
, . . . .
(4) The vector space R over Q has dimension ∞. It does not
have a countable basis.
(5) The vector space of all mappings f : R → R has also
dimension ∞. It does not have any countable basis.
2.3.11. Vector coordinates. If we ﬁx a basis (v1, . . . , vn) of
a ﬁnite dimensional space V , then every vector
w ∈ V can be expressed as a linear combination
v = a1v1 +· · ·+anvn in a unique way. Indeed,
assume that we can do it in two ways:
w = a1v1 + · · · + anvn = b1v1 + · · · + bnvn.
Then
0 = (a1 − b1) · v1 + · · · + (an − bn) · vn
and thus ai = bi for all i = 1, . . . , n, because the vectors
vi are linearly independent. We have reached the concept of
coordinates:
116
Solution. When a vector u ∈ Rn
is given in coordinates, such
as w, it is always understood to be expressed with respect to
the standard basis, unless stated otherwise. The assumption
that u is an ordered basis, with a ﬁxed and well deﬁned order
as indicated by the relevant positions of the vectors u1, u2, u3,
it is also crucial. To solve the problem we need to determine
reals a, b, and c ∈ R, which satisfy the following matrix equa-
tion:
a ·


1
2
1

 + b ·


−1
1
0

 + c ·


0
1
1

 =


1
1
1

 .
This is equivalent to (a − b, 2a + b + c, a + c)T
= (1, 1, 1)T
.
which induces the following system of equations:
{
a − b = 1 , 2a + b + c = 1 , a + c = 1
}
.
It is easy to see that this system has a unique solution, given
by a = 1
2 , b = −1
2 , c = 1
2 . Thus, the coordinates of w with
respect to u are given by (1
2 , −1
2 , 1
2 ).
There are several diﬀerent methods to demonstrate a veriﬁcation
via Sage. Here we describe two methods. The ﬁrst
method uses the coordinates command, which is designed
to determine the coordinates of a vector with respect to a
given basis. Its implementation goes as follows:
V=RR^3
u1=vector([1,2,1]);u2=vector([-1,1,0])
u3=vector([0, 1, 1]);w=vector([1, 1, 1])
L=[u1, u2, u3]
W=V.subspace_with_basis(L)
cord=W.coordinates(w); show(cord)
sum([cord[i]*L[i] for i in range(3)])==w
This cell returns the coeﬃcients of w with respect to u, along
with the message True. The latter indicates that the command
in the last row veriﬁes the accuracy of the result from
show(cord) (i.e., Sage is used to verify its own output).
The second method relies on computing the reduced row
echelon form of the matrix having as columns the vectors of
the given basis, together with w, and so of the extended matrix
corresponding to the linear system posed above. For this, add
in the previous cell the code
B=column_matrix([u1, u2, u3, w])
B1=B.rref(); show(B1[:,3])
By executing the block, Sage prints out the fourth column of
the reduced row echelon form, which indeed consists of the
coeﬃcients of w with respect to the ordered basis u. □
So far we have demonstrated a straightforward algorithm
to identify a maximal linearly independent
set from a given collection of vectors. The steps
are as follows:
• Arrange the given vectors as columns in a matrix.
• Transform the matrix into reduced row echelon form using
row operations.
• Identify the vectors corresponding to the pivot columns;
these form a maximal linearly independent set.
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Coordinates of vectors
Deﬁnition. The coeﬃcients of the unique linear combination
expressing the given vector w ∈ V in the chosen basis
v = (v1, . . . , vn) are called the coordinates of the vector w
in this basis.
Whenever we speak about coordinates (a1, . . . , an) of
a vector w, which we express as a sequence, we must have
a ﬁxed ordering of the basis vectors v = (v1, . . . , vn). Although
we have deﬁned the basis as a minimal set of generators,
in reality we work with them as with sequences (that is,
with completely ordered sets).
Assigning coordinates to vectors
A mapping assigning the vector v = a1v1 + · · · + anvn to
its coordinates in the basis v will be denoted by the same
symbol v : V → Kn
. It has the following properties:
(1) v(u + w) = v(u) + v(w); ∀u, w ∈ V ,
(2) v(a · u) = a · v(u); ∀a ∈ K, ∀u ∈ V .
Note that the operations on the two sides of these equations
are not identical. Quite the opposite; they
are operations on diﬀerent vector spaces!
Sometimes it is really useful to understand
vectors as mappings from ﬁxed set of independent generators
to coordinates (without having the generators ordered). In
this way, we may think about the basis M of inﬁnite dimensional
vector spaces V . Even though the set M will be inﬁnite,
there can be only a ﬁnite number of non-zero values for
any mapping representing a vector. The vector space of all
polynomials K∞[x], with the basis M = {1, x, x2
, . . . } is
a good example.
2.3.12. Linear mappings. The above properties of the assignments
of coordinates are typical for what we
have called linear mappings in the geometry of
the plane R2
.
For any vector space (of ﬁnite or inﬁnite dimension) we
deﬁne “linearity” of a mapping between spaces in a similar
way to the case of the plane R2
:
Linear mappings
Let V and W be vector spaces over the same ﬁeld of scalars
K. The mapping f : V → W is called a linear mapping, or
homomorphism, if the following holds:
(1) f(u + v) = f(u) + f(v), ∀u, v ∈ V
(2) f(a · u) = a · f(u), ∀a ∈ K, ∀u ∈ V .
We have seen such mappings already in the case of matrix
multiplication:
f : Kn
→ Km
, x → A · x
with a ﬁxed matrix A of the type m/n over K.
Analogously to the abstract deﬁnition of vector spaces,
it is again necessary to prove seemingly trivial claims that
follow from the axioms:
117
This approach is essentially what was utilized in the second
method described in 2.C.17. For a theoretical validation
and a more in-depth discussion, see the theoretical section
2.3.5.
2.C.18. Determine the vector subspace of the Euclidean
space R4
, generated by the vectors u1 = (−1, 3, −2, 1)T
,
u2 = (2, −1, −1, 2)T
, u3 = (−4, 7, −3, 0)T
, and u4 =
(1, 5, −5, 4)T
.
Solution. Write the vectors ui into the columns of a matrix
and transform it using elementary row transformations. This
gives
(−1 2 −4 1
3 −1 7 5
−2 −1 −3 −5
1 2 0 4
)
∼
( 1 2 0 4
−1 2 −4 1
3 −1 7 5
−2 −1 −3 −5
)
∼
(1 2 0 4
0 4 −4 5
0 −7 7 −7
0 3 −3 3
)
∼
(1 2 0 4
0 1 −1 5/4
0 1 −1 1
0 0 0 0
)
∼
(1 2 0 4
0 1 −1 5/4
0 0 0 −1/4
0 0 0 0
)
∼



1 0 2 0
0 1 −1 0
0 0 0 1
0 0 0 0


 .
Now, according to the algorithm above, a maximal linearly independent
set (basis) consists of those vectors corresponding
to the columns with the pivots. For our case, the pivots are
circled in the reduced row echelon form8
and hence a maximal
linearly independent set consists of the vectors u1, u2
and u4. This means that the vector subspace generated by
{u1, u2, u3, u4} is only 3-dimensional.
Finally, let us verify the output of our algorithm via Sage
(where for convenience we view the vectors as elements of
Q4
). Let us check ﬁrst that the given vector are indeed linearly
dependent:
v1=vector(QQ, [-1, 3, -2, 1])
v2=vector(QQ, [2, -1, -1, 2])
v3=vector(QQ, [-4, 7, -3, 0])
v4=vector(QQ, [1, 5, -5, 4])
V=QQ^4; L=[v1, v2, v3, v4]
V.linear_dependence(L)==[]
Next we apply the trick with the pivot elements:
M=column_matrix([v1, v2, v3, v4])
M1=M.rref(); show(M1); M.pivots()
with the last command returning the desired (0, 1, 3). □
2.C.19. Derive in Sage a method analogous to the one presented
in 2.C.18, with aim to complete the following set of
vectors to a basis of the vector space Q4
:
L = {v1 = (2, 0, 1, 0)T
, v2 = (1, 0, 1, 0)T
} .
⃝
8We will adopt this tactic especially in Chapter 3.
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Proposition. Let f : V → W be a linear mapping between
two vector spaces over the same ﬁeld of scalars K. The following
is true for all vectors u, u1, . . . , uk ∈ V and scalars
a1, . . . , ak ∈ K
(1) f(0) = 0,
(2) f(−u) = −f(u),
(3) f(a1 ·u1 +· · ·+ak ·uk) = a1 ·f(u1)+· · ·+ak ·f(uk),
(4) for every vector subspace V1 ⊂ V , its image f(V1) is a
vector subspace in W,
(5) for every vector subspace W1 ⊂ W, the set f−1
(W1) =
{v ∈ V ; f(v) ∈ W1} is a vector subspace in V .
Proof. We rely on the axioms, deﬁnitions and already
proved results (in case you are not sure what has been used,
look it up!):
f(0) = f(u − u) = f((1 − 1) · u) = 0 · f(u) = 0,
f(−u) = f((−1) · u) = (−1) · f(u) = −f(u).
Property (3) is derived easily from the deﬁnition for two
summands, using induction on the number of summands.
Next, (3) implies span f(V1) = f(V1), thus it is a vector
subspace. On the other hand, if f(u) ∈ W1 and f(v) ∈ W1
then for any scalars we arrive at f(a · u + b · v) = a · f(u) +
b · f(v) ∈ W1. □
The image of a linear mapping, Im f = f(V ) ⊂ W, is
always a vector subspace, since for any set of vectors ui, the
linear combination of images f(ui) is the image of the linear
combination of the vectors ui with the same coeﬃcients.
Analogously, the set of all vectors Ker f = f−1
({0}) ⊂
V is a subspace, since the linear combination of zero images
will always be a zero vector. The subspace Ker f is called the
kernel of the linear mapping f.
A linear mapping which is a bijection is called an isomor-
phism.
2.3.13. Proposition (Simple corollaries). (1) The composition
g ◦ f : V → Z of two linear mappings f : V → W
and g : W → Z is again a linear mapping.
(2) The linear mapping f : V → W is an isomorphism if
and only if Im f = W and Ker f = {0} ⊂ V . The
inverse mapping of an isomorphism is again an isomor-
phism.
(3) For any two subspaces V1, V2 ⊂ V and linear mapping
f : V → W,
f(V1 + V2) = f(V1) + f(V2),
f(V1 ∩ V2) ⊂ f(V1) ∩ f(V2).
(4) The “coordinate assignment” mapping u : V → Kn
given by an arbitrarily chosen basis u = (u1, . . . , un) of
a vector space V is an isomorphism.
(5) Two ﬁnitely dimensional vector spaces are isomorphic if
and only if they have the same dimension.
(6) The composition of two isomorphisms is an isomor-
phism.
118
2.C.20. Consider the matrix A =


2 4
1 3
0 5

.
i) Find the column space C(A), the row space C(AT
), the
kernel Ker(A) and the cokernel Ker(AT
) of A.
ii) Compute the dimensions of these subspaces, using
Sage as well. Hint: Use the commands column_space,
row_space, right_kernel and left_kernel.
Solution. (i) The given matrix A has size 3 × 2 and thus
the column space should be a subspace of R3
. Let us check
the linear independency of the two column vectors of A. By
performing elementary row operations we see that
A =


2 4
1 3
0 5


R1→ 1
2 R1
−→
R3→ 1
5 R3


1 2
1 3
0 1

 R2→R2−R1
−→


1 2
0 1
0 1

 R3→R3−R2
−→


1 2
0 1
0 0

 R1→R1−2R2
−→


1 0
0 1
0 0

 .
Thus rank(A) = 2 and the column vectors are linearly independent.
Hence, C(A) = span
{
(2, 1, 0)T
, (4, 3, 5)T
}
with
dimR C(A) = 2.
By the reduced row echelon form of A we also deduce
that the ﬁrst two rows of A are linearly independent.
For instance, it is easy to see that the third row is a linear
combination of the previous two: (0, 5)T
= a(2, 4)T
+
b(1, 3)T
with a = −5/2 and b = 5. Thus, C(AT
) =
span{(2, 4)T
, (1, 3)T
} with dimR C(AT
) = 2, as well.
Recall that null spaces are the solution spaces of linear
homogeneous systems Au = 0. For our case, this system has
the form 

2 4
1 3
0 5


(
x1
x2
)
=


0
0
0


and it is easy to see that the unique solution is given by x1 =
0 = x2. Thus, Ker(A) is trivial, Ker(A) =
{
(0, 0)T
}
⊂ R2
.
In other words the linear map F : R2
→ R3
corresponding
to A is an injection.
On the other hand, the left null space is the kernel of the
transpose matrix AT
. This space consists of vectors w =
(w1, w2, w3)T
∈ R3
, such that AT
w = 0, i.e.,
(
2 1 0
4 3 5
)


w1
w2
w3

 =
(
0
0
)
.
Solving the corresponding system we obtain Ker(AT
) =
span
{
(5, −10, 2)T
}
⊂ R3
and hence the cokernel of A is
1-dimensional.
(ii) In Sage, you can easily execute the commands outlined in
the statement as follows:
A=matrix([[2, 4], [1, 3], [0, 5]])
rank(A)
print(A.column_space())
print(dim(A.column_space())==rank(A))
print(A.row_space())
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Proof. Proving the ﬁrst claim is a very easy exercise
left to the reader. In order to verify (2), notice
that f is surjective if and only if Im f = W.
If Ker f = {0} then f(u) = f(v) ensures
f(u − v) = 0, that is, u = v. In this case f
is injective. Finally, if f is a linear bijection, then the vector
w is the preimage of a linear combination au + bv, that is
w = f−1
(au + bv), if and only if
f(w) = au + bv = f(a · f−1
(u) + b · f−1
(v)).
Thus we also get w = af−1
(u) + bf−1
(v) and therefore the
inversion of a linear bijection is again a linear bijection.
The third property is obvious from the deﬁnition, but try
ﬁnding an example showing that the inequality in the second
equation can indeed by sharp.
The remaining claims all follow immediately from the
deﬁnition. □
2.3.14. Coordinates again. Consider any two vector spaces
V and W over K with dim V = n, dim W = m
and consider some linear mapping f : V → W.
For every choice of basis u = (u1, . . . , un) on V ,
v = (v1, . . . , vn) on W there are the following linear
mappings as shown in the diagram:
V
f
//
≃u

W
v≃

Kn
fu,v
// Km
The bottom arrow fu,v is deﬁned by the remaining three, i.e.
the composition of linear mappings
fu,v = v ◦ f ◦ u−1
.
Matrix of a linear mapping
Every linear mapping is uniquely determined by its values
on an arbitrary set of generators, in particular, on the vectors
of a basis u. Denote by
f(u1) = a11 · v1 + a21 · v2 + · · · + am1vm
f(u2) = a12 · v1 + a22 · v2 + · · · + am2vm
...
f(un) = a1n · v1 + a2n · v2 + · · · + amnvm,
that is, scalars aij form a matrix A, where the columns are
coordinates of the values f(uj) of the mapping f on the basis
vectors expressed in the basis v on the target space W.
A matrix A = (aij) is called the matrix of the mapping
f in the bases u, v.
119
print(dim(A.row_space())==rank(A))
print(A.right_kernel())
print((A.T).right_kernel())
print(A.left_kernel())
print((A.T).right_kernel()==A.left_kernel())
In this block the ﬁnal line conﬁrms that the command
(A.T).right_kernel() can be used as an alternative to
A.left_kernel(). □
In linear algebra, there are several classical constructions
that generate new vector spaces from existing
ones. Examples include intersections, direct
sums (see e.g., 2.3.6), and quotients of vector spaces.
While intersection means just solving systems of linear
equations on coordinates, sums require merging all generators
(and leaving only a maximal independent subset). We
already touched quotient spaces in 2.C.12, although they are
formally introduced in Chapter 3, see 3.4.13. Since we are
short in space, we display only a few tasks and further discussion
appears later, see also Section E.
2.C.21. In R4
consider the three-dimensional linear subspaces
U = span{u1, u2, u3} and V = span{v1, v2, v3},
where the vectors ui and vi (i = 1, 2, 3) are respectively given
by
u1 = (1, 1, 1, 0)T
, v1 = (1, 1, −1, −1)T
,
u2 = (1, 1, 0, 1)T
, v2 = (1, −1, 1, −1)T
,
u3 = (1, 0, 1, 1)T
, v3 = (1, −1, −1, 1)T
.
Determine explicitly the subspace W := U ∩ V , and verify
your answer via Sage. Moreover, compute dim(W) over R.
Solution. The intersection U ∩ V contains exactly these vectors
which are linear combinations of the vectors ui, and also
linear combinations of the vectors vi, for i = 1, 2, 3. Thus,
we search for some x1, x2, x3, y1, y2, y3 ∈ R, satisfying
x1u1 + x2u2 + x3u3 = y1v1 + y2v2 + y3v3 .
This means that we are looking for a solution of the following
system of linear equations:



x1 + x2 + x3 = y1 + y2 + y3,
x1 + x2 = y1 − y2 − y3,
x1 + x3 = −y1 + y2 − y3,
x2 + x3 = −y1 − y2 + y3.
We convert this to a homogeneous system of linear equations,
by moving all variables to the left-hand side. Its matrix reads
as
A =




1 1 1 −1 −1 −1
1 1 0 −1 1 1
1 0 1 1 −1 1
0 1 1 1 1 −1



 .
Let us now apply row operations to ﬁnd an echelon form:
A∼
(1 1 1 −1 −1 −1
0 0 −1 0 2 2
0 −1 0 2 0 2
0 1 1 1 1 −1
)
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
For a general vector u = x1u1 + · · · + xnun ∈ V we
calculate (recall that vector addition is commutative and distributive
with respect to scalar multiplication)
f(u) = x1f(u1) + · · · + xnf(un)
= x1(a11v1 +· · · +am1vm) + · · · + xn(a1nv1 + · · · )
= (x1a11 +· · · +xna1n)v1 + · · · + (x1am1 + · · · )vm.
Using matrix multiplication we can now very easily and
clearly write down the values of the mapping fu,v(w) deﬁned
uniquely by the previous diagram. Recall that vectors in Kℓ
are understood as columns, that is, matrices of the type ℓ/1
fu,v(u(w)) = v(f(w)) = A · u(w).
On the other hand, if we have ﬁxed bases on V and W,
then every choice of a matrix A of the type m/n gives a
unique linear mapping Kn
→ Km
and thus also a mapping
f : V → W. We have found the bijective correspondence between
matrices of the ﬁxed types (determined by dimensions
of V and W) and linear mappings V → W.
2.3.15. Coordinate transition matrix. If we choose V =
W to be the same space, but with two diﬀerent
bases u, v, and consider the identity mapping for
f, then the approach from the previous paragraph
expresses the vectors of the basis u in coordinates
with respect to the basis v. Let the resulting matrix be T.
Thus, we are applying the concept of the matrix of a linear
mapping to the special case of the identity mapping idV .
V
idV
//
≃u

V
v≃

Kn
T =(idV )u,v
// Kn
The resulting matrix T is called the coordinate transition
matrix for changing the basis from u to the basis v.
The fact that the matrix T of the identity mapping yields
exactly the transformation of coordinates between the two
bases is easily seen.
Consider the expression of u with the basis u
u = x1u1 + · · · + xnun,
and replace the vectors ui by their expressions as linear combinations
of the vectors vi in the basis v. Collecting the
terms properly, we obtain the coordinate expression ¯x =
(¯x1, . . . , ¯xn) of the same vector u in the basis v. It is enough
just to reorder the summands and express the individual
scalars at the vectors of the basis. But this is exactly what we
do when forming the matrix for the identity mapping, thus
¯x = T · x.
We have arrived at the following instruction for building
the coordinate transition matrix:
120
∼
(1 1 1 −1 −1 −1
0 1 1 1 1 −1
0 0 −1 0 2 2
0 0 1 3 1 1
)
∼
(1 1 1 −1 −1 −1
0 1 1 1 1 −1
0 0 1 0 −2 −2
0 0 0 1 1 1
)
∼
(1 1 1 0 0 0
0 1 1 0 0 −2
0 0 1 0 −2 −2
0 0 0 1 1 1
)
∼
(1 0 0 0 0 2
0 1 0 0 2 0
0 0 1 0 −2 −2
0 0 0 1 1 1
)
.
The ﬁnal matrix is in row echelon form and gives that
x1 = −2t, x2 = −2s, x3 = 2s + 2t, y1 = −s − t, y2 = s,
y3 = t, t, s ∈ R.
One can verify the previous computation in Sage, by typ-
ing
A=matrix(QQ, [[1, 1, 1, -1, -1, -1],
[1, 1, 0, -1, 1, 1],
[1, 0, 1, 1, -1, 1],
[0,1, 1, 1, 1, -1]])
show(A.echelon_form())
Hence we can claim that a general vector of W is written as
(x1 + x2 + x3
x1 + x2
x1 + x3
x2 + x3
)
=
( 0
−2t − 2s
2s
2t
)
,
that is, W = spanR
{
(0, −2t−2s, 2s, 2t)T
s, t ∈ R
}
. Obviously,
W is generated by the vectors w1 = (0, −1, 1, 0)T
and
w2 = (0, −1, 0, 1)T
and it is easy to see that they are linearly
independent. Thus dimR(W) = 2/
In Sage we can determine the intersection U ∩ V of two
linear subspaces of a given vector space via the command
U.intersection(V). For our example, give the cell
u1=vector([1, 1, 1, 0])
u2=vector([1, 1, 0, 1])
u3=vector([1, 0, 1, 1])
U=(RR**4).span([u1, u2, u3], QQ)
v1=vector([1, 1, -1, -1])
v2=vector([1, -1, 1, -1])
v3=vector([1, -1, -1, 1])
V=(RR**4).span([v1, v2, v3], QQ)
W=U.intersection(V)
show(W); print(dim(W))
□
2.C.22. Referring to the results presented in 2.C.20, demonstrate
that for the given matrix A, the following direct sum
decompositions are valid:
R3
= C(A) ⊕ Ker(AT
) , R2
= C(AT
) ⊕ Ker(A) .
Solution. In 2.C.20 we obtained the expressions
C(A) = span
{
u1 = (2, 1, 0)T
, u2 = (4, 3, 5)T
}
,
Ker(AT
) = span
{
u3 = (5, −10, 2)T
}
.
In particular, we have seen that dim C(A)+dim Ker(AT
) =
2 + 1 = 3 = dim R3
. This, along with the relation
C(A) ∩ Ker(AT
) = {0}
that we are going to prove below, precisely deﬁnes the conditions
for a direct sum, i.e., R3
= C(A) ⊕ Ker(AT
). Hence,
suppose that v ∈ C(A)∩Ker(AT
). Then, there exists scalars
a, b, c ∈ R such that v = au1+bu2 and v = cu3, respectively,
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Calculating the matrix for changing the basis
Proposition. The matrix T for the transition from the basis
u to the basis v is obtained by taking the coordinates of the
vectors of the basis u expressed in the basis v and writing
them as the columns of the matrix T. The new coordinates
¯x in terms of the new basis v are then ¯x = T · x, where x is
the coordinate vector in the original basis u.
Because the inverse mapping to the identity mapping is
again the identity mapping, the coordinate transition matrix
is always invertible and its inverse T−1
is the coordinate transition
matrix in the opposite direction, that is from the basis
v to the basis u (just have a look at the diagram above and
invert all the arrows).
2.3.16. More coordinates. Next, we are interested in the
matrix of a composition of the linear mappings.
Thus, consider another vector space Z over K
of dimension k with basis w, linear mapping
g : W → Z and denote the corresponding matrix by gv,w.
V
f
//
≃u

W
g
//
v≃

Z
w≃

Kn
fu,v
// Km
gv,w
// Kk
The composition g ◦ f on the upper row corresponds to
the matrix of the mapping Kn
→ Kk
on the bottom and we
calculate directly (we write A for the matrix of f and B for
the matrix of g in the chosen bases):
gv,w ◦ fu,v(x) = w ◦ g ◦ v−1
◦ v ◦ f ◦ u−1
= B · (A · x) = (B · A) · x = (g ◦ f)u,w(x)
for every x ∈ Kn
. By the associativity of matrix multiplications,
the composition of mappings corresponds to multiplication
of the corresponding matrices. Note that the isomorphisms
correspond exactly to invertible matrices and that the
matrix of the inverse mapping is the inverse matrix.
The same approach shows how the matrix of a linear mapping
changes, if we change the coordinates on both the domain
and the codomain:
V
idV
//
≃u′

V
f
//
≃u

W
idW
//
v≃

W
v′
≃

Kn T // Kn
fu,v
// Km S−1
// Km
where T is the coordinate transition matrix from u′
to u and
S is the coordinate change matrix from v′
to v. If A is the
original matrix of the mapping, then the matrix of the new
mapping is given by A′
= S−1
AT.
In the special case of a linear mapping f : V → V , that
is the domain and the codomain are the same space V , we
express f usually in terms of a single basis u of the space V .
Then the change from the old basis to the new basis u′
with
121
and we should have the relation au1 +bu2 = cu3. This gives
the matrix equation
a


2
1
0

 + b


4
3
5

 − c


5
−10
2

 =


0
0
0

 .
To solve the corresponding homogeneous system, you can use
Sage in the following straightforward manner:
var("a, b, c")
eq1=2*a+4*b-5*c; eq2=a+3*b+10*c; eq3=5*b-2*c
solve([eq1==0, eq2==0, eq3==0], a, b, c)
Sage’s output has the form [[a == 0, b == 0, c == 0]], and
hence v = (0, 0, 0)T
, i.e., C(A) ∩ Ker(AT
) = {0}. This
proves the ﬁrst direct sum decomposition and you are encouraged
to prove the second one, which is actually easier. □
In Mathematics, the structures are best understood via
mappings which preserve them. With vector
spaces, these are the linear mappings, cf.
2.3.12. We met already homotheties, rotations,
and reﬂections in Chapter 1.
By the very deﬁnition, a linear mapping f : V → W is
uniquely determined by its action on the basis vectors of its
domain.9
This results in the unique matrix representation of
f, once the basis on V and W are chosen. Thus, we are back
at matrix calculus and composition of mappings is given by
products of matrices.
Two special subspaces are associated to a linear map f :
V → W, the kernel Ker(f) = f−1
({0}) ⊂ V and the image
Im(f) = f(V ) ⊂ W. Isomorphisms are those f with trivial
kernels Ker(f) = {0} and images Im(f) = W. Hence such
mappings are both surjective and injective, and their matrices
are invertible. Any transformation of coordinates is such an
example.
Although we usually write vectors as columns, in the sequel,
we shall often write just x = (x1, . . . , xn) for vectors in
Kn
.
2.C.23. Check which of F is a linear transformation:
i) F : R3
→ R2
, F((x, y, z)) = (y, z);
ii) F : R3
→ R3
, F((x, y, z)) = (x + z, 2x − 3 + z, −2y);
iii) F : R2
→ R2
, F((x, y)) = (x + a, y), 0 ̸= a ∈ R;
iv) F : R → R, F(x) = x + a, where a ∈ R;
v) F : R2
→ R2
, F((x, y)) = (x2
, y);
vi) F : R2
→ R1[t], F((x, y)) = y + x t;
vii) F : Matn(R) → Matn(R), F(A) = AB − BA, where
B ∈ Matn(R) is ﬁxed;
viii) F : Matn(R) → Matn(R), F(A) = AT
;
ix) F : Z2
3 → Z3
3, F((x, y)) = (x + y, 2x + y, x), where as
usual Zn
3 = Z3 × · · · × Z3 (n-factors);
x) F : R → R, F(t) = et
, where e is the base of the natural
logarithm. ⃝
9Linear mappings T : V → W are also called linear transformations
or operators. Linear mappings T : V → V are also called endomorphism
of V .
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
the coordinate transition matrix T leads to the new matrix
A′
= T−1
AT.
2.3.17. Linear forms. A simple but very important case of
linear mappings on an arbitrary vector space V
over the scalars K appears with the codomain
being the scalars themselves, i.e. mappings f :
V → K. We call them linear forms.
If we are given the coordinates on V , the assignments
of a single i-th coordinate to the vectors is an example of a
linear form. More precisely, for every choice of basis v =
(v1, . . . , vn), there are the linear forms v∗
i : V → K such that
v∗
i (vj) = δij, that is, v∗
i (vj) = 1 when i = j, and v∗
i (vj) = 0
when i ̸= j.
The vector space of all linear forms on V is denoted by
V ∗
and we call it the dual space of the vector space V . Let
us now assume that the vector space V has ﬁnite dimension
n. The basis of V ∗
, v∗
= (v∗
1, . . . , v∗
n), composed of assignments
of individual coordinates as above, is called the
dual basis to v. Clearly this is a basis of the space V ∗
, because
these forms are evidently linearly independent (prove
it!) and if α ∈ V ∗
is an arbitrary form, then for every vector
u = x1v1 + · · · + xnvn
α(u) = x1α(v1) + · · · + xnα(vn)
= α(v1)v∗
1(u) + · · · + α(vn)v∗
n(u)
and thus the linear form α is a linear combination of the forms
v∗
i .
Taking into account the standard basis {1} on the onedimensional
space of scalars K, any choice of a basis v on
V identiﬁes the linear forms α with matrices of the type 1/n,
that is, with rows y. The components of these rows are coordinates
of the general linear forms α in the dual basis v∗
.
Expressing such a form on a vector is then given by multiplying
the corresponding row vector y with the column of the
coordinates x of the vector u ∈ V in the basis v:
α(u) = y · x = y1x1 + · · · + ynxn.
Thus we can see that for every ﬁnitely dimensional space V ,
the dual space V ∗
is isomorphic to the space V . The choice
of the dual basis provides such an isomorphism.
In this context we meet again the scalar product of a row
of n scalars with a column of n scalars. We have worked with
it already in the paragraph 2.1.3 on the page 85.
The situation is diﬀerent for inﬁnitely dimensional
spaces. For instance the simplest example
of the space of all polynomials K[x] in one
variable is a vector space with a countable
basis with elements vi = xi
. As before, we
can deﬁne linearly independent forms v∗
i . Every formal
inﬁnite sum
∑∞
i=0 aiv∗
i is now a well-deﬁned linear form
on K[x], because it will be evaluated only for a ﬁnite linear
combination of the basis polynomials xi
, i = 0, 1, 2, . . . .
The countable set of all v∗
i is thus not a basis. Actually,
it can be proved that this dual space cannot have a countable
basis.
122
2.C.24. Show that the map T : R2
→ R3
deﬁned by
T((x, y)) = (x, y, x − y) is a linear transformation. Next,
ﬁnd the matrix A corresponding to T with respect to the standard
bases of R2
and R3
, respectively.
Solution. Let u = (u1, u2)T
and v = (v1, v2)T
be two arbitrary
vectors on R2
. Then we see that
T(au + bv) = T
(
a
(
u1
u2
)
+ b
(
v1
v2
))
=
T
((
au1 + bv1
au2 + bv2
))
=


au1 + bv1
au2 + bv2
au1 + bv1 − au2 − bv2

 =
a


u1
u2
u1 − u2

 + b


v1
v2
v1 − v2

 = aT(u) + bT(v) ,
for all a, b ∈ R. Thus T is a linear transformation.
Let e = {e1 = (1, 0)T
, e2 = (0, 1)T
} be the standard
basis of R2
and denote by ε = {ε1 = (1, 0, 0)T
, ε2 =
(0, 1, 0)T
, ε3 = (0, 0, 1)T
the standard basis of R3
. According
to the discussion in 2.3.14, the columns of the matrix
A = (aij) corresponding to T consist of the coordinates of
the values T(ej), for j = 1, 2, the latter expressed in the basis
ε on the target space R3
. We compute
T(e1) = (1, 0, 1)T
= 1 · e1 + 0 · e2 + 1 · e3 ,
T(e2) = (0, 1, −1)T
= 0 · e1 + 1 · e2 − 1 · e3 ,
hence the matrix A has the form A =
( 1 0
0 1
1 −1
)
. As a simple
veriﬁcation, show that
T(u) = Au = (x, y, x − y)T
,
for all vectors u = (x, y) ∈ R2
. In fact, the matrix presentation
T(u) = Au provides an easier way to verify the
linearity of T. Indeed, assuming that a, b, u, v are as above
we get T(au + bv) = A(au + bv) = a(Au) + b(Av) =
aT(u) + bT(v). □
2.C.25. Remark. In Sage we can treat linear mappings via
the command linear_transformation, see
also the task in 2.C.27 for a complementary application.
Let us use the linear mapping T :
R2
→ R3
given in 2.C.24 to illustrate the situation:
V=RR^2; W=RR^3
var("x, y");f(x, y)=[x, y, x-y]
T=linear_transformation(V, W, f)
show(T)
Executing this block will display, among other information,
the matrix representation of T, which we present here:
(
1.000000000 0.0000000000 1.0000000000
0.000000000 1.0000000000 −1.0000000000
)
.
Notice this matrix is from the “left”, in terms of Sage, and
hence the transpose of the matrix A presented above.10
To
ensure that Sage will print the correct matrix, add the following
code:
10This means that A acts on a vector x as xT A.
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.3.18. The length of vectors and scalar product. When
dealing with the geometry of the plane R2
in
the ﬁrst chapter we also needed the concept of
the length of vectors and their angles, see 1.5.7.
For deﬁning these concepts we used the scalar product of two
vectors u = (x, y) and v = (x′
, y′
) in the form u · v =
xx′
+ yy′
.
Indeed, the expression for the length of v = (x, y) is
given by
∥v∥ =
√
x2 + y2 =
√
v · v,
while the (oriented) angle φ of two vectors u = (x, y) and
v = (x′
, y′
) is in the planar geometry given by the formula
cos φ =
xx′
+ yy′
∥v∥∥v′∥
.
Note that this scalar product is linear in each of its arguments,
and we denote it by u · v or by ⟨u, v⟩. The scalar product
deﬁned in such a way is symmetric in its arguments and of
course ∥v∥ = 0 if and only if v = 0. We also see immediately
that two vectors in the Euclidean plane are perpendicular
whenever their scalar product is zero.
Now we shall mimic this approach for higher dimensions.
First, observe that the angle between two vectors is always a
two-dimensional concept (we want the angle to be the same
in the two-dimensional space containing the two vectors u
and v). In the subsequent paragraphs, we shall consider only
ﬁnitely dimensional vector spaces over real scalars R.
Scalar product and orthogonality
A scalar product on a vector space V over real numbers is
a mapping ⟨ , ⟩ : V × V → R which is symmetric in its
arguments, linear in each of them, and such that ⟨v, v⟩ ≥ 0
and ∥v∥2
= ⟨v, v⟩ = 0 if and only if v = 0.
The number ∥v∥ =
√
⟨v, v⟩ is called the length, or
norm, of the vector v.
Vectors v and w ∈ V are called orthogonal or perpendicular
whenever ⟨v, w⟩ = 0. We also write v ⊥ w. The
vector v is called normalised whenever ∥v∥ = 1.
The basis of the space V composed exclusively of mutually
orthogonal vectors is called an orthogonal basis. If the
vectors in such a basis are all normalised, we call the basis
orthonormal.
A scalar product is very often denoted by the common
dot, that is, ⟨u, v⟩ = u · v. Thus, it is then necessary to recognize
from the context whether the dot means a product of two
vectors (the result is a scalar) or something diﬀerent (e.g. we
often denote the product of matrices and product of scalars in
the same way).
Because the scalar product is linear in each of its arguments,
it is completely determined by its values
on pairs of basis vectors. Indeed, choose a basis
u = (u1, . . . , un) of the space V and denote
sij = ⟨ui, uj⟩.
123
A=T.matrix(side="right");show(A)
Check yourselves that now Sage prints out the matrix A posed
in 2.C.24.
2.C.26. Let T : R3
→ R3
be the linear mapping with type
T




x
y
z



 =


x + y + z
x + y + 2z
x + y + 3z

 .
Determine its matrix with respect to the standard basis of R3
,
and next present an answer via Sage. ⃝
2.C.27. Consider the matrix A =
(
−1 2 3
4 2 0
)
. Find the
value f(u) of the linear mapping f : R2
→ R3
induced by A,
where u = (1, 2)T
∈ R2
. Next, using Sage, ﬁnd a basis for
the kernel and the image of f. Is f injective or surjective?
Solution. The function f has domain R2
and target R3
, thus
its matrix should be 3×2. By our assumption, this means that
matrix of f is the transpose of the given A. Thus we have
f(u) =


−1 4
2 2
3 0


(
1
2
)
=


7
6
3

 .
As before, use the command linear_transformation () in
Sage to verify this result:
A = matrix(RR, [[-1, 2, 3], [4, 2, 0]])
f = linear_transformation(A); f
This gives:
Vector space morphism represented
by the matrix:
[-1.00000000000000 2.00000000000000
3.00000000000000]
[ 4.00000000000000 2.00000000000000
0.000000000000000]
Domain: Vector space of dimension 2
over Real Field with 53 bits of precision
Codomain: Vector space of dimension 3
over Real Field with 53 bits of precision
We can now evaluate f at a vector, simply by typing f([1, 2]).
This prints the vector (7, 6, 3). We can also compute the kernel
and image of f by adding the commands
f.kernel( ); f.image( )
respectively. For them, the output is
Vector space of degree 2 and dimension
0 over Real Field with 53 bits of precision
Basis matrix: []
and
Vector space of degree 3 and dimension
2 over Rational Field
Basis matrix:
[ 1 0 -3/5]
[ 0 1 6/5]
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Then from the symmetry of the scalar product we know sij =
sji and from the linearity of the product in each of its arguments
we get
⟨∑
i
xiui,
∑
j
yjuj
⟩
=
∑
i,j
xiyj⟨ui, uj⟩ =
∑
i,j
sijxiyj.
If the basis is orthonormal, the matrix S is the unit matrix.
This proves the following useful claim:
Scalar product in coordinates
Proposition. For every orthonormal basis, the scalar product
is given by the coordinate expression
(1) ⟨x, y⟩ = yT
· x.
For each basis of the space V there is the symmetric matrix
S such that the coordinate expression of the scalar product
is
(2) ⟨x, y⟩ = yT
· S · x.
Notice, that with symmetric matrix S it is just a matter of
convention in which order we insert the vectors: the formula
xT
· S · y = (xT
· S · y)T
= yT
· ST
· x = yT
· S · x
produces the same value. However, we shall later consider
the second argument as a linear form, thus it seems to be more
convenient to use the expression yT
· S · x.
2.3.19. Orthogonal complements and projections. For every
ﬁxed subspace W ⊂ V in a space with
scalar product, we deﬁne its orthogonal complement
as
W⊥
= {u ∈ V ; u ⊥ v for all v ∈ W}.
It follows directly from the deﬁnition that W⊥
is a vector subspace.
If W ⊂ V has a basis (u1, . . . , uk) then the description
for W⊥
is given as k homogeneous equations for n variables.
Thus W⊥
will have dimension at least n − k. Also
u ∈ W ∩ W⊥
means that ⟨u, u⟩ = 0, and thus also u = 0 by
the deﬁnition of scalar product. Clearly then, V is the direct
sum
V = W ⊕ W⊥
.
A linear mapping f : V → V on any vector space is
called a projection, if we have
f ◦ f = f.
In such a case, we can write, for every vector v ∈ V,
v = f(v) + (v − f(v)) ∈ Im(f) + Ker(f) = V
and if v ∈ Im(f) and f(v) = 0, then also v = 0. Thus
the above sum of the subspaces is direct. We say that f is a
projection to the subspace W = Im(f) along the subspace
U = Ker(f). In words, the projection can be described naturally
as follows: we decompose the given vector into a component
in W and a component in U, and forget the second
one.
124
respectively. This means that Ker(f) = {0}, and so f is injective,
but Im(f) is 2-dimensional, with a basis determined
by the two vectors that Sage indicates in the solution. Hence
f is not surjective. Note that Sage allows us to directly verify
the injectivity and surjectivity of f, as follows:
f.is_injective()
f.is_surjective()
Sage’s ouptut is True and False, respectively. □
2.C.28. Consider R3
endowed with its standard basis e =
{e1, e2, e3}, and let f : R3
→ R3
be a linear mapping satis-
fying
f(e1) = (0, 1, 2)T
, f(e2) = (1, 0, 1)T
, f(e3) = (−1, 1, 1)T
.
(a) Determine the explicit form of f;
(b) Determine the matrix of f with respect to the standard
basis e;
(c) Present bases of the kernel and the image of f;
(d) Find the inverse f−1
, provided it exists.
Solution. (a) By assumption we have f(e1) = (0, 1, 2)T
,
f(e2) = (1, 0, 1)T
and f(e3) = (−1, 1, 1)T
. However, any
vector u = (x, y, z) ∈ R3
is written as u = x·e1+y·e2+z·e3
(here “·” is just the scalar multiplication), and since f is linear
we get
f(u) = x · f(e1) + y · f(e2) + z · f(e3) = x · (0, 1, 2)T
+ y ·
(1, 0, 1)T
+ z · (−1, 1, 1)T
= (y − z, x + z, 2x + y + z)T
,
that is, f((x, y, z)) = (y − z, x + z, 2x + y + z)T
.
(b) We have that
f(e1) = 0 · e1 + 1 · e2 + 2 · e3 , f(e2) =
1 · e1 + 0 · e2 + 1 · e3 , f(e3) = −1 · e1 + 1 · e2 + 1 · e3 .
Thus, the matrix of f with respect to the standard basis on R3
is given by
A =


0 1 −1
1 0 1
2 1 1

 .
In Sage we can verify this result by executing the block
V=W=RR**3; var("x, y, z")
f(x, y, z)=[y-z, x+z, 2*x+y+z]
T=linear_transformation(V, W, f)
A=T.matrix(side="right");show(A)
As an alternative, compute ﬁrst the matrix A, and then determine
the explicit form of f by the rule f(u) = Au, for all
u ∈ R3
.
(c) Let us compute ﬁrst a basis of the image Im(f) of f. By
the relation f(u) = x · f(e1) + y · f(e2) + z · f(e3) it is
obvious that
Im(f) = spanR{f(e1), f(e2), f(e3)} ={
(0, 1, 2)T
, (1, 0, 1)T
, (−1, 1, 1)T
}
.
We should now check the linear independence of these vectors.
By elementary row operations on the matrix A presented
above, we obtain
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
If V has a scalar product, we say that the projection is
orthogonal if the kernel is orthogonal to the image.
Every subspace W ̸= V thus deﬁnes an orthogonal projection
to W. It is a projection to W along W⊥
, given by
the unique decomposition of every vector u into components
uW ∈ W and uW ⊥ ∈ W⊥
, that is, linear mapping which
maps uW + uW ⊥ to uW .
2.3.20. How to compute projections. The orthogonal
projections projv to one-dimensional subspaces span{v}
spanned by v ∈ V are most simple. Indeed, for any u ∈ V
we require the projection of u to be a scalar multiple cv, such
that u − cv ⊥ w, i.e., ⟨u − cw, w⟩ = 0 (see part (a) in the
ﬁgure below). Solving this equation yields the correct value
for c and we deduce
projv u = cv =
⟨u, v⟩
⟨v, v⟩
v =
⟨u, v⟩
∥v∥2
v .
Consequently, the vector u decomposes into a pair of orthogonal
vectors: u = projv u + (u − projv u).
Similarly, for linear subspaces W ⊂ V of (V, ⟨ , ⟩), there
is the direct decomposition V = W ⊕ W⊥
and clearly any
vector u ∈ V can be uniquely expressed as u = w +z, where
w ∈ W and z ∈ W⊥
. The vector w is called the orthogonal
projection of u in W, we also write projW u. The other part
of u, orthogonal to W, is denoted by projW ⊥ u and we have
(see part (b) in the ﬁgure below):
z = projW ⊥ u = u − projW u .
2.3.21. Existence of orthonormal bases. It is easy to see
that on every ﬁnite dimensional real vector space
there exist scalar products. Just choose any basis.
Deﬁne lengths so that each basis vector is
of unit length. Call it orthonormal. Immediately
we have a scalar product. In this basis the scalar products of
vectors are computed as in the formula (1) in the Proposition
2.3.18.
More often we are given a scalar product on a vector
space V, and we want to ﬁnd an appropriate orthonormal basis
for it. We present an algorithm using suitable orthogonal
projections in order to transform any basis into an orthogonal
one. It is called the Gramm-Schmidt orthogonalization
process.
The point of this procedure is to transform a given sequence
of independent generators v1, . . . , vk of a ﬁnite dimensional
space V into an orthogonal set of independent generators
of V .
125
A =


0 1 −1
1 0 1
2 1 1

 ∼


2 1 1
1 0 1
0 1 −1

 ∼


2 1 1
0 1 −1
1 0 1

 ∼


1 1 0
0 1 −1
1 0 1

 ∼


1 0 1
0 1 −1
1 0 1

 ∼


1 0 1
0 1 −1
0 0 0

 .
The ﬁnal matrix is in reduced row echelon form (recall that
a quick computation of RREF in Sage occurs by adding in
the previous cell the code A1 = A.rref( ); show(A1)). Therefore,
the pivots are lying on the ﬁrst two columns (use the command
A.pivots( ) to verify this). This means that only the
vectors f(e1), f(e2) are linearly independent (in fact it is easy
to see that f(e3) = f(e1) − f(e2)), and so a basis of Im(f)
has the form
{
(0, 1, 2)T
, (1, 0, 1)T
}
. Thus dim Im(f) = 2,
and in particular f is not surjective. For the kernel, we should
have
dim R3
= dim Ker(f) + dim Im(f)
from where we get that dim Ker(f) = 3−2 = 1. We may use
Sage to ﬁnd a basis of Ker(f), by one of the commands
print(A.right_nullity())
A.right_kernel( )
The ﬁrst command prints out the dimension of Ker(A), while
the output of the second one includes also a basis,
Ker(A) = spanR
{
(1, −1, −1)
T
}
.
Of course, this also gives the required basis of Ker(f). Indeed,
recall the computation of Ker(f) relies on solving the
homogeneous system Au = 0, where u = (x, y, z)T
∈ R3
is an arbitrary vector on R3
. This exploits the identiﬁcation
Ker(A) = Ker(f) and gives


0 1 −1
1 0 1
2 1 1




x
y
z

 =


0
0
0

 ,
or in other words {y − z = 0 , x + z = 0 , 2x + y + z =
0}. The next step relies on the reduced row echelon form
posed above, which ﬁrst implies that z is a free variable, say
z = t ∈ R. Then we get y = z = t ∈ R and x = −z =
−t ∈ R, that is, there are inﬁnitely many solutions of the form
t · (−1, 1, 1)
T
, with t ∈ R. Of course, one can replace t by
−t to obtain the answer that Sage provides.
(d) The given linear transformation is not injective, since we
saw that Ker(f) ̸= {0}. Therefore, the inverse f−1
does not
exist. □
2.C.29. Consider the endomorphism T : C3
→ C3
with
T(e1) = (1, 0, i)T
, T(e2) = (0, 1, 1)T
, T(e3) = (i, 1, 0)T
,
where {e1, e2, e3} denotes the standard basis of C3
. Is T invertible?
Describe a solution using Sage, as well. ⃝
2.C.30. (a) Consider the complex numbers C as a real vector
space with its standard basis u = {1, i}. In this basis determine
the matrix of the following linear mappings:
1) conjugation,
2) multiplication by the number (2 + i).
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Gramm-Schmidt orthogonalization
Proposition. Let (u1, . . . , uk) be a linearly independent
k-tuple of vectors of a space V with scalar product. Then
there exists an orthogonal system of vectors (v1, . . . , vk)
such that vi ∈ span{u1, . . . , ui}, and span{u1, . . . , ui} =
span{v1, . . . , vi}, for all i = 1, . . . , k. We obtain it by the
following procedure:
• The independence of the vectors ui ensures that u1 ̸= 0;
we choose v1 = u1.
• If we have already constructed the vectors v1, . . . , vℓ
with the required properties and if ℓ < k, we choose
vℓ+1 = uℓ+1 + a1v1 + · · · + aℓvℓ, where
ai = −⟨uℓ+1,vi⟩
∥vi∥2 .
Proof. We begin with the ﬁrst (nonzero) vector v1 = u1
and calculate the orthogonal projection v2 of u2 to
span{v1}⊥
⊂ span{v1, u2},
i.e., we ﬁnd the right constant a1 for which v2 = u2 + a1v1
is perpendicular to v1. The result is nonzero if and only if u2
is independent of v1. All other steps are similar:
In step ℓ, ℓ ≥ 1 we seek the vector vℓ+1 = uℓ+1 +a1v1 +
· · · + aℓvℓ satisfying ⟨vℓ+1, vi⟩ = 0 for all i = 1, . . . , ℓ. This
implies
0 = ⟨uℓ+1 + a1v1 + · · · + aℓvℓ, vi⟩ = ⟨uℓ+1, vi⟩ + ai⟨vi, vi⟩
and we can see that the vectors with the desired properties are
determined uniquely up to a scalar multiple. □
Whenever we have an orthogonal basis of a vector space
V , we just have to normalise the vectors in order to obtain
an orthonormal basis. Thus, starting the Gramm-Schmidt orthogonalization
with any basis of V , we have proven:
Corollary. On every ﬁnite dimensional real vector space
with scalar product there exists an orthonormal basis.
In an orthonormal basis, the coordinates and orthogonal
projections are very easy to calculate. Indeed, suppose we
have an orthonormal basis (e1, . . . , en) for a space V . Then
every vector v = x1e1 + · · · + xnen satisﬁes
⟨ei, v⟩ = ⟨ei, x1e1 + · · · + xnen⟩ = xi
and so we can always express
(1) v = ⟨e1, v⟩e1 + · · · + ⟨en, v⟩en.
If we are given a subspace W ⊂ V and its orthonormal
basis (e1, . . . , ek), then we can extend it to an orthonormal
basis (e1, . . . , en) for V . Orthogonal projection of a general
vector v ∈ V to W is then given by the expression
v → ⟨e1, v⟩e1 + · · · + ⟨en, v⟩ek.
In particular, we need only consider an orthonormal basis of
the subspace W in order to write the orthogonal projection to
W explicitly.
126
(b) Determine the matrix of these mappings also in the basis
f = ((1 − i), (1 + i)). ⃝
2.C.31. Consider the vector space Matm,n(K) of m × n matrices
with coeﬃcients in K, where K is R or C. Show that
dimK Matm,n(K) = mn. Next establish an isomorphism
φ : Matm,n(K) → Kmn
, between Matm,n(K) and Kmn
.
⃝
2.C.32. The vector spaces Rn[x] and Rn+1
have the same
dimension. Find an isomorphism between them. ⃝
2.C.33. Prove that the vectors
u1 = (1, 2, 0, 0)T
, u2 = (0, 1, 0, 1)T
, u3 = (1, 0, 0, 0)T
generate a subspace U of R4
which is isomorphic to R3
. Can
you ﬁnd explicitly a linear isomorphism? ⃝
2.C.34. List all subspaces of R3
, up to isomorphism. ⃝
The simplest linear mappings α on Kn
are of the form
x → α(x) = c1x1 + · · · + cnxn ∈ K, where
c = (c1, . . . , cn) is a ﬁxed n-tuple of scalars from
the ﬁeld K, and x = (x1, . . . , xn)T
∈ Kn
. Actually,
these are all linear mappings Kn
→ K, which can be
expressed as α(x) = c · xT
, through matrix multiplication.
Such linear mappings are called linear forms on Kn
.
In fact, any choice of coordinates on a ﬁnite-dimensional
vector space V over a ﬁeld K, is actually an n-tuple of independent
linear forms on V , i.e., linear mappings V → K.
They form the dual vector space V ∗
of V , which is also a
vector space over K, see 2.3.17. Clearly, V and V ∗
are isomorphic
as vector spaces over K.
2.C.35. Given the basis u = {u1 = (1, 1)T
, u2 = (3, 1)T
}
of R2
, ﬁnd the dual basis in (R2
)∗
.
Solution. We seek for linear forms φi : R2
→ R satisfying
φ1(u1) = 1 , φ1(u2) = 0 , and φ2(u1) = 0 , φ2(u2) = 1 ,
respectively. Suppose that φ1(x, y) = ax + by and
φ2(x, y) = cx + dy for some reals a, b, c, d. These equations
induce the systems {a + b = 1, 3a + b = 0} and
{c + d = 0, 3c + d = 1}, respectively. Solving them we
deduce that the dual basis to u consists of the linear forms
φ1(x, y) = −1
2 x + 3
2 y and φ2(x, y) = 1
2 x − 1
2 y. □
2.C.36. Prove that the set
u = {u1 = (1, 0, 0)T
, u2 = (0, i, 0)T
, u3 = (1, 1, i)T
}
is a basis of C3
. Next ﬁnd the dual basis. ⃝
2.C.37. Let {u1, . . . , un} be a basis of a vector space V over
a ﬁeld K, and let {φ1, . . . , φn} be the dual basis in V ∗
. Prove
that
(a) Any u ∈ V is written as u =
∑n
i=1 φi(u)ui.
(b) Any ξ ∈ V ∗
is written as ξ =
∑n
i=1 ξ(ui)φi. ⃝
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Note that in general the projection f to the subspace W
along U and the projection g to U along W is constrained by
the equality g = idV −f. Thus, when dealing with orthogonal
projections to a given subspace W, it is always more
eﬃcient to calculate the orthonormal basis of that space W
or W⊥
whose dimension is smaller.
Note also that the existence of an orthonormal basis guarantees
that for every real space V of dimension n with a scalar
product, there exists a linear mapping which is an isomorphism
between V and the space Rn
with the standard scalar
product (i.e. respecting the scalar products as well). We saw
already in Theorem 2.3.18 that the desired isomorphism is
exactly the coordinate assignment. In words – in every orthonormal
basis the scalar product is computed by the same
formula as the standard scalar product in Rn
.
We shall return to the questions of the length of a vector
and to projections in the following chapter in a more general
context.
2.3.22. Angle between two vectors. As we have already
noted, the angle between two linearly independent vectors in
the space must be the same as when we consider them in the
two-dimensional subspace they generate. Basically, this is the
reason why the notion of angle is independent of the dimension
of the original space. If we choose an orthogonal basis
such that its ﬁrst two vectors generate the same subspace as
the two given vectors u and v (whose angle we are measuring),
we can simply take the deﬁnition from the planar geometry.
Independently of the choice of coordinates we can formulate
the deﬁnition as follows:
Angle between two vectors
The angle φ between two vectors v and w in a vector space
with a scalar product is given by the relation
cos φ =
⟨v, w⟩
∥v∥∥w∥
.
The angle deﬁned in this way does not depend on the order
of the vectors v, w and it is chosen in the interval 0 ≤ φ ≤ π.
We shall return to scalar products and angles between
vectors in further chapters.
2.3.23. Multilinear forms. The scalar product was given as
a mapping from the product of two copies of a
vector space V into the space of scalars, which
was linear in each of its arguments. Similarly,
we will work with mappings from the product
of k copies of a vector space V into the scalars, which are
linear in each of its k arguments. We speak of k-linear forms.
Most often we will meet bilinear forms, that is, the case
α : V × V → K, where for any four vectors u, v, w, z and
scalars a, b, c and d we have
α(au + bv, cw + dz) = ac α(u, w) + ad α(u, z)
+ bc α(v, w) + bd α(v, z).
127
Endomorphisms F on V are represented by square matrices
A in a given basis. If we leave the choice of the
bases free on both domain and target, then we may always
achieve the diagonal matrix A with just ones and
zeros on the diagonal, with the rank equal to the dimension
of the image, see 2.1.9. But dealing with endomorphisms, we
rather want to ﬁx just one basis on V . Then the available
change of the matrix is A → P · A · P−1
= B, and we call
such matrices A and B similar. Of course, P provides the
relevant change of basis on V , see 2.3.16.
Much of our future eﬀorts will aim at understanding linear
forms and other concepts within the matrix calculus which
are invariant under the above similarity, and thus they reveal
properties of the linear mappings themselves. The trace of
matrices discussed in the subsequent tasks is an example.
2.C.38. Show that the trace tr : Matm(K) → K, tr(A) =∑m
i=1 aii, is a linear functional, i.e.,tr ∈ Mat∗
m(K) ∼=
(Km2
)∗
. Next show that tr(AB) = tr(BA), for any A, B ∈
Matm(K). Moreover, provide an example verifying that in
general tr(AB) ̸= tr(A) tr(B). ⃝
2.C.39. Demonstrate with a low-dimensional example that
the trace of an endomorphism remains invariant under a
change of basis. ⃝
2.C.40. Prove that if A, B are two similar square matrices,
then tr(A) = tr(B) and det(A) = det(B). Next decide
which of the given pairs of matrices are similar:
(a) A =
(
1 −2
2 3
)
, B =
(
2 2
−1 2
)
.
(b) A =
(
0 1
0 0
)
, B =
(
γδ δ2
−γ2
−γδ
)
,
where γ, δ are non-zero real numbers.
(c) A =
(
2 1
1 2
)
, B =
(
3 0
0 1
)
. ⃝
2.C.41. Transition matrix. Find the transition matrix from
the standard basis e = {e1, e2, e3} of R3
to the basis
u = {u1 = (1, 1, 3)T
, u2 = (1, −1, 1)T
, u3 = (3, 1, 5)T
} .
Next, ﬁnd the coordinates of the vector w = (1, 2, 3)T
∈ R3
in the new ordered basis u.
Solution. We see that
u1 = e1 + e2 + 3e3 ,
u2 = e1 − e2 + e3 ,
u3 = 3e1 + e2 + 5e3 .
Thus, the transition matrix T from u to e has as columns exactly
the vectors of the basis u:
T =


1 1 3
1 −1 1
3 1 5

 .
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
If additionally we always have
α(u, w) = α(w, u),
then we speak of a symmetric bilinear form. If interchanging
the arguments leads to a change of sign, we speak of an
antisymmetric bilinear form.
Already in planar geometry we have deﬁned the determinant
as a bilinear antisymmetric form α, that is, α(u, w) =
−α(w, u). In general, due to the theorem 2.2.5, we know that
the determinant with dimension n can be seen as an n-linear
antisymmetric form.
As with linear mappings it is clear that every k-linear
form is completely determined by its values on all k-tuples of
basis elements in a ﬁxed basis. In analogy to linear mappings
we can see these values as k-dimensional analogues to matrices.
We show this by an example with k = 2, where it will
correspond to matrices as we have deﬁned them.
Matrix of a bilinear form
If we choose a basis u on V and deﬁne for a given bilinear
form α scalars aij = α(ui, uj) then we obtain for vectors v,
w with coordinates x and y (as columns of coordinates)
α(v, w) =
n∑
i,j=1
aijxiyj = yT
· A · x,
where A is a matrix A = (aij).
Directly from the deﬁnition of the matrix of a bilinear
form we see that the form is symmetric or antisymmetric if
and only if the corresponding matrix has this property.
Every bilinear form α on a vector space V deﬁnes a mapping
V → V ∗
, v → α( , v). That is, by placing a ﬁxed vector
in the second argument we obtain a linear form which is the
image of this vector. If we choose a ﬁxed basis on a ﬁnitely
dimensional space V and a dual basis V ∗
, then we have the
mapping
y → (x → yT
· A · x).
All this is a matter of convention. Also we may ﬁx the ﬁrst
vector and get a linear form again.
4. Properties of linear mappings
In order to exploit vector spaces and linear mappings in
modelling real processes and systems in other sciences, we
need a more detailed analysis of properties of diverse types
of linear mappings.
2.4.1. We begin with four examples in the lowest dimension
of interest. With the standard basis of
the plane R2
and with the standard scalar
product we consider the following matrices
of mappings f : R2
→ R2
:
A =
(
1 0
0 0
)
, B =
(
0 1
0 0
)
, C =
(
a 0
0 b
)
, D =
(
0 −1
1 0
)
128
The transition matrix from the e to u is now given by the inverse
T−1
of T, see 2.3.15. By inspection, we compute
T−1
=


−3
2 −1
2 1
−1
2 −1 1
2
1 1
2 −1
2

 ,
hence the new coordinates of w are given by the matrix mul-
tiplication
T−1
w =


−3
2 −1
2 1
−1
2 −1 1
2
1 1
2 −1
2




1
2
3

 =


1
2
−1
1
2

 .
□
2.C.42. Solve the problem in 2.C.41 using Sage. ⃝
2.C.43. Consider the following ordered basis u1 = {E1 =
(1, 0, 1)T
, E2 = (1, 1, 0)T
, E3 = (0, 1, 1)T
} of R3
, and suppose
that the transition matrix T from another ordered basis
u2 of R3
to u1 is given by
T =


1 1 1
1 2 1
−1 1 2

 .
Find the basis u2. ⃝
2.C.44. Suppose that a linear mapping F : R3
→ R3
has the
following matrix (with respect to the standard basis)
A =


1 −1 0
0 1 1
2 0 0

 .
Compute the matrix of this mapping in the new basis f :=
{f1, f2, f3} := {(1, 1, 0)T
, (−1, 1, 1)T
, (2, 0, 1)T
}. ⃝
In applications, we often know the size of vectors, i.e.,
there is the so-called notion of norm on the vector
space. We saw how to deal with this in Chapter
1 already — the norm is usually deﬁned via
the scalar product (see 1.E.20). This is closely linked with the
concept of angle between vectors, which is a 2-dimensional
concept and works in all dimensions, even the inﬁnite ones,
without changes.
The scalar product allows us to employ orthogonal projections,
orthogonal bases, and other important concepts, see
the theoretical column for details. In the following section, we
shall come to mappings and matrices compatible with scalar
products, which leads to the spectral theory discussed properly
in the next chapter. Therefore, consider the remainder of
this section as a gentle preparation.
2.C.45. Remark. Consider the vectors u = (1, 2, 2)T
and
v = (0, 1, −1)T
on R3
. Then we see that
u · v = ⟨u, v⟩ = 0, that is, u, v are orthogonal.
To encode this we often write u ⊥ v. As
we know by 1.E.20, in Sage to compute the scalar product
⟨u, v⟩ =
∑
i uivi, of two vectors u, v ∈ Rn
we use the command
u.dot_product(v). For instance, the block
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
The matrix A describes the orthogonal projection along the
subspace
W = {(0, a); a ∈ R} ⊂ R2
to the subspace
V = {(a, 0); a ∈ R} ⊂ R2
,
that is, the projection to the x-axis along the y-axis. Evidently
for this f : R2
→ R2
we have f ◦ f = f and thus the restriction
f|V of the given mapping on its codomain is the identity
mapping. The kernel of f is exactly the subspace W.
The matrix B has the property B2
= 0, therefore the
same holds for the corresponding mapping f. We can envision
this as the diﬀerentiation of polynomials R1[x] of degree
at most one in the basis (1, x) (we shall come to diﬀerentiation
in chapter ﬁve, see 5.1.6).
The matrix C gives a mapping f, which rescales the ﬁrst
vector of the basis a-times, and the second one b-times. Therefore
the whole plane divides into two subspaces, which are
preserved under the mapping and where it is only a homothety,
that is, scaling by a scalar multiple (the ﬁrst case was a special
case with a = 1, b = 0). For instance the choice a = 1,
b = −1 corresponds to axial symmetry (mirror symmetry)
under the x-axis, which is the same as complex conjugation
x+iy → x−iy on the two-dimensional real space R2
≃ C in
basis (1, i). This is a linear mapping of the two-dimensional
real vector space C, but not of the one-dimensional complex
space C.
The matrix D is the matrix of rotation by 90 degrees (the
angle π/2) centered at the origin in the standard basis. We can
see at ﬁrst glance that none of the one-dimensional subspaces
is preserved under this mapping.
Such a rotation is a bijection of the plane onto itself,
therefore we can surely ﬁnd distinct bases in the domain and
codomain, where its matrix will be the unit matrix E. We
simply take any basis of the domain and its image in the
codomain. But we are not able to do this with the same basis
for both the domain and the codomain.
Consider the matrix D as a matrix of the mapping g :
C2
→ C2
with the standard basis of the complex
vector space C2
. Then we can ﬁnd vectors u =
(i, 1), v = (−i, 1), for which we have
g(u) =
(
0 −1
1 0
)
·
(
i
1
)
=
(
−1
i
)
= i · u,
g(v) =
(
0 −1
1 0
)
·
(
−i
1
)
=
(
−1
−i
)
= −i · v.
That means that in the basis (u, v) on C2
, the mapping g has
the matrix
K =
(
i 0
0 −i
)
.
Notice that by extending the scalars to C, we arrive at an analogy
to the matrix C with diagonal elements a = cos(1
2 π) +
i sin(1
2 π) and its complex conjugate ¯a. In other words, the
129
u = vector(RDF, [1, 2, 2])
v = vector(RDF, [0, 1, -1])
u.dot_product(v)
returns the result 0.0, and veriﬁes that u ⊥ v. We also know
how to compute norms of vectors. For instance, for the given
u, v, we can add the code
u.norm (); v.norm ()
returning 3.0 and 1.4142135 ≈
√
2, respectively.
Scalar products over complex vector spaces will need
diﬀernt approach, we come to them in Chapter 3.
They are linked to Hermitian forms on Cn
, and are
fundamental in quantum computing.
Let us now turn our attention to vectors in our
3-dimensional world. In the solution to the next task we shall
see the general concepts and procedures working independent
of the (ﬁnite) dimensions. One of the main concepts is that
of orthogonal projections, see 2.3.20 for instructions how to
compute them.
2.C.46. Consider the following four vectors in R3
:
i) u1 = (1, 3,
√
2)T
, u2 = (−1, 1, −
√
2)T
;
ii) v1 = (0, 1, 2)T
, v2 = (−1, 2, 3)T
;
Compute their dot products, norms, angles. Check that the
product is symmetric, and next use Sage to solve the tasks.
⃝
2.C.47. Given the linear form α : R3
→ R with α(u) =
4x + 6y − 2z for all u = (x, y, z)T
∈ R3
, determine the
unique vector v = (a, b, c)T
∈ R3
satisfying f(u) = ⟨u, v⟩,
where ⟨ , ⟩ denotes the usual dot product on R3
.
Solution. This is an application of the fact that the standard
basis {e1, e2, e3} of R3
is orthonormal with respect to the dot
product, i.e., ⟨ei, ej⟩ = δi,j for 1 ≤ 1, j ≤ 3. Then, we have
v =
3∑
i=1
⟨v, ei⟩ei =
e∑
i=1
α(ei)ei ,
where we have used that ⟨ , ⟩ is symmetric and moreover that
α(ei) = ⟨ei, v⟩, for all i. Since α(e1) = 4, α(e2) = 6 and
α(e3) = −2, we get v = 4e1 + 6e2 − 2e3, that is v =
(4, 6, −2)T
∈ R3
. Now one can easily verify that α(u) =
⟨u, v⟩ for all u ∈ R3
. □
2.C.48. Orthogonal projections via Sage. It is very easy
to compute orthogonal projections of vectors in Sage. For
instance, suppose that we want to compute the projection of
u = (1, 2, 4)T
∈ R3
onto w = (1, −2, 1)T
∈ R3
. We have
projw(u) = ⟨u,w⟩
∥w∥2 w, so it is reasonable to type
u = vector ([1 ,2 ,4])
w = vector ([1 , -2 ,1])
proj = u.dot_product (w)/(norm (w)^2)*w
proj
sage: (1/6, -1/3, 1/6)
Hence projw(u) = (1/6, −1/3, 1/6)T
∈ R3
. As a veriﬁcation
we can check that projw(u) is orthogonal to u −
projw(u):
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
argument of the number a in polar form provides the angle of
the rotation.
This is easy to understand, if we denote the real and imaginary
part of the vector u as follows
u = xu + iyu = Re u + i Im u =
(
0
1
)
+ i ·
(
1
0
)
.
The vector v is the complex conjugate of u. We are interested
in the restriction of the mapping g to the real vector subspace
V = R2
∩ spanC{u, v} ⊂ C2
. Evidently,
V = spanR{u + ¯u, i(u − ¯u)} = spanR{xu, −yu}
is the whole plane R2
. The restriction of g to this plane is
exactly the original mapping given by the matrix D (notice
this matrix is real, thus it preserves this real subspace). It is
immediately seen that this is the rotation through the angle 1
2 π
in the positive sense with respect to the chosen basis xu, −yu.
Work it by yourself with a direct calculation. Note also why
exchanging the order of the vectors u and v leads to the same
result, although in a diﬀerent real basis!
2.4.2. Eigenvalues and eigenvectors of mappings. A key
to the description of mappings in the previous
examples was the answer to the question “what
are the vectors satisfying the equation f(u) =
a · u for some suitable scalars a?”.
We consider this question for any linear mapping f :
V → V on a vector space of dimension n over scalars K.
If we imagine such an equality written in coordinates, i.e. using
the matrix of the mapping A in some bases, we obtain a
system of linear equations
A · x − a · x = (A − a · E) · x = 0
with an unknown parameter a. We know already that such a
system of equations has only the solution x = 0 if the matrix
A−aE is invertible. Thus we want to ﬁnd such values a ∈ K
for which A−aE is not invertible, and for that, the necessary
and suﬃcient condition reads (see Theorem 2.2.11)
(1) det(A − a · E) = 0.
If we consider λ = a as a variable in the previous scalar equation,
we are actually looking for the roots of a polynomial of
degree n. As we have seen in the case of the matrix D, the
roots may exist in an extension of our ﬁeld of scalars, if they
are not in K.
Eigenvalues and eigenvectors
Scalars λ ∈ K satisfying the equation f(u) = λ · u for
some nonzero vector u ∈ V are called the eigenvalues of
the mapping f. The corresponding nonzero vectors u are
called the eigenvectors of the mapping f.
If u, v are eigenvectors associated with the same eigenvalue
λ, then for every linear combination of u and v,
f(au + bv) = af(u) + bf(v) = λ(au + bv).
Therefore the eigenvectors associated with the same eigenvalue
λ, together with the zero vector, form a nontrivial vector
130
proj.dot_product(u-proj)
which gives 0.
Recall now that a nice feature of Sage is that we can easily
extend its capabilities by introducing new commands. Here
will create a function that computes the orthogonal projection
of a vector onto another, a procedure which allows us to view
(in Sage) the orthogonal projection as a function of two vectors.
For this, use the cell
def proj (u,w):
a = u. dot_product (w)/(norm (w)^2)*w
return(a)
which deﬁnes the projection of u onto w. In such terms, the
following cell in Sage gives the same result as above.
u1 = vector ([1 ,2 ,4])
u2 = vector ([1 , -2 ,1])
def proj (u,w):
a = u. dot_product (w)/(norm (w)^2)*w
return(a)
proj (u1 ,u2)
sage: (1/6, -1/3, 1/6)
Try implementing this block on your own. The function deﬁned
here will be useful for implementing the Gram-Schmidt
procedure, which we are going to describe very soon.
2.C.49. Determine the matrix A which, under the standard
basis of R3
, gives the orthogonal projection on the vector
subspace generated by the vectors u1 = (−1, 1, 0)T
and
u2 = (−1, 0, 1)T
.
Solution. First observe that given subspace is a plane containing
the origin of R3
, with normal vector u3 = (1, 1, 1)T
. This
is because the ordered triple (1, 1, 1) is clearly a solution to
the system {−x1 + x2 = 0 , −x1 + x3 = 0}, that is, the vector
u3 is perpendicular to the vectors u1, u2. Under the given
projection the vectors u1 and u2 must map to themselves, and
the vector u3 to the zero vector. Thus, with respect to the basis
{u1, u2, u3}, the matrix presentation of the projection is
P =


1 0 0
0 1 0
0 0 0

 .
If the transition matrix from the basis {u1, u2, u3} to the standard
basis is denoted by T, then the transition matrix from the
standard basis to {u1, u2, u3} is the inverse T−1
. We com-
pute
T =



−1 −1 1
1 0 1
0 1 1


 , T−1
=



−1
3
2
3 −1
3
−1
3 −1
3
2
3
1
3
1
3
1
3


 ,
and thus we obtain
A = TPT−1
=



2
3 −1
3 −1
3
−1
3
2
3 −1
3
−1
3 −1
3
2
3


 .
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
subspace Vλ ⊂ V . We call it the eigenspace associated with
λ. For instance, if λ = 0 is an eigenvalue, the kernel Ker f is
the eigenspace V0.
We have seen how to compute the eigenvalues in coordinates.
The independence of the eigenvalues from the choice
of coordinates is clear from their deﬁnition. But let us look
explicitely what happens if we change the basis. As a direct
corollary of the transformation properties from the paragraph
2.3.16 and the Cauchy theorem 2.2.7 for calculation of the
determinant of product, the matrix A′
in the new coordinates
will be A′
= P−1
AP with an invertible matrix P. Thus
|P−1
AP − λE| = |P−1
AP − P−1
λEP|
= |P−1
(A − λE)P|
= |P−1
||(A − λE)||P|
= |A − λE|,
because the scalar multiplication is commutative and we
know that |P−1
| = |P|−1
.
For these reasons we use the same terminology for matrices
and mappings:
Characteristic polynomials
For a matrix A of dimension n over K we call the polynomial
|A−λE| ∈ Kn[λ] the characteristic polynomial of the
matrix A.
Roots of this polynomial are the eigenvalues of the matrix
A. If A is the matrix of the mapping f : V → V in a
certain basis, then |A − λE| is also called the characteristic
polynomial of the mapping f.
Because the characteristic polynomial of a linear mapping
f : V → V is independent of the choice
of the basis of V , the coeﬃcients of individual
powers of the variable λ are scalars expressing
some properties of f. In particular, they too cannot
depend on the choice of the basis. Suppose dim V = n
and A = (aij) is the matrix of the mapping in some basis.
Then
|A − λ · E| = (−1)n
λn
+ (−1)n−1
(a11 + · · · + ann)λn−1
+ · · · + |A|λ0
.
The coeﬃcient at the highest power says whether the dimension
of the space V is even or odd.
The most interesting coeﬃcient is the sum of the diagonal
elements of the matrix. We have just proved that it does
not depend on the choice of the basis and we call it the trace of
the matrix A and denote it by Tr A. The trace of the mapping
f is deﬁned as a trace of the matrix in an arbitrary basis.
In fact, this is not so surprising once we notice that the
trace is actually the linear approximation of the determinant
in the neighbourhood of the unit matrix in the direction A. We
shall deal with such concepts in Chapter 8 only. But since the
determinant is a polynomial, we may see easily that the only
terms in det(E +tA) which are linear in the real parameter t
131
□
Our choice of basis of a given (ﬁnite dimensional) vector
space corresponds to a coordinate system on it
(we get the basis with ﬁxed ordering of its elements).
If the basis u = {u1, . . . , um} is orthonormal, i.e.
⟨ui, uj⟩ = δij, the Kronecker delta, for all i, j, then
the coordinates are given just by the scalar products, i.e.,
v = ⟨v, ui⟩ui + · · · + ⟨v, um⟩um for all v ∈ V . With orthogonal
basis, we just request that its elements are mutualy
perpendicular and in the latter expression, we have to divide
the scalar product by the norm of the elements, i.e., the coordinates
are ⟨v,ui⟩
⟨ui,ui⟩ .
The ﬁgure below compares a general basis and an orthonormal
basis in R2
. Notice that the formula for the scalar
products is the same in all orthonormal bases, just the standard
Euclidean dot product, as we know it from the matrix
calculus. Orthogonal or orthonormal bases are easily obtained
by the Gram-Schmidt procedure, see 2.3.21.
2.C.50. (a) Prove that the vectors E1 = (1, 1, 1, 1)T
, E2 =
(1, 1, 1, 0)T
and E3 = (1, 1, 0, 0)T
of R4
are linearly inde-
pendent.
(b) Next ﬁnd an orthonormal basis of the 3-dimensional space
W = spanR{E1, E2, E3} of R4
.
(c) Find the new coordinates of the vector w = (4, 4, 2, 1)T
∈
W with respect to the orthonormal basis that you described
in part (b).
Solution. (a) We check the linear independence by Sage
V = RR^4
E1 = vector(RR, [1, 1, 1, 1])
E2 = vector(RR, [1, 1, 1, 0])
E3 = vector(RR, [1, 1, 0, 0])
V.linear_dependence([E1, E2, E3]) == []
Sage’s output is indeed True.
(b) Let us apply the Gram-Schmidt method to obtain an orthonormal
basis of W. Set w1 = E1 and
w2 = E2 − projw1
(E2)
= E2 −
⟨E2, w1⟩
∥w1∥2
w1 ,
w3 = E3 − projw2
(E3) − projw1
(E3)
= E3 −
⟨E3, w2⟩
∥w2∥2
w2 −
⟨E3, w1⟩
∥w1∥2
w1 .
We compute ∥w1∥2
= 4 and ⟨E2, E1⟩ = 3, hence w2 =
(1/4, 1/4, 1/4, −3/4)T
with ∥w2∥2
= 3/4. We proceed
by computing ⟨E3, w2⟩ = 1/2, ⟨E3, w1⟩ = 2 and hence
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
are just the trace. We shall see relation to matrix exponential
later in Chapter 8.
The coeﬃcient at λ0
is the determinant |A| and we shall
see later that it describes the rescaling of volumes by the map-
ping.
2.4.3. Basis of eigenvectors. We discuss a few important
properties of eigenspaces now.
Theorem. Eigenvectors of linear mappings f : V → V associated
to diﬀerent eigenvalues are linearly independent.
Proof. Let a1, . . . , ak be distinct eigenvalues of the
mapping f and u1, . . . , uk eigenvectors with these
eigenvalues. The proof is by induction on the number
of linearly independent vectors among the chosen
ones.
Assume that u1, . . . , uℓ are linearly independent and
ul+1 =
∑
i ciui is their linear combination. We can choose
ℓ = 1, because the eigenvectors are nonzero. But then
f(uℓ+1) = al+1 · ul+1 =
∑l
i=1 al+1 · ci · ui, that is,
f(ul+1) =
l∑
i=1
al+1 ·ci ·ui =
l∑
i=1
ci ·f(ui) =
l∑
i=1
ci ·ai ·ui.
By subtracting the second and the fourth expression in the
equalities we obtain 0 =
∑l
i=1(al+1 − ai) · ci · ui. All the
diﬀerences between the eigenvalues are nonzero and at least
one coeﬃcient ci is nonzero. This is a contradiction with the
assumed linear independence u1, . . . , uℓ, therefore also the
vector ul+1 must be linearly independent of the others. □
The latter theorem can be seen as a decomposition of
a linear mapping f into a sum of much simpler
mappings. If there are n = dim V distinct
eigenvalues λi, we obtain the entire V as a direct
sum of one-dimensional eigenspaces Vλi . Each of them
then describes a projection on this invariant one-dimensional
subspace, where the mapping is given just as multiplication
by the eigenvalue λi.
Furthermore, this decomposition can be easily calcu-
lated:
132
w3 = (1/3, 1/3, −2/3, 0)T
. The orthogonality is veriﬁed
easily: ⟨w1, w2⟩ = ⟨w1, w3⟩ = ⟨w2, w3⟩ = 0. We also
compute ∥w3∥2
= 2/3, hence the orthonormal basis { ˆwi =
wi
∥wi∥ : i = 1, 2, 3} of W has the explicit form
ˆw1 =




1/2
1/2
1/2
1/2



 , ˆw2 =




√
3/6√
3/6√
3/6
−
√
3/2



 , ˆw3 =




√
6/6√
6/6
−
√
6/3
0



 .
(c) The given vector w = (4, 4, 2, 1)T
lies on W, since we
have the expression w = E1 + E2 + 2E3. To ﬁnd its coordinates
with respect to the orthonormal basis { ˆwi : i = 1, 2, 3}
of W constructed above, we apply the formula
w =
3∑
i=1
⟨w, ˆwi⟩ ˆwi .
We compute ⟨w, ˆw1⟩ = 11/2, ⟨w, ˆw2⟩ = 7
√
3/6 and
⟨w, ˆw3⟩ = 2
√
6/3, thus w = 11
2 ˆw1 + 7
√
3
6 ˆw2 + 2
√
6
3 ˆw3.
Hence the initial coordinates (1, 1, 2) of w with respect
to the basis {E1, E2, E3} changed to the new coordinates
(11/2, 7
√
3/6, 2
√
6/3) with respect to the basis
{ ˆw1, ˆw2, ˆw3}. □
2.C.51. Use Sage via the function proj(v, u) introduced in
2.C.48 to conﬁrm the expression of the orthonormal basis
{ ˆw1, ˆw2, ˆw3} presented in 2.C.50. ⃝
2.C.52. Apply the Gram-Schmidt orthogonalisation process
to obtain the orthogonal basis of the linear subspace U ⊂ R4
U =
{
(x1, x2, x3, x4)T
∈ R4
; x1 + x2 + x3 + x4 = 0
}
.
⃝
2.C.53. Consider the linear mapping φ : R4
→ R4
whose
matrix with respect to the standard basis of R4
is given by
A =




1/2 2 −1/2 −1/2
1 1/2 1 3/2
2 9/2 0 1/2
2 2 0 0



 .
Find an orthonormal basis of the kernel of φ. ⃝
2.C.54. Consider the Euclidean space R4
with its standard
dot product ⟨ , ⟩. Find a basis of the linear subspace of R4
consisting of vectors which are orthogonal to:
i) u = (1, 1, 1, 1)T
∈ R4
;
ii) w = (1, 0, 0, −1)T
and z = (−1, 0, 1, 0)T
∈ R4
. ⃝
2.C.55. Orthogonal complement. Let W be the subspace
of R4
spanned by the vectors u1 = (−1, 1, 3, 0)T
and u2 =
(0, 0, 0, 1)T
. Find the orthogonal complement of W.
Solution. Any vector x ∈ W is written as x = au1 + bu2 =
(−a, a, 3a, b)T
for some a, b ∈ R. Vectors y ∈ W⊥
should
satisfy x · y = 0, i.e., −ay1 + ay2 + 3ay3 + by4 = 0, for any
a, b ∈ R. For a = 1, b = 0 (this corresponds to the condition
u1 · y = 0) and a = 0, b = 1 (this corresponds to the condition
u2 · y = 0) we obtain the following system of equations:
{−y1 + y2 + 3y3 = 0 , y4 = 0}. Viewing y2 = c ∈ R
and y3 = d ∈ R as free variables, we obtain the solution
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Basis of eigenvectors
Corollary. If there exist n mutually distinct roots λi of the
characteristic polynomial of the mapping f : V → V on the
n-dimensional space V , then there exists a decomposition of
V into a direct sum of eigenspaces each of dimension one.
This means that there exists a basis for V consisting only of
eigenvectors and in this basis the matrix for f is the diagonal
matrix with the eigenvalues on the diagonal. This basis
is uniquely determined up to the order of the elements and
scale of the vectors.
The corresponding basis (expressed in the coordinates
in an arbitrary basis of V ) is obtained by solving n systems
of homogeneous linear equations of n variables with matrices
(A − λi · E), where A is the matrix of f in a chosen
basis.
2.4.4. Invariant subspaces. We have seen that every eigenvector
v of the mapping f : V → V generates
a subspace span{v} ⊂ V , which is preserved
by the mapping f.
More generally, we say that a vector subspace
W ⊂ V is an invariant subspace for a linear mapping
f, if f(W) ⊂ W.
If V is a ﬁnite dimensional vector space and we choose
some basis (u1, . . . , uk) of a subspace W, we can always extend
it to a basis (u1, . . . , uk, uk+1, . . . , un) for the whole
space V. For every such basis, f(W) ⊂ W implies the mapping
f will have the matrix A of the form
(1) A =
(
B C
0 D
)
where B is a square matrix of dimension k, D is a square
matrix of dimension n − k and C is a matrix of the type
n/(n − k). On the other hand, if for some basis (u1, . . . , un)
the matrix of the mapping f is of the form (1), then W =
span{u1, . . . , uk} is invariant under the mapping f.
By the same arguments, the mapping with the matrix A
as in (1) leaves the subspace span{uk+1, . . . , un} invariant,
if and only if the submatrix C is zero.
From this point of view the eigenspaces of the mapping
are special cases of invariant subspaces. Our next task is to
ﬁnd some conditions under which there are invariant complements
of invariant subspaces.
2.4.5. We illustrate some typical properties of mappings on
the spaces R3
and R2
in terms of eigenvalues and eigenvec-
tors.
(1) Consider the mapping given in the standard basis by
the matrix A
f : R3
→ R3
, A =


0 0 1
0 1 0
1 0 0

 .
133
(c + 3d, c, d, 0). It follows that W⊥
= spanR{w1, w2},
where the vectors w1, w2 are given by w1 = (1, 1, 0, 0)T
and
w2 = (3, 0, 1, 0)T
, respectively. □
2.C.56. Given the dot products on Rm
and Rn
, prove that for
any m×n real matrix A the column space C(A) is orthogonal
to the left null space Ker(AT
) and the row space C(AT
) is
orthogonal to the kernel Ker(A) of A. Next illustrate these
statements using the matrix A described in 2.C.20. ⃝
D. Properties of linear mappings
A central attention in numerical matrix calculus is devoted
to the simplest possible representation of linear
mappings, i.e., the optimal choice of coordinates
allowing to understand the mappings best.
The most perfect understanding of f : V → V
comes if we have the mapping represented by a diagonal matrix.
This leads to the concepts of “eigenvectors” and “eigenvalues”
of f.
Thus, an eigenvector of an n×n matrix A is a non-trivial
solution x to the equation A x = λx with an unknown scalar
λ. Obviously, this homogeneous system of equations has got
a non-trivial solution if and only if λ is a root of the characteristic
polynomial χA(λ) := det(A − λE), where E is the
n × n identity matrix. We talk about eigenvalues of A, they
may come with multiplicities, and the spectrum of A consists
of the eigenvalues of A, including their multiplicities (called
algebraic multiplicities).
By the very deﬁnition, all this corresponds to f(u) = λu,
i.e., these concepts are independent of the choice of coordinates.
If we ﬁnd enough eigenvalues and eigenvectors, the
mapping f enjoys a diagonal matrix!
This may fail for two reasons: the polynomial might not
have enough roots in the chosen ﬁeld K, or the number of
independant eigenvectors to a given eigenvalue λ might be
smaller than the algebraic multiplicity of λ, i.e., the dimension
of the eigen-space Vλ spanned by all such eigenvectors
is too small (we call it the geometric multiplicity). We met
both examples in the ﬁrst chapter in the plane (rotations and
derivatives of ﬁrst order polynomials, see also the task presented
in 2.E.57).
2.D.1. Eigenvalues and eigenvectors. Find the eigenvalues
and the eigenvectors of the matrix
A =


−1 1 0
−1 3 0
2 −2 2

 .
Solution. We begin with the characteristic polynomial of A:
χA(λ) =
−1 − λ 1 0
−1 3 − λ 0
2 −2 2 − λ
= λ3
− 4λ2
+ 2λ + 4 .
The eigenvalues of A are the roots of χA(λ). Using for example
the Horner’s scheme presented in Chapter 1, we compute
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
We compute
|A − λE| =
−λ 0 1
0 1 − λ 0
1 0 −λ
= −λ3
+ λ2
+ λ − 1,
with roots λ1 = 1, λ2 = 1, λ3 = −1. The eigenvectors with
eigenvalue λ = 1 can be computed:


−1 0 1
0 0 0
1 0 −1

 ∼


1 0 −1
0 0 0
0 0 0

 ;
with the basis of the space of solutions, that is, of all eigenvectors
with this eigenvalue
u1 = (0, 1, 0), u2 = (1, 0, 1).
Similarly for λ = −1 we obtain the third independent eigen-
vector


1 0 1
0 2 0
1 0 1

 ∼


1 0 1
0 2 0
0 0 0

 ⇒ u3 = (−1, 0, 1).
Under the basis u1, u2, u3 (note that u3 must be linearly
independent of the remaining two because of the previous theorem
and u1, u2 were obtained as two independent solutions)
f has the diagonal matrix
A =


1 0 0
0 1 0
0 0 −1

 .
The whole space R3
is a direct sum of eigenspaces, R3
=
V1 ⊕V2, with dim V1 = 2, and dim V2 = 1. This decomposition
is uniquely determined and says much about the geometric
properties of the mapping f. The eigenspace V1 is furthermore
a direct sum of one-dimensional eigenspaces, which can
be selected in other ways (thus such a decomposition has no
further geometrical meaning).
(2) Consider the linear mapping f : R2[x] → R2[x] deﬁned
by polynomial diﬀerentiation, that is, f(1) = 0, f(x) =
1, f(x2
) = 2x. The mapping f thus has in the usual basis
(1, x, x2
) the matrix
A =


0 1 0
0 0 2
0 0 0

 .
The characteristic polynomial is |A−λ·E| = −λ3
, thus it has
only one eigenvalue, λ = 0. We compute the eigenvectors:


0 1 0
0 0 2
0 0 0

 ∼


0 1 0
0 0 1
0 0 0

 .
The space of the eigenvectors is thus one-dimensional, generated
by the constant polynomial 1.
The striking property of this mapping is that there is no
basis for which the matrix would be diagonal. There is the
“chain” of vectors mapping the independent generators as follows:
1
2 x2
→ x → 1 → 0. This builds a sequence of invariant
subspaces without invariant complements.
134
λ1 = 2, λ± = 1 ±
√
3, all with multiplicity one11
Hence the
algebraic multiplicities of λ1 and λ± are all one. It follows
that each has a 1-dimensional eigenspace, i.e., the geometric
multiplicity of each eigenvalue is also one (see 3.4.10).
The eigenvector associated to λ1 is determined by solving
the matrix equation (A − λ1E)x = 0, with x =
(x1, x2, x3)T
∈ R3
. This reduces to the system
{−3x1 + x2 = 0 , −x1 + x2 = 0 , x1 − x2 = 0}
and in order to ﬁnd a solution we may apply what we learned
in Section A. We see that x3 is a free variable, x3 ∈ R, and
the rest two variables must satisfy x1 = x2 = 0.12
Thus, the
eigenvector associated to λ1 is the vector u1 := (0, 0, 1)T
(or any multiple of it), and in particular Vλ1 is 1-dimensional
(spanned by u1).
Similarly, the eigenvector corresponding to λ+ = 1 +
√
3
arises by solving the matrix equation (A−λ+E)x = 0. This
gives the system



−(2 +
√
3)x1 + x2 = 0 ,
−x1 + (2 −
√
3)x2 = 0 ,
2x1 − 2x2 + (1 −
√
3)x3 = 0 ,
whose solution has the form {
(
2 −
√
3, 1, −2
)
t : t ∈ R}.
This implies that the eigenspace Vλ+ corresponding to λ+ is
also 1-dimensional, as well:
Vλ+ = spanR
{(
2 −
√
3, 1, −2
)T
}
.
In an analogous way we obtain that the eigenspace associated
to the third eigenvalue λ− is 1-dimensional,
Vλ− = spanR
{(
2 +
√
3, 1, −2
)T
}
.
□
2.D.2. Eigentheory via Sage. In Sage, given a squared matrix
it is easy to compute its eigenvalues and eigenvectors. Let
us use the matrix A from the previous task to illustrate the situation.
To introduce the characteristic polynomial and ﬁnd
its roots, we can type
A=matrix(SR, [[-1, 1, 0], [-1, 3, 0],
[2, -2, 2]])
p(t) = A.characteristic_polynomial(t)
show(p(t)); show(p.roots())
The output here is the characteristic polynomial (with respect
to the variable t), i.e., t3
− 4 t2
+ 2 t + 4, and its roots, presented
in a list as follows:
[(
−
√
3 + 1, 1
)
,
(√
3 + 1, 1
)
, (2, 1)
]
.
11Use also Sage to solve the equation χA(λ) = 0, and then read below
for a variety of techniques in Sage for ﬁnding the eiqenvalues of a given
matrix A.
12Although this case is extremely easy and the fact that x3 is a free
variable is obvious (as the solution), recall that a good technique solving the
homogeneous system (A − λE)x = 0, where λ is an eigenvalue of A,
relies on the description of the reduced row echelon form corresponding to
the matrix A − λE, especially when A is a n × n matrix with large n.
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.4.6. Orthogonal mappings. We consider the special case
of the mapping f : V → W between spaces
with scalar products, which preserve lengths for
all vectors u ∈ V .
Orthogonal mappings
A linear mapping f : V → W between spaces with scalar
product is called an orthogonal mapping, if for all u ∈ V
⟨f(u), f(u)⟩ = ⟨u, u⟩.
The linearity of f and the symmetry of the scalar product
imply that for all pairs of vectors the following equality holds:
⟨f(u + v), f(u + v)⟩ = ⟨f(u), f(u)⟩ + ⟨f(v), f(v)⟩
+ 2⟨f(u), f(v)⟩.
Therefore all orthogonal mappings satisfy also the seemingly
stronger condition for all vectors u, v ∈ V :
⟨f(u), f(v)⟩ = ⟨u, v⟩,
i.e. the mapping f leaves the scalar product invariant if and
only if it leaves invariant the length of the vectors. (We should
have noticed that this is true for all ﬁelds of scalars, where
1 + 1 ̸= 0, but it does not hold true for Z2.)
In the initial discussion about the geometry in the plane
we proved in the Theorem 1.5.10 that a linear mapping R2
→
R2
preserves lengths of the vectors if and only if its matrix
in the standard basis (which is orthonormal with respect to
the standard scalar product) satisﬁes AT
· A = E, that is,
A−1
= AT
.
In general, orthogonal mappings f : V → W must be
always injective, because the condition ⟨f(u), f(u)⟩ = 0 implies
⟨u, u⟩ = 0 and thus u = 0. In such a case, the dimension
of the range is always at least as large as the dimension
of the domain of f. But then both dimensions are equal and
f : V → Im f is a bijection. If Im f ̸= W, we extend the orthonormal
basis of the image of f to an orthonormal basis of
the range space and the matrix of the mapping then contains
a square regular submatrix A along with zero rows so that it
has the required number of rows. Without loss of generality
we can assume that W = V .
Our condition for the matrix of an orthogonal mapping
in any orthonormal basis requires that for all vectors x and y
in the space Kn
:
(A · x)T
· (A · y) = xT
· (AT
· A) · y = xT
· y.
Special choice of the standard basis vectors for x and y yields
directly AT
· A = E, that is, the same result as for dimension
two. Thus we have proved the following theorem:
Matrix of orthogonal mappings
Theorem. Let V be a real vector space with scalar product
and let f : V → V be a linear mapping. Then f is orthogonal
if and only if in some orthonormal basis (and then consequently
in all of them) its matrix A satisﬁes AT
= A−1
.
135
In case we simply type A.characteristic_polynomial( ),
then Sage returns the characteristic polynomial as a polynomial
with respect to the variable x. The previous list indicates
the three eigenvalues of A, together with their algebraic multiplicity.
Of course, one can proceed in a more elementary
way, e.g., by the block
A=matrix(SR, [[-1, 1, 0], [-1, 3, 0], [2, -2, 2]])
E=identity_matrix(3)
var("c"); ch(c)=det(A-c*E)
ch.factor(); show(ch.roots())
Another alternative to compute the eigenvalues of A has the
form
A=matrix(SR, [[-1, 1, 0], [-1, 3, 0], [2, -2, 2]])
eigen=A.eigenvalues()
show(eigen)
In this case, Sage’s output looks like as [−
√
3+1,
√
3+1, 2],
so the command A.eigenvalues( ) computes only the eigenvalues,
without their algebraic multiplicities. As for the eigenvectors,
add in some of the previous cells the code
A.eigenvectors_right()
Here Sage prints out the eigenvectors of A, together with the
corresponding eigenvalues and algebraic multiplicities.
2.D.3. Find bases of the eigenspaces of the matrix
A =


0 1 1
1 0 1
1 1 0

 .
Solution. We ﬁrst compute the determinant det(A − λ E):
−λ 1 1
1 −λ 1
1 1 −λ
= −λ·
−λ 1
1 −λ
−1·
1 1
1 −λ
+1·
1 −λ
1 1
and hence the characteristic polynomial is given by χA(λ) =
−λ3
+3λ+2. With the same method applied in 2.D.1, we see
that χA(λ) = −(λ + 1)2
(λ − 2), hence λ1 = −1 is a double
root, while the other root is given by λ2 = 2. These are the
eigenvalues of A, with algebraic multiplicities two and one,
respectively. Let us describe the associated eigenspaces.
• For λ1 = −1 we get the matrix equation


1 1 1
1 1 1
1 1 1




x1
x2
x3

 =


0
0
0

 .
Obviously, the reduced row echelon form of the matrix A+E
is given by


1 1 1
0 0 0
0 0 0

, hence we have two free variables,
namely x2 = a ∈ R and x3 = b ∈ R. The general solution
has the form (−a − b, a, b)T
, thus the eigenspace V−1 is a
2-dimensional subspace of R3
given by
V−1 =
{(
−a − b
a
b
)
: a, b ∈ R
}
= spanR
{(
−1
1
0
)
,
(
−1
0
1
)}
.
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Proof. Indeed, if f preserves lengths, it must have the
claimed property in every orthonormal basis. On the other
hand, the previous calculations show that this property for the
matrix in one such basis ensures length preservation. □
Square matrices which satisfy the equality AT
= A−1
are called orthogonal matrices.
The shape of the coordinate transition matrices between
orthonormal bases is a direct corollary of the
above theorem. Each such matrix must provide a
mapping Kn
→ Kn
which preserves lengths and
thus satisﬁes the condition S−1
= ST
. When
changing from one orthonormal basis to another one, the matrix
of any linear mapping changes according to the relation
A′
= ST
A S.
2.4.7. Decomposition of an orthogonal mapping. We take
a more detailed look at eigenvectors and eigenvalues of orthogonal
mappings on a real vector space V with scalar prod-
uct.
Consider a ﬁxed orthogonal mapping f : V → V with
the matrix A in some orthonormal basis. We continue as with
the matrix D of rotation in 2.4.1.
We think ﬁrst about invariant subspaces of orthogonal
mappings and their orthogonal complements. Namely, given
any subspace W ⊂ V invariant with respect to an orthogonal
mapping f : V → V , then for all v ∈ W⊥
and w ∈ W we
immediately see
⟨f(v), w⟩ = ⟨f(v), f ◦ f−1
(w)⟩ = ⟨v, f−1
(w)⟩ = 0
since f−1
(w) ∈ W, too. But this means that also f(W⊥
) ⊂
W⊥
and we have proved a simple but very important propo-
sition:
Proposition. The orthogonal complement of a subspace invariant
with respect to an orthogonal mapping is also invari-
ant.
If all eigenvalues of an orthogonal mapping are real, this
claim ensures that there always exists a basis of V
composed of eigenvectors. Indeed, the restriction of
f to the orthogonal complement of an invariant subspace
is again an orthogonal mapping, therefore we
can add one eigenvector to the basis after another, until we
obtain the whole decomposition of V . However, mostly the
eigenvalues of orthogonal mappings are not real. We need to
deviate into complex vector spaces. We formulate the result
right away:
136
These two eigenvectors form a basis of V−1.
• For λ2 = 2 we obtain the matrix equation


−2 1 1
1 −2 1
1 1 −2




x1
x2
x3

 =


0
0
0

 .
It is easy to verify that the matrix


1 0 −1
0 1 −1
0 0 0

 is the reduced
row echelon form of the matrix A − 2E. Hence
we deduce that the solution has the form (a, a, a, )T
, where
a = x3 ∈ R is the free variable. Thus, V2 is 1-dimensional,
V2 = spanR
{
(1, 1, 1)T
}
. □
2.D.4. Present the solution of the task in 2.D.3 in Sage. ⃝
In 2.D.3 we observed that the eigenspace of a multiple
eigenvalue, such as the double root λ1 = −1,
can have a dimension greater than one. In fact,
for this particular eigenvalue, the algebraic and
geometric multiplicity are both equal to 2. However,
be aware that, in general, the algebraic and geometric
multiplicities of an eigenvalue may not always match. It particular,
the algebraic multiplicity of an eigenvalue is always
greater than or equal to its geometric multiplicity.
2.D.5. Determine the eigenvalues of the following matrix, together
with their algebraic and geometric multiplicities.
A =


1 1 0
−1 3 0
2 −2 2

 .
⃝
2.D.6. Compute tr(A) and det(A) for the matrix
A =




−11 5 4 1
−3 0 1 0
−21 11 8 2
−9 5 3 1



 ,
based on the formulas
det(A) = λ1λ2λ3λ4 ,
tr(A) = λ1 + λ2 + λ3 + λ4 ,
where λ1, λ2, λ3, λ4 are the roots of the characteristic polynomial
χA(λ) of A. Moreover, do the computation in Sage.
Solution. The characteristic polynomial associated to A is
of degree four, χA(λ) = λ4
+ 2λ3
− 2λ − 1. It has two
roots, namely λ+ = 1 with multiplicity one and λ− = −1
with multiplicity three, that is, χA(λ) = (λ − 1)(λ + 1)3
. It
follows that det(A) = λ1λ3
2 = −1 and tr(A) = λ1 + 3λ2 =
1 − 3 = −2. To perform these computations in Sage, use the
following block:
A=matrix(SR, [[-11, 5, 4, 1], [-3, 0, 1, 0],
[-21, 11, 8,2], [-9, 5, 3,1]])
p(t) = A.characteristic_polynomial(t)
show(p(t).factor())
show(p.roots())
print(A.det()); print(A.trace())
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Orthogonal mapping decomposition
Theorem. Let f : V → V be an orthogonal mapping on a
real vector space V with scalar product. Then all the (in general
complex) roots of the characteristic polynomial f have
length one. There exists the decomposition of V into onedimensional
eigenspaces corresponding to the real eigenvalues
λ = ±1 and two-dimensional subspaces Pλ,¯λ with
λ ∈ C\R, where f acts by the rotation by the angle equal to
the argument of the complex number λ in the positive sense.
All these subspaces are mutually orthogonal.
Proof. Without loss of generality we can work with the
space V = Rm
with the standard scalar product.
The mapping is thus given by an orthogonal
matrix A which can be equally well seen as
the matrix of a (complex) linear mapping on the
complex space Cm
(which just happens to have all of its coeﬃcients
real).
There exist exactly m (complex) roots of the characteristic
polynomial of A, counting their algebraic multiplicities
(see the fundamental theorem of algebra, 1.A.6). Furthermore,
because the characteristic polynomial of the mapping
has only real coeﬃcients, the roots are either real or there are
pairs of roots which are complex conjugates λ and ¯λ. The
associated eigenvectors in Cm
for such pairs of complex conjugates
are actually solutions of two systems of linear homogeneous
equations which are also complex conjugate to each
other – the corresponding matrices of the systems have real
components, except for the eigenvalues λ. Therefore the solutions
of this systems are also complex conjugates (check
this!).
Next, we exploit the fact that for every invariant subspace
its orthogonal complement is also invariant. First we
ﬁnd the eigenspaces V±1 associated with the real eigenvalues,
and restrict the mapping to the orthogonal complement
of their sum. Without loss of generality we can thus assume
that our orthogonal mapping has no real eigenvalues and that
dim V = 2n > 0.
Now choose an eigenvalue λ and let uλ be the eigenvector
in C2n
associated to the eigenvalue λ = α + iβ, β ̸= 0.
Analogously to the case of rotation in the plane discussed in
paragraph 2.4.1 in terms of the matrix D, we are interested
in the real part of the sum of two one-dimensional (complex)
subspaces W = span{uλ} ⊕ span{¯uλ}, where ¯uλ is the
eigenvector associated to the conjugated eigenvalue ¯λ.
Now we want the intersection of the 2-dimensional complex
subspace W with the real subspace R2n
⊂ C2n
, which
is clearly generated (over R) by the vectors uλ + ¯uλ and
i(uλ − ¯uλ). We call this real 2-dimensional subspace Pλ,¯λ ⊂
R2n
and notice, this subspace is generated by the basis given
by the real and imaginary part of uλ
xλ = Re uλ, −yλ = − Im uλ.
137
eign=A.eigenvalues(); print(eign)
bool(sum(eign)==A.trace())
bool(prod(eign)==A.det())
□
2.D.7. Let A be a n × n triangular matrix over K ∈ {R, C},
either upper or lower. Prove that the eigenvalues of A are its
diagonal elements. ⃝
2.D.8. (a) Prove that a n×n matrix A is invertible if and only
if 0 is not an eigenvalue of A.
(b) Compute the characteristic polynomial of the n×n matrix
An given below. Then combine your result with the statement
in (a), to determine those n for which the matrix An is invert-
ible.
An =










1 0 0 0 · · · 0 0 1
1 1 0 0 · · · 0 0 0
0 1 1 0 · · · 0 0 0
0 0 1 1 · · · 0 0 0
...
...
...
...
...
...
...
...
0 0 0 0 · · · 1 1 0
0 0 0 0 · · · 0 1 1










.
⃝
As described in 2.4.3 the eigenvectors of a squared matrix
corresponding to distinct eigenvalues, are linearly
independent. This property enables us to use an
“eigenvector basis” to represent the matrix as a diagonal
matrix. Matrices (or endomorphisms) with this property
are commonly referred to as diagonalizable.
Summarizing, a matrix A is diagonalizable if it is similar
to a diagonal matrix D, meaning that there exists an invertible
matrix P such that A = PDP−1
. This is equivalent to
saying that the algebraic and geometric multiplicities of each
eigenvalue of A are equal.
Diagonalizable matrices are particularly useful because
they simplify many matrix computations, such as raising a
matrix to a power, or solving systems of linear diﬀerential
equations. As we will explore in Chapter 3, signiﬁcant examples
of diagonalizable endomorphisms include symmetric
matrices. These matrices are important in various ﬁelds, including
physics, engineering, and computer science.
2.D.9. Diagonalization. Consider the matrix A introduced
in the tasks 2.D.1, 2.D.3 and 2.D.5, respectively. Specify the
cases where A is diagonalizable, and for these cases ﬁnd matrices
P, D such that A = PDP−1
.
Solution. (a) Let us consider ﬁrst the matrix A given in 2.D.1.
There we saw that this matrix admits three eigenvalues, all distinct
each other. Hence in this case A is diagonalizable. In
particular, observe that for any eigenevalue of A, the corresponding
algebraic and geometric multiplicity coincide (they
are all equal to one). Thus we have A = PDP−1
, with
D =


λ1 0 0
0 λ+ 0
0 0 λ−

 , P =


0 2 −
√
3, 2 +
√
3
0 1 1
1 −2 −2

 ,
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Because A · (uλ + ¯uλ) = λuλ + ¯λ¯uλ and similarly with the
second basis vector, it is clearly an invariant subspace with
respect to multiplication by the matrix A and we obtain
A · xλ = αxλ + βyλ, A · (−yλ) = −αyλ + βxλ.
Consequently,
∥A · xλ∥2
+ ∥A · yλ∥2
= (α2
+ β2
)(∥x∥2
+ ∥y∥2
)
and, since our mapping preserves lengths, the absolute value
of the eigenvalue λ must equal one. But that means that the
restriction of our mapping to Pλ,¯λ is the rotation by the argument
of the eigenvalue λ. Note that the choice of the eigenvalue
¯λ instead of λ leads to the same subspace with the same
rotation, we would just have expressed it in the basis xλ, yλ,
that is, the same rotation will in these coordinates go by the
same angle, but with the opposite sign, as expected.
The proof of the whole theorem is completed by restricting
the mapping to the orthogonal complement and ﬁnding
another 2-dimensional subspace, until we get the required decomposition.
□
We return to the ideas in this proof once again in chapter
three, where we study complex extensions of the Euclidean
vector spaces, see 3.4.4.
Remark. The previous theorem is very powerful in dimension
three. Here at least one eigenvalue must
be real ±1, since three is odd. But then the associated
eigenspace is an axis of the rotation of
the three-dimensional space through the angle
given by the argument of the other eigenvalues. Try to think
how to detect in which direction the space is rotated. Note
also that the eigenvalue −1 means an additional reﬂection
through the plane perpendicular to the axis of the rotation.
./img/0163b.jpg
We shall return to the discussion of such properties of
matrices and linear mappings in more details at the end of the
next chapter, after illustrating the power of the matrix calculus
in several practical applications. We close this section with a
general quite widely used deﬁnition:
138
respectively, where λ1 = 2, λ± = 1±
√
3 are the eigenvalues
of A. The columns of P are the eigenvectors of A, described
in 2.D.1. The block below provides a quick veriﬁcation of the
relation A = PDP−1
in Sage:
A=matrix(SR, [[-1, 1, 0], [-1, 3, 0],
[2, -2, 2]])
u1=vector([0, 0, 1])
u2=vector([2-sqrt(3),1,-2])
u3=vector([2+sqrt(3),1,-2])
P=column_matrix([u1, u2, u3])
D=diagonal_matrix([2, 1+sqrt(3),1-sqrt(3)])
bool(A==P*D*P.inverse())
Note that we can replace any column of the matrix P with
a non-zero scalar multiple of the corresponding eigenvector.
As a result, there are inﬁnitely many possible matrices P that
satisfy the diagonalization condition.
(b) Let us now consider the matrix A given in 2.D.3. There we
saw that this matrix has two eigenvalues: λ1 = −1 with algebraic
multiplicity 2, and λ = 2 with algebraic multiplicity 1.
We also proved that the geometric multiplicity of λ1 coincides
with its algebraic multiplicity, and similarly for λ2. Hence in
this case the matrix A is also diagonalizable: A = PDP−1
.
To explicitly determine D and P we can follows the same
method outline in (a). We leave this case as an exercise for
practice.
(c) Finally, the matrix A presented in 2.D.5 is not diagonalizable.
This is because its unique eigenvalue λ = 2 has an
algebraic multiplicity 3, but a geometric multiplicity 2. Consequently,
it is impossible to construct a complete basis of
eigenvectors for this matrix. □
2.D.10. Determine whether the matrices A1 and A2 given
below are diagonalizable. If diagonalizable, determine the
matrices Di, Pi such that Ai = PiDiP−1
i :
A1 =


2 0 0
4 2 2
−2 0 1

 , A2 =


2 0 0
−4 2 2
−2 0 1

 .
⃝
Consider a real vector space V endowed with a scalar
product ⟨ , ⟩, and an orthogonal endomorphism
f : V → V . Thus, f preserves the scalar product,
the lengths and the angles of vectors in V . A
prime example of orthogonal endomorphisms are
rotations, in R2
or R3
(cf. 1.E.17).
It is most remarkable, that in matrix terms,
A ∈ Matn(R) is a matrix of an orthogonal transformation
f (in suitable orthonormal coordinates), if and only
if
AAT
= AT
A = E .
We call them orthogonal matrices and they are particularly
easy to invert. This property leads to several interesting results,
such as the fact that the determinant of an orthogonal
matrix is alway ±1.
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Spectrum of linear mapping
2.4.8. Deﬁnition. The spectrum of a linear mapping f :
V → V, or the spectrum of a square matrix A, is a sequence
of roots of the characteristic polynomial f or A, along with
their multiplicities, respectively. The algebraic multiplicity
of an eigenvalue means the multiplicity of the root of the
characteristic polynomial, while the geometric multiplicity
of the eigenvalue is the dimension of the associated subspace
of eigenvectors.
The spectral diameter of a linear mapping (or matrix)
is the greatest of the absolute values of the eigenvalues.
In this terminology, our results about orthogonal mappings
can be formulated as follows: the spectrum of an orthogonal
mapping is always a subset of the unit circle in the complex
plane. Thus only the values ±1 may appear in the real
part of the spectrum and their algebraic and geometric multiplicities
are always the same. Complex values of the spectrum
then correspond to rotations in suitable two-dimensional subspaces
which are mutually perpendicular.
139
The next task illustrates that orthogonal endomorphisms
on Rn
correspond to orthogonal n × n real matrices, and
conversely. For more information, see the discussion in 2.4.6.
2.D.11. On V = Rn
endowed with the dot product consider
a linear mapping φA : V → V , deﬁned by φA(u) = Au,
for some A ∈ Matn(R), for all u ∈ V . Prove that A is
orthogonal if and only if φA is an orthogonal endomorphism
of V . ⃝
2.D.12. Determine which of the matrices below is orthogo-
nal:
A =
(
1/
√
2 −1/
√
2
−1/
√
2 −1/
√
2
)
,
B =


2/3 −2/3 1/3
2/3 1/3 −2/3
1/3 2/3 2/3

 ,
C =


3/
√
11 −1/
√
6 −1/
√
66
1/
√
11 2/
√
6 −4/
√
66
1/
√
11 1/
√
6 7/
√
66

 .
Solution. All the given matrices A, B and C are orthogonal.
There are many ways to verify this fact. For A notice that
AT
= A and
AAT
=
(
1/
√
2 −1/
√
2
−1/
√
2 −1/
√
2
) (
1/
√
2 −1/
√
2
−1/
√
2 −1/
√
2
)
= E .
On the other hand, later in Chapter 3 we will learn that a
square matrix is orthogonal, if and only if its columns (or
rows) form an orthonormal basis (see 3.4.4). Given our familiarity
with orthonormal bases, we can use this criterion to
determine whether the matrix B is orthogonal. We need to
verify for example that the (column) vectors
v1 =


2/3
2/3
1/3

 , v2 =


−2/3
1/3
2/3

 , v3 =


1/3
−2/3
2/3


form an orthonormal basis of R3
with respect to the dot
product. We compute ∥v1∥ = ∥v2∥ = ∥v2∥ = 1 and
v1 · v2 = −4
9 + 2
9 + 2
9 = 0, v1 · v3 = 2
9 − 4
9 + 2
9 = 0,
v2 · v3 = −2
9 − 2
9 + 4
9 = 0. This proves the claim.
You can also use the command A.is_unitary( ) in Sage.
This checks whether a given real matrix A is orthogonal (or if
a complex matrix is unitary, as we will see in Chapter 3). Let
us apply this method to verify the orthogonality of the matrix
C:
C = matrix(SR,[[3/sqrt(11),-1/sqrt(6),-1/sqrt(66)],
[1/sqrt(11),2/sqrt(6),-4/sqrt(66)],
[1/ sqrt (11),1/ sqrt (6),7/ sqrt (66)]])
C.is_unitary()
Sage’s output is True. □
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
140
Proper rotations are orthogonal matrices with determinant
1, and play a crucial role in both applied and
pure mathematics. Orthogonal matrices with determinant
−1 correspond to reﬂections through the
origin, which are used to model symmetry operations,
such as those seen in particle physics or optics. Recall
that in low dimensions, eigentheory provides a nuanced geometric
perspective on both rotations and reﬂections (see the
remark in 2.4.7, and the tasks 1.E.17, 2.E.57).
In a three-dimensional space, a general proper rotation
is characterized by its axis and its angle θ. Conventionally,
a positive angle corresponds to a counterclockwise rotation
when viewed from the tip of the axis, see also the discussion
in ??. The rotation matrix in this case has necessarily three
eigenvalues, a real one λ1 = 1, and two unitary complex
conjugate eigenvalues.
As explained in 2.4.7, the eigenvector u1 associated with
the eigenvalue 1 represents the rotation axis, which remains
invariant under the rotation. The direction of this axis can
be determined using the right-hand rule: curl the ﬁngers of
your right hand around the axis of rotation, with your ﬁngers
pointing in the direction of θ; your thumb will then point in
the direction of u1. Additionally, the angle θ is given by the
argument of the two complex conjugate eigenvalues, which
represent a 2-dimensional rotation within the plane orthogonal
to the rotation axis.
2.D.13. Consider the matrix
A =



3
5
16
25
−12
25
−16
25
93
125
24
125
12
25
24
125
107
125


 .
(a) Show that the linear mapping induced by A is a rotation.
(b) Find its axis, the plane rotation, and the rotation angle.
Solution. (a)To prove that the induced linear mapping is a
rotation, it is suﬃcient to show that A is an orthogonal matrix
with determinant 1. In Sage you may apply the method
described above, i.e.,
A=matrix(SR,[[3/5,16/25,-12/25],
[ -16/25,93/125,24/125],
[12/25,24/125,107/125]])
A.is_unitary()
Sage’s output is True. Adding the command det(A), Sage
gives that det(A) = 1. Compute the determinant in a formal
way, as well. Thus, A is an orthogonal matrix with determinant
1. Let us also compute the eigenvalues of A by adding
the command A.eigenvalues(). Sage returns the list
[-4/5*I + 3/5, 4/5*I + 3/5, 1]
but for our convenience we set λ1 = 1, λ2 = 3
5 + 4
5 i and
λ3 = 3
5 − 4
5 i. All three eigenvalues have absolute value one.
Combined with the fact that the matrix is orthogonal, this
conﬁrms that the matrix represents a rotation.
(b) In this part we need to ﬁnd the eigenvectors of A. Let
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
141
us use Sage to specify them. By adding the command
A.eigenvectors_right() to the previous block, Sage
will return the eigenvectors corresponding to the eigenvalues
λ1, λ2, λ3. These are given by v1 = (0, 1, 4/3)T
,
v2 = (1, 4i/5, −3i/5)T
, and v3 = (1, −4i/5, 3i/5)T
,
respectively. The axis of the rotation is given by the eigenvector
v1. The plane of rotation is the real plane E in R3
which
is deﬁned by the intersection of the 2-dimensional complex
subspace of C3
spanned by the remaining eigenvectors v2, v3,
with R3
. A direct computation shows that
E = spanR
{
E1 = (1, 0, 0)T
, E2 = (0, −4, 3)T
}
⊂ R3
,
where the ﬁrst generator is the real multiple of v2 + v3 and
the second one is the real multiple of i(v2 − v3), see also
2.4.7. We are now ready to determine the rotation angle in
this plane: It is a rotation by the angle arccos(3
5 )
.
= 0, 295π,
which is the argument of the eigenvalue 3
5 + 4
5 i (or minus that
number, if we would choose the other eigenvalue). □
You can ﬁnd more exercises on eigenvalues, eigenvectors,
orthogonal matrices and rotations in Section E. In Chapter 3
we will return to these topics and expand our perspective by
exploring their applications in eigenvalues and orthogonal
diagonalization.
142
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
E. Additional exercises for the whole chapter
Below we provide additional material for most of the concepts discussed in the second chapter. The goal is to gain
further computational experience with matrix calculus, become more familiar with basic notions of linear transformations,
and hopefully enrich our approach to solving linear problems.
A) Material on vectors and matrices
2.E.1. Sage’s vocabulary in linear algebra. This chapter has provided extensive information on using Sage with vectors
and matrices. Before we begin this supplementary section, it is helpful to summarize this knowledge. Below,
we list most of the Sage commands we have explored so far, along with instructions for manipulating matrices
and linear mappings. We strongly recommend using Sage to aid your understanding of the problems presented
below, especially in cases where our solution does not include Sage. This practice will also help expand your Sage vocabulary
on linear algebra and its applications, as we will see later in Chapter 3.
Sage vocabulary for matrix calculus, vector spaces and linear mappings
zero m/n matrix zero_matrix(KK, m, n) rank A.rank( )
identity n/n matrix identity_matrix(KK, n) left kernel A.left_kernel( )
matrix diagonal_matrix(KK, [a, b, c]) right kernel A.right_kernel( )
random matrix random_matrix(KK, m, n) rank of left kernel A.left_nullity( )
matrix multipl. A ∗ b, or A ∗ B or B ∗ A rank of right kernel A.right_nullity( )
matrix exponent. e.g., Aˆ3, or A ∗ ∗3 character. polyn. A.characteristic_polynomial( )
determinant, trace A.det( ), A.trace( ) eigenvalues A.eigenvalues( )
row echelon form A.echelon_form( ) eigenvectors A.eigenvectors_right( )
RREF A.rref( ) conjugate transpose A.conjugate_transpose( )
pivots columns A.pivots( ) orthogonal A.is_unitary( )
inverse A.inverse( ) unitary A.is_unitary( )
minors of order k A.minor(k) symmetric A.is_symmetric( )
adjoint A.adjugate( ) rescale rows A.rescale_row( )
transpose A.transpose( ) add multiples of rows A.add_multiple_of_row( )
conjugate A.conjugate( ) interchange rows A.swap_rows( )
accessing the entry a1,2 A[0, 1] or A[0][1] ﬁnd the size of A A.nrows( ), A.ncols( )
accessing kth column A[:, k − 1], e.g. 3rd column A[:, 2] introduce a vector u u = vector(KK, [u1, · · · , un])
accessing nth row A[n − 1, :], e.g. 2rd row A[1, :] size of u u.degree( )
submatrices of B B[0 : 2, 1 : 3], B[0 : 2, 1 : 3] block matrix
( A 0
0 B
)
A.stack(B)
e.g., B is a 4/5 matrix B[:, 1 : 2], B[:, 1 : 3] augmented matrix A.augment(b)
B[1 : 2, :], B[1 : 3, :] a solution of Ax = b A\b
As we learned in the main part, many of the commands listed above come with various options, which we have not included
here for simplicity. Above, KK represents a numerical system K, e.g, ZZ for the integers Z, etc.
2.E.2. Consider the ﬁnite ﬁeld Z7 and the matrices A =
(
2 3
4 5
)
and B =
(
6 1
0 2
)
with elements in Z7. Describe the
multiplication AB and next conﬁrm your result using Sage. How many arithmetic operations we need to compute the product
AB over Z7 and why this diﬀers from the result presented in 1.E.11?
Solution. The ﬁnite ﬁeld Z7 consists of the integers {0, 1, 2, 3, 4, 5, 6} with addition and multiplication deﬁned modulo
7. Thus, to multiply the matrices A, B over Z7 we follow the standard procedure for matrix multiplication but reduce
each entry modulo 7. Let us denote by C = AB = (cij) the product of A, B. Then, the entries cij are given by
cij =
∑2
k=1 aikbkj(mod7). We compute
c11 = (2 · 6 + 3 · 0)(mod7) = 12(mod7) = 5 , c12 = (2 · 1 + 3 · 2)(mod7) = 8(mod7) = 1 ,
c21 = (4 · 6 + 5 · 0)(mod7) = 24(mod7) = 3 , c22 = (4 · 1 + 5 · 2)(mod7) = 14(mod7) = 0 .
Thus
C = AB =
(
c11 c12
c21 c22
)
=
(
5 1
3 0
)
.
A veriﬁcation in Sage follows the standard procedure. However, we ﬁrst need to introduce the ﬁnite ﬁeld Z7 and deﬁne the
matrices A and B over this ﬁeld.
143
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Z7=GF(7)
A=Matrix(Z7, [[2, 3], [4, 5]]); B=Matrix(Z7, [[6, 1], [0, 2]]); show(A*B)
Finally, according to 1.E.11 the multiplication AB of two 2 × 2 real matrices requires 12 arithmetic operations: For each
matrix element, we perform two multiplications and one addition. Over Z7 for each matrix element we need one more
operation, the modulo operation. Thus in total we will have 12 + 4 = 16 operations. □
Elementary matrices are fundamental tools in linear algebra, particularly useful for understanding and performing
row operations on matrices. An elementary matrix is derived from the identity matrix by performing a single
elementary row operation, see 2.1.8. These operations are essential for matrix manipulations, solving systems
of linear equations, and ﬁnding matrix inverses. Let us describe related tasks.
2.E.3. Consider the matrix A =


1 0 2
3 1 −1
2 4 2

.
i) Determine the elementary matrix representing the row operation R2 → R2 − 3R1. Next, conﬁrm that the inverse of this
elementary matrix is the elementary matrix of the row operation R2 → R2 + 3R1.
ii) Determine the elementary matrix representing the row operation R2 → R2 − 3
2 R3. Next, conﬁrm that the inverse of this
elementary matrix is the elementary matrix of the row operation R2 → R2 + 3
2 R3.
iii) Determine the elementary matrix representing the row operation R3 → 1
2 R3. Next, conﬁrm that the inverse of this
elementary matrix is the elementary matrix of the row operation R3 → 2R3. ⃝
2.E.4. Consider the matrix A =


2 1 3
1 0 1
4 2 1

.
(a) Show that A is invertible.
(b) Apply elementary row operations to determine the inverse of A.
(c) Express A as a product of inverses of elementary matrices.
Solution. (a) Using the Sarrus rule (see 2.2.1) it is straightforward to verify that the determinant of A equals 5 ̸= 0. Therefore,
the matrix is invertible.
(b) Consider the augmented matrix
(
A E
)
(where E is the 3 × 3 identity). The goal is to use elementary row operations to
transform
(
A E
)
into
(
E A−1
)
. We compute:
(
A E
)
=


2 1 3 1 0 0
1 0 1 0 1 0
4 2 1 0 0 1


R1→ 1
2 R1
−→



1 1
2
3
2
1
2 0 0
1 0 1 0 1 0
4 2 1 0 0 1



R2→R2−R1
−→



1 1
2
3
2
1
2 0 0
0 −1
2 −1
2 −1
2 1 0
4 2 1 0 0 1



R3→R3−4R1
−→



1 1
2
3
2
1
2 0 0
0 −1
2 −1
2 −1
2 1 0
0 0 −5 −2 0 1



R2→−2R2
−→



1 1
2
3
2
1
2 0 0
0 1 1 1 −2 0
0 0 −5 −2 0 1



R3→− 1
5 R3
−→



1 1
2
3
2
1
2 0 0
0 1 1 1 −2 0
0 0 1 2
5 0 −1
5



R2→R2−R3
−→



1 1
2
3
2
1
2 0 0
0 1 0 3
5 −2 1
5
0 0 1 2
5 0 −1
5



R1→R1− 3
2 R3
−→



1 1
2 0 − 1
10 0 3
10
0 1 0 3
5 −2 1
5
0 0 1 2
5 0 −1
5



R1→R1− 1
2 R2
−→



1 0 0 −2
5 1 1
5
0 1 0 3
5 −2 1
5
0 0 1 2
5 0 −1
5


 .
Thus A−1
=


−2/5 1 1/5
3/5 −2 1/5
2/5 0 −1/5

. To conﬁrm this expression, you can either use Sage, or verify the relation AA−1
=
E = A−1
A by hand. Let us present the veriﬁcation in Sage:
A=Matrix(SR, [[2, 1, 3], [1, 0, 1], [4, 2, 1]]); print(det(A))
A_inv=A.inverse(); show(A_inv)
144
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
(c) In part (b) we obtained A−1
by applying eight elementary row operations. These operations correspond to eight elementary
matrices E1, . . . , E8, such that E8E7E6E5E4E3E2E1A = E. Thus, A = (E8E7E6E6E5E4E3E2E1)−1
=
E−1
1 E−1
2 E−1
3 E−1
4 E−1
5 E−1
6 E−1
7 E−1
8 . Let us specify E1, . . . , E8 and their inverses.
row operation R1 →
1
2
R1 : E =


1 0 0
0 1 0
0 0 1


R1→ 1
2 R1
−→


1/2 0 0
0 1 0
0 0 1

 =: E1 =⇒ E−1
1 =


2 0 0
0 1 0
0 0 1

 ,
row operation R2 → R2 − R1 : E =


1 0 0
0 1 0
0 0 1

 R2→R2−R1
−→


1 0 0
−1 1 0
0 0 1

 =: E2 =⇒ E−1
2 =


1 0 0
1 1 0
0 0 1

 ,
row operation R3 → R3 − 4R1 : E =


1 0 0
0 1 0
0 0 1

 R3→R3−4R1
−→


1 0 0
0 1 0
−4 0 1

 =: E3 =⇒ E−1
3 =


1 0 0
0 1 0
4 0 1

 ,
row operation R2 → −2R2 : E =


1 0 0
0 1 0
0 0 1

 R2→−2R2
−→


1 0 0
0 −2 0
0 0 1

 =: E4 =⇒ E−1
4 =


1 0 0
0 −1
2 0
0 0 1

 ,
row operation R3 → −
1
5
R3 : E =


1 0 0
0 1 0
0 0 1


R3→− 1
5 R3
−→


1 0 0
0 1 0
0 0 −1/5

 =: E5 =⇒ E−1
5 =


1 0 0
0 1 0
0 0 −5

 ,
row operation R2 → R2 − R3 : E =


1 0 0
0 1 0
0 0 1

 R2→R2−R3
−→


1 0 0
0 1 −1
0 0 1

 =: E6 =⇒ E−1
6 =


1 0 0
0 1 1
0 0 1

 ,
row operation R1 → R1 −
3
2
R3 : E =


1 0 0
0 1 0
0 0 1


R1→R1− 3
2 R3
−→


1 0 −3
2
0 1 0
0 0 1

 =: E7 =⇒ E−1
7 =


1 0 3
2
0 1 0
0 0 1

 ,
row operation R1 → R1 −
1
2
R2 : E =


1 0 0
0 1 0
0 0 1


R1→R1− 1
2 R2
−→


1 −1
2 0
0 1 0
0 0 1

 =: E8 =⇒ E−1
8 =


1 1
2 0
0 1 0
0 0 1

 .
It is now a straightforward (though lengthy) computation to verify that A = E−1
1 E−1
2 E−1
3 E−1
4 E−1
5 E−1
6 E−1
7 E−1
8 . To speed
up this process, we can program Sage to perform the calculation for us, by adding the following cell to the previous code
block:
E1inv=Matrix(SR, [[2, 0, 0], [0, 1, 0], [0, 0, 1]])
E2inv=Matrix(SR, [[1, 0, 0], [1, 1, 0], [0, 0, 1]])
E3inv=Matrix(SR, [[1, 0, 0], [0, 1, 0], [4, 0, 1]])
E4inv=Matrix(SR, [[1, 0, 0], [0, -1/2, 0], [0, 0, 1]])
E5inv=Matrix(SR, [[1, 0, 0], [0, 1, 0], [0, 0, -5]])
E6inv=Matrix(SR, [[1, 0, 0], [0, 1, 1], [0, 0, 1]])
E7inv=Matrix(SR, [[1, 0, 3/2], [0, 1, 0], [0, 0, 1]])
E8inv=Matrix(SR, [[1, 1/2, 0], [0, 1, 0], [0, 0, 1]])
bool(A==E1inv*E2inv*E3inv*E4inv*E5inv*E6inv*E7inv*E8inv)
Execute this program yourselves to verify that Sage’s output is True. □
Sage provides built-in functions to explore elementary matrices, which we can list as follows:
• The matrix which multiplies the kth row by c:
elementary_matrix(R, n, row1 = k, scale = c)
• The matrix which multiplies the kth row by c and adds it to the mth row:
elementary_matrix(R, n, row1 = k, row2 = m, scale = c)
• The matrix which swaps the kth and mth rows:
elementary_matrix(R, n, row1 = k, row2 = m)
In each case, R is the base ring, and is optional, while n denotes the size of the square matrix.
145
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.E.5. Use Sage to describe the elementary matrices E1, . . . , E8 obtained in 2.E.4, using the commands mentioned above.
Compute their inverses, as well. ⃝
2.E.6. Solve the system of linear equations given by the extended matrix B =
(
A b
)
=




3 3 2 1 3
2 1 1 0 4
0 5 −4 3 1
5 3 3 −3 5



 .
⃝
2.E.7. Provide a formal solution of the linear system given below. Next use Sage to verify your solution:



x1 + x2 + x3 + x4 − 2x5 = 3 ,
2x2 + 2x3 + 2x4 − 4x5 = 5 ,
−x1 − x2 − x3 + x4 + 2x5 = 0 ,
−2x1 + 3x2 + 3x3 − 6x5 = 2 .
Solution. The extended matrix of the system is
B =
(
A b
)
=




1 1 1 1 −2 3
0 2 2 2 −4 5
−1 −1 −1 1 2 0
−2 3 3 0 −6 2



 .
Adding the ﬁrst row to the third, adding its 2-multiple to the fourth, and adding the (−5/2)-multiple of the second to the
fourth we obtain
B ∼




1 1 1 1 −2 3
0 2 2 2 −4 5
0 0 0 2 0 3
0 5 5 2 −10 8



 ∼




1 1 1 1 −2 3
0 2 2 2 −4 5
0 0 0 2 0 3
0 0 0 −3 0 −9/2



 .
The last row is clearly a multiple of the previous one, and thus we can omit it. The pivots are located in the ﬁrst, second and
fourth, while x3 and x5 are free variables, which we substitute by the real parameters t and s. Thus we consider the system
x1 + x2 + t + x4 − 2s = 3 ,
2x2 + 2t + 2x4 − 4s = 5 ,
2x4 = 3 .
We see that x4 = 3/2, and hence the second equation gives x2 = 1−t+2s. From the ﬁrst equation we now obtain x1 = 1/2.
This gives the expression (x1, x2, x3, x4, x5) = (1/2, 1 − t + 2s, t, 3/2, s) with t, s ∈ R. Note that setting t0 = −t + 2s
and t1 = t − s, the solution takes the forms (1/2, 1 + t0, t0 + 2t1, 3/2, t0 + t1), with t0, t1 ∈ R.
As an alternative method, one could transform the extended matrix B to reduced row echelon form. In this procedure,
we can omit the fourth equation since it is a combination of the ﬁrst three. By sequentially multiplying the second and third
rows by 1/2, subtracting the third row from the second and ﬁrst rows, and ﬁnally subtracting the second row from the ﬁrst,
we obtain the following:
B ∼


1 1 1 1 −2 3
0 2 2 2 −4 5
0 0 0 2 0 3

 ∼


1 1 1 1 −2 3
0 1 1 1 −2 5/2
0 0 0 1 0 3/2

 ∼


1 1 1 0 −2 3/2
0 1 1 0 −2 1
0 0 0 1 0 3/2

 ∼


1 0 0 0 0 1/2
0 1 1 0 −2 1
0 0 0 1 0 3/2

 .
This conﬁrms that x3, x5 are free variables, i.e., x3 = t, x5 = s with t, s ∈ R. Moreover, a short computation yields the
solution presented above. These computations, particularly the (unique) reduced row echelon form of the extended matrix
B and the given solution, can be obtained in Sage by applying the method proposed in 2.A.13. In particular, the commands
B.rref() and B.pivots() appearing in the block below, return the reduced row echelon form of B and the pivot columns,
respectively.
A=matrix([[1, 1, 1, 1, -2], [0, 2, 2, 2, -4],
[-1, -1, -1, 1, 2], [-2, 3, 3, 0, -6]])
b=vector([3, 5, 0, 2]); B=A.augment(b)
Br=B.rref(); show(Br); B.pivots()
x=A.solve_right(b)
k=A.right_kernel ().matrix ()
nrows = k.nrows ()
146
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
if nrows > 0:
t = vector ( var("t", n= nrows ))
show (x + t*k)
else:
show (x)
Sage’s output has the form
(1
2 , t0 + 1, t0 + 2 t1, 3
2 , t0 + t1
)
, with t0, t1 ∈ R, and this veriﬁes our formal computations. □
2.E.8. Find all the solutions of the following homogeneous system of four linear equations and ﬁve unknowns x, y, z, u, v:
{x + y = 2z + v , z + 4u + v = 0 , −3u = 0 , z = −v} .
Moreover, specify a basis of the vector space of solutions.
Solution. Let us ﬁrst rewrite the system intro the matrix form Ax = 0, where here x = (x, y, z, u, v)T
∈ R5
, and as usual
the ﬁrst column of A consists of the coeﬃcients of x, the second column of the coeﬃcients of y, and so on. This gives
A =




1 1 −2 0 −1
0 0 1 4 1
0 0 0 −3 0
0 0 1 0 1



 .
Such homogeneous systems with less (linear) equations than unknowns are called underdetermined and admit an inﬁnite
number of solutions. Indeed, let us compute the reduced row echelon form of A. For this, let us denote by R1, R2, R3, R4
the rows at any step of the procedure. We see that




1 1 −2 0 −1
0 0 1 4 1
0 0 0 −3 0
0 0 1 0 1




R2→R2+ 4
3 ·R3
∼




1 1 −2 0 −1
0 0 1 0 1
0 0 0 −3 0
0 0 1 0 1




R4→R4−R2
∼




1 1 −2 0 −1
0 0 1 0 1
0 0 0 −3 0
0 0 0 0 0




R3→− 1
3 ·R3
∼




1 1 −2 0 −1
0 0 1 0 1
0 0 0 1 0
0 0 0 0 0




R1→R1+2·R2
∼




1 1 0 0 1
0 0 1 0 1
0 0 0 1 0
0 0 0 0 0



 .
The ﬁnal matrix is in reduced row echelon form. Recall that in Sage we can achieve this form in two ways: by applying the
command A.rref( ), or by manually performing the necessary elementary row operations. Let us demonstrate both methods
using matrices over Q instead of R, for convenience, as this not aﬀect the result.
A=matrix(QQ, [[1, 1, -2, 0, -1], [0, 0, 1, 4, 1],
[0, 0, 0, -3, 0],[0, 0, 1, 0, 1]])
show(A.rref())
The output is the same as the one obtained by executing the following block:
A=matrix(QQ, [[1, 1, -2, 0, -1], [0, 0, 1, 4, 1],
[0, 0, 0, -3, 0],[0, 0, 1, 0, 1]])
A.add_multiple_of_row(1 ,2 , 4/3);show(A)
A.add_multiple_of_row(3, 1, -1);show(A)
A.rescale_row (2 ,-1/3) ;show(A)
A.add_multiple_of_row(0 ,1 , 2);show(A)
where recall that elementary row operations should be applied successively.
Running any of these blocks veriﬁes that our computations are accurate. Therefore, we can now pose the system {x +
y + v = 0, z + v = 0, u = 0}, where the variable y, v are free (they correspond to the 2nd and 5th column of the REEF,
since these columns do not contain a pivot). Hence, let us set y = t, and v = s with t, s ∈ R and derive the solution by back
substitution:
(x, y, z, u, v) = (−t − s, t, −s, 0, s) , t, s ∈ R .
Obviously, we can rewrite the solution as






x
y
z
u
v






= t






−1
1
0
0
0






+ s






−1
0
−1
0
1






, t, s ∈ R .
147
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
It easy to prove that the vectors in the right-hand side are linearly independent, and thus they form a basis of the solution
space. In other words, the kernel of the matrix A is a 2-dimensional subspace of R5
. □
2.E.9. Determine all solutions of the following linear system:



x2 + x4 = 1 ,
3x1 − 2x2 − 3x3 + 4x4 = −2 ,
x1 + x2 − x3 + x4 = 2 ,
x1 − x3 = 1 .
⃝
2.E.10. Solve the following linear system:



3x − 5y + 2u + 4z = 2 ,
5x + 7y − 4u − 6z = 3 ,
7x − 4y + + 3z = 4 ,
x + 6y − 2u − 5z = 2 .
⃝
2.E.11. Consider the following linear system:



x1 + x2 − x3 = 1 ,
x1 + 2x2 + κ x3 = 2 ,
2x1 + κ x2 + 2x3 = 3 .
Determine the values of the real parameter κ such that
(a) the system has a unique solution;
(b) the system has inﬁnitely many solutions;
(c) the system has no solution.
For the cases (a) and (b) determine the corresponding solutions.
Solution. The extended matrix of the system has the form
B =
(
A b
)
=


1 1 −1 1
1 2 κ 2
2 κ 2 3

 .
By applying elementary row operations we obtain that
B =


1 1 −1 1
1 2 κ 2
2 κ 2 3

 R2→R2−R1
∼
R3→R3−2R1


1 1 −1 1
0 1 κ + 1 1
0 κ − 2 4 1

 R3→R3−(κ−2)R2
∼


1 1 −1 1
0 1 κ + 1 1
0 0 4 − (κ + 1)(κ − 2) 3 − κ

 .
The above matrix is in row echelon form and we can rewrite it as
B ∼


1 1 −1 1
0 1 κ + 1 1
0 0 −κ2
+ κ + 6 3 − κ

 =


1 1 −1 1
0 1 κ + 1 1
0 0 (κ − 3)(κ + 2) 3 − κ

 . (∗)
Hence we deduce that:
(a) The system has a unique solution when κ ̸= 3 and κ ̸= −2. Indeed, in this case we obtain



x1 + x2 − x3 = 1 ,
x2 + (κ + 1) x3 = 1 ,
(2 + κ)(3 − κ)x3 = 3 − κ .
Therefore, by back substitution we obtain the solution (x1, x2, x3)T
= (1, 1
2+κ , 1
2+κ )T
.
(b) The system has inﬁnite many solutions when κ = 3, since then the last row of the row echelon form presented in (∗) has
only zeros, the pivots belong to the ﬁrst and second column, and x3 is a free variable. Hence in this case we obtain the system
{
x1 + x2 − x3 = 1 ,
x2 + 4x3 = 1 ,
with x3 = t ∈ R, having inﬁnite many solutions given by (x1, x2, x3)T
= (5t, 1 − 4t, t)T
, (t ∈ R). Note that we may
express these solutions as (x1, x2, x3)T
= t(5, −4, 1)T
+ (0, 1, 0)T
. Thus, when κ = 3 the space of all solutions is given
by summing all solutions of the corresponding homogeneous system, given by t(5, −4, 1)T
, with a particular solution of the
original system, given by (0, 1, 0)T
.
148
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
(c) The system is inconsistent when κ = −2. Indeed, for this value of κ the third row of the matrix in the right-hand side of
(∗) gives us 0x3 = 5, which is impossible. □
2.E.12. Solve the following parametric system of linear equations in terms of the parameter µ ∈ R:
{
µ x1 + 4x2 + 2x3 = 0 ,
2x1 + 3x2 − x3 = 0 .
⃝
2.E.13. Determine the number of solutions of the following parametric linear system, in terms of the parameter a ∈ R:




4 1 4 a
2 3 6 8
3 2 5 4
6 −1 2 −8








x1
x2
x3
x4



 =




2
5
3
−3



 .
⃝
2.E.14. Consider the following parametric system of linear equations:



x1 − a x2 − 2x3 = b ,
x1 + (1 − a) x2 = b − 3 ,
x1 + (1 − a) x2 + ax3 = 2b − 1 .
Find the values of the parameters a, b ∈ R, such that:
(a) the system has exactly one solution;
(b) the system has no solution;
(c) the system has inﬁnitely many solutions.
⃝
Let us now explore an important application of linear systems in the analysis of electric circuits. This description relies
on Ohm’s law, the Kirchhoﬀ’s voltage and the current laws, providing an opportunity to revisit some concepts from
high school physics. Readers not interested in these applications can skip this task.
2.E.15. Kirchhoﬀ’s Circuit Laws. Consider an electric circuit as in the ﬁgure and write down the values of the currents
there if you know the values V1 = 20, V2 = 120, V3 = 50, R1 = 10, R2 = 30, R3 = 4, R4 = 5, R5 = 10,
Notice that the quantities Ii denote the electric currents, while Rj are resistances, and Vk are voltages.
Solution. There are two closed loops, namely ABEF and EBCD and two branching vertices B and E of degree no less
than 3. On every segment of the circuit, bounded by branching points, the electric current is constant. Let us set this I1 for
the segment EFAB, I2 for EB, and I3 for BCDE.
Applying Kirchhoﬀ’s current law to branching points B and E we obtain: I1 + I2 = I3 and I3 − I1 = I2, which are,
of course the same equations. In case there are many branching vertices, we write all Kirchhhoﬀ’s Current Law
equations to the system, having at least one of those equations redundant. Choose the counter clockwise orientations
of the loops ABEF and EBCD. Applying Kirchhoﬀ Voltage Law and Ohm’s Law to the loop ABEF we obtain
the equation:
V1 + I1R3 − I2R5 + V3 + I1R1 + I1R4 = 0 .
149
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Similarly, the loop EBCD implies −V2 + I3R2 − V3 + R5I2 = 0. A combination of all equations yields the following
system of linear equations:



I1 + I2 − I3 = 0 ,
(R3 + R1 + R4)I1 − R5I2 + = −V1 − V3 ,
R5I2 + R2I3 = V2 + V3 .
Using the prescribed values, this ﬁnally reduce to



I1 + I2 − I3 = 0 ,
19I1 − 10I2 + = −70 ,
10I2 + 30I3 = 170 .
It is easy to see that this has a unique solution, given by I1 = −80
53 ≈ −1.509, I2 = 219
53 ≈ 4.132, I3 = 139
53 ≈ 2.623. □
2.E.16. The general case. In general, the method for electrical circuit analysis can be formulated along the following steps:
(α) Identify all branching vertices of the circuit, i.e., vertices of degree no less than 3;
(β) Identify all closed loops of the circuit;
(γ) Introduce variables Ik, denoting the oriented currents on each segment of the circuit between two branching vertices;
(δ) Write down the Kirchhoﬀ’s current conservation law for each branching vertex. The total incoming current equals the
total outgoing current;
(ε) Choose an orientation on every closed loop of the circuit and write down the Kirchhoﬀ’s voltage conservation law,
according to the chosen orientation. Here, in case you ﬁnd an electric charge of voltage Vj and you go from the short
bar to the long bar, then the contribution of this charge is Vj. It should be −Vj in case you go from the long bar to the
short one. Notice also that if you go in the positive direction of a current I and ﬁnd a resistor with resistance Rj, then
the contribution is −RjI, and it is RjI if the orientation of the loop is opposite to the direction of the current I. The
total voltage change along each closed loop must be zero.
(ζ) Compose the system collecting all equations, representing the Kirchhoﬀ’s current and voltage laws and solve it with
respect to the variables, representing the currents. Notice that some equations may be redundant, however, the solution
should be unique!
To illustrate this general approach, consider the circuit example in the diagram below.
Solution. Let us apply the steps posed above:
(α) The set of branching vertices is {B, C, F, G, H}.
(β) The set of closed loops is {ABHG, FHBC, GHF, CDEF}.
(γ) Let I1 be the current on the segment GAB, I2 on the segment GH, I3 on the segment HB, I4 on the segment BC, I5
on the segment FC, I6 on the segment FH, I7 on GF, and I8 on CDEF.
(δ) Next we write the Kirchhoﬀ’s current conservation laws for the branching vertices:
B : I1 + I3 = I4 , C : I4 + I5 = I8 , F : I8 = I5 + I6 − I7 , G : −I7 = I1 + I2 , H : I2 + I6 = I3 .
Notice that the second and third equations give I8 − I5 = I4 and I8 − I5 = I6 − I7, respectively. Hence we can replace any
of them by the equation I4 = I6 − I7.
(ε) Let us now write Kirchhoﬀ’s voltage conservation for each of the closed loops traversed counter-clockwise:
loop ABHG : −R1I2 + V3 + R2I1 − V2 = 0 , loop GHF : R1I2 − V1 = 0 ,
loop FHBC : V4 + R3I4 − V3 = 0 , loop CDEF : R4I8 − V4 = 0 .
(ζ) To pass to the ﬁnal step we need to use some speciﬁc values for Ri, Vi for i = 1, . . . , 4. Setting R1 = 4, R2 = 7, R3 = 9,
R4 = 12, V1 = 10, V2 = 20, V3 = 60 and V4 = 120, we get a unique solution (although the system is overdetermined!)
150
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
I1 = −
30
7
, I2 =
5
2
, I3 = −
50
21
, I4 = −
20
3
, I5 =
50
3
, I6 = −
205
42
, I7 =
25
14
, I8 = 10 .
Below we present the solution in Sage, where ﬁrst we should introduce the unknowns I1, . . . , I8 as symbolic vari-
ables.
var("I1, I2, I3, I4, I5, I6, I7, I8")
eq1=I1+I3-I4; eq2=I4+I5-I8
eq3=I5+I6-I7-I8; eq4=I1+I2+I7
eq5=I2-I3+I6; eq6=7*I1-4*I2+40
eq7=9*I4+60; eq8=4*I2-10
eq9=12*I8-120
show(solve([eq1, eq2, eq3, eq4, eq5, eq6, eq7, eq8, eq9],
[I1, I2, I3, I4, I5, I6, I7, I8]))
□
B) Material on determinants
2.E.17. Factor the following permutations into a product of transpositions:
σ =
(
1 2 3 4 5 6 7
7 6 5 4 3 2 1
)
, τ =
(
1 2 3 4 5 6 7 8
6 4 1 2 5 8 3 7
)
, ρ =
(
1 2 3 4 5 6 7 8 9 10
4 6 1 10 2 5 9 8 3 7
)
.
⃝
2.E.18. Determine the parity of the following permutations:
σ =
(
1 2 3 4 5 6 7
7 5 6 4 1 2 3
)
, τ =
(
1 2 3 4 5 6 7 8
6 7 1 2 3 8 4 5
)
, ρ =
(
1 2 3 4 5 6 7 8 9 10
9 7 1 10 2 5 4 9 3 6
)
.
⃝
2.E.19. For the matrix A shown below, determine all values of the complex parameter a ∈ C that satisfy det(A) = 1:
A =




a 1 1 1
0 a 1 1
0 1 a 1
0 0 0 −a



 .
Solution. Let us compute the determinant of A by expanding the ﬁrst column of the matrix:
det(A) =
a 1 1 1
0 a 1 1
0 1 a 1
0 0 0 −a
= a ·
a 1 1
1 a 1
0 0 −a
= −a4
+ a2
.
In Sage we can verify this computation by the cell
var("a")
A=matrix([[a, 1, 1, 1], [0,a, 1, 1], [0, 1, a, 1],[0, 0, 0, -a]]); det(A)
Thus, the equation det(A) = 1 is equivalent to a4
− a2
+ 1 = 0. Substituting t = a2
we obtain t2
− t + 1 with roots
t1 =
1 + i
√
3
2
= cos(π/3) + i sin(π/3) , t2 =
1 − i
√
3
2
= cos(π/3) − i sin(π/3) = cos(−π/3) + i sin(−π/3) .
Therefore, there are four possible values for the parameter a:
a1 = cos(π/6) + i sin(π/6) =
√
3/2 + i/2 , a2 = cos(7π/6) + i sin(7π/6) = −
√
3/2 − i/2 ,
a3 = cos(−π/6) + i sin(−π/6) =
√
3/2 − i/2 , a4 = cos(5π/6) + i sin(5π/6) = −
√
3/2 + i/2 .
Of course, solving the equation det(A) = 1 with Sage is straightforward. Simply add the code solve(det(A) == 1, a) to
the previous cell. This provides a fast way to verify the result.
Note that there is an alternative way to specify the possible values of a. For this, multiply the equation det(A) = 1, i.e.,
the equation a4
− a2
+ 1 = 0 by a2
+ 1. This gives 0 = (a2
+ 1)(a4
− a2
+ 1) = a6
+ 1, and the idea is that we can
easier treat the equation a6
+ 1, comparing the original equation. Indeed, recall by 1.G.17 that the equation a6
= −1 has six
complex solutions, which can be presented as
a = cos φ + i sin φ , where φ = π/6 + kπ/3 = (2k + 1)π/6 , k = 0, 1, 2, 3, 4, 5 .
151
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Between them, we must discard the two choices k = 1 and k = 4, since these choices solve a2
+ 1 = 0, and not our equation
a4
− a2
+ 1 = 0. Hence we conclude that a = cos φ + i sin φ with φ = (2k + 1)π/6, for k = 0, 2, 3, and k = 5. □
2.E.20. Establish whether the following matrix is invertible:
A =




3 2 −1 2
4 1 2 −4
−2 2 4 1
2 3 −4 8



 .
Solution. We just need to compute det(A). We will apply Laplace theorem (see 2.2.9) by expanding the ﬁrst row of the given
matrix. This gives
det(A) = 3 ·
1 2 −4
2 4 1
3 −4 8
− 2 ·
4 2 −4
−2 4 1
2 −4 8
+ (−1) ·
4 1 −4
−2 2 1
2 3 8
− 2 ·
4 1 2
−2 2 4
2 3 −4
= 3 · 90 − 2 · 180 − 1 · 110 − 2 · (−100) = 0.
Thus det(A) = 0 and A cannot be invertible. As a veriﬁcation by Sage, give the cell
A=matrix([[3, 2, -1, 2], [4, 1, 2, -4], [-2, 2, 4, 1],[2, 3, -4, 8]])
det(A)
Note that giving the command A.inverse(), Sage will return an error, since A is singular, i.e., det(A) = 0. □
We now present the determinant of the famous “Vandermonde matrix”. The Vandermonde matrix Vn = (Vij) is the
square matrix of size n, with columns formed by the powers of a given vector x = (x1, . . . , xn)T
∈ Rn
, i.e.,
Vij = xj−1
i for i, j = 1, . . . , n, see below. This matrix is fundamental in both pure and applied mathematics.
For example, in Chapter 5, we will explore its role in polynomial interpolation (see 5.1.5). Beyond mathematics,
Vandermonde matrices are also signiﬁcant in the natural sciences, particularly in economics and statistics.
2.E.21. Vandermonde determinant. The matrix
Vn :=






1 1 . . . 1
x1 x2 . . . xn
x2
1 x2
2 . . . x2
n
...
...
...
xn−1
1 x2 . . . xn−1
n






,
where x1, . . . , xn ∈ R, is the so called Vandermonde matrix. Prove that det(Vn) =
∏
1≤i<j≤n
(xj − xi).
Solution. The determinant will be a polynomial on x1, . . . , xn, which we may denote by Vn(x1, . . . , xn) = det(Vn). Notice
for n = 2 the claim holds trivially. We will proceed by induction on n. In fact we can work with the transposed Vandermonde
matrix VT
n , which satisﬁes
det(VT
n ) = det(Vn) = Vn(x1, . . . , xn) , n = 1, 2, . . .
By subtracting the ﬁrst row from all other rows and then expanding the ﬁrst column we obtain
Vn(x1, x2, . . . , xn) =
1 x1 x2
1 . . . xn−1
1
0 x2 − x1 x2
2 − x2
1 . . . xn−1
2 − xn−1
1
...
...
...
...
...
0 xn − x1 x2
n − x2
1 . . . xn−1
n − xn−1
1
=
x2 − x1 x2
2 − x2
1 . . . xn−1
2 − xn−1
1
...
...
...
...
xn − x1 x2
n − x2
1 . . . xn−1
n − xn−1
1
.
Let us now take out xi+1 − x1 from the i-th row, for all i ∈ {1, 2, . . . , n − 1}. Then we deduce that
Vn(x1, x2, . . . , xn) = (x2 − x1) · · · (xn − x1) ·
1 x2 + x1 . . .
∑n−2
j=0 xn−j−2
2 xj
1
...
...
...
...
1 xn + x1 . . .
∑n−2
j=0 xn−j−2
n xj
1
.
By subtracting from every column the x1-multiple of the previous column, starting with the last and ending with the second
one, we obtain
152
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
1 x2 + x1 . . .
∑n−2
j=0 xn−j−2
2 xj
1
...
...
...
...
1 xn + x1 . . .
∑n−2
j=0 xn−j−2
n xj
1
=
1 x2 . . . xn−2
2
...
...
...
...
1 xn . . . xn−2
n
.
Therefore, if for simplicity set Vn = Vn(x1, x2, . . . , xn), we see that Vn = (x2 − x1) · · · (xn − x1)·Vn−1(x2, . . . , xn). Now,
it is clear that V2(xn−1, xn) = xn − xn−1, and thus by induction it follows that Vn(x1, x2, . . . , xn) =
∏
1≤i<j≤n
(xj − xi).
Observe that the determinant is non-zero, whenever the numbers x1, . . . , xn are mutually distinct. For an alternative proof
of this classical result we refer to 5.1.5. □
2.E.22. Vandermonde matrix in Sage. Compute in Sage the determinant of a 3 × 3 Vandermonde matrix V3(x1, x2, x3)
using appropriately the matrix.vandermonde command.
Solution. To use the matrix.vandermonde command in Sage we need to specify ﬁrst a vector. Since the task requires a
general 3×3 Vandermonde matrix, this vector should be of size 3×1 and symbolic. Note also that the matrix.vandermonde
command returns the transpose of the Vandermonde matrix. Hence, for n = 3 one can use the following code block:
var("x1", "x2", "x3")
x=vector([x1, x2, x3])
V3=matrix.vandermonde(x, ring=None)
show(V3.T)
show("The determinant of the Vandermonde matrix is given by:", det(V3.T).factor())
Sage generates the desired matrix V3(x1, x2, x3),


1 1 1
x1 x2 x3
x2
1 x2
2 x2
3

 ,
and also provides the result for its determinant, as shown below:
The determinant of the Vandermonde matrix is given by: − (x1 − x2)(x1 − x3)(x2 − x3).
An alternative goes as follows:
A=matrix.vandermonde(SR.var(["x1", "x2", "x3"]))
show(A.T); show(det(A.T).factor())
□
2.E.23. Use two diﬀerent methods to solve the linear system



x1 + x2 + x3 + x4 = 2 ,
x1 + x2 − x3 − x4 = 3 ,
x1 − x2 + x3 − x4 = 3 ,
x1 − x2 − x3 + x4 = 5 .
Hints: Solve the system by computing the inverse of the coeﬃcient matrix. Another method is based on the Cramer’s rule.
Solution. In matrix notation we have Ax = b, where
A =




1 1 1 1
1 1 −1 −1
1 −1 1 −1
1 −1 −1 1



 , x =




x1
x2
x3
x4



 , b =




2
3
3
5



 ,
respectively. We compute det(A) = −16, hence A is invertible and the linear system has a unique solution which occurs by
multiplying the matrix equation Ax = b by A−1
from the left, i.e., x = A−1
b. Thus, it suﬃces to compute the inverse of A
and next do the matrix multiplication A−1
b. We see that
A−1
=




1
4
1
4
1
4
1
4
1
4
1
4 −1
4 −1
4
1
4 −1
4
1
4 −1
4
1
4 −1
4 −1
4
1
4




=⇒ x = A−1
b =




1
4
1
4
1
4
1
4
1
4
1
4 −1
4 −1
4
1
4 −1
4
1
4 −1
4
1
4 −1
4 −1
4
1
4








2
3
3
5




=




13
4
−3
4
−3
4
1
4




.
Thus, the solution has the following form:
(
13
4
, −
3
4
, −
3
3
,
1
4
)T
∈ R4
.
153
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
In order to compute A−1
we applied the relation A−1
= (1/ det(A))adj(A). Recall that the algebraic adjoint matrix adj(A)
of A is deﬁned by adj(A) = (Aij)T
, where Aij is the algebraic complement of the element aij of the matrix A, that is,
Aij = (−1)i+j
det( ˆAij). Here, ˆAij denotes the matrix obtained by A by deleting the ith row and jth column. We compute
A11 = (−1)1+1
det



1 −1 −1
−1 1 −1
−1 −1 1


 = −4 , A31 = (−1)3+1
det



1 1 1
1 −1 −1
−1 −1 1


 = −4 ,
A12 = (−1)1+2
det



1 −1 −1
1 1 −1
1 −1 1


 = −4 , A32 = (−1)3+2
det



1 1 1
1 −1 −1
1 −1 1


 = 4 ,
A13 = (−1)1+3
det



1 1 −1
1 −1 −1
1 −1 1


 = −4 , A33 = (−1)3+3
det



1 1 1
1 1 −1
1 −1 1


 = −4 ,
A14 = (−1)1+4
det



1 1 −1
1 −1 1
1 −1 −1


 = −4 , A34 = (−1)3+4
det



1 1 1
1 1 −1
1 −1 −1


 = 4 ,
A21 = (−1)2+1
det



1 1 1
−1 1 −1
−1 −1 1


 = −4 , A41 = (−1)4+1
det



1 1 1
1 −1 −1
−1 1 −1


 = −4 ,
A22 = (−1)2+2
det



1 1 1
1 1 −1
1 −1 1


 = −4 , A42 = (−1)4+2
det



1 1 1
1 −1 −1
1 1 −1


 = 4 ,
A23 = (−1)2+3
det



1 1 1
1 −1 −1
1 −1 1


 = 4 , A43 = (−1)4+3
det



1 1 1
1 1 −1
1 −1 −1


 = 4 ,
A24 = (−1)2+4
det



1 1 1
1 −1 1
1 −1 −1


 = 4 , A44 = (−1)4+4
det



1 1 1
1 1 −1
1 −1 1


 = −4 .
Hence by applying the rule A−1
= (1/ det(A))adj(A) we obtain the given expression for the inverse A−1
of A. □
2.E.24. Conﬁrm all the computations presented in 2.E.23 via Sage. ⃝
2.E.25. Present in Sage the solution given in 2.E.23 via the Cramer’s rule. ⃝
2.E.26. Explain why the matrix
A =




5 0 8 0
1 5 0 4
4 0 5 0
0 7 0 8




is invertible. Next use Sage to compute its algebraic adjoint and its inverse.
Solution. Computing the determinant of A we see that det(A) = −84. Thus A has an inverse, with an inverse given by
A−1
= 1
det(A) · adj(A). As we know, in order to derive the adjoint and the inverse of A in Sage, the corresponding code has
the form
A=matrix([[5, 0, 8, 0], [1, 5, 0, 4], [4, 0, 5, 0],[0, 7, 0, 8]])
print(det(A)); show(A.adjugate()); show(A.inverse())
Executing this block we get the following expressions:
adj(A) =




60 0 −96 0
−40 −56 64 28
−48 0 60 0
35 49 −56 −35



 , A−1
=




−5
7 0 8
7 0
10
21
2
3 −16
21 −1
3
4
7 0 −5
7 0
− 5
12 − 7
12
2
3
5
12




.
154
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
□
2.E.27. For the matrix F =


α β 0
γ δ 0
0 0 1

, where α, β, γ, δ ∈ R, compute its algebraically adjoint matrix adj(F). ⃝
2.E.28. Calculate the algebraically adjoint matrix for the matrices
A =




3 −2 0 −1
0 2 2 1
1 −2 −3 −2
0 1 2 1



 , B =
(
1 + i 2i
3 − 2i 6
)
,
where as usual i denotes the imaginary unit. ⃝
2.E.29. Show that the matrix
A =


2 −3 −5
−1 4 5
1 −3 −4

 ∈ Mat3(R)
is idempotent, i.e., A2
= A.13
Then, based explicitly on this property of A, compute AB and B2
, where B = A − E, with
E being the 3 × 3 identity matrix. Moreover, compute the determinant det(A) of A. ⃝
2.E.30. Let A, B, P ∈ Matn(R) be invertible n×n matrices such that B = PAP−1
. Can we conclude that det(A) = det(B)
and det(A−1
B) = 1? Use the Cauchy theorem (see 2.2.7) to support your answer. ⃝
C) Material on vector spaces and linear mappings
2.E.31. Consider the complex numbers C as a real vector space. Determine the coordinates of z = 2 + i in the basis given
by the roots of the polynomial x2
+ x + 1.
Solution. Because roots of the given polynomial are −1
2 + i
√
3
2 and −1
2 − i
√
3
2 , we have to determine the coordinates (a, b)
of the vector z = 2+i in the basis (−1
2 +i
√
3
2 , −1
2 −i
√
3
2 ). These real numbers a, b are uniquely determined by the condition
a · (−
1
2
+ i
√
3
2
) + b · (−
1
2
− i
√
3
2
) = 2 + i .
By equating separately the real and the imaginary parts of the equation, we obtain a system of two linear equations in two
variables: {
−
1
2
a −
1
2
b = 2,
√
3
2
a −
√
3
2
b = 1
}
.
The solution has the form a = −2 +
√
3
3 , and b = −2 −
√
3
3 , respectively, hence the required coordinates are given by
(−2 + 1√
3
, −2 − 1√
3
). □
2.E.32. Consider the complex numbers C as a real vector space. Determine the coordinates of the number 2 + i in the basis
given by the roots of the polynomial x2
− x + 1. ⃝
2.E.33. Express the vector (5, 1, 11)T
∈ R3
, as a linear combination of the vectors (3, 2, 2)T
, (2, 3, 1)T
, and (1, 1, 3)T
. ⃝
2.E.34. Given the vectors u1 = (1, 1, a, 1)T
, u2 = (1, b, 1, 1)T
, u3 = (c, 1, 1, 1)T
in R4
, specify the values of the parameters
a, b, c ∈ R, such that u1, u2, u3 are linearly dependent. ⃝
2.E.35. Let V be vector space with a basis formed by the vectors u, v, w, z. Determine whether or not the following vectors
of V are linearly independent: u − 3v + z, v − 5w − z, 3w − 7z and u − w + z. ⃝
2.E.36. Complete the vectors 1 − x2
+ x3
, 1 + x2
+ x3
, 1 − x − x3
to a basis of the vector space R3[x] of polynomials over
R of degree at most 3. ⃝
2.E.37. Depending on the parameter t ∈ R, determine the dimension of the subspace U ⊂ R3
for the following two cases:
(a) The subspace U is generated by the vectors u1 = (1, 1, 1)T
, u2 = (1, t, 1)T
, and u3 = (2, 2, t)T
;
(b) The subspace U is generated by the vectors u1 = (t, t, t)T
, u2 = (−4t, −4t, 4t)T
, and u3 = (−2, −2, −2)T
. ⃝
13More general, a n × n matrix A is called idempotent if A2 = A.
Idempotent matrices are also called projectors (or projections), see also the
exercise 2.E.61 below.
155
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.E.38. (a) Using a similar approach to the proof shown in 2.C.31, prove that the linear mapping


a b
c d
e f

 → (a, b, c, d, e, f)T
,
is a linear isomorphism between the vector spaces Mat3,2(R) and R6
. Next ﬁnd a basis of Mat3,2(R).
(b) Determine a basis of the subspace
U = spanR



U1 =


1 2
3 4
5 6

, U2 =


0 1
2 3
4 5

, U3 =


−1 0
1 2
3 4

, U4 =


−2 −1
0 1
2 3





⊂ Mat3,2(R) .
Extend this basis to a basis of the whole space Mat3,2(R) of real 3 × 2 matrices.
Solution. (a) An explicit proof here is very similar to 2.C.31, so you can easily complete the necessary steps alone. As for a
basis, verify (also in Sage, with the same method described in 2.C.31) that the following matrices
E1 =


1 0
0 0
0 0

 , E2 =


0 1
0 0
0 0

 , E3 =


0 0
1 0
0 0

 , E4 =


0 0
0 1
0 0

 , E5 =


0 0
0 0
1 0

 , E6 =


0 0
0 0
0 1

 ,
are linearly independent, and hence they form a basis of the vector space Mat3,2(R) of all real 3 × 2 matrices.
(b) Let us use the indicated isomorphism and consider real 3×2 matrices as vectors in R6
. In this way, the four given matrices
can be identiﬁed with the rows of the following matrix:
A =




1 2 3 4 5 6
0 1 2 3 4 5
−1 0 1 2 3 4
−2 −1 0 1 2 3



 .
It is easy to show that A has rank 2. This implies that the subspace U is generated just by the ﬁrst two matrices U1, U2, which
consequently form a basis for U. Hence in fact the rest two matrices U3, U4 ∈ U should be linear combinations of U1, U2,
and indeed you should be able to verify that U3 = −U1 + 2U2, and U4 = −2U1 + 3U2, respectively.
Now, there are many options for extending this basis to a basis of the whole space. For instance, we may choose the
ﬁrst two of the given matrices above, together with the matrices E3, E4, E5, E6 of the basis mentioned in (a). In
order to prove the linear independence we may ﬁrst apply the isomorphism established in (a), and then compute the
determinant of the matrix induced by U1, U2, E3, E4, E5, E6. We see that
1 2 3 4 5 6
0 1 2 3 4 5
0 0 1 0 0 0
0 0 0 1 0 0
0 0 0 0 1 0
0 0 0 0 0 1
= 1 ̸= 0 .
Clearly, the dimension is 6, so the linear spanning is automatic and we have obtained a basis. Prove that the choice
of the matrices E3, E4, E5, E6 is not so crucial, in the sense that we could ﬁx any four matrices between E1, . . . , E6. □
2.E.39. Consider the linear subspaces U and V of R3
, generated by the vectors u1 = (1, 1, −3)
T
, u2 = (1, 2, 2)
T
, and
v1 = (1, 1, −1)
T
, v2 = (1, 2, 1)
T
, v3 = (1, 3, 3)
T
, respectively. Determine the intersection U ∩ V . ⃝
2.E.40. In the vector space R3
, determine the matrix of the orthogonal projection onto the plane x + y − 2z = 0. ⃝
2.E.41. In the vector space R3
, determine the matrix of the orthogonal projection on the plane 2x − y + 2z = 0. ⃝
2.E.42. Consider the following subspaces of R4
:
U = spanR{(2, 1, 2, 2)T
} , W = spanR{(−1, 0, −1, 2)T
, (−1, 0, 1, 0)T
, (0, 0, 1, −1)T
}.
(a) Is it true that U ⊥ W with respect to the dot product on R4
?
(b) Is it true that U⊥
= W, and hence R4
= U ⊕ W? ⃝
2.E.43. Let W be the subspace of the Euclidean space (R4
, ⟨ , ⟩) spanned by the vectors E1 = (1, 0, 0, 1)T
, E2 = (0, 1, 0, 1)T
and E3 = (1, 2, 1, 0)T
. Present an orthonormal basis of W. Next deduce that its orthogonal complement W⊥
with respect
to the standard Euclidean product ⟨ , ⟩ is one-dimensional and ﬁnd a generator. ⃝
156
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.E.44. Consider the linear mapping F : R4
→ R2
given by
R4
∋ u = (x1, x2, x3, x4)T
−→ F(u) =
(
x1 + 2x3 + 3x4
x2 + 4x3 + 5x4
)
∈ R2
.
i) Find the matrix A of F with respect to the standard bases on R4
and R2
.
ii) Find the column space, row space, kernel and cokernel of the matrix of F, and next use Sage to conﬁrm your answers by
applying the commands presented in 2.C.20.
iii) Deduce that the transpose operator of F is injective.
iv) Prove the orthogonal decompositions
R4
= C(AT
) ⊕ Ker(A) , R2
= C(A) ⊕ Ker(AT
) = C(A) ⊕ {0} ∼= C(A) .
Solution. (i) Let us denote the vectors of the standard bases of R4
and R2
by e1 = (1, 0, 0, 0)T
, e2 = (0, 1, 0, 0)T
, e3 =
(0, 0, 1, 0)T
, e4 = (0, 0, 0, 1)T
and α1 = (1, 0)T
, α2 = (0, 1)T
, respectively. Then, we see that
F(e1) = (1, 0)T
= α1 ,
F(e2) = (0, 1)T
= α2 ,
F(e3) = (2, 4)T
= 2α1 + 4α2 ,
F(e4) = (3, 5)T
= 3α1 + 5α2 .
Thus, the matrix of F with respect to the standard bases is given by A =
(
1 0 2 3
0 1 4 5
)
. We may verify this result in Sage,
by applying the method described in 2.C.28, for example:
V=RR^4
W=RR^2;
var("x1, x2, x3, x4")
f(x1, x2, x3, x4)=[x1+2*x3+3*x4, x2+4*x3+5*x4]
T=linear_transformation(V, W, f)
A=T.matrix(side="right");
show(A)
(ii) The matrix A is already in RREF, and hence its rank equals two, rank(A) = 2. Here is a conﬁrmation via Sage:
A=matrix([[1, 0, 2, 3], [0, 1, 4, 5]])
print(A==A.rref()); print(rank(A))
In particular, the rows of A are linearly independent. Hence, the row space of A is a 2-dimensional subspace of R4
given by
C(AT
) = spanR
{
(1, 0, 2, 3)T
, (0, 1, 4, 5)T
}
. Let us denote by
α1 =
(
1
0
)
, α2 =
(
0
1
)
, α3 =
(
2
4
)
, α4 =
(
3
5
)
the column vectors of A. Then, we see that α3 = 2α1 + 4α2, α4 = 3α1 + 5α2, while α1, α2 are linearly independent.
Therefore, C(A) is a 2-dimensional subspace of R2
generated by the vectors of the standard basis of R2
, i.e., C(A) =
spanR
{
(1, 0)T
, (0, 1)T
}
.
The kernel of A corresponds to solutions of the homogeneous system Au = 0, i.e.,
(
1 0 2 3
0 1 4 5
)




x1
x2
x3
x4



 =
(
0
0
)
.
This corresponds to the system {x1 + 2x3 + 3x4 = 0, x2 + 4x3 + 5x4 = 0} and we deduce that Ker(A) is a 2-dimensional
subspace of R4
given by
Ker(A) =







−2t − 3s
−4t − 5s
t
s



 : t, s ∈ R



= spanR







−2
−4
1
0



 ,




−3
−5
0
1







.
The left null space of A corresponds to solutions of the homogeneous system AT
w = 0 for some w = (y1, y2)T
∈ R2
, i.e,
157
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA




1 0
0 1
2 4
3 5




(
y1
y2
)
=




0
0
0
0



 .
This system has a unique solution given by y1 = y2 = 0. Thus, the cokernel Ker(AT
) is trivial.
To verify these results in Sage, one may proceed as in 2.C.20:
A=matrix([[1, 0, 2, 3], [0, 1, 4, 5]])
print("\nThe column space of A is given by:", A.column_space())
print("\nThe row space of A is given by:", A.row_space())
print("\nThe kernel of A is given by:", A.right_kernel())
print("\nThe cokernel of A is given by:", A.left_kernel())
For your convenience here is the output of Sage:
The column space of A is given by: Free module of degree 2 and rank 2 over Integer Ring
Echelon basis matrix:
[1 0]
[0 1]
The row space of A is given by: Free module of degree 4 and rank 2 over Integer Ring
Echelon basis matrix:
[1 0 2 3]
[0 1 4 5]
The kernel of A is given by: Free module of degree 4 and rank 2 over Integer Ring
Echelon basis matrix:
[ 1 1 1 -1]
[ 0 2 -3 2]
The cokernel of A is given by: Free module of degree 2 and rank 0 over Integer Ring
Echelon basis matrix:
[]
(iii) The transpose of F is the linear operator ˆF : R2
→ R4
whose matrix presentation (with respect to the standard bases of
R2
and R4
) is given by the transpose of A. Thus ˆF(w) = AT
w for any w ∈ R2
and ˆF is injective since Ker(AT
) = {0}.
(iv) Above we saw that C(AT
) = spanR
{
(1, 0, 2, 3)T
, (0, 1, 4, 5)T
}
and Ker(A) = spanR{(−2, −4, 1, 0)T
, (−3, −5, 0, 1)T
}.
Suppose that v is a vector in C(AT
) ∩ Ker(A). Then we can think v as a linear combination of the vectors E1 = (1, 0, 2, 3)T
and E2 = (0, 1, 4, 5)T
, satisfying the additional condition Av = 0. This given the homogeneous system
Av =
(
1 0 2 3
0 1 4 5
)




α
β
2α + 4β
3α + 5β



 =
(
0
0
)
⇐⇒
(
14α + 23β
23α + 42β
)
=
(
0
0
)
,
which has the zero solution α = β = 0 as unique. Thus, v = 0 and C(AT
) ∩ Ker(A) = {0}. This, along with the relation
dim C(AT
) + dim Ker(A) = 2 + 2 = 4 = dim R4
proved above, precisely deﬁnes the conditions for a direct sum, i.e.,
R4
= C(AT
) ⊕ dim Ker(A). The orthogonality condition C(AT
) ⊥ Ker(A) follows by 2.C.56. For a direct veriﬁcation,
observe that for any u ∈ C(AT
) and w ∈ Ker(A) we have
u = aE1 + vE2 , w = cE3 + dE4
for some scalars a, b, c, d ∈ R, where E3 = (−2, −4, 1, 0)T
, E4 = (−3, −5, 0, 1)T
and the vectors E1, E2 are as above.
Thus, we only need prove that ⟨E1, E3⟩ = 0 = ⟨E1, E4⟩ and ⟨E3, E3⟩ = 0 = ⟨E2, E4⟩, which is straightforward:
ET
3 E1 = −2 + 2 = 0 , ET
4 E1 = −3 + 3 = 0 , ET
3 E2 = −4 + 4 = 0 , ET
4 E2 = −5 + 5 = 0 .
Recall that in Sage these scalar products can be computed using the dot_product command:
E1=vector([1, 0, 2, 3])
E2=vector([0, 1, 4, 5])
E3=vector([-2, -4, 1, 0])
E4=vector([-3, -5, 0, 1])
print(E1.dot_product(E3)==0)
print(E2.dot_product(E3)==0)
158
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
print(E1.dot_product(E4)==0)
print(E2.dot_product(E4)==0)
The second decomposition R2
= C(A) ⊕ Ker(AT
) can be veriﬁed in an analogous way, but it is much easier and reduces to
R2 ∼= C(A), as Ker(AT
) is trivial. □
2.E.45. Remark. Recall from the theoretical task 2.C.56 that any m × n matrix A over R satisﬁes
C(A) ⊥ Ker(AT
) , and C(AT
) ⊥ Ker(A) ,
with respect to the corresponding dot products. In fact, it is not so diﬃcult to prove that
C(A) ∩ Ker(AT
) = {0}, C(AT
) ∩ Ker(A) = {0} , dim Ker(AT
) = m − rank(A) , dim Ker(A) = n − rank(A) ,
for any such matrix A, and all these properties together lead to the following orthogonal direct sum decompositions:
Rm
= C(A) ⊕ Ker(AT
) , Rn
= C(AT
) ⊕ Ker(A) .
Note that this result extends to complex matrices with appropriate adjustments (for instance one should use the standard
Hermitian inner product on Cm
and Cn
, and replace AT
by the conjugate transpose of the complex matrix A, which we will
mainly explore in Chapter 3, see also below).
2.E.46. Consider the linear mapping F : R4
→ R2
presented in 2.E.44 and the sets
u =
{
u1 = (1, 0, 0, 0)T
, u2 = (0, 1, 0, 0)T
, u3 = (0, 0, 1, 0)T
, u4 = (1, 1, 1, 1)T
}
,
v =
{
v1 = (1, 1)T
, v2 = (2, −1)T
}
.
i) Use Sage to prove that u and v are bases of R4
and R2
, respectively.
ii) Find the new coordinates wu of the vector w = (1, 2, 3, 4)T
∈ R4
with respect to u.
iii) Find the new matrix presentation of F with respect the new bases u and v presented above.
iv) Verify that the image of w by F remains invariant across the two diﬀerent matrix representations of F (this is true for
any vector w ∈ R4
). In particular, demonstrate that F(w) = Q F(w)v, where F(w)v is the image of w in the new
coordinates on R2
, and Q is the transition matrix from the new basis v to the standard basis of R2
.
Solution. (i) We can use Sage via the method described in 2.C.10 to verify the linear independence of u1, . . . , u4
u1=vector(QQ, [1, 0, 0, 0])
u2=vector(QQ, [0, 1, 0, 0])
u3=vector(QQ, [0, 0, 1, 0])
u4=vector(QQ, [1, 1, 1, 1])
U=RR^4; L=[v1, v2, v3, v4]
U.linear_dependence(L)==[]
Similarly one can check the linear independence of v1, v2. Since for both cases the vectors are linearly independent we have
span{u1, . . . , u4} ∼= R4
and span{v1, v2} ∼= R2
, respectively, and the claim for the bases follows.
(ii) The transition matrix from the basis u to the standard basis e of R4
, is given by the matrix T having as columns vectors
the vectors u1, . . . , u4. The transition matrix from e to u is the inverse matrix T−1
, given here: (compute T−1
on your own)
T =




1 0 0 1
0 1 0 1
0 0 1 1
0 0 0 1



 , T−1
=




1 0 0 −1
0 1 0 −1
0 0 1 −1
0 0 0 1



 .
Thus, the new coordinates of the vector w in terms of u are given by wv = T−1
w = (−3, −2, −1, 4)T
. In Sage this task can
be solved as in 2.C.42, hence we leave the veriﬁcation to the reader.
(iii) Here one should apply the method we learned in 2.C.44. In particular, by 2.E.44 we know that the matrix of the mapping
F : R4
→ R2
with respect to the standard bases is given by A =
(
1 0 2 3
0 1 4 5
)
. Therefore, the new matrix of F with
respect to the bases u and v is given by A′
= [F]u,v = Q−1
AT, where Q−1
is the inverse of the transition matrix Q from the
new basis v to the standard basis of R2
, and T is as above. Since Q =
(
1 2
1 −1
)
, we compute Q−1
= −1
3
(
−1 −2
−1 1
)
=
(
1/3 2/3
1/3 −1/3
)
, and
A′
=
(
1/3 2/3
1/3 −1/3
) (
1 0 2 3
0 1 4 5
)




1 0 0 1
0 1 0 1
0 0 1 1
0 0 0 1



 =
(
1/3 2/3 10/3 26/3
1/3 −1/3 −2/3 −4/3
)
.
159
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Here is a veriﬁcation of these computations by Sage:
A=matrix([[1, 0, 2, 3], [0, 1, 4, 5]])
T=matrix([[1, 0, 0, 1], [0, 1, 0, 1], [0, 0, 1, 1], [0, 0, 0, 1]])
Q=matrix([[1, 2], [1, -1]])
Qinv=Q.inverse(); show(Qinv*A*T)
(iv)
In Sage one can ﬁnd the matrix of the linear mapping F with respect to the new bases u and v, by combining the functions
T = linear_transformation(U, V, F) and T.restrict_domain(U1).restrict_codomain(V1). Within the program, we
should deﬁne the spaces spanned by the bases u and v, which we may denote by U1 and V1, respectively (of course, these
spaces are isomorphic to R4
and R2
). This can be done by the command U1 = U.subspace_with_basis(u), for example,
where U is the domain of F. Additionally, to simplify the matrix displays we can instruct Sage to work within the vectors space
Q4
and Q2
, instead of R4
and R2
. This adjustment does not aﬀect the ﬁnal result. The implementation of these concepts is
as follows:
U=QQ^4; u1=vector([1, 0, 0, 0]); u2=vector([0, 1, 0, 0])
u3=vector([0, 0, 1, 0]); u4=vector([1, 1, 1, 1]); u=[u1, u2, u3, u4]
U1=U.subspace_with_basis(u)
V=QQ^2; v1=vector([1, 1]); v2=vector([2, -1]); v=[v1, v2]
V1=V.subspace_with_basis(v)
var("x1, x2, x3, x4")
F(x1, x2, x3, x4)=[x1+2*x3+3*x4, x2+4*x3+5*x4]
T=linear_transformation(U, V, F)
S=T.restrict_domain(U1).restrict_codomain(V1)
Ahat=S.matrix(side="right")
show(Ahat)
Check on your own that executing this code block Sage returns the matrix
(
1/3 2/3 10/3 26/3
1/3 −1/3 −2/3 −4/3
)
.
(iv) Above we have seen that the new coordinates of w ∈ R4
in the basis u are given by wu = T−1
w. On the other hand
the matrix A′
= [F]u,v in the new bases is given by A′
= Q−1
AT. The image F(w) in the new coordinates and in terms of
matrices is the following vector:
F(w)v = A′
wu = (Q−1
AT)(T−1
w) = Q−1
ATT−1
w = Q−1
Aw ∈ R2
.
To bring this vector back to the original coordinates in R2
we need to apply the transition matrix Q, which gives the result:
QF(w)v = QQ−1
Aw = Aw = F(w). Use the explicit expressions of wu, A′
and Q to conﬁrm the claim, as well. □
2.E.47. Construct an orthogonal basis of the subspace W = spanR
{
(1, 1, 1, 1)T
, (1, 1, 1, −1)T
, (−1, 1, 1, 1)T
}
⊂ R4
. ⃝
2.E.48. In R4
, ﬁnd an orthogonal basis of the subspace of all linear combinations of the vectors (1, 0, 1, 0)T
, (0, 1, 0, −7)T
,
and (4, −2, 4, 14)T
. Moreover, ﬁnd an orthogonal basis of the subspace generated by the vectors (1, 2, 2, −1)T
, (1, 1, −5, 3)T
and (3, 2, 8, −7)T
. ⃝
2.E.49. Consider the following vectors of R5
: u1 = (1, 1, 2, 0, 0)T
, u2 = (1, −1, 0, 1, a)T
and u3 = (1, b, 2, 3, −2)T
.
Specify the values of the parameters a, b ∈ R such that these vectors are pairwise orthogonal. ⃝
2.E.50. Let V the linear subspace of R4
generated by the vectors (−1, 2, 0, 1)T
, (3, 1, −2, 4)T
, (−4, 1, 2, −4)T
, and
(2, 3, −2, 5)T
. Determine the orthogonal complement of V with respect to the standard dot product in R4
. ⃝
2.E.51. Consider the following two vectors u1 = (1, −2, 2, 1)T
and u2 = (1, 3, 2, 1)T
of R4
. Extend their set into an
orthogonal basis of the whole R4
, with respect to the usual dot product. ⃝
2.E.52. Let V ⊂ R5
the linear subspace generated by the vectors (1, 1, −1, −1, 0)T
, (1, −1, −1, 0, −1)T
, (1, 1, 0, 1, 1)T
, and
(−1, 0, −1, 1, 1)T
. Find a basis for its orthogonal complement V ⊥
with respect to the usual dot product. ⃝
2.E.53. In the Euclidean space R5
, determine the orthogonal complement W⊥
of the subspace W, if
(a) W = {(r + s + t, −r + t, r + s, −t, s + t); r, s, t ∈ R};
(b) W is the set of the solutions of the following system of linear equations {x1 − x3 = 0, x1 − x2 + x3 − x4 + x5 = 0}. ⃝
160
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
So far we have primarily focused on real matrices, though manipulating complex matrices is quite similar. When dealing
with complex matrices, a key operation is the complex conjugation. For a comple matrix A = (aij) ∈
Matm,n(C), the complex conjugate A is deﬁned by A = (aij), where z denotes the complex conjugate of z ∈ C.
Note that the matrix A belongs to Matm,n(C), as well. Another important operation is the conjugate transpose
of a complex matrix, denoted by A∗
= AT = ¯AT
. This results in a matrix in Matn,m(C), provided that
A ∈ Matm,n(C). Any real matrix A can be viewed as a complex matrix with zero imaginary part, and hence such a matrix
satisﬁes A∗
= AT
. In Chapter 3 we will examine these concepts in greater detail. Hence you may consider the rest of this
section as preparatory groundwork.
2.E.54. Recall that real matrix A ∈ Matn(R) is called symmetric if A = AT
, and skew-symmetric (or antisymmetric) if
A = −AT
. Using the identity (AT
)T
= A, show that any square matrix A can be expressed as the sum of a symmetric and
a skew-symmetric matrix. Then illustrate the statement using the matrix A given below and verify your answer in Sage:
A =


1 0 2
6 3 0
2 2 4

 .
⃝
2.E.55. (a) Given a squared matrix A ∈ Matn(R), show that A + AT
is symmetric, and that A − AT
is skew-symmetric.
(b) Given a n × n skew-symmetric matrix A with n odd, show that that det(A) = 0.
(c) Prove that the trace of any skew-symmetric matrix A is zero. ⃝
2.E.56. (a) Find A∗
if A =
(
1 + i
√
2 2 − 3i
4i 1 − 2i 7
)
. Conﬁrm your answer by Sage.
(b) This is a theoretical task: Show that
• (zA)∗
= ¯zA∗
, for any A ∈ Matm,n(C) and z ∈ C.
• (A + B)∗
= A∗
+ B∗
, for any A, B ∈ Matm,n(C).
• (AB)∗
= B∗
A∗
, for any A ∈ Matm,n(C) and B ∈ Matn,r(C).
• (A∗
)∗
= A, for any A ∈ Matm,n(C).
• If A ∈ Matm,n(C) is invertible, then A∗
is invertible and (A∗
)−1
= (A−1
)∗
.
• tr(A∗
) = tr(A), for any A ∈ Matm(K), where K is one of R or C. In particular, any A ∈ Matm(R) satisﬁes
tr(AT
) = tr(A).
Solution. (a) A∗
is an element of Mat3,2(C), given by


1 − i −4i√
2 1 + 2i
2 + 3i 7

. In Sage to compute A∗
use the following
cell:
A=matrix(CDF, [[1+i, sqrt(2), 2-3*i], [4*i, 1-2*i, 7]])
A_Herm=(A.conjugate()).T
show(A_Herm); bool(A_Herm==A.H)
Notice that A.H is a shortcut that Sage provides to derive directly the complex conjugate of a complex matrix, analogous to the
command A.T that we use for the transpose of a matrix (cf. ??). The last line in the previous block veriﬁes that out deﬁnition
of ¯AT
that we introduced in Sage is consistent with the A.H command.
(b) Let A ∈ Matm,n(C), with A = (aij) and 1 ≤ i ≤ m and 1 ≤ j ≤ n, respectively. If A∗
= (a∗
kℓ), then a∗
kℓ = aℓk with
1 ≤ k ≤ n and 1 ≤ ℓ ≤ m, respectively. Consider some z ∈ C and suppose that (zA)∗
= ((za)∗
kℓ) and ¯zA∗
= (¯za∗
kℓ),
respectively. Then (za)∗
kℓ = zaℓk = ¯z aℓk = ¯za∗
kℓ, which proves that (zA)∗
= ¯zA∗
. Directly
(zA)∗
= zA
T
= (zA)T
= zA
T
= zA∗
.
Next, if (A + B)∗
= ((a + b)∗
kℓ), then (a + b)∗
kℓ = (a + b)ℓk = aℓk + bℓk = a∗
kℓ + b∗
kℓ. This proves the second relation. For
the third one, ﬁrst notice that since AB is a m × r matrix, (AB)∗
is a r × m matrix, as the matrix B∗
A∗
in the r.h.s. Thus,
for any k, ℓ with 1 ≤ k ≤ r and 1 ≤ ℓ ≤ m we get
(ab)∗
kℓ = (ab)ℓk =
∑
s
aℓsbsk =
∑
s
aℓsbsk =
∑
s
bskaℓs =
∑
s
b∗
ksa∗
sℓ = (ba)∗
kℓ .
Directly (AB)∗
= (AB)
T
= (A B)T
= B
T
A
T
= B∗
A∗
. For the fourth relation, let (A∗
)∗
=
(
(a∗
)∗
kℓ
)
for some 1 ≤ k ≤ m
and 1 ≤ ℓ ≤ n. Then we have (a∗
)∗
kℓ = a∗
ℓk = akℓ = akℓ. Thus (A∗
)∗
= A. Now, an application of the conjugate transpose
to the identity AA−1
= E = A−1
A gives (A−1
)∗
A∗
= E∗
= E = A∗
(A−1
)∗
and the claim for the inverse follows (E is
161
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
the identity n × n matrix).
For the ﬁnal claim ﬁrst observe that
tr(A∗
) =
m∑
i=1
a∗
ii =
m∑
i=1
aii =
m∑
i=1
aii = tr(A).
The result for the real case now follows. □
D) Material on the properties of linear mappings
2.E.57. Verify the results of eigenvalues and eigenvectors presented in the following table.
linear matrix
eigenvalues
algebraic
eigenvectors
transformation form multiplicity
identity
(
1 0
0 1
)
1 2
(
1
0
)
, (V1
∼= R2
)
homothety
(
c 0
0 c
)
, c ∈ R\{0, 1} c 2
(
1
0
)
, (Vc
∼= R2
)
rotation
(
cos(θ) sin(θ)
− sin(θ) cos(θ)
)
, θ ̸= kπ, k ∈ Z
cos(θ) ± i sin(θ)
1, 1
(
1
i
)
,
(
1
−i
)
= e±iθ
stretching
(
c 0
0 1
)
, c ∈ R\{1} c, 1 1, 1
(
1
0
)
,
(
0
1
)
hor. shear
(
1 a
0 1
)
, a ∈ R\{0} 1 2
(
x
0
)
, x ∈ R
ver. shear
(
1 0
a 1
)
, a ∈ R\{0} 1 2
(
0
y
)
, y ∈ R
reﬂection
(
−1 0
0 1
)
±1 1, 1
(
0
1
)
,
(
1
0
)
(about y-axis)
reﬂection
(
1 0
0 −1
)
±1 1, 1
(
1
0
)
,
(
0
1
)
(about x-axis)
Solution. We have explored these transformations and their matrices in detail in 1.E.17, but we will now leave the detailed
calculations as an exercise. Note that among the linear transformations in this table, only the vertical or horizontal shears are
not diagonalizable. The rotation R(θ) is diagonalizable over C for any θ, but it is diagonalizable over R only when θ is a
multiple of π, which we have excluded. □
Consider a n×n matrix A with characteristic polynomial χA(λ) = det(A−λ E). As described in 2.4.2, this polynomial
r is of degree n, and can be expressed as
χA(λ) = cn λn
+ cn−1 λn−1
+ · · · + c1 λ + c0 ,
where cn = (−1)n
, c0 = det(A), and cn−1 = (−1)n−1
tr(A). For a 2 × 2 matrix this expression simpliﬁes to
χA(λ) = λ2
− tr(A) λ + det(A) .
On the other hand, for a 3 × 3 matrix A the polynomial reduces to
χA(λ) = −λ3
+ tr(A) λ2
+ c1 λ + det(A) . (∗)
To use this expression we ﬁrst need to determine c1. An expression of c1 is provided below.
2.E.58. For a 3 × 3 matrix A, use the relation (∗) to prove that
χA(λ) = −λ3
+ tr(A) · λ2
+
(
det(A − E) + 1 − tr(A) − det(A)
)
· λ + det(A) . 14
Next, based on this relation compute the characteristic polynomial and the eigenvalues of the matrix
A =


32 −67 47
7 −14 13
−7 15 −6

 .
Can you verify your result in Sage? ⃝
2.E.59. Determine the 4 × 4 real matrix A whose eigenvalues are given by λ1 = 6 and λ2 = 7 and satisfy one the following
conditions:
14This method is not the fastest way to compute χA for a 3 × 3 matrix
A. However, it oﬀers an alternative approach that can be conveniently used
to verify our direct computation.
162
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
i) λ2 has algebraic multiplicity three, and geometric multiplicity three;
ii) λ2 has algebraic multiplicity three, and geometric multiplicity two;
iii) λ2 has algebraic multiplicity three, and geometric multiplicity one.
⃝
2.E.60. Find the eigenvalues and the eigenvectors of the matrix A =



−1 −5
6
5
3
0 −2
3 −2
3
0 1
6 −4
3


 . ⃝
2.E.61. In this exercise let K ∈ {R, C} and consider an idempotent matrix A ∈ Matn(K), i.e., A2
= A. Prove the following
statements (see also 2.3.19).
(a) E − A is also idempotent, where E is the identity matrix;
(b) Ker(A) = Im(E − A) and Im(A) = Ker(E − A);
(c) Ker(A) ⊕ Im(A) = Kn
;
(d) If Kn
= V ⊕ W for some subspaces V, W ⊂ Kn
, then there exists a unique idempotent matrix A ∈ Matn(K) such that
V = Ker(A) and W = Im(A);
(e) The eigenvalues of A are 0 and 1;
(f) det(A) = 0 and tr(A) = rank(A).
Solution. (a) We have
(E − A)2
= E − 2A + A2
= E − 2A + A = E − A .
This proves (a).
(b) Suppose that y ∈ Ker(A), that is, Ay = 0. Then (E − A)y = Ey = y, and so y ∈ Im(E − A). Conversely, if
y ∈ Im(E − A), then we have y = (E − A)x for some x ∈ Kn
. Therefore, Ay = A(E − A)x = (A − A2
)x = 0, i.e.,
y ∈ Ker(A). This proves that Ker(A) = Im(E − A) and similarly is treated the second relation in (b).
(c) Any x ∈ Kn
can be expressed as
x = (E − A)x + Ax = v + w , where v := (E − A)x , w := Ax ,
respectively. Obviously, w ∈ Im(A) and by (b) we have v ∈ Ker(A). This shows that Kn
= Ker(A) + Im(A). Let us now
prove that this sum is direct (see 2.3.6). Consider some u ∈ Ker(A) ∩ Im(A). Because u ∈ Ker(A) means that Au = 0,
and by (b) the relation u ∈ Im(A) implies that (E − A)u = 0, we get u = 0, and hence Ker(A) ∩ Im(A) = {0}. Thus
Kn
= Ker(A) ⊕ Im(A).
(d) By assumption Kn
= V ⊕ W, hence any x is uniquely expressed as x = v + w, with v ∈ V and w ∈ W. Consider the
mapping F : Kn
→ Kn
with F(x) = w ∈ W. Clearly, F is linear and satisﬁes F ◦ F = F2
= F. Moreover, Im(F) = W
and Ker(F) = V . Hence, taking as A the matrix presentation of F in the standard basis of Kn
, we are done (the uniqueness
follows from the deﬁnition of F).
(e) Suppose that λ is an eigenvalue of A with eigenvector x, i.e., Ax = λ x. Then
A2
x = A(Ax) = A(λ x) = λ(Ax) = λ2
x ,
and this shows that λ2
is an eigenvalue of A2
. However, A is idempotent and thus we get the relation λ x = Ax = A2
x = λ2
x,
that is, λ(λ − 1)x = 0. The claim now follows.
(f) Recall that det(A) coincides with product of the eigenvalues of A, hence we get det(A) = 0 since 0 is an eigenvalue of
A. On the other hand, all eigenvectors of A belong either to the subspace Im(A) (for λ = 1) or to the subspace Ker(A) (for
λ = 0). But since by (c) we have Kn
= Ker(A)⊕Im(A), the eigenvectors of A must span Kn
and hence A is diagonalizable.
It follows that the multiplicity of λ = 1 is precisely the rank of A. Therefore, the relation tr(A) = rank(A) is now immediate,
since the trace coincides with the sum of the eigenvalues A and the only eigenvalues are 0, 1.
In Chapter 3 we will meet further examples of matrices A satisfying the decomposition Kn
= Ker(A) ⊕ Im(A), as
Hermitian matrices. □
2.E.62. Eigenface for facial recognition – An application via Sage. In facial recognition, a prominent technique is the so
called eigenface method. This technique involves analyzing a set of grayscale images, each represented as a column vector, to
identify the principal component, which is the eigenvector associated with the largest eigenvalue, also known as the eigenface.
This eigenface captures the most signiﬁcant variance among the images.
For an illustration, consider two grayscale images, each with pixel values ranging from 0 to 255, represented by the
following 2 × 2 matrices:
I1 =
(
50 60
55 58
)
, I2 =
(
70 80
72 78
)
.
163
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Using Sage, compute the eigenface by applying the following steps:
i) Convert the grayscale image into column vectors I1, I2 (this is known as “ﬂattening of the image”).
ii) Construct the data matrix D: Stack the column vectors I1, I2 to create a data matrix.
iii) Compute the mean image: (I1 + I2)/2.
iv) Center the data: Subtract the mean image from each column of the data matrix to center the data. Denote this matrix by
cen(D).
v) Calculate the covariance matrix from the centered data: covmatrix = (cen(D))T
cen(D).
Solution. Here is a possible solution:
# Given images - vectorize the values
I1 = Matrix([[50], [60], [55], [58]])
I2 = Matrix([[70], [80], [72], [78]])
# Construct the data matrix
D = I1.augment(I2)
# Compute the mean image
mean_image = (I1 + I2) / 2
# Subtract the mean from each image in the data matrix
centered_D = D - mean_image * Matrix([1, 1])
# Compute the covariance matrix
cov_matrix = centered_D.transpose() * centered_D
# Find the eigenvalues and eigenvectors
eigenvalues, eigenvectors = cov_matrix.eigenmatrix_right()
# The Eigenface
eigenface_vector = eigenvectors.column(0).normalized()
show(eigenface_vector)
Note that the principal component is presented as the primary Eigenface, and we have normalized it. Sage’s answer is(1
2
√
2, −1
2
√
2
)
. This is the eigenvector corresponding to the largest eigenvalue of the covariance matrix. It represents the
direction of maximum variance in the data, or in other words, the direction along which the variation among the images is
the greatest.15
□
2.E.63. Based on the eigenvalues and eigenvectors of a given 3 × 3 matrix A, we can often interpret the induced linear
transformations on R3
geometrically. With this in mind, prove that the following statements are true:
i) If the matrix has the eigenvalues 0 and 1, the latter with geometric multiplicity 2, then it is a projection in the direction
of the eigenvector associated with the eigenvalue 0 onto the plane given by the eigenspace of the eigenvalue 1.
ii) If the matrix has the eigenvalues 0 and 1, the latter with geometric multiplicity 2, and the eigenvector associated with
0 is perpendicular to the plane generated by the eigenvectors associated with the eigenvalue 1, then the mapping is an
orthogonal projection.
iii) If the matrix has eigenvalue −1 with the eigenvector perpendicular to the plane generated by the eigenvectors associated
with the eigenvalue 1, then it is a mirror symmetry through the plane of the eigenvectors associated with 1.
iv) If the matrix has eigenvalue 1 with an eigenvector perpendicular to the plane generated by the eigenvectors associated
with the eigenvalue −1, then it is an axial symmetry in R3
through the axis given by the eigenvector associated with 1.
Solution. We will prove the ﬁrst three statements and leave the last one as an exercise.
(i) Since 1 has algebraic multiplicity 2 there are two linearly independent vectors {v1, v2} corresponding to this eigenvalue.
Since the algebraic multiplicity of 0 is 1, the geometric multiplicity of 0 should be also 1. This means that the corresponding
eigenspace (nullspace) V0 generated by a unique vector v0, and we get a direct sum decomposition R3
= V0 ⊕ V1, with
dim V0 = 1 and dim V1 = 2, respectively. Thus, any vector on u ∈ R3
can be expressed as u =
∑2
i=0 civi, for some scalars
c0, c1, c2. By applying the 3 × 3 matrix A to x we get
Au = A(c0v0 + c1v1 + c2v2) = c0Av0 + c1Av1 + c2Av2 = c1v1 + c2v2 .
It is now easy to prove that A2
= A:
A2
(u) = A(Au) = A(c1v1 + c2v2) = c1Av1 + c2Av2 = c1v1 + c2v2 = Au , u ∈ R3
.
15Principal Component Analysis (PCA) is a statistical method used to
identify patterns in data and reduce dimensionality by projecting the data
onto a set of orthogonal axes. In the context of PCA and the eigenface
method, the maximum variance in the data is the direction in which the data
varies the most.
164
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Thus, A is a projection matrix mapping any vector u ∈ R3
onto the plane V1 along the direction deﬁned by the line V0 in R3
.
(ii) Let us use the notation from (i) and assume that the eigenvector v0 associated with 0 is perpendicular to V1. Without loss
of generality, we may assume that the eigenvectors v1, v2 that span V1 are orthonormal, i.e.,
⟨v1, v1⟩ = vT
1 v1 = 1 = vT
2 v2 = ⟨v2, v2⟩ , ⟨v1, v2⟩ = vT
2 v1 = 0 = vT
1 v2 = ⟨v2, v1⟩ .
Since v1vT
1 and v2vT
2 are the projection matrices onto the lines spanned by v1, v2, respectively, the projection matrix A onto
the plane V1 has the form A = v1vT
1 + v2vT
2 . It is now a direct computation to prove that A is an orthogonal projection, i.e.,
A2
= A and A = AT
.
(iii) By assumption, the matrix A has the eigenvalues −1 and 1, the second one with geometric multiplicity 2. Thus, there are
two linearly independent vectors u1, u2 which generate the eigenspace V1. Consequently, any vector x ∈ V1 can be written
as a linear combination of u1, u2. Also, V−1 is 1-dimensional, generated by the eigenvector corresponding to the eigenvalue
−1, which we may denote by v0. This eigenvector is perpendicular to V1 and hence we have ⟨v1, v0⟩ = ⟨v2, v0⟩ = 0. Let
x = a1u2 + a2u2 be some arbitrary vector of V1. Then we see that
Ax = A(a1u1 + a2u2) = a1Au1 + a2Au2 = a1u1 + a2u2 = x ,
i.e., A preserves V1. Moreover, since Av0 = −v0 we deduce that v0 is reﬂected trough the plane V1. Since A reﬂects vectors
that are perpendicular to the plane V1, while leaving vectors in V1 unchanged, we deduce that A represents a mirror symmetry
through the plane V1 in R3
. □
2.E.64. (a) Determine the eigenvalues and eigenvectors of the matrix
A =


1 1 0
1 2 1
1 2 1

 .
(b) Describe the geometric interpretation of the corresponding linear mapping, and write down its matrix in the basis u =
{u1 = (1, −1, 1)T
, u2 = (1, 2, 0)T
, u3 = (0, 1, 1)T
}.
Solution. (a) The characteristic polynomial χA(λ) = det(A−λ E) of A is of degree three, so over C we expect to ﬁnd three
roots. We compute
1−λ 1 0
1 2 − λ 1
1 2 1−λ
= −λ3
+ 4λ2
− 2λ = −λ(λ2
− 4λ + 2)
and hence the equation AP(λ) = 0 has only real solutions, given by λ1 = 0, λ± = 2 ±
√
2. These are the three eigenvalues
of A. To verify these claims we can use Sage and apply the rules from 2.D.2. Thus, let us type
A=matrix(SR, [[1, 1, 0], [1, 2, 1], [1, 2, 1]])
show(A)
p(x) = A.characteristic_polynomial(x)
show(p(x)); show(p.roots())
Sage prints out the characteristic polynomial of A (up to a sign), and the triple
[(
−
√
2 + 2, 1
)
,
(√
2 + 2, 1
)
, (0, 1)
]
,
where we can read the three eigenvalues together with the multiplicities. Alternatively, we can use the command
A.eigenvalues( ). Next we should compute the corresponding eigenvectors:
• For λ1 = 0 we get the kernel of A, which is of course determined by solving the matrix equation Ax = 0, i.e.,


1 1 0
1 2 1
1 2 1




x1
x2
x3

 =


0
0
0

 .
The result is V0 = Ker(A) = spanR{E0 := (1, −1, 1)T
}, and hence the eigenspace V0 is 1-dimensional.
• For λ+ = 2 +
√
2 we get the matrix equation


−(1 +
√
2) 1 0
1 −
√
2 1
1 2 −(1 +
√
2)




x1
x2
x3

 =


0
0
0

 .
Again the solution forms a one-dimensional space, Vλ+ = spanR{E+ := (1, 1 +
√
2, 1 +
√
2)T
}.
• For λ− = 2 −
√
2 we solve the matrix equation
165
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA


(
√
2 − 1) 1 0
1
√
2 1
1 2 (
√
2 − 1)




x1
x2
x3

 =


0
0
0

 .
Solving the corresponding homogeneous linear system we see that Vλ− = spanR{E− := (1, 1 −
√
2, 1 −
√
2)T
}. To
verify the results for the eigenvectors using Sage, just add the code A.eigenvectors_right( ). Of course, the case of the
zero-eigenvalue which corresponds to the null space of A can be treated by the command A.right_kernel( ), as well.
(b) Let us ﬁrst express the induced linear transformation in the given basis u, which is easier and relies on A. For this, we
need to determine the matrix T for moving from the standard basis e to the new basis u. But let us apply the tip from 2.C.41,
and ﬁrst obtain the matrix for changing the basis from u to the standard one, that is, the matrix T−1
, whose columns consist
just of the vectors u1, u2, u3:
T−1
=


1 1 0
−1 2 1
1 0 1

 ⇒ T = (T−1
)−1
=


0 0 1
1 0 −1
−2 1 3

 .
Hence, the matrix of the linear transformation under the new basis is given by
TAT−1
=


0 5 2
0 −2 −1
0 14 6

 .
To determine geometrically the mapping induced by the given matrix A, we can combine our results about the eigenvectors
of A with the statements in 2.E.63. The induced map can be interpreted as a projection along the eigenvector
E0 ∈ V0 = Ker(A) into the plane generated by the eigenvectors E+ ∈ V+ and E− ∈ V−, composed with the linear
mapping given by stretching in the directions of these eigenvectors by the factor corresponding to the eigenvalues. □
2.E.65. Given the matrix
M =



−2
3 −1
3 −2
3
4
3 −7
3 −8
3
−1 1 −1



determine geometrically the induced linear mapping (in a similar way we did in task 2.E.64). ⃝
2.E.66. (Pauli matrices) In physics, the state of a particle with spin 1
2 is described by the so called Pauli matrices. They are
the following 2 × 2 matrices over C:
σ1 =
(
0 1
1 0
)
, σ2 =
(
0 −i
i 0
)
, σ3 =
(
1 0
0 −1
)
,
where as usual i =
√
−1. For square matrices we deﬁne their commutator (denoted by square brackets [ , ]) as [σ1, σ2] :=
σ1σ2 − σ2σ1.
i) Show that [σ1, σ2] = 2iσ3, [σ1, σ3] = −2iσ2, and [σ2, σ3] = 2iσ1, respectively.
ii) Prove that σ2
1 = σ2
2 = σ2
3 = 1 and that the eigenvalues of the matrices σ1, σ2, σ3 are ±1.
iii) It is known that the matrices describing the state of the particle with spin 1 are the following 3 × 3 matrices over C:
σx =
1
√
2


0 1 0
1 0 1
0 1 0

 , σy =
1
√
2


0 −i 0
i 0 −i
0 i 0

 , σz =


1 0 0
0 0 0
0 0 −1

 .
Show that these matrices satisfy the same commuting relations as the Pauli matrices.
⃝
2.E.67. (Quaternionic relations) Using the Pauli matrices σ1, σ2, σ3 introduced in 2.E.66, set
1 := E =
(
1 0
0 1
)
, I := iσ3 , J := iσ2 , K := iσ1 .
Prove the following relations, known as the quaternionic relations:
I2
= J2
= K2
= −1 , IJ = −JI = K , JK = −KJ = I , KI = −IK = J .
⃝
166
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.E.68. Remark. The quaternionic relations impose on the 4-dimensional space spanned by 1, I, J, K the structure of a
4-dimensional algebra over R. This algebra was invented by W. R. Hamilton in 1843, and is known as the algebra of quaternions,
denoted by H, that is,
H = spanR{1, I, J, K} .
Roughly speaking, an “algebra” is a vector space endowed with a binary bilinear operation for multiplication. For the algebra
of quaternions, the multiplication is given by the matrix multiplication, hence observe that H is associative and noncommutative.
A quaternion is simply an element in H, and hence a linear combination of {1, I, J, K}. On the other hand, the
algebra generated by the matrices σ1, σ2, σ3 is isomorphic (as algebra) to the so called Cliﬀord algebra of R3
. This Cliﬀord
algebra is strongly related with developments in geometric algebra, which is an advanced topic with beautiful applications in
robotics.
167
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Solutions to the problems
2.A.2. 1) We have (A−E)(A+E) = A2
+AE−EA−E2
= A2
−E, since AE = EA = A for any square matrix A and the
identity matrix satisﬁes En
= E for all n. We compute A2
=
(
1 1 +
√
2
0 2
)
, and hence (A−E)(A+E) =
(
0 1 +
√
2
0 1
)
.
2) Here we get AB =
(
5 4
4
√
2 0
)
, thus A(Bu) = (AB)u =
(
5 4
4
√
2 0
) (
0√
2
)
=
(
4
√
2
0
)
. Alternatively, we could ﬁrst
compute Bu, that is,
Bu =
(
1 4
4 0
) (
0√
2
)
=
(
4
√
2
0
)
=⇒ A(Bu) =
(
1 1
0
√
2
) (
4
√
2
0
)
=
(
4
√
2
0
)
.
3) In this case we compute
(A − B)(A + B) =
(
0 −3
−4
√
2
) (
2 5
4
√
2
)
=
(
−12 −3
√
2
4
√
2 − 8 −18
)
.
Thus (A − B)(A + B)u =
(
−12 −3
√
2
4
√
2 − 8 −18
) (
0√
2
)
= −
(
6
18
√
2
)
= −6
(
1
3
√
2
)
.
4) Obviously
aA2
+ aB2
= a(A2
+ B2
) = a
(
18
√
2 + 5
4 18
)
.
5) The inner product of u, v is given by
u · v =
(
0√
2
)
·
(
1
4
)
= 0 · 1 +
√
2 · 4 = 4
√
2 ,
hence the given quantity equals to zero.
6) This is also easy. We compute A3
= AA2
=
(
1 1
0
√
2
) (
1 1 +
√
2
0 2
)
=
(
1 3 +
√
2
0 2
√
2
)
, and thus
4A3
− B2
=
(
4 12 + 4
√
2
0 8
√
2
)
−
(
17 4
4 16
)
=
(
−13 8 + 4
√
2
−4 −16 + 8
√
2
)
.
7) In this case we obtian
√
2u − 1
2 v =
(
0
2
)
−
(
1/2
2
)
=
(
−1/2
0
)
.
8) Above we computed A(Bu) =
(
4
√
2
0
)
. Hence
√
2
2
u − ABu + 2v =
(
0
1
)
−
(
4
√
2
0
)
+
(
2
8
)
=
(
2 − 4
√
2
9
)
.
9) In this case we have
A2
u − B2
v =
(
1 1 +
√
2
0 2
) (
0√
2
)
−
(
17 4
4 16
) (
1
4
)
=
(
2 +
√
2
2
√
2
)
−
(
33
68
)
=
( √
2 − 31
2
√
2 − 68
)
.
10) Notice that B2
u =
(
17 4
4 16
) (
0√
2
)
=
(
4
√
2
16
√
2
)
= 4
√
2 v. Hence the vector w := B2
u is a non-zero multiple of the
vector v, and since ∥v∥2
= v · v, we see that the given expression is a multiple of the square of the norm of v. Indeed, we
have v · v = 17 and a direct computation gives
(B2
u) · (4v) = w · (4v) = (4
√
2v) · (4v) = 16
√
2 v · v = 16
√
2∥v∥2
= 272
√
2 .
2.A.4. Notice that one could follow alternative computations to obtain the same results. Now, for a veriﬁcation of the results
in Sage, we use the cell
A=matrix([[1,sqrt(2),pi],[0,1,1],[-pi,0,2]])
B=matrix([[0,1,1],[-1,0,pi],[-1,0,1]])
u=vector([pi, 0, sqrt(2)]); E=matrix.identity(3)
show(A-E); show((A-E)^2)
C=(A-E)^2*(4*B); show(C); show(C*u)
show(B^2); show((A-E).transpose())
D=(A-E).transpose()*B^2; show(D); show(D*u)
show(A*B*u); show(B*A*u)
168
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.A.16. (a) The linear systems whose extended matrix is B±, are overdetermined, i.e., we have more equations than unknowns:


1 1
3 2
2 ±1


3×2
(
x1
x2
)
2×1
=


−2
0
2


3×1
,
or in other words 


x1 + x2 = −2,
3x1 + 2x2 = 0,
2x1 + x2 = 2,
and



x1 + x2 = −2,
3x1 + 2x2 = 0,
2x1 − x2 = 2,
respectively. Such systems are inconsistent, unless some of the equations are linearly dependent. For the matrix B+, the
cell
Bp=matrix(QQ, [[1, 1, -2], [3, 2, 0], [2, 1, 2]])
rec=Bp.rref();show(rec)
print(Bp.pivots())
returns the matrix
( 1 0 4
0 1 −6
0 0 0
)
, and the pair (0, 1), respectively Thus, the pivots are on the ﬁrst and second column and the
system admits a unique solution, given by {x1 = 4, x2 = −6}. This also follows by using the backslash operator:
Ap=matrix(QQ, [[1, 1], [3, 2], [2, 1]])
b=vector([-2, 0, 2])
x=Ap\b; show(x)
Theoretically, this unique solution is explained by the fact that the second equation in the corresponding linear system is a
linear combination of the ﬁrst and the third:
3x1 + 2x2 = (x1 + x2) + (2x1 + x2) = −2 + 2 = 0 .
On the other side for the matrix B− Sage gives as reduced row echelon form the identity 3×3 matrix, and so as pivot columns
all the three columns, i.e., the pivot command for B− returns (0, 1, 2). The existence of a pivot in the last column implies
that the corresponding linear system is inconsistent, a claim that you can easily prove also by hand.
(b) For the matrix C the cell
C=matrix(QQ, [[2,1,7,0], [-1,4,10,5], [3,2,12,10],[0, 0, 0, 10]])
show(C.rref())
print(C.pivots())
returns 



1 0 2 0
0 1 3 0
0 0 0 1
0 0 0 0



 and (0, 1, 3),
respectively. Thus, the pivots are located in the ﬁrst, second, and fourth columns Moreover, the presence of the last row
containing only zeros alerts the possibility of an inﬁnite number of solutions. Indeed, because n there is no pivot in the 3rd
column, x3 is a free variable. In particular, the homogeneous system Cx = 0 has inﬁnite solutions given by x1 = −2x3 =
−2t, x2 = −3x3 = −3t, x4 = 0, with x3 = t ∈ R.
2.A.17. Clearly the unknowns X1 and X2 must be 2 × 2 matrices, so we may set X1 =
(
a1 b1
c1 d1
)
and X2 =
(
a2 b2
c2 d2
)
,
respectively, for some ai, bi, ci, di ∈ R, for all i = 1, 2. Then, for the ﬁrst case an easy computation gives that
(
a1 + 3c1 b1 + 3d1
3a1 + 8c1 3b1 + 8d1
)
=
(
1 2
3 4
)
.
This implies the following system of equations
{
a1 + 3c1 = 1 , b1 + 3d1 = 2 , 3a1 + 8c = 3, 3b1 + 8d1 = 4}, which has a
unique solution given by a1 = 1, b1 = −4, c1 = 0 and d1 = 2. Thus X1 =
(
1 −4
0 2
)
.
Let apply a more elegant method for the second case, which is based on the inverse of a 2×2 matrix A. Recall that the inverse
of a n × n matrix is the square matrix of the same order as A, uniquely determined by the equation AA−1
= A−1
A = E,
where as usual E is the identity matrix. Notice that A−1
may not always exist. By the discussion in 2.2.11 we know the
following nice expression of the inverse of a 2 × 2 matrix A =
(
a b
c d
)
, where a, b, c, d ∈ R and det(A) = ad − bc ̸= 0:
169
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
A−1
=
1
(ad − bc)
·
(
d −b
−c a
)
.
An application gives A−1
=
(
−8 3
3 −1
)
, and then by multiplying both sides of the given equation from the right by A−1
,
we deduce that
X2 =
(
1 2
3 4
)
A−1
=
(
−2 1
−12 5
)
(since X2AA−1
= X2I = X2). Of course, this method applies for solving also the ﬁrst matrix equation (here one should
multiply both sides of the given equation by the corresponding A−1
from the left, so that A−1
AX1 = IX1 = X1).
2.A.19. We obtain
A−1
=


1 10 −4
1 12 −5
0 5 −2

 , B−1
=






2 −3 0 0 0
−5 8 0 0 0
0 0 −1 0 0
0 0 0 −5 2
0 0 0 3 −1






, C−1
=
1
2
·




0 1 1 0
0 1 0 −1
1 −1 0 0
1 −1 −1 1



 , D−1
=
1
2
·
(
3 −i
i 1
)
.
2.A.20. Of course, we may use Sage to transform the given matrix A to a row-echelon form.
A=matrix([[1, -3, 0, 1], [1, -2, 2, -4],[1, -1, 0, 1], [-2, -1, 1, -2]])
show(A.echelon_form())
Sage returns the matrix




1 0 0 6
0 1 0 5
0 0 1 5
0 0 0 10



 ,
hence the rank of A equals to four. In fact, Sage allows us to directly compute the rank of a given matrix A, by the command
A.rank( ).
Let us move to the remaining questions. The extended matrix of the ﬁrst linear system has the form




1 1 1 −2 4
−3 −2 −1 −1 5
0 2 0 1 1
1 −4 1 −2 3



 .
But the left-hand side of this extended matrix is exactly AT
, thus we can get the column-echelon form the same way as before.
In particular, the columns of the matrix are linearly independent and the rank is maximal, i.e., four again. Therefore, there
exists the inverse of AT
, i.e., the matrix
(
AT
)−1
, and it turns out that the system has a unique solution given by
(x1, x2, x3, x4)
T
=
(
AT
)−1
· (4, 5, 1, 3)
T
.
Next, the homogenous system has the same left-hand side (given by the matrix AT
), as the ﬁrst. Because the numbers on
the right-hand side of the equations in the system do not inﬂuence the number of solutions and because every homogeneous
system has a zero solution, the only solution of the second system is the zero one, (x1, x2, x3, x4)
T
= (0, 0, 0, 0)
T
.
The extended matrix of the 3rd system coincides with A:




1 −3 0 1
1 −2 2 −4
1 −1 0 1
−2 −1 1 −2



 .
Sage revelaed to us that the transformation of the matrix into the row echelon form, yields a row of the form
(
0 0 0 a
)
, where a ̸= 0.
However, the column on the right-hand side is not a linear combination of the columns on the left-hand side (the rank of
the matrix is four). It follows that this system has no solution. We may verify this claim easily in Sage, e.g. via the solve
command.
170
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
var("x1, x2, x3")
eq1=x1-3*x2-1;eq2=x1-2*x2+2*x3+4
eq3=x1-x2-1;eq4=-2*x1-x2+x3+2
solve([eq1==0,eq2==0,eq3==0,eq4==0],x1,x2,x3)
Sage’s output is [ ], indicating that the system has no solutions (the output [ ] is a common way for Sage to represent an empty
solution set).
2.B.2. First we treat σ, where observe that after the integer 3 in the second row, there are the small integers 1 and 2, so in the
required sum we add 2. Next, after 1 there is no smaller number and we add 0. In the sequel, after the integer 6 there are the
smaller numbers ﬁve, four and two, thus we add 3, similarly for seven, eight and nine, while for ﬁve we add 2, for four we
add 1 and for two nothing. Thus we have 2 + 3 + 3 + 3 + 3 + 2 + 1 = 17 inversions in total, which implies that σ is an odd
permutation. Its parity is (−1)17
= −1. Applying the alternative method (using the theorem given in 2.2.2), we obtain the
expression σ = (1, 3)(3, 6)(6, 9)(9, 2)(4, 7)(7, 5)(5, 8). Thus, there are seven transpositions in the decomposition, and the
permutation is indeed odd.
Similarly we can treat the second permutation τ, which decomposes into a product of three transpositions (using the
cycle decomposition): τ = (1, 2, 4)(3, 6) = (1, 2)(2, 4)(3, 6). Or one can count the number of inversions in τ. The latter is
given by 1 + 2 + 3 + 0 + 1 = 7, and thus both ways imply that τ is an odd permutation, i.e., sign(τ) = −1.
2.B.7. According to Laplace theorem given 2.2.9 we can compute the determinant via expansion along the ﬁfth row (or the
second column, which each contains four zeros). We obtain:
1 0 1 0 1
0 2 0 2 0
0 0 3 0 3
4 0 0 4 4
0 0 0 0 5
= 0 ·
0 1 0 1
2 0 2 0
0 3 0 3
0 0 4 4
− 0 ·
1 1 0 1
0 0 2 0
0 3 0 3
4 0 4 4
+ 0 ·
1 0 0 1
0 2 2 0
0 0 0 3
4 0 4 4
− 0 ·
1 0 1 1
0 2 0 0
0 0 3 3
4 0 0 4
+ 5 ·
1 0 1 0
0 2 0 2
0 0 3 0
4 0 0 4
= 5 ·
1 0 1 0
0 2 0 2
0 0 3 0
4 0 0 4
= 5 · 2 ·
1 1 0
0 3 0
4 0 4
= 120.
The determinant of the last 3 × 3 matrix was computed by the Sarrus rule. The check in Sage is most simple:
D=matrix([[1, 0, 1, 0, 1],[0, 2, 0, 2, 0],[0, 0, 3, 0, 3],[4, 0, 0, 4, 4],[0, 0, 0, 0, 5]])
det(D)
2.B.8. (a) This is a direct consequence of the Cauchy theorem, that is, det(AB) = det(A) det(B), see 2.2.7.
(b) This identity is in fact a consequence of the following: The determinant of a n×n matrix A is a multilinear mapping in the
columns (or rows) of A. Hence for example if c ∈ R and Aj
denotes the jth column of A such that A =
(
A1
A2
· · · An
)
,
then we have
det(c A) = det
((
c A1
c A2
· · · c An
))
= cn
det
((
A1
A2
· · · An
))
= cn
det(A) .
It remains to prove our statement for the multi-linearity of det, which we leave as an exercise on multilinear mappings, see
also 2.3.23.
(c) On can present many examples. For instance, based on the identity presented in part (b) for A = B and n > 1 we see that
det(A + B) = det(2A) = 2n
det(A) > 2 det(A) = det(A) + det(B).
2.B.10. (a) We present the solutions mainly via Sage, and leave to the reader the description of formal proofs. To ﬁnd the null
space of the matrix A, give the cell
A=matrix([[1, -1, pi, 0],[0, pi, -1, 0], [1, 0, -1, pi]])
show(A.right_kernel())
The output has the form
RowSpanSR
(
1 − 1
π
(
π− 1
π
) − 1
π− 1
π
−
π− 1
π +1
π
(
π− 1
π
)
)
.
Therefore, Ker(A) is 1-dimensional. In a similar way we compute Ker(B) = span{(1, −1, −1, 1)T
}. We will learn for
vector subspaces and the notion of “dimension” of a vector (sub)space very soon, in Section C.
(b) Obviously, the required submatrix has the form
( π 0
0 π
)
, which is diagonal. In Sage there are many ways to determine
submatrices. For the speciﬁc case, where we should cancel both rows and columns, an appropriate command is
A.matrix_from_rows_and_columns( ):
171
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
d=A.matrix_from_rows_and_columns([1, 2],[1,3]); show(d)
#verify that d is diagonal
print(bool(d==diagonal_matrix([pi, pi])))
In this command the brackets [1, 2] and [1, 3] specify the rows and columns that are retained in the submatrix.
Another method relies on the bracket operation that we shortly mentioned above. Hence for example, add the cell
suba=A[1:3, 1:4:2]; show(suba)
We mention that in the expression A[1 : 3, 1 : 4 : 2] the ﬁnal index 2 is the slice step between the columns.
(c) The submatrix obtained by B after canceling the ﬁrst and last row, and the fourth column has the form
( 2 1 4
3 4 1
)
. In Sage
use the block
B=matrix([[1, 2, 3, 4],[2, 1, 4, 3], [3, 4, 1, 2], [4, 3, 2, 1]])
show(B.matrix_from_rows_and_columns([1, 2],[0, 1,2]))
or, as an alternative, you may type
B=matrix([[1, 2, 3, 4],[2, 1, 4, 3], [3, 4, 1, 2], [4, 3, 2, 1]])
show(B[1:3, 0:-1])
In fact, one can replace the ﬁnal line by the command show(B[1 : 3, 0 : 3]).
(d) In this case the required submatrix has the form
( 2 4
4 2
)
, which is obviously symmetric. In Sage add in the previous block
the code
subb=B.matrix_from_rows_and_columns([1,3], [0, 2])
bool(subb==subb.transpose())
2.C.2. The set of real polynomials of degree exactly m is not a real vector space, because it fails to satisfy one of the key
properties of a vector space: closure under addition. This is because the sum of two polynomials of degree m does not
necessarily have the same degree. Can you provide an explicit example?
2.C.4. One can use the following block:
v1=vector(QQ, [1/2, 2/3, 1, 0])
v2=vector(QQ, [2, 0, 1/7, 0])
v3=vector(QQ, [0, 1/5, 2/5, 1])
v4=vector(QQ, [-1, 2, 0, 3])
show(3*v1+2*v2+5*v3+v4)
Another approach involves placing the vectors and their coeﬃcients into two separate lists and then manipulating these lists
as follows:
vectors=[v1, v2, v3, v4]
scalars=[3, 2, 5, 1]
lin_comb=sum([scalars[i]*vectors[i] for i in range(len(vectors))])
show(lin_comb)
In this block we used the len() function to determine the number of elements in the list with vectors, which can be very
useful when working with large lists. Check yourselves that both cells print out the following answer:
(9
2 , 5, 37
7 , 8
)
. There
are further methods to describe the linear combination (span) of vectors via Sage, presented in 2.C.12.
2.C.5. A solution goes as follows:
A=matrix(QQ, [[3, -9, 9, 0],[9, -3, 6, 0],[9, -6, 0,-6],[-9, 12, 6, 6]])
b = vector(QQ, [-1, -1, 2, 0])
cols=A.columns(); show(cols)
soln = [-2/15, 1/5, 2/15, -11/15]
bool(sum([soln[i]*cols[i] for i in range(len(cols))])==b)
In this block we used the function .columns() to access the columns of A. Note also that the ﬁnal line veriﬁes that linear combination
of the columns of A with coeﬃcients from soln (this represents the given solution), equals the vector b. Executing
this block we obtain True. In fact, in our solution we could replace the ﬁnal line by
show(sum([soln[i]*cols[i] for i in range(len(cols))))
so that we can directly compare this expression with the constant vector b.
172
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.C.6. The ﬁnite ﬁeld Z5 is an important mathematical structure used in various areas (such as algebra, cryptography, and
computer science). It consists of the integers from 0 to 4, Z5 = {0, 1, 2, 3, 4}, with arithmetic operations performed modulo
5. For your convenience we present the addition and multiplication table for Z5:
+ 0 1 2 3 4
0 0 1 2 3 4
1 1 2 3 4 0
2 2 3 4 0 1
3 3 4 0 1 2
4 4 0 1 2 3
· 0 1 2 3 4
0 0 0 0 0 0
1 0 1 2 3 4
2 0 2 4 1 3
3 0 3 1 4 2
4 0 4 3 2 1
.
For instance, 3 + 4 = 7 = 2(mod5) and 2 · 3 = 6 = 1(mod5). The vector space Z3
5 = Z5 × Z5 × Z5 consists of vectors
(a, b, c)T
, where each entry a, b, c is an element of the ﬁnite ﬁeld Z5. Hence, Z3
5 is 3-dimensional over Z5. The addition and
scalar multiplication are respectively deﬁned by
u + v =


(u1 + v1) mod 5
(u2 + v2) mod 5
(u3 + v3) mod 5

 , cu =


(cu1) mod 5
(cu2) mod 5
(cu3) mod 5

 ,
for any two vectors u = (u1, u2, u3)T
, v = (v1, v2, v3)T
and scalar c ∈ Z5. For instance:


3
4
2

 +


2
3
4

 =


(3 + 2) mod 5
(4 + 3) mod 5
(2 + 4) mod 5

 =


0
2
1

 , 3


1
4
2

 =


3 mod 5
12 mod 5
6 mod 5

 =


3
2
1

 .
The zero vector is the additive identity in Z3
5. Can you derive the expression of the additive inverse of some vector u =
(u1, u2, u3)T
∈ Z3
5? About the total number of vectors in Z3
5 we see that each entry of a vector in Z3
5 can independently take
one of 5 possible values: {0, 1, 2, 3, 4}. Thus, the total number of vectors in this space is 53
= 125. Here is a veriﬁcation via
Sage (to create the ﬁnite ﬁeld Z5 we will use the function GF)
# Define the finite field Z_5
F = GF(5)
# Define the 3-dimensional vector space over Z_5
V = VectorSpace(F, 3)
# Calculate the total number of vectors in the vector space
total_vectors = V.cardinality()
# Output the result
total_vectors
2.C.8. Here is a block that we can use to answer the task:
V=QQ^3
e1=vector([1, 0, 0])
e2=vector([0, 1, 0])
e3=vector([0, 0, 1])
Vxy=V.span([e1, e2]); print(Vxy)
Vxz=V.span([e1, e3]); print(Vxz)
Vyz=V.span([e2, e3]); print(Vyz)
Let us also present what Sage prints out:
Vector space of degree 3 and dimension 2 over Rational Field
Basis matrix:
[1 0 0]
[0 1 0]
Vector space of degree 3 and dimension 2 over Rational Field
Basis matrix:
[1 0 0]
[0 0 1]
Vector space of degree 3 and dimension 2 over Rational Field
Basis matrix:
[0 1 0]
[0 0 1]
173
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.C.9. For a matrix A of size m × n the column space is deﬁned by C(A) = {Ax : x ∈ Rn
}. Hence, obviously, C(A)
is a subset of Fm
. Note that C(A) contains the zero vector, since A0 = 0. To prove that is a linear subspace of Fm
we
should prove that C(A) is closed under vector addition and scalar multiplication. Let u, v ∈ C(A). Then, there exist vectors
x, y ∈ Fn
such that u = Ax and v = Ay, respectively. Thus, considering the linear combination αu + βv of u, v for some
scalars α, β ∈ F, we see that αu + βv = α(Ax) + β(Ay) = A(αx + βy). This shows that αu + βv ∈ C(A), and hence
C(A) is a subspace of Fm
. Recall that the rank of A equals the number of (nonzero) pivots in elimination, hence it coincides
with the maximum number of linearly independent columns (or linearly independent rows). The relation now between the
dimension of C(A) and the rank of A follows. A similar analysis applies for the row space and the left null space, which we
leave for practice.
2.C.11. The given vectors are linearly independent: The equation a · v1 + b · v2 = 0 for some a, b ∈ R, can be equivalently
written in matrix form as (a, 0, a, 2a)
T
+ (0, b, 0, 0)T
= (0, 0, 0, 0)T
, from where we immediately get a = 0 = b. We can
verify the independence of v1, v2 in Sage as in 2.C.10:
V=QQ^4;v1=vector(QQ, [1, 0, 1, 2]); v2=vector(QQ, [0, 1, 0, 0])
V.linear_dependence([v1, v2]) ==[]
It follows that the linear span W = spanQ{v1, v2} is a 2-dimensional linear subspace of Q4
. As we said an alternative
is given by using the command span. For example, adding in the previous cell the syntax
W=V.span([v1, v2]); W}
veriﬁes that W is 2-dimensional subspace of Q4
. In Sage we can directly ﬁnd a basis of a vector space V by the command
W.basis(). Hence for example we can type
V=QQ^4
v1=vector(QQ, [1, 0, 1, 2])
v2=vector(QQ, [0, 1, 0, 0])
W=V.subspace([v1, v2])
B=W.basis();show(B)
By this block Sage will directly print the basis vectors v1 and v2. Other techniques to create subspaces rely on the creation of
row or column spaces of matrices, a situation that we will encounter a bit later.
2.C.12. (a) It is easy to see the vector equation av1 + bv2 = 0 for some scalars a, b, implies that a = b = 0. Thus, the vectors
v1, v2 are linearly independent and automatically form a basis of W = spanQ{v1, v2}. Therefore, W is 2-dimensional, see
also the discussion in 2.3.7-2.3.10 for the concept of dimension of a vector space. A general element of W has the form
w = c1v1 + c2v2 = (2c1 + 4c2, −c1, 3c1 + c2)T
with c1, c2 ∈ Q.
There are many ways to program Sage to verify the linear independence of v1, v2. Here, we present a method that uses
the rank of the matrix A, which has v1, v2 as its columns. For a built-in method in Sage to check linear independence we refer
to 2.C.10.
A = matrix(QQ, [[2, 4], [-1, 0], [3, 1]])
rank_of_A = A.rank()
print(f"The rank of matrix A is: {rank_of_A}")
if rank_of_A == 2:
print("The vectors are linearly independent.")
else:
print("The vectors are linearly dependent.")
Sage’s output has the form
The rank of matrix A is: 2
The vectors are linearly independent.
To introduce W in Sage we can type
V = VectorSpace(QQ, 3); v1 = V([2, -1, 3]); v2 = V([4, 0, 1])
W = V.subspace([v1, v2])
print("\nSubspace W spanned by v1 and v2:")
print(W)
An alternative to create W is to use the span command, as in 2.C.8:
W.span([v1, v2]); print(W)
For both cases Sage’s output has the form
174
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Subspace W spanned by v1 and v2:
Vector space of degree 3 and dimension 2 over Rational Field
Basis matrix:
[ 1 0 1/4]
[ 0 1 -5/2]
We will analyze the terms used in this answer in more detail soon. For now, keep only that W is a 2-dimensional subspace of
Q3
, over the rationals.
To illustrate W we can leverage Sage’s 3D-graphics functions, because we are working within a three-dimensional space.
We want to generate a ﬁgure displaying v1 in red, v2 in green and W as the shaded area. With this goal in mind one can
execute the following block:
# Define the vectors
v1 = vector(QQ, [2, -1, 3])
v2 = vector(QQ, [4, 0, 1])
# Create a meshgrid for the plane using linear combinations of v1 and v2
u, v = var("u v")
plane = u*v1 + v*v2
# Define the plot range for u and v
plot_range = (-1, 1)
# Plot the plane
plane_plot = parametric_plot3d(plane,
(u, *plot_range),
(v, *plot_range),
color="lightblue",
opacity=0.5)
# Plot the vectors v1 and v2 starting from the origin
v1_plot = arrow3d((0, 0, 0), v1, color="red", thickness=0.05)
v2_plot = arrow3d((0, 0, 0), v2, color="green", thickness=0.05)
# Combine the plots
show(plane_plot + v1_plot + v2_plot, figsize=8)
This produces the following illustration of W:
We will meet further applications of the Sage function parametric_plot3d in Chapter 4.
Finally, if v2 is a multiple of v1, then W will be reduced to a line (dimension 1).
(b) This task is a bit more challenging compared to part (a). By assumption,
W = spanQ{u1 = (1, 1, 0, 0)T
, u2 = (0, 1, 1, 0)T
} = {au1 + bu2 : a, b ∈ Q} =
{
(a, a + b, b, 0)T
: a, b ∈ Q
}
⊂ Q4
.
175
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
It is easy to see that u1, u2 are linearly independent, hence dimQ W = 2, with a basis given by these vectors. For instance,
the rank of the matrix A =




1 0
1 1
0 1
0 0



 formed by u1, u2 is 2. This is because by row reducing A we get




1 0
1 1
0 1
0 0



 →




1 0
0 1
0 1
0 0



 →




1 0
0 1
0 0
0 0



 ,
and the ﬁnal matrix has 2 pivot columns. Thus rank(A) = 2 and u1, u2 are linearly independent.
The quotient space V/W is the set of equivalence classes of vectors in V , under the equivalence relation ∼ deﬁned
by the subspace W ⊂ V , i.e., v1 ∼ v2 if and only if v1 − v2 ∈ W, where v1, v2 ∈ V . Thus, we can view V/W as
V/W = {[u] : u ∈ V }, where the equivalence class of u ∈ V is denoted by [u] = {u + w : w ∈ W}, see 3.4.13 in Chapter
3.
The quotient space V/W inherits a vector space structure from V , with operations deﬁned by [u] + [v] = [u + v] and
a[u] = [au], for any two vectors u, v and scalar a, try to verify this claim yourselves. It is also easy to prove that the dimension
of V/W is given by dim(V/W) = dim(V ) − dim(W).
Four our case, V = Q4
and dimQ W = 2, thus dimQ V/W = 4 − 2 = 2. Since W is a 2-dimensional plane in
Q4
, geometrically the quotient space V/W represents the remaining “directions”, after factoring out the inﬂuence of W. To
introduce the quotient space V/W in Sage we can use the command V.quotient(W), as follows:
V = VectorSpace(QQ, 4)
u1 = vector(QQ, [1, 1, 0, 0])
u2 = vector(QQ, [0, 1, 1, 0])
W = V.subspace([u1, u2])
print(V.quotient(W))
Let us present Sage’s output:
Vector space quotient V/W of dimension 2 over Rational Field where
V: Vector space of dimension 4 over Rational Field
W: Vector space of degree 4 and dimension 2 over Rational Field
Basis matrix:
[ 1 0 -1 0]
[ 0 1 1 0]
Notice in this answer the basis matrix refers to W. For a basis of V/W see 2.C.13.
2.C.13. The quotient space V/W is a 2-dimensional vector space over Q, as shown in 2.C.12. Thus, a basis of V/W will
consists of two vectors. These vectors can be chosen to extend the basis of W to a basis of V = Q4
. Fo example, consider
the vector e3 = (0, 0, 1, 0)T
of the standard basis of Q4
, and suppose that e3 = au1 + bu2 for ome a, b ∈ Q. Then we get the
following system of equations {a = 0 , a + b = 0 , b = 1}, which does not admit a consistent solution for a, b ∈ Q. This
means that e3 /∈ W. Similarly we can show that e4 = (0, 0, 0, 1)T
/∈ W. Thus a basis of V/W is given by u = {e3, e4}.
To verify with Sage that u forms a basis of V/W we will use the vector in subspace command, which provides a
convenient way to check if a given vector is an element of a speciﬁed subspace. Our program has the following form:
# Define the vectors
u1 = vector(QQ, [1, 1, 0, 0])
u2 = vector(QQ, [0, 1, 1, 0])
e3 = vector(QQ, [0, 0, 1, 0])
e4 = vector(QQ, [0, 0, 0, 1])
# Define the space W spanned by u1 and u2
V = VectorSpace(QQ, 4)
W = V.span([u1, u2])
# Check if e3 and e4 are in the span of W
def is_in_subspace(vector, subspace):
return vector in subspace
is_e3_in_W = is_in_subspace(e3, W)
is_e4_in_W = is_in_subspace(e4, W)
# Check linear independence of e3 and e4
176
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
# Construct the matrix with e3 and e4 as rows
matrix_e3_e4 = Matrix(QQ, [e3, e4])
is_independent = matrix_e3_e4.rank() == 2
# Construct the span of e3 and e4
span_V = V.span([e3, e4])
span_check = span_V.dimension() == 2
# Display results
is_e3_in_W, is_e4_in_W, is_independent, span_check
Conﬁrm yourselves that Sage’s output appears as (False, False, True, True), verifying that e3, e4 do not lie on W and that
u is indeed a basis of V/W. Try exploring additional examples by presenting an alternative basis.
2.C.14. (a) Taking a = b = 0 we see that U contains the zero vector (0, 0, 0)T
. Thus, to show that U is a subspace of Z3
5, we
need to prove closure under addition and scalar multiplication. Let u = (a1, 2b1, 3a1 + 4b1)T
and v = (a2, 2b2, 3a2 + 4b2)T
be arbitrary vectors of U, where ai, bi ∈ Z5 for all i = 1, 2. Then, we see that
u + v =
(
a1 + a2, 2(b1 + b2), 3(a1 + a2) + 4(b1 + b2)
)T
∈ U , cu =
(
ca1, 2cb1, c(3a1 + 4b1)
)
∈ U ,
with c ∈ Z5. Therefore, U is a vector subspace of Z3
5.
(b) Notice that


a
2b
3a + 4b

 = a


1
0
3

 + b


0
2
4

 = au1 + bu2 , where u1 =


1
0
3

 , u2 =


0
2
4

 .
Thus the vectors u1, u2 span U, i.e., U = spanZ5
{u1, u2}, and to obtain a basis it remains to prove linear independence.
Let A =


1 0
0 2
3 4

 be the matrix having as columns the vectors u1, u2. To ﬁnd the RREF of A we mention that in Z5 the
inverse of 2 equals 3 and the inverse of 4 equals 4 itself. For example, for the case of 2 one needs to ﬁnd an integer y such
that 2 · y = 1 mod 5. Since 6 = 1 mod 5 our claim follows. Similarly for 4, we have 4 · 4 = 16 = 1 mod 5. Performing
elementary row operations with respect to the arithmetic in Z5 (see the tables in 2.C.6), we obtain the following reduced row
echelon form:


1 0
0 1
0 0

. For a conﬁrmation of this matrix in Sage, add the cell
F = GF(5)
A=matrix(F, [[1, 0], [0, 2], [3, 4]]); A.rref()
It is now immediate that rank(A) = 2. Hence the vectors u1, u2 are linearly independent. and so they form a basis of U.
This implies that dimZ5 U = 2. To conﬁrm the linear independence of u1, u2 in Sage, we use the following cell:
F = GF(5); V = VectorSpace(F, 3)
u1=vector(F, [1, 0, 3]); u2=vector(F, [0, 2, 4])
V.linear_dependence([u1, u2]) ==[]
By adding
U=span([u1, u2]); U.dimension()
we also get a direct veriﬁcation of the dimension of U.
2.C.16. Considered vectors are linearly independent if and only if the vectors (1, −2, 0, 0)T
, (3, 0, 1, −1)T
, (1, −4, 1, 2)T
,
and (0, 4, 8, 4)T
are linearly independent in R4
. We see that
1 −2 0 0
3 0 1 −1
1 −4 1 2
0 4 8 4
= −36 ̸= 0 ,
thus the vectors are linearly independent.
2.C.19. In this task, ﬁrst check that the given vectors v1, v2 are linearly independent. Next, if A is the matrix having as
columns these two vectors, construct the extended matrix
(
A I4
)
, where I4 is the 4 × 4 identity matrix. Then compute its
reduced row echelon form. From this matrix, the columns with the pivots will form the desired basis. All these in Sage can
be done as follows:
177
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
V=QQ^4; v1=vector(QQ, [2, 0, 1, 0]); v2=vector(QQ, [1, 0, 1, 0]); L=[v1,v2]
V.linear_dependence(L)==[]
I4=identity_matrix(4); A=column_matrix(L); A1=A.augment(I4)
M=A1.rref()
[A1.column(p) for p in A1.pivots( )]
Sage’s output is here:
[(2, 0, 1, 0), (1, 0, 1, 0), (0, 1, 0, 0), (0, 0, 0, 1)]
Thus the vectors {v1, v2, v3 = (0, 1, 0, 0)T
, v4 = (0, 0, 0, 1)T
form the desired basis of Q4
. The linear independence of these
vectors can be conﬁrmed by adding in the previous cell the next few lines of code:
v3=vector(QQ, [0, 1, 0, 0]); v4=vector(QQ, [0, 0, 0, 1]); Lext=[v1,v2,v3,v4]
V.linear_dependence(Lext)==[]
2.C.23. (i) This is the projection to the yz-plane and a short computation ensures its linearity.
(ii) We see that F((0, 0, 0)) = (0, −3, 0) ̸= (0, 0, 0). However, any linear mapping maps the zero vector of its domain to the
zero vector of the target space. Hence, this F is not linear.
(iii) By assumption a ̸= 0. Hence, for two vector u = (x1, y1) and v = (x2, y2) lying on R2
, we see that
F(u + v) = (x1 + x2 + a, y1 + y2) ̸= F(u) + F(v) = (x1 + a, y1) + (x2 + a, y2) = (x1 + x2 + 2a, y1 + y2) .
This shows that F is not linear.
(iv) When a = 0 this is the identity map, which is obviously a linear mapping. For a ̸= 0 the mapping F(x) = x + a is an
aﬃne mapping. This is because the mapping G : R → R deﬁned by G(x) := F(x) − F(0) is linear, for all x ∈ R.
(v) The mapping F((x, y)) = (x2
, y) is not linear. This is because F(c (x, y)) = F((c x, c y)) = (c2
x2
, c y) and
c F((x, y)) = (c x2
, c y). Hence in general we have F(c (x, y)) ̸= c F((x, y)), except if c = 0 or c = 1. Since the relation
F(c (x, y)) = c F((x, y)) must hold for any c ∈ R and u = (x, y) ∈ R2
we conclude.
(vi) The mapping F : R2
→ R1[t] with F((x, y)) = y + x t is linear. Indeed, consider two arbitrary vectors u = (x1, y1)
and v = (x2, y2) on R2
. Then for any two scalars a, b ∈ R we have
F(a u + b v) = F((a x1, a y1) + (b x2, b y2)) = F((a x1 + b x2, a y1 + b y2)) = (a y1 + b y2) + (a x1 + b x2) t
= a (y1 + x1 t) + b (y2 + x2 t) = a F((x1, y1)) + b F((x2, y2)) .
(vii) Fix some B ∈ Matn(R). The mapping F(A) = AB − BA is also linear, and hence an endomorphism of Matn(R). In
order to prove its linearity, consider some λ, µ ∈ R and A1, A2 ∈ Matn(R). Then we have
F(λ A1 + µ A2) = (λ A1 + µ A2)B − B(λ A1 + µ A2) = λ A1B + µ A2B − λ BA1 − µ BA2
= λ (A1B − BA1) + µ (A2B − BA2) = λ F(A1) + µ F(A2) .
(viii) The mapping F(A) = AT
is another linear endomorphism of Matn(R). Perform the brief computation yourselves.
(ix) Let u = (x1, y1) and v = (x2, y2) two vectors on Z2
3 = Z3 × Z3. Then we have
F(u + v) = F((x1 + x2, y1 + y2)) =


(x1 + x2) + (y1 + y2)
2(x1 + x2) + (y1 + y2)
x1 + x2

 =


x1 + y1
2x1 + y1
x1

 +


x2 + y2
2x2 + y2
x2

 = F(u) + F(v) ,
and
F(c u) = F((c x1, c y1)) =


c x1 + c y1
2c x1 + c y1
c x1

 = c


x1 + y1
2x1 + y1
x1

 = c F(u) ,
for all c ∈ Z3. Thus, F is linear.
(x) The exponential map is, of course, not linear because, as we will see in Chapter 5, et1+t2
= et1
· et2
̸= et1
+ et2
.
2.C.26. Let e = {e1 = (1, 0, 0)T
, e1 = (0, 1, 0)T
, e3 = (0, 0, 1)T
} be the standard basis of R3
. We see that
T(e1) = (1, 1, 1)T
= e1 + e2 + e3 , T(e2) = (1, 1, 1)T
= e1 + e2 + e3 , T(e3) = (1, 2, 3)T
= e1 + 2e2 + 3e3 .
The columns of the matrix A corresponding to T are given by the images of the basis vectors. Thus we get
A =


1 1 1
1 1 2
1 1 3

 .
To work with this example in Sage, use the following code block:
178
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
# Define the variables and vector space
x, y, z = var("x y z")
V = W = RR^3
# Define the linear transformation
f(x, y, z) = [x + y + z, x + y + 2*z, x + y + 3*z]
T = linear_transformation(V, W, f)
# Get the matrix representation of the transformation
A = T.matrix(side="right")
# Display the matrix
show(A)
Verify that this conﬁrms the matrix A obtained above.
2.C.29. The matrix of T with respect to the standard basis is given by A =


1 0 i
0 1 1
i 1 0

. The endomorphism T will be
invertible if and only if A is invertible. However, A is not of full rank, in particular, rank(A) = 2 and hence A is not
invertible. Alternatively, we may compute the determinant of A, which is zero, det(A) = 0. Hence A is singular and T−1
does not exist. A veriﬁcation is Sage is given in the usual way:
v1=vector([1, 0, i]); v2=vector([0, 1, 1]); v3=vector([i, 1, 0])
A=column_matrix([v1, v2, v3])
print(rank(A)); print(det(A))
2.C.30. (a) We begin with the conjugation. This maps 1 → 1, and i → −i, written in the coordinates as (1, 0) → (1, 0) and
(0, 1) → (0, −1), respectively. By writing the images into the columns we obtain the matrix
(
1 0
0 −1
)
.
For the second map, using the basis u = {1, i} we obtain 1 → 2 + i, i → 2i − 1, that is, (1, 0) → (2, 1), (0, 1) → (2, −1).
Thus, the matrix of multiplication by the number 2 + i under the basis u has the form
(
2 −1
1 2
)
.
(b) We will only determine the matrix of the second linear map in the basis f, and leave the ﬁrst case for practice. Multiplication
by (2 + i) gives us: (1 − i) → (1 − i)(2 + i) = 3 − i, (1 + i) → (1 + 3i). Coordinates (a, b)f of the vector 3 − i in
the basis f are given, as we know, by the equation a · (1 − i) + b · (1 + i) = 3 + i, that is, (3 + i)f = (2, 1). Analogously
(1+3i)f = (−1, 2). Altogether, we obtain the matrix
(
2 −1
1 2
)
. Observe that the matrices for multiplication by 2+i are the
same in both bases. Why does this happen? Would the matrices be identical in both bases for multiplication by any complex
number? 16
2.C.31. To simplify the explanation, we will prove the statement for a low-dimensional case, as the idea extends easily to
the general case. For example, let us consider proving that the space Mat2,3(K) of 2 × 3 matrices over K is isomorphic to
K2·3
= K6
. Any element of Mat2,3(K) has the form
(
a b c
d e f
)
, for some a, b, c, d, e, f ∈ K .
From this expression we see that the matrices
E1 =
(
1 0 0
0 0 0
)
, E2 =
(
0 1 0
0 0 0
)
, E3 =
(
0 0 1
0 0 0
)
, E4 =
(
0 0 0
1 0 0
)
, E5 =
(
0 0 0
0 1 0
)
, E6 =
(
0 0 0
0 0 1
)
generate Mat2,3(K). It is also easy to see that these matrices are linearly independent, and hence they form a basis of
Mat2,3(K). This implies that dimK Mat2,3(K) = 6. For a veriﬁcation, use Sage and its command MatrixSpace, as fol-
lows:
MS = MatrixSpace(SR, 2, 3)
MS.dimension()
Sage print out the number 6, which is the dimension of Mat2,3(K) over K. Adding the following command, one can also
obtain the basis mentioned above:
16This similarity arises because multiplication by a complex number c,
corresponds to a linear transformation that is invariant under basis changes.
Essentially, c acts as a scalar multiplication on the vector space, and scalar
multiplication is invariant to the choice of a basis.
179
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
B = MS.basis(); list(B)
Consider ﬁnally the mapping φ deﬁned by
φ : Mat2,3(K) → K6
,
(
a b c
d e f
)
→ (a, b, c, d, e, f)T
.
This is linear and maps the basis E1, . . . , E6 to the standard basis e1 = (1, 0, . . . , 0)T
, . . . , e6 = (0, . . . , 0, 1)T
of K6
. Hence
φ is a linear isomorphism (you should be able to prove that a linear mapping having this property is always an isomorphism).
2.C.32. An isomorphism f : Rn[x] → Rn+1
is given by a0 + a1x + a2x2
+ · · · + anxn
→ (a0, a1, . . . , an). We leave to
the reader to prove that f is a well deﬁned linear bijection. Can you specify the inverse f−1
: Rn+1
→ Rn[x]?
2.C.33. To prove the claim it is suﬃcient to show that U = spanR{u1, u2, u3} is a 3-dimensional subspace of R4
, and
hence that the vectors u1, u2, u3 are linearly independent. Indeed for some a, b, c ∈ R we see that the matrix equation
au1 + bu2 + cu3 = 0 is equivalent to the system {a + c = 0, 2a + b = 0, b = 0} having the unique solution a = b = c = 0.
Or in Sage we can type
u1=vector([1, 2, 0, 0]); u2=vector([0, 1, 0, 1]); u3=vector([1, 0, 0, 0]); L=[u1, u2, u3]
V=RR**4; V.linear_dependence(L)==[]
Let e = {e1, e2, e3} be the standard basis of R3
. Check yourselves that a linear isomorphism f : R3
→ U is deﬁned by
f(e1) = u1 , f(e2) = u2 , f(e3) = u3 .
2.C.34. A subspace of R3
can be 0, 1, 2 or 3-dimensional. Thus, up to linear isomorphisms, such a subspace should be
isomorphic to
(a) the origin of R3
(the single zero vector), or, (c) a plane passing through the origin of R3
, or,
(b) a line passing through the origin of R3
, or, (d) R3
itself.
2.C.36. We can easily see that u is a basis of C3
, e.g., via Sage. To ﬁnd its dual, we need to specify linear forms φi : C3
→ C
for i = 1, 2, 3, satisfying the relations
φ1(u1) = 1 , φ1(u2) = 0 = φ1(u3) , φ2(u1) = 0 , φ2(u2) = 1 , φ2(u3) = 0 , φ3(u1) = 0 = φ3(u2) , φ3(u3) = 1 ,
respectively. We may suppose that
φ1(z1, z2, z3) =
3∑
i=1
aizi , φ2(z1, z2, z3) =
3∑
i=1
bizi , φ3(z1, z2, z3) =
3∑
i=1
cizi ,
for some complex numbers ai, bi, ci, (i = 1, 2, 3). From the ﬁrst set of equations we get a1 = 1, a2 = 0 and a3 = i, thus
φ1(z1, z2, z3) = z1 + iz3. From the second set of equations we get b1 = 0, b2 = −i and b3 = 1, thus φ2(z1, z2, z3) =
−iz2 + z3. Finally for φ3 we get c1 = 0 = c2 and c3 = −i, that is, φ3(z1, z2, z3) = −iz3. You may try to verify alone that
{φ1, φ2, φ3} is indeed a basis of (C3
)∗
.
2.C.37. (a) Suppose that u =
∑n
i=1 ciui for some real numbers ci, for all i = 1, . . . , n. Then, by linearity, we see that
φ1(u) = φ1
( ∑
i
ciui
)
=
∑
i
ciφ1(ui) = c1φ1(u1) + c2φ1(u2) + · · · + cnφ1(un) = c1 · 1 + c2 · 0 + · · · + cn · 0 = c1,
and in an analogous way we can show that φi(u) = ci, for all i = 1, . . . , n. Based on this remark we obtain the ﬁrst relation:
u =
∑n
i=1 ciui =
∑n
i=1 φi(u)ui.
(b) To prove the second relation let us apply ξ in both sides of the relation in (a). Since φi(u) are scalars for all i, i.e.,
φi(u) ∈ K, this gives
ξ(u) = ξ(
∑
i
φi(u)ui) =
∑
i
φi(u)ξ(ui) =
( ∑
i
ξ(ui)φi
)
(u) .
This relation holds for all u ∈ V and hence the result follows, i.e., ξ =
∑
i ξ(ui)φi.
2.C.38. If A = (aij), B = (bij) ∈ Matm(K) and c ∈ K, then we have tr(A + B) = (a11 + b11) + · · · + (amm + bmm) =
(a11 +· · ·+amm)+(b11 +· · ·+bmm) = tr(A)+tr(B) and tr(cA) = ca11 +· · ·+camm = c(a11 +· · ·+amm) = c tr(A).
Thus tr is a linear transformation, and since the output is a scalar, i.e., tr(A) ∈ K, it is a linear form of Matm(K).
Now, for A, B ∈ Matm(K) we have AB = ((ab)ij) and BA = ((ba)ij). Thus,
tr(AB) =
m∑
i=1
(ab)ii =
m∑
i=1
m∑
k=1
aikbki =
m∑
k=1
m∑
i=1
bkiaik =
m∑
k=1
(ba)kk = tr(BA) .
180
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
As a counterexample of the relation tr(AB) = tr(A) tr(B), consider any non-zero matrix A ∈ Matm(R) and the identity
matrix E ∈ Matm(R). Then we see that tr(AE) = tr(A), but tr(E) tr(A) = m tr(A). Hence, except of the trivial case
m = 1 where the relation actually holds, this provides an answer.
2.C.39. Consider for example the endomorphism f : R2
→ R2
represented by the matrix A =
(
3 1
0 2
)
with respect to the
standard basis e of R2
. Obviously, tr(A) = 3 + 2 = 5. Another basis of R2
is given by u =
{
(1, 1)T
, (1, −1)T
}
(use Sage,
as indicated below, to verify that these two vectors are linearly independent).
V=RR^2;
v1=vector(RR, [1,1])
v2=vector(RR, [1, -1])
V.linear_dependence([v1, v2]) ==[]
The matrix that changes from the new basis u to the standard one e is formed by placing the new basis vectors as columns,
see also below. Let us denote this matrix by P =
(
1 1
1 −1
)
. The matrix presentation of f in the new basis is given by
A′
= P−1
AP =
(
3 0
1 2
)
, see the discussion in the end of 2.3.16. We see that tr(A′
) = 5 = tr(A), hence the trace is
invariant under a change of basis. You may like to conﬁrm this result in Sage (and the expression of A′
). This can be done
by the usual methods, see the cell given here:
# Define the matrix A in the standard basis
A = Matrix(RR, [[3, 1], [0, 2]])
# Define the change of basis matrix P
P = Matrix(RR, [[1, 1], [1, -1]])
# Compute the inverse of P
P_inv = P.inverse()
# Compute the matrix representation of the endomorphism in the new basis
A_prime = P_inv * A * P
show(A_prime)
# Compute the traces
bool(A.trace()==A_prime.trace())
2.C.40. In 2.C.38 we proved the relation tr(AB) = tr(BA). Hence, for any invertible m × m matrix P over K we obtain
tr(P−1
AP) = tr(P−1
(AP)) = tr((AP)P−1
) = tr(APP−1
) = tr(AE) = tr(A) .
Thus, A and P−1
AP have the same trace and our claim follows. For the determinant observe that
det(B) = det(P−1
AP) = det(P−1
) det(A) det(P) = det(A)
det(P)
det(P)
= det(A) .
Here, we utilized the fact that the determinant of the product of matrices equals the product of their determinants, and that
for any invertible matrix P, the determinant of its inverse is given by det(P−1
) = 1/ det(P).
Now, in (a) the given matrices A, B are such that tr(A) = tr(B) = 4 but det(A) = 7 ̸= 6 = det(B). Thus, according
to our result above, the matrices A, B cannot be similar.
For the matrices A, B given in (b), we have tr(A) = tr(B) = 0 and det(A) = det(B) = 0, hence we cannot conclude
by applying our criterium. However, let M =
( α β
γ δ
)
be an invertible 2 × 2 real matrix. Thus we assume that det(M) =
αδ − βγ ̸= 0, so that M−1
= 1
αδ−βγ
( δ −β
−γ α
)
. For simplicity, we may ﬁx α, β so that det(M) = 1. Then we see that
M−1
AM =
(
δ −β
−γ α
) (
0 1
0 0
) (
α β
γ δ
)
=
(
0 δ
0 −γ
) (
α β
γ δ
)
=
(
γδ δ2
−γ2
−γδ
)
.
Therefore, in this case the matrices A, B are similar. Notice that for γ = δ = 0, we have det(M) = 0, hence we should
exclude these values.
The matrices A, B given in (c) are also similar: Let us verify this in Sage via the command B.is_similar(A), as fol-
lows:
A = matrix([[2,1],[1,2]])
B = matrix([[3,0],[0, 1]])
b, P = B.is_similar(A, transformation=True)
b; show(P)
181
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
The output here is True, which means that A, B are indeed similar, while show(P) prints the matrix P, satisfying P−1
AP =
B. A veriﬁcation of this relation can be done by adding the code
bool(P.inverse()*A*P==B)
In fact, later we will explain that the diagonal matrix B in (c) consists of the eigenvalues of A, and so the similarity of A, B
is related with the study of the eigenvectors of A, see also below.
2.C.42. To solve in Sage the task given in 2.C.41, we ﬁrst introduce the vector space V = R3
and conﬁrm the linear independence
of the vectors u1, u2 and u3. Next, we use these vectors to form the transition matrix T, which can be done easily via
the columm_matrix command. Finally, we compute the inverse of T, and display the new coordinates of w by applying the
rule T−1
w. Here is the corresponding code:
V=RR^3;u1=vector([1, 1, 3])
u2=vector([1, -1, 1]);u3=vector([3, 1,5])
V.linear_dependence([u1, u2, u3])==[]
T=column_matrix([v1, v2, v3])
Tinv=T.inverse()
w=vector([1, 2, 3]); wcor=Tinv*w
show(wcor)
2.C.43. Suppose that u2 consists of the vectors V1, V2, V3. By the given expression of T we have
V1 = E1 + E2 − E3 = (2, 0, 0)T
, V2 = E1 + 2E2 + E3 = (3, 3, 2)T
, V3 = E1 + E2 + 2E3 = (2, 3, 3)T
.
Thus u2 = {(2, 0, 0)T
, (3, 3, 2)T
, (2, 3, 3)T
}. As a veriﬁcation in Sage, you may type
V = RR**3; V1 = vector(RR, [2, 0, 0]);
V2 = vector(RR, [3, 3, 2]); V3 = vector(RR, [2, 3, 3])
V.linear_dependence([V1, V2, V3]) == [ ]
2.C.44. Again the transition matrix T for changing the basis from the basis f = {f1, f2, f3} to the standard basis e can be
obtained by expressing the coordinates of the vectors f1, f2, f3 in the standard basis as the columns of the matrix T. On the
other hand, the transition matrix for changing the basis from e to f is the inverse of T. We compute
T =


1 −1 2
1 1 0
0 1 1

 , T−1
=



1
4
3
4 −1
2
−1
4
1
4
1
2
1
4 −1
4
1
2


 .
Now we can derive the matrix of the mapping in the basis f, which is given by (see also 2.2.11)
T−1
AT =



−1
4 2 −3
4
5
4 0 7
4
3
4 −2 9
4


 .
A method in Sage that can be used to derive the matrix of F : R3
→ R3
using built-in functions, is described in 2.E.46.
2.C.46. (i) As always, the expressions of the given vectors u1, u2 are with respect to the standard basis of R3
. To compute
their (standard) scalar product we have
u1 · u2 = ⟨u1, u2⟩ = uT
2 u1 =
(
−1 1 −
√
2
)


1
3√
2

 = −1 + 3 − 2 = 0 .
And the same gives the computation
u2 · u1 = ⟨u2, u1⟩ = uT
1 u2 =
(
1 3
√
2
)


−1
1
−
√
2

 = −1 + 3 − 2 = 0 .
This veriﬁes that ⟨, ⟩ is symmetric. Having u1 · u2 = 0 means that the vectors u1, u2 are orthogonal each other, u1 ⊥ u2.
Moreover, ∥u1∥ =
√
u1 · u1 =
√
1 + 9 + 2 =
√
12 = 2
√
3 and ∥u2∥ =
√
u2 · u2 =
√
1 + 1 + 2 =
√
4 = 2, respectively.
A veriﬁcation of these computations In Sage relies on the block
u1 = vector([1, 3, sqrt(2)])
u2 = vector([-1, 1, -sqrt(2)])
sp1=u1.dot_product(u2);
sp2=u2.dot_product(u1); bool(sp1==sp2)
182
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
print(sp1); print(sp2)
print(norm(u1)); print(norm(u2))
(ii) Let us present the solution in Sage:
v1=vector([0, 1, 2])
v2=vector([-1, 2, 3])
spr1=v1.dot_product(v2)
print(spr1); print(norm(v1))
print(norm(v2))
In order to compute the angle between v1, v2, add the code
theangle = spr1/(norm(v1) * norm(v2))
arccos(theangle).n()
Notice the answer printed out by Sage is in radians.
2.C.51. It is straightforward to verify the orthonormal basis expression obtained in 2.C.48, using the function proj(v, u)
developed in 2.C.48. The corresponding code block is as follows:
E1 = vector( [1, 1, 1, 1])
E2 = vector( [1, 1, 1, 0])
E3 = vector( [1, 1, 0, 0])
def proj (v,u):
p = v. dot_product (u)/(norm (u)^2)*u
return(p)
w1 = E1; w2 = E2 - proj(E2, w1)
w3 = E3 - proj(E3, w2)-proj(E3, w1)
g = [w1, w2, w3]; g
W1 = w1/norm(w1); W2 = w2/norm(w2)
W3 = w3/norm(w3); G = [W1, W2, W3]; G
Verify on your own that running this block, Sage will print the basis {w1, w2, w3}, denoted by g, and the orthonormal basis
{ ˆw1, ˆw2, ˆw3}, denoted by G.
2.C.52. The set of solutions of the given homogeneous linear equation is always a vector spaces. For our case, a computation
shows that a basis of this vector space consists of the vectors
u1 = (−1, 1, 0, 0)
T
, u2 = (−1, 0, 1, 0)
T
, u3 = (−1, 0, 0, 1)
T
.
Let us now denote by {v1, v2, v3} the orthogonal basis obtained using the Gram-Schmidt orthogonalisation process. We have
v1 = u1 and for the rest two vectors we respectively compute
v2 = u2 −
uT
2 · v1
||v1||2
v1 = u2 −
1
2
v1 =
(
−
1
2
, −
1
2
, 1, 0
)T
,
v3 = u3 −
uT
3 · v1
||v1||2
v1 −
uT
3 · v2
||v2||2
v2 = u3 −
1
2
v1 −
1
6
v2 =
(
−
1
3
, −
1
3
, −
1
3
, 1
)T
.
2.C.53. The kernel Ker(φ) of φ is a linear subspace of R4
. To specify Ker(φ) we need to solve the matrix equation Ax = 0,
for some arbitrary vector x = (x1, x2, x3, x4)T
∈ R4
. Let us use Sage to quickly perform this task:
A = matrix(SR, 4, 4, [[1/2, 2, -1/2, -1/2], [1, 1/2, 1, 3/2],
[2, 9/2, 0, 1/2], [2, 2, 0, 0]])
A.right_kernel()
Sage’s output has the form
Vector space of degree 4 and dimension 1 over Symbolic Ring
Basis matrix: [ 1, -1, -8, 5]
This means that Ker(φ) is 1-dimensional subspace of R4
, spanned by the vector E1 = (1, −1, −8, 5)T
∈ R4
. For the norm
of this vector we compute ∥E1∥ =
√
91, hence an orthonormal basis of V consists of the vector 1√
91
E1.
2.C.54. (i) Let x = (x1, x2, x3, x4)T
be a vector orthogonal to u, with respect to the dot product on R4
. Then x · u =
x1 + x2 + x3 + x4 = 0. Thus, if U is the subspace of vectors orthogonal to u, then
U = {(x1, x2, x3, x4)T
∈ R4
: x4 = −(x1 + x2 + x3)} = {
(
x1, x2, x3, −(x1 + x2 + x3)T
)
: x1, x2, x3 ∈ R} .
It follows that U is 3-dimensional. In particular, vectors of U can be expressed as
183
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
(
x1, x2, x3, −(x1 + x2 + x3)
)T
= x1E1 + x2E2 + x3E3 ,
where E1, E2, E3 are the vectors deﬁned by E1 = (1, 0, 0, −1)T
, E2 = (0, 1, 0, −1)T
, and E3 = (0, 0, 1, −1)T
, respectively.
We easily check that these are linearly independent, and hence they provide a basis of U.
(ii) For the second task we get two conditions from orthogonality, namely:
x · w = x1 − x4 = 0 , x · z = −x1 + x3 = 0 .
This means that the vector x = (x1, x2, x3, x4)T
is orthogonal both to the vectors w and z if and only if x1 = x4 and x1 = x3.
Therefore, if V denotes the subspace of vectors orthogonal to w and z, then we see that V = {(x1, x2, x1, x1)T
: x1, x2 ∈ R}.
It turns out that V is 2-dimensional, with a basis given by the vectors v1 = (1, 0, 1, 1)T
and v2 = (0, 1, 0, 0)T
.
2.C.56. We will prove that C(A) is orthogonal to the cokernel Ker(AT
) and leave the second case for practice. Recall that
for a m × n matrix A, both C(A) and Ker(AT
) are subspaces of Rm
given by C(A) = {Ax : x ∈ Rn
} and Ker(AT
) =
{y ∈ Rm
: AT
y = 0}, respectively. Hence, for u ∈ C(A) and w ∈ Ker(AT
) we see that
⟨u, w⟩ = wT
u = wT
Ax = (AT
w)T
x = 0x = 0 ,
where the ﬁrst equality is the deﬁnition of the dot product, the second occurs by replacing u ∈ C(A) by Ax and the third
relies on the relation (AB)T
= BT
AT
. Thus wT
u = 0 and C(A) ⊥ Ker(AT
) with respect to the dot product on Rm
.
Recall now from 2.C.20 that for the matrix A =


2 4
1 3
0 5

 the column space C(A) is spanned by the vectors u1 = (2, 1, 0)T
and u2 = (4, 3, 5)T
, while the left null space Ker(AT
) is spanned by the vector u3 = (5, −10, 2)T
. Hence, we see that
⟨u1, u3⟩ = uT
3 u1 =
(
5 −10 2
)


2
1
0

 = 10 − 10 = 0 , ⟨u2, u3⟩ = uT
3 u2 =
(
5 −10 2
)


4
3
5

 = 20 − 30 + 10 = 0 .
Similarly, we saw that the row space C(AT
) is the subspace span{(2, 4)T
, (1, 3)T
}, while Ker(A) is trivial, Ker(A) = {0}.
Hence it is also trivial to prove that C(AT
) ⊥ Ker(A).
2.D.5. Observe that this matrix diﬀers from the one in 2.D.1 by just a single sign. However, in this case the characteristic
polynomial of A has the form λ3
−6λ2
+12λ−8, or in other words χA(λ) = (λ−2)3
. Therefore, the unique eigenvalue of A is
given by λ = 2, with algebraic multiplicity three. The geometric multiplicity of λ is either one, two or three. Let us determine
the eigenvectors associated to λ. They are the solutions of the matrix equation (A − 2E)x = 0, and a direct computation
shows that the eigenspace V2 has the form V2 = spanR{(1, −1, 0)T
, (0, 0, 1)T
}. Thus, the geometric multiplicity of the
unique eigenvalue λ of A equals to two.
2.D.7. Let us prove the statement for n = 3 and when A is upper triangular, and similarly is treated the more general case.
Thus we assume that A has the form
A =


a11 a12 a13
0 a22 a23
0 0 a33

 ,
with aij ∈ K for all i, j. Then it is easy to see that the characteristic polynomial has the form χA(λ) = (a11 − λ)(a22 −
λ)(a33 − λ), hence the roots of χA(λ) are the diagonal entries a11, a22, a33. The result now follows.
2.D.8. (a) The ﬁrst claim is based on the fact that det(A) is the product of eigenvalues of A. On the other hand, we know that
A is invertible if and only if det(A) ̸= 0. Combining these two statements we obtain that A is invertible if and only if 0 is
not an eigenvalue of A.
(b) The solution arises as an application of part (a). You may like to write down the matrix An for small values of n and then
generalize (by induction over n). For instance, since
A1 = 1 , A2 =
(
1 1
1 1
)
, A3 =


1 0 1
1 1 0
0 1 1

 , A4 =




1 0 0 1
1 1 0 0
0 1 1 0
0 0 1 1



 ,
we see that det(A2) = det(A4) = 0 but det(A1) ̸= 0 and det(A3) ̸= 0. Thus, an option is to compute the determinant of
An and show that det(An) ̸= 0 if and only if n is odd. An alternative method is suggested by the statement, and is based on
the characteristic polynomial:
184
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
χAn (λ) = det(An − λ E) =
1 − λ 0 0 0 · · · 0 0 1
1 1 − λ 0 0 · · · 0 0 0
0 1 1 − λ 0 · · · 0 0 0
0 0 1 1 − λ · · · 0 0 0
...
...
...
...
...
...
...
...
0 0 0 0 · · · 1 1 − λ 0
0 0 0 0 · · · 0 1 1 − λ
.
By expanding the determinant with respect to the ﬁrst column, for example, we see that χAn (λ) = (1 − λ)n
+ (−1)n−1
.
Based on this expression we deduce that An is invertible if and only if n is odd (we are looking for those n such that χAn (λ)
does not admits λ = 0 as a root).
2.D.10. The given matrix A1 is diagonalizable, since it has two eigenvalues, λ1 = 1 with algebraic and geometric multiplicity
1, and λ2 = 2 with algebraic and geometric multiplicity 2. The eigenspace V1 is generated by the eigenvector (0, 1, −1/2)T
(or any non-zero multiple of this vector), and V2 is generated by the eigenvectors (1, 0, −2)T
and (0, 1, 0)T
(or any non-zero
multiple of them). Thus, for example, the matrices D1 and P1 presented below satisfy A1 = P1D1P−1
1 :
D1 = diag(1, 2, 2) , P1 =


0 1 0
2 0 1
−1 −2 0

 .
To verify all these results via Sage, execute the following block:
A1=matrix(SR, [[2, 0, 0], [4, 2, 2], [-2, 0, 1]])
p(t) =A1.characteristic_polynomial(t)
show(p(t).factor())
show(p.roots())
show(A1.eigenvectors_right())
u1=vector([0, 2, -1])
u2=vector([1, 0, -2])
u3=vector([0, 1, 0])
P1=column_matrix([u1, u2, u3])
D1=diagonal_matrix([1, 2, 2])
bool(A1==P1*D1*P1.inverse())
For the matrix A2, which only diﬀers by A1 in one sign, the result is negative, in particular A2 is not diagonalizable. It has
the same eigenvalues with A1, i.e., λ1 = 1 and λ2 = 2, with aalgebraic multiplicity 1 and 2, respectively. However, in this
case, λ2 has a geometric multiplicity 1 and thus is impossible to obtain a basis of eigenvectors for the matrix A2.
2.D.11. Suppose that AAT
= AT
A = E. Then we see that
⟨φA(u), φA(w)⟩ = ⟨Au, Aw⟩ = (Aw)T
Au = wT
AT
Au = wT
u = ⟨u, w⟩ ,
for any u, w ∈ V . Hence φA is orthogonal. For the converse, assume hat φA is orthogonal. Then we have ⟨φA(u), φA(w)⟩ =
⟨Au, Aw⟩ = ⟨u, w⟩, or equivalently wT
AT
Au = wT
u, or (AT
Aw)T
u = wT
u. Hence ⟨u, w⟩ = ⟨u, AT
Aw⟩. By linearity
this ﬁnal relation can be equivalently written as ⟨u, (E − AT
A)w⟩ = 0 for any u, w ∈ V . Recall now that ⟨ , ⟩ is a scalar
product and hence non-degenerate, which means that the condition ⟨x, y⟩ = 0 for any x implies y = 0. Thus we obtain
(E − AT
A)w = 0 for all w ∈ V , which yields the desired relation E − AT
A = 0, i.e., A is orthogonal.
2.E.3. (i) The row operation R2 → R2 − 3R1 transforms A as follows
A =


1 0 2
3 1 −1
2 4 2

 R2→R2−3R1
−→


1 0 2
0 1 −7
2 4 2

 .
To determine the corresponding elementary matrix, which we may denote by E1, we apply the same row operation on the
3 × 3 identity matrix. This gives
E =


1 0 0
0 1 0
0 0 1

 R2→R2−3R1
−→


1 0 0
−3 1 0
0 0 1

 =: E1 .
To conﬁrm this, note that the product E1A yields the same result as performing the row operation on A, i.e.,
E1A =


1 0 0
−3 1 0
0 0 1




1 0 2
3 1 −1
2 4 2

 =


1 0 2
0 1 −7
2 4 2

 .
185
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Every elementary matrix is invertible and the inverse of an elementary matrix is also an elementary matrix. It is found by
performing the reverse row operation on the identity matrix, which for our case is the row operation R2 → R2 + 3R1. Hence
we have E−1
1 =


1 0 0
3 1 0
0 0 1

, such that E1E−1
1 =


1 0 0
−3 1 0
0 0 1




1 0 0
3 1 0
0 0 1

 =


1 0 0
0 1 0
0 0 1

 = E = E−1
1 E1.
(ii) By applying the row operation R2 → R2 − 3
2 R1 we obtain A =


1 0 2
3 1 −1
2 4 2


R2→R2− 3
2 R3
−→


1 0 2
0 −5 −4
2 4 2

. For
the corresponding elementary matrix we have E =


1 0 0
0 1 0
0 0 1


R2→R2− 3
2 R3
−→


1 0 0
0 1 −3
2
0 0 1

 := E2. As a conﬁrmation,
we see that the product E2A matches the result of the row operation on A, i.e., E2A =


1 0 0
0 1 −3
2
0 0 1




1 0 2
3 1 −1
2 4 2

 =


1 0 2
0 −5 −4
2 4 2

. For the inverse we get E−1
2 =


1 0 0
0 1 3
2
0 0 1

.
(iii) In a similar way we ﬁnd that the elementary matrix representing the row operation R3 → 1
2 R3, and its inverse, corresponding
to the row operation R3 → 2R3:
E2 =


1 0 0
0 1 0
0 0 1/2

 , E−1
2 =


1 0 0
0 1 0
0 0 2

 .
2.E.5. Here is a block in Sage that conﬁrms the expressions of E1, . . . , E8 and of their inverses, given in 2.E.4:
# Define the elementary matrices
E1 = elementary_matrix(QQ, 3, row1=0, scale=1/2)
E2 = elementary_matrix(QQ, 3, row1=1, row2=0, scale=-1)
E3 = elementary_matrix(QQ, 3, row1=2, row2=0, scale=-4)
E4 = elementary_matrix(QQ, 3, row1=1, scale=-2)
E5 = elementary_matrix(QQ, 3, row1=2, scale=-1/5)
E6 = elementary_matrix(QQ, 3, row1=1, row2=2, scale=-1)
E7 = elementary_matrix(QQ, 3, row1=0, row2=2, scale=-3/2)
E8 = elementary_matrix(QQ, 3, row1=0, row2=1, scale=-1/2)
# Print the elementary matrices
show(E1, E2, E3, E4, E5, E6, E7, E8)
# Compute their inverses
E1_inv = E1.inverse()
E2_inv = E2.inverse()
E3_inv = E3.inverse()
E4_inv = E4.inverse()
E5_inv = E5.inverse()
E6_inv = E6.inverse()
E7_inv = E7.inverse()
E8_inv = E8.inverse()
# Print the inverses
show(E1_inv, E2_inv, E3_inv, E4_inv, E5_inv, E6_inv, E7_inv, E8_inv)
2.E.6. We have four unknowns x1, x2, x3, x4, and four equations. Hence the matrix A has size 4 × 4, and b = (3, 4, 1, 5)T
∈
R4
. Let us present the solution using Sage, and compute the reduced row echelon form of the extended matrix B:
B=matrix([[3, 3, 2, 1, 3], [2, 1, 1, 0, 4],
[0, 5, -4, 3, 1],[5, 3, 3, -3, 5]])
Br=B.rref(); show(Br); print(Br.pivots())
The output is the matrix
186
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA




1 0 0 0 4
0 1 0 0 −2
0 0 1 0 −2
0 0 0 1 1




and the quadruple (0, 1, 2, 3). Therefore, the pivots are lying on the ﬁrst four columns and the corresponding linear system
admits a unique solution, given by x1 = 4, x2 = −2, x3 = −2, x4 = 1. Try to present a formal computation of the reduced
row echelon form.
2.E.9. The are inﬁnite many solutions represented by the vector (x1 = 1 + t , x2 = 3
2 , x3 = t , x4 = −1
2 )T
, with t ∈ R.
2.E.10. This linear system has no solution. Can you explain why this claim is true?
2.E.12. The solution has the form {(−10t , (µ + 4)t , (3µ − 8)t)
T
: t ∈ R}.
2.E.13. For a = 0 the system has no solution. For a ̸= 0 the system has inﬁnitely many solutions.
2.E.14. Let us use the extended matrix of apply elementary row transformation to obtain


1 −a −2 b
1 1 − a 0 b − 3
1 1 − a a 2b − 1

 ∼


1 −a −2 b
0 1 2 −3
0 1 a + 2 b − 1

 ∼


1 −a −2 b
0 1 2 −3
0 0 a b + 2

 .
Above, we ﬁrst subtracted the ﬁrst row from the second and the third;, Then we subtracted the second row from the third. We
see that the system has a unique solution (determined by backward elimination), if and only if a ̸= 0. If a = 0 and b = −2,
we have a zero row in the extended matrix. Choosing x3 ∈ R as a parameter then gives inﬁnitely many distinct solutions. For
a = 0 and b ̸= −2 the last equation a = b + 2 cannot be satisﬁed and the system has no solution. Note that:
• For a = 0, b = −2 we have inﬁnite solutions of the form (x1, x2, x3)
T
= (−2 + 2t, −3 − 2t, t)
T
, with t ∈ R.
• For a ̸= 0 the unique solution has the form
(
−3a2
−ab−4a+2b+4
a , −2b+3a+4
a , b+2
a
)T
.
2.E.17. We compute:
σ = (1, 7)(2, 6)(5, 3) , τ = (1, 6)(6, 8)(8, 7)(7, 3)(2, 4) , ρ = (1, 4)(4, 10)(10, 7)(7, 9)(9, 3)(2, 6)(6, 5) .
2.E.18. For σ we compute 17 inversions, and its parity is thus odd. For τ we enumerate 12 inversions, and hence its parity is
even. Finally for ρ we enumerate 25 inversions, so its parity is odd.
2.E.24. A conﬁrmation of the inverse of A and the given solution can be obtained as follows:
A=matrix(QQ, [[1, 1, 1, 1], [1, 1, -1, -1], [1, -1, 1, -1], [1, -1, -1, 1]])
det(A)
Ainv=A.inverse()
show(Ainv)
b=vector(QQ, [2, 3, 3, 5])
x=Ainv*b; show(x)
As for the computation of the algebraic complements Aij, one can proceed with the following block:
B11=A.matrix_from_rows_and_columns ([1, 2, 3], [1, 2, 3]);show(B11)
A11=(-1)^(1+1)*det(B11); show(" A11 is equal to",A11)
B12=A.matrix_from_rows_and_columns ([1, 2, 3], [0, 2, 3]);show(B12)
A12=(-1)^(1+2)*det(B12); show(" A12 is equal to",A12)
B13=A.matrix_from_rows_and_columns ([1, 2, 3], [0, 1, 3]);show(B13)
A13=(-1)^(1+3)*det(B13); show(" A13 is equal to",A13)
B14=A.matrix_from_rows_and_columns ([1, 2, 3], [0, 1, 2]);show(B14)
A14=(-1)^(1+4)*det(B14); show(" A14 is equal to",A14)
B21=A.matrix_from_rows_and_columns ([0, 2, 3], [1, 2, 3]);show(B21)
A21=(-1)^(1+2)*det(B21); show(" A21 is equal to",A21)
B22=A.matrix_from_rows_and_columns ([0, 2, 3], [0, 2, 3]);show(B22)
A22=(-1)^(2+2)*det(B22); show(" A22 is equal to",A22)
B23=A.matrix_from_rows_and_columns ([0, 2, 3], [0, 1, 3]);show(B23)
A23=(-1)^(2+3)*det(B23); show(" A23 is equal to",A23)
B24=A.matrix_from_rows_and_columns ([0, 2, 3], [0, 1, 2]);show(B24)
A24=(-1)^(2+4)*det(B24); show(" A24 is equal to",A24)
B31=A.matrix_from_rows_and_columns ([0, 1, 3], [1, 2, 3]);show(B31)
187
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
A31=(-1)^(3+1)*det(B31); show(" A31 is equal to",A31)
B32=A.matrix_from_rows_and_columns ([0, 1, 3], [0, 2, 3]);show(B32)
A32=(-1)^(3+2)*det(B32); show(" A32 is equal to",A32)
B33=A.matrix_from_rows_and_columns ([0, 1, 3], [0, 1, 3]);show(B33)
A33=(-1)^(3+3)*det(B33); show(" A33 is equal to",A33)
B34=A.matrix_from_rows_and_columns ([0, 1, 3], [0, 1, 2]);show(B34)
A34=(-1)^(3+4)*det(B34); show(" A34 is equal to",A34)
B41=A.matrix_from_rows_and_columns ([0, 1, 2], [1, 2, 3]);show(B41)
A41=(-1)^(4+1)*det(B41); show(" A41 is equal to",A41)
B42=A.matrix_from_rows_and_columns ([0, 1, 2], [0, 2, 3]);show(B42)
A42=(-1)^(4+2)*det(B42); show(" A42 is equal to",A42)
B43=A.matrix_from_rows_and_columns ([0, 1, 2], [0, 1, 3]);show(B43)
A43=(-1)^(4+3)*det(B43); show(" A43 is equal to",A43)
B44=A.matrix_from_rows_and_columns ([0, 1, 2], [0, 1, 2]);show(B44)
A44=(-1)^(4+4)*det(B44); show(" A43 is equal to",A44)
Ain=(-1/16)*matrix(QQ, [[A11, A21, A31, A41],
[A12, A22, A32, A42], [A13, A23, A33, A43], [A14, A24, A34, A44]])
show(Ain); bool(Ain==A.inverse())
Inside this cell, the matrices Bij represent the matrices ˆAij. The ﬁnal command veriﬁes that the matrix constructed via the
algebraic complement, named here Ain, is the inverse of A. Or we can test our computation for the matrix adj(A), e.g., by
adding the code
Adtr=matrix(QQ, [[A11, A21, A31, A41], [A12, A22, A32, A42],
[A13, A23, A33, A43], [A14, A24, A34, A44]])
show(Adtr); show(A.adjugate())
bool(Adtr==A.adjugate())
2.E.25. To do this, we will use a slightly diﬀerent method from those presented in 2.B.11. Both methods can be employed to
further explore applications of Cramer’s rule, such as solving linear systems with parameters. Here we will deﬁne a function
to construct the matrices Ai in Cramer’s notation (see 2.B.11). This will enable us to express the solution to Ax = b as
(xi = det(Ai)/ det(A) : i = 1, . . . , n). This function can be introduced by the block given here:
def column_replace (M,column ,u):
n1=M.nrows ()
n2=M.ncols ()
P = matrix (n1 ,n2)
for i in range (n1):
for j in range (n2):
P[i,j]=M[i,j]
for i in range (n1):
P[i,column -1]= u[i ,0]
return P
To solve our task, we can now add the following cell:
A= matrix ([[1 ,1 ,1 ,1] ,[1 , 1 , -1, -1] ,[1 ,-1, 1, -1] ,[1 ,-1 ,-1, 1]])
b= matrix (4 ,1 ,[2 ,3 ,3 ,5])
A1= column_replace (A ,1,b)
A2= column_replace (A ,2,b)
A3= column_replace (A ,3,b)
A4= column_replace (A ,4,b)
x= matrix (4 ,1 ,[det(A1)/det(A), det(A2)/det(A), det(A3)/det(A), det(A4)/det(A)])
show(x)
Executing this block Sage returns the solution obtained above, i.e., (13
4 , −3
4 , −3
3 , 1
4 )T
.
2.E.27. One may compute ﬁrst F−1
. Then, based on the formula F−1
= 1
det(F ) adj(F) we deduce that
adj(F) = (αδ − βγ) F−1
=


δ −β 0
−γ α 0
0 0 αδ − βγ

,
for all α, β, γ, δ ∈ R.
188
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.E.28. The corresponding adjoint matrices have the form
adj(A) =




1 1 −2 −4
0 1 0 −1
−1 −1 3 6
2 1 −6 −10



 , adj(B) =
(
6 −2i
−3 + 2i 1 + i
)
.
2.E.29. We can easily verify that A satisﬁes A2
= A. In Sage give the cell
A=matrix([[2, -3, -5], [-1, 4, 5], [1, -3, -4]]); bool(A==A*A)
Sage returns True. As for the product AB, we get AB = A(A − I) = A2
− A = A − A = 0. Moreover, for B2
we see that
B2
= (A − I)(A − I) = A2
− 2A − I = I − A = −B =


−1 3 5
1 −3 −5
−1 3 5

 .
In Sage a veriﬁcation of these results it is very fast. For example add the following syntax in the previous cell:
E=matrix([[1, 0, 0], [0, 1, 0], [0, 0, 1]]); B=A-E; show(A*B); show(B^2)
Finally, let us use Sage again to compute det(A), where we just need to add the command det(A). It gives det(A) = 0, i.e.,
A is a singular matrix. Try to present a formal computation of det(A). In fact, except of the identity matrix, all the other
idempotent matrices have determinant zero (see 2.E.61 below).
2.E.30. Both relations are true. To prove this, we will use Cauchy’s theorem, which states that det(AB) = det(A) det(B)
(see 2.2.7). Since det(PP−1
) = det(E) = 1 we obtain
det(B) = det(PAP−1
) = det(P) det(A) det(P−1
) = det(P) det(P−1
) det(A) = det(PP−1
) det(A) = det(A) .
Hence, det(B) = det(A) and we also obtain det(A−1
B) = det(A−1
) det(B) = det(A−1
) det(A) = det(A−1
A) = 1.
2.E.32. This task is solved in a similar way with 2.E.31, and you can verify that the desired coordinates are given by (2 +
1√
3
, 2 − 1√
3
).
2.E.33. Let us write
(5, 1, 11)T
= p (3, 2, 2)
T
+ q (2, 3, 1)
T
+ r (1, 1, 3)
T
for some p, q, r ∈ R. Solving the corresponding linear system we obtain a unique solution given by p = 2, q = −2 and
r = 3.
2.E.34. We see that the vectors u1, u2, u3 are linearly dependent, whenever at least one of the following conditions is satisﬁed:
a = b = 1, or a = c = 1, or b = c = 1.
2.E.35. It turns out that the given vectors are linearly independent. In order to prove our assertion, you may try to apply the
method presented in 2.C.16.
2.E.36. It is easy to check that adding the polynomial x gives a basis of R3[x].
2.E.37. For both cases we have to check the linear independence of the given vectors u1, u2, u3.
(a) In the ﬁrst case we compute dim U = 2, for t ∈ {1, 2}, otherwise we have dim U = 3.
(b) In the second case we compute dim U = 2 for t ̸= 0, and dim U = 1 for t = 0.
2.E.39. According to the deﬁnition of intersection, the vectors in the intersection are in both, the span of the vectors
(1, 1, −3) , (1, 2, 2), as well as in the span of the vectors (1, 1, −1) , (1, 2, 1) , (1, 3, 3). We see that U is spanned by two
linearly independent vectors, and hence U is a plane in R3
. Next, V is spanned by three vectors, but these are linearly depen-
dent:
1 1 1
1 2 3
−1 1 3
=
1 1 −1
1 2 1
1 3 3
= 0 .
Therefore, V is also a plane. If the vector (x1, x2, x3) lies in U, then (x1, x2, x3) = λ(1, 1, −3)+µ(1, 2, 2), for some scalars
λ, µ. Similarly, if the vector (x1, x2, x3) lies in V , then (x1, x2, x3) = α(1, 1, −1) + β(1, 2, 1) + γ(1, 3, 3), for some scalars
α, β, γ. This gives a system of six equations with eight unknowns. Solving such a system can be quite cumbersome, but we
can make some simpliﬁcations. First notice that the subspace U is described by the following three equations:
x1 = λ + µ , x2 = λ + 2µ , x3 = −3λ + 2µ .
189
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Solving this system of equations with respect to λ and µ, or alternatively by eliminating λ and µ from these equations, one
obtains the single equation 8x1 − 5x2 + x3 = 0, which we may use to replace the ﬁrst three. Now notice that the subspace
V is describe by the following three equations:
x1 = α + β + γ , x2 = α + 2β + γ , x3 = −α + β + 3γ .
Solving this systems of equations with respect to α, β and γ, or alternatively by eliminating α, β and γ from these equations,
we obtain the single equation 3x1 − 2x2 + x3 = 0, which we can use to describe V . Hence, it is now straightforward that
after introducing a new parameter t, we can express intersection by (x1, x2, x3) = t · (3, 5, 1).
2.E.40. The answer is given by the matrix
( 5/6 −1/6 1/3
−1/6 5/6 1/3
1/3 1/3 1/3
)
.
2.E.41. The answer is given by the matrix
( 5/9 2/9 −4/9
2/9 8/9 2/9
−4/9 2/9 5/9
)
.
2.E.42. (a) The vector that determines the subspace U is perpendicular to each of the three vectors that generate W. Thus the
subspaces are orthogonal and the ﬁrst claim holds.
(b) It is not true that R4
= U ⊕ W. This is because the subspace W is only two-dimensional, since (−1, 0, −1, 2)T
=
(−1, 0, 1, 0)T
− 2(0, 0, 1, −1)T
.
2.E.43. Let
A =




1 0 1
0 1 2
0 0 1
1 1 0




be the 4 × 3 matrix induced by the vectors E1, E2, E3. Recall that the maximum number of linearly independent row vectors
in A coincides with the rank of A. Because the determinant of the ﬁrst three rows equals to 1 ̸= 0, the rank of A equals to 3,
rank(A) = 3. Hence the vectors E1, E2, E3 are linearly independent. Recall that to verify this conclusion by Sage it suﬃces
to add the block
V = RR^4
E1 = vector(RR, [1, 0, 0, 1])
E2 = vector(RR, [0, 1, 0, 1])
E3 = vector(RR, [1, 2, 1, 0])
V.linear_dependence([E1, E2, E3]) == []
On the other hand, for the rank of A the appropriate cell is
A = matrix(SR, [[1, 0, 1], [0, 1, 2], [0, 0, 1], [1, 1, 0]])
A.rank()
In the ﬁrst case Sage’s output is True, while in the second Sage prints out the desired number 3.
Since E1, E2, E3 are linearly independent and span W, they form a basis of W. We will use this basis to obtain an
orthogonal basis by applying the usual Gram–Schmidt procedure. Set
w1 = E1 , w2 = E2 −
⟨E2, w1⟩
∥w1∥2
w1 , w3 = E3 −
⟨E3, w1⟩
∥w1∥2
w1 −
⟨E3, w2⟩
∥w2∥2
w2.
Since ⟨E2, w1⟩ = 1 and ∥w1∥2
= 2 this gives w2 = (−1
2 , 1, 0, 1
2 )T
. We also compute ⟨E3, w1⟩ = 1, ⟨E3, w2⟩ = 3
2 and
∥w2∥2
= 3
2 . Hence w3 = E3 − 1
2 E1 − w2, that is w3 = (1
2 , 1, 1, −1)T
, with ∥w3∥2
= 13
4 . Thus an orthonormal basis of W
consists of the vectors
ˆw1 =
w1
∥w1∥
= (
1
√
2
, 0, 0,
1
√
2
)T
, ˆw2 =
w2
∥w2∥
= (
−1
√
6
,
√
3
2
, 0,
1
√
6
)T
, ˆw3 =
w3
∥w3∥
= (
1
√
13
,
2
√
13
,
2
√
13
,
−2
√
13
)T
.
By deﬁnition, we have R4
= W ⊕ W⊥
and hence dim W⊥
= dim R4
− dim W = 1. Choosing any vector of the
orthonormal basis of R4
, say e2 = (0, 1, 0, 0), we extend the above basis to a basis to R4
, namely {Z1 = ˆw1, Z2 = ˆw2, Z3 =
ˆw3, Z4 = e2} (we encourage the reader to verify this statement in Sage). By applying the Gram–Schmidt method to this basis
we obtain an orthogonal basis { ˆZ1, . . . , ˆZ4} of R4
, given by ˆZj = Zj = ˆwj for j = 1, 2, 3 and
ˆZ4 = Z4 −
⟨Z4, ˆw1⟩
∥ ˆw1∥2
ˆw1 −
⟨Z4, ˆw2⟩
∥ ˆw2∥2
ˆw2 −
⟨Z4, ˆw3⟩
∥ ˆw3∥2
ˆw3 .
It follows that W⊥
= spanR{ ˆZ4}, and the explicit computation of ˆZ4 is left to the reader.
2.E.47. Using the Gram-Schmidt orthogonalization process we obtain the basis: {(1, 1, 1, 1)T
, (1, 1, 1, −3)T
, (−2, 1, 1, 0)T
}.
190
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.E.48. For example, one can obtain the orthogonal bases {(1, 0, 1, 0)T
, (0, 1, 0, −7)T
) for the ﬁrst part, and
{(1, 2, 2, −1)T
, (2, 3, −3, 2)T
, (2, −1, −1, −2)T
} for the second part.
2.E.49. The solution is a = 9/2, b = −5 (since 1 + b + 4 + 0 + 0 = 0 and 1 − b + 0 + 3 − 2a = 0).
2.E.50. The orthogonal complement V ⊥
is the set of all scalar multiples of the vector (4, 2, 7, 0)
T
.
2.E.51. Obviously, there are inﬁnitely many possible extensions. For example, a very simple one is given by the set
{u1, u2, u3 = (1, 0, 0, −1)T
, u4 = (1, 0, −1, 1)T
} .
2.E.52. The basis consists only from the vector u1 = (3, −7, 1, −5, 9)T
(or any non-zero scalar multiple of u1).
2.E.53. The answers are given as follows:
(a) W⊥
= spanR
{
(1, 0, −1, 1, 0)T
, (1, 3, 2, 1, −3)T
}
.
(b) W⊥
= spanR
{
(1, 0, −1, 0, 0)T
, (1, −1, 1, −1, 1)T
}
.
2.E.54. Obviously, any squared matrix A can be written as A = 1
2
(
A + AT
)
+ 1
2
(
A − AT
)
. Setting As = 1
2
(
A + AT
)
and Aa = 1
2
(
A − AT
)
, respectively, it is clear that
AT
s =
1
2
(
A + AT
)T
= As , AT
a =
1
2
(
A − AT
)T
= −Aa .
This proves the ﬁrst statement, i.e., A = As + Askew. For the given matrix A we compute
As =
1
2





1 0 2
6 3 0
2 2 4

 +


1 6 2
0 3 2
2 0 4





=


1 3 2
3 3 1
2 1 4


and
Aa =
1
2





1 0 2
6 3 0
2 2 4

 −


1 6 2
0 3 2
2 0 4





=


0 −3 0
3 0 −1
0 1 0

 .
Observe that a skew-symmetric matrix, as Aa, has zeros in the diagonal. In Sage, to ﬁnd As and Aa we can proceed with the
cell
A=matrix([[1, 0, 2], [6, 3, 0], [2, 2, 4]]); show(A)
As=(A+A.transpose())/2; Aa=(A-A.transpose())/2; show(As); show(Aa)
As.is_symmetric() # this command is used to verify that A_s is symmetric
Notice an alternative to compute the transpose of a matrix A in Sage, is the shortcut A.T, a notation which is common in some
other programming environments like NumPy. Hence for example to determine the matrix As in Sage one could type
A=matrix([[1, 0, 2], [6, 3, 0], [2, 2, 4]])
As=(A+A.T)/2; print(As)
2.E.55. (a) The assertions are obvious:
(A + AT
)T
= AT
+ (AT
)T
= AT
+ A = A + AT
, (A − AT
)T
= AT
− (AT
)T
= AT
− A = −(A − AT
) .
(b) By assumption A is skew-symmetric, i.e., A = −AT
and n is odd. Recall that det(X) = det(XT
) and det(cX) =
cn
det(X) for any n × n matrix X, see 2.B.8 . Thus, det(A) = det(AT
) = det(−A) = (−1)n
det(A) = − det(A) which
implies that 2 det(A) = 0, i.e., det(A) = 0.
(c) The proof for the claim about the trace is also easy: Since a skew-symmetric matrix A = (aij) satisﬁes aij = −aji for
any i, j, we have aii = 0 for all i = j. Thus, all the diagonal entries of a skew-symmetric matrix are zero.
2.E.58. The given identity occurs easily by (∗), which for λ = 1 gives χA(1) = det(A − E) = −1 + tr(A) + c1 + det(A),
so we can replace c1 in (∗) by det(A − E) + 1 − tr(A) − det(A).
For the given matrix A we compute tr(A) = 12 and det(A) = 60. Hence an application gives
χA(λ) = det(A − λ E) = −λ3
+ 12λ2
− 47λ + 60 .
Using Sage we ﬁnd that χA(λ) has three roots: λ1 = 3, λ2 = 4, and λ3 = 5. These are the eigenvalues of A, all with
algebraic multiplicity one. Try to solve the equation χA(λ) = 0 in a formal way. For a description in Sage of the given
expression of the characteristic polynomial, but also for a veriﬁcation of the determinant and trace, type
191
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
A=matrix(SR, [[32, -67, 47], [7, -14, 13], [-7, 15, -6]])
print(A.trace( )); print(A.det( ))
192
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
p(t) = A.characteristic\_polynomial(t)
show(p(t)); show(p.roots())
2.E.59. Possible examples illustrated by the matrices A, B and C provided below, which correspond to the cases (i), (ii) and
(iii), respectively:
A =




6 0 0 0
0 7 0 0
0 0 7 0
0 0 0 7



 , B =




6 0 0 0
0 7 1 0
0 0 7 0
0 0 0 7



 , C =




6 0 0 0
0 7 1 0
0 0 7 1
0 0 0 7



 .
Explain the reasoning behind the derivation of these matrices and then explore additional solutions.
2.E.60. There is a triple eigenvalue −1. The corresponding eigenspace is spanned by the eigenvectors (1, 0, 0)T
and (0, 2, 1)T
,
hence it is 2-dimensional (over R).
2.E.65. The matrix has a double eigenvalue −1, and its associated eigenspace is spanned over R by the vectors (2, 0, 1)T
and
(1, 1, 0)T
. Further, the matrix has 0 as the eigenvalue, with eigenvector (1, 4, −3)T
. Combining this with the statements in
2.E.63, we deduce that the mapping induced by the given matrix M is an axial symmetry through the line induced by the
last vector, composed with the projection on the plane perpendicular to the last vector. Verify that this plane is given by the
equation x + 4y − 3z = 0.
2.E.66. (i) Let us present the computations in Sage, where we can use the command A.commutator(B) to compute the
commutator [A, B] = AB − BA of two squared matrices A, B. The corresponding block is here:
s1=matrix(SR, [[0, 1], [1, 0]])
s2=matrix(SR, [[0, -i], [i, 0]])
s3=matrix(SR, [[1, 0], [0, -1]])
show(s1.commutator(s2)); bool(s1.commutator(s2)==2*i*s3)
show(s1.commutator(s3)); bool(s1.commutator(s3)==-2*i*s2)
show(s2.commutator(s3)); bool(s2.commutator(s3)==2*i*s1)
Execute this block to read Sage’s output.
(ii) To verify the claims posed in the second part one can proceed by adding in the previous cell the code
bool(s1*s1==identity_matrix(2))
bool(s2*s2==identity_matrix(2))
bool(s3*s3==identity_matrix(2))
ev1=s1.eigenvalues()
ev2=s2.eigenvalues()
ev3=s3.eigenvalues()
bool(ev1==ev2); bool(ev1==ev3); bool(ev2==ev3); show(ev1)
Let us also verify the claim for the eigenvalues by a formal computation, e.g., for σ1. The point is to compute its characteristic
polynomial χσ1 (λ) = det(σ1 − λ E), where here E =
( 1 0
0 1
)
. We have
σ1 − λ E =
(
−λ 1
1 −λ
)
=⇒ χσ1
(λ) =
−λ 1
1 −λ
= λ2
− 1 .
Thus there are two roots, λ± = ±1. Similarly for σ2 and σ3. Try to you determine the corresponding eigenvectors.
(iii) The claim in this part can be proved similarly, and left as an exercise.
2.E.67. This is direct and left for practice.
We have already developed a useful package of tools and
it is time to show some applications of matrix calculus. The
ﬁrst three parts of this chapter are independent and the readers
more interested in the theory might skip any of them and
continue with the fourth part straight ahead.
It might seem that the assumption of linearity of relations
between quantities is too restrictive. But this is often not so.
In real problems, linear relations may appear directly. A problem
may be solved as a result of an iteration of many linear
steps. If this is not the case, we may still use this approach at
least to approximate real non-linear processes.
We should also like to compute with matrices (and linear
mappings) as easily as we can compute with scalars. In
order to do that, we prepare the necessary tools in the second
part of this chapter. We also present a useful application of
matrix decompositions to the pseudoinverse matrices, which
are needed for numerical mastery of matrix calculus.
We try to illustrate all the phenomena with rather easy
problems. Still some parts of this chapter are perhaps diﬃcult
for ﬁrst reading. This in particular concerns the very ﬁrst
part providing some glimpses towards the linear optimization
(linear programming), and the third part devoted to iterated
processes (the Frobenius-Perron theory).
The rest of the chapter comes back to some more advanced
parts of the matrix calculus (the Jodan canonical
form, decompositions, and pseudo-inverses of matrices). The
reader should feel free to move forward if getting lost.
1. Linear optimization
The simplest linear processes are given by linear mappings
φ : V → W on vector spaces. As we can surely imagine,
the vector v ∈ V can represent the state of some system
we are observing, while φ(v) gives the result after some process
is realized.
If we want to reach a given result b ∈ W of such a process,
we solve the problem
φ(x) = b
for some unknown vector x and a known vector b.
In ﬁxed coordinates we then have the matrix A of a mapping
φ and coordinate expression of the vector b. We have
mastered such problems in the previous chapter. Now we
CHAPTER 3
Linear models and matrix calculus
where are the matrices useful?
– basically almost everywhere...
A. Linear optimization
The idea of maximizing or minimizing a linear function
subject to linear constraints arises naturally in many
ﬁelds. For instance we may want to maximize proﬁts,
minimize costs, etc. Linear programming builds upon
notions from linear algebra discussed in Chapter 2 and is a
dynamic tool for solving linear optimization problems subject
to linear constraints.1
In a linear programming problem (LP problem in short)
we seek for a vector on some Euclidean space maximizing
(or minimizing) the value of some linear functional, among
all vectors satisfying a given system of linear constraints that
govern the process. In 3.1.2 this linear functional is called
the objective function.
3.A.1. To illustrate the idea let us begin with LP problems
in two dimensions, where a solution occurs graphically. As
usual, we view vectors x on R2
as column matrices (x1, x2)T
with x1, x2 ∈ R. We want to maximize the value of the linear
function h(x1, x2) = x1 + x2 (objective function) subject to
the constraints
x1 + 2x2 ≤ 3 , 2x1 + x2 ≤ 3 , x1 ≥ 0, x2 ≥ 0 .
1The name "programming" refers to process planning and was introduced
before the computers were invented, cf. the footnote on page 195.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
draw more interesting conclusions in the setup of linear optimization
models (called also linear programming).
3.1.1. Linear optimization. In the practical column, the
previous chapter started with a painting problem,
and we shall continue here in a similar way.
Imagine that our very specialized painter in a
black&white world is willing to paint facades of either small
family houses or of large public buildings, and that he (of
course) uses only black and white colours. He can arbitrarily
choose proportions between x1 units of area for the small
houses or x2 units for the large buildings. Assume that his
maximal workload in a given interval of time is L units of
area, his net income (that is, after subtracting the costs) is
c1 per unit of area for small houses and c2 per unit of area
for large buildings. Furthermore, he has only W kg of white
colour and B kg of black colour at his disposal. Finally, a unit
of area for small houses requires w1 kg of white colour and
b1 kg of black colour. For large buildings the corresponding
values are w2 and b2.
If we write all this information as inequalities, we obtain
the conditions
x1 + x2 ≤ L(1)
w1x1 + w2x2 ≤ W(2)
b1x1 + b2x2 ≤ B.(3)
The total net income of the painter, which is the following
linear form h,
h(x1, x2) = c1x1 + c2x2,
is to be maximized.
Each of the given inequalities clearly determines a halfplane
in the plane of the variables (x1, x2), bounded by a line
given by the corresponding equality, and we must also assume
that both x1 and x2 are non-negative real numbers (because
the painter cannot paint negative areas). Thus we have constraints
for the values (x1, x2) – either the constraints are unsatisﬁable,
or they allow points inside a polygon with at most
ﬁve vertices. See the diagram.
the axis in the diagram
should be called x1
and x2! Add the line
of constant value for h,
best through one of the
vertices with
hand-written
description "optimal
constant value of h
194
Solution. The inequalities are graphically represented by
half-planes in R2
, and the intersection of these half-planes determines
the “feasible region” of solutions. Thus, a feasible
solution is any point in R2
satisfying all the given constraints.
The non-negativity constraints xi ≥ 0 for i = 1, 2 clearly restrict
all the feasible solutions to the ﬁrst quadrant. Therefore,
the feasible region is the shaded (hatched) area in the diagram
below. While it is possible to construct linear programming
(LP) problems with an empty feasible region, we are primarily
interested in those with a non-empty feasible region.
In a linear maximization problem, an “optimal solution”
is a point within the feasible region that maximizes the objective
function’s value.
For our example, an optimal solution appears at a unique
vertex. To see this, let us consider the vector c = (1, 1)T
,
such that cT
x = x1 + x2. Consider also the objective function
line x1 + x2 = k (k ∈ R), perpendicular to the vector
c. As illustrated in the ﬁgure above, we now begin moving
this line upwards in the direction of the vector c, maintaining
the same slope, without leaving the feasible region. By doing
so, we can determine the point where the moving line intersects
the feasible region for the last time, that is, at the point
(1, 1). This point provides the maximum value k = 2 for the
objective function h. □
On the other hand, there exist LP problems with inﬁnitely
many optimal solutions. Nevertheless, at least one of these
optimal solutions must occur at a vertex of the feasible region.
Let us describe such an example.
3.A.2. Consider the problem of maximizing the functional
h(x1, x2) = 2.5x1 + x2 under the constraints
3x1 + 5x2 ≤ 15 , 5x1 + 2x2 ≤ 10 , x1 ≥ 0, x2 ≥ 0 .
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
How to solve such a problem? We seek the maximum
value of a linear form h over subsets M of a vector space
which are deﬁned by linear inequalities. In the plane, M is
given by the intersection of half planes.
Next, note that every linear form over real vector space
h : V → R (that is, arbitrary linear scalar function) is monotone
in every chosen direction. More precisely, if we choose
a ﬁxed starting vector u ∈ V and “directional” vector v ∈ V ,
then composition of our form h with parametrization yields
t → h(u + t v) = h(u) + t h(v).
This expression is indeed either increasing or decreasing, or
constant (depending on whether h(v) is positive, negative or
zero), as a function of t.
Thus, if the set M is bounded as at our picture above,
we easily ﬁnd the solution by testing the value of h at the
vertices of the boundary polygon. In general, we must expect
that problems similar to the one with the painter are either
unsatisﬁable (if the given set with constraints is empty), or the
proﬁt is unbounded (if the constraints allow for unbounded
directions in the space and the form h is non-zero in some of
the unbounded directions) or they attain a maximal solution
in at least one of the “vertices” of the set M. Normally the
maximum is attained at a single point of M, but sometimes it
is attained on a part of the boundary of the set M.
Try to choose explicit values for the parameters w1, w2,
b1, b2, c1, c2, draw the above picture for these parameters and
ﬁnd the explicit solution to the problem (if it exists)!
3.1.2. Terminology. In general we speak of a linear programming
problem whenever we seek either the maximum or
minimum value of a linear form h over Rn
on a set bounded
by a system of linear inequalities which we call linear constraints.
The vector on the right side is then called the vector
of constraints. The linear form h is also called the objective
function.1
In real practice we meet hundreds or thousands of
constraints for dozens of variables.
The standard maximization problem is deﬁned by seeking
a maximum of the objective function while the restrictive
inequalities are ≤ and the variables are non-negative. On the
other hand, the standard minimization problem is deﬁned by
seeking a minimum of the objective function while the restrictive
inequalities are ≥ and the variables are non-negative.
It is easy to see that every general linear programming
problem can be transformed into a standard one
of either types. Aside from sign changes, we
can work with a decomposition of the variables
that have no sign restriction into a diﬀerence of two nonnegative
ones. Without loss of generality we will work only
with the standard maximization problem.
1Leonid Kantorovich and Tjalling Koopmans shared the 1975 Nobel
prize in economics for their formulations and solution of economical and logistics
problems in a similar way during the second world war. But it was
George B. Dantzig who independently developed general linear programming
formulation in the period 1946-49, motivated by planning problems
in US Air Force. Among others, he invented the simplex method algorithm.
195
Solution. Obviously, we have h = cT
x where c = (2.5, 1)T
and the feasible region is the shaded polygon in the ﬁgure
below. All the points on the thick side of the polygon are
optimal solutions. This is because h is parallel to the line
5x1 + 2x2 = 10, induced by the second constraint. Indeed,
we see that both the points (2, 0) and (20
19 , 45
19 ) are optimal
solutions to the problem, with h(0, 2) = h(20
19 , 45
19 ) = 5. The
line segment p joining these points has the form
(x1, x2) = t(2, 0)+(1−t)(
20
19
,
45
19
) = (
20
19
+
18
19
t,
45
19
−
45
19
t)
with t ranging from 0 to 1. Then, any point (x1, x2) on p
maximizes h and hence serves as an optimal solution:
h(x1, x2) =
5
2
(
20
19
+
18
19
t) + (
45
19
−
45
19
t) = 5 .
□
Finally, we may encounter LP problems with an unbounded
feasible region, where the objective function can
achieve arbitrarily large values, resulting in the absence of
optimal solutions (unbounded objective). For example, this
occurs when attempting to maximize x1 +x2 in the ﬁrst quadrant
without any additional constraints, or when adding the
constraint x2 − x1 ≤ 1.
Another possible scenario involves an unbounded linear
programming problem that still has inﬁnitely many optimal
solutions. This situation is illustrated by the following example,
where the details left as an exercise.
3.A.3. Maximize h = 2x2 − x1 subject to the constraints
x1 − x2 ≥ −1 , −0.5x1 + x2 ≤ 2 , x1 ≥ 0 , x2 ≥ 0 .
⃝
Linear programming applies to a wide range of problems
and in the sequel we will explore several of its applications.
Examples include maximizing the proﬁts of a production
process, minimizing the area on a chip, optimizing
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
3.1.3. Formulation using linear equations. Finding
an optimum is not always as simple as in the previous
2-dimensional case. The problem can contain many variables
and constraints and even deciding whether the set M
of the feasible points is non-empty can be a problem.
We do not have the ambition to go into detailed theory
here. But we mention at least some ideas which show that the
solution can be always found, and then we build an eﬀective
algorithm solving the problem in the next paragraphs.
We begin by comparison with systems of linear equations
– because we understand those well. We write the equations
(1)-(3) in 3.1.1 in the general form:
A · x ≤ b,
where x is now an n-dimensional vector, b is an
m-dimensional vector and A is the corresponding matrix.
By an inequality between vectors we mean individual
inequalities between all coordinates. We want to maximize
the product c · x for a given row vector of coeﬃcients of
the linear form h and the feasible values of x. If we add
new auxiliary variables xs, one for every equation and add
another variable z for the value of the linear form h, we can
rewrite the whole system as a system of linear equations
(1)
(
1 −c 0
0 A Em
)
·


z
x
xs

 =
(
z − c · x
A · x + xs
)
=
(
0
b
)
where the matrix is composed of the blocks with 1 + n +
m columns and 1 + m rows, with corresponding individual
components of the vectors. We call the new variables xs the
slack variables. Moreover, we require non-negativity for all
coordinates x and xs. If the given system of equations has a
solution, we seek values for the variables z, x and xs, such
that all x and xs are non-negative and z is maximized.
Speciﬁcally, in our problem of the black&white painter
from 3.1.1, the system of linear equations looks like this:




1 −c1 −c2 0 0 0
0 1 1 1 0 0
0 w1 w2 0 1 0
0 b1 b2 0 0 1



 ·








z
x1
x2
x3
x4
x5








=




0
L
W
B




In paragraph 4.1.11 on page 328 we will discuss this situation
from the viewpoint of aﬃne geometry. Now we just
notice that being on the boundary of the set of feasible points
M of the problem is equivalent to having some of the slack
variables vanishing. Our algorithm will try to move from one
such position to another while increasing h. But we shall need
some conceptual preparation ﬁrst.
3.1.4. Duality of linear programming. Consider the real
matrix A with m rows and n columns, vector of constraints
b and row vector c giving the objective function.
From this data we can consider two problems of
linear programming for x ∈ Rn
and y ∈ Rm
.
196
supply chain logistics, scheduling tasks eﬃciently, and determining
the best investment portfolio mix.
3.A.4. A company manufactures two models A and B of a
product. Each piece of model A requires ten labor hours for
fabrication and one labor hour for ﬁnishing and packing. For
model B, each piece requires sixteen labor hours for fabrication
and two labor hours for ﬁnishing and packing. The
proﬁt is C500 for each piece of model A and C850 for each
piece of model B. Assuming there are 180 labor hours available
monthly for fabrication and 20 labor hours available for
ﬁnishing and packing, how many pieces of model A and B
should be manufactured to maximize the proﬁt?
Solution. Let us encode the given data in a table:
model A model B labour hours
fabricate 10 hours/piece 16 hours/piece 180 hours
ﬁnish & pack 1 hour/piece 2 hours/piece 20 hours
proﬁt C500/piece C850/piece
Suppose that x1 is the number of pieces of model A and x2 is
the number of pieces of model B that are manufactured. The
proﬁt function is given by h(x1, x2) = 500x1 + 850x2, and
we want to maximize it subject to the following constraints:
10x1 + 16x2 ≤ 180, x1 + 2x2 ≤ 20, with x1 ≥ 0 and
x2 ≥ 0. Here, the ﬁrst inequality represents the constraint
on the labour hours of fabricating, the second one pertains
the labour hours of ﬁnishing and packing, and the last two
are natural non-negativity conditions. Equivalently we have
the LP problem of maximizing the given proﬁt h = h(x1, x2)
with respect to the conditions 5x1+8x2 ≤ 90, x1+2x2 ≤ 20,
with xi ≥ 0 for i = 1, 2. Draw a ﬁgure yourself to see,
that this is a bounded LP problem and hence the maximum
of h will appear in one of the corners of the corresponding
feasible region. The vertices of the feasible region are (18, 0),
(0, 10), and (10, 5). Since h(18, 0) = 9000, h(0, 10) = 8500,
h(10, 5) = 9250, we deduce that the company must produce
10 pieces of model A and 5 pieces of model B, to achieve the
maximum proﬁt of C9250.
□
3.A.5. Minimize the costs of feeding. A stable in the west
of Czech Republic purchases fodder for the winter: hay and
oats. The nutritional values of the fodder, along with the daily
requirements (portions) per foal, are detailed in the following
table.
g per kg Hay Oats Requirements
Dry basis 841 860 ≥ 6300 g
Digestible nitrogen stuﬀ 53 123 ≥ 1150 g
Starch 0.348 0.868 ≤ 5.35 g
Calcium 6 1.6 ≥ 30 g
Phosphate 2.8 3.5 ≤ 44 g
Natrium 0.2 1.4 ≃ 7 g
Costs C1.80 C1.60
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Dual problems of linear programming
Maximization problem: Maximize c · x under the conditions
A · x ≤ b and x ≥ 0.
Minimization problem: Minimize yT
· b under the condition
yT
· A ≥ c and y ≥ 0.
We say that these two problems are dual problems of linear
programming. Before deriving further properties of linear
programming we need some terminology.
We say that the problem is solvable if there is an admissible
vector x (or admissible vector yT
) which satisﬁes all constraints.
A solvable maximization (minimization) problem is
bounded, if the objective function is bounded from above (bellow)
over the set of admissible vectors.
Lemma (Weak duality theorem). If x ∈ Rn
is an admissible
vector for the standard maximization problem, and if
y ∈ Rm
is an admissible vector for the dual minimization
problem, then
c · x ≤ yT
· b
Proof. It is a simple observation. Since x ≥ 0 and c ≤
yT
· A, it follows that c · x ≤ yT
· A · x. But also y ≥ 0 and
A · x ≤ b, hence
c · x ≤ yT
· A · x ≤ yT
· b,
which is what we wanted to prove. □
We see immediately that if both dual problems are solvable,
then they must be bounded. Even more interesting is the
following corollary, which is directly implied by the inequality
in the previous proof.
Corollary. If there exist admissible vectors x and y of dual
linear problems such that for the objective functions c · x =
yT
· b, then both are optimal solutions for the corresponding
problems.
3.1.5. Theorem (Strong duality theorem). If a standard problem
of linear programming is solvable and bounded, then its
dual is also bounded and solvable. There exists an optimal
solution for each of the problems, and the optimal values of
the corresponding objective functions are equal.
Proof. As already proved in the latter corollary, once it
is established that the values of the objective functions for the
dual problems equal, we have the required optimal solutions
to both problems. It remains to prove the other implication,
i.e. the existence of an optimal solution under the assumptions
in the theorem, as well as the fact, that the objective functions
share their values in such a case. This will be veriﬁed by
delivering an eﬃcient algorithm in the next paragraph. □
We notice yet another corollary of the just formulated
duality theorem:
Corollary (Equilibrium theorem). Consider two admissible
vectors x and y for the standard maximization problem and
its dual problem as deﬁned in 3.1.4. Then both vectors are
197
During each daily meal, each foal requires a minimum of 2 kg
of oats. The average cost, including transportation, is C1.80
per kg of hay and C1.60 per kg of oats. Design a daily diet
for one foal that minimizes costs. ⃝
As explained earlier, when the feasible region of a linear
programming problem with two variables is
non-empty and bounded, the maximum value of
the given linear cost function h will be found at
one of the extreme points (corners) of this region,
in the direction of the normal vector to the line deﬁned by
h. This principle also holds true in higher dimensions, as we
will see below.
3.A.6. Linear programming on Rn
. Consider a function
f : Rn
→ R of the form
f(x1, . . . , xn) = c0 + c1x1 + · · · + cnxn
for some constants c0, c1, . . . , cn. Due to the appearance of
the constant term c0 we see that f is an aﬃne function.2
Since
f is aﬃne, the diﬀerence of the values of f at the points
A = (a1, . . . , an) and B = A + u = (a1 + u1, . . . , an + un)
equals to f(u) = f(u1, . . . , un) = c1u1 + · · · + cnun. Notice
the latter is the dot product of the vectors (c1, ..., cn)T
and (u1, ..., un)T
on Rn
. Next we will refer to f by the term
“objective function”.
The relation between the scalar product and the cosine
of the angle between vectors ensures that ﬁxing the value of
the given function f deﬁnes a hyperplane in Rn
with normal
the vector (c1, . . . , cn)T
. This hyperplane splits the space Rn
into two half-spaces. Clearly the given function grows if moving
towards one of those half-spaces and declines towards the
other one. This is essentially the same principle as we saw
when discussing the visibility of segments in dimension 2 (we
checked whether the observer is to the left or to the right from
the oriented segment, cf. 1.5.12). At the same time, each linear
inequality also deﬁnes a half-space and we shall learn the
properties of intersections of half-spaces in Chapter 4 (they
form the so-called simplexes). These observations lead to an
algorithm for ﬁnding the extremal values of the linear objective
function f on the set of admissible points deﬁned by linear
inequalities.
For the sake of simplicity, we shall deal with the “standard
problem” of linear programming. This means that we
will treat the task of maximizing the linear function
h(x1, . . . , xn) =
n∑
j=1
cjxj = cT
x
subject to the conditions
∑n
j=1 aijxj ≤ bi and xj ≥ 0,
with i = 1, . . . , m, j = 1, . . . , n, respectively. According
to the discussion in 3.1.6, our ﬁrst task is to add the
(non-negative) slack variables xs (also called slacks), one for
each of the less-than inequalities to reduce them to equalities:
2Recall that aﬃne functions are essentially linear functions plus a constant
oﬀset.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
optimal if and only if yi = 0 for all coordinates with index i
for which
∑n
j=1 aijxj < bi and simultaneously xj = 0 for
all coordinates with index j such that
∑m
i=1 yiaij > ci.
Proof. Suppose both relations regarding the zeros
among xi and yi are true.
Since the summands with strict inequality
have zero coeﬃcients, we have
m∑
i=1
yibi =
m∑
i=1
yi
n∑
j=1
aijxj =
m∑
i=1
n∑
j=1
yiaijxj
and for the same reason
m∑
i=1
n∑
j=1
yiaijxj =
n∑
j=1
cjxj.
This shows one implication, by the duality theorem.
Suppose now that both x and y are optimal vectors. Then
m∑
i=1
yibi ≥
m∑
i=1
n∑
j=1
yiaijxj ≥
n∑
j=1
cjxj.
But the left- and right-hand sides are equal, and hence there
is equality everywhere. If we rewrite the ﬁrst equality as
m∑
i=1
yi
(
bi −
n∑
j=1
aijxj
)
= 0,
then we see that it can be satisﬁed only if the relation from the
statement holds. But it is a sum of non-negative numbers and
equals zero. From the second equality we similarly derive the
second part and the proof is ﬁnished. □
The duality theorem and equilibrium theorem are useful
when solving linear programming problems, because they
show us relations between zeros among the additional variables
and the fulﬁllment of the constraints. As usual, it is
good to know that the problem is solvable in principle and
to have some theory related to that, but we still need some
clever ideas to make it all into an eﬃcient algorithmic procedure.
The next paragraph will provide some insight to this.
3.1.6. The algorithm. As already explained, the linear programming
problem of maximizing the linear objective
function h = c x under the conditions A x ≤ b
can be turned into solving the system of equations (1)
in 3.1.3, where we added the slack variables xs. If all
entries in b are non-negative, then the choice of xs = b and
x = 0 provides an admissible solution of the system with the
value of the objective function h = 0. This is the choice of
the origin x = 0 as one of the vertices of the distinguished
region M of the admissible points. We can understand this as
choosing the variables xs as the basic variables, whose values
are given by the right hand sides of the equation, while
all the other variables are set to zero.
In the general case (allowing for negative entries in b),
we shall see in 4.1.11 that we always can ﬁnd an admissible
vertex. That is, the choice of the basic variables in the above
198
∑n
j=1 aijxj + (xs)i = bi, for all i = 1, . . . , m. Therefore,
our goal is to maximize h over a vector space of solutions
deﬁned by a system of linear equations, while ensuring all coordinate
values are non-negative. This represents the “canonical
form” of a linear programming problem.
In fact, it is useful to have a notation in which
the slacks are more or less indistinguishable from
the original variables. Therefore, we will often
write (x1, . . . , xn, xn+1, . . . , xn+m), instead of
(x1, . . . , xn, (xs)1, . . . , (xs)m), and then the above equations
take the form
n∑
j=1
aijxj + xn+i = bi , i = 1, . . . , m .
If there are more general inequalities, we can transform them
into our standard form by multiplying them by −1. Additionally,
minimizing h is equivalent to maximizing −h. Therefore,
we can reduce all linear programming problems to the
standard form described above.
The simplex method is an iterative process where we begin
with a less-than optimal “solution” that
meets the given equations and non-negativity
constraints. We then seek for a new solution
that improves upon it by increasing the objective function
value. Using the Gauss elimination method, we iterate
this process until we reach a solution that cannot be further
improved—thus achieving an optimal solution. Next, we will
analyze the key steps in detail, summarizing the discussion
given in 3.1.1 and 3.1.6. To illustrate these steps clearly, we
will demonstrate them using a straightforward example.
Guide example. Consider the task of maximizing
h(x1, x2) = 140x1 + 100x2 under the constraints
8x1 + 8x2 ≤ 960 , 4x1 + 2x2 ≤ 400 , 4x1 + 3x2 ≤ 420 ,
together with the positivity conditions x1 ≥ 0, x2 ≥ 0. The
canonical form occurs by maximizing h subject to
8x1 + 8x2 + x3 = 960 ,
4x1 + 2x2 + x4 = 400 ,
4x1 + 3x2 + x5 = 420 ,
with xi ≥ 0 for any i = 1, . . . , 5. The slacks are the variables
x3, x4, x5 and in terms of matrices we have
˜A = (A|I3) =


8 8 1 0 0
4 2 0 1 0
4 3 0 0 1

 , b =


960
400
420

 ,
and x = (x1, x2, x3, x4, x5)T
such that ˜Ax = b.
3.A.7. The simplex algorithm. To keep the exposition simple,
from now on we restrict ourselves to the case where all
bi are nonnegative, bi ≥ 0 for any i = 1, . . . , m.
Step 1. Using the canonical form of a LP program, we construct
the initial simplex tableau. The ﬁrst row of this tableau
consists of the coeﬃcients of h (with negatives signs), while
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
sense, describing an admissible solution. Next, we shall assume
to have such a vertex already.
The idea of the algorithm is to perform equivalent row
transformations of the entire system in such a way, that we
move to other vertices of the region M and the function h increases.
In order to move to more interesting vertices in M,
we must bring some of the slack variables to zero while the
appropriate column for the unit matrix would move to one of
those columns corresponding to the variables x. A simple
check reveals that in order to do this, we must choose some
of the negative entries in the ﬁrst row of the matrix 3.1.3(1),
pick up this column and choose a row in such a way that using
the Gaussian elimination to push the other entries in this particular
column to zero, the right hand sides of the equations
remain non-negative. The latter condition means that we have
to choose the index i such that bi/aij is minimal. This entry
in the matrix is called the pivot for the next step in the elimination.
Of course, the non-positive coeﬃcients aij are not taken
into consideration, since they would not lead to any increase
in the objective function. When there are no more negative
entries in the ﬁrst row, we are ﬁnished, and the claim is that
the optimal value of h appears in the right hand top corner of
the matrix.
Before indicating the proof of all the above claims, we
show how all this works for the simple problem
from 3.F.1. In practice, the very ﬁrst column
of the matrix in question does not change during
the procedure at all, so we can omit it completely.
Thus we deal with the matrix tableaux:
−4 −6 0 0 0 0 0
1 2 1 0 0 0 120
1 4 0 1 0 0 180
−1 1 0 0 1 0 −90
1 0 0 0 0 1 110.
We cannot ﬁnd an admissible solution by ﬁxing xs as the basic
variables here, since there are negative values in b. We try
to initiate the above algorithm by changing the sign in the last
but one row and performing the Gaussian elimination for the
very ﬁrst column aiming to have only the 1 in the last but one
row there. We obtain:
0 −10 0 0 −4 0 360
0 3 1 0 1 0 30
0 5 0 1 1 0 90
1 −1 0 0 −1 0 90
0 1 0 0 1 1 20
We choose the boxed entries for the basic variables, this represents
the values x1 = 90, x2 = 0, x3 = 30, x4 = 90,
x5 = 0, x6 = 20, and h = 360 = 4 · 90 = −4 · (−90) which
is an admissible solution. We have also circled the pivot for
the next step, i.e. the element in the second column which we
want to replace with 1 and eliminate the rest of the column
(remember this is the one yilding the smalest ratio with the
199
the subsequent rows are derived from the matrices (A|I) and
b, i.e.,
−c1 . . . −cn 0 0 · · · 0 0
a11 · · · a1n 1 0 · · · 0 b1
a21 · · · a2n 0 1 · · · 0 b2
...
...
...
...
...
...
...
am1 . . . amn 0 0 · · · 1 bm
In this initial tableau, there are m basic variables and
n non-basic variables, a conﬁguration that remains consistent
throughout the process. Recall that a basic variable is
one that can take non-zero values, whereas a non-basic variable
is one that equals zero in the current solution of the
problem. Basic variables correspond to the columns of the
tableau, where exactly one entry is equal to one, and all other
entries are zero. To initiate the algorithm, we choose all slack
variables as basic variables. This is achieved by setting to
zero the initial decision variables x1, . . . , xn in the equation∑n
j=1 aijxj +xn+i = bi, for all i. This initialization ensures
an initial feasible solution, though not necessarily optimal.
For the guide example the initial tableau has the form
x1 x2 x3 x4 x5
R0 −h −140 −100 0 0 0 0
R1 x3 8 8 1 0 0 960
R2 x4 4 2 0 1 0 400
R3 x5 4 3 0 0 1 420
where R0, . . . , R3 denote the rows of the table. An initial
feasible solution is given by x1 = x2 = 0 and x3 = 960,
x4 = 400, x5 = 420. This satisﬁes h = 0 and since we have
negative values in R0, a better solution exists.
Step 2. We now move in the iterated steps (compare the more
theoretical explanation in 3.1.6).
(a) We locate the column with the most negative value in the
ﬁrst row of the tableau (locating the ﬁrst column from the left
having a non positive value also works). This column, which
we assume here that is the jth column, is called the work column.
For the guide example, the work column corresponds
to the variable x1, since −140 < −100.
(b) In the work column we pick up the positive entry aij in
A which provides the minimal relation bi/aij. If more than
one element yields the same smallest ratio, we choose one.
We call this entry the pivot. For our guide example, the pivot
appears in row R2 circled (this is because 400/4 < 420/4 <
960/8).
(c) To conﬁgure the new tableau we proceed with the Gaussian
method. Our goal is to eliminate the current work column,
and determine the new pivot and the new basic variable (recall
that within each iteration of the simplex method, exactly
one variable goes from non-basic to basic and exactly one
variable goes from basic to non-basic).
To illustrate this via the guide example, denote by ˆR2 =
(1, 1/2, 0, 1/4, 0, 100) the result of the row R2 divided by the
pivot 4. The elementary row operations R0 → R0 + 140 ˆR2,
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
last right hand column entry among the positive elements –
30/3 = 10 which is less then 90/5 = 16 and 20/1 = 20).
This leads to the next admissible vertex in our region M and,
of course the value for h will increase:
0 0 10
3 0 −2
3 0 460
0 1 1
3 0 1
3 0 10
0 0 −5
3 1 −2
3 0 40
1 0 1
3 0 −2
3 0 100
0 0 −1
3 0 2
3 1 10
with x1 = 100, x2 = 10, x3 = x5 = 0, x4 = 40, x6 = 10,
and h = 460 = 4 · 100 + 6 · 10 = 10
3 · 120 − 2
3 · (−90).
We still have one of the entries in the ﬁrst line negative. We
circled the next pivot leading to
0 0 9
3 0 0 1 470
0 1 1
2 0 0 −1
2 5
0 0 −2 1 0 1 50
1 0 0 0 0 1 110
0 0 −1
2 0 1 3
2 15
with the ﬁnal values x1 = 110, x2 = 5, x3 = 0, x4 = 50,
x5 = 15, x6 = 0, and
h = 470 = 4 · 110 + 6 · 5 =
9
3
· 120 + 1 · 110.
Let us remind why we can be sure that this is the optimal
solution. Thanks to fact that the ﬁrst line is exclusively nonnegative,
we have got admissible solution of the dual problem
which leads to the same value as the solution of the original
one. Thus the equillibrium theorem claims we are done!
Correctness of the algorithm. Let us come back to
the above claims in some detail. We should
check whether the algorithm provides the right
answer when it terminates, but we should also
check under which conditions it terminates.
We start with the ﬁrst part. The special feature of the
above algorithm reshuﬄing the basic variables is that the
slack variables parts of the matrix are closely linked to the
dual linear programming problem. Moreover, there is an invariant
of the entire procedure:
Claim. Writing (−ˆc, ˆcs, ˆh) for the current ﬁrst line in the
matrix and (ˆx, ˆxs) for the current values of the variables, we
obtain c · ˆx = ˆcs · b = ˆh at each step.
In particular at the moment of the termination of the
above algorithm, the coeﬃcients y = ˆcs in the ﬁrst row represent
admissible values of the dual problem (while the values
ˆc stay for the slack variables in the dual problem), and the
right hand top corner provides the value of the corresponding
objective function yT
· b. Since the two objective functions
are equal, we know that the algorithm provides the optimal
solution.
200
R1 → R1 − 8 ˆR2, and R3 → R3 − 4 ˆR2 yield the second
simplex tableau, given by
x1 x2 x3 x4 x5
ˆR0 −h 0 −30 0 35 0 14000
ˆR1 x3 0 4 1 −2 0 160
ˆR2 x1 1 1/2 0 1/4 0 100
ˆR3 x5 0 1 0 −1 1 20
The variable x4 is replaced by x1, which becomes basic, and
hence we moved it to the column of basic variables. Together
with x3, x5 they are now the new basic variables, and we have
the solution x1 = 100, x2 = 0 = x4, x3 = 160, x5 = 20 (the
values of the basic variables are read from the last column).
Since there is still a negative number in ˆR0, we proceed.
(d) We repeat steps (a), (b) and (c) and terminate the procedure
when there are no more given entries in the ﬁrst row.
For our example by the second tableau one deduces the
new work column, that of x2, and the new pivot, which is the
circled number 1. Applying the elementary row operations
ˆR0 → ˆR0 + 30 ˆR3, ˆR1 → ˆR1 − 4 ˆR3, and ˆR2 → ˆR2 − 1
2
ˆR3
we arrive to the third tableau:
x1 x2 x3 x4 x5
−h 0 0 0 5 30 14600
x3 0 0 1 2 −4 80
x1 1 0 0 3/4 −1/2 90
x2 0 1 0 −1 1 20
The new basic variable is x2 and replaces x5. So we have the
solution x1 = 90, x2 = 20, x3 = 80 and x4 = 0 = x5. Since
the above tableau is the ﬁnal one, this solution is optimal and
we may read the maximal value of h in the right top corner of
the tableau, that is, h(90, 20) = 14600.
In the solution obtained above all the original variables
are among the basic ones and so their values are non-zero.
This is based on the assumption b ≥ 0. The problem 3.F.1
presented in the ﬁnal section of this chapter encodes the more
general situation and its explanation via this simplex algorithm
is given in 3.1.6.
3.A.8. A note on duality. By the theoretical discussion in
3.1.4 we know that when the primal LP problem is a maximization
one, its dual is a minimization problem and conversely.
The number of decision variables in a dual problem
is the same as the number of less-than inequalities in the optimization
primal problem, while the number of the constrains
in the dual problem equals to the number of decision variables
in the primal one. Note that when some constraint appears as
an equality in the primal (maximization) problem, then the
corresponding dual variable is unrestricted. The coeﬃcients
of the objective function of the primal problem form the right
hand side of the constrains on the dual one, while the right
hand side of the primal determines the objective function of
the dual LP problem.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Proof of the Claim. As we know, the Gaussian elimination
cat be expressed as multiplication by a suitable transition
matrix T from the left. Our “pivoting” corresponds always
to a (block wise) matrix
T =
(
1 yT
0 R
)
,
with an invertible matrix R. Thus (dealing with the standard
problem)
T ·
(
1 −c 0 0
0 A Em 0
)
=
(
1 −c + yT
A yT
yT
b
0 RA R Rb
)
and the second row corresponds (since R is invertible) to the
equality Aˆx + ˆxs = b. Consequently, we arrive at
−ˆc ˆx = (−c + yT
A) ˆx = −c ˆx + yT
(b − ˆxs).
Now, notice the components ˆxs vanish whenever the corresponding
components of yT
̸= 0, i.e., yT
ˆxs = 0. Similarly,
ˆc ˆx = 0 and so we read from the latter displayed equality
cˆx = yT
b, as claimed. □
The termination of the algorithm is a more subtle question.
While it terminates nearly always, it might happen that
the algorithm cycles without enlarging the objective function.
We shall not go into more detailes here, a simple example is
given in ??. □
Put a cycling simple
example to the practicle
column, maybe in the
end of the chapter.
3.1.7. Notes about linear models in economy. Our simple
scheme of the black&white painter from the paragraph
3.1.1 can be used to illustrate one of the typical
economical models, the model of production planning.
The model tries to capture the problem completely,
that is, to capture both external and internal relations.
The left-hand sides of the equations (1), (2), (3) in 3.1.1, and
the objective function h(x1, x2) express various production
relations. Depending on the character of the problem, we
have on the right-hand sides either exact values (and so we
solve equations) or capacity constraints and goal optimization
(then we obtain linear programming problems).
Thus in general we can solve the problem of source allocation
with supplier constraints and either minimize costs or
maximize income.
Among economical models, we can ﬁnd many modiﬁcations.
One of them is the problem of ﬁnancial planning,
which is connected to the optimization of portfolio. We are
setting up a volume of investment into individual investment
possibilities with the goal to meet the given constraints for
risk factors while maximizing the proﬁt, or dually minimize
the risk under the given volume.
Another common model is marketing application, for instance
allocation of costs for advertisement in various media
or placing advertisement into time intervals. Restrictions are
in this case determined by budget, target population, etc.
Very common are models of nutrition, that is, setting up
how much of diﬀerent kinds of food should be eaten in order
to meet total volume of speciﬁc components, e.g. minerals
and vitamins.
201
For our guide example the dual problem is the task of
minimizing 960y1 + 400y2 + 420y3 subject to
8y1 + 4y2 + 4y3 ≥ 140 ,
8y1 + 2y2 + 3y3 ≥ 100 ,
with yi ≥ 0 for all i = 1, 2, 3.
The ﬁnal tableau of the primal problem also provides the
solution of the dual one. According to the strong duality theorem
(see 3.1.5), the minimal value is again 14600, while the
corresponding values of the dual variables y1, y2 and y3 are
obtained by the ﬁrst row in the corresponding ﬁnal tableau,
that is, y1 = 0, y2 = 5, y3 = 30.
3.A.9. For the following LP problems, ﬁnd their standard
form and describe the corresponding dual problem.
(a) Maximize h = x1 + 2.5x2 subject to
2x1 + 3x2 ≤ 20 , x1 + x2 ≥ −1 , x1 − 2x2 = 1 ,
with x1 ≥ 0, x2 ≥ 0.
(b) Minimize h = 2x1 + 3x2 + 2x4 subject to
x1 +2x2 +2x3 ≤ −6 , x1 +4x2 −2x4 = 5 , x2 −x3 +4x4 ≥ 2 ,
with xi ≥ 0 for all i = 1, . . . , 4. ⃝
3.A.10. LP problems with redundant constraints. Use
the simplex method to solve the LP problem of maximizing
the linear function h = 4x1 + 9x2 subject to the conditions
x1 + 4x2 ≤ 8 , x1 + 2x2 ≤ 4 , x1 ≥ 0 , x2 ≥ 0 .
Solution. The corresponding feasible region is the grey region
in the ﬁgure below, and as we see the ﬁrst constraint is
redundant. The evaluation of h at the three corners of this triangle
shows that h maximizes at (x1 = 0, x2 = 2). We want
to verify this result by the simplex method.
The canonical form of the problem is the following: maximize
h = 4x1 + 9x2 + 0x3 + 0x4 subject to
x1 + 4x2 + x3 = 8 ,
x1 + 2x2 + x4 = 4 ,
with xi ≥ 0 for i = 1, . . . , 4. The variables x3, x4 are the
slacks and we begin with the ﬁrst simplex tableau:
x1 x2 x3 x4
R0 −h −4 −9 0 0 0
R1 x3 1 4 1 0 8
R2 x4 1 2 0 1 4
The work column is the column of x2 and as pivot one may ﬁx
the circled number 4. To eliminate the work column replace
R1 by ˆR1 := 1
4 R1 and apply the row operations R0 → R0 +
9 ˆR1 and R2 → R2 − 2 ˆR1. This enters the variable x2 to the
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Problems of linear programming arise with personal
tasks, where workers with speciﬁc qualiﬁcations and other
properties are distributed into working shifts. Common are
also problems of merging, problems of splitting and problems
of goods distribution.
2. Diﬀerence equations
We have already met diﬀerence equations in the ﬁrst
chapter, albeit brieﬂy and of ﬁrst order
only. Now we consider a more general
theory for linear equations with constant
coeﬃcients. This not only provides very
practical tools but also represents a good illustration for the
concepts of vector spaces and linear mappings.
Homogeneous linear difference equation of order k
3.2.1. Deﬁnition. A homogeneous linear diﬀerence equation
(or homogeneous linear recurrence) of order k is given
by the expression
a0xn + a1xn−1 + · · · + akxn−k = 0, a0 ̸= 0, ak ̸= 0,
where the coeﬃcients ai are scalars, which can possibly depend
on n.
A solution of this equation is a sequence of scalars xi,
for all i ∈ N (or i ∈ Z), which satisfy the equation with any
n.
We often understand the sequence in question as a func-
tion
xn = f(n) = −
a1
a0
f(n − 1) − · · · −
ak
a0
f(n − k).
By giving any k consecutive values xi in the sequence,
all other values of xi are determined uniquely. Indeed, we
work over a ﬁeld of scalars, thus the values a0 and ak are
invertible and hence, using the recurrent deﬁnition, any xn
can be computed uniquely from the preceding k values, and
similarly for xn−k. Induction thus immediately proves that
all remaining values are uniquely determined.
The space of all inﬁnite sequences xi forms a vector
space, where addition and multiplication by scalars work
coordinate-wise. The deﬁnition immediately implies that a
sum of two solutions of a homogeneous linear diﬀerence
equation or a multiple of a solution is again a solution. Analogously
as with homogeneous linear systems we see that the
set of all solutions forms a subspace.
Initial conditions on the values x0, . . . , xk−1 of the solution
represent a k-dimensional vector in Kk
. The sum of
initial conditions determines the sum of the corresponding
solutions, similarly for scalar multiples. Note also that substituting
zeros and ones into initial k values immediately yields
k linearly independent solutions of the diﬀerence equation.
Thus, although the vectors are inﬁnite sequences, the set of
all solutions has ﬁnite dimension. The dimension equals the
order of the equation k. Moreover, we can easily obtain a basis
of all those solutions. Again we speak of the fundamental
202
column of basic variables, while x3 leaves. Thus the second
simplex tableau has the form
x1 x2 x3 x4
ˆR0 −h −5/4 0 9/4 0 18
ˆR1 x2 1/4 1 1/4 0 2
ˆR2 x4 1/2 0 −1/2 1 0
We observe that in this table the basic variable x4 takes the
zero value, and this means that there exists a redundant constraint.
Such a basic feasible solution with at least one basic
variable equal to zero is called degenerate. Now, the column
of x1 is the new driver and the new pivot is the circled 1/2.
We replace ˆR2 by ˇR2 := 2 ˆR2 and apply the row operations
ˆR1 → ˆR1 − 1
4
ˇR2 and ˆR0 → ˆR0 + 5
4
ˇR2. Thus x1 enters the
column of basic variables, x4 leaves and the ﬁnal tableau is
given by
x1 x2 x3 x4
ˇR0 −h 0 0 1 5/2 18
ˇR1 x2 0 1 1/2 −1/2 2
ˇR2 x1 1 0 −1 2 0
This gives us the optimal solution (x2 = 2, x1 = 0), which is
degenerate, with the optimal value of the objective function
being h = 18. □
3.A.11. Cycling. Observe that during the second iteration
above, the value of the given h did not increase. When
a sequence of iterations returns to a previously constructed
tableau, we say that the simplex method cycles. In this case
the method enters a loop, without improving the objective
function value, see also the end of section 3.1.6 Cycling can
occur only in the presence of degeneracy. However, there
are many LP problems that are degenerate and do not cycle
(as above). Examples of LP problems for which the simplex
method cycles are in general diﬃcult to construct, and rare
in practice. We describe such an example in Section 3.F.5,
where also a remedy to this pitfall is presented.
Numerous software packages are available to handle the
simplex method and solve linear programming
problems via a computer. As it is customary in
our book, next we will use Sage to implement
the tasks at hand.
3.A.12. LP treated by Sage. In Sage, for handling linear
programming problems, one can utilize the mixed integer linear
programming (MILP) package. Next, we will demonstrate
the procedure using our guide example from section
3.A.7. Let us summarize the main steps:
1. We initialize the linear programming algorithm:
p = MixedIntegerLinearProgram()
2. We introduce the non-negative decision variables:
v = p.new_variable(real=True,
nonnegative=True)
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
system of solutions and all other solutions are its linear com-
binations.
As we have just checked, if we choose k indices i,
i + 1, . . . , i + k − 1 in sequence, the homogeneous linear
diﬀerence equation gives a linear mapping Kk
→ K∞
of k-dimensional vectors of initial values into inﬁnitelydimensional
sequences of the same scalars. The independence
of such solutions is equivalent to the independence
of the initial values – which can be easily checked by a determinant:
If we have a k-tuple of solutions (x
[1]
n , . . . , x
[k]
n ),
it is independent if and only if the following determinant,
sometimes called the Casoratian, is non-zero for one n
Cn =
x
[1]
n · · · x
[k]
n
x
[1]
n+1 . . . x
[k]
n+1
...
...
...
x
[1]
n+k−1 . . . x
[k]
n+k−1
̸= 0.
Notice the determinant Cn+1 is obtained from Cn by replacing
the ﬁrst raw in Cn by the last raw in Cn+1 expressed as
the linear combination of the raws in Cn given by the diﬀerence
equation and then putting the ﬁrst row to position of the
last one. Thus Cn+1 = (−1)k a0
ak
Cn, with values a0, ak corresponding
to n+k. In particular, linearly independent initial
conditions lead to independent k-dimensional vectors for all
consecutive k components of the solution.
3.2.2. Recurrences with constant coeﬃcients. It is diﬃcult
to ﬁnd a universal mechanism for ﬁnding a solution (that
is, a directly computable expression) of general homogeneous
linear diﬀerence equations. We shall come back to this problem
in the end of chapter 13.
In practical models there are very often equations, where
the coeﬃcients are constant. In this case it is
possible to guess a suitable form for the solution
and indeed to ﬁnd k linearly independent
solutions. This would then be a complete solution
of the problem, since all other solutions would be linear
combinations of them.
For simplicity we start with equations of second order.
Such recurrences are very often encountered in practical problems,
where there are relations based on two previous values.
A linear diﬀerence equation (recurrence) of second order with
constant coeﬃcients is thus a formula
(1) f(n + 2) = a f(n + 1) + b f(n) + c,
where a,b,c are known scalar coeﬃcients.
Consider a population model. We assume that the individuals
in a population mature and start breeding two seasons
later (that is, they add to the value f(n + 2) by a multiple
b f(n) with positive b > 1), while immature individuals at
the same time weaken and destroy part of the mature population
(that is, the coeﬃcient a at f(n+1) is negative). Furthermore,
it might be that somebody destroys (uses, eats) a ﬁxed
amount −c of individuals every season.
203
For the guide example there are two decision variables and to
introduce them one should additionally type
x1, x2 = v["x1"], v["x2"]
3. We set the objective function by typing
p.set_objective(140*x1 + 100*x2)
4. Next we set the constraints. For the guide example this can
be done as follows:
p.add_constraint(8*x1 + 8*x2 <= 960)
p.add_constraint(4*x1 + 2*x2 <= 400)
p.add_constraint(4*x1 + 3*x2 <= 420)
5. If k is the maximum of the objective function, let the program
knows about this by typing
k = p.solve()
Since x1, x2 are the coordinates of the solution, we should
also include the code
x1, x2 = p.get_values(x1,x2)
6. Finally, to get the output we write
print("Answer =", round(k, 2))
print("(x1, x2) =", (x1, x2))
and Sage prints out the following answer:
Answer = 14600.0
(x1, x2) = 90.0 20.0
Let us summarize the code all together, without interruptions.
p = MixedIntegerLinearProgram()
v = p.new_variable(real=True,
nonnegative=True)
x1, x2 = v["x1"], v["x2"]
p.set_objective(140*x1 + 100*x2)
p.add_constraint(8*x1 + 8*x2 <= 960)
p.add_constraint(4*x1 + 2*x2 <= 400)
p.add_constraint(4*x1 + 3*x2 <= 420)
k = p.solve()
x1, x2 = p.get_values(x1,x2)
print("Answer =", round(k, 2))
print("(x1, x2) =", (x1, x2))
Similarly can be treated problems with more decision variables.
For instance, if you need three variables give the cell
v = p.new_variable(real=True,
nonnegative=True)
x1, x2, x3 = v["x1"], v["x2"], v["x3"]
Further tasks in linear programming and applications
of Sage are described in section F. It is noteworthy
that an elegant application of linear programming
involves game theory. In 3.F.6 we shall explore
an application in this direction concerning zero-sum
games, also known as “matrix games”. In these games, the
interests of the two players are directly opposed, meaning one
player’s loss is the other player’s gain. The term “zero-sum”
indicates that the sum of the payoﬀs for both players always
equals zero, regardless of the game’s outcome. As detailed in
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
A similar situation with c = 0 and both other coeﬃcients
positive determines the famous Fibonacci sequence of numbers
y0, y1, . . . , where yn+2 = yn+1 + yn, see 3.B.1.
If we have no idea how to solve a mathematical problem,
we can always blindly try some known solutions of similar
problems. Thus, let us substitute into the equation (1) with coeﬃcient
c = 0 a similar solution as with the linear equations
from the ﬁrst chapter (cf. 1.2.1), that is, we try f(n) = λn
for
some scalar λ. By substitution into the equation we obtain
λn+2
− aλn+1
− bλn
= λn
(λ2
− aλ − b) = 0.
This relation will hold either for λ = 0 or for the choice of
the values
λ1 =
1
2
(a +
√
a2 + 4b), λ2 =
1
2
(a −
√
a2 + 4b).
It is easy to see that such solutions work. We just had to
choose the scalar λ suitably. But we are not ﬁnished, since
we want to ﬁnd a solution for any two initial values f(0) and
f(1). So far, we have only found two speciﬁc sequences satisfying
the given equation (or possibly even only one sequence
if λ2 = λ1).
As we have already derived for linear recurrences, the
sum of two solutions f1(n) and f2(n) of our equation f(n +
2) − a f(n + 1) − b f(n) = 0 is again a solution of the same
equation. The same holds for scalar multiples of the solution.
Our two speciﬁc solutions thus generate the more general so-
lutions
f(n) = C1λn
1 + C2λn
2
for arbitrary scalars C1 and C2. For a unique solution of the
speciﬁc problem with given initial values f(0) and f(1), it
remains only to ﬁnd the corresponding scalars C1 and C2.
3.2.3. The choice of scalars. We show how this can work
with an example. Consider the problem:
(1) yn+2 = yn+1 +
1
2
yn, y0 = 2, y1 = 0.
Here λ1,2 = 1
2 (1 ±
√
3) and clearly
y0 = C1 + C2 = 2
y1 =
1
2
C1(1 +
√
3) +
1
2
C2(1 −
√
3)
is satisﬁed for exactly one choice of these constants. Direct
calculation yields C1 = 1 − 1
3
√
3, C2 = 1 + 1
3
√
3 and our
problem has unique solution
f(n) = (1−
1
3
√
3)
1
2n
(1+
√
3)n
+(1+
1
3
√
3)
1
2n
(1−
√
3)n
.
Note that even if the found solution for our equation with
rational coeﬃcients and rational initial values looks complicated
and is expressed with irrational numbers, we know a
priori that the solution itself is again rational. But without
this “step aside” into a larger ﬁeld of scalars, we would not
be able to describe the general solution.
204
3.F.6, such problems can be eﬀectively addressed using stochastic
matrices.
B. Recurrence relations
A recurrence relation for a given sequence deﬁnes a relationship
among its terms, often useful for modeling
recurring patterns or behaviours. Such relations
are particularly valuable in scenarios where recursive
nature is observed repeatedly. For instance, they serve
as a powerful tool for describing growth models.
Without doubt, the simplest recurrence relations are the
“homogeneous linear recurrence relations” (with constant
coeﬃcients), expressed as:
xn+k = α1xn+k−1 + α2xn+k−2 + · · · + αkxn ,
where αi are complex numbers with αk ̸= 0. The number
k is called the order of the recurrence relation, see also the
Section 3.2.1 for details. When the right-hand side includes
a constant term, we get the notion of a “non-homogeneous
linear recurrence relations” (with constant coeﬃcients).
We begin with a well-known example of a second-order
homogeneous linear recurrence relation, historically used to
model populations of rabbits.
3.B.1. Rabbits and the Fibonacci sequence. Consider a
scenario where, at the start of spring, a stork brings
two newborn rabbits, one male and one female,
to a meadow. Each female rabbit becomes fertile
at two months old and gives birth to a pair of
newborns, one male and one female, every month thereafter.
Speciﬁcally, each female is pregnant for one month before
producing new oﬀsprings. How many pairs of rabbits will be
present after nine months, assuming no deaths and that none
“move in”?
Solution. After the ﬁrst month, there is still only one pair
of rabbits, but the female is already pregnant. By the end
of the second month, the ﬁrst oﬀspring are born, resulting in
two pairs. Each subsequent month, the number of new pairs
matches the number of pregnant females from the previous
month. This corresponds to the number of at least one month
old pairs, which coincides with the number of pairs that were
there two months ago.
Let us denote by Fn the total number of pairs after n
months. According to the explanation given above, for the
ﬁrst months we compute
F1 = 1 , F2 = 1 , F3 = F2 + F1 = 2 , F4 = F3 + F2 = 3 ,
F5 = F4 + F3 = 5 , F6 = F5 + F4 = 8 , . . .
Hence Fn should be equal to the sum of the number of pairs
in the previous two months, and this is described by the following
homogeneous linear recurrence relation:
Fn+2 = Fn+1 + Fn , n = 1, 2, . . . .
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
We will often meet similar phenomena. Moreover, the
general solution often allows us to discuss qualitative behaviour
of the sequence of numbers f(n) without direct enumeration
of the constants. For example, we may see whether
the values approach some ﬁxed value with increasing n or
oscillate in some interval or whether they are unbounded.
3.2.4. General homogeneous recurrences. We substitute
xn = λn
for some (yet unknown) scalar λ into
the general homogeneous equation from the definition
3.2.1 (with constant coeﬃcients). For every
n we obtain the condition
λn−k
(a0λk
+ a1λk−1
· · · + ak) = 0.
This means that either λ = 0 or λ is the root of the so-called
characteristic polynomial in the parentheses. The characteristic
polynomial is independent of n.
Assume that the characteristic polynomial has k distinct
roots λ1, . . . , λk. For this purpose, we can extend the ﬁeld
of scalars we are working in, for instance Q into R or C. Of
course, if the inicial conditions are in the original ﬁeld then
the solutions stay there since the recurrence equation itself
does. Each of the roots gives us single possible solution
xn = (λi)n
.
We need k linearly independent solutions.
Thus we should check the independence by substituting
k values for n = 0, . . . , k − 1 for k choices of λi into the
Casoratian (see 3.2.1). Thus we obtain the Vandermonde matrix.
It is a good but not entirely trivial exercise to show that
for every k and any k-tuple of distinct λi the determinant of
such a matrix non-zero, see 2.E.21 on the page 151. It follows
that the chosen solutions are linearly independent.
Thus we have found the fundamental system of solutions
of the homogeneous diﬀerence equation in the case that all
the (possibly complex) roots of its characteristic polynomial
are distinct.
Now we suppose λ is a multiple root. We ask whether
xn = nλn
could be a solution. We arrive at the condition
a0nλn
+ · · · + ak(n − k)λn−k
= 0.
This condition can be rewritten as
λ(a0λn
+ · · · + akλn−k
)′
= 0
where the dash denotes diﬀerentiation with respect to λ (cf.
the inﬁnitesimal deﬁnition in 5.1.6, and 12.3.7 for the purely
algebraic treatment).
Moreover, a root c of a polynomial f has multiplicity
greater than one if and only if it is a root of f′
, see 12.3.7
for the proof. Our condition is thus satisﬁed.
With greater multiplicity ℓ of the root of the characteristic
polynomial we can proceed similarly and
use the (now obvious) fact that a root with multiplicity
ℓ is a root of all derivatives of the polynomial
up to order ℓ − 1 (inclusively).
205
This relation, along with the initial conditions F1 = 1 and
F2 = 1, uniquely determines the number of pairs of rabbits
at the meadow, in individual months.
Observe that for certain r the function rn
is a solution
of the diﬀerence equation (without initial conditions). This
r can be obtained by substitution into the recurrent relation:
rn+2
= rn+1
+ rn
and after dividing by rn
we obtain
r2
= r + 1. This is the characteristic equation of the given
recurrent formula, and the numbers 1−
√
5
2 and 1+
√
5
2 are its
roots.3
Thus, according to the theory in 3.2.4, the solution of
the given recurrent relation will be a sequence of the form
Fn = axn + byn ,
with xn := (1−
√
5
2 )n
, yn := (1+
√
5
2 )n
, and a, b ∈ R to be
speciﬁed. To compute the constants a, b, we use the initial
conditions. Alternatively, we can set F0 = 0 and compute a
and b from the equations for F0 and F1. We ﬁnd a = − 1√
5
,
b = 1√
5
and hence the solution is given by
Fn =
1
√
5
(yn − xn) =
(1 +
√
5)n
− (1 −
√
5)n
2n(
√
5)
.
Although the Fibonacci sequence looks like a sequence of irrational
numbers, the value of Fn is actually an integer for
any natural n. Hence, all terms in the Fibonacci sequence are
integers, as the term F9, which gives the answer. □
We proceed with additional exercises related to secondorder
linear homogeneous diﬀerence equations with constant
coeﬃcients.
3.B.2. Find the next two terms of the sequence (an)n≥0 beginning
as follows:
3 , 5 , 11 , 21 , 43 , 85 , . . .
by providing a recursive deﬁnition of an. Next solve the corresponding
recurrence relation.
Solution. Set a0 = 3 and a1 = 5. We see that the terms
under question are a6 = 171 and a7 = 341. Indeed, observe
that a2 = a1 + 2a0, a3 = 2a2 + a1, and so on. Thus the
sequence an satisﬁes an = an−1 + 2an−2 for n = 0, 1, . . .,
and this is the recurrence formula we are seeking for. The corresponding
characteristic equation is the quadraticl equation
r2
− r − 2 = 0, with roots the real numbers 2 and −1. Thus,
the general solution has the form an = a2n
+ b(−1)n
, with
a, b ∈ R to be computed. By the initial conditions we obtain
the system {3 = a + b , 5 = 2a − b}, so a = 8/3, b = 1/3
and an = 8
3 2n
+ 1
3 (−1)n
. □
3The number 1+
√
5
2
is called the golden ratio and has fascinated mathematicians
since Pythagoras (570 – 495 B.C.).
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Derivatives look like this:
f(λ) = a0λn
+ · · · + akλn−k
f′
(λ) = a0nλn−1
+ · · · + ak(n − k)λn−k−1
f′′
(λ) = a0n(n−1)λn−2
+· · · +ak(n−k)(n−k−1)λn−k−2
...
f(ℓ)
= a0n · · · (n − ℓ + 1)λn−ℓ
+ · · ·
+ ak(n − k) · · · (n − k − ℓ + 1)λn−k−ℓ
.
We look at the case of a triple root λ and try to ﬁnd a
solution in the form n2
λn
. By substitution into the deﬁnition,
we obtain the equation
a0n2
λn
+ · · · + ak(n − k)2
λn−k
= 0.
Clearly the left side equals the expression λ2
f′′
(λ) + λf′
(λ)
and because λ is a root of both derivatives, the condition is
satisﬁed.
Using induction, we prove that even for the general condition
of the solution in the form xn = nℓ
λn
,
a0nℓ
λn
+ . . . ak(n − k)ℓ
λn−k
= 0,
the solution can be obtained as a linear combination of the
derivatives of the characteristic polynomial starting with the
expression (check the combinatorics!)
λℓ
f(ℓ)
+
(
ℓ
2
)
λℓ−1
f(ℓ−1)
+ . . . .
We have thus come close to the complete proof of the following
result:
Homogenous equations with constant coefficients
Theorem. The solution space of a homogeneous linear difference
equation of order k with constant coeﬃcients, over
the ﬁeld of scalars K = C, is the k-dimensional vector space
generated by the sequences xn = nℓ
λn
, where λ are (complex)
roots of the characteristic polynomial and the powers
ℓ run over all natural numbers 0, . . . , rλ −1, where rλ is the
multiplicity of the root λ.
Proof. The relation between the multiplicity of roots
and the derivatives of real polynomials will be proved later
(cf. 5.3.7), while the fact that every complex polynomial has
exactly as many roots (counting multiplicities) as its degree
will appear in 10.2.11. It remains to prove that the k-tuple
of solutions thus found is linearly independent. Even in this
case we can prove inductively that the corresponding Casoratian
is non-zero. We have done this already in the case of the
Vandermonde determinant before.
This approach is well illustrated by the calculation in the
case of a root λ1 with multiplicity one and a root λ2 with
206
3.B.3. Find the solution of the recurrence relation xn =
6xn−1 − 9xn−2 with initial conditions x0 = 2,
x1 = 3. Additionally, validate your answer using
Sage.
Solution. The characteristic equation has the form
r2
− 6r + 9 = 0 ⇐⇒ (r − 3)2
= 0 .
Hence r = 3 is the unique (double) root. Thus, according to
the theory in 3.2.4, the general solution must be of the form
xn = a3n
+ bn3n
,
for some a, b ∈ R to be computed. Based on the initial conditions
x0 = 2 and x1 = 3 we obtain a = 2 and b = −1. Thus
we deduce that xn = 2 · 3n
− n3n
= 3n
(2 − n).
Sage provides a user-friendly package for handling recurrence
relations. One useful function available is the rsolve,
from the pure Python package sympy, designed for symbolic
computations. This function can eﬀectively handle linear recurrence
relations. To begin, we typically start by entering
the following cell:
from sympy import Function,rsolve
from sympy.abc import n
a = Function("a")
The next step is to input the recurrence relation that we
want to solve, along with its initial conditions:
f = a(n) - 6*a(n-1) + 9*a(n-2)
initial = {a(0):2, a(1):3}
Now we can obtain the solution, by adding the syntax
rsolve(f, a(n), initial).expand()
Sage prints out the answer −3 ∗ ∗n ∗ n + 2 ∗ 3 ∗ ∗n. Let us
summarize all the code together:
from sympy import Function,rsolve
from sympy.abc import n
a = Function("a")
f = a(n) - 6*a(n-1) + 9*a(n-2)
initial = {a(0):2, a(1):3}
rsolve(f, a(n), initial).expand()
□
3.B.4. Determine an explicit expression of the sequence satisfying
the diﬀerence equation xn+2 = 3xn+1 + 3xn with
members x1 = 1 and x2 = 3. Next verify your answer using
Sage. ⃝
Note that characteristic polynomial of a homogeneous
diﬀerence equation may have complex roots.
Therefore, when solving recurrence relations,
we may encounter situations involving complex
numbers. This requires transitioning from complex
bases in the solution space to real ones, as discussed in
Section 3.2.5. Let us explore such an example.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
multiplicity two:
C(λn
1 , λn
2 , nλn
2 ) =
λn
1 λn
2 nλn
2
λn+1
1 λn+1
2 (n + 1)λn+1
2
λn+2
1 λn+2
2 (n + 2)λn+2
2
= λn
1 λ2n
2
1 1 n
λ1 λ2 (n + 1)λ2
λ2
1 λ2
2 (n + 2)λ2
2
= λn
1 λ2n
2
1 1 n
λ1 − λ2 0 λ2
λ1(λ1 − λ2) 0 λ2
2
= −λn
1 λ2n
2
λ1 − λ2 λ2
λ1(λ1 − λ2) λ2
2
= λn
1 λ2n+1
2 (λ1 − λ2)2
̸= 0.
In the general case the proof can be carried on inductively in
a similar way. □
3.2.5. Real basis of the solutions. For equations with real
coeﬃcients, initial real conditions always lead
to real solutions (and similarly with scalars Z or
Q). However, the corresponding fundamental
solutions derived using the above theorem might exist only in
the complex domain.
We try therefore to ﬁnd other generators, which will be
more convenient. Because the coeﬃcients of the characteristic
polynomial are real, each of its roots is either real or the
roots are paired as complex conjugates.
If we describe the solution in polar form as
λn
= |λ|n
(cos nφ + i sin nφ)
¯λn
= |λ|n
(cos nφ − i sin nφ),
we see immediately that their sum and diﬀerence leads to two
linearly independent solutions
xn = |λ|n
cos nφ, yn = |λ|n
sin nφ.
Diﬀerence equations very often appear as a model of dynamics
of some system. A nice topic to think about is the
connection between the absolute values of individual roots
and the stability of the solution. We will not go into details
here, because only in the ﬁfth chapter we will speak of convergence
of values to some limit value. There is space for some
interesting numerical experiments: for instance with oscillations
of suitable population or economical models.
3.2.6. The non-homogeneous case. As in the case of systems
of linear equations we can obtain all solutions of nonhomogeneous
linear diﬀerence equations
a0(n)xn + a1(n)xn−1 + · · · + ak(n)xn−k = b(n),
where the coeﬃcients ai and b are scalars which might depend
on n, with a0(n) ̸= 0, ak(n) ̸= 0. Again, we proceed by
ﬁnding one solution and adding the complete vector space of
dimension k of solutions to the corresponding homogeneous
system. Indeed each such sum yields a solution. Since the
diﬀerence of two solutions of a non-homogeneous system is
207
3.B.5. Complex bases versus real bases. Determine the solution
of the diﬀerence equation xn+2 = 2xn+1 − 2xn, with
initial values x1 = 2 = x2. Moreover, express your solution
in terms of a real basis corresponding to the associated
solution space.
Solution. The roots of the characteristic polynomial r2
−2r+
2 are the complex numbers 1 + i and 1 − i. Hence the sequences
yn = (1 + i)n
and zn = (1 − i)n
form a basis of
the (complex) vector space of solutions, see also the main theorem
in 3.2.4. In particular, a general solution is given as a
linear combination of yn, zn (with complex coeﬃcients), that
is, xn = ayn + bzn, where a = a1 + ia2, b = b1 + ib2, with
aj, bj ∈ R for any j = 1, 2. The initial conditions will guide
us to choose these scalars. They yield the following system
of equations
1 = a1+ia2+b1+ib2 , 2 = (a1+ia2)(1+i)+(b1+ib2)(1−i)
and by comparing the real and the complex part of both equations,
we ﬁnally result to a system of four equations with four
unknowns: a1 + b1 = 1, a2 + b2 = 0, a1 − a2 + b1 + b2 = 2
and a1 + a2 − b1 + b2 = 0. We compute a1 = b1 = b2 = 1
2 ,
a2 = −1/2, and thus the sequence in question attains the
following form:
xn =
(
1
2
−
1
2
i
)
(1 + i)n
+
(
1
2
+
1
2
i
)
(1 − i)n
.
If we want to verify our solution via Sage, we may proceed
as before, that is,
from sympy import Function, rsolve
from sympy.abc import n
a = Function("a")
f = a(n+2) - 2*a(n+1) + 2*a(n)
initial = {a(1):2, a(2):2}
rsolve(f, a(n), initial)
Sage’s answer has the form
(1/2 + I/2)*(1 - I)**n
+ (1/2 - I/2)*(1 + I)**n
For the ﬁnal task we use the sequences
un =
1
2
(yn + zn) = (
√
2)n
cos(
nπ
4
) ,
vn =
i
2
(zn − yn) = (
√
2)n
sin(
nπ
4
) .
Note that the transition matrix for changing the basis from the
complex one to the real one, is given by T :=
(1
2 −1
2 i
1
2
1
2 i
)
,
with T−1
=
(
1 1
i −i
)
. Therefore, if (c, d) are the coordinates
of the sequence xn, then under the basis {un, vn} we
have (
c
d
)
= T−1
(
a
b
)
=
(
1
1
)
.
Thus, we now have an alternative expression for the sequence
xn, which involves square roots instead of complex numbers:
xn = (
√
2)n
cos
(nπ
4
)
+ (
√
2)n
sin
(nπ
4
)
. □
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
a solution of the homogeneous system, we obtain all solutions
in this way.
When we were working with systems of linear equations,
it was possible that there was no solution. This is not possible
with diﬀerence equations. But it is not always easy to
ﬁnd that one particular solution of a non-homogeneous system,
particularly if the behaviour of the scalar coeﬃcients in
the equation is complicated. Even for linear recurrences with
constant coeﬃcients it may not be easy to ﬁnd a solution if
the right-hand side is complicated.
But we can always try to ﬁnd a solution in a form similar
to the right hand side. Consider the case when the corresponding
homogeneous system has constant coeﬃcients and b(n) is
a polynomial of degree s. The solution can then be found in
the form of the polynomial
xn = α0 + α1n + · · · + αsns
with unknown coeﬃcients αi, i = 1, . . . , s. By substitution
into the diﬀerence equation and comparing the coeﬃcients
of the individual powers of n we obtain a system of s + 1
equations for s + 1 variables αi. If this system has a solution,
then we have found a solution of our original problem. If
it has no solution, we can try again with an increase in the
degree s of the polynomial in question.
For instance, the equation xn − xn−2 = 2 cannot have
a constant solution, because substitution of the potential solution
xn = α0 yields the requirement α0 − α0 = 0 = 2. But
by setting xn = α0 +α1n we obtain a solution xn = α0 +n,
with α0 arbitrary. Thus the general solution of our equation
is
xn = C1 + C2(−1)n
+ n.
We use this method, the method of indeterminate coeﬃcients
for example in 3.F.8.
3.2.7. Variation of constants. Another possible way to
solve such an equation is the variation of constants
method. Here we ﬁnd ﬁrst a solution
y(n) =
k∑
i=1
cifi(n)
of the homogeneous equation, where we consider the constants
ci as functions ci(n) of the variable n. Then we look
for a particular solution of the given equation in the form
y(n) =
k∑
i=1
ci(n)fi(n).
We illustrate the method on second order equations. Suppose
that the homogeneous part of the second order nonhomogeneous
equation
xn+2 + anxn+1 + bnxn = fn
has x
(1)
n and x
(2)
n as a basis of solutions. We will be looking
for a particular solution of the non-homogeneous equation in
the form
xn = Anx(1)
n + Bnx(1)
n
208
It is often useful to treat recurrence relations using matrices.
While we will describe many such cases in
Section C, let us brieﬂy provide an example to highlight
this elegant application of matrices. For the
reader’s convenience, our description includes veriﬁcations
using Sage, especially for matrix computations.
3.B.6. The matrix method via Sage. Using tools from linear
algebra, for n ≥ 0 ﬁnd the solution of the diﬀerence equation
xn+2 = 2xn+1 + 3xn, with x0 = 0 and x1 = 1.
Solution. Set pn =
(
xn+1
xn
)
such that p0 =
(
x1
x0
)
=
(
1
0
)
.
Then, for n ≥ 0 the diﬀerence equation under question can
be expressed in terms of matrices as
pn+1 =
(
xn+2
xn+1
)
=
(
2xn+1 + 3xn
xn+1
)
=
(
2 3
1 0
) (
xn+1
xn
)
= Apn ,
where A =
(
2 3
1 0
)
. The matrix A is usually referred to
as the “companion matrix” of the recurrence and satisﬁes the
relations
p1 = Ap0 , p2 = Ap1 = A2
p0 , p3 = Ap2 = A3
p0
and so forth. For example, we obtain p1 = (2, 1)T
and
p2 = (7, 2)T
, which implies that x2 = 2, x3 = 7, and so on.
This approach inductively gives us the expression pn = An
p0.
Therefore, to ﬁnd the vector pn and consequently the desired
sequence xn, we simply need to compute An
. This process
is straightforward when A is diagonalizable.
For our case, using Sage and the syntax
var("l")
solve(l^2-2*l-3, l)
we ﬁnd that the characteristic polynomial of A, represented
by λ2
− 2λ − 3, has roots λ1 = 3 and λ2 = −1. Hence
A has two eigenvalues, both with multiplicity one, and as we
will see it is diagonalizable. The corresponding eigenvectors,
that is, solutions of the equation AX = λX, are given by
X1 = (1, 1/3)T
and X2 = (1, −1)T
. In order to obtain
these expressions we used Sage, via the block
A = matrix(QQ, [[2,3], [1,0]])
A.eigenvectors_right()
This prints out the answer [(3, [(1, 1/3)], 1), (−1, [(1, −1)], 1)].4
Recall that one may verify that A is diagonalizable, directly
via the command A.is_diagonalizable ( ), which in our
case prints True. Therefore, we have
An
= PDn
P−1
, D =
(
3 0
0 −1
)
, P =
(
3 1
1 −1
)
4
Recall that in the expression (3, [(1, 1/3)]) the ﬁrst number 3 corresponds
to the ﬁrst eigenvalue, the pair (1, 1/3) encodes its eigenvector, that
is X1 = (1, 1/3)T , and the last number 1 is the multiplicity of the eigenvalue.
Similarly for the second component inside the list.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
with some conditions on An and Bn to be imposed. We have
xn+1 = An+1x
(1)
n+1 + Bn+1x
(2)
n+1 = Anx
(1)
n+1 + Bnx
(2)
n+1+
(An+1 − An)x
(1)
n+1 + (Bn+1 − Bn)x
(2)
n+1
= Anx
(1)
n+1 + Bnx
(2)
n+1 + δAnx
(1)
n+1 + δBnx
(2)
n+1,
where δAn = An+1 − An and δBn = Bn+1 − Bn.
In order to be able to use the same An, Bn in the expression
for xn+1, we impose for all n the condition
δAnx
(1)
n+1 + δBnx
(2)
n+1 = 0.
Thus, for all n
xn+1 = Anx
(1)
n+1 + Bnx
(2)
n+1,
and in particular
xn+2 = An+1x
(1)
n+2 + Bn+1x
(2)
n+2
= Anx
(1)
n+2 + Bnx
(2)
n+2 + δAnx
(1)
n+2 + δBnx
(2)
n+2.
Now,
fn = xn+2 + anxn+1 + bnxn
= An
(
x
(1)
n+2 + anx
(1)
n+1 + bnx(1)
n
)
+ Bn
(
x
(2)
n+2+
anx
(2)
n+1 + bnx(2)
n
)
+ δAnx
(1)
n+2 + δBnx
(2)
n+2
= δAnx
(1)
n+2 + δBnx
(2)
n+2
Hence the variations δAn and δBn are subject to the systems
δAnx
(1)
n+1 + δBnx
(2)
n+1 = 0
δAnx
(1)
n+2 + δBnx
(2)
n+2 = fn
with solutions (compute the inverse matrix e.g. by means of
the algebraic adjoint and the determinant)
δAn = An+1 − An = −
fnx
(2)
n+1
Wn+1
δBn = Bn+1 − Bn =
fnx
(1)
n+1
Wn+1
where Wn+1 is the Wronski determinant
Wn+1 = det
(
x
(1)
n+1 x
(2)
n+1
x
(1)
n+2 x
(2)
n+2
)
.
It follows that
An − A0 =
n−1∑
j=0
−
fjx
(2)
j+1
Wj+1
Bn − B0 =
n−1∑
j=0
fjx
(1)
j+1
Wj+1
.
209
where for simplicity we have used the multiple 3X1 = (3, 1)T
in the ﬁrst column of P (this is also an eigenvector corresponding
to λ1). We compute P−1
=
(
1/4 1/4
1/4 −3/4
)
, which
one can verify by the relation P−1
AP =
(
λ1 0
0 λ2
)
= D.
With all the necessary ingredients in hand, we now obtain
An
=
(
3 1
1 −1
) (
3 0
0 −1
)n (
1/4 1/4
1/4 −3/4
)
=
(
3 1
1 −1
) (
3n
0
0 (−1)n
) (
1/4 1/4
1/4 −3/4
)
=
1
4
(
3n+1
+ (−1)n
3n+1
− 3(−1)n
3n
− (−1)n
3n
+ 3(−1)n
)
.
To verify the expression for An
one can Sage, by the following
block (this relies on the fact that A is diagonalizable)
A = matrix(SR, [[2,3], [1,0]])
P = matrix(SR, [[3, 1], [1, -1]])
n = var(’n’)
D = matrix(SR, [[3**n, 0], [0, (-1)**n]])
An = P * D * P.inverse()
An
Returning back to our initial task we get:
pn =
(
xn+1
xn
)
= An
p0
=
1
4
(
3n+1
+ (−1)n
3n+1
− 3(−1)n
3n
− (−1)n
3n
+ 3(−1)n
) (
1
0
)
=
1
4
(
3n+1
+ (−1)n
3n
− (−1)n
)
.
From this expression It follows that xn = 1
4 (3n
− (−1)n
). □
3.B.7. Verify the solution given in 3.B.6 via Sage. ⃝
Homogeneous recurrence relations of order higher than
two, can essentially be treated in the same way. However,
the characteristic polynomial in this case will be of a higher
degree, making the overall procedure a bit more complicated.
With this in mind, we present the following task.
3.B.8. For the diﬀerence equation xn+4 = xn+3 +xn+1 −xn
ﬁnd a real basis of the corresponding solution space. ⃝
3.B.9. Determine an explicit formula for the nth member of
the unique solution {xn}∞
n=1 that satisﬁes −xn+3 = 2xn+2 +
2xn+1 + xn with x1 = x2 = x3 = 1. ⃝
Non-homogeneous diﬀerence equations have numerous
applications. Next, we will dedicate some space to describe
examples of this kind (see also 3.2.6 and the problem 3.F.8
in the ﬁnal section). Notice in Sage non-homogeneous recurrence
relations can be treated exactly via the same method
described above for the homogeneous case.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Setting A0 = B0 = 0 we obtain
An =
n−1∑
j=0
−
fjx
(2)
j+1
Wj+1
Bn =
n−1∑
j=0
fjx
(1)
j+1
Wj+1
.
and the aquired general solution of our recurrence equation is
xn = C1x(1)
n + C2x(2)
n +


n−1∑
j=0
−
fjx
(2)
j+1
Wj+1

 x(1)
n +


n−1∑
j=0
fjx
(1)
j+1
Wj+1

 x(2)
n .
This method is used to solve the example 3.F.9.
3.2.8. Linear ﬁlters. Now we consider inﬁnite sequences
x = (. . . , x−n, x−n+1, . . . , x−1, x0, x1, . . . , xn, . . . ).
As in the case of systems of linear equations, we work
with an operation T that maps the sequence x to the sequence
z = Tx with elements
zn = a0xn + a1xn−1 + · · · + akxn−k.
As already noticed, the sequences x = (xn) are vectors
with respect to coordinate-wise operations,
and the vector space of all such sequences
is inﬁnitely-dimensional. The operation T is
clearly a linear mapping on this space.
The sequences can be imagined as discrete values of a
signal, often captured in very short time units. T plays the
role of a ﬁlter that works with the signal. For example, this
is how the sampling of an audio signal looks like. We are
interested in estimating the properties such a linear ﬁlter can
have.
Signals are often a linear combination of superimposed
parts, which are themselves periodical. From our deﬁnition
it is clear that periodic sequences xn, that is, sequences satisfying
for some ﬁxed natural number p and all n
xn+p = xn
will also have periodic images z = Tx
zn+p = a0xn+p + a1xn−1+p + · · · + akxn−k+p
= a0xn + a1xn−1 + · · · + akxn−k = zn
with the same period p.
We are interested in which input periodic sequences Tx
remain roughly the same (up to a scalar multiple), and in
which Tx will be suppressed close to zero values. This means,
we are looking for the kernel of our linear mapping T and
other eigen vectors. The kernel is the subspace of sequences
given by the homogeneous diﬀerence equation
a0xn + a1xn−1 + · · · + akxn−k = 0, a0 ̸= 0 ak ̸= 0,
which we are able to solve.
210
3.B.10. Determine the sequence of real numbers that satisﬁes
the non-homogeneous diﬀerence equation 2xn+2 =
−xn+1 + xn + 2 with initial conditions x1 = 2, x2 = 3.
Solution. The general solution of the homogeneous equation
is of the form a(−1)n
+ b(1/2)n
. A particular solution is
the constant 1. The general solution of the non-homogeneous
equation without initial conditions is thus
a(−1)n
+ b
(
1
2
)n
+ 1 .
Based on the initial conditions we compute a = 1 and b = 4.
Consequently, the solution is given by the sequence xn =
(−1)n
+ 4
(1
2
)n
+ 1. □
3.B.11. Find a sequence which satisﬁes the following nonhomogeneous
diﬀerence equation xn+2 = xn+1 + 2xn + 1
with the initial conditions x1 = 2, x2 = 2. Next verify your
solution via Sage ⃝
C. Models of growth and iterated processes
Many population models are based on recurrence relations
in vector spaces. In these models, the
unknown is not a sequence of numbers, but
a sequence of vectors, with matrices serving
as the coeﬃcients. Thus, matrices play a crucial role
in population dynamics and can be used to simulate population
growth. Next we will focus on such problems, and moreover
explore how various iterated processes, such as discrete
Markov chains, can model everyday situations and oﬀer intriguing
insights.
3.C.1. Alice and Bob together in savings. We begin with a
model of growth of saved money.
Alice and Bob are artists having a band together, and they
are saving money for online shopping during the black Friday,
to enrich their musical instruments. As the black Friday belongs
to the last week of November, they start saving every
December. This ﬁrst month Alice gives 1C and Bob gives
2C. Every consecutive month each of Alice and Bob gives as
many as last month, plus one half of what the other has given
the month before. How much money will Alice and Bob be
able to spend together the next black Friday, by applying the
above plan for exactly one year? Is it true or false that during
the last month Alice and Bob will put away essentially the
same amount of money?
Solution. Let us denote by an the amount of money that
Alice saves in the nth month, and let bn be the corresponding
amount saved by Bob. In the ﬁrst month they deposit
a1 = 1, b1 = 2. The forthcoming savings can be encoded as
follows:
an+1 = an +
1
2
bn , bn+1 = bn +
1
2
an .
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
3.2.9. Bad equalizer. As an example, consider a very simple
linear ﬁlter given by the equation
zn = (Tx)n = xn + xn−2.
Clearly, the kernel of T is generated by xn = cos(π
2 n)
and xn = sin(π
2 n), while the solutions to
xn+2 = xn correspond to the requirement
(Tx)n = 2xn. The results of such an operation
on a signal are illustrated by the two diagrams
below. There we use two diﬀerent frequencies of signals and
display their discrete sampling (the solid lines and the points
xn on them). The dashed line represents the sampling zn of
the ﬁltered signal.
The ﬁrst case shows an amplifying of the signal, while
the second frequency is close to the kernel which is killed by
the ﬁlter. Notice that the ﬁltered signal suﬀers serious shifts
in phase, which varies with the frequencies. Cheap equalisers
work in such a bad way.
Notice also how badly the original signal is sampled on
the second picture. This is due to the fact that the sampling
frequency is not much higher than the frequency of the signal.
3. Iterated linear processes
3.3.1. Iterated processes. In practical models we often encounter
the situation where the evolution of a
system in a given time interval is given by a
linear process, and we are interested in the behaviour
of the system after many iterations. The
linear process often remains the same, thus from the mathematical
point of view we are dealing with an iterated multiplication
of the state vector by the same matrix.
While solving the systems of linear equations requires
only minimal knowledge of properties of linear mappings,
in order to understand the behaviour of an iterated system,
we shall exploit the features of eigenvalues, eigenvectors and
other structural features.
In fact, the determination of the solution of a linear recurrence
equation by a set of intital conditions can be described
as an iterated process. Imagine we keep the state vector of the
211
Setting pn := an +bn for the common savings during the nth
month, we obtain pn+1 = 3
2 pn. This is obviously a geometric
sequence and hence
pn = 3
(
3
2
)n−1
.5
Now, the saving period lasts exactly twelve months, so the
sum p1 +p2 +· · ·+p12 represents the total common savings.
We compute
3
(
1 +
3
2
+ · · · +
(
3
2
)11
)
= 3
(3
2 )12
− 1
3
2 − 1
≈ 772, 5 .
To conﬁrm this computation use the command sum in Sage,
as follows:
var("n"); pn=3*(3/2)**(n-1)
N(sum(pn, n, 1, 12))
In this cell we ﬁrst introduced n as a symbolic variable, and
next programmed Sage to return the numerical approximation
of the desired sum. This answers the ﬁrst question: The next
black Friday Alice and Bob will be able to spend almost 773C.
To answer the second task, observe that
an+1 = an +
1
2
bn =
1
2
(an + pn) =
1
2
an +
(
3
2
)n
.
For a quick solution of this recurrence relation we apply Sage.
Why is there the initial
condition a(2)=2? And
we should say hwat is
the Bob’s recurrence looks
the same, just
with b(1)=2, perhaps
then the solution is
(3/2)n
+ (1/2)n
.
from sympy import Function, rsolve
from sympy.abc import n
a = Function("a")
f = a(n+1) - (1/2)*a(n) - (3/2)**n
initial = {a(1):1, a(2):2}
rsolve(f, a(n), initial)
which gives the answer (3/2) ∗ ∗n − 1/2 ∗ ∗n. This means
that an =
(3
2
)n
−
(1
2
)n
, and hence during the 12th month
the economies of Alice should reach the level of 130C, as
it occurs by the expression
(3
2
)12
−
(1
2
)12
. Bob will save
basically the same amount and the statement is true. □
3.C.2. Remark. Obviously, the example can be adapted to
higher initial amounts, although the last months Alice and
Bod may face problems to follow their plan. For instance,
starting with a1 = 10C and b1 = 20C, we get p1 = 30.
Therefore, in this case one has
pn = 30
(
3
2
)n−1
, an+1 =
1
2
an + 15
(
3
2
)n−1
,
with solution an = 10
(3n
2n − 1
2n
)
. For convenience, we list the
relative savings in a table with accuracy two decimal digits.
5Recall that the geometric sequence xn = κxn−1 satisﬁes xn =
x1κn−1.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
last k values
Yn = (xn, . . . , xn−k+1)
(ﬁlled by the initial condition in the beginning of the process).
In the next step we update the state vector
Yn+1 = (xn+1, xn, . . . , xn−k+2),
where the ﬁrst entry xn+1 = a1xn +· · ·+akxn−k+1 is computed
by means of a homogeneous diﬀerence equation, while
the other entries are just a shift by one position with the last
one forgotten. The corresponding square matrix of order k
that satisﬁes Yn+1 = A · Yn is as follows:
A =








a1 a2 . . . ak−1 ak
1 0 . . . 0 0
0 1
... 0 0
...
...
...
...
0 0 . . . 1 0








A while ago, we derived an explicit procedure for the complete
formula for the solution of such an iterated process with
a special type of matrix. In general, it will not be easy even
for very similar systems. A typical case is the study of the dynamics
of populations in some biological systems which we
discuss below.
The characteristic polynomial |A − λ E| of our matrix is
p(λ) = (−1)k
(λk
− a1λk−1
− · · · − ak),
as we can check directly or by expanding the last column and
employing induction on k.
Thus, the eigenvalues are exactly the roots λ of the characteristic
polynomial of the linear recurrence. We should
have expected this, because having a nonzero solution xn =
λn
to the linear recurrence means that the matrix A must
bring (λk
, . . . , λ)T
to its λ-multiple. Thus every such λ must
be eigenvalue of the matrix A.
3.3.2. Leslie model for population growth. Imagine that
we are dealing with some system of individuals
(cattle, insects, cell cultures, etc.) divided into m
groups (according to their age, evolution stage,
etc.). The state Xn is thus given by the vector
Xn = (u1, . . . , um)T
depending on the time tn in which we are observing the system.
A linear model of evolution of such system is then given
by the matrix A of dimension m, which gives the change of
the vector Xn to
Xn+1 = A · Xn
when time changes from tn to tn+1.
212
month Alice Bob month Alice Bob
1st 10.00C 20.00C 7th 171.03C 171.43C
2nd 20.00C 25.00C 8th 256.75C 256.95C
3rd 32.50C 35.00C 9th 385.22C 385.32C
4th 50.00C 51.25C 10th 577.89C 577.94C
5th 75.62C 76.25C 11th 866.86C 866.88C
6th 113.75C 114.56C 12th 1300.30C 1300.31C
The dominant chord of the next series of problems is
the description of suitable frameworks investigating
age-structured population dynamics systems
from a linear perspective based on discrete
variables. In particular, ﬁrst we will focus on
the “Leslie growth model”, which eﬀectively describes the
growth of age-structured populations and is therefore very
popular in population ecology.
Recall that this model is written as pn+1 = Apn, where
A is the Leslie matrix and pn = (p1
n, . . . , pm
n )T
is the population
vector at time n, divided into m age classes. The relevant
data for each age class consists of the reproduction rate and
the rate of survival into the next age class. These vital rates
form the essential part of A.
The matrix form makes the Leslie model ﬂexible and
mathematically very tractable. This is because the solutions
o the Leslie model asymptotically exhibit exponential growth.
Speciﬁcally, the population pn grows at an exponential rate
determined by λ1, such that pn ≈ λn
1 X1. Here λ1 is the dominant
eigenvalue (also known as the “Perron-Frobenius eigenvalue”,
see 3.3.3) of A and X1 is the corresponding rescaled
eigenvector. Furthermore, we can utilize X1 to infer longterm
trends of the age classes, leading to the determination of
the stable age distribution. For a more detailed explanation
and the proofs of these statements, we encourage the reader
to refer to the theoretical section starting at 3.3.2.
3.C.3. Bushbabies and Leslie matrices. A small team of
biologists studies a colony of bushbabies6
living in a speciﬁc
region of South Africa. In the wild, these mammals typically
have a lifespan of no more than four years. For research purposes,
we can categorize their population into four distinct
age classes, as follows:
class A: 0 to 1 year old class C: 2 to 3 years old
class B: 1 to 2 years old class D: 3 to 4 years old
The biologists determined that the breeding rates bi
for the four age classes in this colony are as follows:7
b1 = 0, b2 = 1, b3 = 2 and b4 = 1, respectively.8
6Also known as galagos or little night monkeys.
7Breeding rates are also called fertility rates and it is natural to assume
that they are non-negative real numbers.
8Galagos give usually birth to a single oﬀspring at their ﬁrst pregnancy,
and then produce twins in subsequent litters. In particular, they may give
birth to two sets of twins a year. (source: https://animaldiversity.org/,
https://africageographic.com)
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
As an example, we consider the Leslie model for population
growth. Here there is the matrix
A =










f1 f2 f3 . . . fm−1 fm
τ1 0 0 . . . 0 0
0 τ2 0 . . . 0 0
0 0 τ3
... 0 0
...
...
...
...
0 0 0 . . . τm−1 0










,
whose parameters are tied with the evolution of a population
divided into m age groups such that fi denotes the relative
fertility of the corresponding age group (in the observed time
shift from N individuals in the i-th group arise new fiN ones
– that is, they are in the ﬁrst group), while τi is the relative
mortality in the i-th group in one time interval. Clearly such
a model can be used with any number of age groups.
All coeﬃcients are thus non-negative real numbers and
the numbers τi are between zero and one. Note that when all
τ are equal one, it is actually a linear recurrence with constant
coeﬃcients and thus has either exponential growth/decay (for
real roots λ of the characteristic polynomial) or oscillation
connected with potential growth/decay (for complex roots).
Before we introduce a more general theory, we consider
in more detail this speciﬁc model.
Direct computation with the Laplace expansion of the
last column yields the characteristic polynomial pm(λ) of the
matrix A for the model with m groups:
pm(λ) = −λpm−1(λ) + (−1)m−1
fmτ1 . . . τm−1.
By induction we derive that this characteristic polynomial is
of the form
pm(λ) = (−1)m
(λm
− a1λm−1
− · · · − am−1λ − am).
The coeﬃcients a1, . . . , am, are all positive if all parameters
τi and fi are positive. In particular,
am = fmτ1 . . . τm−1.
Consider the distribution of the roots of the polynomial
pm(λ). We write the characteristic polynomial in the
form
pm(λ) = ±λm
(1 − q(λ))
where q(λ) = a1λ−1
+ · · · + amλ−m
is a strictly decreasing
non-negative function for λ > 0. For λ positive but very
small the value of q will be arbitrarily large, while for large
λ, it will be arbitrarily close to zero. Thus, it seems there exists
exactly one positive λ for which q(λ) = 1 and thus also
pm(λ) = 0.2
In other words, for every Leslie matrix (with
all the parameters fi and τi positive), there exists exactly one
positive real eigenvalue. For actual Leslie models of populations
a typical situation is when the only real eigenvalue λ1 is
greater or equal to one, while the absolute values of the other
eigenvalues are strictly less than one.
2Actually, we shall spend a lot of time in chapter 5 to make such consideratios
precise, involving the convergence and continuity issues.
213
Therefore, except for the babies
in age class A, all other (female)
galagos in the colony can
mate and produce oﬀsprings.9
The biologists also estimated
the corresponding survival
rates si for these four classes:
s1 = 0.4, s2 = 0.5, s3 = 0.2,
and s4 = 0. Notice that
0 ≤ si ≤ 1, as each number represents the probability
of a bushbaby surviving from one age class to the next.
Graphically, the vital rates and the interactions between the
age classes are represented by the following directed graph:
Suppose that the current age distribution of female bushbabies
across the four age classes is given by the vector p0 =
(20, 20, 30, 10)T
. Apply the Leslie model to compute the population
of babies and the total female population of the colony
after one year. In addition, determine the population of female
galagos in age classes C and D ten years later, and deduce
that the total female population decreases. Finally provide
the long-term trends for the four age classes.
Solution. The associated Leslie matrix is given by
A =




b1 b2 b3 b4
s1 0 0 0
0 s2 0 0
0 0 s3 0



 =




0 1 2 1
0.4 0 0 0
0 0.5 0 0
0 0 0.2 0



 .
Let us denote by At, . . . , Dt the amounts of the female population
belonging to the age class A, . . . , D, respectively, at
time t. Then, the Leslie condition pt+1 = Apt gives
pt+1 =




At+1
Bt+1
Ct+1
Dt+1



 =




Bt + 2Ct + Dt
0.4At
0.5Bt
0.2Ct



 .
Thus, p1 = (A1, B1, C1, D1)T
= (90, 8, 10, 6)T
, and so after
one year the model predicts the existence of 90 babies and in
total a number of 114 female galagos (this is represented by
the sum A1 + · · · + D1).
For the second task, we utilize the rule p10 = Ap9 = A10
p0.
Using the following cell in Sage
A=matrix(SR, [[0, 1, 2, 1],
[0.4, 0, 0, 0],
[0, 0.5, 0, 0],
[0, 0, 0.2, 0]])
show(A**10)
9Note that our focus is solely on the female population.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
If we begin with any state vector X, given as a sum of
eigenvectors
X = X1 + · · · + Xm
with eigenvalues λi, then iterations yield
Ak
· X = λk
1X1 + . . . λk
mXm.
Thus under the assumption that |λi| < 1 for all i ≥ 2, all
components in the eigensubspaces decrease very fast, except
for the component λk
1X1.
The distribution of the population among the age groups
is thus very fast approaching the ratios of the components of
the eigenvector to the dominant eigenvalue λ1.
As an example, consider the matrix below where individual
coeﬃcients are taken from the model for sheep breeding,
that is, the values τ contain both natural deaths and activities
of breeders.
A =






0 0.2 0.8 0.6 0
0.95 0 0 0 0
0 0.8 0 0 0
0 0 0.7 0 0
0 0 0 0.6 0






.
The eigenvalues are approximately
1.03, 0, −0.5, −0, 27 + 0.74i, −0.27 − 0.74i
with absolute values 1.03, 0, 0.5, 0.78, 0.78 and the eigenvector
corresponding to the dominant eigenvalue is approxi-
mately
XT
= (30 27 21 14 8).
We have chosen the eigenvector whose coordinates sum to
100, thus it directly gives us the percentage distribution of
the population.
Suppose instead that we wish for a constant population,
and that one year old sheep are removed for consumption.
Then we need ask how to decrease τ2 so that the dominant
eigenvalue would be one.
A direct check shows that the farmer could then eat about
10% more of one year old sheep to keep the population con-
stant.
3.3.3. Matrices with non-negative elements. Real matrices
which have no negative elements have very
special properties. They are very often present
in practical models. Thus we introduce the
Perron-Frobenius theory which deals with such
matrices. Actually, we show some results of Perron, we omit
the more general situations due to Frobenius.3
We begin with some deﬁnitions in order to formulate our
ideas.
3Oskar Perron and Ferdinand Georg Frobenius were two great German
mathematicians at the break of the 19th and 20th centuries. Even in this
textbook we shall meet their names in Analysis, Number Theory, Algebra.
Look up the index.
214
we obtain
A10
≈




0.195 0.465 0.457 0.204
0.081 0.195 0.208 0.095
0.047 0.102 0.099 0.044
0.008 0.023 0.023 0.010



 .
Therefore, a further computation shows that
p10 = (A10, . . . , D10)T
≈ (28.98, 12.75, 6.44, 1.44)T
.
It follows that C10 ≈ 6.4 and D10 ≈ 1.4. We also deduce that
after ten years, the total population of female galagos will be
signiﬁcantly smaller, speciﬁcally fewer than 50 individuals.
The eventual extinction of the population can be inferred from
the dominant eigenvalue of A To determine this, one must
compute the eigenvalues of A and compare them accordingly.
To compute the eigenvalues we will use Sage. However,
instead of employing the conventional method from Chapter
2 (see ??), we will opt for a more convenient approach by obtaining
the eigenvalues numerically. This can be done by using
the command A_num.eigenvalues. This method applies
speciﬁcally to matrices in numerical form. After deﬁning the
matrix A, the process unfolds as follows:
A_num = A.change_ring(RDF); show(A_num)
eig_numeric = A_num.eigenvalues()
show(eig_numeric)
Verify that this block accurately computes the eigenvalues of
A in a highly suitable format. For a more comprehensive response,
you can replace the ﬁnal line with
print("Numerical eigenvalues:")
for eigenvalue in eig_numeric:
print(eigenvalue.n(digits=10))
According to Sage there are two real eigenvalues, λ1 ≈
0.9347, and λ2 ≈ −0.1121, and two complex λ3,4 ≈
−0.4112±i0.4607. Thus λ1 is the dominant eigenvalue of A
and since λ1 < 1, the colony will disappear with rate pt+1/pt
equal to λ1 ≈ 0.9347.
An answer of the ﬁnal task relies on the eigenvector ˆX1
corresponding to λ1. Adding in the initial cell the code
eigV = A_num.eigenvectors_right()
show(eigV)
we see that ˆX1 has (approximatelly) the expression
(0.8988, 0.3847, 0.2058, 0.04403)T
, but it is more convenient
to rewrite this vector as ˆX1 ≈ (1, 0.427, 0.228, 0.048)T
.
The number 1.703 approximates the sum of its entries, so
the normalized eigenvector associated to λ1 has the form
X1 = 1
1.703
ˆX1 ≈ (0.587, 0.250, 0.133, 0.028)T
. From this
we obtain the long term trends of the four age classes in the
population: 0.58 % of babies, 0.25 % of age class B, while
the population of female galagos older than two years old is
smaller than 16.2 %. □
3.C.4. Remarks. To illustrate the decrease of the population
in the previous example, we may present a graph where the
population’s fading is depicted for a period of ﬁfty years. Included
is also the situation for babies’ population.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Positive and primitive matrices
Deﬁnition. A positive matrix means a square matrix A all of
whose elements aij are real and strictly positive. A primitive
matrix is a square matrix A whose power Ak
is positive for
some positive k ∈ N.
Recall that spectral radius of a matrix A is the maximum
of absolute values of all (complex) eigenvalues of A. The
spectral radius of a linear mapping on a (ﬁnite dimensional)
vector space coincides with the spectral radius of the corresponding
matrix for some basis.
In the sequel, the norm of a matrix A ∈ Rn2
or of a
vector x ∈ Rn
will mean the sum of the absolute values of
all elements. For a vector x we write |x| for its norm.
The following result is very useful and hopefully understandable.
But the diﬃculty of its proof is rather not typical
for this textbook. If you prefer, read just the theorem and skip
the proof till later on.
Perron Theorem
Theorem. If A is a primitive matrix with spectral radius
λ ∈ R, then λ is a root of the characteristic polynomial
of A with multiplicity one and λ is strictly greater than the
absolute value of all other eigenvalues of A. Furthermore,
there exists an eigenvector x associated with λ such that all
elements xi of x are positive.
Proof. We shall present the proof only brieﬂy and we
shall rely on intuition from elementary geometry, as well as
some results touched much later in this book.
Notice that the matrices A and Ak
share the eigenvectors,
while the corresponding eigenvalues are λ and λk
respectively.
Thus the assertion of the theorem holds if and
only if the same is true for Ak
. In particular, we may assume
the matrix A itself is positive, without any loss of generality.
Many of the necessary concepts and properties will be
discussed in chapter four and in the subsequent
chapters devoted to analytical aspects, so the
reader might come back to this proof later.
The ﬁrst step is to show the existence of an
eigenvector which has all elements positive. Consider the
standard simplex
S = {x = (x1, . . . , xn)T
, |x| = 1, xi ≥ 0, i = 1, . . . , n}.
Since all elements in the matrix A are positive, the image A·x
for x ∈ S has all coordinates positive too. The mapping
x → |A · x|−1
(A · x)
thus maps S to itself. This mapping S → S satisﬁes all the assumptions
of the Brouwer ﬁxed point theorem4
and thus there
4This theorem is a great example of a blend of (homological) Algebra,
(diﬀerential) Topology and Analysis. We shall discuss it in Chapter 9, cf.
9.1.16 on page 822.
215
To obtain the graph, for any plot separately we can apply a
command of the following type (we present the code for the
ﬁrst 5 years for the total population and the case of babies
population is treated similalry).
list_plot({0: 80, 1: 114, 2: 76, 3: 78.4,
4: 79.2, 5: 66.32},
plotjoined=True, color="dodgerblue") \
+ text(r"total population", (22, 34),
color="black")
In this block the numbers 0, 1, 2, . . . represent the years, and
the numbers 80, 114, 76, . . . represent the size of the female
population the corresponding year. The female population at
the kth year is given by mk = Ak + Bk + Ck + Dk. Thus we
need to compute the matrices A, A2
, . . . , A50
and then apply
the Leslie condition pk = Ak
p0, for any k with 0 ≤ k ≤ 50.
Recall that pk = (Ak, Bk, Ck, Dk)T
. Finally, to join the two
plots in one ﬁgure one can add in the previous cell the code
g1 = list_plot(0 : 80, . . .) , g2 = list_plot(0 : 20, . . .) ,
and then use the command (g1 + g2).show ( ).
3.C.5. Show that in 3.C.3 the result about the declination of
the population will change if we assume that galagos of age
class B give two female oﬀsprings, instead of one, and the
rest of the given data remains the same. Which are the long
term trends of the four age classes in this case and which is
the total population of the colony after ten years, if we assume
that for any female bushbaby there exists a male one? Present
a graph illustrating the increase of the female population for
a period of thirty years. ⃝
In the study of age-structured populations an interesting
question is how we can stabilize the population
under examination, so that it is neither growing
nor declining. For the Leslie model, this is the
case if and only if the dominant eigenvalue of
the associated Leslie matrix equals one, λ1 = 1. Below we
describe this situation for two cases: one less realistic, involving
the population of female galagos analyzed in 3.C.3,
and one more realistic, involving the population of ﬁsh in a
restricted environment, such as an artiﬁcial pond.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
exists vector y ∈ S such that it is mapped by this mapping to
itself. That means that
A · y = λ y, λ = |A · y|
and we have found an eigenvector that lies in S. By assumption,
A · y has got all coordinates positive, thus y must have
the same property. Moreover, λ > 0.
In order to prove the rest of the theorem, we consider
the mapping given by the matrix A in a more suitable basis,
where the coordinates of the eigenvector would be (λ, . . . , λ).
Moreover, we multiply the mapping by the constant λ−1
.
Thus we work with the matrix B,
B = λ−1
(Y −1
· A · Y ),
where Y is the diagonal matrix with coordinates yi of the
above eigenvector y on its diagonal. Evidently B is also a positive
matrix. By the construction, the vector z = (1, . . . , 1)T
is its eigenvector with eigenvalue 1, because Y · z = y.
It remains to prove that µ = 1 is a simple root of the
characteristic polynomial of the matrix B and that all other
roots have absolute value strictly smaller than one. Then the
proof of the Perron theorem is ﬁnished.
In order to do that we use an auxiliary lemma, which is
discussed below. Consider for the moment the matrix B to
deﬁne the linear mapping ψ that maps the row vectors
u = (u1, . . . , un) → u · B = v,
that is, using multiplication from the right (i.e., B is viewed
as the matrix of a linear map on one-forms). Since z =
(1, . . . , 1)T
is an eigenvector of the matrix B (with eigenvalue
1), the sum of the coordinates of the row vector v
u · B · (1, . . . , 1)T
=
n∑
i,j=1
uibij =
n∑
i=1
ui = 1,
whenever u ∈ S. Therefore the simplex S maps onto itself
and thus has in S a (row) eigenvector w with eigenvalue one
(a ﬁxed point, by the Brouwer theorem again). Because some
power B is positive by our assumption, the image of the simplex
S under Bk
lies inside of S.
We continue with the row vectors. Denote by P the shift
of the simplex S into the origin by the eigenvector w we have
just found. That is, P = −w+S. Evidently P is a set containing
the origin and is deﬁned by linear inequalities. Clearly,
ψ(P) = −w + ψ(S) ⊂ P and, thus, the vector subspace
V ⊂ Rn
generated by P is invariant with respect to the action
of the matrix B through multiplication of the row vectors
from the right. Moreover, if ψ(p) ∈ P sits on the boundary of
P, then ψ(p + w) is on the boundary of S. Hence restriction
of our mapping to V , and P itself satisfy the assumptions of
the auxiliary lemma discussed below and thus all its eigenvalues
are strictly smaller than one.
Now, the entire space decomposes as the sum Rn
= V ⊕
span{w} of invariant subspaces, w is the eigenvector with
eigenvalue 1, while all eigenvalues of the restriction to V are
strictly smaller in absolute value.
216
3.C.6. Stabilizing the size of bushbabies’ colony. Let us
return to the colony of bushbabies studied in 3.C.3. Determine
the necessary adjustments in the age-dependent vital
rates that required to stabilize the population size. By agedependent
vital rates we mean the fertility rates b1, . . . , b4 and
the survival rates s1, . . . , s4.
Solution. Since we are using the Leslie model, the stabilization
process is governed by the condition λ1 = 1, where λ1 is
the dominant eigenvalue of the corresponding Leslie matrix.
This condition is independent of the initial population size.
We consider the following possibilities (though there may be
others):
1. Increase s := s1, that is the survival rate of babies. Hence
we use the Leslie matrix
A =




0 1 2 1
s 0 0 0
0 0.5 0 0
0 0 0.2 0




for some s, with 0 ≤ s ≤ 1. Since we set λ1 = 1, to determine
the survival rate s we need to compute the determinant
of A − E, where E is the identity matrix. For this one can
use Sage, as follows:
s=var("s")
A= matrix(SR, 4, 4, [0, 1, 2, 1, s, 0, 0,
0, 0, 0.5, 0, 0, 0, 0, 0.2, 0])
E=identity_matrix(4)
(A-E).det()
which gives −2.1 ∗ s + 1. Thus the equation det(A−E) = 0
has the solution s = 10/21 = 0.47619. Hence from 0.4
which was initially the breeding parameter s = s1 we need to
increase it to 0.47619 to stabilize the size of the female population
of bushbabies (under the assumptions that the other
vital rates are as in 3.C.3).
2. Change in a proportional way the fertility rates bi for i =
2, 3, 4. This means that we use the following Leslie matrix
A =




0 s 2s s
0.4 0 0 0
0 0.5 0 0
0 0 0.2 0




for some s > 0, which we should specify via the constraint
λ1 = 1. Using Sage we get det(A − E) = −0.84s + 1,
and the equation of of det(A − E) = 0 gives s = 25/21 ≈
1.19048. Therefore, by setting b2 = b4 = 1.19048 and b3 =
2.38095, while maintaining the survival rates si as in 3.C.3,
the population is stabilized. □
3.C.7. Population of ﬁshes in an artiﬁcial pond. Suppose
we have a simple model of a pond where a population of ﬁsh
lives (e.g., bleak, vimba, or nase). Assume that 20% of babies
survive their second year and, from that age on, they are able
to reproduce. It is known that approximately 60% of these
young ﬁsh survive during the third year, and in the following
years, mortality can be ignored. We also assume that the
birth rate is three times the number of ﬁsh that can reproduce.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
The theorem is nearly proved. We have just to consider
the problem that the mapping under question was given by
multiplication of the row vectors from the right with the matrix
B, while originally we were interested in the mapping
given by the matrix B and multiplication of the column vectors
were from the left. But this is equivalent to the multiplication
of the transposed column vectors with the transposed
matrix B in the usual way – from the left. Thus we have
proved the claim about eigenvalues for the transpose of B.
But transposing does not change the eigenvalues and so the
proof is complete. □
A bounded polyhedron in Rn
is a nonempty subset deﬁned
by linear inequalities, sitting in some large enough ball.
Simplex S from the proof or any its translation are examples.
We shall provide concise explanation of all these concepts in
chapter 4.
Lemma. Consider any bounded polyhedron P ⊂ Rn
, containing
a ball around origin 0 ∈ Rn
. If some iteration of the
linear mapping ψ : Rn
→ Rn
maps P into its interior (that
is ψ(P) ⊂ P and the image does not intersect with the boundary),
then the spectral radius of the mapping ψ is strictly less
than one.
Proof. Consider the matrix A of the mapping ψ in the
standard basis. Because the eigenvalues of Ak
are the k-th
powers of the eigenvalues of the matrix A, we may assume
(without loss of generality) that the mapping ψ already maps
P into P. Clearly ψ cannot have any eigenvalue λ with absolute
value greater than one. This is easy to see, if the
eigenvalue is real. In the complex case, there will be a
2-dimensional invariant plane in which ψ is a multiplication
by |λ| composed with a rotation, thus |λ| > 1 is in conﬂict
with the invariance of P again.
Next, we argue by contradiction and assume that there
exists an eigenvalue λ with |λ| = 1. Then there are two possibilities,
either λk
= 1 for suitable pozitive integer k or there
is no such k.
The image of P is a closed set (that means that if the
points in the image ψ(P) get arbitrarily close to some point y
in Rn
, then the point y is also in the image – this is a general
feature of the linear maps on ﬁnite dimensional vector spaces).
By our assumption, the boundary of P does not intersect with
the image. Thus ψ cannot have a ﬁxed point on the boundary
and there cannot even be any point on the boundary to which
some sequence of points in the image would converge.
The ﬁrst argument excludes that some power of λ is one,
because such a ﬁxed point of ψk
on the boundary of P would
then exist and thus it would be in the image. In the remaining
case there would be a two-dimensional subspace W ⊂ Rn
on
which the restriction of ψ acts as a rotation by an irrational angle
and thus there exists a point y in the intersection of W with
the boundary of P. But then the point y could be approached
arbitrarily close by the points from the set ψk
(y) (through all
iterations) and thus would have to be in the image too. This
leads to a contradiction and thus the lemma is proved. □
217
Clearly, such a population would ﬁll the pond very quickly.
To maintain a balance, we need to introduce a predator, such
as esox. Assume that an esox eats approximately 500 mature
ﬁsh per year. How many predators of this type should be introduced
into the pond to keep the population constant?
Solution. There are three age classes: babies, young ﬁsh and
adult ﬁsh. Let p be the number of babies, m be the number
of young ﬁsh and r be the number of adult ﬁsh. In terms of
vectors, the state of the population in the next year is given by


p
m
r

 →


3m + 3r
0.2p
0.6m + yr

 .
The relative mortality of the adult ﬁsh caused by the predators
equals to the diﬀerence 1 − y, and we should specify the
unknown y. This model is successfully described by a generalized
Leslie matrix, having the form
A =


0 3 3
0.2 0 0
0 0.6 y

 .
To stabilize the population one of the eigenvalues of this matrix
should be equal to 1, and this will give us the desired value
of the unknown y. As before, in Sage we may type
y=var("y")
A= matrix(SR, 3, 3, [0, 3, 3, 0.2, 0,
0, 0, 0.6, y])
I3 = matrix(RR, [[1, 0, 0], [0, 1, 0],
[0, 0, 1]])
(A-I3).det ( )
This prints out the expression 0.4 ∗ y − 0.04 with obvious solution
y = 0.1. This means that in the next year only 10 % is
allowed to survive and the rest should be eaten by the predators.
Let us denote the number of redators by x. Then, together
they eat 500x ﬁsh, which according to the previous
computation should be 0.9r. Consequently, the ratio of the
number of ﬁsh to the number of predators is given by r
x = 500
0.9 .
That is, one esox for (approximately) 556 ﬁsh. □
3.C.8. Predators and preys. In a population model, let Dk
be the number of predators and Kk be the number of preys, in
month k. The relation of these between month k and month
k + 1 is given by one of the following three linear systems:
(a) Dk+1 = 0.6 Dk + 0.5 Kk ,
Kk+1 = −0.16 Dk + 1.2 Kk ,
(b) Dk+1 = 0.6 Dk + 0.5 Kk ,
Kk+1 = −0.175 Dk + 1.2 Kk ,
(c) Dk+1 = 0.6 Dk + 0.5 Kk ,
Kk+1 = −0.135 Dk + 1.2 Kk .
Analyse the behaviour of this model for large time values.
Solution. We can encode all three cases as
(
Dk
Kk
)
= Ta ·
(
Dk−1
Kk−1
)
, k ∈ N ,
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
3.3.4. Simple corollaries. Once we know the Perron theorem,
the following very useful claim has a surprisingly
simple proof. It shows how strong is
the primitivity assumption of a matrix.
Corollary. If A = (aij) is a primitive matrix and x ∈ Rn
is
its eigenvector with all coordinates non-negative and eigenvalue
λ, then λ > 0 is the spectral radius of A. Moreover,
minj∈{1,...,n}
n∑
i=1
aij ≤ λ ≤ maxj∈{1,...,n}
n∑
i=1
aij.
Proof. Because A is primitive, we can choose k such
that Ak
has only positive elements. Then Ak
· x = λk
x is a
vector with all coordinates strictly positive. Obviously λ > 0.
According to the Perron theorem, the spectral radius µ
of A is an eigenvalue and the associated eigenvectors y have
positive coordinates only. Thus we may choose such an eigenvector
with the property that the diﬀerence x − y has only
strictly positive coordinates. Then for all large positive integer
powers m we have
0 < Am
· (x − y) = λm
x − µm
y,
but also λ ≤ µ. If µ = λ + α, α > 0, then
0 < λm
x − (λ + α)m
y < λm
(x − y − m
α
λ
y)
which is clearly negative for m large enough. Hence λ = µ.
It remains to estimate the spectral radius using the minimum
and maximum of sums of individual columns of the
matrix. We denote them by bmin and bmax. Choose x to be
the eigenvector with the sum of coordinates equal to one and
count:
n∑
i,j=1
aijxj =
n∑
i=1
λxi = λ
λ =
n∑
j=1
( n∑
i=1
aij
)
xj ≤
n∑
j=1
bmaxxj = bmax
λ =
n∑
j=1
( n∑
i=1
aij
)
xj ≥
n∑
j=1
bminxj = bmin.
□
Note that for instance all Leslie matrices from 3.3.2, as
soon as all their parameters fi and τj are strictly positive, are
primitive. Thus we can apply the just derived results to them.
(Compare this with the ad hoc analysis of the roots of the
characteristic polynomial from 3.3.2)
3.3.5. Markov chains. A very frequent and interesting case
of linear processes with only non-negative elements
in a matrix is a mathematical model of a
system which can be in one of m states with various
probabilities. At a given point of time the
system is in state i with probability xi. The transition form
the state i to the state j happens with probability tij.
We can write the process as follows: at time n the system
is described by the stochastic vector (we also say probability
vector) xn = (u1(n), . . . , um(n))T
.
218
where Ta :=
(
0.6 0.5
−a 1.2
)
and a ∈ {0.16, 0.175, 0.135}. The
coeﬃcient a represents the average number of preys killed by
one predator per month.10
It follows that
(
Dk
Kk
)
= Tk
a ·
(
D0
K0
)
, k ∈ N .
Using the powers of the matrix Ta we can determine the evolution
of the populations of predators and prey after a very
long time. For such a procedure it is useful to summarize in
a table the eigenvalues and the eigenvectors of Ta:
λ1 λ2 X1 X2
a = 0.16 1 4/5 (5, 4)T
(5, 2)T
a = 0.175 19/20 17/20 (10, 7)T
(2, 1)T
a = 0.135 21/20 3/4 (10, 9)T
(10, 3)T
To compute Tk
a we may simply proceed by hand. As we know,
if X =
(
X1 X2
)
is the matrix formed by the eigenvectors,
then the matrix Tk
a occurs as follows:
Tk
a = X ·
(
λ1 0
0 λ2
)k
· X−1
, k ∈ N .
For instance, for a = 0.16 we have X =
(
5 5
4 2
)
, X−1
=
(
−1/5 1/2
2/5 −1/2
)
, and
(
1 0
0 0.8
)k
≈
(
1 0
0 0
)
for large k.
Thus, for large time values we get Tk
0.16 ≈
(
−1 5/2
−4/5 2
)
.
In a similar way, for large k one computes
Tk
0.175 ≈
(
0 0
0 0
)
, Tk
0.135 ≈ 1.05k
(
−1/2 5/3
−9/20 3/2
)
.
Knowing Tk
a we are able to proceed with the systems. These
are given by
(
Dk
Kk
)
≈
1
10
(
5 (−2D0 + 5K0)
4 (−2D0 + 5K0)
)
,
(
Dk
Kk
)
≈
(
0
0
)
,
(
Dk
Kk
)
≈
1.05k
60
(
10 (−3D0 + 10K0)
9 (−3D0 + 10K0)
)
,
for a = 0.16, a = 0.175 and a = 0.135, respectively. A qualitative
interpretation of these relations can produce valuable
conclusions for the future of the populations. In particular:
(a) If 2D0 < 5K0, the sizes of both populations stabilise on
non-zero sizes. if 2D0 ≥ 5K0, both populations die out.
(b) Both populations die out.
(c) For 3D0 < 10K0 we have a population boom of both
kinds. For 3D0 ≥ 10K0 both populations die out.
Note that there are many models that eﬀectively describe
predator-prey population dynamics These models can be either
discrete iterative models, or continuous models based on
10The value of a does not depend on the size of the population (in fact,
for stable populations one can estimate the size of a). As a consequence,
observe that a small change of a can lead to diﬀerent conclusions.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
This means that all components of the vector x are real
non-negative numbers and their sum equals one. Components
give the distribution of the probability of individual possibilities
for the state of the system. The distribution of the probabilities
at time n + 1 is given via multiplication by the transition
matrix T = (tij), that is,
xn+1 = T · xn.
Since we assume that the vector x captures all possible states
of the system and moves again to some of these states with
the total probability one, all columns of T are also given by
stochastic vectors. We call such matrices stochastic matrices.
Note that every stochastic matrix maps every stochastic vector
x to a stochastic vector Tx again:
∑
i,j
tijxj =
∑
j
(∑
i
tij
)
xj =
∑
j
xj = 1.
Such a sequence xn+1 = Txn is called a (discrete) Markov
process and the resulting sequence of vectors x0, x1, . . . is
called a Markov chain xn.
Now we can exploit the Perron-Frobenius theory in its
full power. Because the sum of the rows of the matrix is always
equal to the vector (1, . . . , 1), we see that the matrix
T − E is singular and thus one is an eigenvalue of the matrix
T. Furthermore, if T is a primitive matrix (for instance, when
all elements are non-zero), we know from the corollary 3.3.4
that one is a simple root of the characteristic polynomial and
all others have absolute value strictly smaller than one. This
leads to:
Ergodic Theorem
Theorem. Markov processes with primitive matrices T sat-
isfy:
• there exists a unique eigenvector x∞ of the matrix T
with the eigenvalue 1, which is stochastic,
• the iterations Tk
x0 approach the vector x∞ for any initial
stochastic vector x0.
Proof. The ﬁrst claim follows directly from the positivity
of the coordinates of the eigenvector derived in
the Perron theorem (notice the dominant eigenvalue
comes with multiplicity one).
Next, assume that the algebraic and geometric
multiplicities of the eigenvalues of the matrix T are the same.
Then every stochastic vector x0 can be written (in the complex
extension Cn
) as a linear combination
x0 = c1x∞ + c2y2 + · · · + cnyn,
where y2 . . . , yn extend x∞ to a basis of the eigenvectors. But
then the k-th iteration gives again a stochastic vector
xk = Tk
· x0 = c1x∞ + λk
2c2y2 + · · · + λk
ncnyn.
Now all eigenvalues λ2, · · · λn are in absolute value strictly
smaller than one. So all components of the vector xk but the
ﬁrst one approach (in norm) zero. But xk is still stochastic,
219
diﬀerential equations. A prominent example is the Lotka and
Volterra model, see 8.3.10. Further exercises related to the
Leslie model and population growth are presented in Section
F, see for example the tasks 3.F.12, 3.F.13 and 3.F.15. □
In the remainder of this section, we delve into the discrete
“Markov process”, a topic where matrix calculus
intersects with probability theory. For a more theoretical
treatment, see 3.3.5 and 3.4.7. We begin
with a brief introduction to stochastic processes,
a concept that will be extensively analyzed in Chapter 10.
Here, we primarily focus on “discrete stochastic processes”
(cf. 10.2.14).
Recall that a “random variable” in the context of a random
experiment with sample space Ω, is an R-valued function
X : Ω → R satisfying the property that {s : X(s) ∈ I}
is an event for any interval I ⊆ R. For further details see
10.2.10.
In simpler terms, a random variable assigns a real number
to each possible outcome of a random experiment. A random
variable X that can take at most countably many values
is termed discrete. For instance, the number of typographical
errors on a page of a book represents a discrete random
variable. A stochastic (or random) process is essentially a
sequence of random variables {X(t) ≡ Xt}∞
t=0. Such processes
depict the evolution over time of a random phenomenon,
thereby encapsulating the concept of probabilistic dynamics.
In the following we are only interested in discrete
random variables and, consequently, in discrete stochastic
processes.
3.C.9. Describe examples of discrete stochastic processes.
Solution. There is plethora of such examples. Let Xn be
the number of customers served in a bank, or the number of
product sales in a market at the end of the nnth working day.
Then {Xn : n = 0, 1, . . .} is a discrete stochastic process.
Another example occurs when Xn represents the number of
arrivals at an emergency room between midnight and 8.00 am.
Or Xn can represent the number of books sold in a bookstore,
or the number of people who attend a concert, etc. □
The values that a discrete random variable Xn takes are
known as the states of the system at step n. We can
denote these states using integers 0, 1, 2, . . ., or symbols
such as s0, s1, and so on. Given that Xn−1 = j
and Xn = i, the system is said to have made a “transition”
from state sj to state si. This is known as a chain.
“Discrete Markov processes”, also known as “Markov
chains”, are stochastic processes with a ﬁnite number of possible
states, where memory eﬀects are strongly limited. This
means that the probability having Xn = i, depends only on
the immediate previous state of the system.
Next we will assume time-homogeneity, meaning the
transition probabilities P(Xn = i|Xn−1 = j) are constant
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
thus the only possibility is that c1 = 1 and the second claim
is proved.
In fact, even if the algebraic and geometric multiplicities
of eigenvalues do not coincide we reach the same conclusion
using a more detailed study of the root subspaces of the matrix
T. (We meet them when discussing the Jordan matrix
decomposition later in this chapter.) Consequently, even in
the general case the eigensubspace span{x∞} comes with
the unique invariant (n − 1)-dimensional complement, on
which are all eigenvalues in absolute value smaller than one
and the corresponding components in xk approach zero as
before. See the note 3.4.11 where we ﬁnish this argument in
detail. □
3.3.6. Iteration of the stochastic matrices. We reformulate
the previous theorem into a simple, but surprising
result. By convergence to a limit matrix in
the following theorem we mean the following: if
we say that we want to bound the possible error
ε > 0, then we can ﬁnd a lower bound on the number of iterations
k after which all the components of the matrix diﬀer
from the limit one by less than ε.
Corollary. Let T be a primitive stochastic matrix from a
Markov process and let x∞ be the stochastic eigenvector for
the dominant eigenvalue 1 (as in the Ergodic Theorem above).
Then the iterations Tk
converge to the limit matrix T∞, whose
columns all equal to x∞.
Proof. Columns in the matrix Tk
are images of the vectors
of the standard basis under the corresponding iterated linear
mapping. But these are images of the stochastic vectors
and thus they all converge to x∞. □
3.3.7. Final brief remark. Before leaving the Markov processes,
we shortly mention their more general versions with
matrices which are not primitive. Here we would need the
full Frobenius-Perron theory. Without going into technicalities,
consider a process with a block wise diagonal or an upper
triangular matrix T,
T =
(
P R
0 Q
)
and imagine ﬁrst that P, Q are primitive and R = 0. Here
we can again apply the above results block wise. In words, if
we start in a stay x0 with all probability concentrated in the
ﬁrst block of coordinates, the process converges to the value
x∞ which again has all the probability distributed among the
ﬁrst block of coordinates, and the same for the other block.
If R > 0 then we can always jump to the states corresponding
to the ﬁrst block from those in the second block
with a non-zero probability and the iterations get more com-
plicated:
T2
=
(
P2
P · R + R · Q
0 Q2
)
T3
=
(
P3
P2
· R + P · R · Q + R · Q2
0 Q3
)
.
220
over time and do not depend on n. This is a common assumption
in many Markov chain models. The transition from
state sj to si, or simply from j to i, is encoded by the transition
probability tij = P(Xn = i|Xn−1 = j), compare with
3.3.5. Thus, the dynamics of a discrete-time homogeneous
Markov chain with state space consisting of m states, is described
by the m × m matrix T = (tij). This is known as the
transition matrix of the Markov process, from the (n − 1)th
step to the nth step. In our context, T is a column-stochastic
matrix, meaning that tij ≥ 0 and
∑
i tij = 1. In other words,
each column of T is a stochastic vector, (this has non-negative
real numbers as entries, that sum to one, see also 3.3.5).
3.C.10. Two-state Markov chains are Markov chains having
a state space consisting of two elements, typically
denoted by S = {0, 1}. They can be used to
model various situations, such as the operational
status of everyday machines where the probability
of a machine being out of operation the next day is known.
For example, consider a traﬃc light on a road in New
York. Suppose the probability that a working traﬃc light will
be out of order the next day is p, while the probability that an
out-of-order traﬃc light will start operating the next day is q.
Demonstrate that this scenario constitutes a Markov process
and determine its transition matrix.
Solution. What happens at each day depends only on the
previous day, and hence we have a two-stage (homogeneous)
Markov process {Xn}, where Xn is the state of the traﬃc
light on the nth day. By deﬁnition, Xn = 0 if the traﬃc light
will be out of order the nth day and Xn = 1 otherwise. Thus
one computes
t00 = P(Xn = 0|Xn−1 = 0) = 1 − q ,
t01 = P(Xn = 0|Xn−1 = 1) = p ,
t10 = P(Xn = 1|Xn−1 = 0) = q ,
t11 = P(Xn = 1|Xn−1 = 1) = 1 − p .
This means that T =
(
t00 t01
t10 t11
)
=
(
1 − q p
q 1 − p
)
. □
3.C.11. Remarks. a) Markov chains on ﬁnite state spaces
can alternatively be represented graphically through a “transition
diagram”. This diagram is a directed graph where each
vertex represents a state of the chain. Directed edges are
drawn from vertex j to vertex i, labeled with the probability
tij, whenever tij > 0. For example, in the case of a two-state
Markov process as described earlier, the transition diagram
takes the following form:
0t00=1−q
66
t10=q
++
1 t11=1−p
jj
t01=p
kk
This graphical representation is particularly useful for visualizing
the structure of the Markov chain, understanding the
possible state transitions, and analyzing the overall behavior
of the chain in terms of state probabilities and sequences of
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
An interesting special case is when P = E and R is positive.
Then Q − E must be a regular matrix and a simple computation
yields the general iteration (notice E and Q commute
and thus (E − Q)(E + Q + · · · + Qk−1
) = E − Qk
)
Tk
=
(
E R(E − Q)−1
(E − Qk
)
0 Qk
)
.
Thus, the entire ﬁrst block of states is formed by eigenvectors
with eigenvalue 1 (so these states stay constant with probability
1), while the behavior on the other block is more compli-
cated.
4. More matrix calculus
We have seen that understanding the inner structure of
matrices is a strong tool for both computation and analysis.
It is even more true when considering numerical calculations
with matrices. Therefore we return now to the abstract theory.
We introduce special types of linear mappings on vector
spaces. We consider general linear mappings whose structure
is understood in terms of the Jordan normal form (see 3.4.10).
In all these cases, complex scalars are essential. So we extend
our discussion of scalar product (see 2.3.18–2.3.22) to complex
vector spaces. Actually, in many areas the complex vector
spaces are the essential platform necessary for introducing
the mathematical models. For instance, this is the case in the
so-called quantum computing, which became a very active
area of theoretical computer science. Many people hope to
construct an eﬀective quantum computer soon.
3.4.1. Unitary spaces and mappings. The deﬁnitions of
scalar product and orthogonality easily extend to
the complex case. But we do not mean the complex
bilinear symmetric forms α, since there the
quadratic expressions α(v, v) are not real in general
and thus we would not get the right deﬁnition of length
of vectors. Instead, we deﬁne:
Unitary spaces
Unitary space is a complex vector space V along with the
mapping V × V → C, (u, v) → u · v called scalar product
and satisfying for all vectors u, v, w ∈ V and scalars a ∈ C
the following axioms:
(1) u · v = v · u (the bar stands for complex conjugation),
(2) (au) · v = a(u · v),
(3) (u + v) · w = u · w + v · w,
(4) if u ̸= 0, then u · u > 0 (notice u · u is always real).
The real number
√
v · v is called the norm of the vector
v and a vector is normalized, if its norm equals one. Vectors
u and v are said to be orthogonal if their scalar product is
zero. A basis composed of mutually orthogonal and normalized
vectors is called an orthonormal basis of V .
221
transitions, see 3.C.12 for another example.
b) Be aware that other authors may deﬁne the transition matrix
T = (tij) with tij being the probability of a transition
from state i to state j, that is tij = P(Xn = j|Xn−1 = i).
This is the opposite convention of ours, producing a rowstochastic
matrix whose transpose matrix is our transition ma-
trix.
3.C.12. Absent-minded professor. An absent-minded professor
always carries an umbrella with him, but with
probability of 1/2 he forgets it from wherever he is
leaving. His daily routine is strict: in the morning
he walks from his house to his oﬃce; from there, he goes to
a restaurant for lunch, then returns to his oﬃce, and ﬁnally in
the evening, he returns home. It is assumed the professor does
not visit other locations, and if he forgets his umbrella at the
restaurant, it remains there until his next visit. This situation
can be modeled as a Markov process. Determine its transition
matrix, diagram, and calculate the probability that after many
days, in the morning, the umbrella is at the restaurant.
Solution. We assume as time unit the daytime, from morning
to the next morning, and denote by Xn the place where the
umbrella sits in the morning of the nth day. This deﬁnes a (homogeneous)
Markov process {Xn : n ∈ N}, with state space
S = {s1 = house, s2 = oﬃce, s3 = restaurant}. For simplicity
we represent these states as {1, 2, 3}. Let us proceed
with the transition matrix T = (tij), where tij = P(Xn+1 =
i|Xn = j) and 1 ≤ i, j ≤ 3. We explain the computations
for the ﬁrst column and leave the veriﬁcation of the rest two
to the reader. By deﬁnition, t11 = P(Xn+1 = 1|Xn = 1),
and hence this is the probability that the umbrella starts its
day at home and stays there till next morning. There are three
distinct possibilities p1, p2, p3 for this scenario:
p1: The umbrella stays at home in the morning. Thus p1 = 1
2 .
p2: The umbrella arrives to the oﬃce, it stays there during
lunch, but in the evening is moved back to home. Thus
p2 = 1
2 · 1
2 · 1
2 = 1
8 .
p3: The professor takes the umbrella with him all the time and
does not forget it somewhere. Thus p3 = 1
2 · 1
2 · 1
2 · 1
2 = 1
16
and in total t11 = p1 + p2 + p3 = 11
16 .
Next, t21 = P(Xn+1 = 2|Xn = 1) is the probability that
the umbrella starts its day at home and the next morning sits
in the oﬃce. There are two possibilities q1, q2 for such a sce-
nario:
q1: The umbrella arrives to the oﬃce, it stays there during
lunch, and in the evening it remains again there. Thus
q1 = 1
2 · 1
2 · 1
2 = 1
8 .
q2: The umbrella arrives to the oﬃce, then to the restaurant,
then back to the oﬃce, and in the evening it remains at the ofﬁce.
Thus q2 = 1
2 · 1
2 · 1
2 · 1
2 = 1
16 and in total t21 = 1
8 + 1
16 =
3
16 . Finally for t31 we have t31 = 1 − t11 − t21 = 1/8. The
repetition of this procedure for the rest entries of T yields
T =


11/16 3/8 1/4
3/16 3/8 1/4
1/8 1/4 1/2

 .
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
At ﬁrst sight this is an extension of the deﬁnition of Euclidean
vector spaces into the complex domain. We will continue
to use the alternative notation ⟨u, v⟩ for the scalar product
of vectors u and v. As in the real domain, we obtain immediately
from the deﬁnition the following simple properties
of the scalar product for all vectors in V and scalars in C:
u · u ∈ R
u · u = 0 if and only if u = 0
u · (av) = ¯a(u · v)
u · (v + w) = u · v + u · w
u · 0 = 0 · u = 0
(∑
i
aiui
)
·
(∑
j
bjvj
)
=
∑
i,j
ai
¯bj(ui · vj),
where the last equality holds for all ﬁnite linear combinations.
It is a simple exercise to prove everything formally. For instance,
the ﬁrst property follows from (1) since the product
u · u has to be the complex conjugate to itself.
A standard example of the scalar product over the complex
vector space Cn
is
(x1, . . . , xn)T
· (y1, . . . , xn)T
= x1 ¯y1 + · · · + xn ¯yn.
This expression is also called the standard (positive deﬁnite)
Hermitian form on Cn
. By conjugation of the coordinates of
the second argument, this mapping satisﬁes all the required
properties. The space Cn
with this scalar product is called
the standard unitary space of dimension n. We can denote
this scalar product of vectors x and y with matrix notation as
¯yT
· x (here the complex conjugation indicated by the bar is
performed on all components of y).
As usual, those mappings which leave the additional
structure invariant are of great importance.
Unitary mappings
A linear mapping φ : V → W between unitary spaces is
called a unitary mapping, if for all vectors u, v ∈ V
u · v = φ(u) · φ(v).
Unitary isomorphism is a bijective unitary mapping.
3.4.2. Real and complex spaces with scalar product. In
the previous chapter we have already derived
some simple properties of spaces
with scalar products. The properties and
proofs are very similar to the complex case.
In the sequel we shall work with real and complex spaces
simultaneously and write K for R or C. In the real case the
conjugation is just the identity mapping (it is the restriction
of the conjugation in the complex plane to the real line). As
in the real case, we deﬁne the orthogonal complement for a
vector subspace U ⊂ V in the unitary space V as
U⊥
= {v ∈ V ; u · v = 0 for all u ∈ U},
which is clearly also a vector subspace in V .
222
As for the transition graph it has the following form:
2
t22

t32

t12
yy
1t11
66 t31
22
t21
99
3 t33
jj
t23
YY
t13ss
For the ﬁnal task compute the eigenvector of T corresponding
to the dominant eigenvalue 1. It is given by (y1 = 2, y2 =
1, y3 = 1)T
, and we should rescale it to obtain the desired
probability. This equals to y3/(y1 + y2 + y3) = 1/4. □
Let {Xn : n ≥ 0} be a (homogeneous) Markov process
with transition matrix T = (tij) and let
x(n)
be the probability distribution of the states
at the nth transition, i.e., the vector whose ith
component describes the probability of the system
to be in state i after n steps: x
(n)
i = P(Xn = i). We
have
∑
i x
(n)
i = 1 and it follows that each x(n)
is a stochastic
vector. Moreover, in terms of the matrix T we know that
x(n+1)
= Tx(n)
, which gives x(n)
= Tn
x(0)
. This makes the
Markov process computationally very tractable, since once T
is known one can use matrix calculus to compute Tn
. Let us
describe such applications.
3.C.13. Experiment in a laboratory. In a laboratory an experiment
is carried on with the same probability of success
and failure. If the experiment succeeds, the probability of success
of the second experiment is 0.7. If the ﬁrst experiment
fails, the probability of success of the second experiment is
0.6. This process is continued indeﬁnitely. For any n ∈ N determine
the probability that the nth experiment is successful.
Solution. This is a two-state Markov process {Xn : n ∈ N}
with transition matrix
T =
(
7/10 3/5
3/10 2/5
)
.
The corresponding state space admits the description
{success, failure} which here we denote by {1, 2}. Now, for
n ∈ N consider the stochastic vectors x(n)
= (x
(n)
1 , x
(n)
2 )T
,
where x
(n)
1 is the probability of the success of the nth
experiment and x
(n)
2 is the probability of its failure, that is,
x
(n)
1 = P(Xn = 1) ,
x
(n)
2 = P(Xn = 2) = 1 − x
(n)
1 .
By assumption we have x(1)
= (1/2, 1/2)T
, and by the theoretical
result in 3.3.5 we obtain the relation
x(2)
= Tx(1)
= (13/20, 7/20)T
.
Similarly, for general n ∈ N and the relation x(n+1)
= Tx(n)
,
we obtain x(n+1)
= T2
x(n−1)
= · · · = Tn
x(1)
.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Athough we deal exclusively with ﬁnitely-dimensional
spaces now, the results in the next two theorems have a natural
generalization for Hilbert spaces, which are inﬁnitelydimensional
spaces with scalar products. We shall meet them
later, in connection with approximation in vector spaces of
real or complex valued functions.
Theorem. For every ﬁnitely-dimensional space V of dimension
n with scalar product we have:
(1) There exists an orthonormal basis in V .
(2) Every system of non-zero orthogonal vectors in V is linearly
independent and can be extended to an orthogonal
basis.
(3) For every system of linearly independent vectors
(u1, . . . , uk) there exists an orthonormal basis
(v1, . . . , vn) such that ⟨v1, . . . , vi⟩ = ⟨u1 . . . , ui⟩,
for all 1 ≤ i ≤ k, i.e. its vectors consecutively generate
the same subspaces as the vector uj.
(4) If (u1, . . . , un) is an orthonormal basis V , then the coordinates
of every vector u ∈ V are expressed via
u = (u · u1)u1 + · · · + (u · un)un.
(5) In any orthonormal basis, the scalar product has the coordinate
form
u · v = ¯yT
· x = x1 ¯y1 + · · · + xn ¯yn
where x and y are columns of coordinates of the vectors
u and v in a chosen basis. Notably, every n-dimensional
space with scalar product is isomorphic to the standard
Euclidean Rn
or the unitary Cn
.
(6) The orthogonal sum of unitary subspaces V1 + · · · + Vk
in V is always a direct sum.
(7) If A ⊂ V is an arbitrary subset, then A⊥
⊂ V is a
vector subspace (and thus also unitary), and (A⊥
)⊥
⊂
V is exactly the subspace generated by A. Furthermore,
V = span A ⊕ A⊥
.
(8) V is an orthogonal sum of n one-dimensional unitary
subspaces.
Proof. (1), (2), (3): First we extend the given system of
vectors into any basis (u1, . . . , un) of the space
V and then start the Gramm-Schmidt orthogonalization
from 2.3.21. This procedure works in the
complex case. It yields an orthogonal basis with
properties as required in (3). But from the Gramm-Schmidt
orthogonalization algorithm it is clear that if the original k
vectors formed an orthogonal system of vectors, then they
continue to do so after the othogonalization process is applied.
Thus we have also proved (2) and (1).
(4): If u = a1u1 + · · · + anun, then
u · ui = a1(u1 · ui) + · · · + an(un · ui) = ai∥ui∥2
= ai
(5): If u = x1u1 + · · · + xnun, v = y1u1 + · · · + ynun, then
u · v = (x1u1 + · · · + xnun) · (y1u1 + · · · + ynun)
= x1 ¯y1 + · · · + xn ¯yn.
223
Next we compute Tn
. By running the following command
in Sage we verify that T is diagonalizible and obtain its
eigenvalues: The block
A=matrix(QQ, [[7/10,3/5],[3/10,2/5])
A.eigenvalues()
Thus, the built-in function A.is_diagonalizable ( ) provides
a straightforward method to determine if a matrix A
is diagonalizable. We also obtain the eigenvalues λ1 = 1 and
λ2 = 0.1.
As eigenvectors we choose the vectors e1 = (2, 1) and
e2 = (−1, 1). Since T is diagonalizable, we can compute Tn
by applying the same method as in 3.B.6. Hence we should
compute Tn
= PDn
P−1
, where P =
(
2 −1
1 1
)
and D =
diag(λ1, λ2) = diag(1, 1/10). This gives
Tn
=
1
3
(
2 + 10−n
2 − 2 · 10−n
1 − 10−n
1 + 2 · 10−n
)
, n ∈ N .
Here is a veriﬁcation in Sage:
T=matrix(SR, [[7/10,3/5],[3/10,2/5]])
P=matrix(SR, [[2, -1],[1, 1]])
n=var("n")
D=matrix(SR, [[1**n, 0], [0, (1/10)**n]])
Tn=P*D*P.inverse(); Tn
Now, matrix multiplication of Tn
with the vector x(1)
gives
x(n+1)
=
(
2
3
−
1
6 · 10n
,
1
3
+
1
6 · 10n
)T
, n ∈ N .
Thus, for big n the probability of success of the nth experiment
is close to 2/3. In other words, for large n we have
x
(n+1)
1 ≈ 2/3. □
To summarize, another signiﬁcant advantage of studying
Markov chains is their ability to predict the
long-term behaviour of a system. Let us illustrate
this fact with a straightforward example, related
to forecasting weather pattern.
3.C.14. Weather expectations. Suppose that the days are
divided into warm, medium, and cold, and that the following
hold:
1. After a warm day the next day is warm with possibility
50 %, and is medium with possibility 30 %;
2. After a medium day the next day is medium with possibility
40 %, and cold with possibility 30 %;
3. After a cold day, the next day is cold with possibility 50 %
and medium with possibility 30 %.
Without any further information, derive how many warm,
medium and cold days can be expected in a year.
Solution. The problem is clearly a Markov process, since it is
assumed that the daily weather depends only on the weather
of the previous day.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
(6): We need to show that for any tuple Vi, Vj from the given
subspaces their intersection is the zero vector. If u ∈ Vi and
u ∈ Vj, then u ⊥ u, that is, u · u = 0. This is possible only
for the zero vector u ∈ V .
(7): Let u, v ∈ A⊥
. Then (au + bv) · w = 0 for all w ∈ A,
a, b ∈ K (from the distributivity of the scalar product). Thus
A⊥
is a subspace in V . Let (v1, . . . , vk) be a basis of span A
chosen among the elements of A, and let (u1, . . . , uk) be
the orthonormal basis resulting from the Gramm-Schmidt
orthogonalization of the vectors (v1, . . . , vk). We extend
it to an orthonormal basis of the whole V (both exist by
the already proven parts of this proposition). Because it
is an orthogonal basis, necessarily span{uk+1, . . . , un} =
span{u1, . . . , uk}⊥
= A⊥
and A ⊂ span{uk+1, . . . , un}⊥
(this follows from expressing the coordinates under the orthonormal
basis). If u ⊥ span{uk+1, . . . , un}, then u is necessarily
a linear combination of the vectors u1, . . . , uk, but
that happens whenever it is a linear combination of the vectors
v1, . . . , vk, which is equivalent to u being in span A.
(8): This is equivalent to the formulation of the existence of
the orthonormal basis. □
3.4.3. Important properties of the norm. Now we have everything
prepared for basic properties related to
our deﬁnition of the norm of vectors. We speak
also of the length of vectors deﬁned by the scalar
product. Note also that all claims always consider
ﬁnite sets of vectors, Their validity does not depend on
the dimension of the space V where it all takes place.
Properties of norm
Theorem. Let V be a vector space with scalar product, u
and v vectors in V . Then
(1) ∥u + v∥ ≤ ∥u∥ + ∥v∥. Equality holds if and only if u
and v are linearly dependent. This is called the triangle
inequality.
(2) |u · v| ≤ ∥u∥ ∥v∥. Equality holds if and only if u and
v are linearly dependent. This property is called the
Cauchy inequality.
(3) If (e1, . . . , ek) is a orthonormal system of vectors, then
∥u∥2
≥ |u · e1|2
+ · · · + |u · ek|2
.
This property is called the Bessel inequality.
(4) If (e1, . . . , ek) is an orthonormal system of vectors, then
u ∈ span{e1, . . . , ek} if and only if
∥u∥2
= |u · e1|2
+ · · · + |u · ek|2
.
This is called the Parseval equality.
(5) If (e1, . . . , ek) is an orthonormal system of vectors and
u ∈ V , then the vector
w = (u · e1)e1 + · · · + (u · ek)ek
is the only vector which minimizes the norm ∥u − v∥
among all v ∈ span{e1, . . . , ek}.
224
This has a 3-dimensional state space S = {w, m, c},
where w is for warm, m for medium and c for cold., and state
transition matrix given by T =
( 0.5 0.3 0.2
0.3 0.4 0.3
0.2 0.3 0.5
)
.
Let us now consider the probabilistic vector x(n)
=
(x
(n)
w , x
(n)
m , x
(n)
c )T
, whose components are the probabilities
the nth day to be warm, medium or cold. We indicate them
by the numbers x
(n)
w = P(Xn = w), x
(n)
m = P(Xn =
m), and x
(n)
c = P(Xn = c), respectively. Since all the
elements T are positive, there exists a probabilistic vector
x∞ = (xw
∞, xm
∞, xc
∞)
T
, which the vector x(n)
approaches as
n grows. By the corollary of the Perron-Frobenius theorem
(see 3.3.4), x∞ must be the eigenvector of T corresponding to
the eigenvalue 1. This gives rise to the condition Tx∞ = x∞,
which together with the condition that the vector x∞ is stochastic,
yield the following system of equations:
xw
∞ = 0.5 xw
∞ + 0.3 xm
∞ + 0.2 xc
∞ ,
xm
∞ = 0.3 xw
∞ + 0.4 xm
∞ + 0.3 xc
∞ ,
xc
∞ = 0.2 xw
∞ + 0.3 xm
∞ + 0.5 xc
∞ ,
1 = xw
∞ + xm
∞ + xc
∞ .
It is easy to see that xw
∞ = xm
∞ = xc
∞ = 1
3 is the unique
solution of this system. Thus, one should expect roughly the
same number of warm, medium and cold days. □
The column-stochastic matrix T in Problem 3.C.14
is symmetric, TT
= T. Consequently, it is also “rowstochastic”,
indicating that the sum of the entries in any row
is one. A matrix which is both column- and row-stochastic is
called a “doubly stochastic matrix”. A signiﬁcant property
of every doubly stochastic primitive matrix is that the
corresponding vector x∞ has all its components identical,
as above. This means that, after suﬃciently many iterations,
all states in the corresponding Markov chain are reached
with the same frequency.
A series of additional problems related to Markov chains
is presented in Section F. In this section we address the development
of an algorithm for determining the importance of
web pages (see 3.F.20), along with other compelling tasks.
D. More matrix calculus
In this section we explore fundamental concepts of linear
algebra that underpin advanced applications
across diverse ﬁelds. We begin with
the notion of “unitary spaces”, see also
3.4.1.
Given the vector space V = Fn
, where F is either R
or C, consider the standard scalar product ⟨ , ⟩ : V ×
V → F deﬁned by ⟨x, y⟩ :=
∑
j xj ¯yj = xT
¯y, where
x = (x1, . . . , xn)T
and y = (y1, . . . , yn)T
are vectors in
Fn
, and the last expression is in terms of matrices. Since
xT
¯y ∈ F is a scalar, we have xT
¯y = (xT
¯y)T
= ¯yT
x = y∗
x,
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Proof. The veriﬁcations are all based on direct compu-
tations:
(2): The result is obvious if v = 0. Otherwise, deﬁne the
vector w = u − u·v
v·v v, that is, w ⊥ v and compute
∥w∥2
= ∥u∥2
− (u·v)
∥v∥2 (u · v) − u·v
∥v∥2 (v · u) + (u·v)(u·v)
∥v∥4 ∥v∥2
∥w∥2
∥v∥2
= ∥u∥2
∥v∥2
− 2(u · v)(u · v) + (u · v)(u · v)
These are non-negative real values and thus, ∥u∥2
∥v∥2
≥
|u · v|2
and the equality holds if and only if w = 0, that is,
whenever u and v are linearly dependent.
(1): It suﬃces to compute
∥u + v∥2
= ∥u∥2
+ ∥v∥2
+ u · v + v · u
= ∥u∥2
+ ∥v∥2
+ 2 Re(u · v)
≤ ∥u∥2
+ ∥v∥2
+ 2|u · v| ≤ ∥u∥2
+ ∥v∥2
+ 2∥u∥∥v∥
= (∥u∥ + ∥v∥)2
Since we deal with squares of non-negative real numbers, this
means that ∥u + v∥ ≤ ∥u∥ + ∥v∥. Furthermore, equality
implies that in all previous inequalities equality also holds.
This is equivalent to the condition that u and v are linearly
dependent (using the previous part).
(3), (4): Let (e1, . . . , ek) be an orthonormal system of vectors.
We extend it to an orthonormal basis (e1, . . . , en) (that is always
possible by the previous theorem). Then, again using
the previous theorem, we have for every vector u ∈ V
∥u∥2
=
n∑
i=1
(u · ei)(u · ei) =
n∑
i=1
|u · ei|2
≥
k∑
i=1
|u · ei|2
But that is the Bessel inequality. Furthermore, equality holds
if and only if u·ei = 0 for all i > k, which proves the Parseval
equality.
(5): Choose an arbitrary v ∈ span{e1, . . . , ek} and extend
the given orthonormal system to the orthonormal basis
(e1, . . . , en). Let (u1, . . . , un) and (x1, . . . , xk, 0, . . . , 0) be
coordinates of u and v under this basis. Then
∥u−v∥2
= |u1−x1|2
+· · ·+|uk−xk|2
+|uk+1|2
+· · ·+|un|2
and this expression is clearly minimized when choosing the
individual vectors to be x1 = u1, . . . , xk = uk. □
3.4.4. Unitary and orthogonal mappings. The properties
of orthogonal mappings have direct analogues
in the complex domain. We can easily formulate
them and prove together:
Proposition. Consider the linear mapping (endomorphism)
φ : V → V on the (real or complex) space with scalar product.
Then the following conditions are equivalent.
(1) φ is unitary or orthogonal transformation,
(2) φ is linear isomorphism and for every u, v ∈ V
φ(u) · v = u · φ−1
(v),
(3) the matrix A of the mapping φ in any orthonormal basis
satisﬁes A−1
= ¯AT
(for Euclidean spaces this means
that A−1
= AT
),
225
where y∗
:= ¯yT
. Hence, we can equivalently express ⟨ , ⟩ as
⟨x, y⟩ = xT
¯y = y∗
x .
By convention, ⟨ , ⟩ is linear in the ﬁrst argument but conjugate
linear in the second one, in the sense that ⟨ax, y⟩ =
a⟨x, y⟩ but ⟨x, ay⟩ = ¯a⟨x, y⟩, for all x, y ∈ V and a ∈ F.
The scalar product ⟨ , ⟩ induces a norm on Fn
deﬁned
by ∥x∥2
= ⟨x, x⟩ =
∑n
j=1 |xj|2
.11
Ensure that ∥ · ∥ satisﬁes
the deﬁning properties of a norm, that is,
• ∥x∥ ≥ 0 with ∥x∥ = 0 if and only if x = 0,
• ∥ax∥ = |a|∥x∥, for any a ∈ F and x ∈ Fn
• ∥x + y∥ ≤ ∥x∥ + ∥y∥, for any x, y ∈ Fn
.
Next we will adopt the following terminology:
⟨·, ·⟩ =
{
standard dot product, for V = Rn
,
standard Hermitian form, for V = Cn
,
This notation is consistent with what is used in Chapter 2 for
the real case. Both spaces Rn
, Cn
endowed with ⟨ , ⟩ provide
examples of unitary spaces (also known as inner product
spaces), as deﬁned in 3.4.1. On the other hand, the distance
map d : V × V → R, deﬁned by d(x, y) = ∥x − y∥, establishes
V = Fn
as a “metric space”, a fundamental concept
in analysis which we will explore in Chapter 7
Next, we will see that unitary spaces extend beyond Rn
and Cn
, with many other examples existing. We begin with
the following task, which is left as an easy challenge for you.
3.D.1. Consider the vector space C3
, endowed with the standard
Hermitian form ⟨x, y⟩ =
∑
k xk ¯yk and the induced
norm ∥ · ∥. Given the vectors x = (3 + 2i, 1 − i, −i)T
and
y = (2 − 2i, 1 − i, 2 + i)T
, compute the inner product ⟨x, y⟩,
the distance d(x, y) = ∥x − y∥, and the normalized vectors
ˆx, ˆy corresponding to x, y, with ∥ˆx∥ = 1 = ∥ˆy∥. ⃝
3.D.2. When using Sage to compute scalar products of
complex vectors, we need to be cautious. Sage
permits the use of the standard command
x.dot_product(y), as discussed in Chapter 2
(see for example 2.C.45), to complex vectors as well. However,
the handling of complex vectors may require special
attention to ensure correct application and interpretation of
the standard Hermitian scalar product.
For example, let us use the vectors x, y given in 3.D.1.
Execute the cell
x = vector(CDF, [3+2*I, 1-I, -I])
y = vector(CDF, [2-2*I, 1-I, 2+I])
y.dot_product(x)
In this case Sage prints out the expression 11.0 − 6.0 ∗ I,
which does not match the result obtained using the standard
rule ⟨x, y⟩ =
∑
i xiyi. To obtain the correct result we should
use the cell
11If zj = xj +iyj ∈ C recall that ¯zj = xj −iyj and |zj|2 = x2
j +y2
j .
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
(4) the matrix A of a mapping φ in some orthonormal basis
satisﬁes A−1
= ¯AT
,
(5) the rows of the matrix A of the mapping φ in an orthonormal
basis form an orthonormal basis of the space Kn
with standard scalar product,
(6) the columns of the matrix A of the mapping φ in an orthonormal
basis form an orthonormal basis of the space
Kn
with standard scalar product.
Proof. (1) ⇒ (2): The mapping φ is injective, therefore
it must be onto. Also φ(u) · v = φ(u) ·
φ(φ−1
(v)) = u · φ−1
(v).
(2) ⇒ (3): The standard scalar product is in Kn
.
It is given for columns x, y of scalars by the expression
x · y = ¯yT
E x = ¯y x, where E is the unit matrix.
Property (2) thus means that the matrix A of the mapping
φ is invertible and ¯yT
A x = (A−1y)T
x. This means that
(¯yT
A − (A−1y)T
)x = 0 for all x ∈ Kn
. By substituting the
complex conjugate of the expression in the parentheses for x
we ﬁnd that equality is possible only when ¯AT
= A−1
. (We
may also rewrite the expression as ¯yT
(A−( ¯A−1
)T
)x and see
the conclusion by substituting the basis vectors for x and y.)
(3) ⇒ (4): This is an obvious implication.
(4) ⇒ (5) In the relevant basis, the claim is expressed via the
matrix A of the mapping φ as the equation A ¯AT
= E, which
is ensured by (4).
(5) ⇒ (6): We have | ¯AT
A| = |E| = |A ¯AT
| = |A||A| =
1, there exists the inverse matrix A−1
. But we also have
A ¯AT
A = A, therefore also ¯AT
A = E which is expressed
exactly by (6).
(6) ⇒ (1): In the chosen orthonormal basis
φ(u) · φ(v) = (Ay)
T
Ax = ¯yT ¯AT
Ax = ¯yT ¯Ex = ¯yT
x
where x and y are columns of coordinates of the vectors u
and v. That ensures that the scalar product is preserved. □
Characterizations from the previous theorem deserve
some notes. The matrices A ∈ Matn(K) with the
property A−1
= ¯AT
are called unitary matrices for
complex scalars (in the case R we have already used
the name orthogonal matrices for them). The deﬁnition
itself immediately implies that a product of unitary (orthogonal)
matrices is again unitary (orthogonal). The same
is true for inverses. Unitary matrices thus form a subgroup
U(n) ⊂ Gln(C) in the group of all invertible complex matrices
with the product operation. Orthogonal matrices form
a subgroup O(n) ⊂ Gln(R) in the group of real invertible
matrices. We speak of a unitary group and of an orthogonal
group.
The simple calculation
1 = det E = det(A ¯AT
) = det A det A = | det A|2
shows that the determinant of a unitary matrix has norm equal
to one. For real scalars the determinant is ±1. Furthermore,
if Ax = λx for a unitary or orthogonal matrix, then
226
x = vector(CDF, [3+2*I, 1-I, -I])
y = vector(CDF, [2-2*I, 1-I, 2+I])
y.hermitian_inner_product(x)
which returns 3.0 + 8.0 ∗ I.
Remark. It is important to note that in Sage the command
x.hermitian_inner_product(y) prints out the expression
3.0 − 8.0 ∗ I. Therefore, Sage uses the rule
∑
i xiyi, instead
of our rule
∑
i xiyi. In summary, when computing the standard
Hermitian form of complex vectors x, y ∈ Cn
, to stay
consistent with our conventions, we should use the rule
⟨x, y⟩ = y∗
x = y.hermitian_inner_product(x) .
An alternative is based on the dot.product function and
goes as follows:
u.dot_product(v.conjugate())
This corresponds to ⟨u, v⟩ for two complex vectors u, v.
3.D.3. Use the standard Hermitian form ⟨v, w⟩ =
∑
i viwi
to compute ⟨v, w⟩, ∥v∥2
, ∥w∥2
and the angle θ between v, w,
where these vectors are given as follows:
(a) v = (1 + i, 2 − i)T
, w = (3 − 2i, 1 + i)T
in C2
;
(b) v = (3, i)T
, w = (2 − i, 1 − i)T
in C2
;
(c) v = (−i, 0, 2)T
, w = (4, 1 − i, 1)T
in C3
;
Next verify your answers via Sage. ⃝
3.D.4. On V = F2
for F ∈ {R, C} consider the map
f
(
(x1, x2)T
, (y1, y2)T
)
= x1 ¯y1 + 4x1 ¯y2 + 4x2 ¯y1 + x2 ¯y2 .
Does f deﬁne a scalar product on V ? ⃝
3.D.5. For x = (x1, x2)T
and y = (y1, y2)T
in R2
set
g(x, y) := 2x1y1 − x1y2 − x2y1 + 5x2y2 .
Show that g is a scalar product and compute its matrix relative
to the standard basis of R2
.
Solution. Linearity and symmetry can be easily proved. In
addition, we see that
g(x, x) = 2x2
1 − 2x1x2 + 5x2
2 = (x1 + x2)2
+ (x1 − 2x2)2
,
for all x ∈ R. Thus g(x, x) ≥ 0, in particular g(x, x) = 0
if and only if x1 + x2 = 0 and x1 − 2x2 = 0, which gives
x = 0. Thus g is a scalar product. It is clearly diﬀerent
from the standard dot product. Its matrix with respect to the
standard basis of R2
is given by
A =
(
g(e1, e1) g(e1, e2)
g(e2, e1) g(e2, e2)
)
=
(
2 −1
−1 5
)
,
such that g(x, y) = yT
Ax. Observe, that A is symmetric
and positive deﬁnite (recall that an element A ∈ Matn(R) is
called a “positive deﬁnite matrix” when we have uT
Au > 0
for any non-zero vector u ̸= 0 on Rn
). □
3.D.6. Let a ̸= b be some positive real numbers. Show that
the rule ρa,b(u, v) := au1v1 +bu2v2 deﬁnes a scalar product
on R2
, where u = (u1, u2)T
and v = (v1, v2)T
, respectively.
Next, compute the angle between the vectors u = (1, 1)T
and
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
(Ax) · (Ax) = x · x = |λ|2
(x · x). Therefore the real eigenvalues
of orthogonal matrices in the real domain are ±1. The
eigenvalues of unitary matrices are always complex units in
the complex plane.
The same argument as we have seen with the orthogonal
mappings imply that orthogonal complements of invariant
subspaces with respect to unitary mappings φ : V → V
are also invariant. Indeed, if φ(U) ⊂ U, u ∈ U and v ∈ U⊥
are arbitrary, then
φ(v) · φ(φ−1
(u)) = v · φ−1
(u).
Because the restriction φ|U is also unitary, it is a bijection.
Notably φ−1
(u) ∈ U. But then φ(v)·u = 0, because v ∈ U⊥
.
Thus φ(v) ∈ U⊥
.
This leads to an immediate useful corollary in the complex
domain
Corollary. Let φ : V → V be a unitary mapping of complex
vector spaces. Then V is an orthogonal sum of onedimensional
eigensubspaces.
Proof. There exists at least one eigenvector v ∈ V ,
since complex eigenvalues always exist. Then the restriction
of φ to the invariant subspace ⟨v⟩⊥
is again unitary and also
has an eigenvector. After n such steps we obtain the desired
orthogonal basis of eigenvectors. After normalising the vectors
we obtain an orthonormal basis. □
Now it is possible to understand the details of the proof of
the spectral decomposition of the orthogonal mapping from
2.4.7 at the end of the second chapter. The real matrix of an orthogonal
mapping is interpreted as a matrix of a unitary mapping
on a complex extension of Euclidean space. We observe
the corollaries of the structure of the roots of the real characteristic
polynomial over the complex domain. Automatically
we obtain invariant two-dimensional subspaces given by pairs
of complex conjugated eigenvalues and hence the corresponding
rotation for restricted original real mapping.
3.4.5. Dual and adjoint mappings. When discussing vector
spaces and linear mappings in the second chapter,
we mentioned brieﬂy the dual vector space V ∗
of all
linear forms over the vector space V , see 2.3.17. This
duality extends to mappings:
Dual mappings
For any linear mapping ψ : V → W, the expression
(1) ⟨v, ψ∗
(α)⟩ = ⟨ψ(v), α⟩,
where ⟨ , ⟩ denotes the evaluation of the linear forms (the
second argument) on the vectors (the ﬁrst argument), while
v ∈ V and α ∈ W∗
are arbitrary, deﬁnes the mapping ψ∗
:
W∗
→ V ∗
called the dual mapping to ψ.
Choose bases v in V , w in W and write A for the matrix
of the mapping ψ in these bases. Then we compute the matrix
of the mapping ψ∗
in the corresponding dual bases in the
227
v = (1, −1)T
with respect to ρ2,1 and compare the result with
the angle that occurs if we use the dot product on R2
. ⃝
3.D.7. Suppose that A = (aij) is a m × n matrix over C,
whose column space has (complex) dimension n. Consider
the mapping ρA : Cn
× Cn
→ C with ρA(x, y) = y∗
A∗
Ax,
where as usual A∗
= ¯AT
denotes the transpose conjugate of
a matrix A. Show that the pair (Cn
, ρA) is a unitary space.
⃝
3.D.8. Show that the rule B(A, B) := tr(B∗
A) deﬁnes a
scalar product on the space Matm,n(F) of m × n
matrices with entries over F, for F ∈ {R, C}. This
scalar product is known as the Frobenius inner prod-
uct.
Solution. For any A, B ∈ Matm,n(F), the matrix B∗
A is
n×n, and hence B is well-deﬁned. One can proceed by proving
the axioms of a unitary space, as they are presented in
3.4.1. However, a direct calculation yields that
B(A, B) =
m∑
i=1
n∑
j=1
bijaij =
m∑
i=1
n∑
j=1
aijbij
Hence, if we express A and B in terms of column vectors,
say A =
(
A1 . . . An
)
and B =
(
B1 . . . Bn
)
, then
B(A, B) =
∑n
j=1⟨Aj, Bj⟩, which is the sum of the dot products
of the vectors formed from the columns of the matrices
A, B, respectively. Hence B is a scalar product (as a
sum of dot products). Otherwise, a direct proof is mainly
based on the properties of the trace, described in 2.C.38 in
Chapter 2. For instance, over the complex numbers we have
B(aA, B) = tr
(
B∗
(aA)
)
= a tr(B∗
A) = aB(A, B) for
any scalar a ∈ C and any two elements A, B ∈ Matm,n(C),
and moreover, if C ∈ Matm,n(C) is another matrix, then
B(A + C, B) = tr
(
B∗
(A + C)
)
= tr(B∗
A) + tr(B∗
C)
= B(A, B) + B(C, B) .
Positive-deﬁniteness occurs as follows: If A∗
A has entries
cij, then cij =
∑n
k=1 ¯akiakj and thus
∥A∥2
B = B(A, A) =
n∑
i=1
cii =
∑
i,j
|aij|2
.
Hence, as soon as A ̸= 0, we see that B(A, A) is strictly
positive, B(A, A) > 0. We leave as an exercise a direct veriﬁcation
of the property B(A, B) = B(B, A) for A, B ∈
Matm,n(C). □
In Chapter 7 we will encounter additional examples of
unitary spaces. There, leveraging the concept of integration,
which we will examine in Chapter 6, we will introduce inner
products on spaces of polynomials and, more generally, on
inﬁnite-dimensional function spaces (see for example 7.1.1,
7.1.2 and see also the tasks 7.D.3, 7.D.4, 7.D.5, and 7.D.6).
In this context, orthogonality becomes crucial, especially in
Fourier analysis. For now, let us explore a few more elementary
tasks related to orthogonality.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
dual spaces. Indeed, the deﬁnition says that if we represent
the vectors from W∗
in the coordinates as rows of scalars,
then the mapping ψ∗
is given by the same matrix as ψ, if we
multiply by it the row vectors from the right:
⟨ψ(v), α⟩ = (α1, . . . , αn) · A ·



v1
...
vn


 = ⟨v, ψ∗
(α)⟩.
This means that the matrix of the dual mapping ψ∗
is the transpose
AT
, because α · A = (AT
· αT
)T
.
Assume further that we have a vector space with scalar
product. Then we can naturally identify V and V ∗
using the
scalar product. Indeed, choosing one ﬁxed vector w ∈ V , we
substitute this vector into the second argument in the scalar
product in order to obtain the identiﬁcation V ≃ V ∗
=
Hom(V, K)
V ∋ w → (v → ⟨v, w⟩) ∈ V ∗
.
The non-degeneracy condition on the scalar product ensures
that this mapping is a bijection. Notice it is important to use
w as the ﬁxed second argument in the case K = C in order to
obtain linear forms. Since factorizing complex multiples in
the second argument yields complex conjugated scalars, the
identiﬁcation V ≃ V ∗
is linear over real scalars only.
It is clear that the vectors of an orthonormal basis are
mapped to forms that constitute the dual basis, i.e. the orthonormal
basis are selfdual under our identiﬁcation. Moreover,
every vector is automatically understood as a linear
form, by means of the scalar product.
How does the above dual mapping W∗
→ V ∗
look in
terms of our identiﬁcation? We use the same notation ψ∗
:
W → V for the resulting mapping, which is uniquely given
as follows:
Adjoint mapping
For every linear mapping ψ : V → W between spaces with
scalar products, there is the adjoint mapping ψ∗
uniquely
determined by the formula
(2) ⟨ψ(u), v⟩ = ⟨u, ψ∗
(v)⟩.
The parentheses means the scalar products on W or V , re-
spectively.
Notice that the use of the same parenthesis for evaluation
of one-forms and scalar products (which reﬂects the identiﬁcation
above) makes the deﬁning formulae of dual and adjoint
mappings look the same.
Equivalently we can understand the relation (2) to be the
deﬁnition of the adjoint mapping ψ∗
. By substituting
all pairs of vectors from an orthonormal
basis for the vectors u and v we obtain directly
all the values of the matrix of the mapping ψ∗
.
228
3.D.9. (The Pythagorean theorem) (a) In a real vector space
V with a scalar product ⟨ , ⟩ prove that two vectors u, w are
orthogonal, i.e., ⟨u, w⟩ = 0, if and only if ∥u+w∥2
= ∥u∥2
+
∥w∥2
.12
v
u
u + v
Next demonstrate with a counterexample that this property
does not hold for a complex unitary vector space.
(b) If (V, ⟨ , ⟩) is a real scalar vector space, prove that the
vectors u − w and u + w are orthogonal, if and only if ∥u∥ =
∥w∥, where u, w ∈ V are two arbitrary vectors. ⃝
3.D.10. Let (V, ⟨ , ⟩) be a unitary space over F, and u, v ∈ V
two arbitrary vectors. Show that
(a) u ⊥ v if and only if ∥u + av∥ = ∥u − av∥ for all a ∈ F.
(b) u ⊥ v if and only if ∥u + av∥ ≥ ∥u∥ for all a ∈ F. ⃝
3.D.11. Consider the space Matm,n(R) with the scalar
product B(A, B) = tr(BT
A), introduced in 3.D.8. For the
matrices
A =
(
1 3 5
0 2 2
)
and B =
(
2 4 0
7 9 1
)
compute the following:
(a) the angle θ between A, B;
(b) the distance ∥A − B∥B between A, B;
(c) Verify the Cauchy-Schwarz inequality.
Solution. (a) By 3.D.8, for two real matrices A = (aij)
and B = (Bij), both of size m × n, we get B(A, B) =
tr(BT
A) =
∑m
i=1
∑n
j=1 aijbij and ∥A∥2
B = B(A, A) =
∑m
i=1
∑n
j=1 a2
ij. In particular, if A =
(
a11 a12 a13
a21 a22 a23
)
and B =
(
b11 b12 b13
b21 b22 b23
)
are two elements of Mat2,3(R),
then we see that the product BT
A is given by


b11a11 + b21a21 b11a12 + b21a22 b11a13 + b21a23
b12a11 + b22a21 b12a12 + b22a22 b12a13 + b22a23
b13a11 + b23a21 b13a12 + b23a22 b13a13 + b23a23


such that
B(A, B) = tr(BT
A) = a11b11 + a12b12 + a13b13
+a21b21 + a22b22 + a23b23 .
Thus for the given A and B we compute
B(A, B) = 2 + 12 + 0 + 0 + 18 + 2 = 34 ,
∥A∥2
B = a2
11 + a2
12 + a2
13 + a2
21 + a2
22 + a2
23 = 43 ,
∥B∥2
B = b2
11 + b2
12 + b2
13 + b2
21 + b2
22 + b2
23 = 151 .
12Observe that this condition, known as the Pythagorean theorem, is
not the same as the equality case in the triangle inequality ∥u+w∥ ≤ ∥u∥+
∥w∥.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Using the coordinate expression for the scalar product,
the formula (2) reveals the coordinate expression of the adjoint
mapping:
⟨ψ(v), w⟩ = (w1, . . . , wn) · A ·



v1
...
vn



=
(
¯AT ·



w1
...
wn



)T
·



v1
...
vn


 = ⟨v, ψ∗
(w)⟩.
It follows that if A is the matrix of the mapping ψ in an orthonormal
basis, then the matrix of the adjoint mapping ψ∗
is
the transposed and conjugated matrix A – we denote this by
A∗
= ¯AT
.
The matrix A∗
is called the adjoint matrix of the matrix
A. Note that the adjoint matrix is well deﬁned for any rectangular
matrix. We should not confuse them with algebraic
adjoints, which we used for square matrices when working
with determinants.
We can summarise. For any linear mapping ψ : V → W
between unitary spaces, with matrix A in some bases on V
and W, its dual mapping has the matrix AT
in the dual basis.
If there are scalar products on V and W, we identify them (via
the scalar products) with their duals. Then the dual mapping
coincides with the adjoint mapping ψ∗
: W → V , which
has the matrix A∗
. The distinction between the matrix of the
dual mapping and the matrix of the adjoint mapping is thus in
the additional conjugation. This is of course a consequence
of the fact that our identiﬁcation of the unitary space with its
dual is not a linear mapping over complex scalars.
3.4.6. Self-adjoint mappings. Those linear mappings
which coincide with their adjoints: ψ∗
= ψ,
are of particular interest. They are called
self-adjoint mappings. Equivalently we can
say that they are the mappings whose matrix A satisﬁes
A = A∗
in some (and thus in all) orthonormal basis.
In the case of Euclidean spaces the self-adjoint mappings
are those with symmetric matrices (in orthonormal basis).
They are often called symmetric mappings.
In the complex domain the matrices that satisfy A = A∗
are called Hermitian matrices or also Hermitian symmetric
matrices. Sometimes they are also called self-adjoint matrices.
Note that Hermitian matrices form a real vector subspace
in the space of all complex matrices, but it is not a vector subspace
in the complex domain.
Remark. The next observation is of special interest. If we
multiply a Hermitian matrix A by the imaginary unit, we obtain
the matrix B = i A, which has the property
B∗
= ¯i ¯AT
= −B.
Such matrices are called anti-Hermitian or Hermitian skewsymmetric.
Every real matrix can be written as a sum of its
229
Hence ∥A∥B =
√
43, ∥B∥B =
√
151, and
cos θ =
B(A, B)
∥A∥B∥B∥B
=
34
√
43
√
151
.
From this one can explicitly compute θ, as before.
(b) We see that A − B =
(
−1 −1 5
−7 −7 1
)
. Thus
∥A−B∥2
B = B(A−B, A−B) = 1+1+25+49+49+1 = 126
such that ∥A − B∥B =
√
126.
(c) For the Cauchy inequality we refer to 3.4.3. This important
inequality for the scalar product B takes the form
|B(A, B)| ≤ ∥A∥B∥B∥B =
√
B(A, A)
√
B(B, B) ⇐⇒
|tr(B∗
A)| ≤
√
tr(A∗A)
√
tr(B∗ B) .
This gives us 34 <
√
43
√
151 ≈ 80. □
We proceed with exercises related to orthonormal bases
and orthogonal complements. Recall that we
have already discussed these concepts in Chapter
2 (see for example 2.C.48, 2.C.50). There
we introduced the Gram-Schmidt orthogonalization
process, which transforms any basis {E1, . . . , En}
of a scalar product space (V, ⟨ , ⟩), to an orthogonal basis
{w1, . . . , wn} of V .
The process begins with w1 = E1, and constructs the
jth member wj as follows:
wj = Ej −
⟨Ej, w1⟩
∥w1∥2
w1 − · · · −
⟨Ej, wj−1⟩
∥wj−1∥2
wj−1 .
This results to an orthonormal basis { w1
∥w1∥ , . . . , wn
∥wn∥ } of V .
This straightforward yet essential construction has a
wide range of applications. We begin with an example that
requires extending the method presented in Chapter 2 (cf.
2.C.50), to ﬁt Sage’s capabilities.
3.D.12. Consider the Euclidean space R3
endowed with the
scalar product ⟨⟨u, v⟩⟩ = vT
Au, where A is given by13
A =


2 −1 0
−1 2 1
0 1 2

 .
Applying the Gram-Schmidt procedure to obtain an
⟨⟨ , ⟩⟩-orthogonal basis of R3
, starting with the standard
basis e = {e1, e2, e3} of R3
.
Solution. Let e1 = (1, 0, 0)T
, e2 = (0, 1, 0)T
and e3 =
(0, 0, 1)T
be the vectors of the standard basis e of R3
. According
to the Gram-Schmidt method, an ⟨⟨ , ⟩⟩-orthogonal
basis of R3
is given by {w1, w2, w3}, with w1 = e1 and
w2 = e2 −
⟨⟨e2, w1⟩⟩
⟨⟨w1, w1⟩⟩
w1 ,
w3 = e3 −
⟨⟨e3, w1⟩⟩
⟨⟨w1, w1⟩⟩
w1 −
⟨⟨e3, w2⟩⟩
⟨⟨w2, w2⟩⟩
w2 .
13Verify yourselves that A is symmetric and positive deﬁnite (for the
second claim show for example that all the eigenvalues of A are positive).
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
symmetric part and its anti-symmetric part,
A =
1
2
(A + AT
) +
1
2
(A − AT
).
In the complex domain we have analogously
A =
1
2
(A + A∗
) + i
1
2i
(A − A∗
).
In particular, we may express every complex matrix in a
unique way as a sum
A = B + i C
with Hermitian symmetric matrices B = 1
2 (A + A∗
) and
C = 1
2i (A − A∗
). This is an analogy of the decomposition
of a complex number into its real and purely imaginary component
and in the literature we often encounter the notation
B = re A =
1
2
(A + A∗
), C = im A =
1
2i
(A − A∗
).
In the language of linear mappings this means that every
complex linear automorphism can be uniquely expressed by
means of two self-adjoint mappings playing the role of the
real and imaginary parts of the original mapping.
3.4.7. Spectral decomposition. Consider a self-adjoint
mapping ψ : V → V with the matrix A in
some orthonormal basis. Proceed similarly as
in 2.4.7 when we diagonalized the matrix of
orthogonal mappings.
Again, consider arbitrary invariant subspaces of selfadjoint
mappings and their orthogonal complements. If a selfadjoint
mapping ψ : V → V leaves a subspace W ⊂ V
invariant, i.e. ψ(W) ⊂ W, then for every v ∈ W⊥
, w ∈ W
⟨ψ(v), w⟩ = ⟨v, ψ(w)⟩ = 0.
Thus also, ψ(W⊥
) ⊂ W⊥
.
Next, consider the matrix A of a self-adjoint mapping in
an orthonormal basis and an eigenvector x ∈ Cn
, i.e. A·x =
λx. We obtain
λ⟨x, x⟩ = ⟨Ax, x⟩ = ⟨x, Ax⟩ = ⟨x, λx⟩ = ¯λ⟨x, x⟩.
The positive real number ⟨x, x⟩ can be cancelled on both sides
and thus ¯λ = λ, and we see that eigenvalues of Hermitian
matrices are always real.
The characteristic polynomial det(A−λE) has as many
complex roots as is the dimension of the square matrix A (including
multiplicities), and all of them are actually real. Thus
we have proved the important general result:
Proposition. The orthogonal complements of invariant subspaces
of self-adjoint mappings are also invariant. Furthermore,
the eigenvalues of a Hermitian matrix A are always
real.
The very deﬁnition ensures that restriction of a selfadjoint
mapping to an invariant subspace is again self-adjoint.
Thus the latter proposition implies that there always exists an
orthonormal basis of V composed of eigenvectors. Indeed,
start with any eigenvector v1, normalize it, consider its linear
hull V1 and restrict the mapping to V ⊥
1 . Consider next
230
We compute
⟨⟨e2, w1⟩⟩ =
(
1 0 0
)


2 −1 0
−1 2 1
0 1 2




0
1
0

 = −1,
⟨⟨w1, w1⟩⟩ =
(
1 0 0
)


2 −1 0
−1 2 1
0 1 2




1
0
0

 = 2 .
Thus w2 = e2 + 1
2 e1 = (1
2 , 1, 0)T
. Next we compute
⟨⟨e3, w1⟩⟩ = ⟨⟨e3, e1⟩⟩ = eT
1 Ae3
=
(
1 0 0
)


2 −1 0
−1 2 1
0 1 2




0
0
1

 = 0 ,
that is, e1 and e3 are ⟨⟨ , ⟩⟩-orthogonal, and hence
⟨⟨e3, w2⟩⟩ = ⟨⟨e3, e2 +
1
2
e1⟩⟩ = ⟨⟨e3, e2⟩⟩ +
1
2
⟨⟨e3, e1⟩⟩
= ⟨⟨e3, e2⟩⟩ = eT
2 Ae3
=
(
0 1 0
)


2 −1 0
−1 2 1
0 1 2




0
0
1

 = 1 .
Moreover,
⟨⟨w2, w2⟩⟩ =
(1
2 1 0
)


2 −1 0
−1 2 1
0 1 2




1
2
1
0

 =
3
2
.
As an alternative, we know that ⟨⟨e1, e1⟩⟩ = 2, ⟨⟨e1, e2⟩⟩ =
−1, and we compute ⟨⟨e2, e2⟩⟩ = 2. By linearity, we then
have
⟨⟨w2, w2⟩⟩ =
1
4
⟨⟨e1, e1⟩⟩ + ⟨⟨e1, e2⟩⟩ + ⟨⟨e2, e2⟩⟩
=
1
4
· 2 + (−1) + 2 =
3
2
.
Hence, w3 = e3 − 2
3 w2 = (−1
3 , −2
3 , 1) = −1
3 e1 − 2
3 e2 + e3.
To summarize, the desired basis has the form
{
e1,
1
2
e1 + e2, −
1
3
e1 −
2
3
e2 + e3
}
.
To conﬁrm this basis in Sage we may revise the method
presented in Chapter 2 (see 2.C.50). This adjustment is necessary
because we are now using a diﬀerent scalar product
than the standard one. The following cell includes the necessary
explanations and performs the Gram-Schmidt orthogonalization
with respect to the given inner product deﬁned by
the matrix A. Execute this block yourselves to read Sage’s
result.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
another eigenvector v1 ∈ V ⊥
2 , take V2 = span(V1 ∪ {v2}),
which is again invariant. Continue and construct the sequence
of invariant subspaces V1 ⊂ V2 ⊂ . . . Vn = V , building the
orthonormal basis of eigenvectors, as expected.
Actually, it is easy to see directly that eigenvectors associated
with diﬀerent eigenvalues are perpendicular to each
other. Indeed, if ψ(u) = λu, ψ(v) = µv then we obtain
λ⟨u, v⟩ = ⟨ψ(u), v⟩ = ⟨u, ψ(v)⟩ = ¯µ⟨u, v⟩ = µ⟨u, v⟩.
Usually this result is formulated using projections onto
eigensubspaces. Recall the properties of projections along
subspaces, as discussed in 2.3.19. A projection P : V → V
is a linear mapping satisfying P2
= P. This means that the
restriction of P to its image is the identity and the projector is
completely determined by choosing the subspaces Im P and
Ker P.
A projection P : V → V is called orthogonal if Im P ⊥
Ker P. Two orthogonal projections P, Q are called mutually
perpendicular if Im P ⊥ Im Q.
Spectral decomposition of self-adjoint mappings
Theorem (Spectral decomposition). For every self-adjoint
mapping ψ : V → V on a vector space with scalar product
there exists an orthonormal basis composed of eigenvectors.
If λ1, . . . , λk are all distinct eigenvalues of ψ and if
P1, . . . , Pk are the corresponding orthogonal and mutually
perpendicular projectors onto the eigenspaces corresponding
to the eigenvalues, then
ψ = λ1P1 + · · · + λkPk.
The dimensions of the images of these projections Pi equal
the algebraic multiplicities of the eigenvalues λi.
3.4.8. Orthogonal diagonalization. Linear mappings
which allow for orthonormal bases as in the
latter theorem on spectral decomposition are
called orthogonally diagonalizable. Of course,
they are exactly the mappings for which we
can ﬁnd an orthonormal basis in which the matrix of the
mapping is diagonal. We ask what they look like.
In the Euclidean case, this is simple: diagonal matrices
are ﬁrst of all symmetric, thus they are the self-adjoint mappings.
As a corollary we note that an orthogonal mapping of
an Euclidean space into itself is orthogonally diagonalizable
if and only if it is self-adjoint.They are exactly the self-adjoint
mappings with eigenvalues ±1.
The situation is much more interesting on unitary spaces.
Consider any linear mapping φ : V → V on a unitary space.
Let φ = ψ + iη be the (unique) decomposition of φ into its
Hermitian and anti-Hermitian part. If φ has diagonal matrix
D in a suitable orthonormal basis, then D = Re D + i Im D,
where the real and the imaginary parts are exactly the matrices
of ψ and η. This follows from the uniqueness of the decomposition.
Knowing this in the particular coordinates, we
conclude the following computation relations at the level of
231
A = Matrix([[2,-1,0],[-1,2,1],[0,1,2]])
E1 = vector([1, 0, 0])
E2 = vector([0, 1, 0])
E3 = vector([0, 0, 1])
def proj(v, u):
p =((v*A*u)/(u*A*u))*u
return p
w1 = E1; w2 = E2-proj(E2,w1)
w3 = E3-proj(E3,w2)-proj(E3,w1)
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
mappings ψ ◦ η = η ◦ ψ (i.e. the real and imaginary parts of
φ commute), and φ ◦ φ∗
= φ∗
◦ φ (since this clearly holds
for all diagonal metrices). The mappings φ : V → V with
the latter property are called the normal mappings.
A detailed characterization is given by the following theorem
(stated in the notation of this paragraph):
Theorem. The following conditions on a mapping φ : V →
V on a unitary space V are equivalent:
(1) φ is orthogonally diagonalizable,
(2) φ∗
◦ φ = φ ◦ φ∗
(φ is a normal mapping),
(3) ψ ◦ η = η ◦ ψ (the Hermitian and anti-Hermitian parts
commute),
(4) if A = (aij) is the matrix of φ in some orthonormal basis,
and λi are the m = dim V eigenvalues of A, then
m∑
i,j=1
|aij|2
=
m∑
i=1
|λi|2
.
Proof. The implication (1) ⇒ (2) was discussed above.
(2) ⇔ (3): it suﬃces to calculate
φ ◦ φ∗
= (ψ + iη)(ψ − iη) = ψ2
+ η2
+ i(ηψ − ψη)
φ∗
◦ φ = (ψ − iη)(ψ + iη) = ψ2
+ η2
+ i(ψη − ηψ)
Subtraction of the two lines yields
φφ∗
− φ∗
φ = 2i(ηψ − ψη).
(2) ⇒ (1): If φ is normal, then
⟨φ(u), φ(u)⟩ = ⟨φ∗
φ(u), u⟩ = ⟨φφ∗
(u), u⟩
= ⟨φ∗
(u), φ∗
(u)⟩
thus |φ(u)| = |φ∗
(u)|.
Next, notice (φ − λ id V )∗
= (φ∗
− ¯λ id V ). Thus, if φ
is normal, then (φ − λ id V ) is normal too.
If φ(u) = λu, then u is in the kernel of φ − λ idV . Thus
the latter equality of norms of values for normal mappings
and their adjoints ensures that u is also in the kernel of φ∗
−
¯λ idV . It follows that φ∗
(u) = ¯λu. We have proved, under
the assumption (2), that φ and φ∗
have the same eigenvectors
and that they are associated to conjugated eigenvalues.
Similarly to our procedure with self-adjoint mappings,
we now prove orthogonal diagonalizability. The latter procedure
is based on the fact that the orthogonal complements
to sums of eigenspaces are invariant subspaces.
Consider an eigenvector u ∈ V with eigenvalue λ, and
any v ∈ ⟨u⟩⊥
. We have
⟨φ(v), u⟩ = ⟨v, φ∗
(u)⟩ = ⟨v, ¯λu⟩ = λ⟨u, v⟩ = 0.
Thus φ(v) ∈ ⟨u⟩⊥
. The same occurs if u is replaced by a
sum of eigenvectors instead.
(1) ⇒ (4): the expression
∑
i,j |aij|2
is the trace of
the matrix AA∗
, which is the matrix of the mapping φ ◦ φ∗
.
Therefore its value does not depend on the choice of the orthonormal
basis. Thus if φ is diagonalizable, this expression
equals exactly
∑
i |λi|2
.
(4) ⇒ (1): This part of the proof is a direct corollary
of the Schur theorem on unitary triangulation of an arbitrary
232
g = [w1, w2, w3]
show(g)
□
3.D.13. Let (V, ⟨ , ⟩) be a (ﬁnite-dimensional) unitary vector
space and let {Ej}n
j=1 be an ⟨ , ⟩-orthornomal basis of V , i.e.,
⟨Ei, Ej⟩ = δij. Show that any two vectors x, y ∈ V satisfy
⟨x, y⟩ =
∑n
j=1⟨x, Ej⟩⟨y, Ej⟩. ⃝
3.D.14. Show that the vectors E1 = (−1, 1, 2), E2 =
(2, 0, 1) and E3 = (1, 5, −2) form an orthogonal basis of R3
(with respect the usual dot product). Next express the vector
u = (6, 2, −4) as a liner combination of this basis. ⃝
3.D.15. Based on the Cauchy-Schwarz inequality, show that
any triple (a, b, c) of positive real numbers satisﬁes the following
inequality:
√
a + 2b
a + b + c
+
√
b + 2c
a + b + c
+
√
c + 2a
a + b + c
≤ 3 .
⃝
3.D.16. Consider C3
endowed with the standard Hermitian
form ⟨u, v⟩ =
∑3
i=1 ui¯vi and the standard basis e =
{e1, e2, e3}. Verify Parseval’s equality for the vector u =
(2 + i, −1 + 2i, 3 − i) ∈ C3
. ⃝
3.D.17. In Problem 3.D.11 use the isomorphism
Mat2,3(R) ∼= R6
established in Problem 2.C.31 to
show that the matrices A, B are linearly independent. Then
consider the subspace W of Mat2,3(R) spanned by the
matrices A, B. Find a basis of the orthogonal complement
W⊥
of W with respect to the Frobenius scalar product B
(introduced in 3.D.8).
Solution. Under the isomorphism φ : Mat2,3(R) ∼= R6
discussed in Problem 2.C.31, we may view the matrices A
and B as the following vectors on R6
: φ(A) = v1 =
(1, 3, 5, 0, 2, 2)T
and φ(B) = v2 = (2, 4, 0, 7, 9, 1)T
, respectively.
We now treat their linear independence via Sage by the
cell
V = RR^6
v1 = vector(RR, [1, 3, 5, 0, 2, 2])
v2 = vector(RR, [2, 4, 0, 7, 9, 1])
V.linear_dependence([v1, v2]) == []
Sage prints out True, so v1, v2 are linearly independent. Consider
the subspace W = spanR{A, B} ∼= spanR{u1, u2} of
Mat2,3(R) ∼= R6
spanned by A, B. We need to determine
W⊥
with respect to B, that is,
W⊥
= {C ∈ Mat2,3(R) : B(C, A) = 0 = B(C, B)} .
Let us express C ∈ Mat2,3(R) as C =
(
a b c
d e f
)
,
for some reals a, . . . , f. Then C ∈ W⊥
if and only if
tr(AT
C) = 0 = tr(BT
C), that is
{a + 3b + 5c + 2e + 2f = 0 , 2a + 4b + 7d + 9e + f = 0} .
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
linear mapping V → V , which we prove later in 3.4.15. This
theorem says that for every linear mapping φ : V → V there
exists an orthonormal basis under which φ has an upper triangular
matrix. Then all the eigenvalues of φ appear on its
diagonal. Since we have already shown that the expression∑
i,j |aij|2
does not depend on the choice of the orthonormal
bases, all elements in the upper triangular matrix, which are
not on the diagonal must be zero. □
Remark. We can rephrase the main statement of the latter
theorem in terms of matrices. A mapping is normal if and
only if its matrix A satisﬁes AA∗
= A∗
A in some orthonormal
basis (and equivalently in any orthonormal basis). Such
matrices are called normal. Moreover, we can consider the
last theorem as a generalization of standard calculations with
complex numbers. The linear mappings appear similar to
complex numbers in their algebraic form. The role of real
numbers is played by self-adjoint mappings, and the unitary
mappings play the role of the complex units cos t+i sin t ∈ C.
The following consequence of the theorem shows the link to
the property cos2
t + sin2
t = 1.
Corollary. The unitary mappings on a unitary space V
are exactly those normal mappings φ on V for which the
unique decomposition φ = ψ + iη into Hermitian and antiHermitian
parts satisﬁes ψ2
+ η2
= idV .
Proof. If φ is unitary, then φφ∗
= idV = φ∗
φ and thus
φφ∗
= (ψ + iη)(ψ − iη) = ψ2
+ 0 + η2
= idV . On the
other hand, if φ is normal, we can read the latter computation
backwards which proves the other implication. □
3.4.9. Roots of matrices. Non-negative real numbers are exactly
those which are squares of real numbers
(and thus we may ﬁnd their square roots). At
the same time, their positive square roots are
uniquely deﬁned. Now we observe a similar behaviour
of matrices of the form B = A∗
A. Of course, these
are the matrices of the compositions of mappings φ with their
adjoints.
By deﬁnition,
(1) ⟨B x, x⟩ = ⟨A∗
A x, x⟩ = ⟨A x, A x⟩ ≥ 0
for all vectors x. Furthermore, we clearly have
B∗
= (A∗
A)∗
= A∗
A = B.
Hermitian matrices B with the property ⟨Bx, x⟩ ≥ 0 for all
x are called positive semideﬁnite matrices. If the zero value
is attained only for x = 0, they are called positive deﬁnite.
Analogously, we speak of positive deﬁnite and positive semideﬁnite
(self-adjoint) mappings φ : V → V .
For every mapping φ : V → V we can deﬁne its square
root as a mapping ψ such that ψ ◦ ψ = φ. The next theorem
completely describes the situation when restricting to positive
semideﬁnite mappings.
233
This is a system of two equations with six unknowns. Let
c, d, e, f ∈ R be the free variables. Then we get the solution
a = 10c −
21
2
d −
19
2
e +
5
2
f , b = −5c +
7
2
d +
5
2
e −
3
2
f .
Hence W⊥
consists of matrices of the form
C =
(
10c − 21
2 d − 19
2 e + 5
2 f −5c + 7
2 d + 5
2 e − 3
2 f c
d e f
)
with c, d, e, f ∈ R. We see that
C = c
(
10 −5 1
0 0 0
)
+ d
(
−21/2 7/2 0
1 0 0
)
+e
(
−19/2 5/2 0
0 1 0
)
+ f
(
5/2 −3/2 0
0 0 1
)
= cW1 + dW2 + eW3 + fW4 ,
where we denote the matrices appearing above by
W1, W2, W3, W4, respectively. This shows that W1, . . . , W4
generate W⊥
. Actually, they provide a basis of W⊥
. For a
quick veriﬁcation of their linear independence, we use again
the isomorphism Mat2,3(R) ∼= R6
and proceed with Sage,
as before:
V = RR^6
w1 = vector(RR, [10, -5, 1, 0, 0, 0])
w2 = vector(RR, [-21/2, 7/2, 0, 1, 0, 0])
w3 = vector(RR, [-19/2, 5/2, 0, 0, 1, 0])
w4 = vector(RR, [5/2, -3/2, 0, 0, 0, 1])
V.linear_dependence([w1, w2, w3, w4]) == []
□
Invertible linear transformations naturally intersect with
group theory, a topic that we brieﬂy introduced
in Chapter 1 and will explore in greater detail in
Chapter 12. This intersection forms the basis of
“matrix groups”, which are of particular interest
in this context. Next we will focus on the group of all invertible
linear mappings from Rn
to Rn
, known as the “real general
linear group”, denoted by Gln(R).
The operation that deﬁnes this group is the composition
of linear mappings. Equivalently, Gln(R) can be described
as the group of all invertible n×n matrices with real entries,
where the group operation is the matrix multiplication, i.e.,
Gln(R) = {A ∈ Matn(R) : det(A) ̸= 0} .
On the other hand, a “matrix group” is deﬁned as a
closed subgroup of Gln(R).14
An example of a matrix group
14Matrix groups are special cases of the well-known “Lie groups”. In
simple terms, a Lie group is a group equipped with a compatible diﬀerentiable
structure, also known as a smooth manifold. Lie groups were introduced
by the Norwegian mathematician Sophus Lie (1842-1899) during the
late 19th century, shortly after the discovery of non-Euclidean geometries.
Lie referred to them as “continuous symmetry groups”. During this period,
Lie collaborated with F. Klein, and together they signiﬁcantly altered perspectives
in geometry and the theory of diﬀerential equations. Today, the
theory of Lie groups and Lie algebras has become a fundamental area in differential
geometry with a wide range of applications. It’s worth noting that
not all Lie groups are matrix groups, indicating the potential complexity of
these structures.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Positive semidefinite square roots
Theorem. For each positive semideﬁnite square matrix B,
there is the uniquely deﬁned positive semideﬁnite square
root
√
B.
If P is any matrix such that P−1
BP = D is diagonal,
then
√
B = P
√
DP−1
, where D has got the (non-negative)
eigenvalues of B on its diagonal and
√
D is the matrix with
the positive square roots of these values on its diagonal.
Proof. Since B is a matrix of a self-adjoint mapping φ,
there is even an orthonormal P as in the theorem
(cf. Theorem 3.4.7) with all eigenvalues in the
diagonal of D non-negative. Consider C =
√
B
as deﬁned in the second claim and notice that in-
deed
C2
= P
√
DP−1
P
√
DP−1
= PDP−1
= B.
Thus the mapping ψ given by C must have the same
eigenvectors as φ and thus these two mappings share the decompositions
of Kn
into mutually orthogonal eigenspaces. In
particular, both of them will share the bases in which they
have diagonal matrices and thus the deﬁnition of
√
D must
be unique in each such basis. This proves that the deﬁnition
of
√
B does not depend on our particular choice of the diagonalization
of φ. □
Notice there could be a lot of diﬀerent roots, if we relax
the positivity condition on
√
B (e.g., we may choose the signs
in the diagonal matrix D).
3.4.10. Spectra and nilpotent mappings. We return to the
behavior of linear mappings in full generality.
We continue to work with real or complex vector
spaces, but without necessarily ﬁxing a scalar
product there.
Recall that the spectrum of a linear mapping f : V → V
is a sequence of roots of the characteristic polynomial of the
mapping f, counting multiplicities. The algebraic multiplicity
of an eigenvalue is its multiplicity as a root of the characteristic
polynomial. The geometric multiplicity of an eigenvalue
is the dimension of the corresponding subspace of eigenvec-
tors.
A linear mapping f : V → V is called nilpotent, if there
exists an integer k ≥ 1 such that the iterated mapping fk
is
identically zero. The smallest k with such a property is called
the degree of nilpotency of the mapping f. The mapping f :
V → V is called cyclic, if there exists the basis (u1, . . . , un)
of the space V such that f(u1) = 0 and f(ui) = ui−1 for all
i = 2, . . . , n. In other words, the matrix of f in this basis is
of the form
A =



0 1 0 . . .
0 0 1 . . .
...
...
...


 .
234
is the “orthogonal group” O(n), which consists of all linear
transformations φ : Rn
→ Rn
of Rn
preserving the standard
Euclidean product, i.e., ⟨φ(u), φ(v)⟩ = ⟨u, v⟩, for all
u, v ∈ Rn
. We studied such endomorphisms in the end of
Chapter 2 and we learned that they correspond to orthogonal
matrices (see 2.4.6 and 2.D.11). Hence, in terms of matrices,
O(n) consists of all n × n orthogonal matrices, i.e.,
O(n) = {A ∈ Gln(R) : A−1
= AT
}.
3.D.18. Prove that O(n) is a group and a subgroup of
Gln(R). Additionally, demonstrate that the determinant of
any matrix A ∈ O(n) is either 1 or −1.
Solution. Obviously, the n × n identity matrix E belongs to
O(n) and this is the corresponding identity element of the
group. Thus, to verify that O(n) is a group it remains to
prove that the composition of two orthogonal transformations
is orthogonal, and that the inverse of an orthogonal transformation
is again orthogonal. By the conclusion in Problem
2.D.11, one can equivalently work with orthogonal matrices.
Let A, B ∈ O(n). Then we see that
(AB)T
AB = BT
AT
AB = BT
B = E ,
AB(AB)T
= ABBT
AT
= AAT
= E ,
(A−1
)T
A−1
= (AT
)−1
A−1
= (AAT
)−1
= E−1
= E ,
A−1
(A−1
)T
= A−1
(AT
)−1
= (AT
A)−1
= E−1
= E ,
and these relations certify our claim. Finally recall that the
matrix multiplication is associative.
Now, given a group (G, ◦), a non-empty subset K of G
which is closed under composition and taking inverses (with
respect to the restriction of the group operation ◦ to K), is
called a subgroup of G, see also 12.4.1 for more details. To
demonstrate that a (non-empty) subset K ⊂ G of G is a
subgroup of G, we need to show that a ◦ b−1
∈ K, for
any two elements a, b ∈ K. Let A, B ∈ O(n) be two orthogonal
matrices. By the previous assertion we know that
B−1
∈ O(n) as well, and hence AB−1
∈ O(n). Since O(n)
is also a subset of Gln(R), our claim follows. For the determinant,
let A ∈ Matn(R) be an orthogonal matrix, i.e.,
AAT
= E = AT
A. Hence det(AAT
) = det(E) = 1.
However, det(AAT
) = det(A) det(AT
) = det(A)2
, thus
det(A)2
= 1, i.e., det(A) = ±1. □
The special orthogonal group SO(n) is the subgroup
of O(n) consisting of orthogonal matrices with determinant
one, SO(n) = {A ∈ O(n) : det(A) = 1}. Obviously, for
n = 1 we have O(1) ∼= Z2 and the group SO(1) is trivial. For
n = 2 it is not hard to prove that the group SO(2) is isomorphic
(as a group) to the unit circle S1
= {z ∈ C : |z| = 1}.
For n = 3 we have already described examples of special orthogonal
matrices in Chapter 2, e.g., rotations on R3
about an
axis through the origin, which indeed have matrices lying in
the special orthogonal group SO(3) (see 2.D.13). Let us further
emphasize on rotations on the 3-dimensional Euclidean
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
If f(v) = a v, then fk
(v) = ak
· v for every natural k. Note
that, the spectrum of nilpotent mapping can contain only the
zero scalar (and this is always present).
By the deﬁnition, every cyclic mapping is nilpotent.
Moreover, its degree of nilpotency equals the dimension
of the space V . The derivative operator on polynomials,
D(xk
) = kxk−1
, is an example of a cyclic mapping on the
spaces Kn[x] of all polynomials of degree at most n over the
scalars K.
Perhaps surprisingly, this is also true the other way round
– every nilpotent mapping is a direct sum of cyclic mappings.
A proof of this claim takes much work. So we formulate ﬁrst
the results we are aiming at, and only then come back to the
technical work.
In the resulting theorem describing the Jordan decomposition,
the crucial role is played by vector (sub)spaces and
linear mappings with a single eigenvalue λ given by the ma-
trix
(1) J =




λ 1 0 . . . 0
0 λ 1 . . . 0
...
...
...
...
0 0 0 . . . λ




.
These matrices (and the corresponding invariant subspaces)
are called Jordan blocks.5
Jordan canonical form
Theorem. Let V be a real or complex vector space of dimension
n. Let f : V → V be a linear mapping with n eigenvalues
(in the chosen domain of scalars), counting algebraic
multiplicities. Then there exists a unique decomposition of
the space V into the direct sum of subspaces
V = V1 ⊕ · · · ⊕ Vk
where not only f(Vi) ⊂ Vi, but the restriction of f to each
Vi has a single eigenvalue λi and the restriction f − λi idVi
on Vi is either cyclic or is the zero mapping. In particular,
there is a suitable basis in which f has a block-diagonal
matrix J with Jordan blocks along the diagonal.
We say that the matrix J from the theorem is in Jordan
canonical form. In the language of matrices, we can rephrase
the theorem as follows:
Corollary. For each square matrix A over complex scalars,
there is an invertible matrix P such that A = P−1
J P and J
is in canonical Jordan form.
The matrix P is the transition matrix to the basis from the
theorem above. Notice that the total number of ones over the
diagonal in J equals the diﬀerence between the total algebraic
and geometric multiplicity of the eigenvalues. The ordering
of the blocks in the matrix corresponds to the chosen ordering
5Camille Jordan was a famous French Mathematician working in Analysis
and Algebra at the end of the 19th and the beginning of the 20th
centuries.
235
space, by presenting some addition exercises, see also in Section
F for further material (cf. 3.F.35, 3.F.37).
3.D.19. Write down the matrices of rotations by the angle θ,
about the (oriented) axes x, y and z in R3
.15
⃝
The concept of linear transformations on real vector
spaces naturally extends to complex vector spaces.
In particular, the groups O(n) and SO(n) have
counterparts in the complex case, known as the
“unitary group” U(n) and the “special unitary
group” SU(n). These groups are closed subgroups of the
complex general linear group Gln(C), which serves as the
starting point in this context, replacing Gln(R). As a result,
they are also matrix groups (note that when identifying
Cn ∼= R2n
, the group Gln(C) can be viewed as a subgroup
of Gl2n(R)).
The unitary group consists of all complex linear mappings
ψ : Cn
→ Cn
, preserving the standard Hermitian
form ⟨ , ⟩ on Cn
. Equivalently, it can be viewed as the matrix
group U(n) = {U ∈ Gln(C) : U−1
= U∗
= ¯UT
}. Just as
SO(n) is deﬁned as a subgroup of O(n), the special unitary
group SU(n) is deﬁned as the subgroup of U(n) consisting
of unitary matrices with determinant one.
3.D.20. As we said above, over C a n × n unitary matrix U
is deﬁned as the one preserving the standard Hermitian form
⟨ , ⟩ on Cn
, i.e., ⟨Ux, Uy⟩ = ⟨x, y⟩ for any x, y ∈ Cn
. Show
that this is equivalent to say that U∗
U = UU∗
= E. Next
prove that U(n) is a group, and that the determinant of any
unitary matrix is a complex unit. ⃝
3.D.21. Unitary matrices. Determine which of the matrices
listed below is unitary:
A =
(
1√
2
1√
2
i√
2
− i√
2
)
, B =
1
2
(
1 + i
√
2
1 − i
√
2i
)
, C =
(
1 −i
−1 i
)
.
Solution. Let us use Sage to analyze the matrix A. We want
to determine weather the products A∗
A and AA∗
yield the
identity matrix. We this goal in mind, we type:
A=matrix(SR, [[(1/sqrt(2)),(1/sqrt(2))],
[(1/sqrt(2))*I,(-1/sqrt(2))*I]])
show(A)
A_her=A.conjugate_transpose()
show(A_her)
bool(A_her*A==A*A_her)
The command A.conjugate_transpose() returns the conjugate
transpose of A, and has as shortcut the rule A.H. Executing
this block we conﬁrm that A is a unitary matrix. As
mentioned before, a more direct method uses the command
15The matrices presented in the solution are well-known and are
commonly used in 3D graphics, robotics, and other ﬁelds involving threedimensional
transformations.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
of the subspaces Vi in the direct sum. Thus, the uniqueness
of the matrix J is true up to the ordering of the Jordan blocks.
There is therefore freedom in the choice of the basis for such
a Jordan canonical form.
3.4.11. Remarks. The existence of the Jordan canonical
form is clear for the cases when all eigenvalues are either distinct
or when the geometric and algebraic multiplicities of the
eigenvalues are the same. In particular, this is the case for all
unitary and self-adjoint mappings on unitary vector spaces,
while the deﬁnition of normal mappings requires eaxactly this
behavior. In particular, the Jordan canonical form of a mapping
is diagonal if and only if the mapping is normal.
A consequence of the Jordan canonical form theorem
is that for every linear mapping f, every eigenvalue of f
uniquely determines an invariant subspace that corresponds
to all Jordan blocks with this particular eigenvalue. We shall
call this subspace the root subspace corresponding to the
given eigenvalue.
We mention one useful corollary of the Jordan theorem
(which is already used in the discussion about the behavior
of Markov chains). Assume that the eigenvalues
of our mapping f are all of absolute value less
than one. Then repeated application of the linear mapping
on every vector v ∈ V leads to a decrease of all coordinates
of fk
(v) towards zero, without bounds.
Indeed, assume f has only one eigenvalue λ on all the
complex space V and that f − λ idV is cyclic (that is, we
consider only one Jordan block separately). Let v1, . . . , vℓ be
the corresponding basis. Then the theorem says that f(v2) =
λv2 + v1, f2
(v2) = λ2
v2 + λv1 + λv1 = λ2
v2 + 2λv1, and
similarly for other vi’s and higher powers. In any case, the
iteration of f results in higher and higher powers of λ for all
non-zero components. The smallest of them can diﬀer from
the largest one only by less than the dimension of V . The
coeﬃcients are bounded too.
This proves the claim. The same argument can be used to
prove that for the mapping with all eigenvalues with absolute
value strictly greater than one leads to unbounded growth of
all coordinates for the iterations fk
(v).
The remainder of this part of the third chapter is devoted
to the proof of the Jordan theorem and a few necessary lemmas.
It is much more diﬃcult than anything so far. The reader
can skip it, until the beginning of the ﬁfth part of this chapter
in case of any problems with reading it.
3.4.12. Root spaces. We have already seen by explicit examples
that the eigensubspaces completely describe
geometric properties for some linear mappings
only. Thus we now introduce a more subtle
tool, the root subspaces.
Deﬁnition. A non-zero vector u ∈ V is called a root vector
of the linear mapping φ : V → V , if there exists an a ∈ K
and an integer k > 0 such that (φ − a idV )k
(u) = 0. This
means that the k-th iteration of the given mapping sends u
236
A.is_unitary(), which for our cases returns True. Alternatively,
we can assess the orthonormality of the matrix’s
columns (or rows) with respect to the standard Hermitian
form on s Cn
(for a n×n matrix). This is analogous to evaluating
orthogonality in real matrices. These arguments enable
us to conclude that the matrices B is unitary, while the matrix
C is not (although the scaled matrix 1
2 C is unitary). We
will demonstrate the veriﬁcation process for the matrix B, and
leave the other case for practise.
For the matrix B, the column vectors are given by u1 =
(1+i
2 , 1−i
2 )T
, u2 = (
√
2
2 , i
√
2
2 )T
. Thus
∥u1∥ =
√
1 + i
2
·
1 − i
2
+
1 − i
2
·
1 + i
2
= 1 ,
∥u2∥ =
√√
2
2
·
√
2
2
− i2 ·
√
2
2
·
√
2
2
= 1 ,
⟨u1, u2⟩ =
1 + i
2
·
√
2
2
−
1 − i
2
·
i
√
2
2
= 0 .
Hence B is unitary. Let us also proceed with Sage, by applying
one of the methods described above. For example, executing
the command
B = matrix(SR, [[(1+I)/2, sqrt(2)/2],
[(1-I)/2, sqrt(2)*I/2]])
B.is_unitary()
Sage returns True. □
Let V, W be two unitary spaces. Given a linear map
ψ : V → W its “adjoint” is the linear map
ψ∗
: W → V deﬁned by ⟨ψ(u), w⟩W =
⟨u, ψ∗
(w)⟩V , for all u ∈ V , and w ∈ W. By
the discussion in 3.4.5 we know that if A is the
matrix of ψ in an orthonormal basis of V , then the conjugate
transpose A∗
is the matrix of ψ∗
.
When f = f∗
, or equivalently when A is Hermitian, i.e.,
A = A∗
, then f is called “self-adjoint”, see 3.4.6.16
The following
series of exercises will explore these objects, starting
with the properties of the adjoint operator.
3.D.22. Consider two linear mappings φ, ψ : V → V on a
complex unitary space (V, ⟨ , ⟩). Show that the adjoint operators
of φ and ψ satisfy the following properties:
•(cφ)∗
= ¯cφ∗
, for any c ∈ C;
• (φ + ψ)∗
= φ∗
+ ψ∗
;
• (φ ◦ ψ)∗
= ψ∗
◦ φ∗
;
• (φ∗
)∗
= φ;
• If φ is invertible, so is φ∗
with (φ∗
)−1
= (φ−1
)∗
. ⃝
16Self-adjoint operators and Hermitian matrices are fundamental in various
scientiﬁc and engineering disciplines. In quantum mechanics, they represent
observable quantities such as energy, momentum, and position with
their real eigenvalues being of special importance. In numerical analysis,
self-adjoint operators are used to solve eigenvalue problems, which are essential
for simulations and modeling in engineering and physics. In computer
science, Hermitian matrices ﬁnd applications in machine learning, where
they help in dimensionality reduction and data analysis by identifying important
features in large datasets.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
to zero. The set of all root vectors corresponding to a ﬁxed
scalar λ along with the zero vector is called the root subspace
associated with the scalar λ ∈ K. We denote it by Rλ.
If u is a root vector and the integer k from the definition
is chosen as the smallest possible one for u, then
(φ − a idV )k−1
(u) is an eigenvector with the eigenvalue a.
Thus we have Rλ = {0} for all scalars λ which are not in the
spectrum of the mapping φ.
Proposition. Let φ : V → V be a linear mapping. Then
(1) Rλ ⊂ V is a vector subspace for every λ ∈ K,
(2) for every λ, µ ∈ K, the subspace Rλ is invariant with
respect to the linear mapping (φ−µ idV ). In particular
Rλ is invariant with respect to φ,
(3) if µ ̸= λ, then (φ − µ idV )|Rλ
is invertible,
(4) the mapping (φ − λ idV )|Rλ
is nilpotent.
Proof. (1) Checking the properties of the vector vector
subspace is easy and is left to the reader.
(2) Assume that (φ − λ idV )k
(u) = 0 and put v = (φ −
µ idV )(u). Then
(φ − λ idV )k
(v) =
= (φ − λ idV )k
((φ − λ idV ) + (λ − µ) idV )(u)
= (φ − λ idV )k+1
(u) + (λ − µ) · (φ − λ idV )k
(u) = 0
(3) If u ∈ Ker(φ − µ idV )|Rλ
, then
(φ−λ idV )(u) = (φ−µ idV )(u)+(µ−λ) u = (µ−λ) u.
This implies 0 = (φ − λ idV )k
(u) = (µ − λ)k
u and thus
also u = 0 for λ ̸= µ.
(4) Choose a basis e1, . . . , ep of the subspace Rλ. By deﬁnition,
there exist integers ki such that (φ − λ idV )ki
(ei) = 0.
In particular, the entire mapping (φ − λ idV )|Rλ
must be
nilpotent. □
3.4.13. Quotient spaces. Our next aim is to show that the
dimension of the root spaces always equals the
algebraic multiplicity of the corresponding eigenvalues.
First, we introduce some general useful
technical tools.
Quotient spaces
Deﬁnition. Let U ⊂ V be a vector subspace. Deﬁne an
equivalence relation on the set of all vectors in V by v1 ∼ v2
if and only if v1 − v2 ∈ U. Axioms of equivalence are easy
to check. The set V/U of the classes of this equivalence is
equipped by the operations deﬁned by using representatives.
That is, for classes [u] and [v] determined by the vectors u
and v, set [v] + [w] = [v + w], a [u] = [a u]. This is a well
deﬁned vector space called the quotient vector space of the
space V by the subspace U.
Check the correctness of the deﬁnition of the operations
and verify all axioms of the vector space in detail!
The classes (vectors) in the quotient space V/U will often
be denoted as formal sums of one representative with all
237
3.D.23. Let φ : C2
→ C2
the linear mapping given by
φ
((
z1
z2
))
=
(
iz1 + 2z2
z1 − iz2
)
.
Determine the matrix of φ with respect to the standard basis
of C2
and ﬁnd its adjoint operator φ∗
. Deduce that φ is not
self-adjoint. Next solve the task in Sage.
Solution. Consider the standard orthonormal basis e =
{e1, e2} of C2
(with respect to the standard Hermitian form
⟨ , ⟩). Recall that if A = [φ]e is the matrix of φ with respect
to e and A = (aij), then we have aij = ⟨φ(uj), ui⟩.
We compute φ(e1) = (i, 1)T
, φ(e2) = (2, −i)T
and hence
a11 = ⟨φ(e1), e1⟩ = i, a12 = ⟨φ(e2), e1⟩ = 2, a21 =
⟨φ(e1), e2⟩ = 1 and a22 = ⟨φ(e2), e2⟩ = −i. Directly
A = [φ]e =
(
φ(e1) φ(e2)
)
=
(
i 2
1 −i
)
.
Then, for the matrix A∗
= ¯AT
we compute A∗
=
(
−i 1
2 i
)
.
Consequently, the adjoint of φ is given by
φ∗
((
z1
z2
))
=
(
−iz1 + z2
2z1 + iz2
)
.
To verify this result we will use deﬁning equation of φ∗
, that
is, ⟨φ(u), v⟩ = ⟨u, φ∗
(v)⟩, for any two vectors u = (z1, z2)T
and v = (z3, z4)T
of C2
. We compute
⟨(
iz1 + 2z2
z1 − iz2
)
,
(
z3
z3
)⟩
= (iz1 + 2z2)¯z3 + (z1 − iz2)¯z4
= iz1 ¯z3 + 2z2 ¯z3 + z1 ¯z4 − iz2 ¯z4 ,
⟨(
z1
z2
)
,
(
−iz3 + z4
2z3 + iz4
)⟩
= z1(−iz3 + z4) + z2(2z3 + iz4)
= z1(i¯z3 + ¯z4) + z2(2¯z3 − i¯z4)
= iz1 ¯z3 + z1 ¯z4 + 2z2 ¯z3 − iz2 ¯z4 .
Finally, since A ̸= A∗
it follows that φ is not self-adjoint.
In Sage a solution goes as follows:
# Define the matrix A of the linear map phi
A = Matrix(QQbar, [[I, 2], [1, -I]])
# Display the matrix A
show("Matrix A of phi:")
show(A)
# Compute the adjoint of A
A_adjoint = A.H
# Display the adjoint matrix
show("Adjoint matrix A*:")
show(A_adjoint)
# Check if A is self-adjoint
is_self_adjoint = (A == A_adjoint)
# Display whether A is self-adjoint
show("Is A self-adjoint?")
show(is_self_adjoint)
□
3.D.24. The numerical system QQbar in Sage. Observe
that the Sage solution presented in 3.D.23 uses the option
QQbar to deﬁne the matrix A. This represents the numerical
system of algebraic complex numbers. The ﬁeld of algebraic
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
vectors in the subspace U, for instance u+U ∈ V/U, u ∈ V .
The class 0 + U is the zero vector in V/U, i.e. the vector
u ∈ V represents the zero element in V/U if and only if
u ∈ U.
Trivial examples are V/{0} ∼= V , V/V ∼= {0}. Another
example is the quotient space of the plane R2
factored by any
one-dimensional subspace (here, every one-dimensional subspace
U ⊂ R2
is a line passing through the origin). Then the
equivalence classes are all the lines parallel to this line.
Proposition. Let U ⊂ V be a vector subspace and
(u1, . . . , un) be a basis of V , such that (u1, . . . , uk) is a
basis of U. Then dim V/U = n − k and the vectors
uk+1 + U, . . . , un + U
form a basis of V/U.
Proof. V = span{u1, . . . , un}, so V/U = span{u1 +
U, . . . , un + U}. But the ﬁrst k generators are zero, thus
V/U = span{uk+1 + U, . . . , un + U}. Assume that the
linear combination ak+1 (uk+1 + U) + · · · + an (un + U) =
(ak+1 uk+1 +· · · +an un)+U = 0 ∈ V/U vanishes. Equivalently,
this linear combination of the vectors uk+1, . . . , un belongs
to the subspace U. Since U is generated by the remaining
vectors in the basis of V , the latter linear combination
is necessarily zero, and so all coeﬃcients ai are zero. This
proves the linear independence of the generators of V/U. □
3.4.14. Induced mappings on quotient spaces. Assume
that U ⊂ V is an invariant subspace with respect
to linear mapping φ : V → V and choose
basis u1, . . . , un of the space V such that the
ﬁrst k vectors of this basis is a basis of U. With
this basis, φ has block matrix A =
(
B C
0 D
)
. Then we can
prove the following lemma:
Lemma. (1) the mapping φ induces a linear mapping
φV/U : V/U → V/U, φV/U (v+U) = φ(v)+U with the
matrix D under the induced basis uk+1 +U, . . . , un +U
on V/U,
(2) the characteristic polynomial of φV/U divides the characteristic
polynomial of φ.
Proof. For v, w ∈ V , u ∈ U, a ∈ K we have φ(v+u) ∈
φ(v)+U (because U is invariant), (φ(v)+U)+(φ(w)+U) =
φ(v +w)+U and a (φ(v)+U) = a φ(v)+U = φ(a v)+U
(because φ is linear), thus the mapping φV/U is well-deﬁned
and linear. Moreover the very deﬁnition of the matrix of a
mapping in a basis implies that the matrix of φV/U in the
induced basis on V/U is exactly the matrix D (when counting
the images of the basis elements the coeﬃcients of the matrix
C add only to the class U).
The characteristic polynomial of the induced mapping
φV/U is thus |D−λ E|, while characteristic polynomial of the
original mapping φ is |A − λ E| = |B − λ E||D − λ E|. □
238
numbers, usually denoted by ¯Q, is formed by adjoining the rational
numbers Q with the roots of all polynomial equations
with rational coeﬃcients. In other words, an algebraic number
is a complex number that is a root of a non-zero polynomial
equation with rational coeﬃcients. Thus, in Sage QQbar
is an extension of the numerical system QQ, enabling exact
computations with algebraic numbers. However, the result
would be the same if we used CC, instead of QQbar, as both
numerical systems approximate complex numbers.
Note however that QQbar allows for exact algebraic computations,
while CC uses ﬂoating-point arithmetic for complex
numbers. In particular, when displaying complex matrices
with QQbar, then each element is shown in its exact algebraic
form, preserving the precision of computations. In contrast,
CC displays complex matrices using ﬂoating-point approximations,
which can introduce rounding errors but are generally
more eﬃcient for large-scale computations.
3.D.25. Use Sage to determine which of the following matrices
is Hermitian:
A =


√
2 i 1 − i
−i 10
√
2 + i
1 + i
√
2 − i 0

 ,
B =
(
a i + b
−i + b
√
|a|
)
, where a, b ∈ R ,
C =


1 8i 1 − i
√
5
−8i 4z 0
1 + i
√
5 0 1

 ,
where z ∈ C
with Im(z) = 0
,
D =


i
√
2 i 1 − i
−i 0 4 + i
1 + i 4 − i 0

 .
⃝
3.D.26. Prove that with respect to the Frobenius inner product
B(A, B) = tr(B∗
A) on Matn(C) the following statement is
true: Two matrices A, B ∈ Matn(C) satisfying B = U∗
AU
for some n×n unitary matrix U, are such that ∥A∥2
B = ∥B∥2
B.
⃝
3.D.27. Consider the linear map φ : R2
→ R3
deﬁned by
φ
((
x
y
))
=


√
2x + y
x − y
2y

 .
Compute its adjoint operator φ∗
: R3
→ R2
. Next examine
the linear operator ψ : C3
→ C3
whose matrix with respect
to the standard basis of C3
is given by
B =


10 5 + 5i 3 + 2i
5 − 5i 5
√
2i
3 − 2i −
√
2i 0

 .
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Corollary. Let V be a vector space over K of dimension
n and let φ : V → V be a linear mapping whose spectrum
contains n elements (that is, all roots of the characteristic
polynomial lie in K and we count their multiplicities).
Then there exists a sequence of invariant subspaces
{0} = V0 ⊂ V1 ⊂ · · · ⊂ Vn = V with dimensions
dim Vi = i. Consider a basis u1, . . . , un of the space V such
that Vi = span{u1, . . . , ui}. In this basis, the matrix of the
mapping φ is an upper triangular matrix:



λ1 . . . ∗
...
...
...
0 . . . λn


 ,
with the spectrum λ1, . . . , λn on the diagonal.
Proof. The subspaces Vi are constructed inductively.
Let {λ1, . . . , λn} be the spectrum of the mapping φ. Thus
the characteristic polynomial of the mapping φ is of the
form ±(λ − λ1) · · · (λ − λn). We choose V0 = {0},
V1 = span{u1}, where u1 is an eigenvector with eigenvalue
λ1. According to the previous theorem, the characteristic
polynomial of the mapping φV/V1
is of the form
±(λ − λ2) · · · (λ − λn). Assume that we have already constructed
linearly independent vectors u1, . . . , uk and invariant
subspaces Vi = span{u1 . . . , ui}, i = 1, . . . , k < n
such that the characteristic polynomial of φV/Vk
is of the form
±(λ − λk+1) · · · (λ − λn) and φ(ui) ∈ (λi · ui + Vi−1) for
all i = 1, . . . , k.
We want to add one more vector uk+1 with analogous
properties. There exists an eigenvector uk+1 +Vk ∈ V/Vk of
the mapping φV/Vk
with the eigenvalue λk+1. Consider the
space Vk+1 = span{u1, . . . , uk+1}. If the vector uk+1 is a
linear combination of the vectors u1, . . . , uk then uk+1 + Vk
would be the zero class in V/Vk. But this is not possible.
Thus dim Vk+1 = k + 1. It remains to study the induced
mapping φV/Vk+1
. The characteristic polynomial of this mapping
is of degree n − k − 1 and divides the characteristic
polynomial of the mapping φ. But completing the vectors
u1, . . . , uk+1 to the basis of V yields a block matrix of the
mapping φ with an upper triangular submatrix B in the left
upper corner and zero in the left lower corner. The diagonal
elements are exactly the scalars λ1, . . . , λk+1. Therefore the
roots of the characteristic polynomial of the induced mapping
have the required properties. □
Remark. If V decomposes into the direct sum of eigensubspaces
for φ, the latter results do not say anything new. But
their signiﬁcance consists in the fact, that only the existence of
dim V roots of the characteristic polynomial (counting multiplicities)
is assumed. This is ensured whenever the ﬁeld K
is algebraically closed, for instance the complex numbers C.
As a direct consequence we see that the determinant and the
trace of the mapping φ are always the product and the sum of
the elements in the spectrum, respectively.
This can be also used for all real matrices. Just consider
them to be complex, calculate the determinant or the trace as
239
Demonstrate that this operator is self-adjoint. ⃝
3.D.28. Derive the adjoint of the trace tr : Matm(F) →
F with respect to the Frobenius inner product B(A, B) =
tr(B∗
A), for F ∈ {R, C}. ⃝
3.D.29. Let (V, ⟨ , ⟩) be an inner product space and let W ⊂
V be a subspace of W. Show that the orthogonal projection
projW : V → V on W is self-adjoint. ⃝
Recall that a linear endomorphism f : V → V on a
(ﬁnite-dimensional) vector space V is diagonalizable
if and only if V admits a basis B, consisting
of eigenvectors of f. When V is equipped
with a scalar product ⟨ , ⟩ and given an endomorphism
f, it is natural to ask whether there exists an orthonormal
basis of eigenvectors for f. If such a basis exists,
the matrix [f]B of f with respect to B is diagonal, and
the same holds true for the matrix associated with the adjoint
f∗
of f. Since diagonal matrices commute, this implies
f ◦f∗
= f∗
◦f. Linear operators (or matrices) satisfying this
relation are called “normal”. For instance, orthogonal or
unitary matrices are normal, but there are many other types
of normal matrices as well, see below.
It turns out than on a complex scalar vector space
(V, ⟨ , ⟩) an endomorphism f : V → V is normal if and
only if V has an orthonormal basis consisting of eigenvectors
of f. It is important to note that this result does not hold
for real vector spaces. In the case of a real unitary space the
existence of an orthonormal basis of eigenvectors for a linear
operator f is equivalent to f being self-adjoint, i.e., f = f∗
.
3.D.30. Prove that orthogonal and unitary matrices are normal.
Next demonstrate that not all normal matrices (or transformations)
are orthogonal or unitary, by providing a counterexample,
that is, a normal matrix that is neither orthogonal
nor unitary. ⃝
Let φ : V → V be a normal operator on a unitary space
(V, ⟨ , ⟩). By 3.4.8 and the proof given in the main theorem of
this paragraph, we know that eigenvectors corresponding to
distinct eigenvalues of φ are orthogonal with respect to ⟨ , ⟩.
Let us illustrate this important situation by examples.
3.D.31. Suppose that φ : V → V is a normal operator on a
(ﬁnite-dimensional) unitary space (V, ⟨ , ⟩), having two distinct
eigenvalues, namely, λ1 = 3 and λ2 = 5. Assume also
that the corresponding eigenvectors are of unit length. Use
the Pythagorean theorem to prove that there exist a vector
u ∈ V with ∥u∥ = 2 and ∥φ(u)∥ = 6, where ∥ · ∥ is the
norm induced by ⟨ , ⟩.
Solution. Suppose that v, w ∈ V are the eigenvectors of
φ corresponding to λ1 and λ2, that, is φ(v) = 3v and
φ(w) = 5w, respectively. By assumption φ is normal, hence
eigenvectors corresponding to distinct eigenvalues are orthogonal,
⟨v, w⟩ = 0. Then, for the vector u = v + w by the
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
the product or sum of eigenvalues and because both determinant
and the trace are algebraic expressions in terms of the
elements of the matrix, the results will be correct.
3.4.15. Orthogonal triangulation. If we are given a scalar
product on a vector space V and U ⊂ V is
a subspace, then clearly V/U ≃ U⊥
where
v ∈ U⊥
is identiﬁed with v + U. Moreover,
each class of the quotient space V/U contains
exactly one vector from U⊥
(the diﬀerence of two such vector
is in U ∩ U⊥
). We can exploit this observation in every
inductive step of the proof of the theorem above. Choose the
representative uk+1 ∈ V ⊥
k of the eigenvector of φV/Vk
. This
modiﬁcation leads to the orthogonal basis with the properties
required in the claim about triangulation in the corollary
above. Therefore there exists such an orthonormal basis, and
we arrive at a very important theorem:
Schur’s orthogonal triangulation theorem
Theorem. Let φ : V → V be a linear mapping on a vector
space with scalar product. Let there be m = dim V
eigenvalues, counting multiplicities. Then there exists an orthonormal
basis of the space V such that the matrix of φ in
this basis is upper triangular with eigenvalues λ1, . . . , λm
on the diagonal.
3.4.16. Theorem. Let φ : V → V be a linear mapping and
λ1, . . . , λk be all distinct eigenvalues. Then the sum of the
root spaces Rλ1 , . . . , Rλk
is direct. Furthermore, for every
eigenvalue λ the dimension of the subspace Rλ equals the
algebraic multiplicity of λ.
Proof. We prove ﬁrst the independence of nonzero vectors
from diﬀerent root spaces. We proceed by
induction over the number k of root spaces. The
claim is obvious if k = 1. Assume that the theorem
holds for cases with less than k > 1 root
spaces and assume that vectors u1 ∈ Rλ1 , . . . , uk ∈ Rλk
satisfy u1 + · · · + uk = 0. Then, (φ − λk idV )j
(uk) = 0
for suitable j, and moreover all yi = (φ − λk idV )j
(ui) are
non-zero vectors in Rλi , i = 1, . . . , k − 1, whenever ui are
non-zero by Proposition 3.4.12. But at the same time
y1 + · · · + yk−1 = (φ − λk · idV )j
( k∑
i=1
ui
)
= 0
and, according to the inductive assumption, all yi are zero.
But then also all ui, 1 ≤ i < k must vanish and thus uk = 0,
too. This proves the ﬁrst claim.
It remains consider the dimensions of the root spaces Rλ.
Consider an eigenvalue λ of φ, use the same notation φ for
the restriction φ|Rλ
and write ψ : V/Rλ → V/Rλ for the
mapping induced by φ on the quotient space.
Assume that the dimension Rλ is strictly smaller than
the algebraic multiplicity of the root λ of the characteristic
polynomial. In view of lemma 3.4.14, λ is also an eigenvalue
240
Pythagorean theorem we have ∥u∥2
= ∥v∥2
+ ∥w∥2
, that
is ∥u∥ =
√
2. On the other hand, again by the Pythagorean
theorem we obtain
∥φ(u)∥ = ∥φ(v + w)∥ = ∥φ(v) + φ(w)∥
= ∥3v + 5w∥ =
√
9 + 25 = 6 .
□
3.D.32. Consider the normal operator φ : C3
→ C3
whose
matrix with respect to the standard basis of C3
is given by
A =


2 1 + i 0
1 − i 3 0
0 0 1

 .
Use Sage to illustrate the orthogonality of eigenvectors corresponding
to distinct eigenvalues of φ. Next present a formal
solution. ⃝
3.D.33. Skew-Hermitian matrices. Recall that a complex
matrix A = (aij) is called skew-Hermitian if A = −A∗
, or
equivalently aij = −¯aji. Prove that:
(i) Every skew-Hermitian matrix A is normal.
(ii) The eigenvalues of a skew-Hermitian matrix A are purely
imaginary, that is, they are of the form iµ with µ ∈ R.
Solution. (i) The ﬁrst claim is obvious: If A∗
= −A, then
A∗
A = −A2
= AA∗
. Hence A is normal.
(ii) Suppose that λ is an eigenvalue of A with corresponding
eigenvector u, that is Au = λu. It is suﬃcient to prove that
λ = −λ. First we see that
u∗
Au = u∗
(λu) = λu∗
u = λ∥u∥2
,
and since ∥u∥ ∈ R the r.h.s belongs to C. Thus (u∗
Au)∗
=
λ∥u∥2. Now, because ∥u∥ ∈ R, taking conjugates we obtain
λ∥u∥2
= λ∥u∥2 = (u∗
Au)∗
= u∗
A∗
u = −u∗
Au
= −u∗
λu = −λ∥u∥2
.
This ﬁnal relation yields the desired λ = −λ. □
3.D.34. Present an example of a normal matrix which is neither
Hermitian nor skew-Hermritian. Moreover, use Sage to
verify your answer. ⃝
Recall that a n × n matrix A is diagonalizable if and
only if it has n linearly independent eigenvectors,
and in this case A is similar to the diagonal matrix
D consisting of the eigenvalues of A, i.e.,
P−1
AP = D. Here, P is the matrix obtained
from the eigenvectors of A (as column vectors). A square matrix
A is said to be orthogonally diagonalizable when an orthogonal
matrix P can be found such that P−1
AP = PT
AP
is diagonal.
To clarify the concept, we provide an example, and additional
tasks on orthogonal diagonalization are discussed in
Section F (see for example 3.F.46 for an implementation of
orthogonal diagonalization using Sage).
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
of the mapping ψ. Let (v + Rλ) ∈ V/Rλ be the corresponding
eigenvector, that is, ψ(v + Rλ) = λ (v + Rλ). Then
v /∈ Rλ and φ(v) = λ v + w for suitable w ∈ Rλ. Thus
w = (φ−λ idV )(v) and (φ−λ idV )j
(w) = 0 for suitable j.
We conclude that (φ − λ idV )j+1
(v) = 0, which contradicts
the choice v /∈ Rλ.
It follows that the dimension of Rλ equals the algebraic
multiplicity of the root λ of the characteristic polynomial of
the mapping φ : V → V . □
Combining the latter theorem with the triangulation result
from Corollary 3.4.14, we can formulate:
Corollary. Consider a linear mapping φ : V → V on a vector
space V over scalars K, whose entire spectrum
is in K. Then V = Rλ1 ⊕· · ·⊕Rλn is the
direct sum of the root subspaces. If we choose
suitable bases for these subspaces, then under this basis φ
has block-diagonal form with upper triangular matrices in
the blocks and eigenvalues λi on the diagonal.
3.4.17. Nilpotent and cyclic mappings. Now almost everything
is prepared for the discussion about canonical forms of
matrices. It only remains to clear the relation between cyclic
and nilpotent mappings and combine already proved results.
Theorem. Let φ : V → V be a nilpotent linear mapping.
Then there exists a decomposition of V into a direct sum of
subspaces V = V1 ⊕ · · · ⊕ Vk such that the restriction of φ
to each summand Vi is cyclic.
Proof. We provide a straightforward construction of a
basis of the space V such that the action of the mapping
φ on the basis vectors directly shows the decomposition
into the cyclic mappings.
Let k be the degree of nilpotency of the mapping
φ and write Pi = Im(φi
), i = 0, . . . , k. Thus,
{0} = Pk ⊂ Pk−1 ⊂ · · · ⊂ P1 ⊂ P0 = V.
Choose a basis ek−1
1 , . . . , ek−1
pk−1
of the space Pk−1, where
pk−1 > 0 is the dimension of Pk−1. By deﬁnition, Pk−1 ⊂
Ker φ, i.e. φ(ek−1
j ) = 0 for all j.
Assume that Pk−1 ̸= V . Since Pk−1 = φ(Pk−2), there
necessarily exist the vectors ek−2
j , j = 1, . . . , pk−1 in Pk−2,
such that φ(ek−2
j ) = ek−1
j . Assume
a1ek−1
1 +· · ·+apk−1
ek−1
pk−1
+b1ek−2
1 +· · ·+bpk−1
ek−2
pk−1
= 0.
Applying φ on this linear combination yields b1ek−1
1 + · · · +
bpk−1
ek−1
pk−1
= 0. This is a linear combination of independent
vectors, therefore all bj = 0. But then also aj = 0. Thus
the linear independence of all 2pk−1 chosen vectors is established.
Next, extend them to a basis
(1)
ek−1
1 , . . . , ek−1
pk−1
ek−2
1 , . . . , ek−2
pk−1
, ek−2
pk−1+1, . . . , ek−2
pk−2
of the space Pk−2. The images of the added basis vectors
are in Pk−1. Necessarily they must be linear combinations of
241
3.D.35. (Orthogonal diagonalization) Consider the symmetric
matrix A given below. Find a matrix P ∈ Mat3(R) such
that P−1
AP = PT
AP is diagonal:
A =


1 2 6
2 0 2
6 2 1

 .
⃝
A real n × n symmetric matrix has real eigenvalues, and
the same holds true for Hermitian matrices, see 3.4.7. However,
the converse is not necessarily true, as demonstrated by
the counterexample given by the matrix A = ( 1 1
0 1 ). Moreover,
by the spectral decomposition theorem of self-adjoint
operators (see the theorem in ??), the eigenvectors corresponding
to distinct eigenvalues of a symmetric or a Hermitian
matrix A are orthogonal. Leveraging these properties,
one can derive the following fundamental result which is further
motivated by the result in 3.D.35, mentioned above.
3.D.36. Prove that a real n × n matrix A is orthogonally diagonalizable
if and only if is symmetric, i.e., A = AT
. ⃝
In a similar vein, a complex square matrix A ∈ Matn(C)
is said to be “unitary diagonalizable”, if there
exists a unitary matrix U such that U∗
AU is diagonal.
By the spectral theorem (see 3.4.7) we
know that any Hermitian matrix A is unitary diagonalizable.
Conversely, the following holds.
3.D.37. Suppose that A is a complex matrix with real eigenvalues,
which can be diagonalizable by a unitary matrix U.
Show that A is Hermitian.
Solution. By assumption, we have U∗
AU = D, where D is
the diagonal matrix having diagonal entries the eigenvalues of
A. However, the hypothesis states that A has real eigenvalues,
so D is a real diagonal matrix, which implies that D∗
= D.
Now, since U is unitary from the relation U∗
AU = D we get
A = (U∗
)−1
DU−1
= UDU∗
. Then we see that
A∗
= (UDU∗
)∗
= UD∗
U∗
= UDU∗
= A .
□
3.D.38. (Unitary diagonalization) Consider the matrices
A, B given below. For each case demonstrate that the corresponding
eigenvectors are orthognal each other (with respect
to the standard Hermitian form ⟨ , ⟩ on C3
). Next demostrate
that both matrices are unitarily diagonalizable and illustrate
the conditions necessary for unitary diagonalization:
A =


−1 0 i
0 1 0
−i 0 1

 , B =
(
3 i
−i 3
)
.
⃝
3.D.39. Prove that a normal matrix A having all its eigenvalues
purely imaginary is skew-Hermitian. ⃝
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
the basis elements ek−1
1 , . . . , ek−1
pk−1
. We can thus adjust the
chosen vectors ek−2
pk−1+1, . . . , ek−2
pk−2
by adding the appropriate
linear combinations of the vectors ek−2
1 , . . . , ek−2
pk−1
with the
result that they are in the kernel of φ. Thus we may assume
our choice in the scheme (1) has this property.
Assume that we have already constructed a basis of the
subspace Pk−ℓ such that we can directly arrange it into the
scheme similar to (1)
ek−1
1 , . . . , ek−1
pk−1
ek−2
1 , . . . , ek−2
pk−1
, ek−2
pk−1+1, . . . , ek−2
pk−2
ek−3
1 , . . . , ek−3
pk−1
, ek−3
pk−1+1, . . . , ek−3
pk−2
, ek−3
pk−2+1, . . . , ek−3
pk−3
...
ek−ℓ
1 ,. . ., ek−ℓ
pk−1
, ek−ℓ
pk−1+1,. . ., ek−ℓ
pk−2
, ek−ℓ
pk−2+1,. . .. . . ek−ℓ
pk−ℓ
where the value of the mapping φ on any basis vector is located
above it. The value is zero if there is nothing above that
basis vector.
If Pk−ℓ ̸= V , then again there must exist vectors
ek−ℓ−1
1 , . . . , ek−ℓ−1
pk−ℓ
which map to ek−ℓ
1 , . . . , ek−ℓ
pk−ℓ
. We can
extend them to a basis Pk−l−1, say, by the vectors
ek−ℓ−1
pk−ℓ+1, . . . , ek−ℓ−1
pk−ℓ−1
.
Again, exactly as when adjusting (1) above, we choose
the additional basis vectors from the kernel of φ. and analogically
as before we verify that we indeed obtain a basis for
Pk−ℓ−1.
After k steps we obtain a basis for the whole V , which
has the properties given for the basis of the subspace Pk−ℓ.
Individual columns of the resulting scheme then generate the
subspaces Vi. Additionally we have found the bases of these
subspaces which show that corresponding restrictions of φ
are cyclic mappings. □
3.4.18. Proof of the Jordan theorem. Let λ1, . . . , λk be all
the distinct eigenvalues of the mapping φ. From
the assumptions of the Jordan theorem it follows
that V = Rλ1 ⊕ · · · ⊕ Rλk
.
The mappings φi = (φ|Rλi
−λi idRλi
) are
nilpotent and thus each of the root spaces is a direct sum
Rλi = P1,λi ⊕ · · · ⊕ Pji,λi
of spaces on which the restriction of the mapping φ−λi idV
is cyclic. Matrices of these restricted mappings on Pr,s are
Jordan blocks corresponding to the zero eigenvalue, the restricted
mapping φ|Pr,s has thus for its matrix the Jordan
block with the eigenvalue λi.
For the proof of Jordan theorem it remains to verify the
claim about uniqueness (up to reordering the blocks). Because
the diagonal values λi are given as roots of the characteristic
polynomial, their uniqueness is immediate. The
decomposition to root spaces is unique as well. Thus, without
loss of generality we may assume that there is just one
eigenvalue λ and we are going to express the dimensions of
242
We have seen that working with diagonalizable matrices
signiﬁcantly simpliﬁes many computations, such
as matrix powers. When A is not diagonalizable,
it is still useful to have a simpliﬁed form of A with
respect to some ordered basis.
In fact, for any linear transformation f : V → V on
a ﬁnite-dimensional complex vector space V , there exists a
particular simple matrix representation. If A is a complex
square matrix, this means that we can ﬁnd a matrix J such
that A = PJP−1
for some invertible matrix P.
The matrix J is known as the Jordan normal form (or
Jordan canonical form) and is typically represented as:
J =




J1 0
J2
...
0 Jk




.
In this expression, Ji (i = 1, . . . , k) are the so called Jordan
blocks, which are upper triangular matrices whose form is
explained below. The Jordan canonical form J will be a diagonal
matrix if and only if A is diagonalizable, see 3.4.10.
Let us brieﬂy outline the procedure for computing the
Jordan canonical form of a given n × n matrix A:
Step 1: Compute the eigenvalues λ1, . . . , λk of A and their algebraic
multiplicities, which we denote by α(λi).
Step 2: Compute the geometric multiplicity γ(λi) of each eigenvalue
λi of A. Recall that γ(λi) = dim Ker(A − λiE),
where E is the n × n identity matrix (we have γ(λi) ≤
α(λi) for all i = 1, . . . , k).
Step 3: For each eigenvalue λi determine the root subspace (also
known as generalized eigenspace)
Rλi = Ker
(
(A − λiE)j
)
,
where j = α(λi) is the algebraic multiplicity of λi. Note
the null spaces Ker
(
(A−λiE)ℓ
)
for ℓ < j = α(λi) are
included in Rλi , for any eigenvalues λi, as Rλi captures
all the necessary vectors. In particular, the dimension of
Rλi coincides with the algebraic multiplicity j = α(λi)
of λi, for all i = 1, . . . , k.
Step 4: Construct the Jordan block Ji ≡ Jλi for all i = 1, . . . k,
which has necessarily λi on the diagonal and 1s on the
superdiagonal, while other elements are 0. To determine
the size of Ji we start from the highest power j for which
the dimension dim Ker
(
(A − λiE)j
)
is non-zero (this
highest power is typically the algebraic multiplicity, i.e.,
j = α(λi)), and move downwards to j = 1. For example,
a Jordan block of size m × m corresponding to the
eigenvalue λi has the form
Ji =






λi 1 0 . . . 0
0 λi 1 . . . 0
0 0 λi . . . 0
...
...
...
... 1
0 0 0 . . . λi






∈ Matm(C) .
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
individual Jordan blocks using the ranks rk of the mapping
(φ−λ idV )k
. This will show that the blocks are uniquely determined
(up to their order). On the other hand, changing the
order of the blocks corresponds to renumbering the vectors
of basis, thus we can obtain them in any order.
If ψ is a cyclic operator on an m-dimensional space, then
the defect of the iterated mapping ψk
is k for 0 ≤ k ≤ m,
while the defect is m for all k ≥ m. This implies that if
our matrix J of the mapping φ on the n-dimensional space
V (remind we assume V = Rλ) contains dk Jordan blocks
of the order k, then the defect Dℓ = n − rℓ of the matrix
(J − λ E)ℓ
is
Dℓ = d1 + 2d2 + · · · + ℓdℓ + ℓdℓ+1 + · · · .
Now, taking the combination 2Dk −Dk−1 −Dk+1 we cancel
all those terms in the latter expression which coincide for ℓ =
k − 1, k, k + 1 and we are left with
2Dk − Dk−1 − Dk+1 = dk.
Substituting for Dℓ’s, we ﬁnally arrive at
dk = 2n−2rk −n+rk−1 −n+rk+1 = rk−1 −2rk +rk+1.
This is the requested expression for the sizes of the Jordan
blocks and the theorem is proved.
3.4.19. Remarks. The proof of the theorem about the existence
of the Jordan canonical form was constructive,
but it does not give an eﬃcient algorithmic
approach for the construction. Now we show
how our results can be used for explicit computation
of the basis in which the given mapping φ : V → V
has its matrix in the canonical Jordan form.6
(1) Find the roots of the characteristic polynomial.
(2) If there are less than n = dim V roots (counting multiplicities),
then there is no canonical form.
(3) If there are n linearly independent eigenvectors, there is
a basis of V composed of eigenvectors under which φ
has diagonal matrix.
(4) Let λ be the eigenvalue with geometric multiplicity
strictly smaller than the algebraic multiplicity and
v1, . . . , vk be the corresponding eigenvectors. They
should be the vectors on the upper boundary of the
scheme from the proof of the theorem 3.4.17. We
need to complete the basis by application of iterations
φ − λ idV . By doing this we also ﬁnd in which row
the vectors should be located. Hence we ﬁnd the
linearly independent solutions wi of the equations
(φ − λ id)(wi) = vi from the rows below it. Repeat
the procedure iteratively (that is, for wi and so on).
In this way, we ﬁnd the “chains” of basis vectors that
give invariant subspaces, where φ − λ id is cyclic (the
columns from the scheme in the proof).
6There is a beautiful purely algebraic approach to compute the Jordan
canonical form eﬃciently, but it does not give any direct information about
the right basis. This algebraic approach is based on polynomial matrices and
Weierstrass divisors. We shall not go into details in this textbook.
243
Later we will see that each Jordan block corresponds to a size
which represents the length of a so-called chain of generalized
eigenvectors associated with λi. The increase in dimension
between the kernel of (A − λiE)j−1
and (A − λiE)j
is the
number of blocks of size j or more.
We ﬁnally form the matrix J as posed above, i.e., J consists
of the Jordan blocks Ji which all lie along the diagonal,
while the blank oﬀ-diagonal blocks are all zero. Therefore,
in a Jordan matrix J the only possibly non-zero entries are
those on the diagonal, which can attain any complex value
(including 0), and those on the superdiagonal, which are either
0 or 1.
3.D.40. Let A be a matrix with characteristic polynomial
χA(λ) = (λ−
√
2)3
. Illustrate all possible conﬁgurations for
the Jordan canonical form of A.
Solution. Since the characteristic polynomial is of degree
three, the matrix A should be 3 × 3. Since χA(λ) = (λ −√
2)3
, the unique eigenvalue λ =
√
2 has algebraic multiplicity
3. Now, the geometric multiplicity γ(λ) of λ satisﬁes
1 ≤ γ(λ) ≤ 3. Hence there are three possible conﬁgurations
(up to the ordering of the Jordan blocks):
• If γ(λ) = 1, then the Jordan canonical form will consist of
a single Jordan block of size 3:
J =


√
2 1 0
0
√
2 1
0 0
√
2

 .
• If γ(λ) = 2, then the Jordan canonical form will consist of
one Jordan block of size 2 and one Jordan block of size 1:
J =


√
2 1 0
0
√
2 0
0 0
√
2

 .
• If γ(λ) = 3, then the Jordan canonical form will consist of
three Jordan blocks of size 1 each (in this case there exist three
linearly independent eigenvectors, hence A is diagonalizable
and J is diagonal):
J =


√
2 0 0
0
√
2 0
0 0
√
2

 .
□
3.D.41. Which of the following matrices represents a Jordan
canonical form? Explain your answer.
G1 =




0 1 0 0
0 0 0 0
0 0 2 0
0 0 0 2



 , G2 =


3 0 0
0 1 1
0 0 1

 ,
G3 =
(
5 1
0 1
)
, G4 =


−2 1 0
0 −1 0
0 0 −2

 , G5 = (π),
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
The procedure is practical for matrices when the multiplicities
of the eigenvalues are small, or at least when the degrees of
nilpotency are small. For instance, for the matrix
A =


2 0 1
0 2 1
0 0 2


we obtain the two-dimensional subspace of eigenvectors
span{(1, 0, 0)T
, (0, 1, 0)T
},
but we still do not know, which of them are the “ends of
the chains”. We need to solve the equations (A − 2E)x =
(a, b, 0)T
for (yet unknown) constants a, b. This system is
solvable if and only if a = b, and one of the possible solutions
is x = (0, 0, 1)T
, a = b = 1. The entire basis is then
composed of (1, 1, 0)T
, (0, 0, 1)T
, (1, 0, 0)T
. Note that we
have free choices on the way and thus there are many such
bases.
5. Decompositions of the matrices and pseudoinversions
Previously we concentrated on the geometric description
of the structure of a linear mapping. Now
we translate our results into the language
of matrix decomposition. This is an important
topic for numerical methods and
matrix calculus in general.
Even when computing eﬀectively with real numbers we
use decompositions into products. The simplest one is the
unique expression of every real number in the form
a = sgn(a) · |a|,
that is, as a product of the sign and the absolute value. Proceeding
in the same way with complex numbers, we obtain
their polar form. That is, we write z = (cos φ + i sin φ)|z|.
Here the complex unit plays the role of the sign and the other
factor is a non-negative real multiple.
In the following paragraphs we list brieﬂy some useful
decompositions for distinct types of matrices. Remind, we
met suitable decompositions earlier, for instance for positive
semideﬁnite matrices in paragraph 3.4.9 when ﬁnding the
square roots. We shall start with similar simple examples.
3.5.1. LU-decomposition. In paragraphs 2.1.7 and 2.1.8 we
transformed matrices over scalars from any ﬁeld
into row echelon form. For this we used elementary
row transformations, based on successive
multiplication of our matrix by invertible lower
triangular matrices Pi. In this way we added multiples of the
rows above the currently transformed one.
Sometimes we also interchanged the rows, which corresponded
to multiplication by a permutation matrix. That is
a square matrix in which all elements are zero except exactly
one value 1 in each row and column. To imagine why, consider
a matrix with just one non-zero element in the ﬁrst column
but not in the ﬁrst row. When we used the backwards
244
G6 =








2 1 0 0 0 0
0 2 0 0 0 0
0 0 1 1 0 0
0 0 0 1 0 0
0 0 0 0 0 1
0 0 0 0 0 0








.
⃝
3.D.42. Find the Jordan canonical form of the matrix
A =


5 4 2
0 1 −1
−1 −1 3

 .
Solution. Step 1: The characteristic polynomial is given by
χA(λ) =
5 − λ 4 2
0 1 − λ −1
−1 −1 3 − λ
= (5 − λ)
1 − λ −1
−1 3 − λ
− 4
0 −1
−1 3 − λ
+ 2
0 1 − λ
−1 −1
= −λ3
+ 9λ2
− 24λ + 16 .
We see that 1 is a root of χA(λ). To ﬁnd all roots we can apply
the Horner’s scheme:
1 −1 9 −24 16
−1 8 −16
−1 8 −16 0
Thus,
χA(λ) = −λ3
+ 9λ2
− 24λ + 16
= (λ − 1)(−λ2
+ 8λ − 16) = (λ − 1)(λ − 4)2
,
and A has two eigenvalues: λ1 = 1 with α(λ1) = 1 and
λ2 = 4 with α(λ2) = 2.
Step 2: Let us now compute the geometric multiplicities. For
λ1 = 1 we need to solve the matrix equation (A − E)u = 0
for some vector u = (x1, x2, x3)T
∈ R3
. Since
A − E =


4 4 2
0 0 −1
−1 −1 2

 ,
we obtain the linear system
{4x1 + 4x2 + 2x3 = 0 , x3 = 0 , −x1 − x2 + 2x3 = 0} .
The solution space of this system is 1-dimensional:





x1
x2
x3

 =


r
−r
0

 : r ∈ R



.
Thus γ(λ1) = dim Ker(A − I) = 1, in particular we
may assume that V1 is generated by the eigenvector u1 =
(1, −1, 0)T
. For the second eigenvalue λ2 = 4 we have
the matrix equation (A − 4E)v = 0 for some vector v =
(y1, y2, y3)T
∈ R3
. Since
A − 4E =


1 4 2
0 −3 −1
−1 −1 −1

 ,
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
elimination to transform the matrix into the blockwise form
(
Eh 0
0 0
)
(remind Eh stays for the unit matrix of rank h) then we potentially
needed to interchange columns as well. This was
achieved by multiplying by a permutation matrix from the
right hand side.
For simplicity, assume we have a square matrix A of size
m and that Gaussian elimination does not force a row interchange.
Thus all matrices Pi can be lower triangular with
ones on diagonal. Finally we note that inverses of such Pi
are again lower triangular with ones on the diagonal (either
remember the algorithm 2.1.10 or the formula in 2.2.11). We
obtain
U = P · A = Pk · · · P1 · A
where U is an upper triangular matrix. Thus
A = L · U
where L is lower triangular matrix with ones on diagonal
and U is upper triangular. This decomposition is called
LU-decomposition of the matrix A. We can also absorb the
diagonal values of U into a diagonal matrix D and obtain the
LDU-decomposition where both U and L have just ones along
the diagonal, A = L D U.
For a general matrix A, we need to add the potential permutations
of rows during Gaussian elimination. Then we obtain
the general result. (Think why we can always put the necessary
permutation matrices to the most left and most right
positions!)
LU-decomposition
Let A be any square matrix of size m over a ﬁeld of scalars.
Then we can ﬁnd lower triangular matrix L with ones on its
diagonal, upper triangular matrix U and permutation matrices
P and Q, all of size m, such that
A = P · L · U · Q.
3.5.2. Remarks. As one direct corollary of the Gaussian
elimination we can observe that, up to a choice
of suitable bases on the domain and codomain,
every linear mapping f : V → W is given by
a matrix in block-diagonal form with unit matrix of the size
equal to the dimension of the image of f, and with zero blocks
all around. This can be reformulated as follows: every matrix
A of the type m/n over a ﬁeld of scalars K can be decomposed
into the product
A = P ·
(
E 0
0 0
)
· Q,
where P and Q are suitable invertible matrices.
Previously (in 3.4.10) we discussed properties of linear
mappings f : V → V over complex vector spaces. We
245
we obtain the linear system
{y1 + 4y2 + 2y2 = 0 , 3y2 + y3 = 0 , y1 + y2 + y3 = 0}.
Using Sage via the cell
A1=matrix([[1, 4, 2], [0, -3, -1], [-1, -1, -1]])
show(A1)
show(A1.rref())
we obtain the following reduced row echelon form
A − 4E =


1 4 2
0 −3 −1
−1 −1 −1

 ∼


1 0 2/3
0 1 1/3
0 0 0

 ,
which can be veriﬁed by hand in a few steps. Therefore, the
solution space is given by





y1
y2
y3

 =


−2
3 s
−1
3 s
s

 : s ∈ R



.
Consequently, we may assume that the eigenspace V4 is generated
by the eigenvector u2 = (2, 1, −3)T
(this corresponds
to the choice s = −3). In particular, γ(λ2) = 1.
Step 3: Since the eigenvalue λ2 = 4 satisﬁes γ(λ2) =
1 < α(λ2) = 2, the matrix A is not diagonalizable, and
we should proceed with Step 3. In particular, for λ2 = 4
we have j = α(λ2), and this algebraic multiplicity coincides
with the dimension of the generalized eigenspace Rλ2 =
Ker
(
(A−4E)2
)
, i.e., dim Ker
(
(A−4E)2
)
= 2 (see below
in 3.D.44 for a description of Rλ2 ). Thus, according to our
algorithm, for the eigenvalue λ2 = 4 exists a unique Jordan
block of size 2 × 2 or larger, since
dim Ker
(
(A − 4E)2
)
− dim Ker
(
(A − 4E)
)
= 2 − 1 = 1 .
Counting dimensions we see that this has necessarily the form
J2 = ( 4 1
0 4 ), since for λ1 = 1 the corresponding Jordan block
J1 is just a scale, J1 = (1).
Step 4: We are now able to describe the Jordan canonical
form:
J = (1) ⊕
(
4 1
0 4
)
=


1 0 0
0 4 1
0 0 4

 .
□
Having the Jordan canonical form J of a squared matrix
A, we can derive a similarity matrix P such
that P−1
AP = J, or, equivalently, A = PJP−1
.
Such a matrix P is composed of columns that are
eigenvectors and generalized eigenvectors of A. To
understand how P is constructed, recall that for an eigenvalue
λi of A, a generalized eigenvector u is deﬁned as a
non-zero vector such that (A − λiE)j
u = 0 for some integer
j > 0. If j is the smallest positive integer with this property,
the set
{(A − λi)j−1
u , (A − λi)j−2
u , . . . , (A − λi)u , u}
is known as a “(Jordan) chain of length j”, see also the discussion
in 3.4.19.
The structure of these chains reﬂects the sizes and arrangements
of the Jordan blocks in the Jordan form J. In
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
showed that every square matrix A of dimension m can be
decomposed into the product
A = P · J · P−1
,
where J is a block-diagonal with Jordan blocks associated
with the eigenvalues of A on the diagonal. Indeed, this is just
a reformulation of the Jordan theorem, because multiplying
by the matrix P and by its inverse from the other side corresponds
in this case just to the change of the basis on the vector
space V (with transition matrix P). The quoted theorem says
that every mapping has Jordan canonical form in a suitable
basis.
Analogously, when discussing the self-adjoint mappings
we proved that for real symmetric matrices or for complex
Hermitian matrices there exists a decomposition into the prod-
uct
A = P · D · P∗
,
where D is the diagonal matrix with all (always real) eigenvalues
on the diagonal, counting multiplicities. Indeed, we
proved that there is an orthonormal basis consisting of eigenvectors.
Thus the transition matrix P reﬂecting the appropriate
change of the basis must be orthogonal. In particular,
P−1
= P∗
.
For real orthogonal mappings we derived analogous expression
as for the symmetric ones, i.e. A = P · B · P∗
.
But in this case the matrix B is block-diagonal with blocks of
size two or one, expressing rotations, mirror symmetry and
identities with respect to the corresponding subspaces.
3.5.3. Singular decomposition theorem. We return to general
linear mappings f : V → W between vector
spaces (generally distinct). We assume that
scalar products are deﬁned on both spaces and
we restrict ourselves to orthonormal bases only.
If we want a similar decomposition result as above, we must
proceed in a more reﬁned way than in the case of arbitrary
bases. But the result is surprisingly similar and strong:
Singular decomposition
Theorem. Let A be a matrix of the type m/n over real or
complex scalars. Then there exist square unitary matrices U
and V of dimensions m and n, and a real diagonal matrix D
with non-negative elements of dimension r, r ≤ min{m, n},
such that
A = U S V ∗
, S =
(
D 0
0 0
)
and r is the rank of the matrix AA∗
.
The matrix S is determined uniquely up to the order of
the diagonal elements si in D. Moreover, si are the square
roots of the positive eigenvalues di of the matrix A A∗
.
If A is a real matrix, then the matrices U and V are
orthogonal.
246
particular, each chain corresponds to a block, and the number
of vectors in the chain indicates the block’s size. Now
each column in P corresponds to a vector in a Jordan chain.
For a Jordan block of size j associated with λi, the matrix P
will include j columns form the Jordan chain of a generalized
eigenvector associated with λi.
Let us illustrate these facts with examples and for further
exercises on the Jordan canonical form see Section F.
3.D.43. Using the matrices G1, . . . G6 given in 3.D.41, derive
the characteristic polynomial of the corresponding matrix A
for those cases that represent a Jordan canonical from. ⃝
3.D.44. For the matrix A provided in 3.D.42, determine a
similarity matrix P such that P−1
AP = J, where J is Jordan
canonical form of A, as obtained in 3.D.42.
Solution. We know that Rλ2 = Ker
(
(A − 4E)2
)
is
2-dimensional. This is because the homogeneous linear system
induced by the matrix equation (A − 4E)2
w = 0, i.e.,


−1 −10 −4
1 10 4
0 0 0




w1
w2
w3

 =


0
0
0


has a 2-dimensional solution space, given by





w1
w2
w3

 =


−10t − 4s
t
s

 : t, s ∈ R



.
To derive that matrix P we ﬁrst need to solve the matrix equation
(A − 4E)w = u2, where v2 = (2, 1, −3)T
is the eigenvector
corresponding to λ2 = 4 (see 3.4.19). Then, the vector
w should be an element of Rλ2 and the set {(A−4E)w, w} =
{u2, w} will give rise to a Jordan chain for w. Moreover, the
columns of P will be the vectors u1, u2 and w, i.e., P =(
u1 u2 w
)
.
Let us ﬁnd w. The augmented matrix of the linear system
induced by the equation (A − 4E)w = u2, is given by
B :=
(
A − 4E u2
)
=


1 4 2 2
0 −3 −1 1
−1 −1 −1 −3

 ,
and we can apply elementary row operations to obtain the
corresponding RREF. Successively one obtains
B
R3→R1+R3
−→


1 4 2 2
0 −3 −1 1
0 3 1 −1

 R3→R2+R3
−→


1 4 2 2
0 −3 −1 1
0 0 0 0


R2→− 1
3 R2
−→


1 4 2 2
0 1 1/3 −1/3
0 0 0 0


R1→R1−4R2
−→


1 0 2/3 10/3
0 1 1/3 −1/3
0 0 0 0

 .
You can verify this expression in Sage by the rref()-method,
as we learned in Chapter 2.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Proof. Assume ﬁrst that m ≤ n. Denote by
φ : Kn
→ Km
the mapping between
real or complex spaces with standard scalar
products, given by the matrix A in the standard
bases.
We can reformulate the statement of the theorem as follows:
there exists orthonormal bases on Kn
and Km
in which
the mapping φ is given by the matrix S from the statement of
the theorem.
As noted before, the matrix A∗
A is positive semideﬁnite.
Therefore it has only real non-negative eigenvalues and there
exists an orthonormal basis w of Kn
in which the corresponding
mapping φ∗
◦ φ is given by a diagonal matrix with eigenvalues
on the diagonal. In other words, there exists a unitary
matrix V such that A∗
A = V B V ∗
for a real diagonal matrix
B with non-negative eigenvalues (d1, d2, . . . , dr, 0, . . . , 0)
on the diagonal, di ̸= 0 for all i = 1, . . . , r. Thus
B = V ∗
A∗
A V = (A V )∗
(A V ).
This is equivalent to the claim that the ﬁrst r columns of the
matrix AV are orthogonal, while the remaining columns vanish
because they have zero norm.
Next, we denote the ﬁrst r columns of A V as
v1, . . . , vr ∈ Km
. Thus, ⟨vi, vi⟩ = di, i = 1, . . . , r,
and the normalized vectors ui = 1√
di
vi form an orthonormal
system of non-zero vectors. Extend them to an orthonormal
basis u = u1, . . . , um for the entire Km
. Expressing the
original mapping φ in the bases w of Kn
and u of Km
, yields
the matrix
√
B. The transformations from the standard bases
to the newly chosen ones correspond to the multiplication
from the left by a unitary (orthogonal) matrix U and from
the right by V −1
= V ∗
. This is the claim of the theorem.
If m > n, we can apply the previous part of the proof to
the matrix A∗
which implies the desired result.
All the previous steps in the proof are also valid in the
real domain with real scalars. □
This proof of the theorem about singular decomposition
is constructive and we can indeed use it for computing the
unitary (orthogonal) matrices U and V and the non-zero diagonal
elements of the matrix S.
The diagonal values of the matrix D from the previous
theorem are called singular values of the matrix A.
3.5.4. Further comments. When dealing with real scalars,
the singular values of a linear mapping φ :
Rn
→ Rm
have a simple geometric meaning:
Let K ⊂ Rn
be the unit ball in the standard
scalar product. The image φ(K) is always an
m-dimensional ellipsoid (possibly degenerate). The singular
values of the matrix A are then the norms of the main halfaxes.
The theorem says further that the original ball allows
an orthogonal set of diameters, whose images are exactly the
half-axes of this ellipsoid.
For square matrices it can be seen that A is invertible if
and only if all singular values are non-zero. The ratio of the
247
A=matrix([[5, 4, 2], [0, 1, -1],
[-1, -1, 3]])
E=identity_matrix(3)
AA=A-4*E
print(bool(AA==
matrix([[1, 4, 2], [0, -3, -1],
[-1, -1, -1]])))
b=vector([2, 1, -3])
B=AA.augment(b)
B.rref()
Therefore, we obtain the system (we assume that w has the
form w = (x, y, z)T
)
{x +
2
3
z =
10
3
, y +
1
3
z = −
1
3
, z ∈ R}
with solution space





x
y
z

 =


−2
3 s + 10
3
−1
2 s − 1
3
s

 : s ∈ R



.
To obtain a representative we may set s = 0 which gives the
generalized eigenvector w = (10/3, −1/3, 0)T
. Let us use
Sage to verify that w is indeed a generalized eigenvector. To
do so, we can simply run the following code block:
w=vector([10/3, -1/3, 0])
A=matrix([[5, 4, 2], [0, 1, -1],[-1, -1, 3]])
E=identity_matrix(3)
BB=(A-4*E)^2
bool(BB*w==0)
Sage’s output is True and we can present the matrix P:
P =
(
u1 u2 w
)
=


1 2 10
3
−1 1 −1
3
0 −3 0

 .
To verify the correctness of this result, ﬁrst compute P−1
and
next check that P−1
AP = J. For convenience, we will use
Sage to perform these computations promptly.
A=matrix([[5, 4, 2], [0, 1, -1],
[-1, -1, 3]])
J=matrix([[1, 0, 0], [0, 4, 1],
[0, 0, 4]])
P=matrix([[1, 2, 10/3], [-1, 1, -1/3],
[0, -3, 0]])
show(A, P, J, P.inverse())
print(P.inverse()*A*P==J)
We mention that Sage provides a built-in method to obtain
the Jordan canonical form of a matrix, using the command
A.jordan_form(). For instance, for our matrix A give
the cell
A=matrix([[5, 4, 2],[0, 1, -1],
[-1, -1, 3]])
J,P=A.jordan_form(transformation=True)
show(J, P)
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
greatest to the smallest singular value is an important parameter
for the robustness of many numerical computations with
matrices, for instance the computation of the inverse matrix.
Note that there are fast methods of computation (approximations)
for eigenvalues. Thus the singular decomposition is a
very eﬀective tool to work with.
3.5.5. Polar decomposition theorem. The singular decomposition
theorem is the starting point for
many other useful tools. We present several
direct corollaries (which by themselves are
non-trivial and important).
The statement of the singular decomposition theorem
saying that for any matrix A, real or complex, A = USW∗
with S diagonal with non-negative real numbers on the diagonal
and U and W unitary, can be rephrased as
A = USU∗
UW∗
and let us denote P = USU∗
, V = UW∗
. The ﬁrst of the
matrices, P, is Hermitian (in the real case, symmetric) and
positive semideﬁnite (because P and S are matrices of the the
same mapping in diﬀerent orthonormal bases). At the same
time, V is the product of two unitary matrices and thus again
is unitary (in the real case orthogonal).
Next, assume that A = PV = QZ are two such decompositions
of the matrix A into the product of a positive
semideﬁnite Hermitian matrix and a unitary matrix. Clearly,
A∗
= WSU∗
. Thus AA∗
= USSU∗
= P2
, and the matrix
P is actually the square root of the easily computable Hermitian
matrix AA∗
. In particular, this proves that P is uniquely
determined, cf. 3.4.9.
Further, assume that A is invertible. Then also P is invertible
and Z = V = P−1
A.
We have derived a very useful analogy of the decomposition
of a real number into a sign and the absolute value:
Polar decomposition
Theorem. Every square complex matrix A of dimension n
can be expressed in the form A = PV , where P is a Hermitian
positive semi-deﬁnite square matrix of the same dimension
while V is unitary.
The matrix P =
√
AA∗ is uniquely given, and if
A is invertible, the decomposition is unique and V =
(
√
AA∗)−1
A.
If A is a matrix of real scalars, then P is symmetric and
V is orthogonal.
If we apply the same theorem to A∗
instead of A, we
obtain the same result, but with the order of the Hermitian
and unitary matrices is reversed. This means A = V P with V
unitary and P =
√
A∗A positive semideﬁnite. The matrices
in the corresponding right and left polar decompositions will
in general be diﬀerent.
248
Sage’s output has the form


1 0 0
0 4 1
0 0 4

 ,


1 1
2 1
−1 1
4 0
0 −3
4 −1
4

 .
This veriﬁes the expression of the matrix J derived in 3.D.42
and presents a similarity matrix P, such that P−1
AP = J.
However, be aware that P is not uniquely determined, which
is why Sage’s candidate for P may diﬀer from ours. □
3.D.45. Consider the matrix A =
(
3 1
−1 1
)
.
(a) Show that A is not diagonalizable and ﬁnd the associated
Jordan canonical form. Moreover, ﬁnd a similarity matrix P
such that P−1
AP = J and verify the results using Sage.
(b) Compute the 4th power A4
of A using the Jordan canonical
form J and the matrices P, P−1
, derived in (a). ⃝
3.D.46. Consider the matrix A =




5 4 2 1
0 5 1 1
0 0 5 2
0 0 0 5



. Determine
the eigenvectors and generalized eigenvectors of
A and ﬁnd generators of the subspaces Ker
(
(A − 5E)2
)
,
Ker
(
(A − 5E)3
)
and Ker
(
(A − 5E)4
)
. Next determine
the Jordan canonical form J of A and present a matrix P
such that P−1
AP = J. Moreover show that the generalized
eigenespace Rλ of the unique eigenvalue λ of A satisﬁes
Rλ = Ker
(
(A − 5E)4
)
∼= R4
. ⃝
E. Matrix decompositions
Matrices with some special property,
such as diagonal matrices, triangular matrices,
or unitary matrices, are simpler to
work with than general matrices. The ﬁnal part of this chapter
focuses on matrix decompositions, which involving techniques
expressing a given matrix as a product of “simpler”
matrices, as those mentioned above, see the paragraphs 3.5.1
– 3.5.6. Matrix decompositions are essential for tasks like
solving linear systems of equations and performing many
complex matrix operations.
We begin with the LU-decomposition, which is essentially
the matrix form of the Gauss elimination method, described
in paragraphs 2.1.7-2.1.9. The LU-decomposition
exists for squared matrices A which can be reduced to a rowechelon
form through a series of row operations without requiring
row exchanges. In such a situation we can arrive to
a factorization of the form A = LU, where U is the upper
triangular matrix whose diagonal entries are the pivots, and
L = (ℓij) is a lower triangular matrix having units in the
diagonal and nonzero oﬀ-diagonal entries ℓij encoding the
elementary row operations that we applied to ﬁnd U. Thus,
ℓij represents the times that we subtracted the row j from row
i, during the speciﬁc step of the Gaussian elimination.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Actually, if A is invertible, it is easy to check, that the matrices
in the left are polar decomposition coincide if and only
if A is normal. Look at theorem 3.4.8 and verify it yourself.
In the complex case the analogy with the decomposition
of numbers is even more entertaining. The positive
semideﬁnite P again plays the role of the
absolute value of the complex number. The unitary
matrix V uniquely allows the expression as
a sum V = re V +i im V with Hermitian real and imaginary
parts and the property (re V )2
+ (im V )2
= E. We obtain a
full analogy for the polar form for the complex numbers (see
the ﬁnal remark and corollary in 3.4.8). But note that in the
higher dimensional case, it is important in which order this
“polar form” of matrix is written. It is possible in both ways,
but the results diﬀer in general.
3.5.6. QR decomposition. For many practical applications
it is faster to use another decomposition of matrices, which is
an analogy of the Schur orthogonal triangulation theorem:
QR decomposition
Theorem. For every complex matrix A of the type m/n
there exists a unitary matrix Q and an upper triangular matrix
R such that A = QR.
If all the scalars are real, then both Q and R are real
(i.e. Q orthogonal).
Proof. In the geometric formulation we need to prove
that for every mapping φ : Kn
→ Km
with the matrix
A in the standard bases we can choose a new orthonormal
basis on Km
for which φ has upper triangular
matrix.
Consider the images φ(e1), . . . , φ(en) ∈ Km
of the vectors
of the standard orthonormal basis of Kn
. Choose from
them a maximal linearly independent system v1, . . . , vk in
such a way that the removed dependent vectors are always
a linear combination of the previous vectors. Extend it into
a basis v1, . . . , vm. Let u1, . . . , um be an orthonormal basis
Km
obtained by the Gramm-Schmidt orthogonalization
of this system of vectors.
For every ei, φ(ei) is either one of vj, j ≤ i, or it is a
linear combination of v1, . . . , vi−1. Therefore in the expression
of φ(ei) in the basis u only the vectors u1, . . . , ui appear.
Thus, in the standard basis on Kn
and u on Km
, the mapping
φ has an upper triangular matrix R. The change of the basis u
on Rm
corresponds to the multiplication by a unitary matrix
Q∗
from the left. That is, R = Q∗
A, equivalently A = QR.
The last claim is clear from the construction. □
3.5.7. Pseudoinversions. Finally, we discuss an especially
useful and important extension of the inversion
concept, which is of great importance for numerical
procedures and also in Statistics.
249
3.E.1. Find a LU-decomposition of the matrix
A =


−2 1 0
−4 4 2
−6 1 −1

 .
Solution. The desired decomposition is given by
A = LU =


1 0 0
2 1 0
3 −1 1




−2 1 0
0 2 2
0 0 1

 .
To obtain to this result, let us denote by R1, R2, R3 the rows of
A, during any step of the Gaussian elimination. Then U (the
row-echelon form of A) occurs by the following elementary
row operations


−2 1 0
−4 4 2
−6 1 −1


R2 − 2 R1
R3 − 3 R1
=⇒


−2 1 0
0 2 2
0 −2 −1


R3 − ( -1 R2)
=⇒


−2 1 0
0 2 2
0 0 1

 = U .
Recall from Chapter 2 (cf. 2.A.9) that in Sage one can
compute a row echelon form of a matrix A using the
echelon_form() command. However, Sage’s output may
diﬀer from the expected result, and some caution is required
when choosing the numerical system. For instance, consider
the matrix A with entries over Z. Executing in Sage the cell
given below, we will obtain the matrix
( 2 1 0
0 2 0
0 0 1
)
, which is
clearly not the desired result.
A = matrix(ZZ, [[-2, 1, 0],[-4, 4, 2],
[-6, 1, -1]])
A.echelon_form()
On the other hand, you can verify on your own that by replacing
ZZ with RR, RDF, QQbar or the symbolic ring SR, the
resulting matrix will be the 3 × 3 identity matrix. Hence, in
this case, Sage returns tge reduced row-echelon form of A,
which is not useful for our task.
Now, for the lower triangular matrix
L =


1 0 0
ℓ21 1 0
ℓ31 ℓ32 1


we see that the entries ℓ21, ℓ31, ℓ32 are given by the circled
numbers 2, 3 and −1, appearing in the row operations mentioned
above. In 3.E.5 below, you will be asked to derive
the expression of L in terms of elementary matrices. Finally,
a veriﬁcation in Sage can be done manually, as fol-
lows:
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Technically, the following quite straightforward application
of singular decompositions of matrices allows us to deﬁne
the pseudoinverse. However, we should beware that the
singular decomposition is not unique and thus we must verify
that such a deﬁnition is consistent. We shall see that in the
next theorem.
Pseudoinverse matrices
Let A be a real or complex matrix of the type m/n. Let
A = U S V ∗
, S =
(
D 0
0 0
)
be its singular decomposition (in particular, D is invertible).
The matrix
A†
:= V S†
U∗
, S†
=
(
D−1
0
0 0
)
is called the pseudoinverse matrix of the matrix A.
In geometric terms, we may view the linear mapping φ
given by the matrix A in the two special orthonormal basis,
where φ has got the matrix S with non-negative diagonal entries.
We take the inverse of the “invertible part” of φ and
complete it trivially to the pseudoinverse7
mapping φ†
. The
result is then viewed in the original basis and yields A†
.
As the following theorem shows, the pseudoinverse is an
important generalization of the notion of inverse matrix, together
with direct applications. At the same time, property
(3), together with property (2), veriﬁes the appropriateness
of the deﬁnition.
7This concept was introduced by Eliakim Hastings Moore, an American
mathematician, around 1920. It was reinvented by Roger Penrose and
others later. In the literature, it is often called the Moore-Penrose pseudoinverse.
Roger Penrose is an extremely inﬂuential mathematical physicist and
philosopher of science working in Oxford, known also for his many bestselling
popular books such as The Emperor’s New Mind: Concerning Computers,
Minds, and The Laws of Physics (1989); Shadows of the Mind: A
Search for the Missing Science of Consciousness (1994); The Road to Reality:
A Complete Guide to the Laws of the Universe (2004); Cycles of Time:
An Extraordinary New View of the Universe (2010).
250
A = matrix(RR,[[-2, 1, 0],[-4, 4, 2],
[-6, 1, -1]])
L = matrix(RR,[[1, 0, 0],[2, 1, 0],
[3, -1, 1]])
U = matrix(RR,[[-2, 1, 0],[0, 2, 2],
[0, 0, 1]])
L*U==A
Sage returns True. Note in this block we could use RDF or
SR instead of RR, or even omit specifying a numerical system
altogether. □
3.E.2. Permuted LU-decomposition via Sage. If partial
pivoting (row exchanges) is allowed, then for
any squared matrix A we have a decomposition
of the form PA = LU, where P is a permutation
matrix. This is known as the LU-decomposition with
partial pivoting, or permuted LU-decomposition. Note that
Sage provides the command A.LU(), which corresponds to a
built-in method to obtain the permuted LU-decomposition. In
particular, for a m×n matrix A this command returns a triple
of matrices P, L and U, such that A = PLU, satisfying the
following properties: P is a m × m permutation matrix, L is
a lower triangular m × m matrix with units in the diagonal,
and U is an upper triangular m × n matrix. Notice in this
case U is not in general a row-echelon form of A.
For example, using the matrix A from 3.E.1 and running
the cell
A = matrix([[-2, 1, 0], [-4, 4, 2], [-6, 1, -1]])
A.LU()
we obtain the following answer:
(
[0 0 1] [ 1 0 0] [ -6 1 -1]
[0 1 0] [2/3 1 0] [ 0 10/3 8/3]
[1 0 0], [1/3 1/5 1], [ 0 0 -1/5]
)
Notice that the matrices L and U that Sage prints out here,
diﬀer from those obtained in 3.E.1.
The LU-decomposition is an eﬀective method for solving
linear systems. Given that A = LU is the LU-factorization
of a squared matrix A, the equation Ax = b can be written
as LUx = b. Introducing Ux = y, this gives us Ly = b.
Typically, we ﬁrst solve Ly = b with respect to y and then use
this result to solve Ux = y for x.
3.E.3. Solve Ax = b using a LU-decomposition of the matrix
A, where A and b are respectively given by
A =




3 −6 2 −2
−3 4 1 0
6 −4 0 −4
−9 6 −2 12



 , b =




−7
4
6
−3



 .
Solution. One can verify that
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Properties of pseudoinverse matrices
Theorem. Let A be a real or complex matrix of the type
m/n and let A†
be its pseudoinverse. Then:
(1) if A is invertible (necessarily square), then
A†
= A−1
,
(2) A†
A and AA†
are Hermitian (in real case symmetric)
and
AA†
A = A, A†
AA†
= A†
.
(3) the pseudoinverse matrices A†
are uniquely deﬁned by
the four properties from (2). Thus if some matrix B of
the type n × m has the properties that BA and AB
are both Hermitian, ABA = A and BAB = B, then
B = A†
.
(4) if A is a matrix of the system of linear equations Ax = b
with b ∈ Km
, then the vector y = A†
b ∈ Kn
minimizes
the norm ∥Ax − b∥ for all vectors x ∈ Kn
.
(5) the system of linear equations Ax = b with b ∈ Km
is solvable if and only if AA†
b = b. In this case all
solutions are given by the expression
x = A†
b + (E − A†
A)u,
where u ∈ Kn
is arbitrary.
Proof. (1): If A is invertible, then the matrix S =
U∗
AV is also invertible and directly from the deﬁnition
S†
= S−1
. Consequently, A†
A = AA†
= E.
(2): Direct computation yields SS†
S = S and
S†
SS†
= S†
, therefore
AA†
A = USV ∗
V S†
U∗
USV ∗
= USS†
SV ∗
= USV ∗
= A
and analogically for the second equation. Furthermore,
(AA†
)∗
= (USS†
U∗
)∗
= U(S†
)∗
S∗
U∗
= U(SS†
)∗
U∗
= USS†
U∗
= AA†
.
It can be proved similarly that (A†
A)∗
= A†
A.
(3) The claim can be proved by direct computation. Of
course we can consider the matrices A, A†
, B as the matrices
of the mappings φ, φ†
, and ψ in the standard bases on Kn
and
Km
, or any other pair of orthonormal bases. The requested
equality is equivalent to the equality φ†
= ψ independently
of the choice of the bases. We choose a couple of orthogonal
bases from the singular decomposition of A. Then the
mapping φ has the matrix S from the deﬁnition of the pseudoinverse
A†
, so we write directly
A =
(
D 0
0 0
)
, A†
=
(
D−1
0
0 0
)
,
with the diagonal matrix D consisting of all non-zero singular
values. We write again B for the matrix of ψ in these
bases. Clearly B and A satisfy the assumptions of the claim
(3). Thus
A†
A =
(
E 0
0 0
)
, ABA = A
251
A = LU =



1 0 0 0
−1 1 0 0
2 −4 1 0
−3 6 −14/8 1








3 −6 2 −2
0 −2 3 −2
0 0 8 −8
0 0 0 4



.
For example, use Sage via the block
A=matrix(RDF,[[3, -6, 2, -2],[-3, 4, 1, 0],
[6, -4, 0, -4], [-9, 6, -2, 12]])
L=matrix(RDF,[[1, 0, 0, 0], [-1, 1, 0, 0],
[2, -4, 1, 0], [-3, 6, -14/8, 1]])
U=matrix(RDF,[[3, -6, 2, -2], [0, -2, 3, -2],
[0, 0, 8, -8], [0, 0, 0, 4]])
L*U == A
Sage prints out True. Now we solve the equation Ly = b,
which is given by




1 0 0 0
−1 1 0 0
2 −4 1 0
−3 6 −14/8 1








y1
y2
y3
y4



 =




−7
4
6
−3



 .
From this we obtain
y1 = −7 , y2 = 4 + y1 = −3 ,
y3 = 6 − 2y1 + 4y2 = 8 ,
y4 = −3 + 3y1 − 6y2 +
14
8
y3 = 8 .
Next we solve Ux = y, which is given by




3 −6 2 −2
0 −2 3 −2
0 0 8 −8
0 0 0 4








x1
x2
x3
x4



 =




−7
−3
8
8



 .
By back substitution we get the desired solution:
x4 =
8
4
= 2 , x3 =
8 + 8x4
8
= 3 ,
x2 =
−3 − 3x3 + 2x4
−2
=
−8
−2
= 4 ,
x1 =
−7 + 6x2 − 2x3 + 2x4
3
=
15
3
= 5 .
□
Given a n × n matrix A with a LU-decomposition
A = LU, the eﬃciency of solving the linear
equation Ax = b can be analyzed by counting
the number of multiplications and divisions re-
quired.17
As we saw before, to solve Ax = b using
LU-factorization, the process involves solving Ly = b and
Ux = y. It can be proved that approximately n2
multiplications
and divisions are needed for these steps. For instance,
in the case of a 4 × 4 matrix A as in 3.E.3, you can count exactly
16 multiplications and divisions, 6 for solving the equation
Ly = b and 10 for solving the equation Ux = y.
17Additions are typically ignored in this analysis due to their lower computational
“cost”.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
and we obtain
A†
= A†
ABAA†
=
(
E 0
0 0
)
B
(
E 0
0 0
)
=
(
D−1
0
0 0
)
.
Consequently
B =
(
D−1
P
Q R
)
for suitable matrices P, Q and R. Next,
BA =
(
D−1
P
Q R
) (
D 0
0 0
)
=
(
E 0
QD 0
)
is Hermitian. Thus QD = 0 which implies Q = 0 (the matrix
D is diagonal and invertible). Analogously, the assumption
that AB is Hermitian implies that P is zero. Finally, we com-
pute
B = BAB =
(
D−1
0
0 R
) (
D 0
0 0
) (
D−1
0
0 R
)
.
On the right side in the right-lower corner there is zero, and
thus also R = 0 and the claim is proved.
(4): Consider the mapping φ : Kn
→ Km
, x → Ax, and
direct sums Kn
= (Ker φ)⊥
⊕Ker φ, Km
= Im φ⊕(Im φ)⊥
of the orthogonal complements. The restricted mapping
˜φ := φ|(Ker φ)⊥ : (Ker φ)⊥
→ Im φ is a linear isomorphism.
If we choose suitable orthonormal bases on (Ker φ)⊥
and Im φ and extend them to orthonormal bases on whole
spaces, the mapping φ will have matrix S and ˜φ the matrix
D from the theorem about the singular decomposition. In
the next section, we shall discuss in detail that for any given
b ∈ Km
, there is the unique vector which minimizes the distance
∥b − z∥ among all z ∈ Im φ (in analytic geometry
we shall say that the point z realises the distance of b from
the aﬃne subspace Im φ), see 4.1.13). The properties of the
norm proved in theorem 3.4.3 directly imply that this is exactly
the component z = b1 of the decomposition b = b1 +b2,
b1 ∈ Im φ, b2 ∈ (Im φ)⊥
.
Now, in our choice of bases, the mapping φ†
is given by
the matrix S†
from the singular decomposition theorem. In
particular, φ†
(Im φ) = (Ker φ)⊥
, D−1
is the matrix of the
restriction φ†
| Im φ, and φ†
|(Im φ)⊥ is zero. Indeed,
φ ◦ φ†
(b) = φ(φ†
(z)) = z
and the proof is ﬁnished.
(5) Evidently, the equality Ax = b, with x ∈ Kn
ﬁxed,
implies
b = Ax = AA†
Ax = AA†
b.
Thus the condition is necessary. On the other hand, if this
condition holds, then the choice x = A†
b + (E − A†
A)u as
in (5) implies
Ax = A(A†
b + (E − A†
A)u) = b + (A − AA†
A)u = b.
The rank of the matrix E − A†
A gives the correct size of
the image of the corresponding mapping according to the
Kronecker-Capelli theorem (cf. 2.3.5) about the solution of
the system of linear equations, and thus we obtain all solutions
in this way. □
252
On the other hand, performing row reduction directly on
the augmented matrix
(
A b
)
to transform it to
(
U y
)
requires
about n3
/3 arithmetic operations (as multiplications
and divisions), which is signiﬁcantly more time-consuming.
Even worse, computing the inverse of A demands about n3
operations, and multiplying b by A−1
to obtain x = A−1
b
requires another n2
operations.
3.E.4. How are determinants of large matrices computed?
Typically, determinants of large matrices are not
calculated using cofactor expansion, but rather through matrix
decomposition. For instance, with an LU-decomposition
of a squared matrix A, the determinant of A satisﬁes the
relation det(A) = det(L) det(U). Since both L, U are
triangular matrices, the previous relation implies that det(A)
equals the product of the elements on the diagonals of L and
U.
3.E.5. Consider the LU-decomposition of the matrix A described
in 3.E.1.
i) Express the lower-triangular matrix L in terms of elementary
matrices and compute its inverse.
ii) Compute the inverse of the matrix U using Gauss elimi-
nation.
iii) Calculate both the determinant and the inverse matrix of
A, using its LU-decomposition as derived above.
⃝
Let us now focus on the singular value decomposition,
which is possibly the most important matrix decomposition,
with a wide range of remarkable applications.
According to the singular value decomposition
(in short SVD), given a matrix A ∈ Matm,n(F)
where F = R or F = C, we can ﬁnd unitary matrices
U ∈ Matm(F) and V ∈ Matn(F), and a diagonal matrix
Σ ∈ Matm,n(F) with non-genative entries such that
A = UΣV ∗
, see 3.5.3.
In such a decomposition the diagonal matrix Σ consists
of the singular values of A, which are the non-negative square
roots of the eigenvalues of A∗
A (or equivalently of AA∗
).
They are typically denoted by σ1, . . . , σmin{m,n}, and we order
them so that σ1 ≥ σ2 ≥ . . . ≥ σmin{m,n}.
In fact, there are exactly rank(A) non-zero singular values
of A, as rank(UΣV ∗
) = rank(Σ), and the rank of a
diagonal matrix coincides with the number of non-zero diagonal
elements. On the other hand, the columns of V are
(normalized) eigenvectors of A∗
A, while the columns of U
are (normalized) eigenvectors of AA∗
.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Remark. Notice that the last computation in the proof veriﬁes
that (E − A†
A) is the matrix of the projection of Rn
onto the subspaces of all solutions of the homogenous system
Ax = 0.
It can be also shown that the matrix A†
minimizes the
square of the norm of the expression
AA†
− E
that is, the sum of squares of all elements of the given matrix.
The claim (4) of the theorem can be also interpreted
as follows. AA†
is the matrix of the orthogonal projection
form the vector space Rm
, onto the subspace generated by
the columns of the matrix A (m is the number of the rows of
the matrix A). This interpretation has a strong meaning for
matrices having more rows than columns. Moreover, for matrices
A whose columns are independent vectors, the expression
(AT
A)−1
AT
makes sense and it is not hard to verify that
this matrix satisﬁes all the properties from (1) and (2) from
the previous theorem. Thus it is the pseudoinverse A†
of the
matrix A.
3.5.8. Linear regression. The approximation property (4)
from the previous theorem is very useful in the
cases where we are to ﬁnd as good an approximation
as possible for the (non-existent) solution
of a given system Ax = b, where A is a
real matrix of the type m/n and m > n.
For instance, an experiment gives many measured real
values bj, j = 1, . . . , m. We want to ﬁnd a linear combination
of only a few ﬁxed functions fi, i = 1, . . . , n which
approximates the values bj as good as possible. The actual
values of the ﬁxed functions at the relevant points yj ∈ R deﬁne
the matrix aij = fi(yj). The columns of the matrix are
given by values of the individual functions fi at the considered
points. The goal is to determine the coeﬃcients xi ∈ R
so that the sum of the squares of the deviations from the actual
values
m∑
j=1
(bj − (
n∑
i=1
xifi(yj)))2
=
m∑
j=1
(bj − (
n∑
i=1
aijxi))2
is minimized. By the previous theorem, the optimal coeﬃcients
are A†
b.
As an example, consider just three functions f0(y) = 1,
f1(y) = y, f2(y) = y2
. Assume that the “measured values”
of their unknown combination g(y) = x0 +x1y +x2y2
in integral
values for y between 1 and 5 are bT
= (1, 12, 6, 27, 33).
(This vector b arose by computing the values 1+y +y2
at the
given points adjusted by random integral values in the range
±10.) This leads in our case to the matrix A = (bji)
AT
=


1 1 1 1 1
1 2 3 4 5
1 4 9 16 25

 .
253
3.E.6. How many (non-zero) singular values has the matrix
A =
(
4 0
3 5
)
? Describe them explicitly. ⃝
3.E.7. For the matrix A given below, derive the matrix Σ, as
described above:
A =


0 0 −1
2
−1 0 0
0 0 0

 .
⃝
There is a simple way to perform the SVDdecomposition,
which is described as follows: First
compute the matrices Σ and V as suggested above. To compute
the unitary matrix U, which forms the last piece of the
SVD-decomposition, we rely on the relation AV = UΣ. To
simplify the description, suppose that rank(A) = r and express
V, U in terms of columns, as V =
(
v1 v2 · · · vn
)
and U =
(
u1 u2 · · · um
)
respectively. Then we have
AV =
(
Av1 Av2 · · · Avn
)
=
(
σ1u1 · · · σrur 0 · · · 0
)
= UΣ .
Hence we will have uk = 1
σk
Avk, for all k = 1, . . . , r. To
determine the remaining m − r columns of U it suﬃces to
extend the set {u1, . . . , ur} to an orthonormal basis of Fm
,
which usually can be done via the Gram–Schmidt procedure.
3.E.8. Describe the singular value decompositions of the
matrices given in 3.E.6 and 3.E.7, respectively.
Solution. (a) Let us ﬁrst consider the 2 × 2 matrix A, where
Σ = diag(2
√
10,
√
10). Let us derive the eigenvectors corresponding
to the eigenvalues λ1 = 40 and λ2 = 10 of AT
A,
to construct the matrix V . The solution space of the linear
system induced by the matrix equation (AT
A − 40E)v has
the form {(x1, x2)T
= (t, t)T
: t ∈ R}. Hence, eigenvectors
of AT
A corresponding to λ1 = 40 are scalar multiples
of the vector (1, 1)T
. This vector has norm
√
2 and
we obtain the normalized eigenvector v1 =
(√
2
2 ,
√
2
2
)T
. For
the second eigenvalue we see that the linear system induced
by the matrix equation (AT
A − 10E)v has solution space
{(x1, x2)T
= (−s, s)T
: s ∈ R}. Consequently, eigenvectors
of AT
A corresponding to λ2 are scalar multiples of
(−1, 1)T
. The norm of this vector is also
√
2, and we obtain
the normalized eigenvector v2 =
(
−
√
2
2 ,
√
2
2
)T
. Therefore,
V =
(
v1 v2
)
=
(√
2
2 −
√
2
2√
2
2
√
2
2
)
,
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
The requested optimal coeﬃcients for the combination are
x = A†
· b
=



9
5 0 −4
5 −3
5
3
5
−37
35
23
70
6
7
37
70 −23
35
1
7 − 1
14 −1
7 − 1
14
1
7


 ·






1
12
6
27
33






≃


0.600
0.614
1.214

 .
The resulting approximation can be seen in the picture,
where the given values b are shown by the diamonds, while
the dashed curve stays for the resulting approximation g(y) =
x1 + x2y + x3y2
.
The computation was produced in Maple and taking 15
values yi = 1 + i + i2
, with a random vector of deviations
from the same range added produced the following picture:
254
and it is easy to verify that V is unitary. Next, for the matrix
U =
(
u1 u2
)
we have
u1 =
1
σ1
Av1 =
1
2
√
10
(
4 0
3 5
) (√
2
2√
2
2
)
=
( √
5
5
2
√
5
5
)
,
u2 =
1
σ2
Av2 =
1
√
10
(
4 0
3 5
) (
−
√
2
2√
2
2
)
=
(
−2
√
5
5√
5
5
)
.
A direct computations show that u1, u2 are orthonormal, and
hence the matrix U is given by
U =
(
u1 u2
)
=
( √
5
5
−2
√
5
5
2
√
5
5
√
5
5
)
,
such that A = UΣV T
. For a quick conﬁrmation, you may
use Sage:
A=matrix([[4, 0], [3, 5]])
show("\nThe matrix A is given by A=", A)
S=diagonal_matrix([2*sqrt(10), sqrt(10)])
show("\nThe matrix S is given by S=", S)
V=matrix([[sqrt(2)/2, -sqrt(2)/2],
[sqrt(2)/2, sqrt(2)/2]])
show("\nThe matrix V is given by V=", V)
print(V.is_unitary())
U=matrix([[sqrt(5)/5, -2*sqrt(5)/5],
[2*sqrt(5)/5, sqrt(5)/5]])
show("\nThe matrix U is given by U=", U)
print(U.is_unitary())
print(A*V==U*S)
(b) In this case we have seen that Σ = diag(1, 1/2, 0). For
the eigenvalues λ1 = 1, λ2 = 1/4 and λ3 = 0 of AT
A we
compute the normalized eigenvectors v1 = (1, 0, 0)T
, v2 =
(0, 0, 1)T
and v3 = (0, 1, 0)T
. This can be conﬁrmed in Sage
by the following cell:
A=matrix([[0, 0, -1/2], [-1, 0, 0],
[0, 0, 0]])
AT=A.T; a= AT*A
show(a.eigenvalues())
S=diagonal_matrix([1, 1/2, 0])
show("\nThe matrix S is given by S=", S)
aein=a.eigenvectors_right()
show(aein)
which also prints the expression of the matrix Σ (denoted inside
the cell by S). Thus V =


1 0 0
0 0 1
0 1 0

. Next, for the
matrix U =
(
u1 u2 u3
)
we can compute the ﬁrst two
columns by the rule uk = 1
σk
Avk. This gives
u1 =
1
σ1
Av1 =
1
1


0 0 −1
2
−1 0 0
0 0 0




1
0
0

 =


0
−1
0

 ,
u2 =
1
σ2
Av2 =
1
1/2


0 0 −1
2
−1 0 0
0 0 0




0
0
1

 =


−1
0
0

 .
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
255
It remains to extend {u1, u2} to an orthonormal basis of R3
.
Thus, we need to ﬁnd a unit vector which is orthogonal to both
u1, u2. Such a vector is the vector u3 = (0, 0, −1)T
which is
in fact the cross product on R3
of u1, u2:
u1 × u2 =
⃗i ⃗j ⃗k
0 −1 0
−1 0 0
= 0 ·⃗i − 0 ·⃗j + (−1) · ⃗k .
This gives the matrix
U =


0 −1 0
−1 0 0
0 0 −1

 ,
such that A = UΣV T
, i.e.,
A =


0 −1 0
−1 0 0
0 0 −1




1 0 0
0 1
2 0
0 0 0




1 0 0
0 0 1
0 1 0

 .
If you would like to verify this computation using Sage, you
can continue by typing the following code in the previous
block:
V=matrix([[1, 0, 0], [0, 0, 1], [0, 1, 0]])
show("\nThe matrix V is given by V=", V)
print(V.is_unitary())
U=matrix([[0, -1, 0], [-1, 0, 0], [0, 0, -1]])
show("\nThe matrix U is given by U=", U)
print(U.is_unitary())
A==U*S*(V.T)
In this way Sage conﬁrms that both the matrices U and V are
unitary, and moreover the relation A = UΣV T
.
Note that Sage oﬀers a built-in method to compute the
singular value decomposition of a matrix, but it is only available
for matrices over the numerical systems CDF or RDF. For
our matrix A, for example, one can type
A=matrix(RDF, [[0,0,-1/2],[-1,0,0], [0, 0, 0]])
show(A.SVD())
Note that Sage’s output diﬀers from our result only in terms
of signs and column permutations. This variation is expected,
as in SVD decomposition, the singular vectors (eigenvectors
of U and V ) may diﬀer by signs or orderings. Hence, be
aware that diﬀerent software implementations or methods
might produce diﬀerent but equally valid SVD results. □
Using the components of the SVD-decomposition of a
matrix A we can construct the polar decomposition, which
is another signiﬁcant matrix factorization, see 3.5.5. This describes
a complex matrix A as a product of a unitary matrix
Up and a positive semi-deﬁnite matrix P. If A is real then
Up is an orthogonal matrix. Recall that saying P is positive
semi-deﬁnite we mean that P has non-negative eigenvalues.
If A = UΣV ∗
is the SVD decomposition of A, then the corresponding
polar decomposition A = PUp occurs easily by
setting Up = UV ∗
and P = UΣU∗
, respectively.
3.E.9. For the matrix A given in 3.E.7 compute its polar decomposition
A = PUp and use Sage to verify your result. Is
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
256
this decomposition unique? Next show that the matrix P is
Hermitian and positive semi-deﬁnite. ⃝
Let A = UΣV ∗
be the SVD-decomposition of a m × n
matrix A or rank r, where Σ = diag(σ1, . . . , σr, 0, . . . , 0)
is the m × n diagonal matrix with the singular values of A.
The pseudo-inverse of A (also known as the Moore-Penroseinverse)
is deﬁned by
A†
= V Σ†
U∗
,
where Σ†
= diag
(
1
σ1
, . . . , 1
σr
, 0, . . . , 0
)
.
This matrix functions as a generalized inverse for matrices
that are singular or non-square. As such, it extends the
concept of an inverse matrix in several ways, see 3.5.7. Notably,
while not every matrix has an inverse, every matrix,
including non-square ones, has a pseudo-inverse.
3.E.10. For the matrices A given in 3.E.6 and 3.E.7 compute
the corresponding pseudo-inverses.
Solution. (a) For the matrix A presented in 3.E.6 we obtain
Σ†
=
(
1
2
√
10
0
0 1√
10
)
.
Thus, the explicit form of the pseudo-inverse A†
= V Σ†
UT
of the matrix A is given by
A†
=
(√
2
2 −
√
2
2√
2
2
√
2
2
) (
1
2
√
10
0
0 1√
10
) ( √
5
5
2
√
5
5
−2
√
5
5
√
5
5
)
=
( 1
4 0
−3
20
1
5
)
.
To conﬁrm our result we need to verify the identity AA†
A =
A, which is direct:
AA†
A =
(
4 0
3 5
) ( 1
4 0
−3
20
1
5
) (
4 0
3 5
)
=
(
1 0
0 1
) (
4 0
3 5
)
= A .
In fact, observe that det(A) = 20 ̸= 0. Hence A is nonsingular
and has a unique inverse that should coincide with
A†
. This provides another conﬁrmation:
A−1
=
1
det(A)
(
5 0
−3 4
)
=
( 1
4 0
−3
20
1
5
)
= A†
.
(b) For the matrix A presented in 3.E.7 we obtain
Σ†
=


1 0 0
0 2 0
0 0 0


and hence the corresponding pseudo-inverse has the form
A†
=


1 0 0
0 0 1
0 1 0




1 0 0
0 2 0
0 0 0




0 −1 0
−1 0 0
0 0 −1


=


0 −1 0
0 0 0
−2 0 0

 .
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
257
Note that det(A) = 0 hence this matrix does not have an
inverse. To conﬁrm the relation AA†
A = A (or A†
AA†
=
A†
), you can use Sage as follows;
A=matrix([[0, 0, -1/2],
[-1, 0, 0], [0, 0, 0]])
Aps=matrix([[0, -1, 0],
[0, 0, 0], [-2, 0, 0]])
print(A*Aps*A==A)
Or you can use the following routine, which relies on Scipy
(or Numpy).
def pseudo_inverse(mat) :
from scipy import linalg
return matrix(linalg.pinv(mat))
Ap=pseudo_inverse(A)
show(Ap)
Note that one should add this cell in the previous block. □
Let us conclude our description of matrix factorizations
with the QR-decomposition, which expresses a complex
matrix as a product of a unitary matrix Q and an
upper-triangular matrix R, see 3.5.6. If for example
A ∈ Matm,n(C) comes with n linearly independent columns,
then using the Gram-Schmidt orthogonalization we can obtain
the factorization A = QR, where Q is of type m × n
with orthonormal columns, and R is an invertible upper triangular
matrix with positive diagonal entries.
Finding the QR-decomposition only requires simple algebraic
operations and it can be useful to summarize the
steps, for a matrix A ∈ Matm,n(C), as above:
Step 1 Via the Gram-Schmidt procedure ﬁnd an orthogonal basis
of the column space C(A) of A.
Step 2 Normalize the vectors of this basis to get the columns of
Q.
Step 3 As the columns of Q are orthonormal with respect to the
dot product, we have QT
Q = E, where E is the n ×
n identity matrix. Hence, the upper triangular matrix
R is given by the formula R = QT
A. Alternatively, if
u1, . . . , un are the linearly independent columns of A,
and q1, . . . , qn are the orthonormal vectors obtained by
the Gram-Schmidt procedure, then the entries rij of R
are given by rij = ⟨uj, qi⟩ for any 1 ≤ i, j ≤ n, such
that rij = 0 for all i > j. Note that both Q, R are unique.
3.E.11. Present the QR-decomposition of the matrix
A =




1 1 1
1 1 1
1 1 0
1 0 0



 .
Solution. The column of A are the vectors E1, E2, E3 presented
in 2.C.50. Recall that there we constructed the orthonormal
basis { ˆwi : i = 1, 2, 3} of the 3-dimensional subspace
W = spanR{E1, E2, EE}. Obviously, this coincides
with the column space of A, i.e., W = C(A). Thus the 3 × 4
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
258
matrix Q is given by Q =
(
ˆw1 ˆw2 ˆw3
)
, or in other
words
Q =




1/2
√
3/6
√
6/6
1/2
√
3/6
√
6/6
1/2
√
3/6 −
√
6/3
1/2 −
√
3/2 0



 .
We can verify that QT
Q = E in Sage, but again you should
be careful in the following sense. If we use the number system
QQbar to deﬁne the matrix Q, then the cell below gives True,
which is the correct result.
Q=matrix(QQbar,
[[1/2, sqrt(3)/6, sqrt(6)/6],
[1/2, sqrt(3)/6, sqrt(6)/6],
[1/2, sqrt(3)/6, -sqrt(6)/3],
[1/2, -sqrt(3)/2, 0]])
Q.transpose()*Q == matrix.identity(3)
However, if we conﬁgure the matrix Q, using one of the number
systems RR, CC, or CDF, Sage will return False. Can you
explain the possible reasons why this might occur?
Next, the matrix R occurs by the multiplication QT
A,
which gives
R =


2 3/2 1
0
√
3/2
√
3/3
0 0
√
6/3

 .
Note that det(R) =
√
2 ̸= 0, hence R is invertible, as it
should be.
Sage provides a built-in method to obtain the
QR-decomposition of a matrix A, based on the command
A.QR(). Hence, you can quickly verify the result, just
by typing
A=matrix(QQbar, [[1, 1, 1],[1, 1, 1],
[1, 1, 0], [1, 0, 0]])
Q, R = A.QR ( )
In case you like to conﬁrm the expression of R in a “manual”
way, you may use the following cell:
Q=matrix(QQbar,
[[1/2, sqrt(3)/6, sqrt(6)/6],
[1/2, sqrt(3)/6, sqrt(6)/6],
[1/2, sqrt(3)/6, -sqrt(6)/3],
[1/2, -sqrt(3)/2, 0]])
A = matrix(RR, [[1, 1, 1],[1, 1, 1],
[1, 1, 0], [1, 0, 0]])
R=(Q.T)*A; R
□
The QR-decomposition of a matrix A is very useful when
a system of linear equations Ax = b which has no
solution is given, but an approximation as good as
possible is needed.
In other words, the main point is to minimize ∥Ax−b∥.
According to the Pythagorean theorem, one has
∥Ax − b∥2
= ∥Ax − b∥∥2
+ ∥b⊥∥2
,
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
259
where b is decomposed into b∥ and into b⊥. The ﬁrst component
belongs to the range of the linear transformation A, and
the second one is perpendicular to this range. Now, the projection
on the range of A can be written in the form QQT
for
a suitable matrix Q, which one obtains by the Gram-Schmidt
orthonormalisation of the columns of A. Then, if follows that
Ax − b∥ = Q(QT
Ax − QT
b) .
The system in the parentheses has a solution, for which ∥Ax−
b∥ = ∥b⊥∥, and this is the minimal value. Furthermore, the
matrix R := QT
A is upper triangular, therefore the approximate
solution can be easily obtained.
3.E.12. Find an approximate solution of the following system
of equations:
{x1 + 2x2 = 1 , 2x1 + 4x2 = 4} .
Solution. We may express the given system in the matrix
form Ax = b, where
A =
(
1 2
2 4
)
, b =
(
1
4
)
.
Evidently, this system has no solution. Let us orthonormalize
the columns of A: Take the ﬁrst of them and divide it by its
norm. This yields the ﬁrst vector of the orthonormal basis
1√
5
(1, 2)T
. The second column vector of A is twice the ﬁrst
and thus we deduce that Q = 1√
5
(
1
2
)
. The projector on the
range of A is then given by
QQT
= 1
5
(
1 2
2 4
)
.
Next, we see that
QT
b =
1
√
5
(
1 2
)
(
1
4
)
=
9
√
5
and
R = QT
A =
1
√
5
(
1 2
)
(
1 2
2 4
)
=
1
√
5
(
5 9
)
.
Thus, the approximate solution should satisfy the relation
Rx = QT
b,
which implies that 5x1 +9x2 = 9. Note that the approximate
solution is not unique (instead of the QR-decomposition). In
particular, the QR-decomposition of our matrix A has the
form (
1 2
2 4
)
= 1√
5
(
1
2
)
1√
5
(
5 9
)
.
□
3.E.13. Minimise ∥Ax − b∥ for
A =


2 −1 −1
−1 2 −1
−1 −1 2

 , and b =


1
0
0

 .
Write down also the QR-decomposition of the matrix A.
Solution. The normalised ﬁrst column of the matrix A is
u1 = 1√
6
(2, −1, −1)T
. From the second column, subtract
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
260
its component in the direction u1 and set u = (−1, 2, −1)T
.
Then we see that ⟨u, u1⟩ = − 3√
6
. Thus it follows that
u − ⟨u, u1⟩u1 =
1
2


0
3
3

 .
Hence, we have created an orthogonal vector which we may
normalise to obtain the vector
u2 =
1
√
2


0
1
−1

 .
The third column of the matrix A is already linearly dependent
and thus the orthogonal matrix Q has the form
Q =
1
√
6


2 0
−1
√
3
−1 −
√
3

 .
Consequently, we obtain
R = QT
A = 1√
6
(
2 −1 −1
0
√
3 −
√
3
)


2 −1 −1
−1 2 −1
−1 −1 2


= 1√
6
(
6 −3 −3
0 3
√
3 −3
√
3
)
and
QT
b =
1
√
6
(
2 −1 −1
0
√
3 −
√
3
)


1
0
0

 =
1
√
6
(
2
0
)
.
The solution of the equation Rx = QT
b is x = y = z. Thus,
multiples of the vector (1, 1, 1)T
minimize ∥Ax − b∥. Note
that the mapping given by the matrix A, is a projection on the
plane with normal vector (1, 1, 1)T
. □
When dealing with an overdetermined system of linear
equations (more equations than unknowns), the
system may not have an exact solution. In such
cases, the pseudo-inverse provides the best approximate
solution in the least squares sense.
The pseudo-inverse is also used in linear regression for ﬁnding
the optimal parameters.
Next we will analyze how to solve problems with linear
regression. Such tasks are about ﬁnding the best approximation
of some functional dependence, using a linear function.
In particular, given a functional dependence for some
points, that is,
f(a1
1, . . . , a1
n) = y1, . . . , f(ak
1, ak
2, . . . , ak
n) = yk,
with k > n, we wish to ﬁnd the “best possible” approximation
of this dependency using a linear function. Observe that
since k > n one has more equations than unknowns.
Hence we want to express the value of the property as a
linear function f(x1, . . . , xn) = b1x1+b2x2+· · ·+bnxn+c.
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
261
We choose to deﬁne “best possible” by the minimisation of
k∑
i=1

yi −
n∑
j=1
(bjxj + c)


2
with regard to the real constants b1, . . . , bn, c. The goal is to
ﬁnd such a linear combination of the columns of the matrix
A = (ai
j) (with coeﬃcients b1, . . . , bn), that is closest to the
vector (y1, . . . , yk)T
in Rk
.
This means that the whole procedure is about specifying
an orthogonal projection of the vector (y1, . . . , yk)T
onto the
subspace generated by the columns of A. Using the theorem
presented in 3.5.7, it follows that this projection should be described
by the vector (b1, . . . , bn)T
= A†
(y1, . . . , bn), where
A†
is the pseudo-inverse of A.
3.E.14. Linear regression. Using the least squares method,
solve the linear system given below:



2x + y + 2z = 1 ,
x + y + 3z = 2 ,
2x + y + z = 0 ,
x + z = −1 .
Solution. The system has no solution, since its matrix has
rank 3, and the extended matrix has rank 4. According to
the theorem in 3.5.7, the best approximation of the vector
b = (1, 2, 0, −1)T
can be obtained by the vector A†
b. In particular,
AA†
b is then the best approximation – the perpendicular
projection of the vector b onto the column space C(A) of
A. Because the columns of the matrix A are linearly independent,
its pseudo-inverse is given by the relation (AT
A)−1
AT
.
We compute
A†
=








2 1 2
1 1 3
2 1 1
1 0 1






2 1 2 1
1 1 1 0
2 3 1 1






−1


2 1 2 1
1 1 1 0
2 3 1 1


=




10 5 10
5 3 6
10 6 15




−1 

2 1 2 1
1 1 1 0
2 3 1 1


=


3/5 −1 0
−1 10/3 −2/3
0 −2/3 1/3




2 1 2 1
1 1 1 0
2 3 1 1


=


1/5 −2/5 1/5 3/5
0 1/3 2/3 −5/3
0 1/3 −1/3 1/3

 ,
and now we can compute the desired x, which is given by
x = A†
b = (−6/5, 7/3, 1/3)T
.
Consequently, the best possible approximation to the column
of the right side, is the vector (3/5, 32/15, 4/15, −13/15)T
.
□
262
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
F. Additional exercises for the whole chapter
As usual, we proceed with supplementary exercises related to the notions that we have discussed so far.
A) Material on linear programming
3.F.1. Manufacturing bolts and nuts. A company manufactures bolts and nuts. Moulding a box of bolts takes one minute,
while a box of nuts is moulded for 2 minutes. Preparing the box itself takes one minute for bolts, 4 minutes for nuts. The
company has at its disposal two hours for moulding and three hours for box preparation. Demand says that it is necessary
to manufacture at least 90 boxes of bolts more than boxes of nuts. Due to technical reasons it is not possible to manufacture
more than 110 boxes of bolts. The proﬁt from one box of bolts is $4 and the proﬁt from one box of nuts is $6. The company
has no trouble with selling. Write down the corresponding LP problem and present its standard form. Deduce graphically
how many boxes of nuts and bolts should be manufactured in order to have maximal proﬁt.
Solution. For convenience, let us put the given data into a table:
Bolts 1 box Nuts 1 box Capacity
Mould 1 min./box 2 min./box 160 min.
Box 1 min./box 4 min./box 180 min.
Proﬁt $4/box $6/box
Denote by x1 the number of manufactured boxes of bolts and by x2 the number of manufactured boxes of nuts. From the
restriction on moulding time and from the restriction on the box preparation we obtain the following restrictive conditions:
x1 + 2x2 ≤ 120 , x1 + 4x2 ≤ 180 , x1 ≥ x2 + 90 , x1 ≤ 110 .
The standard form reads as the maximization of the proﬁt function h(x1, x2) = 4x1 + 6x2 subject to the conditions
x1 + 2x2 ≤ 120 , x1 + 4x2 ≤ 180 , x2 − x1 ≤ −90 , x1 ≤ 110 ,
with x1 ≥ 0, x2 ≥ 0. The feasible region is the grey region in the ﬁgure below, where we have also included the objective
lines 4x1 + 6x2 = k. From this ﬁgure we deduce that the point P = (110, 5) maximizes h, and the maximum possible
income is thus given by 4 · 110 + 6 · 5 = $470.
□
3.F.2. Investments and proﬁts. An insurance company has a capital of 100000C which aims to invest in two diﬀerent ways:
the investment of type X and the investment of type Y. These types of investments give an annual income of 10% and of 15%,
respectively. However, there is a limitation by the government, which requires that at least 25% of the capitals, should be
invested in the X-type investment. On the other hand, the policy of the company requires the ratio between the capital used
263
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
for the Y-type investment and the capital used for the X-type of investment, not be greater than 1.5. How the company should
invest its capital? Articulate the problem as a LP problem, and ﬁnd the solution via the simplex method. Verify your solution
using Sage.
Solution. Let x1, x2 be the decision variables corresponding to the capitals (in euro) that the company will use for the X-type
investment, and Y-type investment, respectively. The objective function is given by
h = 0.1x1 + 0.15x2 =
1
10
x1 +
3
20
x2 =
2
20
x1 +
3
20
x2 .
The corresponding LP problem is the maximization of h under the constraints
x1 + x2 ≤ 100000 ,
x1
x1 + x2
≥ 0.25 ,
x2
x1
≤ 1.5 , x1 ≥ 0 , x2 ≥ 0 ,
which we can equivalently write as
x1 + x2 ≤ 100000 , −
3
4
x1 +
1
4
x2 ≤ 0 , −
3
2
x1 + x2 ≤ 0 , x1 ≥ 0 , x2 ≥ 0 .
Let us introduce the slack variables y1, y2, y3, with yi ≥ 0 for any i = 1, 2, 3. The application of the simplex method yields
four tableaux, which we present as follows:
x1 x2 y1 y2 y3
−h −2/20 −3/20 0 0 0 0
y1 1 1 1 0 0 100000
y2 −3/4 1/4 0 1 0 0
y3 −3/2 1 0 0 1 0
x2 enters
=⇒
y2 leaves
x1 x2 y1 y2 y3
−h −11/20 0 0 3/5 0 0
y1 4 0 1 −4 0 100000
x2 −3 1 0 4 0 0
y3 3/2 0 0 −4 1 0
x1 enters
=⇒
y3 leaves
x1 x2 y1 y2 y3
−h 0 0 0 −13/15 11/30 0
y1 0 0 1 20/3 −8/3 100000
x2 0 1 0 −4 2 0
x1 1 0 0 −8/3 2/3 0
y2 enters
=⇒
y1 leaves
x1 x2 y1 y2 y3
−h 0 0 13/100 0 1/50 13000
y2 0 0 3/20 1 −2/5 15000
x2 0 1 3/5 0 2/5 60000
x1 1 0 2/5 0 −2/5 40000
Since in the last tableau the row of −h has no negative entry, we have arrived to the optimal solution which reads as follows:
(x1 = 40000, x2 = 60000) with y2 = 15000. The maximum value of h is given by 13000, hence the investments will bring
to the company a maximum proﬁt of 13000C. To verify the solution in Sage, you can use the block
p = MixedIntegerLinearProgram()
v = p.new_variable(real=True, nonnegative=True)
x1, x2 = v["x1"], v["x2"]
p.set_objective((2/20)*x1 + (3/20)*x2)
p.add_constraint(x1 + x2 <= 100000)
p.add_constraint((-3/4)*x1 + (1/4)*x2 <= 0)
p.add_constraint((-3/2)*x1 + x2 <= 0)
k = p.solve()
x1, x2 = p.get_values(x1,x2)
print("Answer =", round(k, 2))
print("(x1, x2) =", (x1, x2))
By executing this block we obtain the desired veriﬁcation:
Answer = 13000.0
(x1, x2) = 40000.0 60000.0
□
3.F.3. If xi ≥ 0 for all i = 1, 2, 3, minimize the expression −3x1 − x2 − 2x3 subject to the conditions
x1 − x2 + x3 ≥ −4 ,
2x1 + x3 ≤ 3 ,
x1 + x2 + 3x3 ≤ 8 .
Solution. We multiply the objective function and the ﬁrst inequality by −1, to get the equivalent task of maximizing h =
3x1 + x2 + 2x3 subject to the following conditions
−x1 + x2 − x3 ≤ 4 ,
2x1 + x3 ≤ 3 ,
x1 + x2 + 3x3 ≤ 8 ,
264
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
with xi ≥ 0 for i = 1, 2, 3. Introducing the non-negative slacks x4, x5 and x6, interpret the objective function as 3x1 + x2 +
2x3 + 0 · x4 + 0 · x5 + 0 · x6. We now write down the ﬁrst simplex tableau (the pivot element always appear circled):
R0 −3 −1 −2 0 0 0 0
R1 −1 1 −1 1 0 0 4
R2 2 0 1 0 1 0 3
R3 1 1 3 0 0 1 8
To eliminate the ﬁrst column which is the driver, we apply the elementary row operations R2 → ˆR2 := 1
2 R2, R0 → R0 +3 ˆR2,
R1 → R1 + ˆR2, and R3 → R3 − ˆR2. This yields the second tableau:
ˆR0 0 −1 −1/2 0 3/2 0 9/2
ˆR1 0 1 −1/2 1 1/2 0 11/2
ˆR2 1 0 1/2 0 1/2 0 3/2
ˆR3 0 1 5/2 0 −1/2 1 13/23
where the new basic variables are x1 = 3/2, x4 = 11/2, and x6 = 13/2. This reﬂects the fact that we moved as much from
the former slack x5 to the new basic variable x1 as possible. This increased the value of the objective function, which we
may read in the right top corner of the tableau. We now move to the new work column which is the one containing −1 in ˆR0.
In this column we circled the new pivot, which equals to 1 (since 11/2 < 13/2). Hence we can directly eliminate the second
column by the row operations ˆR0 → ˆR0 + ˆR1 and ˆR3 → ˆR3 − ˆR1 (observe that ˆR2 already contains 0 in the work column).
This gives
˜R0 0 0 −1 1 2 0 10
˜R1 0 1 −1/2 1 1/2 0 11/2
˜R2 1 0 1/2 0 1/2 0 3/2
˜R3 0 0 3 −1 −1 1 1
and shifts the new basic variable from x4 to x2. In this way we have increased the objective function. However, the ﬁrst row
˜R0 still contains a negative number and hence there is one more repetition. The new work column is the third one and the
next pivot is the circled number 3. By applying the necessary elementary row operations we result to the ﬁnal tableau
0 0 0 2/3 5/3 1/3 31/3
0 1 0 5/6 1/3 1/6 17/3
1 0 0 1/6 2/3 −1/6 4/3
0 0 1 −1/3 −1/3 1/3 1/3
where the basic variables are the initial decision variables x1, x2, x3. This gives the optimal solution x1 = 4/3, x2 = 17/3
and x3 = 1/3 with the maximal value of h being 31/3. □
3.F.4. Inﬁnite optimal solutions. Consider the LP problem of maximizing h = 3x1 + x2 subject to the conditions
6x1 + 2x2 ≤ 30 , 10x1 + x2 ≤ 40 , x1 ≥ 0 , x2 ≥ 0 .
Use the simplex method to show that an optimal solution of this LP problem is not unique. Verify your answer by a ﬁgure.
Solution. Let us introduce slack variables y1, y2 to bring the problem to its canonical form. This corresponds to the maximization
of h = 3x1 + x2 + 0y1 + 0y2 under the constraints
6x1 + 2x2 + y1 = 30 ,
10x1 + x2 + y2 = 40 ,
with xi ≥ 0 and yj ≥ 0 for i, j = 1, 2. By applying the simplex method we need two iterations to arrive to an optimal solution.
Let us summarize the tableaux together with the pivot elements circled (we also indicate the corresponding elementary row
operations).
x1 x2 y1 y2
R0 −h −3 −1 0 0 0
R1 y1 6 2 1 0 30
R2 y2 10 1 0 1 40
R2 → ˆR2 := 1
10
R2
R0 → ˆR0 := R0 + 3 ˆR2
R1 → ˆR1 := R1 − 6 ˆR2
=⇒
x1 x2 y1 y2
ˆR0 −h 0 −7/10 0 3/10 12
ˆR1 y1 0 7/5 1 −6/10 6
ˆR2 x1 1 1/10 0 1/10 4
265
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
ˆR1 → ˜R1 := 5
7
ˆR1
ˆR0 → ˜R0 := ˆR0 + 7
10
˜R1
ˆR2 → ˜R2 := ˆR2 − 1
10
˜R1
=⇒
x1 x2 y1 y2
˜R0 −h 0 0 1/2 0 15
˜R1 x2 0 1 5/7 −3/7 30/7
˜R2 x1 1 0 −1/14 1/7 25/7
Hence in the ﬁrst iteration x1 enters the column of basic variables and y2 leaves, while in the second one x2 enters and y1
leaves. Since ˜R0 does not contain a negative entry, we have arrived to an optimal solution given by (x1 = 25
7 , x2 = 30
7 )
(with y1 = 0 = y2). The maximal value of h equals to 15. However, observe that in the above table the non-basic variable
y2 appears with zero coeﬃcient in ˜R0. Therefore, y2 can be brought to the basis to generate a new optimal solution with the
same value of the objective function h. For this we use the entry 1/7 of ˜R2 as a pivot and apply the following row operations
˜R2 → r2 := 7 ˜R2, ˜R1 → ˜R1 + 3
7 r2, ˜R0 → ˜R0 (the cost row remains the same). This gives the following tableau
x1 x2 y1 y2
r0 −h 0 0 1/2 0 15
r1 x2 3 1 1/2 0 15
r2 y2 7 0 −1/2 1 25
which provides the new optimal solution (x1 = 0, x2 = 15) (with y1 = 0 and y2 = 25), such that h(0, 15) = 15. Since two
optimal solutions have been indicated, their convex combination will be also an optimal solution of the initial LP problem.
This means that tP1 +(1−t)P2 is also an optimal solution, for any t with 0 ≤ t ≤ 1, where P1 := (25
7 , 30
7 ) and P2 := (0, 15).
Thus this LP problem admits inﬁnite optimal solutions, as one can also deduce by the help of the following ﬁgure:
□
3.F.5. Cycling in the simplex method. Consider the following task: maximize the functional h = 10x1 −57x2 −9x3 −24x4
subject to the conditions
1
2
x1 −
11
2
x2 −
5
2
x3 + 9x4 ≤ 0 ,
1
2
x1 −
3
2
x2 −
1
2
x3 + x4 ≤ 0 , x1 + x2 + x3 + x4 ≤ 1 , xi ≥ 0 , (i = 1, . . . , 4) .
Show that the simplex method applied to this LP problem cycles, which roughly speaking means that the same tableau occurs
more than once. Can you ﬁnd an optimal solution? 18
Solution. Introduce the slack variables y1, y2, y3, such that
18This example is adapted from the book of V. Chvátal (1983), Linear
Programming, W. H. Freeman. Vasek Chvátal is a czech mathematician
known for his contributions in LP, graph theory and combinatorics.
266
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
1
2
x1 −
11
2
x2 −
5
2
x3 + 9x4 + y1 = 0 ,
1
2
x1 −
3
2
x2 −
1
2
x3 + x4 + y2 = 0 ,
x1 + x2 + x3 + x4 + y3 = 1 ,
with xi ≥ 0 and yj ≥ 0 for any i = 1, . . . , 4 and j = 1, 2, 3. Thus the initial simplex tableau reads as follows:
x1 x2 x3 x4 y1 y2 y3
R0 −h −10 57 9 24 0 0 0 0
R1 y1 1/2 −11/2 −5/2 9 1 0 0 0
R2 y2 1/2 −3/2 −1/2 1 0 1 0 0
R3 y3 1 1 1 1 0 0 1 1
In the driver column, which is the column of x1, there are two choices of a pivot and we choose the circled 1/2 in R1. Hence
the variable x1 enters the column of basic variables and the variable y1 leaves. We replace R1 by ˆR1 := 2R1 and next apply
the row operations R0 → R0 + 10 ˆR1, R2 → R2 − 1
2
ˆR1 and R3 → R3 − ˆR1. This gives the second tableau which reads as
follows:
x1 x2 x3 x4 y1 y2 y3
ˆR0 −h 0 −53 −41 204 20 0 0 0
ˆR1 x1 1 −11 −5 18 2 0 0 0
ˆR2 y2 0 4 2 −8 −1 1 0 0
ˆR3 y3 0 12 6 −17 −2 0 1 1
Since −53 < −41, the column of x2 is the new work column and the circled 4 in ˆR2 is the new pivot. So x2 enters and
y2 leaves. To eliminate the second column we replace ˆR2 by ˜R2 := 1
4
ˆR2 and apply the row operations ˆR0 → ˆR0 + 53 ˜R2,
ˆR1 → ˆR1 + 11 ˜R2 and ˆR3 → ˆR3 − 12 ˜R2. This gives the third tableau:
x1 x2 x3 x4 y1 y2 y3
˜R0 −h 0 0 −29/2 98 27/4 53/4 0 0
˜R1 x1 1 0 1/2 −4 −3/4 11/4 0 0
˜R2 x2 0 1 1/2 −2 −1/4 1/4 0 0
˜R3 y3 0 0 0 7 1 −3 1 1
The column of x3 is the new driver and we choose as pivot the circled 1/2 in ˜R1. So x3 enters the column of basic variables
and x1 leaves. To do so we replace ˜R1 by ˇR1 := 2 ˜R1 and apply the row operations ˜R0 → ˜R0 + 29
2
ˇR1 and ˜R2 → ˜R2 − 1
2
ˇR1.
This gives the fourth tableau:
x1 x2 x3 x4 y1 y2 y3
ˇR0 −h 29 0 0 −18 −15 93 0 0
ˇR1 x3 2 0 1 −8 −3/2 11/2 0 0
ˇR2 x2 −1 1 0 2 1/2 −5/2 0 0
ˇR3 y3 0 0 0 7 1 −3 1 1
Since −18 < −15, the next work column is the column of x4 and the pivot is the circled 2 in ˇR2. Hence x4 enters and
x2 leaves. For this, replace ˇR2 by ¯R2 := 1
2
ˇR2 and apply the row operations ˇR0 → ˇR0 + 18 ¯R2, ˇR1 → ˇR1 + 8 ¯R2 and
ˇR3 → ˇR3 − 7 ¯R2. This gives the following table:
x1 x2 x3 x4 y1 y2 y3
¯R0 −h 20 9 0 0 −21/2 141/2 0 0
¯R1 x3 −2 4 1 0 1/2 −9/2 0 0
¯R2 x4 −1/2 1/2 0 1 1/4 −5/4 0 0
¯R3 y3 7/2 −7/2 0 0 −3/4 23/4 1 1
The new column driver is the column of y1 and we choose as pivot the circled 1/2 in ¯R1. Thus y1 enters and x3 leaves. To
do so we replace ¯R1 by r1 := 2 ¯R1 and apply the row operations ¯R0 → ¯R0 + 21
2 r1, ¯R2 → ¯R2 − 1
4 r1 and ¯R3 → ¯R3 + 3
4 r1.
This yields the following table:
267
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
x1 x2 x3 x4 y1 y2 y3
r0 −h −22 93 21 0 0 −24 0 0
r1 y1 −4 8 2 0 1 −9 0 0
r2 x4 1/2 −3/2 −1/2 1 0 1 0 0
r3 y3 1/2 5/2 3/2 0 0 −1 1 1
Since −24 < −22 the new work column is that of y2 and the pivot is the circled 1 in r2. Hence y2 enters the column of basic
variables and x4 leaves. To eliminate the driver column we leave r2 identical and apply the row operations r0 → r0 + 24r2,
r1 → r1 + 9r2 and r3 → r3 + r2. This gives us the initial simplex tableau, and hence this LP problem exhibits cycling.
A simple remedy against cycling is the so called Bland’s rule. This says that for the entering variable we should take
the ﬁrst one in the present ordering of variables, which has a negative entry in the objective row. For choosing the leaving
variable, if there is a tie for least ratio, take the candidate that is ﬁrst in the ordering. For our problem, the ordering of the
variables is x1, . . . , x4, y1, . . . , y3 and an application of Bland’s rule will leave everything the same up to the last pivot. This
means that x1 should enter, instead of y2 and the pivot for this is the number 1/2 in r2 in the above table. Thus we replace
the row r2 by ˆr2 := 2r2 and apply the row operations r0 → r0 + 22ˆr2, r1 → r1 + 4ˆr2 and r3 → r3 − 1
2 ˆr2. This yields the
following table:
x1 x2 x3 x4 y1 y2 y3
ˆr0 −h 0 27 −1 44 0 20 0 0
ˆr1 y1 0 −4 −2 8 1 −1 0 0
ˆr2 x1 1 −3 −1 2 0 2 0 0
ˆr3 y3 0 4 2 −1 0 −2 1 1
The next iteration will increase the value of the objective function and in particular will provide an optimal solution. Indeed,
x3 enters and the new pivot is circled in the row ˆr3, so y3 leaves. The ﬁnal row operations are ˆr3 → ˜r3 := 1
2 ˆr3, ˆr0 → ˆr0 + ˜r3,
ˆr1 → ˆr1 + 2˜r3 and ˆr2 → ˆr2 + ˜r3 and we arrive to the following optimal table:
x1 x2 x3 x4 y1 y2 y3
˜r0 −h 0 29 0 87/2 0 19 1/2 1/2
˜r1 y1 0 0 0 7 1 −3 1 1
˜r2 x1 1 −1 0 3/2 0 1 1/2 1/2
˜r3 x3 0 2 1 −1/2 0 −1 1/2 1/2
Hence we have the optimal solution given by x1 = 1/2 = x3 and x2 = 0 = x4, with the maximal value being h = 1/2. □
3.F.6. Matrix games. Imagine a game played by two players – a billionaire and fate. The billionaire would like to invest into
gold, silver, diamonds or stocks of an important IT software company. Suppose that the wins and losses of such investments
are well known for a period of four-ﬁve years (for simplicity, we consider only a period of four years and write them into the
matrix A = (aij)):
gold silver diamonds software
2018 2% 1% 4% 3%
2019 3% -1% -2% 6%
2020 1% 2% 3% -4%
2021 -2% 1% 2% 3%
The billionaire would like to invest for one year only. How should he split his investment in order to ensure the maximal
win independently of development on the stock market? We assume that next year will be some (unknown) probabilistic
mix of the previous four ones. In terms of our game, fate will play some stochastic vector (x1, x2, x3, x4)T
ﬁxing the behaviour
of the market (as a probabilistic mixture of the previous ones), while the billionaire will play another stochastic vector
(y1, y2, y3, y4)T
describing the split of his investment. The win of the billionaire is
∑4
i,j=1 xiyjaij.
Solution. The task is to ﬁnd the stochastic vector (y1, y2, y3, y4)T
, which will maximize the minimum of all values
∑4
i,j=1 xiyjaij
for the ﬁxed matrix A and any stochastic vector (x1, x2, x3, x4)T
. This is equivalent to the problem of maximizing z1 +
z2 + z3 + z4 under the condition AT
z ≤ (1, . . . , 1)T
, z ≥ 0 (and the requested stochastic vector y is then obtained by
268
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
normalizing the vector z, the requested optimal value is the inverse of the optimal value obtained).19
Thus, one has to solve
a LP problem and the ﬁrst step is to introduce the slack variables w1, w2, w3, w4, and transform the problem to the standard
form:
max
{
z1 + z2 + z3 + z4 |
(
AT
|E4
)
(z, w) = (1, 1, 1, 1)T
}
.
The initial tableau of the simplex method is the following one
−1 −1 −1 −1 0 0 0 0 0
2 3 1 −2 1 0 0 0 1
1 −1 2 1 0 1 0 0 1
4 −2 3 2 0 0 1 0 1
3 6 −4 3 0 0 0 1 1
and after four iterations we arrive to the ﬁnal table, given by
188
89 0 0 0 25
89 0 44
89
17
89
86
89
114
89 0 1 0 18
89 0 21
89 − 2
89
37
89
−146
89 0 0 0 − 9
89 1 −55
89
1
89
26
89
−85
89 0 0 1 −10
89 0 18
89
11
89
19
89
78
89 1 0 0 17
89 0 5
89
8
89
30
89
We can read oﬀ the optimal solution: z2 = 30
89 , z3 = 37
89 , z4 = 19
89 , z1 = 0. The optimal value (upper right corner) is
z1 + z2 + z3 + z4 = 86
89 . After rescaling to a stochastic vector (multiplying with 89
86 ) we get the solution of the original
problem: y1 = 0, y2 = 30
86 , y3 = 37
86 , y4 = 19
86 . with the optimal value 89
86 . □
B)Material on difference equations
3.F.7. Find a real basis of solutions for the diﬀerence equation xn+4 = xn+3 − xn+2 + xn+1 − xn.
Solution. The space of the solutions is a four-dimensional vector space whose generators can be obtained from the roots of
the characteristic polynomial of the given equation. The characteristic equation has the form r4
−r3
+r2
−r+1 = 0. This is
a so-called reciprocal equation, which means that the coeﬃcients at the (n − k)-th and k-th power of r for k = 1, . . . , n, are
equal. After dividing the characteristic equation by r2
(zero cannot be a root) and setting u := r+1
r (note that r2
+ 1
r2 = u2
−2)
we obtain
r2
− r + 1 − 1
r + 1
r2 = u2
− u − 1 = 0 .
This gives the indeterminates u1,2 = 1±
√
5
2 and by the equation r2
− ur + 1 = 0 we determine the roots
r1,2,3,4 = 1±
√
5±
√
−10±2
√
5
4 .
In fact, since r5
+ 1 = (r + 1)(r4
− r3
+ r2
− r + 1), the roots of the characteristic equation could have been “guessed” right
away. Thus, the roots of the characteristic polynomial are also the roots of the polynomial r5
+ 1, which are exactly the ﬁfth
roots of the −1. By this we obtain that the solutions of the characteristic polynomial are the numbers
r1,2 = cos(π
5 ) ± i sin(π
5 ) and r3,4 = cos(3π
5 ) ± i sin(3π
5 ) .
Now, by the description in 3.2.5 it follows that a real basis of the space of the solutions of the given diﬀerence equation,
is given by the sequences
cos(
nπ
5
) , sin(
nπ
5
) , cos(
3nπ
5
) , sin(
3nπ
5
) .
Note that these are sines and cosines of the arguments of the corresponding powers of the roots of the characteristic polynomial.
□
19The observation comes from the proof of the von Neumann Minimax
theorem, 1928. The theorem claims that any probabilistic extension of a
matrix game enjoys an equilibrium state.
269
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
3.F.8. A simpliﬁed model for the behaviour of gross domestic product. Non homogeneous diﬀerence equations of second
order are known in the science of macroeconomics. Consider for example the diﬀerence equation
xn+2 − A(1 + B)xn+1 + ABxn = 1 ,
where xk is the gross domestic product at the year k. The constant A is the consumption tendency, which is a macro economical
factor that gives the fraction of money that the people spend (from what they have at their disposal). The constant B describes
the dependence of the measure of investment of the private sector on the consumption tendency. Moreover, we assume that
the size of the domestic product is normalised such that the right-hand side of the equation is 1. Compute the values xn for
A = 3/4, B = 1/3, x0 = 1 = x1.
Solution. For A = 3/4 and B = 1/3, it is easy to check that the constant function xn = 4 is a particular solution of the
initial (non-homogeneous) equation. We now look for solutions of the corresponding homogeneous equation. The general
characteristic equation has the form
r2
− A(1 + B)r + AB = 0 ,
which reduces to r2
− r + 1/4 = 0 for the particular values of A and B. Hence 1/2 is a double root. In this case, as we have
seen before (cf. 3.B.3), the solutions of the homogeneous equation are given by a(1/2)n
+ bn(1/2)n
, for some a, b ∈ R.
Hence, according to the theoretical description in 3.2.6, the solutions of the given diﬀerence equation (for A = 3/4 and
B = 1/3), are expressed by
4 + a(1/2)n
+ bn(1/2)n
.
Using the initial conditions x0 = x1 = 1 we also obtain a = b = −3, and we are able to present the solution explicitly:
xn = 4 − 3
(
1
2
)n
− 3n
(
1
2
)n
.
Let us verify this result in Sage, by the method that we learned in the main part (cf. 3.B.11)
from sympy import Function, rsolve
from sympy.abc import n
a = Function("a")
f = a(n+2) - a(n+1) + (1/4)*a(n) - 1
initial = {a(0):1, a(1):1}
rsolve(f, a(n), initial)
Sage’s output is the expression 4 + (−3 ∗ n − 3)/2 ∗ ∗n. □
3.F.9. Find the solution of the recurrence relation xn+2 − 6xn+1 + 5xn = n en
.
Solution. Solving ﬁrst the homogeneous part we obtain:
x(h)
n = c1 · (1)n
+ c2 · 5n
.
To ﬁnd the particular solution we can use the method of variation of the constant, described in ?? For this, we should compute
the Wronski determinant:
Wj+1 = det
(
1j+1
5j+1
1j+2
5j+2
)
= 4 · 5j+1
.
Thus,
xn = c1 + c2 · 5n
−
1
4
n−1∑
j=0
j ej
+


n−1∑
j=0
j ej
4 · 5j+1

 5n
,
with c1, c2 ∈ R. □
3.F.10. Find the solution of the recurrence relation −xn+3 = 3xn+2 +3xn+1 +xn, with initial conditions x1 = x2 = x3 = 1.
⃝
C) Material on models of growth and iterated processes
In this paragraph, the initial set of exercises addresses population growth and the Leslie model. However, it is crucial to
ﬁrst recognize the role that “primitive matrices” have in the theory of linear iterative models. Recall that a matrix A is called
primitive if there exists a positive integer k such that Ak
has only positive entries. These matrices are discussed in detail in
Section 3.3.3, where further information can be found.
270
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
3.F.11. Which of the matrices given below are primitive?
A =
(
0 1/7
1 6/7
)
, B =


1/2 0 1/3
0 1 1/2
1/2 0 1/6

 , C =


0 1 0
1/4 0 1/2
3/4 0 1/2

 , D =




1/3 1/2 0 0
1/2 1/3 0 0
0 1/6 1/6 1/3
1/6 0 5/6 2/3



 .
Solution. We see that
A2
=
(
1/7 6/49
6/7 43/49
)
, C3
=


3/8 1/4 1/4
1/4 3/8 1/4
3/8 3/8 1/2

 .
So the matrices A and C are primitive, since A2
and C3
are positive matrices. For n ∈ N, the middle column of the matrix
Bn
is always the vector (0, 1, 0)T
, which contains the entry 0. Hence the matrix B cannot be primitive. Finally, the product




1/3 1/2 0 0
1/2 1/3 0 0
0 1/6 1/6 1/3
1/6 0 5/6 2/3



 ·




0
0
a
b



 =




0
0
a/6 + b/3
5a/6 + 2b/3



,
with a, b ∈ R, implies that the matrix D2
has in the right upper corner a zero two-dimensional (square) sub-matrix. By
induction, the same property is shared by the matrices D3
= D · D2
, D4
= D · D3
, . . . , Dn
= D · Dn−1
, and so on.
Consequently, the matrix D is not primitive. □
3.F.12. Rabbits and their population growth. We are again interested in the population of rabbits on a meadow, as in the
problem 3.B.1. However, we now assume that the rabbits are dying after reaching the ninth year of age.20
Show that according
to this model the population grows approximately with the geometric sequence 1.608t
.
Solution. Denote by x1(t), x2(t),..., x9(t) the number of rabbits according to their age (in years), at time t. Then, the
number of rabbits in individual categories are described after one year by the formula
x1(t + 1) = x2(t) + x3(t) + · · · + x9(t) , xi(t + 1) = xi−1(t)
for i = 2, 3, . . . , 10. Equivalently, we may write














x1(t + 1)
x2(t + 1)
x3(t + 1)
x4(t + 1)
x5(t + 1)
x6(t + 1)
x7(t + 1)
x8(t + 1)
x9(t + 1)














=














0 1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0
0 0 0 0 1 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1 0




























x1(t)
x2(t)
x3(t)
x4(t)
x5(t)
x6(t)
x7(t)
x8(t)
x9(t)














.
The characteristic polynomial of the given matrix is λ9
− λ7
− λ6
− λ5
− λ4
− λ3
− λ2
− λ − 1. To obtain a veriﬁcation in
Sage we can apply the command A.eignevalues(), as usual:
A = matrix(QQ, [[0,1,1,1,1,1,1,1,1],
[1,0,0,0,0,0,0,0,0],
[0,1,0,0,0,0,0,0,0],
[0,0,1,0,0,0,0,0,0],
[0,0,0,1,0,0,0,0,0],
[0,0,0,0,1,0,0,0,0],
[0,0,0,0,0,1,0,0,0],
[0,0,0,0,0,0,1,0,0],
[0,0,0,0,0,0,0,1,0]])
A.eigenvalues()
The answer shows that λ1 ≈ 1.608 is indeed the only positive real eigenvalue, a property that every Leslie matrix satisﬁes
(there are also eight complex eigenvalues). In fact we can estimate this root of the characteristic polynomial very well (think
why this must be smaller than (
√
5 + 1)/2)).
Now, the normalized eigenvector corresponding to λ1 has the form (the coordinates of X1 sum to 100)
20In the original model the rabbits were immortal. Note that domesticated
rabbits can live between eight to twelve years.
271
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
X1 ≈ (38.36, 23.85, 14.83, 9.22, 5.73, 3.56, 2.21, 1.37, 0.85)T
.
Consequently, according to this model the population grows approximately with the geometric sequence 1.608t
. □
3.F.13. Consider the following Leslie model in which a farmer breeds sheep. The birth-rate of sheep depends only on their
age and on average is 2 lambs per sheep between one and two years of age, 5 lambs per sheep between two and three years
of age and 2 lambs per sheep between three and four years of age. Younger sheep do not deliver any lambs. Every year, half
of the sheep die, uniformly distributed among all age groups. Every sheep older than four years is sent to the butchery. The
farmer would like to sell (living) lambs younger than one year for their skin. What proportion of the lambs can be sold every
year to ensure that the size of the herd remains the same? In what ratio will the sheep then be distributed among individual
age categories?
Solution. The matrix of the model, without an action of the farmer, can be expressed as
L =




0 2 5 2
1
2 0 0 0
0 1
2 0 0
0 0 1
2 0



 .
The farmer can inﬂuence how many sheep younger than one year stay in his herd to the next year, that is, he can inﬂuence the
element l12 of the matrix L. Thus ﬁnally we are dealing with the model encoded by the matrix
L =




0 2 5 2
a 0 0 0
0 1
2 0 0
0 0 1
2 0



 .
We are looking for this a such that the matrix has the eigenvalue 1 (we know that it has only one real positive eigenvalue).
The characteristic polynomial of this matrix is λ4
− 2aλ2
− 5
2 λ − 1
2 . If we require 1 as a root of this polynomial, then we get
the solution a = 1
5 Thus, the farmer can sell 1
2 − 1
5 = 3
10 of lambs that are born that year. The corresponding eigenvector for
the eigenvalue 1 of the given matrix is (20, 4, 2, 1)T
, and in these ratios the population stabilises. □
3.F.14. Consider the Leslie population growth model for the population of rats, divided into three groups according to age:
younger than one year, between one year and two years, and between two years and three years. Assume that there exist no
rats older than three years. The average birth-rate of one rat in individual age categories is the following: in the ﬁrst group it
is zero, in the second and in the third it is 2 rats. The mortality in the second group is zero, that is, the rats that survive their
ﬁrst year die after three years of life. Determine the mortality in the ﬁrst group, if you know that the population stagnates (the
total number of rats does not change). ⃝
3.F.15. Model of evolution of a whale population. For the evolution of a population (of whales), females are important.
The important factor is not age but fertility. From this point of view, one can divide the females into newborns (juvenile), that
is, females who are yet fertile; young fertile females; adult females with the highest fertility, and postclimacterial femalesm
who are no longer fertile, but are still important with respect to taking care of newborns and food gathering.
We model the evolution of such a population by time. For a time unit, we choose the time it takes to reach adulthood.
A newborn female who survives this interval becomes fertile. The evolution of a young female to full fertility and
to post-climacterial state depends mainly on the environment. That is, the transition to the next category is a random event.
Analogously, the death of an individual is also a random event. Note that a young fertile female has less children per unit
interval, than an adult female. Let us not try to formalise these statements.
Denote by x1(t), x2(t), x3(t), x4(t) the number of juvenile, young, adult and postclimacterial females in time t respectively.
The amount can be expressed as a number of individuals, but also as a number of individuals relative per unit area
(population density), or as a total biomass. Denote further by p1 the probability that a juvenile female survives the unit time
interval and becomes fertile. Let also p2 and p3 be the respective probabilities that a young female becomes adult and that an
adult female becomes old. Another random event is the death (positively formulated: survival) of females who do not move
to the next category – we denote these probabilities respectively as q2, q3 and q4 for young, adult and old females. Of course,
Each of the numbers p1, p2, p3, q2, q3, q4 is a probability from the interval [0.1].
Now, a young female can survive, reach adulthood or die. These events are mutually exclusive, together they form a sure
certain event and cannot be excluded. Thus, p2 + q2 < 1. For similar reasons p3 + q3 < 1. Finally, we denote by f2 and
f3 the average number of daughters of a young and adult female, respectively. These parameters satisfy 0 < f2 < f3. The
expected number of newborn females in the next time interval is the sum of the daughters of young and of the adult females,
that is
x1(t + 1) = f2x2(t) + f3x3(t) .
272
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Let us temporarily denote by:
• x2,1(t + 1) the number of young females in time t + 1, who were juvenile in the previous time interval.
• x2,2(t + 1) the number of young females, who were already fertile in time t, survived that time interval, bud did not move
into the adulthood.
The probability p1 that a juvenile female survives the interval can be expressed by classical probability, that is, by the ratio
x2,1(t+1)/x1(t). Similarly, the probability q2 can be expressed as the ratio x2,2(t+1)/x2(t) and we ﬁnally have the relation
x2(t + 1) = x2,1(t + 1) + x2,2(t + 1) = p1x1(t) + q2x2(t) .
In the same way we deduce the expected number of fully fertile females:
x3(t + 1) = p2x2(t) + q3x3(t) ,
and ﬁnally the expected number of postclimacterial females is
x4(t + 1) = p3x3(t) + q4x4(t) .
Set now
A :=




0 f2 f3 0
p1 q2 0 0
0 p2 q3 0
0 0 p3 q4



 and x(t) :=




x1(t)
x2(t)
x3(t)
x4(t)



 .
Then, we may rewrite the previous recurrent formulas in matrix form as x(t + 1) = Ax(t). Using this matrix diﬀerence
equation it is possible to compute the expected number of females in individual categories, assuming that the distribution
of the population at some initial time is known. Let us focus for example on the population of orca whales, where we may
assume the following parameters:
p1 = 0.9775 , q2 = 0.9111 , f2 = 0.0043 ,
p2 = 0.0736 , q3 = 0.9534 , f3 = 0.1132 .
p3 = 0.0452 , q4 = 0.9804 ,
In this case the time interval is one year. If we start at the time t = 0 with a unit measure of young females in some unoccupied
area, that is, with the vector x(0) = (0, 1, 0, 0)T
, then with the help of Sage one easily computes
x(1) =




0 0.0043 0.1132 0
0.9775 0.9111 0 0
0 0.0736 0.9534 0
0 0 0.0452 0.9804








0
1
0
0



 =




0.0043
0.9111
0.0736
0



 ,
and
x(2) =




0 0.0043 0.1132 0
0.9775 0.9111 0 0
0 0.0736 0.9534 0
0 0 0.0452 0.9804








0.0043
0.9111
0.0736
0



 =




0.01224925
0.83430646
0.13722720
0.00332672



 .
The reader may try a computation and graphical depiction of the results for a diﬀerent initial distribution of the population.
The result should be an observation that the total population grows exponentially, but the ratios of the sizes of individual
groups stabilise on constant values. In fact, the matrix A has the following eigenvalues
λ1 = 1.025441326, λ2 = 0.980400000, λ3 = 0.834222976, λ4 = 0.004835698 .
The eigenvector associated with λ1 is given by w = (0.03697187, 0.31607121, 0.32290968, 0.32404724)T
, and this vector
is normalized such that the sum of its components equals to the unit.
We will now explore a variety of further exercises related to discrete Markov chains. For the reader’s convenience, we
have included some classic problems that lend themselves to a Markov process interpretation, such as the “ruin
of a player” (see 3.F.21), the “algorithm for determining the importance of web pages” (see 3.F.20), and other.
We begin with an enjoyable problem.
3.F.16. Students into three groups. Suppose that we can divide students into three groups, as follows: Those that are present
at a lecture and pay attention, those that are present but pay no attention, and those who are in a pub instead. Now observe,
lecture after lecture, how the numbers in the individual groups change.
The ﬁrst step is to observe what are the probabilities that a student changes his state. Suppose that this goes as follows.
For a student who pays attention: with probability 50 % stays in the same state, with 40 % stops paying attention and with
10 % moves to the pub. For a student who pays no attention: starts paying attention with 10 %, with 50 % stays in the same
state and with 40 % moves to the pub. Finally, for a student who is in the pub we assume that there is zero probability of
273
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
returning to the lectures. How does the model evolve in time? How does the situation change if we assume at least ten percent
probability that a student returns from the pub to the lecture (but is not going to pay any attention)?
Solution. This is obviously a (homogeneous) discrete Markov process. Its transition matrix T has the form
T =


0.5 0.1 0
0.4 0.5 0
0.1 0.4 1

 ,
where the ﬁrst column is about the students paying attention, the second one about those no paying attention and the third one
for those in the pub. We will prove that all the students end up in the pub. Indeed, by Sage we see that T has the following
three eigenvalues: λ1 = 1 and the rest two are given by 0.3 and 0.4. We also see that the eigenvector associated to λ1 is a
multiple of (0, 0, 1)T
. This vector describes the distribution of students into the three groups with the passing of time and
this proves our claim (of course, such a result is clear even without any computation, as the probability of returning from the
pub is zero, all students end up in the pub).
For the second task, we will show that adding 10 percent possibility for leaving the pub, will only slightly change the
situation. The corresponding transition matrix is given by
T =


0.5 0.1 0
0.4 0.5 0.1
0.1 0.4 0.9

 .
We see by Sage (or by Mathematica), that λ1 = 1 is an eigenvalue of T, with corresponding eigenvector given by (1, 5, 21)T
.
This means that the distribution of the students in the individual groups is encoded by the multiple of this vector such that the
coordinates sum to one, that is, the vector ( 1
27 , 5
27 , 21
27 ). Therefore again most of the students end up in the pub. □
3.F.17. Daily running. Jonas goes running every evening. He has three tracks – short, middle and long. Whenever he
chooses a short track, the next day he feels bad about it and chooses with equal probabilities between long and medium.
Whenever he chooses a long track, the next day he chooses arbitrarily among all three. Whenever he chooses the medium
track, the next he feels good about it, and again chooses with equal probabilities between medium and long. Joans claims that
he has been running in this mode for a very long period. How often does he choose the short track and how often the long
track? What is the probability that Jonas follows a long track, under the assumption that he did the same a week ago?
Solution. Clearly this task forms a (homogeneous) discrete Markov process with a three-dimensional state space: S =
{short track, medium track, long track} (that silently one can encode as {1, 2, 3}). This order of the states gives the following
transition matrix
T =


0 0 1/3
1/2 1/2 1/3
1/2 1/2 1/3

 .
Let explain the computation for the second column, which corresponds to the choice of the medium track during the previous
day. This means that with the probability 1/2, a medium track will be chosen (the second row), and with probability 1/2 a
long track will be chosen (the third row). Hence T follows. Now, we see that
T2
=


1/6 1/6 1/9
5/12 5/12 4/9
5/12 5/12 4/9

 .
Therefore, we can use the corollary of the Perron-Frobenius theorem for Markov chains, see 3.3.4. The eigenvector corresponding
to the eigenvalue 1 is the stochastic vector
(1
7 , 3
7 , 3
7
)T
, where recall that the numbers 1/7, 3/7, 3/7 are respectively
the probabilities that in a randomly chosen day, Joans choose a short, medium or long track. Suppose now that on a certain
day (that is, in time n ∈ N), Jonas follows a long track. This corresponds to the probabilistic vector x(n)
= (0, 0, 1)
T
. As
we know, for the following day we will have
x(n+1)
=


0 0 1/3
1/2 1/2 1/3
1/2 1/2 1/3

 ·


0
0
1

 =


1/3
1/3
1/3

 .
In particular, after seven days
x(n+7)
= T7
·


0
0
1

 = T6
·


1/3
1/3
1/3

 .
Let us present this computation in Sage:
274
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
T=matrix(QQbar, [[0, 0, 1/3], [1/2, 1/2, 1/3], [1/2, 1/2, 1/3]])
x=vector(QQbar, [1/3, 1/3, 1/3]); T^6*x
Sage prints out (4999/34992, 29993/69984, 29993/69984) and hence we deduce that
x(n+7)
≈ (4999/34992 ≈ 0.142861, 29993/69984 ≈ 0.428569, 29993/69984 ≈ 0.428569)T
.
Consequently, the probability that Jonas follows a long track under the assumption that he did the same a week ago, is
0.428569 ≈ 3/7. □
3.F.18. Car rental. A car rental company has two branches – one in Prague and one in Brno. A car rented in Brno can be
returned in Prague and vice versa. After operating for some time, the company has observed that roughly 80 % of the cars
rented in Prague and 90 % of the cars rented in Brno, are ﬁnally returned in Prague. However, the strategy of the company
says that in both branches, in the start of every week should exist the same number of the cars, as in the week before. How
the cars should be then distributed?
Solution. Let us denote by xB and xP , the initial number of cars in Brno and in Prague, respectively. Then x = (xB, xP )T
is the vector describing car’s distribution between the two branches. Now, according to the statement the matrix describing
the (linear) system of car rental is given by
A =
(
0.1 0.2
0.9 0.8
)
.
The state at the end of the week is given by Ax. However, at the end of the week in the branches the same number of cars
should exist as at the beginning. This means that x must satisﬁes Ax = x, and so it must be an eigenvector of the matrix A
associated with the eigenvalue λ1 = 1. We can save time on computations by using Sage as follows:
A = matrix(QQ, [[0.1, 0.2], [0.9,0.8]])
print(A.eigenvalues())
show(A.eigenvectors_right())
Sage’s output for the eigenvector has the form
[(
1,
[(
1,
9
2
)]
, 1
)
,
(
−
1
10
, [(1, −1)] , 1
)]
,
and we deduce that the eigenvector corresponding to λ1 is given by (1, 9/2)T
. This implies that x = (1, 9/2)T
.
Now, the percentage distribution of the cars is given by the normalized vector associated to x and this is the the vector
whose entries sum to 1. This vector is given by
2
11
(
1
9/2
)
=
(
0.181818
0.818182
)
.
Therefore, the optimal distribution of cars would have approximately 18 % stationed in Brno and 82 % in Prague. □
3.F.19. Suppose that two students A and B spend every Monday morning playing a certain computer game. The person
who wins pays for both of them in the evening in the restaurant. The game can also be a draw – then each pays for half the
meal. The result of the previous game partially determines the next game. If a week ago student A has won, then with the
probability 3/4 wins again and with probability 1/4 it is a draw. A draw is repeated with probability 2/3. With probability
1/3 the next game is won by B. Moreover, if student B won a game, then with probability 1/2 he/she wins again and with
probability 1/4, student A wins the next game. Determine the probability that today each of them pays half of the costs, if
the ﬁrst game played long time ago and was won by A.
Solution. This a Markov process with the following three states: "student A wins", "the game ends with a draw", "student B
wins". Labelling these states in this order as {1, 2, 3} we arrive to the following transition matrix of the process:
T =


3/4 0 1/4
1/4 2/3 1/4
0 1/3 1/2

 .
We want to ﬁnd the probability of the transition from the ﬁrst state to the second after a large number n ∈ N of steps (weeks).
Observe that the matrix T is primitive, because
T2
=


9/16 1/12 5/16
17/48 19/36 17/48
1/12 7/18 1/3

 .
Thus, it suﬃces to ﬁnd the probabilistic eigenvector x∞ of the matrix T associated with the eigenvalue λ1 = 1. As before,
giving the following block in Sage we compute the eigenvalues and eigenvectors corresponding to T:
275
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
T = matrix(QQbar, [[3/4, 0, 1/4], [1/4, 2/3, 1/4], [0, 1/3, 1/2]])
T.eigenvalues()
T.eigenvectors_right ()
We deduce that the eigenvector associated to the eigenvalue λ1 = 1 is the vector (1, 3/2, 1)T
which we may normalize as
x∞ =
(2
7 , 3
7 , 2
7
)T
. Recall now that x∞ diﬀers only very slightly from the probabilistic vector for large n and in particular it
does not depend on the initial state. Indeed, for large n ∈ N we compute
Tn
≈


2/7 2/7 2/7
3/7 3/7 3/7
2/7 2/7 2/7

 .
The desired probability is the element of this matrix on the second position in the ﬁrst column (the second component of the
vector x∞). Hence the answer is 3/7. □
3.F.20. Algorithm for determining the importance of pages. Internet browsers can ﬁnd (almost) all pages containing a
given word or phrase on the Internet. But how can a user sort the pages such that a list is sorted according to the relevance
of the given pages? One of the possibilities is the following algorithm: the collection of all found pages is considered to be a
system, and each of the found pages is one of its states. We describe a random walk on these pages as a Markov process.
The transition probabilities between web pages are determined by hyperlinks: each link from page A to page B establishes
the probability of moving from A to B as
1
totalănumberăofălinksăfromăpageA
.
If a page has no outgoing links, it is treated as a page that links to every other page. This creates a probability matrix
M = (mij), where each entry mij represents the probability of moving from the ith page to the jth page.
Assume that a user randomly clicks on links, and chooses a random page when encountering a linkless page. In this
case, the probability of being on the ith page at a suﬃciently large time from the beginning, is given by the ith component
of the unit eigenvector of the matrix M, corresponding to the unit eigenvalue. The importance of individual pages is then
determined by the magnitudes of these probabilities.
This algorithm can be modiﬁed by assuming that the user eventually stops clicking on links after some time and starts
again from a random page. Suppose that the user chooses a new page randomly with probability d, and with probability 1−d,
continues clicking on links. In this scenario, the probability mij of transitioning from the page Si to page Sj is given by
mij =



d
n
+
(1 − d)
total number of links at the page Si
, if from Si there is a link to Sj,
d
n
, otherwise.
Note that mij ̸= 0. Now, According to the Perron-Frobenius theorem, the eigenvalue 1 is with multiplicity one and
dominant, and thus the corresponding eigenvector is unique. For an illustration, consider pages A, B, C and D. The links lead
from A to B and to C, from B to C and from C to A, and ﬁnally from D nowhere. Suppose that the probability that the user
chooses a random new page is 1/5. Then the matrix M looks as follows:
M =




1/20 1/20 17/20 1/4
9/20 1/20 1/20 1/4
9/20 17/20 1/20 1/4
1/20 1/20 1/20 1/4



 .
We ﬁnd that the eigenvector corresponding to the eigenvalue 1 is the vector u = (305/53, 175/53, 315/53, 1)T
∈ R4
.
Hence, the importance of the pages is given according to the order of the sizes of the corresponding components, that is,
C > A > B > D.
3.F.21. Ruining of a player. Two players, A and B, gamble for money repeatedly in a certain game, which can result only
in a victory for one of the players. The winning probability for player A in each individual game is p ∈ [0, 1/2). Both players
bet always the amount of 1C. Consequently, after each game player B gives 1C to player A with probability p, and player A
gives 1C to player B with probability 1 − p. Suppose that in the start of the game that A has Cx, B has Cy and that they play,
as long as they both have some money. Determine the probability that the player A will lose all of his/her money.
Solution. This is the so-called ruining of a player. It is a special Markov chain (see also the exercise Sweet-toothed gambler)
with many important applications. The probability in question is given by
276
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
1 −
(
p
1−p
)y
1 −
(
p
1−p
)x+y .
We can now investigate what this value is for speciﬁc choices of p, x, y. We give an example. Suppose that the player B wants
the probability that the player A will lose the amount of 1 000 000C to be at least 0.999. Assume for instance that p is given
by p = 0.495. Then, for the player B we see that is suﬃcient to has the amount of 346C. If p = 0.499, this amount increases
to 1727C. Therefore, it is possible in big casinos that “passionate” players play almost fair games. □
3.F.22. In a certain game you can choose one of two opponents. The probability that you beat the better one is 1/4, while
the probability that you beat the worse one is 1/2. But the opponents cannot be distinguished, thus you do not know which
one is the better one. You await a large number of games. For each of them you can choose a diﬀerent opponent. Consider
the following two strategies:
1. For the ﬁrst game choose the opponent randomly. If you win a game, carry on with the same opponent; if you lose a
game, change the opponent.
2. For the ﬁrst two games, choose an opponent randomly. Then for the next two games, if you lost both the previous games,
change the opponent, otherwise stay with the same opponent.
Which of the two strategies is better?
Solution. Both strategies deﬁne a Markov chain. For simplicity denote the worse opponent by A and the better opponent by
B. In the ﬁrst case the state space has two states: “game with A” and “game with B”. In this order, we obtain the probabilistic
transition matrix (
1/2 3/4
1/2 1/4
)
.
This matrix has all of its elements positive. Thus it suﬃces to ﬁnd the probabilistic vector x∞, which is associated with
the eigenvalue 1. We compute x∞ =
(3
5 , 2
5
)T
. Its components correspond to the probabilities that after a long sequence of
games the opponent is A, or B. Therefore, we can expect that 60 % of the games will be played against the worse of the two
opponents. Because 0.4 = 2
5 = 3
5 · 1
2 + 2
5 · 1
4 , there will be roughly 40 % against the better of the two opponents.
For the second strategy, use the states “two games in a row with A”, and “two games in a row with B”, which lead to the
probabilistic transition matrix
(
3/4 9/16
1/4 7/16
)
.
In this case we compute x∞ =
( 9
13 , 4
13
)T
. Against the worse opponent one would then play (9/4)-times more frequently
than against the better one. Recall that for the ﬁrst strategy it is (3/2)-times more frequently, which means that second strategy
is better. Note also that for the second strategy, roughly 42, 3 % of the games are winning ones, as we can see from the relation
0.423 ≈ 11
26 = 9
13 · 1
2 + 4
13 · 1
4 . □
3.F.23. In a certain country there are two television channels. From a public survey it follows that in one year 1/6 of the
viewers of the ﬁrst channel move to the second one, while 1/5 viewers of the second channel move to the ﬁrst one. Determine
the time evolution of the number of viewers watching the channels, using Markov processes. Write down a matrix of the
process, and ﬁnd its eigenvalues and eigenvectors. ⃝
3.F.24. Daily casino routine. A female roulette player has the following strategy: she comes at a casino to play with C10.
She always bets everything she has, and she always bets on black (there are 37 numbers in the roulette, 18 black, 18 red and
zero). The player ends whenever she has either nothing, or when she wins C80. Consider this problem as a Markov process
and write down its transition matrix.
Solution. In the course of the game and at its end, the player can have only one of the following amounts of money (in C): 0,
10, 20, 40, 80. If we view the situation as a Markov process, then these amounts corresponds to the states of the process, and
we construct the following transition matrix
A =






1 a a a 0
0 0 0 0 0
0 b 0 0 0
0 0 b 0 0
0 0 0 b 1






,
277
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
with a = 19
37 and b = 18
37 , respectively.
Note that the matrix is probabilistic and singular, while the eigenvalue λ1 = 1 has multiplicity two (and this is the unique
non-zero eigenvalue of A), as we see from the following block in Sage:
a = var(’a’); b = var(’b’)
assume(a, ’real’); assume(b, ’real’)
A = matrix(SR, 5, 5, [[1, a, a, a, 0], [0, 0, 0, 0, 0], [0, b, 0, 0, 0],
[0, 0, b, 0, 0], [0, 0, 0, b, 1]])
A.eigenvalues()
Executing this block we obtain the list [0, 0, 0, 1, 1]. Observe that the game does not converge to a single vector x∞, but
ends in one of the eigenvectors associated with the eigenvalue λ1, that is, either (1, 0, 0, 0, 0)T
(the player looses it all), or
(0, 0, 0, 0, 1)T
(the player wins C80). We verify this as follows (after introducing the matrix A as above)
A.eigenvectors_right()
In this case Sage output has the form
[(0, [(1, 0, 0, -1/a, b/a)], 3), (1, [(1, 0, 0, 0, 0), (0, 0, 0, 0, 1)], 2)]
Furthermore, using Sage we can easily check that the game ends after three bets, that is, the sequence {An
}∞
n=1, is constant
for n ≥ 3. For instance, you may type
A^3 == A^4
A^3 == A^5
and for both cases Sage prints out True (and similarly for any other power An
with n ≥ 3). Moreover, we see that A3
is given
by
[ 1 (a*b + a)*b + a a*b + a a 0]
[ 0 0 0 0 0]
[ 0 0 0 0 0]
[ 0 0 0 0 0]
[ 0 b^3 b^2 b 1]
which means that
A∞
:= A3
= An
=






1 a + ab + ab2
a + ab a 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 b3
b2
b 1






.
We deduce that the game ends with the probability a + ab + ab2 .
= 0.885 as a loss, and with the probability roughly 0.115
as a win of C80. This conclusion occurs by multiplying the matrix A∞
with the initial vector (0, 1, 0, 0, 0)T
. This gives the
vector (a + ab + ab2
, 0, 0, 0, b3
)T
. Notice that whether the player was female or male, makes no diﬀerence to the result. □
3.F.25. Consider the situation from the previous case and assume that the probability of both win and loss is 1/2. Denote by
A the matrix of the process. Without using any computational software determine A100
. ⃝
3.F.26. Reliable products. A production line is not reliable: individual products diﬀer in quality in a signiﬁcant way. A
certain worker tries to improve the quality of the products and intervenes to the process. The products are distributed into
classes I, II, and III, according to their quality, and a report found out the following:
• After a product of class I, the next product has the same quality in 80% of the cases and is of quality II in 10% of the cases.
• After a product of class II, the next product is of class II in 60% of the cases and is of quality I in 20% of the cases.
• After a product of class III, the next product is of quality III in 50% of the cases. while in 25% of the cases it is of class II.
Compute the probability that the 18th product has quality of class I, given that the 16th product is of quality III.
Solution. There are at least two ways to provide an answer. First we solve the problem without using a Markov chain. Since
the 16th product has quality of class III, the event in question is satisﬁed by the cases
• The 17th product has quality of classI and the 18th product has quality of class I,
• The 17th product has quality of class II and the 18th product has quality of class I,
• The 17th product has quality of class III and the 18th product has quality of class I,
with probabilities respectively
0.25 · 0.8 = 0.2 , 0.25 · 0.2 = 0.05 , 0.5 · 0.25 = 0.125 .
278
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Thus the solution is 0.375 = 0.2 + 0.05 + 0.125.
Let us now view the problem as a discrete Markov process. From the statement it follows that the transition matrix has the
form
T =


0.8 0.2 0.25
0.1 0.6 0.25
0.1 0.2 0.5

 .
The situation that the product is in class III is given by the probabilistic vector (0.0.1)T
. For the next product we obtain the
probabilistic vector


0.25
0.25
0.5

 =


0.8 0.2 0.25
0.1 0.6 0.25
0.1 0.2 0.5




0
0
1

 .
Finally, for the next product in order, we compute the vector


0.375
0.3
0.325

 =


0.8 0.2 0.25
0.1 0.6 0.25
0.1 0.2 0.5




0.25
0.25
0.5

 ,
and the ﬁrst component is the desired probability.
Observe that the ﬁrst method (without using the Markov process) led to the result faster and easier. But notice also how
unclear it would become if we wanted to compute, say, the 22nd or 30th product. For the second method one can in a sense
restrict the computations to the relevant parts of the matrices only, instead of mindlessly multiplying the whole matrix. When
using the Markov process, we have also directly obtained the probabilities that the 18thproduct belongs to the class II and III.
□
3.F.27. Suppose that there are two boxes, which together contain n white and n black balls. Each box can contain up to n
balls. At regular time intervals, a ball is taken from each box and moved to the other box. For this Markov process, ﬁnd its
probabilistic transition matrix T.
Solution. This problem is often used in physics as a model for blending two incompressible liquids (already introduced by
D. Bernoulli during 1769), or analogously, as a model of diﬀusion of gases.
Let the states 0, 1, . . . , n correspond to the number of white balls in the ﬁrst box. This information already says how
many black balls are in the ﬁrst box (the remaining balls are then in the second box). If, for a certain step, the state changes
from j ∈ {1, . . . , n} to j − 1, then from the ﬁrst box a white ball was drawn and from the second a black ball was drawn.
This happens with probability
j
n
·
j
n
=
j2
n2
.
The transition from state j ∈ {0, . . . , n − 1} to the state j + 1 corresponds to drawing the black ball from the ﬁrst box, and a
white ball from the second box, with probability
n − j
n
·
n − j
n
=
(n − j)
2
n2
.
The system stays in state j ∈ {1, . . . , n − 1}, if from both boxes balls of the same colour are drawn, which has the same
probability, namely
j
n
·
n − j
n
+
n − j
n
·
j
n
=
2j (n − j)
n2
.
Notice that from the state 0 it is necessary (with probability 1) to go to the state 1 and similarly from the state n with probability
one to the state n − 1. In summary, for the order of the states 0, 1, . . . , n. we obtain the following n × n transition matrix:
T =














0 1 0 · · · 0 0 0
n2
2 · 1(n − 1) 22 ... 0 0 0
0 (n − 1)2
2 · 2(n − 2)
... 0 0 0
...
...
...
...
...
...
...
0 0 0
... 2 · (n − 2)2 (n − 1)2
0
0 0 0
... 22
2 · (n − 1)1 n2
0 0 0 · · · 0 1 0














.
279
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
In physics we are of course interested in the distribution of balls in boxes, after a certain time (the number of drawings). If
the initial state is for instance 0, we can use this model and the powers of the matrix T to observe with what probability the
number of white balls in the ﬁrst box is increasing. We can conﬁrm the expected result that the initial distribution of the balls
inﬂuences their distribution, after a certain time in a very negligible way. □
D) Material on linear algebra – unitary spaces, orthogonality, Jordan forms
3.F.28. The Cauchy–Schwarz inequality. Let (V, ⟨ , ⟩) be a unitary space over R. Describe an alternative proof of the
Cauchy inequality, in comparison with the one presented in 3.4.3.
Solution. We should prove the inequality |⟨x, y⟩| ≤ ∥x∥∥y∥, for any two vectors x, y ∈ V . We may assume that y ̸= 0 since
otherwise the statement is trivial. For some scalar t ∈ R deﬁne the function ζ(t) = ∥ty − x∥2
= ⟨ty − x, ty − x⟩. Then we
get
ζ(t) = t2
∥y∥2
− t⟨y, x⟩ − t⟨x, y⟩ + ∥x∥2
.
Since (V, ⟨ , ⟩) is a real unitary space we have ⟨y, x⟩ = ⟨x, y⟩, and hence ζ(t) = t2
∥y∥2
− 2t⟨x, y⟩ + ∥x∥2
. When y ̸= 0,
by setting a = ∥y∥2
, b = 2⟨y, x⟩ and c = ∥x∥2
we result to a second order polynomial given by ζ(t) = at2
− bt + c.
However, ∥ · ∥ is a norm and thus ∥ty − x∥2
≥ 0, that is, ζ(t) ≥ 0. Therefore the discriminant of ζ(t) must be non-positive:
∆ = b2
− 4ac ≤ 0, which is equivalent to 4ac ≥ b2
. This means 4∥y∥2
∥x∥2
≥ 4⟨y, x⟩2
or equivalently ∥y∥∥x∥ ≥ |⟨x, y⟩|
and the proof is complete. □
The next problem is about the so-called polarization identities on a unitary space. According to these formulas one can
recover the inner product from the norm. The polarization identities can be used in many situations, see for example 3.F.32.
(and see also Chapter 4 for a description in terms of quadratic forms).
3.F.29. Polarization identities. Let (V, ⟨ , ⟩) be a unitary space.
(i) If V is deﬁned over R and so (V, ⟨ , ⟩) is a real unitary space, prove that
4⟨u, v⟩ = ∥u + v∥2
− ∥u − v∥2
, u, v ∈ V .
(ii) If V is deﬁned over C and so (V, ⟨ , ⟩) is a complex unitary space, prove that
4⟨u, v⟩ = ∥u + v∥2
− ∥u − v∥2
+ i∥u + iv∥2
− i∥u − iv∥2
, u, v ∈ V .
Solution. (i) By assumption, (V, ⟨ , ⟩) is a real inner product space, hence ⟨u, v⟩ = ⟨v, u⟩ and by bilinearity we get
∥u + v∥2
− ∥u − v∥2
= ⟨u + v, u + v⟩ − ⟨u − v, u − v⟩ = ∥u∥2
+ 2⟨u, v⟩ + ∥v∥2
− (∥u∥2
− 2⟨u, v⟩ + ∥v∥2
) = 4⟨u, v⟩ .
(ii) In this case (V, ⟨ , ⟩) is a complex inner product space and thus ⟨u, v⟩ ̸= ⟨v, u⟩. We compute
∥u + v∥2
= ⟨u + v, u + v⟩ = ∥u∥2
+ ⟨u, v⟩ + ⟨v, u⟩ + ∥v∥2
,
−∥u − v∥2
= −⟨u − v, u − v⟩ = −∥u∥2
+ ⟨u, v⟩ + ⟨v, u⟩ − ∥v∥2
,
i∥u + iv∥2
= i⟨u + iv, u + iv⟩ = i∥u∥2
+ i⟨u, iv⟩ + i⟨iv, u⟩ + i⟨iv, iv⟩ = i∥u∥2
+ i¯i⟨u, v⟩ + i2
⟨v, u⟩ + i2¯i∥v∥2
= i∥u∥2
+ ⟨u, v⟩ − ⟨v, u⟩ + i∥v∥2
,
−i∥u − iv∥2
= −i⟨u − iv, u − iv⟩ = −i∥u∥2
+ ⟨u, v⟩ − ⟨v, u⟩ − i∥v∥2
.
Hence adding these relations we arrive to the desired identity: 4⟨u, v⟩ = ∥u + v∥2
− ∥u − v∥2
+ i∥u + iv∥2
− i∥u − iv∥2
. □
Beyond the theory of unitary spaces the notion of norm appears in many other areas of mathematics. For instance, we
couldn’t imagine the so-called theory of normed vector spaces without a norm. This is a topic that we will analyze
in Chapter 7, but since we are already familiar with scalar products we can present a few exercises on norms
induced by scalar products. In this point an important remark is that not all the norms are of this type, see the
Problem 3.F.30 which establishes a very ﬁrst link with the material that we are going to treat in Chapter 7.
3.F.30. Parallelogram law. Prove that the following relation makes sense on any unitary vector space (V, ⟨ , ⟩):
∥u + v∥2
+ ∥u − v∥2
= 2(∥u∥2
+ ∥v∥2
) , (†)
for any u, v ∈ V , where ∥ · ∥ denotes the induced norm by ⟨ , ⟩. Then, show with an example that there are norms which are
not induced by an inner product.
280
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Solution. Many students in a high-school can solve the ﬁrst task, at least for the real vector space Rn
. It occurs by the
properties of a scalar product and the details of the proof are left to the reader. The second task is a bit more demanding.
Consider for example R2
and set
∥(x1, x2)∥∞ := max{|x1|, |x2|}
for any x = (x1, x2)T
∈ R2
. We can easily show that this is a norm since it satisﬁes the characteristic properties of a norm.
For example the triangle inequality occurs as follows:
∥x + y∥∞ = max{|x1| + |y1|, |x2| + |y2|} ≤ max{|x1|, |x2|} + max{|y1|, |y2|} = ∥x∥∞ + ∥y∥∞ .
However, this norm is not induced by an inner product on R2
, since it does not satisfy the relation (†) posed above. To
understand this, consider for example the vectors x = (1, 0)T
and y = (0, 1/2)T
. Then we compute ∥x+y∥2
∞+∥x−y∥2
∞ = 2
but 2(∥x∥2
∞ + ∥y∥2
∞) = 5/2. □
3.F.31. Let F : Mat2(R) → R2[x] the mapping deﬁned by
F
((
a b
c d
))
= (a + d) + (a + b)x + 2(c + d)x2
, a, b, c, d ∈ R .
After verifying that F is linear, ﬁnd the dimension of the null space ker(F) ⊂ Mat2(R) of F. Next provide an orthonormal
basis of ker(F), with respect to the Frobenius scalar product B(A, B) = tr(BT
A) on Mat2(R). ⃝
3.F.32. Linear isometries. Let (V, ⟨ , ⟩V ) and (W, ⟨ , ⟩W ) be two (ﬁnite-dimensional) unitary spaces. For a linear mapping
ψ : V → W prove the following:
(a) ψ is a linear isometry if and only if ∥u∥V = ∥ψ(u)∥W for any u ∈ V , i.e., φ is distance preserving.
(b) ψ is a linear isometry if and only if ψ∗
ψ = IV , where IV is the identity map on V .
Solution. (a) Let u ∈ V and assume that ψ is an isometry. This means that ⟨u, v⟩V = ⟨φ(u), φ(v)⟩W for any u, v ∈ V
and we see that ∥ψ(u)∥2
W = ⟨ψ(u), ψ(u)⟩W = ⟨u, u⟩V = ∥u∥2
V , and hence ∥ψ(u)∥W = ∥u∥V for any u ∈ V . Conversely,
assume that ∥u∥V = ∥ψ(u)∥W for any u ∈ V . Then, recall by 3.F.29 that the scalar product is deﬁned by the norm via the
formula
4⟨u, v⟩V = ∥u + v∥2
V − ∥u − v∥2
V + i∥u + iv∥2
V − i∥u − iv∥2
V ,
for any u, v ∈ V , and similarly for ⟨ , ⟩W . Because ψ preserves norms, a direct calculation via the above formula certiﬁes
that it will preserve also the inner products, and so ψ is and isometry.
(b) Suppose that ψ satisﬁes ψ∗
ψ = IV . Then, for any u, v ∈ V we have
∥u − v∥2
V = ⟨u − v, u − v⟩V = ⟨u − v, (ψ∗
ψ)(u − v)⟩V = ⟨ψ(u − v), ψ(u − v)⟩W = ∥ψ(u) − ψ(v)||2
W .
Thus ∥u − v∥V = ∥ψ(u) − ψ(v)∥W which shows that ψ is an isometry. Conversely, if ψ : V → W is an isometry, then
⟨(ψ∗
ψ)(u), v⟩V = ⟨ψ(u), ψ(v)⟩W = ⟨u, v⟩V for any u, v ∈ V . Since ⟨ , ⟩V is an inner product this implies (ψ∗
ψ)(u) = u
for any u ∈ V . Thus ψ∗
ψ = IV . □
3.F.33. Consider the vector v =
(
0,
√
2
2 , −
√
2
2
)T
∈ R3
. Find an orthogonal operator F : R3
→ R3
such that F(v) = e1,
where e1 = (1, 0, 0)T
is the ﬁrst vector of the standard basis on R3
. Next conﬁrm your result using Sage.
Solution. Let A be the 3×3 matrix corresponding to F, such that F(u) = Au for all u ∈ R3
. Since F should be orthogonal,
A satisﬁes AAT
= E, where E is the 3 × 3 identity matrix. Recall that the columns of an orthogonal matrix n × n matrix
form an orthonormal basis of Rn
. For the given vector v we know that F(v) = Av = e1, from where we get v = AT
e1.
Thus v sits in the ﬁrst column of AT
, and we may assume that AT
=
(
v v1 v2
)
for some vectors v1, v2 on R3
that we
should specify. These vectors should be orthogonal to v, i.e., v1, v2 ∈ v⊥
and it is easy to see that any vector (x1, x2, x3)T
on the orthogonal complement v⊥
should satisfy the equation
√
2
2
x2 −
√
2
2
x3 = 0
or equivalently, x2 − x3 = 0. Thus, the solution space has the form





x1
x2
x3

 =


t
s
s

 = t


1
0
0

 + s


0
1
1

 : t, s ∈ R



and we deduce that v⊥
is generated by the vectors (1, 0, 0)T
and (0, 1, 1)T
. Obviously, these vectors are orthogonal each
other. Setting v1 = (1, 0, 0)T
and v2 =
(
0,
√
2
2 ,
√
2
2
)T
such that ∥v1∥ = 1 = ∥v2∥ and v1 ⊥ v2, we ﬁnally obtain
281
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
AT
=



0 1 0√
2
2 0
√
2
2
−
√
2
2 0
√
2
2


 =⇒ A =



0
√
2
2 −
√
2
2
1 0 0
0
√
2
2
√
2
2


 .
Using this expression we can verify that Av = e1. Consequently, if u = (x, y, z)T
is an arbitrary vector of R3
we have
F(u) = Au =



0
√
2
2 −
√
2
2
1 0 0
0
√
2
2
√
2
2





x
y
z

 =



√
2
2 (y − z)
x√
2
2 (y + z)


 .
As for a conﬁrmation in Sage, use the cell given here:
A=matrix([[0, sqrt(2)/2, -sqrt(2)/2], [1, 0, 0],[0, sqrt(2)/2, sqrt(2)/2]])
v=vector([0, sqrt(2)/2, -sqrt(2)/2]); e1=vector([1, 0, 0])
print(A.is_unitary()); print(A*v==e1)
□
Let us now analyze an example related to the notion of matrix groups G ⊂ Gln(F), where F ∈ {R, C} and Gln(F) is
the general linear group. In this part, to prove the crucial property of “closedness” that characterizes such a group G, we
require a basic understanding of continuous functions, as well as some familiarity with closed subsets of Euclidean space.
These concepts are explored in Chapter 5, and our tasks below establish a ﬁrst link between analysis and group theory.
Notably, we have already encountered the concept of “closedness” in the proof of the lemma presented in 3.3.3, and the
reader may ﬁnd it helpful to refer to a similar explanation below.21
3.F.34. Prove that the orthogonal group O(n) is a matrix group, that is, a closed subgroup of Gln(R).
Solution. Hints: The fact that O(n) is a subgroup was discussed in 3.D.18. Consider the map
ρ : Matn(R) → Sym(n, R) , A → AAT
.
Here, Sym(n, R) is the set of n × n symmetric matrices with real entries. First, establish a linear isomorphism between
Sym(n, R) and the Euclidean space R
1
2 n(n+1)
. Then, deduce that the function ρ is continuous and moreover that O(n) =
ρ−1
({E}), where E is the n × n identity matrix. Since the preimage of a closed set under a continuous map is closed, this
shows that O(n) is closed in Gln(R), thereby conﬁrming that O(n) is indeed a matrix group. □
3.F.35. Matrix of general rotation in R3
. Derive the matrix of a general rotation in R3
,
Solution. Consider an arbitrary unit vector (x, y, z)T
∈ R3
. Rotation in the positive sense by an angle φ about this vector
can be written down as a composition of the following rotations (whose matrices we already know):
• rotation R1 in the negative sense about the z axis through the angle with cosine equal to x/
√
x2 + y2 = x/
√
1 − z2,
that is, with sine y/
√
1 − z2, under which the line with the directional vector (x, y, z)T
goes over on the line with the
directional vector (0, y, z)T
. The matrix of this rotation has the form
R1 =


x/
√
1 − z2 y/
√
1 − z2 0
−y/
√
1 − z2 x/
√
1 − z2 0
0 0 1

 .
• Rotation R2 in the positive sense about the y axis through the angle with cosine
√
1 − z2, that is, with sine z, under which
the line with the directional vector (0, y, z)T
goes over on the line with the directional vector (1, 0, 0)T
. The matrix of
this rotation is given by
R2 =


√
1 − z2 0 z
0 1 0
−z 0
√
1 − z2

 .
• Rotation R3 in the positive sense about the x axis through the angle φ, with the following matrix
R3 =


1 0 0
0 cos(φ) − sin(φ)
0 sin(φ) cos(φ)

 .
• Rotation R−1
2 with the matrix R−1
2 ,
• Rotation R−1
1 with the matrix R−1
1 .
21Recall that we have an inclusion Gln(R) ⊂ Rn2
∼= Matn(R).
282
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
The matrix of the composition of these mappings, that is, the matrix we are looking for, is given by the product of the rotations
in the reverse order:
R−1
1 · R−1
2 · R3 · R2 · R1 =


1 − t + tx2
txy − zs txz + ys
yxt + zs 1 − t + ty2
tyz − xs
zxt − ys tzy + xs 1 − t + tz2


where t = 1 − cos φ and s = sin φ. □
3.F.36. Find the matrix of the rotation in the positive sense by the angle π/3 about the line passing through the origin with
the oriented directional vector (1, 1, 0)T
under the standard basis R3
. ⃝
3.F.37. Using basic plotting functions in Sage, visualize the 2-dimensional square on R3
, having vertices the points [1, 1, 0],
[1, −1, 0], [−1, −1, 0] and [−1, 1, 0], as it rotates around the x, y and z axes. Hint: Use the matrices Rx, Ry and Rz presented in
the solution of 3.D.19.
Solution. We should deﬁne the given square, next, apply the rotation matrices described in ?? and ﬁnally, plot the rotated
squares. It will be easier to ﬁx the rotation angle, say θ = π/4, in order to visualize our goal by a ﬁgure, as follows:
In this ﬁgure, the initial square is shown in black. The square rotated about the x-axis is shown in red, the square rotated
around the y-axis is in green, and the square rotated around the z-axis is in blue. Let us present the code and include the
necessary explanations within the code block.
283
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
# Define the vertices of the square in the xy-plane
square_vertices = [(1, 1, 0), (1, -1, 0), (-1, -1, 0), (-1, 1, 0)]
# Define the edges of the square
square_edges = [(square_vertices[0], square_vertices[1]),
(square_vertices[1], square_vertices[2]),
(square_vertices[2], square_vertices[3]),
(square_vertices[3], square_vertices[0])]
# Define the rotation angle in radians theta=pi/4 #45degrees
theta = pi / 4 # 45 degrees
# Rotation matrices
Rx = Matrix([[1, 0, 0], [0, cos(theta), -sin(theta)],[0, sin(theta), cos(theta)]])
Ry = Matrix([[cos(theta), 0, sin(theta)], [0, 1, 0], [-sin(theta), 0, cos(theta)]])
Rz = Matrix([[cos(theta), -sin(theta), 0],[sin(theta), cos(theta), 0],[0, 0, 1]])
# Function to rotate vertices and create plot for each axis
def plot_rotation(rotation_matrix, color, label):
rotated_edges = []
for start, end in square_edges:
start_rotated = rotation_matrix * vector(start)
end_rotated = rotation_matrix * vector(end)
rotated_edges.append((start_rotated, end_rotated))
return sum([line([edge[0], edge[1]], color=color,
thickness=2, legend_label=label) for edge in rotated_edges])
284
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
# Plot the original and rotated squares
original_plot = plot_rotation(Matrix.identity(3), "black", "Original")
rotated_x_plot = plot_rotation(Rx, "red", "Rotation around x-axis")
rotated_y_plot = plot_rotation(Ry, "green", "Rotation around y-axis")
rotated_z_plot = plot_rotation(Rz, "blue", "Rotation around z-axis")
# Combine the plots and show them
final_plot = original_plot + rotated_x_plot + rotated_y_plot + rotated_z_plot
final_plot.show(aspect_ratio=1, frame=True, legend=True)
The most advanced part of our program is the deﬁnition of the plot_rotation function. This is designed to take a
rotation matrix, a colour, and a label as inputs. It then applies the rotation matrix to each edge of the given square, creates
lines for these rotated edges, and returns a plot object that can be displayed, making it easier to visualize the eﬀect of the
rotation in R3
. The command rotated_edges = [] initializes a list for rotated edges. In particular, this is an empty list that
will store the pairs of points (edges) after they have been transformed by the rotation matrix. Next, the syntax
for start, end in square_edges:
forms a loop over each edge of the square. In particular, the option square_edges is a list of tuples, where each tuple
represents an edge of the square as a pair of vertex coordinates. The loop iterates over each edge, extracting the start and end
points (start, end) of the edge. On the other hand, the syntax
start_rotated = rotation_matrix * vector(start)
end_rotated = rotation_matrix * vector(end)
plays the following role: The ﬁrst operation multiplies the rotation matrix by a vector created from the start point of the edge.
The vector() function then converts the coordinate tuple into a vector object, allowing for matrix multiplication. Thus,
the syntax start_rotated encodes the new vector representing the rotated position of the start point, while, the syntax
end_rotated encodes the new vector representing the rotated position of the end point. Next, the syntax
rotated_edges.append((start_rotated, end_rotated))
adds the newly rotated edge, represented as a tuple of the rotated start and end points, to the rotated_edges list. Finally, to
create line plots for each rotated edge we typed the last line within the def environment. This creates a list of line objects,
each representing a line plot for an edge. To summarize, the plot_rotation function eﬃciently rotates the square’s vertices
using a speciﬁed rotation matrix, then plots each rotated edge with a given color and label. You can experiment further by
changing the rotation angle θ form the value we used, i.e., θ = π/4. □
3.F.38. Let (V, ⟨ , ⟩) be a unitary vector space and suppose that φ : V → V is a linear mapping having the property φ2
= φ.
Prove that there exists a linear subspace of W such that φ is the orthogonal projection on W, i.e., φ = projW , if and only if
φ is self-adjoint.
Solution. Assume that φ is self adjoint. Then, by linearity and the condition that satisﬁes φ we see that φ(u − φ(u)) =
φ(u) − φ2
(u) = 0 for any u ∈ V . Thus u − φ(u) ∈ ker(φ). Recall now by 3.F.41 that any endomorphism φ on V
satisﬁes ker(φ∗
) = (Im(φ))⊥
and Im(φ∗
) = (ker(φ))⊥
. By the second relation we get ker(φ) = (Im(φ∗
))⊥
and since φ is
self-adjoint this gives ker(φ) = (Im(φ))⊥
. Thus writing
u = φ(u) + (u − φ(u)) ,
we have φ(u) ∈ Im(φ) and (u − φ(u)) ∈ (Im(φ))⊥
. Since V = Im(φ) ⊕ (Im(φ))⊥
, we deduce that φ(u) = projW u, for
any u ∈ V , where W = Im(φ) ⊂ V . This proves the one direction. For the converse, assume that there exists a subspace
W ⊂ V , such that φ = projW . Let W⊥
the orthogonal complement of W with respect to ⟨ , ⟩. Given arbitrary u, v ∈ V ,
write u = u1 +u2, v = v1 +v2 with u1, v1 ∈ W and u2, v2 ∈ W⊥
, respectively. Then, since φ = projW we have φ(u) = u1,
φ(v) = v1, and thus
⟨φ(u), v⟩ = ⟨u1, v1 + v2⟩ = ⟨u1, v1⟩ = ⟨u1 + u2, v1⟩ = ⟨u, v1⟩ = ⟨u, φ(v)⟩ .
285
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Thus φ = φ∗
and φ is self-adjoint. □
3.F.39. Show that for any symmetric matrix A ∈ Matn(R) the operator LA : Rn
→ Rn
deﬁned by LAx = Ax (x ∈ Rn
) is
self-adjoint. ⃝
3.F.40. For A ∈ Matm(C) Hermitian, prove that the linear operator LA : Matm,n(C) → Matm,n(C) deﬁned by LA(B) =
AB is self-adjoint. ⃝
3.F.41. Let f : V → V be an endomorphism on a (ﬁnite-dimensional) inner product space (V, ⟨ , ⟩). Prove that ker(f∗
) =
(Im(f))⊥
and Im(f∗
) = (ker(f))⊥
. ⃝
3.F.42. Let φu : Rn
→ Rn
be the linear mapping that reﬂects Rn
through the line in the direction of the unit vector u ∈ Rn
.
Show that its matrix [φu] = 2uuT
− E is unitary. ⃝
3.F.43. Let A ∈ Matn(C) be a complex matrix. Prove that the product A∗
A has only real eigenvalues. Next suppose that
there exists a unitary matrix U such that A = U∗
DU for some diagonal matrix D. Show that A is normal. ⃝
3.F.44. Find the values of the complex parameters a, b, c such that the matrix A is a Hermitian matrix, where
A =


1 a −i
2 − 2i 0 b
c 1 + i 0

 .
⃝
3.F.45. Given a Hermitian matrix A show that is determinant det(A) is a real number.
Solution. The matrix A is supposed to be Hermitian, and hence unitary diagonalizable. Thus U∗
AU = D for some unitary
matrix U and diagonal (real) matrix D (since A has only real eigenvalues). Then, since U−1
= U∗
a direct computation
shows that A = U∗
DU. Therefore,
det(A) = det(U∗
) det(D) det(U) = det(U−1
) det(D) det(U) = det(D) ,
and our claim follows. □
So far, we have explored various methods in Sage for applying the Gramm-Schmidt orthogonalization process. However,
we have not yet discussed the gram_schmidt() function, which is only used to orthogonalize the columns of a matrix. When
the gram_schmidt() method is applied to a matrix, it returns two matrices: a matrix whose columns are mutually orthogonal
(but not necessarily of unit length) and an upper triangular matrix that contains the coeﬃcients used in the Gram-Schmidt
process to express the original columns as linear combinations of the orthogonalized columns. In 3.F.46 below, we will apply
this method and use only the ﬁrst of the two resulting matrices.
3.F.46. Present an orthogonal diagonalization of the following symmetric matrix A =


−1
3 1 0
1 −1
3 0
0 0 −1
3

. Next implement
the task in Sage using the gram_schmidt() function mentioned above.
Solution. The characteristic polynomial of A is given by
χA(λ) =
−1
3 − λ 1 0
1 −1
3 − λ 0
0 0 −1
3 − λ
= −(λ +
1
3
)(λ2
+
2
3
λ −
8
9
) = −(λ +
1
3
)(λ +
4
3
)(λ −
2
3
) .
Thus, the eigenvalues of A are given by λ1 = 2
3 , λ2 = −2
3 and λ3 = −4
3 , all with algebraic multiplicity one. The geometric
multiplicity of each λi is also one, for all i = 1, 2, 3, and hence the matrix A is diagonalizable. Let us ﬁnd the corresponding
eigenvectors.
• For λ1 = 2
3 we need to solve the matrix equation (A − 2
3 E)u = 0 for some vector u = (x1, x2, x3)T
∈ R3
. We see that
A − 2
3 E =


−1 1 0
1 −1 0
0 0 −1

 and the corresponding linear system has solution space





x1
x2
x3

 =


t
t
0

 : t ∈ R



. Hence,
the eigenvectors corresponding to λ1 are multiples of the vector u1 = (1, 1, 0)T
.
• For λ2 = −1
3 we have the matrix equation(A + 1
3 E)u = 0 for some vector u = (x1, x2, x3)T
∈ R3
, where A + 1
3 E =

0 1 0
1 0 0
0 0 0

. The solution space of the corresponding linear system has the form





x1
x2
x3

 =


0
0
s

 : s ∈ R



. Hence, the
eigenvectors corresponding to λ2 are multiples of the vector u2 = (0, 0, 1)T
.
• For λ3 = −4
3 we have the matrix equation(A + 4
3 E)u = 0 for some vector u = (x1, x2, x3)T
∈ R3
, where A + 4
3 E =
286
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS


1 1 0
1 1 0
0 0 1

. The solution space of the corresponding linear system has the form





x1
x2
x3

 =


r
−r
0

 : r ∈ R



. Hence, the
eigenvectors corresponding to λ3 are multiples of the vector u3 = (1, −1, 0)T
.
In Sage we can conﬁrm these results as follows:
A=matrix(QQ, [[-1/3, 1, 0], [1, -1/3, 0], [0, 0, -1/3]])
chi_A=A.characteristic_polynomial()
print(chi_A.factor())
print(A.eigenvalues())
show(A.eigenmatrix_right())
D, Pein=A.eigenmatrix_right()
show(D,Pein)
Executing this cell Sage prints out the characteristic polynomial of A, its eigenvalues, the diagonal matrix
D = diag(λ1, λ2, λ3) and the matrix whose columns are the eigenvectors u1, u2, u3 (the latter is denoted by Pein
inside the program).
Next, we see that u1 ⊥ u2, u1 ⊥ u3 and u2 ⊥ u3:
⟨u1, u2⟩ = uT
2 u1 = 0 , ⟨u1, u3⟩ = uT
3 u1 = 0 , ⟨u2, u3⟩ = uT
3 u2 = 0 .
Moreover, we compute ∥u1∥ =
√
2 = ∥u3∥, while ∥u2∥ = 1. Thus, the eigenvectors given below are orthonormal:
ˆu1 =
u1
∥u1∥
=
(√
2
2
,
√
2
2
, 0
)T
, ˆu2 = u2 = (0, 0, 1)T
, ˆu3 =
u3
∥u3∥
=
(√
2
2
, −
√
2
2
, 0
)T
.
We are now ready to present the matrix P:
P =
(
ˆu1 ˆu2 ˆu3
)
=



√
2
2 0
√
2
2√
2
2 0 −
√
2
2
0 1 0


 .
By adding the following cell to the previous block, we can quickly verify that P is orthogonal, and furthermore that
P−1
AP = PT
AP = D =


1
3 0 0
0 −2
3 0
0 0 −4
3

 .
P=matrix([[sqrt(2)/2, 0, sqrt(2)/2], [sqrt(2)/2, 0, -sqrt(2)/2], [0, 1, 0]])
print(P.is_unitary())
print(D==P.inverse()*A*P)
Let us now implement the solution of the task using the gram_schmidt method, as suggested in the statement. As
mentioned earlier, we will only utilize the ﬁrst matrix returned by this method. To normalize the column vectors of a given
matrix, we will deﬁne another function, which we may call normalize_col. This takes a matrix M as input and returns
another matrix, with each column normalized. It works as follows:
• It iterates through each column vector u of the matrix M.
• Then, each vector u is divided by its norm to normalize it.
• Finally the normalized columns are recombined into a new matrix, using the column_matrix() command.
For our task we also need the matrices D and Pein, as introduced in the initial block via the eigenmatrix_right command.
Our program has the following form:
287
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
A=matrix(QQ, [[-1/3, 1, 0], [1, -1/3, 0], [0, 0, -1/3]])
D, Pein=A.eigenmatrix_right()
# store the transpose of the matrix whose columns are the unnormalized eigenvectors of A as Q
Q=Pein.T
# orthogonalize the columns of Q
Q.gram_schmidt()[0]
# take the transpose of the matrix obtained by Gram-Schmidt
R=Q.gram_schmidt()[0].T
show(R)
#construct the function that normalizes the column vectors of a matrix M
def normalize_col(M):
return column_matrix([v/norm(v) for v in M.columns()])
288
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
#normalize the columns of the matrix R and store the resulting matrix as P
P=normalize_col(R)
show(P)
print(P.is_unitary())
print(P.inverse()*A*P==D)
Running this program, Sage displays the matrix P as shown below, along with the message “True” printed twice, verifying
that P is an orthogonal matrix that satisﬁes the relation PT
AP = D:


1
2
√
2 0 1
2
√
2
1
2
√
2 0 −1
2
√
2
0 1 0

 .
□
3.F.47. Using the program in Sage constructed in 3.F.46, present the orthogonal diagonalization of the following matrices:
A =


2 2 2
2 0 4
2 4 0

 , B =


−4 4 4
4 −4 4
4 4 −4

 .
Solution. For the matrix A the program has the form
A=matrix(QQ, [[2, 2, 2], [2, 0, 4], [2, 4, 0]])
show(A.eigenmatrix_right())
D, Pein=A.eigenmatrix_right()
show(D,Pein)
Q=Pein.T
R=Q.gram_schmidt()[0].T
def normalize_columns(M):
return column_matrix([v/norm(v) for v in M.columns()])
P=normalize_columns(R);show(P)
print(P*D*P.T==A); print(P.is_unitary())
In this case we get the orthogonal matrix P =




√
3
3
2
3
√
3
2 0
√
3
3
−1
3
√
3
2
√
2
2
√
3
3 −1
3
√
3
2 −
√
2
2




, such that
PT
AP =



√
3
3
√
3
3
√
3
3
2
3
√
3
2
−1
3
√
3
2 −1
3
√
3
2
0
√
2
2 −
√
2
2





2 2 2
2 0 4
2 4 0






√
3
3
2
3
√
3
2 0
√
3
3
−1
3
√
3
2
√
2
2
√
3
3 −1
3
√
3
2 −
√
2
2




=


6 0 0
0 0 0
0 0 −4

 = D ,
where λ1 = 6, λ2 = 0 and λ3 = −4 are the eigenvalues of A.
Similarly for the matrix B:
B=matrix(QQ, [[-4, 4, 4], [4, -4, 4], [4, 4, -4]])
show(B.eigenmatrix_right())
D, Pein=B.eigenmatrix_right()
show(D,Pein)
Q=Pein.T
show(Q.gram_schmidt()[0])
R=Q.gram_schmidt()[0].T
show(R)
def normalize_columns(M):
return column_matrix([v/norm(v) for v in M.columns()])
P=normalize_columns(R)
show(P)
print(P*D*P.T==B)
print(P.is_unitary())
Running this block Sage prints out the orthogonal matrix P =




√
3
3
√
2
2 −1
3
√
3
2
√
3
3
0 2
3
√
3
2
√
3
3 −
√
2
2 −1
3
√
3
2




, such that
289
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
PT
BP =



√
3
3
√
3
3
√
3
3√
2
2 0 −
√
2
2
−1
3
√
3
2
2
3
√
3
2 −1
3
√
3
2





−4 4 4
4 −4 4
4 4 −4






√
3
3
√
2
2 −1
3
√
3
2
√
3
3
0 2
3
√
3
2
√
3
3 −
√
2
2 −1
3
√
3
2




=


4 0 0
0 −8 0
0 0 −8

 = D ,
where λ1 = 4 and λ2 = −8 are the eigenvalues of B, the second with an algebraic multiplicity of two. □
3.F.48. Present a unitary diagonalization of the matrix B =


0 i 1
−i 0 0
1 0 0

.
Solution. The matrix B is Hermitian, so it has real eigenvalues given by λ1 = 0 and λ2,3 = ±
√
2. The corresponding
eigenvectors have the form
u1 = (0, i, 1)T
, u2 = (
√
2, −i, 1)T
, u3 = (−
√
2, −i, 1)T
.
By running in a sequence the following commands in Sage, one can verify that ⟨ui, uj⟩ = 0 for any 1 ≤ i ̸= j ≤ 3, and
∥u1∥ =
√
2, ∥u2∥ = ∥u3∥ = 2.
u1=vector(QQbar, [0, I, 1])
u2=vector(QQbar, [sqrt(2), -I, 1])
u3=vector(QQbar, [-sqrt(2), -I, 1])
u2.hermitian_inner_product(u1)
u1.hermitian_inner_product(u3)
u3.hermitian_inner_product(u2)
u1.norm (); u2.norm ();u3.norm ()
The normalized eigenvectors ˆui = ui/∥ui∥ for i = 1, 2, 3 form the columns of the matrix U,
U =



0
√
2
2 −
√
2
2
i
√
2
2
− i
2 − i
2√
2
2
1
2
1
2


 .
We can verify that U is unitary; for example in Sage you can type
U=matrix(QQbar, [[0, sqrt(2)/2, -sqrt(2)/2],
[(sqrt(2)/2)*I, (-1/2)*I, (-1/2)*I],
[sqrt(2)/2, 1/2, 1/2]])
U.is_unitary ( )
which returns: True. This diagonalizes the given matrix B with D = diag(0,
√
2, −
√
2). Recall by 3.D.38 that a veriﬁcation
in Sage of this ﬁnal claim relies on the block
U=matrix(QQbar, [[0, sqrt(2)/2, -sqrt(2)/2],
[(sqrt(2)/2)*I, (-1/2)*I, (-1/2)*I],
[sqrt(2)/2, 1/2, 1/2]])
A = matrix(QQbar, [[0, I, 1], [-I, 0, 0], [1, 0, 0]])
D = matrix(QQbar, [[0, 0, 0], [0, sqrt(2), 0], [0, 0, -sqrt(2)]])
U.conjugate_transpose ()*A*U == D
Run this block yourself to see the output generated by Sage. □
3.F.49. Adapt the method of orthogonal diagonalization presented in 3.F.46, to implement the unitary diagonalization of
the matrix B given in 3.F.48 using Sage.
Solution. We only need to male a few adjustments to ithe program presented in 3.F.46 for it to work with unitary diagonalization.
In particular, since unitary diagonalization deals with complex matrices we need to replace carefully the transposes by
conjugate transposes. Also, to achieve a more readable display of the printed matrices by Sage, it is useful to pass the matrix
to a low-precision ring before printing it. This adjustment will not eﬀect the original matrices and the ﬁnal results, only their
display. Below is the our program:
290
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
CC20 = ComplexField(prec=20)
MatPrint = MatrixSpace(CC20,3,3)
B=matrix(QQbar, [[0, i , 1], [-i, 0, 0], [1, 0, 0]])
show(MatPrint(B))
print(B==B.H)
chi_B=B.characteristic_polynomial()
291
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
print(chi_B.factor())
print(B.eigenvalues())
D, Pein=B.eigenmatrix_right()
show(MatPrint(D))
show(MatPrint(Pein))
Q=Pein.H #show(MatPrint(Q))
q=Q.gram_schmidt()[0]
R=Q.gram_schmidt()[0].H
def normalize_columns(M):
return column_matrix([v/norm(v) for v in M.columns()])
P=normalize_columns(R)
show(MatPrint(P))
print(P*D*P.H==B); print(P.is_unitary())
Sage’s prints the solution matrix P as follows
P =


0.70711 9.5829 × 10−21
i 0.70711
−0.50000i 0.70711 0.50000i
0.50000 −0.70711i −0.50000

 =


0.70711 0 0.70711
−0.50000i 0.70711 0.50000i
0.50000 −0.70711i −0.50000

 .
This matrix slightly diﬀers from our solution matrix P in 3.F.48, since for example Sage enumerates the eigenvalues diﬀerently
and uses diﬀerent eigenvectors (recall that P is not uniquely determined). However, Sage veriﬁes that P is unitary and
satisﬁes the unitary diagonalization condition. To understand the importance of using the low-precision ring before printing
our matrices, try running the program in Sage without including the initial lines related to matrix display. Then, compare the
corresponding output with the one presented above. □
Next we will explore several tasks involving the bracket of square matrices of the same size, also known as the commutator
[ , ], deﬁned as [A, B] := AB − BA. The commutator is a fundamental operation in linear algebra and plays a crucial role
in various branches of mathematics and physics, particularly in the study of the so called “Lie algebras”. When [A, B] = 0,
the matrices commute in the sense that AB = BA. Hence, the commutator [A, B] measures the extend to which two matrices
A, B fail to commute.
The commutator is a central concept in the theory of Lie algebras, which are algebraic structures used to study symmetry,
particularly in the context of continuous transformations. A “Lie algebra” is a vector space V over a ﬁeld F, equipped with a
bilinear operation [ , ] : V × V → V , known as the “Lie bracket”. This is skew-symmetric and satisﬁes the so called Jacobi
identity
SX,Y,Z[X, [Y, Z]] = 0 ,
where SX,Y,Z is the cyclic sum over the elements X, Y, Z ∈ V .
We should mention that while the commutator of matrices is a primary example, the concept of a Lie algebra extends far
beyond matrices, encompassing a broad range of algebraic structures.
3.F.50. For F = R or F = C, prove that the vector space of square matrices Matn(F) endowed with the commutator
[A, B] := AB − BA has the structure of a (ﬁnite-dimensional) Lie algebra.
Solution. It suﬃces to prove that the commutator is a bilinear mapping [ , ] : Matn(F) × Matn(F) → Matn(F), which is
skew-symmetric and satisﬁes the Jacobi identity. Consider arbitrary matrices A1, A2 ∈ Matn(F) and scalars λ, µ ∈ F. Then,
for any B ∈ Matn(F) we see that
[λA1 + µA2, B] = (λA1 + µA2)B − b(λA1 + µA2) = λA1B + µA2B − λBA1 − µBA2
= λ(A1B − BA1) + µ(A2B − BA2) = λ[A1, B] + µ[A2, B] .
This prove linearity in the ﬁrst argument and in a similar way we obtain [A, λB1 + µB2] = λ[A, B1] + µ[A, B2], for all
A, B1, B2 ∈ Matn(F) and λ, µ ∈ F. Since the bracket [ , ] is linear also in the second argument, is a bilinear mapping.
Skew-symmetry is obvious: [B, A] = BA − AB = −(AB − BA) = −[A, B], for any A, B ∈ Matn(F). Finally, for the
Jacobi identity, let A, B, C be some arbitrary n × n matrices over F. Then, we see that
[A, [B, C]] = A(BC − CB) − (BC − CB)A = ABC − ACB − BCA + CBA ,
[B, [C, A]] = B(CA − AC) − (CA − AC)B = BCA − BAC − CAB + ACB ,
[C, [A, B]] = C(AB − BA) − (AB − BA)C = CAB − CBA − ABC + BAC ,
and the Jacobi identity [A, [B, C]] + [B, [C, A]] + [C, [A, B]] = 0 is now direct.
292
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
The Lie algebra (Matn(R), [ , ]), also denoted by gl(n, R) or gln(R), underpins a wide range of applications in science,
engineering, and technology. Its signiﬁcance stems from its ability to model and solve real-world problems that involve linear
transformations and systems. Note that in a similar way one can prove that (Matn(C), [ , ]) is a Lie algebra over C. □
3.F.51. The dimension of a Lie algebra (V, [ , ]V ) over a ﬁeld F, is the dimension of the vector space V over F. What is the
dimension of the Lie algebra (Matn(R), [ , ]) described above? ⃝
3.F.52. Show that the set of matrices M =
{(
a b
0 −a
)
: a, b ∈ R
}
forms a Lie algebra under matrix commutation.
Solution. The given set of matrices is a linear subspace of the vector space Mat2(R), the latter consisting of 2 × 2 real
matrices. This is because M contains the zero matrix (it occurs for a = b = 0), and is closed under addition and scalar
multiplication:
A + B =
(
a1 b1
0 −a1
)
+
(
a2 b2
0 −a2
)
=
(
a1 + a2 b1 + b2
0 −(a1 + a2)
)
∈ M , λA =
(
λ a λ b
0 −λ a
)
∈ M ,
for all A, B ∈ M and λ ∈ R. Next we should prove that the commutator [A, B] for A, B ∈ M satisﬁes the properties of a
Lie bracket, as deﬁned above and hence M is a 2-dimensional Lie algebra. An easier way relies on proving that M is a“Lie
subalgebra” of the Lie algebra (Mat2(R), [ , ]). A Lie subalgebra of a Lie algebra (V, [ , ]V ) is a subspace W ⊂ V which
is closed under the corresponding Lie bracket operation, i.e., [W, W]V ⊂ W. It is easy to see that a Lie subalgebra is a Lie
algebra itself. For two elements A, B ∈ M we see that
[A, B] = AB − BA =
(
a1 b1
0 −a1
) (
a2 b2
0 −a2
)
−
(
a2 b2
0 −a2
) (
a1 b1
0 −a1
)
=
(
a1a2 a1b2 − b1a2
0 a1a2
)
−
(
a2a1 a2b1 − b2a1
0 a2a1
)
=
(
0 0
0 0
)
∈ M .
Thus M is closed under the commutator, making it a Lie subalgebra of Mat2(R) and hence a Lie algebra in its own right.
Since [A, B] = 0 for any A, B ∈ M, this Lie algebra is an example of an Abelian Lie algebra, a special type of Lie algebras
where all Lie brackets are zero. □
3.F.53. Let us denote by so(3) the set of skew-symmetric 3 × 3 matrices.
i) Prove that so(3) is a vector subspace of Mat3(R) and compute its dimension.
ii) Prove that so(3) endowed with the matrix commutator [ , ] is a Lie subalgebra of (Mat3(R), [ , ]). Next compute the
brackets of the basis elements and use Sage to verify the result.
iii) Show that R3
endowed with the cross product × as a Lie bracket, forms another 3-dimensional Lie algebra.
iv) Establish an linear isomorphism φ between R3
and so(3) that preserves the Lie algebra structures, i.e., φ(u × v) =
[φ(u), φ(v)].22
Solution. (i) An element in so(3) has the form


0 −x3 x2
x3 0 −x1
−x2 x1 0

 for some x1, x2, x3 ∈ R. Thus it is easy to see that
A + B ∈ so(3) and λ A ∈ so(3) for any two matrices A, B ∈ so(3) and scalar λ ∈ R. Hence so(3) is a vector subspace of
Mat3(R) with dim so(3) = 3, as a basis is given by the matrices
E12 =


0 −1 0
1 0 0
0 0 0

 , E23 =


0 0 0
0 0 −1
0 1 0

 , E31 =


0 0 1
0 0 0
−1 0 0

 .
(ii) Let us prove that so(3) is closed under the matrix commutator. Let A, B ∈ so(3) be skew-symmetric matrices, i.e.,
AT
= −A and BT
= −B. Then we see that
([A, B])T
= (AB − BA)T
= BT
AT
− AT
BT
= (−B)(−A) − (−A)(−B) = BA − AB = −[A, B] ,
which implies that [A, B] ∈ so(3). Thus [so(3), so(3)] ⊂ so(3) and so(3) is Lie subalgebra of Mat3(R), and hence a Lie
algebra itself. Alternatively, one may compute the Lie brackets of the basis elements, and prove that they belong to so(3),
[E12, E23] = E31 ∈ so(3) , [E12, E31] = −E23 ∈ so(3) , [E23, E31] = E12 ∈ so(3) .
To verify the brackets in Sage you may use the following block (see also 2.E.66).
22Such isomorphisms are called Lie algebra isomorphisms.
293
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
E12=matrix([[0, -1, 0], [1, 0, 0], [0, 0, 0]])
E23=matrix([[0, 0, 0], [0, 0, -1], [0, 1, 0]])
E31=matrix([[0, 0, 1], [0, 0, 0], [-1, 0, 0]])
294
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
show("\nThe matrix E12 has the form:", E12)
show("\nThe matrix E23 has the form:", E23)
show("\nThe matrix E31 has the form:", E31)
def Lie_bracket(A, B):
return A*B-B*A
show("\nThe commutator [E12, E23] equals to:", Lie_bracket(E12, E23))
show("\nThe commutator [E12, E31] equals to:", Lie_bracket(E12, E31))
show("\nThe commutator [E23, E31] equals to:", Lie_bracket(E23, E31))
print(Lie_bracket(E12, E23)==E31)
print(Lie_bracket(E12, E31)==-E23)
print(Lie_bracket(E23, E31)==E12)
(iii) Recall that the cross product v × w of two vectors v = (v1, v2, v3)T
and w = (w1, w2, w3)T
on R3
is the vector
v × w =
⃗i ⃗j ⃗k
v1 v2 v3
w1 w2 w3
=


v2w3 − w2v3
v3w1 − w3v1
v1w2 − w1v2

 ∈ R3
.
Hence R3
is closed with respect to the operation [v, w]R3 := v × w. This is by deﬁnition bilinear and skew-symmetric,
v × w = −w × v for all v, w ∈ R3
. For the Jacobi identity recall that u × (v × w) = (u · w)v − (u · v)w. Thus
[u, [v, w]R3 ]R3 + [v, [w, u]R3 ]R3 + [w, [u, v]R3 ]R3 = u × (v × w) + v × (w × u) + w × (u × v)
= (u · w)v − (u · v)w + (v · u)w − (v · w)u + (w · v)u − (w · u)v
=
(
(u · w) − (w · u)
)
v −
(
(u · v) − (v · u)
)
w +
(
(w · v) − (v · w)
)
u
= 0 .
Thus (R3
, [ , ]R3 ) is also Lie algebra.
(iv) The map φ : R3
→ so(3) with
u = (x1, x2, x3)T
→ Au :=


0 −x3 x2
x3 0 −x1
−x2 x1 0


is clearly a vector space isomorphism. To show that is a Lie algebra isomorphism it remains to prove that φ([u, v]R3 ) =
[φ(u), φ(v)] or equivalently, Au×v = [Au, Av], which we leave as an exercise. □
3.F.54. Consider Hermitian matrices A, B, C satisfying [A, C] = [B, C] = 0 and [A, B] ̸= 0. Prove that at least one
eigenspace of the matrix C has dimension > 1.
Solution. We prove it by contradiction. Assume that all eigenspaces of the matrix C are 1-dimensional. Then, any vector u
can be written as u =
∑
k⟨u, uk⟩uk =
∑
k ckuk, where uk are the linearly independent eigenvectors of C, associated with
the eigenvalue λk. Now a computation shows that
0 = [A, C]uk = ACuk − CAuk = λkAuk − C(Auk) .
Therefore Auk is also an eigenvector of the matrix C corresponding to eigenvalue λk. Then we get Auk = λA
k uk for some
number λA
k . Similarly, Buk = λB
k uk for some number λB
k . Then for the commutator of matrices A and B one computes:
[A, B]uk = ABuk − BAuk = λA
k λB
k uk − λB
k λA
k uk = 0
and hence
[A, B]u = [A, B]
∑
k ckuk −
∑
k ck[A, B]uk = 0 .
This ﬁnal relation implies [A, B] = 0, a contradiction. □
3.F.55. Given the matrix A =


0 −2 2
−2 0 2
2 2 0

, compute the trace of the matrix
exp(A) :=
∞∑
k=0
1
k!
Ak
= I2 + A +
A2
2
+ · · · .
Hint: Use the eigenvalues of A.
Solution. Recall that trace of a matrix A ∈ Matn(F), for F = R or F = C equals to the sum of the eigenvalues of A. The
given matrix is symmetric and hence its eigenvalues are real. An application of the following block in Sage gives λ1 = −4,
with multiplicity 1 and λ2 = 2 with multiplicity 2:
295
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
A=matrix(QQ, 3, 3, [[0, -2, 2], [-2, 0, 2], [2, 2, 0]])
A.eigenvalues ()
Thus, the eigenvalues of exp(A) are e−4
and e2
, the second with multiplicity 2 again. It follows that tr(exp(A)) =
sum of eigenvalues of exp(A) = e−4
+2 e. □
3.F.56. Show that if H is a Hermitian matrix, then U = exp(iH) =
∑∞
n=0
1
n! (iH)n
is a unitary matrix. Next compute its
determinant.
Solution. From the deﬁnition of exp we can show that exp(A + B) = exp(A) exp(B) just as with the exponential mapping
in the domain of real numbers. Because (u + v)∗
= u∗
+ v∗
and (cv)∗
= ¯cv∗
, we obtain
U∗
= (
∞∑
n=0
1
n!
(iH)n
)∗
=
∞∑
n=0
1
n!
(−iH∗
)n
.
Since H∗
= H, we ﬁnally see that
U∗
=
∞∑
n=0
(−1)n 1
n!
(iH)n
= exp(−iH) .
Thus U∗
U = exp(iH) exp(−iH) = exp(0) = 1, and det(U) = exptr(iH)
. □
We will now continue with additional tasks related to generalized eigenspaces and the Jordan canonical form.
3.F.57. Determine the algebraic and geometric multiplicities of the eigenvalues for the matrices below, and use Sage to
validate your results:
A =


1 1 2
0 1 2
0 0 3

 , B =


1 0 2
0 1 2
0 0 3

 , C =


4 0 1
2 3 2
1 0 4

 .
Solution. The matrices A and B are upper triangular having the same diagonal entries. Thus A, B have the same eigenvalues,
|A−λE| = |B −λE| = λ3
−5λ2
+7λ−3 = (λ−1)2
(λ−3) and we get λ1 = 1 with multiplicity 2, αA(λ1) = 2 = αB(λ1)
and λ2 = 3 with multiplicity 1, αA(λ2) = 1 = αB(λ2). Using Sage we can obtain the same conclusion as follows:
A=matrix(QQbar, [[1, 1, 2], [0, 1, 2], [0, 0, 3]])
A.characteristic_polynomial ( )
B=matrix(QQbar, [[1, 0, 2], [0, 1, 2], [0, 0, 3]])
B.characteristic_polynomial ( )
A.characteristic_polynomial ( ) == B.characteristic_polynomial ( )
Recall also that in Sage the command A.fcp(′
t′
) factors the characteristic polynomial corresponding to a matrix. For the
matrix C, the following cell
C=matrix(QQbar, [[4, 0, 1], [2, 3, 2], [1, 0, 4]])
C.fcp(’x’)
C.eigenvalues ( )
gives the characteristic polynomial of C as well as its eigenvalues: λ1 = 5 with αC(λ1) = 1 and λ2 = 1 with αC(λ2) = 2.
Let us now compute the geometric multiplicities,
γA(λ) = dimC(ker(A − λE)) = n − rank(A − λE) .
Here A is a n × n matrix and λ is an eigenvalue of A.
Thus, for the given matrix A we get A − E =


0 1 2
0 0 2
0 0 2

. In particular, A − E consists of 2 independent column
vectors. Thus rank(A − E) = 2 and γA(1) = 3 − 2 = 1. Here is a conﬁrmation using Sage:
A=matrix(QQbar, [[1, 1, 2], [0, 1, 2], [0, 0, 3]])
E =matrix(QQbar, [[1, 0, 0], [0, 1, 0], [0, 0, 1]])
(A-E).rank ( )
Or we can type
A=matrix(QQbar, [[1, 1, 2], [0, 1, 2], [0, 0, 3]])
A.eigenspaces_right ()
The second block returns the eigenvalues of A and their eigenspaces:
296
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
(3, Vector space of degree 3 and dimension 1 over Algebraic Field
User basis matrix:
[ 1.000000000000000? 0.6666666666666667? 0.6666666666666667?]),
(1, Vector space of degree 3 and dimension 1 over Algebraic Field
User basis matrix:
[1 0 0])
In a similar way you can verify that γB(1) = 2, γA(3) = 1 = γB(3), γC(5) = 1 and γC(3) = 2. □
3.F.58. Find the Jordan canonical form of the matrices A =
(
−1 1
−6 4
)
and B =
(
−1 1
−4 3
)
. Additionally, provide the
geometric interpretation of the Jordan canonical form decomposition corresponding to these matrices.
Solution. The eigenvalues of A are real, given by λ1 = 1 and λ2 = 2. Since the size of A is 2 × 2 and it has two distinct
eigenvalues, the Jordan form of A is diagonal, i.e., J =
(
1 0
0 2
)
. It is easy to see that the eigenvector u1 = (x, y)T
associated
with the eigenvalue λ1 = 1 satisﬁes the matrix equation
0 = (A − E)u1 =
(
−2 1
−6 3
) (
x
y
)
.
This given the equation −2x+y = 0, thus the eigenvectors are multiples of the vector u1 = (1, 2)T
. Similarly, the eigenvector
associated with the eigenvalue λ2 is given by u2 = (1, 3)T
. The matrix P is then obtained by writing these eigenvectors into
tho columns, P =
(
1 1
2 3
)
. Then, for the matrix A we have A = PJP−1
. The inverse of P is P−1
=
(
3 −1
−2 1
)
, and we
can conﬁrm the relation A = PJP−1
: (
−1 1
−6 4
)
=
(
1 1
2 3
) (
1 0
0 2
) (
3 −1
−2 1
)
.
This decomposition says that the matrix A determines a linear mapping which has the aforementioned diagonal form with
respect to the basis {u1, u2}. Geometrically, this means that in the direction of u1 nothing is changing and in the direction of
u2 every vector is being stretched twice.
Let us now focus on B. This has a unique eigenvalue given by λ = 1 with multiplicity 2. The corresponding eigenvector
v = (x, y)T
satisﬁes the matrix equation
0 = (B − E)v =
(
−2 1
−4 2
) (
x
y
)
.
We ﬁnd that the solutions are multiples of the vector v = (1, 2)T
. The fact that the system does not have two linearly
independent vectors as a solution, implies the following expression for the Jordan canonical form: J =
(
1 1
0 1
)
. The basis
for which A has this form consists of the eigenvector v1 and the vector that is assigned to v1 via the linear transformation
B − E. This provides a solution of the following system of equations:
(
−2 1 1
−4 2 2
)
∼
(
−2 1 1
0 0 0
)
.
You can easily check that the solutions are multiples of the vector v2 = (1, 3)T
. Moreover, one obtains the same basis as in
the previous case, and we can verify the relation B = PJP−1
:
(
−1 1
−4 3
)
=
(
1 1
2 3
) (
1 1
0 1
) (
3 −1
−2 1
)
.
In this case, the mapping acts on the vectors as follows: the component in the direction of v2 stays the same. On the other
hand, the component in the direction of v1 is multiplied by the sum of the coeﬃcients that determine the components in the
directions of v2 and v1. □
3.F.59. Consider the matrix A =




−1 −1 −1 0
3 2 3 −1
2 1 3 −1
2 1 4 −2



. Solve the following tasks:
i) Show that the eigenvalues of A are given by ±1. Compute their algebraic and geometric multiplicities and ﬁnd the
corresponding eigenvectors of A.
ii) Use Sage to verify that the dimensions of the null spaces ker
(
(A − E)j
)
do not increase for j > 3. Next verify that
dim Rλ1 = 3 by computing the rank of the matrix (A − E)3
in Sage and applying the rank-nullity theorem.
297
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
iii) Derive the generalized eigenvectors of A.
iv) Describe the Jordan canonical form J of A and ﬁnd a matrix P satisfying P−1
AP = J. Next verify your answer by
Sage.
v) Prove the direct sum decomposition R4
= Rλ1 ⊕ Rλ2 , where λ1 = 1 and λ2 = −1 are the eigenvalues of A.
Solution. (i) The characteristic polynomial of A is given by
χA(λ) =
−1 − λ −1 −1 0
3 2 − λ 3 −1
2 1 3 − λ −1
2 1 4 −2 − λ
= (λ − 1)3
(λ + 1) .
To conﬁrm this expression we may use Sage via the block
A=matrix([[-1, -1, -1, 0], [3, 2, 3, -1], [2, 1, 3, -1], [2, 1, 4, -2]])
p=A.characteristic_polynomial(); show(factor(p))
Thus the eigenvalues of A are λ1 = 1 with multiplicity 3 and λ2 = −1 with multiplicity 1.
• Let us ﬁrst focus on λ1 and describe the eigenspace Vλ1
= V1 = ker (A − E). We will perform elementary row operations
to obtain the reduced row echelon form of the matrix A − E:
A − E =




−2 −1 −1 0
3 1 3 −1
2 1 2 −1
2 1 4 −3




R4→R4+R1
−→




−2 −1 −1 0
3 1 3 −1
2 1 2 −1
0 0 3 −3




R3→R3+R1
−→




−2 −1 −1 0
3 1 3 −1
0 0 1 −1
0 0 3 −3




R4→R4−3R3
−→




−2 −1 −1 0
3 1 3 −1
0 0 1 −1
0 0 0 0




R1→− 1
2 R1
−→




1 1/2 1/2 0
3 1 3 −1
0 0 1 −1
0 0 0 0




R2→R2−3R1
−→




1 1/2 1/2 0
0 −1/2 3/2 −1
0 0 1 −1
0 0 0 0




R1→R1+R2
−→




1 0 2 −1
0 −1/2 3/2 −1
0 0 1 −1
0 0 0 0




R2→−2R2
−→




1 0 2 −1
0 1 −3 2
0 0 1 −1
0 0 0 0




R2→R2+3R3
−→




1 0 2 −1
0 1 0 −1
0 0 1 −1
0 0 0 0




R1→R1−2R3
−→




1 0 0 1
0 1 0 −1
0 0 1 −1
0 0 0 0



 .
You can verify this expression in Sage by adding the following cel to the initial block:
E=identity_matrix(4); AEprref=(A-E).rref(); show(AEprref)
Thus, the solutions of the matrix equation (A − E)u1 = 0 for some vector u1 = (x1, x2, x3, x4)T
correspond to solutions of
the linear system {x1 + x4 = 0, x2 − x3 = 0, x3 − x4 = 0}, where x4 is a free parameter. This gives
V1 = ker(A − E) =
{
t(−1, 1, 1, 1)T
: t ∈ R
}
and hence the geometric multiplicity of λ1 = 1 equals 1, i.e., γ(λ1) = 1 = dim V1. Moreover, the vector u1 = (−1, 1, 1, 1)T
is an eigenvector of A corresponding to λ1 = 1.
• For the second eigenvalue λ2 = −1 we have algebraic multiplicity 1 and thus also the geometric multiplicity should be 1,
i.e., γ(λ2) = dim V−1 = dim ker(A+E) = 1. This also implies that dim Rλ2 = 1, in particular Rλ2
∼= V−1 (the dimension
of the null spaces ker
(
(A + E)ℓ
)
stabilizes for ℓ > 1). We will apply elementary row operations to derive the reduced row
echelon form of the matrix A + E.
A + E =




0 −1 −1 0
3 3 3 −1
2 1 4 −1
2 1 4 −1




R4→R4−R3
−→




0 −1 −1 0
3 3 3 −1
2 1 4 −1
0 0 0 0




R1↔R2
−→




3 3 3 −1
0 −1 −1 0
2 1 4 −1
0 0 0 0




R1→ 1
3 R1
−→




1 1 1 −1/3
0 −1 −1 0
2 1 4 −1
0 0 0 0




R3→R3−2R1
−→




1 1 1 −1/3
0 −1 −1 0
0 −1 2 −1/3
0 0 0 0




R2→−R2
−→




1 1 1 −1/3
0 1 1 0
0 −1 2 −1/3
0 0 0 0




298
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
R3→R3+R2
−→




1 1 1 −1/3
0 1 1 0
0 0 3 −1/3
0 0 0 0




R3→ 1
3 R3
−→




1 1 1 −1/3
0 1 1 0
0 0 1 −1/9
0 0 0 0




R1→R1−R2
−→




1 0 0 −1/3
0 1 1 0
0 0 1 −1/9
0 0 0 0




R2→R2−R3
−→




1 0 0 −1/3
0 1 0 1/9
0 0 1 −1/9
0 0 0 0



 .
Once more we can verify this expression in Sage quickly, by adding the cell
AEmrref=(A+E).rref(); show(AEmrref)
This results in the system {x1 − 1
3 x4 = 0 , x2 + 1
9 x4 = 0 , x3 − 1
9 x4 = 0}, where x4 is a free parameter. Thus
Vλ2 = V−1 = ker(A + E) =
{
t(
1
3
. −
1
9
,
1
9
, 1) : t ∈ R
}
.
Therefore the corresponding eigenvector is given (up to a scalar) by v1 = (1
3 , −1
9 , 1
9 , 1)T
.
(ii) We will use Sage to study the matrices (A−E)2
, (A−E)3
, and so on. We proceed with the following block:
A=matrix([[-1, -1, -1, 0], [3, 2, 3, -1], [2, 1, 3, -1], [2, 1, 4, -2]])
E=identity_matrix(4)
bb=(A-E)*(A-E)
show("\nThe matrix (A-E)^2 is given by:",bb)
cc=bb*(A-E)
show("\nThe matrix (A-E)^3 is given by:",cc)
dd=cc*(A-E)
show("\nThe matrix (A-E)^4 is given by:",dd)
ee=dd*(A-E)
show("\nThe matrix (A-E)^5 is given by:",ee)
Executing this block, we obtain
The matrix (A-E)^2 is given by:




−1 0 −3 2
1 0 2 −1
1 0 1 0
1 0 −3 4




The matrix (A-E)^3 is given by:




0 0 3 −3
0 0 −1 1
0 0 1 −1
0 0 9 −9




The matrix (A-E)^4 is given by:




0 0 −6 6
0 0 2 −2
0 0 −2 2
0 0 −18 18




The matrix (A-E)^5 is given by:




0 0 12 −12
0 0 −4 4
0 0 4 −4
0 0 36 −36




This shows that the matrices (A − E)4
and (A − E)5
are (even) multiples of the matrix (A − E)3
. Thus, as we expected, the
dimensions of the null subspaces ker
(
(A − E)j
)
will not increase for j > 3 = α(λ1). To compute the rank of the matrix
(A − E)3
we can continue typing in the previous block the following syntax:
show("The rank of (A-E)^3 is given by:", cc.rank())
Sage’s output has the form:
The rank of (A-E)^3 is given by:1 .
Thus rank
(
(A − E)3
)
= 1 and dim ker
(
(A − E)3
)
= 4 − 1 = 3 = dim Rλ1 , as it should be, since α(λ1) = 3.
(iii) Let us now determine generalized eigenvectors u2, u3 ∈ Rλ1 for the ﬁrst eigenvalue, where the algebraic multiplicity
exceeds the geometric multiplicity. To do this, start with the matrix equation (A − E)u2 = u1, i.e.,
299
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS




−2 −1 −1 0
3 1 3 −1
2 1 2 −1
2 1 4 −3








x1
x2
x3
x3



 =




−1
1
1
1



 ,
where u2 = (x1, x2, x3, x4)T
is the unknown. We will use Sage to obtain the reduced row echelon form of the augmented
matrix
(
A − E v1
)
. It is suﬃcient to continue typing in our previous block and add the following cell
v1=vector([-1, 1, 1, 1])
B=(A-E).augment(v1, subdivide="True")
show(B.rref())
We see that




−2 −1 −1 0 −1
3 1 3 −1 1
2 1 2 −1 1
2 1 4 −3 1



 ∼




1 0 0 1 0
0 1 0 −1 1
0 0 1 −1 0
0 0 0 0 0



 .
Now you may conclude that the solution space of the induced linear system has the form
{
(x1, x2, x3, x4)T
= (−s, 1 + s, s, s)T
}
,
where s is a free parameter. Therefore, without loss of generality we may assume that u2 = (0, 1, 0, 0)T
. Next, using the
matrix expression (A − E)3
provided above, we can easily verify that (A − E)3
u2 = 0.
Similarly, by solving the matrix equation (A − E)u3 = u2 we ﬁnd that u3 = (1, −2, 0, 0)T
(up to a scalar). Veriﬁcation
of this result is left as an exercise for the reader. To summarize, the eigenvalue λ1 = 1 has the eigenvector u1 = (−1, 1, 1, 1)T
and the generalized eigenvectors u2 = (0, 1, 0, 0)T
and u3 = (1, −2, 0, 0)T
.
(iv) It is easy to see that the generalized eigenevector u3 satisﬁes the additional relation (A − E)2
u3 = u1. Hence the
set
{(A − E)2
u3 , (A − E)u3 , u3} = {u1, u2, u3}
is a Jordan chain of length 3 corresponding to Rλ1 . Consequently, the Jordan form should contain a Jordan block of size 3
corresponding to λ1, J1 =


1 1 0
0 1 1
0 0 1

, and a Jordan block of size 1 corresponding to λ2, J−1 = (−1). This yields the
following expression:
J = J1 ⊕ J−1 =


1 1 0
0 1 1
0 0 1

 ⊕ (−1) =




1 1 0 0
0 1 1 0
0 0 1 0
0 0 0 −1



 .
Recall that in Sage a veriﬁcation of the Jordan canonical form (up to order of the Jordan blocks) occurs by adding the cell
J= A.jordan_form(transformation=True); show(J)
Now, a similarity matrix P satisfying the condition P−1
AP = J has as columns the vectors {u1, u2, u3, v1} derived in (i)
and (iii), i.e.,
P =




−1 0 1 1/3
1 1 −2 −1/9
1 0 0 1/9
1 0 0 1



 .
In particular, P is invertible since these vectors are linearly independent. While the computation of P−1
is left as an exercise,
we use Sage to verify that P−1
AP = J:
J=matrix([[1, 1, 0, 0], [0, 1, 1, 0], [0, 0, 1, 0], [0, 0, 0, -1]])
P=matrix([[-1, 0, 1, 1/3], [1, 1, -2, -1/9], [1, 0, 0, 1/9],[1, 0, 0, 1]])
P.inverse()*A*P == J
Sage’s output is True. Finally, recall from Chapter 2 that in Sage a quick way to verify the linear independence of the vectors
u1, u2, u3 and v1 is by computing the determinant of P, via the command P.det(). Alternatively, you might prefer to use the
cell provided below:
300
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
u1=vector([-1, 1, 1, 1])
u2=vector([0, 1, 0, 0])
u3=vector([1, -2, 0, 0])
v1=vector([1/3, -1/9, 1/9, 1])
V=RR^4; V.linear_dependence([u1, u2, u3, v1])==[]
(v) We have seen that dim R1 = 3 and dim R−1 = dim V−1 = 1. Since the total dimension of R4
equals 4, the generalized
eigenspaces together have enough “space” to cover all of R4
. In fact, above we saw that joining the bases {u1, u2, u3}
of R1 and {v1} of R−1
∼= V−1 we obtain a basis of R4
, and this already provides a proof. Alternatively, the relations
R4
= R1 + R−1 and R1 ∩ R−1 = {0} imply the direct sum decomposition
R4
= Rλ1 ⊕ Rλ2 = R1 ⊕ R−1 = ker
(
(A − E)3
)
⊕ ker (A + E) .
□
E) Material on linear algebra – matrix decompositions
3.F.60. Find the LU-decomposition of the matrix
A =


2 8 0
2 2 −3
1 2 6

 .
Solution. An answer goes as follows:
A =


2 8 0
2 2 −3
1 2 6

 = LU =


1 0 0
1 1 0
1/2 1/3 1




2 8 0
0 −6 −3
0 0 7

 .
□
3.F.61. Provide the QR-decomposition of the following matrices:
A =




1 0 0
1 1 0
1 1 1
1 1 1



 , B =


1 1 0
1 0 1
0 1 1

 .
Solution. The QR-factorizations under question are given as follows:
A = QR =




1/2 −
√
12/4 0
1/2
√
12/12 −
√
6/3
1/2
√
12/12
√
6/6
1/2
√
12/12
√
6/6






2 3/2 1
0
√
12/4
√
12/6
0 0
√
6/3

 ,
B = QR =


√
2/2
√
6/6 −
√
3/3√
2/2 −
√
6/6
√
3/3
0
√
6/3
√
3/3




√
2
√
2/2
√
2/2
0
√
6/2
√
6/6
0 0 2
√
3/3

 .
□
3.F.62. Ray-tracing. In computer 3D-graphics the image is very often displayed using the Ray-tracing algorithm. The basis
of this algorithm is an approximation of the light waves by a ray (line) and an approximation of the displayed objects by
polyhedrons. These are bounded by planes and it is necessary to compute where exactly the light rays are reﬂected from these
planes. From physics we know how the rays are reﬂected – the angle of impact equals the angle of reﬂection. We have already
met this topic in the exercise ??.
The ray of light in the direction v = (1, 2, 3) hits the plane given by the equation x + y + z = 1. In what direction is it
reﬂected?
Solution. The unit normal vector to the plane is n = 1√
3
(1, 1, 1). The vector that gives the direction of the reﬂected ray vR
lies in the plane given by the vectors v, n. We can express it as a linear combination of these vectors. Furthermore, the rule
for the angle of reﬂection says that ⟨v, n⟩ = −⟨vR, n⟩. From there we obtain a quadratic equation for the coeﬃcient of the
linear combination.
This exercise can be solved in an easier, more geometric way. From the diagram we can derive directly that
vR = v − 2⟨v, n⟩n
. In our case, vR = (−3, −2, −1). □
301
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Solutions to the problems
3.A.3. We see that h = cT
x, where c = (−1, 2)T
. By multiplying the ﬁrst inequality by −1 we obtain the following equivalent
system of constraints:
x2 − x1 ≤ 1 , −0.5x1 + x2 ≤ 2 , x1 ≥ 0, x2 ≥ 0 .
An illustration of this LP problem goes as follows:
The feasible region (the grey region in this diagram) is unbounded. However, the drawn thick semi-line starting from the
point (2, 3) give us inﬁnitely many optimal solutions.
3.A.5. The daily diet should contain 3.9 kg of hay and 4.3 kg of oat. Then the costs per foal are C13.82.
3.A.9. (a) Multiply both sides of the second inequality by −1 to get −x1 − x2 ≤ 1. Also, replace the relation x1 − 2x2 = 1
by the inequalities x1 − 2x1 ≤ 1 and x1 − 2x2 ≥ 1. A multiplication of the latter by −1 yields the standard form of the
initial problem: maximize h = x1 + 2.5x2 with respect to the conditions
2x1 + 3x2 ≤ 20 , −x1 − x2 ≤ 1 , x1 − 2x2 ≤ 1 , −x1 + 2x2 ≤ −1 , x1 ≥ 0 , x2 ≥ 0 .
Hence the dual problem is the minimization of 20y1 + y2 + y3 − y4 subject to the conditions
2y1 − y2 + y3 − y4 ≥ 1 , 3y1 − y2 − 2y3 + 2y4 ≥ 2.5 , yi ≥ 0 , for all i = 1, . . . , 4 .
However, in the primal problem we had a constraint given by an equality, and so in the dual problem this should correspond
to an unrestricted variable. Indeed, by setting y5 = y3 − y4 we result to the minimization of 20y1 + y2 + y5 subject to
2y1 − y2 + y5 ≥ 1 , 3y1 − y2 − 2y5 ≥ 2.5 , y1 ≥ 0 , y2 ≥ 0 ,
with y5 unrestricted in sign.
(b) The standard form of the primal problem is the minimization of h = 2x1 + 3x2 + 2x4 subject to
−x1 − 2x2 − 2x3 ≥ 6 , −x1 − 4x2 + 2x4 ≥ −5 , x1 + 4x2 − 2x4 ≥ 5 , x2 − x3 + 4x4 ≥ 2 , xi ≥ 0 , for all i =
1, . . . , 4 .
The dual problem reads as follows: maximize 6y1 − 5y2 + 5y3 + 2y4 subject to
−y1−y2+y3 ≤ 2 , −2y1−4y2+4y3+y4 ≤ 3 , −2y1−y4 ≤ 0 , 2y2−2y3+4y4 ≤ 2 , yi ≥ 0 , for all i = 1, . . . , 4 .
By setting y5 = y2 − y3 we ﬁnally result to the maximization of 6y1 + 2y4 − 5y5 with respect to
−y1 − y5 ≤ 2 , −2y1 + y4 − 4y5 ≤ 3 , −2y1 − y4 ≤ 0 , 4y4 + 2y5 ≤ 2 , y1 ≥ 0 , y4 ≥ 0 ,
with y5 unrestricted in sign.
3.B.4. Let us proceed with Sage for now, although we recommend readers to follow the discussion in Section 3.2.2, and
perform the formal computations alone. For the given recurrence relation, the corresponding code is as follows:
from sympy import Function,rsolve
from sympy.abc import n
a = Function("a")
f = a(n+2) - 3*a(n+1) - 3*a(n)
initial = {a(1):1, a(2):3}
rsolve(f, a(n), initial)
As an answer, Sage returns the following:
−sqrt(21) ∗ (3/2 − sqrt(21)/2) ∗ ∗n/21 + sqrt(21) ∗ (3/2 + sqrt(21)/2) ∗ ∗n/21 .
In other words, the solution is given by
302
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
xn =
1
√
21
(
3 +
√
21
2
)n
−
1
√
21
(
3 −
√
21
2
)n
.
3.B.7. A veriﬁcation in Sage can be obtained as follows:
from sympy import Function, rsolve
from sympy. abc import n
a = Function(’a’)
f = a(n+2) - 2*a(n+1) - 3*a(n)
initial = {a(0) : 0, a(1) : 1}
rsolve(f, a(n), initial)
Sage’s output has the form −(−1) ∗ ∗n/4 + 3 ∗ ∗n/4.
3.B.8. To treat this task we begin with the characteristic equation, given by r4
− r3
− r + 1 = 0, or equivalently by (r −
1)2
(r2
+ r + 1) = 0. This has two complex roots given by
r1 = −
1
2
+ i
√
3
2
= cos(2π/3) + i sin(2π/3) , r2 = −
1
2
− i
√
3
2
= cos(2π/3) − i sin(2π/3) ,
and the real root 1 which is double. Thus we can ﬁnd basis of the solution space; this consists of the sequences
{an} :=
{
(
−
1
2
+ i
√
3
2
)n
}∞
n=1
, {bn} :=
{
(
−
1
2
− i
√
3
2
)n
}∞
n=1
,
together with {n}∞
n=1 and the constant sequence {1}∞
n=1.
Our focus now shifts to establishing a real basis for the solution space, by replacing the two complex generators from
the previous basis with sequences that are entirely real. These generators are “power series” whose elements are complex
conjugates. We will discuss power series in Chapter 5. Now, it suﬃces to take as generators appropriate linear combinations of
an, bn, as in the Problem 3.B.5. This yields the following real basis: {1}∞
n=1, {n}∞
n=1, {cos(2nπ/3)}∞
n=1, {sin(2nπ/3)}∞
n=1.
3.B.9. The answer is given by the sequence (xn) with general term xn = −3(−1)n
−2 cos(n·(2π/3))−2
√
3 sin(n·((2π/3)).
3.B.11. The general solution of the homogeneous equation is of the form a(−1)n
+ b2n
. A particular solution is the constant
−1/2. Therefore, the general solution of the given non-homogeneous equation without initial conditions has the form
a(−1)n
+ b2n
−
1
2
.
Substituting in the initial conditions we obtain the constants a = −5/6, b = 5/6. Thus, the desired solution is given by the
sequence xn = −5
6 (−1)n
+ 5
6 2n
− 1
2 . To conﬁrm this result by Sage use the block
from sympy import Function, rsolve
from sympy. abc import n
a = Function("a")
f = a(n+2) - a(n+1) - 2*a(n)-1
int = {a(1) : 2, a(2) : 2}
rsolve(f, a(n), int)
3.C.5. In this case we have b1 = 0, b2 = 2 = b3, b4 = 1, and thus the Leslie matrix has the form
A =




0 2 2 1
0.4 0 0 0
0 0.5 0 0
0 0 0.2 0



 .
Its unique dominant eigenvalue is λ1 ≈ 1.09486. Since λ1 > 0, the population increases with growth rate 1.09486t
, in
contrast with the population of bushbabies in 3.C.3. The normalized eigenvector associated with λ1 is given by X1 ≈
(0.639, 0.233, 0.106, 0.019)T
, hence the long term trends of the female population are as follows: 63.9 % of age class A,
23.3 % of age class B, 10.6 % of age class C and 1.9 % of age class D. Finally, we compute
A10
≈




1.047 2.872 2.094 0.96
0.384 1.047 0.761 0.348
0.174 0.48 0.349 0.16
0.032 0.087 0.064 0.029



 , p10 = A10
p0 ≈ (150.813, 54.956, 25.171, 4.598)T
.
Therefore, the female population after ten years will consist of approximately 235 members (this is the sum of the entries of
p10), and the total population of the colony (both females and males) after the same period will approximately reach the level
303
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
of 470 galagos. Below is the required diagram representing the exponential growth of the female population for a duration of
thirty years.
3.D.1. Let us ﬁrst recall that the norm of an arbitrary vector z ∈ Cn
with respect the standard Hermitian form ⟨ , ⟩ reads by
∥z∥2
=
∑n
j=1 zj ¯zj. For the vectors x = (3 + 2i, 1 − i, −i)T
and y = (2 − 2i, 1 − i, 2 + i)T
in C3
one computes
⟨x, y⟩ = y∗
x =
(
2 + 2i 1 + i 2 − i
)


3 + 2i
1 − i
−i

 = (3 + 2i)(2 + 2i) + (1 − i)(1 + i) − i(2 − i) = 3 + 8i .
Moreover, we see that
∥x∥ =
√
(3 + 2i)(3 − 2i) + (1 − i)(1 + i) − i2 = 4 ,
∥y∥ =
√
(2 − 2i)(2 + 2i) + (1 − i)(1 + i) + (2 + i)(2 − i) =
√
15 .
On the other hand, the vector x − y has the expression x − y = (1 + 4i, 0, −(2 + 2i))T
. Therefore, d(x, y) =√
(1 + 4i)(1 − 4i) + (2 + 2i)(2 − 2i) = 5. Since ∥x∥ = 4 and ∥y∥ =
√
15, the normalized vectors with ∥ˆx∥ = 1 = ∥ˆy∥
are given by ˆx = 1
4 (3 + 2i, 1 − i, −i) and ˆy = 1√
15
(2 − 2i, 1 − i, 2 + i), respectively. See the Problem 3.D.2 how one can
verify these results in Sage.
3.D.3. (a) For the vectors v = (1 + i, 2 − i)T
and w = (3 − 2i, 1 + i)T
in C2
we compute
⟨v, w⟩ = w∗
v =
(
3 + 2i 1 − i
)
(
1 + i
2 − i
)
= (3 + 2i)(1 + i) + (1 − i)(2 − i) = 2 + 2i .
Moreover, we see that
∥v∥2
= v∗
v = (1 + i)(1 − i) + (2 − i)(2 + i) = 7 , ∥w∥2
= w∗
w = (3 − 2i)(3 + 2i) + (1 + i)(1 − i) = 15 .
Recall that the angle θ between v, w is deﬁned by the equation (compare with the formula given in 2.3.22 for the real case)
cos θ =
Re(⟨v, w⟩)
∥v∥∥w∥
,
where Re(z) denotes the real part of a complex number z ∈ C. Hence it follows that θ = cos−1
(
2
√
7
√
15
)
≈ 1.37435 ≈
78.75◦
. To verify all these results in Sage one can use the following cell:
304
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
v = vector(CDF, [1+I, 2-I])
w = vector(CDF, [3-2*I, 1+I])
vinnerw=w.hermitian_inner_product(v)
print("The standard Hermitian product of the vectors v=",v,""
"and w=",w,"" "equals to",vinnerw)
vn=N(v.norm()^2, digits=3)
print("The square norm of the vector v=", v, "" "equals to", vn)
wn=N(w.norm()^2, digits=3)
print("The square norm of the vector w=", w, "" "equals to", wn)
theta=arccos(w.hermitian_inner_product(v).real()/(norm(v)*norm (w)))
305
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
print("The angle between the vectors v and w equals to",
N(theta, digits = 4), "radians" )
print("The angle between the vectors v and w equals to",
N(theta*180/pi, digits = 4), "degrees")
In this block we used the option CDF (Complex Double Field) when introducing the vectors v, w, similarly with the description
given in 3.D.2. Recall that CDF approximates the ﬁeld of complex numbers using double-precision ﬂoating point numbers.
Alternatively, we could use the option CC (for this task, the result would remain the same). The CDF option helps to save
some space in the output, compared to CC. Finally, for your convenience, in this block Sage was programmed to provide more
detailed responses, than usual, as one can see in the output below:
The standard Hermitian product of the vectors v= (1.0 + 1.0*I, 2.0 - 1.0*I)
and w= (3.0 - 2.0*I, 1.0 + 1.0*I) equals to 2.0 + 2.0*I
The square norm of the vector v= (1.0 + 1.0*I, 2.0 - 1.0*I) equals to 7.00
The square norm of the vector w= (3.0 - 2.0*I, 1.0 + 1.0*I) equals to 15.0
The angle between the vectors v and w equals to 1.374 radians
The angle between the vectors v and w equals to 78.75 degrees
Similarly can be treated the cases (b) and (c) for which we only present the results.
(b) ⟨v, w⟩ = 5 + 4i, ∥v∥2
= 10, ∥w∥2
= 7, cos θ ≈ 0.59, θ ≈ 0.93 ≈ 53.5◦
.
(c) ⟨v, w⟩ = 2 − 4i, ∥v∥2
= 5, ∥w∥2
= 19, cos θ ≈ 0.205, θ ≈ 1.364 ≈ 78.1◦
.
3.D.4. One can easily check the following properties of f; f(u + w, v) = f(u, v) + f(w, v), f(au, v) = af(u, v) and
f(v, u) = f(u, v), for all u, v ∈ V = F2
and a ∈ F. Moreover, if u = (x1, x2)T
we have
f(u, u) = x1 ¯x1 + 4x1 ¯x2 + 4x2 ¯x1 + x2 ¯x2 = |x1|2
+ 4(x1 ¯x2 + x2 ¯x1) + |x2|2
,
where we used the relation z¯z = |z|2
with z ∈ F. However, x1 ¯x2 + x2 ¯x1 = |x1 + x2|2
− |x1|2
− |x2|2
, and hence the above
relation reduces to f(u, u) = −3|x1|2
+ 4|x1 + x2|2
− 3|x2|2
. Therefore, we may have f(u, u) < 0 for some u ̸= 0. For
example, f(u, u) = −6 for u = (1, −1)T
. This shows that f cannot be a scalar product on V .
3.D.6. We see that ρa,b(u, u) = au2
1 + bu2
2 ≥ 0 and the only way that this is zero is when u = 0, that is, u1 = u2 = 0. This
proves positive deﬁniteness. Moreover, by the commutativity of real numbers we get
ρa,b(v, u) = ρa,b(v, u) = av1u1 + bv2u2 = ρa,b(u, v) .
Hence ρa,b is Hermitian symmetric. Next,
ρa,b(u, v + w) = au1(v1 + w1) + bu2(v2 + w2) = (au1v1 + bu2v2) + (au1w2 + bu2w2) = ρa,b(u, v) + ρa,b(u, w) ,
and moreover ρa,b(cu, v) = ρa,b(u, cv) = cρa,b(u, v) for any c ∈ R. This proves that ρa,b is a scalar product on R2
. Then
the angle between u, v satisﬁes
cos θ =
ρa,b(u, v)
∥u∥ρa,b
∥v∥ρa,b
.
For a = 2 and b = 1 we compute ρ2,1(u, v) = 2 · 1 + 1 · (−1) = 1, and ∥u∥ρ2,1 =
√
2 · 1 + 1 · 1 =
√
3 = ∥v∥ρ2,1 . Thus
cos θ = 1/3 and so θ = arccos(1/3). With respect to the dot product the vector u, v are orthogonal, u · v = 0, hence they
obviously form a diﬀerent angle in comparison with the one appearing with respect to ρ2,1.
3.D.7. For any vector x ∈ Cn
we see that ρA(x, x) = x∗
A∗
Ax = (Ax)∗
Ax = ⟨Ax, Ax⟩ ≥ 0, where ⟨ , ⟩ is the standard
Hermitian form on Cn
. Thus ρA(x, x) ≥ 0 with the equality holding if and only if Ax = 0. Let us denote by A1, . . . , An the
n columns of A, such that A =
(
A1 A2 · · · An
)
with Ai = (a1i, a2i, . . . , ami)T
, for any 1 ≤ i ≤ n. Recall the column
space of A:
C(A) = {w ∈ Cm
: w =
∑n
i=1 xiAi, with xi ∈ C} = {w ∈ Cm
: w = Ax} ,
where x = (x1, . . . , xn)T
∈ Cn
, and the second relation holds since the vector Ax ∈ Cm
is given by a linear combination of
the columns of A with elements of x, i.e., Ax =
∑n
i=1 xiAi. Hence our assumption that C(A) is n-dimensional, is equivalent
to say that A consists of n linearly independent column vectors, or that the conditions Ax = 0 and x = 0 are equivalent each
other. This shows that ρA is positive deﬁnite. Also, since ρA(x, y) ∈ C for any x, y ∈ Cn
and so (ρA(x, y))∗
= ρA(x, y) we
see that
ρA(x, y) = (Ay)∗
(Ax) =
(
(Ax)∗
(Ay)
)∗
= (x∗
A∗
Ay)∗
= (ρA(y, x))∗
= ρA(y, x) ,
for any x, y ∈ Cn
. Moreover, ρA(x, y + z) = (y + z)∗
(A∗
Ax) = y∗
A∗
Ax + z∗
A∗
Ax = ρA(x, y) + ρA(x, z) for any three
vectors x, y, z, while for some a ∈ C and any x, y ∈ Cn
we get ρA(ax, y) = (Ay)∗
Aax = a(Ay)∗
Ax = aρA(x, y). Thus ρ
is an inner product.
306
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
We should mention that by the deﬁnition of ρA one gets ρA(x, y) = y∗
A∗
Ax = (Ay)∗
(Ax) = ⟨Ax, Ay⟩, where Ax is
viewed as a (column) vector. This provides of course an alternative and much shorter veriﬁcation of the claim, but for the
convenience of the reader we kept both arguments. Finally observe that when A is the identity matrix, then ρA is nothing
than the standard Hermitian form on Cn
.
3.D.9. (a) For a real scalar product space (V, ⟨ , ⟩) we have ⟨u, w⟩ = ⟨w, u⟩, for all u, w ∈ V and hence
∥u + w∥2
= ⟨u + w, u + w⟩ = ∥u∥2
+ ∥w∥2
+ 2⟨u, w⟩ , u, w ∈ V .
The ﬁrst claim is now direct.
For a complex scalar product space (V, ⟨ , ⟩) the bilinear form ⟨ , ⟩ should satisfy ⟨u, w⟩ = ⟨w, u⟩, for all u, w ∈ V . This
requirement prevents the form ⟨ , ⟩ from having the same symmetry, as in the real case. For example, consider C2
endowed
with the standard (positive) Hermitian form ⟨u, w⟩ = u1 ¯w1 + u2 ¯w2, with u, w ∈ C2
. Given the vectors u = (0, i)T
∈ C2
an w = (0, 1)T
∈ C2
we see that ⟨u, w⟩ = i and ∥u + w∥2
= 2, ∥u∥2
= i¯i = 1, ∥w∥2
= 1. Hence we have ∥u + w∥2
=
∥u∥2
+ ∥w∥2
but u, w are not orthogonal (observe however that if ⟨u, w⟩ = 0, then ∥u + w∥2
= ∥u∥2
+ ∥w∥2
still makes
sense on a complex inner product space, see the proof of the theorem in 3.4.2).
(b) If u + w ⊥ u − w then 0 = ⟨u + w, u − w⟩ = ∥u∥2
− ∥w∥2
. Thus ∥u∥2
= ∥w∥2
, which gives ∥u∥ = ∥w∥. Conversely,
let u, w ∈ V such that ∥u∥ = ∥w∥. Then ⟨u + w, u − w⟩ = ∥u∥2
− ∥w∥2
= 0, hence u + w ⊥ u − w.
3.D.10. (a) From the deﬁnition of a unitary space given in 3.4.1, we see that
∥u + av∥2
= ⟨u + av, u + av⟩ = ⟨u, u⟩ + ¯a⟨u, v⟩ + a⟨v, u⟩ + a¯a⟨v, v⟩ = ∥u∥2
+ 2Re(¯a⟨u, v⟩) + |a|2
∥y∥2
and similarly ∥u − av∥2
= ∥u∥2
− 2Re(¯a⟨u, v⟩) + |a|2
∥y∥2
, for any two vector u, v ∈ V and scalar a. Thus, ∥u + av∥ =
∥u − av∥ if and only if Re(¯a⟨u, v⟩) = 0. Since the latter relation should hold for any scalar a, taking a = 1 for the real case,
and a = i for the complex case, we see that ⟨x, y⟩ = 0.
(b) The proof is similar and we leave it for practice.
3.D.13. According to part (4) of the main theorem in Section 3.4.2, for any two vectors x, y ∈ V we have the (Fourier)
expansions x =
∑
j⟨x, Ej⟩Ej and y =
∑
k⟨y, Ek⟩Ek. Thus
⟨x, y⟩ =
⟨
∑
j
⟨x, Ej⟩Ej,
∑
k
⟨y, Ek⟩Ek
⟩
=
∑
j,k
⟨x, Ej⟩⟨y, Ek⟩⟨Ej, Ek⟩ =
∑
j
⟨x, Ej⟩⟨y, Ej⟩ .
3.D.14. We can easily check that the given set {E1 = (−1, 1, 2), E2 = (2, 0, 1), E3 = (1, 5, −2)} is an orthogonal basis of
R3
. We only prove orthogonality by Sage and leave a formal veriﬁcation of the claim to the reader. Let us program Sage to
do this quickly, as follows:
E1 = vector(QQbar, [-1, 1, 2])
E2 = vector(QQbar, [2, 0, 1])
E3 = vector(QQbar, [1, 5, -2])
B = [E1, E2, E3]
ips = [B[i].dot_product(B[j])
for i in range(2) for j in range(i+1,2)]
all([ip == 0 for ip in ips])
Sage’s answer is True. Notice in the same way we can also verify that the vectors Ei are not of unit length for all i = 1, 2, 3.
This can be done by adding the cell
ips2 = [B[i].dot_product(B[i])
for i in range(2)]
all([ip2 == 1 for ip2 in ips2])
and in this case Sage prints out False.
Now recall that if {u1, . . . , um} is an orthogonal basis of a unitary space (V, ⟨ , ⟩), then any vector u ∈ V is expressed by
u =
m∑
k=1
⟨u, uk⟩
∥uk∥2
uk .
Let us apply this rule for out task. We compute
⟨u, E1⟩ = −12 , ⟨u, E2⟩ = 8 , ⟨u, E3⟩ = 24 , ∥E1∥2
= 6 , ∥E2∥2
= 5 , ∥E3∥2
= 30 .
Thus u = −12
6 E1 + 8
5 E2 + 24
30 E3.
3.D.15. We will apply a method that relies on the Cauchy-Schwarz inequality, see 3.4.3, as it is suggested in the statement.
Set u = (
√
a + 2b,
√
b + 2c,
√
c + 2a)T
, v = (1, 1, 1)T
and view u, v as vectors in R3
. Their dot product is given by
307
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
⟨u, v⟩ = vT
u =
√
a + 2b+
√
b + 2c+
√
c + 2a, while ∥u∥2
= (a+2b)+(b+2c)+(c+2a) = 3(a+b+c) and ∥v∥2
= 3.
An application of the Cauchy-Schwarz inequality ⟨u, v⟩ ≤ ∥u∥∥w∥ yields the expression
√
a + 2b +
√
b + 2c +
√
c + 2a ≤ 3
√
(a + b + c) ,
which gives the result after dividing with
√
(a + b + c) (which is non-zero by the assumption in the statement).
3.D.16. The standard basis e is an orthonormal basis of (C3
, ⟨ , ⟩) and hence we see that
u =
3∑
j=1
⟨u, ej⟩ej = (2 + i)e1 + (−1 + 2i)e2 + (3 − i)e3 ,
that is, ⟨u, e1⟩ = 2 + i, ⟨u, e2⟩ = −1 + 2i and ⟨u, e3⟩ = 3 − i. Now, according to Parseval’s equality (see 3.4.3) we should
have
∥u∥2
=
n∑
j=1
|⟨u, ej⟩|2
= |2 + i|2
+ | − 1 + 2i|2
+ |3 − i|2
= (2 + i)(2 − i) + (−1 + 2i)(−1 − 2i) + (3 − i)(3 + i) = 20 ,
where we applied the rule |z|2
= z¯z, with z ∈ C. On the other hand, we also see that ∥u∥2
= ⟨u, u⟩ = (2 + i)(2 + i) +
(−1 + 2i)(−1 + 2i) + (3 − i)(3 − i) = 20, and we are done.
3.D.19. When we rotate a point in R3
about a particular axis, the coordinate corresponding to that axis remains unchanged
The remaining two coordinates are given by the well known rotation in the plane (see Chapter 1). This yields the matrices
Rx(θ) =


1 0 0
0 cos θ − sin θ
0 sin θ cos θ

 , Ry(θ) =


cos θ 0 sin θ
0 1 0
− sin θ 0 cos θ

 , Rz(θ) =


cos θ − sin θ 0
sin θ cos θ 0
0 0 1


which represent the rotation about the x-axis, the y-axis and the z-axis, respectively.
Note the sign of θ in the rotation matrix about the y-axis. As with any other rotation, we want the rotation about the
y-axis to be in the “positive sense”. This means that when viewed from the negative y-axis direction, the rotation should
appear anti-clockwise. Consequently, the signs in the rotation matrices depend on the orientation of our coordinate system.
Usually, in the 3-dimensional space the “right-handed coordinate system” is chosen, also known as “dextrorotary coordinate
system”. In a right-handed coordinate system, if you orient your right hand so that your thumb points along the positive
x-axis and curl your ﬁngers naturally, your ﬁngers will point in the direction of the y-axis, followed by the z-axis. In particular,
the index ﬁnger will point the positive y-axis and the middle ﬁnger will point the positive z-axis. For example, a positive
rotation Rx(θ) about the x-axis occurs perpendicular to the yz-plane, and a negative rotation will occurs in the yz-plane,
with opposite direction to the positive rotation. Notice when looking from positive x towards the origin, a positive rotation is
counterclockwise. This conﬁguration visually conﬁrms the order x → y → z and its cyclical nature is evident in how axes
are sequentially rotated: x → y → z → x → . . . for a positive cycle and x → z → y → x → . . . for a negative one.
3.D.20. The ﬁrst task has been analyzed for unitary transformations in 3.4.4. Let us present an alternative proof. Suppose
that U ∈ Matn(C) is unitary and consider the scalar product ⟨Ux, y⟩. Based on the relation E = UU−1
= U−1
U, where
E is the identity n × n matrix, we get ⟨Ux, y⟩ = ⟨Ux, UU−1
y⟩ = ⟨x, U−1
y⟩. On the other hand, by the deﬁnition of the
conjugate transpose matrix we have ⟨Ux, y⟩ = ⟨x, U∗
y⟩ for any x, y ∈ Cn
. By comparing these two relations and using
bilinearity, we arrive at the following relation:
⟨x, (U−1
− U∗
)y⟩ = 0 , (∗)
for any x, y ∈ Cn
. We can now utilize the fact that ⟨ , ⟩ is a non-degenerate bilinear form. Hence, by the relation (∗) we
deduce that U−1
− U∗
= 0, which implies that U−1
= U∗
.
Conversely, assume that U∗
U = E. Then, for any x ∈ Cn
we have ∥Ux∥2
= ⟨Ux, Ux⟩ = x, U∗
Ux⟩ = ⟨x, x⟩ = ∥x∥2
, and
hence U is unitary.
To prove that U(n) is a group, we can proceed similarly to the case of O(n). Hence we need to verify that the composition
of two unitary transformations and the inverse of a unitary transformation, both retain the same property. Let A, B ∈ U(n).
Then A−1
= A∗
and B−1
= B∗
, respectively. Moreover (AB)∗
= B∗
A∗
= B−1
A−1
= (AB)−1
, and hence AB is also
unitary. The veriﬁcation of the second property is left as an exercise for the reader.
Finally, the claim for the determinant occurs by the relation det(E) = 1 = det(U∗
U) = det(U∗
) det(U) =
det(U) det(U) = | det(U)|2
. This implies that the absolute value of the determinant of any unitary matrix is equal to one.
3.D.22. Consider linear maps φ, ψ : V → V as in the statement, and let A, B be their matrices. In Problem 2.E.56 we proved
that A, B must satisfy all the analogous properties listed in the statement. Since φ, ψ are determined by their matrices, all the
formulas follow. Alternatively, one can work in terms of endomorphisms and prove all the relations based on the deﬁnition
of the adjoint map. We leave such a description to the reader.
308
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
3.D.25. For A we have A∗
= A and hence A is Hermitian (i.e., self-adjoint). We can directly verify this by Sage, as
follows:
A=matrix(CC, [[sqrt(2), I, 1-I], [-I, 10, sqrt(2)+I], [1+I, sqrt(2)-I, 0]])
A.is_hermitian()
This command returns True. Alternatively, we can check this by verifying if A equals its conjugate transpose:
A=matrix(CC, [[sqrt(2), I, 1-I], [-I, 10, sqrt(2)+I], [1+I, sqrt(2)-I, 0]])
A == A.conjugate_transpose()
which again returns True.
Among the remaining matrices, B and C are Hermitian, while D is not. For the matrix B, verifying this property
using Sage may take more time compared to a manual computation, but it is useful for handling parametric matrices. The
veriﬁcation process is as follows:
a = var("a"); b = var("b")
assume(a, "real"); assume(b, "real")
B = matrix(SR, 2, 2, [[a, I+b],[-I+b, sqrt(abs(a))]])
B.is_hermitian()
We leave the veriﬁcation for the matrices C and D to the reader.
3.D.26. Under our assumptions for the matrices A, B, we need to show tr(A∗
A) = tr(B∗
B). This is equivalent to proving
that
∑n
i,j=1 |aij|2
=
∑n
i,j=1 |bij|2
, providing an alternative reformulation of the problem. Since B = U∗
AU, we have
B∗
= (U∗
AU)∗
= U∗
A∗
U. Therefore
tr(B∗
B) = tr(U∗
A∗
UU∗
AU) = tr(U∗
A∗
AU) = tr(A∗
A) .
This follows from the property that similar matrices X, Y , i.e., X = P−1
Y P, have the same trace, tr(X) = tr(Y ), and and
from the fact that U is unitary, i.e., U−1
= U∗
.
3.D.27. To compute its adjoint operator φ∗
: R3
→ R2
, we need to ﬁnd the matrix representation of φ∗
with respect to the
standard bases of R3
and R2
. In matrix form we have
φ
((
x
y
))
=


√
2 1
1 −1
0 2


(
x
y
)
.
Thus, the matrix of φ with respect to the standard orthonormal bases is given by
A =


√
2 1
1 −1
0 2

 .
Now, to compute φ∗
we can take the conjugate transpose of A, but since A has only real entries, this simpliﬁes to just the
transpose of A: We compute
A∗
= AT
=
(√
2 1 0
1 −1 2
)
.
and hence φ∗




x
y
z



 =
( √
2x + y
x − y + 2z
)
, with x, y, z ∈ R.
For the second task, we need to verify that B = B∗
. The conjugate transpose of the matrix B is given by
B∗
= ¯BT
=


10 5 + 5i 3 + 2i
5 − 5i 5
√
2i
3 − 2i −
√
2i 0

 .
Hence B = B∗
and consequently ψ is a self-adjoint operator.
3.D.28. The adjoint operator of tr : Matm(F) → F is a linear map tr∗
: F → Matm(F), assigning to any scalar z ∈ F a
m × m matrix over F. It is determined by the relation
B
(
tr(A), z
)
= B
(
A, tr∗
(z)
)
,
for any z ∈ F and A ∈ Matm(F). Observe here that on the left-hand side B acts on scalars (viewed as trivial 1 × 1 matrices).
By the deﬁnition of B, we have
B(tr(A), z) = ¯z tr(A) .
309
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Setting tr∗
(z) = B for some matrix B ∈ Matm(F), we also obtain B(A, tr∗
(z)) = tr(B∗
A). Hence, we need ¯z tr(A) =
tr(B∗
A). Based on the properties of the trace described in 2.C.38, we see that ¯z tr(A) = tr(¯zA) = tr(¯zEA), where E
denotes the m × m matrix. By comparing these two relations and using the non-degeneracy of B, we ﬁnd that B∗
= ¯zE.
Therefore
B = (B∗
)∗
= (¯zE)∗
= E∗ ¯¯z = Ez ,
which means that tr∗
(z) = zE from any z ∈ F.
3.D.29. By deﬁnition, the linear map projW : V → V sends u ∈ V to its orthogonal projection projW u on W. Let W⊥
be
the orthogonal complement of W with respect to ⟨ , ⟩. Recall by Chapter 2 the direct sum decomposition V = W ⊕ W⊥
,
which for any u ∈ V implies that
u = projW u + projW ⊥ u = projW u + (u − projW u) = P(u) + (u − P(u)) ,
where P(u) := projW u. Then, by linearity we see that
⟨P(u), w⟩ = ⟨P(u), P(w) + (w − P(w))⟩ = ⟨P(u), P(w)⟩ + ⟨P(u), w − P(w)⟩ ,
for any u, w ∈ V . However, by assumption P(u) ∈ W and (w − P(w)) ∈ W⊥
. Thus, the second component in the
right-hand side of the previous relation vanishes, which yields the relation ⟨P(u), w⟩ = ⟨P(u), P(w)⟩. In a similar way one
computes ⟨u, P(w)⟩ = ⟨P(u), P(w)⟩, and the result follows.
3.D.30. The ﬁrst claim is direct. To present a counterexample, use the matrix A =


1 1 0
0 1 1
1 0 1

 ∈ Mat3(R). Then a direct
computation shows that
A∗
= AT
=


1 0 1
1 1 0
0 1 1

 , AA∗
= A∗
A =


2 1 1
1 2 1
1 1 2

 .
Hence A is normal, but is not orthogonal since AAT
= AT
A ̸= E.
3.D.32. We present directly the code including some comments within the code:
# Define the matrix phi symbolically
phi = Matrix(SR, [[2, 1+I, 0],
[1-I, 3, 0],
[0, 0, 1]])
# Verify that phi is normal
is_normal = phi*phi.conjugate_transpose()== phi.conjugate_transpose() * phi
print("Is phi normal?:", is_normal)
# Compute eigenvalues and eigenvectors symbolically
eigenvalues = phi.eigenvalues()
eigenvectors = phi.eigenvectors_right()
# Display eigenvalues and eigenvectors
print("Eigenvalues:", eigenvalues)
print("Eigenvectors:")
for eig in eigenvectors:
print(f"Eigenvalue: {eig[0]}, Eigenvectors: {eig[1]}")
# Extract eigenvectors corresponding to distinct eigenvalues
if len(eigenvectors) >= 2:
v1 = eigenvectors[0][1][0] # Eigenvector for the first eigenvalue
v2 = eigenvectors[1][1][0] # Eigenvector for the second eigenvalue
# Compute the inner product of the eigenvectors
inner_product = v1.inner_product(v2)
print("Inner product of v1 and v2:", inner_product)
# Verify if they are orthogonal
is_orthogonal = inner_product == 0
print("Are v1 and v2 orthogonal?:", is_orthogonal)
else:
print("Not enough distinct eigenvalues to check orthogonality.")
310
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Let us also display Sage’s output:
Is phi normal?: True
Eigenvalues: [1, 1, 4]
Eigenvectors:
Eigenvalue: 1, Eigenvectors: [(1, 1/2*I - 1/2, 0), (0, 0, 1)]
Eigenvalue: 4, Eigenvectors: [(1, -I + 1, 0)]
Inner product of v1 and v2: I + 1
Are v1 and v2 orthogonal?: (I + 1) == 0
Notice the comparison (I + 1) == 0 is used to determine if the computed inner product is zero. Since 1 + i ̸= 0 the result
of the comparison is False, indicating that the eigenvectors are not orthogonal. We leave a formal computation for practice.
3.D.34. Such an example is given by the matrix A =
(
0 1 − i
1 − i 0
)
. Indeed, since A∗
= ¯AT
=
(
0 1 + i
1 + i 0
)
̸= ±A,
the matrix at hand is neither Hermitian, nor skew-Hermitian. However, A is a normal matrix since
AA∗
=
(
0 1 − i
1 − i 0
) (
0 1 + i
1 + i 0
)
=
(
2 0
0 2
)
=
(
0 1 + i
1 + i 0
) (
0 1 − i
1 − i 0
)
= A∗
A .
Here is a detailed veriﬁcation with Sage:
# Define the matrix
A = Matrix([[0, 1 - I], [1 - I, 0]])
# Compute the conjugate transpose (Hermitian transpose) of A
A_H = A.H
# Check if A is Hermitian
is_hermitian = A == A_H
# Check if A is skew-Hermitian
is_skew_hermitian = A == -A_H
# Compute A * A^H
A_AH = A * A_H
# Compute A^H * A
AH_A = A_H * A
# Check if A is normal
is_normal = A_AH == AH_A
# Print the results
print("Matrix A:")
print(A)
print("\nConjugate Transpose of A:")
print(A_H)
print("\nIs A Hermitian?")
print(is_hermitian)
print("\nIs A Skew-Hermitian?")
print(is_skew_hermitian)
print("\nIs A Normal?")
print(is_normal)
3.D.35. We will mainly use Sage and leave the formal computations for practice. The given matrix A is symmetric, A = AT
.
The following cell veriﬁes this statement in Sage:
A = matrix(SR, [[1, 2, 6], [2,0, 2], [6, 2, 1]])
A.T == A
Recall also that the eigenvalues of A occur by the command
A.eigenvalues()
They are given by λ1 = 8, λ2 = −5 and λ3 = −1. As for the eigenvectors, use the cell
A.eigenvectors_right()
Sage’s output is the list
[(8, [(1, 1/2, 1)], 1), (-1, [(1, -4, 1)], 1), (-5, [(1, 0, -1)], 1)]
311
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Thus the vectors u1 = (2, 1, 2)T
, u2 = (1, 0, −1)T
and u3 = (1, −4, 1)T
are eigenvectors of A, corresponding to the
eigenvalues λ1, λ2, and λ3, respectively. Observe that ⟨u1, u2⟩ = 0, ⟨u1, u3⟩ = 0, and ⟨u2, u3⟩ = 0 where ⟨ , ⟩ is the dot
product on R3
. Thus, the eigenvectors of A are orthogonal to each other. We also compute ∥u1∥ = 3, ∥u2∥ =
√
2 and
∥u3∥ = 3
√
2. Consequently, the orthogonal matrix P has the form
P =



2
3
1√
2
1
3
√
2
1
3 0 − 4
3
√
2
2
3 − 1√
2
1
3
√
2


 .
Note that P is not uniquely deﬁned. It remains to verify the relations P−1
= PT
, and P−1
AP = diag(λ1, λ2, λ3). In Sage
this can be done easily:
A = matrix(SR, [[1, 2, 6], [2,0, 2], [6, 2, 1]])
D = diagonal_matrix([8, -5, -1])
P = matrix(SR, [[2/3, 1/sqrt(2), 1/(3*sqrt(2))],[1/3, 0, -4/(3*sqrt(2))],
[2/3, -1/sqrt(2), 1/(3*sqrt(2))]])
print(P.inverse()==P.T); print(P.inverse()*A*P==D)
3.D.36. The one direction starting with A orthogonally diagonalizable is easy. Then we have PT
AP = D for some orthogonal
n × n matrix P and a diagonal n × n matrix D. Thus A = PDPT
and since D being diagonal satisﬁes D = DT
, we obtain
AT
= (PDPT
)T
= PDT
PT
= PDPT
= A .
Hence A is symmetric. The converse is more diﬃcult. Assume that A is a symmetric n × n matrix. If n = 1 the statement
is trivial, i.e. the pair (P = 1, D = A) satisﬁes the statement. We will proceed by induction on the size of A. The induction
hypothesis assumes that every (n−1)×(n−1) symmetric matrix is orthogonally diagonalizable. Consider now the symmetric
matrix A and let λ ∈ R be an eigenvalue of A with eigenvector u ∈ Rn
, i.e., Au = λu. Set p1 = u/∥u∥ and extend {p1} to
an orthonormal basis {p1, . . . , pn} of Rn
. Let P be the orthogonal matrix whose column vectors are the vectors pi, that is,
P =
(
p1 p2 . . . pn
)
. Then, in terms of block matrices for the multiplication AP we get the expression
AP =
(
λp1 Ap2 . . . Apn
)
= P
(
λ C
0 A1
)
,
for some C ∈ Mat1,n−1(R) and A1 ∈ Matn−1(R). However, P−1
= PT
and hence this equivalently can be rewritten as
PT
AP =
(
λ C
0 A1
)
. (∗)
Since A is symmetric, we see that (PT
AP)T
= PT
AP, hence the matrix PT
AP is also symmetric. Thus, the relation given
in (∗) will make sense only when C = 0 and A1 is symmetric. Since A1 has size (n − 1) × (n − 1), the induction hypothesis
applies, so there exist an orthogonal matrix B and a diagonal matrix D1 of the same size with A1, satisfying BT
A1B = D1.
Consider now the block matrix P′
=
(
1 0
0 B
)
, whose size is n × n. Since B is orthogonal, P′
is also orthogonal and hence
we see that
(PP′
)T
A(PP′
) = (P′
)T
(PT
AP)P′
=
(
1 0
0 BT
) (
λ 0
0 A1
) (
1 0
0 B
)
=
(
λ 0
0 BT
A1B
)
=
(
λ 0
0 D1
)
.
Since the last matrix is diagonal and PP′
is orthogonal, the proof is complete.
3.D.38. Both matrices A and B are Hermitian, i.e., A = A∗
and B = B∗
, and hence also normal. Thus, they are unitary
diagonalizable. Moreover, since Hermitian matrices have only real eigenvalues, both A and B possess only real eigenvalues,
as we will see also below. Let us begin with the matrix A. The characteristic polynomial of A is given by
χA(λ) = |A − λ E| =
−(1 + λ) 0 i
0 1 − λ 0
−i 0 1 − λ
= −(1 + λ)
1 − λ 0
0 1 − λ
+ i
0 1 − λ
−i 0
= −(1 + λ)(1 − λ)2
+ i2
(1 − λ) = −(1 + λ)(1 − λ2
+ 1) = (1 − λ)(λ2
− 2) = (1 − λ)(λ −
√
2)(λ +
√
2) .
Or one may try to express the 3 × 3 determinant with respect to the second row, which has less computations. Hence the
eigenvalues of A are given by λ = 1 and λ± = ±
√
2, all with multiplicity one. Hence their geometric multiplicity is also
one. Let us derive generators of the corresponding eigenspaces Vλ, Vλ+ and Vλ− , which are subspaces of C3
.
• For λ = 1 we have the matrix equation Au = λu = u, for some vector u = (u1, u2, u3)T
∈ C3
, i.e.,
312
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS


−u1 + iu3
u2
−iu1 + u3

 =


u1
u2
u3

 .
It is easy to see that the solutions of the corresponding linear system are given by u1 = u3 = 0, with u2 ∈ C free. Thus we
may assume that u = (0, 1, 0)T
.
• For λ+ =
√
2 we need to solve the matrix equation Au+ =
√
2 u+, for some u+ = (w1, w2, w3) ∈ C3
, i.e.,


−w1 + iw3
w2
−iw1 + w3

 =


√
2w1√
2w2√
2w3

 .
This induces the linear system



(
√
2 + 1)w1 − iw3 = 0 ,
(
√
2 − 1)w2 = 0 ,
iw1 + (
√
2 − 1)w3 = 0 ,
whose solutions are given by {w1 = i(
√
2 − 1)r , w2 = 0 , w3 = r ∈ C}. Thus, setting r = 1 we may assume that
u+ = (i(
√
2 − 1), 0, 1)T
∈ C3
.
• For λ− = −
√
2 we need to solve the matrix equation Au− = −
√
2 u−, for some vector u− = (v1, v2, v3) ∈ C3
. In a similar
way we ﬁnd that the solution of the corresponding linear system is given by {v1 = −i(
√
2 + 1)s , v2 = 0 , v3 = s ∈ C}.
Thus, up to a multiple, the eigenvector u− is given by u− = (−i(
√
2 + 1), 0, 1)T
∈ C3
.
We recommend formally verifying that the eigenvectors are orthogonal with respect to the standard Hermitian form ⟨ , ⟩
on C3
. We have achieved this using Sage, as presented below:
u=vector([0, 1, 0])
u_plus=vector([1-sqrt(2), 0, i])
u_minus=vector([1+sqrt(2), 0, i])
a=u.dot_product(u_plus.conjugate())
print(a.simplify_full())
b=u.dot_product(u_minus.conjugate())
print(b.simplify_full())
c=u_plus.dot_product(u_minus.conjugate())
print(c.simplify_full())
Now, without loss of generality, we can multiply both u± by i and hence consider the eigenvectors
˜u+ =


1 −
√
2
0
i

 ∈ V+ , and ˜u− =


1 +
√
2
0
i

 ∈ V− .
We need to compute the norms of the eigenvectors u, and ˜u± with respect the standard Hermitian form ⟨ , ⟩ on C3
. Obviously,
∥u∥ = 1 and
∥˜u+∥2
= ⟨˜u+, ˜u+⟩ = ˜u∗
+ ˜u+ =
(
1 −
√
2 0 −i
)
·


1 −
√
2
0
i

 = (1 −
√
2)2
− i2
= 4 − 2
√
2 ,
∥˜u−∥2
= ⟨˜u−, ˜u−⟩ = ˜u∗
− ˜ui =
(
1 +
√
2 0 −i
)
·


1 +
√
2
0
i

 = (1 +
√
2)2
− i2
= 4 + 2
√
2 ,
where “·” denotes matrix multiplication. Thus ∥˜u+∥ =
√
4 − 2
√
2, ∥˜u−∥ =
√
4 + 2
√
2, and the normalized eigenvectors
are given by
ˆu = u =


0
1
0

 , ˆu+ =
˜u+
∥˜u+∥
=
1
√
4 − 2
√
2


1 −
√
2
0
i

 , ˆu− =
˜u−
∥˜u−∥
=
1
√
4 + 2
√
2


1 +
√
2
0
i

 .
We can now derive the matrices U, U∗
(and D)
U =



0 1−
√
2√
4−2
√
2
1+
√
2√
4+2
√
2
1 0 0
0 i√
4−2
√
2
i√
4+2
√
2


 , U∗
=



0 1 0
1−
√
2√
4−2
√
2
0 −i√
4−2
√
2
1+
√
2√
4+2
√
2
0 −i√
4+2
√
2


 , D =


1 0 0
0
√
2 0
0 0 −
√
2

 .
313
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Verifying the relation U∗
AU = D by hand can be both time-consuming and challenging. To streamline this process we will
use Sage in the following straightforward manner:
A=matrix([[-1, 0, i], [0, 1, 0], [-i, 0, 1]])
U=matrix([[0, (1-sqrt(2))/sqrt(4-2*sqrt(2)), (1+sqrt(2))/sqrt(4+2*sqrt(2))],
[1, 0, 0], [0, i/sqrt(4-2*sqrt(2)), i/sqrt(4+2*sqrt(2))]])
D=diagonal_matrix([1, sqrt(2), -sqrt(2)])
U.H*A*U==D
Verify that Sage’s response is rapid, and eﬀectively conﬁrms the desired relation U∗
AU = D.
For the matrix B we will rely solely on Sage to obtain a solution. Using the cell
B = matrix (QQbar, [[3, I], [-I, 3]])
B.eigenvalues()
B.eigenvectors_right()
we obtain the eigenvalues of B: λ1 = 2 and λ2 = 4, both with multiplicity one. This code also returns the corresponding
eigenvectors: (1, i)T
for λ1 and (1, −i)T
for λ2. We may multiply the second vector by i and set u1 = (1, i)T
and u2 = (i, 1)T
.
With respect to the standard Hermitian form on C2
it is easy to verify that ⟨u1, u2⟩ = 0. Thus the eigenvectors are orthogonal,
as it should be, and moreover ∥u1∥ =
√
2 = ∥u2∥. By taking the orthonormal vectors ˆu1 = 1√
2
u1 and ˆu2 = 1√
2
u2 we arrive
to the unitary matrix U given by
U = 1√
2
(
1 i
i 1
)
.
Recall that Sage can verify this claim via the command U.is_unitary, where the same cell should include the deﬁnition of
the matrix U. To verify that this matrix U diagonalizes the given matrix B, i.e, U∗
BU = D, where D = diag(λ1, λ2) =
diag(2, 4), use the following block:
B = matrix (QQbar, [[3, I], [-I, 3]]
U = matrix(QQbar, [[1/sqrt(2), I*(1/sqrt(2))], [I*(1/sqrt(2)), 1/sqrt(2)]])
D = matrix(QQbar, [[2, 0], [0, 4]])
U.conjugate_transpose()*B*U == D
3.D.39. Suppose that A is a normal matrix with purely imaginary eigenvalues. Then we can ﬁnd a unitary matrix U satisfying
U∗
AU = D, where D is a diagonal matrix with purely imaginary (diagonal) entries. From this we get A = UDU∗
and since
D∗
= −D, we see that
A∗
= (UDU∗
)∗
= UD∗
U∗
= −UDU∗
= −A .
3.D.41. The matrices that represent Jordan canonical forms are G1, G2, G5 and G6.
3.D.43. We saw that only the matrices G1, G2, G5 and G6 represent a Jordan canonical form. Let us denote by Ak the
corresponding matrices and by Pk the corresponding similarity matrices. Both Ak and Pk have the same size as Gk for all
these k. Since Ak = PkGkP−1
k , i.e., the matrices Ak and Gk are similar for all k = 1, 2, 5, 6, they have the same characteristic
polynomial. Thus χAk
(λ) = χGk
(λ) for all these four cases and from the expressions of Gk is direct that
• χG1 (λ) = χA1 (λ) = λ2
(λ − 2)2
;
• χG2
(λ) = χA2
(λ) = (λ − 3)(λ − 1)2
;
• χG5 (λ) = χA5 (λ) = (λ − π);
• χG6 (λ) = χA6 (λ) = (λ − 2)2
(λ − 1)2
λ2
.
3.D.45. (a) Let us compute the characteristic polynomial of the given matrix A =
(
3 1
−1 1
)
:
χA(λ) = |A − λ E| =
3 − λ 1
−1 1 − λ
= (3 − λ)(1 − λ) + 1 = λ2
− 4λ + 4 = (λ − 2)2
.
Thus, λ = 2 is the unique eigenvalue of A, and has algebraic multiplicity α(λ) = 2. To determine the eigenvectors, we need to
solve the matrix equation (A−2E)u = 0 for some vector u = (x, y)T
. This gives the linear system {x+y = 0, −x−y = 0}
with 1-dimensional solution space
{(
x
y
)
=
(
−s
s
)
: s ∈ R
}
. Therefore, we may assume that the eigenspace Vλ = V2 is
generated by the eigenvector u = (1, −1)T
(this occurs for s = −1). For the geometric multiplicity of λ this fact implies that
γ(λ) = 1. Hence γ(λ) ̸= α(λ) and A cannot be diagonalizable. Since A is 2 × 2, α(λ) = 2, γ(λ) = 1, and λ is the unique
eigenvalue of A, the Jordan canonical form of A is the matrix J =
(
2 1
0 2
)
.
314
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Observe now that (A − 2E)2
= 0, where 0 denotes the 2 × 2 matrix. To ﬁnd the generalized eigenvector corresponding
to λ we need to solve the matrix equation (A − 2E)w = u. If we assume that w has the expression w = (w1, w2)T
, the
solution space of the corresponding linear system has the form
{(
w1
w2
)
=
(
1 − t
t
)
: t ∈ R
}
. Thus we may assume that the
generalized eigenvector w has the form w = (1, 0)T
(this occurs for t = 0). Consequently, the basis with respect to which
J has the above form is given by {u, w} and P has the form P =
(
1 1
−1 0
)
. We also compute P−1
=
(
0 −1
1 1
)
and it is
straightforward to verify the relation P−1
AP = J. In Sage this veriﬁcation can be done as usual, i.e.,
A=matrix(SR, [[3, 1], [-1, 1]])
J=matrix(SR, [[2, 1], [0, 2]])
P=matrix(SR, [[1, 1], [-1, 0]])
Pinv=P.inverse(); Pinv*A*P==J
As mentioned in 3.D.44, an alternative has the form
A=matrix(SR, [[3, 1], [-1, 1]])
J, P = A.jordan_form(transformation=True)
show(J, P)
In this case Sage directly prints out the derived expression for J and P.
(b) For this task, observe that the kth power Ak
of A satisﬁes the relation
Ak
= (PJP−1
)(PJP−1
) · · · (PJP−1
)
k−factors
= PJk
P−1
.
Now, we easily compute J4
=
(
16 32
0 16
)
and by applying the relation A4
= PJ4
P−1
, we obtain A4
=
(
48 32
−32 16
)
.
3.D.46. The matrix A is upper triangular, meaning that all entries below the main diagonal are zero. For an upper triangular
matrix the characteristic polynomial is factorized into linear factors by inspecting the diagonal elements. For our case we get
the expression χA(λ) = (5 − λ)4
. Therefore, λ = 5 is the unique eigenvalue of A and has algebraic multiplicity 4, α(λ) = 4.
To determine the size of the Jordan blocks, we need to ﬁnd the geometric multiplicity of λ. The matrix A − 5E is given by
A − 5E =




0 4 2 1
0 0 1 1
0 0 0 2
0 0 0 0



 ,
and it is easy to see that its rank equals 3. We can verify this statement in Sage, by the following block,
A = Matrix([[5, 4, 2, 1], [0, 5, 1, 1],[0, 0, 5, 2], [0, 0, 0, 5]])
E=identity_matrix(4)
AA=A-5*E; show(AA)
show(AA.rref())
print("The rank of A-4E is given by:", AA.rank())
Since rank(A − 4E) = 3, the dimension of Ker(A − 5E) equals 1.23
Hence we conclude that the geometric multiplicity of
λ = 5 equals 1. Since γ(λ) = 1, α(λ) = 4 and λ = 5 is the unique eigenvalue of A, the Jordan form is composed of a single
Jordan block of size 4 (see also below for a Jordan chain of length 4)
J =




5 1 0 0
0 5 1 0
0 0 5 1
0 0 0 5



 .
Let us now describe an eigenvector of λ. The previous block also provides the REEF of A − 4E, given by
A − 5E =




0 4 2 1
0 0 1 1
0 0 0 2
0 0 0 0



 ∼




0 1 0 0
0 0 1 0
0 0 0 1
0 0 0 0



 .
23Recall that according to the rank-nullity theorem we have rank(T)+
dim Ker(T) = dim V for any linear transformation T : V → W deﬁned
on a ﬁnite-dimensional vector space V .
315
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Thus, the general solution of the linear system corresponding to the matrix equation (A − 4E)u1 = 0 is given by{
(x1, x2, x3, x4)T
= (t, 0, 0, 0)T
}
, where t is a free parameter. Hence, without loss of generality, we may assume that
the eigenspace Vλ = V5 = Ker(A − 5E) is generated by the eigenvector u1 = (1, 0, 0, 0)T
. Using Sage we can conﬁrm that
(A − 5E)4
= 0 and compute
(A − 5E)2
=




0 0 4 8
0 0 0 2
0 0 0 0
0 0 0 0



 , (A − 5E)3
=




0 0 0 8
0 0 0 0
0 0 0 0
0 0 0 0



 .
Therefore it is easy to derive the dimensions of the null spaces Ker
(
(A − 5E)2
)
and Ker
(
(A − 5E)3
)
:
rank
(
(A − 5E)2
)
= 2 =⇒ dim Ker
(
(A − 5E)2
)
)
= 4 − 2 = 2 ,
rank
(
(A − 5E)3
)
= 1 =⇒ dim Ker
(
(A − 5E)3
)
)
= 4 − 1 = 3 .
Since (A − 5E)4
= 0 it is immediate that dim Ker
(
(A − 5E)4
)
= 4 and hence Ker
(
(A − 5E)4
)
is the whole R4
.
Let us compute a generalized eigenvector for the unique eigenvalue of A, which we will denote by u2 = (x1, x2, x3, x4)T
.
The matrix equation (A − 5E)u2 = u1 corresponds to the linear system {4x2 + 2x3 + x4 = 1 , x3 + x4 = 0 , 2x4 = 0}
and its solutions are given by {(x1, x2, x3, x4)T
= (s, 1
4 , 0, 0)T
}, where s is a free parameter. Thus, we may assume that
u2 = (0, 1
4 , 0, 0)T
, and it is easy to see that u2 ∈ Ker
(
(A − 5E)2
)
but u2 /∈ Ker(A − 5E). In particular, the vectors
u1, u2 span the subspace Ker
(
(A − 5E)2
)
. Next we need to solve the matrix equation (A − 5E)u3 = u2 for some vector
u3 = (x1, x2, x3, x4)T
, which induces the system {4x2 + 2x3 + x4 = 0 , x3 + x4 = 1/4 , 2x4 = 0}. Its solutions are
given by {(x1, x2, x3, x4)T
= (r, −1
8 , 1
4 , 0)T
}, where r is a free parameter. Thus we may assume that u3 = (0, −1
8 , 1
4 , 0)T
and it is easy to see that u3 ∈ Ker
(
(A − 5E)3
)
but u3 /∈ Ker
(
(A − 5E)2
)
. It follows that vectors u1, u2, u3 generate
Ker
(
(A − 5E)3
)
. Finally, solving the matrix equation (A − 5E)u4 = u3 for some unknown u4 = (x1, x2, x3, x4)T
, we
get the generalized eigenvector u4 = (0, 3
32 , −1
4 , 1
8 )T
. Putting all information together, we deduce that Ker
(
(A − 5E)4
)
=
span{u1, u2, u3, u4}.
To summarize, we saw that the vectors u4, u3, u2 and u1 satisfy the relations
(A − 5E)4
u4 = 0 (A − 5E)3
u3 = 0 (A − 5E)2
u2 = 0 (A − 5E)u1 = 0,
(A − 5E)u4 = u3 (A − 5E)u3 = u2 (A − 5E)u2 = u1.
Moreover, a direct computations shows that
(A − 5E)3
u4 = u1 , (A − 5E)2
u4 = u2 , (A − 5E)u4 = u3 .
Thus {(A − 5E)3
u4, (A − 5E)2
u4, (A − 5E)u4, u4} = {u1, u2, u3, u4} is a Jordan chain of full length 4. This implies that
the generalized eigenespace Rλ corresponding to the eigenvalue λ has dimension 4 and it encompasses all vectors that map
to zero under the fourth power of (A − 5E). Thus,
Rλ = Ker
(
(A − 5E)4
)
∼= R4
.
Actually Rλ is the union of the subspaces Ker
(
(A − 5E)ℓ
)
for all 1 ≤ ℓ ≤ 4.
As another consequence of this chain we can derive a representative of the matrix P:
P =
(
u1 | u2 | u3 | u4
)
=




1 0 0 0
0 1/4 −1/8 3/32
0 0 1/4 −1/4
0 0 0 1/8



 .
Check yourselves using Sage via the following block that P satisﬁes P−1
AP = J.
A = matrix(SR, [[5, 4, 2, 1],[0, 5, 1, 1], [0, 0, 5, 2],[0, 0, 0, 5]])
J = matrix(SR, [[5, 1, 0, 0],[0, 5, 1, 0], [0, 0, 5, 1],[0, 0, 0, 5]])
P = matrix(SR, [[1, 0, 0, 0],[0, 1/4, -1/8, 3/32],[0, 0, 1/4, -1/4],
[0, 0, 0, 1/8]])
#Verify P^(-1)AP = J
print(P.inverse()* A * P == J)
3.E.5. (i) In 3.E.1 the matrix U was derived from the matrix A by successively applying the elementary row operations
R2 → R2 − 2R1, R3 → R3 − 3R1 and R3 → R3 + R2. To obtain the corresponding elementary matrices, we apply each
of these row operations to the 3 × 3 identity matrix. This results in the following matrices:
E1 =


1 0 0
−2 1 0
0 0 1

 , E2 =


1 0 0
0 1 0
−3 0 1

 , E3 =


1 0 0
0 1 0
0 1 1

 ,
316
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
and the Gauss elimination in terms of matrices has the form
E3E2E1A =


1 0 0
0 1 0
0 1 1




1 0 0
0 1 0
−3 0 1




1 0 0
−2 1 0
0 0 1




−2 1 0
−4 4 2
−6 1 −1

 =


−2 1 0
0 2 2
0 0 1

 = U .
Since the elementary matrices are always invertible, we obtain A = E−1
1 E−1
2 E−1
3 U, which implies that L = E−1
1 E−1
2 E−1
3 .
The inverses of the elementary matrices are easily computed, leading to the expected result:
L = E−1
1 E−1
2 E−1
3 =


1 0 0
2 1 0
0 0 1




1 0 0
0 1 0
3 0 1




1 0 0
0 1 0
0 −1 1

 =


1 0 0
2 1 0
3 −1 1

 .
We also compute
L−1
= E3E2E1 =


1 0 0
−2 1 0
−5 1 1

 .
For a veriﬁcation using Sage, type he cell
L=matrix([[1, 0, 0], [2, 1, 0], [3, -1, 1]]); show(L.inverse())
(ii) To ﬁnd the inverse of the matrix U via the Gauss elimination method, we begin by forming the augmented matrix
(
U E
)
,
where E is the 3 × 3 identity matrix. We then successively apply the following elementary row operations: R1 → −1
2 R1,
R1 → R1 + 1
4 R2, R1 → R1 − 1
2 R3, R2 → 1
2 R2 and R2 → R2 − R3. You should be able to verify on your own that this
results in the matrix
U−1
=


−1/2 1/4 −1/2
0 1/2 −1
0 0 1

 .
Here is a quick veriﬁcation using Sage:
U=matrix([[-2, 1, 0], [0, 2, 2], [0, 0, 1]]); show(U.inverse())
(iii) We have det(A) = det(L) det(U) = 1 · (−4) = −4 ̸= 0, and hence A is invertible. Now, using the expressions of L−1
and U−1
obtained above, the matrix A−1
occurs very easily:
A−1
= (LU)−1
= U−1
L−1
=


−1/2 1/4 −1/2
0 1/2 −1
0 0 1




1 0 0
−2 1 0
−5 1 1

 =


3/2 −1/4 −1/2
4 −1/2 −1
−5 1 1

 .
Verify this result in Sage with the same basic method applied for L or U.
3.E.6. It is easy to see that rank(A) = 2 and hence there are exist two non-zero singular values Since AT
A =
(
25 15
15 25
)
,
the characteristic polynomial of AT
A is given by χAT A(λ) = (40 − λ)(10 − λ). Hence we conclude that the eigenvalues of
AT
A are λ1 = 40 and λ2 = 10, so the singular values of A are σ1 =
√
40 = 2
√
10 and σ2 =
√
10, with σ1 > σ2.
3.E.7. We see that AT
A is a diagonal matrix:
AT
A =


0 −1 0
0 0 0
−1
2 0 0




0 0 −1
2
−1 0 0
0 0 0

 =


1 0 0
0 0 0
0 0 1
4

 .
From this we deduce that AT
A has three eigenvalues, λ1 = 1, λ2 = 1/4 and λ3 = 0. and two 2-nonzero singular values
(note that rank(A) = 2), the latter given by σ1 =
√
λ1 = 1 and σ2 =
√
λ2 = 1/2, with σ1 > σ2. Therefore, the matrix Σ
has the form Σ = diag(1, 1/2, 0).
3.E.9. According to 3.5.5 we will have P := UΣUT
and Up := UV T
, where A = UΣV T
is the SVD decomposition of A.
We compute
P =


0 −1 0
−1 0 0
0 0 −1




1 0 0
0 1
2 0
0 0 0




0 −1 0
−1 0 0
0 0 −1

 =


1
2 0 0
0 1 0
0 0 0

 ,
and
Up =


0 −1 0
−1 0 0
0 0 −1




1 0 0
0 0 1
0 1 0

 =


0 0 −1
−1 0 0
0 −1 0

 .
317
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Thus, the polar decomposition of the matrix A has the form
A =


0 0 −1
2
−1 0 0
0 0 0

 = PUp =


1
2 0 0
0 1 0
0 0 0




0 0 −1
−1 0 0
0 −1 0

 .
A veriﬁcation using Sage is easy:
A=matrix([[0, 0, -1/2], [-1, 0, 0], [0, 0, 0]])
S=diagonal_matrix([1, 1/2, 0])
V=matrix([[1, 0, 0], [0, 0, 1], [0, 1, 0]])
U=matrix([[0, -1, 0], [-1, 0, 0], [0, 0, -1]])
P=U*S*(U.T)
show(P)
Upolar=U*(V.T)
show(Upolar)
A==P*Upolar
The polar decomposition is not unique, since A is singular. Next it easy to see that P is Hermitian and positive semi-deﬁnite.
For example, for the second claim we see that P has three non-negative eigenvalues: 1, 1/2 and 0. In Sage you can conﬁrm
these claims by the adding the cell
show(P.eigenvalues())
P.is_hermitian()
3.F.10. The answer is given by xn = (−1)n
(−2n2
+ 8n − 7).
3.F.23. The required matrix has the form
(5
6
1
5
1
6
4
5
)
. This matrix has the dominant eigenvalue 1, with corresponding eigenvector
given by (6
5 , 1)T
. Because the eigenvalue is dominant, the ratio of the viewers stabilises on 6 : 5.
3.F.25. The game ends after three bets (the same as in 3.F.24). Thus, all the powers of A, starting with A3
, are identical.
Consequently the answer is given as follows:
A100
= A3
=






1 7/8 3/4 1/2 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 1/8 1/4 1/2 1






.
3.F.31. It is easy to see that F is linear. Also we compute ker(F) =
{(
a −a
a −a
)
: a ∈ R
}
hence dimR(ker(F)) = 1 and a
basis of ker(F) consists of the matrix u =
(
1 −1
1 −1
)
. Since B(u, u) = 4, we deduce that {ˆu = 1
2 u} is an orthonormal basis
of ker(F).
3.F.36. The given rotation is easily obtained by composing the following three mappings:
• rotation through the angle π/4 in the negative sense about the axis z (the axis of the rotation goes over on the x axis);
• rotation through the angle π/3 in the positive sense about the x axis;
• rotation through the angle π/4 in the positive sense about the z axis (the x axis goes over on the axis of the rotation).
The matrix of the resulting rotation is the product of the matrices corresponding to the given three mappings, while the order
of the matrices is given by the order of application of the mappings – the ﬁrst mapping applied is in the product the rightmost
one. Thus we obtain the desired matrix


√
2
2 −
√
2
2 0√
2
2
√
2
2 0
0 0 1


 ·



1 0 0
0 1
2 −
√
3
2
0
√
3
2
1
2


 ·



√
2
2
√
2
2 0
−
√
2
2
√
2
2 0
0 0 1


 =



3
4
1
4
√
6
4
1
4
3
4
−
√
6
4
−
√
6
4
√
6
4
1
2


 .
Note that the resulting rotation could be also obtained for instance by taking the composition of the three following mappings:
• rotation through the angle π/4 in the positive sense about the axis z (the axis of rotation goes over on the axis y);
• rotation through the angle π/3 in the positive sense about the axis y;
• rotation through the angle π/4 in the negative sense about the axis z (the axis y goes over to the axis of rotation).
Analogously we obtain
318
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS



√
2
2
√
2
2 0
−
√
2
2
√
2
2 0
0 0 1


 ·



1
2 0
√
3
2
0 1 0
−
√
3
2 0 1
2


 ·



√
2
2 −
√
2
2 0√
2
2
√
2
2 0
0 0 1


 =



3
4
1
4
√
6
4
1
4
3
4
−
√
6
4
−
√
6
4
√
6
4
1
2


 .
3.F.39. For any x, y ∈ Rn
and with respect to the standard dot product on Rn
we compute
⟨LAx, y⟩ = ⟨Ax, y⟩ = yT
Ax = yT
(AT
)T
x = (AT
y)T
x = ⟨x, AT
y⟩ = ⟨x, Ay⟩
where the last equality follows since by assumption A = AT
.
3.F.40. First notice that the operator LA : Matm,n(F) → Matm,n(F) is linear since LA(B+C) = A(B+C) = AB+AC =
LA(B) + LA(C) and LA(zB) = A(zB) = z(AB) = zLA(B) for any B, C ∈ Matm,n(F) and z ∈ F. Its adjoint operator
L∗
A is the linear map L∗
A : Matm,n(F) → Matm,n(F) satisfying
B
(
LA(B), C
)
= B
(
B, L∗
A(C)
)
, B, C ∈ Matm,n(F) .
We will show that for L∗
A = A∗
this holds like an identity. We compute
B
(
LA(B), C
)
= tr
(
C∗
(AB)
)
= tr(C∗
AB) ,
B
(
B, L∗
A(C)
)
= B(B, A∗
C) = tr
(
(A∗
C)∗
B
)
= tr
(
(C∗
(A∗
)∗
)B) = tr(C∗
AB) .
Since L∗
A = A∗
and A is by assumption Hermitian, our claim follows.
3.F.41. Let y ∈ V . Then we have the following equivalences:
y ∈ ker(f∗
) ⇐⇒ f∗
(y) = 0 ,
⇐⇒ ⟨x, f∗
(y)⟩ = 0 , for any x ∈ V ,
⇐⇒ ⟨f(x), y⟩ = 0 , for any x ∈ V ,
⇐⇒ y ⊥ f(x) , for any x ∈ V ,
⇐⇒ y ∈ (Im(f))⊥
.
Thus ker(f∗
) = (Im(f))⊥
. Similarly is treated the second relation.
3.F.42. Reﬂections matrices are unitary since they preserve the length of vectors. In particular,
[φu][φu]∗
= (2uuT
−E)(2uuT
−E)∗
= (2uuT
−E)(2uuT
−E) = 4u(uT
u)uT
−4uuT
+E = 4uuT
−4uuT
+E = E ,
since u is a unit vector, i.e., ⟨u, u⟩ = uT
u = 1.
3.F.43. Observe that (A∗
A)∗
= A∗
A hence the matrix B = A∗
A is Hermitian, B = B∗
. Then the result follows since
any Hermitian matrix has real eigenvalues, see 3.4.7. For the second statement, we have AA∗
= U∗
DU(U∗
DU)∗
=
U∗
DUU∗
D∗
U = U∗
DD∗
U and A∗
A = (U∗
DU)∗
U∗
DU = U∗
D∗
UU∗
DU = U∗
D∗
DU. Now the result follows since
D is diagonal and hence we have DD∗
= D∗
D (as one can easily verify).
3.F.44. It should be a = 2 + 2i, b = 1 − i and c = i. One can verify this answer in Sage as follows:
A=matrix(SR, 3, 3, [[1, 2+2*I, -I], [2-2*I, 0, 1-I], [I, 1+I, 0]])
A.is_hermitian()
Could you suggest an alternative method in Sage to verify the same result?
3.F.51. The answer is dim Matn(R) = n2
.
We return to the view on geometry that we had when we
studied positions of points in the plane in the 5th part of the
ﬁrst chapter, c.f. 1.5.1. We are interested in the properties of
objects in the Euclidean space, delimited by points, straight
lines, planes etc. The essential point is to clarify how their
properties are related to the notion of vectors, and whether
they depend on the notion of the length of vectors.
Later in the book, we use linear and mutli-linear algebra
to study objects deﬁned in a nonlinear way. In order to facilitate
this we provide some insight in the so called analytic
geometry, based on the application of matrix calculus. This
point of view is most useful in discussion of the technique of
optimization, e.g. when searching for (constrained) extrema
of functions.
At the end of this chapter we show how the projectivization
of aﬃne spaces helps us to obtain simpliﬁcation and stability
of algorithms typical for computer graphics.
1. Aﬃne and Euclidean geometry
While clarifying the structure of solutions of linear equations
in the second chapter we ﬁnd in paragraph
2.3.5 that the set of all solutions of a nonhomogeneous
system of linear equations does not
form a vector space. However, the solutions always arise in
such a way that to one particular solution we can add the vector
space of solutions to the corresponding homogeneous system.
On the other hand, the diﬀerence of any two solutions of
the nonhomogeneous system is always a solution of the homogeneous
system. This behaviour is similar to the behaviour of
linear diﬀerence equations. We see this already in paragraph
3.2.6.
4.1.1. Aﬃne spaces. Going back to paragraph 1.5.3 (and further
on) provides a hint how to deal with the theory in any ﬁnite
dimension. There we describe straight lines and points as
sets of solutions of systems of linear equations. A line is considered
as a one-dimensional subspace, although its points
are described by two coordinates. Parametrically, the line is
deﬁned by the sum of a single point (that is, a pair of coordinates)
and multiples of a ﬁxed direction vector. We proceed
now in the same way for arbitrary dimensions.
CHAPTER 4
Analytic geometry
position, incidence, projection
– and we return to matrices again...
A. Aﬃne geometry
4.A.1. Find a parametric equation for a line in R3
given by
the equations
x − 2y + z = 2,
2x + y − z = 5.
Solution. It is suﬃcient to solve the equation system. However,
there is an alternative approach. Find a non-zero direction
vector orthogonal to the normal vectors (1, −2, 1),
(2, 1, −1). The cross product
(1, −2, 1) × (2, 1, −1) = (1, 3, 5)
is such a vector. The triple
[x, y, z] = [2, −1, −2]
satisﬁes the respective system, so a solution is
[2, −1, −2] + t (1, 3, 5) , t ∈ R.
□
4.A.2. A plane in R4
is given by its parametric equation
ϱ : [0, 3, 2, 5] + t (1, 0, 1, 0) + s (2, −1, −2, 2) , t, s ∈ R
Find its implicit equation.
Solution. The task is to ﬁnd a system of equations with 4
variables x, y, z, u (the dimension of the space is 4) which
are satisﬁed by the coordinates of those points which lie in
the plane. The desired system must contain 2 = 4−2 linearly
CHAPTER 4. ANALYTIC GEOMETRY
Standard affine space
Standard aﬃne space An is a set of all points in Rn
=
An together with an operation “+” which assigns the point
A + v = (a1 + v1, . . . , an + vn) ∈ Rn
= An
to a point A = (a1, . . . , an) ∈ An and a vector v =
(v1, . . . , vn) ∈ Rn
= V.
This operation satisﬁes the following three properties:
(1) A + 0 = A for all points A ∈ An and the null vector
0 ∈ V ,
(2) A + (v + w) = (A + v) + w for all vectors v, w ∈ V
and points A ∈ An,
(3) for every two points A, B ∈ An there exists exactly
one vector v ∈ V such that A + v = B. This vector is
denoted by v = B − A, sometimes also ⃗AB.
The underlying vector space Rn
is called the diﬀerence
space of the standard aﬃne space An.
Notice that care is needed about several formal ambiguities.
In particular, the symbol “+” is used
for two diﬀerent operations. “+” is used for
adding a vector from the diﬀerence space to a
point in the aﬃne space. “+” is also used for summing vectors
in the diﬀerence space V = Rn
. Further, notice that
the operation “−” assigns a vector to a couple of points and
cannot be understood as inverse of the addition (athough in
the standard aﬃne space it is given by the diﬀerence of the
n-tuples B and A). We do not introduce speciﬁc letters for
the set of points in the aﬃne space. An denotes both this
set of points as well as the whole structure deﬁning the aﬃne
space.
Why distinguish between the set of points in the aﬃne
space An and its diﬀerence space V when both spaces can be
viewed as Rn
? It is a fundamental formal step to understanding
the geometry in Rn
: The issue is that geometric objects,
namely straight lines, points, planes etc. do not depend directly
on the vector space structure of the set Rn
. They do not
depend at all on the fact that we work with n–tuples of scalars.
We need to know only what it means to move “straight in
a given direction”. For instance, we can consider the aﬃne
plane as an unbounded board without chosen coordinates, but
with the possibility of moving about a given vector. When we
switch to such an abstract view, we can discuss the “plane geometry”
for two-dimensional subspaces, without the need to
work with k–tuples of coordinates.
This point of view underlies the following deﬁnition:
4.1.2. Deﬁnition. The aﬃne space A with the diﬀerence
space V is a set of points P, together with the map
P × V → P, (A, v) → A + v.
V is a vector space. The map satisﬁes the properties (1)–(3)
from the deﬁnition of the standard aﬃne space.
For a ﬁxed vector v ∈ V, there is a translation τv : A →
A as the restricted map
τv : P ≃ P × {v} → P, A → A + v.
320
independent equations. Solve the problem by elimination of
parameters. The points [x, y, z, u] ∈ ϱ satisfy
x = t + 2s,
y = 3 − s,
z = 2 + t − 2s,
u = 5 + 2s,
where t, s ∈ R. Write the system as a matrix




1 2 −1 0 0 0 0
0 −1 0 −1 0 0 3
1 −2 0 0 −1 0 2
0 2 0 0 0 −1 5



 .
The ﬁrst two columns are direction vectors of the plane, followed
by the negative identity matrix. The last column is the
vector of coordinates of the point [0, 3, 2, 5]. This is now a system
in t, s, x, y, z, u. Transform the obtained matrix using elementary
row operations in order to have as many zero-rows on
the left-hand side of the ﬁrst vertical line as possible. Adding
(−1)-times the ﬁrst row and (−4)-times the second row to
the third row and adding twice the second row to the ﬁrst row
gives




1 2 −1 0 0 0 0
0 −1 0 −1 0 0 3
0 0 1 4 −1 0 −10
0 0 0 −2 0 −1 11



 .
The bottom two rows, both with only zeros to the left of the
ﬁrst vertical line, imply
x + 4y − z − 10 = 0,
−2y − u + 11 = 0.
Note that the original system can be written as




1 0 0 0 1 2 0
0 1 0 0 0 −1 3
0 0 1 0 1 −2 2
0 0 0 1 0 2 5



 ,
where x, y, z, u remains on the left-hand side of the equations.
A similar transformation gives




1 0 0 0 1 2 0
0 1 0 0 0 −1 3
−1 −4 1 0 0 0 −10
0 2 0 1 0 0 11




from which
−x − 4y + z = −10,
2y + u = 11.
As seen in this exercise, parameter elimination can be longwinded.
It is not diﬃcult to make a mistake along the way.
Another solution All that is needed are two linearly independent
vectors perpendicular to (1, 0, 1, 0), (2, −1, −2, 2).
If we “guessed” that these vectors could be for example
(0, 2, 0, 1), (−1, 0, 1, 2), then putting x = 0, y = 3, z = 2,
u = 5 to the equations
2y + u = a,
−x + z + 2u = b
CHAPTER 4. ANALYTIC GEOMETRY
By the dimension of an aﬃne space A, is meant the dimension
of its diﬀerence space.
In the sequel, we do not distinguish accurately between
A and the set of points P. We talk instead about points and
vectors of the aﬃne space A.
It follows immediately from the axioms that for arbitrary
points A, B, C in the aﬃne space A
A − A = 0 ∈ V(1)
B − A = −(A − B)(2)
(C − B) + (B − A) = C − A.(3)
Indeed, (1) follows from the fact that A + 0 = A and that
such a vector is unique (the ﬁrst and third deﬁning property).
By adding successively B − A and A − B to A, according to
the second deﬁning property we obtain A again. Add the null
vector to prove (2). Similarly, (3) follows from the deﬁning
property 4.1.1 (2) and the uniqueness.
Notice that the choice of one ﬁxed point A0 ∈ A determines
a bijection between V and A. So for a ﬁxed basis u in
V there is a unique expression
A = A0 + x1u1 + · · · + xnun
for every point A ∈ A. We talk about an aﬃne coordinate
system (A0; u1, . . . , un) given by the origin of the aﬃne coordinate
system A0 and the basis u of the corresponding diﬀerence
space. This is sometimes called an aﬃne frame (A0, u).
To summarize: Aﬃne coordinates of a point A in the
frame (A0, u) are the coordinates of the vector A − A0 in the
basis u of the diﬀerence space V .
The choice of an aﬃne coordinate system identiﬁes each
n-dimensional aﬃne space A with the standard aﬃne space
An.
4.1.3. Aﬃne subspaces. If we consider some coordinate
system on A and choose only such points in A
which have some chosen coordinates equal to
zero (for instance the last one), we obtain again
a set which behaves as an aﬃne space. This is
the spirit of the following deﬁnition of the aﬃne subspaces.
Subspaces of an affine space
Deﬁnition. The nonempty subset Q ⊂ A of an aﬃne space
A with a diﬀerence space V is called an aﬃne subspace in
A if the subset W = {B − A; A, B ∈ Q} ⊂ V is a vector
subspace and A + v ∈ Q for any A ∈ Q, v ∈ W.
It is important to include both of the conditions in the
deﬁnition, since there are examples of sets which satisfy the
ﬁrst condition but not the second. One such set consists of a
straight line in the plane with one point removed.
For an arbitrary set of points M ⊂ A in an aﬃne space
with a diﬀerence space V , we deﬁne the vector space
Z(M) = span{B − A; B, A ∈ M} ⊂ V
of all vectors generated by the diﬀerences of points in M.
321
yields a = 11, b = 12. The desired implicit expression is
2y + u = 11,
−x + z + 2u = 12.
Another solution Since
x = t + 2s,
y = 3 − s,
z = 2 + t − 2s,
u = 5 + 2s,
Eliminate t to get
x − z = 2 − 4s,
y = 3 − s,
u = 5 + 2s,
Eliminate s to obtain two equations, namely
z − x + 2u = 12
u + 2y = 11,
which solves the problem. □
4.A.3. Find a parametric equation of the plane passing
through the points
A = [2, 1, 1], B = [3, 4, 5], C = [4, −2, 3].
Hence ﬁnd a parametric equation of the open half-plane containing
the point C and bounded by the line passing through
the points A, B.
Solution. We need one point and two (linearly independent)
vectors lying in the plane. It is enough to choose A together
with the vectors B − A = (1, 3, 4) and C − A = (2, −3, 2),
which are clearly independent. A point [x, y, z] lies in the
plane if and only if there exist numbers t, s ∈ R so that
x = 2+1·t+2·s, y = 1+3·t−3·s, z = 1+4·t+2·s.
Consequently a parametric equation is
[x, y, z] = [2, 1, 1] + t (1, 3, 4) + s (2, −3, 2) , t, s ∈ R.
Setting s = 0 gives a line passing through the points
A and B. t = 0 and s ≥ 0, deﬁnes a ray passing through
C with an initial point A. A particular but arbitrarily chosen
t ∈ R and variable s ≥ 0 gives a ray initiated on the border
line, going through the half-plane in which the point C lies.
That means that the desired open half-plane can be expressed
parametrically as
[x, y, z] = [2, 1, 1]+t (1, 3, 4)+s (2, −3, 2) , t ∈ R, s > 0.
□
4.A.4. Determine the relative position of the lines
p : [1, 0, 3] + t (2, −1, −3) , t ∈ R,
q : [1, 1, 3] + s (1, −1, −2) , s ∈ R.
Solution. Search for common points of the given lines (subspaces
intersection). We have a system
1 + 2t = 1 + s,
0 − t = 1 − s,
3 − 3t = 3 − 2s.
CHAPTER 4. ANALYTIC GEOMETRY
In particular, V = Z(A). Every aﬃne subspace Q ⊂ A
itself satisﬁes the axioms for an aﬃne space with the diﬀerence
space Z(Q).
The intersection of any set of aﬃne subspaces is either
an aﬃne subspace or is the empty set. This follows directly
from the deﬁnitions.
The aﬃne subspace ⟨M⟩ in A generated by a nonempty
set M ⊂ A is the intersection of all aﬃne subspaces which
contain all points of M.
Affine hull and parametric description
Aﬃne subspaces can be described by their diﬀerence spaces
after choosing a point A0 ∈ M in a generating set M. Indeed,
to generate the aﬃne subspace, take the vector subspace
Z(M) ⊂ Z(A) generated by all diﬀerences of points
in M, and add this vector space to an arbitrary point in M,
i.e., ⟨M⟩ = {A0 + v; v ∈ Z(M) ⊂ Z(A)}. We talk about
the aﬃne hull of the set of points M in A.
On the other hand, whenever a subspace U in the difference
space Z(A) and a ﬁxed point A ∈ A is chosen, the
subset A + U, created by all possible sums of A and all vectors
in U, is an aﬃne subspace. This approach leads to the
notion of parametrization of subspaces:
Let Q = A + Z(Q) be an aﬃne subspace in An. Let
(u1, . . . , uk) be a basis of Z(Q) ⊂ Rn
. Then the expression
of the subspace
Q = {A + t1u1 + · · · + tkuk; t1, . . . , tk ∈ R}
is called the parametric description of the subspace Q.
In the sequel, we shall use the same notation ⟨ ⟩ for the
aﬃne hulls in the aﬃne spaces and linear hulls in their diﬀerence
vector spaces.
There is another way of prescribing aﬃne spaces: If we
choose aﬃne coordinates, then the diﬀerence space may be
described by a homogeneous system of linear equations in
these coordinates. By inserting the coordinates of one point
of the subspace Q into the system of equations, we obtain
the right-hand side of the non-homogeneous system with the
same matrix. The subspace Q is exactly the set of solutions
of this system. The description of the subspace Q by a system
of equations in given coordinates is called an implicit description
of the subspace Q.
The following proposition says that we can prescribe all
aﬃne subspaces in this way. It shows the geometric
nature of the solutions of systems of linear
equations.
4.1.4. Theorem. Let (A0; u) be an aﬃne coordinate system
in an n-dimensional aﬃne space A. In these coordinates,
aﬃne subspaces of dimension k in A are exactly the sets of
solutions of solvable systems of n − k linearly independent
equations in n variables.
Proof. Consider an arbitrary solvable system of n − k
linearly independent equations αi(x) = bi, where bi ∈ R,
i = 1, . . . , n − k. Suppose A = (a1, . . . , an)T
∈ Rn
is a
322
From the ﬁrst two equations, t = 1, s = 2. This does not
satisfy the third equation. Thus the system does not have a
solution. The direction vector (2, −1, −3) of the line p is not
a multiple of the direction vector (1, −1, −2) of the line q.
Hence the lines are not parallel. Hence, the lines are skew. □
4.A.5. Find all numbers a ∈ R so that the lines
p : [4, −4, 8] + t (2, 1, −4) , t ∈ R,
q : [a, 6, −5] + s (1, −3, 3) , s ∈ R
intersect.
Solution. The lines intersect if and only if the system
4 + 2t = a + s,
−4 + t = 6 − 3s,
8 − 4t = −5 + 3s
has a solution. Express the system as a matrix (the ﬁrst column
corresponding to t, the second to s), and solve


2 −1 a − 4
1 3 10
−4 −3 −13

 ∼


1 3 10
2 −1 a − 4
−4 −3 −13


∼


1 3 10
0 −7 a − 24
0 1 3

 .
The system has a solution if and only if the second row is a
multiple of the third row. This property is satisﬁed only for
a = 3. The point of intersection of the lines is [6, −3, 4]. □
4.A.6. In R3
, determine the relative position of the line p
deﬁned implicitly by
x + y − z = 4,
x − 2y + z = −3
and the plane ϱ : y = 2x − 1.
Solution. A normal vector to the plane is ϱ is (2, −1, 0) (consider
ϱ : 2x − y + 0z = 1). Since
(1, 1, −1) + (1, −2, 1) = (2, −1, 0),
the normal vector to the plane ϱ is a linear combination of
the p normal vectors. A vector deﬁning the line lies in a subspace
of the plane ϱ It remains to discover whether or not they
intersect. The system of equations
x + y − z = 4,
x − 2y + z = −3,
2x − y = 1
has inﬁnitely many solutions, because the ﬁrst two equations
add to give the third one. So the line p lies in the plane ϱ. □
The following exercise is a typical vector spaces intersection
exercise. The reader should be able to solve
this. Otherwise we recommend not continuing
with this book.
CHAPTER 4. ANALYTIC GEOMETRY
ﬁxed solution of this (non-homogeneous) system. Suppose
also that U ⊂ Rn
is the vector space of all solutions of the
homogenized system αi(x) = 0. Then the dimension of U is
k. The set of all solutions of the given system is of the form
{B; B = A+(y1, . . . , yn)T
, y = (y1 . . . , yn)T
∈ U} ⊂ Rn
,
c.f. 2.3.5. So the corresponding aﬃne subspace is described
parametrically by the initial coordinates (A0; u).
Conversely, consider an arbitrary aﬃne subspace Q ⊂
An. Choose a point B therein, and consider this point to be
the origin of an aﬃne coordinate system (B, v) for the aﬃne
space A. Since Q = B + Z(Q), it is necessary to describe
the diﬀerence space of the subspace Q as a subspace of solutions
of a homogeneous system of linear equations. Therefore,
choose a basis v of Z(A) such that the ﬁrst k vectors form a
basis of Z(Q). In these coordinates, the vectors v ∈ Z(Q)
are given by equations
αj(v) = 0, j = k + 1, . . . , n.
The αi are linear forms from the dual basis to v. They are the
functions which assign to a vector the corresponding coordinates
in the basis v.
Hence the vector subspace Z(Q) of dimension k in the
n-dimensional space Rn
is given as a solution of a homogeneous
system of n−k independent equations. The description
of the chosen aﬃne subspace in the newly chosen coordinate
system (B; v) is therefore given by a system of homogeneous
linear equations.
It remains to consider the consequences of the transition
from the former coordinate system (A; u) to the new
adapted system (B; v). It follows from a general consideration
about transformations of coordinates in the following
paragraph that the ﬁnal description of the subspace is again a
system of linear equations. This time it is non-homogeneous
in general. □
4.1.5. Coordinate transformations. Any two arbitrarily
chosen aﬃne coordinate systems (A0, u),
(B0, v) diﬀer in the basis of the diﬀerence spaces
and the origin of the latter one is translated about
the vector (B0 − A0). Hence the equations for
the corresponding coordinate transformations can be read oﬀ
from the rule for a transformation of a point X ∈ A
X = B0 + x′
1v1 + · · · + x′
nvn
= B0 + (A0 − B0) + x1u1 + · · · + xnun.
Let y = (y1, . . . , yn)T
denote the column of coordinates
of the vector (A0 − B0) in the basis v. Let M = (aij) be the
matrix expressing the basis u in terms of the basis v. Then
x′
1 = y1 + a11x1 + · · · + a1nxn
...
x′
n = yn + an1x1 + · · · + annxn.
In matrix notation
x′
= y + M · x.
323
4.A.7. Find the intersection of the subspaces Q1 and Q2,
where
Q1 : [4, −5, 1, −2] + t1 (3, 5, 4, 2) + t2 (2, 4, 5, 1)
+ t3 (0, 3, 1, 2) ,
Q2 : [4, 4, 4, 4] + s1 (0, −6, −2, −4) + s2 (−1, −5, −3, −3) ,
for t1, t2, t3, s1, s2 ∈ R.
Solution. The point X = [x1, x2, x3, x4] ∈ R4
lies in Q1 if
and only if




x1
x2
x3
x4



 =




4
−5
1
−2



 + t1




3
5
4
2



 + t2




2
4
5
1



 + t3




0
3
1
2




for some numbers t1, t2, t3 ∈ R. The point
X = [x1, x2, x3, x4] ∈ R4
lies in Q2 if and only if




x1
x2
x3
x4



 =




4
4
4
4



 + s1




0
−6
−2
−4



 + s2




−1
−5
−3
−3




for some s1, s2 ∈ R. Hence if X lies in Q1 ∩ Q2 then the
equation
t1




3
5
4
2



 + t2




2
4
5
1



 + t3




0
3
1
2




=




4 − 4
4 + 5
4 − 1
4 + 2



 + s1




0
−6
−2
−4



 + s2




−1
−5
−3
−3



 .
has a solution for t1, t2, t3, s1, s2.
Move the vectors corresponding to s1 and s2 to the lefthand
side. Write the equations in matrix form and reduce to
echelon form. There follows




3 2 0 0 1 0
5 4 3 6 5 9
4 5 1 2 3 3
2 1 2 4 3 6



 ∼




3 2 0 0 1 0
0 2 9 18 10 27
0 7 3 6 5 9
0 −1 6 12 7 18




∼ · · · ∼




3 0 0 0 0 0
0 2 0 0 0 0
0 0 1 2 0 3
0 0 0 0 1 0



 .
So t1 = t2 = s2 = 0 and for s1 = t ∈ R we have t3 = 3 − 2t.
Note that for the determination of Q1 ∩ Q2, it is suﬃcient to know
either t1, t2, t3, or s1, s2. So




x1
x2
x3
x4



 =




4
4
4
4



 + s1




0
−6
−2
−4



 + s2




−1
−5
−3
−3



 =




4
4
4
4



 + t




0
−6
−2
−4



 .
This should be checked using t1 = t2 = 0 and t3 = 3 − 2t. The
solution is a line lying in both planes. It is Q1 ∩ Q2. □
CHAPTER 4. ANALYTIC GEOMETRY
Now, we can come back to the proof of the previous theorem.
Suppose the system of linear equations in the “new”
coordinates (B0; v) has the form
S · x′
= b
where S is the matrix of the system. Then, in the original
coordinates (A0; u)
S · x′
= S · (y + M · x) = b.
Thus in the original coordinates the system has the form
S · M · x = b − S · y.
Therefore, if a subspace is described by a system of linear
equations in one aﬃne frame, then it is so described in all
other aﬃne frames. This completes the proof of the previous
proposition.
4.1.6. Examples of subspaces. (1) The one-dimensional
(standard) aﬃne space is the set of all points
of a real straight line A1. Its diﬀerence space
is a one-dimensional vector space R. The
supporting set (i.e. the set of points in A1) is
also R. The aﬃne coordinates are obtained by the choice of
an origin and a scale (i.e. a basis in the vector space R). All
proper aﬃne spaces are 0-dimensional. They are formed by
all points of the real straight line R.
(2) The two-dimensional (standard) aﬃne space is a set of
all points in the space A2 with the diﬀerence space R2
. The
supporting set is R2
. The aﬃne coordinates are obtained by
a choice of an origin and two linearly independent vectors
(directions and scales). The proper subspaces are then all
points and straight lines in the plane (0-dimensional and
1-dimensional). The lines are prescribed by the choice of a
point and one vector from the corresponding diﬀerence space.
The vector is a generator of direction as in the parametric
deﬁnition of the straight line.
(3) The three-dimensional (standard) aﬃne space is a set
of all points in the space A3 with the diﬀerence space R3
.
The aﬃne coordinates are obtained by the choice of an
origin and three linearly independent vectors (directions
and scales). The proper aﬃne subspaces are then all points,
straight lines and planes (0-dimensional, 1-dimensional and
2-dimensional).
(4) Suppose there is given a nonzero vector of coeﬃcients
(a1, . . . , an) and a scalar b ∈ R. Compute the subspace of
all solutions of one linear equation a · x = b for the unknown
point [x1, . . . , xn] ∈ An. This is an aﬃne subspace of
dimension n−1. We say that the subspace is of codimension
1, called a hyperplane in An.
4.1.7. Aﬃne combinations of points. We introduce an analogue
of the linear combination of vectors. Let
A0, . . . , Ak be points in the aﬃne space A.
Their aﬃne hull ⟨{A0 . . . , Ak}⟩ can be written
as the set of points
{A0 + t1(A1 − A0) + · · · + tk(Ak − A0); t1, . . . , tk ∈ R}.
324
4.A.8. Determine whether or not the points [0, 2, 1],
[−1, 2, 0], [−2, 5, 2] and [0, 5, 4] in R3
all lie in the same
plane.
Solution. Consider the vectors [0, 2, 1] − [−1, 2, 0]
= (1, 0, 1), [0, 2, 1] − [−2, 5, 2] = (2, −3, −1) and
[0, 2, 1] − [0, 5, 4] = (0, −3, −3). They are linearly
dependent since the matrix


1 0 1
2 −3 −1
0 −3 −3

 ,
has rank 2. Hence, the given points lie in a plane. □
4.A.9. Into how many parts can three planes slice the space
(R3
)? Give an example of planes in a suitable position for
every case.
4.A.10. Determine whether or not the point [2, 1, 0]
lies within the convex hull of the points [0, 2, 1], [1, 0, 1],
[3, −2, −1], [−1, 0, 1].
Solution. [2, 1, 0] lies in the convex hull (see chapter 4.9) if
and only if
[2, 1, 0] = t1[0, 2, 1]+t2[1, 0, 1]+t3[3, −2, −1]+t4[−1, 0, 1]
has a solution with t1, t2, t3, t4, all non-negative and
t1 + t2 + t3 + t4 = 1. Equivalently, [2, 1, 0] lies in the
convex hull if and only if
[2, 1, 0, 1] =t1[0, 2, 1, 1] + t2[1, 0, 1, 1] + t3[3, −2, −1, 1]
+ t4[−1, 0, 1, 1]
has a solution with t1, t2, t3, t4, all non-negative. Solving
these four equations, gives (t1, t2, t3, t4)= (1, 0, 1/2, −1/2),
so the given point does not lie in the convex hull. □
4.A.11. In R3
, a tetrahedron has vertices ABCD, where
A = [4, 0, 2], B = [−2, −3, 1], C = [1, −1, −3],
D = [2, 4, −2].
a) Determine its volume.
b) Decide whether or not the point X = [0, −3, 0] lies inside
the tetrahedron.
Solution. a) The volume of the tetrahedron is one sixth of
volume of a parallelepiped, of which three edges from the
point A are B − A = (−6, −3, −1), C − A = (−3, −1, −5)
and D − A = (−2, 4, −4). It is given by the absolute value
of the determinant
−6 −3 −1
−3 −1 −5
−2 4 −4
= −124.
Thus, the volume of the tetrahedron is 124
6 .
b) Write X as an aﬃne combination of its vertices, by
solving the system of four linear equation in four unknowns
a, b, c, d given by the equality X = aA+bB+cC +dD. The
solution is X = 1
4 A+ 1
2 B + 1
2 C − 1
4 D. Since the coeﬃcient
of D is negative, X does not lie in the convex hull of the points
A, B, C and D. Hence the given point does not lie inside the
tetrahedron. □
CHAPTER 4. ANALYTIC GEOMETRY
In any aﬃne coordinates the same set can be written as
{t0A0 + t1A1 + · · · + tkAk; ti ∈ R,
k∑
i=0
ti = 1}.
Affine combinations of points
In general, by the formula t0A0 + t1A1 + · · · + tkAk with
coeﬃcients satisfying
∑k
i=0 ti = 1 is meant the set of all
points A0 +
∑k
i=1 ti(Ai − A0). They are called the aﬃne
combinations of points.
The points A0 . . . , Ak are in general position if they generate
a k-dimensional aﬃne subspace. This happens if and
only if for each Ai, the vectors which arise as diﬀerences of
this point Ai, and all other vectors Aj, are linearly independent.
Observe that an assignment of a series of (dim A) + 1
points in general position is equivalent to the deﬁnition of an
aﬃne frame with the origin in the ﬁrst of them.
4.1.8. Simplexes. For points in an aﬃne space, the aﬃne
combination is a similar construction to the linear combination
for vectors in a vector space. Indeed, the aﬃne subspace
generated by points A0 . . . , Ak equals the set of all aﬃne combinations
of its generators. The notion “to lie on the line between
two points” can be generalized. In the two-dimensional
case, imagine the interior of a triangle. In general proceed as
follows:
k–dimensional simplexes
Let A0, . . . , Ak be k + 1 points in general position in an
aﬃne space A. The set ∆ = ∆(A0, . . . , Ak) is deﬁned as
the set of all aﬃne combinations of points Ai with nonnegative
coeﬃcients only. This is
∆ = {t0A0 + t1A1 + · · · + tkAk; 0 ≤ ti,
k∑
i=0
ti = 1},
called a k–dimensional simplex generated by the points Ai.
A one–dimensional simplex is a line segment, a two–
dimensional simplex is a triangle, while a zero–dimensional
simplex is a point.
Notice that each k–dimensional simplex has exactly k+1
faces deﬁned by equations ti = 0, i = 0, . . . , k. The faces are
also simplexes, and their dimension is k − 1. We talk about
the boundary of the simplex. For instance, the boundary of
a triangle is formed by the three edges, and the boundary of
each edge is formed by the two vertices.
The description of a subspace as a set of aﬃne combinations
of points in general position is equivalent to the parametric
description. We work similarly with the parametric
description of simplexes.
4.1.9. Convex sets. The subset M of an aﬃne space is called
convex if and only if for any two points A, B ∈ M the set contains
the line segment ∆(A, B). Directly from the deﬁnition,
each convex set with k +1 points in general position contains
also the entire simplex deﬁned by these points
325
4.A.12. Aﬃne transformation of point coordinates
The point X has coordinates expressed
as [2, 2, 3] in an aﬃne basis {[1, 2, 3], (1, 1, 1), (1, −1, 2), (2, 1, 1)}
(in R3
). Determine its coordinates in
the standard basis, i.e. in the basis
{[0, 0, 0], (1, 0, 0), (0, 1, 0), (0, 0, 1)}.
Solution. The coordinates [2, 2, 3] in the given basis are by
the deﬁnition
[1, 2, 3] + 2 · (1, 1, 1) + 2 · (1, −1, 2) + 3 · (2, 1, 1)
= [11, 5, 12]
coordinates for X in the standard basis. □
4.A.13. Aﬃne transformation of mapping. Find the
aﬃne mapping f in the coordinate system with the basis
u = {(1, 1), (−1, 1)} and origin [2, 0], deﬁned as
f(x1, x2) =
(
2 1
0 1
) (
x1
x2
)
+
(
1
1
)
in the standard basis in R2
.
Solution. The change of the basis matrix from the basis u to
the standard basis is
(
1 −1
1 1
)
.
The transformation matrix in the basis ([2, 0], u) is obtained
by ﬁrst transforming the coordinates in the basis ([2, 0], u) to
the standard basis, i.e. to the basis ([0, 0], (1, 0), (0, 1)). Then
the transformation matrix f is applied in the standard basis.
Finally it is transformed back to the coordinates in the basis
([2, 0], u). The transformation equations for changing the coordinates
y1, y2 in the basis ([2, 0], u) to the coordinates x1,
x2 in the standard basis are
(
x1
x2
)
=
(
1 −1
1 1
) (
y1
y2
)
+
(
2
0
)
.
Hereby
(
y1
y2
)
=
(
1 −1
1 1
)−1 ((
x1
x2
)
−
(
2
0
)
.
)
=
( 1
2
1
2
−1
2
1
2
) (
x1
x2
)
+
(
−1
1
)
.
Hence the desired mapping is
f(y1, y2)
=
( 1
2
1
2
−1
2
1
2
) [ (
2 1
0 1
) ((
1 −1
1 1
) (
y1
y2
)
+
(
2
0
))
+
(
1
1
) ]
+
(
−1
1
)
=
(
2 0
−1 1
) (
y1
y2
)
+
(
2
−1
)
□
CHAPTER 4. ANALYTIC GEOMETRY
Examples of convex sets are
(1) the empty set,
(2) aﬃne subspaces,
(3) line segments, rays p = {P + t · v; t ≥ 0},
(4) more generally k–dimensional objects
α = {P + t1 · v1 + · · · + tk · vk; t1, . . . , tk ∈ R, tk ≥ 0},
(5) angles in two-dimensional subspaces
β = {P + t1 · v1 + t2 · v2; t1 ≥ 0, t2 ≥ 0}.
The intersection of an arbitrary system of convex sets is
a convex set. The intersection of all convex sets containing a
given set M is called the convex hull K(M) of the set M.
Theorem. The convex hull of any subset M ⊂ A is
K(M) = {t1A1 + · · · + tsAs;
s∑
i=1
ti = 1, ti ≥ 0, Ai ∈ M}
Proof. Let S denote the set of all aﬃne combinations
on the right-hand side of the equation. To
check that S is convex, choose two sets
of parameters ti, i = 1, .., s1, t′
j, j =
1, . . . , s2 with the desired properties.
Without loss of generality, assume that s1 = s2 and
that the same points from M there appear in both combinations
(otherwise simply add summands with zero coeﬃcients).
Consider an arbitrary point on the line segment given
by the vertices deﬁned by the two combinations:
ε(t1A1+· · ·+tsAs)+(1−ε)(t′
1A1+· · ·+t′
sAs), 0 ≤ ε ≤ 1.
Obviously any point of this line segment lies in S.
It remains to show that the convex hull of the points
A1, . . . , As cannot be smaller than S. The points Ai themselves
correspond to the choice of parameters tj = 0 for
all j ̸= i and ti = 1. Assume that the claim holds for all
sets with at most s − 1 points. Then the convex hull of the
points A1, . . . , As−1 is (according to the assumption) formed
exactly by the combinations from the right side of the equation
to be proved, where ts = 0. Now consider a point
A = t1A1 +· · ·+tsAs ∈ S, ts < 1, and aﬃne combinations
ε(t1A1+· · ·+ts−1As−1)+(1−ε(1−ts))As, 0 ≤ ε ≤ 1
1−ts
.
It is a line segment with vertices given by parameters ε = 0
(the point As) and ε = 1/(1 − ts) (a point in the convex hull
of A1, . . . , As−1). The point A is an inner point of this line
segment with the parameter ε = 1, and thus A lies in the
convex hull of A1, . . . , As. □
The convex hulls of ﬁnite sets are called convex polyhe-
drons.
We have a k–dimensional simplex if and only if the vertices
A0, . . . , Ak deﬁning the convex polyhedron are in general
position. In the case of a simplex, the expression of any
of its points as an aﬃne combination of the deﬁning vertices
is unique.
Speciﬁc examples are the convex polyhedrons deﬁned by
one point and a ﬁnite number of vectors. Let u1, . . . , uk be
326
4.A.14. Let there be a standard coordinate system in R3
space. Agent K lives at the point S with coordinates [0, 1, 2].
The headquarters gave him a coordinate system with origin S
and basis {(1, 1, 0), (−1, 0, 1), (0, 1, 2)}. Agent Bond lives
at the point D with coordinates [1, 1, 1] and uses a coordinate
system with basis {(0, 0, 1), (−1, 1, 2), (1, 0, 1)}. Agent K
has set an appointment with agent Bond in the old brickﬁeld
which is (according to K’s coordinate system) at the point
[1, 1, 0]. To where should Bond go (regarding his coordinate
system)?
Solution. The change of basis matrix from agent K’s basis to
the Bond’s basis (with the same origins) is
T =


−4 2 −1
1 0 1
2 −1 1


The vector (0, 1, 2) thus has coordinates T · (0, 1, 2)T
= (0, 2, 1)T
. Translate the origin (add the vector (−1, 0, 1))
to obtain the result (−1, 2, 2). □
4.A.15. Find a transversal of the lines (that is, a line passing
through both given lines)
p : [1, 1, 1] + t(2, 1, 0), q : [2, 2, 0] + t(1, 1, 1),
so that [1, 0, 0] lies on the transversal.
Solution. The transversal lies in the plane ρ deﬁned by the
point [1, 0, 0] and the line p. Hence it lies in the plane
[1, 1, 1] + t(2, 1, 0) + s(0, 1, 1).
Let the point Q be the intersection of this plane with the line
q. Q is obtained by solving the system
1 + 2t = 2 + u
1 + t + s = 2 + u
1 + s = u
The left-hand sides of the equations represent all three coordinates
of an arbitrary point of the plane ρ respectively.
The right-hand sides then represent the coordinates of an arbitrary
point on q (the free variable is denoted u in order
not to be ambiguous). Solving this system, yields s = 2,
t = 2, u = 3. Putting u = 3 into the line q equation, gives
Q = [5, 5, 3]. The desired transversal is thus given by Q and
the point [1, 0, 0]. The intersection of the transversal with p
is at P = [7/3, 5/3, 1]. □
4.A.16. Find the common perpendicular between the two
skew lines
p : [3, 0, 3] + (0, 1, 2)t, t ∈ R
q : [0, −1, −2] + (1, 2, 3)s s ∈ R
Solution. The direction of the common perpendicular
is given by the cross product of the two direction
vectors. So the direction of the common perpendicular
is (1, −2, 1). Form a linear equation system which
CHAPTER 4. ANALYTIC GEOMETRY
arbitrary vectors in the diﬀerence space Rn
, A ∈ An a point.
A parallelepiped Pk(A; u1, . . . , uk) ⊂ An is the set
Pk(A; u1, . . . , uk) = {A+c1u1 +· · ·+ckuk; 0 ≤ ci ≤ 1}.
If the vectors u1, . . . , uk are independent, we talk about
a k–dimensional parallelepiped Pk(A; u1, . . . , uk) ⊂ An. It
is clear from the deﬁnition that parallelepipeds are convex.
They are the convex hulls of their vertices.
4.1.10. Examples of standard aﬃne tasks. (1) To ﬁnd a
parametric description of an implicitly given
subspace and vice versa:
Find a particular solution of a nonhomogeneous
system and a fundamental
solution of the homogenized system. Then obtain (in the
coordinates in which the equations have been set) the desired
parametric description. In the opposite direction, write the
parametric description in coordinates and then eliminate
the free parameters t1, . . . , tk. This results in the equations
deﬁning the given subspace implicitly.
(2) To ﬁnd the subspace generated by several subspaces
Q1, . . . , Qs (of diﬀerent dimensions in general). For instance,
to ﬁnd a plane in A3 given by a straight line and a point, or
by three points). To deﬁne this subspace implicitly or para-
metrically:
The resulting subspace Q is always determined by one
ﬁxed point Ai in each subspace Qi and by the sum of all difference
spaces. For instance,
Q = A1 + (Z({A1, . . . , As}) + Z(Q1) + · · · + Z(Qs)).
If the subspaces are given implicitly, it is possible to convert
them into parametric form ﬁrst. Nevertheless, diﬀerent methods
are advantageous in some concrete situations. Notice that
it is really necessary to use one point from each of the subspaces.
For example, two parallel lines in a plane generate
the whole plane, but they share the same one–dimensional
diﬀerence space.
(3) To ﬁnd the intersection of the subspaces Q1, . . . , Qs:
If they are given in the implicit form, it is suﬃcient to
unify all equations into one system, omitting any linearly dependent
ones. If the resulting system has no solution, then
the intersection is empty. Otherwise, an implicit description
of the aﬃne subspace is obtained. This is the intersection we
are searching for.
If parametric forms are given, we may search directly for
common points as solutions of the appropriate equations, similarly
to the way we ﬁnd the intersections of vector spaces. If
the number of subspaces is greater then two, we must search
for the intersection step by step.
If one of the subspaces is deﬁned parametrically and the
other implicitly, it suﬃces to substitute the parametrized coordinates
and to solve the resulting system of equations.
(4) To ﬁnd a crossbar between two skew lines p, q in A3
passing through a given point or having a given direction:
By a crossbar we mean a straight line which has
nonempty intersection with both skew lines. Thus the
327
expresses that a vector deﬁned by two points, one lying
on p, the other on q, is parallel to the direction
(1, −2, 1). We get the system P − Q = k(1, −2, 1), or
[3, 0, 3] + (0, 1, 2)t
P
− [0, −1, −2] + (1, 2, 3)s
Q
= k(1, −2, 1).
Treat this equality component-wise to give
3 − s = k
1 + t − 2s = −2k
5 + 2t − 3s = k
with the solution t = 1, s = 2, k = 1. Put t = 1 into the
line p, to obtain the point [3, 1, 5] on the common perpendicular.
Put s = 2 into the line q equation to obtain the point
[3, 1, 5]. The common perpendicular is deﬁned by the line
joining these two points. □
B. Euclidean geometry
4.B.1. Determine the distance between the lines in R3
.
p : [1, −1, 0]+t(−1, 2, 3), and q : [2, 5, −1]+t(−1, −2, 1).
Solution. The distance is deﬁned as the distance of the orthogonal
projection of arbitrary points on the respective lines
to the orthogonal complement of the vector subspace generated
by their directions. The orthogonal complement is
spanned by the cross product:
⟨(−1, 2, 3), (−1, −2, 1)⟩⊥
= ⟨(−1, 2, 3) × (−1, −2, 1)⟩
= ⟨(8, −2, 4)⟩ = ⟨(4, −1, 2)⟩.
A transversal is (for example) the segment joining [1, −1, 0]
to [2, 5, −1]. So the vector to be projected is [1, −1, 0] −
[2, 5, −1] = (−1, −6, 1). The distance between the lines is
therefore:
ρ(p, q) =
|(−1, −6, 1) · (4, −1, 2)|
∥(4, −1, 2)∥
=
4
√
21
.
□
4.B.2. Find a point A lying on the line
p : x + 2y + z − 1 = 0, 3x − y + 4z − 29 = 0,
which is equidistant from both B = [3, 11, 4] and
C = [−5, −13, −2].
Solution. First, express the line p parametrically. Solve the
system
x + 2y + z = 1,
3x − y + 4z = 29.
Rewrite the system as an augmented matrix and perform row
operations
(
1 2 1 1
3 −1 4 29
)
∼
(
1 2 1 1
0 −7 1 26
)
∼
(
1 0 9/7 59/7
0 1 −1/7 −26/7
)
.
The line p is thus described by
p :
[
59
7
, −
26
7
, 0
]
+ t
(
−
9
7
,
1
7
, 1
)
, t ∈ R.
CHAPTER 4. ANALYTIC GEOMETRY
resulting crossbar r is a one–dimensional aﬃne subspace.
If we are given one point A ∈ r, then the aﬃne subspace
generated by p and A is either a straight line (if A ∈ p) or a
plane (if A /∈ p). In the ﬁrst case, there are an inﬁnite number
of solutions, one for each point of q. In the second case, it
suﬃces to ﬁnd the intersection B of the plane ⟨p ∪ A⟩ with
q, and r = ⟨{A, B}⟩. There is no solution if the intersection
is empty. If q ⊂ ⟨p ∪ A⟩, there are an inﬁnite number of
solutions. If the intersection has one element, there is exactly
one solution.
If a direction u ∈ Rn
is given, then we consider the subspace
Q generated by p and the diﬀerence space Z(p)+⟨u⟩ ⊂
Rn
. Again, we obtain an inﬁnite number of solutions if
q ⊂ Q. Otherwise we consider the intersection Q with q
and we ﬁnish as before.
The solutions of other practical geometric problems are
based mostly on the systematic use of the steps given above.
4.1.11. Remark on linear programming. In the beginning
of the third chapter in paragraphs 3.1.1–3.1.7, we
dealt with practical problems which are given by systems
of linear inequalities. Each single inequality
a1x1 + · · · + anxn ≤ b
deﬁnes a halfspace in the standard aﬃne space Rn
. This is
bounded by a hyperplane given by the corresponding equation
(compare with the deﬁnition in paragraph 4.1.9(4)).
In particular, the set of all admissible vectors for the problem
of the linear programming is always an intersection of a
ﬁnite number of convex sets. Hence the set itself is either
convex or empty.
If the intersection is both nonempty and bounded, then it
is a convex polyhedron. As justiﬁed in 3.1.1 already, each linear
form is either increasing or decreasing or constant along
each parametrized straight line in the aﬃne space. Thus if
a given problem from linear programming is solvable and
bounded, then it has the optimal solution at one of the vertices
of the corresponding convex polyhedron. The reader should
be able to imagine this claim in the case of two–dimensional
or three–dimensional problems. Nevertheless, the straightforward
explanation in these low dimensions holds for all ﬁnite–
dimensional cases.
We have given a “geometric proof” of the existence part
of the fundamental theorem 3.1.5. We have translated the initial
problem into a ﬁnite problem of the given cost function.
4.1.12. Euclidean point spaces. So far, we do not need the
notions of distance and length for geometric
considerations. But the length of vectors and
the angle between vectors, as deﬁned in the second
chapter (see 2.3.18 and elsewhere), play a signiﬁcant role
in many practical problems.
328
It is convenient to avoid the fractions, by introducing the substitution
t = 7s + 26. p is thus described by
p : [−25, 0, 26] + s (−9, 1, 7) , s ∈ R.
The point A is obtained by requiring that the vectors
A − B = (−28 − 9s, −11 + s, 22 + 7s) ,
A − C = (−20 − 9s, 13 + s, 28 + 7s)
have the same length. Hence
√
(−28 − 9s)2 + (−11 + s)2 + (22 + 7s)2
=
√
(−20 − 9s)2 + (13 + s)2 + (28 + 7s)2,
or rather
(−28 − 9s)2
+ (−11 + s)2
+ (22 + 7s)2
= (−20 − 9s)2
+ (13 + s)2
+ (28 + 7s)2
.
which has the unique solution s = −3. Therefore
A = [−25, 0, 26] − 3 (−9, 1, 7) = [2, −3, 5].
□
4.B.3. Michael has a stick of length 4. Can he touch the
lines p and q simultaneously with this stick, given that the
stick must pass through [2, 1, 2]?
p : [−1, 4, 1] + t(−1, 2, 0),
q : [4, 4, −1] + s(1, 2, −4)?
Solution. Compute the transversal of those lines passing
through [2, 1, 2]. It is the segment joining [1, 0, 1] to [3, 2, 3].
Its length is
√
12, which is less than 4. So Michael can touch
the lines as required. □
4.B.4. In Euclidean space R4
, determine the distance between
the point A = [2, −5, 1, 4] and the subspace deﬁned
by the equations
U : 4x1 − 2x2 − 3x3 − 2x4 + 12 = 0,
2x1 − x2 − 2x3 − 2x4 + 9 = 0.
Solution. Find ﬁrst a parametric expression of the subspace
U. For example,
B = [0, 3, 0, 3] ∈ U.
The distance between A and U equals the length of the orthogonal
projection of the vector A − B to the orthogonal
complement of the direction of the subspace U. However,
the orthogonal complement of the U direction (it deﬁnes this
subspace) – as set (of linear combination of normal vectors)
V := {t (4, −2, −3, −2) + s (2, −1, −2, −2) ; t, s ∈ R}.
We need to ﬁnd the orthogonal projection PA−B of vector
A − B to V , which lies in V , and thus
PA−B = a (4, −2, −3, −2) + b (2, −1, −2, −2)
for certain a, b ∈ R. Clearly, (A − B − PA−B) ⊥ V , thus
((A − B) − PA−B) ⊥ (4, −2, −3, −2) ,
CHAPTER 4. ANALYTIC GEOMETRY
Euclidean spaces
The standard Euclidean point space En is the aﬃne space
An whose diﬀerence space is the standard Euclidean space
Rn
with the scalar product
⟨x, y⟩ = yT
· x.
The Cartesian coordinate system is the aﬃne coordinate
system (A0; u) with the orthonormal basis u.
The Euclidean distance between two points A, B ∈ En
is deﬁned as the length of the vector ∥B − A∥. This is denoted
by ρ(A, B).
Euclidean subspaces in En are aﬃne subspaces, where
the corresponding diﬀerence spaces are considered with restricted
scalar products.
By a Euclidean point space E of dimension n is meant an
aﬃne space, whose diﬀerence space is a real n–dimensional
Euclidean vector space. The notion of a Cartesian coordinate
system has an obvious meaning. Since each choice of such a
coordinate system identiﬁes E with the standard space En, we
deal with the standard Euclidean spaces and their subspaces,
with no loss of generality.
From the geometric point of view, simple properties of
the scalar product like the triangular inequality,
the Cauchy inequality, Bessel’s inequality, derived
in the previous chapter (see 3.4.3), have
useful consequences:
4.1.13. Theorem. For points A, B, C ∈ En the following
holds
(1) ρ(A, B) = ρ(B, A)
(2) ρ(A, B) = 0 if and only if A = B
(3) ρ(A, B) + ρ(B, C) ≥ ρ(A, C)
(4) In each Cartesian coordinate system (A0; e), the distance
between the points A = A0 + a1e1 + · · · + anen,
B = A0 + b1e1 + · · · + bnen is
√∑n
i=1(ai − bi)2.
(5) Given a point A and a subspace Q in En, there exists a
point P ∈ Q which minimizes the distance between A
and points in Q. The distance between A and P equals
the length of the orthogonal projection of the vector A −
B into Z(Q)⊥
for an arbitrary B ∈ Q.
(6) More generally, for subspaces Q and R in En there exist
points P ∈ Q and Q ∈ R which minimize the distance
between points B ∈ Q and A ∈ R. The distance between
the points P and Q is the length of the orthogonal
projection of the vector A−B into (Z(Q)+Z(R))⊥
for
arbitrary points B ∈ Q and A ∈ R.
Proof. The ﬁrst three properties follow directly from the
properties of the length of vectors in spaces with a scalar product.
The fourth follows from the expression of the scalar product
in an orthonormal basis.
Consider the relation for the minimal distances ρ(A, B)
for B ∈ Q. The vector A − B decomposes uniquely as
A − B = u1 + u2, where u1 ∈ Z(Q), u2 ∈ Z(Q)⊥
. The
component u2 does not depend on the choice of B ∈ Q. This
329
((A − B) − PA−B) ⊥ (2, −1, −2, −2) .
By substitution of A − B and PA−B,
((2, −8, 1, 1) − a(4, −2, −3, −2) − b(2, −1, −2, −2))
·(4, −2, −3, −2) = 0,
((2, −8, 1, 1) − a(4, −2, −3, −2) − b(2, −1, −2, −2))
·(2, −1, −2, −2)) = 0;
so
(2, −8, 1, 1)·(4, −2, −3, −2)
−a(4, −2, −3, −2)·(4, −2, −3, −2)
−b(2, −1, −2, −2)·(4, −2, −3, −2) = 0,
((2, −8, 1, 1)·(2, −1, −2, −2))
−a(4, −2, −3, −2)·(2, −1, −2, −2)
−b(2, −1, −2, −2)·(2, −1, −2, −2 = 0.
If we compute these dot products, we obtain the system
19 − 33a − 20b = 0,
8 − 20a − 13b = 0,
with the only solution a = 3, b = −4. Hence
PA−B = 3 (4, −2, −3, −2) − 4 (2, −1, −2, −2)
= (4, −2, −1, 2) ,
where
|| PA−B || =
√
42 + (−2)2 + (−1)2 + 22 = 5.
Hence the distance between A and U equals || PA−B || = 5.
□
4.B.5. In the vector space R4
, compute the distance v between
the point [0, 0, 6, 0] and the vector subspace
U : [0, 0, 0, 0]+t1 (1, 0, 1, 1)+t2 (2, 1, 1, 0)+t3 (1, −1, 2, 3) ,
t1, t2, t3 ∈ R
Solution. We solve the problem by the least squares method.
Write the generating vectors of U as the columns of the matrix
A =




1 2 1
0 1 −1
1 1 2
1 0 3



 .
Substitute the point [0, 0, 6, 0] by the corresponding vector
b = (0, 0, 6, 0)T
. Now solve A · x = b. This is the linear
equation system
x1 + 2x2 + x3 = 0,
x2 − x3 = 0,
x1 + x2 + 2x3 = 6,
x1 + 3x3 = 0,
by the least squares method. (Note that the system does not
have a solution – the distance would be 0 otherwise.) Multiply
CHAPTER 4. ANALYTIC GEOMETRY
is because any potential change of B is apparent by adding a
vector from Z(Q).
Choose P = A + (−u2) = B + u1 ∈ Q. Then
∥A − B∥2
= ∥u1∥2
+ ∥u2∥2
≥ ∥u2∥2
= ∥A − P∥2
.
Hence the minimal distance is obtained for the point P. Its
value is ∥u2∥.
The general result is obtained in a similar way. For the
choice of arbitrary points A ∈ R and B ∈ Q their diﬀerence
is given as a sum of vectors u1 ∈ Z(R) + Z(Q) and u2 ∈
(Z(R) + Z(Q))⊥
. The component u2 does not depend on
the choice of the points. By adding suitable vectors from the
diﬀerence spaces of R and Q, points A′
and B′
are obtained
so that the distance between them is ∥u2∥. □
We consider more elementary problems in aﬃne geometry
requiring the concept of distance.
4.1.14. Examples of standard problems. (1) To ﬁnd the distance
from the point A ∈ En to the subspace
Q ⊂ En:
A method of solving such a problem is
given in proposition 4.1.13.
(2) In E2 to construct the straight line q through a given point
A which forms a given angle with a given line p:
Recall that we work with angles between vectors in plane
geometry already (see e.g. 2.3.22). Find a vector u ∈ R2
lying in the diﬀerence space of the line p. Then choose a
vector v having the prescribed angle with u. The desired line
is given by the point A and the diﬀerence space ⟨v⟩. The
problem has either one or two solutions.
(3) To ﬁnd a line through a given point, perpendicular to a
given line:
The procedure is introduced in the proof of the last but
one item of proposition 4.1.13.
(4) In E3 to determine the distance between two lines p, q:
Choose any point from each of the lines, A ∈ p, B ∈ q.
The component of the vector A − B lying in the orthogonal
complement (Z(p)+Z(q))⊥
has length equal to the distance
between p and q.
(5) In E3 to ﬁnd the axis of two skew lines p a q:
By the axis we mean the crossbar which attains the minimal
possible distance between the given skew lines in terms
of the points of intersection. The procedure can be derived
from the proof of proposition 4.1.13 (the last item). Let η be
the subspace generated by a single point A ∈ p and the sum
Z(p) + (Z(p) + Z(q))⊥
. Provided that the lines p and q are
not parallel, η is a plane. Then the intersection η ∩ q together
with the diﬀerence space (Z(p) + Z(q))⊥
gives a parametric
description of the desired axis. If the lines are parallel, the
problem has an inﬁnite number of solutions.
330
A · x = b by the matrix AT
from the left-hand side. Then the
augmented matrix AT
· A · x = AT
· b is


3 3 6 6
3 6 3 6
6 3 15 12

 .
By elementary row operations, transform the matrix to the
normal form


3 3 6 6
3 6 3 6
6 3 15 12

 ∼


3 3 6 6
0 3 −3 0
0 −3 3 0


∼


1 1 2 2
0 1 −1 0
0 0 0 0

 .
Continue with backward elimination


1 1 2 2
0 1 −1 0
0 0 0 0

 ∼


1 0 3 2
0 1 −1 0
0 0 0 0

 .
The solution is
x = (2 − 3t, t, t)
T
, t ∈ R.
Note that the existence of inﬁnitely many solutions is caused
by third vector generating U, which is redundant because
3 (1, 0, 1, 1) − (2, 1, 1, 0) = (1, −1, 2, 3) .
An arbitrary (t ∈ R) linear combination
(2−3t) (1, 0, 1, 1)+t (2, 1, 1, 0)+t (1, −1, 2, 3) = (2, 0, 2, 2)
corresponds to a point [2, 0, 2, 2] in the subspace U, which
is the nearest point to [0, 0, 6, 0]. The required distance is
therefore
v = || [2, 0, 2, 2] − [0, 0, 6, 0] || =
√
22 + 0 + (−4)2 + 22
= 2
√
6.
□
4.B.6. Compute the volume of the parallelepiped in R3
with
base in the plane z = 0 and with edges given by pairs
of vertices [0, 0, 0], [−2, 3, 0]; [0, 0, 0], [4, 1, 0] and [0, 0, 0],
[5, 7, 3].
Solution. The parallelepiped is given by vectors (4, 1, 0),
(−2, 3, 0), (5, 7, 3). Its volume is the determinant
4 −2 5
1 3 7
0 0 3
= 3
4 −2
1 3
= 3 · 14 = 42.
Note that if the order of vectors is changed, we would get result
±42, because the determinant gives the oriented volume
of parallelepiped. Note further that the volume would not
change if the third vector was [a, b, 3] for arbitrary a, b ∈ R.
Its surface depends only on orthogonal distance between
planes of its upper and lower base and their area
4 −2
1 3
= 14.
□
CHAPTER 4. ANALYTIC GEOMETRY
4.1.15. Angles. Various geometric notions like angles, orientation,
volume etc. in the point spaces En are
deﬁned in terms of suitable notions from Euclidean
spaces. The angle between two vectors
is deﬁned at the end of the third part of the second
chapter, see 2.3.22.
From the Cauchy inequality, it follows that 0 ≤ |u·v|
∥u∥∥v∥ ≤
1. So it makes sense to deﬁne the angle φ(u, v) between vectors
u, v ∈ V in a real vector space with a scalar product
given by the equation
cos φ(u, v) =
u · v
∥u∥∥v∥
, 0 ≤ φ(u, v) ≤ 2π.
This is completely in accordance with the situation in two–
dimensional Euclidean space R2
and with the philosophy that
the notion related to the two vectors is the issue of plane ge-
ometry.
In the Euclidean plane, we use also the geometric functions
cos and sin deﬁned by a pure geometric consideration.
Therefore, the angle between two vectors in higher–
dimensional spaces is measured in the plane which is generated
by these two vectors (or it is zero).
In an arbitrary real vector space with a scalar product, it
follows that
∥u − v∥2
= ∥u∥2
+ ∥v∥2
− 2(u · v)
= ∥u∥2
+ ∥v∥2
− 2∥u∥∥v∥ cos φ(u, v).
This is the well known cosine rule from plane geometry.
Consider an orthonormal basis e of the diﬀerence space
V and a vector u ∈ V . The square of the size of u is given by
the usual formula
∥u∥2
=
∑
i
|u · ei|2
.
By dividing this equation by the number ∥u∥2
,
1 =
∑
i
(cos φ(u, ei))2
,
which is known as the law of direction cosines cos φ(u, ei) of
the vector u.
Now we can derive deﬁnitions for angles between general
subspaces in a Euclidean vector space from
the deﬁnitions of angles between vectors. In particular,
it must be decided how to deal with cases
where the subspaces have a nontrivial intersection.
For the angle between two lines, use the smaller of the
two possible angles.
In the case of two nonparallel planes in R3
we do not say
that the angle is zero. They intersect and have one direction in
common. Perhaps we take the two perpendicular lines to this
common direction in the two planes and measure their angle.
The general cases are treated as follows:
331
4.B.7. Let the points [0, 0, 1], [2, 1, 1], [3, 3, 1], [1, 2, 1] deﬁne
a parallelogram. Determine the point X lying on the line
p : [0, 0, 1] + (1, 1, 1)t so that the parallelepiped deﬁned by
the given parallelogram and a point X has volume of 1.
Solution. Form a determinant which gives the volume of a
parallelepiped with X moving along line p:
t t t
2 1 0
1 2 0
.
The volume is 3t which implies t = 1/3. □
4.B.8. Let ABCDEFGH be a cube (with common notation,
i.e. vectors E − A, F − B, G − C, H − D are orthogonal
to the plane deﬁned by vertices A, B, C, D) in Euclidean
space R3
. Compute the angle φ between the vectors F − A
and H − A.
Solution. This problem is solved using the formula for the
angle between the vectors. Alternatively notice that the vertices
A, F, H are the vertices of a triangle with all sides of
the same length. Hence it is an equilateral triangle. Therefore
φ = π/3. □
4.B.9. Let S be the midpoint of the edge AB of the cube
ABCDEFGH (with common labelling). Compute the cosine
of the angle between the lines ES and BG.
Solution. Dilation (homothety) is a mapping which preserves
angles. So without loss of generality, the cube edge has length
1. The coordinate system can be placed so that A is at the
origin, B = [1, 0, 0] and E = [0, 0, 1]. It follows that then:
S = [1/2, 0, 0], G = [1, 1, 1], ES = (1/2, 0, −1) and BG =
(0, 1, 1). The desired cosine of the angle φ is then
cos(φ) =
(1/2, 0, −1) · (0, 1, 1)
∥(1/2, 0, −1)∥ ∥(0, 1, 1)∥
=
√
2
√
5
. □
4.B.10. Compute the angle between the line p given by the
implicit equations
x + 3y + z = 0,
−x − y + z = 0
and plane the ϱ : x + y + 2z + 1 = 0.
Solution. The normal vector of the plane ϱ is (1, 1, 2). Copy
the ﬁrst equation of the line p. Sum both of them, to obtain
x + 3y + z = 0,
2y + 2z = 0.
From this system y = −z and x = 2z. The vector (2, −1, 1)
is therefore the direction vector of p. In other words, p passes
through the origin, and
p : [0, 0, 0] + t (2, −1, 1) , t ∈ R.
For the angle φ between the vectors (1, 1, 2), (2, −1, 1),
cos φ =
2 − 1 + 2
√
6 ·
√
6
=
1
2
.
CHAPTER 4. ANALYTIC GEOMETRY
Angles between subspaces
4.1.16. Deﬁnition. Consider ﬁnite–dimensional subspaces
U1, U2 in a Euclidean vector space V of arbitrary dimension.
The angle between vector subspaces U1, U2 is the real
number α = φ(U1, U2) ∈ [0, π
2 ] satisfying:
(1) If dim U1 = dim U2 = 1, U1 = ⟨u⟩, U2 = ⟨v⟩, then
cos α =
|u · v|
∥u∥∥v∥
.
(2) If the dimensions of U1, U2 are positive, and if U1 ∩
U2 = {0}, then the angle is the minimum of all angles
between the one–dimensional subspaces
α = min{φ(⟨u⟩, ⟨v⟩); 0 ̸= u ∈ U1, 0 ̸= v ∈ U2}.
Such a minimum always exists.
(3) If U1 ⊂ U2 or U2 ⊂ U1 (in particular if one of them is
empty), then α = 0.
(4) If U1 ∩ U2 ̸= {0} and if U1 ̸= U1 ∩ U2 ̸= U2, then
α = φ(U1 ∩ (U1 ∩ U2)⊥
, U2 ∩ (U1 ∩ U2)⊥
).
The angle between aﬃne subspaces Q1, Q2 in a Euclidean
point space En is deﬁned as the angle between their
diﬀerence spaces Z(Q1), Z(Q2).
Notice that the angle is always well deﬁned. In the last
case,
(U1 ∩ (U1 ∩ U2)⊥
) ∩ (U2 ∩ (U1 ∩ U2)⊥
) = {0}
so we can determine the angle according to item (2). Notice
also that in the case U1 ∩ U2 = {0}, the subspaces U1 and
U2 are perpendicular in terms of the former deﬁnitions if and
only if the angle between them is π/2. However, if the intersection
is nontrivial, then they cannot be perpendicular in the
former sense.
In order to show the validity of the deﬁnition, it remains
to show that the vectors u ∈ U1, v ∈ U2 minimizing
the expression for the angle always exist.
First a special case:
4.1.17. Lemma. Let v be a vector in a Euclidean space V
and U ⊂ V an arbitrary subspace. Denote by v1 ∈ U,
v2 ∈ U⊥
the (uniquely determined) components of the vector
v, i.e. v = v1 + v2. Then the angle φ between the subspace
generated by v and the subspace U satisﬁes
cos φ(⟨v⟩, U) = cos φ(⟨v⟩, ⟨v1⟩) =
∥v1∥
∥v∥
.
Proof. By the Cauchy inequality,
|u · v|
∥u∥∥v∥
=
|u · (v1 + v2)|
∥u∥∥v∥
=
|u · v1|
∥u∥∥v∥
≤
∥u∥∥v1∥
∥u∥∥v∥
=
∥v1∥
∥v∥
=
∥v1∥2
∥v∥∥v1∥
=
|v1 · v|
∥v∥∥v1∥
.
332
Hence φ = 60 ◦
. However, this is the angle between the
direction vector of p and the normal vector ϱ. The desired
angle is the complement of this angle, so the solution is
30 ◦
= 90 ◦
− 60 ◦
. □
4.B.11. In the real plane, ﬁnd a line which passes through
the point [−3, 0], so that 60 ◦
is the angle between this line
and the line
p :
√
3x + 3y + 5 = 0.
Solution. The given line has slope −1√
3
. This is at angle −30 ◦
from the positive x axis. Thus the required line is either at angle
−90 ◦
or angle 30 ◦
from the positive x axis. The former
determines the vertical line x = −3. The latter determines
the line with slope 1√
3
through [−3, 0], hence has equation
y
√
3 = x + 3. □
Solution. (Alternative) Notice that there are two such lines.
The general equation of a line in the plane has the form
ax+by +c = 0. Choose parameters so that a2
+b2
= 1.
We ﬁnd such numbers a, b, c ∈ R, so that all the conditions
are satisﬁed. Since the line passes through [−3, 0]), c = 3a.
The condition of the angle between lines equals 60 ◦
then
gives
1
2
= cos 60 ◦
=
√
3a + 3b
√
12
, tj.
√
3 =
√
3a + 3b .
Performing further operations
±1 = a+
√
3b and exponentation 1 = a2
+3b2
+2
√
3ab.
If we use a2
+ b2
= 1, we get
0 = 2b2
+ 2
√
3ab, tj. 0 = b
(
b +
√
3a
)
.
Together (remember that c = 3a and a2
+ b2
= 1)
a = ±1, b = 0, c = ±3; a = ±
1
2
, b = ∓
√
3
2
, c = ±
3
2
.
We can easily check that lines determined by those coeﬃ-
cients
x + 3 = 0,
1
2
x −
√
3
2
y +
3
2
= 0
satisfy all the conditions. □
4.B.12. Determine the equation of all planes so that the angle
between every such plane and the plane x + y + z − 1 = 0
is 60◦
, and further, that they contain the line p : [1, 0, 0] +
t (1, 1, 0). ⃝
4.B.13. Determine the angle between the planes
σ : [1, 0, 2] + (1, −1, 1)t + (0, 1, −2)s
ρ : [3, 3, 3] + (1, −2, 0)t + (0, 1, 1)s
Solution. The line of intersection between the planes has direction
vector (1, −1, 1). The plane orthogonal to this vector
has intersection with the given planes generated by the vectors
vektory (1, 0, −1) and (0, 1, 1). The angle between these
one-dimensional subspaces is 60◦
. □
CHAPTER 4. ANALYTIC GEOMETRY
for all vectors u ∈ U. This implies that
cos φ(⟨v⟩, ⟨u⟩) ≤ cos φ(⟨v⟩, ⟨v1⟩) =
∥v1∥
∥v∥
.
Thus the computed vector v1 represents the largest possible
value of the cosine of angles between all choices of vectors
in U. The cos function is decreasing on the interval [0, π
2 ].
Hence the smallest possible angle is obtained in this way, and
so the claim is proved. □
4.1.18. Calculating angles. The procedure in the previous
lemma can be understood as follows. Choose the orthogonal
projection of the one–dimensional subspace
generated by v into the subspace U, and consider the
ratio between v and its image. A similar procedure is
used in the higher dimension. The problem is to recognize
the directions whose projections give the desired (minimal)
angle. This is clear in the previous example if we project the
larger space U into the one–dimensional ⟨v⟩ ﬁrst, and then orthogonally
back to U. The desired angle corresponds to the
direction of the eigenvector of this map. The eigenvalue is the
square of the cosine of the angle.
Let U1, U2 be two arbitrary subspaces in a Euclidean vector
space V , U1 ∩ U2 = {0}. Choose orthonormal bases e
and e′
of the whole space V such that U1 = ⟨e1, . . . , ek⟩,
U2 = ⟨e′
1, . . . , e′
l⟩.
Consider the orthogonal projection φ of the space V on
U2. Its restriction on U1 will be denoted by φ : U1 → U2
as before. Similarly, let ψ : U2 → U1 be the map which
has arisen from the orthogonal projection on U1. In the bases
(e1, . . . , ek) and (e′
1, . . . , e′
l), these maps have matrices A and
B,



e1 · e′
1 . . . ek · e′
1
...
...
e1 · e′
l . . . ek · e′
l


 ,



e′
1 · e1 . . . e′
l · e1
...
...
e′
1 · ek . . . e′
l · ek


 ,
respectively. Since ei · e′
j = e′
j · ei holds for all indices i, j,
B = AT
.
The composition of maps ψ ◦ φ : U1 → U1 has therefore
a symmetric positive semideﬁnite matrix AT
A, and ψ is
adjoint to φ. Each such map has only nonnegative real eigenvalues.
It has a diagonal matrix with these eigenvalues on the
diagonal in a suitable orthonormal basis, see 3.4.7 a 3.4.9.
Now we can derive a general procedure for computing
the angle α = φ(U1, U2).
Theorem. In the previous notation, let λ be the largest eigenvalue
of the matrix AT
A. Then (cos α)2
= λ.
Proof. Let u ∈ U1 be the eigenvector of the map ψ ◦ φ
corresponding to the eigenvalue λ. Consider
all eigenvalues λ1, . . . , λk (including multiplicities),
and let u = (u1, . . . , un) be the corresponding
orthonormal basis of U1 containing
the eigenvectors. Assume that λ = λ1, and choose its eigenveector
u = u1, ∥u∥ = 1.
333
4.B.14. A cube ABCDA′
B′
C′
D′
is given in standard notation.
That is, ABCD and A′
B′
C′
D′
are faces and AA′
, BB′
are edges. Compute the angle φ between AB′
and AD′
.
Solution. It can be assumed that the cube is of side 1 and
placed in R3
in such a way that the vertices A, B, C, D
have coordinates respectively [0, 0, 0], [1, 0, 0] [1, 1, 0]
[0, 1, 0] and the vertices A′
, B′
, C′
, D′
have coordinates
respectively [0, 0, 1], [1, 0, 1] [1, 1, 1] [0, 1, 1]. Thus
AB′
= B′
− A = (1, 0, 1), AD′
= D′
− A = (0, 1, 1). So
cos(φ) =
(1, 0, 1) · (0, 1, 1)
∥ (1, 0, 1) ∥∥ (0, 1, 1) ∥
=
1
2
,
hence φ = 60◦
.
□
For further exercises on angles, see .
4.B.15. Prove that for every n ∈ N and for all positive
x1, x2, . . . , xn ∈ R
n2
≤
(
1
x1
+
1
x2
+ · · · +
1
xn
)
· (x1 + x2 + · · · + xn) .
For what arguments does equality hold?
Solution. It is suﬃcient to consider the Cauchy inequality
| u · v | ≤ || u || || v ||
in Euclidean space Rn
for the vectors
u =
(
1
√
x1
, . . . ,
1
√
xn
)
, v = (
√
x1, . . . ,
√
xn) .
We get
(1) n ≤
√
1
x1
+
1
x2
+ · · · +
1
xn
·
√
x1 + x2 + · · · + xn.
We obtain the desired inequality squaring (1). The Cauchy
inequality attains equality when vector u is a multiple of v,
that is, when x1 = x2 = · · · = xn. □
4.B.16. Vectors u = (u1, u2, u3) and v = (v1, v2, v3) are
given. Find a third unit vector such that parallelepiped deﬁned
by these three vectors has the greatest possible volume.
Solution. Denote the desired vector by t = (t1, t2, t3). By
Proposition ?? the volume of the parallelepiped P3(0; u, v, t)
is the absolute value of determinant
u1 v1 t1
u2 v2 t2
u3 v3 t3
=
t1 t2 t3
u1 u2 u3
v1 v2 v3
= t · (u × v) ≤ ∥t∥∥u × v∥
= ∥u × v∥.
The sign of the inequality follows from the Cauchy inequality.
This becomes equality if and only if t = c(u × v), c ∈ R.
The volume therefore could be at most equal to the area of
paralleloid deﬁned by vectors u, v (i.e. size of vector (u×v)).
Equality holds if and only if
t = ±
(u × v)
∥(u × v)∥
.
□
CHAPTER 4. ANALYTIC GEOMETRY
We need to show that the angle between an arbitrary
v ∈ U1 and U2 is at least as large as the angle between u and
U2. Equivalently that the cosine of the corresponding angle
cannot be greater. By the previous lemma, it is suﬃcient to
discuss the angle between u and φ(u) ∈ U2. Choose v ∈ U1,
v = a1u1 + · · · + akuk,
∑k
i=1 a2
i = ∥v∥2
= 1. Then
∥φ(v)∥2
= φ(v) · φ(v) = (ψ ◦ φ(v)) · v
≤ ∥ψ ◦ φ(v)∥∥v∥ = ∥ψ ◦ φ(v)∥.
Moreover, the previous lemma gives a formula for computing
the angle α between the vector v and the subspace U2
cos α =
∥φ(v)∥
∥v∥
= ∥φ(v)∥.
Since λ1 is the largest eigenvalue, and the sum of squares of
coordinates a2
i is one,
(cos α)2
= ∥φ(v)∥2
≤ ∥ψ ◦ φ(v)∥ =
k∑
i=1
(λiai)2 =
= λ2
1 +
k∑
i=1
a2
i (λ2
i − λ2
1) ≤
√
λ2
1.
If v = u, we have ∥φ(v)∥2
= λ2
1∥v∥2
= λ2
, and thus the
angle has the minimal value for this vector. □
4.1.19. Calculating volume. An indication of how to calculate
volumes in plane geometry is given at
the end of the ﬁfth part of the ﬁrst chapter (see
1.5.11). There the notion of orientation played
a fundamental role. We can imagine orientation as the decision
whether to look at the plane R2
from above or from
below. The distinction lies in the order of selecting standard
basis vectors e1 and e2 on the unit circle. We proceed in the
same way in general:
Orientation of a vector space
Two bases u and v of a real vector space V are said to determine
the same orientation if the transformation matrix between
them has a positive determinant. By the orientation
of a vector space V is meant the equivalence class of bases
u with respect to the equivalence deﬁned above, by the sign
of the determinant. Equivalent bases in this sense are called
compatible with the chosen orientation.
It follows that there exist exactly two orientations on every
vector space. From each compatible basis there is a non
compatible one by a transformation matrix with a negative
determinant.
A vector space with a chosen orientation is called the
oriented vector space.
The oriented Euclidean (point) space is a Euclidean
point space whose diﬀerence space is oriented. In the sequel
we consider the standard Euclidean space En together
with the orientation given by the standard basis of Rn
.
334
4.B.17. Find the foot of the line passing through the point
[0, 0, 7] and perpendicular to the plane
ρ : [0, 5, 3] + (1, 2, 1)t + (−2, 1, 1)s.
4.B.18. In Euclidean space R5
determine the distance between
the planes
ϱ1 : [7, 2, 7, −1, 1] + t1 (1, 0, −1, 0, 0) + s1 (0, 1, 0, 0, −1) ,
ϱ2 : [2, 4, 7, −4, 2] + t2 (1, 1, 1, 0, 1) + s2 (0, −2, 0, 0, 3) ,
where t1, s1, t2, s2 ∈ R.
Solution. First compute the orthogonal complement to sum
of vectors deﬁning the planes. Form a matrix with rows as the
direction vectors of the planes. Then transform this matrix
into normal form.




1 0 −1 0 0
0 1 0 0 −1
1 1 1 0 1
0 −2 0 0 3



 ∼ · · · ∼




1 0 −1 0 0
0 1 0 0 −1
0 0 1 0 1
0 0 0 0 1



 .
So the orthogonal complement is ⟨(0, 0, 0, 1, 0)⟩. The
vector (0, 0, 0, 1, 0) lies within the orthogonal complement.
Transform the matrix into normal form. This shows that the
orthogonal complement is one-dimensional. The distance
between planes is the length of the perpendicular projection
of the vector A1 − A2 into the subspace ⟨(0, 0, 0, 1, 0)⟩
for arbitrary points A1 ∈ ϱ1, A2 ∈ ϱ2. Choose e.g.
A1 = [7, 2, 7, −1, 1], A2 = [2, 4, 7, −4, 2]. The orthogonal
projection A1 − A2 = (5, −2, 0, 3, −1) to ⟨(0, 0, 0, 1, 0)⟩ is
(0, 0, 0, 3, 0). The length of (0, 0, 0, 3, 0) gives the desired
distance 3. □
4.B.19. In Euclidean space R5
determine the distance of
planes
σ1 : [0, 1, 2, 0, 0] + p1 (2, 1, 0, 0, 1) + q1 (−2, 0, 1, 1, 0) ,
σ2 : [3, −1, 7, 7, 3] + p2 (2, 2, 4, 0, 3) + q2 (2, 0, 0, −2, −1) ,
where p1, q1, p2, q2 ∈ R.
Solution. The sum of the directions σ1, σ2 is generated by
the direction vectors. Denote them by
u1 = (2, 1, 0, 0, 1) , u2 = (−2, 0, 1, 1, 0) ,
v1 = (2, 2, 4, 0, 3) , v2 = (2, 0, 0, −2, −1) .
Find points X1 ∈ σ1, X2 ∈ σ2, that equal the distance between
σ1 and σ2. This requires
X1 − X2 = [0, 1, 2, 0, 0] − [3, −1, 7, 7, 3]
+ p1u1 + q1u2 − p2v1 − q2v2
and
⟨ X1 − X2, u1 ⟩ = 0, ⟨ X1 − X2, u2 ⟩ = 0,
⟨ X1 − X2, v1 ⟩ = 0, ⟨ X1 − X2, v2 ⟩ = 0.
CHAPTER 4. ANALYTIC GEOMETRY
Let u1, . . . , uk be arbitrary vectors in the diﬀerence
space Rn
, A ∈ En a point. As an example of a convex set,
the parallelepiped Pk(A; u1, . . . , uk) ⊂ En is given by
Pk(A; u1, . . . , uk) = {A + c1u1 + · · · + ckuk; 0 ≤ ci ≤ 1}.
If u1, . . . , uk are linearly dependent, the parallelepiped is degenerated
and we se the volume
Vol Pk = 0.
If the vectors u1, . . . , uk are linearly independent, we have a
k–dimensional parallelepiped Pk(A; u1 . . . , uk) ⊂ En. For
given vectors u1, . . . , uk there are also parallelepipeds of
lower dimension
P1(A; u1), . . . , Pk(A; u1, . . . , uk)
in Euclidean subspaces A+⟨u1⟩, . . . , A+⟨u1, . . . , uk⟩ at our
disposal. We proceed as in the case of the Gramm–Schmidt
orthogonalization: Consider the decomposition
⟨u1, . . . , uk⟩ =
= ⟨u1, . . . , uk−1⟩ ⊕ ⟨u1, . . . , uk−1⟩⊥
∩ ⟨u1, . . . , uk⟩.
In this direct sum, uk is uniquely expressed as
uk = u′
k + ek
where ek ⊥ ⟨u1, . . . , uk−1⟩.
The absolute value of the volume of a parallelepiped is
deﬁned inductively such that it is the product of the volume
of the “base” and the “altitude”: | Vol |P1(A; u1) = ∥u1∥,
| Vol |Pk(A; u1, . . . , uk) =
= ∥ek∥| Vol |Pk−1(A; u1, . . . , uk−1).
If u1, . . . , un is a basis compatible with the orientation
of the entire vector space V , the (oriented) volume of the parallelepiped
is deﬁned by
Vol Pn(A; u1, . . . , un) = | Vol |Pn(A; u1, . . . , un).
In the case of a non compatible basis we set
Vol Pn(A; u1, . . . , un) = −| Vol |Pn(A; u1, . . . , un).
Theorem. Let Q ⊂ En be a Euclidean subspace, and let e =
(e1, . . . , ek) be an orthonormal basis of Z(Q).
For arbitrary vectors u1, . . . , uk ∈ Z(Q) and
A ∈ Q the following holds
(1) Vol Pk(A; u1, . . . , uk) =
u1 · e1 . . . uk · e1
...
...
u1 · ek . . . uk · ek
(2) (Vol Pk(A; u1, . . . , uk))2
=
u1 · u1 . . . uk · u1
...
...
u1 · uk . . . uk · uk
Proof. The matrix
A =



u1 · e1 . . . uk · e1
...
...
u1 · ek . . . uk · ek



335
Hence
⟨ (−3, 2, −5, −7, −3), u1 ⟩ + p1 ⟨ u1, u1 ⟩ + q1 ⟨ u2, u1 ⟩
− p2 ⟨ v1, u1 ⟩ − q2 ⟨ v2, u1 ⟩ = 0,
⟨ (−3, 2, −5, −7, −3), u2 ⟩ + p1 ⟨ u1, u2 ⟩ + q1 ⟨ u2, u2 ⟩
− p2 ⟨ v1, u2 ⟩ − q2 ⟨ v2, u2 ⟩ = 0,
⟨ (−3, 2, −5, −7, −3), v1 ⟩ + p1 ⟨ u1, v1 ⟩ + q1 ⟨ u2, v1 ⟩
− p2 ⟨ v1, v1 ⟩ − q2 ⟨ v2, v1 ⟩ = 0,
⟨ (−3, 2, −5, −7, −3), v2 ⟩ + p1 ⟨ u1, v2 ⟩ + q1 ⟨ u2, v2 ⟩
− p2 ⟨ v1, v2 ⟩ − q2 ⟨ v2, v2 ⟩ = 0.
By computing the dot products, we obtain the linear equation
system
6p1 − 4q1 − 9p2 − 3q2 = 7,
−4p1 + 6q1 + 6q2 = 6,
9p1 − 33p2 − q2 = 31,
3p1 − 6q1 − p2 − 9q2 = −11.
Solve it by forming a matrix and performing elementary row
operations.




6 −4 −9 −3 7
−4 6 0 6 6
9 0 −33 −1 31
3 −6 −1 −9 −11



 ∼ · · ·
∼




1 0 0 0 0
0 1 0 0 −1
0 0 1 0 −1
0 0 0 1 2



 .
The solution is (p1, q1, p2, q2) = (0, −1, −1, 2). Conse-
quently
X1 − X2 = (−3, 2, −5, −7, −3) − u2 + v1 − 2v2
= (−3, 4, −2, −4, 2).
The length of the vector (−3, 4, −2, −4, 2) equals the distance
between the planes σ1, σ2 and is then
7 =
√
(−3)2 + 42 + (−2)2 + (−4)2 + 22.
We solved this problem by a method diﬀerent to that of the
previous problem We can use both methods in both cases.
Try the former method for the case of σ1, σ2. Find the
orthogonal complement of vector subspace generated by
(2, 1, 0, 0, 1) , (−2, 0, 1, 1, 0) , (2, 2, 4, 0, 3) , (2, 0, 0, −2, −1).
We get




2 1 0 0 1
−2 0 1 1 0
2 2 4 0 3
2 0 0 −2 −1



 ∼ · · ·
∼




1 0 0 0 3/2
0 1 0 0 −2
0 0 1 0 1
0 0 0 1 2



 ,
The orthogonal complement is ⟨(−3/2, 2, −1, −2, 1)⟩, or
rather ⟨(3, −4, 2, 4, −2)⟩. Note that the distance between σ1
CHAPTER 4. ANALYTIC GEOMETRY
has the coordinates of the vectors u1, . . . , uk in the chosen
basis in columns, and
|A|2
= |A||A| = |AT
||A| = |AT
A|
=
u1 · u1 . . . uk · u1
...
...
u1 · uk . . . uk · uk
.
Hence if (1) holds, then also (2) holds.
Directly from the deﬁnition, the unoriented volume
equals the product
| Vol |Pk(A; u1, . . . , uk) = ∥v1∥∥v2∥ . . . ∥vk∥,
where v1 = u1, v2 = u2 +a2
1v1, . . . , vk = uk +ak
1v1 +· · ·+
ak
k−1vk−1 is the result of the Gramm-Schmidt orthogonalization.
Thus
(Vol Pk(A; u1, . . . , uk))2
=
v1 · v1 0 . . . 0
...
...
0 0 . . . vk · vk
=
v1 · v1 . . . vk · v1
...
...
v1 · vk . . . vk · vk
.
Denote by B the matrix whose columns are formed by
the coordinates of vectors v1, . . . , vk in the orthonormal basis
e. Since v1, . . . , vk have arisen from u1, . . . , uk as images
under a linear transformation with an upper–triangular
matrix C with ones on the diagonal, B = CA and |B| =
|C||A| = |A|. But then |A|2
= |B|2
= |A||A|, and thus
Vol Pk(A; u1, . . . , uk) = ±|A|. The resulting volume is zero
if the vectors u1, . . . , uk are linearly dependent. Provided
they are independent, the sign of the determinant is positive if
and only if the basis u1, . . . , uk deﬁnes the same orientation
as the basis e. □
Consider a parallelepiped in a k–dimensional space,
which is spanned by k vectors. Write down the coordinates
(in an orthonormal basis) into the columns of a matrix. Then
the volume of the parallelepiped is the determinant of the
matrix.
The formula (2) above is called the Gram determinant. It
is independent of the choice of basis and, therefore it is useful
when k is lower then the dimension of the whole space.
We formulate the following important geometric conse-
quence:
4.1.20. Corollary. For each linear map φ : V → V on a Euclidean
vector space V , det φ equals the (oriented) volume
of the image of the parallelepiped determined by vectors of
an orthonormal basis. More generally, the image of the parallelepiped
P, determined by arbitrary dim V vectors, has a
volume equal to det φ–multiple of the former volume.
336
and σ2 equals the size of the orthogonal projection of the vector
(the diﬀerence of an arbitrary point in σ1 and an arbitrary
point in σ2)
u = (3, −2, 5, 7, 3) = [3, −1, 7, 7, 3] − [0, 1, 2, 0, 0]
to this orthogonal complement. Denote the orthogonal projection
of u as pu and choose v = (3, −4, 2, 4, −2). Obviously
pu = a · v for some a ∈ R and
⟨ u − pu, v ⟩ = 0, tj. ⟨ u, v ⟩ − a ⟨ v, v ⟩ = 0.
Computing gives 49 − a · 49 = 0. Therefore pu = 1 · v = v
and the distance between the planes σ1 and σ2 equals
|| pu || =
√
32 + (−4)2 + 22 + 42 + (−2)2 = 7.
The method of computing the distance using the orthogonal
complement of sum of vector spaces proves to be a "faster
way to the solution”. It is the same for the planes ϱ1 and ϱ2.
The second method however reveals points where the distance
can be measured (a pair of points in which the planes are the
closest). Find such points in the case of planes ϱ1, ϱ2. Denote
u1 = (1, 0, −1, 0, 0) , u2 = (0, 1, 0, 0, −1) ,
v1 = (1, 1, 1, 0, 1) , v2 = (0, −2, 0, 0, 3) .
Points X1 ∈ ϱ1, X2 ∈ ϱ2, which are the "closest” (as commented
above), are
X1 = [7, 2, 7, −1, 1] + t1u1 + s1u2,
X2 = [2, 4, 7, −4, 2] + t2v1 + s2v2,
so
X1 − X2 = [7, 2, 7, −1, 1] − [2, 4, 7, −4, 2]
+ t1u1 + s1u2 − t2v1 − s2v2
= (5, −2, 0, 3, −1)
+ t1u1 + s1u2 − t2v1 − s2v2.
The dot products
⟨ X1 − X2, u1 ⟩ = 0, ⟨ X1 − X2, u2 ⟩ = 0,
⟨ X1 − X2, v1 ⟩ = 0, ⟨ X1 − X2, v2 ⟩ = 0
then lead to the linear equation system
2t1 = −5,
2s1 + 5s2 = 1,
−4t2 − s2 = −2,
−5s1 − t2 − 13s2 = −1
with the unique solution t1 = −5/2, s1 = 41/2, t2 = 5/2,
s2 = −8. We obtained
X1 = [7, 2, 7, −1, 1] −
5
2
u1 +
41
2
u2
=
[
9
2
,
45
2
,
19
2
, −1, −
39
2
]
,
X2 = [2, 4, 7, −4, 2] +
5
2
v1 − 8v2
=
[
9
2
,
45
2
,
19
2
, −4, −
39
2
]
.
The distance between the points X1, X2 equals the distance
between the planes ϱ1, ϱ2) both of which are given by
|| X1 − X2 || = || (0, 0, 0, 3, 0) || = 3. □
CHAPTER 4. ANALYTIC GEOMETRY
4.1.21. Outer product and cross product. The previous
considerations are closely related to the tensor
product of vectors. We do not go further in this
technically more complicated topic now. But we
do mention the outer product of n = dim V vectors
u1, . . . , un ∈ V .
Let (u1j, . . . , unj)T
be coordinate expressions of vectors
uj in a chosen orthonormal basis V . Let M be a matrix with
elements (uij). Then the determinant |M| does not depend
on the choice of the basis with the same orientation. Its value
is called the outer product of the vectors u1, . . . , un, and is
denoted by [u1, . . . , un]. Hence the outer product is the oriented
volume of the corresponding parallelepiped, see 4.1.19.
Although the outer product looks like a scalar quantity,
the story gets more complicated once we allow for general
basis of V . Then the determinant of the matrix M built of
the coordinates of ui changes by the determinants of the transition
matrices. Such objects are called densities and we shall
come back to them in chapter 9.
Several useful properties of the outer product follow directly
from the deﬁnition
(1) The map (u1, . . . , un) → [u1, . . . , un] is an antisymmetric
n–linear map. It is linear in all arguments, and the interchange
of any two arguments causes a change of sign.
(2) The outer product is zero if and only if the vectors
u1, . . . , un are linearly dependent.
(3) The vectors u1, . . . , un form a positive basis if and only
if the outer product is positive.
Consider a Euclidean vector space V of dimension n ≥ 2
and vectors u1, . . . , un−1 ∈ V . If these n − 1 vectors are
substituted into the ﬁrst n − 1 arguments of the n–linear map
deﬁned by the volume determinant as above, then there is one
argument left over. This deﬁnes a linear form on V . Since
the scalar product is available, each linear form corresponds
to exactly one vector. This vector v ∈ V is called the cross
product of the vectors u1, . . . , un−1. For each vector w ∈ V
⟨w, v⟩ = [u1, . . . , un−1, w].
We denote the cross product by v = u1 × . . . × un−1.
If the coordinates of the vectors in an orthonormal basis
are v = (y1, . . . , yn)T
, w = (x1, . . . , xn)T
and uj =
(u1j, . . . unj)T
, then the deﬁnition can be expressed as
y1x1 + · · · + ynxn =
u11 . . . u1(n−1) x1
...
...
...
un1 . . . un(n−1) xn
.
Hence the vector v is determined uniquely. Its coordinates are
calculated by the formal expansion of this determinant along
the last column. The following properties of the cross product
are direct consequences of the deﬁnition:
Theorem. For the cross product v = u1 × . . . × un−1
(1) v ∈ ⟨u1, . . . , un−1⟩⊥
(2) v is nonzero if and only if the vectors u1, . . . , un−1 are
linearly independent,
337
4.B.20. Find the intersection of the plane passing through
the point A = [1, 2, 3, 4] ∈ R4
and orthogonal to the plane
ϱ : [1, 0, 1, 0] + (1, 2, −1, −2)s + (1, 0, 0, 1)t, s, t ∈ R.
Solution. Find the plane orthogonal to ϱ. Its direction is orthogonal
to the direction of ϱ, for vectors (a, b, c, d) within its
direction we get linear equation system
(a, b, c, d) · (1, 2, −1, −2) = 0 ≡ a + 2b − c − 2d = 0
(a, b, c, d) · (1, 0, 0, 1) = 0 ≡ a + d = 0.
The solution is the two-dimensional vector space
⟨(0, 1, 2, 0), (−1, 0, −3, 1)⟩. The plane τ orthogonal to
ϱ passing through A has parametric equation
τ : [1, 2, 3, 4] + (0, 1, 2, 0)u + (−1, 0, −3, 1)v, u, v ∈ R.
We can obtain the intersection of the planes from both parametric
equations. It is given by the linear equation system
1 + s + t = 1 − v
2s = 2 + u
1 − s = 3 + 2u − 3v
−2s + t = 4 + v,
which has the unique solution (it must be so as matrix
columns are linearly independent) s = −8/19, t = 34/19,
u = −54/19, v = −26/19. Substitute the parameter values
s and t into the parametric form of the plane ϱ, to obtain the
intersection [45/19, −16/19, 11/19, 18/19]. (Needless to
say, the same solution is obtained by substituting the values
into τ). □
4.B.21. Find a line passing through point [1, 2] ∈ R2
so that
the angle between this line and the line
p : [0, 1] + t(1, 1)
is 30◦
.
Solution. The angle between two lines is the angle between
their direction vectors. It is suﬃcient to ﬁnd the direction
vector v of the line. One way to do so is to rotate the direction
vector of p by 30◦
. The rotation matrix for the angle 30◦
is
(
cos 30◦
− sin 30◦
sin 30◦
cos 30◦
)
=
(√
3
2 −1
2
1
2
√
3
2
)
.
The desired vector v is therefore
v =
(√
3
2 −1
2
1
2
√
3
2
) (
1
1
)
=
(√
3
2 − 1
2√
3
2 + 1
2
)
.
We could perform the backward rotation as well. The line
(one of two possible) has parametric equation
[1, 2] +
(√
3
2
−
1
2
,
√
3
2
+
1
2
)
t.
□
CHAPTER 4. ANALYTIC GEOMETRY
(3) the length ∥v∥ of the cross product equals the
absolute value of the volume of parallelepiped
P(0; u1, . . . , un−1),
(4) (u1, . . . , un−1, v) is a compatible basis of the oriented
Euclidean space V .
Proof. The ﬁrst claim follows directly from the deﬁning
formula for v. Substituting an arbitrary
vector uj for w gives the scalar product
v · uj on the left and the determinant with
two equal columns on the right.
The rank of the matrix with n − 1 columns uj is given
by the maximal size of a non-zero minor. The minors which
deﬁne coordinates of the cross product are of degree n − 1
and thus claim (2) is proved.
If the vectors u1, . . . , un−1 are linearly dependent, then
(3) also holds. Suppose the vectors are linearly independent.
Let v be their cross product, and choose an orthonormal basis
(e1, . . . , en−1) of the space ⟨u1, . . . , un−1⟩. It follows from
what is proved that there exists a multiple (1/α)v, 0 ̸= α ∈ R,
such that (e1, . . . , ek, (1/α)v) is an orthonormal basis of V .
The coordinates of the vectors in this basis are
uj = (u1j, . . . , u(n−1)j, 0)T
, v = (0, . . . , 0, α)T
.
So the outer product [u1, . . . , un−1, v] equals (see the deﬁnition
of cross product)
[u1, . . . , un−1, v] =
u11 . . . u1(n−1) 0
...
...
...
u(n−1)1 . . . u(n−1)(n−1) 0
0 . . . 0 α
= ⟨v, v⟩ = α2
.
By expanding the determinant along the last column,
α2
= α Vol P(0; u1, . . . , un−1),
which proves the remaining two claims. □
In technical applications in R3
, the cross product is often
used. It assigns a vector to any pair of vectors.
2. Transformations
This short section backs up a quite wide area of practical
considerations displayed in the other column. As
usual, we can understand objects well only if we
master the mappings preserving the crucial con-
cepts.
338
4.B.22. An octahedron has eight faces consisting of equilateral
triangles. Determine cos α, where α is the angle between
two adjacent faces of a regular octahedron.
Solution. An octahedron is symmetric, therefore it does
not matter which two faces are selected. By suitable
scaling, the octahedron has edge length 1 and is placed
in the standard Cartesian coordinate system R3
so that its
centroid is at [0, 0, 0]. Its vertices then are located at the
points A = [
√
2
2 , 0, 0], B = [0,
√
2
2 , 0], C = [−
√
2
2 , 0, 0],
D = [0, −
√
2
2 , 0], E = [0, 0, −
√
2
2 ] and F = [0, 0,
√
2
2 ].
We compute the angle between the faces CDF and
BCF. We need to ﬁnd vectors orthogonal to their intersection
and lying within respective faces, which means orthogonal
to CF. They are altitudes from D and F to the edge
CF in the triangles CDF and BCF respectively. The altitudes
in an equilateral triangle are the same segments as
ythe medians, so they are SD and SB, where S is midpoint
of CF. Because the coordinates of points C and F
are known, S has coordinates [−
√
2
4 , 0,
√
2
4 ] and the vectors
are SD = (
√
2
4 , −
√
2
2 , −
√
2
4 ) and SB = (
√
2
4 ,
√
2
2 , −
√
2
4 ). To-
gether
cos α =
(
√
2
4 , −
√
2
2 , −
√
2
4 ) · (
√
2
4 ,
√
2
2 , −
√
2
4 )
∥(
√
2
4 , −
√
2
2 , −
√
2
4 )∥∥(
√
2
4 ,
√
2
2 , −
√
2
4 )∥
= −
1
3
.
Therefore α
.
= 132◦
. □
4.B.23. In Euclidean space R5
determine the angle φ between
subspaces U, V , where
(a) U : [3, 5, 1, 7, 2] + t (1, 0, 2, −2, 1) , t ∈ R,
V : [0, 1, 0, 0, 0] + s (2, 0, −2, 1, −1) , s ∈ R;
(b) U : [4, 1, 1, 0, 1] + t (2, 0, 0, 2, 1) , t ∈ R,
V : x1 + x2 + x3 + x5 = 7;
(c) U : 2x1 − x2 + 2x3 + x5 = 3,
V : x1 + 2x2 + 2x3 + x5 = −1;
(d) U : [0, 1, 1, 0, 0] + t (0, 0, 0, 1, −1) , t ∈ R,
V : [1, 0, 1, 1, 1] + r (1, −1, 2, 1, 0) + s (0, 1, 3, 2, 0)
+ p (1, 0, 0, 1, 0) + q (1, 3, 1, 0, 0) ,
r, s, p, q ∈ R;
(e) U : [0, 2, 5, 0, 0] + t (2, 1, 3, 5, 3) + s (0, 3, 1, 4, −2)
+ r (1, 2, 4, 0, 3) , t, s, r ∈ R,
V : [0, 0, 0, 0, 0] + p (−1, 1, 1, −5, 0)
+ q (1, 5, 1, 13, −4) , p, q ∈ R;
(f) U : [1, 1, 1, 1, 1] + t (1, 0, 1, 1, 1)
+ s (1, 0, 0, 1, 1) , t, s ∈ R,
V : [1, 1, 1, 1, 1] + p (1, 1, 1, 1, 1) + q (1, 1, 0, 1, 1)
+ r (1, 1, 0, 1, 0) , p, q, r ∈ R.
Solution. Recall that the angle between aﬃne subspaces is
the same as the angle between vector spaces associated to
them. Therefore the translation caused by the point addition
can be omitted.
Case (a). Since U and V are one-dimensional spaces, the
angle φ ∈ [0, π/2] is given by formula
cos φ = | (1,0,2,−2,1)·(2,0,−2,1,−1) |
|| (1,0,2,−2,1) ||·|| (2,0,−2,1,−1) || = 5√
10·
√
10
.
CHAPTER 4. ANALYTIC GEOMETRY
Affine maps
4.2.1. A map f : A → B between aﬃne spaces is called
an aﬃne map if there exists a linear map φ : Z(A) → Z(B)
between their diﬀerence spaces such that for all A ∈ A, v ∈
Z(A) the following holds:
f(A + v) = f(A) + φ(v).
The maps f and φ are determined uniquely by this property,
and by arbitrarily chosen images of (dim A + 1) points in
general position.
For an arbitrary aﬃne combination of points
t0A0 + · · · + tsAs ∈ A we obtain
f(t0A0 + · · · + tsAs) =
= f(A0 + t1(A1 − A0) + · · · + ts(As − A0))
= f(A0) + t1φ(A1 − A0) + · · · + tsφ(As − A0)
= t0f(A0) + t1f(A1) + · · · + tsf(As).
On the other hand, if a map preserves aﬃne combinations,
we may use a speciﬁc combination of n + 1 ﬁxed vectors
generating the aﬃne frame. After choosing successively
the only non-zero coeﬃcients ti = 1, i = 1, . . . , s, we
deﬁne the map φ between diﬀerence spaces by the relation
φ(Ai − A0) = f(Ai) − f(A0). The previous computation
can be read in the opposite direction, so we can check the validity
and linearity of φ. The assumption that the ﬁrst and the
last rows are equal implies that the second and the third rows
are equal. So we have an aﬃne map with the corresponding
linear map φ between diﬀerence spaces which we described
in the chosen aﬃne frame by this procedure. Therefore:
Theorem. Aﬃne maps are exactly those maps which preserve
the aﬃne combinations of points.
It is suﬃcient to check the invariance of aﬃne combinations
for all pairs of points since we can create an arbitrary
aﬃne combination from them. The aﬃne combination of
k + 2 points A0, . . . , Ak+1 can be expressed as
r(t0A0 + · · · + tkAk) + sAk+1,
where
∑k
i=0 tk = 1 and r + s = 1. We choose a point which
is an aﬃne combination of k + 1 points only. Then make its
combination with the last one. In this way, any ﬁnite aﬃne
combination can be made step by step from the combination
of pairs.
4.2.2. Ratio of collinear points. The aﬃne combinations of
pairs of points can be also expressed with the
help of the ratio of points on a straight line. If
C is given by an aﬃne combination of points
A and B ̸= C, C = rA + sB, then we say that the number
λ = (C; A, B) = −
s
r
is the ratio of the point C with respect to the given points A
and B. Since we can express C as
C = A + s(B − A) = B + r(A − B),
339
Therefore cos φ = 1/2 and φ = π/3.
Case (b). The subspace U has direction vector
(2, 0, 0, 2, 1) and the subspace V has normal vector
(1, 1, 1, 0, 1) . The angle between them ψ = π/3 is derived
from the formula
cos ψ = (2,0,0,2,1)·(1,1,1,0,1)
|| (2,0,0,2,1) ||·|| (1,1,1,0,1) || = 3
3·2 .
Notice that φ = π/2 − ψ = π/6, because φ is complement
to ψ.
Case (c). The hyperplanes U and V are deﬁned by normal
vectors u = (2, −1, 2, 0, 1) and v = (1, 2, 2, 0, 1). The
angle φ equals to angle between the direction vectors u a v.
Therefore (see (a))
cos φ = | (2,−1,2,0,1)·(1,2,2,0,1) |
|| (2,−1,2,0,1) ||·|| (1,2,2,0,1) || = 1
2 , tj. φ = π
3 .
Case (d). Denote
u = (0, 0, 0, 1, −1) , v1 = (1, −1, 2, 1, 0) ,
v2 = (0, 1, 3, 2, 0) , v3 = (1, 0, 0, 1, 0) , v4 = (1, 3, 1, 0, 0)
and denote the orthogonal projection of u into the vector subspace
of V (subspace generated by v1, v2, v3, v4) by pu.
Now
pu = av1 + bv2 + cv3 + dv4 for some a, b, c, d ∈ R
and
⟨ pu − u, v1 ⟩ = 0, ⟨ pu − u, v2 ⟩ = 0,
⟨ pu − u, v3 ⟩ = 0, ⟨ pu − u, v4 ⟩ = 0.
Substituting for pu gives the linear equation system
7a + 7b + 2c = 1,
7a + 14b + 2c + 6d = 2,
2a + 2b + 2c + d = 1,
6b + c + 11d = 0.
The solution is (a, b, c, d) = (−8/19, 7/19, 13/19, −5/19).
(1) cos φ =
|| pu ||
|| u ||
and so
pu = −
8
19
v1 +
7
19
v2 +
13
19
v3 −
5
19
v4 = (0, 0, 0, 1, 0) ,
cos φ =
|| pu ||
|| u ||
=
|| (0, 0, 0, 1, 0) ||
|| (0, 0, 0, 1, −1) ||
=
1
√
2
=
√
2
2
.
Hence φ = π/4.
Case (e). Determine the intersection of the vector
subspaces associated with the given aﬃne subspaces. The
vector (x1, x2, x3, x4, x5) is in the vector subspace of U, if
and only if
(x1, x2, x3, x4, x5) = t (2, 1, 3, 5, 3) + s (0, 3, 1, 4, −2)
+ r (1, 2, 4, 0, 3) for some t, s, r ∈ R. Similarly,
(x1, x2, x3, x4, x5) ∈ V if and only if
(x1, x2, x3, x4, x5) = p (−1, 1, 1, −5, 0) + q (1, 5, 1, 13, −4)
for some p, q ∈ R. We look for such t, s, r, p, q ∈ R, so that
t (2, 1, 3, 5, 3) + s (0, 3, 1, 4, −2) + r (1, 2, 4, 0, 3)
= p (−1, 1, 1, −5, 0) + q (1, 5, 1, 13, −4).
It is a homogeneous linear equation system. It is solved in
matrix form (order of variables is t, s, r, p, q)
CHAPTER 4. ANALYTIC GEOMETRY
the ratio λ is the ratio of the length of the oriented vectors
C − A and C − B. In particular, λ = −1 if and only if
C is at the centre of the line segment joining A and B (i.e.
r = s = 1
2 in the aﬃne combination).
Hence the characterization of aﬃne maps in terms of
aﬃne combinations has the following consequence:
Corollary. Aﬃne maps are exactly those maps for which the
ratios are invariant.
Needless to say that collinearity of points must be preserved
in order to talk about the ratios.
4.2.3. Coordinate expression for maps. View a general
aﬃne map f : A → B, f(X) = f(A0) + φ(X − X0), in
coordinates as follows. First express the image f(A0) of the
origin of the frame (A0, u) on A in the frame (B0, v) on B.
In other words, the vector f(A0)−B0 has got coordinates y0
in the basis v. Everything else is then given by multiplying by
the matrix of the map φ in the chosen bases and by adding the
outcome. Each aﬃne map therefore has the following form
in coordinates:
x → y0 + Y · x,
where y0 is as above, and Y is the matrix of the map φ.
Of course, the changes of coordinates are special instances
of invertible aﬃne mappings on the standard aﬃne
space An. Similarly to the case of linear mappings, under the
choice of two aﬃne coordinate systems (A0, u) and (B0, v)
on A, the coordinate expression of the identity mapping is the
requested rule for the change of coordinates, cf. 4.1.5.
Next, consider the change of the frame
x = w + M · x′
,
on the domain by a translation w and a matrix M. Further,
let
y′
= z + N · y
describe a change of basis on the range space by a translation
z and a matrix N. Then the coordinate expression x → y0 +
Y · x for f : A → B is
(1)
y′
= z + N · y = z + N · (y0 + Y · x)
= (z + N · y0 + N · Y · w) + (N · Y · M) · x′
.
Hence the aﬃne map in the new bases is given by the
translation vector z +N ·y0 +N ·Y ·w and matrix N ·Y ·M.
Euclidean maps
4.2.4. The Euclidean maps f : E1 → E2 are aﬃne maps
which respects the distances, which happens if and only if
the associated linear maps φ are orthogonal.
In particular, the coordinate description of invertible Euclidean
maps includes orthogonal matrices Y . If the dimension
of the image is bigger, then we always can complete
the image of a chosen orthogonal frame into an orthonormal
frame of the co-domain, and then the relevant matrix Y will
contain an orthogonal block, completed by zeros.
340






2 0 1 1 −1
1 3 2 −1 −5
3 1 4 −1 −1
5 4 0 5 −13
3 −2 3 0 4






∼ · · ·
∼






1 3 2 −1 −5
0 2 1 −1 −3
0 0 1 −1 1
0 0 0 0 0
0 0 0 0 0






.
The vectors deﬁning V are linear combination of the vectors
of U. So V is subset of U, and hence φ = 0.
Case (f). Find the intersection of U and V . Search for
numbers t, s, p, q, r ∈ R such that
t (1, 0, 1, 1, 1) + s (1, 0, 0, 1, 1)
=p (1, 1, 1, 1, 1) + q (1, 1, 0, 1, 1) + r (1, 1, 0, 1, 0) .
The solution is (t, s, p, q, r) = (−a, a, −a, a, 0), a ∈ R. The
intersection Z(U)∩Z(V ) of vector spaces U and V contains
the vectors
(0, 0, −a, 0, 0) = −a (1, 0, 1, 1, 1) + a (1, 0, 0, 1, 1)
= −a (1, 1, 1, 1, 1) + a (1, 1, 0, 1, 1)
+ 0 (1, 1, 0, 1, 0) ,
where a ∈ R. Z(U) ∩ Z(V ) is generated
by (0, 0, 1, 0, 0) and its orthogonal complement
(Z(U) ∩ Z(V ))⊥
is generated by vectors
(1, 0, 0, 0, 0) , (0, 1, 0, 0, 0) , (0, 0, 0, 1, 0) , (0, 0, 0, 0, 1) .
We obtain
Z(U) ∩ Z(V ) ̸= {0}, Z(U) ∩ Z(V ) ̸= Z(U),
Z(U) ∩ Z(V ) ̸= Z(V ).
The angle φ is deﬁned as the angle between the subspaces
Z(U) ∩ (Z(U) ∩ Z(V ))⊥
a Z(V ) ∩ (Z(U) ∩ Z(V ))⊥
.
It is now established that
Z(U) ∩ (Z(U) ∩ Z(V ))⊥
= ⟨ (1, 0, 0, 1, 1) ⟩ ,
Z(V ) ∩ (Z(U) ∩ Z(V ))⊥
= ⟨ (1, 1, 0, 1, 1) , (1, 1, 0, 1, 0) ⟩ .
It is enough to express Z(U) as a linear combination of
vectors (0, 0, 1, 0, 0), (1, 0, 0, 1, 1) and Z(V ) by the vectors
(0, 0, 1, 0, 0), (1, 1, 0, 1, 1), (1, 1, 0, 1, 0). Since the dimension
of Z(U) ∩ (Z(U) ∩ Z(V ))⊥
is 1, we can use the formula
(1), where u = (1, 0, 0, 1, 1) and pu is the orthogonal
projection of u into Z(V ) ∩ (Z(U) ∩ Z(V ))⊥
. Then
pu = a (1, 1, 0, 1, 1) + b (1, 1, 0, 1, 0)
and
⟨ pu − u, (1, 1, 0, 1, 1) ⟩ = 0, ⟨ pu − u, (1, 1, 0, 1, 0) ⟩ = 0,
which leads to the linear equation system
4a + 3b = 3,
3a + 3b = 2
with the unique solution a = 1, b = −1/3. Thus
pu =
(2
3 , 2
3 , 0, 2
3 , 1
)
.
From (1) it follows that
CHAPTER 4. ANALYTIC GEOMETRY
Remark. Notice an amazing fact about transformations on
Euclidean spaces keeping the distances invariant,
i.e., general mappings f : En → En.
Notice that a point C is on the segment
determined by points A and B if and only if
∥C − B∥ + ∥C − A∥ = ∥B − A∥. Consequently, any mapping
f preserving distances must map segments to segments,
thus also lines to lines. Moreover, if preserving distances, it
clearly preserves ratios too, and thus we deal with an aﬃne
map. Aﬃne maps preserve distances if and only if they are
Euclidean.
In 4.D.1, we can see a purely analytic proof that in principle
all such (reasonably “smooth”) maps are aﬃne and thus
Euclidean. The advantage of the above synthetic approach
is, that even the smoothnes of the mapping follows from the
assumption.
4.2.5. Aﬃne and Euclidean properties. Now we can consider
which properties are related to the aﬃne structure and
where we really need the scalar product on the diﬀerence vector
space.
All Euclidean transformations, i.e. bijective aﬃne maps
which preserve the distance between points, preserve
also all objects we have studied (possibly
up to the change of orientation). In particular,
they preserve unoriented angles, unoriented volumes,
angles between subspaces etc. If we want them to preserve
also oriented angles, cross products, and volumes, then
we must also assume that the transformations preserve the ori-
entation.
Dealing with a general aﬃne transformation T on an
m-dimensional Euclidean space, the volumes are multiplied
by the determinants of the relevant linear parts of T.
We ask: Which concepts are preserved under aﬃne trans-
formations?
Recall ﬁrst that an aﬃne transformation on an n–
dimensional space A is uniquely deﬁned by mapping n + 1
points in general position, that is, by mapping a one n–
dimensional simplex. In the plane, this means choosing the
image of any nondegenerate triangle. Preserved properties
are properties related to subspaces and ratios. In particular,
incidence properties of the type “a line passing through
a point” or “a plane contains a line” etc. are preserved.
Moreover, the collinearity of vectors is preserved. For every
two collinear vectors, the ratio of their lengths is preserved
independently of the scalar product deﬁning the length.
Similarly, the ratio of the volumes of two n–dimensional
parallelepipeds is preserved under the transformations, since
the determinant of the corresponding matrix changes them
by the same multiple.
These aﬃne properties can be used to prove geometric
statements. For instance in the plane, to prove the fact that
the medians of a triangle intersect in a single point, and in
one third of their lengths, it is suﬃcient to verify this only in
the case of an isosceles right-angled triangle or only in the
341
cos φ = || (2/3,2/3,0,2/3,1) ||
|| (1,0,0,1,1) || =
√
7
3 . φ
.
= 0.49 (≈ 28 ◦
) .
□
C. Geometry of quadratic forms
4.C.1. Determine the polar basis of the form f : R3
→ R,
f(x1, x2, x3) = 3x2
1 + 2x1x2 + x2
2 + 4x2x3 + 6x2
3.
Solution. Its matrix is
A =


3 1 0
1 1 2
0 2 6

 .
According to step (1) of the Lagrange algorithm (see Theorem
4.3.1), perform the following operations
f(x1, x2, x3) =
1
3
(3x1 + x2)2
+
2
3
x2
2 + 4x2x3 + 6x2
3
=
1
3
y2
1 +
3
2
(
2
3
y2 + 2y3)2
=
1
3
z2
1 +
3
2
z2
2.
The form has rank 2 and the matrix changing the basis to the
polar basis w is obtained by a combination of following transformations:
z3 = y3 = x3, z2 = 2
3 y2 + 2y3 = 2
3 x2 + 2x3
and z1 = y1 = 3x1 + x2, so the change of basis matrix is
T =


3 1 0
0 2
3 2
0 0 1

 .
We computed the polar coordinates, expressed them in
standard basis and wrote them as rows of the matrix (the
columns of this matrix are vectors of the standard basis in
the polar basis). The polar basis vector coordinates are the
columns of the matrix T−1
.
T−1
=


1
3
−1
2 1
0 3
2 −3
0 0 1

 ,
The polar basis is therefore ((1
3 , 0, 0), (−1
2 , 3
2 , 0), (1, −3, 1)).
□
4.C.2. Determine the polar basis of the form f : R3
→ R3
.
f(x1, x2, x3) = 2x1x3 + x2
2.
Solution. The matrix is of the form
A =


0 0 1
0 1 0
1 0 0

 .
Change the order of the variables: y1 = x2, y2 = x1,
y3 = x3. It is then trivial to apply step (1) of Lagrange algorithm
(there are no common terms). However for the next
step, case (4) sets in. Introduce the transformation z1 = y1,
z2 = y2, z3 = y3 − y2.
f(x1, x2, x3) = z2
1+2z2(z3+z2) = z2
1+
1
2
(2z2+z3)2
−
1
2
z2
3.
CHAPTER 4. ANALYTIC GEOMETRY
case of an equilateral triangle. Then this property holds for
all triangles. Think about this argument!
4.2.6. Transformations of quadrics. After straight lines
(which are mapped to straight lines again by all aﬃne maps),
the simplest objects in the analytic geometry of plane are the
conic sections. These are given by quadratic equations in
Cartesian coordinates. A conic is distinguished as a circle,
ellipse, parabola or hyperbola, by examining the coeﬃcients.
There are two degenerate cases, namely a pair of lines or a
point. We cannot distinguish a circle from an ellipse in aﬃne
geometry, but they are diﬀerent in Euclidean geometry.
In analogy with the equations of conic sections in plane,
we may discuss quadratic objects in Euclidean point spaces.
These are deﬁned in all orthonormal frames by quadratic
equations, and they are known as quadrics. Consider a general
quadratic equation for the coordinates (x1, . . . , xn)T
of
a point A ∈ En
(1)
n∑
i,j=1
aijxixj +
n∑
i=1
aixi + a = 0,
where it may be assumed by symmetry that aij = aji without
loss of generality. This equation can be written as
f(u) + g(u) + a = 0
for a quadratic form f (i.e. the restriction of a symmetric
bilinear form F to pairs of equal arguments), a linear form g,
and a scalar a ∈ R. We assume that at least one coeﬃcient
aij is nonzero. Otherwise the equation is linear and describes
a hyperplane.
Notice that the equation (1) keeps the same shape under
every aﬃne coordinate transformation, i.e., it splits again into
a nontrivial quadratic, linear and constant parts.
Recognizing which of the standard types of quadrics is
determined by the given equation is an extremaly useful tool
(we shall see this later e.g. in multivariate analysis in Chapter
8) and thus we devote the next section to this topic.
Notice that the above observation is true for every ﬁxed
order of the multivariate polynomial expressions. In particular,
there is the well deﬁned concept of cubics, given by cubic
equations and invariant with respect to all aﬃne transformations.
Similarly the quartics, quintics, etc.
We shall meet the cubic curves in plain in more details in
Chapter 11 because of their fascinating use in cryptography.
3. Geometry of quadratic forms and quadrics
We shall start with the aﬃne point of view and proceed
to the Euclidean classiﬁcation later.
4.3.1. Linear transformations of quadratic forms. Let us
recall the bilinear symmetric forms F : Rn
×
Rn
→ R, cf. 2.3.23, and the corresponding
mappings f(x) = F(x, x) for all x ∈ Rn
.
342
Together, z1 = y1 = x2, z2 = y2 = x1, z3 = y3 − y2
= x3 − x1. The matrix T for change to polar basis is
T =


0 1 0
1 0 0
−1 0 1

 and T−1
=


0 1 0
1 0 0
0 1 1

 .
The polar basis is therefore ((0, 1, 0), (1, 0, 1) (0, 1, 1)). □
4.C.3. Find the polar basis of the quadratic form f : R3
→
R, which in the standard basis is deﬁned as
f(x1, x2, x3) = x1x2 + x1x3.
Solution. By an application of the Lagrange algorithm:
f(x1, x2, x3) = 2x1x2 + x2x3
substitution y2 = x2 − x1, y1 = x1, y3 = x3
= 2x1(x1 + y2) + (x1 + y2)x3
= 2x2
1 + 2x1y2 + x1x3 + y2x3
=
1
2
(2x1 + y2 +
1
2
x3)2
−
1
2
y2
2 −
1
8
x2
3 + y2x3
substitution y1 = 2x1 + y2 + 1
2
x3
=
1
2
y2
1 −
1
2
y2
2 −
1
8
x2
3 + y2x3
=
1
2
y2
1 − 2(
1
2
y2 −
1
2
x3)2
+
3
8
x2
3
substitution y3 = 1
2
y2 − 1
2
x3
=
1
2
y2
1 − 2y2
3 +
3
8
x2
3.
With the coordinates y1, y3, x3, the quadratic form has a diagonal
shape, which means that the basis associated with those
coordinates is the polar basis of the form. If we want to express
the basis, we need to obtain the matrix which changes
the basis from polar to standard. By deﬁnition of the change
of basis matrix, its columns are the polar basis vectors. Either
we express the old variables (x1, x2, x3) by new variables (y1,
y3, x3), or equivalently we express the new ones by the old
ones (which is easier). In the latter case, we need to compute
inverse matrix.
y1 = 2x1 + y2 + 1
2 x3 = 2x1 + (x2 − x1) + 1
2 x3 and
y3 = 1
2 y2 − 1
2 x3 = −1
2 x1 + 1
2 x3 − 1
2 x3. The matrix for
changing the basis from the polar basis to the standard basis
is
T =


2 1 1
2
−1
2
1
2 −1
2
0 0 1

 .
The inverse matrix is
T−1


1
3 −2
3 −1
2
1
3
4
3
1
2
0 0 1

 .
Hence one of the polar bases of the given quadratic
forms is (see the columns of the matrix),
{(1/3, 1/3, 0), (−2/3, 4/3, 0), (−1/2, 1/2, 1)}. □
CHAPTER 4. ANALYTIC GEOMETRY
For an arbitrary basis e on this vector space, the value
f(x) on vector x = x1e1+· · ·+xnen is given by the equation
(1) f(x) = F(x, x) =
∑
i,j
xixjF(ei, ej) = xT
· A · x
where A = (aij) is a symmetric matrix with elements aij =
F(ei, ej). We call such maps f quadratic forms, and the formula
(1), f(x) =
∑
ij aijxixj, is called the analytic formula
for the form f.
In general, by a quadratic form on vector space V is
meant the restriction f(u) of a symmetric bilinear form
F(u, v) to arguments of the type (u, u) ∈ V × V . Evidently,
the whole bilinear form F can be reconstructed from the values
f(u) (unless we work over scalars Z2) since
f(u + v) = F(u + v, u + v) = f(u) + f(v) + 2F(u, v).
This is called the polarization process.
Let us come back to the real vector spaces Rn
. If we
change the basis e to a diﬀerent basis e′
1, . . . , e′
n, we get different
coordinates x = S · x′
for the same vector (here S is
the corresponding transformation matrix), and so
f(x) = (S · x′
)T
· A · (S · x′
) = (x′
)T
· (ST
· A · S) · x′
.
Clearly, if we ﬁx the second argument in the bilinear form
F, we obtain a linear mapping V → V ∗
, with coordinate description
x → F( , x) = xT
· A. Thus, the rank of the matrix
A of the quadratic form f is the dimension of the image of
this mapping and therefore it is independent of the choice of
the coordinates. We call it the rank of the quadratic form f.
Our ﬁrst task is to decide whether two quadratic forms
can be transformed one into the other by a linear transforma-
tion.
We shall easily see that the matrix A of each quadratic
form may be diagonal for a suitable choice of basis u and
this will resolve our task as well. In other words we request
F(ui, uj) = 0 for i ̸= j for the corresponding symmetric
bilinear form F. Each such basis is called the polar basis of
the quadratic form f.
Later we shall see, we could even ﬁnd a polar orthonormal
basis (with respect to any scalar product). Nevertheless,
without the use of the scalar product, there is a much simpler
algorithm for ﬁnding a polar basis among all other bases.
At the same time, this algorithm brings relevant information
about the aﬃne properties of the quadratic form. The algorithmic
procedure in the proof of the next theorem is known
as the Lagrange algorithm.
Theorem. Let f : V → R be a quadratic form on a real
vector space V of dimension n. Then there exists a polar
basis for f on V .
Proof. (1) Let A be the matrix of f in basis u =
(u1, . . . , un) on V , and assume a11 ̸= 0. Then we may write
f(x1, . . . , xn) = a11x2
1 + 2a12x1x2 + · · · + a22x2
2 + . . .
= a−1
11 (a11x1 + a12x2 + · · · + a1nxn)2
+ terms not containing x1.
343
4.C.4. Determine the type of conic section deﬁned by
3x2
1 − 3x1x2 + x2 − 1 = 0.
Solution. Complete the squares:
3x2
1 − 3x1x2 + x2 − 1 =
1
3
(3x1 −
3
2
x2)2
−
3
4
x2
2 + x2 − 1
=
1
3
y2
1 −
4
3
(
3
4
x2 −
1
2
)2
+
1
3
− 1
=
1
2
y2
1 −
4
3
y2
2 −
2
3
.
According to the list 4.3.6, the given conic section is a hyperbola.
□
4.C.5. By completing the squares, express the quadric
−x2
+ 3y2
+ z2
+ 6xy − 4z = 0
in such a way that one can determine its type from it.
Solution. Complete the square. Deal ﬁrst with all terms involving
an x. Obtain the equation
−(x − 3y)2
+ 9y2
+ 3y2
+ z2
− 4z = 0.
There are no "unwanted“ terms containing y , so repeat the
procedure for z. This gives
−(x − 3y)2
+ 12y2
+ (z − 2)2
− 4 = 0.
Conclude that there is a transformation of variables that leads
to the equation (we can divide by 4 if desired)
−¯x2
+ ¯y2
+ ¯z2
− 1 = 0.
□
We can tell the type of the conic section without transforming
its equation to the form listed in 4.3.6. Every conic
section can be expressed as
a11x2
+ 2a12xy + a22y2
+ 2a13x + 2a23y + a33 = 0.
Determinants ∆ = det A =
a11 a12 a13
a12 a22 a23
a13 a32 a33
and
δ =
a11 a12
a12 a22
are invariants of conic sections which means
that they are not changed by Euclidean transformations
(rotation and translation). Furthermore, the diﬀerent types
of conic sections have diﬀerent signs of those determinants.
• ∆ ̸= 0 for non-degenerate conic sections:
ellipse for δ > 0, hyperbola for δ < 0 and parabola for
δ = 0
For a real ellipse (not imaginary), it is necessary that
(a11 + a22)∆ < 0.
• ∆ = 0 for degenerate conic sections, or pairs of lines.
The signs (or zero-value) of the determinants are really invariant
to the coordinate transformation. Denote X =


x
y
1


and denote A as the matrix of the quadratic form. Then the
corresponding conic section has equation XT
AX = 0. The
CHAPTER 4. ANALYTIC GEOMETRY
This suggest to transform the coordinates (i.e., change the ba-
sis)
x′
1 = a11x1 + a12x2 +· · · + a1nxn, x′
2 = x2, . . . , x′
n = xn.
This corresponds to the new basis
v1 = a−1
11 u1, v2 = u2−a−1
11 a12u1, . . . , vn = un−a−1
11 a1nu1
(As an exercise, compute the transformation matrix.) In the
new basis the corresponding symmetric bilinear form satisﬁes
F(v1, vi) = 0 for all i > 1 (compute it!). Thus f has the
analytic form a−1
11 x′
1
2
+ h in the new coordinates, where h is
a quadratic form independent of the variable x1.
It is often easier to choose v1 = u1 in the new basis.
Then f = f1 + h, where f1 depends only on x′
1, while x′
1
does not appear in h, but F(v1, v1) = a11.
(2) Assume that after step (1), h is a matrix of rank less
than n with a nonzero coeﬃcient of x′
2
2
. Then the same procedure
can be repeated to obtain the expression f = f1 +f2 +h,
where h contains only the variables with index greater than
two. Proceed in this way until a diagonal form is obtained
after n − 1 steps, or in (say) the i-th step, the element aii is
zero.
(3) If the last possibility occurs, and there exists some
other element ajj ̸= 0 with j > i, then it suﬃces to exchange
the i–th and the j–th vector of the basis. Then continue according
the previous procedure.
(4) Assume that the situation is ajj = 0 for all j ≥ i.
If there is no element ajk ̸= 0 with j ≥ i, k ≥ i, then we
are ﬁnished, since then the matrix is diagonal. If ajk ̸= 0,
then we use the transformation vj = uj + uk and we keep
the other vectors of basis unchanged (i.e. x′
k = xk − xj, the
other remain without change). Then h(vj, vj) = h(uj, uj) +
h(uk, uk) + 2h(uk, uj) = 2ajk ̸= 0 and we can continue as
for case (1). □
4.3.2. Aﬃne classiﬁcation of quadratic forms. The vectors
in the basis obtained from the Lagrange algorithm
can be rescaled by scalars such that the
coeﬃcients of the squares of variables are only
the scalars 1, −1 and 0. Moreover, the following
law of inertia says that the number of one’s and minus one’s
does not depend on the choices in the course of the algorithm.
These numbers are called the signature of a quadratic form.
As before, there is a complete description of quadratic forms
in the sense that two such forms may be transformed each one
into the other by a linear transformation if and only if they
have the same signature.
Theorem. For each nonzero quadratic form of rank r on a
real vector space V there exists a natural number p, and r
independent linear forms φ1, . . . , φr ∈ V ∗
such that 0 ≤
p ≤ r and f(u) becomes
(φ1(u))2
+ · · · + (φp(u))2
− (φp+1(u))2
− · · · − (φr(u))2
.
Otherwise put, there exists a polar basis, in which f has an
analytic formula
f(x1, . . . , xn) = x2
1 + · · · + x2
p − x2
p+1 − · · · − x2
r.
344
standard form is obtained by rotation and translation. This is
by a transformation to new coordinates x′
, y′
satisfying
x = x′
cos α − y′
sin α + c1
y = x′
sin α + y′
cos α + c2,
or, in matrix form, for the new coordinates X′
=


x′
y′
1

,
(1)
X =


x
y
1

 =


cos α − sin α c1
sin α cos α c2
0 0 1




x′
y′
1

 = MX′
.
Put X = MX′
into the conic section equation to obtain the
equation in new coordinates
XT
AX = 0
(MX′
)T
A(MX′
) = 0
X′ T
MT
A MX′
= 0.
Denote by A′
the matrix of the quadratic form in
new coordinates. Then A′
= MT
AM, where matrix
M =


cos α − sin α c1
sin α cos α c2
0 0 1

 has unit determinant, so
det A′
= det MT
det A det M = det A = ∆.
Necessarily, the determinant A33, which is the algebraic
complement of a33, is invariant to the coordination transformation.
For rotation only, det A′
= det MT
det A det M.
In this case the matrix M =


cos α − sin α 0
sin α cos α 0
0 0 1


and det A′
33 = det A33 = δ. For translation only,
M =


1 0 c1
0 1 c2
0 0 1

 and this subdeterminant remains
unchanged.
4.C.6. Determine the type of conic section
2x2
− 2xy + 3y2
− x + y − 1 = 0.
Solution. The determinant
∆ =
2 −1 −1
2
−1 3 1
2
−1
2
1
2 −1
= −23
4 ̸= 0,
hence it is a non-degenerate conic section. Moreover
δ = 5 > 0, therefore it is an ellipse. Furthermore
(a11 + a22)∆ = (2 + 3) · (−23
4 ) < 0, so it is real ellipse. □
4.C.7. Determine the type of conic section x2
−4xy−5y2
+
2x + 4y + 3 = 0.
CHAPTER 4. ANALYTIC GEOMETRY
The number p of positive diagonal coeﬃcients in the matrix
of the given quadratic form (and thus the number r − p of
negative coeﬃcients) does not depend on the choice of polar
basis.
Two symmetric matrices A, B of dimension n are matrices
of the same quadratic form in diﬀerent bases if and only
if they have the same rank and the same number of positive
coeﬃcients in the polar basis.
Proof. By completing the squares, f(x1, . . . , xn) =
λ1x2
1 + · · · + λrx2
r, λi ̸= 0, in a basis on V . Assume
moreover that the ﬁrst p coeﬃcients λi are positive. Then
the transformation y1 =
√
λ1x1, . . . , yp =
√
λpxp, yp+1 =
√
−λp+1xp+1, . . . , yr =
√
−λrxr, yr+1 = xr+1, . . . , yn =
xn yields the desired formula. The forms φi are exactly the
forms from the dual basis in V ∗
to the obtained polar basis.
It remains to prove that p does not depend on the procedure.
Assume that there is a formula for the same form f in
the polar bases u, v, i.e.
f(x1, . . . , xn) = x2
1 + · · · + x2
p − x2
p+1 − · · · − x2
r
f(y1, . . . , yn) = y2
1 + · · · + y2
q − y2
q+1 − · · · − y2
r .
Denote the subspace generated by the ﬁrst p vectors
of the ﬁrst basis by P = ⟨u1, . . . , up⟩, and similarly
Q = ⟨vq+1, . . . , vr⟩. Then for each 0 ̸= u ∈ P, f(u) > 0
while for 0 ̸= v ∈ Q, f(v) ≤ 0. Hence necessarily
P ∩ Q = {0}, and therefore dim P + dim Q ≤ n. Hence
p + (r − q) ≤ r, so that p ≤ q. By interchanging the
subspaces, q ≤ p, and so p = q.
Thus p is independent of the choice of the polar basis.
Consequently for two matrices with the same rank and the
same number of positive coeﬃcients in the diagonal form of
the corresponding quadratic form, the analytic formulas are
the same. □
While discussing symmetric maps we talked about definite
and semideﬁnite maps. The same discussion
has an obvious meaning also for symmetric
bilinear forms and quadratic forms.
Types of quadratic forms
A quadratic form f on a real vector space V is called
(1) positive deﬁnite if f(u) > 0 for all vectors u ̸= 0,
(2) positive semideﬁnite if f(u) ≥ 0 for all vectors u ∈ V ,
(3) negative deﬁnite if f(u) < 0 for all vectors u ̸= 0,
(4) negative semideﬁnite if f(u) ≤ 0 for all vectors u ∈ V ,
(5) indeﬁnite if f(u) > 0 and f(v) < 0 for two vectors
u, v ∈ V .
The same names are used for symmetric matrices corresponding
to quadratic forms. By the signature of a symmetric
matrix is meant the signature of the corresponding quadratic
form.
345
Solution. The determinant ∆ =
1 −2 1
−2 −5 2
1 2 3
= −34 ̸= 0,
furthermore δ =
1 −2
−2 −5
= −9 < 0, it is therefore a
hyperbola. □
4.C.8. Determine the equation and type of conic section
passing through the points
[−2, −4], [8, −4], [0, −2] , [0, −6] , [6, −2] .
Solution. Input the coordinates of the points into the general
conic section equation
a11x2
+ a22y2
+ 2a12xy + a1x + a2y + a = 0
There follows the linear equation system
4a11 + 16a22 + 16a12 − 2a1 − 4a2 + a = 0,
64a11 + 16a22 − 64a12 + 8a1 − 4a2 + a = 0,
4a22 − 2a2 + a = 0,
36a22 − 6a2 + a = 0,
36a11 + 4a22 − 24a12 + 6a1 − 2a2 + a = 0.
In matrix form we perform operations






4 16 16 −2 −4 1
64 16 −64 8 −4 1
0 4 0 0 −2 1
0 36 0 0 −6 1
36 4 −24 6 −2 1






∼ · · ·
∼






4 16 16 −2 −4 1
0 4 0 0 −2 1
0 0 64 −8 12 −9
0 0 0 24 −36 27
0 0 0 0 3 −2






∼ · · ·
∼






48 0 0 0 0 −1
0 12 0 0 0 −1
0 0 64 0 0 0
0 0 0 24 0 3
0 0 0 0 3 −2






.
Then
a11 = 1, a22 = 4, a12 = 0, a1 = −6, a2 = 32.
The conic section has equation
x2
+ 4y2
− 6x + 32y + 48 = 0.
Complete the terms x2
−6x, 4y2
+32y to squares. The result
is
(x − 3)2
+ 4(y + 4)2
− 25 = 0,
or rather
(x − 3)2
52
+
(y + 4)2
(5
2
)2 − 1 = 0.
The conic section is an ellipse with centre at [3, −4]. □
CHAPTER 4. ANALYTIC GEOMETRY
Sylvester criterion
4.3.3. Theorem. A symmetric real matrix A is positive definite
if and only if all its leading principal minors are posi-
tive.
A symmetric real matrix A is negative deﬁnite if and
only if (−1)i
|Ai| > 0 for all leading principal submatrices
Ai.
Proof. The claim about negative deﬁnite forms follows
immediately from the ﬁrst part of the theorem. Just observe
that A is positive deﬁnite if and only if −A is negative deﬁ-
nite.
Suppose that the form f is positive deﬁnite. Then A =
PT
EP = PT
P for a suitable regular matrix P. Hence
|A| = |P|2
> 0. Let u be a chosen basis in which the
form f has matrix A. The restrictions of f to the subspaces
Vk = ⟨u1, . . . , uk⟩ are positive deﬁnite forms fk again, and
the corresponding matrices in the bases u1, . . . , uk are the
leading principal submatrices Ak. Thus |Ak| > 0, too.
In order to prove the other implication, analyse in detail
the form of the transformations used in completing
the square in the Lagrange algorithm. The
transformation used in the ﬁrst step always has
an upper triangular matrix T. By rescaling, see
the proof of Theorem 4.3.1, the matrix may have ones on the
diagonal:
T =



1 −a12
a11
. . . −an2
a11
0 1 . . . 0
...
...
...


 .
Such a matrix of the transformation from basis u to basis v
has several useful properties. In particular, its leading principal
submatrices Tk formed by the ﬁrst k rows and columns are
the transformation matrices of a subspace Pk = ⟨u1, . . . , uk⟩
from basis (u1, . . . , uk) to basis (v1 . . . , vk). The leading
principal submatrices Ak of the matrix A of the form f are
matrices of restrictions of the form f to Pk. Therefore, the
matrices Ak and A′
k of restrictions to Pk in basis u and v
respectively satisfy Ak = TT
k A′
kTk, where T is the transformation
matrix from u to v. The inverse matrix to an upper
triangular matrix with one’s on the diagonal is again an upper
triangular matrix with one’s on the diagonal. Hence we may
similarly express A′
in terms of A. Thus the determinants of
the matrices Ak and A′
k are equal by Cauchy formula. We
may conclude:
Claim. Let f be a quadratic form on V , dim V = n. Let u be
a basis of V such that the items (3) and (4) from the Lagrange
algorithm while ﬁnding the polar basis are not needed. Then
the analytic formula
f(x1, . . . , xn) = λ1x2
1 + λ2x2
2 + · · · + λrx2
r
is obtained where r is the rank of the form f, λ1, . . . , λr ̸=
0 and for the leading principal submatrices of the (former)
matrix A of quadratic form f, |Ak| = λ1λ2 . . . λk, k ≤ r.
346
4.C.9. Other characteristics and concepts of conic sections.
The axis of a conic section is a line of reﬂection symmetry
for the conic section. From the canonical form of a
conic section in polar basis (4.3.6) it can be shown that an ellipse
and a hyperbola both have two axes (x = 0 and y = 0).
A parabola has one axis (x = 0). The intersection of a conic
section and its axis is called a conic section vertex.
The numbers a, b from the canonical form of a conic section
(which express the distance between vertices and the origin)
are called the length of semi-axes. In the case of an ellipse
and hyperbola, the axes intersect at the origin. This is
a point of central symmetry for the conic section, called the
centre of the conic section.
For practical problems involving conic sections, it is often
easiest to describe them in parametric form. Often, this
avoids contending with messy square roots.
Every point P on the parabola y2
= 4ax, a > 0, can
be described by P = (x, y) = (at2
, 2at), for real t. The standard
parametric form for the parabola is the pair of equations
x = at2
y = 2at,
(Note that the roles of x and y are interchanged, so that the
axis of symmetry is the line y = 0.) The tangent line at
at2
, 2at) has slope 1
t and equation t(y − 2at) = (x − at2
).
The point F = (a, 0) on the axis is called the focus of the
parabola, and the line x = −a is called the directrix. Each
point on the parabola is equidistant from the focus and the
directrix. This property can be used to deﬁne a parabola.
Every point P on the ellipse x2
a2 + y2
b2 = 1 can be described
by P = (x, y) = (a cos θ, b sin θ, ) where 0 < b ≤ a.
The standard parametric form for the ellipse is the pair of
equations
x = a cos θ, y = b sin θ.
The tangent line at P has slope −b cos θ
a sin θ and consequently
has equation (a cos θ)(y − b sin θ) = −b cos θ)(x − a cos θ).
The positive number e, deﬁned by b2
= a2
(1 − e2
) is called
the eccentricity of the ellipse. If e = 0, the ellipse becomes a
circle or radius a = b. Otherwise 0 < e < 1. The two points
F1 = (ae, 0) and F2 = (−ae, 0) are the foci of the ellipse,
and the lines x = ±a/e are the directrices.
Every point P on the hyperbola x2
a2 − y2
b2 = 1, 0 < a,
0 < b, can be described by P = (x, y) = (a cosh θ, b sinh θ).
The standard parametric form for the hyperbola is the pair of
equations
x = a cosh θ, y = b sinh θ.
The tangent line at P has slope b cosh θ
a sinh θ and consequently has
equation (a cosh θ)(y − b sinh θ) = b cosh θ)(x − a cosh θ).
The positive number e, deﬁned by b2
= a2
(e2
− 1) is called
the eccentricity of the hyperbola. Necessarily, e > 1. The
two points F1 = (ae, 0) and F2 = (−ae, 0) are the foci of
the ellipse, and the lines x = ±a/e are the directrices. A hyperbola
has two asymptotes. In standard form, the equations
are y = ±(b/a)x.
CHAPTER 4. ANALYTIC GEOMETRY
After each step in this procedure, the resulting matrix
contains zeros under the diagonal in the already processed
columns. At the same time, all the principal minors remain
the same.
Consequently if the leading principal minors are nonzero,
then the next diagonal term in A is nonzero and we do not
need other steps than completing the squares. Moreover, λi =
|Ai|/|Ai−1|. This proves the following:
Corollary (Jacobi theorem). Let f be a quadratic form of
rank r on a vector space V with matrix A for the basis u.
Steps other than completing the square are not required if and
only if the leading principal submatrices of A satisfy |A1| ̸=
0, ..., |Ar| ̸= 0. Then there exists a polar basis in which f
has the analytic formula
f(x1, . . . , xn) = |A1|x2
1 +
|A2|
|A1|
x2
2 + · · · +
|Ar|
|Ar−1|
x2
r.
Hence if all leading principal minors are positive, then
f is positive deﬁnite by the Jacobi theorem and the Sylvester
criterion is proved. □
4.3.4. Euclidean classiﬁcation of quadratic forms. We
come back to the discussion of equation
4.2.6(1) in the Euclidean context.
Begin with its quadratic part, i.e. bilinear
symmetric form F : V × V → R.
Assume now that the real vector space is equipped with a
scalar product and choose an orthogonal basis. Then the matrix
of the bilinear form F, which is the same as the matrix
of f, transforms under a change of coordinates in such a way
that for orthogonal changes it coincides with the transformation
of a matrix of a linear map (i.e., then S−1
= ST
). This
result can be interpreted as the following observation:
Proposition. Let V be a real vector space with a scalar product.
Then formula
φ → F, F(u, u) = ⟨φ(u), u⟩
deﬁnes a bijection between symmetric linear maps and quadratic
forms on V .
Proof. Each bilinear form with a ﬁxed second argument
becomes a linear form F( , u). In the presence
of a scalar product, it is given by the formula
F(v, u) = v·w for a suitable vector w. Put φ(u) = w.
Directly from the coordinate expression 4.3.1(1) displayed
above, φ is the linear map with symmetric matrix A.
Hence it is selfadjoint.
On the other hand, each symmetric map φ deﬁnes a symmetric
bilinear form F by formula F(u, v) = ⟨φ(u), v⟩ =
⟨u, φ(v)⟩, and thus is also a quadratic form. □
It is immediate that for each quadratic form f there exists
an orthonormal basis in which f has a diagonal matrix. The
values on the diagonal are the eigenvalues of the matrix and
they are determined uniquely up to their order.
The rank of f equals the dimension of the image of the
corresponding map φ.
347
4.C.10. Existence of foci. For an ellipse with lengths of
semi-axes a > b, show that the sum of the distances from
any point on the ellipse to the two foci is constant, namely
2a.
Solution. If P = (a cos θ, b sin θ) and F1 = (ae, 0), then
|PF1|2
= (a cos θ − ae)2
+ b2
sin2
θ
= a2
cos2
θ − 2a2
e cos θ + a2
e2
+ a2
(1 − e2
) sin2
θ
= a2
[−2e cos θ + e2
− e2
(1 − cos2
θ)]
= a2
(1 − e cos θ)2
.
So |PF1| = a(1 − e cos θ). Similarly
|PF2| = a(1 + e cos θ). Hence |PF1| + |PF2| = 2a.
□
Solution. (Alternative). Consider the points X = [x, y],
which satisfy the property |F1X|+|F2X| = 2a. Coordinatewise,
this implies the equation
√
(x + ae)2 + y2 +
√
(x − ae)2 + y2 = 2a
Rewrite this as
√
(x + ae)2 + y2 = 2a −
√
(x − ae)2 + y2
Square, simplify and square again to get
(1 − e2
)x2
+ y2
= a2
(1 − e2
).
Substitute b2
= a2
(1 − e2
and divide by b2
to obtain
x2
a2
+
y2
b2
= 1.
which is the ellipse in standard form. □
Remark.
Similarly, the hyperbola foci are the points F1, F2, which
satisfy ||F2X| − |F1X|| = 2a for an arbitrary X on the hyperbola.
You can check this in same the way as above for the
ellipse, with F1 = [ae, 0], F2 = [−ae, 0], ae =
√
a2 + b2.
Parabola focus If the parabola has equation x2
= 2py,
the focus is the point F with coordinates F = [0, p
2 ]. It is characterized
by the fact that the distance between this point and
an arbitrary X on parabola is equal to the distance between
X and line y = −p
2 .
4.C.11. Find the foci of the ellipse x2
+ 2y2
= 2.
Solution. From the equation that semi-axes lengths
are a =
√
2 and b = 1. Compute (see 4.C.10):
ae =
√
a2 − b2 = 1 The foci coordinates are at [−1, 0] and
[1, 0]. □
4.C.12. Prove that the product of the distances between the
foci of an ellipse and any tangent line is constant. Find the
value of the constant.
Solution. Every point T on the ellipse has coordinates
T = (x, y) where x = a cos θ, y = b sin θ for some θ. The
tangent line to the ellipse at T has equation
y − b sin θ = −
b cos θ
a sin θ
(x − a cos θ).
a(sin θ)(y − b sin θ) = −b(cos θ)(x − a cos θ).
CHAPTER 4. ANALYTIC GEOMETRY
4.3.5. Classiﬁcation of quadrics. We return to the equation
4.2.6(1). The above results enable us to rewrite this equation
in suitable orthonormal frame of the diﬀerence space as
n∑
i=1
λix2
i +
n∑
i=1
bixi + b = 0.
Hence we may assume that the quadric is given in this
form.
In the next step, we “complete the square” for the coordinates
xi with λi ̸= 0, which “absorbs” the squares together
with the linear terms in the same variable. So only linear
terms are left corresponding to variables for which the coeﬃcient
of the quadratic term is zero. We have
n∑
i=1
λi(xi − pi)2
+
∑
j
bjxj + c = 0.
where the summation over j is only for j satisfying λj = 0.
This corresponds to a translation of the origin about the
vector with coordinates pi. To such a choice of basis of the
diﬀerence space the desired diagonal form is in the quadratic
part. In the identiﬁcation of quadratic forms with linear maps
derived above, this means that φ is diagonal on the orthogonal
complement of its kernel. If there are also some linear
terms left, the orthonormal basis of the diﬀerence space can
be adjusted for the kernel of φ so that the corresponding linear
form is a multiple of the ﬁrst term of the dual basis. Hence
the ﬁnal formula reads
k∑
i=1
λiy2
i + byk+1 + c = 0,
where k is the rank of matrix of quadratic form f. If b ̸= 0, it
can be arranged that the constant c in the equation is zero by
a further change of the origin.
Hence the linear term may (but does not have to) appear
only in the case that the rank of f is less than n, while c ∈ R
may be nonzero only if b = 0. The resulting equations are
called the canonical analytic formulas for quadrics.
4.3.6. The case of E2. As an example of the previous procedure,
we discuss the simplest case of a nontrivial
dimension, namely dimension two. The
original equation has the form
a11x2
+ a22y2
+ 2a12xy + a1x + a2y + a = 0.
By a suitable choice of a basis of the diﬀerence space, and the
subsequent completion of the square, it is written in the form
(using the same notation x, y for the new coordinates):
a11x2
+ a22y2
+ a1x + a2y + a = 0
where ai is nonzero only in the case that aii is zero. By the
last step of the general procedure, exactly one of the following
equations is involved (notice the equations of quadrics are
determined up to a multiple):
348
This meets the x − axis at the point (a/ cos θ, 0). The
distance from the focus F1 to the tangent line is D1
= [a/ cos θ − ae] sin φ where tan φ = ±b cos θ
a sin θ . Eliminate
φ to get
D2
1 = a2
(1 − e cos θ)2
[
b2
a2 sin2
θ + b2 cos2 θ
]
= (1 − e cos θ)2
[
b2
sin2
θ + (1 − e2) cos2 θ
]
= (1 − e cos θ)2
[
b2
1 − e2 cos2 θ)
]
= b2
(1 − e cos θ)[
1
1 + e cos θ)
]
Since D2 is the same as D1 with e replaced by −e, it follows
that D1D2 = b2
. □
Solution. (Alternative). Consider the polar basis.The ellipse
matrix has diagonal shape diag( 1
a2 , 1
b2 , −1) and the tangent
equation at X=[x0, y0] is x0
a2 x + y0
b2 y = 1. The distance between
F1, F2 = [∓ae, 0] and this line equals
1 ± aex0
a2
√
x2
0
a4 +
y2
0
b4
.
Its product is
1 − e2 x2
0
a4
x2
0
a4 +
y2
0
b4
If we substitute a2
e2
= a2
−b2
and
y2
0
b2 = 1−
x2
0
a2 (the point X
is lying on the ellipse), we ﬁnd that the previous term equals
b2
. □
4.C.13. Projective approach to conic section. Projective
space gives an ability to approach the conic section from a
new perspective (compare with 4.4.11). We can understand
conic sections in E2 deﬁned by the quadratic form
f(x, y) = a11x2
+ 2a12xy + a22y2
+ 2a13x + 2a23y + a33
CHAPTER 4. ANALYTIC GEOMETRY
0 = x2
/a2
+ y2
/b2
+ 1 empty set
0 = x2
/a2
+ y2
/b2
− 1 ellipse
0 = x2
/a2
− y2
/b2
− 1 hyperbola
0 = x2
/a2
− 2py parabola
0 = x2
/a2
+ y2
/b2
point
0 = x2
/a2
− y2
/b2
2 concurrent lines
0 = x2
− a2
2 parallel lines
0 = x2
2 identical lines
0 = x2
+ a2
empty set
The origin of the Cartesian coordinates is the center of
the studied conic. The new orthonormal basis of the diﬀerence
space gives the direction of semiaxes. The ﬁnal coeﬃcients
a, b then give the lengths of the semiaxes in the nondegenerate
directions.
4. Projective geometry
In many elementary texts on analytic geometry, the authors
ﬁnish with the aﬃne and Euclidean objects
described above. The aﬃne and Euclidean geometries
are suﬃcient for many practical problems,
but not for all problems.
For instance in processing an image from a camera, angles
are not preserved and parallel lines may (but do not have
to) intersect.
Moreover, it is often diﬃcult to distinguish very small
angles from zero angles, and thus it would be convenient to
have tools which do not need such distinguishing.
The basic idea of projective geometry is to extend aﬃne
spaces by points at inﬁnity. This permits an easy way to deal
with linear objects such as points, lines, planes, projections,
etc.
4.4.1. Projective extension of aﬃne plane. We begin with
the simplest interesting case, namely geometry in a plane. If
we imagine the points in the plane A2 as the plane z = 1 in
R3
, then each point P in the aﬃne plane is represented by
a vector u = (x, y, 1) ∈ R3
. So it is represented also by
a one–dimensional subspace ⟨u⟩ ⊂ R3
. On the other hand,
almost every one–dimensional subspace in R3
intersects the
plane in exactly one point P. The vectors of such a subspace
are given by coordinates (x, y, z) uniquely up to a common
scalar multiple. Only the subspaces corresponding to vectors
(x, y, 0) do not intersect the plane.
Projective plane
Deﬁnition. The projective plane P2 is the set of all one–
dimensional subspaces in R3
. The homogeneous coordinates
of a point P = (x : y : z) in the projective plane are
triples of real numbers given up to a common scalar multiple,
at least one of which must be nonzero. A straight line in the
projective plane is deﬁned as a set of one–dimensional subspaces
(i.e. points in P2) which generate a two–dimensional
subspace (i.e. a plane) in R3
.
349
as a set of points in projective plane P2
with homogenous
coordinates (x : y : z), which are the zero points of the homogenous
quadratic form
f(x, y, z) = a11x2
+2a12xy+a22y2
+2a13xz+2a23yz+a33z2
.
Or rather f(v) = vT
Av, where v is a column vector with
coordinates (x, y, z) and matrix A is symmetric matrix (aij).
By theorem 4.3.2, there exists a basis in which this quadratic
form has one of the following equations
f(x, y, z) = x2
+ y2
+ z2
, f(x, y, z) = x2
+ y2
− z2
.
In the former case there is only one solution of f(x, y, z) = 0
and therefore the original form does not represent a real conic
section. The second quadratic form represents a cone in R3
.
We obtain the corresponding conic section by moving back
to inhomogeneous coordinates. That means intersecting the
cone with the plane which has the equation z = 1 in the original
basis. Immediately we obtain the conic section classiﬁcation
from 4.29., which corresponds to the intersecting cone
in R3
with diﬀerent planes. Non-degenerate sections are depicted.
Degenerate sections are those which pass through the
vertex of the cone.
We deﬁne the following useful terms for a conic section
in projective plane :
Points P, Q∈ P2
corresponding to one-dimensional subspaces
⟨p⟩, ⟨q⟩ (generated by vectors p, q ∈ R3
) are called polar
conjugated with respect to conic section f, if F(p, q) = 0,
or rather pT
Aq = 0.
Point P= ⟨p⟩ is called singular point of conic section f,
when it is polar conjugated with respect to f with all points
of the plane, so F(p, x) = 0 ∀x ∈ P2
. In other words,
Ap = 0. Hence the matrix A of the conic section does not
have maximal rank and therefore deﬁnes a degenerate conic
section. Non-degenerate conic sections do not contain singular
points.
The set of all points X= ⟨x⟩ are called polar conjugated
with P = ⟨p⟩ polar of the point P with respect to the
conic section f. It is therefore the set of points for which
F(p, x) = pT
Ax = 0. Because the polar is given by a linear
combination of coordinates, it is always (in the non-singular
case) a line. The following explains the geometric interpretation
of polar.
4.C.14. Polar characterization. Consider a non-degenerate
conic section f. The polar of a point P ∈ f with respect to
f is the tangent to f with the touch point P. The polar of
CHAPTER 4. ANALYTIC GEOMETRY
For a concrete example, consider two parallel lines in the
aﬃne plane R2
L1 : y − x − 1 = 0, L2 : y − x + 1 = 0.
If the points of lines L1 and L2 are ﬁnite points in projective
space P2, then their homogeneous coordinates (x : y : z)
satisfy equations
L1 : y − x − z = 0, L2 : y − x + z = 0.
the intersection L1 ∩ L2 is the point (1 : 1 : 0) ∈ P2 in
this context. It is the point at inﬁnity corresponding to the
common direction vector of the lines.
4.4.2. Aﬃne coordinates in the projective plane. If we begin
with the projective plane and if we want to see
the aﬃne plane as its “ﬁnite” part, then instead of the
plane z = 1 we may take another plane σ in R3
which
does not pass through the origin 0 ∈ R3
. Then the ﬁnite
points are those one–dimensional subspaces which have
a nonempty intersection with the plane σ.
Consider the two parallel lines from the previous paragraph.
Let us choose the plane y = 1 to obtain two lines in
the aﬃne plane
L′
1 : 1 − x − z = 0, L′
2 : 1 − x + z = 0
The “inﬁnite” points of the former aﬃne plane are given by
z = 0. The lines L′
1 and L′
2 intersect at the “ﬁnite” point
(x, z) = (1, 0). This corresponds to the geometric concept
that two parallel lines L1, L2 in the aﬃne plane meet at inﬁnity,
at the point (1 : 1 : 0), but this point becomes ﬁnite in
diﬀerent ﬁnite aﬃne plane.
4.4.3. Projective spaces. We shall not go for an axiomatic
deﬁnition of projective spaces here. We generalize
the procedure in the aﬃne plane for each
ﬁnite dimension instead.
By choosing an arbitrary aﬃne hyperplane
An in the vector space Rn+1
which does not pass through
origin, we may identify the points P ∈ An with one–
dimensional subspaces generated by these points. The remaining
one–dimensional subspaces determine a hyperplane
parallel to An. They are called inﬁnite points in the projective
extension Pn of the aﬃne plane An.
The set of inﬁnite points in Pn is always a projective
space of dimension one less. An aﬃne straight line has only
one inﬁnite point in its projective extension (both ends of the
line “intersect” at inﬁnity and thus the projective line looks
like a circle). The projective plane has a projective line of
inﬁnite points, the three–dimensional projective space has a
projective plane of inﬁnite points etc.
More generally, we can deﬁne the projectivization of a
vector space. For an arbitrary vector space V of dimension
n + 1, we deﬁne
P(V ) = {P ⊂ V ; P ⊂ V , dim V = 1}.
350
the point P /∈ f is the line deﬁned by the touch points of the
tangents to f passing through P.
Solution. First consider P∈ f. Suppose that the polar of P,
deﬁned by F(p, x) = 0, intersects f in Q= ⟨q⟩ ̸=P. Then
F(p, q) = 0 and f(q) = F(q, q) = 0. For an arbitrary point
X = ⟨x⟩ lying on P and Q, x = αp + βq for some α, β ∈ R.
Because of the bilinearity and symmetry of F,
f(x) = F(x, x) = α2
F(p, p)+2αβF(p, q)+β2
F(q, q) = 0.
So every point X of the line lies on the conic section f. However,
when the conic section contains a line, it has to be degenerate,
which is a contradiction.
The claim for P /∈ f follows from the corollary of the
symmetry of the bilinear form F. When the Q lies on the
polar of P, then P lies on the polar of Q.
□
Using polar conjugates we can ﬁnd the axes and the centre
of the conic sections without using the Lagrange algo-
rithm.
Consider the conic section matrix as a block matrix
A =
(
¯A a
aT
α
)
,
where ¯A = (aij) for i, j = 1, 2, a is vector (a13, a23) and
α = a33. This means that the conic section is deﬁned by the
equation
uT ¯Au + 2aT
u + α = 0
for a vector u = (x, y). Now we show that
4.C.15. The axes of a conic section are the polars of the
points at inﬁnity determined by the eigenvectors of the matrix
¯A.
Solution. Because of the symmetry of ¯A in the basis of its
eigenvectors, it has a diagonal shape D =
(
λ 0
0 µ
)
, where
λ, µ ∈ R and this basis is orthogonal. Denote by U the matrix
changing basis to a basis of eigenvectors (columns), then the
conic section matrix is
(
UT
0
0 1
) (
¯A a
aT
α
) (
U 0
0 1
)
=
(
D UT
a
aT
U α
)
CHAPTER 4. ANALYTIC GEOMETRY
By choosing a basis u in V we obtain homogeneous coordinates
on P(V ). For a P ∈ P(V ) we use the nonzero vector
u ∈ V and the coordinates of this vector in a basis u.
The points of the projective space P(V ) are called geometric
points. Their generators in V are called arithmetic represen-
tatives.
In the chosen homogeneous coordinates, we ﬁx one of
them to be one. Thus we exclude all points of the projective
space P(V ) which have this coordinate equal to zero.
We have an embedding of n–dimensional aﬃne space An ⊂
P(V ). This is precisely the construction used in the example
on the projective plane.
4.4.4. Perspective projection. Our basic “projective” concepts
are now (projective) points, lines, planes,
etc., together with their incidencies (i.e. point in
a line, line in a plane, etc.). Thus the morphisms
of the projective geometry must respect these.
The best known example is given by perspective projections
R3
→ R2
. Imagine that an observer sitting in the
origin observes “one half of the world”, that is, the points
(X, Y, Z) ∈ R3
with Z > 0. The observer sees the image
“projected” on the screen given by plane Z = f > 0.
picture missing!
Thus a point (X, Y, Z) in the “real world” projects to a
point (x, y) on the screen as follows:
x = −f
X
Z
, y = −f
Y
Z
.
This is a nonlinear formula. The accuracy of calculations are
problematic when Z is small.
By extending this transformation to a map between projective
spaces, we have (X : Y : Z : W) → (x : y : z) =
(−fX : −fY : Z), which is well deﬁned for all Z > 0. That
is, a map described by a simple linear formula


x
y
z

 =


f 0 0 0
0 f 0 0
0 0 1 0

 ·




X
Y
Z
W




This simple expression deﬁnes the perspective projection
for ﬁnite points in the open half-space in R3
⊂ P3 which we
351
So in this basis there is the canonical form deﬁned by vector
UT
a (up to a translation). Speciﬁcally, denote the eigenvectors
by vλ, vµ, and then
λ(x+
aT
vλ
λ
)2
+µ(y +
aT
vµ
µ
)2
=
(aT
vλ)2
λ
+
(aT
vµ)2
µ
−α.
This means that the eigenvectors are the direction vectors
of the conic section axes (main directions). The axes equations
in this basis are x = −aT
vλ
λ and y = −
aT
vµ
µ . The axes
coordinates uλ and uµ in the standard basis satisfy vT
λ uλ =
−aT
vλ
λ and vT
µ uµ = −
aT
vµ
µ , because vT
λ (λuλ + a) = 0 and
vT
µ (µuµ+a) = 0. These equations are equivalent to the equations
vT
λ ( ¯Auλ + a) = 0 and vT
µ ( ¯Auµ + a) = 0 which are the
polar equations of the points deﬁned by the vectors vλ a vµ.
□
4.C.16. Remark. A corollary of the previous claim is that the
centre of the conic section is polar conjugated with all points
at inﬁnity. The coordinates of the centre s then satisfy the
equation ¯As + a = 0.
If det(A) ̸= 0, then the equation ¯As + a = 0 for centre
coordinates has exactly one solution if δ = det( ¯A) ̸= 0,
and no solutions if δ = 0. That means that, regarding nondegenerate
conic sections, the ellipse and the hyperbola have
exactly one centre. The parabola has no centre. (its centre is
point at inﬁnity).
4.C.17. Prove that the angle between the tangent to the
parabola (with arbitrary touch point) and the parabola axis
is the same as the angle between the tangent and the line connecting
the focus and the point of tangency
Solution. The polar (i.e. tangent) of a point X=[x0, y0] to a
parabola deﬁned by the canonical equation in the polar basis
is a line satisfying
(x0, y0, 1)


1 0 0
0 0 −p
0 −p 0




x
y
1

 = x0x − py − py0 = 0
The cosine of the angle between the tangent and the axis of
the parabola (x = 0) is given by the dot product of the corresponding
unit direction vectors. The unit direction vector of
the tangent is 1√
p2+x2
0
(p, x0) and therefore
1
√
p2 + x2
0
(p, x0).(0, 1) =
x0
√
p2 + x2
0
Now we show that this is the same as the cosine of the angle
between the tangent and line connecting the focus F=[0, p
2 ],
and the touch point X. The unit direction vector of the connecting
line is
1
√
x2
0 + (y0 − p
2 )2
(x0, y0 −
p
2
).
For the cosine of the angle,
1
√
p2 + x2
0
1
√
x2
0 + (y0 − p
2 )2
(x0y0 +
px0
2
)
CHAPTER 4. ANALYTIC GEOMETRY
substitute as points with W = 1. In this way we eliminate
problems with points whose image runs to inﬁnity. Indeed,
if the Z–coordinate of a real point is close to zero, then the
value of the third homogeneous coordinate of the image is
close to zero, i.e. it corresponds to a point close to inﬁnity.
4.4.5. Aﬃne and projective transformations. Each injective
linear map φ : V1 → V2 between vector
spaces maps one–dimensional subspaces to one–
dimensional subspaces. Therefore, we have a map
on projectivizations T : P(V1) → P(V2). We
shall call such maps projective maps or homographies. A
slightly more general concept posits the property that a map
C : P(V1) → P(V2) is bijective and maps projective lines
to projective lines. They are called collineations. Of course,
every invertible homography is a collineation.1
Otherwise put, the projective map is a map between projective
spaces such that in each system of homogeneous coordinates
on the domain and image, it is given by the multiplication
by a matrix. More generally if the auxiliary linear map is
not injective, then we need to deﬁne the projective map only
outside of its kernel, that is, on points whose homogeneous
coordinates do not map to zero. This is, what we saw in the
previous paragraph when discussing the perspective projec-
tions.
In general, each collineation does not have to be a homography.
For example, if we work with complex vector spaces,
then we may use the ﬁeld homomorphism z → ¯z, i.e. the
conjugation, to deﬁne the map F : (z0 : · · · : zn) → (¯z0 :
· · · : ¯zn) in any ﬁxed homogeneous coordinates. Clearly this
is bijective and since the colinearity is computed by subdeterminants
(and the determinants are equivariant with respect to
each ﬁeld homomorphism) the colinearity is preserved. Thus,
F is a collineation which does not come from a linear automorphism
of the vector space Cn+1
. Fortunately, there is the
so called fundamental theorem of projective geometry saying
that in dimensions at least two, each collineation is composed
of a homography and an collineation coming from a ﬁeld
homomorphism in the above way. Of course, in dimension
one, every bijection is a collineation and this concept does
not make much sense there.
Notice, there are no nontrivial ﬁeld homomorphisms on
the real scalars R and thus the two concepts coincide for our
real projective spaces in dimensions at least two.
Since injective maps V → V of a vector space to itself
are invertible, all projective maps of projective space Pn to
itself are invertible. They are also called regular collineations
or projective transformations. In homogeneous coordinates,
they correspond to invertible matrices of dimension n + 1.
Two such matrices deﬁne the same projective transformation
if and only if one is a (nonzero) multiple of the other.
1Since projective geometry is an old and rich mathematical discipline
with very abstract current versions, there is a lot of terminology and diﬀerent
names for similar or same objects involved.
352
Substitute y0 =
x2
0
2p to obtain x0√
p2+x2
0
.
This example shows that lightrays striking parallel with
axis of parabolic mirror are reﬂecting to the focus and, vice
versa, light rays going through focus reﬂect in direction parallel
with axis of parabola. This is the principle of many devices
such as parabolic reﬂectors. □
Solution. (Alternative) At the point P = (at2
, 2at) on the
parabola, the tangent line has slope (1/t) and the focus is at
(a, 0). So the line joining P to the focus F has slope 2at−0
at−a =
2t
(t2−1) . If θ is the angle between the tangent line and the x −
axis, then tan θ = 1/t, so
tan 2θ =
2 tan θ
1 − tan2
θ
=
2/t
(1 − 1/t2)
=
2t
t2 − 1
By subtraction, the angle between the tangent line and the line
joining P to the focus is θ.
Note that the tangent line meets the x-axis at Q where
Q = (−at2
, 0). The result follows from showing that |FP| =
|FQ|, and hence the triangle QFP is isosceles. □
You can ﬁnd many more examples on quadrics on 4.D.1
CHAPTER 4. ANALYTIC GEOMETRY
If we choose the ﬁrst coordinate as the one whose vanishing
deﬁnes inﬁnite points, then the transformations preserving
inﬁnite points are given by matrices whose ﬁrst row vanishes
up to its ﬁrst element. If we wish to switch to aﬃne
coordinates of ﬁnite points, (i.e the ﬁrst coordinate is ﬁxed at
one), the ﬁrst element in the ﬁrst row also equals one. Hence
the matrices of collineations preserving ﬁnite points of the
aﬃne space have the form:





1 0 · · · 0
b1 a11 · · · a1n
...
...
bn an1 · · · ann





where b = (b1, . . . , bn)T
∈ Rn
and A = (aij) is an invertible
matrix of dimension n. The action of such a matrix on the
vector (1, x1, . . . , xn) is exactly a general aﬃne transformation,
where b is the translation and A is its linear part. Thus
the aﬃne maps are exactly those collineations which preserve
the hyperplane of points at inﬁnity.
4.4.6. Determining collineations. In order to deﬁne an
aﬃne map, it is necessary and suﬃcient to deﬁne
an image of the aﬃne frame. In the above
description of aﬃne transformations as special
cases of projective maps, this corresponds to a
suitable choice of an image of a suitable arithmetic basis of
the vector space V .
In general it is not true that the image of an arithmetic
basis of V determines the collineation uniquely. The basic
problem is illustrated by a simple example of aﬃne plane.
Choose four points A, B, C, D in the plane such that no
three of them lie on a line. Then choose their images in the
collineation as follows:
Choose arbitrarily their four images A′
, B′
, C′
, D′
with
the same property, and choose their homogeneous coordinates
u, v, w, z, u′
, v′
, w′
, z′
in R3
. The vectors z and z′
can be written as linear combinations
z = c1u + c2v + c3w, z′
= c′
1u′
+ c′
2v′
+ c′
3w′
,
where all six coeﬃcients must be nonzero, otherwise there
exist three points not in general position.
Now choose new arithmetic representatives ˜u = c1u,
˜v = c2v and ˜w = c3w of points A, B and C respectively.
Similarly ˜u′
= c1u′
, ˜v′
= c2v′
and ˜w′
= c3w′
for points A′
,
B′
and C′
. This choice deﬁnes an unique linear map φ which
maps successively
φ(˜u) = ˜u′
, φ(˜v) = ˜v′
, φ( ˜w) = ˜w′
.
But then,
φ(z) = φ(˜u + ˜v + ˜w) = ˜u′
+ ˜v′
+ ˜w′
= z′
,
and so the constructed collineation maps the points which we
have chosen in advance. The linear map φ is given uniquely
by the construction, thus the collineation is given uniquely by
this choice.
353
CHAPTER 4. ANALYTIC GEOMETRY
The argument holds also in the case when some of the
chosen points are inﬁnite (i.e. one or two). The same phenomenon
can be explained even more easily by the regular
collineation of a projective line. These are deﬁned by pairwise
diﬀerent images of three pairwise diﬀerent points.
The procedure works in an arbitrary dimension n. Then
we say that n + 2 points are in general position if no n + 1
of them lie in the same hyperplane. We also call these points
linearly independent, forming a geometric basis of projective
space.
Theorem. A regular collineation on n–dimensional projective
space is uniquely determined by linearly independent images
of n + 2 linearly independent points.
Proof. The proof is exactly the same as in dimension
two. We recommend writing it in detail as an exercise. □
4.4.7. Cross-ratio. Recall that aﬃne maps preserve ratios
of lengths of line segments on each line. Technically,
we deﬁned this ratio as for three points
A, B and C ̸= B, C = rA + sB as λ =
(C; A, B) = −s
r .
For central projection the ratios are not preserved. Moreover,
even the relative position of points on a line is not necessarily
preserved. On the contrary we may determine uniquely
a projective transformation by choosing arbitrarily images of
three pairwise diﬀerent points on a projective line. However,
one can show relatively easily that the ratio of such ratios for
two distinct points C is preserved:
Consider four distinct points A,B, C, D in projective
space with arithmetic coordinates x, y, w, z respectively
which lie on a projective line. Since these four vectors lie
in the subspace generated by ⟨x, y⟩, we may write w and z as
linear combinations
w = t1x + s1y, z = t2x + s2y.
Deﬁne the cross-ratio of four points (A, B, C, D) as
ρ =
s1
t1
t2
s2
.
The deﬁnition is valid, since although the vectors x and y are
determined up to a scalar multiple, these freedom is cancelled
out in the deﬁnition.
Each projective transformation preserves cross-ratios:
Indeed, if the transformation is given in arithmetic coordinates
by a matrix A, we have images A·w = t1A·x+s1A·y,
and similarly for Az. Therefore the four images have the same
cross-ratio.
We discuss the characterization of projective transformations.
These are exactly those maps which preserve crossratios.
But this is not a very practical characterization, since
it contains implicitly the claim that these maps map projective
lines to projective lines.
We shall not go into details here, but one can prove a
much stronger statement. A map of arbitrarily small open
area in aﬃne space Rn
(e.g. a ball without boundary) into
354
CHAPTER 4. ANALYTIC GEOMETRY
the same aﬃne space which maps lines to lines is actually
a restriction of a uniquely determined globally deﬁned
collineation of the projective extension PRn+1
of the former
aﬃne space Rn
. As we heard above, all collineations of real
projectives space of ﬁnite dimensions at least two are projective
tranformations in our sense and thus these transformations
also preserve cross-ratios.
4.4.8. Duality. Projective hyperplanes in an n–dimensional
projective space P(V ) are deﬁned as the projectivizations
of n–dimensional vector subspaces in the vector
space V . Hence in homogeneous coordinates, they
are deﬁned as kernels of linear forms α ∈ V ∗
which
in turn are determined up to a scalar multiple.
Thus in a chosen arithmetic basis, a projective hyperplane
is given by a row vector α = (α0, . . . , αn). But the
forms α are given uniquely up to a scalar multiple. Therefore,
each hyperplane in V is identiﬁed with exactly one geometric
point in the projectivization of the dual space P(V ∗
). We
call such a space the dual projective space, and we talk about
a duality between points and hyperplanes.
On forms, the linear map deﬁning a given collineation
acts by the multiplication of row vectors from the right by the
same matrix
α = (α0, . . . , αn) → α · A.
Viewing the coordinates αi as collumns, this corresponds to
the action of the dual map with the matrix AT
. But the dual
map maps forms in the opposite direction, from the “target
space” to the “initial one”. Therefore the inverse map for the
collineation of f is required in order to study the eﬀect of regular
collineations on points and their dual hyperplanes. The
inverse is given by the matrix A−1
. Hence the matrix for the
action of the corresponding collineation on forms is (AT
)−1
.
Since the inverse matrix equals the algebraically adjoint matrix
A∗
alg, up to the multiplication by the inverse of determinant,
(see equation (1) on page 106) we can work directly
with the projective transformation of the space P(V ∗
) given
by the matrix (A∗
alg)T
(or without transposing if we multiply
row vectors from the right).
The projective point X belongs to the hyperplane α if
the arithmetic coordinates satisfy α·x = 0. It still holds after
acting with an arbitrary collineation, since
(α · A−1
) · (A · x) = α · x = 0.
4.4.9. Fixed points, centers and axes. Consider a regular
collineation f given in an arithmetic basis of
projective space P(V ) by a matrix A.
By the ﬁxed point of the collineation f, we
mean a point P which is mapped to itself. That
is, f(P) = P. By the ﬁxed hyperplane of collineation f
is meant a hyperplane α which is mapped to itself. That is,
f(α) ⊂ α.
Hence the arithmetic representatives of ﬁxed points are
exactly the eigenvectors of the matrix A.
355
CHAPTER 4. ANALYTIC GEOMETRY
In the geometry of the plane, we meet many types of
collineations: reﬂection through a point, reﬂection across a
line, translation, homothety etc. Perhaps we remember also
some types of projections, e.g. the projection of a plane in R3
to another from a center S ∈ R3
.
Note also that there appear ﬁxed lines next to ﬁxed points
in all cases of such aﬃne maps. For example, the reﬂection
through a point preserves also all lines passing through this
point. In the case of a translation the inﬁnite points behave
similarly.
Now we discuss this phenomenon in an arbitrary dimension.
First, we deﬁne a classical notion related to the incidence
of points and hyperplanes.
A set of hyperplanes passing through a point P ∈ P(V )
is a set of all hyperplanes which contain the point P. For
each point P the corresponding set of hyperplanes itself is
a hyperplane in the dual space P(V ∗
). It is given by one
homogeneous linear equation in arithmetic coordinates.
For a collineation f : P(V ) → P(V ), a point S ∈
P(V ), is called the center of collineation f, if all hyperplanes
in the set determined by S are ﬁxed hyperplanes. A hyperplane
α is called the axis of collineation f if all its points are
ﬁxed points.
It follows that the axis of a collineation is the center of
the dual collineation, while the set of hyperplanes deﬁning
the center of collineation is the axis of the dual collineation.
Since the matrices of a collineation on the former and the
dual space diﬀer only by the transposition, their eigenvalues
coincide (the eigenvectors are column vectors, respectively
row vectors corresponding to the same eigenvalues). For example
in the projective plane (and for the same reason in each
real projective space of even dimension) each collineation has
at least one ﬁxed point, since the characteristic polynomials
of corresponding linear maps are of odd degree. Hence they
have at least one real root.
Instead of discussing a general theory, we illustrate its
usefulness in several results for projective planes.
Proposition. A projective transformation of the projective
plane other than the identity has either exactly one center and
exactly one axis, or it has neither a center nor an axis.
Proof. Consider a collineation f on PR3
and assume
that it has two distinct centers A and B. Denote by ℓ the
line given by these two centers, and choose a point X in the
projective plane outside of ℓ. If p and q are the lines passing
through pairs of points (A, X) respectively (B, X), then
f(p) = p and f(q) = q. In particular, X is ﬁxed. But then
all points of the plane outside of ℓ are ﬁxed. Hence each line
diﬀerent from ℓ has all points out of ℓ ﬁxed and thus also its
intersection with ℓ is ﬁxed. It follows that f is the identity
mapping. So it is proved that every projective transformation
other than the identity has at most one center. The same argument
for the dual projective plane proves that there is at most
one axis.
356
CHAPTER 4. ANALYTIC GEOMETRY
If f has a center A, then all lines passing through A are
ﬁxed. They correspond therefore to a two–dimensional subspace
of a row eigenvectors of the matrix corresponding to the
transformation f. Therefore, there exists a two–dimensional
subspace of column eigenvectors for the same eigenvalue.
This represents exactly the line of ﬁxed points, hence it represents
the axis. The same consideration in the reversed order
proves the opposite statement – if a projective transformation
of plane has an axis, then it has also a center. □
picture missing!
For practical problems it is useful to work with complex
projective extensions also in the case of a real plane. Then
the geometric behaviour can be easily read oﬀ the potential
existence of real or imaginary centers and axes.
4.4.10. Pappus Theorem. The following result known as
Pappus theorem is a classic result of projective geometry.
Proposition. Let two triples of distinct consecutive collinear
points {A, B, C} and {A′
, B′
, C′
} lie on two lines that meet
at the point T, which is closest to A and A′
, respectively. Deﬁne
points Q, R and S as
Q = [AB′
]∩[BA′
], R = [AC′
]∩[CA′
], S = [BC′
]∩[CB′
].
Then {Q, R, S} are also collinear.
picture missing!
357
CHAPTER 4. ANALYTIC GEOMETRY
Proof. Without loss of generality, consider the plane,
passing through {T, A, B, C, A′
, B′
, C′
} as a 2-dimensional
plane in P2
deﬁned by z = 1 in the homogeneous coordinates
(x : y : z).
The points {T, A, B, C, A′
, B′
, C′
} may be considered
as objects in P2
, representing lines through the origin in
R3
with directional vectors {t, a, b, c, a′
, b′
, c′
}, respectively.
These can be chosen up to a real non-zero factor. The condition
{z = 1} uniquely identiﬁes those points in R3
regardless
of the choice of {t, a, b, c, a′
, b′
, c′
}. Since {T, A, B, C} are
collinear points, (they lie in the same 2-dimensional linear
subspace of R3
), we may assume that this plane is generated
by t and a. Choose
b = t + a, c = λt + a,
and analogously, for {T, A′
, B′
, C′
}
b′
= t + a′
, c′
= λ′
t + a′
for some real constants λ and λ′
. It is only necessary to show
that the vectors q, r, s, representing Q, R, S in P2
generate a
2-dimensional subspace in R3
.
Since
(t + a) + a′
= a + (t + a′
),
q = t + a + a′
represents Q. Since
λλ′
t + λ′
a + λa′
= λ(λ′
t + a′
) + λ′
a = λ′
(λt + a) + λa′
,
r = λλ′
t + λ′
a + λa′
represents R. Finally,
s = q − r = t + a + a′
− λλ′
t − λ′
a − λa′
= (1 − λ′
)(t + a) + (1 − λ)(λ′
t + a′
)
= (1 − λ)(t + a′
) + (1 − λ′
)(λt + a)
represents the point S. Thus, the points {Q, R, S} lie in the
2-dimensional subspace generated by vectors q and r. Since
Q, R, S also belong to the plane {z = 1}, these points are
collinear. □
4.4.11. Projective classiﬁcation of quadrics. To end this
section, we return to conics and quadrics. A
quadric Q in n–dimensional aﬃne space Rn
is
deﬁned by a general quadratic equation 4.2.6(1)
on the page 342. By viewing the aﬃne space Rn
as aﬃne
coordinates in projective space PRn+1
we may wish to describe
the set Q by homogeneous coordinates in projective
space. The formula in these coordinates should contain only
the terms of second order since only a homogeneous formula
is independent of the choice of the multiple of homogeneous
coordinates (x0, x1, . . . , xn) of a point. Hence we search for
a homogeneous formula whose restriction to aﬃne coordinates,
(that is, substitution x0 = 1), gives the original formula
4.2.6(1).
But this is especially easy. Simply add the right number
of x0 to all terms – nothing to the quadratic terms, one to
the linear terms and x2
0 to the constant term in the original
aﬃne equation for Q. We obtain a well deﬁned quadratic form
f =
∑n
i,j=0 aijxixj on the vector space Rn+1
whose zero set
deﬁnes correctly the projective quadric ¯Q.
358
CHAPTER 4. ANALYTIC GEOMETRY
The intersection of a “cone” ¯Q ⊂ Rn+1
of the zero set of
this form with the aﬃne plane x0 = 1 is the original quadric
Q whose points are called the proper points of the quadric.
The other points ¯Q \ Q in the projective extension are the
inﬁnite points.
The classiﬁcation of real or complex projective quadrics,
up to projective transformations, is a problem already considered.
It is all about ﬁnding the canonical polar basis, see paragraph
4.3.2. From this classiﬁcation, given by the signature
of the form in the real case and by the rank only in the complex
case, we can deduce also the classiﬁcation of the aﬃne
quadrics. We show the essential part of the procedure in the
case of conics in the aﬃne and the projective plane.
The projective classiﬁcation gives the following possibilities,
described by homogeneous coordinates (x : y : z) in
the projective plane PR3
:
• imaginary regular conic given by x2
+ y2
+ z2
= 0
• real regular conic given by x2
+ y2
− z2
= 0
• pair of imaginary lines given by x2
+ y2
= 0
• pair of real lines given by x2
− y2
= 0
• one double line x2
= 0.
We consider this classiﬁcation as real, that is, the classiﬁcation
of quadratic forms is given not only by its rank but also by
its signature. Nevertheless, the points of a quadric are considered
also in the complex extension. In this way we should understand
the stated names. For example the imaginary conic
does not have any real points.
4.4.12. Aﬃne classiﬁcation of quadrics. For an aﬃne
classiﬁcation we must restrict the projective transformations
to those which preserve the projective subspace of inﬁnite
points. This can be seen also by the converse procedure
— for a ﬁxed projective type of conic Q, that is, its cone
˜Q ⊂ R3
, we choose diﬀerent aﬃne planes α ⊂ R3
which do
not pass through the origin. We observe the changes to the
set of points ˜Q ∩ α, which are proper points of Q in aﬃne
coordinates, as realized by the plane α.
Hence in the case of a regular conic there is a real cone
˜Q given by the equation z2
= x2
+ y2
. As planes α we
may for instance choose the tangent planes to the unit sphere.
If we begin with the plane z = 1, the intersection consists
only of ﬁnite points forming a unit circle Q. By a gradual
change of the slope of α we obtain a more and more stretched
ellipse until we get such a slope that α is parallel with one of
lines of the cone. At that moment there appears one (double)
inﬁnite point of the conic whose ﬁnite points still form one
connected component, and so we have a parabola. Continuing
to change the slope gives rise to two inﬁnite points. The set
of ﬁnite points is no longer connected, and so we obtain the
last regular quadric in the aﬃne classiﬁcation, a hyperbola.
We can take advice from the introduced method which
enables us to continue the classiﬁcation in higher dimensions.
In particular, we notice that the intersection of the conic with
the projective line of inﬁnite points is always a quadric in dimension
one less. It is either the empty set or a double point
359
CHAPTER 4. ANALYTIC GEOMETRY
or two points as types of quadrics on a projective line. Next
we found an aﬃne transformation transforming one of possible
realizations of a ﬁxed projective type to another one,
only if the corresponding quadrics in the subspace of inﬁnite
points were projectively equivalent. In this way, it is possible
to continue the classiﬁcation of quadrics in dimension three
and above.
360
361
CHAPTER 4. ANALYTIC GEOMETRY
D. Further exercise on this chapter
4.D.1. [To be moved to an appropriate spot in Chapter 8 or 9] Show that all twice diﬀerentiable mappings F : Rn
→
Rn
which preserve distances are aﬃne mappings, and thus Euclidean (also the so called rigid motions).
Solution. Suppose F : Rn
→ Rn
keeps distances, i.e., ∥F(x + tv) − F(x)∥ = ∥tv∥, where t ∈ R, v ∈ Rn
. Then the Taylor
expansion says
t∥v∥ = ∥F(x+tv)−F(x)∥ = ∥DF(x)(tv)+
1
2
D2
F(x+stv)(tv, tv)∥ = t∥DF(x)(v)+
t
2
D2
F(x+stv)(v, v)∥,
where 0 ≤ st ≤ t. Now, the limit for t → 0 leads to ∥v∥ = ∥DF(x)(v)∥, i.e., the expected fact that the diﬀerential of F
must be an orthogonal linear map. Consequently, writing Fxi for the partial derivatives, its columns are orthogonal:
(1) Fxi · Fxj = δij.
In particular, F is invertible on a neighborhood of x and we may look at G(y) = DF(x)−1
◦ F(y) instead of F. Now, the
diﬀerential of G is the identity matrix at the point x and G keeps distances again. Thus, diﬀerentiating the equation (1) for G
we arrive at
Gxixk
(x) · Gxj (x) + Gxi (x) · Gxj xk
(x) = 0
and taking into account Gj
xi
(x) = δj
i , the latter equation reduces to αjik = −αijk, where we write αijk = Gi
xj xk
(x) for the
second partial derivatives of the relevant component function. Clearly, we also know αijk = αikj since the second derivatives
are commutative. Thus,
αijk = −αjik = −αjki = αkji = αkij = −αikj = −αijk
and thus all the quantities αijk have to vanish. This means all the second partial derivatives of G in the point x vanish and by
the deﬁnition, this also implies that all the second order partial derivatives of F(y) = DF(x) ◦ G(y) in the point x vanish
(write down the formulae explicitly in case of any doubts!). The point x was arbitrary and thus we have veriﬁed that F is
aﬃne by the Taylor expansion. □
4.D.2. Find a parametric equation for the intersection of the following planes in R3
:
σ : 2x + 3y − z + 1 = 0 a ρ : x − 2y + 5 = 0.
⃝
4.D.3. Find a common perpendicular for the skew lines
p : [1, 1, 1] + t(2, 1, 0), q : [2, 2, 0] + t(1, 1, 1).
⃝
4.D.4. Jarda is standing in [−1, 1, 0] and has a stick of length 4. Can he simultaneously touch the lines p and q, where
p : [0, −1, 0] + t(1, 2, 1),
q : [3, 4, 8] + s(2, 1, 3)?
(The stick must pass through [−1, 1, 0].) ⃝
4.D.5. A cube ABCDEFGH is given. The point T lies on the edge BF, with |BT| = 1
4 |BF|. Compute the cosine of the
angle between ATC and BDE. ⃝
4.D.6. A cube ABCDEFGH is given. The point T lies on the edge AE, with |AT| = 1
4 |AE|. S is the midpoint of AD.
Compute the cosine of the angle between BDT and SCH. ⃝
4.D.7. A cube ABCDEFGH is given. The point T lies on the edge BF, |BT| = 1
3 |BF|. Compute the cosine of the angle
between ATC and BDE. ⃝
4.D.8. What are the lengths of semi-axes, when the sum of their lengths equals the distance between foci both equal 1.
Solution. It is given that a + b = 1 and 2ae = 1. Also b2
= a2
(1 − e2
). Eliminating e gives b2
= a2
− (1/4). So
1/4 = a2
− b2
= (a − b)(a + b) = a − b. So a = 5/8 and b = 3/8. □
Solution. (Alternative.) Solve the system
a + b = 1
2e = 2
√
a2 − b2 = 1
and ﬁnd solution a = 5
8 , b = 3
8 . □
362
CHAPTER 4. ANALYTIC GEOMETRY
4.D.9. For what slopes k are the lines passing through [−4, 2] secant and tangent lines of the ellipse deﬁned by
x2
9
+
y2
4
= 1
Solution. The direction vector of the line is (1, k) and its parametric equations then are x = −4 + t, y = 2 + kt. The
intersection with the ellipse satisﬁes
(−4 + t)2
9
+
(2 + kt)2
4
= 1
This quadratic equation has discriminant equal to
D = −
k
9
(7k + 16).
This implies that for k ∈ (−16
7 , 0) there are two solutions, and the line is a secant. For k = −16
7 and k = 0 there is only one
solution and the line is a tangent to the ellipse. □
4.D.10. Find all lines tangent to the ellipse 3x2
+ 7y2
= 30, so that the distance from the centre of the ellipse to the tangent
is 3.
Solution. All lines at distance 3 from the origin are tangents to the circle centre at [0, 0] and radius 3. They all have an
equation x cos θ + y sin θ = 3 for some θ.This line meets the standard ellipse x2
/a2
+ y2
/b2
= 1 where
x2
a2
+
(3 − x cos θ)2
b2 sin2
θ
= 1
or
x2
(a2
cos2
θ+b2
sin2
θ)−6a2
x cos θ−a2
(b2
sin2
θ−9) = 0
It is a tangent line if the above equation has a double root for x. Thus it is required that
36a4
cos2
θ = 4(a2
cos2
θ + b2
sin2
θ)(9 − b2
sin2
θ).
This simpliﬁes to requiring that
a2
cos2
θ + b2
sin2
θ = 9.
This implies
cos2
θ =
(9 − b2
)
(a2 − b2)
sin2
θ =
(a2
− 9)
(a2 − b2)
For the given problem a2
= 10 and b2
= 30/7. The solution is x
√
33 + y
√
7 = 3
√
40. □
Solution. (Alternative.) The tangent line is (y − b sin θ)= −b cos θ
a sin θ (x − a cos θ) with a2
= 10 and b2
= 30/7. The distance
to the origin, 3, implies 3 = (a/ cos θ) sin φ where
tan φ = b cos θ
a sin θ 3 cos θ = a sin φ where a sin θ tan φ = b cos θ 3 sin θ = b cos φ
9/a2
cos2
θ + 9/b2
sin2
θ = 1. 9 cos2
θ + 21 sin2
θ = 10. 12 sin2
θ = 1
(y −
√
35/2) = −(x −
√
55/
√
6)[
√
33/7] □
Solution. (Alternative.) The centre of the ellipse is at the origin. The distance d between the line ax + by + c = 0 and the
origin is d = |c|
√
a2+b2
. The tangent then satisﬁes a2
+b2
= c2
9 . The equation of the tangent passing through the point [xT , yT ]
is 3xxT + 7yyT − 30 = 0. For coordinates of the point of tangency,
(3xT )2
+ (7yT )2
= 100
3x2
T + 7y2
T = 30
Its solution is xT = ±
√
55
6 , yT = ±
√
5
14 . Considering the symmetry of ellipse, there are four solutions
±3
√
55
6 x ± 7
√
5
14 y − 30 = 0. □
363
CHAPTER 4. ANALYTIC GEOMETRY
4.D.11. A hyperbola x2
− y2
= 2 is given. Find an equation of a hyperbola having the same foci and passing through point
[−2, 3].
Solution. The given hyperbola has a2
= b2
= 2, so a2
e2
= a2
+ b2
= 4, and the foci are at (±ae, 0) = (±2, 0). So the
desired hyperbola has equation
√
(x − 2)2 + y2 −
√
(x + 2)2 + y2 = k,
for some constant k. Since the hyperbola passes through [−2, 3], k = 2. Squaring gives
√
(x − 2)2 + y2 = [
√
(x + 2)2 + y2 + 2],
(x − 2)2
+ y2
= (x + 2)2
+ y2
+ 4
√
(x + 2)2 + y2 + 4
(−2x − 1)2
= (x + 2)2
+ y2
or 3x2
= y2
+ 3 which is the required hyperbola. □
Solution. (Alternative.) The equation of the desired hyperbola is x2
a2 − y2
b2 = 1, with its eccentricity e satisfying
a2
e2
= a2
+ b2
= 4, since the foci are at [±ae, 0] = [±2, 0]. The point [−2, 3] lies on the hyperbola, so 4
a2 − 9
b2 = 1. It
follows that a2
= 1, b2
= 3. The desired hyperbola is x2
− y2
3 = 1. □
4.D.12. Determine the equations of the tangent lines to the hyperbola 4x2
− 9y2
= 1, which are perpendicular to line
x − 2y + 7 = 0.
Solution. All lines perpendicular to the given line have an equation 2x+y +c = 0 for some c. So the line has an intersection
with a double root with the given hyperbola. So the equation 4x2
− 9(−2x − c)2
= 1 has a double root. Hence (36c)2
−
4.32.(9c2
+ 1) = 0, and c = ±2
√
2
3 . □
4.D.13. Determine the tangent to the ellipse x2
16 + y2
9 = 1 which is parallel with line x + y − 7 = 0.
Solution. The lines parallel with the given line intersect this line in a point at inﬁnity (1 : −1 : 0). Construct tangents to given
ellipse passing through this point. The point of tangency T= (t1 : t2 : t3) lies on its polar and therefore satisﬁes t1
16 − t2
9 = 0,
so t2 = 9
16 t1. Substituting into the ellipse equation, we get t1 = ±16
5 . The touching points of the desired tangents are [16
5 , 9
5 ]
and [−16
5 , −9
5 ]. The tangents are polars of those points. They have equations x + y = 5 and x + y = −5. □
Solution. (Alternative). The given line has slope −1. The tangent line at (4 cos θ, 3 sin θ) has slope −3 cos θ
4 sin θ , so it is required
that tan θ = 3
4 . The tangent line has equation (y − 3 sin θ) = (−1)(x − 4 cos θ) where either sin θ = 3/5 and cos θ = 4/5
or sin θ = −3/5 and cos θ = −4/5. The two solutions are x + y = ±5. □
4.D.14. Determine the points at inﬁnity and the asymptotes of the conic section
2x2
+ 4xy + 2y2
− y + 1 = 0
Solution. The equation for the points at inﬁnity of 2x2
+ 4xy + 2y2
= 0 or rather 2(x + y)2
= 0 has a solution x = −y.
The only point at inﬁnity therefore is (1 : −1 : 0), so the conic section is a parabola. The asymptote is a polar of this point,
speciﬁcally the line at inﬁnity z = 0. □
4.D.15. Prove that the product of the distances between an arbitrary point on a hyperbola and both of its asymptotes is
constant. Find its value.
Solution. Denote the point lying on the hyperbola by P. The asymptote equation of the hyperbola in canonical form is
bx±ay = 0. Their normals are (b, ±a) and from here we determine the projections P1, P2 of point P to asymptotes. For the
distance between point P and asymptotes we get |PP1,2| = |aq±bp|
√
a2+b2
. The product is therefore equal to a2
q2
−b2
p2
a2+b2 = a2
b2
a2+b2 ,
because P lies on hyperbola. □
4.D.16. Compute the angle between the asymptotes of the hyperbola 3x2
− y2
= 3.
Solution. For the cosine of the angle between the asymptotes of the hyperbola in canonical form, cos α = b2
−a2
b2+a2 . In this case
the angle is 60 degrees. □
364
CHAPTER 4. ANALYTIC GEOMETRY
4.D.17. Locate the centers of the conic sections
(a) 9x2
+ 6xy − 2y − 2 = 0
(b) x2
+ 2xy + y2
+ 2x + y + 2 = 0
(c) x2
− 4xy + 4y2
+ 2x − 4y − 3 = 0
(d) (x−α)2
a2 + (y−β)2
b2 = 1
Solution. (a) The system ¯As + a = 0 for computing centers is
9s1 + 3s2 = 0
3s1 − 2 = 0
.
Solve it to obtain the center at [2
3 , −2].
(b) In this case,
s1 + s2 + 1 = 0
s1 + s2 + 1
2 = 0.
Therefore there is no proper center (the conic section is a parabola). Moving to homogeneous coordinates we can obtain the
center at inﬁnity (1 : −1 : 0).
(c) The coordinates of the center in this case satisfy
s1 − 2s2 + 1 = 0
−2s1 + 4s2 − 2 = 0.
The solution is the line of centers. This is so because the conic section is degenerate: it is a pair of parallel lines.
(d) The center is at (α, β). The coordinates of the center therefore give the translation of the coordinate system to the
frame in which the ellipse has its basic form.
□
4.D.18. Find the equations of the axes of the conic section 6xy + 8y2
+ 4y + 2x − 13 = 0.
Solution. The major and minor axes of the conic section are in the direction of the eigenvectors of matrix
(
0 3
3 8
)
. The
characteristic equation has the form λ2
− 8λ − 9 = 0. The eigenvalues are therefore λ1 = −1, λ2 = 9. The corresponding
eigenvectors are then (3, −1) and (1, −3). The axes are the polars of points at inﬁnity deﬁned by those directions. For (3, −1),
the axis equation is −3x + y + 1 = 0. For (1, −3) it is −9x − 21y − 5 = 0. □
4.D.19. Determine the equations of the axes of the conic section 4x2
+ 4xy + y2
+ 2x + 6y + 5 = 0.
Solution. The eigenvalues of the matrix
(
4 2
2 1
)
are λ1 = 0, λ2 = 5 and the corresponding eigenvectors are (−1, 2) and
(2, 1). There is one axis 2x + y + 1 = 0, and the conic section is a parabola. □
4.D.20. The equation
x2
+ 3xy − y2
+ x + y + 1 = 0.
deﬁnes a conic section. Determine its center, axes, asymptotes and foci.
4.D.21. Find the equation of the tangent at P=[1, 1] to the conic section
4x2
+ 5y2
− 8xy + 2y − 3 = 0
Solution. By projecting, this is a conic section deﬁned by the quadratic form (x, y, z)A(x, y, z)T
with matrix
A =


4 −4 0
−4 5 1
0 1 −3


Using the previous theorem, the tangent is a polar of P, which has homogenenous coordinates (1 : 1 : 1). It is given by
equation (1, 1, 1)A(x, y, z)T
= 0, which in this case gives
2y − 2z = 0
Moving back to inhomogeneous coordinates, the tangent line equation is y = 1.
365
CHAPTER 4. ANALYTIC GEOMETRY
□
4.D.22. Find the coordinates of the intersection of the y axis and the conic section deﬁned by
5x2
+ 2xy + y2
− 8x = 0
Solution. The y axis, is the line x = 0. It is the polar of the point P with homogeneous coordinates ⟨p⟩ = (p1 : p2 : p3).
That means that the equation x = 0 is equivalent to the polar equation F(p, v) = pT
Av = 0, where v = (x, y, z)T
. This is
satisﬁed when Ap = (α, 0, 0)T
for some α ∈ R. This condition gives the conic section matrix
A =


5 1 −4
1 1 0
−4 0 0


equation system
5p1 + p2 − 4p3 = αj
p1 + p2 = 0
−4p1 = 0
We can ﬁnd the coordinates of P by the inverse matrix, p = A−1
(α, 0, 0)T
, or solve the system directly by backward substitution.
In this case we can easily obtain solution p = (0, 0, −1
4 α). So the y axis touches the conic section at the origin.
□
4.D.23. Find a touch point of the line x = 2 with the conic section from the previous exercise.
Solution. The line has equation x − 2z = 0 in its projective extension and therefore we get the condition Ap = (α, 0, −2α)
for the touch point P, which gives
5p1 + p2 − 4p3 = α
p1 + p2 = 0
−4p1 = −2α
Its solution is p = (1
2 α, −1
2 α, 1
4 α). These homogeneous coordinates are equivalent to (2, −2, 1) and hence the touch point
has coordinates [2, −2]. □
4.D.24. Find equations of the tangents passing through P= [3, 4] to the conic deﬁned by
2x2
− 4xy + y2
− 2x + 6y − 3 = 0.
Solution. Suppose that the point of tangency T has homogeneous coordinates given by a multiple of the vector t = (t1, t2, t3).
The condition that T lies on the conic section is tT
At = 0, which gives
2t2
1 − 4t1t2 + t2
2 − 2t1t3 + 6t2t3 − 3t2
3 = 0
The condition that P lies on the polar of T is pT
At = 0, where p = (3, 4, 1) are the homogeneous coordinates of point P. In
this case, the equation gives
(3, 4, 1)


2 −2 −1
−2 1 3
−1 3 −3




t1
t2
t3

 = −3t1 + t2 + 6t3 = 0
Now we can substitute t2 = 3t1 − 6t3 to the previous quadratic equation. Then
−t2
1 + 4t1t3 − 3t2
3 = 0
Because the equation is not satisﬁed for t3 = 0, we move to inhomogeneous coordinates (t1
t3
, t2
t3
, 1), for which we get
−(t1
t3
)2
+ 4(t1
t3
) − 3 = 0 a t2
t3
= 3(t1
t3
) − 6,
tj. t1
t3
= 1 a t2
t3
= −3, nebo t1
t3
= 3 a t2
t3
= 3. So the touch points have homogeneous coordinates (1 : −3 : 1) and (3 : 3 : 1).
The tangent equations are the polars of those points 7x − 2y − 13 = 0 and x = −3. □
366
CHAPTER 4. ANALYTIC GEOMETRY
4.D.25. Find an equation of the tangent passing through the origin to the circle
x2
+ y2
− 10x − 4y + 25 = 0
Solution. The touch point (t1 : t2 : t3) satisﬁes
(0, 0, 1)


1 0 −5
0 1 −2
−5 −2 25




t1
t2
t3

 = −5t1 − 2t2 + 25 = 0
From here we eliminate t2 and substitute into circle equation, which (t1 : t2 : t3) has to be satisﬁed as well. We obtain the
quadratic equation 29t2
1 −250t1 +525 = 0, with solutions t1 = 5 and t1 = 105
29 . We compute the coordinate t2 and get touch
points [5, 0] and [105
29 , 100
29 ]. The tangents are polars of those points with equations y = 0 and 20x − 21y = 0. □
4.D.26. Find tangents equations to circle x2
+ y2
= 5 which are parallel with 2x + y + 2 = 0.
Solution. In the projective extension, these tangents intersect at the point at inﬁnity satisfying 2x + y + z = 0, so in point
with homogeneous coordinates (1 : −2 : 0). They are tangents from this point to the circle. We can use the same method as in
previous exercise. The conic section matrix is diagonal with the diagonal (1, 1, −5) and therefore the touch point (t1 : t2 : t3)
of the tangents satisﬁes t1 − 2t2 = 0. Substitute into the circle equation to get 5t2
2 = 5. Since t2 = ±1, the touch points are
[2, 1] and [−2, −1]. □
Solution. Alternative. The point P =
√
5(cos θ, sin θ) lies on the circle for all θ. The tangent line at P is x cos θ +y sin θ =√
5. This has slope −(cos θ)/(sin θ) which is −2 provided tan θ = 1/2. It follows that P is at either [2, 1] or [−2, −1]. □
A tangent line touching the conic section at inﬁnity is called an asymptote. The number of asymptotes of a conic section
equals the number of intersections between the conic section and the line at inﬁnity. So the ellipse has no real asymptotes,
the parabola has one (which is however a line at inﬁnity) and the hyperbola has two.
4.D.27. Find the points at inﬁnity and the asymptotes of the conic section deﬁned by
4x2
− 8xy + 3y2
− 2y − 5 = 0
Solution. First, rewrite the conic section in homogeneous coordinates.
4x2
− 8xy + 3y2
− 2yz − 5z2
= 0
the homogeneous coordinates (x : y : 0) satisfying this equation, which means
4x2
− 8xy + 3y2
= 0.
It follows that either: x
y = −1
2 or x
y = −3
2 . The conic section is therefore a hyperbola with points at inﬁnity P= (−1 : 2 : 0)
a Q= (−3 : 2 : 0).
(−1, 2, 0)


4 −4 0
−4 3 −1
0 −1 −5




x
y
1

 = −12x + 10y − 2 = 0
a
(−3, 2, 0)


4 −4 0
−4 3 −1
0 −1 −5




x
y
1

 = −20x + 18y − 2 = 0
□
There are further exercises on conic sections on the page 361.
4.D.28. Harmonic cross-ratio. If the cross-ratio of four points lying on the line equals −1, we talk about a harmonic
quadruple. Let ABCD be a quadrilateral. Denote by K the intersection of the lines AB and CD, by M the intersection of
the lines AD and BC. Further let L, N be the intersection of KM and AC, BD respectively. Show that the points K, L,
M, N are a harmonic quadruple. ⃝
367
CHAPTER 4. ANALYTIC GEOMETRY
Exercise solution
In this chapter, we start using tools allowing us to model
dependencies which are neither linear, nor discrete. Such
models are often needed when dealing with time dependent
systems. We try to describe them not only at discrete moments
of time, but “continuously”. Sometimes this is advantageous,
for instance in physical models of classical mechanics
and engineering. It might also be appropriate and computationally
eﬀective to employ an approximation of discrete models
in economics, chemistry, or biology. In particular such
ideas may be appropriate in relation to stochastic models, as
we shall see in Chapter 10.
The key concept is that of a function, also called a “signal”
in practical applications. The larger the class of functions
used, the more diﬃcult is the development of eﬀective tools.
On the other hand, if there are only a few simple types of functions
available, it may be that some real situations cannot be
modelled at all.
The objective of the following two chapters is thus to introduce
explicitly the most elementary functions of real variables.
It is also to describe implicitly many more functions,
and to build the standard tools to use them. This is the differential
and integral calculus of one variable. While the focus
has been mainly on the part of mathematics called algebra,
the emphasis will now be on mathematical analysis. The
link between the two is provided by a “geometric approach”.
If possible, this means building concepts and intuition independently
of any choice of coordinates. Often this leads to a
discrete (ﬁnite) description of the objects of interest. This is
immediate when working with polynomials now.
1. Polynomial interpolation
In the previous chapters, we often worked with sequences
of real or complex numbers, i.e. with scalar functions N → K
or Z → K, where K is a given set of numbers. We also
worked with sequences of vectors over real or complex num-
bers.
Recall the discussion from paragraph 1.1.6, about dealing
with scalar functions. This discussion is adequate to work
with functions R → R (real-valued functions of one real variable),
or R → C (complex-valued functions of one real variable),
or sometimes more generally the vector-valued functions
of one real variable R → V . The results can usually be
CHAPTER 5
Establishing the ZOO
which functions do we need for our models?
– a thorough menagerie
A. Polynomial interpolation
Our ﬁrst “route” in the zoo of functions is devoted to a
beautiful topic, concerning the elementary notion
of polynomials. Usually, given a polynomial one
wants to ﬁnd its roots, and in the ﬁrst chapter we
learned to evaluate polynomials via the Horner’s
scheme. Of course, evaluation is an essential part of any rootﬁnding
method and we know that the fundamental theorem of
algebra classiﬁes all roots of a polynomial over C (see also
the paragraph 12.2.8).
Below we adopt a slightly diﬀerent approach, and focus
on applications related to the notion of polynomial interpolation
of real or complex functions. Roughly speaking this
means that we will use polynomials of certain degree to approximate
such functions. This includes the so called “Lagrange
interpolation”, but we will also explain how the theory
of derivatives involves, in terms of “Hermite’s interpolation
problem” and “cubic splines”. We mention that in this
part the use of elementary functions is stressed, although they
are systematically analyzed later in this chapter.
5.A.1. Using Sage for plotting graphs. Computer algebra
programs, as Mathematica, Matlab, Maple, Sage, and other,
all include simple syntaxes available for drawing the graph
Gf = {(x, f(x)) : x ∈ I} of a real polynomial f(x) =
anxn
+ · · · + a1x + a0 of some ﬁxed degree n ∈ N, or
CHAPTER 5. ESTABLISHING THE ZOO
extended to cases concerning vector values over the considered
scalars, rather than just real and complex numbers.
We begin with some easily computable functions.
5.1.1. Polynomials. We can add and multiply scalars.
These operations satisfy a number of properties
which we listed in the paragraphs 1.1.1
and 1.1.5. If we admit any ﬁnite number of
these operations, leaving one of the variables as an unknown
and ﬁxing the other scalars, we obtain the polynomial
functions. We consider the scalars K = R, C, or Q.
Polynomials
A polynomial over a ring of scalars K is a mapping f : K →
K given by the expression
f(x) = anxn
+ an−1xn−1
+ · · · + a1x + a0,
where ai, i = 0, . . . , n, are ﬁxed scalars. Multiplication
is indicated by juxtaposition of symbols, and “+” denotes
addition. If an ̸= 0, the polynomial f is said to have degree
n. The degree of the zero polynomial is undeﬁned. The
scalars ai are called the coeﬃcients of the polynomial f.
The polynomials of degree zero are exactly the non-zero
constant mappings. In algebra, polynomials are more often
deﬁned as formal expressions of the aforementioned form of
f(x), i. e. a polynomial is deﬁned to be a sequence a0, a1, . . .
of coeﬃcients such that only ﬁnitely many of them are nonzero.
However, we will show shortly that these approaches
are equivalent for our choices of scalars.
It is easy to verify that the polynomials over a given ring
of scalars form a ring. Multiplication and addition of polynomials
are given by the operations in the original ring K by the
values of the polynomials. Hence,
(f · g)(x) = f(x) · g(x), (f + g)(x) = f(x) + g(x),
where the operations on the left-hand side are interpreted in
the ring of polynomials whereas the operations on the righthand
side are of the ring of scalars (see the second part of
Chapter 12 for detailed algebraic treatment).
5.1.2. Division of polynomials. As already mentioned, the
scalar ﬁelds used are Q, R, or C. In all of these ﬁelds, the
following holds:
Euclidean division of polynomials
Proposition. For any two polynomials f of degree n and g
of degree m, there is exactly one pair of polynomials q, r
such that f = q · g + r, where either r = 0, or the degree of
r is less than m.
Proof. This is a special simple case of the much more
general algebraic result in 12.3.6. Write f(x) =
anxn
+ an−1xn−1
+ · · · + a1x + a0 for the polynomial
of degree n, and g(x) = bmxm
+ bm−1xm−1
+
· · · + b1x + b0, with an ̸= 0 and bm ̸= 0.
369
more general of some real function f : I → R, where I is a
suitable interval in the real line. Next we will use Sage, and
its command plot(f(x), x, a, b), a situation that we encountered
already in Chapter 1.
It worths to mention that Sage computes the plot of a
given function f by evaluating the function at a large number
of random points lying in the chosen interval [a, b]. Since
in general Sage tries to plot the whole graph of f, it is often
useful to restrict the “y-range” between certain values (“compatible”
with the range of y = f(x), say c ≤ y = f(x) ≤ d).1
This can be done by including inside the command plot the
options ymin = c and ymax = d, respectively. The restriction
of the y-range is also useful for treating further specialities,
as for example for the graphs of functions with (vertical)
asymptotes and other situations. We will discuss such details
later.
5.A.2. Low degree polynomials. Describe geometrically
real polynomials of degree zero, one, two, and three, respectively.
Use Sage to plot the corresponding graphs for the
most “fancy cases” (i.e., the non-linear cases).
Solution. Let us discuss the cases of degree n = 2, 3 only
(whose graphs are more interesting than straight lines, being
the graphs for the cases n = 0, 1). So, recall that a second
degree polynomial has the form f(x) = a2x2
+ a1x + a0 =
ax2
+ bx + c, with a2 = a ̸= 0, and x ∈ R. Its graph
is a parabola, symmetric with respect to the vertical line
x = − b
2a and oriented upwards or downwards, depending
on the sign of a. Below we present the graph of the parabola
f(x) = 4x2
−2x−3, constructed in Sage. This indicates also
its axis of symmetry and its vertex V = [− b
2a , f(− b
2a )] =
[1/4, −13/4].
Recall from 1.G.5 that a method to sketch the graph of f(x) =
a0 + a1x + a2x2
2 is based on the substitution method and in
particular relies on a combination of the commands plot and
subs. This allows to substitute certain values of the parameters
a0, a1, a2 and quickly sketch the corresponding Cf (this
can be applied to plot the graph of any polynomial).
a=SR.var("a", 3); P=a[0]+a[1]*x+a[2]*x^2
q=plot(P.subs(a[0] == -3,a[1] == -2, a[2] == 4),
x, -6, 6); q.show()
As a side remark related to the use of ymin and ymax mentioned
in 5.A.1, observe that one may include them also in
1This is extremely useful when for example we need to sketch the graph
of a rational function.
CHAPTER 5. ESTABLISHING THE ZOO
Consider uniqueness. Suppose there are polynomials q,
q′
, r, and r′
, such that
f = q · g + r = q′
· g + r′
.
Then subtraction gives 0 = (q − q′
) · g + (r − r′
).
If q = q′
, then also r = r′
. If q ̸= q′
, then the term of
highest degree in (q − q′
) · g cannot be replicated in r − r′
.
This leads to a contradiction. This proves uniqueness.
It remains to prove that f can always be expressed in the
desired form. If m > n, then f = 0 · g + f satisﬁes the
requirements. So suppose that n ≥ m. The result is proved
by induction on the degree of f.
If f is of degree zero, then the statement is trivial. Suppose
the statement holds for all polynomials f of degree less
than n > 0. Put
h(x) = f(x) −
an
bm
xn−m
g(x).
If h(x) is the zero polynomial, then f is of the desired form.
Otherwise h(x) is a polynomial of degree less than that of f
and so h be written in the desired form as h(x) = q · g + r.
But then
f(x) = h(x) +
an
bm
xn−m
g(x) = (q +
an
bm
xn−m
)g(x) + r
and the proof is complete. □
If f(b) equals zero for some element b ∈ K, then the
equality f(x) = q(x)(x−b)+r yields 0 = f(b) = q(b)·0+r,
and so the constant remainder is r = 0. Consequently
f(x) = (x−b)q(x). The value b is called a root of the polynomial
f. The degree of q is then n − 1. If q also has a root, we
can continue and in no more than n steps we arrive at a constant
polynomial. It follows that the number of roots of any
non-zero polynomial over the ﬁeld K is at most the degree of
the polynomial. Hence the following observation:
Corollary. If the ﬁeld of scalars K is inﬁnite, then the polynomials
f and g are equal as mappings if and only if they are
equal as sequences of coeﬃcients.
Proof. Suppose that f = g, i.e. f−g = 0, as a mapping.
Then the polynomial (f − g)(x) has inﬁnitely many roots,
which is possible only if it is the zero polynomial. □
Notice that of course, this statement does not hold for
ﬁnite ﬁelds. A simple counter-example is the polynomial x2
+
x over Z2 which represents a constant zero mapping.
5.1.3. Interpolation polynomial. It is often desirable to use
an easily computable expression for a function
which is given by its values at some given points
x0, . . . , xn. Mostly this would be an approximation
of an unknown function represented by the
ﬁnite values only. We look for such polynomials.
If the values were all zeros, we can immediately ﬁnd a
polynomial of degree n + 1, namely
f(x) = (x − x0)(x − x1) . . . (x − xn).
370
the command show ending the previous block, to restrict the
values of f. For n = 3, the graph of a general cubic polynomial
f(x) = a3x3
+ a2x2
+ a1x + a0, with a3 ̸= 0 is a
curve running from −∞ to ∞, if a3 > 0, or the opposite way
from ∞ to −∞ if a3 < 0. A cubic polynomial can be continuously
either increasing or decreasing, or it can have two
bumps. Observe also that such a polynomial always meets the
x-axis, providing a real root either once or three times (see
also 1.A.6 in Chapter 1). Later in Section C we will discuss
conditions distinguishing these cases. □
Functions as trigonometric, exponential or logarithmic,
have an elementary role in our zoo of functions, hence we will
meet them very often. It will be also useful to know how to
treat and evaluate them in Sage, a goal that we quickly ﬁgure
out for the trigonometric one via our next task.
5.A.3. Trigonometric functions via Sage. Use Sage to obtain
the exact values of all the elementary trigonometric functions
at the points 0, π/6, π/4, π/3, π/2, 2π/3, π, 3π/2, and
2π, in case they are deﬁned there.
Solution. All the trigonometric functions are built into Sage.
We type sin, cos and tan for sin, cos and tan = sin
cos , respectively.
Similarly we typ cot, sec and csc for the cotangent,
secant, and cosecant function. To save your time, we recall
that these are the functions deﬁned by cot(x) = cos(x)
sin(x) =
1
tan(x) , sec(x) = 1
cos(x) and csc(x) = 1
sin(x) , respectively.
To obtain the required values in an exact way, you may
type sin(pi/6), cos(3 ∗ pi/2), and so on.2
When a function
is not deﬁned at a certain point, Sage returns Infinity.
This is the case for the commands cot(0) and sec(pi/2), for
example. In this way one may recover the following table:
sin cos tan cot sec csc
0 0 1 0 − 1 −
π/6 1/2
√
3/2
√
3/3
√
3 2
√
3/3 2
π/4
√
2/2
√
2/2 1 1
√
2
√
2
π/3
√
3/2 1/2
√
3
√
3/3 2 2
√
3/3
π/2 1 0 − 0 − 1
2π/3
√
3/2 −1/2 −
√
3 −
√
3/3 −2 2
√
3/3
π 0 −1 0 − −1 −
3π/2 −1 0 − 0 − −1
2π 0 1 0 − 1 −
Notice the inverse trigonometric functions are also built into
Sage. Hence we type arcsin, arccos, arctan, arccot,
arcsec and arccsc for the inverse of sin, cos, tan, cot, sec
and csc, respectively. You may use Sage to evaluate these
functions, or to obtain their graph. For some of them, we will
do this later in this chapter. □
Let us now focus on polynomial interpolation. Recall by
5.1.3 that given some distinct nodes x0, x1, . . . , xn there is
precisely a polynomial P(x) of degree not greater than n,
2In case you need a decimal approximation, type N(sin(pi/6)), etc.
An alternative can be the command n(sin(pi/6)), though we should be
careful not having introduced in the same cell “n” as a symbolic variable.
CHAPTER 5. ESTABLISHING THE ZOO
This is zero at these points and only at them. However, there
are other polynomials which are zero at the given points. For
instance the zero polynomial, which tourns out to be the only
such polynomial in the vector space of polynomials of degree
at most n. The general situation is analogous:
Interpolation polynomials
Let K be an inﬁnite ﬁeld of scalars. An interpolation polynomial
f for the set of (pairwise distinct) points x0, . . . , xn ∈
K and given values y0, . . . , yn ∈ K is either the zero polynomial,
or a polynomial of degree at most n such that f(xi) =
yi for all i = 0, 1, . . . , n.
Theorem. For every set of n + 1 (pairwise distinct) points
x0, . . . , xn ∈ K and given values y0, . . . , yn ∈ K, there is
exactly one interpolation polynomial f.
Proof. If f and g are interpolation polynomials with the
same deﬁning values, then their diﬀerence
is a polynomial of degree n which has at
least n+1 roots, and thus f −g = 0. This
proves uniqueness.
It remains to prove the existence. Label the coeﬃcients
of the polynomial f of degree n:
f = anxn
+ · · · + a1x + a0.
Substituting the desired values leads to a system of n+1 equations
for the same number of unknown coeﬃcients ai
a0 + x0a1 + · · · + (x0)n
an = y0
...
a0 + xna1 + · · · + (xn)n
an = yn.
The existence of a solution of this system is easily shown
by constructing the polynomial by using the Lagrange polynomials
for the given points x0, . . . , xn, introduced in the next
paragraph.
However, the proof can be concluded by using only basic
knowledge from linear algebra. This system of linear equations
has a unique solution if the determinant of its matrix
is a non-zero scalar (see 2.3.5 and 2.2.11). The determinant
is the Vandermonde determinant, which was discussed in the
exercise 2.E.21 on page 151.
Since it is veriﬁed that for zero right-hand sides, there is
exactly one solution, we know that this determinant must be
non-zero.
371
which takes a prescribed value yi at xi, for all i = 0, 1, . . . , n.
This is given by P(x) = y0 · ℓ0(x) + · · · + yn · ℓn(x), with
ℓi(x) :=
(x − x0) · · · (x − xi−1)(x − xi+1) · · · (x − xn)
(xi − x0) · · · (xi − xi−1)(xi − xi+1) · · · (xi − xn)
for any i = 0, 1, . . . , n. Observe that ℓi(xi) = 1 for all
i = 0, 1, . . . , n, while ℓi(xj) = 0 for all i ̸= j.
The polynomial P(x) is known as the Lagrange interpolation
polynomial, while ℓi are referred to
as the elementary Lagrange polynomials, see
5.1.4 for details. Notice if we have yi = f(xi)
for some function f, then P(x) is referred to as the Lagrange
interpolation polynomial for f. Let us now illustrate the situation
by examples.
5.A.4. Lagrange interpolation. Consider the nodes x0 = 0,
x1 = 1, x2 = 4 and the values y0 = 1, y1 = 2, y2 = 4.
Write down the corresponding Lagrange interpolation polynomial
P(x), and sketch its graph, together with the graphs
of ℓ0, ℓ1, ℓ2.
Solution. According to 5.1.4 the Lagrange interpolation polynomial
will be at most of degree two. We compute ℓ0(x) =
(x−1)(x−4)
4 , ℓ1(x) = −x(x−4)
3 and ℓ2(x) = x(x−1)
12 , and
hence P(x) = − 1
12 x2
+ 13
12 x + 1. For an implementation of
the Lagrange interpolation, Sage provides an inbuilt method
based on the command langrange.polynomial. For example,
in our case we can give the block
nodes=[(0, 1), (1, 2), (4, 4)]
R=PolynomialRing(QQ, "x")
P=R.lagrange_polynomial(nodes); show(P)
Check yourself that this cell prints out the explicit expression
of the polynomial P(x). Notice also that you can use the
bool command to conﬁrm Sage’s output. This can be done by
adding in the cell posed above the following syntax:
l0(x)=(x-1)*(x-4)/4; l1(x)=-x*(x-4)/3
l2(x)=x*(x-1)/12
eq=P(x)-l0(x)-2*l1(x)-4*l2(x)==0; eq
bool(eq)
As for the required graphs, we present them below (check
yourself that the reproduction of this ﬁgure in Sage is really
enjoyable).
□
CHAPTER 5. ESTABLISHING THE ZOO
Since polynomials are equal as mappings if and only if
they are equal as sequences of coeﬃcients, the theorem is
proved. □
5.1.4. Applications of interpolation. At ﬁrst sight, it may
seem that real or rational polynomials, that is,
polynomial functions R → R or Q → Q, form a
very useful class of functions of one variable. We
can arrange for them to attain any set of given values.
Moreover, they are easily expressible, so their value at
any point can be calculated without diﬃculties.
However, there are a number of problems when trying to
use them in practice.
The ﬁrst of the problems is to ﬁnd quickly the polynomial
which will interpolate the given data. Solving the aforementioned
system of linear equations generally requires time proportional
to the cube of the number of given points xi. This
is unacceptable for large data. We will demonstrate how to
overcome this on one popular type of polynomials related to
ﬁxed points x0, . . . , xn:
Lagrange1
interpolation polynomials
The Lagrange interpolation polynomial is expressed in
terms of the elementary Lagrange polynomials ℓi of degree
n with the properties
ℓi(xj) =
{
1 i = j
0 i ̸= j
.
These polynomials must (up to a constant factor) equal the
expressions (x−x0) . . . (x−xi−1)(x−xi+1) . . . (x−xn).
So
ℓi(x) =
∏
j̸=i(x − xj)
∏
j̸=i(xi − xj)
.
The desired Lagrange interpolation polynomial is then given
by
f(x) = y0ℓ0(x) + y1ℓ1(x) + · · · + ynℓn(x).
Notice that the elementary Langrange polynomials can
be quite easily expressed by means of the derivatives, see the
exercise ??.
The usage of Lagrange polynomials is especially eﬃcient
when working with diﬀerent values yi for the same set of values
xi. For in this case, the elementary polynomials ℓi are
already prepared.
One of the disadvantages of this expression is a large sensitivity
to inaccuracies in a computation when the diﬀerences
of the given values xi are small. This is because division by
these diﬀerences is required.
Another disadvantage (common to all ways of expressing
the unique interpolation polynomial) is poor stability of the
values of real or rational polynomials outside of the interval
containing all its roots.
1Joseph-Louis Lagrange (1736-1813) was a famous Italian mathematician
and astronomer, who contributed in particular to celestial mechanics.
His famous Mécanique analytique appeared in 1788. His name appears often
even in this elementary textbook.
372
5.A.5. Find a polynomial P satisfying the following conditions:
P(2) = 1, P(3) = 0, P(4) = −1, P(5) = 6.
Solution. Initially one may want to use Sage to plot the given
points in R2
. We can do this by the cell
list_plot({2: 1, 3: 0, 4: -1, 5: 6}, size=30,
figsize=4, color="black")
Now, we can solve the task by two diﬀerent ways. Four points
are given, so we know that there is exactly one polynomial of
degree at most three, satisfying the given conditions. Hence
we can assume that P(x) = a3x3
+ a2x2
+ a1x1 + a0, for
some a0, . . . , a3 ∈ R, and in such terms it is easy to see that
the system {P(2) = 1, P(3) = 0, P(4) = −1, P(5) = 6}
consists of the equations a0 + 2a1 + 4a2 + 8a3 = 1, a0 +
3a1 + 9a2 + 27a3 = 0, a0 + 4a1 + 16a2 + 64a3 = −1, and
a0 +5a1 +25a2 +125a3 = 6. Applying the Sage cell
var("a0, a1, a2, a3")
eq1=a0+2*a1+4*a2+8*a3-1; eq2=a0+3*a1+9*a2+27*a3
eq3=a0+4*a1+16*a2+64*a3+1
eq4=a0+5*a1+25*a2+125*a3-6
solve([eq1==0,eq2==0,eq3==0,eq4==0],a0,a1,a2,a3)
we obtain a unique solution given by
[[a0 == -29,a1 == (101/3),a2 == -12,a3 == (4/3)]]
An alternative, more “solid” way to treat this system via Sage
is here:
a=SR.var("a", 4)
P(x)=a[0]+a[1]*x+a[2]*x^2+a[3]*x^3
solve([P(2)-1==0, P(3)==0,P(4)+1==0, P(5)-6==0], a[0], a[1], a[2], a[3])
To summarize, the polynomial P has the form 4
3 x3
− 12x2
+
101
3 x − 29.
Let us now verify this answer by the method described
before, based on the Lagrange polynomial:
P(x) = 1 ·
(x − 3)(x − 4)(x − 5)
(2 − 3)(2 − 4)(2 − 5)
+ 0 · (. . . )
+ (−1) ·
(x − 2)(x − 3)(x − 5)
(4 − 2)(4 − 3)(4 − 5)
+ 6 ·
(x − 2)(x − 3)(x − 4)
(5 − 2)(5 − 3)(5 − 4)
=
4
3
x3
− 12x2
+
101
3
x − 29.
Of course, such a computation can be done faster in Sage via
the langrange.polynomial command, as in 5.A.4, and we
leave to the reader this veriﬁcation. Below we present the
graph of P(x) together with the given nodes.
□
CHAPTER 5. ESTABLISHING THE ZOO
Soon we will develop tools for an exact description of
the functions’ behaviour. But even without such tools, it is
clear that, according to the sign of the coeﬃcient of the term
with highest degree, the value of the polynomial will rapidly
approach plus or minus inﬁnity as x increases (or decreases).
However, the above mentioned sign is even not stable under
small changes of the deﬁning values yi. This is illustrated
by the following two diagrams, displaying eleven values of
the function sin(x) with two diﬀerent small changes of values.
The interpolated function sin(x) is the dotted line, the
circles are the gently moved values yi and the uniquely determined
interpolation polynomial is the solid line. While the
approximation is quite good inside the interval covering the
eleven points, it is very poor at the margins.
There is a rich theory about the interpolation polynomials.
If interested, consult the special literature.
5.1.5. Remark. Numerical instability caused by the closeness
of (some of) the points xi is clearly seen in the
system of equations from the proof of the Theorem
5.1.3. When solving a system of linear equations, instability
is closely related to the size of the determinant
of the corresponding matrix. This is the Vandermonde
determinant V in our case.
Lemma. For any sequence of pairwise distinct scalars
x0, . . . , xn ∈ K,
V (x0, . . . , xn) =
n∏
i>k=0
(xi − xk).
Proof. The proof is by induction on the number of the
points xi. The result is true for n = 1. (The problem is completely
uninteresting for n = 0). Suppose that the result is
true for n − 1, i.e.
V (x0, . . . , xn−1) =
n−1∏
i>k=0
(xi − xk).
Consider the values x0, . . . , xn−1 to be ﬁxed, and vary the
value of xn. Expand the determinant by the last row (see
373
5.A.6. Find the Lagrange interpolation polynomial for
the function f(x) = 1
4+x2 by dividing the closed interval
[−1, 1] ⊂ R into n equal parts, for n = 20. Use Sage to
present a ﬁgure including the graph of f and that of the
interpolation polynomial, along with the interpolated points.
Solution. The procedure of dividing [−1, 1] to n equal parts
gives rise to n + 1 points of the form
−1 +
2k
n
, k = 0, 1, . . . , n .
For instance for n = 1 we get the points −1, 1, for n = 2
we get the points −1, 0, 1, for n = 3 we get the points
−1, −1/3, 1/3, 1, etc. We can now use Sage and the command
.lagrange_polynomial(), as follows:
f(x)=1/(4+x**2)
fig=plot(f(x), (x, -1, 1),figsize=4,color="black")
show(fig)
R=PolynomialRing(QQ, "x"); n=20
points=[(-1+k*(2/n), f(-1+k*2/n)) for k
in [0, 1,.., n]]
P(x)=R.lagrange_polynomial(points)
fig+=plot(P(x),(x, -1, 1),color="purple")
fig+=list_plot(points, size=20, figsize=4,
color="blue"); fig
This block includes commands appropriate to produce the
plots of f and of P, along with the interpolated points. This
allows us to illustrate the interpolation, as in the ﬁgure below.
Notice to obtain the explicit form of P(x) one should
type show(P(x)), which we do not present to save some space
(type alone in your editor the given code to read P(x)).
□
5.A.7. Repeat the task in 5.A.6 by replacing f with the function
g(x) = (x/(1 + 8x2
)). Next present the graphs of g(x)
and of the Lagrange interpolation polynomial along with the
given points, via a programming package of your desire. ⃝
5.A.8. Find a polynomial P of degree two or less, taking the
values y0 = 1, y1 = −3, y2 = 4 at the points x0 = −1,
x1 = 1, x2 = 2, respectively. ⃝
5.A.9. Find a polynomial P of third degree satisfying P(0) =
1, P(1) = 0, P(2) = 1, and P(3) = 10. ⃝
5.A.10. Find a polynomial P satisfying:
(i) P(1 + i) = i, P(2) = 1, P(3) = −i,
(ii) P(1) = i, P(−1) = −i, P(i) = −1.
CHAPTER 5. ESTABLISHING THE ZOO
2.2.9). This exhibits the desired determinant as the polyno-
mial
(1)
V (x0, . . . , xn) = (xn)n
V (x0, . . . , xn−1)
+lower degree terms in x.
This is a polynomial of degree n since its coeﬃcient at (xn)n
is non-zero, by the induction hypothesis. Evidently, it vanishes
at any point xn = xi for i < n because in that case, the
original determinant contains two identical rows. The polynomial
is thus divisible by the expression
(xn − x0)(xn − x1) · · · (xn − xn−1),
which itself is of degree n. It follows that the Vandermonde
determinant (as a polynomial in the variable xn) must, up to
a multiplicative constant, be given by
V (x0, . . . , xn) = c · (xn − x0)(xn − x1) · · · (xn − xn−1).
Comparing the coeﬃcients at the highest power in (1) with
this expression yields c = 1, which completes the proof . □
Notice that the value of the determinant is small if the
points xi are close together.
5.1.6. Derivatives of polynomials. The values of polynomials
rapidly tend to inﬁnite values as the input
variable grows. Hence polynomials are unable
to describe periodic events, such as the values
of the trigonometric functions. One could say
that we will achieve much better results, at least between the
points xi, if we look not only at the function values, but also
at the rate of increase of the function at those points.
For this purpose, we introduce (only intuitively, for
the time being) the concept of a derivative for polynomials.
Again, we can work with real, complex or rational polynomials.
The rate of increase of a real-valued polynomial f(x) at
a point x ∈ R should be related to the values
(1)
f(x + δx) − f(x)
δx
,
where δx is a small value in K expressing the increment of the
argument x. Since we can calculate (over an arbitrary ring)
./img/0085interpolacnipolynomy.jpg
(x + δx)k
= xk
+ · · · +
(k
l
)
xl
(δx)k−l
+ · · · + (δx)k
,
374
Solution. A straightforward computation reveals the expression
P(z) = (−3
5 − 4
5 i)z2
+ (2 + 3i)z − 3
5 − 14
5 i. We can
compute this directly via Sage, as in 5.A.5, although we have
to change the ﬁeld into C:
R = PolynomialRing(CC, "z")
R. lagrange_polynomial ([(1+I,I),(2,1),(3,-I)])
Sage returns the answer as follows:
(-0.600000000000000 - 0.800000000000000*I)*z^2
+ (2.00000000000000 + 3.00000000000000*I)*z
- 0.600000000000000 - 2.80000000000000*I
In the second case, the solution is easier. This is because the
conditions are satisﬁed of the rotation by angle π/2 in the
complex plane. This means that the polynomial must be of
the form f(z) = iz. □
Assuming a bit of experience on elementary functions,
we now proceed with tasks highlighting the interplay between
elementary functions and polynomial interpolation. Later, in
Chapter 7, we will study the so called Chebyshev polynomials,
which may enrich this relationship.
5.A.11. Based on Lagrange interpolation, present an approximate
polynomial formula of the sine function, using the
known values of sin(x) at the points 0, π
6 , π
4 , π
3 , π
2 . Next
present the graphs of the interpolation polynomial and of
sin(x) for x ∈ [0, π], including in your ﬁgure the given nodes.
Solution. According to the statement we have the table
x 0 π/6 π/4 π/3 π/2
sin(x) 0 1/2
√
2/2
√
3/2 1
To solve our task we can use this table and apply the same technique
in Sage, as above. This can be done by the block
nodes=[(0, 0), (pi/6, 1/2), (pi/4, sqrt(2)/2),
(pi/3, sqrt(3)/2), (pi/2, 1)]
R=PolynomialRing(RR, "x")
P=R.lagrange_polynomial(nodes); show(P)
Sage’s answer says that the interpolation polynomial P(x) is
approximately given by 0.0288x4
− 0.2043x3
+ 0.0214x2
+
0.9956x.
Now, in order to produce the required graphs together
with the given points, one may proceed with the code
nodes=[(0, 0), (pi/6, 1/2), (pi/4, sqrt(2)/2),
(pi/3, sqrt(3)/2), (pi/2, 1)]
R=PolynomialRing(RR, "x")
P=R.lagrange_polynomial(nodes)
A=plot(P, 0, pi, color="grey")
B=list_plot({0: 0, pi/6: 1/2, pi/4: sqrt(2)/2,
pi/3: sqrt(3)/2, pi/2 : 1},
size=30, figsize=4, color="black")
C=plot(sin(x), 0, pi, color="black")
show(A+B+C, figsize=4)
We leave to the reader the implementation of this block. □
CHAPTER 5. ESTABLISHING THE ZOO
we get for the polynomial f(x) = anxn
+· · ·+a0, the above
quotient (1) in the form
f(x+δx)−f(x)
δx
=an
nxn−1
δx+· · · +(δx)k
δx
+· · · +a1
δx
δx
= nanxn−1
+ (n − 1)an−1xn−2
+ · · · + a1 + δx(. . . )
where the expression in parentheses in the end of the expression
is polynomially dependent on δx. Clearly, for values δx
very close to zero, there is a value arbitrarily close to the value
in the following deﬁnition:
Derivatives of polynomials
The derivative of the polynomial f(x) = anxn
+ · · · + a0
with respect to the variable x is the polynomial
f′
(x) = nanxn−1
+ (n − 1)an−1xn−2
+ · · · + a1.
From the deﬁnition, it is clear that it is just the value
f′
(x0) of the derivative which gives a good approximation
of the polynomial’s behaviour near the point x0. More precisely,
the lines
y =
f(x0 + δx) − f(x0)
δx
(x − x0) + f(x0),
that is, the secant lines of the graph of the polynomial going
through the points [x0, f(x0)] and [x0 + δx, f(x0 + δx)] approach,
as δx decreases, to the line
y = f′
(x0)(x − x0) + f(x0),
which is the “tangent” to the graph of the polynomial f. This
is linear approximation to the polynomial f by its tangent
line. Exact meaning to all these concepts is given later.
The derivative of polynomials is a linear mapping which,
to polynomials of degree at most n, assigns polynomials of
degree at most n − 1.
Iterating this procedure, there are the second derivative
f′′
, the third derivative f(3)
, and generally after k-tuple iteration,
the polynomial f(k)
of degree n−k. Thus the (n+1)-st
derivative is the zero polynomial. This linear mapping is an
example of cyclic nilpotent mappings, which are more thoroughly
examined in paragraph 3.4.10.
The derivative behaves well also with respect to the multiplication
of polynomials. A straightforward combinatorial
check reveals the derivation property or Leibniz rule for this
linear operator :
(
f(x) · g(x)
)′
= f′
(x) · g(x) + f(x) · g′
(x).
Actually this is a purely algebraic result (which holds over
any ring of scalars!) and you may either check it yourself or
consult the formal proof in 12.3.7.
5.1.7. Hermite’s interpolation problem. Consider m + 1
pairwise distinct real numbers x0, . . . , xm, i.e.
xi ̸= xj for all i ̸= j. It is desired to place polynomials
through given values at these points,
but to determine also the ﬁrst derivatives of the
interpolating polynomial in these points. Set yi and y′
i for all i.
375
The Lagrange interpolation method comes with an interpolation
error (and an error bound),
which we will brieﬂy analyze in the ﬁnal
section E of this chapter (see 5.E.9, 5.E.10
and 5.E.11).
It is now reasonable to discuss problems where we should
also regard the slope of the tangents to our polynomial at the
given points. This amounts at employing the derivatives of the
polynomials (see 5.1.6 if necessary), and it can be handled by
the very same methods as before. Thus next we will meet the
so called Hermite interpolation, a method introduced in 5.1.7
(see also 5.1.8).
5.A.12. Hermite interpolation. Find a polynomial P satisfying
the following conditions: P(1) = 0, P′
(1) = 1,
P(2) = 3, P′
(2) = 3.
Solution. We provide two methods of ﬁnding the solution.
1st approach: The given conditions give rise to four linear
equations in the coeﬃcients of P. If we seek for a polynomial
of degree less than four, we get the same number of
equations and unknown coeﬃcients. Hence assuming that
P(x) = a3x3
+ a2x2
+ a1x + a0 for some reals a0, . . . , a3,
we get the equations
P(1) = a3 + a2 + a1 + a0 = 0 ,
P′
(1) = 3a3 + 2a2 + a1 = 1 ,
P(2) = 8a3 + 4a2 + 2a1 + a0 = 3 ,
P′
(2) = 12a3 + 4a2 + a1 = 3 .
Solving quickly this linear system by Sage, we get P(x) =
−2x3
+ 10x2
− 13x + 5.
2nd approach: This is based on Hermite’s interpolation, which
requires the description of the so-called fundamental Hermite
interpolation polynomials h
(1)
i and h
(2)
i (i = 0, 1), see
5.1.7. By assumption, x0 = 1, x1 = 2, and we may set
y0 = P(x0) = 0, y1 = P(x1) = 3, y′
0 = P′
(x0) = 1
and y′
1 = P′
(x1) = 3. For an application of the Hermite’s
interpolation method one needs the function
ℓ(x) = (x − x0)(x − x1) = (x − 1)(x − 2) = x2
− 3x + 2 .
Obviously ℓ′
(x) = 2x − 3, with ℓ′
(x0) = −1 and ℓ′
(x1) =
1. Moreover, the second derivative is a constant function,
ℓ′′
(x) = 2. We also need the elementary Lagrange polynomials
induced by the nodes x0, x1:
ℓ0(x) =
x − x1
x0 − x1
= (2 − x), ℓ1(x) =
x − x0
x1 − x0
= (x − 1) .
CHAPTER 5. ESTABLISHING THE ZOO
A polynomial f is wanted which will satisfy these conditions
on the values and derivatives.
As in the case of interpolating the values only, we obtain
the following system of 2(m+1) equations for the coeﬃcients
of the polynomial f(x) = anxn
+ · · · + a0:
./img/0085interpolacnipolynomy.jpg
a0 + x0a1 + · · · + (x0)n
an = y0
...
a0 + xma1 + · · · + (xm)n
an = ym
a1 + 2x0a2 + · · · + n(x0)n−1
an = y′
0
...
a1 + 2xma2 + · · · + n(xm)n−1
an = y′
m.
We could verify that with the choice n = 2m+1, the determinant
of this system is non-zero, and thus there is exactly
one solution.
The polynomial f can be constructed immediately. Simply
create a set of polynomials with values 0
or 1 respectively for the derivatives and the values,
in order to express the desired values as the
linear combination. We sketch brieﬂy, how to
construct them now, leaving the details to the reader.
The elementary Lagrange polynomials serve well for this
purpose. The derivative of f(x) = (ℓi(x))2
is 2ℓ′
i(x)ℓi(x)
and thus all xj are roots of this polynomial, except for j =
i. Similarly for the derivative f′
(x). But a polynomial of
degree 2m+1 is wanted. So we consider rather g(x) = (x−
xi)f(x). Now the values will be all zero, while the derivative
g′
(x) = f(x)+(x−xi)f′
(x) has the required properties too.
Thus we take h
(2)
i (x) = (x − xi)(ℓi(x))2
. This is called the
fundamental Hermitian polynomial2
of the second type.
Finally we look for a polynomial which has zero derivatives
at all points xi with the same values as ℓi at the given
points xi. We can apply a very similar trick. Look for polynomials
of the form h
(1)
i (x) = (1 − a(x − xi))(ℓi(x))2
. All
xj will be roots of this polynomial, except for j = i where
2Charles Hermite (1822-1901) was a Frenchman active in many areas
of Mathematics. His name is mostly linked to the Hermitian operators and
matrices, cf. 3.4.6.
376
Now, according to the formulas given in 5.1.7 we have
h
(1)
0 (x) :=
(
1 −
ℓ′′
(x0)
ℓ′(x0)
(x − x0)
)
(ℓ0(x))2
= (2x − 1)(x − 2)2
,
h
(1)
1 (x) :=
(
1 −
ℓ′′
(x1)
ℓ′(x1)
(x − x1)
)
(ℓ1(x))2
= (5 − 2x)(x − 1)2
,
h
(2)
0 (x) := (x − x0)(ℓ0(x))2
= (x − 1)(x − 2)2
,
h
(2)
1 (x) := (x − x1)(ℓ1(x))2
= (x − 2)(x − 1)2
.
The polynomial P(x) is then given by the epxression y0 ·
h
(1)
0 (x) + y1 · h
(1)
1 (x) + y′
0 · h
(2)
0 (x) + y′
1 · h
(2)
1 (x), and we
arrive to the same expression as above:
P(x) = 0 · h
(1)
0 (x) + 3 · h
(1)
1 (x) + 1 · h
(2)
0 (x)
+3 · h
(2)
1 (x) = −2x3
+ 10x2
− 13x + 5 .
In fact, one could avoid to compute explicitly h
(1)
0 , since y0 =
0 by assumption. Notice also that our ﬁnal computation can
be veriﬁed quickly in Sage, via the block
h01(x)=(2*x-1)*(x-2)^2; h11(x)=(5-2*x)*(x-1)^2
h02(x)=(x-1)*(x-2)^2; h12(x)=(x-2)*(x-1)^2
y0=0; y1=3; dy0=1; dy1=3
P(x)=y0*h01(x)+y1*h11(x)+dy0*h02(x)+dy1*h12(x)
show(P(x).expand())
□
The next exercises on Hermite interpolation are helpful
for a better understanding of the method but also for
improving your computational skills (we suggest to
treat these tasks with the help of Sage, as well). One
can ﬁnd further problems related to Hermite interpolation
also in Section E, see 5.E.1.
5.A.13. Determine the Hermite interpolation polynomial Q
satisfying Q (−1) = −9, Q (1) = −1, Q′
(−1) = 10,
Q′
(1) = 2. ⃝
5.A.14. Replace the function f with a Hermite polynomial,
having as initial data the following values of f:
xi −1 1 2
f(xi) 4 −4 −8
f′
(xi) 8 −8 11
⃝
Let us now focus on splines, a notion introduced in 5.1.9.
Shortly, we look for (cubic) polynomials on intervals,
prescribing the values and requesting the ﬁrst
and second derivatives to be shared in the boundary
points. Splines are very popular in environmental research
for image processing, where smooth interpolation becomes
essential, e.g., for computational animation and image scal-
ing.
However, a formal description of the spline through a
given data can be often a tedious task. For instance, if we are
CHAPTER 5. ESTABLISHING THE ZOO
the value is 1. The derivative is
(
h
(1)
i (x)
)′
= −a(ℓi(x))2
+ (1 − a(x − xi))2ℓi(x)ℓ′
i(x).
All xj, j ̸= i, are roots of ℓi(x), thus they are also roots of this
polynomial. Finally, at the point xi we want 0 = −a+2ℓ′
i(xi).
Thus we choose a = 2ℓ′
i(xi).
The combinatorial check that 2ℓ′
i(xi) = ℓ′′
(xi)
ℓ′(xi) is left to
the reader. We summarize:
Hermite’s 1st order interpolation polynomial
The fundamental Hermite polynomials are deﬁned as fol-
lows:
h
(1)
i (x) =
(
1 −
ℓ′′
(xi)
ℓ′(xi)
(x − xi)
)
(ℓi(x))
2
h
(2)
i (x) = (x − xi) (ℓi(x))
2
,
where ℓ(x) =
∏m
i=0(x − xi), ℓi(x) is the elementary Lagrange
polynomial. These polynomials satisfy:
h
(1)
i (xj) = δj
i =
{
1 for i = j
0 for i ̸= j
(h
(1)
i )′
(xj) = 0
h
(2)
i (xj) = 0
(h
(2)
i )′
(xj) = δj
i .
The Hermite’s interpolation polynomial is given by the ex-
pression
f(x) =
m∑
i=0
(
yih1
i (x) + y′
ih2
i (x)
)
.
5.1.8. Examples of Hermite’s polynomials. The simplest
example is the one of prescribing the value and the derivative
at one point. This determines a polynomial of degree one
f(x) = f(x0) + f′
(x0)(x − x0).
This is exactly the equation of the straight line given by the
value and slope at the point x0. When we set the values and
the derivatives at two points, i.e. y0 = f(x0), y′
0 = f′
(x0),
y1 = f(x1), y′
1 = f′
(x1) for two distinct points xi, we still
obtain an easily computable problem.
Consider the simple case when x0 = 0, x1 = 1. Then
the matrix of the system and its inverse is
A =




0 0 0 1
1 1 1 1
0 0 1 0
3 2 1 0



 , A−1
=




2 −2 1 1
−3 3 −2 −1
0 0 1 0
1 0 0 0



 .
The multiplication A−1
· (y0, y1, y′
0, y′
1)T
gives the vector
(a3, a2, a1, a0)T
of coeﬃcients of the polynomial f, i.e.
f(x) = (2y0 − 2y1 + y′
0 + y′
1)x3
+ (−3y0 + 3y1 − 2y′
0 − y′
1)x2
+ y′
0x + y0.
377
given n ≥ 2 points and values in them, then we need to solve
4n − 4 linear equations. Fortunately, using computers we
may establish algorithms which lead to much faster computer
implementations. In order to illustrate this situation, below
we proceed by combining the appropriate syntax of Sage. But
let us ﬁrst present a more basic example.
5.A.15. Splines. Find a natural cubic spline S satisfying
S(−1) = 0, S(0) = 1, S(1) = 0, and present its graph
for x in the closed interval [−1, 1].
Solution. We have three nodes and the intervals [x0, x1] =
[−1, 0] and [x1, x2] = [0, 1], that is, n = 2 in terms of the
paragraph 5.1.9. Thus the spline that we are looking for consists
of two cubic polynomials, say S1(x) = ax3
+bx2
+cx+
d, and S2(x) = ex3
+fx2
+gx+h, with domain the intervals
[−1, 0] and [0, 1], respectively, with a, b, · · · , h ∈ R. Applying
the deﬁnition of cubic splines for n = 2, one obtains six
linear equations in terms of a, b, · · · , h:
S1(−1) = 0, S2(0) = 1, S1(0) = 1, S2(1) = 0,
S′
1(0) = S′
2(0), S′′
1 (0) = S′′
2 (0).
In addition, S must be a natural spline, and this corresponds
to the vanishing of the second derivatives of S1 = ax3
+
bx2
+ cx + d and S2 = ex3
+ fx2
+ gx + h at the points
−1 and 1, respectively: S′′
1 (−1) = 0 and S′′
2 (1) = 0. Hence
we arrive to a system of eight equations, which in fact we can
reduce in the following way. Thanks to the given value at 0,
we know that the absolute coeﬃcients of both the polynomials
are 1. The resulting spline has to be symmetric along the
y axis, otherwise we would get two splines satisfying the condition
by the reﬂection along the axis. Here we are based on
the fact that a natural cubic spline is unique. Thus, the only
possibility for the common values for the ﬁrst derivatives of
S1 and S2 at zero is zero, further the second derivatives in
zero have to agree, that is b = d, and the symmetry gives also
a = −c.
So far we have S1(x) = ax3
+ bx2
+ 1 and S2(x) = −ax3
+
bx2
+ 1. Now the conditions S1(−1) = 0 and S′′
1 (−1) = 0
correspond to the equations −a+b+1 = 0 and −6a+2b = 0,
respectively. Solving this system we obtain S1(x) = −1
2 x3
−
3
2 x2
+ 1, S2(x) = 1
2 x3
− 3
2 x2
+ 1. Altogether,
S(x) =
{
−1
2 x3
− 3
2 x2
+ 1, if x ∈ [−1, 0] ,
1
2 x3
− 3
2 x2
+ 1, if x ∈ [0, 1] .
For the graph of the spline via Sage, we obtain the ﬁgure presented
below (we used blue colour for S1 and green for S2)
CHAPTER 5. ESTABLISHING THE ZOO
5.1.9. Spline interpolation. We can prescribe any ﬁnite
number of derivatives at the particular points
and a convenient choice for the upper bound on
the degree of the desired polynomial, leading
to a unique interpolation. Unfortunately, these
interpolations do not solve the problems mentioned already
– complexity of the computations and instability. However, a
smarter usage of derivatives allows an improvement.
As we have seen, small local changes of the values may
dramatically aﬀect the overall changes of the behaviour of the
resulting polynomial. In particular, this may happen outside
of the interval covered by the points xi. We try gluing together
polynomial pieces of low degree.
The simplest possibility is to interpolate each pair o f adjacent
points by a polynomial of degree at most one. This corresponds
either to the interpolation by values with two points,
or by guessing the slope and employing Hermite’s ﬁrst order
interpolation at a single point. This is a common way of displaying
data. This means that derivatives will be constant on
each of the segments, with a discontinuous ’jump’ at the given
points. There is no freedom for improvements.
A more sophisticated method is to prescribe the value
and the derivative at each point. We have then four values
for each pair of neighbouring points. As seen earlier, this
uniquely determines Hermite’s polynomial of degree three.
This polynomial can then be used for all the values of the input
variable between the distinguished points x0 < x1. Such
a piece-wise polynomial approximation has the property that
the ﬁrst derivatives will be compatible (equal) at the meeting
points xi in the interval [x0, x1].
In practice, mere compatibility of the ﬁrst derivatives is
often insuﬃcient. Consider for instance railway tracks, where
the second derivative corresponds to acceleration. Discontinuous
jumps would be very undesirable for the second derivative.
So instead of requiring ﬁxed values of the ﬁrst derivatives,
we request equality of both ﬁrst and second derivatives
at adjacent pieces of the cubic polynomials, as well as ﬁxing
the values at the given points. This requirement yields the
same number of equations and unknowns, and so the problem
is solvable similarly to the 1st order Hermite interpolation
problem:
Cubic splines
Let x0 < x1 < · · · < xn be real values at which the required
values y0, . . . , yn are given. A cubic interpolation spline for
this assignment is a function S : R → R which satisﬁes the
following conditions:
• the restrictions of S on the intervals [xi−1, xi] are polynomials
Si of degree at most three, for all i = 1, . . . , n
• Si(xi−1) = yi−1 and Si(xi) = yi for all i = 1, . . . n,
• S′
i(xi) = S′
i+1(xi) for all i = 1, . . . , n − 1,
• S′′
i (xi) = S′′
i+1(xi) for all i = 1, . . . , n − 1.
378
□
5.A.16. Splines via Sage. Splines can be treated in Sage via
the command spline. For instance, a graphical implementation
of the previous example in Sage relies on the given interpolated
points and goes as follows:
pts=[(-1, 0), (0, 1), (1, 0)]; S=spline(pts)
a=plot(S, -1, 1, color="darkgray", figsize=4)
show(points(pts)+a)
Verify yourself that for the spline in question this syntax gives
us the same graph (so this provides a graphical veriﬁcation of
our previous computation).
Changing the points of the spline causes of course the
spline to be recomputed. Notice also that this syntax does not
return the explicit form of S, as a cubic polynomial. However,
it allows us to compute the values of the spline, e.g., by adding
the syntax
show(S(0.4)); show(S(0.5)); show(S(0.6))
This returns the values of S at the chosen points, and one can
easily verify that these values coincide with the values that we
obtain by applying the given solution in 5.A.15 at the same
points. Notice however that we cannot compute the value of
the spline at points that are not included in between the given
interpolated points.
5.A.17. Find a (cubic) spline S which satisﬁes S(−1) = 0,
S(0) = 1, S(1) = 0, S′
(−1) = −1, S′
(1) = 1. Hint: Apply
the same trick as in 5.A.15. ⃝
5.A.18. Interpolate the function f(x) = ex2
− e on the interval
[−6/5, 6/5] by the (unique) natural cubic spline S corresponding
to the partition x0 = −1.2, x1 = −1, x2 = 0,
x3 = 1, x4 = 1.2. Next, using Sage plot the functions f
and S (together with the interpolated nodes), and based on
the “spline” method described in 5.A.16, verify your result.
Solution. By assumption we have ﬁve nodes, x0 = −1.2,
x1 = −1, x2 = 0, x3 = 1, x4 = 1.2, and so four intervals
[x0, x1], [x1, x2], [x2, x3], [x3, x4]. Thus, in terms of the
deﬁnition of cubic splines given in paragraph 5.1.9 we have
CHAPTER 5. ESTABLISHING THE ZOO
The cubic spline3
for n + 1 points consists of n cubic
polynomials. There are 4n free parameters (the ﬁrst condition
from the deﬁnition). The other conditions then yield 2n +
(n − 1) + (n − 1) more equalities. Two parameters remain
free. The values of the derivatives at the marginal points may
be prescribed explicitly (the complete spline), or the second
derivatives can be set to zero (the natural spline).
Unfortunately, the computation of the whole spline is not
as easy as with the independent computations of Hermite’s
cubic polynomials because the data mingles between adjacent
intervals. However, ordering the variables and equations
properly, gives a matrix of the system such that all of its nonzero
elements appear only on three diagonals. These matrices
are nice enough to be solved in a time proportional to the number
of points, using a suitable numerical method. The results
are stunning.
For comparison, look at the interpolation of the same
data as in the case of the Lagrange polynomial, now using
splines. The spline is the solid line, the interpolated function
is again the dotted line.
Although the diagrams look nearly identical, the data is
diﬀerent.
2. Real numbers and limit processes
Polynomials and splines do not supply a suﬃciently large
stock of functions to express many dependencies.
Actually, the ﬁrst problem to solve is how to deﬁne the
values of more general functions at all. In principle, all we can
get with a ﬁnite number of multiplications and additions is
polynomial functions. Perhaps division by polynomial quantities,
and some eﬃcient manipulation with rational numbers
can be done. However, we cannot restrict ourselves to rational
numbers. For instance,
√
2 is not a rational number.
Thus the ﬁrst step is a thorough introduction to limit processes.
We deﬁne precisely what it means for a sequence of
numbers to approach another number.
3The name comes from the name of an elastic ruler used by engineers
to draw smooth curve interpolation points. In fact, the requirement on the
equality of the ﬁrst and second derivatives is a good model for natural elasticity
behaviour.
379
n = 4. This means that the spline S(x) has the form



S1(x) = a3x3
+ a2x2
+ a1x + a0, x ∈ [−
6
5
, −1],
S2(x) = b3x3
+ b2x2
+ b1x + b0, x ∈ [−1, 0],
S3(x) = c3x3
+ c2x2
+ c1x + c0, x ∈ [0, 1],
S4(x) = d3x3
+ d2x2
+ d1x + d0, x ∈ [1,
6
5
],
with ai, bi, ci, di ∈ R for all i = 0, 1, 2, 3. In our terms, we
also have y0 = f(x0) ≈ 1.502, y1 = f(x1) = 0, y2 =
f(x2) ≈ −1.718, y3 = f(x3) = 0, y4 = f(x4) ≈ 1.502.
Let us summarize the sixteen conditions that determine S:
S1(x0) = y0 ⇔ −(
6
5
)3
a3 + (
6
5
)2
a2 −
6
5
a1 + a0 = 1.502 ,
S2(x1) = y1 ⇔ −b3 + b2 − b1 + b0 = 0 ,
S3(x2) = y2 ⇔ c0 = −1.718 ,
S4(x3) = y3 ⇔ d3 + d2 + d1 + d0 = 0 ,
S1(x1) = y1 ⇔ −a3 + a2 − a1 + a0 = 0 ,
S2(x2) = y2 ⇔ b0 = −1.718 ,
S3(x3) = y3 ⇔ c3 + c2 + c1 + c0 = 0 ,
S4(x4) = y4 ⇔ (
6
5
)3
d3 + (
6
5
)2
d2 +
6
5
d1 + d0 = 1.502 ,
S′
1(x1) = S′
2(x1) ⇔ 3a3 − 2a2 + a1 = 3b3 − 2b2 + b1 ,
S′
2(x2) = S′
3(x2) ⇔ b1 = c1 ,
S′
3(x3) = S′
4(x3) ⇔ 3c3 + 2c2 + c1 = 3d3 + 2d2 + d1 ,
S′′
1 (x1) = S′′
2 (x1) ⇔ −6a3 + 2a2 = −6b3 + 2b2 ,
S′′
2 (x2) = S′′
3 (x2) ⇔ b2 = c2 ,
S′′
3 (x3) = S′′
4 (x3) ⇔ 6c3 + 2c2 = 6d3 + 2d2 ,
S′′
1 (x0) = 0 ⇔ −(
36
5
)a3 + 2a2 = 0 ,
S′′
4 (x4) = 0 ⇔ (
36
5
)d3 + 2d2 = 0 .
Solving in Sage this system of linear equations we get a
unique solution given by
S1(x) =
4933
380
x3
+
44397
950
x2
+
228243
4750
x +
135841
9500
,
S2(x) = −
28837
9500
x3
−
3129
2375
x2
−
859
500
,
S3(x) =
28837
9500
x3
−
3129
2375
x2
−
859
500
,
S4(x) = −
4933
380
x3
+
44397
950
x2
−
228243
4750
x +
135841
9500
.
Hence the plot of S(x) in [−1.2, 1.2] is given by
CHAPTER 5. ESTABLISHING THE ZOO
An important property of polynomials is the “continuous”
dependency of their values on the input variable. We expect
intuitively that if x is changed by a little, then the value
of f(x) also changes only a little. This behaviour is not possessed
by piece-wise constant functions f : R → R near the
“jump discontinuities”. For instance, the Heaviside function4
f(x) =



0 for all x < 0,
1/2 for x = 0,
1 for all x > 0
has this type of “discontinuity” for x = 0.
We formalize these intuitive statements.
5.2.1. Real numbers. We have dealt with algebraic properties
of real numbers, summarized by the claim
that R is a ﬁeld. However, we have also used
the relation of the standard (total) ordering of the
real numbers, denoted by “≤”. See the paragraph
1.6.3 on the page 43.
The properties (axioms) of the real numbers, including
the connections between the relations and other operations,
are enumerated in the following table. The bars indicate how
the axioms guarantee that the real numbers form an abelian
(commutative) group with respect to addition, that R \ {0}
is an abelian group with respect to multiplication, that R is
a ﬁeld, that the set R together with the operations +, · and
the order relation is an ordered ﬁeld. The last axiom can be
considered as claiming that R is “suﬃciently dense”.
4Oliver Heaviside was an unconventional English electrical engineer
(1850-1925) with an innovative and very original approach to practical mathematical
modelling. His famous sayings include “Mathematics is an experimental
science, and deﬁnitions do not come ﬁrst, but later on”. Or, defending
his incomplete argumentation, “I do not refuse my dinner simply because I
do not understand the process of digestion”. Is this suggestive of the methodology
of this textbook?
380
Using the command spline one can obtain a graphical veriﬁcation,
as before. Although the code is given below, we leave
to the reader to check the result alone.
f(x)=e^(x^2)-e
plf=plot(f, x, -1.2, 1.2, color="black")
pts=[(-1.2, f(-1.2)), (-1, f(-1)), (0, f(0)),
(1, f(1)), (1.2, f(1.2))]
S=spline(pts)
sps=plot(S, -1.2, 1.2, color="steelblue")
show(points(pts, size=40)+sps+plf)
□
5.A.19. Without calculation, construct the natural cubic interpolation
spline for the points x0 = −1, x1 = 0 a x2 = 2
and the value y0 = y1 = y2 = 1 at these points.
Solution. The natural splines request the second derivative
in the outer boundary points to vanish. Thus, the constant
spline S1(x) ≡ 1, S2(x) ≡ 1 satisﬁes all conditions and thus
it must be the unique solution. □
5.A.20. Construct the natural cubic interpolation spline for
the function f(x) = 1/(1 + x2
) selecting the nodes x0 = 0,
x1 = 1, and x2 = 3. ⃝
Addition problems concerning polynomial interpolation
can be found in the ﬁnal section E, see also Chapter 6.
B. Real numbers and limit processes
In this section we treat limits of sequences and of functions.
We will also discuss some basic topological
notions about subsets of the real line R or the complex
plane C. In this way we will establish the foundations
of calculus, but also become familiar with
elementary ideas from mathematical analysis, that we will revise
ﬁnally later, in Chapter 7.
Functions and limits are some of the most fundamental
concepts in mathematics and are ideas that came of age in
the 17th century.3
The French mathematician Augustin-Louis
3 The French mathematician Pierre de Fermat (1607-1665) was one of
the ﬁrst realizing the importance of limits. When later Newton and Leibniz
invented the calculus they don’t use “delta-epsilon” proofs, and it took more
than a century to develop them. From this perspective, it should be no wonder
that often modern students ﬁnd an introduction to calculus diﬃcult. Hence
you should not worry much whenever some of the notions that we discuss
CHAPTER 5. ESTABLISHING THE ZOO
Axioms of the real numbers
(R1) (a + b) + c = a + (b + c), for all a, b, c ∈ R
(R2) a + b = b + a, for all a, b ∈ R
(R3) there is an element 0 ∈ R such that for all a ∈ R,
a + 0 = a
(R4) for all a ∈ R, there is an additive inverse (−a) ∈
R such that a + (−a) = 0
(R5) (a · b) · c = a · (b · c), for all a, b, c ∈ R
(R6) a · b = b · a for all a, b ∈ R
(R7) there is an element 1 ∈ R, 1 ̸= 0, such that for
all a ∈ R, 1 · a = a
(R8) for all a ∈ R, a ̸= 0, there is a multiplicative
inverse a−1
∈ R such that a · a−1
= 1
(R9) a · (b + c) = a · b + a · c, for all a, b, c ∈ R
(R10) the relation ≤ is a total order, i.e. reﬂexive, antisymmetric,
transitive, and total on R
(R11) for all a, b, c ∈ R, a ≤ b implies a + c ≤ b + c
(R12) for all a, b ∈ R, a > 0 and b > 0 implies a·b > 0
(R13) every non-empty set A ⊂ R which has an upper
bound has a least upper bound.
The concept of a least upper bound from axiom (R13),
also called the supremum, is very important. It makes sense
for any partially ordered set. This is a set with a (not necessarily
total) ordering relation. Recall that an ordering relation
is a binary relation on a set which is reﬂexive, antisymmetric,
and transitive; see the paragraph 1.6.3.
Supremum and infimum
Consider a subset A ⊂ B in a partially ordered set B. An
upper bound of the set A is any element b ∈ B such that
b ≥ a for all a ∈ A.
Similarly, a lower bound of the set A is an element b ∈
B such that b ≤ a for all a ∈ A.
The least upper bound of the set A, if it exists, is called
its supremum and it is denoted by sup A. Similarly, the greatest
lower bound, if it exists, is called the inﬁmum and it is
denoted by inf A.
Thus, the last axiom (R13) from the table of properties
of the real numbers can be reformulated as follows: Every
non-empty bounded set A of real numbers has a supremum.
This means that if there is a number a which is larger than or
equal to all numbers x ∈ A, then there is a smallest number
with this property.
For instance, the choice A = {x ∈ Q, x2
< 2} gives the
supremum sup A which is called
√
2; the square root of two.
An immediate consequence of this axiom is the existence
of the inﬁma for any non-empty set of real numbers bounded
from below. Observe that changing the sign of all the numbers
in a set interchanges suprema and inﬁma.
For the formal construction, it is necessary to know
whether or not there is such a set R with the operations and
ordering relation which satisﬁes the thirteen axioms. So far,
381
Cauchy (1789-1867) was probably the ﬁrst who put the calculus
in a rigorous basis. To be more speciﬁc, Cauchy introduced
the “delta-epsilon” notation commonly encountered in
the deﬁnition of limits. While such proofs are presented here,
we have chosen to de-emphasize them after a certain point.
This is because many calculus problems can be approached
in various alternative ways. Additionally, this approach allows
us to conserve space, which is then utilized to implement
solutions using Sage
Our initial tasks involve the concepts of the supremum
(sup A) and the inﬁmum (inf A) of a subset A of real or
complex numbers, which are elementary notions discussed
in 5.2.1. These represent the smallest upper bound and the
largest lower bound of A, respectively. They always exist
when A is a bounded subset, but they do not necessarily belong
to A, as they can be either limit points or isolated points
of A, see 5.2.5 and 5.2.7.
5.B.1. Find sup A and inf A, if they exist, for
A =
{
n +
(−1)n
n
: n ∈ N∗
}
⊂ R ,
where as usual we set N∗
= N\{0} = Z+. ⃝
5.B.2. Determine the inﬁma and suprema of the sets
B =
{
(−1)n
n2 : n ∈ Z+
}
, C = (−9, 9) ∩ Q ,
X =
{
− 1
n : n ∈ Z+
}
, Y = (0, 2] ∪ [3, 5]\{4} ,
and look whether they belong to them or not. ⃝
Sequence are functions deﬁned on the set of natural numbers.
Next we are interesting in the behaviour of
such functions as n tends to inﬁnity, see also the
discussion in 5.2.3, 5.2.9 and 5.2.10. To begin
with, recall that for any subset A of reals or complex
numbers we can assign the notion of “limit points” of A.
These are points x for which each of their ε-neighbourhoods
Oε(x) = {z : |z−x| < ε} contains at least one point from A
other than x. Be aware that limit points may not belong to A.
Having in hand this deﬁnition of limit points we can deﬁne a
“convergent sequence” (xn) of real or complex numbers, as
a sequence approaching its only limit point x, as n tends to
inﬁnity, see 5.2.3. In this case we say that (xn) converges to
x (or that tends to x), and write xn → x, or lim
n→∞
xn = x .
When the limit does not exist as a real, or complex number, or
is ±∞, the sequence (xn) is called divergent, see also below.
Notice that in our deﬁnition of limits points the neighbourhood
Oε(x) is either an open interval in R, or an open
disc in C. The related topological notion of open sets is introduced
in 5.2.7, see also 5.2.5. To make the description in this
below seem diﬃcult. To master with limits, it takes a little while. Once
it is mastered, however, it becomes easier to understand diﬀerentiation and
integration.
CHAPTER 5. ESTABLISHING THE ZOO
only the rational numbers have been constructed formally.
These form an ordered ﬁeld. This means that Q satisﬁes the
axioms (R1) – (R12), and this can easily be veriﬁed.
We do not go into details here of the consistent construction
of the real numbers now. We will be satisﬁed with an intuitive
idea of the real line, and we will work with the axioms
(R1) through R(13). But we shall come back to this issue in a
more general framework in chapter 7, see the paragraph 5.2.4
and the discussion started in 7.3.6. Actually, we shall see, that
if the real numbers can be constructed, then the construction
is unique up to isomorphism. This is a bijection preserving
all algebraic structures of two diﬀerent realizations of the ﬁeld
R.
5.2.2. The complex plane. Recall that the complex numbers
are given as pairs of real numbers. We usually
write them as z = Re z + i Im z. Therefore, the
plane C = R2
is an appropriate image of the
complex numbers.
With addition and multiplication, the complex numbers satisfy
the axioms (R1)-(R9) and thus form a ﬁeld. There is,
however, no natural ordering deﬁned on them which would
satisfy the axioms (R10)-R(13). Nevertheless, we work with
them, since extending real scalars to the complex numbers is
highly advantageous or sometimes even necessary.
There is an important operation on the complex numbers
called complex conjugation. It is the reﬂection symmetry
with respect to the line of real numbers. We denote it by a
bar over the number z ∈ C:
¯z = Re z − i Im z.
It changes the sign of the imaginary part. Since for z = x+iy,
z · ¯z = (x + iy)(x − iy) = x2
+ y2
,
this value expresses the squared distance of the complex numbers
from the origin. The square root of this non-negative real
number is called the absolute value of the complex number z;
written
(1) |z|2
= z · ¯z.
The absolute value can be deﬁned on any ordered ﬁeld
of scalars K. Deﬁne the absolute value |a| as follows:
|a| =
{
a if a ≥ 0
−a if a < 0.
For any numbers a, b ∈ K,
(2) |a + b| ≤ |a| + |b|.
This property is called the triangle inequality. It holds also
for the absolute value of the complex numbers.
For the ﬁelds of rational numbers or real numbers, both
of which are subﬁelds of the complex numbers, both deﬁnitions
of the absolute value coincide. The absolute value must
be understood in the context of which ever ﬁeld K of rational,
real, or complex numbers is involved. The triangle inequality
holds in all these cases.
382
column more functional, we decided to present problems related
to such (and other) topological notions a few later, after
mastering the notion of limits.
5.B.3. Convergence and divergence. Based on the formal
deﬁnition of the convergence or the divergence of a sequence,
prove the following:
(a) an = 1
n → 0 , as n → +∞ ,
(b) an = 1
n2 → 0 , as n → +∞ ,
(c) an = n
n+1 → 1 , as n → +∞ ,
(d) if 0 < x < 1 then an = xn
→ 0 , as n → +∞ ,
(e) if x > 1 then an = xn
→ +∞ , as n → +∞ ,
(f) an = n2
→ +∞ , as n → +∞ .
Solution. (a) Let ε > 0 be given. We have to ﬁnd some positive
integer N such that 1
n − 0 < ε for all n ≥ N. Clearly,
the condition 1
n − 0 < ε is equivalent to 1
n < ε, i.e., n > 1
ε .
Hence taking N ∈ Z+ with N > 1
ε we see that 1
n ≤ 1
N < ε
so that 1
n − 0 < ε for all n ≥ N. This means that 1
n → 0.
(b) For any ε > 0 we see that 1
n2 < ε is equivalent to n2
> 1
ε ,
that is, n > 1√
ε
. Thus, taking N ∈ Z+ with N ≥ 1/
√
ε we
have 1
n2 < ε, for all n ≥ N. It follows that 1
n2 → 0.
(c) Again, for given ε > 0 we wish to ﬁnd some N ∈ Z+
such that n
n+1 − 1 < ε, provided that n ≥ N. Notice that
the condition n
n+1 − 1 < ε is equivalent to 1
n+1 < ε, i.e.,
1
ε < n + 1. Hence, choosing N with N > 1
ε − 1 we see
that the inequality n ≥ N implies n + 1 > 1
ε , and hence
n
n+1 − 1 < ε. Thus n
n+1 → 1.
(d) We have 0 < x < 1 and thus 1
x > 1. Hence 1
x =
1 + h, for some h > 0, and by applying the binomial rule
∑n
k=0
(
n
k
)
an−k
bk
for the power (1
x )n
we get
1
xn
= (1 + h)n
= (1 + nh + · · · + hn
) > 1 + nh .
Thus 0 < xn
< 1
1+nh . Since 1
1+nh → 0 we will also have
xn
→ 0, as n → ∞. For instance, given some ε > 0 we
see that 0 < xn
< ε is true whenever 1
1+nh < ε. Hence
for n ≥ N with N being the least positive integer larger than
(1/e)−1
h , we are done.
(e) Let us describe a proof that again is based on the binomial
theorem. Since x > 1 we may write x = 1 + h for some
h > 0 and by the binomial rule we deduce that xn
≥ nh
for all n. Since nh → ∞ it follows that xn
→ ∞ as well.
In fact in this case we say that the sequence (xn
) “diverges
to inﬁnity”. It is useful to state this as a general rule: For a
sequence (an) saying that an → +∞ we essentially mean
that for any given ζ > 0 there exists some N ∈ Z+ such that
an > ζ for all n ≥ N. Similarly, saying that a sequence (an)
diverges to −∞ we essentially mean that for any given ζ > 0
there exists N ∈ Z+ such that an < −ζ for all n ≥ N.
Let us ﬁnally consider case (f). Given ζ > 0, the condition
CHAPTER 5. ESTABLISHING THE ZOO
5.2.3. Convergence of a sequence. We wish to formalize
the notion of a sequence of numbers approaching
a limit. The key object of interest is a sequence
of numbers ai, where the index i usually goes
through all the natural numbers. Denote the sequences
loosely either as a0, a1, . . . , or as inﬁnite vectors
(a0, a1, . . . ), or as (ai)∞
i=1.
Cauchy5
sequences
Consider a sequence (a0, a1, . . . ) of elements of K such that
for any ﬁxed positive number ε > 0,
|ai − aj| < ε.
for all but ﬁnitely many terms ai, aj of the sequence.
In other words, for any ﬁxed ε > 0, there is an index N
such that the above inequality holds for all i, j > N. Loosely
put, the elements of the sequence are eventually arbitrarily
close to each other. Such a sequence is called a Cauchy se-
quence.
Intuitively, either all but ﬁnitely many of the sequence’s
terms are equal, (then |ai − aj| = 0 from some index N on),
or they “approach” some particular value. This is easily imaginable
in the complex plane. Choose an arbitrarily small disc
(with radius equal to ε). Suppose a Cauchy sequence is given.
It must be possible to put the disc into the complex plane in
such a way that it covers all but ﬁnitely many of the elements
of the inﬁnite sequence ai. Imagine that such discs have very
small radii, and all contain a number a; see the diagram. If
such a value a ∈ K exists for a Cauchy sequence, we would
expect the sequence to have the property of convergence:
Convergent sequences
A sequence (ai)∞
i=0 converges to a value a if and only if for
any positive real number ε,
|ai − a| < ε
for all but ﬁnitely many indices i. Notice that the set of those
i for which the inequality does not hold may depend on ε.
The number a is called the limit of the sequence (ai)∞
i=0, we
write
lim
i→∞
ai = a.
If a sequence ai ∈ K, i = 0, 1, . . . , converges to a ∈ K,
then for any ﬁxed positive ε, |ai −a| < ε for all i greater than
5Augustin-Louis Cauchy (1789-1857) was a French mathematician pioneering
a rigorous approach to inﬁnitesimal analysis. He was very productive,
wrote about 800 research articles. There are dozens of concepts and
theorems named after him.
383
n2
> ζ is satisﬁed if n >
√
ζ. Thus, for N >
√
ζ we see that
n2
> ζ for all n ≥ N. This implies that n2
→ +∞. □
5.B.4. Use Sage to plot the ﬁrst 30 terms of the sequence
(an) given in Problem 5.B.3, case (c).
Solution. Sage provides many alternatives to plot a sequence
(an). Here we utilized a method based on the object
Graphics( ), which is employed when initializing a for
loop of various graphics objects (further techniques will be
analyzed in the sequel). Therefore, to plot the ﬁrst 30 terms
of the sequence ( n
n+1 )n∈Z+ , type
p=Graphics()
for n in srange (1, 30+1):
p=p+points((n,n/(n+1)),color="black")
show(p)
This produces the ﬁgure given here
□
For many of the solutions presented in 5.B.3, we have repeatedly
used the fact that for every real number
x there exists a natural number n ∈ N∗
such that
n > x. This is the so-called Archimedean property
of R, and it will be used also below without
mentioning it explicitly. In fact, it is not hard to prove that
the Archimedean property is equivalent to say that for each
positive real x there exists n ∈ N∗
such that 1
n < x.
5.B.5. Inﬁnite limits. Show that if an → +∞ and bn →
+∞, then an · bn → +∞. Also, assuming an → a > 0 and
bn → +∞, show that an · bn → +∞. What is the situation
for a < 0 and for a = 0?
Solution. (Hints) First, note that in this context the rule of
product of limits must be carefully extended (see 5.2.13 for
the product rule). This is because at least one of the limits
in the given statements is inﬁnite, refer to the discussion in
5.2.14 for further details. To prove the statement we will apply
the deﬁnition given in case (e) in 5.B.3. We will describe
the proof for the second statement only, and the proof of the
ﬁrst is left for practice.
CHAPTER 5. ESTABLISHING THE ZOO
a certain N ∈ N. By the triangle inequality,
|ai − aj| = |ai − a + a − aj| < |ai − a| + |a − aj| < 2ε.
for all pairs of indices i, j ≥ N. Thus:
Lemma. Every convergent sequence is a Cauchy sequence.
However, in the ﬁeld of rational numbers, it can happen
that for a Cauchy sequence a corresponding value a does not
exist. For instance, the number
√
2 can be approached by a sequence
of rational numbers ai, thereby obtaining a sequence
converging to
√
2, but the limit is not rational.
Ordered ﬁelds of scalars in which every Cauchy sequence
converges are called complete. The following theorem
states that the axiom (R13) guarantees that the real numbers
are such a ﬁeld:
Theorem. Every Cauchy sequence of real numbers ai converges
to a real number a ∈ R.
Proof. The terms of any Cauchy sequence form a
bounded set since any choice of ε bounds all
but ﬁnitely many of them. Let B be the set of
those real numbers x for which x < aj for all
but ﬁnitely many terms aj of the sequence. B has an upper
bound, and thus B has a supremum as well, by (R13).
Deﬁne a = sup B. Fix ε > 0, and choose N so that
|ai − aj| < ε for all i, j ≥ N. Then aj > aN − ε and
aj < aN + ε for all indices j > N, and so aN − ε belongs
to B, while aN + ε does not. Hence |a − aN | ≤ ε, and thus
|a − aj| ≤ |a − aN | + |aN − aj| ≤ 2ε
for all j > N. So a is the limit of the given sequence. □
Corollary. Every Cauchy sequence of complex numbers zi
converges to a complex number z.
Proof. Write zi = ai+i bi. Since |ai−aj|2
≤ |zi−zj|2
and similarly for the values bi, both sequences of real numbers
ai and bi are Cauchy sequences. They converge to a and b,
respectively. It is easily veriﬁed that z = a + i b is the limit
of the sequence zi. □
5.2.4. Remark. The previous discussion proposes a construction
method for the real numbers. Proceed
similarly to building the integers from the natural
numbers (adding in all additive inverses).
Build the rational numbers from the integers
(adding all multiplicative inverses of non-zero numbers).
Then “complete” the rational numbers by adding in all limits
of Cauchy sequences.
Cauchy sequences (ai)∞
i=0 and (bi)∞
i=0 of rational numbers
are equivalent if and only if the distances |ai − bi| converge
to zero. This is the same as the condition that merging
these sequences into a single sequence also yields a Cauchy
sequence. For example, a sequence can be formed by selecting
alternately terms from the ﬁrst sequence and the second
sequence. Check the properties of the equivalence relations.
Clearly the relation is reﬂexive, it is symmetric (since the distance
of the rational numbers is symmetric in its arguments)
384
By assumption, limn→+∞ an = a > 0, so choosing
ε = a/2 > 0 we can ﬁnd N ∈ Z+ such that |an − a| < a
2
for all n ≥ N. This implies that a
2 < an < 3a
2 , for all
n ≥ N, and so an > a
2 for n ≥ N. Also, choosing any
ζ > 0 there exists M ∈ Z+ such that bn > ζ for all n ≥ M.
Hence, for n ≥ L := max{M, N} both conditions hold and
in particular an · bn > a
2 · ζ > 0. Taking ζ = 2˜ζ
a for some
˜ζ > 0 we get an · bn > ˜ζ for all n ≥ L. Thus an · bn → +∞,
and a·(+∞) = +∞ for a > 0. For a < 0 we have an ·bn →
−∞, and thus a·(+∞) = −∞, for a < 0. However, the case
a = 0 will produce a so-called indeterminate form, namely
0 · (+∞), which is excluded. □
5.B.6. Using the results from 5.B.3 explain why:
(a) The sequence
(
an = 1
n + 3n
)
diverges to +∞.
(b) The sequence (bn = n2
+ 2n
) diverges to +∞.
(c) The sequence (cn = n2
· 3n
) diverges to +∞.
(d) The sequence
(
dn = n·4n
n+1
)
diverges to +∞. ⃝
5.B.7. (a) Prove that (an = 1
2n ) tends to zero as n → ∞.
(b) Prove that (bn = (−1)n+1
) is divergent.
(c) Prove that (cn = sin(nπ/2)) is divergent.
(d) Find sequences (xn) and (yn) with inﬁnite limits, such
that lim
n→∞
(xn + yn) = 1 and lim
n→∞
(
xny2
n
)
= +∞. ⃝
5.B.8. Show that lim
n→∞
n
√
n = 1 (notice n
√
n = n
1
n ). Then
present a proof via Sage.
Solution. Apparently, for all naturals n ≥ 1 we have n
√
n ≥ 1.
So we can set n
√
n = 1+an for certain numbers an ≥ 0. Now,
by the binomial theorem we see that
n = (1 + an)
n
= 1 +
(
n
1
)
an +
(
n
2
)
a2
n + · · · + an
n ,
for all natural numbers n ≥ 2. Hence we have the bound
n ≥
(
n
2
)
a2
n =
n (n − 1)
2
a2
n,
for all naturals n ≥ 2, which leads to 0 ≤ an ≤
√
2
n−1
for all such n. Having established these inequalities, we
can now use the squeeze theorem (see 5.2.12). This implies
that limn→∞ an = 0, and hence limn→∞
n
√
n =
limn→∞ (1 + an) = 1. One should ﬁnally mention that it
possible to prove the result by means of the so called “l’Hopital’s
rule”, introduced in 5.3.10. However, such a procedure
is based on the notion of derivatives and hence we will return
to this later, see 5.E.109.
For computing limits in Sage we use the command
limit or its alias lim. We also recall that the mathematical
symbol ∞ is represented either by oo or by Infinity (or
infinity). Let us now illustrate the implementation of these
rules via our example. So, an appropriate cell has the form
n=var("n");a(n)=n**(1/n);lim(a(n), n=oo)
Or, as we said, one could replace the last command by
lim(a(n), n = Infinity). Notice that ﬁrst one should introduce
the variable n, as a symbolic variable. This is because
CHAPTER 5. ESTABLISHING THE ZOO
and transitivity follows easily from the triangle inequality.
Thus, we may deﬁne R as the set of equivalence classes on
the above set of sequences.
We introduce algebraic structures on this set R and check
their properties. Of course, the rational numbers can
be represented by constant sequences, so that Q ⊂
R, as expected. Next, deﬁne the sum and product of
equivalence classes by taking the sum and product of
sequences representing them, respectively. It is easy to check
that the results represent a class independent of the choices.
Ordering is dealt with similarly. Here it is required to
prove that a ≤ b if and only if there are representatives with
ai ≤ bi. Finally it is necessary to show that all Cauchy sequences
in R converge. We do not go into details now and
advise the reader to return back and check all the details when
going through the full discussion of the completion of metric
spaces in the paragraph 7.3.6. The arguments used there with
the real scalars replaced by rational ones provide an adequate
proof. The arguments proving that the axioms (R1)–(R13)
deﬁne the real numbers uniquely up to isomorphism are also
to be found there.
5.2.5. Closed sets. For further work with real or complex
numbers, we need to understand the notions
of closeness, boundedness, convergence,
and so on. These concepts belong
to the topic “topology”6
. As before, we work with K = R or
K = C. We advise the reader to draw many diagrams for all
the concepts and their properties for both the real line and the
complex plane.
For any subset A of points in K, we are interested not
only in the points belonging to a ∈ A, but also in the ones
which can be approached by limits of sequences in A.
Limit points of a set
Let A be a subset of K. A point x ∈ K is called a limit point
of A if and only if there is a sequence a0, a1, . . . of elements
in A such that all its terms diﬀer from x, yet its limit is x.
Notice that a limit point of a set may or may not belong
to the set.
For every non-empty set A ⊂ K and a ﬁxed point x ∈ K,
the set of all distances |x−a|, a ∈ A, is a set of real numbers
bounded from below, and so it has an inﬁmum d(x, A), which
is called the distance of the point x from the set A. Notice that
d(x, A) = 0 if and only if either x ∈ A or if x is a limit point
of A. (We suggest the reader proves this in detail directly
from the deﬁnitions.)
6The name of this mathematical discipline comes from the Greek
“studying the shape” (topos + logos) . The main concepts are built on the formalism
of open and closed sets, compactness etc. We use the same names
here but only in the realm of real and complex numbers. Later on, we go
further in metric spaces in chapter 7.
385
our goal is to view a(n) as a function of n. Execute yourself
the given syntax to practice with Sage. □
5.B.9. Limits of sequences. Compute the following limits,
and next verify your answers via Sage:
(a) lim
n→∞
2n2
+3n+1
n+1 , (d) lim
n→∞
n+1
2n2+3n+1 ,
(b) lim
n→∞
n
√
12 + 22 + · · · + n2, (e) lim
n→∞
5n
+1
7n+1 ,
(c) lim
n→∞
√
4n2 + n − 2n, (f) lim
n→∞
√
4n2+n
n .
Solution. (a) Set an = 2n2
+3n+1
n+1 . Multiplying both the numerator
and denominator by 1/n, we have
lim
n→∞
an = lim
n→∞
2n + 3 + 1
n
1 + 1
n
=
∞ + 3 + 0
1 + 0
= ∞ .
To conﬁrm this result in Sage, give the cell
n=var("n"); a(n)=(2*n^2+3*n+1)/(n+1)
lim(a(n), n=infinity)
Sage’s answer is +Infinity.
(b) Let us present a solution based on a very useful theoretical
result, the so called “squeeze theorem” presented in 5.2.12.
In particular, set an = n
√
12 + 22 + · · · + n2. Then for any
positive natural n we have
bn = n
√
n ≤ an ≤ cn = n
√
n2
+ n2
+ · · · + n2
n−terms
=
n
√
n3 .
By 5.B.8 we know that bn → 1. Moreover, since
n
√
n3 =
(n3
)
1
n = n
3
n = (n
1
n )3
, we also conclude that cn → 1, as
n → +∞. The squeeze theorem now applies and gives that
an → 1.
In Sage one can verify these claims as follows:
n, k=var("n, k")
a(n)=sum(k^2, k, 1, n)**(1/n) # declare (a_n)
f(n)=(n^3)**(1/n); g(n)=n**(1/n)
print(bool(a(n)<=f(n) for n in range(1, 1000)))
print(bool(a(n)>=g(n) for n in range(1, 1000)))
print(lim(a(n), n=oo))
Notice in the fourth and ﬁfth line we used the bool command
to test the two posed inequalities, for many n. However, one
could directly compute the limit in question (hence, one may
avoid to include these two lines, together with the deﬁnition
of the sequences f(n) and g(n)).
(c) Set an =
√
4n2 + n − 2n. Here it applies the following
classical trick:
lim
n→∞
an = lim
n→∞
(
√
4n2 + n − 2n)(
√
4n2 + n + 2n)
√
4n2 + n + 2n
= lim
n→∞
n√
4n2+n+2n
= lim
n→∞
1
√
4n2+n
n + 2
=
1
4
.
(d) We leave this for practice, since can be treated as case (a).
Verify yourself that the given limit equals to 1
∞ = 0.
CHAPTER 5. ESTABLISHING THE ZOO
Closed sets
The closure ¯A of a set A ⊂ K is the set of those points which
have zero distance from A (note that the distance from the
empty set of points is undeﬁned, therefore ¯∅ = ∅).
A closed subset in K is a set which coincides with its
closure. A set is closed if it contains all of its limit points.
On the real line, a closed interval
[a, b] = {x ∈ R, a ≤ x ≤ b}
of real numbers, where a and b are ﬁxed real numbers is a
closed set.
The sets (−∞, b], [a, ∞), and (−∞, ∞) are also closed sets.
A closed set may also be formed by a sequence of real
numbers without a limit point, or a sequence with a ﬁnite
number of limit points together with these points.
The unit disc (including its boundary circle) in the complex
plane is another example of a closed set.
An arbitrary intersection of closed sets is again a closed
set. A ﬁnite union of closed sets is again a closed set. Indeed,
if all of the points of some sequence belong to the considered
intersection of closed sets, then they belong to each of the
sets, and so do all the limit points. However, if we wanted to
say the same about an arbitrary union, we would get in trouble:
singleton sets are closed, but a sequence of points created
from them may not be. On the other hand, if we restrict our
attention to ﬁnite unions and consider a limit point of some
converging sequence lying in this union, then the limit point
must also be the limit point of any subsequence, especially
the one lying in only one of the united sets. As this set is assumed
to be closed, the limit point lies in it, and thus it lies
in the union.
5.2.6. Open sets. There is another useful type of subset of
the real numbers: open intervals
(a, b) = {x ∈ R; a < x < b},
where, again, a and b are ﬁxed real numbers or inﬁnite values
±∞. It is an open set, in the following sense:
Open sets and neighbourhoods of points
An open set in K is a set whose complement is a closed set.
A neighbourhood of a point a ∈ K is an open set O
which contains a. If the neighbourhood is deﬁned as
Oδ(a) = {x ∈ K, |x − a| < δ}
for some positive number δ, then we call it the
δ-neighbourhood of the point a.
For real numbers, the δ–neighbourhood of a point a is
the open interval of length 2δ, centered at a. In the complex
plane, it is the disc of radius δ, also centered at a.
Notice that for any set A, a ∈ K is a limit point of A
if and only if every neighbourhood of a contains at least one
point b ∈ A, b ̸= a.
Lemma. A set A ⊂ K of numbers is open if and only if every
point a ∈ A, has a neighbourhood contained in A.
386
(e) This case can be treated using the conclusions of 5.B.3.
Hence we get
lim
n→∞
5n
+ 1
7n + 1
= lim
n→∞
(5
7 )n
+ 1
7n
1 + 1
7n
=
0 + 0
1 + 0
= 0 .
In Sage for this result work as above, i.e.,
n=var("n");f(n)=((-5)^n+1)/((7^n)+1)
lim(f(n), n=oo)
(f) Set an =
√
4n2+n
n . For any positive natural n we have that
bn =
√
4n2
n
< an < cn =
√
4n2 + n + 1
16
n
.
Moreover, it is easy to see that lim
n→∞
bn = 2 = lim
n→∞
cn. Thus,
by the squeeze theorem it follows that lim
n→∞
an = 2. □
5.B.10. Given a non-empty bounded set A ⊆ R, set a :=
sup A. Show that there exists a sequence (an) of A tending
to a (notice a similar statement holds for inf A).
Solution. By assumption a is the supremum of A, so a is a
upper bound of A and for every ζ > 0 there exists x ∈ A such
that x > a − ζ. To prove this ﬁx some ζ > 0 and suppose
in the contrary that x ≤ a − ζ. Then a − ζ must be also an
upper bound of A, a contradiction since a = sup A, so a is
the least upper bound. Thus we can ﬁnd some x ∈ A with
x > a − ζ (note that the converse is also true). Choosing
now ζ = 1/n for n ∈ Z+, so ζ > 0, we may denote this
element by xn ∈ A, such that xn > a− 1
n . Since xn ≤ a, we
ﬁnally arrive to the inequality 0 ≤ a − xn < 1/n. Fix now
some ε > 0 and choose N ∈ Z+ with 1/N < ε. Then for all
n ≥ N we get |a − xn| < 1/n ≤ 1/N < ε, hence xn → a
as n → ∞. □
A sequence (an) is said to be bounded above (respectively,
bounded below), if there exists M ∈ R
such that an ≤ M (respectively, an ≥ M),
for all n that (an) is deﬁned. A sequence that
is bounded above and below is called bounded.
Obviously, a sequence that diverges to +∞ (respectively,
−∞) is not bounded above (respectively, is not bounded below).
A sequence (an) is said to be increasing (respectively,
decreasing) if an ≤ an+1 (respectively, an ≥ an+1) for all
n that (an) is deﬁned. If we have an < an+1 (respectively,
an > an+1) for all n, then the sequence (an) is called strictly
increasing (respectively, strictly decreasing). For instance,
the sequence (n2
) is strictly increasing, while the sequence
(1/n) is strictly decreasing.
5.B.11. Monotone sequence theorem. Show that every
bounded monotonic sequence is convergent.
Solution. This extremely useful result can be equivalently
rephrased as follows: “Every increasing sequence (an) which
is bounded above converges to sup{an : n ∈ N∗
}, and similarly
every decreasing sequence (bn) that is bounded below
CHAPTER 5. ESTABLISHING THE ZOO
Proof. Let A be an open set and let a ∈ A. If there
is no neighbourhood of the point a inside A, then there is a
sequence an /∈ A, |a − an| ≤ 1/n. But then the point a ∈ A
is a limit point of the set K\A, which is impossible since the
complement of A is closed.
Suppose every a ∈ A has a neighbourhood contained in
A. This prevents a limit point b of the set K \ A to lie in A.
Thus the set K \ A is closed, and so A is open. □
From this lemma, it follows immediately that any union
of open sets is an open set. A ﬁnite intersection of open sets
is also an open set.
5.2.7. Bounded and compact sets. The closed and open
sets are basic concepts of topology. Without going
into deeper connections, the above material
describes the topology of the real line and the
topology of the complex plane. The following
concepts are extremely useful:
Bounded and compact sets
A set A of rational, real, or complex numbers is called
bounded if and only if there is a positive real number r such
that |z| ≤ r for all numbers z ∈ A. Otherwise, the set is
called unbounded.
A set which is both bounded and closed is called com-
pact.
An interior point of a set A is a point such that one of its
neighbourhoods is contained in A.
A boundary point of a set A is a point for which all its
neighbourhoods have a non-trivial intersection with both A
and its complement K \ A. A boundary point of the set A
may or may not belong to it.
An open cover of a set A is such a collection of open sets
Ui, i ∈ I, that its union contains the whole of A.
An isolated point of a set A is a point a ∈ A such that
there is a neighbourhood N of a satisfying N ∩ A = {a}.
5.2.8. Theorem. All subsets A ⊂ K of real or complex numbers
satisfy:
(1) a non-empty set A ⊂ R is open if and only if it is a union
of countably (or ﬁnitely) many open intervals; similarly
387
converges to inf{bn : n ∈ N∗
}. We will prove the ﬁrst statement
and an analogous method applies for the second one.
Let (an) be a sequence and assume for simplicity that n ∈ N∗
.
Assume also that (an) is bounded above with an ≤ an+1
for all n ∈ N∗
. Since (an) is bounded above, its range
{an : n ∈ N∗
} is also bounded above (as a set), and hence
sup{an : n ∈ N∗
} exists. Set a := sup{an : n ∈ N∗
}.
Recall that if a is the supremum of a non-empty set A, then
a is an upper bound of A and for every ζ > 0 there exists
x ∈ A such that x > a − ζ (see the proof of 5.B.10). In our
case this means that for ζ > 0 be given, it exists a natural N
with a − ζ < aN , that is, a − aN < ζ. However, (an) is
increasing, so for all n ≥ N we have an ≥ aN and in particular
a − ζ < aN ≤ an ≤ a < a + ζ, for all such n. Thus
|an − a| < ζ for all n ≥ N, i.e., an → a. □
5.B.12. Present a “counterexample” verifying that the converse
of the monotone sequence theorem fails. Moreover,
give an example of a monotone sequence which is not convergent.
⃝
5.B.13. Given the sequences (an = sin(n)
n ), (bn = n2
+ 1)
and (cn = (−1)n
+n
n ), with n ∈ Z+ for all three cases, locate
them which are bounded. ⃝
5.B.14. Indicate the convergent sequences between those
given in the previous problem 5.B.13, and determine their
limits. Repeat for the sequences fn = 1√
n
and gn = n!
nn ,
where n ∈ Z+. ⃝
5.B.15. Let a > 0. Show that lim
n→∞
n
√
a = 1.
Solution. Once more, squeeze theorem provides a traditional
way to treat this limit, as in 5.B.8. Let us however discuss a
diﬀerent method. First notice that we may assume that 0 <
a < 1, since otherwise the argument applies for 1/a. For
such a it is easy to see that given sequence (an = n
√
a)n∈Z+
is monotonically increasing. Moreover, it is bounded above
by 1. A proof of this fact is left to the reader, but one may like
to illustrate the situation in Sage, either via the cell
n=var("n"); a=0.2; a(n)=a**(1/n)
bool(a(n)<1 for n in range(1, 5000))
which returns True, or by plotting many terms of (an) (continue
typing in the previous cell)4
list_plot([a(n) for n in range(1, 100)])
Thus (an) converges, and we may assume that lim
n→+∞
an =
ℓ ∈ R.
Assume that ℓ < 1. Using the inequality n
√
a ≤ ℓ we
get a ≤ ℓn
, with ℓn
→ 0 by 5.B.3. This gives a contradiction,
since a > 0. Taking into account that (an) is bounded
above by 1, we deduce that ℓ = 1. For a Sage veriﬁcation one
can continue in the very ﬁrst cell used above, by adding the
following command: lim(a(n), n = oo). □
4Notice this gives an alternative to plot sequences via Sage.
CHAPTER 5. ESTABLISHING THE ZOO
A ⊂ C is open if and only if it is a union of countably
(ﬁnite) many open discs.
(2) every point a ∈ A is either an interior or a boundary
point,
(3) every boundary point of A is either an isolated point or
a limit point of A,
(4) A is compact if and only if every inﬁnite sequence contained
in A has a subsequence converging to a point in
A.7
(5) A is compact if and only if each of its open covers contains
a ﬁnite subcover of A.
Proof. (1) Every open set is a union of some neighbourhoods
of its points, i.e., we may consider open intervals
in reals, or open discs in C. So the question that
remains is whether it suﬃces to take countably many
of them. Let us ﬁrst prove the claim for the complex
plane. For each z ∈ A, there is an open disc Oδ(z) contained
in A, with some δ > 0, and let δz be the supremum of the values
of such δ. Clearly, A = ∪z∈AOδz (z). Consider an arbitrary
z ∈ A and pick up w with both real and imaginary parts
rational, such that |w − z| < δz/4. Thus, z ∈ Oδw (w) (draw
a picture!) and we have checked that actually A is the union
of the countably many open discs Oδw (w) for all w ∈ A with
rational real and imaginary coordinates.
If A is an open subset in R, then we may repeat the
above argument with the discs Oδ(z) replaced by the intervals
Oδ(x) and x ∈ A. Think about the details!
(2) It follows immediately from the deﬁnitions that no
point can be both an interior and boundary point. Let a ∈ A
be a point that is not interior. Then there is a sequence of
points ai /∈ A with a as its limit point. At the same time, a
belongs to each of its neighbourhoods. Thus a is a boundary
point.
(3) Suppose that a ∈ A is a boundary point but not isolated.
Then, similarly to the reasoning from the previous paragraph,
there are points ai, this time inside A, whose limit
point is a.
(4) Suppose A ⊂ R is a compact set, i.e., both closed
and bounded. Consider an inﬁnite sequence of
points ai ∈ A. A has both a supremum b and an
inﬁmum a. Divide the interval [a, b] into halves:
[a, 1
2 (b − a)] and [1
2 (b − a), b]. At least one of
them contains inﬁnitely many of the terms ai. Select this half
and one of the terms contained in it; then cut the selected
interval into halves. Again, select such a half which contains
inﬁnitely many of the sequence’s terms and select one
7This result for real numbers is usually referred to as the BolzanoWeierstrass
theorem. Karl Weierstrass was a famous German mathematician
(1815-1897) and his name is linked to many theorems in Mathematics.
Bernard Bolzano (1781-1848) was a Bohemian mathematician, logician,
philosopher, theologian and Catholic priest working in Prague at the beginning
of the 19th century. He laid the basis of rigorous mathematical analysis
a few decades before all the theory was worked out by Weierstrass and others.
In particular he was skeptical about the eﬀective use of Leibniz’s inﬁnitesimals
without the necessary rigour.
388
Often we are interested in the limit of a sequence sn,
where the terms are partial sums of some other given sequence
an. Let us describe such an example and more applications
are described in Section D, which is about inﬁnite
series.
5.B.16. Geometric series. Find the limit of the sequence
with general term
sn = 1 + x + x2
+ · · · + xn
, n = 0, 1, 2, . . . ,
for x with |x| < 1.
Solution. First observe that
sn(1 − x) = (1 + x + · · · + xn
) − x(1 + x + · · · + xn
)
= 1 + x + · · · + xn
− x − · · · − xn
− xn+1
= 1 − xn+1
.
So for x ̸= 1 we have sn =
1 − xn+1
1 − x
, with n ∈ N. Recall
now by 5.B.3 that limn→∞ xn
= 0, provided that |x| < 1.
Thus for such x we get
lim
n→∞
sn = lim
n→∞
1 − xn+1
1 − x
=
1
1 − x
,
and it follows that
∞∑
n=0
xn
= lim
n→∞
sn =
1
1 − x
, |x| < 1 .
This is the so called geometric series and for those x satisfying
|x| < 1, the geometric series is a ﬁnite number. For instance
∞∑
n=1
(
1
2
)n
=
∞∑
n=0
(
1
2
)n
− 1 =
1
1 − 1
2
− 1 = 2 − 1 = 1 ,
that is,
1
2
+
1
4
+
1
8
+
1
16
+
1
32
+ · · · = 1 .
Very often Sage allows us to compute such sums, by applying
for example the following cell (which indeed returns
−1
x−1 ):
x, n=var("x, n"); assume(abs(x) < 1)
print(sum(x^n, n, 0, oo))
This is preliminary example of the sum method in Sage, which
we will analyze more in Section D. On the other hand, the geometric
series will appear in the foreground very often, hence
we encourage you to devote some additional time with it (e.g.,
compute in Sage some of its partial sums). □
5.B.17. Consider the sequence (an) with
an =
8
10
+
8
102
+ · · · +
8
10n
, n = 1, 2, . . .
that is, an = 0.88 · · · 8, with 8 repeated n times. Show that
an tends to 8
9 as n tends to ∞, and then give a proof via Sage.
CHAPTER 5. ESTABLISHING THE ZOO
of those points. By this procedure, a Cauchy sequence is established.
Cauchy sequences have limit points or are constant
up to ﬁnitely many exceptions. Thus there is a subsequence
with the desired limit. The fact that A is closed implies that
the point obtained lies in A.
Now the other direction: if every inﬁnite subset of A has
a limit point in A, then all limit points are in A, and so A is
closed. If A were not bounded, we would be able to ﬁnd an
increasing or decreasing sequence such that the diﬀerences
of absolute values of adjacent numbers would be at least 1,
for instance. However, such a sequence of points in A cannot
have a limit point at all.
Finally, we have to deal with the general case A ⊂ C.
The arguments of the latter implication remain the same.
Thus we have to show that any sequence zn of complex
numbers in A has got a limit point in A. Consider the sequences
of real and imaginary parts, xn and yn. Since they
both have to be in the bounded subsets AR and AiR of the
real and imaginary projections of A, there is a subsequence
znk
= (xnk
, ynk
) such that xnk
→ x, ynk
→ y with the
limits sitting in the closures of AR and AiR, by virtue of the
already proved real case. Obviously, znk
→ z = (x, y), but
the latter limit has to sit in A since A is closed.
(5) First, focus on the easier implication. That is, suppose
that every open cover contains a ﬁnite subcover. It is required
to prove that A is both closed and bounded. A ⊂ C can
be covered by a countable union of neighbourhoods On(z),
with integers n and centers z with integral real and imaginary
parts. Any choice of a ﬁnite subcover of them witnesses that
A is bounded. If A ⊂ R, then the same argument applies
with intervals On(x), n, x ∈ Z.
Now suppose that a ∈ C \ A is the limit point of a sequence
ai ∈ A. Further, assume that |a−an| < 1
n (otherwise
select a subsequence satisfying this property). The sets
Jn = C \ O1/n(a)
for all n ∈ N, n > 0, are open and they also cover our set A.
Since it is possible to choose a ﬁnite cover of A, the point a
is inside the complement C \ A, including one of its neighbourhoods,
and thus it is not a limit point. Therefore, all of
A’s limit points must again lie in A. Hence A is closed.
If A ⊂ R, the same argument applies with closed discs
replaced by closed intervals.
Finally, we have to prove the other implication. So assume
A ⊂ C is complete and bounded, but there
is an open covering Uα, α ∈ I, of A, which
does not contain any ﬁnite covering. Consider
the sequence of positive real numbers εn = 1/n
converging to 0 and sets
Bn = {z = (k
n , m
n ) ∈ A, k, m ∈ Z}
of complex numbers with real and imaginary parts in the
“1/n–net of coordinates”. Clearly all sets Bn are ﬁnite. Further,
for each k, consider the system Ak of closed discs with
centres in the points of Bk and diameters 2εk. Clearly each
such system Ak covers the entire set A. Altogether, there
389
Solution. We see that
an =
8
10
(
1 +
1
10
+ · · · +
1
10n−1
)
=
8
10
(1 − 1
10n
1 − 1
10
)
=
8
9
(
1 −
1
10n
)
=
8
9
−
8
9
·
1
10n
.
This gives an − 8
9 = 8
9 · 1
10n and hence it is easy to see that
the condition an − 8
9 < ε is equivalent to the inequality
10n
> 8
9 ε. Therefore, if N is a natural satisfying 10N
> 8
9 ε,
then we have an − 8
9 < ε, for all n ≥ N. Thus we are done.
An alternative relies on the fact 1
10n → 0 as n → ∞. To
prove this, notice that for given ε > 0 and for all n ≥ N :=
log10(1/ε) one has 1 − 1
10n = 1
10n ≤ 1
10N = 1
1/ε = ε.
Finally, the following block conﬁrms the result by a less
obvious technique (explain why this is true).
var("n, k") ; a=sum(8/(10)^n, n, 1, oo)
bool(a==lim(sum(8/(10)^k, k, 1, n), n=oo))
□
The binomial theorem often combines with the monotone
sequence theorem and together provide elegant
techniques for studying the convergence of sequences.
Such an example encodes our next task,
which is about a famous number, the so called “Euler number”
denoted by e, that is, the base of the natural logarithm,
see 5.4.1. This satisﬁes 2 < e < 3, approximately equals to
e ≈ 2.7183, and is the limit of the sequence (en)∞
n=1 featured
in 5.B.18.5
5.B.18. (The Euler number) Combining the binomial theorem
with the monotone sequence theorem, see 5.B.11, show that
the sequence en given below is convergent:
en =
(
1 +
1
n
)n
, n = 1, 2, . . . .
⃝
Roughly speaking, Cauchy sequences are sequences
whose all but a ﬁnite number of elements sit from
each other less than a given distance ε > 0. A
general result states that any convergent sequence
is Cauchy, while over the reals R or the complex
numbers C the converse is also true, see 5.2.3. Thus we
say that R or C are complete ﬁelds, instead of Q, for
example. We can often use this result, to decide about
the convergence/divergence of a given sequence of real or
complex numbers. Let us describe such an example, and for
more advanced tasks check the ﬁnal section with the extra
material, see 5.E.29, 5.E.30, for example.
5Good to know that the proof given here as a solution of 5.B.18, diﬀers
from those given in 5.4.1. Can you establish the diﬀerent points between the
two methods?
CHAPTER 5. ESTABLISHING THE ZOO
must be at least one closed disc C in the system A1 which
is not covered by a ﬁnite number the sets Uα. Call it C1 and
notice that diam C1 = 2ε1.
Next, consider the sets C1 ∩C, with discs C ∈ A2 which
cover the entire set C1. Again, at least one of them cannot
be covered by a ﬁnite number of Uα, we call it C2. This
way, we inductively construct a sequence of sets Ck satisfying
Ck+1 ⊂ Ck, diam Ck ≤ 2εk, εk → 0, and none of them
can be covered by a ﬁnite number of the open sets Uα.
Finally we choose one point zk ∈ Ck in each of these
sets. By construction, this must be a Cauchy-sequence. Consequently,
this sequence of complex numbers has a limit z.
Thus there is Uα0
containing z and containing also some
δ-neighbourhood Oδ(z). But now, if diam Ck ≤ 2εk < δ,
then Ck ⊂ Oδ(z) ⊂ Uα0
, which is a contradiction. The proof
is complete when considering A ⊂ C.
Dealing with real subset A ⊂ R, again the same line
of arguments applies, just the 2-dimensional nets Bk become
1-dimensional and the open discs are replaced by open intervals.
□
5.2.9. Limits of functions and sequences. For the discussion
of limits, it is advantageous to extend the
set R of real numbers by the two inﬁnite values
±∞ as we have done when deﬁning intervals.
A neighbourhood of inﬁnity is any interval (a, ∞). Similarly,
any interval (−∞, a) is a neighbourhood of −∞. Further,
we will extend the concept of a limit point so that ∞ is a
limit point of a set A ⊂ R if and only if every neighbourhood
of ∞ has a non-empty intersection with it, i.e. if the set A
is unbounded from above. Similarly for −∞. We talk about
the inﬁnite limit points, sometimes also called improper limit
points of the set A.
“Calculations” with infinities
We also introduce rules for calculation with the formally
added values ±∞ and arbitrary “ﬁnite” numbers a ∈ R:
a + ∞ = ∞
a − ∞ = −∞
a · ∞ = ∞, if a > 0
a · ∞ = −∞, if a < 0
a · (−∞) = −∞, if a > 0
a · (−∞) = ∞, if a < 0
a
±∞
= 0, for all a ̸= 0.
The following deﬁnition covers many cases of limit processes
and needs to be thoroughly understood.
Some particular cases are considered in great
detail below.
390
5.B.19. Harmonic numbers. Consider the sequence
(Hn)∞
n=1 with general term
Hn =
n∑
k=1
1
n
, n = 1, 2, . . .
Decide about the convergence of (Hn)∞
n=1 in terms of Cauchy
sequences.
Solution. The term Hn is the so-called nth harmonic number.
The ﬁrst harmonic numbers are given by
H1 = 1, H2 =
3
2
= H1 +
1
2
, H3 =
11
6
= H2 +
1
3
,
and by induction over n we can show that Hn+1 = Hn + 1
n+1 ,
for all naturals n ≥ 1. Harmonic numbers are the partial sums
of the harmonic series that we will meet later in Section D.
Now, for convenience set b = |H2n − Hn|. Then
b = 1 +
1
2
+ · · · +
1
n
+
1
n + 1
+ · · · +
1
2n
− Hn
=
1
n + 1
+
1
n + 2
+ · · · +
1
2n
>
1
2n
+
1
2n
+ · · · +
1
2n
n−terms
=
1
2
.
Thus taking ε = 1/2 we see that there is no positive integer N
satisfying |Hm − Hn| < 1/2, for all m, n ≥ N. This means
that (Hn) is not a Cauchy sequence. Thus (Hn) diverges (to
+∞ since all of its terms are positive). We may verify the
situation in Sage, by giving the cell
var("n");sum(1/n, n, 1, oo)
Sage after returning some “Runtime Errors”, types
ValueError: Sum is divergent.
□
Let us now present exercises having a bit more topological
character, and are about limit, interior, boundary
or isolated points of certain subsets of R or C.
Here one should refresh the formal deﬁnitions given
in 5.2.5, 5.2.7, and read carefully the proof of the Theorem
5.2.8. Mastering these notions may needs some eﬀort, and
thus we have included a series of exercises (see also the problems
5.E.36, 5.E.37, 5.E.38, and 5.E.39 in Section E).
5.B.20. Limit, interior and isolated points. Find all limit,
isolated, and interior points of the set N∗
⊂ R of non-zero
natural numbers (i.e., positive integers), and of the set Q ⊂ R
of rational numbers.
Solution. Any n ∈ N∗
admits a neighbourhood in R containing
only the number n, i.e., O1 (n) ∩ N∗
= (n − 1, n + 1) ∩
N∗
= {n}. This means that every point n ∈ N∗
is isolated,
and there are no interior points (since an isolated point cannot
be an interior point). Moreover N∗
provides an example
of an inﬁnite set having no limits points. Indeed, if x0 is a
CHAPTER 5. ESTABLISHING THE ZOO
Real and complex limits
Deﬁnition. Consider a subset A ⊂ R and a real-valued
function f : A → R or a complex-valued function f : A →
C, deﬁned on A. Further, consider a limit point x0 of the set
A (i.e. a real number or ±∞).
We say that f has limit a ∈ R (or a complex limit a ∈ C)
at x0 and write
lim
x→x0
f(x) = a
if and only if for every neighbourhood O(a) of the point
a, there is a neighbourhood O(x0) of x0 such that for all
x ∈ A ∩ (O(x0) \ {x0}), f(x) ∈ O(a).
In the case of a real-valued function, a = ±∞ can also
be the limit. Such a limit is called inﬁnite or improper. In
the other case, i.e. a ∈ R, we say the limit is ﬁnite or proper.
It is important to notice that the value of f at x0 does not
occur in the deﬁnition, and that the function f may even not
be deﬁned at this limit point (and in the case of an improper
limit point, it cannot be deﬁned, of course)!
We shall not deal with improper limits of complex functions
now.
5.2.10. The most important cases of domains. Our deﬁnition
of a limit covers several very dissimilar situations:
(1) Limits of sequences. If A = N, i.e. the function f
is deﬁned for the natural numbers only, we talk about limits
of sequences of real or complex numbers. In this case, the
only limit point of the domain is ∞, and we mostly write the
values (terms) of the sequence as f(n) = an and the limit in
the form
lim
n→∞
an = a.
According to the deﬁnition, this means that for any neighbourhood
O(a) of the limit value a, there is an index N ∈ N such
that an ∈ O(a) for all n ≥ N. Actually, we have only reformulated
the deﬁnition of convergence of a sequence (see
5.2.3). We have only added the possibility of inﬁnite limits.
As before, we also say that the sequence an converges to a.
We can easily see from our deﬁnition for complex numbers
that a sequence of complex values has limit a if and only
if the real parts of ai converge to Re a and the imaginary parts
converge to Im a.
(2) Limits of functions at interior points of intervals.
If f is deﬁned on the interval A = (a, b) and x0 is an interior
point of this interval, we talk about the limit of a function at
an interior point of its domain. Usually, we write
lim
x→x0
f(x) = a.
Let us examine why it is important to require f(x) ∈ O(a)
only for the points x ̸= x0 in this case as well. As an example,
let us consider the function f : R → R
f(x) =
{
0 if x ̸= 0
1 if x = 0.
391
limit point of N∗
then for some ε > 0 we should have a natural
(and in particular an inﬁnite number of natural numbers)
in distance smaller than ε, which is impossible.
Now for Q, recall by 1.A.1 that this is a dense subset of R.
This means that each neighbourhood Oε(x) of any x ∈ R
contains rational numbers, that is for any x ∈ R and every
ε > 0 there exists q ∈ Q with q ∈ Oε(x) = (x − ε, x + ε).
Thus, for every x ∈ R, there is a sequence of rational numbers
an ̸= x, converging to it. For instance, imagine the decimal
representation of a real number x /∈ Q, and the corresponding
sequence whose kth term will be the representation truncated
to the ﬁrst k decimal digits. If x was rational, we could
clearly consider the sequence an = x+ 1
n . The set of all limit
points of Q is thus the whole real line R. There are no isolated
points in Q, and in particular any rational number is also
a limit point of the complement R\Q. Finally, there are no
interior points of Q (this is because between any two rational
numbers, there is an irrational number, and so it is impossible
for a rational a ∈ Q to be an interior point of Q). □
5.B.21. Find all limit, isolated, boundary and interior points
of the sets X = {x ∈ R : 0 ≤ x < 1} ⊂ R and Y = {z ∈
C : 0 ≤ |z| < 1} ⊂ C.
Solution. Let a ∈ [0, 1) be an arbitrary number. Apparently,
the sequences {a + 1
n }∞
n=1, {1 − 1
n }∞
n=1 converge to a and
1, respectively. So we have easily shown that the set of X’s
limit points contains the interval [0, 1]. There are no other
limit points: for any b /∈ [0, 1] there is δ > 0 such that
Oδ (b)∩[0, 1] = ∅ (for b < 0 it suﬃces to take δ = −b, and for
b > 1 we can choose δ = b−1). Since every point of the interval
[0, 1) is a limit point, there are no isolated points. For a ∈
(0, 1), let δa be the smaller one of the two positive numbers a,
1 − a. Considering Oδa (a) = (a − δa, a + δa) ⊆ (0, 1) for
a ∈ (0, 1), we see that every point of the interval (0, 1) is an
interior point of X. For every δ ∈ (0, 1), we have that
Oδ (0) ∩ [0, 1) = (−δ, δ) ∩ [0, 1) = [0, δ),
Oδ (1) ∩ [0, 1) = (1 − δ, 1 + δ) ∩ [0, 1) = (1 − δ, 1),
so every δ-neighborhood of the point 0 contains some points
of the interval [0, 1) and some points of the interval (−δ, 0),
and every δ-neighborhood of 1 has a non-empty intersection
with the intervals [0, 1), [1, 1 + δ). Therefore, 0 and 1 are
boundary points. Altogether, we have found that the interior
points of X coincide with the interval (0, 1) while the twoelement
set {0, 1} consists of the boundary points of X (as
we know that no point can be both interior and boundary and
that a boundary point must be an isolated or a limit point).
The case of Y is very similar and we leave most of the details
to the reader. It is the open unit disk in the complex plane, all
its points are interior (as with all open sets). The boundary
points form the unit circle. Thus, the set of limit points is the
closed unit disc and there are no isolated points there. □
CHAPTER 5. ESTABLISHING THE ZOO
Apparently, the limit at zero is well-deﬁned, and in accordance
with our expectations, limx→0 f(x) = 0 even though
the value f(0) = 1 does not belong into small neighbourhoods
of the limit value 0.
An equivalent deﬁnition using ε-neighbourhoods of the
limits a and δ-neighbourhoods of the limit points x0 is the
following: limx→x0 f(x) = a if for each ε > 0 there is a
δ > 0 such, that for all x ̸= x0 satisfying |x − x0| < δ,
|f(x) − a| < ε.
(3) One-sided limits. If A = [a, b] is a bounded interval
and x0 = a or x0 = b, we talk about a one-sided limit of the
function f at the point x0, from the right and from the left
respectively.
If the point x0 is an interior point of the domain of f, we
can, in order to determine the limit, consider the domain restricted
to [x0, b] or [a, x0]. The resulting limits are also called
a right-sided limit and left-sided limit, respectively, of the
function f at the point x0. We denote them by limx→x+
0
f(x)
and limx→x−
0
f(x), respectively. As an example, we can consider
the one-sided limits at x0 = 0 for Heaviside’s function
h from the introduction to this part. Apparently,
lim
x→0+
h(x) = 1, lim
x→0−
h(x) = 0.
However, the limit limx→0 f(x) does not exist.
It follows from the deﬁnitions that the limit at an interior
point of the domain of an arbitrary function f exists if and
only if both one-sided limits exist and are equal.
5.2.11. Further examples of limits. (1) The limit of a complex
function f : A → C in a limit point x0 of its domain
exists if and only if the limits of both the real part and the
imaginary part exist. In this case, we have
lim
x→x0
f(x) = lim
x→x0
(Re f(x)) + i lim
x→x0
(Im f(x)).
The proof is straightforward and makes direct use of the
deﬁnitions of distances and neighbourhoods of the points
in the complex plane. Indeed, the membership into a δ–
neighbourhood of a complex value z is guaranteed by the real
(1/
√
2)δ–neighbourhoods of the real and the imaginary parts
of z. Hence the proposition follows immediately.
(2) Let f be a real or complex polynomial. Then for every
point x ∈ R,
lim
x→x0
f(x) = f(x0).
Really, if f(x) = anxn
+ · · · + a0, then the identity (x0 +
δ)k
= xk
0 + kδxk−1
0 + · · · + δk
, substituted for k = 0, . . . , n,
gives that choosing a suﬃciently small δ makes the values
arbitrarily close to f(x0).
(3) Now consider the following function deﬁned on the
whole real line
f(x) =
{
1 if x ∈ Q
0 if x /∈ Q.
It is apparent from the deﬁnition that this function cannot have
(even one-sided) limits at any point of its domain.
392
If all limit points of a subset A of R or of C belong to
A, then the subset is said to be closed. This
means that we cannot run away from A by a
sequence (xn) ∈ A to a limit not belonging
to A. If the complement of A in the real line or in the complex
plane is closed, then clearly each x ∈ A has a neighbourhood
Oε(x) ⊂ A, for some ε > 0. In this case A is called open,
see 5.2.5, 5.2.7 and 5.2.8 for further details. Observe that the
empty set ∅ and R (or C) itself are both open and closed (for
instance, C has no boundary points). If a subset A in R or C
is closed and bounded (the latter means that A ⊂ Or(x) for
some x and big enough r > 0), then each sequence (xn) ∈ A
has a limit point in A, i.e., there is a convergent subsequence
in A. Such sets are called compact sets.
5.B.22. Closed and open subsets. Verify that an open interval
I = (a, b) = {x ∈ R : a < x < b} of the
real line is an open set, and moreover that a closed interval
J = [a, b] = {x ∈ R : a ≤ x ≤ b} is a closed set. Is the
interval (0, 1] a closed set?
Solution. Given some x ∈ I we should show that x is an
interior point of I. Set ε = 1
2 min{x−a, b−x} > 0 and note
that ε depends on x. Then we have a ≤ x − ε < x + ε ≤ b,
and so (x−ε, x+ε) ⊂ (a, b). Therefore, we have constructed
an open neighbourhood Oε(x) ⊂ I around x, contained in I,
and since this works for any x ∈ I, the set I is open. Next,
the complement of J is Jc
= (−∞, a) ∪ (b, ∞) which is
open as a union of open intervals. Thus J is closed. Finally,
the interval (0, 1] cannot be a closed set; For instance, the
sequence (xn) = (1/n) ∈ (0, 1] satisﬁes (1/n) → 0 as
n → ∞, but 0 /∈ (0, 1]. □
5.B.23. Decide which of the following sets are open or
closed:
(a) A = Z ⊂ R;
(b) B = [2, +∞) ⊂ R;
(c) C = {z ∈ C : |z − 1| + |z + 1| < 5} ⊂ C.
Solution. (a) It is not hard to see that the complement of Z in
R is given by the union ∪k∈Z(k, k + 1). Hence the set R\Z
is open, as the union of open sets. It follows that Z is closed
in R.
(b) The complement of B in R is the set (−∞, 2), which is
open. Thus B itself is a closed subset of R. In a similar way
one can show that the following sets are closed: [a, +∞), and
(−∞, b], with a, b ∈ R.
(c) Let us treat this case based on the notion of continuity,
that we will essentially analyze a few below. Consider the
function f : C → R given by f(z) = |z − 1| + |z + 1|.
As a challenge for you, prove that this function is continuous.
Moreover, show that C = f−1
((−∞, 5)). Then, the claim
follows by 5.2.18; The set (−∞, 5) is open and f is continuous,
hence C is open, as well. □
5.B.24. Is the range of f(x) = x2
a closed subset of R? ⃝
CHAPTER 5. ESTABLISHING THE ZOO
(4) The following function is even trickier than the previous
one. Let f : R → R be the function deﬁned as follows:8
f(x) =



1 if x = 0
1
q if x = p
q ∈ Q, p, q ∈ Z, q > 0 are co-prime,
0 if x /∈ Q.
Choose any point x0, no matter whether rational
or irrational. Our goal is to show that
limx→x0 f(x) = 0. Thus ﬁx any ε > 0 and
look at the possible values of f(x) close to x0.
Notice that f(x) ≥ ε, i.e. 1
q ≥ ε can be true
for only ﬁnite number of q ∈ N. This behaviour is illustrated
on the diagram. In particular, there can be only a ﬁnite
number of points x in the interval (x0 − 1, x0 + 1) for which
f(x) ≥ ε. Label them x1, . . . , xn. Finally, choose δ smaller
than the minimum of the distances of any two diﬀerent points
xi. Then f(x) < ε for all x ∈ Oδ(x0) \ {x0}. This ﬁnishes
the proof.
Notice that this limit equals the function value only at the
irrational points.
5.2.12. The squeeze theorem. The following result is elementary,
but extremely useful. We meet it when
computing limits of all types discussed above,
i.e. limits of sequences, limits of functions at
interior points, one-sided limits, and so on.
Theorem. Let f, g, h be three real-valued functions with the
same domain A and such that there is a neighbourhood of a
limit point x0 ∈ R of the domain where
f(x) ≤ g(x) ≤ h(x),
for all x ̸= x0. Suppose there are limits
lim
x→x0
f(x) = f0, lim
x→x0
h(x) = h0
and f0 = h0. Then the limit
lim
x→x0
g(x) = g0
exists, and it satisﬁes g0 = f0 = h0.
8This function is called the Thomae function after the German mathematician
J. Thomae, 1840–1921. You may ﬁnd it under many other names
too: e.g. Riemann function, pop-corn function, raindrop function etc. It illustrates
how badly dense the “discontinuity” points of a function can be even
though it has limits everywhere.
393
5.B.25. Decide which of the following sets is compact:
(a) I = (0, 1);
(b) J = [0, 1];
(c) A = N∗
;
(d) B = {1/n : n ∈ N∗
};
(e) C = B ∪ {0}.
Solution. Recall by (4) in Theorem 5.2.8 (BolzanoWeierstrass
theorem), that a subset A ⊂ R is compact
if and only if every ﬁnite sequence contained in X has a
subsequence converging to a point in X. We can use the
Bolzano-Weierstrass theorem as criterium to decide which
of the given sets is compact or not.
(a) The open set I = (0, 1) is not compact: Consider for
example the sequence (xn) = ( 1
n ), as above. We know that
in I this sequence converges to 0, but 0 /∈ I. Thus, for (xn)
there is no convergent subsequence whose limit belongs to I.
(b) The closed set J = [0, 1] is compact, as every bounded
closed interval [a, b] ⊂ R, see 5.E.39.
(c) The set of non-zero naturals N∗
is not compact. For
instance, the sequence (yn = n) has no convergent subsequence.
This is because every subsequence diverges to
inﬁnity.
(d) The set B is not closed, so cannot be compact either.
(e) Obviously, the set C is closed and bounded, and hence
compact. □
Given a sequence (an) in 5.4.6 we consider the so called
“upper limit” and “lower limit”, also called limes superior
and inferior of (an), and denoted by lim sup
n→∞
an and
lim inf
n→∞
an. These limits always exists and have many important
applications (see for example in Section D), and see also
5.E.32 and 5.E.34 for more details.
5.B.26. Compute the limes superior and inferior of the se-
quence
an =
n2
+ 4n − 5
n2 + 9
sin2
(nπ
4
)
, n ∈ N∗
.
Solution. Recall that the limes superior/interior are the
largest/smallest limit point of the sequence. For the given
task to compute them we may split the sequence (an) into
several subsequences according the the value of
(
sin nπ
4
)2
,
e.g., 0, 1
2 , and 1. All of them are converging to diﬀerent
limits and the result is lim sup
n→∞
an = 1 and lim inf
n→∞
an = 0,
respectively. □
Additional exercises on limes superior/interior and most
of the directions analyzed above are presented in
Section E. Later, in Chapter 7, we will revise most
of the topological notions analyzed in this chapter
(using R or C), in terms of the more general notion
of “metric spaces”. But to do so, we ﬁrst need to learn for
limits of functions and continuous functions.
CHAPTER 5. ESTABLISHING THE ZOO
Proof. From the assumptions of the theorem, it follows
that for any ε > 0, there is a neighbourhood O(x0) of the
point x0 ∈ A ⊂ R in which both f(x) and h(x) lie in the
interval (f0 − ε, f0 + ε), for all x ̸= x0. From the condition
f(x) ≤ g(x) ≤ h(x), it follows that g(x) ∈ (f0 − ε, f0 + ε),
and so limx→x0 g(x) = f0.
The above reasoning is easily modiﬁed for inﬁnite limit
values or for limits at inﬁnite points x0. In the ﬁrst case,
choose a large N instead of ε. The condition on the values
reads: both f(x) and h(x) have values larger than N on the
neighbourhood O(x0) \ {x0}, and thus the same will be true
for g(x). In the second case, the neighbourhood O will be an
interval (M, ∞). The other inﬁnite limit point −∞ is dealt
with similarly. □
The next theorem reveals the elementary properties of
limits, again for all types together. Think about the individual
cases, including the limits taken at x0 = ±∞!
5.2.13. Theorem. Let A ⊂ R be the domain of real or complex
functions f and g, let x0 be a limit point of A and let the
limits
lim
x→x0
f(x) = a ∈ K, lim
x→x0
g(x) = b ∈ K
exist. Then:
(1) the limit a is unique,
(2) the limit of the sum f + g exists and satisﬁes
lim
x→x0
(f(x) + g(x)) = a + b,
(3) the limit of the product f · g exists and satisﬁes
lim
x→x0
(f(x) · g(x)) = a · b.
In particular, if f(x) = a is a constant function then
limx→x0 a · g(x) = a · b,
(4) if b ̸= 0, the limit of the quotient f/g exists and satisﬁes
lim
x→x0
f(x)
g(x)
=
a
b
.
Proof. (1) Suppose a and a′
are two values of the limit
limx→x0 f(x). If a ̸= a′
, then there are disjoint
neighbourhoods O(a) and O(a′
). However,
for suﬃciently small neighbourhoods of
x0, the values of f should lie in both neighbourhoods. This
is a contradiction. Thus a = a′
.
(2) Choose a neighbourhood of a+b, for instance O2ε(a+
b). For a suﬃciently small neighbourhood of x0 and x ̸= x0,
394
Limits of functions enable us to investigate the behaviour
of functions (and hence their graph), in the neighbourhood of
some given point. They form the basis of the notion of diﬀerentiation
that we will discuss in the forthcoming section, and
establishes one of the most beautiful themes in mathematical
analysis and in many other areas (e.g., numerical analysis).
Most of the tasks presented below are based on theoretical
results from the sections 5.2.9, 5.2.12, 5.2.13, 5.2.16,
5.2.18, and 5.2.20. All these results will gradually become
more evident, as we explore them by examples. We recall
however that limits are well deﬁned in the limit points of the
given domain of a function, and the function is called continuous
at x, if its limit equals the function value there (be
aware however, that in general a function does not need to be
deﬁned at the point where we seek for he limit). Recall also
that from the topological point of view (see 5.2.18), a function
f : A ⊂ R → R is continuous if and only if the pre-image
f−1
(U) is an open subset of A, for every open set U ⊂ R. To
verify that f−1
(U) is open it suﬃces to show that any point
x ∈ f−1
(U) lies on an open interval of f−1
(U).
5.B.27. Limit of functions. Find the following limits or explain
why they do not exist:
(a) lim
x→c
x2
, (e) lim
x→2+
x2
+ 2x − 8
|x − 2|
,
(b) lim
x→c
(x6
− x3
−
√
6) , (f) lim
x→2−
x2
+ 2x − 8
|x − 2|
,
(c) lim
x→2
x2
+ x − 6
x2 − 3x + 2
, (g) lim
x→3
x − 3
√
x + 1 − (x − 1)
,
(d) lim
x→0
sin
(
1
x
)
, (k) lim
x→0
x cos
(
1
x
)
.
Solution. (a) We can treat this limit by basic principles, as for
example the Cauchy deﬁnition for limits of functions. Let us
however rely on topology, as it was mentioned a few above, to
show that f(x) = x2
is continuous and hence limx→c x2
=
c2
, where c is any real number. Let I = (a, b) ⊂ R be an open
interval with a < b and a, b ∈ R. By deﬁnition, f−1
(I) =
{x ∈ R : x2
∈ (a, b)}, hence obviously for b ≤ 0 it should
be f−1
(I) = ∅ (since x2
≥ 0 for any x), which is open. For
a < 0 < b, one gets f−1
(I) = (−
√
b,
√
b) ⊂ R, since any
x ∈ (−
√
b,
√
b) satisﬁes f(x) ∈ (a, b). Thus again f−1
(I)
is open as an open interval of R. Finally, for 0 ≤ a < b
the preimage f−1
(I) is the union (−
√
b, −
√
a) ∪ (
√
a,
√
b)
which is again open as the union of two open intervals of R.
(b) For the second limit one simply applies sum and diﬀerence
rules (see also below), and get beneﬁt from the fact that
the nth power x → xn
(n = 0, 1, 2, . . .) is a continuous
function. Or you can directly rely on the fact that polynomials
are continuous everywhere in R, see 5.2.11. Hence
limx→c(x6
− x3
+ x −
√
6) = c6
− c3
−
√
6.
(c) The substitution x = 2 leads to both zero numerator and
zero denominator. Despite of that, the problem can be solved
CHAPTER 5. ESTABLISHING THE ZOO
both f(x) and g(x) will lie in ε–neighbourhoods of the points
a and b. Hence their sum will lie in the 2ε–neighbourhood of
a + b. The proposition is proved.
(3) We have to be a bit more careful here. Let us
look, what we can do with an assumption |f(x) − a| < ε,
|g(x) − b| < ε for x ∈ O(x0), i.e. choosing a small enough
neighbourhood of the limit point x0. Estimate:
|f(x)g(x) − ab| = |f(x)(g(x) − b) + b(f(x) − a)|
≤ |f(x)|ε + |b|ε ≤ |f(x) − a|ε + |a|ε + |b|ε
≤ ε2
+ ε(|a| + |b|)
Now we easily conclude. Choosing any ˜ε > 0, there is unique
ε > 0 with ˜ε = ε2
+ ε(|a| + |b|). Thus using this ε for the
choice of O(x0) above, we arrive at the required condition
from the deﬁnition of the limit.
Clearly, the limit of the constant function f(x) = a is a
at all limit points of its domain.
(4) In view of the previous results, it suﬃces to prove
limx→x0
1
g(x) = 1
b for |b| > 0. We need to be careful when
considering complex valued functions. We need to estimate
1
g(x)
−
1
b
=
b − g(x)
g(x) · b
.
Since |b| > 0, we may restrict ourselves to a neighbourhood
U of x0 such that |g(x)| > |b|
2 . Then |g(x)|·|b| > |b|2
2 . Thus,
if |g(x) − b| < ε, then
1
g(x)
−
1
b
<
2|b − g(x)|
|b|2
<
2ε
|b|2
.
This veriﬁes the claim as in the previous case. □
5.2.14. Remarks on inﬁnite values of limits. The statement
of the theorem can be extended to some inﬁnite values of the
limits of real-valued functions: For sums, either at least one
of the two limits must be ﬁnite or both limits share the same
sign. Then the limit of the sum is the sum of the limits, with
the conventions from 5.2.9. However, “∞ − ∞” is excluded.
For products, if one of the limits is inﬁnite, then the other
limit must be non-zero. Then the limit of the product is the
product of the limits. The case “0 · (±∞)” is excluded.
For a quotient, it may be that a ∈ R and b = ±∞, then
the resulting limit will be zero; or a = ±∞ and b ∈ R, then
it will be ±∞ according to the signs of the numerator and the
denominator. The case “∞
∞ ” is excluded.
The theorem also covers, as a special case, the corresponding
statements about the convergence of sequences as
well as about one-sided limits of functions deﬁned on an in-
terval.
The following provides a “convergence test” useful in
many situations. It relates to limits of sequences and functions
in general.
5.2.15. Proposition. Consider a real or complex valued function
f deﬁned on a set A ⊂ R and a limit point x0 of the set
A. f has a limit y at x0 if and only if for every sequence of
395
very easily by factorization:
lim
x→2
x2
+ x − 6
x2 − 3x + 2
= lim
x→2
(x − 2) (x + 3)
(x − 2) (x − 1)
=
2 + 3
2 − 1
= 5 .
(d) We will prove that this limit does not exist, using
ε-neighbourhoods, see 5.2.9 and 5.2.10. Set f(x) = sin(1/x)
and assume, in contrast, that limx→0 f(x) = ℓ for some real
number ℓ. This means that for ε = 1
2 there exists δ > 0 such
that for any x with 0 < |x − 0| < δ we have |f(x) − ℓ| < 1
2 .
Then there exist some n ∈ N and reals x1 = 1
2nπ and
x2 = 1
2nπ+ π
2
, such that 0 < |x1| < δ and 0 < |x2| < δ.
Hence, |f(x1) − ℓ| < 1
2 , |f(x2) − ℓ| < 1
2 , and using the
triangle inequality we see that
∆ = |(f(x1) − ℓ) + (ℓ − f(x2))|
≤ |f(x1) − ℓ| + |ℓ − f(x2)|
= |f(x1) − ℓ| + |f(x2) − ℓ| <
1
2
+
1
2
= 1 ,
where for simplicity we set ∆ := |f(x1) − f(x2)|. Since
f(x1) = sin(2nπ) = 0 and f(x2) = sin(2nπ + π
2 ) =
sin(π
2 ) = 1, the last inequality gives |f(x1) − f(x2)| =
|0 − 1| = 1 < 1, a contradiction.
(e) By the deﬁnition of the absolute value we have
lim
x→2+
x2
+ 2x − 8
|x − 2|
= lim
x→2+
(x − 2)(x + 4)
x − 2
= lim
x→2+
(x + 4) = 6 .
(f) Similarly
lim
x→2−
x2
+ 2x − 8
|x − 2|
= lim
x→2−
(x − 2)(x + 4)
−(x − 2)
= −6.
Thus, a comparison with the result from case (e) implies that
the function f(x) = x2
+2x−8
|x−2| cannot be continuous at x0 = 2
(since the one sided limits are not equal).
(g) Set for simplicity f(x) = x−3√
x+1−(x−1)
. When x → 3 we
see that both the numerator and denominator tend to 0, which
gives the indeterminate form 0
0 (see also below). One can
treat such cases by multiplying both the numerator and denominator
by the “conjugate expression” of the denominator.
For our case this means
√
x + 1 + (x − 1) with
(√
x + 1 − (x − 1)
) (√
x + 1 + (x − 1)
)
= −x(x − 3) .
Thus
lim
x→3
f(x) = lim
x→3
(x − 3)
(√
x + 1 + (x − 1)
)
−x(x − 3)
= − lim
x→3
√
x + 1 + (x − 1)
x
= −
4
3
.
(k) Recall that |cos(x)| ≤ 1 for any x ∈ R and hence
x cos
(1
x
)
≤ |x|. By the squeeze theorem (5.2.12) this gives
limx→0 x cos
(1
x
)
= 0. □
CHAPTER 5. ESTABLISHING THE ZOO
points xn ∈ A converging to x0, xn ̸= x0, the sequence of
the values f(xn) has limit y.
Proof. Suppose ﬁrst that the limit of f at x0 is y. Then
for any neighbourhood U of the point y, there is a neighbourhood
V of x0 such that for all x ∈ V ∩ A, x ̸= x0, f(x) ∈ U.
For every sequence xn → x0 of points diﬀerent from x0, the
terms xn lie in V for all n greater than a suitable N. Therefore,
the sequence f(xn) converges to y.
Now suppose that the function f does not converge to y
at x → x0. Then for some neighbourhood U of y, there is a
sequence of points xm ̸= x0 in A which are closer to x0 than
1/m, with f(xm) not belonging to U. In this way, there is
constructed a sequence of points lying in A diﬀerent from x0,
with limm→∞ xm = x0, for which the values f(xn) do not
converge to y. The proof is ﬁnished. □
5.2.16. Continuity. Continuity was discussed intuitively
when polynomials were discussed. Now all the
tools for a proper formulation of continuity are
prepared. This is the basic class of functions in
the sequel.
Continuity of functions
Deﬁnition. Let f be a real or complex valued function deﬁned
on an interval A ⊂ R. f is continuous at a point
x0 ∈ A if and only if
lim
x→x0
f(x) = f(x0).
The function f is continuous on an interval A if and only if
it is continuous at every point x0 ∈ A.
The diagram explains the meaning of continuity. Firstly,
the limit has to exist. Thus, after choosing a neighbourhood U
of the limit value f(x0) (the ε-neighbourhood Oε(f(x0)) is
shown), there is a neighbourhood of x0 (the δ-neighbourhood
is shown), for which all images lie in U. In words, if we decide
how close we want to be to f(x0), we always may choose
a suﬃciently small neighbourhood of x0 where this is guar-
anteed.
396
5.B.28. Compute the limit
lim
x→1
3x3
+ 2x2
− 2x − 3
5x3 + 2x2 − 2x − 5
,
if it is known that ρ = 1 is a root of both the numerator and
dominator. Next use Sage to conﬁrm your computations.
Solution. First notice that the function f(x) =
3x3
+2x2
−2x−3
5x3+2x2−2x−5 is not deﬁned at x = 1, and moreover
this limit has again the indeterminate form 0
0 . However, the
statement already motivates a method for solving the task:
Use the Horner’s scheme and factorize the numerator and
the dominator, i.e.,
3x3
+ 2x2
− 2x − 3 = (x − 1)(3x2
+ 5x + 3) ,
5x3
+ 2x2
− 2x − 5 = (x − 1)(5x2
+ 7x + 5) .
Then it is easy to see that the limit at hand equals to 11/17.
In fact, you can easily do all the computations that we
omitted via Sage, where you can still use the command limit
to compute limits of functions, as in the case of sequences.
Here it is however necessary to specify the limit point x0,
and the exact syntax for computing the limit limx→x0
f(x)
is limit(f(x), x = x_0). In addition, you can program Sage
to compute the left and right side limits, by adding the options
dir = ”left” and dir = ”right”, respectively. For
out task the cell
g(x)=3*x^3+2*x^2-2*x-3; h(x)=5*x^3+2*x^2-2*x-5
show(factor(g(x))); show(factor(h(x)))
lim(g(x)/h(x), x=1)
veriﬁes the factorizations given above and the explicit value
of the limit, as you can easily check in your computer.
Finally, it worths to mention that one could compute the
limit at hand by other methods. For instance, one can apply
Proposition 5.2.15. Hence, let (xn) be a sequence of real
numbers with xn → 1 as n → ∞, and xn ̸= 1 for all
n = 1, 2, . . .. Then, for the induced sequence of values f(xn)
we see that
f(xn) =
3x3
n + 2x2
n − 2xn − 3
5x3
n + 2x2
n − 2xn − 5
=
(xn − 1)(3x2
n + 5xn + 3)
(xn − 1)(5x2
n + 7xn + 5)
=
3x2
n + 5xn + 3
5x2
n + 7xn + 5
.
Taking the limit limn→∞ f(xn) we get 11/17, and according
to Proposition 5.2.15 this implies that limx→1 f(x) = 11/17.
□
5.B.29. Explain why the relation limx→π/3 sin(x) =
sin(π/3) is true. Next use Sage to conﬁrm that sin(x) is
continuous at the chosen point by combining the limit
command with the bool syntax. ⃝
The Cauchy deﬁnition of limits of functions in terms of ε
and δ can be used to derive a number of elementary results.
Let us describe such an example.
CHAPTER 5. ESTABLISHING THE ZOO
Notice that for the boundary points of the interval A, the
deﬁnition says that value of f equals the value of the onesided
limit there. The function is said to be right-continuous
or left-continuous at such a point. Every polynomial is a continuous
function on the whole R, see 5.2.11(2). The Thomae
function is continuous at irrational real numbers only although
it has limits at all rational points as well, see 5.2.11(4).
The previous theorem 5.2.13 about limit properties
implies immediately all of the following
claims. The same properties are true for
right-continuity or left-continuity, as is
easily checked.
5.2.17. Theorem. Let f and g be (real or complex valued)
functions deﬁned on an interval A ⊂ R and continuous at a
point x0 ∈ A. Then
(1) the sum f + g is continuous at x0
(2) the product f · g is continuous at x0
(3) if g(x0) ̸= 0, then the quotient f/g is well-deﬁned on
some neighbourhood of x0 and is continuous at x0.
(4) If a continuous function h is deﬁned on a neighbourhood
of f(x0) of the real-valued function f, then the composite
function h ◦ f is deﬁned on a neighbourhood of x0 and
is continuous at x0.
Proof. Statements (1) and (2) are clear. For property
(3): if g(x0) ̸= 0, then the ε–neighbourhood of the number
g(x0) does not contain zero for a suﬃciently small ε > 0.
By the continuity of g, it follows that on a suﬃciently small
δ-neighbourhood of the point x0, g is non-zero and the quotient
f/g is thus well-deﬁned there. It is continuous at x0 by
the previous theorem.
(4) Choose a neighbourhood O of h(f(x0)). By the continuity
of h, there is a neighbourhood O′
of f(x0) which
is mapped into O by h. The continuous function f maps
some suﬃciently small neighbourhood of the point x0 into
the neighbourhood O′
. This is the deﬁnition property of continuity,
so the proof is ﬁnished. □
5.2.18. We consider some basic relations between continuous
mappings and the topology of the real numbers. They
exploit the highly non-trivial characterization of compact sets
in theorem 5.2.8.
397
5.B.30. Let f : A → R be a function, where A is a subset
of R and let a be a limit point of A. Suppose that the limit
limx→a f(x) exists and is a ﬁnite number. Show that f is
bounded on the set (a − h, a + h) ∩ A for some h > 0.
Solution. By assumption, the limit limx→a f(x) exists and
is a ﬁnite number, hence suppose that limx→a f(x) = ℓ.
Then for every ε > 0 there exists some δ > 0 such that
|f(x) − ℓ| < ε, for all x ∈ A satisfying 0 < |x − a| < δ.
Hence, taking ε = 1, there exists some h > 0 such that
|f(x) − ℓ| < 1 for all x ∈ A with 0 < |x − a| < h.
By the triangle inequality this implies that |f(x)| < 1 + |ℓ|
for all such x. Set now L = 1 + |ℓ| if a /∈ A, otherwise
set L = max{1 + |ℓ| , f(a)}. Then it is easy to see that
|f(x)| ≤ L for all x ∈ (a − h, a + h) ∩ A. □
In the previous examples we often applied some basic
rules, as the limit of a sum of functions is the sum of
the limits, the limit of a product is the product of the
limits, and the limit of a quotient is the quotient of the
limits, provided that the particular limits exist and
do not lead to one of the following expressions: 0/0, ∞/∞,
0 · ∞, and ∞ − ∞.
These are called “indeterminate forms”, see also the discussion
in 5.2.14 and their full list includes three more types:
00
, 1∞
and ∞0
. To address many indeterminate forms (e.g.,
of type 0/0, ∞/∞, 0 · ∞, etc.), we use the so-called “l’Hopital’s
rule”, which we will examine later (see 5.3.10). However,
let us begin with an example that demonstrates a pedestrian
approach to deal with an indeterminate form of type
0/0.
5.B.31. Show that lim
x→0
sin(x)
x
= 1.
Solution. Consider the ﬁrst quadrant in the unit circle
S1
, and its arbitrary point P(x) = [cos(x), sin(x)],
where x is running in the open interval
(
0, 2π
2
)
⊂ R.
Obviously, the length
of the arc joining
the points P(x) and
C := [1, 0], equals to
x. So we apparently
have sin(x) < x for
all x ∈
(
0, π
2
)
, see
also the ﬁgure given
at the r.h.s. On the
other hand, the value
tan(x) is the distance
between the points
Q(x) = [1, tan(x)]
and C, and we see that x < tan(x), for all x ∈
(
0, π
2
)
.
Altogether, sin(x) < x < sin(x)
cos(x) which implies that
cos(x) <
sin(x)
x
< 1 ,
CHAPTER 5. ESTABLISHING THE ZOO
Topological characterization of continuity
Theorem. Let f : A ⊂ R → R be a function deﬁned on an
interval A. Then:
(1) f is continuous if and only if the inverse image f−1
(U)
of every open set U ⊂ R is an open set in A.
(2) If f is continuous, then the inverse image f−1
(W) of
every closed set W ⊂ R is a closed set in A.
(3) If f is continuous, then the image f(K) of every compact
set K ⊂ A is a compact set in A,
(4) If f is continuous, then f attains both its maximum and
its minimum on every compact set K.
Proof. (1) Consider a point x0 ∈ f−1
(U). There is a
neighbourhood O of f(x0) which is contained in U
since U is open. Hence there is a neighbourhood O′
of x0 which is mapped into O and thus is contained
in the inverse image. Therefore, every point of the
inverse image is an interior point, i.e., f−1
(U) is open.
Conversely, if f−1
(U) is open for each open U, then taking
any ε-neighbourhood of f(x0), its pre-image will be an
open neighbourhood of x0 satisfying the condition from the
deﬁnition of the continuity.
(2) Consider a limit point x0 of the inverse image
f−1
(W) and a sequence xi, f(xi) ∈ W, which converges to
x0. From the continuity of f, it follows that f(xi) converges
to f(x0) (cf. the convergence test 5.2.15). Since W is closed,
f(x0) ∈ W. Thus, all limit points of the inverse image of
the set W are contained in f−1
(W).
(3) Choose any open cover Uα of f(K). The inverse
images of all Uα are open and thus create an open cover of
the set K. Select a ﬁnite subcover from it. Then ﬁnitely many
of the corresponding images cover the original set f(K).
(4) Since the image of a compact set is again a compact
set, the image must be bounded and it contains both the supremum
and the inﬁmum. Hence it follows that these must also
be the maximum and the minimum, respectively. □
Notice, a complex valued function is continuous if and
only if its real and imaginary components are continuous.
The ﬁrst three claims of the theorem remain valid with very
slight modiﬁcation of the proof. (Check it yourselves!)
5.2.19. There are two very useful consequences of the previous
theorem.
398
for all x ∈
(
0, π
2
)
. Invoking now the squeeze theorem, one
deduces that limx→0+
sin(x)
x = 1. Based then on the fact that
the function f(x) = sin(x)/x deﬁned for all x ̸= 0 is even,
i.e., f(−x) = f(x), we get
lim
x→0−
sin(x)
x
= lim
x→0+
sin(x)
x
= 1 .
This shows that the both one-sided limits exist and have the
same value, hence our claim follows. Notice that may exist
other alternative ways to prove the statement. □
5.B.32. Based on the result form 5.B.31, show that:
(a) lim
x→0
1 − cos(x)
x
= 0;
(b) lim
x→0
1 − cos(x)
x2 sin(x2)
= ∞. ⃝
5.B.33. Compute lim
x→x0
f(x) for the following cases:
(a) f(x) =
x − 2
√
x2 − 4
, and x0 = 2 ,
(b) f(x) =
sin
(
sin(x)
)
x
, and x0 = 0 ,
(c) f(x) =
sin2
(x)
x
, and x0 = 0 . ⃝
5.B.34. Explain why the limit of the following functions as
x → 0, does not exist:
f(x) =
{
1/x, if x > 0,
1, if x ≤ 0,
g(x) =
{
1, if x ≥ 0,
0, if x < 0,
h(x) =
| sin(x)|
sin x
, k(x) = sign(x) .
⃝
5.B.35. Suppose that f : (0, +∞) → R is a function satisfying
|ex
f(x) − 2 ex
| ≤ |sin(ex
)|, for any x ∈ (0, +∞).
Show that lim
x→+∞
f(x) = 2.
Solution. The given relation |ex
f(x) − 2 ex
| ≤ |sin(ex
)|
can be equivalently written as
− |sin(ex
)| ≤ ex
f(x) − 2 ex
≤ |sin(ex
)| ,
or 2 ex
− |sin(ex
)| ≤ ex
f(x) ≤ |sin(ex
)| + 2 ex
,
or 2 −
sin(ex
)
ex
≤ f(x) ≤
sin(ex
)
ex
+ 2 . (∗)
Now, it is easy to prove that lim
x→+∞
sin(ex
)
ex
= 0, which also
follows from the graph of this function, see here:
CHAPTER 5. ESTABLISHING THE ZOO
Maxima and minima of continuous functions9
Corollary. Let f : R → R be continuous. Then
(1) the image of every interval is again an interval,
(2) f takes all the values between the maximal and the minimal
one on the closed interval [a, b].
Proof. (1) Consider an open interval A and suppose
there is a point y ∈ R such that f(A) contains
points less than y as well as points greater than
y, but y /∈ f(A). Put B1 = (−∞, y) and
B2 = (y, ∞). These are open sets, and the
union of their inverse images A1 = f−1
(B1) ⊂ A and
A2 = f−1
(B2) ⊂ A contains A. A1 and A2 are open, disjoint,
and they both have a non-empty intersection with A.
Thus there is a point x ∈ A which does not lie in A1 but is
a limit point of A1. It is in A2, which is impossible for two
disjoint open sets.
Thus it is proved that if there is a point y which does not
belong to the image of the interval, then either all of the values
must be above y or they all must be below. It follows that the
image is again an interval. Notice that the boundary points of
this interval may or may not lie in the image.
If the domain interval A contains one of its boundary
points, then the continuous function must map it to a limit
point or an interior point of the image of the interior of A.
This veriﬁes the statement.
(2) This statement immediately follows from the previous
one (and the above theorem) since the image of a closed
bounded interval (i.e. a compact set) is again a closed interval.
□
5.2.20. We conclude this introductory discussion by two
more theorems which provide useful tools for calculating limits.
Notice that we assume that functions are deﬁned on all of
R. Actually we are only interested in f on a neighbourhood
of one point a, while g has to be deﬁned on a neighbourhood
of one point b only.
Limits of composite functions
Theorem. Let f, g : R → R be two functions and
limx→a f(x) = b.
(1) If the function g is continuous at the point b, then
lim
x→a
g (f(x)) = g
(
lim
x→a
f(x)
)
= g(b).
(2) If the limit limy→b g(y) exists and f(x) ̸= b holds for
all x ̸= a from some neighbourhood of the point a, then
lim
x→a
g (f(x)) = lim
y→b
g(y).
9This result is usually called the Weierstrass theorem, but it is also
known (especially in Czech literature) as the Bolzano’s theorem. Bernard
Bolzano apparently used such result as a technical lemma when proving his
Bolzano-Weierstrass theorem mentioned earlier.
399
For instance, via the squeeze theorem one can show that
lim
y→+∞
sin y
y
= 0
and then the replacement y = ex
yields our claim, or directly,
via Sage type: lim(sin(ex
)/ex
, x = +oo). Hence, we also
deduce that
lim
x→+∞
(
2 −
sin(ex
)
ex
)
= 2 = lim
x→+∞
(
sin(ex
)
ex
+ 2
)
and by (∗) we are done via the squeeze theorem. □
5.B.36. For some α > 0 let g the function deﬁned by
g(x) = lim
α→+∞
(
(2x − 3)α2
+ 4xα + 2
x + α
· sin
(
1
α
))
.
Show that g is a line which is a median of the triangle ABC
with corners the points A = [1, −1], B = [6, 2], C = [−2, 0].
⃝
So far we tried to convince the reader to use common
sense when reasoning about the limit behaviour
of functions and prevent any mindless following
of rules. On the other hand, beyond the l’Hopital’s
rule which is based on the notion of derivatives
and we will discuss later (see also 5.3.10), there are diverse
useful "rules" or "tricks" to deal with the indeterminate
expressions, which are not necessarily based on derivatives.
Let us mention some of them.
a) Recall that if f(x) > 0 for all x in the domain of f, then
f(x)g(x)
= e
ln
(
f(x)g(x)
)
= e g(x)·ln(f(x))
.
Since the function ex
is continuous and injective on R, we get
lim
x→x0
f(x)g(x)
= e
lim
x→x0
(g(x) ln(f(x)))
= e
lim
x→x0
g(x)· lim
x→x0
ln(f(x))
.
Moreover, it is easy to prove that
lim
x→x0
(g(x) · ln f(x)) = a ∈ R =⇒ lim
x→x0
f(x)g(x)
= ea
,
lim
x→x0
(g(x) · ln f(x)) = +∞ =⇒ lim
x→x0
f(x)g(x)
= +∞ ,
lim
x→x0
(g(x) · ln f(x)) = −∞ =⇒ lim
x→x0
f(x)g(x)
= 0 .
b) Some further useful rules to remember are here:
lim
x→+∞
c
xα
= 0, lim
x→+∞
xα
xβ
= 0, lim
x→+∞
xβ
ax
= 0 ,
lim
x→+∞
ax
bx
= 0, c ∈ R, 0 < α < β, 1 < a < b .
You may try to verify them (a way to determine the third limit
is based on the l’Hopital’s rule, for example).
5.B.37. Compute the following limits:
(a) lim
x→+∞
3x+1
+ x5
− 4x
3x + 2x + x2
,
(b) lim
x→+∞
4x
− 8x6
− 2x
− 167
3x − 45x −
√
11πx+12
,
CHAPTER 5. ESTABLISHING THE ZOO
Proof. The ﬁrst proposition can be proved similarly to
5.2.17(4). From the continuity of g at the point b, it follows
that for any neighbourhood V of the value g(b), we can ﬁnd
a suﬃciently small neighbourhood U of the point b whose
values of g lies in V . However, if f has limit b at the point
a, then f will hit U by all its values f(x) for x ̸= a from
some suﬃciently small neighbourhood of the point a, which
veriﬁes the ﬁrst statement.
(2) Even if we cannot use the continuity of g at the point
b, the previous reasoning will be true if we ensure that all
points x ̸= a from suﬃciently small neighbourhoods of a
are mapped into a neighbourhood of b by the function f, but
f(x) ̸= b for all such points. □
5.2.21. Who or what is in the ZOO. We have begun to
build a menagerie of functions with polynomials
and functions which can be created from
them “piece-wise”. Moreover, we have derived
many properties for a huge class of continuous
functions. However, except for polynomials, we do not have
many practically manageable examples at our disposal. We
consider the quotients of polynomials.
Let f and g be two polynomials in the real variable x
with complex coeﬃcients ai ∈ C.
The function h : R \ {x ∈ R, g(x) = 0} → C,
h(x) =
f(x)
g(x)
is well-deﬁned for all real x except for the roots of the polynomial
g. Such functions are called rational functions. From
theorem 5.2.17, it follows that rational functions are continuous
at all points of their domains. At the points where the
denominators of real rational functions vanish, they can have
• a ﬁnite limit,
• an inﬁnite limit, supposing the one-sided inﬁnite limits
are equal,
• diﬀerent one-sided inﬁnite limits.
For the case of a ﬁnite limit, it is necessary that the point
is a common root of both f and g and that its multiplicity in
f is at least as large as in g. Then the domain of the rational
function can be extended by this point, deﬁning it to take the
value of the limit there. Then the new function is continuous
at this point, too.
The possibilities are illustrated in the diagram showing
the values of the function
h(x) =
(x − 0.05a)(x − 2 − 0.2a)(x − 5)
x(x − 2)(x − 4)
400
(c) lim
x→0
√
1 + x −
√
1 − x
x
,
(d) lim
x→π/4
cos x − sin x
cos (2x)
. ⃝
5.B.38. Consider a polynomial function f : R → R whose
degree is determined by the result of rolling a die. Find the
probability P(A), where A denotes the following event:
lim
x→+∞
x4
+ x2
+ 1
f(x)
= 0 .
Solution. We have f(x) = anxn
+ an−1xn−1
+ · · · + a1x +
a0, where by assumption n ∈ {1, 2, . . . , 6} (and so f(x) ̸=
0). Thus one computes
lim
x→+∞
x4
+ x2
+ 1
f(x)
=
= lim
x→+∞
x4
+ x2
+ 1
anxn + an−1xn−1 + · · · + a1x + a0
= lim
x→+∞
x4
anxn
.
Hence we seek for the probability of having lim
x→+∞
x4
anxn
=
0. It is now reasonable to consider the following cases:
1) For n = 1, 2, 3 we get lim
x→+∞
x4
anxn
= ±∞.
2) For n = 4 we get lim
x→+∞
x4
a4x4
=
1
a4
̸= 0.
3) For n = 5, 6 we obtain lim
x→+∞
x4
anxn
= 0.
Therefore, the favourable cases are those with n = 5, 6. Since
the sample space is Ω = {1, 2, . . . , 6} we deduce that the
probability in question equals to P(A) = 2
6 = 1
3 . □
5.B.39. Consider the functions f(x) = e4x+2
and g(x) =
ln(x2
), deﬁned on R\{0} and [1, e8
], respectively.
(a) Show the composition φ = g ◦ f is determined, and
present its explicit form. Conﬁrm your result via Sage and
the command compose.
(b) Evaluate the limit L = lim
x→0
φ(x) − sin3
(x) − 4
x
.
Solution. (a) Let us denote by A = R\{0} and B = [1, e8
]
the two given domains. The composition φ(x) = g (f(x))
is determined, if the set X := {x ∈ A : f(x) ∈ B} is non
empty, i.e., X ̸= ∅. We compute
X =
{
x ∈ R∗
: 1 ≤ e4x+2
≤ e8
}
=
{
x ∈ R∗
: e0
≤ e4x+2
≤ e8
}
= {x ∈ R∗
: 0 ≤ 4x + 2 ≤ 8}
=
{
x ∈ R∗
: −
1
2
≤ x ≤
3
2
}
=
[
−
1
2
, 0
)
∪
(
0,
3
2
]
,
where here R∗
= R\{0} Therefore, based on the basic properties
of logarithms, we get φ : X → R with type
φ(x) = g (f(x)) = ln
(
f(x)2
)
= ln e2(4x+2)
= 2(4x + 2) .
CHAPTER 5. ESTABLISHING THE ZOO
for a = 0 on the left (thus it is the simpler rational function
(x − 5)/(x − 4)) and for a = 5/3 on the right.
Notice that the situation gets more complicated for the
complex rational functions. Their real and imaginary parts
are again (real) rational functions (see the exercise ??) and
the above options are true for only each of the components
separately.
5.2.22. Power functions and the exponential. We have met
the simple power functions x → xn
with natural
exponents n = 0, 1, 2, . . . when building
the polynomials. The meaning of the function
x → x−1
, deﬁned for all x ̸= 0, is also clear.
Now, we extend this deﬁnition to a general power function
xy
with an arbitrary (constant) y ∈ R and x > 0.
Changing our mind about constants and variables, we obtain
the values of exponential function xy
with arbitrary (constant)
x > 0 called the base.
With natural y and z, we know the following rule
(1) xy+z
= xy
· xz
and its consequence
(2) xy·z
= (xy
)z
.
We shall extent this deﬁnition to all positive real x > 0
and all real y.
Exponential and power functions
Theorem. There is the unique function f(y) = xy
, deﬁned
for all x > 0, y ∈ R, separately continuous in both variables
x and y (i.e. we consider the other one as a constant when
checking the continuity) and staisfying (1), f(0) = 1, and
f(1) = x. This function also satisﬁes (2).
We call this function the exponential function with the
base x, or the power function with exponent y.
Proof. Of course, we want to extend the well known values
of the powers xn
with x rational and n integral,
which are also direct consequences of (1)
(and then automatically satisfy (2)).
For a negative integer −a, we have to deﬁne
(in view of (1))
x−a
= (xa
)−1
= (x−1
)a
.
Further, we want the equality bn
= x for n ∈ N to imply that
b is the n–th root of x, i.e. b = x
1
n (again a consequence of
401
To verify this very last computation in Sage, use the cell
f(x)=e**(4*x+2); ln(f(x)^2).simplify_full()
which returns 8 ∗ x + 4. The same returns the method suggested
in the statement, via the command compose, which
we implement by the cell
f(x)=e^(4*x+2); g(x)=ln(x^2)
h = compose(g, f)
show(h(x).simplify_full())
(b) The explicit form of φ is known by (a), which makes it
easy to compute the limit at hand:
L = lim
x→0
8x − sin3
(x)
x
= lim
x→0
(
8 − sin2
(x) ·
sin(x)
x
)
= 8 − lim
x→0
sin2
(x) · lim
x→0
sin(x)
x
= 8 − 0 · 1 = 8 .
□
We have exploited the continuity of several functions in
an implicit or explicit way many times already. Now
it is time to return back to the essence and play a little
bit with the concept itself.
5.B.40. The ﬁgure below shows the graph of a function
y = f(x). Indicate the points where f is discontinuous and
explain why.
2 3 4 510 x
y
Solution. There is a discontinuity at x = 2, since there f is
not deﬁned (the graph has a break that is represented by the
small circle). Another discontinuity appears at x = 3 (a jump
discontinuity). Indeed, although f(3) is deﬁned (is a negative
number), we see that the left and right limits are diﬀerent, so
lim
x→3
f(x) does not exist. Finally, according to the ﬁgure at
x = 5 the value f(5) is deﬁned and moreover lim
x→5
f(x) exists.
However, we see that lim
x→5
f(x) ̸= f(5). Thus f is discontinuous
also at x = 5 (we may refer to such a discontinuity by
the term removable discontinuity). □
5.B.41. Sketch the graph of a function which is continuous
everywhere, except of a removable discontinuity at x = 1, a
jump discontinuity at x = 3, and a discontinuity at x = 4
which is not of the two previous types.
Solution. There are of course many such functions. An example
is given here:
CHAPTER 5. ESTABLISHING THE ZOO
(1) since this requests (x
1
n )n
= xn 1
n = x). We verify that
such b’s always exist for positive real numbers x.
By factoring out y2 − y1 in yn
2 − yn
1 , or otherwise, we
see that the function y → yn
is strictly increasing for y > 0.
Choose x > 0 and consider the set B = {y ∈ R, y >
0, yn
≤ x}. This is a non-empty set bounded from above,
so set b = sup B. A power function with natural exponent
n is continuous, thus bn
= x. Indeed, surely bn
≤ x. if
the inequality is strict, then there is a number y such that
bn
< yn
< x, which implies that b < y. This contradicts
the deﬁnition of b as a supremum.
Thus the power function is suitably deﬁned for all rational
numbers a = p
q . For x > 0, we set xa
= (xp
)
1
q = (x
1
q )p
.
Finally, we notice that for the values 0 < a ∈ Q and ﬁxed
x > 1, xa
is strictly increasing for rational a’s. Therefore, for
general 0 < a ∈ R and 1 < x we can deﬁne
xa
= sup{xy
, y ∈ Q, y ≤ a}.
As before, x−a
= 1
xa .
For 0 < x < 1, proceed analogously with care for the
inequality signs, or set xa
= (1
x )−a
. For x = 1, deﬁne 1a
= 1
for any a, while 0a
= 0.
Now, the power function x → xa
is deﬁned for all
x ∈ [0, ∞) and a ∈ R. Notice, the requested continuity in
both parameters ﬁxed all our choices above. In particular, our
function is continous also in the paremater a and we have constructed
also the exponential functions cy
for constants c > 0.
The property (1) used when deﬁning the power function
xa
for integral a implied also (2), and they both manifestly
survived trough the construction for any choices of the values
x, y. □
5.2.23. Logarithmic functions. The exponential function
f(x) = ax
is increasing for a > 1 and decreasing
for 0 < a < 1. Thus in both cases, there is
a function f−1
(x) inverse to it. This function is
called the logarithmic function with base a.
We write lna(x) and the deﬁning property is
lna(ax
) = x.
The equalities 5.2.22(1),(2) are thus equivalent to
lna(x · y) = lna(x) + lna(y), lna(xy
) = y · lna(x).
Logarithmic functions are deﬁned only for positive arguments
and they are, on the entire domain, increasing if a > 1 and
decreasing for 0 < a < 1. Moreover, lna(1) = 0 holds for
every a.
402
2 3 4 510 x
y
Notice the discontinuity at x = 4: Can you analyze why this
diﬀers from the previous two? □
5.B.42. Explain why the functions f, g : (0, +∞) → R deﬁned
by f(x) = xx
and g(x) = xcos(x)
, respectively, are
continuous. ⃝
5.B.43. If f, g : R → R are continuous functions with
f(2) = 6 and lim
x→2
(4f(x) − g(x)) = 2, ﬁnd g(2). ⃝
5.B.44. Present an example of a function deﬁned on R which
is everywhere continuous except at x = 0. ⃝
5.B.45. Consider the function f : R → R
f(x) =
{
x , if x ∈ Q ,
−x , if x ∈ R\Q .
Determine the points x0 ∈ R, if any, where f is continuous.
Solution. The given function f is continuous only at x0 = 0.
For a conﬁrmation, consider any sequence (xn) or R with
xn → 0 and xn ̸= 0 for all n. Then obviously (−xn) → 0
as well and hence f(xn) → 0. The claim now follows by
combining the deﬁnition of continuity of a given function at
a chosen point with Proposition 5.2.15. On the other hand,
for any non-zero real number a we can ﬁnd sequences (xn)
and (yn) with xn → a and yn → a, which however satisfy
f(xn) → a and f(yn) → −a, respectively. This implies that
f is discontinuous at any a ∈ R\{0}. □
5.B.46. Determine the interval(s) on which each of the following
functions is continuous:
f(x) =
{
sin
(1
x
)
, if x ̸= 0 ,
0 , if x = 0 ,
g(x) =
{
x sin
(1
x
)
, if x ̸= 0 ,
0 , if x = 0 .
⃝
5.B.47. Consider the function
g(x) =



tan(x)
π − 2x
+ x , if x ∈
(
0,
π
2
)
,
2x + 1
2
, if x ∈
[π
2
, +∞
)
.
Show that g(x) is continuous at x0, where x0 is the value
of the determinant |A|, and A ∈ Mat4(R) is a real 4 × 4
invertible matrix satisfying A = (1/ |A|)B, for some square
matrix B ∈ Mat4(R) with det(B) = |B| = (π/2)5
.
CHAPTER 5. ESTABLISHING THE ZOO
There is an extremely important value of a, namely the
number e, see the paragraph 5.4.1, also known as Euler’s number.
The function lne(x) is called the natural logarithm and
is denoted by ln(x) (i.e., omitting the base e).
3. Derivatives
When talking about polynomials, the rate at which
the function changes at a given point of its
domain was already discussed (see the paragraph
5.1.6). It is the quotient 5.1.6(1), which
expresses the slope of the secant line between the points
[x, f(x)] ∈ R2
and [x + δx, f(x + δx)] ∈ R2
for a (small)
change δx of the input variable. This reasoning is correct for
any real or complex function f. It is only necessary to work
with the concept of the limit, instead of the intuitive “small
change” δx.
Derivative of a function of a real variable
5.3.1. Deﬁnition. Let f be a real or complex function deﬁned
on an interval A ⊂ R and x0 ∈ A. If the limit
lim
x→x0
f(x) − f(x0)
x − x0
= a
exists, the function f is said to be diﬀerentiable at x0, provided
a is ﬁnite. The value of the derivative at x0, namely
a, is denoted by f′
(x0) or df
dx (x0) or d
dx f(x0).
If a is ﬁnite, the derivative is also sometimes called
proper. If a is inﬁnite, it is improper.
If x0 is one of the boundary points of A, we arrive at
one-sided derivatives (i.e., left-sided derivative and rightsided
derivative).
If a function has a derivative at x0, the function is said
to be diﬀerentiable at x0. A function which is diﬀerentiable
at every point of a given interval is said to be diﬀerentiable
on the interval.
Obviously, the derivative of a complex valued function
(f(x) + i g(x)) exists if and only if the derivatives of both
real and imaginary parts f and g exist (see the elementary
properties of limits). Then
(f(x) + i g(x))′
= f′
(x) + i g′
(x).
403
Solution. By assumption, A and B are real 4 × 4 matrices
satisfying the relation B = |A| A. In Chapter 2 we saw that
det(λC) = λn
det(C) for any real n×n-matrix C and λ ∈ R,
see ??. Thus we get |B| = A A = |A|
4
|A| = |A|
5
, that is,
(π/2)
5
= |A|
5
or equivalently |A| = π/2. Hence x0 = π/2
and we see that
lim
x→ π
2
+
f(x) = lim
x→ π
2
+
(
x +
1
2
)
=
π
2
+
1
2
,
lim
x→ π
2
−
f(x) = lim
x→ π
2
−
tan x
π − 2x
+ lim
x→ π
2
−
x
=
1
2
lim
x→ π
2
−
tan(π
2 − x)
(π
2 − x)
+
π
2
=
π
2
+
1
2
.
Here we have used the trigonometric identity tan(π
2 − x) =
tan(x) (try to prove this), and moreover that
lim
y→0
tan y
y
= 0 , (∗)
where above one should replace π
2 − x by y and x → π
2
−
with y → 0. Therefore, the one-sided limits coincide and
since they equal to f(π
2 ) = π
2 + 1
2 , f is continuous at x0.
To be complete, let us prove (∗). For this one uses the result
from 5.B.31 and that the limit of a product is the product of
the limits, if both limits are deﬁned:
lim
y→0
tan y
y
= lim
y→0
sin y
cos y
y
= lim
y=0
sin y
y
· lim
y=0
1
cos y
= 1 ·
1
cos 0
= 1 .
□
Given a real number x ∈ R its ﬂoor (also called the integer
part of x), denoted by ⌊x⌋ (or by [x]), is
the greatest integer which is less than or equal
to x, i.e., ⌊x⌋ = max{a ∈ Z : a ≤ x}.
Equivalently, we can view ⌊x⌋ as the unique integer satisfying
⌊x⌋ ≤ x < ⌊x⌋ + 1. Hence for example we have ⌊x⌋ = 1 for
1 ≤ x < 2, ⌊x⌋ = 2 for 2 ≤ x < 3, etc. The ﬂoor function or
greatest integer function is then deﬁned by f(x) = ⌊x⌋ ∈ Z,
for all x ∈ R. Obviously, for any integer n ∈ Z we have
f(n) = ⌊n⌋ = n.
In the same vein one can deﬁne the “ceiling function”
(also known as the “least integer function”) x → ⌈x⌉, where
⌈x⌉ is the smallest integer ≥ x, called the “ceiling of x”, i.e.,
⌈x⌉ = min{a ∈ Z : a ≥ x}. In addition, {x} = x − ⌊x⌋
is called the “fractional part of x”, and the assignment x →
{x} induces the so called “fractional part function”. Obviously,
0 ≤ {x} < 1 and for nonnegative real numbers, the
fractional part is just the part of the number after the decimal
(but this is not the case for negative real numbers).
Nevertheless, together with the ﬂoor function for which
we devote some space in 5.B.48, these functions are some
of the simplest examples of discontinuous functions, and
they can be quite useful in constructing more wild examples.
Moreover, they have all important applications in integral
CHAPTER 5. ESTABLISHING THE ZOO
Derivatives are handled rather easily, but it takes time to
derive the proper formulae for derivatives of the elementary
functions in our zoo. Therefore, we present a table of derivatives
of several such functions in advance. In the last column,
there are references to the corresponding paragraphs where
the results are proved. Notice that even though we are unable
to express inverse functions to some of our functions by elementary
means, we can calculate their derivatives; see 5.3.6.
Derivatives of some functions
function domain derivative
polynomials
f(x)
whole R f′
(x) is again a
polynomial
5.1.6
cubic splines
h(x)
whole R continuous second
derivatives
5.1.9
rational
functions
f(x)/g(x)
whole R,
except for
roots of g
rational functions:
f′
(x)g(x)−f(x)g′
(x)
g(x)2
5.3.5
power
functions
f(x) = xa
interval
(0, ∞)
f′
(x) = axa−1
5.3.7
exponential
functions
f(x) = ax
,
a > 0, a ̸= 1
whole R f′
(x) = ln(a)ax
5.3.7
logarithmic
function f(x)
= lna(x),
a > 0, a ̸= 1
interval
(0, ∞)
f′
(x) =
(ln(a))−1 1
x
5.3.7
The initial idea of the deﬁnition suggests that f′
(x0) allows
an approximation to the function f by the straight line
y = f(x0) + f′
(x0)(x − x0).
This is the meaning of the following lemma, which
says that replacing the constant coeﬃcient f′
(x0)
in the line’s equation with a certain continuous
function gives exactly the values of f. The difference
between ψ(x) and ψ(x0) on a neighbourhood
of x0 then says how much the slopes of the secant lines
and the tangent line at x0 diﬀer.
Lemma. A real or complex function f(x) has a ﬁnite derivative
at x0 if and only if there is a neighbourhood O of x0 and
a function ψ deﬁned on this neighbourhood which is continuous
at x0 and such that for all x ∈ O,
f(x) = f(x0) + ψ(x)(x − x0).
Furthermore, ψ(x0) = f′
(x0), and f is continuous at the
point x0.
Proof. If ψ exists, it is of the form
ψ(x) =
f(x) − f(x0)
x − x0
for all x ∈ O \ {x0}.
404
calculus, in number theory and in computer science. Hence
we will meet them again in the next chapter and in Chapter
11, the latter devoted to number theory. For more details on
the functions ⌈x⌉ and {x} we refer to the problem 5.E.57 in
the ﬁnal section.
5.B.48. The ﬂoor function. Let f(x) = ⌊x⌋ be the ﬂoor
function.
(1) Compute f(−0.4), f(0.4), f(0.99), f(3/2), f(
√
7), f(3),
f(π), f(9/2) and f(19/2).
(2) Show that f(x + n) = f(x) + n, for all x ∈ R and n ∈ Z.
Next present an example conﬁrming this relation.
(3) Show that the relation f(nx) = nf(x), with x ∈ R and
n ∈ Z is false in general.
(4) Deduce that limx→2+ f(x) ̸= limx→2− f(x). More in
general, prove that f is discontinuous at any integer n ∈ Z.
(5) Use Sage to sketch the graph of the ﬂoor function for −5 ≤
x ≤ 5.
Solution. (1) Obviously, ⌊−0.4⌋ = −1, ⌊0.4⌋ = 0 = ⌊0.99⌋,
⌊3/2⌋ = 1, ⌊
√
7⌋ = 2, ⌊3⌋ = 3 = ⌊π⌋, ⌊9/2⌋ = 4, and
⌊19/2⌋ = 9. Notice that in Sage the ﬂoor function corresponds
to the command floor. Hence, to verify some of our
previous computations, one may type (and similarly for the
cases omitted).
print(floor(-0.4)); print(floor(sqrt(7))
print(floor(pi)); print(floor(19/2))
(2) We need to prove that ⌊x + n⌋ = ⌊x⌋ + n for all x ∈ R
and n ∈ Z. By deﬁnition, we have ⌊x⌋ ≤ x < ⌊x⌋+1 and by
adding n to all the parts we get ⌊x⌋+n ≤ x+n < ⌊x⌋+n+1.
Recall however that for all x ∈ R and m ∈ Z the relations
⌊x⌋ = m and m ≤ x < m + 1 are equivalent. Thus for
m = ⌊x⌋ + n the above inequality means m ≤ x < m + 1,
and hence we get ⌊x + n⌋ = m. Comparing now the two
equalities involving m, the result follows. As an example, we
have ⌊9/2⌋ = ⌊0.5 + 4⌋ = ⌊0.5⌋ + 4 = 0 + 4 = 4.
(3) As a counterexample notice that ⌊3 · 1
3 ⌋ = ⌊1⌋ = 1, but
3 · ⌊1
3 ⌋ = 3 · 0 = 0.
(4) By deﬁnition, we have ⌊x⌋ = 2 for 2 ≤ x < 3. Thus
limx→2+ ⌊x⌋ = limx→2+ 2 = 2. Similarly, since ⌊x⌋ = 1
for 1 ≤ x < 2 we have limx→2− ⌊x⌋ = limx→2− 1 = 1.
This shows that f is discontinuous at x = 2, and similarly
at any integer n ∈ Z, i.e., limx→n+ ⌊x⌋ = n = f(n) ̸=
limx→n− ⌊x⌋ = n − 1.
(5) Let us now present the graph of the ﬂoor function. For
convenience, we will use small circles to indicate the jump
discontinuities of f at the integers lying in [−5, 5]. We can
do this manually in Sage by combining ﬁrst the commands
floor and plot, to sketch f(x) = ⌊x⌋, and next the commands
point and circle to draw the discontinuity points,
etc. This is encoded by the following cell:
CHAPTER 5. ESTABLISHING THE ZOO
Suppose f′
(x0) is the proper derivative. Then we can
deﬁne the value at the point x0 as ψ(x0) = f′
(x0). Certainly,
lim
x→x0
ψ(x) = f′
(x0) = ψ(x0).
Thus ψ is continuous at x0 as desired.
On the other hand if such a function ψ exists, the same
procedure calculates its limit at x0. Thus the derivative f′
(x0)
exists as well and equals ψ(x0).
The function f is expressed in terms of the sum and product
of functions continuous at x0. Thus f is continuous at
x0. □
5.3.2. Geometrical meaning of the derivative. The previous
lemma leads to a geometric interpretation
of the derivative in terms of the slope of the
secant lines of the graph of a real function f
through [x0, f(x0)]. The derivative exists if and only if the
slope of the secant line through the points [x0, f(x0)] and
[x, f(x)] changes continuously when approaching the argument
x = x0. If so, the limit of this slope is the value of the
derivative. This observation leads to the important corollary:
Functions increasing and decreasing at a point
Corollary. If a real-valued function f has derivative
f′
(x0) > 0 at a point x0 ∈ R, then there is a neighbourhood
O(x0) such that f(x) > f(x0) for all points
x ∈ O(x0), x > x0, and f(x) < f(x0) holds for all
x ∈ O(x0), x < x0.
On the other hand, if the derivative satisﬁes f′
(x0) <
0, then there is a neighbourhood O(x0) such that f(x) <
f(x0) for all points x ∈ O(x0), x > x0, and f(x) > f(x0)
for all x ∈ O(x0), x < x0.
Proof. Suppose f′
(x0) > 0. By the previous lemma,
f(x) = f(x0) + ψ(x)(x − x0) and ψ(x0) > 0. Since ψ
is continuous at x0, there exists a neighbourhood O(x0) on
which ψ(x) > 0. If x increases, x > x0, f(x) increases as
well, f(x) > f(x0). Analogously for x < x0. The case with
a negative derivative is proved similarly. □
A real function is called increasing at x0 of its domain, if
for all points x of some neighbourhood of a point x0, f(x) >
f(x0) if x > x0 and f(x) < f(x0) if x < x0. A real function
is increasing on an interval A if f(x)−f(y) > 0 for all x > y,
x, y ∈ A.
Similarly, a function is said to be decreasing at a point x0
if and and only if there is a neighbourhood of the point x0 such
that f(x) < f(x0) for all x > x0, while f(x) > f(x0) for all
x < x0 from this neighbourhood. A function is decreasing
on an interval A if f(x) − f(y) < 0 for all x > y, x, y ∈ A.
Thus a function having a non-zero ﬁnite derivative at a
point is either increasing or decreasing at that point, according
to the sign of the derivative.
405
f=floor(x)
p=plot(f, x, -5, 5, color="steelblue",
ticks=[1,1],
legend_label=r"$\text{floor function}$")
CHAPTER 5. ESTABLISHING THE ZOO
A function increasing on an interval is increasing at each
of its points. The converse is true as well. In order
to see this, assume that f is increasing at all points
of the interval A. Consider two points x < y in A
with f(y) ≤ f(x). By the assumption, there is a
δ-neighbourhood of y on which z < y implies f(z) < f(y).
Let δ0 be the supremum of all such δ ≤ y−x and w = y−δ0.
Then f(w) cannot be larger than f(y) (there would be such
a point on the right of it too, which is excluded). But, unless
w = x, w is a limit point of a sequence of points less
than w, for which the value of f is larger than f(y) ≥ f(w).
This is a contradiction with f increasing in w. But if w were
x then there would be the contradiction with the assumption
f(x) > f(z) for points z > x arbitrarily close to x.
The same arguments work for decreasing functions. The
following is now proved:
Proposition. A real function is increasing or decreasing on
an open interval A if and only if it is increasing or decreasing
in each its point, respectively.
5.3.3. Examples. (1) There is a function which is increasing
at the origin x0 = 0 but is neither increasing or
decreasing on any neighbourhood of x0. Consider
the (continuous) function
f(x) = x + 5x2
sin(1/x), f(0) = 0.
The choice f(0) makes f a continuous function on R
(sin is a bounded function with values between 1 and −1).
Its derivative at zero exists too.
lim
x→0
x + 5x2
sin(1/x)
x
= lim
x→0
(1 + 5x sin(1/x)) = 1.
For x ̸= 0,
f′
(x) = 1 + 10x sin(1/x) − 5 cos(1/x)
(cf. the rules for computing derivatives in 5.3.4 below). The
derivative is not continuous at the origin.
f is increasing at x = 0 but is not increasing on any
neighbourhood of this point.
(2) As another illustration of a simple usage of the relation between
the derivatives and the properties of being an increasing
(or decreasing) function, we can consider the existence of
inverses to real polynomials.
Polynomials of degree at least two need not be either
increasing or decreasing functions. Hence we cannot anticipate
that there would be a globally deﬁned inverse function to
them. On the other hand, the inverse exists to every restriction
406
for x in [-5..5] :
p+=point([x,x],size=30, color="black")
p+=circle((x,x-1),0.08, color="black")
show(p)
Sage prints out the ﬁgure given here (ignore the vertical lines,
as they are not part of the graph):
Notice that there exist a shorter way to sketch the graph of
the ﬂoor function, which combines the floor command with
the command exclude. The latter automatically removes the
vertical lines and the syntax goes as follows:
plot(floor(x), (x, -5, 5), exclude=[-5..5])
□
To solve problems involving the continuity of piecewise
functions, there are generally two steps. First,
prove that the individual components of the function
are continuous within the intervals where the
domain is divided. Second, examine the continuity
at the “gluing points”. For the ﬁrst step, you can often rely
on the inherent continuity of elementary functions within their
deﬁned domains (e.g., polynomials, trigonometric functions,
power functions, exponentials, logarithms, etc.).
5.B.49. Consider the 2-parameter family fα,β of piecewise
functions, deﬁned by
fα,β(x) =



√
2x2 − x + 6 − αx
x + 2
, if x < −2 ,
x3
+ βx + 4 , if x ≥ −2 ,
with α, β ∈ R. Find the continuous members of fα,β. Next
use Sage to conﬁrm your answer, and moreover sketch the
graph of these members, for −10 ≤ x ≤ 10. ⃝
The continuity of a function has far-reaching consequences,
such as Bolzano’s theorem, which
states that a continuous function which changes
its sign within an interval has a zero, see
5.2.19. The following tasks are based on this
important principle, with further examples provided in
Section E.
CHAPTER 5. ESTABLISHING THE ZOO
of f to an interval between adjacent roots of the derivative f′
,
i.e. where the derivative of the polynomial is non-zero and
keeps the sign. These inverse functions will never be polynomials,
except for the case of polynomials of degree one. The
equation
y = ax + b
implies
x =
1
a
(y − b).
For a polynomial of degree two, the equation
y = ax2
+ bx + c
leads to the equation
x =
−b ±
√
b2 − 4a(c − y)
2a
.
Thus the inverse (given by the above equation) exists only
for those x which are in either of the intervals (−∞, − b
2a ),
(− b
2a , ∞).
It can be shown that the roots of polynomials of order
larger than four cannot in general be be expressed by means of
power functions in general. Thus piece-wise deﬁned inverses
to polynomials may represent new items in our zoo.
5.3.4. Elementary properties of derivatives. We introduce
several basic facts about the calculation
of derivatives. We shall see that the derivatives
are quite nicely compatible with the algebraic
operations of addition and multiplication of
real or complex functions. The last formula then allows us to
eﬃciently determine the derivatives of composite functions.
It is also called the chain rule.
Intuitively, they can be understood very easily if we imagine
that the derivative of a function y = f(x) is the quotient
of the rates of increase of the output variable y and the input
variable x:
f′
=
δy
δx
.
Of course, for y = h(x) = f(x) + g(x), the increase in y is
given by the sum of the increases of f and g, and the increase
of the input variable is still the same. Therefore, the derivative
of a sum is the sum of the derivatives.
The derivative of a product is not the product of the
derivatives. For y = f(x)g(x), the increase is
δy = f(x + δx)g(x + δx) − f(x)g(x)
= f(x+δx)(g(x+δx)−g(x)) + (f(x+δx)− f(x))g(x)
Now, if we make the increase δx small, we actually calculate
the limit of a sum of products, which is the sum of the products
of the limits. Thus the derivative of a product fg is given
by the expression fg′
+f′
g, which is called the Leibniz rule10
.
10Gottfried Wilhelm von Leibniz (1646-1716) was a great German
mathematician and philosopher. He developed the diﬀerential and integral
calculus in terms of the inﬁnitesimal quantities, arguing similarly as above.
407
5.B.50. Determine whether the equation e2x
−x4
+ 3x3
−
6x2
= 5 has a positive solution.
Solution. Let us consider the function
f(x) = e2x
−x4
+ 3x3
− 6x2
− 5, x ≥ 0,
for which f(0) = −4. Notice that limx→+∞ f(x) =
limx→+∞ e2x
= +∞. Obviously, f is continuous on the
whole domain and hence it takes on all the values y ∈
[−4, ∞). Moreover, we have f(0) < 0 and f(2) = e4
−21 >
0. Hence by Bolzano’s theorem its graph necessarily intersects
the positive semi-axis x, i.e., the equation f(x) = 0 has
a positive solution. To conﬁrm all these claims in Sage use
the cell
f(x)=e^(2*x)-x^4+3*x^3-6*x^2-5
show(f(0)); show(f(2))
plot(f(x), x, 0, 2, ymax=20)
sol=solve(f(x)==0, x); show(sol)
which in fact solves the equation f(x) = 0 via the last command.
However, in this way Sage is not able to present the
numerical expression of the complicated symbolic expression
which returns for the solution (check it yourself). To avoid
this problem, one may instead use the command
f(x)=e^(2*x)-x^4+3*x^3-6*x^2-5
find_root(f, 0,2)
□
5.B.51. Determine whether the polynomial P(x) = x37
+
5x21
− 4x9
+ 5x4
− 2x − 3 has a real root in (−1, 1). ⃝
5.B.52. Use Sage and implement Bolzano’s theorem to prove
that the equation
x3
− cos(x) ex
+x sin(x) = 0
has a positive solution in (0, π/2). Next use the find_root
function to take a numerical approximation of this root. ⃝
C. Derivatives
Having both a conceptual understanding of the notion of
limits but also the ability to compute limits, we are
now ready to proceed with problems in calculus, the
branch of mathematics devoted to derivatives and
integrals of functions.6
We will ﬁrst focus on derivatives.
Given a function f depending on a real variable x, the
slope of its graph at the point [x0, f(x0)] indicates the rate of
change of the value f(x), as x approaches the point x0. This
rate of change is the limiting value of the slopes f(x)−f(x0)
x−x0
(x ̸= x0), as x tends to x0, that is,
6A very rough notion of calculus, and in particular integration goes
back to the ancient Greeks. Archimedes already developed the method of
exhaustion to compute the volume of a cone, sphere, etc. However, calculus
was essentially developed by Isaac Newton (1643–1727), and Gottfried
Leibniz (1646–1716), both having as reference previous works by Fermat
(1607–1665) and others. Later the Frech mathematician Augustin-Louis
Cauchy (1789-1857) established a more rigorous approach to inﬁnitesimal
calculus, as we mentioned also before. On the other hand, the Italian mathematician
Joseph-Louis Lagrange (1736–1813) was one of the founders of
“calculus of variations”, whose elements will be studied in Chapter 9.
CHAPTER 5. ESTABLISHING THE ZOO
The derivative of a composite function is even more interesting:
Consider a function
g = h ◦ f,
where the domain of the function z = h(y) contains the
codomain of the function y = f(x). By writing out the in-
creases,
g′
=
δz
δx
=
δz
δy
δy
δx
.
Thus we expect that the formula will be of the form
(h ◦ f)′
(x) = h′
(f(x))f′
(x).
Now we provide correct formulations together with
proofs:
Rules for differentiation
Theorem. Let f and g be real or complex functions deﬁned
on a neighbourhood of a point x0 ∈ R and having ﬁnite
derivatives at this point. Then
(1) for every real or complex number c, the function x →
c · f(x) has a derivative at x0 and
(cf)′
(x0) = c(f′
(x0)),
(2) the function f + g has a derivative at x0, and
(f + g)′
(x0) = f′
(x0) + g′
(x0),
(3) the function f · g has a derivative at x0, and
(f · g)′
(x0) = f′
(x0)g(x0) + f(x0)g′
(x0).
(4) Further, suppose h is a function deﬁned on a neighbourhood
of the image y0 = f(x0) with a derivative at y0.
Then the composite function h ◦ f also has a derivative
at x0, and
(h ◦ f)′
(x0) = h′
(f(x0)) · f′
(x0).
Proof. (1) and (2): A straightforward application of the
theorem about sums and products of function limits yields the
result.
(3) Rewrite the quotient of the increases (already mentioned),
in the following way:
(fg)(x)−(fg)(x0)
x−x0
= f(x)g(x)−g(x0)
x−x0
+ f(x)−f(x0)
x−x0
g(x0).
The limit of this as x → x0 gives the desired result because
f is continuous at x0.
(4) By lemma 5.3.1, there are functions ψ and φ which
are continuous at x0 and y0 = f(x0). Further they satisfy
h(y) = h(y0)+φ(y)(y−y0), f(x) = f(x0)+ψ(x)(x−x0)
on some neighbourhoods of x0 and y0. They satisfy ψ(x0) =
f′
(x0) and φ(y0) = h′
(y0). Then,
h(f(x)) − h(f(x0)) = φ(f(x))(f(x) − f(x0))
= φ(f(x))ψ(x)(x − x0)
for all x from the neighbourhood of x0. However, the product
φ(f(x))ψ(x) is a function which is continuous at x0 and
408
lim
x→x0
f(x) − f(x0)
x − x0
.
If this limit exists and it is a ﬁnite number, then we say that
f is “diﬀerentiable” at x0 and denote the limiting value by
f′
(x0) or df
dx (x0). 7
This is the so-called ﬁrst derivative of f
at x0.
The uniqueness of limits ensures that the derivative of f
at x0, when it exists, is uniquely determined. If f is diﬀerentiable
at any point x0 ∈ (a, b) in a given open interval, we
say that f is diﬀerentiable on (a, b). We say that f is diﬀerentiable
on a closed interval [a, b], when f is diﬀerentiable at
any point x0 ∈ (a, b) and the left-side and right-side derivatives
at a and b, respectively, exist.
5.C.1. Based on the deﬁnition of the derivative of a function
of one variable (see 5.3.1), compute the derivatives of the following
functions:
f(x) = xn
, with n ∈ N constant and x ∈ R;
g(x) = sin(x), with x ∈ R;
h(x) =
√
x, with x ∈ [0, +∞). ⃝
5.C.2. Let f : R → R be a function such that f(a + b) =
f(a)f(b), for all a, b ∈ R. Suppose also that f(0) = f′
(0) =
1. Show that f′
(x) = f(x) for all x ∈ R. ⃝
5.C.3. Consider the function f(x) = |x|. Find the derivative
of f for all 0 ̸= x ∈ R. Is f diﬀerentiable at x0 = 0? ⃝
5.C.4. Show that if f is a diﬀerentiable function at a point x0
in its domain, then f is continuous at x0 as well. ⃝
The converse of the statement in 5.C.4 isn’t true, and
there are several reasons that a continuous function
f may fail to be diﬀerentiable at a given
point in its domain. A possible reason is a “corner”
on its graph, as those of the absolute value
function f(x) = |x| that we discussed in 5.C.3. Another reason
is that f may admit a vertical tangent line at the given
point, as the square root function discussed in 5.C.1
Let us summarize with a useful slogan: “Diﬀerentiability
is a stronger requirement than continuity”.
5.C.5. Consider the function
f(x) =
{ √
x2 + 7 , if x > 3,
x2
+ bx + c , if x ≤ 3.
Determine the reals b, c ∈ R such that f is diﬀerentiable for
all x ∈ R. For those values use Sage to plot the graph of f
with at least two diﬀerent methods.
Solution. Obviously, for all x ̸= 3 the given function is differentiable.
In order to determine the unknowns b, c ∈ R we
need to establish a system of two equations. Since diﬀerentiability
requires continuity, these equations are induced by the
continuity and diﬀerentiability of f at x0 = 3.
7For y = f(x), the notation dy/dx is due to Liebniz, while f′ (x) is
the notation by Lagrange.
CHAPTER 5. ESTABLISHING THE ZOO
its value at x0 is just the desired derivative of the composite
function, again by lemma 5.3.1. □
5.3.5. Derivative of quotients. Consider ﬁrst the special
case of h(x) = x−1
. From the deﬁnition of the derivative,
h′
(x) = lim
δx→0
1
x+δx − 1
x
δx
= lim
δx→0
x − x − δx
δx(x2 + xδx)
= lim
δx→0
−1
x2 + xδx
= −x−2
.
Thus, the above leads to:
Derivative of a quotient
Corollary. Let f and g be real-valued functions which have
ﬁnite derivatives at a point x0 and g(x0) ̸= 0. Then the
function h(x) = f(x)(g(x))−1
satisﬁes
h′
(x0) =
(
f
g
)′
(x0) =
f′
(x0)g(x0) − f(x0)g′
(x0)
(g(x0))2
.
Proof. Using the formula (x−1
)′
= −x−2
, the chain
rule says
(g−1
)′
= −g−2
· g′
.
The Leibniz rule implies
(f/g)′
= (f · g−1
)′
= f′
g−1
− fg−2
g′
=
f′
g − gf′
g2
.
□
5.3.6. Derivatives of inverse functions. In paragraph 1.6.1,
while talking about relations and mappings in
general, the concept of an inverse function was
introduced. If the inverse function f−1
to a
given function f : R → R exists (do not confuse
this notation with the function x → (f(x))−1
), then it is
uniquely determined by either of the following identities
f−1
◦ f = idR, f ◦ f−1
= idR .
Then the other identity is also true. If f is deﬁned on a set
A ⊂ R and f(A) = B, the existence of f−1
is conditioned
by the same statements with identity mappings idA and idB,
respectively, on the right-hand sides. As seen in the diagram,
the graph of the inverse function is obtained simply by interchanging
the axes of the input and output variables.
409
We have f(3) = 9+3b+c, and the continuity condition
limx→3− f(x) = 9 + 3b + c = limx→3+ f(x). The ﬁrst
equality holds as an identity and the second gives the relation
9 + 3b + c =
√
32 + 7 = 4, i.e., 3b + c = −5. As for the
diﬀerentiability of f at x0 = 3 we get
lim
x→3−
f(x) − f(3)
x − 3
= lim
x→3+
f(x) − f(3)
x − 3
. (∗)
We see that f(3) = 9 + 3b + c = 9 − 5 = 4 and hence
lim
x→3−
f(x) − f(3)
x − 3
= lim
x→3−
√
x2 + 7 − 4
x − 3
= lim
x→3−
(
√
x2 + 7 − 4)(
√
x2 + 7 + 4)
(x − 3)(
√
x2 + 7 + 4)
= lim
x→3−
x2
− 9
(x − 3)(
√
x2 + 7 + 4)
= lim
x→3−
x + 3
√
x2 + 7 + 4
=
6
8
=
3
4
.
Similarly,
lim
x→3+
f(x) − f(3)
x − 3
= lim
x→3+
x2
+ b(x − 3) − 9
x − 3
= lim
x→3+
(x − 3)(x + 3 + b)
x − 3
= lim
x→3+
(x + 3 + b) = 6 + b .
Thus, by (∗) we obtain the equation 6 + b = 3
4 , that is, b =
−21
4 and so c = −5 − 3b = 43
4 .
In Sage we can plot f “manually”, i.e.,
f1(x)=sqrt(x^2+7); f2(x)=x^2-(21/4)*x+43/4
P=plot(f1(x), x, 3, 12, color="black")
P+=plot(f2(x), x, -6, 3, color="gray")
P+=point([3, 9-(21/4)*3+43/4], color="black",
size=30); show(P)
An alternative relies on the command piecewise,
whose implementation can be summarized from the
cell
f1(x)=sqrt(x^2+7); f2(x)=x^2-(21/4)*x+43/4
F=piecewise([[(-6, 3), f2(x)],
[(3, 12), f1(x)]])
p=plot(F(x), x, -6, 12)
p+=point([3, 9-(21/4)*3+43/4], color="black",
size=30); show(p)
Execute these cells to generate the required graph (note
the code includes also the midpoint x0 = 3). Notably,
compiling the “piecewise” method takes more time in both
https://sagecell.sagemath.org or in cocalc.com, compared
to the ﬁrst approach. Finally, remember that another
way to introduce f is based on the method presented in 5.B.49.
□
CHAPTER 5. ESTABLISHING THE ZOO
If it is known that the inverse y = f−1
(x) of a diﬀerentiable
function x = f(y) is also diﬀerentiable, then the chain
rule yields immediately
1 = (id)′
(x) = (f ◦ f−1
)′
(x) = f′
(y) · (f−1
)′
(x).
Notice that f′
(y) must be non-zero.
This corresponds to the intuitive idea that for y = f(x),
the value of f′
is approximately δy
δx while for x = f−1
(y) it
is approximately (f−1
)′
(y) = δx
δy . And this indeed is the way
the derivatives of inverse functions are calculated.
Derivative of the inverse function
Theorem. If f is a real-valued function diﬀerentiable at y0,
such that the inverse f−1
(x) exists on a neighbourhood of
the value x0 = f(y0) and f′
(y0) ̸= 0, then
(1) (f−1
)′
(x0) =
1
f′(f−1(x0))
=
1
f′(y0)
.
Let us notice that if we assume that the derivative f′
(x)
is continuous on a neighborhood of x0, than the condition
f′
(x0) ̸= 0 clearly implies that f is either increasing or decreasing
on some neighborhood, and thus the assumptions
of the theorem are fulﬁlled. On the other hand, the example
5.3.3(1) shows, that the existence of f′
(x0) ̸= 0 is not
necessarily suﬃcient.
Proof. To prove the proposition, it suﬃces to read the
proof of the fourth statement of theorem 5.3.4.
We work with the composition f ◦ f−1
= id
there. The composite function is diﬀerentiable.
By lemma 5.3.1, there is a function φ continuous
at y0 such that
f(y) − f(y0) = φ(y)(y − y0),
on some neighbourhood of y0. Further, it satisﬁes φ(y0) =
f′
(y0) ̸= 0 and φ has constant sign on a neighbourhood of
x0. Next, notice that the existence of the inverse f−1
around
the point x0 and the continuity of f at y0 guarantees the continuity
of f−1
at x0, (the ε and δ neighbourhoods map each
to the other bijectively). The substitution y = f−1
(x) then
yields
x − x0 = φ(f−1
(x))(f−1
(x) − f−1
(x0)),
for all x lying in some neighbourhood O(x0) of x0. Further,
f−1
(x0) = y0, and φ(f−1
(x)) is continuous at x0 and remains
non-zero on a neighbourhood O(x0) of x0 with constant
sign. Thus
f−1
(x) − f−1
(x0)
x − x0
=
1
φ(f−1(x))
̸= 0,
for all x ∈ O(x0) \ {x0}. The right-hand side of this expression
is continuous at x0. The limit is
lim
x→x0
1
φ(f−1(x))
=
1
φ(f−1(x0))
=
1
f′(y0)
.
410
The fundamental rules of diﬀerentiation (see 5.3.4 and
5.3.5) enable us to compute derivatives of functions,
even those with complex forms. The upcoming exercises
focus on these rules to enhance familiarity
with their application. It’s worth noting that while
diﬀerentiation is a standard procedure, it can involve intricate
computations. Therefore, many ﬁnd it more enjoyable
to compute derivatives using computer algebra packages, as
these not only simplify the process but also allow for rigorous
veriﬁcation of calculations.
Below we will explain the usage of Sage. Similar syntaxes
are available in many other computer algebra systems,
as Mathematica and Maple.
5.C.6. In 5.C.1 we saw that (sin(x))′
≡ sin′
(x) = cos(x).
Based on this fact and using the chain rule, introduced in
5.3.4, prove that (cos(x))′
= − sin(x). ⃝
5.C.7. Diﬀerentiate the following functions, compute the required
values and verify (some of) your answers via Sage.
(1) f(x) = x sin(x). Find f′
(π/2);
(2) g(x) = sin(x)
x , x ̸= 0. Find g′
(π/2);
(3) h(x) = ln(x +
√
x2 − c2), c ̸= 0, |x| ≥ |c|.
(4) k(x) = xx
, x > 0;
(5) m(x) = xsin(x)
, x > 0.
Solution. (1) By the product rule (also called Leibniz rule,
see 5.3.4), we see that
f′
(x) = x′
· sin(x) + x · (sin(x))′
= sin(x) + x cos(x).
Thus f′
(π/2) = 1. In order to diﬀerentiate a function f(x),
Sage provides many alternatives. For instance we can use the
command derivative(f, x) or f.derivative(x), as fol-
lows:
f(x)=x*sin(x); show(derivative(f, x))
or type
f(x)=x*sin(x); show(f.derivative(x))
If we want to ﬁnd the explicit value of the derivative at a point
x0, then the command f.derivative(x)(x = x0) is an appropriate
one. Thus, to compute f′
(π/2) you may continue
typing in the previous cell the following:
f.derivative(x)(x=pi/2)
This prints out 1, that is, f′
(π/2) = 1, where f(x) =
x sin(x). An alternative to diﬀerentiate f(x) relies on the
command diff. For example, by typing
f(x)=x*sin(x); show(diff(f, x))
we obtain the same result. In order to specify the value at a
certain point x0, we again type diff(f, x)(x = x0), i.e.,
f(x)=x*sin(x); show(diff(f, x)(x=pi/2))
The command diff is also useful when one wants to ﬁnd
higher-order derivatives, hence we will meet it again in Chapter
6, see also below and the ﬁnal Section E of this chapter.
CHAPTER 5. ESTABLISHING THE ZOO
Therefore, the limit of the left-hand side also exists, and it
follows that
(f−1
)′
(x0) =
1
f′(y0)
as required. □
5.3.7. Derivatives of the elementary functions. Consider
the exponential function f(x) = ax
for any ﬁxed real a > 0.
If the derivative of ax
exists for all x, then
f′
(x) = lim
δx→0
ax+δx
− ax
δx
= ax
lim
δx→0
aδx
− 1
δx
= f′
(0)ax
.
On the other hand, if the derivative at zero exists, then this
formula guarantees the existence of the derivative at any point
of the domain and also determines its value. At the same time,
the validity of this formula for one-sided derivatives is also
veriﬁed.
Unfortunately, it takes some time to verify that the derivatives
of exponential functions indeed exist (see 5.4.2, 5.4.10,
and 6.3.7).
There is an especially important base e, also known as
Euler’s number, for which the derivative at zero equals one.
Remember the formula (ex
)′
= ex
for a while and draw
on its consequences.
For the general exponential function, (using standard
rules of diﬀerentiation),
(ax
)′
= (eln(a)x
)′
= ln(a)(eln(a)x
) = ln(a) · ax
.
Thus exponential functions are special since their derivatives
are proportional to their values.
Next, we determine the derivative (lne(x))′
. The deﬁnition
of the natural logarithm as the inverse to ex
,
eln x
= x,
allows the calculation:
(1) (ln)′
(y) = (ln)′
(ex
) =
1
(ex)′
=
1
ex
=
1
y
.
The formula
(2) (xa
)′
= axa−1
for diﬀerentiating a general power function can also be derived
using the derivatives of the exponential and logarithmic
functions:
(xa
)′
= (ea ln x
)′
= ea ln x
(a ln x)′
= a
xa
x
= axa−1
.
5.3.8. Mean value theorems. Before continuing the journey
of ﬁnding new interesting functions, we
derive several simple statements about derivatives.
The meaning of all of them is intuitively
clear from the diagrams. The proofs follow the visual imagi-
nation.
411
(2) The rule for the derivative of a quotient (see 5.3.5), gives
g′
(x) =
(sin(x))′
· x − sin(x) · x′
x2
=
x cos(x) − sin(x)
x2
.
Thus, we easily get g′
(π/2) = − 4
π2 . To conﬁrm this via Sage
a recommendation has the form
g(x)=sin(x)/x; diff(g(x), x)(x=pi/2)
(3) Here the “chain rule” applies. Set a(x) = ln(x) and
b(x) = x +
√
x2 − c2 such that h(x) = a(b(x)). Then
h′
(x) =
(
a(b(x)
)′
= a′
(b(x)) · b′
(x)
=
(x +
√
x2 − c2)′
x +
√
x2 − c2
=
1 + x
x2−c2
x +
√
x2 − c2
.
For a conﬁrmation in Sage you may use one of the methods
posed above.
(4) Let us recall the identity (ef(x)
)′
= f′
(x) ef(x)
, where we
assume that f′
(x) exists. Because xx
= ex ln(x)
by applying
this rule we obtain
k′
(x) = (xx
)′
=
(
ex ln(x)
)′
=
(
x ln(x)
)′
ex ln(x)
=
(
ln(x) + 1
)
ex ln(x)
=
(
ln(x) + 1
)
xx
.
In Sage you may use the cell show(diff(x ∗ ∗x, x)), which
returns the expression xx
(log (x) + 1) (recall that in Sage the
function log represents the natural logarithm).
The same trick used in (4) applies for the ﬁnal case, too.
Hence we leave this as an exercise. □
5.C.8. Chain rule via Sage. Write in Sage a short code implementing
the chain rule for two arbitrary diﬀerentiable func-
tions.
Solution. Let h(x) = f(g(x)) be the composition of two
arbitrary diﬀerentiable functions f, g : R → R. We want to
implement in Sage the rule h′
(x) = g′
(x)f′
(g(x)), for all
x ∈ R, and with this goal in mind it is useful to recall about
symbolic functions in Sage. For instance, type in your editor
the cell
x=var("x")
f=function(’f’)(x); show(f)
This cell introduces an arbitrary function f of one variable
x, in particular prints out f(x). This makes the implementation
of the chain rule for two arbitrary functions f, g : R →
R, really simple. First introduce the composition h, as fol-
lows
x=var("x"); f=function("f"); g=function("g")
h=f(g(x)) # define the composition h
show(h)
Check yourself that this gives the composition f(g(x)). Next
it is suﬃcient to add the command h.diff(x), which returns
D[0](f)(g(x)) ∗ diff(g(x), x). Recall here that D[0] encodes
the ﬁrst derivative, hence we can indeed translate Sage’s output
as f′
(g(x))g′
(x), that is, the chain rule.
A more useful alternative has the following
form:
function("f")(x)
CHAPTER 5. ESTABLISHING THE ZOO
Rolle’s theorem11
Theorem. Assume that the function f : R → R is continuous
on a closed bounded interval [a, b] and diﬀerentiable
inside this interval. If f(a) = f(b), then there is a number
c ∈ (a, b) such that f′
(c) = 0.
Proof. Since the function f is continuous on the closed
interval (i.e. on a compact set), it attains its maximum and
its minimum there. Either its maximum value is greater than
f(a) = f(b), or the minimum value is less than f(a) = f(b),
or f is constant. If the third case applies, the derivative is zero
at all points of the interval (a, b). If the second case applies,
then the ﬁrst case applies to the function −f. If the ﬁrst case
applies, it occurs at an interior point c. If f′
(c) ̸= 0 then the
function f would be either increasing or decreasing at c (see
5.3.2), implying the existence of larger values than f(c) in
a neighbourhood of c, contradicting that f(c) is a maximum
value. □
5.3.9. The latter result immediately implies the following
corollary.
Lagrange’s mean value theorem
Theorem. Assume the function f : R → R is continuous on
an interval [a, b] and diﬀerentiable at all points inside this
interval. Then there is a number c ∈ (a, b) such that
f′
(c) =
f(b) − f(a)
b − a
.
11The French mathematician Michel Rolle (1652-1719) proved this theorem
only for polynomials. The principle was perhaps known much earlier,
but the rigorous proof comes from the 19th century only.
412
function("g")(x)
show(diff(f(g(x)), x))
The output here has the form D0 (f) (g (x))
∂
∂x
g (x), and the
advantage is that we can “replace” f, g with certain functions,
so that we can use this method for verifying our formal computations
in such examples. For instance, to compute the derivative
of ex2
, it is suﬃcient to specify f, g by adding in the
previous cell the code
f(x)=e^x;g(x)=x^2; show(diff(f(g(x)), x))
To compute the case (5) of Problem 5.C.7, add the line
f(x) = xx
; g(x) = sin(x); show(diff(f(g(x)), x)). Try
some additional examples yourself, and see also in Chapter 6
for further applications of symbolic functions (cf. 6.A.4). □
5.C.9. Inverse function theorem. (a) Given the function
f(x) = (4x + 3)/(x − 6) prove that it is invertible on its
domain, with inverse given by the function g(x) = (6x +
3)/(x − 4), with x ∈ R\{4}.
(b) According to 5.3.6 if y = g(x) is the inverse of a diﬀerentiable
function f(x), then for all x satisfying f′
(g(x))) ̸= 0
we have g′
(x) = 1/f′
(g(x))). Based on this result compute
the derivative of the function g(x) given in (a) and then compare
your result, by applying a direct computation of g′
(x).
Solution. (a) Obviously, the domain of f is the set A :=
R\{6}. For arbitrary x1, x2 ∈ A we see that the relation
f(x1) = f(x2) is equivalent to 27x1 = 27x2, that is, x1 =
x2. Hence f is one-to-one and its inverse y = f−1
(x) exists.
We will show that y = f−1
(x) = g(x) with domain the
set B = R\{4}. Setting y = 4x+3
x−6 , we equivalently get
y(x − 6) = 4x + 3, or x(y − 4) = 6y + 3, that is x = 6y+3
y−4
for y ̸= 4. Such x are indeed elements of A, and in order
to obtain the inverse of f it is now suﬃcient to reverse the
roles of x, y in the previous relation. This gives the desired
expression of g = f−1
. An alternative veriﬁcation occurs due
to the relations f(g(x)) = 27x
27 = x and g(f(x)) = 27x
27 = x.
(b) A direct computation shows that
g′
(x) =
(6x + 3
x − 4
)′
= −
27
(x − 4)2
, x ∈ B .
Let us obtain the same result by applying the method mentioned
above. We similarly compute f′
(x) = − 27
(x−6)2 and
hence it follows that for all x ̸= 4 we have
f′
(g(x)) = −
27
(6x+3
x−4 − 6)2
= −
(x − 4)2
27
̸= 0 .
Thus g′
(x) = 1/f′
(g(x)) = − 27
(x−4)2 , for all x ∈ B. □
Let y = f(x) be a diﬀerentiable function at x0 ∈ R.
The the tangent line of the graph of f at the point [x0, y0 =
f(x0)] ∈ R2
has the form y − y0 = f′
(x0)(x − x0), or
equivalently
y = f′
(x0)x + (y0 − f′
(x0)x0) .
CHAPTER 5. ESTABLISHING THE ZOO
Proof. The proof is a simple statement of the geometrical
meaning of the theorem: The secant line
between the points [a, f(a)] and [b, f(b)] has a
tangent line which is parallel to it (have a look
at the diagram). The equation of the secant line is
y = g(x) = f(a) +
f(b) − f(a)
b − a
(x − a).
The diﬀerence h(x) = f(x) − g(x) determines the (vertical)
distance of the graph and the secant line (in the values of y).
Surely h(a) = h(b) and
h′
(x) = f′
(x) −
f(b) − f(a)
b − a
.
By Rolle’s theorem, there is a point c at which h′
(c) = 0. □
The mean value theorem can also be written in the form:
(1) f(b) = f(a) + f′
(c)(b − a).
In the case of a parametrically given curve in the plane,
i.e., a pair of functions y = f(t), x = g(t), similar result
about the existence of a tangent line parallel to the secant line
going through the boundary points is described by Cauchy’s
mean value theorem. Notice we may consider such a curve as
a complex valued function f(t) + i g(t).
Cauchy’s mean value theorem
Corollary. Let functions y = f(t) and x = g(t) be continuous
on an interval [a, b] and diﬀerentiable inside this interval.
Further, let g′
(t) ̸= 0 for all t ∈ (a, b) and g(b) ̸= g(a).
Then there is a point c ∈ (a, b) such that
f(b) − f(a)
g(b) − g(a)
=
f′
(c)
g′(c)
.
Proof. Put
h(t) = (f(b) − f(a))g(t) − (g(b) − g(a))f(t).
Now h(a) = f(b)g(a) − f(a)g(b), h(b) = f(b)g(a) −
f(a)g(b), so by Rolle’s theorem, there is a number c ∈ (a, b)
such that h′
(c) = 0.
Finally, g′
(c) ̸= 0 and the desired formula follows. □
Notice that g(b) ̸= g(a) automatically, if g′
(t) is contin-
uous.
5.3.10. A reasoning similar to the one in the above proof
leads to a supremely useful tool for calculating limits of quotients
of functions.
413
Hence, the tangent is the line through the point [x0, y0],
having slope f′
(x0). Observe that the geometric
condition of having a unique non-vertical tangent
to the graph of f at a point [x0, f(x0)] ∈ Gf
is equivalent to the existence of f′
(x0), that is,
to the diﬀerentiability of f at x0. Hence tangent lines provide
a more intuitive deﬁnition of diﬀerentiability: We may
always think a diﬀerentiable function f at x0, as a function
whose graph has a unique (non-vertical) tangent at the point
[x0, f(x0)]. In this case the graph of f cannot have breaks,
corners or cusps at x0.
5.C.10. Consider the function f(x) = αx2
− 4x ln(x) with
x > 0, where α ̸= 0 is some real parameter. Find the tangent
of f at the point P = [1, f(1)]. Next specify α such that the
origin is a point of this tangent. ⃝
5.C.11. Consider the function f(x) = x2
ln(x), with x > 0.
Prove that there exists a unique point P ∈ R2
belonging on
the graph of f and where the tangent of f is parallel to the
x-axis. Explain why this happens and verify your computations
via Sage.
Solution. The derivative of f is given by
f′
(x) = 2x ln(x) + x = x(2 ln(x) + 1) ,
for any x ∈ (0, +∞). We see that the equation f′
(x) = 0
has a unique solution, namely x0 = e− 1
2 = 1√
e
(since f′
is
deﬁned only for x > 0, the solution x = 0 is not acceptable).
In Sage we can verify these computations by typing
f(x)=x^2*ln(x); assume(x>0)
show(solve(diff(f, x)==0, x))
The tangent line of f at this point has zero slope, i.e.,
f′
(x0) = 0, and hence is horizontal. In particular, the
tangent line is given by y = f(x0), and observe that near x0
this line lies under the graph of f. In such a situation we say
that at x0 the function f attains a local minimum with value
f(x0), see also here:
Thus there is a unique point P ∈ R2
satisfying the statement,
with coordinates P =
[
1√
e
, − 1
2e
]
. □
CHAPTER 5. ESTABLISHING THE ZOO
L’Hospital’s rule12
Theorem. Suppose f and g are functions diﬀerentiable on
some neighbourhood of a point x0 ∈ R, yet not necessarily
at x0 itself. Suppose
lim
x→x0
f(x) = 0, lim
x→x0
g(x) = 0.
If the limit
lim
x→x0
f′
(x)
g′(x)
exists, then the limit
lim
x→x0
f(x)
g(x)
also exists, and the two limits are equal.
Proof. Without loss of generality, the functions f and g
are zero at the point x0. The quotient of the values
then corresponds to the slope of the secant
line between the points [0, 0] and [f(x), g(x)].
At the same time, the quotient of the derivatives corresponds
to the slope of the tangent line at the given point. Thus it is
necessary to verify that the limit of the slopes of the secant
lines exists from the fact that the limit of the slopes of the
tangent lines exists.
Technically, we can use the mean value theorem in
Cauchy’s parametric form. First of all, the existence of the expression
f′
(x)/g′
(x) on some neighbourhood of the point x0
(excluding x0 itself) is implicitly assumed. Thus especially
for points c suﬃciently close to x0, g′
(c) ̸= 0.13
By the mean
value theorem,
lim
x→x0
f(x)
g(x)
= lim
x→x0
f(x) − f(x0)
g(x) − g(x0)
= lim
x→x0
f′
(cx)
g′(cx)
,
12Guillaume François Antoine, Marquis de l’Hôpital, (1661-1704) became
famous for his textbook on Calculus (in modern textbooks, his name
is mostly spelled as l’Hospital). This rule was ﬁrst published there, perhaps
originally proved by one of the famous Bernoulli brothers.
13This is not always necessary for the existence of the limit in a general
sense. Nevertheless, for the statement of l’Hospital’s rule, it is. A thorough
discussion can be found (googled) in the popular article ‘R. P. Boas, Counterexamples
to L’Hospital’s Rule, The American Mathematical Monthly, October
1986, Volume 93, Number 8, pp. 644–645.’
414
5.C.12. Find the equations of the tangent and normal lines
to the graph of g(x) = (x + 1) 3
√
3 − x, with x ∈ R, at the
point P = [1, f(1)]. Next sketch the two lines together with
the graph of g via Sage.
Solution. We have g(x0) = g(1) = 2 · 2
1
3 = 2
4
3 . Moreover
g′
(x) = (3 − x)
1
3 −
1
3
(x + 1)(3 − x)− 2
3 ,
and hence g′
(1) = 2
3 · 2
1
3 = 2 3√
2
3 . Thus the tangent of g at P
is given by
y1 = g(x0) + g′
(x0)(x − x0) = 2
4
3 +
2 3
√
2
3
(x − 1) .
The normal line and tangent line are perpendicular, hence
they should have slopes that are opposite reciprocals each
other. This means that the normal line has the form
y2 = g(x0) −
1
g′(x0)
(x − x0) = 2
4
3 −
3
2 3
√
2
(x − 1) .
In order to derive the equations of the tangent and normal
(and sketch them) in Sage, type the block
g(x)=(x+1)*(3-x)^(1/3);dg(x)=diff(g(x),x)
a=plot(g(x), x, -1.2, 3, xmin=-3, xmax=3,
legend_label="curve")
x0=1; tangl=g(x0)+dg(x0)*(x-x0)
show(tangl)
perpl=g(x0)-(1/dg(x0))*(x-x0);show(perpl)
b=plot(tangl, xmin=-1.2, xmax=1.5,
color="black",
legend_label="tangent", aspect_ratio=1)
c=plot(perpl, xmin=-1, xmax=2, color="gray",
legend_label="normal")
m=point([x0, 0], size=30, color="black")
mv=point([x0, g(x0)], size=30, color="black")
show(a+b+c+m+mv)
We leave to the reader the implementation of this block, for
practicing with Sage. □
5.C.13. Find the tangent and normal line to the curve given
by the equation x3
+ y3
− 2xy = 0, at the point P = [1, 1].
⃝
5.C.14. Tangent lines via Sage. Given the polynomial
P(x) = x4
− 2x with x ∈ I = [−2, 2], write a short routine
in Sage constructing the tangent line ℓx0 (x) of P at a random
point x0 lying in the interval I. Then choosing certain
x0 ∈ I, produce the graphs of P and ℓx0 for all x ∈ I.
Solution. Using the def command we can introduce a subroutine,
which we agree to call Tangent. To do so we ﬁrst
need to introduce P and its ﬁrst derivative. The whole block
has the form
CHAPTER 5. ESTABLISHING THE ZOO
where cx is a number lying between x0 and x, dependent on
x. From the existence of the limit
lim
x→x0
f′
(x)
g′(x)
,
it follows that this value will be shared by the limit of any sequence
created by substituting the values x = xn approaching
x0 into f′
(x)/g′
(x) (cf. the convergence test 5.2.15). Especially,
we can substitute any sequence cxn for xn → x0,
and thus the limit
lim
x→x0
f′
(cx)
g′(cx)
exist, and the last two limits are equal. Hence the desired limit
exists and has the same value. □
From the proof of the theorem, it is true for one-sided
limits as well.
5.3.11. Corollaries. L’Hospital’s rule can easily be extended
for limits at the improper points ±∞ and for the case
of inﬁnite values of the limits. If, for instance, we have
lim
x→∞
f(x) = 0, lim
x→∞
g(x) = 0,
then limx→0+ f(1/x) = 0 and limx→0+ g(1/x) = 0.
At the same time, from existence of the limit of the quotient
of the derivatives at inﬁnity,
lim
x→0+
(f(1/x))′
(g(1/x))′
= lim
x→0+
f′
(1/x)(−1/x2
)
g′(1/x)(−1/x2)
= lim
x→0+
f′
(1/x)
g′(1/x)
= lim
x→∞
f′
(x)
g′(x)
.
Applying the previous theorem, the limit
lim
x→∞
f(x)
g(x)
= lim
x→0+
f(1/x)
g(1/x)
= lim
x→∞
f′
(x)
g′(x)
exists in this case as well.
The limit calculation is even simpler in the case when
lim
x→x0
f(x) = ±∞, lim
x→x0
g(x) = ±∞.
Then it suﬃces to write
lim
x→x0
f(x)
g(x)
= lim
x→x0
1/g(x)
1/f(x)
,
which is already the case of usage of l’Hospital’s rule from
the previous theorem.
In fact, the l’Hospital’s rule has the same form for inproper
limits as well:
Theorem. Let f and g be functions diﬀerentiable on some
neighbourhood of a point x0 ∈ R, not necessarily at x0
itself. Further, let the limits limx→x0 f(x) = ±∞ and
limx→x0 g(x) = ±∞ exist. If the limit
lim
x→x0
f′
(x)
g′(x)
exists, then the limit
lim
x→x0
f(x)
g(x)
also exists and they equal each other.
415
P(x)=x^4-2*x; P1(x)=diff(P(x), x)
def Tangent(x_0):
y_0=P(x_0)
m=P1(x_0)
c=y_0 - m*x_0
l(x)=m*x+c #this defines the tangent line
Q = plot(P(x), -2, 2, color="blue",
CHAPTER 5. ESTABLISHING THE ZOO
Proof. Apply the mean value theorem. The key step is
to express the quotient in a form where the derivative arises:
f(x)
g(x)
=
f(x)
f(x) − f(y)
·
f(x) − f(y)
g(x) − g(y)
·
g(x) − g(y)
g(x)
,
where y ̸= 0 is ﬁxed, from a neighbourhood of x0 and x is
approaching x0. Since the limits of f and g at x0 are inﬁnite,
we can surely assume that the diﬀerences of the values of both
functions at x and y, having ﬁxed y, are non-zero.
Using the mean value theorem, replace the fraction in
the middle with the quotient of the derivatives at an
appropriate point c between x and y. The expression
of the examined limit thus gets the form
f(x)
g(x)
=
1 − g(y)
g(x)
1 − f(y)
f(x)
·
f′
(c)
g′(c)
,
where c depends on both x and y. With y ﬁxed and x approaching
x0, the former fraction converges to one. At the
same time, if simultaneously y → x0 and |y −y0| ≥ |x−x0|,
the latter fraction becomes arbitrarily close to the limit value
of the quotient of the derivatives. Thus, we may choose a sequence
yn → x0, such that limx→x0
f′
(x)
g′(x) − f′
(c)
g′(c) < 1/n
for all c with |c − x0| < |yn − x0|. At the same time, we
may restrict appropriately |x − x0| ≤ δn, so that the former
fraction with y = yn will be closer to 1 than by 1/n for all
such x. Altogether, this implies the requested equality of the
limits. □
By making suitable modiﬁcations of the examined expressions,
one can also apply l’Hopital’s rule on
forms of the types ∞−∞, 1∞
, 0·∞, and so on.
Often one simply rearranges the expressions or
uses some continuous function, for instance the
exponential one.
5.3.12. Example. For an illustration of such a procedure, we
show the connection between the arithmetic and geometric
means of n non-negative values xi. The arithmetic mean
M1
(x1, . . . , xn) =
x1 + · · · + xn
n
is a special case of the power mean with exponent r, also
known as the generalized mean:
Mr
(x1, . . . , xn) =
(
xr
1 + · · · + xr
n
n
)1
r
.
The special value M−1
is called the harmonic mean. Calculate
the limit value of Mr
for r approaching zero. For this
purpose, determine the limit by l’Hopital’s rule (we treat it as
an expression of the form 0/0 and diﬀerentiate with respect to
r, with xi as constant parameters).
The following calculation uses the chain rule and knowledge
of the derivative of the power function, must be read in
reverse. The existence of the last limit implies the existence
of the last-but-one, and so on.
416
ymin=-2, ymax=20, gridlines="true")
Q+=plot(l(x), -2, 2, color="black",
ymin=-2, ymax=20)
Q+=point( (x_0, y_0), color="black",
size=50)
Q.show()
show("x_0=" , x_0)
show("tang.line: l(x)=(",m,")*x+(",c,")")
return
Notice that x0 is included in this code as a real number input.
Also, the command gridlines with the option true
was used to add grid lines in the background (though not necessary).
The last two show commands are included to display
the chosen value of x0 and the expression of the tangent line
ℓx0 (x) at x0. Finally, the command return is used to conclude
the routine. Further applications of programming in
Sage will be discussed later, as seen, for for example in 5.D.5.
We can now check our routine by testing certain values
for x0. We may type
Tangent(0); Tangent(-0.5)
Tangent(0.5); Tangent(1)
etc. For instance, the last command produces the ﬁgure
and prints
x_0=1
tang. line: l(x)=(2)*x+(-3)
Later in the ﬁnal section of this chapter we will use the routine
constructed here to build an interactive environment for
tangent lines, see 5.E.92. □
We have now only scratched the surface of the theory
of derivatives, which encompasses a wide range
of theorems and applications. Derivatives oﬀer a
powerful mathematical framework that naturally
applies to the study of monotone functions, optimization
problems, and approximation techniques.
CHAPTER 5. ESTABLISHING THE ZOO
lim
r→0
ln(Mr
(x1, . . . , xn)) = lim
r→0
ln( 1
n (xr
1 + · · · + xr
n))
r
= lim
r→0
xr
1 ln x1+···+xr
n ln xn
n
xr
1+···+xr
n
n
=
ln x1 + · · · + ln xn
n
= ln n
√
x1 . . . xn.
Hence
lim
r→0
Mr
(x1, . . . , xn) = n
√
x1 . . . xn,
which is known as the geometric mean.
4. Inﬁnite sums and power series
The last part of this chapter is mainly devoted to inﬁnite
sums of numbers, aiming at inﬁnite extension of polynomials
– the so called power series.
We ﬁrst complete the basic discussion of the exponential
function, expressing it as a limit of polynomial approximations.
This illustrates the more general need to develop eﬀective
tools to deal with sequences of numbers or functions. If
the reader ﬁnds the next paragraphs too demanding, we suggest
jumping to 5.4.3 starting the general discussion on inﬁnite
sums of numbers and maybe return later.
5.4.1. The calculation of ex
. For numerical computations,
manipulation with limits of sequences is
needed as well as addition and multiplication
of scalars. Thus it might be a good idea to
approximate non-polynomial functions by
sequences of numbers that can be calculated easily, keeping
control of the approximation errors.
We approach the function ex
this way. In view of the expected
property (ex
)′
= ex
(cf. 5.3.7), we look for a function
whose rate of increase equals the function’s value at every
point. This can be imagined as a splendid interest rate equal
to the current value of your money.
If we apply the interest rate per year once a month, once
a day, once an hour, and so on, we obtain the following values
for the yield x of the deposit after one year:
(
1 +
x
12
)12
,
(
1 +
x
365
)365
,
(
1 +
x
8760
)8760
, . . .
Therefore, we could guess that
ex
= lim
n→∞
(
1 +
x
n
)n
.
At the same time, we can imagine that the ﬁner we apply the
interest, the higher the yield will be. So the sequence on the
right-hand side should be an increasing sequence.
In detail, we examine the sequence of numbers
an =
(
1 +
1
n
)n
.
Bernoulli’s inequality will come in handy:
417
In particular, when examining monotonicity and optimization
problems (including local and absolute extrema),
derivatives of the ﬁrst and second orders are often suﬃcient to
reveal the local behavior of a function deﬁned on an interval
of the real line.8
Notice in 5.3.2 we will revise the concepts of increasing
and decreasing functions, thereby enriching our approach to
studying the local behavior of functions. For further illustration,
refer to the examples in 5.3.3. This latter aspect will
be further elaborated upon in the initial section of Chapter
6. Therefore, the tasks presented below serve a more foundational
purpose.
5.C.15. Discuss the monotonicity of the function f(x) =
ln(x)
x over its domain.
Solution. For any x in the domain A := (0, +∞) of f we
compute f′
(x) = (ln(x))′
x−x′
ln(x)
x2 = 1−ln(x)
x2 . Thus, the
equation f′
(x) = 0 has a unique solution, namely x0 = e.
This is a “critical” or “stationary point” of f, see 5.3.2 for terminology.
Since x2
is always positive, the sign of f′
depends
on the numerator 1 − ln(x). Observe now that 1 − ln(x) >
0 ⇔ ln(x) < 1 ⇔ ln(x) < ln(e) ⇔ x < e, and similarly
we obtain 1 − ln(x) < 0 ⇔ x > e. Thus f′
(x) > 0 for all
x ∈ (0, e) and f′
(x) < 0 for all x ∈ (e, +∞), which means
that f is strictly increasing in the interval (0, e) and strictly
decreasing in (e, +∞). In particular, f takes its maximum
value at the point x0, namely f(x0) = f(e) = 1
e , see also the
ﬁgure below.
In section 6.1.2 we will learn about a test checking this ﬁnal
claim, which is based on the second derivative of f. The latter
is simply deﬁned by f′′
= (f′
)′
, hence in our case it has the
form
f′′
(x) = (f′
(x))′
=
(1 − ln(x)
x2
)′
=
2 ln(x) − 3
x3
.
8In this column we will mainly focus on real functions or on complexvalued
functions of a real variable. Applications of complex-valued functions
of a complex variable (complex analytic functions) will be analyzed
later in Chapter 9.
CHAPTER 5. ESTABLISHING THE ZOO
Lemma. For every real number b ≥ −1, b ̸= 0, and a natural
number n ≥ 2, (1 + b)n
> 1 + nb.
Proof. For n = 2,
(1 + b)2
= 1 + 2b + b2
> 1 + 2b.
Proceed by induction on n, supposing b > −1. Assume that
the proposition holds for some k ≥ 2 and calculate:
(1 + b)k+1
= (1 + b)k
(1 + b) > (1 + kb)(1 + b)
= 1 + (k + 1)b + kb2
> 1 + (k + 1)b
The statement is, of course, true for b = −1 as well. □
Now
an
an−1
=
(1 + 1
n )n
(1 + 1
n−1 )n−1
=
(n2
− 1)n
n
n2n(n − 1)
=
(
1 −
1
n2
)n
n
n − 1
>
(
1 −
1
n
) n
n − 1
= 1.
by using Bernoulli’s inequality with b = − 1
n2 . So an > an−1
for all natural numbers, and it follows that the sequence an is
indeed increasing.
The following similar calculation (also using Bernoulli’s
inequality) veriﬁes that the sequence of numbers
bn =
(
1 +
1
n
)n+1
=
(
1 +
1
n
) (
1 +
1
n
)n
is decreasing. Notice that bn > an. Also,
bn
bn+1
=
n
n + 1
(
n+1
n
n+2
n+1
)n+2
=
n
n + 1
(
n2
+ 2n + 1
n2 + 2n
)n+2
=
n
n + 1
(
1 +
1
n(n + 2)
)n+2
≥
n
n + 1
(
1 +
n + 2
n(n + 2)
)
= 1.
Thus the sequence an is increasing and bounded from above,
so the set of its terms has a supremum which equals the limit
of the sequence. At the same time, this value is also the limit
of the decreasing sequence bn because
lim
n→∞
bn = lim
n→∞
(1 +
1
n
)an = lim
n→∞
an.
This limit determines one of the most important numbers
in mathematics (besides the numbers 0, 1, and π), namely
Euler’s number14
e. Thus
e = lim
n→∞
(
1 +
1
n
)n
.
14The ingenious Swiss mathematician, physicist, astronomer, logician
and engineer Leonhard Euler (1707-1783) was behind extremely many inventions,
including original mathematical techniques and tools.
418
According to this criterium, whenever we have f′′
(x0) < 0
near a critical point x0 of f, then f must attain a (local) maximum
at x0. For our case we indeed compute f′′
(e) = − 1
e3 <
0.
Notice ﬁnally in Sage one could compute the ﬁrst and
second derivative of f by the cell
f(x)=ln(x)/x; d1(x)=diff(f, x).factor()
show(d1(x))
D(x)=diff(d1, x).factor(); show(D(x))
Here we have programmed Sage to compute f′′
(x) “manually”
(we denote this by D(x)). An alternative approach for
the second derivative relies on an application of the command
diff(f, x, 2).factor( ). This directly prints out the factorization
of the expression given for the second derivative of f,
without being necessary to compute the ﬁrst derivative. We
will meet more such applications in Chapter 6. □
5.C.16. Prove by two diﬀerent ways that the tangent function
f(x) = tan(x) is strictly increasing for all x ∈ (−π/2, π/2).
⃝
5.C.17. Consider the parabola y = x2
. Determine the
x-coordinate of the parabola’s point which is nearest to the
point A with coordinates x0 = 1 and y0 = 2.
Solution. Recall the formula for the Euclidean distance between
two points [x1, y1], [x2, y2] in the plane:
d =
√
(x2 − x1)2 − (y2 − y1)2 .
An arbitrary point in the parabola has coordinates
[x, y] = [x, x2
] (and in this way we eliminate y from
the problem). Since the task requires the minimization
of the distance between the points A = [1, 2] and
[x, x2
], it is suﬃcient to ﬁnd the minimum of the function
d(x) =
√
(x − 1)2 + (x2 − 2)2 with x ∈ R, see also the
ﬁgure for an illustration of the idea.
The function d(x) is non-zero for any x ∈ R, in particular the
equation g(x) := (x−1)2
+(x2
−2)2
= 0 has only complex
solutions. We have
d′
(x) =
g′
(x)
2
√
g(x)
=
g′
(x)
2d(x)
, x ∈ R .
CHAPTER 5. ESTABLISHING THE ZOO
5.4.2. Power series for ex
. The exponential function has
been deﬁned as the only continuous function satisfying
f(1) = e and f(x + y) = f(x) · f(y). The base e is now
expressed as the limit of the sequence an, thus necessarily
ex
= lim
n→∞
(an)x
.
Fix a real number x ̸= 0. If we replace n with n/x in the
numbers an from the previous paragraph, we arrive again at
the same limit. (Think this out in detail!) Hence
e = lim
n→∞
(
1 +
x
n
)n
x
, ex
= lim
n→∞
(
1 +
x
n
)n
.
Denote the n-th term of this sequence by un(x) = (1 + x
n )n
and express it by the binomial theorem:
(1)
un(x) = 1 + n
x
n
+
n(n − 1)x2
2!n2
+ · · · +
n!xn
n!nn
= 1 + x +
x2
2!
(
1 −
1
n
)
+
x3
3!
(
1 −
1
n
) (
1 −
2
n
)
+ . . .
+
xn
n!
(
1 −
1
n
) (
1 −
2
n
)
. . .
(
1 −
n − 1
n
)
.
Look at un(x) for very large n. It seems that many of
the ﬁrst summands of un(x) will be fairly close to the values
1
k! xk
, k = 0, 1, . . . . Thus it is plausible that the numbers
un(x) should be very close to vn(x) =
∑n
j=0
1
j! xj
and thus
both these sequences should have the same limit.
The following theorem is perhaps one of the most important
results of Mathematics:
The power series for ex
Theorem. The exponential function ex
equals, for each
x ∈ R, the limit limk→∞ vk(x) of the partial sums in the
expression
ex
= 1 + x +
1
2
x2
+ · · · +
1
n!
xn
+ · · · =
∞∑
n=0
1
n!
xn
.
The function ex
is diﬀerentiable at each point x and its derivative
is (ex
)′
= ex
.
Proof. The technical proof makes the above idea precise.
Although the complete argumentation might
look complicated, the strategy is straightforward. We
shall go through three steps.
1) Prove that the partial sums vn converge.
2) Write un,k for the ﬁrst k < n summands in un and
conclude that for given k, the diﬀerence between vk and un,k
gets arbitrarily small for large n > k.
3) Show that there are subsequences vki and uni converging
to the same limit. This will conclude the proof of the ﬁrst
claim of the theorem.
419
Because the square root is an increasing function, the function
d takes its least value at the same point where the function
g does. The derivative of g has the form g′
(x) = 4x3
−
6x − 2, for all x ∈ R. Moreover, the equation g′
(x) = 0 or
equivalently 0 = 2x3
− 3x − 1 has three solutions:
x0 = 1 , x1 =
1 −
√
3
2
, x2 =
1 +
√
3
2
.
We deduce that the x-coordinate in question equals to x2 =
1+
√
3
2 . Can you explain how we decline x0 and x1? □
First order derivatives and in particular tangent lines
give us the ability to approximate functions locally by
linear functions (recall that linear functions are the
easiest functions to work with). In particular, when
f is a diﬀerentiable function at a point x0 of its domain,
we can approximate the values of f near x0 via the formula
f(x) ≈ f (x0) + f′
(x0) (x − x0). Here, the r.h.s is a ﬁrstdegree
polynomial with respect to x and we suppose that the
diﬀerence x−x0 is “small enough”. Hence using ﬁrst-order
derivatives we can obtain a tangent line approximation of the
value f(x) in a small neighbourhood of x0.
In Section A of Chapter 6 these ideas will be generalized.
There we will explain how to approximate functions based
on higher-degree polynomials (Taylor polynomials). Next we
pose applications related to linear approximations, see also
6.A.37 and 6.A.38 for an alternative interpretation in terms
of the so-called “diﬀerentials”. More exercises on tangential
approximations are presented in Section D.
5.C.18. Linear approximations. Approximate linearly the
function f(x) = 1/x near x0 = 1, and present the graph of f
and that of its tangent line at x0. Then compute the approximations
at x = 0.9 and x = 1.1 and compare their diﬀerence
with the real values of f at these points.
Solution. The ﬁrst derivative of f is given by f′
(x) = −1/x2
and at x0 = 1 we get f′
(x0) = −1. Thus the tangent line of
f passing from x0 has the form f(1)+(x−1)f′
(1) = 2−x,
that is, the linear approximation of f near x0 is given by the
line L(x) = 2 − x. The ﬁgure below illustrate the situation:
CHAPTER 5. ESTABLISHING THE ZOO
1) Fix x and recall that vn(x) is the sequence deﬁned as the
sums of the ﬁrst n terms of the formal inﬁnite expression
∞∑
j=0
cj =
∞∑
j=0
1
j!
xj
.
For all m > n,
|vm(x) − vn(x)| ≤
m∑
k=n+1
1
k!
|x|k
= vm(|x|) − vn(|x|).
Consequently, in order to prove that vn(x) is a Cauchy sequence,
and thus convergent, it is enough to restrict ourselves
to x > 0 and to prove that vn(x) is always bounded and thus
convergent (as a nondecreasing sequence) in this case.
The quotient of adjacent terms in the series is
cj+1
cj
=
x
n+1 . Thus for every ﬁxed x > 0, there is a number N ∈ N
such that
cj+1
cj
< 1
2 for all j ≥ N. However, such large
indices j satisfy |cj+1| < 1
2 |cj| < 2−(j−N+1)
|cN |.
Recall that sums of a geometric series is computed from
the equality (1 − q)(1 + q + · · · + qk
) = 1 − qk+1
. This
means that the partial sums of the ﬁrst n > N terms of our
formal sum with x > 0 can be estimated as follows
vn(x) −
N−1∑
j=0
1
j!
xj
<
1
N!
xN
n−N∑
j=0
1
2j
.
In particular, the limit of the expressions on the right-hand
side for n approaching inﬁnity surely exists, and so the limit
of the increasing sequence vn also exists.
2) Consider some ﬁxed k and ε > 0. If N is large enough,
then clearly for all n > N, |un,k − vk| < ε. Indeed, there
is only a ﬁxed number of the brackets in the k summands of
un,k, see (1), and they will all be arbitrarily close to 1, if n is
large enough.
3) Next, consider x > 0 and notice
|un − un,k| < |vn − vk|.
Thus, if we ﬁx some ε > 0, then there is some M, such that
for k > M we ensure |un − un,k| < ε, for all n > k. If x <
0, we may ﬁrst bound the left hand side by sum of absolute
values of the individual terms, which is still less then |vn−vk|
evaluated for |x|.
Now, take such a k > M and consider N from the previous
step (of course dependent on k). Then for n > N we
arrive at
|vk − un| ≤ |vk − un,k| + |un,k − un| < 2ε.
Finally, choosing εℓ = 1
2ℓ , the previous estimate allows to
ﬁnd subsequences vkℓ
, unℓ
satisfying
|vkℓ
− unℓ
| <
1
ℓ
.
Thus, the converengent sequences un and vn must converge
to the same limit value. This is the ﬁrst claim we had to prove.
Remind we already know that (ex
)′
= ex
if and only
if the derivative equals 1 at the origin, see 5.3.7. Thus, it
420
Now, we compute f(0, 9) ≈ 1.1111 and L(0, 9) = 1.1, thus
|L(0.9) − f(0.9)| ≈ 0.0111. Similarly, we have f(1.1) ≈
0.90909, L(1.1) = 0.9 and |L(1.1) − f(1.1)| ≈ 0.009. □
5.C.19. Given that (arcsin(x))′
= 1√
1−x2
for x ∈ (−1, 1),
approximate linearly the value arcsin(0.497). ⃝
5.C.20. Approximate linearly the values sin
(29π
180
)
and
sin
(46π
180
)
. ⃝
Rolle’s theorem, mean value theorem and its generalizations
(Cauchy’s mean value theorem), are important
theorems that highlight the signiﬁcance
of derivatives, see 5.3.8 and 5.3.9. Below we illustrate
them quickly via examples, and further
linked applications are presented in the ﬁnal section of this
chapter.
5.C.21. Prove that the the equation x2027
+ 7x3
− 5 = 0
has a unique solution,
Solution. Consider the function f(x) = x2027
+ 7x3
− 5.
Since f(0) = −5 < 0 and f(1) = 3 > 0 by the intermediate
value theorem there exists some x0 ∈ (0, 1) such that
f(x0) = 0. Suppose that x1 ̸= x0 is another (positive) root
of f. Without loss of generality we may assume that x0 < x1.
We have that f(x0) = f(x1), and f is continuous on [x0, x1]
and diﬀerentiable on (x0, x1). Thus by Rolle’s theorem the
ﬁrst derivative f′
of f must have a root in (x0, x1). However,
f′
(x) = 2027x2026
+21x2
and thus we have f′
(x) > 0 for all
x > 0, a contradiction. Thus f has only one zero in (0, +∞).
On the other hand we see that f(x) < 0 for all x ∈ (−∞, 0].
You may verify this ﬁnal claim in Sage, just by the cell
f(x)=x^(2027)+7*x^3-5
assume(x<0); bool(f(x)<0)
Thus f has a unique solution for all x ∈ R. □
5.C.22. Using Rolle’s theorem prove that the equation x3
+
px + q = 0 with p, q ∈ R, p > 0, admits a unique real
solution.
Solution. Consider the function f(x) = x3
+ px + q with
x ∈ R, where p, q ∈ R and p > 0. It is easy to prove that
limx→−∞ f(x) = −∞ and limx→+∞ f(x) = +∞. Moreover,
f takes both negative and positive values, thus there exists
some ξ ∈ R with f(ξ) = 0. We will show that ξ is unique.
Indeed, suppose that ζ ∈ R is another real number satisfying
f(ζ) = 0. We may assume that ζ < ξ, and similarly is
treated the case with ξ < ζ. Based on the Rolle’s theorem
we can then ﬁnd some a ∈ (ζ, ξ) with f′
(a) = 0. However,
f′
(x) = 3x2
+ p for all x, and so this gives a contradiction
(since p > 0, the equation f′
(x) = 0 does not admit a real
solution). □
5.C.23. Decide if the function f(x) = x3
+ b, where b is
some constant, satisﬁes the mean value theorem on the interval
[1, 2]. In the positive case determine c, where c is as in
the Theorem 5.3.9.
CHAPTER 5. ESTABLISHING THE ZOO
remains to compute
lim
x→0
(1 + x + 1
2 x2
+ . . . ) − 1
x
= lim
x→0
x + 1
2 x2
+ . . .
x
.
This seems to be tricky, since there are two limit expressions
there (notice that an x may be cancelled since this is a constant
in the inner limit):
lim
x→0
(
lim
n→∞
n∑
k=1
1
k!
xk−1
)
= 1 + lim
x→0
(
lim
n→∞
n∑
k=2
1
k!
xk−1
)
.
But now, for each positive ε > 0 we can ﬁnd N ∈ N such
that
lim
n→∞
n∑
k=N
1
(k + 1)!
xk
< lim
n→∞
n∑
k=N
1
(k + 1)!
< ε
for all x ∈ [−1, 1]. Finally, we can restrict the interval for x
enough to ensure that the remaining ﬁrst terms are bounded
by ε, too:
N−1∑
k=2
1
(k + 1)!
xk
< ε.
This shows that the limit expression on the right-hand
side must be zero. Thus the derivative exists and equals to
one, as expected. □
Readers who skipped the preceding paragraphs (it
doesn’t matter whether on purpose or in
need) can stay calm – we deduce all the
results on the exponential function later
again, using more general tools. In particular, we will see
that all power series are always diﬀerentiable and can be
diﬀerentiated term by term. We see later that the conditions
f′
(x) = f(x) and f(0) = 1 determine a function uniquely.
5.4.3. Number series. When deriving the previous theorems
about the function ex
, we have automatically used several
extraordinarily useful concepts and tools. Now, we come
back to them in detail:
Infinite number series
Deﬁnition. An inﬁnite series of numbers is an expression
∞∑
n=0
an = a0 + a1 + a2 + · · · + ak + . . . ,
where the an’s are real or complex numbers. The sequence
of partial sums is given by the terms sk =
∑k
n=0 an. The
series converges and equals s if the limit
s = lim
k→∞
sk
of the partial sums exists and is ﬁnite.
If the sequence of partial sums has an improper limit,
the series diverges to ∞ or −∞. If the limit of the partial
sums does not exist, the series oscillates.
An immediate consequence of the properties of limits is
the following claim:
421
Solution. The given function is a polynomial, and hence continuous
on its domain [1, 2]. It is also diﬀerentiable on the
open interval (1, 2), with f′
(x) = 3x2
. Therefore, it satisﬁes
the mean value theorem. In particular, by Theorem 5.3.9
there exists c ∈ (1, 2) such that f′
(c) = 3c2
= f(2)−f(1)
2−1 =
8+b−1−b = 7. Thus c2
= 7/3, that is, c = ±
√
7/3. Since
−
√
7/3 does not belong on the interval [1, 2], we deduce that
c =
√
7/3 is the only possible value. □
5.C.24. Check which of the following functions satisfy the
mean value theorem. In the positive case, ﬁnd the possible
values of c, where c is as in the Theorem 5.3.9:
f(x) = x
2
3 , g(x) =
x
x + 1
, h(x) = 3
√
8 − x
for x ∈ [−2, 2], x ∈ [1, 4] and x ∈ [0, 8], respectively. ⃝
5.C.25. Using the mean value theorem prove that
−1 ≤ cos(2x) <
sin(2x)
2x
≤ 1 , for all x ∈
(
0,
π
2
]
.
Solution. For x = π
2 the given inequality becomes −1 ≤
−1 < 0 ≤ 1, which is true. Hence we may focus on x ∈
(0, π
2 ). For any x ∈ (0, π
2 ) consider the function
f(a) = sin(2a) , with a ∈ [0, x] .
Observe that f is continuous on [0, x] as a composition of continuous
functions. It is also diﬀerentiable on the open interval
(0, x) with f′
(a) = 2 cos(2a). Thus, by the mean value
theorem there exists c ∈ (0, x) with f′
(c) = f(x)−f(0)
x−0 , or
equivalently
2 cos(2c) =
sin(2x)
x
⇔ cos(2c) =
sin(2x)
2x
. (∗)
Next we may assume that 0 < c < x < π
2 , that is, 0 <
2c < 2x < π. Also, for the function g(x) = cos(x) we have
g′
(x) = − sin(x) < 0 for all x ∈ (0, π). This means that the
function cos is strictly decreasing on (0, π) (having a local
maximum at x = 0 and a local minimum at x = π). Thus,
the previous inequality implies that
1 = cos(0) ≥ cos(2c) > cos(2x) ≥ cos(π) = −1 .
Combining this inequality with (∗), the claim follows. □
5.C.26. Based on Cauchy’s mean value theorem, show that
a <
n
n + 1
(bn+1
− an+1
bn − an
)
< b
for all n ∈ N∗
, where a, b ∈ R are such that 0 < a < b ⃝
The theory of derivatives is useful in a plethora of practical
tasks, but also for crucial problems appearing
in many other sciences and technology, e.g.,
in physics, chemistry, statistics, economics, biology,
engineering, environmental science, informatics
and computer science, architecture, science of graphics,
and space science, to mention a few of them.
But why derivatives can be so useful? Partially this is
because the derivative dy
dx represents the rate of change of y
CHAPTER 5. ESTABLISHING THE ZOO
Corollary. Consider two convergent series S =
∑∞
n=0 an
and T =
∑∞
n=0 bn. Then their sum and constant multiple by
real or complex number c are convergent too, and
S + T =
∞∑
n=0
(an + bn), cS =
∞∑
n=0
can.
In particular, any linear combination of series with constant
coeﬃcients is convergent and equals to the same linear
combination of the sums. Check the details yourself!
5.4.4. Properties of series. For the sequence of partial sums
sn to converge, it is necessary and suﬃcient that it is a Cauchy
sequence; that is
|sm − sn| = |an+1 + · · · + am|
must be arbitrarily small for suﬃciently large m > n. Since
|an+1| + · · · + |am| ≥ |an+1 + · · · + am|,
the convergence of the series
∑∞
k=0 |an| implies the convergence
of the series
∑∞
k=0 an.
Absolutely convergent series
A series
∑∞
k=0 an is absolutely convergent if the series∑∞
n=0 |an| converges.
Absolute convergence is introduced because it is often
much easier veriﬁed. The following theorem
shows that all simple algebraic operations behave
“very well” for series that converge abso-
lutely.
Properties of series
Theorem. Let S =
∑∞
n=0 an and T =
∑∞
n=0 bn be two
absolutely convergent series. Then
(1) their sum converges absolutely to the sum
S + T =
∞∑
n=0
an +
∞∑
n=0
bn =
∞∑
n=0
(an + bn),
(2) their diﬀerence converges absolutely to the diﬀerence
S − T =
∞∑
n=0
an −
∞∑
n=0
bn =
∞∑
n=0
(an − bn),
(3) their product converges absolutely to the product
S · T =
( ∞∑
n=0
an
)
·
( ∞∑
n=0
bn
)
=
∞∑
n=0
( n∑
k=0
an−kbk
)
,
(4) the value S of the sum does not depend on any rearrangement
of the series, i.e.,
∑∞
n=0 aσ(n) = S for
any permutation σ : N → N of integers.
Proof. The convergence in the ﬁrst and the second statements
were discussed already. Further, the linear combinations
satisfy |αan + βbn| ≤ |α||an| + |β||bn| and thus the
absolute convergence is obvious, too.
422
with respect to x. Thus, the rate of change of any physical
quantity at any time is obtained by diﬀerentiating the physical
quantity with respect to time. Another reason is the related
theory of minimal/maximals (optimization problems) and we
will meet many such tasks in Chapter 6. The next few examples
aim to present how derivatives may involve in problems
of every-day life and other nice applications. An enrichment
of this list, which roughly speaking can be long enough, is
postponed to the ﬁnal section E of this chapter.
5.C.27. For a company renting city taxis, any driver costs
C6 per working hour (h), while the gas
costs C1.40 per litter (ℓ). Suppose that
inside the city these taxis can succeed
(180 − 2u)/6 kilometres (km) per litter of gas, where u denotes
the driving velocity, enumerated in km/h. Find the
cheapest driving speed so that the company can minimize its
costs per kilometre (corresponding only to a taxi of its property),
and present the corresponding mimimal cost value in
C/km per taxi.
Solution. We can focus on the cost corresponding to the use
of a single taxi (in C per km), which should minimize. This
cost is given by the sum of the costs of the used driver (per
km) and of the used gas (per km). According to the statement
we have:
• The driver costs
6 C/h
u km/h
=
6
u
C/km;
• The gas costs
1.40 C/ℓ
(180−2u)
6 km/ℓ
=
8.4
180 − 2u
C/km.
Thus, the total cost in C per km is a function with respect to
u, given by
c(u) :=
6
u
+
8.4
180 − 2u
=
6
u
−
4.2
u − 90
.
First we see that the driving velocity u, should satisfy 0 <
u < 90 (e.g., the cost blows up for u = 90), and for such u
the extreme points of c(u) correspond to the solutions of the
equation c′
(u) = 0 (we recommend to sketch the graph of the
cost function c via Sage). The derivative of c is given by
c′
(u) = −
6
u2
+
4.2
(u − 90)2
and it follows that the equation c′
(u) = 0 has two solutions,
namely u1,2 = ±30
√
70 + 300 km/h. These computations
can be quickly veriﬁed in Sage:
u=var("u"); c(u)=(6/u)+8.4/(180-2*u)
dc(u)=diff(c(u), u);show(solve(dc==0, u))
Acceptable is only the value u1 = −30
√
70 + 300 ≈
49 km/h, and this is the speed with the lowest costs. In particular,
with this velocity the cost value c(u1) of a single taxi
is approximately 0.225 C/km. □
5.C.28. Suppose that from a given piece of industrial paper
having area α2
we want to construct an open box, by cutting
CHAPTER 5. ESTABLISHING THE ZOO
The third statement is not so simple. Write
cn =
n∑
k=0
an−kbk.
From the assumptions and from the rule for the limit of a prod-
uct,
( k∑
n=0
an
)
·
( k∑
n=0
bn
)
→
( ∞∑
n=0
an
)
·
( ∞∑
n=0
bn
)
.
Thus it suﬃces to prove that
0 = lim
k→∞
(( k∑
n=0
an
)
·
( k∑
n=0
bn
)
−
k∑
n=0
cn
)
.
Consider the expressions
( k∑
n=0
an
)
·
( k∑
n=0
bn
)
=
∑
0≤i,j≤k
aibj,
cn =
∑
i+j=n
0≤i,j≤k
aibj,
k∑
n=0
cn =
∑
i+j≤k
0≤i,j≤k
aibj.
along with the bound
( k∑
n=0
an
)
·
( k∑
n=0
bn
)
−
k∑
n=0
cn =
∑
i+j>k
0≤i,j≤k
aibj
≤
∑
i+j>k
0≤i,j≤k
|aibj|.
If the sum of the indices is to be larger than k, then at least
one of them must be larger than k/2. The expression does
not decrease if more terms are added into it. Take all as in
the product and remove only those whose indices are both at
most k/2.
∑
i+j>k
0≤i,j≤k
|aibj| ≤
∑
0≤i,j≤k
|aibj| −
∑
0≤i,j≤k/2
|aibj|.
However, both the expressions of the diﬀerence are the partial
sums for the product S · T. Therefore, they share the same
limit and their diﬀerence goes to zero.
The last claim seems to be a little tricky. Notice, for each
small ε > 0 we can ﬁnd a common bound α such that for all
N > α both estimates are true:
∞∑
n=N
|an| < ε,
N∑
n=0
an − S < ε.
Now, consider any permutation σ of the indices and write
Iσ = {σ−1
(0), . . . , σ−1
(α)}. Then, for each N > max Iσ
423
four small squares of equal size of its corners and then bending
the resulting sides. Determine the total area of paper that
we should cut, so that the box has maximum volume.
Solution. Let x be the length of the side of any of the squares
that we should cut, as it is illustrated in the ﬁgure:
The volume of the box under construction is given by V (x) =
x(α − 2x)2
with 0 < x < α
2 . We want to locate those x for
which the function V attains its maximum value V (x). Since
α > 0 is a constant, an application of the product rule gives
V ′
(x) =
dV
dx
= (α − 2x)2
+ 2 · x · (α − 2x) · (−2)
= (α − 2x)(α − 6x) .
Thus V ′
(x) = 0 if and only if x = α
2 or x = α
6 . Now we
will use the theory of second derivatives, which are analyzed
in Section 6.1.5. As we said above, the second derivative of a
diﬀerentiable function f is the derivative of its ﬁrst derivative.
So
V ′′
(x) = (V ′
(x))′
= −2 · (α − 6x) − 6(α − 2x)
= 24x − 8α = 24
(
x −
α
3
)
.
According to 6.1.5, if x0 satisﬁes f′
(x0) = 0 and f′′
(x0) < 0,
then x0 is a local maximum of f, while when x0 is such that
f′
(x0) = 0 but f′′
(x0) > 0, then x0 is a local minimum of
f. In our case we see that
V ′′
(
α
2
) = 4α > 0 , V ′′
(
α
6
) = −4α < 0 .
Thus, the volume of the box under construction will be maximal
if and only if x = α
6 , so the area that we should cut is
4· α2
36 = α2
9 = (α
3 )2
. Moreover, the maximal volume is given
by V (α
6 ) = 2α3
27 = 2(α
3 )3
.
An implementation of the whole problem via Sage is
easy, and can be done in many ways. For instance, type the
cell
var("a") #introduce a symbolic variable
V(x)=x*(a-2*x)^2 #declare the volume
DV(x)=diff(V, x).factor()#declare the 1st
#derivative of the volume
show(solve(DV(x)==0, x)) #critical point eq.
show(diff(V, x, 2)(x=a/6).factor())
show(V(a/6).factor()) #the maximal volume
□
CHAPTER 5. ESTABLISHING THE ZOO
clearly
N∑
n=0
aσ(n) − S =
N∑
n≤N, n∈Iσ
aσ(n) − S +
N∑
0≤n≤N,n/∈Iσ
aσ(n)
≤
α∑
n=0
an − S +
N∑
0≤n≤N, n/∈Iσ
|aσ(n)|.
Next, notice that n /∈ Iσ means σ(n) > α. Thus, the latter
term is at most equal to
∑∞
α+1 an and thus the entire expresssion
is bounded by 2ε. This shows that the rearranged series
converges to the same value S again. □
5.4.5. Simple tests. The following theorem collects some
useful conditions for deciding on the convergence of series.
Convergence tests
Theorem. Let S =
∑∞
n=0 an be an inﬁnite series of real
or complex numbers. Let T =
∑∞
n=0 bn be another series
with all bn ≥ 0 real.
(1) If the series S converges, then limn→∞ an = 0.
(2) (The comparison test) If T converges and |an| ≤ bn,
then S converges absolutely. If bn ≤ |an| and T diverges,
then S does not converge absolutely.
(3) (The limit comparison test). If both an and bn are positive
real numbers and the ﬁnite limit
lim
n→∞
an
bn
= r > 0
exists, then S converges if and only if T converges.
(4) (The ratio test) Suppose that the limit of the quotients of
adjacent terms of the series exists and
lim
n→∞
an+1
an
= q.
Then the series S converges absolutely for |q| < 1 and
does not converge absolutely for |q| > 1. If |q| = 1 the
series may or may not converge.
(5) (The root test) If the limit
lim
n→∞
n
√
|an| = q
exists, then the series converges absolutely for q < 1. It
does not converge absolutely for q > 1. If q = 1, the
series may or may not converge.
Proof. (1) The existence and the potential value of
the limit of a sequence of complex numbers
is given by the limits of the real parts and
the imaginary parts. Thus it suﬃces to
prove the ﬁrst proposition for sequences of real numbers.
If limn→∞ an does not exist or is non-zero, then for a
suﬃciently small number ε > 0, there are inﬁnitely many
terms ak with |ak| > ε. There are either inﬁnitely many
positive terms or inﬁnitely many negative terms among
them. But then, adding any one of them into the partial sum,
the diﬀerence of the adjacent terms sn and sn+1 is at least
424
5.C.29. On a spherical ballon we inﬂate air so that its volume
increase with a rate of 16 cm3
/min. Find the rate of
change of the radius of the ballon when its volume is 36 cm3
.
Hint: The volume of a sphere of radius r is given by 4πr3
3
.
Solution. The radius of the ballon can be viewed as a function
with respect to the time t. Hence also the volume of the ballon
is a function of t (since it depends on the radius), i.e., V (t) =
4π
3 r3
(t) for all t.
Now, by assumption at certain time t = t0 the volume is
36 cm3
. Hence we have 36 = 4πr3
(t0)
3 , and thus r3
(t0) = 33
π ,
that is, r(t0) = 3
3
√
π
. We also compute
V ′
(t) =
dV
dt
= 4πr2
(t)r′
(t) .
By assumption dV
dt = 16, hence 16 = 4πr2
(t)r′
(t) and this
gives r′
(t) = dr
dt = 4
πr2(t) . At t = t0 we have r′
(t0) =
4
πr2(t0) and by replacing the value r(t0) = 3
3
√
π
we obtain the
rate of change of the radius of the ballon at time t = t0. □
5.C.30. Consider an isosceles triangle with base length b
and height (above the base) h. Inscribe the rectangle having
the greatest possible area into it (one of the rectangle’s sides
will lie on the triangle’s base). Determine the area S of this
rectangle.
Solution. To solve this problem, it suﬃces to consider the
problem of inscribing the largest rectangle into a right triangle
with legs of lengths b/2 and h so that two of the rectangle’s
sides lie on the legs of the triangle. Thus we reduce the
problem to maximizing the function
f(x) = x
(
h −
2 h x
b
)
on the closed interval I = [0, b/2]. Observe that f(x) ≥
0, for all x ∈ I, with f(0) = f
(b
2
)
= 0. We also have
f′
(x) = h − 4 h x
b and x0 = b/4 is the unique stationary
point of f [0,b/2]
. There f takes its greatest value, hence the
sides of the required rectangle are b/2 long (i.e., twice x0),
and h/2 long (the latter is obtained by substituting b/4 for x
into the expression h − 2hx/b). Moreover, S = hb/4. □
5.C.31. Recall that the velocity of a moving object is the derivative
of its position function, and its acceleration is the derivative
of its velocity function. If the position of a moving
object in time t is given by the function
s(t) = −(t − 3)2
+ 16, t ∈ [0, 7] ,
where the time in measured in seconds, and the position is
measured in meters, determine the following:
(a) the initial velocity of the object (that is, at t = 0);
(b) the time and position at which its velocity is zero;
(c) its velocity and acceleration at time t = 4 s. ⃝
In the reminder of this section we shall exercise the use
of the so called “l’Hopital’s rule” via a variety
of examples (for extra material refer to Section
E in the end of this chapter).
CHAPTER 5. ESTABLISHING THE ZOO
ε. Thus the sequence of partial sums cannot be a Cauchy
sequence and, therefore, it cannot be convergent.
(2) We are dealing with absolute converengence and
thus the sequence of partial sums is non-decreasing. Once
bounded, such a sequence is convergent and converges to its
supremum. Similarly, the sequence of partial sums must be
unbounded, if estimated from below by a divergent sequence.
(3) Since limit r = limn→∞
an
bn
exists, for any given ε >
0 and suﬃciently big n > Nε,
(r − ε)bn < an < (r + ε)bn.
Thus, after choosing ε < r it follows that an < (r +ε)bn and
bn < 1
r−ε an. The result follows from the previous claim (2).
(4) To prove absolute convergence, it can be assumed that
the terms of the series are real numbers an > 0. Suppose
q < r < 1 for a real number r. From the existence of the
limit of the quotients, for every j greater than a suﬃciently
large N,
aj+1 < r · aj ≤ r(j−N+1)
aN ,
But this means that the partial sums sn are, for large n > N,
bounded from above by the sums
sn <
N∑
j=0
aj + aN
n−N∑
j=0
rj
=
N∑
j=0
aj + aN
1 − rn−N+1
1 − r
where the last equality follows from the general equality (1−
r)(1 + r2
+ · · · + rk
) = 1 − rk+1
. Since 0 < r < 1, the set
of all partial sums is an increasing sequence bounded from
above, and thus its limits equals to its supremum.
In the case q > r > 1, a similar technique can be used.
However, this time, from the existence of the limit of the quo-
tients,
aj+1 > r · aj ≥ r(j−N+1)
aN > 0.
This implies that the absolute values of the particular terms of
the series do not converge to zero, and thus the series cannot
be convergent, by the already proved part (1) of the theorem.
(5) The proof is similar to the previous case. From the
existence of the limit q < 1, it follows that for any r, q < r <
1, there is an N such that for all n > N, n
√
|an| < r holds.
Exponentiation then gives |an| < rn
, there is a comparison
with a geometric series. Thus the proof can be ﬁnished in the
same way as in the case of the ratio test. □
5.4.6. Limes superior and inferior. In the proofs of the
last two statements of the previous theorem, a much
weaker assumption is used than the existence of the
limit. It is only necessary to know that the examined
sequences of non-negative terms (ratios or roots) are,
from a given index on, either all larger or all less than a given
number.
For this purpose however, it suﬃces to consider, for a
given bounded sequence of terms bn, the supremum of the
terms with index higher than n. These suprema always exist
and create a non-increasing sequence. Its inﬁmum is then
425
Recall the claim is that for each diﬀerentiable curve
[g(t), f(t)] emanating from the origin, the limit of the slope
f(t)
g(t) exists and they are equal, provided that the limit of the
tangent lines in the origin exists. This resolves the indeterminate
forms 0
0 , and the similar result is available for indeterminate
forms ∞
∞ , see also 5.3.10.
It’s important to note that multiple applications of l’Hopital’s
rule are quite common. This practice involves stating
that the limits of quotients of derivatives obtained through
l’Hopital’s rule are equal to the limits of the original quotients,
albeit this is sometimes an abuse of notation. To justify
this, it is essential to ensure that the limits produced on the
right-hand sides actually exist, which may require repeated
veriﬁcation.
5.C.32. l’Hopital’s rule. Apply the l’Hopital’s rule to compute
the limit lim
x→+∞
f(x), where
f(x) =
ln x − 4 ex
8 ex αe − ln x
, α = constant > 0 .
Solution. When x → +∞, both the numerator and denominator
of f(x) give the indeterminate form (+∞) − (+∞).
However, we may write
lim
x→+∞
f(x) = lim
x→+∞
ln x − 4 ex
8 ex αe − ln x
= lim
x→+∞
ln x
ex − 4
8αe − ln x
ex
=
lim
x→+∞
ln x
ex
− 4
8αe − lim
x→+∞
ln x
ex
.
Now, for the limit lim
x→+∞
ln x
ex
we may apply the l’Hopital’s
rule. For this we need the derivatives of ln x and ex
, and by
5.3.7 we know that (ln x)′
= 1/x and (ex
)′
= ex
, respectively.
Therefore
lim
x→+∞
f(x) =
lim
x→+∞
(ln x)′
(ex)′
− 4
8αe − lim
x→+∞
(ln x)′
(ex)′
=
lim
x→+∞
1/x
ex
− 4
8αe − lim
x→+∞
1/x
ex
=
lim
x→+∞
1
x ex
− 4
8αe − lim
x→+∞
1
x ex
=
0 − 4
8αe − 0
= −
1
2αe
.
□
5.C.33. Based on the l’Hopital’s rule verify the given type
and next verify the given result:
(a) lim
x→0
sin(2x) − 2 sin(x)
2 ex −x2 − 2x − 2
= −3 , (type 0
0 );
(b) lim
x→0+
ln(x)
cot(x)
= 0 , (type ∞
∞ );
(c) lim
x→1+
(
x
x − 1
−
1
ln(x)
)
=
1
2
, (type ∞ − ∞);
CHAPTER 5. ESTABLISHING THE ZOO
called upper limit of the sequence and denoted by
lim sup
n→∞
bn = lim
n→∞
sup{bk; k ≥ n}.
The advantage is that the upper limit always exists (either ﬁnite
if the sequence is bounded, or ±∞ if the sequence is
unbounded). Similarly,
lim inf
n→∞
bn = lim
n→∞
inf{bk; k ≥ n}.
Therefore, we can reformulate the previous result (without
having to change the proof) in a stronger form:
Corollary. Let S =
∑∞
n=0 an be an inﬁnite series of real or
complex numbers.
(1) If
q = lim sup
n→∞
an+1
an
,
then the series S converges absolutely for q < 1 and does
not converge absolutely for q > 1. For q = 1, it may or
may not converge.
(2) If
q = lim sup
n→∞
n
√
|an|,
the series converges absolutely for q < 1 while it does
not converge absolutely for q > 1. For q = 1, it may or
may not converge.
In the literature, the ratio test is often called the d’Alambert’s
criterion, while the root test is called the Cauchy’s criterion.
There are many other useful tests, but we do not have
space here for them.
5.4.7. Alternating series. The condition an → 0 is a necessary
but not suﬃcient condition for the convergence of the
series
∑∞
n=0 an. However, there is the Leibniz’s criterion of
convergence for special types of series.
Leibniz’s criterion for alternating series
The series
∑∞
n=0(−1)n
an, where an is a non-increasing sequence
of non-negative real numbers, is called an alternating
series.
Theorem. An alternating series converges if and only if
limn→∞ an = 0. Its value a =
∑∞
n=0(−1)n
an diﬀers from
the partial sum s2k by at most a2k+1.
Proof. By the deﬁnition the partial sums sk of an alternating
series satisfy
s2(k+1)+1 = s2k+1 + a2k+2 − a2k+3 ≥ s2k+1
s2(k+1) = s2k − a2k+1 + a2k+2 ≤ s2k
s2k+1 − s2k = −a2k+1 → 0
s2 ≥ s2k ≥ s2k+1 ≥ s1.
Thus, the even partial sums are a non-decreasing sequence,
while the odd ones are non-increasing. The last line reveals
that the bounded sequence of the odd partial sums converges
426
(d) lim
x→1+
(ln(x) · ln(x − 1)) = 0 , (type 0 · ∞);
(e) lim
x→0+
(cot(x))
1
ln(x) =
1
e
, (type ∞0
);
(f) lim
x→0
(
sin(x)
x
) 1
x2
=
1
6
√
e
, (type 1∞
);
(g) lim
x→1−
(
cos(
πx
2
)
)ln x
= 1 , (type 00
).
Solution. (a) The conclusion for the type is immediate. Set
f(x) = sin(2x)−2 sin(x)
2 ex −x2−2x−2 . Applying repeatedly the l’Hopital’s
rule we obtain (observe that at any step the limit in the r.h.s
exists)
lim
x→0
f(x)
0
0
= lim
x→0
(sin(2x) − 2 sin(x))
′
(2 ex −x2 − 2x − 2)
′
= lim
x→0
2 cos(2x) − 2 cos(x)
2 ex −2x − 2
0
0
= lim
x→0
(2 cos(2x) − 2 cos(x))
′
(2 ex −2x − 2)
′
= lim
x→0
−4 sin(2x) + 2 sin(x)
2 ex −2
0
0
= lim
x→0
(−4 sin(2x) + 2 sin(x))
′
(2 ex −2)
′
= lim
x→0
−8 cos(2x) + 2 cos(x)
2 ex
= −3 .
(b) Cleary, limx→0+ ln(x) = −∞ and limx→0+ cot(x) =
+∞ (see also a sketch of the graphs of these functions below),
hence the type of indeterminate form.
Otherwise, for a direct veriﬁcation via Sage, one may
type
g(x)=cot(x);k(x)=ln(x)
lim(g(x), x=0, dir="plus")
lim(ln(x), x=0, dir="plus")
This time applying the l’Hopital’s rule, we will result to an indeterminate
form of diﬀerent type, which is again a quite common
situation. In particular, recall that cot(x) = 1
tan(x) =
cos(x)
sin(x) and hence cot′
(x) = −1/ sin2
(x). Setting f(x) =
ln(x)/ cot(x) and observing that at any step the limit in the
CHAPTER 5. ESTABLISHING THE ZOO
to its supremum, while the even ones converge to the inﬁmum.
The previous line says they coincide, if and only if
limn→∞ an = 0 which proves the ﬁrst claim.
At the same time the limit value a of the series is always
at least s2k+1 and at most s2k. Thus, the latter partial sums
cannot diﬀer by more than a2k+1 from the limit value. □
Remark (Riemann’s rearrangement theorem). As obvious
from the latter theorem, convergent alternating series are often
not convernging absolutely. This phenomenon is called
conditionally converging series.
Unlike the independence on the order in which we sum
up the terms of an absolutely convergent series (cf. (4) of Theorem
5.4.4), there is the famous Riemann theorem saying that
conditionaly convergent series can be braught to any ﬁnite or
inﬁte value by appropriate rearrangement of the terms in the
sum. We shall not go into the detailed proof here, but the
idea is very simple: if a series S converges only conditionally,
we may split it into the series of positive and negative
terms, S+ and S−, say ordered by absolute value, and both
of them have to be divergent (otherwise, they would both converge
absolutely and thus their diﬀerence would do as well).
Now, prescribing the desired limit q, we shall add the postive
terms until getting bigger than q, then adding the negative
ones, until getting smaller than q, etc.
5.4.8. Convergence rate. The proofs of the tests derived in
the previous two paragraphs allow also for straightforward estimates
of the speed of the convergence. Indeed, both the tests
for the absolute convergence are based on the comparison
with the geometric series either for q = lim supn→∞
an+1
an
or q = lim supn→∞
n
√
|an|, and 0 < q < 1. In the estimate
of the error of approximation of the limit s∞ by the n-th partial
sum sn
|s∞ − sn| < |aN |
∞∑
j=0
rj
= |aN |rn−N 1
1 − r
= Crn
where N and q < r < 1 are the two related choices from the
proof of the test and C is the resulting constant not dependent
of n. Thus the convergence rate is quite fast, in particular if
r is much smaller than 1 (and we can get r as close to q as
necessary).
On the other hand, the proof of the alternating series test
shows that the convergence rate is at least as fast as the convergence
of the terms an.
5.4.9. Power series. If we do not consider a sequence of
numbers an, but rather a sequence of functions
fn(x) sharing the same domain A,
we can use the deﬁnition of addition of
series “point-wise”, thereby obtaining the
concept of the series of functions
S(x) =
∞∑
n=0
fn(x).
427
r.h.s. exists, the following makes sense
lim
x→0+
f(x)
+∞
+∞
= lim
x→0+
ln′
(x)
cot′
(x)
= lim
x→0+
1
x
− 1
sin2(x)
= lim
x→0+
− sin2
(x)
x
0
0
= lim
x→0+
−
(
sin2
(x)
)′
(x)′
= lim
x→0+
(−2 sin(x) cos(x)) = 0 .
(c) For a veriﬁcation of the type, notice that limx→1+
x
x−1 =
+∞, and limx→1+
1
ln x = +∞. However, we see that
f(x) :=
(
x
x−1 − 1
ln x
)
= x ln(x)−x+1
(x−1) ln(x) , thus we obtain the
indeterminate form 0/0. One computes the following:
lim
x→1+
f(x)
0
0
= lim
x→1+
(x ln(x) − x + 1))
′
((x − 1) ln(x))
′
= lim
x→1+
ln(x) + x
x − 1
ln(x) + x−1
x
= lim
x→1+
x ln(x)
x ln(x) + x − 1
0
0
= lim
x→1+
(x ln(x))
′
(x ln(x) + x − 1)
′ =
1
2
.
(d) Obviously, this case is of type 0 · (−∞). We transform
the given expression into the type −∞
∞ by writing f(x) :=
ln(x) · ln(x − 1) = ln(x−1)
1
ln(x)
. Then we see that
lim
x→1+
f(x)
0
0
= lim
x→1+
(ln(x − 1))
′
(
1
ln(x)
)′ = lim
x→1+
1
x−1
−1
x · 1
ln2(x)
= lim
x→1+
x ln2
(x)
1 − x
0
0
= lim
x→1+
(
x ln2
(x)
)′
(1 − x)′
= lim
x→1+
ln2
(x) + 2 · x
x · ln(x)
−1
= 0 .
(e) Recall from above that limx→0+ cot(x) = +∞. Moreover
limx→0+
1
ln(x) = 1
−∞ = 0, hence indeed we have the
indeterminate form (+∞)0
. Moreover, we know that
lim
x→0+
f(x) = lim
x→0+
(cot(x))
1
ln(x)
= e
lim
x→0+
ln (cot(x))
ln(x) ,
and hence it suﬃces to calculate the limit given in the argument
of the exponential function. Setting g(x) = ln(cot(x))
ln(x)
CHAPTER 5. ESTABLISHING THE ZOO
If we consider the monomials fn(x) = anxn
, we arrive
at the “inﬁnite polynomials”:
Power series
Deﬁnition. A power series is a series of functions given by
the expression
S(x) =
∞∑
n=0
anxn
with coeﬃcients an ∈ C, n = 0, 1, . . . .
S(x) has the radius of convergence ρ ≥ 0 if and only if
S(x) converges for every x satisfying |x| < ρ and does not
converge for |x| > ρ.
5.4.10. Properties of power series. Although a signiﬁcant
part of the proof of the following theorem is postponed until
the end of the following chapter, formulation of the basic
properties of the power series can be considered now.
Actually, we should notice that our argument on the converengence
of power series works exactly the same way for
complex values of z and even now, the reader may enjoy a
direct simple proof of all the claimed properties in the complex
setting in 9.4.2 on page 873 in Chapter 9. Our real case
may be viewed as a special case there and there is nothing
included but some very straightforward and natural estimates
(which could serve as a nice excercise right now).
Recall that the upper limit r = lim supn→∞
n
√
|an|
equals the limit limn→∞
n
√
|an|, whenever this limit exists.
Similarly with the ratio criterion and lim supn→∞
an+1
an
.
Convergence and differentiation
Theorem. Let S(x) =
∑∞
n=0 anxn
be a power series and
let
r = lim sup
n→∞
n
√
|an|.
Then the radius of convergence of the series S is ρ = r−1
.
Equivalently, we may compute
r = lim sup
n→∞
an+1
an
.
The power series S(x) converges absolutely on the
whole interval of convergence and is continuous on it (including
the boundary points, supposing it is convergent
there). Moreover, the derivative exists on this interval, and
S′
(x) =
∞∑
n=1
nanxn−1
.
Proof. To verify the absolute convergence of the series,
use the root test from theorem 5.4.5(3), for every value of x.
Calculate (if the limit exists)
lim
n→∞
n
√
|anxn| = r|x|.
428
we obtain
lim
x→0+
g(x)
+∞
−∞
= lim
x→0+
(ln (cot(x)))
′
ln′
(x)
= lim
x→0+
1
cot(x) · (cot(x))′
1
x
= lim
x→0+
x ·
(
tan(x) ·
−1
sin2
(x)
)
= lim
x→0+
−x
cos(x) · sin(x)
0
0
= lim
x→0+
−1
cos2(x) − sin2
(x)
= −1 .
Thus limx→0+ f(x) = e−1
= 1/ e.
The cases (f) and (g) rely on the same trick as in the previous
case. Thus, limx→0(sin(x)
x )
1
x2
= e limx→0
ln(
sin(x)
x
)
x2
and
limx→1−
(
cos(πx
2 )
)ln(x)
= e limx→1−
(
ln(x)·ln
(
cos( πx
2 )
))
, respectively,
and now you should prove that the limits appearing
as exponents in the right-hand-side equal to −1/6 and 0,
respectively. Check yourself that this leads to the evaluations
e−1/6
= 1/ 6
√
e and e0
= 1, respectively. □
5.C.34. Show with an example that using the l’Hopital’s rule
for limits which are not indeterminate, can lead to wrong results.
⃝
D. Inﬁnite sums and power series
In this section we shall investigate whether we can add
inﬁnitely many real numbers, which leads us
to consider what is known as an inﬁnite series
of real numbers. Usually we denote such
series as
∑∞
n=0 an.
Saying that an inﬁnite series
∑∞
n=0 an converges to a
real number L, we mean that the sequence (Sn =
∑n
k=1 ak)
of partial sums converges to L, that is,
lim
n→∞
Sn = L .
If the sequence (Sn) does not converge, we usually say that
the series diverges. When an inﬁnite series
∑∞
n=0 an is convergent,
it is relatively easy to prove that an → 0 as n →
+∞, see 5.4.5. Hence, if the sequence (an) does not converge
to zero, then the given series is not convergent. However,
there are many cases where an → 0 as n → +∞,
but the induced series diverges, see our ﬁrst example below
(5.D.1).
In 5.4.5 further theorems are presented that help us to
determine whether a series converges or diverges, known
as “convergence criteria”. Many of these criteria revolve
CHAPTER 5. ESTABLISHING THE ZOO
Either the series converges absolutely, or it does not converge
if this limit is diﬀerent from 1. It follows that it converges
for |x| < ρ and diverges for |x| > ρ. If the limit does not
exist, use the upper limit in the same way. The same argument
based on the ratio test of convergence leads to the other
formula for r.
The statements about continuity and the derivatives are
proved later in a more general context, see 6.3.7–6.3.9, or
check the straightforward proof in the complex setting in
9.4.2. □
5.4.11. Remarks. If the coeﬃcients of the series increase
rapidly enough, (for example an = nn
),
then r = ∞. Then the radius of convergence
is zero, and the series converges
only at x = 0.
Here are some examples of convergence of power series
(including the boundary points of the corresponding interval):
Consider
S(x) =
∞∑
n=0
xn
, T(x) =
∞∑
n=1
1
n
xn
.
The former example is the geometric series, which was
already discussed. Its sum is, for every x, |x| < 1,
S(x) =
1
1 − x
,
while |x| > 1 guarantees that the series diverges. For x = 1,
we obtain the series 1 + 1 + 1 + . . . , which is divergent. For
x = −1, the series is 1 − 1 + 1 − . . . , whose partial sums do
not have a limit. The series oscillates.
Theorem 5.4.5(2) shows that the radius of convergence
of the series T(x) is again 1 because
lim
n→∞
1
n+1 xn+1
1
n xn
= |x| lim
n→∞
n
n + 1
= |x|.
For x = 1, the series 1 + 1
2 + 1
3 + . . . , is divergent: By
summing up the 2k−1
adjacent terms 1/2k−1
, . . . , 1/(2k
−1)
and replacing each of them by 2−k
(thus they total up to 1/2),
the partial sums are bounded from below by the sum of these
1/2’s. Since the bound from below diverges to inﬁnity, so
does the original series.
On the other hand, the series T(−1) = −1+ 1
2 − 1
3 +. . .
converges although of course, it cannot converge absolutely.
Of course, this is true since we deal here with an alternating
series.
Notice that the convergence of a power series is relatively
fast near x = 0. It is slower near the boundary of the convergence
interval.
5.4.12. Trigonometric functions. Another important observation
is that a power series is a series of numbers for
each ﬁxed x and the individual terms make sense for
complex numbers x ∈ C. Thus the domain of convergence
of a power series is always a disc in the complex
plane C centered at the origin.
429
around “absolute convergence”, a signiﬁcant concept introduced
in 5.4.4. Our goal next is to carefully analyze numerous
examples to illustrate these theoretical statements. Additionally,
we will demonstrate how Sage can be used for computational
assistance. We start with a straightforward example,
the so-called “harmonic series”, and progressively introduce
more challenging examples and applications.
5.D.1. Explain why the following series are divergent:
S =
∞∑
n=1
1
√
n
, H =
∞∑
n=1
1
n
, Υ =
∞∑
n=1
ln
(
n + 1
n
)
.
Solution. Starting with S, ﬁrst of all observe that 1√
n
→ 0
as n → ∞. So indeed we cannot apply (1) of Theorem 5.4.5.
However, considering the partial sums we see that
Sn = 1 +
1
√
2
+ · · · +
1
√
n
>
n
√
n
=
√
n
for all n ∈ N∗
. This means that the sequence (Sn)n∈N∗ of
partial sums of S is not bounded, and this implies that S di-
verges.
For H observe again that 1
n → 0 as n → ∞. This series
is the harmonic series, discussed also in 5.B.19, and here we
present a diﬀerent proof verifying its divergence. The newer
method is based on the partial sums of H, listed below:
H2 = 1 +
1
2
= H1 +
1
2
,
H4 = H2 +
1
3
+
1
4
≥ H2 +
1
4
+
1
4
= H2 +
1
2
,
H8 = H4 +
1
5
+
1
6
+
1
7
+
1
8
≥ H4 +
1
8
+
1
8
+
1
8
+
1
8
= H4 +
1
2
,
· · · · · ·
H2n+1 = H2n +
1
2n + 1
+
1
2n + 2
+ · · · +
1
2n + 2n
≥ H2n +
2n
2n+1
= H2n +
1
2
,
and more general
H2n ≥ 1 +
n
2
, n ≥ 1 .
Thus, the sequence of partial sums is not bounded, and diverges
to inﬁnity. This implies that the harmonic series diverges
as well. As we will see below, the harmonic series is a
special case of a more general series, the so called “p-series”.
For Υ, we have an := ln
(n+1
n
)
→ 0 for n → +∞, as well.
To verify this claim, use Sage via the cell
var("n");lim(ln((n+1)/n), n=oo)
CHAPTER 5. ESTABLISHING THE ZOO
More generally, we can write power functions centered
at an arbitrary (complex) point x0,
S(x) =
∞∑
n=0
an(x − x0)n
,
which converge absolutely again on the disc of radius ρ,
ρ−1
= lim supn→∞
n
√
|an|, but this time centered at x0.
Earlier we proved explicitly (by a simple application of
the ratio test) that the exponential function series converges
everywhere. Thus this deﬁnes a function for all complex numbers
x.
Its values are the limits of values of (complex) polynomials
with real coeﬃcients and each polynomial is completely
determined by ﬁnitely many of its values. In particular, the
values of each series are completely determined on the complex
domain by their values at the real input values x. Therefore,
the complex exponential must also satisfy the usual formulae
which we have already derived for the real values x. In
particular, for all x, y ∈ C
ex+y
= ex
· ey
,
which can be also easily checked directly by the formula for
the product in theorem 5.4.4(3).
Substitute the values x = i t, where i ∈ C is the imaginary
unit, t ∈ R arbitrary. We get the complex valued function
on R
eit
= 1 + it −
1
2
t2
− i
1
3!
t3
+
1
4!
t4
+ i
1
5!
t5
− . . . .
The conjugate number to z = eit
is the number ¯z = e−it
.
Hence
|z|2
= z · ¯z = eit
· e−it
= e0
= 1.
All the values z = eit
lie on the unit circle centered at the
origin, in the complex plane.
The real and imaginary parts of the points lying on the
unit circle are named as the trigonometric functions cos θ and
sin θ, where θ is the corresponding angle.
Diﬀerentiating the parametric description of the points
of the circle t → eit
, gives the vectors of “velocities” which
are easily computed. Diﬀerentiating the real and imaginary
parts separately (knowing that the real power series can be
diﬀerentiated term by term) gives :
(eit
)′
= (1 −
1
2
t2
+
1
4!
t4
. . . )′
+ i(t −
1
3!
t3
+
1
5!
t5
. . . )′
= −(t −
1
3!
t3
+
1
5!
t5
. . . ) + i(1 −
1
2
t2
+
1
4!
t4
. . . )
430
However, we may rely on the properties of natural logarithm,
as follows:
∞∑
n=1
an = lim
n→∞
(
ln
2
1
+ ln
3
2
+ ln
4
3
+ · · · + ln
n + 1
n
)
= lim
n→∞
ln
2 · 3 · 4 · · · (n + 1)
1 · 2 · 3 · · · n
= lim
n→∞
ln (n + 1) = +∞.
Thus the series Υ also diverges to +∞. □
The inﬁnite number series
ζ(p) =
∞∑
n=1
1
np
= 1 +
1
2p
+ · · · +
1
np
+ · · ·
is called the p-series, and it is known that converges for all
p > 1. For 0 < p ≤ 1, the p-series diverges (observe that for
p = 1 coincides with the harmonic series discussed above).
For p > 1 the p-series ζ(p) is a monotone decreasing function
of p, called the “Riemann zeta function”.9
Euler proved that
∞∑
n=1
1
np
=
∏
s
1
(
1 − 1
sp
) ,
where s ranges over all prime numbers. This formula, which
is now known as the Euler product formula, makes ζ a very
important tool in number theory. However, for most of the p
with p > 1, the value ζ(p) is still unknown, though there are
some accurate approximations (as for p = 3 or p = 5, etc).
Next we will use Sage to evaluate ζ(p) for some p.
5.D.2. Evaluate the zeta function ζ(p) for p = 2, . . . , 9 with
the aid of Sage. Is this feasible for all the given values of p?
⃝
Inﬁnite sums naturally appear in many practical problems.
Perhaps the best known ancient example is the Achilles
paradox noticed by Aristoteles, which we brieﬂy highlight by
the next exercise. Later we will discuss more sophisticated
problems, starting for example with 5.D.10.
5.D.3. Achilles’ paradox. Although Achilles runs much
faster than the turtle on the beach, he needs some time tn for
shortening their distance to the half. Thus Achilles needs
the inﬁnite sum of these time intervals tn to catch the turtle.
Explain why this sum is still a ﬁnite number.
Solution. If the current distance was d, and the velocity of
the turtle was 1 while the velocity of Achilles was α, we could
write d + t − α t for their distance after time t. Thus, shortening
it to the half means t = 1
2
d
α−1 . Repeating this n times,
we obtain the time intervals
tn =
d
α − 1
·
1
2n
.
9Named for F. B. Riemman (1826 - 1866).
CHAPTER 5. ESTABLISHING THE ZOO
which means (eit
)′
= i · eit
. So the velocity vectors all have
unit length. Hence the entire circle is parametrized if t is
moved through the interval [0, 2π], where 2π stands for the
length of the circle (a thorough deﬁnition of the length of a
curve needs integral calculus, which we will develop in the
next chapter). In particular, this procedure of parameterizing
the circle can be used to deﬁne the number π, also called
Archimedes’ constant or the Ludolphian number,15
half the
length of the unit circle in the Euclidean plane R2
. It can be
found by computing the ﬁrst positive zero point of the imaginary
part of eit
.
For example, use the 10th order approximation
sin t ≃ t − 1
6 t3
+ 1
120 t5
− 1
5040 t7
+ 1
362880 t9
.
Ask any computer algebra system to ﬁnd the ﬁrst positive root.
The result is π ≃ 3.148690, for which only the ﬁrst 2 decimal
points are correct. But with the 20th order approximation this
gets 3.141592 with 5 digits correct. The reason for the slow
approximation is that π is relatively far from zero.
The explicit representation of trigonometric functions in
terms of the power series is now apparent:
cos t = Re eit
= 1 −
1
2
t2
+
1
4!
t4
−
1
6!
t6
+
· · · + (−1)k 1
(2k)!
t2k
+ . . .
sin t = Im eit
= t −
1
3!
t3
+
1
5!
t5
−
1
7!
t7
+
· · · + (−1)k 1
(2k + 1)!
t2k+1
+ . . .
The following diagram illustrates the convergence of the
series for the cosine function. It is the graph of the corresponding
polynomial of degree 68. Drawing partial sums
shows that the approximation near zero is very good. As the
order increases, the approximation is better further away from
the origin as well.
15This number describes the ratio of the circumference to the diameter
of an (arbitrary) circle. It was known to the Babylonians and the Greeks in
ancient times. The term Ludolphian number is derived from the name of
German mathematician Ludolph van Ceulen of the 16th century, who produced
35 digits of the decimal expansion of the number, using the method of
inscribed and circumscribed regular polygons, invented by Archimedes.
431
The procedure of summing all these tn follows the rule of
geometric series (see 5.B.16). Thus the result is
∞∑
n=0
1
2n
=
1
1 − 1
2
= 2
and so Achilles needs the time d
α−1 (here we sum from n =
1). This perfectly matches the fact that their relative mutual
velocity is α − 1 (explain why this assertion is true). □
An inﬁnite series, whether real or complex, that is absolutely
convergent is also convergent. This is fundamentally
due to the completeness of both the real numbers (R) and the
complex numbers (C). Conversely, a series that is convergent
but not absolutely convergent is termed “conditionally
convergent”, see in the end of Section 5.4.7. In the following
sections we will examine a range of examples that encompass
all possible cases of convergence and divergence.
5.D.4. Exam which of the following series is convergent or
divergent:
T1 =
∞∑
n=1
n2
+ 1
2n2 − 1
, T2 =
∞∑
n=1
2n
n
,
T3 =
∞∑
n=1
2n
(n + 2)!
, T4 =
∞∑
n=1
1
n · 22025
,
T5 =
∞∑
n=1
n7
+ 7n
+ 2
(7n + 2) · n7
, T6 =
∞∑
n=1
en
n2
,
T7 =
∞∑
n=0
1
(n + 1) · 3n
, T8 =
∞∑
n=1
n2
+ 1
n3
,
T9 =
∞∑
n=0
2n
+ 3n
n!
, T10 =
∞∑
n=1
2n
·
(
4
5
)n2
,
T11 =
∞∑
n=0
2 + sin3
(n + 1)
4n + n2
, T12 =
∞∑
n=1
nn
n2 · n!
.
Solution. 1) We see that limn→∞
n2
+1
2n2−1 = 1
2 ̸= 0, so by (1)
in Theorem 5.4.5 the series T1 is divergent.
2) Similarly we see that limn→∞
2n
n = ∞, hence T2 is divergent.
Notice however that when the terms of the series are
positive, absolute convergence is the same as convergence.
Hence the same result follows by the ratio test, also known
as the d’Alembert criterion (see (4) of Theorem 5.4.5). This
is because
lim
n→∞
an+1
an
= lim
n→∞
2n+1
n+1
2n
n
= lim
n→∞
2(n + 1)
n
= 2 > 1 .
One may like to verify the divergent of T2 (or T1) also in Sage.
This can be done just by typing
var("n"); sum(2^n/n, n, 1, oo)
and what we should essentially keep from the answer that
Sage returns is the very ﬁnal output: Sum is divergent.
Thus, the sum method in Sage can be used as a veriﬁcation
CHAPTER 5. ESTABLISHING THE ZOO
The well-known formula
eit
e−it
= sin2
t + cos2
t = 1
is immediate. From the derivative (eit
)′
= i eit
it follows that
(sin t)′
= cos t, (cos t)′
= − sin t
by considering real and imaginary parts. Let t0 denote the
smallest positive number for which e−it0
= − eit0
. The value
t0 is the ﬁrst positive zero point of the function cos t. According
to the deﬁnition of π, t0 = 1
2 π. If approximating the
function cos by the 10th order approximation, then twice the
ﬁrst positive root yields π ∼ 3.1415917 with 5 digits correct,
while the 20th order approximation provides 10 correct digits
after the decimal point.
Squaring yields ei2t0
= e−i2t0
= (e−it0
)2
. So π is a
zero point of the function sin t. Of course, for any t,
ei(4kt0+t)
= (eit0
)4k
· eit
= 1 · eit
.
Therefore, both trigonometric functions sin and cos are periodic,
with period 2π. This is their prime period.
Now the usual formulae connecting the trigonometric
functions are easily derived. For illustration, we introduce
some of them. First, the deﬁnition says that
cos t =
1
2
(eit
+ e−it
)(1)
sin t =
1
2i
(eit
− e−it
).(2)
Thus the product of these functions can be expressed as
sin t cos t =
1
4i
(eit
− e−it
)(eit
+ e−it
)
=
1
4i
(ei2t
− e−i2t
) =
1
2
sin 2t.
Further, by utilizing our knowledge of derivatives:
cos 2t = (
1
2
sin 2t)′
= (sin t cos t)′
= cos2
t − sin2
t.
The properties of other trigonometric functions
tan t =
sin t
cos t
, cot t = (tan t)−1
can easily be derived from their deﬁnitions and the formulae
for derivatives. The graphs of the functions sine, cosine, tangent,
and cotangent are displayed on the diagrams (they are
distinguised as the solid and dashed lines, respectively):
432
of divergence, as well.
3) We see that
an+1
an
=
2n+1
(n + 2)!
2n(n + 3)!
=
2
n + 3
=
2
n + 3
,
which tends to 0 < 1 as n tends to inﬁnity. By the ratio test
we deduce that the series T3 is absolute convergent and hence
also convergent.
4) Let
∑∞
n=1 an be a non-convergent series. Taking some
constant c ̸= 0 and considering the new series
∑∞
n=1(can),
it is easy to see that this is also a non-convergent series. This
is because by assumption the sequence of partial sums sn =∑n
k=1 ak cannot converge, and hence also the sequence (Sn)
of partial sums of the new series cannot converge (since Sn =∑n
k=1(cak) = c · sn). We can apply this conclusion for T4:
T4 =
∞∑
n=1
1
n · 22025
=
1
22025
·
∞∑
n=1
1
n
,
that is, the series is a multiple of the harmonic series. Thus,
in combination with 5.D.1 we see that T4 is not convergent.
5) Obviously,
T5 =
∞∑
n=1
1
n7
+
∞∑
n=1
1
7n + 2
= T1
5 + T2
5 .
Hence it is suﬃcient to deduce about the convergence or divergence
of the series T1
5 and T2
5 , as deﬁned above. We see
that T2
5 =
∑∞
n=1
1
7n+2 <
∑∞
n=1
1
7n and the geometric se-
ries
∑∞
n=1
1
7n converges (see 5.B.16). Thus, the series T2
5
converges as well. The series T1
5 =
∑∞
n=1
1
n7 is also convergent
(is a p-series with p = 7 > 1). As a result, the relation
T5 = T1
5 + T2
5 implies that T5 is a convergent series, (since
it is the sum of two convergent series, see Corollary 5.4.3).
6) The series T6 is not convergent. Let us apply for example
the d’Alembert criterion (ratio test):
lim
n→∞
an+1
an
= lim
n→∞
en+1
·n2
en ·(n + 1)2
= e · lim
n→∞
(
n
n + 1
)2
= e ·1 = e > 1 .
7) The series T7 converges, because
T7 =
∞∑
n=0
1
(n + 1) · 3n
≤
∞∑
n=0
(
1
3
)n
=
1
1 − 1
3
=
3
2
< +∞ .
The result now follows by the comparison test (see Theorem
5.4.5).
8) The series T8 consists only of non-negative terms, as T7
for example, but it diverges (necessarily to +∞), as a direct
computation shows:
T8 =
∞∑
n=1
n2
+ 1
n3
≥
∞∑
n=1
n2
n3
=
∞∑
n=1
1
n
= +∞ .
9) Observe that T9 is again the sum of two series:
T9 =
∞∑
n=0
2n
n!
+
∞∑
n=0
3n
n!
= T1
9 + T2
9
CHAPTER 5. ESTABLISHING THE ZOO
Cyclometric functions are the functions inverse to
trigonometric functions. Since the trigonometric functions
all have period 2π, their inverses can be deﬁned only inside
one period, and further, only on the part where the given
function is either increasing or decreasing. Two inverse
trigonometric functions are
arcsin = sin−1
with domain [−1, 1] and range [−π/2, π/2] and
arccos = cos−1
with domain [−1, 1] and range [0, π]. See the left-hand illus-
tration..
The remaining functions are (displayed in the diagram
on the right)
arctan = tan−1
with domain R and range (−π/2, π/2), and ﬁnally
arccot = cot−1
with domain R and range (0, π).
The hyperbolic functions are also of big importance.
Two basic ones are
sinh x =
1
2
(ex
− e−x
), cosh x =
1
2
(ex
+ e−x
).
The name indicates that they should have something in common
with a hyperbola. From the deﬁnition,
(cosh x)2
− (sinh x)2
= 2
1
4
(2 ex
e−x
) = 1.
The points [cosh t, sinh t] ∈ R2
parametrically describe a hyperbola
in the plane. For hyperbolic functions, one can easily
derive identities similar to the ones for trigonometric functions.
By substituting into (1) and (2), one can obtain for ex-
ample
cosh x = cos(ix), i sinh x = sin(ix).
433
and hence it is suﬃcient to deduce about the convergence of
T1
9 and T2
9 . By applying the ratio test to T1
9 , we see that
lim
n→∞
2n+1
(n+1)!
2n
n!
= lim
n→∞
2
n + 1
= 0 < 1
so T1
9 converges. Similarly for T2
9 , and hence T9 converges.
10) We can apply the root test, see (5) in Theorem 5.4.5:
lim
n→∞
n
2n ·
(
4
5
)n2
= lim
n→∞
n
√
2n ·
(
4
5
)n2
= lim
n→∞
(
2 ·
(
4
5
)n )
= 0 < 1 .
Thus T10 is a convergent series. In Sage you can compute this
limit as usual:
var("n"); lim(2*((4/5)**n), n=oo)
11) For the series T11 we see that
0 ≤
2 + sin3
(n + 1)
4n + n2
<
3
4n
.
However,
∑∞
n=1
3
4n = 3
∑∞
n=1
1
4n and the latter is a convergent
geometric series. Thus the series T11 converges as well.
12) For the series T12 an application of the ratio test gives
lim
n→∞
an+1
an
= lim
n→∞
(
(n + 1)n+1
(n + 1)2 · (n + 1)!
·
n2
· n!
nn
)
= lim
n→∞
n2
(n + 1)2
· lim
n→∞
(n + 1)n
nn
= lim
n→∞
n2
n2
· lim
n→∞
(
1 +
1
n
)n
= 1 · e > 1 .
Thus, T12 is not converge and in particular it diverges to +∞.
□
Next, we will explore how to program in Sage to utilize it
as a tool for analyzing the convergence or divergence
of inﬁnite series. We introduce a syntax using the
command def to construct a routine tailored for this
purpose. The essence of this routine revolves around
implementing the ratio test. Later on, we will demonstrate
how this technique can be adapted to apply the root test (see
5.E.136), as well as other convergence criteria.
5.D.5. Let S =
∑∞
n=0 fn be an inﬁnite series for which
the limit limn→∞ |fn+1/fn| exists, and equals to q. Present
in Sage a syntax appropriate to deduce the absolute convergence/divergence
of S based on the ratio test. Then apply
your program with aim to:
(a) Prove that the following inﬁnite series converge:
S1 =
∞∑
n=1
2n
· (n + 1)3
3n
, S2 =
∞∑
n=1
6n
n!
.
Next provide a formal proof.
(b) Deduce, if possible, the convergence/divergence of S3 =
CHAPTER 5. ESTABLISHING THE ZOO
5.4.13. Notes. (1) If a power series S(x) is expressed with
the variable x moved by a constant oﬀset x0, we
arrive at the function T(x) = S(x − x0). If ρ
is the radius of convergence of S, then T will be
well-deﬁned on the interval (x0 −ρ, x0 +ρ). We
say that T is a power series centered at x0.
The power series can be deﬁned in the following way:
S(x) =
∞∑
n=0
an(x − x0)n
,
where x0 is an arbitrary ﬁxed real number. All of the previous
reasonings are still valid. It is only necessary to be aware of
the fact that they relate to the point x0. Especially, such a
power series converges on the interval (x0 −ρ, x0 +ρ), where
ρ is its radius of convergence.
Further, if a power series y = T(x) has its values in an
interval where a power series S(y) is well-deﬁned, then the
values of the function S ◦ T are also described by a power
series which can be obtained by formal substitution of y =
T(x) for y into S(y).
(2) As soon as a power series with a suitable center is available,
the coeﬃcients of the power series for inverse functions
can be calculated. We do not introduce a list of formulae here.
It is easily obtained in Maple, for instance, by the procedure
“series”. For illustration, here are two examples:
Begin with
ex
= 1 + x +
1
2
x2
+
1
6
x3
+
1
24
x4
+ . . . .
Since e0
= 1, we search for a power series centered at x = 1
for the inverse function ln x. So assume
ln x = a0+a1(x−1)+a2(x−1)2
+a3(x−1)3
+a4(x−1)4
+. . . .
Apply the equality x = ln(ex
), regroup the coeﬃcients by
the powers of x and substitute. The result is:
x = a0 + a1
(
x +
1
2
x2
+
1
6
x3
+
1
24
x4
+ . . .
)
+ a2
(
x +
1
2
x2
+ . . .
)2
+ a3
(
x +
1
2
x2
+ . . .
)3
+ . . .
= a0 + a1x +
(
1
2
a1 + a2
)
x2
+
(
1
6
a1 + a2 + a3
)
x3
+
(
1
24
a1 +
(
1
4
+
2
6
)
a2 +
3
2
a3 + a4
)
x4
+ . . . .
Comparing the coeﬃcients of the corresponding powers on
both sides, gives
a0 = 0, a1 = 1, a2 = −
1
2
, a3 =
1
3
, a4 = −
1
4
, . . . .
This corresponds to the valid expression (to be veriﬁed later):
ln x =
∞∑
n=1
(−1)n−1
n
(x − 1)n
.
Similarly, we can begin with the series
sin t = t −
1
3!
t3
+
1
5!
t5
−
1
7!
t7
+ . . .
434
∑∞
n=1
nn
n2·n! and S4 =
∑∞
n=1
n2
n+1 (notice S3 is the last series
in 5.D.4).
Solution. As we said, we will use the command def to deﬁne
a routine, which we will call ratiotest. First it is useful to
introduce n as a symbolic variable. It is also convenient to
introduce the subset (−1, 1) ⊂ R. This is because of the
ratio test (see Theorem 5.4.5), which states that S converges
absolutely for |q| < 1, does not converge absolutely for |q| >
1, while for q = 1 the series may or may not converge.
The aim now is to encode the above criterion in a Sage
routine. For this we may type
var("n")
s=RealSet(-1, 1)
def ratiotest(f):
q=lim(abs(f(n+1)/f(n)), n=oo)
if q in s :
return "converges absolutely"
elif q==1 :
return "no conclusion"
else :
return "does not converge absolutely"
return
Our routine is now ready to be tested.10
(a) First we will prove formally that S1, S2 are both convergent
series. Notice that for S1, S2 the deﬁning terms are positive,
and hence
q = lim
n→∞
fn+1
fn
= lim
n→∞
fn+1
fn
.
So, for S1 we compute
q = lim
n→∞
2n+1
· (n + 2)3
· 3n
3n+1 · 2n · (n + 1)3
= lim
n→∞
2(n + 2)3
3(n + 1)3
= lim
n→∞
2n3
3n3
=
2
3
< 1 .
Or you can directly use Sage to compute this limit:
var("n")
f(n)=(2^(n+1)*(n+2)^3*3^n)/(3^(n+1)*(n+1)^3*2^n)
lim(f(n), n=oo)
Similarly, for S2:
q = lim
n→∞
(
6n+1
(n + 1)!
·
n!
6n
)
= lim
n→∞
6
n + 1
= 0 < 1 ,
where again one can compute the limit directly in
Sage:
var("n")
g(n)=(factorial(n)*6^(n+1))/(factorial(n+1)*6^n)
lim(g(n), n=oo)
Notice here the command factorial(n) corresponds to n!.
Since for both cases we obtained q < 1, our claim follows by
the ratio test.
To check our routine using S1, S2, we type
10In general, detailed “return inputs” are more than welcome. However,
here we kept them “short” to save some space.
CHAPTER 5. ESTABLISHING THE ZOO
and the (unknown so far) power series for its inverse centered
at zero (since sin 0 = 0)
arcsin t = a0 + a1t + a2t2
+ a3t3
+ a4t4
+ . . . .
Substitution gives
t = a0 + a1
(
t −
1
3!
t3
+
1
5!
t5
+ . . .
)
+
a2
(
t −
1
3!
t3
+
1
5!
t5
+ . . .
)2
+ . . .
= a0 + a1t + a2t2
+
(
−
1
6
a1 + a3
)
t3
+
(
−
2
6
a2 + a4
)
t4
+
(
1
120
a1 −
3
6
a3 + a5
)
t5
+ . . . ,
hence
arcsin t = t +
1
6
t3
+
3
40
t5
+ . . . .
(3) Notice that if it is assumed that the function ex
can be expressed
as a power series centered at zero, and that power
series can be diﬀerentiated term by term, then the diﬀerence
equation for the coeﬃcients an is easily obtained since
(xn+1
)′
= (n + 1)xn
. Therefore, from the condition that
the exponential function has its derivative equal to its value
at every point,
an+1 = 1
n+1 an, a0 = 1
and hence it is clear that an = 1
n! .
435
f(n)=((2^n)*(n+1)^3)/3^n; ratiotest(f)
and
f(n)=(6^n/factorial(n)); ratiotest(f)
respectively. Our routine veriﬁes our result, by printing out
the answer “converges absolutely”.
On the other hand, in 5.D.4 we saw that S3 satisﬁes q =
e > 1, in particular, S3 cannot converge. The same veriﬁes
our routine:
f(n)=n^n/(n^2*factorial(n)); ratiotest(f)
with answer “does not converge absolutely”. For the ﬁnal
case notice that running the command
print(lim((n+1)^3/(n^2*(n+2)), n=oo))
we obtain 1, that is, limn→∞
f(n+1)
f(n) = 1, where f(n) =
n2
n+1 . This means that the ratio test is inconclusive and we
cannot determine if the series converges or diverges using this
test. Our routine is also able to certify this claim, and we just
need to type
f(n)=n^2/(n+1)
ratiotest(f)
which prints out “no conclusion”. To answer however
this case, one may add the syntax sum(n2
/(n + 1), n, 1, oo),
which shows to us that S4 is divergent. Try to describe a formal
proof of this conclusion. □
5.D.6. Alternating series. Determine whether or not the following
series converge:
S1 =
∞∑
n=1
(−1)n n2
+ 3n − 1
(3n − 2)2
,
S2 =
∞∑
n=1
(−1)n−1 3n4
− 3n3
+ 9n − 1
(5n3 − 2) · 4n
.
Solution. We have limn→∞
n2
+3n−1
(3n−2)2 = limn→∞
n2
9n2 =
1
9 ̸= 0. Thus the limit
lim
n→∞
(−1)n n2
+ 3n − 1
(3n − 2)2
does not exist, and the series S1 does not converge.
For S2 an application of the ratio (or root) test shows that the
polynomials in the numerator or in the denominator do not
aﬀect the value of the considered limit, and we may consider
the series
∞∑
n=1
(−1)n−1 1
4n
For this, we see that lim
n→∞
an+1
an
=
1
4
< 1. Thus this
series (absolutely) converges and as a conclusion the series
S2 is also (absolutely) convergent. □
5.D.7. Decide whether the series
S1 =
∞∑
n=1
sin(n)
n2
, S2 =
∞∑
n=1
cos (πn)
3
√
n2
CHAPTER 5. ESTABLISHING THE ZOO
436
converges absolutely, converges conditionally, or does not
converge at all. ⃝
Computing the exact values of inﬁnite series is generally
a challenging task. Sometimes, we can compare these values
with known results. However, in many cases, relying on
computer algebra software is necessary. To illustrate this situation,
let us consider some examples and use Sage to verify
our calculations rigorously.
5.D.8. Calculate the series given below and then use Sage
to verify your answer.
S1 =
∞∑
n=1
(
1
√
n
−
1
√
n + 1
)
, S2 =
∞∑
n=0
5
3n
,
S3 =
∞∑
n=1
(
3
42n−1
+
2
42n
)
, S4 =
∞∑
n=1
n
3n
.
Solution. For the ﬁrst given series we see that
S1 = lim
n→∞
(( 1
√
1
−
1
√
2
)
+
( 1
√
2
−
1
√
3
)
+ · · ·
· · · +
( 1
√
n
−
1
√
n + 1
))
= lim
n→∞
(
1 +
(
−
1
√
2
+
1
√
2
)
+ · · ·
· · · +
(
−
1
√
n
+
1
√
n
)
−
1
√
n + 1
)
= lim
n→∞
(
1 −
1
√
n + 1
)
= 1 .
As we know, a veriﬁcation by Sage takes the form
n=var("n");sum((1/sqrt(n)-1/sqrt(n+1)),n,1,oo)
For S2 we see that this is a quintuple of the standard geometric
series with the common ratio q = 1/3. Thus
S2 =
∞∑
n=0
5
3n
= 5
∞∑
n=0
(
1
3
)n
= 5 ·
1
1 − 1
3
=
15
2
.
Or we can directly give in Sage the cell
n=var("n"); sum((5/3**n),n,0,oo)
The third inﬁnite series is a series of linear combinations
which we can express as a sum of inﬁnite series with factoring
out the constants. This is a valid modiﬁcation, since the
obtained series are absolutely convergent. In particular,
S3 =
3
4
∞∑
n=1
(
1
42n−2
)
+
2
16
∞∑
n=1
(
1
42n−2
)
m:=n−1
=
(
3
4
+
2
16
) ∞∑
m=0
1
42m
=
14
16
∞∑
m=0
(
1
16
)m
=
14
16
·
1
1 − 1
16
=
14
15
.
CHAPTER 5. ESTABLISHING THE ZOO
437
In Sage after introducing n as a symbolic variable we can
compute this sum as above:
sum((3/(4**(2*n-1))+2/(4**(2*n))),n,1,oo)
Finally, for S4, notice that from the relation of the partial
sums, i.e., sn = 1
3 + 2
32 + 3
33 + · · · + n
3n for n = 1, 2, . . ., we
can claim that
sn
3
=
1
32
+
2
33
+ · · · +
n − 1
3n
+
n
3n+1
.
Therefore, for all n ∈ Z+ we obtain the relation
sn −
sn
3
=
1
3
+
1
32
+
1
33
+ · · · +
1
3n
−
n
3n+1
.
This, in combination with the relation lim
n→∞
n
3n+1 = 0, gives
S4 =
3
2
lim
n→∞
(
sn −
sn
3
)
=
3
2
lim
n→∞
n∑
k=1
1
3k
=
3
2
∞∑
k=1
(
1
3
)k
(∗)
=
3
2
( 1
1 − 1
3
− 1
)
=
3
4
.
Notice that here our geometric series
∑∞
k=1
(1
3
)k
starts from
k = 1 and thus the given replacement in (∗). A veriﬁcation
in Sage occurs as usual, i.e.,
var("n"); sum((n/3**n),n,1,oo)
□
5.D.9. Verify the inequality
∞∑
n=1
1
n2
<
∞∑
n=0
1
2n
. ⃝
Further exercises concerning inﬁnite series are presented
in Section E. Let us now describe a beautiful
application that highlights the use of inﬁnite series (see
5.E.130 for another fascinated example).
5.D.10. Koch snowﬂake (1904). Create a “snowﬂake” by
the following procedure: At the beginning, consider an equilateral
triangle with sides of length 1. With each of its three
sides, do the following: Cut it into three equally long parts,
then build another equilateral triangle above the middle part
(this is pointing out from the original triangle), and remove
the middle part. This transforms the original equilateral triangle
into a six-pointed star. Repeating this step ad inﬁnitum,
one arrives to the desired snowﬂake. Prove that the created
ﬁgure has inﬁnite perimeter. Then determine its area.
Solution. The perimeter of the original triangle is equal to 3.
In each step, the perimeter increases by one third since three
parts of every line segment are replaced with four equally
long ones. Thus the snowﬂake’s perimeter can be expressed
as the limit
lim
n→∞
dn , dn := 3
(
4
3
)n
,
and we see that lim
n→∞
dn = +∞. This can be quickly veriﬁed
in Sage:
CHAPTER 5. ESTABLISHING THE ZOO
438
var("n");lim(3*((4/3)**n),n=oo)
Now, the ﬁgure’s area is apparently increasing during the construction.
For its computation it is suﬃcient to catch the rise
between two consecutive steps. The number of the ﬁgure’s
sides is four times higher at every step (the line segments are
divided into thirds and one of them is doubled). Moreover,
the new sides are three times shorter. Therefore, the ﬁgure’s
area grows exactly by the equilateral triangles glued to each
side (so, there is the same number of them as of the sides).
In the ﬁrst iteration (when creating the six-pointed star from
the original triangle), the area grows by the three equilateral
triangles with sides of length 1/3 (one third of the original
sides’ length). Let us denote the area of the original equilateral
triangle by S0. If we realize that shortening an equilateral
triangle’s sides three times makes its area decrease nine times,
we get S0+3· S0
9 for the area of the six-pointed star. Similarly,
in the next step we obtain the area of the ﬁgure as
S0 + 3 ·
S0
9
+ 4 · 3 ·
S0
92
.
It is easy now to deduce that the area E of the resulting
snowﬂake equals the limit
E = lim
n→∞
(
S0 + 3
S0
9
+ 4 · 3
S0
92
+ · · · + 4n
· 3
S0
9n+1
)
= S0 lim
n→∞
(
1 +
1
3
+
1
3
·
4
9
+ · · · +
1
3
·
(
4
9
)n )
= S0
(
1 +
1
3
lim
n→∞
(
1 +
4
9
+ · · · +
(
4
9
)n) )
= S0
(
1 +
1
3
lim
n→∞
n∑
k=0
(
4
9
)k )
= S0
(
1 +
1
3
∞∑
k=0
(
4
9
)k )
= S0
(
1 +
1
3
·
1
1 − 4
9
)
,
that is, E = 8
5 S0. Thus, the snowﬂake’s area is 8/5 of the
area of the original triangle, i.e.,
8
5
S0 =
8
5
·
√
3
4
=
2
√
3
5
.
We mention that this snowﬂake is an example of an inﬁnitely
long curve which encloses a ﬁnite area. □
So far, we have explored the concept of assigning a value
to a sum of inﬁnitely many numbers. Now, we shift
our focus to sums involving inﬁnitely many functions.
Speciﬁcally, we can consider such series for each argument
x of the functions fn(x), particularly focusing on
sums of polynomials. These are known as “power series”,
which always converge on an interval (or a disc in the complex
plane, for complex polynomials). Moreover, the radius
of convergence of these series can be readily determined from
their coeﬃcients, which is either a non-negative real number
or ∞.
CHAPTER 5. ESTABLISHING THE ZOO
439
Later, in Chapters 6 and 7, we will revisit other types
of function series and explore additional concepts of conver-
gence.
5.D.11. Consider the series S =
∞∑
n=1
(x + 4)n
n · 5n
. Determine
all x ∈ R for which S is convergent.
Solution. Having the ratio test in mind, let us compute the
limit lim
n→∞
fn+1(x)
fn(x) , that is,
lim
n→∞
(x+4)n+1
(n+1)·5n+1
(x+4)n
n·5n
= lim
n→∞
n |x + 4|
5(n + 1)
= |x + 4| lim
n→∞
n
5(n + 1)
=
|x + 4|
5
.
In case you like conﬁrm this conclusion by Sage, give the cell
var(”n”); lim(n/(5 ∗ (n + 1)), n = oo).
Now, according to the ratio test the series will converge
for |x+4|
5 < 1, that is, −9 < x < 1 and diverges for |x+4|
5 >
1, that is, x < −9 or x > 1. For x = 1 we see that S is the
harmonic series, i.e., S =
∑∞
n=1
1
n which diverges by 5.D.1.
For x = −9 we get
S =
∞∑
n=1
(−5)n
n · 5n
=
∞∑
n=1
(−1)n
n
= −1 +
1
2
−
1
3
+
1
4
− · · ·
which is an alternating series, known as the alternating harmonic
series. For this series we compute lim
n→∞
(−1)n
n = 0,
hence S is convergent (see Theorem 5.4.7). As an alternative,
in Sage the command sum((−1)n
/n, n, 1, oo) provides an explicit
evaluation of the alternating harmonic series; it equals
to − ln(2). We conclude that the initial series is convergent
for all x ∈ [−9, 1). □
5.D.12. Determine the radius r of convergence of the following
power series:
A(x) =
∞∑
n=1
2n
n
xn
, B(x) =
∞∑
n=1
1
(1 + i)n
xn
,
C(x) =
∞∑
n=1
(−1)
n+1
n · 8n
xn
, D(x) =
∞∑
n=1
(−4n)
n
xn
,
E(x) =
∞∑
n=1
(
1 +
1
n
)n2
xn
.
Solution. According to the discussion in 5.4.10, for A(x) we
have
r =
1
lim sup
n→∞
an+1
an
=
1
2
.
Thus, the power series converges exactly for the real numbers
x ∈ (−1
2 , 1
2 ). Moreover, the series diverges for x = 1
2 (since
it becomes harmonic), and it converges for x = −1
2 (since
then it becomes an alternating harmonic series). However,
to determine the convergence for any x lying in the complex
CHAPTER 5. ESTABLISHING THE ZOO
440
plane on the circle of radius 1
2 , it is a much harder question
which goes beyond our lectures.
For B(x) we compute
r = lim sup
n→∞
n
√
1
(1 + i)n
= lim sup
n→∞
1
1 + i
=
√
2
2
.
For C(x) we see that
lim
n→∞
n
√
| an | = lim
n→∞
1
n
√
n · 8
=
1
8
.
Thus the radius is r = 8. For D(x) we get lim
n→∞
n
√
| an | =
lim
n→∞
4n = +∞, so the radius is r = 0, while for E(x) we
compute lim
n→∞
n
√
| an | = lim
n→∞
(
1 + 1
n
)n
= e, so r = 1/e in
this case. □
5.D.13. Calculate the radius r of convergence of the power
series
∞∑
n=1
ein
3
√
n3 + n · 3n
πn · 3
√
n4 + 2n3 + 1
· (x − 2)n
.
Solution. Here we will apply the following trick: The radius
of convergence of any power series does not change if
we move its center or alter its coeﬃcients while keeping their
absolute values. Therefore, let us determine the radius of convergence
of the inﬁnite series
∞∑
n=1
3
√
n3 + n · 3n
πn 3
√
n4 + 2n3 + 1
· xn
.
Since lim
n→∞
n
√
na =
(
lim
n→∞
n
√
n
)a
= 1 for all a > 0, we can
further move to the series
∞∑
n=1
3n
πn
xn
.
with the same radius of convergence r = π/3. □
5.D.14. Determine the power series centered at the origin
which, determines the function
1
x2 − x − 12
, x ∈ (−3, 3) .
Solution. A quick method relies on the known procedure of
partial fraction decomposition, which gives
1
x2 − x − 12
=
1
(x − 4)(x + 3)
=
1
7
(
1
x − 4
−
1
x + 3
)
.
This expression can be obtained also in Sage, via the
partial_fraction function, as follows:
f(x)=1/(x^2-x-12)
show(f.partial_fraction())
CHAPTER 5. ESTABLISHING THE ZOO
441
Using now appropriately the known formula of geometric series,
we see that
1
x − 4
= −
1
4
(
1 +
x
4
+
x2
42
+ · · · +
xn
4n
+ · · ·
)
,
1
x + 3
=
1
3
(
1 −
x
3
+
x2
32
+ · · · +
(−x)n
3n
+ · · ·
)
.
Thus, altogether we obtain
1
x2 − x − 12
= −
1
28
∞∑
n=0
xn
4n
−
1
21
∞∑
n=0
(−x)n
3n
=
∞∑
n=0
(
(−1)n+1
21 · 3n
−
1
28 · 4n
)
xn
.
□
5.D.15. Determine the radius r of convergence of the power
series
∞∑
n=0
22n
·n!
(2n)! xn
. ⃝
5.D.16. Calculate the radius of convergence for
∑∞
n=1 2
√
n
xn
. ⃝
5.D.17. Find the domain of convergence of the power series
∞∑
n=1
√
n+1
3
√
n xn
. ⃝
5.D.18. Determine for which x ∈ R the power series
∞∑
n=1
(−3)n
√
n4+2n3+111
(x − 2)n
converges. ⃝
5.D.19. Determine for which x ∈ R the series
∞∑
i=1
1
2n · n · ln(n)
x3n
converges. ⃝
5.D.20. Determine all x ∈ R for which the power series
∞∑
i=1
x2n
n2 is convergent. ⃝
The polynomials are functions which we can easily evaluate
and thus, the partial sums of power series offer
a valuable method for approximating the values
of functions. However, as we will see below, to
be successful we need some good estimate on the
convergence speed.
Especially for convergent alternating series
S =
∞∑
n=0
(−1)n
an = a0 − a1 + a2 − a3 + · · ·
with a0 ≥ a1 ≥ a2 ≥ · · · ≥ an ≥ · · · ≥ 0 and∑∞
n=0(−1)n
an = L, where L is a ﬁnite real number, we
can prove a very useful estimation error given by
|L − Sk| ≤ ak+1 .
Here Sk =
∑k
n=0(−1)n
an denotes the partial sum corresponding
to S. Hence, summing only terms up to the kth term
ak and omitting all the remaining terms, the approximation
CHAPTER 5. ESTABLISHING THE ZOO
442
error will be at most large as ak+1. Let us illustrate this situation
via examples.
5.D.21. Approximate cos(1) with an error strictly less that
10−8
. Then use Sage to ﬁnd the actual approximation error.
Solution. From Section 5.4.12 we know the following expression
of cos(x) in terms of power series:
cos(x) = 1 −
x2
2!
+
x4
4!
−
x6
6!
+ · · · =
∞∑
n=0
(−1)n x2n
(2n)!
,
for all x ∈ R. Thus
cos(1) = 1 −
1
2
+
1
4!
−
1
6!
+
1
8!
− · · · =
∞∑
n=0
(−1)n
(2n)!
.
According to the previous remark on alternating series, stopping
at the term 1
(2n)! we will obtain an error which equals at
most 1
(2(n+1))! = 1
(2n+2)! . Thus, to ﬁnd the number of terms
which we need to approximate cos(1) with an error strictly
less that 10−8
, it suﬃces to solve the inequality
1
(2n + 2)!
< 10−8
. (∗)
There are many ways to solve this inequality, and perhaps, the
simplest one is by testing several values of n. As a solution,
one gets that n = 5 is the smallest positive integer satisfying
(∗). For instance, in Sage a solution goes as follows:
n=1;
while (1/factorial(2*n+2) >= 10^-8 ):
n = n+1
print(n)
and this answers 5. Here, with the second line we program
Sage to try all values until (∗) is true, starting with n = 1.
Hence, the desired solution is
cos(1) ≈ 1 −
1
2
+
1
4!
−
1
6!
+
1
8!
−
1
10!
,
with an error less than 10−8
. We may ﬁnd the error of the
approximation in Sage as follows:
a=1-1/2+1/factorial(4)-1/factorial(6) \
+1/factorial(8)-1/factorial(10)
print(N(cos(1))-N(a))
bool((N(cos(1))-N(a))<(1/factorial(12)))
Notice here we used a slash to break the ﬁrst line. In this
cell the print command gives the actual approximation error
2.07625261428035e − 9, which approximately translates to
2.077·10−9
. The ﬁnal command returns True, verifying that
the error is not larger than the 6th term (−1)n 1
(2n)!
n=5+1
=
1/12! (since above we proved that n = 5), that is, 1/12! ≈
2.088·10−9
. For a successful treating of such small numbers
one can use again Sage, e.g., by the cell
bool(2.077*(10**(-9))<2.088*(10**(-9)))
□
CHAPTER 5. ESTABLISHING THE ZOO
443
5.D.22. Approximate sin(1) with an error strictly less
than 10−8
, and then verify your answer in Sage.
Solution. In Section 5.4.12 we learned the expression of
sin(x) in terms of power series:
sin(x) = x −
1
3!
x3
+
1
5!
x5
−
1
7!
x7
+ · · ·
=
∞∑
n=0
(−1)n
(2n + 1)!
x2n+1
, x ∈ R .
Thus
sin(1) = 1 −
1
3!
+
1
5!
−
1
7!
+ · · · =
∞∑
n=0
(−1)n
(2n + 1)!
and, as before, we are treating an alternating series. Stopping
at the term (−1)n
(2n+1)! we will obtain an error which equals at
most (−1)n+1
(2n+3)! = 1
(2n+3)! . Hence, to locate the number of
terms which we need to approximate sin(1) with a cost strictly
less that 10−8
, it suﬃces to solve the inequality
1
(2n + 3)!
< 10−8
,
and as a solution we get n = 5 (e.g. by applying the same
method as in 5.D.21). This mean that
sin(1) ≈ 1 −
1
3!
+
1
5!
−
1
7!
+
1
9!
−
1
11!
with an error less than 10−8
. In Sage you may type
b=1-1/factorial(3)+1/factorial(5)-1/factorial(7) \
+1/factorial(9) -1/factorial(11)
print(abs(N(sin(1))-N(b)))
print(bool(abs(N(sin(1))-N(b))<1/10**8))
which returns the actual approximation error, given by
1.59828483781155e − 10 ≈ 1.599 · 10−10
, and veriﬁes
the statement. □
444
CHAPTER 5. ESTABLISHING THE ZOO
E. Additional exercises for the whole chapter
As usual, we will now delve into additional material related to the concepts discussed thus far in Chapter 5. Many of the
tasks described below rely on the theory of derivatives, making prior experience with derivatives particularly useful.
In some instances, we may need to employ higher-order derivatives, which are formally introduced at the beginning
of Chapter 6, see 6.1.1. However, these cases primarily pertain to higher-order derivatives of polynomials, a topic
with which the reader is already familiar from our discussion in Section 5.1.6.
A) Material on polynomial interpolation
5.E.1. Consider the function f(x) = sin(x). Given three points (nodes) x0, x1, x2, write a routine in Sage which will return
the Hermite interpolation polynomial P(x) corresponding to these nodes and the values yi = sin(xi) and y′
i = sin′
(xi) =
cos(xi), for i = 0, 1, 2, that is, adapted to the following table (for simplicity you may ﬁx some triple (x0, x1, x2), e.g.,
x0 = −2π, x1 = 0 and x2 = 2π)
xi x0 x1 x2
yi y0 = sin(x0) y1 = sin(x1) y2 = sin(x2)
y′
i y′
0 = cos(x0) y′
1 = cos(x1) y′
2 = cos(x2)
.
Then, for a variety of diﬀerent triples (x0, x1, x2) centered to 0 (that is, with x1 = 0 and with x0 = −x2), present the graphs
of the corresponding polynomial P(x) and that of f(x), for x ∈ [x0, x2].
Solution. We present the code in Sage and attach some comments within. Since the Hermite interpolation method includes
derivatives, below we will use commands as derivative(f, x), which gives the derivative of a function f and
derivative(f, x, n), returning the nth derivative of f. So, let us ﬁx the nodes x0 = −2π, x1 = 0, x2 = 2π. Notice
the elementary Lagrange polynomials ℓ0, ℓ1, ℓ2 corresponding to these three nodes are all of degree 2. We have:
x0=-2*pi; x1=0; x2=2*pi; l(x)=(x-x0)*(x-x1)*(x-x2)
d1l(x)=derivative(l, x) #define first derivative of function l
d2l(x)=derivative(l(x),x,2) #define second derivative of function l
l0(x)=((x-x1)*(x-x2))/((x0-x1)*(x0-x2)) #elementary Lagrange polynomial l_0
l1(x)=((x-x0)*(x-x2))/((x1-x0)*(x1-x2)) #elementary Lagrange polynomial l_1
l2(x)=((x-x0)*(x-x1))/((x2-x0)*(x2-x1)) #elementary Lagrange polynomial l_2
h01(x)=(1-(d2l(x0)/d1l(x0))*(x-x0))*(l0(x))^2 #1st type Hermite polyn. h^{(1)}_{0}
h11(x)=(1-(d2l(x1)/d1l(x1))*(x-x1))*(l1(x))^2 #1st type Hermite polyn. h^{(1)}_{1}
h21(x)=(1-(d2l(x2)/d1l(x2))*(x-x2))*(l2(x))^2 #1st type Hermite polyn. h^{(1)}_{2}
h02(x)=(x-x0)*(l0(x))^2 #2nd type Hermite polyn. h^{(2)}_{0}
h12(x)=(x-x1)*(l1(x))^2 #2nd type Hermite polyn. h^{(2)}_{1}
h22(x)=(x-x2)*(l2(x))^2 #2nd type Hermite polyn. h^{(2)}_{2}
y0=sin(x0); y1=sin(x1); y2=sin(x2) #introduce the values y_0, y_1, y_2
dy0=derivative(sin(x), x)(x=x0) #introduce the values y’_0
dy1=derivative(sin(x), x)(x=x1) #introduce the values y’_1
dy2=derivative(sin(x), x)(x=x2) #introduce the values y’_2
P(x)=y0*h01(x)+y1*h11(x)+y2*h21(x)+dy0*h02(x)+dy1*h12(x)+dy2*h22(x); show(P(x))
a=plot(sin(x), x, x0, x2, color="red", thickness=8, legend_label=" $\\sin(x)$")
b=plot(P(x), x, x0, x2, color="blue", thickness=2, linestyle="-.", legend_label="$P(x)$");
show(a+b)
This returns the explicit form of the Hermite interpolation polynomial P(x), given by
P(x) =
(2 π + x)
2
(2 π − x)
2
x
16 π4
−
(2 π + x)
2
(2 π − x)x2
64 π4
+
(2 π + x)(2 π − x)
2
x2
64 π4
,
and the following ﬁgure
445
CHAPTER 5. ESTABLISHING THE ZOO
Above, the code is written in such way that one only changes the ﬁrst line, that is, the initial values of x0, x1, x2, (and
so the triple (x0, x1, x2)) and results to a new Hermite interpolation polynomial. Let us present the graphs for the triples
(−π, 0, π), (−3π/2, 0, 3π/2), and (−8π, 0, 8π), centered at 0, but also for other kind of triples, e.g., (−5π/2, π/2, 3π/2),
(−e,
√
e, ln(312
)), etc.
□
5.E.2. Determine the natural cubic spline that interpolates the absolute value function f(x) = |x|, for x ∈ [−1, 1], selecting
the points x0 = −1, x1 = 0, x2 = 1. Then use the “spline” method described in 5.A.16 to verify your result.
Solution. By assumption one has y0 = f(x0) = 1, y1 = f(x1) = 0 and y2 = f(x2) = 1. For x ∈ [−1, 0] suppose that
S(x) = S1(x) = a3x3
+a2x2
+a1x+a0 and for x ∈ [0, 1] let S(x) = S2(x) = b3x3
+b2x2
+b1x+b0, with ai, bi ∈ R for all
446
CHAPTER 5. ESTABLISHING THE ZOO
i = 0, . . . , 3. The conditions that S(x) must satisfy are eight in total; First, we have S1(x0) = y0, S2(x1) = y1, S1(x1) = y1,
and S2(x2) = y2, which equivalently are written as −a3 + a2 − a1 + a0 = 1, b0 = 0 = a0 and b3 + b2 + b1 + b0 = 1,
respectively. Moreover S′
1(x1) = S′
2(x1) and S′′
1 (x1) = S′′
2 (x1), that is, a1 = b1 and a2 = b2, respectively. Finally, since
S should be natural, we need S′′
1 (x0) = 0 = S′′
2 (x2), so we also have −6a1 + 2a2 = 0 and 6b1 + 2b2 = 0. The system of
these eight linear equations has non-zero determinant and so unique solution. Using Sage we get a0 = b0 = 0, a1 = b1 = 0,
a2 = b2 = 3/2, a3 = 1/2 and b2 = −(1/2), that is,
S(x) =
{
S1(x) = (1/2)x3
+ (3/2)x2
, if x ∈ [−1, 0] ,
S2(x) = −(1/2)x3
+ (3/2)x2
, if x ∈ [0, 1] .
Below we include the graphical veriﬁcation obtained by a combination of the commands spline and plot, but present the
ﬁgure that occurs by plotting S1, S2 in Sage via the usual way (so one can compare alone in his/her editor).
f(x)=abs(x); pts=[(-1, f(-1)), (0, f(0)), (1, f(1))]; S=spline(pts)
A=plot(S, -1, 1, color="steelblue", thickness=2, legend_label="$S(x)$")
B=points(pts, size=50, color="darkgrey")
fpl=plot(f(x), x, -1, 1, color="black", legend_label="$f(x)$")
fpl.set_legend_options(loc=(0.6,0.8))
tx0=text("$(x_0, y_0)$", (-0.89, 1), color="darkslategrey", fontsize="12")
tx1=text("$(x_1, y_1)$", (0.1, -0.03), color="darkslategrey", fontsize="12")
tx2=text("$(x_2, y_2)$", (0.89, 1), color="darkslategrey", fontsize="12")
show(A+B+fpl+tx0+tx1+tx2)
□
5.E.3. Without calculation, determine the Hermite interpolation polynomial if the following is given:
x0 = 0, x1 = 2, x2 = 1,
y0 = 0, y1 = 4, y2 = 1,
y′
0 = 0, y′
1 = 4, y′
2 = 2. ⃝
5.E.4. Construct the natural cubic interpolation spline for the points x0 = −3, x1 = 0, x2 = 3, and the values y0 = −3,
y1 = 0, y2 = 3. ⃝
5.E.5. Construct the complete cubic interpolation spline for the points x0 = −3, x1 = −2, x2 = −1 with values y0 = 0,
y1 = 1, y2 = 2 and derivatives at the marginal points given by y′
0 = 1, y′
2 = 1. ⃝
5.E.6. Using Lagrange interpolation, approximate cos2
(1), based on the values of cos2
(x) at the points π
4 , π
3 , and π
2 . ⃝
5.E.7. Let P(x) be a polynomial with real non-negative coeﬃcients. Assume that P(1
x )P(x) ≥ 1 for x = 1. Show that the
same inequality holds for every positive x.
Solution. Let P(x) = anxn
+ an−1xn−1
+ · · · a1x + a0. From the statement one has P(1)2
≥ 1. Thus,
P(x)P
(
1
x
)
= (anxn
+ an−1xn−1
+ · · · a1x + a0) · (anx−n
+ an−1x−(n−1)
+ · · · a1x−1
+ a0)
=
n∑
i=0
a2
i +
∑
i<j
aiaj(xj−i
+ xi−j
) ≥
n∑
i=0
a2
i + 2
∑
i<j
aiaj = P(1)2
≥ 1 .
447
CHAPTER 5. ESTABLISHING THE ZOO
Here we have used the well known inequality x + 1
x ≥ 2, which holds for any positive real number x (equivalent to (
√
x −
1√
x
)2
≥ 0). □
5.E.8. Determine a polynomial P(x) of the least degree possible satisfying the conditions P(1) = 1, P(2) = 28, P(0) = 2,
P′
(0) = 1, P′
(1) = 9. ⃝
Let us now explore material that focuses on the interpolation error of the Lagrange polynomial, which concerns the
accuracy of interpolation. As we will soon demonstrate, Rolle’s theorem (see 5.3.8), plays a crucial role in
deriving this interpolation error. Later, in Chapter 6, we will observe that the formula for this error closely
resembles that of the Taylor polynomial, see 6.1.3.
5.E.9. Lagrange interpolation error. Let x0, . . . , xn be (n + 1) distinct numbers (nodes) in an interval I = [a, b] ⊂ R,
and suppose that f(x) has (n + 1) derivatives on I, that is, f ∈ Cn+1
[a, b].11
Assume also that f(x) is interpolated by the
Lagrange polynomial P(x) passing trough the (n + 1) given points. Prove that for each x ∈ I there exists a number ξx
(generally unknown) in the open interval (a, b), such that
f(x) − P(x) =
ℓ(x)
(n + 1)!
f(n+1)
(ξx) , (∗)
where f(n+1)
denotes the (n+1)th derivative of f and by deﬁnition ℓ(x) := (x−x0)(x−x1)·. . .·(x−xn). The diﬀerence
f(x) − P(x) is called the interpolation error.
Solution. For a proof we will use repeatedly Rolle’s theorem. For some t ∈ [a, b] distinct from all the numbers x0, . . . , xn,
consider the function
g(x) := (f(x) − P(x)) −
ℓ(x)
ℓ(t)
· (f(t) − P(t)) .
Since by assumption f ∈ Cn+1
[a, b], it follows that g has also n + 1 derivatives on I. Moreover, it is immediate that
g(t) = 0 , g(x0) = 0 , . . . , g(xn) = 0 ,
which means that g has at least n + 2 zeros in I. Recall now that Rolle’s theorem states that between any two zeros of a
diﬀerentiable function, exists at least one zero of its derivative. Thus g′
must have at least n + 1 zeros. We can again apply
Rolle’s theorem, and argue that the second derivative g′′
has at least n zeros if n ≥ 1. Repeatedly we deduce that the third
derivative g(3)
of g has n − 1 zeros if n ≥ 2, and ﬁnally g(n+1)
has at least one zero which we denote by ξt. Now, P is of
degree n and its n + 1 derivative vanishes, P(n+1)
(x) = 0 for any x ∈ I. Moreover ℓ(x) is polynomial of degree n + 1, and
we see that ℓ(n+1)
(x) = (n + 1)!. For instance for n = 2, that is, ℓ(x) = (x − x0)(x − x1)(x − x2) we have
ℓ′
(x) = (x − x1)(x − x2) + (x − x0)(x − x1) + (x − x0)(x − x1) , ℓ′′
(x) =
2(x − x0) + 2(x − x1) + 2(x − x2) , ℓ(3)
(x) = 6 = (2 + 1)! .
Also ℓ(t) is a constant. Hence, diﬀerentiating (n+1) times the function g(x) end evaluating the result at ξt yields the formula
0 = g(n+1)
(ξt) = f(n+1)
(ξt) −
(n + 1)! (f(t) − P(t))
ℓ(t)
.
Upon solving for f(t) and replacing t by x and ξt by ξx, we get the result. □
Suppose that ˆx is a point at which we want to compute the error estimate. Then ξ = ξˆx depends on that point (not in a
continuous way in general), and is hardly known. Hence, having small control of f(n+1)
(ξ) one can claim that the formula (∗)
is of limited utility. However, it can be used to obtain a bound on the interpolation error, as we explain below via examples.
5.E.10. Consider the function f(x) = x ln(x) and the points x0 = 8.3, x1 = 8.6. Compute the ﬁrst-order Lagrange
interpolation polynomial P corresponding to this data and next approximate f(8.4). Moreover, estimate an upper bound of
the absolute value |f(8.4) − P(8.4)| using the relation (∗), and compare the result with the real interpolation error.
Solution. Let us use Sage to compute f(x0) and f(x1):
x0=8.3; x1=8.6; f(x)=x*ln(x); print([n(f(x0), digits=4), n(f(x1), digits=4)])
which returns [17.65, 18.51], that is, f(x0) ≈ 17.65 and f(x1) ≈ 18.51, respectively. We have
ℓ0(x) =
x − 8.6
8.3 − 8.6
= −
x − 8.6
3/10
= −
10
3
(x − 8.6) =
86 − 10x
3
, ℓ1(x) =
x − 8.3
8.6 − 8.3
=
x − 8.3
3/10
=
10x − 83
3
.
11Here we follow the notation of the paragraph 6.1.1.
448
CHAPTER 5. ESTABLISHING THE ZOO
Thus P(x) = f(x0) · ℓ0(x) + f(x1) · ℓ1(x) ≈ 3.13412x + 8.44823, and we see that P(8.4) ≈ 17.87834, which gives the
required approximation of f(8.4). The real value f(8.4) is approximately given by 17.87716, thus the absolute value of the
interpolation error is |f(8.4) − P(8.4)| ≈ 0.00119. These computations can be veriﬁed in Sage:
x0=8.3; x1=8.6; f(x)=x*ln(x); l0(x)=(x-x1)/(x0-x1); l1(x)=(x-x0)/(x1-x0)
P(x)=f(x0)*l0(x)+f(x1)*l1(x); abs(f(8.4)-P(8.4))
Or alternatively, one can proceed using the lagrange.polynomial command:
x0=8.3; x1=8.6; f(x)=x*ln(x); pt=[(x0, f(x0)), (x1, f(x1))]
R=PolynomialRing(QQ, "x"); p=R.lagrange_polynomial(pt); abs(f(8.4)-p(8.4))
Let us now estimate an upper bound of the error. By the formula (∗), and since ℓ(x) = (x−x0)(x−x1) = (x−8.3)(x−8.6)
and n = 1, we obtain
|f(8.4) − P(8.4)| =
f′′
(ξ)
2!
(8.4 − 8.3)(8.4 − 8.6) (∗∗)
where here ξ ≡ ξ8.4 ∈ (8.3, 8.6). By the product rule we see that f′
(x) = (x)′
·ln(x)+x·(ln(x))′
= ln(x)+x· 1
x = ln(x)+1,
so f′′
(x) = 1
x . Consequently, we deduce that |f′′
(x)| ≤ 1
8.3 for any x ∈ [8.3, 8.6] (and hence also for x = ξ). Thus (∗∗)
gives
|f(8.4) − P(8.4)| ≤
1/8.3
2
|8.4 − 8.3| · |8.4 − 8.6| =
0.1 · 0.2
16.6
≈ 0.001205 .
Observe that the actual error |f(8.4)−P(8.4)| ≈ 0.00119 computed above, is a bit smaller that this upper bound, as it should
be. □
5.E.11. Consider the exponential function f(x) = ex
and the points x0 = 1, x1 = 1.3, x2 = 1.6, x3 = 1.9, x4 = 2.
Suppose that P(x) is the Lagrange polynomial interpolating f along these nodes.
(a) What is the degree of the polynomial P? Use Sage to determine P.
(b) Find an upper bound of the absolute value of the interpolation error when we approximate f(1.25) by P(1.25).
Solution. (a) The Lagrange polynomial is at most of degree 4. In particular, via Sage and the cell below we see that it has
degree exactly 4.
x0=1;x1=1.3;x2=1.6;x3=1.9;x4=2;f(x)=e**x
pt=[(x0, f(x0)), (x1, f(x1)), (x2, f(x2)), (x3, f(x3)), (x4, f(x4))]
R=PolynomialRing(RR, "x"); p=R.lagrange_polynomial(pt)
(b) Based on the relation (∗) we obtain
|f(x) − P(x)| =
f(5)
(ξ)
(4 + 1)!
|(x − 1) · (x − 1.3) · (x − 1.6) · (x − 1.9) · (x − 2)| .
Since f(x) = ex
, we have f(5)
(x) = ex
, as well. Moreover we see that f(5)
(x) ≤ e2
for any x ∈ [0, 2], and hence eξ
≤ e2
.
Thus
|f(1.25) − P(1.25)| =
eξ
5!
· |1.25 − 1| · |1.25 − 1.3| · |1.25 − 1.6| · |1.25 − 1.9| · |1.25 − 2|
≤
e2
120
(0.25) · (0.05) · (0.35) · (0.65) · (0.75) ≈
7.38906
120
· 0.00214 ≈ 0, 000132.
The absolute value of the actual error is approximately |f(1.25) − P(1.25)| ≈ 0.0000810, and this is smaller
of the upper bound given above. One can obtain the actual error in Sage by typing in the end of previous cell
print(abs(f(1.25) − p(1.25))). □
B) Material on real numbers and limit processes
5.E.12. Determine the supremum and inﬁmum of the set A = (−3, 0] ∪ (1, π) ∪ {6} ⊂ R.
Solution. Clearly, sup A = 6, inf A = −3. One can verify this in Sage by applying the following the cell (notice however
that this method applies only for subsets of R)
A = RealSet.open_closed(-3,0) + RealSet(1, pi)+ RealSet.point(6)
print(A.inf()); print(A.sup())
□
449
CHAPTER 5. ESTABLISHING THE ZOO
5.E.13. (a) Find sup A and inf A for
A =
{
an =
n + (−1)n
n
: n ∈ N∗
}
⊂ R .
(b) Find a subset of R which does not have an inﬁmum in R, but has a supremum.
(c) Find a subset of R which does not have a supremum in R, but has an inﬁmum.
Solution. (a) Clearly
A =
{
a1 = 0, a2 =
3
2
, a3 =
2
3
, a4 =
5
4
, . . .
}
=
{
a1 = 1 −
1
2 · 1 − 1
, a2 = 1 +
1
2 · 1
, a3 = 1 −
1
2 · 2 − 1
, a4 = 1 +
1
2 · 2
, . . .
}
.
Therefore, the given sequence
(
an = n+(−1)n
n
)∞
n=1
splits into two subsequences with even and odd indices, that is,
(an)∞
n=1 = (bn)∞
n=1 ∪ (cn)∞
n=1 , with bn = a2n = 1 +
1
2n
, and cn = a2n−1 = 1 −
1
2n − 1
,
respectively. Clearly 3
2 = b1 ≥ bn > cn ≥ c1 = 0. Thus, sup A = 3
2 and inf A = 0, respectively.
(b) An example is given by the set Z\N.
(c) The set of natural numbers is such an example. □
5.E.14. Recall that when the supremum of a subset A ⊂ R belongs to A is called “maximum of A”. Similarly is deﬁned the
“minimum of A”. Find, if existent, the minimum/maximum of the following sets: A = (−1, 2), B = (−1, 2], C = {(−1)n
:
n ∈ Z+}, and R+ = {x ∈ R : x > 0}.
Solution. The set A = (−1, 2) has 2 as its supremum, but 2 /∈ A. So there is no maximum for A. Neither it has a minimum,
since its inﬁmum −1 /∈ A. The set B = (−1, 2] has 2 as its supremum, which belongs to B. Hence 2 is the maximum of
B. But it does not have a minimum. For C both the minimum and maximum elements exist, and are the numbers −1 and 1,
respectively. Finally, R+ is not bounded above, but does has an inﬁmum which equals to 0. However, 0 /∈ R+ hence in this
case there is neither a minimum, nor a maximum. □
5.E.15. Find a subset X ⊂ R such that sup X ≤ inf X. ⃝
5.E.16. Find sets A, B, C ⊆ R such that A ∩ B = ∅, A ∩ C = ∅, B ∩ C = ∅, and sup A = inf B = inf C = sup C. ⃝
5.E.17. Based on the deﬁnition of a convergent sequence, show that an → 1, as n → ∞, where an = 2n
−1
2n .
Solution. We have an − 1 =
2n
− 1
2n
− 1 =
1
2n
=
1
2n
. On the other hand, by induction we get n < 2n
for any natural
number n, hence we also have 1
2n < 1
n for all n ∈ Z+. Let ε > 0. Using the Archimedean property we can ﬁnd some N
with 1
N < ε. Then, for n > N we get 1
n < 1
N < ε and hence |an − 1| =
1
2n
<
1
n
<
1
N
< ε. This shows that an → 1 as
n → ∞. □
5.E.18. Use the binomial theorem to prove that lim
n→+∞
n
√
ln(n) = 1.
Solution. For x > 0 the function ln(x) is strictly increasing, thus for n > 3 it is true that ln(n) > ln(e) = 1 and this implies
that n
√
ln(n) > 1. Consequently, we may write n
√
ln(n) = 1 + an for some an > 0 for all n. Taking the nth power of this
relation we arrive to ln(n) = (1 + an)n
, and here is where the binomial theorem applies:
ln(n) = (1 + an)n
= 1 + nan +
n(n − 1)
2!
a2
n + · · · + an
n .
Thus ln(n) > 1+nan, from where we deduce that an < ln(n)
n − 1
n . However, ln(n)
n → 0 as n → +∞, thus limn→+∞ an = 0.
Hence by the relation n
√
ln(n) = 1 + an the result follows. □
5.E.19. Compute the limit lim
n→∞
(√
2 ·
4
√
2 ·
8
√
2 · · ·
2n√
2
)
. ⃝
5.E.20. Evaluate the limit lim
n→∞
(
1
1 · 2
+
1
2 · 3
+
1
3 · 4
+ · · · +
1
(n − 1) · n
)
. ⃝
5.E.21. Compute the limit lim
n→∞
(
1
n2
+
2
n2
+ · · · +
n − 2
n2
+
n − 1
n2
)
⃝
450
CHAPTER 5. ESTABLISHING THE ZOO
5.E.22. Evaluate the limit lim
n→∞
(
1
√
n2 + 1
+
1
√
n2 + 2
+ · · · +
1
√
n2 + n
)
. Also, use Sage to plot several terms of the
given sequence to visually illustrate your answer.
Solution. In order to determine this limit, we can invoke the squeeze theorem. The bounds
1
√
n2 + 1
+ · · · +
1
√
n2 + n
≥
1
√
n2 + n
+ · · · +
1
√
n2 + n
=
n
√
n2 + n
and
1
√
n2 + 1
+ · · · +
1
√
n2 + n
≤
1
√
n2 + 1
+ · · · +
1
√
n2 + 1
=
n
√
n2 + 1
for all naturals n, respectively, imply that
1 = lim
n→∞
n
√
n2 + n
≤ lim
n→∞
(
1
√
n2 + 1
+ · · · +
1
√
n2 + n
)
≤ lim
n→∞
n
√
n2 + 1
= 1 .
Thus lim
n→∞
(
1√
n2+1
+ · · · + 1√
n2+n
)
= 1. This result can be visually conﬁrmed in Sage by the same method used in 5.B.4.
Thus one can execute the cell
var("k")
p=Graphics()
for n in srange (1, 50+1):
p=p+points((n,sum(1/sqrt(n^2+k), k, 1, n)),color="black")
show(p)
This code produces the ﬁgure given below, where indeed one can observe the convergence of the given sequence to the
unit.
□
5.E.23. Evaluate the limits lim
n→∞
(
n
n + 1
)n
, lim
n→∞
(
1 +
1
n2
)n
and lim
n→∞
(
1 −
1
n
)n2
. ⃝
5.E.24. Let (an) be a non-negative convergent sequence of real numbers.12
Show that limn→∞ an = a ≥ 0.
Solution. In contrast, suppose that the claim is wrong, i.e., the limit of (an) is a negative number, a < 0. Since an → a
as n → ∞ and ε = −a is a positive real number, there is a natural n0 such that |an − a| < ε for all n ≥ n0, that is,
a − ε < an < a + ε, for all n ≥ n0. However, a = −ε which gives −2ε < an < 0, a contradiction as (an) consists only of
non-negative terms. Thus a ≥ 0. □
5.E.25. Let (an) be a sequence of real numbers and let (yn) be a sequence of positive real numbers satisfying yn → 0, as
n → ∞. Suppose that for some N ∈ N, some constant µ > 0 and some (ﬁnite) real number a, we have |an − a| < µ yn for
all n ≥ N. Prove that (an) is convergent, in particular, limn→∞ an = a.
Solution. The sequence (yn) is by assumption convergent to 0, thus if ˆε > 0 is given there exists some n0 (depending in
general on ˆε) such that |yn − 0| = |yn| = yn < ˆε for all n ≥ n0. By assumption µ > 0, hence we may set ˆε = ε/µ,
12By the term “non-negative” we mean an ≥ 0 for all naturals n.
451
CHAPTER 5. ESTABLISHING THE ZOO
for some ε > 0. Thus we have yn < ε
µ for all n ≥ n0. Then, for those n satisfying both n ≥ n0 and n ≥ N we get
|an − a| ≤ µ yn < µ · ε
µ = ε. As ε > 0 is arbitrary, the claim follows. □
5.E.26. (1) Let (an) be a sequence of positive real numbers for which the limit limn→∞
an+1
an
exists and is ﬁnite real number
ℓ. Show that if ℓ < 1, then limn→∞ an = 0.
(2) More in general, suppose that an ̸= 0 for all n ∈ N and that the limit ℓ = limn→∞
an+1
an
exists. Show that if ℓ < 1, then
limn→∞ an = 0.
Solution. We will prove the ﬁrst statement and leave the second one for the reader, since the method follows essentially the
same ideas. We have an > 0 for all n ∈ N, hence the sequence (yn) with general term yn = an+1
an
also satisﬁes yn > 0,
for all n ∈ N. Also, yn → ℓ as n → ∞ and hence the requirements of the statement in 5.E.24 are satisﬁed. Thus we have
limn→∞ yn = ℓ ≥ 0. Let b a real number with ℓ < b < 1 and set ε = b − ℓ > 0. Then, there exists some N ∈ N with
an+1
an
− ℓ < ε, provided that n ≥ N. This implies that an+1
an
< ε + ℓ = b − ℓ + ℓ = b, which in turn gives
0 < an+1 < b · an < b2
· an−1 < · · · < bn−N+1
aN .
Thus, if µ := aN b−N
> 0 we obtain 0 < an+1 < µ · bn+1
, for all n ≥ N. In addition, since 0 < b < 1 we have
limn→∞ bn
= 0. Thus the result follows by 5.E.25.
Alternatively, the inequality an+1 < b an for all n ≥ N implies that an+1 < an, for all n ≥ N (since 0 < b < 1). Therefore,
the sequence (an) is monotone (is decreasing starting from the Nth term). Since (an) is also bounded, it is convergent. Let
L = limn→∞ an. Then, using the relation an+1 = ynan we see that
L = lim
n→∞
an+1 = lim
n→∞
(yn an) = lim
n→∞
yn · lim
n→∞
an = ℓ · L ⇐⇒ L(1 − ℓ) = 0 .
But 1 − ℓ > 0, and thus L = 0, as it is required. □
5.E.27. (a) Apply the statement in 5.E.25 to prove that limn→∞
(
1
1+n a
)
= 0, where a > 0.
(b) For a real number x apply the statement 5.E.26 to prove that limn→∞
xn
n! = 0. ⃝
5.E.28. Recall that if (an) is a Cauchy sequence in R, then the diﬀerence an+1 − an tends to zero, i.e., (an+1 − an) → 0
as n → ∞. Present a counterexample verifying that the converse of this statement is not in general true.
Solution. Consider the sequence (an =
√
n) with n ∈ N. Then we see that
an+1 − an =
√
n + 1 −
√
n =
(
√
n + 1 −
√
n)(
√
n + 1 +
√
n)
√
n + 1 +
√
n
=
1
√
n + 1 +
√
n
,
and 1√
n+1+
√
n
→ 0. However, (an) is not a Cauchy sequence (try to verify this claim yourself). □
5.E.29. Based on the deﬁnition of a Cauchy sequence, show that (an = 1/n)∞
n=1 is such a sequence.
Solution. We need to show that for every ε > 0 there exist N ∈ N such that |an − am| = 1
n − 1
m < ε for all n, m ≥ N.
Indeed, for such n, m ≥ N by the triangle inequality we have
1
n
−
1
m
≤
1
n
+ −
1
m
=
1
n
+
1
m
≤
1
N
+
1
N
=
2
N
. (∗)
Now, for ε > 0 we can ﬁnd non-zero N ∈ N such that 1
N < ε
2 . Then by (∗) we get the result, i.e., 1
n − 1
m ≤ 2
N < ε. □
5.E.30. A sequence (xn)∞
n=1 of real numbers is called “contractive” if there exists some real α with 0 < α < 1 such that
|xn+1 − xn| ≤ α |xn − xn−1| , for all n ∈ N , n ≥ 2 , (⋆)
Prove that any contractive sequence is Cauchy.
Solution. Suppose that (xn) is a sequence of real numbers satisfying (⋆) for some α ∈ (0, 1). By (⋆) we get that
|x3 − x2| ≤ α|x2 − x1| ,
|x4 − x3| ≤ α |x3 − x2| ≤ α · α |x2 − x1| = α2
|x2 − x1| ,
· · · · · ·
|xn+1 − xn| ≤ α |xn − xn−1| ≤ α2
|xn−1 − xn−2| ≤ · · · ≤ αn−1
|x2 − x1| ,
where the ﬁnal relation follows by induction over n. Now, from the geometric series task 5.B.16 recall that
1 + α + α2
+ · · · + αn
=
1 − αn−1
1 − α
.
452
CHAPTER 5. ESTABLISHING THE ZOO
Combining this with our previous observation, for all m, n ∈ N with m > n we have
|xm − xn| ≤ |xm − xm−1| + |xm−1 − xm−2| + · · · + |xn+1 − xn|
≤ αm−2
|x2 − x1| + αm−3
|x2 − x1| + · · · + αn−1
|x2 − x1|
= |x2 − x1|
(
αm−2
+ αm−3
+ · · · + αn−1
)
= αn−1
(
1 + α + · · · + αm−n−1
)
|x2 − x1|
= αn−1
(
1 − αm−n
1 − α
)
|x2 − x1| ≤ αn−1
(
1
1 − α
)
|x2 − x1| . (♯)
If x2 = x1 it is easy to see that (xn) is a Cauchy sequence, hence we may assume that x2 ̸= x1. Recall by 5.B.3 that
αn−1
→ 0 as n → +∞, provided that 0 < α < 1. Therefore, given some ε > 0 we can ﬁnd some positive integer N with
αn−1
− 0 = |αn−1
| < ε(
1
1−α
)
·|x2−x1|
, for all n ≥ N. Then, for the same N and for m > n ≥ N by (♯) we see that
|xm − xn| ≤ αn−1
(
1
1 − α
)
|x2 − x1| <
ε
(
1
1−α
)
|x2 − x1|
(
1
1 − α
)
|x2 − x1| = ε .
Thus (xn) is a Cauchy sequence. □
5.E.31. Consider the sequence (xn) of real numbers deﬁned recursively as follows:
x1 = 1 , xn+1 =
1
4 + xn
, n ≥ 1 .
Deduce that (xn) is a convergent sequence, by proving that it is contractive (see 5.E.30). ⃝
5.E.32. Let (an)∞
n=1 be a bounded sequence of real numbers. Set mn = sup{ak : k ≥ n}, and ℓn = inf{ak : k ≥ n},
n ∈ N. Based on the monotone convergent theorem, show that both (mn)∞
n=1 and (ℓn)∞
n=1 are convergent. ⃝
5.E.33. Equivalent condition for convergence. Let (xn) a sequence of real numbers. Is the following claim true?
xn is convergent ⇐⇒ lim
n→∞
lim sup
m→∞
|xn − xm| = 0 .
Solution. The answer is yes. Suppose ﬁrst that limn→∞ xn = a for some (ﬁnite) real number a Then, limm→∞ |xn −xm| =
|xn − a| and in turn this implies that
lim
n→∞
lim sup
m→∞
|xn − xm| = limn→∞ |xn − a| = 0 .
For the opposite direction, we rely on the theory of Cauchy sequence. So, suppose that lim
n→∞
lim sup
m→∞
|xn − xm| = 0. Then,
for every ε > 0 we may ﬁx some natural N with
lim sup
m→∞
|xm − xN | <
ε
2
.
Therefore, there exist m0 ∈ N with |xm − xN | < ε
2 for all m ≥ m0. Combining this with the triangle inequality, for all
m1, m2 ≥ m0 we obtain
|xm1 − xm2 | ≤ |xm1 − xN | + |xm2 − xN | <
ε
2
+
ε
2
= ε .
This show that (xn) is a Cauchy sequence, and hence convergent, see the theorem in 5.2.3. □
5.E.34. Find the limes superior/inferior of the sequences (an), (bn) and (cn) deﬁned below:
an = 3 + (−1)n
, bn = 2 +
1
n
, cn =
4n
n + 1
cos(
nπ
2
) , (n ∈ Z+) .
Solution. We recall that superior/inferior of a sequence (xn) are essentially the biggest/smallest limits of all the subsequences
of (xn). Thus, to compute them it is always suﬃcient to choose appropriate subsequences of (xn).
Let us begin with (an) and consider its subsequence (a2n = 3 + (−1)2n
). This obviously tends to 4 as n → +∞ and hence
lim sup
n→∞
an = 4. On the other hand, the subsequence (a2n+1 = 3 + (−1)2n+1
) of (an) tends to 2 as n → +∞ and hence
lim inf
n→∞
an = 2.
Obviously the sequence (bn) converges to 2, so any subsequence of (bn) will have the same limit and it follows that
lim sup
n→∞
bn = 2 = lim inf
n→∞
bn.
For (cn) we have the expression cn = xn · yn, where xn = 4n
n+1 and yn = cos(nπ
2 ). Obviously, limn→∞ xn =
limn→∞
4
1+ 1
n
= 4 and so we may focus on (yn). We consider the following four subsequences:
• The subsequence of (yn) deﬁned by y4n = cos(4nπ
2 ) = cos(2nπ) = 1, which satisﬁes limn→∞ y4n = 1. Thus, the
453
CHAPTER 5. ESTABLISHING THE ZOO
subsequence (c4n) of (cn) deﬁned by c4n = x4n · y4n will tend to 4 · 1 = 4.
• The subsequence of (yn) deﬁned by y4n+1 = cos((4n+1)π
2 ) = cos(2nπ+π
2 ) = cos(π/2) = 0, satisﬁes limn→∞ y4n+1 = 0.
Hence in this case the subsequence (c4n+1) of (cn) deﬁned by c4n+1 = x4n+1 · y4n+1 tends to 4 · 0 = 0.
• The subsequence of (yn) deﬁned by y4n+2 = cos((4n+2)π
2 ) = cos(2nπ + π) = cos(π) = −1, satisﬁes limn→∞ y4n+2 =
−1. Thus, the subsequence (c4n+2) of (cn) deﬁned by c4n+2 = x4n+2 · y4n+2 tends to 4 · −1 = −4.
• The subsequence of (yn) deﬁned by y4n+3 = cos((4n+3)π
2 ) = cos(2nπ + 3π
2 ) = cos(3π
2 ) = 0 tends to 0 and hence the
subsequence (c4n+3) of (cn) tends to 0 as well.
To summarize, the largest limit of the subsequences (c4n), (c4n+1), (c4n+2) and (c4n+3) is 4 and the smallest one is −4. This
means that lim sup
n→∞
cn = 4 and lim inf
n→∞
cn = −4, respectively. □
5.E.35. Based on the result from 5.B.18 on the Euler number, compute lim sup
n→∞
an and lim inf
n→∞
an, where (an)∞
n=1 is the
sequence deﬁned by an =
(
1 + (−1)n
n
)n
, for n ∈ Z+.
Solution. Consider a sequence (an) which its deﬁnition involves the expression (−1)n
(or (−1)n+1
or (−1)n−1
). Recall
that in this case in order to compute the superior/inferior of (an) we can consider the subsequences determined for 2n and
2n + 1, respectively. Hence for the speciﬁc task let us consider ﬁrst the subsequence of (an) deﬁned by a2n =
(
1 + 1
2n
)2n
.
We have 2n → +∞ as n → +∞, and hence by 5.B.18 we see that limn→+∞ a2n = e. On the other hand, the subsequence
of (an) deﬁned by a2n+1 =
(
1 + (−1)2n+1
2n+1
)2n+1
satisﬁes a2n+1 =
(
1 − 1
2n+1
)2n+1
and a combination of 5.B.18 with the
exercise 5.E.23 above, gives lim
n→+∞
a2n+1 = e−1
. As a conclusion we get that lim sup
n→∞
an = e and lim inf
n→∞
an = e−1
. □
Let us now present a few extra tasks on the topological notions that we met so far in Chapter 5.
5.E.36. Is some of the sets N or Q an open or a closed subset of R? ⃝
5.E.37. (a) Given the sets R∗
= R\{0}, R\Q, R\Z, and (1, 2) ∪ {5}, locate the open one.
(b) Similarly, given the sets R\Q, R\Z and [1, 2] ∪ {5}, locate the closed one. ⃝
5.E.38. (a) Consider the set A =
∩
n∈N∗
(
− 1
n , 1 + 1
n
)
, where as usual N∗
= N\{0}. Show that A = [0, 1].
(b) Is the set B =
∪
n∈N∗
[ 1
n , 1 − 1
n
]
an open or a closed subset of R?
Solution. (a) Clearly, [0, 1] ⊂ A and we need to prove the opposite inclusion. So assume that x ∈ A, that is, − 1
n < x < 1+ 1
n
for all n ∈ Z+. Then we have x ≥ sup{− 1
n : n ∈ Z+} and x ≤ inf{1 + 1
n : n ∈ Z+}. However, sup{− 1
n : n ∈ Z+} = 0
and inf{1 + 1
n : n ∈ Z+} = 1, which means that x ∈ [0, 1].
(b) This can be treated similarly, in particular one can show that
B =
∪
n∈N∗
[
1
n
, 1 −
1
n
]
= (0, 1) ,
which we leave for practice. Hence B is open and this shows that the union of inﬁnitely many closed sets is not necessary
closed. □
5.E.39. Compact sets. Suppose that we deﬁne the compactness of a subset A ⊂ R via the Bolzano-Weierstrass theorem
(see (4) in Theorem 5.2.8), that is, A is said to be compact if for any sequence (xn) in A there is a convergent subsequence
(xnk
) of (xn) whose limit belongs to A. Under this deﬁnition, show that a compact set A is closed and bounded.
Solution. If A is the empty set, then the result clearly holds, and hence assume that A ̸= ∅ and that A is compact. According
to our new deﬁnition, this means that for any sequence (xn) in A there is a convergent subsequence (xnk
) of (xn) whose limit,
say ℓ, belongs to A. Hence if (xn) is convergent, then we should have limn→∞ xn = limk→∞ xnk
= ℓ ∈ A. This shows
that A is closed and it remains to show that A is bounded. For each n ∈ N∗
= N\{0} consider the open sets Cn = (−n, n).
Obviously,
∪
n∈N∗
Cn = R , hence A ⊆
∪
n∈N∗
Cn .
Since each Cn is also open, this means that the family {Cn : n ∈ N∗
} is an open cover of A. However, A is compact and by
(5) in Theorem 5.2.8 each of its open covers contains a ﬁnite subcover of A. Thus there exists m ∈ N∗
such that
A ⊆
m∪
n=1
Cn
454
CHAPTER 5. ESTABLISHING THE ZOO
with
∪m
n=1 Cn = (−m, m). This shows that A is contained in the bounded interval (−m, m), therefore A is bounded. For
instance, any interval of the form [a, b] ⊂ R with a < b, a, b ∈ R, is closed (and bounded) and hence compact. □
5.E.40. An open cover. Find an open cover, a subcover and a ﬁnite subcover of A := {x ∈ R : 0 ≤ x ≤ 1} = [0, 1].
Solution. An open cover of A is given by the family A := {Aa = (a − ε, a + ε)} where a ∈ A and ε > 0. As a subcover,
for the same ε take the subfamily {Ab = (b − ε, b + ε)}, where b ∈ {x ∈ Q : 0 ≤ x ≤ 1}. Notice however that this is not a
ﬁnite subcover. To obtain a ﬁnite subcover, consider the family
{A0·ε, A1·ε, A2·ε, . . . , An0·ε} = {A0, Aε, A2ε, . . . , An0ε} ,
where n0 ∈ N is the largest natural with n0ε < 1. Then this is ﬁnite subfamily of {Aa = (a − ε, a + ε) : a ∈ A} (every set
in this family is obviously a member of A), and moreover it covers A = [0, 1], i.e., A ⊆ A0 ∪ Aε ∪ A2ε ∪ · · · ∪ An0ε. □
5.E.41. Show that the collection of open intervals A = {Ak = (k − 1, k + 1)}k∈N∗ is an open cover of N∗
= N\{0}.
Deduce however that N∗
is not a compact subset of R.
5.E.42. Indeed, we see that
∞∪
k=1
Ak = (0, +∞)
and N∗
⊂ (0, +∞). Observe however that the family {Ak = (k − 1, k + 1)}k∈N∗ is not a cover of N, since its union does
not contain 0.13
On the other hand we see that there is no ﬁnite subfamily {Ai1 , . . . , Ain } of A which covers N∗
. For, if
α = max{i1, . . . , in} then we see that
n∪
k=1
Aik
⊂ (0, α + 1) ,
and it is obvious that this union does not contain all naturals. Hence N∗
is a non-compact set. ⃝
5.E.43. Suppose that A ⊂ R is a non-empty compact subset of R and let ε > 0. Specify a ﬁnite subset B ⊂ R such that
min{|x − y| : y ∈ B} < ε for all x ∈ A.
Solution. It is easy to see that for any ε > 0 the set {(x − ε, x + ε) : x ∈ A} is an open cover of A. On the other hand, A is
by assumption compact. Hence there exist some positive integer n and x1, . . . , xn, all points of A, such that
A ⊂
n∪
i=1
(xn − ε, xn + ε) .
Hence a ﬁnite subset B ⊂ R with the required properties is given by these points, i.e., B = {x1, . . . , xn}. □
The theory of limits often combines with the theory of matrices, and can be used to solve tasks related to linear algebra.
For instance, let (An)∞
n=1 be a sequence of 2 × 2 matrices, with An =
(
an bn
cn dn
)
, where (an)∞
n=1, (bn)∞
n=1,
(cn)∞
n=1 and (dn)∞
n=1 are sequences of real (or complex) numbers. When an → a, bn → b, cn → c and dn → d
for some real numbers a, b, c, d we say that (An)∞
n=1 converges to the matrix A =
(
a b
c d
)
, i.e., lim
n→∞
An = A.
In this case A referred to as the “limit matrix” of (An)∞
n=1. Let us illustrate such an example.
5.E.44. Given a positive real number φ, consider the matrix sequence (An) whose general term is given by
An =
(
1 −φ
n
φ
n 1
)n
, n ∈ Z+ .
Based on the relation lim
n→∞
(
1 + z
n )n
= ez
for all z ∈ C, prove that the limit matrix A of (An) exists and describes a rotation
in the plane by angle φ, i.e.,
A =
(
cos(φ) − sin(φ)
sin(φ) cos(φ)
)
.
13Recall that till Chapter 11 we assume that N contains 0.
455
CHAPTER 5. ESTABLISHING THE ZOO
Solution. Let us set φ
n = tan(θn). Based on basic matrix calculus we can then express the matrix sequence (An) as follows:
An =
(
1 −φ
n
φ
n 1
)n
=
(
1 − tan(θn)
tan(θn) 1
)n
=
1
cosn(θn)
(
cos(θn) − sin(θn)
sin(θn) cos(θn)
)n
=
1
cosn(θn)
(
cos(nθn) − sin(nθn)
sin(nθn) cos(nθn)
)
=




cos(nθn)
cosn(θn)
−
sin(nθn)
cosn(θn)
sin(nθn)
cosn(θn)
cos(nθn)
cosn(θn)



 .
Notice above we used the relations
(
1 − tan(θn)
tan(θn) 1
)
=
1
cos(θn)
(
cos(θn) − sin(θn)
sin(θn) cos(θn)
)
,
(
cos(θn) − sin(θn)
sin(θn) cos(θn)
)n
=
(
cos(nθn) − sin(nθn)
sin(nθn) cos(nθn)
)
.
Although the ﬁrst is obvious, you may like to conﬁrm the second one by induction. Hence we can write
lim
n→∞
An =




lim
n→∞
cos(nθn)
cosn(θn)
− lim
n→∞
sin(nθn)
cosn(θn)
lim
n→∞
sin(nθn)
cosn(θn)
lim
n→∞
cos(nθn)
cosn(θn)



 , (∗)
and A = lim
n→∞
An will exist if and only if the limits inside the matrix in (∗) exist. To examine these limits we rely on de
Moivre’s theorem, as follows:
cos(nθn)
cosn(θn)
+ i
sin(nθn)
cosn(θn)
=
cos(nθn) + i sin(nθn)
cosn(θn)
=
(cos(θn) + i sin(θn)
cos(θn)
)n
=
(
1 + i tan(θn)
)n
=
(
1 +
iφ
n
)n
.
Recalling that lim
n→∞
(
1 + iφ
n
)n
= eiφ
= cos(φ) + i sin(φ), we deduce that
lim
n→∞
(cos(nθn)
cosn(θn)
+ i
sin(nθn)
cosn(θn)
)
= cos(φ) + i sin(φ) ⇐⇒ lim
n→∞
cos(nθn)
cosn(θn)
+ i lim
n→∞
sin(nθn)
cosn(θn)
= cos(φ) + i sin(φ) .
Comparing the real and imaginary parts, we obtain lim
n→∞
cos(nθn)
cosn(θn)
= cos(φ), lim
n→∞
sin(nθn)
cosn(θn)
= sin(φ), and the result
follows by (∗). □
Let’s now discuss additional content concerning the limits of functions and the concept of continuity. This will involve
various computational and theoretical examples, as well as notable applications of the intermediate value theorem.
However, we will ﬁrst address tasks involving sequences deﬁned recursively, such as in the problem 5.E.31
above.
Let f : A ⊂ R → R be a continuous function deﬁned on a subset A of real numbers. Suppose that exists a
sequence (xn) in A satisfying xn+1 = f(xn), for all n. Moreover, assume that the limn→∞ xn exists, and that is equal to
some (limit point) a ∈ A. In this situation it is not hard to see that limn→∞ f(xn) = f(a) and f(a) = a. This shows how
useful can be continuous functions and provides a very common way to deﬁne sequences, that is, by iterating some function
f (recursive deﬁnition).
5.E.45. Iterations. Consider the sequence (an)∞
n=1 with a1 = 0, a2 = 1, a3 =
√
3, . . . , an =
√
1 + 2an−1.
(a) Show that (an)∞
n=1 is strictly increasing.
(b) Is this a convergent sequence? In the positive case, ﬁnd the limit limn→∞ an.
(c) Use Sage to plot some terms of (an) and obtain a graphical conﬁrmation of your conclusion in (b).
Solution. We should prove that an+1 > an for all n. Observe that a2 = 1 > 0 = a1. Assuming that an > an−1 for n ≥ 2
and proceeding by induction over n one gets an+1 =
√
1 + 2an >
√
1 + 2an−1 = an. Using mathematical induction over
n we can also show that (an) is bounded above, that is, an < 3 for all naturals n ≥ 1. This is because a1 = 0 < 3 and hence
assuming an < 3 for n ≥ 2, we can arrive to the claim: an+1 =
√
1 + 2an <
√
1 + 2 · 3 <
√
9 = 3. Since the sequence
(an) is strictly increasing and bounded above, by the monotone sequence theorem it should be convergent.
To determine its limit we rely on the continuity of the square root function
√
x, as follows: Suppose that a = limn→∞ an.
Then a must satisfy the relation a =
√
1 + 2a, thus a2
= 1 + 2a with solutions a = 1 ±
√
2. However, an ≥ 0 for all n,
from where one can neglect the negative solution, that is, limn→∞ an = 1 +
√
2.
Let us now use Sage to plot some terms of (an). To do so we will ﬁrst use the def command to introduce our sequence.
This can be done as follows:
456
CHAPTER 5. ESTABLISHING THE ZOO
def a(n, D={}) :
if n in D.keys() :
return D[n]
if(n==1) :
sol = 0
else :
sol =sqrt(1+2*a(n-1))
D[n] = sol
return sol
In this way to test diﬀerent values of (an) one can simply type
a(2).n(); a(3).n(); a(10).n()
etc. Notice however that the deﬁnition of (an) via the def method, does not allows us to compute the limit of (an), which
means that typing lim(a(n), n = oo) returns an error (even if we add before the def command the variable n as symbolic).
Now we can use this routine to plot (an). Together we will sketch the line y = 1 +
√
2 so that the visualization of our
conclusion about the limit to be easier.
N=50
pts = [(n, a(n)) for n in range(1, N)]
p=points(pts, color="slategray", size=25)
p+=line([(0, 1+sqrt(2)), (51, 1+sqrt(2))],rgbcolor=(0.7,0.2,0.3), linestyle="--")
p+=point([0, 1+sqrt(2)], size=20, color="darkblue")
p+= text (r"$1+\sqrt{2}$ ",(-4, 1+sqrt(2)), color="darkblue", fontsize ="14")
p.show(figsize=6)
The ﬁgure that Sage returns is here:
□
5.E.46. Consider the sequence (an)∞
n=1 deﬁned by the relation an+1 =
1
2
(
an +
a2
an
)
for all positive integers n, with a1 > a,
for some a > 0. Show that lim
n→+∞
an = a.
Solution. For all naturals n > 1 by the AM-GM-inequality presented in Chapter 1, we have
an =
1
2
(
an−1 +
a2
an−1
)
≥
√
an−1 ·
a2
an−1
= a > 0 .
Thus, together with the condition a1 > a this gives an ≥ a for all n and hence (an) is bounded. It is also decreasing, since
an+1 − an =
a2
− a2
n
2an
≤ 0 . (♭)
One may like to conﬁrm the equality appearing in (♭) also via Sage, which can be done quickly by the block
var("n, c"); function("a")(n)
bool((1/2)*(a(n)+(c^2/a(n)))-a(n)==(c^2-a(n)*a(n))/(2*a(n)))
In this way one proves that (an) is bounded and monotone, hence its limit lim
n→+∞
an exists and equals to inf{an : n ∈ N}.
Let us set b = inf{an : n ∈ N}. Then we should have
457
CHAPTER 5. ESTABLISHING THE ZOO
b =
1
2
(
b +
a2
b
)
,
which can be equivalently expressed as b2
= a2
. However, by assumption a > 0 and thus we get b = a. □
5.E.47. Consider the sequence (an)∞
n=1 with a1 =
√
2 and an =
√
2 + an−1 for all naturals n with n ≥ 2. Show that (an)
is convergent, and in particular compute lim
n→+∞
an. ⃝
5.E.48. Let f : R → R be a continuous function satisfying the relation x2
f(x) = 1 − cos(2x), for all x ∈ R. Using the
result of 5.B.31, ﬁnd the explicit form of f.
Solution. Obviously, for x ̸= 0 the function at hand satisﬁes f(x) =
1 − cos(2x)
x2
. For x = 0, by the continuity of f we
should have f(0) = limx→0 f(x). To compute this limit one can use the known identity sin2
(
θ
2
)
=
1 − cos(θ)
2
(half-angle
formula), which can be rephrased as 2 sin2
(θ) = 1 − cos(2θ). This is a very useful trigonometric identity, which for our case
applies as follows:
f(0) = lim
x→0
f(x) = lim
x→0
1 − cos(2x)
x2
= 2 lim
x→0
sin2
(x)
x2
= 2
(
lim
x→0
sin(x)
x
)2
= 2 · 12
= 2 .
Therefore one can now present the full type of f, namely
f(x) =



1 − cos(2x)
x2
, for x ̸= 0 ,
2 , for x = 0 .
□
5.E.49. Using the result of 5.B.31, when adequate, evaluate the following limits:
lim
x→0
sin2
(x)
x
, lim
x→0
x
sin2
(x)
, lim
x→0
arcsin(x)
x
, lim
x→0
3 tan2
(x)
5x2
, lim
x→0
sin (3x)
sin (5x)
, lim
x→0
tan (3x)
sin (5x)
.
⃝
5.E.50. Conﬁrm that
lim
x→0
e5x
− e2x
x
= 3 = lim
x→0
e5x
− e−x
sin(2x)
.
(Hint: Recall that lim
x→0
ex −1
x
= 1)). Next use Sage to sketch the graphs of the involved functions. ⃝
5.E.51. Compute the limit lim
x→+∞
(
3x + 1
3x − 2
)4x
with the aid of the relation lim
x→+∞
(
1 +
1
x
)x
= e. Next verify your answer
via Sage.
Solution. Based on the properties of powers, we see that
(
3x + 1
3x − 2
)4x
=
[
3x
(
1 + 1
3x
)]4x
[
3x
(
1 − 2
3x
)]4x =
(
1 + 1
3x
)4x
(
1 − 2
3x
)4x =
[(
1 + 1
3x
)3x
]4
3
(
1 + 1
− 3
2 x
)4x =
[(
1 + 1
3x
)3x
]4
3
[(
1 + 1
− 3
2 x
)− 3
2 x
]− 8
3
.
Thus
lim
x→+∞
(
3x + 1
3x − 2
)4x
=
lim
x→+∞
[(
1 + 1
3x
)3x
]4
3
lim
x→+∞
[(
1 + 1
− 3
2 x
)− 3
2 x
]− 8
3
=
[
lim
x→+∞
(
1 + 1
3x
)3x
]4
3
[
lim
x→+∞
(
1 + 1
− 3
2 x
)− 3
2 x
]− 8
3
=
e
4
3
e− 8
3
= e
4
3 + 8
3 = e4
.
For a conﬁrmation via Sage just give the cell limit(((3 ∗ x + 1)/(3 ∗ x − 2)) ∗ ∗(4 ∗ x), x = oo). □
5.E.52. Evaluate the limits lim
x→+∞
(
2 + 1
x
)1
x
, lim
x→+∞
x−x
and lim
x→0
e
1
x . ⃝
5.E.53. Evaluate the following limits:
lim
x→0
sin(x)
x3
, lim
x→+∞
(√
x2 + x − x
)
, lim
x→+∞
(
x
√
1 + x2 − x2
)
, lim
x→0−
√
1 + tan(x) −
√
1 − tan(x)
sin(x)
.
458
CHAPTER 5. ESTABLISHING THE ZOO
⃝
5.E.54. Use the binomial theorem to compute the limit lim
x→0
(1 + 2nx)
n
− (1 + nx)
2n
x2
, for all n ∈ N∗
. Next conﬁrm your
computation via Sage. ⃝
5.E.55. Exam the convergence of the sequence (an) with general term
an =
n∑
k=1
cos(k)
2k
, n = 1, 2, . . .
⃝
5.E.56. Compute the limit lim
n→+∞
⌊xn⌋
n , where x ∈ R and ⌊ ⌋ is the ﬂoor function introduced in 5.B.48.
Solution. Recall from 5.B.48 that for any x ∈ R we have ⌊x⌋ ≤ x < ⌊x⌋+1. Since in general we have nx ∈ R for all n ∈ Z
and x ∈ R we also get ⌊nx⌋ ≤ nx < ⌊nx⌋ + 1, which we may rewrite as 0 ≤ x − ⌊xn⌋
n < 1
n , for all n ̸= 0. Combining this
with the squeeze theorem we deduce that limn→+∞
⌊xn⌋
n = x, for all x ∈ R. □
5.E.57. The ceiling function and the fractional part function. (1) Use the command ceil in Sage to compute ⌈−1.4⌉,
⌈−0.5⌉, ⌈2⌉, ⌈2.1⌉, ⌈
√
7⌉, ⌈π⌉ and ⌈9/2⌉.
(2) Construct in Sage the fractional part function and then compute its evaluation on the values mentioned in (1).
(3) Show that ⌈x⌉ = −⌊−x⌋, for all x ∈ R.
(4) If x = n + d, with n ∈ Z and 0 ≤ d < 1, show that n = ⌊x⌋ and d = {x}.
(5) Show that the fractional part function is discontinuous at any integer.
(6) Use Sage to sketch the graph of the ceiling function for −5 ≤ x ≤ 5 and of the fractional part function for 0 ≤ x ≤ 5.
Solution. (1) It is not hard to compute
⌈−1.4⌉ = −1 , ⌈−0.5⌉ = 0 , ⌈2⌉ = 2 , ⌈2.1⌉ = 3 = ⌈
√
7⌉ , ⌈π⌉ = 4 , ⌈9/2⌉ = 5 .
A conﬁrmation in Sage occurs by the command ceil, as follows:
print(ceil(-1.4)); print(ceil(-0.5)); print(ceil(2)); print(ceil(2.1))
print(ceil(sqrt(7))); print(ceil(pi)); print(ceil(9/2))
(2) In Sage to introduce the fractional part function x → {x} = x − ⌊x⌋, use the ﬂoor function floor and
type the syntax: fractal(x) = x − floor(x) Then, to compute the required evaluations type print(fractal(−1.4)),
print(fractal(−0.5)), etc. Notice that Sage has a build-in function for the fractional part function, called frac but this
agrees with ours only for positive reals. In this way you can quickly verify the following answers:
{−1.4} = −1.4 − ⌊−1.4⌋ = −1.4 − (−2) = 0.6 , {−0.5} = −0.5 − ⌊−0.5⌋ = −0.5 − (−1) = 0.5 ,
{2} = 2 − ⌊2⌋ = 2 − 2 = 0 , {2.1} = 2.1 − ⌊2.1⌋ = 2.1 − 2 = 0.1 ,
{
√
7} =
√
7 − ⌊
√
7⌋ ≈ 2.645752 − 2 ≈ 0.645752 , {π} = π − ⌊π⌋ ≈ 0.141593 , {9/2} = 4.5 − ⌊4.5⌋ = 0.5 .
(3) We leave this to the reader.
(4) Consider a real x such that x = n + d with n ∈ Z and 0 ≤ d < 1. By (2) in 5.B.48 we know that ⌊x + n⌋ = ⌊x⌋ + n for
all x ∈ R and n ∈ Z. Thus we have ⌊x⌋ = ⌊n + d⌋ = n + ⌊d⌋ = n, since ⌊d⌋ = 0. Consequently,
x = n + d = ⌊x⌋ + d = ⌊x⌋ + {x} ,
that is, {x} = d. This proves (4).
(5) Let n ∈ Z be an integer. We see that
lim
x→n+
{x} = lim
h→0
{n + h} = lim
h→0
(
n + h − ⌊n + h⌋
)
= lim
h→0
(
n + h
)
− lim
h→0
(
⌊n + h⌋
)
= n − lim
h→0
(
n + ⌊h⌋
)
= n − n = 0 .
On the other hand, we compute
lim
x→n−
{x} = lim
h→0
{n − h} = lim
h→0
(
n − h − ⌊n − h⌋
)
= lim
h→0
(
n − h
)
− lim
h→0
(
⌊n − h⌋
)
= n − (n − 1) = 1 .
It follows that the function x → {x} is not continuous at any integer (see also its graph below).
(6) For this task and for the ceiling function you may proceed in analogous way with the ﬂoor function, presented in 5.B.48.
This method includes the jump discontinuities, and is encoded by the following block:
459
CHAPTER 5. ESTABLISHING THE ZOO
g=ceil(x)
p=plot(g, x, -5, 5, ticks=[1,1])
for x in [-5..5] :
460
CHAPTER 5. ESTABLISHING THE ZOO
p+=point([x,x], size=30, color="black")
for x in [-4..5] :
p+=circle((x-1,x),0.08, color="black")
show(p)
In a similar way we can sketch the graph of the fractional part function (together with the jump discontinuities).
fractal(x)=x-floor(x)
q=plot(fractal, x, 0, 5)
for x in [0..5] :
q+=point([x,0], size=30, color="black")
for x in [1..5] :
q+=circle((x,1),0.03, color="black")
show(q)
Let us present the ﬁgures that Sage constructs: At left is the graph of the ceiling function and at right those of the fractional part
function (again, in both ﬁgures one should ignore the vertical lines). Notice that the fractional part function is non-negative.
□
5.E.58. Examine the continuity of the function f(x) = (x − 1)− sgn(x)
, at the points 0 and 1. ⃝
5.E.59. Given the points −π, 0, 1, 2, 3, π, determine whether the function
f(x) =



x, x < 0;
0, 0 ≤ x < 1;
x, x = 1;
0, 1 < x < 2;
x, 2 ≤ x ≤ 3;
1
x−3 , x > 3
is continuous; left-continuous or right-continuous at these points. ⃝
5.E.60. Find all p ∈ R for which the function
f(x) =
sin (6x)
3x
, x ∈ R∗
= R\{0}; f(0) = p ,
is continuous at the origin. ⃝
5.E.61. Choose a real number a so that the function
h (x) =
x4
− 1
x − 1
, x > 1; h (x) = a, x ≤ 1 ,
is continuous on R. ⃝
5.E.62. By deﬁning the values at the points −1 and 1, extend the function
f(x) =
(
x2
− 1
)
sin
2x − 1
x2 − 1
, x ∈ R\{±1} ,
so that the resulting function is continuous on the whole R. ⃝
461
CHAPTER 5. ESTABLISHING THE ZOO
5.E.63. Let f : R → R be a function satisfying the relation
f3
(x) + 3f2
(−x)f(x) = 4x2
, x ∈ R . (♯)
(a) Prove that f is even and next show that limx→0 f(x) = 0.
(b) Show that lim
ln (f(x))
ln(
√
x3)
= 4/9.
Solution. (a) In the deﬁning equation (♯) replace x by −x. This gives f3
(−x) + 3f2
(x)f(−x) = 4x2
and by subtracting
this relation from (♯) (by parts), we get
f3
(x) − 3f2
(x)f(−x) + 3f(x)f2
(−x) − f3
(−x) = 4x2
− 4x2
= 0 ⇐⇒
(
f(x) − f(−x)
)3
= 0 , x ∈ R ,
where we used the identity (a − b)3
= a3
− 3a2
b + 3ab2
− b3
, with a, b ∈ R. Therefore, we deduce that f(x) − f(−x) = 0,
for all x ∈ R, i.e., f is even.
Now, because f is even the deﬁning equation takes the form f3
(x)+3f2
(x)f(x) = 4x2
or 4f3
(x) = 4x2
, that is, f3
(x) = x2
for all x ∈ R, from where we get f(x) =
3
√
x2 = x
2
3 , x ∈ R. As for the limit, recall that the function f(x) = x
2
3
is everywhere continuous on R = (−∞, ∞), see also the Problem 6.A.21 where the graph of f is presented. Therefore,
limx→0 f(x) = limx→0 x
2
3 = f(0) = 0.
(b) The type of f is known by the ﬁrst part, thus one can easily prove the claim, i.e.,
lim
x→0
ln (f(x))
ln(
√
x3)
= lim
x→0
ln(x)2/3
ln(x)3/2
= lim
x→0
(2/3) ln(x)
(3/2) ln(x)
=
4
9
· lim
x→0
ln(x)
ln(x)
=
4
9
.
□
5.E.64. Let f : R → (−∞, 1) and g : R → (1, +∞) be continuous functions satisfying f(a) = n a and g(b) = n b for
certain reals 0 < a < b and some positive integer n. Prove the existence of some ξ ∈ (a, b) satisfying the equation
ξ =
f(ξ)g(ξ)
n
.
Solution. Observe that f(a) = n a < 1 since f(a) ∈ (−∞, 1), and hence a < 1
n , while g(b) = n b > 1 since g(b) ∈ (1, +∞)
and hence b > 1
n . Thus, we can indeed form the interval [a, b] and consider the function
ℓ(x) = f(x)g(x) − n x , x ∈ [a, b] .
Then we see that:
• ℓ(a) = f(a)g(a) − n a = n a g(a) − n a = n a
(
g(a) − 1
)
> 0, since g(a) > 1 and so g(a) − 1 > 0 and 0 < a < 1/n.
• ℓ(b) = f(b)g(b) − n b = n b f(b) − n b = n b
(
f(b) − 1
)
< 0, since f(b) < 1 and hence f(b) − 1 < 0 while 0 < 1
n < b.
Thus, ℓ(a)ℓ(b) < 0 and by Bolzano’s theorem there exists some ξ ∈ (a, b) with ℓ(ξ) = 0, that is, n ξ = f(ξ)g(ξ). □
5.E.65. Let f : R → R be a periodic function with period T > 0, that is, f(x+T) = f(x) for all x ∈ R. If limx→+∞ f(x)
exists and is a real number α, show that f(x) = α for all x ∈ R, i.e., f is constant.14
Solution. By assumption f is periodic with period T, hence it suﬃces to prove the f is constant on [0, T). Since
limx→+∞ f(x) = α ∈ R, for every ε > 0 there exist C > 0 such that |f(x) − α| < ε, for all x ≥ C. Moreover, for
all x ∈ [0, T) we can ﬁnd some positive integer n such that x + nT ≥ C. Thus we get |f(x) − α| = |f(x + nT) − α| < ε.
This shows that |f(x) − α| < ε, for all x ∈ [0, T) and ε > 0, and hence f(x) = α for all x ∈ R. □
5.E.66. Let f : R → R be a continuous function which is periodic with period T > 0. Show that there exists x0 ∈ R such
that f
(
x0 + T
2
)
= f(x0). Then use Sage to compute x0 for f(x) = cos(x).
Solution. It is suﬃcient to consider the function g(x) = f
(
x + T
2
)
− f(x) which is continuous on R (since f is continuous)
and satisﬁes
g(0) = f
(
T
2
)
− f(0) , g
(
T
2
)
= f(0) − f
(
T
2
)
= −g(0) .
Recall that if a continuous function has values of opposite sign inside an interval, then it admits a root in that interval
(Bolzano’s theorem, see 5.2.19). In particular, the result follows by the intermediate value theorem applied to g, the latter
restricted on
[
0, T
2
]
.
The cos function f(x) = cos(x) has period T = 2π, thus it suﬃces to focus on the interval (0, T
2 = π). The equation
cos(x0+π) = cos(x0) is equivalent to cos(x0) = 0 (since cos(x+π) = − cos(x) for all x). Hence obviously x0 = π
2 ∈ (0, π)
(see also the table in 5.A.3).15
In Sage just type solve(cos(x + pi) == cos(x), x). □
14Later we will see that the condition limx→+∞ f(x) = α ∈ R,
means that the line y = α is a horizontal asymptote of y = f(x).
15Recall that the equation cos(x) = 0 has general solution π
2
+ κπ,
with k ∈ N.
462
CHAPTER 5. ESTABLISHING THE ZOO
5.E.67. Let f, g, h, k : R → R continuous functions on R satisfying the following conditions:
• f(x) > 0 and k(x) > 0 for all x ∈ R;
• f(a) = g(a) and h(b) = k(b) for some a ̸= b ∈ R;
• g(x) < f(x) for all x ∈ R\{a} and k(x) < h(x) for all x ∈ R\{b} .
Prove the existence of (at least) one x0 ∈ (a, b) for which the vectors ⃗u =
(
g(x0), f(x0)
)T
and ⃗v =
(
k(x0), h(x0)
)T
(of R2
)
are parallel.
Solution. The vectors ⃗u and ⃗v are parallel if and only if the determinant
g(x0) f(x0)
k(x0) h(x0)
vanishes, i.e., g(x0)h(x0) =
k(x0)f(x0). Consider the function φ(x) = g(x)h(x)−k(x)f(x). This function is continuous on [a, b] and by the assumptions
it follows that
φ(a)φ(b) =
(
g(a)h(a) − k(a)f(a)
)
·
(
g(b)h(b) − k(b)f(b)
)
= g(a)h(a)g(b)h(b) − g(a)h(a)k(b)f(b) − k(a)f(a)g(b)h(b) + k(a)f(a)k(b)f(b)
= f(a)h(a)g(b)k(b) − f(a)h(a)k(b)f(b) − k(a)f(a)g(b)k(b) + k(a)f(a)k(b)f(b)
= f(a)
(
h(a) − k(a)
)
· k(b)
(
g(b) − f(b)
)
< 0 .
Thus, by the Bolzano’s theorem there exists x0 ∈ (a, b) with φ(x0) = 0 and the claim follows. □
5.E.68. Show that the equation x2024
+ 3x − ex
−1 = 0 has at least a real solution in the open interval (0, 1).
Solution. The function f(x) = x2024
+ 3x − ex
−1 is sum of the polynomial x2024
+ 3x − 1 with the negative of the
exponential function, hence by 5.2.17 it should be continuous on R and hence also on the closed interval [0, 1]. Moreover, we
have f(0) = −2 < 0 and f(1) = 3 − e > 0. Therefore, the claim follows by the Bolzano’s theorem. □
5.E.69. Let f be a real-valued continuous function deﬁned on the closed interval [1, 4].
(a) Explain why the function g(x) = f(x) + e2x−1
with the same domain as f has a minimal and a maximal value there.
(b) Prove the existence of some x0 ∈ [1, 4] such that f(x) < f(x0) + e2x0
, for all x ∈ [1, 4].
Solution. (a) The function g is continuous for all x ∈ [1, 4], as the sum of two continuous function deﬁned there, see the
basic limit properties in 5.2.17. Thus g should take all the values between the maximal and the minimal one on the closed
interval [1, 4], by the Corollary in 5.2.19.
(b) Using the conclusion form the ﬁrst part, we can ﬁnd x0 ∈ [1, 4] such that g(x) ≤ g(x0) for all such x (i.e., at x0 the
function g attains it maximal value). This induces the result, i.e.,
f(x) ≤ f(x0) + e2x0−1
− e2x−1
< f(x0) + e2x0−1
< f(x0) + e2x0
, x ∈ [1, 4] .
□
5.E.70. Let f : I → R be a continuous function deﬁned on the interval I = [a, b], where a, b ∈ R with a < b. If f is
injective, prove that f is strictly monotone. ⃝
5.E.71. Compute the following limits or explain why they do not exist:
lim
x→+∞
(
arccos(1/(x + 1))
)3
, lim
x→−∞
arctan(1/x) , lim
x→−∞
arctan(x4
) , lim
x→−∞
arctan
(
sin(x)
)
.
Next, use Sage to verify your conclusions.
Solution. The function arccos(x) is continuous on its domain [−1, 1], while the function x3
is continuous everywhere. Moreover,
lim
x→+∞
(1/(x + 1)) = 0, and arccos(0) = π/2. Thus,
lim
x→+∞
(
arccos(1/(x + 1))
)3
=
[
arccos
(
lim
x→+∞
1/(x + 1)
)]3
= (π/2)3
.
The same conclusion occurs graphically, via the graph of the function f(x) = (arccos(1/(1 + x)))3
, see below at the
left-hand-side for an illustration for x > 0:
463
CHAPTER 5. ESTABLISHING THE ZOO
Similarly, the function arctan(x) is continuous and injective on its domain. Thus, according to 5.E.70 it should be strictly
monotone, and indeed it is easy to verify that it is strictly increasing. This is also illustrated by the graph of arctan(x), as we
see above at the right-hand-side. Now, according to the discussion in 5.2.16 these properties allow us to move the examined
limit into the argument of such a function. Therefore, it is possible to write
lim
x→−∞
arctan(1/x) = arctan
(
lim
x→−∞
(1/x)
)
,
lim
x→−∞
arctan(x4
) = arctan
(
lim
x→−∞
x4
)
,
lim
x→−∞
arctan
(
sin(x)
)
= arctan
(
lim
x→−∞
sin(x)
)
.
However, we see that lim
x→−∞
(1/x) = 0, lim
x→−∞
x4
= +∞, and lim
x→−∞
sin(x) does not exist. Therefore, using the relations
above we obtian lim
x→−∞
g(x) = arctan(0) = 0, lim
x→−∞
h(x) = π/2, while the last limit does not exist. Finally, to describe a
conﬁrmation via Sage just type
f(x)=(arccos(1/(1+x)))^3; print(lim(f(x), x=+oo))
g(x)=arctan(1/x); print(lim(g(x), x=-oo))
h(x)=arctan(x^4); print(lim(h(x), x=-oo))
k(x)=arctan(sin(x)); print(lim(k(x), x=-oo))
□
5.E.72. Show by an example that, in general, the intermediate value theorem fails for discontinuous functions.
Solution. For some positive real numbers α, δ, consider the function f : [−δ, δ] → R with
f(x) =
{
α if − δ ≤ x < 0 ,
−α if 0 ≤ x ≤ δ .
Obviously, this function is discontinuous at x = 0 since lim
x→0−
f(x) = α ̸= lim
x→0+
f(x) = −α = f(0). On the other hand we
see that f(−δ) = α > 0 and f(δ) = −α < 0, so that f(−δ)f(δ) < 0, but f(x) ̸= 0 for any x ∈ [−δ, δ]. □
Let us now present some further applications related to the intermediate value theorem. We ﬁrst describe the
1-dimensional case of the so called Borsuk–Ulam theorem. We then study an application related to the temperature of antipodal
points on the earth.
5.E.73. 1-dimensional Borsuk–Ulam theorem. Let S1
be the unit circle. Show that for any continuous function f : S1
→ R
there exists some x ∈ S1
with f(−x) = f(x).16
Solution. Consider the function g : S1
→ R deﬁned by g(x) := f(x) − f(−x). If g(x0) = 0 for some arbitrary x0 ∈ S1
,
we are done. Otherwise we may assume that g(x) > 0 (and similarly is treated the case with g(x) < 0). Then g(−x) =
f(−x) − f(x) = −
(
f(x) − f(−x)
)
= −g(x) < 0. Thus, Bolzano’s theorem certiﬁes the existence of some x0 ∈ S1
with
−x < x0 < x satisfying g(x0) = 0. This proves the claim.
We should mention that the result extends to n dimensions, as follows: Let
Sn
=
{
(x1, . . . , xn+1) ∈ Rn+1
:
n+1∑
k=1
x2
k = 1
}
⊂ Rn+1
16Prove that this is equivalent to say that any continuous odd function
S1 → R has a zero.
464
CHAPTER 5. ESTABLISHING THE ZOO
be the n-sphere. Then, for any continuous map f : Sn
→ Rn
there exists x ∈ Sn
such that f(−x) = f(x).17
□
5.E.74. Earth’s temperature and antipodal pints. Prove that at any time there is a point on the globe where the temperature
agrees with the temperature on the antipode, assuming that the temperature varies continuously.18
Solution. Consider the equator (or any meridian of the globe), and denote by T : [0, 2π] → R the temperature, so that
T(x) is the temperature of a point x on this great circle. Notice that every great circle through any point also passes through
its antipodal point, so there are inﬁnitely many great circles through two antipodal points. Deﬁne the function f(x) =
T(x) − T(x + π), which is also continuous on [0, 2π], since T is assumed to be continuous. If f(0) = 0 we have found
our antipodal points with equal temperature. Otherwise we may assume that f(0) ̸= 0. We see that f(0) = T(0) − T(π),
f(π) = T(π)−T(2π), and since T is periodic with period 2π, we ﬁnally get f(π) = T(π)−T(2π) = T(π)−T(0) = −f(0).
By Bolzano’s theorem this implies that there exists some point x0 ∈ (0, π) such that f(x0) = 0 (notice that T is also
continuous on [0, π]). Thus T(x0) = T(x0 + π), which proves the claim. □
C) Material on derivatives
For convenience, we divide this paragraph into two subsections: one containing technical exercises for additional practice
on derivatives, and the other focusing on applications of derivatives.
C1) Material on derivatives - Practice
5.E.75. Show that the function f(x) = x |x| is diﬀerentiable for all x ∈ R.
Solution. Based on the deﬁnition of the absolute value function, one deduces that f(x) = x2
for x > 0 and f(x) = −x2
for x < 0, so for all x ∈ R\{0} the function f is diﬀerentiable (it is easy to see that (x2
)′
= 2x, see also below). We need
to check the diﬀerentiability of f at x0 = 0, where f(0) = 0. We should compute the left and right derivatives, f′
−(0) and
f′
+(0), respectively. We see that
f′
−(0) = lim
x→0−
f(x) − f(0)
x − 0
= lim
x→0−
−x2
− 0
x
= 0 , f′
+(0) = lim
x→0+
f(x) − f(0)
x − 0
= lim
x→0+
x2
− 0
x
= 0 ,
that is, the left and right derivative exist and are equal. Hence, f is also diﬀerentiable at x0 = 0, with f′
(0) = 0. □
5.E.76. Compute the derivatives of the functions f(x) = (tan(x))x
and g(x) = (sin(x))ln(x)
.
Solution. By the quotient rule and since tan(x) = sin(x)
cos(x) , we easily get
tan′
(x) ≡ (tan(x))′
≡
d
dx
tan(x) =
(
sin(x)
cos(x)
)′
=
1
cos2(x)
.
Moreover, we have f(x) = tan(x) = eln(tan(x))
, and hence we deduce that
f(x) =
(
eln(tan(x))
)x
= ex ln(tan(x))
.
Combining this with the relations (eh(x)
)′
= h′
(x) · eh(x)
and (ln(h(x)))
′
= h′
(x)/h(x), we get
f′
(x) =
(
ex ln(tan(x))
)′
=
(
x ln(tan(x))
)′
· ex ln(tan(x))
=
(
x ln(tan(x))
)′
· f(x)
=
(
ln(tan(x)) + x ·
tan′
(x)
tan(x)
)
· f(x) =
(
ln(tan(x)) + x ·
1
tan(x)
·
1
cos2(x)
)
· f(x)
=
(
ln(tan(x)) + x ·
(1 + tan2
(x))
tan(x)
)
· f(x) ,
where in the ﬁnal relation one replaces 1/ cos2
(x) by 1 + tan2
(x). In Sage a veriﬁcation is given as usual:
f(x)=tan(x)**x; df=diff(f, x); show(df)
In a similar vein, for the function g we get
g(x) = (sin(x))ln(x)
=
(
eln(sin(x))
)ln(x)
= eln(x)·ln(sin(x))
.
17Another famous result in topology with a very similar ﬂavor is the so
called “Brouwer ﬁxed point theorem” which we will analyze in Chapter 9.
18Observe that the same result applies for the barometric pressure of
antipodal points on the globe.
465
CHAPTER 5. ESTABLISHING THE ZOO
Thus,
g′
(x) =
(
(sin(x))ln(x)
)′
=
(
eln(x)·ln(sin(x))
)′
=
(
ln(x) · ln(sin(x))
)′
· g(x)
=
( 1
x
· ln(sin(x)) + ln(x) · cos(x) ·
1
sin(x)
)
· g(x) =
(ln(sin(x))
x
+ ln(x) · cot(x)
)
· g(x) .
For a veriﬁcation in Sage type
g(x)=sin(x)**ln(x); show(g.derivative(x))
Observe that for your convenience the given solution in this task highlights two diﬀerent, but familiar, methods in Sage that
we can always use to compute derivatives. □
5.E.77. Diﬀerentiate the expression
4
√
x − 1 · (x + 2)3
ex(x + 132)2
, with x > 1. ⃝
5.E.78. Calculate the ﬁrst derivative of the functions given below:
(1) a(x) = (2 − x2
) cos(x) + 2x sin(x) with x ∈ R;
(2) b(x) = sin (sin(x)) with x ∈ R;
(3) c(x) = sin
(
ln
(
x3
+ 2x
))
with x ∈ (0, +∞);
(4) d(x) = (1 + x − x2
)/(1 − x + x2
) with x ∈ R;
(5) ε(x) =
√
x
√
x
√
x with x ∈ (0, +∞);
(6) f(x) = sin (sin (sin(x))) with x ∈ R;
(7) g(x) = 3
√
sin(x) with x ∈ R\{nπ : n ∈ Z};
(8) h(x) = 3
√
1+x3
1−x3 with x ∈ R\{±1}. ⃝
5.E.79. Recall that a function f is called odd if f(−x) = −f(x) for all x in its domain, and even if f(−x) = f(x) for all such
x. Therefore, an odd function is symmetric about the origin, and an even function is symmetric with respect to the y-axis.
(a) Present a function deﬁned on R which is neither even, nor odd.
(b) Show that the derivative of an even (respectively odd) function is an odd (respectively even) function. ⃝
5.E.80. Extend the discussion of real polynomials of degree at most three, initiated in Problem 5.A.2, using derivatives. ⃝
So far we have discussed most of the trancedental functions, as the exponential, the logarithms, the trigonometric and
their inverse. Another remarkable class consists of the so called “hyperbolic functions”, deﬁned by
sinh(x) =
ex
− e−x
2
, cosh(x) =
ex
+ e−x
2
, tanh(x) =
sinh(x)
cosh(x)
, coth(x) =
cosh(x)
sinh(x)
,
where x ∈ R for the ﬁrst three cases, while the domain of coth is R∗
= R\{0}. Notice that
cosh2
(x) − sinh2
(x) = 1 , x ∈ R .
The hyperbolic functions come at handy at times (especially in physics), and the next task is devoted to them, see also 5.4.12.
5.E.81. Hyperbolic functions. (a) Show that sinh (respectively, cosh) is the odd (respectively, even) part of the exponential
function. Moreover, prove that tanh(−x) = − tanh(x) for all x ∈ R and coth(x) = − coth(x) for all x ∈ R∗
.
(b) Determine the derivatives of the hyperbolic functions on their domains.
(c) Conﬁrm your answer in Sage and next sketch the graphs of sinh and cosh in the interval I = [−π, π].
Solution. (a) Obviously, one has the relation
ex
=
ex
− e−x
2
+
ex
+ e−x
2
= sinh(x) + cosh(x) ,
and is easy to see that sinh(−x) = − sinh(x), while cosh(−x) = cosh(x) for all x ∈ R. The rest two relations are now
direct.
(b) This is really easy and relies on the basic rules of diﬀerentiation, see 5.3.4, in combination with the formula (ex
)′
= ex
,
466
CHAPTER 5. ESTABLISHING THE ZOO
for all x ∈ R. Thus
(sinh(x))′
=
(
ex
− e−x
2
)′
=
1
2
(
ex
+ e−x
)
= cosh(x) ,
(cosh(x))′
=
(
ex
+ e−x
2
)′
=
1
2
(
ex
− e−x
)
= sinh(x) ,
(tanh(x))′
=
(
sinh(x)
cosh(x)
)′
=
cosh2
(x) − sinh2
(x)
cosh2
(x)
=
1
cosh2
(x)
= 1 − tanh2
(x) ,
(coth(x))′
=
(
cosh(x)
sinh(x)
)′
=
sinh2
(x) − cosh2
(x)
sinh2
(x)
= −
1
sinh2
(x)
.
(c) As one could expect in Sage the hyperbolic functions are built-in and correspond to the commands sinh, cosh, tanh and
coth. Therefore, to conﬁrm the previous computations in Sage one can combine these functions with either the command
diff, or the command derivative, as follows
show(sinh(x).derivative()); show(cosh(x).derivative())
show(tanh(x).derivative()); show(coth(x).derivative())
Let us ﬁnally present the graphs of sinh and cosh, but leave for practice the corresponding coding in Sage.
□
5.E.82. Find the inverse of the hyperbolic sine function f(x) = sinh(x), determine its domain and its range, and next compute
its ﬁrst derivative. ⃝
5.E.83. Prove that the following function is diﬀerentiable at x = 0,
f(x) =
{
x2
, if x ∈ Q ,
0 , if x ∈ R\Q .
Solution. Notice that f(0) = 0. Thus, lim
x→0
f(x) − f(0)
x − 0
= lim
x→0
f(x)
x
. However, we see that 0 ≤
f(x)
x
≤
x2
|x|
= |x|, and
thus by the squeeze theorem we get limx→0
f(x)
x = 0. □
5.E.84. Show that the function f(x) =
√
|x| is not diﬀerentiable for all x ∈ R. Which is the problematic point?
Solution. We see that f(x) =
√
x for x ≥ 0 and f(x) =
√
−x for x < 0. Thus for x ̸= 0 we may write
f′
(x) =
{
1
2
√
x
, x > 0;
− 1
2
√
−x
, x < 0.
However, at x0 = 0 the given function is not diﬀerentiable since for the left and right derivative one computes f′
−(0) = −∞
and f′
+(0) = +∞, respectively (you should be able to verify the last claim by a formal computation, but also in Sage). □
467
CHAPTER 5. ESTABLISHING THE ZOO
5.E.85. Provide an example of a continuous function f : R → R which is diﬀerentiable for all x ∈ R but whose ﬁrst
derivative f′
(x) is not continuous at x0 = 0. Next use Sage to present the graph of your example.
Solution. An example is given by the piecewise function f : [−1, 1] → R, deﬁned by
f(x) =



x2
sin
(
1
x
)
, x ̸= 0 ;
0 , x = 0 .
We know that limx→0 f(x) = 0 and since f(0) = 0, the function f is continuous. A short computation shows that its ﬁrst
derivative is given by
f′
(x) =



2x sin
(
1
x
)
− cos
(
1
x
)
, x ̸= 0 ;
0 , x = 0 .
Above, the ﬁrst expression follows by the chain and product rule, while to verify that f′
(0) = 0, notice that
lim
x→0
f(x) − f(0)
x
= lim
x→0
x sin
(
1
x
)
= 0
(since 0 ≤ x sin
(1
x
)
≤ |x|). On the other hand, we see that the limit limx→0
(
2x sin(1
x ) − cos(1
x )
)
does not exist, so f′
(x)
is not continuous at the origin. For another similar example see the description in 5.3.3.
Now, to present the graph of f one can apply the piecewise method described in 5.C.5. Thus we can proceed with the
cell
f1(x)=(x^2)*sin(1/x); f2(x)=0
S=RealSet(x==0)
S1=RealSet.closed_open(-1,0)
S2=RealSet.open_closed(0,1)
S3=S1.union(S2)
F=piecewise([[S, f2(x)], [S3, f1(x)]])
p=plot(F(x), x, -1, 1, ymin=-0.1, ymax=0.1, color="black")
show(p)
In this block we appropriately used the command RealSet to produce the sets {0}, [−1, 0), and (0, 1], and next the command
union to introduce the set [−1, 0) ∪ (0, 1]. Below, on the left is the ﬁgure produced by this block, while on the right we see
a zoomed version of the graph of f near the origin. Can you realize the changes that we did in our code to get the zoomed
version?
□
5.E.86. (1) Consider the function f : R → R deﬁned by f(0) = 0 and f(x) = x arctan
(
1
x
)
for all x ∈ R\{0}. Determine
whether the derivative of f exists at x0 = 0.
(2) Consider the function f : R → R deﬁned by f(−1) = 0 and f(x) =
(
x2
− 1
)
sin
(
1
x + 1
)
for all x ∈ R\{−1}.
Determine whether the derivative of f exits at x0 = −1.
(3) Present an example of a function f : R → R which is continuous on R but does not have derivatives at the points x1 = 5
and x2 = 9.
(4) Find functions f and g which are not diﬀerentiable anywhere, yet their composition f ◦ g is diﬀerentiable everywhere on
468
CHAPTER 5. ESTABLISHING THE ZOO
R.
(5) For x > e, determine the sign of the derivative of the function f(x) = arctan
(
ln(x)
−1 + ln(x)
)
. ⃝
5.E.87. If f(x) = 5x + 4 sin(x), then compute (f−1
)′
(4) by means of the formula of the derivative of the inverse function
presented in 5.3.6. ⃝
5.E.88. (a) Let P(x) be a polynomial of order n ≥ 2. Prove that P(x) is divided by (x − ρ)2
with quotient π(x), i.e.,
P(x) = (x − ρ)2
π(x), if only if P(ρ) = P′
(ρ) = 0.
(b) Show that (x − 1)2
is a factor of the polynomial P(x) = nxn+1
− xn−1
− (n2
+ 1)x + n2
− n + 2, with n ≥ 2.
Solution. (a) If P(x) = (x − ρ)2
π(x) then by diﬀerentiation we get P′
(x) = 2(x − ρ)π(x) + (x − ρ)2
π′
(x). Thus
P′
(ρ) = 0 = P(ρ). Conversely, assume that P(ρ) = 0 = P′
(ρ). From the ﬁrst relation we get P(x) = (x − ρ)ν(x) for
some polynomial ν(x), and hence P′
(x) = ν(x) + (x − ρ)ν′
(x). But then the second relation can be equivalently written as
ν(ρ) + (ρ − ρ)ν′
(ρ) = 0, that is ν(ρ) = 0. Therefore ρ is a root of ν(x) and hence ν(x) = (x − ρ)π(x) for some polynomial
π(x). It follows that P(x) = (x − ρ)2
π(x), as it is required.
(b) We see that P(1) = n − 1 − n2
− 1 + n2
− n + 2 = 0. Moreover, the ﬁrst derivative of P is given by P′
(x) =
n(n + 1)xn
− (n − 1)xn−2
− (n2
+ 1), and thus P′
(1) = n(n + 1) − (n − 1) − n2
− 1 = n2
+ n − n + 1 − n2
− 1 = 0.
The claim now follows from the statement in (a). □
5.E.89. Let P be a cubic polynomial satisfying the conditions P(0) = 1, P′
(0) = 1, P(1) = 2a + 2, P′
(1) = 5a + 1. Find
the values of the parameter a ∈ R for which the polynomial P is strictly monotonic on the whole real line. ⃝
5.E.90. Suppose that f, g : R → R are two functions diﬀerentiable at a ∈ R. Prove that lim
x→a
x2
f(a) − a2
f(x)
x − a
=
2af(a) − a2
f′
(a), and more in general
lim
x→a
g(x)f(a) − g(a)f(x)
x − a
= g′
(a)f(a) − f′
(a)g(a) .
Solution. For the ﬁrst limit we have
lim
x→a
x2
f(a) − a2
f(x)
x − a
= lim
x→a
x2
f(a) − a2
f(x) − a2
f(a) + a2
f(a)
x − a
= lim
x→a
(x2
− a2
)f(a) − a2
(f(x) − f(a))
x − a
= f(a) lim
x→a
x2
− a2
x − a
− a2
lim
x→a
f(x) − f(a)
x − a
= f(a) lim
x→a
(x + a) − a2
f′
(a)
= 2af(a) − a2
f′
(a) .
For the general case we apply the same trick:
lim
x→a
g(x)f(a) − g(a)f(x)
x − a
= lim
x→a
g(x)f(a) − g(a)f(x) − g(a)f(a) + g(a)f(a)
x − a
= lim
x→a
(g(x) − g(a))f(a) − g(a)(f(x) − f(a))
x − a
= f(a) lim
x→a
g(x) − g(a)
x − a
− g(a) lim
x→a
f(x) − f(a)
x − a
= f(a)g′
(a) − g(a)f′
(a) .
□
5.E.91. Let f : R → R be a diﬀerentiable function at 0, satisfying f(x)f(y) ̸= 1 for all x, y ∈ R and
f(x + y)(1 − f(x)f(y)) = f(x) + f(y) , x, y ∈ R .
Show that f is diﬀerentiable on the whole real line. ⃝
5.E.92. Interactive tangent lines via Sage. For the polynomial P(x) = x4
− 2x use Sage to produce an interactive graph
for the tangent line of P at a random point x0, lying in the interval [−2, 2]. (Hints: Combine the routine constructed in 5.C.14 with
the method presented in ?? for creating an interactive plot).
Solution. Let us use the routing introduced in 5.C.14 for the construction of the tangent line and combine this with the
commands @interact and slider, which are responsible for the creation and the control of the interactive environment,
respectively. In particular, the command @interact appears alway in a separate line inside the syntax, usually before the
def command that our subroutine starts. On the other hand, the slider command converts the x0 input from a numerical
input into a slider. The coding goes as follows
469
CHAPTER 5. ESTABLISHING THE ZOO
P(x)=x^4-2*x; P1(x)=diff(P(x), x)
@interact
470
CHAPTER 5. ESTABLISHING THE ZOO
def Tangentat(x_0 = slider(-2, 2, 0.1, -1.5,label="x-coordinate")) :
y_0=P(x_0)
m=P1(x_0)
c=y_0 - m*x_0
l(x)=m*x+c
Q = plot(P(x), -2, 2, color="blue", ymin=-
2, ymax=20)
Q+=plot(l(x), -2, 2, color="black", ymin=-
2, ymax=20)
Q+=point( (x_0, y_0), color="black", size=50)
Q.show()
show("x_0=" , x_0)
show("tang. line: l(x) = (",m,")*x + (",c,")")
return
Recall inside the command slider the ﬁrst two parameters determine the interval where that variable x0 is deﬁned, in our
case −2 ≤ x ≤ 2. The third parameter determines the granularity of the slider, and our choice represents a movement with
step 0.1. Notice in CoCalc.com one should call the subroutine by typing Tangentat( ) in the very end of the block presented
above (without specifying x0 inside the parenthesis). However, in https : //sagecell.sagemath.org this is not necessary
and one can simply move the slider around. Test yourself Sage’s output. □
5.E.93. Consider the functions f(x) = 2 + ln(x − 1) and g(x) = 2 − ln(x − 1), with x ∈ A = (1, +∞).
(a) Use Sage to ﬁnd the common points of the graphs of f, g, respectively.
(b) Deduce about the monotonicity of f, g, and show that x = 1 is an asymptote without slope (vertical asymptote) both of
Cf and Cg. Moreover, deduce that f(A) = g(A) = R.
(c) Show that the tangent lines of f, g at points with the same y-coordinate are perpendicular.
Solution. (a) This can be done by solving the equation f(x) = g(x), which is equivalent to ln(x − 1) = 0 and has x = 2
as a unique solution. There we see that f(2) = g(2) = 2, hence the common point of Cf and Cg is the point P = [2, 2]. A
solution via Sage occurs by the cell
f(x)=2+ln(x-1); g(x)=2-ln(x-1); solve(f(x)==g(x), x)
(b) We see that f′
(x) =
1
x − 1
> 0 and g′
(x) = − 1
x−1 < 0 for all reals x > 1. Thus f is strictly increasing on A, while g is
strictly decreasing on A. Moreover, you can compute a = lim
x→1+
f(x) = −∞, b = lim
x→∞
f(x) = +∞, c = lim
x→1+
g(x) = +∞
and d = lim
x→∞
g(x) = −∞. These limits can be conﬁrmed in Sage, as usual:
f(x)=2+ln(x-1); g(x)=2-ln(x-1);
a=lim(f(x), x=1, dir="right"); b=lim(f(x), x=oo);
c=lim(g(x), x=1, dir="right"); d=lim(g(x), x=oo)
show(a, ",", b, ",", c, ",", d)
Both functions f, g are continuous, hence by the limits above we deduce that f(A) = R and g(A) = R. Moreover, the
limits a, c presented above show that the graphs of f, g, respectively, tend toward ∓∞ as the inputs approach 1, and hence
the line x = 1 is a vertical asymptote for both Cf , Cg (we will discuss more details on asymptotes in 6.1.8). To illustrate the
behaviour of f, g, we used Sage to sketch the graphs Cf , Cg, for 1 ≤ x ≤ 50, which we present together here:
471
CHAPTER 5. ESTABLISHING THE ZOO
(c) To get points on Cf , Cg with the same y-coordinate it is suﬃcient to draw horizontal lines y = c, c ∈ R. This happens
due to the type of monotonicity of f, g. Let Q, R be the corresponding intersection points of the line y = c with Cf and
Cg, respectively. We may assume that Q = [a, f(a)] = [a, c] and R = [b, g(b)] = [b, c]. Notice for a = 2 = b we have
P = Q = R. The getter a better view around the intersection point of Cf and Cg, see the ﬁgure below (at the l.h.s). In
this ﬁgure, to become more precise, we ﬁx the points Q, R to be elements of the line y = 2.5 (hence one can specify Q, R
explicitly, as we did).
From the mathematical point of view it is not necessary to ﬁx c. At the the points Q = [a, c] ∈ Cf and R = [b, c] ∈ Cg
we have f′
(a) = 1
1−a and g′
(b) = − 1
1−b , and to prove the claim it is suﬃcient to show hat f′
(a)g′
(b) = −1. Because
f(a) = c = g(b) we get a − 1 = ec−2
, b − 1 = e2−c
and the result follows (for an illustration see the picture at the r.h.s,
which includes the tangent lines of Cf , Cg, at the points Q, R, respectively), i.e.,
f′
(a)g′
(b) = −
1
(1 − a)(1 − b)
= −
1
ec−2 e2−c
= −
1
e0
= −1 .
For your convenience, let us ﬁnally present the code used to construct the ﬁgure including the tangent lines.
472
CHAPTER 5. ESTABLISHING THE ZOO
f(x)=2+ln(x-1); g(x)=2-ln(x-1);
p=plot(f(x), x, 1, 5, rgbcolor=(0.2,0.3,0.6))
p+=plot(g(x), x, 1, 5,rgbcolor=(0.6,0.3,0.2))
p+=point((2, 2), color="black", size=30)
p+=text(r"$f(x)$", (4.8, 3.55), rgbcolor=(0.2,0.3,0.6), fontsize=13)
p+=text(r"$g(x)$", (4.8, 0.45), rgbcolor=(0.6,0.3,0.2), fontsize=13)
p+=text(r"$P$", (2.35, 2.1), color="black", fontsize=12)
p+=line([(0.5, 1), (4.5, 1)], rgbcolor=(0.2,0.2,0.2), linestyle="--")
p+=line([(0.5, 2.5), (4.5, 2.5)], rgbcolor=(0.2,0.2,0.2), linestyle="--")
p+=line([(0.5, 2), (4.5, 2)], rgbcolor=(0.2,0.2,0.2), linestyle="--")
473
CHAPTER 5. ESTABLISHING THE ZOO
p+=point((e^(1/2) + 1, f(e^(1/2) + 1)), color="black", size=30)
p+=text(r"$Q$", (e^(1/2) + 1, f(e^(1/2) + 1)+0.2), color="black", fontsize=13)
p+=point(((e^(1/2) + 1)*e^(-1/2), g((e^(1/2) + 1)*e^(-1/2))), color="black", size=30)
p+=text(r"$R$", ((e^(1/2) + 1)*e^(-1/2)+0.05, g((e^(1/2) + 1)*e^(-1/2))+0.2),
color="black",fontsize=13)
p+=text(r"$y=2.5$", (4.7, 2.65), color="black", fontsize=12)
p+=text(r"$y=2$", (4.65, 2.15), color="black", fontsize=12)
p+=point(((e + 1)*e^(-1), f((e + 1)*e^(-1))), color="black", size=30)
p+=point((e + 1, g(e + 1)), color="black", size=30)
p+=text(r"$y=1$", (4.65, 1.15), color="black", fontsize=13)
xQ=e^(1/2) + 1
tangf=f(xQ)+diff(f, x)(x=xQ)*(x-xQ)
p+=plot(tangf, x, 1, 5, rgbcolor=(0.2,0.2,0.2), linestyle="--")
xR=(e^(1/2) + 1)*e^(-1/2)
tangg=g(xR)+diff(g, x)(x=xR)*(x-xR)
p+=plot(tangg, x, 1, 5, rgbcolor=(0.2,0.2,0.2), linestyle="--")
p.show(ymin=-1, ymax=5, ticks=[1, 1], aspect_ratio=1)
□
5.E.94. Consider the function f(x) = e
√
α cos(x)
, where α > 0 is some ﬁxed constant. Provide a proof of the following
facts::
(a) f is continuous, even, and periodic with period independent of α. Sketch the graph of f for α = 3 and x ∈ [−2π, 2π];
(b) f is diﬀerentiable for all x ∈ R. Find the formula of f′
(x);
(c) f is (strictly) decreasing in (0, π) and (strictly) increasing in (π, 2π), in particular, the point x0 = π provides a local
minimum of f.
Solution. (a) The domain of f is the whole real line R. Since f is the composition of continuous functions, it is also
continuous. It is even since cos(−x) = cos(x) which implies that f(−x) = f(x) for all x ∈ R. Suppose that T ∈ R\{0}
satisﬁes f(x + T) = f(x), for all x ∈ R. Since the exponential map is an injection this gives
√
α cos(x + T) =
√
a cos(x),
that is, cos(x + T) = cos(x) for all x ∈ R. Thus
x + T = 2kπ + x , or x + T = 2kπ − x ,
which implies that T = 2kπ or T = 2kπ − 2x for all x ∈ R and some k ∈ Z. This means that f is periodic with period
2kπ, so for k = 1 we obtain T = 2π which is the smallest (positive) period of f. A visual veriﬁcation of this fact occurs by
sketching the graph of f, for some ﬁxed α. Below we present the case for α = 3.
(b) The function f is diﬀerentiable on R as the composition of diﬀerentiable functions. By the chain rule we see that
f′
(x) =
(√
a cos(x)
)′
e
√
α cos(x)
= −
√
α sin(x) · e
√
α cos(x)
, x ∈ R .
Or we can compute this in Sage, via the cell
a=var("a"); f(x)=e**(sqrt(a)*cos(x)); show(diff(f, x))
(c) In [0, 2π] the equation f′
(x) = 0 has three solutions, namely the points x = 0, x = π and x = 2π, see also the ﬁgure in
the r.h.s. where we have included the graph of f′
(x). Moreover we see that f′
(x) < 0 for all x ∈ (0, π) and f′
(x) > 0 for all
474
CHAPTER 5. ESTABLISHING THE ZOO
x ∈ (π, 2π). It follows that f(x) is decreasing in the ﬁrst interval, and increasing in the second one. Hence x0 = π is a local
minimum of f giving the minimal value f(π) = e−
√
α
(the other two stationary points are local maxima of f in [0, 2π]). □
5.E.95. (a) Determine all local extrema of the function f(x) = x ln2
(x) with domain (0, +∞);
(b) Determine the maximum value of the function f(x) = e−x
· 3
√
3x, with x ∈ R.
(c) Find the absolute extrema of the polynomial p(x) = x3
− 3x + 2 on the interval [−3, 2].
(d) Is there any α ∈ R such that the function h(x) = αx + sin(x), x ∈ [0, 2π] has a global minimum at x0 = 5π/4? ⃝
5.E.96. Show that the function f(x) = e−x
is strictly decreasing for any x ∈ R. ⃝
5.E.97. Find all the solutions of the equation x2025
+ 2024x = 2025 − ln(x). ⃝
5.E.98. Let I = [a, b] ⊂ R be a closed subset of R and let f : I → R be a continuous function. Suppose that f
diﬀerentiable on the open interval (a, b) and satisﬁes f(a) = f(b) = 0. Show that for given κ ∈ R there exists x0 ∈ (a, b)
such that κf(x0) + f′
(x0) = 0.
Solution. For ﬁxed κ ∈ R consider the function g(x) = eκx
f(x), with x ∈ [a, b]. By assumption f is continuous on [a, b],
diﬀerentiable on (a, b), and the same is g. We see that
g′
(x) = κ eκx
f(x) + eκx
f′
(x) = eκx
(
κf(x) + f′
(x)
)
, x ∈ (a, b) . (∗)
Moreover, g(a) = 0 = g(b) and by Rolle’s theorem (see 5.3.8) one can ﬁnd some x0 ∈ (a, b) such that g′
(x0) = 0. Then the
result is a simple consequence of (∗). □
5.E.99. Let f : [0, 4] → R be a diﬀerentiable function with f(0) = f(4) = 0, and set g(x) := x(4−x)
4 f(2), for any
x ∈ [0, 4]. Show that there exist x1, x2 ∈ (0, 4) with x1 ̸= x2 satisfying f′
(x1) = g′
(x1) and f′
(x2) = g′
(x2).
Solution. The functions f, g are both diﬀerentiable on [0, 4], so the same is their diﬀerence h(x) := f(x) − g(x) = f(x) −
x(4−x)
4 f(2). Moreover, we see that h(0) = f(0) − g(0) = 0, h(4) = f(4) − g(4) = 0 and h(2) = f(2) − 2·2
4 f(2) = 0.
Thus, h satisﬁes the Rolle’s theorem conditions on the intervals [0, 2] and [2, 4]. This means that there exist x1 ∈ (0, 2) such
that h′
(x1) = 0, which is equivalent to f′
(x1) = g′
(x1) and x2 ∈ (2, 4) such that h′
(x2) = 0, that is, f′
(x2) = g′
(x2). □
5.E.100. Suppose that f : I → R is a continuous function on I = [a, b], diﬀerentiable on (a, b) and such that
limx→a f′
(x) = κ ∈ R. Show that f is also diﬀerentiable at x = a with f′
(a) = κ.
Solution. We can apply the mean value theorem on f on the interval (a, a + h) for some small enough h. Hence there exists
ch ∈ (a, a + h) such that
f′
(ch) =
f(a + h) − f(a)
h
. (∗)
Moreover, we see that ch → a as h → 0 and hence
f′
(a) = lim
h→0
f(a + h) − f(a)
h
(∗)
= lim
h→0
f′
(ch) = lim
x→a
f′
(x) = κ .
□
5.E.101. Show that the polynomial P(x) = x5
− x4
+ 2x3
− x2
+ x + 1 has exactly one real root. Next use Sage to
approximate this root.
Solution. It is easy to prove that any odd degree polynomial has at least one real root. For instance, the polynomial at hand
has degree 5, hence by the fundamental theorem of algebra (see 12.4.20), P has to have ﬁve roots over C. Moreover, the
complex roots with real coeﬃcients come in pairs of conjugated numbers. Therefore, P has to have at least one real root, say
x0.
Suppose now that there exists a second real root x′
0. Then, according to the Rolle’s theorem there must be some c ∈
(x0, x′
0), such that P′
(c) = 0. But P′
(x) = 5x4
− 4x3
+ 6x2
− 2x + 1 = 2x2
(x − 1)2
+ 3x4
+ 3x2
+ (x − 1)2
> 0, for all
x ∈ R, a contradiction. Therefore, P admits exactly one real root. In other words, P is strictly increasing and thus its graph
should cross the x-axis just once.
From the graph of P given below we see that the unique root of P should be near to −0.5. In Sage to obtain this root we
can use the find_root function, as follows:
find_root(x^5-x^4+2*x^3-x^2+x+1, -1, 1)
According to Sage’s answer, the unique root of P is given by x0 ≈ −0.47748.
475
CHAPTER 5. ESTABLISHING THE ZOO
□
5.E.102. Let f : R → (0, +∞) be a continuously diﬀerentiable function with f(0) ̸= 0. Prove that there exists ξ ∈ (0, 1)
satisfying the relation ef′
(ξ)
f(0)f(ξ)
= f(1)f(ξ)
.
Solution. Let us rewrite the equation as ef′
(ξ)
=
(
f(1)
f(0)
)f(ξ)
. By applying the logarithm in both sides we get
ln
(
ef′
(ξ)
)
= f′
(ξ) = ln
(
f(1)
f(0)
)f(ξ)
⇐⇒ f′
(ξ) = f(ξ) ln
(
f(1)
f(0)
)
⇐⇒
f′
(ξ)
f(ξ)
= ln(f(1)) − ln(f(0)) .
It is now obvious that the existence of such a ξ is guaranteed by applying the Lagrange mean value theorem to the function
f(x) = ln(f(x)), which is continuous on [0, 1] and diﬀerentiable at all points inside this interval. □
5.E.103. Show that any real x ≥ 0 satisﬁes x ≥ ln(x + 1). ⃝
5.E.104. Let α, β be reals satisfying 0 < α < β. Show that α xβ
− β xα
> α − β, for all x > 1. ⃝
5.E.105. Show that all x ∈ R satisfy ex
− x ≥ 1.
Solution. Consider the function f(x) = ex
− x + 1. This is diﬀerentiable in R with f′
(x) = ex
− 1. We see that the
equation f′
(x) = 0 has a unique solution given by x = 0. In particular, having in mind the graph of ex
we see that ex
< 1
for x < 0, so f′
(x) < 0 for x ∈ (−∞, 0) and ex
> 1 for x > 0 which means that f′
(x) > 0 for x ∈ (0, +∞). Hence f is
strictly decreasing in the ﬁrst interval and strictly increasing in the second one, see the ﬁgure below. Thus, for x > 0 we have
f(x) > f(0) and so ex
− x + 1 > 2, that is ex
− x > 1, while for x < 0 we have f(x) > f(0) and so again ex
− x + 1 > 2.
So for x ̸= 0 we get ex
− x > 1. Since for x = 0 this holds as an equality we have ﬁnally proved ex
− x ≥ 1 for all x ∈ R.
Remark: The graphs of f, g sketched below indicate that one may use the function g(x) = ex
− x − 1, instead of f, and
apply the same procedure. Try to verify this claim.
□
5.E.106. For x > y > 0 prove the inequality
x + y
2
>
x − y
ln(x) − ln(y)
.
Solution. Rewrite the given inequality as
ln(x) − ln(y
x − y
>
2
x + y
and set t = x − y > 0. Then we get
ln(t + y) − ln(y)
t
>
2
t + 2y
⇐⇒
ln
(
t+y
y
)
t
>
2
t + 2y
⇐⇒ ln
(
1 +
t
y
)
>
2t
t + 2y
⇐⇒ ln
(
1 +
t
y
)
>
2 t
y
2 + t
y
.
476
CHAPTER 5. ESTABLISHING THE ZOO
Putting now ψ = t
y we have ψ > 0 and the initial inequality can be equivalently transformed to
ln(1 + ψ) >
2ψ
2 + ψ
, ψ > 0 . (♭)
Thus now you may consider the function f(ψ) = ln(1 + ψ) − 2ψ
2+ψ , with ψ > −1. Its ﬁrst derivative is given by
f′
(ψ) =
1
ψ + 1
−
2(2 + ψ) − 2ψ
(2 + ψ)2
=
ψ2
(ψ + 1)(2 + ψ)2
> 0 ,
that is, f′
(ψ) > 0 for all ψ > −1. Hence for example we should have f(ψ) > f(0) = 0, which gives us the inequality
presented in (♭). □
5.E.107. If p, q ∈ R are such that 0 < p < q show that
ln
(
p + q
2
)
<
p
p + q
ln(p) +
q
p + q
ln(q) . (♯)
Solution. Consider the 1-parameter family fq(x) = ln
(x+q
2
)
− x
x+q ln(x)− q
x+q ln(q) with q > 0 and domain A = (0, +∞).
Notice that fq(q) = 0. We will prove that fq is strictly increasing for all x ∈ (0, q], which for 0 < p < q gives the inequality
fq(p) < fq(q). Then, it is easy to see that the latter is equivalent to (♯). Therefore let us compute the ﬁrst derivative of f:
f′
q(x) =
1
2
·
2
x + q
−
(x + q) − x
(x + q)2
ln(x) −
x
x + q
·
1
x
+
q
(x + q)2
ln(q) =
1
x + q
−
q ln(x)
(x + q)2
−
1
x + q
+
q ln(q)
(x + q)2
=
q
(
ln(q) − ln(x)
)
(x + q)2
=
q ln
( q
x
)
(x + q)2
.
The critical points of fq are the solutions of f′
q(x) = 0, and we see that
f′
q(x) = 0 ⇐⇒ ln
( q
x
)
= 0 ⇐⇒
q
x
= 1 ⇐⇒ q = x .
Now, for x > q we have ln
( q
x
)
< 0, hence f′
q(x) < 0 and fq is strictly decreasing in [q, +∞) for all q > 0. On the other
hand for 0 < x < q we get f′
q(x) > 0, therefore fq is strictly increasing on (0, q] for all q > 0. Thus, in fact, at x0 = q our
family fq attains its unique maximum which equals to fq(q) = 0 (hence, fq is non-positive all along its domain, for all q > 0).
Let us ﬁnally sketch the graph of fq for q = 1, 2, . . . , 5, so that we have an illustrated conﬁrmation of the monotonicity of fq
for these values of the parameter q.
To obtain this ﬁgure, we have used Sage and a bit of programming (relying on the command for), which goes as fol-
lows:
var("q"); f(x, q)=ln((x+q)/2)-(x/(x+q))*ln(x)-(q/(x+q))*ln(q)
a=plot([f(x, q) for q in [1,2,..,5]], (x, 0, 10), ymax=0.05, ymin=-0.4)
a+=point([(q, f(q, q)) for q in [1,2,..,5]], size=20, color="black")
a+=text(r"$q=1$", (1, 0+0.02), fontsize=12, color="black")
a+=text(r"$q=2$", (2, 0+0.02), fontsize=12, color="black")
a+=text(r"$q=3$", (3, 0+0.02), fontsize=12, color="black")
a+=text(r"$q=4$", (4, 0+0.02), fontsize=12, color="black")
a+=text(r"$q=5$", (5, 0+0.02), fontsize=12, color="black"); show(a)
477
CHAPTER 5. ESTABLISHING THE ZOO
□
When applying the l’Hopital’s rule, caution is always necessary. For instance, the l’Hopital’s rule can yield a non-existent
limit even when the original limit exists. Let’s illustrate this remarkable scenario with an example.
5.E.108. Consider the function f(x) =
x + sin x
x
.
(a) Which is the type of indeterminate form corresponding to the limit limx→+∞ f(x)?
(b) Show that in this case the l’Hopital’s rule leads to a non-existing limit, although limx→+∞ f(x) = 1. ⃝
5.E.109. Using the L’Hospital’s rule provide an alternative proof of the relation limn→+∞
n
√
n = 1 (see 5.B.8). Moreover,
apply the same method to prove that limn→+∞
n
√
ln(n) = 1 (see 5.E.18), and limn→+∞
(
n sin( 1
n )
)
= 1.
Solution. Recall that n
√
n = n1/n
. Thus, as n → +∞ this gives the indeterminate form (+∞)0
. Consider the function
f(x) = x1/x
with x > 0. Taking the logarithm in both sides we get ln(f(x)) = ln(x1/x
) = 1
x ln(x). Thus f(x) = e
ln(x)
x .
Notice that by the L’Hospital’s rule we get lim
x→+∞
ln(x)
x
+∞
+∞
= lim
x→+∞
1
x
= 0. Therefore
lim
x→+∞
f(x) = lim
x→+∞
e
ln(x)
x = e
lim
x→+∞
ln(x)
x
= e0
= 1 ,
and applying this to our sequence we get the result:
lim
n→+∞
n
√
n = lim
n→+∞
n
1
n = lim
n→+∞
e
ln(n)
n = 1 .
Let us use the L’Hospital’s rule to prove also that limn→+∞
n
√
ln(n) = 1 and leave the third limit for practice. Since we have
n
√
ln(n) = (ln(n))
1
n , we see that the limit limn→+∞
n
√
ln(n) = 1 has the indeterminate form (+∞)0
. So, let us consider
the function g(x) = (ln(x))
1
x . Then, we can write g(x) = e
1
x ·ln(ln(x))
. Hence, to compute limx→+∞ g(x) it is suﬃcient to
compute limx→+∞
ln(ln(x))
x . This has the indeterminate form +∞
+∞ , so by applying the L’Hospital’s rule we get
lim
x→+∞
ln(ln(x))
x
= lim
x→+∞
(
ln(ln(x))
)′
x′
= lim
x→+∞
(ln(x))′
ln(x)
1
= lim
x→+∞
1
x ln(x)
=
1
+∞
= 0 .
Therefore, limx→∞ g(x) = e0
= 1 and this implies that limn→+∞
n
√
ln(n) = 1. □
5.E.110. Use the l’Hospital’s rule to evaluate the limit lim
x→0
(
cot(x) − 1
x
)
.
Solution. Since cot(x) = cos(x)
sin(x) , by applying repeatedly the l’Hospital’s rule one has
lim
x→0
(
cot(x) −
1
x
)
= lim
x→0
x cos(x) − sin(x)
x sin(x)
0
0
= lim
x→0
(x cos(x) − sin(x))
′
(x sin(x))′
= lim
x→0
cos(x) − x sin(x) − cos(x)
sin(x) + x cos(x)
= lim
x→0
−x sin(x)
sin(x) + x cos(x)
0
0
= lim
x→0
(−x sin(x))
′
(sin(x) + x cos(x))
′ = lim
x→0
− sin(x) − x cos(x)
cos(x) + cos(x) − x sin(x)
=
0 − 0
1 + 1 − 0
= 0 .
□
5.E.111. Using the l’Hospital’s rule, when adequate, determine the limits
a = lim
x→1−
(1 − x) tan(
πx
2
) , b = lim
x→ π
2
−
(π
2
− x tan(x)
)
, c = lim
x→1
(
1
2 ln(x)
−
1
x2 − 1
)
, d = lim
x→+∞
(
cos
2
x
)x2
.
⃝
Next we present a result which analogous to l’Hopital’s rule, which can be used in establishing the monotonicity of a
ratio f/g of two functions f, g : [a, b] → R, which are both diﬀerentiable on (a, b), with g′
(x) ̸= 0 for all x ∈ (a, b). In the
literature, it is commonly referred to as the “l’Hopital’s Monotone Rule” or for short by “LMR”, and, despite its numerous
applications in various contexts, it often seems to be overlooked. Notice the rule holds true even in cases where a or b is
inﬁnite, and in particular when f, g are deﬁned only on the open interval (a, b) (provided that g′
has constant sign).
478
CHAPTER 5. ESTABLISHING THE ZOO
5.E.112. Suppose that a, b are such that −∞ ≤ a < b ≤ ∞ and let f, g : (a, b) → R be continuous functions which are
diﬀerentiable on (a, b). Suppose also that limx→a+ f(x) = limx→a+ g(x) = 0 (or limx→b− f(x) = limx→b− g(x) = 0),
and that the derivative g′
is nonzero (and does not change sign on (a, b)). Prove that if f′
/g′
is (strictly) increasing or (strictly)
decreasing on (a, b), then so is f/g.
Solution. We will analyze the case with limx→a+ f(x) = limx→a+ g(x) = 0, and similarly one can treat the case f(b−
) =
g(b−
) = 0. Assume also that f′
/g′
is strictly increasing, and the proof for the rest cases is analogous. We will prove that
(
f
g
)′
> 0 , for all x ∈ (a, b) ,
and hence the ratio f/g is strictly increasing. Fix some arbitrary x ∈ (a, b) and consider the function hx : (a, b) → R,
deﬁned by
hx(y) = f′
(x)g(y) − f(y)g′
(x) , y ∈ (a, b) .
Obviously, hx is diﬀerentiable on (a, b) (and hence also continuous), with
h′
x(y) = f′
(x)g′
(y) − f′
(y)g′
(x) = g′
(x)g′
(y)
(
f′
(x)
g′(x)
−
f′
(y)
g′(y)
)
. (∗)
By the hypothesis, g′
(x) is non-zero and it has constant sign, that is, only one of the following relations is possible: g′
(x) > 0
or g′
(x) < 0 for all x ∈ (a, b). Moreover f′
/g′
is strictly increasing, so for x > y the inequality f′
(x)
g′(x) > f′
(y)
g′(y) makes sense.
Therefore, by (∗) we deduce that h′
x(y) > 0 for all y ∈ (a, x), which means that the function hx is strictly increasing on
(a, x). In fact, by continuity, hx should be strictly increasing on (a, x]. Moreover, by our hypothesis we get
lim
x→a+
hx(x) = lim
x→a+
(
f′
(x)g(x) − f(x)g′
(x)
)
= 0 ,
and hence, all together, this implies that hx(x) > 0. Thus
(
f
g
)′
(x) =
f′
(x)g(x) − f(x)g′
(x)
g2(x)
=
hx(x)
g2(x)
> 0
and since x ∈ (a, b) is arbitrary, our claim follows. □
5.E.113. (a) Based on the l’Hopital monotone rule study the monotonicity of k = f/g on the speciﬁed domain A, when
f, g, A are given as follows:
(1) f(x) = e2x
−1, g(x) = ln(x + 1), A = (−1, 0) ∪ (0, +∞);
(2) f(x) = ex
−1, g(x) = 2x + ex
−1, A = (−∞, 0) ∪ (0, ∞).
Then, conﬁrm your conclusions by specifying the sign of the ﬁrst derivative κ′
of κ = f/g, and rely on Sage to verify all
your computations.
(b) For both cases conﬁrm that limx→0 κ(x) = limx→0 K(x), where K(x) = f′
(x)/g′
(x).
Solution. (a) Let us begin with (1). We have
κ(x) =
f(x)
g(x)
=
e2x
−1
ln(x + 1)
, and K(x) =
f′
(x)
g′(x)
= 2(x + 1) e2x
,
where κ is deﬁned on A = (−1, 0) ∪ (0, +∞) (and we assume the same for both f, g). Moreover, the given functions satisfy
limx→0+ f(x) = limx→0+ g(x) = 0, hence one can apply the LMR (5.E.112). A direct computation shows that
K′
(x) = 2 e2x
(3 + 2x) > 0 , for all x ∈ A .
Because K = f′
/g′
is diﬀerentiable, this implies that K strictly increases on A and according to LMR the same monotonicity
pattern we should get for the function κ = f/g. To conﬁrm this result, and because κ is diﬀerentiable on A, we will use its
ﬁrst derivative:
dκ(x)
dx
= κ′
(x) =
(
f(x)
g(x)
)′
=
(
e2x
−1
ln(x + 1)
)′
=
2 e2x
· ln(x + 1) − (e2x
−1) · 1
x+1
(ln(x + 1))2
=
2(x + 1) e2x
ln(x + 1) − e2x
+1
(x + 1)(ln(x + 1))2
.
To determine the sign of κ′
(x) it is suﬃcient to determine the sign of the numerator h(x) := 2(x+1) e2x
ln(x+1)−e2x
+1.
A direct computation shows that h′
(x) = 2 e2x
ln(x + 1)(2x + 3), so for all x ∈ A we have h′
(x) > 0. Therefore h is
increasing on A. Moreover h(0) = 0, and thus it should be h(x) > 0 for all x ∈ A, which implies that κ′
(x) > 0. As a side
remark observe that it is easier to deduce that the ratio K = f′
/g′
is increasing, than determining the sign of k′
= (f/g)′
.
Now, let’s perform all computations in Sage, including necessary explanations within the code:
f(x)=e^(2*x)-1; g(x)=ln(x+1); k(x)=f(x)/g(x) #declare the ratio k=f/g
A=RealSet((-1, 0), x>0); show(A) #declare the domain A of k
K(x)=diff(f, x)/diff(g, x); show(K(x)) #declare and show the ratio K=f’/g’
Kprime(x)=diff(K, x).factor(); show(Kprime(x)) #declare and show the derivative K’
479
CHAPTER 5. ESTABLISHING THE ZOO
print(bool(Kprime(x)>0 for x in A)) #confirm that K’>0 for all the elements of A
kprime(x)=diff(k, x).factor(); show(kprime(x)) #declare and show the derivative k’
print(bool(kprime(x)>0 for x in A)) #confirm that k’>0 for all the elements of A
h(x)=2*(x+1)*e^(2*x)*ln(x+1)-e^(2*x)+1; show(diff(h, x).factor()) #declare the numerator h
print(bool(diff(h, x)>0 for x in A)) #confirm that the derivative of h is positive in A
print(bool(h(x)>0 for x in A)) #confirm that h is positive in A
(2) In this case we have
κ(x) =
f(x)
g(x)
=
ex
−1
2x + ex −1
, and K(x) =
f′
(x)
g′(x)
=
ex
2 + ex
,
where κ is deﬁned on A = (−∞, 0) ∪ (0, +∞) (and we assume the same for both f, g). We also have limx→0+ f(x) =
limx→0+ g(x) = 0 and hence we can apply the LMR. In particular, the ratio K(x) = ex
2+ex is strictly increasing on A (e.g.,
its ﬁrst derivative satisﬁes K′
(x) = 2 ex
(2+ex)2 > 0 for all x ∈ A), and hence the same monotonicity pattern must have the ratio
κ = f/g. It is an easy exercise to see that the ﬁrst derivative of κ satisﬁes
κ′
(x) =
2(x ex
1 − ex
+1)
(2x + ex −1)2
> 0 , ∀ x ∈ A .
On the other hand, a conﬁrmation via Sage occurs in a very similar way as in (a), i.e.,
f(x)=e^(x)-1; g(x)=2*x+e^(x)-1; k(x)=f(x)/g(x)
A=RealSet(x>0, x<0); show(A) #declare the domain A
K(x)=diff(f, x)/diff(g, x); show(K(x)) #declare and show the ratio K=f’/g’
Kprime(x)=diff(K, x).factor(); show(Kprime(x)) #declare and show the derivative K’
print(bool(Kprime(x)>0 for x in A)) #confirm that K’>0 for all the elements of A
kprime(x)=diff(k, x).factor(); show(kprime(x)) #declare and show the derivative k’
print(bool(kprime(x)>0 for x in A)) #confirm that k’>0 for all the elements of A
(b) The objective of this question is to illustrate that the LMR operates similarly to the l’Hospital’s rule for limits. According
to the latter rule we get
lim
x→0
κ(x) = lim
x→0
f(x)
g(x)
= lim
x→0
e2x
−1
ln(x + 1)
0
0
= lim
x→0
f′
(x)
g′(x)
= lim
x→0
K(x) = lim
x→0
2(x + 1) e2x
= 2 .
Verify yourself that in a similar way in (2) we get limx→0 κ(x) = limx→0 K(x) = 1/3. □
C2) Material on derivatives - Applications
5.E.114. Thales theorem and the shadow problem. (a) A man who is 1.7 meters tall walks away from a 4-meter-tall
lamppost at a speed of 1.4 m/sec (see also the ﬁgure given below). Find the rate at which his shadow is increasing in length.
(b) Suppose now that as a man walks aways from a lamppost with height 380 cm, the tip of his shadow moves twice as fast as
he does. What is man’s height?
Solution. (a) Let us denote by x the distance of the man from the base of the lamppost at time t, and by y the length of his
shadow, both measured as meters. By assumption the man is moving at a rate of 1.4m/sec, that x = 1.4t. Consider the
triangles OLB and AMB.
x y
1.7m
4m
man
light post
shadow
O A B
M
These triangles are similar, so by Thales theorem we have
x + y
4
=
y
1.7
⇐⇒
1.4t + y
4
=
y
1.7
⇐⇒ y = 1.03478 · t =
119
115
· t .
One may like to solve this equation via Sage and take the ﬁnal equality presented here, which can be done by the syntax
480
CHAPTER 5. ESTABLISHING THE ZOO
var("t"); solve(4*x==1.7*(1.4*t+x), x)
It follows that the length of man’s shadow increases with rate dy
dt = 119
115 m/sec.
(b) Let us denote by α the whole distance from lamppost’s base to the tip of man’s shadow (hence in terms of the previous
ﬁgure we have α = x + y), and by h the man’s height (which we now enumerate in centimeters and is the unknown of the
problem). Notice that both α, x are functions of the time t, but h is constant. In this case Thales theorem tells us that
α(t)
380
=
α(t) − x(t)
h
⇐⇒ 380 ·
(
α(t) − x(t)
)
= h · α(t)
which we need to diﬀerentiate with respect to t, based on the statement of the assumption, that is dα
dt = 2dx
dt . This gives
380
(dα
dt
−
dx
dt
)
= 2 · h
dx
dt
⇐⇒ 380
(
2
dx
dt
−
dx
dt
)
= 2 · h
dx
dt
,
which is equivalent to h = 380
2 = 190 cm. □
5.E.115. Inscribed rectangle with maximal area. (1) Find the rectangle with the largest area that can be inscribed inside
the graph of the parabola y = x2
and below the line y = a, where a > 0 is some ﬁxed constant (see the ﬁgure below).
(2) Which is the maximum area for the values 1, 2 and 3 of a?
Solution. (1) Let [x, x2
] be the point representing the lower right corner of the rectangle. As we indicate in the ﬁgure, the
upper right corner should be the point [x, a], in particular the height of the rectangle is the function ha(x) = a − x2
with
x ∈ [0,
√
a]. Notice ha(
√
a) = 0, and ha(0) = a which are two cases that we may exclude for obvious reasons (these
correspond to zero height and zero width (x = 0) respectively, hence they do not determine a rectangle). Thus, the function
representing the area of the rectangle is
Ea(x) = 2 · x · ha(x) = 2 x (a − x2
) = 2 x a − 2 x3
,
which we want to maximize on the interval (0,
√
a).
A diﬀerentiation of the deﬁning equation of Ea with respect to x gives E′
a(x) = 2 a − 6 x2
. The equation E′
a(x) = 0 has a
unique solution in (0,
√
a) given by xa :=
√a
3 , with
Ea(xa) =
4
√
3
9
a
3
2 =
4
√
3 a3
9
, a > 0 . (∗)
There are many ways to prove that xa maximizes the area. For instance the second derivative of the area function Ea is given
by E′′
a (x) = (E′
a(x))′
= −12x and hence obviously E′′
a (xa) < 0 (since xa > 0 for all a > 0). You can easily conﬁrm these
conclusions in Sage by applying the cell
var("a"); assume(a>0)
h(a, x)=a-x^2; E(a, x)=2*x*h(a, x)
show(diff(E(a, x), x)); show(solve(diff(E(a, x), x)==0, x))
bool(diff(E(a, x), x, 2)(x=sqrt(a/3))<0)
(2) It is suﬃcient to replace in (∗) the required values. This gives E1(x1) = 4
√
3
9 , E2(x2) = 4
√
24
9 = 8
√
6
9 and E3(x3) =
4
√
81
9 = 4. In Sage to conﬁrm these values add in the previous block the cell
show(E(1, sqrt(1/3)).factor()); show(E(2, sqrt(2/3)).factor()); show(E(3, sqrt(3/3)).factor())
Let us ﬁnalize this task by illustrating these three solutions:
481
CHAPTER 5. ESTABLISHING THE ZOO
□
5.E.116. Inscribe a rectangle with the greatest perimeter possible into a semidisc with radius r. Determine the rectangle’s
perimeter. ⃝
5.E.117. Among the rectangles with perimeter 4c, ﬁnd the one having the greatest area (if such one exists) and determine the
lengths of its sides. ⃝
5.E.118. Find the height h and the radius r of the largest (greatest volume) cone which ﬁts into a ball of radius R. ⃝
5.E.119. From all triangles with given perimeter p, select the one with the greatest area. ⃝
5.E.120. A parabola is given by the equation 2x2
− 2y = 9. Find the points of the parabola which are closest to the origin.
⃝
5.E.121. Regiomontanus’ angle maximization problem (1471). In the museum, there is a painting on the wall. Its lower
edge is a meters above the ground and its upper edge is b meters, that is, it height equals to b − a. A tourist is looking at the
painting, her eyes being at height h < a meters above ground (see the ﬁgure below).How far from the wall should the tourist
stand if she wants to maximize her angle of view at the painting? This is the so-called “Regiomontanus’ angle maximization
problem” from 14th century.19
b
a
h tourist
painting
x
α
β
Solution. First observe that because the tourist is below the lowest level of the painting, her viewing angle very close to the
wall of very far from the wall becomes small, so the largest viewing angle is somewhere in between. We will use the picture
above (at the left) to illustrate the situation. Let us denote by x the distance (in meters) of the tourist from the wall and the
angle of her view at the painting by φ. For the angles α, β ∈ (0, π/2) we have
tan(α) =
b − h
x
, tan(β) =
a − h
x
,
with h < a < b. Our task is to maximize φ = α − β. Let us add that for h > b, one can proceed analogously and for
h ∈ [a, b], the angle φ increases as x decreases (φ = π for x = 0 and h ∈ (a, b)). From the condition h < a it follows that
the angle φ is acute, that is, φ ∈ (0, π/2). Since the function y = tan(x) is increasing on the interval (0, π/2), we can turn
our attention on maximizing the value tan(φ). We have that
19 This optimization problem was ﬁrst posed and solved in 1471 by the
German astronomer Johannes Müller von Königsberg (1436–1476), known
as Regiomontanus. Müller solved the problem based on elementary geometry,
but the solution presented here relies on diﬀerentiation.
482
CHAPTER 5. ESTABLISHING THE ZOO
tan(φ) = tan (α − β) =
tan(α) − tan(β)
1 + tan(α) tan(β)
=
b − h
x
−
a − h
x
1 +
b − h
x
·
a − h
x
=
x (b − a)
x2 + (b − h) (a − h)
.
So it suﬃces to ﬁnd the global maximum of the function
f(x) =
x (b − a)
x2 + (b − h) (a − h)
, x ∈ [0, +∞) ,
whose ﬁrst derivative is given by
f′
(x) =
(b − a)
[
(b − h) (a − h) − x2
]
[x2 + (b − h) (a − h)]
2 , (∗)
for all x ∈ (0, +∞). This expression can be quickly veriﬁed via the following cell in Sage:
var("a, b, h"); f(x)=x*(b-a)/(x^2+(b-h)*(a-h))
df1=diff(f(x), x).factor(); show(df1)
From (∗) one can see that
f′
(x) > 0 , for x ∈
(
0,
√
(b − h)(a − h)
)
, and f′
(x) < 0 , for x ∈
(√
(b − h)(a − h), +∞
)
,
and hence f attains its global maximum at the point x0 =
√
(b − h)(a − h). This is the desired point and the corresponding
angle is given by
φ0 = arctan
(
x0 (b − a)
x2
0 + (b − h) (a − h)
)
= arctan
(
b − a
2
√
(b − h) (a − h)
)
.
Let us describe two examples:
• Suppose an ant looks at the painting, so in this case we have h = 0, x0 =
√
ab and φ0 = arctan
(
b−a
2
√
ab
)
. If the painting
is 1 meter high and its lower edge is 2 meters above ground, i.e., a = 2 and b = 3, then the ant will see the painting at the
largest angle φ0
.
= 0.201rad ≈ 11.5 ◦
at the distance x0 =
√
6 ≈ 2.45 meters from the wall.
• As for another example, look in the left picture above, where the painting is viewed by a man whose eyes are at the height of
1.8 meters, together with his son whose eyes are 1 meter above ground. In this case the father should stand at x0
.
= 0.49 meters
from the wall, and his son at x0
.
= 1.41 meters from the wall. Thus the father has viewing angle φ0
.
= 0.795 rad ≈ 45.6 ◦
whereas his son has φ0
.
= 0.339 rad ≈ 19.5 ◦
, and the quotient
0.795
0.339
≈ 2.3 proves what a strongly better view the father
has. □
5.E.122. Determine the point x0 of the previous problem 5.E.121 by an alternative method. ⃝
5.E.123. Snell’s law. This task is about the light rays which bend when traveling from one (optical) medium to another, such
as air to water. This causes items placed in or behind water to look shifted and there is a rule which can be applied to
evaluate the bending of a light ray, called the “Snell’s law”.20
Having as reference the ﬁgure given below, determine
the refracted light ray between the point A in a homogeneous medium with speed of light v1 and the point B in a
homogeneous medium with speed of light v2.
Solution.
We can assume that distances are given in meters, the speeds v1, v2 in meters
per second, and the time in seconds. But for convenience we will omit the units
of measurement in our notation. The ray is determined by “Fermat’s principle
of least time”, which states that “from all the paths between the points A and
B, the light will go along the one which can be traversed in the least time”. In
homogeneous mediums the ray will be a straight line and in our case we will
consider its segment. So it suﬃces to determine the point R (given by the value x)
where the ray refracts. The distance between the points A and R is
√
h2
1 + x2,
while the distance between R and B is
√
h2
2 + (d − x)2. We deduce that the total
time which needs the energy to be transmitted between the points A and B is
represented by the function
T(x) =
√
h2
1 + x2
v1
+
√
h2
2 + (d − x)2
v2
,
20An optical medium is material through which light (and other electromagnetic
waves) propagate.
483
CHAPTER 5. ESTABLISHING THE ZOO
withg x ∈ [0, d]. We want to locate the point x ∈ [0, d] at which the value T(x) is minimal. The derivative of T is the
continuous function T′
(x) =
x
v1
√
h2
1 + x2
−
d − x
v2
√
h2
2 + (d − x)2
for all x ∈ [0, d], and thus the critical point equation
T′
(x) = 0 is equivalent to
x
v1
√
h2
1 + x2
=
d − x
v2
√
h2
2 + (d − x)2
, from where we deduce that
f1(x)
f2(x)
=
v1
v2
, with f1(x) =
x
√
h2
1 + x2
, f2(x) =
d − x
√
h2
2 + (d − x)2
.
This expression is useful for us because (see the picture)
sin(φ1) =
x
√
h2
1 + x2
, sin(φ2) =
d − x
√
h2
2 + (d − x)2
.
Thus there is at most one stationary point, determined by the relation
sin(φ1)
sin(φ2)
=
v1
v2
. (♯)
This is the so called “Snell’s law” in physics. As φ1 ∈ [0, π/2] increases the angle φ2 ∈ [0, π/2] decreases. On the other
hand, on the interval [0, π/2] the sine function is non-negative and increasing, so the quotient (sin φ1)/(sin φ2) is increasing
with respect to x. Since T′
(0) < 0 and T′
(d) > 0, there is exactly one stationary point x0. From the inequalities T′
(x) < 0
for x ∈ [0, x0) and T′
(x) > 0 for x ∈ (x0, d], it follows that at the stationary point x0 there is the global minimum.
Let us summarize as follows: The ray is given by the point R of refraction (i.e., the value x0), and the point R is
determined by the Snell’s law (♯). The ration v1/v2 is constant for the given homogeneous mediums, and determines an
important quantity which describes the interface of optical spaces. It is called “refractive index” and denoted by n. Usually,
the ﬁrst optical medium (space) is vacuum, that is, v1 = c, where c = speed of light and v2 = v, such that n = c/v. For
vacuum, we get n = 1, of course. This value is also used for the air since its refractive index at standard conditions is
n
.
= 1.000272. Other mediums have n > 1 (n = 1.31 for ice, n = 1.33 for water, n = 1.5 for glass). However, the refractive
index also depends on the wave length of the electromagnetic radiation in question (for example, for water and light, it ranges
from n
.
= 1.331 to n
.
= 1.344), where the index ordinarily decreases as the wave length increases. The speed of light in
an optical space having n > 1 depends on its frequency, and hence we talk about the “dispersion of light”. The dispersion
causes rays of diﬀerent colors to refract at diﬀerent angles. The violet ray refracts the most and the red ray refracts the least.
This is also the origin of a rainbow. □
5.E.124. You are in a boat on a lake at distance d km from the shore. You want to get to a given place on the shore whose
straight-line distance is
√
d2 + ℓ2 from you, see the ﬁgure below. What path will you take if you want to be there
as soon as possible, supposing you can row at v1 kph and run along the shore at v2 kph? How long will the
journey take?
Solution. The optimal strategy is given by ﬁrst rowing in a straight line to the shore at some point [0, x] for
x ∈ [0, ℓ], and then running along the shore to the target point [0, ℓ], see the diagram. So the trajectory consists of two line
segments, or only one segment, in the case when x = ℓ. The voyage to the point [0, x] on the shore takes
√
d2+x2
v1
hours, and
the ﬁnal run takes ℓ−x
v2
hours. Thus, the total time is the function
t(x) =
√
d2 + x2
v1
+
ℓ − x
v2
, x ∈ [0, ℓ] .
It can be assumed that v1 < v2, for if v1 ≥ v2, the optimal strategy is to
row straight to the target point, which corresponds to x = l. Now, the ﬁrst and
the second derivative of t(x) are given by t′
(x) = x
v1
√
d2+x2
− 1
v2
and t′′
(x) =
d2
v1
√
(d2+x2)3
, respectively, with x ∈ (0, ℓ). You may conﬁrm these relations in
Sage as follows:
var("d, l, v1, v2"); assume(v1<v2)
t(x)=((sqrt(d^2+x^2))/(v1) + (l-x)/(v2))
d1t(x)=diff(t, x); show(d1t(x))
d2t(x)=diff(t,x, 2); show(d2t(x).factor())
In this block for the second derivative we used the command diff(t, x, 2), where
recall that diff(f, x, k) computes the kth derivative of some function f (see also
Chapter 6). As an alternative one could simply type d2t(x) = diff(d1t, x),
since in our code d1t represents the ﬁrst derivative of t(x).
484
CHAPTER 5. ESTABLISHING THE ZOO
Let us now focus on the stationary point equation t′
(x) = 0, or equivalently x√
d2+x2
= v1
v2
, whose solution is given by
x0 =
v1d
√
v2
2 − v2
1
.
You can conﬁrm this expression via Sage by adding in the previous block the command
show(solve(d1t(x)==0, x))
We see that if x0 < ℓ, then t(x) has a global minimum at x0 on the interval [0, ℓ]. This is because t′′
(x) > 0 for all x in that
interval. If x0 ≥ ℓ, then t′
(x) ≤ 0 for all x ∈ [0, ℓ] and so the global minimum of t(x) occurs at ℓ. In the former case, the
fastest journey in hours takes
t (x0) =
√
d2 + x2
0
v1
+
ℓ − x0
v2
=
d
√
v2
2 − v2
1
v1v2
+
ℓ
v2
.
In the latter case, the fastest journey in hours takes t (ℓ) =
√
d2 + ℓ2
v1
.
Notice one could follow an even simpler approach for doing the calculations, using the variable θ, instead of the variable x,
where x = d tan(θ). The fastest journey occurs when sin(θ) = v1/v2., which is the limiting case of Snell’s law. □
5.E.125. L’Hospital’s pulley.
A rope of length r is tied at one of its ends to the ceiling at a point A. A small pulley is attached to its other end. A point
B is also on the ceiling at distance d, d > r, from A. Another rope of length ℓ >
√
d2 + r2, is tied to B at one
end, passes over the pulley, and has a weight is attached to its other end, see the ﬁgure presented below. Omit
the mass and the size of the ropes and the pulley. In what position does the weight stabilize so that the system
is in a stationary position?
Solution. The system is in stationary position when its potential energy is minimized. This is when the distance between
the weight and the ceiling is maximal. Let x be the distance between A and the point P on the ceiling vertically above the
weight and the pulley. By the Pythagorean theorem, the distance between the pulley and the ceiling is
√
r2 − x2. Similarly,
the distance between the weight and the pulley is ℓ−
√
(d − x)2 + r2 − x2. Hence if f(x) is the distance between the weight
and the ceiling, then
f(x) =
√
r2 − x2 + ℓ −
√
(d − x)2 + r2 − x2 .
The state of the system is fully determined by the value x ∈ [0, r]. Therefore, it
suﬃces to ﬁnd the global maximum of f on the interval [0, r]. The ﬁrst derivative
of f is given by
f′
(x) =
−x
√
r2 − x2
−
− (d − x) − x
√
(d − x)2 + r2 − x2
=
−x
√
r2 − x2
+
d
√
(d − x)2 + r2 − x2
, x ∈ (0, r) .
Let us use once more Sage to conﬁrm this expression:
var("r, d, l"); f(x)=sqrt(r^2-x^2)+l-sqrt((d-x)^2+r^2-x^2)
df1(x)=diff(f, x); show(df1(x))
Now, instead of solving the equation f′
(x) = 0, it is easier to solve the equation
(f′
(x))
2
= 0, which is equivalent to the polynomial
2 d x3
− (2 d2
+ r2
)x2
+ d2
r2
= 0 , x ∈ (0, r) .
This has x = d as an obvious solution and hence we get the factorization (x −
d)(2dx2
−r2
x−dr2
) = 0. Thus, with a bit of try we see that the equation f′
(x) = 0 has three solutions. The solution x = d
does not lie on the interval [0, r] since d > r. The solution x = r2
−r
√
r2+8d2
4d is also outside the interval [0, r], since x1 < 0.
But the solution
x0 =
r2
+ r
√
r2 + 8d2
4d
is positive and since r < d it satisﬁes x0 < r, i.e., x0 < r
4 + 3r
4 = r. Now, because f′
is continuous on the open interval
(0, r) it changes sign only at x0. Based on the limits limx→0+ f′
(x) = d√
d2+r2
and limx→r− f′
(x) = −∞, we deduce that
f′
(x) > 0 for all x ∈ (0, x0), while f′
(x) < 0 for all x ∈ (x0, r). This implies that at x0 the function f attains its global
maximum. □
485
CHAPTER 5. ESTABLISHING THE ZOO
5.E.126. A rectangle is inscribed into an equilateral triangle with sides of length a so that one of its sides lies on one of the
triangle’s sides and the other two of the rectangle’s vertices lie on the remaining sides of the triangle. What is the maximum
possible area of the rectangle?
5.E.127. Determine the dimensions of an (open) swimming pool whose volume is 32 m3
and whose bottom has the shape of
a square, so that one would use the least amount of paint possible to prime its bottom and its walls. ⃝
5.E.128. Express the number 28 as a sum of two non-negative numbers such that the sum of the ﬁrst summand squared and
the second summand cubed is as small as possible. ⃝
5.E.129. With the help of the ﬁrst derivative, ﬁnd the real number a > 0 for which the sum a + 1/a is minimal. Next solve
this problem without using diﬀerential calculus. ⃝
D) Material on infinite sums and power series
Recall by Section D that inﬁnite sums may have several beautiful applications. Below we describe an application, strongly
related with the compelling notion of fractals, the so called Sierpiński carpet.
5.E.130. Sierpiński carpet. The unit square is divided into nine equal squares, and then the middle one is removed. Each
of the eight remaining squares is again divided into nine equal sub-squares and then the middle sub-square, corresponding
to each of these eight squares, is removed again. Having applied this procedure ad inﬁnitum, determine
the area of the resulting ﬁgure.
Solution. In the ﬁrst step, a square having the area of 1/9 is removed. In the second step, eight squares are removed,
each having the area of 9−2
, so in total we remove 8 · 9−2
. Every further iteration removes eight times more squares than in
the previous steps, but the squares are nine times smaller. Thus, the sum of areas of all the removed squares is encoded by
the following inﬁnite sum:
1
9
+
8
92
+
82
93
+ · · · =
∞∑
n=0
8n
9n+1
.
Now, to compute the area of the remaining ﬁgure (known as the Sierpiński carpet) it is suﬃcient to consider the diﬀerence
1 −
∞∑
n=0
8n
9n+1
= 1 −
1
9
∞∑
n=0
(
8
9
)n
= 1 −
1
9
·
1
1 − 8
9
= 0 ,
where we used the result from 5.B.16 on geometric series. Thus, the Sierpiński carpet has zero area. □
5.E.131. Decide whether the following implications hold:
(a) Convergence of series
∞∑
n=1
an,
∞∑
n=1
bn implies that the series
∞∑
n=1
(6an − 5bn) converges as well.
(b) If a series
∞∑
n=0
an satisﬁes lim
n→∞
a2
n = 0, then it is convergent.
(c) If a series
∞∑
n=1
a2
n converges, then the series
∞∑
n=1
an
n converges absolutely. ⃝
5.E.132. Prove that the series below diverge to +∞:
T1 =
∞∑
n=2
1
n
√
ln(n)
, T2 =
∞∑
n=1
1
n − ln(n)
.
Solution. The two enlisted series consist of non-negative terms only. So the series either converge, or diverge to +∞. For
T1 it is useful to recall that limn→∞
n
√
ln(n) = 1 (see for example 5.E.18), and hence we also get
lim
n→∞
1
n
√
ln(n)
=
1
limn→∞
n
√
ln(n)
= 1 .
Thus, by (1) in Theorem 5.1.9 we deduce that T1 is not convergent, and in particular it diverges to +∞. For T2 we have
limn→∞
1
n−ln(n) = 0. However, we see that
∞∑
n=1
1
n − ln(n)
≥
∞∑
n=1
1
n
= +∞ .
Hence T2 also diverges to +∞. □
486
CHAPTER 5. ESTABLISHING THE ZOO
5.E.133. Prove that the series
S =
∞∑
n=0
arctan
(
n2
+ 2n + 3
√
n + 4
n + 1
)
, T =
∞∑
n=1
3n
+ 1
n3 + n2 − n
are both divergent.
Solution. For the ﬁrst case we have
lim
n→∞
arctan
(
n2
+ 2n + 3
√
n + 4
n + 1
)
= lim
n→∞
arctan
(
n2
n
)
=
π
2
.
We can do this computation in Sage by typing
var("n"); lim(arctan((n^2+2*n+3*sqrt(n)+4)/(n+1)), n=oo)
For the second one we compute
lim
n→∞
3n
+ 1
n3 + n2 − n
= lim
n→∞
3n
n3
= +∞ .
Thus, the necessary condition lim
n→∞
an = 0 for the series
∑∞
n=n0
an to converge does not hold, and so none of S, T converges.
□
5.E.134. Compute explicitly the sum
∞∑
n=0
1
(3n+1)(3n+4) .
Solution. It suﬃces to use the form (this is the so-called partial fraction decomposition)
1
(3n + 1)(3n + 4)
=
1
3
·
1
3n + 1
−
1
3
·
1
3n + 4
, n = 0, 1, 2, . . .
This gives
∞∑
n=0
1
(3n + 1)(3n + 4)
= lim
n→∞
1
3
(
1 −
1
4
+
1
4
−
1
7
+
1
7
−
1
10
+ · · · +
1
3n + 1
−
1
3n + 4
)
= lim
n→∞
1
3
(
1 −
1
3n + 4
)
=
1
3
.
□
5.E.135. (a) Using the ratio test show that S =
∞∑
n=1
(−2)n2
n!
is not convergent.
(b) Exam the convergence of the series T =
∞∑
n=1
(−1)n+1
arctan
(
2
√
3n
)
. ⃝
5.E.136. Let S =
∑∞
n=0 f(n) be an inﬁnite series for which the limit limn→∞
n
√
|f(n)| exists, and equals to q. Present in
Sage a syntax based on the root test to deduce, if possible, for the absolute convergence/divergence of S. Then
apply your program for the following series and state Sage’s output.
S1 =
∞∑
n=1
2n
(
4
5
)n2
, S2 =
∞∑
n=1
(
ln(n)
n
)n
, S3 =
∞∑
n=1
(
2n
4n2 + 1
)1
n
, S4 =
∞∑
n=1
(−1)n−1
(
3n
n + 1
)n
.
Solution. Here one can use once more the command def to creating a subroutine, as we did in 5.D.5. This time we will call
it roottest. Recall that according to the root test the series S will converge absolutely if q < 1, will not converge absolutely
for q > 1, while for q = 1 the root test is inconclusive. Thus one can give the cell
var("n")
def roottest(f):
q=lim(abs(f(n))**(1/n), n=oo)
if q<1 :
return "converges absolutely"
elif q==1 :
return "no conclusion"
else :
return "does not converge absolutely"
return
Now to test our routine via the given series, one can simply introduce ﬁrst the sequence inside the series, and the apply the
command roottest(f). For instance for the second series, type in your editor
487
CHAPTER 5. ESTABLISHING THE ZOO
f(n)=(ln(n)/n)^n; roottest(f)
Sage informs us that the series converges absolutely. Check yourself that the same is true for the ﬁrst series (see also 5.D.4).
On the other hand, the series S4 does not converge absolutely, while the root test does not provide a result for S3. □
5.E.137. Prove the convergence of the series
∞∑
n=1
3n
+ 2n
6n
, and ﬁnd its value. ⃝
5.E.138. Calculate the series
∞∑
n=1
2n − 1
2n
and
∞∑
n=0
n + 1
3n
. ⃝
5.E.139. Determine those α ∈ R, β ∈ Z and γ ∈ R\{0} which establish the following series as convergent:
∞∑
n=120
e−αn
n
,
∞∑
n=240
βn
· n!
nn
,
∞∑
n=360
n
γn
.
⃝
5.E.140. Determine whether the series
∞∑
n=21
(−1)n n8
− 5n6
+ 2n
2n
converges absolutely, converges conditionally, or does not
converge at all. ⃝
5.E.141. Find all real numbers α ≥ 0 for which the series
∞∑
n=1
(−1)n
ln
(
1 + α2n
)
is convergent. ⃝
5.E.142. Explain, why the convergence interval of the power series expressing the function 1
1+x2 with the center at origin
is one.
Solution. Recall by 5.B.16 that the geometric series
∞∑
n=0
xn
= 1
1−x converges for |x| < 1. Thus 1
1+x =
∞∑
n=0
(−1)n
xn
and
substituting x2
instead of x, we arrive at 1
1+x2 =
∞∑
n=0
(−1)n
x2n
, again converging for |x| < 1. At the same time, viewing
the same sum over complex x, the sum exploads towards inﬁnit absolute value if x approaches the imaginary unit i. Thus the
power series cannot converge for |x| > 1. □
5.E.143. Express the function y = ex
, deﬁned on the whole real line, as an inﬁnite polynomial whose terms are of the form
an(x − 1)n
. Then express the function y = 2x
deﬁned on R as an inﬁnite polynomial with terms anxn
.
Solution. We know the series for ex
, thus writing ex−1
= e−1
ex
leads to ex
=
∞∑
n=0
e
n!
(x − 1)n
. The second task is even
simpler, since 2x
= ex ln(2)
=
∞∑
n=0
lnn
(2)
n!
xn
. □
5.E.144. Supposing | x | < 1, determine the series
A(x) =
∞∑
n=1
1
2n − 1
x2n−1
, B(x) =
∞∑
n=1
n2
xn−1
.
488
CHAPTER 5. ESTABLISHING THE ZOO
⃝
5.E.145. Give an example of two divergent series
∑∞
n=1 an,
∑∞
n=1 bn with positive numbers for which the series∑∞
n=1 (3an − 2bn) converges absolutely. ⃝
5.E.146. Determine whether the series
∞∑
n=1
(−1)n (n!)2
(2n)!
and
∞∑
n=1
(−1)n n7
− n4
+ n
n8 + 2n6 + n
converge absolutely, converge conditionally,
or do not converge at all. ⃝
5.E.147. Does the series
∞∑
n=1
(−1)n+1
3
√
n + 5
√
n + 1
n + 5
√
n
converge? ⃝
5.E.148. Find the values of the parameter p ∈ R for which the series
∞∑
n=1
(−1)
n
sinn p
n
converges. ⃝
5.E.149. Determine whether the series
∞∑
n=0
2n
+ (−2)n
5n
converges. ⃝
5.E.150. Using the power series F(x) =
∞∑
n=0
(−1)n
(2n + 1)x2n
with x ∈ (−1, 1), calculate the inﬁnite sum S =
∞∑
n=1
2n − 1
(−2)
n−1 . ⃝
489
CHAPTER 5. ESTABLISHING THE ZOO
Solutions to the exercises
5.A.7. As usual, we have a taste for Sage, where an appropriate cell reads as follows:
g(x)=(x/(1+8*x**2))
figg=plot(g(x), (x, -1, 1), figsize=4, color="black"); show(figg)
R=PolynomialRing(QQ, "x")
n=20
points=[(-1+k*(2/n), g(-1+k*2/n)) for k in [0, 1,.., n]]
P(x)=R.lagrange_polynomial(points); show(P(x))
figg+=plot(P(x), (x, -1, 1), color="purple")
figg+=list_plot(points, size=20, figsize=4, color="blue"); figg
This returns the interpolation polynomial P(x) under question, and also the plots of g(x) and of P(x), along with the given
set of points.
Note that the polynomial oscillates wildly near the endpoints, getting worse and worse as n increases.
5.A.8. An implementation of the following cell in Sage prints out the explicit for of P, given byP(x) = 3x2
−2x−4.
R = PolynomialRing(QQbar, "x")
R.lagrange_polynomial ([(-1 ,1) ,(1 ,-3) ,(2 , 4)])
For further practicing with the method of elementary Lagrange polynomials, we suggest to the reader to verify this answer
also by a formal computation.
5.A.9. Assume that that P(x) = a x3
+ b x2
+ c x + d for some reals a, b, c, d to be speciﬁed. By the given conditions one
gets the following system (we present this directly in Sage)
var("a, b, c, d")
eq1=d-1; eq2=a+b+c+d
eq3=8*a+4*b+2*c+d-1; eq4=27*a+9*b+3*c+d-10
solve([eq1==0, eq2==0, eq3==0, eq4==0], a, b, c, d)
This blocks returns the answer a = 1, b = −2, c = 0, d = 1, that is, P(x) = x3
− 2x2
+ 1. Alternatively, one may apply
the previous method of Lagrange polynomials:
R = PolynomialRing(QQbar, "x")
R.lagrange_polynomial([(0, 1), (1, 0), (2, 1), (3, 10)])
which veriﬁes the previous answer: x3
− 2 ∗ x2
+ 1.
5.A.13. The polynomial Q is given by Q(x) = x3
− 2x2
+ 3x − 3.
5.A.14. An answer is given by x5
− 2x4
− 5x + 2.
5.A.17. Sought spline diﬀers from the one in 5.A.15 only in the values of the derivatives at the points −1 and 1. Similarly to the
previous task, we get that the parts S1 and S2 of our spline have the forms S1(x) = ax3
+bx2
+1 and S2(x) = −ax3
+bx2
+1,
respectively, where a, b, c, d are unknown real parameters. Confronting this with the conditions S1(−1) = 0, S1
′
(−1) = 1,
S2(1) = 0, and S2
′
(1) = 1 yields the system {−a + b + 1 = 0, 3a − 2b = −1}, having the solution a = −3, b = −4.
Hence,
S(x) =
{
−3x3
− 4x2
+ 1 , if x ∈ [−1, 0],
3x3
− 4x2
+ 1 , if x ∈ [0, 1].
5.A.20. S1(x) = 1 − 11
20 x + 1
20 x3
; S2(x) = 1
2 − 2
5 (x − 1) + 3
20 (x − 1)2
− 1
40 (x − 1)3
.
490
CHAPTER 5. ESTABLISHING THE ZOO
5.B.1. The set Z+ is unbounded above, hence A is also unbounded and in particular sup A cannot exist. Observe also that
(−1)n
n = 1
n ≤ 1 and thus we have the inequalities n + (−1)n
n ≥ n − (−1)n
n ≥ n − 1 for all n ∈ Z+. This implies that
n + (−1)n
n ≥ 0 for all n ∈ Z+, and hence 0 is a lower bound of A. In particular, 0 ∈ A and A is non-empty, thus we deduce
that inf A = 0.
5.B.2. For the set B we get sup B = 1
4 ∈ B and inf B = −1 ∈ B. For C we see that sup C = 9 /∈ C, inf C = −9 /∈ C. For
the set X, inf X = −1 ∈ X, while sup X = 0 /∈ X. Further, inf Y = 0 ̸∈ Y , while sup Y = 5 ∈ Y .
5.B.6. (a) By 5.B.3 we know that 1
n → 0 and 3n
→ +∞. Thus their sum will diverge to +∞.
(b) Here we are based on fact that if an → +∞ and bn → +∞, then also an + bn = +∞. We have n2
→ +∞ and
2n
→ +∞, so (n2
+ 2n
) → +∞, as well.
(c) For this use the ﬁrst assertion in Problem 5.B.5: Since n2
→ +∞, and 3n
→ +∞, also their product n2
· 3n
→ +∞.
(d) For this use the second assertion in Problem 5.B.5: Since n
n+1 → 1 > 0 and 4n
→ +∞, their product will tend to +∞
as well.
5.B.7. (a) For any ε > 0 the inequality 1/2n
< ε is equivalent to 2n
> 1/ε. Thus, taking N > 1/ε we see that n ≥ N
implies 2n
≥ 2N
> N > 1/ε. Hence 1/2n
< ε, and this proves the claim.
(b) Observe that the terms of (bn = (−1)n+1
) interchange between 1 and −1. Such a sequence is said to be an alternating
sequence, and since its nth term is either 1 or −1, it must diverge by oscillation. For instance, taking ε with 0 < ε < 1/2
we see that inﬁnitely many terms of (bn) do not sit in the open interval (x − ε, x + ε), for any x ∈ R. Hence the inequality
|bn − x| < ε will not make sense for inﬁnitely many n.
(c) The sequence (cn = sin(nπ/2)) (n ≥ 1) is of the form (1, 0, −1, 0, 1, 0, . . .) Therefore, by the same reasoning as in the
previous case (b) one deduces that (cn) diverges.
(d) In this case a solution is given by xn = n, yn = −n + 1, with n ∈ Z+.
5.B.12. We have to prove the incorrectness of the statement: “Any convergent sequence is bounded and monotone.” Take for
example the sequence ((−1)n
/n)n≥1. It converges to 0 but is not monotone, as we can see in the ﬁgure below.
For the second question, an example is given by (an = n)n≥1. Clearly, this sequence is strictly increasing, but is not bounded
and so it cannot be convergent (it is not hard to prove that any convergent sequence is bounded).
5.B.13. Recall that |sin(x)| ≤ 1 for any x ∈ R. Thus |an| = 1
n sin(n) ≤ 1
n = 1
n , which implies that |an| ≤ 1/n ≤ 1,
for any natural n ≥ 1. This is equivalent to say that −1 ≤ an ≤ 1, and so the sequence (an) is bounded. Next, it is easy to
see that (bn = n2
+ 1) is bounded below, bn ≥ 2 for any n (since n2
≥ 1 for any natural n ≥ 1). Suppose that (bn) is also
bounded above. Then there exists M ∈ R such that bn ≤ M for all n. This gives n2
≤ M − 1 or n ≤
√
M − 1 for all n, a
contradiction. Finally, for cn one can verify that 0 ≤ cn ≤ 2, which we leave as an exercise (we recommend to plot the ﬁrst
terms of (cn), e.g., via Sage).
5.B.14. The sequence (an = 1
n sin(n)) is bounded but not monotone. However, it is convergent. Indeed, we have −(1/n) ≤
an ≤ (1/n), for all n ∈ Z+, so using the squeeze theorem we obtain an → 0. We can verify this in Sage and also plot some
terms of an, as follows:
a(n)=sin(n)/n; show(lim(a(n), n=oo))
points([(n,a(n)) for n in range(1,200)], color="black")
491
CHAPTER 5. ESTABLISHING THE ZOO
Check yourself Sage’s output. Clearly, the sequence bn diverges to +∞, so it is not convergent. Also, it is easy to see that the
sequence cn = ((−1)n
+n)/n behaves as (an). This means that (cn) is bounded, and also convergent, though not monotone.
We present the veriﬁcation only via Sage, together with a syntax giving the plot of its ﬁrst thirty terms. However, once more
we leave for practice the execution of this block.
c(n)=((-1)^n+n)/n; show(lim(c(n), n=oo))
list_plot([c(n) for n in range(1, 30)], color="black")
Next, one can prove that fn = 1√
n
= n−1/2
is decreasing: Since
√
n ≤
√
n + 1 we have 1√
n+1
≤ 1√
n
, for all n ∈ Z+. It
is also bounded: For any n ∈ Z+ we have n ≥ 1, so
√
n ≥ 1 and thus 1/
√
N ≤ 1. Therefore, by the monotone sequence
theorem fn should be convergent, and in particular we see that
lim
n→+∞
fn =
0
+∞
= 0 = sup
{
1
√
n
: n ∈ Z+
}
.
Similarly for the sequence gn = n!
nn , which is decreasing and bounded. Thus, the monotone sequence theorem ensures its
convergence, and in particular we deduce that
lim
n→+∞
gn = 0 = sup
{
n!
nn
: n ∈ Z+
}
.
This also follows by squeeze theorem, since we have 0 < gn ≤ 1/n for all positive naturals n. This is because n! =
n · (n − 1) · · · 2 · 1 < n · n · · · n · 1 = nn−1
(so dividing by 1/nn
we obtain the formula).
5.B.18. By the binomial theorem we have
en = 1 +
(
n
1
)
1
n
+ · · · +
(
n
k
)
1
nk
+ · · · +
(
n
n
)
1
nn
= 2 +
(
n
2
)
1
n2
+ · · · +
(
n
k
)
1
nk
+ · · · +
(
n
n
)
1
nn
.
Since
(
n
k
)
=
n!
k!(n − k)!
=
n · (n − 1) · . . . · (n − k + 1)
k!
, for the general term we see that
(
n
k
)
1
nk
=
1
k!
·
n · (n − 1) · . . . · (n − k + 1)
nk
=
1
k!
·
n
n
·
n − 1
n
· . . . ·
n − k + 1
n
=
1
k!
·
(
1 −
1
n
)
· . . . ·
(
1 −
k − 1
n
)
.
Therefore
en = 2 +
1
2!
(
1 −
1
n
)
+
1
3!
(
1 −
1
n
) (
1 −
2
n
)
+ . . . +
1
k!
(
1 −
1
n
)
· . . . ·
(
1 −
k − 1
n
)
+ . . . +
1
nn
.
As n increases, the quantities 1
n , 2
n , . . . , k−1
n decrease and so the expressions (1 − 1
n ), (1 − 2
n ), . . . , (1 − k−1
n ) increase.
So the general term of (en) increases, i.e., (en) is increasing. Another way to obtain the monotonicity of (en) is based
on the geometric mean inequality and in particular in Bernoulli’s inequality which we presented in Chapter 1. Also, since
k! = 1 · 2 · · · k ≥ 1 · 2 · · · 2 = 2k−1
we see that
2 < en < 1 +
n∑
k=1
1
k!
< 1 +
n∑
k=1
1
2k−1
= 1 +
1 − (1/2)n
1 − (1/2)
< 3 ,
since 1 + 1−(1/2)n
1−(1/2) < 1 + 1
1−(1/2) . Thus (en) is also bounded, 2 < en < 3, for any positive natural n, and by 5.B.11 it is
convergent.
5.B.24. Yes: The range of f(x) = x2
is the set [0, +∞), which according to the remark in 5.B.23 is a closed subset of R.
5.B.29. This is because the sine function f(x) = sin(x) is continuous at any x ∈ R. Verify this claim yourself, presenting a
“delta-epsilon” proof.21
Thus limx→π/3 sin(x) = sin(π/3) =
√
3/2.
Let us now use Sage and combine the limit command with the bool function, to verify that the left/right limits coincide
with the value of sine function at the limit point x0. This idea is encoded by the following cell:
bool(sin(pi/3)==limit(sin(x),x=pi/3, dir="left"))
bool(sin(pi/3)==limit(sin(x),x=pi/3, dir="right"))
For both these commands Sage’s output is True, as it was expected.
21Hint: Use the identity sin(x) − sin(x0) =
2 cos( x+x0
2
) sin( x−x0
2
) and some classical inequalities related to
the sine and cosine function.
492
CHAPTER 5. ESTABLISHING THE ZOO
5.B.32. (a) We may multiply the expression inside the given limit by 1+cos(x)
1+cos(x) , so that we can form the diﬀerence of two
squares in the numerator, that is,
lim
x→0
1 − cos(x)
x
= lim
x→0
1 − cos(x)
x
·
1 + cos(x)
1 + cos(x)
= lim
x→0
1 − cos2
(x)
x
(
1 + cos(x)
) = lim
x→0
sin2
(x)
x
(
1 + cos(x)
)
= lim
x→0
sin(x)
x
· lim
x→0
sin(x)
1 + cos(x)
= 1 ·
0
2
= 0 .
(b) Here we will use the so called trigonometric “half-angle” identity of the sine function, that is, 2 sin2
(x
2
)
= 1 − cos(x).
We have
lim
x→0
1 − cos x
x2 sin(x2)
= lim
x→0
2 sin2
(x
2
)
x2 sin(x2)
= lim
x→0
1
2 sin2
(x
2
)
(x
2
)2
sin(x2)
=
1
2
(
lim
x→0
sin
(x
2
)
x
2
)2
· lim
x→0
1
sin2
(x2)
=
1
2
· ∞ = ∞ .
Note that this calculation must be considered “from the back”. Indeed, since the limits on the right-hand side exist (no matter
whether ﬁnite or inﬁnite) and the expression 1
2 · ∞ is meaningful (see the note after theorem 5.2.13), the original limit exists
as well. On the other hand, observe that if we split the original limit into the product
lim
x→0
(1 − cos x) · lim
x→0
1
x2 sin(x2)
,
then we will get the 0 · ∞ type, which is an indeterminate form and tells us nothing about existence of the original limit.
Finally, you may like to conﬁrm your computation by Sage, which can be easily done by the cell
f(x)=(1-cos(x))/(x^2*sin(x^2)); limit(f(x), x=0)
5.B.33. (a) One can decompose the polynomial in the denominator, and simplify the given expression, i.e.,
lim
x→2
x − 2
√
x2 − 4
= lim
x→2
x − 2
√
(x − 2)(x + 2)
= lim
x→2
√
x − 2
√
x + 2
=
0
4
= 0 .
(b) Here one can exploit the rule for limits of compositions of functions (see 5.2.20). Thus
lim
x→0
sin (sin(x))
x
= lim
y→0
sin(y)
y
· lim
x→0
sin(x)
x
= 1 .
(c) We see that lim
x→0
sin2
(x)
x
= lim
x→0
sin(x) · lim
x→0
sin(x)
x
= 0 · 1 = 0. Notice the original limit exists because both the
right-hand side limits exist and their product is well-deﬁned.
5.B.34. For instance, in the ﬁrst case we have limx→0− f(x) = 1 but limx→0+ f(x) = +∞. Hence limx→0 f(x) does not
exist. Let us also explain the situation for k and similarly are treated the function g and h. For the sign function sign : R → R
we have
k(x) = sign(x) =



1 , if x > 0 ,
0 , if x = 0 ,
−1 , if x < 0 .
Taking the sequence (xn = 1/n) we see that sign(1/n) → 1 for n → +∞, while for the sequence (yn = −1/n) we obtain
sign(−1/n) → −1, as n → +∞. Since we found two real sequences (xn), (yn) with limn→+∞ xn = c = limn→+∞ yn
(where c = 0), and limn→+∞ k(xn) ̸= limn→+∞ k(yn) we deduce that the limit limx→c k(x) does not exist (notice that
c = 0 ∈ R is a limit point, see also Proposition 5.2.15). In Sage recall that we can treat limits of functions really quickly. For
instance, for the functions h, k to verify the statement type
k(x)=sign(x); h(x)=abs(sin(x))/sin(x); lim(k(x), x=0);lim(h(x), x=0)
As a side remark notice that the sign function is very similar to the function g : R\{0} → {−1, 1} ∼= Z2, deﬁned by
g(x) = |x| /x which is everywhere continuous except at x = 0. In particular, g(0) is not deﬁned and we see that
g(x) =
{
1 , if x > 0 ,
−1 , if x < 0 .
If you like to sketch the graph of g, say for −2 ≤ x ≤ 2, type
f(x)=abs(x)/x; plot(f, x, -2, 2, exclude=[0])
Notice here in our input we added the option exclude, to exclude the point that g is not deﬁned. This syntax produces the
known ﬁgure, i.e.,
493
CHAPTER 5. ESTABLISHING THE ZOO
5.B.36. Let us multiply both the numerator and denominator by 1/α2
, which yields
g(x) = lim
α→+∞
(
2x − 3 + 4x 1
α + 2
α2
1 + x
α
·
sin 1
α
1
α
)
= lim
α→+∞
2x − 3 + 4x 1
α + 2
α2
1 + x
α
· lim1
α →0
sin 1
α
1
α
= 2x − 3 .
This proves that g(x) = 2x − 3, i.e., g(x) represents a line. It remains to prove the second claim. Observe that y = 2x − 3
pass through A. Let M = [xM , yM ] be the middle point of the segment BC. This has coordinates xM = (6 + (−2))/2 = 2
and yM = (2 + 0)/2 = 1. These coordinates also lie on the line y = 2x − 3, and hence y = g(x) is a median of ABC.
5.B.37. (a) We see that
lim
x→+∞
3x+1
+ x5
− 4x
3x + 2x + x2
= lim
x→+∞
3 · 3x
3x
= 3 .
(b) Here one computes
lim
x→+∞
4x
− 8x6
− 2x
− 167
3x − 45x −
√
11πx+12
= lim
x→+∞
4x
−
√
11π12 · πx
= −∞ .
(c) In view of the formula (a − b)(a + b) = a2
− b2
, we have
lim
x→0
√
1 + x −
√
1 − x
x
= lim
x→0
(1 + x) − (1 − x)
x
(√
1 + x +
√
1 − x
) = lim
x→0
2
√
1 + x +
√
1 − x
=
2
√
1 +
√
1
= 1 .
(d) Similarly we see that
lim
x→π/4
cos x − sin x
cos (2x)
= lim
x→π/4
(cos x + sin x) (cos x − sin x)
(cos x + sin x) cos (2x)
= lim
x→π/4
cos2
x − sin2
x
(cos x + sin x) cos (2x)
= lim
x→π/4
1
cos x + sin x
=
1
√
2
2 +
√
2
2
=
√
2
2
.
Notice the reduction above was made thanks to the identity cos (2x) = cos2
(x) − sin2
(x), where x ∈ R.
5.B.42. For the function f we have f(x) = xx
= eln(xx
)
= ex ln(x)
. Thus f is continuous as the composition of the continuous
functions y1 = ex
and y2 = x ln(x). Similarly, g(x) = xcos(x)
= ecos(x) ln(x)
and g is continuous as the composition of two
continuous functions, namely y1 = ex
and y3 = cos(x) ln(x).
5.B.43. Since f, g are continuous the given relation means that 4 · f(2) − g(2) = 2, and so g(2) = 22.
5.B.44. We have already seen such an example in 5.B.34. We mean the sign function f(x) = sign(x), which has a “bump
discontinuity” at x = 0. As we have explained in 5.B.34, essentially this is because limx→0+ f(x) = 1 ̸= limx→0− f(x) =
−1. Sketch the graph of f to illustrate the situation.
494
CHAPTER 5. ESTABLISHING THE ZOO
5.B.46. The function 1/x is continuous for all x ∈ R\{0}, and the function sin(x) is continuous for all x ∈ R. Thus the
function f is continuous on the set (−∞, 0) ∪ (0, +∞), i.e., everywhere on R except at x0 = 0. In particular, at x0 = 0 it is
easy to see that the limit of f does not exist. The function g is similar to f, except of that the oscillations are absorbed by the
factor x. We see that −|x| ≤ x sin
(1
x
)
≤ |x| and thus by the squeeze theorem we get limx→0 g(x) = 0. In particular, g is
continuous everywhere on R.
5.B.49. The family has the form
fα,β(x) =



√
2x2 − x + 6 − αx
x + 2
, for x < −2 ,
x3
+ βx + 4 , for x ≥ −2 ,
and we see that for all x < −2 the expression 2x2
− x + 6 is positive (and hence fα,β is well-deﬁned). We require the
continuity of fα,β at any x ∈ R and hence also at x0 = −2, which is equivalent to the condition limx→−2− fα,β(x) =
fα,β(−2) = limx→−2+ fα,β(x). Then we see that
ℓ := lim
x→−2−
(
fα,β(x) · (x + 2)
)
= fα,β(−2) · 0 = 0 .
On the other hand, from the deﬁnition of f and ℓ, we get
ℓ = lim
x→−2−
(√
2x2 − x + 6 − αx
)
=
√
lim
x→−2−
(2x2 − x + 6) − lim
x→−2−
(αx) = 4 + 2α .
Thus, 4 + 2α = 0, i.e., α = −2. Using this value, we compute
lim
x→−2−
f−2,β(x) = lim
x→−2−
√
2x2 − x + 6 + 2x
x + 2
= lim
x→−2−
2x2
− x + 6 − 4x2
(x + 2)(
√
2x2 − x + 6 − 2x)
= lim
x→−2−
−2x2
− x + 6
(x + 2)(
√
2x2 − x + 6 − 2x)
= lim
x→−2−
(x + 2)(−2x + 3)
(x + 2)(
√
2x2 − x + 6 − 2x)
= lim
x→−2−
(−2x + 3)
(
√
2x2 − x + 6 − 2x)
=
7
8
.
On the other hand, fα,β(−2) = −2(b + 2) and we arrive to the equation 7
8 = −2(b + 2), that is, b = −39
16 . Thus only the
member f−2,− 39
16
is continuous.
Using Sage we can deﬁne piecewise functions with many diﬀerent ways. For instance one can combine the def command
with if and else, as follows:
def f(x) :
if x<-2 :
return((sqrt(2*x^2-x+6)+2*x)/(x+2))
else :
return(x^3-(39/16)*x+4)
Now, to view f combine any of the print or show rules, with the assume command. Thus, for example, adding the syntax
assume(x < −2); show(f(x)) will give the ﬁrst component of f, and similarly for the second, where instead one should type
assume(x >= −2). To evaluate f at diﬀerent points x ∈ R just type f(−4), f(−2), f(0), etc. As for the continuity you can
add the commands
show(bool(limit(f(x), x=-2,dir="+")==f(-2)))
show(bool(limit(f(x), x=-2, dir="-")==f(-2)))
or simply type
show(bool(limit(f(x), x=-2)==f(-2)))
Finally, to sketch the graph of f add the cell
plot(f, x, -10, 10,ymax=8, color="black")
The produced ﬁgure is here:
495
CHAPTER 5. ESTABLISHING THE ZOO
Another way to determine f and sketch its graph relies on the piecewise function in Sage, which we analyze in 5.C.5. Note
that this method currently has some limitations (e.g., in computing limits) and Sage requires a signiﬁcant amount of time to
return an answer (see however in 5.C.5 and 6.D.43 for further details).
5.B.51. The given polynomial has at least two roots in (−1, 1), This is because P(−1) > 0, P(0) < 0, P(1) > 0 and thus
there must be at least one root in each of these two subintervals, see also 5.2.19.
5.B.52. The solution is based on applying Bolzano’s theorem via Sage, which goes as follows:
f(x) = x^3 - cos(x)*e^x + x*sin(x)
show(f(0)); show(f(pi/2))
Sage gives f(0) = −1 < 0 and f(1) = π3
8 + π
2 > 0, and hence by Bolzano’s theorem f has at least a root in
(0, π/2). As for the implementation of the find_root function, just add the syntax find_root(f, 0, pi/2). Sage’s answer
is 0.9221778114841418.
5.C.1. For f(x), using the binomial theorem one computes
(xn
)′
= lim
h→0
(x + h)n
− xn
h
= lim
h→0
(n
1
)
xn−1
h +
(n
2
)
xn−2
h2
+ · · · + hn
h
=
n xn−1
+ lim
h→0
((
n
2
)
xn−2
h + · · · + hn−1
)
= n xn−1
.
For the derivative of g(x) we rely on identities from trigonometry, as those for sin(x + h) and the results from 5.B.31 and
5.B.32, that is, lim
h→0
sin h
h = 1 and lim
h→0
(cos h−1)
h = 0, respectively. We have
(sin(x))′
= lim
h→0
sin(x + h) − sin(x)
h
= lim
h→0
sin(x) cos(h) + cos(x) sin(h) − sin(x)
h
=
cos(x) lim
h→0
sin(h)
h
+ sin(x) lim
h→0
(cos(h) − 1)
h
= cos(x) · 1 + sin(x) · 0 = cos(x) .
Similarly is treated the square root function h(x) =
√
x = x1/2
, which is left as an exercise. Here one can show that
h′
(x) = 1
2
√
x
for all x ̸= 0, but f is not diﬀerentiable at x = 0 (verify this claim by computing the left and right derivatives).
Note that the derivative of the ﬁrst given function generalizes as (xa
)′
= axa−1
, for all x ∈ R and all a > 0. When a is
negative, we can use the relation (xa
)′
= axa−1
only for x ̸= 0. These rules provide a way to compute h′
(x) for x ̸= 0:
(
√
(x))′
= (x
1
2 )′
=
1
2
x1− 1
2 =
1
2
x− 1
2 =
1
2x
1
2
=
1
2
√
x
.
5.C.2. The claim is based on the properties of f and occurs by the deﬁnition of the derivative of f, as follows:
f′
(x) = lim
h→0
f(x + h) − f(x)
h
= lim
h→0
f(x)f(h) − f(x)
h
= lim
h→0
f(x) (f(h) − 1)
h
= f(x) · lim
h→0
f(h) − 1
h
= f(x) · lim
h→0
f(0 + h) − f(0)
h
= f(x) · f′
(0) = f(x) · 1 = f(x) .
5.C.3. Recall that
f(x) =
{
x, if x ≥ 0 ,
−x, if x < 0 .
496
CHAPTER 5. ESTABLISHING THE ZOO
Thus, f′
(x) = 1 for x > 0 and f′
(x) = −1 for x < 0. For instance, for x > 0 we can choose h small enough such that
x + h > 0, and then
f′
(x) = lim
h→0
|x + h| − |x|
h
= lim
h→0
x + h − x
h
= 1 .
Similarly for x < 0. At x0 = 0 the function f cannot be diﬀerentiable, since the left-side derivative does not agree with the
right-side derivative:
lim
x→0+
f(x) − f(0)
x − 0
= lim
x→0+
x − 0
x
= 1 , lim
x→0−
f(x) − f(0)
x − 0
= lim
x→0−
−x − 0
x
= −1 .
Hence f′
(0) does not exist (and this is what we should expect since the graph of f forms a corner at 0).
5.C.4. If f is diﬀerentiable at x0, then the limit lim
x→x0
f(x) − f(x0)
x − x0
exists and equals to f′
(x0). Thus we also have
(x − x0) · f′
(x0) = (x − x0) · lim
x→x0
f(x) − f(x0)
x − x0
.
Consider now the limits in both sides as x tends to x0. In the l.h.s we obtain a zero, since limx→x0
(
(x − x0) · f′
(x0)
)
=
f′
(x0) · limx→x0 (x − x0) = f′
(x0) · 0 = 0. For the limit in the r.h.s we write altogether
0 = lim
x→x0
(x − x0) · lim
x→x0
f(x) − f(x0)
x − x0
= lim
x→x0
(
(x − x0) ·
f(x) − f(x0)
x − x0
)
= lim
x→x0
(
f(x) − f(x0)
)
,
which implies that limx→x0
f(x) = f(x0). Therefore, f is continuous at x0.
5.C.6. In order to use the derivative sin′
(x) = cos(x) and the chain rule, we ﬁrst need to use the identity cos(x) = sin(x+ π
2 ).
This allows us to view cos(x) as the composition cos(x) = f(g(x)), with f(x) = sin(x) and g(x) = x + π
2 , respectively.
Then, by the chain rule we get
(cos(x))′
= g′
(x) · f′
(g(x)) = 1 · cos(x +
π
2
) = cos(x) cos(π/2) − sin(x) sin(π/2) = 0 − sin(x) = − sin(x) .
5.C.10. We have f(1) = α and since f′
(x) = 2αx − 4 ln(x) − 4, it follows that f′
(1) = 2α − 4. Thus the tangent of f at
the point P = [1, α] is given by
y − α = (2α − 4)(x − 1) ⇐⇒ y = (2α − 4)x − α + 4 .
We ﬁnally see that this line passes through [0, 0] if and only if 0 = (2α − 4) · 0 − α + 4, that is, α = 4.
5.C.13. The answer is y = 2 − x; y = x.
5.C.16. It is useful to recall that the tangent function tan(x) = sin(x)/ cos(x) is not deﬁned at any x = π
2 + κπ, with κ ∈ Z.
Let us ﬁrst plot tan(x) (in fact, let us present the graph of a periodic extension of the tangent function on (−3π/2, 3π/2)).
To obtain this we used Sage via the following block:
p=plot(tan(x), x, -pi/2, pi/2, ymax=5, ymin=-5, color="black")
p+=plot(tan(x-pi), x, -3*pi/2, -pi/2, ymax=5, ymin=-5, color="black",
detect_poles="False")
p+=plot(tan(x+pi), x, pi/2, 3*pi/2, ymax=5, ymin=-5, color="black",
detect_poles="False")
p.show(ticks=pi/2, tick_formatter=pi, aspect_ratio="1")
497
CHAPTER 5. ESTABLISHING THE ZOO
In this block ensure to include the options ymax=5, ymin=-5 for plotting correctly the tangent function. From the ﬁgure,
it is evident that tan(x) is strictly increasing for all x ∈ (−π/2, π/2). Let us substantiate this claim rigorously with a
mathematical proof.
Let x1, x2 be real numbers such that −π
2 < x1 < x2 < π
2 . Based on the identity sin(α − β) = sin(α) cos(β) −
sin(β) cos(α) we see that
tan(x2) − tan(x1) =
sin(x2)
cos(x2)
−
sin(x1)
cos(x1)
=
sin(x2) cos(x1) − sin(x1) cos(x2)
cos(x1) cos(x2)
=
sin(x2 − x1)
cos(x1) cos(x2)
> 0 ,
since 0 < x2 − x1 < π. Thus, for x1 < x2 we have
tan(x2) − tan(x1) > 0 ⇐⇒ tan(x1) < tan(x2) ⇐⇒ f(x1) < f(x2) ,
which shows that the tangent function is increasing on (−π/2, π/2). Alternatively, we can compute the ﬁrst derivative of f,
given by f′
(x) = tan′
(x) = 1
cos2(x) (see 5.E.76). Thus for any x ∈ (−π/2, π/2) we get f′
(x) > 0 and our claim follows.
5.C.19. The answer is given by π
6 − 2√
3
0.003.
5.C.20. The answers are 1
2 −
√
3π
360 and
√
2
2 +
√
2π
360 , respectively.
5.C.24. The function f is continuous on its domain [−2, 2]. We have f′
(x) = 2
3 x− 1
3 = 2
3x1/3 . Thus f′
is not deﬁned at
x0 = 0. Since x0 ∈ (−2, 2), this means that we cannot apply the mean value theorem. In the opposite, for the functions g, h
the mean value theorem applies. Indeed, g is continuous on [1, 4] with g′
(x) = 1
(x+1)2 . Thus g is also diﬀerentiable on the
open interval (1, 4) and according to the Theorem 5.3.9 there exists c ∈ (1, 4) satisfying g′
(c) = g(4)−g(1)
4−1 = 4/5−1/2
3 = 1
10 .
From this we get (1+c)2
= 10, that is, c = ±
√
10−1 and since c ∈ (1, 4) we can accept only the value c =
√
10−1 ≈ 2.163.
Similarly is treated the case for h.
5.C.26. It is suﬃcient to apply Cauchy’s mean value theorem for the functions f(x) = xn+1
and g(x) = xn
, which are
both continuous on [a, b], and diﬀerentiable on (a, b). Also, by assumption 0 ̸∈ (a, b) and hence g′
(x) = nxn−1
̸= 0 for all
x ∈ (a, b). Thus f, g satisfy the requirements of the Cauchy’s mean value theorem. This gives that g(b) ̸= g(a) and moreover
there exists c ∈ (a, b) such that
f(b) − f(a)
g(b) − g(a)
=
f′
(c)
g′(c)
⇐⇒
bn+1
− an+1
bn − an
=
(n + 1)cn
ncn−1
⇐⇒
n
n + 1
(bn+1
− an+1
bn − an
)
= c ,
for any n ∈ N\{0}. Since c ∈ (a, b) we are done.
5.C.31. (a) v(0) = 6 m/s; (b) t = 3 s, s(3) = 16 m; (c) v(4) = −2 m/s, a(4) = −2 m/s2
.
5.C.34. Such an example occurs by considering the limit limx→0
sin(x)
x+1 , which obviously equals to 0 and is not of some
indeterminate form. Another example is the limit limx→0
sin(x)
ex .
5.D.2. Often, we may use Sage to evaluate convergent inﬁnite series, or verify summation identities. Recall from Chapter 1
that this can be done via the command sum, whose general form is sum(f(n), n, a, b) (formally, this corresponds to the sum
∑b
n=a f(n)). Of course, one can replace b by Infinity or its alias oo, and this corresponds to inﬁnite sums. This method
can be applied for example when one tries to evaluate the zeta function ζ(p), for certain p, as in our example. In particular,
an explicit implementation of our task goes as follows:
var("n")
show(sum(1/n^2, n, 1, oo))
show(sum(1/n^3, n, 1, oo))
show(sum(1/n^4, n, 1, oo))
show(sum(1/n^5, n, 1, oo))
show(sum(1/n^6, n, 1, oo))
show(sum(1/n^7, n, 1, oo))
show(sum(1/n^8, n, 1, oo))
show(sum(1/n^9, n, 1, oo))
Let us put Sage’s output in a table:
p ζ(p) p ζ(p)
2 π2
/6 3 ζ(3)
4 π4
/90 5 ζ(5)
6 π6
/945 7 ζ(7)
8 π8
/9450 9 ζ(9)
498
CHAPTER 5. ESTABLISHING THE ZOO
Hence Sage is able to evaluate explicitly ζ(p) for p = 2, 4, 6, 8. However, as we expected, Sage is unable to compute the
cases p = 3, 5, 7, 9. In these instances, Sage indicates that the sum we are trying to evaluate corresponds to the zeta function.
We will encounter further related applications in the sequel (see for example 5.D.8).
5.D.7. The series S1 converges absolutely. For instance, we see that
∞∑
n=1
sin(n)
n2
≤
∞∑
n=1
1
n2
<
∞∑
n=0
1
2n
= 2 .
Passing to S2, one observes that this is an alternating series (since cos (πn) = (−1)n
, n ∈ N). Moreover, we see that the
sequence of the absolute values of its terms is decreasing, and lim
n→∞
1
3
√
n2
= 0. It follows that the series S2 is convergent. In
addition, we see that
∞∑
n=1
cos(πn)
3
√
n2
=
∞∑
n=1
1
3
√
n2
≥
∞∑
n=1
1
n
= +∞ ,
which implies that S2 converges also conditionally.
5.D.9. We see that
1 ≤ 1 ,
1
22
+
1
32
< 2 ·
1
22
=
1
2
,
1
42
+
1
52
+
1
62
+
1
72
< 4 ·
1
42
=
1
4
, . . . ,
and more general
1
(2n)2
+ · · · +
1
(2n+1 − 1)2
< 2n
·
1
(2n)2
=
1
2n
, n = 1, 2, . . .
Hence, by comparing the terms of both of the series, we get the required inequality. By the way, notice that from this inequality
it follows that the series
∑∞
n=1
1
n2 converges absolutely. Eventually, let us specify that
∞∑
n=1
1
n2
=
π2
6
< 2 =
∞∑
n=0
1
2n
.
5.D.15. The radius of convergence equals to r = +∞.
5.D.16. The radius of convergence equals to 1.
5.D.17. The domain of convergence is the closed interval [−1, 1].
5.D.18. The answer is x ∈
[
2 − 1
3 , 2 + 1
3
]
.
5.D.19. The series converges for all x ∈ [− 3
√
2, 3
√
2).
5.D.20. The series converges for all x ∈ [−1, 1]
5.E.3. The obvious answer is x2
.
5.E.4. The natural cubic spline interpolating the given data is given by S1(x) ≡ x; S2(x) ≡ x.
5.E.5. The complete cubic interpolation spline in question is given by Si(x) = x + 3, x ∈ [−3 + i − 1, −3 + i]; i ∈ {1, 2}.
5.E.6. The mentioned values are given by cos2
(π
4 ) = 1/2, cos2
(π
3 ) = 1/4, cos2
(π
2 ) = 0. Since the third value is zero, we
need to compute only the values of the ﬁrst two elementary Lagrange polynomials at the given points. Based on the deﬁnition
of such polynomials and the given data, it follows that
ℓ0(1) =
(1 − π
3 )(1 − π
2 )
(π
4 − π
3 )(π
4 − π
2 )
= 8
(π − 3)(π − 2)
π2
, ℓ1(1) =
(1 − π
4 )(1 − π
2 )
(π
3 − π
4 )(π
3 − π
2 )
= −9
(π − 4)(π − 2)
π2
.
Thus, we deduce that
P(1) =
1
2
· 8
(π − 3)(π − 2)
π2
−
1
4
· 9
(π − 4)(π − 2)
π2
+ 0 =
(7π − 12)(π − 2)
4π2
≈ 0.288913 .
Note that the actual value is approximately cos2
1 ≈ 0.291927.
5.E.8. Such a polynomial is given by x4
+ 2x3
− x2
+ x − 2.
5.E.15. Consider any singleton (one-element set) X ⊂ R.
5.E.16. The set C must be a singleton. Thus, let us choose C = {0}, for example. Now as A, B we can take the open subsets
A = (−1, 0), and B = (0, 1), respectively.
499
CHAPTER 5. ESTABLISHING THE ZOO
5.E.19. Using the properties of powers one can express the general term of the given sequence as
an =
(√
2 ·
4
√
2 ·
8
√
2 · · ·
2n√
2
)
= 2
1
2 · 2
1
4 · 2
1
8 · · · 2
1
2n
= 2
1
2 + 1
4 + 1
8 +···+ 1
2n
.
Thus, in combination with the continuity of the exponential function (a property discussed in the paragraph 5.2.22), we get
lim
n→∞
an = 2
lim
n→∞
( 1
2 + 1
4 + 1
8 +···+ 1
2n
)
= 2
( ∞∑
n=1
1
2n
)
.
Next we use a known formula for the sum of geometric series:
∞∑
n=1
(
1
2
)n
=
1
2
·
1
1 − 1
2
= 1. Thus ﬁnally lim
n→∞
an = 2.
Notice in Sage we can simply type
n=var("n"); b=2**(sum(1/2**n, n, 1, oo)); b
which answers 2, and the claim follows.
5.E.20. Notice that every natural number k ≥ 2 satisﬁes the relation 1
(k−1)k = 1
k−1 − 1
k . (this identity is a special case of the
so called partial fraction decomposition). Therefore, we compute
lim
n→∞
(
1
1 · 2
+
1
2 · 3
+
1
3 · 4
+ · · · +
1
(n − 1) · n
)
= lim
n→∞
(
1
1
−
1
2
+
1
2
−
1
3
+
1
3
−
1
4
+ · · · +
1
n − 1
−
1
n
)
= lim
n→∞
(
1 −
1
n
)
= 1 .
Let us mention that this limit determines the sum of one of the so-called “telescoping series” (used by J. Bernoulli already
about 300 years ago). In Sage we can verify the result as usual, i.e., by the cell
var("i, n"); limit(sum(1/(i*(i+1)), i, 1, n), n=oo)
5.E.21. We see that
lim
n→∞
(
1
n2
+
2
n2
+ · · · +
n − 2
n2
+
n − 1
n2
)
= lim
n→∞
(
1 + n − 1
n2
·
n − 1
2
)
=
1
2
.
In Sage we can type limit(sum(i/n ∗ ∗2, i, 1, n − 1), n = oo), which provides a quick conﬁrmation of our computation.
5.E.23. For these tasks one relies on the relation lim
n→∞
(
1 + a
n
)n
= ea
, where a ∈ R, see 5.B.18 . This means that
e−1
=
1
e
= lim
n→∞
(
1 −
1
n
)n
= lim
n→∞
(
n − 1
n
)n
.
Thus, the substitution m = n − 1 gives
lim
n→∞
(
n − 1
n
)n
= lim
m→∞
(
m
m + 1
)m+1
= lim
m→∞
(
m
m + 1
)m
· lim
m→∞
m
m + 1
.
Clearly, the second limit is equal to 1 and replacing n with m we get the result
1
e
= lim
n→∞
(
n
n + 1
)n
.
In Sage we can easily verify this by typing n = var(”n”); f(n) = (n/(n + 1))n
; lim(f(n), n = oo). For the second limit we
see that
lim
n→∞
(
1 +
1
n2
)n
= lim
n→∞
(
1 +
1
n2
)n2
n
= lim
n→∞
((
1 +
1
n2
)n2 ) 1
n
= e0
= 1 .
For the third limit we will apply only Sage and leave to the reader the formal details. Hence type
n=var("n"); g(n)=(1-1/n)**(n^2); lim(g(n), n=oo)
This returns the value 0.
5.E.27. (a) We have a > 0, thus we also get 0 < n a < n a + 1 and 0 < 1
n a+1 < 1
n a , for all positive n. This gives
1
1 + n a
− 0 =
1
1 + n a
=
1
1 + n a
<
1
n a
=
1
a
·
1
n
, for all n = 1, 2, . . . .
As the sequence (1/n) tends to 0 and 1/a > 0, the inequality obtained above allows us to invoke the result from 5.E.25, with
n ≥ N = 1. Thus limn→∞
(
1
1+n a
)
= 0.
(b) Obviously, the sequence (an) with general term an = xn
n! satisﬁes an ̸= 0 for all n. Moreover,
500
CHAPTER 5. ESTABLISHING THE ZOO
ℓ = lim
n→∞
an+1
an
= lim
n→∞
xn+1
(n + 1)!
·
n!
xn
= lim
n→∞
x
n + 1
= 0 < 1 .
Thus, by the second result in 5.E.26 we get limn→∞
xn
n! = 0, for all x ∈ R.
5.E.31. By deﬁnition, x1 = 1 > 0. Assuming that xn > 0 for some n, we get xn+1 = 1/(4 + xn) > 0. Thus, by induction
we see that (xn) is a positive sequence, that is, xn > 0 for all n. Using this, for all n ≥ 2 one computes
|xn+1 − xn| =
1
4 + xn
−
1
4 + xn−1
=
xn−1 − xn
(4 + xn)(4 + xn−1)
=
|xn−1 − xn|
(4 + xn)(4 + xn−1)
<
1
16
|xn − xn−1| ,
Thus (xn) is a contractive sequence and by 5.E.30 it is Cauchy, and hence convergent.
5.E.32. Obviously, {ak : k ≥ n + 1} ⊆ {ak : k ≤ n} and hence mn+1 = sup{ak : k ≥ n + 1} ≤ sup{ak : k ≥ n} = mn.
This shows that the sequence (sn) is decreasing Similarly, the sequence (ℓn) is increasing, since ℓn ≤ inf{ak : k ≥ n+1} =
ℓn+1, see also the discussion in 5.4.6. By assumption, (an) is bounded hence (sn) and (ℓn) are also bounded, and the result
follows by 5.B.11.
5.E.36. The naturals N form a closed subset of R (recall by 5.B.20 that there are no limits points). Another explanation occurs
from the fact that any b /∈ N has got a small neighbourhood disjoint with N. On the other hand, the subset Q ⊂ R is neither
closed nor open. Indeed, in 5.B.20 we saw that the set of all limit points of Q is the real line R, so Q cannot be closed.
Moreover, we saw that there are no interior points, so Q cannot be open.
5.E.37. (a) The open subsets are R∗
and R\Z.
(b) Only the set [1, 2] ∪ {5} is closed.
5.E.47. We will show that the sequence is strictly increasing and bounded, so from the monotonicity theorem (5.B.11), it
should be convergent. Obviously, a1 =
√
2 ≤ a2 =
√
2 +
√
2. Assume that ak < ak+1 for some k. Using the induction
hypothesis we see that 2 + ak < 2 + ak+1, or
√
2 + ak <
√
2 + ak+1, that is ak+1 < ak+2. Hence, an < an+1 for all
n ∈ N, which means that (an)n∈N is strictly increasing. Notice now that a1 =
√
2 < 2. If it is also an−1 < 2, then we get
2 + an−1 < 4, that is,
√
2 + an−1 < 2, or equivalently an < 2. Thus (an) is bounded and by the monotonicity theorem
(5.B.11) it is convergent, limn→+∞ an = sup{an : n ∈ N} = 2. To verify this ﬁnal conclusion, let x = limn→+∞ an.
Then taking the limits in both sides of the relation an =
√
2 + an−1 we see that
√
x + 2 = x, which has two solutions
(x = −1, x = 2), and the acceptable is the positive one.
5.E.49. We have lim
x→0
sin2
(x)
x
= lim
x→0
sin(x) · lim
x→0
sin(x)
x
= 0 · 1 = 0. Apparently, lim
x→0
x
sin(x)
= 1−1
= 1 and the limit
lim
x→0
1
sin(x)
does not exist. Similarly, using the rule for the limit of a product we see that the limit lim
x→0
x
sin2
(x)
does not exist.
For the calculation of the limit lim
x→0
arcsin(x)
x
use the identity x = sin(arcsin(x)), which makes sense for any x ∈ (−1, 1).
This gives lim
x→0
arcsin(x)
x
= lim
x→0
arcsin(x)
sin(arcsin(x))
= lim
y→0
y
sin y
= 1, where we substitute y = arcsin(x). Observe that y → 0
follows from substituting x = 0 into y = arcsin(x) and from continuity of this function at 0 (this also guarantees that such a
substitution can be made).
Next, see that
lim
x→0
3 tan2
(x)
5x2
= lim
x→0
(
3
5
·
sin(x)
x
·
sin(x)
x
·
1
cos2(x)
)
=
3
5
· lim
x→0
sin(x)
x
· lim
x→0
sin(x)
x
· lim
x→0
1
cos2(x)
=
3
5
· 1 · 1 · 1 =
3
5
.
For the next case it useful to sketch the graph of the function sin(3x)/ sin(5x), via Sage. This is given by
501
CHAPTER 5. ESTABLISHING THE ZOO
This graph occurs via the following cell in Sage:
a=plot(sin(3*x)/sin(5*x), x, -2*pi, 2*pi, detect_poles="show", ymin=-10,ymax=10,
color="black")
b=point((0, 3/5), size = 70, color="black")
c=text("3/5", (0.4, 0.3), color="black", fontsize="10"); show(a+b+c)
As we see, the limit under question must be equal to 3/5. Let us verify this in a formal way:
lim
x→0
sin(3x)
sin(5x)
= lim
x→0
(
sin(3x)
3x
·
5x
sin(5x)
·
3
5
)
= lim
x→0
sin(3x)
3x
· lim
x→0
5x
sin(5x)
·
3
5
= lim
y→0
sin y
y
· lim
z→0
z
sin z
·
3
5
=
3
5
.
For a conﬁrmation via Sage, type lim(sin(3 ∗ x)/sin(5 ∗ x), x = 0). We leave to the reader to show that the ﬁnal limit
equals to 3/5, as well.
5.E.50. Based on the fact that lim
x→0
ex
−1
x = 1, for the ﬁrst limit we obtain
lim
x→0
e5x
− e2x
x
= lim
x→0
(
e2x e(5−2)x
−1
(5 − 2)x
(5 − 2)
)
= lim
x→0
e2x
· lim
x→0
e3x
−1
3x
· 3 = e0
· lim
y→0
ey
−1
y
· 3 = 1 · 1 · 3 = 3 .
In a similar way one computes the second limit, and we leave the details for practice. We present the solution in Sage and
include in a ﬁgure both plots.
G(x)=(e**(5*x)-e**(2*x))/x; E(x)=(e**(5*x)-e**(-x))/sin(2*x); show(lim(E(x), x=0))
a=plot(E(x), x, -pi, pi, detect_poles="show", ymin=-2, ymax=4, color="black")
b=plot(G(x), x, -pi, pi, detect_poles="show", ymin=-2, ymax=4, color="grey")
show(a+b)
Executing this block we get the conﬁrmation for the limit and desired illustration of the graphs, see here:
502
CHAPTER 5. ESTABLISHING THE ZOO
5.E.52. Obviously, limx→+∞
(
2 + 1
x
)
= 2, limx→+∞
1
x = 0 and limx→+∞ x = +∞. Thus, limx→+∞
(
2 + 1
x
)1
x
= 20
= 1
and limx→+∞ x−x
= limx→+∞
(1
x
)x
= 0. Equally well, we could ﬁrst compute the limit lim
x→+∞
(1
x · ln(2 + 1
x )
)
= 0·ln 2 =
0, and then, as a generalization of the formula
lim
x→x0
f(x)g(x)
= e
lim
x→x0
(g(x) ln(f(x)))
,
we obtain limx→+∞
(
2 + 1
x
)1
x
= e
lim
x→+∞
( 1
x ·ln(2+ 1
x )
)
= e0
= 1. Similarly, we see that lim
x→+∞
(−x ln(x)) = −∞, and hence
lim
x→+∞
x−x
= elimx→+∞(−x ln(x))
= e−∞
= 0 .
Finally one must be cautious when calculating the ﬁnal limit. Both one-sided limits exist, but are diﬀerent, which implies
that this limit does not exist: lim
x→0+
e
1
x = elimx→0+
1
x = e∞
= ∞ and lim
x→0−
e
1
x = elimx→0−
1
x = e−∞
= 0. In Sage one can
directly solve the task by the cell
f(x)=(2+(1/x))**(1/x); g(x)=x**(-x); j(x)=e**(1/x)
show([lim(f(x),x=+oo), lim(g(x),x=+oo), lim(j(x),x=0)])
5.E.53. The ﬁrst limit equals to +∞. For the second one, after multiplying by the fraction
√
x2+x+x√
x2+x+x
it follows that
limx→+∞(
√
x2 + x − x) = 1/2. For the third case we see that limx→+∞(x
√
1 + x2 − x2
) = 1/2. Finally,
limx→0−
√
1+tan(x)−
√
1−tan(x)
sin(x) = 1.
5.E.54. In Sage we can directly type
x, n=var("x, n"); f(x)=((1+2*n*x)**n-(1+n*x)**(2*n))/(x**2); lim(f(x), x=0)
and this returns the answer −n3
. To verify this in a formal way, we will use the binomial theorem. This gives
(1 + 2nx)
n
= 1 +
(
n
1
)
2nx +
(
n
2
)
(2nx)
2
+ P (x) x3
, (1 + nx)
2n
= 1 +
(
2n
1
)
nx +
(
2n
2
)
(nx)
2
+ Q (x) x3
,
for some polynomials P, Q. Let us emphasize on the fact that this really holds for all n ∈ N∗
(if n = 1, the polynomials P,
Q as zero constants). So, for all x ∈ R we obtain the relations
(1 + 2nx)
n
= 1 + 2n2
x + 2n3
(n − 1) x2
+ P (x) x3
, (1 + nx)
2n
= 1 + 2n2
x + n3
(2n − 1) x2
+ Q (x) x3
.
Based on these formulas we are now able to present the result, as follows:
lim
x→0
(1 + 2nx)
n
− (1 + nx)
2n
x2
= lim
x→0
(
2n3
(n − 1) − n3
(2n − 1)
)
x2
+ (P(x) − Q(x)) x3
x2
= lim
x→0
(
−n3
+ (P(x) − Q(x)) x
)
= −n3
+ 0 = −n3
.
5.E.55. By the deﬁnition of (an) and based on the trigonometric inequality |cos(x)| ≤ 1, (x ∈ R), we see that
|an+1 − an| =
cos(n + 1)
2n+1
≤
1
2n+1
<
1
2n
.
Thus |an+1 − an| < 1
2n and it follows that (an) is a Cauchy sequence. Thus it converges (in fact one can show that a sequence
(an) satisfying |an+1 − an| < cn
, for some c with 0 < c < 1 and for all n ∈ Z+, is a Cauchy sequence).
5.E.58. First, let us calculate the one-sided limits at the point x0 = 0:
lim
x→0−
(x − 1)− sgn(x)
= lim
x→0−
(x − 1) = −1 , lim
x→0+
(x − 1)− sgn(x)
= lim
x→0+
1
x − 1
= −1 ,
whence limx→0(x−1)− sgn(x)
= −1. However, the function will be continuous at 0 = 0, if and only if f(0) = −1. Recall by
the solution given in 5.B.34 that for the sign function we have adopted the convention sgn(0) = 0. Thus f(0) = (−1)0
= 1,
and the function at hand is not continuous at x0 = 0. Similarly, for x = 1 we obtain
lim
x→1−
(x − 1)− sgn(x)
= lim
x→1−
1
x − 1
= −∞ , lim
x→1+
(x − 1)− sgn(x)
= lim
x→1+
1
x − 1
= ∞ .
Hence both one-sided limits at the point 1 exist, yet they diﬀer. Thus the limit lim
x→1
f(x) does not exist, and the function is
not continuous at x1 = 1, as well.
5.E.59. The function is continuous at the points −π, 0, π, only right-continuous at the point 2, only left-continuous at the
point 3, and discontinuous at 1.
5.E.60. The function is continuous if and only if p = 2.
503
CHAPTER 5. ESTABLISHING THE ZOO
5.E.61. The answer is a = 4.
5.E.62. The given function is continuous at every point of its domain R\{±1}. Thus, the extended function will be continuous
if and only if
f(−1) = lim
x→−1
(
(
x2
− 1
)
sin
2x − 1
x2 − 1
)
, f(1) = lim
x→1
(
(
x2
− 1
)
sin
2x − 1
x2 − 1
)
,
such that the limits in the right hand side exist and are ﬁnite (if either of these limits did not exist, or are inﬁnite, then f cannot
be extended to a continuous function). Indeed, since any x ∈ R, x ̸= ±1 satisﬁes sin
(
2x−1
x2−1
)
≤ 1, it follows that
− x2
− 1 ≤ f(x) ≤ x2
− 1 , ∀ x ∈ R , x ̸= ±1 .
Clearly limx→±1 x2
− 1 = 0, and by the squeeze theorem we get f(±1) = 0.
5.E.70. If f : [a, b] → R is injective then we have f(a) ̸= f(b), and we may assume that f(a) < f(b) (otherwise consider
−f). We will show that f is strictly increasing. First we will show that for any x ∈ (a, b) we have f(a) < f(x) < f(b).
Assume in the contrary that f(x) ≤ f(a) or f(x) ≥ f(b). Because f is injective we essentially assume that f(x) < f(a) or
f(x) > f(b).
Case f(x) < f(a): This means that f(x) < f(a) < f(b) and by applying (the general version of) the intermediate value
theorem for the restriction f [x,b]
we ﬁnd some x0 ∈ (x, b) with f(x0) = f(a), a contradiction since f is injective.
Case f(x) > f(b): This means that f(a) < f(b) < f(x) and by applying (the general version of) the intermediate value
theorem for the restriction f [a,x]
we ﬁnd some x0 ∈ (a, x) with f(x0) = f(b), which is again a contradiction by injectivity.
Based now on our claim and taking x1, x2 ∈ R with a < x1 < x2 < b it is easy to see that f(x1) < f(x2), that is, f is
strictly increasing.
5.E.77. We will solve this problem using logarithmic diﬀerentiation. In particular, let f be a positive function for which f′
(x)
exists. Then recall that
(ln f(x))
′
=
f′
(x)
f(x)
, that is, f′
(x) = f(x) · (ln f(x))′
.
Now, the given function is diﬀerentiable as a product of diﬀerentiable functions. Thus, by combining the previous rule with
basic properties of the natural logarithm, we obtain
( 4
√
x − 1 · (x + 2)3
ex(x + 132)2
)′
=
4
√
x − 1 · (x + 2)3
ex(x + 132)2
·
[
ln
4
√
x − 1 · (x + 2)3
ex(x + 132)2
]′
=
4
√
x − 1 · (x + 2)3
ex(x + 132)2
·
[
3 ln (x + 2) +
1
4
ln (x − 1) − x ln(e) − 2 ln (x + 132)
]′
=
4
√
x − 1 · (x + 2)3
ex(x + 132)2
[
3
x + 2
+
1
4 (x − 1)
− 1 −
2
x + 132
]
.
In Sage to verify this complicated computation use the cell
f(x)=(((x-1)**(1/4))*(x+2)^3)/(exp(x)*(x+132)^2)
show(diff(f(x), x).simplify())
5.E.78. (1) a′
(x) = x2
sin(x); (2) b′
(x) = cos (sin(x)) cos(x); (3) c′
(x) = 3x2
+2
x3+2x cos
(
ln
(
x3
+ 2x
))
; (4) d′
(x) =
2(1−2x)
(1−x+x2)2 ; (5) ε′
(x) = 7
8 x− 1
8 ; (6) f′
(x) = cos(x) cos (sin(x)) cos (sin (sin(x))); (7) g′
(x) = cos x
3
3√
sin2 x
; (8) h′
(x) =
2x2
1−x6
3
√
1+x3
1−x3 .
5.E.79. (a) Such a function is the ﬂoor function f(x) = ⌊x⌋ introduced in 5.B.48. For instance, we have ⌊−2.2⌋ = −3 and
⌊2.2⌋ = 2, thus f(−x) ̸= f(x) and f cannot be even. Moreover, ⌊−2.2⌋ = −3 ̸= −2 = −⌊2.2⌋, and hence the ﬂoor
function is neither odd.
(b) Suppose that f is even, that is, f(−x) = f(x) for all x in the domain of f. Taking the derivatives in both sides of this
equation and based on the chain rule, we get f′
(x) = (f(−x))′
= (−x)′
f′
(−x) = −f′
(−x). This means that f′
(−x) =
−f′
(x) and hence f′
is odd. Similarly the case with f odd.
5.E.80. The derivative f′
(x) of a polynomial f provides the slope of the tangent line to its graph at the point [x, f(x)] ∈ R2
.
In degree zero, f is a non-zero constant a · x0
= a · 1 = a and its derivative vanishes everywhere. This conﬁrms the fact
that the graph in this case is the line y = a, parallel to the x-axis. In degree one, f(x) = ax + b for some a, b ∈ R with
a ̸= 0, thus its derivative is the constant a. Consequently, the slope of the tangent is a everywhere, which is the property
504
CHAPTER 5. ESTABLISHING THE ZOO
characterizing a line. In degree two, f(x) = a x2
+ b x + c with a, b, c ∈ R and a ̸= 0. Obviously, f′
(x) = 2a x + b and
x = − b
2a is the only point where the slope is zero, i.e., the maximum or the minimum of the quadratic function, as pointed
out in 5.A.2. In degree three f(x) = a x3
+ b x2
+ c x + d with a, b, c, d ∈ R and a ̸= 0. The derivative is the second degree
polynomial f′
(x) = 3a x2
+ 2b x + c. Clearly, there are points with tangent slope zero if and only if the latter quadratic
expressions allows zero values. This happens if and only if its discriminant 4(b2
− 3a c) is non-negative. In particular, there
will be two bumps on the graph if and only if b2
> 3ac. The slope of the graph gets to zero only in one point if b2
= 3a c,
and then the graph curve looks similarly as the graph of f(x) = x3
at x = 0.
5.E.82. Observe ﬁrst that (sinh(x))′
> 0 for all x ∈ R, hence f is strictly increasing on R (as we see also in the graph of f
above). Thus it has an inverse function, called the “inverse hyperbolic sine function” and denoted by sinh−1
(x). The domain
and the range of f are both (−∞, ∞), hence the same applies for sinh−1
. To compute its explicit form, let y = sinh−1
(x).
Then x = sinh(y) = ey
− e−y
2 , that is, ey
− e−y
−2x = 0. Multiplying both sides of this equation by ey
we arrive to the
equation e2y
−2x ey
−1 = 0, which is obviously a quadratic equation in the variable t = ey
. Its solutions have the form
t± = x ±
√
x2 + 1. However, t = ey
> 0, but x −
√
x2 + 1 < 0, from where we deduce that ey
= x +
√
x2 + 1,
or equivalently y = ln(x +
√
x2 + 1), i.e., sinh−1
(x) = ln(x +
√
x2 + 1) for all x ∈ R. You may like to conﬁrm this
expression in Sage, based on the command arcsinh which corresponds to the function sinh−1
. This, for example, can be
done via the cell
bool(arcsinh(x)==ln(x+sqrt(x^2+1)))
Recall now that ln(f(x))′
= f′
(x)/f(x) and (
√
g(x))′
= g′
(x)/2
√
g(x), where f, g are both positive on their domain.
Thus,
(
sinh−1
(x)
)′
=
(
ln(x +
√
x2 + 1)
)′
=
1 + x√
x2+1
x +
√
x2 + 1
=
1
√
x2 + 1
, x ∈ R .
As a conﬁrmation via Sage, just type
show(diff(ln(x+sqrt(x^2+1)), x).factor())
5.E.86. (1) It does not because the one-sided derivatives diﬀer. Om particular, we get π/2 from the right and −π/2 from the
left.
(2) It does not.
(3) To provide an answer one should recall the derivative of absolute value function. In particular, the function f(x) :=
| x − 5 | + | x − 9 | has the desired properties.
(4) Suppose that f(x) = g(x) = 1 at the rational numbers and f(x) = g(x) = −1 at the irrational numbers.
(5) It is not hard to see that f′
(x) < 0, for all x > e.
5.E.87. We see that f′
(x) = 5 − 4 sin(x) > 0. This means that f is increasing and hence a bijection. Thus f−1
exists and
since f(0) = 4 we have f−1
(4) = 0. Moreover, f′
(
f−1
(4)
)
= f′
(0) = 5 ̸= 4, hence we may apply 5.3.6 which gives
(
f−1
)′
(4) =
1
f′ (f−1(4))
=
1
f′(0)
=
1
5
.
5.E.89. From the conditions P(0) = 1 and P′
(0) = 1 it follows that P(x) = bx3
+ cx2
+ x + 1, for some b, c ∈ R. The two
remaining conditions determine two equations for the variables b and c: b+c+2 = 2a+2, 3b+2c+1 = 5a+1 with the unique
solution b = c = a. Therefore, a polynomial satisfying the desired conditions should has the form P(x) = ax3
+ax2
+x+1,
with a ∈ R. The derivative of P is a parabola given by P′
(x) = 3ax2
+ 2ax + 1 and we require P′
(x) > 0 or P′
(x) < 0,
which is equivalent saying that the discriminant ∆ = 4a2
− 12a of P′
is negative, ∆ < 0. This gives 4a(a − 3) < 0, which
is true for all a ∈ (0, 3).
5.E.91. Let x0 ∈ R some arbitrary point. By means of the deﬁnition of the ﬁrst derivative of a function on R, and in
combination with the relation f(x + y) = f(x)+f(y)
1−f(x)f(y) we see that
f′
(x0) = lim
h→0
f(x0 + h) − f(x0)
h
= lim
h→0
f(x0)+f(h)
1−f(x0)f(h) − f(x0)
h
= lim
h→0
f(x0) + f(h) − f(x0) + f2
(x0)f(h)
h
(
1 − f(x0)f(h)
)
= lim
h→0
1 + f2
(x0)
1 − f(x0)f(h)
· lim
h→0
f(h)
h
=
1 + f2
(x0)
1 − f(x0)f(0)
· f′
(0) = f′
(0) ·
(
1 + f2
(x0)
)
.
505
CHAPTER 5. ESTABLISHING THE ZOO
Here we have used that f(0) = 0, which we get from the relation f(x + y) = f(x)+f(y)
1−f(x)f(y) by setting y = 0 and moreover
that f is diﬀerentiable at 0, and hence limh→0
f(h)
h = limh→0
f(h+0)−f(0)
h−0 = f′
(0). Thus f′
(x) = f′
(0)
(
1 + f2
(x)
)
for
all x ∈ R and f is diﬀerentiable on R.
5.E.95. (a) The function has a local maximum at the point x1 = e−2
. It has a local minimum at the point x2 = 1.
(b) The answer is 1
3
√
e
.
(c) The answer is 4 = p (−1) = p (2), −16 = p (−3).
(d) No. For instance, if α =
√
2/2, there is only a local extremum at the point.
5.E.96. The given function f has domain the whole real line and we have f′
(x) = − e−x
= − 1
ex < 0 for any x ∈ R. Thus
f is strictly decreasing on R.
5.E.97. Obviously, a solution is given by x = 1. We will show that it is unique by using the function f(x) = x2025
+
2024x + ln x − 2025. It is suﬃcient to show that f is strictly growing throughout its domain A = (0, +∞). Obviously, f is
diﬀerentiable over A and its derivative is given by f′
(x) = 2024x2024
+ 2024 + 1
x . But f′
(x) > 0 for all x ∈ A, so we are
done.
5.E.103. Consider the function g(x) = x − ln(x + 1), with g(0) = 0. The ﬁrst derivative of g is given by
g′
(x) = 1 −
1
x + 1
=
x
x + 1
.
Thus g′
(x) = 0 if and only if x = 0. For x > 0 we have obviously g′
(x) > 0, so g is strictly increasing on the open interval
(0, +∞). Hence, for all x > 0 we have g(x) > g(0) = 0, that is, x > ln(x + 1). Since for x = 0 this holds as an equality we
have ﬁnally proved x ≥ ln(x + 1) for all x ≥ 0.
5.E.104. Consider the function f(x) = α xβ
− β xα
− α + β. We have f′
(x) = αβ
(
xβ−1
− xα−1
)
> 0 for all x > 1, that
is, f(x) is strictly increasing for all x ∈ (1, +∞). The point x = 1 induces a critical point for f, i.e., f′
(1) = 0 and we see
that there f attains its minimum value f(1) = 0. Thus for all x > 1 we have f(x) > 0, i.e., α xβ
− β xα
> α − β.
5.E.108. (a) Obviously, this limit has the indeterminate form ∞
∞ .
(b) The l’Hospital’s rule gives rise to the following limit
lim
x→+∞
x + sin(x)
x
= lim
x→+∞
1 + cos(x)
1
= lim
x→+∞
(1 + cos(x)) ,
which does not exist (e.g., use Sage and type the syntax limit(1 + cos(x), x = oo)). However, the limit at hand exists. This
is because
x − 1
x
≤
x + sin x
x
≤
x + 1
x
,
which implies that lim
x→+∞
f(x) = 1.
5.E.111. The answers are a = 2/π, b − 1, c = 1/2 and d = e−2
.
5.E.116. The perimeter is 2
√
5 r.
5.E.117. The answer here is the square with sides of length c.
5.E.118. We get h = 4
3 R and r = 2
√
2
3 R.
5.E.119. This triangle is the equilateral triangle, with area
√
3 p2
/36.
5.E.120. The desired points have coordinates P = [2, −1/2], and Q = [−2, −1/2], respectively. Try to illustrate the situation
via Sage.
5.E.122. To conﬁrm the answer presented for x0 in 5.E.121 we may instead try to locate the global minimum of the function
g(x) =
1
f(x)
=
x2
+ (b − h) (a − h)
x (b − a)
=
x
b − a
+
(b − h) (a − h)
x (b − a)
, x ∈ (0, +∞) .
This can be done with the help of the AM-GM inequality that we met in Chapter 1, that is,
y1 + y2
2
≥
√
y1 y2, y1, y2 ≥ 0 ,
where the equality occurs if and only if y1 = y2. The choice
y1(x) =
x
b − a
, y2(x) =
(b − h) (a − h)
x (b − a)
506
CHAPTER 5. ESTABLISHING THE ZOO
then gives
g(x) = y1(x) + y2(x) ≥ 2
√
y1(x) y2(x) =
2
b − a
√
(b − h) (a − h) .
Therefore, if there is a number x > 0 for which y1(x) = y2(x), then the function g has the global minimum at x. In particular,
the equation
y1(x) = y2(x), i. e.
x
b − a
=
(b − h) (a − h)
x (b − a)
has a unique positive solution given by x0 =
√
(b − h)(a − h).
5.E.126. The inscribed rectangle has sides of lengths x,
√
3/2(a − x), thus its area is
√
3/2(a − x)x. The maximum occurs
for x = a/2, hence the greatest possible area is (
√
3/8)a2
.
5.E.127. The dimensions of the pool are 4 m × 4 m × 2 m.
5.E.128. The answer is 28 = 24 + 4.
5.E.129. The answer is a = 1.
5.E.131. (a) The ﬁrst claim is obviously true. We do not need any absolute convergence to handle linear combinations. This
is a direct consequence of the properties of limits.
(b) This statement is clearly false, in general. Consider the harmonic series
∞∑
n=1
1
n . We saw in 5.D.9 that the series
∞∑
n=1
1
n2
converges, while the harmonic series does not.
(c) Let us divide the series into a sum of two. The ﬁrst one will collect the terms with |an| > n|an|2
, while the other sum the
remaining terms. In the ﬁrst case, |an| < 1
n , and thus an
n < 1
n2 . Consequently, this part of the series converges absolutely.
The remaining part must converge, as well, as a consequence of the comparison with the converging series: an
n ≤ |an|2
.
As a sum of two absolutely convergent series, the initial series is absolutely convergent, too. Thus the claim is true.
5.E.135. (a) The ratio test gives
lim
n→∞
an+1
an
= lim
n→∞
2(n+1)2
· n!
(n + 1)! · 2n2 = lim
n→∞
22n+1
n + 1
= lim
n→∞
2 · 4n
n + 1
= +∞ .
Thus the series does not converge.
(b) The sequence
(
2√
3n
)
n∈Z+
is decreasing, while recall that and the function f(x) = arctan(x) is increasing on the whole
real axis. So, the sequence
(
arctan
(
2√
3n
) )
n∈Z+
is decreasing. Thus, it is an alternating series such that the sequence of
the absolute values of its terms is decreasing. According to the Leibniz criterion, such an alternating series converges if and
only if the sequence of its terms converges to zero, and this is satisﬁed: lim
n→∞
arctan
(
2
√
3n
)
= arctan(0) = 0 and hence
lim
n→∞
(
(−1)n+1
arctan
(
2
√
3n
) )
= 0 .
So the series T is convergent.
5.E.137. The required value is 3/2.
5.E.138. The ﬁrst series sums to 3 and the second sums to 9/4.
5.E.139. We get α > 0, β ∈ {−2, −1, 0, 1, 2}, and γ ∈ (−∞, −1) ∪ (1, +∞).
5.E.140. This series is absolutely convergent.
5.E.141. The answer is α ∈ [0, 1).
5.E.144. We see that A(x) = 1
2 ln 1+x
1−x and B(x) = 1+x
(1−x)3 , respectively.
5.E.145. For example: an = n/3, bn = n/2, with n ∈ N.
5.E.146. The former series converges absolutely; the latter converges conditionally.
5.E.147. Yes, it does.
5.E.148. The answer is p ∈ R.
507
CHAPTER 5. ESTABLISHING THE ZOO
5.E.149. This series is convergent.
5.E.150. The answer is S = 2/9.
In the previous chapter, we were working either with an
extremely large class of functions (for example, all continuous,
all diﬀerentiable), or with only particular functions, (for
example exponential, trigonometric, polynomial). However
we had very few tools. We indicated how to discuss the local
behaviour of functions near a given point by linear approximation.
We learned how to measure instantaneous changes
by diﬀerentiation.
Now we derive several results that will allow us to work
with functions more easily when modeling real problems. We
also deal with the task of summing inﬁnitely many “inﬁnitely
small” changes, in particular, how to “integrate”. In the last
part of the chapter we come back to series of functions. We
also add useful techniques, how to deal with extra parameters
in the functions, and we introduce some further integration
concepts brieﬂy.
1. Diﬀerentiation
6.1.1. Higher order derivatives. If the ﬁrst derivative f′
(x)
of a function of one real variable has a derivative
(f′
)′
(x0) at the point x0, we say that the
second derivative of function f (or second order
derivative) exists. Then we write f′′
(x0) = (f′
)′
(x0) or
f(2)
(x0). A function f is two times diﬀerentiable on some
interval, if it has a second derivative at each of its points.
Derivatives of higher orders are deﬁned inductively:
k times differentiable functions
A function f of one real variable is diﬀerentiable k
times at the point x0 for some natural number k > 1, if
it is diﬀerentiable (k − 1) times on some neighbourhood of
the point x0 and its (k − 1)-st derivative has a derivative at
the point x0. We write f(k)
(x) for the k-th derivative of the
function f(x).
If derivatives of all orders exist on an interval A, we say
the function f is smooth or inﬁnitely diﬀerentiable on A.
We use the notation class of functions Ck
(A) for functions
with continuous k-th derivative on the interval A,
where k can attain values 0, 1, . . . , ∞. Often we write only
Ck
, if the domain is known from the context. When k = 0,
C0
means continuous functions.
CHAPTER 6
Diﬀerential and integral calculus
we already have the menagerie, but what shall we do with it?
– we’ll learn to control it...
A. Derivatives of higher orders
In the previous chapter we brieﬂy explained the use of
second order derivatives in the study of local extremes
(see 5.C.28 for example). In this chapter we
will study higher order derivatives in many details
and derive such applications in a more systematic
way. Notice that a higher order derivative refers to the repeated
process of taking derivatives of derivatives, a procedure
also known as “successive diﬀerentiation”.
As in Chapter 5, we will denote the second derivative of
a function f by f′′
= (f′
)′
, and for derivatives of third or
higher order we will write f(3)
= (f′′
)′
, f(4)
= (f(3)
)′
, etc.
The bracket in the notation f(n)
= (f(n−1)
)′
is necessary to
distinguish it form the nth power fn
of f. When for a function
f we can consider arbitrarily many continuous derivatives,
then the function is said to be smooth, see also the theoretical
column for the notion of smooth functions (e.g., 6.1.9 ).
At this point it is also importnat to recall that in Sage the
nth order derivative of a given function f can be computed
via the command diff(f, x, n). For the examples described
below we advice the reader to use Sage and verify the computations,
especially for the problems where a Sage implementation
is not included.
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
We illustrate the concept of higher order derivatives with
polynomials. Because the derivative of a polynomial is a polynomial
with the degree one less than the original one, after a
ﬁnite number of diﬀerentiations we obtain the zero polynomial.
If k is the degree of the polynomial, then exactly k + 1
diﬀerentiations yields zero. Then derivatives of all orders exist,
hence f ∈ C∞
(R).
In the spline construction, see 5.1.9, we took care that
the resulting functions belong to the class C2
(R). Their third
derivatives are piece-wise constant functions. That is why the
splines do not belong to C3
(R), even though all their higher
order derivatives are zero in all of the inner points of all single
intervals in the interpolation. Think this example through in
detail!
The next assertion is a combinatorical corollary of Leibniz’s
rule for diﬀerentiation of a product of two functions:
Lemma. If two functions f and g have derivatives of order k
at the point x0, then their product also has the derivative of
order k and the following equality holds:
(f · g)(k)
(x0) =
k∑
i=0
(
k
i
)
f(i)
(x0)g(k−i)
(x0).
Proof. For k = 0, the statement is trivial. For k = 1
it is Leibniz’s product rule. Suppose equality holds for some
k. Diﬀerentiate the right hand side and use Leibniz’s rule to
obtain the expression
k∑
i=0
(
k
i
)(
f(i+1)
(x0)g(k−i)
(x0) + f(i)
(x0)g(k−i+1)
(x0)
)
.
In this new sum, the sum of orders of the derivatives of
products in all summands is k + 1 and the coeﬃcients of
f(j)
(x0)g(k+1−j)
(x0) are the sums of binomial coeﬃcients( k
j−1
)
+
(k
j
)
=
(k+1
j
)
. □
6.1.2. The meaning of second derivative. We have already
seen that the ﬁrst derivative of a function is its
linear approximation in the neighbourhood of a
given point. The sign of a nonzero derivative determines
whether the function is increasing or decreasing
at the point x0. The points where the ﬁrst derivative
is zero are called the critical points or stationary points of the
given function.
If x0 is a critical point of function f, there are several
possibilities for the behaviour of the function f in the
neighbourhood of x0. Consider the behaviour of the function
f(x) = xn
in the neighbourhood of zero for diﬀerent
509
6.A.1. Determine the derivatives given below.
(1)
(
x2
sin(x)
)′′
, x ∈ R (5) (xx
)′′
, x > 0
(2)
(
2x
4x+3
)′′
, x ∈ R\{−3
4
} (6) (xn
)(n)
, x ∈ R
(3)
(
ln
√
x2+1
x2−1
)′′
, with (7)
(
x
ln(x)
)(3)
, x > 0
x ∈ (−∞, −1) ∪ (1, +∞)
(4) (tan(x))′′
, with (8) (1/
√
x(x − 2))′′
, with
x ∈ R\{π
2
+ nπ : n ∈ N} 0 < x < 2.
Solution. (1) This is based on the product rule (fg)′
= f′
g+
fg′
. Applying this rule successively we obtain
(x2
sin(x))′′
=
((
x2
sin(x)
)′
)′
=
(
2x sin(x) + x2
cos(x)
)′
= 2 sin(x) + 4x cos(x) − x2
sin(x) .
(2) In this case we will apply twice the quotient rule
(
f
g
)′
=
gf′
−fg′
g2 . We get
( 2x
4x+3
)′
= 2(4x+3)−8x
(4x+3)2 = 6
(4x+3)2 , hence
( 2x
4x + 3
)′′
=
(( 2x
4x + 3
)′
)′
=
( 6
(4x + 3)2
)′
= −
48(4x + 3)
(4x + 3)4
= −
48
(4x + 3)3
.
(3) For any x ∈ (−∞, −1) ∪ (1, +∞) we can write
ln
√
x2 + 1
x2 − 1
= ln
(x2
+ 1
x2 − 1
)1
2
=
1
2
ln
(x2
+ 1
x2 − 1
)
=
1
2
[
ln(x2
+ 1) − ln(x2
− 1)
]
. (∗)
This relation has many advantages. For instance, one can
easily deduce that the given function is diﬀerentiable (as
the diﬀerence of two diﬀerentiable functions). On the other
hand, recall that a positive diﬀerentiable function f satisﬁes
(ln(f(x))′
= f′
(x)/f(x) for all x in its domain. Hence, a
combination of this rule with (∗) gives
(
ln
√
x2 + 1
x2 − 1
)′
=
1
2
[ 2x
x2 + 1
−
2x
x2 − 1
]
= −
2x
x4 − 1
.
Now we can easily compute also the second derivative:
(
ln
√
x2 + 1
x2 − 1
)′′
=
(
−
2x
x4 − 1
)′
=
2(1 + 3x4
)
(x4 − 1)2
.
(4) For this you should apply the quotient rule twice. In particular,
recall by 5.E.76 that (tan(x))′
= 1
cos2(x) . Therefore,
tan′′
(x) ≡ (tan(x))′′
≡
d2
d2x
tan(x) =
(
1
cos2(x)
)′
= 2
sin(x)
cos3(x)
= 2
tan(x)
cos2(x)
.
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
n. For odd n > 0, f(x) will be increasing on all R, while
for even n it will be decreasing for x < 0 and increasing for
x > 0. In the latter case, the function will attain its minimal
value among points in the (suﬃciently small) neighbourhood
of x0 = 0.
The same argument applies to the function f′
. If the second
derivative is nonzero, its sign determines the behaviour
of the ﬁrst derivative. At the critical point x0 the derivative
f′
(x) is increasing if the second derivative is positive and decreasing
if the second derivative is negative. If increasing, it
is necessarily negative to the left of the critical point and positive
to the right of it. In that case, f is decreasing to the left of
the critical point and increasing to the right of it. So f attains
its minimal value among all points from a (suﬃciently small)
neighbourhood of x0 at x0.
On the other hand, if the second derivative is negative at
x0, the ﬁrst derivative is decreasing. Thus the ﬁrst derivative
is negative to the left of x0 and positive to the right of it. f
then attains its maximal value at x0 among all values in a
neighbourhood of x0.
A function which is diﬀerentiable on (a, b) and continuous
on [a, b] has an absolute maximum and minimum on this
interval. Both can be attained only at the boundary of the interval
or at a point where the derivative is zero, Thus critical
points may be suﬃcient for ﬁnding extremes. Second derivatives
help to determine the type of extreme, if nonzero.
For a more precise discussion of the latter phenomena we
consider higher order polynomial approximations of the functions.
We then return to the qualitative study of the behaviour
of functions with new tools.
6.1.3. Taylor expansion. As a surprisingly easy use of
Rolle’s theorem we derive an extremely important
result. It is called the Taylor expansion with
remainder1
.
Consider the power series centered at a,
S(x) =
∞∑
n=0
an(x − a)n
.
Recall, power series can be diﬀerentiated term by term, cf.
5.4.10.
Diﬀerentiate the series S(x) repeatedly, to get the power
series
S(k)
(x) =
∞∑
n=k
n(n − 1) . . . (n − k + 1)an(x − a)n−k
.
Put x = a. Then S(k)
(a) = k! ak. We can read the last
statement as an equation for ak and rewrite the original series
as
S(x) =
∞∑
n=0
1
k!
S(k)
(a)(x − a)n
.
1Brook Taylor was an English mathematician (1685-1731) best known
for his formalization of the polynomial approximations of functions, recognized
by Lagrange as the “ main foundation of diﬀerential calculus”
510
(5) In 5.C.7 (d) we proved that (xx
)′
=
(
ln(x)+1
)
xx
. Thus,
for the second derivative we obtain
(xx
)′′
=
((
xx
)′
)′
=
((
ln(x) + 1
)
ex ln(x)
)′
=
(
ln(x) ex ln(x)
)′
+
(
ex ln(x)
)′
=
1
x
ex ln(x)
+ ln(x)
(
ln(x) + 1
)
ex ln(x)
+
(
ln(x) + 1
)
ex ln(x)
=
xx
x
+ xx
(
ln2
(x) + 2 ln(x) + 1
)
= xx−1
+ xx
(
ln(x) + 1
)2
.
(6) In this case we simply state the result and leave a formal
computation for practice:
( x
ln x
)(3)
= 1
x2(ln x)2 − 6
x2(ln x)4 .
(7) We see that
(xn
)(n)
= [(xn
)′
]
(n−1)
= (nxn−1
)(n−1)
= · · · = n! .
The last step of a formal proof includes an induction over n,
which we leave as an exercise.
(8) For this case we just present the result, and leave the proof
for practice:
(
1/
√
x(x − 2)
)′′
=
(
2 x2
− 4 x + 3
)√
x2 − 2 x
x6 − 6 x5 + 12 x4 − 8 x3
=
2 x2
− 4 x + 3
√
(x − 2)x(x − 2)
2
x2
.
□
6.A.2. Use Sage to verify your answers in 6.A.1.
Solution. For most of the cases the answer is based on the
command diff(f, x, m), mentioned above. For instance, type
show(diff(x^2*sin(x), x, 2).full_simplify())
show(diff((2*x)/(4*x+3),x,2).full_simplify( ))
f(x)=ln(sqrt((x^2+1)/(x^2-1)))
show(diff(f(x), x, 2).full_simplify())
show(diff(tan(x), x, 2))
show(diff(x^x,x,2))
show(diff(x/ln(x),x,3))
g(x)=1/sqrt(x*(x-2))
show(diff(g(x), x, 2).factor())
Notice the function full_simplify( ) (or factor()) was
used to simplify the given expressions and make direct the
veriﬁcation, whenever necessary.
Finally observe also that the cell
[diff(x^n,x,n) for n in range (1,100)]
gives a veriﬁcation of the relation (xn
)(n)
= n!, presented in
(7), for many values of n. However, a more precise program
that oﬀers a more practical demonstration of the theoretical
result is as follows:"
# Define the variable x
x = var(’x’)
# Define a specific integer value for n
n = 4 # Change this to test different cases
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Suppose f is a smooth function instead of a power series.
We search for a good approximation by polynomials in the
neighbourhood of a given point a.
Taylor polynomial of function f
For a k times diﬀerentiable (real or complex valued) function
f of one real variable, deﬁne its Taylor polynomial of
k–th degree centered at a by the formula
Tk
a f(x) = f(a) + f′
(a)(x − a) +
1
2
f′′
(a)(x − a)2
+
1
6
f(3)
(a)(x − a)3
+ · · · +
1
k!
f(k)
(a)(x − a)k
.
The mean value theorem is used to show how good this
approximation of f is.
Taylor expansion with a remainder
Theorem. Let f(x) be a function that is k times diﬀerentiable
on the interval (a, b) and continuous on [a, b]. Then
for all x ∈ (a, b) there exists a number c ∈ (a, x) such that
f(x) = Tk−1
a f(x) +
1
k!
f(k)
(c)(x − a)k
= f(a) + f′
(a)(x − a) + . . .
+
1
(k − 1)!
f(k−1)
(a)(x − a)k−1
+
1
k!
f(k)
(c)(x − a)k
.
Proof. For ﬁxed x deﬁne the remainder R, by
f(x) = Tk−1
a f(x) + R.
Then R = 1
k! r(x−a)k
for a suitable number
r (dependent on x). Consider the function F(ξ) deﬁned by
F(ξ) =
k−1∑
j=0
1
j!
f(j)
(ξ)(x − ξ)j
+
1
k!
r(x − ξ)k
.
By the Leibniz rule, its derivative (here x is considered as a
constant parameter) is
F′
(ξ) = f′
(ξ)+
k−1∑
j=1
(
1
j!
f(j+1)
(ξ)(x−ξ)j
−
1
(j − 1)!
f(j)
(ξ)(x−ξ)j−1
)
−
1
(k − 1)!
r(x − ξ)k−1
=
1
(k − 1)!
f(k)
(ξ)(x − ξ)k−1
−
1
(k − 1)!
r(x − ξ)k−1
=
1
(k − 1)!
(x − ξ)k−1
(f(k)
(ξ) − r),
because the expressions in the sum cancel each other out sequentially.
Now it suﬃces to notice that F(a) = F(x) =
f(x) (recall that x is an arbitrarily chosen but ﬁxed number
from the interval (a, b)). According to Rolle’s theorem there
exists a number c, a < c < x such that F′
(c) = 0. That is
the desired relation. □
511
# Define the function f(x) = x^n
f = x^n
# Compute the n-th derivative
n_th_derivative = f.diff(x, n)
# Simplify the result
n_th_derivative_s = n_th_derivative.simplify()
# Print the result
print(f"The {n}-th derivative of x^{n} is:",
n_th_derivative_s)
# Verify that the result matches n!
expected_result = factorial(n)
print(f"Expected result ({n}!):",
expected_result)
# Check if the computed result matches n!
if n_th_derivative_s == expected_result:
print("The result matches n!")
else:
print("The result does not match n!.")
□
6.A.3. Use the generalized Leibniz rule (see lemma in 6.1.1)
to compute the fourth-order derivative of the functions:
(a) h(x) = x cos(x) , (c) ℓ(x) = x3
ln(x) ,
(b) k(x) = e4x
x4
, (d) m(x) = ex
arctan(x) .
Next describe a solution in Sage.
Solution. The generalized Leibniz’s rule for the 4th-order derivative
of a product fg gives
(f(x)g(x))(4)
=
(
4
0
)
f(0)
(x)g(4)
(x) +
(
4
1
)
f′
(x)g(3)
(x)
+
(
4
2
)
f′′
(x)g′′
(x) +
(
4
3
)
f(3)
(x)g′
(x) +
(
4
4
)
f(4)
(x)g(x)
= f(x)g(4)
(x) + 4f′
(x)g(3)
(x) + 6f′′
(x)g′′
(x)
+ 4f(3)
(x)g′
(x) + f(4)
(x)g(x) . (♯)
Now, for case (a) we have h(x) = f(x)g(x) with f(x) =
x and g(x) = cos(x), respectively. Moreover, g′
(x) =
− sin(x), g′′
(x) = − cos(x), g′′′
(x) = sin(x), g(4)
(x) =
cos(x), f′
(x) = 1, and f(k)
(x) = 0 for all k ≥ 2. Thus, an
application of (♯) gives h(4)
(x) = x cos(x) + 4 sin(x).
A solution in Sage is based on the cell
h(x)=x*cos(x); show(diff(h,x,4))
Or, you may use the alternative
([diff(h(x),x,i) for i in [1..4]])
which gives us all the derivatives h′
(x), h′′
(x), h(3)
(x) and
h(4)
(x). Similarly are treated the rest cases. □
6.A.4. Generalized Leibniz rule and Sage. For two symbolic
functions f, g in Sage write a short program to implement
the identity (♯) presented in 6.A.3. Then verify the results
for the functions h, k, ℓ, m, by using this program, and
hence independently of the command diff(f ∗ g, x, 4).
Solution. Recall from 5.C.8 that in Sage we can easily introduce
symbolic functions by typing
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
A special case of the last theorem is the mean value theorem,
as an approximation by the Taylor polynomial of degree
zero. See 5.3.9(1).
6.1.4. Estimates for Taylor expansions. A simple case of a
Taylor expansion is when f is a polynomial:
f(x) = anxn
+ an−1xn−1
+ · · · + a1x + a0, an ̸= 0.
Because the (n + 1)–th derivative f is identically zero, the
Taylor polynomial of degree n has zero remainder, therefore
for each x0 ∈ R
f(x) = f(x0)+f′
(x0)(x−x0)+· · · +
1
n!
f(n)
(x0)(x−x0)n
.
We can compute all the derivatives easily (for example the
last term is always of the form an(x − x0)n
).
This result is a very special case of error estimation in
Taylor expansion with the remainder. We know in advance
that the remainder can be estimated by the size of the derivative,
and for polynomials this is identically zero for some
order onwards.
More generally, the estimation of the size of the k–th derivative
on some interval can be used to estimate the error on
the same interval.
Good examples of an expansion of an arbitrary degree
are provided by the trigonometric functions sin and cos. By
iterating the diﬀerentiation of the function sin x we always
have either sine or cosine with some signs. The absolute values
do not exceed one. Thus we obtain a direct estimation of
the speed of convergence of the power series
| sin x − (Tk
0 sin)(x)| ≤
|x|k+1
(k + 1)!
.
This shows that for x much smaller than k the error is small,
but for x comparable with k or bigger it may be large. In the
ﬁgure, compare the approximation of the function cos x by
a Taylor polynomial of degree 68 in paragraph 5.4.12 on the
page 429.
As mentioned in the introduction of the discussion of Taylor
expansion of functions, if we start with a power series f(x)
centered at a, then its partial sums coincide with Taylor polynomials
Tk
a f(x). The next statement is one of the simple formulations
of the converse implication. This is when the given
function f(x) is actually a power series on some neighbourhood
of the given point a.
Taylor’s theorem
Theorem. Assume that the function f(x) is smooth on the
interval (a − b, a + b) and all of its derivatives are bounded
uniformly by a constant M > 0. So
|f(k)
(x)| ≤ M, k = 0, 1, . . . , x ∈ (a − b, a + b).
Then the power series S(x) =
∑∞
n=0
1
k! f(k)
(a)(x − a)n
converges on the interval (a − b, a + b) to f(x).
512
var("x"); f=function("f")(x)
This, in combination with the substitute_function, gives
rise to an alternative approach to obtain the ﬁrst derivative of
a given function. For instance, for f(x) = ex
we just need to
add the code
f1=f.diff(x)
show(f1.substitute_function(f==exp(x)))
Check yourself that this block returns the ﬁrst derivative of
f(x) = ex
.
We are now ready to verify (♯), which can be done for
example by the cell
f=function("f")(x)
g=function("g")(x)
f1=diff(f, x); f2=diff(f, x, 2)
f3=diff(f, x, 3); f4=diff(f, x, 4)
g1=diff(g, x); g2=diff(g, x, 2)
g3=diff(g, x, 3); g4=diff(g, x, 4)
h4x=f*g4+4*f1*g3+6*f2*g2+4*f3*g1+f4*g
show(h4x)
This block includes all the derivatives that one needs to encode
(♯) inside the program, which we did here by the syntax
named “h4x”. To conﬁrm the expression of h(4)
(x) given in
6.A.3, it now suﬃces to add the code
show(h4x.substitute_function(f==x,g==cos(x)))
The rest cases can be conﬁrmed similarly and left for practice.
□
6.A.5. Let n ∈ N be arbitrary. Find the nth derivative of
function f(x) = ln
(
1+x
1−x
)
, where x ∈ (−1, 1). ⃝
6.A.6. Compute the 12th derivative of the function f(x) =
e2x
+ cos(x) + x10
− 5x7
+ 6x3
− 7x + 3, with x ∈ R. ⃝
6.A.7. Compute the 26th derivative of the function f(x) =
sin(x) + x23
− x18
+ 5x11
− 3x8
+ e2x
, with x ∈ R. ⃝
Taylor polynomials extend the idea of linearization that
we brieﬂy discussed in Chapter 5. They provide highorder
approximations of a function f by polynomials
in the neighbourhood of a point (and thus locally).
This is because such expansions are based on higher order
derivatives of f at that point. We remark that as the terms
of such polynomials increase (and hence their order), the induced
approximations are improved.
Taylor expansions have many natural applications in
mathematics, but also in other sciences (e.g., in physics). In
the sequel, ﬁrst are described exercises related to the computation
of Taylor polynomials, and more interesting applications
will be discussed hereafter. For our description we
adopt the notation from the theoretical section 6.1.3, hence
for example we will denote by Tk
a f(x) the kth-order Taylor
expansion of f(x) around a point a.
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Proof. The proof is identical with the special case of
function sin x above. Except that the universal bound by 1 is
replaced by M, and thus the estimate of the remainders are
|f(x) − (Tk
a f)(x)| ≤
M
(k + 1)!
|x|k+1
.
□
6.1.5. Local behaviour of functions. With the Taylor expansions
at hand, let us return to the local
or global behaviour of real functions
of one real variable. We have seen that
the sign of the ﬁrst derivative of a diﬀerentiable
function determines whether it is increasing or decreasing
on some neighbourhood of the given point. If the
derivative is zero, it is one of the so called critical points, but
we do not know about the function much without looking at
higher derivatives.
We encountered the importance of the second derivative
when analysing the critical points, see 6.1.2. Now we generalize
the discussion of critical points for all orders. First we
deal with the local extremes of functions.
In the following we consider real functions with a suﬃciently
high number of continuous derivatives, without specifically
stating this assumption.
The point a in domain of f is a critical point of order k
if and only if
f′
(a) = · · · = f(k)
(a) = 0, f(k+1)
(a) ̸= 0.
Suppose f(k+1)
(a) > 0 and f ∈ Ck+1
. Then this continuous
derivative is positive on a certain neighbourhood O(a) of the
point a as well. In that case, the Taylor expansion with the
remainder gives
f(x) = f(a) +
1
(k + 1)!
f(k+1)
(c)(x − a)k+1
for all x in O(a). Because of that, the change of values of
f(x) in a neighbourhood of a is given by the behaviour of
(x − a)k+1
. Moreover, if k + 1 is an even number, then the
values of f(x) in such a neighbourhood are necessarily larger
than the value f(a). So a is a local minimum. But if k is even
then the values on the left are smaller, while those on right are
larger than f(a). So an extreme does not occur even locally.
On the other hand, the graph of the function f(x) intersects
its tangent y = f(a) at the point [a, f(a)] in the latter case,
as discussed in more detail below.
Similarly, if f(k+1)
(a) < 0, then it is a local maximum
for odd k, and there is no extreme for even k.
6.1.6. Convex and concave functions. The diﬀerentiable
function f is concave at a, if its graph lies completely
below the tangent at the point [a, f(a)]
in a neighbourhood of a. That is,
f(x) ≤ f(a) + f′
(a)(x − a).
Similarly f is convex at a, if its graph is above the tangent at
the point a. That is,
f(x) ≥ f(a) + f′
(a)(x − a).
513
6.A.8. For the elementary functions sin(x), cos(x), ex
and
ln(x + 1) compute the third order Taylor expansion based on
a = 0. ⃝
6.A.9. Determine the Taylor expansion T3
1 f(x) for the functions
f(x) = ex
x and f(x) = arctan(x), with x ∈ R\{0} and
x ∈ R, respectively. Then verify your answers in Sage and
moreover check the accuracy of the approximation near the
point a.
Solution. The third-order Taylor expansion around the point
a = 1 is given by
T3
1 f(x) = f(1)+f′
(1)(x−1)+
f′′
(1)
2
(x−1)2
+
f′′′
(1)
6
(x−1)3
.
For the ﬁrst case we have f(1) = e, and
f′
(1) =
ex
x
−
ex
x2
x=1
= 0 ,
f(2)
(1) =
ex
x
− 2
ex
x2
+ 2
ex
x3
x=1
= e ,
f(3)
(1) =
ex
x
− 3
ex
x2
+ 6
ex
x3
− 6
ex
x4
x=1
= −2 e .
Thus it follows that T3
1
ex
x
= e +
e
2
(x − 1)2
−
e
3
(x − 1)3
.
In Sage in order to compute the Taylor expansion
Tk
a f(x) of a given function f(x) around a point a, you can
either apply the command taylor(f(x), x, a, k), or type
f.taylor(x, a, k). Thus, a veriﬁcation of the derivatives
and the Taylor expansion presented above is obtained by the
cell
f(x)=e^x/x
a=diff(f, x, 2); show(a); show(a(1))
b=diff(f, x, 3); show(b); show(b(1))
show(taylor(f(x), x, 1, 3))
In order to check the accuracy of the approximation near
the point a = 1, we can check the values of f and T3
1 f at a
point close to a = 1, say at x = 1.1. The block
f(x)=exp(x)/x; T(x)=taylor(f(x), x, 1, 3)
show("Value of the function at x=1.1:",
round(f(x=1.1), 10))
show("3rd degree Taylor pol. at x=1.1:",
round(T(x=1.1), 10))
returns the desired answer, namely:
• Original value of the function at x = 1.1: 2.7310600218,
• 3rd degree Taylor polynomial at x = 1.1: 2.7309671437.
The function arctan(x) can be treated in the same way.
For instance, the syntax show(taylor(arctan(x), x, 1, 3))
gives us the following expression.
T3
1 arctan(x) =
π
4
+
1
2
(x−1)−
1
4
(x − 1)
2
+
1
12
(x − 1)
3
.
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
A function is convex or concave on an interval, if it has this
property at all its points.
Suppose f has continuous second derivatives in a neighbourhood
of a. The Taylor expansion of second order with
the remainder implies
f(x) = f(a) + f′
(a)(x − a) +
1
2
f′′
(c)(x − a)2
.
Then the function is convex, whenever f′′
(a) > 0, and concave
whenever f′′
(a) < 0.
If the second derivative is zero, we can use derivatives of
higher orders. But we can only make the same conclusion if
the ﬁrst other nonzero derivative after the ﬁrst derivative is of
even order. If the ﬁrst nonzero derivative is of odd order, the
points of the graph of the function on opposite sides of some
small neighbourhood of the studied point will lie on opposite
sides of the tangent at this this point.
6.1.7. Inﬂection points. A point a is called an inﬂection
point of a diﬀerentiable function f, if the graph of f crosses
from one side of the tangent in the point a to the other. The
latter discussion on concave and convex functions shows that
the inﬂections can appear only at points with vanishing second
derivative.
Suppose f has continuous third derivatives and write the
Taylor expansion of third order with the remainder:
f(x)=f(a)+f′
(a)(x−a)+
1
2
f′′
(a)(x−a)2
+
1
6
f′′′
(c)(x−a)3
.
If a is a zero point of the second derivative such that f′′′
(a) ̸=
0, then the third derivative is nonzero on some neighbourhood.
Then a is an inﬂection point since the second derivative
changes the sign at a and thus the tangent crosses the
graph. In that case, the sign of the third derivative determines
whether the graph of the function crosses the tangent from the
top to the bottom or vice versa.
Moreover, if a is an isolated zero point of the second derivative
and simultaneously an inﬂection point, then on some
small neighbourhood of a the function is concave on one side
and convex on the other. Thus the inﬂection points are points
of the change between concave and convex behaviour of the
graph of the function.
514
To check the accuracy, one can apply the same method as for
f(x) = ex
x . □
6.A.10. Determine the Taylor expansion T4
1 f(x) where
f(x) = ln(x2
), with x ∈ (0, 2). Next use Sage to plot
together f(x) and T4
1 f(x), for x in the domain of f. ⃝
6.A.11. Consider the function f(x) = cos(x) with x ∈
R. Use Sage to sketch in one diagram the graph of f together
with the graphs of Taylor polynomials Tk
a f(x) for
k = 2, 3, 5, 7, 9 and a = π/2, for certain intervals of R.
Solution. Sage allows us to use diﬀerent colours to present
the required graphs, and this makes easier their recognition.
Based on this remark, one can present the following ﬁgure.
To produce this ﬁgure we use the taylor function to introduce
the required Taylor polynomials. We then sketch their
graphs via the plot command, which we use with some more
advanced options (for example, any polynomial comes with
a label, that we did via the legend_label option). Let us
present the full syntax in a block:
f(x)=cos(x)
p=plot(f(x), x, -2*pi, 2*pi,color="green",
legend_label=r"$f(x)=cos(x)$")
T2(x)=taylor(f(x), x, pi/2, 2)
T3(x)=taylor(f(x), x, pi/2, 3)
T5(x)=taylor(f(x), x, pi/2, 5)
T7(x)=taylor(f(x), x, pi/2, 7)
T9(x)=taylor(f(x), x, pi/2, 9)
p2=plot(T2(x), x, -2*pi, 2*pi,
legend_label=r"$T^2_{\frac{\pi}{2}}(x)$")
p3=plot(T3(x), x, -pi, 2*pi, color="red",
legend_label=r"$T^3_{\frac{\pi}{2}}(x)$")
p5=plot(T5(x), x, -pi, 2*pi, color="orange",
legend_label=r"$T^5_{\frac{\pi}{2}}(x)$")
p7=plot(T7(x), -pi, 2*pi, color="black",
legend_label=r"$T^7_{\frac{\pi}{2}}(x)$")
p9=plot(T9(x), -pi, 2*pi, color="brown",
legend_label=r"$T^9_{\frac{\pi}{2}}(x)$")
show(p+p2+p3+p5+p7+p9)
For further practice with Taylor polynomials, check
6.B.40 and see also in Section D for a series of additional
tasks (e.g., 6.D.5). □
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.1.8. Asymptotes of graphs of functions. We introduce
one more useful utility for understanding or
sketching the graph of a function. We consider
the asymptotes. These are lines in R2
whose distance
from the graph of f(x) converges to zero
for x → x0.
Thus, an asymptote at the improper point ∞ is a line y =
ax + b, which satisﬁes
lim
x→∞
(f(x) − ax − b) = 0.
An asymptote with a slope. If such an asymptote exists, it
satisﬁes
lim
x→∞
(f(x) − ax) = b
Consequently the limit
lim
x→∞
f(x)
x
= a
also exists.
Conversely, if the last two limits exist, the limit from the
deﬁnition of the asymptote exists as well, thus these are suﬃcient
conditions, too.
The asymptote at the improper point −∞ is deﬁned sim-
ilarly.
In this way we ﬁnd all the lines satisfying the properties
of asymptotes with slope. It remains to consider lines perpendicular
to the x axis:
The asymptotes at points a ∈ R are lines x = a such that
the function f has at least one of the one-sided limits at a has
inﬁnite value. They are called asymptotes without slope.
The real rational functions have asymptotes at all zero
points of the denominator which are not zero points of the
numerator as well.
We consider a simple illustrative example: Let
f(x) = x +
1
x
.
f has two asymptotes y = x and x = 0. Indeed, the one-sided
limits from the right and left at zero are clearly ±∞, while the
limit f(x)/x = 1+1/x2
is of course 1 at the improper points.
Finally the limit of f(x) − x = 1/x is zero at the improper
points.
By diﬀerentiating,
f′
(x) = 1 − x−2
, f′′
(x) = 2x−3
.
The function f′
(x) has two zero points ±1. At x = 1, f
has a local minimum. At x = −1, f has a local maximum.
The second derivative has no zero points in all its domain
(−∞, 0) ∪ (0, ∞), so f has no inﬂection points.
515
So far we have learned how to use Taylor polynomials
to approximate smooth functions. Whenever we
use an approximation technique, it is important to
have a sense for how accurate our approximations
are, as for example the Lagrange interpolation error
presented in 5.E.9. For Taylor expansions the situation is
encoded by the so called Remainder Theorem, which is described
in theoretical section 6.1.3. Let us illustrate this by
examples.
6.A.12. Determine the Taylor polynomial T6
0 f(x), where
f(x) = sin(x), and next estimate the error of the approximation
at the point x = π/4, according to the Theorem 6.1.3.
Solution. It is easy to prove that
T6
0 sin(x) = x −
1
6
x3
+
1
120
x5
.
According to the statement in 6.1.3, the estimate of the remainder
R at a = 0 is thus given by (the absolute value of)
the expression f(k+1)
(c)
(k+1)! xk+1
, for some c ∈ (0, π
4 ) and k = 6.
We compute f(7)
(x) = − cos(x), for all x ∈ R, and so at
x = π/4 we obtain
R(π/4) =
− cos(c)π7
7!47
<
1
7!
≈ 0, 0002 .
□
6.A.13. Show that ex
≥ 1 + x +
x2
2!
+
x3
3!
. ⃝
6.A.14. Find the estimation of the error of the approximation
ln (1 + x) ≈ x − x2
2
for x ∈ (−1, 0). ⃝
6.A.15. Consider the sine function f(x) = sin(x), with
x ∈ R. Compute the Taylor polynomial T4
0 f(x) and use Sage
to plot together the graphs of T4
0 f(x) and of f(x). Finally,
based on your answer estimate sin(1◦
). ⃝
6.A.16. Compute an approximation of cos π
10 with an error
less than 10−5
. ⃝
Taylor series are power series having as partial sums the
Taylor polynomials of a given function. Taylor series
need not be convergent, and even if the Taylor
series of a function f does converge, its limit may
not be equal to f(x).
Functions f whose Taylor series
∑∞
n=0
f(n)
(a)
n! (x − a)n
converges to f(x) for all x in some open neighbourhood of
a, are said to be “analytic at a”. When f is analytic at any
point in an interval I ⊂ R, then f is said to be “analytic on
I”, see also in 6.1.9. As we will explain below, for analytic
functions the power series representation allows us to solve
problems as if the function were a polynomial.
In general not all the functions are analytic, and in
particular functions with discontinuities cannot be even expressed
as Taylor series. However, most of the functions that
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.1.9. Analytic and smooth functions. If f is smooth at a,
the formal power series can be written
S(x) =
∞∑
n=0
1
k!
f(k)
(a)(x − a)n
.
If this power series has a nonzero radius of convergence and
simultaneously S(x) = f(x) on the respective interval, we
say that f is an analytic function at a. A function is analytic
on an interval, if it is analytic at its every point.
One of the most important functions in Mathematics and
Physics is the following
Gaussian function
The analytic function
g(x) = e−x2
.
is called the Gaussian function.
Indeed, the Gaussian is analytic – we simply replace x
with −x2
in the power series for the exponential ex
and it is
an analytic function on the entire R. Its derivatives are easily
computed, too. In particular g′
(0) = 0 is the only singular
point and g′′
(0) = −2. Since the limits at ±∞ are zero, the
number g(0) = 1 is the global maximum of the Gaussian
function (see ?? for more detailed comments).
We shall see why the function is so important in Chapter
10 when dealing with probability and statistics.
The Gaussian g(x) and the function f(x) = g(1/x) are
depicted on the ﬁgure, f(x) is the solid line, while the Gaussian
function is dashed.
Notice how ﬁrmly the function f touches the x axis. We
are going to explain this.
516
arise in applications are usually analytic. In the end of Chapter
5 we met for example the analytic functions
ex
= 1 + x +
x2
2!
+
x3
3!
+ · · · =
∞∑
k=0
xk
k!
,
sin(x) = x −
x3
3!
+
x5
5!
− · · · =
∞∑
k=0
(−1)k
(2k + 1)!
x2k+1
,
cos(x) = 1 −
x2
2!
+
x4
4!
− · · · =
∞∑
k=0
(−1)k
(2k)!
x2k
.
In fact it is not hard to prove that the Taylor series of cos(x)
occurs from the Taylor series of sin(x) around the origin, after
diﬀerentiation.
Another example is induced by the equality
ln(1 + x) = x −
x2
2
+
x3
3
− · · · =
∞∑
k=1
(−1)k+1
k
xk
.
This will be proved later in Section C, where you will have
the chance to learn more details on the convergence of Taylor
series (see 6.D.47).
Finally, a remarkable common property of all the examples
mentioned above, is that the Taylor series in the r.h.s.
are centered at a = 0. Such Taylor series are referred to as
“Maclaurin series”. As we will see, from the Maclaurin series
we can build many other examples through substitution
and series multiplication.
6.A.17. Consider the functions
f(x) = ln
(
1 + x
1 − x
)
, g(x) = ex2
+ x2
e−2x
with x ∈ (−1, 1) and x ∈ R, respectively. Expand them into
a Taylor series centered at the origin, that is, into a Maclaurin
series, and then present a veriﬁcation in Sage.
Solution. We ﬁrst analyze the case for f(x). We know that
ln(1 + x) =
∑∞
n=1
(−1)n+1
n xn
, which implies that
ln(1 − x) =
∞∑
n=1
(−1)n+1
n
(−x)n
= −
∞∑
n=1
1
n
xn
.
Thus, for all x ∈ (−1, 1) we have the relation
ln
(
1 + x
1 − x
)
= ln(1 + x) − ln(1 − x) = 2
∞∑
n=1
x2n−1
2n − 1
.
In order to verify this computation in Sage we could type
var("n"); assume(x>0)
sum((2/(2*n-1))*x^(2*n-1), n, 1, oo)
and this prints out the desired result, i.e., ln(−(x + 1)/(x −
1)). On the other side, we could use Sage to derive the Taylor
series directly, as follows:
f(x)=ln((1+x)/(1-x))
tf=taylor(f, x, 0, 10)
T=tf.power_series(QQ)
show(T)
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Not all smooth functions are analytic. It can be proven
that for every sequence of numbers an there is
a smooth function, whose derivatives of order
k at a given point x0 are these numbers ak.2
Now, look more closely at our function
f(x) = e−1/x2
.
It is a well deﬁned smooth function at all points x ̸= 0. Its
limit at x = 0 exists, and limx→0 f(x) = 0. By deﬁning
f(0) = 0, f is a continuous function for all real x.
By a direct computation based on L’Hospital’s rule we
compute the derivatives of f (the ﬁrst three ones are at the
picture, guess which line is which!). It suﬃces to consider
only the right derivative, since the function is even.
f′
(0) = lim
x→0+
e−1/x2
−0
x
= lim
x→0+
x−1
e1/x2
=
1
2
lim
x→0+
x
e1/x2 = 0.
By diﬀerentiating f(x) at an arbitrary point x ̸= 0,
f′
(x) = e−1/x2
·2x−3
. By repeated diﬀerentiation of the
results, there is always a sum of ﬁnitely many terms of the
form
C · e−1/x2
·x−j
,
where C is an integer and j is a natural number.
Next, assume it is already proven that the derivative of
order k of f(x) exists and vanishes at zero. Compute the
limit of the expression f(k)
(x)/x for x → 0+
. This is a ﬁnite
sum of limits of the expressions Cx−j
e−1/x2
= x−j
/ e1/x2
.
All these expressions are of type ∞/∞, so L’Hôpital’s rule
can be used repeatedly on them. After several diﬀerentiations
of both the numerator and denominator (and a similar adjustment
as above) there remains the same expression in the denominator,
while in the numerator the power is non-negative.
Thus the expression necessarily has a zero limit at zero, just
as in the case of the ﬁrst derivative above. The same holds for
a ﬁnite sum of such expressions. So each derivative f(k)
(x)
at zero exists with value zero.
2This is a special case of the Whitney extension theorem, which says
that there is a smooth function on a Euclidean space with prescribed derivatives
in all points of a closed set A if and only if the Taylor theorem estimates
are true for the prescription. In the case of one single point A, the condition
is empty. This is relevant for the Taylor theorem for functions of more than
one real variable, as in Chapter 8. Hasler Whitney (1907-1989) was a very
inﬂuential American mathematician.
517
This returns the Taylor series corresponding to f(x), in the
following form
2x +
2
3
x3
+
2
5
x5
+
2
7
x7
+
2
9
x9
+ O(x10
) .
Here, Sage uses the “big O” notation to encode terms of
higher order (in our case of order ≥ 10, since we “asked”
the Taylor expansion to be of order 10).
For the second function by the identity ex
=
∞∑
n=0
xn
n!
we obtain
ex2
=
∞∑
n=0
x2n
n!
, for all x ∈ R. Moreover, x2
e−2x
=
x2
∞∑
n=0
1
n!
(−2x)
n
=
∞∑
n=0
(−2)n
n!
xn+2
, and hence
ex2
+x2
e−2x
=
∞∑
n=0
x2n
+ (−2)n
xn+2
n!
.
The veriﬁcation in Sage for this case is left for practice. □
6.A.18. Determine the Taylor series centered at the origin for
the functions f(x) = 1
(1+x)2 , with x ∈ (−1, 1). Next conﬁrm
your result by Sage. ⃝
Maclaurin series are also useful when compute limits,
and here is a task for you to perform yourself, see also 6.D.6.
6.A.19. Use the appropriate Maclaurin series to compute the
limits lim
x→0
x sin(x) − x2
x4
, and lim
x→0
x2
(
e− 1
x2
−1
)
. ⃝
Let us now stress the use of derivatives when studying the
“local behaviour” of functions, and hence revise
the discussion initiated in Chapter 5 (cf. 5.C.15,
5.C.17). This is based on many new notions introduced
in ??, and essentially includes the description
of the following features of f:
• the domain and the range;
• parity and periodicity;
• discontinuities and their kind;
• points of intersections with the axes;
• the limits limx→±∞ f(x);
• the ﬁrst and the second derivatives;
• the critical points (also called stationary points);
• the intervals of monotonicity;
• local and absolute extremes;
• the intervals where f is convex/concave;
• the points of inﬂection;
• the horizontal and inclined asymptotes;
• the graph.
In order to become familiar with these notions, we present
many illustrative examples.
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
In summary, f(x) is smooth on the whole of R. It is
strictly positive everywhere except for x = 0. All its derivatives
at this point are zero.
It cannot be analytic at x0 = 0. The limit of the function
at the improper points ±∞ is 1, while all its derivatives
converge quickly to zero.
6.1.10. Smooth jump functions. The smooth functions are
very “elastic” — from a local behaviour around
one point we cannot deduce anything at all about
the global behavior of such functions. On the
other hand, analytic functions are completely determined
just by derivatives at one point. In particular they
are completely determined by their behaviour on an arbitrarily
small neighbourhood of a single point from their domain.
In this sense, analytic functions are very “rigid”.
In particular, the smooth functions allow for joining different
constant values on disjoint open intervals in a diﬀerentiable
way.
Let us look at such functions more closely now. We can
modify f(x) from the previous paragraph in this way:
g(x) =
{
0 if x ≤ 0
e−1/x2
if x > 0.
Again it is a smooth function on all of R. By another modiﬁcation
there is another function h, which is nonzero at all
inner points of the interval [−a, a], a > 0 and zero elsewhere.
h(x) =
{
0 if |x| ≥ a
e
1
x2−a2 + 1
a2
if |x| < a.
This function is again smooth on all of R. The last two
functions are in the two ﬁgures. On the right, the parameter
a = 1/2 is used.
Finally we show how to get smooth analogies of the Heaviside
functions. For two ﬁxed real numbers a < b, deﬁne the
function f(x) exploiting the above function g as follows:
f(x) =
g(x − a)
g(x − a) + g(b − x)
.
For all x ∈ R the denominator of the fraction is positive (because
g is non-negative. For each of the three intervals determined
by numbers a and b at least one of the summands of the
denominator is nonzero). Thus the deﬁnition yields a smooth
function f(x) on all of R. For x ≤ a the numerator of the
fraction is zero according to the deﬁnition of g. For x ≥ b
518
6.A.20. (a) Find the critical points and the local extremes of
the function f(x) =
3
√
x2 = x
2
3 , with x ∈ R.
(b) Provide an example of a function g satisfying g′
(0) = 0
but g does not attain a local extreme at x0 = 0.
(c) Provide an example of a function having no local minima
or maxima.
Solution. (a) The given function f is continuous in R and for
any x > 0 we have (see also 5.C.24)
f′
(x) = (x
2
3 )′
=
2
3
x
2
3 −1
=
2
3
x− 1
3 =
2
3 3
√
x
.
Comparing the left and right derivatives, one can check that
at x = 0 the function f is not diﬀerentiable. This can be done
quickly via Sage, by the syntax
f(x)=x^(2/3)
show(lim((f(x)-f(0))/x, x=0, dir="right"))
show(lim((f(x)-f(0))/x, x=0, dir="left"))
which prints out +∞, and −∞, respectively. Moreover, we
see that f′
(x) = 0 has no solutions, hence f has no critical
points (in terms of 6.1.5). We already know from Chapter 5
how to obtain a conﬁrmation of the critical points via Sage:
just add the cell
show(solve(diff(f, x).factor()==0, x))
On the other hand, we see that f′
(x) > 0 for x > 0 and
f′
(x) < 0 for x < 0. Hence f′
changes from negative to
positive at x0 = 0, and since f is continuous at x0 = 0 this
implies that f attains a local minimum at this point. Thus local
extremes of a function may occur at places where the ﬁrst
derivative does not exist.
(b) An example is given by the function g(x) = x3
, which
obviously satisﬁes g′
(0) = 0, but the origin is neither a local
minimum nor a local maximum of g (recall the graph of g, or
see below). This shows that stationary points are not necessary
local extremes.
(c) An example is given by the (equilateral) hyperbola y =
h(x) = 1/x with x ∈ A := R\{0} = (−∞, 0) ∪ (0, +∞).
Indeed, for x1, x2 ∈ (0, +∞) with x1 < x2 we have
1
x1
> 1
x2
, i.e., h(x1) > h(x2), and thus y = h(x) is
strictly decreasing on (0, +∞). Since h is an odd function,
h(−x) = −h(x) for all x ∈ A, we immediately deduce that
h is strictly decreasing on (−∞, 0), as well, and hence on
its whole domain A. This also follows by the ﬁrst derivative
test: h′
(x) = −1/x2
< 0 for all x ∈ A. Hence there
are no extreme points, a conclusion that one can illustrate
by sketching the graph of h (h is an odd function, hence its
graph is symmetric with respect to the origin, see also the ﬁgure
below). Moreover, we have limx→0+ h(x) = +∞ and
hence the y-axis (x = 0) is a vertical asymptote of h. In
addition, the x-axis (y = 0) is a horizontal asymptote, since
limx→+∞ h(x) = 0.
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
the numerator and denominator are equal. In the next two
ﬁgures there are functions f(x) with parameters a = 1 − α,
b = 1 + α. On the left α = 0.8, and on the right α = 0.4.
Finally, we can create a smooth analogue of the characteristic
function of any interval [c, d].
Write fε(x) for the latter function f(x) with parameters
a = −ε, b = +ε. For the interval (c, d) with the length
d − c > 2ε deﬁne the function hε(x) = fε(x − c) · fε(d −
x). This function is identically zero on the intervals(−∞, c−
ε) a (d + ε, ∞). It is identically one on the interval (c +
ε, d − ε). Moreover, it is smooth everywhere. Locally it is
either constant or monotonic (you should verify the last claim
yourself). The smaller the ε > 0, the faster hε(x) jumps from
zero to one around the beginning of the interval or back at the
end of it.
The diagram shows the choices [c, d] = [1, 2] and ε =
0.6, ε = 0.3.
6.1.11. Diﬀerential of a function. In practical use of differential
calculus, we often work with dependencies
between several variables, say y and x.
The choice of dependent and independent variable
is not ﬁxed. The explicit relation y = f(x) with some
function f is then only one of possible options. Diﬀerentiation
then expresses that the immediate change of y = f(x)
is proportional to the immediate change of x with the proportion
of f′
(x) = df
dx (x).
This relation is often written as dy = f′
(x)dx, or
df(x) = f′
(x)dx,
where we interpret df(x) as a linear map R → R deﬁned on
increments of x at the given point, df(x)(v) = f′
(x)·v, while
the identity map yields dx(x)(v) = v.
We talk about the diﬀerential of function f, which is the
best linear approximation of the function f around the point
519
Our ﬁgure also contains the graph of the function k(x) =
−1/x, which is symmetric to the graph of h(x) = 1/x with
respect to the x-axis. Notice that k has the same domain as h
and is strictly increasing on A, since k′
(x) = 1
x2 > 0 for all
x ∈ A. Hence it provides another example of a function having
no local minima or maxima. A generalization occurs by
the hyperbolas y = a/x, with a > 0 and a < 0, respectively.
For convenience, let us ﬁnally include the code used in
Sage to construct these graphs:
h(x)=1/x; k(x)=-1/x
p=plot(h(x), x,-5, 5, ymin=-5, ymax=5,
rgbcolor=(0.2,0.2,0.5), thickness=1.2)
p+=text(r"$y=\frac{1}{x}, \ x>0$", (3, 2),
fontsize=16, rgbcolor=(0.2,0.2,0.5))
p+=text(r"$y=\frac{1}{x}, \ x<0$", (-3, -2),
fontsize=16, rgbcolor=(0.2,0.2,0.5))
p+=plot(k(x), x, -5, 5, ymin=-5, ymax=5,
rgbcolor=(0.2,0.5,0.2), thickness=1.2)
p+=text(r"$y=-\frac{1}{x}, \ x<0$", (-3, 2),
fontsize=16, rgbcolor=(0.2,0.5,0.2))
p+=text(r"$y=-\frac{1}{x}, \ x>0$", (3, -2),
fontsize=16, rgbcolor=(0.2,0.5,0.2)); show(p)
□
6.A.21. An annoying technicality. Generally speaking,
Sage is still a work “in progress” and hence its users may
face some technicalities. For instance, try to produce the
graph of f(x) = x
2
3 with x ∈ R (or of x → x
1
3 = 3
√
x), in a
traditional way. In this case Sage will sketch only the portion
corresponding to the positive reals. This problem occurs
since Sage returns complex numbers for odd roots of negative
numbers, when the latter are numerically approximated. This
is the situation when, for example, one tries to sketch the
graph of f. In order to realize this fact, in your editor type
the syntax
c1 = (-1)^(2/3); print(c1)
print(float(c1))
c2 = (-1.)^(2/3); print(c2); float(c2)
For c1 and by the third command, Sage returns the error “unable
to simplify to ﬂoat approximation”, while for c2 the error
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
x, i.e. the following approximation property is true:
(1) lim
v→0
f(x + v) − f(x) − df(x)(v)
v
= 0.
Since we are working in dimension 1, the deﬁnition is equivalent
to the existence of the derivative f′
(x) (all linear maps
are just multiplications by a constant) and then the diﬀerential
is df(x)(v) = f′
(x)v. Later, we shall see that the situation is
quite diﬀerent for functions of more variables.
If the quantity x is expressed by another quantity t, e.g.
x = g(t) and, moreover, g is diﬀerentiable, the chain rule for
diﬀerentiating the composite functions says that f ◦ g has the
diﬀerential too and
df(t) = d(f ◦ g)(t) =
df
dx
(x)
dx
dt
(t)dt.
We may also write dy = f′
(x)dx = f′
(g(t))g′
(t)dt. Therefore
dy can be seen as a linear approximation of the given
quantity independently of the choice of the dependent vari-
able.
6.1.12. The numerical derivatives. We shall conclude this
section with two straightforward applications of
diﬀerentials. First we provide a brief introduction
to the numerical procedures for diﬀerentiation.
Then we discuss curves in the plane and
space, starting with the graphs of functions. This will also
provide ﬁrst glimpses into the so called vector calculus (working
with vector valued functions).
In the begining of this textbook we discussed how to describe
the values in a sequence if its immediate diﬀerences
are known, (c.f. paragraphs 1.1.5, 1.2.1). Before proceeding
the same way with the derivatives we clarify the connections
between derivatives and diﬀerences. The key to this is the
Taylor expansion with remainder.
Suppose that for some (suﬃciently) diﬀerentiable function
f(x) deﬁned on the interval [a, b], the values fi = f(xi)
at the points x0 = a, x1, x2, . . . , xn = b, are given while
xi − xi−1 = h for some constant h > 0 and all indices
i = 1, . . . , n. Write the Taylor expansion of function f in
the form
f(xi ± h) = fi ± hf′
(xi) +
h2
2
f′′
(xi) ±
h3
3!
f(3)
(xi) + . . .
Suppose the expansion is terminated at the term containing
hk
which is of order k in h. Then the actual error is
bounded by
hk+1
(k + 1)!
|f(k+1)
(x)|
on the interval [xi −h, xi +h]. If the (k+1)th
derivative f is
continuous, it can be bounded by a constant. Then for small
h, the error of the approximation by the Taylor polynomial of
order k acts like hk+1
except for a constant multiple. Such an
estimation is called an asymptotic estimation.
520
from the command float(c2) is about the diﬃculty of converting
a complex number to ﬂoat (notice the diﬀerence in
the deﬁnition of c1, c2). In this case Sage advices us to use
some of the options abs( ) or real_part( ). Motivated by
this, to cure our problem for the given f we may instead plot
the graph of |x|
2
3
, see the ﬁgure below.
6.A.22. Recall that if x0 is a critical point of a diﬀerentiable
function and f′′
(x0) > 0, then f′
is strictly increasing near
x0. Therefore, the conditions f′
(x0) = 0 and f′′
(x0) > 0
imply that x0 is a local minimum. Similarly, the conditions
f′
(x0) = 0 and f′′
(x0) < 0 imply that x0 is a local maximum.
Show with a counterexample that the converse of these
conclusions are not true in general. ⃝
6.A.23. Suppose that a function f : R → R admits the
line y = 3x + 1 as an asymptote with slope, as x → +∞.
Compute the limit
lim
x→+∞
xf(x) + 7x2
+ 2
x2f(x) + 3x3 + 5x2 + 2
.
Solution. By the deﬁnition of an asymptote with slope, we
have
lim
x→+∞
f(x)
x
= 3 , and lim
x→+∞
(
f(x) − 3x
)
= 1 .
Thus, if we denote by A the given limit, by dividing both the
enumerator and dominator by x2
we get
A = lim
x→+∞
f(x)
x + 7 + 2
x2
f(x) − 3x + 5 + 2
x2
=
lim
x→+∞
(
f(x)
x
)
+ lim
x→+∞
(
7 + 2
x2
)
lim
x→+∞
(f(x) − 3x) + lim
x→+∞
(
5 + 2
x2
) =
3 + 7
1 + 5
= 5/3 .
□
Let us describe a problem which requires a bit of integration,
but in a very elementary level and hence already treatable.
Further similar problems we will analyze in Section B
and also in the ﬁnal section of this chapter.
6.A.24. Let f : R → R be a twice diﬀerentiable function
on R satisfying for all x ∈ R the relation
(
√
2 x2
+ 1)f′′
(x) + 4
√
2 x f′
(x) + 2
√
2 f(x) = 0 . (∗)
(a) Find those diﬀerentiable functions g : R → R satisfying
g(x) = 2
√
2 x f(x) + (
√
2 x2
+ 1)f′
(x);
(b) Based on your answer in (a), determine the function f, if
in addition it is known that the graph Cf of f passes through
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Asymptotic estimates
Deﬁnition. The expression G(h) is asymptotically equal to
F(h) for h → 0, write G(h) = O(F(h)), if the ﬁnite limit
lim
h→0
G(h)
F(h)
= a ∈ R
exists.
Similarly, compare the expressions for h → ∞ and use
the same notation.
If the limit is zero, then we write G(h) = o(F(h)).
This way of denoting the asymptotic behaviour is often
called the big O and small o notation. For instance, the diﬀerentiability
of a function f means that the error of the approximation
of f by its diﬀerential is o(|h|) for increments h of
the argument.
Denote the values of the derivatives of f(x) at the points
xi as f
(j)
i . Write the Taylor expansion as:
fi±1 = fi ± f′
i h +
f′′
i
2
h2
±
f′′′
i
6
h3
+ . . .
Considering combinations of the two expansions and fi itself,
we can express the derivative f′
i as follows
fi+1 − fi−1
2h
= f′
i +
h2
3!
f
(3)
i + . . .
fi+1 − fi
h
= f′
i +
h
2!
f′′
i + . . .
fi − fi−1
h
= f′
i −
h
2!
f′′
i + . . .
This suggests a basic numerical approximation for deriva-
tives:
Central, forward, and backward differences
The central diﬀerence is deﬁned as
f′
i =
fi+1 − fi−1
2h
,
the forward diﬀerence is
f′
i =
fi+1 − fi
h
,
and the backward diﬀerence is
f′
i =
fi − fi−1
h
.
Theorem. The asymptotic estimate of the error of the central
diﬀerence is O(h2
). The errors of the backward and
forward diﬀerences are O(h).
Proof. If we use the Taylor expansions with remainder
of the appropriate order, we obtain an expression of the error
of the approximation by the central diﬀerence in the form
1
3!
h2
(
f(3)
(xi + ξh) − f(3)
(xi − ηh)
)
.
521
the origin and the tangent line of Cf at the origin is perpendicular
to the line determined by the equation x +
√
2y −1 = 0;
(c) Use Sage to conﬁrm that your answer in (b) satisﬁes indeed
the initial condition (∗). Moreover, sketch the graph of
f for −10 ≤ x ≤ 10. Where did we meet such a similar
function earlier?
(d) Use Sage to specify the local extremes of f and characterize
their type.
Solution. (a) By diﬀerentiating the relation g(x) =
2
√
2 x f(x)+(
√
2 x2
+1)f′
(x), with respect to x, we obtain
g′
(x) = 2
√
2 f(x) + 2
√
2 x f′
(x) + 2
√
2 x f′
(x) +
(
√
2 x2
+ 1)f′′
(x) = 0 , x ∈ R ,
where the ﬁnal equality follows by (∗). This implies that
g(x) = c for all x ∈ R for some constant c ∈ R, i.e., g is
constant.
(b) One observes that
g(x) = 2
√
2 x f(x) + (
√
2 x2
+ 1)f′
(x) =
(
(
√
2 x2
+ 1)f(x)
)′
, x ∈ R .
In (a) we proved that g(x) = c for all x ∈ R, hence a combination
with the previous relation gives
(
(
√
2 x2
+1)f(x)
)′
= c.
Hence it is easy to guess that (
√
2 x2
+ 1)f(x) = c x + α,
for some constant α ∈ R. As we will see in the forthcoming
section, this follows by integrating both parts of the relation
including the derivative, i.e.,
∫
(
(
√
2 x2
+ 1)f(x)
)′
dx =
∫
c dx =⇒
(
√
2 x2
+ 1)f(x) = c x + α , α ∈ R .
The polynomial
√
2 x2
+ 1 has only complex roots, hence we
can divide which yields the expression
f(x) =
c x + α
√
2 x2 + 1
, ∀ x ∈ R . (∗∗)
To determine the constants c and α we rely on the information
speciﬁed by the scenario in (b). First, f passes through the
origin, hence we need α = 0. Next, the line x +
√
2 y −
1 = 0 is written as y = − 1√
2
x + 1 = −
√
2
2 x + 1, which
means that its slope equals to −
√
2
2 . Since the tangent line of
f at the origin should be perpendicular to this line, we need
f′
(0) = 2√
2
=
√
2, such that −
√
2
2 ·
√
2 = −1. Therefore, to
determine c we need to solve the equation f′
(0) =
√
2. So,
let us use (∗∗) (for α = 0) to compute f′
(x):
f′
(x) =
c(−
√
2 x2
+ 1)
(
√
2 x2 + 1)2
, x ∈ R .
Thus f′
(0) =
√
2 if and only if c =
√
2, which means that
f(x) =
√
2 x
(
√
2 x2 + 1)
, for all x ∈ R.
(c) A conﬁrmation in Sage relies on the command bool, as
follows (we also include the syntax for sketching Cf )
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Here, 0 ≤ ξ, η ≤ 1 are the values from the remainder expression
of fi+1 and fi−1, respectively. The error of the second
derivative in the other two cases is obtained similarly. □
Surprisingly, the central diﬀerence is one order better
than the other two. But of course, the constants in the asymptotic
estimates are important, too. In the case of the central
diﬀerence, the bound on the third derivative appears, while in
the two other cases second derivatives show up instead.
We proceed the same way when approximating the second
derivative. To compute f′′
(xi) from a suitable combination
of the Taylor polynomials, we cancel both the ﬁrst derivative
and the value at xi. The simplest combination cancels
all the odd derivatives as well:
fi+1 − 2fi + fi−1
h2
= f
(2)
i +
h2
12
f(4)
(xi) + . . . .
This is called the second order diﬀerence. Just as in the central
ﬁrst order diﬀerence, the asymptotic estimate of the error
is
f
(2)
i =
fi+1 − 2fi + fi−1
h2
+ O(h2
).
Notice that the actual bound depends on the fourth derivative
of f.
6.1.13. The curvature of the graph of a function. Imagine
the graph of a function as a movement in the plane
parametrized by the independent variable x. The vector
(1, f′
(x)) ∈ R2
represents the velocity at x of such a movement.
The tangent line through [x, f(x)] parametrized by this
directional vector then represents a linear approximation of
the curve. The goal is to discuss how “curved” is the graph
at x. This is a straightforward exercise working with diﬀerentials
in the setup of elementary plane geometry. It might need
some eﬀort to keep the attention.
If f′′
(x) = 0 and simultaneously f′′′
(x) ̸= 0, the graph
of the function f intersects its tangent line. In such a case,
the tangent line is the best approximation of the curve at the
point x up to the second order as well. We describe this by
saying that the graph of f has zero curvature at the point x.
522
f(x)=(sqrt(2)*x)/(sqrt(2)*x^2+1)
print(bool((sqrt(2)*x^2+1)*diff(f, x, 2) +
4*sqrt(2)*x*diff(f, x)+2*sqrt(2)*f(x)==0))
show(plot(f(x), x, -10, 10,
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
The nonzero values of the ﬁrst derivative describe the
speed of the growth. Intuitively we expect the second derivative
to describe the acceleration, including how “curved” the
graph is. As a matter of convention, we want the curvature to
be positive if the graph of the function is above its tangent.
The tangent at a ﬁxed point P = [x, f(x)] is the limit of
the secants. The lines passing through the points P and Q =
[x + ∆x, f(x + ∆x)]. To approximate the second derivative,
interpolate the points P and Q ̸= P by the circle CQ, whose
center is at the intersection of the perpendicular lines to the
tangents at P and Q.
It can be seen from the ﬁgure that if the angle between the
tangent at the ﬁxed point P and the x axis is α and the angle
between the tangent at the chosen point Q and the x axis is
α+∆α, then the angle of the latter perpendicular lines is ∆α
as well.
improve a bit the
picture - should only
ρ∆α a perhaps make
the measured arc
thicker.Denote the radius of the circle by ρ. Then the length of
the arc between points P and Q is ρ∆α. As Q approaches the
ﬁxed point P, the length ρ∆α of the arc approaches the length
∆s of the curve between P and Q, that is, the graph of the
function f(x). At the same time the circle approaches some
limit circle CP . This limit circle CP is called the osculating
circle.
Thus we arrive at the basic relation for the expected radius
ρ of the circle CP in terms of the linear approximations
of the quantities:
ρ = lim
∆α→0
∆s
∆α
=
ds
dα
.
Notice that the quantity on the right hand side is well deﬁned
(independently of its rather intuitive justiﬁcation).
Deﬁne the curvature of the graph of the function f at the
point P as the number 1/ρ. Zero curvature then corresponds
to an inﬁnite radius ρ.
For computing the radius ρ in terms of f we need to express
the length of the arc s by the change of the angle α and
express the derivative of this function in terms of the derivative
of f.
Notice that for an increasing angle α the length of the
arc can either increase or decrease, depending on whether the
circle CQ has its center above or below the graph of the function
f. The sign of ρ then reﬂects whether the function is
concave or convex. There is also the special case when the
center “runs oﬀ” to inﬁnity in the limit. Instead of a circle
there is the tangent line.
There is no direct tool to compute the derivative ds
dα .
However, tg α = df
dx . By diﬀerentiating this equality with
respect to x we obtain (using the chain rule for diﬀerentials)
1
(cos α)2
dα
dx
= f′′
.
On the left hand side we can substitute
1
(cos α)2
= 1 + (tg α)2
= 1 + (f′
)2
523
ymin=-0.8, ymax=0.8, color="black"))
This block veriﬁes our answer and produces the ﬁgure presented
here:
(d) As shown in the graph above, f has two local extrema:
one maximum and one minimum. Recall that critical points
are found by solving the equation f′
(x) = 0. Sage performs
this calculation quickly for us:
f(x)=(sqrt(2)*x)/(sqrt(2)*x^2+1)
d1=diff(f, x).factor();
print(solve(d1==0, x))
This prints out two solutions, namely x± = ±1
2 23/4
=
±
4√
8
2 ≈ ±0.84. An alternative to obtain these solutions is
based on the command roots and can be implemented as
follows:
f(x)=(sqrt(2)*x)/(sqrt(2)*x^2+1)
d1=diff(f, x).factor();
roots =d1.roots()
sols = []
for i in range(len(roots)):
sols.append(roots[i][0])
print(sols)
Executing this block one gets the two solutions (without their
multiplicities), i.e., [−(1/2) ∗ 2 ∗ ∗(3/4), (1/2) ∗ 2 ∗ ∗(3/4)].
To verify that x+ (respectively, x−) is a point where f
attains its local maximum (respectively, minimum) we will
use the criterium of the second derivative. Hence it suﬃces
to add the syntax
bool(diff(f, x, 2)(x=1/2*2^(3/4))<0)
bool(diff(f, x, 2)(x=-1/2*2^(3/4))>0)
which for both cases returns True. Or we can program Sage
to check itself the critical points and print the corresponding
result. For this it is suﬃcient to add in the initial block the
following:
d2 = f.diff(2)
for a in sols:
if d2(a)>0:
print("{} is a local minimum"
.format(a))
if d2(a)<0:
print("{} is a local maximum"
.format(a))
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
which implies (see the rule for diﬀerentiating inverse func-
tions)
dx
dα
=
1 + (tg α)2
f′′
=
1 + (f′
)2
f′′
.
Now, we are almost ﬁnished, because the increment of the
length of arc s dependent on x is given by the formula
ds
dx
= (1 + (f′
)2
)1/2
.
Thus, by the chain rule,
ρ =
ds
dα
=
ds
dx
dx
dα
=
(1 + (f′
)2
)3/2
f′′
.
Taking the reciprocal value, we have computed:
Curvature of the graph
If f is in C2
, then the curvature of its graph is given by the
formula
ρ−1
(x) =
f′′
(x)
(1 + (f′(x))2)3/2
The result explains the relation between the curvature
and the second derivative. The denominator of
the fraction is always positive. It equals the third
power of the length of the tangent vector of the
given graph curve. The sign of the curvature is therefore
given only by the sign of the second derivative, which conﬁrms
the ideas about concave and convex points of functions.
In particular, in singular points, the curvature is just the second
derivative.
If the second derivative is zero, the curvature 1/ρ is also
zero. If f′′
is large, then the radius ρ is small, thus the curvature
is large as well.
Compute the curvature of simple functions yourself and
use osculating circles while sketching their graphs. The computation
at the critical points of the function f is easiest. The
radius of the osculating circle is the reciprocal value of the
second derivative with the corresponding sign.
6.1.14. Vector diﬀerential calculus. As mentioned already
in the introduction to chapter ﬁve, most considerations related
to diﬀerentiation are based on the fact that the functions are
deﬁned on real numbers and that their values can be added
and multiplied by real numbers. That is why functions f :
R → V need to have values in a vector space V . We call them
vector functions of one real variable or more brieﬂy vector
functions.
Now, we digress to consider functions with values in the
plane or in space. Thus, f : R → R2
and f : R → R3
. We
consider (parametrized) curves in plane and space. We could
work with values in Rn
for any ﬁnite dimension n.
For simpliﬁcation, we work with the ﬁxed standard bases
ei in R2
and R3
. So curves are given by pairs or triples of
real functions of one real variable, respectively. The vector
function r in plane or space, respectively, is given by
r(t) = x(t)e1 + y(t)e2, r(t) = x(t)e1 + y(t)e2 + z(t)e3.
524
In this case the output looks like as follows:
-1/2*2^(3/4) is a local minimum
1/2*2^(3/4) is a local maximum
Finally, Sage can be used to compute the values of f at
x±, which allows us to summarize as follows: The point
P+ = [
4√
8
2 ,
4√
2
2 ] is a local maxima of f, while the point P− =
[−
4√
8
2 , −
4√
2
2 ] is a local minima of f. Notice ±
4√
2
2 ≈ ±0.59.
□
6.A.25. Local extrema numerically via Sage. We
should remark that Sage provides an in-built method
which numerically ﬁnds the local extremes of a given
function on an interval [a, b], along with the points that
these extremes values are attained. This technique relies
on the commands find_local_maximum(f, a, b) and
find_local_minimum(f, a, b), respectively. Hence, for
example, for the function f presented in 6.A.24 one could
alternatively type
f(x)=(sqrt(2)*x)/(sqrt(2)*x^2+1);
show(find_local_maximum(f, -10, 10))
show(find_local_minimum(f, -10, 10))
This, indeed, returns the desired answers in the form
(f(x±), x±), see here:
(0.5946035575013604, 0.8408963982254773) ,
(−0.5946035575013604, −0.8408963982254773) .
6.A.26. Use Sage to sketch the real functions f, g, h, k for
x ∈ [−2, 2], and ﬁnd numerically their local extrema, if any,
where:
(a) f(x) = 1
8 x8
− 3
5 x5
− 3x + 9 ,
(b) g(x) = x2
− 1
6 x3
,
(c) h(x) = x
√
x2 + 4 ,
(d) k(x) = ln(x3
+ 8) . ⃝
6.A.27. Find all intervals on which the function f(x) =
e−x2
, with x ∈ R, is concave.
Solution. We compute f′
(x) = −2x e−x2
and f′′
(x) =
2(2x2
− 1) e−x2
for all x ∈ R. The sign of f′′
is given in
the following table
x −∞ −
√
2/2
√
2/2 +∞
f′′
(x) + 0 − 0 +
Thus for any x ∈ (−
√
2/2,
√
2/2) we get f′′
(x) < 0,
and the desired answer is given by the open interval
(−
√
2/2,
√
2/2) ⊂ R. □
6.A.28. Remark. Using Sage we might consider using the
bool command, to verify the inequality f′′
(x) < 0, for all
x ∈ (−
√
2/2,
√
2/2), from the previous problem. Hence, for
example, one could type
f(x)=exp(-x**2)
df2=diff(f, x, 2)
s=RealSet(-sqrt(2)/2, sqrt(2)/2)
bool(df2(x)<0 for x in s)
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
The derivative of such a vector function is a vector, which
approximates the map r by a linear map of the real line to the
plane or to the space.
In the plane it is
dr
dt
(t) = r′
(t) = x′
(t)e1 + y′
(t)e2
and similarly in space.
The diﬀerential of a vector function in this context is:
dr =
(
dx
dt
e1 +
dy
dt
e2 +
dz
dt
e3
)
dt
where the expression on the right hand side is understood as
“selecting” an increment of the scalar independent variable t
and mapping it linearly by multiplying the vector of the three
derivative components. Thus the corresponding increment of
the vector quantity r is obtained (of course, only two components
in the plane).
The notation r(t) is a convenient way to describe curves
in space. For example r(t) = (a cos t, a sin t, bt) or r(t) =
a cos te1 + a sin te2 + be3 for ﬁxed constants a, b describes
a circular helix. Here the parameter t is related to a suitable
angle measured around the z-axis. The derivative of r(t) at
t = t0, determines the direction of the tangent line at r(t0).
In Newtonian mechanics, the parameter t can stand for time,
measured in suitable units. In this case the derivative of r(t)
at time t = t0, gives the velocity vector at the same time.
The second derivative then represents the acceleration vector
at the same time.
6.1.15. Diﬀerentiating composite maps. In linear algebra
and geometry there are very useful special maps
called forms. They have one or more vectors as their
arguments and they are linear in each of their arguments.
In this way we deﬁned the length of the vectors
(the dot product is a symmetric bilinear form) or the volume
of a parallelepiped (this is an n-linear antisymmetric form,
where n is the dimension of the space), see for example the
paragraphs 2.3.23 a 4.1.19.
Of course, we insert vectors r(t) dependent on a parameter
as the arguments of these operations. By a straightforward
usage of the Leibniz rule for diﬀerentiation of a product of
functions, the following is veriﬁed:
Theorem. (1) If r(t) : R → Rn
is a diﬀerentiable vector
function and Ψ : Rn
→ Rm
is a linear map, then the derivative
of the map Ψ ◦ r satisﬁes
d(Ψ ◦ r)
dt
= Ψ ◦
dr
dt
.
(2) Consider diﬀerentiable vectors r1, ..., rk : R → Rn
and a k–linear form Φ : Rn
× . . . ×Rn
→ R on the space
Rn
. Then the derivative of the composed map
φ(t) = Φ(r1(t), . . . , rk(t))
satisﬁes the (generalized) Leibniz’s rule
dφ
dt
= Φ
(dr1
dt
, r2, . . . , rk
)
+ · · · + Φ
(
r1, . . . , rk−1,
drk
dt
)
.
525
In this block we used the RealSet command to deﬁne
the interval of interest. Although Sage returns True,
this result can be misleading and we cannot trust this
method. This is because if we replace the last line by
bool(df2(x) > 0 for x in s), Sage will still return True.
In fact, the bool(df2(x) < 0 for x in s) command checks
if the condition is true for some sampled points within the
interval, but does not guarantee that the inequality holds for
all points in the interval. This is because it only evaluates the
condition at speciﬁc points and does not perform a rigorous
check over the entire range. Therefore, one should exercise
caution when interpreting results from computer algebra
systems.
6.A.29. Show that the function f(x) = xa
with 0 < a < 1 is
concave for all x > 0. ⃝
6.A.30. Use Sage to determine the asymptotes of the function
f(x) = x/(x2
−1), with x ∈ R\{±1}, and then sketch these
asymptotes along with the graph of f. ⃝
6.A.31. Consider the function f(x) = ex
−1
ex +1 with x ∈ R.
(a) Prove that f is increasing on R, ﬁnd its asymptotes and
determine its range. Then use Sage to sketch the asymptotes
along with the graph of f.
(b) Determine the extremes points, inﬂection points and intervals
of convexity of f, if any. ⃝
6.A.32. Study the local behaviour of the function f(x) =
x/(x + 1)2
, with x ∈ A := R\{−1}.
Solution. Let us begin with the asymptotes of f. Obviously,
f is continuous on its domain A. Thus, the only place we
may have a vertical asymptote is at the endpoint −1. Indeed,
we see that the line x = −1 is a vertical asymptote since
limx→−1+ f(x) = limx→−1− f(x) = −∞. We also have
limx→−∞ f(x) = limx→∞ f(x) = 0, thus the line y = 0 is
a horizontal asymptote. In Sage a computation of these limits
can be done as usual, i.e.,
f(x)=x/(x+1)^2
show(lim(f(x), x=-1, dir="left"))
show(lim(f(x), x=-1, dir="right"))
show(lim(f(x), x=-oo)); lim(f(x), x=oo)
Next, by adding the command show(diff(f, x).factor())
we obtain the ﬁrst derivative of f,
f′
(x) = −
(x − 1)
(x + 1)3
, x ∈ A . (♭)
Try to verify these results by hand. By (♭) we deduce that the
equation f′
(x) = 0 has a unique solution, given by x = 1,
hence f has a unique critical point. There it appears a relative
maximum of f, with value f(1) = 1/4 (see below for an
explanation).
To determine the critical points of f via Sage, add in the
previous block the syntax
solve(diff(f, x)==0, x)
(the command diff(f, x).roots() also works, as we explained
in 6.A.24). Now, by (♭) we also deduce that
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
(3) The previous statement remains valid even if Φ also
has values in the vector space, and is linear in all its k argu-
ments.
Proof. (1) The linear maps are given by a constant matrix
of scalars A = (aij) so that
Ψ ◦ r(t) =
( n∑
i=1
a1iri(t), . . . ,
n∑
i=1
amiri(t)
)
.
Carry out the diﬀerentiation separately for individual coordinates
of the result. However, the derivative acts linearly
with respect to scalar linear combinations, see Theorem 5.3.4.
That is why the derivative is obtained simply by evaluating the
original linear map Ψ on the derivative r′
(t).
(2) The second statement is obtained analogously. Write
out the evaluation of the k–linear form on the vectors r1, ...,
rk in the coordinates in this way:
Φ(r1(t), . . . , rk(t)) =
n∑
i1,...,ik=1
Bi1...ik
· (r1)i1 (t) . . . (rk)ik
(t),
where the scalars Bi1...ik
= Φ(ei1 , . . . , eik
) are given as the
values of the given form on the chosen k–tuple of base vectors
for every choice of indices. The rule for diﬀerentiating a
product of scalar functions then yields the statement.
(3) If Φ has vector values, it is given by ﬁnitely many
components and the previous result can be used for each of
them. □
In the Euclidean space R3
, the scalar product assigns a
scalar to two vectors, There is also the vector product, which
assigns the vector u × v ∈ R3
to vectors u and v, see 4.1.21.
This vector u × v is orthogonal to both vectors u and v, its
length equals the area of the parallelogram determined by u
and v (in this order) and the orientation is such that the triple
u, v, u × v is a positively oriented basis.
The previous ideas immediately imply:
Corollary. Consider the vectors u(t) and v(t) in the space
R3
. The derivatives of their scalar product ⟨u(t), v(t)⟩ and
their vector product u(t) × v(t) satisfy
d
dt
⟨u(t), v(t)⟩ = ⟨u′
(t), v(t)⟩ + ⟨u(t), v′
(t)⟩(1)
d
dt
(u(t) × v(t)) = u′
(t) × v(t) + u(t) × v′
(t)(2)
6.1.16. The curvature of curves. We develop far more powerful
tools for studying curves in a more systematic
way than when we discussed the curvature
of the graphs of functions. We proceed in dimension
three. Plane curves are a special case
in which the third component is the constant zero.
Let r(s) be a curve in the Euclidean space R3
. For our
purposes, it is convenient to choose arc length s as for the
parameter. It follows that ∥dr
ds ∥ = 1, so that the tangent vector
has unit length. When s is the parameter, the notation ′
is used
for diﬀerentiation.
526
• f′
(x) < 0 for all x ∈ (−∞, −1) ∪ (1, +∞);
• f′
(x) > 0 for all x ∈ (−1, 1).
Thus f decreases for all x ∈ (−∞, −1) ∪ (1, +∞) and increases
for x ∈ (−1, 1).
Let us now proceed with the second derivative. By
adding the command diff(f, x, 2) we obtain the second derivative
of f, given by f′′
(x) = 2(x−2)
(x+1)4 for all x ∈ A. Hence
x = 2 is the unique solution of f′′
(x) = 0 and we see that
• f′′
(x) < 0 for all x ∈ (−∞, −1) ∪ (−1, 2);
• f′′
(x) > 0 for all x ∈ (2, +∞).
Thus, f is convex for all x ∈ (2, +∞) and concave for x ∈
(−∞, −1)∪(−1, 2). It also follows that the unique inﬂection
point of f appears at x = 2, with value f(2) = 2/9. Notice
in order to verify that at the critical point located above, a
relative maximum appears, we may use the criterion based on
the second derivative (see below) and show that f′′
(1) < 0.
Let us ﬁnally focus on the graph of f. The origin is the
unique intersection of the graph of f with the axes, while f
is neither even nor odd (so its graph has no particular symmetries).
To sketch the graph of f for certain x ∈ A, together
with its vertical asymptote x = −1, we use the cell
f(x)=x/(x+1)^2
pf=plot(f, x, -5, 5, ymin=-5,ymax=1,color="black")
pm = point ((1 , 1/4) , size =30 , color="red")
pt = circle((-1,0), 0.05, color="black")
pnf = point ((2 , 2/9) , size =30 , color="blue")
sy = line([(-1,-5),(-1,5)], linestyle="--", thickness=0.8)
(pf+sy+pm+pt+pnf).show(ticks=1,tick_formatter=1)
Execute this block in your editor, to revise its “small” particularities
(as red and blue colours for the presentation of the
maximum and the inﬂection point of f, respectively). □
We will now focus on derivatives of functions deﬁned
“implicitly” by equations of the form F(x, y) = 0.
In such cases it can be often impossible to solve the
equation with respect to y (as an explicit function of
x). However, as we will see below it is still possible
to ﬁnd the derivative dy/dx, compute extrema and/or decide
for the convexity features of y, etc. For instance, to compute
dy/dx the trick is just to diﬀerentiate both sides of the given
equation, and then solve for the derivative we are seeking for.
This is the so called implicit diﬀerentiation, which has many
useful applications.
6.A.33. Implicit diﬀerentiation. Find the extreme points of
the real function y = f(x) given in the implicit form xy2
−
x2
y = 54.
Solution. Since y is a function of x we can diﬀerentiate the
given relation with respect to x. We get
y2
+ 2xyy′
− 2xy − x2
y′
= 0 ,
and thus solving with respect to y′
one deduces that
y′
=
2xy − y2
2xy − x2
. (♯)
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
So ⟨r′
(s), r′
(s)⟩ = 1 for all s. The curve r(s) is
parametrized by the length s. By another diﬀerentiation of
this equality we arrive at (using the symmetry of the dot prod-
uct)
0 =
d
ds
⟨r′
(s), r′
(s)⟩ = 2⟨r′′
(s), r′
(s)⟩.
Thus the vector r′′
(s) is always orthogonal to the vector r′
(s).
This corresponds to the idea that after the choice of a
parametrization with a derivative of constant length, the second
derivative in the direction of the movement vanishes. The
second derivative lies in the plane orthogonal to the tangent
vector.
If the second derivative is nonzero, the normed vector
n(s) =
1
∥r′′(s)∥
r′′
(s)
is the (principal) normal of the curve r(s). The scalar function
κ(s) satisfying (at the points where r′′
(s) ̸= 0)
r′′
(s) = κ(s)n(s)
is called the curvature of the curve r(s). At the zero points
of the second derivative κ(s) is deﬁned as 0.
At the nonzero points of the curvature, the unit vector
b(s) = r′
(s)×n(s) is well deﬁned and is called the binormal
of the curve r(s). By direct computation
0 =
d
ds
⟨b(s), r′
(s)⟩ = ⟨b′
(s), r′
(s)⟩ + ⟨b(s), r′′
(s)⟩
= ⟨b′
(s), r′
(s)⟩ + κ(s)⟨b(s), n(s)⟩ = ⟨b′
(s), r′
(s)⟩,
which shows that the derivative of the binormal is orthogonal
to r′
(s). Further, b′
(s) is also orthogonal to b(s) (for the same
reason as with r′
above). Therefore it is a multiple of the
principal normal n(s). We write
b′
(s) = −τ(s)n(s).
The scalar function τ(s) is called the torsion of the curve r(s).
In the case of plane curves, the deﬁnitions of binormal
and torsion do not make sense.
527
One can obtain this formula also in Sage by the cell
var("x"); y = function("y")(x)
eq = x*y**2 -y*x**2 == 54
dy = diff(y,x) #declare the derivative
sol = solve(diff(eq), dy) #solve the implicit eq.
show(dy.subs(sol)) #substitute the solution
Thus Sage can be quickly programmed to implement an
implicit diﬀerentiation for us, see also below.
Let us now return back to our task. Based on (♯), we
see that the equation y′
= 0 is equivalent to y(2x − y) = 0,
which gives y = 0, or y = 2x. The ﬁrst solution is not
acceptable and hence we have y′
= 0 if and only if y = 2x.
For y = 2x the given equation reduces to x3
− 33
= 0, that
is, (x − 3)(x2
+ 3x + 9) = 0. The real x = 3 is the unique
real solution of this equation, hence we get a stationary point
of y with value y(3) = 6.
Next we need to characterize this critical point, and hence
we need the second derivative. A diﬀerentiation of both sides
in (♯) gives the relation y′′
= A(x)/B(x), where A(x) =
2
[
(y + xy′
− yy′
)(2xy − x2
) − (2xy − y2
)(y + xy′
− x)
]
and B(x) = (2xy − x2
)2
, respectively. By replacing the
coordinates [x = 3, y = 6] and since y′
vanishes at this point,
we ﬁnally deduce that
y′′
(3) = 12/27 > 0 .
Thus, x = 3 is local mimimum of y = f(x), with value
ymin = 6. □
6.A.34. Suppose that f : R → R is a function which is twice
diﬀerentiable on R and satisﬁes
f2
(x) − xf(x) + 2x2
− 7 = 0 , x ∈ R .
If a ∈ R is a stationary (critical) point of f, prove that a =
±
√
2
2 . Does f admit some inﬂection point? ⃝
Recall by Chapter 5 that ﬁrst order derivatives provide
linear approximations, which can be used to estimate
values of diﬀerentiable functions (cf. 5.C.18). This
idea is revised in 6.1.11, in terms of the “diﬀerential”
of a diﬀerentiable function y = f(x), deﬁned
by df(x) = f′
(x)dx, i.e., dy = f′
(x)dx. This is of course an
equivalent way to interpret the diﬀerentiability of f, which is
encoded by the relation f′
(x) = dy
dx . Notice dx is an independent
variable which we may assign any non-zero real number,
while dy (or df(x)) is necessary a dependent variable.
The diﬀerential of f can be viewed as a linear form on
R, satisfying df(x)(h) = f′
(x)h, for all h ∈ R, and provides
the best linear approximation of f, around x. Let us illustrate
the most basic features of this concept via examples.
6.A.35. Compute the diﬀerentials of the functions f(x) =
ex2
, g(x) = ln(x2
+ 1) and k(x) = arctan(ex
) with x ∈
R. Next present an implementation of these diﬀerentials via
Sage. ⃝
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
We have not yet computed the rate of change of the principal
normal, which can be written as n(s) = b(s) × r′
(s):
n′
(s) = b′
(s) × r′
(s) + κ(s)b(s) × n(s)
= −τ(s)n(s) × r′
(s) + κ(s)(−r′
(s))
= τ(s)b(s) − κ(s)r′
(s).
Summarizing, for all points with nonzero second derivative
of the curve r(s) parametrized by the arc length we constructed
the nice orthonormal basis (r′
(s), n(s), b(s)), called
the Frenet frame in the classical literature. At the same time,
this basis is used in order to express the derivatives of its components
in the well known form
Frenet–Serret formulae
For the Frenet frame (r′
(s), n(s), b(s)), the so called Frenet–
Serret formulae hold true
dr′
ds
(s) = κ(s)n(s),
dn
ds
(s) = τ(s)b(s) − κ(s)r′
(s)
db
ds
(s) = −τ(s)n(s).
The following theorem tells how crucial the curvature
and torsion are. Notice that if the curve r(s) lies in one plane,
then the torsion is identically zero. In fact, the converse is
true as well. We shall not provide the proofs here.
Theorem. Two curves in the space R3
parametrized by the
length of their arc can be mapped to each other by a Euclidean
transformation (i.e. an aﬃne map keeping the distances)
if and only if their curvature functions and torsion
functions coincide, except for a possible constant shift of the
parameter. Moreover, for every choice of smooth functions κ
and τ there exists a smooth curve with these parameters.
By a straightforward computation we can check that the
curvature of the graph of the function y = f(x)
in plane and the curvature κ of this curve deﬁned
in this paragraph coincide. Indeed, comparing
the diﬀerentials of the length of the arc for the
graph of a function (as a curve with coordinates x(t), y(t) =
f(x(t)):
ds = (1 + (fx)2
)1/2
dx, dx = (1 + (fx)2
)−1/2
ds
(we write fx = df
dx ), we obtain the following equality for the
unit tangent vector of the graph of a curve
r′
(s) =
(
(1 + (fx)2
)−1/2
, fx(1 + (fx)2
)−1/2
)
.
Now, we have to go through a bit messy computation for the
second derivative. Let us write shortly r = (x, y), y′
= fxx′
,
x′
= (1 + f2
x)−1/2
(remind ′
always means derivative with
respect to the arc lenght parameter s). Then
x′′
= −1
2 (x′
)3
2fxfxxx′
= −(x′
)4
fxfxx
y′′
= fxx(x′
)2
+ fxx′′
= fxx(x′
)2
− fxxf2
x(x′
)4
.
528
6.A.36. Diﬀerentials. Construct in Sage a routine having as
input a pair (f, x0) consisting of a function and
a point, and as output the diﬀerential of f at x0
(assuming that f is diﬀerentiable at x0). Next
use your program to test some cases explicitly.
Solution. To solve this task we can use the def command to
deﬁne a routine with the name Dif. Notice that Dif should
have two inputs, a function f depending on one variable, and
a real number x_0. Since df(x) = f′
(x)dx, we should program
Sage to view dx as a variable diﬀerent than x, and it
is suﬃcient to introduce dx as a symbolic variable, similarly
with x. In total, one can program Sage via the block
var("x, dx")
def Dif(f, x_0) :
f1=diff(f, x)(x=x_0)
show("The differential of f(x)=", f, ",",
"at x=", x_0, ",", "equals to", ":", (dx)*f1)
return
Notice in the second line of this block the syntax corresponds
to the derivative f′
(x0).
Let us now test the routine. For this goal, we chose the
following pairs (f, x0):
Dif(ln(x), 2*pi)
Dif(exp(x), 0)
Dif(cos(2*x), 2*pi)
Dif(ln(x^2+1), 4*pi)
Dif(arctan(exp(x)-x^3+4*x), pi/2)
In the ﬁrst case and for the pair (ln(x), 2π), Sage’s output has
the form:1
The differential of f(x)= log (x) ,at x=2 π,
equals to:
dx
2 π
,
which is obviously true. Execute the rest commands in your
editor and next verify Sage’s solutions by hand.
As a side remark, observe that when our input is a nondiﬀerentiable
function f at x0, our routine will return an error
(similarly with the error that Sage returns when the derivative
of a function at a point does not exist). □
6.A.37. Consider a function y = f(x) diﬀerentiable at a
point x0. The change in the values of y in a small neighbourhood
around x0, is given by ∆(y) = f(x0 + dx) − f(x0).
We can approximate linearly ∆(y) using the diﬀerential of f
at x0, that is, ∆(y) ≈ f′
(x0)dx. Based on this formula compare
∆(y) and dy for the cases:
(a) y = x4
− x2
+ 3x + 2 and x changes from 2 to 2.05;
(b) y = x2
+ sin2
(x) cos(x) and x changes from π
2 to
π
2 + 0.04;
(c) y = ln(x + 1) − e
√
x
and x changes from 1 to 1.03.
Solution. (a) We have x0 = 2 and x0+dx = 2+0.05 = 2.05.
Thus ∆(y) = f(x0 + dx) − f(x0) = f(2.05) − f(2) ≈
1To avoid confusions, recall that in Sage the function log(x) corresponds
to the natural logarithm ln(x).
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Hence
(x′′
)2
+ (y′′
)2
= f2
xx(x′
)8
(
fx + (1 + f2
x)2
+ f4
x
− 2f2
x(1 + f2
x)
)
= f2
xx(1 + f2
x)−4
(f2
x + 1)
= f2
xx(1 + f2
x)−3
.
We have arrived at the expected formula
κ2
= ∥r′′
∥2
= (d2
f
dx2 )2
(1 + (fx)2
)−3
.
2. Integration
6.2.1. Indeﬁnite integral. Now, we reverse the procedure
of diﬀerention. We want to reconstruct the
actual values of a function using its immediate
changes. If we consider the given function
f(x) as the (say continuous) derivative of an unknown function
F(x), then at the level of diﬀerentials we can write
dF = f(x) dx.
We call the function F the primitive function or the indeﬁnite
integral of the function f (we can also ﬁnd the name antiderivative
in the literature). Traditionally we write
F(x) =
∫
f(x) dx.
Lemma. The primitive function F(x) to the function f(x) is
determined uniquely on each interval [a, b] up to an additive
constant.
Proof. The statement follows immediately from
Lagrange’s mean value theorem, see 5.3.9. Indeed, if
F′
(x) = G′
(x) = f(x) on the whole interval [a, b], then the
derivative of the function (F − G)(x) vanishes at all points
c of the interval [a, b]. The mean value theorem implies that
for all points x in this interval,
F(x) − G(x) = F(a) − G(a) + 0 · (x − a).
Thus the diﬀerence of the values of the functions F and G is
constant on the interval [a, b]. □
The previous lemma supports another notation for the indeﬁnite
integral:
F(x) =
∫
f(x) dx + C
with an unknown constant C.
The primitive functions are well deﬁned for complex
functions f, where the real and the imaginary part of the indeﬁnite
integrals are real primitive functions to the real and
the imaginary parts of f. Thus, with no loss of generality, we
work only with real functions in the sequel.
529
1.6085. On the other hand, dy x=x0
= f′
(x0)dx, that is,
dy = (4x3
− 2x + 3) x=2
dx = 31 · 0.05 = 1.55 .
Thus |∆(y) − dy| = ∆(y) − dy ≈ 1.6085 − 1.55 ≈ 0.0585.
(b) In this case we have x0 = π/2 and x0 + dx = π
2 + 0.04.
Based on the following block in Sage,
f(x)=x^2+(sin(x))^2*cos(x)
D=f((pi/2)+0.04)-f(pi/2)
show(N(D))
we compute ∆(y) = f(π
2 + 0.04) − f(π
2 ) ≈ 0.08733. Also,
we compute that
f′
(
π
2
) =
(
2x + 2 sin(x) cos2
(x) − sin3
(x)
)
x=π/2
= π − 1 .
Hence for the diﬀerential we get dy x=π/2
= (π−1)·0.04 ≈
0.08566, and we deduce that |∆(y) − dy| = ∆(y) − dy ≈
0.00167.
(c) Let us answer this task only with the aid of Sage, and leave
the formal computation as an exercise. An appropriate syntax
has the form
f(x)=ln(x+1)-e**(sqrt(x))
D=f(1.03)-f(1); show(N(f(1.03)-f(1)))
d=(diff(f(x), x)(x=1))*0.03; show(N(d))
show(N(abs(D-d)))
In this block the ﬁrst show command implies that ∆y ≈
0.025887, while the second gives dy ≈ 0.025774. After introducing
f, D and d, one could directly type the ﬁnal command,
which implies that |∆(y) − dy| ≈ 0.00011.
At the end of the day, what we should keep from this
task is that a formal computation of dy for all the three cases
(a), (b) and (c), is easier than computing explicitly D(y). In
fact, for complicated functions the computation of D(y) can
be extremely hard, but it is always easier to compute diﬀerentials.
Finally, you may like to verify yourself that a receipt for
improving the approximation is to choose smaller increments
dx. □
6.A.38. Recall that the formula f(x) ≈ f(x0)+f′
(x0)(x−
x0) determines the linear (tangential) approximation of a differentiable
function f at a point x0.
(a) Show that this is equivalent to the approximation formula
∆(y) ≈ dy;
(b) For the function f(x) = 1/x compare the linear approximation
of the value f(1.1), obtained in 5.C.18, with the value
determined by the relation f(x0 + dx) ≈ f(x) + dy.
Solution. (a) We have f(x) ≈ f(x0) + f′
(x0)(x − x0) and
since x − x0 = dx, this can be equivalently expressed as
f(x) ≈ f(x0) + f′
(x0)dx. Using the deﬁnition of the diﬀerential
of f at x0 (which we simply denote by dy), we ﬁnally
get the equivalent expression f(x) ≈ f(x0) + dy, that is,
f(x0 + dx) ≈ f(x0) + dy, where in the l.h.s the variable x
was replaced by x0 + dx. Hence, the tangential approximation
is equivalent to the relation f(x0 + dx) − f(x0) ≈ dy,
i.e., ∆(y) ≈ dy. This proves (a).
(b) Recall by 5.C.18 that the tangential approximation of
1/x around x0 = 1 is the line L(x) = 2 − x, and hence
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.2.2. Newton integral. We consider the value of a real
function f(x) as an immediate increment of the region
bounded by the graph of the function f and the x axis and
try to ﬁnd the area of this region between boundary values a
and b of some interval. We relate this idea with the indeﬁnite
integral.
Suppose we are given a real function f and its indeﬁnite
integral F(x), i.e. F′
(x) = f(x) on the interval [a, b].
Divide the interval [a, b] into n parts by choosing the
points
a = x0 < x1 < · · · < xn = b.
Approximate the values of the derivatives at the points xi by
the forward diﬀerences. That is, by the expressions
f(xi) = F′
(xi) ≃
F(xi+1) − F(xi)
xi+1 − xi
.
Finally the sum over all the intervals of our partition yields
the approximation of the area:
n−1∑
i=0
f(xi)(xi+1 − xi) ≃
n−1∑
i=0
F(xi+1) − F(xi)
xi+1 − xi
(xi+1 − xi)
=
n−1∑
i=0
(F(xi+1) − F(xi)) = F(b) − F(a).
Therefore we expect that for “nice enough” functions f(x),
the area of the region bounded by the graph of the function
and the x axis (including the signs) can be calculated as a difference
of the values of the primitive function at the boundary
points of the interval. This procedure is called the Newton in-
tegration.3
3Isaac Newton (1642-1726) was a phenomenal English physicist and
mathematician. The principles of integration and diﬀerentiation were formulated
independently by him and Gottfried Leibniz in the late 17th century. It
took nearly another two centuries before Bernhard Riemann introduced the
completely rigorous modern version of the integration process.
530
L(1.1) = 0.9. Using the diﬀerential dy at x0, by the claim in
(a) we should get the same result. For, we have dy = − 1
x2 dx.
Moreover, 1.1 = x0 + dx = 1 + 0.1, i.e., dx = 0.1, and
hence we get dy x0=1
= −1 · 0.1 = −0.1. Then the relation
f(x0 + dx) ≈ f(x0) + dy x0=1
gives
f(1.1) ≈ f(1) + (−0.1) = 1 − 0.1 = 0.9 ,
as required. □
The relation df(x)(h) = f′
(x)h and the replacement
of the increment dx by some small positive number h close
to zero, allows us to rephrase the approximation formula
∆(y) ≈ dy as f(x+h)−f(x) ≈ f′
(x)h, i.e., f(x+h)−f(x)
h ≈
f′
(x). Based then on the Taylor expansion of f around x, and
using the O-notation (see 6.1.12) we can ﬁnally arrive to the
equation
f′
(x) =
f(x + h) − f(x)
h
+ O(h) .
Obviously, this establishes a ﬁrst-order accurate approximation
of f′
(x), since the dominate term in the truncation error
is O(h). The expression f(x+h)−f(x)
h is known as the
“forward diﬀerence” formula for the ﬁrst derivative. Similarly,
one can consider the “backward diﬀerence” formula
f(x)−f(x−h)
h , which gives another ﬁrst-order accurate approximation
of f′
(x), see 6.1.12.
6.A.39. Remark on the role of h.
The ﬁgure below illustrates the forward
and backward diﬀerence formulas for the
approximations of f′
(0.5), where f(x) =
3x2
− x + 14 with x ∈ I = [−1, 2] and h = 1. Check
yourself that since h is big enough, this will not produce a
“good” approximation of f′
(0.5). Hence, to improve the approximations
in general small increments h closer to zero are
more adaptable (see also below).
6.A.40. Consider the function f(x) = ex2
− 9
10 . For the
steps h = 0.1 and h = 0.01, use Sage to ﬁnd the forward and
backward diﬀerence approximations of the derivative f′
(0.2).
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Newton integral
If F is the primitive function to the function f on the interval
[a, b], then we write
∫ b
a
f(x) dx = [F(x)]b
a = F(b) − F(a)
and call it the Newton (deﬁnite) integral with the bounds a
and b.
We prove later that for all continuous functions f ∈
C0
[a, b] the Newton integral exists and computes the area as
expected. This is one of the fascinating theorems in elementary
calculus. Before going into this, we discuss how to compute
these integrals.
6.2.3. Integration “by heart”. We show several procedures
for computing the Newton integral. We
exploit the knowledge of diﬀerentiation,
and look for primitive functions.
The easiest case is the one where the given function is
known as a derivative. To learn such cases, it suﬃces to read
the tables for function derivatives in the menagerie the other
way round. Hence:
Integration table
For arbitrary nonzero a, b ∈ R and n ∈ Z, n ̸= −1:
∫
a dx = ax + C
∫
axn
dx = a
n+1 xn+1
+ C
∫
eax
dx = 1
a eax
+C
∫
a
x
dx = a ln x + C
∫
a cos(bx) dx = a
b sin(bx) + C
∫
a sin(bx) dx = −a
b cos(bx) + C
∫
a cos(bx) sinn
(bx) dx = a
b(n+1) sinn+1
(bx) + C
∫
a sin(bx) cosn
(bx) dx = − a
b(n+1) cosn+1
(bx) + C
∫
a tg(bx) dx = −
a
b
ln(cos(bx)) + C
∫
a
a2 + x2
dx = arctg
(x
a
)
+ C
∫
−1
√
a2 − x2
dx = arccos
(x
a
)
+ C
∫
1
√
a2 − x2
dx = arcsin
(x
a
)
+ C.
In all the above formulae, it is necessary to clarify the
domain on which the indeﬁnite integral is well deﬁned. We
leave this to the reader.
531
Then, estimate the absolute value of the actual error. Which
formula provides the better approximation and for which h?
Solution. We have x = 0.2, hence the forward/backward difference
formulas have the form
f(0.2 + h) − f(0.2)
h
,
f(0.2) − f(0.2 − h)
h
.
It is easy to estimate these expressions for the given h via
Sage, and a recommendation is here:
f(x)=exp(x^(2))-0.9;
forw01=(f(0.2+0.1)-f(0.2))/0.1; show(forw01)
back01=(f(0.2)-f(0.2-0.1))/0.1; show(back01)
forw001=(f(0.2+0.01)-f(0.2))/0.01; show(forw001)
back001=(f(0.2)-f(0.2-0.01))/0.01; show(back001)
Based on Sage’s output, for h = 0.1, we deduce that
f′
(0.2)
forw. diﬀ
≈
f(0.3) − f(0.2)
0.1
≈ 0.53364 ,
f′
(0.2)
back. diﬀ
≈
f(0.2) − f(0.1)
0.1
≈ 0.30761 ,
while for h = 0.01 we get
f′
(0.2)
forw. diﬀ
≈
f(0.21) − f(0.2)
0.01
≈ 0.42761 ,
f′
(0.2)
back. diﬀ
≈
f(0.2) − f(0.19)
0.01
≈ 0.40513 .
Of course, the second choice h = 0.01 gives betters approximations.
For, the derivative of f is given by f′
(x) = 2x ex2
,
thus f′
(0.2) = 0.4163243. As for a conﬁrmation via Sage,
in the previous cell add the line
f1=diff(f(x), x)(x=0.2); show(f1)
Let us now estimate the errors, where it is reasonable to treat
the case h = 0.01 only, which provides smaller errors. So, to
evaluate the diﬀerences |f′
(0.2) − 0.42761|, and |f′
(0.2) −
0.40513|, we may add in our block the lines
show(N(abs(f1-forw001)))
show(N(abs(f1-back001)))
We deduce that |f′
(0.2) − 0.42761| ≈ 0.0112841 and
|f′
(0.2)−0.40513| ≈ 0.0111986. Thus, the backward diﬀerence
with h = 0.01 approximates f′
(0.2) with the smallest
actual error. □
6.A.41. For h = 0.1, and for the function f given in 6.A.40,
use Sage to illustrate the forward and backward diﬀerence approximation
of f′
(0.2), together with the tangent line of f at
x = 0.2. ⃝
A combination of the forward and backward diﬀerences
induces the “central diﬀerence” formula, given by
f(x + h) − f(x − h)
2h
.
This provides another approximation of f′
(x), which has in
fact truncation error being of order of O(h2
), see 6.1.12.
6.A.42. For the function f introduced in 6.A.40, show that
the central diﬀerence approximation of the derivative f′
(0.2),
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Further rules can be added by observations of special
structure of the given functions. For example,
∫
f′
(x)
f(x)
dx = ln |f(x)| + C
for all continuously diﬀerentiable functions f on intervals
where they are nonzero.
Of course, the rules for diﬀerentiating a sum of diﬀerentiable
functions and constant multiples of diﬀerentiable functions
yield analogous rules for the indeﬁnite integral. So the
sum of two indeﬁnite integrals is the indeﬁnite integral of the
sum of the integrated functions, up to the freedom in the chosen
constant, etc.
6.2.4. Integration by parts. The Leibniz rule for derivatives,
(F · G)′
(t) = F′
(t)G(t) + F(t)G′
(t),
can be interpreted in the realm of the primitive
functions. This observation leads to the following
very useful practical procedure. It also has
theoretical consequences.
Integration by parts
The formula for computing the integral on the left hand side
∫
F(x)G′
(x) dx = F(x)G(x) −
∫
F′
(x)G(x) dx + C
is called integration by parts.
The above formula is useful if we can compute G and at
the same time compute the integral on the right hand side.
The principle is best shown on an example. Compute
I =
∫
x sin x dx.
In this case the choice F(x) = x, G′
(x) = sin x will help.
Then G(x) = − cos x and therefore
I = x(− cos x) −
∫
− cos x dx = −x cos x + sin x + C.
Some integrals can be dealt with by inserting the factor 1, so
that G′
(x) = 1:
∫
ln x dx =
∫
1 · ln x dx
= x ln x −
∫
1
x
x dx = x ln x − x + C.
6.2.5. Integration by substitution. Another useful procedure
is derived from the chain rule for diﬀerentiating composite
functions. If
F′
(y) = f(y), y = φ(x),
where φ is a diﬀerentiable function with nonzero derivative,
then
dF(φ(x))
dx
= F′
(y) · φ′
(x)
and thus F(y) + C =
∫
f(y) dy can be computed as
F(φ(x)) + C =
∫
f(φ(x))φ′
(x) dx.
532
with step h = 0.01, is much better than the corresponding forward
and backward diﬀerence approximations. In particular,
show that this approximation coincides with f′
(0.2), up to
the ﬁrst four decimal digits. ⃝
Calculus helps us understand various fundamental
phenomena, such as the intrinsic geometry
of curves and surfaces. We will start with
a task that demonstrates the concept of the curvature
of the graph of a function, as introduced in ??. To
maintain continuity with our previous discussion (cf. 6.A.33),
we recommend solving this task using implicit diﬀerentiation,
though other methods are also available.
6.A.43. Determine the curvature of the ellipse x2
+ 2y2
= 2
at its vertices. Also determine the equations of the circles of
osculation at these vertices. ⃝
Let us now explore how to work with parametrized curves
in both plane and space within the context
of calculus. These curves can be viewed
as real-valued functions in R2
and R3
, respectively,
though the concept easily generalizes to Rn
(see
6.1.14 for for a brief introduction to such functions, which are
examples of “vector functions”.
These curves have numerous applications in Newtonian
mechanics, where velocity and acceleration are described by
the ﬁrst and second derivatives of such parametric functions
with respect to the parameter t, representing time. We will
start with a simple task involving the derivatives of plane
curves, which is left as an easy challenge for you and can
be solved using Sage, as well.
6.A.44. (a) Use the chain rule to prove that any diﬀerentiable
plane curve α(t) = [x(t), y(t)] satisﬁes
dy
dx
=
dy/dt
dx/dt
,
assuming that x′
(t) = dx/dt ̸= 0.
(b) Compute using two alternative methods the derivative dy
dx
for the parametric curves
α(t) = [t + 1, 2t3
], β(t) = [1/(1 + t2
), ln(t + 1)] ,
where t ∈ [0, 2] for both cases. Next evaluate dy
dx at t = 0 and
t = 2 (if it is deﬁned there).
(c) Use Sage to conﬁrm your computations in (b) and then
plot the curves α, β via the command parametric_plot.
⃝
6.A.45. Cycloid. Consider the plane curve α given by the
parametric equations
x(t) = t − sin(t) , y(t) = 1 − cos(t) , t ∈ [0, 2π] .
This is part of the so-called “cycloid”, which is the curve
traced by a point on a circle as it rolls along a straight line
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
By substituting x = φ−1
(y), we obtain the originally desired
primitive function. This is often written as follows:
Integration by substitution
If φ(x) is diﬀerentiable with a nowhere vanishing derivative,
then
F(y) =
∫
f(y) dy =
∫
f(φ(x))φ′
(x) dx = F(φ(x)).
We talk about substituting the variable y by y = φ(x).
On the level of diﬀerentials, the substitution can be easily
understood in a way that (linearized) increments of the variable
y and x are in mutual relation formally described by
dy = φ′
(x) dx,
which corresponds the relation between the integrated quan-
tities
f(y) dy = f(φ(x))φ′
(x) dx.
As an illustration, we verify the last but one integral in
the list in 6.2.3 using this method. To compute
I =
∫
1
√
1 − x2
dx.
Choose the substitution x = sin t, for all t ∈ (−π
2 , π
2 ). Then
dx = cos t dt. So
I =
∫
1
√
1 − sin2
t
cos t dt =
∫
1
√
cos2 t
cos t dt
=
∫
dt = t + C.
By substitution t = arcsin x into the result, I = arcsin x+C.
While substituting, the actual existence of the inverse
function to y = φ(x) is required. To evaluate a deﬁnite Newton
integral, it is needed to correctly recalculate the bounds of
integration. Problems with the domains of the inverse functions
can sometimes be avoided by dividing the integration
into several intervals. We return to this point later.
6.2.6. Integration by reduction to recurences. Often the
use of substitutions and integrating by parts leads to
recurent relations, from which desired integrals can
be evaluated. We illustrate by an example. Integrating
by parts, to evaluate
Im =
∫
cosm
x dx =
∫
cosm−1
x cos x dx
= cosm−1
x sin x − (m − 1)
∫
cosm−2
x(− sin x) sin x dx
= cosm−1
x sin x + (m − 1)
∫
cosm−2
x sin2
x dx.
Using the formula sin2
x = 1 − cos2
x,
mIm = cosm−1
x sin x + (m − 1)Im−2.
The initial values are
I0 = x, I1 = sin x.
533
without slipping.2
Use calculus to show that the angle θ of
the tangent lines of α at the points P = [x(2π
3 ), y(2π
3 )] ∈ α
and Q = [x(4π
3 ), y(4π
3 )] ∈ α , respectively, is such that
tan(θ) = −
√
3.
Solution. Recall that sin(2π/3) =
√
3/2 = − sin(4π/3)
and cos(2π/3) = −1/2 = cos(4π/3). Thus we compute
P = [xP , yP ] =
[
2π
3 −
√
3
2 , 3
2
]
and Q = [xQ, yQ] =
[
4π
3 +
√
3
2 , 3
2
]
, respectively. The tangent line of α passing
through P is the line
ℓP (x) = yP + k2π
3
· (x − xP ) , k2π
3
=
dy
dx t= 2π
3
.
Similarly, the tangent line of α passing through Q is the line
ℓQ(x) = yQ + k4π
3
· (x − xQ) , k4π
3
=
dy
dx t= 4π
3
.
We compute
dy
dx
=
dy/dt
dx/dt
=
sin(t)
1 − cos(t)
=
1
tan(t/2)
= cot(t/2) ,
and thus it follows that k2π
3
=
√
3
3 and k4π
3
= −
√
3
3 . Hence
the angle θ = ∠PRQ, where R is the intersection points of
ℓP , ℓQ, is given by
tan(θ) =
k4π
3
− k2π
3
1 + k2π
3
· k4π
3
=
−
√
3
3 −
√
3
3
1 − 3
9
= −
√
3 .
An illustration is given here:
This ﬁgure and all the computations presented above can be
done easily in Sage, and we leave this part as an exercise,
see 6.A.46. For more details on the cycloid, see for example
6.D.17 in Section D. □
6.A.46. For the task in 6.A.45 llustrate the situation in Sage
and conﬁrm the related computations. ⃝
In the end of the next section we will return back to parametric
curves and learn how to compute their lengths and the
areas of the regions bounded or enclosed by of their graphs.
Hence soon we will also learn how to deal with integrals involving
parametric equations.
2When the circle has radius r the parametric equations of the cycloid
are given by [r · x(t), r · y(t)], and in our case we ﬁxed r = 1. Be aware
that the cycloid is not an ellipse.
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Integrals in which the integrated function depends on expressions
of the form (x2
+ 1) can be reduced to these types
of integrals using the substitution x = tg t. For example, to
compute
Jk =
∫
dx
(x2 + 1)k
the latter substitution yields (notice that dx = cos−2
t dt)
Jk =
∫
dt
cos2 t
(
sin2 t
cos2 t + 1
)k
=
∫
cos2k−2
t dt.
For k = 2, the result is
J2 =
1
2
(cos t sin t + t) =
1
2
(
tg t
1 + tg2
t
+ t
)
.
After the reverse substitution t = arctg x
J2 =
1
2
(
x
1 + x2
+ arctg x
)
+ C.
When evaluating deﬁnite integrals, we can compute the
whole recurrence after evaluating with the given bounds. For
example while integrating over the interval [0, 2π], the integrals
have these values:
I0 =
∫ 2π
0
dx = [x]2π
0 = 2π
I1 =
∫ 2π
0
cos x dx = [sin x]2π
0 = 0
Im =
∫ 2π
0
cosm
x dx =
{
0 for odd m
m−1
m Im−2 for even m.
Thus for even m = 2n, the result is
∫ 2π
0
cos2n
x dx =
(2n − 1)(2n − 3) . . . 3 · 1
2n(2n − 2) . . . 2
2π.
For odd m it is zero (as could be guessed from the graph of
the function cos x).
6.2.7. Integration of rational functions. The next goal is
the integration of the quotients of two polynomials
f(x)/g(x). There are several simpliﬁcations
to start with.
If the degree of the polynomial f in the numerator
is greater or equal to the degree of the polynomial
g in the denominator, carry out the division with remainder
(see the paragraph 5.1.2). This reduces the integration to a
sum of two integrals.
The division provides
f = q · g + h,
f
g
= q +
h
g
.
Thus,
∫
f(x)/g(x) dx =
∫
q dx+
∫
h(x)/g(x) dx where the
ﬁrst integral is easy and the second one is again an expression
of the type h(x)/g(x), but with degree of g(x) strictly larger
than the degree of h(x) (such functions are called proper rational
functions).
534
B. Integration
In this section we will focus on tasks related to the notion
of integration. As we pronounced before, the
processes of diﬀerentiation and integration are inverse
to each other, a result that we will conﬁrm
later in terms of the “fundamental theorem of calculus”.
This theorem is in a sense the main link between the
so called “diﬀerentiable calculus” and “integral calculus”
and we will have the chances to analyze many examples a
few below.
Let us begin with the notion of indeﬁnite integrals, also
called antiderivatives, which are introduced in 6.2.1 (see also
6.2.2).3
We ﬁrst present a few easy examples, based on basic
rules from integration (see the “Integration Table” in 6.2.3).
6.B.1. Integration by heart. Using integration “by heart”
(see 6.2.3) evaluate the indeﬁnite integrals given below. Next
conﬁrm your answers via Sage.
(1)
∫
e−x
dx, with x ∈ R; (2)
∫
1
√
4 − x2
dx, with x ∈
(−2, 2); (3)
∫
1
x2 + 3
dx, with x ∈ R; (4)
∫
3x2
+ 1
x3 + x + 2
dx,
with x ̸= −1; (5)
∫
|x| dx, with x ∈ R.
Solution. (1) The primitive of f(x) = e−x
is the function
F(x) = − e−x
, since (− e−x
)′
= e−x
. Thus, by the
deﬁnition of the indeﬁnite integral given in 6.2.1, we have∫
f(x) dx = F(x) + C = − e−x
+C, for some constant C.
(2) Use the formula
∫ dx√
1−x2
= arcsin(x) + C, for some
constant C ∈ R. This gives
∫
1
√
4 − x2
dx =
∫ 1
2
√
1 −
(x
2
)2
dx = arcsin(
x
2
) + C .
(3) Use the formula
∫ dx
1+x2 = arctan(x) + C, for some constant
C ∈ R. We have
∫
1
x2 + 3
dx =
1
3
∫
1
x2
3 + 1
dx =
1
√
3
∫ 1√
3
1 +
(
x√
3
)2 dx
=
1
√
3
arctan(
x
√
3
) + C .
(4) Use the formula
∫ f′
(x)
f(x) dx = ln |f(x)| + C, with C ∈ R.
Hence, we obtain
∫
3x2
+ 3
x3 + 3x + 2
dx = ln x3
+ 3x + 2 + C .
(5) The function f(x) = |x| is continuous on R, hence it has a
primitive function on R. This has the form x2
2 +c1, for x ≥ 0,
3The notation for the indeﬁnite integral was ﬁrst introduced by Leibniz
in 1675.
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Thus we can assume that the degree of g is strictly larger
than the degree of f. We introduce the procedure to integrate
proper rational functions by a simple example.
Observe that we can integrate (a + x)−n
, n > 1, and
∫
1
a + x
dx = ln |a + x| + C.
Summing such simple fractions yields more complicated
ones:
−2
x + 1
+
6
x + 2
=
4x + 2
x2 + 3x + 2
which can be integrated directly:
∫
4x + 2
x2 + 3x + 2
dx = −2 ln |x + 1| + 6 ln |x + 2| + C.
This suggests looking for a procedure to express proper
rational functions as a sum of simple ones. In the example,
it is straightforward to compute the unknown coeﬃcients A
and B, once the roots of the denominator are known:
4x + 2
x2 + 3x + 2
=
4x + 2
(x + 1)(x + 2)
=
A
x + 1
+
B
x + 2
.
Multiply both sides by the polynomial x2
+3x+2 from the denominator
and compare coeﬃcients of the individual powers
of x in the resulting polynomials:
4x+2 = A(x+2)+B(x+1) =⇒ 2A+B = 2, A+B = 4.
This procedure is called decomposition into partial fractions.
It is a purely algebraic procedure based on properties
of polynomials.
Without loss of generality, suppose that the denominator
g(x) and the numerator f(x) do not share any real or complex
roots and that g(x) has exactly n distinct real roots a1, . . . , an.
Then the points a1, . . . an are all the discontinuities of the
function f(x)/g(x).
Split the expression f(x)
g(x) according to the factors of the
denominator. Thus, assume g(x) is the product
g(x) = p(x)q(x)
of two coprime polynomials. By the Bezout identity (see
12.3.8 on the page 1082), which is a corollary of the polynomial
division with a remainder, there exist polynomials a(x)
and b(x) of degrees strictly less than the degree of g such that
a(x)p(x) + b(x)q(x) = 1.
Multiplying this equality by the quotient f(x)/g(x), gives
f(x)
g(x)
=
a(x)
q(x)
+
b(x)
p(x)
.
Thus, we may restrict our attention to cases where the denominator
g(x) cannot be decomposed further into two coprime
poynomials.
Suppose that the polynomial g(x) has only real roots.
Then there is a unique decomposition into factors (x − ai)ni
,
where ni are the multiplicities of the roots ai, i = 1, . . . , k.
By a sequential use of the latter procedure with coprime
535
and −x2
2 + c2, for x ≤ 0, for some constants c1, c2 ∈ R. It is
easy to see that c1 = c2 = c ∈ R and hence
∫
|x| dx =
1
2
x |x| + c =
{
x2
2 + c, if x ≥ 0;
−x2
2 + c, if x ≤ 0;
A powerful capability of Sage, as most of the available
computer algebra programs, is its ability to integrate symbolically.
In Sage we will learn to integrate functions in
many diﬀerent ways and also numerically. For antiderivatives
(indeﬁnite integrals) one uses the integral function,
via the syntax integral(f(x), x), which can be rewritten
as f.integral(x). An alternative reads as f.integrate(x).
Later we will see that the situation of deﬁnite integrals does
not diﬀer much.
Recall that the indeﬁnite integral (as an inﬁnite set) can
be represented by one speciﬁc function and its translations.
Hence, one could expect that Sage will print out a function
with a constant C, as in the relation F(x) =
∫
f(x) dx +
C. However, Sage ignores C and hence you should always
assume that this is implicitly part of the answer. Keeping in
mind these details, we are now ready to solve our task. This
can be done by the cell
show(integral(e^(-x), x))
show(integral(1/(sqrt(4-x^2)), x))
show(integral(1/(x^2+3), x))
show(integral((3*x^2+1)/(x^3+x+2), x))
show(integral(abs(x)))
Check yourself that this provides the desired answers.
Often, computing integrals via Sage we may need to
add restrictions, via the command assume. For instance,
type
m=var("m"); assume(m>1)
show(integral(1/x^m, x))
Notice without the command assume(m > 1), Sage is not
able to produce a result. We will meet further such examples
in the sequel. □
6.B.2. Let f be a continuous function on R with f(x) ̸= 0,
for all x ∈ R. Suppose that the primitive function F of f
satisﬁes 4F(x) = f(x), for all x ∈ R, and moreover that
f(4) = 4. Find the type of f. ⃝
6.B.3. (a) Write a routine in Sage which will print out a
primitive of an integrable function f.
(b) Use your routine to ﬁnd the primitive of
some of the functions presented in the “Integration
Table” in 6.2.3.
Solution. This is an easy task that one can implement as fol-
lows:
function("f")(x)
def Primitive_function(f) :
F(x)=integral(f, x).factor()
show("A primative function of f(x)=", f,
" is given by ", ":", F(x))
return
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
polynomials p(x) and q(x), we obtain a representation of
f(x)/g(x) as a sum of fractions of the form
f(x)
g(x)
=
r1(x)
(x − a1)n1
+ · · · +
rk(x)
(x − ak)nk
,
where the degrees of the polynomials ri(x) are strictly
smaller than the degrees of the denominators. Finally, each
of the summands can be represented as a sum
r(x)
(x − a)n
=
A1
x − a
+
A2
(x − a)2
+ · · · +
An
(x − a)n
.
Indeed, we multiply the equation by (x−a)n
and start comparing
the coeﬃcients from the highest powers of the polynomial
r(x) and compute sequentially A1, A2, . . . after expanding
all the products. This can be done faster by suitable additions
and subtractions, starting by the highest orders. For example,
5x − 16
(x − 2)2
= 5
x − 2
(x − 2)2
− 6
1
(x − 2)2
=
5
x − 2
+
−6
(x − 2)2
.
Finally, we have to handle the case, where there are not
enough real roots. There always exists a factorization of g(x)
into linear factors with complex roots (see the fundamental
theorem of Algebra in 1.A.6 on page 6). The non-real roots
always appear in conjugated pairs, since g(z) = g(¯z) for a
polynomial with real coeﬃcients.
Repeating the above procedure for ratios of complex
polynomials gives the same result, but with complex coeﬃcients.
If we insist in having real expressions only, we may
collect the conjugate pairs together and get quadratic factors
expressed as sums of squares (x − a)2
+ b2
and their powers.
The procedure works well and guarantees that it is possible to
ﬁnd summands in the form of
Bx + C
((x − a)2 + b2)n
.
As in the real roots case, there is always a corresponding decomposition
into partial fractions of the form
A1x + B1
(x − a)2 + b2
+ · · · +
Anx + Bn
((x − a)2 + b2)n
in the case of a power ((x − a)2
+ b2
)n
of such quadratic
(irreducible) factor as well.
The factorization of the polynomials and the further computations
might be quite time consuming. The reader could
prefer to experiment with computer algebra software instead.
This works well in Maple by calling the procedure convert(h,
parfrac, x) that decomposes the expression
h rationally dependent on the variable x into partial fractions.
The important point is that we can already integrate all
of the above partial fractions. The last mentioned ones lead
to integrals discussed in example 6.2.6.
In summary, the rational functions f(x)/g(x) can be integrated
easily, if the corresponding decomposition of the
polynomial in the denominator g(x) is known. The reality is
not that simple when computing (deﬁnite) Newton integrals.
Although we ﬁnd the primitive functions, the problematic
points are the discontinuities of rational functions, in whose
536
(b) To test the routine it is easy:
a,b=var("a, b")
Primitive_function(a*x^3)
Primitive_function(e^(a*x))
Primitive_function(a/x)
Primitive_function(a*sin(x*b))
Primitive_function(a/(a^2+x^2))
Primitive_function(1/sqrt(a^2+x^2))
Primitive_function(x*abs(x))
Primitive_function((3*x^2)*sqrt(1+(1/x^2)))
For instance, for the ﬁnal case Sage’s output has the form:
A primitive function of f(x) = 3x2
√
1 + 1
x2 is
given by x3
(
x2
+1
x2
)3
2
Observe that when the input f is an integrable function, but
there is no function, built up of addition, subtraction, multiplication,
division, roots, exponents, logarithms, trigonometric
functions, and inverse trigonometric functions which will
have, as its derivative the function f, then Sage will not provide
an answer. Such are the functions
e−x3
sin(x) , e−x3
sin(x2
) ,
or the “Gaussian” (also called the “error function”) that we
will meet later (this has a crucial role in statistics) However,
one can still numerically compute such integrals, a situation
that we will encounter later. □
Trigonometric identities are often usefull when integrating
expression involving trigonometric functions. Test your
skills on trigonometric identities by evaluating the following
indeﬁnite integral.
6.B.4. Compute the integral A =
∫
sin2
(x) cos2
(x) dx. ⃝
Next we present a simple application of antiderivatives,
where a given “initial condition” allows us to
compute explicitly the constant of integration C.
In particular, this provides an example of solving
a ﬁrst order diﬀerential equation with an initial
condition (also referred to as an “initial value problem”).
Further such problems we will meet in the end of this section
(see 6.B.64, 6.B.65).
6.B.5. An initial value problem. Determine a curve y =
f(x) passing from the point P = [1, 3] with slope 2
5 x. Then
use the command dsolve in Sage to conﬁrm your answer.
Solution. By assumption the slope of y = f(x) is 2
5 x, which
means that f′
(x) = df
dx = 2
5 x. By integration, we get
f(x) =
∫
f′
(x) dx =
∫
2
5
x dx =
2
5
∫
x dx =
x2
5
+ C .
To specify the constant C ∈ R use the equation f(1) = 3,
i.e., 1
5 +C = 3. This gives C = 14/5 and hence y = f(x) =
1
5 (x2
+ 14).
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
neighbourhood these functions are unbounded. We return to
this problem later (see paragraph 6.2.14 below).
6.2.8. Riemann integral. We return to the idea of deﬁning
the integral as a tool for computing the area of
the region bounded by the graph of a function
and the x axis. This is our next goal. We prove
that for all continuous functions on a closed bounded interval,
this deﬁnition yields the same result as the Newton integral.
Consider a real function f deﬁned on the interval [a, b].
Choose a partition of this interval along with the choice of
representatives ξi of the respective parts, i.e. a = x0 <
x1 < · · · < xn = b and ξi ∈ [xi−1, xi], i = 1, . . . , n.
The number δ = mini{xi − xi−1} is called the norm of
the partition. Deﬁne the Riemann sum corresponding to
the chosen partition along with the chosen representatives
Ξ = (x0, . . . , xn; ξ1, . . . , ξn) as
SΞ =
n∑
i=1
f(ξi) · (xi − xi−1).
Riemann integral4
Deﬁnition. The Riemann integral of the function f on the
interval [a, b] exists, if for every sequence of partitions with
representatives (Ξk)∞
k=0 with norms of the partitions δk approaching
zero, the limit
lim
k→∞
SΞk
= S
exists and its value does not depend on the choice of the
sequence of partitions and their representatives. Then we
write
S =
∫ b
a
f(x) dx.
This deﬁnition does not look very practical, but nonetheless
it allows us to formulate and prove several simple properties
of the Riemann integral:
4Bernhard Riemann (1826-1866) was an extremely inﬂuential German
mathematician with many contributions to inﬁnitesimal analysis, diﬀerential
geometry, and in particular complex analysis and analytic number theory.
537
To solve the system { df
dx = 2
5 x, f(1) = 3} with Sage
we will use the command desolve(F, y, [a, b]), which involves
the diﬀerential equation that we want to solve (this is
denoted by F and it will necessarily include the ﬁrst derivative
diff(y, x) of y = f(x)), the function y, which we should
ﬁrst introduce as a symbolic function in Sage, and the numbers
a, b which are speciﬁed by the initial condition y(a) = b.
For our case this technique takes the form
y = function("y")(x)
difeq = diff(y,x) - (2/5)*x == 0
h = desolve(difeq, y, [1, 3] )
show(h)
Sage’s output has the desired form, i.e., 1
5 x2
+ 14
5 . □
A very elementary method of integration is the so called
integration by parts (see 6.2.4). This method is appropriate
for computing integrals of the following form:
∫
P(x) abx
dx ,
∫
P(x) sin(bx) dx ,
∫
P(x) cos(bx) dx,
∫
P(x) logn
a x dx ,
∫
xb
logn
a (kx) dx ,
∫
P(x) arcsin(bx) dx ,
∫
P(x) arccos(bx) dx ,
∫
P(x) arctan(bx) dx ,
∫
P(x) arccot(bx) dx ,
∫
abx
sin(cx) dx ,
and
∫
abx
cos(cx) dx, where P is an arbitrary polynomial,
a ∈ (0, 1) ∪ (1, +∞), b, c ∈ R\{0}, n ∈ N and k > 0. Let
us now illustrate this method by a series of examples.
6.B.6. Integration by parts. Using integration by parts,
evaluate the integral L =
∫
(
x2
+ 1
)
e−x
dx, with x ∈ R.
Solution. Here we should apply integration by parts twice:
L =
∫
(
x2
+ 1
)
(− e−x
)′
dx = −
(
x2
+ 1
)
e−x
−
∫
(
x2
+ 1
)′
(− e−x
) = −
(
x2
+ 1
)
e−x
+ 2ℓ ,
where ℓ =
∫
x e−x
dx. Next we have
ℓ =
∫
x(− e−x
)′
= −x e−x
−
∫
(x)′
(− e−x
)
= −x e−x
+
∫
e−x
= −x e−x
− e−x
+C ,
for some constant C ∈ R. Thus, all together we obtain L =
− e−x
(x2
+ 2x + 3) + C. □
6.B.7. Using integration by parts, compute the indeﬁnite integrals
given below:
(a) K =
∫
x cos(x) dx, with x ∈ R;
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Theorem. (1) Suppose f is a bounded real function deﬁned
on the interval [a, b], and c ∈ [a, b] is an inner point of this
interval. Then the integral
∫ b
a
f(x) dx exists if and only if
both of the integrals
∫ c
a
f(x) dx and
∫ b
c
f(x) dx exist. In that
case
∫ b
a
f(x) dx =
∫ c
a
f(x) dx +
∫ b
c
f(x) dx.
(2) Suppose f and g are two real functions deﬁned on the
interval [a, b], and that both of the integrals
∫ b
a
f(x) dx and
∫ b
a
g(x) dx exist. Then the integral of their sum also exists
and
∫ b
a
(f(x) + g(x)) dx =
∫ b
a
f(x) dx +
∫ b
a
g(x) dx.
(3) Suppose f is a real function deﬁned on the interval
[a, b], C ∈ R is a constant, and the integral
∫ b
a
f(x) dx exists.
Then the integral
∫ b
a
Cf(x) dx also exists and
∫ b
a
Cf(x) dx = C
∫ b
a
f(x) dx.
Proof. (1) First suppose that the integral over the whole
interval exists. When computing it, we can limit ourselves to
limits of the Riemann sums whose partitions have the point
c among their partitioning points. Each such sum can be obtained
as a sum of two partial Riemann sums. If these two
partial sums would depend on the chosen partitions and representatives
in the limit, then the total sums could not be independent
on the choices in limit. (It suﬃces to keep the sequence
of partitions of the subinterval the same, and change
the other so that the limit would change).
Conversely, if both Riemann integrals on both subintervals
exists, they can be approximated with arbitrary precision
by the Riemann sums, and moreover independently on their
choice. If a partitioning point c is added to any sequence of
Riemann sums over the whole interval [a, b], the value of the
whole sum is changed. Also the values of the partial sums
over the intervals belonging to [a, c] and [c, b] change at most
by a multiple of the norm of the partition and possible differences
of the bounded function f on all of [a, b]. This is a
number arbitrarily close to zero for a decreasing norm of the
partition. Necessarily the partial Riemann sums of the function
over the two parts of the interval also converge to the
limits, whose sum is the Riemann integral over [a, b].
(2) In every Riemann sum, the sum of the functions manifests
as the sum of the values in the chosen representatives.
Because multiplication of real numbers is distributive, each
Riemann sum becomes the sum of the two Riemann sums
with the same representatives for the two functions. The statement
follows from the elementary properties of limits.
(3) Each of the Riemann sums is multiplied by the constant
C. So the claim follows from the elementary properties
of limits. □
538
(b) M =
∫
(2x − 1) ln(x) dx, with x > 0;
(c) N =
∫
ex
sin(βx) dx, with x, β ∈ R. ⃝
6.B.8. Determine the integrals given below and next use Sage
to conﬁrm your answers:
(a)
∫
x
cos2(x)
dx, with x ̸= π
2 + kπ, k ∈ Z;
(b)
∫
x2
e−3x
dx, with x ∈ R;
(c)
∫
cos2
(x) dx, with x ∈ R. ⃝
The substitution method, introduced in 6.2.5, is another
very important technique to compute integrals. The
next series of exercises is based on this method. Notice
later you will meet a more systematic illustration
of the substitution method, along the integration of
rational and irrational functions, but also in many trigonometric
integrals.
6.B.9. Substitution method. Using a suitable substitution,
determine the integral
∫
f(x) dx, where f(x) is given below.
Moreover, use Sage to conﬁrm your answer.
(1) f(x) =
√
2x − 5, with x > 5
2 ;
(2) f(x) =
(
7 + ln(x)
)7
/x, with x > 0;
(3) f(x) = cos(x)/
(
1 + sin(x)
)2
, with x ̸= (3+4k)π
2 , k ∈ Z;
(4) f(x) = (1 − 2x)2025
, with x < 1/2;
Solution. (1) Set t = 2x − 5 such that dt = 2 dx, i.e., dx =
dt
2 . Thus
∫
√
2x − 5 dx =
1
2
∫ √
t dt =
t
3
2
3
+ C =
1
3
(2x − 5)
3
2 + C
for some constant C.
(2) Set t = 7 + ln(x) such that dt = dx
x . Then we get
∫ (
7 + ln(x)
)7
x
dx =
∫
t7
dt =
t8
8
+ C =
(
7 + ln(x)
)8
8
+ C .
(3) Set t = 1 + sin(x) such that dt = cos(x) dx. Then
∫
cos(x)
(
1 + sin(x)
)2 dx =
∫
dt
t2
= −
1
t
+C = −
1
(
1 + sin(x)
) +C .
(4) Set t = 1 − 2x such that dt = −2 dx, i.e., dx = −dt
2 .
Then
∫
(1 − 2x)2025
dx = −
1
2
∫
t2025
dt = −
1
2
·
t2026
2026
+ C
= −
(1 − 2x)2026
4052
+ C , C ∈ R .
A veriﬁcation via Sage occurs with the same method presented
in 6.B.1. Hence one can type
show(integral(sqrt(2*x-5), x))
show(integral(((7+ln(x))^7)/x, x))
show(integral((cos(x)/(1+sin(x))^2), x))
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.2.9. The fundamental theorem. The following result is
crucial for understanding the relation between
the integral and the derivative. The complete
proof of this theorem is somewhat longer, so it
is broken into several subsections.
Fundamental theorem of integral calculus
Theorem. For every continuous function f on a ﬁnite interval
[a, b] there exists its Riemann integral
∫ b
a
f(x)dx. Moreover,
the function F(x) given on the interval [a, b] by the
Riemann integral
F(x) =
∫ x
a
f(t)dt
is a primitive function to f on this interval.
6.2.10. Upper and lower Riemann integral. In the ﬁrst
step for proving the existence of the integral, we use an alternative
deﬁnition, in which the choice of representatives
and the corresponding value f(ξi) is replaced by the suprema
Mi of the values f(x) in the corresponding subintervals
[xi−1, xi], or by the inﬁma mi of the function f(x) in the
same subintervals, respectively. We speak of upper and lower
Riemann sums, respectively (in literature, this process is also
called the Darboux integral).
Because the function is continuous, it is bounded on a
closed interval, hence all the above considered suprema and
inﬁma exist and are ﬁnite. Then the upper Riemann sum corresponding
to the partition Ξ = (x0, . . . , xn) is given by the
expression
SΞ,sup =
n∑
i=1
(
sup
xi−1≤ξ≤xi
f(ξ)
)
(xi − xi−1)
=
n∑
i=1
Mi(xi − xi−1).
The lower Riemann sum is
SΞ,inf =
n∑
i=1
(
inf
xi−1≤ξ≤xi
f(ξ)
)
(xi − xi−1)
=
n∑
i=1
mi(xi − xi−1).
For each partition Ξ = (x0, . . . , xn; ξ1, . . . , ξn) with representatives,
there are the inequalities
(1) SΞ,inf ≤ SΞ,ξ ≤ SΞ,sup
Moreover, the inﬁma and suprema can be approximated
with arbitrary precision by the actual values of terms in the
sequences. Thus, we might suspect that the Riemann integral
exists if and only if for all sequences of partitions with norms
approaching zero, the limits of both the upper and lower sums
will exists and they will be equal. This is indeed true for all
bounded functions:
539
and similarly for ﬁnal case. □
6.B.10. Using a suitable substitution determine
∫
ex3
+4
x2
,
with x ∈ R. Next use Sage to conﬁrm the answer. ⃝
6.B.11. Compute
∫
cos5
(x) sin(x) dx, with x ∈ R. ⃝
6.B.12. Compute
∫
sin4
(x)
cos4(x)
dx, with x ∈
(
−π
2 , π
2
)
. ⃝
6.B.13. Use Sage to conﬁrm the answers presented in 6.B.11
and 6.B.12, respectively. ⃝
6.B.14. Evaluate
∫
cos5
(x) sin2
(x) dx. ⃝
Combining integration by parts with the substitution
method is a very common situation. Hence we will meet often
examples which rely on combining both methods, as the task
presented below.
6.B.15. Evaluate the following integrals:
(a)
∫
x3
e−x2
dx, with x ∈ R;
(b)
∫
x arcsin(x2
) dx, with x ∈ (−1, 1);
(c)
∫
e
√
x
dx, with x > 0. ⃝
6.B.16. Integration by reduction to recurrences. Let n be
a non-negative integer.
(a) Set In =
∫
xn
ex
dx, with I0 = ex
. Show that In =
xn
ex
−nIn−1 and next compute I1, I2, I3.
(b) Set Jn =
∫
(ln(x))n
dx. Show that Jn = x(ln(x))n
−
nJn−1 and next compute J1, J2, J3.
Solution. (a) Integration by parts gives the result:
In =
∫
xn
(ex
)′
dx = xn
ex
−n
∫
xn−1
ex
dx
= xn
ex
−nIn−1 .
It is easy now to compute I1, I2, I3:
I1 = x ex
− ex
= ex
(x − 1) ,
I2 = x2
ex
−2 ex
(x − 1) = ex
(x2
− 2x + 2) ,
I3 = x3
ex
−3 ex
(x2
− 2x + 2) = ex
(x3
− 3x2
+ 6x − 6) .
You can also prove that In = ex
pn(x) where the polynomial
pn is of order n and deﬁned recursively by
pn(x) = xn
− n pn−1(x) , p0(x) = 1 .
(b) This can be solved similarly and left for practice. □
6.B.17. Let In =
∫
sinn
(x) dx with n ∈ N. Prove that
In = −
1
n
sinn−1
(x) cos(x) +
n − 1
n
In−2 .
Then use this relation to compute I2 and I3. ⃝
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Theorem. Let the function f be bounded on a closed interval
[a, b]. Then
Ssup = inf
Ξ
SΞ,sup , Sinf = sup
Ξ
SΞ,inf
are the limits of all sequences of upper and lower sums with
norm approaching zero, respectively.
The Riemann integral of the function f exists if and only
if Ssup = Sinf .
Proof. First, notice that choosing two partitions Ξ1, Ξ2,
there is the common reﬁnement Ξ and we arrive at the
inequalitites SΞ1,inf ≤ SΞ,inf ≤ SΞ,sup ≤ SΞ2,sup.
Thus, Ssup is well deﬁned since it is the inﬁmum of a
set of real values bounded from below by any of the
SΞ,inf . Similarly for the value Sinf , which is bounded from
above by any of SΞ,sup .
Reﬁne a partition Ξ1 to Ξ2 by adding new points. Then
SΞ1,sup ≥ SΞ2,sup, SΞ1,inf ≤ SΞ2,inf .
By the deﬁnition of the inﬁmum, there are sequences of partitions
Ξk for which Ssup is the limit of the sums SΞk,sup.
Moreover, every two partitions have a common reﬁnement.
Thus it may be assumed that Ξk in the sequence is always
a partition obtained by reﬁning the previous one. Hence the
sums SΞk,sup form a non-increasing sequence of real numbers
converging to Ssup.
A similar argument applies to Sinf . Hence the values
Ssup = inf
Ξ
SΞ,sup , Sinf = sup
Ξ
SΞ,inf
are good candidates for the limits of upper and lower sums.
Next, consider a ﬁxed partition Ξ with n inner partitioning
points of the interval [a, b], and another partition Ξ1,
whose norm is a small number δ. In the common reﬁnement
Ξ2, there will be only n intervals contributing to the sum
SΞ2,sup by eventually smaller contribution than in the case
of Ξ1. Now, f is a bounded function on[a, b] and thus each of
these contributions will be bounded by a universal constant
multiplied by the norm δ of the partition. Hence when choosing
δ suﬃciently small, the distance of SΞ1,sup from Ssup will
not be larger than twice the distance of SΞ,sup from Ssup.
Finally, return to the sequence of partitions Ξk as chosen
above, and choose an ε > 0. Then there is some m ∈ N such
that the distance of SΞk,sup from Ssup is less than ε for all
k ≥ m. Hence for arbitrary partition Ξ with appropriately
small norm δ > 0 the distance of SΞ,sup from Ssup does not
exceed 2ε.
In summary, for arbitrary 2ε > 0, there is δ > 0
such that for all partitions with norm at most δ the inequality
|SΞ,sup − Ssup| < 2ε holds. This is exactly the statement
that the number Ssup is the limit of all sequences of upper
sums with norms of the partition approaching zero.
The statement for lower sums is proved in exactly the
same way.
It remains to deal with the existence of the Riemann in-
tegral
∫ b
a
f(x) dx. If Ssup = Sinf , then all Riemann sums of
540
An important situation in integration occurs along the
integration of rational functions, see 6.2.7. The key
point here is the decomposition of a such a function
as a sum of simple rational functions. Integrating
the partial fractions corresponding to real roots of a
denominator of a rational function is very easy. For instance,
∫
A
x − x0
dx
y=x−x0
=
dy=x
A ln | y | + C = A ln | x − x0 | + C1 ,
∫
A
(x − x0)
n dx
y=x−x0
=
dy=x
∫
A
yn
dy =
Ay−n+1
−n + 1
+ C2
=
A
(1 − n) (x − x0)
n−1 + C2 ,
for some constants C1, C2, and for all A, x0 ∈ R, n ∈ N with
n ≥ 2. Rational functions f(x)/g(x) can be integrated easily,
assuming that we know the corresponding factorization of
the polynomial in the denominator g(x) in terms of its roots
and their multiplicities. Next we will cover all the possible
cases and to begin with, here are a few easy tasks for you.
6.B.18. Evaluate the integral
∫
R(x) dx, for R(x) = 6
x−2 ,
with x ̸= 2 and R(x) = 6
(x+4)3 , with x ̸= −4. ⃝
6.B.19. (a) Prove that
∫
dx
(ax + b)n
=
1
a(1 − n)(ax + b)n−1
+ C .
(b) Compute
∫
dx
9x2 + 6x + 1
by applying the formula in (a).
⃝
6.B.20. Suppose that P(x) = ax2
+ bx + c (a ̸= 0) is a
parabola with a double root, i.e, ∆ = b2
− 4ac = 0. Prove
that ∫
dx
ax2 + bx + c
dx = −
2
2ax + b
+ C .
Use this formula to conﬁrm your result in 6.B.19, (b). ⃝
6.B.21. Evaluate the integral
∫
1
5x2 + 5x + 2
dx. Next
use Sage to conﬁrm your result.
Solution. The given parabola P(x) = x2
+ 5x + 2 has negative
discriminant ∆ = −15, and hence two complex conjugate
roots, given by −1
2 ± i
√
15
10 . By completing the square
we get
P(x) = 5(x2
+ x +
2
5
) = 5
(
(x2
+ x +
1
4
) −
1
4
+
2
5
)
= 5
(
(x +
1
2
)2
+
3
20
)
= 5
(
(x +
1
2
)2
+
(√
3/20
)2 )
= 5
(
(x +
1
2
)2
+
(√
15/10
)2 )
.
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
sequences of the partitions have the same limit because of the
inequalities (1).
If the Riemann integral does not exist, then there exist
two sequences of partitions Ξk and ¯Ξk and their representatives
with diﬀerent limits of Riemann sums. Suppose the ﬁrst
limit is larger then the other one. Then the upper Riemann
sums can be selected for the ﬁrst sequence and the lower Riemann
sums for the second sequence. Their diﬀerence will
then be at least as large. In particular, in view of the previous
part of the proof, this implies Ssup > Sinf . □
6.2.11. Uniform continuity. Until now, we have only used
the continuity of the function f to know that all
such functions are bounded on a closed ﬁnite
interval. It remains to show that for continuous
functions Ssup = Sinf .
From the deﬁnition of continuity, for every point x ∈
[a, b] and every neighbourhood Oε(f(x)) there exists a neighbourhood
Oδ(x) such that f(Oδ(x)) ⊂ Oε(f(x)). This statement
can be rewritten in this way: for y, z ∈ Oδ(x), i.e.
|y − z| < 2δ,
it is true that f(y), f(z) ∈ Oε(f(x)), i.e.
|f(y) − f(z)| < 2ε.
A global variant of such a property is needed; it is called the
uniform continuity of a function f:
Uniform continuity
Deﬁnition. Let f be a function on a closed ﬁnite interval
[a, b]. f is uniformly continuous on [a, b], if for every ε > 0
there exists δ > 0 such that for all z, y ∈ [a, b] satisfying
|y − z| < δ, the inequality |f(y) − f(z)| < ε holds.
Theorem. Each continuous function on a ﬁnite closed interval
[a, b] is uniformly continuous.
Proof. Fixing some ε > 0, the deﬁnition of continuity
of f provides for each x ∈ [a, b] the values δ(x), such that
f(y) ∈ Oε(f(x)) for all y ∈ O2δ(x)(x). Since every ﬁnite
closed interval is compact, it is covered by ﬁnitely many
of such neighbourhoods Oδ(x)(x), determined by points
541
Thus
∫
1
P(x)
dx =
1
5
∫
dx
(x + 1
2 )2 + (
√
15/10)2
. Next, set
u = x + 1
2 with du = dx, and use the known formula (see
6.2.3)
∫
dx
x2 + δ2
=
1
δ
arctan
(u
δ
)
+ C . (♯)
This gives
1
5
∫
dx
(x + 1
2 )2 + (
√
15/10)2
=
1
5
∫
du
u2 + (
√
15/10)2
=
1
5
·
1
√
15/10
arctan
(
u
√
15/10
)
+ C
=
2
√
15
15
arctan
(
(
√
15/3)(2x + 1)
)
+ C ,
for some constant C. For a conﬁrmation give the cell
show(integral(1/(5 ∗ x2
+ 5 ∗ x + 2), x)). □
Integrals of a rational function Q of the form Q(x) =
(a1x + b1)/(a2x2
+ b2x + c2) can be solved by forming on
the numerator the derivative of the dominator. Then we decompose
the fraction into two rational functions and hence
our integral decomposes into two integrals, that are easier to
be computed. Let us describe such an example.
6.B.22. Evaluate the integral A =
∫
3x + 7
x2 − 4x + 15
dx,
with x ∈ R.
Solution. The derivative of g(x) = x2
− 4x + 15 is the line
2x − 4, with x ∈ R. Let f(x) = 3x + 7, with x ∈ R, as well.
We see that f(x) = 3
2 (2x − 4) + 4 · 3
2 + 7 = 3
2 (2x − 4) + 13.
Thus
A =
∫
f(x)
g(x)
dx =
3
2
∫
2x − 4
g(x)
dx + 13
∫
dx
g(x)
= α + β .
For α :=
3
2
∫
2x − 4
g(x)
dx, by setting u = g(x) = x2
− 4x +
14 we get du = (2x − 4) dx and thus
α =
3
2
∫
du
u
=
3
2
ln(x2
− 4x + 15) + C1 , C1 ∈ R .
For the second integral β := 13
∫
dx
g(x)
by completing the
square we get g(x) = (x − 2)2
+ 11. Thus an application of
the relation (♯) appearing in the proof of 6.B.21, gives
β = 13
∫
dx
(x − 2)2 + (
√
11)2
=
13
√
11
arctan
(
x − 2
√
11
)
+C2
for some constant C2. Hence all together we get that A =
3
2 ln(x2
− 4x + 15) + 13√
11
arctan
(
x−2√
11
)
+ C. □
6.B.23. If Kn = Kn(x0, a) =
∫
dx
(
(x − x0)2 + a2
)n dx,
where a, x0 ∈ R are ﬁxed, prove that
Kn =
1
a2
(
2n − 3
2n − 2
Kn−1 +
x − x0
(2n − 2)
(
(x − x0)2 + a2
)n−1
)
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
x1, . . . , xk. Choose δ as the minimum of all the (ﬁnitely
many) δ(xi).
Choose any two points y, z ∈ [a, b] with |y − z| < δ,
they both belong to one of O2δ(xi)(xi). Thus |f(y)−f(z)| ≤
|f(y) − f(xi)| + |f(xi) − f(z)| < 2ε and f has the desired
property. □
6.2.12. Finishing the proof of Theorem 6.2.9. Now we
complete the proof of the existence of the Riemann
integral. Choose ε and δ as in the deﬁnition
of the uniform continuity of f. Consider
any partition Ξ with n intervals and norm at
most δ. Then, writing Ji = [xi−1, xi],
n∑
i=1
sup
ξ∈Ji
f(ξ)(xi − xi−1) −
n∑
i=1
inf
ξ∈Ji
f(ξ)(xi − xi−1)
=
n∑
i=1
(
sup
ξ∈Ji
f(ξ) − inf
ξ∈Ji
f(ξ)
)
(xi − xi−1)
≤ ε · (b − a).
For decreasing norm of the partition, the upper and lower
sums are arbitrarily close to each other. In particular the upper
Riemann integral and the lower Riemann integral coincide.
To complete the proof of the fundamental theorem of
integral calculus, it is still needed to verify the statement
about the existence of a primitive function. For a continuous
function f on interval [a, b] there exists the Riemann integral
F(x) =
∫ x
a
f(t) dt for every x ∈ [a, b]. As in the statement
about uniform continuity, there is δ > 0, dependent on a ﬁxed
small ε > 0, such that
|f(x + ∆x) − f(x)| < ε
for all 0 ≤ ∆x < δ on the interval [a, b]. The diﬀerence
of the derivative of F(x) and the integrated function f(x) is
expressed by the limit of the expressions
α =
1
∆x
(∫ x+∆x
a
f(t) dt −
∫ x
a
f(t) dt
)
− f(x)
=
1
∆x
(∫ x+∆x
x
f(t) dt
)
− f(x)
for ∆x approaching zero.
Now choose 0 < ∆x < δ and replace the integrated
function by the constant value f(x). Then the values f(ξ) at
any point ξ ∈ [x, x + ∆x] are distant from f(x) by at most ε.
Hence the Riemann integral in question cannot be diﬀerent
from f(x)∆x by more then ε∆x. Thus, we arrived at the
following estimate:
|α| =
1
∆x
(∫ x+∆x
x
f(t) dt
)
− f(x) < ε.
But that means that at the point x, the one-sided right derivative
of the function F(x) exists and equals f(x). The result
for the left derivative is proved in the same way, just working
with the interval [x − ∆x, x]. This ﬁnishes the proof of the
theorem 6.2.9.
542
for all n ̸= 1, and that K1(x0, a) = 1
a arctan
(x−x0
a
)
+ C.
Solution. We will apply integration by parts. In terms of
6.2.4 we have F(x) = 1/
(
(x − x0)
2
+ a2
)n
and F′
(x) =
−2n (x − x0) /
(
(x − x0)
2
+ a2
)n+1
. If you are not sure for
this diﬀerentiation, use Sage and the cell
var("a, n, x0"); F(x)=1/(((x-x0)^2+a^2)**n)
show(diff(F(x), x).factor())
Sage prints out the following expression which agrees with
ours: −2
(
a2
+ x2
− 2 xx0 + x2
0
)−n−1
n(x − x0). Moreover,
G′
(x) = 1 and we may ﬁx G(x) = (x − x0). Thus,
Kn(x0, a) =
∫
F(x)G′
(x) dx =
x − x0
(
(x − x0)
2
+ a2
)n +
2n
∫ ( (x − x0)
2
+ a2
(
(x − x0)
2
+ a2
)n+1 −
a2
(
(x − x0)
2
+ a2
)n+1
)
dx
=
x − x0
(
(x − x0)
2
+ a2
)n + 2n
(
Kn(x0, a) − a2
Kn+1(x0, a)
)
,
or equivalently,
Kn+1 =
1
a2
(
2n − 1
2n
Kn +
1
2n
x − x0
(
(x − x0)
2
+ a2
)n
)
.
Replacing n by n − 1 in this formula, we get the result. On
the other hand, the case n = 1 follows easily by the relation
(♯), used in the proof of 6.B.21. □
6.B.24. Using the result from 6.B.23, compute the integral
given below and next verify your answer via Sage:
I =
∫
30x − 77
(x2 − 6x + 13)
2 dx , x ∈ R .
Solution. This provides an example of partial fractions for
multiple complex roots in the form of
Ax+B[
(x−x0)2
+a2
]n , with A, B, x0 ∈ R, a > 0, n ∈ N\{0, 1},
which again can be solved by forming on the numerator the
derivative of the expression (x − x0)
2
+ a2
appearing in
the dominator, that is, A
2 · 2(x−x0)[
(x−x0)2
+a2
]n + (B + Ax0) ·
1[
(x−x0)2
+a2
]n . Hence one needs to compute the two induced
integrals, which can be done by the methods applied above.
For our problem we have
I = 15
∫
2x − 6
(x2 − 6x + 13)
2 dx + 13
∫
dx
(x2 − 6x + 13)
2 .
To compute the ﬁrst integral set u = x2
− 6x + 13 with
du = (2x − 6) dx. Then
15
∫
2x − 6
(x2 − 6x + 13)
2 dx = 15
∫
du
u2
= −
15
u
+ C1 =
−
15
x2 − 6x + 13
+ C1 ,
for some constant C1. For the second integral by completing
the square one gets x2
− 6x + 13 = (x − 3)
2
+ 22
. Hence
13
∫
1
(x2 − 6x + 13)
2 dx = 13
∫
dx
(
(x − 3)
2
+ 22
)2 .
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.2.13. Important remarks. (1) Theorems 6.2.9 and 6.2.8
claim that the Riemann integral is a linear map
∫
: C[a, b] → R
from the vector space of continuous functions on the interval
[a, b] to real numbers. Hence it is a linear form on the vector
space C[a, b].
(2) We proved that every continuous function is a derivative
of some function. Hence the concepts of the Newton and
Riemann integrals coincide for continuous functions. Therefore
the Riemann integral of continuous functions can be computed
as the diﬀerence of values F(b)−F(a) of the primitive
function F.
(3) In the ﬁrst step of the proof of the theorem 6.2.9 we
proved the important statement that for bounded functions f
on ﬁnite intervals [a, b] the limits of the upper and lower sums
always exist. They are called respectively the upper Riemann
integral and the lower Riemann integral and they are also denoted
by
∫ b
a
f(x) dx and
∫ b
a
f(x) dx.
In this way we can deﬁne the Riemann integral for continuous
functions as in the above proof.
(4) We derived the important property of continuous
functions called the uniform continuity on ﬁnite closed intervals
[a, b]. Clearly every uniformly continuous function is
continuous as well, but the converse is not true on open intervals.
As an example, consider the function f(x) = sin(1/x)
on the interval (0, 1).
(5) Consider a function f on an interval [a, b], which is
only piece-wise continuous. This means that f is continuous
in all points c ∈ [a, b] except for ﬁnitely many discontinuities
ci, a < ci < b, in which it has ﬁnite one-sided limits.
Because of the additivity of the integral with respect to the
interval of integration, see 6.2.8(1), the last theorem implies
that in this case the integral
F(x) =
∫ x
a
f(t)dt
exists for all x ∈ [a, b] and the derivative of the function F(x)
exists at all points x where f is continuous. It can be veriﬁed
that F(x) is continuous at the remaining points. So it is a
continuous function on the whole interval [a, b]. When evaluating
the integral by primitive functions, it is necessary to
choose its individual parts so that they are connected continuously
at the points ci. Then the entire integral can be again
computed as a diﬀerence of the function F(x) at its boundary
values.
(6) Lagrange’s mean value theorem for diﬀerentiable
functions has an analogue which is called the integral mean
value theorem. Suppose f(x) is continuous on an interval
[a, b] and its primitive function is F(x). The mean value theorem
claims that there exists a point c, a < c < b such that
∫ b
a
f(x) dx = F(b) − F(a) = F′
(c)(b − a) = f(c)(b − a).
543
We can now apply 6.B.23 with x0 = 3, a = 2 and n = 2. In
particular we have
13 · K2(3, 2) = 13
∫
dx
(
(x − 3)
2
+ 22
)2
=
13
22
(
1
2
K1(3, 2) +
x − 3
2
(
(x − 3)2 + 22
)
)
=
13
4
(
1
4
arctan
(
x − 3
2
)
+ C2 +
x − 3
2
(
(x − 3)2 + 22
)
)
=
13
16
arctan
(
x − 3
2
)
+
13(x − 3)
8
(
x2 − 6x + 13
) + C2 ,
for some constant C2. In total we get
I =
13
16
arctan
(
x − 3
2
)
+
13x − 159
8 (x2 − 6x + 13)
+C , C ∈ R .
A veriﬁcation via Sage occurs as usual, i.e., type
show(integral((30 ∗ x − 77)/(x2
− 6 ∗ x + 13)2
, x)).
□
6.B.25. Consider the rational function
R(x) =
x2
− x − 1
x3 + 3x2 − 16x + 12
.
(a) Find its domain and its discontinuities and sketch its graph.
(b) Find the decomposition of R into partial fractions.
(c) Evaluate the integral
∫
R(x) dx.
Solution. (a) Let us express R as R(x) = f(x)/g(x). It is
obvious that g(1) = 0 and based on the Horner’s scheme we
get the factorization g(x) = (x−1)(x−2)(x+6). Using Sage
one can type factor(x ∗ ∗3 + 3 ∗ x ∗ ∗2 − 16 ∗ x + 12).
Hence, the domain of R is the set R\{−6, 1, 2}. Notice the
dominator g has three distinct real roots and none of them is
a root of the numerator f. Hence, all the discontinuities of
R = f/g appear at the roots 1, 2 and −6 of g, as we can see
also from the graph of R:
(b) Let us assume that
R(x) =
x2
− x − 1
(x − 1)(x − 2)(x + 6)
=
A1
x − 1
+
A2
x − 2
+
A3
x + 6
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
This statement can be derived directly from the deﬁnition of
the Riemann integral. It can be then used in the ﬁnal step of
the proof of the fundamental theorem of integral calculus.
6.2.14. Improper integrals. When discussing the integration
of rational functions f, there is a need
to consider deﬁnite integrals over intervals,
where f(x) has improper (one-sided) limits.
Here f is neither continuous nor bounded. Thus earlier definitions
and results may not apply. We speak of “improper”
integrals.
A simple solution is to discuss the deﬁnite integral on a
smaller sub-interval, and determine whether the limit value of
such a deﬁnite integral exists when the boundary approaches
the problematic point. If it does, the corresponding improper
integral exists and equals this limit. We illustrate this procedure
by an example:
I =
∫ 2
0
dx
(2 − x)1/4
.
This is an improper integral, because the integrand f(x) =
(2 − x)−1/4
= 1
4
√
(2−x)
has its left-sided limit ∞ at the point
b = 2. The integrand is continuous at all other points. Thus,
for 0 < δ < 2, consider the integrals (substituting y = 2−x)
Iδ =
∫ 2−δ
0
dx
4
√
2 − x
=
∫ 2
δ
y−1/4
dy
=
[
4
3
y3/4
]2
δ
=
4
3
[23/4
− δ3/4
].
Notice that dy = −dx and x = 2 − δ corresponds to y = δ.
When x = 0, then y = 2.
The limit when δ → 0 from the right clearly exists, so
the improper integral is evaluated.
I =
∫ 2
0
dx
4
√
2 − x
=
4
3
23/4
.
We proceed in the same way to integrate over an unbounded
interval. In this case, we speak of improper Riemann
integrals of the ﬁrst kind. The integrals of unbounded functions
on ﬁnite intervals are improper Riemann integrals of the
second kind.
More explicitly, we can deﬁne the integrals of both kinds
as follows. For a ∈ R and f deﬁned on [a, b) and bounded
on each [a, c] ⊂ [a, b),
I =
∫ b
a
f(x) dx = lim
c→b
∫ c
a
f(x) dx,
if the integrals and limit on the right hand side exist. Here b
is either ﬁnite or b = ∞. Similarly we can have a ﬁnite ﬁxed
upper bound and −∞ ≤ a < c < b and inﬁnite lower bound.
If both a and b are inﬁnite, we can evaluate the integral as a
sum of two integrals with a chosen ﬁxed bound in the middle
as in
∫ ∞
−∞
f(x) dx =
∫ a
−∞
f(x) dx +
∫ ∞
a
f(x) dx.
544
for some reals A1, A2, A3 to be speciﬁed. This relation can be
equivalently written as (by multiplying by the denominator)
f(x) = A1(x−2)(x+6)+A2(x−1)(x+6)+A3(x−1)(x−2) .
There are many approaches to compute A1, A2, A3. For instance,
we may gather together like powers of the variable
and equate their coeﬃcients. This gives a set of equations to
solve for A1, A2, A3. As an alternative we can plug x = 1
into this equation, which gives A1 = 1/7. Likewise, the substitution
of x = 2 gives A2 = 1/8 while those of x = −6
yields A3 = 41/56.
Thus, the decomposition of R into partial fractions is
given by
R(x) =
1
7(x − 1)
+
1
8(x − 2)
+
41
56(x + 6)
. (∗)
Sage provides a built-in method for computing the decomposition
of a rational function into partial fractions, given by
the command partial_fraction. For instance, for a Sage
conﬁrmation of (∗) one may use the cell
f(x)=(x^2-x-1); g(x)=(x-1)*(x-2)*(x+6)
R(x)=f(x)/g(x); show(R.partial_fraction())
(c) This is easy and relies on our previous conclusion (∗), i.e.,
∫
R(x) dx =
1
7
∫
1
x − 1
dx +
1
8
∫
1
x − 2
dx
+
41
56
∫
1
x + 6
dx =
1
7
ln |x − 1| +
+
1
8
ln |x − 2| +
41
56
ln |x + 6| + C ,
for some constant C. Adding in the previous cell the command
show(integral(R, x)), Sage conﬁrms our answer. □
6.B.26. Evaluate the integral
∫
Q(x) dx, where
Q(x) =
x
(x − 1)
2
(x2 + 2x + 2)
, x ̸= 1 .
Solution. According to the discussion in 6.2.7, we may assume
that
Q(x) =
A
x − 1
+
B
(x − 1)2
+
Cx + D
x2 + 2x + 2
,
for A, B, C, D ∈ R. This can be equivalently expressed as
x = A (x − 1)
(
x2
+ 2x + 2
)
+ B
(
x2
+ 2x + 2
)
+
(Cx + D) (x − 1)
2
.
By setting x = 1 we immediately get B = 1/5. By comparing
the coeﬃcients at the same powers of the polynomials we
also get A = 1
25 , C = − 1
25 , and D = − 8
25 . Hence
Q(x) =
1
25(x − 1)
+
1
5(x − 1)2
−
x + 8
25 (x2 + 2x + 2)
. (∗)
An easy way to verify this fractional decomposotion is by
Sage, via the function partial_fraction as in 6.B.25,
i.e.,
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Its existence and its value do not depend on the choice of such
bound, because by changing it, we only change both summands
by the same ﬁnite value, but with opposite sign.
At the same time a limit for which the upper and lower
bound would approach ±∞ at the same speed can lead to
diﬀerent results! For example
∫ a
−a
x dx =
[
1
2
x2
]a
−a
= 0,
even though the values of the integrals
∫ b
a
x dx with a ﬁxed
and b → ∞ diverge to inﬁnity. (This is the typical behavior
for all odd functions.)
Clearly, if f is continuous on [a, b), then the Newton integral
F(x) =
∫ x
a
f(y) dy exists for all x ∈ [a, b) and the
improper Riemann integral exists if and only if limx→b F(x)
exists, and its value is this limit. Thus, for continous f(x) as
in our simple example above, we compute the deﬁnite Newton
integral and take its limit. This is what we did.
The integrated functions may have more discontinuities
with inﬁnite one-sided limits and the interval of integration
may be unbounded. Then the integration intervals must be
split in such a way that the individual intervals of integration
include only one of the above phenomena.
Hence when evaluating the improper integral of a rational
function, divide the given interval according to the discontinuities
of the integrated function. Then compute all the
improper integrals separately.
6.2.15. Mean value of a function. For a ﬁnite set of n numbers,
the mean value, or arithmetic mean, is obtained by summing
the numbers and dividing by n. For a Riemann integrable
function f(x) on an interval (ﬁnite or inﬁnite) [a, b],
the mean value is deﬁned by
m(f) =
1
b − a
∫ b
a
f(x) dx.
By deﬁnition, m(f) is the altitude of the rectangle over
the interval [a, b], which has the same area as that of the region
between the x axis and the graph of the f(x) (counted with
signs according to being above or below the x-axis). Hence
the integral mean value theorem is true in general:
Proposition. If f(x) is a Riemann integrable function on an
interval [a, b], then there exists a number m(f) satisfying
∫ b
a
f(x) dx = m(f)(b − a).
6.2.16. Integral criterion for series. Using the improper integral,
we can also decide the question of convergence for a
class of inﬁnite series with summands expressed as values of
a function in integers:
545
Q(x)=x/((x -1)^2*(x^2 + 2*x + 2))
show(R.partial_fraction())
Based now on (∗) we can write
∫
Q(x) dx =
∫
dx
25 (x − 1)
+
∫
dx
5 (x − 1)
2
−
∫
x + 8
25 (x2 + 2x + 2)
dx =
1
25
ln |x − 1| −
1
5 (x − 1)
−
1
50
ln
(
x2
+ 2x + 2
)
−
7
25
arctan(x + 1) + C ,
for some constant C. Here, as in 6.B.22 we get
∫
x + 8
x2 + 2x + 2
dx =
∫ ( 1
2 (2x + 2)
x2 + 2x + 2
+
7
x2 + 2x + 2
)
dx
=
1
2
∫
2x + 2
x2 + 2x + 2
dx + 7
∫
1
(x + 1)2 + 1
dx
=
1
2
ln
(
x2
+ 2x + 2
)
+ 7 arctan(x + 1) + C .
□
6.B.27. Evaluate the integral A =
∫
Q(x) dx, where
Q(x) =
2x4
+ 2x2
− 5x + 1
x (x2 − x + 1)
2 , x ̸= 0 .
Next conﬁrm your result in Sage. ⃝
6.B.28. Evaluate
∫
1
e3x −2 e2x
dx, with x ̸= ln(2). ⃝
Let us also integrate an improper rational function
f(x)/g(x), which means that the degree of the polynomial
f in the numerator is greater or equal to the degree of the
polynomial g in the denominator. In such a case we should
ﬁrst carry out the division with remainder.
6.B.29. Evaluate
∫
x3
+ 2x2
+ x − 1
x2 − x + 1
dx, with x ∈ R.
Can you conﬁrm all the steps of your proof via Sage?
Solution. The degree of the numerator f(x) =3
+2x2
+x−1
is bigger from the degree of the dominator g(x) = x2
−x+1,
hence we proceed by division of polynomials:
x + 3
x2
− x + 1
)
x3
+ 2x2
+ x − 1
− x3
+ x2
− x
3x2
− 1
− 3x2
+ 3x − 3
3x − 4
This means that f(x) = g(x)(x + 3) + 3x − 4 and implies
the following decomposition
f(x)
g(x)
= (x + 3) +
3x − 4
x2 − x + 1
. (♭)
The division of polynomials can be successfully performed
in Sage, and a way goes as follows
improve
explanation
& maybe
add
in Ch.1
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Integral criterion
Theorem. Let
∑∞
n=1 f(n) be a series such that the function
f : R → R is positive and nonincreasing on the interval
[1, ∞). Then this series converges if and only if the integral
∫ ∞
1
f(x) dx.
converges.
Proof. If the integral is interpreted as the area of a region
under the curve, the criterion is clear.
Indeed, notice that the given series diverges
or converges if and only if the same is true for the
same series without the ﬁrst summand. Moreover,
by the monotonicity of f(x), there are the following es-
timates
˜sk =
k∑
n=2
f(n) ≤
∫ k
1
f(x) dx ≤
∞∑
n=1
f(n) = sk,
because ˜sk is the lower sum of the Riemann integral while sk
is the upper sum. Thus, the integral converges if and only the
series does, as expected. □
6.2.17. New acquisitions to the ZOO. As a matter of fact,
primitive functions (Newton integrals) are very rarely
described in terms of known elementary functions.
Indeed, nearly all continuous functions lead to integrals
which we cannot express in this way. Functions
obtained by integration often appear in applications. Many
of them have names and there are eﬃcient methods how to
approximate them numerically (we shall come to this point
brieﬂy in 6.2.23 below). Such functions either appear as the
primitive functions or they are given by deﬁnite integrals dependent
on further paremeter(s). We present just a few (important)
examples now.
In the methods of signal processing, a very important
function is the so called sinc function:
(1) sinc(x) =
sin(x)
x
.
We shall meet the sinc function in our discussion of
Fourier transform in 7.2.6. Check yourself that it is a smooth
function with limit values
f(0) = 1, f′′
(0) = −
1
3
, . . . , f(2k)
(0) = (−1)k 1
2k + 1
,
whilst all f(2k+1)
(0) vanish. This is easily seen from the Taylor
expansion of the sine function (try to verify this by computing
the derivates directly - a good excercise on multiple use
of L’Hospital even for very small orderes). The even function
sinc has its absolute maximum at the point x = 0 and many
local maxima and minima, with inﬂexions between them. It
oscillates with a fast decreasing amplitude as x approaches
inﬁnity. The x-axis is the asymptot for both inﬁnities.
546
f(x)=x^3+2*x^2 + x - 1 ; g(x)=x^2-x+1
show(f.maxima_methods().divide(g))
Sage’s output has the form [x + 3, 3x − 4], where the ﬁrst
expression encodes the quotient and the second one the
reminder. One may like to certify Sage’s answer by adding
the syntax bool(g(x) ∗ (x + 3) + 3 ∗ x − 4 == f(x)),
which returns True. Based now on (♭) one has
∫
f(x)
g(x)
dx =
∫
(x + 3) dx +
∫
3x − 4
x2 − x + 1
=
x2
2
+ 3x +
3
2
ln
(
x2
− x + 1
)
−
5
√
3
arctan
(
2x − 1
√
3
)
+ C .
Here, to compute the integral ℓ =
∫
3x − 4
x2 − x + 1
dx you can
apply the same method as 6.B.22 (since g(x) has only complex
roots). In particular,
ℓ =
3
2
∫
2x − 1
x2 − x + 1
dx −
5
2
∫
dx
(
x − 1
2
)2
+
(√
3
2
)2
=
3
2
ln
(
x2
− x + 1
)
−
5
√
3
arctan
(
2x − 1
√
3
)
+ c ,
for some constant c. This agrees with Sage’s output for the
command integral((3 ∗ x − 4)/(x2
− x + 1), x). □
Many integrals may initially appear complicated. In
such cases, the power of the substitution method
becomes evident. For instance, the substitution
method often allows us to transform irrational
functions into rational ones, which are easier to
integrate.
The tasks below focus on integrating irrational functions
and serve as preparation for the material covered in the ﬁnal
section of this chapter, where more space is dedicated to this
topic. But ﬁrst, let us describe a few useful hints.
• Hint 1: This is about integrals of the form
∫
f
(
x, p1
√
x, p2
√
x, . . . , pk
√
x
)
dx
for certain numbers p1, p2, . . . , pk ∈ N and a rational function
f. In this case we set n
√
x = t, i.e., tn
= x, where n
is the least common multiple of p1, . . . , pk. This substitution
reduces the integrated function (integrand) to a rational function,
which we can always integrate. When instead of the expressions
pj
√
x we have pj
√
ax + b for all j = 1, . . . , k, where
a, b ∈ R, then set tn
= ax + b, where n occurs in the same
vein as above.
• Hint 2: This is about integrals of the type
∫
f
(
x, p1
√
ax + b
cx + d
, p2
√
ax + b
cx + d
, . . . , pk
√
ax + b
cx + d
)
dx,
where the values a, b, c, d ∈ R are such that ad − bc ̸= 0.
In this case we set tn
= ax+b
cx+d , where again n is the least
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Gamma, Si, and more
The sine integral function is deﬁned by
(2) Si(x) =
∫ x
0
sinc(t) dt.
Other important functions are Fresnel’s sine and cosine in-
tegrals
FresnelS(x) =
∫ x
0
sin
(1
2 πt2
)
dt(3)
FresnelC(x) =
∫ x
0
cos
(1
2 πt2
)
dt.(4)
One of the most important mathematical functions ever is
the Gamma function. It is deﬁned for all positive real numbers
z by
(5) Γ(z) =
∫ ∞
0
e−t
tz−1
dt.
For all integers n ≥ 0, n! = Γ(n + 1).
The function Si(x) is shown in the left ﬁgure. Both Fresnel’s
functions are shown on the right.
The Gamma function is deﬁned via an improper integral
dependent on the parameter z. We shall provide some theory
for such functions in the next part of this chapter. In particular,
it can be proved that this function is analytic at all points 0 <
z ∈ R.
For small z ∈ N, we can evaluate:
Γ(1) =
∫ ∞
0
e−t
t0
dt = [− e−t
]∞
0 = 1
Γ(2) =
∫ ∞
0
e−t
t1
dt = [− e−t
t]∞
0 +
∫ ∞
0
e−t
dt = 1
Γ(3) =
∫ ∞
0
e−t
t2
dt = 0 + 2
∫ ∞
0
e−t
t dt = 0 + 2 = 2.
Integration by parts reveals immediately
Γ(z + 1) = zΓ(z).
Hence for all positive integers n this function yields the value
of the factorial:
Γ(n) = (n − 1)!.
The following ﬁgure shows the behaviour of the functions
f1(x) = ln(Γ(x + 1)), f2(x) = ln(Γ(x)) together with the
function x ln x−x+1 (the dashed one, the dots are the actual
values of ln(n!)).
547
common multiple of p1, . . . , pk. This converts the integrand
to a rational function, as well.
6.B.30. Determine the integrals
∫
f(x) dx, where
(a) f(x) =
1
√
x3 +
5
√
x7
, with x > 0;
(b) f(x) =
x + 1
3
√
3x + 1
, with x ̸= −1
3 ;
(c) f(x) =
1
x
√
x + 1
x − 1
, with x ∈ R\[−1, 1].
Solution. (a) For all x > 0 we have
f(x) =
1
√
x3 +
5
√
x7
=
1
√
x2 · x +
5
√
x5 · x2
=
1
x(
√
x +
5
√
x2)
.
The least common multiply of 2, 5 is 10. Hence, according
to the ﬁrst hint above to compute
∫
f(x) dx we set 10
√
x = t,
that is, t10
= x with 10t9
dt = dx. Then it is easy to see that
√
x = t5
and
5
√
x2 = t4
, thus
∫
f(x) dx =
∫
dx
x
(√
x +
5
√
x2
) =
∫
10t9
t10 (t5 + t4)
dt
= 10
∫
dt
t6 + t5
.
Now we see that the function Q(t) = 1
t6+t5 = 1
t5(1+t) admits
the following decomposition into simple fractions:
Q(t) = −
1
t + 1
+
1
t
−
1
t2
+
1
t3
−
1
t4
+
1
t5
.
You can verify this expression by Sage and the command
var("t"); show((1/(t^6+t^5)).partial_fraction()
Thus
10
∫
Q(t) dt = 10
∫ (
−
1
t + 1
+
1
t
−
1
t2
+
1
t3
−
1
t4
+
1
t5
)
dt
= 10
(
− ln(t + 1) + ln(t) +
1
t
−
1
2t2
+
1
3t3
−
1
4t4
)
+ C
= ln
x
(1 + 10
√
x)10
+
10
10
√
x
−
5
5
√
x
+
10
3
10
√
x3
−
5
2
5
√
x2
+ C
for some constant C.
(b) Set t = 3
√
3x + 1, i.e., t3
= 3x + 1 and dx = t2
dt. Then
∫
x + 1
3
√
3x + 1
dx =
∫ t3
−1
3 + 1
t
t2
dt =
∫
t3
− 1 + 3
3
t dt
=
1
3
∫
(
t4
+ 2t
)
dt =
1
3
(
t5
5
+ t2
)
+ C
=
1
15
(3 x + 1)
5
3
+
1
3
(3 x + 1)
2
3
+ C .
Remember always to use Sage to verify your computations.
For instance, one may ﬁnd diﬃcult to simplify the result
of the back substitution t = 3
√
3x + 1 in the expression
1
3
(
t5
5 + t2
)
. There are many diﬀerent ways to implement
this, and here we present one based on the function
subs.
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
The relatively good approximation should not be a surprise,
because we may clearly write ln(n!) =
∑n
k=1 ln k
and the latter expression can be approximated by the integral∫ n
1
ln x = n ln n − n + 1. Hence we came close to a very
important and famous formula approximating n! by nn
e1−n
:
Stirling’s formula
The asymptotic estimate for the factorial function is
ln n! = n ln n − n + 1 + O(ln n)
There is the famous Stirling’s approximation formula making
this estimate much more precise
(6)
√
2πnn+ 1
2 e−n
≤ n! ≤ e nn+ 1
2 e−n
.
The lower approximation is so good, that the function√
2πxx+ 1
2 e−x
would completely overlap with the black one
on the picture above.
We shall not provide the proof of the Stirling formula
here. In order to understand the qualitative behavior of such
functions (e.g. their diﬀerentiability etc.) we need to understand
the limit processes much better. Before diving into this
in the next section, we introduce several direct applications
of the Riemann integral.
6.2.18. Riemann measurable sets. The deﬁnition of the
Riemann integral is motivated by the concept
of the area of rectangles in the plane with coordinates
x and y. The deﬁnite integral
∫ b
a
is
designed to correspond to the area of the region
bounded by the x axis, the values of the function y = f(x)
and boundary lines x = a, x = b. Moreover, the area of the
region above the x axis is given with a positive sign, while
values under the axis lead to a negative sign.
From geometry, the length of an interval on the real line,
and the area of a parallelogram determined by two vectors in
the plane are basic concepts. This extends to the area of a
parallelepiped in Euclidean vector space Rn
. The areas/volumes
of other subsets are yet to be deﬁned. For some simple
objects like triangles, polygons and polyhedrons, their area is
given naturally by the generally expected properties of area
548
var("t"); a(t)=(1/3)*((t^5/5) + t^2)
show(a.subs(t=(3*x+1)^(1/3)))
As for the given integral, just type
show(integral((x+1)/(3*x+1)^(1/3), x))
Observe that in our result there are still computations
that can be done to arrive to a more solid expression, as
1
5 (3 x + 1)
2
3
(x + 2). The same can be done in Sage, by
adding in the last two show commands the function factor,
that is, show(a.subs(t = (3 ∗ x + 1)(
1/3)).factor()) and
show(integral((x + 1)/(3 ∗ x + 1)(
1/3), x).factor()),
respectively. Check yourself Sage’s outputs.
(c) In this case set t =
√
x+1
x−1 , thus t2
= x+1
x−1 and x = t2
+1
t2−1 .
Moreover, it is easy to see that dx = −4t
(t2−1)2 dt. This
substitution will convert the initial integrand to a rational
functional, namely
∫
1
x
√
x + 1
x − 1
dx =
∫
t2
− 1
t2 + 1
−4t2
(t2 − 1)2
dt
= −
∫
4t2
(t2 + 1)(t2 − 1)
dt .
If Q(t) = −
4t2
(t2 + 1)(t2 − 1)
then we see that
Q(t) = −2
2t2
(t2 + 1)(t2 − 1)
= −2
((t2
+ 1) + (t2
− 1)
(t2 + 1)(t2 − 1)
)
= −2
( 1
t2 + 1
+
1
t2 − 1
)
.
However,
1
t2 − 1
=
1
2
(
1
t − 1
−
1
t + 1
)
, and in total we get
Q(t) = −
2
t2 + 1
−
1
t − 1
+
1
t + 1
.
One can recover this expression in Sage by the cell
var("t")
(-(4*t^2)/((t^2-1)*(t^2+1))).partial_fraction()
Thus now we can rewrite our integral as
∫
1
x
√
x + 1
x − 1
dx =
∫ (
1
t + 1
−
1
t − 1
−
2
t2 + 1
)
dt
= ln | t + 1 | − ln | t − 1 | − 2 arctan(t) + C
= ln
t + 1
t − 1
− 2 arctan(t) + C
= ln
√
x+1
x−1 + 1
√
x+1
x−1 − 1
− 2 arctan
(√
x + 1
x − 1
)
+ C .
For a Sage conﬁrmation type the usual
show(integral((1/x)*sqrt((x+1)/(x-1)), x))
□
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
(invariance with respect to Euclidean motions and additivity
with respect to ﬁnite union of disjoint objects). We shall come
to conceptual deﬁnitions in the next two chapters, mainly via
generalizing the Riemann integral. Now we can start with
several answers based on the univariate calculus.
We start with the question how to measure the “volume”
of one-dimensional subsets.
We say that the subset A ⊂ R is (Riemann) measurable,
if the function χ : R → R
χA(x) =
{
1 if x ∈ A
0 if x /∈ A.
is Riemann integrable. That is, the (improper) integral
m(A) =
∫ ∞
−∞
χA(x) dx
exists (the ﬁniteness of its value doesn’t matter). The function
χA is called the characteristic function of the set A, the value
m(A) is called the Riemann measure of the set A. Notice that
for an interval A = [a, b] this yields the value
∫ ∞
∞
χA(x) dx =
∫ b
a
dx = b − a,
just as expected.
The elementary properties of the Riemann integral imply
that this deﬁnition of “size” has reasonable properties. The
measure of a union of ﬁnitely many measurable pairwise disjoint
subsets is the sum of their measures. In particular, every
ﬁnite set A has zero measure.
If instead we choose a countable union, this property is
no longer true. For example, consider the set Q of all rational
numbers as the union of one-element subsets. While every
set containing only ﬁnitely many points has a zero measure by
our deﬁnition, the characteristic function χQ is not Riemann
integrable over any ﬁnite interval [a, b]. This is an essential
ﬂaw of the Riemann approach and we shall comment on how
to improve it in the end of this chapter.
Notice, the upper Riemann integral of the characteristic
set χA corresponds to the inﬁmum of the sums of lengths of
ﬁnitely many intervals, by which we can cover the given set A.
The lower integral is the supremum of the sums of lengths of
ﬁnitely many disjoint intervals that can be embedded into the
set A. We shall proceed in the same way in higher dimensions
when deﬁning the Jordan measure. For now, just remark that
the area of a plane ﬁgure bounded by a graph of a function in
the way described above is consistent with expectations and
we are going to deduce some straightforward consequences
for special higher dimensional concepts.
6.2.19. Length of a curve. The Riemann integral can be effectively
used to compute the length of a curve
in multidimensional Euclidean vector space Rn
.
For the sake of simplicity, we deal with a curve
in R2
with coordinates x, y. Suppose a parametric
description of a curve F : R → R2
,
F(t) = [f(t), g(t)]
549
Recall from 6.2.2 that for an arbitrary function f that
is continuous and bounded on a bounded interval
(a, b), it holds the so called Newton-Leibniz
formula, given by
b∫
a
f(x) dx = [F(x)]
b
a := lim
x→b−
F(x) − lim
x→a+
F(x) . (⋆)
Here, as usual F′
(x) = f(x) is the primitive function of
f, with x ∈ (a, b). Under the given conditions, the function
F always exists and so do both the proper limits appearing
in (⋆). Therefore, to compute the deﬁnite integral one just
needs to ﬁnd the antiderivative and determine the respective
one-sided limits. We also recall that the computation of definite
integrals via Sage is simple, and relies on the command
integral(f, x, a, b), or f.integral(x, a, b), see below for
examples.
6.B.31. Compute
π
3∫
π
6
tan2
(x) dx and
π
4∫
0
x
cos2(x)
dx. ⃝
6.B.32. Compute the integrals given below by hand, and next
present a conﬁrmation by Sage.
(a)
∫ 1
0
x
√
1 − x2
dx , (b)
∫ 2
1
1
√
x2 − 1
dx .
⃝
6.B.33. Compute the integral
∫ π
π/4
sin(3x) cos(x) dx in Sage.
Next provide a formal computation. ⃝
6.B.34. Prove the following inequalities:
(a)
√
2
20
≤
1∫
0
x9
√
1 + x
dx ≤
1
10
;
(b) 1 <
∫ π/2
0
sin(x)
x
dx <
π
2
. ⃝
6.B.35. Let f : [−a, a] → R be a continuous function, a ∈
R.
(a) If f is even show that
∫ a
−a
f(x) dx = 2
∫ a
0
f(x) dx.
(b) If f is odd show that
∫ a
−a
f(x) dx = 0. ⃝
6.B.36. Let f : R → R be a continuous function which is
periodic with period T > 0.
(1) Show that the integral
∫ a+T
a
f(x) dx has the same value
for every real number a, in particular,
∫ a+T
a
f(x) dx =
∫ T
0
f(x) dx .
(2) More in general, show that
∫ a+nT
a
f(x) dx = n
∫ T
0
f(x) dx ,
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
is given. Look at it as a trajectory of a movement. Assume
that f(t) and g(t) have piece-wise continuous derivatives.
By diﬀerentiating the map F(t) we obtain vectors corresponding
to the speed of the movement along this trajectory.
Hence the total length of the curve (i.e. distance traveled over
time between the values t = a, t = b) is given by the integral
over the interval [a, b], with the integrated function h(t) being
the length of the vectors F′
(t). Therefore the length s is
given by the formula
s =
∫ b
a
h(t) dt =
∫ b
a
√
(f′(t))2 + (g′(t))2 dt.
The result can be seen intuitively as a corollary of Pythagoras’
theorem: the linear increment ∆s of the length of the curve
corresponding to the increment ∆t of variable t is given by
the proportion in the orthogonal triangle and thus at the level
of diﬀerentials
ds =
√
(g′(t))2 + (f′(t))2 dt.
In the special case when the curve is the graph of a function
y = f(x) between points a < b, we obtain
s =
∫ b
a
√
1 + (f′(x))2 dx
and at the level of diﬀerentials,
ds =
√
1 + (y′(x))2dx,
just as expected.
As an example, we calculate the circumference of the unit
circle as twice the integral of the function y =
√
1 − x2 over
[−1, 1]. We know that the result is 2π, because π is deﬁned
in this way.
s = 2
∫ 1
−1
√
1 + (y′)2 dx = 2
∫ 1
−1
√
1 +
x2
1 − x2
dx
= 2
∫ 1
−1
1
√
1 − x2
dx = 2[arcsin x]1
−1 = 2π.
If we instead use
y =
√
r2 − x2 = r
√
1 − (x/r)2
and bounds [−r, r] in the previous calculation, by substituting
x = rt we obtain the circumference of the circle with radius
r:
s(r) = 2
∫ r
−r
√
1 +
(x/r)2
1 − (x/r)2
dx = 2
∫ 1
−1
r
√
1 − t2
dt
= 2r[arcsin x]1
−1 = 2πr.
The result is of course well known from elementary geometry.
Nevertheless, by using integral calculus, we derive the important
fact that the length of a circle is linearly dependent on its
diameter 2r. The number π is exactly the ratio, appearing in
this dependency.
550
for any a ∈ R and n ∈ Z. ⃝
6.B.37. Use 6.B.36, to compute the following integrals:
(a)
∫ π/6+2π
π/6
| sin(x)| dx;
(b)
∫ 4π
2π
| sin(x)| dx;
Solution. (a) The function f(x) = | sin(x)| is periodic with
period T = π > 0. To conﬁrm this claim in Sage use the
bool command, as usual, i.e.,
f(x)=abs(sin(x)); bool(f(x+pi)==f(x))
The ﬁrst integral corresponds to the grey area in the following
ﬁgure:
An application of the formula given in the second part of
6.B.36 gives
∫ π/6+2π
π/6
| sin(x)| dx =
∫ 2π
0
| sin(x)| dx
= 2
∫ π
0
| sin(x)| dx = 2
∫ π
0
sin(x) dx
= 2
[
− cos(x)]π
0 = 2 · 2 = 4 .
To conﬁrm this computation in Sage add in the previous cell
the syntax integral(f(x), x, pi/6, 2 ∗ pi + pi/6).
(b) Since
∫ π
0
| sin(x)| dx = 2, the second integral is also 4,
see the ﬁgure given for this case:
Here is a formal computation:
∫ 4π
2π
| sin(x)| dx =
∫ 4π
0
| sin(x)| dx −
∫ 2π
0
| sin(x)| dx
= 4
∫ π
0
| sin(x)| dx − 2
∫ π
0
| sin(x)| dx
= (4 − 2)
∫ π
0
| sin(x)| dx = 2 · 2 = 4 .
Once more, to conﬁrm the result via Sage just type
integral(abs(sin(x), x, 2 ∗ pi, 4 ∗ pi). □
According to the fundamental theorem of integral calculus,
whenever f : A ⊆ R → R is a continuous function and
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.2.20. Areas and volumes. The Riemann integral can be
used to compute areas or volumes of shapes deﬁned by a
graph of a function. As an example, calculate the area of a
circle with radius r. The quarter-circle bounded by the func-
tion
√
r2 − x2 for 0 ≤ x ≤ r determines one quarter of the
area. Use the substitution x = r sin t, dx = r cos t dt (cf. the
corollary for I2 in the paragraph 6.2.6) to obtain by symmetry
a(r) = 4
∫ r
0
√
r2 − x2 dx = 4r2
∫ π/2
0
cos2
t dt
= 4r2
∫ π/2
0
sin2
t dt =
1
2
4r2
∫ π/2
0
cos2
t + sin2
t dt
= 2r2
∫ π/2
0
dt = πr2
.
It is worth noticing that this well known formula is derived
from the principles of integral calculus. The area of a
circle is not only proportional to the square of the radius, but
this proportion is again given by the constant π.
Notice the ratio of the area to the perimeter of a circle.
πr2
2πr
=
1
2
r.
The square with the same area has the side of length
√
πr
and therefore its perimeter is 4
√
πr. Hence the perimeter of
a square with the area of the unit circle is 4
√
π, compared to
the perimeter 2π of the unit circle, which is about 0.8 less.
It can be shown that in fact the circle is the shape with the
smallest perimeter among all with the same area. We derive
such results in comments about the calculus of variations in
chapter 9.
Another analogy of this approach is the computation of
the volume or the surface area of a solid of revolution. Such
a set in R3
is deﬁned by plotting the graph of a function y =
f(x) (for x in an interval [a, b]) in the plane xy and rotating
this plane around the x axis. This is exactly what happens
when producing pottery on a jigger – the hands shape the clay
in the form of y = f(x). add appropriate picture
here!
When computing the area of the surface, an increment
∆x causes the area to increase by the multiple of the length
∆s of the curve given by the graph of the function y = f(x)
and the size of the circle with radius f(x). Hence the surface
area A(f) is computed by the formula
A(f) = 2π
∫ b
a
f(x) ds = 2π
∫ b
a
f(x)
√
1 + (f′(x))2 dx,
where the diﬀerential ds is given by the increment on the
length of curve y = f(x), see above. If instead we determine
the solid of revolution by its bound parametrized in
the xy plain by a pair of functions [x(t), y(t)], then the corresponding
diﬀerential of the length s has the form ds =√
(x′(t))2 + (y′(t))2dt. Thus we obtain
A = 2π
∫ b
a
y(t)
√
(y′(t))2 + (x′(t))2 dt.
551
a ∈ A, the function
F(x) =
∫ x
a
f(t) dt , x ∈ A ,
satisﬁes F′
(x) = f(x) for any x ∈ A. Therefore F is a
primitive of f on A. This function plays a key role in integral
calculus and in many cases can be used to simplify our computations.
The next series of tasks are based on this scheme
and will help us to master the fundamental theorem of calculus.
Further similar exercises are presented in Section D.
6.B.38. (a) Find the derivative of the function F(x) =
x∫
0
t5
ln (t + 1) dt, with x ∈ (−1, 1).
(b) If F(x) =
∫ x
1
(
t2
sin(t) + 4 cos(4t)
)
dt, with x > 0,
show that limx→∞ g(x) = 1, where g(x) := F ′
(x)
x2+2 . ⃝
6.B.39. Local extremes. Find the local extremes of the func-
tion
F(x) =
∫ x
0
sin(t)
t
dt , with x > 0 .
Solution. For all x > 0 we have F′
(x) = sin(x)
x . Thus, the
critical points of F, i.e., the solutions of the equation F′
(x) =
0 have the form x = k π for k ∈ Z+, and they all provide local
extremes of F. In particular, the second derivative of F has
the form
F′′
(x) =
(
sin(x)
x
)′
=
x cos(x) − sin(x)
x2
, x > 0 ,
and we see that F′′
(kπ) = kπ·cos(kπ)
(kπ)2 = (−1)k
kπ . Thus
• F′′
(kπ) < 0 if k is odd =⇒ F attains a local maximum
at (2m + 1)π, with m ∈ N.
• F′′
(kπ) > 0 if k is even =⇒ F attains a local minimum
at (2m)π, with m ∈ Z+. □
6.B.40. Find the third-order Taylor expansion of the func-
tion
f(x) =
∫ x
0
1
1 + t2
dt , x ∈ R ,
around the point a = 0, both by hand and by Sage.
Solution. Recall by 6.A.8 that we want to determine the poly-
nomial
T3
0 f(x) = f(0) + f′
(0)x +
f′′
(0)
2
x2
+
f(3)
(0)
6
x3
. (♭)
Obviously, f(0) = 0 and
f′
(x) =
1
1 + x2
, ⇒ f′
(0) = 1 ,
f′′
(x) =
(
1
1 + x2
)′
=
−2x
(1 + x2)2
, ⇒ f′′
(0) = 0 ,
f(3)
(x) =
(
−2x
(1 + x2)2
)′
=
2(3x2
− 1)
(1 + x2)3
, ⇒ f(3)
(0) = −2.
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
When computing the volume of the same solid, then the
increase of ∆x causes the volume increase by a multiple of
this increment and the area of the circle with radius f(x).
Hence it is given by the formula
V (f) = π
∫ b
a
(f(x))2
dx.
As an example of using the formulas for surface and volume,
we derive the well known formulas for the surface of the
sphere and volume of the ball with diameter r.
Ar = 2π
∫ r
−r
r
√
1 − (x/r)2
1
√
1 − (x/r)2
dt
= 2πr
∫ r
−r
dt = 4πr2
Vr = π
∫ r
−r
(r2
− x2
) dx
= π
[
r2
x −
1
3
x3
]r
−r
=
4
3
πr3
.
Similarly to the circle, the ball is also the object which
has the smallest surface among all with the given volume.
That is the reason why small soap bubbles almost always assume
this shape.
6.2.21. Diﬀerential equations. Theorem 6.2.9 can be understood
in the following way. Given a continuous
function f(x) on a bounded interval [a, b], the
set of all functions y of one variable x satisfying
the equality
y′
= f(x)
is given by the formula
y(x) =
∫ x
a
f(t) dt + C
with the constant C = y(a). This is the simplest instance of
diﬀerential equations. More generally, ordinary diﬀerential
equations of ﬁrst order are given as
y′
= f(x, y),
where f(x, y) depends on two unknown variables x and y. A
solution to this equation is a function y = y(x), such that the
equality is true upon substitution. Similarly, dependence on
higher derivatives of y may be included.
We return to this concept in Chapter 8, see 8.3.2. For the
present, we discuss one very special type of equations with
separated variables
y′
= f(x)g(y)
and add a few observations concerning analytic solutions.
Rewrite the equation in terms of the diﬀerentials, cf. 6.1.11,
1
g(y)
dy = f(x) dx.
Find the primitive functions on both sides to determine the
unknown function y = y(x) implicitly.
552
Thus, according to (♭) we get T3
0 f(x) = x −
x3
3
, x ∈ R.
In Sage, as in 6.A.9 one can use the command taylor,
though now the variable t should be included as a symbolic
variable and assume that x > 0:
var("x, t"); assume(x>0)
f(x)=integral(1/(1+t^2), t, 0, x)
T(x)=taylor(f(x), x, 0, 3); show(T(x))
Sage’s output has the form −
1
3
x3
+ x. □
6.B.41. Ler f : R → R be a diﬀerentiable function on R,
having continuous second derivative everywhere on R and a
local extremum at x0 = 1. If for some constants α, β ∈ R
with α ̸= 1 we have the relation
∫ 1
0
(
x f′′
(x) + α f′
(x)
)
dx = β (∗)
and the graph Cf of f passes through the point P = [0, β] ∈
R2
, compute f(1) in terms of α and β.
Solution. By assumption, f is diﬀerentiable everywhere on
R and hence also at x0 = 1, and moreover attains a local
extremum at x0, so we have f′
(x0) = f′
(1) = 0. We also
have f(0) = β, since P belongs to the graph of f. Thus, if A
is the left-hand-side of (∗), by applying integration by parts
we get
A =
∫ 1
0
x f′′
(x) dx + α
∫ 1
0
f′
(x) dx =
[
xf′
(x)
]1
0
−
∫ 1
0
x′
f′
(x) + a
[
f(x)
]1
0
= f′
(1) −
[
f(x)
]1
0
+ a
[
f(x)
]1
0
= 0 − (f(1) − f(0)) + a(f(1) − f(0))
= (α − 1)(f(1) − f(0)) = (α − 1)(f(1) − β) .
Thus we should have (α − 1)(f(1) − β) = β from where we
get f(1) = β
α−1 + β = α β
α−1 . □
6.B.42. Consider the function
F(x) = x
∫ x
1
f(t) dt + (1 − x)
∫ x
0
f(t) dt
with x ∈ [0, 1], where f : [0, 1] → R is continuous function
on R with f(0)f(1) ̸= 0. Show that there exists ξ ∈ (0, 1)
with
f(ξ) =
∫ 1
0
f(t) dt . (∗)
Solution. The function F is diﬀerentiable on [0, 1] and it
easy to see that it satisﬁes F(0) = 0 = F(1). Therefore,
by Rolle’s theorem there exists ξ ∈ (0, 1) with F′
(ξ) = 0.
We will show that ξ satisﬁes (∗). In particular, a combination
of the product rule with the fundamental theorem of calculus
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Indeed, if G(y) and F(x) are the primitive functions
with G′
(y) = 1
g(y) and F′
(x) = f(x), and y(x) satisﬁes
G(y(x)) = F(x), then diﬀerentiating both sides with respect
to x yields
0 = G′
(y(x))y′
(x) − F′
(x) =
y′
(x)
g(y(x))
− f(x)
as expected. Of course, it is necessary to be careful with the
values y for which g(y) = 0, which need to be discussed
separately.
For example, the equation y′
= y leads to the implicit
deﬁnition
ln |y| = x + C,
which for positive y provides y(x) = D ex
with positive constant
D, the constant solution y = 0 corresponds to D = 0.
Negative values of y correspond to negative constants D in
the same expression.
If y(0) = 1, we recover the exponential y(x) = ex
.
6.2.22. Analytic solutions. As we know, the power series
are diﬀerentiated and integrated term by term, thus
the solution y(x) to the equation y′
= f(x) with a
known analytic function f(x) =
∑∞
n=0 anxn
must
be
y(x) =
∞∑
n=0
1
n+1 anxn+1
+ y(0),
where y(0) is the free integration constant. The solution is
deﬁned on the covergence domain of the power series (which
has to be same as that of f by the lim sup formula). Of course
we might use series centered in other points x0 if prescribing
the initial value y(x0). (We shall prove much later, that actually
there is always the unique solution with the given initial
prescribed value y(0) in Chapter 8.)
The latter equation y′
= y had the analytic solution ex
,
too. Let us consider the general case of this type, i.e. equations
of the form
(1) y′
= f(y)
with an analytic right-hand side f(y). Given the initial condition
y(x0) = y0, straightforward diﬀerentiation with the help
of the chain rule and the equation (1) shows
y′
(x0) = f(y0)
y′′
(x0) = f′
(y)y′
|x=x0
= f′
(y)f(y)|x=x0
= f′
(y0)f(y0)
y′′′
(x0) =
(
f′′
(y)y′
f(y) + f′
(y)f′
(y)y′
)
|x=x0
= f′′
(y0)(f(y0))2
+ (f′
(y0))2
f(y0)
...
Two crucial observations are due here. First, giving the initial
condition y(x0) = y0, all derivatives y(k)
(x0) are given
at this point by the equation. Thus, if an analytic solution
exists, we know it explicitly. So we have to focus on the
convergence of the known formal expression of the series
y(x) =
∑∞
n=0
1
n! y(n)
(x0)(x − x0)n
and we arrive at the theorem
below.
553
(see 6.2.9), gives
F′
(x) =
(
x
∫ x
1
f(t) dt
)′
+
(
(1 − x)
∫ x
0
)′
=
∫ x
1
f(t) dt + x f(x) −
∫ x
0
f(t) dt + (1 − x)f(x)
= −
( ∫ x
0
f(t) dt +
∫ 1
x
f(t) dt
)
+ f(x)
= −
∫ 1
0
f(t) dt + f(x) .
This expression combined with the equation F′
(ξ) = 0 yields
the result. □
6.B.43. Let f : [0, 1] → R be a continuous function. Show
that
∫ π
0
x f
(
sin(x)
)
dx =
π
2
∫ π
0
f
(
sin(x)
)
dx .
Solution. Set A =
∫ π
0
x f
(
sin(x)
)
dx. The trick is the substitution
u = π − x with du = −dx. For x = 0 we have
u = π and for x = π we have u = 0. Recall also that
sin(π − u) = sin(π) cos(u) − sin(u) cos(π)
= 0 − (−1) sin(u) = sin(u) ,
for all u ∈ R. Thus
A = −
∫ 0
π
(π − u)f
(
sin(π − u)
)
du
=
∫ π
0
(π − u)f
(
sin(u)
)
du
= π
∫ π
0
f
(
sin(u)
)
du −
∫ π
0
u f
(
sin(u)
)
du ,
and the result follows easily. □
Along the proof of the fundamental theorem of integral
calculus we used the notion of “uniformly continuous
functions”, see 6.2.11. Uniform continuity
is a stronger condition than continuity, since the
real number δ in the relative deﬁnition depends
only on ε. Thus, every uniformly continuous function deﬁned
on a subset A ⊆ R is also continuous, but the converse is not
true (think for example of the hyperbola 1/x on (0, 1)). Let
us present an alternative way to think about uniformly continuous
functions and discuss some examples.
6.B.44. Show that a function f : A ⊆ R → R is uniformly
continuous if and only if for any two sequences (xn), (yn) in
A with xn − yn → 0, we also have f(xn) − f(yn) → 0, as
n → ∞.
Solution. Let us prove the one direction and leave the converse
as an exercise. So, let f : A ⊆ R → R be a uniformly
continuous function. Then, for every ε > 0 there exists δ > 0
such that for all x, y ∈ A satisfying |x−y| < δ the inequality
|f(x) − f(y)| < ε holds. Let (xn), (yn) be sequences on A
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
In its proof, the second observation will be most helpful:
the expressions for the derivatives y(n)
are universal polynomials
Pn
y(n)
(x) = Pn(f(x), f′
(x), . . . , f(n−1)
(x))
in the derivatives of the function f, all with non-negative coeﬃcients
and independent of the particular equation.5
Cauchy-Kovalevskaya Theorem in dimension one
Theorem. Assume f(y) is a real analytic function convergent
on the interval (x0−a, x0+a) ⊂ R and consider the differential
equation (1) with the condition y(x0) = y0. Then
the formal power series y(x) =
∑∞
n=0
1
n! y(n)
(x0)(x−x0)n
converges on a neighborhood of x0 and provides the solution
to (1) satisfying the initial condition.
Proof. The second observation above suggests how to
prove the convergence of the “candidate series” similarly as
we proved the convergence of power series in general, i.e.
by ﬁnding another converging series whose partial sums will
bound our’s from above. This was the original Cauchy’s approach
to this theorem and we talk about the method of majo-
rants.
Without loss of generality we shall ﬁx x0 = 0 and
y(0) = 0 (we may always use shifted quantities z = y − y0
and t = x−x0 to transform the general case). Assume we can
ﬁnd another analytic function g(x) =
∑∞
n=0
1
n! bnxn
with all
bn = g(n)
≥ 0, i.e. g has got all derivatives non-negative at
the origin, such that g(n)
(0) ≥ |f(n)
(0)| for all n.
Now, replace f in the equation (1) by g and write formal
power series z(x) =
∑∞
n=0
z(n)
n! xn
for the potential solution
of this equation as above. In particular, we deduce (recall the
universal polynomials Pn have got non-negative coeﬃcients)
z(n)
(0) = Pn
(
g(z(0)), . . . , g(n−1)
(z(0))
)
≥ Pn
(
|f(y(0))|, . . . , |f(n−1)
(y(0))|
)
≥ |y(n)
(0)|
and, consequently, convergence of z(x) will imply absolute
convergence of y(x), i.e. the claim of the Theorem. We try
to ﬁnd a majorant in the form of a geometric series.
Let us pick r > 0, smaller than the radius of convergence
of f. Then obviously, there is a constant C > 0 such that
the derivatives an = f(n)
(0) satisfy | 1
n! anrn
| ≤ C for all
n, i.e. |an| ≤ C n!
rn (the series would certainly not converge
otherwise). We may recognize the derivatives of a geometric
series and write
(2) g(z) = C
∞∑
n=0
zn
rn
= C
r
r − z
with derivatives g(n)
(0) = C n!
rn .
Finally, we have to prove that the solution of the equation
z′
= g(z) is analytic. We can easily integrate this equation
5Although we shall not need the explicit formulae for these polynomials,
they are well known under the name Faá di Bruno’s formula. In principle,
they are direct generalization of the Leibniz rule to higher order derivatives.
554
with xn − yn → 0, and let ε > 0 be given. By the deﬁnition
of a convergent sequence this means that there exists some
natural n0 such that |xn − yn| < ε for all n ≥ n0. But then
we also get |f(xn) − f(yn)| < ε for all n ≥ n0, which is
equivalent to say that limn→∞
(
f(xn) − f(yn)
)
= 0. □
6.B.45. Let f : A → R be a continuous function deﬁned on
subset A of R. Provide examples verifying that:
(a) If A is not closed, then f may not be uniformly continuous
on A;
(b) If A is not bounded, then f may not be uniformly continuous
on A. ⃝
6.B.46. Based on 6.B.44, show that the hyperbola f(x) =
1/x with x ∈ (0, 1) is not uniformly continuous. ⃝
6.B.47. Show that the sine function sin(x) is uniformly continuous
on R.
Solution. First we will use the Lagrange’s mean value theorem
(see 5.3.9) to show that | sin(x) − sin(y)| ≤ |x − y|, for
any x, y ∈ R. If x = y we have nothing to prove, so assume
that x ̸= y. Without loss of generality we may also assume
that x < y. The sine function sin(x) satisﬁes the conditions
of the Lagrange’s mean value theorem on the interval [x, y].
Thus, there exists some ξ ∈ (x, y) with
sin(x) − sin(y)
x − y
= sin′
(ξ) = cos(ξ) .
Since | cos(ξ)| ≤ 1 we get sin(x)−sin(y)
x−y ≤ 1, which implies
that | sin(x) − sin(y)| ≤ |x − y|, for any x, y ∈ R.
Let us now prove that sin(x) is uniformly continuous on R.
Given some ε > 0 take δ = ε. If x, y ∈ R satisfy |x−y| < δ,
by the previous inequality we will have | sin(x) − sin(y)| ≤
|x − y| < δ = ε, which proves the claim. □
6.B.48. Show that the cos function cos(x) is uniformly continuous
on R. ⃝
6.B.49. Let f : A → R be a continuous function deﬁned on
a subset A ⊆ R.
(a) Show with an example that if (xn) is a Cauchy sequence
on A, then the sequence (f(xn)) may fail to be Cauchy.
(b) If f is in addition uniformly continuous on A, show that
the sequence (f(xn)) is Cauchy for any Cauchy sequence
(xn) on A. ⃝
6.B.50. Which of the following functions is uniformly continuous
on the given domain?
f(x) = x2
on A = [0, 1], g(x) = tan(x) on B = [0, π/2),
h(x) = x2
on R, and k(x) = x3
on R. ⃝
In many cases we should compute integrals over an inﬁnite
interval, or integrals of functions containing
a discontinuity, see 6.2.14. In such cases
one speaks for “improper integrals”, which are
in general deﬁnite integrals that cover an unbounded
area. For convenience we summarize the following
rules:
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
with separated variables directly. Written with the help of differentials,
(r − z) dz = Cr dx. Thus, the implicit equation
reads 1
2 (r − z)2
= −Crx + D, where the constant D is determined
by z(0) = 0. Consequently D = 1
2 r2
and a simple
computation reveals the solution of the implicit equation
z(x) = r
(
1 ±
√
1 −
2Cx
r
)
.
The option with the minus sign satisﬁes our initial condition.
This clearly is an analytic function, g provides the requested
majorant, and the proof is ﬁnished. □
6.2.23. Numerical approximation of integration. Just
as in paragraph 6.1.12, we use the Taylor
expansion to propose simple approximations
of integration. We deal with an integral
I =
∫ b
a
f(x) dx of an analytic function f(x)
and a uniform partition of the interval [a, b] using points
a = x0, x1, . . . , xn = b with distances xi − xi−1 = h > 0.
Denote the points in the middle of the intervals in the
partitions by xi+1/2 and the values of the function at the
points of the partition by f(xi) = fi.
Compute the contribution of one segment of the partition
to the integral by the Taylor expansion (knowing that the
power series might be integrated term by term). Integrate
symmetrically around the middle values so that the derivatives
of odd orders cancel each other out while integrating:
∫ h/2
−h/2
f(xi+1/2 + t)dt =
∫ h/2
−h/2
( ∞∑
n=0
1
n!
f(n)
(xi+1/2)tn
)
dt
=
∞∑
k=0
(∫ h/2
−h/2
1
k!
f(k)
(xi+1/2)tk
dt
)
=
∞∑
k=0
h2k+1
22k(2k + 1)!
f(2k)
(xi+1/2).
A simple numerical approximation of integration on one
segment of the partition is the trapezoidal rule. This uses
the area of a trapezoid given by the points [xi, 0], [xi, fi],
[xi+1, 0], [xi+1, fi+1] for approximation. This area is
Pi =
1
2
(fi + fi+1)h.
In total, the integral I is approximated by
Itrap =
n−1∑
i=0
Pi =
h
2
(f0 + 2f1 + · · · + 2fn−1 + fn).
Compare Itrap to the exact value of I computed by contributions
over individually segments of the partition. Express
the values fi by the middle values fi+1/2 and the derivatives
f
(k)
i+1/2 in the following way:
fi+1/2±1/2 = fi+1/2 ±
h
2
f′
i+1/2 +
h2
2!22
f′′
i+1/2
±
h3
3!23
f
(3)
i+1/2 + . . . .
555
(1)
∫ +∞
a
f(x) dx = lim
c→+∞
∫ c
a
f(x) dx,
where f is assumed to be continuous on [a, +∞).
(2)
∫ b
−∞
f(x) dx = lim
c→−∞
∫ b
c
f(x) dx,
where f is assumed to be continuous on (−∞, b].
(3)
∫ +∞
−∞
f(x) dx =
∫ 0
−∞
f(x) dx +
∫ +∞
0
f(x) dx,
where f is assumed to be continuous on R.
(4)
∫ b
a
f(x) dx = lim
t→b−
∫ t
a
f(x) dx,
where f is assumed to be continuous on [a, b).
(5)
∫ b
a
f(x) dx = lim
t→a+
∫ b
t
f(x) dx,
where f is assumed to be continuous on (a, b].
(6)
∫ b
a
f(x) dx =
∫ c
a
f(x) dx +
∫ b
c
f(x) dx,
where f is assumed to be continuous on [a, c) ∪ (c, b].
6.B.51. Show that the integral L =
∫ 2
0
1
(x − 1)2
dx is divergent,
i.e., L = +∞.
Solution. Observe that x0 = 1 is an improper point of the
function f(x) = 1/(x − 1)2
and
lim
x→1+
f(x) = lim
x→1−
f(x) = lim
x→1
f(x) = +∞.
Hence the line x = 1 is a vertical asymptote of f. However
the integrand f is continuous at all other points, see also its
graph below.
Already from the graph we understand that the statement
should be true. To prove it we will extend the method presented
in 6.2.14 for computing the integral I appearing there,
in particular this improper integral corresponds to case (6) of
those listed above. Hence we have
L =
∫ 2
0
f(x) dx =
∫ 1
0
f(x) dx +
∫ 2
1
f(x) dx
= lim
δ→1−
∫ δ
0
f(x) dx + lim
ε→1+
∫ 2
ε
f(x) dx .
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Thus, the contribution Pi to the approximation is
Pi =
1
2
(fi+fi+1)h = h
(
fi+1/2+
h2
2!22
f′′
(i+1/2)
)
+O(h5
).
Estimate the error ∆i = I − Itrap over one segment of the
partition:
∆i = h
(
fi+1/2 +
h2
24
f′′
i+1/2 −fi+1/2 −
h2
8
f′′
i+1/2 +O(h4
)
)
= −
h3
12
f′′
i+1/2 + O(h5
).
The total error is thus estimated as
|I − Itrap| =
1
12
nh3
|f′′
| + n O(h5
)
=
1
12
(b − a)h2
|f′′
| + O(h4
)
where |f′′
| represents an upper estimate for |f′′
(x)| of f over
the integral of integration.
If the linear approximation of the function over the individual
segments does not suﬃce, we can try approximations
by quadratic polynomials. To do so, three values are always
needed, so work with segments of the partition in pairs. Suppose
n = 2m and consider xi with odd indices. We choose
fi+1 = f(xi + h) = fi + αih + βih2
fi−1 = f(xi − h) = fi − αih + βih2
which implies
βi =
1
2h2
(fi+1 + fi−1 − 2fi).
The approximation of the integral over two segments of the
partition between xi−1 and xi+1 is now estimated by the expression
(notice we integrate the quadratic polynomial with
the requested values fi−1, fi, fi+1 in the points xi−1, xi,
xi+1, respectively. It is not necessary to know the constant
αi)
Pi =
∫ h
−h
fi + αit + βit2
dt = 2hfi +
2
3
βih3
= 2hfi +
2h
6
(fi+1 + fi−1 − 2fi)
=
h
3
(fi+1 + fi−1 + 4fi).
This procedure is called Simpson’s rule6
. The entire integral
is now approximated by
ISimp =
1
3
h
(
f0 + 4
n−1∑
m=0
f2m+1 + 2
n−1∑
m=1
f2m + f2n
)
.
As with the trapezoidal rule above, the total error is estimated
by
|I − ISimp| =
1
180
(b − a)h4
|f(4)
| + O(h5
),
where |f(4)
| represents the upper bound for |f(4)
(x)| over the
interval of integration.
6This way of approximating the integral is attributed to the English
mathematician and inventor Thomas Simpson (1710-1761).
556
Now, obviously F(x) = −1/(x − 1) satisﬁes F′
(x) = f(x)
and hence is a primitive of f. Applying the fundamental theorem
of integral calculus we obtain
∫ δ
0
f(x) dx = F(δ) − F(0) = −
δ
δ − 1
,
∫ 2
ε
f(x) dx = F(2) − F(1 + ε) = −
ε − 2
ε − 1
,
and hence
L = lim
δ→1−
δ
1 − δ
+ lim
ε→1+
ε − 2
1 − ε
= (+∞) + (+∞) = +∞ .
These computations can be done quickly in Sage, e.g., by the
block
F(x)=-1/(x-1); var("delta, eps")
show((F(delta)-F(0)).factor())
show(lim((F(delta)-F(0)), delta=1, dir="-"))
show((F(2)-F(eps)).factor())
show(lim((F(2)-F(eps)), eps=1, dir="+"))
Notice the command integral(1/(x − 1)2
, x, 0, 2) gives an
error which conﬁrms itself the divergence of L, check yourself
the very end of Sage’s output for this command. □
6.B.52. Show that
+∞∫
1
arctan(x)
x
√
x
dx is a ﬁnite real number.
⃝
In 6.2.17 one meets several new acquisitions to the ZOO.
One of them is the Gamma function, deﬁned for all positive
real numbers z by
Γ(z) =
∫ ∞
0
e−x
xz−1
dx .
The Gamma function is well deﬁned and analytic for z > 0
(and more generally for complex numbers z with positive real
part). Moreover, Γ converges for all z > 0 and the next task
shows that the Gamma function can be seen as an extension
of the factorial function to real positive numbers.4
6.B.53. An interpretation of the factorial. (a) Prove the recurrence
formula Γ(n + 1) = nΓ(n), n ∈ Z+.
(b) Using (a) show that Γ(n + 1) = n!, for all naturals n.
(c) For some a > 0 prove that
∫ ∞
0
e−a x
xn−1
dx =
1
an
Γ(n) .
(d) Using the “Gaussian integral”
∫ ∞
0
e−u2
du =
√
π
2
,
show that Γ(1/2) =
√
π.
4The gamma function was introduced by the Leonhard Euler
(1707-1783) at 1729, as a natural extension of the factorial operation n! (in
the sense that n was replaced by a real or a complex number).
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
3. Sequences, series and limit processes
While building a menagerie of functions, we encountered
power series, which extended the collection of all polynomials
in a natural way, see 5.4.10. We obtained the class of
analytic functions in this way. We have also got the message
that they are alwyas smooth and may be integrated and differentiated
term by term on the entire domain of convergence.
We gave the link to a straightforward technical proof of these
statements in 9.4.2 on page 873. Now we shall develop simple
tools and show similar results for much more general series
of function.
Moreover, functions often depend on further parameters
which are dummy when diﬀerentiating or integrating, but we
need to understand how the result behaves with respect to
these parameters. For instance, what about the diﬀerentiability
of the Gamma function introduced above? Or, when computing
volume or area depending on a free parameter, how to
minimize it?
Finally, in the end of this chapter we brieﬂy introduce
some more advanced concepts of integration.
6.3.1. How well behaved is a sequence of functions? We
return to the discussion of the limits of sequences
of functions and the sum of series of
functions in view of the methods of diﬀerential
and integral calculus. Consider a convergent series
of functions
S(x) =
∞∑
n=1
fn(x)
on an interval [a, b]. Natural questions include:
• If all functions fn(x) are continuous at some point x0 ∈
[a, b], is the function S(x) also continuous at the point
x0?
• If all functions fn(x) are diﬀerentiable at some point a ∈
[a, b], is the function S(x) also diﬀerentiable there and
does the equality S′
(x) =
∑∞
n=1 f′
n(x) hold?
• If all functions fn(x) are Riemann integrable on an interval
[a, b], is the function S(x) also integrable there and
does the equality
∫ b
a
S(x)dx =
∑∞
n=1
∫ b
a
fn(x)dx hold?
Notice, it does not matter whether we discuss series or sequences,
since the former ones are just limits of the sequences
of the partial sums. First, we demonstrate by examples that
the answers to all three questions above are “NO!”. Then we
ﬁnd additional conditions on the convergence of the series
(or sequences) which guarantees the validity of all three statements.
Later we shall mention alternative concepts of integration
which are more satisfactory than the Riemann integral for
an even wider classes of functions.
6.3.2. Examples of nasty sequences. (1) Consider the func-
tions
fn(x) = (sin x)n
on the interval [0, π]. The values of these functions are nonnegative
and smaller than one at all points 0 ≤ x ≤ π, except
557
Solution. (a) Integration by parts gives the result, i.e.,
Γ(n + 1) =
∫ ∞
0
e−x
xn
dx
=
[
− xn
e−x
]∞
0
+ n
∫ ∞
0
e−x
xn−1
dx
= − lim
x→∞
xn
ex
+ nΓ(n) = 0 + nΓ(n) = nΓ(n) .
Above we used the fact limx→∞
xn
ex = 0. This limit has the
intermediate form ∞
∞ and a way to compute it occurs by applying
the l’Hopital’s rule n times:
imx→+∞
xn
ex
∞
∞
= lim
x→+∞
n xn−1
ex
∞
∞
= · · ·
∞
∞
= lim
x→∞
n!
ex
= 0 .
(b) Obviously,
Γ(1) =
∫ ∞
0
e−x
dx =
[
− e−x
]∞
0
= 1 .
Moreover, a direct computation shows that
Γ(2) =
∫ ∞
0
e−x
x dx = 1 = Γ(1) .
Thus we can write Γ(2) = 1 · Γ(1) = 1!, and similarly by
integration by parts one can prove that Γ(3) = 2Γ(2) = 2!.
Let us apply induction over n to prove the general formula.
For the inductive step suppose Γ(n) = (n − 1)! for some n.
Then, by (a) we see that Γ(n+1) = nΓ(n) = n(n−1)! = n!,
for all naturals n.
(c) For any positive integer we have Γ(n) =∫ ∞
0
e−x
xn−1
dx. For a > 0 set x = ay with dx = a dy.
Then we see that
Γ(n) =
∫ ∞
0
e−ay
(ay)n−1
a dy = an
∫ ∞
0
e−ay
yn−1
dy ,
i.e.,
∫ ∞
0
e−ay
yn−1
dy =
Γ(n)
an
. The result now follows.
(d) This is based on the substitution x = u2
:
Γ(1/2) =
∫ ∞
0
e−x
x− 1
2 dx =
∫ ∞
0
e−x
√
x
dx
x=u2
=
dx=2u du
2
∫ ∞
0
e−u2
du = 2
√
π
2
=
√
π .
□
Recall that the mean value of a function over a given interval
provides a single value that represents the average
behavior of the function on that interval. This
concept is fundamental in various areas of mathematics,
since it allows the estimation of the overall
eﬀect of a function over an interval without being necessary
to consider every individual point. This gives many applications
in physics and engineering, where average values are
often more useful than point values (such as in calculating
average velocity, average temperature, or average concentration
of a substance).
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
for x = π
2 , where the value is 1. Hence on the whole interval
[0, π], these functions converge to the function
f(x) = lim
n→∞
fn(x) =
{
0 for all x ̸= π
2
1 for x = π
2 .
point by point. The limit of the sequence of functions fn is a
discontinuous function, even though all functions fn(x) are
continuous. The problematic point is an inner point of the
interval.
The same phenomenon occurs for a series of functions,
because the sum is the limit of partial sums. Hence in the
previous example, it suﬃces to express fn as the n-th partial
sum. For example, f1(x) = sin x, f2(x) = (sin x)2
− sin x,
etc. The ﬁgure plots the functions fm(x) for m = n3
, n =
1, . . . , 10.
(2) We look at the second question, i.e. badly behaving
derivatives. A natural idea based on the same principle as
above is to construct a sequence of functions which has the
same nonzero derivative at one point, but becomes smaller
and smaller. So they converge pointwise to the identically
zero function.
The next ﬁgure plots the functions
fn(x) = x(1 − x2
)n
on the interval [−1, 1] for values n = m2
, m = 1, . . . , 10.
It is immediate that limn→∞ fn(x) = 0 and that all functions
fn(x) are smooth. Their derivative at x = 0 is
f′
n(0) =
(
(1 − x2
)n
− 2nx2
(1 − x2
)n−1
)
|x=0 = 1
for all n. But the limit function for the sequence fn has a zero
derivative at every point.
(3) The counterexample to the third statement is in 6.2.18
already. The characteristic function χQ of rational numbers
can be expressed as a sum of countably many functions,
which are numbered exactly by rational numbers. They are
zero everywhere except for the single point after which they
558
On the other hand, the integral mean value theorem provides
a rigorous way to calculate the average value of a function
over an interval. Next we will describe applications related
to these theorems, see also see 6.2.15.
6.B.54. For the function f(θ) = cos(2θ) e1+sin(2θ)
, determine
its average value on [−π/4, π/4].
Solution. The average value of f on [−π/4, π/4] is given by
m(f) =
1
π/4 − (−π/4)
∫ π/4
−π/4
f(θ) dθ =
2
π
∫ π/4
−π/4
f(θ) dθ .
To compute this integral one may set u = 1 + sin(2θ), such
that du = 2 cos(2θ)dθ. We also have u = 2 for θ = π/4, and
u = 0 for θ = −π/4, thus we get
m(f) =
2
π
·
1
2
∫ 2
0
eu
du =
e2
−1
π
.
Here is a conﬁrmation in Sage:
var("th"); f(th)=cos(2*th)*e^(1+sin(2*th))
show((2/pi)*integral(f(th), th,-pi/4,pi/4))
□
6.B.55. Determine the average velocity m(v) of a solid in
the time interval [1, 2], if its velocity is given by
v(t) =
t
√
1 + t2
, t ∈ [1, 2] .
You can omit the units.
Solution. To solve the problem, it suﬃces to realize that the
sought avarage velocity is the mean value of function v on
interval [1, 2]. Hence
m(v) =
1
2 − 1
2∫
1
t
√
1 + t2
dt =
5∫
2
1
2
√
x
dx =
√
5 −
√
2
with 1 + t2
= x, t dt = dx/2. □
Further exercises on improper integrals and other applications
related to deﬁnite integrals are presented in Section
D. Next we describe a few tasks about lengths of curves, areas
of regions, and volumes.
6.B.56. Arc length. Determine the length of the parametric
curve deﬁned by x(t) = et
−t, y(t) = 4 et/2
with 0 ≤ t ≤ 1.
Then present the solution via Sage.
Solution. According to 6.2.18, the length of a curve α(t) =
[x(t), y(t)] on R2
deﬁned on the interval [a, b] is given by the
integral
sα(t) ≡ s(α(t)) =
∫ b
a
√
(x′(t))2 + (y′(t))2 dt .
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
are named for, where the value is 1. Riemann integrals of all
such functions are zero, but the sum is not a Riemann integrable
function.
This example illustrates the fundamental ﬂaw of the Riemann
integral, to which we return later.
We present an example where the limit function f is integrable,
all functions fn are continuous, but the value of the
integral is not the limit of the integrals of fn. We modify the
sequence of the functions x(1 − x2
)n
used above. They integrate
to
∫ 1
0
x(1 − x2
)n
dx = 1
2(n+1) . Thus, we consider the
functions
fn(x) = 2(n + 1)x(1 − x2
)n
.
These functions with n = m2
, m = 1, . . . , 10 are on the next
diagram.
It is quite easy to verify that the values of these functions
converge to zero for every x ∈ [0, 1] (for example the reader
can check ln(fn(x)) → −∞). But for all n
∫ 1
0
fn(x) dx = 1 ̸= 0.
6.3.3. Uniform convergence. A reason of failure in all three
previous examples is the fact that the speed of
pointwise convergence of values fn(x) → f(x)
varies dramatically point from point. Hence a
natural idea is to conﬁne the problem to cases
where the convergence will have roughly the same speed over
all the interval:
Uniform convergence
Deﬁnition. We say that the sequence of functions fn(x)
converges uniformly on interval [a, b] to the limit f(x), if
for every positive number ε, there exists a natural number
N ∈ N such that for all n ≥ N and all x ∈ [a, b] the in-
equality
|fn(x) − f(x)| < ε
holds.
A series of functions converges uniformly on an interval,
if the sequence of its partial sums converges uniformly.
Albeit the choice of the number N depends on the chosen
ε, it is independent on the point x ∈ [a, b]. This is the diﬀerence
from pointwise convergence, where N depends on both
ε and x. We visualise the deﬁnition graphically in this way: if
we consider a zone created by a translation of the limit function
f(x) to f(x) ± ε for arbitrarily small, but ﬁxed positive
559
For the given curve we have x′
(t) = et
−1 and y′
(t) = 2 et/2
,
with (x′
(t))2
+ (y′
(t))2
= (et
+1)2
. Thus
s =
∫ 1
0
√
(et +1)2 dt =
∫ 1
0
(et
+1) dt =
[
et
]1
0
+
[
t
]1
0
= e −1 + 1 = e .
In Sage you can conﬁrm all these computations by the following
cell:
var(’t’); x(t)=e^t-t; y(t)=4*e^(t/2)
X(t)=diff(x(t), t); show(X(t))
Y(t)=diff(y(t), t); show(Y(t))
l(t)=(X(t)*X(t)+Y(t)*Y(t)).factor(); show(l(t))
s=integral(sqrt(l(t)), t, 0, 1).simplify_full()
show(s)
Using the command parametric_plot you can plot the
given curve, which can be done just by adding in the previous
block the syntax
p=parametric_plot((x(t), y(t)), (t, 0, 1))
p.show(aspect_ratio=1/4)
□
It is often useful to jump via Sage from classical integration
to numerical integration. This for example
can be useful for cases where we need
an approximation, or in cases where a “closed
form” of the integral does not exist, as for example the case of
Gaussian f(x) = e−x2
whose antiderivative is the so-called
“error function” erf. This is an interesting situation which,
till the end of this chapter, we will implement in many diﬀerent
ways.
Notice in Sage an option to approximate a deﬁnite integral
relies on the command n(integral(f, x, a, b)), or
numerical_approx(integral(f, x, a, b)). For instance
for the Gaussian f(x) = e−x2
try the cell
f(x)=e^(-x^2)
show(integral(f, x, 0, 1))
print(n(integral(f, x, 0, 1)))
Sage’s output has the form
1
2
√
π erf (1)
and 0.746824132812427, respectively. Built-in methods that
Sage provides for numerical integration will be discussed
later. We begin with the following task about the arc length
of parametric curves.
6.B.57. Given a curve c : [a, b] → R2
in the parametric form
c(t) = [x(t), y(t)], present a routine in Sage that will compute
numerically the length of c on [a, b] and plot its graph.
Then, test your routine for the following cases and next conﬁrm
routine’s result by a formal computation:
(1) c(t) = [cos3
(t), sin3
(t)], t ∈ [0, π/2];
(2) c(t) = [t, ln
(
cos(t)
)
], t ∈ [0, π/4];
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
ε, all of the functions fn(x) will fall into this zone, except for
ﬁnitely many of them. The ﬁrst and the last of the nasty exam-provide a picture here?
ples above do not have this property; In the second example,
the sequence of derivatives f′
n lacked it.
6.3.4. The three claims in the following theorem say that
all three generally false properties discussed
in 6.3.1 are true features for uniform convergence
(but beware the subtleties when diﬀer-
entiating).
Consequences of uniform convergence
Theorem. (1) Let fn(x) be a sequence of functions continuous
on a closed interval [a, b] and converging uniformly
to the function f(x) on this interval. Then f(x)
is also continuous on the interval [a, b].
(2) Let fn(x) be a sequence of Riemann integrable functions
on a ﬁnite closed interval [a, b] which converge
uniformly to the function f(x) on this interval. Then
f(x) is Riemann integrable, and
∫ b
a
f(x) dx =
∫ b
a
(
lim
n→∞
fn(x)
)
dx = lim
n→∞
∫ b
a
fn(x) dx.
(3) Let fn(x) be a sequence of functions diﬀerentiable on
a closed interval [a, b] and assume fn(x0) → f(x0) at
some point x0 ∈ [a, b]. Moreover, assume all derivatives
gn(x) = f′
n(x) are continuous and converge uniformly
to the function g(x) on the same interval. Then
the function f(x) = f(x0)+
∫ x
x0
g(t) dt is diﬀerentiable
on the interval [a, b], the functions fn(x) converge to
f(x) and f′
(x) = g(x). In other words,
d
dx
f(x) =
d
dx
(
lim
n→∞
fn(x)
)
= lim
n→∞
(
d
dx
fn(x)
)
Proof of the first claim. Fix an arbitrary ﬁxed point
x0 ∈ [a, b] and let ε > 0 be given. It is required to show that
|f(x) − f(x0)| < ε
for all x close enough to x0. From the deﬁnition of uniform
convergence,
|fn(x) − f(x)| < ε
for all x ∈ [a, b] and all suﬃciently large n. Choose some n
with this property and consider δ > 0 such that
|fn(x) − fn(x0)| < ε
for all x in δ-neighbourhood of x0. That is possible because
fn(x) are continuous for all n. Then
|f(x) − f(x0)| ≤|f(x) − fn(x)| + |fn(x) − fn(x0)|
+ |fn(x0) − f(x0)| < 3ε
for all x in the δ-neighbourhood of x0. This is the desired
inequality with the bound 3ε. □
560
(3) c(t) = [t sin(t), t cos(t)], t ∈ [0, 4π];
(4) c(t) = [sin2
(t), cos2
(t)], t ∈ [0, π
2 ]. ⃝
6.B.58. Compute the length s of a part of the so-called “tractrix”,
that is, the parametric curve α(t) = [f(t), g(t)] deﬁned
by
f(t) = r cos t + r ln
(
tan(
t
2
)
)
, g(t) = r sin(t) ,
with t ∈ [π/2, a], where r > 0, and a ∈ (π/2, π). Next use
Sage to plot α for the values r = 1, 2, . . . , 5 in one ﬁgure. ⃝
6.B.59. Area of half ellipses. Consider the curve α(t) =
[x(t), y(t)] = [4 cos(t), 3 sin(t)], with 0 ≤ t ≤ π.
(a) Show that α(t) is the half of a horizontal ellipses centered
at the origin, wit major axes of length 8 and minor axes of
length 6. Use Sage to plot α and color the region bounded by
the x-axis and the ellipsis.
(b) Compute the area of this region.
Solution. (a) Obviously, α(t) is the half of a horizontal ellipses
centered at the origin of R2
, since it satisﬁes the equation
x2
a2 + y2
b2 = 1 with a = 4 and b = 3. Thus the major
axes has length 2a = 8 and the minor axes has length
2b = 6, respectively. To plot α we will use the command
parametric_plot as above, that is,
x(t)=4*cos(t); y(t)=3*sin(t)
show(parametric_plot((x(t), y(t)), (t, 0, pi),
fill=True, fillcolor="lightgrey",
aspect_ratio=1, color="black", thickness=1.5))
which produces the following ﬁgure:
(b) To ﬁnd the area under the given parametric curve we need
ﬁrst to compute the integral
∫ π
0
y(t)x′
(t) dt =
∫ π
0
3 sin(t)(4 cos(t))′
dt
= −12
∫ π
0
sin2
(t) dt = −
12
2
∫ π
0
(
1 − cos(2t)
)
dt
= −6
[
t −
sin(2t)
2
]π
0
= −6π .
Thus the required area equals to |
∫ π
0
y(t)x′
(t) dt| = 6π. In
Sage to conﬁrm this result it is suﬃcient to add in the previous
block the cell
F(t)=y(t)*diff(x(t), t)
show(abs(integral(F(t), t, 0, pi)))
□
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Remark. In fact, the arguments in the proof show a more
general claim. Indeed, if the functions fn(x) converge uniformly
to f(x) on [a, b], and the individual functions fn(x)
have the limits (or one-sided limits) limx→x0 fn(x) = an,
then the limit limx→x0 f(x) exists if and only if the limit
limn→∞ an = a exists. Then they are equal, that is,
a = lim
n→∞
(
lim
x→x0
fn(x)
)
= lim
x→x0
(
lim
n→∞
fn(x)
)
.
The reader should be able to modify the above proof for this
situation.
6.3.5. Proof of the second claim. The proof of this part of
the theorem is based upon a generalization of
the properties of Cauchy sequences of numbers
to uniform convergence of functions. In this
way we can work with the existence of the limit of a sequence
of integrals without needing to know the limit.
Uniformly Cauchy sequences
Deﬁnition. The sequence of functions fn(x) on interval
[a, b] is uniformly Cauchy, if for every (small) positive number
ε, there exists a (large) natural number N such that for
all x ∈ [a, b] and all m, n ≥ N,
|fn(x) − fm(x)| < ε.
Every uniformly convergent sequence of functions on interval
[a, b] is also uniformly Cauchy on the same interval. To
see this, it suﬃces to notice the usual bound
|fn(x) − fm(x)| ≤ |fn(x) − f(x)| + |f(x) − fm(x)|
based on the triangle inequality.
Before coming to the proof 6.3.4(2), we mention the fol-
lowing:
Proposition. Every uniformly Cauchy sequence of functions
fn(x) on the interval [a, b] uniformly converges to some function
f on this interval.
Proof. Of course, the condition for a sequence of functions
to be uniformly Cauchy implies that also for all x ∈
[a, b], the sequence of values fn(x) is a Cauchy sequence of
real (or complex) numbers. Hence the sequence of functions
fn(x) converges pointwise to some function f(x).
Choose N large enough so that
|fn(x) − fm(x)| < ε
for some small positive ε chosen beforehand and all m, n ≥
N, x ∈ [a, b]. Now choose one such n and ﬁx it, then
|fn(x) − f(x)| = lim
m→∞
|fn(x) − fm(x)| ≤ ε
for all x ∈ [a, b]. Hence the sequence fn(x) converges to its
limit uniformly. □
Proof of the second claim in 6.3.4. Recall that every
uniformly convergent sequence of functions is also uniformly
Cauchy and that the Riemann sums of all single terms fn(x)
561
6.B.60. Area. Calculate the area E between the x-axis and
the graph of f(x) = x2
− 3x − 4 for 4 ≤ x ≤ 5. Next use
Sage to illustrate the area in question.
Solution. An easy computation shows that E = 17
6 , and this
corresponds to grey region in the ﬁgure given here:
To obtain the required illustration in Sage, we will use the
polygon method, which is the core of of our program presented
below. Details on this method are postponed to Section
D, see (b) in 6.D.43.
f(x)=x^2-3*x-4
p=plot(f(x), x, 3, 5, thickness=2,
color="black")
p+=line([(5, 0), (5, 6)],
rgbcolor=(0.2,0.2,0.2), linestyle="--")
p+=polygon([(4,0),(4,f(4))]
+ [(x, f(x)) for x in [4,4.1,..,5]]+[(5,0),(4,0)],
rgbcolor=(0.8,0.8,0.8), aspect_ratio=’automatic’)
p+=text("$f(x)=x^2-3x-4$",
(4.5, 5), fontsize=14, color="black")
show(p)
show(integral(f(x), x, 4, 5))
□
6.B.61. Compute the volume of a solid created by rotation
of a bounded surface, whose boundary is the curve x4
−9x2
+
y4
= 0, around the x-axis. You may omit the units.
Solution. If [x, y] is a point on the x4
−9x2
+y4
= 0, clearly
this curve also intersects points [−x, y], [x, −y], [−x, −y].
Thus is symmetric with respect to both axes x, y. For y = 0,
we have x2
(x − 3) (x + 3) = 0, i.e. the x axis is intersected
by the boundary curve at points [−3, 0], [0, 0], [3, 0]. In the
ﬁrst quadrant, it can then be expressed as a graph of the func-
tion
f(x) =
4
√
9x2 − x4, x ∈ [0, 3] .
The sought volume is thus a double (here we consider x > 0)
of the integral
3∫
0
πf2
(x) dx = π
3∫
0
√
9x2 − x4 dx .
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
of the sequence converge to
∫ b
a
fn(x) dx independently of the
choice of the partition and the representatives. Hence, if
|fn(x) − fm(x)| < ε
for all x ∈ [a, b], then also
∫ b
a
fn(x) dx −
∫ b
a
fm(x) dx ≤ ε|b − a|.
Therefore the sequence of numbers
∫ b
a
fn(x) dx is Cauchy,
and hence convergent.
The Riemann sums of the limit function f(x) can be
made arbitrarily close to those of fn(x) for large n, by the
same argument as above. So f(x) is integrable. Moreover,
∫ b
a
fn(x) dx −
∫ b
a
f(x) dx ≤ ε|b − a|,
so the limit value is as expected. □
6.3.6. Proof of the third claim. For the corresponding result
about derivatives, extra care is needed regarding
the assumptions: If the diﬀerentiable
functions ˜fn(x) = fn(x) − fn(x0) are considered
instead of fn(x), the derivatives do not
change. Hence without loss of generality it can be assumed
that all functions satisfy fn(x0) = 0.
Then the ﬁrst assumption of the theorem is satisﬁed automatically.
For all x ∈ [a, b], we can write
fn(x) =
∫ x
x0
gn(t) dt.
Because the functions gn converge uniformly to g on all of
[a, b], the functions fn(x) converge to
f(x) =
∫ x
x0
g(t) dt.
g is a uniform limit of continuous functions, thus g is again
continuous. By 6.2.8, for the relations between the Riemann
integral and the primitive function, the proof is ﬁnished.
6.3.7. Uniform convergence of series. For inﬁnite series,
we apply the previous three results to the sequences
of partial sums. Thus the following
corollary is an immediate consequence:
562
Using the substitution t =
√
9 − x2 (xdx = −tdt), we get
3∫
0
√
9x2 − x4 dx =
3∫
0
x ·
√
9 − x2 dx = −
0∫
3
t2
dt = 9 .
Thus the ﬁnal answer is 18π. □
6.B.62. Torricelli’s trumpet (1641). Let a part of a branch
of the hyperbola xy = 1 for x ≥ a, where a > 0, rotate
around the x axis. Show that the solid of revolution created
in this manner has a ﬁnite volume V and simultaneously an
inﬁnite surface S.
Solution. We know that
V = π
+∞∫
a
(
1
x
)2
dx = π
+∞∫
a
1
x2
dx
= π
(
lim
x→+∞
−
1
x
−
(
−
1
a
))
=
π
a
and
S = 2π
+∞∫
a
1
x
·
√
1 +
(
−
1
x2
)2
dx = 2π
+∞∫
a
√
x4 + 1
x3
dx
≥ 2π
+∞∫
a
1
x
dx = 2π
(
lim
x→+∞
ln x − ln a
)
= +∞ .
The fact the the given solid (the so called Torricelli’s trumper)
cannot be painted with a ﬁnite amount of colour, but can be
ﬁlled with a ﬁnite amount of ﬂuid, is called the “Torriccelli’s
paradox”. Realize however that a real color painting has a
nonzero width, which the computation does not take into account.
For example, if we would paint it from the inside, a
single drop of color would undoubtedly “block” the trumpet
of inﬁnite length. □
A diﬀerential equation is an equation involving derivatives.
Many physical and engineering problems
naturally lead to diﬀerential equations, making
it essential to learn how to formulate and solve
them. For instance, in Chapter 3 we encountered
diﬀerential equations in the context of population growth.
Diﬀerential equations fall into two main categories, “ordinary
diﬀerential equations” (ODEs) and “partial diﬀerential
equations” (PDEs). An ODE involves derivatives with
respect to a single independent variable, and next we focus
on ODEs of ﬁrst order. Typically, such equations are written
as y′
= F(x, y), where y = f(x), see 6.2.21. If F is in the
form y′
= F(x, y) = f(x)g(y) for some functions f, g, the
ODE is called separable. For g(y) ̸= 0 we can solve it by
separating the variables:
dy
g(y)
= f(x) dx ⇒
∫
dy
g(y)
=
∫
f(x) dx.
Let’s explore some simple problems, with a more detailed discussion
on diﬀerential equations postponed to Chapter 8.
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Consequences for uniform convergence of series
Theorem. Consider a sequence of functions fn(x) on interval
[a, b].
(1) If all the functions fn(x) are continuous on [a, b] and
the series
S(x) =
∞∑
n=1
fn(x)
converges uniformly to the function S(x), then S(x) is
continuous on [a, b].
(2) If all the functions fn(x) are Riemann integrable on
[a, b] and the series
S(x) =
∞∑
n=1
fn(x)
uniformly converges to S(x) on [a, b], then S(x) is integrable
on [a, b] and
∫ b
a
( ∞∑
n=1
fn(x)
)
dx =
∞∑
n=1
∫ b
a
fn(x) dx.
(3) If all the functions fn(x) are continuously diﬀerentiable
on the interval [a, b], if the series S(x) =
∑∞
n=1 fn(x)
converges for some x0 ∈ [a, b], and if the series T(x) =∑∞
n=1 f′
n(x) converges uniformly on [a, b], then the series
S(x) converges. S(x) is continuously diﬀerentiable
on [a, b] and S′
(x) = T(x). That is:
d
dx
( ∞∑
n=1
fn(x)
)
=
∞∑
n=1
d
dx
fn(x).
6.3.8. Test of uniform convergence. A simple way to test
that a series of functions converges uniformly
is to use a comparison with the absolute
convergence of a suitable series of
numbers. This is often called the Weierstrass
test.
Suppose a sequence of functions fn(x) is given on an
interval I = [a, b] satifying
|fn(x)| ≤ an ∈ R
for suitable real constants an and for all x ∈ [a, b]. Let
sk(x) =
k∑
n=1
fn(x)
for distinct indices k. For k > m,
|sk(x) − sm(x)| =
k∑
n=m+1
fn(x)
≤
k∑
n=m+1
|fn(x)| ≤
k∑
n=m+1
ak.
563
6.B.63. Diﬀerential equations. Compute the limit
limx→+∞ f(x) if it is known that the function f(x)
satisﬁes the diﬀerential equation
f′
(x) + α · ex
(f(x))2
= 0 ,
with initial condition f(0) = 1/α, where α > 0 is some constant.
Next use Sage to solve the given diﬀerential equation
for α = 3.
Solution. Let us write y = f(x). Then, the given diﬀerential
equation takes the form as y′
= −α · ex
y2
, or equivalenty
dy
dx
= −α · ex
y2
⇐⇒ −
dy
y2
= α · ex
dx .
Hence, by integrating we obtain
−
∫
dy
y2
= α
∫
ex
dx ,
which is equivalent to
−
∫
y−2
dy = α ex
+C
that is, y−1
= α ex
+C. Thus y = f(x) = 1
α ex +C and
to ﬁnd C one relies on the initial condition f(0) = 1/α,
which gives C = 0. Thus f(x) = 1
α ex and we see that
limx→+∞ f(x) = 0.
Recall by 6.B.5 that a method to solve (1st order) ODEs
relies on the command desolve. Since a = 3 in this case
one can type
var("x"); y = function("y")(x)
a=3; show(desolve(diff(y, x) + a*e^x*y^2, y))
and Sage’s output has the form
1
3 y (x)
= C + ex
, which is
of course equivalent to our answer for a = 3. □
6.B.64. Solve the ODE
dy
dx
= esin(x)
cos(x)y1/2
with initial
condition y(0) = 16. ⃝
6.B.65. Solve the task in 6.B.64 using Sage. Moreover, plot
the solution together with the “slope ﬁeld” of the corresponding
diﬀerential equation.
Solution. Recall by 6.B.5 that in order to solve an initial value
problem in Sage we use the desolve command and include
the option ics = [x0, y0] (or simply [x0, y0]), where y0 =
y(x0) represents the initial condition. Thus, in our case one
can type
var("x")
y = function("y")(x)
h=desolve(diff(y, x) - e^(sin(x))*cos(x)*sqrt(y),
y, ics=[0, 16])
show(h)
which prints out the expression 2
√
y (x) = esin(x)
+7, hence
conﬁrms our result in 6.B.64.
Having a diﬀerential equation of the form y′
= F(x, y),
plotting the slope ﬁeld means that we plot a line at (x, y)
with slope F(x, y). Such details will be analyzed more
carefully in Chapter 8. However, we may already discuss
the built-in method that Sage provides for this procedure.
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
If the series of the (nonnegative) constants
∑∞
n=1 an is convergent,
then the sequence of its partial sums is a Cauchy sequence.
But then the sequence of partial sums sn(x) is uniformly
Cauchy.
By 6.3.5 the following is veriﬁed:
The Weierstrass test
Theorem. Let fn(x) be a sequence of functions deﬁned on
interval [a, b] with |fn(x)| ≤ an ∈ R.
If the series of numbers
∑∞
n=1 an is convergent, then
the series S(x) =
∑∞
n=1 fn(x) converges uniformly.
6.3.9. Consequences for power series. The Weierstrass test
allows us to derive the properties of power series
in a very straightforward way. Conseder a series
S(x) =
∞∑
n=0
an(x − x0)n
centered at a point x0.
We saw earlier in 5.4.9, that each power series converges
on an entire interval (x0 − δ, x0 + δ). The radius of convergence
δ ≥ 0 can be zero or ∞, see 5.4.13. Moreover, the
series obtained by integrating or diﬀerentiating a power series
term by term must have the same radius of convergence
(by the very formula based on lim sup).
In the proof of theorem 5.4.9, a comparison with a suitable
geometric series is used to verify the convergence of
power series. By the Weierstrass test, every power series S(x)
converges uniformly on every compact (i.e. bounded closed)
interval [a, b] contained in the interval (x0 −δ, x0 + δ). Thus
the crucial result follows again:
Differentiation and integration of power series
Theorem. Every power series S(x) is continuous and is
continuously diﬀerentiable at all points inside its interval
of convergence. The function S(x) is Riemann integrable
and can be diﬀerentiated or integrated done term by term.
Abel’s theorem states that power series are continuous
even at the boundary points of their domain when they converge
there (including eventual inﬁnite limits). We do not
prove it here.
The pleasant properties of power series also reveal limitations
on the use in practical modelling. In particular, it
is not possible to approximate piece-wise continuous or nondiﬀerentiable
functions very well by using power series. Of
course, it should be possible to ﬁnd better sets of functions
fn(x) than just the values fn(x) = xn
, up to constants.
The best known examples are Fourier series and wavelets discussed
in the next chapter.
564
The latter is based on the syntax plot_slope_field,
but ﬁrst we need to introduce the function F representing
the diﬀerential equation. One can also plot the solution
curves (“integral curves”), as slope ﬁelds, via the command
streamline_plot(F(x, y), (x, a, b), (y, a, b)). For our
task, we have dy
dx = esin(x)
cos(x)y1/2
, thus we need to
consider the function F(x, y) = esin(x)
cos(x)y1/2
, and the
implementation takes the form
var("x, y")
c1=implicit_plot(2*sqrt(y)==e^(sin(x))+7,
(x, 0, 25), (y, 0, 25))
F(x, y)=e^(sin(x))*cos(x)*sqrt(y)
p1=plot_slope_field(F(x, y),(x, 0, 25),(y, 0, 25))
p2=streamline_plot(F(x, y),(x, 0, 25),(y, 0, 25))
show(p1+c1, figsize=6); show(p2, figsize=6)
This produces the following ﬁgures
Here, at the left hand side with blue appears the solution of
our initial value problem, while the slope ﬁeld has grey colour
and appears in the background. At the right hand side we see
the integral curves, as slope ﬁelds. □
Numerical integration is a fundamental technique in
computational mathematics, used to approximate
the area under a curve, especially when
an exact analytical solution is challenging or
impossible to obtain. It complements the numerical diﬀerentiation
that we met earlier in 6.1.12 and is invaluable for
analyzing functions where direct solutions are impractical.
Below, we will explore examples using the trapezoid rule and
Simpson’s rule, see 6.2.23.
6.B.66. Trapezoid rule. Compute the error of estimating
the integral
I =
∫ π
0
sin(x) dx
by the trapezoid rule with n = 4 intervals. Demonstrate this
by the following two ways:
i) Compute the actual error given by the diﬀerence
| I − Itrap|.
ii) Apply the formula (b−a)h2
12 |f′′
|, where |f′′
| represents an
upper estimate of |f′′
(x)| over the integral of integration
[a, b], where f(x) = sin(x).
Solution. The trapezoid rule with n = 4 intervals has the
form
Itrap =
h
2
[
f(x0) + 2f(x1) + 2f(x2) + 2f(x3) + f(x4)
]
,
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.3.10. Laurent series. We return to the smooth function
f(x) = e−1/x2
from paragraph 6.1.9 in the context
of Taylor series expansions. It is not analytic
at the origin, because all its derivatives are zero
there and the function is strictly positive at all
other points. At all points x0 ̸= 0 this function is given by its
convergent Taylor series with radius of convergence r = |x0|.
At the origin the Taylor series converges only at the one point
0.
Replace x with the expression −1/x2
into the power series
for ex
. The result is the series of functions
S(x) =
∞∑
n=0
1
n!
(−1)n
x−2n
=
0∑
n=−∞
(−1)|n|
|n|!
x2n
.
The series converges at all points x ̸= 0. It gives a good idea
about the behaviour near the exceptional point x = 0.
Thus we consider the following series similar to power
series but more general:
Laurent7
series
A series of functions of the form
S(x) =
∞∑
n=−∞
an(x − x0)n
is called a Laurent series centered at x0. The series is convergent
if both its parts with positive and negative exponents
converge separately.
The importance of Laurent series can be seen with rational
functions. Consider such a function S(x) = f(x)/g(x)
with coprime polynomials f and g and consider a root x0 of
polynomial g(x). If the multiplicity of this root is s, then after
multiplication we obtain the function ˜S(x) = S(x)(x−x0)s
,
which is analytic on some neighbourhood of x0. Therefore
we can write
S(x) = a−s
(x−x0)r + · · · + a−1
x−x0
+ a0 + a1(x − x0) + . . .
=
∞∑
n=−s
an(x − x0)n
.
Consider the two parts of the Laurent series separately:
S(x) = S− +S+ =
−1∑
n=−∞
an(x −x0)n
+
∞∑
n=0
an(x −x0)n
.
For the series S+, Theorem 5.4.9 implies that its radius of convergence
R is given by R−1
= lim supn→∞
n
√
|an|. Apply
the same idea to the series S− with 1/x substituted for x. It is
then apparent that the series S−(x) converges for |x−x0| > r,
where r = lim supn→∞
n
√
|a−n|.
7Pierre Alphonse Laurent (1813-1854) was a French engineer and military
oﬃcer. He submitted his generalization of the Taylor series into the
Grand Prix competition of the French Académie des Sciences. For formal
reasons it was not considered. It was published much later, after the author’s
death.
565
where the step size h satisﬁes h = b−a
n , see 6.2.23. For our
case we have f(x) = sin(x), a = 0, b = π and h equals
π/4. Thus, x0 = 0, x1 = π/4, x2 = π/2, x3 = 3π/4, and
x4 = π, with f(x0) = 0 = f(x4), f(x1) =
√
2
2 = f(x3)
and f(x2) = 1. This gives
Itrap = π
8
[
2(1 +
√
2)] ≈ 1.896,
where we used Sage to compute this expression by the com-
mand
N((pi/8)*2*(1+sqrt(2)))
An illustration of the trapezoid rule is given here
Let us now derive the actuall error of the estimation as suggested
in (i). You can easily compute I = 2, thus
| I − Itrap| ≈
∫ π
0
sin(x) dx − 1.896 ≈ 0.104 .
This error should not diﬀer dramatically from the result in (ii).
Indeed, we see that f′′
(x) = − sin(x) and over the interval
[0, π] the maximum value of | f′′
(x)| = |sin(x)| equals 1.
This gives us
(b − a)3
12n2
maxx∈[0,π] |f′′
(x)|
n=4
=
π3
192
≈ 0.162 .
Therefore, the actual error 0.104 is less than the theoretical
upper bound 0.162, which is consistent with the idea that the
theoretical error estimate often provides an upper bound on
the actual error.
We also conclude that the trapezoid rule underestimates
the actual integral value. This occurs because f(x) = sin(x)
is concave (down) on the integral of integration, i.e., f′′
(x) <
0 for all x ∈ [0, π]. This characteristic is typical of the trapezoid
rule’s behaviour with concave functions, as illustrated in
the ﬁgure above. □
6.B.67. Create a routine in Sage to demonstrate the computation
of the trapezoid rule with n intervals. Apply your program
to estimate the integral
L =
∫ 3π/2
π/2
cos(x) dx ,
with n = 4. Additionally, provide a formal proof, and calculate
both the actual and theoretical errors of the estimation, as
discussed in 6.B.66. ⃝
6.B.68. Apply the Simpson’s rule to approximate the integral
I =
∫ 1
0
1
1 + x
dx with h = 1/4 and h = 1/2 respectively.
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Notice that the conclusions about convergence remain
true even for complex values of x substituted into the expression.
Laurent series can be considered as functions deﬁned
on a domain in the complex plain. We return to this in chapter
9. The following theorem is proved already.
Convergence of the Laurent series on the annulus
Theorem. The Laurent series S(x) centered at x0 converges
for all x ∈ C satisfying r < |x − x0| < R and
diverges for all x satisfying |x − x0| < r or |x − x0| > R,
where
r = lim sup
n→∞
n
√
|a−n|, R−1
= lim sup
n→∞
n
√
|an|.
The Laurent series need not converge at any point, because
possibly R < r. If we look for an example of the
above case of rational functions expanded to Laurent series at
some of the roots of the denominator, then clearly r = 0 and
therefore, as expected, it converges in the punctured neighbourhood
of this point x0. R is given by the distance to the
closest root of the denominator. In the case of the ﬁrst example,
the function e−1/x2
, r = 0 and R = ∞.
6.3.11. Integrals dependent on parameters. When integrating
a function f(x, y1, . . . , yn) of 1 real
variable x depending on further real parameters
y1, . . . , yn with respect to the single variable
x, the result is a function F(y1, . . . , yn) depending on
all the parameters. Such a function F often occurs in practice.
For instance, we can look for the volume or area of a
body which depends on parameters, and determine the minimal
and maximal values (with additional constrains as well).
Often it is desirable to interchange the operations of diﬀerentiation
and integration. That this can be done is proved below.
We begin with an examination of continuous dependency on
the parameters.
For sake of simplicity, we shall deal with functions
f(x, y) depending on two variables, x ∈ [a, b], y ∈ [c, d].
We say f is continuous on I = [a, b] × [c, d] ⊂ R2
= C if
for each z = (x, y) from the domain of f and ε > 0 there is
some δ > such |f(w) − f(z)| < ε if w ∈ Oδ(z). (Notice the
deﬁnition is the same as with the univariate functions, just we
use the distance in the plane.)
The function f(x, y) is called uniformly continuous if for
each ε > 0, there is δ > 0 such that for any two points z, w in
I ⊂ R2
= C, |z−w| < δ implies |f(z)−f(w)| < ε. Exactly
the same argument as with univariate functions, based on the
fact that every open cover of a compact set in the complex
plane contains a ﬁnite subcover, cf. Theorem 5.2.8(5), provides
the following lemma (cf. the proof of Theorem 6.2.11).
Lemma. Each continuous function f(x, y) on I = [a, b] ×
[c, d] is uniformly continuous.
Now we are ready for the following important claim:
566
Then, compare the accuracy of these results with those obtained
using the trapezoidal rule. Which method provides a
better approximation? ⃝
6.B.69. Create a routine in Sage to demonstrate the computation
of the Simpson’s rule with n intervals, where n is an
even integer. Next use your program to verify the result presented
in 6.B.68 for h = 1/4, and illustrate the Simpson’s
approximation. ⃝
6.B.70. Calculate the theoretical error (b−a)5
180 n4 f(4)
of Simpson’s
approximation with h = 1/4 for the integral I given in
6.B.68, where f(4)
represents the upper bound of f(4)
(x)
with x ∈ [0, 1]. Then, compare this theoretical error with the
actual error computed earlier. ⃝
C. Sequences, series and limit processes
We are already familiar with sequences of real numbers,
power series and the concept of convergence. Our
aim now is to readdress the discussion on series of
functions in light of the methods of diﬀerential and
integral calculus.
First we will consider sequences whose terms are functions
rather than real or complex numbers. Sequences of
functions naturally arise in real analysis and are crucial in
approximation theory. We will illustrate the consequences of
the uniform convergence of sequences and series of functions
through numerous examples. Additionally, we will discuss
tasks related to the diﬀerentiation and integration of power
series, along with other applications. Let us begin, however,
with numerical series. Thanks to the integral criterion of convergence
(see 6.2.15), one can address the question of convergence
for a broader class of series. The next few tasks will
highlight this fact.
6.C.1. Apllications of the integral criterion of convergence.
Decide whether the following sums converge or
diverge and conﬁrm your computations in Sage:
T1 =
∞∑
n=1
1
n ln n
, T2 =
∞∑
n=1
1
n2
.
Solution. Observe that we cannot decide the convergence of
none of these series by using the ratio or root test. This is
because
lim
n→∞
|
an+1
an
| = lim
n→∞
n
√
an = 1 .
However, using the integral criterion for convergence of series
one obtains:
∫ ∞
1
1
x ln(x)
dx =
∫ ∞
0
1
t
dt = lim
δ→∞
[ln(t)]
δ
0 = ∞,
hence the series S1 diverges. On the other hand,
∫ ∞
1
1
x2
dx = lim
δ→∞
[
−
1
x
]δ
1
= 1,
hence the series S2 converges. □
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Theorem. Assume f(x, y) is a function deﬁned for all x lying
in a bounded interval [a, b] and all y in a bounded interval
[c, d], continuous on I = [a, b] × [c, d]. Consider the
(Riemann) integral
F(y) =
∫ b
a
f(x, y) dx.
Then the function F(y) is continuous on [c, d].
Proof. Fix a point y ∈ [c, d], small ε > 0, and choose
a neighbourhood W of y such that for all ¯y ∈ W ⊂
[c, d] and all x ∈ [a, b] (remember f is uniformly con-
tinuous)
|f(x, ¯y) − f(x, y)| < ε.
The Riemann integral of continuous functions is evaluated
by approximations of ﬁnite sums (equivalently: upper, lower,
or Riemann sums with arbitrary representatives ξi, see paragraph
6.2.9).
The goal is to establish that the Riemann sums for the
integrals with parameters y and ¯y cannot diﬀer much. In the
following estimate for any partition with k intervals and representatives
ξi, ﬁrst use the standard properties of the absolute
value and then exploit the choice of W:
k−1∑
i=0
f(ξi, ¯y)(xi+1 − xi) −
k−1∑
i=0
f(ξi, y)(xi+1 − xi)
≤
k−1∑
i=0
f(ξi, ¯y) − f(ξi, y) (xi+1 − xi)
< ε(b − a).
It follows that the limit values for any sequences of the partitions
and representatives F(y) and F(¯y) cannot diﬀer by
more than ε(b−a) either, so the function F is continuous. □
6.3.12. Integrating twice. The fact that the integral F(y) =
∫ b
a
f(x, y)dx of a continuous function f :
[a, b] × [c, d] → R in the plane is again a continuous
function F : [c, d] → R allows us to
repeat the integration and write
(1) I =
∫ d
c
∫ b
a
f(x, y) dx dy =
∫ d
c
(
∫ b
a
f(x, y) dx
)
dy.
The next theorem is the simplest version of the claim known
as Fubini theorem.
Fubini theorem
Theorem. Consider a continuous function f : [a, b] ×
[c, d] → R in the plane R2
. The multiple integration (1) is
well deﬁned and does not depend on the order of integration,
i.e.,
I =
∫ d
c
(
∫ b
a
f(x, y) dx
)
dy =
∫ b
a
(
∫ d
c
f(x, y) dy
)
dx.
567
6.C.2. Using the integral criterion, decide on the convergence
of the series
∞∑
n=1
1
(n + 1) ln2
(n + 1)
.
Solution. The function
f(x) = 1
(x+1) ln2(x+1)
, x ∈ [1, +∞)
is clearly positive and nonincreasing on its whole domain,
thus the given series converges if and only if the integral
∫ +∞
1
f(x) dx
converges. By using the substitution y = ln (x + 1) (where
dy = dx/(x + 1)), we can compute
+∞∫
1
1
(x + 1) ln2
(x + 1)
dx =
+∞∫
ln 2
1
y2
dy =
1
ln 2
.
Hence the series converges. □
Next we will explore examples related to the concept of
“uniform convergence”, which is a stronger notion
than pointwise convergence.
Recall that given a sequence of functions (fn) deﬁned
on an interval I = [a, b], we say that (fn)
converges uniformly to a function f on [a, b], if for every
ε > 0 there exists N ∈ N such that for all n ≥ N and
x ∈ [a, b] we have |fn(x) − f(x)| < ε. In other words, we
have f(x) − ε < fn(x) < f(x) + ε for all n ≥ N and
x ∈ [a, b], thus uniform convergence means that all of the
functions fn(x) are close to f(x) for all points x ∈ [a, b], except
for ﬁnitely many of them, see the ﬁgure given below and
see also 6.3.3.
y
x
a b
y = f(x) + ε
y = f(x) − ε
fn(x)
f(x)
Notice that we can equivalent interpret the condition of uniform
convergence as (see 6.D.46)
lim
n→∞
supx∈[a,b]|fn(x) − f(x)| = 0 .
It is also important to note that uniform convergence implies
pointwise convergence (and the uniform limit function is the
same as the pointwise limit function), but the converse is not
true. Before discussing tasks on uniform convergence, we will
ﬁrst emphasize some natural tasks related to the pointwise
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Proof. We know f is uniformly continuous on the product
of intervals [a, b]×[c, d] in the plane. Thus,
for each ε > 0 there is δ > 0 such that
|f(x1, y1) − f(x2, y2)| < ε whenever |x1 −
x2| < δ and |y1 − y2| < δ.
We know both Riemann integrals in (1) exist, thus we
may ﬁx a sequence Ξk of partitions of the interval [a, b] into
k subinterval [xi−1, xi] of equal size 1/k and with representatives
ξi,k, i = 1, . . . , k, and similarly for the interval
[c, d] with the subintervals [yi−1, yi] and representatives ηj,k,
j = 1, . . . , k. Then we may write
I =
∫ d
c
lim
k→∞
Sk(y)dy, Sk(y) =
( k∑
i=1
f(ξi,k, y)1
k (b − a)
)
If 1
k < δ, then
∫ b
a
f(x, y) dx −
k∑
i=1
f(ξi,k, y)1
k (b − a)
≤
k∑
i=1
∫ xi
xi−1
f(x, y) dx − f(ξi,k, y)1
k (b − a)
≤
k∑
i=1
∫ xi
xi−1
f(x, y) − f(ξi,k, y) dx
≤ ε(b − a).
Thus, the convergence of
lim
k→∞
Sk(y) = F(y) =
∫ b
a
f(x, y) dx
is uniform on [c, d]. In particular, we may swap the integral
and the limit to obtain
I = lim
k→∞
∫ d
c
( k∑
i=1
f(ξi,k, y)1
k (b − a)
)
dy
= lim
k→∞
k∑
i=1
(
∫ d
c
f(ξi,k, y)dy
)1
k (b − a)
= lim
k→∞
(
lim
ℓ→∞
k∑
i=1
ℓ∑
j=1
f(ξi,k, ηj,ℓ)
1
k
1
ℓ
(b − a)(d − c)
)
.
This double limit can be clearly rewritten as
lim
k→∞
( k∑
i=1
ℓk∑
j=1
f(ξi,k, ηj,ℓk
)
1
k
1
ℓk
(b − a)(d − c)
)
for a suitable sequence of indices ℓk → ∞.
Finally, take any sequence of indices kn → ∞ and
ℓn → ∞, divide intervals (a, b) and (c, d) to kn and ℓn equal
parts, and choose some representatives (xi,kn , yj,ℓn ) in all the
corresponding small rectangles. Then the absolute value of
the diﬀerence
kn∑
i=1
ℓn∑
j=1
f(ξi,kn , ηj,ℓn ) −
kn∑
i=1
ℓn∑
j=1
f(xi,kn , yj,ℓn )
568
convergence of sequences of functions. See also the discussion
in 6.3.1, where similar problems are addressed.
6.C.3. Provide an example for each of the following scenar-
ios:
(a) A sequence of continuous functions (fn) that converges
pointwise to a function f which is discontinuous;
(b) A sequence of diﬀerentiable functions (fn) that converges
pointwise to a function f which is non-diﬀerentiable;
(c) A sequences of integrable functions (fn) that converges
pointwise to a function f which is non-integrable. ⃝
6.C.4. Inspect which of the sequences (fn)n∈Z+ given below,
converge uniformly on the given domain:
(1) fn(x) = sin(nx)/n, with x ∈ R;
(2) fn(x) = nx/(n + x), with x ∈ [0, 1];
(3) fn(x) = e
x4
4n2
, with x ∈ R;
(4) fn(x) = e
x
n , with x ∈ [0, 1];
(5) fn(x) = xn
, with x ∈ [0, 1].
Solution. (1) In the ﬁrst case it is easy to see that fn → 0
pointwise on R, that is, the limit function is the zero one,
f(x) = 0. The sequence (fn) also converges uniformly on R.
This is essentially derived from the inequality
|fn(x)| =
sin(nx)
n
=
| sin(nx)|
n
≤
1
n
.
Thus, given some ε > 0,we will have |fn(x) − 0| < ε for all
x ∈ R if n > 1
ε . Since 1
ε does not depend on x we conclude.
(2) We may express fn as
fn(x) =
x
1 + x
n
, x ∈ [0, 1] ,
and it is obvious that (fn) is pointwise convergent on [0, 1]
with limn→∞ fn(x) = f(x) = x, for all x ∈ [0, 1]. Next we
see that
|fn(x) − f(x)| =
x2
n + x
≤
1
n
, x ∈ [0, 1]
and using this we deduce that supx∈[0,1] |fn(x) − f(x)| ≤ 1
n .
Thus limn→∞ supx∈[0,1] |fn(x) − f(x)| = 0 and (fn) is uniformly
convergent in the interval [0, 1].
(3) The sequence (fn)n∈Z+ converges pointwise to the constant
function f(x) = 1 on R, since
lim
n→∞
e
x4
4n2
= e0
= 1, x ∈ R .
However, we see that fn
(√
2n
)
= e > 2, for all n ∈ Z+,
which does not allow uniform convergence over R (be aware
that in the deﬁnition of uniform convergence, it suﬃces to
consider ε ∈ (0, 1)).
(4) Obviously, limn→∞ e
x
n = 1, i.e., the pointwise limit function
is the constant function f(x) = 1. Next we see that
sup
x∈[0,1]
|fn(x) − f(x)| = sup
x∈[0,1]
(
e
x
n −1
)
= e
1
n −1 ,
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
is at most ε(b−a)(c−d) whenever 1
kn
< δ and 1
ℓn
< δ. Thus,
we see that the value of the iterated integral can be expressed
as the limit of double sums, where the representatives of the
individual rectangles are chosen arbitrarily.8
This expression
does not depend on the order of our integration and thus the
two posibilities for the iterated integrals must coincide. □
6.3.13. Diﬀerentiation in the integrals. We are ready to
discuss the diﬀerentiation of integrals with respect
to parameters. The following result is extremely
useful. For instance we shall use it in
the next chapter when examining integral trans-
forms.
Differentiation with respect to parameters
Theorem. Consider a continuous function f(x, y) deﬁned
for all x from a ﬁnite interval [a, b] and for all y in another
ﬁnite interval [c, d], a point z ∈ [c, d], and the integral
F(y) =
∫ b
a
f(x, y) dx.
If there exists the continuous derivative d
dy f on a neighbourhood
of the point z, then d
dy F(z) exists as well and
d
dy
F(z) =
∫ b
a
d
dy
f(x, y)|y=z dx.
Proof. By the assumed continuity of all functions and
the already known continuous dependence of integrals on parameters,
some knowledge about univariate antiderivatives
can be used. The result is then a simple consequence of the
Fubini theorem. Denote
G(y) =
∫ b
a
d
dy
f(x, y) dx, F(y) =
∫ b
a
f(x, y) dx
and compute, invoking Fubini theorem, the antiderivative
H(y) =
∫ y
y0
G(z) dz =
∫ y
y0
(∫ b
a
d
dz
f(x, z) dx
)
dz
=
∫ b
a
(∫ y
y0
d
dz
f(x, z) dz
)
dx
=
∫ b
a
(f(x, y) − f(x, y0)) dx
= F(y) − F(y0).
Finally, diﬀerentiating with respect to y yields
G(y) =
d
dy
H(y) =
dF
dy
(y),
as desired. □
8ACtually, this is the way how we shall deﬁne the Riemann integral in
more variables in Chapter 8.
569
and hence
lim
n→∞
sup
x∈[0,1]
|fn(x) − f(x)| = lim
n→∞
(
e
1
n −1
)
= 1 − 1 = 0 .
Thus (fn) is uniformly convergent over [0, 1].
(5) Let us ﬁrst use Sage to sketch some members of (fn). We
use the following cell:
q=plot(x^5, x, 0, 1, color="steelblue")
q+= text (r"$f_5(x)=x^5$ ",(0.84, 0.25),
color="steelblue", fontsize ="14")
#q+=plot(x^5, x, 0, pi, color="black")
q+=plot(x^4, x, 0, 1, color="darkgreen")
q+= text (r"$x^4$ ",(0.8, 0.47),
color="darkgreen", fontsize ="14")
q+=plot(x^3, x, 0, 1, color="magenta")
q+= text (r"$x^3$ ",(0.8, 0.57),
color="magenta", fontsize ="14")
q+=plot(x^2, x, 0, 1, color="orange")
q+= text (r"$x^2$ ",(0.8, 0.7),
color="orange", fontsize ="14")
q+=plot(x, x, 0, 1, color="darkred")
q+= text (r"$f_1(x)=x$ ",(0.74, 0.85),
color="darkred", fontsize ="14")
q+=line([(1, 0), (1, 1^5)], linestyle="--")
q.show(ymax=1)
Executing this block we obtain the graph of the ﬁrst 5 members
of the sequence (fn), which we present here:
An alternative but less informative way to plot some terms of
(fn) goes as follows:
var("n"); f(n, x)=x^n
p=plot([f(n, x) for n in [1, 2..5]], (x, 0, 1))
show(p)
As we have already seen in Chapter 5 (see 5.B.3), the limit
function of (fn(x) = xn
) on [0, 1] is given by
limn→∞(xn
) = f(x) =
{
0 , if x ∈ [0, 1) ,
1 , if x = 1 .
Thus
|fn(x) − f(x)| =
{
0 , for x = 1 ,
|xn
| , for x ∈ [0, 1) ,
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.3.14. The Riemann–Stieltjes integral. To end this chapter,
we mention brieﬂy some other concepts of
integration. Mostly we conﬁne ourselves to remarks
and comments. Readers interested in a
thorough explanation can ﬁnd another source.
First, a modiﬁcation of the Riemann integral, which is
useful when discussing probability and statistics. In the discussion
of integration, we summed inﬁnitely many linearized
(inﬁnitely) small increments of the area given by a function
f(x). We omitted the possibility that for diﬀerent values of
x we could take the increments with diﬀerent weights. This
can be arranged at the inﬁnitesimal level by exchanging the
diﬀerential dx for φ(x)dx for some suitable function φ, or
we could even take some single points as adding the weight
independentaly of the size of our intervals in the partitions.
Imagine that at some point x0, the increment of the integrated
quantity is given by αf(x0) independently of the size
of the increment of x. For example, we may observe the probability
that the amount of alcohol per mille in the blood of a
driver at a test will be at most x. We might like to integrate
over the possible values in the interval [0, x]. With quite a
large probability the value is 0. Thus for any integral sum, the
segment containing zero contributes by a constant nonzero
contribution, independent of the norm of the partition.
We cannot simulate such behaviour by multiplying the
diﬀerential dx by some real function. Instead we generalize
the Riemann integral in the following way:
Riemann–Stieltjes integral
Choose a real (usually) nondecreasing function g on a ﬁnite
interval [a, b]. For every partition Ξ with representative ξi
and points of the partition
a = x0, x1, . . . , xn = b
the Riemann–Stieltjes integral sum of function f(x) is
SΞ =
n∑
i=1
f(ξi)
(
g(xi) − g(xi−1)
)
.
The Riemann–Stieltjes integral
I =
∫ b
a
f(x)dg(x)
exists and its value is I, if for every real ε > 0 there exists
a norm of the partition δ > 0 such that for all partitions Ξ
with norm smaller than δ,
|SΞ − I| < ε.
For example, choose g(x) on interval [0, 1] as a piecewise
constant function with ﬁnitely many discontinuities
c1, . . . , ck and “jumps”
αi = lim
x→ci+
g(x) − lim
x→ci−
g(x),
570
which implies that
sup
x∈[0,1]
|fn(x) − f(x)| = sup{1, 0} = 1 .
Therefore, limn→∞ supx∈[0,1] |fn(x) − f(x)| = 1 ̸= 0, and
(xn
)n∈Z+ is not uniformly convergent on [0, 1]. □
6.C.5. Demonstrate that the sequence (fn)n∈Z+ of nonnegative
functions deﬁned by the given graph, pointwise
converges to 0, yet it does not converge uniformly on
[0, +∞).
⃝
6.C.6. Show that the sequence (fn)n∈Z+ of functions deﬁned
by
fn : R → R , fn(x) =
√
x2 +
1
n2
,
converges uniformly on R. ⃝
6.C.7. (a) Provide an example of sequences (fn), (gn) that
converge uniformly on a set I, but the product sequence (fn ·
gn) does not converge uniformly on I.
(b) Suppose that the sequences (fn), (gn) converge uniformly
to f, g, respectively, on a set I. Assume also that there exists
some M > 0 such that |f(x)| < M and |g(x)| < M for all
x ∈ I. Show that the sequence (fn · gn) uniformly converges
to f · g on I. ⃝
6.C.8. Inspect which of the sequences (fn)n∈Z+ given below,
converge uniformly on the given domain. ⃝
Similarly with uniform convergence of sequences of functions,
we can explore series of functions which converge
uniformly on an interval. This is the case
when the sequence of the partial sums converges
uniformly. Uniform convergence of sereis has many
applications. For instance, in Chapter 7 we will discuss the
uniform convergence of trigonometric series, speciﬁcally focusing
on Fourier series.
6.C.9. Decide whether the series
∞∑
n=1
√
x · n
n4 + x2
converges uniformly on the interval (0, +∞).
Solution. Using the denotation
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
then the Riemann–Stieltjes integral exists for every continuous
f(x) and equals
I =
∫ 1
0
f(x)dg(x) =
k∑
i=1
αif(ci).
By the same technique as used for the Riemann integral,
we deﬁne upper and lower sums and upper and lower
Riemann–Stieltjes integral. For bounded functions they
always exist, and their values coincide if and only if the
Riemann–Stieltjes integral in the above sense exists.
We have already encountered problems with the Riemann
integration of functions that are “too jumpy”. For a
function g(x) on a ﬁnite interval [a, b] deﬁne its variation by
varb
a g = sup
Ξ
n∑
i=1
|g(xi) − g(xi−1)|,
where the supremum is taken over all partitions Ξ of the interval
[a, b]. If the supremum is inﬁnite, we say that g(x) has
unbounded variation on [a, b]. Otherwise we say that g is a
function with bounded variation on [a, b]. A function is of
bounded variation if and only if it can be written as the difference
of two monotonic functions. We shall not provide a
proof here.
As in the discussion of the Riemann integral, we derive
the following theorem.
Properties of the Riemann–Stieltjes integral
Theorem. Let f(x) and g(x) be real functions on a ﬁnite
interval [a, b].
(1) Suppose g(x) is non-decreasing and continuously differentiable.
Then the Riemann integral on the left hand
side and the Riemann–Stieltjes integral on the right
hand side either both exist or do not exist. In the former
case, their values are equal
∫ b
a
f(x)g′
(x)dx =
∫ b
a
f(x)dg(x)
(2) If f(x) is continuous and g(x) is a function with ﬁnite
variation, then the integral
∫ b
a
f(x)dg(x) exists.
We invite the reader to add the details of its proof. The
main tools are the mean theorem, the uniform
continuity of continuous functions on closed
bounded intervals. The variation of g over the
interval [a, b] plays the role of the length of the
interval in the earlier proofs dealing with Riemann integra-
tion.
6.3.15. Kurzweil-Henstock integral. The last topic in this
chapter is a modiﬁcation of the Riemann integral,
which ﬁxes the unfortunate behaviour at
the third point in the paragraph 6.3.1. That is,
the limits of the non-decreasing sequences of integrable functions
are again integrable. Then we can interchange the order
571
fn(x) =
√
x·n
n4+x2 , x > 0, n ∈ N,
we have
f′
n(x) =
n
(
n4
−3x2
)
2
√
x(n4+x2)2 , x > 0, n ∈ N.
From now on, let n ∈ N be arbitrary. The inequalities
f′
n(x) > 0 for x ∈
(
0, n2
/
√
3
)
and f′
n(x) < 0 for x ∈
(
n2
/
√
3, +∞
)
imply that the maximum of function fn is attained
exactly at the point x = n2
/
√
3. Since
fn
(
n2
√
3
)
=
4√
27
4n2 a
∞∑
n=1
4√
27
4n2 =
4√
27
4
∞∑
n=1
1
n2 < +∞,
according to the Weierstrass test, the series
∑∞
n=1 fn(x) converges
uniformly on the interval (0, +∞). □
6.C.10. For x ∈ [−1, 1], add
∞∑
n=1
(−1)n+1
n(n+1) xn+1
.
Solution. First notice that by the symbol for an indeﬁnite
integral, we’ll denote one speciﬁc primitive function (while
preserving the variable), which should be understood as a so
called function of the upper limit, while the lower limit is
zero. Using the theorem about integration of a power series
for x ∈ (−1, 1), we’ll obtain
∑∞
n=1
(−1)n+1
n(n+1) xn+1
=
∑∞
n=1
(
(−1)n+1
n
∫
xn
dx
)
=
∫ ∞∑
n=1
(
(−1)n+1
n
xn
)
dx
=
∫ ∑∞
n=1
(
(−1)n+1
∫
xn−1
dx
)
dx
=
∫ ( ∫ ∑∞
n=1(−x)n−1
dx
)
dx =
∫ ( ∫
1 − x + x2
− x3
+
· · · dx
)
dx =
∫ ( ∫ 1
1+x dx
)
dx =
∫
ln (1 + x) + C1 dx .
Since
∫ ∞∑
n=1
(
(−1)n+1
n
xn
)
dx =
∫
ln (1 + x) + C1 dx,
we know from the continuity of the given functions that
∞∑
n=1
(−1)n+1
n xn
= ln (1 + x) + C1, x ∈ (−1, 1).
The choice x = 0 then yields 0 = ln 1 + C1, i.e. C1 = 0.
Next,
∫
ln (1 + x) dx = per partes
=
u = ln (1 + x) u′
= 1
1+x
v′
= 1 v = x
= x ln (1 + x) −
∫ x
1+x dx = x ln (1 + x) −
∫
1 − 1
1+x dx
= x ln (1 + x) − x + ln (1 + x) + C2
= (x + 1) ln (x + 1) − x + C2.
Since the given series converges at the point x = 0 with a
sum of 0, analogously as for C1 ,
0 = 1 · ln 1 − 0 + C2
implies that C2 = 0. In total, we have for x ∈ (−1, 1):
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
of the limit process and integration in these cases, just as with
uniform convergence.
Notice what is the essence of the problem. Intuitively
we assume that very small sets must have zero size. Thus the
changes of values of the functions on such sets should not
change the integral. Moreover, a countable union of such sets
which are “negligible for the purpose of integration” should
also have zero size. We would expect for example that the
set of rational numbers inside a ﬁnite interval would have
this property, hence its characteristic function should be integrable
and the value of such an integral should be zero.
We say that a set A ⊂ R has zero measure, if for every
ε > 0 there is a covering of the set A by a countable system
of open intervals Ji, i = 1, 2, . . . such that
∞∑
i=1
m(Ji) < ε.
m(Ji) means the length of the interval Ji.
In the sequel, the statement “function f has the given
property on a set B almost everywhere” means that f has
this property at all points except for a subset A ⊂ B of zero
measure. For example, the characteristic function of rational
numbers is zero almost everywhere. A piece-wise continuous
function is continuous almost everywhere.
Now we modify the deﬁnition of the Riemann integral so
that restrictions on the Riemann sums are permitted, eliminating
the eﬀect of the values of the integrated function on sets
of measure zero. This is achieved by a ﬁner control of the size
of the segments in the partition in the vicinity of problematic
points.
A positive real function δ on a ﬁnite interval [a, b] is
called a gauge. A partition Ξ of interval [a, b] with representatives
ξi is δ–gauged, if
ξi − δ(ξi) < xi−1 ≤ ξi ≤ xi < ξi + δ(ξi)
for all i.
The norm δ of the partition used in the Riemann integration
is a special case of constant gauges δ(x) = δ > 0. In
order to restrict the Riemann sums to a gauged partition with
representatives in the deﬁnition of the integral, it is necessary
to know that for every gauge δ, a δ–gauged partition with representatives
exists. Otherwise the condition in the deﬁnition
could be satisﬁed in a vacuous way. This statement is called
Cousin’s lemma. It is proved by exploiting the standard properties
of suprema:
For a given gauge δ on [a, b], denote by M the set of all
points x ∈ [a, b] such that a δ–gauged partition with representatives
can be found on [a, x]. M is nonempty and bounded,
thus it has a supremum s. If s ̸= b, then there is a gauged
partition with representatives at s, where s is in the interior
of the last segment. This leads to a contradiction. Thus the
supremum is b, but then the gauge δ(b) > 0 and thus b itself
belongs to the set M.
572
∞∑
n=1
(−1)n+1
n(n+1) xn+1
= (x + 1) ln (x + 1) − x.
Moreover, according to Abel’s theorem (see 6.3.9), the
sum of the given series equals the (potentially improper) limit
of the function (x + 1) ln (x + 1) − x at points −1 and 1. In
our case, both limits are proper (at point 1, the function is even
continuous and the value of the limit at point 1 then equals the
value of the function 2 ln 2 − 1.) For computing the value of
the limit at point −1, we’ll use L’Hospital’s rule:
lim
x→−1+
(x + 1) ln (x + 1) − x = lim
t→0+
t ln t + 1
= lim
t→0+
ln t
1
t
+ 1 = lim
t→0+
1
t
− 1
t2
+ 1
= lim
t→0+
−t + 1 = 1.
Of course, the convergence of the series at points ±1 can
be veriﬁed directly. It’s even possible to directly deduce that
∞∑
n=1
1
n(n+1) = 1 (by writing out 1
n(n+1) = 1
n − 1
n+1 . □
6.C.11. Sum of a series. Using theorem 6.3.5 “about the interchange
of a limit and an integral of a sequence of uniformly
convergent functions”, we’ll now add the number series
∞∑
n=1
1
n2n
.
We’ll use the fact that
∞∫
2
dx
xn+1 = 1
n2n .
Solution. On interval (2, ∞), the series of functions∑∞
n=1
1
xn+1 converges uniformly. That is implied for
example by the Weierstrass test: each of the function 1
xn+1
is decreasing on interval (2, ∞), thus their values are at
most 1
2n+1 ; the series
∑∞
n=1
1
2n+1 is convergent though
(it’s a geometric series with quotient 1
2 ). Hence according
to the Weierstrass test, the series of functions
∑∞
n=1
1
xn+1
converges uniformly. We can even write the resulting
function explicitly. Its value at any x ∈ (2, ∞) is the value
of the geometric series with quotient 1
x , so if we denote the
limit by f(x), we have
f(x) =
∞∑
n=1
1
xn+1
=
1
x2
1
1 − 1
x
=
1
x(x − 1)
.
By using (6.3.7) (3), we get
∞∑
n=1
1
n2n
=
∞∑
n=1
∫ ∞
2
dx
xn+1
=
∫ ∞
2
( ∞∑
n=1
1
xn+1
)
dx
=
∫ ∞
2
1
x(x − 1)
dx = lim
δ→∞
∫ δ
2
1
x − 1
−
1
x
dx
= lim
δ→∞
[(ln(δ − 1) − ln(δ) − ln(1) + ln 2]
= lim
δ→∞
[
ln
(
δ − 1
δ
)]
+ ln(2)
= ln
(
lim
δ→∞
δ − 1
δ
)
+ ln 2 = ln 2
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Now we can state the following generalization of the Riemann
integral. Call it the K-integral9
.
Kurzweil-Henstock integral
Deﬁnition. A function f deﬁned on a ﬁnite interval [a, b]
has a Kurzweil-Henstock integral
I =
∫ b
a
f(x) dx,
if for every ε > 0, there exists a gauge δ such that for every
δ–gauged partition with representatives Ξ, the inequality
|SΞ −I| < ε is true for the corresponding Riemann sum
SΞ.
6.3.16. Basic properties. When deﬁning the K-integral,
only the set of all partitions is bounded, for which the
Riemann sums are taken into account. Hence if the function
is Riemann integrable, then it is K-integrable, and the two
integrals are equal.
For the same reason, the argumentation in Theorem
6.2.8 about simple properties of the Riemann integral applies.
This veriﬁes that the K-integral behaves in the same
way. In particular, a linear combination of integrable function
cf(x) + dg(x) is again integrable and its integral is
c
∫ b
a
f(x)dx + d
∫ b
a
g(x)dx etc. To prove this, it suﬃces only
to think through some modiﬁcations when discussing the reﬁned
partitions, which moreover should be δ–gauged.
The Kurzweil integral behaves as anticipated with respect
to the sets of zero measure:
Theorem. Consider a function f, deﬁned on the interval
[a, b]. If f is zero almost every, then the K-integral
∫ b
a
f(x)d(x) exists and is zero.
Proof. The proof is an illustration of the idea that the
inﬂuence of values on a “small” set can be removed
by a suitable choice of gauge. Denote by
M the corresponding set of zero measure, outside
of which f(x) = 0 and write Mk ⊂ [a, b],
k = 1, . . . , for the subset of the points for which k − 1 <
|f(x)| ≤ k. Because all the sets Mk have zero measure, each
of them can be covered by a countable system of pairwise disjoint
open intervals Jk,i such that the sum of their lengths is
arbitrarily small.
Deﬁne the gauge δ(x) for x ∈ Jk,i so that the intervals
(x−δ(x), x+δ(x)) are still contained in Jk,i. Outside of M,
δ is deﬁned arbitrarily.
9There are many equivalent deﬁnitions and thus also names for this
K-integral. A complicated approach was coined by Arnaud Denjoy around
1912. Thus the space of real functions integrable on an interval [a, b] in
this sense is often called Denjoy space. Other people involved were Nikolai
Luzin and Oskar Perron. We can ﬁnd the integral under their names. The
simple and beautiful deﬁnition was introduced by Jaroslav Kurzweil, a Czech
mathematician still living in 1957. Much of the theory was developed by
Ralph Henstock (1923-2007), an English mathematician.
573
□
6.C.12. Consider function f(x) =
∑∞
n=1 ne−nx
. Determine
∫ ln 3
ln 2
f(x) dx.
Solution. Similarly as in the previous case, the Weierstrass
test for uniform convergence implies that the series of
functions
∑∞
n=1 ne−nx
converges uniformly on interval
(ln 2, ln 3), since each of the functions ne−nx
is lesser than
n
2n on (ln 2, ln 3) and the series
∑∞
n=1
n
2n converges, which
can be seen for example from the ratio test for convergence
of series:
lim
n→∞
an+1
an
= lim
n→∞
(n + 1)2−(n+1
n2n
= lim
n→∞
1
2
n + 1
n
=
1
2
.
In total, according to (6.3.7) (3), we have
∫ ln 3
ln 2
f(x) dx =
∫ ln 3
ln 2
∞∑
n=1
ne−nx
=
∞∑
n=1
∫ ln 3
ln 2
ne−nx
dx
=
∞∑
n=1
[−e−nx
]ln 3
ln 2 =
∞∑
n=1
(
1
2n
−
1
3n
)
= 1 −
1
2
=
1
2
.
□
6.C.13. Determine the following limit (give reasons for the
procedure of computation):
lim
n→∞
∫ ∞
0
cos
(x
n
)
(
1 + x
n
)n dx.
Solution. First we’ll determine lim
n→∞
cos( x
n )(
1+ x
n
)n . The sequence
of these functions converges pointwise and we have
lim
n→∞
cos(x
n )
(
1 + x
n
)n =
1
lim
n→∞
(
1 + x
n
)n
(??)
=
1
ex
It can be shown that the given sequence converges uniformly.
Then according to (6.3.5) ,
lim
n→∞
∫ ∞
0
cos
(x
n
)
(
1 + x
n
)n dx =
∫ ∞
0
[
lim
n→∞
cos
(x
n
)
(
1 + x
n
)n
]
dx
=
∫ ∞
0
1
ex
= 1
We leave the veriﬁcation of uniform convergence to the
reader (we only point out that the discussion is more complicated
than in the previous cases).
□
6.C.14. Find the analytic function whose Taylor series is
x − 1
3 x3
+ 1
5 x5
− 1
7 x7
+ · · · ,
for x ∈ [−1, 1]. ⃝
6.C.15. From the knowledge of the sum of a geometric series,
derive the Taylor series of function
y = 1
5+2x
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
For any δ–gauged partition Ξ of the interval [a, b] the
bound on the corresponding Riemann sum is given as
n−1∑
j=0
f(ξn)(xi+1 − xi) =
n−1∑
j=0
ξi∈M
f(ξn)(xi+1 − xi)
≤
∞∑
k=1
n−1∑
j=0
ξi∈Mk
f(ξn) (xi+1 − xi)
≤
∞∑
k=1
k
( n−1∑
j=0
ξi∈Mk
m(Jk,j)
)
.
To guarantee that this bound is smaller than a given ε, it sufﬁces
to choose the covering by the intervals Jk,j so that
∞∑
j=1
m(Jk,j) =≤
ε
k2k
.
Since
∑∞
k=1 2−k
= 1, the result follows. □
Corollary. If the values of f(x) are changed on a set of zero
measure, the K-integrability of f(x) is not changed, and neither
is the value of its integral.
6.3.17. The fundamental theorems of Calculus. We conclude
this chapter with a few remarks on the
properties of integration procedures from the
point of view of expectations and reality.10
In 6.2.9, we deal with the relation between
the derivatives f(t) and the antiderivatives (integrals) F(t).
Since f(t) is assumed continuous, two essential claims collapse
into one, resulting in
F(t) =
∫ t
t0
f(x) dx
up to the choice of the value of F(t0). In particular,
∫ t
t0
F′
(dx) dx = F(t) − f(t0)
for all choices of F.
More generally, this can be split into two claims which
hold for the K-integral under much milder conditions:
10A very good and elementary exposition of the K-integral can be found
in the short paper Return to the Riemann Integral. The American Mathematical
Monthly, Vol. 103, No. 8 (1996), 625-632. by Robert G. Bartle.
574
centered at the origin. Then determine its radius of convergence.
⃝
6.C.16. Expand the function
y = 1
3−2x , x ∈
(
−3
2 , 3
2
)
to a Taylor series centered at the origin. ⃝
6.C.17. Expand the function cos2
(x) to a power series at the
point π/4 and determine for which x ∈ R this series converges.
⃝
6.C.18. Express the function y = ex
deﬁned on the whole
real axis as an inﬁnite polynomial with terms of the form
an(x − 1)n
and express the function y = 2x
deﬁned on R
as an inﬁnite polynomial with terms anxn
. ⃝
6.C.19. Find a function f such that for x ∈ R, the sequence
of functions
fn(x) = n2
x3
n2x2+1 , n ∈ N
to it. Is this convergence uniform on R? ⃝
6.C.20. Does the series
∞∑
n=1
n x
n4+x2 , kde x ∈ R,
converge uniformly on the whole real axis? ⃝
6.C.21. By using diﬀerentiation, obtain the Taylor expansion
of function y = cos x from the Taylor expansion of function
y = sin x centered at the origin. ⃝
6.C.22. Approximate
(a) cosine of ten degress with a precision of at least 10−5
;
(b) the deﬁnite integral
∫ 1/2
0
dx
x4+1 with a precision of at
least 10−3
.
⃝
6.C.23. Determine the power expansion centered at x0 = 0
of function
f(x) =
x∫
0
et2
dt, x ∈ R.
⃝
6.C.24. Find the analytic function whose Taylor series is
x − 1
3 x3
+ 1
5 x5
− 1
7 x7
+ · · · ,
for x ∈ [−1, 1]. ⃝
6.C.25. From the knowledge of the sum of a geometric series,
derive the Taylor series of function
y = 1
5+2x
centered at the origin. Then determine its radius of convergence.
⃝
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
1st and 2nd fundamental theorems of calculus
Theorem. (1) Suppose the K-integral
∫ b
a
f(x) dx exists.
Then F(t) =
∫ t
a
f(x dx) is continuous. The derivative
F′
(t) exists and equals f(t) almost everywhere.
(2) Suppose F(t) is a continuous function on [a, b] and suppose
f(t) = F′
(t) exists for all but countably many exceptional
points t in [a, b]. If f(t) is deﬁned arbitrarily
at those points, then F(t) =
∫ t
a
f(x) dx exists for all
t ∈ [a, b] and equals to F(t) − F(a).
We have no space here to go into proofs. Notice, the statements
of the theorem are characteristic for the K-integrals, i.e.
this is the only integration concept on R for which these two
theorems hold true.
6.3.18. K-integrability and Lebesgue measure. We illustrate
the claims in the latter theorem on the indicator function
χQ of the rational numbers. Clearly its K-integral
F(t) =
∫ t
a
χQ(x) dx
exists (χQ is zero almost everywhere) and equals zero. Its derivative
F′
(t) is identically zero, and equals χQ nearly everywhere.
This is a good example of a bounded function which is
not Riemann integrable, but is integrable in the more general
sense.
There are many more K-integrable functions than Riemann
integrable functions. There is no diﬀerence between
proper and improper integrals. More precisely, the K-integral
∫ b
a
f(x) dx exists if and only if the one-sided limit
lim
t→a−
∫ b
t
f(x) dx
is well deﬁned and their values coincide, and similarly for the
upper limit b. This is due to the freedom in the choice of the
gauges.
There is only an indirect proof that there are bounded
functions on a compact interval which are not K-integrable,
based on some set-theoretic arguments, but there are no explicit
constructions of such functions available.
We say that a set of real numbers M is measurable if
the K-integral of its indicator function χM exists. The assignment
m : M →
∫ b
a
χM (x) dx for all sets M ⊂ [a, b] has
the properties of a measure. The set of such measurable sets
M is closed under ﬁnite intersections and countable unions.
The measure m is additive with respect to unions of at most
countable systems of pairwise disjoint sets.
This measure coincides with the Lebesgue measure. This
measure is used in another concept of integration, which
is extremely useful in higher dimensional applications, the
Lebesgue integral. We do not go into more details here. We
remark that a real function f is Lebesgue integrable if and
only if its absolute value is K-integrable.
575
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
A big advantage of the K-integral compared to other concepts
is the possibility of integrating many functions which
are not integrable in absolute value. Compare the concepts
of convergence and absolute convergence of series.
A typical example is the sinus integral over all reals. The
K-integral of the sinc function
∫ ∞
0
sin x
x dx exists, while the
absolute value g(x) = | sinc(x)| is not Lebesgue integrable.
Such integrals are important in models for signal processing
where it is necessary to aggregate potentially inﬁnite many
interferences canceling each other by diﬀerent signs.
6.3.19. The convergence theorems. We have dealt with uniform
convergence and Riemann integrability. With the Kintegral,
there is a much nicer and stronger theorem available.
A special case is the monotone convergence theorem for uniformly
bounded functions f0(x) ≤ f1(x) ≤ . . . .
Dominated convergence theorem
Theorem. Suppose f0, f1, f2, . . . are K-integrable functions
on an interval [a, b], converging pointwise to the limit
function f. If there are two K-integrable functions g and h
satisfying
g(x) ≤ fn ≤ h(x),
for all n ∈ N and x ∈ [a, b], then f is K-integrable too, and
∫ b
a
f(t) dt = lim
n→∞
∫ b
a
fn(t) dt.
For monotone convergence, there is a stronger result
saying that a suﬃcient and necessary condition for the Kintegrability
of the pointwise limit is supn
∫ b
a
fn(x) dx < ∞.
This theorem could not be applied in our third example
in 6.3.2. There the functions fn have a "bump" which gets
larger but narrower when close to the origin. The functions
cannot be dominated by an integrable function.
With the Riemann integral, a similar dominated convergence
theorem can be proved, except that we have to guarantee
the integrability of the pointwise limit f.
576
577
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
D. Additional exercises for the whole chapter
Next we will describe some extra material on the notions that we have analyzed so far in Chapter 6. We begin with
additional exercises on higher-order derivatives, Taylor polynomials, Taylor series and other tasks related to diﬀerentiation.
Some of them are technical, and some other require a bit of our imagination.
A) Material on derivatives of higher orders
6.D.1. Prove that nth derivative of the function f(x) = 1
x+1 with x ∈ A = R\{−1}, is given by f(n)
(x) = (−1)n
n!
(x+1)n+1 for
all x ∈ A.
Solution. For any x in the domain of f we have f′
(x) = − 1
(x+1)2 . Thus for n = 1 the given relation holds. Assume that
f(k)
(x) = (−1)k
k!
(x+1)k+1 for some positive integer k. Then we see that
f(k+1)
(x) =
(
(−1)k
k!
(x + 1)k+1
)′
= (−1)k
k!
(
1
(x + 1)k+1
)′
= (−1)k
k!
(
−(k + 1)(x + 1)k
(x + 1)′
(x + 1)2(k+1)
)
=
(−1)k+1
(k + 1)!
(x + 1)k+2
,
for all x ∈ A, where we used the identity k!(k + 1) = (k + 1)!. This shows that f(k+1)
(x) = (−1)k+1
(k+1)!
(x+1)k+2 and the claim
follows by the principle of induction. □
6.D.2. Compute the nth derivative of the function f(x) = 1
x(1−x) with x ∈ R\{0, 1}.
Solution. For all x in the domain of f one can write f(x) = g(x) + h(x), where g(x) = 1
x and h(x) = 1
1−x , respectively.
Thus it follows that f(n)
(x) = g(n)
(x) + h(n)
(x) for all x, and it suﬃces to compute the nth derivatives of the functions g, h.
For g we see that
g′
(x) = −x−2
, g′′
(x) = 2 x−3
, g′′′
(x) = −6 x−4
= −1 · 2 · 3 x−4
, g(4)
(x) = 24 x−5
= 1 · 2 · 3 · 4 x−5
.
Hence clearly we have
g(n)
(x) = (−1)n
n! x−(n+1)
=
(−1)n
n!
xn+1
,
which easily follows by induction. Similarly, for h(x) we can prove that h(n)
(x) =
n!
(1 − x)n+1
. Thus ﬁnally we deduce that
f(n)
(x) = n!
((−1)n
xn+1
+
1
(1 − x)n+1
)
, x ∈ R\{0, 1} .
□
6.D.3. Consider the function f(x) = 1/x, with x ∈ R\{0}, and let n be a positive integer. Compute the limit
lim
x→+∞
f(n)
(x) , for all x ̸= 0. Next conﬁrm your answer via Sage by combining the commands limit and assume.
Solution. First we need to compute the nth derivative. We see that
f′
(x) = −x−2
, f′′
(x) = 2 x−3
, f′′′
(x) = −2·3 x−4
, f(4)
(x) = 2·3·4 x−5
, . . . , f(n)
(x) = (−1)n
n! x−n−1
.
Hence f(n)
(x) = (−1)n
n! x−n−1
and we leave as practice the proof by induction over n. To compute the given limit one
may use the ratio test:
lim
n→+∞
f(n+1)
(x)
f(n)(x)
= lim
n→+∞
(−1)n+1
(n + 1)! x−n−2
(−1)nn! x−n−1
= lim
n→+∞
(n + 1)
|x|
= +∞ .
As for a Sage veriﬁcation, it is necessary to use the command assume as follows
var("n"); assume(x>1)
limit(abs((-1)^n*(factorial(n))*(x**(-n-1))), n=oo)
Otherwise, Sage prints out an error, suggesting the use of further restrictions. □
6.D.4. Determine the Taylor expansion of third order around the point a = 0 of the following functions:
g(x) =
1
cos(x)
, h(x) = e− x2
2 , k(x) = sin (sin(x)) .
Next verify your answers in Sage. ⃝
578
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.D.5. For the function f(x) = esin(x)
with x ∈ R compute the Taylor polynomials Tk
a f(x), where k = 1, 2, 3, 4 and a = 0.
Deduce that T3
0 f(x) = T2
0 f(x) for all x. Moreover, use Sage to plot the graphs of these Taylor extensions together with the
graph of f.
Solution. We have f(0) = e0
= 1, and compute
f′
(0) = cos(x) esin(x)
x=0
= 1 , f′′
(0) =
[
esin(x)
(
cos2
(x) − sin(x)
)]
x=0
= 1 ,
f′′′
(0) =
[
esin(x)
(
cos3
(x) − 3 cos(x) sin(x) − cos(x)
)]
x=0
= 1 − 1 = 0 ,
f(4)
(0) =
[
esin(x)
(
cos4
(x) − 6 cos2
(x) sin(x) − 4 cos2
(x) + 3 sin2
(x) + sin(x)
)]
x=0
= 1 − 4 = −3 .
Thus
T1
0 f(x) = 1 + x , T2
0 f(x) = 1 + x +
1
2
x2
, T3
0 f(x) = T2
0 f(x) , T4
0 f(x) = 1 + x +
1
2
x2
−
1
8
x4
.
In the ﬁgure below we present the graphs of these polynomials.
To produce this ﬁgure in Sage (together with a conﬁrmation of the expressions of Tk
0 f(x) for k = 1, . . . , 4) we applied the
following block:
f(x)=exp(sin(x))
p=plot(f(x), x, -2*pi, 2*pi, thickness=1.5, legend_label=r"$f$")
T1(x)=taylor(f(x), x, 0, 1)
p1=plot(T1(x), x, -2*pi, 2*pi, color="green", linestyle="--", legend_label=r"$T^1_{0}$")
T2(x)=taylor(f(x), x, 0, 2)
p2=plot(T2(x), x, -2*pi, 2*pi, color="red", linestyle="-.", legend_label=r"$T^2_{0}$")
T3(x)=taylor(f(x), x, 0, 3)
p3=plot(T2(x), x, -2*pi, 2*pi, color="purple", linestyle="-.", legend_label=r"$T^3_{0}$")
T4(x)=taylor(f(x), x, 0, 4)
p4=plot(T4(x), x, -pi, pi, color="black", linestyle=":", legend_label=r"$T^4_{0}$")
show(p+p1+p2+p3+p4)
□
6.D.6. Consider the function f(x) = ln(1 + x2
)/(1 − cos(x)). Using the Taylor series centered at 0 of the enumerator and
dominator of f, compute limx→0 f(x). Next present a conﬁrmation of your result based on the l’Hopital’s rule.
Solution. Recall that the Taylor series centered at 0 of ln(1 + x) and cos(x) are respectively given by
ln(1 + x) =
∞∑
n=1
(−1)n−1
xn
n
, |x| < 1 , cos(x) =
∞∑
n=0
(−1)n
x2n
(2n)!
x ∈ R .
Thus, the Taylor series around 0 of ln(1 + x2
) has the form
ln(1 + x2
) =
∞∑
n=1
(−1)n−1
n
x2n
= x2
−
x4
2
+
x6
3
−
x8
4
+ · · · , |x| < 1 ,
579
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
while
1 − cos(x) = 1 −
(
1 −
x2
2!
+
x4
4!
−
x6
6!
+
x8
8!
− · · ·
)
=
x2
2!
−
x4
4!
+
x6
6!
−
x8
8!
+ · · · , x ∈ R .
We can now use these two formulas to compute the limit at hand:
lim
x→0
f(x) = lim
x→0
ln(1 + x2
)
1 − cos(x)
= lim
x→0
x2
−
x4
2
+
x6
3
−
x8
4
+ · · ·
x2
2!
−
x4
4!
+
x6
6!
−
x8
8!
+ · · ·
= lim
x→0
x2
(
1 −
x2
2
+
x4
3
−
x6
4
+ · · ·
)
x2
(1
2
−
x2
4!
+
x4
6!
−
x6
8!
+ · · ·
)
= lim
x→0
1 −
x2
2
+
x4
3
−
x6
4
+ · · ·
1
2
−
x2
4!
+
x4
6!
−
x6
8!
+ · · ·
=
1
1
2
= 2 .
To conﬁrm this result by an alternative method, apply the l’Hopital rule twice:
lim
x→0
ln(1 + x2
)
1 − cos(x)
0
0
= lim
x→0
(
ln(1 + x2
)
)′
(
1 − cos(x)
)′ = lim
x→0
2x
1 + x2
sin(x)
= lim
x→0
2x
(1 + x2) sin(x)
0
0
= lim
x→0
(2x)′
(
(1 + x2) sin(x)
)′ = lim
x→0
2
2x sin(x) + (1 + x2) cos(x)
=
2
0 + 1
= 2 .
□
6.D.7. Consider the function f(x) =
ex
− e−x
−2x
x2 − x ln(x + 1)
. Using appropriately the theory of Taylor series to evaluate the
limit limx→0 f(x). Next conﬁrm your computation by Sage.
Solution. Observe that limit that we need to compute has the indeterminate form 0/0. We have
ex
= 1 + x +
x2
2!
+
x3
3!
+ · · · +
xn
n!
+ · · · , e−x
= 1 − x +
x2
2!
−
x3
3!
+ · · · +
(−1)n
xn
n!
+ · · ·
Thus, ex
− e−x
= 2x +
2x3
3!
+ higher order terms. This implies that the numerator ex
− e−x
−2x is dominated by the term
2x3
3! , as x → 0. Similarly,
x ln(x + 1) = x
(
x −
x2
2
+
x3
3
−
x4
4
+ · · ·
)
= x2
−
x3
2
+
x4
3
−
x5
4
+ · · · .
Thus the dominator x2
− x ln(1 + x) is dominated by the term x3
2 , as x → 0. Combining these two observations we get
limx→0 f(x) = 2/3.
In Sage to conﬁrm the result give the cell
f(x)=(e^x-e^(-x)-2*x)/(x^2-x*ln(1+x)); limit(f(x), x=0)
□
6.D.8. Consider the function
f(x) =
{
1/x2
, if x ̸= 0,
0 , if x = 0.
Determine the intervals of monotonicity and ﬁnd the extremes points of f, if any.
Solution. The domain of f is the whole real line, but f is not continuous at x = 0, since
lim
x→0+
f(x) = lim
x→0−
f(x) = ∞ ,
and f(0) = 0 (try to plot the graph of f). Recall in Sage we can get the same, just by the cell
show(lim(1/x^2, x=0, dir="left")); show(lim(1/x^2, x=0, dir="right"))
Hence, f is neither diﬀerentiable at x = 0, i.e., f′
(0) does not exist. Now, for all x ̸= 0 we have f′
(x) = −2/x3
and
hence f is strictly increasing for x < 0 and strictly decreasing for x > 0. However, at x = 0 the function f does not a
maximum, since it is discontinuous at this point. □
6.D.9. Find the asymptotes of the following functions:
580
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
(a) y = 2 arctan
(
x
x2 − 1
)
, x ∈ R\{±1} , (b) y = ln
(
3e2x
+ ex
+ 10
ex + 1
)
, x ∈ R .
⃝
6.D.10. Let f : R → R be a diﬀerentiable function satisfying f(x) + ef(x)
= x + 1, for all x ∈ R.
(a) Show that f(0) = 0.
(b) Does f have some critical points? Show that f is strictly increasing.
(c) Show that f is positive for x > 0 and negative for x < 0.
(d) Prove the inequality x
2 ≤ f(x) ≤ xf′
(x), for all x ∈ R.
Solution. (a) The deﬁning equation f(x) + ef(x)
= x + 1 for x = 0 gives
f(0) + ef(0)
−1 = 0 . (∗)
Suppose that f(0) = y ∈ R and consider the function g(y) = y + ey
−1. Obviously, y = 0 is root of g, i.e., g(0) = 0. On
the other hand, the ﬁrst derivative of g is given by g′
(y) = 1 + ey
and we have g′
(y) > 0 for all y ∈ R. Thus the function g
is strictly increasing, which implies that y = 0 is its unique root. Thus f(0) = 0.
Another solution: For y = f(0) and g(y) = y + ey
−1 the relation (∗) is written as g(f(0)) = 0. However, g is injective
since g is strictly increasing. Therefore, the relation g(f(0)) = 0 gives f(0) = 0.
(b) By diﬀerentiating the relation f(x) + ef(x)
= x + 1 with respect to x we get f′
(x)(1 + ef(x)
) = 1. Thus
f′
(x) =
1
1 + ef(x)
> 0 , x ∈ R . (∗∗)
Hence f has no critical points, and in particular is strictly increasing.
(c) To obtain this claim one can rely on the monotonicity of f and use (a). In particular, we have
x > 0 ⇔ f(x) > f(0) = 0 , x < 0 ⇔ f(x) < f(0) = 0 .
(d) Here we will use the second derivative of f, given by
f′′
(x) =
( 1
1 + ef(x)
)′
= −
f′
(x) ef(x)
(1 + ef(x) )2
= −
ef(x)
(1 + ef(x) )3
< 0 ,
for all ∈ R. Notice also by (∗∗) that f′
(0) = 1/2. Now, for x = 0 the given inequality holds as an equality. For x > 0
consider the set [0, x]. On this set f satisﬁes the conditions of the mean value theorem, hence there exist ξ ∈ (a, b) such that
f′
(ξ) =
f(x) − f(0)
x − 0
=
f(x)
x
, (♯)
where we used (a) to replace f(0) by 0. On the other hand we saw that f′
is strictly decreasing (since f′′
(x) < 0 for all x).
Thus, in combination with (♯) we obtain the following equivalences:
0 < ξ < x ⇔ f′
(x) > f′
(ξ) > f′
(0) ⇔ f′
(x) >
f(x)
x
>
1
2
⇔ xf′
(x) > f(x) >
x
2
which proves the given formula for x > 0. For x < 0 one proceeds similarly by applying the mean value theorem on the set
[x, 0]. □
6.D.11. Study the local behaviour of the function f(x) = 3
√
| x |3 + 1, with x ∈ R.
Solution. It is easy to deduce that f is continuous everywhere on R. Also, f(x) ≥ 1 and f(−x) = f(x) for all x ∈ R, i.e.,
the function f is positive and even. As for the limit behavior of f at ±∞, we compute (see also below the graph of f)
lim
x→±∞
3
√
| x |3 + 1 = lim
x→±∞
3
√
| x |3 = lim
x→±∞
| x | = +∞ .
For the ﬁrst derivative of f we compute
f′
(x) =



x2
3
√
(x3+1)2
, for x > 0,
0 , for x = 0,
− x2
3
√
(−x3+1)2
, for x < 0.
Notice for computing f′
(0) we used the one-side limits
lim
x→0+
x2
3
√
(x3 + 1)
2
= 0 = lim
x→0−
−
x2
3
√
(−x3 + 1)
2
.
581
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
We deduce that f′
(x) > 0 for all x > 0 and f′
(x) < 0 for all x < 0. This implies that f is strictly increasing on the interval
(0, +∞), increasing on [0, +∞), strictly decreasing for all x ∈ (−∞, 0), and increasing on the interval (−∞, 0] (recall that
f is an even function). It follows that has only one local minimum at point x0 = 0, which is a global minimum with value
f(0) = 1.
Next we will prove that f is convex for any x ∈ R. Indeed, the second derivative of f is given by
f′′
(x) =



2x
3
√
(x3+1)5
, if x > 0;
0 , if x = 0;
− 2x
3
√
(−x3+1)5
, if x < 0.
Thus we deduce that f′′
(x) > 0 for all x ∈ (−∞, 0) ∪ (0, ∞), so f is strictly convex on this interval. In total, we obtained
that f is convex on its whole domain (it doesn’t have some inﬂection points).
Let us ﬁnally determine the asymptotes of Cf . Recall that a line y = ax + b is an inclined asymptote for x → ∞, if and
only if both (proper) limits
lim
x→∞
f(x)
x
= a , lim
x→∞
(f(x) − ax) = b ,
exist. Analogous statement holds for x → −∞. Hence the limits
lim
x→∞
f(x)
x
= lim
x→∞
3
√
x3 + 1
x
= lim
x→∞
3
√
x3
x
= 1 ,
lim
x→∞
(f(x) − 1 · x) = lim
x→∞
(
3
√
x3 + 1 − x
)
= lim
x→∞


[
3
√
x3 + 1 − x
] 3
√
(x3 + 1)
2
+ x 3
√
x3 + 1 + x2
3
√
(x3 + 1)
2
+ x 3
√
x3 + 1 + x2


= lim
x→∞
x3
+ 1 − x3
3
√
(x3 + 1)
2
+ x 3
√
x3 + 1 + x2
= lim
x→∞
1
3x2
= 0 ,
imply that the line y = x is an asymptote at +∞. Since f is even, we immediately obtain the line y = −x as an asymptote
at −∞. Let us ﬁnally present the graph of f together with the asymptotes y = ±x:
□
6.D.12. Using Sage as a tool, examine the local behaviour of the function f(x) = arctan
(
x
2 − x
)
.
Solution. The given function is deﬁned on A = R\{2}. Its graph does not have special symmetry. The only point of
intersection of Cf with the axes is the origin [0, 0], while f is positive exactly on the open interval (0, 2). In Sage the roots
of f occur by the command
f(x)=arctan(x/(2-x)); show(solve(f(x)==0, x))
while one can verify that f is not even/odd by adding the cell
show(bool(f(-x)==f(x))); show(bool(f(-x)==-f(x)))
where for both cases Sage returns False. The function is everywhere continuous on A, and at x0 = 2 the graph of f has a
jump of size π. This follows from the limits limx→2− f(x) = π
2 and limx→2+ f(x) = −π
2 , see also the graph Cf of f below:
582
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
This illustration occurs by adding in the initial block the cell
p=plot(f(x), x, -10, 10, exclude=[2], color="black", thickness=2)
p+=text(r"$f(x)=\arctan(\frac{x}{2-x})$", (5.1, 1.45), fontsize=14, color="black")
p+=line([(2, -1.6), (2, 1.6)], rgbcolor=(0.2,0.2,0.2), linestyle="--")
p.show(figsize=6)
Moreover we see that limx→−∞ f(x) = −π
4 = limx→+∞ f(x), and thus the range of f is the set (−π/2, π/2)\{−π
4 }.
Moreover, the only asymptote of Cf is the line y = −π/4 at ±∞. We also compute
f′
(x) =
1
x2 − 2x + 2
, f′′
(x) =
2 (1 − x)
(x2 − 2x + 2)2
, x ∈ R\{2} .
The limits posed above together with the ﬁrst two derivatives can be conﬁrmed by adding in the initial block the cell
show(lim(f(x), x=2, dir="+")); show(lim(f(x), x=2, dir="-"))
show(lim(f(x), x=oo,)); show(lim(f(x), x=-oo))
show(diff(f, x).factor()); show(diff(f, x, 2).factor())
It is easy to see that f′
(x) > 0 for all x ∈ A and hence f is increasing at every point of its domain, see also the graph of f′
(x)
above (which is coloured by green). This claim can be conﬁrmed in Sage by adding the cell
A=RealSet(x!=2); bool(diff(f, x)>0 for x in A)
Notice the ﬁrst command in this cell declares the domain A of f. Finally, the function f is convex on the interval (−∞, 1),
and concave on the interval (1, 2), (2, +∞). In the ﬁgure at the right we included the graph of f′′
(x), which provides a
graphical proof of these claims. The point x1 = 1 is the unique point of inﬂection with f(1) = π/4, which we can easily
conﬁrmed by adding the cell solve(diff(f, x, 2) == 0, x). □
6.D.13. Study the local behaviour of the function f(x) = −
x2
x + 1
, with x ∈ R\{−1} and use Sage to sketch the graph of f
together with its asymptotes. ⃝
6.D.14. Study the local behaviour of the function f(x) =
x3
− 3x2
+ 3x + 1
x − 1
and use Sage to conﬁrm most of your computations.
⃝
6.D.15. Study the local behaviour of the function f(x) = 3
√
x e−x
. ⃝
Consider a set of n functions {f1(t), . . . , fn(t)} which are (n − 1) times diﬀerentiable. The Wronki matrix associated
to this set is deﬁned by
W(f1, . . . , fn) =






f1(t) f2(t) . . . fn(t)
f′
1(t) f′
2(t) . . . f′
n(t)
f′′
1 (t) f′′
2 (t) . . . f′′
n (t)
...
...
...
...
f
(n−1)
1 (t) f
(n−1)
2 (t) . . . f
(n−1)
n (t)






.
On can prove that the set of functions {f1(t), . . . , fn(t)} is linearly independent if and only if det(W) ̸= 0. Let us describe
an application of this result, which relates linear algebra with calculus.
583
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.D.16. Wronki matrix. Consider the following set B = {sin(t), cos(t), et
}.
(a) Show that B consists of linearly independent functions.
(b) If V is the real vector space of functions spanned by B, let D : V → V be the linear operator deﬁned by D(f) = f′
(t) =
d f(t)
d t . Compute the matrix of D with respect to the basis B. Moreover, ﬁnd its kernel and its image.
Solution. For the set B = {sin(t), cos(t), et
} we compute
det(W(sin, cos, exp)) =
sin(t) cos(t) et
cos(t) − sin(t) et
− sin(t) − cos(t) et
= −2et
̸= 0 ,
hence B consists of linearly independent vectors. We can easily verify the computation of the determinant in Sage, by typing
t = var("t")
W = matrix(SR, 3, 3, [[sin(t),cos(t), exp(t)],[cos(t), -sin(t), exp(t)], [-sin(t), -cos(t), exp(t)]])
W; w=W.det ( ); w.full_simplify ( )
which returns the desired answer −2 ∗ et
.
(b) For the matrix of the linear operator D with respect to the basis B we see that
D(sin(t)) = 0 sin(t) + 1 cos(t) + 0et
, D(cos(t)) = −1 sin(t) + 0 cos(t) + 0et
, D(et
) = 0 sin(t) + 0 cos(t) + 1et
.
Then the coordinates of D(sin(t)), D(sin(t)) and D(et
) give the columns of the matrix of D, that is,


0 −1 0
1 0 0
0 0 1

.
Let us now quickly compute the kernel of D via Sage:
D = matrix(SR, 3, 3, [[0, -1, 0], [1, 0, 0], [0, 0, 1]])
D.right_kernel ( )
which prints out
Vector space of degree 3 and dimension 0 over Symbolic Ring
Basis matrix:
[]
This essentially means that Ker(D) = {0}. For the image use the command D.image (). It gives5
Vector space of degree 3 and dimension 3 over Symbolic Ring
Basis matrix:
[1 0 0]
[0 1 0]
[0 0 1]
Thus the image of D is spanned by the standard basis of R3
. □
We proceed with material on parametrized curves and surfaces.
6.D.17. Find the points on the cycloid given in 6.A.45 where the tangent lines are vertical/parallel. ⃝
In diﬀerential calculus there is a plethora of important inequalities. Among them there is one with many applications,
the so called Jensen inequality, This is about convex and concave functions, as we will see below.
6.D.18. Jensen inequality. For a strictly convex function f on interval I and for arbitrary points x1, . . . , xn ∈ I and real
numbers c1, . . . , cn > 0 sucht that c1 + · · · + cn = 1, the inequality
f
( n∑
i=1
ci xi
)
≤
n∑
i=1
ci f (xi)
holds, with equality occuring if and only if x1 = · · · = xn. Notice the Jensen inequality can be also formulated in a more
intuitive way: “the centroid of mass points placed upon a graph of a strictly convex function lies above this graph.”
Solution. Could be proven easily by induction: for n = 2 it is just the deﬁnition of the convex function, for the induction step
f
(k+1∑
i=1
ci xi
)
= f
(
c1x1 + (1 − c1)
k+1∑
i=2
ci
1 − c1
xi
)
≤ c1f(x1) + (1 − c1)f
(k+1∑
i=2
ci
1 − c1
xi
)
5Recall that the basis computed by Sage is “row reduced”.
584
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
≤ c1f(x1) + (1 − c1)
(k+1∑
i=2
ci
1 − c1
f(xi)
)
=
k+1∑
i=1
cif(xi) .
Notice above we used the inequality ﬁrst for n = 2 and then for n = k. □
6.D.19. Prove that among all (convex) n-gons inscribed into a circle, the regular n-gon has the largest area (for arbitrary
n ≥ 3).
Solution. Clearly it suﬃces to consider the n-gons inside of which lies the center of the circle. We’ll divide each such n-gon
inscribed into a circle with radius r to n triangles with areas Si, i ∈ {1, . . . , n} according to the ﬁgure. With regard to the
fact that
sin φi
2 = xi
r , cos φi
2 = hi
r , i ∈ {1, . . . , n},
we have
Si = xihi = r2
sin φi
2 cos φi
2 = 1
2 r2
sin φi, i ∈ {1, . . . , n}.
This implies that the area of the hole n-gon is
S =
n∑
i=1
Si = 1
2 r2
n∑
i=1
sin φi.
Thus we want to maximize the sum
∑n
i=1 sin φi, while for values φi ∈ (0, π) we clearly have
(1) φ1 + · · · + φn =
n∑
i=1
φi = 2π.
The function y = sin x is strictly concave on the interval (0, π), which means, that the function y = − sin x is strictly
convex on this interval. Then according to Jensen’s inequality for ci = 1/n and xi = φi, we have
− sin
( n∑
i=1
1
n φi
)
≤ −
n∑
i=1
1
n sin φi, tj. sin
( n∑
i=1
1
n φi
)
≥
n∑
i=1
1
n sin φi.
Moreover, we know the equality occurs exactly for φ1 = · · · = φn. If we express (using (1))
S = r2
n
2
n∑
i=1
1
n sin φi ≤ r2
n
2 sin
( n∑
i=1
1
n φi
)
= r2
n
2 sin 2π
n ,
we can see that S can attain at most the value on the right hand side. But that happens if and only if φ1 = · · · = φn (we
chose xi = φi). Hence the regular n-gon is the one with the maximum area, because it satisﬁes φ1 = · · · = φn = 2π/n. □
6.D.20. Isoperimitric quotient. For a closed curve in plane enclosing a planar region, we deﬁne its isoperimetric quotient
as the number
IQ := S
π
( o
2π
)2 = 4πS
o2 ,
where S denotes the area of the region and o its perimeter (i.e. the length of the curve). Hence the isoperimetric quotient
determines the ratio of the area of the region and the area of a circle with the same perimeter as the given region. The notation
IQ is therefore not only an English abbreviation for the isoperimetric quotient, but can be also thought of as the “intelligence
of the region”, with which it uses its perimeter for attaining as big area as possible. The isoperimetric theorem then states
that for every closed curve, IQ ≤ 1, with equality occuring only for a circle, or (“the circle is the smartest”).
Determine IQ for a regular polygon and a circle and ﬁnd the sector of a circle, for which its boundary has the largest IQ
Solution. First notice that the value of IQ doesn’t change with a change of scale on the axes (same on both). Because when
the proportions of the region get a times bigger (for arbitrary a > 0), the perimeter also gets a times bigger and the area a2
times (it’s a square measure). Hence IQ doesn’t depend on the size of the region, but only on its shape. Thus we can consider
a regular n-gon inscribed into a unit circle. According to the ﬁgure,
h = cos φ = cos π
n , x
2 = sin φ = sin π
n ,
which yields
on = n · x = 2n sin π
n
and
Sn = n · 1
2 hx = n cos π
n sin π
n .
Thus for a regular n-gon, we have
IQ =
4πn cos π
n sin π
n
4n2 sin2 π
n
= π
n cotg π
n ,
which we can verify for example for a square (n = 4) with a side of length a, where
IQ = 4πa2
(4a)2 = π
4 = π
4 cotg π
4 .
585
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Using the limit transition for n → ∞ and the limit
lim
x→0
sin x
x = 1,
we get the isoperimetric quotient for a circle:
IQ = lim
n→∞
π
n cotg π
n = lim
n→∞
cos π
n
sin π
n
π
n
= cos 0
1 = 1.
Of course, for a circle with radius r, we could have also directly computed
IQ = 4πS
o2 =
4π
(
πr2
)
(2πr)2 = 1.
For the boundary a of sector of a circle with radius r and central angle φ ∈ (0, 2π), we have
IQ = 4πS
o2 =
4π φr2
2
(2r+rφ)2 = 2πφ
(2+φ)2 .
Hence we’re looking for a maximum of the function
f(φ) := 2πφ
(2+φ)2 , φ ∈ (0, 2π).
By computing
f′
(φ) = 2π (2+φ)2
−2φ(2+φ)
(2+φ)4 = 2π 2−φ
(2+φ)3 , φ ∈ (0, 2π)
we easily obtain that
f′
(φ) > 0, φ ∈ (0, 2), f′
(φ) < 0, φ ∈ (2, 2π).
Hence function f attains its maximal value for φ0 = 2 and for a central angle φ0 = 2 (radians), we get the largest
IQ = 2πφ0
(2+φ0)2 = π
4 .
For the sake of completeness, for a solid in three-dimensional space (more precisely, for the closed surface which is its
boundary), we deﬁne
IQ := V
4π
3
( S
4π
) 3
2
,
where V is the volume and S the surface of the solid. Thus we compare the volume of the solid with a given surface with the
volume of the ball with the same space. □
6.D.21. A string of length l is given. The task is to cut it into n parts so that it’s possible to create boundaries of geometric
ﬁgures given in advance (for example a square, a triangle, a circle, a halfcircle) with the least sum of areas from
the n smaller strings.
Solution. To solve this problem, we’ll use the isoperimetric quotient of curves and Jensen’s inequality (stated in
previous examples). For the geometric ﬁgures given in advance, denote the values of their isoperimetric quotients
as
1
λi
:= 4πSi
o2
i
, i ∈ {1, . . . , n},
where Si is the area and oi the perimeter of the i-th ﬁgure. We’ll also use the denotation
Λ :=
n∑
i=1
λi.
Recall that the isoperimetric quotient is given only by the shape of the ﬁgure and doesn’t depend on its size. In particular, the
value Λ is constant (it’s determined by the shapes of the given ﬁgures).
Our task is to minimize the sum
∑n
i=1 Si with
∑n
i=1 oi = l. Because
Si =
o2
i
4πλi
, i ∈ {1, . . . , n},
we need to minimize the expression
S := 1
4π
n∑
i=1
o2
i
λi
.
Using Jensen’s inequality for the strictly convex function y = x2
(on the whole real axis), we obtain
( n∑
i=1
ci xi
)2
≤
n∑
i=1
ci x2
i
for xi ∈ R and ci > 0 with the property c1 + · · · + cn = 1. Moreover we know that the equality occurs if and only if
x1 = · · · = xn. By choosing
ci = λi
Λ , xi = oi
λi
, i ∈ {1, . . . , n},
we then get
586
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
( n∑
i=1
λi
Λ
oi
λi
)2
≤
n∑
i=1
λi
Λ
(
oi
λi
)2
.
By several simpliﬁcations, we obtain the inequality
1
Λ2
( n∑
i=1
oi
)2
≤ 1
Λ
n∑
i=1
o2
i
λi
and then (notice that
∑n
i=1 oi = l)
l2
Λ ≤
n∑
i=1
o2
i
λi
,
with equality again occuring for
(1) x1 = · · · = xn, tj.
o1
λ1
= · · · =
on
λn
.
This implies that S the smallest, if and only if (1) holds. This smallest value of S is l2
/(4πΛ). Now we only need to
determine the lengths of the cut parts oi. If (1) holds, then clearly oi = kλi for all i ∈ {1, . . . , n} and certain constant k > 0.
From
n∑
i=1
oi = l and simultaneously
n∑
i=1
oi = k
n∑
i=1
λi = kΛ,
we can immediately see that k = l/Λ, i.e.
oi = λi
Λ l, i ∈ {1, . . . , n}.
Let’s take a look at a speciﬁc situation where we are to cut a string of length 1 m into two smaller ones and then create a
square and a circle from them so that the sum of their areas is the smallest possible. For a square and a circle (in order), we
have (see the example called Isoperimetric quotient)
λ1 = 4
π , λ2 = 1, tj. Λ = λ1 + λ2 = 4+π
π .
Then the lengths of the respective parts are (in metres)
o1 =
4
π
4+π
π
· 1 = 4
4+π
.
= 0, 56, o2 = 1
4+π
π
· 1 = π
4+π
.
= 0, 44.
The area of a square with perimeter 0, 56 m (with a side of length a = 0, 14 m) is 0, 019 6 m2
and the area of a circle with
perimeter 0, 44 m (and radius r
.
= 0, 07 m) is approximately 0, 015 4 m2
. We can verify that (in m2
l2
4πΛ = 1
4(4+π)
.
= 0, 035 = 0, 019 6 + 0, 015 4.
□
A) Material on integration
6.D.22. Based on Sage, evaluate the integrals
∫
f(x) dx, where:
(1) f(x) = x4
+ ex
+5 ln(x), with x > 0;
(2) f(x) =
√
x(1 + x3
), with x > 0;
(3) f(x) = x/
√
x + 1, with x > −1. ⃝
6.D.23. Using any basic formula, and your pencil, determine a primitive function for the following functions:
(a) f(x) =
√
x
√
x
√
x, with x ∈ (0, +∞);
(b) g(x) = (2x
+ 3x
)
2
, with x ∈ R;
(c) h(x) =
1
√
4 − 4x2
, with x ∈ (−1, 1);
(d) k(x) =
cos x
1 + sin x
, with x ∈
(
−π
2 , 3π
2
)
.
Then, conﬁrm your computations via Sage. ⃝
6.D.24. Find a primitive function for the function f(x) = ex
+
3
√
4 − x2
on the open interval (−2, 2). ⃝
6.D.25. Evaluate the integral
∫
dx
x
(
ln(x)
)2
+ 2025x
with an appropriate substitution. ⃝
6.D.26. Evaluate the integral
∫
ex
ex +2024
dx with an appropriate substitution. ⃝
587
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.D.27. Apply integration by parts to compute the integral Λ =
∫
cos(3x + 2025)(x + 2026) dx, with x ∈ R.
Solution. In terms of 6.2.4, it is convenient to set F(x) = (x + 2026), such that F′
(x) = 1. Then G′
(x) = cos(3x + 2025)
and G must have the form G(x) = 1
3 sin(3x + 2025), such that G′
(x) = (3x)′
cos(3x+2025)
3 = cos(3x + 2025). Then we get
Λ =
∫
(x + 2026) cos(3x + 2025) dx =
∫
(x + 2026)
(
sin(3x + 2025)
3
)′
dx =
=
1
3
(x + 2026) sin(3x + 2025) −
1
3
∫
(x + 2026)′
sin(3x + 2025) dx =
=
1
3
(x + 2026) sin(3x + 2025) −
1
3
∫
sin(3x + 2025) dx =
=
1
3
(x + 2026) sin(3x + 2025) +
1
3
·
cos(3x + 2025)
3
+ C ,
for some constant C. A direct veriﬁcation in Sage occurs as usual:
show(integral(cos(3*x+2025)*(x+2026), x).factor())
□
6.D.28. Compute the integral Λ =
∫
sin(2024x) cos(x) dx, formally ﬁrst and next in Sage.
Solution. Recall the identities
sin(α + β) = sin(α) cos(β) + sin(β) cos(α) , sin(α − β) = sin(α) cos(β) − sin(β) cos(α) .
Adding these two relations we get 2 sin(α) cos(β) = sin(α + β) + sin(α − β), an identity that one can apply to compute the
integral at hand. In particular,
Λ =
∫
sin(2024x) cos(x) dx =
1
2
∫
(
sin(2024x+x)+sin(2024x−x)
)
dx =
1
2
∫
sin(2025x) dx+
1
2
∫
sin(2023x) dx =
= −
1
4050
cos(2025x) −
1
4046
cos(2023x) + C .
In Sage to conﬁrm this just type the command show(integrate(sin(2024 ∗ x) ∗ cos(x), x)). □
6.D.29. Compute the integral I =
∫ √
1 − x2 dx, with x ∈ (−1, 1), in two diﬀerent ways.
Solution. In terms of integration by parts (6.2.4), we have F(x) =
√
1 − x2, F′
(x) =
−x
√
1 − x2
, G′
(x) = 1 and G(x) = x.
Thus, up to a constant we have that
I =
∫ √
1 − x2 dx = x
√
1 − x2 +
∫
x2
√
1 − x2
dx = x
√
1 − x2 −
∫
1 − x2
− 1
√
1 − x2
dx =
= x
√
1 − x2 −
∫ √
1 − x2 dx +
∫
1
√
1 − x2
dx = x
√
1 − x2 −
∫ √
1 − x2 dx + arcsin(x) .
This implies that
2
∫ √
1 − x2 dx = x
√
1 − x2 + arcsin(x) + C =⇒ I =
∫ √
1 − x2 dx =
1
2
(
x
√
1 − x2 + arcsin(x)
)
+ C ,
for some constant C ∈ R.
Let us now compute I by an appropriate substitution. Set x = sin(t) such that dx = cos(t) dt and x = arcsin(t). Notice
that t ∈ (−π/2, π/2) for x ∈ (−1, 1), and among other things, one has
0 < cos(t) = |cos(t)| =
√
cos2(t) =
√
1 − sin2
(t) .
Therefore our integral I can written as
I =
∫ √
1 − x2 dx =
∫ √
1 − sin2
(t) cos(t) dt =
∫
cos2
(t) dt =
1
2
(
t + sin(t) cos(t)
)
+ C ,
for some constant C, where for the ﬁnal equality one is based on our previous computation from 6.B.8. Thus
I =
1
2
(
sin(t)
√
1 − sin2
(t) + t
)
+ C =
1
2
(
x
√
1 − x2 + arcsin(x)
)
+ C , C ∈ R .
□
588
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.D.30. Compute the limit lim
x→+∞
F′
(x)
x4 + 1
, where F(x) =
∫ x
1
(
t sin(t) + 4 cos(4t)
)
dt.
Solution. Let us express F as F(x) =
∫ x
1
f(t) dt, with f(t) = t sin(t) + 4 cos(4t), for t ∈ R. Since f is continuous the
function F is diﬀerentiable, with F′
(x) = x sin(x) + 4 cos(4x) for all x ∈ R. Now for all x > 0 the function g(x) =
F′
(x)
x4 + 1
=
x sin(x) + 4 cos(4x)
x4 + 1
satisﬁes
|g(x)| =
x sin(x) + 4 cos(4x)
x4 + 1
≤
|x sin(x)| + 4 |cos(4x)|
x4 + 1
≤
|x| + 4
x4 + 1
=
x + 4
x4 + 1
.
It follows that lim
x→+∞
g(x) = 0 (since lim
x→+∞
x+4
x4+1 = 0). □
6.D.31. Suppose that the function f : R → R has continuous second derivative on R and attains a local extreme at x0 = 1.
Moreover, assume that the graph Cf of f passes through the point P = [0, 3] ∈ R2
. If
∫ 1
0
(
x f′′
(x) + 4 f′
(x)
)
dx = α
for some real number α, ﬁnd α such that f(1) = 1/3.
Solution. By assumption at x0 = 1 the function f attains a local extreme, thus f′
(1) = 0. Moreover, we have f(0) = 3.
Thus,
∫ 1
0
(
x f′′
(x) + 4 f′
(x)
)
dx =
∫ 1
0
x f′′
(x) dx + 4
∫
f′
(x) dx =
[
x f′
(x)
]1
0
−
∫ 1
0
(x′
) f′
(x) dx + 4
[
f(x)
]1
0
=
=
(
1 · f′
(1) − 0 · f′
(0)
)
+ 3
[
f(x)
]1
0
= 0 + 3
(
f(1) − f(0)
)
= 3
(
f(1) − 3) = 3f(1) − 9 .
Thus 3f(1) − 9 = α, from where it follows that f(1) = (α + 9)/3, and since it should be f(1) = 1/3 we get α = −8. □
Suppose that for some rational function f we want to compute the integral
∫
f
(
sin(x), cos(x)
)
dx. Then, usually we
apply the substitution method, and here are some hints:
• If f
(
sin(x), − cos(x)
)
= −f
(
sin(x), cos(x)
)
, then apply the substitution t = sin(x);
• If f
(
− sin(x), cos(x)
)
= −f
(
sin(x), cos(x)
)
, then set t = cos(x);
• If f
(
− sin(x), − cos(x)
)
= f
(
sin(x), cos(x)
)
, then set t = tan(x);
• If none of these equalities hold, then try to use the substitution t = tan (x/2).
6.D.32. Integrate the following functions, with x ∈
(
−π
2 , π
2
)
for all the cases:
(a) f(x) =
sin3
(x)
1 + 4 cos2(x) + 3 sin2
(x)
;
(b) g(x) =
1
1 + sin2
(x)
;
(c) h(x) =
1
2 − cos(x)
.
Solution. For f(x), in the denominator it appears the function β(x) = 1 + 4 cos2
(x) + 3 sin2
(x), which can be rewritten as
β(x) = 4 + cos2
(x), and in the numerator only the sine function to an odd power. Thus, the substitution t = cos(x) with
dt = − sin(x) dx, allows the replacement of all the sines and cosines. Indeed,
∫
sin3
(x)
β(x)
dx =
∫
sin(x)
(
1 − cos2
(x)
)
4 + cos2(x)
dx =
∫
−
(
1 − t2
)
4 + t2
dt =
∫
(1 −
5
4 + t2
) dt = t −
5
2
arctan
(
t
2
)
+ C
= cos(x) −
5
2
arctan
(
cos(x)
2
)
+ C .
For g(x), because both the sine and cosine appear to an even power, we may use the substitution t = tan(x), and hence
x = arctan(t). This leads to the relations
sin2
(x) =
tan2
(x)
1 + tan2
(x)
=
t2
1 + t2
, cos2
(x) =
1
1 + tan2
(x)
=
1
1 + t2
.
589
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Combining them with the relation dx = 1
1+t2 dt, we get
∫
dx
1 + sin2
(x)
=
∫ 1
1+t2
1 + t2
1+t2
dt =
∫
1
1 + 2t2
dt =
√
2
2
arctan
(√
2t
)
+ C =
√
2
2
arctan
(√
2 tan(x)
)
+ C .
Finally, for h(x) one can apply the universal substitution t = tan
(x
2
)
, with
sin(x) =
2t
1 + t2
, cos(x) =
1 − t2
1 + t2
, dx =
2
1 + t2
dt .
In such terms we have
∫
dx
2 − cos x
=
∫ 2
1+t2
2 − 1−t2
1+t2
dt = 2
∫
dt
1 + 3t2
=
2
√
3
3
arctan
(√
3t
)
+ C =
2
√
3
3
arctan
(√
3 tan
(x
2
))
+ C .
□
6.D.33. Evaluate the integral I =
∫
1
sin(x)
dx with x ∈
(
0, π
2
)
, in at least two alternative ways.
Solution. We may rewrite the integrand as
1
sin(x)
=
sin(x)
sin2
(x)
=
sin(x)
1 − cos2(x)
. Thus, I =
∫
sin(x)
1 − cos2(x)
dx and we
may set t = cos(x) with dt = − sin(x) dx, which gives I = −
∫
dt
1 − t2
. However, it is easy to prove that −
1
1 − t2
=
−
1
2(t + 1)
+
1
2(t − 1)
, hence we deduce that
I =
1
2
∫ ( 1
t − 1
−
1
t + 1
)
dt =
1
2
ln |t − 1| −
1
2
ln |t + 1| + C =
1
2
ln
t − 1
t + 1
+ C =
1
2
ln
cos(x) − 1
cos(x) + 1
+ C =
=
1
2
ln tan2
(x
2
)
+ C =
1
2
ln tan
(x
2
) 2
+ C = ln tan
(x
2
)
+ C , C ∈ R .
Here we applied the formulas tan
(x
2
)
=
√
1 − cos(x)
1 + cos(x)
and |x|
2
= x2
.
Probably, the quickest way to compute the integral at hand is the substitution t = tan
(x
2
)
(see also 6.D.32), with
sin(x) =
2 tan
(x
2
)
1 + tan2
(x
2
) =
2t
1 + t2
, dx =
2
1 + t2
dt .
In such terms we immediately get I =
∫
1
2t
1+t2
·
2
1 + t2
dt =
∫
1
t
dt = ln |t| + C = ln tan
(x
2
)
+ C, C ∈ R. Notice may
exist many other alternatives based on diﬀerent trigonometric identities. For instance, one may use the identity
tan
(x
2
)
+
1
tan
(x
2
) = tan
(x
2
)
+ cot
(x
2
)
=
2
sin(x)
, x ∈
(
−
π
2
,
π
2
)
,
which you may verify either by your pencil, or by Sage and the command
bool(2/sin(x)==(tan(x/2)+(1/tan(x/2))) for x in RealSet(-pi/2, pi/2))
□
Then next few tasks involve the integration of rational functions.
6.D.34. Consider the rational function Q(x) =
1
x3 − 1
, with x ̸= 1.
(a) Present the decomposition of Q into partial fractions, and then verify your answer by Sage.
(b) Evaluate the integral A =
∫
Q(x) dx. ⃝
6.D.35. Determine the integral
∫
3x + 5
x2 + 4x + 8
dx, where x ∈ R. ⃝
6.D.36. Compute the indeﬁnite integral of the function f(x) =
1
(x2 + x + 1)2
, with x ∈ R. ⃝
6.D.37. Determine the indeﬁnite integral
∫
dx
x3 + 1
dx, where x ∈ R\{−1}. ⃝
590
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Next, we present additional material related to the integration of irrational expressions. In particular, for a few, we will
focus on integrals of the form
∫
f
(
x,
√
a x2 + b x + c
)
dx ,
where f is a rational expression and we expect a ̸= 0, b2
− 4ac ̸= 0 for otherwise arbitrary numbers a, b, c ∈ R. In
this case one can distinguish two cases, related with the existence/non-existence of real roots for the quadratic polynomial
a x2
+ b x + c.
• If a > 0 and the polynomial a x2
+ b x + c has real roots x1, x2, use the representation
√
a x2 + b x + c =
√
a
√
(x − x1)2
x − x2
x − x1
=
√
a | x − x1 |
√
x − x2
x − x1
and set t2
= x−x2
x−x1
.
• If a < 0 and the polynomial a x2
+ b x + c has real roots x1 < x2, use the representation
√
ax2 + bx + c =
√
−a
√
(x − x1)2
x2 − x
x − x1
=
√
−a (x − x1)
√
x2 − x
x − x1
and set t2
= x2−x
x−x1
.
• Finally, if the polynomial a x2
+ b x + c does not have real roots (necessarily for a > 0), choose the substitution
√
ax2 + bx + c = ±
√
a · x ± t
with any choice of the signs. Of course, here we choose the signs so that we result to as easy expression to integrate, as
possible. As one can expect, for all these cases the corresponding substitutions lead to rational functions.
6.D.38. Determine the antiderivative of the following functions:
(a) f(x) =
1
(x + 4)
√
x2 + 3x − 4
, with x ∈ (−∞, −4) ∪ (1, +∞);
(b) g(x) =
1
(x − 1)
√
x2 + x + 1
, with x ̸= 1.
Next conﬁrm the computations using Sage.
Solution. (a) According to the discussion above one can proceed as follows:
∫
dx
(x + 4)
√
x2 + 3x − 4
=
∫
dx
(x + 4)
√
(x − 1)(x + 4)
=
∫
dx
(x + 4) | x + 4 |
√
x−1
x+4
=
t2
= x−1
x+4
x = 5
1−t2 − 4
dx = 10t
(1−t2)2 dt
=
∫ 10t
(1−t2)2
(
5
1−t2
)
5
1−t2 t
dt =
∫
2
5
1 − t2
1 − t2
dt =
2
5
sign
(
1 − t2
)
∫
1 dt =
2
5
sign
(
5
x + 4
)
t+C =
2
5
sign (x)
√
x − 1
x + 4
+C .
In Sage giving the cell
f(x)=1/((x+4)*sqrt(x^2+3*x-4)); show(integral(f(x), x).factor())
we get the answer
2
√
x2 + 3 x − 4
5 (x + 4)
(recall that x2
+ 3x − 4 = (x + 4)(x − 1)). Hence Sage’s answer does not contain the
sign of x, but obviously our solution is correct.
(b) Here we see that
∫
dx
(x − 1)
√
x2 + x + 1
=
√
x2 + x + 1 = x + t
x2
+ x + 1 = x2
+ 2xt + t2
x = −t2
+2t−2
2t−1 + 1
dx = −2(t2
−t+1)
(2t−1)2 dt
=
∫ −2(t2
−t+1)
(2t−1)2
−t2+2t−2
2t−1
t2−t+1
2t−1
dt =
∫
2
t2 + 2t − 2
dt =
∫ (√
3
3
1
t + 1 −
√
3
−
√
3
3
1
t + 1 +
√
3
)
dt =
√
3
3
ln t + 1 −
√
3 −
√
3
3
ln t + 1 +
√
3 + C =
√
3
3
ln
t + 1 −
√
3
t + 1 +
√
3
+ C =
√
3
3
ln
√
x2 + x + 1 − x + 1 −
√
3
√
x2 + x + 1 − x + 1 +
√
3
+ C .
□
591
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.D.39. Using a suitable substitution, compute
∫
dx
x +
√
x2 + x − 1
dx, x ∈
(
−∞,
−
√
5 − 1
2
)
∪
(√
5 − 1
2
, +∞
)
.
Solution. Even though the quadratic polynomial under the root has real roots x1, x2, we won’t solve this problem by substitution
t2
= x−x2
x−x1
. We could proceed that way, but we will rather use a method we introduced for the complex roots case. This
is because this method yields a very simple integral of a rational function, as can be seen from the calculation
∫
dx
x +
√
x2 + x − 1
=
√
x2 + x − 1 = x + t
x2
+ x − 1 = x2
+ 2xt + t2
x = t2
+1
1−2t
dx = −2t2
+2t+2
(1−2t)2 dt
=
∫
−2t2
+ 2t + 2
(t + 2)(1 − 2t)
dt =
∫ (
1 −
2
t + 2
−
1
2
1
t − 1
2
)
dt = t − 2 ln | t + 2 | −
1
2
ln t −
1
2
+ C =
√
x2 + x − 1 − x − 2 ln
(√
x2 + x − 1 − x + 2
)
−
1
2
ln
√
x2 + x − 1 − x −
1
2
+ C .
for some constant C. Notice an undeniable advantage of the recommended substitutions is their universality though: by using
them, one can compute all integrals of the respective type. □
6.D.40. For x > 0 determine
(a)
∫
(2 + 5x)3
4
√
x3
dx , (b)
∫ 3
√
1 + 4
√
x
√
x
dx , (c)
∫
1
4
√
1 + x4
dx .
Solution. All three given integrals are binomial, i.e. they can be written as
∫
xm
(a + bxn
)p
dx , for some a, b ∈ R , and m, n, p ∈ Q .
The binomial integrals are usually solved by applying the substitution method. If p ∈ Z (not necessarily p < 0), we choose the
subtitution x = ts
, where s is the common denominator of numbers m a n; if m+1
n ∈ Z and p /∈ Z, we choose a + bxn
= ts
,
where s is the denominator of number p; and if m+1
n + p ∈ Z (p /∈ Z, m+1
n /∈ Z), we choose a + bxn
= ts
xn
, where s is the
denominator of p. In these three cases, a reduction to an integration of a rational function is guaranteed.
Hence we can easily compute
(a)
∫ (2+5x)3
4√
x3
dx =
∫
x− 3
4 (2 + 5x)3
dx =
p ∈ Z
x = t4
dx = 4t3
dt
= 4
∫ (
2 + 5t4
)3
dt = 4
∫ (
8 + 60t4
+ 150t8
+ 125t12
)
dt =
4
(
8t + 12t5
+ 50
3 t9
+ 125
13 t13
)
+ C = 4
(
8 4
√
x + 12
4
√
x5 + 50
3
4
√
x9 + 125
13
4
√
x13
)
+ C .
(b)
∫ 3
√
1+ 4
√
x
√
x
dx =
∫
x− 1
2
(
1 + x
1
4
)1
3
dx =
p /∈ Z, m+1
n ∈ Z
1 + x
1
4 = t3
x = (t3
− 1)4
dx = 12t2
(
t3
− 1
)3
dt
= 12
∫
t3
(
t3
− 1
)
dt = 12
∫
t6
− t3
dt =
12
(
t7
7 − t4
4
)
+ C = 12 3
√
(1 + 4
√
x)4
(
1+ 4
√
x
7 − 1
4
)
+ C .
(c)
∫ 1
4√
1+x4
dx =
∫ (
1 + x4
)− 1
4
dx =
p /∈ Z, m+1
n /∈ Z, m+1
n + p ∈ Z
1 + x4
= t4
x4
x =
(
t4
− 1
)− 1
4
dx = −t3
(
t4
− 1
)− 5
4
dt
= −
∫ t2
t4−1 dt = −
∫ t2
(t−1)(t+1)(t2+1) dt =
−1
4
∫ (
1
t−1 − 1
t+1 + 2
t2+1
)
dt = −1
4 (ln | t − 1 | − ln | t + 1 | + 2 arctg t) + C =
−1
4
[
ln
4
√
1
x4 +1−1
4
√
1
x4 +1+1
+ 2 arctg
(
4
√
1
x4 + 1
)
]
+ C .
592
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
□
Next, we focus on deﬁnite integrals and related applications.
6.D.41. Compute the deﬁnite integrals given below:
A =
∫ π
2
π
4
sin t
1 − cos2 t
dt , B =
∫ ln 2
0
dx
e2x −3 ex
, C =
∫ π
2
0
sin(x) sin(2x) dx .
⃝
6.D.42. Use 6.B.36, to compute the integral
∫ 50
0
{x} dx, where { } is the fractional part function.
Solution. (c) Recall by 5.E.57 that the fractional part function is deﬁned by {x} = x − ⌊x⌋, and clearly this function is
periodic with period T = 1. The integral at hand corresponds to the grey area in the following ﬁgure, which equals to
10 · 1
2 = 5 (obviously, this is the area of 10 half square boxes with sides of length 1).
Indeed,
∫ 10
0
{x} dx = 10
∫ 1
0
{x} dx = 10
∫ 1
0
x dx = 10
[
x2
2
]1
0
= 5 .
For your convenience, here is the code used in Sage to produce the previous ﬁgure and conﬁrm the result:
fract(x)=x-floor(x)
p=plot(fract, x, 0, 10, color="black", aspect_ratio=1, fill=True, fillcolor="grey", figsize=8); show(p)
integral(fract, x, 0, 10)
□
6.D.43. For some parameter α ∈ R, consider the function
f(x) =
{
ex
cos(2x) , if x ∈ [−π/2, 0] ,
√
α + sin(4x) , if x ∈ (0, 3π/8] .
(a) Determine α such that f is continuous for any x in its domain. Then use your answer and Sage to execute the following:
• Indicate the whole domain of f but also the domains of the two components of f.
• Sketch the graph of f via the piecewise method in Sage (see also 5.C.5).
(b) Compute the area in between the graph of f and the x-axis, bounded from the vertical lines x = ±π
4 .
(c) Use Sage in order to
• Conﬁrm your result in (b), in a “manual way”.
• Compute the error that one obtains, if any, integrating the function f, as introduced via the piecewise method.
Solution. At x0 = 0 we see that lim
x→0−
f(x) = lim
x→0−
(
ex
cos(2x)
)
= 1 = f(0). Hence we need
lim
x→0+
f(x) = lim
x→0+
√
α + sin(4x) = 1 ⇐⇒
√
α + lim
x→0+
sin(4x) = 1 ,
that is,
√
α + 0 = 1 and hence α = 1. Thus f is continuous everywhere in its domain if and only if α = 1 (obviously, f is
continuous on the intervals [−π/2, 0) and (0, 3π/8]). From now ﬁx α = 1 and observe that f(−π/4) = 0 and f(3π/8) = 0.
Sage has built-in commands to indicate the domain of a piecewise function and of its components, given by domain and
domains, respectively. Hence we may type
f1(x)=(e^x)*cos(2*x); f2(x)=sqrt(1+sin(4*x))
f=piecewise([[(-pi/2, 0), f1(x)], [(0, 6*pi/16), f2(x)]])
f.domain(); f.domains()
Sage returns the obvious answers, i.e., (−1/2 ∗ pi, 0) ∪ (0, 3/8 ∗ pi), and ((−1/2 ∗ pi, 0), (0, 3/8 ∗ pi)), respectively.
Let us now present the graph of f, where we have included the two vertical lines from the question in (b), and shaded the
region whose area needs to be computed.
593
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
The code to construct this ﬁgure relies on many diﬀerent steps, and some advanced options from 2D-graphics in Sage, as for
example the command polygon to shade the region n question.
In general, when one wants to integrate a function f from a to b and compute the area between f and the x-axis, with
bounds the vertical lines passing from a, b, then can follow the summary given here:
Step 1 introduce the function f;
Step 2 declare the points a, b
Step 3 sketch the graph of f, e.g., add the command
q= plot(f, (x, a-0.5, b+0.5), thickness=2)
Step 4 graph the two vertical lines, which can be done by adding the syntax
q+= line([(a,0),(a,f(a))], color=’black’)
q+= line([(b,0),(b,f(b))], color=’black’)
Step 5 shade the area at hand, via the polygon method:
q += polygon( [(a,0),(a,f(a))] +[(x, f(x)) for x in [a,a+0.005,..,b]] +[(b,0),(a,0)],
rgbcolor=(0.2,0.4,0.7),aspect_ratio="automatic")
Step 6 produce the ﬁgure, which can be done by adding show(q) (we may specify further options here, as ymin, ymax, ticks,
figsize, and other, hence instead one may type q.show( ), and add inside the parentheses the appropriate options).
For our example this method becomes a few more complicated, since we have a piecewise function, and hence we use ﬁrst
the piecewise method to introduce f. Here is the full implementation combining both methods.
f1(x)=(e^x)*cos(2*x); f2(x)=sqrt(1+sin(4*x))
f=piecewise([[(-pi/2, 0), f1(x)], [(0, 6*pi/16), f2(x)]])
p=plot(f(x), x, -pi/2, 6*pi/16, color="black", thickness=1.2)
p+=plot(f2(x), x, 0, 6*pi/16, color="darkgreen", thickness=1.2)
p+= text ("$f(x)$ ",(pi/16, 1.47), color="black", fontsize ="12")
a=-pi/4; b=pi/4
p+=line([(a, 0),(a, 1)], color="black")
p+=line([(b, 0),(b, f(N(b)))], color="black")
p+=polygon([(-pi/4,0),(-pi/4,f(N(-pi/4)))]+[(x, f(N(x)))
for x in [-pi/4,-pi/4+0.005,..,pi/4]]+[(b,0),(a,0)],
rgbcolor=(0.5,0.7,0.8),aspect_ratio="automatic")
p.show(ticks=pi/4, tick_formatter=pi, ymin=-0.25, ymax=1.5)
Notice this technique requires the values introduced in f to be in numerical expression, and this is the reason behind the
appearance of f(N(b)), f(N(x), etc. Otherwise Sage returns errors. However, as we will see below this method produces aimprove !
small error in the computation of the given area; this is because the piecewise method uses approximations of a, b, which
eﬀect on the integration.
(b) Let us denote by E the shaded area. By deﬁnition we have
E =
∫ π
4
− π
4
|f|(x)| dx =
∫ π
4
− π
4
f(x) dx =
∫ 0
− π
4
(
ex
cos(2x)
)
dx +
∫ π
4
0
√
1 + sin(4x) dx = E1 + E2 .
594
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
For the ﬁrst integral, recall that
(
sin(2x)
2
)′
= cos(2x) and
(
−cos(2x)
2
)′
= sin(2x). Thus, by applying integration by parts
twice we get
E1 =
∫ 0
− π
4
ex
(
sin(2x)
2
)′
dx =
1
2
[
ex
sin(2x)
]0
− π
4
−
1
2
∫ 0
− π
4
ex
sin(2x) dx =
=
1
2
(
e0
· sin(0) − e− π
4 · sin
(
−
2π
4
))
+
1
4
∫ 0
− π
4
ex
(
cos(2x)
)′
dx =
=
1
2
e− π
4 · sin
(π
2
)
+
1
4
[
ex
cos(2x)
]0
− π
4
−
1
4
∫ 0
− π
4
ex
cos(2x) dx .
Thus
E1 =
1
2
e− π
4 +
1
4
(
e0
· cos(0) − e− π
4 · cos
(
−
2π
4
))
−
1
4
E1 ,
or equivalently E1 =
2 e− π
4 +1
5
≈ 0.382375.
For E2, we use the identities sin(2y) = 2 sin(y) cos(y) and sin2
(y) + cos2
(y) = 1, for any y ∈ R, where we replace y by
2x. Thus we can write
E2 =
∫ π
4
0
√
sin2
(2x) + cos2(2x) + 2 sin(2x) cos(2x) dx =
∫ π
4
0
√(
sin(2x) + cos(2x)
)2
dx =
=
∫ π
4
0
(
sin(2x) + cos(2x)
)
dx =
[
−
cos(2x)
2
+
sin(2x)
2
]π
4
0
=
1
2
+
1
2
= 1 .
Thus all together E = E1 + E2 = 2 e− π
4 +1
5 + 1 = 2 e− π
4 +6
5 ≈ 1.382375.
(c) A conﬁrmation of the result above, occurs easily, without using the piecewise method to introduce f, and from the
mathematical point of view it relies on the relation E = E1 + E2. Hence we can type
f1(x)=e^(x)*cos(2*x); f2(x)=sqrt(1+sin(4*x))
show(integral(f1(x),x, -pi/4, 0))
show(integral(f2(x), x, 0, pi/4))
and this is what we mean “manually”. Execute this cell yourself to check Sage’s output.
On the other hand, introducing the function f (in Sage) via the piecewise method, gives us the ability to directly
integrate f, as follows:
f1(x)=(e^x)*cos(2*x); f2(x)=sqrt(1+sin(4*x))
f=piecewise([[(-pi/2, 0), f1(x)], [(0, 6*pi/16), f2(x)]])
f.integral(x, N(-pi/4), N(pi/4))
However, since this still requires to introduce the certain values a = −π/4 and b = π/4 numerically, some error will appear.
Indeed, executing this block, Sage prints out 1.16777341450385, which diﬀers from the result in (b). The same time, we
cannot replace in this certain example the last command by f.integral(x, −pi/4, pi/4), which is a pitfall of Sage. □
improve
return
6.D.44. Compute the area S of a ﬁgure composed of two parts of plane bounded by lines x = 0, x = 1, x = 4, the x axis
and the graph of a function
y = 1
3
√
x−1
.
Solution. First realize that
1
3
√
x−1
< 0, x ∈ [0, 1), 1
3
√
x−1
> 0, x ∈ (1, 4]
and
lim
x→1−
1
3
√
x−1
= −∞, lim
x→1+
1
3
√
x−1
= +∞.
The ﬁrst part of the ﬁgure (below the x axis) is thus bounded by the curves
y = 0, x = 0, x = 1, y = 1
3
√
x−1
with an area given by the improper integral
S1 = −
1∫
0
1
3
√
x−1
dx;
while the second part (above the x axis), which is bounded by the curves
y = 0, x = 1, x = 4, y = 1
3
√
x−1
,
595
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
has an area of
S2 =
4∫
1
1
3
√
x−1
dx.
Since
∫ 1
3
√
x−1
dx = 3
2
3
√
(x − 1)2 + C,
the sum S1 + S2 can be gotten as
S = − lim
x→1−
(
3
2
3
√
(x − 1)2 − 3
2
)
+ lim
x→1+
(
3
2
3
√
9 − 3
2
3
√
(x − 1)2
)
= 3
2
(
1 + 3
√
9
)
.
We have shown among other things, that the given ﬁgure has a ﬁnite area, even though it’s unbounded (both from the top and
the bottom). (If we approach x = 1 from the right, eventually from the left, its altitude grows beyond measure.) Recall here
the indeﬁnite expression of type 0 · ∞. Namely, the ﬁgure is bounded if we limit ourselves to x ∈ [0, 1 − δ] ∪ [1 + δ, 4] for
an arbitrarily small δ > 0. □
6.D.45. Determine the surface and volume of a circular paraboloid created by rotating a part of the parabola y = 2x2
for
x ∈ [0, 1] around the y axis.
Solution. The formulas stated in the texts are true for rotating the curves around the x axis! Hence it’s necessary either to
integrate the given curve with respect to variable y, or to transform. This gives
V =
∫ 2
0
x
2
dx = π and S = 2π
∫ 2
0
√
x
2
√
1 +
1
8x
dx = 2π
∫ 2
0
√
x
2
+
1
16
dx = π
17
√
17 − 1
24
.
□
A) Material on sequences, series and limit processes
We begin with additional material related to uniform convergence of sequence of functions and of series.
6.D.46. Uniform convergence. Let (fn) be a sequence of functions deﬁned on an interval I such that limn→∞ fn(x) = f(x)
for all x ∈ I. Prove that (fn) uniformly converges to f on I if and only if limn→∞ an = 0, where an := supx∈I |fn(x)−f(x)|.
Solution. Suppose that (fn) is uniformly convergent to f on I. Then, given ε > 0 there exists an integer N such that for any
n ≥ N and x ∈ I we have |fn(x) − f(x)| < ε. Therefore we also have an := supx∈I |fn(x) − f(x)| ≤ ε for all n ≥ N and
thus an → 0 as n → ∞. Conversely, assume that limn→∞ an = 0, where an is deﬁned as above. Then, for any ε > 0 there
exists an integer N such that an < ε for all n ≥ N, that is,
sup
x∈I
|fn(x) − f(x)| < ε , for all n ≥ N .
By the deﬁnition of supremum,
|fn(x) − f(x)| ≤ sup
x∈[a,b]
|fn(x) − f(x)| < ε ,
for suﬃciently large n and for all x ∈ I. Thus (fn) uniformly converges to f on I. □
6.D.47. Natural logarithm. Expand the natural logarithm f(x) = ln(1 + x) into a power series around 0 and 1, and next
determine all x ∈ R for which these series converge.
Solution. First we will determine the expansion around point 0. To expand a function into a power series at a given point is
the same as to determine its Taylor expansion at that point. We can easily see that
f(n)
(x) = [ln(x + 1)](n)
= (−1)n+1 (n − 1)!
(x + 1)n
,
so after computing the derivatives at zero, we have
f(x) = ln(x + 1) = ln 1 +
∞∑
n=1
anxn
,
where the coeﬃcients an are given by
an =
(−1)n+1
(n − 1)!
n!
=
(−1)n+1
n
.
596
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Therefore, one can write
f(x) = ln(x + 1) = x −
1
2
x2
+
1
3
x3
−
1
4
x4
+ · · ·
=
∞∑
n=1
(−1)n+1
n
xn
.
For the radius of convergence, we can then use the limit of the quotient of the following coeﬃcients of terms of the power
series:
r =
1
limn→∞
an+1
an
=
1
limn→∞
1
n+1
1
n
= 1 .
Hence the series converges for arbitrary x ∈ (−1, 1). For x = −1 we get the harmonic series (with a negative sign), while
for x = 1 we get the alternating harmonic series, which converges by the Leibniz criterion. Thus the given series converges
exactly for x ∈ (−1, 1].
Analogously, for the expansion at point 1, we get
f(x) = ln(x + 1) = ln(2) +
1
2
(x − 1) −
1
8
(x − 1)2
+
1
3 · 23
(x − 1)3
− . . .
= ln(2) +
∞∑
n=1
(−1)n+1
n · 2n
(x − 1)n
,
and for the radius of convergence of this series we get
r =
1
limn→∞
an+1
an
=
1
limn→∞
1
2n+1(n+1)
1
2n n
= 1 .
□
6.D.48. Expand the function cos2
(x) into a power series at 0, and determine for which x ∈ R it converges. ⃝
6.D.49. Expand the function sin2
(x) into a power series at 0 and determine for which x ∈ R it converges. ⃝
6.D.50. Expand the function ln(x3
+ 3x2
+ 3x + 1) into a power series at 0 and determine for which x ∈ R it converges. ⃝
6.D.51. Expand the function ln(
√
x) into a power series at 1 and determine for which x ∈ R it converges. ⃝
Now is the time to demonstrate various ways to combine the theory of Taylor series with diﬀerentiation and integration.
Further similar tasks are presented in Section D (see for example 6.D.58). To make the description more engaging,
we will also utilize Sage.
6.D.52. On the interval of convergence (−1, 1), determine the sum of the series
∞∑
n=1
n (n + 1) xn
.
Next conﬁrm your answer in Sage.
597
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Solution. We have
∞∑
n=1
n (n + 1) xn
=
∞∑
n=1
n
(
xn+1
)′
=
( ∞∑
n=1
n xn+1
)′
=
( ∞∑
n=1
n xn−1
x2
)′
=
[
x2
∞∑
n=1
(xn
)
′
]′
=
[
x2
( ∞∑
n=1
xn
)′ ]′
=
[
x2
(
−1 +
∞∑
n=0
xn
)′ ]′
=
[
x2
(
−1 +
1
1 − x
)′
]′
=
[
x2
·
1
(1 − x)2
]′
=
2x
(1 − x)3
,
for all x ∈ (−1, 1). To conﬁrm the result in Sage use the sum command, as follows:
var("n")
show(sum(n*(n+1)*x^n, n, 1, oo).factor())
□
6.D.53. Use the Taylor series of f(x) = cos(
√
x) centered at 0 to approximate the integral I =
∫ 1
0
cos(
√
x) dx. Then use
Sage to compute the real value of I, and compare the two results. ⃝
6.D.54. Taylor series of arctan. Consider the function f(x) = 1/(1+x2
), x ∈ (−1, 1). Use the Taylor series of f centered
at 0 and the notion of deﬁnite integrals to compute the power series of arctan(x) around 0.
Solution. Recall that for |x| < 1 we have 1
1−x =
∑∞
n=0 xn
. Thus the replacement of x by −x2
gives (see also 5.E.142)
1
1 + x2
=
∞∑
n=0
(−1)n
x2n
, |x| < 1 .
Hence, for all t ∈ (−1, 1) we have
arctan′
(t) =
1
1 + t2
=
∞∑
n=0
(
−t2
)n
=
∞∑
n=0
(−1)n
t2n
.
Now, for all x ∈ (−1, 1) we have
x∫
0
arctan′
(t) dt = arctan(x) − arctan(0) = arctan(x). On the other side
x∫
0
( ∞∑
n=0
(−1)n
t2n
)
dt =
∞∑
n=0

(−1)n
x∫
0
t2n
dt

 =
∞∑
n=0
(−1)n
2n + 1
x2n+1
,
and thus arctan(x) =
∞∑
n=0
(−1)
n
2n + 1
x2n+1
for all x ∈ (−1, 1). □
6.D.55. Find the Maclaurin series of the function
A(x) =
∫ x
0
t cos(t2
) dt .
Solution. The Maclaurin series of cos(t) is
1 −
t2
2!
+
t4
4!
− · · · =
∞∑
n=0
(−1)n
(2n)!
t2n
.
Using this we see that
t cos(t2
) = t
∞∑
n=0
(−1)n t4n
(2n)!
=
∞∑
n=0
(−1)n t4n+1
(2n)!
.
598
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Therefore,
A(x) =
∫ x
0
t cos(t2
) dt =
∫ x
0
∞∑
n=0
(−1)n
(2n)!
t4n+1
dt
=
∞∑
n=0
[(−1)n
(2n)!
∫ x
0
t4n+1
dt
]
=
∞∑
n=0
(−1)n
(4n + 2)(2n)!
x4n+2
.
Let us write down the ﬁrst few terms of A,
A(x) =
1
2
x2
−
1
12
x6
+
1
240
x10
−
1
10080
x14
+ · · · .
A conﬁrmation of this expression via Sage is based on the method presented in 6.A.17. However, one should now combine this
method with the command integral(g(t), t, 0, x), to introduce a function of the form
∫ x
0
g(t) dt, where g(t) = t cos(t2
).
Thus the implementation goes as follows:
var("t"); assume(x>0)
f(x)=integral(t*cos(t^2), t, 0, x)
tf=taylor(f, x, 0, 6)
T=tf.power_series(QQ); show(T)
□
6.D.56. For the convergent series
∞∑
n=0
(−1)n
√
n + 100
estimate the error of the approximation of its sum by the partial sum s9999. ⃝
6.D.57. Approximate the expression given below with an error lesser than 1/10:
2∫
1
(
x −
cos10
(x)
10
)
ln(x) dx .
⃝
6.D.58. Find the Maclaurin series of the function F(x) =
∫ x
0
sin(t)
t
dt, with x ∈ R. Next check your answer via Sage,
and plot the graph of the function F for certain x in the domain of F.
Solution. Recall that sin(t) = t −
t3
3!
+
t5
5!
− · · · =
∞∑
n=0
(−1)n
(2n + 1)!
t2n+1
for all t ∈ R, hence
sin(t)
t
= 1 −
t2
3!
+
t4
5!
− · · · =
∞∑
n=0
(−1)n
(2n + 1)!
t2n
.
By integrating this series we get the series expansion of F(x), i.e.,
F(x) =
∫ x
0
sin(t)
t
dt =
∫ x
0
[ ∞∑
n=0
(−1)n
(2n + 1)!
t2n
]
dt =
∞∑
n=0
[ (−1)n
(2n + 1)!
∫ x
0
t2n
dt
]
=
∞∑
n=0
[ (−1)n
(2n + 1)!
t2n+1
(2n + 1)
]x
0
=
∞∑
n=0
(−1)n
(2n + 1)(2n + 1)!
x2n+1
.
Hence the series expansion of F is given by
F(x) = x −
x3
3 · 3!
+
x5
5 · 5!
− · · · +
(−1)n
(2n + 1)(2n + 1)!
x2n+1
+ · · ·
We can easily check this result in Sage by the method described in 6.A.17 and 6.D.55, that is,
599
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
var("t"); assume(x>0); F(x)=integral(sin(t)/t, t, 0, x)
tF=taylor(F, x, 0, 7); T=tF.power_series(QQ); show(T)
This cell prints out the following expression: x −
1
18
x3
+
1
600
x5
−
1
35280
x7
+ O(x8
), hence veriﬁes our answer. In order
to plot the graph of F for certain x we can for example add in the above cell the command show(plot(F, x, −10, 10)). □
6.D.59. Find the function f to which, for x ∈ R, the sequence of functions
fn(x) = n2
x3
n2x2+1 , n ∈ N.
converges. Is this convergence uniform on R? ⃝
6.D.60. Does the series
∞∑
n=1
n x
n4+x2 , kde x ∈ R,
converge uniformly on the real line? ⃝
6.D.61. Approximate
(a) the cosine of ten degrees with accuracy of at least 10−5
;
(b) the deﬁnite integral
∫ 1/2
0
dx
x4+1 with accuracy of at least 10−3
. ⃝
6.D.62. Determine the power series centered at x0 = 0 of the function
f(x) =
x∫
0
et2
dt, x ∈ R. ⃝
6.D.63. Using the integral test, ﬁnd the values a > 0 for which the series
∞∑
n=1
1
na
converges. ⃝
6.D.64. For which x ∈ R does the series
∞∑
n=1
ln(n!)
nx
converge? ⃝
6.D.65. Determine whether the series
∞∑
n=1
(−1)n−1
tan 1
n
√
n
converges absolutely, converges conditionally, diverges to +∞, diverges to −∞, or none of the above. (such a series is
sometimes said to be oscillating). ⃝
6.D.66. Calculate the series
∞∑
n=1
1
n·3n
with the help of an appropriate power series. ⃝
600
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Solutions of the exercises
6.A.5. For all x ∈ (−1, 1) we have ln
1 + x
1 − x
= ln (1 + x) − ln (1 − x). Therefore, it is useful to consider the auxiliary
function
φ(x) := ln (ax + 1) , x ∈ (−1, 1) , a = ±1 .
For this function we easily compute
φ′
(x) =
a
ax + 1
, φ′′
(x) =
−a2
(ax + 1)2
, φ(3)
(x) =
2a3
(ax + 1)3
, φ(4)
(x) =
−6a4
(ax + 1)4
,
for all x ∈ (−1, 1). Motivated by these formulas, one may guess that
φ(n)
(x) =
(−1)n−1
(n − 1)! an
(ax + 1)n
, x ∈ (−1, 1) , n ∈ N . (∗)
Let us use the principle of mathematical induction to prove (∗). As we have seen, it holds for n = 1, 2, 3, 4 and we assume
that (∗) holds for some other k ∈ N. Then, a direct computation shows that
φ(k+1)
(x) =
(
φ(k)
(x)
)′
=
(
(−1)k−1
(k − 1)! ak
(ax + 1)k
)′
=
(−1)k−1
(k − 1)! ak
(−k) a
(ax + 1)k+1
=
(−1)k
k! ak+1
(ax + 1)k+1
,
and hence the relation (∗) is true for all n ∈ N. Let us now return back to our initial task. By (∗) we may compute the nth
derivative of ln(1 ± x), that is,
ln(n)
(1 + x) =
(−1)n−1
(n − 1)!
(x + 1)n
, ln(n)
(1 − x) = −
(n − 1)!
(−x + 1)n
, x ∈ (−1, 1) .
Therefore, we ﬁnally deduce that
(
ln 1+x
1−x
)(n)
= (n − 1)!
(
1
(1−x)n − (−1)n
(1+x)n
)
for all x ∈ (−1, 1) and n ∈ N.
6.A.6. The answer is f(12)
(x) = 212
e2x
+ cos(x).
6.A.7. The answer is f(26)
(x) = − sin(x) + 226
e2x
.
6.A.8. Recall that the third-order Taylor expansion of a function f around the point a = 0, is given by
T3
0 f(x) = f(0) + f′
(0)x +
f′′
(0)
2
x2
+
f(3)
(0)
6
x3
.
For the sine function f(x) = sin(x) we have f′
(0) = cos(0) = 1, f(2)
(0) = − sin(0) = 0, f(3)
(0) = − cos(0) = −1, and
f(0) = 0. Thus we get
T3
0 sin(x) = x −
1
6
x3
= x −
x3
3!
.
In a similar way we obtain
T3
0 cos(x) = 1 −
x2
2
, T3
0 ex
= 1 + x +
x2
2
+
x3
6
, T3
0 ln(x + 1) = x +
x2
2
−
x3
3
.
6.A.10. The answer is
T4
1 f(x) = 2 (x − 1) − (x − 1)
2
+
2
3
(x − 1)
3
−
1
2
(x − 1)
4
.
To obtain this expression via Sage, we may type
f(x)=ln(x^2); T(x)=taylor(f(x), x, 1, 4); show(T(x))
Adding in this cell the command
p=plot(f(x), x, 0, 2, color="black", thickness=1.1)
p+=plot(T(x), x, 0, 2, linestyle="--", thickness=1.1);
p+=text(r"$f$", (1.95,1.8), fontsize=14, color="black")
p+=text(r"$T$", (1.95,0.8), fontsize=14, color="blue")
show(p)
one can produce the required graphs which we present here (the graph of the Taylor polynomial is coloured blue).
601
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Observe that the approximation is accurate enough. In fact, try to verify yourself that as the degree of the Taylor polynomials
increases, the approximations become more accurate.
6.A.13. Consider the exponential function f(x) = ex
restricted on the closed interval [0, x]. The third-order degree Taylor
polynomial of f(x) = ex
, centered at a = 0 has the form
T3
0 ex
= 1 + x +
x2
2!
+
x3
3!
.
Now, f is a smooth function and according to the theorem in 6.1.3 there exists some c ∈ (0, x) such that
f(x) = ex
= T3
0 ex
+R(x) , R(x) =
x3+1
(3 + 1)!
f(3+1)
(c) .
We have f(4)
(x) = ex
and hence we seee that R(x) = x4
4! ec
> 0. Thus f(x) − T3
0 f(x) ≥ 0 which proves our claim.
6.A.14. In this case we have k = 2 and hence we compute the error −x3
3(1+x)3 .
6.A.15. We compute T4
0 sin(x) = x − x3
6 . In Sage the task can be solved by the block
f(x)=sin(x); F=plot(f(x), x, -pi, pi)
T4(x)=taylor(f(x), x, 0, 4); show(T4(x))
T=plot(T4(x), x, -pi, pi, color="black"); show(F+T)
This conﬁrms the expression of T4
0 sin(x) and produces the graphs of f, T4
0 f, which we present in the ﬁgure below (in this
ﬁgure, with blue is coloured the graph of f)
Next we may approximate sin 1◦
by the expression sin 1◦
≈ π
180 − π3
6·1803 .
6.A.16. The answer is 1 − π2
102·2 + π4
104·4! .
6.A.18. One can use the known formula 1
1+x =
∑∞
n=0 (−x)
n
=
∑∞
n=0(−1)n
xn
corresponding to geometric series. By
diﬀerentiating it, we obtain
602
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
−
1
(1 + x)2
=
( ∞∑
n=0
(−1)n
xn
)′
=
∞∑
n=1
(−1)n
n xn−1
.
Thus we see that
1
(1 + x)2
=
∞∑
n=1
(−1)n+1
n xn−1
= 1 − 2x + 3x2
− 4x3
+ 5x4
− 6x5
+ 7x6
− 8x7
+ · · ·
for all x ∈ (−1, 1). Finally, to conﬁrm this expression by Sage we use the same method as in 6.A.17, i.e.,
f(x)=1/(1+x)^2; tf=taylor(f, x, 0, 10); T=tf.power_series(QQ); show(T)
Sage’s answer has the form
1 − 2x + 3x2
− 4x3
+ 5x4
− 6x5
+ 7x6
− 8x7
+ 9x8
− 10x9
+ 11x10
+ O(x11
) .
6.A.19. The ﬁrst limit equals to −1/6, while the second limit takes the value −1.
6.A.22. An example is given by f(x) = x4
. The point x0 = 0 is a local minimum of f but f′
(0) = 0 and f′′
(0) = 0.
6.A.26. For the solution one can adopt the method presented in 6.A.25, which means that we should use appropriately the
commands find_local_maximum(f, a, b) and find_local_minimum(f, a, b). The implementation takes the form
f(x)=(1/8)*x^8-(3/5)*x^5-3*x+9
fp=plot(f(x), x, -2, 2, color="black", legend_label=r"$f(x)$")
show(fp)
show(find_local_maximum(f, -2, 2))
show(find_local_minimum(f, -2, 2))
g(x)=x^2-(1/6)*x^3
gp=plot(g(x), x, -2, 2, color="black", legend_label=r"$g(x)$")
show(gp)
show(find_local_maximum(g, -2, 2))
show(find_local_minimum(g, -2, 2))
h(x)=x/sqrt(x^2+4)
hp=plot(h(x), x, -2, 2, color="black", legend_label=r"$h(x)$")
show(hp)
show(find_local_maximum(h, -2, 2))
show(find_local_minimum(h, -2, 2))
k(x)=ln(x^3+8)
kp=plot(k(x), x, -2, 2, color="black", legend_label=r"$k(x)$")
show(kp)
show(find_local_maximum(k, -2, 2))
show(find_local_minimum(k, -2, 2))
Sage’s answers look like as follows:
For f : (66.19999293835187, −1.9999999605494494) , (3.1326992998650347, 1.525960530898268) .
For g : (5.333333096630033, −1.9999999605494494) ,
(
9.947977045494631 × 10−19
, 9.97395460544528e-10
)
.
For h : (0.7071067742126095, 1.9999999605494494) , (−0.7071067742126095, −1.9999999605494494) .
For k : (2.7725886926518686, 1.9999999605494494) , (−14.563311203591136, −1.9999999605494494) .
Here, recall that Sage ﬁrst presents the maximal/minimum value f(x0) at the extreme point x0, and next the point x0 itself.
It also produces the plots of the functions at hand, which are listed below.
603
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.A.29. The function f(x) = xa
is twice diﬀerentiable on (0, +∞) with f′
(x) = a xa−1
and f′′
(x) = a(a − 1) xa−2
. By
assumption we have a > 0 and (a − 1) < 0. It is also easy to see that xa−2
> 0 for all x > 0. Thus f′′
(x) < 0 for all x > 0
which means that f is concave on (0, +∞).
6.A.30. Based on the cell
f(x)=x/(x^2-1)
a=lim(f(x), x=1, dir="right")
b=lim(f(x), x=1, dir="left")
c=lim(f(x), x=-1, dir="right")
d=lim(f(x), x=-1, dir="left")
show(a,’,’, b,’,’, c,’,’, d)
we see that
a := lim
x→1+
f(x) = +∞ , b := lim
x→1−
f(x) = −∞ , c := lim
x→−1+
f(x) = +∞ , d := lim
x→−1−
f(x) = −∞ .
Thus, the lines x = ±1 are vertical asymptotes of f. It is also easy to see that the x-axis y = 0 is the unique horizontal
asymptote of f (this is because limx→∞ f(x) = 0 = limx→−∞ f(x)). However, asymptotes with slope cannot exist, since
f(x) = x/(x2
− 1) = g(x)/h(x), and the degree of g(x) is smaller than the degree of h(x).
Recall that using Sage, to plot a rational function it is wise to restrict the y-values. On the other hand, Sage often sketches
the vertical asymptotes via the option detect_poles, which we should appropriately include inside the command plot. In
particular, the cell
p=plot(x/(x^2-1), x, -5, 5, ymin=-10, ymax=10,
detect_poles="show")
produces the ﬁgure posed above. However, one has the chance to set detect_poles to ”True”, (or ”False”), to obtain only
the graph of f. For an example with horizontal asymptotes see the 6.A.31.
6.A.31. (a) The given function is continuous and diﬀerentiable on R as a fraction of diﬀerentiable functions. We see that its
ﬁrst derivative is everywhere positive,
f′
(x) =
2ex
(ex + 1)
2 > 0, x ∈ R .
This implies that f is strictly increasing for all x ∈ R. In case you want to verify this claim in Sage, use the cell
f(x) = (exp(x) − 1)/(exp(x) + 1); d1(x) = diff(f, x).factor(); bool(d1(x) > 0)
604
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Now, since the domain of f is the whole real line, f cannot have vertical asymptotes. On the other hand, the lines y = ±1
are horizontal asymptotes of f since
lim
x→+∞
ex
− 1
ex + 1
= lim
x→+∞
ex
ex
= 1, lim
x→−∞
ex
− 1
ex + 1
=
0 − 1
0 + 1
= −1 .
Combining these conclusions one deduces that the range is the open interval (−1, 1), see also the graph of f below.
It seems that Sage has not an “built-in” method for horizontal asymptotes. But we can proceed “manually” and introduce
theese lines via the command line. For instance, to sketch the graph of f, together with its asymptotes y = ±1, use the
cell
f(x)=(exp(x)-1)/(exp(x)+1)
gf=plot(f, x, -5, 5, ymin=-1.1, ymax=1.1, color="black")
a1 = line([(5,1) ,(-5,1)], linestyle="--", thickness =0.8)
a2 = line([(5,-1) ,(-5,-1)], linestyle="--", thickness =0.8)
(gf+a1+a2).show(ticks=1,tick_formatter =1)
Executing this block, you will get the following ﬁgure
In this ﬁgure we have restricted the y-values, by the options ymin = −1.1, and ymax = 1.1, though not necessary (more
experienced programmers could jump this step).
(b) We know that f is strictly increasing in R, hence it does not admit (local/global) extremes. The second derivative of f is
given by f′′
(x) = 2(1−ex
) ex
(ex +1)3 , for all x ∈ R. Thus, x = ln(1) = 0 is the unique solution of the equation f′′
(x) = 0 and we
have
• f′′
(x) > 0 for all x ∈ (−∞, 0) ,
• f′′
(x) < 0 for all x ∈ (0, +∞) .
Therefore, f is strictly convex for all x ∈ (−∞, 0) and strictly concave for all x ∈ (0, +∞), and since f is continuous on 0
and its graph there changes from convex to concave, x = 0 is the unique inﬂection point of f (with value f(0) = 0).
Some of the previous conclusions occur really easy in Sage, as well. For instance, by typing
f(x)=(exp(x)-1)/(exp(x)+1)
d2(x)=diff(f, x, 2); show(solve(d2(x)==0, x))
we conﬁrm the inﬂection point of f.
6.A.34. By assumption, f is diﬀerentiable on R. Hence, if a ∈ R is a stationary point of f we have f′
(a) = 0. On the other
hand, a diﬀerentiation of the deﬁning equation gives
2f(x)f′
(x) − f(x) − xf′
(x) + 4x = 0 , x ∈ R . (∗)
By replacing x = a in this equation, we ﬁnd that f(a) = 4a. Combining this with the deﬁning equation, the replacement
x = a gives f2
(a) − af(a) + 2a2
− 7 = 0, that is, (4a)2
− 4a2
+ 2a2
− 7 = 0, which is equivalent to 14a2
= 7. Thus
a2
= 1
2 , i.e., a = ±
√
2
2 .
Next, f is twice diﬀerentiable. Hence, if x = b ∈ R is an inﬂection point of f we should have f′′
(b) = 0. Now, taking the
derivative of (∗), we see that
2
(
f′
(x)
)2
+ 2f(x)f′′
(x) − 2f′
(x) − xf′′
(x) + 4 = 0 , x ∈ R .
605
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
For x = b this relation reduces to
(
f′
(b)
)2
− f′
(b) + 2 = 0, which has negative discriminant, ∆ = −7 < 0. This implies
that f does not admit inﬂection points.
6.A.35. We have f′
(x) = 2x ex2
, g′
(x) = 2x
x2+1 , and k′
(x) = ex
e2x +1 , for all x ∈ R. Thus, we get df(x) = (2x ex2
) dx,
dg(x) = 2x
x2+1 dx, and dk(x) = ex
e2x +1 dx, respectively. An implementation of these diﬀerentials in Sage, essentially relies
on the command diff (or derivative). For instance, a cell based on this method and which includes dx as a symbolic
variable has the form
var("x, dx")
f(x)=e**(x**2); f1(x)=diff(f, x)
df(x)=f1*dx #declare the differential of f
g(x)=ln(x**2+1); g1(x)=diff(g, x)
dg(x)=g1*dx #declare the differential of g
k(x)=arctan(e^x); k1(x)=diff(k, x)
dk(x)=k1*dx #declare the differential of k
show(df(x), ",", dg(x), ",", dk(x))
Chek yourself Sage’s output.
6.A.41. First we need to introduce f and use the command plot to sketch it. We also need to indicate the nodes x, x+h, f(x)
and f(x + h), and draw the slopes corresponding to forward/backward diﬀerences, together with the tangent line. Hence, we
may appropriately use the commands point and line (recall that these are build-in functions in Sage for the construction of
points and lines, respectively). To keep our code simple, we agree to distinguish the slopes via colours. So, let us use green
and red colours to illustrate the forward/backward diﬀerence slopes, and orange colour for the tangent line of f trough x. We
can now encode the solution, by the following block:
f(x)=exp(x^(2))-0.9
p=plot(f(x), x, -0.4, 0.5, ticks=[0.1,None])
p+=point((0.2, f(0.2)), size=30)
p+=point((0.3, f(0.3)), size=30)
p+=point((0.2, 0), size=30)
p+=point((0.3, 0), size=30)
p+=line([(0.2,f(0.2)),(0.3,f(0.3))],color="green")
p+=line([(0.2,0), (0.2,f(0.2))],
linestyle="--", rgbcolor=(0.2,0.2,0.2))
p+=line([(0.3,0), (0.3,f(0.3))],
linestyle="--", rgbcolor=(0.2,0.2,0.2))
p+=point((0.1, 0), size=30)
p+=point((0.1, f(0.1)), size=40)
p+=line([(0.1,f(0.1)),(0.2,f(0.2))],color="red")
p+=line([(0.1,0), (0.1,f(0.1))],
linestyle="--",rgbcolor=(0.2,0.2,0.2))
f1=diff(f(x), x)(x=0.2); l1(x)=f(0.2)+f1*(x-0.2)
p+=plot(l1(x), x, 0.01, 0.45,color="orange")
p+=text(r"$x$", (0.22, 0.01),fontsize=12)
p+=text(r"$x+h$", (0.335, 0.01),fontsize=12)
p+=text(r"$x-h$", (0.14, 0.01),fontsize=12)
p+=text(r"$f(x)$", (0.19, 0.16),fontsize=12)
p+=text(r"$f(x+h)$", (0.24, 0.20),fontsize=12)
p+=text(r"$f(x-h)$", (0.07, 0.128),fontsize=12)
p+=text(r"$y=f(x)$", (0.35, 0.31),fontsize=12)
show(p)
This block has as output the following ﬁgure:
606
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.A.42. The central diﬀerence approximation of the value f′
(0.2) = 0.4163243 with step h = 0.01 is given by
f(0.2 + 0.01) − f(0.2 − 0.01)
0.02
≈ 0.4163671 .
This proves the claim, comparing with the results in 6.A.40.
6.A.43. The axes of the given ellipse are the coordinate axes x and y, and its vertices are the points
P = [
√
2, 0], Q = [−
√
2, 0], R = [0, 1], S = [0, −1] ,
see also the given ﬁgure below.
Let us ﬁrst compute the curvature at R. We can consider the coordinate y as a function of x (determined uniquely in a
neighbourhood of R ). Thus by diﬀerentiating the equation of the ellipse with respect to x we get 2x + 4yy′
= 0. Hence
y′
= − x
2y . Diﬀerentiating this equation once more with respect to x we obtain the second derivative,
y′′
= −
1
2
(
1
y
−
xy′
y2
) .
At the point R, we see that y′
(R) = 0 and y′′
(R) = −1
2 . According to 6.1.13, the radius of the osculation circle will be given
by the expression
(1 + (y′
)2
)
3
2
(y′′)2
evaluated at the point in question. This gives the values ±2 (the sign tells us the circle will be “below” the graph of the
function). Now, the ideas in 6.1.13 and 6.1.16 imply that the centre will be in the direction opposite to the normal line of
this curve, i.e., on the y-axis. Since the radius is 2, the centre will be at the point [0, 1 − 2] = [0, −1]. In total, the equation of
the osculation circle of the given ellipse at R will be x2
+ (y + 1)2
= 4. Analogously, we can determine the equation of the
osculation circle at the point S, where we get the equation x2
+ (y − 1)2
= 4. Finally, the curvatures of the ellipse at these
points equals to 1
2 (the absolute value of the curvature of the graph of the function).
For determining the osculation circle at the point [
√
2, 0], we’ll consider the equation of the ellipse as a formula for the
variable x depending on the variable y, i.e., x as a function of y. determined uniquely, so we cannot use the previous procedure
- technically it would end up by diving by zero). Then, by diﬀerentiation we obtain: 2xx′
+ 4y = 0, thus x′
= −2y
x , and
x′′
= −2(1
x − yx′
x2 ). Hence at point [
√
2, 0], we have x′
= 0 and x′′
= −
√
2 and the radius of the circle of osculation is
ρ = − 1√
2
=
√
2
2 according to 6.1.13. The normal line is heading to −∞ along the x axis at point [
√
2, 0], thus the center
607
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
of the osculation circle will be on the x axis on the other side at distance
√
2
2 , hence at the point [
√
2 −
√
2
2 , 0] = [
√
2
2 , 0]. In
total, the equation of the circle of osculation at vertice [
√
2, 0] will be (x −
√
2
2 )2
+ y2
= 1
2 . The curvature at both of these
vertices equals
√
2.
Note that the vertices of an ellipse (more generally the vertices of a closed smooth curve in plane) can be deﬁned as the
points at which the function of curvature has an extreme. The ellipse having four vertices isn’t a coincidence. The so called
“four vertices theorem” states that a closed curve of the class C3
has at least four vertices. A curve of the class C3
is locally
given parametrically by points [f(t), g(t)] ∈ R2
, t ∈ (a, b) ⊂ R, where f and g are functions of the class C3
(R). Thus the
curvature of the ellipse at its any point is between its curvatures at its vertices, i.e. between 1
2 and
√
2.
6.A.44. (a) Since the functions x = x(t) and y = y(t) are diﬀerentiable, by the chain rule one has
y′
(t) =
dy
dt
=
dy
dx
dx
dt
=⇒
dy
dx
=
dy/dt
dx/dt
,
assuming that x′
(t) = 0 for all t ∈ I.
(b) For α(t) we have
dy
dx
=
dy/dt
dx/dt
=
6t2
1
= 6t2
.
Thus dy
dx t=0
= 0 and dy
dx t=2
= 24. Alternatively, we have x = 1 + t, thus t = x − 1 and hence y = 2(x − 1)3
. Thus
y′
(x) = dy
dx = 6(x − 1)2
= 6t2
.
Next, for β(t) we have
dy
dx
=
dy/dt
dx/dt
= −
(1 + t2
)2
2t(1 + t)
. (♭)
However in this case we don’t have x′
(t) ̸= 0 for all t ∈ I, in particular, the derivative dy
dx is not deﬁned at t = 0. However, at
t = 2 one gets dy
dx t=2
= −25/12. To conﬁrm the situation in an alternative way, we need to eliminate t which can be done by
the ﬁrst parametric equation x = 1/(1 + t2
). We get t =
√
1−x
x (notice that x = 1 for t = 0 and x = 1/5 for t = 2). Thus,
y(x) = ln
(
1 +
√
1−x
x
)
and now one can directly diﬀerentiate y with respect to x. Lets do this quickly in Sage:
var("t")
f(x)=ln(1+sqrt((1-x)/x))
show(diff(f, x))
show(diff(f, x)(x=1/5))
show(diff(f, x)(x=1))
show(diff(f(x), x)(x=1/(1+t^2)).full_simplify())
The very ﬁrst show command returns the derivative y′
(x) (in terms of x), the second the value y′
(1/5) = −25/12 while the
third one returns an error, since there y′
(x) is not deﬁned. Finally, the last command conﬁrms the expression given in (♭).
(c) Let us now pose the code for solving this task:
var("t")
F1(t)=(diff(2*t^3, t)/diff(t^2+1, t)).factor(); show(F1)
F2(t)=(diff(ln(1+t), t)/diff(1/(1+t^2), t)).factor(); show(F2)
p=parametric_plot((t^2+1, 2*t^3), (t,0,2), legend_label=r"$\alpha(t)$", color="black", thickness=2)
p+=parametric_plot((1/(1+t^2), ln(1+t)), (t,0,2),legend_label=r"$\beta(t)$",color="grey",thickness=2)
p+=plot(F1(t), t, 0, 2, color="black", linestyle="--", thickness=1.5,
legend_label=r"$\left(\frac{dy/dt}{dx/dt}\right)_{\alpha}$")
p+=plot(F2(t), t, 0, 2, color="darkgrey", linestyle="--", thickness=1.5,
legend_label=r"$\left(\frac{dy/dt}{dx/dt}\right)_{\beta}$")
p.show(ymin=-4, ymax=4, xmax=4, aspect_ratio=1/2)
This block conﬁrms the given expressions of the derivatives dy
dx and produces the graphs of the curves α, β, together with the
graphs of the corresponding derivatives.
608
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.A.46. Let us present the block with some comments within:
var("t, s, r")
x(t)=t-sin(t); y(t)=1-cos(t)
show(diff(y, t)/diff(x, t)) # declare the slope of the cycloid at general point
y0=y(2*pi/3);show(y0) # declare the y-coordinate of the point P
x0=x(2*pi/3);show(x0) # declare he x-coordinate of the point P
k=diff(y, t)(t=2*pi/3)/(diff(x, t)(t=2*pi/3)); show(k) # declare the slope of the cycloid at 2pi/3
l(s)=y0+k*s-k*x0; show(l) #declare the tangent line passing through P
y00=y(4*pi/3); show(y00) # declare the y-coordinate of the point Q
x00=x(4*pi/3); show(x00) # declare the x-coordinate of the point Q
kk=diff(y, t)(t=4*pi/3)/(diff(x, t)(t=4*pi/3)); show(kk) # declare the slope of the cycloid at 4pi/3
L(r)=y00+kk*r-kk*x00; show(L) #declare the tangent line passing through Q
show(solve(l(r)-L(r)==0, r)) #declare the intersection point of the 2 tangent lines
p=parametric_plot((x(t), y(t)), (t,0,2*pi), color="black", thickness=2)
p+=plot(l(s), -0.5, 4, color="darkgrey")
p+=plot(L(r),2, 6, color="grey")
p+=point([x(2*pi/3), y(2*pi/3)], size=30) # the point P
p+=point([x(4*pi/3), y(4*pi/3)], size=30) # the point Q
p+=point([pi, l(pi)], size=30) # the intersection point R
p+=text(r"$P$", (x0, y0+0.2),fontsize=14, rgbcolor=(0.1,0.2,0.5))
p+=text(r"$Q$", (x00, y00+0.2),fontsize=14, rgbcolor=(0.1,0.2,0.5))
p+=text(r"$R$", (pi, l(pi)+0.2),fontsize=14, rgbcolor=(0.1,0.2,0.5))
p.show(gridlines="true")
6.B.2. By deﬁnition, we have F′
(x) = f(x) for all x ∈ R. Combining this with the relation 4F(x) = f(x) we get F′
(x) =
4F(x), that is,
F′
(x)
F(x)
= 4 (since F(x) = (1/4)f(x) ̸= 0 for all x ∈ R). An integration then gives
∫
F′
(x)
F(x)
dx =
∫
4 dx ⇐⇒ ln | F(x) | = 4x + c ⇐⇒ | F(x) | = e4x+c
,
for some constant c. Thus F(x) = e4x+c
or F(x) = − e4x+c
. Because F(4) = f(4)
4 = 1 > 0 the second case is omitted. In
particular, the relation F(4) = 1 gives e16+c
= 1 = e0
, that is, 16 + c = 0 or equivalently c = −16. Thus F(x) = e4(x−4)
and f(x) = F′
(x) = 4 e4(x−4)
.
6.B.4. One can use the identities sin2
(x) =
(
1 − cos(2x)
)
/2 and cos2
(x) =
(
1 + cos(2x)
)
/2. In particular, we see that
A =
∫ (
1 − cos(2x)
2
) (
1 + cos(2x)
2
)
dx =
∫
1 − cos2
(2x)
4
dx =
1
4
∫
dx −
1
4
∫
cos2
(2x) dx
=
x
4
−
1
4
∫ (
(1 + cos(4x)
2
)
dx =
x
4
−
1
8
∫
dx −
1
8
∫
cos(4x) dx =
x
8
−
sin(4x)
32
+ C .
6.B.7. (a) We have (sin(x))′
= cos(x) and hence
609
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
K =
∫
x(sin(x))′
dx = x sin(x) −
∫
(x)′
sin(x) dx = x sin(x) + cos(x) + C , C ∈ R .
(b) Recall that (x2
− x)′
= 2x − 1 and hence
M =
∫
(x2
− x)′
ln(x) dx = (x2
− x) ln(x) −
∫
(x2
− x)(ln(x))′
dx = (x2
− x) ln(x) −
∫
(x − 1) dx =
(x2
− x) ln(x) + x −
x2
2
+ C .
(c) In this case we will use the relations (− 1
β cos(βx))′
= sin(βx) and ( 1
β sin(βx))′
= cos(βx), for all x ∈ R. Thus
N =
∫
ex
sin(βx) dx =
∫
ex
(−
1
β
cos(βx))′
dx = −
1
β
ex
cos(βx) +
1
β
ν , (∗)
where ν =
∫
ex
cos(βx) dx. In the same vein we see that
ν =
∫
ex
(
1
β
sin(βx))′
dx =
1
β
ex
sin(βx) −
1
β
N . (∗∗)
Therefore, a combination of (∗) and (∗∗) gives
N +
1
β2
N = −
1
β
ex
cos(βx) +
1
β2
ex
sin(βx) + C ,
that is, N =
β
β2 + 1
ex
( 1
β
sin(βx) − cos(βx)
)
+ C, C ∈ R.
A word of caution: One should remember to identify multiples of C with C itself, as we did in cases (b) and (d), for example.
Moreover, we could even write N(x) = β
β2+1 ex
( 1
β sin(βx) − cos(βx)
)
+ C, x ∈ R, and similarly for the previous cases
(we avoid this to save some space).
6.B.8. All the solutions can be obtained by applying the rule
∫
F(x)G′
(x) dx = F(x)G(x) −
∫
F′
(x)G(x) dx (integration
by parts). Notice the ﬁrst non-trivial part in this rule is the function G(x), which we should “guess” using its ﬁrst derivative
G′
(x). Next we should be able to compute the integral
∫
F′
(x)G(x) dx.
(a) We have F(x) = x, F′
(x) = 1, G′
(x) = 1
cos2(x) and hence G(x) = tan(x), such that (tan(x))′
= 1
cos2(x) . Therefore,
an application of the preceding rule gives
∫
x
cos2(x)
dx =
∫
x(tan(x))′
dx = x tan(x) −
∫
tan(x) dx = x tan(x) +
∫
− sin(x)
cos(x)
dx
= x tan(x) +
∫
(cos(x))′
cos(x)
dx = x tan(x) + ln |cos(x)| + C .
Above, in the ﬁnal step we applied the identity
∫
f′
(x)
f(x)
dx = ln | f(x) | + C, for some constant C (see 6.2.3).
(b) In this case we have F(x) = x2
, F′
(x) = 2x, G′
(x) = e−3x
and hence G(x) = −1
3 e−3x
. Therefore, at a ﬁrst step we
obtain ∫
x2
e−3x
dx =
∫
x2
(
−
1
3
e−3x
)′
dx = −
1
3
x2
e−3x
+
2
3
∫
x e−3x
dx .
Next we see that∫
x e−3x
=
∫
x
(
−
1
3
e−3x
)′
dx = −
1
3
x e−3x
+
1
3
∫
e−3x
dx = −
1
3
x e−3x
−
1
9
e−3x
+C ,
for some constant C ∈ R, where in the ﬁnal step we used the relation
∫
eax
dx =
eax
a
+ C (see 6.2.3). A combination of
these two relations gives
∫
x2
e−3x
= −
1
3
x2
e−3x
+
2
3
(
−
1
3
x e−3x
−
1
9
e−3x
)
+ C = −
1
3
e−3x
(
x2
+
2
3
x +
2
9
)
+ C ,
for some constant C ∈ R.
(c) In this case we have F(x) = cos(x), F′
(x) = − sin(x), G′
(x) = cos(x) and hence G(x) = sin(x). If we set for
convenience M =
∫
cos2
(x) dx, we thus have
M =
∫
cos2
(x) dx =
∫
cos(x) cos(x) dx =
∫
cos(x)
(
sin(x)
)′
dx = cos(x) sin(x) +
∫
sin2
(x) dx
= cos(x) sin(x) +
∫
(
1 − cos2
(x)
)
dx = cos(x) sin(x) +
∫
1 dx −
∫
cos2
(x) dx = cos(x) sin(x) + x − M + C ,
610
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
that is, 2M = cos(x) sin(x) + x + C or equivalently, M = 1
2
(
cos(x) sin(x) + x
)
+ C, for some constant C ∈ R. We
emphasise that usually suitable simpliﬁcations or substitutions, lead to the desired result faster than integration by parts. For
instance, by using the identity
cos2
(x) =
1
2
(
1 + cos(2x)
)
, x ∈ R
we easily obtain
∫
cos2
(x) dx =
∫
1
2
dx+
∫
1
2
cos(2x) dx =
x
2
+
sin(2x)
4
+C =
x
2
+
2 sin(x) cos(x)
4
+C =
1
2
(
x+sin(x)·cos(x)
)
+C ,
for some constant C ∈ R.
6.B.10. Set t = x3
+ 4 such that dt = 3x2
dx. Thus dx = dt
3x2 and
∫
ex3
+4
x2
dx =
1
3
∫
et
dt =
et
3
+ C =
ex3
+4
3
+ C ,
for some constant C. In Sage one can type
show(integral((e^(x^3+4))*x^2, x))
6.B.11. The main idea of the so called ﬁrst substitution method is writing the integral in the form of
∫
f
(
φ(x)
)
φ′
(x) dx , (⋆)
for certain functions f and φ. Then the substitution y = φ(x) gives that dy = φ′
(x) dx, and the integral above reads as
∫
f(y) dy .
Based on this idea let us set y = cos(x), such that dy = − sin(x) dx. Then we obtain
∫
cos5
(x) sin(x) dx = −
∫
cos5
(x)
(
− sin(x)
)
dx = −
∫
y5
dy = −
y6
6
+ C = −
cos6
(x)
6
+ C ,
for some arbitrary constant C ∈ R.
6.B.12. We will use the so called second substitution method, which means a reduction of
∫
f(y) dy to the form (⋆) presented
in the previous task 6.B.11, for y = φ(x). In particular, in our example we want to determine the primitive function of
function f(x) = tan4
(x). Thus it is sensible to consider the substitution u = tan(x), and hence x = arctan(u). This gives
dx = du
1+u2 , and hence we get
∫
sin4
(x)
cos4(x)
dx =
∫
u4
1 + u2
du =
∫
u2
− 1 +
1
u2 + 1
du =
u3
3
− u + arctan(u) + C =
tan3
(x)
3
− tan(x) + x + C ,
for some constant C ∈ R.
6.B.13. Let us include the solutions in one cell:
show(integral((cos(x))^5*sin(x), x)); show(integral((sin(x))^4/((cos(x))^4), x))
Check yourself that this veriﬁes the given answers in 6.B.11 and 6.B.12, respectively.
6.B.14. This is based on appropriate substitution, in particular
∫
cos5
(x) sin2
(x) dx =
∫
(
cos2
(x)
)2
sin2
(x) cos(x) dx =
∫
(
(1 − sin2
(x)
)2
sin2
(x) cos(x) dx
t=sin(x)
=
dt=cos(x) dt
∫
(
1 − t2
)2
t2
dt =
∫
(
1 − 2t2
+ t4
)
t2
dt =
∫
(
t6
− 2t4
+ t2
)
dt
=
t7
7
− 2
t5
5
+
t3
3
+ C =
sin7
(x)
7
− 2
sin5
(x)
5
+
sin3
(x)
3
+ C ,
for some constant C.
6.B.15. (a) The substitution method leads to the integral
∫
x3
e−x2
dx =
t = −x2
dt = −2x dx
=
1
2
∫
t et
dt ,
which can be easily computed by integrating by parts, yielding
1
2
∫
t et
dt =
F(t) = t F′
(t) = 1
G′
(t) = et
G(t) = et =
1
2
t et
−
1
2
∫
et
dt =
1
2
t et
−
1
2
et
+C = −
1
2
e−x2 (
x2
+ 1
)
+ C , C ∈ R .
611
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
(b) Similarly, we obtain
∫
x arcsin(x2
) dx =
t = x2
dt = 2x dx
=
1
2
∫
arcsin(t) dt =
F(t) = arcsin(t) F′
(t) = 1√
1−t2
G′
(t) = 1 G(t) = t
=
1
2
t arcsin(t) −
1
2
∫
t
√
1 − t2
dt =
u = 1 − t2
du = −2t dt
=
1
2
t arcsin(t) +
1
4
∫
du
√
u
=
1
2
t arcsin(t) +
1
2
√
u + C =
1
2
t arcsin(t) +
1
2
√
1 − t2 + C
=
1
2
x2
arcsin(x2
) +
1
2
√
1 − x4 + C , C ∈ R .
(c) In this case let us ﬁrst use the substitution y =
√
x to get rid of the root from the argument of the exponential function.
This leads to the integral
∫
e
√
x
dx =
y2
= x
2y dy = dx
= 2
∫
y ey
dy .
Based now on integration by parts, we obtain
∫
y ey
dy =
F(y) = y F′
(y) = 1
G′
(y) = ey
G(y) = ey = y ey
−
∫
ey
dy = y ey
− ey
+C , C ∈ R .
In total this gives
∫
e
√
x
dx = 2y ey
−2 ey
+C = 2 e
√
x
(√
x − 1
)
+ C , C ∈ R .
6.B.17. Integration by parts gives
∫
sinn
(x) dx =
∫
sinn−1
(x) sin(x) dx =
∫
sinn−1
(x)(− cos(x))′
dx
= − sinn−1
(x) cos(x) + (n − 1)
∫
sinn−2
(x) cos2
(x) dx
= − sinn−1
(x) cos(x) + (n − 1)
∫
sinn−2
(x)
(
1 − sin2
(x)
)
dx
= − sinn−1
(x) cos(x) + (n − 1)
∫
sinn−2
(x) dx − (n − 1)
∫
sinn
(x) dx
= − sinn−1
(x) cos(x) + (n − 1)In−2 − (n − 1)In .
Thus for any positive integer n we get
nIn = − sinn−1
(x) cos(x) + (n − 1)In−2 ⇐⇒ In = −
1
n
sinn−1
(x) cos(x) +
n − 1
n
In−2 .
Now, we have I0 = x and I1 = − cos(x). Thus based on the recurrence relation and using the identity sin(2x) =
2 sin(x) cos(x), we get
I2 =
∫
sin2
(x) dx = −
1
2
sin(x) cos(x) +
1
2
I0 = −
1
2
sin(x) cos(x) +
1
2
x = −
1
4
sin(2x) +
1
2
x .
Similarly,
I3 =
∫
sin3
(x) dx = −
1
3
sin2
(x) cos(x) +
2
3
I1 = −
1
3
sin2
(x) cos(x) −
2
3
cos(x) = −
1
3
(
1 − cos2
(x)
)
cos(x) −
2
3
cos(x)
=
1
3
cos3
(x) − cos(x) .
6.B.18. We can solve this by substitution, which we will encode in a more “compact” but less informative way, as follows:
∫
6
x − 2
dx =
y = x − 2
dy = dx
=
∫
6
y
dy = 6 ln | y | + C1 = 6 ln | x − 2 | + C1 , C1 ∈ R .
Similarly, we have
∫
6
(x + 4)3
dx =
y = x + 4
dy = dx
=
∫
6
y3
dy =
6
−2y2
+ C2 = −
3
(x + 4)2
+ C2 , C2 ∈ R .
6.B.19. Setting w = ax + b we get dw = a dx, that is, dx = dw
a . Hence
612
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
∫
dx
(ax + b)n
=
∫
dw
awn
=
1
a
∫
w−n
dw =
1
a
(
w−n+1
−n + 1
)
+ C = −
1
a(n − 1)wn−1
+ C ,
and the result follows.
Next observe that 9x2
+ 6x + 1 = (3x + 1)2
. Thus
∫
dx
9x2 + 6x + 1
=
∫
dx
(3x + 1)2
, and by applying the relation that we
proved in (a), it follows that
∫
dx
(3x + 1)2
= −
1
3(3x + 1)
+ C, for some constant C.
6.B.20. Recall that P(x) = ax2
+ bx + c = a
(
x + b
2a
)2
− ∆
4a , and since ∆ = 0 we get P(x) = a
(
x + b
2a
)2
. Therefore,∫
dx
P(x)
=
∫
dx
a
(
x + b
2a
)2 . Let us set u = x + b
2a , such that du = dx. Then we see that
∫
dx
a
(
x + b
2a
)2 =
1
a
∫
du
u2
=
1
a
∫
u−2
du = −
1
au
+ C = −
1
a
(
x + b
2a
) + C = −
2
2ax + b
+ C ,
for some constant C. We leave the other conﬁrmation to the reader.
6.B.27. By the discussion in 6.2.7 we can assume that
Q(x) =
2x4
+ 2x2
− 5x + 1
x (x2 − x + 1)
2 =
A
x
+
B1x + C1
x2 − x + 1
+
B2x + C2
(x2 − x + 1)2
,
for some A, B1, C1, B2, C2, to be speciﬁed. By equating coeﬃcients and solving the corresponding system one gets A = 1,
B1 = 1, C1 = 3, B2 = 1 and C2 = −6, and thus
Q(x) =
1
x
+
x + 3
x2 − x + 1
+
x − 6
(x2 − x + 1)2
.
Recall that you can use Sage to get a quick conﬁrmation of the obtained fractional decomposition of Q, just by typing
Q(x)=(2*x^4+2*x^2-5*x+1)/(x*(x^2-x+1)^2); show(Q.partial_fraction())
The fractional decomposition of Q implies that the integral in question splits in three parts:
∫
Q(x) dx =
∫
2x4
+ 2x2
− 5x + 1
x (x2 − x + 1)
2 dx =
∫
1
x
dx +
∫
x + 3
x2 − x + 1
dx +
∫
x − 6
(x2 − x + 1)2
dx . (⋆)
Let us focus on the last two integrals which are more diﬃcult. To simplify our notation, we will compute them up to a constant
(as Sage does). We begin with Λ :=
∫
x + 3
x2 − x + 1
dx. Since x2
− x + 1 has complex roots one can apply a method similar
to those presented in 6.B.22. Hence we get
Λ =
∫
x + 3
x2 − x + 1
dx =
1
2
∫
2x − 1
x2 − x + 1
dx +
7
2
∫
1
x2 − x + 1
dx = Λ1 + Λ2 ,
and we need to evaluate the integrals Λ1 and Λ2. We have (up to a constant)
Λ1 =
1
2
∫
2x − 1
x2 − x + 1
dx
t=x2
−x+1
=
dt=(2x−1) dx
1
2
∫
dt
t
=
1
2
ln |t| =
1
2
ln(t) =
1
2
ln(x2
− x + 1) . (α)
Next
Λ2 =
7
2
∫
dx
x2 − x + 1
=
7
2
∫
dx
(x − 1
2 )2 + 3
4
=
7
2
∫
dx
(x − 1
2 )2 + (
√
3
2 )2
u=x− 1
2
=
du=dx
7
2
∫
du
u2 + (
√
3
2 )2
=
=
7
2
·
1
√
3
2
arctan
(
u
√
3
2
)
=
7
√
3
arctan
(
2u
√
3
)
=
7
√
3
3
arctan
(√
3(2x − 1)
3
)
. (β)
A combination of (α) and (β) gives
Λ = Λ1 + Λ2 =
1
2
ln(x2
− x + 1) +
7
√
3
3
arctan
(√
3(2x − 1)
3
)
. (A)
In the second integral M :=
∫
x − 6
(x2 − x + 1)2
dx the dominator appears in the second power. Hence one can apply a similar
technique to those presented in 6.B.24. We have
M =
∫
x − 6
(x2 − x + 1)2
dx =
1
2
∫
2x − 1
(x2 − x + 1)2
dx −
11
2
∫
dx
(x2 − x + 1)2
= M1 + M2 .
Then, up to a constant we get
613
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
M1 =
1
2
∫
2x − 1
(x2 − x + 1)2
dx
t=x2
−x+1
=
dt=(2x−1) dx
1
2
∫
dt
t2
= −
1
2t
= −
1
2(x2 − x + 1)
, (γ)
and
M2 = −
11
2
∫
dx
(x2 − x + 1)2
= −
11
2
∫
dx
(
(x − 1
2 )2 + (
√
3
2 )2
)2 = −
11
2
· K2
(
1
2
,
√
3
2
)
=
= −
11
2
·
1
3
4
(1
2
K1
(
1
2
,
√
3
2
)
+
x − 1
2
2
(
(x − 1
2 )2 + 3
4 )
)
= −
11
2
·
4
3
(1
2
·
1
√
3
2
arctan
(
x − 1
2√
3
2
)
+
2x−1
2
2(x2 − x + 1)
)
=
= −
22
3
(√
3
3
arctan
(√
3(2x − 1)
3
)
+
2x − 1
4(x2 − x + 1)
)
= −
22
√
3
9
arctan
(√
3(2x − 1)
3
)
−
11(2x − 1)
6(x2 − x + 1)
. (δ)
Therefore by (γ) and (δ) we obtain
M = M1 + M2 = −
1
2(x2 − x + 1)
−
22
√
3
9
arctan
(√
3(2x − 1)
3
)
−
11(2x − 1)
6(x2 − x + 1)
. (B)
Combining now (⋆) with (A) and (B) we ﬁnally deduce that
∫
Q(x) dx = ln |x| +
1
2
ln(x2
− x + 1) +
7
√
3
3
arctan
(√
3(2x − 1)
3
)
−
1
2(x2 − x + 1)
−
22
√
3
9
arctan
(√
3(2x − 1)
3
)
−
11(2x − 1)
6(x2 − x + 1)
+ C ,
or equivalently
∫
Q(x) dx = ln x
√
x2 − x + 1 −
√
3
9
arctan
(√
3(2x − 1)
3
)
−
11x − 4
3(x2 − x + 1)
+ C ,
for some constant C. For a conﬁrmation (and much faster computation) via Sage, just type
Q(x)=(2*x^4+2*x^2-5*x+1)/(x*(x^2-x+1)^2); show(Q.integrate(x))
Or you may like to verify the integrals Λ1, Λ2, M1, M2 presented above, which can be done in a similar way:
show((1/2)*integral((2*x-1)/(x^2-x+1), x)); show((7/2)*integral(1/(x^2-x+1), x))
show((1/2)*integral((2*x-1)/(x^2-x+1)^2, x)); show((-11/2)*integral(1/(x^2-x+1)^2, x))
6.B.28. An appropriate substitution is given by t = ex
, hence x = ln(t) and dx = 1
t dt. This will allows us to convert the
function Q(x) =
1
e2x −4 ex
to a rational function, hence we can apply the theory described above. In particular,
∫
Q(x) dx =
∫
1
e3x −2 e2x
dx =
∫
dt
(t3 − 2t2)t
=
∫
dt
t3(t − 2)
.
Now, suppose that
1
t3(t − 2)
=
A
t
+
B
t2
+
C
t3
+
D
t − 2
, for some A, B, C, D ∈ R .
This gives the relation
1 = t2
(t − 2)A + t(t − 2)B + (t − 2)C + t3
D ⇐⇒ 1 = t3
(A + D) + t2
(−2A + B) + t(−2B + C) − 2C .
Thus we get the system
−2C = 1 , −2B + C = 0 , −2A + B = 0 , A + D = 0 ,
which has the tuple {A = −1/8, B = −1/4, C = −1/2, D = 1/8} as a unique solution. This means that
1
t3(t − 2)
= −
1
8t
−
1
4t2
−
1
2t3
+
1
8(t − 2)
, (∗)
which can be easily conﬁrmed in Sage by the usual method, that is,
var("t"); show((1/(t^4-2*t^3)).partial_fraction()
Based on (∗) we may ﬁnally compute the integral at hand:
∫
dt
t3(t − 2)
=
∫ ( 1
8(t − 2)
−
1
8t
−
1
4t2
−
1
2t3
)
dt =
1
8
(ln |t − 2| − ln |t|) +
1
4t
+
1
4t2
+ C =
614
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
=
1
8
(ln |ex
−2| − ln(ex
)) +
1
4 ex
(
1 +
1
ex
)
+ C =
ln |ex
−2|
8
−
x
8
+
1
4
e−2x
(1 + ex
) + C ,
for some constant C. For a quick conﬁrmation of this result try the command
show(integral(1/(e**(3*x)-2*e**(2*x)), x))
6.B.31. For all x ̸= π
2 + kπ, with k ∈ Z, we see that
∫
1
cos2(x)
dx =
t=tan(x)
=
dt=dx/ cos2(x)
∫
dt = t + C = tan(x) + C ,
for some constant C. Therefore, with the help of the table given in 5.A.3 one computes
π/3∫
π/6
tan2
(x) dx =
π/3∫
π/6
sin2
(x)
cos2(x)
dx =
π/3∫
π/6
1 − cos2
(x)
cos2(x)
dx =
π/3∫
π/6
(
1
cos2(x)
− 1
)
dx =
[
tan(x) − x
]π/3
π/6
=
√
3 −
π
3
−
(
1
√
3
−
π
6
)
=
2
√
3
3
−
π
6
.
An alternative for computing this integral occurs by the substitution u = tan(x), with
du =
dx
cos2(x)
, sin2
(x) =
tan2
(x)
1 + tan2
(x)
=
u2
1 + u2
.
This is also based on the integral
∫
1
1 + u2
du = arctan(u) + C and left as an easy challenge.
Notice, to conﬁrm the result in Sage we may type
show(integral(tan(x)*tan(x), x, pi/6, pi/3))
For the second integral recall by 6.B.8 that integration by parts gives
∫
x
cos2(x)
dx = x tan(x) + ln |cos(x)| + C ,
for all x ̸= π
2 + kπ, k ∈ Z, where C is some constant. Thus
π/4∫
0
x
cos2(x)
dx =
[
x tan(x) + ln(cos(x))
]π/4
0
=
π
4
+ ln
√
2
2
=
π
4
−
ln(2)
2
.
Again we can conﬁrm this result in Sage, as before, i.e.,
show(integral(x/(cos(x)*cos(x)), x, 0, pi/4))
6.B.32. (a) Set y = 1 − x2
with dy = −2dx. For x = 0 we have y = 1 and for x = 1 we have y = 0. Thus
1∫
0
x
√
1 − x2
dx = −
0∫
1
y−1/2
2
dy =
1∫
0
y−1/2
2
dy =
[√
y
]1
0
= 1 .
(b) Set t = x +
√
x2 − 1, so that dt =
√
x2−1+x√
x2−1
dx For x = 1 we get t = 1 and for x = 2 we get t = 2 +
√
3. Thus,
2∫
1
dx
√
x2 − 1
=
2+
√
3∫
1
1
t
dt =
[
ln(t)
]2+
√
3
1
= ln
(
2 +
√
3
)
.
6.B.33. Using Sage we see that the integral at hand equals to −1/2. This can be done in the usual way, i.e., the cell
integral(sin(3*x)*cos(x), x, pi/4,pi)
One method to verify this result formally is based on the identity sin(α + β) + sin(α − β) = 2 sin(α) cos(β) which for our
case applies as 2 sin(3x) cos(x) = sin(3x + x) + sin(3x − x). Thus we get
∫ π
π/4
sin(3x) cos(x) dx =
1
2
∫ π
π/4
(
sin(3x + x) + sin(3x − x)
)
dx =
1
2
∫ π
π/4
sin(4x) dx +
1
2
∫ π
π/4
sin(2x) dx
= −
1
8
[
cos(4x)
]π
π/4
−
1
4
[
cos(2x)
]π
π/4
= −
1
8
(1 − (−1)) −
1
4
(1 − 0) = −
1
4
−
1
4
= −
1
2
.
615
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.B.34. (a) Because 0 ≤
x9
√
2
≤
x9
√
1 + x
≤ x9
for all x ∈ [0, 1] the geometric meaning of the deﬁnite integral implies
√
2
20
=
1∫
0
x9
√
2
dx ≤
1∫
0
x9
√
1 + x
dx ≤
1∫
0
x9
dx =
1
10
.
(b) Because 0 < x < π/2 we have 0 < 2
π < 1
x . Also we have 0 ≤ sin(x) ≤ 1 for all x ∈ (0, π/2). Thus
∫ π/2
0
2
π
dx <
∫ π/2
0
sin(x)
x
<
∫ π/2
0
1 dx ,
or equivalently
2
π
·
π
2
<
∫ π/2
0
sin(x)
x
<
π
2
.
6.B.35. Recall that any continuous function f : [−a, a] → R satisﬁes the relation
∫ a
−a
f(x) dx =
∫ 0
−a
f(x) dx +
∫ a
0
f(x) dx . (∗)
Set u = −x such that du = −dx. For x = −a we have u = a and for x = 0 we have u = 0. Assume now that f is even.
Then f(u) = f(−x) = f(x) and thus
∫ 0
−a
f(x) dx =
∫ 0
a
f(u)(−du) = −
∫ 0
a
f(u) du =
∫ a
0
f(u) du =
∫ a
0
f(x) dx
and hence (∗) gives
∫ a
−a
f(x) dx = 2
∫ a
0
f(x) dx. Similarly is treated the case where f is odd.
6.B.36. (1) Since f is periodic with period T, for any a ∈ R one can write
∫ a+T
a
f(x) dx =
∫ 0
a
f(x) dx +
∫ T
0
f(x) dx +
∫ a+T
T
f(x − T) dx . (†)
Set now u = x−T with du = dx. For x = T we have u = 0 and for x = a+T we have u = a. Thus,
∫ a+T
T
f(x−T) dx =
∫ a
0
f(u) du, and the result follows by (†).
An alternative relies on the primitive function F of f, which allows us to prove that
∫ T
0
f(x) dx =
∫ a
0
f(x) dx +
∫ T
a
f(x) dx =
∫ a
0
f(x) dx + F(T) − F(a)
=
∫ a
0
f(x) dx +
(
F(a + T) − F(a)
)
−
(
F(a + T) − F(T)
)
=
∫ a
0
f(x) dx +
∫ a+T
a
f(x) dx −
∫ a+T
T
f(x) dx . (‡)
Then, as above with the substitution t = x − T one can show that
∫ a+T
T
f(x) dx =
∫ a
0
f(x) dx and the result follows by
(‡).
(2) Since f is periodic with period T, using induction over the naturals we can show that f(x) = f(x + nT) = f(x − nT),
for any n ∈ N. This easily extends to n ∈ Z (try to prove this claim). Using this periodicity property, for n ≥ 0 we get
∫ a+nT
a
f(x) dx =
∫ 0
a
f(x) dx +
n−1∑
k=0
∫ (k+1)T
kT
f(x) dx +
∫ a+nT
nT
f(x) dx
=
∫ 0
a
f(x) dx +
n−1∑
k=0
∫ (k+1)T
kT
f(x − kT) dx +
∫ a+nT
nT
f(x − nT) dx
=
∫ 0
a
f(x) dx +
n−1∑
k=0
∫ T
0
f(x) dx +
∫ T
0
f(x) dx =
n−1∑
k=0
∫ T
0
f(x) dx = n
∫ T
0
f(x) dx . (∗)
616
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Notice for the integral
∫ (k+1)T
kT
f(x − kT) dx we did the substitution t = x − kT with dt = dx, and t = 0 for x = kT,
t = T for x = (k + 1)T. Similarly for
∫ a+nT
nT
f(x − nT) dx.
Suppose now that n < 0. Then we have
∫ a+nT
a
f(x) dx = −
∫ a
a+nT
f(x) dx = −
∫ a+nT +(−nT )
a+nT
f(x) dx
(∗)
= −
(
− n
∫ T
0
f(x) dx
)
= n
∫ T
0
f(x) dx .
This proves our claim. Observe, that the identity in (1) for a = 0 reduces to
∫ nT
0
f(x) dx = n
∫ T
0
f(x) dx .
This says that the area under Cf for n periods equals n times the area under Cf for one period.
6.B.38. (a) The given function F is clearly the antiderivative of the function f(x) := x5
ln (x + 1) on the interval (−1, 1).
Thus, F′
(x) = −x5
ln (x + 1), see also the fundamental theorem of calculus in 6.2.9.
(b) We have F′
(x) = x2
sin(x) + 4 cos(4x) and we see that
|g(x)| =
x2
sin(x) + 4 cos(4x)
x2 + 2
≤
|x2
sin(x)| + 4| cos(4x)|
x2 + 2
≤
|x2
| · 1 + 4 · 1
x2 + 2
=
x2
+ 4
x2 + 1
,
for all x ∈ (0, +∞). Thus, obviously limx→∞
x2
+4
x2+1 = 1, which you can conﬁrm in Sage by the cell
var("x"); lim((x^2+4)/(x^2+1), x=oo)
6.B.45. For some real numbers a < b consider the set A = (a, b] and the function f : A → R, with f(x) = 1/(x − a).
Though f is continuous on A, it is not uniformly continuous, since A is not closed. Indeed, let (xn) and (yn) be the sequences
deﬁned by
xn = a +
b − a
n
, yn = a +
b − a
n + 1
, n ∈ Z+ .
Then it is easy to see that (xn − yn) → 0 but f(xn) − f(yn) does not convergent to 0. Hence, by 6.B.44 one deduces that f
cannot be uniformly continuous on A.
As an example where the domain is not bounded, consider the parabola h : A → R, h(x) = x2
deﬁned on A = [0, +∞).
The sequences (xn = n) and (yn = n − 1
n ) with n ∈ Z+, they both belong to A and satisfy xn − yn = 1
n → 0, as n → ∞.
However, h(xn)−h(yn) = 2− 1
n2 , which obviously does not converge to 0. By 6.B.44 we deduce that h cannot be uniformly
continuous on A
6.B.46. Consider the sequences (xn), (yn) with general terms xn = 1/(n + 1) and yn = 1/n, respectively, with n ≥ 2.
They both belong to A = (0, 1) and satisfy xn − yn → 0. However, it is easy to see that f(xn) − f(yn) = 1 and hence
limn→∞
(
f(xn) − f(yn)
)
= 1 ̸= 0. The assertion now follows by the statement in 6.B.44. Provide an alternative proof
based on the ε-δ-deﬁnition of uniform continuity, given in 6.2.11.
6.B.48. The proof is analogous to 6.B.47 and we leave it to the reader for practice.
6.B.49. (a) For instance, consider the function f(x) = sin
(
1
x
)
with x ∈ A = (0, 1]. It is not hard to see that f is continuous
on A, but not uniformly continuous. Now, for any n ∈ N we have n π + π
2 ≥ 1 and hence the sequence (xn) with general
term
xn =
1
n π + π/2
, n ∈ N ,
is a sequence in A. Since n π + π/2 → +∞ we have xn → 0, as n → +∞. Thus (xn) is convergent and thus a Cauchy
sequence in A. However, the sequence (f(xn)) =
(
sin
(
1
xn
))
n∈N
is not a Cauchy sequence. Why?.
(b) Let (xn) be a Cauchy sequence with xn ∈ A for all n. By assumption f is uniformly continuous on A, hence for every
ε > 0 there exists δ = δ(ε) > 0 such that |f(x) − f(y)| < ε, for all x, y ∈ A, provided that |x − y| < δ. However, (xn) is
Cauchy and since δ > 0 we can ﬁnd some natural N = N(δ) such that |xn − xm| < δ for all n, m ≥ N. But then we get
|f(xn) − f(xm)| < ε, for all n, m ≥ N, and the result follows.
6.B.50. The function f(x) = x2
is continuous on R and hence also on A = [0, 1]. Since the set [0, 1] is a closed ﬁnite interval,
the function f should be uniformly continuous, see the theorem in 6.2.11.
617
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Suppose that the function g(x) = tan(x) is uniformly continuous on B = [0, π/2). The domain of g is a ﬁnite subset of R
and hence bounded. But then the function g should be also bounded.6
However, it is easy to see that tan(xn) → +∞ for
xn ∈ (0, π/2) with xn → π/2, which implies that g(x) = tan(x) is not bounded (think also the graph of tan(x)). This
gives a contradiction and hence g(x) = tan(x) is not uniformly continuous on B.
Next, the function h(x) = x2
is not uniformly continuous on R. For instance, consider the sequences (xn = n) and
(yn = n + 1
n ), with xn − yn → 0. Then we see that h(xn) − h(yn) = 2 + 1
n2 ≥ 2 and thus limn→+∞
(
h(xn) − h(yn)
)
̸= 0.
Our assertion now follows by 6.B.44.
Finally, the function k(x) = x3
is also not uniformly continuous on R. Here one may consider the sequences xn = n+ 1
n and
yn = n. Then obviously xn − yn → 0 but we see that k(xn) − k(yn) ≥ 3n for all n, which implies that limn→+∞
(
k(xn) −
k(yn)
)
= +∞.
6.B.52. The improper integral represents the area of the region bounded by the graph of the positive function
f(x) =
arctan(x)
x
√
x
= x− 2
3 arctan(x) , x ≥ 1 ,
and the x-axis. From the left, this region is bounded by the line x = 1, see also the ﬁgure here:
Therefore, the integral is a positive real number, or equals +∞. First we see that
+∞∫
1
x− 3
2 dx = lim
t→+∞
∫ t
1
x− 3
2 dx = lim
t→+∞
(
2 −
2
√
t
)
= 2 .
Moreover, we know that
π
4
≤ arctan(x) ≤
π
2
, ∀ x ∈ [1, +∞) ,
and hence we get
π
2
=
π
4
+∞∫
1
x− 3
2 dx ≤
+∞∫
1
arctg x
x
√
x
dx ≤
π
2
+∞∫
1
x− 3
2 dx = π .
Thus the integral at hand is a ﬁnite real number. In fact, with the aid of Sage one can compute this integral via the command
integral(arctan(x)/(x ∗ sqrt(x)), x, 1, oo).
6.B.57. As usual you can use the def method and introduce a routine which you may call “length_of_curve”. Since
we usually work with the variable t for a parametric curve c : [a, b] → R2
, t should be declared as a symbolic variable,
together with the endpoints a, b, that is, the limits of integration. Thus, the input of our routine can be a tuple (x, y, a, b),
corresponding to the curve c : [a, b] → R2
with c(t) = [x(t), y(t)] for all t ∈ [a, b]. Hence, though a, b are real number
inputs, x, y : [a, b] → R are real-valued functions of t which it is necessary to introduce as symbolic functions. To control
also the aspect ratio of the corresponding parametric plot (and take a better illustration), we will introduce one more symbolic
variable, say rt, which for any curve we can specify so that the result will be the desired one. This means that ﬁnally the
input of our routine will be the list (x, y, a, b, rt), and the implementation goes as follows:
var("t, a, b, rt")
function("x")(t)
function("y")(t)
def length_of_curve(x, y, a, b, rt) :
X(t)=diff(x, t)
6Show that any uniformly-continuous function f : A → R mapping a
bounded interval to R, is bounded.
618
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Y(t)=diff(y, t)
l(t)=(X(t)*X(t)+Y(t)*Y(t)).factor()
s(t)=numerical_approx(integral(sqrt(l), t, a, b), digits=7)
show("The length of the curve c(t)=", (x, y), "on", [a, b], "" " equals to ", ":", s(t))
p=parametric_plot((x, y), (t, a, b))
p.show(figsize=6, aspect_ratio=rt)
return
To test the routine ﬁrst we will conﬁrm a known result based on the curve [et
−t, 4 et/2
] of the previous task (see 6.B.56),
with t ∈ [0, 1]. To conﬁrm this case, now one can simply give the cell
length_of_curve(e^t-t, 4*e^(t/2), 0, 1, 1/2)
where notice that we ﬁxed rt = 1/2. Sage returns the following answer:
The length of the curve c(t)=
(
−t + et
, 4 e
( 1
2 t
))
on [0, 1] equals to:2.718282
For the given cases one can proceed similarly (we present them case by case):
(1) For the curve c(t) = [cos3
(t), sin3
(t)] on [0, pi/2] type
length_of_curve((cos(t))^3, (sin(t))^3, 0, pi/2, 1/2)
Here the output is
The length of the curve c(t)=
(
cos (t)
3
, sin (t)
3
)
on
[
0,
1
2
π
]
equals to:1.500000
Indeed, we see that |c′
(t)| =
√
9 cos4(t) sin2
(t) + 9 sin4
(t) cos2(t) = 3 cos(t) sin(t) and thus
sc(t) =
∫ π/2
0
3 cos(t) sin(t) dt =
3
2
∫ π/2
0
sin(2t) dt =
3
2
[
−
cos(2t)
2
]π/2
0
=
3
2
[
−
cos(π)
2
+
cos(0)
2
]
=
3
2
.
As for the graph of c(t), Sage returns the ﬁgure given here:
(2) For the curve c(t) = [t, ln(cos(t))] on [0, π/4] type
length_of_curve(t, ln(cos(t)), 0, pi/4, 1)
Sage;s output has the form
The length of the curve c(t)= (t, log (cos (t))) on
[
0,
1
4
π
]
equals to:0.8813736
and the corresponding plot has the form
To conﬁrm Sage’s result we compute
619
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
|c′
(t)| =
√
1 +
(
− sin(t)
cos(t)
)2
=
√
1 + tan2
(t) =
√
sec2(t) = sec(t) ,
where we used the identity tan2
(x) = sec2
(x) − 1. Thus
sc(t) =
∫ π/4
0
|c′
(t)| dt =
∫ π/4
0
sec(t) dt .
Notice now that
∫
sec(x) dx =
∫
sec(x)
(
sec(x) + tan(x)
)
sec(x) + tan(x)
dx =
∫
sec2
(x) + sec(x) tan(x)
sec(x) + tan(x)
dx .
Setting u = sec(x) + tan(x) we compute
du =
(
sec(x) + tan(x)
)′
dx =
( (
1
cos(x)
)′
+ tan′
(x)
)
dx =
( sin(x)
cos2(x)
+ (1 + tan2
(x))
)
dx =
=
( 1
cos(x)
·
sin(x)
cos(x)
+ sec2
(x)
)
dx =
(
sec(x) tan(x) + sec2
(x)
)
dx ,
and hence
∫
sec(x) dx =
∫
1
u
du = ln |u| + C , C ∈ R . (⋆)
Since we have u(0) = sec(0) + tan(0) = 1 and u(π/4) = sec(π/4) + tan(π/4) =
√
2 + 1, we ﬁnally deduce that
sc(t) =
∫ π/4
0
sec(t) dt =
∫ √
2+1
1
du
u
=
[
ln |u|
]√
2+1
1
= ln(1 +
√
2) − ln(1) = ln(1 +
√
2) ≈ 0.8813736 .
(3) For the curve c(t) = [t sin(t), t cos(t)] on [0, 4π] type
length_of_curve(t*sin(t), t*cos(t), 0, 4*pi, 1)
which answers
The length of the curve c(t)= (t sin (t) , t cos (t)) on [0, 4 π] equals to:80.81931
and prints out the following ﬁgure
Let us present some hints helpful for verifying this result formally. First we compute
|c′
(t)| =
√
(sin(t) + t cos(t))2 + (cos(t) − t sin(t))2 =
√
1 + t2
and thus we need to compute sc(t) =
∫ 4π
0
√
1 + t2 dt. For this integral, using the relation (⋆) from (2) we will ﬁrst prove that
∫ √
1 + x2 dx =
1
2
x
√
1 + x2 +
1
2
ln(x +
√
1 + x2) + C . (∗)
Indeed, set x = tan(θ) with dx =
(
1
cos2(x)
)
dθ = sec2
(x) dθ. Then
∫ √
1 + x2 dx =
∫ √
1 + tan2
(θ) sec2
(θ) dθ =
∫ √
sec2(θ) sec2
(θ) dθ =
∫
sec3
(θ) dθ .
620
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
We have (tan(θ))′
= sec2
(x) and (sec(θ))′
= sec(θ) tan(θ) and integration by parts gives
∫
sec3
(θ) dθ =
∫
sec2
(θ) sec(θ) dθ =
∫
(tan(θ))′
sec(θ) dθ = tan(θ) sec(θ) −
∫
tan(θ)(sec(θ))′
dθ
= tan(θ) sec(θ) −
∫
tan2
(θ) sec(θ) dθ = tan(θ) sec(θ) −
∫
(
sec2
(θ) − 1
)
sec(θ) dθ
= tan(θ) sec(θ) −
∫
sec3
(θ) dθ +
∫
sec(θ) dθ .
Thus
∫
sec3
(θ) dθ =
1
2
(
tan(θ) sec(θ) +
∫
sec(θ) dθ
)
and hence using (∗) we can write
∫ √
1 + x2 dx =
∫
sec3
(θ) dθ =
1
2
tan(θ) sec(θ) +
1
2
ln | sec(θ) + tan(θ)| + C .
The desired formula (⋆) appears now by replacing x = tan(θ) and sec(θ) =
√
1 + x2. The ﬁnal computation of sc(t) is now
based on (⋆) and left as an easy exercise.
(4) For the curve c(t) = [sin2
(t), cos2
(t)] with t ∈ [0, π
2 ], similarly we the previous cases we can type
length_of_curve((sin(t))^2, (cos(t))^2, 0, pi/2, 1)
Sage returns7
The length of the curve c(t)=
(
sin (t)
2
, cos (t)
2
)
on
[
0,
1
2
π
]
equals to:1.414214
and the following illustration of the curve at hand:
For a formal computation, we see that
|c′
(t)| =
√(
2 sin(t) cos(t)
)2
+ (−2 sin(t) cos(t))2 =
√(
sin(2t)
)2
+
(
− sin(2t)
)2
=
√
2 sin2
(2t) .
Thus
sc(t) =
∫ π/2
0
|c′
(t)| dt =
√
2
∫ π/2
0
sin(2t) dt =
√
2
[
−
cos(2t)
2
]π/2
0
=
√
2 .
In fact, the given curve is a part of the line y = 1 − x (since sin2
(t) + cos2
(t) = 1), and in particular the segment with
boundary points [0, 1] for t = 0, and [1, 0] for t = π
2 , see also the ﬁgure given above. Hence one can immediately write its
length, sc(t) =
√
2 (by the Pythagorean theorem).
6.B.58. It is easy to see that
f′
(t) = −r sin t +
r
2 tan t
2 · cos2 t
2
= −r sin t +
r
sin t
=
r cos2
t
sin t
, g′
(t) = r cos t ,
for any t ∈ [π/2, a]. Thus, for the length sα(t) we get
sα(t) =
a∫
π/2
√
r2 cos4 t
sin2
t
+ r2 cos2 t dt =
a∫
π/2
√
r2 cos2 t
sin2
t
dt = −r
a∫
π/2
cos t
sin t
dt = −r [ln (sin t)]
a
π/2 = −r ln (sin a) .
To plot the tractrix for the values r = 1, 2, . . . , 5 one can use the block
7Be aware that in this case Sage does not type the curve c(t) properly.
621
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
var("r, t")
f(t, r)=r*cos(t)+r*ln(tan(t/2))
g(t, r)=r*sin(t)
a=parametric_plot((f(t, 1), g(t, 1)), (t, pi/2, pi/2+1), exclude=[pi], rgbcolor=(0.2,0.2,0.5),
title="The tractrix for several values of r")
a+=parametric_plot((f(t, 2), g(t, 2)), (t, pi/2, pi/2+1), exclude=[pi], rgbcolor=(0.2,0.5,0.2))
a+=parametric_plot((f(t, 3), g(t, 3)), (t, pi/2, pi/2+1), exclude=[pi], rgbcolor=(0.5,0.2,0.2))
a+=parametric_plot((f(t, 4), g(t, 4)), (t, pi/2, pi/2+1), exclude=[pi], rgbcolor=(0.4,0.4,0.2))
a+=parametric_plot((f(t, 5), g(t, 5)), (t, pi/2, pi/2+1), exclude=[pi], rgbcolor=(0.4,0.2,0.8))
a+=point([[0, r] for r in [1,2,..,5]], color="black")
a.show(aspect_ratio=4, figsize=6)
which produces the following ﬁgure
6.B.64. By separating the variables we can write
y−1/2
dy = esin(x)
cos(x) dx ,
and this implies that
∫
y−1/2
dy =
∫
esin(x)
cos(x) dx ⇐⇒ 2y1/2
=
∫
esin(x)
cos(x) dx + C
To compute the integral at the right hand side set t = sin(x) with dt = cos(x) dx. Then we get
2y1/2
=
∫
et
dt + C ⇐⇒ 2y1/2
= et
+C = esin(x)
+C .
To compute C we use the initial condition. By assumption y(0) = 16, thus 2 ·
√
y(0) = 2 ·
√
16 = 2 · 4 = 8 = esin(0)
+C,
from where we get C = 7. Thus
y = y(x) =
(
esin(x)
+7
2
)2
.
6.B.67. Let us refer to our routine as trapezoid_rule. The input should be a function f, the reals numbers a, b representing
the lower and upper limit of the integral and the number of steps n. Let us present the program and provide the necessary
explanations below.
def trapezoid_rule(f, a, b, n):
step = (b - a) / n
valsf = [f(a + i * step) for i in range(n + 1)]
return (step / 2) * (valsf[0] + 2 * sum(valsf[1:n]) + valsf[-1])
In this block we use the list valsf, which is is generated by evaluating the function f at n + 1 points (the latter should
be evenly spaced between a and b). Hence, for example, valsf[i] contains the function value at the ith point, i.e.,
valsf[i] = f(a + i × step). In the return, valsf[1 : n] selects the sublist containing all the intermediate function values,
excluding the ﬁrst and last points. This is because the trapezoid rule assigns a weight of 2 to the intermediate points between
the start and end of the interval, while the ﬁrst and last points are treated diﬀerently with a weight of 1. In our program, these
are accessed separately as valsf[0] and valsf[−1], respectively.
Let us now test our routine, by verifying, for instance, the result obtained in 6.B.66. It suﬃces to add the cell
622
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
n=4; a=0; b=pi; f(x)=sin(x)
print(trapezoid_rule(f,a,b,n).n())
Sage’s result is 1.89611889793704.
For the estimation of the integral L by the trapezoid rule, we will ﬁrst present a formal computation. We have a = π/2,
b = 3π/2, n = 4 and hence h = π/4. Moreover,
x0 = a =
π
2
, x1 = x0 + h =
3π
4
, x2 = x0 + 2h = π , x3 = x0 + 3h =
5π
4
, x4 = b = x0 + 4h =
3π
2
.
Thus we obtain
Ltrap =
h
2
[
f(x0) + 2f(x1) + 2f(x2) + 2f(x3) + f(x4)
]
=
π
8
[
cos
(π
2
)
+ 2 cos
(
3π
4
)
+ 2 cos (π) + 2 cos
(
5π
4
)
+ cos
(
3π
2
) ]
=
π
8
[
0 − 2
√
2
2
− 2 − 2
√
2
2
+ 0
]
= −
π(1 +
√
2)
4
≈ −1.896 .
We can now conﬁrm this result by applying the trapezoid_rule routine:
n=4; a=pi/2; b=3*pi/2; f(x)=cos(x)
print(trapezoid_rule(f,a,b,n).n())
Sage’s output is −1.89611889793704
To compute the actual error we ﬁrst see that
L =
∫ 3π/2
π/2
cos(x) dx = [sin(x)]
3π/2
π/2 = sin(3π/2) − sin(π/2) = −1 − 1 = −2.
Thus, |L − Ltrap| = |−2 + 1.896| = |−0.104| = 0.104. For the theoretical error a simple computation gives that
(b−a)h2
12 |f′′
| = π3
192 ≈ 0.161.
It is easy to see that the cosine function is convex over the integral of integration and the trapezoid rule overestimates the
integral (since −1.896 > −2). This overestimation is characteristic of the trapezoid rule when applied to convex functions.
In fact, cos(x) is negative on [π/2, 3π/2], and hence the trapezoids do not exceed the area under the curve. This gives the
desired overestimation, as illustrated in the ﬁgure below.
6.B.68. Simpson’s rule formula for an integral over [a, b] with n intervals, is given by (see 6.2.23)
ISimp = h
3 [f(x0) + 4f(x1) + 2f(x2) + 4f(x3) + · · · + 2f(xn−2) + 4f(xn−1) + f(xn)] ,
where h = b−a
n , xi = a + ih for i = 0, 1, . . . , n, and n is even! Hence, except of the starting and ending points x0, xn, for all
odd indices i = 1, 3, 5 . . . the function values f(xi) are multiplied by 4, and for all even indices i = 2, 4, 6 . . . the function
values f(xi) are multiplied by 2.
For our case, for h = 1
4 we get the equation 1
4 = a−b
n = 1−0
n , which gives n = 4. Thus we have n + 1 = 5 nodes:
x0 = 0, x1 = 1/4, x2 = 1/2, x3 = 3/4 and x4 = 1, with values f(x0) = 1, f(x1) = 4/5, f(x2) = 2/3, f(x3) = 4/7 and
f(x4) = 1/2, respectively. Thus
ISimp =
h
3
[f(x0) + 4f(x1) + 2f(x2) + 4f(x3) + f(x4)] =
1
12
(
1 + 4
4
5
+ 2
2
3
+ 4
4
7
+
1
2
)
≈ 0.69325 ,
623
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
where we used Sage via the following basic command to do the ﬁnal computation:
print(N((1/12)*(1+16/5+4/3+16/7+1/2)))
Let us now ﬁnd the approximation by the trapezoid rule, where the corresponding nodes are the same. We have
Itrap =
h
2
[f(x0) + 2f(x1) + 2f(x2) + 2f(x3) + f(x4)] =
1
8
(
1 + 2
4
5
+ 2
2
3
+ 2
4
7
+
1
2
)
≈ 0.6970 .
This result can be veriﬁed by using the program constructed in 6.B.67, for example. Hence we can type
def trapezoid_rule(f, a, b, n):
step = (b - a) / n
valsf = [f(a + i * step) for i in range(n + 1)]
return (step / 2) * (valsf[0] + 2 * sum(valsf[1:n]) + valsf[-1])
n=4; a=0; b=1; f(x)=1/(1+x)
print(trapezoid_rule(f,a,b,n).n())
Running this block Sage prints out the number 0.697023809523809.
The exact value of I is given by I =
∫ 1
0
1
1 + x
dx = ln(1 + 1) − ln(1 + 0) = ln(2) ≈ 0.69314, and we may summarize
the results about the errors in a table:
Method Approximation Actual error
Trapezoidal Rule 0.6970 |0.693147 -0.6970| ≈ 0.0039
Simpson’s Rule 0.693253 |0.693147-0.69325| ≈ 0.000106
Thus, the approximation given by Simpson’s rule is signiﬁcantly closer to the exact value of I, compared to the trapezoidal
rule (Simpson’s rule yields a smaller error).
Similarly is treated the case with h = 1/2, which corresponds to n = 2 and n + 1 = 3 nodes: x0 = 0, x1 = 1/2, and
x2 = 1. We leave this for practice.
6.B.69. Recall that Simpson’s rule requires n to be an even number, as the method involves pairing subintervals to ﬁt a
quadratic polynomial. To solve this task, we will use a method similar to the one presented in 6.B.67, and therefore, we will
omit extensive explanations. Here is our routine:
def simpson_rule(f, a, b, n):
if n % 2 != 0:
raise ValueError("n must be even for Simpson’s rule.")
step = (b - a) / n
valsf = [f(a + i * step) for i in range(n + 1)]
return (step / 3) * (valsf[0] + 4 * sum(valsf[i] for i in range(1, n, 2))
+ 2 * sum(valsf[i] for i in range(2, n-1, 2)) + valsf[-1])
Let us now use the function f(x) = 1/(1 + x) of the previous example, with h = 1/4 to check our program:
n = 4; a = 0; b = 1; f(x) = 1 / (1 + x)
print(simpson_rule(f, a, b, n).n())
This veriﬁes our computation, as Sage’s output is the number 0.69325396825396. Note that if n is odd, Sage will display the
the error: “n must be even for Simpson’s rule.” You can use this program to demonstrate additional examples of Simpson’s
rule in your Sage editor.
For h = 1/4 and the interval I given in 6.B.68, the slight overestimation of Simpson’s rule demonstrated above can be
illustrated as follows:
624
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
To construct this ﬁgure we used the following block in Sage:
f(x) = 1 / (1 + x)
a = 0; b = 1; h = 1/4
n = int((b - a) / h)
# Generate x values and corresponding y values
x_values = [a + i*h for i in range(n + 1)]
y_values = [f(x) for x in x_values]
# Initialize the plot with the original function
p = plot(f, a, b, color="blue", legend_label="f(x) = 1/(1+x)", thickness=1.5)
# Create the piecewise quadratic approximation for Simpson’s rule
for i in range(0, n, 2):
x0, x1, x2 = x_values[i], x_values[i+1], x_values[i+2]
y0, y1, y2 = y_values[i], y_values[i+1], y_values[i+2]
# Shade the area under the curve for this segment using a polygon
p += polygon([(x0, 0), (x0, y0), (x2, y2), (x2, 0)], color="darkgreen", alpha=0.3, fill=True)
# Highlight the points used in Simpson’s rule
p += points([(x, f(x)) for x in x_values], color="black", size=30, marker="o")
# Show the plot with a larger range for visibility
p.show(legend_loc="upper right")
6.B.70. It is easy to see that the fourth derivative of f(x) = 1
1+x is given by f(4)
(x) = 24
(1+x)5 , for all x ∈ [0, 1]. Recall that
we can verify this computation in Sage by the cell
f(x)=1/(1+x); dif4=diff(f, x, 4); show(dif4)
Since the fourth derivative f(4)
(x) is decreasing for all x ∈ [0, 1], the maximum value of f(4)
(x) occurs at the left endpoint
of the interval, i.e., at x = 0, and equals 24. Here is a short program in Sage that can verify this claim (you may also plot the
graph of f(4)
(x))
var("x"); f_4 = 24 / (1 + x)^5
a = 0; b = 1
# Evaluate f^{(4)} at the endpoints
f_4_a = f_4.subs(x=a)
f_4_b = f_4.subs(x=b)
# Find the maximum value of |f^{(4)}(x)| on the interval
max_value = max(f_4_a, f_4_b)
# Output the results
f"f^{(4)}(0) = {f_4_a}, f^{(4)}(1) = {f_4_b}, Maximum |f^{(4)}(x)| on [0,1] = {max_value}"
Thus we compute
(b − a)5
180n4
maxx∈[0,1] f(4)
(x)
n=4
=
24
180 · 256
=
1
1920
≈ 0.00052.
The actual error 0.000106 computed above is indeed less than the theoretical error bound 0.00052. In this way one conﬁrms
that the theoretical error bound is valid, and provides an upper limit on the actual error.
625
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.C.3. (a) A typical example is the sequence (fn) with fn(x) = xn
, x ∈ [0, 1]. In 5.B.3 we learned that for 0 ≤ x < 1 the
sequence (fn(x)) = (xn
) converges to 0, while (fn(1)) converges to 1. Thus, the limit function of (fn) on [0, 1] is given by
f(x) =
{
0 , if x ∈ [0, 1) ,
1 , if x = 1 .
Although each member fn is continuous, it is obvious that f is not continuous at x0 = 1 (it has a jump discontinuity there).
Another example is given by the sequence (fn)n∈Z+ deﬁned by
fn(x) =



0 , if x < 0 ,
nx , if 0 ≤ x ≤ 1
n ,
1 , if x > 1
n .
The graph of fn has the following form:
This was made in Sage by the cell
var("n"); n=2
def f(x):
if -2 <x<0:
return 0
elif 0<=x<=1/n:
return n*x
elif 1/n<x<2:
return 1
p=plot(f, (-2, 2), thickness=1.5, color="black")
p+=point([1/2, 0], size=30, color="black")
p+=line([(1/2, 0), (1/2, 1)], linestyle="--", color="blue")
p+=text (r"$\frac{1}{n}$ ",(1/2+0.01, -0.05), color="darkblue", fontsize ="16")
p.show(ticks=[[], None])
Now, it is easy to see that fn(x) is continuous on R (notice that limx→( 1
n )− f(x) = limx→( 1
n )− nx = 1 = limx→( 1
n )+ f(x)).
However, the pointwise limit function of this sequence is the discontinuous function
f(x) =
{
1 , if x > 0 ,
0 , if x ≤ 0 .
(b) Consider for instance the sequence (fn(x) = e−nx2
) with n ∈ N and x ∈ R. For x = 0 we obviously have fn(0) = 1 and
limn→∞ fn(0) = 1, for all n. For x ̸= 0 we have limn→∞ fn(x) = 0, for all n. This is because of the inequality 1
et < 1
1+t ,
which holds for any t > 0.8
This gives that
0 < fn(x) =
1
enx2 <
1
1 + nx2
,
and our claim follows. Thus, the sequence (fn) converges on R to the function
f(x) =
{
1 , if x = 0 ,
0 , if x ̸= 0 .
8To see this, use the familiar inequality et > 1 + t, t > 0.
626
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Now it is obvious that though all members of (fn) are diﬀerentiable functions over R, the limit function f(x) is not diﬀerentiable
at x = 0. Another such example is given in 6.C.6.
(c) Let us consider the sequence of functions (fn : [0, 1] → R) deﬁned by
fn(x) =



0 , for x = 0 ,
n , for 0 < x ≤ 1
n ,
1/x , if 1
n < x ≤ 1 ,
It is not hard to prove that the pointwise limit function f : [0, 1] → R of (fn)n∈Z+ is given by
f(x) =
{
0 , if x = 0 ,
1/x , if 0 < x ≤ 1 ,
To illustrate this result graphically, we just need to use Sage and plot some members of the sequence (fn). This can be
accomplished in several diﬀerent ways. For instance, you may use a bit of programming as demonstrated in the following
block:
var("n")
n=5
def f(x):
if x==0:
return 0
elif 0<x<=1/n:
return n
elif 1/n<x<=1:
return 1/x
p=plot(f, (x, 0, 1), thickness=1.5, color="black")
show(p)
This will produce the plot of f5(x) and for convenience below we present the graphs of f5(x), f25(x) and f50(x).
Let us now prove that any member fn(x) is Riemann integrable on [0, 1]. It is easy to see that each fn(x) is bounded on [0, 1]
(for instance, on the interval ( 1
n , 1] we see that 1
x is bounded since x > 1
n implies 1
x < n). Next we need to understand the
discontinuities of the functions fn, n ∈ Z+. Obviously, at x = 0 we have a (jump) discontinuity, limx→0+ fn(x) = n ̸= 0 =
fn(0), for all n ∈ Z+. At all the other points the functions fn are continuous: e.g., at x = 1/n we get
lim
x→(1/n)−
fn(x) = lim
x→(1/n)+
n = n = fn(1/n) , lim
x→(1/n)+
fn(x) = lim
x→(1/n)+
(1/x) =
1
1/n
= n .
To summarize, each member fn is bounded and has a unique discontinuity at x = 0 (so a ﬁnite number of discontinuities,
and thus of measure zero). Therefore, each fn is Riemann integrable. In fact, we can easily see that
∫ 1
0
fn(x) dx =
∫ 1/n
0
n dx +
∫ 1
1/n
1
x
dx = 1 + ln(n) .
In contrast, the limit function f is not bounded on [0, 1], and thus cannot be integrable on [0, 1].
6.C.5. The sequence of functions (fn) deﬁned by the given graph has the form
fn(x) =



nx , for x ∈ [0, 1
n ) ,
2 − nx , for x ∈ [1/n, 2/n] ,
0 , otherwise.
If x = 0 we have fn(x) = 0 for all n, and limn→∞ fn(0) = 0. Suppose now that x > 0 and in particular that x ∈ (0, 1
n ) for
some n. Then, we may ﬁnd a suﬃciently large integer n0 such that x /∈ [0, 1
n0
), which implies again that limn→∞ fn(x) = 0.
627
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
If x ∈ [ 1
n , 2
n ] then similarly there exists a suﬃciently large integer n0 such that x /∈ [ 1
n0
, 2
n0
], which again implies that
limn→∞ fn(x) = 0. Hence the claim, i.e., limn→∞ fn(x) = f(x) = 0 for all x ≥ 0.
For the uniform convergence, let x0 = 1
2n . Since fn(x0) = fn( 1
2n ) = n · 1
2n = 1
2 is non-zero (regardless how large n
becomes), we also get |fn( 1
2n ) − f( 1
2n )| = |1
2 − 0| = 1
2 . Therefore, for any ε < 1
2 does not exist N such that |fn(x)| < ε for
all n ≥ N. Consequently, (fn(x)) cannot converge uniformly to f(x) = 0.
6.C.6. Obviously,
lim
n→∞
fn(x) = lim
n→∞
√
x2 +
1
n2
=
√
x2 + lim
n→∞
1
n2
=
√
x2 + 0 =
√
x2 = |x| ,
that is, the limit function is given by f(x) = |x| with x ∈ R. Now, for given ε > 0, choose N large enough such that 1
N < ε.
Then, for any n ≥ N and x ∈ R we have
|fn(x) − f(x)| =
√
x2 +
1
n2
− |x| =
√
x2 +
1
n2
− |x| ·
√
x2 + 1
n2 + |x|
√
x2 + 1
n2 + |x|
=
1
n2
√
x2 + 1
n2 + |x|
≤
1
n2
√
1
n2
=
1
n
< ε .
This proves the claim. Observe that each member fn(x) is continuous and diﬀerentiable for all x ∈ R, but the limit function
f(x) = |x|, though continuous, is not diﬀerentiable at x = 0.
6.C.7. (a) On the interval I = (0, 1) consider the sequences (fn), (gn), deﬁned by
fn(x) =
1
x
, gn(x) =
1
n
, n ∈ Z+ .
Obviously, fn → f = 1
x and gn → 0, as n → ∞, and it is easy to see that we have uniform convergence on (0, 1). Consider
the sequence (fn · gn)n∈Z+ with fn(x)gn(x) = 1
nx for all x ∈ (0, 1). Certainly, 1
nx → 0 as n → ∞, hence its pointwise
limit function is the zero one. We will show however that the convergence is not uniform on (0, 1).
For uniform convergence, given ε > 0 we need to ﬁnd some positive integer N such that | 1
nx − 0| < ε for all n ≥ N and
x ∈ (0, 1). This is also written as 1
nx < ε or equivalently, n > 1
ε x , which must hold for all x ∈ (0, 1). However, no matter
how large n is chosen, there will always be some x ∈ (0, 1) close enough to 0 such that 1
nx ≥ ε (notice that as x → 0+
, the
expression 1/x grows without bound, making 1
nx diﬃcult to keep below any positive ε uniformly). In particular, we see that
sup
x∈(0,1)
|fn(x)gn(x) − 0| = sup
x∈(0,1)
1
nx
= sup
x∈(0,1)
(
1
nx
)
= +∞ .
6.C.8.
6.D.4. We compute T3
0 (g(x)) = 1 + x2
2 , T3
0 (h(x)) = 1 − x2
2 and T3
0 (k(x)) = x − x3
3 , respectively. In Sage we can verify
these results as usual, that is, by typing
show(taylor(1/cos(x), x, 0, 3)) , show(taylor(exp(−x ∗ ∗2/2), x, 0, 3)) , show(taylor(sin(sin(x)), x, 0, 3)) ,
respectively.
6.D.9. (a) The x-axis y = 0 as x → ±∞. (b) The lines y = ln 10, and y = x + ln 3.
6.D.13. The function f(x) = −x2
/(x + 1) with x ∈ A = R\{−1} is not odd, neither even, nor periodic. Its range is the
union (−∞, 0]∪[4, +∞). Moreover, the point x0 = −1 is an improper point, i.e., f is not deﬁned there and so it has a single
discontinuity with
lim
x→−1+
f(x) = −∞, lim
x→−1−
f(x) = +∞ .
The function intersects the x axis only at the origin. It is positive for x < −1 and not positive for x > −1. It can be shown
easily that
lim
x→−∞
f(x) = +∞, lim
x→+∞
f(x) = −∞ , f′
(x) = −
x2
+ 2x
(x + 1)2
, f′′
(x) = −
2
(x + 1)3
, x ∈ R\{−1} .
Therefore, f is increasing on the intervals [−2, −1), (−1, 0] and decreasing on the intervals (−∞, −2], [0, +∞), see also a
sketch of its graph below. The function f has two stationary points x1 = 0 and x2 = −2. We leave the characterization of
these critical points to the reader. Moreover, f is convex on the interval (−∞, −1) and concave on the interval (−1, +∞).
628
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
However, it does not have a point of inﬂection. The line x = −1 is a vertical asymptote, the inclined asymptote at ±∞ is the
line y = −x + 1. For example, f(−3) = 9/2, f′
(−3) = −3/4, f(1) = −1/2, f′
(1) = −3/4.
To graph of f together with its asymptotes is given below:
To obtain this illustration one may use the block
f(x)=(-x^2)/(x+1)
p=plot(f(x), x, -10, 10, exclude=[-1], ymin=-11, ymax=11, color="black", figsize=6)
p+=text(r"$f(x)=\frac{-x^2}{x+1}$", (-4.1, 9), fontsize=14, color="black")
p+=line([(-1, -12), (-1, 12)], rgbcolor=(0.2,0.2,0.5), linestyle="--")
p+=line([(-10, 11), (10, -9)], rgbcolor=(0.2,0.5,0.2), linestyle="--"); show(p)
Notice in the second line we have used the option exclude = [−1], in order to exclude the point where f is not deﬁned.
6.D.14. The function is deﬁned on A = R\{1} and is everywhere continuous on A. It is not odd, neither even nor periodic.
The points of intersection of the graph of f with the axes are the points
[
1 − 3
√
2, 0
]
and [0, −1]. At x0 = 1, the function has
a discontinuity of the second kind and its range is R, which follows from the limits
lim
x→1−
f(x) = −∞, lim
x→1+
f(x) = +∞, lim
x→±∞
f(x) = +∞.
After the arrangement
f(x) = (x − 1)2
+ 2
x−1 , x ∈ R ∖ {1},
it is not diﬃcult to compute
f′
(x) = 2 (x−1)3
−1
(x−1)2 , x ∈ R ∖ {1},
f′′
(x) = 2 (x−1)3
+2
(x−1)3 , x ∈ R ∖ {1}.
The only stationary point is x1 = 2. The function f is increasing on the interval [2, +∞), decreasing on the intervals (−∞, 1),
(1, 2]. Hence at the point x1 it attains the local minimum y1 = 3. It is convex on the intervals
(
−∞, 1 − 3
√
2
)
, (1, +∞) and
concave on the intervals
(
1 − 3
√
2, 1
)
. The point x2 = 1 − 3
√
2 is a point of inﬂection. The line x = 1 is a horizontal
asymptote. The function does not have any inclined asymptotes.
6.D.15. The function is deﬁned on R and is everywhere continuous. It is not odd, even nor periodic. It attains positive values
on the positive half-axis, negative values on the negative half-axis. The point of intersection of the graph of f with the axes
is only at the point [0, 0]. The derivative is:
f′
(x) = e−x
3
3√
x2
− 3
√
x e−x
, x ∈ R ∖ {0}, f′
(0) = +∞,
f′′
(x) = 3
√
x e−x
− 2e−x
3
3√
x2
− 2e−x
9
3√
x5
, x ∈ R ∖ {0}.
The only zero point of the ﬁrst derivative is the point x0 = 1/3. The function f is increasing on the interval (−∞, 1/3]
and decreasing on the interval [1/3, +∞). Hence at the point x0, it has an absolute maximum y0 = 1/ 3
√
3e. Since
limx→−∞ f(x) = −∞, its range is (−∞, y0]. The points of inﬂection are
x1 = 1−
√
3
3 , x2 = 0, x3 = 1+
√
3
3 .
It is convex on the intervals (x1, x2) and (x3, +∞), concave on the intervals (−∞, x1), (x2, x3). The only asymptote is the
line y = 0 at +∞, i.e. limx→+∞ f(x) = 0.
6.D.17. Recall by 6.A.45 that the slope of the cycloid
629
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
x(t) = t − sin(t) , y(t) = 1 − cos(t) , t ∈ [0, 2π]
is given by
dy
dx
=
dy/dt
dx/dt
=
sin(t)
1 − cos(t)
.
We see that dy
dx = 0 if and only dy
dt = 0, that is, sin(t) = 0. This gives t = π where we haver a parallel tangent line. The
vertical tangents appear at the cusps of the cycloid, that is, at the point t = 0 or t = 2π. These values of t solve the equation
dx
dt = 0, that is, cos(t) = 1. To calculate the slopes at these points we can use limits:
k0 = lim
t→0+
dy
dx
= lim
t→0+
sin(t)
1 − cos(t)
= +∞ , k2π = lim
t→2π−
dy
dx
= lim
t→2π−
sin(t)
1 − cos(t)
= −∞ .
Thus the tangents at t = 0, 2π are vertical.
6.D.22. Here one should simply give the following block:
show(integral(x^4+e^x+5*ln(x), x)); show(integral(sqrt(x)*(1+x^3), x)); show(integral(x/sqrt(x+1), x))
Check yourself Sage’s output.
6.D.23. (a) F(x) = 8
15 x
8
√
x7;
(b) G(x) =
4x
ln 4
+ 2
6x
ln 6
+
9x
ln 9
;
(c) H(x) =
arcsin(x)
2
;
(d) K(x) = ln (1 + sin(x)).
6.D.24. A primitive function is given by F(x) = ex
+3 arcsin
(x
2
)
.
6.D.25. Set t = ln(x) with dt = 1
x dx. Then we see that
∫
dx
x
(
ln(x)
)2
+ 2025x
=
∫
dx
x
((
ln(x)
)2
+ 2025
) =
∫
dt
t2 + 452
=
1
45
arctan
(
t
45
)
+ C =
1
45
arctan
(
ln(x)
45
)
+ C ,
for some constant C ∈ R. If you like to verify the result in Sage just type
show(integral(1/(x*((ln(x))^2+2025)), x))
6.D.26. Set t = ex
such that dt = ex
dx = t dx, that is, dx = 1
t dt. Then
∫
ex
ex +2024
dx =
∫ (
t
t + 2024
·
1
t
)
dt =
∫
dt
t + 2024
= ln |t + 2024| + C = ln(t + 2024) + C = ln(ex
+2024) + C .
6.D.34. To be written.
6.D.35. The answer is
3
2
ln
(
x2
+ 4x + 8
)
−
1
2
arctg
x + 2
2
+ C, for some C ∈ R.
6.D.36. The answer is
4
3
√
3
arctg
2x + 1
√
3
+
2x + 1
3 (x2 + x + 1)
+ C, with C ∈ R.
6.D.37. The answer is
1
6
ln
(x + 1)2
x2 − x + 1
+
√
3
3
arctg
2x − 1
√
3
+ C, with C ∈ R.
6.D.41. We compute A =
1
2
ln
(
2 + ln 2
2 − ln 2
)
, B = −
1
6
−
2
9
ln(2), and C = 2/3.
6.D.48. The answer is
∞∑
i=0
(−1)n 22n−1
(2n)!
x2n
,
which converges for all real x.
6.D.49. The answer is
∞∑
n=1
(−1)n+1 22n−1
(2n)!
x2n
,
which converges for all real x.
630
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.D.50. The answer is
f(x) =
∞∑
n=1
3(−1)n+1
n
xn
,
which converges for all x ∈ (−1, 1].
6.D.51. One should ﬁrst realize that we are expanding the function f(x) = 1
2 ln(x). Thus, we get
f(x) =
∞∑
i=0
(−1)i+1 1
2i
(x − 1)i
,
which converges on the interval (0, 2].
6.D.53. Recall from Chapter 5 (see for example 5.D.21) that cos(x) = 1 − x2
2! + x4
4! − x6
6! + · · · , which makes sense for any
x ∈ R. Thus, replacing x by
√
x we get
cos(
√
x) = 1 −
(
√
x)2
2!
+
(
√
x)4
4!
−
(
√
x)6
6!
+ · · · = 1 −
x
2!
+
x2
4!
−
x3
6!
+ · · · .
We can now approximate the integral at hand, as follows:
I =
∫ 1
0
(
1 −
x
2!
+
x2
4!
−
x3
6!
+ · · ·
)
dx =
[
x −
x2
2 · 2!
+
x3
3 · 4!
−
x4
4 · 6!
+ · · ·
]1
0
= 1 −
1
2 · 2!
+
1
3 · 4!
−
1
4 · 6!
+ · · ·
Hence we deduce that I =
∫ 1
0
cos(
√
x) dx ≈ 1 −
1
2 · 2!
+
1
3 · 4!
−
1
4 · 6!
= 1 −
1
4
+
1
72
−
1
2880
= 733/960 ≈ 0.7635416.
In Sage the real value of I is obtained by the cell
integral(cos(sqrt(x)), x, 0, 1)
N(integral(cos(sqrt(x)), x, 0, 1))
Here the ﬁrst commands gives the answer 2 cos(1) + 2 sin(1) − 2 ≈ and the second prints out its decimal expression,
0.763546581352073. Hence the estimation based on the Maclaurin series has very small error.
6.D.56. The error belongs to the interval (0, 1/200).
6.D.57. We ﬁnd that
0 <
∫ 2
1
cos10
(x)
10
ln(x) dx <
1
10
,
∫ 2
1
x ln x dx = ln 4 −
3
4
.
In this chapter, we mainly deal with applications of the
tools of diﬀerential and integral calculus. We consider a variety
of problems related to functions of one real variable.
The tools and procedures are similar to the ones shown
in Chapter 3, i.e. we consider linear combinations of selected
generators and linear transformations.
This chapter serves also as a useful consolidation of background
material before considering functions of several variables,
diﬀerential equations, and the calculus of variations.
We begin by asking how to approximate a given function
by linear combinations from a given set of generators. Approximation
considerations lead to the general concept of distance.
We illustrate the concepts on rudiments of the Fourier
series. Our intuition from the Euclidean spaces of low dimensions
is extended to inﬁnite dimensional spaces, particularly
the concept of orthogonal projections.
The next part of this chapter focuses on integral operators.
These are linear mappings on functions which are deﬁned
in terms of integrals. Especially, we pay attention to
convolutions and Fourier analysis. Throughout all these considerations,
we work with real or complex valued functions
of one variable.
Only then do we introduce the elements of the theory of
metric spaces. This should enlighten the concepts of convergence
and approximation on inﬁnite dimensional spaces
of functions. It will also cover our needs in analysis on Euclidean
spaces Rn
in the next chapter.
1. Fourier series
7.1.1. Spaces of functions. As usual, we begin by choosing
appropriate sets of functions to use. We want
enough functions so that our models can conveniently
be applied in practice. At the same time,
the functions must be suﬃciently “smooth” so that we can
integrate and diﬀerentiate as needed.
All functions are deﬁned on an interval I = [a, b] ⊂ R,
where a < b. The interval may be bounded, (i.e., both a and
b are ﬁnite), or unbounded (i.e., either a = −∞, or b = +∞,
or both).
CHAPTER 7
Continuous tools for modelling
How do we manage non-linear objects?
– mainly by linear tools again...
A. Fourier series
If we want to understand three-dimensional objects, we
often use (one or more) two-dimensional plane
projections of them. The orthogonal projections
are special in providing the closest images of the
points of the objects in the chosen plane.
Similarly, we can understand complicated functions in
terms of simpler ones. We consider their projections into the
(real) vector space generated by those chosen functions. Recall
from Chapter 2 that the orthogonal projections were easily
computed in terms of inner products. Now we do the same
for the inﬁnite dimensional spaces of functions.
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Spaces of piecewise smooth functions
We denote by S0
= S0
[a, b] the set of all piecewise continuous
functions on I = [a, b] with real or complex values.
Otherwise put, all functions f in S0
= S0
[a, b] have only
ﬁnitely many points of discontinuity on bounded intervals.
Moreover, f has ﬁnite one-sided left and right limits at every
point in [a, b]. In particular, f is bounded on all bounded
subintervals.
For every natural number k ≥ 1, we consider the set
of all piecewise continuous functions f such that all their
derivatives up to order k (inclusive) lie in S0
. We denote
this set by Sk
[a, b], or brieﬂy Sk
. Note that the derivatives
of functions in Sk
need not exist at all points, but their onesided
limits must exist.
If the interval I is unbounded, we often consider only
those functions with compact support. A function with compact
support means that it is identically zero outside some
bounded interval of the real line. For unbounded intervals,
we denote by Sk
c the subset of those functions in Sk
which
have compact support.
Functions in S0
are always Riemann integrable on the
bounded interval I = [a, b], with both
∫ b
a
|f(x)| dx < ∞, and
∫ b
a
|f(x)|2
dx < ∞.
Both integrals are ﬁnite for unbounded intervals if the
function f has compact support.
7.1.2. Distance between functions. The properties of limits
and derivatives ensure that Sk
and Sk
c are
vector spaces. In ﬁnite-dimensional spaces, the
distance between vectors can be expressed by
means of the diﬀerences of the coordinate components.
In spaces of functions, we proceed analogously and
utilize the absolute value of real or complex numbers and the
Euclidean distance in the following way:
The L1 distance of functions
The L1–distance between functions f and g in S0
c is deﬁned
by
∥f − g∥1 =
∫ b
a
|f(x) − g(x)| dx.
If g = 0, then the distance from f to the zero function,
namely ∥f∥1, is called the L1-norm (i.e., length or size) of
f.
The L1–distance between functions f and g (when both
are real valued) expresses the area enclosed by the graphs
of these functions, regardless of which function takes greater
values. We observe that ∥f − g∥1 ≥ 0.
Since f and g are both piecewise continuous functions,
∥f − g∥1 = 0 only if f and g diﬀer in their values at most
at the points of discontinuity, and hence at only ﬁnitely many
points on any bounded interval. Recall that we can change
632
In this case the inner product mimics the product of
scalars and provides the necessary tool to calculate the corresponding
projections. The simplest way to deﬁne an inner
product on appropriate vector spaces of functions takes the
form
⟨f, g⟩ =
∫ b
a
f(x)g(x) dx .
We refer to this inner product as L2 and denote the corresponding
norm by ∥ ∥2, see 7.1.1-7.1.3 for further details and
its extension to complex functions.
7.A.1. Orthogonal systems of functions. Consider the subspace
W = ⟨f1, f2⟩ of the space of real-valued functions deﬁned
on the interval [1, 2], generated by the functions f1(x) =
1/x and f2(x) = x2
, endowed with the L2 product.
(a) Complete the function f1 to an orthogonal basis of W.
(b) Determine the orthogonal projection of f(x) = x onto W
and compute the distance of f from W.
Solution. (a) The vector space W is generated by two linearly
independent functions, thus its dimension is 2. All vectors
belonging to W are of the form a · f1(x) + b · f2(x) for some
a, b ∈ R. According to the Gram–Schmidt process, we seek
for a vector of the form h = x2
+ k · 1
x , k ∈ R, subject to the
orthogonality condition 0 =
⟨
1
x , x2
+ k · 1
x
⟩
, which gives
k = −
⟨
1
x
, x2
⟩
⟨
1
x
,
1
x
⟩
= −
∫ 2
1
1
x
· x2
dx
∫ 2
1
1
x
·
1
x
dx
= −3 .
Hence the requested orthogonal basis consists of the functions
1
x and x2
− 3
x .
(b) The projection of the function f(x) = x onto W has the
form (see ....)
px =
⟨x, 1
x ⟩
⟨1
x , 1
x ⟩
·
1
x
+
⟨x, x2
− 3
x ⟩
⟨x2 − 3
x , x2 − 3
x ⟩
·
(
x2
−
3
x
)
=
2
x
+
15
34
(
x2
−
3
x
)
.
Let us recall that the distance of a vector from the subspace is
given by the norm of the diﬀerence between this vector and
its projection. In this case we get
∥x − px∥2 =
(∫ 2
1
(
x −
2
x
−
15
34
(
x2
−
3
x
))2
dx
)1
2
=
1
√
408
≈ 0.0495. □
7.A.2. Let W be the the space generated by the functions 1
x ,
1
x2 , 1
x3 , deﬁned on the interval [1, 2]. Assume that W is endowed
with the L2 product. Complete the function 1
x to an
orthogonal basis of W. Next determine the projection of the
functions f(x) = 1
x4 and g(x) = x onto W and ﬁnd their
distances from W. ⃝
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
the value of any function at a ﬁnite number of points, without
changing the value of the integral.
If in particular, f and g are both continuous on [a, b], then
∥f − g∥1 = 0 implies f(x) = g(x) for all x ∈ [a, b]. Indeed,
if f(x0) ̸= g(x0) at a point x0, a ≤ x0 ≤ b, and if f and g
are both continuous at x0, then f and g also diﬀer on some
small neighbourhood of x0, and this neighbourhood, in turn,
contributes a non-zero value into the integral, so that then
∥f − g∥1 > 0.
If we have three functions f, g, and h, then, of course,
∫ b
a
|f(x) − g(x)| dx =
∫ b
a
|f(x) − h(x) + h(x) − g(x)| dx
≤
∫ b
a
|f(x) − h(x)| dx +
∫ b
a
|h(x) − g(x)| dx,
so the usual triangle inequality
∥f − g∥1 ≤ ∥f − h∥1 + ∥h − g∥1
holds. To derive this inequality, we used only the triangle
inequality for the scalars; thus it is valid for functions f, g ∈
S0
c with complex values as well.
∥f −g∥1 is not the only way to measure distance between
two functions f and g. For another way:
The L2–distance
The L2–distance between functions f and g in S0
c is deﬁned
by
∥f − g∥2 =
(∫ b
a
|f(x) − g(x)|2
dx
)1/2
.
If g = 0, then ∥f∥2, the distance from f to the zero function,
is called the L2 norm of f.
Clearly ∥f∥2 ≥ 0. Moreover, ∥f∥2 = 0, implies that
f(x) = 0 for all x except for a ﬁnite set of points in any
bounded interval. As above for the L1 norm, ∥f − g∥2 = 0
only if f and g diﬀer in their values at most at the points of
discontinuity, and hence at only ﬁnitely many points on any
bounded interval. In particular, if f and g are both continuous
for all x, then ∥f − g∥2 = 0 implies f(x) = g(x) for all x.
The square of ∥f∥2 for a function f is
∥f∥2
2
=
∫ b
a
|f(x)|2
dx
and it is related to the well-deﬁned symmetric bilinear mapping
of real or complex functions to scalars
⟨f, g⟩ =
∫ b
a
f(x)g(x) dx
since
⟨f, f⟩ =
∫ b
a
f(x)f(x) dx =
∫ b
a
|f(x)|2
dx = ∥f∥2
2
.
We can use therefore all the properties of inner products
in unitary spaces as described in Chapter 3. In particular, the
633
The next task involves a speciﬁc type of polynomials,
known as “Chebyshev polynomials”. These are deﬁned by
Tn(x) = cos(n arccos(x)), or equivalently, given by
Tn(cos(x)) = cos(n x) , n ∈ N .
Chebyshev polynomials are described by a plethora of equivalent
ways, and have a distinguished role in approximation
theory. Moreover, due to their simple deﬁnition are easily
adapted to symbolic computations, and here we will explain
how one can manipulate them via Sage.
7.A.3. Chebyshev polynomials. Show that Tn(x) is a polynomial
for all n ∈ N.
Solution. By the deﬁnition given above it is direct that
T0(x) = 1, T1(x) = x .
Based now on the trigonometric identity
cos(rz) cos(sz) =
1
2
(
cos((r − s)z)+ cos((r + s)z)
)
, (⋆)
we see that
Tn+1(x) + Tn−1(x) =
= cos
(
(n + 1) arccos(x)
)
+ cos
(
(n − 1) arccos(x)
)
= 2 cos
(
n arccos(x)
)
cos
(
arccos(x)
)
= 2x Tn(x) ,
that is,
Tn+1(x) = 2x Tn(x) − Tn−1(x) ,
for all positive integers n. This is the recurrent deﬁnition of
Chebyshev polynomials, and the result follows. □
7.A.4. (a) Verify that for each interval I = [a, b] ⊂ R and
positive continuous function ω on I, the formula
⟨f, g⟩ω =
∫ b
a
f(x)g(x)ω(x) dx
deﬁnes an inner product on the continuous functions on I.
(b) Choosing I = (−1, 1) and ω(x) = ω0(x) = (1−x2
)−1/2
,
deduce that the Chebyshev polynomials Tk(x), (k ∈ N), form
an orthogonal system of polynomials with respect to ⟨ , ⟩ω0 .
Solution. We compare the deﬁning formula with the L2 inner
product above: Consider the substitution x = φ(z), where φ
is the inverse function to z =
∫ x
a
ω(t) dt. The inverse exists,
since ω is positive and so z is a strictly increasing function of
x. Thus, dz = ω(x)dx and
⟨f, g⟩ω =
∫ b
a
f(x)g(x)ω(x) dx
=
∫ φ−1
(b)
0
f(φ(z))g(φ(z)) dz.
In particular, the “coordinate change” x = φ(z) identiﬁes the
vector space of continuous functions on I with the space of
continuous function on another interval equipped with the L2
inner product and so ⟨ , ⟩ω is an inner product. We leave as
an easy exercise to check that ⟨ , ⟩ω satisﬁes the properties of
an inner product.
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
inner product satisﬁes both linearity in the ﬁrst argument and
the Hermitian symmetry
⟨f, g⟩ = ⟨g, f⟩.
It is a symmetric bilinear mapping in the real case.
In the sequel, the L2 distance will be most important.
Thus, to simplify notation, the norm symbol ∥ ∥ without subscript
will mean ∥ ∥2 in the subsequent pargraphs.
7.1.3. Orthogonality. In Chapters 2 and 3, we dealt with
ﬁnite-dimensional real or complex vector
spaces. Most properties derived there concerned
pairs or ﬁnite sets of vectors. Now, we
can do just the same with functions. We restrict
our deﬁnition of the inner product to any vector subspace
generated by only ﬁnitely many functions f1, . . . , fk (over
real or complex numbers, according to our need). We again
obtain a well-deﬁned inner product on this ﬁnite-dimensional
vector subspace and so all our considerations from the ﬁnite
dimensional linear algebra apply again.
As an example, consider the monomial functions
fi(x) = xi
, i = 0, . . . , k. In S0
, these generate the (k + 1)–
dimensional vector subspace Rk[x] of all polynomials of
degree at most k. The inner product of two such polynomials
is given by integration. Every polynomial of degree at
most k is uniquely expressed as a linear combination of
the generators f0, . . . , fk. Moreover, if we can arrange the
choice of generators to satisfy
(1) ⟨fi, fj⟩ =
{
0 for i ̸= j,
1 for i = j,
then the computations become much easier than they would
otherwise.
Recall the Gram–Schmidt orthogonalization procedure,
see 2.3.21. This procedure transforms any system of linearly
independent generators fi into new (again linearly independent)
orthogonal generators gi of the same subspace, i.e.
⟨gi, gj⟩ = 0 for all i ̸= j. We can calculate them step by step.
Put g1 = f1, and
gℓ+1 = fℓ+1 + a1g1 + · · · + aℓgℓ, ai = −
⟨fℓ+1, gi⟩
∥gi∥2
for ℓ ≥ 1.
To illustrate, we apply this procedure to the three polynomials
1, x, x2
on the interval [−1, 1]. Put g1 = 1, and generate
the sequence
g1 = 1
g2 = x −
1
∥g1∥2
(∫ 1
−1
x · 1 dx
)
· g1 = x − 0 = x
g3 = x2
−
1
∥g1∥2
(∫ 1
−1
x2
· 1 dx
)
· g1−
1
∥g2∥2
(∫ 1
−1
x2
· x dx
)
· g2 = x2
−
1
3
.
634
(b) In this special case, ω(x) = d
d x (arccos(x)), and thus the
above substitution yields
⟨Tr, Ts⟩ω =
∫ π
0
cos(rz) cos(sz) d z.
We are dealing with improper Riemann integrals (integrating
the unbounded function ω), but this does not cause any problem.
In fact it is easy to evaluate the integral via the trigonometric
formula (⋆) mentioned in 7.A.3. We see that it vanishes
for all r ̸= s. □
Both Sage and Maple provide built-in functions for generating
the Chebyshev polynomials. For instance, in Sage
they can be accessed via the function chebyshev_T(n, x),
and in our notation this correspond to Tn(x), see also below.
7.A.5. (a) Use Sage and its function chebyshev_T to derive
the ﬁrst ﬁve Chebyshev polynomials. Next, plot them.
(b) Combine the recursive deﬁnition of Chebyshev polynomials
with the def environment in Sage to present an alternative
approach for integrating them into Sage.
Solution. (a) To write down the ﬁrst ﬁve Chebyshev polynomials
one can type
R = PolynomialRing(QQ, "x")
show([chebyshev_T(n, x) for n in [0..4]])
which returns the list
[
1, x, 2 x2
− 1, 4 x3
− 3 x, 8 x4
− 8 x2
+ 1
]
To plot them add the command
plot([chebyshev_T(n, x) for n in [0..4]])
(b) For this task, let us agree to call T(n, x) the desired routine.
We can introduce it as follows:
def T(n, x):
if n==0 :
return 1
elif n==1 :
return x
else :
return expand(2*x*T(n-1, x)-T(n-2, x))
To test the routine you may type T(3, x) to get 4x3
− 3x,
T(4, x) to get 8x4
− 8x2
+ 1, etc. Or you can directly compare
the formula of T(n, x) with that of chebyshev_T(n, x),
for some n, by typing
bool(T(n, x)==chebyshev_T(n, x) for n in [0..10])
In this case Sage’s answer is True. □
7.A.6. A good programmer is often interested in the eﬃciency
of algorithms and programs. For example,
you may like to check the eﬃciency of the
routine in Sage presented in 7.A.5, which can
be done via the command timeit. Hence, one may compare
the execution time of the commands chebyshev_T(n, x) and
T(n, x), for certain values of n. To test the case n = 3, for
example, add the cell
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
The corresponding orthogonal basis of the space R2[x] of all
polynomials of degree less than three on the interval [−1, 1]
is 1, x, x2
− 1/3. Rescaling by appropriate numbers so that
the basis elements all have length 1, yields the orthonormal
basis
h1 =
√
1
2
, h2 =
√
3
2
x, h3 =
1
2
√
5
2
(3x2
− 1).
For example, h1 = g1/∥g1∥ and
∥g1∥2
=
∫ 1
−1
12
dx = 2.
We could easily continue this procedure in order to ﬁnd orthonormal
generators of Rk[x]. The resulting polynomials
are called Legendre polynomials.
Considering all Legendre polynomials hi, i = 0, . . . , we
have an inﬁnite orthonormal set of generators such that polynomials
of all degrees are uniquely expressed as their ﬁnite
linear combinations.
7.1.4. Orthogonal systems of functions. Generalizing the
latter example, suppose we have three polynonials
h1, h2, h3 forming an orthonormal set. For
any polynomial h, we can put
H = ⟨h, h1⟩h1 + ⟨h, h2⟩h2 + ⟨h, h3⟩h3.
We claim that H is the (unique) polynomial which minimizes
the L2–distance ∥h − H∥. See 3.4.3.
The coeﬃcients for the best approximation of a given
function by a function from a selected subspace are obtained
by the integration introduced in the deﬁnition of the inner
product.
This example of computing the best approximation of H
by a linear combination of the given orthonormal generators
suggests the following generalization:
Orthogonal systems of functions
Every (at most) countable system of linearly independent
functions in S0
c [a, b] such that the inner product of each pair
of distinct functions is zero is called an orthogonal system
of functions. Moreover, if the norm ∥f∥2 = 1 for all f in an
orthogonal system, we talk about an orthonormal system of
functions.
Notice, all continous functions have got compact supports
on a ﬁnite interval [a, b].
Consider an orthogonal system of functions
fn ∈ S0
[a, b] and suppose that for (real or complex)
constants cn, the series
F(x) =
∞∑
n=0
cnfn(x)
converges uniformly on a ﬁnite interval [a, b]. Notice that the
limit function F(x) does not need to belong to S0
[a, b], but
this is not our concern now.
635
print(timeit(’chebyshev_T(3,x)’, number=20))
print(timeit(’T(3,x)’), number=20)
Here the option number=20 restricts the number of loops.
The output appears as follows:
20 loops, best of 3: 81.2 µs per loop
20 loops, best of 3: 18 µs per loop
and what one looks for is the execution time, which here is
measured in microseconds (µs).1
Note that each time the
above two commands are executed, Sage returns slightly different
execution times.
7.A.7. Show that the choice of the weight function ω(x) =
e−x
and the interval I = [0, ∞) in Problem 7.A.3, leads to
an inner product for which the Laguerre polynomials
Ln(x) =
n∑
k=0
(
n
k
)
(−1)k
k!
xk
form an orthonormal system. ⃝
7.A.8. Check that the orthonormal systems obtained in the
previous two examples coincide with the result of the corresponding
Gram-Schmidt orthogonalisation procedure applied
to the system 1, x, x2
, . . . , xn
, . . . , using the inner products
⟨ , ⟩ω, possibly only up to signs. ⃝
7.A.9. Let S0
([−π, π]) be the (inﬁnite-dimensional) vector
space of F-valued piecewise continuous functions deﬁned on
[−π, π], where F ∈ {R, C}, and let us denote by ¯g the complex
conjugate of a function g ∈ S0
([−π, π]). Prove that for
any f, g ∈ S0
([−π, π]) the rule
(f, g) :=
1
π
∫ π
−π
f(x)g(x) dx
deﬁnes an inner product. Next show that the system of func-
tions
{√
2
2
, sin(x) , cos(x) , sin(2x) , cos(2x) , · · · : n ∈ Z+
}
is ( , )-orthonormal. ⃝
The orthonormal system of functions discussed in 7.A.9 is
an orthonormal version of the so-called “Fourier orthogonal
system”, introduced in 7.1.6. In fact, a key point in 7.A.9 is
the “periodicity” of the functions forming the orthonormal
basis (each having period 2π).
Periodic functions occur frequently in engineering problems,
and usually have a complicated form. Thus, it is often
desirable to have a presentation of such periodic functions in
terms of the simpler functions of sine and cosine. Building
an orthogonal system of the periodic functions sin(nx) and
cos(nx) leads to the classical concept of the Fourier series
1For the command timeit Sage’s output can include also the units
nanoseconds (ns) and millisecond (ms). Recall that nanoseconds provide the
smallest unit of the three, followed by microseconds and then milliseconds.
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
By uniform convergence, the inner product ⟨F, fn⟩ can
be expressed in terms of the particular summands (see the
corollary 6.3.7), obtaining
⟨F, fn⟩ =
∞∑
m=0
cm
∫ b
a
fm(x)fn(x) dx = cn∥fn∥2
,
since each term in the sum is 0 except when m = n. Exactly
as in the example above, each ﬁnite sum
∑k
n=0 cnfn(x) is
the best approximation of the function F(x) among the linear
combinations of the ﬁrst k +1 functions fn in the orthogonal
system.
Actually, we can generalize the deﬁnition further to any
vector space of functions with an inner product. See the exercise
7.A.4 for such an example. For the sake of simplicity we
conﬁne ourselves to the L2 distance, but the reader can check
that the proofs work in general.
We extend our results from ﬁnite-dimensional spaces to
inﬁnite dimensional ones. Instead of ﬁnite linear combinations
of base vectors, we have inﬁnite series of
pairwise orthogonal functions. The following theorem
gives us a transparent and very general answer to
the question as to how well the partial sums of such a series
can approximate a given function:
7.1.5. Theorem. Let fn, n = 1, 2, . . . , be an orthogonal
sequence of (real or complex) functions in S0
[a, b] and let
g ∈ S0
[a, b] be an arbitrary function. Put
cn = ∥fn∥−2
∫ b
a
g(x)fn(x) dx.
Then
(1) For any ﬁxed n ∈ N, the expression which has the least
L2–distance from g among all linear combinations of
functions f1, . . . , fn is
hn =
n∑
i=1
cifi(x).
(2) The series
∞∑
n=1
|cn|2
∥fn∥2
always converges, and more-
over
∞∑
n=1
|cn|2
∥fn∥2
≤ ∥g∥2
.
(3) The equality
∞∑
n=1
|cn|2
∥fn∥2
= ∥g∥2
holds if and only if
lim
n→∞
∥g − hn∥ = 0.
Before presenting the proof, we consider the meaning of
the individual statements of this theorem. Since we are working
with an arbitrarily chosen orthogonal system of functions,
we cannot expect that all functions can be approximated by
linear combinations of the functions fi.
For instance, if we consider the case of Legendre orthogonal
polynomials on the interval [−1, 1] and restrict ourselves
to even degrees only, surely we can approximate only even
636
for a 2L-periodic function f that is integrable over an integral
of length 2L, say [−L, L]. 2
These are series of the form
F(x) =
a0
2
+
∞∑
n=1
(
an cos(
n π x
L
) + bn sin(
n π x
L
)
)
,
where an and bn are the Fourier coeﬃcients, given by
an =
1
L
∫ L
−L
f(x) cos(
n π x
L
) dx , n = 0, 1, . . . ,
bn =
1
L
∫ L
−L
f(x) sin(
n π x
L
) dx , n = 1, 2, . . . ,
respectively.
In many cases the period is 2π, i.e., L = π, but we are
also interested in functions with arbitrary periods, which, in
practice (applications), is the most common scenario. For
instance, if f has period T then it may deﬁned on any interval
of the form [0, T) or [−T/2, T/2), i.e., of length T.
Deﬁning f on intervals of the form [−π, π) or [0, 2π) is useful
however, since simpliﬁes the calculations and aligns with
the standard form of the Fourier series, see 7.1.6. For the
purpose of Fourier series the function f may deﬁned even
on the open interval (−π, π), where the key requirement is
that f must be periodic of period 2π, see for example the
so called “Heaviside function” in 7.1.9 (also known as the
“square wave function”).
Below, our goal is to illustrate all of this with a series of
examples, most of which are implemented using Sage as well.
For convenience, next we will denote by Fk(x) (k > 0) the
kth-partial sum of a given Fourier series F(x).
7.A.10. Consider the 2π-periodic extension of the function
f(x) = x, with x ∈ (−π, π). Describe the corresponding
Fourier series F(x) and next conﬁrm your answer via Sage.
Moreover, conﬁrm that as the number of terms in the series
increases, the approximation improves signiﬁcantly, except in
a small neighbourhood around the discontinuity point. (Hint:
For example, use Sage to plot the partial sums F5(x), F10(x),
F20(x) and F100(x)).
Solution. We see that f(−x) = −x = −f(x) for all x ∈
(−π, π), which means that f is an odd function. Thus, the
coeﬃcients an vanish for all n = 0, 1, . . ., and the Fourier
series of f is a sine series, that is, F(x) =
∑∞
n=1 bn sin(nx).
On the other hand, the function g(x) = x sin(nx) is even,
i.e., g(−x) = g(x) for all x ∈ (−π, π). Hence, according to
the statement in 6.B.35 we have
bn =
1
π
∫ π
−π
x sin(nx) dx =
2
π
∫ π
0
x sin(nx) dx .
2The Fourier series are named in honour of the French mathematician
and physicist Jean B. J. Fourier, in recognition of his seminal 1807 work
“On the Propagation of Heat in Solid Bodies”, which focused on the issue
of heat conduction. Fourier introduced the concept of analyzing periodic
functions using trigonometric series, a method that remains fundamental to
mathematical physics and has numerous critical applications in engineering.
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
functions in a reasonable way. Nevertheless, the ﬁrst statement
of the theorem says that the best approximation possible
(in the L2–distance), is by the partial sums as described.
The second and third statements can be perceived as an
analogy to the orthogonal projections onto subspaces in terms
of Cartesian coordinates. Indeed, if for a given function g, the
series F(x) =
∑∞
n=1 cnfn(x) converges pointwise, then the
function F(x) is, in a certain sense, the orthogonal projection
of g into the vector subspace of all such series.
The second statement is called Bessel’s inequality and it
is an analogy of the ﬁnite-dimensional proposition that the
size of the orthogonal projection of a vector cannot be larger
than the original vector. The equality from the third statement
is called Parseval’s theorem and it says that if a given vector
does not decrease in length by the orthogonal projection onto
a given subspace, then it belongs to this subspace.
On the other hand, the theorem does not claim that the
partial sums of the considered series need to converge pointwise
to some function. There is no analogy to this phenomenon
in the ﬁnite-dimensional world. In general, the series
F(x) need not be convergent for some points x, even under
the assumption of the equality in (3). However, if the
series
∑∞
n=1 |cn| converges to a ﬁnite value, and if all the
functions fn are bounded uniformly on I, then, the series
F(x) =
∑∞
n=1 cnfn(x) converges at every point x. Yet it
need not converge to the function g everywhere. We return to
this problem later.
The proof of all of the three statements of the theorem
is similar to the case of ﬁnite-dimensional Euclidean spaces.
That is to be expected since the bounds for the distances
of g from the partial sum f are constructed in the ﬁnitedimensional
linear hull of the functions concerned:
Proof of theorem 7.1.5. Choose any linear combination
f =
∑k
n=1 anfn and calculate its distance from g. We
obtain
∥g −
k∑
n=1
anfn∥2
=
∫ b
a
g(x) −
k∑
n=1
anfn(x)
2
dx
=
∫ b
a
|g(x)|2
dx −
∫ b
a
k∑
n=1
g(x)anfn(x) dx−
−
∫ b
a
k∑
n=1
anfn(x)g(x) dx +
∫ b
a
k∑
n=1
anfn(x)
2
dx
= ∥g∥2
−
k∑
n=1
ancn∥fn∥2
−
k∑
n=1
ancn∥fn∥2
+
k∑
n=1
a2
n∥fn∥2
= ∥g∥2
+
k∑
n=1
∥fn∥2
(
(cn − an)(cn − an) − |cn|2
)
= ∥g∥2
+
k∑
n=1
∥fn∥2
(
|cn − an|2
− |cn|2
)
.
Since we are free to choose an as we please, we minimize
the last expression by choosing an = cn, for each n. This
637
Keeping in mind the relations cos(nπ) = (−1)n
and
sin(nπ) = 0, integration by parts gives
bn =
2
π
∫ π
0
x sin(nx) dx =
2
π
[
−
x cos(nx)
n
+
sin(nx)
n2
]π
0
=
2
π
·
−π
n
(−1)n
=
2
n
(−1)n+1
.
Therefore, the Fourier series of f has the form
F(x) =
∞∑
n=1
2(−1)n+1
n
sin(nx) = 2 sin(x) − sin(2x)
+
2
3
sin(3x) −
1
2
sin(4x) +
2
5
sin(5x) − · · · .
In Sage it suﬃces to compute the Fourier coeﬃcients an and
bn, which can be easily done by the block
f=lambda x:x; f=piecewise([[(-pi,pi),f]])
L=pi; n=var("n")
an=(1/L)*integral(x*cos(n*pi*x/L),x,-pi,pi)
show("a_n=", an)
bn=(1/L)*integral(x*sin(n*pi*x/L) x,-pi,pi)
show("b_n=", bn.expand())
Sage prints out the following expressions
a_n=0 , b_n= −
2 cos (πn)
n
+
2 sin (πn)
πn2
,
which, with a bit of eﬀort, conﬁrm the presented values of
an, bn, and consequently the given expression of F(x).
In Sage the description of the kth-partial sum of the
Fourier series corresponding to a periodic function f, relies
on the syntax f.fourier_series_partial_sum(k, L),
where L denotes the half-period of f. For instance, the 5th
partial sum of the Fourier series at hand can be explicitly described
by adding in the previous cell the following:
FS = f.fourier_series_partial_sum(5,pi)
show("partial FS = ",FS)
To plot the suggested partial sums, together with f, you may
add the block
FS5 = f.fourier_series_partial_sum(5,pi)
FS10 = f.fourier_series_partial_sum(10,pi)
FS20 = f.fourier_series_partial_sum(20,pi)
FS100 = f.fourier_series_partial_sum(100,pi)
a = f.plot(x, -pi, pi)
b5 = plot(FS5, x, -pi, pi, linestyle="--", tick_formatter=[2*pi, None])
b10 = plot(FS10, x, -pi, pi, linestyle="--", tick_formatter=[2*pi, None])
b20 = plot(FS20, x, -pi, pi, linestyle="--", tick_formatter=[2*pi, None])
b100 = plot(FS100, x, -pi, pi, linestyle="--", tick_formatter=[2*pi, None])
(a+b5).show(); (a+b10).show()
(a+b20).show(); (a+b100).show()
This procedure yields the ﬁgures presented below, illustrating
the improvement in approximation as the number of terms is
increased.
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
completes the proof of the ﬁrst statement. With this choice of
an, we obtain Bessel’s identity
∥g −
k∑
n=1
cnfn∥2
= ∥g∥2
−
k∑
n=1
|cn|2
∥fn∥2
.
Since the left-hand side is non-negative, it follows that:
k∑
n=1
|cn|2
∥fn∥2
≤ ∥g∥2
.
Let k −→ ∞. Since every non-decreasing sequence of real
numbers which is bounded from above has a limit, it follows
that
∞∑
n=1
c2
n∥fn∥2
≤ ∥g∥2
,
which is Bessel’s inequality.
If equality occurs in Bessel’s inequality, then statement
(3) follows straight from the deﬁnitions and the Bessel’s identity
proved above. □
An orthogonal system of functions is called a complete
orthogonal system on an interval I = [a, b] for some space
of functions on I if and only if Parseval’s equality holds for
every function g in this space.
7.1.6. Fourier series. The coeﬃcients cn from the previous
theorem are called the Fourier coeﬃcients of a
given function in the (abstract) Fourier series.
The previous theorem indicates that we are
able to work with countable orthogonal systems of functions
fn in much the same way as with ﬁnite orthogonal bases of
vector spaces.
There are, however, essential diﬀerences:
• It is not easy to decide what the set of convergent or uniformly
convergent series
F(x) =
∞∑
n=1
cnfn(x)
looks like.
• For a given integrable function, we can ﬁnd only the
“best approximation possible” by such a series F(x) in
the sense of L2–distance.
In the case when we have an orthonormal system of functions
fn, the formulae mentioned in the theorem are simpler,
but still there is no further improvement in the approxima-
tions.
The choice of an orthogonal system of functions for use
in practice should address the purpose for which the approximations
are needed. The name “Fourier series” itself refers
to the following choice of a system of real-valued functions:
Fourier’s orthogonal system
The following functions form an orthogonal system
1, sin x, cos x, sin 2x, cos 2x, . . . , sin nx, cos nx, . . .
638
F5(x) F10(x)
F20(x) F100(x)
□
7.A.11. Triangle wave. Find the Fourier series F(x) for the
2π-periodic extension of the function f(x) = | x |, with
x ∈ (−π, π). Next present the 1st, 3rd and 6th partial
sum of the Fourier series and use Sage to illustrate them for
x ∈ (−4π, 4π).
Solution. The given function corresponds to a sawtoothshaped
oscillation. Its expression as a Fourier series is very
important in practice. The function g is even on (−π, π), i.e.,
f(−x) = f(x) for all x ∈ (−π, π). Therefore, we have
bn = 0 for all n ∈ Z+, and it suﬃces to determine an, for
n ∈ N. Combining the deﬁnition of an with 6.B.35, we get
a0 =
1
π
π∫
−π
g(x) dx =
2
π
π∫
0
x dx =
2
π
[
x2
2
]π
0
= π .
To compute an for n ∈ Z+, one can apply integration by
parts:
an =
1
π
π∫
−π
f(x) cos(nx) dx =
2
π
π∫
0
x cos(nx) dx
=
2
π
[x
n
sin(nx)
]π
0
−
2
nπ
π∫
0
sin (nx) dx
=
2
n2π
[cos(nx)]
π
0 =
2
n2π
((−1)n
− 1) .
So
an =
{
− 4
n2π , for n odd;
0 , for n even ,
and the Fourier series in question has the form
F(x) =
π
2
+
2
π
∞∑
n=1
(
(−1)n
− 1
n2
cos(nx)
)
=
π
2
−
4
π
∞∑
n=1
cos((2n − 1)x)
(2n − 1)2
=
π
2
−
4
π
(
cos x +
cos(3x)
32
+
cos(5x)
52
+ · · ·
)
.
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
An elementary exercise on integration by parts shows
that this is an orthogonal system of functions on the interval
[−π, π].
These functions are periodic with common period 2π
(see the deﬁnition below). “Fourier analysis”, which builds
upon this orthogonal system, allows us to work with all piecewise
continuous periodic functions with extraordinary eﬃciency.
Since many physical, chemical, and biological data
are perceived, received, or measured, in fact, by frequencies
of the signals (the measured quantities), it is really an essential
mathematical tool. Biologists and engineers often use the
word “signal” in the sense of “function”.
Periodic functions
A real or complex valued function f deﬁned on R is called
a periodic function with period T > 0 if f(x + T) = f(x)
for every x ∈ R.
It is evident that sums and products of periodic functions
with the same period are again periodic functions with the
same period.
We note that the integral
∫ x0+T
x0
f(x) dx of a periodic
function f on an interval whose length equals the period T
is independent of the choice of x0 ∈ R. To prove it, it is
enough to suppose 0 ≤ x0 < T, using a translation by a
suitable multiple of T. Then,
∫ x0+T
x0
f(x) dx =
∫ T
x0
f(x) dx +
∫ x0+T
T
f(x) dx
=
∫ T
x0
f(x) dx +
∫ x0
0
f(x) dx =
∫ T
0
f(x) dx.
Next, remind the Fourier’s orthogonal system with the
norms over intervals of length 2π equal to
∥ sin nx∥2
= ∥ cos nx∥2
= π
(cf. 6.2.6). Using it to built series as in the theorem 7.1.5, we
arrive at
Fourier series
The series of functions
F(x) =
a0
2
+
∞∑
n=1
(
an cos(nx) + bn sin(nx)
)
with coeﬃcients
an =
1
π
∫ x0+2π
x0
g(x) cos(nx) dx,
bn =
1
π
∫ x0+2π
x0
g(x) sin(nx) dx,
is called the Fourier series of a function g on the interval
[x0, x0 + 2π].
The coeﬃcients an and bn are called Fourier coeﬃcients
of the function g.
639
Thus
F1(x) =
π
2
−
4 cos(x)
π
,
F3(x) = F1(x) −
4 cos(3x)
9π
,
F6(x) = F3(x) −
4 cos(5x)
25π
.
Below we illustrate F1, F3 and F6 together in a single ﬁgure:
To better observe the convergence of the Fourier series in a
small neighbourhood around the origin, we may zoom in on
the graph at that point:
Notice the Fourier series described here could be derived by
easier means, namely by integrating the Fourier series of the
square wave function (Heaviside’s function), see 7.1.9. □
7.A.12. Use Sage to provide a solution of the task in 7.A.11.
Moreover, show that F5(x) = F6(x) for all x. ⃝
7.A.13. Compute the Fourier series of the 2π-periodic extension
of the sign function sgn : [−π, π] → R, where recall
that
sgn(x) =



−1 , for − π ≤ x < 0 ,
0 , for x = 0 ,
1 , for 0 < x ≤ π . ⃝
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
If T is the time taken for one revolution of an object moving
round the unit circle at constant speed, then
that constant speed is ω = 2π/T. In practice,
we often want to work with Fourier series with
an arbitrary primary period T of the functions,
not just 2π. Then we should employ the functions cos(ωnx),
sin(ωnx), where ω = 2π
T . By substitution t = ωx, we can
verify the orthogonality of the new system of functions and recalculate
the coeﬃcients in the Fourier series F(x) of a function
g on the interval [x0, x0 + T]:
F(x) =
a0
2
+
∞∑
n=1
(
an cos(nωx) + bn sin(nωx)
)
,
which have values
an =
2
T
∫ x0+T
x0
g(x) cos(nωx) dx,
bn =
2
T
∫ x0+T
x0
g(x) sin(nωx) dx.
7.1.7. The complex Fourier coeﬃcients. Parametrize the
unit circle in the form:
eiωt
= cos ωt + i sin ωt.
For all integers m, n with m ̸= n,
∫ π
−π
eimx
e−inx
dx =
∫ π
−π
ei(m−n)x
dx
= 1
i(m−n)
[
ei(m−n)x
]π
−π
= 0.
Thus for m ̸= n, the integral ⟨eimx
, einx
⟩ = 0.
Fourier’s complex orthogonal system
The following functions form a complex valued orthogonal
system
e−inωt
, . . . , e−iωt
, 1, eiωt
, ei2ωt
. . . , einωt
, . . .
Note that if m = n, then
∫ π
−π
eimx
e−imx
dx =
∫ π
−π
dx = 2π.
The orthogonality of this system can be easily used to recover
the orthogonality of the real Fourier’s system: Rewrite
the above result as
∫ π
−π
(cos mx + i sin mx)(cos nx − i sin nx) dx = 0
By expanding and separating into real and imaginary parts
we get both
∫ π
−π
(cos mx cos nx + sin mx sin nx) dx = 0
∫ π
−π
(sin mx cos nx − cos mx sin nx) dx = 0
640
7.A.14. Use Sage to describe the Fourier series of the
2π-periodic extension of the function
f(x) =
{
0 , x ∈ [−π, 0) ,
sin(x) , x ∈ [0, π) .
Additionally, plot the graphs of F3, F5, F8 and f in a single
ﬁgure, and specify the 8th and 48th coeﬃcients of the corresponding
Fourier series.
Solution. We have T = 2π and we are interested in the
interval [−π, π). Let us begin by computing the Fourier
coeﬃcients of f, which can be done with the method
applied above, involving the commands piecewise and
integral:
f1(x)=0; f2(x)=sin(x)
f=piecewise([[(-pi, 0), f1], [(0, pi), f2]])
L=pi; n=var("n")
an=(1/L)*integral(f1(x)*cos(n*pi*x/L),x,-pi, 0)
+(1/L)*integral(f2(x)*cos(n*pi*x/L),x,0,pi)
show("a_n=", an.expand())
bn=(1/L)*integral(f1(x)*sin(n*pi*x/L),x,-pi, 0)
+(1/L)*integral(f2(x)*sin(n*pi*x/L),x,0,pi)
show("b_n=", bn.expand())
Running this, we get the answers
a_n= −
cos (πn)
π(n2 − 1)
−
1
π(n2 − 1)
, b_n= −
sin (πn)
π(n2 − 1)
,
which both make sense for all n ̸= 1. Thus, for all positive
integers n ̸= 1 we have
an = −
1
π(n2 − 1)
(
(−1)n
+ 1) , bn = 0
and moreover a0 = 2
π . For n = 1 we have the integrals
a1 =
1
2π
π∫
0
sin(2x) dx , b1 =
1
π
π∫
−π
f(x) sin(x) dx ,
and it is easy to see that a1 = 0 and b1 = 1/2. In this way we
arrive at the Fourier series
F(x) =
1
π
+
sin x
2
+
1
π
∞∑
n=2
(
(−1)n
+ 1
1 − n2
cos(nx)
)
.
Since (−1)n
+ 1 = 0 when n is odd, and (−1)n
+ 1 = 2
when n is even, put n = 2m to obtain the expression
F(x) =
1
π
+
sin x
2
−
2
π
∞∑
m=1
cos(2mx)
4m2 − 1
.
For the partial sums we see that
F3(x) =
1
π
+
sin(x)
2
−
2 cos(2x)
3π
,
F5(x) = −
2 cos(4x)
15π
+ F3(x) ,
F8(x) = −
2 cos(8x)
63π
−
2 cos(6x)
35π
+ F5(x) .
In Sage to obtain these expressions we may continue typing
in the preceding block the syntax given here:
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
By replacing n with −n, we have also
∫ π
−π
(cos mx cos nx − sin mx sin nx) dx = 0
∫ π
−π
(sin mx cos nx + cos mx sin nx) dx = 0
and hence, with m ̸= n,
∫ π
−π
cos mx cos nx dx = 0
∫ π
−π
sin mx sin nx dx = 0
∫ π
−π
sin mx cos nx dx = 0
which proves again the orthogonality of the real valued
Fourier system.
Note the case m = n > 0, when we recovered again
∫ π
−π
cos2
nx dx = ∥ cos(nx)∥2
= π,
∫ π
−π
sin2
nx dx = ∥ sin(nx)∥2
= π.
If n = 0, then ∥1∥2
= 2π.
For a (real or complex) function f(t) with −T/2 ≤ t ≤
T/2, and all integers n, we can deﬁne, in this context, its complex
Fourier coeﬃcients by the complex numbers
cn =
1
T
∫ T/2
−T/2
f(t) e−iωnt
dt.
The relation between the coeﬃcients an and bn of the Fourier
series (after recalculating the formulae for these coeﬃcients
for functions with a general period of length T) and these
complex coeﬃcients cn follow from the deﬁnitions. Clearly,
c0 = a0/2, and for natural numbers n, we have
cn = 1
2 (an − ibn), c−n = 1
2 (an + ibn).
If the function f is real valued, cn and c−n are complex conjugates
of each other.
We note here that for real valued functions with period
2π, the Bessel inequality in this notation becomes
1
2 |a0|2
+
∞∑
n=1
(|an|2
+ |bn|2
) ≤ ∥f∥2
=
1
π
∫ π
−π
|f(t)|2
dt.
We have expressed the Fourier series F(t) for a function
f(t) in the form
F(t) =
∞∑
n=−∞
cn eiωnt
.
In both cases of real or complex valued functions, the corresponding
Fourier series can be written in this form. In general,
the coeﬃcients are complex. We return to this expression
later, in particular when dealing with Fourier transforms.
For ﬁxed T, the expression ω = 2π/T describes how the
frequency changes if n is increased by one.
641
FS3 = f.fourier_series_partial_sum(3,pi)
FS5 = f.fourier_series_partial_sum(5,pi)
FS8 = f.fourier_series_partial_sum(8,pi)
show("3rd partial sum = ",FS3)
show("5th partial sum = ",FS5)
show("8th partial sum = ",FS8)
a = f.plot(x, -pi, pi, thickness=2,
color="black", legend_label="$f$")
b = plot(FS3, x,-pi,pi,linestyle="--",
color="blue",legend_label="$F_3$")
c= plot(FS5, x,-pi,pi,linestyle="-.",
color="green", legend_label="$F_5$")
d= plot(FS8, x, -pi, pi, linestyle=":",
color="red", legend_label="$F_8$")
(a+b+c+d).show()
This block also constructs the required graphs, which we
present below.
Finally, for calculating the Fourier coeﬃcients Sage provides
built-in functions. For instance, one can use the syntax
f.fourier_series_cosine_coefficient(k), which
corresponds to the kth coeﬃcient of the cosine Fourier
series of our periodic function f. Notice that if you are
interested in “sine coeﬃcients”, then the right function is
f.fourier_series_sine_coefficient(k), and in this
case one should ensure that uses the sine function correctly,
see 7.D.8 for an example.
Now you may check yourselves the desired coeﬃcients,
by simply adding the following cell:
Fc8=f.fourier_series_cosine_coefficient(8)
show("8th coefficient = ",Fc8)
Fc48=f.fourier_series_cosine_coefficient(48)
show("48th coefficient = ",Fc48)
In 7.D.11 we will present an alternative method using the
Sage functions introduced in this block to determine and illustrate
certain partial sums of a Fourier series, without using
the Sage function fourier_series_partial_sum. □
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
We wish to show that the Fourier orthogonal system is
complete on S0
[−π, π]. This needs a thorough technical
preparation. So now, we only formulate the main result, and
add some practical notes. The proof of the theorem will be
discussed later, see 7.1.15.
7.1.8. Theorem. Consider a bounded interval [a, b] of length
T = b − a. Let f be a real or complex valued function in
S1
[a, b] (i.e., a piecewise continuous function with a piecewise
continuous ﬁrst derivative), extended periodically on R.
Then:
(1) The partial sums sN of its Fourier series converge pointwise
to the function
g(x) =
1
2
(
lim
y→x+
f(y) + lim
y→x−
f(y)
)
.
(2) Moreover, if the periodic extension of f ∈ S1
[a, b] is
a continuous function on R, then the pointwise convergence
of its Fourier series is uniform.
(3) The L2–distance ∥sN − f∥ converges to 0 for N → ∞.
7.1.9. Periodic extension of functions. The Fourier series
converges, of course, outside the original interval
[−T/2, T/2], since it is a periodic function
on R. The Heaviside function g is deﬁned by
g(x) =
{
−1 if − π < x < 0,
1 if 0 < x < π
(We do not need to deﬁne the values at zero and at the end
points of the interval, because these do not eﬀect the coeﬃcients
of the Fourier series.) The periodic extension of the
Heaviside function onto all of R is usually called the square
wave function.
Since g is an odd function, the coeﬃcients of the functions
cos(nx) are all zero. The coeﬃcients of sin(nx), are
given by
bn =
1
π
∫ π
−π
g(x) sin(nx) dx =
2
π
∫ π
0
sin(nx) dx
=
2
nπ
(
1 − (−1)n
)
.
Thus the Fourier series of g is
g(x) =
4
π
(
sin(x) +
1
3
sin(3x) +
1
5
sin(5x) + . . .
)
.
The partial sums of its ﬁrst ﬁve and ﬁfty terms, respectively,
are shown in the following diagrams.
-1
0 2
x
0
-2
-0,5
0,5
-4
1
4
t=2.
-1
0 2
x
4-2
-0,5
0
-4
1
0,5
t=24.
642
7.A.15. Fourier series over other intervals. Consider the
function g : [−1, 1) → R, deﬁned by
g(x) =
{
0 , for −1 ≤ x < 0 ,
x + 1 , for 0 ≤ x < 1 .
(a) Determine a 2-periodic extension ˆg of g for x ∈ [−3, 3)
and next sketch its graph via Sage.
(b) Find the corresponding Fourier series on [−1, 1), and next
conﬁrm your result via Sage.
(c) Use Sage to plot the partial sums F5(x), and F55(x) together
with ˆg, for x ∈ (−3, 3).
Solution. (a) By assumption the period is T = 2, and on the
interval [−3, 3) a 2-periodic extension of g has the form
ˆg(x) =



g(x + 2) , for −3 ≤ x < −1 ,
g(x) , for −1 ≤ x < 1 ,
g(x − 2) , for 1 ≤ x < 3 .
=



0 , for −3 ≤ x < −2 ,
x + 3 , for −2 ≤ x < −1 ,
0 , for −1 ≤ x < 0 ,
x + 1 , for 0 ≤ x < 1 ,
0 , for 1 ≤ x < 2 ,
x − 1 , for 2 ≤ x < 3 .
In Sage we can use the piecewise command to plot ˆg, so
one may try the following block:
g1(x)=0;g2(x)=x+1;g3(x)=x+3;g4(x)=x-1
g=piecewise([[(-3,-2),g1],[(-2,-1),g3],
[(-1,0),g1],[(0,1),g2],[(1,2),g1],[(2, 3),g4]])
a = g.plot(x, -3, 3,thickness=3,
color="steelblue",
exclude=[-2, -1, 0, 1, 2],
legend_label=r"$\hat{g}$")
p0=point([-3, 0], size=30, color="black")
p1=point([-2, 1], size=30, color="black")
p2=point([-1, 0], size=30, color="black")
p3=point([0, 1], size=30, color="black")
p4=point([1, 0], size=30, color="black")
p5=point([2, 1], size=30, color="black")
show(a+p0+p1+p2+p3+p4+p5)
Let us present the result:
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
If the interval [−T/2, T/2] is chosen for the prime period
T of such a square wave function, the resulting Fourier series
is
g(x) =
4
π
(
sin(ωx) +
1
3
sin(3ωx) +
1
5
sin(5ωx) + . . .
)
.
The number ω = 2π
T is also called the phase frequency of the
wave.
As the number of terms of the series increases, the approximation
gets much better except in a (still shrinking)
neighbourhood of the discontinuity point. There, the maximum
of the deviation remains roughly the same. This is a
general property of Fourier series and it is called the Gibbs
phenomenon.
In accordance with 7.1.8(1), the Fourier series of the
Heaviside function converges to the mean of the two onesided
limits at the points of discontinuity.
Of course, we cannot expect that the convergence of
Fourier series for functions g with discontinuity points be uniform
(then, the function g would have to be continuous itself,
being a uniform limit of continuous functions).
7.1.10. Utilizing symmetry of functions. We consider the
problem of ﬁnding the Fourier series of the function
f(x) = x2
on the interval [0, 1]. If we just
periodically extend this function from the given
interval [0, 1], the resulting function would not
be continuous, and so the convergence at integers would be
as slow as in the case of a square wave function. However, we
can easily extend the domain of f to the interval [−1, 1], so
that f(x) = x2
is an even function for −1 ≤ x ≤ 1. If we
then extend periodically, the result is continuous. The resulting
Fourier series then converges uniformly, and since then f
is even, only the coeﬃcients an are non-zero.
For n > 0, iterated application of integration by parts
yields:
an =
2
2
∫ 1
−1
x2
cos
(2πnx
2
)
dx = 2
∫ 1
0
x2
cos(πnx) dx
=
4
π2n2
(−1)n
.
The remaining coeﬃcient is
a0 =
2
2
∫ 1
−1
x2
dx = 2
∫ 1
0
x2
dx =
2
3
.
The entire series giving the periodic extension of x2
from the
interval [−1, 1] thus equals
g(x) =
1
3
+
4
π2
∞∑
n=1
(−1)n
n2
cos(πnx).
By the Weierstrass criterion, this series converges uniformly.
Therefore, g(x) is continuous. By theorem 7.1.8, g(x) =
f(x) = x2
on the interval [−1, 1]. Thus our series approximates
the function x2
on the interval [0, 1] better (i.e. faster)
than we could achieve with the periodic extension of the function
from [0, 1] interval only. If we put x = 1 and rearrange,
643
Notice that we have included the points where the jump
discontinuities occur within the interval [−3, 3), using the
command point.
(b) In this task one should apply the general formulae given
in 7.1.6. We have ω = 2π/T = π, and
an =
2
T
x0+T∫
x0
g(x) cos(nωx) dx , ∀ n ∈ N ,
bn =
2
T
x0+T∫
x0
g(x) sin(nωx) dx , ∀ n ∈ Z+ .
Now it is easy to see that a0 =
1∫
−1
g(x) dx =
1∫
0
(x+1) dx =
3
2
and moreover
an =
1∫
−1
g(x) cos(nπx) dx =
1∫
0
(x + 1) cos(nπx) dx
=
(−1)n
− 1
n2π2
,
bn =
1∫
−1
g(x) sin(nπx) dx =
1∫
0
(x + 1) sin(nπx) dx
=
1 − 2(−1)n
nπ
.
In Sage to conﬁrm these computations, it suﬃces to use only
g (and not its extension), as follows:
g1(x)=0;g2(x)=x+1
g=piecewise([[(-1, 0), g1], [(0, 1), g2]])
T=2; n=var("n"); Om=pi
an=(2/T)*integral(g1(x)*cos(n*Om*x),x,-1,0)\
+(2/T)*integral(g2(x)*cos(n*Om*x), x,0,1)
show("a_n=", an.expand())
bn=(2/T)*integral(g1(x)*sin(n*Om*x),x,-1,0)\
+(2/T)*integral(g2(x)*sin(n*Om*x), x,0,1)
show("b_n=", bn.expand()); integral(x+1, x, 0, 1)
In this way we obtain the following expression of the Fourier
series in question:
3
4 +
∞∑
n=1
(
(−1)n
−1
n2π2 cos(nπx) + 1−2(−1)n
nπ sin(nπx)
)
,
and one can try to obtain reﬁnements by considering odd/even
cases.
(c) This task can be solved by the following block:
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
we obtain the remarkable result
π2
4
(g(1) −
1
3
) =
π2
6
=
∞∑
n=1
(−1)2n
n2
=
∞∑
n=1
1
n2
.
We proceed with a further illustration. Let us try to
diﬀerentiate the Fourier series for x2
term by term and, of
course we hope to get the Fourier series of x on the interval
−1 < x < 1
x =
2
π
∞∑
n=1
(−1)n+1
n
sin(πnx).
If it was truly the series of x, it could not converge uniformly
since the periodic extension of the function x from the interval
[−1, 1] is not a continuous function. Thus our diﬀerentiation
term by term is not guaranteed by the general theorem.
However, a straightforward computation of the Fourier
coeﬃcients shows, that this is the Fourier series of x.
The series for x2
converges uniformly, and so we can integrate
it term by term obtaining
1
3
x3
=
2
3
x +
4
π3
∞∑
n=1
(−1)n
n3
sin(πnx).
This is valid for −1 < x < 1. It is not valid for other values
of x, since the series is periodic, but the other two terms are
not. Of course, we may substitute the above Fourier series of
the function x and thus obtain the Fourier series for x3
on the
interval [−1, 1] this way. Notice, this will no more converge
uniformly.
In this context, we use the following terminology for the
Fourier series of even or add functions:
The sine and cosine Fourier series
For a given (real or complex) valued function f on an interval
[0, T) of length T > 0, the Fourier series of it’s even periodic
extension (with period 2T) is called the cosine Fourier
series of f, while the Fourier series of the the odd periodic
extension of f is called the sine Fourier series of the function
f.
7.1.11. General Fourier series and wavelets. In the case of
a general orthogonal system of functions
fn and the series generated from it, we often
talk about the general Fourier series
with respect to the orthogonal system of
functions fn.
Fourier series and further tools built upon them are used
for processing various signals, images, and other data. In
fact, these mathematical tools also underpin many fundamental
models in science, including, for example, modeling of the
function of the brain, as well as much of theoretical physics.
The periodic nature of the sine and cosine functions used
in classical Fourier series, and their simple scaling by increasing
the frequency by unit steps, limit their usability. In many
644
g1(x)=0;g2(x)=x+1;g3(x)=x+3;g4(x)=x-1
g=piecewise([[(-3, -2), g1], [(-2, -1), g3], [(-1, 0), g1],
[(0, 1), g2], [(1, 2), g1], [(2, 3), g4]])
partS5 = g.fourier_series_partial_sum(5,1)
partS55 = g.fourier_series_partial_sum(55,1)
a = g.plot(x,-3,3,thickness=3,color="steelblue",
exclude=[-2, -1, 0, 1, 2], legend_label=r"$f$")
s5=plot(partS5, x, -3, 3, linestyle="--",
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
applications, the nature of the data may suggest more convenient
and possibly more eﬃcient orthogonal systems of func-
tions.
Requirements for fast numerical processing usually include
quick scalability and the possibility of easy translations
by constant values. In other words, we want to be able to
zoom quickly in and out with respect to the frequencies and,
at the same time, to localize in time.
Fast scalability can be achieved by having just one
wavelet mother function1
ψ, if possible with compact support,
from which we create countably many functions ψj,k,
j, k ∈ Z, by integer translations and dyadic dilations:
ψj,k(x) = 2j/2
ψ(2j
x − k).
It is wise to rescale and choose ψ with L2-norm equal to one
(as a function on R). Then the coeﬃcients 2j/2
ensure that
the same is true for all ψj,k.
Of course, the shape of the mother wavelet ψ should
match and cover all typical local behaviour of the data to be
processed. We say that ψ is an orthogonal mother wavelet if
the resulting countable system of functions ψjk is orthogonal
and, at the same time, it is “reasonably dense” in the space of
functions with integrable squares. We come to this concept
in more detail later on.
The eﬀectivity of the wavelet analysis is another issue
which needs further ideas and concepts to be built in. We do
not have space here to go into details, but the readers may ﬁnd
many excellent specialized books in this fascinating area of
applied Mathematics. Here we consider one simple example.
7.1.12. The Haar wavelets. Perhaps the ﬁrst question to
start with is, how to eﬀectively approximate any
given function with piecewise constant ones.
For various reasons, it is good if our mother
wavelet ψ has zero mean, too. Thus we want to
consider an analogue of the Heaviside function
ψ(x) =
{
1 x ∈ [0, 1/2)
−1 x ∈ [1/2, 1).
As a straightforward exercise we may check that, indeed, the
resulting system of functions ψj,k is orthonormal. Another
exercise shows, using ﬁnite linear combinations of these functions,
that we may approximate any constant function with
given precision over a bounded interval. In an exercise we
shall see ?? that this already veriﬁes the density properties
required for the orthogonal mother wavelet functions.
1The roots of wavelets go back to various attempts, how to localize
the basic signals in both time and frequency, with diverse motivations from
engineering and other applications. The name wavelet seems to be related
to the idea of having a wave similar signal which begins and ends with zero
amplitude. Since late 1970’s, these attempts were related to many names
(e.g. Morlet, Meyer) and the wavelet theory became the main tool in signal
analysis. Of course, ﬁrst examples of wavelets are much older, the Haar’s
construction goes back to 1909. Actually, many of the wavelet types do not
represent orthogonal systems of functions, they rather share the idea of a
combination of the high pass ﬁlters and low pass ﬁlters. The reader is advised
to consult extremely rich literature, if interested in more details
645
color="darkgreen", legend_label=r"$F_{5}$")
s55=plot(partS55, x, -3, 3, linestyle="--",
color="darkgreen", legend_label=r"$F_{55}$")
show(a+s5);show(a+s55)
Notice the resulting approximations are much slower comparing
the situation in the previous examples. By the illustrations
one can also demonstrate the “Gibbs phenomenon”. This is
the overshooting in the jumps, which is proportional to the
magnitudes of the jumps, see the ﬁgures here:
□
It is often convenient (and usually easier) to express the
Fourier series using the complex coeﬃcients cn instead of the
real coeﬃcients an and bn. This is a straightforward consequence
of the facts
einωx
= cos(nωx) + i sin(nωx) or, vice versa
cos(nωx) =
1
2
(einωx
+ e−inωx
) ,
sin(nωx) =
1
2i
(einωx
− e−inωx
) .
The resulting series for a real or complex valued function g
on the interval [−π, π] is F(x) =
∑∞
n=−∞ cn einx
with
cn =
1
2π
∫ π
−π
e−inx
g(x) dx .
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Now we consider the question of eﬀective treatment. Notice
that we can also use the characteristic function φ of the
interval [0, 1) and write
ψ(x) = φ(2x)−φ(2x−1) =
1
√
2
√
2φ1,0(x)−
1
√
2
√
2φ1,1(x).
The function φ plays the role of the father wavelet function
and it itself satisﬁes
φ(x) = φ(2x)+φ(2x−1) =
1
√
2
√
2φ1,0(x)+
1
√
2
√
2φ1,1(x).
This can be interpreted as diﬀerencing and averaging the two
consecutive values at the half scale.
With these properties, there is no need for an explicit analytic
form of ψ and φ, since we can ﬁnd their values recurrently
in all dyadic points x. Indeed,
φ(2j−1
n) = φ(2j
n) + φ(2j
n − 1).
The function φ has another useful feature. Namely we can obtain
the unit constant function by adding all its integer trans-
lations
∞∑
k=−∞
φ0,k(x) = 1
for all x ∈ R.
Finally, nearly all the coeﬃcients in the general Fourier
series with the base ψj,k vanish for piecewise constant functions.
On the contrary, the function φ "sees" the constants. In
engineering terminology, this is an instance of the high pass
ﬁlter and low pass ﬁlter.
7.1.13. Example. To illustrate the above considerations, we
approximate the following function f(x) in R
by the Haar wavelets,
f(x) =



0.3(x + 3), −2 ≤ x ≤ −1
0.7(x + 1), 0 ≤ x ≤ 1
0 otherwise
.
The function in question is not periodic and could not be
approximated well by classical Fourier series. The individual
functions ψj,k from 7.1.11 have compact support, but in order
to approximate constant or linear behaviour, we still need a
large number of them. The following illustrations have been
acquired in Maple working with indices |j| ≤ n and |k| ≤ n,
the ﬁrst one with n = 5, the second one with n = 10.
The approximation on the sides of the interval is not as
good as in the middle, because we do not include enough
shifts, i.e. values of k. One of the motivations for the scaling
646
Let us discuss a few examples and for more details see 7.1.7
(see also 7.D.10).
7.A.16. Complex Fourier series. (a) Determine the complex
version of the Fourier series F(x) of the function f(x) =
x with x ∈ (−π, π). Next use the results from 7.A.10 to conﬁrm
that
cn =
1
2
(an − ibn) , c−n = cn .
(b) Verify that the complex form of the Fourier series matches
the result given in 7.A.10.
Solution. (a) By deﬁnition, F(x) =
∑∞
n=−∞ cn einx
, hence
one needs to specify the complex Fourier coeﬃcients cn. We
have
c0 =
1
2π
∫ π
−π
f(x) dx =
1
2π
∫ π
−π
x dx = 0 ,
since f(x) = x is an odd function. For n ̸= 0, using integration
by parts we get
cn =
1
2π
∫ π
−π
f(x) e−inx
dx =
1
2π
∫ π
−π
x e−inx
dx
=
1
2π
[
−
x e−inx
in
]π
−π
−
1
2π
∫ π
−π
e−inx
−in
dx
= −
1
2iπn
[
x e−inx
]π
−π
+
1
2iπn
∫ π
−π
e−inx
dx
= −
1
2iπn
(
π e−inπ
+π einπ
)
+
1
2iπn
[
−
e−inx
in
]π
−π
= −
2π cos(nπ)
2πin
−
1
2π(in)2
[
e−inx
]π
−π
= −
(−1)n
in
+
1
2πn2
(
e−inπ
− einπ
)
= −
(−1)n
in
+
−2i sin(nπ)
2πn2
= −
(−1)n
in
.
Here we used the identities i2
= −1, e−inπ
+ einπ
=
2 cos(nπ) = 2(−1)n
and e−inπ
− einπ
= −2i sin(nπ) = 0.
Now, according to 7.A.10 we have an = 0, bn =
2
n (−1)n+1
, and it should be
cn =
1
2
(an − ibn) =
1
2
(
0 − i
2
n
(−1)n+1
)
=
−i(−1)n+1
n
=
(−1)n+1
in
,
since −i = 1
i . This agrees with the result presented above
and can be veriﬁed in Sage using the bool function. For example,
you may type the cell
var("n")
bn=(2*(-1)^(n+1))/n; cn=(-(-1)^(n))/(i*n)
bool(cn==-(i*bn)/2)
which returns True. In a similar way, for c−n one gets c−n =
(−1)n
in and it easy to see that cn = cn, that is, c−n is the
complex conjugate of cn, see also 7.1.7.
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
and shifting in the construction of wavelets is the hope for a
small amount of non-zero coeﬃcients. But this does not mean
that most of the coeﬃcients would be zero. In our case the
percentage of non-zero coeﬃcients for n = {1, 2, . . . , 10} is
55.6, 44., 38.8, 34.6, 32.2, 30.8, 29.8, 28.7, 28.0, 27.4.
7.1.14. Concluding remarks. A series of famous wavelets
DN has been called after Ingrid Debauchies. They are constructed
by very similar recurrent averadging and diﬀerencing
relations based on certain natural requirements. Just as
an indication, consider the slightly more general recurrent re-
lations
φ(x) =
√
2
N∑
k=0
hkφ(2x − k)
ψ(x) =
√
2
N∑
k=0
gkφ(2x − k)
with yet unknown constants gk and hk. If we want the mother
wavelet ψ(x−k) to have zero coeﬃents in the resulting series
for all polynomials up to the order N −1, then we must ensure
that
∫ ∞
−∞
xk
ψ(x) dx = 0
for all k = 0, 1, . . . , N −1. Similar conditions determine the
Debauchies wavelets.
The standards of JPEG2000 are based on similar
wavelets and such techniques provide tools for professional
compression of visual data in ﬁlm industry, or the format
DjVu for compressed publications.
In the diagram below, there are the Daubechies D4
mother and father wavelets.
In real applications, the orthogonality of the mother
wavelet can be relaxed. As long as the functions ψk,l are
linearly independent and generate the whole space of interest,
we always get the dual basis with respect to the L2 inner
product.
647
For the complex expression of F(x) we ﬁnally get
F(x) = c0 +
∞∑
n=1
cn einx
+
∞∑
n=1
c−n e−inx
= 0 +
∞∑
n=1
−(−1)n
in
einx
+
∞∑
n=1
(−1)n
in
e−inx
=
∞∑
n=1
(−1)n
in
(
e−inx
− einx
)
. (♭)
(b) This follows directly from (♭), combined with the identity
−2i sin(nx) = e−inx
− einx
, i.e.,
F(x) =
∞∑
n=1
(−1)n
in
(
e−inx
− einx
)
=
1
i
∞∑
n=1
(−1)n
n
(−2i sin(nx))
=
i
i
∞∑
n=1
2(−1)n+1
n
sin(nx) ,
which is the formula presented in 7.A.10. □
7.A.17. Compute the complex Fourier series of the
2π-periodic extension of the exponential map f(x) = ex
,
with x ∈ (−π, π). Next use Sage to illustrate the 20th-partial
sum of the Fourier series together with that of f. ⃝
When f is a periodic function with period 2L for some
L > 0, then the complex Fourier series for x ∈ (x0, x0 +2L)
has the form F(x) =
∑∞
n=−∞ cn e
iπnx
L , where
cn =
1
2L
∫ x0+2L
x0
f(x) e− iπnx
L dx .
The next task focuses on this case and is left as a challenge
for you.
7.A.18. For the function f(x) = e−x
with x ∈ [−1, 1] and
f(x) = f(x + 2) outside the interval [−1, 1], compute the
complex Fourier coeﬃcients and derive the corresponding
Fourier series. ⃝
Let f : [a, b] → R be a function for which f′
exists.
We say that f is piecewise diﬀerentiable if both f and f′
are piecewise continuous on [a, b]. Suppose now that f is
a 2L-periodic piecewise diﬀerentiable function deﬁned on an
interval of length L, for some L > 0, and let F be the Fourier
series of f. Assuming that limn→∞ Fn(x) = F(x), the ﬁrst
part of the theorem in 7.1.8 (known as the “Dirichlet condition”,
see 7.1.16) implies that at a point of continuity of f,
the series F converges to the value of f at that point. Moreover,
at a point of discontinuity the series F converges to the
average of the left and right limits of f at that point. Thus, at
a jump discontinuity x0 of f, the value of the Fourier series
F does not depend on the value f(x0), but only on the left
and right limits of f at x0. This means that the behaviour of
the Fourier series of f at discontinuity points is not eﬀected
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
7.1.15. Proof of theorem 7.1.8 about Fourier series. We
return to the detailed proof of the basic properties
of the classical Fourier series. We shall need several
concepts and technicalities related to abstract metric
spaces. Thus, the reader could enjoy reading ﬁrst the
general context of metrics and convergence which we will introduce
in the begining of the last part of this chapter. In this
perspective, reading ﬁrst a few paragraphs starting by 7.3.1,
and returning later might be a good idea. On the other hand,
we do not need much from the general theory of metric spaces
and so our considerations of various concepts of convergence
in the proof could also be of assistance for the abstract developments
later.
We do not worry here about necessary conditions for convergence,
and many other formulations can be found in literature.
On the other hand, the statement of theorem 7.1.8
is quite simple and deals with many useful situations, as we
have seen already.
Although we need only the L1 and L2 norms now, we
should observe that for general 1 ≤ p < ∞, the formula
∥f∥p =
(∫ b
a
|f|p
)1/p
deﬁnes also a norm. See the deﬁnition in 7.3.1 and the paragraph
on the Lp-norms and Hölder inequality in 7.3.4 below.
Moreover, there is the L∞ norm given by the suprema of values
of f over the interval in question.
For the sake of simplicity, we always work in the space
S0
c or S1
c with respect to the corresponding norm (which always
makes sense there).
Hölder’s inequality (applied to the functions f and constant
1) yields the following bound on S0
[a, b]. Namely, for
p > 1 and 1/p + 1/q = 1,
∫ b
a
|f(x)| dx ≤ |b − a|1/q
(∫ b
a
|f(x)|p
dx
)1/p
≤ |b − a|1/q
∥f∥p.
Replace f with fn − f. It is then clear from the above
bound that Lp–convergence fn → f implies, for any p >
1, L1–convergence. (The terminology Lp–convergence is
stronger than L1–convergence is sometimes used). With a
modiﬁed bound, we can derive an even stronger proposition,
namely that on a bounded interval, Lq–convergence is
stronger than Lp–convergence whenever q > p; try this by
yourselves.
If uniform boundedness of the sequence of functions fn
is given, then there is a constant C independent of n, so that
∥fn∥∞ ≤ C.
Then we can assert that |fn(x)−f(x)| ≤ 2C, and it then
follows that L1–convergence implies Lp–convergence, since
648
by the values of f there. On the other hand, recall that the
continuity of f is a necessary and suﬃcient condition for the
uniform convergence of F over the entire real line.
Let us describe a few examples and for further material
on the convergence of Fourier series see the ﬁnal section of
this chapter (Section D).
7.A.19. Describe the implications of Dirichlet’s condition
for the Fourier series obtained in Problem 7.A.10.
Solution. Consider the function f(x) = x, with x ∈ (−π, π),
as in 7.A.10. The 2π-periodic extension of f has discontinuities
at x = ±π, ±3π, ±5π, . . . and provides an example of
a periodic piecewise diﬀerentiable function. At these discontinuities
it is easy to see that the average of the left and right
limits of f is zero. Therefore, by the result in 7.A.10 and
Dirichlet’s condition one deduces that
∞∑
n=1
2(−1)n+1
n
sin(nx) =
{
x , if − π < x < π ,
0 , if x = ±π .
□
7.A.20. Convergence of Fourier series. Consider the
square wave function (Heaviside’s function, see also 7.1.9),
deﬁned this time on [−1, 1], i.e.,
f(x) =
{
−1 , if − 1 ≤ x < 0 ,
1 , if 0 < x ≤ 1 ,
with f(x + 2) = f(x) for all x outside the interval [−1, 1].
Determine the pointwise convergence of the corresponding
Fourier series.
Solution. To answer the task we do not need the explicit expression
of the Fourier series F of f. In particular, f satisﬁes
Dirichlet’s condition and hence F converges to f at all points
where f is continuous. The discontinuities of f appear at
x = 0, and x = ±1. In particular, for the endpoints we have
f(−1) ̸= f(1) and the 2-periodic extension of f has discontinuities
at ±1, ±3, . . .. At these points the Fourier series converges
to the average of the left-hand and right-hand limits,
which all equal to zero. For instance, we see that f(0+
) =
limx→0+ f(x) = 1 and f(0−
) = limx→0− f(x) = −1,
hence
1
2
(
f(0−
) + f(0+
)
)
=
1
2
(−1 + 1) = 0 .
The limits f(0−
) and f(0+
) can be veriﬁed also in Sage,
by utilizing the piecewise and limit commands. It is important
however to specify the option algorithm = ”giac”
within the limit command, as shown below:
f = piecewise([[(-1, 0),-1], [(0,1),1]])
show(f(x).limit(x=0, dir="+", algorithm="giac"))
show(f(x).limit(x=0, dir="-", algorithm="giac"))
Thus F(x) converges to f(x) for all x, except of x = 0, ±1
where it converges to 0. □
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
then
∫ b
a
|f(x) − fn(x)|p
dx =
=
∫ b
a
|f(x) − fn(x)|p−1
||f(x) − fn(x)| dx
≤ (2C)p−1
∫ b
a
|f(x) − fn(x)| dx
which can be written
∥f − fn∥p ≤ (2C)1/q
∥f − fn∥
1/p
1 .
It follows that the Lp–norms on the space S0
[a, b] are
equivalent with respect to the convergence of uniformly
bounded sequences of functions.
7.1.16. Implications of the Dirichlet condition. The most
diﬃcult (and most interesting) problem is to prove the ﬁrst
statement of the theorem 7.1.8, which in the literature is often
refered to as the Dirichlet condition, which seems to have
been derived as early as in 1824.
We begin by proving how this property of pointwise convergence
implies the statements (2) and (3) of the theorem.
Without loss of generality, we assume that we are working on
the interval [−π, π], i. e. with period T = 2π.
As the ﬁrst step, we prepare simple bounds for the coefﬁcients
of the Fourier series. One bound is of course
|an| ≤
1
π
∫ π
−π
|f(x)| dx,
and similarly for all the coeﬃcients bn. This is because both
cos(x) and sin(x) are bounded by 1 in absolute value. However,
if f is a continuous function in S1
[−π, π], we can integrate
by parts, thus obtaining (we write an(f) for the corresponding
coeﬃcient of the function f, and so on)
bn(f) =
1
π
∫ π
−π
f(x) sin(nx) dx
=
−1
nπ
[f(x) cos(nx)]π
−π +
1
nπ
∫ π
−π
f′
(x) cos(nx) dx
=
1
n
an(f′
).
Notice f(−π) = f(π) by the continuity and thus the boundary
term vanishes. Similarly we compute an(f) = −bn(f′
).
Iterating this procedure, we obtain a bound for functions
f in Sk+1
[−π, π] with continuous derivatives up to order k
inclusive
|an(f)| ≤
1
nk+1π
∫ π
−π
|f(k+1)
(x)| dx,
and similarly for bn(f).
Thus we can see that the “smoother” a function is, the
more rapidly the Fourier coeﬃcients approach zero. For suﬃciently
smooth functions f, the nk
–multiples of their Fourier
coeﬃcients an and bn are bounded by the L1-norm of their
k–th derivative f(k)
.
649
7.A.21. Using the Fourier series of the 2π-periodic extension
of the function g(x) = | x |, with x ∈ [−π, π), sum the
series
∞∑
n=1
1
(2n − 1)2
.
Solution. To determine the value of this series, one can successfully
apply several known Fourier series. The Fourier series
of the function g coincides with the Fourier series presented
in 7.A.11, i.e.,
π
2
−
4
π
∞∑
n=1
cos
(
(2n − 1)x
)
(2n − 1)2
.
Since this function g is continuous on [−π, π) and | − π | =
| π |, we have
| x | =
π
2
−
4
π
∞∑
n=1
cos
(
(2n − 1)x
)
(2n − 1)2
, x ∈ [−π, π] .
Substituting x = 0, we get 0 = π
2 − 4
π
∞∑
n=1
1
(2n−1)2 . Therefore
∞∑
n=1
1
(2n − 1)2
=
π2
8
.
□
7.A.22. Determine the convergence and uniform convergence
of the Fourier series for the function g(x) = e−x
,
x ∈ [−1, 1).
Solution. Again, it is unnecessary to calculate the corresponding
Fourier series, since we wish only to check convergence.
Let us consider the 2-periodic function s(x), deﬁned
as follows:
s(x) = g(x) = e−x
, x ∈ (−1, 1),
s(1) =
g(−1) + lim
x→1−
g(x)
2
=
e + e−1
2
.
According to 7.1.8, this function is the sum of the Fourier
series in question. In other words, the Fourier series converges
to s(x). Moreover, this convergence is uniform on every
closed interval which contains none of the points 2k + 1,
k ∈ Z. This follows from the continuity of the functions g
and g′
on (−1, 1). On the other hand, the convergence cannot
be uniform on any interval (c, d) such that [c, d] contains an
odd integer. This is because the uniform limit of continuous
functions is always a continuous function, and the periodic
extension of s is not continuous at the odd integers. Thus, the
series converges to the function g on (−1, 1), yet this convergence
is uniform only on the subintervals [c, d] which satisfy
the restriction −1 < c < d < 1. □
7.A.23. Express the function g(x) = π2
−x2
on the interval
[−π, π] in the form of a Fourier series. Using this
expression, sum the two series
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Let f be a continuous function in the space S1
[a, b]. By
the Dirichlet condition, its Fourier series converge pointwise
to f. Then we can assert that
|sN (x) − f(x)| =
∞∑
k=N+1
(ak cos(kx) + bk sin(kx))
≤
∞∑
k=N+1
(|ak| + |bk|).
The right-hand side can further be estimated by the coeﬃcients
a′
n and b′
n of the derivative f′
. By applying in succession
the inequality above, then Hölder’s inequality for inﬁnite
series (with p = q = 2), and then with the arithmeticgeometric
inequality, we obtain
|sN (x) − f(x)| ≤
∞∑
k=N+1
1
k
(|a′
k| + |b′
k|)
≤
( ∞∑
k=N+1
1
k2
)1
2
( ∞∑
k=N+1
(|a′
k|2
+ 2|a′
k||b′
k| + |b′
k|2
)
)1
2
≤
( ∞∑
k=N+1
1
k2
)1
2
( ∞∑
k=N+1
(2|a′
k|2
+ 2|b′
k|2
)
)1
2
≤
√
2
(∫ ∞
N
1
x2
dx
)1
2
1
√
π
∥f′
∥2
=
( √
2
√
π
∥f′
∥2
)
·
1
√
N
.
Thus we have obtained not only a proof of the uniform convergence
of our series to the anticipated value, but also a bound
for the speed of the convergence:
Uniform convergence under the Dirichlet condition
If f is a continuous function in S1
[a, b], then
sup
x∈R
|sN (x) − f(x)| ≤
( √
2
√
π
∥f′
∥2
)
·
1
√
N
.
This proves the statement 7.1.8.(2), supposing the Dirichlet
condition 7.1.8.(1) holds.
7.1.17. L2–convergence. In the next step of our proof, we
derive L2–convergence of Fourier series. The
proof utilizes the common technique of approximation
objects which are not continuous by
ones which are. We describe it without further
details. Interested readers should be able to ﬁll in the gaps by
themselves without any diﬃculties. First, we formulate the
statement we need.
Lemma. The subset of continuous functions f in S0
[a, b] on
a ﬁnite interval [a, b] is a dense subset in this space with respect
to the L2–norm.
650
(1)
∞∑
n=1
(−1)n+1
n2
,
∞∑
n=1
1
n2
.
Solution. We could take advantage of the function g being
even, and calculate the non-zero coeﬃcients an by integration
by parts. However, in 7.1.10 the Fourier series for the
function f(x) = x2
on the interval [−1, 1] is derived. This
proves the identity
f(x) = 1
3 + 4
π2
∞∑
n=1
(−1)n
cos(nπx)
n2 , x ∈ (−1, 1),
valid also for x = ±1. By adding π2
and rescaling, it follows
that
g(x) = π2
−
(
1
3 + 4
π2
∞∑
n=1
(−1)n
cos nπx
π
n2
)
π2
= 2
3 π2
+ 4
∞∑
n=1
(−1)n+1
cos(nx)
n2 , x ∈ [−π, π].
Of course, one can also calculate the Fourier series of the
function g directly.
Substituting x = 0 and x = π then gives
π2
= 2
3 π2
+ 4
∞∑
n=1
(−1)n+1
n2 , i.e.
∞∑
n=1
(−1)n+1
n2 = π2
12 ,
and
0 = 2
3 π2
+ 4
∞∑
n=1
(−1)n+1
(−1)n
n2 , i.e.
∞∑
n=1
1
n2 = π2
6 .
In other words,
π2
= 12
(
1 −
1
22
+
1
32
−
1
42
+ · · ·
)
= 6
(
1 +
1
22
+
1
32
+
1
42
+ · · ·
)
.
□
We have seen that the Fourier series of a 2L-periodic odd
function involves only sine functions, whereas Fourier series
of a 2L-periodic even function involves only cosine functions.
However, Fourier series are employed in many areas and often
we are interested in functions that are only deﬁned on intervals
of the form [0, L], for some L > 0. In particular, to get
the cosine series expansion of f we may extend it to [−L, L]
so that the extended function is an even 2L-periodic function.
In this case, by the result in 6.B.35 we will have
an =
2
L
∫ L
0
f(x) cos
(nπx
L
)
dx , bn = 0 .
This gives the cosine extension
a0
2
+
∞∑
n=1
an cos
(nπx
L
)
.
Similarly, to get the sine series expansion of f, we may extend
it to [−L, L] in such a way that the extended function is an odd
2L-periodic function. Then we will have
an = 0 , bn =
2
L
∫ L
0
f(x) sin
(nπx
L
)
dx ,
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Sketch of the proof. Here "dense" means that for any
g in S0
[a, b] and any ε > 0, there is some continuous f satisfying
∥f − g∥2 < ε. We deal with abstract topological
concepts like this in the last part of this chapter.
The idea of the proof can be seen easily via the example
of approximation of Heaviside’s function h on the interval
[−π, π]. We recall that h(x) = −1 for x < 0, and h(x) = 1
for x > 0. For every δ satisfying π > δ > 0, we deﬁne the
function fδ as x/δ for |x| ≤ δ and fδ(x) = h(x) otherwise.
All the functions fδ are continuous, in fact, piecewise linear.
It can be calculated easily that ∥h−fδ∥2 → 0 so that h can be
approximated in L2 norm by a sequence of continuous func-
tions.
All discontinuity points of a general function f can be
catered for in exactly the same way. There are only ﬁnitely
many of them, and so all of the considered functions are limit
points of sequences of continuous functions. □
Now, our proof of the L2 convergence (under the assumption
of the Dirichlet condition) is already simple because for
the given function f, the distance between the partial sums of
its Fourier series can be bounded by using a sequence of continuous
functions fε in this way (all norms in this paragraph
are the L2 norms):
∥f−sN (f)∥ ≤ ∥f−fε∥+∥fε−sN (fε)∥+∥sN (fε)−sN (f)∥
and the particular summands on the right-hand side can be
controlled.
The ﬁrst one of them is at most ε, and according to the
uniform convergence for continuous functions (as just proved
above), the second summand can be bounded also by ε, if N
is big enough. Notice that the third term has the value of the
partial sum of the Fourier series for f−fε. Thus (cf. Theorem
7.1.5(1)),
∥f − fε − sN (f − fε)∥ ≤ ∥f − fε∥.
Therefore by the triangle inequality,
∥sN (f−fε)∥ ≤ ∥sN (f−fε)−f+fε∥+∥f−fε∥ ≤ 2∥f−fε∥.
Altogether, ∥f − sN (f)∥ ≤ 4ε.
This veriﬁes the L2 convergence of the functions sN (f)
to f under the Dirichlet condition, which is we wanted to
prove.
7.1.18. Dirichlet kernel. Finally, we arrive at the proof of
the ﬁrst statement of theorem 7.1.8. It follows
from the deﬁnition of the Fourier series F(t)
for a function f(t), using its expression with
the complex exponential in 7.1.7, that the partial
sums sN (t) can be written as
sN (t) =
1
T
N∑
k=−N
∫ T/2
−T/2
f(x) e−iωkx
eiωkt
dx,
651
which gives the sine extension
∞∑
n=1
bn sin
(nπx
L
)
,
see also at the end of the paragraph 7.1.10. Let us describe a
few examples and for further material we refer to Section D.
7.A.24. Describe the cosine Fourier series for the sine function
f(x) = sin(x), with x ∈ [0, π]. ⃝
7.A.25. Describe the sine Fourier series for the function
f(x) = cos(x), with x ∈ [0, π]. ⃝
We end this section with a tasks on the so called “Paresval’s”
identity, and some applications of Fourier series.
7.A.26. Using Parseval’s identity for Fourier’s orthogonal
system (part (3) of the theorem 7.1.5), verify that
∞∑
n=1
1
(2n−1)4 = π4
96 .
Solution. It is imperative to choose an appropriate Fourier
series. For instance, consider the Fourier series
π
2 − 4
π
∞∑
n=1
cos((2n−1)x)
(2n−1)2 ,
which we obtained for the function g(x) = | x |,
x ∈ [−π, π) in 7.A.15(b). Parseval’s identity
a2
0
2 +
∞∑
n=1
a2
n +
∞∑
n=1
b2
n = 2
T
x0+T∫
x0
(g(x))2
d x
says, substituting for the a’s and b’s from our particular series,
that
π2
2 + 16
π2
∞∑
n=1
1
(2n−1)4 = 1
π
π∫
−π
| x |2
d x = 2
π
π∫
0
x2
d x = 2π2
3 ,
so,
∞∑
n=1
1
(2n − 1)4
=
(
2π2
3
−
π2
2
)
π2
16
=
π4
96
.
□
There are other ways of obtaining this result, see for example
(3) at the page 698. We recommend comparing the
solutions of this exercise to the previous one.
We started our discussion of Fourier series with the simplest
of periodic functions
f(t) = a sin(ωt + b)
for certain constants a, ω > 0, b ∈ R. They appear as the general
solution to the homogeneous linear diﬀerential equation
(1) y′′
+ ω2
y = 0
which arises in mechanics by Newton’s law of force for a moving
particle. Recall the brief introduction to the simplest differential
equations in 6.2.21 on page 552. Much more follows
in Chapter 8.
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
where T is the period we are working with and ω = 2π/T.
This expression can be rewritten as
(1) sN (t) =
∫ T/2
−T/2
KN (t − x)f(x) dx,
where the function
KN (y) =
1
T
N∑
k=−N
eiωky
is called the Dirichlet kernel. The sum is a (ﬁnite) geometric
series with common ratio eiωy
. By multiplying by eiωy
and
then subtracting, we obtain
eiωy
KN (y) =
1
T
N∑
k=−N
eiω(k+1)y
(1 − eiωy
)KN (y) =
1
T
(
e−iNωy
− ei(N+1)ωy
)
.
Provided ωy is not a multiple of 2π, we continue to obtain
KN (y) =
1
T
e−iNωy
− ei(N+1)ωy
1 − eiωy
=
1
T
− e−i(N+1/2)ωy
+ ei(N+1/2)ωy
eiωy/2 − e−iωy/2
=
1
T
sin((N + 1/2)ωy)
sin(ωy/2)
in which the key step was to multiply both the numerator
and the denominator by − e−iωy/2
. When y = 0, KN (0) =
1
T (2N + 1).
The last expression shows that KN (y) is an even function.
By l’Hospital’s rule, applied at y = 0, it is continuous
everywhere. Since all the partial sums of the series for the
constant function f(x) = 1 also equal 1, we obtain from the
deﬁnition of the Dirichlet kernel, cf. (1), that
∫ T/2
−T/2
KN (x) dx = 1.
In the case of periodic functions, the integrals over intervals
whose length equals the period are independent of the choice
of the end points. Hence, changing the coordinates, we can
also use the expression
sN (x) =
∫ T/2
−T/2
KN (y)f(x + y) dy
for the partial sums.
Finally, we are fully prepared. First, we consider the case
when the function f is continuous (and piecewice diﬀerentiable)
at the point x. We want to prove that in this case, the
Fourier series F(x) for the function f converges to the value
f(x) at the point x. We have
sN (x) − f(x) =
∫ T/2
−T/2
(f(x + y) − f(x))KN (y) dy.
652
We mention that the function f has period T = 2π/ω.
In mechanics, one often talks about frequency 1/T. The positive
value a expresses the maximum displacement of the oscillating
point from the equilibrium position and it is called the
amplitude. The value b determines the position of the point
at the initial time t = 0 and it is called the initial phase, while
ω is the angular frequency of the oscillation.
Similarly, the function z ≡ g(t) describes the dependence
of voltage upon time t in an electrical circuit with inductance
L and capacity C and which is the solution of the
diﬀerential equation
(2) z′′
+ ω2
z = 0.
The only diﬀerence between the equations (1) and (2) (besides
the dissimilar physical interpretation) is the constant ω.
In the equation (1), there is ω2
= k/m where k is the proportionality
constant and m is the mass of the point, while in the
equation (2), there is ω2
= (LC)−1
.
We illustrate how Fourier series can be applied in the
theory of diﬀerential equations. Consider only the nonhomogeneous
(compare to (1)) diﬀerential equation
(3) y′′
+ a2
y = f(x)
with y an unknown in variable x ∈ R, with a periodic, continuously
diﬀerentiable function f : R → R on the right-hand
side and a constant a > 0. Let T > 0 be the prime period
of the function f and let its Fourier series on [−T/2, T/2] be
known, i.e. we assume
(4) f(x) =
A0
2
+
∞∑
n=1
(
An cos
2πnx
T
+ Bn sin
2πnx
T
)
.
7.A.27. Prove that if the equation (3) has a periodic solution
on R, then the period of this solution is also
a period of the function f. Further, prove that the
equation (3) has a unique periodic solution with period
T if and only if
(1) a ̸=
2πn
T
for every n ∈ N.
Solution. Let a function y = g(x), x ∈ R, be a solution of
the equation (3) with f(x) ̸≡ 0 and with period p > 0. In
order to substitute the function g into a second-order diﬀerential
equation, its second derivative g′′
must exist. Since the
functions g, g′
, g′′
, . . . share the same period, the function
g′′
(x) + a2
g(x) = f(x)
is also periodic with period p. In other words, the function f
is periodic as a linear combination of functions with period p.
Thus, we have proved the ﬁrst statement claiming that p = lT
for a certain l ∈ N.
Suppose that the function y = g(x), x ∈ R, is a periodic
solution of the equation (3) with period T and that it is
expressed by a Fourier series as follows:
(2)
g(x) =
a0
2
+
∞∑
n=1
(
an cos (ωnx) + bn sin (ωnx)
)
, x ∈ R,
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
The integrand can be rewritten
f(x + y) − f(x)
sin(ωy/2)
sin((N + 1/2)ωy) =
= φx(y)
(
cos(ωy/2) sin(Nωy)+sin(ωy/2) cos(Nωy)
)
,
where
φx(y) =
f(x + y) − f(x)
sin(ωy/2)
for y ̸= 0. Moreover, we can compute the right and left limits
of φx at the point y = 0 using the L’Hospital’s rule. Indeed,
writing f′
±(x) for the right and left derivatives of f, we arrive
at
lim
y→0+
φx(y) = lim
y→0+
f(x + y) − f(x)
y
y
sin(ωy/2)
=
2
ω
f′
+(x),
and similiary for the left limit. Thus, φx(y) is piecewise continuous
on the entire interval [−T/2, T/2].
Next, we rewrite the integral expression for sN −f in the
form reminding coeﬃcients of Fourier series:
2
T
∫ T/2
−T/2
(ψ1(y) sin(Nωy) + ψ2(y) cos(Nωy) dy
with the piecewise continuous functions
ψ1(y) =
T
2
φx(y) cos(ωy/2), ψ2(y) =
T
2
φx(y)sin(ωy/2).
Indeed, we deal with the Fourier coeﬃcients of the functions
ψ1 and ψ2, but they have to converge to zero with N → ∞ by
virtue of the general theorem on abstract Fourier series, see
the Bessel inequality in 7.1.5(2). Hence
lim
N→∞
∫ T/2
−T/2
ψ1(y) sin(Nωy) dy = 0,
and
lim
N→∞
∫ T/2
−T/2
ψ2(y) cos(Nωy) dy = 0.
But this means lim
N→∞
sN (x) − f(x) = 0, as desired.
Now suppose the function f has a discontinuity at x.
Without loss of generality, we may assume x = 0 (using constant
shift of coordinates otherwise).
Since the function belongs to S1
, it is already continuous
and diﬀerentiable on a neighbourhood of the point x = 0
(outside the point itself). Split f into its even part f1 and its
odd part f2. That is, write f(x) = f1(x) + f2(x), where
f1(x) =
1
2
(f(x) + f(−x)) for x ̸= 0
f1(0) =
1
2
( lim
x→0+
f(x) + lim
x→0−
f(x))
f2(x) =
1
2
(f(x) − f(−x)).
Then the even part f1(x) is continuous and piecewise diﬀerentiable
at the point x = 0 because of the existence of the onesided
limits, and so on a neighbourhood of the point x = 0.
653
where ω = 2π/T. If g satisﬁes the equation (3), it has a
continuous second derivative on R. Therefore, for x ∈ R,
g′
(x) =
∞∑
n=1
(
ωnbn cos (ωnx) − ωnan sin (ωnx)
)
,
(3)
g′′
(x) =
∞∑
n=1
(
−ω2
n2
an cos (ωnx) − ω2
n2
bn sin (ωnx)
)
,
Substituting (4), (2) and (3) into (3) yields
a2 1
2 a0 +
∞∑
n=1
(
(−ω2
n2
an + a2
an) cos (nωx) +
(
−ω2
n2
bn + a2
bn
)
sin (nωx)
)
= A0
2 +
∞∑
n=1
(
An cos (nωx) + Bn sin (nωx)
)
.
It follows that
(4) a2 a0
2
=
A0
2
, that is a2
a0 = A0,
and for n ∈ N,
(5)
(
−ω2
n2
+ a2
)
an = An,
(
−ω2
n2
+ a2
)
bn = Bn.
There is exactly one pair of sequences {an}n∈N∪{0},
{bn}n∈N satisfying these conditions if and only if
−ω2
n2
+ a2
= −
(2πn
T
)2
+ a2
̸= 0 for every n ∈ N,
i.e., if (1) holds. In this case, the only solution of (3) with
period T is determined by the only solution
(6) an =
An
−ω2n2 + a2
, bn =
Bn
−ω2n2 + a2
, n ∈ N
of the system (5). We emphasize that we utilized the uniform
convergence of the series in (3). □
7.A.28. Using the solution of the previous problem, ﬁnd all
2π-periodic solutions of the diﬀerential equation
y′′
+ 2y =
∞∑
n=1
sin(nx)
n2 , x ∈ R.
Solution. The equation is in the form of (3) for a =
√
2 and
the continuously diﬀerentiable function
f(x) =
∞∑
n=1
sin(nx)
n2 , x ∈ R
with prime period T = 2π. According to problem 7.A.27,
the condition
√
2 /∈ N implies that there is exactly one
2π-periodic solution. If we look for it as the value of the se-
ries
a0
2 +
∞∑
n=1
(
an cos (nx) + bn sin (nx)
)
, x ∈ R,
we know also that (see (4) and (6))
a0 = an = 0, bn = 1
n2(2−n2) , n ∈ N.
Thus, the given equation has the unique 2π-periodic solution
y =
∞∑
n=1
sin(nx)
n2(2−n2) , x ∈ R.
□
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Also f2(0) = 0, and the Fourier series for f2 contains only
the terms with sin(nωx) and thus, vanishes at x = 0.
Thus we can refer to the previous continuous case and
obtain, for the Fourier series F(x) of the function f, the equa-
tion
F(0) =
1
2
(
lim
x→0+
f(x) + lim
x→0−
f(x)
)
+ 0,
which is what we wanted to prove.
If we face more than one jump of f within the basic period
of length T, we may express f as a sum of functions
with exactly one discontinuity. For example, if points of discontinuity
were x0 < x1, both in(−T/2, T/2), and writing
f(y±) = limx→y± f(x) for the one-sided limits, we can consider
f = g1 + g2 with
g1(x) =
{
f(x) x ∈ (−T/2, x0)
f(x) − f(x0+) + f(x0−) x ∈ (x0, T/2)
g2(x) =
{
0 x ∈ (−T/2, x0)
f(x0+) − f(x0−) x ∈ (x0, T/2).
Clearly g1 has got the only discontinuity in x1 while the piecewise
constant g2 jumps at x0 only. By the already proved
behavior, the Fourier series of g1 and g2 converges at x0 to
f(x0−) and 1
2 (f(x0+) − f(x0−)), respectively. Their sum
provides the desired result. Similarly for the other discontinuity
point.
This completes the proof. This also completes the proof
of the statements (2) and (3) of theorem 7.1.8 where we required
that the Dirichlet condition be true.
2. Integral operators and Fourier transform
This section provides a few glimpses towards a fascinating
and useful area of mathematics exploiting the integration
process in many practical ways. Unlike most of this book, we
do not provide any general theory and our constructions will
be rather illustrated by examples than by rigorous theorems
with full proofs.
7.2.1. Functionals. In the case of ﬁnite-dimensional vector
spaces, we can regard the vectors as mappings
from a ﬁnite set of ﬁxed generators into the
space of coordinates. The sums of vectors and
the scalar multiples of vectors were then given
by the corresponding operations with such functions. We also
worked with the vector spaces of functions of a real variable
in the same way when their values were scalars (or vectors as
well).
The simplest linear mapping α between vector spaces
maps vectors to scalars. It is called a linear form, or a linear
functional (in particular when dealing with vector spaces
of functions). In ﬁnite dimension, it is deﬁned as the sum
of products of coordinates xi of vectors with ﬁxed values
αi = α(ei) at the generators ei, i.e. by one-row matrices:
(x1, . . . , xn)T
→ (α1, . . . , αn) · (x1, . . . , xn)T
.
654
B. Integral operators and Fourier Transform
Next we will focus on problems related with the notion of
“convolution”, which is a typical integral operation
that applies on two functions f, g and produces
a new one, denote by (f ⋆ g) and deﬁned
by
(f ∗ g)(y) =
∫ ∞
−∞
f(x)g(y − x) dx .
The convolution integral is a fundamental concept in various
ﬁelds, e.g., in statistics, signal processing, computer vision,
and other. For its basic features and other basic facts, see
7.2.2.
7.B.1. Find the convolutions f ∗g and f ∗h of the following
functions and in each case check the "smoothing" of f.
f(x) = sin(x) +
2
5
[sin(6x)]2
−
1
5
sin(60x) ,
g(x) =
{
1
2ε if − ε < x < ε
0 otherwise
, where ε > 0 ,
h(x) =
1
√
2πσ
e−
(x−µ)2
2σ2
, where µ, σ ∈ R, σ > 0 .
Solution. The function g is chosen to provide the mean of f
over (small) intervals of length 2ε and it is normalized so that
the integral of g over all of R is one. We should expect some
smoothing of the oscillations of the function f(x). Drawing
the resulting functions by Maple shows:
labels in
drawings proc
ne, ale
pripsat rukou,
popis je ale v
textu stejne ?
The graph depicted in the upper left is the original f, while
the graph in the upper right is of g with ε = 1/10. The two
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
More complicated mappings, with values lying in the same
space, are given similarly by square matrices. We approach
linear operations on spaces of functions in an analogous way.
For simplicity, we work with the real vector space S of
all piecewise continuous real-valued functions having compact
support and deﬁned on the whole R or on an interval
I = [a, b]. Linear mappings S → R are called (real) linear
functionals. Such functionals can be deﬁned in many diverse
ways. For example, by evaluating the function’s values (or its
derivatives’) at some ﬁxed points or in terms of integration.
We can, for instance consider the functional L given by
evaluating the function at a sole ﬁxed point x0 ∈ I, i. e.,
L(f) = f(x0).
Or, we can have the functional given by integration of the
product with a ﬁxed function g(x), i.e.,
L(f) =
∫ b
a
f(x)g(x) dx.
The function g(x) in the previous example is a function which
weighs the particular values representing the function f(x) in
the deﬁnition of the Riemann integral and the functional L is
a perfect analogue of the linear forms on ﬁnite dimensional
vector spaces mentioned above. The simplest case of such a
functional is, of course, the Riemann integral itself, i. e. the
case of g(x) = 1 for all points x.
A good example is given by
g(x) =
{
0 if |x| ≥ ε
1
2ε if |x| < ε.
for any ε > 0. The integral of the function g over R equals
one, and our linear functional can be perceived as a (uniform)
averaging of the values of the function f over the ε–
neighbourhood of the origin. Similarly, we can work with the
function
g(x) =
{
0 if |x| ≥ ε
e
1
x2−ε2 + 1
ε2
if |x| < ε
which we used in the paragraph 6.1.9. This function is smooth
on R with compact support on the interval (−ε, ε).
Our functional has the meaning of a weighted combination
of the values, but this time, the weights of the input
values decrease rapidly as their distance from the origin increases.
The integral of g over R is ﬁnite, but it is not equal
to one. Dividing g by this integral would lead to a functional
which would have the meaning of a non-uniform averaging of
a given function f.
Another example is the Gaussian function
g(x) =
1
√
π
e−x2
,
which also has its integral over R equal to one (we verify this
later). This time, all the input values x in the corresponding
“average” have a non-zero weight, yet this weight becomes
insigniﬁcantly small as the distance from the origin increases.
655
lower graphs show their convolution with (respectively) parameters
for g selected as ε = 1/10 and ε = 13/50. It is
straightforward to compute the convolution explicitly, too:
f ∗ g(y) =
1
2ε
∫ y+ε
y−ε
sin(x) +
2
5
sin(6x)2
−
1
5
sin(60x) d x
=
1
2ε
[
− cos(x) −
1
30
sin(6x) cos(6x) +
1
5
x +
1
300
cos(60x)
]y+ε
y−ε
.
Similarly, the other function h is a typical smoothing
function which gives much more weight to the values of f
near the point y, and much less weight to the values of f further
from y. This is the famous Gaussian function. We meet
it frequently. Using Maple again, we see that the integral of h
over R is one (we prove this in 13.2.8). It is not easy to ﬁnd the
convolution analytically, but Maple can do it approximately.
The resulting diagrams are as follows.
As before, the graph depicted in the upper left is the original
f, while the graph in the upper right is of h with µ = 0,
and σ = 1/10. Below that, their convolutions are shown with
parameters µ = 0, σ2
= 1/60 and (lower right) µ = 0,
σ2
= 5/60.
□
7.B.2. Determine the convolution f1 ∗ f2 where
f1(x) =
1
x
for x ̸= 0
f2(x) =
{
x for x ∈ [−1, 1]
0 otherwise
Solution.
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
7.2.2. Function convolution. Integral functionals from the
previous paragraph can easily be modiﬁed to obtain
a “streamed averaging” of the values of a
given function f near a given point y ∈ R:
Ly(f) =
∫ ∞
−∞
f(x)g(y − x) dx
Convolution of functions of a real variable
The free parameter y in the deﬁnition of the functional Ly(f)
can be perceived as a new independent variable, and our
operation Ly actually maps functions to functions again,
f → ˜f:
˜f(y) = Ly(f) =
∫ ∞
−∞
f(x)g(y − x) dx.
This operation is called the convolution of functions f and
g, denoted f ∗ g.
The convolution is usually deﬁned for real or complex
valued functions on R with compact support, or those with
decay fast enough to ensure the integrability.
By the transformation t = z − x, we can easily calculate
that
(f ∗ g)(z) =
∫ ∞
−∞
f(x)g(z − x) dx
= −
∫ −∞
∞
f(z − t)g(t) dt = (g ∗ f)(z).
Thus the convolution, considered as a binary operation
∗ : Sc × Sc → Sc
of pairs of functions having compact support, is commutative.
Similarly, convolutions can be considered with integration
over a ﬁnite interval; we only have to guarantee that the
functions participating in them are well-deﬁned. In particular,
this can be done for periodic functions, integrating over
an interval whose length equals the period.
Convolution is an extraordinarily useful tool for modeling
the way in which we observe the data of an experiment or
the inﬂuence of a medium through which information is transferred.
For instance, an analog audio or video signal aﬀected
by noise. The input value f is the transferred information.
The function g is chosen so that it would express the inﬂuence
of the medium or the technical procedure used for the
signal processing or the processing of any other data.
7.2.3. Gibbs phenomenon. Actually, we have already seen
a very useful case of convolution. In paragraph
7.1.18, we interpreted the partial sum of
the Fourier series for a function f as a convolution
with the Dirichlet kernel KN (y) =
∑T/2
−T/2 eiωky
. The ﬁgure shows this convolution kernel with
N = 5 and N = 15.
656
(f1∗f2)(t) =
∫ ∞
−∞
f1(x)f2(t−x) dx =
∫ ∞
−∞
1
x
f2(t−x) dx.
Since f2(x) = 0 outside the interval [−1, 1], necessarily
−1 ≤ t − x ≤ 1 that is, t − 1 ≤ x ≤ t + 1. So
(f1 ∗ f2)(t) =
∫ t+1
t−1
1
x
(t − x) dx = t
∫ t+1
t−1
1
x
dx − 2.
This last integral is improper if t − 1 ≤ 0 ≤ t + 1. For t
outside that interval, the integration gives
(f1 ∗ f2)(t) = t ln
t + 1
t − 1
− 2.
If instead, −1 < t < 1, we can for small ε > 0, replace
∫ t+1
t−1
1
x dx with
∫ t+1
ε
1
x
dx +
∫ −ε
t−1
1
x
dx
which computes to ln |t + 1| − ln |ε| + ln | − ε| − ln |t − 1|.
The terms in ε cancel, so when we take the limit ε −→ 0, we
obtain the same answer for the integral as before.
Thus
(f1 ∗ f2)(t) = t ln
t + 1
t − 1
− 2
for all values of t except for t = 1, or for t = −1. □
We calculate the convolution of two functions both of
which have a bounded support.
7.B.3. Determine the convolution f1 ∗ f2 where
f1(x) =
{
1 − x2
for x ∈ [−1, 1],
0 otherwise,
f2(x) =
{
x for x ∈ [0, 1],
0 otherwise.
Solution. Since the integrand is zero when f1(x) = 0,
f1 ∗ f2(t) =
∫ ∞
−∞
f1(x)f2(t − x) dx
=
∫ 1
−1
(1 − x2
)f2(t − x) dx
But the integrand is also zero when f2(t − x) = 0, so we
need 0 ≤ t − x ≤ 1 ie. t − 1 < x < t for the integrand
to be non zero. So for a non zero value of f1 ∗ f2(t), we
integrate over the intersection of the intervals [t − 1, t] and
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Notice that instead of integrals over the entire real line,
we employ the integration over the basic period T of the periodic
functions in question.
This interpretation allows us to explain the Gibbs phenomenon
mentioned in paragraph 7.1.9. The point is that
we know well the behaviour of the Dirichlet kernel near to
the origin and thus, taken into account that the function f
is bounded over the whole period and has all one-side limits
of values and derivatives at each point of discontinuity, the
eﬀect of the convolution must be quite local. Consequently
this leads to veriﬁcation that the convolution with the Dirichlet
kernel in the point x of jump of f behaves the same way as
we computed explicitly for the Heaviside function at x = 0.
There the overshooting by the Fourier sums can be computed
explicitly and this explains the Gibs eﬀect in general.
We do not provide more details here. Readers may either
work them out themselves (as a truly nontrivial exercise) or
look them up in the literature.
7.2.4. Integral operators. In general, integral operators
can depend on any number of values and
derivatives of the function in its argument. For
example, considering a function F depending
on k + 2 free arguments,
L(f)(y) =
∫
F(y, x, f(x), f′
(x), . . . , f(k)
(x)) dx.
Convolution is one of many examples of a special class of
such operators on spaces of functions
L(f)(y) =
∫ b
a
f(x)k(y, x) dx.
The function k(y, x), dependent on two variables,
k : R2
→ R,
is called the kernel of the integral operator L.
The theory of integral operators is very useful and interesting.
We focus only on an extraordinarily important special
case, namely the Fourier transform F, which has deep connections
with Fourier series.
7.2.5. Fourier transform. Recall that a function f(t), given
by its converging Fourier series, equals
f(t) =
∞∑
n=−∞
cn eiωnt
,
657
[−1, 1]. Consequently,
(f1 ∗ f2)(t)
= 0, if t > 2
=
∫ 1
t−1
(1 − x2
)(t − x)dx = 4t/3 − t2
+ t4
/12,
1 ≤ t ≤ 2
=
∫ t
t−1
(1 − x2
)(t − x)dx = −t2
/2 + 1/4 + 2t/3,
0 ≤ t ≤ 1
=
∫ t
−1
(1 − x2
)(t − x)dx
= −t4
/12 + t2
/2 + 1/4 + 2t/3,
− 1 ≤ t ≤ 0
= 0, if t < −1.
□
7.B.4. Determine the convolution f1 ∗ f2 of the functions
f1 =
{
1 − x for x ∈ [−2, 1],
0 otherwise,
f2 =
{
1 for x ∈ [0, 1],
0 otherwise.
⃝
The next topic is the Fourier transform, which is another
example of an integral operator. This time the kernel
e−iωt
is complex (see 7.2.5 for the terminology).
Thus the values on real functions are complex
functions in general, see 7.2.5. This is a basic
operation in mathematics, allowing the time and frequency
analysis of signals and also the transitions between local and
global behaviour.
7.B.5. Fix Ω > 0. Recall that sgn t = 1 if t > 0, and
sgn t = −1 if t < 0, sgn 0 = 0.
Find the Fourier transform F(f) and the inverse Fourier
transform F−1
of the functions:
(a) f(t) = sgn t if t ∈ (−Ω, Ω), and zero otherwise.
(b) f(t) = 1 if t ∈ (−Ω, Ω) and zero otherwise.
Solution. The case (a).
The Fourier transform of the given function is
F(f)(ω) = 1√
2π
∞∫
−∞
f(t) e−iωt
d t
= 1√
2π
Ω∫
−Ω
sgn t
(
cos(ωt) − i sin(ωt)
)
d t
= 1√
2π
(Ω∫
0
(
cos (ωt) − i sin (ωt)
)
d t−
0∫
−Ω
(
cos (ωt) −
i sin (ωt)
)
d t
)
.
Since cos and sin are respectively even and odd functions,
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
where the numbers cn are complex Fourier coeﬃcients, and
ωn = n2π/T with period T, see paragraph 7.1.7.
After ﬁxing T, the expression ∆ω = 2π/T describes
the change of the frequency caused by n being increased by
one. Thus it is just the discrete step by which we change the
frequencies when calculating the coeﬃcients of the Fourier
series. The coeﬃcient 1/T in the formula
cn =
1
T
∫ T/2
−T/2
f(x) e−iωnx
dx
then equals ∆ω/2π, so the series for f(t) can be rewritten as
f(t) =
∞∑
n=−∞
1
2π
(
∆ω
∫ T/2
−T/2
f(x) e−iωnx
dx
)
eiωnt
.
Now imagine the values ωn for all n ∈ Z as the chosen
representatives for small intervals [ωn, ωn+1] of
length ∆ω. Then, our expression in the big inner
parentheses in the previous formula for f(t), restricting
the sum to n ∈ [−N, N], describes the summands
of the Riemann sums for the integrals
1
2π
∫ N
−N
g(ω) eiωt
dω,
where g(ω) is a function which takes, at the points ωn, the
values
g(ωn) =
∫ T/2
−T/2
f(x) e−iωnx
dx.
Considering the limit for N → ∞ leads to the improper integral
of g(w) eiωt
.
If the function f(x) e−iωnx
is integrable over the entire
R (this is always the case if f is piecewise continuous with
a compact support) then letting T → ∞, the norm ∆ω of
our subintervals in the Riemann sum decreases to zero. We
obtain the integral
g(ω) =
∫ ∞
−∞
f(x) e−iωx
dx.
The previous reasonings indicates that there should be a
large set of functions f on R (e.g. all those with
|f| integrable over R) for which we can deﬁne
a pair of mutually inverse integral operators:
658
F(f)(ω) = 2√
2π
Ω∫
0
−i sin (ωt) d t = 2i√
2π
[cos(ωt)
ω
]Ω
0
= i
√
2
π
cos(ωΩ)−1
ω .
The inverse Fourier transforms is given by almost the same
integral, with the kernel eiωx
instead of e−iωx
. The integration
is in the frequency domain with variable ω. Thus, the
only diﬀerence in the result is the sign:
F−1
(f)(t) = 2√
2π
Ω∫
0
i sin (ωt) d ω = −2i√
2π
[cos(ωt)
t
]Ω
0
= i
√
2
π
1−cos(tΩ)
t .
Case (b) is computed similarly:
F(f)(ω) = 1√
2π
Ω∫
−Ω
(
cos(ωt) − i sin(ωt)
)
d t
= 1√
2π
Ω∫
−Ω
cos (ωt) d t = 1√
2π
[sin(ωt)
ω
]Ω
−Ω
=
√
2
π
sin(ωΩ)
ω .
The latter expression is often expressed by means of the function
sinc(t) = sin(t)/t
F(f)(ω) =
2Ω
√
2π
sinc(ωΩ).
Here, the inverse Fourier transform has exactly the same result,
because the sign change in the kernel does not aﬀect the
real part. Thus we only need to interchange the time and frequency
variables:
F−1
(f)(t) =
2Ω
√
2π
sinc(tΩ).
□
The results are:
The ﬁrst two diagrams below show the imaginary values
of the Fourier image of the signum function from 7.B.5(a)
with Ω = 20 and Ω = 50. The next two diagrams do the
same for the characteristic function of the interval |x| < Ω
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Fourier transform
For every real or complex valued function f on R for which
the following integrals exist (e.g. all those piecewise continuous
with compact support or fast decay), we deﬁne
F(f)(ω) =
1
√
2π
∫ ∞
−∞
f(t) e−iωt
dt
F−1
(f)(t) =
1
√
2π
∫ ∞
−∞
f(ω) eiωt
dω.
The function ˜f(ω) = F(f)(ω) is called the Fourier transform
of the function f.
The previous ideas indicate that F−1
(F(f))(t) = f(t),
whenever both the improper (Riemann) integrals exist.
This says that the Fourier transform F has an inverse operation
F−1
, which is called the inverse Fourier transform.
Notice, the Fourier transfor and its inverse are integral
operators with almost identical kernels
k(ω, t) = e±iωt
.
The right choice of the function space (and integral type)
ensuring the existence of Fourier transform and its inverse is
a very subtle question. This is the main reason why we do
not formulate theorems with formal proofs and just touch the
problems based on examples here.
We hope this will motivate the readers to ﬁnd more detailed
answers in specialized literature.
7.2.6. Simple properties. The Fourier transform changes
the local and global behaviour of functions in an
interesting way. We begin with a simple example
in which there is a function f(t) which is transformed
to the indicator function of the interval
[−Ω, Ω], i. e., ˜f(ω) = 0 for |ω| > Ω, and ˜f(ω) = 1 for
|ω| ≤ Ω.
The inverse transform F−1
gives
f(t) =
1
√
2π
∫ Ω
−Ω
eiωt
dω =
1
√
2π
[
1
it
eiωt
]Ω
−Ω
=
2
√
2πt
1
2i
(eiΩt
− e−iΩt
)
=
2Ω
√
2π
sin(Ωt)
Ωt
.
Thus, except for a multiplicative constant and the scaling of
the input variable, it is the very important function f(x) =
sinc(t) = sin t
t .
This function is not that easy to handle. Our construction
of the Fourier transform suggests that F(f) should be
the indicator function of the interval [−Ω, Ω]. We verify this
by direct computation in ??. Now, just notice that the sinc
function is not integrable in absolute value over R. It rather
behaves as an alternating series of numbers.
But the local behavior of f is easy to see. Calculation
of the limit at zero, by l’Hospital’s rule or otherwise,
gives f(0) = 2Ω(2π)−1/2
. The closest zero points are at
659
from 7.B.5(b). The longer the interval with the constant values
is, the more the image is concentrated around the origin.
We can always use directly the simpler version of the
transform for the odd and even functions. If the argument
f is odd, then only the sine part of the formula
contributes and its Fourier transform is
F(f)(ω) = −2i√
2π
∞∫
0
f(t) sin (ωt) d t.
Similarly, for even functions f
F(f)(ω) = 2√
2π
∞∫
0
f(t) cos (ωt) d t.
In particular, the odd functions have pure imaginary images
while the images of the even functions are real. More generally,
every real function f decomposes in to its odd and even
parts f = feven+fodd and the real and imaginary components
of the Fourier image ˜f are the images of these two parts, re-
spectively.
7.B.6. Discover how the Fourier transform and its inverse
behave under the translation τa in the variable,
τaf(x)= f(x + a), and the phase shift φa deﬁned as
φaf(x) = eiax
f(x), always with a ∈ R.
Solution. Evaluate the compositions F ◦τa and F ◦φa. This
is easy:
F ◦ τaf(ω) =
∫ ∞
−∞
f(t + a) e−iωt
d t
=
∫ ∞
−∞
f(x) e−iω(x−a)
d x = eiaω
Ff(ω).
F ◦ φaf(ω) =
∫ ∞
−∞
f(t) eiat
e−iωt
d t
=
∫ ∞
−∞
f(t) e−i(ω−a)t
d t = Ff(ω − a).
Thus is proved the formulae
F ◦ τa = φa ◦ F, F ◦ φa = τ−a ◦ F.
Similarly,
F−1
◦ τa = φ−a ◦ F−1
, F−1
◦ φa = τa ◦ F−1
.
□
The next problem displays the behaviour of the Fourier
transform on the Gaussian function. This is a rare
example, where the time and frequency forms are
very similar. Again, we see the feature of exchanging
the local and global properties in the time and
frequency domains.
7.B.7. Compute the Fourier transform F(f) of the function
f(t) = e−at2
, t ∈ R,
where a > 0 is a ﬁxed parameter.
Solution. The task is to calculate
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
t = ±π/Ω and the function drops in value to zero quite
rapidly away from the origin x = 0. This function is shown
in the diagram by a wavy curve for Ω = 20. Simultaneously,
the area where our function f(t) keeps waving more rapidly
as Ω increases is also depicted by a curve.
t
32
y
1
20
0
15
10
-1
5
0
-2
-5
-3
Omega=20.000
The Fourier transform of the indicator function of the interval
[−Ω, Ω] is also the above function f, see the explicit
computation in 7.B.5 (the even function sinc does not see the
change of sign in the kernel of the transform). Clearly, this
function f takes signiﬁcant positive values near zero, and the
value taken at zero is a ﬁxed multiple of Ω. Therefore, as
Ω increases, the function f concentrates more and more near
the origin.
Next, we derive the Fourier transform of the derivative
f′
(t) for a function f. We continue to suppose that all the
integration makes sense (e.g. f has compact support), so that
both F(f′
) and F(f) exist. By integration by parts,
F(f′
)(ω) =
1
√
2π
∫ ∞
−∞
f′
(t) e−iωt
dt
=
1
√
2π
[
e−iωt
f(t)
]∞
−∞
+
iω
√
2π
∫ ∞
−∞
f(t) e−iωt
dt
= iωF(f)(ω).
Thus the Fourier transform converts the (limit) operation of
diﬀerentiation to the (algebraic) operation of a multiplication
by the variable. Of course, this procedure can be iterated, to
obtain
F(f′′
)(ω) = −ω2
F(f), . . . , F(f(n)
) = in
ωn
F(f).
7.2.7. The relation to convolutions. There is another extremely
important property to consider, namely the relation
between convolutions and Fourier transforms.
We shall calculate the Fourier transform of the convolution
h = f ∗ g, where we again assume that all the integrals
exist (e.g. assuming the functions to be piecewice continuous
with compact supports).
Recall that we may change the order of integration, see
6.3.12. Then we change variable by the substitution t−x = u.
660
F(f)(ω) = 1√
2π
∞∫
−∞
e−at2
e−iωt
d t.
A standard trick is to transform the problem into one of solving
a (simple) diﬀerential equation. Hence:
Diﬀerentiating (with respect to ω) and then integrating
by parts gives
(
F(f)(ω)
)′
=
1
√
2π
∫ ∞
−∞
−it e−at2
e−iωt
d t
=
1
√
2π
(
lim
t→∞
i
2a
e−at2
−iωt
− lim
t→−∞
i
2a
e−at2
−iωt
−
∫ ∞
−∞
i(−iω)
2a
e−at2
e−iωt
d t
)
=
1
√
2π
(
i
2a
lim
t→∞
e−at2
−
i
2a
lim
t→−∞
e−at2
−
∫ ∞
−∞
ω
2a
e−at2
e−iωt
d t
)
= −
ω
2a
(
1
√
2π
∫ ∞
−∞
e−at2
e−iωt
d t
)
= −
ω
2a
F(f)(ω).
Therefore y(ω) = F(f)(ω) satisﬁes the diﬀerential
equation
d y
dω = − ω
2a y, i.e. 1
y d y = − ω
2a dω,
unless y equals zero (y ≡ 0 is a solution of the equation).
Integration yields
ln | y | = −ω2
4a − C, i.e. y = Ke− ω2
4a ,
where C and K are constants. All solutions (including the
zero solution) of the diﬀerential equation are given by the
function
y(ω) = K e− ω2
4a , K ∈ R.
To ﬁnd K, begin with the well known fact (proved in ??)
∞∫
−∞
e−x2
d x =
√
π,
to obtain
∞∫
−∞
e−at2
d t = 1√
a
∞∫
−∞
e−x2
d x =
√
π√
a
.
Therefore, F(f)(0) = 1√
2π
√
π√
a
= 1√
2a
and F(f)(0) =
K e0
= K. So K = 1√
2a
and
F(f)(ω) = 1√
2a
e− ω2
4a .
□
7.B.8. Determine the Fourier transform image of the Gaussian
function
f(t) =
1
σ
√
2π
e−
(t−µ)2
2σ2
.
Solution. Use the result of the previous problem with a =
1
2σ2 and the composition with the variable shift τa from the
last but one problem. It follows that
F(f)(ω) =
1
√
2π
eiµω
e−σ2 ω2
2 .
□
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
The result is
F(h)(ω) =
1
√
2π
∫ ∞
−∞
(∫ ∞
−∞
f(x)g(t − x) dx
)
e−iωt
dt
=
1
√
2π
∫ ∞
−∞
f(x)
(∫ ∞
−∞
g(t − x) e−iωt
dt
)
dx
=
1
√
2π
∫ ∞
−∞
f(x)
(∫ ∞
−∞
g(u) e−iω(u+x)
du
)
dx
=
1
√
2π
(∫ ∞
−∞
f(x) e−iωx
dx
)
·
(∫ ∞
−∞
g(u) e−iωu
du
)
=
√
2πF(f) · F(g)
A similar calculation shows that the Fourier transform of a
product is the convolution of the transforms, up to a multiplicative
constant. In fact,
F(f · g) =
1
√
2π
F(f) ∗ F(g).
As we mentioned above, the convolution f ∗ g often models
the process of the observation of some quantity f. Using
the Fourier transform and its inverse, the original values of
this quantity are easily recognised if the convolution kernel
g is known. We just calculate F(f ∗ g) and divide it by the
image F(g). This yields the Fourier transform of the original
function f, which can be obtained explicitly using the inverse
Fourier transform. This is sometimes called deconvolution.
In real applications, the procedure often cannot be that
straightforward since the Fourier image of the known convolution
kernel might have zero values and therefore we hardly
can divide by it as above. For example, take the convolution
kernel sinc(t) whose image is an indicator function of some
ﬁnite interval. So we need some more cunning techniques
and there is a vast literature on them.
7.2.8. The L2–norm. Now we are in position to verify that
the Fourier transform is actually an isometry
with respect to the L2-norm. The Fourier transform
exists for all functions in L1, i.e., with integrable
absolute values. A simple observation reveals that
F : L2 → L∞ satisﬁes
∥F(f)∥∞ ≤ ∥f∥1.
Indeed, | e−iωx
| = 1 and thus, |F(f)(ω) ≤
∫ ∞
−∞
|f(x)| dx
for all ω.
Now, assume f, g are two functions in L2 and write ˆg for
the function ˆg(t) = g(−t). Notice
(f ∗ ˆg)(t) =
∫ ∞
−∞
f(x)ˆg(t − x) dx =
∫ ∞
−∞
f(x)g(x − t) dx.
In particular, the scalar product is given by the formula
⟨f, g⟩ = (f ∗ ˆg)(0).
661
As mentioned, the most typical use of the Fourier transform
is to analyse the frequencies in a signal.
The next problem reveals the reason. For technical
reasons we cut the signal by multiplication
with the characteristic function hΩ of the interval
(−Ω, Ω).
7.B.9. Find the Fourier transform of the functions
f(t) = hΩ(t) cos(nt), g(t) = hΩ(t) sin(nt).
Solution. By deﬁnition
F(f)(ω) =
1
√
2π
Ω∫
−Ω
cos(nt) e−iωt
d t
=
1
√
2π
Ω∫
−Ω
1
2
(eint
+ e−int
) e−iωt
d t
=
1
2
√
2π
[ 1
i(n − ω)
ei(n−ω)t
]Ω
−Ω
+
1
2
√
2π
[ −1
i(n + ω)
e−i(ω+n)t
]Ω
−Ω
=
Ω
√
2π
(
sinc((n − ω)Ω) + sinc((n + ω)Ω)
)
.
The same computation leads to the image of the sine signal,
the only diﬀerence is one minus sign and an additional i
in the formula:
F(g)(ω) = i
Ω
√
2π
(
− sinc((n − ω)Ω) + sinc((n + ω)Ω)
)
□
7.B.10. Find the Fourier transform of the superposition of
the cos(nt) signals over the interval (−Ω, Ω),
f(t) = hΩ(t)(3 cos(5t) + cos(15t)).
What happens, if Ω → ∞?
Solution. The Fourier transform is linear over scalars, thus
we simply add the corresponding images from the previous
problem with n = 1 and n = 3, multiplied by the proper
coeﬃcients. The illustration of the image with Ω = 20 is
Each of the peaks behaves like the Fourier image of the
characteristic function hΩ, shifted to the frequencies.
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Further, the deﬁnition of the Fourier transform yields
⟨f, g⟩ = (F−1
F(f ∗ ˆg))(0)
=
1
√
2π
∫ ∞
−∞
F(f ∗ ˆg) eixω
dω |x=0
=
∫ ∞
−∞
Ff(ω)Fˆg(ω) dω
while
F(ˆg)(ω) =
1
√
2π
∫ ∞
−∞
g(−x) e−iωx
dx = F(g)(ω).
Consequently, ⟨f, g⟩ = ⟨F(f), F(g)⟩. Thus, we have veriﬁed
that the Fourier transform preserves the scalar product
and so it also preserves the L2–norm.
This also explains our choice of the constants in the def-
inition.
7.2.9. Dirac delta–function. We return to the ﬁrst example
of the inverse transform to the indicator function
fΩ of the interval [−Ω, Ω]. Let Ω approach inﬁnity
and denote by
√
2πδ(t) the desired “limit
function” for F−1
(fΩ)(t).
The inverse image of a product with an arbitrary image
F(g) can be expressed using convolution:
F−1
(fΩ · F(g))(z) =
1
√
2π
∫ ∞
−∞
g(t)F−1
(fΩ)(z − t) dt.
As Ω increases to ∞, the left-hand expression should
approach F−1
(F(g))(z) = g(z), while on the right-hand
side, we get
g(z) =
∫ ∞
−∞
g(t)δ(z − t) dt.
The desired δ(t) thus looks as a “function” which takes zero
everywhere except the single point t = 0 where it “has an
inﬁnite value”. Integrating the product of δ(t) with any integrable
function g gives just the value of g at the point t = 0.
Of course, this is strictly not a function at all. Nevertheless it
is a useful concept. It is called the Dirac function δ and it can
be described correctly as an example of what is known as a
distribution. Since we do not have enough space and time, we
do not pay further attention to distributions. We mention only
that the Dirac δ can be imagined as a unit impulse at a single
point. In fact, we saw similar concepts under the name “measure”
when dealing with the Riemann-Stieltjes and HenstockKurzweil
integrals, cf. 6.3.18, 6.3.14 and we shall come back
to them in Chapter 10 in the context of probability. In this
sense, the Dirac function is the (probability) measure concentrated
in the origin and it can be realized by the RiemannStieltjes
integral with the piecewise constant function g with
the single unit jump in the origin. Its Fourier transform is the
constant function F(δ)(ω) = 1√
2π
.
On the other hand, many functions which are not strictly
integrable on R are Fourier-transformed to expressions with
662
If Ω increases to inﬁnity, the image ˜f has four peaks at
the same positions corresponding to the frequencies ±5 and
±15. But they become narrower and sharper. In the limit, this
is no longer a function since the width of the peaks becomes
zero. This is usually written
F(cos(nt))(ω) =
√
π
2
(δ(n − ω)) + δ(n + ω)
with the special case
F(1)(Ω) =
√
2πδ(ω).
See 7.2.9 for comments on the Dirac delta function. □
7.B.11. Find the Fourier transform image of the convolution
of the signals f(t) and g(t) from the Problem 7.B.1. Recall
that f(t) = sin(t)+0.4 [sin(6t)]2
−0.2 sin(60t) and g is the
characteristic function of the interval (−ε, ε). Assume that
the signal is nonzero only in the interval (−Ω, Ω).
Solution. Once we note that F(f ∗ g) =
√
2πF(f)F(g),
we have all the ingredients ready. Indeed, in 7.B.5 and in the
last two problems above, we already computed the Fourier
image of g and of the sine and cosine functions on the interval
(−Ω, Ω).
Instead of writing the explicit formulae for the result,
we display illustrations of the real components of F(f) and
F(f ∗ g) in the ﬁrst line, and similarly the imaginary components
in the second line, all with Ω = 5, ε = 1/10.
The reader should compare the diagrams of f and f ∗ g
in 7.B.1 to see that the higher frequencies in f are eﬀectively
canceled by this convolution, as expected. □
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
the Dirac δ. For instance,
F(cos(nt))(ω) =
√
π
2
(
δ(n − ω) + δ(n + ω)
)
,
which can be seen from the calculation of the Fourier transform
of the function fΩ cos(nx) and then letting Ω approach
∞, see the solution to problem 7.B.10.
We can obtain the Fourier transform of the sine function
in a similar way. We can take advantage of the fact that the
transform of the derivative of this function diﬀer only by a
multiple of the imaginary unit and the new variable. Alternatively,
we can also use the fact that the sine function is obtained
from the cosine function by the phase shift of π/2.
These transforms are a basis for Fourier analysis of signals
(see also problem 7.B.9): If a signal is a pure sinusoid of
a given frequency, then this is recognized in the Fourier transform
as two single-point impulses exactly at the positive and
negative value of the frequency. If the signal is a linear combination
of several such pure signals, then we obtain the same
linear combination of single-point impulses. However, since
we always process a signal in a ﬁnite time interval only, we
get not single-point impulses, but rather a wavy curve similar
to the function sinc with a strong maximum at the value of
the corresponding frequency. The size of this maximum also
yields information about the amplitude of the original signal.
Another good way how to approximate the Dirac delta
function is to exploit the Gaussian functions. As seen in the
solution to problem 7.B.7, the Fourier image of the Gaussian
function
f(t) =
1
σ
√
2π
e− t2
2σ2
is again Gaussian corresponding to the reciprocal values of σ.
In the limit σ → 0, the image converges fast to the multiple
of constant function, see the illustrations on the picture, with
σ = 3 and σ = 1/10.
Notice that the rather large σ in the ﬁrst illustration corresponds
to a wide Gaussian, while the image is the slim one.
The second illustration provides the opposite case. The preimage
is the narrow Gaussian and the image is already reasonably
close to the constant function. The Gaussians are chosen
with L1–norm equal to one, but the Fourier transform
preserves the L2–norm of the functions.
7.2.10. Fourier sine and cosine transform. If we apply the
Fourier transform to an odd function f(t), where f(−t) =
663
As discussed in 7.2.5, the Fourier transform has an inverse
operation. This means that no information is
lost when changing from the time behaviour of a
signal to its frequency behaviour. This allows us
to use Fourier transform for the solution of functional
equations involving diﬀerentiation or integration. We
stay with elementary observations only and return to diﬀerential
equations in one and more variables in the following
chapters.
7.B.12. By using the inverse Fourier transform solve the integral
equation
∞∫
0
f(t) sin (xt) d t = e−x
, x > 0
for an unknown function f.
Solution. Multiply both sides of the equation by
√
2/π, to
obtain the sine Fourier transform on the left-hand side. Apply
the inverse transform to both sides of the equation to get
f(t) = 2
π
∞∫
0
e−x
sin (xt) dx, t > 0.
Integrating by parts twice shows that
∫
e−x
sin (xt) dx = e−x
1+t2
[
− sin (xt) − t cos (xt)
]
+ C.
Hence
∞∫
0
e−x
sin(xt) d x
= lim
x→∞
(
e−x
1+t2
[
− sin (xt) − t cos (xt)
]
)
− e0
(−t)
1+t2 = t
1+t2 .
So
f(t) =
2
π
t
1 + t2
, t > 0.
□
7.B.13. Use the Fourier transform to ﬁnd solutions of the
non-homogeneous linear diﬀerential equation
(1) y′
= a y + f
where a ∈ R is a non-zero constant and f is a known function.
Can all solutions be obtained in this way?
Solution. The key observation for this problem is the relation
between the Fourier transform and the derivative
F(f′
)(ω) = iωF(f)(ω),
see 7.2.6. Thus, if the Fourier transform is applied to the equation
(1), we get the algebraic equation for ˜y = F(y)
iω˜y = a˜y + ˜f.
If it is assumed that F(f) = ˜f exists and there is a solution
y with the Fourier image ˜y, then
˜y =
1
iω − a
˜f
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
−f(t), the contribution in the integration of the product of
f(t) and the function cos(±ωt) cancels for positive and negative
values of t. Thus if f is odd, then
F(f)(ω) =
−2i
√
2π
∫ ∞
0
f(t) sin ωt dt.
The resulting function is odd again, hence for the same reason,
the inverse transform can be determined similarly:
F−1
(f)(ω) =
2i
√
2π
∫ ∞
0
f(t) sin ωt dt.
Omitting the imaginary unit i (notice, this means we have to
multiply the Fourier transform and its inverse by i and −i, respectively)
inverse) gives mutually inverse transforms, which
are called the Fourier sine transform for odd functions:
˜fs(ω) =
√
2
π
∫ ∞
0
f(t) sin(ωt) dt,
f(t) =
√
2
π
∫ ∞
0
˜fs(t) sin(ωt) dt.
Similarly, we can deﬁne the Fourier cosine transform for
even functions:
˜fc(ω) =
√
2
π
∫ ∞
0
f(t) cos(ωt) dt,
f(t) =
√
2
π
∫ ∞
0
˜fs(t) cos ωt dt.
7.2.11. Laplace transforms. The Fourier transform can be
mainly applied to functions which are integrable in absolute
value over R. The Laplace transform is similar to the Fourier
transform and does apply to all functions whose growth is not
too fast:
L(f)(s) =
∫ ∞
0
f(t) e−st
dt.
The integral operator L has a rapidly reducing kernel if s is
a positive real number. Therefore, the Laplace transform is
usually perceived as a mapping of suitable functions on the
interval [0, ∞) to the function on the same or shorter interval.
The image L(p) exists, for example, for every polynomial p(t)
and all positive numbers s.
Analogously to the Fourier transform, we obtain the formula
for the Laplace transform of a diﬀerentiated function for
s > 0 by using integration by parts:
L(f′
(t))(s) =
∫ ∞
0
f′
(t) e−st
dt
= [f(t) e−st
]∞
0 + s
∫ ∞
0
f(t) e−st
dt
= −f(0) + sL(f)(s).
The properties of the Laplace transform and many other transforms
used in technical practice can be found in specialized
literature. We provide a few examples in the other column
starting with 7.B.17.
664
and using the general relation F−1
(g · h) =
√
2πF−1
f ∗
F−1
g between the product and convolutions from 7.2.7 we
arrive at the ﬁnal formula
y =
1
√
2π
F−1
(
1
iω − a
)
∗ f.
So it is necessary to compute the inverse Fourier transform of
the simple rational function (iω − a)−1
. Guess the solution
in two steps.
Assume ﬁrst a < 0 and evaluate
∫ 0
−∞
eat
e−iωt
d t =
[ 1
a−iω e(a−iω)t
]0
−∞
=
1
a − iω
.
Similarly for a > 0
∫ ∞
0
eat
e−iωt
d t =
[ 1
a−iω e(a−iω)t
]∞
0
=
1
iω − a
.
This provides the two desired results. Indeed, if the equation
(1) comes with a > 0, we rewrite our rational function as
−(a − iω)−1
. Next, the function −
√
2π e−at
for negative t
provides the requested Fourier image. Immediately it is seen
that the convolution
y(t) = −
∫ 0
−∞
eax
f(t − x) d x
is a solution. (The multiples
√
2π in the expression with the
convolution cancel.) Similarly, if a < 0 then
y(t) =
∫ ∞
0
eax
f(t − x) d x
is a solution.
Not all solutions can be obtained in this way. For example,
y′
= y leads to y(t) = C et
with an arbitrary constant
C, but this is not a function with a Fourier image. With
f(t) = 0, our procedure produces the zero function, which is
just one of the solutions. Similarly, if we deal with the equation
y′
= y + t, then the particular solution suggested above
is
y(t) = −
∫ 0
−∞
ex
(t − x) d x = −t − 1.
□
7.B.14. Check directly that the two functions y(t) found
above are indeed solutions to the equation y′
= a y + f.
⃝
7.B.15. As in the previous problem, solve the second order
equation
y′′
= a y + f.
Solution. Use the fact that F(y′′
)(ω) = −ω2
F(y)(ω) and
deduce the algebraic relation −ω2
˜y = a˜y + ˜f, for the Fourier
images ˜y and ˜f. Hence
˜y =
−1
ω2 + a
˜f.
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
7.2.12. Discrete transforms. Fourier analysis of signals
mentioned in the previous paragraph are realized
by special analog circuits in, for example,
radio technology. Nowadays, we work only
with discrete data when processing signals
by computer circuits. Assume that there is a ﬁxed (small)
sampling interval τ given in a (discrete) time variable and
that, for a large natural number N, the signal repeats with
period Nτ, which is the maximal period which can be
represented in our discrete model.
We should not be surprised that our continuous models
allow for a discrete analogy. Consider an N–dimensional vector,
which can be imagined as the function r → f(r) ∈ C,
for r = 0, 1, . . . , N − 1. Denote ∆ω = 2π
N and ωk = k∆ω.
The simplest discrete approximation of the Fourier transform
integral suggests that
˜f(k) =
1
N
N−1∑
r=0
f(r) e−i 2π
N kr
should be a promising transformation f → ˜f, whose inverse
f → ˆf should not be far from
ˆ˜f(k) =
N−1∑
r=0
˜f(r) ei 2π
N kr
.
Actually, these are already the mutually inverse transforma-
tions:
Theorem. The transformation above satisﬁes
ˆ˜f(k) = f(k)
for all k = 0, 1, . . . , N − 1.
Proof. Let
T =
N−1∑
r=0
eir 2π
N k
.
Then
ei 2π
N k
T =
N∑
r=1
eir 2π
N k
and so by subtraction,
(1 − ei 2π
N k
)T = 1 − ei2πk
.
The right hand side is 0 for all integers k. On the left side, the
coeﬃcient of T is not zero unless k is a multiple of N. Hence
T =
N−1∑
r=0
eir 2π
N k
=
{
N if k is a multiple of N
0 otherwise.
With k and s both conﬁned to the range {0, 1, 2 . . . N − 1},
k −s can only be a multiple of N when k = s. It follows that
for such k and s
N−1∑
r=0
eir 2π
N (k−s)
= δksN
where the Kronecker delta δks = 0 for k ̸= s and δks = 1 if
k = s.
Finally, we compute:
665
In order to guess the correct preimage of the rational
function in question, ﬁrst assume a > 0 and compute
∫ ∞
−∞
e−a|x|
e−iωx
d x
=
[ 1
a−iω ea−iωx
]0
−∞
+
[ −1
a+iω e−a−iωx
]∞
0
=
1
a − iω
+
1
a + iω
=
2a
a2 + ω2
Thus it is veriﬁed that
F(e−
√
a|x|
) =
√
a
2π
1
a + ω2
.
Immediately (the factors
√
2π cancel)
y(t) =
−1
√
a
e−
√
a|t|
∗f(t) =
−1
√
a
∫ ∞
−∞
e−
√
a|x|
f(t − x) d x.
The case a < 0 is a little more complicated. But we may
ask Maple or look up in the literature that the function
g(t) = sin(b|t|)
has the Fourier image
F(g)(ω) =
1
2π
2b
b2 − ω2
.
We are nearly ﬁnished. The required preimage is
h(t) =
√
2π
√
−4a
sin(
√
−a|t|)
and the resulting convolution is
y(t) =
1
√
−4a
∫ ∞
−∞
sin(
√
−a|x|)f(t − x) d x.
If we rewrite the equation as y′′
+ by = f with b > 0, the
result says
y(t) =
1
2b
∫ ∞
−∞
sin(b|x|)f(t − x) d x.
□
7.B.16. Check directly that the two functions y(t) found
above are indeed solutions to the equation y′′
= a y + f.
⃝
The Laplace transfer is another integral transform which
interchanges diﬀerentiation and algebraic multiplication.
As with the Fourier transform, this is based on
the properties of the exponential function, but this time
we take the real exponential, see 7.2.11 for the formula. One
advantage is that every polynomial has its Laplace image.
7.B.17. Determine the Laplace transform L(f)(s) for each
of the functions
(a) f(t) = eat
;
(b) f(t) = c1 ea1t
+ c2 ea2t
;
(c) f(t) = cos (bt);
(d) f(t) = sin (bt);
(e) f(t) = cosh (bt);
(f) f(t) = sinh (bt);
(g) f(t) = tk
, k ∈ N,
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
ˆ˜f(k) =
N−1∑
r=0
(
1
N
N−1∑
s=0
f(s) e−i 2π
N rs
)
ei 2π
N rk
=
1
N
N−1∑
s=0
f(s)
(N−1∑
r=0
ei 2π
N r(k−s)
)
=
1
N
N−1∑
s=0
f(s)δksN = f(k).
□
The computations in the proof also verify that the Fourier
image of a periodic complex valued function with a unique
period among the chosen sampling periods is just its amplitude
at this particular frequency. Thus, if the signal has
been created as a superposition of periodic signals with the
sampling frequencies only, we obtain the absolutely optimal
result. However, if the transformed signal has a frequency
not exactly available among the sampling frequencies, there
are nonzero amplitudes at all the sampling frequencies in the
Fourier image. This is called frequency leaking in the technical
literature. There is a vast amount of literature devoted to
fast implementation and exploitation of the discrete Fourier
transform, as well as other similar discrete tools. This is an
extremely active area of current research.
3. Metric spaces
At the end of the chapter, we will focus on the concepts
of distance and convergence in a more abstract way. This also
provides the conceptual background for some of the already
derived properties of Fourier series and Fourier transform.
We need these concepts in miscellaneous contexts later.
It is hoped that the subsequent pages are a useful (and
hopefully manageable) trip into the world of mathematics for
the competent or courageous!
7.3.1. Metrics and norms. When we discussed Fourier series,
the distance between functions on a space
of functions, was commonly referred to. Now
we examine the concept of distance more thor-
oughly.
We are familiar with the distance of points x, y in the
Euclidean space Rn
given by the size of the vector x − y.
A very diﬀerent example of distance is the so called discrete
metric deﬁnied on any set X as follows. Each element
x ∈ X has got distance zero from itself and one from all other
elements in X. Notice that the triangle inequality is strict here
– making detours when walking from one element to another
always increases the distance travelled.
Both of these distances are generalized in the following
concept:
666
where the constants b ∈ R and a, a1, a2, c1, c2 ∈ C are arbitrary.
It is assumed that the positive number s ∈ R is greater
than the real parts of the numbers a, a1, a2 ∈ C, and is greater
than b in the problems (e) and (f).
Solution. The case (a). It follows directly from the deﬁnition
of the Laplace transform that
L (f) (s) =
∞∫
0
eat
e−st
d t
=
∞∫
0
e−(s−a)t
d t = lim
R→∞
(
e−(s−a)R
−(s−a)
)
− e0
−(s−a) = 1
s−a .
The case (b). Using the result of the above case and the
linearity of improper integrals, we obtain
L (f) (s) = c1
∞∫
0
ea1t
e−st
d t + c2
∞∫
0
ea2t
e−st
d t =
c1
s−a1
+ c2
s−a2
.
The case (c). Since
cos (bt) = 1
2
(
eibt
+ e−ibt
)
,
the choice c1 = 1/2 = c2, a1 = ib, a2 = −ib in the previous
case gives
L (f) (s) =
∞∫
0
(1
2 eibt
+ 1
2 e−ibt
)
e−st
d t =
1
2(s−ib) + 1
2(s+ib) = s
s2+b2 .
The cases (d), (e), (f). Analogously, the choices
(d) c1 = −i/2, c2 = i/2, a1 = ib, a2 = −ib;
(e) c1 = 1/2 = c2, a1 = b, a2 = −b;
(f) c1 = 1/2, c2 = −1/2, a1 = b, a2 = −b
lead to
(d) L (f) (s) = b
s2+b2 ;
(e) L (f) (s) = s
s2−b2 ;
(f) L (f) (s) = b
s2−b2 .
Finally, the last one is obtained by a straightforward repetition
of integration by parts:
L(tk
)(s) =
∫ ∞
0
tk
e−st
d t
=
[
−tk 1
s e−st
]∞
0
+
k
s
∫ ∞
0
tk−1
e−st
d t
= · · · =
k!
sk
∫ ∞
0
e−st
d t =
k!
sk+1
.
□
7.B.18. Use the deﬁnition of the Gamma function Γ(t) in
Chapter 6 in order to prove
L(tα
)(s) = Γ(α + 1)
1
sα+1
for general α > 0. Compare the result to the one of 7.B.17(g).
⃝
7.B.19. For s > −1, calculate the Laplace transform L (g) (s)
of the function
g(t) = t e−t
.
Further, for s > 1, calculate the Laplace transform L (h) (s)
of the function
h(t) = t sinh t.
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Axioms of a metric and a norm
A set X together with a mapping d : X × X → R such that
for all x, y, z ∈ X, the following conditions are satisﬁed
d(x, y) ≥ 0; and d(x, y) = 0 if and only if x = y,(1)
d(x, y) = d(y, x),(2)
d(x, z) ≤ d(x, y) + d(y, z),(3)
is called a metric space. The mapping d is a metric on X.
If X is a vector space over R and ∥ ∥ : X → R is a
function satisfying
∥x∥ ≥ 0; and ∥x∥ = 0 if and only if x = 0,(4)
∥λx∥ = |λ| ∥x∥, for all scalars λ,(5)
∥x + y∥ ≤ ∥x∥ + ∥y∥,(6)
then the function ∥ ∥ is called a norm on X, and the space X
is then a normed vector space. Every norm deﬁnes a metric
by setting d(x, y) = ∥x − y∥.
The Euclidean distance in the vector spaces Rn
satisﬁes
the above three requirements and it is given by a norm.
The L1 norm ∥ ∥1 and the L2 norm ∥ ∥2 on functions,
satisfy the norm properties and thus deﬁne a metric on the
relevant spaces of functions.
Of course, not every metric can be deﬁned by a norm in
this way. The discrete metric mentioned above is an example.
Metrics given by a norm have very speciﬁc properties
since their behaviour on the whole space X can be derived
from the properties in an arbitrarily small neighbourhood of
the zero element x = 0 ∈ X.
7.3.2. Convergence. The concepts of (close) neighbourhoods
of particular elements, convergence of
sequences of elements and the corresponding
“topological” concepts can be deﬁned on
abstract metric spaces in much the same way
as in the case of the real and complex numbers and their
sequences. See the beginning of the ﬁfth chapter, 5.2.3–5.2.8.
We can almost copy these paragraphs; although the proof of
the theorem 5.2.8 is much harder. We begin with the concept
of convergent sequences in a metric space X with metric d:
Cauchy sequences
Consider an arbitrary sequence of elements x0, x1, . . . in X.
Suppose that for any ﬁxed positive real number ε,
d(xi, xj) < ε
for all but ﬁnitely many pairs of terms xi, xj of the sequence.
In other words, for any given ε > 0, there is an index N
such that the above inequality holds for all i, j > N. Loosely
put, the elements of the sequence are eventually arbitrarily
close to each other.
Such a sequence is called a Cauchy sequence.
Just as in the case of the real or complex numbers, we
would like every Cauchy sequence of terms xi ∈ X to converge
to some x in the following sense:
667
⃝
7.B.20. The basic Laplace transforms are enumerated in the
following table:
y(t) L(y)(s)
tk k!
sk+1
eat 1
s−a
t eat 1
(s−a)2
tn
eat n!
(s−a)n+1
sin ωt ω
s2+ω2
cos ωt s
s2+ω2
eat
sin ωt ω
(s−a)2+ω2
eat
(cos ωt + a
ω sin ωt) s
(s−a)2+ω2
t sin ωt 2ωs
(s2+ω2)2
sin ωt − ωt cos ωt 2ω3
(s2+ω2)2
Establish the 5th and 6th rows of the table above using Euler’s
formula eiωt
= cos ωt + i sin ωt. ⃝
As expected, using the features of the Laplace transform
allows us to ﬁnd explicit solutions to some diﬀerential equations.
By 7.D.18, it is straightforward to incorporate the initial
conditions into the solution. We present just two such
examples in the problems at the conclusion of this Chapter,
see 7.D.21. We return to this topic in Chapter 8.
C. Metric spaces
The concept of metric is an abstract version of what we
understand as the distance in Euclidean geometry.
It is always based on the triangle inequality.
The axioms in Deﬁnition 7.3.1 follow the Euclidean
experience, saying that our "distance" of
two elements has to be strictly positive (except if the two elements
coincide), should be symmetric in the arguments, and
should satisfy the triangle inequality.
Other concepts available in the literature are more abstract
and might lead to more general objects (the most important
ones being pseudometrics, ultrametrics, and semimetrics).
3
7.C.1. The discrete metric space X is deﬁned as the set X
with the function d : X × X → R
d(x, y) =
{
1 x ̸= y
0 x = y.
Show that this is a metric space according to Deﬁnition 7.3.1.
Show how to introduce a metric on Cartesian products of
metric spaces, so that product of two discrete metric spaces
is again discrete.
3The ﬁrst axiomatic deﬁnition of a "traditional" metric was given by
Maurice Fréchet in 1906. However, the name of the metric comes from Felix
Hausdorﬀ, who used this word in his work from 1914.
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Convergent sequences
Let x0, x1, . . . be a sequence in a metric space X and let x be
an element of X. We say that the sequence {xi} converges
to the element x, if, for every positive real number ε, there
is an integer N > 0, so that i > N implies d(xi, x) < ε.
By the triangle inequality, it follows that for each pair
of terms xi, xj from a convergent sequence with suﬃciently
large indices,
d(xi, xj) ≤ d(xi, x) + d(x, xj) < 2ε.
Therefore, every convergent sequence is a Cauchy sequence.
Conversely however, not every Cauchy sequence is convergent.
Metric spaces where every Cauchy sequence is convergent
are called complete metric spaces.
7.3.3. Topology, convergence, and continuity. Just as in
the case of the real numbers, we can formulate the convergence
in terms of “open neighbourhoods”.
Open and closed sets
Deﬁnition The open ε–neighbourhood of an element x
in a metric space X (or just ε–neighbourhood for short) is
the set
Oε(x) = {y ∈ X; d(x, y) < ε}.
A subset U ⊂ X is open if and only if for all x ∈ U, U
contains some ε–neighbourhood of x.
We deﬁne a subset W ⊂ X to be closed if and only if
its complement X \ W is an open set.
Instead of an ε–neighbourhood, we also talk about (open)
ε–ball centered at x. In the case of a normed space, we can
consider ε–balls centered at zero: along with x, ε–balls determine
an ε–neighbourhood.
The limit points of a subset A ⊂ X are deﬁned as those
elements x ∈ X such that there is a sequence of points in A
other than x converging to x.
Lemma. A subset in a metric space is closed if and only if it
contains all of its limit points
Proof. Suppose A is closed and x is a limit point of A
but not belonging to A. Then x ∈ X \ A which is open, so
there is an ε–neighbourhood of x not intersecting with A.
But in every ε–neighbourhood of x, there are inﬁnitely many
points of the set A, since x is a limit point. This is a contra-
diction.
Conversely, suppose A contains all of its limit points and
suppose x ∈ X \A. If in every ε–neighbourhood of the point
x, there is a point xε ∈ A, then the choices ε = 1/n provide
a sequence of points xn ∈ A converging to x. But then, the
point x would have to be a limit point, thus lying in A, which
again leads to a contradiction. □
For every subset A in a metric space X, we deﬁne its
interior as the set of those points x in A for which a neighbourhood
of x also belongs to A. We deﬁne the closure ¯A of
668
Solution. All three axioms of a metric from 7.3.1 are obviously
satisﬁed in our deﬁnition of the discrete metric space.
Consider two metric spaces X and Y with metrics dX
and dY . The ﬁrst obvious idea seems to add the distances of
the components, i.e.
d
(
(x1, y1), (x2, y2)
)
= dX(x1, x2) + dY (y1, y2).
Clearly this is a metric (verify in detail!), but if the metric
spaces X and Y are discrete, then considering points u =
(x1, y1) and w = (x2, y2) such that x1 ̸= x2, y1 ̸= y2 we
arrive at d(u, w) = 2. Thus, this is not a discrete metric
space.
But there is another simple possibility of introducing a
metric on X × Y using the maximum of the distances:
d((x1, y1), (x2, y2)) = max{(dX(x1, x2), dY (y1, y2)}.
We call this the product of the metric spaces X and Y . The
triangle inequality as well as the other axioms are obvious
(write down the explicit arguments!). Moreover, if both X
and Y are discrete, then d is also a discrete metric. □
7.C.2. Decide whether or not the following sets and mappings
form a metric space:
i) N, d(m, n) = gcd(m, n)
ii) N, d(m, n) = max(m,n)
gcd(m,n) − 1
iii) World population, d(P1, P2) = n,
P1 = X0, X1,...,Xn+1 = P2 is the shortest seuquence
of people, such that Xi knows Xi+1 for i = 0, . . . n.
Solution.
i) No. The "distance" d does not satisfy that d(m, m) = 0.
ii) No. The ﬁrst and second conditions in the deﬁnition
7.3.1 are fulﬁlled, but the triangle inequality (property
(3)) is not. The distance of 8 and 9 is 8, the distance
of 8 and 6 is 3 and the distance of 6 and 9 is 2, thus
d(8, 9) > d(8, 6) + d(6, 9).
iii) No. The "distance" is not symmetric. It would be a metric
space, if the deﬁnition the word "knows" is changed
to mean "know each other".
□
7.C.3. Consider the set of binary words of the length n. Deﬁne
the distance between two words as the number of bits in
which they diﬀer. This is called the Hamming distance, see
12.5.2). Show that it deﬁnes a metric.
Solution. The ﬁrst two axioms of a metric are satisﬁed. For
the third one let the words x and z diﬀer in k bits. Let y be
another word. Then consider just k bits, in which x and z
diﬀer. Clearly y diﬀers in each of these bits exactly from one
each of x and z. Thus considering only the parts of words xp,
yp, zp in the k bits, we have d(xp, yp)+d(yp, zp) = d(xp, zp).
In the other bits, the words x and z are the same, while x and
y or y and z may diﬀer. Thus d(x, y)+d(y, z) ≥ d(x, z) and
the third axiom is satisﬁed also. □
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
a set A as the union of the original set A with the set of all
limit points of A.
As easily as in the case of the real numbers, we can verify
that the intersection of any system of closed sets as well as the
union of any ﬁnite system of closed sets is also closed.
On the other hand, any union of open sets is again an
open set. A ﬁnite intersection of open sets is again an open
set. Prove these propositions by yourselves in detail!
We also advise the reader to verify that the interior of a
set A equals the union of all open sets contained in A, (alternatively
put, the interior of A is the largest open subset of A).
The closure of A is the intersection of all closed sets which
contain A, (alternatively put, the closure of A is the smallest
closed superset of A).
The closed and open sets are the essential concepts of the
mathematical discipline called topology. Without pursuing
these ideas further, we have just familiarised ourselves with
the topology of the metric spaces.
The concept of convergence can be reformulated now as
follows. A sequence of elements xi, i = 0, 1, . . . , in a metric
space X converges to x ∈ X if and only if for every open set
U containing x, all but ﬁnitely many points of our sequence
lie in U.
Just as in the case of the real numbers, we can deﬁne
continuous mappings between metric spaces:
Continuous mappings
Let W and Z be metric spaces. A mapping f : W → Z is
continuous if and only if the inverse image f−1
(V ) of every
open set V ⊂ Z is an open set in W. This is equivalent
to the statement that f is continuous if and only if for every
z = f(x) ∈ Z and positive real number ε, there is a positive
real number δ such that for all elements y ∈ W with distance
dW (x, y) < δ, dZ(z, f(y)) < ε.
Again, as in the case of real-valued functions, a mapping
f from one metric space to another is continuous if and only
if it preserves the convergence of sequences (check this your-
selves!).
7.3.4. Lp–norms. Now we have the general tools with which
we can look at examples of metric spaces created
by ﬁnite-dimensional vectors or functions at our disposal.
We restrict ourselves to an extraordinarily useful
class of norms.
We begin with the real or complex ﬁnite-dimensional
vector spaces Rn
and Cn
, and for a ﬁxed real number p ≥ 1
and any vector z = (z1, . . . , zn), we deﬁne
∥z∥p =
( n∑
i=1
|zi|p
)1/p
.
We prove that this indeed deﬁnes a norm. The ﬁrst two properties
from the deﬁnition are clear. It remains to prove the triangle
inequality. For that purpose, we use Hölder’s inequality:
669
7.C.4. Consider any connected subset S ⊂ Rn
(any two
points in S can be connected with a path lying in S). Deﬁne
the distance between two points as the length of the shortest
path between the points. Is it a metric on S?
Solution. It is a metric. All the axioms of the metric are
trivially satisﬁed. But this metric has a special signiﬁcance.
The principle of "shortest way" is often met in reality. Recall
for example Fermat’s principle (see 5.E.123) of the least time,
where we measure the length of a path by the time it is traveled
by light. Generally, shortest paths in a metric space are called
geodesics. □
7.C.5. Consider a space of integrable function on the interval
[a, b]. Deﬁne the (L1) distance of the functions f, g as
∥f, g∥ =
∫ b
a
|f(x) − g(x)| dx
. Why it is not a metric space?
Solution. The ﬁrst axiom of the metric space in 7.3.1 is not
satisﬁed. Any function of zero measure has distance 0 from
the null function.
But if we consider an equivalence where two functions
are equivalent, if they diﬀer by a function of measure zero,
then we get the space S0
(a, b). The given distance considered
on the equivalence classes of this equivalence is the L1 metric.
□
7.C.6. Let r be a rational number and p a prime number. Then
r can be uniquely written in the form r = pk u
v , where u ∈
Z and v ∈ N are coprime and p does not divide both, the
numerator u and the denominator v. Consider the map ∥.∥p :
Q → R, ∥r∥ → p−k
. Show that it is a norm on Q as a vector
space over Q. It is called the p-adic norm ⃝
Solution. It is an exercise in elementary number theory. □
7.C.7. Consider the power set (the set of all subsets)
of a given ﬁnite set. Determine whether the
functions d1 and d2, deﬁned for all subsets X, Y
by
(a) d1(X, Y ) := | (X ∪ Y ) ∖ (X ∩ Y ) |,
(b) d2(X, Y ) := |(X∪Y )∖(X∩Y )|
|X∪Y | , for X ∪ Y ̸= ∅,
d2(∅, ∅) := 0
are metrics. ( | X |, is meant the number of elements of a set
X, thus the metric d1 measures the size of the symmetric difference
of the sets, while d2 measures the relative symmetric
diﬀerence.)
Solution. We omit veriﬁcations of the ﬁrst and second conditions
from the deﬁnition of a metric in exercises on deciding
whether a particular mapping is a metric. We analyze the triangle
inequality only.
The case (a). For any sets X, Y, Z,
(1)
(X∪Z)∖(X∩Z) ⊆
(
(X∪Y )∖(X∩Y )
)
∪
(
(Y ∪Z)∖(Y ∩Z)
)
To show this, suppose ﬁrst that x is an element satisfying x ∈
X and x /∈ Z. Then either x ∈ Y in which case x ∈ (Y ∪
Z)∖(Y ∩Z), or x /∈ Y in which case x ∈ (X∪Y )∖(X∩Y ).
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Hölder inequality
Lemma. For a ﬁxed real number p > 1 and every pair of
n–tuples of non-negative real numbers xi and yi,
n∑
i=1
xiyi ≤
( n∑
i=1
xp
i
)1/p
·
( n∑
i=1
yq
i
)1/q
,
where 1/q = 1 − 1/p.
Proof. Denote by X and Y the expressions in the product
on the right-hand side of the inequality to
be proved. If all of the numbers xi or all of
the numbers yi are zero, then the statement is
clearly true. Therefore, we can assume that
X ̸= 0 and Y ̸= 0.
We need to use the fact that the exponential function is
a convex function. This can be stated: the graph of the exponential
function lies below any of its chords, and indeed,
the second derivative is again the expontial function, thus it
is always positive. Hence, for any a and b, with p, q as above,
e(1/p)a+(1/q)b
≤ (1/p) ea
+(1/q) eb
(in fact, this is the trivial case of the Jensen inequality, see
6.D.18).
Now, for those k with xkyk ̸= 0 we deﬁne the numbers
vk and wk so that
xk = X evk/p
, yk = Y ewk/q
.
Then
evk/p+wk/q
≤ 1
p evk
+1
q ewk
.
By substitution, it follows immediately that
xkyk ≤ XY
(
1
p
(
xk
X
)p
+
1
q
(
yk
Y
)q)
.
Summing over k = 1, . . . , n, gives (notice that adding the
terms with xkyk = 0 does not spoil the inequality)
1
XY
n∑
i=1
xiyi ≤
1
pXp
n∑
i=1
xp
i +
1
qY q
n∑
i=1
yq
i
=
1
pXp
Xp
+
1
qY q
Y q
=
1
p
+
1
q
= 1.
Multiplying this inequality by XY ﬁnishes the proof. □
Now we can prove that ∥ ∥p is indeed a norm:
Minkowski inequality
For every p > 1 and all n–tuples of non-negative real numbers
(x1, . . . , xn) and (y1, . . . , yn),
( n∑
i=1
(xi + yi)p
)1/p
≤
( n∑
i=1
xp
i
)1/p
+
( n∑
i=1
yp
i
)1/p
.
670
It follows that x belongs to the union, that is, the right side
of 1. By symmetry, the same result holds if x is an element
satisfying x /∈ X and x ∈ Z. Since then all possibilities
when x belongs to the left side of 1 are accounted for, the
inclusion 1 is established.
But then,
d1(X, Z) = | (X ∪ Z) ∖ (X ∩ Z) |
≤
(
(X ∪ Y ) ∖ (X ∩ Y )
)
∪
(
(Y ∪ Z) ∖ (Y ∩ Z)
)
≤ | (X ∪ Y ) ∖ (X ∩ Y ) | + | (Y ∪ Z) ∖ (Y ∩ Z) |
= d1(X, Y ) + d1(Y, Z).
The case (b). Proceed similarly to the case of d1. Denote
by X′
the complement of a set X. The equalities
(X ∪ Y ) ∖ (X ∩ Y ) =
(X ∩Y ′
∩Z)∪(X ∩Y ′
∩Z′
)∪(X′
∩Y ∩Z)∪(X′
∩Y ∩Z′
),
(Y ∪ Z) ∖ (Y ∩ Z) =
(X ∩Y ∩Z′
)∪(X ∩Y ′
∩Z)∪(X′
∩Y ∩Z′
)∪(X′
∩Y ′
∩Z),
(
(X ∪ Z) ∖ (X ∩ Z)
)
∪
(
Y ∖ (X ∪ Z)
)
= (X ∩ Y ∩ Z′
) ∪
(X ∩Y ′
∩Z′
)∪(X′
∩Y ∩Z)∪(X′
∩Y ′
∩Z)∪(X′
∩Y ∩Z′
),
which, again, can be proved by listing several possibilities,
imply a stronger form of (1), namely
(
(X ∪ Z) ∖ (X ∩ Z)
)
∪
(
Y ∖ (X ∪ Z)
)
⊆(
(X ∪ Y ) ∖ (X ∩ Y )
)
∪
(
(Y ∪ Z) ∖ (Y ∩ Z)
)
.
Further, we invoke the inequality
| (X∪Z)∖(X∩Z) |
| X∪Z | ≤
(
(X∪Z)∖(X∩Z)
)
∪
(
Y ∖(X∪Z)
)
X∪Z∪
(
Y ∖(X∪Z)
) , X ∪ Z ̸= ∅.
This is based on calculations with non-negative numbers only
since in general
x
z
≤
x + y
z + y
, y ≥ 0, z > 0, x ∈ [0, z].
Since
X ∪ Z ∪
(
Y ∖ (X ∪ Z)
)
= X ∪ Y ∪ Z,
we obtain
d2(X, Z) = | (X∪Z)∖(X∩Z) |
| X∪Z |
≤
(
(X∪Z)∖(X∩Z)
)
∪
(
Y ∖(X∪Z)
)
X∪Z∪
(
Y ∖(X∪Z)
)
≤
(
(X∪Y )∖(X∩Y )
)
∪
(
(Y ∪Z)∖(Y ∩Z)
)
| X∪Y ∪Z | ≤
| (X∪Y )∖(X∩Y ) | +| (Y ∪Z)∖(Y ∩Z) |
| X∪Y ∪Z |
≤ | (X∪Y )∖(X∩Y ) |
| X∪Y | + | (Y ∪Z)∖(Y ∩Z) |
| Y ∪Z | =
d2(X, Y ) + d2(Y, Z),
if X ∪ Z ̸= ∅ and Y ̸= ∅. However, for X = Z = ∅ or
Y = ∅, the triangle inequality clearly still holds.
Therefore, both mappings are metrics. The metric d1 is
quite elementary, but the other metric, the metric d2 has wider
applications. In the literature, it is also known as Jaccard’s
metric.4
□
4It is named after the biologist Paul Jaccard, who described the measure
of similarity in insects populations using the function 1 − d2 in 1908.
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
To verify this inequality, we can use the following obser-
vation.
n∑
i=1
(xi + yi)p
=
n∑
i=1
xi(xi + yi)p−1
+
n∑
i=1
yi(xi + yi)p−1
and by Hölder’s inequality (recall p > 1),
n∑
i=1
xi(xi + yi)p−1
≤
( n∑
i=1
xp
i
)1/p( n∑
i=1
(xi + yi)(p−1)q
)1/q
n∑
i=1
yi(xi + yi)p−1
≤
( n∑
i=1
yp
i
)1/p( n∑
i=1
(xi + yi)(p−1)q
)1/q
.
Adding the last two inequalities, and taking into account that
p + q = pq, and so (p − 1)q = pq − q = p, we arrive at
∑n
i=1(xi + yi)p
(∑n
i=1(xi + yi)p
)1/q
≤
( n∑
i=1
xp
i
)1/p
+
( n∑
i=1
yp
i
)1/p
.
that is
( n∑
i=1
(xi + yi)p
)1−1/q
≤
( n∑
i=1
xp
i
)1/p
+
( n∑
i=1
yp
i
)1/p
,
or
( n∑
i=1
(xi + yi)p
)1/p
≤
( n∑
i=1
xp
i
)1/p
+
( n∑
i=1
yp
i
)1/p
,
since 1−1/q = 1/p. This is the Minkowski inequality which
we wanted to prove.
Thus we have veriﬁed that on every ﬁnite-dimensional
real or complex vector space, there is a class of norms ∥ ∥p
for all p > 1. The case p = 1 was considered earlier. We can
also consider p = ∞ by setting
∥z∥∞ = max{|zi|, i = 1, . . . , n}.
This is obviously a norm.
We notice that Hölder’s inequality can, in the context
of these norms, be written for all x = (x1, . . . , xn), y =
(y1, . . . , yn) as
n∑
i=1
|xi| · |yi| ≤ ∥x∥p · ∥y∥q
for all p ≥ 1 and q satisfying 1/p + 1/q = 1. For p = 1, we
set q = ∞.
7.3.5. Lp–norms for sequences and functions. Now we
can easily deﬁne the Lp-norms on suitable
inﬁnite-dimensional vector spaces as well. We
begin with sequences.
The vector space ℓp, p ≥ 1, is the set of all
sequences of real or complex numbers x0, x1, . . . such that
∞∑
i=0
|xi|p
< ∞.
If x = (x1, x2 . . . ) ∈ ℓp, p ≥ 1, then the norm is given by
∥x∥p =
( ∞∑
i=0
|xi|p
)1/p
671
7.C.8. Let
d(x, y) :=
| x − y |
1 + | x − y |
, x, y ∈ R.
Prove that d is a metric on R.
Solution. We prove the triangle inequality only (the rest is
clear). Introduce an auxiliary function
(1) f(t) :=
t
1 + t
, t ≥ 0.
Note that f(s) − f(r) = s
1+s − r
1+r = s−r
(1+s)(1+r) > 0,
whenever s > r ≥ 0.
It follows that f is increasing, a fact which can also be
veriﬁed by examining the ﬁrst derivative. Therefore,
d(x, z) =
| x − z |
1 + | x − z |
=
| x − y + y − z |
1 + | x − y + y − z |
≤
| x − y | + | y − z |
1 + | x − y | + | y − z |
=
| x − y |
1 + | x − y | + | y − z |
+
| y − z |
1 + | x − y | + | y − z |
≤
| x − y |
1 + | x − y |
+
| y − z |
1 + | y − z |
= d(x, y) + d(y, z), x, y, z ∈ R.
□
The metrics in the next problems are deﬁned by norms on
vector spaces of functions. See the deﬁnitions and discussion
in 7.3.1.
7.C.9. Determine the distance between the functions
f(x) = x, g(x) = − x√
1+x2
, x ∈ [1, 2]
as elements of the normed vector space
S0
[1, 2] of (piecewise) continuous functions on
the interval [1, 2] with norm
(a) ∥ f ∥1 =
∫ 2
1
| f(x) | d x;
(b) ∥ f ∥∞ = max {| f(x) |; x ∈ [1, 2]}.
Solution. The case (a). We need only compute the norm of
the diﬀerence of the functions
2∫
1
| f(x) − g(x) | d x =
2∫
1
x + x√
1+x2
d x
=
(x2
2 +
√
1 + x2
]2
1
= 3
2 +
√
5 −
√
2.
The case (b). It is necessary to compute
max
x∈[1,2]
| f(x) − g(x) | = max
x∈[1,2]
(
x + x√
1+x2
)
.
Since
(
x + x√
1+x2
)′
= 1 + 1(√
1+x2
)3 > 0, x ∈ [1, 2],
it follows that f −g is increasing, and so attains its maximum
at the right end point of the interval when x = 2. So
max
x∈[1,2]
(
x + x√
1+x2
)
= 2 + 2√
1+22
= 2 + 2√
5
.
□
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
That ∥x∥p is a norm follows immediately from the Minkowski
inequality by letting n −→ ∞.
The vector space ℓ∞, is the set of all bounded sequences
of real or complex numbers x0, x1, . . . .
If x = (x0, x1, . . . ) ∈ ℓ∞, then its norm is given by
∥x∥∞ = sup{|xi|, i = 0, 1, 2, 3 . . . }
It is easily checked that this is indeed a norm.
Eventually, we return to the space of functions S0
[a, b]
on a ﬁnite interval [a, b] or S0
c [a, b] on an unbounded interval.
We have already met the L1 norm ∥ ∥1. However, for every
p > 1 and for all functions in such a space of functions, the
Riemann integrals
∫ b
a
|f(x)|p
dx
surely exist, so we can deﬁne
∥f∥p =
(∫ b
a
|f(x)|p
dx
)1/p
.
The Riemann integral was deﬁned in terms of limits, using
the Riemann sums which correspond to splittings Ξ with representatives
ξi. In our case, those are the ﬁnite sums
SΞ,ξ =
n∑
i=1
|f(ξi)|p
(xi − xi−1).
Hölder’s inequality applied to the Riemann sums of a product
|f(x)g(x)| of two functions f(x) and g(x) gives
n∑
i=1
|f(ξi)||g(ξi)|(xi − xi−1) =
=
n∑
i=1
|f(ξi)|(xi − xi−1)1/p
|g(ξi)|(xi − xi−1)1/q
≤
( n∑
i=1
|f(ξi)|p
(xi−xi−1)
)1/p( n∑
i=1
|g(ξi)|q
(xi−xi−1)
)1/q
.
where on the right-hand side, there is the product of the Riemann
sums for the integrals ∥f∥p and ∥g∥q.
Moving to limits, we thus verify Hölder’s inequality for
integrals:
∫ b
a
f(x)g(x) dx ≤
(∫ b
a
f(x)p
dx
)1/p(∫ b
a
g(x)q
dx
)1/q
which is valid for all non-negative real-valued functions f and
g in our space of piecewise continuous functions with compact
support.
In just the same way as in the previous paragraph, we
can derive the integral form of the Minkowski inequality from
Hölder’s inequality:
∥f + g∥p ≤ ∥f∥p + ∥g∥p.
Thus ∥ ∥p is indeed a norm on the vector space of all continuous
functions having a compact support for all p > 1 (we
veriﬁed this for p = 1 long ago).
672
The L1 or L2 distances, discussed in the beginning of this
chapter (cf. 7.1.2), reﬂect the basic intuition about the
distance between graphs of the functions. However, in
practice we need to understand more subtle concepts
of distances. The most obvious way is to include the derivatives
in a way similar to the values of the functions.
7.C.10. Consider the space S1
[a, b] of piecewise diﬀerentiable
(real or complex) functions on the interval [a, b] and
show that the formula
∥f∥ =
(∫ b
a
|f(x)|2
+ α2
|f′
(x)|2
d x
)1/2
with any real α ≥ 0 is a norm on this vector space (up to
the identiﬁcation of functions diﬀering only in the points of
discontinuity).
Compute the distance between functions f(x) =
sin(x) + 0.1[sin(6x)]2
− 0.03 sin(60x) and g(x) = sin(x)
on the interval [−π, π] in this norm and explain its dependence
on α.
Solution. The formula
⟨f, g⟩ →
∫ b
a
f(x)g(x) + α2
f′
(x)g′(x) d x
deﬁnes a scalar product on S1
[a, b]. The mapping is linear
in the ﬁrst argument f, provides complex conjugate value if
the arguments are exchanged and clearly is positive if f = g
is non-zero on any interval (Ignore the values in the points
of discontinuity, cf. the the discussion in 7.1.2). Thus the
corresponding quadratic form deﬁnes a norm on the complex
vector space S1
[a, b].
The distance in this norm is easily computed to obtain
√
0.02639 + 11.3097α2.
Its dependence on α can be seen in the illustration — the values
of the function f(x) are nearly equal to sin x, but the very
wiggly diﬀerence is well apparent in the derivatives.
If α = 0, the distance 0.162 is the usual L2 distance. If
α = 1 the distance is 3.367.5
□
5Here is an illustration of the very important concept of Sobolev spaces,
where any number of derivatives can be involved. Moreover, we can use Lp,
p ≥ 1 in the deﬁnition of the norm instead of p = 2. There is much literature
on this subject.
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
We use the word “norm” for the entire space S0
[a, b] of
piecewise continuous functions in this context; however, we
should bear in mind that we have to identify those functions
which diﬀer only by their values at points of discontinuity.
Among these norms, the case of p = 2 is special because
of the existence of the inner product. In this case, we could
have derived the triangle inequality much more easily using
the Schwarz inequality.
For the functions from S0
[a, b], we can deﬁne an analogy
of the L∞–norm on n–dimensional vectors. Since our functions
are piecewise continuous, they always have suprema of
absolute values on a ﬁnite closed interval, so we can set
∥f∥∞ = sup{f(x), x ∈ [a, b]}
for such a function f. If we considered both the one-sided
limits (which always exist by our deﬁnition) and the value of
the function itself to be the value f(x) at points of discontinuity,
we can work with maxima instead of suprema. It is
apparent again that it is a norm (except for the problems with
values at discontinuity points).
7.3.6. Completion of metric spaces. Both the real numbers
R and the complex numbers C are (with the
metric given by the absolute value) a complete
metric space. This is contained in the axiom of
the existence of suprema. Recall that the real
numbers were created as a “completion” of the space of rational
numbers which is not complete. It is evident that the
closure of the set Q ⊂ R is R.
Dense and nowhere-dense subsets
We say that a subset A ⊂ X in a metric space X is dense if
and only if the closure of A is the whole space X.
A set A is said to be nowhere dense in X if and only if
the set X \ ¯A is dense.
Evidently, A is dense in X if every open set in the whole
space X has a non-empty intersection with A.
In all cases of norms on functions from the previous paragraph,
the metric spaces deﬁned in this way are not complete
since it can happen that the limit of a Cauchy sequence of
functions from our vector space S0
[a, b] should be a function
which does not belong to this space any more. Consider the
interval [0, 1] as the domain of functions fn which take zero
on [0, 1/n) and are equal to sin(1/x) on [1/n, 1]. They converge
to the function sin(1/x) in all Lp norms, but this function
does not lie in the space.
Completion of a metric space
Let X be a metric space with metric d which is not complete.
A metric space ˜X with metric ˜d such that X ⊂ ˜X, d is the
restriction of ˜d to the subset X, and the closure ¯X in ˜X is
the whole space ˜X is called a completion of the metric space
X.
673
Now we move to more theoretical considerations.
Though these exercises may not look particularly
practical, they should be of help in understanding
the basic concepts of metric spaces, the convergence
as well as their links to the topological
concepts.
7.C.11. Show that the deﬁnition of a metric as a function d
deﬁned on X × X for a non-empty set X and satisfying
d(x, y) = 0, if and only if x = y, x, y ∈ X,(1)
d(x, z) ≤ d(y, x) + d(y, z), x, y, z ∈ X,(2)
is equivalent to the deﬁnition given in the theoretical part, in
paragraph 7.3.1.
Solution. At ﬁrst glance, it seems that this deﬁnition demands
fewer requirements on the metric than the deﬁnition
from the theoretical part. The two deﬁnitions are equivalent
if and only if the two conditions of non-degeneracy and triangle
inequality imply
d(x, y) ≥ 0, x, y ∈ X,(3)
d(x, y) = d(y, x), x, y ∈ X.(4)
However, if we set x = z in (2), we get the non-negativity
of the metric from (1). Similarly, the choice y = z in (2) together
with (1) implies that d(x, y) ≤ d(y, x) for all points
x, y ∈ X. Interchanging the variables x and y then gives
d(y, x) ≤ d(x, y), i.e. (4). Thus, it is proved that the deﬁnitions
are equivalent. □
7.C.12. Describe all sequences in a discrete metric space X,
which are convergent or Cauchy.
Solution. Since the distance between two points x, y in X
is either 1 or zero, the sequence x1, x2, . . . is Cauchy if and
only if all xi are equal, except for a ﬁnite number of them.
But then, the sequence is convergent. □
This problem shows a behaviour quite diﬀerent from the
convergence of sequences in the metric spaces X = R or
X = C. But sequences of integers would behave in a very
similar way. On the other hand, we deal mostly with metrics
on spaces of functions, where intuition gained in the real line
R may be useful.
7.C.13. Determine whether or not the sequence {xn}n∈N
where
x1 = 1, xn = 1 + 1
2 + · · · + 1
n , n ∈ N ∖ {1},
is a Cauchy sequence in R using the standard metric.
Solution. Recall that
(1)
∞∑
k=1
1
k
= ∞, i.e.
∞∑
k=m
1
k
= ∞, m ∈ N.
Therefore,
lim
n→∞
| xn − xm | =
∞∑
k=m+1
1
k = ∞, m ∈ N.
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
The following theorem says that the completion of an arbitrary
(incomplete) metric space X can be found in essentially
the same way as the real numbers were created from
the rationals. (Actually, the reader might read the detailed
proof below ﬁrst having this completion of rationals in mind,
which veriﬁes that this construction leads to R with the standard
complete metric, thus satisfying the axioms of the reals.)
7.3.7. Theorem. Let X be a metric space with metric d which
is not complete. Then there exists a completion ˜X of X.
Proof. The idea of the construction is identical to the
one used when building the real numbers. Two Cauchy sequences
xi and yi of points belonging to X are considered
equivalent if and only if d(xi, yi) converges to zero for i approaching
inﬁnity. This is a convergence of real numbers,
thus the deﬁnition is correct.
From the properties of convergence on the real numbers,
it is clear that the relation deﬁned above is an equivalence
relation. The reader is advised to verify this in detail. For
instance, the transitivity follows from the fact that the sum of
two sequences converging to zero converges to zero as well.
We deﬁne ˜X as the set of the classes of this equivalence
of Cauchy sequences. The original points x ∈ X can be identiﬁed
with the class of sequences equivalent to the constant
sequence xi = x, i = 0, 1, . . . .
It is now easy to deﬁne the metric ˜d. We put
˜d(˜x, ˜y) = lim
i→∞
d(xi, yi)
for sequences ˜x = {x0, x1, . . . } and ˜y = {y0, y1, . . . }.
First, we have to verify that this limit exists at all and is
ﬁnite. Notice the consequence of the triangle inequality for
diﬀerences. Taking x, y, and z ∈ X, d(x, y) ≤ d(x, z) +
d(y, z) and thus, d(x, y) − d(x, z) ≤ d(y, z). Swapping y
and z, we obtain d(x, z) − d(x, y) ≤ d(y, z). Thus,
(1) |d(x, y) − d(x, z)| ≤ d(y, z)
(draw a picture!). Now, the fact that both the sequences ˜x and
˜y are Cauchy sequences implies that the considered sequence
is also a Cauchy sequence of real numbers:
|d(xi, yi) − d(xj, yj)| ≤ |d(xi, yi) − d(xi, yj)|
+ |d(xi, yj) − d(xj, yj)|
≤ d(yi, yj) + d(xi, xj).
Thus, the limit of d(xi, yi) exists.
If we select diﬀerent representatives ˜x = {x′
0, x′
1, . . . }
and ˜y = {y′
0, y′
1, . . . }, then by similar argument as above
|d(x′
i, y′
i) − d(xi, yi)| ≤ |d(x′
i, y′
i) − d(x′
i, yi)|
+ |d(x′
i, yi) − d(xi, yi)|
≤ d(xi, x′
i) + d(yi, y′
i).
Therefore, the deﬁnition is indeed independent of the choice
of representatives.
We verify that ˜d is a metric on ˜X. The ﬁrst and second
properties are clear, so it remains to prove the triangle inequality.
For that purpose, choose three Cauchy representatives of
674
Hence the sequence {xn} is not a Cauchy sequence.
Alternatively, {xn} is not a Cauchy sequence, since if
it is, then it is convergent in the complete metric space R,
which contradicts the divergence shown in (1). □
7.C.14. Repeat the question from the previous problem with
the metric d given by (cf. 7.C.8)
d(x, y) :=
| x − y |
1 + | x − y |
, x, y ∈ R.
Solution. Instead of repeating the arguments, we point out
the diﬀerence between the given metric from the standard
one. The diﬀerence is expressed by the function f introduced
in (1). This is a continuous function and, moreover a bijection
between the sets [0, ∞) and [0, 1), having the property that
f(0) = 0. Further, the property of a sequence being Cauchy
or convergent in a metric space is deﬁned by being Cauchy
or convergent for the real numbers describing the distances
between the elements in the sequence. But the continuous
mappings preserve convergence or the property being Cauchy,
and hence the solution for the new metric is the same as with
the standard one. □
7.C.15. Determine whether or not the metric space C[−1, 1]
of continuous functions on the interval [−1, 1] with metric
given by the norm
(a) ∥ f ∥p =
(∫ 1
−1
| f(x) |p
d x
)1/p
for p ≥ 1;
(b) ∥ f ∥∞ = max {| f(x) |; x ∈ [−1, 1]}
is complete.
Solution. The case (a). For every n ∈ N, deﬁne a function
fn(x) = 0, x ∈ [−1, 0), fn(x) = 1, x ∈
( 1
n , 1
]
,
fn(x) = nx, x ∈
[
0, 1
n
]
.
For every m ≥ n, m, n ∈ N, we compute the inequality
( 1∫
−1
| fm(x) − fn(x) |p
d x
)1/p
<
(1/n∫
0
1 d x
)1/p
=
( 1
n
)1/p
.
It follows that the sequence {fn}n∈N ⊂ C[−1, 1] is a Cauchy
sequence of functions.
Suppose the sequence {fn} has a ∥·∥p limit f in C[−1, 1].
We show that this limit cannot be continuous in x = 0. For
every ε ∈ (0, 1), there exists an n(ε) ∈ N such that
fn(x) = 0, x ∈ [−1, 0], fn(x) = 1, x ∈ [ε, 1]
for all n ≥ n(ε). Imagine, f(y) ̸= 1 at some y ≥ ε. Then
∥f −fn∥ ≥ δ > 0 for all n ≥ n(ε) and some δ, since f is continuous.
Thus f ̸= 1 on some bounded interval containing y.
Therefore f must satisfy
f(x) = 0, x ∈ [−1, 0], f(x) = 1, x ∈ [ε, 1]
for an arbitrarily small ε > 0. Thus, necessarily,
f(x) = 0, x ∈ [−1, 0], f(x) = 1, x ∈ (0, 1].
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
the elements ˜x, ˜y, ˜z, and we obtain
˜d(˜x, ˜z) = lim
i→∞
d(xi, zi)
≤ lim
i→∞
d(xi, yi) + lim
i→∞
d(yi, zi)
= ˜d(˜x, ˜y) + ˜d(˜y, ˜z).
The restriction of the metric ˜d just deﬁned to the original
space X is identical to the original metric because the original
points are represented by constant sequences.
It is required to prove that X is dense in ˜X. Let ˜x = {xi}
be a ﬁxed Cauchy sequence, and let ε > 0 be
given. Since the sequence xi is a Cauchy sequence,
all pairs of its terms xn, xm for suﬃciently
large indices m and n become closer to
each other than ε. Then the choice y = xn for one of those
indices necessarily implies that the elements y and xm are
closer together than ε, and so, ˜d(˜y, ˜x) ≤ ε. Hence there is
an element y of the original space such that the distance of
the sequences of y’s from the chosen sequence xi does not
exceed ε. This establishes the denseness of X.
It remains to prove that the constructed metric space is
complete. That is, that Cauchy sequences of points of the extended
space ˜X with respect to the metric ˜d are necessarily
convergent to a point in ˜X. This can be done by approximating
the points of a Cauchy sequence ˜xk by points yk from
the original space X so that the resulting sequence ˜y = {yi}
would be the limit of the original sequence with respect to the
metric ˜d.
Since X is a dense subset in ˜X, we can choose, for every
element ˜xk of our ﬁxed sequence, an element zk ∈ X so
that the constant sequence ˜zk would satisfy ˜d(˜xk, ˜zk) < 1/k.
Now consider the sequence ˜z = {z0, z1, . . . }. The original
sequence ˜x is Cauchy. So for a ﬁxed real number ε > 0, there
is an index n(ε) such that ˜d(˜xn, ˜xm) < ε/2 whenever both
m and n are greater than n(ε). Without loss of generality, the
index n(ε) is greater than or equal to 4/ε. Now, for m and n
greater than n(ε), we get:
d(zm, zn) = ˜d(˜zm, ˜zn)
≤ ˜d(˜zm, ˜xm) + ˜d(˜xm, ˜xn) + ˜d(˜xn, ˜zn)
≤ 1/m + ε/2 + 1/n ≤ 2
ε
4
+
ε
2
= ε.
Hence it is a Cauchy sequence zi of elements in X, and so
˜z ∈ ˜X. From the triangle inequality,
˜d(˜z, ˜xn) ≤ ˜d(˜z, ˜zn) + ˜d(˜zn, ˜xn).
From the previous bounds, both terms on the right-hand side
converge to zero. Hence the distances ˜d(˜xn, ˜z) approach zero,
thereby ﬁnishing the proof. □
7.3.8. Uniqueness. We consider now the uniqueness of the
completion of metric spaces.
675
But this function is not continuous on [−1, 1], so it does not
belong to the considered metric space. Therefore, the sequence
{fn} does not have a limit in C[−1, 1], so this space
is not complete.
The case (b). Let an arbitrary Cauchy sequence
{fn}n∈N ⊂ C[−1, 1] be given. The terms of this sequence
are continuous functions fn on [−1, 1] having the property
that for ε > 0 (or for every ε/2 if you want) there is an
n(ε) ∈ N such that
(1) max
x∈[−1,1]
| fm(x) − fn(x) | <
ε
2
, m, n ≥ n(ε).
In particular, for every x ∈ [−1, 1], we get a Cauchy sequence
{fn(x)}n∈N ⊂ R of numbers. Since the metric space R with
the usual metric is complete, every (for x ∈ [−1, 1]) sequence
{fn(x)} is convergent. Set
f(x) := lim
n→∞
fn(x), x ∈ [−1, 1].
Letting m → ∞ in (1), we obtain
max
x∈[−1,1]
| f(x) − fn(x) | ≤ ε
2 < ε, n ≥ n(ε).
It follows that the sequence {fn}n∈N converges uniformly
(that is, with respect to the given norm), to the function f
on [−1, 1]. Since the uniform limit of continuous functions
is continuous, so is f, so f ∈ C[−1, 1], see 6.3.4. Therefore,
the metric space is complete. □
The same reasoning as above, and hence the same results,
apply to the more general metric space C[a, b] of continuous
functions on any closed bounded interval [a, b] or on the space
Cc of continuous functions with compact support.
7.C.16. Prove that the metric space ℓ2 is complete.
Solution. Recall that ℓ2 is the space of sequences of real numbers
with the L2-norm, see 7.3.5.
Consider an arbitrary Cauchy sequence {xn}n∈N in the
space ℓ2. Every term of this sequence is again a sequence,
i.e., xn = {xk
n}k∈N, n ∈ N. Of course, the range of indices
does not matter – there is no diﬀerence whether n, k ∈ N or
n, k ∈ N ∪ {0}. Introduce auxiliary sequences yk for k ∈ N
so that
yk = {yn
k }n∈N =
{
xk
n
}
n∈N
.
If {xn} is a Cauchy sequence in ℓ2, then each of the sequences
yk is a Cauchy sequence in R (the sequences yk are sequences
of real numbers). It follows from the completeness of R (with
respect to the usual metric) that all of the sequences yk are
convergent. Denote their limits by zk, k ∈ N.
It suﬃces to prove that z = {zk}k∈N ∈ ℓ2 and that the sequence
{xn} converges for n → ∞ in ℓ2 just to the sequence
z. The sequence {xn}n∈N ⊂ ℓ2 is a Cauchy sequence; therefore,
for every ε > 0, there is an n(ε) ∈ N with the property
that
∞∑
k=1
(
xk
m − xk
n
)2
< ε, m, n ≥ n(ε), m, n ∈ N.
In particular,
l∑
k=1
(
xk
m − xk
n
)2
< ε, m, n ≥ n(ε), m, n, l ∈ N,
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Isometries
A mapping φ : X1 → X2 between metric spaces with metrics
d1 and d2, respectively, is called an isometry if and only
if all elements x,y ∈ X satisfy d2(φ(x), φ(y)) = d1(x, y).
Of course, every isometry is a bijection onto its image
(this follows from the property that the distance between distinct
elements is non-zero) and the corresponding inverse
mapping is an isometry as well.
Now, consider two inclusions of a dense subset, ι1 :
X → ˜X1 and ι2 : X → ˜X2, into two completions of the
space X, and denote the corresponding metrics by d, d1, and
d2, respectively. The mapping
φ : ι1(X)
ι−1
1
// X
ι2
// ˜X2
is well-deﬁned on the dense subset ι1(X) ⊂ ˜X1. Its image is
the dense subset ι2(X) ⊂ ˜X2 and, moreover, this mapping is
clearly an isometry. The dual mapping ι1 ◦ ι−1
2 works in the
same way.
Every isometric mapping maps, of course, Cauchy sequences
to Cauchy sequences. At the same time, such Cauchy
sequences converge to the same element in the completion if
and only if this holds for their images under the isometry φ.
Thus if such a mapping φ is deﬁned on a dense subset X of a
metric space ˜X1, then it has a unique extension to the whole
˜X1 with values lying in the closure of the image φ(X), i. e.
˜X2.
By using the previous ideas, there is a unique extension
of φ to the mapping ˜φ : ˜X1 → ˜X2 which is both a bijection
and an isometry. Thus, the completions ˜X1 and ˜X2 are indeed
identical in this sense.
Thus it is proved:
Theorem. Let X be a metric space with metric d which is
not complete. Then the completion ˜X of X with metric ˜d is
unique up to bijective isometries.
In the following three paragraphs, we introduce three theorems
about complete metric spaces. They are highly applicable
in both mathematical analysis and verifying convergence
of numerical methods.
7.3.9. Banach’s contraction principle. A mapping
F : X → X on a metric space X with metric
d is called a contraction mapping if and only if
there is a real constant 0 ≤ C < 1 such that for
all elements x, y in X,
d(F(x), F(y)) ≤ C d(x, y).
Theorem. If F is a contraction mapping on a complete metric
space X, then it has a ﬁxed point, i. e., there is a z ∈ X
such that F(z) = z.
Proof. The proof naturally follows the intuitive idea that
iterative application of a contraction mapping starting from
an initial value z0 ∈ X should “accumulate” to some point.
676
whence, letting m → ∞,
l∑
k=1
(
zk − xk
n
)2
≤ ε, n ≥ n(ε), n, l ∈ N,
i.e. (this time l → ∞)
(1)
∞∑
k=1
(
zk − xk
n
)2
≤ ε, n ≥ n(ε), n ∈ N.
Especially,
∞∑
k=1
(
zk − xk
n
)2
< ∞, n ≥ n(ε), n ∈ N
and, at the same time,
∞∑
k=1
(
xk
n
)2
< ∞, n ∈ N,
which follows straight from {xn}n∈N ⊂ ℓ2.
Since (cf. the special case of Hölder’s inequality for p =
2 in 7.3.4)
∞∑
k=1
(
zkxk
n
)
≤
√
∞∑
k=1
z2
k ·
√
∞∑
k=1
(xk
n)
2
, n ∈ N
and
∞∑
k=1
(
zk − xk
n
)2
=
∞∑
k=1
(
z2
k − 2zkxk
n +
(
xk
n
)2)
, n ∈ N.
Hence
∞∑
k=1
z2
k < ∞.
It is proved that z ∈ ℓ2. The fact that {xn} converges for
n → ∞ to z in ℓ2 follows from (1). □
The next problem addresses the question of the power of
diﬀerent metrics on the same space of functions in
terms of convergence. We deal with the space Sc
of piecewise continuous functions with compact
support, equipped with the Lp metrics. We write
brieﬂy Lp for these metric spaces. In particular, we show that
convergence in Lp for some positive p does not always imply
convergence in Lq for another positive q ̸= p.
7.C.17. Let 0 < p < ∞. For each positive integer n, deﬁne
the sequence of functions
fn(x) =
{
n1/p
−1/n ≤ x ≤ 1/n
0 otherwise.
Decide for which q the sequnce fn converges in Lq.
Solution. Let q be any positive real number. Then
∫ ∞
−∞
|fn(x)|q
dx =
∫ 1/n
−1/n
nq/p
dx = (2/n)nq/p
= 2n(q/p)−1
.
If 0 < q < p and if n → ∞, then
∫ ∞
−∞
|fn(x)|q
dx −→ 0.
So ∥ fn ∥q −→ 0, and the sequence converges to the zero
function. Similarly if 0 < p < q and if n → ∞, then
∫ ∞
−∞
|fn(x)|q
dx −→ ∞.
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
The metric space X, of course, needs to be complete; otherwise
it could happen that the limit point does not exist in
it.
Choose an arbitrary z0 ∈ X and consider the sequence
zi, i = 0, 1, . . .
z1 = F(z0), z2 = F(z1), . . . , zi+1 = F(zi), . . .
From the assumptions, we have
d(zi+1, zi) = d(F(zi), F(zi−1))
≤ C d(zi, zi−1) ≤ · · · ≤ Ci
d(z1, z0).
The triangle inequality then implies that for all natural numbers
j,
d(zi+j, zi) ≤
j∑
k=1
d(zi+k, zi+k−1)
≤
j∑
k=1
Ci+k−1
d(z1, z0) = Ci
d(z1, z0)
j∑
k=1
Ck−1
≤ Ci
d(z1, z0)
∞∑
k=1
Ck−1
=
Ci
1 − C
d(z1, z0).
Now, since 0 ≤ C < 1, limn−→∞ Cn
= 0, so for every
positive (no matter how small) ε, the right-hand expression is
surely less than ε for suﬃciently large indices i, that is,
d(zi, zi+j) ≤
Ci
1 − C
d(z1, z0) ≤ ε.
However, this ensures that the sequence zi is a Cauchy sequence.
Since X is complete, the sequence has a limit z, and
all that remains to be proved is F(z) = z.
Every contraction mapping is continuous. Therefore,
F(z) = F( lim
n→∞
zn) = lim
n→∞
F(zn) = z.
This ﬁnishes the proof. □
The next two theorems extend the intutive understanding
of “density” of closed intervals [a, b] ⊂ R, not allowing
for any “holes” there. They are essential for the
understanding of compactness of metric spaces. In
fact, they are both special cases of more general theorems
on topological spaces.
7.3.10. Cantor intersection theorem. For any set A in a
metric space X with metric d, the real number
diam A = sup
x,y∈A
d(x, y)
is called the diameter of the set A. The set A is said to be
bounded if and only if diam A < ∞.
Theorem. If A1 ⊃ A2 ⊃ · · · ⊃ Ai ⊃ . . . is a nonincreasing
sequence of non-empty closed subsets in a complete
metric space X and if diam Ai → 0, then there is exactly
one point x ∈ X belonging to the intersection of all the
sets Ai.2
2Georg Cantor is considered as the founder of the set theory which he
introduced and developed in the last quarter of 19th century. At this time,
677
So ∥ fn ∥q diverges, and in particular, fn cannot converge to
any limit.
Finally, for q = p we have
∫
|fn(x)|p
dx = 2 for all
positive integers n, and so as n −→ ∞ we get ∥ fn ∥p −→
21/p
. At the same time, for any g ∈ Sc, if g(x) ̸= 0 at some
x ̸= 0 where g is continuous, its distance from fn cannot
converge to zero.
It follows that fn converges to 0 in Lq, 0 < q < p, but it
does not converge in Lq with q ≥ p. □
The next problem deals with the extremely useful Banach
ﬁxed point theorem, showing the necessity of all the requirements
in Theorem 7.3.9.
7.C.18. Show that the mapping f : ⟨0, ∞⟩ → ⟨0, ∞⟩ given
by
f(x) = x + e−x
satisﬁes, for all x ̸= y, the condition
|f(x) − f(y)| < |x − y|,
but it does not have any ﬁxed point, i.e. f(x) ̸= x for any x.
(Thus the condition |f(x)−f(y)| ≤ C|x−y), with constant
C < 1, in the Banach ﬁxed point Theorem 7.3.9 is essential.)
Solution. Clearly the function f is strictly increasing on the
entire domain. Assume y < x. Then e−x
< e−y
and
|f(x) − f(y)| = |x − y + e−x
− e−y
| < |x − y|.
Finally, f(x) = x implies e−x
= 0 which is impossible. □
Dealing with convergence of real numbers, we observe
that the topological concepts of open neighbourhoods
and the compactness are most useful. This is of course
even more true for metric spaces, where we can work
with open balls of radius r etc. The deﬁnitions remain essentially
the same.
7.C.19. We have already seen the discrete metric space
X ̸= ∅ with the metric d : X × X → R deﬁned by the for-
mula
d(x, y) := 1, x ̸= y, d(x, y) := 0, x = y .
(a) Decide whether (X, d) is complete.
(b) Describe all open, closed, and bounded sets in (X, d).
(c) Describe the interior, boundary, limit, and isolated points
of an arbitrary set in (X, d).
(d) Describe all compact sets in (X, d).
Solution. The case (a) was essentially dealt with in 7.C.12.
For an arbitrary sequence {xn}n∈N to be a Cauchy sequence,
it is necessary in this space that there is an index n ∈ N
such that xn = xn+m for all m ∈ N. Any sequence with
this property then necessarily converges to the common value
xn = xn+1 = · · · (we talk about almost stationary sequences).
So the metric space (X, d) is complete.
The case (b). The open 1-neighbourhood of any element
contains this element only. Therefore, every singleton
set is open. Since the union of any number of open sets is
an open set, every set is open in (X, d). By complements,
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Proof. Select one point xi for each set Ai. Since
diam Ai → 0, for every positive real number ε, we can ﬁnd
an index n(ε) such that for all Ai with indices i ≥ n(ε), their
diameters are less than ε. For suﬃciently large indices i, j,
d(xi, xj) ≤ ε, and thus our sequence is a Cauchy sequence.
Therefore, it has a limit point x ∈ X. x must be a limit point
of all the sets Ai, thus it belongs to all of them (since they are
all closed). So x belongs to their intersection. This proves
the existence of x.
Assume there are two points x and y, both belonging
to the intersection of all the sets Ai. Then d(x, y) must be
less than the diameter of the sets Ai. But diam Ai → 0, so
d(x, y) = 0, hence x = y. This proves the uniqueness of
x. □
7.3.11. Theorem (Baire theorem). If X is a complete metric
space, then the intersection of every countable system of open
dense sets Ai is a dense set in the metric space X.3
Proof. Suppose X contains a system of dense open sets
Ai, i = 1, 2 . . . . It is required to show that the
set A = ∩∞
i=1Ai has a non-empty intersection
with any open set U ⊂ X. Proceed inductively,
invoking the previous theorem.
Since A1 is dense, there is a z1 ∈ A1 ∩ U, but since the
set A1 is open, the closure of an ε1-neighbourhood U1 (for
suﬃciently small ε1) of the point z1 is contained in A1 as
well. Denote the closure of this ε1–ball U1 by B1.
Further, suppose that the points zi and their open εi–
neighbourhoods Ui are already chosen for i = 1, . . . , n, so
that
zk ∈ Bk = ¯Uk, zk ∈ ∩k
i=1Bi ⊂ ∩k
i=1Ai ∩ U.
Since the set An+1 is open and dense in X, there is a point
zn+1 ∈ An+1 ∩ Un; however, since An+1 ∩ Un is open,
the point zn+1 belongs to it together with a suﬃciently small
εn+1-neighbourhood Un+1.
Then, the closures surely satisfy Bn+1 = ¯Un+1 ⊂ ¯Un,
and so the closed set Bn+1 is contained in An+1 ∩ ¯Un. Moreover,
we can assume that all εn ≤ 1/n.
If we proceed in this inductive way from the original
point z1 and the set B1, we obtain a non-increasing sequence
of non-empty closed sets Bn whose diameter approaches
zero. Therefore, there is a point z common to all of these
sets. That is,
z ∈ ∩∞
i=1
¯Ui = ∩∞
i=1Bi ⊂ ∩∞
i=1An ∩ U,
which is the statement to be proved. □
the new abstract approach to fundamentals of Mathematics caused ﬁerce
objections. It also lead to the severe internal crises of Mathematics in the
beginning of the 20th century. This part of the history of Mathematics is
fascinating.
3This theorem is a part of considerations by René-Louis Baire in his
1899 doctoral thesis. More generally, a topological space satisfying the property
as in the theorem is called a Baire space and the theorem simply says
that every complete metric space is a Baire space.
678
this also means that every set is closed. The fact that the
2-neighbourhood of any element coincides with the whole
space implies that every set is bounded in (X, d).
The case (c). Once again, we use the fact that the open
1-neighbourhood of any element contains this element only.
It follows that every point of any set is both its interior and
its isolated point and that the sets have neither boundary nor
limit points.
The case (d). Every ﬁnite set in an arbitrary metric space
is compact (it deﬁnes a compact metric space by restricting
the domain of d). It follows from the classiﬁcation of convergent
sequences (see (a)), that no inﬁnite sequence can be
compact in (X, d). □
7.C.20. In the metric space S[−1, 1] with metric given by
the norm ∥ · ∥∞, consider the sets
A = {f ∈ S[−1, 1]; f(0) ∈ (0, 2)},
B = {f ∈ S[−1, 1];
1∫
−1
f(x) d x = 0}.
Are these sets open? Are these sets closed?
Solution. The interior of a set M is the set of all interior
points of M and it is usually denoted by M0
. A set M is
then open if and only if M = M0
. Similarly, we deﬁne the
closure of a set M as the set of all points having zero distance
from M; it is denoted by M. A set M is closed if and only if
M = M. Since
A0
= A, A = {f ∈ S[−1, 1]; f(0) ∈ [0, 2]}, B0
=
∅, B = B
(especially, A contains functions f for which f(0) can attain
values from the whole closed interval [0, 2]), the set A is open
and not closed, and, the other way around, the set B is closed
and not open. □
One of the most important concepts related to complete
metric spaces is given by the principle of nested balls (cf. the
theorem 7.3.10). Under some additional conditions, it says
that a metric space (X, d) is complete if and only if every
sequence {An}n∈N of nested (i.e., An+1 ⊆ An, n ∈ N) nonempty
closed sets An has non-empty intersection. That is,
(1)
∩
n∈N
An ̸= ∅.
7.C.21. Verify that the additional condition in the theorem
(1) lim
n→∞
sup {d(x, y); x, y ∈ An} = 0.
cannot be omitted.
Solution. That the requirement (1) cannot be omitted is probably
contrarily to many readers’ expectations. For a counterexample.
consider the set X = N with metric
d(m, n) = 1 + 1
m+n , m ̸= n, d(m, n) = 0, m = n.
It is indeed a metric. The ﬁrst and second properties are
clearly satisﬁed. To prove the triangle inequality, it suﬃces
to observe that d(m, n) ∈ (1, 4/3] if m ̸= n. Hence the only
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
7.3.12. Bounded and compact sets. The following concepts
facilitated our discussions when dealing
with the real and complex numbers. They can
be reformulated for general metric spaces with
almost no change:
An interior point of a subset A in a metric space is such
an element of A which belongs to it together with some of its
ε–neighbourhoods.
A boundary point of a set A is an element x ∈ X
such that each its neighbourhood has a non-empty intersection
with both A and the complement X \ A. A boundary
point may or may not belong to the set A itself.
A limit point of a set A is an element x equal to the limit
of a sequence xi ∈ A, such that xi ̸= x for all i. Clearly a
limit point may or may not belong to the set A.
An isolated point of a set A is an element a ∈ A such
that one of its ε–neighbourhoods in X has the singleton intersection
{a} with A.
An open cover of a set A is a system of open sets Ui ⊂ X,
i ∈ I, such that their union contains A.
Compact sets
A metric space X is called compact if every sequence of
points xi ∈ X has a subsequence converging to some point
x ∈ X.
Any subset A ⊂ X in a metric space is called compact
if it is compact as the metric space with the restricted metric.
Clearly, the compact subsets in discrete metric spaces X
are exactly the ﬁnite subsets of X.
In the case of the real numbers R and complex numbers
C, our deﬁnition reveals the compact subsets discussed
there and we would also like to come to
useful properties as we did for real and complex
numbers in the paragraphs 5.2.7–5.2.8.
Just as an appetizer, notice that the continuous functions
between general metric spaces behave as the real functions:
Theorem. Let f : X → Y be a continuous mapping between
metric spaces. Then the images of compact sets are compact.
Proof. Recall that any convergent sequence of points
xi → x in X is mapped onto the convergent sequence
f(xi) → f(x) in Y . Thus, the statement follows immediately
from our deﬁnition of the compactness via convergent
subsequences. □
In particular we obtain the most useful consequence on
the minima and maxima of continuous real valued functions
on compact subsets:
Corollary. Let f : X → R be a real function deﬁned on a
compact metric space. Then there are the points x0 and y0 in
X such that
f(x0) = maxx∈X{f(x)}, f(y0) = minx∈X{f(x)}.
Proof. The image f(X) must be a compact subset in R,
thus it must achieve both maximum and minimum (which are
679
Cauchy sequences are those which are constant from some index
on. These sequences are constant except for ﬁnitely many
terms, sometimes called almost stationary sequences. Thus,
every Cauchy sequence is convergent, so the metric space is
complete. Deﬁne
An :=
{
m ∈ N; d(m, n) ≤ 1 + 1
2n
}
, n ∈ N.
As the inequality in their deﬁnition is not strict, it is guaranteed
that they are closed sets. Since An = {n, n + 1, . . . },
it follows that {An} are nested, but with empty intersection
(contrary to (1)). If the requirement (1) is omitted, then
the metric space is not complete, contradicting the data. Of
course, in this case the condition (1)) is not met, as
lim
n→∞
sup {d(x, y); x, y ∈ An}
= lim
n→∞
(
1 +
1
2n + 1
)
= 1 ̸= 0.
□
7.C.22. Determine whether the set (known as the Hilbert
cube)
A =
{
{xn}n∈N ∈ ℓ2; | xn | ≤ 1
n , n ∈ N
}
is compact in ℓ2. Then determine the compactness of the set
B =
{
{xn}n∈N ∈ ℓ∞; | xn | < 1
n , n ∈ N
}
in the space ℓ∞.
Solution. The space ℓ2 is complete (see 7.C.16). Every
closed subset of a complete metric space deﬁnes a complete
metric space. The set A is evidently closed in l2, so it sufﬁces
to show that it is totally bounded, and from the theorem
7.3.13(3) it is compact. To do that, construct an ε-net of A
for any given ε > 0: Begin with the well-known series
∞∑
k=1
1
k2 = π2
6
(see (1)).
For every ε > 0, there is an n(ε) ∈ N satisfying
√
∞∑
k=n(ε)+1
1
k2 < ε
2 .
From each of the intervals [−1/n, 1/n] for
n ∈ {1, . . . , n(ε)},choose ﬁnitely many points
xn
1 , . . . , xn
m(n) so that for any x ∈ [−1/n, 1/n] that
min
j∈{1,...,m(n)}
x − xn
j < ε√
5n
.
Consider such sequences {yn}n∈N from l2 whose terms with
indices n > n(ε) are zero, and at the same time,
y1 ∈
{
x1
1, . . . , x1
m(1)
}
, . . . , yn(ε) ∈
{
x
n(ε)
1 , . . . , x
n(ε)
m(n(ε))
}
.
There are only ﬁnitely many such sequences and they create
the desired ε-net for A: let xn ∈ l2 is arbitrary. According to
our choice of the sequences yn, there is yn such that
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
the supremum and the inﬁmum of the bounded and closed
image). □
The concept of boundedness is a little more complicated
in the case of general metric spaces. For any point x and
subset B ⊂ X in a metric space X with metric d, we deﬁne
their distance4
dist(x, B) = inf
y∈B
{d(x, y)}.
We say that a metric space X is totally bounded if for
every positive real number ε, there is a ﬁnite set A such that
dist(x, A) < ε
for all points x ∈ X. We call such an A an ε-net of X.
Note that a metric space is bounded if X has a ﬁnite diameter.
We can immediately see that a totally bounded space
is always bounded. Indeed, the diameter of a ﬁnite set is ﬁnite,
and if A is the set corresponding to ε from the deﬁnition
of total boundedness, then the distance d(x, y) of two points
can always be bounded by the sum of dist(x, A), dist(y, A),
and diam A, which is a ﬁnite number.
In the case of a metric on a subset of a ﬁnite-dimensional
Euclidean space, these concepts coincide since the boundedness
of a set guarantees the boundedness of all coordinates in
a ﬁxed orthonormal basis, and this implies total boundedness.
(Verify this in detail by yourselves!)
The next theorem provides the promised very useful alternative
characterisations of compactness:
7.3.13. Theorem. The following statements about a metric
space X are equivalent:
(1) X is compact;
(2) every open covering of X, X = ∪i∈IUi, contains a ﬁnite
covering X = ∪n
k=1Ujk
, where all jk ∈ I;
(3) X is complete and totally bounded.
Proof. We show consecutively the implications
(1) =⇒ (3) =⇒ (2) =⇒ (1).
Thus, start assuming X is compact. Then for each
Cauchy sequence of points xi, there is a sub-sequence xin
converging to a point x ∈ X. We just have to verify that the
initial sequence also converges to the same limit, xi → x.
This is easy and we leave it to the reader. So X is complete.
Suppose X is not totally bounded. Then there is ε > 0
such that no ﬁnite ε-net exists in X. Then there is a sequence
of points xi such that d(xi, xj) ≥ ε for all i ̸= j. (Verify this
almost obvious claim – look at the deﬁnition of ε-nets!) Then
this is a sequence of points, where no sub-sequence can be a
Cauchy sequence, so X is not compact. This contradicts (1),
so we conclude that X is totally bounded.
4Notice, that the distance between two subsets A, B ⊂ X should express
how “diﬀerent” they are. Thus we deﬁne the (Hausdorﬀ) distance as follows
dist(A, B) = max{sup{dist(x, B), x ∈ A}, sup{dist(y, A), y ∈
B}}. This diﬀerence is ﬁnite for bounded sets and it is easy to see that
it vanishes if and only if the closures of A and B coincide. This means,
the distances of the point x and one-point set {x} are very much diﬀerent,
dist(x, A) ̸= dist({x}, A), in general.
680
d(xn, yn) =
∞∑
k=1
(xk − yk)2
≤
n(ε)
∑
k=1
(xk − yk)2 +
∞∑
k=n(ε)+1
x2
k
≤
√
ε2
5
+
ε2
52
+ · · · +
ε2
5n(ε)
+
ε
2
< ε ·
√
1
1 − 1
5
− 1 +
ε
2
= ε.
Since ε > 0 is arbitrary, the set A is totally bounded,
which implies compactness.
The closure of the set B is
B =
{
{xn}n∈N ∈ l∞; | xn | ≤ 1
n , n ∈ N
}
.
Hence B is not closed, and so it is not compact. The set B is
compact. The proof of this fact is much simpler than for the
set A, thus we leave it as an exercise for the reader. □
7.C.23. Prove that on each metric space X, the given metric
d is a continuous function X × X → R. ⃝
7.C.24. Show that if F is a continuous mapping on a compact
metric space X, then the inequality
d(F(x), F(y)) < d(x, y),
for all x ̸= y, implies the existence of a ﬁxed point.
Solution. The inﬁmum α of the values of the continuous
function d(x, F(x)) must be achieved in a point x0 ∈ X (see
7.3.12 for the concepts and main results and use the previous
result 7.C.23). Since distances are non-negative, α ≥ 0. If
α ̸= 0, then
d(F(x0), F(F(x0))) < d(x0, F(x0)) = α
which is a contradiction. □
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
The next implication, namely (3) =⇒ (2) is more demanding.
So assume X is complete and totally
bounded, but X does not satisfy (2).
Then there is an open covering Uα, α ∈ I,
of X, which does not contain any ﬁnite covering.
Choose a sequence of positive real numbers εk → 0 and
consider the ﬁnite εk–nets from the deﬁnition of total boundedness.
Further, for each k, consider the system Ak of closed
balls with centres in the points of the εk-net and diameters
2εk. Clearly each such system Ak covers the entire space X.
Altogether, there must be at least one closed ball C in the system
A1 which is not covered by a ﬁnite number of the sets
Uα. Call it C1 and notice that diam C1 = 2ε1.
Next, consider the sets C1 ∩C, with balls C ∈ A2 which
cover the entire set C1. Again, at least one of them cannot
be covered by a ﬁnite number of Uα, we call it C2. This
way, we inductively construct a sequence of sets Ck satisfying
Ck+1 ⊂ Ck, diam Ck ≤ 2εk, εk → 0, and none of them
can be covered by a ﬁnite number of the open sets Uα.
Finally we choose one point xk ∈ Ck in each of these
sets. By construction, this must be a Cauchy-sequence. Consequently,
this sequence of points has a limit x since X is
complete. Thus there is Uα0 containing x and containing
also some δ-neighbourhood Bδ(x). But now, if diam Ck ≤
2εk < δ, then Ck ⊂ Bδ(x) ⊂ Uα0 , which is a contradiction.
The remaining step is to show the implication (2) =⇒
(1). Assume (2) and considering any sequence of points xi ∈
X, we set Cn = {xk; k ≥ n}. The intersection of these sets
must be non-empty by the following general lemma:
Lemma. Let X be a metric space such that property (2) in
the Theorem holds. Consider a system of closed sets Dα,
α ∈ I, such that each its ﬁnite subsystem Dα1 , . . . , Dαk
has
non-empty intersection. Then also
∩α∈IDα ̸= ∅.
This simple lemma is proved by contradiction, again. If
the latter intersection is empty, then
X = X \ (∩α∈I Dα) = ∪α∈I(X \ Dα) = ∪α∈IVα,
where Vα = X\Dα are open sets. Thus, there must be a ﬁnite
number of them, {Vα1 , . . . , Vαn }, covering X too. Thus, we
obtain
X = ∪n
i=1Vαi = ∪n
i=1(X \ Dαi ) = X \ (∩n
i=1Dαi ).
This is a contradiction with our assumptions on Dα and the
lemma is proved.
Now, let x ∈ ∩∞
n=1Cn. By construction, there is a subsequence
xnk
in our sequence of points xn ∈ X, so that
d(xnk
, x) < 1/k. This is a converging subsequence, and so
the proof is complete. □
As an immediate corollary of the latter theorem, each
closed subset in a compact metric space is again compact.
For subsets of a totally bounded set are totally bounded, and
closed subsets of a complete metric space are also complete.
681
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Another consequence is an alternative proof that a subset
K ⊂ Rn
is compact, if and only if it is closed and bounded.
Notice also that while the conditions (1) and (3) are given
in terms of the metric, the equivalent condition (2) is purely
topological.
7.3.14. Continuous functions. We revisit the questions related
to continuity of mappings between metric
spaces. If fact, many ideas understood for the
functions of one real variable generalize natu-
rally.
In particular, as we already showed, every continuous
function f : X → R on a compact set X is bounded and
achieves its maximum and minimum. Here is another argument
for this using the purely topological concept of compact-
ness:
Consider the open intervals Un = (n − 1, n + 1) ⊂ R,
n ∈ Z covering R. Then their preimages f−1
(Ui) cover X,
so that there is a ﬁnite number of them, covering X as well.
Thus f is bounded and the supremum and inﬁmum of its values
exist. Consider sequences f(xn) and f(yn) converging
to the supremum and inﬁmum, respectively. Then there must
be covergent subsequences of the points xn and yn in X and
their limits x and y are in X too. But then f(x) and f(y)
are the supremum and inﬁmum of the values of f since f is
continuous and thus respects convergence.
We should also enjoy to see the diﬀerences between the
“purely topological” concepts, as the continuity (possibly deﬁned
merely by means of open sets), and the next stronger
concepts, which are “metric” properties.
Uniformly continuous mappings
A mapping f : X → Y between metric space is called
uniformly continuous, if for each ε > 0 there is a δ > 0, such
that dY (f(x), f(y)) < ε for all x, y ∈ X with dX(x, y) <
δ.
Notice the following simple lemma:
Lemma. A mapping f : X → Y between metric spaces
is uniformly continuous if and only if for each pair of
sequences xk and yk in X, dX(xk, yk) → 0 implies
dY (f(xk), f(yk)) → 0.
Proof. If f is uniformly continuous, then the the claim
on the sequences is obvious.
Now, assume the property of sequences holds true, but
f is not uniformly continuous. Then, employing the negation
of the deﬁning property, there is an ε > 0 such that for
all δ = 1/n, there are xn, yn with dX(xn, yn) < 1/n and
dY (f(xn), f(yn)) ≥ ε. But this violates the assumed condition.
□
This observation leads to the following generalization of
the behavior of real functions:
Theorem. Each continuous mapping f : X → Y on a compact
metric space X is uniformly continuous.
682
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Proof. Assume f is a continuous function. Consider
any two sequences xk and yk with dX(xk, yk) → 0 and
assume their images do not satisfy dY (xk, yk) → 0. Then,
for some ε > 0, there must be a subsequence (xkn , ykn ) for
which dY (xkn , ykn ) > ε.
Since X is compact, there is a further subsequence of
xnk
converging to some point x ∈ X. Without loss of generality,
we may assume that the original sequence (xk, yk)
has got all these properties (we are aiming at a conﬂict
with our assumption that f is not uniform continuous). Of
course, dX(x, yk) ≤ dX(x, xk) + dX(xk, yk) → 0 and so
limk→∞ yk = x, too.
Next, notice that the metric dY : Y × Y → R is always
a continuous function. Indeed, remind the deﬁnition of the
product metric on X × X by the maximum of the distances
of the components, see 7.C.1. With this deﬁnition, the continuity
of d is obvious from the estimate
|d(x, y) − d(x′
, y′
)| ≤ d(x, x′
) + d(y, y′
),
cf. 7.3.8(1).
Now, the continuity of f ensures
lim
k→∞
dY (f(xk), f(yk)) = dY (f(x), f(x)) = 0.
But this violates our assumption on the sequences and the
proof is complete. □
A very useful variation on the theme of continuity is the
following deﬁnition.
Lipschitz continuity
A function f : X → Y between metric spaces is called
Lipchitz continuous if there is a constant C > 0 such that
for all points x, y ∈ X
dY (f(x), f(y)) ≤ C dX(x, y)).
Every Lipschitz continuous function is uniformly contin-
uous.
7.3.15. Arzelà-Ascoli Theorem. To conclude, we consider
some spaces of functions. These provide examples of how
much they may diﬀer from the usual Euclidean spaces. First,
we introduce some terminology. Basically we want to deal
with functions, which are all uniformly continuous in the very
same way:
Equicontinuous functions
Consider a space M of mappings f : X → Y between
metric spaces. We say that the functions in M are equicontinuous,
if for each ε > 0 there is a δ > 0, such that
dY (f(x), f(y)) < ε for all x, y ∈ X with dX(x, y) < δ,
for all functions f ∈ M.
Consider the metric space C(X) of all continuous (real
or complex) functions on a compact metric space X, with
its ∥ ∥∞ norm. This means that the distance between two
functions f, g is the maximum of the distance between their
values f(x) and g(x) for x ∈ X.
683
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
We say that a set M ⊂ C(X) of real functions is uniformly
bounded, if there is a constant K ∈ R such that
|f(x)| ≤ K for all functions f ∈ M and points x ∈ X.
Of course, bounded sets M of functions in C(X) are always
uniformly bounded, by the deﬁnition of the norm.
Theorem. Consider a compact metrix space X. A set M ⊂
C(X) in the space of continuous functions with the supremum
norm ∥ ∥∞ is compact if and only if it is bounded, closed, and
equicontinuous.5
Proof. Suppose M is compact. Then M is totally
bounded (and thus also uniformly bounded as
noticed above). Since every compact subset is
closed, it remains to verify the equicontinuity.
Given ε > 0, consider the corresponding
ε-net (f1, f2, . . . , fk) ⊂ M from the deﬁnition of the total
boundedness of M. Recall that all the functions fi are uniformly
continuous (as continuous functions on a compact set).
Thus there is a δi for each fi, such that dX(x, y) < δi implies
|fi(x) − fi(y)| < ε.
Of course, we take δ to be the minimum of the ﬁnite many
δi, i = 1, ..., k. Then the same equality holds for all fi in
the ε-net. But now, considering an arbitrary function f ∈
M, there is a function fj in our ε-net with ∥f − fj∥ and so
dX(x, y) < δ implies that |f(x) − f(y)| is at most
|f(x) − fj(x)| + |fj(x) − fj(y)| + |fj(y) − f(y)| ≤ 3ε,
and the equicontinuity has been proved.
Conversely, suppose that M is a bounded, closed, and
equicontiuous subset of C(X), with X as a compact
metric space.
First we show that M is complete. Obviously,
a Cauchy seqence in the ∥ ∥∞ norm is
also pointwise Caychy sequence of real or complex numbers.
Thus the sequence of functions converges pointwise. Next,
the pointwise limit must be bounded and the equicontinuity
property implies that this limit will be continuous as well.
Thus, we need to ﬁnd a Cauchy (sub)sequence within any
sequence of functions fn ∈ M.
The compact space X itself is totally bounded and therefore
it contains a countably dense set A ⊂ X (we may take the
points in all 1/k-nets for k ∈ N). Write A = {a1, a2, . . . } as
a sequence.
Choose a subsequence of functions f1j, j = 1, 2, . . .
within the functions fn, so that the sequence of values f1j(a1)
converges. (This is possible, since the set M is bounded in
the ∥ ∥∞ norm). Similarly, the subsequence f2j can be chosen
from f1k, so that f2j(a2) converges. In general, the m-th
subsequence is chosen from f(m−1)k and have the values
fmj(am) converging (and by our construction, it converges
in all ai, i < m too).
5A weaker version providing a suﬃcient condition was ﬁrst published
by Ascoli in 1883, the complete exposition was given by Arzelà in 1895.
Again, there are much more general versions of this theorem in the realm of
topological spaces.
684
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
As a result, we can choose the sequence of function
gk = fkk for all positive integers k with the hope that this is
a Cauchy sequence. This is where the equicontinuity helps.
Start with any ε > 0 and ﬁnd δε > 0, such that |f(x) −
f(y)| < ε whenever the arguments x and y are closer than
δε. Let Aε ⊂ A be subset forming a δε-net. This is a ﬁnite
set and so there must be an n ∈ N such that for all i, j ≥ n
and all a ∈ Aε, we know |gi(a) − gj(a)| < ε. But then, for
every x ∈ X, there is some a ∈ Aε with dX(x, a) < δε and
so |gi(x) − gj(x)| can be at most
|gi(x) − gi(a)| + |gi(a) − gj(a)| + |gj(a) − gj(x)| ≤ 3ε.
Thus, the sequence gk is a Cauchy sequence in C(X), and so
M is compact. □
685
686
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
D. Additional exercises to the whole chapter
Let us begin this section with problems similar to those described in the introduction of Chapter 7. (cf. 7.A.1, 7.A.2).
7.D.1. Let S0
[0, 1] be the space of F-valued piecewise continuous functions deﬁned on [0, 1], where F ∈ {R, C}, endowed
with the L2-inner product
⟨f, g⟩ :=
∫ 1
0
f(x)g(x) dx .
With respect to the induced norm ∥ · ∥2 compute:
(a) ∥f∥2, where f(x) = x + b, with b being some constant;
(b) the distance ∥f − g∥2 between f(x) = cos(2πx) and g(x) = sin(2πx).
Solution. (a) By deﬁnition, ∥f∥2 =
√
⟨f, f⟩ =
√∫ 1
0
(x + b)2
dx, and we easily compute that
∫ 1
0
(x + b)2
dx =
[
x3
3
]1
0
+ 2
[
bx2
2
]1
0
+
[
b2
x
]1
0
=
1
3
+ b + b2
.
Thus ∥f∥2 =
√
1
3 + b + b2.
For (b) and the distance between the given f, g, we compute
⟨f − g, f − g⟩ =
∫ 1
0
(
cos(2πx) − sin(2πx)
)2
dx =
∫ 1
0
(
1 − 2 cos(2πx) sin(2πx)
)
dx =
∫ 1
0
(
1 − sin(4πx)
)
dx
=
[
x −
cos(4πx)
4π
]1
0
= 1 ,
where the third equality relies on the identity sin(2t) = 2 sin(t) cos(t). Thus ∥f − g∥2 =
√
1 = 1. □
7.D.2. Compute the L2-norm of the function f(x) = eix
, deﬁned on the interval [−π, π].
Solution. We can express f as f(x) = eix
= cos(x) + i sin(x), for all x ∈ [−π, π]. Then, with respect to L2-product we
see that
⟨f, f⟩ =
∫ π
−π
(
cos(x) + i sin(x)
)(
cos(x) − i sin(x)
)
dx =
∫ π
−π
(
cos2
(x) + sin2
(x)) dx =
∫ π
−π
dx = 2π .
Thus ∥f∥2 =
√
2π. □
7.D.3. Consider the vector space R3[x] of polynomials of one variable of degree at most 3 with real coeﬃcients, endowed
with the L2-inner product (i.e., the same inner product ⟨ , ⟩ as the one in Problem 7.D.1). Let W be the subspace of R3[x]
having as basis the set B = {x, x2
}. Find an orthonormal basis of W with respect to ⟨ , ⟩.
Solution. Set E1 = x and E2 = x2
. The Gram–Schmidt procedure states that an orthogonal basis of W is given by
{
w1 = E1 , w2 = E2 −
⟨E2, w1⟩
∥w1∥2
2
w1
}
.
We compute
⟨E2, w1⟩ =
∫ 1
0
x3
dx =
x4
4
1
0
=
1
4
, and ⟨w1, w1⟩ =
∫ 1
0
x2
dx =
x3
3
1
0
=
1
3
.
Thus w2 = E2 − 3
4 E1 = x2
− 3
4 x. As a veriﬁcation, observe that
⟨w1, w2⟩ = ⟨x, x2
⟩ − 3
4 ⟨x, x⟩ = 1
4 − 3
4 · 1
3 = 0 .
Consequently, an orthonormal basis of W is given by the “vectors” { ˆw1 = w1
∥w1∥2
, ˆw2 = w2
∥w2∥2
}, where ∥w1∥2 = 1√
3
and
∥w2∥2 =
√
∫ 1
0
(
x2 −
3
4
x
)2
dx =
1
√
80
,
respectively. □
687
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
7.D.4. Consider the inner vector space (R3[x], ⟨ , ⟩) of the previous task 7.D.3, and let W be the subspace of R3[x] spanned
by the basis {1, x2
}. Find a basis of the orthogonal complement W⊥
of W with respect to ⟨ , ⟩.
Solution. By assumption W is spanned by the set {1, x2
}. Thus, a vector f(x) = a3x3
+ a2x2
+ a1x + a0 ∈ W⊥
in W⊥
must satisfy
⟨f(x), 1⟩ = 0 , and ⟨f(x), x2
⟩ = 0 .
This gives the following system of equations
{a3
4
+
a2
3
+
a1
2
+ a0 = 0 ,
a3
6
+
a2
5
+
a1
4
+
a0
3
= 0
}
.
We can solve this in Sage via the cell
var("a3", "a2", "a1", "a0")
solve([(1/4)*a3+(1/3)*a2+(1/2)*a1+a0==0,(1/6)*a3+(1/5)*a2+(1/4)*a1+(1/3)*a0==0], a3, a2, a1, a0)
Sage returns the solution
[[a3 == 16*r1 + 3*r2, a2 == -15*r1 - 15/4*r2, a1 == r2, a0 == r1]]
which means that a3 = 16a + 3b and a2 = −15(a + b
4 ) with a1 = b ∈ R and a0 = a ∈ R being the free variables. Thus
f(x) = b(3x3
−
15
4
x2
+ x) + a(16x3
− 15x2
+ 1) ,
and this implies that W⊥
= span{f1(x), f2(x)}, where f1(x) = 3x3
− 15
4 x2
+x and f2(x) = 16x3
−15x2
+1, respectively.
It remains to prove that f1(x), f2(x) are linearly independent which one can easily check (for instance one may compute the
Wronskian as in the Problem 6.D.16). Thus {f1, f2} is a basis of W⊥
. □
7.D.5. Given the vector subspace ⟨sin(x), x⟩ of the space of continuous real-valued functions deﬁned on the interval [0, π],
complete the function x to an orthogonal basis of the subspace and determine the orthogonal projection of the function
1
2 sin(x) onto it. ⃝
7.D.6. Given the vector subspace ⟨cos(x), x⟩ of the space of continuous real-valued functions deﬁned on the interval [0, π],
complete the function cos(x) to an orthogonal basis of the subspace and determine the orthogonal projection of the function
sin(x) onto it. ⃝
7.D.7. Show that the system of functions 1, sin(x), cos(x), . . . , sin(nx), cos(nx), . . . is orthogonal with respect to the L2
inner product on the interval I = [−π, π]. ⃝
Next we will focus on tasks about Fourier series of periodic functions, and their convergence.
7.D.8. Let f : (−π, π) → R be the piecewise function deﬁned by
f(x) =



−1 , if −π < x < −π/3 ,
0 , if −π/3 ≤ x ≤ π/3 ,
1 , if π/3 < x < π .
(1) Find the 2π-periodic extension of f and sketch its graph via Sage for x ∈ (−3π, 3π);
(2) Prove that the Fourier series F of the 2π-periodic extension of f is a sine series. Next determine explicitly F;
(3) Use Sage to verify your computations in (2);
(4) Present the 5th partial sum of the approximation. Next conﬁrm your result via Sage and plot F5, F10, F20 and F100 together
with f, for x ∈ (−π, π).
(5) Use Sage to specify the 5th, 10th, and 100th coeﬃcient of the Fourier series F corresponding to f.
Solution. (1) The 2π-periodic extension of f is deﬁned by ˆf(x) = f(x) for all −π < x < π with ˆf(x) = ˆf(x + 2π) for
all x ∈ R. In Sage to plot this periodic extension for x ∈ (−3π, 3π) we can use appropriately the piecewise function. In
particular, the block
f1(x)=-1; f2(x)=0; f3(x)=1
f=piecewise([[(-3*pi,-7*pi/3),f1],[(-7*pi/3,-5*pi/3),f2], [(-5*pi/3,-pi),f3],[(-pi,-pi/3),f1],
[(-pi/3,pi/3),f2],[(pi/3,pi),f3], [(pi,5*pi/3),f1], [(5*pi/3,7*pi/3),f2], [(7*pi/3,3*pi),f3]])
a = f.plot(x, -3*pi, 3*pi, thickness=2, color="red",
exclude=[-7*pi/3, -5*pi/3, -pi, -pi/3, pi/3, pi, 5*pi/3, 7*pi/3])
a.show(ticks=pi/3, tick_formatter=pi)
produces the following ﬁgure:
688
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Here one may avoid to use tje exclude option (to exclude the jump discontinuities) and in this case Sage will produce the
ﬁgure given in the r.h.s, where the vertical lines should not be considered as part of the required graph.
(2) It is easy to see that f(x) (and hence ˆf) is an odd function, thus an = 0 for all n ∈ N. Consequently the Fourier series of
f is a sine series. As for bn, ﬁrst notice that the function g(x) := f(x) sin(nx) satisﬁes
g(x) =



− sin(nx) , if −π < x < −π/3 ,
0 , if −π/3 ≤ x ≤ π/3 ,
sin(nx) if π/3 < x < π ,
and g(−x) =



sin(nx) , if π/3 < x < π ,
0 , if −π/3 ≤ x ≤ π/3 ,
− sin(nx) , if −π < x < −π/3 .
Thus g(−x) = g(x) for all x ∈ (−π, π), that is, g is an even function. Therefore,
bn =
1
π
∫ π
−π
f(x) sin(nx) dx =
2
π
∫ π
0
f(x) sin(nx) dx =
2
π
[∫ π/3
0
f(x) sin(nx) dx +
∫ π
π/3
f(x) sin(nx) dx
]
=
2
π
∫ π
π/3
1 · sin(nx) dx =
2
π
[
−
cos(nx)
n
]π
π/3
=
2
nπ
(
− cos(nπ) + cos(
nπ
3
)
)
=
2
nπ
(
−(−1)n
+ cos(
nπ
3
)
)
=
2
nπ
(
(−1)n+1
+ cos(
nπ
3
)
)
.
Consequently, the Fourier series of the function at hand has the form F(x) =
2
π
∞∑
n=1
(−1)n+1
+ cos(nπ
3 )
n
sin(nx) .
(3) To conﬁrm the previous expression in Sage it suﬃces to conﬁrm the expressions of the Fourier coeﬃcients, which can be
done via the method presented in 7.A.10, that is,
f1(x)=-1; f2(x)=0; f3(x)=1
g=piecewise([[(-pi, -pi/3), f1], [[-pi/3, pi/3], f2],[(pi/3, pi), f3]]); L=pi; n=var("n")
an=(1/L)*integral(f1(x)*cos(n*pi*x/L),x,-pi,-pi/3)+(1/L)*integral(f2(x)*cos(n*pi*x/L),x,-pi/3,pi/3)\
+(1/L)*integral(f3(x)*cos(n*pi*x/L), x, pi/3, pi); show("a_n=", an.expand())
bn=(1/L)*integral(f1(x)*sin(n*pi*x/L),x,-pi,-pi/3)+(1/L)*integral(f2(x)*sin(n*pi*x/L),x,-pi/3,pi/3)\
+(1/L)*integral(f3(x)*sin(n*pi*x/L), x, pi/3, pi); show( "b_n=", bn.expand())
Executing this block, we obtained the desired result:
a_n=0 , b_n= −
2 cos (πn)
πn
+
2 cos
(1
3 πn
)
πn
.
(4) An application of the given formula in (2) gives
F5(x) =
3 sin(x)
π
−
3 sin(2x)
2π
−
3 sin(4x)
4π
+
3 sin(5x)
5π
.
To conﬁrm this expression in Sage, simply add the following cell to the block above:
pS5 = f.fourier_series_partial_sum(5,pi); show("5th partial sum = ",pS5)
To generate the required graphs, include the following syntax in the initial block in (3):
pS5 = f.fourier_series_partial_sum(5,pi); pS10 = f.fourier_series_partial_sum(10,pi)
pS20 = f.fourier_series_partial_sum(20,pi); pS100 = f.fourier_series_partial_sum(100,pi)
a = f.plot(x, -pi, pi, thickness=2, color="red", legend_label=r"$f$", exclude=[-pi/3, pi/3])
689
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
s1 = plot(pS5, x, -pi, pi, linestyle="--", color="midnightblue", legend_label=r"$F_5$")
s2 = plot(pS10, x, -pi, pi, linestyle="--", color="midnightblue", legend_label=r"$F_{10}$")
s3 = plot(pS20, x, -pi, pi, linestyle="--", color="midnightblue", legend_label=r"$F_{15}$")
s4 = plot(pS100, x, -pi, pi, linestyle="--",color="midnightblue", legend_label=r"$F_{100}$")
(a+s1).show(); (a+s2).show(); (a+s3).show(); (a+s4).show()
Running the code all together, we obtain the following ﬁgures:
F5(x) F10(x)
F20(x) F100(x)
(5) By the result in (4) we already know the 5th coeﬃcient which is given by 3
5π . For a Sage description of the 10th and
100th coeﬃcient of F, one can proceed as described in 7.A.14. In fact, the Fourier series is a sine series, hence we need the
function f.fourier_series_sine_coefficient(k), and the implementation of the method is as follows:
f1(x)=-1; f2(x)=0; f3(x)=1
f=piecewise([[(-pi, -pi/3), f1], [[-pi/3, pi/3], f2],[(pi/3, pi), f3]])
Fc5=f.fourier_series_sine_coefficient(5)
show("5th coefficient = ",Fc5)
Fc10=f.fourier_series_sine_coefficient(10)
show("10th coefficient = ",Fc10)
Fc100=f.fourier_series_sine_coefficient(100)
show("100th coefficient = ",Fc100)
This conﬁrms the value of the 5th coeﬃcient and additionally provides the values of the other two coeﬃcients in question,
namely:
10th coefficient = −
3
10 π
, 100th coefficient = −
3
100 π
.
Can you guess which coeﬃcient is the 1000th? □
7.D.9. (Reverse) sawtooth function. Consider the function deﬁned by
f(x) =
{ π − x
2
, if 0 < x ≤ 2π ,
f(x + 2π) , otherwise .
(1) Describe the given function f for x ∈ (−2π, 4π], and next use Sage to plot its graph over this interval.
(2) Find the Fourier series of f and next use Sage to conﬁrm the result.
690
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
(3) Use Sage to plot the partial sums F5, F8, F15 and F50, in one single ﬁgure.
(4) Examine the convergence of the Fourier series and show that
π
4
= 1 −
1
3
+
1
5
−
1
7
+ · · · .
Solution. (1) The given function f provides the 2π-periodic extension of the function (π −x)/2, the latter deﬁned on (0, 2π].
Hence, it should be f(x) = f(x + 2π) =
π − x − 2π
2
=
−π − x
2
, for all x ∈ (−2π, 0] and f(x) = f(x − 2π) =
π − x + 2π
2
=
3π − x
2
, for all x ∈ (2π, 4π], that is,
f(x) =



−π − x
2
, if −2π < x ≤ 0 ,
π − x
2
, if 0 < x ≤ 2π ,
3π − x
2
, if 2π < x ≤ 4π .
To sketch the graph of this piecewise function use the piecewise command, as before. For clarity, you may also mark the
endpoints where f is deﬁned (e.g., with black dots). This can be done using the command point, as follows:
f1(x)= (pi-x)/2;f2(x)= (-pi-x)/2;f3(x)= (3*pi-x)/2
f = piecewise([[(-2*pi, 0),f2], [(0, 2*pi),f1], [(2*pi, 4*pi), f3]])
p=f.plot(x, -2*pi, 4*pi, exclude=[0, 2*pi], legend_label=r"$f$")
p0=point([0, -pi/2], size=30, color="black")
p1=point([2*pi, -pi/2], size=30, color="black")
p2=point([4*pi, -pi/2], size=30, color="black")
show(p+p0+p1+p2)
Below one can see the produced plot:
(2) By assumption the period is 2π, so the Fourier series has the form F(x) =
a0
2
+
∞∑
n=1
(
an cos(nx) + bn sin(nx)
)
, where
a0 =
1
π
∫ 2π
0
f(x) dx , an =
1
π
∫ 2π
0
f(x) cos(nx) dx , bn =
1
π
∫ 2π
0
f(x) sin(nx) dx .
First we will prove that an = 0 for all n ∈ N. It is very easy to see that a0 = 0. Next, based on integration by parts, we get
an =
1
π
∫ 2π
0
f(x) cos(nx) dx =
1
2π
∫ 2π
0
(π − x) cos(nx) dx =
1
2
∫ 2π
0
cos(nx) dx −
1
2π
∫ 2π
0
x cos(nx) dx
=
1
2
[
sin(nx)
n
]2π
0
−
1
2π
∫ 2π
0
x ·
(
sin(nx)
n
)′
dx
=
1
2n
(
sin(2nπ) − sin(0)
)
−
1
2π
( [
x
sin(nx)
n
]2π
0
−
1
n
∫ 2π
0
sin(nx) dx
)
= 0 −
1
2nπ
(
2π sin(2nπ) − 0
)
+
1
2nπ
[
−
cos(nx)
n
]2π
0
= 0 − 0 −
1
2n2π
(
cos(2nπ) − cos(0)
)
= 0 ,
691
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
for all positive integers n. In a similar way, for bn we compute
bn =
1
π
∫ 2π
0
f(x) sin(nx) dx =
1
2π
∫ 2π
0
(π − x) sin(nx) dx =
1
2
∫ 2π
0
sin(nx) dx −
1
2π
∫ 2π
0
x sin(nx) dx
= −
1
2n
[cos(nx)]
2π
0 +
1
2π
∫ 2π
0
x
(
cos(nx)
n
)′
dx
= −
1
2n
(
cos(2nπ) − cos(0)
)
+
1
2π
( [
x
cos(nx)
n
]2π
0
−
1
n
∫ 2π
0
cos(nx) dx
)
= −
1
2n
(1 − 1) +
1
2nπ
(
2π cos(2nπ) − 0) −
1
2nπ
[
sin(nx)
n
]2π
0
=
2π
2nπ
−
1
2n2π
(
sin(2nπ) − sin(0)
)
=
1
n
.
Thus the Fourier series associated to f is given by F(x) =
∞∑
n=1
sin(nx)
n
. To conﬁrm this expression by Sage one can compute
the Fourier coeﬃcients, as usual, i.e.,
var("n"); f(x)=(pi-x)/2; f = piecewise([[(0, 2*pi),f]])
an=(1/pi)*integral(((pi-x)/2)*cos(n*x), x, 0, 2*pi)
bn=(1/pi)*integral(((pi-x)/2)*sin(n*x), x, 0, 2*pi)
show("a_n=", an.expand()); show("b_n=", bn.expand())
(3) Recall that for the partial sums Sage provides the built-in function fourier_series_partial_sum. For our case the
implementation can be done by continue typing in the previous block the following:
partS5=f.fourier_series_partial_sum(5,pi); show("5th partial sum = ",partS5)
partS8=f.fourier_series_partial_sum(8,pi); show("8th partial sum = ",partS8)
partS15=f.fourier_series_partial_sum(15,pi); show("15th partial sum = ",partS15)
partS50=f.fourier_series_partial_sum(50,pi); show("50th partial sum = ",partS50)
Run the whole syntax yourselves to obtain the corresponding expressions of the partial sums F5, F8, F15 and F50. To plot
them, execute the block given below (notice that this is independent of the previous cells).
f1(x)= (pi-x)/2; f2(x)= (-pi-x)/2; f3(x)= (3*pi-x)/2
f = piecewise([[(-2*pi, 0),f2], [(0, 2*pi),f1], [(2*pi, 4*pi), f3]])
partS5=f.fourier_series_partial_sum(5,pi);partS8=f.fourier_series_partial_sum(8,pi)
partS15=f.fourier_series_partial_sum(15,pi);partS50=f.fourier_series_partial_sum(50,pi)
p=f.plot(x, -2*pi, 4*pi, exclude=[0, 2*pi], legend_label=r"$f$")
s5=plot(partS5, x, -2*pi, 4*pi, linestyle="--", color="midnightblue", legend_label=r"$F_5$")
s8=plot(partS8, x, -2*pi, 4*pi, linestyle="--", color="darkgreen", legend_label=r"$F_8$")
s15=plot(partS15, x, -2*pi, 4*pi, linestyle="--", color="darkorange", legend_label=r"$F_{15}$")
s50=plot(partS50, x, -2*pi, 4*pi, linestyle="--", color="crimson", legend_label=r"$F_{50}$")
(p+s5+s8+s15+s50).show(figsize=8, ticks=pi/2, tick_formatter=pi)
This generates the following ﬁgure:
692
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
We should mention that in this block f was introduced as in the ﬁrst part (see (1)), with aim to display a larger portion
of the graph of f. To verify the computations in part (2), however, it was adequate to use only the function (π − x)/2 on its
initial domain.
(4) By the ﬁgure above, it is clear that the Fourier series F(x) of f converges to f(x), at each point x where f is continuous,
which means that for any x ∈ (0, 2π) we have
π − x
2
= F(x) =
∞∑
n=1
sin(nx)
n
= sin(x) +
sin(2x)
2
+
sin(3x)
3
+
sin(4x)
4
+
sin(5x)
5
+ · · · .
This applies to x = π/2 and gives
π
4
= sin(
π
2
) +
sin(π)
2
+
sin(3π/2)
3
+
sin(2π)
4
+
sin(5π/2)
5
+ · · · = 1 + 0 −
1
3
+ 0 +
1
5
− · · · .
On the other hand, the points of discontinuity of f appear at x = 2kπ, for k ∈ Z. The average value of our function at these
points equals to 0, and hence the Fourier expansion of f converges to 0 at any jump discontinuity of f. □
7.D.10. Compute the complex version of the Fourier series of the 2π-periodic extension of the following function (often
referred to as the “pulse wave”)
g(x) =
{
0 , if − π < x < 0 ,
1 , if 0 < x < π .
⃝
7.D.11. Consider the exponential mapping f(x) = ex
with x ∈ [0, 1). Use Sage to implement the following:
(1) Determine and illustrate the 1-periodic extension of f for x ∈ [−1, 2) (with at least two diﬀerent Sage methods);
(2) Compute the Fourier series of f on the interval [0, 1), and then illustrate its 5th, 15th and 25th partial sum (with at least
two diﬀerent Sage methods);
(3) Determine the convergence of the Fourier series obtained above.
Solution. (1) The 1-periodic extension of f on the interval [−1, 2) is given by f(x + 1) for −1 ≤ x < 0, by f(x) for
0 ≤ x < 1 and f(x − 1) for 1 ≤ x < 2. In Sage one method utilizes the piecewise function while another one relies on the
def procedure to deﬁne piecewise functions. The implementation goes as follows:
f=piecewise([[RealSet.closed_open(-1, 0), e^(x+1)], [(0, 1), e^x], [(1, 2), e^(x-1)]])
a=f.plot(x, -1, 2, thickness=1.5, color="grey", exclude=[0, 1])
p0=point([-1, 1], size=30, color="black")
p1=point([0, 1], size=30, color="black")
p2=point([1, 1], size=30, color="black")
show(a+p0+p1+p2)
var("t")
def g(t):
if 0 <= t<1:
return e^t
elif -1<=t < 0:
return e^(t+1)
elif 1<=t<2 :
return e^(t-1)
b= plot(g, (-1, 2), exclude=[0, 1], thickness=1.5, color="grey")
p0=point([-1, 1], size=30, color="black")
p1=point([0, 1], size=30, color="black")
p2=point([1, 1], size=30, color="black")
show(b+p0+p1+p2)
They both produce the same ﬁgure, which is presented here:
693
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Notice that as in 7.D.9, both blocks include code for marking with black dots the endpoints where the periodic extension is
deﬁned.
(2) As usual, it is suﬃcient to compute the Fourier coeﬃcients. However, one should carefully apply the general formulas
from 7.1.6 and here is a possible implementation:
f=piecewise([[(0, 1), e^x]]); T=1; n=var("n"); Om=pi
an=(2/T)*integral((e^x)*cos(2*n*Om*x), x, 0,1); show("a_n=", an.expand())
bn=(2/T)*integral((e^x)*sin(2*n*Om*x), x, 0,1); show( "b_n=", bn.expand())
Sage’s answer has the form
a_n=
4 πne sin (2 πn)
4 π2n2 + 1
+
2 cos (2 πn) e
4 π2n2 + 1
−
2
4 π2n2 + 1
, b_n= −
4 πn cos (2 πn) e
4 π2n2 + 1
+
4 πn
4 π2n2 + 1
+
2 e sin (2 πn)
4 π2n2 + 1
.
Recalling that sin(2nπ) = 0 and cos(2nπ) = 1 for all n, this means that
an =
2(e − 1)
1 + 4n2π2
, (n ∈ N) , and bn =
4nπ(1 − e)
1 + 4n2π2
, (n ∈ Z+) ,
respectively. It follows that the Fourier series of the function at hand has the form
F(x) = e −1 + 2 (e −1)
∞∑
n=1
cos (2nπx)
1 + 4n2π2
+ 4π (1 − e)
∞∑
n=1
n sin (2nπx)
1 + 4n2π2
. (♭)
Now, to plot the required partial sums, we can proceed as in the previous tasks (cf. 7.D.9), i.e.,
f=piecewise([[(0, 1), e^x]])
partS5 = f.fourier_series_partial_sum(5,1/2)
partS15 = f.fourier_series_partial_sum(15,1/2)
partS25 = f.fourier_series_partial_sum(25,1/2)
a = f.plot(x, 0, 1, thickness=1.5, color="black")
s5=plot(partS5, x, -1, 1, linestyle="--", color="grey")
s15=plot(partS15, x, -1, 1, linestyle="--", color="grey")
s25=plot(partS25, x, -1, 1, linestyle="--", color="grey")
show(a+s5);show(a+s15);show(a+s25)
Notice in the ﬁrst line of this block we could replace the open set (0, 1) with the correct set that f is deﬁned, i.e., the set [0, 1).
This could be done by the command RealSet.closed_open(0, 1), but it is not so essential in the sense that it does not eﬀect
the Fourier series, as previously mentioned in ??. This block produces the following three ﬁgures:
F5 F15 F25
694
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
An alternative method to derive the partial sums utilizes a combination of certain built-in functions in Sage, namely
fourier_series_sine_coefficient and fourier_series_cosine_coefficient. Below we present the case for the
5th partial sum and similarly are treated the rest cases.
f=piecewise([[(0, 1), e**x]])
a5=[f.fourier_series_cosine_coefficient(n, 1/2) for n in range(6)]
b5=[f.fourier_series_sine_coefficient(n, 1/2) for n in range(6)]
fs5=(a5[0]/2)+(sum([a5[k]*cos(2*k*pi*x) for k in range(1, 6)]))
+(sum([b5[k]*sin(2*k*pi*x) for k in range(1, 6)]))
print(bool(fs5.expand()==f.fourier_series_partial_sum(5,1/2).expand()))
a=f.plot(x, 0, 1, thickness=1.5, color="black")
b=fs5.plot(x, 0, 1, linestyle="--")
(a+b).show(figsize=8)
In this block we programmed Sage to compare the expression of the 5th partial sum, denoted by fs5 and deﬁned manually,
with the one that one derives via the usual method by the command fourier_series_partial_sum. This comparison is
performed by the bool command, and Sage’s answer is True. One could also obtain the explicit form of the partial sums in
question, which we skipped to save some space.
As a side remark, we ﬁnally mention that for a formal veriﬁcation of the Fourier coeﬃcients an, bn one can use the
following formulas
∫
ex
cos (αx) dx =
ex
(
α sin (αx) + cos (αx)
)
1 + α2
+ C ,
∫
ex
sin (βx) dx =
ex
(
sin (βx) − β cos (βx)
)
1 + β2
+ C ,
with α, β ∈ R. Both of them can be derived by integrations by parts (see 6.B.7 for the second one).
(3) The function f(x) = ex
is continuous at any x ∈ (0, 1). It is also continuous at x = 0 since limx→0− ex
= 1 =
limx→0+ f(x). It also satisﬁes according to Dirichlet’s theorem (see ??) we should have ex
= F(x) for all x ∈ [0, 1), where
F is given in (♭). □
7.D.12. (1) Derive a cosine Fourier series of f on the interval [0, 1] and illustrate some of its partial sums;
(2) Derive a sine Fourier series of f on the interval (0, 1] and illustrate some of its partial sums;
695
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Note the diﬀerences between the three cases as shown in the diagrams. The approximation diﬀers in relation to the
continuity. The ﬁrst diagram uses n = 5, the second uses n = 3, while the last diagram uses n = 10.
We use the formulae (α, β ∈ R)
∫
ex
cos (αx) d x =
ex
(
α sin (αx) + cos (αx)
)
1 + α2
+ C,(1)
∫
ex
sin (βx) d x =
ex
(
sin (βx) − β cos (βx)
)
1 + β2
+ C,(2)
both of which can be obtained by two integrations by parts. Actually, the second one was computed in detail in ??(d). We
obtain
(a) an = 2
1∫
0
ex
cos (2nπx) d x
= 2
[
ex
(
2nπ sin(2nπx)+cos(2nπx)
)
1+4n2π2
]1
0
= 2(e−1)
1+4n2π2 , n ∈ N ∪ {0},
bn = 2
1∫
0
ex
sin(2nπx) d x
= 2
[
ex
(
sin(2nπx)−2nπ cos(2nπx)
)
1+4n2π2
]1
0
= 4nπ(1−e)
1+4n2π2 , n ∈ N;
(b) an = 2
1∫
0
ex
cos (nπx) d x
= 2
[
ex
(
nπ sin(nπx)+cos(nπx)
)
1+n2π2
]1
0
=
2
(
(−1)n
e−1
)
1+n2π2 , n ∈ N ∪ {0};
(c) bn = 2
1∫
0
ex
sin (nπx) d x
= 2
[
ex
(
sin(nπx)−nπ cos(nπx)
)
1+n2π2
]1
0
=
2nπ
(
1+(−1)n+1
e
)
1+n2π2 , n ∈ N.
Substitution then yields the corresponding Fourier series for g(x):
(a) e − 1 + 2 (e − 1)
∞∑
n=1
cos(2nπx)
1+4n2π2
+4π (1 − e)
∞∑
n=1
n sin(2nπx)
1+4n2π2 ;
696
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
(b) e − 1 + 2
∞∑
n=1
((−1)n
e−1) cos(nπx)
1+n2π2 ;
(c) 2π
∞∑
n=1
n
(
1+(−1)n+1
e
)
sin(nπx)
1+n2π2 .
7.D.13. Determine the cosine Fourier series for a periodic extension of the function
g(x) = 1, x ∈ [0, 1), g(x) = 0, x ∈ [1, 4).
Determine also the sine Fourier series for
f(x) = x − 1, x ∈ (0, 2), f(x) = 3 − x, x ∈ [2, 4).
Solution. We have already encountered the construction of a cosine Fourier series in ??(b) and also 7.A.15(b). It is the case
of the Fourier series of an even function. Therefore, the ﬁrst thing we to do is extend the deﬁnition of the function g to the
interval (−4, 0) so that it is even. This means
g(x) := 1 for x ∈ (−1, 0), g(x) := 0 for x ∈ (−4, −1].
Now we can consider its periodic extension onto the whole R with period x0 = −4, T = 8 and ω = π/4.
Necessarily bn = 0 for all n ∈ N in a cosine Fourier series. We determine the Fourier coeﬃcients an, n ∈ N
a0 = 2
T
x0+T∫
x0
g(x) d x = 2
8
1∫
−1
1 d x = 1
2
1∫
0
1 d x = 1
2 ,
an = 2
T
x0+T∫
x0
g(x) cos (nωx) d x = 1
2
1∫
0
cos nπx
4 d x
= 2
nπ sin nπ
4 .
We use
(1)
a∫
−a
f(x) d x = 2
a∫
0
f(x) d x,
which is valid for every even function f integrable on the interval [0, a].
The expression sin (nπ/4) is conveniently left as is, rather than the alternative of sorting out when it attains which of its
ﬁve diﬀerent values. Thus we write the cosine Fourier series in the form:
1
4 +
∞∑
n=1
( 2
nπ sin nπ
4 cos nπx
4
)
.
The sine Fourier transform of the function can be determined analogously from the odd extension of the given segment.
Again, T = 8 and ω = π/4 for the function f. This time it is the coeﬃcients an, n ∈ N ∪ {0}, which are zero. To ﬁnd the
remaining coeﬃcients, use integration by parts and (1) (the product of two odd functions is an even function). For all n ∈ N
bn = 2
T
x0+T∫
x0
f(x) sin (nωx) d x
= 1
2
( 2∫
0
(x − 1) sin nπx
4 d x −
4∫
2
(x − 3) sin nπx
4 d x
)
=
[
(1 − x) 2
nπ cos nπx
4
]2
0
+
[ 8
n2π2 sin nπx
4
]2
0
−
[
(3 − x) 2
nπ cos nπx
4
]4
2
−
[ 8
n2π2 sin nπx
4
]4
2
= 2
nπ
(
(−1)n
− 1
)
+ 16
n2π2 sin nπ
2 .
Immediately, bn = 0 when n is even. So the sine Fourier series can be written
∞∑
n=1
(
( 2
nπ
(
(−1)n
− 1
)
+ 16
n2π2 sin nπ
2
)
sin nπx
4
)
=
∞∑
n=1
((
−4
(2n−1)π + (−1)n−1
16
(2n−1)2π2
)
sin (2n−1)πx
4
)
.
□
7.D.14. Write the Fourier series of the π-periodic function which equals cosine on the interval (−π/2, π/2). Then write
the cosine Fourier series of the 2π-periodic function y = | cos x |.
Solution. We are looking for one Fourier series only, since the second part of the problem is just a reformulation of the ﬁrst
part. Therefore, we construct the Fourier series for the function g(x) = cos x, x ∈ [−π/2, π/2]. Since g is even, bn = 0,
n ∈ N. We compute
697
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
an = 2
π
π/2∫
−π/2
cos x cos(2nx) d x
= 2
π
π/2∫
−π/2
1
2
(
cos((2n + 1)x) + cos((2n − 1)x)
)
d x
= 1
π
[
sin((2n+1)x)
2n+1 + sin((2n−1)x)
2n−1
]π/2
−π/2
= 2
π
(
(−1)n
2n+1 + (−1)n+1
2n−1
)
= 4
π
(−1)n+1
4n2−1
for every n ∈ N. The calculation is also valid for n = 0, thus a0 = 4/π. The desired Fourier series is
2
π + 4
π
∞∑
n=1
(
(−1)n+1
4n2−1 cos (2nx)
)
.
□
7.D.15. Sum the two series
∞∑
n=1
1
n4 ,
∞∑
n=1
(−1)n+1
n4 .
Solution. We hint at the procedure by which the series
∞∑
n=1
1
n2k ,
∞∑
n=1
(−1)n+1
n2k
for general k ∈ N can be calculated. Use the identities
(1) x = π − 2
∞∑
n=1
sin (nx)
n
, x ∈ (0, 2π),
(2)
x2
=
4π2
3
+4
∞∑
n=1
cos (nx)
n2
−4π
∞∑
n=1
sin (nx)
n
, x ∈ (0, 2π),
which follow from the constructions of the Fourier series for the functions g(x) = x and g(x) = x2
, respectively, on the
interval [0, 2π).
By (1),
∞∑
n=1
sin(nx)
n = π−x
2 , x ∈ (0, 2π).
Substituting into (2) gives
∞∑
n=1
cos(nx)
n2 = 3x2
−6πx+2π2
12 , x ∈ (0, 2π).
Since the values of the series
∞∑
n=1
1
n2 = π2
6 ,
∞∑
n=1
(−1)n+1
n2 = π2
12
have been already determined, substitution then proves the validity of this last equation at the marginal points x = 0, x = 2π.
The left-hand series is evidently bounded from above by
∑∞
n=1
1
n2 , thus it converges absolutely and uniformly on [0, 2π].
Therefore, it can be integrated term by term:
∞∑
n=1
sin(nx)
n3 =
∞∑
n=1
[
sin(ny)
n3
]x
0
=
x∫
0
∞∑
n=1
cos(ny)
n2 d y
=
x∫
0
3y2
−6πy+2π2
12 d y = x3
−3πx2
+2π2
x
12 , x ∈ [0, 2π].
In fact, every Fourier series may be integrated term by term. Further integration gives
∞∑
n=1
1−cos(nx)
n4 =
∞∑
n=1
[
−cos(ny)
n4
]x
0
=
x∫
0
∞∑
n=1
sin(ny)
n3 d y
=
x∫
0
y3
−3πy2
+2π2
y
12 d y = x4
−4πx3
+4π2
x2
48 , x ∈ [0, 2π].
Substituting x = π leads to
698
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
∞∑
n=1
1+(−1)n+1
n4 =
∞∑
n=1
1−cos(nπ)
n4 = π4
48 .
Since the numerator on the left-hand side is zero for even numbers n and is 2 for odd numbers n, the series can be written
(3)
∞∑
n=1
2
(2n − 1)4
=
π4
48
.
From the expression
∞∑
n=1
1
n4 =
∞∑
n=1
1
(2n)4 +
∞∑
n=1
1
(2n−1)4 = 1
16
∞∑
n=1
1
n4 +
∞∑
n=1
1
(2n−1)4 ,
it follows that
∞∑
n=1
1
n4 = 16
15
∞∑
n=1
1
(2n−1)4 = 16
15 · 1
2 · π4
48 = π4
90 ,
thereby having summed up the ﬁrst series. As for the second one,
∞∑
n=1
(−1)n+1
n4 =
∞∑
n=1
1
(2n−1)4 −
∞∑
n=1
1
(2n)4 =
∞∑
n=1
1
(2n−1)4 − 1
16
∞∑
n=1
1
n4
= 1
2 · π4
48 − 1
16 · π4
90 = 7π4
720 .
One can proceed similarly to sum the series
∞∑
n=1
1
n2k ,
∞∑
n=1
(−1)n+1
n2k
for other k ∈ N.
It is natural to ask for the value of the series
∑∞
n=1
1
n3 . This problem has been tackled by mathematicians for centuries
without success. The reader may justiﬁably be surprised by this since the procedure above is applicable to all the odd powers
as well.
For instance, one can start with the identity
∞∑
n=1
cos(nx)
n = −ln
(
2 sin x
2
)
, x ∈ (0, 2π),
which, by the way, can be proved by expanding the right-hand function into a Fourier series. If, similarly to the above, integrate
the left-hand series term by term twice and substituted x → 0+ in the limit, we get the series
∑∞
n=1
1
n3 . Thus, it should
suﬃce to integrate the right-hand function twice and calculate one limit. However, the integration of the right-hand side leads
to a non-elementary integral. That is, the antiderivative cannot be expressed in terms of the elementary functions. 6
□
7.D.16. Determine the function f whose Fourier transform is the function
˜f(ω) = 1√
2π
sin ω
ω , ω ̸= 0.
Solution. We might have noticed that the sinc function appeared as the image of the characteristic function hΩ of the interval
(−Ω, Ω) in one of the previous problems:
˜hΩ(ω) =
2Ω
√
2π
sinc(ωΩ).
In this case Ω = 1 and the function f is the half of h1.
The result can be computed directly. The inverse Fourier transform is
f(t) = 1
2π
∞∫
−∞
sin ω
ω eiωt
dω
= 1
2π
( 0∫
−∞
sin ω
ω eiωt
dω +
∞∫
0
sin ω
ω eiωt
dω
)
.
Substitute −ω for ω in the ﬁrst integral, to obtain
6The function ζ(p) =
∑∞
n=1
1
np is called the Riemann zeta function.
EXPAND THE FOOTNOTE!
699
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
f(t) = 1
2π
(∞∫
0
sin ω
ω e−iωt
dω +
∞∫
0
sin ω
ω eiωt
dω
)
= 1
2π
∞∫
0
sin ω
ω 2(e−iωt
+ eiωt
) dω = 1
π
∞∫
0
sin ω
ω cos (ωt) dω.
Continue via trigonometric identities to obtain
f(t) = 1
2π
(∞∫
0
sin(ω(1+t))
ω dω +
∞∫
0
sin(ω(1−t))
ω dω
)
.
The substitutions u = ω (1 + t), v = ω (1 − t) then give
f(t) = 1
2π
(∞∫
0
sin u
u d u −
∞∫
0
sin v
v d v
)
= 0, t > 1;
f(t) = 1
2π
(∞∫
0
sin u
u d u +
∞∫
0
sin v
v d v
)
= 1
π
∞∫
0
sin u
u d u, t ∈ (−1, 1);
f(t) = 1
2π
(
−
∞∫
0
sin u
u d u +
∞∫
0
sin v
v d v
)
= 0, t < −1.
Thus the function f is zero for | t | > 1 and constant (necessarily non-zero) for | t | < 1. (Throughout, we assume that the
inverse Fourier transform exists).
The constant is f(t) = 1/2 for |t| < 1, from the standard result
∞∫
0
sin u
u d u = π
2 .
Alternatively, we can ’guess’ that the constant is one, i.e.
g(t) = 1, | t | < 1; g(t) = 0, | t | > 1
and compute
F(g)(ω) = 1√
2π
1∫
−1
e−iωt
d t = 2√
2π
1∫
0
cos (ωt) d t = 2√
2π
sin ω
ω .
So f(0) = g(0)/2 = 1/2, which also establishes
∞∫
0
sin u
u d u = π
2 .
□
7.D.17. Using the relation
(1) L (f′
) (s) = s L (f) (s) − lim
t→0+
f(t),
derive the Laplace transforms of both the functions y = cos t and y = sin t.
Solution. Notice ﬁrst that from (1), it follows that
L (f′′
) (s) = s L (f′
) (s) − lim
t→0+
f′
(t) = s
(
sL (f) (s) − lim
t→0+
f(t)
)
− lim
t→0+
f′
(t) =
s2
L (f) (s) − s lim
t→0+
f(t) − lim
t→0+
f′
(t).
Therefore,
−L (sin t) (s) = L (− sin t) (s) = L
(
(sin t)
′′)
(s) = s2
L (sin t) (s) − s lim
t→0+
sin t − lim
t→0+
cos t = s2
L (sin t) (s) − 1,
whence we get
−L (sin t) (s) = s2
L (sin t) (s) − 1, i. e. L (sin t) (s) = 1
s2+1 .
Now, invoking (1), we determine
L (cos t) (s) = L
(
(sin t)
′)
(s) = s 1
s2+1 − lim
t→0+
sin t = s
s2+1 .
□
700
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
7.D.18. Using the discussion from the previous problem, prove that for a continuous function y with enough suﬃciently
higher order derivatives
L(y(n)
)(s) = sn
L(y)(s) −
n∑
i=1
sn−i
y(i−1)
(0).
Solution. Clearly
L(y′
)(s) = sL(y)(s) − y(0)
L(y′′
)(s) = s2
L(y) − sy(0) − y′
(0)
and the claim is veriﬁed by induction. □
7.D.19. Find the Laplace transform of Heaviside’s function H(t) and, for real a, the shifted Heaviside’s function Ha(t) =
H(t − a):
H(t) =



0 for t < 0,
1
2 for t = 0,
1 for t > 0.
Solution.
L(H(t))(s) =
∫ ∞
0
H(t) e−st
dt =
∫ ∞
0
e−st
dt
=
[
−
e−st
s
]∞
0
= −1
s (0 − 1) =
1
s
,
L(Ha(t))(s) = L(H(t − a))(s)
=
∫ ∞
0
H(t − a) e−st
dt =
∫ ∞
a
e−st
dt
=
∫ ∞
0
e−s(t+a)
dt = e−as
L(H(t))(s) =
e−as
s
.
□
7.D.20. Show that for real a,
(1) L(f(t) · Ha(t))(s) = e−as
L(f(t + a))(s)
Solution.
L(f(t)Ha(t))(s) =
∫ ∞
0
f(t)H(t − a) e−st
dt =
∫ ∞
a
f(t) e−st
dt
=
∫ ∞
0
f(t + a) e−s(t+a)
dt = e−as
∫ ∞
0
f(t + a) e−st
dt
= e−as
L(f(t + a))(s).
□
7.D.21. Find a function y(t) satisfying the diﬀerential equation
y′′
(t) + 4y(t) = sin 2t
and the initial conditions y(0) = 0 and y′
(0) = 0.
Solution. From the example 7.D.18:
s2
L(y)(s) + 4L(y)(s) = L(sin 2t)(s)
Now, by 7.B.17(d)
L(sin 2t)(s) =
2
s2 + 4
.
It follows that
L(y)(s) =
2
(s2 + 4)2
.
701
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
The inverse transform then givesMore details ?
y(t) = 1
8 sin 2t − 1
4 t cos 2t.
□
7.D.22. Find a function y(t) satisfying the diﬀerential equation and the initial conditions:
y′′
(t) + 4y(t) = f(t), y(0) = 0, y′
(0) = −1,
where f(t) is the piecewise continuous function
f(t) =
{
cos(2t) for 0 ≤ t < π,
0 for t ≥ π.
Solution. This problem is a model for the undamped oscillations of a spring (excluding friction and other phenomena like
non-linearities in the toughness of the spring and so on). It is initiated by an exterior force during the initial period only and
then ceases.
The function f(t) can be written as a linear combination of Heaviside’s function H(t) and its shift. That is,
f(t) = cos(2t)(H(t) − Hπ(t))
up to the values t = 0, t = π, . . . Since
L(y′′
)(s) = s2
L(y) − sy(0) − y′
(0) = s2
L(y) + 1,
we get, making use of the previous example 7.D.20 , the right-hand sides to the calculation of the Laplace transformcheck the reference
s2
L(y) + 1 + 4L(y) = L(cos(2t)(H(t) − Hπ(t)))
= L(cos(2t) · H(t)) − L(cos(2t) · Hπ(t))
= L(cos(2t)) − e−πs
L(cos(2(t + π))
= (1 − e−πs
)
s
s2 + 4
.
Hence,
L(y) = −
1
s2 + 4
+ (1 − e−πs
)
s
(s2 + 4)2
.
The inverse transform then yields the solution in the formMore details!
y(t) = −1
2 sin(2t) + 1
4 t sin(2t) + L−1
(
e−πs s
(s2 + 4)2
)
.
According to (1),
L−1
(
e−πs s
(s2 + 4)2
)
= 1
4 L−1
(e−πs
L(t sin(2t)))
= (t − π) sin(2(t − π)) · Hπ(t).
Since the Heaviside function Hπ(t) is zero for t < π and equals 1 for t > π, we get the solution in the form
y(t) =
{
t−2
4 sin(2t) for 0 ≤ t < π
(5t−2
4 − π
)
sin(2t) for t ≥ π.
□
702
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Key to the exercises
7.A.2. Similarly to 7.A.1, one can use the Gram–Schmidt process with respect to the mentioned scalar product. This gives
the functions
f1(x) =
1
x
, f2(x) =
1
x2
−
3
4x
, f3(x) =
1
x3
−
3
2x2
+
13
24x
.
and it is easy to check that the set {f1, f2, f3} forms an orthonormal basis. We also see that:
• The projection of the function 1
x4 has the form 15
32 f1 + 69
40 f2 + 9
4 f3, while the distance is
√
14
2240 = 0.00167.
• The projection of the function x has the form 2f1 + 96(−3
4 + ln(2))f2 + 5760(−3
2 ln(2) + 25
24 )f3, while the distance is
approximately 0.035.
Let us now illustrate the functions x and 1/x4
and their approximations.
In this ﬁgure we see that the function 1
x4 , which is the one whose shape is similar to that of one or more generators, is
approximated much better by the projection.
7.A.7. We have just to check the orthogonality of the couples Lm(x), Ln(x) with respect to the inner product ⟨f, g⟩ω =∫ ∞
0
f(t)g(t) e−t
d t. This can be done by integration by parts.
7.A.8. The claim follows from the fact that the powers xk
appear in the polynomials Tk or Lk the ﬁrst time. Thus the linear
hulls of the ﬁrst k functions always coincide.
7.A.9. We mention that since f, g are both piecewise continuous functions deﬁned on [−π, π], the product f¯g also belongs
to S0
([−π, π]). Thus, the function f¯g is Riemann integrable and (f, g) :=
∫ π
−π
f(x)g(x)d x is well-deﬁned. The rest details
proving that ( , ) is an inner product on S0
([−π, π]) rely on elementary properties of integrals. Next, we see that
(
√
2
2
,
√
2
2
) =
1
π
∫ π
−π
1
2
dx =
1
π
[x
2
]π
−π
=
π
π
= 1 .
Moreover, for any positive integer n we have
(
√
2
2
, sin(nx)) =
√
2
2π
∫ π
−π
sin(nx) dx = −
√
2
2π
[
1
n
cos(nx)
]π
−π
= −
√
2
2nπ
[cos(nπ) − cos(−nπ)] = 0 ,
(
√
2
2
, cos(nx)) =
√
2
2π
∫ π
−π
cos(nx) dx = 0 ,
(sin(nx), sin(nx)) =
1
π
∫ π
−π
sin2
(nx) dx =
1
2π
∫ π
−π
(
1 − cos(2nx)
)
dx =
1
2π
(π + π) −
1
2π
[
1
2n
sin(2n)
]π
−π
= 1 ,
(cos(nx), cos(nx)) =
1
π
∫ π
pi
cos2
(nx) dx =
1
2π
∫ π
−π
(
1 + cos(2nx)
)
dx = 1 .
Finally, using appropriate trigonometric identities it is not hard to prove that
(cos(mx), cos(nx)) = (cos(mx), sin(nx)) = (sin(mx), sin(nx)) = 0
for all positive integers m, n with m ̸= n. Moreover, we have (cos(mx), sin(nx)) = 0 also for m = n, and the claim follows.
7.A.12. A conﬁrmation of the expression presented in 7.A.11 goes as follows:
f = piecewise([[(-pi,pi),abs(x)]]); L=pi; n=var("n")
an=(1/L)*integral(abs(x)*cos(n*pi*x/L), x, -pi, pi); show("a_n=", an.expand())
bn=(1/L)*integral(abs(x)*sin(n*pi*x/L), x, -pi, pi); show( "b_n=", bn.expand())
703
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
For the approximation with k = 5, and its graph, add in the previous block the cell
Fs5 = f.fourier_series_partial_sum(5,pi); show("5th partial sum = ",Fs5)
s5 = plot(Fs5, x, -pi-0.5, pi+0.5, linestyle="--", tick_formatter=[2*pi, None])
a=plot(f, x, -pi, pi, color="black", thickness=1.5)
show(a+s5)
Execute this code to obtain the explicit form of F5(x) and conﬁrm graphically that it provides a pretty good approximation
of the function at hand. The conﬁrmation of the identity F5(x) = F6(x) can be easily done and we leave this for practice.
7.A.13. The sign function has discontinuities at x = 0 and x = ±π. Since sgn(−x) = − sgn(x) for all x we get an = 0 for
all n ∈ N, and the Fourier series of sgn(x) is a sine series, F(x) =
∑∞
n=1 bn sin(nx). The function g(x) = sgn(x) sin(nx)
is even, as a product of two odd functions, thus
bn =
1
π
∫ π
−π
sgn(x) sin(nx) dx =
2
π
∫ π
0
sin(nx) dx =
2
π
[
−
cos(nx)
n
]π
0
=
2
π
[
−
cos(nπ)
n
+
1
n
]
=
2
(
1 − (−1)n
)
nπ
.
We can simplify this by observing that
bn =
{
0 , if n = 2m, where m ∈ Z+ ,
4
nπ , if n = 2m − 1, where m ∈ Z+ .
Therefore, the corresponding Fourier series has the form
F(x) =
2
π
∞∑
n=1
1 − (−1)n
n
sin(nx) =
4
π
∞∑
m=1
1
(2m − 1)
sin
(
(2m − 1)x
)
.
7.A.17. Since f is neither even, nor odd, the complex Fourier coeﬃcients will have both real and imaginary part. In particular,
using the properties of the exponential map, we compute
cn =
1
2π
∫ π
−π
ex
e−inx
dx =
1
2π
∫ π
−π
e(1−in)x
dx =
1
2π
[
e(1−in)x
(1 − in)
]π
−π
=
1
2π(1 − in)
(
e(1−in)π
− e−(1−in)π
)
.
Now it is easy to see that e(1−in)π
= eπ
cos(nx) and e−(1−in)π
= e−π
cos(nx). Thus, recalling the identity 2 sinh(x) =
(ex
− e−x
), we have arrived at the following expression:
cn =
cos(nπ)
2π(1 − in)
(
eπ
− e−π
)
=
sinh(π)
π
(1 + in)
(1 + n2)
cos(nx) =
sinh(π)
π
(1 + in)
(1 + n2)
(−1)n
.
This also gives c0 = sinh(π)
π = eπ
− e−π
2π . As for c−n we see that
c−n =
sinh(π)
π
(1 − in)
(1 + (−n)2)
cos(−nx) =
sinh(π)
π
(1 − in)
(1 + n2)
(−1)n
= cn ,
as it is required. Recalling now that einx
+ e−inx
= 2 cos(nx) and einx
− e−inx
= 2i sin(nx), we ﬁnally get
F(x) = c0 +
∞∑
n=1
cn einx
+
∞∑
n=1
c−n e−inx
=
sinh(π)
π
(
1 +
∞∑
n=1
(1 + in)
(1 + n2)
(−1)n
einx
+
∞∑
n=1
(1 − in)
(1 + n2)
(−1)n
e−inx
)
=
sinh(π)
π
(
1 +
∞∑
n=1
(−1)n
1 + n2
[
(1 + in) einx
+(1 − in) e−inx
])
=
sinh(π)
π
(
1 +
∞∑
n=1
(−1)n
1 + n2
[
einx
+ e−inx
+in
(
einx
− e−inx
)])
=
sinh(π)
π
(
1 + 2
∞∑
n=1
(−1)n
1 + n2
[
cos(nx) + i2
n sin(nx)
])
=
sinh(π)
π
(
1 + 2
∞∑
n=1
(−1)n
1 + n2
[
cos(nx) − n sin(nx)
])
.
In order to plot f together with the 20th partial sum of the Fourier series, use the cell given here:
f=piecewise([[(-pi,pi), e**x]])
a=plot(f, x, -pi, pi, color="steelblue", thickness=1.5)
704
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Fs20 = f.fourier_series_partial_sum(20,pi)
s20 = plot(Fs20, x, -4*pi, 4*pi,color="grey", linestyle="--")
(a+s20).show(ticks=pi, tick_formatter=pi, figsize=8)
Running this cell we obtain the following ﬁgure:
7.A.18. In this case the period is given by T = 2L = 1, i.e., L = 1. Thus
F(x) =
∞∑
n=−∞
cn e
inπx
L =
∞∑
n=−∞
cn einπx
, cn =
1
2L
∫ L
−L
e−x
e− inπx
L dx =
1
2
∫ 1
−1
e−x
e−inπx
dx .
We compute
cn =
1
2
∫ 1
−1
e−(1+inπ)x
dx =
1
2
[
e−(1+inπ)x
−(1 + inπ)
]1
−1
= −
1
2(1 + inπ)
(
e−(1+inπ)
− e(1+inπ)
)
= −
1
2(1 + inπ)
(
e−1
e−inπ
− e1
einπ
)
=
(−1)n
(1 + inπ)
e1
− e−1
2
=
(−1)n
(1 + inπ)
sinh(1) .
Here we used the identities einπ
= e−inπ
= cos(nπ) = (−1)n
. Hence, the Fourier series of f has the form
F(x) =
∞∑
n=−∞
[
(−1)n
(1 + inπ)
sinh(1)
]
einπx
= sinh(1)
∞∑
n=−∞
(−1)n
(1 − inπ)
1 + n2p2
einπx
.
7.A.24. We need the even extension of f on (−π, π), which reads by
fev(x) =
{
f(x) , if 0 ≤ x < π ,
f(−x) , if − π < x < 0
=
{
sin(x) , if 0 ≤ x < π ,
− sin(x) , if − π < x < 0 .
We may extend fev periodically all over R by setting fev(x + 2π) = fev(x) for all the other x. Notice the values of this
2π-periodic extension at ±π coincide with f(π) = sin(π) = 0 (and similarly for ±Nπ, N ∈ N). Let us use Sage to illustrate
this extension on (−π, π):
f1(x)=sin(x); f2(x)=-sin(x)
f = piecewise([[(-pi, 0),f2], [RealSet.closed_open(0,pi),f1]])
p=f.plot(x, -pi, pi, exclude=[0], color="black")
p.show(figsize=8, ticks=pi/2, tick_formatter=pi)
Running this block we obtain the ﬁgure given here:
705
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Now, for the cosine extension we have bn = 0 while the coeﬃcients an are given by an =
2
π
∫ π
0
sin(x) cos(nx) dx. Using
the identity 2 sin(θ) cos(φ) = sin(θ + φ) + sin(θ − φ) we obtain
an =
1
π
∫ π
0
(
sin
(
(1 + n)x
)
+ sin
(
(1 − n)x
))
dx =
1
π
[
−
cos
(
(1 + n)x
)
n + 1
−
cos
(
(1 − n)x
)
1 − n
]π
0
=
1
π
(
1 − cos
(
(n + 1)π
)
n + 1
+
cos
(
(1 − n)π
)
− 1
n − 1
)
=
1
π
(
1 − (−1)n+1
n + 1
+
(−1)n+1
− 1
n − 1
)
=
1
π
(
1 + (−1)n
n + 1
−
(−1)n
+ 1
n − 1
)
=
(1 + (−1)n
)
π
(
1
n + 1
−
1
n − 1
)
= −
2(1 + (−1)n
)
π(n2 − 1)
,
which makes sense for all integers n > 1. To verify this expression in Sage one may add in the block presented above the
following cell:
L=pi; n=var("n"); an=(2/L)*integral(sin(x)*cos(n*pi*x/L), x, 0, pi);
show( "a_n=", an.expand())
Moreover, it is easy to see that
a1 =
2
π
∫ π
0
sin(x) cos(x) dx = 0 , a0 =
2
π
∫ π
0
sin(x) dx =
4
π
.
All together, we have proved that
an =



π/4 , if n = 0 ,
0 , if n ≥ 1, n = 2k + 1 odd ,
− 4
(n2−1)π , if n > 1, n = 2k even .
Therefore for all x ∈ (−π, π) by Dirichlet’s condition we have that
sin(x) =
2
π
−
4
π
∞∑
k=1
cos(2kx)
4k2 − 1
=
2
π
−
4
π
(cos(2x)
3
+
cos(4x)
15
+
cos(6x)
35
+ · · ·
)
.
We may use Sage to illustrate the 3rd and 5th partial sums of the approximation, with brown and blue color respectively,
together with the graph of fev for x ∈ (−π, π). It suﬃces to add to the initial block the following one:
Fs3 = f.fourier_series_partial_sum(3,pi)
Fs5 = f.fourier_series_partial_sum(5,pi)
s3 = plot(Fs3, x, -pi, pi,color="brown", linestyle="--")
s5 = plot(Fs5, x, -pi, pi, linestyle="--", color="steelblue")
(p+s3+s5).show(ticks=pi/2, tick_formatter=pi, figsize=8)
Let us pose the ﬁgure that one obtains after executing the whole syntax together:
706
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
7.A.25. Here we need the odd extension of f, which is given by
fodd(x) =
{
f(x) , if 0 < x < π ,
−f(−x) , if − π < x < 0
=
{
cos(x) , if 0 < x < π ,
− cos(x) , if − π < x < 0 .
We may extend fodd over the whole real line by setting fodd(x + 2π) = fodd(x) for x outside (π, π). At x = ±π the values
of this extension will coincide with the values f(±π) = ∓1. Let us illustrate the extension for x ∈ (−π, π) by Sage, via the
same method as above:
f1(x)=cos(x); f2(x)=-cos(x)
f = piecewise([[(-pi, 0),f2], [(0, pi),f1]])
p=f.plot(x, -pi, pi, exclude=[0])
p.show(figsize=8, ticks=pi/2, tick_formatter=pi)
Running this block we obtain the ﬁgure given here:
Now, since we are interesting in the sine expansion, we have necessarily an = 0 for all n ∈ N. Moreover, since L = π we see
that bn =
2
π
∫ π
0
cos(x) sin(nx) dx. To compute this integral use the identity 2 sin(nx) cos(x) = sin((1 + n)x) + sin((n −
1)x), which gives
bn =
1
π
∫ π
0
(sin((1 + n)x) + sin((n − 1)x)) dx = −
1
π
[
cos((n + 1)x)
n + 1
+
cos((n − 1)x)
n − 1
]π
0
=
2n
(
(−1)n
+ 1
)
(n2 − 1)π
.
This holds for all n > 1 and can be veriﬁed in Sage by adding in the previous block the syntax given here:
L=pi; n=var("n")
bn=(2/L)*integral(cos(x)*sin(n*pi*x/L), x, 0, pi)
show( "b_n=", bn.expand())
On the other hand, for n = 1 we get
b1 =
2
π
∫ π
0
cos(x) sin(x) dx =
1
π
∫ π
0
sin(2x) dx = 0 ,
707
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
thus we have proved that
bn =



0 , if n = 1 ,
0 , if n > 1, n = 2k + 1 odd ,
4n
(n2−1)π , if n > 1, n = 2k even .
.
Therefore, one deduces that
cos(x) =
∞∑
k=1
8k
(4k2 − 1)π
sin(2kx) =
8
3π
sin(2x) +
16
15π
sin(4x) +
24
35π
sin(6x) + · · ·
for all x ∈ (0, π). Below are presented the approximations obtained for n = 10 and n = 30, respectively, on the interval
(−π, π).
7.B.4.
f1 ∗ f2(t) =



t − t2
2 + 4 for t ∈ [−2, −1]
1 − t + 1
2 for t ∈ [−1, 1]
t2
2 − 2t + 2 for t ∈ [1, 2]
0 otherwise.
7.B.14. It is a good exercise on derivation and integration by parts. We may diﬀerentiate with respect to t inside the integral
and d
d t f(t − x) can be interpreted as − d
d x f(t − x).
7.B.16. Another good exercise on derivation and integration by parts. We may diﬀerentiate twice with respect to t inside the
integral and f′′
(t − x) can be interpreted either as a derivative with respect to t or x.
7.B.18. The deﬁnition of Γ(t) reveals
L(tα
) =
∫ ∞
0
e−st
tα
d t =
1
sα+1
∫ ∞
0
e−x
xα
d x =
Γ(α + 1)
sα+1
.
7.B.19. Integrate by parts to obtain
L (g) (s) =
∞∫
0
t e−t
e−st
d t =
∞∫
0
t e−(s+1)t
d t = lim
t→∞
(
t e−(s+1)t
−(s+1)
)
− 0 −
∞∫
0
e−(s+1)t
−(s+1) d t = −
(
lim
t→∞
e−(s+1)t
(s+1)2 − e0
(s+1)2
)
=
1
(s+1)2 .
Diﬀerentiating the Laplace transform of a general function −f (i. e., an improper integral) with respect to the parameter s
gives
(∞∫
0
−f(t) e−st
d t
)′
=
∞∫
0
−f(t) (e−st
)
′
d t =
∞∫
0
t f (t) e−st
d t.
This means that the derivative of the Laplace transform L(−f)(s) is the Laplace transform of the function tf(t). The Laplace
transform of the function y = sinh t has already been determined as the function y = 1
s2−1 . Therefore,
L (h) (s) =
(
− 1
s2−1
)′
= 2s
(s2−1)2 .
708
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
We could also have determined L (g) (s) this way.
7.B.20.
L(cos ωt)(s) + iL(sin ωt)(s) = L(eiωt
)(s)
=
∫ ∞
0
eiωt
e−st
dt =
∫ ∞
0
e(iω−s)t
dt
= −
1
s − iω
[
e(iω−s)t
]∞
0
= −
1
s − iω
( lim
t→∞
eiωt
est
− 1) =
1
s − iω
=
s + iω
(s − iω)(s + iω)
=
s
s2 + ω2
+ i
ω
s2 + ω2
7.C.23. Recall the deﬁnition of the product metric of the Cartesian product X×X where the distance is given as the maximum
of the distance of the components. The claim follows directly from the triangle inequality and the topological deﬁnition of
the continuity.
7.D.5. The orthogonal basis has the form {x, − 3
π2 x + sin(x)}, while the projection does not change the function 1
2 sin(x)
since it already lies in the space.
7.D.6. The orthogonal basis has the form {cos(x), 4
π cos(x) + x}. Moreover, the required projection is given by 3π/(π4
−
24)(4 cos(x) + πx). Notice that this is a very bad approximation.
7.D.7. We have already checked the orthogonality of the cosine terms in the solution to the example 7.A.4. The sine terms
are obtained the same way since they are just shifts of cosines by π/2 in the argument. The mixed couples provide an odd
function to be integrated on a symmetric interval around the origin and so the integral also vanishes.
7.D.10. For n ̸= 0 we have
cn =
1
2π
∫ π
0
e−inx
dx =
1
2π
[
e−inx
−in
]π
0
=
1
2π
(1 − (−1)n
)
in
.
Moreover, c0 =
1
2π
∫ π
0
dx =
1
2
. Thus
F(x) =
1
2
+
∞∑
n=1
cn einx
+
−1∑
n=−∞
cn einx
=
1
2
+
∞∑
n=1
1
2π
(1 − (−1)n
)
in
einx
+
−1∑
n=−∞
1
2π
(1 − (−1)n
)
in
einx
.
Next, in the second of the sums, replace n with −n. This gives
F(x) =
1
2
+
1
2π
∞∑
n=1
(1 − (−1)n
)
in
(einx
− e−inx
) =
1
2
+
1
2π
∞∑
n=1
(1 − (−1)n
)
in
2i sin(nx)
=
1
2
+
1
π
∞∑
n=1
(1 − (−1)n
)
n
sin(nx) .
It is easy to see that when n is even, the terms are all 0. For n odd, put n = 2m − 1 to obtain
F(x) =
1
2
+
2
π
∞∑
m=1
(
1
2m − 1
)
sin(2m − 1)x =
1
2
+
2
π
(
sin(x) +
sin(3x)
3
+
sin(5x)
5
+ . . .
)
.
Notice that this expression is similar to the problem described in 7.1.9. In particular, we could use the result from 7.1.9 to
derive our claim after an easy transformation.
7.D.15. Determine the convolution of the functions f1 and f2, where
f1 ∗ f2(t) =



(t+1)2
2 for t ∈ [−1, 0]
1−t2
2 for t ∈ [0, 1]
0 otherwise.
At the beginning of our journey through the mathematical
landscape, we saw that vectors can be manipulated nearly
as easily as scalars. Now, we return to situations where the
relations between objects are expressed with the help of more
(yet still ﬁnitely many) parameters. This is really necessary
when modeling processes or objects in practice, where functions
R → R of one variable are seldom adequate. At least,
functions dependent on ﬁnitely many parameters are necessary,
and the dependence of the change of the results on the parameters
is often more important than the result itself. There
is little need for brand new ideas. Many problems we encounter
can be reduced to ones we can solve already. We
return to the discussion of situations when the values of functions
are described in terms of instantaneous changes. That is,
we consider ordinary diﬀerential equations. In the next chapter,
we consider partial diﬀerential equations and provide a
gentle introduction to variational problems.
1. Functions and mappings on Rn
8.1.1. The world of functions. In the sequel, the main objects
are mappings between Euclidean spaces,
f : Rn
→ Rm
. We have seen many such examples
already. The complex valued real functions
correspond to n = 1, m = 2, while the power series converge
inside of a circle in the complex plane, providing examples
of f : R2
→ R2
. We have also dealt with vector valued real
functions, representing parametrized curves c : R → Rn
(see
e.g. the paragraphs on curvatures and Frenet frames in 6.1.16
on page 526).
In linear algebra and geometry, we saw the linear and
aﬃne maps f : Rn
→ Rm
deﬁned with the help of matrices
A ∈ Matm,n(R) and constant vectors y ∈ Rm
:
Rn
∋ x → y + A x ∈ Rm
.
In coordinates, the value is given by the expression∑
j aijxj + yi, where A = (aij) and y = (yi).
Finally, the quadratic forms were mappings Rn
→ R
given by symmetric matrices S = (sij) and the formula
Rn
∋ x → xT
S x ∈ R.
In coordinates, the value is
∑
i,j sijxixj.
CHAPTER 8
Calculus with more variables
one variable is not enough?
– never mind, just recall vectors!
A. Multivariate functions
We start this chapter with a couple of easy examples to
"grasp" a little multivariate functions.
8.A.1. Solve the system of inequalities. Mark the resulting
area in the plane.
a)
x2
+ y2
≤ 4
y ≥
1
x
b) y ≤ arctan x
y ≤
1
x2
c) x2
+ (y − 1)2
≥ 4
y + x2
− 2x ≥ 0
y ≥ 0
Solution. Whenever you have to solve an inequality of the
form f(x, y) ≥ 0 (f : R2
→ R is a function of two variables,
but the same method is valid for inequalities with more variables),
you just consider the border curve f(x, y) = 0. This
curve divides the plane into some areas. Then all points in
any of the areas either satisfy the inequality or the whole area
does not satisfy it. If we have a system of inequalities, we
CHAPTER 8. CALCULUS WITH MORE VARIABLES
In general, all such mappings f : Rn
→ Rm
are composed
of m components of functions fi : Rn
→ R. So we
start with this case.
8.1.2. Multivariate functions. We can stress the dependance
on the variables x1, . . . , xn by writing the functions
as
f(x1, x2, . . . , xn) : Rn
→ R.
The goal is to extend methods for monitoring the values of
functions and their changes for this situation.
We speak about functions of more variables or, more
compactly, multivariate functions.
We often work with the cases n = 2 or n = 3 so that
the concepts being introduced are easier to understand. In
these cases, letters like x, y, z are used instead of numbered
variables. This means that a function f deﬁned in the “plane”
R2
is denoted
R2
∋ (x, y) → f(x, y) ∈ R,
and, similarly, in the “space” R3
R3
∋ (x, y, z) → f(x, y, z) ∈ R.
Just as in the case of univariate functions, the domain
A ⊂ Rn
on which the function in question is deﬁned needs
to be considered. When examining a function given by a concrete
formula, the ﬁrst task is often to ﬁnd the largest domain
on which the formula makes sense.
It is also useful to consider the graph of a multivariate
function, i. e., the subset Gf ⊂ Rn
× R = Rn+1
, deﬁned by
Gf = {(x1, . . . , xn, f(x1, . . . , xn)); (x1, . . . , xn) ∈ A},
where A is the domain of f. For instance, the graph of the
function deﬁned in the plane by the formula
f(x, y) =
x + y
x2 + y2
is quite a smooth surface, see the illustration below. The maximal
domain of this function consists of all the points of the
plane except for the origin (0, 0).
When deﬁning the function, and especially when drawing
its graph, ﬁxed coordinates are used in the plane. Fixing
the value of either of the coordinates, implies only one variable
remains. Fixing the value of x, for example, gives the
mapping
R → R3
, y → (x, y, f(x, y)),
i.e., a curve in the space R3
. Curves are vector functions of
one variable, already worked with in chapter six (see 6.1.14).
The images of the curves for some ﬁxed values of the coordinates
x and y are depicted by lines in the illustration.
add the “coordinate
lines” in the picture
710
solve each inequality separately and then intersect the result.
In our cases we get
□
8.A.2. Determine the domain of the function R2
→ R:
a)
xy
y(x3 + x2 + x + 1)
,
b)
ln(x2
− y2
),
c)
ln(−x2
− y2
),
d)
arcsin(2χQ(x)),
where χQ denotes the indicator function of the rational
numbers,
e)
f(x, y, z) =
√
ln x · arcsin(y2z).
Solution. a) The formula correctly expresses a value iﬀ the
denominator of the fraction is non-zero. Therefore, the formula
deﬁnes a function on the set R2
\{([x, 0], [−1, y], x, y ∈
R}.
b) The formula is correct iﬀ the argument of the logarithm
is positive, i. e., |x| > |y|. Therefore, the domain of this
function is {(x, y) ∈ R, |x| > |y|}. You can see the graph of
this function in the picture.
c) This formula is again a composition of a logarithm and a
polynomial of two variables. However, the polynomial −x2
−
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.1.3. Euclidean spaces. In the case of functions of one variable,
the entire diﬀerential and integral calculus
is based on the concepts of convergence,
open neighbourhoods, continuity, and so on. In
the last part of chapter seven, these concepts were generalized
for the metric spaces, rather than only for the Euclidean
spaces Rn
. Before proceeding it is appropriate to revise these
ideas, and do further reading if necessary. We present a brief
summary:
The Euclidean space En is perceived as a set of points
in Rn
without any choice of coordinates, and its modelling
vector space Rn
is considered to be the vector space of all
increments that can be added to the points of the space En
(the modelling vector space).
Moreover, the standard scalar product
u · v =
n∑
i=1
xiyi,
is selected on Rn
, where u = (x1, . . . , xn) and v =
(y1, . . . , yn) are arbitrary vectors. This gives a metric on En,
i.e. a function describing the distance ∥P −Q∥ between pairs
of points P, Q by the formula
∥P − Q∥2
= ∥u∥2
=
n∑
i=1
x2
i ,
where u is the vector which yields the point P when added
to the point Q. In the plane E2, for instance, the distance
between the points P1 = (x1, y1) and P2 = (x2, y2) is given
by
∥P1 − P2∥2
= (x1 − x2)2
+ (y1 − y2)2
.
The metric deﬁned in this manner satisﬁes the triangle
inequality for every triple of points P, Q, R:
∥P − R∥ = ∥(P − Q) + (Q − R)∥ ≤ ∥P − Q∥ + ∥Q − R∥.
711
y2
takes on only non-positive real values, where the logarithm
is undeﬁned (as a function R → R).
d) This formula correctly deﬁnes a value iﬀ the argument of
the arc sine lies in the interval [−1, 1], which is broken by
exactly those pairs (= [x, y] ∈ R2
whose ﬁrst component
is rational. The formula thus deﬁnes a function on the set
{[x, y], x ∈ R \ Q}.
e) The argument of the square root must be non-negative, that
is either the image of the logarithm is positive and the image
of arcsine as well, or both images are negative. Thus we get
that the domain is the set
{[x, y, z] ∈ R3
; (x ≥ 1 ∧ y ̸= 0 ∧ 0 ≤ z
1
y2
)
∨(x ∈ (0, 1)∧y ̸= 0∧−
1
y2
≤ z ≤ 0)∨(x > 0∧y = 0)}.
□
In the following examples k([x, y]; r) means a circle with
the center [x, y] and the radius r.
8.A.3. Determine the domain of the function f and mark
the resulting area in the plane:
i) f(x, y) =
√
(x2 + y2 − 1)(4 − x2 − y2),
ii) f(x, y) =
√
1 − x2 +
√
1 − y2,
iii) f(x, y) =
√
x2+y2−x
2x−x2−y2 ,
iv) f(x, y) = arcsin x
y − 1
|y|−|x| ,
v) f(x, y) =
√
1 − x2 − 4y2,
vi) f(x, y, z) =
√
1 − x2
a2 − y2
b2 − z2
c2 .
Solution. a) It has to hold (x2
+y2
−1 ≥ 0, 4−x2
−y2
≥ 0)
or (x2
+ y2
− 1 ≤ 0, 4 − x2
− y2
≤ 0), that is (x2
+ y2
≥
1, x2
+ y2
≤ 4) or (x2
+ y2
≤ 1, x2
+ y2
≥ 4), which is an
annulus between the circles k([0, 0]; 1) and k([0, 0]; 2).
b) It is a circle with the center [0, 0] and verticies
[±1, ±1]
c) The area between circles k([1
2 , 0]; 1
2 ) and k([1, 0]; 1),
the smaller circle belongs to the area, the bigger one does not.
d) The area between the lines y = x and y = −x (without
these lines).
e) The ellipse (together with the inner space) with the
center [0, 0], with the major axis lying on the x-axis with the
major radius a = 1, and the minor axis on the y-axis with the
minor radius b = 1
2 .
f) The ellipsoid (with the inner space) with the center
[0, 0, 0] a semiaxes lying on the x, y, z axis respectively, with
radii a ,b, and c. □
B. The topology of En
In the previous chapter, we have deﬁned general metric
spaces and we have studies especially metric spaces consisting
of the set of functions. As we have already seen in the
previous chapter, many metrics can be deﬁned on the space
Rn
(or on its subsets). For instance, considering a map of a
CHAPTER 8. CALCULUS WITH MORE VARIABLES
See 3.4.3(1) in geometry, or the axioms of metrics in 7.3.1,
or the same inequality 5.2.2(2) for complex scalars. The concepts
deﬁned for real and complex scalars and discussed for
metric spaces in detail can be carried over (extended) with no
problem for the points Pi of any Euclidean space:
Topology and metric in Euclidean spaces
(1) a Cauchy sequence: a sequence of points Pi such that
for every ﬁxed ε > 0, ∥Pi − Pj∥ < ε holds for all
indices but for ﬁnitely many exceptional points Pk;
(2) a convergent sequence: a sequence of points Pi converges
to a point P if nad only if for every ﬁxed ε > 0,
∥Pi − P∥ < ε holds for all but ﬁnitely many indices i;
the point P is then called the limit of the sequence Pi;
(3) a limit point P of a set A ⊂ En: there exists a sequence
of points in A converging to P and diﬀerent from P;
(4) a closed set: contains all of its limit points;
(5) an open set: its complement is closed;
(6) an open δ–neighbourhood of a point P: the set
Oδ(P) = {Q ∈ En; ∥P − Q∥ < δ}, δ ∈ R, δ > 0;
(7) a boundary point P of a set A: every δ–neighbourhood
of P has non-empty intersection with both A and the
complement En \ A;
(8) an interior point P of a set A: there exists a δ–
neighbourhood of P which lies inside A;
(9) a bounded set: lies inside some δ–neighbourhood of
one of its points (for a suﬃciently large δ);
(10) a compact set: both closed and bounded.
(11) limit of a mapping: a ∈ Rm
is the limit of function
f : Rn
→ Rm
in a limit point x0 of its domain A, if for
each ε > 0, there is a δ–neighbourhood U of x0, such
that ∥f(x) − a∥ < ε for all x ∈ U; this happens if and
only if for each sequence xn ∈ A converging to x0, the
values f(xn) converge to a.
(12) continuity: mapping f : A ⊂ Rn
→ Rm
is continuous
in x0 ∈ A if the limit limx→x0 f(x) exists and equals
to f(x0); the mapping f is continuous on A, if it is continuous
in all points in A.
Both the ﬁrst and second items deal with norms of differences
of points approaching zero. Since the square of the
hodil by se obrazek na
ilustraci pojmu, napr.
obr. 1
norm is the sum of squares of the individual compoments, it
is clear that this happens if and only if the individual components
approach zero. In particular, the sequences of points Pi
are Cauchy or convergent if and only if these properties are
possessed by the real sequences obtained from the particular
coordinates of the points Pi in every Cartesian coordinate system.
Therefore, it also follows from Lemma 5.2.3 that every
Cauchy sequence of points in En is convergent. Especially,
En is a complete metric space.
Similarly, the mappings from the item (11) are m-tuples
of the compoment functions and the limits are given as the
m-tuples of limits of these components.
Recall some further results already discussed at the more
general level of the metric spaces in chapter seven:
712
state as a subset of R2
, the distance of two points may be deﬁned
as the time necessary to get from one of the points to the
other by public transport or on foot. In France, for example,
the shortest paths between most pairs of points in this metric
are far from line segments. In this chapter we will focus
on the space En, that is Rn
with the usual metric (distance)
known to the mankind for a long time. The property, that the
shortest path between any two points of this space is the line
segment conecting them could be seen as the deﬁning property
(for example the above example does not satisfy it). Let
us examine the space En in more detail.
8.B.1. Show that every non-empty proper subset of En has
a boundary point (which need not lie in it).
Solution. Let U ⊂ En be a non-empty subset with no boundary
point. Consider a point X ∈ U, a point Y ∈ U′
:=
En \ U, and the line segment XY ⊂ En. Intuitively, going
from X to Y along this segment, we must once get from
U to U′
, and this can happen only at a boundary point (everyone
who has ever been to a foreign country is surely well
acquainted with this fact). Formally, let A be the point of XY
for which |XA| = sup{|XZ|, XZ ∈ U} (apparently, there
is exactly one such point on the segment XY ). This point
is a boundary point of U: it follows from the deﬁnition of
A that any line segment XB (with B ∈ XA) is contained
in U; in particular, B ∈ U. However, if there were a neighborhood
of A contained in U, then there would exist a part of
the line segment XY longer than XA which would be contained
in U, which contradicts the deﬁnition of the point A.
Therefore, any neighborhood of the point A contains a point
from U as well as a point from En \ U. □
8.B.2. Prove that the only non-empty clopen (both closed
and open) subset of En is En itself.
Solution. It follows from the above exercise 8.B.1 that every
non-empty proper subset U of En has a boundary point. If U
is closed, then it is equal to its closure; therefore, it contains
all of its boundary points. However, an open set (by deﬁnition)
cannot contain a boundary point. □
8.B.3. Show that the space En cannot be written as the
union of (at least two) disjoint non-empty open sets.
Solution. Suppose that En can be expressed thus, i. e., En =
∪i∈IUi, where I is an index set. Let us ﬁx a set U from this
union. Then, we can write En = U ∪U, where both U and U
(being a union of open sets) are open. However, they are also
complements of open sets; therefore, they are closed as well.
This contradicts the result of the previous exercise 8.B.2. □
8.B.4. Prove or disprove: a union of (even inﬁnitely many)
closed subsets of En
is a closed subset of En
.
Solution. The proposition does not hold. As a counterexample,
consider the union
∞∪
i=3
[
1
i
, 1 −
1
i
]
CHAPTER 8. CALCULUS WITH MORE VARIABLES
A mapping is continuous if and only if its preimages of
open sets are open (check this carefully!). Further, each continuous
function on a compact set A is uniformly continuous,
bounded and attains its maximum and minimum, cf. the paragraph
7.3.14 on the page 682.
The reader should make an appropriate eﬀort to read the
paragraphs 3.4.3, 5.2.5–5.2.8, 7.3.3–7.3.5,
and 7.3.12 as well as try to think out/recall
the deﬁnitions and connections of all these
concepts.
8.1.4. Compact sets. Working with general open, closed, or
compact sets could seem useless in the case of the real line
E1 since intervals are almost always used.
In the case of metric spaces in the last part of chapter
seven, the ideas are complicated at ﬁrst sight. However, the
same approach is easy in the case of Euclidean spaces Rn
. It
is also very useful and important (and it is, of course, a special
case of general metric spaces).
Just as in the case of E1 or E2, we deal with the open
covers of sets (i.e., systems of open sets containing the given
sets), and Theorem 5.2.8 is again true (with mere reformula-
tions):
Theorem. Subsets A ⊂ En of Euclidean spaces satisfy:
(1) A is open if and only if it is a union of a countable (or
ﬁnite) system of δ–neighbourhoods,
(2) every point a ∈ A is either interior or boundary,
(3) every boundary point of A is either an isolated or a limit
point of A,
(4) A is compact if and only if every inﬁnite sequence contained
in it has a subsequence converging to a point in
A,
(5) A is compact if and only if each of its open covers contains
a ﬁnite subcover.
Proof. The proof from 5.2.8 can be reused without
changes in the case of claims (1)–(3), yet now
the concepts have to be perceived in a diﬀerent
way, and the “open intervals” are substituted
with multidimensional δ–neighbourhoods of
appropriate points.
However, the proof of the fourth and ﬁfth claims has to be
adjusted properly. Actually, we proved the claims there for R
and C, thus in dimensions one and two. Thus the reader may
either extend the two-dimensional reasoning, or to rewrite the
proof of the corresponding propositions for general metric
spaces in 7.3.12, while noticing the parts which can be simpliﬁed
for Euclidean spaces. □
8.1.5. Curves in En. Almost all the discussion about limits,
derivatives, and integrals of functions in chapters
5 and 6 concerned functions of a real variable
and real or complex values since only the
triangle inequality valid for the magnitudes of the real and
complex numbers is used. This argument can be carried over
to any function of a real variable with values in a Euclidean
713
of closed subsets of R, which is equal to the open interval
(0, 1). □
8.B.5. Prove or disprove: an intersection of (even inﬁnitely
many) open subsets of En
is an open subset of En
.
Solution. The proposition does not hold in general. As a
counterexample, consider the intersection
∞∩
i=2
(
1 −
1
i
, 1 +
1
i
)
of open subsets of R, which is equal to the closed singleton
{1}. □
8.B.6. Consider the graph of a continuous function f :
R2
→ R as a subset of E3. Determine whether this subset is
open, closed, and compact, respectively.
Solution. The subset is not open since any neighborhood
of a point [x0, y0, f(x0, y0)] contains a segment of the line
x = x0, y = y0. However, there is a unique point of the
graph of the function on this segment, and that is the point
[x0, y0, f(x0, y0)].
The continuity of f implies that the subset is closed –
we will show that every convergent sequence of points of the
graph of f converges to a point which also lies in the graph:
If such a sequence is convergent in E3, then it must converge
in every component, so the sequence {[xn, yn]}∞
n=1 is convergent
in R2
. Let us denote this limit by [a, b]. Then, it follows
from the deﬁnition of continuity that its function values at the
points [xn, yn] must converge to the value f(a, b). However,
this means that the sequence {[xn, yn, f(xn, yn)]}∞
n=1 converges
to the point [a, b, f(a, b)], which belongs to the graph
of the function f. Therefore, the graph is a closed set.
The subset is closed, yet it is not compact since it is not
bounded (its orthogonal projection onto the coordinate plane
xy is the whole R2
. (A subset of En is compact iﬀ it is both
closed and bounded.) □
And now let us study the limits of functions (a limit is
deﬁned thanks to the topology of En, see 8.1.3)
C. Limits and continuity of multivariate functions
If we approach limits of multivariate functins, there is
one fact we have to deal with:
Let us emphasize that there is no analogy of L’Hospital
rule for multivariate functions. Counting limits 0
0 or ∞
∞ , we
have to be "clever".
In one dimension we can approach a point either from
right or left (and the limit in the point exists, if both one-sided
limits exist and are equal to each other). In more dimensions
we can approach a point from inﬁnitely many directions, and
the limit in the point exists, iﬀ limits of the function narrowed
to any path leading to the point exist and must be equal to each
other.
The easest way to obtain a limit is (as with the functions
of one variable) to plug in the given point to the function prescription
and if we get a meaningful expresion, we are done.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
space Rn
. Several tools for the work with curves are introduced
in paragraphs 6.1.14–6.1.12.
For every (parametrized) curve1
, that is, a mapping c =
(c1(t), . . . , cn(t)) : R → Rn
in an n–dimensional space, the
concepts simply extend the ideas from the univariate functions
with some extra thoughts:
First note that both the limit and the derivative of curves
make sense in an aﬃne space even without selecting the coordinates
(where the limit is again a point in the original space,
while the derivative is a vector in the modeling vector space!).
In the case of an integral, curves are considered in the
vector space Rn
. The reason for this can be seen even in the
case of one dimension, where the origin is needed to be able
to see the “area under the graph of a function”.
It is apparent that limits, derivatives, and integrals have
to be considered via the n individual coordinate components
in Rn
. In particular, their existence is determined in the same
way:
Basic concepts for curves
(1) limit:
lim
t→t0
c(t) = ( lim
t→t0
c1(t), . . . , lim
t→t0
cn(t)) ∈ Rn
(2) derivative:
c′
(t0) = lim
t→t0
1
|t − t0|
· (c(t) − c(t0))
= (c′
1(t0), . . . , c′
n(t0)) ∈ Rn
(3) integral:
∫ b
a
c(t) dt =
(∫ b
a
c1(t) dt, . . . ,
∫ b
a
cn(t) dt
)
∈ Rn
.
We can directly formulate the analogy of the connection
between the Riemann integral and the antiderivative for
curves in Rn
(see 6.2.9):
Proposition. Let c be a curve in Rn
, continuous on an interval
[a, b]. Then its Riemann integral
∫ b
a
c(t)dt exists. Moreover,
the curve
C(t) =
∫ t
a
c(s)ds ∈ Rn
is well-deﬁned, diﬀerentiable, and C′
(t) = c(t) for all values
t ∈ [a, b].
It is not simple to extend the mean value theorem and, in
general, the Taylor’s expansion with remainder, see 5.3.9 and
6.1.3. They can be applied in a selected coordinate system to
the particular coordinate functions of a diﬀerentiable function
c(t) = (c1(t), . . . , cn(t)) on a ﬁnite interval [a, b]. In the case
of the mean value theorem, for instance, there are numbers ti
such that
ci(b) − ci(a) = (b − a) · c′
i(ti), i = 1, . . . , n.
1In geometry, one often makes a distinction between a curve as a subset
of En and its parametrization R → Rn. The word “curve”, means exclusively
the parametrized curve here.
714
Otherwise we can get "undeterminate" expression. There are
some tricks, we can use to count such a limit:
(1) factorize the numerator or the denominator acording to
some known formula and then reduce,
(2) expand the numerator and the denominator with an appropriate
term and then shorten,
(3) bounded expression
∞ = 0, 0 · (bounded expression) = 0;
(4) use an appropriate substitution to get a limit of a function
of one variable
(5) try polar coordinates
x = r cos φ,
y = r sin φ
(it usually works with the expression x2
+ y2
, we have
x2
+y2
= r2
cos2
φ+r2
sin2
φ = r2
(cos2
φ+sin2
φ) =
r2
, which is independent of φ);
(6) try y = kx or y = kx2
or generally x = f(k) a y = g(k)
(to prove the non-existence of the limit: if the limit after
the substitution depends on k, the original limit does not
exists)
8.C.1. lim(x,y)→(e2,1)
ln x
y ⃝
8.C.2. lim(x,y)→(4,4)
√
x−
√
y
x−y ⃝
8.C.3. lim(x,y)→(1,∞)
cos y
x+y ⃝
8.C.4. lim(x,y)→(0,2)
exy
−1
x ⃝
8.C.5. lim(x,y)→(∞,∞)
x2
+y2
x4+y4 ⃝
8.C.6. lim(x,y)→(0,0)
x2
+y2
x+y ⃝
8.C.7. lim(x,y)→(0,0)
x2
−y2
x2+y2 ⃝
8.C.8. lim(x,y)→(∞,∞)( 2xy
x2+y2 )x2
⃝
8.C.9. lim(x,y)→(1,1)
x+y√
x2+y2
⃝
8.C.10. lim(x,y)→(0,0)
x2
+y2
√
x2+y2+1−1
⃝
8.C.11. lim(x,y)→(0,0) xy2
cos 1
xy2 ⃝
8.C.12. lim(x,y)→(0,0)
sin xy
x ⃝
8.C.13. lim(x,y)→(0,0)
x3
+y3
x2+y2 ⃝
8.C.14. lim(x,y)→(∞,∞)(x2
+ y2
)e−(x+y)
⃝
8.C.15. lim(x,y)→(∞,1)(1 + 1
x )
x2
x+y ⃝
8.C.16. lim(x,y)→(0,0)
xy
x2+y2 ⃝
8.C.17. lim(x,y)→(0,0)
1−cos(x2
+y2
)
(x2+y2)xy ⃝
8.C.18. Prove that lim(x,y)→(0,0)
−y
x2−y does not exists. ⃝
8.C.19. Prove that
lim
(x,y)→(1,−2)
2x + xy − y − 2
x2 + y2 − 2x + 4y + 5
.
does not exist. ⃝
CHAPTER 8. CALCULUS WITH MORE VARIABLES
These numbers ti are distinct in general, so the diﬀerence vector
of the boundary points c(b)−c(a) cannot be expressed as
a multiple of the derivative of the curve at a single point, in
general.
For example, for a diﬀerentiable curve c(t) in the plane
E2, c(t) = (x(t), y(t))
c(b) − c(a) = (x′
(ξ)(b − a), y′
(η)(b − a))
= (b − a) · (x′
(ξ), y′
(η))
for two (in general diﬀerent) values ξ, η ∈ [a, b]. However,
this reasoning is still suﬃcient for the following estimate:
Lemma. If c is a curve in En with continuous derivative on
a compact interval [a, b], then for all a ≤ s ≤ t ≤ b
∥c(t) − c(s)∥ ≤
√
n
(
maxr∈[a,b] ∥c′
(r)∥
)
|t − s|.
Proof. Direct application of the mean value theorem
gives for appropriate points ri inside the interval [s, t] the fol-
lowing:
∥c(t) − c(s)∥2
=
n∑
i=1
(ci(t) − ci(s))2
≤
n∑
i=1
(
c′
i(ri)(t − s)
)2
≤ (t − s)2
n∑
i=1
maxr∈[s,t] c′
i(r)2
≤ n(maxr∈[s,t], i=1,...,n |c′
i(r)|)2
(t − s)2
≤ n maxr∈[s,t] ∥c′
(r)∥2
(t − s)2
.
□
Another important concept is the tangent vector to a
curve c : R → En at a point c(t0) ∈ En. It is deﬁned as
the vector in the modelling vector space Rn
given by the derivative
c′
(t0) ∈ Rn
.
Consider c to be the path of an object moving in the space
in time. Then the tangent vector at a point t0 can be perceived
physically as the instantaneous velocity at this point.maybe also a pict.
The straight line T given parametrically as
T : c(t0) + (t − t0) · c′
(t0)
is called the tangent line to the curve c at the point t0. Unlike
the tangent vector, the (unparametrized) tangent line T is independent
of the parametrization of the curve c. The chain
rule ensures that changing the parametrization leads to the
same tangent vector, up to multiple.
8.1.6. Partial derivatives. If we look at the multivariate
function f(x1, . . . , xn) : Rn
→ R as at the
function of one real variable xi while the other
variables are assumed constant, we can consider
the derivative of this function. This is called the
partial derivative of the function f with respect to xi, and it
is denoted as ∂f
∂xi
, i = 1, . . . , n, or (without referring to the
particular function) as the operator ∂
∂xi
on the functions.
More generally, for every function f : Rn
→ R and an
arbitrary curve c : R → Rn
, their composition (f ◦ c)(t) :
R → R can be considered. This composite function F =
f ◦c expresses the behaviour of the function f along the curve
715
Solution. The lines through [1, −2]) have the equation y =
kx − k − 2. As we aproach [1, −2] along one of these lines,
we get the limit k
1+k2 , which is diﬀerent for diﬀernt k, thus
the limit does not exists. □
Let us recall, that a funtion is continuous in points, where
the limit exists and is equal to to function value.
8.C.20. Find the discontinuity points of f(x, y) = 2x−5y
x2+y2−1 .
⃝
8.C.21. Find the discontinuity points of f(x, y) =
sin(x2
y+xy2
)
cos(x−y) . ⃝
8.C.22. Find the discontinuity points of
f(x, y) =
{
x3
+y3
x2+y2 pro [x, y] ̸= [0, 0],
0 pro [x, y] = [0, 0].
⃝
D. Tangent lines, tangent planes, graphs of multivariate
functions
8.D.1. A car is moving at velocity given by the vector
(0, 1, 1). At the initial time t = 0, it is situated at the point
[1, 0, 0]. The acceleration of the car in time t is given by the
vector (− cos t, − sin t, 0). Describe the dependency of the
position of the car upon the time t.
Solution. As we have already discussed in paragraph 8.1.5,
we got acquainted with the means of solving this type of problem
as early as in chapter 6. Notice that the “integral curve”
C(t) from the theorem of paragraph 8.1.5 starts at the point
(0, 0, 0) (in other words, C(0) = (0, 0, 0)). In the aﬃne
space Rn
, we can move it so that it starts at an arbitrary point,
and this does not change its derivative (this is performed by
adding a constant to every component in the parametric equation
of the curve). Therefore, up to the movement, this integral
curve is determined uniquely (nothing else than constants
can be added to the components without changing the derivative).
When we integrate the curve of acceleration, we get
the curve of velocity (− sin t, cos t − 1, 0). Considering the
initial velocity as well, we obtain the velocity curve of the car:
(− sin t, cos t, 1) (we shifted the curve of the vector (0, 1, 1), i.
e., so that now the velocity curve at time t = 0 agrees with the
given initial velocity). Further integration leads to the curve
(cos t−1, sin t, t). Shifting this of the vector (1, 0, 0) then ﬁts
with the initial position of the car. Therefore, the car moves
along the curve [cos t, sin t, t] (this curve is called a helix).
□
8.D.2. Determine both the parametric and implicit
equations of the tangent line to the curve c : R → R3
,
c(t) = (c1(t), c2(t), c3(t)) = (t, t2
, t3
) at the point which
corresponds to the parameter’s value t = 1.
Solution. The value t = 1 corresponds to the point
c(1) = [1, 1, 1]. The derivatives of the particular components
are c′
1(t) = 1, c′
2(t) = 2t, c3(t) = 3t2
. The values of
CHAPTER 8. CALCULUS WITH MORE VARIABLES
c. The simplest case is using parametrized straight lines c
and choosing the lines ci(t) = (x1, . . . , xi + t, . . . , xn), the
derivative of f◦ci yields just the partial derivatives ∂f
∂xi
. More
generally, derivatives can be deﬁned in any direction:
directional and partial derivatives
Deﬁnition. f : Rn
→ R has derivative in the direction of a
vector v ∈ Rn
at a point x ∈ En if and only if the derivative
dvf(x) of the composite mapping
t → f(x + tv)
exists at the point t = 0, i.e.
dvf(x) = lim
t→0
1
t
(f(x + tv) − f(x)).
The partial derivatives are the values ∂f
∂xi
= dei f where ei
are the elements of the standard basis of Rn
.
In other words, the directional derivative expresses the
inﬁnitesimal increment of the function f in the direction v.
For functions in the plane,
∂
∂x
f(x, y) = lim
t→0
1
t
(f(x + t, y) − f(x, y))
∂
∂y
f(x, y) = lim
t→0
1
t
(f(x, y + t) − f(x, y)).
Especially, the partial diﬀerentiation with respect to a given
variable is just the casual one-variable diﬀerentiation while
considering the other variables to be constants.
8.1.7. The diﬀerential of a function. Partial or directional
derivatives are not always good enough to obtain
a fair approximation of the behaviour of a function
by linear expressions. There are three concerns
for a function f : Rn
→ R there. First, the
directional derivatives at a point x ∈ Rn
may not exist in all
directions, although the partial derivatives are well deﬁned.
Second, the dependence of the directional derivatives dvf(x)
on the direction v need not be linear. Third, even if dvf(x) is
a linear mapping in the argument v, the function still may be
not ’well behaved’ around the point x.
As an example, consider the functions in the plane with
coordinates (x, y) given by the formulae
g(x, y) =
{
1 if xy = 0
0 otherwise
h(x, y) =



x if y = 0
y if x = 0
0 otherwise
k(x, y) =
{
x if y = x2
̸= 0
0 otherwise.
Both partial derivatives of g at (0, 0) exist and no other directional
derivatives do, and g is even not continous at the
origin. The functions h and k are continuous at (0, 0) and
h has all its directional derivatives at the origin equal zero,
716
the derivatives at the point t = 1 are 1, 2, 3. Therefore, the
parametric equations of the tangent line are:
x = c′
1(1)s + c1(1) = t + 1,
y = c′
2(1)s + c2(1) = 2t + 1,
z = c′
3(1)s + c3(1) = 3t + 1.
In order to get the implicit equations (which are not given
canonically), we eliminate the parameter t, thereby obtaining:
2x − y = 1,
3x − z = 2. □
8.D.3. Determine the tangent line p to the curve
c(t) = (ln t, arctan t, esin(πt)
) at the point t0 = 1.
⃝
8.D.4. Find a point on the curve c(t) = (t2
− 1, −2t2
+
5t, t − 5) such that the tangent line passing through it is paralell
to the plane ϱ: 3x + y − z + 7 = 0.
Solution. The direction c′
(t0) of the curve c(t) in t0 has to be
perpendicular to the normal of ϱ, that is the scalar multiple of
these two vectors is 0. The tangent vector in the point c(t) is
(2t, −4t + 5, 1), the normal vector of the plane ρ is (3, 1, −1)
(just read oﬀ the coeﬃcients by x, y and z in the equation of
ρ. That is 3 · 2t + 1 · (−4t + 5) + 1 · 1 = 0, which gives
[3, −18, −7]. □
8.D.5. Find the parametric equation of the tangent line of
the curve given as the intersection of surfaces x2
+y2
+z2
= 4
and x2
+ y2
− 2x = 0 in the point [1, 1,
√
2].
Solution. p = {[1 −
√
2s, 1,
√
2 + s]; s ∈ R}. □
8.D.6. The set of diﬀerentiable functions. We can notice
that multivariate polynomials are diﬀerentiable on the
whole of their domain. Similarly, the composition
of a diﬀerentiable univariate function and a diﬀerentiable
multivariate function leads to a diﬀerentiable multivariate
function. For instance, the function sin(x+y) is diﬀerentiable
on the whole R2
; ln(x + y) is a diﬀerentiable function
on the set of points with x > y (an open half-plane, i. e.,
without the boundary line). The proofs of these propositions
are left as an exercise on limit compositions.
Remark. Notation of partial derivatives. The partial derivative
of a function f : Rn
→ R in variables x1, . . . , xn with
respect to the variable x1 will be denoted by both ∂f
∂x1
and the
shorter expression fx1 . In the exercise part of the book, we
will rather keep to the latter notation. On the other hand, the
notation ∂f
∂x1
better catches the fact that this is a derivative of
f in the direction of the vector ﬁeld ∂
∂x1
(you will learn what
a vector ﬁeld is in paragraph 9.1.1).
8.D.7. Determine the domain of the function f : R2
→
R, f(x, y) = x2√
y. Calculate the partial
derivatives where they are deﬁned on this do-
main.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
except for the partial derivatives, which are equal to 1. In
particular, dvh(0, 0) is not a linear mapping in the argument
v. More generally, consider a function f which, along the
lines (r cos θ, r sin θ) with a ﬁxed angle θ, takes the values
α(θ)r, where α(θ) is a periodic odd function of the angle θ,
with period 2π. All of its directional derivatives dvf at (0, 0)
exist, yet these are not linear expressions depending on the
directions v for general functions α(θ). The graph of f can
be visualized as a “deformed cone” and we can hardly hope
for a good linear approximation at its vertex.
Finally, k has all directional derivatives zero, i.e.
dvh(0) = 0 for all directions v, which is a linear dependence
on v ∈ R2
. But still, the zero mapping is a very bad
approximation of k along the parabolla y = x2
. Check all
these claims in detail yourselves!
Therefore, we imitate the case of univariate functions
as thoroughly as possible, and avoid such a pathological behaviour
of functions directly by deﬁning and using the concept
of diﬀerential:
Differential of a function
Deﬁnition. A function f : Rn
→ R has got the diﬀerential
at a point x if and only if all of the following three conditions
hold:
(1) the directional derivatives dvf(x) at the point x exist
for all vectors v ∈ Rn
,
(2) dvf(x) is linearly dependent on the argument v,
(3) limv→0
1
∥v∥
(
f(x + v) − f(x) − dvf(x)
)
= 0.
The linear expression dvf (in a vector variable v) is then
called the diﬀerential df of the function f evaluated at the
increase v.
In words, it is required that the behaviour of the function
f at the point x is well approximated by linear functions of
increments of the variable quantities.
It follows directly from the deﬁnition of directional
derivatives that the diﬀerential can be deﬁned solely by the
property (3). If there is a linear form df(x) such that the
increments v at the point x satisfy the property (3) with
dvf(x) = df(x)(v), then df(x)(v) is apparently just the
directional derivative of the function f at the point x, so the
properties (1) and (2) are automatically satisﬁed.
Notice, that in dimension one, the only linear functions
are multiplications by constant numbers, and if such a number
satisfying (3) exists, we call it the derivative f′
(x) in the
point x. Then the ﬁrst two properties automatically hold true
and thus we did not have to distinguish the derivative and the
diﬀerential there.
8.1.8. Examine what can be said about the diﬀerential of
a function f(x, y) in the plane, supposing both partial
derivatives ∂f
∂x , ∂f
∂y exist and are continuous in a
neighbourhood of a point (x0, y0). To this purpose,
consider any smooth curve t → (x(t), y(t)) with
x0 = x(0), y0 = y(0).
717
Solution. The domain of the function in question in R2
is the
half-plane {(x, y), y ≥ 0}. In order to determine the partial
derivative with respect to a given variable, we consider the
other variables to be constants in the formula that deﬁnes the
function. Then, we simply diﬀerentiate the expression as a
univariate function. We thus get:
fx = 2xy a fy =
1
2
x2
√
y
.
The partial derivatives exist at all points of the domain
except for the boundary line y = 0. □
8.D.8. Determine the derivative of the function f : R3
→
R, f(x, y, z) = x2
yz at the point [1, −1, 2] in the direction
v = (3, 2, −1).
Solution. The directional derivative can be calculated in two
ways. The ﬁrst one is to derive it directly from the deﬁnition
(see paragraph 8.1.6). The second one is to use the diﬀerential
of the function; see 8.1.7 and theorem 8.1.8. Since the given
function is a polynomial, it is diﬀerentiable on the whole R3
.
Let us follow the deﬁnition:
fv(x, y, z) = lim
t→0
1
t
[f(x + 3t, y + 2t, z − t) − f(x, y, z)]
= lim
t→0
1
t
[(x + 3t)2
(y + 2t)(z − t) − x2
yz]
= lim
t→0
1
t
[t(6xyz + 2x2
z − x2
y) + t2
(. . . )]
= 6xyz + 2x2
z − x2
y.
We have thus derived the derivative in the direction of the
vector (3, 2, −1) as a function of three real variables which
determine the point at which we are interested in the value of
the derivative. Evaluating this for the desired point thus leads
to fv(1, −1, 2) = −7.
In order to compute the directional derivative from the
diﬀerential of the function, we ﬁrst have to determine the partial
derivatives of the function:
fx = 2xyz, fy = x2
z, fz = x2
y.
It follows from the note beyond theorem 8.1.8 that we can
express
fv(1, −1, 2) = 3fx(1, −1, 2) + 2fy(1, −1, 2)+
+ (−1)fz(1, −1, 2) =
= 3 · (−4) + 2 · 2 + (−1) · (−1) = −7. □
8.D.9. Determine the derivative of the function
f : R3
→ R, f(x, y, z) = cos(x2
y)
z at the point [0, 0, 2]
in the direction of the vector (1, 2, 3).
Solution. The domain of this function is R3
except for the
plane z = 0. The following calculations will be considered
only on this domain. The function in question is diﬀerentiable
at the point [0, 0, 2] (this follows from the note 8.D.6). We can
determine the value of the examined directional derivative by
8.1.7, using partial derivatives.
First, we determine the partial derivatives of the given
function (as we have already mentioned in exercise 8.D.7, in
CHAPTER 8. CALCULUS WITH MORE VARIABLES
The idea is to use the mean value theorem for univariate
functions for diﬀerences of function values, where only one
of the variables changes: f(x, y)−f(x0, y) = ∂f
∂x (x1, y)(x−
x0) for suitable x1 between x0 and x.
Apply this in both summands of the following expression
separately, to obtain
1
t
(
f(x(t), y(t)) − f(x0, y0)
)
=
1
t
(
f(x(t), y(t))−f(x0, y(t))
)
+ 1
t
(
f(x0, y(t))−f(x0, y0)
)
= 1
t (x(t)−x0)
∂f
∂x
(x(ξ), y(t))+1
t (y(t)−y0)
∂f
∂y
(x0, y(η))
for suitable numbers ξ and η between 0 and t. Indeed, by
exploiting that the curve (x(t), y(t)) is diﬀerentiable, there
must be such values ξ and η.
Especially, for every sequence of numbers tn converging
to zero, the corresponding sequences of numbers ξn and ηn
also converge to zero (by the squeeze theorem for three limits)
and they all satisfy the above equality.
If t converges to 0, the continuity of the partial derivatives,
together with the test for convergence of functions using
subsequences of the input values (cf. 5.2.15), as well as
the properties of the limits of sums and products of functions
(cf. Theorem 5.2.13) imply
d
dt
f(x(t), y(t))|t=0 = x′
(0)
∂f
∂x
(x0, y0) + y′
(0)
∂f
∂y
(x0, y0),
which is a pleasant extension of the theorem on diﬀerentiation
of composite functions of one variable for the case f ◦ c.
Of course, with the special choice of parametrized
straight lines with direction vector v = (ξ, η),
(x(t), y(t)) = (x0 + tξ, y0 + tη),
the calculation leads to the derivative in the direction v =
(ξ, η) and the equality
dvf(x0, y0) =
∂f
∂x
(x0, y0)ξ +
∂f
∂y
(x0, y0)η.
This formula can be expressed in a neat way to describe coordinate
expressions of linear functions on vector spaces:
df =
∂f
∂x
dx +
∂f
∂y
dy,
where dx stands for the diﬀerential of the function (x, y) → x,
i.e. dx(v) = ξ, and similarly for dy. In other words, the
directional derivative dvf is a linear function Rn
→ R on the
increments, with coordinates given by the partial derivatives.
Now we could similarly prove that the assumption of continuous
partial derivatives at a given point guarantees the approximation
property of the diﬀerential as well. In particular,
note that the computation for f ◦c above excluded phenomena
like the function k(x, y) above (there dvk(0, 0) = 0, but the
derivative along the curve (t, t2
) was one). We shall better do
this for the general multivariate functions straightaway.
718
order to determine the partial derivative with respect to x, we
diﬀerentiate it as a univariate function (in x) and use the chain
rule; similarly for other partial derivatives):
fx = −
2xy sin(x2
y)
z
, fy = −
x2
sin(x2
y)
z
,
fz = −
cos(x2
y)
z2
.
Evaluating the expression gives
fx(0, 0, 2) + 2 · fy(0, 0, 2) + 3 · fz(0, 0, 2)
= 1 · 0 + 2 · 0 + 3 ·
(
−1
4
)
= −3
4 . □
8.D.10. Having a function f : Rn
→ R with diﬀerential
df(x) and a point x ∈ Rn
, determine a unit direction v ∈ Rn
in which the directional derivative dv(x) is maximal.
Solution. According to the note beyond theorem 8.1.5, we
are maximizing the function fv(x) = v1fx1 (x)+v2fx2 (x)+
· · · + vnfxn (x) in dependence on the variables v1, . . . , vn
which are bound by the condition v2
1 + · · · + v2
n = 1. We
have already solved this type of problem in chapter 3, when
we talked about linear optimization (viz 3.A.5). The value
fv(x) can be interpreted as the scalar product of the vectors
(fx1 , . . . , fxn ) and (v1, . . . , vn). And this product is maximal
if the vectors are of the same direction. The vector v can thus
be obtained by normalizing the vector (fx1 , . . . , fxn ). In general,
we say the the function grows maximally in the direction
(fx1 , . . . , fxn ). Then, this vector is called the gradient of the
function f. In paragraph 8.1.26, we will recall this idea and
go into further details. □
Counnting the diﬀerential of a function is technicly very
easy, just plug into the deﬁnition.
8.D.11. Find the diﬀerential of a function f in a point P:
i) f(x, y) = arctan x+y
1−xy , P = [
√
3, 1]
ii) f(x, y) = arcsin x√
x2+y2
, P = [1,
√
3].
iii) f(x, y) = xy + x
y , P = [1, 1].
Solution.
i) df(
√
3, 1) = 1
4 dx + 1
2 dy,
ii) df(1,
√
3) =
√
3
4 dx − 1
4 dy,
iii) df(1, 1) = 2dx.
□
Let us realize, that diﬀerential of a function is a linear
map:
8.D.12. Count the diﬀerential of the function
f(x, y, z) = 2x
sin y arctan z in the point [−4, π
2 , 0]
evaluated on dx = 0.05, dy = 0.06, and dz = 0.08.
Solution. df(−4, π
2 , 0) = 0dx + 0dy + 1
16 dz = 0.005. □
The diﬀerential thus can be used to approximate the values
of a function.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.1.9. The following theorem provides a crucial and very
useful observation.
Continuity of partial derivatives
Theorem. Let f : En → R be a function of n variables
with continuous partial derivatives in a neighbourhood of
the point x ∈ En. Then its diﬀerential df at the point x
exists and its coordinate expression is given by the formula
df =
∂f
∂x1
dx1 +
∂f
∂x2
dx2 + · · · +
∂f
∂xn
dxn.
Proof. This theorem can be derived analogously to the
procedure described above, for the case n = 2.
Care is needed in details to ﬁnish the reasoning
about the approximation property. As above,
consider a curve
c(t) = (c1(t), . . . , cn(t)),
c(0) = (0, ..., 0) and a point x ∈ Rn
, and express the diﬀerence
f(x + c(t)) − f(x) for the composite function f(c(t))
as follows:
f(x1 + c1(t), . . . , xn + cn(t)) − f(x1, x2 + c2(t), . . . )
+ f(x1, x2 + c2(t), . . . )) − f(x1, x2, . . . , xn + cn(t))
...
+ f(x1, x2, . . . , xn + cn(t)) − f(x1, x2, . . . , xn).
Now, apply the mean value theorem to all of the n summands,
obtaining (similarly to the case of two variables)
(c1(t) − c1(0))
∂f
∂x1
(x1 + c1(θ1), . . . , xn + cn(t))
+ (c2(t) − c2(0))
∂f
∂x2
(x1, x2 + c2(θ2), . . . , xn + cn(t))
...
+ (cn(t) − cn(0))
∂f
∂xn
(x1, x2, . . . , xn + cn(θn)),
for appropriate values θi, 0 ≤ θi ≤ t. This is a ﬁnite sum,
so the same reasoning as in the case of two variables veriﬁes
that
d
dt
f(x + c(t))|t=0 = c′
1(0)
∂f
∂x1
(x) + · · · + c′
n(0)
∂f
∂xn
(x).
The special choice of the curves c(t) = x+tv for a directional
vector v veriﬁes the statement about existence and linearity of
the directional derivatives at x.
Finally, apply the mean value theorem in the same way
to the diﬀerence
f(x + v) − f(x) = dvf(x + θv)
= v1
∂f
∂x1
(x + θv) + · · · + vn
∂f
∂xn
(x + θv)
719
8.D.13. Approximate
√
2.982 + 4.052 with the use of differential
(and not with a calculator).
Solution. We use the diﬀerential of the function
f(x, y) =
√
x2 + y2 in [3, 4]. Then f′
x = x√
x2+y2
,
f′
y = y√
x2+y2
, thus
√
2.982 + 4.052 .
= f(32
, 42
) + df(2.98 − 3, 4.05 − 4)
=
√
32 + 42 +
3
√
32 + 42
(−0.02) +
4
√
32 + 42
(0.05)
= 5 −
0, 06
5
+
0, 2
5
= 5, 028. □
8.D.14. With the help of a diﬀerential calculate
i) arctan 1.02
0.95 ,
ii) ln(0, 972
+ 0, 052
),
iii) arcsin 0,48
1,05
iv) 1, 042,02
.
⃝
8.D.15. What is approximately the change (in cm3
) in the
volume of the cone with a base radius r = 10 cm and height
h = 10 cm, if we increase the radius by 5 mm and we decrease
the height by 5 mm?
Solution. The volume is (as a fuction of the radius r and
a height h) V (r, h) = 1
3 πr2
h. The change is approximately
given by the diﬀefential of V in [10, 10] evaluated on
dr = 10.5 − 10 = 0.5 and dh = 9.5 − 10 = −0.5. We get
50
3 πcm3
. □
8.D.16. Find the tangent plane to the graph of a function
f : R2
→ R in a point P = [x0, y0, f(x0, y0):
i) f(x, y) =
√
1 − x2 − y2, P = [x0, y0, z0] =
[ 1√
3
, 1√
3
, ?].
ii) f(x, y) = ex2
+y2
, P = [x0, y0, z0] = [0, 0, ?],
iii) f(x, y) = x2
+ xy + 2y2
, P = [x0, y0, z0] = [1, 1, ?],
iv) f(x, y) = arctan y
x , P = [x0, y0, z0] = [1, −1, ?].
Solution.
i) f(x0, y0) =
√
1 − 1
3 − 1
3 = 1√
3
, thus
z0 = 1√
3
. Further f′
x = − x√
1−x2−y2
,
f′
y = − y√
1−x2−y2
, thus f′
x(x0, y0) = −1/
√
3
1/
√
3
= −1,
f′
y(x0, y0) = −1/
√
3
1/
√
3
= −1. The equation of a tangent
plane in [ 1√
3
, 1√
3
, 1√
3
] is
z =
1
√
3
− (x −
1
√
3
) − (y −
1
√
3
), or x + y + z =
√
3,
ii) z0 = 1, z = 1,
iii) z0 = 4, 3x + 5y − z = 4,
iv) z0 = −π
4 , x + y − 2z = π
2 .
□
CHAPTER 8. CALCULUS WITH MORE VARIABLES
with an appropriate θ, 0 ≤ θ ≤ 1, where the latter equality
holds according to the formula for directional derivatives
derived above, for suﬃciently small v’s.
Since all the partial derivatives are continuous at
the point x, for an arbitrarily small ε > 0, there is a
δ–neighbourhood U of the origin in Rn
such that for w ∈ U,
all partial derivatives ∂f
∂xi
(x + w) diﬀer from ∂f
∂xi
(x) by less
than ε. Hence the estimate
1
∥w∥
f(x + w) − f(x) − dwf(x) ≤
≤
1
∥w∥
f(x + w) − f(x) − dwf(x + θw) +
1
∥w∥
dwf(x + θw) − dwf(x)
=
1
∥w∥
w1
( ∂f
∂x1
(x + θw) −
∂f
∂x1
(x)
)
+ . . .
+ wn
( ∂f
∂xn
(x + θw) −
∂f
∂xn
(x)
)
≤
n
∥w∥
∥w∥ε,
where θ is the parameter for which the expression on the second
line vanishes. Thus, the approximation property of the
diﬀerential is satisﬁed as well. □
The approximation property of the diﬀerential can be
written as
f(x + v) = f(x) + df(x)(v) + α(v),
where the function α(v) satisﬁes limv→0
α(v)
∥v∥ = 0, i.e.
α(v) = o(∥v∥) in the asymptotic terminology introduced in
6.1.12 on the page 520.
8.1.10. A plane tangent to the graph of a function. The linear
approximation of the function behaviour by
its diﬀerential can be expressed in terms of its
graph, similarly to the case of univariate functions.
We work with hyperplanes instead of tangent lines.
In the case of a function on E2 and a ﬁxed point
(x0, y0) ∈ E2, consider the plane in E3 given by the equation
on the three coordinates (x, y, z):
z = f(x0, y0) + df(x0, y0)(x − x0, y − y0)
= f(x0, y0)+
∂f
∂x
(x0, y0)(x−x0)+
∂f
∂y
(x0, y0)(y−y0).
It is already seen that the increase of the function values of a
diﬀerentiable function f : En → R at points x + tv and x
is always expressed in terms of the directional derivative dvf
at a suitable point between them. Therefore, this is the only
plane containing the point (x0, y0) with the property that all
derivatives, and so the tangent lines of all curves
c(t) = (x(t), y(t), f(x(t), y(t)))
lie in this plane, too. It is called the tangent plane to the graph
of the function f.
720
8.D.17. Find all points on the conic k : x2
+ 3y2
− 2x +
6y − 8 = 0 such that the normal of the conic is parallel to the
y axe. For each point write the equation of the tangent in the
point.
Solution. The normal to k in a point is parallel to the y axe
iﬀ the tangent line in the point is parallel to the x axe. The
normal to k in [x0, y0] ∈ k is parallel to y axis iﬀ one of the
tangents to k in [x0, y0] is parallel to x axis, and this happens
iﬀ y′
(x0) = 0, where y is a function given implicitly by k
in a neighborhood of [x0, y0]. Derivation of the eqution of k
gives 2x + 6yy′
− 2 + 6y′
= 0, that is y′
= 1−x
3(1+y) . Thus
y(x0)′
= 0, iﬀ x0 = 1. Substituting to the equation of k we
get 1 + 3y2
0 − 2 + 6y0 − 8 = 0, thus y0 = 1 or y0 = −3.
The saught points are [1, 1], resp. [1, −3], the equations of
tangents in the points are y = 1, resp. y = −3. □
8.D.18. On the conic given by the equation 3x2
+6y2
−3x+
3y − 2 = 0 ﬁnd all points where the normal to the conic is
parallel with the line y = x. For each point give the equation
of the tangent in the point. ⃝
8.D.19. On the conic given by the equation x2
+ xy + 2y2
−
x+3y −54 = 0 ﬁnd all points where the normal to the conic
is parallel to the x axis. For each point give the equation of
the tangent in the point. ⃝
8.D.20. On the graph of the function u(x, y, z) = x
√
y2 + z2
ﬁnd all points where the tangent plane is parallel to the plane
x + y − z − u = 0. ⃝
8.D.21. Find the points on to the ellipsoid x2
+2y2
+z2
= 1,
where the tangent planes are parallel to x − y + 2z = 0.
Solution. The equation of the tangent plane is determined
by the partial derivatives of z = z(x, y) given implicitly by
the equation x2
+ 2y2
+ z2
= 1 of the ellipsoid. The normal
vector in [x0, y0, z0] is (z′
x(x0, y0), z′
y(x0, y0), −1). This
vector has to be parallel to the normal (1, −1, 2) of the plane,
thus (−2z′
x(x0, y0), −2z′
y(x0, y0), 2) = (1, −1, 2). which
yields 2x0 = z0, 4y0 = −z0 and after substituting to the ellipsoid’s
equation we get the sought points: [ 2√
22
, − 1√
22
, 4√
22
]
and [− 2√
22
, + 1√
22
, − 4√
22
].
Another solution. It is useful to realize, that
the normal vector in [x0, y0, z0] of the surface
given implicitly by F(x, y, z) = 0 is the vector
(F′
x(x0, y0, z0), F′
y(x0, y0, z0), F′
z(x0, y0, z0)). □
8.D.22. Determine whether the tangent plane to the graph
of the function f : R × R+
→ R, f(x, y) = x · ln(y) at the
point [1, 1
e ] goes through the point [1, 2, 3] ∈ R3
.
Solution. First of all, we calculate the partial derivatives:
fx(x, y) = ln(y), fy(x, y) = x
y ; their values at the point
[1, 1
e ] are −1, e; further f(1, 1
e ) = −1. Therefore, the equation
of the tangent plane is
z = f
(
1,
1
e
)
+ fx
(
1,
1
e
)
(x − 1) + fy
(
1,
1
e
) (
y −
1
e
)
= −1 − x + ey.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Two tangent planes to the graph of the function
f(x, y) = sin(x) cos(y)
are shown in the illustration. The diagonal line is the image
of the curve c(t) = (t, t, f(t, t)).
0
1
2
3
x
0
-2 4
1 2
-1
5
3 4
0
y 5
6
6
1
2
0
1
2
3
x
0
-2 4
1 2
-1
5
3 4
0
y 5
6
6
1
2
For the case of functions of n variables, the tangent plane
is deﬁned as an analogy to the tangent plane to a
surface in the three-dimensional space. Instead
of being overwhelmed by many indices, it is useful
to recall aﬃne geometry, where hyperplanes
can be used, see paragraph 4.1.3.
Tangent (hyper)planes
Deﬁnition. A tangent hyperplane to the graph of a function
f : Rn
→ R at a point x ∈ Rn
is the hyperplane containing
the point (x, f(x)) with the modelling vector space which
is the graph of the linear mapping df(x) : Rn
→ R, i.e. the
diﬀerential at the point x ∈ En.
The deﬁnition takes advantage of the fact that the directional
derivative dvf is given by the increment in the tangent
(hyper)plane corresponding to the increment v.
Many analogies with the univariate functions follow
from the latter fact. In particular, a diﬀerentiable function
f on En has zero diﬀerential at a point x ∈ En if and only if
its composition with any curve going through this point has a
stationary point there, i.e., is neither increasing, nor decreasing
in the linear approximation.
In other words, the tangent plane at such a point is parallel
to the hyperplane of the variables (i.e., its modelling space
is En ⊂ En+1, having added the last coordinate set to zero).
Of course, this does not mean that f should have a local extremum
at such a point. Just as in the case of univariate functions,
this depends on the values of higher derivatives. But it
is a necessary condition to the existence of extrema.
8.1.11. Derivatives of higher orders. The operation of differentiation
can be iterated similarly to the case
of univariate functions. This time, choose new
directions for each iteration.
Fix an increment v ∈ Rn
. The enumeration
of the diﬀerentials at this argument deﬁnes a (diﬀerential)
operation on diﬀerentiable functions f : En → R
f → dvf = df(v),
and the result is again a function df(v) : En → R. If this
function is diﬀerentiable as well, repeat this procedure with
721
The given point does not satisfy this equation, so it does not
lie in the tangent plane. □
8.D.23. Determine the parametric equation of the tangent
line to the intersection of the graphs of the functions
f : R2
→ R, f(x, y) = x2
+ xy − 6; g : R × R+
→ R,
g(x, y) = x · ln(y) at the point [2, 1].
Solution. The tangent line to the intersection is the intersection
of the tangent planes at the given point. The plane that is
tangent to the graph of f and goes through the point [2, 1] is
z = f (2, 1) + fx(2, 1)(x − x0) + fy(2, 1)(y − y0)
= 5x + 2y − 12.
The tangent plane to the graph of g is then
z = f(2, 1) + gx(x, y)(2, 1)(x − x0)
+ g(x, y)y(2, 1)(y − y0) = 2y − 2.
The intersection line of these two planes is given parametrically
as [2, t, 2t − 2], t ∈ R.
Another solution. The normal to the surface given by
the equation f(x, y, z) = 0 at the point b = [2, 1, 0] is
(fx(b), fy(b), fz(b)) = (5, 2, −1); the normal to the surface
given by g(x, y, z) = 0 at the same point is (0, 2, −1). The
tangent line is perpendicular to both normals; we can thus obtain
a vector it is parallel to as the vector product of the normals,
which is (0, 5, 10). Since the tangent line goes through
the point [2, 1, 0], its parametric equation is [2, 1 + t, 2t],
t ∈ R. □
8.D.24. V. ypočtěte všechny parciální derivace prvního a
druhého řádu funkce f(x, y, z) = x
y
z .
Solution. f′
x = y
z x
y
z −1
, f′
y = x
y
z ln x· 1
z , f′
z = x
y
z ln x· −y
z2 ,
f′′
xx = y
z (y
z − 1)x
y
z −2
, f′′
yy = x
y
z ln2
x · 1
z2 ,
f′′
zz = x
y
z ln2
x · y2
z4 + x
y
z ln x · 2y
z3 ,
f′′
xy = 1
z x
y
z −1
+ y
z x
y
z −1
ln x · 1
z , f′′
xz = −y
z2 x
y
z −1
+
y
z x
y
z −1
ln x · −y
z2 , f′′
yz = x
y
z ln2
x · −y
z3 + x
y
z ln x · −1
z2 .
□
8.D.25. Find all ﬁrst and second order partial derivatives of
z = f(x, y) in [1,
√
2, 2] deﬁned in a neighborhood of the
point by x2
+ y2
+ z2
− xz −
√
2yz = 1.
⃝
8.D.26. Find all ﬁrst and second order partial derivatives of
z = f(x, y) in [−2, 0, 1] deﬁned in a neighborhood of the
point by 2x2
+ 2y2
+ z2
+ 8xz − z + 8 = 0. ⃝
8.D.27. Determine all second partial derivatives of the function
f given by f(x, y, z) =
√
xy ln z.
Solution. First, we determine the domain of the given function:
the argument of the square root must be non-negative,
and the argument of the natural logarithm must be positive.
Therefore, Df = {(x, y, z) ∈ R3
, (z ≥ 1&(xy > 0))∨(0 <
z < 1)&(xy < 0)}.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
another increment, and so on. In particular, work with iterations
of partial derivatives. For second-order partial derivatives,
write
(
∂
∂xj
◦
∂
∂xi
)
f =
∂2
∂xi∂xj
f =
∂2
f
∂xi∂xj
.
In the case of the repeated choice i = j, write
(
∂
∂xi
◦
∂
∂xi
)
f =
∂2
∂x2
i
f =
∂2
f
∂x2
i
.
Proceed in the same way with further iterations and talk about
partial derivatives of order k
∂k
f
∂xi1 . . . ∂xik
.
More generally, one can iterate (assuming the function is
suﬃciently diﬀerentiable) any directional derivatives; for instance,
dv ◦ dwf for two ﬁxed increments v, w ∈ Rn
.
k–times differentiable functions
Deﬁnition. A function f : En → R is k–times (continuously)
diﬀerentiable at a point x if and only if all its partial
derivatives up to order k (inclusive) exist in a neighbourhood
of the point x and are continuous at this point.
f is k–diﬀerentiable if it is k–times (continuously) differentiable
at all points of its domain.
From now on, the work is with continuously diﬀerentiable
functions unless explicitly stated otherwise.
To show the basic features of the higher derivatives in the
simplest form, work in the plane E2. In the plane as well as
in the space, iterated derivatives are often denoted by mere
indices referring to the variable names, for example:
fx =
∂f
∂x
, fxx =
∂2
f
∂x2
, fxy =
∂2
f
∂x∂y
, fyx =
∂2
f
∂y∂x
.
Supposing the second-order partial derivatives are continuous
at the point x (and exist in its neghborhood),
we show that the partial derivatives commute.
That is, the order in which diﬀerentiation
is carried out does not matter.
By our assumption, the limits
fxy(x, y) = lim
t→0
1
t
(
fx(x, y + t) − fx(x, y)
)
= lim
t→0
1
t
(
lim
s→0
1
s
(
f(x + s, y + t) − f(x, y + t)
− (f(x + s, y) − f(x, y))
))
exist. However, since the limits can be expressed by any
choice of values tn → 0 and sn → 0 and the limits of the corresponding
sequences, the second derivative can be expressed
722
Now, we calculate the ﬁrst partial derivatives with respect
to each of the three variables:
fx =
y ln(z)
2
√
xy ln(z)
, fy =
x ln(z)
2
√
xy ln(z)
, fz =
xy
2z
√
xy ln(z)
.
Each of these three partial derivatives is again a function
of three variables, so we can consider (ﬁrst) partial derivatives
of these functions. Those are the second partial derivatives
of the function f. We will write the variable with respect to
which we diﬀerentiate as a subscript of the function f.
fxx = −
y2
ln2
z
4(xy ln z)
3
2
,
fxy = −
xy ln2
z
4(xy ln z)
3
2
+
ln z
2
√
xy ln z
,
fxz = −
xy2
ln z
4z(xy ln z)
3
2
+
y
2z
√
xy ln z
,
fyy = −
x2
ln2
z
4(xy ln z)
3
2
,
fyz = −
x2
y ln z
4z(xy ln z)
3
2
+
x
2z
√
xy ln z
,
fzz = −
x2
y2
4z2(xy ln z)
3
2
−
xy
2z2
√
xy ln z
.
By the theorem about interchangeability of partial derivatives
(see 8.1.12), we know that fxy = fyx, fxz = fzx, fyz = fzy.
Therefore, it suﬃces to compute the mixed partial derivatives
(the word “mixed” means that we diﬀerentiate with respect to
more than one variable) just for one order of diﬀerentiation.
□
E. Taylor polynomials
8.E.1. Write the second-order Taylor expansion of the function
f : R2
→ R, f(x, y) = ln(x2
+ y2
+ 1) at the point
[1, 1].
Solution. First, we compute the ﬁrst partial derivatives:
fx =
2x
x2 + y2 + 1
, fy =
2y
x2 + y2 + 1
,
then the Hessian:
Hf(x, y) =
( 2y2
−2x2
+2
(x2+y2+1)2 − 4xy
(x2+y2+1)2
− 4xy
(x2+y2+1)2
2x2
−2y2
+2
(x2+y2+1)2
)
.
The value of the Hessian at the point [1, 1] is
( 2
9 −4
9
−4
9
2
9
)
.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
as
fxy(x, y) = lim
t→0
1
t2
(
(
f(x + t, y + t) − f(x, y + t)
)
−
(
f(x + t, y) − f(x, y)
)
)
,
if the limit on the right hand side exists (notice we cannot take
this as granted without further arguments).
Consider the expression from which the last limit is taken
as the function φ(x, y, t), and try to express it in terms of
partial derivatives. For a temporarily ﬁxed t, denote
g(x, y) = f(x + t, y) − f(x, y).
Notice that the partial derivative gy exists on a neighborhood
of (x, y) and it is continuous in (x, y). Thus, for small t, we
may apply the mean value theorem viewing g as function of y.
Therefore, the expression in the last large parentheses equals
to
g(x, y + t) − g(x, y) = t · gy(x, y + t0)
for a suitable t0 which lies between 0 and t (the value of t0
depends on t). Next, gy(x, y) = fy(x + t, y) − fy(x, y), so
we may rewrite φ as
φ(x, y, t) =
1
t
gy(x, y + t0)
=
1
t
(
fy(x + t, y + t0) − fy(x, y + t0)
)
.
Another application of the mean value theorem yields
φ(x, y, t) = fyx(x + t1, y + t0)
for a suitable t1 between 0 and t. Since it is assumed that
the second order partial derivatives of f are continous at the
point (x, y), the requested limit limt→0 φ(x, y, t) must exist
and we have arrived at the desired equality fxy = fyx, too.
8.1.12. Schwarz’s Theorem. The same procedure for functions
of n variables proves the following fundamental result:2
Commutativity of partial derivatives
Theorem. Let f : En → R be a k–times diﬀerentiable function
with continuous partial derivatives up to order k (inclusive)
at the point point x ∈ Rn
. Then all partial derivatives
of the function f at the point x up to order k (inclusive) are
independent of the order of diﬀerentiation.
Proof. The proof for the second order is illustrated
above in the special case when n = 2. In fact,
it yields the general case as well. Indeed, notice
that for every ﬁxed choice of a pair of coordinates
xi and xj, the discussion of their interchanging takes
2This is a great example of a result which was used widely for many
decades before a complete proof was published in 1873 by Karl Hermann
Amandus Schwarz (1843-1921). The result is also known as Clairaut’s theorem
– following the earlier version for functions requesting continuous partial
derivatives on a neighborhood of x.
723
Altogether, we get that the second-order Taylor expansion at
the point [1, 1] is
T2(x, y) =f(1, 1) + fx(1, 1)(x − 1) + fy(1, 1)(y − 1)
+
1
2
(x − 1, y − 1)Hf(1, 1)
(
x − 1
y − 1
)
= ln(3) +
2
3
(x − 1) +
2
3
(y − 1) +
1
9
(x − 1)2
−
4
9
(x − 1)(y − 1) +
1
9
(y − 1)2
=
1
9
(x2
+ y2
+ 8x + 8y − 4xy − 14) + ln(3).
□
Remark. In particular, we can see that the second-order Taylor
expansion of an arbitrary diﬀerentiable function at a given
point is a second-order polynomial.
8.E.2. Determine the second-order Taylor polynomial of
the function f : R2
→ R2
, f(x, y) = xy cos y at the point
[π, π]. Decide whether the tangent plane to the graph of this
function at the point [π, π, f(π, π)] goes through the point
[0, π, 0].
Solution. As in the above exercises, we ﬁnd out that
T(x, y) =
1
2
π2
y2
− xy − π3
y +
1
2
π4
.
The tangent plane to the graph of the given function at the
point [π, π] is given by the ﬁrst-order Taylor polynomial at
the point [π, π]; its general equation is thus
z = −πy − πx + π2
,
and this equation is satisﬁed by the given point [0, π, 0]. □
8.E.3. Determine the third-order Taylor polynomial of the
function f : R3
→ R, f(x, y, z) = x3
y + xz2
+ xy + 1
at the point [0, 0, 0]. ⃝
8.E.4. Determine the second-order Taylor polynomial of the
function f : R2
→ R, f(x, y) = x2
sin y + y2
cos x at the
point [0, 0]. Decide whether the tangent plane to the graph
of this function at the point [0, 0, 0] goes through the point
[π, π, π]. ⃝
8.E.5. Determine the second-order Taylor polynomial of the
function ln(x2
y) at the point [1, 1]. ⃝
8.E.6. Determine the second-order Taylor polynomial of the
function f : R2
→ R,
f(x, y) = tan(xy + y)
at the point [0, 0]. ⃝
F. Extrema of multivariate functions
8.F.1. Determine the stationary points of the function
f : R2
→ R, f(x, y) = x2
y + y2
x − xy and decide which
of these points are local extrema and of which type.
Solution. The ﬁrst derivatives are fx = 2xy + y2
− y,
fy = x2
+ 2xy − x. If we set both partial derivatives equal
CHAPTER 8. CALCULUS WITH MORE VARIABLES
place in a two-dimensional aﬃne subspace, (all the other variables
are considered to be constant and do not aﬀect in the
discussion). So neighbouring partial derivatives may be interchanged.
This solves the problem in order two.
In the case of higher-order derivatives, the proof can be
completed by induction on the order. Every order of the indices
i1, . . . , ik can be obtained from a ﬁxed one by several
interchanges of adjacent pairs of indices. □
8.1.13. Hessian. The diﬀerential was introduced as the linear
form df(x) which approximates the function
f at a point x in the best possible way.
Similarly, a quadratic approximation of a
function f : En → R is possible.
Hessian
Deﬁnition. If f : Rn
→ R is a twice diﬀerentiable function,
the symmetric matrix of functions
Hf(x) =
(
∂2
f
∂xi∂xj
(x)
)
=




∂2
f
∂x1∂x1
(x) . . . ∂2
f
∂x1∂xn
(x)
...
...
...
∂2
f
∂xn∂x1
(x) . . . ∂2
f
∂xn∂xn
(x)




is called the Hessian of the function f.
It is already seen from the previous reasonings that the
vanishing of the diﬀerential at a point (x, y) ∈ E2 guarantees
stationary behaviour along all curves going through this point.
The Hessian
Hf(x, y) =
(
fxx(x, y) fxy(x, y)
fyx(x, y) fyy(x, y)
)
plays the role of the second derivative.
For every parametrized straight line
c(t) = (x(t), y(t)) = (x0 + ξt, y0 + ηt),
the derivative of the univariate function α(t) = f(x(t), y(t))
can be computed by means of the formula d
dt f(t) =
fx(x(t), y(t))x′
(t) + fy(x(t), y(t))y′
(t) (derived in 8.1.8)
and so the function
β (t) = f(x0, y0) + t
∂f
∂x
(x0, y0)ξ + t
∂f
∂y
(x0, y0)η
+
t2
2
(
fxx(x0, y0)ξ2
+ 2fxy(x0, y0)ξη + fyy(x0, y0)η2
)
shares the same derivatives up to the second order (inclusive)
at the point t = 0 (calculate this on your own!). The function
β can be written in terms of vectors as
β(t) = f(x0, y0) + df(x0, y0)(tv) +
1
2
Hf(x0, y0)(tv, tv),
where v = (ξ, η) is the increment given by the derivative of
the curve c(t), and the Hessian is used as a symmetric 2–form.
This is an expression which looks like Taylor’s theorem
for univariate functions, namely the quadratic approximation
724
to zero simultaneously, the system has the following solution:
{x = y = 0}, {x = 0, y = 1}, {x = 1, y = 0},
{x = 1/3, y = 1/3}, which are four stationary points of the
given function.
The Hessian of the function f is(
2y 2x + 2y − 1
2x + 2y − 1 2x
)
.
Its values at the stationary points are, respectively,
(
0 −1
−1 0
)
,
(
1 1
1 0
)
,
(
0 1
1 1
)
,
(2
3
1
3
1
3
2
3
)
.
Therefore, the ﬁrst three Hessians are indeﬁnite, and the
last one is positive deﬁnite. The point [1/3, 1/3] is thus a
local minimum. □
8.F.2. Determine the point in the plane x+y+3z = 5 lying
in R3
which is closest to the origin of the coordinate system.
First, do this by applying the methods of linear algebra; then,
using the methods of diﬀerential calculus.
Solution. It is the intersection point of the perpendicular going
through the point [0, 0, 0] to the plane. The normal to the
plane is (t, t, 3t), t ∈ R. Substituting into the equation of the
plane, we get the intersection point [5/11, 5/11, 15/11].
Alternatively, we can minimize the distance (or its
square) of the plane’s points from the origin, i. e., the
function
(5 − y − 3z)2
+ y2
+ z2
.
Setting the partial derivatives equal to zero, we get the system
3y + 10z − 15 = 0
2y + 3z − 5 = 0,
whose solution is as above. Since we know that the minimum
exists and is the only stationary point, we need not calculate
the Hessian any more. □
8.F.3. Determine the local extrema of the function
f(x, y) = x2
+ arctan2
x + y3
+ y , x, y ∈ R.
Solution. The function f can be written as the sum f1 + f2,
where f1(x) = x2
+ arctan2
x, f2(y) = y3
+ y , x, y ∈ R.
If the function f has a local extremum at a point, then it
does so with respect to an arbitrary subset of its domain. In
other words, if the function has, for instance, a maximum at
a point [a, b] and we set y = b, then the univariate function
f(x, b) of x must have a maximum at the point x = a. Let
us thus ﬁx an arbitrary y ∈ R. For this ﬁxed value of y, we
get a univariate function, which is a shift of the function f1.
This means that its maxima and minima are at the same points.
However, it is easy to ﬁnd the extrema of the function f1. We
can just realize that this function is even (it is the sum of two
even functions, and the function y = arctan2
x is the product
of two odd functions) and increasing for x ≥ 0 (the composition
as well as the sum of increasing functions is again an
increasing function). Therefore, it has a unique extremum,
and that is a minimum at the point x = 0. Similarly, for any
ﬁxed value of x, f is a shift of the function f2, and f2 has a
minimum at the point y = 0, which is its only extremum. We
CHAPTER 8. CALCULUS WITH MORE VARIABLES
of a function by Taylor’s polynomial of degree two. The following
illustration shows both the tangent plane and this quadratic
approximation for two distinct points and the function
f(x, y) = sin(x) cos(y). popis obrazku
6
5
4 x
3
2
1
0
0
1
2
3
4 y
5
6-2
-1
0
1
2
6
5
4 x
3
2
1
0
0
1
2
3
4 y
5
6-2
-1
0
1
2
8.1.14. Taylor’s expansion. The multidimensional version
of Taylor’s theorem is an example of a mathematical
statement where the most diﬃcult part
is ﬁnding the right formulation. The proof is
then quite simple.
The discussion on the Hessians continues. Write Dk
f for
the k–th order approximations of the function f : En → Rn
.
It is always a k–linear expressions in the increments.
The diﬀerential D1
f = df (the ﬁrst order) and the Hessian
D2
f = Hf (the second order) are already discussed.
For functions f : En → R, points x = (x1, . . . , xn) ∈ En,
and increments v = (ξ1, . . . , ξn), set
Dk
f(x)(v) =
∑
1≤i1,...,ik≤n
∂k
f
∂xi1· · ·∂xik
(x1, . . . , xn)ξi1· · · ξik
.
An illustrative example (making use of the symmetry of the
partial derivatives) is, for E2, the third-order expression
D3
f(x, y)(ξ, η) =
∂3
f
∂x3
ξ3
+ 3
∂3
f
∂x2∂y
ξ2
η
+ 3
∂3
f
∂x∂y2
ξη2
+
∂3
f
∂y3
η3
,
and, in general,
Dk
f(x, y)(ξ, η) =
k∑
ℓ=0
(
k
ℓ
)
∂k
f
∂xk−ℓ∂yℓ
ξk−ℓ
ηℓ
.
Taylor’s expansion with remainder
Theorem. Let f : En → R be a k–times diﬀerentiable
function in a neighbourhood Oδ(x) of a point x ∈ En. For
every increment v ∈ Rn
of size ∥v∥ < δ, there exists a
number θ, 0 ≤ θ ≤ 1, such that
f(x + v) = f(x) + D1
f(x)(v) +
1
2!
D2
f(x)(v)+
· · · +
1
(k − 1)!
Dk−1
f(x)(v) +
1
k!
Dk
f(x + θ · v)(v).
725
have thus proved that f can have a local extremum only at the
origin. Since
f(0, 0) = 0, f(x, y) > 0, [x, y] ∈ R2
∖ {[0, 0]},
the function f has a strict local (even global) minimum at the
point [0, 0]. □
8.F.4. Examine the local extrema of the function
f(x, y) =
(
x + y2
)
e
x
2 , x, y ∈ R.
Solution. This function has partial derivatives of all orders on
the whole of its domain. Therefore, local extrema can occur
only at stationary points, where both the partial derivatives
fx, fy are zero. Then, it can be determined whether the local
extremum occurs by computing the second derivatives.
We can easily determine that
fx(x, y) = e
x
2 +
1
2
(
x + y2
)
e
x
2 ,
fy(x, y) = 2y e
x
2 , x, y ∈ R.
A stationary point [x, y] must satisfy
fy(x, y) = 0, i. e. y = 0,
and, further,
fx(x, y) = fx(x, 0) = e
x
2
(
1 + 1
2 x
)
= 0, i. e. x = −2.
We can see that there is a unique stationary point, namely
[−2, 0].
Now, we calculate the Hessian Hf at this point. If this
matrix (the corresponding quadratic form) is positive deﬁnite,
the extremum is a strict local minimum. If it is negative definite,
the extremum is a strict local maximum. Finally, if the
matrix is indeﬁnite, there will be no extremum at the point.
We have
fxx(x, y) = 1
2 e
x
2
(
2 + 1
2
(
x + y2
))
, fyy(x, y) = 2 e
x
2 ,
fxy(x, y) = fyx(x, y) = y e
x
2 , x, y ∈ R.
Therefore,
Hf (−2, 0) =
(
fxx (−2, 0) fxy (−2, 0)
fyx (−2, 0) fyy (−2, 0)
)
=
(
1/2e 0
0 2/e
)
.
We should recall that the eigenvalues of a diagonal matrix are
exactly the values on the diagonal. Further, positive deﬁniteness
means that all the eigenvalues are positive. Hence it follows
that there is a strict local minimum at the point [−2, 0].
□
8.F.5. Find the local extrema of the function
f(x, y, z) = x3
+ y2
+ z2
2 − 3xz − 2y + 2z, x, y, z ∈ R.
Solution. The function f is a polynomial; therefore, it has
partial derivatives of all orders. It thus suﬃces to look for its
stationary points (the extrema cannot be elsewhere). In order
to ﬁnd them, we diﬀerentiate f with respect to each of the
three variables x, y, z and set the derivatives equal to zero.
We thus obtain
3x2
− 3z = 0, i. e., z = x2
,
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Proof. Given an increment v ∈ Rn
, consider the
parametrized straight line c(t) = x + tv in En, and
examine the function φ : R → R deﬁned by the
composition φ(t) = f ◦ c(t). Taylor’s theorem for
univariate functions claims that (see Theorem 6.1.3)
φ(t) = φ(0) + φ′
(0)t + . . .
+
1
(k − 1)!
φ(k−1)
(0)tk−1
+
1
k!
φ(k)
(θ)tk
.
It remains to verify that computing the derivatives φ(ℓ)
yields
the desired relation. This can be done quite easily by induction
on the order k.
For k = 1, Taylor’s theorem coincides with the corollary
of the mean value theorem applied to the directional derivative,
which is already used several times. When deriving it,
the formula
d
dt
φ(t) =
∂f
∂x1
(x(t)) · x′
1(t) + · · · +
∂f
∂xn
(x(t)) · x′
n(t)
is used. It holds for every continuously diﬀerentiable curve
and function f, cf. 8.1.8 and 8.1.9. This means that
φ′
(t) = D1
f(c(t))(c′
(t)) = D1
f(c(t))(v)
for all t in a neighbourhood of zero. Proceed similarly for
functions Dℓ
f. Write c′
(t) instead of the increment v, and recall
that further diﬀerentiation of c(t) leads identically to zero
everywhere, i.e. c′′
(t) = 0 for all t (since it is a parametrized
straight line). Suppose
φ(ℓ)
(t) = Dℓ
f(x(t))(v)
=
∑
i1,...,iℓ
(
∂ℓ
f
∂xi1 . . . ∂xiℓ
(x1(t), . . . , xn(t))x′
i1
(t) · · · x′
iℓ
(t)
)
and calculate φ(ℓ+1)
(t). By the above formula for ﬁrst-order
diﬀerentiation in a given direction and the rule for the derivative
of a product (see Theorem 5.3.4), the diﬀerentiation of
the composite function gives
φ(ℓ+1)
(t) =
d
dt
Dℓ
f(c(t))(c′
(t))
=
d
dt
∑
i1,...,iℓ
(
∂ℓ
f
∂xi1 . . . ∂xiℓ
(x1(t), . . . , xn(t))
· x′
i1
(t) · · · x′
iℓ
(t)
)
=
∑
i1,...,iℓ
( n∑
j=1
∂ℓ+1
f
∂xi1 . . . ∂xiℓ
∂xj
(x1(t), . . . , xn(t))
· x′
j(t) · x′
i1
(t) · · · x′
iℓ
(t)
)
+ 0,
which is the required formula for order ℓ+1. Taylor’s theorem
now follows from substituting into the equality for φ at the
beginning of this proof and the enumeration at the right values
of t. □
726
2y − 2 = 0, i. e., y = 1,
and (utilizing the ﬁrst equation)
z − 3x + 2 = 0, i. e., x ∈ {1, 2}.
Therefore, there are two stationary points, namely [1, 1, 1] and
[2, 1, 4]. Now, we compute all second-order partial deriva-
tives:
fxx = 6x, fxy = fyx = 0, fxz = fzx = −3,
fyy = 2, fyz = fzy = 0, fzz = 1.
Having this, we are able to evaluate the Hessian at the stationary
points:
Hf (1, 1, 1) =


6 0 −3
0 2 0
−3 0 1

 ,
Hf (2, 1, 4) =


12 0 −3
0 2 0
−3 0 1

 .
Now, we need to know whether these matrices are positive
deﬁnite, negative deﬁnite, or indeﬁnite in order to determine
whether and which extrema occur at the corresponding
points. Clearly, the former matrix (for the point [1, 1, 1]) has
eigenvalue λ = 2. Since its determinant equals −6 and it
is a symmetric matrix (all eigenvalues are real), the matrix
must have a negative eigenvalue as well (because the determinant
is the product of the eigenvalues). Therefore, the matrix
Hf (1, 1, 1) is indeﬁnite, and there is no extremum at the
point [1, 1, 1].
We will use the so-called Sylvester’s criterion for the latter
matrix Hf (2, 1, 4). According to this criterion, a realvalued
symmetric matrix
A =






a11 a12 a13 · · · a1n
a12 a22 a23 · · · a2n
a13 a23 a33 · · · a3n
...
...
...
...
...
a1n a2n a3n · · · ann






is positive deﬁnite if and only if all of its leading principal
minors A, i. e. the determinants
d1 = a11 , d2 =
a11 a12
a12 a22
, d3 =
a11 a12 a13
a12 a22 a23
a13 a23 a33
, . . .
dn = | A |,
are positive. Further, it is negative deﬁnite iﬀ
d1 < 0, d2 > 0, d3 < 0, . . . , (−1)n
dn > 0.
The inequalities
12 = 12 > 0,
12 0
0 2
= 24 > 0,
12 0 −3
0 2 0
−3 0 1
= 6 > 0,
imply that the matrix Hf (2, 1, 4) is positive deﬁnite – there
is a strict local minimum at the point [2, 1, 4]. □
8.F.6. Find the local extrema of the function
z =
(
x2
− 1
) (
1 − x4
− y2
)
, x, y ∈ R.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.1.15. Formula with multi-indices. To simplify the notation,
let us introduce the multi-index notation for the polynomials
with more variables.
Multi-indices
A multi-index α of length n is an n-tuple of non-negative
integers (α1, . . . , αn). The integer |α| = α1 + · · · + αn is
called the size of the multi-index α.
Monomials are written shortly as xα
instead of
xα1
1 xα2
2 . . . xαn
n . Real polynomials in n variables can be
symbolically expressed in a similar way as univariate
polynomials:
f =
∑
|α|≤k
aαxα
, g =
∑
|β|≤ℓ
bβxβ
∈ R[x1, . . . , xn].
f is said to have total degree k if at least one coeﬃcient
with multi-indices α of size k is non-zero, while all the coeﬃcients
with multi-indices of larger sizes vanish.
Nice formulae express addition and multiplication of
multivariate polynomials of degrees k and ℓ respectively:
f + g =
∑
|α|≤max(k,ℓ)
(aα + bα)xα
,
fg =
k+ℓ∑
|γ|=0
( ∑
α+β=γ
(aαbβ)xγ
)
,
where the multi-indices are added componentwise, and the
formally non-existing coeﬃcients are assumed to be zero.
Moreover we write shortly
∂αf =
∂|α|
f
∂x1
α1 . . . ∂xn
αn
and α! = α1! · · · αn!. In particular, ∂αf = f if |α| = 0.
Taylor’s polynomials via multi-indices
Taylor’s expansion up to order r of a function f : Rn
→ R,
for an increment v ∈ Rn
is the polynomial
(1) f(x + v) = f(x) +
∑
1≤|α|≤k
1
α!
∂αf(x) vα
,
quite as the formula in dimension one.
8.1.16. Real analytic functions. If the multidimensional
power series F(v) =
∑
|α|≥0
1
α! ∂αf(x) vα
converges at
some neighborhood of v = 0, we call the function f (real)
analytic on a neighborhood of x.
For instance, this happens if all the partial derivatives are
uniformly bounded, i.e., ∂αf(x) < C for all α. Indeed, we
may estimate
1
k!
∑
i1,...,ik
∂k
f(x)
∂xi1 . . . ∂xik
vi1 . . . vik
≤
1
k!
nk
C∥v∥k
and thus the Taylor’s series converges absolutely by the Weierstrass
criterion for all v. Think about the details. Actually, our
argument shows that the Taylor’s series converges (on a small
727
Solution. Once again, we calculate the partial derivatives zx,
zy and set them equal to zero. This leads to the equations
−6x5
+ 4x3
+ 2x − 2xy2
= 0,
(
x2
− 1
)
(−2y) = 0,
whose solutions [x, y] = [0, 0], [x, y] = [1, 0],
[x, y] = [−1, 0]. (In order to ﬁnd these solutions, it
suﬃces to ﬁnd the real roots 1, −1 of the polynomial
−6x4
+ 4x2
+ 2 using the substitution u = x2
. Now, we
compute the second-order partial derivatives
zxx = −30x4
+ 12x2
+ 2 − 2y2
,
zxy = zyx = −4xy,
zyy = −2
(
x2
− 1
)
and evaluate the Hessian at the stationary points:
Hz (0, 0) =
(
2 0
0 2
)
,
Hz (1, 0) = Hz (−1, 0) =
(
−16 0
0 0
)
.
We can see that the ﬁrst matrix is positive deﬁnite, so the function
has a strict local minimum at the origin.
However, the second and third matrices are negative
semideﬁnite. Therefore, the knowledge of second partial
derivatives in insuﬃcient for deciding whether there is an extremum
at the points [1, 0] and [−1, 0]. On the other hand, we
can examine the function values near these points. We have
z (1, 0) = z (−1, 0) = 0, z (x, 0) < 0 for x ∈ (−1, 1).
Further, consider y dependent on x ∈ (−1, 1) by the formula
y =
√
2 (1 − x4), so that y → 0 for x → ±1. For this choice,
we get z
(
x,
√
2 (1 − x4)
)
=
(
x2
− 1
) (
x4
− 1
)
> 0,
x ∈ (−1, 1). We have thus shown that in arbitrarily small
neighborhoods of the points [1, 0] and [−1, 0], the function z
takes on both higher and lower values than the function value
at the corresponding point. Therefore, these are not extrema.
□
8.F.7. Decide whether the polynomial
p(x, y) = x6
+ y8
+ y4
x4
− x6
y5
has a local extremum at the stationary point [0, 0].
Solution. We can easily verify that the partial derivatives px
and py are indeed zero at the origin. However, each of the
partial derivatives pxx, pxy, pyy is also equal to zero at the
point [0, 0]. The Hessian Hp (0, 0) is thus both positive and
negative semideﬁnite at the same time. However, a simple
idea can lead us to the result: We can notice that p(0, 0) = 0
and
p(x, y) = x6
(
1 − y5
)
+ y8
+ y4
x4
> 0
for [x, y] ∈ R × (−1, 1) ∖ {[0, 0]}. Therefore, the given
polynomial has a local minimum at the origin. □
8.F.8. Determine local extrema of the function f : R3
→ R,
f(x, y, z) = x2
y + y2
z + x − z on R3
. ⃝
8.F.9.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
neighborhood of x) if the partial derivatives grow with the
order k slower than k!.
8.1.17. Local extrema. We examine the local maxima and
minima of functions on En using the diﬀerential
and the Hessian. Just as in the case of univariate
functions, an interior point x0 ∈ En of
the domain of a function f is said to be a (local)
maximum or minimum if and only if there is a neighbourhood
U of x0 such that for all points x ∈ U, the function value satisﬁes
f(x) ≤ f(x0) or f(x) ≥ f(x0), respectively. If strict
inequalities hold for all x ̸= x0, there is a strict extremum.
To simplify, suppose that f has continuous both ﬁrstorder
and second-order partial derivatives on its domain. A
necessary condition for the existence of an extremum at a
point x0 is that the diﬀerential be zero at this point, i.e.,
df(x0) = 0. If df(x0) ̸= 0, then there is a direction v in
which dvf(x0) ̸= 0. However, then the function value is increasing
at one side of the point x0 along the line x0 +tv and
it is decreasing on the other side, see 5.3.2.
An interior point x ∈ En of the domain of a function f
at which the diﬀerential df(x) is zero is called a stationary
point of the function f.
To illustrate the concept on a simple function in E2, consider
f(x, y) = sin(x) cos(y).
The shape of this function resembles the well-known egg
plates, so it is evident that there are many extrema, and also
many stationary points which are not extrema (“saddles” are
visible in the picture).
00
-1
22
-0,5
44
0
66
0,5
8 8
1
Calculate the ﬁrst derivatives, and then the necessary secondorder
ones:
fx(x, y) = cos(x) cos(y), fy(x, y) = − sin(x) sin(y),
and both derivatives are zero for two sets of points
(1) cos(x) = 0, sin(y) = 0, that is (x, y) = (2k+1
2 π, ℓπ),
for any k, ℓ ∈ Z
(2) cos(y) = 0, sin(x) = 0, that is (x, y) = (kπ, 2ℓ+1
2 π),
for any k, ℓ ∈ Z.
The second partial derivatives are
Hf(x, y) =
(
fxx fxy
fxy fyy
)
(x, y)
=
(
− sin(x) cos(y) − cos(x) sin(y)
− cos(x) sin(y) − sin(x) cos(y)
)
.
So the following Hessians are obtained in two sets of stationary
points:
728
Determine the local extrema of the function f : R3
→ R,
f(x, y, z) = x2
y − y2
z + 4x + z on R3
. ⃝
8.F.10. Determine the local extrema of the function f :
R3
→ R, f(x, y, z) = xz2
+ y2
z − x + y on R3
. ⃝
8.F.11. Determine the local extrema of the function f :
R3
→ R, f(x, y, z) = y2
z − xz2
+ x + 4y on R3
. ⃝
8.F.12. Determine the local extrema of the function f :
R2
→ R, f(x, y) = x2
y + x2
+ 2y2
+ y on R2
⃝
8.F.13. Determine the local extrema of the function f :
R2
→ R, f(x, y) = x2
y + 2y2
+ 2y on R2
. ⃝
8.F.14. Determine the local extrema of the function f :
R2
→ R, f(x, y) = x2
+ xy + 2y2
+ y on R2
. ⃝
8.F.15. Determine the local extrema of the function f :
R2
→ R, f(x, y) = x2
+ xy − 2y2
+ y on R2
. ⃝
G. Implicitly given functions and mappings
8.G.1. Let F : R2
→ R be a function, F(x, y) =
xy sin
(π
2 xy2
)
. Show that the equality F(x, y) = 1 implicitly
deﬁnes a function f : U → R on a neighborhood U of
the point [1, 1] so that F(x, f(x)) = 1 for x ∈ U. Determine
f′
(1).
Solution. The function is diﬀerentiable on the whole R2
, so
it is such on any neighborhood of the point [1, 1]. Let us evaluate
Fy at [1, 1]:
Fy(x, y) = x sin
(π
2
xy2
)
+ πx2
y2
cos
(π
2
xy2
)
,
so Fy(1, 1) = 1 ̸= 0. Therefore, it follows from theorem
8.1.25 that the equation F(x, y) = 1 implicitly determines
on a neighborhood of the point (1, 1) a function f : U → R
deﬁned on a neighborhood of the point (number) 1. Moreover,
we have
Fx(x, y) = y sin
(π
2
xy2
)
+
π
2
xy3
cos
(π
2
xy2
)
,
so the derivative of the function f at the point 1 satisﬁes
f′
(1) = −
Fx(1, 1)
Fy(1, 1)
= −
1
1
= −1. □
Remark. Notice that although we are unable to explicitly deﬁne
the function f from the equation F(x, f(x)) = 1, we are
able to determine its derivative at the point 1.
8.G.2. Considering the function F : R2
→ R,
F(x, y) = ex
sin(y) + y − π/2 − 1,
show that the equation F(x, y) = 0 implicitly deﬁnes the
variable y to be a function of x, y = f(x), on a neighborhood
of the point [0, π/2]. Compute f′
(0).
Solution. The function is diﬀerentiable in a neighborhood of
the point [0, π/2]; moreover, Fy = ex
cos y+1, F(0, π/2) =
1 ̸= 0, so the equation indeed deﬁnes a function f : U → R
on a neighborhood of the point [0, π/2]. Further, we have
Fx = ex
sin y, Fx(0, π/2) = 1, and its derivative at the point
CHAPTER 8. CALCULUS WITH MORE VARIABLES
(1) Hf(kπ + π
2 , ℓπ) = ±
(
1 0
0 1
)
, where the minus sign
occurs when k and ℓ have the same parity (remainder on
division by two), and the sign + occurs in the other case;
(2) Hf(kπ, ℓπ + π
2 ) = ±
(
0 1
1 0
)
, where again the minus
sign occurs when k and ℓ have the same parity, and the
sign + occurs in the other case;
From the proposition of Taylor’s theorem for order k = 2,
there is, in a neighbourhood of one of the stationary points
(x0, y0),
f(x, y) = f(x0, y0)+
+
1
2
Hf(x0 + θ(x − x0), y0 + θ(y − y0))(ξ, η).
Here, Hf is considered to be a quadratic form evaluated
at the increment (x − x0, y − y0) = (ξ, η). In the case
(1), Hf(x0, y0)(ξ, η) = ±(ξ2
+ η2
), while in the case (2),
Hf(x0, y0)(ξ, η) = ±2ξη. While in the ﬁrst case, the quadratic
form is either always positive or always negative on all
nonzero arguments, in the second case, there are always arguments
with positive values and other arguments with negative
values.
Since the Hessian of the function is continuous (i.e. all
the partial derivatives up to order two are continous), the Hessians
in the nearby points are small perturbations of those
in (x0, y0) and so these properties of the quadratic form
Hf(x, y) remain true on some neighbourhood of (x0, y0).
This is obvious in cases (1) and (2), since a small perturbation
of the matrices does clearly not change the latter properties
of the quadratic forms in question. A general formal
proof is presented below.
The local maximum occurs if and only if the point
(x0, y0) belongs to the case (1) with k and ℓ of the same parity.
On the other hand, if the parities are diﬀerent, then the point
from the case (1) happens to be a point of a local minimum.
On the other hand, in the case (2) the entire function f
behaves similarly to the Hessian and so the “saddle” points
are not extrema.
8.1.18. The decision rules. In order to formulate the general
statement about the Hessian and the local
extrema at stationary points, it is necessary to
remember the discussion about quadratic forms
from the paragraphs 4.3.2–4.3.3 in the chapter on
aﬃne geometry. There are introduced the following types of
quadratic forms h : En → R:
• positive deﬁnite if and only if h(u) > 0 for all u ̸= 0
• positive semideﬁnite if and only if h(u) ≥ 0 for all u ∈ V
• negative deﬁnite if and only if h(u) < 0 for all u ̸= 0
• negative semideﬁnite if and only if h(u) ≤ 0 for all u ∈
V
• indeﬁnite if and only if h(u) > 0 and f(v) < 0 for appropriate
u, v ∈ V .
729
0 satisﬁes:
f′
(0) = −
Fx(0, π/2)
Fy(0, π/2)
= −
1
1
= −1. □
8.G.3. Let
F(x, y, z) = sin(xy) + sin(yz) + sin(xz).
Show that the equation F(x, y, z) = 0 implicitly deﬁnes a
function z(x, y) : R2
→ R on a neighborhood of the point
[π, 1, 0] ∈ R3
so that F(x, y, z(x, y)) = 0.
Determine zx(π, 1) and zy(π, 1).
Solution. We will calculate Fz = y cos(yz) + x cos(xz),
Fz(π, 1, 0) = π + 1 ̸= 0, and the function z(x, y) is deﬁned
by the equation F(x, y, z(x, y)) = 0 on a neighborhood of
the point [π, 1, 0]. In order to ﬁnd the values of the wanted
partial derivatives, we ﬁrst need to calculate the values of the
remaining partial derivatives of the function F at the point
[π, 1, 0].
Fx(x, y, z) = y cos(xy) + z cos(xz) Fx(π, 1, 0) = −1,
Fy(x, y, z) = x cos(xy) + z cos(yz) Fy(π, 1, 0) = −π,
odkud
zx(π, 1) = −
Fx(π, 1, 0)
Fz(π, 1, 0)
=
1
π + 1
,
zy(π, 1) = −
Fy(π, 1, 0)
Fz(π, 1, 0)
=
π
π + 1
.
□
8.G.4. Having the mapping F : R3
→ R2
, F(x, y, z) =
(f(x, y, z), g(x, y, z)) = (ex sin y
, xyz), show that the equation
F(x, c1(x), c2(x)) = (0, 0) deﬁnes a curve c : R → R2
on a neighborhood of the point [1, π, 1]. Determine the tangent
vector to this curve at the point 1.
Solution. We will calculate the square matrix of the partial
derivatives of the mapping F with respect to y and z:
H(x, y, z) =
(
fy fz
gy gz
)
=
(
x cos y ex sin y
0
xz xy
)
.
Hence, H(1, π, 1) =
(
−1 0
1 π
)
and det H(1, π, 1) =
−π ̸= 0. Now, it follows from the implicit mapping theorem
(see 8.1.25) that the equation F(x, c1(x), c2(x)) = (0, 0)
on a neighborhood of the point [1, π, 1] determines a curve
(c1(x), c2(x)) deﬁned on a neighborhood of the point [1, π].
In order to ﬁnd its tangent vector at this point, we need to
determine the )column) vector (fx, gx) at this point:
(
fx
gx
)
=
(
sin y ex sin y
yz
)
,
(
fx(1, π, 1)
gx(1, π, 1)
)
=
(
0
π
)
.
The wanted tangent vector is thus
(
(c1)x(1)
(c2)x(1))
)
=
(
fy(1, π, 1) fz(1, π, 1)
gy(1, π, 1) gz(1, π, 1)
)−1 (
fx(1, π, 1
gx(1, π, 1)
)
=
=
(
−1 0
1 π
)−1 (
0
π
)
=
(
−1 0
1
π
1
π
)
=
(
0
1
)
.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
There are methods to allow determining whether or not a
given form has any of these properties.
The Taylor expansion with remainder immediately yields
the following rules:
Local extrema
Theorem. Let f : En → R be a twice continuously diﬀerentiable
function and x ∈ En be a stationary point of the
function f. Then
(1) f has a strict local minimum at x if Hf(x) is positive
deﬁnite,
(2) f has a strict local minimum at x if Hf(x) is negative
deﬁnite,
(3) f does not have an extremum at x if Hf(x) is indeﬁnite.
Proof. The Taylor second-order expansion with remainder
applied to our function f(x1, . . . , xn), an
arbitrary point x = (x1, . . . , xn), and any increment
v = (v1, . . . , vn), such that all points
x+θv, θ ∈ [0, 1], lie in the domain of the function
f, says that
f(x + v) = f(x) + df(x)(v) +
1
2
Hf(x + θ · v)(v)
for an appropriate real number θ, 0 ≤ θ ≤ 1. Since it is
supposed that the diﬀerential is zero, we obtain
f(x + v) = f(x) +
1
2
Hf(x + θ · v)(v).
By assumption, the quadratic form Hf(x) is continuously dependent
on the point x, and the deﬁniteness or indeﬁniteness
of quadratic forms can be determined by the sign of the major
subdeterminants of the matrix Hf, see Sylvester’s criterion
in paragraph 4.3.3. However, the determinant itself is a polynomial
expression in the coeﬃcients of the matrix, hence a
continuous function. Therefore, the non-vanishing and signs
of the examined determinants are the same in a suﬃciently
small neighbourhood of the point x as at the point x itself.
In particular, for a positive deﬁnite Hf(x), it is guaranteed
that at a stationary point x, f(x + v) > f(x) for suﬃciently
small v. So this is a sharp minimum of the function
f at the point x. The case of negative deﬁniteness is analogous.
If Hf(x) is indeﬁnite, then there are directions v, w in
which f(x + v) > f(x) and f(x + w) < f(x), so there is no
extremum at the stationary point in question. □
The theorem yields no result if the Hessian of the function
is degenerate, yet not indeﬁnite at the point in question.
The reason is the same as in the case of univariate functions.
In these cases, there are directions in which both the ﬁrst and
second derivatives vanish, so at this level of approximation, it
cannot be determined whether the function behaves like t3
or
±t4
until higher-order derivatives in the necessary directions
are calculated.
At the same time, even at those points where the diﬀerential
is non-zero, the deﬁniteness of the Hessian Hf(x) has
730
□
H. Constrained optimization
We will begin with a somewhat atypical optimization
problem.
8.H.1. A betting oﬃce accepts bets on the outcome of a tennis
match. Let the odds laid against player A winning be a : 1
(i. e., if a bettor bets x dollars on the event that player A
wins and this really happens, then the bettor wins ax dollars)
and, similarly, let the odds laid against player B winning be
b : 1 (fees are neglected). What is the necessary and suﬃcient
condition for (positive real) numbers a and b so that a bettor
cannot guarantee any proﬁt regardless the actual outcome of
the match? (For instance, if the odds were laid 1.5 : 1 against
the win of A and 5 : 1 against the win of B, then the bettor
could bet 3 dollars on B winning and 7 dollars on A winning
and proﬁt from this bet in either case).
Solution. Let the bettor have P dollars. The bet amount can
be divided to kP and (1 − k)P dollars, where k ∈ (0, 1).
The proﬁt is then akP dollars (if player A wins) or b(1−k)P
dollars (if B does). The bettor is always guaranteed to win
the lesser of these two amounts; the total proﬁt (or loss) is
obtained by subtracting the bet P, then. Since each of a, b, P
is a positive real number, the function akP is increasing, and
the function b(1 − k)P is decreasing with respect to k. For
k = 0, b(1−k)P is greater; for k = 1, akP is. The minimum
of the two numbers akP and b(1 − k)P is thus maximal for
a k ∈ (0, 1), namely for the value k0 which satisﬁes ak0P =
b(1 − k0)P, whence k0 = b
a+b . Therefore, the betting oﬃce
must choose a, b so that ak0P = b(1 − k0)P < P, which is
equivalent to ak0 < 1, i. e., ab < a + b. □
We managed to solve this constrained optimization problem
even without using the diﬀerential calculus. However, we
will not be able to do so in the following problems.
8.H.2. Find the extremal values of the function
h(x, y, z) = x3
+ y3
+ z3
on the unit sphere S in R3
given by the equation
F(x, y, z) = x2
+ y2
+ z2
− 1
as well as on the circle which is the intersection of this sphere
with the plane
G(x, y, z) = x + y + z.
Solution. First, we will look for stationary points of the function
h on the sphere S. Computing the corresponding gradients
(for instance, grad h(x, y, z) = (3x2
, 3y2
, 3z2
)) , we get
the system
0 = 3x2
− 2λx,
0 = 3y2
− 2λy,
0 = 3z2
− 2λz,
0 = x2
+ y2
+ z2
− 1
CHAPTER 8. CALCULUS WITH MORE VARIABLES
similar consequences as the non-vanishing of the second derivative
of a univariate function. For a function f : Rn
→ R,
the expression z(x + v) = f(x) + df(x)(v) deﬁnes the tangent
hyperplane to the graph of the function f in the space
Rn+1
. Taylor’s theorem of order two with remainder, as
used in the proof above, provides the expression f(x + v) =
z(x + v) + 1
2 Hf(x + θv)(v).
If the Hessian is positive deﬁnite, all the values of the
function f lie above the values of the tangent hyperplane for
arguments in a suﬃciently small neighbourhood of the point
x, i.e. the whole graph is above the tangent hyperplane in
a suﬃciently small neighbourhood. In the case of negative
deﬁniteness, it is the other way round.
Finally, when the Hessian is indeﬁnite, the graph of the
function has values on both sides of the hyperplane. This happens,
in general, along objects of lower dimensions in the tangent
hyperplane, so there is no straightforward generalization
of inﬂection points.
8.1.19. The diﬀerential of mappings. The concepts of derivative
and diﬀerential can be easily extended
to mappings F : En → Em. Having selected
the Cartesian coordinate system on both sides,
this mapping is an ordinary m–tuple
F(x1, . . . , xn) = (f1(x1, . . . , xn), . . . , fm(x1, . . . , xn))
of functions fi : En → R. F is deﬁned to be a diﬀerentiable
or k–times diﬀerentiable mapping if and only if the corresponding
property is shared by all the functions f1, . . . , fm.
The diﬀerentials dfi(x) of the particular functions fi give
a linear approximation of the increments of their values for the
mapping F. Therefore, we can expect that they also give a coordinate
expression of the linear mapping D1
F(x) : Rn
→
Rm
between the modelling spaces which linearly approximates
the increments of the mapping F.
Differential and Jacobi matrix
Consider a diﬀerentiable mapping F : Rn
→ Rm
with components
(f1(x1, . . . , xn), . . . , fm(x1, . . . , xn)) and x in its
domain. The matrix
D1
F(x) =




df1(x)
df2(x)
...
dfm(x)




=






∂f1
∂x1
∂f1
∂x2
. . . ∂f1
∂xn
∂f2
∂x1
∂f2
∂x2
. . . ∂f2
∂xn
...
...
...
...
∂fm
∂x1
∂fm
∂x2
. . . ∂fm
∂xn






(x)
is called the Jacobi matrix of the mapping F at the point x.
The linear mapping D1
F(x) deﬁned on the increments
v = (v1, . . . , vn) by the Jacobi matrix is called the diﬀerential
of the mapping F at a point x in the domain if and only
if
lim
v→0
1
∥v∥
(
F(x + v) − F(x) − D1
F(x)(v)
)
= 0.
Recall that the deﬁnition of Euclidean distance guarantees
that the limits of values in En exist if and only if the
731
consisting of four equations in four variables. Before trying
to solve this system, we can estimate how many local constrianed
extrema we should anticipate the function to have.
Surely, h(P) is in absolute value equal to at most 1, and this
happens at all intersection points of the coordinate axes with
S. Therefore, we are likely to get 6 local extrema. Further, inside
every eighth of the sphere given by the coordinate planes,
there may or may not be another extremum. The particular
quadrants can be easily parametrized, and the function h (considered
a function of two parameters) can be analyzed by standard
means (or we can have it drawn in Maple, for example).
Actually, solving the system (no matter whether algebraically
or in Maple again) leads to a great deal of stationary
points. Besides the six points we have already talked about
(two of the coordinates equal to zero and the other to ±1) and
which have λ = ±3
2 , there are also the points
P± = ±
(√
3
3
,
√
3
3
,
√
3
3
)
,
for example, where a local extremum indeed occurs.
If we restrict our interest to the points of the circle K,
we must give another function G another free parameter η
representing the gradient coeﬃcient. This leads to the bigger
system
0 = 3x2
− 2λx − η,
0 = 3y2
− 2λy − η,
0 = 3z2
− 2λz − η,
0 = x2
+ y2
+ z2
− 1,
0 = x + y + z.
However, since a circle is also a compact set, h must have
both a global minimum and maximum on it. Further analysis
is left to the reader. □
8.H.3. Determine whether the function f : R3
→ R,
f(x, y, z) = x2
y has any extrema on the surface 2x2
+ 2y2
+
z2
= 1. If so, ﬁnd these extrema and determine their types.
Solution. Since we are interested in extrema of a continuous
function on a compact set (ellipsoid) – it is both closed and
bounded in R3
– the given function must have both a minimum
and maximum on it. Moreover, since the constraint
is given by a continuously diﬀerentiable function and the examined
function is diﬀerentiable, the extrema must occur at
stationary points of the function in question on the given set.
We can build the following system for the stationary points:
2xy = 4kx,
x2
= 4ky,
0 = 2kz.
This system is satisﬁed by the points [± 1√
3
, 1√
6
, 0] and
[± 1√
3
, − 1√
6
, 0]. The function takes on only two values
at these four stationary points. Ir follows from the above
that the ﬁrst and second stationary points are maxima of
CHAPTER 8. CALCULUS WITH MORE VARIABLES
limits of the particular coordinate components do. Direct application
of Theorem 8.1.6 about the existence of the diﬀerential
for functions of n variables to the particular coordinate
functions of the mapping F thus leads to the following generalization
(prove this in detail by yourselves!):
Existence of the differential
Corollary. Let F : En → Em be a mapping such that all of
its coordinate functions have continuous partial derivatives
in a neighbourhood of a point x ∈ En. Then the diﬀerential
D1
F(x) exists, and it is given by the Jacobi matrix D1
F(x).
8.1.20. Lipschitz continuity. Continuous diﬀerentiability
of mappings allows good control on their variability in the
following sense. Assume the estimates of the diﬀerence
F(y) − F(x) for all x and y from a convex compact subset
K in the domain of F are of interest. Applying the Taylor’s
theorem with remainder in order one on each of the components
of F = (f1, . . . , fn) separately gives the estimate
(write v = y − x)
∥F(y) − F(x)∥2
=
m∑
i=1
|fi(y) − fi(x)|2
=
m∑
i=1
|D1
fi(x + θiv)(v)|2
=
m∑
i=1
n∑
j=1
∂fi
∂xj
(x + θiv)vj
2
≤
(
max
z∈K,i,j
∂fi
∂xj
(z)
2
)
nm∥v∥2
= C2
∥v∥2
for an appropriate constant C ≥ 0. The fact that continuous
functions are bounded over each compact set is used.
This is the property of Lipschitz continuity of F on the
compact set K:
∥F(y) − F(x)∥ ≤ C∥y − x∥, for all x, y ∈ K
which was considered in 7.3.14 in the end of chapter 7.
Proposition. Each continuously diﬀerentiable mapping F :
Rn
→ Rm
is Lipschitz continuous over convex compact sets.
8.1.21. Diﬀerential of composite mappings. The following
theorem formulates a very useful generalization of the chain
rule for univariate functions. Except for the concept of the
diﬀerential itself, which is mildly complicated, it is actually
the same as the one already seen in the case of one variable.
The Jacobi matrix for univariate functions is a single
number, namely the derivative of the function at a given point,
so the multiplication of Jacobi matrices is simply the multiplication
of the derivatives of the outer and inner components
of the function. There is, of course, another special case: the
formula derived and used several times for the derivative of a
composition of multivariate functions with curves. There, the
diﬀerential is the linear form expressed via the partial derivatives
of the outer components, evaluated on the vector of the
732
the function on the given ellipsoid, while the other two are
minima. □
Remark. Note that we have used the variable k instead of λ
from the theorem 8.1.29.
8.H.4. Decide whether the function f : R3
→ R,
f(x, y, z) = z − xy2
has any minima and maxima on the
sphere
x2
+ y2
+ z2
= 1.
If so, determine them.
Solution. We are looking for solutions of the system
kx = −y2
,
ky = −2xy,
kz = 1.
The second equation implies that either y = 0 or x = −k
2 .
The ﬁrst possibility leads to the points [0, 0, 1], [0, 0, −1]. The
second one cannot be satisﬁed. Note that because of the third
equation k ̸= 0 and substituting into the equation of the
sphere, we get the equation
k2
4
+
k2
2
+
1
k2
= 1,
which has no solution in real numbers (it is a quadratic equation
in k2
with the negative discriminant). The function has
a maximum and minimum, respectively, at the two computed
points on the given sphere. □
8.H.5. Determine whether the function f : R3
→ R,
f(x, y, z) = xyz, has any extrema on the ellipsoid given by
the equation
g(x, y, z) = kx2
+ ly2
+ z2
= 1, k, l ∈ R+
.
If so, calculate them.
Solution. First, we build the equations which must be satisﬁed
by the stationary points of the given function on the
ellipsoid:
∂g
∂x
= λ
∂f
∂x
: yz = 2λkx,
∂g
∂y
= λ
∂f
∂y
: xz = 2λly,
∂g
∂z
= λ
∂f
∂z
: xy = 2λz.
We can easily see that the equation can only be satisﬁed by
a triple of non-zero numbers. Dividing pairs of equations
and substituting into the ellipse’s equation, we get eight solutions,
namely the stationary points x = ± 1√
3k
, y = ± 1√
3l
,
z = ± 1√
3
. However, the function f takes on only two distinct
values at these eight points. Since it is continuous and
the given ellipsoid is compact, f must have both a maximum
and minimum on it. Moreover, since both f and g are continuously
diﬀerentiable, these extrema must occur at stationary
points. Therefore, it must be that four of the computed stationary
points are local maxima of the function (of value 1
3
√
3kl
)
and the other four are minima (of value − 1
3
√
3kl
). □
CHAPTER 8. CALCULUS WITH MORE VARIABLES
derivative of the inner component, again given by the product
of the one line (the form) and one column (the vector).
The chain rule
Theorem. Let F : En → Em and G : Em → Er be two
diﬀerentiable mappings, where the domain of G contains the
whole image of F. Then, the composite mapping G ◦ F is
also diﬀerentiable, and its diﬀerential at any point x in the
domain of F is given by the composition of diﬀerentials
D1
(G ◦ F)(x) = D1
G(F(x)) ◦ D1
F(x).
The Jacobi matrix on the left hand side is the product of the
corresponding Jacobi matrices on the right hand side.
Proof. In paragraph 8.1.6 and in the proof of Taylor’s
theorem, it was derived how the diﬀerentiation of
mappings composed of functions and curves behaves.
This proved the theorem in the special case of n =
r = 1. The general case can be proved analogously,
one just has to work with more vectors.
Fix an arbitrary increment v and calculate the directional
derivative for the composition G ◦ F at a point x ∈ En. This
means to determine the diﬀerentials for the particular coordinate
functions of the mapping G composed with F. To simplify,
write g ◦ F for any one of them.
dv(g ◦ F)(x) = lim
t→0
1
t
(
g(F(x + tv)) − g(F(x))
)
.
The expression in parentheses can, from the deﬁnition of the
diﬀerential of g, be expressed as
g(F(x + tv)) − g(F(x) = dg(F(x))(F(x + tv) − F(x))
+ α(F(x + tv) − F(x)),
where α is a function deﬁned on a neighbourhood of the point
F(x) which is continuous and limv→0
1
∥v∥ α(v) = 0. Substitution
into the equality for the directional derivative yields
dv(g ◦ F)(x) = lim
t→0
1
t
(
dg(F(x))(F(x + tv) − F(x))
+ α
(
F(x + tv) − F(x)
)
)
= dg(F(x))
(
lim
t→0
1
t
(
F(x + tv) − F(x)
))
+ lim
t→0
1
t
(
α
(
F(x + tv) − F(x)
)
)
= dg(F(x)) ◦ D1
F(x)(v) + 0.
The fact that linear mappings between ﬁnite-dimensional
spaces are always continuous was used. In the last step the
Lipschitz continuity of F, i.e. ∥F(x+tv)−F(x)∥ ≤ C∥v∥|t|
was exploited, and the properties of the function α.
So the theorem for the particular functions g1, . . . , gr of
the mapping G is proved. The theorem in general now follows
from the deﬁnition of matrix multiplication and its links to
linear mappings. Think about all details! □
733
8.H.6. Determine the global extrema of the function
f(x, y) = x2
− 2y2
+ 4xy − 6x − 1
on the set of points [x, y] that satisfy the inequalities
(1) x ≥ 0, y ≥ 0, y ≤ −x + 3.
Solution. We are given a polynomial with continuous partial
derivatives on a compact (i. e. closed and bounded) set. Such
a function necessarily has both a minimum and a maximum
on this set, and this can happen only at stationary points or on
the boundary. Therefore, it suﬃces to ﬁnd stationary points
inside the set and the ones on a ﬁnite number of open (or singleton)
parts of the boundary, then evaluate f at these points
and choose the least and the greatest values. Notice that the
set of points determined by the inequalities (1) is clearly a
triangle with vertices at [0, 0], [3, 0], [0, 3].
Let us determine the stationary points inside this triangle
as the solution of the equations fx = 0, fy = 0. Since
fx(x, y) = 2x + 4y − 6, fy(x, y) = 4x − 4y,
these equations are satisﬁed only by the point [1, 1]. The
boundary suggests itself to be expressed as the union of three
line segments given by the choice of pairs of vertices. First,
we consider x = 0, y ∈ [0, 3], when f(x, y) = −2y2
− 1.
However, we know the graph of this (univariate) function on
the interval [0, 3] It is thus not diﬃcult to ﬁnd the points at
which global extrema occur. They are the marginal points
[0, 0], [0, 3]. Similarly, we can consider y = 0, x ∈ [0, 3],
also obtaining the marginal points [0, 0], [3, 0]. Finally, we
get to the line segment y = −x+3, x ∈ [0, 3]. Making some
rearrangements, we get
f(x, y) = f(x, −x + 3) = −5x2
+ 18x − 19, x ∈ [0, 3].
We thus need to ﬁnd the stationary points of the polynomial
p(x) = −5x2
+ 18x − 19 from the interval [0, 3]. The
equation p′
(x) = 0, i. e., −10x + 18 = 0, is satisﬁed by
x = 9/5. This means that in the last case, we obtained one
more point (besides the marginal points), namely [9/5, 6/5],
where a global extremum may occur. Altogether, we have
these points as “suspects”:
[1, 1], [0, 0], [0, 3], [3, 0],
[9
5 , 6
5
]
with function values
−4, −1, −19, −10, −14
5 ,
respectively. We can see that the function f takes on the greatest
value −1 at the point [0, 0] and the least value −19 at the
point [0, 3]. □
8.H.7. Determine whether the function f : R3
→ R,
f(x, y, z) = y2
z has any extrema on the line
segment given by the equations 2x + y + z = 1,
x − y + 2z = 0 and the constraint x ∈ [−1, 2]. If so,
ﬁnd these extrema and determine their types. Justify all of
your decisions.
Solution. We are looking for the extrema of a continuous
function on a compact set. Therefore, the function must have
both a minimum and a maximum on this set, and this will happen
either at the marginal points of the segment or at those
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.1.22. Transformation of coordinates. A mapping F :
En → En which has an inverse mapping G :
En → En deﬁned on the entire image of F is
called a transformation. Such a mapping can be
perceived as a change of coordinates. It is usually required
that both F and G be (continuously) diﬀerentiable mappings.
Just as in the case of vector spaces, the choice of “point
of view”, i.e. the choice of coordinates, can simplify or deteriorate
comprehension of the examined object. The change
of coordinates is now being discussed in a much more general
form than in the case of aﬃne mappings in the fourth
chapter. Sometimes, the term “curvilinear coordinates” is
used in this general sense. An illustrative example is the
change of the most usual coordinates in the plane to polar
coordinates. That is, the position of a point P is given by
its distance r =
√
x2 + y2 from the origin and the angle
φ = arctan(y/x) between the ray from the origin to it and
the x-axis (if x ̸= 0). Notice, this is just the transformation
between the algebraic and geometric form of a complex num-
ber.
The illustration shows the the “line” r = φ drawn in the
Cartesian coordinates.
The change from the polar coordinates to the standard
ones is
Ppolar = (r, φ) → (r cos φ, r sin φ) = PCartesian
It is apparent that it is necessary to limit the polar coordinates
to an appropriate subset of points (r, φ) in the plane so that the
inverse mapping would exist. The Cartesian image of lines
in polar coordinates with constant coordinates r or φ is also
shown in the illustration above.
Let us discuss an example how to deal with the concept of
transformation and the theorem about diﬀerentiation
of composite mappings. The inverse to
the above is the transformation F : R2
→ R2
(for instance, on the domain of all points in the
ﬁrst quadrant except for the points having x = 0):
r =
√
x2 + y2, φ = arctan
y
x
.
Consider now the function gt : E2 → R, with free parameter
t ∈ R,
g(r, φ, t) = sin(r − t)
734
where the gradient of the examined function is a linear combination
of the gradients of the functions that give the constraints.
First, let us look for the points which satisfy the gradient
condition:
0 = 2k + l,
2yz = k − l,
y2
= k + 2l,
2x + y + z = 1,
x − y + 2z = 0.
The solution of the system is [x, y, z] = [2
3 , 0, −1
3 ] and
[x, y, z] = [4
9 , 2
9 , −1
9 ] (of course, the variables k and l can
also be computed, but we are not interested in them). The
marginal points of the given line segment are [−1, 5
3 , 4
3 ]
and [2, −4
3 , −5
3 ]. Considering these four points, the function
takes on the greatest value at the ﬁrst marginal point
(f(x, y, z) = 100
27 ), which is its maximum on the given segment,
and it takes the least value at the second marginal point
(f(x, y, z) = −80
27 ), which is thus its minimum there. □
8.H.8. Find the maximal and minimal values of the polyno-
mial
p(x, y) = 4x3
− 3x − 4y3
+ 9y
on the set
M =
{
[x, y] ∈ R2
; x2
+ y2
≤ 1
}
.
Solution. This is again the case of a polynomial on a compact
set; therefore, we can restrict our attention to stationary points
inside or on the boundary of M and the “marginal” points
on the boundary of M. However, the only solutions of the
equations
px(x, y) = 12x2
− 3 = 0, py(x, y) = −12y2
+ 9 = 0
are the points
[
1
2 ,
√
3
2
]
,
[
1
2 , −
√
3
2
]
,
[
−1
2 ,
√
3
2
]
,
[
−1
2 , −
√
3
2
]
,
which are all on the boundary of M. This means that p has
no extremum inside M. Now, it suﬃces to ﬁnd the maximum
and minimum of p on the unit circle k : x2
+ y2
= 1. The
circle k can be expressed parametrically as
x = cos t, y = sin t, t ∈ [−π, π].
Thus, instead of looking for the extrema of p on M, we are
now seeking the extrema of the function
f(t) := p(cos t, sin t) = 4 cos3
t − 3 cos t − 4 sin3
t + 9 sin t
on the interval [−π, π]. For t ∈ [−π, π], we have
f′
(t) = −12 cos2
t sin t + 3 sin t − 12 sin2
t cos t + 9 cos t,
In order to determine the stationary points, we must express
the function f′
in a form from which we will be able to calculate
the intersection of its graph with the x-axis. To this
purpose, we will use the identity
1
cos2 t = 1 + tg2
t,
CHAPTER 8. CALCULUS WITH MORE VARIABLES
in polar coordinates. Such a function can approximate the
waves on a water surface after a point impulse in the origin
at the time t, see the illustration (there, t = −π/2). While it
was easy to deﬁne the function in polar coordinates, it would
have been much harder to guess with Cartesian coordinates.
Compute the derivative of this function in Cartesian coordinates.
Using the theorem,
∂g
∂x
(x, y, t) =
∂g
∂r
(r, φ)
∂r
∂x
(x, y) +
∂g
∂φ
(r, φ)
∂φ
∂x
(x, y)
= cos(
√
x2 + y2 − t)
x
√
x2 + y2
+ 0
and, similarly,
∂g
∂y
(x, y, t) =
∂g
∂r
(r, φ)
∂r
∂y
(x, y) +
∂g
∂φ
(r, φ)
∂φ
∂y
(x, y)
= cos(
√
x2 + y2 − t)
y
√
x2 + y2
.
8.1.23. The inverse mapping theorem. If the ﬁrst derivative
of a diﬀerentiable univariate function is
non-zero, its sign determines whether the function
is increasing or decreasing. Then, the function
has this property in a neighbourhood of the
point in question, and so an inverse function exists in the selected
neighbourhood. The derivative of the inverse function
f−1
is then the reciprocal value of the derivative of the function
f (i.e. the inverse with respect to multiplication of real
numbers). For higher dimensions there is the analogous re-
sult:
The inverse mapping theorem
Theorem. Let F : En → En be a diﬀerentiable mapping
on a neighbourhood of a point x0 ∈ En, and let the Jacobi
matrix D1
F(x0) be invertible.
Then in some neighbourhood of y0 = F(x0), the inverse
mapping F−1
exists, it is diﬀerentiable, and its differential
at the point F(x0) is the inverse mapping to the
diﬀerential D1
F(x0).
Hence, D1
(F−1
)(F(x0)) is given by the inverse matrix
to the Jacobi matrix of the mapping F at the point x0.
Interpreting this situation for a mapping E1 → E1 and
linear mappings R → R as their diﬀerentials, the nonvanishing
is a necessary and suﬃcient condition for the diﬀerential
735
which is valid provided both sides are well-deﬁned. We get
f′
(t) = cos3
t
[
− 12 tg t + 3
(
tg t + tg3
t
)
− 12 tg2
t + 9
(
1 + tg2
t
) ]
for t ∈ [−π, π] with cos t ̸= 0. However, this condition
does not exclude any stationary points since sin t ̸= 0 if
cos t = 0. Therefore, the stationary points of f are those
points t ∈ [−π, π] for which
−4 tg t + tg t + tg3
t − 4 tg2
t + 3 + 3 tg2
t = 0.
The substitution s = tg t leads to
s3
− s2
− 3s + 3 =
0, i. e. (s − 1)
(
s −
√
3
) (
s +
√
3
)
= 0.
Then, the values
s = 1, s =
√
3, s = −
√
3
respectively correspond to
t ∈ {−3
4 π, 1
4 π}, t ∈ {−2
3 π, 1
3 π}, t ∈ {−1
3 π, 2
3 π}.
Now, we evaluate the function f at each of these points as
well as at the marginal points t = −π, t = π. Sorting them,
we get
f
(
−1
3 π
)
= −1 − 3
√
3 < f
(
−3
4 π
)
= −3
√
2 <
f
(
−2
3 π
)
= 1 − 3
√
3 < −1,
f (−π) = f (π) = −1 < 0,
f
(2
3 π
)
= 1 + 3
√
3 > f
(1
4 π
)
= 3
√
2 > f
(1
3 π
)
=
−1 + 3
√
3 > 0.
Therefore, the global minimum of the function f is at the
point t = −π/3 , while the global maximum is at t = 2π/3.
Now, let us get back to the original function p. Since
we know the values cos
(
−1
3 π
)
= 1
2 , sin
(
−1
3 π
)
= −
√
3
2 ,
cos
(2
3 π
)
= −1
2 , sin
(2
3 π
)
=
√
3
2 , we can deduce that the
polynomial p takes on the minimal value −1−3
√
3 (the same
as f, of course) at the point [1/2, −
√
3/2] and the maximal
value 1 + 3
√
3 at [−1/2,
√
3/2]. □
8.H.9. At which points does the function
f(x, y) = x2
− 4x + y2
take on global extrema on the set M : | x | + | y | ≤ 1?
Solution. Expressing f in the form
f(x, y) = (x − 2)2
− 4 + y2
,
we can see that the global maximum and minimum occur at
the same points as for the function
g(x, y) :=
√
(x − 2)2 + y2, [x, y] ∈ M,
since neither shifting the function nor applying the increasing
function v =
√
u for u ≥ 0 changes the points of extrema
(of course, they can change their values). However, we know
that the function g gives the distance of a point [x, y] from the
point [2, 0]. Since the set M is clearly a square with vertices
[1, 0], [0, 1], [−1, 0], [0, −1], the point of M that is closest to
[2, 0] is the vertex [1, 0], while the most distant one is [−1, 0].
Altogether, we have obtained that the minimal value of f occurs
at the point [1, 0] and the maximal one at [−1, 0]. □
CHAPTER 8. CALCULUS WITH MORE VARIABLES
to be invertible as a linear mapping. In general ﬁnite dimensions,
the non-generacy of the diﬀerential is the adequate con-
cept.
Proof. First, verify that the theorem makes sense and
is as expected. If it is supposed that the inverse
mapping exists and is diﬀerentiable at
F(x0), then diﬀerentiating the composite mapping
F−1
◦ F enforces the formula
idRn = D1
(F−1
◦ F)(x0) = D1
(F−1
) ◦ D1
F(x0),
which veriﬁes the formula at the conclusion of the theorem.
Therefore, it is known at the beginning which diﬀerential for
F−1
to ﬁnd.
Next, suppose that the inverse mapping F−1
exists in a
neighbourhood of the point F(x0) and that it is
continuous. Since F is diﬀerentiable in a neighbourhood
of x0, it follows that
(1) F(x) − F(x0) − D1
F(x0)(x − x0) = α(x − x0)
with function α : Rn
→ 0 satisfying limv→0
1
∥v∥ α(v) = 0.
To verify the approximation properties of the linear mapping
(D1
F(x0))−1
, it suﬃces to calculate the following limit for
y = F(x) approaching y0 = F(x0):
lim
y→y0
1
∥y − y0∥
(
F−1
(y)−F−1
(y0)−(D1
F(x0))−1
(y−y0)
)
.
Substituting (1) for y − y0 into the latter equality yields
lim
y→y0
1
∥y − y0∥
(
x − x0
− (D1
F(x0))−1
(D1
F(x0)(x − x0) + α(x − x0))
)
= lim
y→y0
−1
∥y − y0∥
(
D1
F(x0))−1
(α(x − x0)
)
= (D1
F(x0))−1
lim
y→y0
(−1)
∥y − y0∥
(α(x − x0)),
where the last equality follows from the fact that linear mappings
between ﬁnite-dimensional spaces are always continuous.
Hence performing this linear mapping commutes with
the limit process.
The proof is almost ﬁnished. The limit at the end of the
expression is, using the properties of α, zero if the values
∥F(x)−F(x0)∥ are greater than C∥x−x0∥ for some constant
C > 0. This can be translated in terms of the inverse as
C∥F−1
(y) − F−1
(y0)∥ ≤ ∥y − y0∥, i.e.
∥F−1
(y) − F−1
(y0)∥ ≤ D∥y − y0∥
for the constant D = C−1
> 0. This is Lipschitz continuity,
which is a stronger property than F−1
being continuous. So,
now it remains “merely” to prove the existence of a Lipschitzcontinuous
inverse mapping to the mapping F.
736
8.H.10. Compute the local extrema of the function y =
f(x) given implicitly by the equation
3x2
+ 2xy + x = y2
+ 3y + 5
4 , [x, y] ∈
R2
∖
{[
x, x − 3
2
]
; x ∈ R
}
.
Solution. In accordance with the theoretical part (see 8.1.25),
let us denote
F(x, y) = 3x2
+ 2xy + x − y2
− 3y − 5
4 ,
[x, y] ∈ R2
∖
{[
x, x − 3
2
]
; x ∈ R
}
and calculate the derivative
y′
= f′
(x) = −Fx(x,y)
Fy(x,y) = −6x+2y+1
2x−2y−3 .
We can see the this derivative is continuous on the whole set
in question. In particular, the function f is deﬁned implicitly
on this set (the denominator is non-zero).
A local extremum may occur only for those x, y which
satisfy y′
= 0, i. e., 6x + 2y + 1 = 0. Substituting y =
−3x−1/2 into the equation F(x, y) = 0, we obtain −12x2
+
6x = 0, which leads to
[x, y] =
[
0, −1
2
]
, [x, y] =
[1
2 , −2
]
.
We can also easily compute that
y′′
= (y′
)
′
= −
(
6+2y′
)
(2x−2y−3)−(6x+2y+1)
(
2−2y′
)
(2x−2y−3)2 .
Substituting x = 0, y = −1/2, y′
= 0 and x = 1/2, y = −2,
y′
= 0, we obtain
y′′
= −6(−2)−0
4 > 0 for [x, y] =
[
0, −1
2
]
and
y′′
= −6(+2)−0
4 < 0 for [x, y] =
[1
2 , −2
]
.
We have thus proved that the implicitly given function has a
strict local minimum at the point x = 0 and a strict local
maximum at x = 1/2. □
8.H.11. Find the local extrema of the function z = f(x, y)
given on the maximum possible set by the equation
(1) x2
+ y2
+ z2
− xz − yz + 2x + 2y + 2z − 2 = 0.
Solution. Diﬀerentiating (1) with respect to x and y gives
2x + 2zzx − z − xzx − yzx + 2 + 2zx = 0,
2y + 2zzy − xzy − z − yzy + 2 + 2zy = 0.
Hence we get that
(2)
zx = fx(x, y) =
z − 2x − 2
2z − x − y + 2
,
zy = fy(x, y) =
z − 2y − 2
2z − x − y + 2
.
We can notice that the partial derivatives are continuous at
all points where the function f is deﬁned. This implies that
the local extrema can occur only at stationary points. These
points satisfy
zx = 0, i. e. z − 2x − 2 = 0,
zy = 0, i. e. z − 2y − 2 = 0.
We have thus two equations, which allow us to express the
dependency of x and y on z. Substituting into (1), we obtain
the points
[x, y, z] =
[
−3 +
√
6, −3 +
√
6, −4 + 2
√
6
]
,
CHAPTER 8. CALCULUS WITH MORE VARIABLES
To simplify, reduce the general case slightly. Especially,
without loss of generality, apply shifts of the
coordinates by constant vectors. In particular,
it can be assumed that x0 = 0 ∈ Rn
, y0 =
F(x0) = 0 ∈ Rn
. So assume this property of
the mapping F.
Further, composing the mapping F with any linear
mapping G yields a diﬀerentiable mapping again, and
it is known how the diﬀerential changes. The choice
G(y) = (D1
F(0))−1
(y) gives D1
(G ◦ F)(0) = idRn and
thus we may assume that
D1
F(0) = idRn .
With these assumptions, consider the mapping K(x) =
F(x) − x. This mapping is also diﬀerentiable, and its differential
at 0 is zero.
It is already known that each continuously diﬀerentiable
mapping is Lipschitz continuous over every
δ–neighbourhood Uδ of the origin (in the its domain),
∥K(x) − K(y)∥ ≤ C∥x − y∥,
where C is bounded by the maximum of all absolute values
of the partial derivatives in the Jacobi matrix of the mapping
K in the neighbourhood Uδ, cf. 8.1.20.
Since the diﬀerential of the mapping K at the point
x0 = 0 is zero, one can, by selecting a suﬃciently small
neighbourhood U of the origin, achieve the bound
∥K(x) − K(y)∥ ≤
1
2
∥x − y∥.
It follows by the triangle inequality that
∥x − y∥ = ∥(F(x) − K(x)) − (F(y) − K(y))∥
≤ ∥F(x) − F(y)∥ + ∥K(x) − K(y)∥
≤ ∥F(x) − F(y)∥ +
1
2
∥x − y∥
and hence
1
2
∥x − y∥ ≤ ∥F(x) − F(y)∥.
With this estimate, if x ̸= y are both in the neighbourhood
U = Uδ, then also F(x) ̸= F(y). Therefore, the mapping
is bijective onto its image V = F(U). Write F−1
for
its inverse deﬁned on V . For this mapping, the latter estimate
says
∥F−1
(x) − F−1
(y)∥ ≤ 2∥x − y∥,
so this mapping is not only continuous (as we assumed in
our ﬁrst step in the proof), but also Lipschitz-continuous, as
requested in the end of the previous part of the proof.
It could seem that the proof is complete, but this is not so.
To ﬁnish, it is necessary to show that the mapping
F restricted to a suﬃciently small neighbourhood
Uδ is not only bijective onto its image,
but also that it maps open neighbourhoods
of zero onto open neighbourhoods of zero.3
3In the literature, there are examples of mappings which continuously
and bijectively map a line segment onto a square. So this is not an obvious
requirement.
737
[x, y, z] =
[
−3 −
√
6, −3 −
√
6, −4 − 2
√
6
]
.
Now, we need the second derivatives in order to decide
whether the local extrema really occur at the corresponding
points. Diﬀerentiating zx in (2), we obtain
zxx = fxx(x, y) = (zx−2)(2z−x−y+2)−(z−2x−2)(2zx−1)
(2z−x−y+2)2 ,
with respect to x, and
zxy = fxy(x, y) =
zy(2z−x−y+2)−(z−2x−2)(2zy−1)
(2z−x−y+2)2 ,
with respect to y. We need not calculate zyy since the variables
x and y are interchangeabel in (1) (if we swap x and
y, the equation is left unchanged). Moreover, the x- and
y-coordinates of the considered points are the same; hence
zxx = zyy. Now, we evaluate that at the stationary points:
fxx
(
−3 +
√
6, −3 +
√
6
)
= fyy
(
−3 +
√
6, −3 +
√
6
)
=
− 1√
6
,
fxy
(
−3 +
√
6, −3 +
√
6
)
= fyx
(
−3 +
√
6, −3 +
√
6
)
=
0,
fxx
(
−3 −
√
6, −3 −
√
6
)
= fyy
(
−3 −
√
6, −3 −
√
6
)
=
1√
6
,
fxy
(
−3 −
√
6, −3 −
√
6
)
= fyx
(
−3 −
√
6, −3 −
√
6
)
=
0.
As for the Hessian, we have
Hf
(
−3 +
√
6, −3 +
√
6
)
=
(
− 1√
6
0
0 − 1√
6
)
,
Hf
(
−3 −
√
6, −3 −
√
6
)
=
(
1√
6
0
0 1√
6
)
.
Apparently, the ﬁrst Hessian is negative deﬁnite, while
the second one is positive deﬁnite. This means that there
is a strict local maximum of the function f at the point[
−3 +
√
6, −3 +
√
6
]
, and there is a strict local minimum at
the point
[
−3 −
√
6, −3 −
√
6
]
. □
8.H.12. Determine the strict local extrema of the function
f(x, y) = 1
x + 1
y , x ̸= 0, y ̸= 0
on the set of points that satisfy the equation 1
x2 + 1
y2 = 4.
Solution. Since both the function f and the function given
implicitly by the equation 1
x2 + 1
y2 − 4 = 0 have continuous
partial derivatives of all orders on the set R2
∖ {[0, 0]}, we
should look for stationary points, i. e., for the solution of the
equations Lx = 0, Ly = 0 for
L(x, y, λ) = 1
x + 1
y − λ
(
1
x2 + 1
y2 − 4
)
, x ̸= 0, y ̸= 0.
We thus get the equations
− 1
x2 + 2λ
x3 = 0, − 1
y2 + 2λ
y3 = 0,
which lead to x = 2λ, y = 2λ. Considering the set of points
in question, the constraint x = y gives the stationary points
(1)
[√
2
2
,
√
2
2
]
,
[
−
√
2
2
, −
√
2
2
]
.
Now, let us examine the second diﬀerential of the function
L. We can easily compute that
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Decrease the latter neighbourhood U = Uδ so that the
above estimates are true for the boundary of U as well and at
the same time the Jacobi matrix of the mapping is invertible
on all of U. This can be done since the determinant is a continuous
mapping. Let B denote the boundary of the set U,
that is, the corresponding sphere. Since B is compact and F
is continuous, the function
ρ(x) = ∥F(x)∥
achieves both the maximum and the minimum on B. Denote
a = 1
2 minx∈B ρ(x) and consider any y ∈ Oa(0) ﬁxed. Of
course, a > 0 because x = 0 is the only point with F(x) = 0
within Uδ. It is necessary to show that there is at least one
x ∈ U such that y = F(x), which completes the proof of the
inverse mapping theorem.
For this purpose, consider the function (y is a ﬁxed point)
h(x) = ∥F(x) − y∥2
.
Again, the image h(U) ∪ h(B) must have a minimum. This
minimum cannot occur for x ∈ B.
Notice that F(0) = 0, hence h(0) = ∥y∥ < a. At the
same time, the distance of y from F(x) for x ∈ B is at least a
for all y ∈ Oa(0) (since a is selected to be half the minimum
of the magnitude of F(x) on the boundary). Therefore, the
minimum occurs inside U, and it is a stationary point z of the
function h. Fixing such z, means that for all j = 1, . . . , n,
∂h
∂xj
(z) =
n∑
i=1
2
(
fi(z) − yi
) ∂fi
∂xj
(z) = 0.
This is a system of linear equations with variables ξi =
fi(z) − yi and coeﬃcients given by twice the Jacobi matrix
D1
F(z). In particular, for z ∈ U, such a system has a unique
solution, and this is zero since the Jacobi matrix is invertible.
In this way the desired point x = z ∈ U is found, satisfying,
for all i = 1, . . . , n, the equality fi(z) = yi, i.e.,
F(z) = y. □
8.1.24. The implicit functions. The next goal is to employ
the inverse mapping theorem for clarifying the
properties of implicitly deﬁned functions. To
start, consider a diﬀerentiable function F(x, y)
deﬁned in the plane E2, and look for those points (x, y) where
F(x, y) = 0.
An example of this can be the usual (implicit) deﬁnition
of straight lines and circles:
F(x, y) = ax + by + c = 0, a, b, c ∈ R
F(x, y) = (x − s)2
+ (y − t)2
− r2
= 0, r > 0.
While in the ﬁrst case, the relation between the quantities x
and y can be expressed as the function (for b ̸= 0)
y = f(x) = −
a
b
x −
c
b
for all x; in the other case, for any point (x0, y0) satisfying
the equation of the circle and such that y0 ̸= t (these are the
738
Lxx = 2
x3 − 6λ
x4 , Lxy = 0, Lyy = 2
y3 − 6λ
y4 , x ̸=
0, y ̸= 0,
whence it follows that
d2
L(x, y) =
( 2
x3 − 6λ
x4
)
dx2
+
(
2
y3 − 6λ
y4
)
dy2
.
Diﬀerentiating the constraint 1
x2 + 1
y2 = 4, we get
− 2
x3 dx − 2
y3 dy = 0, i. e. dy2
= y6
x6 dx2
.
Therefore,
d2
L(x, y) =
[
2
x3 − 6λ
x4 +
(
2
y3 − 6λ
y4
)
y6
x6
]
dx2
.
In fact, we are considering a one-dimensional quadratic form
whose positive (negative) deﬁniteness at a stationary point
means that there is a minimum (maximum) at that point. Realizing
that the stationary points had x = 2λ, y = 2λ, mere
substitution yields
d2
L
(√
2
2 ,
√
2
2
)
= −4
√
2 dx2
, d2
L
(
−
√
2
2 , −
√
2
2
)
=
4
√
2 dx2
,
which means that there is a strict local maximum of the
function f at the point
[√
2/2,
√
2/2
]
, while at the point
[
−
√
2/2, −
√
2/2
]
, there is a strict local minimum. The corresponding
values are:
(2)
f
(√
2
2
,
√
2
2
)
= 2
√
2, f
(
−
√
2
2
, −
√
2
2
)
= −2
√
2.
Now, we will demonstrate a quicker way how to obtain
the result. We know (or we can easily calculate) the second
partial derivatives of the function L, i. e., the Hessian with
respect to the variables x and y:
HL (x, y) =
( 2
x3 − 6λ
x4 0
0 2
y3 − 6λ
y4
)
.
The evaluation
HL
(√
2
2
,
√
2
2
)
=
(
−2
√
2 0
0 −2
√
2
)
,
HL
(
−
√
2
2
, −
√
2
2
)
=
(
2
√
2 0
0 2
√
2
)
then tells us that the quadratic form is negative deﬁnite for
the former stationary point (there is a strict local maximum)
and positive deﬁnite for the latter one (there is a strict local
minimum).
We should be aware of a potential trap in this “quicker”
method in the case we obtain an indeﬁnite form (matrix).
Then, we cannot conclude that there is not an extremum at
that point since as we have not included the constraint (which
we did when computing d2
L), we are considering a more general
situation. The graph of the function f on the given set is
a curve which can be deﬁned as a univariate function. This
must correspond to a one-dimensional quadratic form. □
CHAPTER 8. CALCULUS WITH MORE VARIABLES
marginal points of the circle in the direction of the coordinate
x), There is a neighbourhood of the point x0 in which either
y = f(x) = t +
√
(x − s)2 − r,
or
y = f(x) = t −
√
(x − s)2 − r,
according to whether (x0, y0) belongs to the upper or lower
semicircle.
If a diagram of the situation is drawn, the reason is clear:
describing both the semicircles simultaneously by a single
function y = f(x) is not possible. The boundary points of the
interval [s−r, s+r] are even more interesting. They also satisfy
the equation of the circle with y = t, yet Fy(s±r, t) = 0,
which describes the position of the tangent line to the circle
at these points, parallel to the y-axis. There are no neighbourhoods
of these points in which the circle could be described
as a function y = f(x).
Moreover, the derivatives of the function y = f(x) =
t+
√
(x − s)2 − r2 can be easily expressed in terms of partial
derivatives of the function F:
f′
(x) =
1
2
2(x − s)
√
r2 − (x − s)2
= −
x − s
y − t
= −
Fx
Fy
.
If the roles of the variables x and y are interchanged and
a relation x = f(y) such that F(f(y), y) = 0 is sought, then
neighbourhoods of the points (s ± r, t) are obtained with no
problem. Notice that the partial derivative Fx is non-zero at
these points.
So it is observed (though for only two examples):
for a function F(x, y) and a point (a, b) ∈ E2 such that
F(a, b) = 0, there is the unique function y = f(x) satisfying
F(x, f(x)) = 0 on some neighbourhood of x if
Fy(a, b) ̸= 0. In this case, f′
(a) = −Fx(a, b)/Fy(a, b) can
even be computed. We prove that in fact, this proposition is
always true.
The last statement about derivatives can be remembered
(and is quite comprehensible if things are properly understood)
from the expression for the diﬀerential of the (constant)
function g(x) = F(x, y(x)) and the diﬀerential dy =
f′
(x)dx
0 = dg = Fxdx + Fydy = (Fx + Fyf′
(x))dx.
One can work analogously with the implicit expressions
F(x, y, z) = 0, to look for a function g(x, y) such that
F(x, y, g(x, y)) = 0. As an example, consider the function
f(x, y) = x2
+ y2
, whose graph is the rotational paraboloid
centered at the point (0, 0). This can be deﬁned implicitly by
the equation
0 = F(x, y, z) = z − x2
− y2
.
Before formulating the result for the general situation,
notice which dimensions could/should appear in the problem.
If it is desired to ﬁnd, for this function F, a curve
c(x) = (c1(x), c2(x)) in the plane such that
F(x, c(x)) = F(x, c1(x), c2(x)) = 0,
739
8.H.13. Find the global extrema of the function
f(x, y) = 1
x + 1
y , x ̸= 0, y ̸= 0
on the set of points that satisfy the equation 1
x2 + 1
y2 = 4.
Solution. This exercise is to illustrate that looking for global
extrema may be much easier than for local ones (cf. the above
exercise) even in the case when the function values are considered
on an unbounded set. First, we would determine the
stationary points (1) and the values (2) the same way as above.
Let us emphasize that we are looking for the function’s extrema
on a set that is not compact, so we will not do with
evaluating the function at the stationary points. The reason is
that the function f may not have an extremum on the considered
set – its range might be an open interval. However, we
will show that this is not the case here.
Let us thus consider | x | ≥ 10. We can realize that the
equation 1
x2 + 1
y2 = 4 can be satisﬁed only by those values y
for which | y | ≥ 1/2. We have thus obtained the bounds
−2
√
2 < − 1
10 − 2 ≤ f(x, y) ≤ 1
10 + 2 <
2
√
2, if | x | ≥ 10.
At the same time, we have (interchanging x and y leads to the
same task)
−2
√
2 < − 1
10 − 2 ≤ f(x, y) ≤ 1
10 + 2 <
2
√
2, if | y | ≥ 10.
Hence we can see that the function f must have global
extrema on the considered set, and this must happen inside
the square ABCD with vertices A = [−10, −10], B =
[10, −10], C = [10, 10], D = [−10, 10].
The intersection of the “hundred times reduced” square
with vertices at ˜A = [−1/10, −1/10], ˜B = [1/10, −1/10],
˜C = [1/10, 1/10], ˜D = [−1/10, 1/10] and the given set is
clearly the empty set. Therefore, the global extrema are at
points inside the compact set bounded by these two squares.
Since f is continuously diﬀerentiable on this set, the global
extrema can occur only at stationary points. We thus must
have
fmax = f
(√
2
2 ,
√
2
2
)
= 2
√
2, fmin = f
(
−
√
2
2 , −
√
2
2
)
=
−2
√
2.
□
8.H.14. Determine the maximal and minimal values of the
function f(x, y, z) = xyz on the set M given by the condi-
tions
x2
+ y2
+ z2
= 1, x + y + z = 0.
Solution. It is not hard to realize that M is a circle. However,
for our problem, it is suﬃcient to know that M is compact,
i. e. bounded (the ﬁrst condition of the equation of
the unit sphere) and closed (the set of solutions of the given
equations is closed since if the equations are satisﬁed by all
terms of a converging sequence, then it is satisﬁed by its limit
as well). The function f as well as the constraint functions
F(x, y, z) = x2
+ y2
+ z2
− 1, G(x, y, z) = x + y + z
CHAPTER 8. CALCULUS WITH MORE VARIABLES
then this can be done (even for all initial conditions x = a),
yet the result is not unique for a given initial condition. It sufﬁces
to consider an arbitrary curve on the rotational paraboloid
whose projection onto the ﬁrst coordinate has a non-zero
derivative. Then consider x to be the parameter of the curve,
and c(x) to be its projection onto the plane yz.
Therefore, it is expected that one function of m + 1 variables
deﬁnes implicitly a hypersurface in Rm+1
which is to be expressed (at least locally) as the
graph of a function of m variables. It can be anticipated
that n functions of m + n variables deﬁne
an intersection of n hypersurfaces in Rm+n
, which is expected
as an “m–dimensional” object.
8.1.25. The general theorem. Consider a diﬀerentiable
mapping
F = (f1, . . . , fn) : Rm+n
→ Rn
.
The Jacobi matrix of this mapping has n rows and m + n
columns. Write it symbolically as
D1
F =




∂f1
∂x1
. . . ∂f1
∂xm
...
...
...
∂fn
∂x1
. . . ∂fn
∂xm
∂f1
∂xm+1
. . . ∂f1
∂xm+n
...
...
...
∂fn
∂xm+1
. . . ∂fn
∂xm+n




= (D1
xF, D1
yF),
where (x1, . . . , xm+n) ∈ Rm+n
is written as (x, y) ∈ Rm
×
Rn
, D1
xF is a matrix of n rows and the ﬁrst m columns in the
Jacobi matrix, while D1
yF is a square matrix of order n, with
the remaining columns. The multidimensional analogy to the
previous reasoning with the non-zero partial derivative with
respect to y is the condition that the matrix D1
yF is invertible.
The implicit mapping theorem
Theorem. Let F : Rm+n
→ Rn
be a diﬀerentiable mapping
in an open neighbourhood of a point (a, b) ∈ Rm
×
Rn
= Rm+n
at which F(a, b) = 0, and det D1
yF ̸= 0.
Then there exists a neighbourhood U of the point a ∈
Rm
and a unique diﬀerentiable mapping G : Rm
→ Rn
deﬁned on U, with G(a) = b and such that F(x, G(x)) = 0
for all x ∈ U.
Moreover, the Jacobi matrix D1
G of the mapping G is,
in the neighbourhood of the point a, given by the product of
matrices
D1
G(x) = −(D1
yF)−1
(x, G(x)) · D1
xF(x, G(x)).
Proof. For the sake of comprehensibility, ﬁrst show
the proof for the simplest case of the equation
F(x, y) = 0 with a function F of two variables.
At ﬁrst sight, it might look complicated, but this
situation can be discussed in a way which can be
extended for the general dimensions as in the theorem, almost
without changes.
740
have continuous partial derivatives of all orders (since they
are polynomials). The Jacobi constraint matrix is
(
Fx(x, y, z) Fy(x, y, z) Fz(x, y, z)
Gx(x, y, z) Gy(x, y, z) Gz(x, y, z)
)
=
(
2x 2y 2z
1 1 1
)
.
Its rank is reduced (less than 2) if and only if the vector
(2x, 2y, 2z) is a multiple of the vector (1, 1, 1), which gives
x = y = z, and thus x = y = z = 0 (by the second constraint).
However, the set M does contain the origin. Therefore,
we may look for stationary points using the method of
Lagrange multipliers. For
L(x, y, z, λ1, λ2) =
xyz − λ1
(
x2
+ y2
+ z2
− 1
)
− λ2 (x + y + z) ,
the equations Lx = 0, Ly = 0, Lz = 0 give
yz − 2λ1x − λ2 = 0,
xz − 2λ1y − λ2 = 0,
xy − 2λ1z − λ2 = 0,
respectively. Subtracting the ﬁrst equation from the second
one and from the third one leads to
xz − yz − 2λ1y + 2λ1x = 0,
xy − yz − 2λ1z + 2λ1x = 0,
i. e.,
(x − y) (z + 2λ1) = 0,
(x − z) (y + 2λ1) = 0.
The last equations are satisﬁed in these four cases:
x = y, x = z; x = y, y = −2λ1;
z = −2λ1, x = z; z = −2λ1, y = −2λ1,
thus (including the constraint G = 0)
x = y = z = 0; x = y = −2λ1, z = 4λ1;
x = z = −2λ1, y = 4λ1; x = 4λ1, y = z = −2λ1.
Except for the ﬁrst case (which clearly cannot happen), including
the constraint F = 0 yields
(4λ1)2
+ (−2λ1)2
+ (−2λ1)2
= 1, i. e. λ1 = ± 1
2
√
6
.
Altogether, we get the points
[
− 1√
6
, − 1√
6
, 2√
6
]
,
[
− 1√
6
, 2√
6
, − 1√
6
]
,
[
2√
6
, − 1√
6
, − 1√
6
]
,
[
1√
6
, 1√
6
, − 2√
6
]
,
[
1√
6
, − 2√
6
, 1√
6
]
,
[
− 2√
6
, 1√
6
, 1√
6
]
.
We will not verify that these really are stationary points. The
only important thing is that all stationary points are among
these six.
We are looking for the global maximum and minimum
of the continuous function f on the compact set M. However,
the global extrema (we know they exist) can occur only
at points of local extrema with respect to M. And the local
extrema can occur only at the aforementioned points. Therefore,
it suﬃces to evaluate the function f at these points. Thus
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Extend the function F to
˜F : R2
→ R2
, (x, y) → (x, F(x, y)).
The Jacobi matrix of the mapping ˜F is
D1 ˜F(x, y) =
(
1 0
Fx(x, y) Fy(x, y)
)
.
It follows from the assumption Fy(a, b) ̸= 0, that the same
also holds in a neighbourhood of the point (a, b), so the function
˜F is invertible in this neighbourhood, by the inverse mapping
theorem. Therefore, there is a uniquely deﬁned diﬀerentiable
inverse mapping ˜F−1
in a neighbourhood of the point
(a, 0).
Denote by π : R2
→ R the projection onto the second
coordinate, and consider the function f(x) = π ◦ ˜F−1
(x, 0).
This function is well-deﬁned and diﬀerentiable. It must be
veriﬁed that the expression
F(x, f(x)) = F
(
x, π( ˜F−1
(x, 0))
)
is zero in a neighbourhood of the point x = a. It follows directly
from the deﬁnition of ˜F(x, y) = (x, F(x, y)) that its
inverse is of the form ˜F−1
(x, y) = (x, π ˜F−1
(x, y)). Therefore,
the previous calculation can be resumed:
F(x, f(x)) = π
(
˜F
(
x, π( ˜F−1
(x, 0))
))
= π
(
˜F( ˜F−1
(x, 0))
)
= π(x, 0) = 0.
This proves the ﬁrst part of the theorem, and it remains to
compute the derivative of the function f(x). This derivative
can, once again, be obtained by invoking the inverse mapping
theorem, using the matrix (D1 ˜F)−1
.
The following equality is easily veriﬁed by multiplying
the matrices. It can also be computed directly using the explicit
formula for the inverse matrix in terms of the determinant
and the algebraically adjoint matrix, see paragraph
2.2.11
(
1 0
Fx(x, y) Fy(x, y)
)−1
= (Fy(x, y))−1
(
Fy(x, y) 0
−Fx(x, y) 1
)
.
By the deﬁnition f(x) = π ˜F−1
(x, 0), and thus the ﬁrst entry
of the second row of this matrix is the derivative f′
(x) with
y = f(x), i.e. the Jacobi matrix D1
f. In this simple case, it
is exactly the desired scalar −Fx(x, f(x))/Fy(x, f(x)).
The general proof is exactly the same, there is no need
to change any of the formulae. We obtain the invertible
mapping ˜F : Rm+n
→ Rm+n
and deﬁne
G(x) = π ˜F−1
(x, 0), where π : Rm+n
→
Rn
, π(x, y) = y. The same check as above reveals
that F(x, G(x)) = 0) as requested. Only in the last
computation of the derivative of the function do the corresponding
parts of the Jacobi matrix D1
xF and D1
yF appear,
instead of the particular partial derivatives.
For the calculation of the Jacobi matrix of the mapping
G, use the computation of the inverse matrix. This time the
algebraic procedure from paragraph 2.2.11 is not very advantageous.
It is better to be guided by the case in dimension
741
we ﬁnd out that the wanted maximum is
f
(
−
1
√
6
, −
1
√
6
,
2
√
6
)
= f
(
−
1
√
6
,
2
√
6
, −
1
√
6
)
=
= f
(
2
√
6
, −
1
√
6
, −
1
√
6
)
=
1
3
√
6
,
while the minimum is
f
(
1
√
6
,
1
√
6
, −
2
√
6
)
= f
(
1
√
6
, −
2
√
6
,
1
√
6
)
=
= f
(
−
2
√
6
,
1
√
6
,
1
√
6
)
= −
1
3
√
6
.
□
8.H.15. Find the extrema of the function f : R3
→ R,
f(x, y, z) = x2
+ y2
+ z2
, on the plane x + y − z = 1
and determine their types.
Solution. We can easily build the equations that describe the
linear dependency between the normal to the constraint surface
and the examined function:
x = k, y = k z = −k, k ∈ R.
The only solution is the point [1
3 , 1
3 , −1
3 ]. Further, we can
notice that the function is increasing in the direction of
(1, −1, 0), and this direction lies in the constraint plane.
Therefore, the examined function has a minimum at this point.
Another solution. We will reduce this problem to ﬁnding
the extrema of a two-variable function on R2
. Since the constraint
is linear, we can express z = x + y − 1. Substituting
this into the given function then yields a real-valued function
of two variables: f(x, y) = x2
+ y2
+ (x + y − 1)2
=
2x2
+ 2xy + y2
− 2x − 2y + 1. Setting both partial derivatives
equal to zero, we get the linear equation
4x + 2y − 2 = 0, 4y + 2x − 2 = 0,
whose only solution is the point [1
3 , 1
3 ]. Since it is a quadratic
function with positive coeﬃcients at the unknowns, it is unbounded
on R2
. Therefore, there is a (global) minimum at
the obtained point. Then, we can get the corresponding point
[1
3 , 1
3 , −1
3 ] in the constraint plane from the linear dependency
of z. □
8.H.16. Find the extrema of the function x + y : R3
→
R on the circle given by the equations x + y + z = 1 and
x2
+ y2
+ z2
= 4.
Solution. The “suspects” are those points which satisfy
(1, 1, 0) = k · (1, 1, 1) + l · (x, y, z), k, l ∈ R.
Clearly, x = y(= 1/l). Substituting this into the equation of
the circle then leads to the two solutions
[
1
3
±
√
22
6
,
1
3
±
√
22
6
,
1
3
∓
√
22
3
]
.
Since every circle is compact, it suﬃces to examine the function
values at these two points. We ﬁnd out that there is a
CHAPTER 8. CALCULUS WITH MORE VARIABLES
m + n = 2 and to divide the matrix
(D1 ˜F−1
) =
(
idRm 0
D1
xF(x, y) D1
yF(x, y)
)−1
=
(
A B
C D
)
into blocks of m and n rows and columns (for instance A is
of type m×m, while C is of type n×m). Now, the matrices
A, B, C, D can be determined from the deﬁning equality for
the inverse:
(
idRm 0
D1
xF(x, y) D1
yF(x, y)
)
·
(
A B
C D
)
=
(
idRm 0
0 idRn
)
.
Apparently, it follows that A = idRm , B = 0, D = (D1
yF)−1
,
and ﬁnally, D1
xF + D1
yF · C = 0. The latter equality implies
already the desired relation
D1
G = C = −(D1
yF)−1
· D1
xF.
This concludes the proof of the theorem. □
8.1.26. The gradient of a function. As seen in the previous
paragraph, if F is a continuously diﬀerentiable
function of n variables, the deﬁnition
F(x1, . . . , xn) = b with a ﬁxed value b ∈ R
deﬁnes the subset Mb ⊂ Rn
which mostly has
the properties of an (n−1)–dimensional hypersurface. To be
more precise, if the vector of the partial derivatives
D1
F =
(
∂F
∂x1
, . . . ,
∂F
∂xn
)
is non-zero, the set M can be described locally as the graph
of a continuously diﬀerentiable function of n − 1 variables.
In this connection, there are level sets Mb.
The vector D1
F ∈ Rn
is called the gradient of the function
F. In technical and physical literature, it is also often
denoted as grad F or ∇F.
Since Mb is given by a constant value of the function F,
the derivatives of the curves lying in M have the property that
the diﬀerential dF always evaluates to zero along them. For
every such curve, F(c(t)) = b, hence
d
dt
F(c(t)) = dF(c′
(t)) = 0.
On the other hand, we can consider a general vector v =
(v1, . . . , vn) ∈ Rn
and the magnitude of the corresponding
directional derivative
|dvF| =
∂F
∂x1
v1 + · · · +
∂F
∂xn
vn = cos φ∥D1
F∥∥v∥,
where φ is the angle between the directions of the vector v and
the gradient D1
F, see the discussion about angles of vectors
and straight lines in the fourth chapter (cf. deﬁnition 4.1.15).
Thus, the following result is observed:
Zde by se hodil
obrazek, napr. obr. 2
742
maximum of the considered function on the given circle at
the former point and a minimum at the latter one. □
8.H.17. Find the extrema of the function f : R3
→ R,
f(x, y, z) = x2
+ y2
+ z2
, on the plane 2x + y − z = 1
and determine their types. ⃝
8.H.18. Find the maximum of the function f : R2
→ R,
f(x, y) = xy on the circle with radius 1 which is centered at
the point [x0, y0] = [0, 1]. ⃝
8.H.19. Find the minimum of the function f : R2
→ R,
f = xy on the circle with radius 1 which is centered at the
point [x0, y0] = [2, 0]. ⃝
8.H.20. Find the minimum of the function f : R2
→ R,
f = xy on the circle with radius 1 which is centered at the
point [x0, y0] = [2, 0]. ⃝
8.H.21. Find the minimum of the function f : R2
→ R,
f = xy on the ellipse x2
+ 3y2
= 1. ⃝
8.H.22. Find the minimum of the function f : R2
→ R,
f = x2
y on the circle with radius 1 which is centered at the
point [x0, y0] = [0, 0]. ⃝
8.H.23. Find the maximum of the function f : R2
→ R,
f(x, y) = x3
y on the circle x2
+ y2
= 1. ⃝
8.H.24. Find the maximum of the function f : R2
→ R,
f(x, y) = xy on th ellipse 2x2
+ 3y2
= 1. ⃝
8.H.25. Find the maximum of the function f : R2
→ R,
f(x, y) = xy on the ellipse x2
+ 2y2
= 1. ⃝
I. Volumes, areas, centroids of solids
8.I.1. Find the volume of the solid which lies in the halfplane
z ≥ 0, the cylinder x2
+ y2
≤ 1, and the half-plane
a) z ≤ x,
b) x + y + z ≤ 0.
Solution. a) The volume can be calculated with ease using
cylindric coordinates. There, the cylinder is determined by
the inequality r ≤ 1; the half-plane z ≤ x by z ≤ r cos φ,
then. Altogether, we get
CHAPTER 8. CALCULUS WITH MORE VARIABLES
The maximal growth of a function
Proposition. The gradient D1
F =
(
∂F
∂x1
, . . . , ∂F
∂xn
)
provides
the directions of maximal growth of the function F of
n variables.
Moreover, the vanishing directional derivatives are exactly
those in directions perpendicular to the gradient.
Therefore, it is clear that the tangent plane to a non-empty
level set Mb in a neighbourhood of its point with non-zero
gradient D1
F is determined by the orthogonal complement
to the gradient, and the gradient itself is the normal vector of
the hypersurface Mb.
For instance, considering a sphere in R3
with radius r >
0, centered at (a, b, c), i.e. given implicitly by the equation
F(x, y, z) = (x − a)2
+ (y − b)2
+ (z − c)2
= r2
,
The normal vectors at a point P = (x0, y0, z0) are obtained
as non-zero multiples of the gradient, i.e. multiples of
D1
F =
(
2(x0 − a), 2(y0 − b), 2(z0 − c)
)
,
and the tangent vectors are exactly the vectors perpendicular
to the gradient. Therefore, the tangent plane to a sphere at
the point P can always be described implicitly in terms of the
gradient by the equation
0 = (x0 − a)(x − x0) + (y0 − b)(y − y0) + (z0 − c)(z − z0).
This is a special case of the following general formula:
Tangent hyperplanes to level sets
Theorem. For a function F(x1, . . . , xn) of n variables and
a point P = (p1, . . . , pn) in a level set Mb of the function
F such that the gradient D1
F is non-vanishing at P, the
implicit equation for the tangent hypersurface to Mb is
0 =
∂F
∂x1
(P)(x1 − p1) + · · · +
∂F
∂xn
(P)(xn − pn).
Proof. The statement is clear from the previous discussions.
The tangent hyperplane must be (n − 1)–dimensional,
so its direction space is given as the kernel of the linear form
given by the gradient (zero values of the corresponding linear
mapping Rn
→ R given by multiplying the column of coordinates
by the row vector grad F). Clearly, the selected point
P satisﬁes the equation. □
8.1.27. Illumination of 3D objects. Consider the illumination
of a three-dimensional object where the direction v of the
light falling onto the two-dimensional surface M of this object
is known. Assume M is given implicitly by an equation
F(x, y, z) = 0.
The light intensity at a point P ∈ M is deﬁned as I cos φ,
where φ is the angle between the normal line to M and the
vector which is opposite to the ﬂow of the light. As seen, the
normal line is determined by the gradient of the function F.
743
V =
∫ 1
0
∫ π
2
− π
2
∫ r cos φ
0
r dz dφ dr =
2
3
.
b) We will reduce this problem to one that is completely
analogous to the above part by rotating the solid
around the z-axis by the angle π/4 (be it in the positive
or the negative direction). Applying the rotation matrix

√
2/2 −
√
2/2 0√
2/2
√
2/2 0
0 0 1

, the original inequality x+y+z ≤ 0
is transformed to
√
2x′
+z′
≤ 0 in the new coordinates. Now,
it is easy to express the integral that corresponds to the volume
of the examined solid:
V =
∫ 1
0
∫ 3π
2
π
2
∫ 0
−
√
2r cos φ
r dz dφ dr =
2
√
2
3
. We
need not have computed the result as we did; instead, we
could notice that the solid from part (a) diﬀers only by homothety
with coeﬃcient
√
2 in the direction of the y-axis. See
also note 8.I.11. □
8.I.2. Find the volume of the solid in R3
which is given by
x2
+ y2
+ z2
≤ 1, 3x2
+ 3y2
≥ z2
, x ≥ 0.
Solution.
First, we should realize what the examined solid looks
like. It is a part of a ball which lies outside a given cone (see
the picture).
The best way to determine the volume is probably to subtract
half the volume of the sector given by the cone from half
the ball’s volume (note that the volume of the solid does not
change if we replace the condition x ≥ 0 with z ≥ 0 – the
sector is cut either “horizontally” or “vertically”, but always
to halves). We will calculate in spherical coordinates.
x = r cos(φ) sin(ψ),
y = r sin(φ) sin(ψ),
z = r cos(ψ),
φ ∈ [0, 2π), ψ ∈ [0, π), r ∈ (0, ∞).
CHAPTER 8. CALCULUS WITH MORE VARIABLES
The sign of the expression then says which side of the surface
is illuminated.
For example, consider an illumination with constant intensity
I0 in the direction of the vector v = (1, 1, −1) (i.e.
“downward askew”), and let the ball given by the equation
F(x, y, z) = x2
+ y2
+ z2
− 1 ≤ 0 be the object of interest.
Then, for a point P = (x, y, z) ∈ M on the surface, the
intensity
I(P) =
grad F · v
∥ grad F∥∥v∥
I0 =
−2x − 2y + 2z
2
√
3
I0
is obtained. Notice that, as anticipated, the point which is
illuminated with the (full) intensity I0 is the point P =
1√
3
(−1, −1, 1) on the surface of the ball, while the antipodal
point is fully illuminated with the minus sign (i.e. on the
inside of the sphere).
Zde by se mozna
hodilo i neco vic, aspon
obrazek, neco jako obr.
3
8.1.28. Tangent and normal spaces. Ideas about tangent
and normal lines can be extended to general dimensions.
With a mapping F : Rm+n
→ Rn
,
and coordinate functions fi, one can also consider
the n equations for n + m variables
fi(x1, . . . , xm+n) = bi, i = 1, . . . , n,
expressing the equality F(x) = b for a vector b ∈ Rn
.
Assuming that the conditions of the implicit function theorem
hold, the set of all solutions (x1, . . . , xm+n) ∈ Rm+n
is (at least locally) the graph of a mapping G : Rm
→ Rn
.
Technically, it is necessary to have some submatrix in D1
F
of the maximal possible rank n.
For a ﬁxed choice b = (b1, . . . , bn), the set of all solutions
is, of course, the intersection of all hypersurfaces
M(bi, fi) corresponding to the particular functions fi. The
same must hold for tangent directions, while normal directions
are generated by the individual gradients. Therefore, if
D1
F is the Jacobi matrix of a mapping which implicitly deﬁnes
a set M and P = (p1, . . . , pm+n) ∈ M is a point such
that M is the graph of a mapping in the neighbourhood of the
point P,
D1
F =




∂f1
∂x1
. . . ∂f1
∂xm+n
...
...
...
∂fn
∂x1
. . . ∂fn
∂xm+n



 ,
then the aﬃne subspace in Rm+n
which contains exactly all
tangent lines going through the point P is given implicitly by
the following equations:
0 =
∂f1
∂x1
(P)(x1 − p1) + · · · +
∂f1
∂xn
(P)(xm+n − pm+n)
...
0 =
∂fn
∂x1
(P)(x1 − p1) + · · · +
∂fn
∂xn
(P)(xm+n − pm+n).
This subspace is called the tangent space to the (implicitly
given) m–dimensional surface M at the point P.
The normal space at the point P is the aﬃne subspace
generated by the point P and the gradients of all the functions
744
The Jacobian of this transformation R3
→ R3
is
r2
sin(ψ).
First of all, let us determine the volume of the ball. As
for the integration bounds, it is convenient to express the conditions
that bind the solid in the coordinates we will work in.
In the spherical coordinates, the ball is given by the inequality
x2
+ y2
+ z2
= r2
≤ 1.
First, let us ﬁnd the integration bounds for the variable φ. If
we denote by πφ the projection onto the φ-coordinate in the
spherical coordinates (πφ(φ, θ, r) = φ), then the image of
the projection πφ of the solid in question gives the integration
bounds for the variable φ. We know that πφ(ball) = [0, 2π)
(the equation r2
≤ 1 does not contain the variable φ, so there
are no constraints on it, and it takes on all possible values;
this can also easily be imagined in space).
Having the bounds of one of the variables determined,
we can proceed with the bounds of other variables. In general,
those may depend on the variables whose bounds have
already been determined (although this is not the case here).
Thus, we choose arbitrarily a φ0 ∈ [0, 2π), and for this φ0
(ﬁxed from now on), we ﬁnd the intersection of the solid (ball)
and the surface φ = φ0 and its projection πψ on the variable
ψ. Similarly like for φ, the variable ψ is not bounded (either
by the inequality r2
≤ 1 or the equality φ = φ0), so it can
take on all possible values, ψ ∈ [0, π).
Finally, let us ﬁx a φ = φ0 and a ψ = ψ0. Now, we are
looking for the projection πr(U) of the object (line segment)
U given by the constraints r2
≤ 1, φ = φ0, ψ = ψ0 on the
variable r. The only constraint for r is the condition r2
≤ 1,
so r ∈ (0, 1].
Note that the integration bounds of the variables are independent
of each other, so we can perform the integration in
any order. Thus, we have
Vkoule =
∫ 1
0
∫ 2π
0
∫ π
0
r2
sin(ψ) dψ dφ dr =
4
3
π.
Now, let us compute the volume of the spherical sector
given by x2
+ y2
+ z2
≤ 1 and 3x2
+ 3y2
≥ z2
. Again, we
express the conditions in the spherical coordinates: r2
≤ 1,
3 sin2
(ψ) ≥ cos2
(ψ), i. e., tg(ψ) ≥ 1√
3
. Just like in the
case of the ball, we can see that the variables occur independently
in the inequalities, so the integration bounds of the variables
will be independent of each other as well. The condition
r2
≤ 1 implies r ∈ (0, 1]; from tg(ψ) ≥ 1√
3
, we have
ψ ∈ [0, π
6 ]. The variable φ is not restricted by any condition,
so φ ∈ [0, 2π].
Vsector =
∫ 2π
0
∫ 1
0
∫ π
6
0
r2
sin ψ dψ dr dφ =
2 −
√
3
3
π,
altogether,
V = Vball − Vsector =
2
3
π −
2 −
√
3
3
π =
π
√
3
.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
f1, . . . , fn at the point P, i.e. the rows of the Jacobi matrix
D1
F.
As an illustrative simple example, calculate the tangent
and normal spaces to a conic section in R3
. Consider the
equation of a cone with vertex at the origin,
0 = f(x, y, z) = z −
√
x2 + y2,
and a plane, given by
0 = g(x, y, z) = z − 2x + y + 1.
The point P = (1, 0, 1) belongs to both the cone and the
plane, so the intersection M of these surfaces is a curve (draw
a diagram!). Its tangent line at the point P is given by the
following equations:
0 = −
1
2
√
x2 + y2
2x
x=1,y=0
· (x − 1)
−
1
2
√
x2 + y2
2y
x=1,y=0
· y + 1 · (z − 1)
= −x + z
0 = −2(x − 1) + y + (z − 1) = −2x + y + z + 1,
while the plane perpendicular to the curve, containing the
point P, is given parametrically by the expression
(1, 0, 1) + τ(−1, 0, 1) + σ(−2, 1, 1)
with real parameters τ and σ.
8.1.29. Constrained extrema. Now we come with the ﬁrst
really serious application of the diﬀerential calculus
of more variables. The typical task in optimization
is to ﬁnd the extrema of values depending
on several (yet ﬁnitely many) parameters, under
some further constraints on the parameters of the model.
The problem often has m + n parameters constrained
by n conditions. In the language of diﬀerential calculus, it
is desired to ﬁnd the extrema of a diﬀerentiable function h
on the set M of points given implicitly by a vector equation
F(x1, . . . , xm+n) = 0. Of course, we might ﬁrst locally parameterize
the solution space of the latter equation by m free
parameters, express the function h in terms of these parameters
and look for the local extrema by inspecting the critical
points. However, we have already prepared more eﬃcient procedures
for this eﬀort.
For every curve c(t) ⊂ M going through P = c(0), it
must be ensured that h(c(t)) is an extremum of this univariate
function. Therefore, the derivative must satisfy
d
dt
h(c(t))|t=0 = dc′(0)h(P) = dh(P)(c′
(0)) = 0.
This means that the diﬀerential of the function h at the point P
is zero along all tangent increments to M at P. This property
is equivalent to stating that the gradient of h lies in the normal
subspace (more precisely, in the modelling vector space of the
normal subspace). Such points P ∈ M are called stationary
points of the function h with respect to the constraints given
by F.
745
We could also have computed the volume directly:
V =
∫ π
0
∫ 1
0
∫ 5π
6
π
6
r2
sin ψ dψ dr dφ =
π
√
3
.
In cylindric coordinates
x = r cos(φ),
y = r sin(φ),
z = z
with Jacobian r of this transformation, the calculation of the
volume as the diﬀerence of the two solids considered above
looks as follows:
V =
2
3
π −
∫ 2π
0
∫ 1
2
0
∫ 1
0
r dz dr dφ =
π
√
3
.
Note that we cannot compute the volume of the solid directly
in the cylindric coordinates. Thus, we must split it into two
solids deﬁned by the conditions r ≤ 1
2 and r ≥ 1
2 , respec-
tively.
V = V1 + V2
=
∫ 2π
0
∫ 1
2
0
∫ √
3r
0
r dz dr dφ
+
∫ 2π
0
∫ 1
1
2
∫ √
1−r2
0
r dz dr dφ
=
π
√
3
.
□
Another alternative is to compute it as the volume of a
solid of revolution, again splitting the solid into the two parts
as in the previous case (the part “under the cone” and the part
“under the sphere”. However, these solids cannot be obtained
by rotating around one of the axes. The volume of the former
part can be calculated as the diﬀerence between the volumes
of the cylinder x2
+ y2
≤ 1
4 , 0 ≤ z ≤
√
3
2 and the cone’s
part 3x2
+ 3y2
≤ z2
, 0 ≤ z ≤
√
3
2 . The volume of the latter
one is then the diﬀerence between the volumes of the solid
that is created by rotating the part of the arc y =
√
(1 − x2),
1
2 ≤ x ≤ 1 around the z-axis and the cylinder x2
+ y2
≤ 1
4 ,
0 ≤ z ≤
√
3
2 .
V = V1 + V2
=
(
π
√
3
8
−
π
√
3
24
)
+
(
π
∫ √
3
2
0
(1 − r2
) dr −
π
√
3
8
)
=
π
√
3
4
+
π
4
√
3
=
π
√
3
.
8.I.3. Calculate the volume of the spherical segment of the
ball x2
+ y2
+ z2
= 2 cut by the plane z = 1.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
As seen in the previous paragraph, the normal space to
the set M is generated by the rows of the Jacobi matrix of
the mapping F, so the stationary points are described equivalently
by the following proposition:
Lagrange multipliers
Theorem. Let F = (f1, . . . , fn) : Rm+n
→ Rn
be
a diﬀerentiable function in a neighbourhood of a point P,
F(P) = 0. Further, let M be given implicitly by an equation
F(x, y) = 0, and let the rank of the matrix D1
F at
the point P be n. Then P is a stationary point of a continuously
diﬀerentiable function h : Rm+n
→ R with respect to
the constraints F, if and only if there exist real parameters
λ1, . . . , λn such that
grad h = λ1 grad f1 + · · · + λn grad fn.
The procedure suggested by the theorem is called the
method of Lagrange multipliers. It is of algorithmic
character. Consider the numbers of unknowns
and equations: the gradients are vectors
of m + n coordinates, so the statement of the theorem yields
m + n equations. The variables are, on one side, the coordinates
x1, . . . , xm+n of the stationary points P with respect
to the constraints, and, on the other hand, the n parameters
λi in the linear combination. It remains to say that the point
P belongs to the implicitly given set M, which represents n
further equations. Altogether, there are 2n + m equations for
2n + m variables, so it can be expected that the solution is
given by a discrete set of points P (i.e., each one of them is
an isolated point).
Very often, the system of equations is a seemingly simple
system of algebraic equations, but in fact only rarely can
it be solved explicitly. We return to special algebraic methods
for systems of polynomial equations in chapter 12. There are
also various numerical approaches to such systems. Theoretical
details are not discussed here, but there are several solved
examples in the other column, including also the illustration
of how to use the second derivatives to decide about the local
extrema under the constraints.
8.1.30. Arithmetic mean versus geometric mean. As an
example of practical application of the Lagrange multipliers,
we prove the inequality
1
n
(x1 + · · · + xn) ≥ n
√
x1 · · · xn
for any n positive real numbers x1, . . . , xn. Equality occurs
if and only if all the xi’s are equal.
Consider the sum x1 + · · · + xn = c as the constraint for
a (non-speciﬁed) non-negative constant c. We look for the
maxima and minima of the function
f(x1, . . . , xn) = n
√
x1 · · · xn
with respect to the constraint and the assumption x1 > 0,...,
xn > 0.
746
Solution. We iwll compute the integral in spherical coordinates.
The segment can be perceived as a spherical sector
without the cone (with vertex at the point [0, 0, 0] and the circular
base z = 1, x2
+ y2
= 1). In these coordinates, the sector
is the product of the intervals [0,
√
2] × [0, 2π) × [0, π/4].
We thus integrate in the given bounds, in any order:
∫ 2π
0
∫ √
2
0
∫ π
4
0
r2
sin(θ) dθ dr dφ =
4
3
(
√
2 − 1)π.
In the end, we must subtract the volume of the cone. That is
equal to 1
3 πR2
H (where R is the radius of the cone’s base
and H is its height; both are equal to 1 in our case), so the
total volume is
Vsector − Vcone =
4
3
(
√
2 − 1) −
1
3
π =
1
3
π(4
√
2 − 5).
The volume of a general spherical segment with height
h in a ball with radius R could be computed similarly:
V = Vsector − Vcone
=
∫ 2π
0
∫ arccos
(
R−h
R
)
0
∫ R
0
r2
sin(θ) dr dθ dφ
−
1
3
π(2Rh − h2
)(R − h)
=
1
3
πh2
(3R − h).
□
8.I.4. Find the volume of the part of the cylinder
x2
+ z2
= 16 which lies inside the cylinder x2
+ y2
= 16.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
The normal vector to the part of the hyperplane M deﬁned
by the constraints is (1, . . . , 1). Therefore, the function
f can have an extremum only at those points where its
gradient is a multiple of this normal vector. Hence there is
the following system of equations for the desired points (the
components ∂f
∂xi
of the gradient appear on the left-hand side,
λ-multiple of the normal vector is on the right):
1
n
1
xi
n
√
x1 · · · xn = λ,
for i = 1, . . . , n and λ ∈ R.
These equations imply x1 = · · · = xn in the set M. If
the variables xi are allowed to be zero as well, then the set M
would be compact, so the function f would have to have both
a maximum and a minimum there. However, f is minimal if
and only if at least one of the values xi is zero; so the function
necessarily has a strict maximum at the point with xi = c
n ,
i = 1, . . . , n, and then λ = 1
n .
By substituting, the geometric mean equals the arithmetic
mean for these extreme values, but it is strictly smaller
at all other points with the given sum c of coordinates, which
proves the inequality.
2. Integration for the second time
We return to the process of integration, discussed in the
second and third parts of chapter six. We saw that the integration
with respect to the diverse coordinates can be iterated.
Now we extend the concept of the Riemann integration and
Jordan measure to general Euclidean spaces and, again, we
shall see that the approaches coincide for many reasonable
functions.
8.2.1. Integrals dependent on parameters. Recall that integrating
a function f(x, y1, . . . , yn) of n + 1
variables with respect to the single variable x,
the result is a function F(y1, . . . , yn) of the remaining
variables. Essentially, we proved the following theorem
already in 6.3.11 and 6.3.13. This is an extremely useful
technical tool, as we saw when handling the Fourier transforms
and convolutions in the last chapter. Previous results
about extrema of multivariate functions also have a direct application
for minimization of areas or volumes of objects deﬁned
in terms of functions dependent on parameters, etc.
747
Solution. We will compute the integral in Cartesian coordinates.
Since the solid is symmetric, it suﬃces to integrate
over the ﬁrst octant (interchanging x and −x does not change
the equation of the solid; the same holds for y and for z). The
part of the solid that lies in the ﬁrst octant is given by the space
under the graph of the function z(x, y) =
√
16 − x2 and over
the quarter-disc x2
+ y2
≤ 16, x ≥ 0, y ≥ 0. Therefore, the
volume of the whole solid is equal to
V = 8
∫ 4
0
∫ √
16−x2
0
4
√
16 − x2
dy dx = 128. □
Remark. Note that the projection of the considered solid onto
both the plane y = 0 and the plane z = 0 is a circle with
radius 4, yet the solid is not a ball.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Continuity and differentiation
Theorem. Consider a continuous function f(x, y1, . . . , yn)
deﬁned for all x from a ﬁnite interval [a, b] and for all
(y1, . . . , yn) lying in some neighbourhood U of a point c =
(c1, . . . , cn) ∈ Rn
, and its integral
F(y1, . . . , yn) =
∫ b
a
f(x, y1, . . . , yn) dx.
Then function F(y1, . . . , yn) is continuous on U.
Moreover, if there exists the continuous partial derivative
∂f
∂yj
on a neighbourhood of the point c, then ∂F
∂yj
(c)
exists as well and
∂F
∂yj
(c) =
∫ b
a
∂f
∂yj
(x, c1, . . . , cn) dx.
Proof. In Chapter 6, we dealt with two variables x, y
only, but replacing the absolute value |y| with
the norm ∥y∥ of the vector of parameters does
not change the argumentation at all. Again, the
main point is that the continuous real functions
on compact sets are uniformly continuous.
Since partial derivative concerns only one of the variables,
the rest of the theorem was proved in 6.3.13, too. □
8.2.2. Integration of multivariate functions. In the case of
univariate functions, integration is motivated by
the idea of the area under the graph of a given
function of one variable. Consider now the volume
of the part of the three-dimensional space which lies under
the graph of a function z = f(x, y) of two variables, and
the multidimensional analogues in general.
In chapter six, small intervals [xi, xi+1] were chosen of
length ∆xi which divided the whole interval [a, b]. Then,
their representatives ξi were selected, and the corresponding
part of the area was approximated by the area of the rectangle
with height given by the value f(ξi) at the representative, i.e.
the expression f(ξi)∆xi.
In the case of functions of two variables, work with divisions
in both variables and the values representing the height
of the graph above the particular little rectangles in the plane.
The ﬁrst thing to deal with is to determine the integration
domain, that is, the region the function f is to be integrated
over. As an example, consider the function z = f(x, y) =√
1 − x2 − y2, whose graph is, inside the unit disc, half of
the unit sphere. Integrating this function over the unit disc
yields the volume of the unit semi-ball.
The simplest approach is to consider only those integration
domains M which are given by products of intervals,
i.e. given by ranges x ∈ [a, b] and y ∈ [c, d]. In this context,
it is called a multidimensional interval. If M is a different
bounded set in R2
, work with a suﬃciently large area
[a, b] × [c, d], rather than with the set itself, and adjust the
function so that f(x, y) = 0 for all points lying outside M.
Considering the above case of the unit ball, integrate over the
748
8.I.5. Find the volume of the part of the cylinder x2
+y2
= 4
bounded by the planes z = 0 and z = x + y + 2.
Solution. We will work in cylindric coordinates given by the
equations x = r cos(φ), y = r sin(φ), z = z. The Jacobian
of this transformation is J = r. The solid can be divided into
two parts: above and below the plane z = 0, whose volumes
will be denoted by V1 and V2, respectively. Further, we can notice
that one part of the solid with volume V1 is a pyramid with
vertices [0, 0, 0], [0, 0, 2], [−2, 0, 0], [0, −2, 0]. Thus, we will
further split this solid (above z = 0) into two parts, whose
volumes we will calculate separately.
V1 − Vpyramid =
∫ π
−π/2
(∫ 2
0
[r sin φ + r cos φ + 2]r dr
)
dφ,
= 6π +
16
3
,
Vpyramid =
4
3
Further,
V1 − V2 =
∫ π
−π
∫ 2
0
r2
(sin(φ) + cos(φ)) + 2r dr dφ = 8π,
so V1 + V2 = 4π + 40
3 . □
Remark. During the calculation, we made use of the fact
that integrating a function of two variables over an area in R2
yields the diﬀerence of the volume of the solid in R3
determined
by the graph of the integrated function and lying above
z = 0 and the one lying below z = 0.
8.I.6. Find the volume of the solid in R3
which is given by
the intersection of the sphere x2
+y2
+z2
= 4 and the cylinder
x2
+ y2
= 1.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
set M = [−1, 1] × [−1, 1] the function
f(x, y) =
{√
1 − x2 − y2 for x2
+ y2
≤ 1
0 otherwise.
The deﬁnition of the Riemann integral then faithfully follows
the procedure from paragraph 6.2.8. This can be done
for an arbitrary ﬁnite number of variables.
Given an n–dimensional interval I and partitions into ki
subintervals in each variable xi, select the partition of I into
k1 · · · kn small n–dimensional intervals, and write ∆xi1...in
for their volumes. The maximum of the lengths of the sides
of the multidimensional intervals in such a partition is called
its norm.
Riemann integral
Deﬁnition. The Riemann integral of a real-valued function
f deﬁned on a multidimensional interval I = [a1, b1] ×
[a2, b2]×. . .×[an, bn] exists if for every choice of a sequence
of divisions Ξ (dividing the multidimensional interval in all
variables simultaneously), and the representatives ξi1...in of
the little mutlidimensional intervals in the partitions, with
the norm of the partitions converging to zero, the integral
sums
SΞ,ξ =
∑
i1...in
f(ξi1...in )∆xi1...in
always converge to the value
S =
∫
I
f(x1, . . . , xn) dx1 . . . dxn,
independent of the selected sequence of divisions and repre-
sentatives.
The function f is then said to be Riemann-integrable
over I.
As a relatively simple exercise, prove in detail that every
Riemann-integrable function over an interval
I must be bounded there. The reason is the
same as in the case of univariate functions: the
control of the norms of the divisions used in the
deﬁnition is somewhat rough.
The situation gets worse when integrating in this way
over unbounded intervals, see more remarks in 8.2.6 below.
Therefore, consider integration of functions over Rn
mainly
for functions whose support is compact, that is, functions
which vanish outside a bounded interval I.
A bounded set M ⊂ Rn
is said to be Riemann-
measurable4
if and only if its indicator function, deﬁned
by
χM (x1, . . . , xn) =
{
1 for (x1, . . . , xn) ∈ S
0 for all other points in Rn
,
is Riemann-integrable over Rn
.
4 Better to say “measurable via Riemann integration”, the measure itself
is commonly called the Peano–Jordan measure in the literature.
749
Solution. Thanks to symmetry, it suﬃces to compute the volume
of the part that lies in the ﬁrst octant. We will integrate
in cylindric coordinates given by the equations x = r cos(φ),
y = r sin(φ), z = z with Jacobian J = r, and it is the
space between the plane z = 0 and the graph of the function
z =
√
4 − x2 − y2 =
√
4 − r2. Therefore, we can directly
write is as the double integral
V = 8
∫ π/2
0
∫ 1
0
r
√
4 − r2 dr dφ =
2
3
(8 − 3
√
3)π.
□
8.I.7. Find the volume of the solid in R3
which is given by
the intersection of the sphere x2
+ y2
+ z2
= 2 and the paraboloid
z = x2
+ y2
.
Solution. Once again, we will work in cylindric coordinates:
V =
∫ 2π
0
∫ 1
0
∫ √
2−r2
r2
r dz dr dφ =
4
√
2π
3
−
7π
6
.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
For any Riemann-measurable set M and a function f deﬁned
at all points of M, consider the function ˜f = χM ·f as a
function deﬁned on the whole Rn
. This function ˜f apparently
has a compact support. The Riemann integral of the function
f over the set M is deﬁned by
∫
M
f dx1 . . . dxn =
∫
Rn
˜f dx1 . . . dxn,
supposing the integral on the right-hand side exists.
8.2.3. Properties of Riemann integral. This deﬁnition of
the integral does not provide reasonable instructions for computing
the values of Riemann integrals. However, it does lead
to the following basic properties of the Riemann integral (cf.
Theorem 6.2.8):
Theorem. The set of Riemann-integrable real-valued functions
over a Riemann measurable domain M ⊂ Rn
is a vector
space over the real scalars, and the Riemann integral is a
linear form there.
If the integration domain M is given as a disjoint union
of ﬁnitely many Riemann-measurable domains Mi, then f is
integrable on M if and only if it is integrable on all Mi, and
the integral over a function f over M is given by the sum of
the integrals over the individual subdomains Mi.
Proof. All the properties follows directly from the definition
of the Riemann integral and the properties of convergent
sequences of real numbers, just as in the case of univariate
functions. Think out the details by yourselves. □
For practical use, rewrite the theorem into the usual
equalities:
Finite additivity and linearity
Any linear combination of Riemann-integrable functions
fi : I → R, i = 1, . . . , k (over scalars in R) is again a
Riemann-integrable function, and its integral can be computed
as follows:
∫
I
(
a1f1(x1, . . . , xn)+. . . + akfk(x1, . . . , xn)
)
dx1. . .dxn
= a1
∫
I
f1(x1, . . . , xn) dx1 . . . dxn+
· · · + ak
∫
I
fk(x1, . . . , xn) dx1 . . . dxn.
Let M1 and M2 be disjoint Riemann-measurable sets, consider
a function f : M1 ∪ M2 → R. Then f is Riemannintegrable
over both sets Mi if and only if it is integrable
over its union, and
∫
M1∪M2
f(x1, . . . , xn) dx1 . . . dxn
=
∫
M1
f(x1, . . . , xn) dx1 . . . dxn+
∫
M2
f(x1, . . . , xn) dx1 . . . dxn.
750
□
8.I.8. Find the volume of the solid in R3
which is bounded
by the elliptic cylinder 4x2
+ y2
= 1 and the planes z = 2y
and z = 0, lying above the plane z = 0.
Solution. Thanks to symmetry, it is advantageous to work in
the coordinates x = 1
2 r cos(φ), y = r cos(φ), z = z with
Jacobian J = 1
2 r. The equation of the elliptic cylinder in
these coordinates is r2
= 1. Thus, the wanted volume is
V =
∫ π
0
∫ 1
0
r sin(φ)
1
2
r dr dφ
=
∫ π
0
∫ 1
0
r2
sin(φ) dr dφ =
∫ π
0
1
3
sin(φ) dφ =
2
3
.
□
8.I.9. Find the volume of the solid in R3
which is bounded
by the paraboloid 2x2
+ y2
= z and the plane z = 2.
Solution. Similarly to the above problem, we choose “special”
coordinates which respect the symmetry of the solid:
x = 1√
2
r cos(φ), y = r sin(φ), z = z with Jacobian
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.2.4. Multiple integrals. Riemann-measurable domains
especially involve cases when the boundary
of the integration domain M can be expressed
step by step via continuous dependencies
between the coordinates in the following way. The ﬁrst
coordinate x runs within an interval [a, b]. The interval range
of the next coordinate can be deﬁned by two functions, i.e.
y ∈ [φ(x), ψ(x)], then the range of the next coordinate is
expressed as z ∈ [η(x, y), ζ(x, y)], and so on for all of the
other coordinates.
For example, this is easy in the case of a ball from the
introductory example: for x ∈ [−1, 1], deﬁne the range for
y as y ∈ [−
√
1 − x2,
√
1 − x2]. The volume of the ball can
then be computed by integration of the mentioned function
f, or integrate the indicator function of the ball, i.e. the function
which takes one on the subset M ⊂ R3
which is further
deﬁned by z ∈ [−
√
1 − x2 − y2,
√
1 − x2 − y2].
The following fundamental theorem transforms the computation
of a Riemann integral to a sequence of computations
of univariate integrals (while the other variables are considered
to be parameters, which can appear in the integration
bounds as well). Notice, we could have deﬁned the multiple
integral directly via the one-dimensional integration, but we
would face the trouble of ensuring the indepdence of the result
on our way of describing M. The theorem reveals that
the two approaches coincide and there are no unclear points
left.
Multiple integrals
Theorem. Let M ⊂ Rn
be a bounded set, expressed with
the help of continuous functions ψi, ηi
M = {(x1, . . . , xn); x1 ∈ [a, b], x2 ∈ [ψ2(x1), η2(x1)],
. . . , xn ∈ [ψn(x1, . . . , xn−1), ηn(x1, . . . , xn−1)]},
and let f be a function which is continuous on M. Then the
Riemann integral of the function f over the set M exists and
is given by the formula
∫
M
f(x1, x2, . . . , xn) dx1 . . . dxn =
∫ b
a
(∫ η2(x1)
ψ2(x1)
. . .
(∫ ηn(x1,...,xn−1)
ψn(x1,...,xn−1)
f(x1, x2, . . . , xn) dxn
)
. . . dx2
)
dx1
where the individual integrals are the one-variable Riemann
integrals.
Proof. Consider ﬁrst the proof for the case of two variables.
It can then be seen that there is no need of
further ideas in the general case.
Consider an interval I = [a, b] × [c, d] containing
the set M = {(x, y); x ∈ [a, b], y ∈
[ψ(x), η(y)]} and divisions Ξ of the interval I with representatives
ξij.
751
J = 1√
2
r. The equation of the paraboloid in these coordinates
is z = r2
, so the volume of the solid is equal to
V = 4
∫ π/2
0
∫ √
2
0
∫ 2
r2
1
√
2
r dz dr dφ
= 2
√
2
∫ π/2
0
∫ √
2
0
2r − r3
dr dφ = 2
√
2
∫ π/2
0
dφ
=
√
2π.
□
8.I.10. Calculate the volume of the ellipsoid x2
+ 2y2
+
3z2
= 1.
Solution. We will consider the coordinates
x = r cos(φ) sin(θ),
y = 1√
2
r sin(φ) sin(θ),
z = 1√
3
r cos(θ).
The corresponding Jacobian is 1√
6
r2
sin(θ), so the volume is
V =
∫ 2π
0
∫ π
0
∫ 1
0
1
√
6
r2
sin(θ) dr dθ dφ =
4
3
√
6
π.
□
8.I.11. Remark. Note that if the transformation the coordinates
is linear (and aﬃne), then the space is deformed “uniformly”.
This means that the volume of an arbitrary solid is
changed proportionally to the change of the volume of an inﬁnitesimal
volume element, which is the Jacobian. Therefore,
if we consider the volume of the ball with a given radius r to
be known, (in this case, r = 1), we can infer directly that the
volume of the ellipsoid is V = 1√
6
· 4
3 π = 4
3
√
6
π.
8.I.12. Find the volume of the solid which is bounded by
the paraboloid 2x2
+ 5y2
= z and the plane z = 1.
Solution. We choose the coordinates
x = 1√
2
r cos(φ),
y = 1√
5
r sin(φ),
z = z.
The determinant of the Jacobian is r√
10
, so the volume is
V =
∫ 2π
0
∫ 1
0
∫ 1
r2
r
√
10
dz dr dφ =
π
2
√
10
.
□
CHAPTER 8. CALCULUS WITH MORE VARIABLES
The corresponding integral sum is
SΞ,ξ =
∑
i,j
f(ξij)∆xij
=
∑
i
(∑
j
f(ξij))∆yj
)
∆xi,
where ∆xij is written for the product of the sizes ∆xi and
∆xj of the intervals which correspond to the choice of the
representative ξij.
Assume ﬁrst that we work only with choices of representatives
ξij which all share the same ﬁrst coordinate xi. If the
partition of the interval [a, b] is ﬁxed, and only the partition of
[c, d] is reﬁned, the values of the inner sum of the expression
approaches the value of the integral
Si =
∫ η(xi)
φ(xi)
f(xi, y) dy,
which exists since the function f(xi, y) is continuous. In this
way, a function is obtained which is continuous in the free
parameter xi, see 8.2.1. Therefore, further reﬁnement of the
partition of the interval [a, b] leads, in the limit, to the desired
formula
∑
i
Si∆xi → S =
∫ b
a
(∫ η(x)
ψ(x)
f(x, y) dy
)
dx.
It remains to deal with the case of general choices of representatives
of general divisions Ξ. Since M is clearly compact
(bounded and closed by deﬁnition), and f is a continuous
function on M, it is uniformly continuous there. Therefore,
if a small real number ε > 0 is selected beforehand, there is
always a bound δ > 0 for the norm of the partitions, so that
the values of the function f for the general choices ξij diﬀer
by no more than ε from the choices used above. Thus, the
limit processes results in the same value for general Riemann
sums SΞ,ξ as seen above.
Now, the general case can be proved easily by induction.
In the case of n = 1, the result is trivial. The presented
reasoning can easily be transformed for a general
induction step, writing (x2, . . . , xn) instead of
y, having x1 instead of x, and perceiving the particular
little cubes of the divisions as (n − 1)-dimensional cubes
Cartesian-multiplied by the last interval. In the last-but-one
step of the proof, the induction hypothesis is used, rather than
the simple one-dimensional integration. The ﬁnal argument
about uniform continuity remains the same. It is advised to
write this proof in detail as an exercise. □
8.2.5. Fubini theorem. The latter theorem has a particularly
simple shape in the case of a multidimensional
interval M. Then all the functions in bounds for
integration are just the constant bounds from the
deﬁnition of M. But this means that the integration
process can be carried out coordinate by coordinate in
752
8.I.13. Find the volume of the solid which lies in the ﬁrst
octant and is bounded by the surfaces y2
+ z2
= 9 and y2
=
3x.
Solution. In cylindric coordinates,
V =
∫ π/2
0
∫ 3
0
∫ r2
3 cos2
(φ)
0
r dx dr dφ =
27
16
π.
□
8.I.14. Find the volume of the solid in R3
which is bounded
by the cone part 2x2
+y2
= (z−2)2
, z ≥ 2 and the paraboloid
2x2
+ y2
= 8 − z.
Solution. First of all, we ﬁnd the intersection of the given
surfaces:
(z − 2)2
= −z + 8, z ≥ 2;
therefore, z = 4, and the equation of the intersection is
2x2
+y = 4. The substitution x = 1√
2
r cos(φ), y = r sin(φ),
z = z transforms the given surfaces to the form r2
= (z−2)2
,
z ≥ 2, and r2
= 8 − z, i. e., z = r + 2 for the former surface
and z = 8 − r2
for the latter. Altogether, the projection
of the given solid onto the coordinate φ is equal to the interval
[0, 2π]. Having ﬁxed a φ0 ∈ [0, 2π], the projection of
the intersection of the solid and the plane φ = φ0 onto the
coordinate r equals (independently of φ0) the interval [0, 2].
Having ﬁxed both r0 and φ0, the projection of the intersection
of the solid and the line r = r0, φ = φ0, onto the coordinate
z is equal to the interval [r0 + 2, 8 − r2
0]. The Jacobian of the
considered transformation is J = 1√
2
r, so we can write
V =
∫ 2π
0
∫ 2
0
∫ 8−r2
r+2
r
√
2
dz dr dφ =
16
√
2
3
π.
□
CHAPTER 8. CALCULUS WITH MORE VARIABLES
any order. We have exploited this behavior already in Chapter
6, cf. 6.3.12. In this way is proved the important corollary:5
Fubini theorem
Theorem. Every continuous function f(x1, . . . , xn) on a
multidimensional interval M = [a1, b1] × [a2, b2] × . . . ×
[an, bn] is Riemann integrable on M, and its integral
∫
M
f(x1, . . . , xn) dx1 . . . dxn
=
∫ b1
a1
∫ b2
a2
. . .
∫ bn
an
f(x1, . . . , xn) dx1 . . . dxn
is independent of the order in which the multiple integration
is performed.
The possibility of changing the order of integration in
multiple integrals is extremely useful. We have already taken
advantage of this result, namely when studying the relation of
Fourier transforms and convolutions, see paragraph 7.1.9.
8.2.6. Unbounded regions and functions. There is no simple
concept of an improper integral for unbounded multivariate
functions. The following example of multiple integration
of an unbounded function is illustrative in this direction:
∫ 1
0
(∫ 1
0
x − y
(x + y)3
dy
)
dx =
1
2
∫ 1
0
(∫ 1
0
x − y
(x + y)3
dx
)
dy = −
1
2
.
The reason can be understood by looking at the properties
of non-absolutely converging series. There, rearranging the
summands can lead to an arbitrary result.
The situation is better if the Riemann integral of a
bounded non-negative function f(x) ≥ 0 with non-compact
support over the whole Rn
is calculated. Of course some extra
information is needed on the decay of the function f for
large arguments. For example, if f is Riemann integrable over
each n–dimensional interval I and there is a universal bound
∫
I
|f(x)| dx ≤ C
with a constant C independent of the choice of the n–
dimensional interval I ⊂ Rn
, then we may deﬁne
∫
Rn
f(x) dx = lim
r→∞
∫
Ir
f(x) dx,
where Ir = {(x1, . . . , xn); |xj| < r, j = 1, . . . , n}. The
resulting limit, if it exists, is bounded by the same constant C.
5Guido Fubini (1907-1943) was an important Italian mathematician
active also in applied areas of mathematics. Simple derivation of Fubini
theorem builds upon the simple properties of Riemann integration and the
continuity of the integrated function. Fubini, in fact, proved this result in a
much more general context of integration, while the theorem just introduced
was used by mathematicians like Cauchy at least a century before Fubini.
753
8.I.15. Find the volume of the solid which lies inside the
cylinder y2
+z2
= 4 and the half-space x ≥ 0 and is bounded
by the surface y2
+ z2
+ 2x = 16.
Solution. In cylindric coordinates,
V =
∫ 2π
0
∫ 2
0
∫ 8− r2
2
0
r dx dr dφ = 28π.
□
8.I.16. The centroid of a solid. The coordinates (xt, yt, zt)
of the centroid of a (homogeneous) solid T with volume V
in R3
are given by the following integrals:
xt =
∫∫∫
T
x dx dy dz,
yt =
∫∫∫
T
y dx dy dz,
zt =
∫∫∫
T
z dx dy dz.
The centroid of a ﬁgure in R2
or other dimensions can
be computed analogously.
8.I.17. Find the centroid of the part of the ellipse 3x2
+
2y2
= 1 which lies in the ﬁrst quadrant of the plane R2
.
Solution. First, let us calculate the volume of the given ellipse.
The transformation x = 1√
3
x′
, y = 1√
2
y′
with Jacobian
1√
6
leads to
S =
∫ 1√
3
0
∫
√
1−3x2
2
0
dy dx =
1
√
6
∫ 1
0
∫ √
1−x2
0
dy′
dx′
=
π
4
√
6
.
The other integrals we need can be computed directly in Cartesian
coordinates x and y:
Tx =
∫ 1√
3
0
∫
√
1−3x2
2
0
x dy dx =
∫ 1√
3
0
x
√
1 − 3x2
2
dx =
=
1
2
∫ 1
3
0
√
1 − 3t
2
dt =
√
2
18
,
Ty =
∫ 1√
3
0
∫
√
1−3x2
2
0
y dy dx =
1
2
∫ 1√
3
0
1 − 3x2
2
dx =
=
1
4
∫ 1√
3
0
(1 − 3x2
) dx =
√
3
18
.
Therefore, the coordinates of the centroid are [4
√
3
9π , 2
√
2
π ]. □
8.I.18. Find the volume and the centroid of a homogeneous
cone of height h and circular base with radius r.
Solution. Positioning the cone so that the vertex is at the origin
and points downwards, we have in cylindric coordinates
that
V = 4
∫ π/2
0
∫ r
0
∫ h
h
r ρ
ρ dz dρ dφ =
1
3
πhr2
.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
In this case, the Fubini theorem is true in the form
∫
Rn
f(x) dx =
∫ ∞
−∞
. . .
(∫ ∞
−∞
f(x) dx1
)
. . . dxn.
8.2.7. Further remarks on integration. The Riemann
integral of multivariate functions behaves even
worse than in the case of functions of one variable
in the sixth chapter. Therefore, more sophisticated
approaches to integrations have been developed.
They are mainly based on the concept of the measure of a set.
We consider this problem brieﬂy now.
As we shall see in 8.2.10, the Riemann integration of the
indicator functions χM of sets M ⊂ Rn
leads to a ﬁnitely
additive measure. In probability theory in chapter 10, even
elementary problems require a concept of measure which is
additive over countable systems of disjoint sets. Having such
a measure, measurable functions f can be deﬁned by the condition
that their preimages of bounded intervals, f−1
([a, b]]),
are measurable sets and the integral is built by approximation
via such “horizontal strips”, see the illustration. This is the
starting point of Lebesgue integration..
add a regular diagram
on Lebesgue
integration
We omit further details here, but note the Riesz representation
theorem6
saying that for each linear functional I (i.e.
a linear mapping valued in R) on continuous functions with
compact support on a metric space X, there is the unique
measure (with certain regularity properties) such that the integral
associated to this measure extends I. In the case of
the Riemann integral I on functions on Rn
this provides the
Lebesgue measure and the Lebesgue integral.
Another point of view is the completion procedure for
metric spaces. Consider the vector space X = Sc(Rn
) of
all continuous functions with compact support. It can be
equipped with the Lp norms, similar to the univariate case
from the seventh chapter, i.e.
∥f∥p =
(∫
Rn
|f(x1, . . . , xn)|p
dx1 . . . dxn
)1/p
for any 1 ≤ p < ∞. Since the Riemann integral is deﬁned
again in terms of partitions and the representative values, the
properties of the norm can be veriﬁed in the same way as
for univariate functions, using Hölder’s and Minkowski’s in-
equalities.
There are the metrics ∥ ∥p
on X. The general theory
provides its completion ˜X, unique up to isometry, and it can
be shown that it is again a space of functions. The Lebesgue
integral mentioned above deﬁnes exactly these norms. Hence
the spaces of functions with Lebesgue integrable powers |f|p
are obtained.
6Frigyes Riesz (1880-1956) was a famous Hungarian mathematician
active in particular in functional analysis. He introduced this theorem in the
special case of X being an interval in Rn in 1909
754
Apparently, the centroid lies on the z-axis. For the
z-coordinate, we get
z =
1
V
∫
cone
zdV =
1
V
∫ π/2
0
∫ r
0
∫ h
h
r ρ
zρ dz dρ dφ =
3
4
h.
Thus, the centroid lies 1
4 h over the center of the cone’s base.
□
8.I.19. Find the centroid of the solid which is bounded by
the paraboloid 2x2
+ 2y2
= z, the cylinder (x+1)2
+y2
= 0,
and the plane z = 0.
Solution. First, we will compute the volume of the given
solid. Again, we use the cylindric coordinates (x = r · cos φ,
y = r · sin φ, z = z), where the equation of the paraboloid
is z = 2r2
and the equation of the cylinder reads
r = −2 cos(φ). Moreover, taking into account the fact that
the plane x = 0 is tangent to the given cylinder, we can easily
determine the bounds of the integral that corresponds to the
volume of the examined solid:
V =
∫ 3π
2
π
2
∫ −2 cos φ
0
∫ 2r2
0
r dz dr dφ
=
∫ 3π
2
π
2
∫ −2 cos φ
0
2r3
dr dφ
=
∫ 3π
2
π
2
8 cos4
πφ = 3π,
where the last integral can be computed using the method of
recurrence from 6.2.6.
Now, let us ﬁnd the centroid. Since the solid is symmetric
with respect to the plane y = 0, the y-coordinate of the
centroid must be zero. Then, the remaining coordinates xT
and zT of the centroid can be computed by the following in-
tegrals:
xT =
1
V
∫ ∫ ∫
B
x dx dy dz
=
1
V
∫ 2r2
0
∫ 3π
2
π
2
∫ −2 cos φ
0
r2
cos φ dz dr dφ
=
1
V
∫ 3π
2
π
2
∫ −2 cos φ
0
2r4
cos φ dr dph
=
1
V
∫ 3π
2
π
2
−
64
5
cos6
φ dφ = −
4
3
,
where the last integral was computed by 6.2.6 again.
Analogously for the z-coordinate of the centroid:
zT =
1
V
∫ 2r2
0
∫ 3π
2
π
2
∫ −2 cos φ
0
zr cos φ dz dr dφ =
20
9
.
The coordinates of the centroid are thus [−4
3 , 0, 20
9 ]. □
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.2.8. Change of coordinates. When calculating integrals
of univariate functions, the “subtitution method”
is used as one of the powerful tools, cf. 6.2.5.
The method works similarly in the case of functions
of more variables, when understanding its
geometric meaning.
Recall and reinterpret the univariate case. There, the
integrated expression f(x) dx inﬁnitesimally describes the
two-dimensional area of the rectangle whose sides are the
(linearized) increment ∆x of the variable x, i.e. the onedimensional
rectangle, and the value f(x). If the variable
x is transformed by the relation x = u(t), then the linearized
increment can be expressed with the help of the diﬀerential
as
dx =
du
dt
dt,
and so the corresponding contribution for the integral is given
by
f(u(t))
du
dt
dt.
Here one either supposes that the sign of the derivative u′
(t)
is positive, or one interchanges the bounds of the integral, so
that the sign does not eﬀect the result.
Intuitively, the procedure for n variables should be similar.
It is only necessary to recall the formula for (change
of) the volume of parallelepipeds. The Riemann integrals
are approximated by Riemann sums, which are based on the
n–dimensional volume (area) of small multidimensional intervals
∆xi1...in in the variables, multiplied by the values of
the function at the representative points ξi1...in . If the coordinates
are transformed by means of a mapping x = G(y),
not only the function values f(G(˜ξi1...in )) are obtained at the
representative points ˜ξi1...un = G−1
(ξi1...in ) in the new coordinate
expression, but also the change of the volume of the
corresponding small multidimensional intervals needs care.
Once again, this is the case of a linear approximation of a
change, which is well known — the best linear approximation
of G(y) is its derivative D1
G(y), which is given by the Jacobi
matrix of G, see 8.1.19. The change of the volume is then
given (in absolute value) by the determinant of this matrix
(see a discussion of this topic in chapter 4 devoted to analytic
geometry and linear algebra, especially 4.1.19).
Summarizing, the formulation of the next theorem
should not be surprising and its proof consists in formalization
of the latter ideas. However, this needs some eﬀort and
so the proof is split into several steps.
755
8.I.20. Find the centroid of the homogeneous solid in R3
which lies between the planes z = 0 and z = 2, bounded by
the cones x2
+ y2
= z2
and x2
+ y2
= 2z2
.
Solution. The problem can be solved in the same way as the
previous ones. It would be advantageous to work in cylindric
coordinates.
However, we can notice that the solid in question is an
“annular cone”: it is formed by cutting out a cone K1 with
base radius 4 of a cone K2 with base radius 8, of common
height 2.
The centroid of the examined solid can be determined
by the “rule of lever”: the centroid of a system of two solids
is the weighted arithmetic mean of of the particular solids’
centroids, weighed by the masses of the solids. We found out
in exercise 8.I.18 that the centroid of a homogeneous cone
is situated at quarter its height. Therefore, the centroids of
both cones lie at the same point, and this points thus must
be the centroid of the examined solid as well. Hence, the
coordinates of the wanted centroid are [0, 0, 3
2 ]. □
8.I.21. Find the volume of the solid in R3
which is bounded
by the cone part x2
+ y2
= (z − 2)2
and the paraboloid
x2
+ y2
= 4 − z.
Solution. We build the corresponding integral in cylindric
coordinates, which evaluates as follows:
V =
∫ 2π
0
∫ 1
0
∫ 4−r2
r+2
r dz dr dφ =
5
6
π.
□
8.I.22. Find the volume of the solid in R3
which lies under
the cone x2
+ y2
= (z − 2)2
, z ≤ 2 and over the paraboloid
x2
+ y2
= z.
Solution.
V =
∫ 2π
0
∫ 1
0
∫ 2−r
r2
r dz dr dφ =
5
6
π.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Transformation of coordinates
Theorem. Let G(t) : Rn
→ Rn
be a continuously diﬀerentiable
and invertible mapping, and write
t = (t1, . . . , tn), x = (x1, . . . , xn) = G(t1, . . . , tn).
Further let M = G(N) be a Riemann-measurable sets,
and f : M → R a continuous function. Then, N is also
Riemann-measurable and
∫
M
f(x) dx1. . .dxn =
∫
N
f(G(t)) det(D1
G(t)) dt1. . .dtn.
8.2.9. The invariance of the integral. The ﬁrst thing to be
veriﬁed is the coincidence of two deﬁnitions of
volume of parallelepipeds (taken for granted in
the above intuitive explanation of the latter theorem).
Volumes and similar concepts were dealt with in chapter
4 and a crucial property was the invariance of the concepts
with respect to the choice of Euclidean frames of Rn
,
cf. 4.1.19 on page 334, which followed directly from the expression
of the volumes in terms of determinants. It is needed
to show that the same result holds in terms of the Riemann integration
as deﬁned above. It turns out that it is easier to deal
with invariance with respect to general invertible linear mappings
Ψ : Rn
→ Rn
.
Proposition. Let Ψ : Rn
→ Rn
be an invertible linear mapping
and I ⊂ Rn
a multidimensional interval. Consider a
function f, such that f ◦Ψ is integrable on I. Then M = Ψ(I)
is Riemann-measurable, f is Riemann integrable on M and
∫
M
f(x1, . . . , xn) dx1 . . . dxn =
= | det Ψ|
∫
I
(f ◦ Ψ)(y1, . . . , yn) dy1 . . . dyn.
Proof. Each linear mapping is a composition of the elementary
transformations of three types (see the discussion in
chapter 2, in particular paragraphs 2.1.7 and 2.1.9).
The ﬁrst one is a multiplication of one of the coordinates
with a constant: Ψ(y1, . . . , yn) = (y1, . . . , αyi, . . . , yn). In
this case | det Ψ| = |α|. The second one consists of an exchange
of two coordinates, i.e. for given 1 ≤ i < j ≤ n,
Ψ(y1, . . . , yn) = (y1, . . . , yj, . . . , yi, . . . , yn). The determinant
of Ψ is −1 in this case. The third type of transformations
is of the form Ψ(y1, . . . , yn) = (y1, . . . , yi +
yj, . . . , yj, . . . , yn), with determinant one. Without loss of
generality, i = 1 can be chosen in the ﬁrst case and i = 1,
j = 2, in the second case. Since the determinant of the composition
of the mappings (i.e. the determinant of the product
of the matrices) is the product of the individual determinants,
it is enough to prove the proposition for all three special types
of Ψ.
Express the right hand integrals for these three types of Ψ
by means of the multiple integrals and Fubini theorem. Write
756
Note that the considered solid is symmetric with the solid
from the previous exercise 8.I.21 (the center of the symmetry
is the point [0, 0, 2]). Therefore, it must have the same
volume. □
8.I.23. Find the centroid of the surface bounded by the
parabola y = 4 − x2
and the line y = 0. ⃝
8.I.24. Find the centroid of the circular sector corresponding
to the angle of 60◦
that was cut out of a disc with radius 1.
⃝
8.I.25. Find the centroid of the semidisc x2
+ y2
= 1, y ≥ 0.
⃝
8.I.26. Find the centroid of the circular sector corresponding
to the angle of 120◦
that was cut out of a disc with radius 1.
⃝
8.I.27. Find the volume of the solid in R3
which is given by
the inequalities z ≥ 0, z − x ≤ 0, and (x − 1)2
+ y2
≤ 1.
⃝
8.I.28. Find the volume of the solid in R3
which is given by
the inequalities z ≥ 0, z − y ≤ 0. ⃝
8.I.29. Find the volume of the solid bounded by the surface
3x2
+ 2y2
+ 3z2
+ 2xy − 2yz − 4xz = 1.
⃝
8.I.30. Find the volume of the part of R3
lying inside the
ellipsoid 2x2
+ y2
+ z2
= 6 and in the half-space x ≥ 1. ⃝
8.I.31. The area of the graph of a real-valued function
f(x, y) in variables x and y. The area of the graph of a function
of two variables over an area S in the plane xy is given
by the integral
P =
∫
S
√
1 + f2
x + f2
y dx dy.
Considering the cone x2
+ y2
= z2
. ﬁnd the area of the
part of its lateral surface which lies above the plane z = 0
and inside the cylinder x2
+ y2
= y.
Solution. The wanted area can be calculated as the area of
the graph of the function z =
√
x2 + y2 over the disc K:
x2
− (y − 1
2 )2
. We can easily see that
fx =
x
x2 + y2
, fy =
y
x2 + y2
,
so the area is expressed by the integral
∫∫
K
√
1 + f2
x + f2
y dx dy =
∫∫
K
√
2 dx dy =
=
√
2
∫ π
0
∫ sin π
0
r dr dφ =
√
2
2
∫ π
0
sin2
φ
=
√
2π
4
.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
I = [a1, b1] × . . . × [an, bn] and x = Ψ(y) for the transformation.
In the ﬁrst case (notice we can deal with the ﬁrst
variable and α > 0 without loss of generality),
| det Ψ|
∫
I
f(αy1, y2, . . . , yn) dy1 . . . dyn =
= α
∫ bn
an
. . .
(∫ b1
a1
f(αy1, y2, . . . , yn) dy1
)
. . . dyn
= α α−1
∫ bn
an
. . .
(∫ αb1
αa1
f(x1, x2, . . . , xn) dx1
)
. . . dxn
=
∫
Ψ(I)
f(x1, x2, . . . , xn) dx1 . . . dxn.
The second case is even easier, since the order of integration
does not matter due to the Fubini theorem. The third case is
similar to the ﬁrst one:
| det Ψ|
∫
I
f(y1 + y2, y2, . . . , yn) dy1 . . . dyn =
=
∫ bn
an
. . .
(∫ b1
a1
f(y1 + y2, y2, . . . , yn) dy1
)
. . . dyn
=
∫ bn
an
. . .
(∫ b1+x2
a1+x2
f(x1, x2, . . . , xn) dx1
)
. . . dxn
=
∫
Ψ(I)
f(x1, x2, . . . , xn) dx1 . . . dxn.
The reader should check the details that the last multiple integral
describes the image Ψ(I). □
As a direct corollary of the proposition, the Riemann integral
is invariant with respect to the Euclidean aﬃne mappings.
That is, the integral cannot depend on the choice of
the orthogonal frame in the Euclidean Rn
.
8.2.10. Riemann-measurable sets. It is necessary to understand
how to recognize Riemann-measurable domains M.
When deﬁning the Riemann integral, a strict analogy of
the lower and upper Riemann integrals for univariate functions
can be considered. This means taking inﬁma or suprema
of the integrated function over the corresponding multidimensional
intervals instead of the function values at the representatives
in the Riemann sums. For bounded functions, there
are well-deﬁned values of the upper and lower integrals found
in this way. If this is done for the indicator function χM of
a ﬁxed set M, the inner and outer Riemann measure of the
set M is obtained. Evidently, the inner measure is the supremum
of the areas given by the the (ﬁnite) sums of the volumes
of all multi-dimensional intervals from the partitions which
are inside M, and on the other hand, the outer measure is
the inﬁmum of the (ﬁnite) sums of the volumes of intervals
covering M. It follows directly from the deﬁnition that a set
M is Riemann-measurable if and only if its inner and outer
measures are equal.
The sets whose outer measure is zero are, of course,
Riemann-measurable. They are called measure zero sets or
null sets. The ﬁnite additivity of the Riemann integral makes
757
□
8.I.32. Find the area of the parabola z = x2
+ y2
over the
disc x2
+ y2
≤ 4. ⃝
8.I.33. Find the area of the part of the plane x + 2y + z = 10
that lies over the ﬁgure given by (x−1)2
+y2
≤ 1 and y ≥ x.
⃝
In the following exercise, we will also apply our knowledge
of the theory of Fourier transforms from the previous
chapter.
8.I.34. Fourier transform and diﬀraction. Light intensity
is a physical quantity which expresses the transmission of energy
by waves. The intensity of a general light wave is deﬁned
as the time-averaged magnitude of the Poynting vector,
which is the vector product of mutually orthogonal vectors of
electric and magnetic ﬁelds. A monochromatic plane wave
spreading in the direction of the y-axis satisﬁes
I = cε0
1
τ
∫ τ
0
E2
y dt,
where c is the speed of light and ε0 is the vacuum permittivity.
The monochromatic wave is described by the harmonic function
Ey = ψ(x, t) = A cos(ωt − kx). The number A is the
maximal amplitude of the wave, ω is the angular frequency,
and for any ﬁxed t, the so-called wave length λ is the prime
period. The number k then represents the speed k = 2π
λ at
which the wave propagates. We have
I = cε0
1
τ
∫ τ
0
E2
y dt = cε0
1
τ
∫ τ
0
A2
cos2
(ωt − k x) dt
= cε0A2 1
τ
∫ τ
0
1 + cos(2(ωt − k x))
2
dt
=
1
2
cε0A2 1
τ
[
t +
sin(2(ωt − k x))
2ω
]τ
0
=
1
2
cε0A2 1
τ
(
τ +
sin(2(ωτ − k x)) − sin(2(−k x))
2ω
)
=
1
2
cε0A2
(
1 +
sin(2(ωτ − k x)) − sin(2(−k x))
2ωτ
)
.
=
1
2
cε0A2
The second term in the parentheses can be neglected
since it is always less than 2
2ωτ = T
2πτ < 10−6
for real detectors
of light, so it is much inferior to 1. The light intensity is
directly proportional to the squared amplitude.
A diﬀraction is such a deviation from straight-line propagation
of light which cannot be explained as the result of
a refraction or reﬂection (or the change of the ray’s direction
in a medium with continuously varying refractive index).
The diﬀraction can be observed when a lightbeam propagates
through a bounded space. The diﬀraction phenomena are
strongest and easiest to see if the light goes through openings
or obstacles whose size is roughly the wavelength of the light.
In the case of the Fraunhofer diﬀraction, with which we will
deal in the following example, a monochromatic plane wave
goes through a very thin rectangular opening and projects on
CHAPTER 8. CALCULUS WITH MORE VARIABLES
the measure ﬁnitely additive. Hence, a disjoint union of
ﬁnitely many measurable sets is again a measurable set, and
its measure is given by the sum of the measures of the individual
sets in the union.
Consider the measurability of any given set M ⊂ I ⊂
Rn
inside a suﬃciently large multidimensional interval I.
Consider the boundary ∂M, i.e. the set of all boundary points
of M. For any partition Ξ of I from the deﬁnition of the
Riemann integral of χM , each of the intervals Ii1...in with
non-trivial intersection with ∂M contributes to the upper integral
but might not contribute to the lower integral. On the
contrary, for every point in the interior Mo ⊂ M its interval
Ii1...in
contributes to both the same way as soon as the norm
of the partition is small enough. This observation leads to the
ﬁrst part of the following claim:
Proposition. A bounded set M ⊂ Rn
is Riemannmeasurable
if and only if its boundary is of Riemann-measure
zero.
If M is a Riemann-measurable set and G : M ⊂ Rn
→
Rn
is a continuously diﬀerentiable and invertible mapping,
then G(M) is again Riemann-measurable.
Proof. The ﬁrst claim is already veriﬁed. Since both
G and G−1
are continuous, G maps internal
points of M to internal points of G(M). To
ﬁnish the proof, it must be veriﬁed that G maps
the boundary ∂M, which is a set of measure
zero, again to a set of measure zero.
Since every Riemann integrable set M is bounded, its
closure ¯M must be compact. It follows that G and all partial
derivatives of its components are uniformly continuous on ¯M,
and in particular on the boundary ∂M.
Next, consider a partition Ξ of an interval I containing
∂M and a ﬁxed tiny interval J in the partition including a
point t ∈ ∂M. Write R = G(t) + D1
G(t)(J − t), which
is to be understood as follows: J is ﬁrst shifted to the origin
by translation, then the derivative of G is applied obtaining
a parallelepiped. This is shifted back to be around G(t). By
the uniform continuity of G and D1
G, for each ε > 0 there
is a bound δ for the norm of a partition for which
G(J) ⊂ G(t) + (1 + ε)D1
G(t)(J − t)
can be guaranteed. The entire image of J lies inside a slightly
enlarged linear image of J by the derivative. Now, the outer
measure α of the image G(J) satisﬁes:
α ≤ (1 + ε)n
voln R = (1 + ε)n
| det G(t)| voln J.
If µ is the upper Riemann sum for the measure of ∂M
corresponding to the chosen partition, the outer measure of
G(∂M) must be bounded by (1+ε)n
maxt∈∂M | det G(t)|µ.
Finally, we exploit the assumption that ∂M has got measure
zero and thus, for the same ε we may further decrease the
bound on the norm of the partition so that µ ≤ ε, too. But
then the outer measure is bounded by a constant multiple of
(1 + ε)n
ε, with the universal constant maxt∈∂M | det G(t)|.
So the outer measure is zero, as required. □
758
a distant surface. For instance, we can highlight a spot on
the wall with a laser pointer. The image we get is the Fourier
transform of the function describing the permeability of the
shade - opening.
Let us choose the plane of the diﬀraction shade as the
coordinate plane z = 0. Let a plane wave A exp(ikz) (independent
of the point (x, y) of landing on the shade) hit this
plane perpendicularly. Let s(x, y) denote the function of the
permeability of the shade, then the resulting waves falling
onto the projection surface at a point (ξ, η) can be described
as the integral sum of the waves (Huygens-Fresnel principle)
which have gone through the shade and propagate through the
medium from all points (x, y, 0) (as a spherical wave) into the
point (ξ, η, z):
ψ(ξ, η) = A
∫∫
R2
s(x, y)e−ik(ξx+ηy)
dx dy
ψ(ξ, η) = A
∫ p/2
−p/2
∫ q/2
−q/2
e−ik(ξx+ηy)
dy dx
ψ(ξ, η) = A
∫ p/2
−p/2
e−ikξx
dx
∫ q/2
−q/2
e−ikηy
dy
= A
[
e−ikξx
−ikξ
]p/2
−p/2
[
e−ikηy
−ikη
]q/2
−q/2
= A
2 sin(k ξp/2)
kξ
2 sin(k ηq/2)
kη
= Apq
sin(k ξp/2)
kξp/2
sin(k ηq/2)
kηq/2
The graph of the function f(x) = sin x
x looks as follows:
The graph of the function ψ(ξ, η) = sin ξ
ξ
sin η
η then does:
CHAPTER 8. CALCULUS WITH MORE VARIABLES
A slightly extended argumentation as in the proof above
leads to understanding that the Riemann integrable functions
are exactly those bounded functions with compact support
whose set of discontinuity points has (Riemann) measure
zero.
8.2.11. Proof of Theorem 8.2.8. A continuous function f
and a diﬀerentiable change of coordinates is under consideration.
So the inverse G−1
is continuously diﬀerentiable, and
the image G−1
(M) = N is Riemann-measurable. Hence the
integrals on both sides of the equality exist and it remains to
prove that their values are equal.
Denote a composite continuous function by
g(t1, . . . , tn) = f(G(t1, . . . , tn)),
and choose a suﬃciently large n-dimensional interval I containing
N and its partition Ξ. The entire proof is nothing
more than a more exact writing of the discussion presented
before the formulation of the theorem.
Repeat the estimates on the volumes of images from the
previous paragraph on Riemann measurability. It is already
known that the images G(Ii1...in ) of the intervals from the
partition are again Riemann-measurable sets. For each small
part Ii1...in of the partition Ξ, the integral of f over Ji1...in =
G(Ii1...in ) certainly exists, too.
Further, if the center ti1...in of the interval Ii1...in is ﬁxed,
then the linear image of this interval
Ri1...in = G(ti1...in ) + D1
G(ti1...in )(Ii1...in − ti1...in ),
is obtained. This is an n-dimensional parallelepiped (note
that the interval is shifted to the origin, then mapped by the
linear mapping given by the Jacobi matrix, and the result is
then added to the image of the center).
If the partition is very ﬁne, this parallelepiped diﬀers only
a little from the image Ji1,...in . By the uniform continuity of
the mapping G, there is, for an arbitrarily small ε > 0, a norm
of the partition such that for all ﬁner partitions
G(ti1...in ) + (1 + ε)D1
G(t1, . . . tn)(Ii1...in ) ⊃ Ji1...ik
.
However, then the n-dimensional volumes also satisfy
voln(Ji1...in ) ≤ (1 + ε)n
voln(Ri1...in )
= (1 + ε)n
det G(ti1...ik
) voln(Ii1...in ).
Now, it is possible to estimate the entire integral:
∫
M
f(x1, . . . , xn) dx1 . . . dxn =
=
∑
i1...in
∫
Ji1...in
f(x1, . . . , xn) dx1 . . . dxn
≤
∑
i1...in
( sup
t∈Ii1...in
g(t)) voln(Ji1...in
)
≤ (1 + ε)n
∑
i1...in
( sup
t∈Ii1...in
g(t)) det G(ti1...ik
) voln(Ii1...in ).
If ε approaches zero, then the norms of the partitions approach
zero too, the left-hand value of the integral remains
the same, while on the right-hand side, the Riemann integral
759
And the diﬀraction we are describing:
Since limx→0
sin x
x = 1, the intensity at the middle of the
image is directly proportional to I0 = A2
p2
q2
. The Fourier
transform can be easily scrutinized if we aim a laser pointer
through a subtle opening between the thumb and the index
ﬁnger; it will be the image of the function of its permeability.
The image of the last picture can be seen if we create
a good rectangular opening by, for instance, gluing together
some stickers with sharp edges.
J. First-order diﬀerential equations
8.J.1. Find all solutions of the diﬀerential equation
y′
=
√
1−y2
cos2 x
(
1 + cos2
x
)
.
Solution. We are given an ordinary ﬁrst-order diﬀerential
equation in the form y′
= f(x, y), which is called an explicit
form of the equation. Moreover, we can write it as
y′
= f1(x)·f2(y) for continuous univariate functions f1 and
f2 (on certain open intervals), i. e., it is a diﬀerential equation
with separated variables.
First, we replace y′
with dy/dx and rewrite the diﬀerential
equation in the form
1√
1−y2
dy = 1+cos2
x
cos2 x dx.
Since
∫ 1+cos2
x
cos2 x dx =
∫ 1
cos2 x + 1 dx,
we can integrate using the basic formulae, thereby obtaining
(1) arcsin y = tg x + x + C, C ∈ R.
However, we must keep in mind that the division by the ex-
pression
√
1 − y2 is valid only if it is non-zero, i. e., only for
y ̸= ±1. Substituting the constant functions y ≡ 1, y ≡ −1
into the given diﬀerential equation, we can immediately see
that they satisfy it. We have thus obtained two more solutions,
which are called singular. We do not have to pay attention to
the case cos x = 0 since this only loses points of the domains
(but not any solutions).
Now, we will comment on several parts of the computation.
The expression y′
= dy/dx allows us to make many
symbolic manipulations. For instance, we have
CHAPTER 8. CALCULUS WITH MORE VARIABLES
of g(t)| det G(t)| is obtained. Instead of the desired equality,
the inequality:
∫
M
f(x) dx1. . .dxn ≤
∫
N
f(G(t)) det(D1
G(t)) dt1. . .dtn
is obtained.
The same reasoning can be repeated after interchanging
G and G−1
, the integration domains M and N, and the functions
f and g. The reverse inequality is immediately obtained:
∫
N
g(t) det(D1
G(t)) dt1 . . . dtn
≤
∫
M
f(x) det(D1
G(G−1
(x)))
det(D1
G−1
(x)) dx1 . . . dxn
=
∫
M
f(x) dx1 . . . dxn.
The proof is complete.
8.2.12. An example in two dimensions. The coordinate
transformations are quite transparent for the integral
of a continuous function f(x, y) of two
variables. Consider the diﬀerentiable transformation
G(s, t) = (x(s, t), y(s, t)). Denoting
g(s, t) = f(x(s, t), y(s, t)),
∫
G(N)
f(x, y) dxdy =
∫
N
g(s, t)
∂x
∂s
∂y
∂t
−
∂x
∂t
∂y
∂s
dsdt
is obtained.
As a truly simple example, calculate the integral of the
indicator function of a disc with radius R (i.e. its area) and
the integral of the function f(t, θ) = cos(t) deﬁned in polar
coordinates inside a circle with radius 1
2 π (i.e. the volume
hidden under such a “cap placed above the origin”, see the
illustration).
-1,5
-1
-0,5
0
0,2
0,4
0
0,6
0,8
y 1
-1,50,5
-1
-0,51
0
0,5
11,5
x1,5
First, determine the Jacobi matrix of the transformation
x = r cos θ, y = r sin θ
D1
G =
(
cos θ −r sin θ
sin θ r cos θ
)
.
Hence, the determinant of this matrix is equal to
det D1
G(r, θ) = r(sin2
θ + cos2
θ) = r.
760
dz
dy · dy
dx = dz
dx , 1
dy
dx
= dx
dy .
The validity of these two formulae is actually guaranteed by
the chain rule theorem and the theorem for diﬀerentiating an
inverse function, respectively. It was just the facility of the
manipulations that inspired G. W. Leibniz to introduce this
notation, which has been in use up to now. Further, we should
realize why we have not written the general solution (1) in the
suggesting form
(2) y = sin (tg x + x + C) , C ∈ R.
As we will not mention the domains of diﬀerential equations
(i. e., for which values of x the expressions are well-deﬁned),
we will not change them by “redundant” simpliﬁcations, either.
It is apparent that the function y from (2) is deﬁned
for all x ∈ (0, π) ∖ {π/2}. However, for the values of x
which are close to π/2 (having ﬁxed C), there is no y satisfying
(1). In general, the solutions of diﬀerential equations are
curves which may not be expressible as graphs of elementary
functions (on the whole intervals where we consider them).
Therefore, we will not even try to do that. □
8.J.2. Find the general solution of the equation
y′
= (2 − y) tg x.
Solution. Again, we are given a diﬀerential equation with
separated variables.
We have
dy
dx
= (2 − y) tg x,
−
dy
y − 2
=
sin x
cos x
dx,
− ln | y − 2 | = − ln | cos x | − ln | C |, C ̸= 0.
Here, the shift obtained from the integration has been
expressed by ln | C |, which is very advantageous (bearing in
mind what we want to do next) especially in those cases when
we obtain a logarithm on both sides of the equation. Further,
we have
ln | y − 2 | = ln | C cos x |, C ̸= 0,
| y − 2 | = | C cos x |, C ̸= 0,
y − 2 = C cos x, C ̸= 0,
where we should write ±C (after removing the absolute
value). However, since we consider all non-zero values of
C, it makes no diﬀerence whether we write +C or −C. We
should pay attention to the fact that we have made a division
by the expression y − 2. Therefore, we must examine the
case y ≡ 2 separately. The derivative of a constant function
is zero, so we have found another solution, y ≡ 2. However,
this solution is not singular since it is contained in the general
solution as the case C = 0. Thus, the correct result is
y = 2 + C cos x, C ∈ R. □
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Therefore, the calculation can be done directly for the disc S
which is the image of the rectangle (r, θ) ∈ [0, R]×[0, 2π] =
T. In this way the area of the disc is obtained:
∫
S
dxdy =
∫ 2π
0
∫ R
0
r drdθ =
∫ R
0
2πr dr = πR2
.
The integration of the function f is very similar, using multiple
integration and integration by parts:
∫
S
f(x, y) dxdy =
∫ 2π
0
∫ π/2
0
r cos r drdθ = π2
− 2π.
In many real life applications, a much more general approach
to integration is needed which allows for the dealing
with objects over curves, surfaces, and their higher dimensional
analogues. For many simple cases, such tools
can be built now with the help of parametrization of such
k-dimensional surfaces and employ the letter theorem to show
the independence of the result on such a parametrization.
These topics are postponed to the beginning of the next chapter
where a more general and geometric approach is discussed,
see e.g. 9.1.9 and 9.1.12.
3. Diﬀerential equations
In this section, we return to (vector) functions of one variable,
deﬁned and examined in terms of their instantaneous
changes.
8.3.1. Linear and non-linear diﬀerence models. The concept
of derivative was introduced in order to
work with instantaneous changes of the examined
quantities. In the introductory chapter, difference
equations based on similar concepts in relation to sequences
of scalars were discussed. As a motivating introduction
to equations containing derivatives of unknown functions,
recall ﬁrst the diﬀerence equations.
The simplest diﬀerence equations are formulated as
yn+1 = F(yn, n), with a function F of two variables. For
example, the model describing interests of deposits or loans
(this included the Malthusian model of populations) was
considered. The increment was proportional to the value,
yn+1 = a yn, see 1.2.2. Growths by 5% is represented
by a = 1.05. Considering continuous modeling, the same
request leads to an equation connecting the derivative y′
(t)
of a function with its value
(1) y′
(t) = r y(t)
with the proportionality constant r. Here, the instantaneous
growth by 5% corresponds to r = 0.05.
It is easy to guess the solution of the latter equation, i.e.
a function y(t) which satisﬁes the equality identically,
y(t) = C ert
with an arbitrary constant C. This constant can be determined
uniquely by choosing the initial value y0 = y(t0) at some
point t0. If a part of the increment in a model should be given
as a constant independent of the value y or t (like bank charges
761
8.J.3. Find the solution of the diﬀerential equation
(1 + ex
) yy′
= ex
which satisﬁes the initial condition y(0) = 1.
Solution. If the functions f : (a, b) → R and g : (c, d) → R
are continuous and g(y) ̸= 0, y ∈ (c, d), then the initial
problem
y′
= f(x)g(y), y(x0) = y0
has a unique solution for any x0 ∈ (a, b), y0 ∈ (c, d). This
solution is determined implicitly as
y(x)∫
y0
dt
g(t) =
x∫
x0
f(t) dt.
In practical problems, we ﬁrst ﬁnd all solutions of the equation
and then select the one which satisﬁes the initial condi-
tion.
Let us compute:
(1 + ex
) y dy/dx = ex
,
y dy =
ex
1 + ex
dx,
y2
2
= ln (1 + ex
) + ln | C |, C ̸= 0,
y2
2
= ln (C [1 + ex
]) , C > 0.
The substitution y = 1, x = 0 then gives
1
2 = ln (C · 2) , i. e. C =
√
e
2 .
We have thus found the solution
y2
2 = ln
(√
e
2 [1 + ex
]
)
,
i. e.,
y =
√
2 ln
(√
e
2 [1 + ex]
)
on a neighborhood of the point [0, 1] where y > 0. □
8.J.4. Find the solution of the diﬀerential equation
y′
= y2
+1
x+1
which satisﬁes y(0) = 1.
Solution. Similarly to the previous example, we get
dy
y2 + 1
=
dx
x + 1
,
arctan y = ln | x + 1 | + C, C ∈ R.
The initial condition (i. e., the substitution x = 0 and y = 1)
gives
arctan 1 = ln | 1 | + C, i. e., C = π
4 .
Therefore, the solution of the given initial problem is the func-
tion
y(x) = tg
(
ln | x + 1 | + π
4
)
on a neighborhood of the point [0, 1]. □
CHAPTER 8. CALCULUS WITH MORE VARIABLES
or the natural decrease of stock population as a result of sending
some part of it to slaughterhouses), an equation can be
used with a constant s on the right-hand side.
(2) y′
(t) = r · y(t) + s.
The solution of this equation is the function
y(t) = C ert
−
s
r
.
It is a straightforward matter to produce this solution when
it is realized that the set of all solutions of the equation (1)
is a one-dimensional vector space, while the solutions of the
equation (2) are obtained by adding any one of its solutions to
the solutions of the previous equation. The constant solution
y(t) = k for k = −s
r is easily found.
Similarly, in paragraph 1.4.1, the logistic model of population
growth was created. Based on the assumption that the
ratio of the change of the population size p(n + 1) − p(n)
and its size p(n) is aﬃne with respect to the population size
itself. The model behaves similar as the Malthusian one for
small values of the population size and to cease growing when
reaching a limit value K. Now, the same relation for the continuous
model can be formulated for a population p(t) dependent
on time t by the equality
(3) p′
(t) = p(t)
(
−
r
K
p(t) + r
)
.
At the value p(t) = K for a (large) constant K, the instantaneous
increment of the function p is zero, while for p(t) > 0
near zero, the ratio of the rate of increment of the population
and its size is close to r, which is the (small) number expressing
the rate of increment of the population in good conditions
(e.g. 0.05 would again mean immediate growth by 5%).
It is not easy to solve such an equation without knowing
any theory (although this type of equations will be dealt with
in a moment). However, as an exercise on diﬀerentiation, it
is easily veriﬁed that the following function is a solution for
every constant C:
p(t) =
K
1 + CK e−rt
.
For the continuous and discrete versions of the logistic
models, the values K = 100, r = 0.05, and C = 1 in the left
hand illustration are chosen. The same 1.4.1 result occurs in
the right hand illustration (i.e. with a = 1.05 and p1 = 1,
as expected). The choice C = 1 yields p(0) = K/(1 + K)
which is very close to 1 if K is large enough.
762
8.J.5. Solve
(1) y′
=
x + y + 1
2x + 2y − 1
.
Solution. Let a function f : (a, b) × (c, d) → R have continuous
second-order partial derivatives and f(x, y) ̸= 0,
x ∈ (a, b), y ∈ (c, d). Then, the diﬀerential equation
y′
= f(x, y) can be transformed to an equation with separated
variables if and only if
f(x, y) f′
y(x, y)
f′
x(x, y) f′′
xy(x, y)
= 0, x ∈ (a, b), y ∈ (c, d).
With a bit of eﬀort, it can be shown that a diﬀerential equation
of the form y′
= f(ax + by + c) can be transformed to an
equation with separated variables, and this can be done by
the substitution z = ax + by + c. Let us emphasize that the
variable z replaces y.
We thus set z = x + y, which gives z′
= 1 + y′
. Substitution
into (1) yields
z′
− 1 =
z + 1
2z − 1
,
dz
dx
=
z + 1
2z − 1
+ 1,
dz
dx
=
3z
2z − 1
,
(
2
3
−
1
3z
)
dz = 1 dx,
2
3
z −
1
3
ln | z | = x + C, C ∈ R,
or
2
3 z − 1
3 ln | Cz | = x, C ̸= 0.
Now, we must get back to the original variable y in one of
these forms. The general solution can be written as
2
3 x + 2
3 y − 1
3 ln | x + y | = x + C, C ∈ R,
i. e.,
x − 2y + ln | x + y | = C, C ∈ R.
At the same time, we have the singular solution y = −x,
which follows from the constraint z ̸= 0 of the operations
we have made (we have divided by the value 3z). □
8.J.6. Solve the diﬀerential equation
xy′
+ y ln x = y ln y.
Solution. Using the substitution u = y/x, every homogeneous
diﬀerential equation y′
= f (y/x) can be transformed
to an equation (with separated variables)
u′
= 1
x (f(u) − u) , i. e. u′
x + u = f(u).
The name of this diﬀerential equation is comes from the following
deﬁnition. A function f of two variables is called homogeneous
of degree k iﬀ f(tx, ty) = tk
f(x, y). Then, a
diﬀerential equation of the form
P(x, y) dx + Q(x, y) dy = 0
CHAPTER 8. CALCULUS WITH MORE VARIABLES
In particular, both versions of this logistic model yield
quite similar results. For example, the left hand illustration
also contains the dashed line of the graph of the solution of
the equation (1) with the same constant r and initial condition
(i.e. the Mathusian model of growth).
8.3.2. First-order diﬀerential equations. By an (ordinary)
ﬁrst-order diﬀerential equation, is usually meant
the relation between the derivative y′
(t) of a
function with respect to the variable t, its value
y(t), and the variable itself, which can be written
in terms of some real-valued function F : R3
→ R as the
equality
F(y′
(t), y(t), t) = 0.
This equation resembles the implicitly deﬁned functions y(t);
however, this time there is a dependency on the derivative of
the function y(t). We also often suppress the dependence of
y = y(t) on the other variable t and write F(y′
, y, t) = 0
instead.
If the implicit equation is solved at least explicitly with
regard to the derivative, i.e.
y′
= f(t, y)
for some function f : R2
→ R, it is clear graphically what
this equation deﬁnes. For every value (t, y) in the plane, the
arrow corresponding to the vector (1, f(t, y)), can be considered.
That is the velocity with which the point of the graph of
the solution moves through the plane, depending on the free
parameter t.
For instance, the equation (3) in the previous subsection
determines the following: (illustrating the solution for the initial
condition as above).
Such illustrations should invoke the idea that diﬀerential
equations deﬁne a “ﬂow” in the plane, and each choice of the
initial value (t0, y(t0)) should correspond to a unique ﬂowline
expressing the movement of the initial point in the time t.
It can be anticipated intuitively that for reasonably behaved
functions f(t, y) in the equations y′
= f(t, y), there is a
unique solution for all initial conditions.
763
is a homogeneous diﬀerential equation iﬀ the functions P
and Q are homogeneous of the same degree k.
For instance, we can discover that the given equation
x dy + (y ln x − y ln y) dx = 0
is homogeneous. Of course, it is not diﬃcult to write it explicitly
in the form
y′
= y
x ln y
x .
The substitution u = y/x then leads to
u′
x + u = u ln u,
du
dx
x = u (ln u − 1) ,
du
u (ln u − 1)
=
dx
x
,
where u (ln u − 1) ̸= 0. Using another substitution,
namely t = ln u − 1, we can integrate
∫
du
u (ln u − 1)
=
∫
dx
x
,
∫
dt
t
=
∫
dx
x
,
ln | t | = ln | x | + ln | C |, C ̸= 0,
ln | ln u − 1 | = ln | Cx |, C ̸= 0,
ln u − 1 = Cx, C ̸= 0,
ln
y
x
= Cx + 1, C ̸= 0,
y = xeCx+1
, C ̸= 0.
The excluded cases u = 0 and ln u = 1 do not lead to two
more solutions sinceu = 0 implies y = 0, which cannot be
put into the original equation. On the other hand, ln u = 1
gives y/x = e, and the function y = ex is clearly a solution.
Therefore, the general solution is
y = xeCx+1
, C ∈ R.
□
8.J.7. Compute
y′
= −4x+3y+1
3x+2y+1 .
Solution. In general, we are able to solve every equation of
the form
(1) y′
= f
(
ax + by + c
Ax + By + C
)
.
If the system of linear equations
(2) ax + by + c = 0, Ax + By + C = 0
has a unique solution x0, y0, then the substitution u = x−x0,
v = y − y0 transforms the equation (1) to a homogeneous
equation
dv
du = f
(
au+bv
Au+Bv
)
.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.3.3. Integration of diﬀerential equations. Before examining
the conditions for existence and uniqueness
of the solutions, we present a truly elementary
method for ﬁnding the solutions. The idea,
mentioned brieﬂy already in 6.2.21 on page 552,
is to transform the problem to ordinary integration, which usually
leads to an implicit description of the solution.
Equations with separated variables
Consider a diﬀerential equation in the form
(1) y′
= f(t) · g(y)
for two continuous functions of a real variable, f and g.
The solution of this equation can be obtained by integration,
ﬁnding the antiderivatives
G(y) =
∫
dy
g(y)
, F(t) =
∫
f(t)dt.
This procedure reliably ﬁnds solutions y(t) which satisfy
g(y(t)) ̸= 0, given implicitly by the formula
(2) F(t) + C = G(y)
with an arbitrary constant C.
Diﬀerentiating the latter equation (2) using the chain rule
for the composite function G(y(t)) leads to 1
g(y) y′
(t) = f(t),
as required.
As an example, ﬁnd the solution of the equation
y′
= t y.
Direct calculation gives ln |y(t)| = 1
2 t2
+ C with arbitrary
constant C. Hence it looks (at least for positive values of y)
as
y(t) = e
1
2 t2
+C
= D e
1
2 t2
,
where D is an arbitrary positive constant. It is helpful to examine
the resulting formula and signs thoroughly. The constant
solution y(t) = 0 also satisﬁes the equation. For negative
values of y, the same solution can be used with negative
constants D. In fact, the constant D can be arbitrary, and a
solution is found satisfying any initial value.
The illustration shows two solutions which demonstrate
the instability of the equation with regard to the initial values:
For every t0, if we change a small y0 from a negative value
to a positive one, then the behaviour of the resulting solution
changes dramatically. Notice the constant solution y(t) = 0,
which satisﬁes the initial condition y(t0) = 0.
764
If the system (2) has no solution or has inﬁnitely many solutions,
the substitution z = ax + by transforms the equation
(1) to an equation with separated variables (often, the original
equation is already such).
In this problem, the corresponding system of equations
4x + 3y + 1 = 0, 3x + 2y + 1 = 0
has a unique solution x0 = −1, y0 = 1. The substitution
u = x+1, v = y−1 then leads to the homogeneous equation
dv
du = −4u+3v
3u+2v ,
which can be solved by further substitution z = v/u. We thus
obtain
z′
u + z = −
4 + 3z
3 + 2z
,
dz
du
u = −
2z2
+ 6z + 4
3 + 2z
,
2z + 3
2z2 + 6z + 4
dz = −
du
u
provided z2
+ 3z + 2 ̸= 0. Integrating, we get
1
2
ln z2
+ 3z + 2 = − ln | u | + ln | C |, C ̸= 0,
1
2
ln
(
z2
+ 3z + 2
)
u2
= ln | C |, C ̸= 0,
ln
(
z2
+ 3z + 2
)
u2
= ln C2
, C ̸= 0,
(
z2
+ 3z + 2
)
u2
= ±C2
, C ̸= 0.
We thus have
(
z2
+ 3z + 2
)
u2
= D, D ̸= 0
and returning to the original variables,
(
v2
u2
+ 3
v
u
+ 2
)
u2
= D, D ̸= 0,
v2
+ 3vu + 2u2
= D, D ̸= 0,
(y − 1)2
+ 3(y − 1)(x + 1) + 2(x + 1)2
= D, D ̸= 0.
Making simple rearrangements, the general solution can
be expressed as
(x + y) (2x + y + 1) = D, D ̸= 0.
Now, let us return to the condition z2
+ 3z + 2 ̸= 0. It
follows from z2
+ 3z + 2 = 0 that z = −1 or z = −2, i. e.,
v = −u or v = −2u. For v = −u, we have x = u − 1 and
y = v + 1 = −u + 1, which means that y = −x. Similarly,
for v = −2u, we have y = −2u + 1, hence y = −2x − 1.
However, both functions y = −x, y = −2x − 1 satisfy the
original diﬀerential equations and are included in the general
solution for the choice D = 0. Therefore, every solution is
known from the implicit form
(x + y) (2x + y + 1) = D, D ∈ R.
□
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Using separation of variables, the non-linear equation is
easily solved from the previous paragraph which describes
the logistic population model. Try this as an exercise.
8.3.4. First order linear equations. In the ﬁrst chapter, we
paid much attention to linear diﬀerence equations.
Their general solution was determined in paragraph
1.2.2 on page 11. Although it is clear beforehand that
it is a one-dimensional aﬃne space of sequences, it
is a hardly transparent sum, because all the changing coeﬃcients
need to be taken into account.
Consequently this can be used as a source of inspiration
for the following construction of the solution of a general ﬁrstorder
linear equation
(1) y′
= a(t)y + b(t)
with continuous coeﬃcients a(t) and b(t).
First, ﬁnd the solution of the homogeneous equation y′
=
a(t)y. This can be computed easily by separation of variables,
obtaining the solution with y(s) = y0,
y(t) = y0F(t, t0), F(t, s) = e
∫ t
s
a(x) dx
.
In the case of diﬀerence equations, the solution of the general
non-homogeneous equation was “guessed”. Then it was
proved by induction that it was correct. It is even simpler now,
as it suﬃces to diﬀerentiate the correct solution to verify the
statement, once we are told what the right result is:
The solution of first-order linear equations
The solution of the equation (1) with initial values y(t0) =
y0 is (locally in a neighbourhood of t0) given by the formula
y(t) = y0F(t, t0) +
∫ t
t0
F(t, s)b(s) ds,
where F(t, s) = e
∫ t
s
a(x) dx
.
Verify the correctness of the solution by yourselves (pay
proper attention to the diﬀerentiation of the integral where t is
both in the upper bound and a free parameter in the integrand,
cf. 6.3.13).
In fact, there is the general method called variation
of constants which directly yields this solution, see e.g.
the problem 8.J.9. It consists in taking the solution for
the homogenous equation in the form y(t) = c F(t, t0)
and consider instead an ansatz for a solution to the nonhomogeneous
equation in the form y(t) = c(t)F(t, t0)
with an unknown function c(t). Diﬀerentiating yields the
equation c′
= e
−
∫ t
t0
a(x)dx
b(t) and integrating this leads to
c(t) =
∫ t
t0
e
∫ t0
s
a(x)dx
b(s)ds, i.e. y(t) = c(t) e
∫ t
t0
a(x)dx
as
in the above formula. Check the details!
Notice also the similarity to the solution for the equations
with constant coeﬃcients explicitly computed in the form of
convolution in 7.B.13 on the page 663, which could serve as
inspiration, too.
As an example, the equation
y′
= 1 − x y,
765
8.J.8. Find the general solution of the diﬀerential equation
(
x2
+ y2
)
dx − 2xy dy = 0.
Solution. For y ̸= 0, simple rearrangements lead to
y′
= x2
+y2
2xy =
1+
( y
x
)2
2 y
x
.
Using the substitution u = y/x, we get to the equation
u′
x + u = 1+u2
2u .
For u ̸= ±1 and D = −1/C, we have
du
dx
x =
1 + u2
− 2u2
2u
,
2u
1 − u2
du =
dx
x
,
− ln 1 − u2
= ln | x | + ln | C |, C ̸= 0,
ln
1
| 1 − u2 |
= ln | Cx |, C ̸= 0,
1
1 − u2
= Cx, C ̸= 0,
1 = Cx
(
1 −
y2
x2
)
, C ̸= 0,
−
D
x
= 1 −
y2
x2
, D ̸= 0,
−Dx = x2
− y2
, D ̸= 0.
The condition u = ±1 corresponds to y = ±x. While
y ≡ 0 is not a solution, both the functions y = x and y = −x
are solutions and can be obtained by the choice D = 0. The
general solution is thus
y2
= x2
+ Dx, D ∈ R. □
8.J.9. Solve
y′
= x − 2y
x2−1 .
Solution. The given equation is of the form y′
= a(x)y +
b(x), i. e., a non-homogeneous linear diﬀerential equation
(the function b is not identically equal to zero). The general
solution of such an equation can be obtained using the method
of integration factor (the non-homogeneous equation is multiplied
by the expression e−
∫
a(x) dx
) or the method of variable
separation (the integration constant that arises in the solution
of the corresponding homogeneous equations is considered
to be a function in the variable x). We will illustrate both of
these methods on this problem.
As for the former method, we multiply the original equation
by the expression
e
∫ 2
x2−1
dx
= e
ln x−1
x+1
= x−1
x+1 ,
where the corresponding integral is understood to stand for
any antiderivative and where any non-zero multiple of the obtained
function can be considered (that is why we could remove
the absolute value). Thus, consider the equation
y′ x−1
x+1 + 2y
(x+1)2 = x(x−1)
x+1 .
The core of the method of integration factor is that fact that
the expression on the left-hand side is the derivative of y x−1
x+1 .
Integrating this leads to
CHAPTER 8. CALCULUS WITH MORE VARIABLES
can be solved directly (although the so-called error function
appears when integrating the particular solution and this cannot
be expressed via elementary functions), this time encountering
stable behaviour, visible in the following illustration.
8.3.5. Transformation of coordinates. The illustrations
suggest that diﬀerential equations can be perceived
as geometric objects (the “directional
ﬁeld of the arrows”), so the solution can be
found by conveniently chosen coordinates. We return to
this point of view later. Here are three simple examples of
typical tricks as seen from the explicit form of the equations
in coordinates.
We begin with homogeneous equations of the form
y′
= f
(y
t
)
.
Considering the transformation z = z(t) = y
t and assuming
that t ̸= 0, then by the chain rule,
z′
=
1
t2
(
t y′
− y
)
=
1
t
(f(z) − z),
which is an equation with separated variables.
Other examples are the Bernoulli diﬀerential equations,
which are of the form
y′
= f(t)y + g(t)yn
,
where n ̸= 0, 1. The choice of the transformation z = y1−n
leads to the equation
z′
= (1 − n)y−n
(f(t)y + g(t)yn
)
= (1 − n)f(t)z + (1 − n)g(t),
which is a linear equation, easily integrated.
We conclude with the extraordinarily important Riccati
equation. It is a form of the Bernoulli equation with n = 2,
extended by an absolute term
y′
= f(t)y + g(t)y2
+ h(t).
This equation can also be transformed to a linear equation
provided that a particular solution x = x(t) can be guessed.
Then, use the transformation
z =
1
y − x
.
Verify by yourselves that this transformation leads to the equa-
tion
z′
= −
(
f(t) + 2x(t)g(t)
)
z − g(t).
766
y x−1
x+1 =
∫ x(x−1)
x+1 dx = x2
2 − 2x + 2 ln | x + 1 | + C, C ∈
R.
Therefore, the solutions are the functions
y = x+1
x−1
(
x2
2 − 2x + 2 ln | x + 1 | + C
)
, C ∈ R.
As for the latter method, we ﬁrst solve the corresponding
homogeneous equation
y′
= − 2y
x2−1 ,
which is an equation with separated variables. We have
dy
dx
= −
2y
x2 − 1
,
dy
y
= −
2
x2 − 1
dx,
ln | y | = − ln | x − 1 | + ln | x + 1 | + ln | C |, C ̸= 0,
ln | y | = ln C
x + 1
x − 1
, C ̸= 0,
y = C
x + 1
x − 1
, C ̸= 0,
where we had to exclude the case y = 0. However, the
function y ≡ 0 is always a solution of a homogeneous linear
diﬀerential equation, and it can be included in the general solution.
Therefore, the general solution of the corresponding
homogeneous equation is
y = C(x+1)
x−1 , C ∈ R.
Now, we will consider the constant C to be a function C(x).
Diﬀerentiating leads to
y′
= C′
(x) (x+1)(x−1)+C(x) (x−1)−C(x) (x+1)
(x−1)2 .
Substituting this into the original equation, we get
C′
(x) (x+1)(x−1)+C(x) (x−1)−C(x) (x+1)
(x−1)2 = x − 2 C(x) (x+1)
(x−1)(x2−1) .
It follows that
C′
(x) = x(x−1)
x+1 ,
i. e.,
C(x) =
∫
x(x − 1)
x + 1
dx,
C(x) =
x2
2
− 2x + 2 ln | x + 1 | + C, C ∈ R.
Now, it suﬃces to substitute:
y = C(x) x+1
x−1 =
x+1
x−1
(
x2
2 − 2x + 2 ln | x + 1 | + C
)
, C ∈ R.
We can see that the result we have obtained here is of the same
form as in the former case. This should not be surprising as
the diﬀerences between the two methods are insigniﬁcant and
the computed integrals are the same.
Finally, we can notice that the solution y of an equation
y′
= a(x)y can be found in the same way for any continuous
function a. We thus always have
y = Ce
∫
a(x) dx
, C ∈ R.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
As seen in the case of integration of functions (the simplest
type of equations with separated variables), the equations
usually do not have a solution expressible explicitly in
terms of elementary functions.
As with standard engineering tables of values of special
functions, books listing the solutions of basic equations are
compiled as well.7
Today, the wisdom concealed in them is essentially
transferred to software systems like Maple or Mathematica.
Here, any task about ordinary diﬀerential equations
can be assigned, with results obtained in surprisingly many
cases. Yet, explicit solutions are not possible for most prob-
lems.
8.3.6. Existence and uniqueness. The way out of this is numerical
methods, which try only to approximate
the solutions. However, to be able to use them,
good theoretical starting points are still needed
regarding existence, uniqueness, and stability of
the solutions.
We begin with the Picard–Lindelöf theorem:
Existence and uniqueness of the solutions of ODEs
Theorem. Consider a function f(t, y) : R2
→ R with continuous
partial derivatives on an open set U. Then for every
point (t0, y0) ∈ U ⊂ R2
, there exists the maximal interval
I = (t0 − a, t0 + b), with positive a, b ∈ R, and the unique
function y(t) : I → R which is the solution of the equation
y′
= f(t, y) on the interval I.
Proof. If a diﬀerentiable function y(t) is a solution of
an equation satisfying the initial condition y(t0) = y0, then
it also satisﬁes the equation
y(t) = y0 +
∫ t
t0
y′
(s) ds = y0 +
∫ t
t0
f(s, y(s)) ds,
where the Riemann integrals exist due to the continuity of
f and hence also y′
. However, the right-hand side of this
expression is the integral operator
L(y)(t) = y0 +
∫ t
t0
f(s, y(s)) ds
acting on functions y. Solving ﬁrst-order diﬀerential equations,
is equivalent to ﬁnding ﬁxed points for this operator L,
that is, to ﬁnd a function y = y(t) satisfying L(y) = y.
On the other hand, if a Riemann-integrable function y(t)
is a ﬁxed point of the operator L, then it immediately follows
from the fundamental theorem of calculus that y(t) satisﬁes
the given diﬀerential equation, including the initial con-
ditions.
7For example, the famous book Diﬀerentialgleichungen reeller Funktionen,
Akademische Verlagsgesellschaft, Leipzig 1930, by E. Kamke, a German
mathematician, contains many hundreds of solved equations. They appeared
in many editions in the last century.
767
Similarly, the solution of an equation y′
= a(x)y +b(x) with
an initial condition y(x0) = y0 can be determined explicitly
as (provided the coeﬃcients, i. e. the functions a and b, are
continuous)
y = e
∫ x
x0
a(t) dt
(
y0 +
∫ x
x0
b(t) e
−
∫ t
x0
a(s) ds
dt
)
.
Let us remark that the linear equation has no singular solution,
and the general solution contains a C ∈ R. □
8.J.10. Solve the linear equation
(y′
+ 2xy) ex2
= cos x.
Solution. If we used the method of integration factor, we
would only rewrite the equation trivially since it is already
of the desired form – the expression on the left-hand side is
the derivative of y ex2
. Thus, we can immediately calculate
(
y ex2
)′
= cos x,
y ex2
=
∫
cos x dx,
y ex2
= sin x + C, C ∈ R,
y = e−x2
(sin x + C) , C ∈ R.
□
8.J.11. Find all non-zero solutions of the Bernoulli equa-
tion
y′
− y
x = 3xy2
.
Solution. The Bernoulli equation
y′
= a(x)y + b(x)yr
, r ̸= 0, r ̸= 1, r ∈ R
can be solved by ﬁrst dividing by the term yr
and then using
the substitution u = y1−r
, which leads to the linear diﬀerential
equation
u′
= (1 − r) [a(x)u + b(x)] .
In this very problem, the substitution u = y1−2
= 1/y gives
u′
+ u
x = −3x.
Similarly to the previous exercise, we have
u = e− ln | x |
[∫
−3x eln | x |
dx
]
,
where ln | x | was obtained as an (arbitrary) antiderivative
to 1/x. Furhter,
u = eln 1
| x |
[∫
−3x eln | x |
dx
]
,
u =
1
| x |
[∫
−3x | x | dx
]
.
The absolute value can be replaced with a sign that can
be canceled, i. e., it suﬃces to consider
u = 1
x
[∫
−3x2
dx
]
= 1
x
[
−x3
+ C
]
, C ∈ R.
Returning to the original variable, we get
y = 1
u = x
C−x3 , C ∈ R.
The excluded case y ≡ 0 is a singular solution (which, of
course, is true for every Bernoulli equation with r positive).
□
CHAPTER 8. CALCULUS WITH MORE VARIABLES
It is easy to estimate how much the values L(y) and L(z)
diﬀer for various functions y(t) and z(t). Since both
partial derivatives of f are continuous, f is itself locally
Lipschitz. This means that restricting the values
(t, y) to a neighbourhood U of the point (t0, y0) with
compact closure, there is the estimate
|f(t, y) − f(t, z)| ≤ C|y − z|,
with some constant C depending only on U. This immediately
leads to the following bound (for the sake of simplicity,
t ≥ t0, but the ﬁnal conclusion works for t < t0 the same
way)
|
(
L(y) − L(z)
)
(t)| =
∫ t
t0
f(s, y(s)) − f(s, z(s)) ds
≤
∫ t
t0
|f(s, y(s)) − f(s, z(s))| ds
≤ C
∫ t
t0
|y(s) − z(s)| ds
≤ C
(
max
t0≤s≤t
|y(s) − z(s)|
)
|t − t0|
= D( max
t0≤s≤t
|y(s) − z(s)|
)
,
where the constant D comes from substituting the maximum
of |t − t0| on U.
If the operator L is viewed as an operator on a metric
space of continuous functions on a compact interval with the
max norm, this yields
∥L(y) − L(z)∥ ≤ D ∥y − z∥.
Some further restrictions on the choice of U and the considered
functions y and z are required, in order to make the constant
D smaller than one. Then the Banach ﬁxed point theorem,
based on the notion of a contraction, can be applied.
See 7.3.9 on the page 676. At the same time, the operator
must leave the chosen space of functions y invariant, i.e. the
images L(y) are also there.
To begin, choose ε > 0 and δ > 0, both small enough
so that [t0 − δ, t0 + δ] × [y0 − ε, y0 + ε] =
V ⊂ U, and consider only those functions y(t)
which satisfy for J = [t0 −δ, t0 +δ] the estimate
maxt∈J |y(t)−y0| < ε. The uniform continuity
of f(t, y) on V ensures that ﬁxing ε and further shrinking δ,
implies
max
t∈J
|L(y)(t) − y0| < ε.
Finally, the above estimate for ∥L(y) − L(z)∥ shows that if
δ is decreased suﬃciently further, then the latter constant D
becomes smaller than one, as required for a contraction. At
the same time, L maps the above space of functions into itself.
However, for the assumptions of the Banach contraction
theorem, which guarantees the uniquely determined ﬁxed
point, completeness of the space X of functions on which the
operator L works is needed.
Since the mapping f(t, y) is continuous, there follows a
uniform bound for all of the functions y(t) considered above
768
8.J.12. Interchanging the variables, solve the equation
y dx −
(
x + y2
sin y
)
dy = 0.
Solution. When the variable x occurs only in the ﬁrst power
in the diﬀerential equation and y occurs in the arguments of
elementary functions, we can apply the so-called method of
variable interchange, when we look for the solution as for a
function x of the independent variable y.
First, we write the equation explicitly:
y′
= y
x+y2 sin y .
This equation is not of any of the previous types, so we rewrite
it as follows:
dy
dx
=
y
x + y2 sin y
,
dx
dy
=
(
y
x + y2 sin y
)−1
=
x
y
+ y sin y,
x′
=
1
y
x + y sin y.
We have thus obtained a linear diﬀerential equation. Now, we
can easily compute its general solution
x = −y cos y + Cy, C ∈ R.
□
Further problems concerning ﬁrst-order diﬀerential equations
can be found on page 795.
K. Practical problems leading to diﬀerential equations
8.K.1. A water puriﬁcation plant with volume 2000 m3
was
contaminated with lead which is spread in the water with density
10 g/m3
. Water is ﬂowing in and out of the basin at 2 m3
/s.
In what time does the amount of lead in the basin decrease below
10 µg/m3
(which is the hygienic norm for the amount of
lead in drinkable water by a regulation of the European Community)
provided the water keeps being mixed uniformly?
Solution. Let us denote the water’s volume in the basin by
V (m3
), the speed of the water’s ﬂow by v (m3
/s). In an inﬁnitesimal
(inﬁnitely small) time unit dt, m
V · v dt grams of
lead runs out of the basin, so we can construct the diﬀerential
equation
dm = −
m
V
· v dt
for the change of the lead’s mass in the basin. Separating the
variables, we get the equation
dm
m
= −
v
V
dt.
Integration both sides of the equation and getting rid of the
logarithms, we get the solution in the form m(t) = m0e− v
V t
,
where m0 is the lead’s mass at time t = 0. Substituting the
concrete values, we ﬁnd out that t
.
= 6 h 35 min. □
CHAPTER 8. CALCULUS WITH MORE VARIABLES
and the values t > s in their domain:
|L(y)(t) − L(y)(s)| ≤
∫ t
s
|f(s, y(s)| ds ≤ A |t − s|
with a universal constant A > 0. Besides the conditions
mentioned above, there is a restriction to the subset of all
equicontinuous functions in the sense of the Deﬁnition 7.3.15.
According to the Arzelà-Ascoli Theorem proved in the same
paragraph at the page 683, this set of continuous functions
is already compact, hence it is a complete set of continuous
functions on the interval.
Therefore, there exists a unique ﬁxed point y(t) of this
contraction L by the Theorem 7.3.9. This is the solution of
the equation.
It remains to show the existence of a maximal interval
I = (t0 − a, t0 + b). Suppose that a solution y(t) is found on
an interval (t0, t1), and, at the same time, the one-sided limit
y1 = limt→t1−
y(t) exists and is ﬁnite.
It follows from the already proven result that there exists
a solution with this initial condition (t1, y1), in some neighbourhood
of the point t1. Clearly, it must coincide with the
discussed solution y(t) on the left-hand side of t1. Therefore,
the solution y(t) can be extended on the right-hand side of t1.
There are only two possibilities when the extension of
the solution behind t1 does not exist: either there is no ﬁnite
left limit y(t) at t1, or the limit y1 exists, yet the point (t1, y1)
is on the boundary of the domain of the function f. In both
cases, the maximal extension of the solution to the right of t0
is found.
The argumentation for the maximal solution left of t0 is
analogous. □
8.3.7. Iterative approximations of solutions. The proof of
the previous theorem can be reformulated as an
iterative procedure which provides approximate
solutions using step-by-step integration. Moreover,
an explicit estimate for the constant C from
the proof yields bounds for the errors.
Think this out as an exercise (see the proof of Banach
ﬁxed-point theorem in paragraph 7.3.9). It can then be shown
easily and directly that it is a uniformly convergent sequence
of continuous functions, so the limit is again a continuous
function (without invoking the complicated theorems from
the seventh chapter).
769
8.K.2. The speed of transmission of a message in a population
consisting of P people is directly proportional to the
number of people who have not heard the message yet. Determine
the function f which describes the dependency of the
number of people who have heard the message on time. Is
it appropriate to use this model of message transmission for
small or large values of P?
Solution. We construct a diﬀerential equation for f. The
speed of the transmission df
dt = f′
(t) should be directly proportional
to the number of people who have not heard of it, i.
e. the value P − f(t). Altogether,
df
dt
= k(P − f(t)).
Separating the variables and introducing a constant K (the
number of people who know the message at time t = 0 must
be P − K), we get the solution
f(t) = P − Ke−kt
,
where k is a positive real constant.
Apparently, this model makes sense for large values of P
only. □
8.K.3. The speed at which an epidemic spreads in a given
closed population consisting of P people is directly proportional
to the product of the number of people who have been
infected and the number of people who have not. Determine
the function f(t) describing the number of infected people in
time.
Solution. Just like in the previous problem, we construct a
diﬀerential equation:
df
dt
= k · f(t) (P − f(t)) .
Again, separating the variables and introducing suitable constants
K and L, we obtain
f(t) =
K
1 + Le−Kkt
.
□
8.K.4. The speed at which a given isotope of a given chemical
element decays is directly proportional to the amount of
the given isotope. The half-life of the isotope of plutonium
239
94 Pu is 24,100 years. In what time does a hundredth of a nuclear
bomb whose active component is the mentioned isotope
disappear?
Solution. Denoting the amount of plutonium by m, we can
build a diﬀerential equation for the rate of the decay:
dm
dt
= k · m,
where k is an unknown constant. The solution is thus the
function m(t) = m0e−kt
. Substituting into the equation for
half-life(e−kt
= 1
2 ), we get the constant k
.
= 2.88 · 105
. The
wanted time is then approximately 349 years. □
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Picard’s approximations
Theorem. The unique solution of the equation
y′
= f(t, y)
whose right-hand side f has continuous partial derivatives
can be expressed, on a suﬃciently small interval, as the limit
of step-by-step iterations beginning with the constant function
(Picard’s approximation):
y0(t) = y0, yn+1(t) = L(yn), n = 1, . . . .
It is a uniformly converging sequence of diﬀerentiable functions
with diﬀerentiable limit y(t).
Only the Lipschitz condition is needed for the function f,
so the latter two theorems are true with this weaker assumption
as well. It is seen in the next paragraph that continuity of
the function f guarantees the existence of the solution. Yet it
is insuﬃcient for the uniqueness.
8.3.8. Ambiguity of solutions. We begin with a simple example.
Consider the equation
y′
=
√
|y|.
Separating the variables, the solution is
y(t) =
1
4
(t + C)2
,
for positive values y, with an arbitrary constant C and t +
C > 0. For the initial values (t0, y0) with y0 ̸= 0, this is an
assignment matching the previous theorem, so there is locally
exactly one solution. The solution must apparently remain
non-decreasing, hence for negative values y0, the solution is
the same, only with the opposite sign and t + C < 0.
However, for the initial condition (t0, y0) = (t0, 0), there
is not only the already discussed solution continuing to the
left of t0 and to the right, but also the identically zero solution
y(t) = 0. Therefore, these two branches can be glued
arbitrarily (see the diagram, where the thick solution can be
continued along the t axis and branch along the parabola at
any value t.)
Nevertheless, the existence of a solution is guaranteed by
the following theorem, known as Peano existence theorem:
vylepsit obrazek, aby
ukazoval vic reseni ...
770
8.K.5. The acceleration of an object falling in a constant
gravitational ﬁeld with a certain resistance of the environment
is given by the formula
dv
dt
= g − kv,
where k is a constant which expresses the resistance of the
environment. An object was dropped in a gravitational ﬁeld
with g = 10 ms−2
at the initial speed of 5 ms−1
, the resistance
constant is k = 0.5 s−1
. What will the speed of the
object be in three seconds?
Solution.
v =
g
k
−
(g
k
− v0
)
e−kt
,
v(3) = 20 − 15e− 3
2 ms−1
after substitution. □
8.K.6. The rate of increase of a population of a certain type
of bug is indirectly proportional to its size. At time t = 0, the
population had 100 bugs. In a month, the population doubled.
What will the size of the population be in two months?
Solution. Let us consider a continuous approximation of the
number of bugs, and let their amount be denoted by P. Then,
we can build the following equation:
dP
dt
=
k
P
,
P =
√
Kt + c. Substituting the given values, we get P(2) =√
7 · 100, which is an estimate of the actual number of bugs.
□
8.K.7. Find the equation of the curve with the following properties:
It lies in the ﬁrst quadrant, goes through the point
[1, 3/4], and its tangent at any point marks on the positive
half-axis y a segment whose length is the same as the distance
of that point from the origin. ⃝
8.K.8. Consider a chemical compound C isolated in a container.
C is unstable, with half-time of a molecule equal to q
time units. If there were M moles of the compound C in the
container at the beginning (i. e., at time t = 0), how many
moles of it will be there at time t ≥ 0? ⃝
8.K.9. A 100-gram body lengthens a spring of 5 cm if hung
on it. Express the dependency of its position on time t provided
the speed of the body is 10 cm/s when going through
the equilibrium point. ⃝
Further practical problems that lead to diﬀerential equations
can be found on page 795.
L. Higher-order diﬀerential equations
8.L.1. Underdamped oscillation. Now, we will describe a
simple model for the movement of a solid object attached to
a point with a strong spring. If y(t) is the deviation of our object
from the point y0 = y(0) = 0, then we can assume that
the acceleration y′′
(t) in time t is proportional to the magnitude
of the deviation, yet with the other sign. The proportionality
constant k is called the spring constant. Considering the
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Theorem. Consider a function f(t, y) : R2
→ R which is
continuous on an open set U. Then for every point (t0, y0) ∈
U ⊃ R2
, there exists a solution of the equation
y′
= f(t, y)
locally in some neighbourhood of t0.
Proof. The proof is presented only roughly, with the details
left to the reader.
We construct a solution to the right of the
initial point t0. For this purpose, select a small
step h > 0 and label the points
tk = t0 + kh, k = 1, 2, . . . .
The value of the derivative f(t0, y0) of the corresponding
curve of the solution (t, y(t)) is deﬁned at the initial point
(t0, y0), so a parametrized line with the same derivative can
be substituted:
y(0)(t) = y0 + f(t0, y0)(t − t0).
Label y1 = y(0)(t1). Construct inductively the functions and
points
y(k)(t) = yk + f(xk, yk)(t − tk), yk+1 = y(k)(tk+1).
Now, deﬁne ˜yh(t) by gluing the particular linear parts, i.e., ad picture!!!
˜yh(t) = y(k)(t) if t ∈ [kh, (k + 1)h].
This is a continuous function, called the Euler’s approximation
of the solution.
It “only” remains to prove that the limit of the functions
˜yh for h approaching zero exists and is a solution.
For this, one must observe (as done already in
the proof of the theorem on uniqueness and existence
of the solution) that f(t, y) is uniformly
continuous on a suﬃciently small neighbourhood U where
the solution is sought. For any selected ε > 0, a suﬃciently
small δ such that |f(t, y) − f(s, z)| < ε, exists whenever
∥(t − s, y − z)∥ < δ.
Especially, all functions ˜yh are in the set of uniformly
continuous and equicontinous functions on a suﬃciently
small interval. By the Arzelà-Ascoli theorem (see paragraph
7.3.15 on page 683), the constructed continuous functions
˜yh are all in a compact set of functions. So there exists a
sequence of values hn → 0 such that the corresponding sequence
of functions ˜yhn converges uniformly to a continuous
function y(n). Write ˆyn(t) = ˜yhn
(t), i.e. ˆyn → y uniformly.
For each of the continuous functions ˜yh, there are only
ﬁnitely many points in the interval [t0, t] where it is not differentiable,
so
ˆyn(t) = y0 +
∫ t
t0
ˆy′
n(s) ds.
On the other hand, the derivatives on the particular intervals
are constant, so (here, k is the largest such that t0 + khn ≤
771
case k = 1, we get the so-called oscillation equation
y′′
(t) = −y(t).
This equation corresponds to the system of equations
x′
(t) = −y(t), y′
(t) = x(t)
from 1. The solution of this system is given by
x(t) = R cos(t − τ), y(t) = R sin(t − τ)
with an arbitrary non-negative constant R, which determines
the maximum amplitude, and a constant τ, which determines
the initial phase.
Therefore, in order to determine a unique solution, we
need to know not only the initial position y0, but also the
speed of the motion at that moment. These two pieces of
information uniquely determine both the amplitude and the
initial phase.
Moreover, let us imagine that as a result of the properties
of the spring material, there is another force which is directly
proportional to the instantaneous speed of our object, with the
other sign than the amplitude again. This is expressed by one
more term with the ﬁrst derivative, so our equation is now
y′′
(t) = −y(t) − αy′
(t),
where α is a constant which expresses the magnitude of the
damping. In the following picture, there are the so-called
phase diagrams for solutions with two distinct initial conditions,
namely with zero damping on the left, and for the value
of the coeﬃcient α = 0.3 on the right.
0
5
-3-3
10-2
-2
t-1
-1
0 15
0
1x(t)
y(t)
1
2
203
2
3
Tlumeneoscilace
0
5
-3-3
10-2
-2
t-1
-1
0 15
0
1x(t)
y(t)
1
2
203
2
3
Tlumeneoscilace
The oscillations are expressed by the y-axis values; the
x-axis values describe the speed of the motion.
8.L.2. Undamped oscillation. Find the function y(t) which
satisﬁes the following diﬀerential equation and initial condi-
tions:
y′′
(t) + 4y(t) = f(t), y(0) = 0, y′
(0) = −1,
where the function f(t) is piecewise continuous:
f(t) =
{
cos(2t) for 0 ≤ t < π,
0 for t ≥ π.
Solution. This problem is a model of undamped oscillation
of a spring (omitting friction, non-linearities in the toughness
of the spring, and other factors) which is initiated by an outer
force.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
t, while yj and tj are the points from the deﬁnition of the
function ˜yhn )
ˆyn(t) = y0 +
k−1∑
j=0
∫ tj+1
tj
f(tj, yj)ds +
∫ t
tk
f(tk, yk) ds.
Instead, the equation
ˆyn(t) = y0 +
∫ t
t0
f(s, ˆyn(s)) ds
is wanted, but the diﬀerence between this integral and the last
two terms in the previous expression is bounded by the possible
variation of the function values f(t, ˆy) and the lengths
of the intervals. By the universal bound for f(t, y) above, the
last integral can be used instead of the actual values in the
limit process limn→∞ yn(t), thereby obtaining
y(t) = lim
n→∞
(
y0 +
∫ t
t0
f(s, ˆyn(s)) ds
)
= y0 +
∫ t
t0
(
lim
n→∞
f(s, ˆyn(s))
)
ds
= y0 +
∫ t
t0
f(s, y(s)) ds,
where the uniform convergence ˆyn(t) → y(t) is employed.
This proves the theorem. □
8.3.9. Coupled ﬁrst-order equations. The problem of ﬁnding
the solution of the equation y′
= f(x, y) can
also be viewed as looking for a (parametrized)
curve (x(t), y(t)) in the plane where the
parametrization of the variable x(t) = t is ﬁxed beforehand.
If this point of view is accepted, then this ﬁxed choice for the
variable x can be forgotten, and the work can be carried out
with an arbitrary (ﬁnite) number of variables.
In the plane, for instance, such a system can be written
in the form
x′
= f(t, x, y), y′
= g(t, x, y)
with two functions f, g : R3
→ R.
A simple example in the plane might be the system of
equations
x′
= −y, y′
= x.
It is easily guessed (or at least veriﬁed) that there is a solution
of this system,
x(t) = R cos t, y(t) = R sin t,
with an arbitrary non-negative constant R, and the curves of
the solution are exactly the parametrized circles with radius
R.
In the general case, the vector notation of the system can
be used in the form
x′
= f(t, x)
for a vector function x : R → Rn
and a mapping f : Rn+1
→
Rn
. The validity of the theorem on uniqueness and existence
of the solution to such systems can be extended:
772
The function f(t) can be written as a linear combination
of Heaviside’s function u(t) and its shift, i. e.,
f(t) = cos(2t)(u(t) − uπ(t))
Since
L(y′′
)(s) = s2
L(y) − sy(0) − y′
(0) = s2
L(y) + 1,
we get, applying the results of the above exercises 7 and 8 to
the Laplace transform of the right-hand side
s2
L(y) + 1 + 4L(y) = L(cos(2t)(u(t) − uπ(t)))
= L(cos(2t) · u(t)) − L(cos(2t) · uπ(t))
= L(cos(2t)) − e−πs
L(cos(2(t + π))
= (1 − e−πs
)
s
s2 + 4
.
Hence,
L(y) = −
1
s2 + 4
+ (1 − e−πs
)
s
(s2 + 4)2
.
Performing the inverse transform, we obtain the solution in
the form
y(t) = −1
2 sin(2t) + 1
4 t sin(2t) + L−1
(
e−πs s
(s2 + 4)2
)
.
However, by formula (1), we have
L−1
(
e−πs s
(s2 + 4)2
)
= 1
4 L−1
(e−πs
L(t sin(2t)))
= (t − π) sin(2(t − π)) · Hπ(t).
Since Heaviside’s function is zero for t < π and equal to 1
for t > π, we get the solution in the form
y(t) =
{
−1
2 sin(2t) + 1
4 t sin(2t) for 0 ≤ t < π
π−2
4 sin(2t) for t ≥ π
□
8.L.3. Find the general solution of the equation
y′′′
− 5y′′
− 8y′
+ 48y = 0.
Solution. This is a third-order linear diﬀerential equation
with constant coeﬃcients since it is of the form
y(n)
+ a1y(n−1)
+ a2y(n−2)
+ · · · + an−1y′
+ any = f(x)
for certain constants a1, . . . , an ∈ R. Moreover, we have
f(x) ≡ 0, i. e., the equation is homogeneous.
First of all, we will ﬁnd the roots of the so-called characteristic
polynomial
λn
+ a1λn−1
+ a2λn−2
+ · · · + an−1λ + an.
Each real root λ with multiplicity k corresponds to the k so-
lutions
eλx
, x eλx
, . . . , xk−1
eλx
and every pair of complex roots λ = α ± iβ with multiplicity
k corresponds to the k pairs of solutions
eαx
cos (βx) , x eαx
cos (βx) , . . . , xk−1
eαx
cos (βx) ,
eαx
sin (βx) , x eαx
sin (βx) , . . . , xk−1
eαx
sin (βx) .
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Existence and uniqueness for systems of ODEs
Theorem. Consider functions fi(t, x1, . . . , xn) : Rn+1
→
R, i = 1, . . . , n, with continuous partial derivatives. Then,
for every point (t0, c) ∈ Rn+1
, c = (c1, . . . , cn), there exists
a maximal interval (t0−a, t0+b), with positive numbers
a, b ∈ R, and a unique (vector) function x(t) : R → Rn
which is the solution of the system of equations
x′
1 = f1(t, x1, . . . , xn)
...
x′
n = fn(t, x1, . . . , xn)
with the initial condition x(t0) = c, i.e.
x1(t0) = c1, . . . , xn(t0) = cn.
Proof. The proof is almost identical to the one of the
existence and uniqueness of the solution for a single
equation with a single unknown function as shown
in Theorem 8.3.6. The unknown function x(t) =
(x1(t), . . . , xn(t)) is a curve in Rn
satisfying the
given equation, so its components xi(t) are again expressed
in terms of integrals
xi(t) = xi(t0) +
∫ t
t0
x′
i(s) ds = ci +
∫ t
t0
fi(s, x(s)) ds.
We work with the integral operator y → L(y), this time mapping
curves in Rn
to curves in Rn
. It is desired to ﬁnd its
ﬁxed point. The proof proceeds in much the same way as in
the case 8.3.6. It is only necessary to observe that the size of
the vector
∥f(t, z1, . . . , zn) − f(t, y1, . . . , yn)∥
is bounded from above by the sum
∥f(t, z1, . . . , zn) − f(t, y1, z2 . . . , zn)∥ + . . .
+ ∥f(t, y1, . . . , yn−1, zn) − f(t, y1, . . . , yn)∥.
It is recommended to go through the proof of Theorem 8.3.6
from this point of view and to think out the details. □
8.3.10. Example. When dealing with models in practice, it
is of interest to consider the qualitative behaviour of the solution
in dependence on the initial conditions and free parameters
of the system
We consider a simple example of a system of ﬁrst-order
equations from this point of view. The standard population
model “predator – prey”, was introduced in the 1920s by
Lotka and Volterra.
Let x(t) denote the evolution of the number of individuals
in the prey population and y(t) for the predators. Assume
that the increment of the prey would correspond to the Malthusian
model (i.e. exponential growth with coeﬃcient α) if they
were not hunted. On the other hand, assume that the predator
would only naturally die out if there were no prey (i.e. exponential
decrease with coeﬃcient γ). Further, consider an
773
Then, the general solution corresponds to all linear combinations
of the above solutions.
Therefore, let us consider the polynomial
λ3
− 5λ2
− 8λ + 48
with roots λ1 = λ2 = 4, λ3 = −3. Since we know the roots,
we can deduce the general solution as well:
y = C1e4x
+ C2x e4x
+ C3e−3x
, C1, C2, C3 ∈ R. □
8.L.4. Compute
y′′′
+ y′′
+ 9y′
+ 9y = ex
+ 10 cos (3x) .
Solution. First, we will solve the corresponding homogeneous
equation. The characteristic polynomial is equal to
λ3
+ λ2
+ 9λ + 9,
with roots λ1 = −1, λ2 = 3i, λ3 = −3i. The general solution
of the corresponding homogeneous equation is thus
y = C1e−x
+C2 cos (3x)+C3 sin (3x) , C1, C2, C3 ∈ R.
The solution of the non-homogeneous equation is of the form
y =
C1e−x
+ C2 cos (3x) + C3 sin (3x) + yp, C1, C2, C3 ∈ R
for a particular solution yp of the non-homogeneous equation.
The right-hand side of the given equation is of a special
form. In general, if the non-homogeneous part is given by a
function
Pn(x) eαx
,
where Pn is a polynomial of degree n, then there is a particular
solution of the form
yp = xk
Rn(x) eαx
,
where k is the multiplicity of α as a root of the characteristic
polynomial and Rn is a polynomial of degree at most n. More
generally, if the non-homogeneous part is of the form
eαx
[Pm(x) cos (βx) + Sn(x) sin (βx)] ,
where Pm is a polynomial of degree m and Sn is a polynomial
of degree n, there exists a particular solution of the form
yp = xk
eαx
[Rl(x) cos (βx) + Tl(x) sin (βx)] ,
where k is the multiplicity of α + iβ as a root of the characteristic
polynomial and Rl, Tl are polynomials of degree at
most l = max {m, n}.
In our problem, the non-homogeneous part is a sum of
two functions in the special form (see above). Therefore, we
will look for (two) corresponding particular solutions using
the method of undetermined coeﬃcients, and then we will
add up these solutions. This will give us a particular solution
of the original equation (as well as the general solution, then).
Let us begin with the function y = ex
, which has particular
solution yp1 (x) = Aex
for some A ∈ R. Since
yp1 (x) = y′
p1
(x) = y′′
p1
(x) = y′′′
p1
(x) = Aex
,
substitution into the original equation, whose right-hand side
contains only the function y = ex
, leads to
20Aex
= ex
, i. e. A = 1
20 .
For the right-hand side with the function
y = 10 cos (3x), we are looking for a particular solution
in the form
CHAPTER 8. CALCULUS WITH MORE VARIABLES
interaction of the predator and the prey which is expected to
be proportional to the number of both with a certain coeﬃcient
β, which is in the case of the predator, supplemented by
a multiplicative coeﬃcient expressing the hunting eﬃciency.
Lotka-Volterra model
This is a system of two equations, x models the pray, y the
predator, with positive constants α, β, γ, δ
x′
= αx − βyx
y′
= −γy + δβxy.
The diagram illustrates one of typical behaviours of such
dynamical systems – the existence of closed orbits on which
the system moves in time. These are the thick black ovals,
while the “comets” indicate the ﬁeld at the individual points
(i.e. their expected movement). The left illustration corresponds
to α = 1, β = 1, γ = 0.3, δ = 0.3 and the initial
condition (x0, y0) = (1, 0.5) at t0 = 0 for the solution, while
the other illustration comes with α = 1, β = 2, γ = 2, δ = 1
and (x0, y0) = (1, 1.5)
In both cases, the system is quite stable in the vicinity of
the initial condition, and it would be very stable for (x0, y0) =
(1, 1) or (1, 0.5), respectively. But their developement diﬀers
in speed — the depicted solution cycles close in the times
about t = 12 in the ﬁrst case and t = 5 in the other one.
It is interesting that the same model captures quite well
the development of the unemployment rate in population, considering
the employees to be the predators, while the employers
play the role of the prey.
Much information about this and other models can be
found in the literature.
Next, we are approaching several qualitative results (i.e.,
featuring properties of the solutions, without
knowing them explicitely). They are all easily
understandable as statements, but the complexity
of their proofs perhaps would be too demanding,
at least in the ﬁrst reading. Thus, the readers are advised
to focus on the Theorems (do not ignore them, they are all
absolutely essential!) and their rough explanations, but skip
the proofs. The exposition will come back to easily proved
topics in 8.3.16 on page 785.
774
yp2 (x) = x [B cos (3x) + C sin (3x)] .
Recall that the number λ = 3i was obtained as a root of the
characteristic polynomial. We can easily compute the deriva-
tives
y′
p2
(x) = [B cos (3x) + C sin (3x)]
+x [−3B sin (3x) + 3C cos (3x)] ,
y′′
p2
(x) = 2 [−3B sin (3x) + 3C cos (3x)]
+x [−9B cos (3x) − 9C sin (3x)] ,
y′′′
p2
(x) = 3 [−9B cos (3x) − 9C sin (3x)]
+x [27B sin (3x) − 27C cos (3x)] .
Substituting them into the equation, whose right-hand side
contains the function y = 10 cos (3x), we get
−18B cos (3x) − 18C sin (3x) − 6B sin (3x) +
6C cos (3x) = 10 cos (3x) .
Confronting the coeﬃcients leads to the system of linear equa-
tions
−18B + 6C = 10, −18C − 6B = 0
with the only solution B = −1/2 and C = 1/6, i. e.,
yp2 (x) = x
[
−1
2 cos (3x) + 1
6 sin (3x)
]
.
Altogether, the general solution is
y = C1e−x
+ C2 cos (3x) + C3 sin (3x) +
1
20
ex
−
1
2
x cos (3x) +
1
6
x sin (3x) , C1, C2, C3 ∈ R.
□
8.L.5. Determine the general solution of the equation
y′′
+ 3y′
+ 2y = e−2x
.
Solution. The given equation is a second-order (the highest
derivative of the wanted function is of order two) linear (all
derivatives are in the ﬁrst power) diﬀerential equation with
constant coeﬃcients. First, we solve the homogenized equa-
tion
y′′
+ 3y′
+ 2y = 0.
Its characteristic polynomial is
x2
+ 3x + 2 = (x + 1)(x + 2),
with roots x1 = −1 and x2 = −2. Hence, the general solution
of the homogenized equation is
c1e−x
+ c2e−2x
,
where c1, c2 are arbitrary real constants.
Now, using the method of undetermined coeﬃcients,
we will ﬁnd a particular solution of the original nonhomogeneous
equation. According to the form of the
non-homogeneity and since −2 is a root of the characteristic
polynomial of the given equation, we are looking for the
solution in the form y0 = axe−2x
for a ∈ R.
Substituting into the original equation, we obtain
a[−4e−2x
+4xe−2x
+3(e−2x
−2xe−2x
)+2xe−2x
] = e−2x
,
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.3.11. Stability of systems of equations. In order to illustrate
the stability questions, we discuss just one basic theorem
only.
We are interested in the continuity with respect the to L∞
norm on the space of functions (i.e. the supremum norm, see
7.3.5). According to the theorem below, the assumption that
the partial derivatives of the functions deﬁning the system are
continuous (in fact, it suﬃces to have them Lipschitz), guarantees
the continuity of the solutions in dependence on the
initial conditions as well as the deﬁning equations themselves.
Note however, that as the distance of t from the initial value
t0 grows, then the error estimates grow exponentially!
Therefore, this result is of a strictly local character. It is
not in contradiction with the example of the unstably behaving
equation y′
= ty illustrated in paragraph 8.3.3.8
Consider two systems of equations written in the vector
form
(1) x′
= f(t, x), y′
= g(t, y)
and assume that the mappings f, g : U ⊂ Rn+1
→ Rn
have
continuous partial derivatives on an open set U with compact
closure. Such functions must be uniformly continuous and
uniformly Lipschitz on U, so there are the ﬁnite values
C = sup
x̸=y; (t,x), (t,y)∈U
∥f(t, x) − f(t, y)∥
∥x − y∥
B = sup
(t,x)∈U
∥f(t, x) − g(t, x)∥
With this notation, the fundamental theorem can be formu-
lated:
Theorem. Let x(t) and y(t) be two ﬁxed solutions
x′
= f(t, x(t)), y′
= g(t, y(t))
of the systems (1) considered above, given by initial conditions
x(t0) = x0 and y(t0) = y0. Then,
∥x(t) − y(t)∥ ≤ ∥x0 − y0∥ eC|t−t0|
+B
C
(
eC|t−t0|
−1
)
.
Proof. Without loss of generality, t0 = 0. From the expression
of the solutions x(t) and y(t) as ﬁxed
points of the corresponding integral operators
follows the estimate
∥x(t)−y(t)∥ ≤ ∥x0 −y0∥+
∫ t
0
∥f(s, x(s))−g(s, y(s))∥ ds.
The integrand can be further estimated as follows:
∥f(s, x(s)) − g(s, y(s))∥ ≤
≤ ∥f(s, x(s)) − f(s, y(s))∥ + ∥f(s, y(s)) − g(s, y(s))∥
≤ C ∥x(s) − y(s)∥ + B
If F(t) = ∥x(t) − y(t)∥, α = ∥x0 − y0∥, then
F(t) ≤ α +
∫ t
0
(C F(s) + B) ds.
8Much more information can be found for example in Gerald Teschl’s
book Ordinary Diﬀerential Equations and Dynamical Systems, Graduate
Studies in Mathematics, Volume 140, Amer. Math. Soc., Providence, 2012.
775
hence a = −1. We have thus found the function −xe−2x
as
a particular solution of the given equation. Hence, the general
solution is the function space c1e−x
+ c2e−2x
− xe−2x
,
c1, c2 ∈ R. □
8.L.6. Determine the general solution of the equation
y′′
+ y′
= 1.
Solution. The characteristic polynomial of the given equation
is x2
+ x, with roots 0 and −1. Therefore, the general
solution of the homogenized equation is c1 + c2e−x
, where
c1, c2 ∈ R.
We are looking for a particular solution in the form ax,
a ∈ R (since zero is a root of the characteristic polynomial).
Substituting into the original equation, we get a = 1. The
general solution of the given non-homogeneous equation is
c1 + c2e−x
+ x, c1, c2 ∈ R. □
8.L.7. Determine the general solution of the equation
y′′
+ 5y′
+ 6y = e−2x
.
Solution. The characteristic polynomial of the equation is
x2
+ 5x + 6 = (x + 2)(x + 3), its roots are −2 and −3.
The general solution of the homogenized equation is thus
c1e−2x
+ c2e−3x
, c1, c2 ∈ R. We are looking for a particular
solution in the form axe−2x
, (−2 is a root of the characteristic
polynomial), a ∈ R, using the method of undetermined coefﬁcients.
Substitution into the original equation yields a = 1.
Hence, the general solution of the given equation is
c1e−2x
+ c2e−3x
+ xe−2x
.
□
8.L.8. Determine the general solution of the equation
y′′
− y′
= 5.
Solution. The characteristic polynomial of the equation is
x2
− x, with roots 1, 0. Therefore, the general solution of the
homogenized equation is c1 + c2ex
, where c1, c2 ∈ R. We
are looking for a particular solution in the form ax, a ∈ R,
using the method of undetermined coeﬃcients. The result is
a = −5, and the general solution is of the form
c1 + c2ex
− 5x.
□
8.L.9. Solve the equation
y′′
− 2y′
+ y = ex
x2+1 .
Solution. We will solve this non-homogeneous equation using
the method of variation of constants. We will thus obtain
the solution in the form
y = C1(x) y1(x) + C2(x) y2(x) + · · · + Cn(x) yn(x),
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Such an estimate bound can be exploited further, by
the following general result, known as Gronwall’s inequality.
Note the similarity with the general solution of linear equa-
tions.
Lemma. Assume a real-valued function F(t) satisﬁes for all
t in the interval [0, tmax]
F(t) ≤ α(t) +
∫ t
0
β(s)F(s) ds
for some real-valued functions α(t), β(t), with β(t) ≥ 0.
Then
F(t) ≤ α(t) +
∫ t
0
α(s)β(s) e
∫ t
s
β(r) dr
ds
for all t ∈ [0, tmax]. Moreover, if additionally α(t) is nondecreasing,
then
F(t) ≤ α(t) e
∫ t
0
β(s) ds
.
Proof of the lemma. Write
G(t) = e−
∫ t
0
β(s) ds
.
By the ﬁrst assumption of the theorem,
d
dt
(
G(t)
∫ t
0
β(s)F(s) ds
)
=
= β(t)G(t)
(
F(t) −
∫ t
0
β(s)F(s) ds
)
≤ α(t)β(t)G(t).
Integrating with respect to t and dividing by the non-zero
function G(t) gives
∫ t
0
β(s)F(s) ds ≤
∫ t
0
α(s)β(s)
G(s)
G(t)
ds,
which, having added α(t) to both sides of the inequality, gives
the ﬁrst proposition of the lemma.
Assuming that α(t) is non-decreasing, there follows:
F(t) ≤ α(t)
(
1 +
∫ t
0
β(s) e
∫ t
s
β(r) dr
ds
)
.
The integrand is a derivative:
−β(s) e
∫ t
s
β(r) dr
=
d
ds
(
e
∫ t
s
β(r) dr
)
,
so
F(t) ≤ α(t)
(
1 −
∫ t
0
d
ds
e
∫ t
s
β(r) dr
ds
)
= α(t)
(
1 + e
∫ t
s
β(r) dr
−1
)
,
and the second proposition of the lemma is also proved. □
Now, the proof of the theorem about continuous dependency
on the parameters is easily ﬁnished. The bound F(t) ≤
α +
∫ t
0
(C F(s) + B) ds is already obtained, and using a
slightly modiﬁed function ˜F(t) = F(t) + B
C , this yields
˜F(t) ≤ B
C + α +
∫ t
0
C ˜F (s) ds.
776
where y1, . . . , yn give the general solution of the
corresponding homogeneous equation and the functions
C1(x), . . . , Cn(x) can be obtained from the system
C′
1(x) y1(x) + · · · + C′
n(x) yn(x) = 0,
C′
1(x) y′
1(x) + · · · + C′
n(x) y′
n(x) = 0,
...
C′
1(x) y
(n−2)
1 (x) + · · · + C′
n(x) y(n−2)
n (x) = 0,
C′
1(x) y
(n−1)
1 (x) + · · · + C′
n(x) y(n−1)
n (x) = f(x).
The roots of the characteristic polynomial λ2
− 2λ + 1
are λ1 = λ2 = 1. Therefore, we are looking for the solution
in the form
C1(x) ex
+ C2(x) x ex
,
considering the system
C′
1(x) ex
+ C′
2(x) x ex
= 0,
C′
1(x) ex
+ C′
2(x) [ex
+ x ex
] =
ex
x2 + 1
.
We can compute the unknowns C′
1(x) and C′
2(x) using
Cramer’s rule. It follows from
ex
x ex
ex
ex
+ x ex = e2x
,
0 x ex
ex
x2+1 ex
+ x ex = −x
e2x
x2 + 1
,
ex
0
ex ex
x2+1
=
e2x
x2 + 1
that
C1(x) = −
∫
x
x2 + 1
dx = −
1
2
ln
(
x2
+ 1
)
+ C1, C1 ∈ R,
C2(x) =
∫
dx
x2 + 1
= arctan x + C2, C2 ∈ R.
Hence, the general solution is
y = C1ex
+ C2x ex
− 1
2 ex
ln
(
x2
+ 1
)
+
x ex
arctan x, C1, C2 ∈ R.
□
8.L.10. Find the only function y which satisﬁes the linear
diﬀerential equation
y(3)
− 3y′
− 2y = 2ex
,
with initial conditions y(0) = 0, y′
(0) = 0, y′′
(0) = 0.
Solution. The characteristic polynomial is x3
− 3x − 2, with
roots 2 and −1 (double). We are looking for a particular solution
in the form aex
, a ∈ R, easily ﬁnding out that it is the
function −1
2 ex
. The general solution of the given equation is
thus
c1e2x
+ c2e−x
+ c3xe−x
−
1
2
ex
.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
This is the assumption of Gronwall’s inequality with even constant
parameters, so by the second claim of the lemma,
F(t) + B
C ≤ (α + B
C ) e
∫ t
0
C ds
,
which is the statement
F(t) ≤ α eCt
+B
C (eCt
−1)
as desired. □
The continuous dependency on both the initial conditions
and the potential further parameters in which the function
f would be Lipschitz-continuous follows immediately
from the statement of the theorem.
The extremely simple equations in one variable x′
= ax,
where a are small constants, with their exponential solution
x(t) = eat
show that better general results cannot be ex-
pected.
8.3.12. Diﬀerentiable dependance. In practical problems,
the diﬀerentiability of the obtained solutions is
often of interest, especially with regard to the initial
conditions or other parameters of the system.
In the general vector notation of the system
of ordinary equations
y′
= f(t, y),
it can always be supposed that the vector function does not
depend explicitly on t. If it does, then another variable
y0 can be added to the other variables y1, . . . , yn. Then
there is the same system of equations for the curve ˜y′
(t) =
(y0(t), y1(t), . . . , yn(t)) as
y′
0 = 1
y′
1 = f1(y0, y1, . . . , yn)
...
y′
n = fn(y0, y1, . . . , yn)
with the initial conditions
y0(t0) = t0, y1(t0) = x1, . . . , yn(t0) = xn.
Such systems, which do not explicitly depend on time, are
called autonomous systems of ordinary diﬀerential equations.
Without loss of generality, we deal with autonomous systems
in ﬁnite dimension n, dependent on parameters λ and
with initial conditions
(1) y′
= f(y, λ), y(t0) = x.
Without loss of generality, consider the initial value t0 = 0,
and write the solution with y(0) = x in the form y(t, x, λ) to
emphasize the dependency on the parameters.
For ﬁxed values of the initial conditions (and the potential
parameters λ), the solution is always once more diﬀerentiable
than the function f. This can be derived inductively
by applying the chain rule. If f is continuously diﬀerentiable
and y(t) is a solution, then (use the matrix notation where
777
Substituting into the original conditions, we get the only satisfactory
function,
2
9
e2x
+
5
18
e−x
+
1
3
xe−x
−
1
2
ex
.
□
Further problems concerning higher-order diﬀerential
equations can be found on page 798
M. Applications of the Laplace transform
Diﬀerential equations with constant coeﬃcients can also
be solved using the Laplace transform.
8.M.1. Let L(y)(s) denote the Laplace transform of a function
y(t). Integrating by parts, prove that
Solution.
(1) L(y′
)(s) = sL(y)(s) − y(0)
L(y′′
)(s) = s2
L(y) − sy(0) − y′
(0)
and, by induction:
L(y(n)
)(s) = sn
L(y)(s) −∑n
i=1 sn−i
y(i−1)
(0) . □
8.M.2. Find the function y(t) which satisﬁes the diﬀerential
equation
y′′
(t) + 4y(t) = sin 2t
as well as the initial conditions y(0) = 0, y′
(0) = 0.
Solution. It follows from the above exercise 7.D.18 that
s2
L(y)(s) + 4L(y)(s) = L(sin 2t)(s).
We also have
L(sin 2t)(s) =
2
s2 + 4
,
i. e.,
L(y)(s) =
2
(s2 + 4)2
.
The inverse transform leads to
y(t) = 1
8 sin 2t − 1
4 t cos 2t . □
8.M.3. Find the function y(t) which satisﬁes the diﬀerential
equation
y′′
(t) + 6y′
(t) + 9y(t) = 50 sin t
and the initial conditions y(0) = 1, y′
(0) = 4.
Solution. The Laplace transform yields
s2
L(y)(s)−s−4+6(sL(y)(s)−1)+9L(y)(s) = 50L(sin t)(s),
i. e.,
(s2
+ 6s + 9)L(y)(s) =
50
s2 + 1
+ s + 10,
L(y)(s) =
50
(s2 + 1)(s + 3)2
+
s + 10
(s + 3)2
.
Decomposing the ﬁrst term to partial fractions, we obtain
50
(s2 + 1)(s + 3)2
=
As + B
s2 + 1
+
C
s + 3
+
D
(s + 3)2
,
CHAPTER 8. CALCULUS WITH MORE VARIABLES
the Jacobi matrix D1
f(y) of the mapping f : Rn
→ Rn
is
multiplied with the column vector y′
)
y′′
= D1
f(y) · y′
= D1
f(y) · f(y)
exists and is continuous.
With all the derivatives up to order two continuous, there
is an expression for the third derivative:
y(3)
= D2
f(y)
(
f(y), f(y)
)
+
(
D1
f(y))2
· f(y).
Here, the chain rule is used again, starting with the diﬀerential
of the bilinear mapping of matrix multiplication and viewing
the second derivative as a bilinear object evaluated on y′
in both arguments. Think out the argumentation for this and
higher orders in detail.
Assume for a while that there is a solution y(t, x) of the
system (1) which is continuously diﬀerentiable in the parameters
x ∈ Rn
, i.e. the initial condition as well, and forget about
the further parameters λ for now. Write
Φ(t, x) = D1
x(y(t, x)),
for the Jacobi matrix of all partial derivatives with respect to
the coordinates xi, which depends on the time t as well as
the initial condition x. Its derivative Φ′
(t, x) with respect to
t can be computed using the symmetry of partial derivatives
and the chain rule:
Φ′
(t, x) =
d
dt
(
D1
xy(t, x)
)
= D1
x
(
y′
(t, x)
)
= D1
f(y(t, x)) · D1
xy(t, x)
= D1
f(y(t, x)) · Φ(t, x).
So the derivatives with respect to the initial conditions along
the solution y(t, x) of the system (1) are given as the solutions
of a system of n2
ﬁrst-order equations with initial condition
(2) Φ′
(t, x) = F(t, x) · Φ(t, x), Φ(0, x) = E,
where F(t, x) = D1
f(y(t, x)), and the initial condition
comes out from the identity y(0, x) = x. The unique existence
of the solution of this (matrix) system and its continuous
dependence on the parameters have already been proved.
The following theorem says that for systems (1) with continuously
diﬀerentiable right-hand sides f, the derivatives
with respect to the initial condition can be obtained in this
way.
Differentiability of the solutions
Theorem. Consider an open subset U ⊂ Rn+k
and a mapping
f : U → Rn
with continuous ﬁrst derivatives. Then,
a system of diﬀerential equations dependent on a parameter
λ ∈ Rk
with initial condition at a point x ∈ U
y′
= f(y, λ), y(0) = x
has a unique solution y(t, x, λ), which is a mapping with
continuous ﬁrst derivatives with respect to each variable.
778
so
50 = (As + B)(s + 3)2
+ C(s2
+ 1)(s + 3) + D(s2
+ 1).
Substituting s = −3, we get
50 = 10D hence D = 5
and confronting the coeﬃcients at s3
, we have
0 = A + C, hence A = −C.
Confronting the coeﬃcients at s, we obtain
0 = 9A + 6B + C = 8A + 6B, hence B =
4
3
C.
Finally, confronting the absolute term, we infer
50 = 9B + 3C + D = 12C + 3C + 5
hence C = 3, B = 4, A = −3.
Since
s + 10
(s + 3)2
=
s + 3 + 7
(s + 3)2
=
1
s + 3
+
7
(s + 3)2
,
we have
L(y)(s) =
−3s + 4
s2 + 1
+
3
s + 3
+
5
(s + 3)2
+
1
s + 3
+
7
(s + 3)2
=
−3s
s2 + 1
+
4
s2 + 1
+
4
s + 3
+
12
(s + 3)2
.
Now, the inverse Laplace transform yields the solution in the
form
y(t) = −3 cos t + 4 sin t + 4e−3t
+ 12te−3t
.
□
8.M.4. Find the function y(t) which satisﬁes the diﬀerential
equation
y′′
(t) = cos (πt) − y(t), t ∈ (0, +∞)
and the initial conditions y(0) = c1, y′
(0) = c2.
Solution. First, we should emphasize that it follows from the
theory of ordinary diﬀerential equations that this equation has
a unique solution. Further, we should recall that
L (f′′
) (s) = s2
L (f) (s) − s lim
t→0+
f(t) − lim
t→0+
f′
(t)
and
L (cos (bt)) (s) = s
s2+b2 , b ∈ R.
Applying the Laplace transform to the given diﬀerential equation
then gives
s2
L (y) (s) − sc1 − c2 = s
s2+π2 − L (y) (s),
i. e.,
(1) L (y) (s) =
s
(s2 + 1) (s2 + π2)
+
c1s
s2 + 1
+
c2
s2 + 1
.
Therefore, it suﬃces to ﬁnd a function y which satisﬁes (1).
Performing partial fraction decomposition, we obtain
s
(s2+1)(s2+π2) = 1
π2−1
(
s
s2+1 − s
s2+π2
)
.
The above expression of L (cos (bt)) (s) and the already
proved formula
L (sin t) (s) = 1
s2+1
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Proof. Consider a general system dependent on parameters,
but viewed as an ordinary autonomous system
with no parameters. More explicitely, consider the
parameters to be additional space variables and add
(vector) conditions λ′
(t) = 0 and λ(0) = λ. Therefore,
the theorem is proved for autonomous systems with no
further parameters. There is dependency on the initial condi-
tions.
Just as in the proof of the fundamental existence theorem
8.3.6, build on the expression of the solutions as ﬁxed
points of the integral operators and prove that the expected
derivative, as discussed above, enjoys the properties of the
diﬀerential. Fix a point x0 as the initial condition, together
with a small neighbourhood x0 ∈ V , which if necessary can
be further decreased during the following estimates, so that
∥f(y) − f(z)∥ ≤ C ∥y − z∥
on this neighbourhood by the Lipschitz property. It is already
deduced that if the derivative
Φ(t, x) = D1
xy(t, x)
of the solution y(t, x) exists, then it must be uniquely given by
the equation (2) wit the proper initial conditions. Therefore,
deﬁne Φ(t, x) by this equation and examine the expression
G(t, h) = ∥y(t, x0 + h) − y(t, x0) − Φ(t, x0)(h)∥
with small increments h ∈ Rn
. In order to prove that the
continuous derivative exists, it is necessary to show that
lim
h→0
1
∥h∥
G(t, h) = 0.
Several estimates are needed for this purpose.
First, from the latter theorem 8.3.11 about continuous dependence
on initial conditions, the estimate
∥y(t, x0 + h) − y(t, x0)∥ ≤ ∥h∥ eC|t|
follows immediately. In the next step, use Taylor’s expansion
of f with remainder:
f(y) − f(z) = D1
f(z) · (y − z) + R(y, z),
where R(y, z) satisﬁes
R(y, z)
∥y − z∥
→ 0 for ∥y − z∥ → 0.
This implies the crucial estimate. In the ﬁrst equality
substitute in the expression of solutions in terms of ﬁxed
points of the integral operators. Next, exploit the deﬁnition
of the mapping Φ(t, x0) in terms of its derivative (write
F(t, x) = D1
f(y(t, x)) again and notice that its initial
condition Φ(0, x)(h) = h implies the vanishing of the h
779
then yield the wanted solution
y(t) = 1
π2−1 (cos t − cos (πt)) + c1 cos t + c2 sin t .
□
8.M.5. Solve the system of diﬀerential equations
x′′
(t) + x′
(t) = y(t) − y′′
(t) + et
, x′
(t) + 2x(t) =
−y(t) + y′
(t) + e−t
with the initial conditions x(0) = 0, y(0) = 0, x′
(0) = 1,
y′
(0) = 0.
Solution. Again, we apply the Laplace transform. This, using
L (e±t
) (s) = 1
s∓1 ,
transforms the ﬁrst equation to
s2
L (x) (s) − s lim
t→0+
x(t) − lim
t→0+
x′
(t) + sL (x) (s) −
lim
t→0+
x(t) = =
L (y) (s)−
(
s2
L (y) (s) − s lim
t→0+
y(t) − lim
t→0+
y′
(t)
)
+ 1
s−1
and the second one to
sL (x) (s) − lim
t→0+
x(t) + 2L (x) (s) =
= −L (y) (s) + sL (y) (s) − lim
t→0+
y(t) + 1
s+1 .
Evaluating the limits (according to the initial conditions), we
obtain the linear equations
s2
L (x) (s)−1+sL (x) (s) = L (y) (s)−s2
L (y) (s)+ 1
s−1
and
sL (x) (s) + 2L (x) (s) = −L (y) (s) + sL (y) (s) + 1
s+1
with the only solution
L (x) (s) = 2s−1
2(s−1)(s+1)2 , L (y) (s) = 3s
2(s2−1)2 .
Once again, we perform partial fraction decomposition, get-
ting
L (x) (s) = 1
8
1
s−1 + 3
4
1
(s+1)2 − 1
8
1
s+1 = 3
4
1
(s+1)2 + 1
4
1
s2−1 .
Since we have already computed that
L (t e−t
) (s) = 1
(s+1)2 , L (sinh t) (s) = 1
s2−1 ,
L (t sinh t) (s) = 2s
(s2−1)2 ,
we get
x(t) = 3
4 t e−t
+ 1
4 sinh t, y(t) = 3
4 t sinh t.
We deﬁnitely advise the reader to verify that these functions
of x and y are indeed the wanted solution. The reason is that
the Laplace transforms of the functions y = et
, y = sinh t
and y = t sinh t were obtained only for s > 1). □
8.M.6. Find the solution of the following system of diﬀerential
equations:
x′
(t) = −2x(t) + 3y(t) + 3t2
,
y′
(t) = −4x(t) + 5y(t) + et
, x(0) = 1, y(0) = −1
Solution.
L(x′
)(s) = L(−2x + 3y + 3t2
)(s),
L(y′
)(s) = L(−4x + 5y + et
)(s).
CHAPTER 8. CALCULUS WITH MORE VARIABLES
summand).
G(t, h) = x0 + h +
∫ t
0
f(y(s, x0 + h))ds − x0
−
∫ t
0
f(y(s, x0))ds − Φ(t, x0)(h)
= h +
∫ t
0
(
f(y(s, x0 + h)) − f(y(s, x0))
− F(s, x0)Φ(s, x0)(h)
)
ds − h
≤
∫ t
0
f(y(s, x0 + h)) − f(y(s, x0))
− F(s, x0)Φ(s, x0)(h)∥ ds
≤
∫ t
0
∥F(s, x0)∥ ∥y(s, x0 +h) − y(s, x0) − Φ(s, x0)(h)∥ ds
+
∫ t
0
∥R(y(s, x0 + h), y(s, x0))∥ ds,
where the norm on the matrices is taken as the maximum of
the absolute values of their entries.
Since F(t, x) is continuous, there is a uniform bound of
its norm in the neighbourhood V given by
∥F(t, x0)∥ ≤ B,
for all |t| < T with a suﬃciently small T to ensure the solutions
remain in the neighbourhood V . At the same time, for
any ﬁxed constant ε > 0, there is a bound ∥h∥ < δ for which
the remainder R satisﬁes
∥R(y(t, x0 + h), y(t, x0))∥ ≤ ε∥y(t, x0 + h) − y(t, x0)∥
≤ ∥h∥ε eCT
.
Therefore, the estimate on G(t, h) can be improved as fol-
lows:
G(t, h) ≤ B
∫ t
0
G(s, h) ds + ε∥h∥ eCT
.
Gronwall’s lemma now gives
G(t, h) ≤ ε∥h∥ e(C+B)T
.
This implies that limh→0
1
∥h∥ G(t, h) = 0 as requested. □
In the same way, it can be proved that continuous diﬀerentiability
of the right-hand side up to order k
(inclusive) guarantees the same order of diﬀerentiability
of solutions in all input parameters.
8.3.13. The analytic case. Let us pay additional attention to
the case when the right hand side f of the system of equations
(1) y′
= f(y), y(t0) = y0
is analytic in all arguments (i.e., a convergent multidimensional
power series f(y) =
∑∞
|α|=0
1
α!
∂f|α|
∂yα yα
, see 8.1.15).
Exactly as in the previous discussion, we may hide the time
variable t as well as further parameters in the variables.
780
The left-hand sides can be written using (1), while the righthand
sides can be rewritten thanks to linearity of the L operator.
Since L(3t2
)(s) = 6
s3 and L(et
)(s) = 1
s−1 , we get the
system of linear equations
sL(x)(s) − 1 = −2L(x)(s) + 3L(y)(s) + 6
s3 ,
sL(y)(s) + 1 = −4L(x)(s) + 5L(y)(s) + 1
s−1 .
In matrices, this is A(s)ˆx(s) = b(s), where
A(s) =
(
s + 2 −3
4 s − 5
)
, ˆx(s) =
(
L(x)(s)
L(y)(s)
)
and b(s) =
(
1 + 6
s3
−1 + 1
s−1
)
.
Cramer’s rule says that
L(x)(s) = |A1|
|A| , L(y)(s) = |A2|
|A| , where
|A| =
s + 2 −3
4 s − 5
= s2
− 3s + 2,
|A1| =
1 + 6
s3 −3
−1 + 1
s−1 s − 5
= (s − 5)(1 + 6
s3 ) + 3(−1 + 1
s−1 )
|A2| =
s + 2 1 + 6
s3
4 −1 + 1
s−1
= (s + 2)(−1 + 1
s−1 ) − 4 − 24
s3 .
Hence,
L(x)(s) =
1
(s − 1)(s − 2)
(
(s − 5)(s3
+ 6)
s3
− 3
s − 2
s − 1
)
,
L(y)(s) =
1
(s − 1)(s − 2)
(
(s + 2)(2 − s)
s − 1
−
4s3
+ 24
s3
)
.
Decomposing to partial fractions, the Laplace images of the
solutions can be expressed as follows:
L(x)(s) = − 39
2s2 − 3
(s−1)2 + 28
s−1 − 21
4(s−2) − 15
s3 − 87
4s ,
L(x)(s) = −18
s2 − 3
(s−1)2 + 27
s−1 − 7
s−2 − 12
s3 − 21
s .
Now, the inverse transform yields the solution of this Cauchy
problem:
x(t) = −39
2 t − 3tet
+ 28et
− 21
4 e2t
− 15
2 t2
− 87
4 ,
y(t) = −18t − 3tet
+27et
− 7e2t
− 6t2
− 21 . □
N. Numerical solution of diﬀerential equations
Now, we present two simple exercises on applying the
Euler method for solving diﬀerential equations.
8.N.1. Use the Euler method to solve the equation y′
= −y2
with the initial condition y(1) = 1. Determine the approximate
solution on the interval [1, 3]. Try to estimate for which
value h of the step is the error less than one tenth.
Solution. The Euler method for the considered equation is
given by
yk+1 = yk − h · y2
k
for
x0 = 1, y0 = 1, xk = x0 + k · h, yk = y(xk).
We begin the procedure with step value h = 1 and halve it
in each iteration. The estimate for the “suﬃciency” of h will
be made somewhat imprecisely by comparing two adjacent
CHAPTER 8. CALCULUS WITH MORE VARIABLES
The famous theorem below says that the solution of the
most general system with analytic right-hand side is analytic
in all the parameters as well (including the initial conditions).
ODE version of Cauchy-Kovalevskaya Theorem
Theorem. Assume f(y) is a real analytic vector valued
function on a domain in Rn
and consider the diﬀerential
equation (1). Then the unique solution of this initial problem
is real analytic, including the dependancy on the initial
condition.
Proof. The idea of the proof is identical as in the simple
one-dimensional case in 6.2.22. As we saw
in the beginning of the previous paragraph, there
are universal (multidimensional) polynomial expressions
for all derivatives of the vector function
y(t) in terms of the partial derivatives of the vector function
f. If we expand them in terms of the individual partial
derivatives of the mapping f all of their coeﬃcients are obviously
non-negative. Let us write again
y(k)
(0) = Pk(f(y(0)), . . . , ∂βf(y(0)), . . . )
for these multivariate vector valued polynomials (the multiindices
β in the arguments are all of size up to k − 1).
Without loss of generality we may consider the initial
condition t0 = 0, y(0) = 0. Indeed, constant shifts of the
variables (say z = y − y0, x = t − t0) transform the general
case to this one. Once we know that the components of the
solution are power series, the transformed quantities will be
analytic too, including the dependancy on the values of the
incital conditions.
In order to prove that the solution to the problem y′
=
f(y), y(0) = 0 is analytic on a neighborhood of the origin,
we shall again look for a majorant g for the vector equation
y′
= f(y), i.e. we want an analytic function on a neighborhood
of the origin 0 ∈ Rn
with ∂αg(0) ≥ |∂αf(0)|, for all
multi-indices α. Then, by the universal computations of all
the coeﬃcients of the power series y(t) =
∑∞
k=0
1
k! y(k)
(0)tk
solving potentially our problem, and similarly for z′
= g(z),
the convergence of the series for z implies the same for y:
z(k)
(0) = Pk
(
g(0), . . . , ∂βg(0), . . .
)
≥ Pk
(
|f(0)|, . . . , |∂βf(0)|
)
≥ |y(k)
(0)|.
As usual, knowing already how to ﬁnd a majorant in a
simpler case, we try to apply a straightforward modiﬁcation.
By the analycity of f, for r > 0 small enough there
is a constant C such that | 1
α! ∂αfi(0)r|α|
| ≤ C, for all
i = 1, . . . , n and mutli-indices α. This means |∂αfi(0)| ≤
C α!
r|α| . In the 1-dimensional case, we considered the multiple
of a geometric series g(z) = C r
r−z with the right
derivatives g(n)
= C n!
rn . Now the most similar mapping is
g(z1, . . . , zn) = (g1(z1, . . . , zn), . . . , gn(z1, . . . , zn)) with
781
approximate values of the function y at common points, terminating
the procedure if the maximum of the absolute difference
of these values is not greater than the tolerated error
(0.1).
The results
h0 = 1
y(0)
= (1 0 0)
h1 = 0.5
y(1)
= (1 0.5 0.375 0.3047 0.2583)
Maximal diﬀerence: 0.375.
h2 = 0.25
y(2)
= (1.0000 0.7500 0.6094 0.5165 0.4498
0.3992 0.3594 0.3271 0.3004)
Maximal diﬀerence: 0.1094.
h3 = 0.125
y(3)
= (1.0000 0.8750 0.7793 0.7034 0.6415
0.5901 0.5466 0.5092 0.4768 0.4484
0.4233 0.4009 0.3808 0.3627
0.3462 0.3312 0.3175)
Maximal diﬀerence: 0.0322.
Using suitable software, the following graphical representation
of the results can be obtained, where the dashed curve corresponds
to the exact solution, which is the function y = 1/x.
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
2
□
8.N.2. Using the Euler method, solve the equation y′
=
−2y with the initial condition y(0) = 1 and step value h = 1.
Explain the phenomenon which occurs here and suggest another
procedure.
Solution. In this case, the Euler method is given by
yk+1 = yk − h · 2yk = −yk.
For the initial condition y0 = 1, we get the alternating values
±1 as the result. This is a typical manifestation of the instability
of this method for large step values h. If the step cannot be
reduced for some reasons (for instance, when processing digital
data, the step value is ﬁxed), better results can be achieved
by the so-called implicit Euler method. For a general equation
y′
= f(x, y), that is given by the formula
yk+1 = yk + h · f(xk+1, yk+1).
CHAPTER 8. CALCULUS WITH MORE VARIABLES
all the components gi equal to
h : Rn
→ R, h(z1, . . . , zn) = C
r
r − z1 − · · · − zn
.
Then the values of all the partial derivatives with |α| = k at
z = 0 are
∂αh(0) = Crk!(r − z1 − · · · − zn)−k−1
|z=0 = C
k!
rk
,
exactly as suitable. (Check the latter simple computation
yourself!)
So it remains to prove that the majorant system z′
= g(z)
has got the converging power series solution z. Obviously, by
the symmetry of g (all compoments equal to the same h and
h is symmetric in the variables zi), also the solution z with
z(0) = 0 must have all the components equal (the system
does not see any permutation of the variables zi at all). Let
us write zi(t) = u(t) for the common solution components.
With this ansatz,
u′
(t) = h(u(t), u(t), . . . , u(t)) = C
r
r − nu(t)
.
This is nearly exactly the same equation as the one in 6.2.22
and we can easily see its solution with u(0) = 0:
u =
r
n
(
1 −
√
1 −
2nCt
r
)
.
Clearly, this is an analytic solution and the proof is ﬁnished.
□
8.3.14. Vector ﬁelds and their ﬂows. Before going to
higher-order equations, pause to consider
systems of ﬁrst-order equations from the
geometrical point of view. When drawing
illustrations of solutions earlier, we already
viewed the right hand side of an autonomous system as a
“ﬁeld of vectors” f(x) ∈ Rn
. This shows how fast and in
which direction the solution should move in time.
This can be formalized. A tangent vector with a footpoint
x ∈ Rn
is a couple (x, v) ∈ Rn
× Rn
. The set of all vectors
with footpoints in an open set U ⊂ Rn
is called the tangent
bundle TU, with the footpoint projection p : (x, v) → x. A
vector ﬁeld X deﬁned on an open set U ⊂ Rn
is a mapping
X : U → TU which is a section of the projection p, i.e.,
p ◦ X = idU . The derivative in the direction of the vector
ﬁeld X is deﬁned for all diﬀerentiable functions g on U by
X(g) : U → R, X(g)(x) = dX(x) g = dg(x)(X(x)).
So the vector ﬁeld X is a ﬁrst order linear diﬀerential operator
mapping functions into functions. Apply pointwise the properties
of directional derivative to obtain the derivative rule
(also called the Leibniz rule) for products of functions:
(1) X(gh) = hX(g) + gX(h).
In ﬁxed coordinates, X(x) = (X1(x), . . . Xn(x)) and
X(g)(x) = X1(x)
∂g
∂x1
(x) + · · · + Xn(x)
∂g
∂xn
(x).
782
In general, we thus have to solve a non-linear equation in each
step. However, in our problem, we get
yk+1 = yk − 2h · yk+1,
so we have yk+1 = 1
3 yk for h = 1. Again, the obtained results
can be represented graphically, including the exact solution of
the equation.
0 1 2 3 4 5 6 7 8 9 10
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
□
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Clearly, there are the special vector ﬁelds with coordinate
functions equal to zero except for one function Xi which
is identically one. Such a ﬁeld then corresponds to the partial
derivatives with respect to the variable xi. This is also
matched by the common notation ∂
∂xi
for such vector ﬁelds
and in general,
X(x) = X1(x)
∂
∂x1
+ · · · + Xn(x)
∂
∂xn
.
Remark. Actually, each derivative on functions, i.e., a linear
operator D satisfying (1), is given by a unique vector ﬁeld.
This may be seen as follows. First D(1) = D(1 · 1) = 2D(1)
and, thus, D(c) = 0 for constant functions. Next, each
function f(x) can be written on a neighborhood of a point
q ∈ Rn
as f(x) = f(q) +
∫ 1
0
d
dt f(q + t(x − q)) dt =
f(q) +
∑n
i=1
∫ 1
0
∂f
∂xi
(q + t(x − q)) dt(xi − qi) = f(q) +
∑n
i=1 αi(x)(xi − qi). Thus, D(f) = 0 +
∑n
i=1 D(αi)(qi −
qi)+
∑n
i=1 αi(q)D(xi) =
∑n
i=1
∂f
∂xi
(q)D(xi). Deﬁning the
components Xi = D(xi) of the vector ﬁeld X, we have obtained
D as the derivative in the direction of X.
We shall write X(U) for the set of all smooth vector
ﬁelds on U, i.e. those with all components Xi smooth. Vector
ﬁelds ∂
∂xi
can be perceived as generators of X(U), admitting
smooth functions as the coeﬃcients in linear combinations.
We return to the problem of ﬁnding the solution of a system
of equations. Rephrase it equivalently as ﬁnding a curve
which satisﬁes
x′
(t) = X(x(t))
for each value x(t) in the domain of the vector ﬁeld X. In
words: the tangent vector of the curve is given, at each of
its points, by the vector ﬁeld X. Such a curve is called an
integral curve of the vector ﬁeld X, and the mapping
FlX
t : Rn
→ Rn
,
deﬁned at a point x0 as the value of the integral curve x(t),
satisfying x(0) = x0 is called the ﬂow of the vector ﬁeld X.
The theorem about existence and uniqueness of the solution
of the systems of equations says (cf. 8.3.6) that for every continuously
diﬀerentiable vector ﬁeld X, its ﬂow exists at every
point x0 of the domain for suﬃciently small values of t. The
uniqueness guarantees that
FlX
t+s(x) = FlX
t ◦ FlX
s (x),
whenever both sides exist. In particular, the mappings FlX
s
and FlX
t always commute.
Moreover, the mapping FlX
t (x) with a ﬁxed parameter t
is diﬀerentiable at all points x where it is deﬁned, cf. 8.3.12.
If a vector ﬁeld X is deﬁned on all of Rn
, and if its support
is compact (i.e., X(x) = 0 oﬀ a compact set K ⊂ Rn
),
then its ﬂow clearly exists at all points and for all values of t.
Vector ﬁelds with ﬂows existing for all t ∈ R are called complete.
The ﬂow of a complete vector ﬁeld consists of (mutually
commuting) diﬀeomorphisms FlX
t : Rn
→ Rn
with
inverse diﬀeomorphisms FlX
−t.
783
CHAPTER 8. CALCULUS WITH MORE VARIABLES
A simple example of a complete vector ﬁeld is the ﬁeld
X(x) = ∂
∂x1
. Its ﬂow is given by
FlX
t (x1, . . . , xn) = (x1 + t, x2, . . . , xn).
On the other hand, the vector ﬁeld X(t) = t2 d
dt on the onedimensional
space R is not complete as the solutions x(t)
of the corresponding equation dx = x2
dt are of the form
x(t) = 1
C−t , except for the initial condition x(0) = 0, so
they “run away” towards inﬁnite values in a ﬁnite time.
The points x0 in the domain of a vector ﬁeld X : U ⊂
Rn
→ Rn
where X(x0) = 0 are called singular points of the
vector ﬁeld X. Clearly FlX
t (x0) = x0 for all t at all singular
points.
8.3.15. Local qualitative description. The description of
vector ﬁelds as assigning the tangent vector in the modelling
space to each point of the Euclidean space is independent of
the coordinates. It follows that the ﬂows exhibit a geometric
concept which must be coordinate-free.
It is necessary to know what happens to the ﬁelds and
their ﬂows, when coordinates are transformed. Suppose y =
F(x) is such a transformation with F : Rn
→ Rn
(or on
some smaller domain there). Then the solutions x(t) to a system
x′
= X(x) satisfy x′
(t) = X(x(t)), and in the transformed
coordinates this reads
y′
(t) =
(
F(x(t))
)′
(t) = D1
F(x(t)) · x′
(t)
= D1
F(x(t)) · X(x(t)).
This means that the “transformed ﬁeld” Y in the new coordinates
is Y (F(x)) = D1
F(x) · X(x). At the same time, the
ﬂows of these vector ﬁelds are related as follows:
FlY
t ◦F(x) = F ◦ FlX
t (x).
By ﬁxing x = x0 and writing x(t) = FlX
t (x0), the curve
F(x(t)) is the unique solution for the system of equations
y′
= Y (y) with initial condition y0 = F(x0), which equals
the right-hand side.
The following theorem oﬀers a geometric local qualitative
description of all solutions of systems of ﬁrst order ordinary
diﬀerential equations in a neighbourhood of each point
x which is not singular.
The flowbox theorem
Theorem. If X is a diﬀerentiable vector ﬁeld deﬁned on a
neighbourhood of a point x0 ∈ Rn
and X(x0) ̸= 0, then
there exists a transformation of coordinates F such that in
the new coordinates y = F(x), the vector ﬁeld X is given
as the ﬁeld ∂
∂y1
.
Proof. Construct a diﬀeomorphism F with the required
properties, step by step. Geometrically, the essence of the
proof can be summarized as follows: ﬁrst select a hypersurface
which goes through the point x0 and is complementary
to the directions X(x) near to x0. Then ﬁx the coordinates
784
CHAPTER 8. CALCULUS WITH MORE VARIABLES
on it, and ﬁnally, extend them to some neighbourhood of the
point x0 using the ﬂow of the ﬁeld X.
Without loss of generality, move the point x0 to the origin
by a translation. Then by a suitable linear transformation
on Rn
, set X(0) = ∂
∂x1
(0). Let us write
ξ for the vector ﬁeld in question, independent of any
coordinates.
In the standard coordinates on Rn
, write the ﬂow Flξ
t
of the ﬁeld X going through the point (x1, . . . , xn) at time
t = 0 as xi(t) = φi(t, x1, . . . , xn), i = 1, . . . , n. Next,
deﬁne the new coordinates y = (y1, . . . , yn) by y =
Flξ
y1
(0, y2, . . . , yn), which corresponds to the inverse transformation
F−1
to the diﬀeomorphism F = (f1, . . . , fn) with
the components
fi(x1, . . . , xn) = φi(x1, 0, x2, . . . , xn).
This follows exactly the strategy using the hypersurface x1 =
0. Since ξ(0, . . . , 0) = ∂
∂x1
∂F
∂x1
(0, . . . , 0) =
d
dt|0
(φ1(t, 0, . . . , 0), . . . φn(t, 0, . . . , 0))
= (1, 0, . . . , 0),
while the ﬂow Flξ
0 at the time t = 0 yields
φi(0, 0, x2, . . . , xn) = (0, x2, . . . , xn),
and in particular
∂F
∂xi
(0, . . . , 0) = (0, . . . , 1, . . . , 0), i = 2, . . . , n.
Therefore, the Jacobi matrix of the mapping F at the origin is
the identity matrix E, so it is a transformation of coordinates
on some neighbourhood (see the inverse mapping theorem in
paragraph 8.1.23).
Directly from the deﬁnition of the mapping F−1
we can
compute
F−1
(Flξ
t (y)) = F−1
(Flξ
t+y1
(0, y2, . . . , yn))
= F−1
◦ F(t + y1, . . . , yn) = (t + y1, y2, . . . , y2),
and this is the desired coordinate description of the ﬂow in
the new coordinates. □
8.3.16. Higher-order equations. An ordinary diﬀerential
equation of order k (solved with respect to the
highest derivative) is an equation
(1) y(k)
= f(t, y, y′
, . . . , y(k−1)
),
where f is a known function of k + 1 variables, t is the independent
variable, and y(t) is an unknown function of one
variable. This type of equation is always equivalent to a system
of k ﬁrst-order equations.
Introduce new unknown functions in a variable t as fol-
lows:
y0(t) = y(t), y1(t) = y′
0(t), . . . , yk−1(t) = y′
k−2(t).
785
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Now, the function y(t) is a solution of the original equation
(1) if and only if it is the ﬁrst component of the solution of
the system of equations
y′
0 = y1
y′
1 = y2
...
y′
k−2 = yk−1
y′
k−1 = f(t, y0, y1, . . . , yk−1).
Hence the following direct corollary of the theorems
8.3.9–8.3.12:
Solutions of higher-order ODEs
Theorem. Consider a function f(t, y0, . . . , yk−1) : U ⊂
Rk+1
→ R with continuous partial derivatives on an open
set U. Then for every point (t0, z0, . . . , zk−1) ∈ U, there exists
a maximal interval Imax = (x0−a, x0+b), with positive
numbers a, b ∈ R, and a unique function y(t) : Imax → R
which is a solution of the k-th order equation
y(k)
= f(t, y, y′
, . . . , y(k−1)
)
with the initial condition
y(t0) = z0, y′
(t0) = z1, . . . , y(k−1)
(t0) = zk−1.
This solution depends diﬀerentiably on the initial conditions
and on potential further parameters diﬀerentiably entering
the function f. Moreover, the solution is analytic if the latter
dependence is analytic.
In particular, the theorem shows that in order to determine
unambiguously the solution of an ordinary k-th order
diﬀerential equation, the values of the solution and its ﬁrst
k − 1 derivatives must be determined at one point
With a system of ℓ equations of order k, the same procedure
transforms this system to a system of kℓ ﬁrst-order equations.
Therefore, an analogous statement about existence,
uniqueness, continuity, and diﬀerentiability is also true.
If the right-hand side f of the equation is diﬀerentiable
up to order r or analytic, including the parameters, than the
same property is enjoyed by the solutions as well.
8.3.17. Linear diﬀerential equations. The operation of differentiation
can be viewed as a linear mapping from (sufﬁciently)
smooth functions to functions. Multiplying the
derivatives ( d
dt )j
of the particular orders j by ﬁxed functions
aj(t) and adding these expressions, gives the linear diﬀerential
operators y(t) → D(y)(t):
D(y)(t) = ak(t)y(k)
(t) + · · · + a1(t)y′
(t) + a0(t)y(t).
To solve the corresponding homogeneous linear diﬀerential
equation of order k then means ﬁnding a function y satisfying
D(y) = 0.
786
CHAPTER 8. CALCULUS WITH MORE VARIABLES
The sum of two solutions is again a solution, since for
any functions y1 and y2,
D(y1 + y2)(t) = D(y1)(t) + D(y2)(t).
A constant multiple of a solution is again a solution. So the
set of all solutions of a k-th order linear diﬀerential equation
is a vector space. Apply the previous theorem about existence
and uniqueness, to obtain the following:
The space of solutions of linear equations
Theorem. The set of all solutions of a homogeneous linear
diﬀerential equation of order k with continuously diﬀerentiable
coeﬃcients is a vector space of dimension k. Therefore,
the solutions can be described as linear combinations
of any set of k linearly independent solutions. Such solutions
are determined uniquely by linearly independent initial conditions
on the value of the function y(t) and its ﬁrst k − 1
derivatives at a ﬁxed point t0.
Proof. Choose k linearly independent initial conditions
at a ﬁxed point. For each of them, there is a unique solution. A
linear combination of these initial condition then leads to the
same linear combination of the corresponding solutions. All
of the possible initial conditions are exhausted, so the entire
space of solutions of the equation is obtained in this way. □
The same arguments as with the ﬁrst order linear diﬀerential
equations in the paragraph 8.3.4 reveal that all solutions of
the non-homogeneous k-th order equation D(y) = b(t) with
a ﬁxed continuous function b(t) are the sums of one ﬁxed
solution y(t) of this problem and all solutions ˜y of the corresponding
homogeneous equation. Thus the entire space of
solutions is an aﬃne k-dimensional space of functions. The
method of variation of constants exploited in 8.3.4 is one of
the possible approaches to guess one non-homogeneous solution
if we know the complete solution to the homogeneous
problem.
We shall illustrate the latter results on the most simple
case:
8.3.18. Linear equations with constant coeﬃcients. The
previous discussion recalls the situation with homogeneous
linear diﬀerence equations dealt with in paragraph 3.2.1 of
the third chapter. The analogy goes further when all of the coeﬃcients
aj of the diﬀerential operator D are constant. Such
ﬁrst-order equations (1) have solutions as an exponential with
an appropriate constant at the argument. Just as in the case of
diﬀerence equations, it suggests trying whether such a form
of the solution y(t) = eλt
with an unknown parameter λ can
satisfy an equation of order k. Substitution yields
D(eλt
) =
(
akλk
+ ak−1λk−1
+ · · · + a1λ + a0
)
eλt
.
The parameter λ leads to a solution of a linear diﬀerential
equation with constant coeﬃcients if and only if λ is a root of
the characteristic polynomial akλk
+ · · · + a1λ + a0.
787
CHAPTER 8. CALCULUS WITH MORE VARIABLES
If the root λ = a + i b ∈ C is not real, then we may consider
the complex valued solution eλt
. Since the conjugate
root is also involved, we may consider the linear combinations
of the solutions eλt
and e
¯λt
, providing the real solutions
eat
sin(bt) and eat
cos(bt).
If the characteristic polynomial has k distinct roots, then
we have the basis of the whole vector space of solutions. Otherwise,
if λ is a multiple root, then direct calculation, making
use of the fact that λ is then a root of the derivative of
the characteristic polynomial as well, yields that the function
y(t) = t eλt
is also a solution. Similarly, for higher multiplicities
ℓ, There are ℓ distinct solutions eλt
, t eλt
, . . . , tℓ−1
eλt
.
In the case of a general linear diﬀerential equation, a nonzero
value of the diﬀerential operator D is wanted. Again, as
for systems of linear equations or linear diﬀerence equations,
the general solution of this type of (non-homogeneous) equa-
tions
D(y) = b(t),
for a ﬁxed function b(t), is the sum of an arbitrary solution of
this equation and the set of all solutions of the corresponding
homogeneous equation D(y)(t) = 0. The entire space of
solutions is a ﬁnite-dimensional aﬃne space, hidden in the
huge space of functions.
The methods for ﬁnding a particular solution are introduced
in concrete examples in the other column. In principle,
they are based on looking for the solution in a similar form
as the right-hand side is, or the method of variation of the
constants.
8.3.19. Matrix systems with constant coeﬃcients. Before
leaving the area of diﬀerential equations, consider
a very special case of ﬁrst-order systems,
whose right-hand side is given by multiplication
of a matrix A ∈ Matn(R) of constant coeﬃcients
and an n2
-dimensional unknown matrix function Y (t):
(1) Y ′
(t) = A · Y (t).
Clearly this is a strict analogy to the iterative models in chapter
3 and we also met such a system of n2
equations 8.3.12(2)
when preparing the proof of Theorem 8.3.12.
Combine knowledge from linear algebra and univariate
function analysis to guess the solution. Deﬁne the exponential
of a matrix by the formula
B(t) = etA
=
∞∑
k=0
tk
k!
Ak
.
The right-hand expression can be formally viewed as a matrix
whose entries bij are inﬁnite series created from the mentioned
products. If all entries of A are estimated by the maximum
of their absolute values ∥A∥ = C, then the absolute
value of the k-th summand in bij(t) is at most |t|k
k! nk
C2k
.
Hence, every series bij(t) is necessarily absolutely and uniformly
convergent, and it is bound above by the value e|t|nC2
.
Diﬀerentiate the terms of the series one by one, to get a uniformly
convergent series with limit A etA
. Therefore, by the
788
CHAPTER 8. CALCULUS WITH MORE VARIABLES
general properties of uniformly convergent series, the deriva-
tive
d
dt
(
etA
)
= A etA
also equals this expression. The general solution of the system
(1) is obtained in the form
Y (t) = etA
·Y0,
where Y0 ∈ Matn(R) is the arbitrary initial condition
Y (0) = Y0. The exponential etA
is a well deﬁned invertible
matrix for all t. So we have a vector space of the proper
dimension, and hence all solutions to the system (1). Notice
that in order to get a solution, it is necessary to multiply by
Y0 from the right.
It is remarkable that dealing with a vector equation with
a constant matrix A ∈ Matn(R),
(2) y′
= A · y,
for an unknown function y : R → Rn
, then the columns
of the matrix exponential etA
provide n linearly independent
solutions. The general solution is then given by linear combinations
of them.
The general solutions of the system (2) may be understood
much better by invoking some linear algebra –
the Jordan canonical form of linear mappings, see e.g.
3.4.10. In terms of vector ﬁelds X, the system has
the linear expression X(y) = Φ(y) where Φ is the
linear mapping with the matrix A in coordinates. Clearly linear
transformations of the system lead to another vector ﬁeld
with such linear description, since the diﬀerential of a linear
mapping is the mapping itself.
Any linear transformation of coordinates with the (constant)
matrix T transforms the system into
˜y′
= (Ty)′
= (TAT−1
) · (Ty) = ˜A · ˜y.
In particular, a suitable change of coordinates T provides the
matrix ˜A in the Jordan canonical form expressing Φ as a sum
of two commuting linear mappings Φ = Φd + Φn with Φd diagonalisable
and Φn nilpotent. Moreover, the decomposition
of the nilpotent part into the sum of cyclic nilpotent mappings
provides the Jordan blocks
Jλ =


λ 1 . . . 0
0 λ . . . 0
.
.
.
.
.
.
.
.
. 1
0 0 . . . λ

 = (Jλ)d + (Jλ)n
=


λ 0 . . . 0
0 λ . . . 0
.
.
.
.
.
.
.
.
. 0
0 0 . . . λ

 +


0 1 . . . 0
0 0 . . . 0
.
.
.
.
.
.
.
.
. 1
0 0 . . . 0

.
Splitting the system (2) into block-wise diagonal form splits
also the space of the solutions generated by the exponential
etA
into the corresponding blocks (all the powers Ak
enjoy
the same block structure). So we can work with the matrix A
already in the form on one such block Jλ = (Jλ)d + (Jλ)n.
But for any two commuting matrices C and D, the exponentials
etC
and etD
commute. So the exponential etD
of the
789
CHAPTER 8. CALCULUS WITH MORE VARIABLES
nilpotent D = (Jλ)n can be computed as the ﬁnite sum
E + t


0 1 . . . 0
0 0 . . . 0
.
.
.
.
.
.
.
.. 1
0 0 . . . 0

 + · · · + tk−1
(k−1)!


0 0 . . . 1
0 0 . . . 0
.
.
.
.
.
.
.
.. 0
0 0 . . . 0

 .
where k is the size of the block and E is the identity matrix.
The solution of the corresponding matrix system is
Y (t) = eλtE
· eDt
= eλt
eDt
=




eλt
t eλt
. . . tk−1
(k−1)!
eλt
0 eλt
. . . tk−2
(k−2)!
eλt
.
.
.
.
.
.
.
.
. t eλt
0 0 . . . eλt




Finally, k independent solutions can be written down by inspecting
the individual columns in Y (t). Notice that the
canonical basis (e1, . . . , ek) provides just the chain of vectors
with D(ek) = ek−1, k = 2, . . . , k, while D(e1) = 0.
Now the k independent solutions are:
y1(t) = eλt
e1
y2(t) = eλt
(
te1 + e2
)
...
yk(t) = eλt
( tk−1
(k−1)! e1 + tk−2
(k−2)! e2 + · · · + tek−1 + ek
)
.
The result can be easily transferred back to the original coordinates
in which the system (2) was given. Finding the decomposition
of the space into Jordan blocks and ﬁnding the
chains of basis vectors vi realizing the cyclic nilpotent components,
we arrive at the independent solutions by replacing
the ei by vi.
The ﬁndings are summarised in the following theorem,
one of many attributed to Euler.
Theorem. All solutions of the system (2) are linear combinations
of those in the form combining exponential and polynomial
expressions
y(t) = eλt
k∑
j=0
pjtj
where k is the order of nilpotency of the Jordan block corresponding
to the eigenvalue λ of the matrix A, pj are suitable
constant vectors. In particular, if the nilpotent part of A is
trivial, then k = 0.
This important result allows many generalizations.
For example, the Floquet-Lyapunov theory generalizes
this behaviour of solutions to systems with periodically
time-dependent matrices A(t).
8.3.20. Return to singular points. Finally, recall the ﬁrstorder
matrix system in paragraph 8.3.12 when the derivative
of the solutions of vector equations with respect to the initial
conditions was discussed. Consider a diﬀerentiable vector
ﬁeld X(x) deﬁned on a neighbourhood of its singular point
x0 ∈ Rn
, i.e. X(x0) = 0. Then, the point x0 is a ﬁxed point
of its ﬂow FlX
t (x).
790
CHAPTER 8. CALCULUS WITH MORE VARIABLES
The diﬀerential Φ(t) = Dx FlX
t (x0) satisﬁes the matrix
system with initial condition (see (2) on page 778)
Φ′
(t) = D1
X(x0) · Φ(t), Φ(0) = E.
The important point is that the diﬀerential D1
X is evaluated
along the constant ﬂow line x0, since this is a singular point
of the system.
The solution is known explicitly, and this describes the
evolution of the diﬀerential Φ(t) of the vector ﬁeld’s ﬂow at
the singular point x0,
Φ(t) = etA
, A = D1
X(x0).
This is a useful step in analysing the qualitative behaviour in
a neighbourhood of the stationary point x0.
Consider the Lotka-Volterra system from this point of
view. Use the coordinates (x, y) and parameters α, β, γ, δ
exactly as in 8.3.10. In particular all these quantities are assumed
to be positive.
The vector ﬁeld in question is
X(x, y) = (x(α − βy), y(−γ + βδx)).
So there is a single singular point (except (0, 0))
(x0, y0) =
( γ
δβ
,
α
β
)
and the diﬀerential of X at this point is
D1
X(x0, y0) =
(
α − βy0 −βx0
δβy0 δβx0 − γ
)
=
(
0 −γ
δ
αδ 0
)
The characteristic polynomial of the latter matrix is λ2
+ γα
and so there are two complex conjugated roots λ = ±i
√
αγ.
As is known from linear algebra, such a matrix describes a
rotation in suitable coordinates. Compute the real and imaginary
components of the eigenvectors corresponding to λ (as
developed in linear algebra), to obtain the matrix solution in
the form
Y (t) =
(
cos
√
αγ t − sin
√
αγ t
δ
√
α
γ sin
√
αγ t δ
√
α
γ cos
√
αγ t
)
.
The columns are the two independent solutions y(t) (they differ
just by the phase of the linearly distorted rotation).
This might be useful information for further analysis of
the model around its singular point. For example, the parameter
β does not appear explicitly here, while δ inﬂuences the
distortion of the ﬂow lines from being circles. Compare this
with the illustrations on page 774. For a ﬁrst approximation,
guess that the sizes of the populations of both the prey and the
predator oscillate regularly around the values of the singular
point if the initial conditions are near this point.
8.3.21. A note about Markov chains. In the third chapter,
we dealt with iterative processes, where the stochastic
matrices and Markov processes determined
by them played an interesting role. Recall
that a matrix A is stochastic if the sum of
each of its columns is one. In other words,
(1 . . . 1) · A = (1 . . . 1).
791
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Take the exponential etA
to obtain
(1 . . . 1) · etA
=
∞∑
k=0
tk
k!
(1 . . . 1) · Ak
= et
(1 . . . 1).
Therefore, for every t, the invertible matrix
B(t) = e−t
etA
is stochastic, if A is stochastic. Add stochastic initial conditions
B0, to get the ﬂow B(t) = e−t
etA
·B0, which is a continuous
version of the Markov process (inﬁnitesimally) generated
by the stochastic matrix A.
Diﬀerentiating with respect to t, yields
d
dt
B(t) = − e−t
etA
·B0 +e−t
A etA
·B0 = (−E +A)B(t),
so the matrix B(t) is the solution of the matrix system of
equations with constant coeﬃcients
Y ′
(t) = (A − E) · Y (t)
with the stochastic matrix A.
This can be explained intuitively. If the matrix A is stochastic,
then the instantaneous increment of the stochastic
vector y(t) in the vector system with the matrix A, y′
(t) =
A · y(t), is again a stochastic vector. However, it is desired
that the Markov process keeps the vector y(t) stochastic for
all t. Hence, the sum of increments of the particular components
of the vector y(t) must be zero, which is guaranteed by
subtracting the identity matrix.
As seen above, the columns of the matrix solution Y ′
(t)
create a basis of all solutions y′
(t) of the vector system.
Much information can be obtained about the solutions by
using some linear algebra. For example, suppose that the matrix
A is primitive, that is, suppose one of its powers has only
positive entries, see 3.3.3 on page 214. Then its powers converge
to a matrix A∞, all of whose columns are eigenvectors
corresponding to the eigenvalue 1.
Next, estimate the diﬀerence between the solution Y ′
(t)
for large t and the constant matrix A∞. There are two consequences
from the latter convergence. First, there exists a
universal constant bound for all powers ∥Ak
− A∞∥ ≤ C.
Second, for every small positive ε, there is an N ∈ N such
that for all k ≥ N, ∥Ak
− A∞∥ ≤ ε. Hence,
e−t
∞∑
k=0
tk
k!
Ak
− e−t
∞∑
k=0
tk
k!
A∞
≤ e−t
∑
k<N
tk
k!
C∥A∞∥ + ε∥A∞∥.
Let t → ∞. The limit of the expression f(t) =
e−t
∑
k<N
tk
k! can easily be computed by iterative application
of l’Hospital’s rule. Diﬀerentiation of the sum
yields the same, only for N smaller by one, and the derivative
in the denominator is not changed, so the limit is
zero. Therefore, for ﬁxed ε, there is a time T such that f(t)
would be less than ε for t ≥ T. The whole expression has
792
CHAPTER 8. CALCULUS WITH MORE VARIABLES
been estimated (for n ≥ N and t ≥ T > 0) by the value
ε(C + 1)∥A∞∥.
Summarizing, a very interesting statement is proved,
which resembles the discrete version of Markov processes:
Continuous processes with a stochastic matrix
Theorem. Every primitive stochastic matrix A determines
a vector system of equations
y′
= (A − E) · y
with the following properties:
• the basis of the vector space of all solutions is given by
the columns of the stochastic matrix
Y (t) = e−t
etA
,
• if the initial condition y0 = y(t0) is a stochastic vector,
then the solution y(t) is also a stochastic vector for all
values of t,
• every stochastic solution converges for t → ∞ to the
stochastic eigenvector y∞ of the matrix A corresponding
to the eigenvalue 1.
8.3.22. Remarks on numerical methods. Except for the exceptionally
simple equations, for example, linear equations
with constant coeﬃcients, analytically solvable equations are
seldom encountered in practice. Therefore, some techniques
are required to approximate the solutions of the equations.
Approximations have already been considered in many
other situations. (Recall the interpolation polynomials and
splines, exploitation of Taylor polynomials in methods for numerical
diﬀerentiation and integration, Fourier series etc.).
With a little courage, consider diﬀerence and diﬀerential
equations to be mutual approximations. In one direction, replace
diﬀerences with diﬀerentials (for example, in economical
or population models). For other situations the diﬀerences
may imitate well continuous changes in models.
Use the terminology for asymptotic estimates, as introduced
in 6.1.12. In particular, an expression G(h) is asymptotically
equal to F(h) for h approaching zero or inﬁnity, and
write G(h) = O(F(h)), if the ﬁnite limit of G(h)/F(h) ex-
ists.
A good example is the approximation of a multivariate
function f(x) by its Taylor polynomial of order k at a point
x0. Taylor’s theorem says that the error of this approximation
is O(∥h∥k+1
), where h is the increment of the argument h =
x − x0.
In the case of ordinary diﬀerential equations, the simplest
scheme is approximation using Euler polygons. Present
this method for a single ordinary equation with two quantities:
one independent and one dependent. It works analogously for
systems of equations where scalar quantities and their derivatives
in time t are replaced with vectors dependent on time
and their derivatives. This procedure was used before in the
proof of the Peano’s existence theorem, see 8.3.8.
793
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Consider an equation
y′
= f(t, y)
with continuous right-hand f. Denote the discrete increment
of time by h, i.e. set tn = t0 + nh. It is desired to approximate
y(t). It follows from Taylor’s theorem (with remainder
of order two) and the equation that
y(tn+1) = y(tn) + y′
(tn)h + O(h2
)
= y(tn) + f(tn, y(tn))h + O(h2
).
Deﬁne recurrently the values yj by the ﬁrst order formula
yj+1 = yj + f(tj, yj)h.
This leads to the local approximation error O(h2
), occuring
in one step of the recurrence.
If n such steps are needed with increment h from t0 to
t = tn, the error could be up to nO(h2
) = 1
h (t−t0)O(h2
) =
O(h). More care is needed, since the function f(t, y) is evaluated
in the approximate points (ti, yi) and the already approximate
previous values yj. In order to keep control, f(t, y)
must be Lipschitz in y. Assuming inductively that the estimate
is true for all i < j,
|f(tj, y(tj)) − f(tj, yj)| ≤ C|y(tj) − yj| ≤ C|t − t0|O(h)
where C is the Lipschitz constant, assuming that the error
does not exceed O(h) with globally valid constant for yj. Inductively,
the expected bound O(h) for the global error estimate
is obtained. Think about the details.
The Euler procedure is the simplest method within the
class of the Runge-Kutta methods.
Dealing with higher order equations, we may either view
them as vector valued ﬁrst order systems (as in the theoretical
column) and then even Euler method provides results for
the initial condition on the necessary number of derivatives
in one point. But in practical problems, it is often needed
to ﬁnd solutions passing through more then one prescribed
point. For example, with second order equations, prescribe
two values y(t1) and y(t2) of the solution. This would need
completely diﬀerent methods.
794
795
CHAPTER 8. CALCULUS WITH MORE VARIABLES
O. Additional exercises to the whole chapter
8.O.1. A basin with volume 300 hl contains 100 hl of water in which 50 kg of salt is dissolved. Water with 2 kg of salt per 1
hl starts ﬂowing into the basin at 6 hl/min. The mixture, being kept homogeneous by permanent stirring, leaves the basin at 4
hl/min. Express the amount of salt (in kg) in the basin after t minutes have expired as a function of the variable t ∈ [0, 100].
⃝
8.O.2. During a controlled experiment, a small smelting furnace is slowly cooling down while the outer temperature keeps
at 300 K. The experiment began at noon. At 1 pm, the temperature in the furnace was estimated at 1300 K. At 3 pm, it was
only 550 K. Supposing the measurements were accurate, compute what the temperature in the furnace was at 2 pm. ⃝
8.O.3. The half-life of the radioactive sulfur isotope 35
S is 87.5 days. After what period are there only 900 grams left of the
original amount of 1 kilogram of this isotope? (You may express the result in terms of the natural logarithm.) ⃝
8.O.4. The half-time of a radioactive element A is 5 years; for an element B, it is 1 year. If we have 5 kg of element B and 1
kg of element A, after what period will we have the same amount of both? (You may express the result in terms of the natural
logarithm.) ⃝
8.O.5. The half-time of a radioactive element A is 8 years; for an element B, it is 2 years. If we have 3 kg of element B and 1
kg of element A, after what period will we have the same amount of both? (You may express the result in terms of the natural
logarithm.) ⃝
8.O.6. The half-life of the radioactive cobalt isotope 60
Co is 5.27 years. Having 4 kg of this isotope, after what period does
1 kg of it decay? (You may express the result in terms of the natural logarithm.) ⃝
8.O.7. Solve the following diﬀerential equation for the function y = y(x):
y′
=
1 + y2
1 + x2
.
⃝
8.O.8. Determine all solutions of the following equation with separated variables:
y − y2
+ xy′
= 0.
⃝
8.O.9. Solve the equation
1 + dy
dx = ey
.
⃝
8.O.10. Solve the equation 2y = x3
y′
. ⃝
8.O.11. Determine all solutions of the equation
√
4 − y2 dx + y dy = 0.
⃝
8.O.12. Solve
y′
tan x = y2
+ 1 − 2y.
⃝
8.O.13. Determine the general solution of the diﬀerential equation
x2
+1
x = y
1−y2 y′
.
⃝
8.O.14. Find the general solution of the diﬀerential equation
(x + 1) dy + xy dx = 0.
796
CHAPTER 8. CALCULUS WITH MORE VARIABLES
⃝
8.O.15. Find the solution of the diﬀerential equation
sin y cos x dy = cos y sin x dx
which satisﬁes 4y(0) = π. ⃝
8.O.16. Solve the initial problem
(
x2
+ 1
) (
y2
− 1
)
+ xyy′
= 0, y(1) =
√
2.
⃝
8.O.17. Determine the particular solution of the equation
y′
sin x = y ln y
which goes through the point [π/2, e]. ⃝
8.O.18. Find all solutions of the diﬀerential equation
2 (1 + ex
) yy′
= ex
,
which satisfy y(0) = 0. ⃝
8.O.19. Solve the homogeneous equation
(xy′
− y) cos y
x = x.
⃝
8.O.20. Determine the general solution of the homogeneous diﬀerential equation y3
= x3
y′
. ⃝
8.O.21. Find all solutions of the equation
xy′
=
√
x2 − y2 + y.
⃝
8.O.22. Determine the general solution if we are given
xy′
= y cos
(
ln y
x
)
.
⃝
8.O.23. Solve the equation (x + y) dx − (x − y) dy = 0 as homogeneous. ⃝
8.O.24. Calculate y′
= (x + y)2
. ⃝
8.O.25. Find the general solution for
y′
= x−y+3
x+y−5 .
⃝
8.O.26. Calculate
y′
= x−y+1
x−y .
⃝
8.O.27. Determine all solutions of the diﬀerential equation
y′
= 5y−5x−1
2y−2x−1 .
⃝
8.O.28. Find the general solution of the equation
y′
= x−y−1
x+y+3 .
797
CHAPTER 8. CALCULUS WITH MORE VARIABLES
⃝
8.O.29. Determine the general solution for the equation
y′
= 2x−y−5
x−3y−5 .
⃝
8.O.30. Express the solutions of the equation
y′
= x+2y−7
x−3
as explicitly given functions. ⃝
8.O.31. Using the method of constant variation, calculate y′
+ 2y = x. ⃝
8.O.32. Determine the general solution of the equation y′
= 6x + 2y + 3. ⃝
8.O.33. Solve the linear equation
y′
= 4xy + (2x + 1)e2x2
.
⃝
8.O.34. Solve the equation y′
x + y = x ln x. ⃝
8.O.35. Calculate the linear diﬀerential equation
y′
x = y + x2
ln x.
⃝
8.O.36. Find all solutions of the equation
y′
cos x = (y + 2 cos x) sin x.
⃝
8.O.37. Find the solution of the equation y′
= 6x − 2y which satisﬁes the initial condition y(0) = 0. ⃝
8.O.38. Solve the initial problem
y′
+ y sin x = sin x, y
(π
2
)
= 2.
⃝
8.O.39. Find the solution of the equation y′
= 4y + cos x which goes through the point [0, 1]. ⃝
8.O.40. Solve the following equation for any a, b ∈ R:
xy′
+ y = ex
, y (a) = b.
⃝
8.O.41. Determine the general solution of the equation
3x2
y′
+ xy = 1
y2 .
⃝
8.O.42. Solve the Bernoulli equation
y′
= xy − y3
e−x2
.
⃝
8.O.43. Calculate the Bernoulli equation
y′
− y
x = y2
sin x.
798
CHAPTER 8. CALCULUS WITH MORE VARIABLES
⃝
8.O.44. Find all solutions of the equation
y′
= 4y
x + x
√
y.
⃝
8.O.45. Solve the equation
xy′
+ 2y + x5
y3
ex
= 0.
⃝
8.O.46. Find the general solution of the following equation provided a, b > 0:
y dy =
(
a y2
x2 + b 1
x2
)
dx.
⃝
8.O.47. Interchanging the variables, solve
2y +
(
y2
− 6x
)
y′
= 0.
⃝
8.O.48. Solve the equation
y′
= y
2y ln y+y−x .
⃝
8.O.49. Calculate the general solution of the following equation:
x dx =
(
x2
y − y3
)
dy.
⃝
8.O.50. Interchanging the variables, calculate
(x + y) dy = y dx + y ln y dy.
⃝
8.O.51. Solve
y′
(e−y
− x) = 1.
⃝
8.O.52. Calculate
y′
= 1
2x−y2 .
⃝
8.O.53. Solve the equation
2y dx + x dy = 2y3
dy.
⃝
8.O.54. Calculate
y′′
+ 3y′
+ 2y = (20x + 29) e3x
.
⃝
8.O.55. Find any solution of the non-homogeneous linear equation
799
CHAPTER 8. CALCULUS WITH MORE VARIABLES
y′′
+ y′
+ 5
2 y = 25 cos (2x) .
⃝
8.O.56. Determine the solution of the equation
y′′
+ 2y′
+ 2y = 3e−x
cos x.
⃝
8.O.57. Find the solution of the equation
y′′
= 2y′
+ y + 1,
which satisﬁes y(0) = 0 and y′
(0) = 1. ⃝
8.O.58. Find the solution of the equation
y′′
= 4y − 3y′
+ 1
which satisﬁes y(0) = 0 and y′
(0) = 2. ⃝
8.O.59. Determine the general solution of the linear equation
y′′
− 2y′
+ 5y = 5e2x
sin x.
⃝
8.O.60. Taking advantage of the special form of the right-hand side, ﬁnd all solutions of the equation
y′′
+ y′
= x2
− x + 6e2x
.
⃝
8.O.61. Solve
y(4)
− 2y′′
+ y = 8 (ex
+ e−x
) + 4 (sin x + cos x) .
⃝
8.O.62. Using the method of constant variation, calculate
y′′
− 2y′
+ y = ex
x .
⃝
8.O.63. Solve
y′′
+ 4y′
+ 4y = e−2x
ln x.
⃝
8.O.64. Using the method of constant variation, ﬁnd the general solution of the equation
y′′
+ 4y = 1
sin(2x) .
⃝
8.O.65. Solve the equation y′′
+ y = tan2
x. ⃝
8.O.66. Find the solution of the diﬀerential equation
y(3)
= −2y′′
− 2y′
− y + sin(x)
which satisﬁes y(0) = −1
2 , y′
(0) =
√
3
2 , and y′′
(0) = −1 −
√
3
2 . ⃝
8.O.67. Calculate the equation y′′′
− 2y′′
− y′
+ 2y = 0. ⃝
8.O.68. Find the general solution of the equation
y(4)
+ 2y′′
+ y = 0.
800
CHAPTER 8. CALCULUS WITH MORE VARIABLES
⃝
8.O.69. Solve
y(6)
+ 2y(5)
+ 4y(4)
+ 4y′′′
+ 5y′′
+ 2y′
+ 2y = 0.
⃝
8.O.70. Find the general solution of the linear equation
y(5)
− 3y(4)
+ 2y′′′
= 8x − 12.
⃝
801
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Key to the exercises
This chapter presents several glimpses towards more serious
applications of the diﬀerential and integral calculus. We
cannot be ambitious in covering the displayed topics extensively.
Thus, after reasonably detailed introduction to more
geometric approach to the diﬀerential and integral calculus,
we present rather quick surveys and comments on partial differential
equations, variational calculus, and complex analysis.
We hope the readers will get excited at least by some of
them and look for further resources themselves.
many pictures
missing!!!
1. Exterior diﬀerential calculus and integration
We have already seen how to optimize functions on subsets
in Rn
, but how to integrate quantities over such domains?
For example, if we have 2-dimensional membrane in R3
and
we know the inﬁnitesimal ﬂow of some liquid through it, how
to compute how much went through within a time interval?
In order to understand such questions properly, we formalize
the concept of the level sets Mb from the paragraph
8.1.24 devoted to the implicit functions and provide a geometric
explanation of the integration process. Then we quite
easily arrive at several powerful tools, including the Stokes
theorem, which is a higher dimensional extension to the fundamental
theorem of (univariate) calculus, and the Frobenius
theorem which generalizes the integration of prescribed line
elements into a solution of an ODE to higher dimensions.
9.1.1. Vector ﬁelds and diﬀerential forms. Let us come
back to the concept of tangent vectors and vector
ﬁelds, cf. 8.3.14 where we introduced the
tangent space TU = U × Rn
as the set of all
possible tangent vectors at the points of an open
subset U ⊂ Rn
. There is the projection p : TU → U assigning
the foot points to the tangent vectors, we write TxU for
the vector space of all vectors X with p(X) = x at a point
x ∈ U, and we use the notation X(U) for the set of all smooth
vector ﬁelds on the open subset U.
The linear combinations of the special vector ﬁelds ∂
∂xi
admitting smooth functions as the coeﬃcients generate the
entire X(U). Thus we write general vector ﬁelds as
X(x) = X1(x)
∂
∂x1
+ · · · + Xn(x)
∂
∂xn
.
CHAPTER 9
Continuous models – further selected topics
The World might be discrete in reality.
But the continuous models are useful anyhow ...
A. Exeterior diﬀerential calculus
B. Applications of Stoke’s theorem
9.B.1. Compute
∫
c
(x − y)dx + x dy,
where c is the positively oriented curve represented by the
perimeter of the square ABCD with vertices A = [2, 2]; B =
[−2, 2]; C = [−2, −2]; D = [2, −2].
Solution. Using Green’s theorem (see ??), we reduce the
given curve integral to an area (multiple) integral. The integral
is of the form
∫
c
f(x, y) dx+g(x, y) dy, where f(x, y) =
x − y and g(x, y) = x. The needed partial derivatives of the
functions f(x, y) and g(x, y) are thus fy(x, y) = −1 and
gx(x, y) = 1. All of the functions f(x, y), g(x, y), fy(x, y),
and gx(x, y) are continuous on R2
, so we can use Green’s
theorem:
∫
c
(x − y)dx + xdy =
∫∫
D
(1 + 1)dxdy = 2
∫∫
D
dxdy
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
As we know, every diﬀerentiable mapping F : U → V between
two open sets U ⊂ Rn
, V ⊂ Rm
deﬁnes the mapping
F∗ : TU → TV by applying the diﬀerential D1
F
to the individual tangent vectors. Thus if y = F(x) =
(f1(x), . . . , fm(x)) then F∗ : TxU → TF (x)V ,
F∗
( n∑
i=1
Xi(x)
∂
∂xi
)
(y) =
m∑
j=1
( n∑
i=1
∂fj(x)
∂xi
Xi(x)
)
∂
∂yj
(y).
When we studied the vector spaces in chapter two, we
came accros the useful concept of linear forms. They were
deﬁned in paragraph 2.3.17 on page 122. This idea extends
naturally now. A scalar valued linear mapping deﬁned on the
tangent space TxU is such a linear form at the foot point x.
The vector space of all such forms T∗
x U = (TxU)∗
is thus
naturally isomorphic to Rn∗
and the collection T∗
U of these
spaces comes equipped by the projection to the foot points,
let us denote it again by p. Having a mapping η : U ⊂ Rn
→
T∗
U with values η(x) ∈ T∗
x U on an open subset U, i.e., p ◦
η = idU , we talk about a diﬀerential form η on U, or a linear
form.
Every diﬀerentiable function f on an open subset U ⊂
Rn
deﬁnes the diﬀerential form df on U (cf. 8.1.7). We use
the notation Ω1
(U) for the set of all smooth linear diﬀerential
forms on the open set U.
In the chosen coordinates (x1, . . . , xn) we can use the
diﬀerentials of the particular coordinate functions to express
every linear form η as
η(x) = η1(x)dx1 + · · · + ηn(x)dxn,
where ηi(x) are uniquely determined functions. Such a form
η evaluates on a vector ﬁeld X(x) = X1(x) ∂
∂x1
+ · · · +
Xn(x) ∂
∂xn
as
η(X)(x) = η(x)(X(x)) = η1(x)X1(x)+· · ·+ηn(x)Xn(x).
If the form η is the diﬀerential of a function f, we get just
back the expression
X(f)(x) = df(X(x)) =
∂f
∂x1
X1(x) + · · · +
∂f
∂xn
Xn(x)
for the derivative of f in the direction of the vector ﬁeld X.
9.1.2. Exterior diﬀerential forms. As we discussed already
in chapters 1 and 4, the volume of k-dimensional parallelepipeds
S, as a quantity depending of the k vectors spanning
S, is an antisymmetric k–linear form on the vectors, see
2.3.23 on page 127. Remember also the computation of the
volume of parallelepipeds in terms of determinants in 4.1.19
on page 334.
Thus, if we want to talk about the (linearized) volume on
k-dimensional objects, we need a concept which will be linear
in k distinct tangent vector arguments and will assign a
scalar quantity to them. Moreover, we will require that interchanging
any pair of arguments swaps the sign, in accordance
with the orientations.
803
= 2
2∫
−2
2∫
−2
dxdy =2
[
x
]2
−2
·
[
y
]2
−2
= 32.
□
9.B.2. Compute
∫
c
x4
dx + xy dy,
where c is the positively oriented curve going through the vertices
A = [0, 0]; B = [1, 0]; C = [0, 1].
Solution. The curve c is the boundary of the triangle ABC.
The integrated functions are continuously diﬀerentiable on
the whole R2
, so we can use Green’s theorem:
∫
c
x4
dx + xydy =
∫∫
D
y dx dy =
1∫
0
−x+1∫
0
y dx dy
=
1∫
0
[
y2
2
]−x+1
0
dx =
1∫
0
[
x2
− 2x + 1
2
]
dx
=
1
2
[
x3
3
−
2x2
2
+ x
]1
0
=
1
6
.
□
9.B.3. Calculate
∫
c
(xy + x + y) dx + (xy + x − y) dy,
where c is the circle with radius 1 centered at the origin.
Solution. Again, the prerequisites of Green’s theorem are satisﬁed,
so we can use Green’s theorem, which now gives
∫
c
(xy + x + y) dx + (xy + x − y) dy
=
∫∫
D
y + 1 − x − 1 dx dy
=
1∫
0
2π∫
0
r2
(sin φ − cos φ) dr dφ
=
1∫
0
r2
dr
2π∫
0
sin φ − cos φdφ
=
1
3
[
− cos φ − sin φ
]2π
0
= 0.
□
9.B.4. Compute
∫
c
(2e2x
sin y − 3y3
)dx + (e2x
cos y +
4
3 x3
)dy, where c is the positively oriented ellipse
4x2
+ 9y2
= 36.
Solution. : We will use Green’s theorem, choosing
the linear deformation of polar coordinates
x = 3r cos φ, φ ∈ [0, 2π],
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Exterior differential forms
Deﬁnition. The vector space of all k–linear antisymmetric
forms on a tangent space TxU, U ⊂ Rn
, will be denoted
by Λk
(TxU)∗
. We talk about exterior k–forms at the point
x ∈ U.
The assignment of a k–form η(x) ∈ Λk
T∗
x U to every
point x ∈ U from an open subset in Rn
deﬁnes an exterior
diﬀerential k–form on U. The set of smooth exterior
k–forms on U is denoted Ωk
(U).
Next, let us consider a smooth mapping G : V → U
between two open sets V ⊂ Rm
and U ⊂ Rn
, an exterior
k-form η(G(x)) ∈ Λk
(T∗
G(x)U), and choose arbitrarily
k vectors X1(x), . . . , Xk(x) in the tangent space TxV .
Just like in the case of linear forms, we can evaluate the
form η at the images of the vectors Xi using the mapping
y = G(x) = (g1(x), . . . , gn(x)). This operation is called
the pullback of the form η by G.
G∗
(
η(G(x))
)
(X1(x), . . . , Xk(x))
= η(G(x))
(
G∗(X1(x)), . . . , G∗(X1(x))
)
,
which is an exterior form in Λk
(T∗
x V ). In the case of linear
forms, this is the dual mapping to the diﬀerential D1
G. We
can compute directly from the deﬁnition that, for instance,
G∗
(dyi)
( ∂
∂xk
)
= dyi
(
G∗
( ∂
∂xk
))
=
∂gi
∂xk
,
and so
(1) G∗
(dyi) =
∂gi
∂x1
dx1 + · · · +
∂gi
∂xm
dxm,
which extends to the linear combinations of all dyi over fuc-
tions.
Another immediate consequence of the deﬁnition is the
formula for pullbacks of arbitrary k–forms by composing two
diﬀeomorphisms:
(2) (G ◦ F)∗
α = F∗
(
G∗
α
)
.
Indeed, as a mapping on k-tuples of vectors,
(G ◦ F)∗
α = α ◦ ((D1
G ◦ D1
F) × . . . × (D1
G ◦ D1
F))
= G∗
(α) ◦ (D1
F × . . . × D1
F) = F∗
◦ G∗
α
as expected.
9.1.3. Wedge product of exterior forms. Given a k–form
α ∈ Λk
Rn∗
and an ℓ–form β ∈ Λk
Rn∗
,
we can create a (k + ℓ)–form α ∧ β by all
possible permutations σ of the arguments.
We just have to alternate the arguments in all possible
orders and take the right sign each time:
(α ∧ β)(X1, . . . , Xk+ℓ) =
1
k!ℓ!
∑
σ∈Σk+ℓ
sgn(σ)α(Xσ(1),. . ., Xσ(k))β(Xσ(k+1),. . ., Xσ(k+ℓ)).
804
y = 2r sin φ r ∈ [0, 1],
leading to (the Jacobian of the transformation is 6r):
∫
c
(2e2x
sin y − 3y3
)dx + (e2x
cos y +
4
3
x3
) dy =
=
∫∫
D
2e2x
cos y + 4x2
− (2e2x
cos y − 9y2
) dx dy =
=
1∫
0
2π∫
0
6r
[
4(3r cos φ)2
+ 9(2r sin φ)2
]
=
= 216
1∫
0
r3
dr
2π∫
0
dφ = 216 ·
[r4
4
]1
0
· 2π = 108π.
□
9.B.5. Compute
∫
c
(ex
lny − y2
x)dx +
(
ex
y
−
1
2
x2
y
)
dy,
where c is the positively oriented circle (x−2)2
+(y−2)2
=
1.
Solution.
∫
c
(ex
lny − y2
x)dx + (
ex
y
−
1
2
x2
y)dy =
=
∫∫
D
ex
y
− xy −
ex
y
+ 2xy dx dy =
=
1∫
0
2π∫
0
r(r cos φ + 2) · (r sin φ + 2) dr dφ =
=
1∫
0
2π∫
0
r3
sin φ cos φ + 2r2
(sin φ + cos φ) + 4r dr dφ =
=
1
4
2π∫
0
sin φ cos φ dφ +
2
3
2π∫
0
sin φ + cos φ dφ + 4π =
=
1
3
[sin2
φ
2
]2π
0
+
[
− cos φ + sin φ
]2π
0
+ 4π = 4π.
□
9.B.6. Calculate the integral
∫
c
(ex
sin y − xy2
)dx +
(
ex
cos y −
1
2
x2
y
)
dy,
where c is the positively oriented circle x2
+y2
+4x+4y+7 =
0. ⃝
9.B.7. Compute
∫
c
(3y − esin x
) dx + (7x +
√
y4 + 1) dy,
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
It is clear from the deﬁnition that α ∧ β is indeed a (k + ℓ)–
form. In the simplest case of 1–forms, the deﬁnition says that
(α ∧ β)(X, Y ) = α(X)β(Y ) − α(Y )β(X).
In the case of a 1–form α and a k–form β, we get
(α ∧ β)(X0, X1, . . . , Xk) =
k∑
j=0
(−1)j
α(Xj)β(X0, . . . , ˆXj, . . . , Xk),
where the hat indicates omission of the corresponding argument.
The wedge product of ﬁnitely many forms is deﬁned
analogously (either directly by a similar formula, or we can
notice that the above wedge product of forms is an associative
operation – think this out by yourselves!).
Next, remind the generators ∂
∂xi
of all vector ﬁelds in
X(Rn
), as well as the generators dxi of all linear exterior
forms in Ω1
(Rn
). Their wedge products
εi1...ik
= dxi1 ∧ · · · ∧ dxik
with k-tuples of indices i1 < i2 < · · · < ik generate the
whole space Ωk
(Rn
) by linear combinations with functions
standing for coeﬃcients. Indeed, interchanging a pair of adjacent
forms in the product merely changes the sign, so the
whole expression is identically zero if an index appears twice.
Therefore, every k–form α is given uniquely by functions
αi1...ik
(x) in the expression
α(x) =
∑
i1<···<ik
αi1...ik
(x)dxi1 ∧ · · · ∧ dxik
.
Consequently, the vector spaces Λk
(T∗
U) are the trivial zero
spaces if k > dim U. Thus, Ωk
(U) contains only the trivial
zero form in this case.
Another straightforward consequence of the deﬁnition is
that the pullback of the wedge product by a smooth mapping
G : V → U satisﬁes
G∗
(α ∧ β) = G∗
α ∧ G∗
β.
We should also notice that 0–forms Ω0
(Rn
) are just
smooth functions on Rn
. The wedge product of a 0–form
f and a k–form α is just the multiple of the form α by the
function f. Similarly, the top degree forms in Ωn
(U) are all
generated by the single generator ε12...n, since there is just
one possibility of n diﬀerent choices among n coordinates,
up to the ordering. This means that actually the n-forms ω
are identiﬁed with functions via the formula
ω(x) = f(x)dx1 ∧ · · · ∧ dxn.
At the same time, while the pullback on the functions f ∈
Ω0
(U) by a transformation F : Rn
→ Rn
, y = F(x), is
trivial, i.e. F∗
f(x) = f(y) = f ◦ F(x), a straightforward
computation reveals
(1) F∗
ω(x) = det(D1
F)(x)f(F(x))dx1 ∧ · · · ∧ dxn
for all ω = fdy1 ∧ · · · ∧ dyn.
805
where c is the positively oriented circle x2
+ y2
= 9. ⃝
9.B.8. Compute the integral
∫
c
(
1
x
+ 2xy −
y3
3
) dx + (
1
y
+ x2
+
x3
3
) dy,
where c is the positively oriented boundary of the set D =
{(x, y) ∈ R2
: 4 ≤ x2
+ y2
≤ 9, x√
3
≤ y ≤
√
3x}. ⃝
9.B.9. Remark. An important corollary of Green’s theorem
is the formula for computing the area D that is bounded by a
curve c.
m(D) =
1
2
∫
c
−y dx + x dy.
9.B.10. Compute the area given by the ellipse x2
a2 + y2
b2 = 1.
Solution. Using the formula 9.B.9 and the transformation
x = a cos t, y = b sin t, we get for t ∈ [0, 2π] that
m(D) =
1
2
∫
c
−y dx + x dy =
=
1
2
2π∫
0
a cos t · b cos tdt −
1
2
2π∫
0
b sin t · (−a sin t)dt =
=
1
2
ab
2π∫
0
cos2
tdt +
1
2
ab
2π∫
0
sin2
tdt =
=
1
2
ab
2π∫
0
cos2
t + sin2
tdt =
1
2
ab2π = πab,
which is indeed the well-known formula for the area of an
ellipse with semi-axes a and b.
□
9.B.11. Find the area bounded by the cycloid which is given
parametrically as ψ(t) = [a(t−sin t); a(1−cos t)], for a ≥ 0,
t ∈ (0, 2π), and the x-axis.
Solution. Let the curves that bound the area be denoted by
c1 and c2. As for the area, we get
m(D) = 1
2
∫
c1
−y dx + x dy + 1
2
∫
c2
−y dx + x dy.
Now, we will compute the mentioned integrals step by
step. The parametric equation of the curve c1 (a segment of
the x-axis) is (t; 0); t ∈ [0; 2aπ], so we obtain for the ﬁrst
integral that
1
2
∫
C1
−y dx + x dy =
1
2
∫ 2aπ
0
0 · 1 dt +
∫ 2aπ
0
t · 0 dt = 0.
The parametric equation of the curve c2 is ψ(t) ∈ (a(t−
sin t), a(1 − cos t)); t ∈ [2π; 0].
The formula for the area expects a positively oriented curve,
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
9.1.4. Integration of exterior forms on Rn
. Once we ﬁx
coordinates (x1, . . . , xn) on Rn
(e.g. the
standard ones), there is the bijection between
functions f and top degree forms
ω(x) = f(x)dx1 ∧ · · · ∧ dxn. This can
be interpreted as deﬁning the scale with which the standard
volume in Rn
is to be taken pointwise due to the function f.
Notice, that changing the coordinates via a tranformation
F will rescale this understanding of the forms exactly as in
the formula for coordinate substitution in the Riemann integral.
We should view this observation as a new interpretation
of the integrands in our earlier procedure of integration of
functions f on Riemann measurable open subsets U ⊂ Rn
,
independent of any coordinate choice.
Let us check this interpretation in more detail. First, we
deﬁne the n–form ωRn , giving the standard n–dimensional
volume of parallelograms, i.e. in the standard coordinates we
obtain
ωRn = dx1 ∧ · · · ∧ dxn.
If we want to integrate a function f(x) “in the new way”, we
consider the form ω = fωRn instead, i.e. ω = f(x)dx1 ∧
· · · ∧ dxn. We deﬁne the integral of the form ω as
∫
U
ω =
∫
U
f(x)dx1 ∧ · · · ∧ dxn =
∫
U
f(x) dx1 · · · dxn,
where the Riemann integral of a function is considered on the
right-hand side.
Let us point out, that the n-form ω on the left-hand side is
well deﬁned, independently of any choice of coordinates. If
we want to express the form ω in diﬀerent coordinates using
a diﬀeomorphism G : V → U, G = (g1, . . . , gn), it means
we will evaluate ω at a point G(y) = x at the values of the
vectors G∗(X1), . . . , G∗(Xn). However, this means we will
integrate the form G∗
ω in coordinates (y1, . . . , yn), and we
already saw in the previous paragraph, cf. 9.1.3(1) that
(G∗
ω)(y) = f(G(y)) det(D1
G(u))dy1 ∧ · · · ∧ dyn.
Substituting into our interpretation of the integral, we get
∫
V
G∗
(fωRn ) =
∫
G−1(U)
f(G(y)) det(D1
G(u))dy1 · · · dyn,
which is, by the theorem 8.2.8 on the coordinate substitution
in the integral, the same value as
∫
U
fωRn if the determinant
of the Jacobian matrix is positive, and the same value up to
the sign if it is negative.
Our new interpretation thus provides the geometrical
meaning for the integral of an n–form on Rn
, supposing the
corresponding Riemann integral exists in some (and hence
any) coordinates. This integration takes into account the orientation
of the area we are integrating over. We shall come
back to this point in a moment.
9.1.5. Integrals along curves. Our next goal is to integrate
objects over domains which are similar to
curves or surfaces in R3
. Let us ﬁrst shape
our mind on the simplest case of the lowest
dimension, i.e. the curves in Rn
.
806
which means for the considered parametric equation that we
are moving against the parametrization direction, i. e. from
the upper bound to the lower one.
We thus get for the area of the cycloid that
1
2
∫
c2
−y dx + x dy =
1
2
0∫
2π
a(t − sin t) · a(sin t) dt−
−
1
2
0∫
2π
a(1 − cos t) · a(1 − cos t) dt =
=
1
2
a2
2π∫
0
t sin t − sin2
t − 1 + 2 cos t − cos2
t dt =
=
1
2
a2
0∫
2π
t sin t + 2 cos t − 2 dt =
=
1
2
a2
[−t cos t − sin t + 2 cos t − 2]0
2π = 3πa2
.
□
9.B.12. Compute I =
∫∫
S
x3
dy dz + y3
dx dz +
z3
dx dy,where S is given by the sphere x2
+ y2
+ z2
= 1.
Solution. It is advantageous to work in spherical coordinates
x = ρ sin φ cos ψ ρ = [0, 1],
y = ρ sin φ sin ψ φ = [0, π],
z = ρ cos φ ψ = [0, 2π].
The Jacobian of this transformation is −ρ2
sin φ.
The given integral is then equal to
I =
∫∫
S
x3
dy dz + y3
dx dz + z3
dx dy =
=
∫∫∫
V
3x2
+ 3y2
+ 3z2
dx dy dz =
= 3
1∫
0
2π∫
0
π∫
0
ρ2
sin φ
(
ρ2
sin2
φ cos2
ψ + ρ2
sin2
φ sin2
ψ+
+ρ2
cos2
φ
)
dρ dφ dψ =
= 3
1∫
0
2π∫
0
π∫
0
ρ4
sin φ
(
sin2
φ
(
cos2
ψ + sin2
ψ
)
+
+ cos2
φ
)
dρ dφ dψ =
= 3
1∫
0
2π∫
0
π∫
0
ρ4
sin φ dρ dφ dψ = 3 ·
[
ρ5
5
]1
0
[ψ]2π
0 [cos φ]π
0 =
= 3 ·
1
5
· 2π · [−1 − 1] = −
12
5
π.
□
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Recall the calculation of the length of a curve in Rn
by
univariate integrals, which was discussed in paragraph 6.1.10
on page 518. The curve was parametrized as a mapping
c(t) : R → Rn
, and the size of the tangent vector ∥c′
(t)∥ was
expressed in the Euclidean vector space. This procedure was
given by the universal relation for an arbitrary tangent vector,
i.e., we actually found the function ρ : Rn
→ R which
gave the true size when evaluated at c′
(t). This mapping satisﬁed
ρ(a v) = |a|ρ(v) since we ignored the orientation of the
curve given by our parametrization. If we wanted a signed
length, respecting the orientation, then our mapping ρ would
be linear on every one-dimensional subspace L ⊂ Rn
. Of
course we could have multiplied the Euclidean size by a positive
function and integrate this quantity.
In view of our geometric approach to integration, we
should rather integrate linear forms along curves, while the
size of vectors is given by a quadratic form, rather than a linear
one. However, in dimension one, we take the square root
of the values of the (positive deﬁnite) quadratic form, in order
to get a linear form (up to sign) which is just the size of the
vectors.
Let us proceed in a much similar way dealing with linear
diﬀerential forms η on Rn
. The simplest ones are the diﬀerentials
df of functions f on Rn
.
In order to motivate our development, let us consider the
following task. Imagine, we are cycling along a path c(t) in
R2
, the function f is the altitude of the terrain. If we want
to compute the total gain of altitude along the path c(t), we
should “integrate” the immediate inﬁnitesimal gains, which
should be the derivatives of f in the directions of the tangent
vectors to the path, i.e. df(c′
(t)).
Thus, let us consider a diﬀerentiable curve c(t) in Rn
,
t ∈ [a, b], write M for the image c([a, b]), and assume that
a diﬀerentiable function f is deﬁned on a neighborhood of
M. The diﬀerential of this function gives for every tangent
vector the increment of the function in the given direction. It
is expressed by the diﬀerential of the composite mapping f ◦c
d(f ◦ c)(t) =
∂f
∂x1
(c(t))c′
1(t) + · · · +
∂f
∂xn
(c(t))c′
n(t).
We can thus try to deﬁne the value of the integral in the following
way
∫
M
df =
∫ b
a
(
∂f
∂x1
(c(t))c′
1(t) + · · · +
∂f
∂xn
(c(t))c′
n(t)
)
dt,
and we immediately verify that the change of the parametrization
of the curve has no eﬀect upon the value. Indeed, writing
c(t) = c(ψ(s)), a = ψ(˜a), b = ψ(˜b), our procedure yields
∫ ˜b
˜a
(
∂f
∂x1
(c(ψ(s)))c′
1(ψ(s)) + . . .
+
∂f
∂xn
(c(ψ(s)))c′
n(ψ(s))
)
dψ
ds
ds,
and the theorem about coordinate transformations for univariate
integrals gives just the same value if we have dψ
ds > 0, i.e.,
807
9.B.13. The vector form of the Gauss–Ostrogradsky
theorem. The divergence of a vector ﬁeld F(x, y, z) =
f(x, y, z) ∂
∂x + g(x, y, z) ∂
∂y + h(x, y, z) ∂
∂z is deﬁned as
div X := fx + gy + hz. Then, the Gauss–Ostrogradsky
theorem can be formulated as follows:
∫∫∫
V
div ⃗F(x, y, z) dx dy dz =
∫∫
S
⃗F(x, y, z)·⃗n(x, y, z) dS,
where ⃗n(x, y, z) is the outer unit normal to the surface S
at the point [x, y, z] ∈ S (S is the boundary of the normal
domain V ).
9.B.14. Find the ﬂow of the vector ﬁeld given by the function
F = (xy2
, yz, x2
z) over the cylinder x2
+ y2
= 4, z =
1, z = 3.
Solution. First of all, we compute the divergence of the
vector ﬁeld:
div F = ∇·F = (
∂(xy2
)
∂x
+
∂(yz)
∂y
+
∂(x2
z)
∂z
) = y2
+z +x2
.
Therefore, the ﬂow T of the vector ﬁled is equal to
∫∫∫
V
y2
+ z + x2
dx dy dz =
=
2∫
0
2π∫
0
3∫
1
ρ · (ρ2
sin2
φ + z + ρ2
cos2
φ) dρ dφ dz =
=
2∫
0
2π∫
0
3∫
1
ρ · (ρ2
(sin2
φ + cos2
φ) + z ) dρ dφ dz =
=
2∫
0
2π∫
0
3∫
1
ρ · (ρ2
(sin2
φ + cos2
φ) + z ) dρ dφ dz =
=
2∫
0
2π∫
0
3∫
1
ρ3
+ ρz dρ dφ dz =
= 2π
2∫
0
3∫
1
ρ3
+ ρz dρ dz = 2π
3∫
1
[
ρ4
4
+
ρ2
2
z]2
0 dz =
= 2π
3∫
1
4 + 2z dz = 2π[4z + z2
]3
1 = 2π[12 + 9 − 4 − 1] = 32π.
□
9.B.15. Find the ﬂow of the vector ﬁeld given by the function
F = (y, x, z2
), over the sphere x2
+ y2
+ z2
= 4.
Solution. The divergence of the given vector ﬁeld is:
divF = ∇ · F = (
∂y
∂x
+
∂x
∂y
+
∂z2
∂z
) = 2z.
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
if we keep the orientation of the curve, and the same value up
to sign if the derivative of the transformation is negative.
If we extend the same deﬁnition to an arbitrary linear
form η = η1dx1 + . . . ηndxn we arrive at the same formulae
with ηi replacing the derivatives ∂f
∂x1
,
∫
M
η =
∫ b
a
(
η1(c(t))c′
1(t) + · · · + ηn(c(t))c′
n(t)
)
dt,
again independent of the parametrization of the curve c as
above.
In the above example with n = 2, f was the altitude of
the terrain, and the integral of df along the path modelled the
total gain of elevation. Thus, we should expect that the total
gain along the path should depend on the values c(a) and c(b)
only, while diﬀerent curves with the same boundary points
would produce diﬀerent integrals of η for a general 1–form η.
This will indeed be the special claim of the Stokes theorem
below.
Before we treat the higher dimensional analogs, we shall
look at more abstract approach to suitable subsets in Rn
and
the role of coordinates on them.
9.1.6. Manifolds. The straightforward generalizations of parameterized
curves c(t) : R → Rn
are the diﬀerentiable
mappings φ : V ⊂ Rk
→ Rn
, k ≤ n,
with injective diﬀerential dφ(u) at every point of
its open domain V . Such mappings are called im-
mersions.
With the curves, we did not care about their selfintersections
etc. Now, for technical reasons, we shall be
more demanding.
Manifolds in Rn
A subset M ⊂ Rn
is called a manifold of dimension k if
every point x ∈ M has a neighborhood U ⊂ Rn
which is
the image of a diﬀeomorphism ˜φ : V × ˜V → U, V ⊂ Rk
,
˜V ⊂ Rn−k
, such that
• the restriction φ = ˜φ|V : V → M is an immersion,
• ˜φ−1
(M) = V × {0} ⊂ Rn
.
The manifolds M are carrying the topology inherited from
Rn
.
./img/0214_eng.png
808
Thus, the wanted ﬂow equals
∫∫∫
V
2z dx dy dz =
2∫
0
π∫
0
2π∫
0
ρ2
sin φ · 2ρ cos φ dρ dφ dψ =
= 2
2∫
0
ρ3
dρ
2π∫
0
dψ
π∫
0
sin φ cos φ dφ =
= 2[
ρ4
4
]2
0 · [ψ]2π
0 · [
sin2
φ
2
]π
0 =
= 2 ·
16
4
· 2π · 0 = 0.
□
C. Equation of heat conduction
9.C.1. Find the solution to the so-called equation of heat
conduction (equation of diﬀusion)
ut(x, t) = a2
uxx(x, t), x ∈ R, t > 0
satisfying the initial condition lim
t→0+
u (x, t) = f(x).
Notes: The symbol ut = ∂u
∂t stands for the partial derivative
of the the u with respect to t (i. e., diﬀerentiating with
respect to t and considering x to be constant), and similarly,
uxx = ∂2
u
∂x2 denotes the second partial derivative with respect
to x (i. e., twice diﬀerentiating with respect to x while considering
t to be constant). The physical interpretation of this
problem is as follows: We are trying to determine the temperature
u(x, t) in an thermally isolated and homogeneous bar of
inﬁnite length (the range of the variable x) if the initial temperature
of the bar is given as the function f. The section of
the bar is constant and the heat can spread in it by conduction
only. The coeﬃcient a2
then equals the quotient α
cϱ , where α
is the coeﬃcient of thermal conductivity, c is the speciﬁc heat
and ϱ is the density. In particular, we assume that a2
> 0.
Solution. We apply the Fourier transform to the equation,
with respect to variable x. We have
F (ut) (ω, t) = 1√
2π
∞∫
−∞
ut(x, t) e−iωx
dx =
=
(
1√
2π
∞∫
−∞
u (x, t) e−iωx
dx
)′
,
where diﬀerentiated with respect to t, i. e.,
F (ut) (ω, t) = (F (u) (ω, t))
′
= (F (u))t (ω, t).
At the same time, we know that
F
(
a2
uxx
)
(ω, t) = a2
F (uxx) (ω, t) =
−a2
ω2
F (u) (ω, t).
Denoting y(ω, t) = F (u) (ω, t), we get to the equation
yt = −a2
ω2
y.
We already solved a similar diﬀerential equation when we
were calculating Fourier transforms, so it is now easy for us
to determine all of its solutions
y(ω, t) = K (ω) e−a2
ω2
t
, K (ω) ∈ R.
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
This deﬁnition is illustrated by the picture above. Manifolds
can be typically (at least locally) given implicitly as
the level sets of diﬀerentiable mappings, see paragraph 8.1.24
and the discussion in 8.1.26.
The mapping φ from the deﬁnition is called the local
parametrization or local map of the manifold M. The manifolds
are a straightforward generalization of curves and surfaces
in the plane R2
or the space R3
. We have excluded
curves and surfaces which are self-intersecting and even those
which are self-approaching.
For instance, we can surely imagine a curve representing
the ﬁgure 8 parametrized with a mapping φ with everywhereinjective
diﬀerential. However, we will be unable to satisfy
the second property from the manifold deﬁnition in a neighborhood
of the point where the two branches of the curve
meet.
Tangent and cotangent bundles of manifolds
The tangent bundle TM of the manifold M is the collection
of vector subspaces TxM ⊂ TxRn
which contain all vectors
tangent to the curves in M. There is the footpoint projection
p : TM → M.
Similarly, the cotangent bundle T∗
M of the manifold
M is the collection of the dual spaces (TxM)∗
, together with
the footpoint projection.
./img/0214_eng.png
Clearly, every parametrization φ deﬁnes a diﬀeomor-
phism
φ∗ : TV → T(φ(V )) ⊂ TM, φ∗(c′
(t)) =
d
dt
φ(c(t)).
Due to the chain rule, this deﬁnition does not depend of the
choice of the representing curve c(t). We shall also write Tφ
for the mapping φ∗.
In particular, the local maps φ (extended to ˜φ, as in
the above deﬁnition) induce the local maps φ∗ : TU =
U ×Rk
→ TM ⊂ Rn
×Rn
of the tangent bundle. Thus, the
tangent bundle TM is again a manifold, which locally looks
as U × Rk
over suﬃciently small open subsets U ⊂ M. But
we shall see that TM might be quite diﬀerent from M × Rk
globally. Dealing with the cotangent bundle, we can use the
dual mappings (Tφ−1
)∗
on the individual ﬁbers T∗
x M to obtain
local parametrizations.
809
It remains to determine K(ω). The transformation of the initial
condition gives
F (f) (ω) = lim
t→0+
F (u) (ω, t) = lim
t→0+
y(ω, t) =
K (ω) e0
= K (ω),
hence
y(ω, t) = F (f) (ω) e−a2
ω2
t
, K (ω) ∈ R.
Now, using the inverse Fourier transform, we can return to
the original diﬀerential equation with solution
u (x, t) = 1√
2π
∞∫
−∞
y(ω, t) eiωx
dω =
= 1√
2π
∞∫
−∞
F (f) (ω) e−a2
ω2
t
eiωx
dω =
= 1√
2π
∞∫
−∞
(
1√
2π
∞∫
−∞
f(s) e−iωs
ds
)
e−a2
ω2
t
eiωx
dω =
= 1√
2π
∞∫
−∞
f(s)
(
1√
2π
∞∫
−∞
e−a2
ω2
t
e−iω(s −x)
dω
)
ds.
Computing the Fourier transform F(f) of the function
f(t) = e−at2
for a > 0, we have obtained (while relabeling
the variables)
1√
2π
∞∫
−∞
e−cp2
e−irp
dp = 1√
2c
e− r2
4c , c > 0.
According to this formula (consider c = a2
t > 0, p = ω,
r = s − x), we have
1√
2π
∞∫
−∞
e−a2
ω2
t
e−iω(s −x)
dω = 1√
2a2t
e−
(s −x)2
4a2t ,
Therefore,
u (x, t) = 1
2a
√
πt
∞∫
−∞
f(s) e−
(x−s)2
4a2t ds.
□
D. Partial Diﬀerential Equations
9.D.1. Find general solution and solution for given boundary
conditions of homogenous linear equation
xux + yuy = 0,
(i) u(cos σ, sin σ) = 1,
(ii) u(σ, 1) = 1 + σ2
,
(iii) u(σ, 1 − σ) = 3σ.
(iv) u(cos σ, sin σ) = σ,
(v) u(σ, σ) = 1,
(vi) u(σ, σ) = σ.
Solution. Firstly let’s look to our problem geometrically. The
vector ∇u = (ux, uy) is the gradient of unknown function
u = u(x, y) (its direction is direction of the bigest growing
of function u). Equation
a(x, y)ux + b(x, y)uy = 0,
tell us, that scalar product of given vector ﬁeld
⃗v = (a(x, y), b(x, y)) and ∇u is zero, e.g. ⃗v is orthogonal
to ∇u at each point, so ⃗v has to be tangent to the
contour lines (lines of constant value of u), we will call them
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Notice that two diﬀerentiable immersions φ and ψ
parametrizing the same open subset U ⊂ M provide the
composition ψ−1
◦ φ. We view this as a coordinate change
for U and we have just seen that coordinate changes on M
induce coordinate changes on TM.
Further, if M and N are two manifolds and F : M → N
a mapping, we say that F is diﬀerentiable (up to order r or
smooth or analytic), if the compositions ψ−1
◦ F ◦ φ with
two local parametrizations φ of M and ψ of N (of the same
order of diﬀerentiability as we want to check) is diﬀerentiable
(up to order r or smooth or analytic). Again, the chain rule
property of diﬀerentiation shows that this deﬁnition does not
depend on the particular choice of the parametrizations.
Each diﬀerentiable mapping F : M → N deﬁnes the
tangent mapping TF : TM → TN between the tangent
spaces, which clearly is diﬀerentiable of order one less than
the assumed diﬀerentiability of F.
Vector fields and differential forms on manifolds
Smooth vector ﬁelds X on a manifold M are smooth sections
X : M → TM of the footpoint projection p : TM →
M, i.e., p ◦ X = idM .
Smooth k–forms η on a manifold M are sections
M → Λk
(TM)∗
such that the pullback of this form by any
parametrization V → M yields a smooth exterior k–form
on V .
We write X(M) for the space of smooth vector ﬁelds on
M, while Ωk
(M) stays for the space of all smooth exterior
k–forms on M.
Notice that all our coordinate formulae for the vector
ﬁelds, forms, pullbacks etc. on Rm
hold true in the more abstract
setting of manifolds and their local parametrizations.1
9.1.7. Integration of exterior forms on manifolds. Now,
we are almost ready for the deﬁnition of the integral
of k–forms on k–dimensional manifolds.
For the sake of simplicity, we will examine
smooth forms ω with compact support only.
First, let us assume that we are given a k–dimensional
manifold M ⊂ Rn
and one of its local parametrizations
φ : V ⊂ Rk
→ U ⊂ M ⊂ Rn
. We consider the standard
orientation on Rk
given by the standard basis (cf. 4.1.19 for
the deﬁnition of the orientation of a vector space). The choice
of the parametrization φ also ﬁxes the orientation of the manifold
U ⊂ M. This orientation will be the same for those
choices of local parametrizations, which diﬀer by diﬀeomorphisms
with positive determinants of their Jacobi matrices.
The orientation will be the other one in the case of negative
determinants. The manifold M is called orientable if there
1Actually, instead of dealing with manifolds as subsets of Rn, we might
use the same concept of local parametrizations of a space M with diﬀerentiable
transition functions ψ−1 ◦φ. We just need to know what are the “open
subsets” in M, thus we could start at the level of topological spaces. On
the other hand, there is the general result (the so called Whitney embedding
theorem) that each such abstract n-dimensional manifold can be realized as
embedded in the R2n, so we essentialy do not loose any generality here.
810
characteristics. Now it’s easy to ﬁnd general solution — it
could be any function, which is constant on characteristics
(integral curves of vector ﬁeld ⃗v). Let’s ﬁnd them for our
case by solving system ODE’s:
˙x(t) =
dx
dt
(t) = a(x, y) = x, ˙y(t) =
dx
dt
(t) = b(x, y) = y,
or by solving equation
y′
(x) =
dy
dx
=
˙y
˙x
=
b
a
.
Solution yields to the curves
x = C1et
, y = C2et
, or
y
x
= C
and characteristics are lines threw the origin (picture P.1).
General solution can be written as
u(x, y) = Φ
(y
x
)
, or u(x, y) = Ψ
(
x
y
)
.
Look at the test:
ux = Φ′
(
−
y
x2
)
, uy = Φ′
(
1
x
)
=⇒ xux + yuy = 0.
y
x x x
y y
(i), (iv) (ii) (iii)
On the picture we see characteristics together with given
boundary lines. We need to choose solutions satisfying our
boundary condition. The questions are: How many solutions
we can ﬁnd? Does the solution exists for any boundary con-
dition?
(i) Parametrizet boundary curve (circle) intersect any
characteristics inﬁnitely many times. But the value
u(cos σ, sin σ) = 1 is constant. Solution has to be
constant on characteristics, therefore u(x, y) = 1 is
unique solution of the equation and boundary condition
(i).
(ii) Parametrized boundary curve (line) x = φ(σ) = σ, y =
ψ(σ) = 1 intersects almost all characteristic (except line
y = 0) just once. The value u = f(σ) = 1 + σ2
in
point of intersection should be the same on the whole
characteristics: Without lose of generality, take t = 0
and from x = C1e0
= φ(σ) = σ, y = C2e0
= ψ(σ) =
1 we get C1 = σ, C2 = 1. Then
x = σet
, y = et
=⇒ σ =
x
y
and we have solution
u(x, y) = f(σ(x, y)) = 1 + [σ(x, y)]2
= 1 +
x2
y2
.
This solution exists for y ̸= 0. On the connected component
of its domain, including boundary curve (Ω =
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
is a covering of the entire set M by local parametrizations φ
such that their orientations coincide.
Therefore, we apparently have exactly two orientations
on every connected orientable manifold. Fixing either of
them, we thereby restrict the set of parametrizations to those
compatible with this orientation. From now on, we will always
proceed in this fashion, and we will talk about oriented
manifolds only.
Next, let us ﬁx a form ω with compact support inside the
image of one parametrization U ⊂ M of an oriented manifold
M. The pullback form φ∗
(ω) is a smooth k–form on V ⊂
Rk
with compact support. The integral of the form ω on M
is deﬁned in terms of the chosen parametrization which is
compatible with the orientation as follows:
∫
M
ω =
∫
Rk
φ∗
(ω).
If we choose a diﬀerent compatible parametrization ˜φ = φ◦ψ
where ψ is a diﬀeomorphism ψ : W → V ⊂ Rk
, we can
easily compute the result, following the same deﬁnition. Let
us denote
φ∗
(ω)(u) = f(u)du1 ∧ · · · ∧ duk.
Invoking the relation 9.1.2(2) for the pullback of a form by a
composite mapping, we get
∫
M
ω =
∫
Rk
˜φ∗
(ω) =
∫
Rk
ψ∗
(φ∗
ω)
=
∫
Rk
ψ∗
(
f du1 ∧ · · · ∧ duk
)
=
∫
Rk
f(ψ(v)) det(D1
ψ)(v)dv1 · · · dvk.
This is again the same value as
∫
Rk φ∗
ω.
This proves the correctness of our deﬁnition of the inte-
gral
∫
M
ω provided the integrated k–form has compact support
lying in the image of a single parametrization.
However, typical manifolds M are given by implicit
equations. For example, x2
+y2
+z2
= 1 deﬁnes the surface
of the unit ball, i.e., the sphere S2
⊂ R3
. If we want to integrate
an exterior 2–form on S2
, we will have to use several
parametrizations. Fortunately, our deﬁnition of the integral
is additive with respect to disjoint unions of integration domains.
Therefore, if we can write
M = U1 ∪ U2 ∪ · · · ∪ Um ∪ B,
where Ui are pairwise disjoint images of parametrizations φi,
and B is a set whose inverse image in any parametrization is a
Riemann measurable set with measure zero, we can compute
∫
M
ω =
∫
U1
ω + · · · +
∫
Um
ω,
and we can easily verify that this value is independent of the
choice of the sets Ui and the parametrizations (in particular,
we need not be worried by the set B since the result of any
integration on it is zero). For example, we can imagine splitting
a sphere to the upper and lower hemispheres, leaving the
equator B uncovered.
811
{(x, y) ∈ R2
| y > 0}) this is unique continuous solution.
There is no "natural" wa, how to make its prolongation
for y < 0.
(iii) Parametrized boundary curve (line) x = φ(σ) = σ,
y = ψ(σ) = 1 − σ intersects almost all characteristic
(except line y = −x) just once. Again, the value
u = f(σ) = 3σ in point of intersection should be the
same on the whole characteristics: Take t = 0 and from
x = C1e0
= φ(σ) = σ, y = C2e0
= ψ(σ) = 1 we get
C1 = σ, C2 = 1 − σ. Then x = σet
, y = (1 − σ)et
=⇒
σ
1 − σ
=
x
y
, =⇒ σ =
x
x + y
.
and we have the unique solution
u(x, y) = f(σ(x, y)) = 3σ(x, y) =
3x
x + y
.
This solution exists for y ̸= −x. Let’s make a test:
ux =
3y
(x + y)2
, uy =
−3x
(x + y)2
=⇒
=⇒ xux + yuy = 0,
u(σ, 1 − σ) =
3σ
σ + 1 − σ
= 3σ.
(iv) Boundary curve intersect any characteristics inﬁnitely
many times, but the value at two point on the same characteristic
is diﬀerent. Globally this problem has no solution.
We can solve it for example for σ ∈ (−π
2 , π
2 ):
t = 0, x = C1e0
= cos σ, y = C2e0
= sin σ, thus
C1 = cos σ, C2 = sin σ and
x = cos σet
, y = sin σet
=⇒ σ = arctg
y
x
.
We get the solution for x > 0:
u(x, y) = σ(x, y) = arctg
y
x
.
Make a test.
(v) The boundary curve y = x is one of the characteristics.
It has no sense. Condition u(σ, σ) = 1 gives inﬁnitely
many solutions. Any function u = Φ
(y
x
)
or Ψ
(
x
y
)
giving value 1 for y = x is solution of this problem (for
expample u = y
x , u = y2
x2 , u = 1, ...).
(vi) The boundary curve y = x is again one of the characteristics.
It has no sense. There is no solution, because
condition u(σ, σ) = σ doesn’t give the same value on
characteristic y = x.
Finally it seems (for our equation): If the boundary curve
intersects each characteristics at exactly one point, there exists
unique continuous solution (on some open set including
boundary curve) satysfying boundary condition. Generally
is situation more complicated. Requirement, that boundary
curve has to intersect each characteristics at exactly one point
is not necessary, nor suﬃcient). □
9.D.2. Find general solution of
yux + xuy = 0
and solutions for given boundary condition
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
When calculating in practice, we usually divide the entire
manifold into several disjoint open areas with compact closures,
and we integrate on each of them separately. However,
this procedure still does not help if we stick with the strict
assumption that the entire support of integrated form has to
be inside of one parametrization. Thus, we will develop a
global deﬁnition of the integral, which is more advantageous
from the technical/theoretical point of view (although it usually
does not help in computations directly).
9.1.8. Partition of unity. Consider a manifold M ⊂ Rn
and one of its covers by open images Ui of
parametrizations φi. We can surely ﬁnd a countable
cover of each manifold M (it suﬃces to realize
that we can do with parametrizations which
map the origin to points with rational coordinates in Rn
). Furthermore,
we shall assume that any point in x ∈ M belongs
to only ﬁnitely many sets Ui. Such a cover is called a locally
ﬁnite cover by parametrizations φi.2
Now, recall the smooth variants of indicator functions
from paragraph 6.1.10. For every pair of positive numbers
ε < r, we constructed a function fε,r(t) of one real variable
t such that fε,r(t) = 1 for |t| < r − ε, while fε,r(t) = 0 for
|t| > r + ε, and 0 ≤ fε,r(t) ≤ 1 everywhere. At the same
time, we had f(t) ̸= 0 if and only if |t| < r + ε.
Next, if we deﬁne
χr,ε,x0 (x) = fε,r(|x − x0|),
then we get a smooth function which takes the value 1 inside
the ball Br−ε(x0), with support exactly Br+ε(x0), and with
values between 0 and 1 everywhere. mozna obr. char.fce
Lemma (Whitney’s theorem). Every closed set K ⊂ Rn
is
the set of all zero points of some smooth non-negative func-
tion.
Proof. The idea of the proof is quite simple. If K =
Rn
, the zero function fulﬁlls the conditions, so we can further
assume that K ̸= Rn
.
The open set U = Rn
\ K can be expressed as the union
of (at most) countably many open balls Bri (xi), and for each
of them, we choose a smooth non-negative function fi on Rn
whose support is just Bri
(xi), see the function χr,ε,x0
above.
Now, we add up all these functions into an inﬁnite series
f(x) =
∞∑
k=1
akfk(x),
where the positive coeﬃcients ak are selected so small that
this series would converge to a smooth function f(x).
To this purpose, it suﬃces to choose ak so that all partial
derivatives of all functions akfk(x) up to order k (inclusive)
would be bounded from above by 2−k
. Then, the series∑
k akfk is bounded from above by the series
∑
k 2−k
, hence
by Weierstrass criterion, it converges uniformly on the entire
2This property is called paracompactness and, actually, each metric
space is paracompact. Thus in particular all our manifolds enjoy this property
too. But we do not want to go into details of the proof.
812
(i) u(x, 0) = |x|,
(ii) u(0, y) = y2
,
(iii) u(σ, σ) = 2σ2
.
⃝
9.D.3. Find general solution of
yux − xuy = 0
and solutions for given boundary condition
(i) u(σ, σ) = 2σ2
,
(ii) u(0, y) = 1
1+y2 ,
(iii) u(x, 1) = x4
.
⃝
9.D.4. Find solution of
sin x sin yux+cos x cos yxuy = 0, u = cos 2y for x+y =
π
2
.
⃝
9.D.5. Look on general quasilinear ﬁrst order equation:
a(x, y, u)ux + b(x, y, u)uy = f(x, y, u),
its special case is equation linear in highest order terms (in
some texts, including our, is called semilinear):
a(x, y)ux + b(x, y)uy = f(x, y, u),
and again its special case is linear equation (generally nonho-
mogenous):
a(x, y)ux + b(x, y)uy = f(x, y).
In the folowing sections we show, how to solve each of this
type of equation.
Find a solution of quasilinear equation for given boundary
condition:
ux − uuy = −u, u(0, y) = 2y.
Solution. Let’s solve characteristic system:
˙x = a = 1, ˙y = b = −u, ˙u = f = −u.
Characteristics are given parametrically:
x(t) = t + C1, y(t) = C3e−t
+ C2, u(t) = C3e−t
.
If we want general solution of equation, it is given implicitly
by
Φ(uex
, u − y) = 0,
where u − y = −C2, uex
= C3eC1
is some implicit description
of characteristics.
From boundary condition for t = 0 we get
x(0) = C1 = 0, y(0) = C3 + C2 = σ, u(0) = C3 = 2σ,
so C2 = −σ and
x(t, σ) = t, y(t, σ) = σ(2e−t
− 1) =⇒
σ =
y
2e−x − 1
, e−t
= e−x
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Rn
. Moreover, we get the same for all series of partial derivatives,
since we can always write them as
r−1∑
k=0
ak
∂r
fk
∂xi1 · · · ∂xir
+
∞∑
k=r
ak
∂r
fk
∂xi1 · · · ∂xir
,
where the ﬁrst part is a smooth function as it is a ﬁnite sum of
smooth functions, and the second part can again be bounded
from above by an absolutely converging series of numbers, so
this expression will converge uniformly to ∂r
f
∂xi1 ···∂xir
.
It is apparent from the deﬁnition that the function f(x)
satisﬁes the conditions of the lemma. □
Partition of unity on a manifold
Theorem. Consider a manifold M ⊂ Rn
equipped with a
locally ﬁnite cover by open images Ui of parametrizations
φi. Then, there exists a system of smooth non-negative functions
fi on the sets M such that for every point x ∈ M, we
have
∑
i fi(x) = 1, and fi(x) ̸= 0 if and only if x ∈ Ui.
The system of functions fi from the theorem is called
the partition of unity subordinated to the locally ﬁnite cover
of the manifold by the open sets Ui.
Proof. First, we extend the sets Ui to open sets ˜Ui using
the extended parametrizations ˜φ, from the deﬁnition
of manifold and its local parametrizations. We can
surely do this in such a way that the sets ˜Ui keep being
a locally ﬁnite cover of an open neighborhood ˜U =
∪i
˜Ui ⊂ Rn
of the manifold M.
For every open set ˜Ui, we can choose a non-negative
function gi(x) on the whole Rn
so that gi(x) ̸= 0 exactly
for x ∈ ˜Ui. This can be done by Whitney’s theorem proved
in the above Lemma. Now, the function g(x) =
∑
i gi(x) is
well-deﬁned for all x ∈ Rn
and smooth, thanks to the cover
being locally ﬁnite (for every ﬁxed point x, it is a ﬁnite sum of
non-zero functions on some of its neighborhoods). The function
g(x) is positive for all x ∈ M. Thus, instead of functions
gi(x) restricted to M, we may rather consider the functions
fi(x) = gi(x)/g(x), which already have both of the required
properties of the theorem. □
9.1.9. Integration of k–forms on manifolds. Now, we are
ready for the deﬁnition of the integral of k–
forms on k–dimensional manifolds. Let us consider
an oriented manifold M ⊂ Rn
and a form
ω ∈ Ωk
(M) with compact support.
Let us choose a locally ﬁnite cover of the manifold M by
parametrizations φi : Vi → Ui such that the closures of all
images φi(Vi) are compact and, eventually, choose a partition
of unity fi subordinated to this cover.
The integral is deﬁned by the formula
∫
M
ω =
∫
M
∑
i
fiω =
∑
i
∫
Ui
fiω,
813
and
u(t, σ) = 2σe−t
=⇒ u(x, y) =
2y
2 − ex
.
This is unique continuous solution on connected component
of its domain including boundary curve (for x < ln 2). Make
a test. □
9.D.6. Find a solution of equation for boundary conditions
(i)—(iii):
yux + xuy = 2u
(i) u(x, 0) = 1,
(ii) u(x, 0) = x2
,
(iii) u(0, y) =
√
1 + y2.
Solution. This is semilinear equation, we will solve it like
quasilinear (in the next section we show another method working
for semilinear equation):
˙x = y, ˙y = x, ˙u = 2u.
Characteristics are given parametrically
x(t) = C1et
+ C2e−t
, y(t) = C1et
− C2e−t
, u(t) = C3e2t
.
(i) Without lose of generality put t = 0 for the point of
intersection of characteristic curve and boundary curve
x = φ(σ) = σ, y = ψ(σ) = 0, u = f(σ) = 1 (boundary
curve intersects characteristics for x > y),
x(0) = C1 + C2 = σ, y(0) = C1 − C2 = 0.
Than C1 = C2 = σ
2 and from
x =
σ
2
(
et
+ e−t
)
= σ cosh t, y =
σ
2
(
et
− e−t
)
= σ sinh t
we get x2
− y2
= σ2
, x + y = σet
,
σ =
√
x2 − y2, et
=
x + y
√
x2 − y2
.
u(0) = C3 = f(σ) = 1,
u(x, y) = u(σ(x, y), t(x, y)) = e2t
=
[
x + y
√
x2 − y2
]2
Finaly
u(x, y) =
x + y
x − y
, for x > y.
Make a test.
(ii) Boundary curve is the same, u(0) = C3 = f(σ) = σ2
,
solution is
u(x, y) = u(σ(x, y), t(x, y)) = σ2
e2t
= (x+y)2
, for x > y.
(iii) Again put t = 0 for the point of intersection of characteristic
curve and boundary curve x = φ(σ) = 0,
y = ψ(σ) = σ, u = f(σ) =
√
1 + σ2 (boundary curve
intersects characteristics for y > x),
x(0) = C1 + C2 = 0, y(0) = C1 − C2 = σ.
Than C1 = −C2 = σ
2 and from
x =
σ
2
(
et
− e−t
)
= σ sinh t, y =
σ
2
(
et
+ e−t
)
= σ cosh t
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
where the right-hand integrals have already been deﬁned
since each of the forms fiω has support inside the image under
the parametrization φi (and they equal to
∫
M
fiω for the
same reason).
Actually, we can assume that our sum is ﬁnite, since it sufﬁces
to consider integral over the image of parametrizations
covering the compact support of ω. Hence, it is a well-deﬁned
number, yet it remains to verify that the resulting value is independent
of all our choices.
To this purpose, let us choose another system of
parametrizations ψ : ˜Vj → ˜Uj, again with compatible
orientations, providing a locally ﬁnite cover of M. Let
gi be the corresponding partition of unity. Then the sets
Wij = Ui ∩ ˜Uj form again a locally ﬁnite covering and the set
of functions figj provide the partition of unity subordinated
to this covering. We arrive at the following equalities:
∑
i
∫
M
fiω =
∑
i
∫
M
fi
(∑
j
gj
)
ω =
∑
i,j
∫
M
figjω
∑
j
∫
M
gjω =
∑
j
∫
M
gj
(∑
i
fi
)
ω =
∑
i,j
∫
M
figjω,
where the potentially inﬁnite sums inside of the integrals are
all locally ﬁnite, while the sums outside of the integral can
be viewed as ﬁnite due to the compactness of the support of
ω. Thus, we have checked that the choices of the partition of
unity and the parametrizations do not inﬂuence the value of
the integral.
9.1.10. Exterior diﬀerential. As we have seen, the diﬀerential
of a function can be interpreted as a mapping
d : Ω0
(Rn
) → Ω1
(Rn
).
By means of parametrizations, this deﬁnition
extends (in a coordinate free way) to functions
f on manifolds M, where the diﬀerential df is a linear
form on M. The following theorem extends this diﬀerential
to arbitrary exterior forms on manifolds M ⊂ Rn
.
Exterior differential
Theorem. For all m–dimensional manifolds M ⊂ Rn
and
k = 0, . . . , m, there is the unique mapping
d : Ωk
(M) → Ωk+1
M,
such that
(1) d is linear with respect to multiplication by real num-
bers;
(2) for k = 0, this is the diﬀerential of functions;
(3) if α ∈ Ωk
(M), β arbitrary, then
d(α ∧ β) = (dα) ∧ β + (−1)k
α ∧ (dβ);
(4) d(df) = 0 for every function f on M.
The mapping d is called the exterior diﬀerential. The equality
d ◦ d = 0 is valid for all degrees k.
814
and we get y2
− x2
= σ2
, x + y = σet
,
σ =
√
y2 − x2, et
=
x + y
√
y2 − x2
.
u(0) = C3 = f(σ) =
√
1 + σ2,
u(x, y) = u(σ(x, y), t(x, y)) =
√
1 + σ2e2t
Finaly
u(x, y) =
√
1 + y2 − x2 ·
x + y
y − x
, for y > x.
Make a test.
How to write general solution of equation? Take an implicit
description of characteristics, from x = C1et
+ C2et
,
y = C1et
− C2et
, u = C3e2t
we get
(x + y)2
u
=
C2
1
C3
= Φ, y2
− x2
= 2C1C2 = Ψ.
General solution of equation is any function constant on char-
acteristics:
u : α(Φ, Ψ) = 0, or u = (x+y)2
K(Ψ) = (x+y)2
K(y2
−x2
).
Make a test. In the next example we show another way, how
to solve semilinear eqution. □
9.D.7. Solve equation (S) and equation (L)
(S) x2
ux + xyuy = u2
, (L) x2
ux + xyuy = x2
both generally and for boundary conditions (i) — (iv)
(i) u(1, y) = y,
(ii) u(cos σ, sin σ) = 1, σ ∈
(
−π
2 , π
2
)
,
(iii) u(x, 1 − x) = x,
(iv) u(x, 1 − x) = x2
.
Solution. Let’s start again with characteristic equation
y′
=
dy
dx
=
xy
x2
=⇒
y
x
= C.
Take a new coordinates ξ(x, y) = y
x , η(x, y) = y and compute
ux = uξξx + uηηx = − y
x2 uξ, uy = uξξy + uηηy =
−1
x uξ + uη. The equation (S) will be now
uη =
u2
η2
· ξ
and solution
∫
∂u
u2
=
∫
ξ
∂η
η2
=⇒ −
1
u
= −
ξ
η
+ K(ξ),
We can write general solution
u(ξ, η) =
η
ξ + ηD(ξ)
, u(x, y) =
x
1 + xD
(y
x
).
Make a test. Now we can use general solution to ﬁnd function
D, depending on given boundary condition:
(i) u(1, σ) = 1
1+D
( y
x
) = σ ⇒ D (σ) = 1
σ − 1. From y
x =
σ
1 we get solution u(x, y) = x
1+x
(
x
y −1
) = xy
y+x2−xy .
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Proof. Each k–form can be written locally in the form
α =
∑
i1<···<ik
ai1···ik
dxi1 ∧ · · · ∧ dxik
.
If the diﬀerential d exists, then by the required properties, it
must be equal to
(5)
dα =
∑
i1<···<ik
dai1···ik
∧ dxi1 ∧ · · · ∧ dxik
=
∑
j; i1<···<ik
∂ai1···ik
∂xj
dxj ∧ dxi1 ∧ · · · ∧ dxik
.
Indeed, the generators dxi of linear forms are in fact the diﬀerentials
of the coordinate functions, so further diﬀerentiation
must lead to zero by the last property, while we know the differential
of functions. Further, we have d(fβ) = df∧β+f dβ
by property (3).
Thus, let us deﬁne the diﬀerential d in coordinates by the
formula (5), and we are going to verify all of the required
properties. We shall proceed in two steps.
First, we check the requirements in one coordinate patch.
The ﬁrst two requirements are obvious from the formula (5).
It is enough to verify the property (3) for the special forms
α = adxi1
∧ · · · ∧ dxik
and β = bdxj1
∧ · · · ∧ dxjℓ
, since
then the property must hold for the sums of such forms too.
We compute
d(α ∧ β) = d
(
ab dxi1 ∧ · · · ∧ dxik
∧ dxj1 ∧ · · · ∧ dxjℓ
)
=
∑
i
∂a
∂xi
b dxi ∧ dxi1 ∧ · · · ∧ dxik
∧ dxj1 ∧ · · · ∧ dxjℓ
+
∑
i
∂b
∂xi
a dxi ∧ dxi1 ∧ · · · ∧ dxik
∧ dxj1 ∧ · · · ∧ dxjℓ
= dα ∧ β + (−1)k
α ∧ dβ,
as expected.
The last property can be again veriﬁed for simple forms
α = a dxi1 ∧ · · · ∧ dxik
. Applying the formula (5) twice, we
arrive at
d(dα) = d
(∑
i
∂a
∂xi
dxi ∧ dxi1 ∧ · · · ∧ dxik
)
=
∑
j
∑
i
∂2
a
∂xi∂xj
dxj ∧ dxi ∧ dxi1 ∧ · · · ∧ dxik
=
∑
i<j
( ∂2
a
∂xi∂xj
−
∂2
a
∂xj∂xi
)
dxj ∧ dxi ∧ dxi1
∧ · · · ∧ dxik
= 0,
where we used the fact that the wedge product dxi ∧ dxi vanishes,
dxj ∧dxi = −dxi ∧dxj and the second partial derivatives
of functions are symmetric.
815
(ii) u(cos σ, sin σ) = cos σ
1+cos σ = 1 we have D(σ) = 1 −
1
cos σ and because σ = arctany
x , cos σ = 1√
1+ y2
x2
so
D(x, y) = 1 −
√
1 + y2
x2 and the solution is
u(x, y) =
x
1 + x
[
1 −
√
1 + y2
x2
], x > 0.
(iii) Analogically u(x, y) = x,
(iv) Analogically u(x, y) = x2
x+y2+xy .
For linear equation (L) is situation much easier. We can
see, that u = x is one of the solutions of both, (L) and (S).
For (L) we can use principle of superposition and write general
solution like a sum of general solution of homogenous
case and any particular solution (make a test):
u(x, y) = x + D
(y
x
)
.
For boundary conditions we get
(i) u(x, y) = x + y
x − 1,
(ii) u(x, y) = x + 1 − x√
x2+y2
, x > 0,
(iii) u(x, y) = x,
(iv) u(σ, 1 − σ) = σ + D(σ) = σ2
⇒ D(σ) = σ2
− σ,
1−σ
σ = y
x ⇒ σ = x
x+y , D = σ2
− σ = − xy
(x+y)2 and
solution is
u(x, y) = x −
xy
(x + y)2
, x ̸= −y.
Make a test for all solutions and use this method also for solving
previous problem. □
9.D.8. Find general solution of
yux + uy = −u.
⃝
9.D.9. Find general solution of
ux + uy = −u(x − y)
and solutions for given boundary condition
(i) u(0, y) = y,
(ii) u(σ, −σ) = 1
σ ,
(iii) u(x, 0) = x2
.
⃝
9.D.10. Solve
ux − y2
uy = u2
, u(x, 1) =
1
2x
.
⃝
9.D.11. Solve
xux + yuy = 2(x2
+ y2
), u(1, y) = 2y2
+ 1.
⃝
9.D.12. Solve
2xux − yuy = x2
+ y2
, u(2, y) = 1 − y2
.
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
The second step in the proof is the veriﬁcation, that the
coordinate formula (5) correctly deﬁnes a differential
operator on general manifolds M. In
order to achieve this, it is suﬃcient to show that
the coordinate expression of the exterior derivative
commutes with the pullbacks of forms. Indeed, we may
then deﬁne the diﬀerential operator around any point in any
coordinates and the results will coincide.3
Thus, consider a change of coordinates G : U → V ,
x = G(y) = (g1(y), . . . , gm(y)), and compute G∗
(dα) of an
exterior form α = a dxi1 ∧ · · · ∧ dxik
(which gives the result
for sums of such expressions too). This is straightforward:
G∗
(dα) =
∑
i
G∗
(
∂a
∂xi
)G∗
(dxi) ∧ G∗
(dxi1 ) ∧ · · ·
=
∑
i
(
∂a
∂xi
◦ G)
( ∂gi
∂y1
dy1 +· · ·
)
∧
(∂gi1
∂y1
dy1 +· · ·
)
∧ · · · .
Now, notice d(G∗
(dxj)) = G∗
(d(dxj)) = 0 and thus
d
(
G∗
(α)
)
= d
(
(a ◦ G)G∗
(dxi1 ) ∧ . . .
)
= d(a ◦ G) ∧ G∗
(dxi1 ) ∧ · · · ∧ G∗
(dxik
)
=
∑
i
(
(
∂a
∂xi
◦ G)
( ∂gi
∂y1
dy1 + . . .
))
∧ G∗
dxi1 ∧ . . . ,
clearly the same expressions. □
9.1.11. Manifolds with boundary. In practical problems,
we often work with manifolds M like an open ball in
the three-dimensional space. At the same time, we are
interested in the boundaries of these manifolds ∂M,
which is a sphere in the case of a ball.
The simplest case is the one of connected curves. It is either
a closed curve (like a circle in the plane), then its boundary
is empty, or the boundary is formed by two points. These
points will be considered including the orientation inherited
from the curve, i.e. the initial point will be taken with the
minus sign, and the terminal point with the plus sign.
The curve integral is the easiest one, and we can notice
that integrating the diﬀerential df of a function along
the curve M deﬁned as the image of a parametrization c :
[a, b] → M, then we get directly from the deﬁnition that
∫
M
df =
∫
[a,b]
c∗
(df) =
∫ b
a
d
dt
(f ◦ c)(t) dt
= f(c(b)) − f(c(a)).
Therefore, the result is not only independent of the selected
parametrization, but also of the actual curve. Only the initial
and terminal points matter. Splitting the curve into several
3Such operators are intrinsically deﬁned on all manifolds. Actually, for
all k > 0, the only operation d : Ωk → Ωk+1 commuting with pullbacks
and with values depending only on the behavior of the argument α on any
small neighborhood of x (locality of the operator), is the exterior derivative.
Thus even the linearity, as well as the dependence on the ﬁrst derivatives
are direct consequences of naturality. See the book Natural operations in
diﬀerential geometry, Springer, 1993, by I. Kolar, P.W. Michor and J. Slovak
for full proof of this astonishing claim.
816
⃝
9.D.13. Solve
x2
ux + yuy = 2u, u(1, y) = y3
.
⃝
9.D.14. S. olve equation for given boundary condition
u = uxuy, u(x, 0) = x2
.
Solution. We show, how to use method of characteristics for
general ﬁrst order equation
F(p, q, u, x, y) = 0, where p = ux, q = uy.
Our equation we can write in the form u−p·q = 0. We need
to solve system of ﬁve ODE’s:
˙x = Fp = −q,
˙y = Fq = −p,
˙u = pFp + qFq = −2pq,
˙p = −Fx − pFu = −p,
˙q = −Fy − qFu = −q.
Solution of this characteristic system is
x = C1e−t
+ C2,
y = C3e−t
+ C4,
u = C1C2e−2t
+ C5,
p = C3e−t
,
q = C1e−t
.
Without lose of generality take t = 0 for the point of intersection
of characteristic lines with boundary line x = φ(σ) =
σ, y = ψ(σ) = 0, u = f(σ) = σ2
and ﬁnd Ci = Ci(σ) from
x(0) = C1 + C2 = σ,
y(0) = C3 + C4 = 0,
u(0) = C1C2 + C5 = σ2
,
p(0) = C3 = p0 = 2σ,
q(0) = C1 = q0 =
σ
2
.
The value of p0 and q0 we get from system of two algebraic
equation:
F(p0, q0, f(σ), φ(σ), ψ(σ)) = 0 =⇒ σ2
= p0 · q0,
u(φ(σ), ψ(σ)) = f(σ) =⇒
d
dσ
u(φ(σ), ψ(σ)) =
d
dσ
f(σ),
uxφ′
(σ) + uyψ′
(σ) = f′
(σ) =⇒ p0 · 1 + q0 · 0 = 2σ.
Finally we get C1 = σ
2 , C2 = σ
2 , C3 = 2σ, C4 = −2σ,
C5 = 0. From
x(σ, t) =
σ
2
(
e−t
+ 1
)
, y(σ, t) = 2σ
(
e−t
− 1
)
we can write
σ(x, y) = x −
y
4
, e−t
(x, y) =
4x + y
4x − y
,
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
consecutive disjoint intervals, the integral splits into the sum
of diﬀerences of the values at the splitting points. This sum
will be telescoping (i.e., the middle terms cancel out), resulting
in the same value again.
Notice, we have already proved the behavior expected in
9.1.5 when dealing with the elevation gain by a cyclist.
We shall discuss this phenomenon in general dimensions
now. To be able to do this, we need to formalize the concept
of the boundary of a manifold and its orientation. The simplest
case is the closed half-space ¯M = (−∞, 0] × Rn−1
.
Its boundary is ∂M = {(x1, x2, . . . , xn) ∈ Rn
; x1 = 0}.
The orientation on this boundary inherited from the standard
orientation is the one determined by the form dx2 ∧· · ·∧dxn.
Oriented boundary of a manifold
Let us consider a closed subset ¯M ⊂ Rn
such that its interior
M ⊂ ¯M is an oriented m–dimensional manifold
covered by compatible parametrizations φi. Further, let
us assume that for every boundary point x ∈ ∂M =
¯M \ M, there is a neighborhood in ¯M with parametrization
φ : V ⊂ (−∞, 0] × Rm−1
→ M such that the points
x ∈ ∂M ∩ φ(V ) from just the image of the boundary of the
half-space (−∞, 0] × Rm−1
. The subset ¯M ⊂ Rm
covered
by the above parametrizations with compatible orientations
is called an oriented manifold with boundary.
The restrictions of the parametrizations including
boundary points to the boundary ∂M deﬁnes the structure
of an (m − 1)–dimensional oriented manifold on ∂M.
Think of a closed unit balls B(x, r) ⊂ Rn
as such manifolds.
Their interiors are an n–dimensional manifolds, just
pridat obrazek k
prikladu
open subsets in Rn
, but their boundaries Sn−1
are the spheres
with the inherited structure of (n−1)–dimensional manifolds.
The inherited orientations are well understood via the outward
normals to the spheres. Another example is a plane
disc sitting as a 2–dimensional manifold in R3
with its 1–
dimensional boundary being a circle. Here the chosen position
of the normal to the plane deﬁnes the orientation of the
circle, one or the other way.
In practice, we often deal with slightly more general
manifolds where we allow for corners in the boundary of all
smaller dimensions. A good example is the cube in R3
having
the sides as 2-dimensional parts of the boundary and also the
edges between them as 1-dimensional parts and the vortices
as 0-dimensional parts of the boundary. Yet another class
of examples is formed by all simplexes and their curved embeddings
in Rn
. Since those lower dimensional parts of the
boundary will have Riemann measure zero, we can neglect
them when integrating over ∂M. Thus we shall not go into
details of this technical extension of our deﬁnitions.
817
and
u(σ(x, y), t(x, y)) = σ2
e−2t
=
[(
x −
y
4
) 4x + y
4x − y
]2
,
u(x, y) =
1
16
(4x + y)
2
.
Make a test. □
9.D.15. Solve
u2
x + u2
y = 1, u(cos σ, sin σ) = 1.
⃝
9.D.16. Solve
u2
x + u2
y + 2u = 0, u(cos σ, sin σ) = −
1
2
.
⃝
9.D.17. Solve
√
u2
x + u2
y − u = 0, u(cos(σ), sin(σ)) = 1.
⃝
9.D.18. Solve
u2
x + u2
y − u = 0, u(cos(σ), sin(σ)) = 1.
⃝
9.D.19. Solve
√
u2
x + u2
y + u = 0, u(cos(σ), sin(σ)) = −1.
⃝
9.D.20. Solve
uxuy + u = 0, u(x, 0) = x.
⃝
9.D.21. Solve
xu2
x + yu2
y = u, u(σ, σ) = 2σ.
⃝
9.D.22. Solve
xu2
x + yu2
y = u, u(1, y) = 1.
⃝
9.D.23. Solve
u2
x − u2
y = 4u, u(cos σ, sin σ) = cos 2σ.
⃝
9.D.24. Solve
u2
x + yuy = u, u(1, y) = y.
⃝
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
9.1.12. Stokes’ theorem. Now, we get to a very important
and useful result. We shall formulate the main
theorem about the multidimensional analogy of
curve integrals for smooth forms and smooth
manifolds. A brief analysis of the proof shows
that actually, we need once continuously diﬀerentiable exterior
forms as integrands on twice continuously diﬀerentiable
parametrizations of the manifold.
In practice, the boundary of the region is often similar as
in the case of the unit cube in R3
, i.e., we have discontinuities
of the derivatives on a Riemann measurable set with measure
zero in the boundary. In such a case, we divide the integration
to smooth parts and add the results up. We can notice that
although new pieces of boundaries appear, they are adjacent
and have opposite orientations in the adjacent regions, so their
contribution is canceled out (just like in the above case of
boundary points of a piecewise diﬀerentiable curve).
Stokes’ theorem
Theorem. Consider a smooth exterior (k −1)–form ω with
compact support on an oriented manifold ¯M with boundary
∂M with the inherited orientation. Then we have
∫
M
dω =
∫
∂M
ω.
Proof. Using an appropriate locally ﬁnite cover of the
manifold ¯M and a partition of unity subordinated
to it, we can express the integrals on both
sides as the sum (even a ﬁnite sum, since the
support of the considered form ω is compact) of
integrals of forms ω supported in individual parametrizations.
Thus we can restrict ourselves to just two cases ¯M = Rk
or
the half-space ¯M = (−∞, 0] × Rk−1
.
In both cases, ω will surely be the sum of forms ωj
ωj = aj(x)dx1 ∧ · · · ∧ ˆdxj ∧ · · · ∧ dxk,
where the hat indicates the omission of the corresponding linear
form, and aj(x) is a smooth function with compact support.
Their exterior diﬀerentials are
dωj = (−1)j−1 ∂aj
∂xj
dx1 ∧ · · · ∧ dxk.
Again, we can verify the claim of the theorem for such forms
ωj separately. Let us compute the integrals
∫
M
dωj using the
Fubini’s theorem. This is most simple if ¯M = Rn
,
∫
Rn
dωj = (−1)j−1
∫
Rk−1
(
∫ ∞
−∞
∂aj
∂xj
dxj
)
dx1 · · · ˆdxj · · · dxk
= (−1)j−1
∫
Rk−1
[
aj
]∞
−∞
dx1 · · · ˆdxj · · · dxk = 0.
Notice, we are allowed to use the Fubini’s theorem for the
entire Rn
since the support of the integrated function is in
fact compact and thus we can replace the integration domain
by a large multidimensional interval I. At the same time, the
forms ωj are all zero outside of such a large interval I and thus
818
9.D.25. Find all solutions (if exists) of the system
ux = f(x, y, u) = y(u − xy + 1),
uy = g(x, y, u) = x(u − xy + 1).
Solution. Necessary (and suﬃcient) condition for existence
of solution is uxy = uyx:
uxy = fy + fug = u − 2xy + 1 + xyu − x2
y2
+ xy,
uyx = gx + guf = u − 2xy + 1 + xyu − x2
y2
+ xy.
Let’s start with ﬁrst equation
∂u
∂x
= yu − xy2
+ 1,
which is linear and we can ﬁnd solution in the form
u(x, y) = K(x, y)exy
so
ux = Kxexy
+ yKexy
= yKexy
− xy2
+ y
and
Kx = (y − xy2
)e−xy
,
by per partes we have
K(x, y) = xye−xy
+ D(y).
Substituing this to the second equation
u(x, y) = xy + D(y)exy
=⇒
uy = x + D′
(y)exy
+ xD(y)exy
= g(x, y, u)
we get D′
(y) = 0, D is constant and all solutions are given
by
u(x, y) = xy + Dexy
.
□
9.D.26. Find solution (if exists) of system
ux = 2xy2
u, uy = 2x2
yu.
⃝
9.D.27. Find solution (if exists) of system
xux = u − y, yuy = u − x.
⃝
9.D.28. For second order semilinear equation we will use
notation
A(x, y)uxx+2B(x, y)uxy+C(x, y)uyy = F(ux, uy, u, x, y).
Show, that characteristics polynomial of the matrix
(
A B
B C
)
has
(E) two nonzero real roots (not necessarily diﬀerent) of the
same sign (λ1 · λ2 > 0) if and only if B2
− AC < 0,
equation is eliptic,
(H) two nonzero real roots of diﬀerent sign (λ1 · λ2 < 0) if
and only if B2
− AC > 0, equation is hyperbolic,
(P) two real roots from which (at least) one is zero (λ1 ·λ2 =
0) if and only if B2
− AC = 0, equation is parabolic.
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
the integrals
∫
∂M
ωj all vanish and the claim of the Stokes’
theorem is veriﬁed in this case. Actually, we may also say
that ∂M = ∅ and thus the integral is zero.
Next, let us assume ¯M is the half-space (−∞, 0]×Rk−1
.
If j > 1, the form ωj evaluates identically to zero on the
boundary ∂M, since x1 is constant there and thus dx1 is identically
zero on all tangent directions to ∂M. Integration over
the interor M yields zero, using the same approach as above:
∫
M
dωj = (−1)j−1
∫ 0
−∞
∫
Rk−2
(
∫ ∞
−∞
∂aj
∂xj
dxj
)
dx1· · · ˆdxj· · ·dxk
= (−1)j−1
∫ 0
−∞
∫
Rk−1
[
aj
]∞
−∞
dx1 · · · ˆdxj · · · dxk = 0
since the function aj has compact support. So the theorem is
also true in this case.
However, if j = 1, then we obtain
∫
M
dω1 =
∫
Rk−1
(∫ 0
−∞
∂a1
∂x1
dx1
)
dx2 · · · · · · dxk
=
∫
Rk−1
a1(0, x2, . . . , xk)dx2 · · · dxk =
∫
∂M
ω1.
This ﬁnishes the proof of Stokes’ theorem. □
9.1.13. Green’s theorem. We have proved an extraordinarily
strong result which covers several standard
integral relations from the classical vector analysis.
For instance, we can notice that by Stokes
theorem, the integration of exterior diﬀerential
dω of any k–form over a compact manifold without boundary
is always zero (for example, integral of any 2–form dω over
the sphere S2
⊂ R3
vanishes).
Let us look step by step at the cases of Stokes’ theorem
with k dimensional boundaries ∂M in Rn
in low dimensions.
Green’s theorem
In the case n = 2, k = 1, we are examining a domain M in
the plane, bounded by a closed curve C = ∂M. Diﬀerential
1–forms are ω(x, y) = f(x, y) dx + g(x, y) dy, with the
diﬀerential dω =
(
−∂f
∂y + ∂g
∂x
)
dx ∧ dy. Therefore, Stokes’
theorem yields the formula
∫
C
f(x, y)dx + g(x, y)dy =
∫
M
(
−
∂f
∂y
+
∂g
∂x
)
dx ∧ dy
which is one of the standard forms of the Green’s theorem.
Using the standard scalar product on R2
, we can identify
the vector ﬁeld X with a linear form ωX such that ωX(Y ) =
⟨Y, X⟩. In the standard coordinates (x, y), this just means
that the ﬁeld X = f(x, y) ∂
∂x + g(x, y) ∂
∂y corresponds to the
form ω = f(x, y) dx + g(x, y) dy given above.
The integral of ωX over a curve C has the physical interpretation
of the work done by movement along this curve in
the force ﬁeld X.
Green’s theorem then says, besides others, that if ωX =
dF for some function F, then the work done along a closed
curve is always zero. Such ﬁelds are called potential ﬁelds
819
Solution. EASY ale FUJ □
9.D.29. Show, that solution of
(H) hyperbolic equation in canonical form
uxy = 0
is any function u(x, y) = F(x) + G(y).
(E) eliptic equation in canonical form
∆v = vxx + vyy = 0
is any function v(x, y) = F(x + iy) + G(x − iy).
(P) parabolic equation in canonical form
wxx = kwy,
is any function
w(x, y) = ek(Cx+C2
y)
.
Solution.
ux = F′
, uxy = 0.
vx = F′
+ G′
, vxx = F′′
+ G′′
, vy = iF′
− iG′
,
vyy = −F′′
− G′′
=⇒ uxx + uyy = 0,
wx = kCek(Cx+C2
y)
, wxx = k2
C2
ek(Cx+C2
y)
,
wy = kC2
ek(Cx+C2
y)
=⇒ wxx = kwy.
□
9.D.30. Show, that real solution of Laplace equation in polar
coordinates x = r cos φ, y = r sin φ (compute):
urr +
1
r2
uφφ +
1
r
uφ = 0,
are harmonic function given by
αn(r, φ) = rn
cos nφ, βn(r, φ) = rn
sin nφ, n ∈ N0.
Find their expression αn(x, y), βn(x, y) for n = 0, 1, 2, 3.
⃝
9.D.31. Find general solution of second order equation
x2
uxx − 2xyuxy + y2
uyy − x2
ux + (x + 2)yuy = 0.
Solution. We have
A = x2
, B = −xy, C = y2
, B2
− AC = 0
and equation is parabolic. Let’s solve characteriscic equation
dy
dx
= y′
=
B ±
√
B2 − AC
A
=
−xy
x2
= −
y
x
.
We have solution in implicit form xy = C = ξ (ξ = ξ(x, y)
will be new coordinate function). For second new coordinate
function η = η(x, y) we can take any independent (ξxηy ̸=
ξyηx) function, for example η = x. Inverse transformation is
x = η, y =
ξ
η
.
We get
ux = uξξx + uηηx =
ξ
η
uξ + uη, uy = uξξy + uηηy = ηuξ,
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
and the function F is the potential of the ﬁeld X. In other
words, the work done when moving in potential ﬁelds does
not depend on the path, it depends only on the initial and terminal
points.
With Green’s theorem, we have veriﬁed once again that
integrating the diﬀerential of a function along a curve depends
solely on the initial and terminal points of the curve.
9.1.14. The divergence theorem. The next case deals with
integrating over some open subset in R3
and it has got a lot
of incarnations in practical use. We shall mention a few.
Gauss–Ostrogradsky’s theorem
In the case n = 3, k = 2 we are examining a region M ⊂
R3
, bounded by a surface S. All 2–forms are of the form
ω = f(x, y, z) dy∧dz+g(x, y, z) dz∧dx+h(x, y, z) dx∧
dy, and we get dω =
(∂f
∂x + ∂g
∂y + ∂h
∂z
)
dx ∧ dy ∧ dz.
The Stokes’ theorem says that
∫
S
f dy ∧ dz + g dz ∧ dx + h dx ∧ dy
=
∫
M
(
∂f
∂x
+
∂g
∂y
+
∂h
∂z
)
dx ∧ dy ∧ dz.
This is the statement of the Gauss–Ostrogradsky theorem.
This theorem has a very illustrative physical interpretation,
too.
Every vector ﬁeld X = f(x, y, z) ∂
∂x + g(x, y, z) ∂
∂y +
h(x, y, z) ∂
∂z can be plugged into the ﬁrst argument of the
standard volume form ωR3 = dx ∧ dy ∧ dz on R3
. Clearly,
the result is a 2–form ωX
(x, y, z) = f(x, y, z)dy ∧ dz +
g(x, y, z)dz ∧ dx + h(x, y, z)dx ∧ dy.
The latter 2–form inﬁnitesimally describes the volume
of the parallelepiped given by the ﬂux caused by the ﬁeld X
through a linearized piece of surface. If we consider the vector
ﬁeld to be the velocity of the ﬂow of the particular points
of the space, this inﬁnitesimally describes the volume transported
pointwise by the ﬂow through the given surface S.
Thus the left hand side is the total change of volume inside
of S, caused by the ﬂow of X.
The integrand of the right-hand side of the integral, is
related to the so-called divergence of the vector ﬁeld, which
is the expression deﬁned as
d(ωX
) = (div X) dx ∧ dy ∧ dz.
The Gauss–Ostrogradsky theorem says
∫
S
iX ωR3 =
∫
M
div X ωR3 ,
i.e. the volume of total ﬂow through a surface is given as the
integral of the divergence of the vector ﬁeld over the interior.
In particuclar, if div X vanished identically, then the total volume
ﬂow through the boundary surface of the region is zero
as well.
820
uxx =
ξ2
η2
uξξ + 2
ξ
η
uξη + uηη,
uxy = ξuξξ + ηuξη + uξ, uyy = η2
uξξ.
Finally the equation has the form
η2
(uηη − uη) = 0 =⇒ uηη = uη.
Integrating twice we have u(ξ, η) = G(ξ)eη
+ F(ξ),
u(x, y) = G(xy)ex
+ F(xy), make a test.
Try to solve equation for diﬀerent choice of the function η.
□
9.D.32. Find general solution of second order equation
yuyy − xuxy + uy = 0.
Solution. We have
A = 0, B = −
x
2
, C = y, B2
− AC =
x2
4
> 0 for x ̸= 0
and this equation is hyperbolic. The characteristic equation
(we can’t divide by A = 0!) is now
dx
dy
= x′
=
B ±
√
B2 − AC
C
=
1
y
(
−
x
2
±
x
2
)
.
We have two solutions in implicit form (new coordinate functions
ξ = ξ(x, y), η = η(x, y))
x = C1 = ξ, xy = C2 = η =⇒ x = ξ, y =
η
ξ
.
Finally
ux = uξ +
η
ξ
uη, uy = ξuη,
uxy = ξuξη + ηuηη + uη, uyy = ξ2
uηη
and in new coordinates ξ, η has the equation canonical form
uξη = 0.
General solution is
u(ξ, η) = F(ξ) + G(η), u(x, y) = F(x) + G(xy).
□
9.D.33. Find general solution of second order equation
uxx − 2 sin xuxy + (2 − cos2
x)uyy − cos xuy = 0.
Solution. We have
A = 1, B = − sin x, C = 2 − cos2
x, B2
− AC < 0
and equation is eliptic. Let’s solve characteriscic equation
dy
dx
= y′
=
B ±
√
B2 − AC
A
= − sin x ± i.
We have solution in implicit form
y−cos x−ix = C1 = Φ(x, y), y−cos x+ix = C2 = Ψ(x, y)
and new (real) coordinate function are
ξ =
1
2
(Φ + Ψ) = y − cos x, η =
1
2i
(Ψ − Φ) = x.
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Such ﬁelds, with div X = 0, are called divergence free or
solenoidal vector ﬁelds. They correspond to dynamics without
changes of volumes (e.g. modelling dynamics of incompressible
liquids).
In order to reformulate the theorem completely in terms
of functions, let us observe that the inherited volume
form ωS on S is deﬁned by the property ν∗
∧ ωS =
ωR3 at all points of S, where ν∗
is dual form to the
oriented (outward) unit normal to S.
All forms of degree 2 are multiples of ωS by functions.
In particular,
iX(ν∗
∧ ωS) = ν · X ωS,
i.e. we have to integrate the scalar product of the vector ﬁeld
X with the unit normal vector with respect to the standard
volume on S. Thus, we have proved the folowing result formulated
in the classical vector analysis style.
Actually, a simple check reveals that the above arguments
work for all open submanifolds M ⊂ Rn
with boundary hypersurface
S and vector ﬁelds X. The reader should easily
verify this in detail.
Divergence theorem
Theorem. Let X be a vector ﬁeld on a n-dimensional manifold
M ⊂ Rn
with hypersurface boundary S. Then
∫
M
div X dx1 . . . dxn =
∫
S
X · ν dS,
where ν is the oriented (outward) unit normal to S and dS
stays for the volume inherited from Rn
on S.
Notice the 2-dimensional case coincides with the Green’s
theorem above.
9.1.15. The original Stokes theorem. If ω is any linear
form, then the integral of dω over a surface depends on the
boundary curve only. This is the most classical Stokes’ theo-
rem:
The classical Stokes’ theorem
In the case n = 3, k = 1 we deal with a surface M in R3
bounded by a curve C. The general linear forms are ω =
f dx + g dy + h dz, with the integral
∫
C
f dx + g dy + h dz =
∫
M
dω,
where dω =
(∂h
∂y − ∂g
∂z
)
dy ∧ dz +
(∂f
∂z − ∂h
∂x
)
dz ∧ dx +
(∂g
∂x − ∂f
∂y
)
dx ∧ dy.
Again, we use the standard scalar product to identify
the vector ﬁeld X = f ∂
∂x + g ∂
∂y + h ∂
∂z with the form
ωX = f dx + g dx + h dz. Finally, reverting the above relation
between the vector ﬁelds and two forms on R3
, the 2–
form dωX can be identiﬁed with the vector ﬁeld rot X,
dωX = ωR3 (rot X, , ).
821
We have
ξx = sin x, ξy = 1, ξxx = cos x, ξxy = 0, ξyy = 0,
ηx = 1, ηy = ηxx = ηxy = ηyy = 0,
uy = uξ, uxy = uξξ sin η + uξη, uyy = uξξ,
uxx = uξξ sin2
η + 2uξη sin η + uηη + uξ cos η.
Finally the equation became canonical
∆u = uξξ + uηη = 0.
General solution is
u(ξ, η) = C(ξ + iη) + D(ξ − iη) make a test,
u(x, y) = C(y − cos x + ix) + D(y − cos x − ix).
Real solution is given by harmonic functions in variables ξ, η.
□
9.D.34. Solve
x2
uxx − 2xuxy + uyy + xux = 0.
⃝
9.D.35. Solve
xuxx − yuxy + ux = 0
⃝
9.D.36. Solve
uxx − uxy + 2uyy = 0.
⃝
9.D.37. Find canonical form of
y2
uxx + x2
uyy = 0.
⃝
9.D.38. Show, that solution of nonhomogenous wave equation
with initial conditions (inﬁnite string):
utt(t, x) = a2
uxx(t, x) + f(t, x), t ∈ (0, ∞), x ∈ R,
u(0, x) = φ(x), ut(0, x) = ψ(x), x ∈ R,
is given by d’Alembert’s formula:
u(t, x) =
φ(x − at) + φ(x + at)
2
+
1
2a
x+at∫
x−at
ψ(ξ)dξ+
+
1
2a
t∫
0



x+a(t−σ)∫
x−a(t−σ)
f(σ, ξ)dξ


 dσ.
⃝
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
This ﬁeld is called the rotation or curl of the vector ﬁeld X.
The Stokes’ theorem now reads:
∫
C
ωX =
∫
M
rot X.
Consequently, the ﬁelds X with the property ωX = dF for
some function F (the ﬁelds of gradients of functions), have
got the property rot X = 0. They are called conservative (or
potential) vector ﬁelds.
9.1.16. Brouwer’s theorem. Among many useful ﬁxedpoint
theorems, the Brouwer theorem is
particularly nice.4
We present a special case
here. In particular, our formulation clearly
must hold true for any homeomorphic image
of the domain K. In fact, the convexity ensures that there
are no “holes” inside of K. For example rotating an annulus
r ≤ ∥x∥ ≤ R in the plane clearly does not have any ﬁxed
point.
Brouwer’s fixed-point theorem
Theorem. Let K ⊂ Rn
be a compact convex submanifold
of dimension n with boundary ∂K. Then every continuous
mapping f : K → K has got at least one ﬁxed point x, i.e.
f(x) = x.
Proof. We shall restrict ourselves to the case of smooth
manifold K and smooth mapping f. In fact, if a continuous
f had no ﬁxed point, then it is possible to approximate it by
a smooth ˜f without any ﬁxed point (e.g., by taking a convolution
˜f = f ∗ φ with a suitable smooth kernel φ enjoying a
very small support).
Assume f : K → K is such a smooth mapping with
f(x) ̸= x for all x ∈ K. For each y ∈ K, consider the ray
L from f(y) through y. This ray will leave K in the unique
point F(y) ∈ F(y) and F : K → ∂K is smooth. Notice we
use the convexity assumption here (it is evident if K is a ball
B of diameter r and the general case can be trasformed to the
ball case by “smoothly expanding” K from an inner point to
a big enough ball B). In particular, the construction implies
F|∂K = id∂K. Thus, by our assumptions, F : K → ∂K is a
smooth retraction of K to its boundary.
Now, we may consider a smooth exterior form ω ∈
Ωn−1
(K) providing the standard (inherited) volume on ∂K,
and we employ the general Stokes’ theorem:
0 <
∫
∂K
ω =
∫
∂K
F∗
ω =
∫
K
d(F∗
ω)
=
∫
K
F∗
(dω) =
∫
K
F∗
(0) = 0.
This is a contradiction and the theorem is proved. □
4Its 2-dimensional version is attributed to Luitzen Egbertus Jan
Brouwer (1881-1966), a Dutch mathematician and philosopher, who should
had noticed that when stirring a cup of coﬀee with sugar, there always stays
at least one point ﬁxed.
822
9.D.39. Using d’Alembert’s formula solve
utt(t, x) = uxx(t, x) + cos x, t ∈ (0, ∞), x ∈ R,
u(0, x) = x2
, ut(0, x) = arctgx, x ∈ R.
Solution.
u1(t, x) =
φ(x − at) + φ(x + at)
2
=
=
(x − t)2
+ (x + t)2
2
= x2
+ t2
,
u2(t, x) =
1
2a
x+at∫
x−at
ψ(ξ)dξ =
1
2
x+t∫
x−t
arctgξdξ =
[
ξarctgξ −
1
2
ln(1 + ξ2
)
]x+t
x−t
=
1
2
[(x + t)arctg(x + t)−
−(x − t)arctg(x − t) + ln
√
1 + (x − t)2
1 + (x + t)2
]
,
u3(t, x) =
1
2a
t∫
0



x+a(t−σ)∫
x−a(t−σ)
f(σ, ξ)dξ


 dσ =
=
1
2
t∫
0


x+t−σ∫
x−t+σ
cos ξdξ

 dσ =
1
2
t∫
0
[
sin ξ
]x+t−σ
x−t+σ
dσ =
1
2
t∫
0
[
sin(x + t − σ) − sin(x − t + σ)
]
dσ =
=
1
2
[
cos(x + t − σ)
]t
0
+
1
2
[
cos(x − t + σ)
]t
0
=
=
1
2
[
− cos(x + t) − cos(x − t)
]
+ cos x.
u(t, x) = u1(t, x) + u2(t, x) + u3(t, x).
□
9.D.40. Solve (x ∈ R, t ∈ (0, ∞))
a) utt − uxx = 0, u(0, x) = sin x, ut(0, x) = x cos x
b) utt − uxx = 0, u(0, x) = 2x, ut(0, x) = ln(1 + x2
)
c) utt − uxx = sin x, u(0, x) = x, ut(0, x) = 1
x
⃝
9.D.41. Solve
utt(t, x) = uxx(t, x) + 2 sin x, t ∈ (0, ∞), x ∈ R,
u(0, x) = x2
+ 2x, ut(0, x) = x cos x, x ∈ R.
⃝
9.D.42. Solve
utt(t, x) = uxx(t, x) + 2 cos x, t ∈ (0, ∞), x ∈ R,
u(0, x) = 2x2
− x, ut(0, x) = x sin x, x ∈ R.
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
9.1.17. Another kind of integration. As we have seen, solutions
to ODEs are ﬂows of vector ﬁelds. As a
modiﬁcation, we can prescribe one-dimensional
linear subspaces Lx ⊂ TxM at each point of
a manifold M and look for unparameterized
curves P tangent to them at all points. This is a coordinatefree
version of the ODE theory. Indeed, locally we may always
choose a vector ﬁeld X generating the spaces Lx and in
each coordinate patch, the ﬂow of X will provide the paramaterized
one-dimensional submanifolds P ⊂ M tangent to Lx
at all points. A change of coordinates or X will change the
parameterizations, but not the curves P.
If we want to describe an n-dimensional submanifold
N ⊂ M, 1 < n < M, in a similar way, we deﬁne the
n-dimensional subspaces Dx ⊂ TxM for all x ∈ M and
seek for a submanifold N with TyN = Dy at all y ∈ N.
Integrability of distributions
The union D ⊂ TM of individual linear subspaces Dx ⊂
TxM, x ∈ M, is called a distribution D on M. We say that
the distribution is n-dimensional and smooth if each ﬁxed
point x allows for a neighborhood U and n linearly independent
smooth vector ﬁelds X1, . . . , Xn generating Dy at all
y ∈ U. The distribution is called integrable if for each point
x ∈ M, there is a submanifold N ⊂ M such that x ∈ N
and TyN = Dy for all y ∈ N.
Our goal is to give neccesary and suﬃcient conditions
for smooth distributions to be integrable. Clearly, the case of
n = 1 is trivial, since we already know that the conditions
are empty – each such distribution is integrable.
The core idea is to use the so called ﬂow box theorem
for vector ﬁelds proved in 8.3.15 and to exploit the individual
ﬂows of the chosen generators X1, . . . , Xn in order to “draw”
new coordinates, in which the integral submanifold would appear
as given by xn+1 = 0, . . . , xm = 0. The problem we
face is that the ﬂows do not commute in general and thus our
idea will not work.
9.1.18. Lie bracket of vector ﬁelds. Fortunately, the commutativity
of the ﬂows is captured by a simple diﬀerential
operation.
Consider two vector ﬁelds on Rm
, X = X1(x) ∂
∂x1
+
· · · + Xm(x) ∂
∂xm
, Y = Y1(x) ∂
∂x1
+ · · · + Ym(x) ∂
∂xm
. The
commutator of the derivatives of functions in the directions
of these vector ﬁelds is
Y (Xf) − X(Y f) =
∑
i,j
Yi
∂
∂xi
(
Xj
∂f
∂xj
)
− Xi
∂
∂xi
(
Yj
∂f
∂xj
)
=
∑
i,j
(
Yj
∂Xi
∂xj
− Xj
∂Yi
∂xj
) ∂f
∂xi
,
thanks to the commutativity of the second derivatives of f.
Thus, the commutator of the two vector ﬁelds behaves as the
823
⃝
9.D.43.
utt(t, x) = 4uxx(t, x) + cos2
x, t ∈ (0, ∞), x ∈ R,
u(0, x) = x3
, ut(0, x) =
1
√
1 + x2
, x ∈ R.
⃝
9.D.44.
utt = 9uxx − 4 cos x, t ∈ (0, ∞), x ∈ R,
u(0, x) = x2
, ut(0, x) = xex
, x ∈ R.
⃝
9.D.45. Solve parabolic equation
ut = kuxx, t ∈ (0, ∞), x ∈ (0, l)
with Dirichlet boundary conditions
u(t, 0) = u(t, l) = 0, t ∈ (0, ∞)
and initial condition
u(0, x) = φ(x) = x(l − x), x ∈ (0, l).
Solution. Suppose that solution is in the form
u(t, x) = X(x) · T(t)
and compute
ut = XT′
, uxx = X′′
T,
dividing both sides of equation by XT we get
T′
kT
=
X′′
X
= −λ,
where λ is constant number, because left side depend only
on variable t and right side only on variable x. Solving both
ordinnary equations
T′
+ λkT = 0, X′′
+ λX = 0
we get
T(t) = Ce−λkt
, X(x) = C1eµx
+ C2e−µx
,
where µ =
√
−λ. Using boundary conditions
X(0) = C1 + C2 = 0 =⇒ C2 = −C1
and
X(l) = C1
(
eµl
− e−µl
)
= 0 =⇒ e2µl
= 1 = e2πni
.
Numbers µn = inπ
l and λn = n2
π2
l2 , where n = 0, 1, 2, . . .
We have
Tn = Cne− n2π2
l2 kt
,
Xn = Dn
(
ei nπ
l x
− e−i nπ
l x
)
= 2iDn sin
(nπ
l
x
)
.
Solution of equation and boundary condition is series
u(t, x) =
∞∑
n=0
TnXn =
∞∑
n=1
Kn sin
(nπ
l
x
)
e− n2π2
l2 kt
.
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
vector ﬁeld [X, Y ],
[X, Y ] =
m∑
i,j=1
(
Yj
∂Xi
∂xj
− Xj
∂Yi
∂xj
) ∂
∂xi
.
This vector ﬁeld is called the Lie bracket5
of its arguments.
It is easy to see that [ , ] is a bilinear antisymmetric operation
(over the real scalars) on the diﬀerentiable vector ﬁelds,
and expanding the commutators explicitly we arrive at the so
called Jacobi identity
[X, [Y, Z]] = [[X, Y ], Z] + [Y, [X, Z]]
valid for all triples of vector ﬁelds, and the Leibnitz derivative
property
[X, fY ] = (Xf)Y + f[X, Y ].
Remark. In fact, it is quite straightforward to see that the
vector ﬁelds X and the diﬀeomorphisms FlX
t in their
ﬂows are linked in a very similar manner to the square
matrices A and their exponential images etA
. The
Lie bracket encodes the composision of the diﬀeomorphisms
like the commutators of matrices encode the matrix
multiplication. Thus, it is not surprising that the ﬂows of two
vector ﬁelds are commuting if and only if their Lie bracket
vanishes. We shall not go into the technical proof here since
we shall not need the result explicitely below.
9.1.19. Back to distributions. We say that D ⊂ TM is an
involutive distribution if for all vector ﬁelds X, Y valued in
D, their Lie bracket [X, Y ] has got values in D, too.
Frobenius’ theorem
Theorem. Let D ⊂ TM be a smooth n-dimensional distribution
in an m-dimensional manifold M. Then D is integrable
if and only if it is involutive.
Proof. Remind integrability means the local existence
of the integral submanifolds through each point in M. One
of the implications of the proof is nearly trivial.
If D is integrable, then through each x ∈ M there is the
integral submanifold N. Consider the embedding i : N →
M and any vector ﬁelds ˜X, ˜Y on M valued in D. Since
Dy = TyN for all y ∈ N, ˜X and ˜Y are tangent to i(N) ⊂
M. We claim that the restriction of the Lie bracket [ ˜X, ˜Y ] to
i(N) is the image i∗([X, Y ]), where the vector ﬁelds X, Y
are viewed as the given ﬁelds on N, i.e., i∗X(x) = ˜X(i(x)),
i∗Y (x) = ˜Y (i(x)). Thus, the bracket has to be in the image
again. The latter claim is a consequence of a more general
statement:
5Marius Sophus Lie (1842–1899)) was an excellent Norwegian mathematician,
the father of the Lie theory. Originally invented to deal with systems
of partial diﬀerential equation via continuous groups of their symmetries,
the theory of Lie groups and Lie algbebras is nowadays in the core of a
vast part of Mathematics. It is a pity we do not have time and space to devote
more attention to this marvelous mathematical story in this textbook.
824
Coeﬃcients Kn we get from initial condition
u(0, x) =
∞∑
n=1
Kn sin
(nπ
l
x
)
= φ(x) = x(l − x).
So Kn are Fourier coeﬃcient of odd extension of function
φ(x) to the interval [−l, l]:
Kn =
2
l
l∫
0
x(l − x) sin
(nπ
l
x
)
dx.
After integration per partes we get Kn = 4l2
n3π3 [(−1)n+1
+1]
and ﬁnal solution
u(t,x)=
∞∑
n=0
8l2
(2n + 1)3π3
sin
(
(2n + 1)π
l
x
)
e−
(2n+1)2π2
l2 kt
.
□
9.D.46. Solve eliptic equation on square
∆u = 0, x ∈ (0, 1), y ∈ (0, 1)
with Dirichlet boundary conditions for x
u(0, y) = u(1, y) = 0, y ∈ (0, 1)
and boundary condition for y
u(x, 0) = φ(x) = sin(πx) cos(πx), x ∈ (0, 1),
u(x, 1) = ψ(x) = − sin(πx), x ∈ (0, 1).
Solution. Suppose again that solution is in the form
u(x, y) = X(x) · Y (y) =⇒ ∆u = X′′
Y + XY ′′
= 0.
Dividing equation by XY we get
X′′
X
= −
Y ′′
Y
= −λ.
Solution (using Dirichlet boundary conditions for x) is
Xn = Cn sin(nπx), Yn = Knenπy
+ Hne−nπy
, n ∈ N
and
u(x, y)=
∞∑
n=1
XnYn =
∞∑
n=1
[
knenπy
+ hne−nπy
]
sin(nπx).
Using boundary conditions for y we get
u(x, 0) =
∞∑
n=1
(kn + hn) sin(nπx) = φ(x),
u(x, 1) =
∞∑
n=1
(knenπ
+ hne−nπ
) sin(nπx) = ψ(x).
Numbers bn = kn + hn are Fourier coeﬃcients of odd extension
of function φ(x) on [−1, 1], b2 = 1
2 and bn = 0 for
n ̸= 2. Numbers βn = knenπ
+ hne−nπ
are Fourier coeﬃcients
of odd extension of function ψ(x) on [−1, 1], β1 = −1
and βn = 0 for n ̸= 1. We need to solve algebraic equations
k2 + h2 =
1
2
, k2e2π
+ h2e−2π
= 0,
k1 + h1 = 0, k1eπ
+ h1e−π
= −1.
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Claim. If φ : N → M is a smooth map and two couples of
vector ﬁelds X, Y and ˜X, ˜Y satisfy
Tφ ◦ X = ˜X ◦ φ, Tφ ◦ Y = ˜Y ◦ φ,
then their Lie brackets satisfy the same relation:
Tφ ◦ [X, Y ] = [ ˜X, ˜Y ] ◦ φ.
Indeed, consider a smooth function f on M and compute,
using X(f ◦φ)(x) = (TφX)f = ( ˜X◦φ)(x)f = ˜X(φ(x))f,
and similarly for Y :
[X, Y ](f ◦ φ)(x) = XY (f ◦ φ)(x) − Y X(f ◦ φ)(x)
= X(( ˜Y f) ◦ φ)(x) − Y (( ˜Xf) ◦ φ)(x)
= ˜X( ˜Y f)(φ(x)) − ˜Y ( ˜Xf)(φ(x))
= ([ ˜X, ˜Y ]f) ◦ φ(x).
Now we employ the latter claim for the inclusion i in the role
of φ and obviously every integrable distribution must be in-
volutive.
As we already revealed, each one-dimensional distribution
is involutive and locally integrable. The
main idea of the proof is to start with any set of
(locally) generating vector ﬁelds for D, to use
some nice coordinates with respect to the ﬁrst
vector ﬁeld and to employ the induction on the dimension to
the rest of them.
Assume the theorem is true for dimensions less than n
and consider an involutive smooth distribution D of dimension
n, generated by ﬁelds X1 . . . , Xn. Actually, we shall
prove a much stronger version of the theorem. We claim that
if D is involutive, then there are coordinates (x1, . . . , xm)
around each point x ∈ M, such that the equations xn+1 =
an+1, . . . , xm = am with small constants ai are deﬁning
all the individual integral submanifolds of D through points
close to x. This is indeed true in dimension n = 1.
By the ﬂowbox theorem 8.3.15, for each point x ∈ M
there are coordinate functions y1, . . . , ym on a neighborhood
U of x, for which X1 = ∂
∂y1
. Let us consider the submanifold
Q ⊂ M deﬁned by y2 = 0, . . . , ym = 0 and the “projections”
Yj of the other ﬁelds to make then tangent to Q. This requires
that Yj leave constant the coordinate y1, i.e. we set
Yj = Xj − Xj(y1)X1, j = 2, . . . , m.
Indeed, this deﬁnition ensures Yj(y1) = 0 and thus the ﬁelds
are tangent to Q as required. We leave Y1 = X1, and clearly
Y1, . . . , Yn generate the same involutive distribution D. Thus
[Yi, Yj] =
∑
i,j
cijkYk
for some set of functions cijk. Moreover, we may view Q as
one leaf of all the subsets deﬁned by y2 = b2, . . . , ym = bm
with small constants bi and there is the projection p : U → Q
forgetting the ﬁrst coordinate.
On the submanifold Q, there is the (n − 1)-dimensional
involutive distribution ˜D generated by the ﬁelds ˜Yi = Yi|Q,
i = 2, . . . , n (notice we again use the argument from the
825
Finally
k1 =
1
e−π − eπ
, h1 =
1
eπ − e−π
,
k2 =
1
2 − 2e4π
, h2 =
1
2 − 2e−4π
,
other coeﬃciens are all zero. Solution is
u(x, y) = sin(πx)
[
1
e−π − eπ
eπy
+
1
eπ − e−π
e−πy
]
+
+ sin(2πx)
[
1
2 − 2e4π
e2πy
+
1
2 − 2e−4π
e−2πy
]
.
Make a test. □
9.D.47. Solve hyperbolic equation
utt = a2
uxx, t ∈ (0, ∞), x ∈ (0, π)
with Neumann boundary conditions
ux(t, 0) = ux(t, π) = 0, t ∈ (0, ∞)
and initial conditions
u(0, x) = φ(x) = 0, ut(0, x) = 5 cos(3x), x ∈ (0, π).
Solution. Again, looking for solution in the form
u(t, x) = X(x) · T(t)
we get
XT′′
= a2
X′′
T =⇒
T′′
a2T
=
X′′
X
= −λ.
So X(x) = C1eµx
+ C2e−µx
, where µ =
√
−λ and using
Neumann boundary condition we have
X′
(0) = µ(C1 − C2) = 0 =⇒ C1 = C2,
X′
(π) = µC1
(
eµπ
− e−µπ
)
= 0,
e2µπ
= 1 = ei2nπ
=⇒ µn = in, λn = −n2
, n = 0, 1, 2, . . .
We have solutions
Xn(x) = C1
(
einx
+ e−inx
)
= 2C1 cos(nx),
Tn(t) = an cos(nat) + bn sin(nat),
u(t, x) =
∞∑
n=0
XnTn =
=
∞∑
n=0
[An cos(nat) + Bn sin(nat)] cos(nx).
Now we have to apply initial conditions
u(0, x) =
∞∑
n=0
An cos(nx) = φ(x),
ut(0, x) =
∞∑
n=0
naBn cos(nx) = ψ(x)
and numbers An are Fourier coeﬃciens of even extension of
function φ(x) to [−π, π]:
A0 =
1
π
π∫
0
φ(x)dx, An =
2
π
π∫
0
φ(x) cos(nx)dx,
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
beginning of the proof about the brackets of restricted ﬁelds).
Now, our assumption says we ﬁnd suitable coordinates
(q2, . . . , qm) on Q around the point x ∈ Q, so that for all
small constants bn+1, . . . , bm, the integral submanifolds of
˜D are deﬁned by qn+1 = bn+1, . . . , qm = bm.
Finally, we need to adjust the original coordinate functions
yi all over the neighborhood U of x. The obvious idea
is to use the ﬂow of X1 = Y1 to extend the latter coordinates
on Q. Thus we deﬁne the coordinate functions in all y ∈ U
using the projection p,
x1(y) = y1(y), x2(y) = q2(p(y)), . . . , xm = qm(p(y)).
The hope is that all submanifolds N given by equations
xn+1 = bn+1, . . . , xm = bm (for small bj) will be tangent to
all ﬁelds Y1, . . . , Yn. Technically, this means Yi(xj) = 0 for
all i = 1, . . . , n, j = n + 1, . . . , m. By our deﬁnition, this is
obvious for the restriction to Q, and obviously Y1(xj) = 0 in
all other points, too.
Let us look closely on what is happening with one of our
functions Yi(xj) along the ﬂows of the ﬁeld X1. We easily
compute with the help of the deﬁnition of the Lie bracket
∂
∂x1
(Yi(xj)) = Y1(Yi(xj)) = Yi(Y1(xj)) + [Y1, Yi](xj)
= Yi(Y1(xj)) + c1i1Y1(xj) +
m∑
k=2
c1ikYk(xj)
=
m∑
k=2
c1ikYk(xj).
This is a system of linear ODEs for the unknown functions
Yi(xj) in one variable x1 along the ﬂow lines of Y1. The
initial condition at the point in Q is zero and thus this constant
zero value has to propagate along the ﬂow lines, as requested.
The induction step is complete. □
9.1.20. Formulation via exterior forms. As we know from
linear algebra, a vector subspace of codimension k is deﬁned
by k independent linear forms. Thus, every smooth
n-dimensional distribution D ⊂ TM on a manifold M can
be (at least) locally deﬁned by m − n linear forms ωj on M.
A direct computation in coordinates reveals that the differential
of linear from ω evaluates on two vector ﬁelds as
follows
(1) dω(X, Y ) = X(ω(Y )) − Y (ω(X)) − ω([X, Y ]).
Indeed, if X =
∑
i Xi
∂
∂xi
, Y =
∑
i Yi
∂
∂xi
, ω =
∑
i ωidxi,
then
X(ω(Y ))−Y (ω(X)) =
∑
i,j
(Xi
∂
∂xi
(ωjYj)−Yi
∂
∂xi
(ωjXj))
=
∑
i,j
(
Xi
∂ωj
∂xi
Yj − Yi
∂ωj
∂xi
Xj + ωj
(
Xi
∂Yj
∂xi
− Yi
∂Xj
∂xi
))
= dω(X, Y ) + ω([X, Y ]).
Thus, the involutivity of a distribution deﬁned by linear forms
ωn+1, . . . , ωm should be closely linked to properties of the
826
numbers naBn are Fourier coeﬃciens of even extension of
function ψ(x) to [−π, π]:
B0 =
1
naπ
π∫
0
ψ(x)dx, Bn =
2
naπ
π∫
0
ψ(x) cos(nx)dx.
Because φ(x) = 0, all An are zero, ψ(x) = 5 cos(3x), so
B3 = 5
3a and all other Bn are again zero. Finally solution of
our problem is
u(t, x) =
5
3a
sin(3at) cos(3x).
Make a test. □
9.D.48. Solve
∆u = uxx + uyy = 0, (x, y) ∈ (0, π) × (0, π),
u(x, 0) = u(x, π) = 0, x ∈ (0, π)
u(π, y) = 0, u(0, y) = sin y, y ∈ (0, π).
⃝
9.D.49. Solve
uxx = utt, x ∈ (0, l), t ∈ (0, ∞),
ux(t, 0) = ux(t, l) = 0, t ∈ (0, ∞),
u(0, x) = − cos
πx
l
, x ∈ (0, l),
ut(0, x) = cos2 πx
l
− sin2 πx
l
, x ∈ (0, l).
⃝
9.D.50. Solve
uxx = ut, x ∈ (0, π), t ∈ (0, ∞),
u(t, 0) = u(t, π) = 0, t ∈ (0, ∞),
u(0, x) = 2 cos(3x) sin x, x ∈ (0, π).
⃝
9.D.51. Solve
uxx = utt, x ∈ (0, π), t ∈ (0, ∞),
u(t, 0) = u(t, π) = 0, t ∈ (0, ∞),
u(0, x) = −6 sin(2x) cos(2x), x ∈ (0, π),
ut(0, x) = − sin x + 4 sin x cos x, x ∈ (0, π).
⃝
9.D.52. Solve
ut = uxx, x ∈ (0, a), t ∈ (0, ∞),
u(t, 0) = ux(t, a) = 0, t ∈ (0, ∞),
u(0, x) = x(2a − x), x ∈ (0, a).
⃝
9.D.53.
utt = uxx, x ∈ (0, A), t ∈ (0, ∞),
u(t, 0) = u(t, A) = 0, t ∈ (0, ∞),
u(0, x) = x(A − x), ut(0, x) = 0, x ∈ (0, A).
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
diﬀerentials on the common kernel. Indeed, there is the following
version of the latter theorem:
Frobenius’ theorem
Theorem. The distribution D deﬁned on an m-dimensional
manifold M by (m − n) independent smooth linear forms
ωn+1, . . . , ωm is integrable if and only if there are linear
forms αij such that dωk =
∑
ℓ αkℓ ∧ ωℓ.
Proof. Let us write ω = (ωn+1, . . . , ωm) for the
Rm−n
-valued form. The distribution is D = ker ω. Now,
the formula (1) (applied to all components of ω) implies that
involutivity of D is equivalent to dω| ker ω = 0.
If the assumption of the theorem on the forms holds true,
dω clearly vanishes on the kernel of ω and therefore D is involutive,
and one of the implications of the theorem is proved.
Next, assume D is integrable. By the stronger claim
proved in the latter Frobenius theorem, for each point x ∈
M, there are coordinates (x1, . . . , xm) such that D is the
common kernel of all dxn+1, . . . , dxm. In particular, our
forms ωj are linear combinations (over functions) of the latter
(m − n) diﬀerentials. Moreover, there must be smooth
invertible matrices of functions A = (akℓ) such that
dxk =
∑
ℓ
akℓωℓ, k, ℓ = n + 1, . . . , m.
Finally, dωk includes only terms with dxi ∧ dxj with j > n
and all dxj can be expresed via our forms ωℓ from the previous
equation. Thus the diﬀerentials have got the requested
forms. □
2. Remarks on Partial Diﬀerential Equations
The aim of our excursion into the landscape of diﬀerential
equations is modest. We do not have space in this rather
elementary guide to come close enough to this subtle, beautiful,
and extremely useful part of mathematics dealing with
diﬀerential equations. Still we mention a few issues.
First, the simplest method reducing the problem to already
mastered ordinary diﬀerential equations is explained,
based on the so called characteristics. Then we show more
simple methods how to get some families of solutions.
Next, we present a more complicated theoretical approach
dealing with formal solvability of even higher order
systems of diﬀerential equations and its convergence – the
famous Cauchy-Kovalevskaya theorem. This is the only instance
of general existence and uniqueness theorem for diﬀerential
equations involving partial derivatives. Unortunately,
it does not cover many of interesting problems of practical
importance.
Finally, we display a few classical methods to solve
boundary problems involving some of the most common
equations of second order.
827
⃝
9.D.54.
∆u = uxx + uyy = 0, x ∈ (0, a), y ∈ (0, a),
u(0, y) = u(a, y), y ∈ (0, a),
u(x, a) = 0, u(x, 0) = x(a − x), x ∈ (0, a).
⃝
9.D.55.
ut = 4uxx, x ∈ (0, 1), t ∈ (0, ∞),
ux(t, 0) = u(t, 1) = 0, t ∈ (0, ∞),
u(0, x) = x − 1, x ∈ (0, 1).
⃝
9.D.56. Find bounded solution of Laplace equation and
boundary condition on the circle (Dirichlet internal boundary
problem):
∆u(x, y) = 0, x2
+ y2
< A2
,
u(A cos φ, A sin φ) = α(φ) = sin4
φ.
Solution. After transformation to the polar coordinates
x = r cos φ, y = r sin φ,
Laplace equation for function u = u(r, φ) become
urr +
1
r
ur +
1
r2
uφφ = 0.
We are looking for solution in the form u(r, φ) = R(r)·Φ(φ),
where Φ(φ + 2π) = Φ(φ) and u(A, φ) = α(φ) = sin4
φ,
u(0, φ) = C is constant. Let’s compute
R′′
Φ +
1
r
R′
Φ +
1
r2
RΦ′′
= 0,
multiplying by r2
and dividing by RΦ we get
r2 R′
R
+ r
R′
R
= −
Φ′
Φ
= −λ.
Solving equation
Φ′′
− λΦ = 0 =⇒ Φ = B1eµφ
+ B2e−µφ
,
where µ =
√
λ, using condition Φ(φ) = Φ(φ + 2π), Φ(φ)
have to be periodical with period 2π, so
µ = in and λn = −n2
, n ∈ N0.
Finally for function Φ(φ) we have
Φn = an cos(nφ) + bn sin(nφ).
Let’s solve equation for R = R(r), which is Euler
r2
R′′
n + rR′
n − n2
Rn = 0.
Substituing s = ln t, r = et
,
R′
n =
1
r
˙Rn, R′′
n = (R′
n)′
= −
1
r2
˙Rn +
1
r2
¨Rn,
we get
¨Rn − n2
Rn = 0,
Rn(t) =
{
C0 + D0t, n = 0,
Cnent
+ Dne−nt
, n ∈ N,
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
9.2.1. Initial observations. In practical problems, we often
meet equations relating unknown functions of
more variables and their derivatives. We already
handled the very special case where the relations
concerned functions x(t) of just one variable
t. More explicitely, we dealt with vector equations
x(k)
= F(t, x, ˙x, ¨x,
...
x, . . . , x(k−1)
), F : Rnk+1
→ Rn
,
where the dots over x ∈ Rn
meant the (iterated) derivatives
of x(t) = (x1(t), . . . , xn(t)), up to the order k. The goal was
to ﬁnd a (vector) curve x(t) in Rn
which makes this equation
valid.
Two more comments are due: 1) we can omit the explicit
appearance of t on the cost of adding one more variable and
equation ˙x0 = 1; and 2) giving new names to the iterated
derivates xj = x(j)
and adding equations ˙xj = xj+1, j =
1, . . . , k − 1, we reduce always the problem to a ﬁrst order
system of equations (on a much bigger space).
Thus, we should like to work similarly with the equations
F(x, y, ux, uxx, uxy, uyy, . . . ) = 0,
where u is an unknown function (possibly vector valued) of
two variables x and y (or even more variables) and, as usual,
the indices denote the partial derivatives. Even if we expect
the implicit equation to be solved in some sence with respect
to some of the highest partial derivatives, we cannot hope for
a general existence and uniqueness result similar to the ODE
case.
Let us start with a most simple example illustrating the
general problem related to the choice of the initial conditions.
9.2.2. The simplest linear case. Consider one real function
u = u(x, y), subject to the linear homogeneous equation
(1) a(x, y)ux + b(x, y)uy = 0
where a and b are known functions of two variables deﬁned
for x, y in a domain Ω ⊂ R2
. We consider the equation in
the tubular domain Ω × R ⊂ R3
. Usualy, Ω is an open set
together with a nice boundary, a curve ∂Ω in our case.
An obvious simple idea suggests to write Ω as a union of
non-intersecting curves and look for u constant along those
curves. Moreover, if those curves were transversal to the
boundary ∂Ω, then initial conditions along the boundary
should extend inside of Ω. Thus, consider such a potentially
existing curve c(t) = (x(t), y(t)) and write
0 =
d
dt
u(c(t)) = ux(c(t)) ˙x(t) + uy(c(t)) ˙y(t).
This yields the conditions for the requested curves:
(2) ˙x = a(x, y), ˙y = b(x, y).
Since u is considered constant along the curve, we obtain a
unique possibility for the function u along the curves for all
initial conditions x(0), y(0), and u(x(0), y(0)), if the coeﬃcients
a and b are at least Lipschitz in x and y.
The latter curves are called the characteristics of the ﬁrst
order partial diﬀerential equation (1) and they are solutions
828
Rn(r) =
{
C0 + D0 ln r, n = 0,
Cnrn
+ Dnr−n
, n ∈ N.
Solution has to be bounded for r → 0 (inside circle), so D0 =
Dn = 0 and ﬁnally
u(r, φ) = Rn(r) · Φn(φ) =
= a0C0 +
∞∑
n=1
Cnrn
[an cos(nφ) + bn sin(nφ)] .
Coeﬃcients we compute from boundary contidion
U(A, φ)=K0 +
∞∑
n=1
An
[Kn cos(nφ) + Hn sin(nφ)]=
= α(φ) = sin4
φ=
(
1 − cos(2φ)
2
)2
=
=
1 − 2 cos(2φ) + cos2
(2φ)
4
=
=
1
4
−
cos(2φ)
2
+
1
4
(
1 + cos(4φ)
2
)
=
=
3
8
−
cos(2φ)
2
+
1
8
cos(4φ),
K0 = 3/8, K2 = −1
2A2 , K4 = 1
8A4 and all other Kn, Hn are
zero.
u(r, φ) =
3
8
−
r2
2A2
cos(2φ) +
r4
8A4
cos(4φ),
u(x, y) =
3
8
−
x2
− y2
2A2
+
x4
+ y4
− 6x2
y2
8A4
,
where we use cos(2φ) = cos2
φ − sin2
φ = x2
r2 − y2
r2 ,
cos(4φ) = cos2
(2φ) − sin2
(2φ). □
9.D.57. Find bounded solution of Laplace equation and
boundary condition on the circle (Dirichlet external boundary
problem):
∆u(x, y) = 0, x2
+ y2
> A2
,
u(A cos φ, A sin φ) = α(φ) = cos2
φ.
Solution. From previous exercise we get
Φn = an cos(nφ) + bn sin(nφ), n ∈ N0,
Rn(r) =
{
D0 + C0 ln r, n = 0,
Cnrn
+ Dnr−n
, n ∈ N.
Functions Rn(r) have to be bounded for r → ∞ (outside of
the circle), so C0 = Cn = 0 and we have
u(r, φ) = Rn(r) · Φn(φ) =
= a0D0 +
∞∑
n=1
Dnr−n
[an cos(nφ) + bn sin(nφ)] .
Coeﬃcients we compute from boundary contidion
U(A, φ)=K0 +
∞∑
n=1
A−n
[Kn cos(nφ) + Hn sin(nφ)]=
= α(φ) = cos2
φ =
1 − cos(2φ)
2
.
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
of its characteristic equations (2). If the coeﬁcients are differentiable
in all variables, then also the solution u will be differentiable
for diﬀerentiable choices of initial conditions on
a curve transversal to the characteristics and we might have
solved the problem (1) locally. Still it might fail.
Let us look at the homogeneous linear problem
(3) yux − xuy = 0, u(x, 0) = x.
We saw already the solutions to the characteristic equations
˙x = y, ˙y = −x
and the characteristics are circles with centers in the origin,
x(t) = R sin t, y(t) = R cos t. If we choose any even diﬀerentiable
function ψ(x) = u(x, 0) for the initial conditions at
points (x, 0), we are lucky to see that the solution will work.
But for odd functions, e.g. our choice ψ(x) = x, there will
be no solution of our problem in any neighbourhood of the
origin. Clearly, this failure is linked to the fact that the origin
is a singular point for characteristic equations.
9.2.3. The quasi-linear case. The situation seems to get
more tricky once we add a nontrivial right-hand value
f(x, y, u) to the equation (1), i.e. we try to solve the problem
(allowing a and b to depend on u)
(1) a(x, y, u)ux + b(x, y, u)uy = f(x, y, u).
But in fact, the very same idea leads to characteristic
equations on R3
, writing z = u(x, y) for the unknown function
along the characteristics. Geometrically, we seek for a
vector ﬁeld tangent to all graphs of solutions in the tubular
domain Ω × R. Remind z = u(x, y), restricted to a curve
in the graph, implies ˙z = ux ˙x + uy ˙y, and thus we may set
˙z = f(x, y, z), ˙x = a(x, y, z), ˙y = b(x, y, z) in order to get
such a characteristic vector ﬁeld.
Characteristic equations and integrals
The characteristic equations of the equation (1) are
(2) ˙x = a(x, y, z), ˙y = b(x, y, z), ˙z = f(x, y, z).
This autonomous system of three equations is uniquely solvable
for each initial condition if a, b, and f are Lipschitz.
A function ψ on Ω × R which is constant on
each ﬂow line of the characteristic vector ﬁeld, i.e.,
ψ(x(t), y(t), z(t)) = const for all solutions of (2), is
called an integral of the equation (1). If ψz ̸= 0, then the
implicit function theorem guarantees the unique existence
of the function z = u(x, y) satisfying the chosen initial
conditions.
Check yourself that the latter functions u are solutions
to our problem. This approach covers the homogeneous case
as well, we just consider the autonomous characteristic equations
with ˙z = 0 added.
Let us come back to our simple equation 9.2.2(3) and
choose f(x, y, u) = y for the right-hand side. The characteristic
equations yield x = R sin t, y = R cos t as before, while
˙z = y = R cos t and hence z = R sin t + z(0). Thus, we
829
We get K0 = 1
2 , K2 = A2
2 and all other Kn, Hn are zero.
u(r, φ) =
1
2
+
1A2
2r2
cos(2φ),
u(x, y) =
1
2
+
x2
− y2
2(x2 + y2)2
A2
.
□
9.D.58. Solve internal Dirichlet problem on the circle:
∆u(x, y) = 0, x2
+ y2
< A2
,
u(A cos φ, A sin φ) = α(φ) = sin φ cos φ.
⃝
9.D.59. Solve external Dirichlet problem on the circle:
∆u(x, y) = 0, x2
+ y2
> A2
,
u(A cos φ, A sin φ) = α(φ) = 2 sin2
φ + 3 cos2
φ.
⃝
9.D.60. Solve internal Dirichlet problem on the circle:
∆u(x, y) = 0, x2
+ y2
< 1,
u(cos φ, sin φ) = α(φ) = sin φ + cos2
φ.
⃝
9.D.61. Solve parabolic equation
ut = κuxx, x ∈ R, t ∈ (0, ∞),
with initial condition
u(0, x) = φ(x) =
{
1 for x ∈ (−1, 1),
0 otherwise.
Suppose u(t, x) → 0 and ux(t, x) → 0 for x → ±∞.
Solution. Let’s compute Fourier image
˜uxx(ξ) =
∞∫
−∞
uxxe−iξx
dx =
a′
= uxx a = ux
b = e−iξx
b′
= −iξe−iξx =
=
[
uxe−iξx
]∞
−∞
− (−iξ)
∞∫
−∞
ux(x)e−iξx
dx =
=
a′
= ux a = u
b = e−iξx
b′
= −iξe−iξx =
[
uxe−iξx
]∞
−∞
+
+iξ
[
ue−iξx
]∞
−∞
+ (−iξ)2
∞∫
−∞
ue−iξx
dx = 0 + 0 − ξ2
˜u,
because u → 0 and ux → 0 for x → ±∞. After Fourier
transformation the equation became
˜ut(t, ξ) = −κξ2
˜u(t, ξ)
and together with initial condition ˜u(0, ξ) = ˜φ(ξ), where
˜φ(ξ) is Fourier image of φ(x), we have Fourier image of so-
lution
˜u(t, ξ) = ˜φ(ξ)e−κξ2
t
,
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
may choose ψ(x, y, z) = z − x as an integral of the equation,
and the solution u(x, y) = x + C with any constant C.
Notice, there will be plenty of solutions here since we
may add any solution of the homogenous problem, i.e. all
functions of the form
(3) u(x, y) = h(x2
+ y2
)
with any diﬀerentiable function h. Thus, the general solution
u(x, y) = x + h(x2
+ y2
) depends on one function of one
variable (the above constant C is a special case of h).
We may also conclude that for “reasonable” curves
∂Ω ⊂ R2
(those transversal to the circles centred at the
origin and not containing the origin) and “reasonable” initial
value u|∂Ω (we have to watch the multiple intersection of
the circles with ∂Ω) there will be (at least locally) a unique
solution extending the intital values to an open neighborhood
of ∂Ω.
Of course, we may similarly use characteristics and integrals
for any ﬁnite number of variables x = (x1, . . . , xn) and
equations of the form
a1(x, u)
∂u
∂x1
+ · · · + an(x, u)
∂u
∂xn
= f(x, u)
with the unknown function u = u(x1, . . . , xn). As we shall
see later, typically we obtain generic solutions dependent on
one function of n − 1 variables, similarly to the above exam-
ple.
9.2.4. Systems of equations. Let us look what happens if
we add more equations. There are two quite diﬀerent ways
how to couple the equations.
We may seek for an unknown vector valued functions
u = (u1, . . . , um) : Rn
→ Rm
, subject to m equations
(1) Ai(x, u) · ∇ui = fi(x, u), i = 1, . . . m,
where the left hand side means the scalar product of a vector
valued function Ai : Rm+n
→ Rn
and the gradient vector of
the function ui. Such systems behave similarly as the scalar
ones and we shall come back to them later.
The other option leads to the so called overdetermined
systems of equations. Actually we shall not pay more
attention to this case in the sequel and so the reader
might jump to 9.2.6 if getting lost.
Consider a (scalar) function u on a domain in
Ω ⊂ Rn
and its gradient vector ∇u. For each matrix A =
(aij) with m rows and n columns, with diﬀerentiable functions
aij(x, u) on Ω × R, and the right hand value function
F(x, u) : Ω × R → Rm
, we can consider the system of
equations
(2) A(x, u) · ∇u = F(x, u).
Of course, in both cases, we have got m individual equations
of the type from the previous paragraph and we could
apply the same idea of characteristic vector ﬁelds for all of
them. The problem consists in coupling of the equations and
obtaining possibly inconsistent neccesary conditions from the
individual characteristic ﬁelds.
830
(solve this separable equation with initial condition thoroughly).
If Fourier image of solution is product of two functions,
the original solution has to be convolution of inverse
Fourier images of this functions:
u(t, x) =
∞∫
−∞
φ(y)
1
2
√
πκt
e−
(x−y)2
4κt dy.
We have used the fact, that inverse Fourier image of Gaussian
function e−κξ2
t
is again Gaussian function 1
2
√
πκt
e− x2
4κt and
formula for convolution
f ∗ g(x) =
∞∫
∞
f(y)g(x − y)dy.
We will express u(t, x) using the error function
erf(x) =
2
π
x∫
0
e−a2
da.
Take φ(x) from assignement and compute
u(t, x) =
1∫
−1
1
2
√
πκt
e−
(x−y)2
4κt dy =
=
a = x−y
2
√
κt
da = −1
2
√
κt
dy
=
x−1
2
√
κt∫
x+1
2
√
κt
−1
√
π
e−a2
da =
=
1
√
π
x+1
2
√
κt∫
0
e−a2
da −
1
√
π
x−1
2
√
κt∫
0
e−a2
da,
u(t, x) =
1
2
[
erf
(
x + 1
2
√
κt
)
− erf
(
x − 1
2
√
κt
)]
.
□
9.D.62. Solve parabolic equation
ut = uxx, x ∈ R, t ∈ (0, ∞),
with initial condition
u(0, x) = φ(x) =
{
2 for x ∈ (0, 1),
0 otherwise.
Suppose u(t, x) → 0 and ux(t, x) → 0 for x → ±∞ and
express solution using error function. ⃝
E. Variational Problems
F. Complex analytic functions
9.F.1. Check that the mapping z → zn
, n ∈ N, deﬁned on
the entire C has got the complex derivative z → nzn−1
.
Solution. We can proceed two ways. Either check the definition
of the complex derivative directly, cf. ??, or use the
detour via the explicit expression of the mapping f(x+i y) =
(x + i y)n
in two real coordinates x, y and see that its derivative
is complex.
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Let us look at the overdetermined case now. We can get
most close to the situation with the ordinary diﬀerential equations
if A is invertible and we move it to the right hand side,
arriving at the system of equations
(3) ∇u = A−1
(x, u) · F(x, u) = G(x, u).
The simplest non-trivial case consists of two equations in two
variables:
ux = f(x, y, u), uy = g(x, y, u).
Geometrically, we describe the graph of the solution as a surface
in R3
by prescribing its tangent plane through each point.
An obvious condition for the existence of such u is obtained
by diﬀerentiating the equations and employing the symmetry
of the higher order partial derivatives, i.e. the condition
uxy = uyx. Indeed,
uxy = fy + fug = gx + guf = uyx,
where we substituted the original equations after applying the
chain rule. We shall see in a moment that this condition is
also suﬃcient for the existence of the solutions. Moreover, if
the solutions exist, then they are determined by their values
in one point, similarly to the ordinary diﬀerential equations.
9.2.5. Frobenius’ theorem again. Similarly, we can deal
with the gradient ∇u of an m-dimensional vector
valued function u. For example, if m = 2
and n = 2 we are describing the tangent planes
to the two-dimensional graph of the solution u
in R4
. In general we face mn equations
(1)
∂up
∂xi
= Fpi(x, u), i = 1, . . . , n, p = 1, . . . , m.
The necessary conditions imposed by the symmetry of higher
order derivatives then read
(2)
∂2
up
∂xi∂xj
=
∂Fpi
∂xj
+
∑
q
∂Fpi
∂uq
Fqj =
∂Fpj
∂xi
+
∑
q
∂Fpj
∂uq
Fqi
for all i, j and p.
Let us reconsider our problem from the geometric point
of view now. We are seeking for the graph of the mapping
u : Rn
→ Rm
. The equations (1) describe the n-dimensional
distribution D on Rm+n
and the graphs of possible solutions
u = (u1, . . . , um) are just the integral manifolds of D. The
distribution D is clearly deﬁned by the m linear forms
ωp = dup −
∑
i
Fpidxi, p = 1, . . . , m,
while the vector ﬁelds generating the common kernel of all
ωp can be chosen as
Xi =
∂
∂xi
+
∑
p
Fpi
∂
∂up
.
831
The ﬁrst very simple possibility repeats the computation
with polynomials in one real variable and is shown in 9.4.1.
□
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Now we compute diﬀerentials dωp and evaluate them on the
ﬁelds Xi
−dωp =
∑
i,j
∂Fpi
∂xj
dxj ∧ dxi +
∑
i,q
∂Fpi
∂uq
duq ∧ dxi
=
∑
i,j
(∂Fpi
∂xj
+
∑
q
∂Fpi
∂uq
Fqj
)
dxj ∧ dxi
−dωp(Xj, Xi) =
(∂Fpi
∂xj
+
∑
q
∂Fpi
∂uq
Fqj
)
−
(∂Fpj
∂xi
+
∑
q
∂Fpj
∂uq
Fqi
)
.
Thus, vanishing of the diﬀerentials on the common kernel is
equivalent to the neccesary conditions deduced above, and
the Frobenius theorem says that the latter conditions are sufﬁcient,
too. We have proved the following:
Theorem. The system of equations (1) admits solutions if and
only if the conditions (2) are satisﬁed. Then the solutions
are determined uniquely locally around x ∈ Ω by the initial
conditions u(x) ∈ Rm
.
Remark. The Frobenius’ theory deals with the so called
overdetermined systems of PDEs, i.e. we have got too many
equations and this causes obstructions towards their integrability.
Although the case in the last paragraph sounds very
special, the actual use of the theory consists in considering
diﬀerential consequences of a given system until we reach a
point, where the special theorem applies and gives not only
further obstractions but also the suﬃcient conditions.
9.2.6. General solutions to PDE’s. In a moment, we shall
deal with diverse boundary conditions for the
solutions of PDEs. In most cases we shall be
happy to have good families of simple "guessed"
solutions which are not subject to any further
conditions. We talk about general solutions in this context.
Unlike the situation with ODEs, we should not hope to get
a universal expression for all possible solutions this way (although
we can come close to that in some cases, cf. 9.2.3(3)).
Instead, we often try to ﬁnd the right superpositions (i.e. linear
combinations) or integrals built from suitable general so-
lutions.
Let us look at the simplest linear second order equations
in two variables, homogeneous with constant coeﬃcients:
(1) Auxx + 2Buxy + Cuyy + Dux + Euy + Fu = 0
where A, B, C, D, E, F are real constants and at least one of
A, B, C is non-zero.
Similarly to the method of characteristics, we try to reduce
the problem to ODEs. Let us again assume solution
in the form u = f(p), where f is an unknown function of
p and p(x, y) should be nice enough to get close to solutions.
The necessary derivatives are ux = f′
px, uy = f′
py,
uxx = f′′
pxpx + f′
pxx, uxy = f′′
pxpy + f′
pxy, uyy =
832
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
f′′
pypy + f′
pyy. Thus (1) becomes too complicated in general,
but restricting to aﬃne p(x, y) = αx+βy with constants
α, β, we arrive at
(2) (Aα2
+ 2Bαβ + Cβ2
)f′′
+ (Dα + Eβ)f′
+ Ff = 0.
This is a nice ODE as soon as we ﬁx the values of α and β.
Let us look at several simple cases of special importance.
Assume D = E = F = 0, A ̸= 0. Then, after dividing
by α2
, we solve the equation (A + 2B β
α + C β2
α2 )f′′
= 0 and
the right choice of the ratio λ = β/α ̸= 0 kills the entire
coeﬃcient at f′′
. Thus, (2) will hold true for any (twice differentiable)
function f and we arrive at the general solution
u(x, y) = f(p(x, y)), with p(x, y) = x + λy. Of course, the
behavior will very much depend on the number of real roots
of the quadratic equation
A + 2Bλ + Cλ2
= 0.
The wave equation. Put A = 1, C = − 1
c2 , B = 0, thus our
equation is uxx = 1
c2 uyy, the wave equation in dimension
1. Then the equation 1 − 1
c2 λ2
= 0 has got two real roots
λ = ±c, and we obtain p = x ± cy leading to the general
solution
u(x, y) = f(x − cy) + g(x + cy)
with two arbitrary twice diﬀerentiable functions of one variable
f and g.
In Physics, the equation models one-dimensional wave
development in the space parametrized by x while y stays
for the time. Notice c corresponds to the speed of the wave
u(x, 0) = f(x) + g(x) initiated in the time y = 0, and
while the f part moves forwards, the other part moves backwards.
Indeed, imagine u(x, y) = f(x − cy) describes the
displacement of a string at point x in time y. This remains
constant along the lines x−cy = constant. Thus, a stationary
observer sees the initial displacement u(x, 0) moving along
x-axis with the speed c.
In particular, we see that the initial condition along a line
in the plane is not enough to determine the solution, unless
we request the solution will move only in one of the possible
directions (i.e. we posit either f or g to be zero).
The Laplace equation. Now we consider A = C = 1,
B = 0, i.e. the equation uxx + uyy = 0. This is the Laplace
equation in two dimensions and its solutions are called harmonic
functions.
Proceeding as before, we obtain two imaginary solutions
to the equation λ2
+ 1 = 0 and our method produces p =
x ± i y, a complex valued function instead of the expected
real one. This looks ridiculous, but we could consider f to be
a mapping f : C → C viewed as a mapping on the complex
plane. Remind that some of such mappgins have got diﬀerentials
D1
f(p) which actually are multiplications by complex
numbers at each point, cf. ??. This is in particular true for
any polynomial or converging power series. We may request
that this property holds true for all iterated derivatives of this
kind. In general, we call such functions on C holomorhic and
we discuss them in the last part of this chapter. The reader is
833
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
advised to come back to this exposition on general solutions
to Laplace equation after reading through the begining of the
part on complex analytic functions below, starting in 9.4.1.
Now, assuming f is holomorphic, we can repeat the
above computation and arrive again at
(λ2
+ 1)f′′
(p) = 0
independently of the choice of f (here f′
(p) means the complex
number given by the diﬀerential D1
f, f′′
(p) is the iteration
of this kind of derivative). Moreover, the derivatives
of vector valued functions are computed for the components
separately and thus both the real and the imaginary part of the
general solution f(x + i y) + g(x − i y) will be real general
solutions.
For example, consider f(p) = p2
leading to
u(x, y) = (x + iy)2
= (x2
− y2
) + i 2xy
and simple check shows that both terms satisfy the equation
separately. Notice the two solutions x2
− y2
and xy povide
the bases of the 2-dimensional vector space of harmonic homogeneous
polynomials of degree two.
The diﬀusion equation. Next assume A = κ, B = C =
D = F = 0, and add the ﬁrst order term with E = −1. This
provides the equation
uy = κuxx,
the diﬀusion equation in dimension one.
Applying the same method again, we arrive at the ODE
κα2
f′′
− βf′
= 0
which is easy to solve. We know the solutions are found in
the form f(p) = eνp
with ν satisfying the condition κα2
ν2
−
βν = 0. The zero solution is not interesting, thus we are
left with the general solution to our problem by substituting
p(x, y) = αx + βy and ν = β
κα2 :
u(x, y) = f(p) = e
1
κ ( β
α x+ β2
α2 y)
.
Again, a simple check reveals that this is a solution. But it
is not very "general" – it depends just on two scalars α and
β. We have to ﬁnd much better ways how to ﬁnd solutions of
such equations.
9.2.7. Nonhomogeneous equations. As always with linear
equations, the space of solutions to the homogeneous linear
equations is a real vector space (or complex, if we deal with
complex valued solutions).
Let us write the equation as Lu = 0, where L is the
diﬀerenatial operator on the left hand side. For instance,
L = A
∂2
∂x2
+ B
∂2
∂x∂y
+ C
∂2
∂y2
+ D
∂
∂x
+ E
∂
∂y
+ F
in the case of the linear equation 9.2.6(1).
The solutions of the corresponding non-homogeneous
equation Lu = f with a given function f on the right hand
side form an aﬃne space. Indeed, if Lu1 = f, Lu2 = f,
Lu3 = 0, then clearly L(u1 −u2) = 0 while L(u1 +u3) = f.
Thus, if we succeed to ﬁnd a single solution to Lu = f, then
834
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
we can add any general solution to the homogeneous equation
to obtain a general solution.
Let us illustrate our observation on some of our basic
examples. The non-homogenous wave equation uxx −uyy =
x + y has got the general solution
u(x, y) =
1
6
(x3
− y3
) + f(x − y) + g(x + y)
depending on two twice diﬀerentiable functions.
The non-homogeneous Laplace equation is called the
Poisson equation. A general complex valued solution of the
Poisson equation uxx + uyy = x + y is
u(x, y) =
1
6
(x3
+ y3
) + f(x − i y) + g(x + i y)
depending on two holomorphic functions f and g.
9.2.8. Separation of variables. As we have experienced, a
straightforward attempt to get solutions is to
expect them in a particular simple form. The
method of separation of variables is based on
the assumption that the solution will appear as
a product of single variable functions in all variables in question.
Let us apply this method on our three special examples.
Diﬀusion equation. We expect to ﬁnd a general solution of
κuxx = ut in the form u(x, t) = X(x)T(t). Thus the equation
says κX′′
(x)T(t) = T′
(t)X(x). Assume further u ̸= 0
and divide this equation by u = XT:
X′′
(x)
X(x)
=
T′
(t)
κT(t)
.
Now the crucial observation comes. Notice the terms on the
left and right are function of diﬀerent variables and thus the
equation may be satisﬁed only if both the sides become constant.
We shall have to distinguish the signs of this separation
constant, so let us write it as −α2
(choosing the negative option).
Thus we have to solve two independent ODEs
X′′
+ α2
X = 0, T′
+ α2
κT = 0.
The general solutions are
X(x) = A cos αx + B sin αx
T(t) = C e−α2
κt
with free real constants A, B, C. When combining these solutions
in the product, we may absorb the constant C into the
other ones and thus we arrive at the general solution
u(x, t) = (A cos αx + B sin αx) e−α2
κt
.
This solution depends on three real constants.
If we choose a positive separation constant instead, i.e.
λ2
, there will be a sign change in our equations and the resulting
general solution is
u(x, t) = (A cosh αx + B sinh αx) eα2
κt
.
If the separation constant vanishes, then we obtain just
u(x, t) = A + Bx, independent of t.
835
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
The Laplace equation. Assume u(x, y) = X(x)Y (y) satisﬁes
the equation uxx +uyy = 0 and proceed exactly as above.
Thus, X′′
Y + Y ′′
X = 0 and dividing by XY and choosing
the separation constant α2
, we arrive at
X′′
= α2
X, Y ′′
= −α2
Y.
The general solution depends on four real constants
A, B, C, D
u(x, y) = (A cosh αx + B sinh αx)(C cos αy + D sin αy).
If the separation constant is negative, i.e. −α2
, the roles of x
and y swap.
The wave equation. Let us look how the method works
if there are more variables there. Consider a solution
u(x, y, z, t) = X(x)Y (y)Z(z)T(t) of the 3D wave equation
1
c2
utt = uxx + uyy + uzz.
Playing the same game again, we arrive at the equation
1
c2 T′′
XY Z = X′′
Y ZT + Y ′′
XZT + Z′′
XY T. Dividing
by u ̸= 0,
1
c2
T′′
T
=
X′′
X
+
Y ′′
Y
+
Z′′
Z
and since all the individual terms depend on diﬀerent single
variables, they have to be constant. Again, we shall have to
keep attention to the signs of the separation constants. For
instance, let us choose all constants negative and look at the
individual four ODEs
1
c2
T′′
T
= −α2
,
X′′
X
= −β2
,
Y ′′
Y
= −γ2
,
Z′′
Z
= −δ2
with the constants satisfying −α2
= −β2
− γ2
− δ2
. The
general solution is u(x, y, z, t) = X(x)Y (y)Z(z)T(t) with
linear combinations
T(t) = A cos cαt + B sin cαt
X(x) = C cos βx + D sin βx
Y (y) = E cos γy + F sin γy
Z(z) = G cos δz + H sin δz
with eight real constants A through H.
If we choose any of the separation constants positive, the
corresponding component in the product would display hyperbolic
sine and cosine instead. Of course, the relation between
the constants sees the signs as well.
We can also work with complex valued solutions and
choose the exponentials as our building blocks (i.e. X(x) =
e±iβx
or X(x) = e±βx
, etc). For instance, take one of the
solutions with all the separation constants negative
u(x, y, z, t) = eiβx
eiγy
eiδz
e−icαt
= ei(βx+γy+δz−cαt)
.
Similarly to the 1D situation, we can again see a "plane wave"
propagating along the direction (β, γ, δ) with angular frequency
cα.
836
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
9.2.9. Boundary conditions. We continue with our examples
of second order equations and discuss
the three most common boundary conditions
for them. Let us consider a domain Ω ⊂ Rn
,
bounded or unbounded, and a diﬀerential operator
L deﬁned on (real or complex valued) functions on Ω.
We write ∂Ω for the boundary of Ω and assume this is a
smooth manifold.
Locally, such a submanifold in Rn
is given by one implicit
function ρ : Rn
→ R and the unit normal vector ν(x),
x ∈ ∂Ω, to the hypersurface ∂Ω is given by the normalized
gradient
ν(x) =
∇ρ(x)
∥∇ρ(x)∥
.
We say that a function u is diﬀerentiable on Ω, if it is diﬀerentiable
on its interior and the directional derivatives D1
νu(x)
exist in all points of the boundary. Typically we write ∂
∂ν for
the derivative in the normal direction.
For simplicity, let us restrict ourselves to L of the form
L = A(x, y)
∂2
∂x2
+ 2B(x, y)
∂
∂x∂y
+ C(x, y)
∂2
∂y2
and look at the equation Lu = F(x, y, u, ∂u
∂x , ∂u
∂y ).
Cauchy boundary problem
At each point of the boundary x ∈ ∂Ω we prescribe both
the value φ(x) = u(x) and the derivative ψ(x) = ∂u
∂ν (x) in
the normal unit direction.
The Cauchy problem is to solve the equation Lu = F
on Ω, subject to u = φ and ∂u
∂ν = ψ on ∂Ω.
We shall see that the Cauchy problems very often lead
locally to unique solutions, subject to certain geometric conditions
on the boundary ∂Ω. At the same time, it is often
not the convenient setup for practical problems. We shall illustrate
this phenomenon on the 2D Laplace equation in the
next but one paragraph.
An even simpler possibility is to request only the condition
on the values of u on the boundary ∂Ω. Another possibility,
often needed in direct applications, is to prescribe the
derivatives only. We shall see, that this is reasonable for the
Laplace and Poisson equations.
Dirichlet and Neumann boundary problems
At each point of the boundary x ∈ ∂Ω we prescribe the
value φ(x) = u(x) or the derivative ψ(x) = ∂u
∂ν (x) in the
normal unit direction.
The Dirichlet problem is to solve the equation Lu = F
on Ω, subject to the condition u = φ on ∂Ω.
The Neumann problem is to solve the equation Lu = F
on Ω, subject to the condition ∂u
∂ν = ψ on ∂Ω.
9.2.10. Uniqueness for Poisson equations. Because the
proof of the next theorem works in all dimensions n ≥ 2, we
837
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
shall formulate it for the general Poisson equation
(1) ∆u =
( ∂2
∂x1
2
+ · · · +
∂2
∂xn
2
)
u = F(x1, . . . , xn).
Theorem. Assume u is a twice diﬀerentiable solution of the
Poisson equation (1) on a domain Ω ⊂ Rn
. If u satisﬁes the
Dirichlet condition u = φ on ∂Ω, then u is the only solution
of the Dirichlet problem.
If u satisﬁes the Neumann condition ∂u
∂ν = ψ on ∂Ω,
then u is the unique solution of the Neumann problem, up to
an additive constant.
The proof of this theorem relies on a straightforward consequence
of the divergence theorem. Remind 9.1.14, saying
that for each vector ﬁeld X on a domain Ω ⊂ Rn
with hypersurface
boundary ∂Ω
(2)
∫
M
div X dx1 . . . dxn =
∫
∂Ω
X · ν d∂Ω,
where ν is the oriented (outward) unit normal to ∂Ω and d∂Ω
stays for the volume inherited from Rn
on ∂Ω.
1st and 2nd Green’s identity
Lemma. Let M ⊂ Rn
be a n-dimensional manifold with
boundary hypersurface S, and consider two diﬀerentiable
functions φ and ψ. Then
(3)
∫
M
(φ∆ψ + ∇φ · ∇ψ) dx1 . . . dxn =
∫
S
φ∇ψ · ν dS.
This version of the divergence theorem is called the 1st
Green’s identity.
Next, let us consider one more diﬀerentiable function
µ and X = φµ∇ψ − ψµ∇φ. The the divergence theorem
yields the so called 2nd Green’s identity
(4)
∫
M
φ(∇ · (µ∇))ψ − ψ(∇ · (µ∇))φ dx1 . . . dxn =
∫
S
µ(φ∇ψ − ψ∇φ) · ν dS,
where ∇ · (µ∇) means the formal scalar product of the two
vector valued diﬀerential operators.
Proof of the Green’s identities. The ﬁrst claim follows
by applying (2) to X = φ∇ψ, where φ and ψ are diﬀerentiable
functions and ∇ψ is the gradient of ψ. Indeed,
iXωRn = φ(∇ψ · ν)dS
div X = φ∆ψ + ∇φ · ∇ψ,
where the dot in the second term denotes the scalar product
of the two gradients. Let us also notice that the scalar product
∇ψ·ν is just the derivative of ψ in the direction of the oriented
unit normal ν.
The second identity is computed the same way and the
two terms with the scalar products of two gradients cancel
each other. The reader should check the details. □
838
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Remark. A special case of the 2nd Green’s identity is worth
mentioning. Namely, if µ = 1 and both ψ and φ vanish on
the boundary ∂Ω, we obtain
∫
Ω
φ∆ψ − ψ∆φ dx1 . . . dxn = 0.
This means that the Laplace operator is self-adjoint with respect
to the L2 scalar product on such functions.
Proof of the uniqueness. Assume u1 and u2 are solutions
of the Poisson equation on Ω, thus u = u1 − u2 is a
solution of the homogeneous Laplace equation,
∆u = ∆u1 − ∆u2 = F − F = 0.
At same time, either u = u1 − u2 = 0 on ∂Ω or ∂u
∂ν = 0 on
∂Ω.
Now we exploit the ﬁrst Green’s identity (3) with φ =
ψ = u,
∫
Ω
(u∆u + ∇u · ∇u) dx1 . . . dxn =
∫
∂Ω
u
∂u
∂n
dS.
In both problems, Dirichlet or Neumann, the right hand side
vanishes. The ﬁrst term in the left hand integrand vanishes,
too. We conclude
∫
Ω
∥∇u∥2
dx1 . . . dxn = 0,
but this is possible only if ∇u = 0 since the integrand is
continuous. Thus, u = u1 − u2 is constant. But if we solve a
Dirichlet problem, then u1 and u2 coinside on the boundary
and thus they are equal. □
9.2.11. Well posed problems. Consider the Cauchy boundary
problem for uxx +uyy = 0, ∂Ω given
by y = 0 and
φ(x) = u(x, 0) = Aα sin αx
ψ(x) = uy(x, 0) = Bα sin αx
with the scalar coeﬃcients Aα and Bα depending on the chosen
frequency α. Simple inspection reveals, that we can ﬁnd
such a solution within the result from the separation method:
u(x, y) = (Aα cosh αy +
1
α
Bα sinh αy) sin αx.
Now, choose Bα = 0 and Aα = 1
α , i.e.
u(x, y) =
1
α
cosh αy sin αx.
Obviously, when moving α towards inﬁnity, the Cauchy
boundary conditions can become arbitrarily small and still
small change of Bα causes arbitrarily big increase of the values
of u in any close vicinity of the line y = 0.
Imagine, the equation describes some physical process
and the boundary conditions reﬂect some measurements, including
some periodic small errors. The results will be horribly
instable with respect to these errors in the derivatives.
We should admit that the problem is in some sense ill-posed,
even locally. This motivates the following deﬁnition.
839
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Well-posed and ill-posed boundary problems
The problem Lu = F on the domain Ω with boundary conditions
on ∂Ω is called well-posed if all three conditions
hold true:
(1) The boundary problem has got a solution u (a classical
solution means u is twice continuously diﬀerentiable);
(2) the solution u is unique;
(3) the solution is stable with respect to initial data, i.e.
“small” change of the boundary conditions results in a
“small” change of the solution.
The problem is called ill-posed, if any of the above conditions
fails.
Usualy, the stability in the third condition means that the
solution is continuously dependent on the boundary conditions
in a suitable topology on the chosen space of functions.
Also the uniqueness required in the second condition has
to be taken reasonably. For instance, only uniquenes up to
some additive constant makes sense for the Neumann prob-
lems.
9.2.12. Quasilinear equations. Now we exploit our experience
and focus on the (local) Cauchy type
problems for equations of arbitrary order. Similarly
to the ODEs, we shall deal with problems,
where the highest order derivatives are
prescribed (more or less) explicitly and the initial conditions
are given on a hypersurface up to the order k − 1.
Some notation will be useful. We shall use the multiindices
to express multivariate plynomials and derivatives, cf.
8.1.15. Further we shall write ∇k
u = {∂αu; |α| = k} for the
vector of all derivatives of order k. In particular, ∇u means
again the gradient vector of u.
Quasi-linear PDEs
For unknown scalar function u on a domain Ω ⊂ Rn
we
prescribe its derivatives
(1)
∑
|α|=k
aα(x, u, . . . , ∇k−1
u)∂αu = b(x, u, . . . , ∇k−1
u),
where b and aα are functions on the tubular domain Ω ×
RN
, accomodating all the derivatives, with at least one of
aα non-zero. We call such equations the (scalar) quasilinear
partial diﬀerential equations (PDE) of order r.
We call (1) semilinear if all aα do not depend on u and
its derivatives (thus all the non-linearity hides in b).
The principal symbol of a semi-linear PDE of order k
is the symmetric k-linear form P on Ω,
P(x) : (Rn
)k
→ R, P(x, ξ, . . . , ξ) =
∑
|α|=k
aα(x)ξα
.
For instance, the Poisson equation ∆u = f(x, y, u, ∇u)
on R2
is a semi-linear equation and its principal symbol is the
positive deﬁnite quadratic form P(ζ, η) = ζ2
+ η2
, independent
of (x, y).
840
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
The diﬀusion equation ∂u
∂t = ∆u on R3
has got the symbol
P(τ, ζ, η) = ζ2
+ η2
, i.e. a positive semi-deﬁnite quadratic
form, while the wave equation □u = ∂2
u
∂t2 − ∂2
u
∂x2 = 0
has got the indeﬁnite symbol P(τ, ζ) = τ2
− ζ2
on R2
.
We shall focus on the scalar equations and reduce the
problem to a special situation which allows a further reduction
to a system of ﬁrst order equations (quite similarly to the
ODE theory). Thus we extend the previous deﬁnition to systems
of equations. Notice, these are systems of the ﬁrst kind
mentioned in 9.2.4.
Systems of quasi-linear PDEs
A system of quasi-linear PDEs determines a vector valued
function u : Ω ⊂ Rn
→ Rm
, subject to the vector equation
(2) A(x, u, . . . , ∇k−1
u) · ∇k
u = b(x, u, . . . , ∇k−1
u).
Here A is a matrix of type m × M with functions ai,α :
Ω × RN
as entries, M =
(n+k−1
k
)
is the number of
k-combinations with repetition from n objects, ∇k
u is the
vector of vectors of all the kth-order derivatives of the components
of u, b : Ω × RN
→ Rm
, and · means the scalar
products of the individual rows in A with the vectors ∇k
ui
of the individual components of u, matching the individual
components in b.
9.2.13. Cauchy data. Next, we have to clarify the boundary
condition data. Let us consider a domain U ⊂
Rn
and a smooth hypersurface Γ ⊂ U, e.g. Γ
given by an implicit equation f(x1, . . . , xn) = 0
locally. Consider the unit normal vector ν(x) at
each point x ∈ Γ (i.e. ν = 1
∥∇f∥ ∇f if given implicitly).
We would like to ﬁnd minimal data along Γ determining a
solution of 9.2.12(1), at least locally around a given point.
To make things easy, let us ﬁrst assume that Γ is prescribed
by xn = 0. Then ν(x) = (0, . . . , 0, 1) at all x ∈ Γ
and knowing the restriction of u to Γ, we also now all derivatives
∂α with α = (α1, . . . , αn−1, 0), 0 ≤ |α|. Thus, we
have to choose reasonably diﬀerentiable functions cj on Γ,
j = 0, . . . , k − 1, and posit for all j
∂αu(x) = cj(x), α = (0, . . . , 0, j), x ∈ Γ.
All the other derivatives ∂αu on Γ, 0 ≤ |α| < ∞ with
αn < k are computed inductively by the symmetry of partial
derivatives.
Moreover, if a(0,...,0,k) ̸= 0, we can establish the remaining
k-th order derivative by means of the equation 9.2.12(1)
and hope to be able to continue inductively. Indeed, writing
a = a(0,...,0,k)(x, u, . . . , ∇k−1
(u)) ̸= 0 (and similarly leaving
out the arguments of the other functions aα), the equation
9.2.12(1) can be rewritten as
(1)
∂k
∂xn
k
u =
1
a
(
−
∑
|α|=k,αn̸=k
aα∂αu + b(x, u, . . . , ∇k−1
u)
)
.
Now, on Γ we can use the already known derivatives to compute
directly all the ∂αu with αn < k +1. But diﬀerentiating
841
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
the latter equation by ∂
∂xn
we obtain the missing derivative of
order k + 1 from the known quantities on the right-hand side.
By induction, we obtain all the derivatives, as requested.
In the general situation we can iterate the derivative
D1
ν(x)u of u in the direction of the unit normal vector ν to
the hypersurface Γ:
Cauchy data for scalar PDE
The (smooth or analytic) Cauchy data for the kth order quasilinear
PDE 9.2.12(1) consist of a hypersurface Γ ⊂ U and k
(smooth or analytic) functions cj, 0 ≤ j ≤ k − 1, prescribing
the derivatives in the normal directions to Γ
(2) (D1
ν(x))j
u(x) = cj(x), x ∈ Γ.
A normal direction ν(x), x ∈ Γ, is called characteristic
for the given Cauchy data, if
(3)
∑
|α|=k
aα(x, u, . . . , ∇k−1
u)ν(x)α
= 0.
The Cauchy data are called non-characteristic if there are no
characteristic normals to Γ.
Notice the situation simpliﬁes for the semi-linear equations.
Then the characteristic directions do not depend on the
chosen functions cj from the Caychy data and they are directly
related to the properties of the principal symbol of the equation.
In the case of the hyperplane Γ = {xn = 0} treated
above, the Cauchy data are non-characteristic if and only if
a(0,...,0,k) ̸= 0.
For instance, semi-linear equations of ﬁrst order always
admit characteristic directions since their principal symbols
are linear forms and so they must have non-trivial kernels (hyperplanes
of characteristic directions). In the three second
order examples of the Laplace equation, diﬀusion equation,
and wave equation very diﬀerent phenomena occur. Since the
symbol of the Laplace equation is a positive deﬁnite quadratic
form, characteristic directions can never appear, independently
of our choice of Γ. On the contrary, there are always
non-trivial characteristic directions in the other two cases.
Characteristic cones of semi-linear PDEs
The characteristic directions of a semi-linear PDE on a domain
Ω ⊂ Rn
generate the characteristic cone C(x) ⊂ TxΩ
in the tangent bundle,
C(x) = {ξ ∈ TxΩ; P(x)(ξ, . . . , ξ) = 0}.
The Cauchy data on a hypersurface Γ are non-characteristic
if and only if (TΓ)⊥
∩C = {0}, i.e. the orthogonal complements
to the tangent spaces to Γ with respect to the standard
scalar product on Rn
never meet the characteristic cone.
Notice, cones for linear forms are hyperplanes in the tangent
space, quadratic cones appear with second order, etc.
The tangent vectors to characteristics of the ﬁrst order quasilinear
equations (as introduced in 9.2.2) are orthogonal to
the characteristic normals. We have learned that the ﬁrst order
equations propagate the solutions along the characteristic
842
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
lines and so we are not free to prescribe the Cauchy data for
the solution in such a case.
9.2.14. Cauchy-Kovalevskaya Theorem. As seen so many
times already, the analytic mappings are very
rigid and most questions related to them boil
down to some estimates and smart combinatorial
ideas. It is time to remind what happens for
analytic equations and Cauchy data in the very special case
of the ODEs.
For a single scalar autonomous ODE of ﬁrst order, the
Cauchy data consist of a single point “hypersurface” Γ = {x}
in Ω ⊂ R and the value u(x). In particular, the Cauchy data
are always non-characteristic in dimension one. Already in
6.2.22 we gave a complete proof that the induced derivatives
of u provide a converging power series and thus the only solution,
on certain neighborhood of x. In 8.3.13 we extended
the same proof to autonomous systems of ODEs, which veriﬁed
the same phenomenon for general systems of ODEs of
any order k. Here the Cauchy data again consist of the only
point in Γ and all derivatives of u of orders less than k (and
again, they are always non-characteristic).
In subsequent paragraphs we shall comment on how to
extend the ODE proof to the following very famous theorem.
In particular, the statement says that we have to expect general
solutions to kth order scalar equations in n variables to
depend on k independent functions of n − 1 variables. This
is in accordance with our experience from simple examples.
Cauchy-Kovalevskaya theorem
Theorem. The analytic Cauchy problem consisting of quasilinear
equation 9.2.12(1) with analytic coeﬃcients and right
hand side, and analytic non-characteristic Cauchy data
9.2.13(2) has got a unique analytic solution on a neighborhood
of each point in Γ.
Notice that we have computed explicitly the formal
power series for the solution (by an inductive procedure) for
the special case when Γ is deﬁned by xn = 0. In this case,
the theorem claims that this formal series always converges
with non-trivial radius of convergence.
The full proof is very technical and we do not have space
to bother the readers with all details. In the next paragraphs,
we shall provide indications toward the steps in the proof. If
the track (or interest) will be lost, the reader should rather
jump to 9.2.18.
9.2.15. Flattening the Cauchy data. The ﬁrst step in the
proof is to transform the non-characteristic data
to the “ﬂat” hypersurface Γ discussed in the beginning
of 9.2.13. Remind that for such Γ the
non-characteristic condition in 9.2.13(3) reads
a(0,...,0,k) ̸= 0.
Let us start with the general equation and its analytic
Cauchy data on an analytic Γ ⊂ Rn
(we omit the arguments
843
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
of all the functions and ℓ = 0, . . . , k − 1)
(1)
∑
|α|=k
aα∂αu = b,
∂ℓ
u
∂nℓ
(x) = cℓ(x), x ∈ Γ.
We shall work locally around some unspeciﬁed ﬁxed
point in Γ. Since Γ is an analytic hypersurface in Rn
, there
are new local coordinates y = Ψ(x), such that
Γ = {x; Ψn(x) = 0}.
Moreover, Ψ can be chosen again analytic. Thus, the unit
normal vector ν to Γ equals to ∇Ψn, up to a multiple µ−1
at
each point of Γ.
Let Φ = Ψ−1
, i.e. x = Φ(y), and v(y) = u(Φ(y)). Then
Φ is analytic and the equation transforms to another equation
in the coordinates y,
(2)
∑
|α|=k
˜aα∂αv = ˜b
with analytic coeﬃcients (they can be all expressed by means
of the chain rule and the mutually inverse transformations Φ,
Ψ). By the very deﬁnition, ∂Φ
∂yn
is a vector (the last column
in the matrix D1
Φ) perpendicular to Γ and thus it must be
µν (remind the product of the Jacobi matrices D1
(Ψ)D1
(Φ)
is the identity matrix and the rows ∇Ψj, j = 1, . . . , n − 1
generate the tangent space TΓ).
Claim 1. The transformed Cauchy data for the equation (2)
are analytic.
The hypersurface ˜Γ given by yn = 0 as well as the coefﬁcients
of the equation are analytic. Compute ∂ℓ
v
∂yn
ℓ on ˜Γ.
v = co ◦ Φ
∂v
∂yn
= ∇u ·
∂Φ
∂yn
= µ∇u · ν = µ
∂u
∂ν
= µc1
∂2
v
∂yn
2
=
∂µ
∂yn
∂u
∂ν
+ µ2 ∂2
u
∂ν2
=
∂µ
∂yn
c1 + µ2
c2
∂3
v
∂yn
3
=
∂2
µ
∂yn
2
∂u
∂ν
+ 3µ
∂µ
∂yn
∂2
u
∂ν2
+ µ3 ∂3
u
∂ν3
=
∂2
µ
∂yn
2
c1 + 3µ
∂µ
∂yn
c2 + µ3
c3.
Inductively, we see that the transformed functions ˜cj are obtained
in an analytic way from the functions ci, i = 0, . . . , j.
Claim 2. The Cauchy data for the equation (1) are noncharacteristic
if and only if the transformed Cauchy data for
(2) are non-characteristic, i.e. ˜a(0,...,0,k) ̸= 0.
Compute using the chain rule (remind ∇Ψn is a vector,
the gradient of the last coordinate function of Ψ, and it is equal
to ν, up to the non-zero multiple µ−1
)
∂αu =
∂k
v
∂yn
k
(∇Ψn)α
+ lot of terms of lower order in ∂
∂yn
= µ−k ∂k
v
∂yn
k
να
+ · · · .
844
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Substitute into (1)
∑
|α|=k
aα∂αu = µ−k
∑
|α|
aανα ∂k
v
∂yn
k
+ · · · .
We have computed
˜a(0,...,0,k) = µ−k
∑
|α|=k
aανα
which is non-zero if and only if the orginial Cauchy data were
non-characteristic.
Since we already veriﬁed that all the partial derivatives of
v along ˜Γ can be computed for the non-characteristic Cauchy
data with the ﬂat hypersurface ˜Γ, we have actually proved the
following claim.
Proposition. The Cauchy data for (1) allow to compute all
partial derivatives of its solution u along the hypersurface Γ
if and only if the data are non-characteristic.
9.2.16. Reduction to a ﬁrst-order system. Without loss of
generality, we may consider only the Cauchy data of the form
discussed in 9.2.13, i.e. the quasi-linear equation on a domain
in Rn
is
(1)
∂k
∂xn
k
u =
∑
|α|=k,αn̸=k
aα∂αu + b
and Γ is given by xn = 0 with prescribed normal derivatives
c0, . . . , ck−1. Moreover, we can subtract suitable ﬁxed functions
from u in order to transform the equation into a new
one of the same shape and with all the Cauchy data cj vanishing
on Γ. Indeed, start with v = u − c0. This transforms
the equation and Cauchy data so that the new ˜c0 = 0. If
we killed the functions ˜c0, . . . , ˜cℓ−1, then we may subtract
g(x1, . . . , xn) = 1
ℓ! xℓ
n˜cℓ(x1, . . . , xn−1) which kills the next
one.
The ﬁnal reduction step is to introduce new functions
for all components in the vector (u, ∇u, . . . , ∇k−1
u). Write
v1, . . . , vN for all these functions, and add one more function
v0(x) = xn. Then we can rewrite our equation (1) as a system
of quasi-linear equations of ﬁrst order on the vector function
v = (v0, . . . , vN ).
(2)
∂
∂xn
vs =
∑
0≤r≤N
n−1∑
i=1
asri
∂
∂xi
vr + bs, s = 0, . . . , N,
where all the coeﬃcients asri, b, are functions in
x1, . . . , xn−1, v0, . . . , vN and the boundary condition
on Γ = {xn = 0} is v|Γ = 0.
Notice two important facts, the coeﬃcients do not depend
on xn and all derivatives on the left hand side are ∂
∂xn
.
This is a technicality which makes the problem similar to the
autonomous systems of ODEs.
The principle is obvious from a simple example. Consider
2nd order equation with coeﬃcients axx, axt, b
utt = axtuxt + axxuxx + b
845
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
and view t, u, ut, ux as unknown function. Then our equation
rewrites as the system of four equations
∂
∂t t = 1, ∂
∂t u = ut, ∂
∂t ux = ∂
∂x ut,
∂
∂t ut = axt
∂
∂x ut + axx
∂
∂x ux + b
with boundary condition t = u = ut = ux = 0 on the line
t = 0.
9.2.17. The majorant. The rest of the proof is very technical,
but it is based on the straightforward idea of a majorant
for the Cauchy problem 9.2.16(2). Remind we used
this method in 8.3.13 when proving the ODE version of the
Cauchy-Kovalevskaya theorem.
We shall not go into much detail and only indicate how to
generalize the method from 8.3.13. The ﬁrst step is easy – we
have seen already that all derivatives of the solution vector
v at a ﬁxed point x in Γ are computed by chain rule from
the Cauchy data. This proves the uniqueness of the analytic
solution, if such a solution exists.
Again, it turns out the derivatives are given via universal
polynomials in the derivatives of the coeﬃcients
asri, bs of the system 9.2.16(2), and the polynomials
have got non-negative real coeﬃcients. The reader
may easily ﬁll the details as an exercise.
Now, a very similar majorant of the coeﬁcients as in the
ODE case can be chosen. First, the analycity of the coeﬃcients
ensures the existence of some suitably small r > 0
and (perhaps big) constant C such that 1
α! |∂αasri|r|α|
≤ C,
1
α! |∂αbs|r|α|
≤ C, and thus
|∂αasri| ≤ C|α|!r−|α|
, |∂αbs| ≤ C|α|!r−|α|
,
for all coeﬃcients and multiindices α.
In particular, all the coeﬃcients can be majorized by the
function
h(x1, . . . , xn−1, v0, . . . , vN ) =
Cr
r −
∑n−1
j=1 xj −
∑N
s=0 vs
.
Now, the majorizing system for the vector (V0, . . . , VN ) is
∂Vs
∂xn
=
∑
0≤r≤N
n−1∑
i=1
h
∂
∂xi
Vr + h, s = 0, . . . , N.
Since the coeﬃcients are completely symmetric in the variables,
let us expect the solution in the form
V0 = · · · = VN = W(x1 + · · · + xn−1, xn),
i.e. W is a real function of two variables, say W(t, y). Substituting
into the system we arrive at the linear ﬁrst order PDE
∂W
∂y
=
Cr
r − t − NW(t, y)
(
N(n − 1)
∂W
∂t
+ 1
)
,
with boundary condition W(t, 0) = 0. The reader may ﬁnd
the solution (e.g. by the method of characteristics)
W(t, y) =
1
Nn
(
r − t −
√
(r − t)2 − 2nNCry
)
,
a real analytic function on a neighborhood of the origin.
846
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
This concludes the proof.6
9.2.18. Back to second order equations. Let us continue
our brief excursion to the PDE world by more detailed
comments on the second order semilinear
equations. We shall deal with scalar PDEs on a
domain Ω ⊂ Rn
with one of the usual boundary
conditions.
Thus consider a general linear operator
(1) L =
∑
1≤i,j≤n
aij
∂2
∂xi∂xj
+
∑
1≤i≤n
ai
∂
∂xi
+ a
written in coordinates (x1, . . . , xn). If we consider any other
coordinate system y = Φ(x), then the chain rule determines
how to transform the equation Lu = f in coordinates x into
˜L˜v = ˜f, where u(x) = v(Φ(x)). Write ∇ = ( ∂
∂x1
, . . . , ∂
∂xn
)
for the gradient operator, dot for the standard scalar product
of vectors, D1
Φ for the Jacobi matrix of Φ in the coordinates
y and x, ∂Φ
∂xi
for the ith column of D1
Φ.
∂
∂xi
= ∇ ·
∂Φ
∂xi
=
∑
k
∂Φk
∂xi
∂
∂yk
∂
∂xi∂xℓ
=
∑
k,ℓ
∂Φk
∂xi
∂Φℓ
∂xj
∂2
∂yk∂yℓ
+
∑
k
Φk
∂xi∂xj
∂
∂yk
.
Thus, the operator (1) transforms into
˜L =
∑
1≤i,j,k,ℓ≤n
aij
∂Φk
∂xi
∂Φℓ
∂xj
∂2
∂yk∂yℓ
+
∑
1≤i,k≤n
ai
∂Φk
∂xi
∂
∂yk
+ a.
In particular, the principal symbol transforms pointwise as a
quadratic form under the linearized transformation D1
Φ.
As we know from the linear algebra, the global behavior
of real quadratic forms is classiﬁed by their singature, i.e. the
number of positive and negative entries in the diagonalized
matrix, cf. 4.3.2. This is transfered to the following
Classification of 2nd order quasi-linear PDEs
Consider a second order quasi-linear operator (1) with the
principal symbol Q. The equations Lu = f and the operator
L are called
• elliptic if Q is either positive or negative deﬁnite
• hyperbolic if Q has got the signature (n − 1, 1) (or
equvalently (1, n − 1))
• parabolic if Q is positive semideﬁnite with rank n − 1
and the equation can be rewritten as ∂
∂t u = ˜L(u), where
the principal symbol of ˜L depends on the remaining
variables only.
Notice that we actually did not include all possibilities
into the above list. We omited the ultra-hyperbolic case,
where the rank of Q is maximal, with the remaining possibilities
of signatures. Further, the parabolic equations could
appear with the minus sign at ∂
∂t (the so called “backwards
6The reader may ﬁnd detailed exposition in many basic books, for instance
see the Chapter 1 of the book "Introduction to partial diﬀerential equations"
by Gerald B. Folland, Princeton, 1995.
847
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
parabolic equations”), and the rank of Q could also drop by
more than one. Most of these cases cannot be seen in low
dimensions.
If the coeﬃcients aij, ai and a are constant, then the principal
symbol Q is a constant quadratic form, too, and we may
choose just linear transformations T = D1
Φ instead of general
Φ. This will always allow us to transform the quadratic
form to its canonical form over the entire domain Ω.
Many particular examples are discussed explicitly in the
other column, see ??, in particular in low dimensions.
Let us also stress that while in dimension n = 2 we
may even locally integrate the necessary linearized transformations
into a genuin mapping Φ and get the canonical forms
of the equations even in more general context, see ??, this is
mostly not possible in higher dimensions.
Before coming to a few most important examples, let us
look at the characteristic directions in the individual
cases. If L is elliptic, then of course there
cannot be any characteristic direction. So the
(local) Cauchy problem prescribing the analytic
value and ﬁrst normal derivative along any analytic hypersurface
Γ will have a locally converning analytic solution around
any ﬁxed point. Unfortunately, we have already encountered
that this is not a well posed problem even for the two dimensional
Laplace equation, cf. 9.2.11.
On the contrary, there will be a (n−1)-dimensional cone
of characteristic direction at each point of a hyperbolic equation,
while the parabolic equations will come equipped with
a line of characteristic directions.
9.2.19. The wave equation. The wave operator in dimension
n is
(1) L =
∂2
∂t2
− c2
∆,
where ∆ is the Laplace operator ∆ = ∂2
∂x1
2 + · · · + ∂2
∂xn
2 ,
c2
> 0 a real constant. The operator L lives on domains in
Rn+1
.
Let us ﬁrst return to the 2D wave equation utt = c2
uxx.
We know the general solution
u(x, t) = f(x − ct) + g(y + ct),
cf. 9.2.6, which is the superposition of the forward and backward
waves u1(x, 0) = f(x) and u2(x, 0) = g(x).
This perfectly matches our general expectation from the
Cauchy-Kovalevskaya theorem that general solutions to quasilinear
second order equations in two variables should depend
on two real single variable functions. Moreover, the characteristic
directions are x ± ct = 0 and thus the line Γ = {t = 0}
is non-characteristic.
Setting the Cauchy boundary data u(x, 0) = φ(x),
∂
∂t u(x, 0) = ψ(x) and substituting the general solution, we
arrive at
φ(x) = f(x) + g(x), ψ(x) = −cf′
(x) + cg′
(x),
848
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
where dash stays for the derivative of the single variable function,
as usual. Thus, for any s0 and s
1
c
∫ s
s0
ψ(x) dx = −f(s) + g(s) + f(s0) − g(s0).
Add and substract the equations
2g(s) = φ(s) +
1
c
∫ s
s0
ψ(x) dx − f(s0) + g(s0)
2f(s) = φ(s) −
1
c
∫ s
s0
ψ(x) dx + f(s0) − g(s0).
Finally, substitute s = x − ct into f(s), and s = x + ct into
g(s) in order to get the value of u(x, t) (notice the integrals
add nicely, while the constants depending on the choice of s0
cancel).
(2) u(x, y) =
1
2
(
φ(x−ct)+φ(x+ct)
)
+
1
2c
∫ x+ct
x−ct
ψ(y) dy.
This solution is often called the D’Alembert’s solution.
The formula also reveals the continuous dependence of
the solution on the boundary conditions. We may conclude
that the Cauchy problem seems to be the right boundary value
problem for the wave equation (although we have fully proved
that it is well-posed only in the dimension two and the analytic
category).
In higher dimensions, the situation is much more complicated.
One of useful options is to employ the method of separation
of variables, but splitting only the time and space variables.
Consider the n-dimensional wave equation and expect
the solution in the form u(x, t) = F(x)T(t) (now x ∈ Rn
,
t ∈ R). Plugging this into (1) and playing the separation
method game, we arrive at two equations
∆F + αF = 0, T′′
+ α
1
c2
T = 0,
where α is the separation constant (usually we consider either
α2
or −α2
to ﬁx the types of the equations). The ﬁrst equation
is called the Helmholtz equation and we shall come back to it
below.
9.2.20. The diﬀusion equation. In general dimension n the
diﬀusion operator is
(1) L =
∂
∂t
− κ∆,
κ > 0, the diﬀusion equation is considered on domains in
Rn
× R.
Again, let us have a look at the simplest 1D diﬀusion
equation ut = κuxx. It describes the diﬀusion process in
a one-dimensional object with diﬀusivity κ (assumed to be
constant here) in time. First of all, let us notice that the
usual boundary value presription of the state at time t = 0
is not matching the assumption of the Cauchy-Kovalevskaya
theroem. Indeed, taking Γ = {t = 0}, the normal direction
vector ∂
∂t is characteristic.
849
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
The intuition related to the expectation on diﬀusion problems
suggests that Dirichlet boundary data should sufﬁce
(we just need the inicial state and the diﬀusion
then does the rest), or we can combine them with
some Neumann data (if we supply heat at some parts
of the boundary). Moreover, the process should not be reversible
in time, so we should not expect that the solution
would extend accross the line t = 0.
Let us look at a classical example considered already by
Kovalevskaya. Posit
u(0, x) = g(x) =
1
1 + x2
on a neighborhood of the origin (perfect analytic boundary
data and equation), and expect u is a solution of ut = uxx in
the form u(t, x) =
∑
k,ℓ≥0 ck,ℓ
tk
k!
xℓ
ℓ! .
The equation obviously implies the relations ck+1,ℓ =
ck,ℓ+2 for all k, ℓ. Further, the power series of (1 + x2
)−1
=∑
ℓ(−1)ℓ
x2ℓ
is obtained from the geometric power series
with argument −x, and with x2
substituted for x in the end.
Thus, for all ℓ, c0,2ℓ+1 = 0, c0,2ℓ = (−1)ℓ
(2n)!. By the
recurrence, ck,2ℓ = (−1)k+ℓ
(2(k + ℓ))!.
This is a too quick growth for a converging power series.
For example, looking at the terms
1
k!k!
ck,k =
(4k)!
k!(2k)!
,
they grow as fast towards the inﬁnity as the expression
e−k
kk
82k
, by the Stirling formula for the factorial (cf.
6.2.17).
We have learned that there cannot be any analytic solution
to our Dirichlet boundary problem at all. This example
also shows the relevance of all assuptions in the CauchyKovalevskaya
theorem.
9.2.21. Diﬀusion via Fourier transform. Fortunately, another
straightforward method helps us to solve
the simplest diﬀusion equation with Dirichlet
data. Let us assume u(x, t) is a solution of
ut = κuxx, u(x, 0) = φ(x). Remind, the
Fourier transform (with respect to x) transfers the diﬀerentiation
∂
∂x to algebraic multiplication by ix, while the other
variable t remains as parameter.
Thus, the Fourier image ˜u(ξ, t) must obey
˜ut = −κξ2
˜u, ˜u(ξ, 0) = ˜φ.
This is a quite simple ODE problem with the general solution
˜u(ξ, t) = C(ξ) e−κtξ2
,
while the initial condition implies that the integration constant
is just C(ξ) = ˜φ.
Now remember the relation between the Fourier transform
and convolution, 7.2.7 at page 660. The image of the
convolution is the product of the images, up to the factor
√
2π.
Thus we shall immediately write down the solution u with
850
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
t > 0, once we ﬁnd the inverse Fourier image of the Gaussian
f(ξ) = e−κtξ2
. But Fourier images F(f) of Gaussians f
are again Gaussians, up to constant, see ??,
F(e−ax2
)(ξ) = 1√
2a
e− ξ2
4a ,
with any real constant a > 0. Thus, we can write for t > 0
˜u(ξ, t) = F(φ)F
(
1√
2κt
e− x2
4κt
)
.
Finally, we obtain u as the convolution of the initial condition
with the so called heat kernel function
u(x, t) =
1
2
√
πκt
∫ ∞
−∞
e−
(x−y)2
4κt φ(y) dy.
Obviously, the solution depends continuously on the
boundary condition. We may imagine it models the dynamics
of temperature in an inﬁnite homogeneous bar, with some
initial distribution of temperature and no losses or gains of
energy in time.
Let us also observe the behavior of the solution for t close
to zero. As mentioned in 7.2.9, the Gaussians with variance
converging to zero are a good approximation for the so called
Dirac delta functions, and indeed the limit of the convolution
for t → 0+ is exactly the function φ, as expected.
We shall come back to such convolution based principles
a few pages later, after investigating simpler methods.
9.2.22. Superposition of the solutions. A general idea to
solve boundary value problems is to take a good
supply of general solutions and try to take linear
combination of even inﬁnite many of them.
This means we consider the solution in a form of a series. The
type of the series is governed by the available solutions.
Let us illustrate the method on the diﬀusion equation discussed
above. Imagine we want to model the temperature of
a a homogeneous bar of length d. Initially, at time t = 0,
the temperature at all points x is zero. At one of its ends
we keep the temperature zero, while the other end will be
heated with some constant intensity. Set the bar as the interval
x ∈ [0, d] ⊂ R, and the domain Ω = [0, d] × [0, ∞).
Our boundary problem is
(1) ut = κuxx, u(x, 0) = 0, u(0, t) = 0,
∂
∂x
u(d, t) = ρ,
where ρ is a constant representing the eﬀect of the heating.
The idea is to exploit the general solutions
u(x, t) = (A cos αx + B sin αx) eα2
κt
from 9.2.6 with free parameters α, A, and B. We want to consider
a superposition of such solutions with properly chosen
parameters and get the solution to our boundary problem in
the form combining Fourier series terms wit the exponentials.
This approach is often called the Fourier method.
The condition u(0, t) = 0 suggests to restrict ourselves
to A = 0. Then, ux(x, t) = Bα cos(αx) e−α2
κt
. It seems to
be diﬃcult now to guess how to combine such solutions, to
get something constant in time, as the Neumann part boundary
condition requests. But we can help with a small trick.
851
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
There are some further obvious solutions to the equation –
those with u depending on the space coordinate only. We
may consider
v(x, t) = ρx
and seek further our solution in the form u(x, t)+v(x). Then
u must again be a solution of the same diﬀusion equation
(1), but the boundary conditions change to u(x, 0) = −ρx,
u(0, t) = 0, ∂
∂x u(d, t) = 0.
Now, we want
ux(x, t) = Bα cos(αx) e−α2
κt
= 0,
i.e. we should restrict to frequencies α = 1
2d nπ, with odd
non-negative integers n. This has settled the second of the
boundary condition. The remaining one is u(x, 0) = −ρx
which sets the condition on the coeﬃcients B in the superpo-
sition
∑
k≥0
B2k+1 sin
(
(2k + 1)πx
2d
)
= −ρx
on the interval x ∈ [0, d]. This is a simple task of ﬁnding the
Fourier series of the function x, which we handled in 7.1.10.
Combining all this, we get the requested solution u(x, t) to
our problem:
ρx − 8ρd 1
π2
∑
k≥0
(−1)k
(2k+1)2 sin
((2k+1)πx
2d
)
e−κ
(2k+1)2π2t
4d2
.
Even though our supply of general solutions was not big,
superposing countably many of them helped us to solve our
problem. Notice the behavior at the heated end. If t → ∞,
then the all exponential terms in the sum vanish faster than the
very ﬁrst one, the sine terms are bounded, and thus the entire
component with the sum vanishes quite fast. Thus, for big
t, the heated end will increase its temperature nearly linearly
with the speed ρ.
9.2.23. Separation in transformed coordinates. As we
have seen several times, it is very useful to
view a given equation rather as an inpendent
object expressed in some particular coordinates.
The practical problems mostly include some
symmetries and then we should like to ﬁnd some suitable
coordinates in order to see the equation in some simple form.
As an example, let us look at the Laplace operator ∆ in
the polar coordinates in the plane, and cylindrical or spherical
coordinates in the space. Writing as usual x = r cos φ, y =
r sin φ for the polar transformation, the Laplace operator gets
the neat form
(1) ∆ = ∂2
∂r2 + 1
r2
∂2
∂φ2 + 1
r
∂
∂r = 1
r
∂
∂r
(
r ∂
∂r
)
+ 1
r2
∂2
∂φ2 .
The reader should perform the tedious but straightforward
computation. Similarly,
∆ = 1
r
∂
∂r
(
r ∂
∂r
)
+ 1
r2
∂2
∂φ2 + ∂2
∂z2 ,(2)
∆ = 1
r2
∂
∂r
(
r2 ∂
∂r
)
+ 1
r2 sin2ψ
∂2
∂φ2 + 1
r2 sin ψ
∂
∂ψ
(
sin ψ ∂
∂ψ
)
(3)
in the cylindrical and spherical coordinates, respectively.
852
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Let us illustrate the use on the following problem. Imagine
a twisted circular drum, whose rim suﬀers a small vertical
displacement. We should model the stabilized position of the
drumskin.
Intuitively, we should describe the drumskin position by
the 2D wave equation, but since we are interested in the state
with ∂
∂t vanishing, we actually take u as the vertical displacement
in the interior of the unit circle, Ω = {x2
+ yy
≤
1} ⊂ R2
and request ∆u = 0, subjet to the Dirichlet boundary
problem prescribing the vertical displacement u(x, y) =
f(x, y) of the rim.
Obviously, we want to consider the problem in the polar
coordinates, where the boundary condition gets the neat form
u(1, φ) = g(φ). Say g(φ) = ε sin φ + ε2
sin 5φ with some
small constant ε ≥ 0.
We shall apply the separation of variables method to
these data. Expecting the solution in the form u(r, φ) =
R(r)Φ(φ), the equation implies (after dividing by RΦ)
R′′
R
+
1
r
R′
R
+
1
r2
Φ′′
Φ
= 0.
Thus, multiplying by r2
and considering the separation constant
α2
, we arrive at two ODEs
Φ′′
+ α2
Φ = 0, r2
R′′
+ rR′
− α2
R = 0.
Fortunately, they are both easy to solve. From the ﬁrst equation
(with α > 0)
Φ(φ) = A cos αφ + B sin αφ,
while the other equation transform by S(t) = R(exp t) into
an equation with constant coeﬃcients and its solution yields
R(r) = Crα
+ Dr−α
with α ̸= 0. If α = 0, then the solution for R is R(r) =
C ln r + D and the previous equation implies Φ(φ) = Aφ +
B.
In fact, we insist the solution u = RΦ be a single valued
function in the plane and so we can allow only integer values
of α, including α = 0 when Φ becomes a constant (again,
any non-zero multiple would lead to multi-valued solutions
u). Thus the general solution of the Laplace equation coming
from the separation of variables method and superposition is
(4)
u(r, φ) = C0 ln r + D0+
∞∑
n=1
(An cos nφ + Bn sin nφ)(Cnrn
+ Dnr−n
).
In our problem, we clearly insist in having u ﬁnite at the
origin and thus all Dn and the C0 have to vanish. Now we
can employ the boundary condition
D0 +
∞∑
n=1
cn(An cos nφ + Bn sin nφ) = ε sin φ + ε sin 5φ.
This is a very simple case of Fourier series and we see immediately
that all the coeﬃcients have to vanish except of the B1
and B5 and the requested solution is
u(r, φ) = εr sin φ + εr5
sin 5φ.
853
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
The higher the frequency of the twist of the rim, the slower
the distorsion develops in the center of the drumskin. The
method works for every boundary condition u(1, φ) = g(φ),
if we are able to ﬁnd its Fourier series.
9.2.24. The Helmholtz equation. In 9.2.19, we looked for
solutions of the nD wave equation utt−c2
∆u = 0 in the form
u(x, t) = F(x)T(t), where x ∈ Rn
, t ∈ R. This (partial)
separation of variables with negative separation constant −α2
leads to the Helmholtz equation
∆F + αF = 0,
together with the easily solved T′′
+ α2
T = 0.
Let us treat the 2D case in the polar coordinates, again
using separation of variables. Thus, the Helmholtz equation
gets the form
1
r
∂
∂r
(
r
∂
∂r
)
+
1
r2
∂2
∂φ2
+ α2
F = 0
and we seek for F(r, φ) = R(r)Φ(φ). Writing β2
for the
separation constant now, we arrive at the two equations (the
second one is multiplied by r2
, for convenience)
Φ′′
+ β2
Φ = 0, r2
R′′
+ rR′
+ (α2
r2
− β2
)R = 0.
The angular component equation has got the obvious solutions
A cos βφ + B sin βφ, and again we have to restrict β to
integers in order to get single-valued solutions. With β = m,
the radial equation is the well known Bessel’s ODE of order
m (notice our equation gets the form we had in ?? once we
substitute z = αr), with the general solution
R(r) = CJm(αr) + DYm(αr),
where Jm and Ym are the special Bessel functions of the ﬁrst
and second kinds.
We have obtained a general solution which is very useful
in practical problems, cf. ??.
9.2.25. Non-homogeneous equations. Finally, we add a
few comments on the non-homogeneous linear PDEs.
Although we provide arguments for the claims, we
shall not go into technical details of proofs because
of the lack of space. Still, we hope this limited
insight will motivate the reader to seek for further sources to
learn more.
As always, facing a problem Lu = f, we have to ﬁnd a
single particular solution to this problem, and we may then
add all solutions to the homogeneous problem Lu = 0. Thus,
if we have to match say Dirichlet conditions u = g on the
boundary ∂Ω of a domain Ω, and we know some solution w,
i.e. Lw = f (not taking care of the boundary conditions),
than we should ﬁnd a solution v to the homogenenous Dirichlet
problem with the boundary condition g − w|∂Ω. Clearly
the sum u = v + w will solve our problem.
In principle, we may always consider superpositions of
known solutions as in the Fourier method above. We shall
indicate a more conceptual and general approach now brieﬂy.
854
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Let us come back to the 1D diﬀusion equation and our
solution of a homogeneous problem by means of the Fourier
transform in 9.2.21. The solution of ut = κuxx with
u(x, 0) = φ is a convolution of the boundary values u(x, 0)
with the heat kernel
(1) Q(x, t) =
1
√
4πκt
e− x2
4κt .
Now, the crucial observation is that u(x, t) = Q(x, t) is a
solution to L(u) = ut − κuxx = 0 for all x and t > 0, while
on neigborhood of the origin it behaves as the Dirac delta
function in the variable x. (The ﬁrst part is a matter of direct
computation, the second one was revealed in 9.2.21 already.)
The latter observation suggests, how to ﬁnd the particular
solutions to a non-homogeneous problem. Consider the
integral of the convolution
(2) u(x, t) =
∫ t
0
(∫ ∞
−∞
Q(x − y, t − s)f(y, s) dy
)
ds.
The derivative ut will have two terms. In the ﬁrst one we differentiate
with respect to the upper limit of the outer integral,
while the other one is the derivative inside the integrals. The
derivatives with respect to x are evaluated inside the integrals.
Thus, in the evaluation of L = ∂
∂t − κ ∂2
∂x2 the terms inside of
the integral cancel each other (remember Q is a solution for
all x, and t > 0) and only the ﬁrst term of ut survives.
It seems obvious that this term is the evaluation of the
integrand with s = t. Although, these values are not properly
deﬁned, we may verify this claim in terms of taking limit (t−
s) → 0+. This leads to
lim
s→t−
∫ ∞
−∞
Q(x − y, t − s)f(y, s) dy = f(x, s).
Thus, (2) is a particular solution and clearly u(x, 0) = 0.
The solution of the general Dirichlet problem L(u) = f,
u(x, 0) = φ on Ω = R × [0, ∞) is
(3)
u(x, t) =
∫ ∞
−∞
Q(x − y, t)φ(y) dy +
∫ t
0
(∫ ∞
−∞
Q(x − y, t − s)f(y, s) dy
)
ds.
Let us summarize the achievements and try to get generalization
to general dimensions.
First, we can generalize the heat kernel function Q writing
its nD variant depending on the distance r from the origin
only. Consider the formula with x ∈ Rn
as the product of the
1D heat kernels for each of the variables in x.
(4) Q(x, t) =
1
√
(4πκt)n
e−
∥x∥2
4κt .
Then taking the n-dimensional (iterated) convolution of Q
with the boundary condition φ on the hyperplane t = 0 provides
the solution candidate
(5) u(x, t) =
∫
Rn
Q(x − y, t)φ(y) dy1 . . . dyn.
855
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Indeed, a straightforward (but tedious) computation reveals
that Q is a solution to L(u) = 0 in all points (x, t) with t >
0, and Q behaves again as the Dirac delta at the origin. In
particular (5) is a solution to the Dirichlet problem L(u) = 0,
u(x, 0) = φ and we can allso obtain the non-homogeneous
solutions similarly to the 1D case.
9.2.26. The Green’s functions. The solutions to the (nonhomogeneous)
diﬀusion equation constructed in the last paragraph
are built on a very simple idea – we ﬁnd a solution G
to our equation which is deﬁned everywehere expcept in the
origin and blows up in the origin at the speed making it into
a Dirac delta function at the origin. A convolution with such
kernel G is then a good candidate for solutions for . Let us try
to mimic this approach for the Laplace and Poisson equations
now.
Actually, we shall modify the strategy by requesting
856
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
about 1 page to be finished – the spherical symmetric
solution to Laplace => Green’s function => solution
to Poisson similarly to the diffusion.
857
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
3. Remarks on Variational Calculus
Many practical problems look for minima or maxima of
real functions J : S → R deﬁned on some spaces of functions.
In particular, many laws of nature can be expressed as
certain “minimum principle” concerning some space of map-
pings.
The basic idea is exactly the same as in the elementary
diﬀerential calculus: we aim at ﬁnding the best linear approximations
of J at ﬁxed arguments u ∈ S, we recognize the
critical points (those with vanishing linearization), and then
we perhaps look at the quadratic approximations at the critical
points. However, all these steps are far more intricate, need a
lot of care, and may provide nasty surprises.
9.3.1. Simple examples ﬁrst. If we know the sizes of tangent
vectors to curves, we may ask what is the
shortest distance between two points. In the
plane R2
, this means we have got a quadratic
form g(x) = (gij(x)), 1 ≤ i, j ≤ 2, at each
x ∈ R2
and we want to integrate (the dots mean derivatives
in time t, u(t) = (u1(t), u2(t)) are diﬀerentiable paths)
(1) J (u) =
∫ t2
t1
√
g(u(t))( ˙u(t)) dt
to get the distance between the two given points u(t1) =
(u1(t1), u2(t1)) = A and u(t2) = (u1(t2), u2(t2)) = B.
If the size of the vectors is just the Euclidean one, and we
consider curves u(t) = (t, v(t)), i.e., graphs of functions of
one variable, the length (1) becomes the well known formula
(2) J (u) =
∫ t2
t1
√
1 + ˙v(t)2 dt.
Quite certainly we all believe that the mimimum for ﬁxed
boundary values v(t1) and v(t2) must be a straight line. But
so far, we have not formulated the problem itself. What is the
space of curves we deal with? If we allowed non-continuous
ones, then shorter paths are available! So we should aim at
proving that the lines are the minimal curves among the continuous
ones. Do we need them to be diﬀerentiable? In some
sense we do, since the derivative appears in our formula for
J , but we need to have the integrand deﬁned only almost everywehere.
For example, this will be true for all Lipschitz
curves.
In general, g(u)( ˙u) = g11(u) ˙u2
1 + 2g12(u) ˙u1 ˙u2 +
g22(u) ˙u2
2. Such lengths of vectors are automatically inherited
from the ambient Euclidean R3
on every hypersurface in
the space. Thus, ﬁnding the minimum of J means ﬁnding
the shortest track in a real terrain (with hills and valleys).
If we choose a positive function α on R2
and consider
g(x) = α(x)2
idR2 , i.e., the Euclidean size of vectors scaled
by α(x) > 0 at each point x ∈ R2
, we obtain
(3) J (u) =
∫ t2
t1
α(t, v(t))
√
1 + ˙v(t)2 dt.
We can imagine the speed 1/α of a moving particle (or light)
in the plane depends of the values of α (the smaller is α, the
858
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
bigger is the speed) and our problem will be to ﬁnd the shortest
path in terms of the time necessary to pass from A to B.
As a warm up, consider α = 1 in the entire plane, except
the vertical strip V = {(t, y); t ∈ [a, a + b]} where α = N
and take A = (0, 0), B = (a+b, c), a, b, c > 0. We can imagine
V is a lake, you have to get from A to B by running and
swimming, and you are swimming N times slower than running.
If we believe that the straight lines are the minimizers
for constant α, then it is clear that we have to ﬁnd the optimal
point P = (a, p) on the bank of the lake where we start swimming.
The total time T(p) will then be (s is our actual speed
when running straight)
|AP|
s
+
|PB|
s/N
=
1
s
(√
p2 + a2 + N
√
(c − p)2 + b2
)
and we want to ﬁnd the minimum of T(p). The critical point
is given by
p
√
p2 + a2
= N
c − p
√
(c − p)2 + b2
=⇒ sin φ = N sin ψ,
where φ is the angle betwen our running track and the normal
to the boundary of V, while ψ is the angle between our
swimming track and the normal to the boundary (draw a picture
yourself!). Thus we have recovered the famous Snell law
of light diﬀraction saying that the proportion of the sine values
of the angles is equal to the proportion of the speeds. (Of
course, to ﬁnish the solution of the problem, the reader should
ﬁnd the solution p of the quartic equation and check that it is
a minimum.)
9.3.2. Variational problems. We shall restrict our attention
to the following class of problems.
General first order variational problems
Consider an open Riemann measurable set Ω ⊂ Rn
, the
space C1
(Ω) of all diﬀerentiable mappings u : Ω → Rm
, a
C2
function F = F(x, y, p) : Rn
× Rm
× Rnm
and set the
functional
(1) J (u) =
∫
Ω
F(x, u(x), D1
u(x))ωRn ,
i.e., J (u) is computed as the ordinary integral of a Riemann
integrable function f(x) = F(x, u(x), D1
u(x)) where
D1
u is the Jacobi matrix (the diﬀerential) of u. The function
F is called the Lagrangian of the variational problem and
our task is to ﬁnd the minimum of J and the corresponding
minimizer u with prescribed boundary values u on the
boundary ∂Ω (and perhaps some further conditions restricting
u).
Mostly we shall restrict ourselves to the case n = m = 1,
like in the previous paragraph, where u is a real diﬀerentiable
function deﬁned on an interval (t1, t2) and the function F =
F(t, y, p) : R3
→ R,
(2) J (u) =
∫
Ω
F(t, u(t), ˙u(t)) dt.
859
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
We saw F =
√
1 + p2, F = α(t)
√
1 + p2 in the previous
paragraph. If we take F = y
√
1 + p2, the functional J computes
the area of the rotational surface given by the graph of
the function u (up to a constant multiple). In all cases we may
set the boundary values u(t1) and u(t2).
Actually, our diﬀerentiability assumptions are too strict
as we saw already in our last example above, where F was differentiable
except of the boundary of the lake V . We can easily
extend our space of functions to piecewise diﬀerentiable
u and request F(t, u(t), ˙u(t)) to be piecewise diﬀerentiable
for all such u’s (as always, picewise diﬀerentiable means the
one-side derivatives exist at all points).
A maybe shocking example is the following functional:
(3) J (u) =
∫ 1
0
( ˙u(t)2
− 1)2
dt
on piece-wise diﬀerentiable functions on [0, 1] (i.e. F is the
neat polynomial (p2
− 1)2
). Clearly, J (u) ≥ 0 for all u and
if we set u(0) = u(1) = 0, then any zig-zag piecewise linear
function u with derivatives ±1 satisfying the boundary conditions
achieves the zero minimum. At the same time, there
is no minimum among the diﬀerentiable functions u (ﬁnd a
quick proof of that!), but we can approximate any of the zigzag
minima by smooth ones at any precision.
9.3.3. More examples. Let us develop a general method
how to ﬁnd the analogy to the critical points
form the elementary calculus here. We shall ﬁnd
the necessary steps dealing with a speciﬁc set of
problems in this paragraph. Let us work with the
Lagrangian generalizing the previous examples:
(1) F(t, y, p) = yr
√
1 + p2
r > 0, and write Ft, Fy, Fp, etc., for the corresponding partial
derivatives. Consider the variational problem on an interval
I = (t1, t2) with ﬁxed boundary conditions u(t1) and u(t2)
and assume u ∈ C2
(I), u(t) > 0. Let us consider any differentiable
v on I with v(t1) = v(t2) = 0 (or even better v
with compact support inside of I). Then u + δv fulﬁlls the
boundary conditions for all small real δ’s and consider
J (u + δv) =
∫ t2
t1
F(t, u(t) + δv(t), ˙u(t) + δ ˙v(t)).
Of course, the necessary condition for u being a critical point
must be d
dδ |0
J (u + δv) = 0, i.e., (remind the derivative with
respect to a parameter can be swapped with the integration)
(2) 0 =
∫ t2
t1
Fy(t, u(t), ˙u(t))v(t)+Fp(t, u(t), ˙u(t))˙v(t) dt.
Integrating the second term in (2) per partes immediately
yields (remember v(t1) = v(t2) = 0)
0 =
∫ t2
t1
(
Fy(t, u(t), ˙u(t))v(t) −
d
dt
Fp(t, u(t), ˙u(t))
)
v(t) dt.
860
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
This condition will be certainly satisﬁed if the so called Euler
equation holds true for u (we prove this is a necessary condition
in lemma 9.3.6)
(3)
d
dt
Fp(t, u(t), ˙u(t)) = Fy(t, u(t), ˙u(t)).
An equivalent form of this equation for ˙u(t) ̸= 0 is (we omit
the arguments t of u and ˙u)
(4) Ft(t, u, ˙u) =
d
dt
(
F(t, u, ˙u) − ˙uFp(t, u, ˙u)
)
.
In our case of F(t, y, p) = yr
(1 + p2
)1/2
, Ft vanishes identically,
Fp = yr
p(1 + p2
)−1/2
and thus, if we further assume
r ̸= 0, u > 0, the term in the bracket has to be a positive
constant Cr
:
Cr
= ur
(1+ ˙u2
)1/2
− ˙uur
˙u(1+ ˙u2
)−1/2
= ur
(1+ ˙u2
)−1/2
.
We have arrived at the diﬀerential equation
(5) u = C(1 + ˙u2
)1/2r
which we are going to solve.
Consider the transformation ˙u = tan τ, i.e.,
u = C(1 + (tan τ)2
)1/2r
= C(cos τ)−1/r
,
and so du = C
r (cos τ)−1/r
tan τdτ. Consequently, dt =
1
˙u du = C
r (cos τ)−1/r
dτ and by integration we arrive at the
very useful parametrization of the solutions by the parameter
τ (which is actually the slope of the tangent to the solution
graph):
(6) t = t0 +
C
r
∫ τ
0
(cos s−1/r
)ds u = C(cos τ)−1/r
.
Now, we can summarize the result for several interesting
values of r. First, if r = 0 (which
we excluded on the way), then the Euler
equation (3) reads
¨u(1 + ˙u2
)−3/2
= 0,
which implies ¨u = 0 and thus the potential minimizers should
be straight lines as expected. (Notice that we have not proved
yet that the Euler equation is indeed a necessary condition,
we shall come to that in the next paragraphs.)
For general r ̸= 0, the Euler equation (3) tells (a straightforward
computation!)
¨u = r
1 + ˙u2
u
and thus the sign of the second derivative coincides with the
sign of r. In particular, the potential minimizers are always
concave functions (if r < 0) or convex (if r > 0).
If r = −1, the parametrization (6) leads to (an easy inte-
gration!)
(7) t = t0 − C sin τ, u = C cos τ,
thus for τ ∈ [−π/2, π/2] our solutions are half-circles with
radius C in the upper halfplane, centred at (t0, 0).
For r = −1/2, the solution is
(8) t = t0 −
C
2
(2τ + sin 2τ),
C
2
(1 + cos 2τ)
861
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
which is a parametric description of a ﬁxed point on a circle
with diameter C rolling along the t axis, the so called cycloid
curve. Now, τ ∈ [−π/2, π/2] provides t running from t0 +
1
2 Cπ to t0 − 1
2 Cπ, while u is zero in the point t0 ± 1
2 Cπ and
reaches the highest point at t = t0. (Draw pictures!)
Next, look at r = 1/2. Another quick integration reveals
t = t0 + 2C tan τ = t0 + 2C ˙u, and we can compute ˙u and
substitute into (5) to obtain
u = C +
1
4C
(t − t0)2
.
Thus the potential minimizers are parabolas with the axis of
symmetry t = t0. If we ﬁx A = (0, 1) and a t0, there are
two relevant choices C = 1
2 (1±
√
1 − t2
0) whenever |t0| < 1
(and no options for |t0| > 1). The two parabolas will have two
points of intersection, A and another point B. Clearly only
one of them should be the actual minimizer. Moreover, the
reader could try to prove that the parabola u = 1
4 t2
touches
all of them and has them all on the left (this is the so called
envelope of all the family of parabolas). Thus, there will be
no potential minimizer joining the point A = (1, 0) to an
arbitrary point on the right of the parabola u = 1
4 t2
.
The last case we come to is r = 1, i.e., the case of the
area of the surface of the rotational body drawn by the graph.
Here we better use another parametrization of the slope of
the tangent, we set ˙u = sinh τ. A very similar computation
as above then immediately leads to t = t0 + C
r
∫ τ
0
cosh s ds
and we arrive at the result7
(9) u(t) = C cosh t−t0
C .
9.3.4. Critical points of functionals. Now we shall develope
a bit of theory verifying that the steps done
in the previous examples realy provided necessary
conditions for solutions of the variational
problems. In order to underline the essential features,
we shall ﬁrst introduce the basic tools in the realm of
general normed vector spaces, see 7.3.1. The spaces of piecewise
diﬀerentiable functions on an interval with the Lp norms
can serve as typical examples. We shall deal with mappings
F : S → R called (real) functionals.
The first differential
Let S be a vector space equipped with a norm ∥ ∥. A continuous
linear mapping L : S → R is called a continuous
linear functional.
A functional F : S → R is said to have the diﬀerential
DuF at a point u ∈ S if there is a continuous linear
functional L such that
(1) lim
v→0
F(u + v) − F(u) − L(v)
∥v∥
= 0.
7Some more details on the set of examples of this paragraph can be
found in the article "Elementary Introduction to the Calculus of Variations"
by Magnus R. Hestenes, Mathematics Magazine, Vol. 23, No. 5 (May - Jun.,
1950), pp. 249-267.
862
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
In the very special case of the Euclidean S = Rn
, we
have recovered the standard deﬁnition of the diﬀerential, cf.
8.1.7 (just notice that all linear functionals are continuous on
a ﬁnite dimensional vector space). Again, the diﬀerential is
computed via the directional derivatives.8
Indeed, if (1) holds
true, than for each ﬁxed v ∈ S
(2) δF(u)(v) = lim
t→0
F(u + tv) − F(u)
t
=
d
dt|0
F(u+tv)
exists and L(v) = δF(u)(v). We call δF(u) the variation of
the functional F at u.
A point u ∈ S is called a critical point if δF(u) = 0.
We say that F has got a local minimum at u if there is an
open neighborhood U of u such that F(w) ≥ F(u) for all
w ∈ U. Similarly, we deﬁne local maxima and talk about
local extrema.
If u is an extreme of F, then in particular t = 0 must be
an extreme of the function F(u + tv) of one real variable t,
where v is arbitrary. Thus the extremes have to be at critical
points, if the variations exist.
Next, let us assume the variations exist at all points in a
neighborhood of a critical point u ∈ S. Then, again exactly
as in the elementary calculus, considering two increments
v, w ∈ S we consider the limit
(3) δ2
F(u)(v, w) = lim
t→0
δF(u + tv)(w) − δF(u)(w)
t
.
If the limits exist for all u, v, then clearly δ2
F(u) is a bilinear
mapping. Then, δ2
F(u)(w, w) is a quadratic form which we
can consider as a second order approximation of F at u. We
call it the second variation of F. Moreover, again as in the
elementary calculus, δ2
F(u)(w, w) = d2
dt2 |0
F(u+tw), if the
second variation exists. We may summarize:
Theorem. Let F : S → R be a functional with a local extreme
in u ∈ S. If the variation δF(u) exists, then it has to
vanish. If the second variation δ2
F(u) exists (thus in particular,
δF exists on a neighborhood of u), then δ2
F(u)(w, w) ≥
0 for a minimum, while δ2
F(u)(w, w) ≤ 0 for a maximum.
Proof. Assume F has got a local minimum at u. We
have already seen, f(t) = F(u + tv) has to achieve a local
minimum for each v at t = 0. Thus f′
(0) = 0 if f(t) is
diﬀerentiable, and so δF(u) vanishes.
Now assume δ2
F(u)(w, w) = f′′
(0) = τ < 0 for some
w. Then the mean value theorem implies
f(t) − f(0) = f′
(c)t =
1
c
(f′
(c) − f′
(0))ct
for some t ≥ c > 0. Thus, for t small enough f(t)−f(0) < 0
which contradicts f(0) being a local minimum.
The claim for maximum follows analogously (or we may
apply the already proved result to the functional −F). □
8In functional analysis, this directional derivative is usually called the
Gâteaux diﬀerential, while the continuous functional L satisfying (1) is usually
called the Fréchet diﬀerential, going back to two of the founders of functional
analysis from the beginning of the 20th century.
863
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Corollary. On top of all assumptions of the above theorem
suppose F(v + tw) is 2 times diﬀerentiable at t = 0 and
δ2
F(v)(w, w) ≥ 0 for all v in a neighborhood of the critical
point u and w ∈ S. Then F has got a minimum at u.
Proof. As before we consider f(t) = F(u + tw), w =
z − u. Thus, for some 0 < c ≤ 1
F(z) − F(u) = f(1) − f(0) = f′
(0) +
1
2
f′′
(c)
=
1
2
δ2
F(u + cw)(w, w) ≥ 0.
□
Remark. Actually, the condition from the collolary is far too
strong in inﬁnite dimensional spaces. It is possible to replace
it by the condition δ2
F continuous at u and δ2
F(u)(w, w) ≥
C∥w∥ for some real constant C > 0 just in the critical point
u. In the ﬁnite dimensional case, this is equivalent to the requirement
δ2
F continuous and positive deﬁnite.
9.3.5. Back to variational problems. As we already noticed,
the answer to a variational problem minimizing a functional
(we omit the arguments t of the unknown function u)
(1) J (u) =
∫ t2
t1
F(t, u, ˙u) dt
depends very much on the boundary conditions and the space
of functions we deal with. If we posit u(t1) = A, u(t2) = B
with arbitrary A, B ∈ R we may deal with spaces of diﬀerentiable
or piecewise diﬀerentiable functions satisfying these
boundary conditions. But these subspaces will not be vector
spaces any more. Thus, strictly speaking, we cannot apply
the concepts from the previous paragraph here.
However, we may ﬁx any diﬀerentiable function v on
[t1, t2] satisfying v(t1) = A, v(t2) = B, e.g. v(t) =
A + (B − A) t−t1
t2−t1
, and replace the functional J by
˜J (u) = J (u + v) =
∫ t2
t1
F(t, u + v, ˙u + ˙v) dt.
Now, the intitial problem transforms to one with boundary
conditions u(t1) = u(t2) = 0 and computing the variations
d
dδ
˜J (u + δw) = d
dδ J (u + v + δw) does not change, i.e. we
have to request w(t1) = w(t2) = 0 and we diﬀerentiate in a
vector space.
Essentially, we just exploit the natural aﬃne structures
on the subspaces of functions deﬁned by the general boundary
conditions and thus the derivatives have to live in their
modeling vector subspaces.
864
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
The first and second variations
Corollary. Let F(t, y, p) be a twice diﬀerentiable
Lagrangian and consider the variational problem of
ﬁnding a minimum of the functional (1) on the space of
diﬀerentiable functions u ∈ C1
[t1, t2] with boundary
conditions u(t1) = A, u(t2) = B. Then the ﬁrst and
second variations exist and can be computed for all
v ∈ S = {v ∈ C1
[t1, t2]; v(t1) = v(t2) = 0}, as follows:
δJ (u)(v) =
∫ t2
t1
(
Fy(t, u, ˙u)v + Fp(t, u, ˙u)˙v
)
dt(2)
δ2
J (u)(v, v) =
∫ t2
t1
(
Fyy(t, u, ˙u)v2
+
2Fyp(t, u, ˙u)v ˙v + Fpp(t, u, ˙u)˙v2
)
dt.
(3)
If u is a local minimum of the variational problem, then
δJ (u)(v) = 0 for all v ∈ S, while δ2
J (u)(v, v) ≥ 0 for
all v in a neighborhood of the origin in S.
Proof. Thanks to our strong assumptions on the diﬀerentiability
of F, u, and v, we may diﬀerentiate the real function
f(t) = J (u+tv) at t = 0 swapping the integral and the
derivative. This immediately provides both formulae.
The remaining two claims are straightforward consequences
of the theorem and corollary in the previous paragraph
9.3.4. □
9.3.6. Euler-Lagrange equations. We are following the
path which we already tried when discussing
our ﬁrst bunch of examples in 9.3.3. Our
next step was to guess the consequences of
vanishing of the ﬁrst variation in terms of a
diﬀerential equation. Now we complete the arguments. We
start with a simple result called the fundamental lemma of
the calculus of variation.
Lemma. Assume u is a continuous function on the interval
[t1, t2] and for all compactly supported smooth φ ∈
C∞
c [t1, t2],
∫ t2
t1
u(t)φ(t) dt = 0.
Then u vanishes identically on [t1, t2].
Proof. Assume there is a c ∈ (t1, t2) such that u(c) > 0.
Due to the continuity, u(t) > u(c)/2 > 0 on a neighborhood
(c − s, c + s) ⊂ (t1, t2), s > 0. Next, remind the smooth
variants of indicator functions constructed in 6.1.10. For every
pair of positive numbers 0 < ε < r, we constructed a
function φε,r(t) of one real variable t such that φε,r(t) = 1
for |t| < r − ε, while φε,r(t) = 0 for |t| > r + ε, and
0 ≤ φε,r(t) ≤ 1 everywhere. Thus, choosing such a function
for φ, with r + ε < s and the origin shifted to t = c, we
certainly obtain
∫ t2
t1
u(t)φ(t) dt >
1
2
u(c)2(r − ε) > 0.
865
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
If we ﬁnd some negative value u(c), the same argumentation
ﬁnishes the proof (notice there is no need to consider
the boundary points due to the continuity of u). □
Euler-Lagrange equations
Theorem. Consider a twice diﬀerentiable Lagrangian
F(t, y, p) on [t1, t2] × R2
and a diﬀerentiable critical point
u of the functional J (u) =
∫ t2
t1
F(t, u, ˙u) dt with ﬁxed
boundary values u(t1), u(t2). Then u is a solution of the
diﬀerential equation
(1) Fy(t, u, ˙u) −
d
dt
Fp(t, u, ˙u) = 0.
Notice that the derivative in the second term of the EulerLagrange
equation means the so called total derivative, i.e.
we should diﬀerentiate the composed mapping via the chain
rule. This can be a problem, if we do not assume u to be twice
diﬀerentiable.
Proof. We already know that vanishing of the ﬁrst variation
δJ (u) is a necessary condition for u being a critical
point. Thus we can start with the equality (2) in the previous
paragraph 9.3.5 and compute by integrating per partes:
(2)
0 =
∫ t2
t1
(
Fy(t, u, ˙u)v + Fp(t, u, ˙u)˙v
)
dt
=
∫ t2
t1
(
Fy(t, u, ˙u) −
d
dt
Fp(t, u, ˙u)
)
v dt
+ Fp(t, u, ˙u)v|t=t2
− Fp(t, u, ˙u)v|t=t1
.
Finally we exploit the above fundamental lemma of the calculus
of variation with arbitrary smooth test functions v
with compact supports inside (t1, t2). Thus the last term
with boundary values vanishes and by the lemma, the EulerLagrange
equation has to hold true for u. □
9.3.7. Remarks. We made our life comfortable by taking
very strong diﬀerentiablity assumptions in the theorem above.
In the last century, there was a lot eﬀort to get much more
general results with weaker assumptions. This is really important
in practice where we need to deal with piece-wise differentiable
extremals. On the other hand, we need even twice
diﬀerentiable critical points in order to write down the EulerLagrange
equation explicitely.
Another diﬃcult point is to recognize which of the critical
points are the minima or maxima of the functional. We
saw that the second variation is a very speciﬁc quadratic functional,
see 9.3.5(3), and there is a rich theory dealing with its
properties. We do not have time to into details here, but we
mention just one simple necessary condition for the extreme
(with a bit tricky proof), to get feeling about the topic).
Lemma (Lagrange necessary condition). Consider a twice
diﬀerentiable Lagrangian F(t, y, p) on [t1, t2] × R2
and
a diﬀerentiable critical point u of the functional J (u) =
∫ t2
t1
F(t, u, ˙u) dt with ﬁxed boundary values u(t1), u(t2). If
866
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
u is a local minimum of J , then Fpp(t, u(t), ˙u(t)) ≥ 0 on the
entire interval [t1, t2].
Proof. Assume there is t0 ∈ (t1, t2) such that
Fpp(t0, u(t0), ˙u(t0)) = −µ < 0. Similarly as
in the proof of of the fundamental lemma in the
previous paragraph, we choose s > 0 so that
Fpp(t, u, ˙u) < −1
2 µ on (t0 − s, t0 + s), and a
smooth analog of an indicator function φ = φε,r, centered
at t0 and satisfying r + ε < s. Then the second variation
evaluated on v(t) = αφ(t0 + t
α ) with some α > 0 can be
estimated as follows (we use a constant C > |Fyy(t, u, ˙u)|,
C > |Fyp(t, u ˙u)| on the entire interval, the fact that ˙v = ˙φ
and the derivative | ˙φε,r| integrates over its support to two,
and the substitution τ = t0 + t
α )
δ2
J (u)(φ, φ) =
∫ t2
t1
(
Fyy(t, u, ˙u)v2
+ 2Fyp(t, u, ˙u)v ˙v
+ Fpp(t, u, ˙u)˙v2
)
dt
≤
∫ t0+αs
t0−αs
(Cα2
+ 2Cα|˙v|) dt − 1
2 µ
∫ t0+αs
t0−αs
˙v2
dt
= 2Csα3
+ 2Cα2
∫ t0+s
t0−s
| ˙φ| dτ − 1
2 µα
∫ t0+s
t0−s
˙φ2
dτ
= 2Csα3
+ 4Csα2
− 1
2 µ
(
∫ t0+s
t0−s
˙φ2
dτ)α.
The integral on the right-hand side is strictly positive and,
thus, the entire expression is negative if α is small enough.
This is a contradiction and the proof is complete. □
9.3.8. Special cases. Very often the Lagrangians do not depend
on all variables and then the variations and
the Euler-Lagrange equations get special forms.
The following summary is a straigtforward consequence
of the general equation 9.3.6(1), whose equivalent
form we saw already in 9.3.3(4)
867
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Special forms of Lagrangians
Case 1. If the Lagrangian is F(t, y), i.e., does not depend
on the derivatives, then the Euler-Lagrange equation says
(1) Fy(t, u) = 0
which is an implicit equation for u(t). Moreover, the second
variation is δ2
J (u)(v, v) =
∫ t2
t1
Fyy(t, u)v2
dt.
Case 2. If the Langrangian is F(t, p), then the EulerLagrange
equation is
(2)
d
dt
Fp(t, ˙u) = 0
and its solutions are given by the ﬁrst order diﬀerential equation
Fp(t, ˙u) = C with a constant parameter C. Moreover,
the second variation is δ2
J (u)(v, v) =
∫ t2
t1
Fpp(t, ˙u)v ˙v2
dt.
Case 3. If the Langrangian is F(u, ˙u), then there is a consequence
of the Euler-Lagrange equation (for ˙u ̸= 0)
(3)
d
dt
(
F(u, ˙u) − ˙uFp(u, ˙u)
)
= 0
which again reduces the equation to the ﬁrst order including
a free constant parameter.
9.3.9. Remarks on higher dimensional problems.
868
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
9.3.10. Problems with free boundary conditions.
869
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
9.3.11. Constrained and isoperimetric problems.
870
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
placeholder
871
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
placeholder
872
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
4. Complex Analytic Functions
In the rest of the chapter, we shall look at the (single
complex variable) functions deﬁned on the complex plane
C = R2
. On many occasions we saw how helpful it was
to extend objects from the real line into the complex plane.
We provide a few glimpses into the rich classical theory, and
we hope the readers will enjoy the use of it in the practical
column.
9.4.1. Complex derivative. An open and connected subset
Ω ⊂ C is called a region, or a domain. A mapping
f : Ω → C is called a complex function of
a single complex variable. Working with complex
numbers, we may repeat the deﬁnition of
derivative:
Complex derivative
We say that a complex function f : Ω → C has the complex
derivative f′
(a) at a point a ∈ Ω, if the complex limit
f′
(a) = lim
z→a
f(z) − f(a)
z − a
∈ C
exists. We say that f is diﬀerentiable in the complex sense,
or holomorphic, on Ω, if its complex derivative f′
(z) exists
at each z ∈ Ω.
Clearly, this deﬁnition restricts itself to the deﬁnition of
the derivative of functions of one real variable along R ⊂ C,
when restricting the deﬁning limit to real z and a. We shall
see that the existence of a complex derivative is much more
restrictive than in the real case.
The simplest example of a diﬀerentiable complex function
is z → zn
, n ∈ N.
Indeed, exactly as with the real polynomials, we compute
(z +h)n
−zn
= h
(
nzn−1
+ 1
2 n(n−1)zn−2
h+· · ·+hn−1
)
and thus for all z ∈ C we obtain the limit
(zn
)′
= lim
h→0
(z + h)n
− zn
h
= nzn−1
.
By the very deﬁnition, the mapping f → f′
is linear over the
complex scalars and thus all polynomials f(z) are diﬀerentiable
this way:
f(z) =
n∑
k=0
akzk
→ f′
(z) =
n−1∑
k=0
(k + 1)ak+1zk
.
9.4.2. Analytic functions. A complex function f : Ω → C
is called analytic in the region Ω if for each a ∈ Ω, there is an
open disc D = {|z −a| < r} ⊂ Ω on which f is represented
by a convergent power series
(1) f(z) =
∞∑
n=0
cn(z − a)n
.
The restrictions of such power series to real arguments z
are the analytic functions dealt with already in Chapter 5 and
the direct proof of the following theorem (not relying an any
real multivariate calculus) was promised already in 5.4.10.
873
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
The theorem reveals that the analytic functions behave
as polynomials. The basic principle in its proof is to try to
bound a series which we know very little about, by a geometric
series, which we fully understand. This is a frequent
technique in analysis. An important property of every power
series is that it is (complex) diﬀerentiable within its radius of
convergence.9
Convergence and derivative of power series
Theorem. Consider an analytic function (1). There exists
R, which is either a non-negative real number or inﬁnity,
such that
• f(z) converges absolutely if |z − a| < R;
• f(z) diverges if |z − a| > R;
• 1
R = lim supn→∞ |cn|
1
n .
Further, if the radius of convergence is R > 0, then f
is diﬀerentiable (in complex sense) for |z − a| < R, and its
derivative equals the power series
f′
(z) =
∞∑
n=1
ncn(z − a)n−1
,
obtained by diﬀerentiating f(z) term by term. Moreover, the
power series representing f′
(z) has the same radius of convergence
as f(z).
The open disc |z − a| < R is called the disc of convergence
of f(z) =
∑∞
n=0 cn(z − a)n
.
Proof. Let L = 1
R . Suppose |z−a| < R. We show that∑∞
n=0 cn(z − a)n
converges. Notice, we mean
that L = 0 if R = ∞, and so the statement is
trivially true in this case. For our ﬁxed z, there
is ε > 0 such that (L + ε)|z − a| < 1.
Since L = lim supn→∞ |cn|
1
n , this means that |cn|
1
n <
L+ε, i.e. |cn| < (L+ε)n
for suﬃciently large n. Therefore,
|cn||z − a|n
≤ (L + ε)n
|z − a|n
and
∑∞
n=0 |cn||z − a|n
is
majorized by the convergent geometric series
∑∞
n=0 ρn
with
ρ = (L + ε)|z − a|. Therefore f(z) =
∑∞
n=0 cn(z − a)n
converges.
Now suppose |z − a| > R. By the deﬁnition of lim sup,
for any ε > 0, there exists inﬁnitely many cn satisfying |cn| >
(L − ε)n
.
Choose ε > 0 small enough such that (L−ε)|z −a| > 1.
Then |cn||z − a|n
> (L − ε)n
|z − a|n
for inﬁnitely many n,
and because (L − ε)|z − a| > 1, this implies that cn(z − a)n
does not converge to 0 as n → ∞. Therefore
∑∞
n=0 cn(z −
a)n
diverges.
Next, we move to the derivative. First, notice that f(z) =∑∞
n=0 cn(z−a)n
and g(z) =
∑∞
n=1 ncn(z−a)n−1
have the
same radius of convergence because lim n
1
n = 1.
9Actually the opposite implication is true as well: A holomorhpic function
on a domain Ω is analytic on Ω. We shall not provide a full proof of this
result, but we come close to it below. The reader may ﬁnd the full argument
at nearly all basic textbooks on complex analysis.
874
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Fix z0 in the disc of convergence, so that |z0 − a| < r <
R for some value of r. Let SN (z), EN (z) be deﬁned by
SN (z) =
N∑
n=0
cn(z − a)n
, EN (z) =
∞∑
n=N+1
cn(z − a)n
.
We may think of SN , which consists of the lower order terms
of the power series, as the main term, and EN as the error
term. Notice that g(z) =
∑∞
n=1 ncn(z − a)n−1
is the termby-term
derivative of f(z). We prove that f′
(z0) = g(z0),
which means
lim
h→0
f(z0 + h) − f(z0)
h
− g(z0) = 0.
Thus, given any ε > 0, we must show that there exists a
δ > 0 such that if 0 < |h| < δ, then the expression above has
absolute value less than ε. To do so, we break the expression
into three parts and estimate each of those separately. More
precisely, since f(z) = SN (z) + EN (z), and we know the
derivative S′
N of the polynomial SN , we write
f(z0 + h) − f(z0)
h
− g(z0) =
(
SN (z0 + h) − SN (z0)
h
− S′
N (z0)
)
+
(
S′
N (z0) − g(z0)
)
+
EN (z0 + h) − EN (z0)
h
.
We analyze the individual terms. The ﬁrst term contains the
main term and its derivative, which exists because SN is a
polynomial. Thus, this term approaches 0 as h → 0. In other
words, given ε
3 > 0, we can ﬁnd δ > 0 such that 0 < |h| < δ
implies
SN (z0 + h) − SN (z0)
h
− S′
N (z0) <
ε
3
.
The second term is S′
N (z0) − g(z0). Since S′
N (z0) → g(z0)
as N → ∞ (because we know that g(z) is a power series
which converges absolutely for z in it’s convergence disc centered
at a, and S′
N (z) is the N-th partial sum of this power
series), this means that for ε
3 > 0, we can ﬁnd some N1 such
that if N > N1, then |S′
N (z0) − g(z0)| < ε
3 .
The third term is the most tricky one to estimate eﬀectively.
We can write
EN (z0+h)−EN (z0) =
∞∑
n=N+1
(cn(z0+h−a)n
−cn(z0−a)n
)
Expanding (z0 +h−a)n
−(z0 −a)n
= h
(
(z0 +h−a)n−1
+
(z0 + h − a)n−2
(z0 − a) + · · · + (z0 − a)n−1
)
, we obtain
EN (z0 + h) − EN (z0)
h
=
∞∑
n=N+1
(
cn((z0 + h − a)n−1
+
(z0 + h − a)n−2
(z0 − a) + · · · + (z0 − a)n−1
)
.
Observe, that for h suﬃciently small, |z0 +h−a| < r as well
as |z0 − a| < r. Therefore, if we replace all terms by their
875
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
absolute values and apply the triangle inequality, we obtain:
EN (z0 + h) − EN (z0)
h
≤
∞∑
n=N+1
|cn|nrn−1
.
The series on the right converges, and furthermore, its value
approaches 0 as N → ∞. Indeed,
∑∞
n=N+1 |cn|nrn−1
is
just the tail of the series g(r) with absolute values on all of its
individual terms, and we know that g(z) converges absolutely
for |z − a| < R. So the series in question does converge,
and since it is the end of a convergent series, its terms must
approach 0 as N → ∞. Therefore, given ε
3 > 0, we can ﬁnd
N2 such that for all suﬃciently small h and N > N2,
EN (z0 + h) − EN (z0)
h
<
ε
3
.
Select now N > max{N1, N2}. Then an application of the
triangle inequality yields:
SN (z0 + h) − SN (z0)
h
− S′
N (z0) +
S′
N (z0) − g(z0) +
EN (z0 + h) − EN (z0)
h
≤ ε
□
9.4.3. Corollaries. We can apply the above theorem any
number of times to obtain the following consequences.
In particular, notice the straightforward
existence of the antiderivative which we
shall link with integrals in the next subsections.
Corollaries on the derivatives of power series
Corollary. Consider any power series
f(z) =
∞∑
n=0
cn(z − a)n
with convergence radius R > 0 and write D for its convergence
disk.
(1) f(z) is inﬁnitely (complex) diﬀerentiable in D and each
of its k-th derivatives f(k)
(z) can be obtained by differentiating
term-by-term k times. The resulting power
series has radius of convergence again equal to R.
(2) There exists the (complex) antiderivative
F(z) =
∞∑
n=0
1
n + 1
cn(z − a)n+1
,
such that F′
(z) = f(z) in the disc of convergence D,
which is the same for both series.
(3) The coeﬃcients ck of the power series f(z) are
ck =
f(k)
(a)
k!
.
Proof. All the claims are more or less obvious. Diﬀerentiating
consecutively we see that f is inﬁnitely diﬀerentiable
at all z ∈ D, as claimed in (1). Furthermore, we see that
f(k)
(z) =
∑∞
n=k n(n−1) . . . (n−k+1)cn(z−a)n−k
, which
876
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
in particular yields (2) with k = 1. Finally, substituting z = a
gives f(k)
(a) = k!ck, since all terms containing (z − a)n−k
where n > k vanish at z = a. Therefore ck = f(k)
(a)
k! . □
9.4.4. Links to the real calculus. Each complex valued
function f on Ω can be viewed as a mapping
f : Ω ⊂ R2
→ R2
. If this mapping is diﬀerentiable
in the real sense (i.e. all partial derivatives
are continuous) we may write
f(z) = f(a) + D1
f(a)(z − a) + (z − a)α(z)
for a ∈ Ω and z in a small neighborhood of a in Ω, with D1
f
being the Jacobi matrix of ﬁrst partial derivatives in the two
real coordinates and limz→a α(z) = 0.
Thus it is legitimate to question whether the real linear
approximation D1
f(a) is complex linear. Obviously this happens
if and only if the complex derivative f′
(z) exists.
If f(z) = u(x + i y) + i v(x + i y) is the coordinate
expression of a complex diﬀerentiable function f viewed as
a diﬀerentiable mapping R2
→ R2
, z = x + i y, then clearly
∂f
∂x
(z) = f′
(z) · 1,
∂f
∂y
(z) = f′
(z) · i.
Thus, ∂u
∂y +i∂v
∂y = i
(∂u
∂x +i∂v
∂x
)
and we have arrived at the sufﬁcient
and necessary conditions for D1
f(z) being complex
linear
(1) ux = vy, uy = −vx.
Yet another argument goes as follows: the rank two matrix
describing multiplication by a complex number a + i b has a
in the diagonal entries. −b and b are the other two entries.
In particular, this implies uxx + uyy = 0. The same
Laplace equation holds true for the other component function
v of any holomorphic function f = u + i v.
At the level of diﬀerentials, it is useful to consider two
(complex valued) linear forms
dz = dx + i dy, dz = dx − i dy,
together with the dual basis at the complexiﬁed tangent space
∂
∂z
=
1
2
(
∂
∂x
− i
∂
∂y
)
,
∂
∂z
=
1
2
(
∂
∂x
+ i
∂
∂y
)
.
A straightforward check reveals that a diﬀerentiable function
f : Ω ⊂ C = R2
→ C is complex diﬀerentiable if and
only if
∂
∂z
f = 0.
9.4.5. Integrals along paths. Another important link concerns
integration along paths. A continuous path γ in the
complex plane is a continuous mapping γ : J ⊂ R → C
deﬁned on a bounded closed interval J = [a, b]. A path is
called a simple closed path, or a Jordan curve in the complex
plane10
, if it does not intersect itself and γ(a) = γ(b).
10We shall be interested in piecewise diﬀerentiable Jordan curves only,
and it is quite easy to see that these curves always divide the complex planes
into exactly two connected components (it is obvious for piecewise linear
and the rest comes via approximation). For general Jordan curves, this is a
877
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
The composition of a path γ with any continuous mapping
f deﬁned on the target of γ is again continous, and thus
the (complex valued) Riemann integral
∫ b
a
f ◦ γ(t) dt exists,
but it is dependent on the parametrization of the path.
An easy way out is to restrict ourselves to diﬀerentiable
paths with the derivative ˙γ(t) ̸= 0 for all t, and deﬁne the
integral Iγ along a path γ as the Riemann integral
Iγ =
∫ b
a
f(γ(t))˙γ(t) dt.
This coincides perfectly with the Riemann integrals of real
functions of one variable restricted to a reparametrization γ
of an interval in R ⊂ C.
Writing f = u + i v and γ(t) = x(t) + i y(t),
f(γ)˙γ = (u + i v)( ˙x + i ˙y) = (u ˙x − v ˙y) + i(v ˙x + u ˙y).
Now we may check that the complex value Iγ is independent
of the choice of parametrization directly by the substitution
formula for real integrals. We should also notice that actually
the (complex valued) linear form f(z) dz on R2
= C equals
(u + i v)(dx + i dy) = (u dx − v dy) + i(v dx + u dy).
Thus, Iγ equals the integral of the linear form f(z) dz over the
(unparametrized) submanifold γ ⊂ C in the sense introduced
in the ﬁrst part of this chapter:
Iγ =
∫ b
a
(u ˙x − v ˙y) dt + i
∫ b
a
(v ˙x + u ˙y) dt =
∫
γ
f(z) dz.
In fact, any choice of parametrization (˙γ ̸= 0 on J) determines
the orientation of γ. Thus the integral is independent
of the parametrization, up to sign.
If γ is a composition γ2 ◦ γ1 of two paths (we simply
concatenate the curves γ1 : [a, b] → C and γ2 : [b, c] → C,
γ2(b) = γ1(b)), then clearly
∫
γ
f(z) dz =
∫
γ1
f(z) dz +
∫
γ2
f(z) dz.
In particular if γ2 = γ−1
1 , i.e. the same curve with opposite
parametrizations, then
∫
γ
f(z) dz = 0.
Clearly, our deﬁnition of integration extends to piecewise
diﬀerentiable paths. By uniform continuity over compact domains,
the value Iγ depends continuously on the choice of
the path γ in the C0
metric on the functions on the interval J.
Thus we may approximate any integral Iγ by integrating the
same function over a piecewise linear path ˜γ.
9.4.6. Antiderivatives. If F(z) is an antiderivative of f(z),
then clearly
d
dt
F(γ(t)) = F′
(γ(t)) · ˙γ(t) = f(γ(t)) · ˙γ(t)
and therefore we have veriﬁed the straightforward generalization
of the Newton integral formula in one real variable cal-
culus:
diﬃcult topological result attributed to the French mathematician Camille
Jordan (1838-1922). This is the same Jordan related to the Jordan canonical
form of matrices discussed in Chapter 4.
878
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Newton integral formula
For each piecewise diﬀerentiable path γ : [a, b] → C and
antiderivative F(z) of the function f(z) deﬁned on a neighborhood
of γ,
(1) Iγ =
∫
γ
f(z) dz = F(γ(b)) − F(γ(a)).
In particular, the value of the integral depends only on the
values of F in the endpoints of γ and not on the path itself.
Corollary. Antiderivatives to complex functions on connected
domains are uniquely deﬁned, up to complex
constants.
Proof. Assume F′
(z) = G′
(z), i.e. (F − G)′
(z) = 0.
Then for each path γ, γ(0) = z, γ(1) = w,
(F − G)(w) − (F − G)(z) =
∫
γ
0 dz = 0,
and thus F − G is a constant function. □
As an example, consider the paths γr : [0, 2π] → C,
γr(t) = r eit
, i.e. the positively oriented boundary of the ball
B(0, r) centered at the origin, with radius r > 0. It is easy to
compute the integral of f(z) = zn
along these paths, for all
n ∈ Z.
(2)
∫
γr
zn
dz =
∫ 2π
0
rn
enit
ir eit
dt
=
{
rn+1
n+1 [e(n+1)it
]2π
0 = 0 n ̸= −1
ir0
∫ 2π
0
e0
dt = 2πi n = −1.
In particular, we see that the integral of any polynomial
along any circle vanishes (cf. more details in ??).
9.4.7. Cauchy integral theorem. The formula 9.4.6(1) can
be applied to closed paths and we arrive immediately
at the important Cauchy integral theorem
on the convergence discs of analytic functions.
This result is actually available for all
holomorphic functions on much more general domains.
Recall that a domain Ω is called simply connected if every
simple closed continuous path in Ω can be shrunk continuously
into a point without leaving Ω.
Cauchy integral theorem
Theorem. Let f : Ω → C be analytic on a simply connected
domain Ω and γ be a closed piecewise diﬀerentiable
path in Ω. Then ∫
γ
f(z) dz = 0.
Sketchy proof. The analytic function f has an antiderivative
F on each of its convergence discs. Assume ﬁrst
that the entire domain Ω is contained in one such disc. Then
we break the closed path γ : [0, 1] → Ω into the intervals
879
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
0 < t1 < t2 < . . . < tm = 1 on which γ is diﬀerentiable,
and
∫
γ
f(z) dz =
m−1∑
j=0
(
F(γ(tj+1)) − F(γ(tj))
)
= F(γ(1)) − F(γ(0)) = 0
since γ(0) = γ(1).
In particular, if T is a triangle lying entirely in a convergence
disc of the analytic function f(z) inside of Ω, then∫
∂T
f(z) dz = 0, where ∂T is the oriented boundary of T.
Next, without loss of generality, the path γ can be viewed
as a polygon, since any piecewise diﬀerentiable path γ(t)
can be uniformly approximated by piecewise linear functions
γn(t), that form closed polygons. The integrals of f over γn
approximate the integral under question. Thus, if we show∫
γn
f(z) dz = 0, this would imply that
∫
γ
f(z) dz = 0, too.
It seems to be clear, that the interior of any closed polygon
γn can be triangulated into closed triangles Tj so that all
Tj lie in Ω with their interiors. Actually, here we need the
assumption that Ω is simply connected if we want to ﬁll in all
the details. The integral along the path γn is then equal to the
sum of the integrals over all the individual triangles (notice
we integrate twice in the opposite directions over each edge
which does not belong to γn). Finally, possibly reﬁning the
polygon γn and the triangulation, we may assume that the size
of each triangle Tj is so small, that Tj lies entirely in some
convergence disc of f(z). Therefore,
∫
γn
f(z) dz =
∑
j
∫
∂Tj
f(z) dz = 0,
and hence
∫
γ
f(z) dz = 0 as requested. □
9.4.8. Cauchy integral theorem again. We were quite
sloppy about the topological issues in the above sketch of the
proof. Actually, there is a more general theorem deducing
the conclusion of the Cauchy integral theorem under the
assumption that the function f is complex diﬀerentiable
(holomorphic). We shall prove this theorem under the
additional assumption that f is (continuously) diﬀerentiable
as a mapping of two real variables. Both conditions are
obviously satisﬁed for analytic functions.
We remark that the general claim of the theorem is
proved by a procedure similar to the above argumentation,
dealing ﬁrst with the claim for triangles etc. The reader may
ﬁnd the full proof in any basic textbook on complex analysis.
Theorem (Cauchy integral theorem). If f(z) is holomorhic
in a simply connected domain Ω ⊂ C, then for every piecewise
diﬀerentiable simple closed path γ ⊂ Ω
∫
γ
f(z) dz = 0.
880
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Proof of a special case. Without loss of generality,
assume that γ is a piecewise diﬀerentiable path
bounding some simple connected region G.
Write as usual f(z) = u(x, y) + i v(x, y) and
so
∫
∂G
f(z) dz =
∫
∂G
u dx − v dy + i
∫
∂G
v dx + u dy.
Now, assuming f is diﬀerentiable as a functions of two real
variables, the Green’s version 9.1.13 of the general Stokes’
theorem 9.1.12 implies
∫
∂G
f(z) dz =
∫
G
(
−
∂v
∂x
−
∂u
∂y
+ i
(
∂u
∂x
−
∂v
∂y
))
dx dy
= 2i
∫
G
∂f
∂z
dx dy = 0,
since ∂f
∂z = 0 is equivalent to being holomorphic. □
The Cauchy integral theorem has an immediate consequence,
ensuring the existence of antiderivatives:
9.4.9. Theorem. For every analytic function f(z) in a simply
connected region Ω, its antiderivative F(z) exists in that
region.
Proof. If Ω is the convergence disc of a power series
expression for f, then the claim is obvious, cf. 9.4.3.
In general, ﬁx a point z0 ∈ Ω, and consider an arbitrary
ζ ∈ Ω and any path γ ⊂ Ω with the initial point z0 and end
point ζ. Deﬁne
F(ζ) =
∫
γ
f(z) dz.
Choose any other ˜γ with the same beginning and end, and
prolong the path γ by ˜γ−1
. This provides a closed path
µ = ˜γ−1
◦ γ, and therefore, by the Cauchy integral theorem,∫
µ
f(z) dz = 0. Thus F(ζ) is well deﬁned, independent of
the choice of γ.
Next, consider a small h such that the entire oriented segment
ν joining ζ and ζ + h is in Ω. Then
F(ζ+h)−F(ζ) =
∫
ν◦γ
f(z) dz−
∫
γ
f(z) dz =
∫ 1
0
f(ζ+ht)h dt.
In particular
lim
h→0
F(ζ + h) − F(ζ)
h
= lim
h→0
∫ 1
0
f(ζ + ht) dt = f(ζ)
and thus, F(ζ) is the requested antiderivative. □
Clearly, the antiderivative of an analytic function on a
simply connected domain is again analytic. This follows immediately
from corollary 9.4.6 and the local formula for antiderivative,
F(z) =
∑∞
n=0
1
n+1 cn(z − a)n+1
of the analytic
functions f(z) =
∑∞
n=0 cn(z −a)n
on the individual convergence
discs. It is related to the much more general concept of
analytic extension which we shall discuss now.
881
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
9.4.10. Uniqueness. At ﬁrst glance, the power series representing
locally an analytic function in a domain Ω should glue
together. We shall deal ﬁrst with the uniqueness issues.
Uniqueness theorem
Lemma. Consider an analytic function f(z) in Ω and a
sequence of its zeroes an ∈ Ω which has a limit point a ∈ Ω.
Then f(z) = 0 everywhere in Ω.
Proof. We start with a simple observation on the nonvanishing
of analytic functions:
Claim. Let f(z) ̸= 0 be analytic in Ω, with f(a) = 0 for
some a ∈ Ω. Then, there exists ε > 0 such that f(z) ̸= 0 for
0 < |z − a| < ε.
Indeed, in some neighbourhood of a, the analytic
function f(z) is represented by a power series
f(z) =
∑∞
n=0 cn(z − a)n
. Since f(a) = 0, we have c0 = 0.
Let ck be the ﬁrst non-zero coeﬃcient in the series. Then
f(z) = (z − a)k
g(z), where g(z) =
∑∞
n=k cn(z − a)n−k
and g(a) ̸= 0. Therefore, by continuity of g(z), there exists
a disc centered at a of the radius ε > 0 where g(z) does not
have zeroes and, consequently, f(z) ̸= 0 for 0 < |z − a| < ε.
The lemma is now a simple corollary of the above claim.
Under the assumptions, f(z) must vanish identically on a nontrivial
disc centered at a. Assume f(w) ̸= 0, w ∈ Ω, and
choose a path γ with γ(0) = a, γ(1) = w. Deﬁne t0 to be
the inﬁmum of the nonempty set {t ∈ [0, 1]; f(γ(t)) ̸= 0}.
Then t0 > 0 and f(γ(t)) is identically zero for t ∈ [0, t0).
Thus the above claim applies for a = γ(t0), which leads to a
contradiction with f(t0) ̸= 0 and we are done. □
As a corollary, we see that any function f(z), analytic in
two concentric discs is represented in those discs by the same
power series f(z) =
∑∞
n=0 cn(z −a)n
, where a is the centre
of those discs, and ck are Taylor coeﬃcients ck = 1
k! f(k)
(a),
k = 0, 1, . . ..
9.4.11. Analytic extension. The basic idea for gluing nonzero
powerseries together is very simple. Consider
a power series f(z) =
∑∞
n=0 cn(z − a)n
,
converging in D = {|z − a| < r} and some
point b ∈ D. If |z − b| + |b − a| < r, then∑∞
n=0 |cn|(|z − b| + |b − a|)n
converges and, thus, on the
smaller disc Db = {|z − b| < s = r − |b − a|} we
may rewrite the power series f(z) by expanding (z − a)n
=
(z − b + b − a)n
:
∞∑
n=0
cn(z − a)n
=
∞∑
n=0
n∑
k=0
(
n
k
)
cn(b − a)n−k
(z − b)k
=
∞∑
k=0
∞∑
n=k
(
n
k
)
cn(b − a)n−k
(z − b)k
,
where all the series converge absolutely, and so the order of
summation is irrelevant.
882
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Thus, writing dk =
∑∞
n=k
(n
k
)
cn(b − a)n−k
, the new
power series f(z) =
∑∞
k=0 dk(z − b)k
converges at least
on the disc Db and we shall call it the re-expansion of∑∞
n=0 cn(z − a)n
at the centre b.
The re-expansion is guaranteed to converge for |z −
b| < r − |b − a|. However, the radius of convergence of∑∞
n=0 dn(z − b)n
can be larger. The concept of analytic extension
is based on this.
Analytic elements and extensions
An analytic element Φ with centre at a ∈ C is a pair Φ =
(D, f), where D is some disc D = {|z − a| ≤ R} and f
is a convergent power series in D. The element Φ is called
canonical, if R is the radius of convergence of f at a.
Elements Φ1 = (D1, f1) and Φ2 = (D2, f2) are called
immediate extensions of each other if D1 ∩ D2 ̸= ∅ and
f1(z) = f2(z) on D1 ∩ D2. An element Ψ is called an analytic
extension of Φ along the chain Φ0 = Φ, Φ1, . . . , Φn =
Ψ if Φj+1 is an immediate analytic extension of Φj, j =
0, 1, . . . , n − 1.
A parametric family Φt = (Dt, ft) of canonical elements
is called an analytic extension of Φ0 along a path
γ : [0, 1] → C, if (i) for all t ∈ [0, 1] the radii of convergence
of the discs Dt are Rt > 0 and the centres of Φt are
at = γ(t); and (ii) for all τ ∈ [0, 1] there exist open intervals
Uτ ⊂ [0, 1] containing τ such that for all t ∈ Uτ , γ(t) ∈ Dτ
and Φt is an immediate extension of Φτ .
For example, the pair
(1) log z =
(
|z − 1| < 1,
∞∑
n=0
(−1)n
n
(z − 1)n
)
is the canonical element restricted to the standard real logarithm
function, centered at 1.
Obviously, if Ψ = (D2, f2) is an immediate extension of
Φ = (D1, f1) and the centre a2 of Ψ lies in D1 then the series
f2 is just a re-expansion of f1 around a2. This follows from
the uniqueness theorem and the possibility of re-expansion.
Lemma. Assume that the element Φ1 = (D1, f1) is an immediate
extension of Φ0 = (D0, f0), and that Φ2 = (D2, f2) is
an immediate extension of Φ1. Then, Φ2 is also an immediate
extension of Φ0 if D0 ∩ D1 ∩ D2 ̸= ∅.
If Φt and Ψt are two analytic extensions of canonical element
Φ0 = Ψ0 along the same path γ : [0, 1] → C, then
Ψt = Φt, for all t ∈ [0, 1].
Proof. Clearly, f0 = f1 = f2 on the non-empty open
subset D0 ∩D1 ∩D2 of D0 ∩D2. By the uniqueness theorem
f2 = f0 everywhere in D0 ∩D2, which proves the ﬁrst claim.
Next, consider E = {t ∈ [0, 1] : Ψt = Φt}. Since 0 ∈ E,
the set E is not empty. It is open, because UΦ
τ ∩ UΨ
τ ⊂ E for
all τ ∈ [0, 1]. Further, if t0 ∈ [0, 1] is a limit point of E
then the elements Φt0 and Ψt0 are immediate extensions of
Φt1 = Ψt1 for all t1 ∈ UΦ
t0
∩ UΨ
t0
, t1 < t0. Since Φt0 and Ψt0
have the common centre γ(t0), they must be equal. So E is
closed and thus, E = [0, 1], i.e. Φt = Ψt for all t. □
883
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
9.4.12. Technical observations. In a sequence of simple
observations, we show that analytic extension
along a path can be always be obtained
by analytic extensions along some
chain of elements and vice versa.
First, consider the family Φt of elements that extend Φ0
along some γ(t) : [0, 1] → C and write R(t) for the corresponding
radii of convergence.
Proposition. If R(τ) = ∞ for some τ ∈ [0, 1], then it is
inﬁnite for all t. If ﬁnite, then R(t) is continuous on [0, 1].
Proof. If R(τ) = ∞, then each element Φt is a reexpansion
of Φτ , and therefore R(t) = ∞ for all t.
On the other hand, if R(τ) < ∞ for some τ, then for all
t ∈ Uτ , the circles {|z − γ(t)| = R(t)} and {|z − γ(τ)| =
R(τ)} intersect at a pair of points because neither of these
circles lies inside the other one. Let w be one such point of
intersection. Then for the triangle with vertices w, γ(τ), γ(t)
we arrive at |R(t) − R(τ)| < |γ(t) − γ(τ)|. Since γ is uniformly
continuous on [0, 1], R(t) is continuous on Uτ , and
thus also on [0, 1]. □
Lemma. Consider an analytic extension {Φt, t ∈ [0, 1]} of
Φ0 along a path γ. There exist a ﬁnite number of intermediate
points 0 = t0 < t1 < . . . < tn = 1 such that Φ1 is obtained
by analytic extension along the chain of elements Φ0 = Φt0 ,
Φt1 , . . . , Φtn = Φ1.
Conversely, if a canonical element Ψ extends Φ along the
chain Φ = Φ0, Φ1, . . . , Φn = Ψ and γ : [0, 1] → C is the
piece-wise linear path through the centres of Φ0, Φ1, . . . , Φn,
then there is a family {Ψt} of canonical elements extending
Ψ0 = Φ0 along γ, such that Ψ1 = Φn = Ψ.
Proof. Since R(t) > 0 for all t and is continuous, due to
the compactness of [0, 1] it is separated from zero and, therefore
R(t) > c for some constant c > 0. Uniform continuity
of γ(t) implies that ∃δ > 0 such that |t2 − t1| < δ yields
|γ(t2) − γ(t1)| < c.
Since the intervals Jt = Ut ∩ (t − δ
2 , t + δ
2 ) cover
[0, 1] one can choose a ﬁnite subcover Jt1 , . . . , Jtn−1 for some
t1 < . . . < tn−1, to which we append t0 = 0 and tn = 1 if
these terminal points are missing from the sequence. Then,
|γ(tj+1) − γ(tj)| < c and the centres of Φtj+1
and Φtj
lie in
the disc of convergence of each other, respectively. So, these
elements are re-expansions, and hence, immediate extensions
of each other.
As there are a ﬁnite number of straight segments in γ(t),
it is suﬃcient to consider the case n = 1. The rest follows
by induction. Thus γ(t) will be a segment connecting γ(0)
and γ(1), that lies in D0 ∪ D1 with D0 ∩ D1 ̸= ∅. Deﬁne
Φt = (Dt, ft) as re-expansion of either Φ0 or Φ1, depending
on where γ(t) appears, in D0 or D1. If γ(t) ∈ D0 ∩ D1 then
re-expansion of both Φ0 or Φ1 at the centre γ(t) determine
same canonical element as noticed above. The interval Uτ
can be chosen so that γ(t), t ∈ Uτ entirely lies in either D0
or D1. Then Φt for t ∈ Uτ will be an immediate extension of
Φτ . □
884
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
9.4.13. Monodromy theorem. In general, it is very diﬃcult
to say whether there is an analytic extension of
a given analytic element along a given path. But we
may quite easily ﬁnd out conditions under which existing
extensions along paths γ0 and γ1 with a common
beginning and end must coincide.
Homotopic paths
We say that two paths γ0(t) and γ1(t) with common terminal
points γ0(0) = γ1(0) = a and γ0(1) = γ1(b) = b are
homotopic if there exists a continuous function γ : [0, 1] ×
[0, 1] → C such that
γ(0, t) = γ0(t), γ(1, t) = γ1(t), γ(s, 0) = a, γ(s, 1) = b
for all s ∈ [0, 1]. We say that the paths γs(t) = γ(s, t)
provide a homotopic deformation of γ0 to γ1.
The following theorem says that homotopic paths lead to
the same analytic extensions.
Monodromy theorem
Theorem. Suppose that a canonical element Φ0 = (D0, f0)
centred at a can be analytically extended along every path
γs in a homotopic deformation. Then the extensions along
each of those paths terminate with the same canonical element
Φ1.
Proof. Write Φst, s ∈ [0, 1], t ∈ [0, 1] the canonical
elements of the extension of Φ00 = Φ0 along γs(t). Let Rs(t)
be the radius of convergence of Φst. Since Rs(t) > 0 for all
(s, t) ∈ [0, 1] × [0, 1], there exists ρ > 0 such that Rs(t) > ρ
for all s and t. Notice that γ(s, t) is uniformly continuous and
thus, ﬁxing s0 ∈ [0, 1] we may choose an interval Vs0 around
s0 on the s-axis such that
max
t∈[0,1]
|γ(s, t) − γ(s0, t)| <
ρ
4
.
Then, for all s ∈ Vs0 , the result of analytic extension remains
the same as every element Φst centred at γ(s, t) is a reexpansion
of Φs0t because the centre γ(s, t) lies in the disc of
convergence of Φs0t. Thus, extensions along γs(t) and γs0 (t)
produce the same terminal element Φs0t.
Consider the set E = {s ∈ [0, 1] : Φs1 = Φ01}. Clearly,
E is not empty as it contains s = 0. By the previous argument,
E is open. We shall see that E is also closed. Let s0 be
a limit point of E. Consider the interval Vs0 as constructed
above. Then there is s′
∈ Vs0 ∩E, and extensions along γs′ (t)
and γs0 (t) coincide, and so s0 ∈ E. Hence E = [0, 1], which
proves the theorem. □
Corollary (Monodromy theorem). Consider a simply connected
region Ω ⊂ C and a canonical element Φ = (D, f)
centred at a ∈ Ω. Suppose that Φ extends along any path
γ ⊂ Ω through a. Then for any b ∈ U, the extension of Φ
along any path terminating at b is independent on the path
885
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
and as a result, produces the same analytic element for any
such path.
Thus, an analytic extension of Φ to every point in Ω generates
an analytic function that is represented as a convergent
power series in any disc inscribed in Ω.
9.4.14. Remarks. We look at some simple examples where
the analytic extension is crucial. Consider the
function f(z) =
√
z. Clearly we may choose
the two diﬀerent options f(1) = ±1 and each
of the choices will lead to a canonical element. The analytic
extensions of them are called the branches of a multivalued
complex function f. Notice that
√
z is not analytic at the
origin since the derivatives blow up to inﬁnity there.
Intuitively, it seems that two closed paths in C \ {0}
are homotopic if and only if they run around the origin the
same number of times (the winding number). We may imagine
what happens to the values if we move z along a circle
z = r eiθ
. The two initial options lead to
f1(z) =
√
r eiθ/2
f2(z) =
√
r ei(π+θ/2)
.
Once we run θ from 0 to 2π, the value of the branches swap.
See ?? for more observations on root functions.
more in the other
column
Another very important example is f(z) = z−1
. Since
f(z) integrates to 2πi over each circle centered in origin,
there cannot exist an antiderivative to f along any of such
circles. But locally, the antiderivative is the logarithmic function
log z. The canonical element 9.4.11(1) extends to one of
inﬁnite branches and running along a circle must change its
value by the constant 2πi.
We return to general analytic functions. As promised
a few pages back, the monodromy theorem implies that for
any analytic function f(z) in a simply connected region Ω,
there exists an analytic function representing the antiderivative.
Moreover, for each analytic element of f, this is just the
antiderivative of the power series representing the function
on the disc. Now we may also deduce the Cauchy integral
theorem for simple connected regions Ω in another way: The
integral along the closed path ∂Ω is given by the diﬀerence of
the values of the antiderivative at the terminal points. Since
the boundary ∂Ω is homotopic to a point, the integral van-
ishes.
9.4.15. Cauchy theorem the third time. The Cauchy integral
theorem holds also for analytic functions
on domains Ω which are not simply
connected. A bounded open domain
Ω ⊂ C is said to have a regular boundary
∂Ω if the set of boundary points ∂Ω consists of ﬁnitely
many piece-wise smooth and mutually disjoint Jordan curves
γ0, γ1, . . . , γn.
Notice, the Jordan curves in the boundary divide the complex
plane into n+2 connected components. Just one of them
is unbounded, one of them coincides with Ω and all the others
are bounded “holes” inside of Ω. We write γ0 for the oriented
exterior boundary, i.e. the boundary of the connected
unbounded component of C \ Ω oriented counter-clockwise,
886
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
while γ1, . . . , γn form the oriented interior boundary, i.e. the
curves are oriented clockwise.
Cauchy integral theorem
Theorem. Let Ω ⊂ C be a bounded region with regular
boundary and f(z) analytic on the closure Ω (i.e. analytic
on some domain containing Ω). Then
∫
∂Ω
f(z) dz = 0.
Proof. With our choice of orientation of the boundary,
we must prove
∫
∂Ω
f(z) dz =
∫
γ0
f(z) dz +
n∑
j=1
∫
γj
f(z) dz = 0.
We proceed by induction on n. If n = 0, then clearly
Ω is simply connected and the theorem is proved already, see
Theorem 9.4.8.
If n = 1, then there is exactly one interior part of the
boundary γ1 and we may choose any two smooth paths µ1
and µ2 joining the left most points and the right most points in
γ0 and γ1, respectively. This way, we split Ω into two simply
connected regions Ω+
(say the upper one) and Ω−
(the lower
one), with boundaries ∂Ω+
= γ+
0 ◦µ1 ◦γ+
1 ◦µ−1
2 and ∂Ω−
=
γ−
0 ◦ µ2 ◦ γ−
1 ◦ µ−1
1 . At the same time,
∫
∂Ω
f(z) dz =
∫
∂Ω+
f(z) dz +
∫
∂Ω−
f(z) dz
since the integration over µi and µ−1
i , i = 1, 2, cancel each
other on the right hand side. Moreover, the boundaries ∂Ω±
are again piecewise diﬀerentiable Jordan curves and therefore
both integrals on the right hand side vanish.
The general induction step is completely analogous. If
n > 1, we ﬁnd one of the interior boundaries γi closest to γ0.
Choose two cuts µ1, µ2 so that one of the two newly created
components of Ω is simply connected. Then the other component
has one less interior boundary and thus the theorem
follows by induction. □
it needs a diagram with
the cuts, see the
illustration on the
diagram
9.4.16. Cauchy integral formula. Consider an open ball
without its center, Ω = B(z0, r) \ {z0}, an analytic
function f(z) in Ω, and positively oriented
Jordan curves γ ⊂ Ω including z0 in its interior.
Due to the Cauchy integral theorem, the integral∫
γ
f(z) dz does depend on the choice of such γ. Indeed, the
region enclosed between the two choices γ1, γ2 is bounded
by them, but with opposite orientations. Thus the vanishing
of the integral over the boundary means that actually the integrals
over γ1 and γ2 are equal.
Next, recall that the integral of z−1
over any circle centered
at the origin is 2πi, see 9.4.6(2). These observations
suggest the following essential formula (we may expect that
f(ζ) will behave similarly as the constant f(z) for γ very
small):
887
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Cauchy integral formula
Theorem. Let f(z) be analytic function in the closure of the
region Ω ⊂ C with regular boundary. Then for all z in Ω,
f(z) =
1
2πi
∫
∂Ω
f(ζ)
ζ − z
dζ.
Proof. Fix z ∈ Ω and consider an open disc Dρ = {ζ ∈
C; |ζ − z| < ρ} lying inside of Ω.
The function g(ζ) = f(ζ)
ζ−z is analytic in the closure of
Ω \ Dρ. Adopting the counterclockwise orientation of the
boundary of Dρ, the Cauchy integral theorem implies
∫
∂Ω
f(ζ)
ζ − z
dζ =
∫
∂Dρ
f(ζ)
ζ − z
dζ.
We aim at showing that
∫
∂Dρ
f(ζ)
ζ−z dζ = 2πif(z). We
know 2πif(z) =
∫
∂Dρ
f(z)
ζ−z dζ. Thus, we consider
∫
∂Dρ
f(ζ
ζ − z
dζ − 2πif(z) =
∫
∂Dρ
f(ζ) − f(z)
ζ − z
dζ
and estimate
∫
∂Dρ
f(ζ) − f(z)
ζ − z
dζ ≤ max
|ζ−z|=ρ
2πρ|f(ζ) − f(z)|
ρ
= 2π max
|ζ−z|=ρ
|f(ζ) − f(z)|.
Clearly, the right hand side approaches zero as ρ → 0. Hence,
∫
∂Dρ
f(ζ) − f(z)
ζ − z
dζ = 0
and the formula in the theorem has been veriﬁed. □
Notice that if we consider z ∈ C\Ω in the above theorem,
then the function f(ζ)
ζ−z is analytic in Ω and thus the integral
vanishes by the Cauchy integral theorem.
9.4.17. Corollaries. Taking consecutive derivatives with respect
to z in the above formula, we obtain expressions for all
derivatives of f(z):
Cauchy integral formula for derivatives
Corollary. Let f(z) be an analytic function in the closure
of the region Ω ⊂ C with regular boundary. Then for all z
in Ω,
f(n)
(z) =
n!
2πi
∫
∂Ω
f(ζ)
(ζ − z)n+1
dζ.
Proof. Indeed, z is an independent argument in the
smooth integrand, thus we may diﬀerentiate the integral,
which yields the formula. □
888
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Apply the Cauchy integral formula to the disc Dr =
{|z − a| < r} to obtain
Mean value theorem
Theorem. For f(z) analytic in Dr the value at the centre
of the disc can be evaluated as
f(a) =
1
2π
∫ 2π
0
f(a + r eiθ
)dθ.
Proof. By the Cauchy integral formula
f(a) =
1
2πi
∫
|ζ−a|=r
f(ζ
ζ − a
dζ.
Substitute ζ = a + r eiθ
and dζ = ir eiθ
dθ to obtain
f(a) =
1
2π
∫ 2π
0
f(a + r eiθ
)dθ.
□
9.4.18. Laurent series. Already in 6.3.10, we noticed that
quotients f(z)
g(z) of two polynomials enjoy a quite
nice expansion similar to power series. We
called series of the form
∑∞
n=−∞ cn(z − a)n
a Laurent series. Now we do the same with complex arguments
and coeﬃcients.
The part of the Laurent series with positive powers,∑∞
n=0 cn(z−a)n
, is called the regular part, while the remaining
part
∑−∞
n=−1 cn(z −a)n
, consisting of negative powers of
(z −a), is called its principal part. A Laurent series is called
convergent if both the regular and principal parts converge.
Laurent series
Theorem. Every function f(z) analytic in the annulus A =
{r < |z − a| > R}, with 0 ≤ r < R ≤ ∞ admits a
representation by a Laurent series
(1) f(z) =
∞∑
n=−∞
cn(z − a)n
,
where the coeﬃcients cn can be calculated by
(2) cn =
∫
|ζ−a|=ρ
f(ζ)
(ζ − a)n+1
dζ, n ∈ Z, r < ρ < R.
The coeﬃcients cn, called the Laurent coeﬃcients of f(z)
in A, are uniquely determined. In particular they do not
depend on ρ.
Proof. Choose z ∈ A and r′
, R′
such that r < r′
<
R′
< R. The function f(z) is now analytic in the closure of
the annulus A′
= {r′
< |z − a| < R′
}. Therefore, by the
889
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Cauchy integral formula, adopting the anticlockwise orientation,
f(z) is a diﬀerence of two integrals, f(z) = J1 − J2,
f(z) =
1
2πi
∫
|ζ−a|=R′
f(ζ)
ζ − z
dζ −
1
2πi
∫
|ζ−a|=r′
f(ζ
ζ − z
dζ.
If |ζ−a| = R′
, then |z−a| < |ζ−a| and we may expand
f(ζ)
ζ − z
=
f(ζ)
(ζ − a)
1
1 − z−a
ζ−a
=
∞∑
n=0
f(ζ)
(ζ − a)n+1
(z − a)n
.
Next, we can estimate
f(ζ)
(ζ − a)n+1
(z − a)n
<
max|ζ−a|=R′ |f(ζ)|
R′
(
|z − a|
R′
)n
and thus the series is uniformly convergent and admits term
by term integration. Therefore
J1 =
1
2πi
∞∑
n=0
cn(z − a)n
, cn =
∫
|ζ−a|=ρ
f(ζ)
(ζ − a)n+1
dζ.
If |ζ − a| = r′
, then |ζ − a| < |z − a| and similarly to
the above, the expansion
f(ζ)
ζ − z
=
f(ζ)
(z − a)
1
ζ−a
z−a − 1
= −
∞∑
n=0
f(ζ)
(z − a)n+1
(ζ − a)n
leads to the equality (via term by term integration)
−J2 =
1
2πi
−∞∑
n=−1
cn(z − a)n
, cn =
∫
|ζ−a|=r′
f(ζ)
(ζ − a)n+1
dζ.
Thus, we have obtained the Laurent series representation
f(z) =
∞∑
n=−∞
cn(z − a)n
as requested.
On the other hand, if we are given a Laurent series (1),
then ﬁxing arbitrary n ∈ Z and multiplying this formula by
(z − a)−(n+1)
we can integrate over |z − a| = ρ in order to
obtain ∫
|z−a|=ρ
f(z)
(z − a)n+1
dz = 2πi cn.
The circle {|z − a| = ρ} with r < ρ < R was chosen
arbitrarily, and in particular we see that the integrals
∫
|z−a|=ρ
f(z)
(z−a)n+1 dz cannot depend on ρ. □
9.4.19. Remarks on convergence. Given a Laurent series,
its regular part
∑∞
n=0 cn(z − a)n
represents a power series
that converges absolutely and uniformly on compact
sets in its disc of convergence {|z − a| < R} with 1
R =
lim supn→∞ |cn|
1
n , see the Cauchy-Hadamard formula in
Theorem 9.4.2.
The principal part
∑−∞
n=−1 cn(z − a)n
becomes a
power series
∑∞
n=1 c−nwn
after a coordinate change
w = 1
z−a and this series converges for |w| < 1
r with
r = lim supn→∞ |c−n|
1
n . Thus we have veriﬁed:
890
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Proposition. For any set of coeﬃcients {cn, n ∈ Z} set
1
R
= lim sup
n→∞
|cn|
1
n , r = lim sup
n→∞
|c−n|
1
n .
Then the Laurent series
f(z) =
∞∑
n=−∞
cn(z − a)n
converges absolutely and uniformly on any compact set in the
annulus A = {r < |z − a| < R}. It is analytic in A and
if |z − a| < r, then the principal part
∑−∞
n=−1 cn(z − a)n
diverges, while the regular part
∑∞
n=0 cn(z − a)n
diverges
for |z − a| > R.
9.4.20. Link to Fourier series. There is a very interesting
link between Laurent and Fourier series. If f is
analytic in A = {1−ρ < |z| < 1+ρ} for some
ρ > 0 then its n-th Laurent series coeﬃcient
cn =
1
2πi
∫
|z|=ρ
f(z)
zn+1
dz =
1
2π
∫ 2π
0
f(eit
) e−int
dt.
Therefore, cn represents the n-th Fourier coeﬃcient of
φ(t) = f(eit
) on t ∈ [0, 2π], and the Fourier series of f(eit
)
converges uniformly to f(eit
) on t ∈ [0, 2π].
9.4.21. Liouville theorem. The formula 9.4.18(2) for Laurent
series coeﬃcients yields the following Cauchy inequali-
ties:
(1) |cn| =
1
2πi
∫
|z−a|=ρ
f(z)
(z − a)n+1
dz ≤
max
|z−a|=ρ
|f(z)|
ρn
for all n ∈ Z. As a straightforward consequence we obtain
the following
Liouville theorem
Theorem. If f(z) is analytic in C and bounded, i.e.
|f(z)| ≤ M for some constant M and all z ∈ C, then f(z)
is constant.
Proof. The Cauchy inequalities applied to
f(z) =
∑∞
n=0 cn(z − a)n
yield that |cn| ≤ M
Rn , n > 0 for
any R > 0. Thus, cn = 0 for any n ≥ 1 and consequently
f(z) = c0. □
9.4.22. Isolated singularities. We look at typical examples
of of analytic functions around "suspicious"
points. Consider the fraction f(z) = sin z
z . The
origin is the zero point of both sin z and z, and
since they behave very similarly for small z, we
see limz→0 f(z) = 1.
891
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
On the other hand f(z) = 1
z grows towards inﬁnity,
limz→0
1
z = ∞, in the sense of the extended complex plane
C = C ∪ ∞ (also called the Riemann sphere). We can imagine
C as the sphere with the stereographic projection onto the
plane C, see the picture. Then clearly limz→a f(z) = ∞
picture of the Riemann
sphere C missing
if and only if limz→0 |f(z)| = ∞ in the sense of standard
analysis in real variables.
It might easily happen that the limit does not exist at all,
see the theorem below. For example, take f(z) = e
1
z around
the point z = 0. It is given by the inﬁnite principle part of
a Laurent series, f(z) =
∑−∞
n=−1
1
n! zn
. In general we talk
about isolated singular points:
Isolated singularities
If f(z) is analytic in a punctured neighbourhood V = {0 <
|z − a| < ρ} then a is called an isolated singular point for
f(z). We say that the singular point is
• removable, if there is a ﬁnite limit lim
z→a
f(z) = b ∈ C;
• a pole, if lim
z→a
f(z) = ∞;
• an essential singularity, if lim
z→a
f(z) does not exist in C.
A function f(z) with only isolated singularities in a domain
Ω ⊂ C and without any essential singularities is called a
meromorphic function in Ω.
The function f(z) = tan 1
z provides an example of a
non-isolated singularity at a = 0, as 0 is the limit of poles
(π
2 + nπ)−1
, n ∈ Z, of f(z).
On the other hand, all rational functions f(z)/g(z) are
meromorphic in C.
The following theorem classiﬁes isolated singularities
and poles in terms of Laurent series.
Theorem. The following properties are equivalent:
• the point z = a is a removable singularity for f(z);
• |f(z)| is bounded in some punctured neighbourhood
V = {0 < |z − a| < ρ};
• the Laurent series of f(z) in V = {0 < |z − a| < ρ}
is the Taylor series f(z) =
∑∞
n=0 cn(z − a)n
, i.e. the
principal part vanishes;
• f(a) can be deﬁned so that f(z) becomes analytic in
{|z − a| < ρ}.
Further, the point z = a is a pole for f(z) if and only
if the principal part of the Laurent series of f(z) in {0 <
|z − a| < ρ} contains only ﬁnitely many terms, i.e. f(z) =∑∞
n=−N cn(z−a)n
, n ∈ N, for some integer N (the smallest
N with this property is called the order or the pole at z = a).
Finally, the Laurent series f(z) =
∑∞
n=−∞ cn(z − a)n
in the punctured neighbourhood of a contains inﬁnitely many
terms with non-zero coeﬃcients cn, n < 0, if and only if
z = a is an essential singularity for f(z).
Proof. If |f(z)| ≤ M for 0 < |z − a| < ρ, then by the
Cauchy inequalities 9.4.21(1), |c−n| ≤ Mεn
, n > 0, for all
0 < ε < ρ. Therefore, all coeﬃcients with negative indices
892
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
vanish, and
f(z) =
∞∑
n=0
cn(z − a)n
.
Deﬁne f(a) = c0 to obtain a power series that converges in
the entire disc {|z − a| < ρ}.
This implies the equivalence of the four conditions in the
ﬁrst part of the theorem.
By the deﬁnition of a pole, f(z) ̸= 0 in some punctured
disc D = {0 < |z − a| < ρ′
< ρ} around a as
limz→a f(z) = ∞. Therefore g(z) = 1
f(z) is also analytic in
D and limz→a g(z) = 0. Hence, g(z) is analytic in D assuming
g(a) = 0 and, therefore, g(z) = (z − a)N
h(z) for some
integer n and analytic function h(z) with h(a) ̸= 0. Thus,
1
h(z) is also analytic on a neighborhood of a and, therefore,
f(z) =
1
(z − a)N
∞∑
n=0
cn(z − a)n
.
Conversely, if f(z) = 1
(z−a)N h(z), where h(a) ̸= 0, then
limz→a f(z) = ∞ and a is a pole for f.
Finally, an isolated singularity of f(z) is neither removable
nor a pole if and only if the principle part of its Laurent
series is inﬁnite and this observation ﬁnishes the proof. □
9.4.23. Some consequences. There are several straightforward
corollaries of our classiﬁcation of isolated singularities.
In particular, if limz→a f(z) does not exist, then f(z) has really
chaotic behaviour:
Theorem. If a ∈ C is an essential singularity of f(z), then
for any w ∈ C there is a sequence zn → a such that
limn→∞ f(zn) = w.
Proof. Let w = ∞. Since the singularity z = a is not
removable, f(z) cannot be bounded in any punctured neighbourhood
of a. So there exists a sequence zn → a such that
limn→∞ f(zn) = ∞.
For w ∈ C, if in any punctured neighbourhood of a there
is a point z such that f(z) = w, then by making a sequence
with those points we obtain a sequence znsuch that f(zn) =
w, as required. If there is a punctured neighbourhood of a
where f(z) ̸= w, then g(z) = 1
f(z)−w also has an isolated
singularity at z = a, which cannot be a pole or a removable
one, as otherwise f(z) = w + 1
g(z) would have a limit as
z → a. Therefore z = a is an essential singularity for g(z)
and, thus, there is a sequence zn → a such that g(zn) = ∞,
which implies that limn→∞ f(zn) = w. □
We say that ∞ ∈ C is an isolated singularity of f(z) if
f(z) is analytic in {|z − a| > R} for some R > 0. These are
straightforward consequences of the Liouville theorem if ∞
is the only singularity of f(z):
Corollary. If f(z) is analytic in C and z = ∞ is a removable
singularity for f(z) then f(z) is a constant.
If f(z) is analytic in C and z = ∞ is a pole then f(z) is
a polynomial f(z) =
∑n
j=0 cj(z − a)j
.
893
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Proof. The ﬁrst claim is a simple reformulation of the
Liouville theorem, cf. 9.4.21.
To deal with the other claim, consider g(w) = f( 1
w ).
Then w = 0 is a pole for g(w). Let P(w) =
∑N
j=1 cjw−j
be
the principal part of Laurent series for g(w). Thus, h(w) =
g(w) − P(w) is analytic in C with a removable singularity
at w = 0. Moreover, limw→∞ h(w) = limw→∞ g(w) =
f(0). Thus, |h(w)| is bounded, and by the Liouville theorem
h(w) = const = f(0) = c0. Hence f(z) = g(z−1
) =
∑N
j=0 cjzj
, which is a polynomial in z. □
9.4.24. Residues. Next we return to the Cauchy integral theorem,
with our knowledge of isolated singular-
ities.
A residue of an analytic function f(z) at
its isolated singular point a ∈ C is deﬁned as
resa f =
1
2πi
∫
|z|=r
f(z) dz,
where 0 < r < ρ. Obviously, the deﬁnition does not depend
on the choice of r.
Residue Theorem
Theorem. If f(z) is represented by the Laurent series∑∞
n=−∞ cn(z − a)n
, then resa f = c−1.
Further, consider a domain D ⊂ C and a function f(z)
analytic in D \ {a1, . . . , an}, where aj ∈ D, j = 1, . . . , n.
Then ∫
∂D
f(z) dz = 2πi
n∑
j=1
resaj f.
Proof. Integrating the Laurent series
∑∞
n=−∞ cn(z −
a)n
and using the fact that
∫
|z−a|=ρ
(z − a)n
dz = 0 unless
n = −1, while
∫
|z−a|=ρ
(z − a)−1
dz = 2πi, we obtain that
resa f = c−1.
Next, choose such ρ > 0 that open discs Dj = {|z −
aj| < ρ}, j = 1, . . . , n, have pairwise empty intersections
and their closures Dj belong to D. Then the Cauchy Integral
theorem 9.4.15, applied to Dρ = D \
∪n
j=1 Dj yields
0 =
∫
∂Dρ
f(z) dz =
∫
∂D
f(z) dz −
n∑
j=1
∫
∂Dj
f(z) dz
=
∫
∂D
f(z) dz −
n∑
j=1
resaj f.
□
9.4.25. Residues at inﬁnity. Recall that when integrating
along the circle |z −a| = R, we always assume counterclockwise
orientation on the circle. Thus we use the minus sign in
the deﬁnition:
If f(z) is analytic in the closure of an exterior of a disc
{|z| ≥ R}, then
res∞ f = −
1
2πi
∫
|z|=R
f(z) dz.
894
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
In terms of the Laurent series f(z) =
∑∞
n=−∞ cn(z − a)n
,
valid in {|z| ≥ R}, we have res∞ f = −c−1.
Note that if f(z) is analytic in C \ {a1, . . . , an}, then
res∞ f +
n∑
j=1
resaj f = 0.
Indeed, by taking a disc {|z| < R} of suﬃciently large radius
that does not contain singularities on its boundary, we
conclude that
1
2πi
∫
|z|=R
f(z) dz =
n∑
j=1
resaj f.
9.4.26. Example of applications. Residues of analytic functions
are used for the evaluation of improper integrals
in real analysis. The following lemma
turns out to be very useful for such purposes.
We shall write M(R) for the maximum of |f(z|)
over the upper half of the circle with radius R, i.e. M(R) =
max|z|=R,Im z≥0 |f(z)|.
Jordan’s lemma
Lemma. Consider the function f(z) continuous on
{Im z ≥ 0, |z| = R}. Then, for each positive real
parameter t,
∫
|z|=R,Im z≥0
f(z) eitz
dz ≤
π
t
M(R).
Consequently, if f(z) is continuous on {Im z ≥ 0,
|z| ≥ R0} and limR→∞ M(R) = 0, then
lim
R→∞
∫
|z|=R,Im z≥0
f(z) eitz
dz = 0.
Proof. We estimate the integral from the lemma:
∫ π
0
f(R eiθ
) e−tR sin θ+itR cos θ
iR eiθ
dθ
≤ R M(R)
∫ π
0
e−tR sin θ
dθ.
To evaluate the latter integral, we observe that sin θ ≥ 2
π θ for
0 ≤ θ ≤ π
2 . Thus, using t > 0 and the substitution τ = 2Rtθ
π ,
we arrive at
R M(R)
∫ π
0
e−tR sin θ
dθ = 2R M(R)
∫ π
2
0
e−tR sin θ
dθ
≤ 2R M(R)
∫ π
2
0
e
−2tRθ
π dθ =
π
t
M(R)
∫ Rt
0
e−τ
dτ
=
π
t
M(R)(1 − e−Rt
) ≤
π
t
M(R).
The consequence for R → ∞ is obvious. □
Typically, the Jordan lemma is used to compute improper
integrals of real analytic (complex valued) functions g(x) =
895
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
f(x) eitx
along the entire real line (or rather the real or imaginary
parts of such integrals). If the corresponding complex
analytic function f(z) has only a ﬁnite number of poles ak in
the upper half plane and limR→∞ M(R) = 0, then we may
compute the real integral
∫ ∞
−∞
g(x) dx = lim
R→∞
∫
γR
g(z) dz = 2πi
∑
k
resak
g(z),
where γR is the path composed byf the interval [−R, R] and
the half upper circle of radius R. See the diagram and the
examples in the other column.
picture for the half
circle area
9.4.27. Concluding remarks. Of course we have not
touched on many important issues in this short introduction.
These include the conformal properties of all analytic
functions, i.e. they preserve all angles of curves, the richness
of analytic functions which allow the mapping of any simply
connected region Ω bijectively to the unit open disc (i.e. both
the map and its inverse are analytic, the Riemann mapping
theorem). The proper setup for analytic extensions are the
Riemann surfaces with their fascinating topological properties.
Also, we only commented on the possibility of proving
the Cauchy integral theorem for triangles just assuming
the existence of complex derivative. The analyticity of all
holomorhic functions then follows from the Cauchy integral
formula. Moreover, we have not mentioned the functions of
several complex variables at all!
We hope that all of these interesting issues will challenge
the readers to go for further more detailed study in the relevant
literature.
896
897
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
G. Additional exercises to the whole chapter
9.G.1. Solution.
□
9.G.2. Solution.
□
9.G.3. Solution.
□
9.G.4. Solution.
□
9.G.5. Solution.
□
898
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
Solutions of the exercises
9.B.6. The answer is 4π.
9.B.7. The answer is 36π.
9.B.8. The answer is 65π
24 .
9.D.2. The general solution has the fom u = Φ(x2
− y2
). Moreover, u(i) =
√
x2 − y2, x > y, u(ii) = y2
− x2
, and the
condition (iii) makes no sense.
9.D.3. The general solution has the fom u = Φ(x2
+ y2
). Moreover, we see that u(i) = x2
+ y2
, u(ii) = 1
1+x2+y2 ,
u(iii) = (x2
+ y2
− 1)2
(unique for x2
+ y2
> 1).
9.D.4. The answer has the form u(x, y) = 2 cos(y) sin(x) − 1.
9.D.8. The general solution has the form u = K(y2
− 2x) · ey
.
9.D.9. The general solution has the form u(x, y) = K(y − x) · ey
. Moreover, u(i) = (y − x) · ex(y−x)
, u(ii) = 2
x−y · e
y2−x2
2 ,
u(iii) = (y − x)2
· ey(y−x)
.
9.D.10. The solution is given by u = y
yC
(
1
y −x
)
−1
, up = y
2xy+3y−3 .
9.D.11. The solution has the form u = x2
+ y2
+ C
(y
x
)
, up = x2
+ y2
+ y2
x2 .
9.D.12. The solution is given by u = 1
4 x2
− 1
2 y2
+ C(y
√
x), up = 1
4 (x2
− 2y2
− xy2
).
9.D.13. The solution is given by u = y2
C
(
ye
1
x
)
, up = y3
e
1
x −1
.
9.D.15. The solution is given by u(x, y) =
√
x2 + y2, u(x, y) = 2 −
√
x2 + y2.
9.D.16. The solution is given by u(x, y) = x2
+y2
2 , u(x, y) = −1
2
(
1 −
√
x2 + y2
)2
.
9.D.17. The solution is the function u(x, y) = e
√
x2+y2−1
.
9.D.18. The solution is the function u(x, y) = 1
4
(
x2
+ y2
+ 2
√
x2 + y2 + 1
)
.
9.D.19. The solution is the function u(x, y) = −e1−
√
x2+y2
.
9.D.20. The answer is given by u(x, y) = x(1 − y).
9.D.21. The answer is the function u(x, y) = x + y.
9.D.22. The answer is given as follows: u(x, y) = (2 −
√
x)2
, u = x.
9.D.23. The answer is g u(x, y) = x2
− y2
.
9.D.24. The answer is u(x, y) = y.
9.D.26. u(x, y) = Dex2
y2
.
9.D.27. u(x, y) = x + y + Dxy.
9.D.30. We get α0 = 1, β0 = 0, α1 = x, β1 = y, α2 = x2
− y2
, β2 = 2xy, α3 = x3
− 3xy2
, β3 = 3x2
y − y3
.
9.D.34. Parabolic equation, ξ = y + ln x, η = y,, uηη = 0, solution u = yC(y + ln x) + D(y + ln x).
9.D.35. Hyperbolic equation, ξ = xy, η = y, uξη = 0, u = F(y) + G(xy).
9.D.36. Eliptic equation, ξ = x + 2y, η =
√
7x, ∆u = 0, u(x, y) = C(x + 2y + i
√
7x) + D(x + 2y − i
√
7x).
9.D.37. Eliptic equation, ξ = y2
, η = x2
,
∆u = −
1
2
(
uη
η
+
uξ
ξ
)
.
9.D.38. Substitute formula into equation and initial condition.
899
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
9.D.40.
ua =
1
2
[sin(x − t) + sin(x + t) + (x + t) sin(x + t) + cos(x + t) − (x − t) sin(x − t) − cos(x − t)] ,
ub = 2x +
1
2
[
(x + t) ln(1 + (x + t)2
) − 2(x + t) + 2 arctan(x + t) − (x − t) ln(1 + (x − t)2
) + 2(x − t) − 2 arctan(x − t)
]
,
uc = x + ln
√
x + t
x − t
+ sin x − sin x cos t.
9.D.41.
u(t, x) = x2
+2x+t2
+2 sin x−2 sin x cos t+
1
2
[(x + t) sin(x + t) − (x − t) sin(x − t) + cos(x + t) − cos(x − t)] .
9.D.42.
u(t, x) = 2x2
−x+2t2
+2 cos x−2 cos x cos t+
1
2
[−(x + t) cos(x + t) + (x − t) cos(x − t) + sin(x + t) − sin(x − t)] .
9.D.43.
u(t, x) = x3
+12xt2
+
1
4
ln
x + 2t +
√
1 + (x + 2t)2
x − 2t +
√
1 + (x − 2t)2
+
1
4
t2
+
1
8
cos 2x−
1
8
cos 2x cos 4t.
9.D.44.
u(t, x) = x2
+9t2
+
1
6
[
(x + 3t)ex+3t
− (x − 3t)ex−3t
+ ex−3t
− ex+3t
]
+
4
9
[cos x cos 3t − cos x] .
9.D.48.
u(x, y) = sin y ·
(
1
1 − e2π
ex
+
1
1 − e−2π
e−x
)
.
9.D.49.
u(t, x) = − cos
πx
l
cos
πt
l
+
l
2π
cos
2πx
l
sin
2πt
l
.
9.D.50.
u(t, x) = −e−4t
sin(2x) + e−16t
sin(4x) .
9.D.51.
u(t, x) = −3 sin(4x) cos(4t) − sin x sin t + sin(2x) sin(2t)
9.D.52.
u(t, x) =
∞∑
n=1
cne
−
[
(2n−1)
2a π
]2
t
sin
(
2n − 1
2a
πx
)
, cn =
2
a
a∫
0
x(2a−x) sin
(
2n − 1
2a
πx
)
dx =
32a2
π3(2n − 1)3
9.D.53.
u(t, x) =
8A2
π3
∞∑
n=1,3,5,...
1
n3
sin
nπx
A
cos
nπt
A
.
9.D.54.
u(x, y)=
∞∑
n=1
sin
nπx
a
(
ane
nπ
a y
+ bne− nπ
a y
)
, an =
1
1 − e2nπ
·
2
a
a∫
0
x(a−x) sin
nπx
a
dx, bn =
1
1 − e−2nπ
·
2
a
a∫
0
x(a−x) sin
nπx
a
dx
u(x, y) =
∞∑
n=0
sin
(2n + 1)πx
a
(
8a2
(2n + 1)3π3
(
1 − e2(2n+1)π
)e
(2n+1)π
a y
+
8a2
(2n + 1)3π3
(
1 − e−2(2n+1)π
)e−
(2n+1)π
a y
)
.
900
CHAPTER 9. CONTINUOUS MODELS – FURTHER SELECTED TOPICS
9.D.55.
u(t, x) = −
∞∑
n=0
8
(2n + 1)2π2
cos
(
2n + 1
2
πx
)
·e−(2n+1)2
π2
t
.
9.D.58.
u(x, y) =
xy
A2
.
9.D.59.
u(x, y) =
5
2
+
A2
2
·
x2
− y2
(x2 + y2)2
.
9.D.60.
u(x, y) =
1
2
+
x2
− y2
2
+ y.
9.D.62.
u(t, x) = erf
(
x
2
√
t
)
− erf
(
x − 1
2
√
t
)
.
Roughly speaking, statistics is any processing of numerical
or other type of data about a population of objects and
their presentation. In this context, we talk about descriptive
statistics. Its objective is thus to process and comprehensibly
represent data about objects of a given “population” — for
instance, the annual income of all citizens obtained from the
complete data of revenue authorities, or the quality of hotel
accommodation in some region. In order to achieve this, we
focus on simple numerical characterization and visualization
of the data.
in general, many
pictures missing!!
Mathematical statistics uses mathematical methods to derive
conclusions valid for the whole (potentially inﬁnite) population
of objects, based on a “small” sample. For instance,
we might want to ﬁnd out how much a certain disease is
spread in the population by collecting data about a few randomly
chosen people, but we interpret the results with regard
to the entire population. In other words, mathematical statistics
makes conclusions about a large population of objects
based on the study of a small (usually randomly selected) sample
collection. It also estimates the reliability of the resulting
conclusions.
Mathematical statistics is based on the tools of probability
theory, which is very useful (and amazing) in itself. Therefore,
probability theory is discussed ﬁrst.
This chapter provides an elementary introduction to the
methods of probability theory, which should be suﬃcient for
correct comprehension of ordinary statistical information all
around us. However, for a serious understanding of a mathematical
statistician’s work, one must look for other resources.
1. Descriptive statistics
Descriptive statistics alone is not a mathematical discipline
although it uses many manipulations with numbers and
sometimes even very sophisticated methods. However, it is a
good opportunity for illustrating the mathematical approach
to building generally useful tools.
At the same time, it should serve as a motivation for
studying probability theory because of later applications in
statistics.
CHAPTER 10
Statistics and probability methods
Is statistics a part of mathematics?
– whenever it is so, we need much of mathematics there...!
A. Dots, lines, rectangles
The obtained data from reality can be displayed in many
ways. Let us illustrate some of them.
10.A.1. Presenting the collected data. 20 mathematicians
were asked about the number of members of their household.
The following table displays the frequency of each number of
members.
Number of members 1 2 3 4 5 6
Number of households 5 5 1 6 2 1
Create the frequency distribution table. Find the mean,
median and mode of the number of members. Build a column
diagram of the data.
Solution. Let us begin with the frequency distribution table.
There, we write not only the frequencies, but also the cumulative
frequencies and relative frequencies (i. e., the probability
that there is a given number of members in a randomly picked
household). Let us denote the number of members by xi, the
corresponding frequency by ni, the relative frequency by pi
(= ni/
∑6
j=1 nj = ni/20), the cumulative frequency by Ni
(=
∑i
j=1 xj), and the relative cumulative frequency by Fi
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
In our brief introduction, we ﬁrst introduce the concepts
allowing to measure the positions of data values and the variability
of the data values (means, percentiles etc.). We touch
the problem how to visualize or otherwise present the data
sets (diagrams). Then we deal with the potential relations between
more data sets (covariance and principal components)
and, ﬁnally, we deal with data without numerical values relying
just on their frequencies of appearance (entropy).
10.1.1. Probability, or statistics? It is not by accident that
we return to a part of the motivating hints
from the ﬁrst chapter, as soon as we have
managed to gather enough mathematical
tools both discrete and continuous.
Nowadays, many communications are of a statistical nature,
be it in media, politics, or science. Nevertheless, in order
to properly understand the meaning of such a communication
and using particular statistical methods and concepts, one
must have a broad knowledge of miscellaneous parts of mathematics.
In this subsection, we move away from the mathematical
theory; and think about the following steps and our
objectives.
As an example of a population of objects, consider the
students of a given basic course. Then, the examined numerical
data can be:
• the “mean number of points” obtained during the course
in the previous semester and the “variance” of these val-
ues,
• the “mean marks” for the examination of this and other
courses and the “correlation” (i.e. mutual dependence)
of these results,
• the “correlation” of data about the past results of given
students,
• the “correlation” of the number of failed exams of a given
student and the number of hours spent in a temporary job,
• ...
With regard to the ﬁrst item, the arithmetic mean itself
does not carry enough information about the quality of the
lecture or of the lecturer, nor about the results of particular
students. Maybe the value which is “in the middle” of the
population, or the number of points achieved by the student
who was just better than half of the students is of more concern.
Similarly, the ﬁrst quarter, the last quarter, the ﬁrst tenth,
etc. maybe of interest. Such data are called statistics of the
population. Such statistics are interesting for the students in
question as well, and it is quite easy to deﬁne, compute, and
communicate them.
From general experience or as a theoretical result outside
mathematics, a reasonable assessment should be “normally”
distributed. This is a concept of probability theory, and it
requires quite advanced mathematics to be properly deﬁned.
Comparing the collected data about even a small random population
of students to theoretical results can serve in two ways:
We can estimate the parameters of the distribution as well as
draw a conclusion whether the assessment is reasonable.
902
(= Ni/20 =
∑i
j=1 pj):
xi ni pi Ni Fi
1 5 1/2 5 1/4
2 5 1/4 10 1/2
3 1 1/20 11 11/20
4 6 3/10 17 17/20
5 2 1/10 19 19/20
6 1 1/20 20 1
Now, we can easily construct the wanted (column) graphs of
(relative, cumulative) frequencies:
The mean number of members of a household is:
¯x =
5 · 1 + 5 · 2 + 1 · 3 + 6 · 4 + 2 · 5 + 1 · 6
20
= 2.9.
The median is the arithmetic mean of the tenth and
eleventh values (having been sorted), which are respectively
2 and 3, i. e., ˜x = 2.5.
The mode is the most frequent value, i. e., ˆx = 4.
The collected data can also be presented using a box plot:
The upper and lower sides of the “box” correspond respectively
to the ﬁrst (lower) and the third (upper) quartile,
so its height is equal to the interquartile range. The thick horizontal
line is drawn at the median level; the lower and upper
horizontal lines correspond respectively to the minimum and
maximum elements of the data set, or to the value that is 1.5
times the interquartile range less than the lower side of the
box (and greater than the upper side, respectively). The data
outside this range would be shown as circles.
We can also build the histogram of the data:
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
At the same time, the numerical values of statistics for a
given population can yield qualitative description of the likelihood
of our conclusions. We can compute statistics which reﬂect
the variability of the examined values, rather than where
these values are positioned within a given population. For instance,
if the assessment does not show enough variability, it
may be concluded that it is badly designed, because the students’
skills are of course diﬀerent. The same applies if the
collected data seem completely random.
In the above paragraph, it is assumed that the examined
data is reliable. This is not always the case in practice. On
the contrary, the data is often perturbed with errors due to
construction of the experiment and the data collection itself.
In many cases, not much is known about the type of the
data distribution. Then, methods of non-parametric statistics
are often used (to be mentioned at the end of this chapter).
Very interesting conclusions can be found if we compare the
statistics for diﬀerent quantities and then derive information
about their relations. For example, if there is no evident relation
between the history of previous studies and the results
in a given course, then it may be that the course is managed
wrongly.
These ideas can be summarized as follows:
• In descriptive statistics, there are tools which allow the
understanding of the structure and nature of even a huge
collection of data;
• in mathematics, one works with an abstract mathematical
description of probability, which can be used for analysis
of given data. Especially, this is when there is a theoretical
model to which the data should correspond;
• conclusions of statistical investigation of samples of particular
data sets can be given by mathematical statistics;
• mathematical statistics can also estimate how adequate
such a description is for a given data set.
10.1.2. Terminology. Statisticians have introduced a great
many concepts which need mastering. The fundamental
concept is that of a statistical population,
which is an exactly deﬁned set of basic statistical
units. These can be given by enumeration
or by some rules, in case of a larger population.
On every statistical unit, statistical data is measured,
with the “measurement” perceived very broadly.
For instance, the population can consist of all students of
a given university. Then, each of the students is a statistical
unit and much data can be gathered about these units – the numerical
values obtainable from the information system, what
is their favorite colour, what they had for dinner before their
last test, etc.
The basic object for examining particular pieces of data
is a data set. It usually consists of ordered values. The ordering
can be either natural (when the data values are real
numbers, for example) or we can deﬁne it (for instance, when
we observe colours, we can express them in the RGB format
903
Note that the frequencies of one- and two-member households
were merged into a single rectangle. This is used in
order to make the data “easier to read” – there exist (various
and ambiguous) rules for the merging.
We simply mention this fact without presenting an exact
procedure (it is just as anyone likes). □
10.A.2. Given a data set x = (x1, x2, . . . , xn), ﬁnd the
mean and variance of the centered values xi − ¯x and the standardized
values xi−¯x
sx
.
Solution. The mean of the centered values can be found directly
using the deﬁnition of arithmetic mean:
1
n
n∑
i=1
(xi − ¯x) =
1
n
n∑
i=1
xi −
¯x
n
n∑
i=1
1 = ¯x − ¯x = 0.
The variance of the centered values is clearly the same as for
the original ones (sx). For the standardized values, the mean
is equal to zero again, and the variance is
1
n
n∑
i=1
(
xi−¯x
sx
)2
= 1
s2
x
· 1
n
n∑
i=1
(xi − ¯x)2
= 1. □
10.A.3. Prove that the variance satisﬁes s2
x = 1
n
∑n
i=1 x2
i −
¯x2
.
Solution. Using the deﬁnitions of variance and arithmetic
mean, we get:
s2
x =
1
n
n∑
i=1
(
x2
i − 2xi ¯x + ¯x2
)
=
1
n
n∑
i=1
x2
i −
2¯x
n
n∑
i=1
xi + ¯x2
=
=
1
n
n∑
i=1
x2
i − ¯x2
.
□
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
and order them with respect to this sign). We can also work
with unordered values.
Since statistical description aims at telling comprehensible
information about the entire population, we should be
able to compare and take ratios of the data values. Therefore,
we need to have a measurement scale at our disposal. In most
cases, the data values are expressed as numbers. However, the
meaning of the data can be quantiﬁed variously, and thus we
distinguish between the following types of data measurement
scales.
Types of data measurement scales
The data values are called:
• nominal if there is no relation between particular values;
they are just qualitative names, i.e. possible values
(for instance, political parties or lecturers at a university
when surveying how popular they are);
• ordinal the same as above, but with an ordering (for example,
number of stars for hotels in guidebooks);
• interval if the values are numbers which serve for comparisons
but do not correspond to any absolute value
(for example, when expressing temperature in Celsius
or Fahrenheit degrees, the position of zero is only con-
ventional);
• ratio if the scale and the position of zero are ﬁxed (most
physical and economical quantities).
With nominal types, we can interpret only equalities
x1 = x2; with ordinal types, we can also interpret inequalities
x1 < x2 (or x1 > x2); with interval types, we can also
interpret diﬀerences x1 − x2. Finally, with rational types, we
have also ratios x1/x2 available.
10.1.3. Data sorting. In this subsection, we work with a
data set x1, x2, . . . , xn, which can be ordered (thus, their type
is not nominal) and which have been obtained through measurement
on n statistical units. These values are sorted in a
sorted data set
(1) x(1), x(2), . . . , x(n).
The integer n is called the size of the data set.
When working with large data sets where only a few values
occur, the simplest way to represent the data set is to enumerate
the values’ frequencies.
For instance, when surveying the political party preference
or when presenting the quality of a hotel, write only the
number of occurrences of each value.
If there are many possible values (or there can even be
continuously distributed real values), divide them into a suitable
number of intervals and then observe the frequencies in
the given intervals. The intervals are also called classes and
the frequencies are called class frequencies. We also use cumulative
frequencies and cumulative class frequencies which
correspond to the sum of frequencies of values not exceeding
a given one.
904
10.A.4. The following values have been collected:
10; 7; 7; 8; 8; 9; 10; 9; 4; 9; 10; 9; 11; 9; 7; 8; 3; 9; 8; 7.
Find the arithmetic mean, median, quartiles, variance, and the
corresponding box diagram.
Solution. Denoting the individual values by ai and their frequencies
by ni, we can arrange the given data set into the
following table.
ai 3 4 7 8 9 10 11
ni 1 1 4 4 6 3 1
From the deﬁnition of arithmetic mean, we have
¯x =
3 + 4 + 4 · 7 + 4 · 8 + 6 · 9 + 3 · 10 + 11
1 + 1 + 4 + 4 + 6 + 3 + 1
=
162
20
= 8.1.
Since the tenth least collected value is x(10) = 8 and the
eleventh one is x(11) = 9, the median is equal to ˜x = 8+9
2 =
8.5. The ﬁrst quartile is x0.25 =
x(5)+x(6)
2 = 7, and the third
quartile is x0.75 =
x(15)+x(16)
2 = 9. From the deﬁnition of
variance, we get s2
x:
5.12
+ 4.12
+ 4 · 1.12
+ 4 · 0.12
+ 6 · 0.92
+ 3 · 1.92
+ 2.92
1 + 1 + 4 + 4 + 6 + 3 + 1
= 3.59.
The histogram and box diagram are shown in the following
pictures.
where we have used "statistics" method to make the histogram
"nice" and "clear". You can ﬁnd a lot of these conventions
in the books on statistics, but if you do not know them,
you are lost. This is the default setting of the R program. For
example if you replace just value 3 by 2 you get quite diﬀerent
looking histogram:
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Most often, the mean ai of a given class is considered
to be its representative, and the value aini (where ni is the
frequency of the class) is the total contribution of the class.
Relative frequencies ai/n, and relative cumulative frequencies,
can also be considered.
A graph which has the intervals of particular classes on
one axis and rectangles above them with height corresponding
to the frequency is called a histogram. Cumulative frequency
is represented similarly.
The following diagram shows histograms of data sets of
size n = 500 which were randomly generated with various
standard distributions (called normal, χ2
, respectively).
10.1.4. Measures of the position of statistical values. If
the magnitude of values around which the collected
data values gather are to be expressed,
then the concepts of the deﬁnition below can
be used. There, we work with ratios or interval types of
scales.
Consider an (unsorted) data set (x1, . . . , xn) of the values
for all examined statistical units and let n1, . . . , nm be the
class frequencies of m distinct values a1, . . . , am that occur
in this set.
Means
Deﬁnition. The arithmetic mean (often only mean) is given
as
¯x =
1
n
n∑
i=1
xi =
1
n
m∑
j=1
njaj.
The geometric mean is given as
¯xG
= n
√
x1x2 · · · xn
and makes sense for positive values xi only. The harmonic
mean is given as
¯xH
=
(
1
n
n∑
i=1
1
xi
)−1
and is also useded for positive values xi only.
The arithmetic mean is the only one of the three above which
is invariant with respect to aﬃne transformations. For all
scalars a, b,
(a + b · x) =
1
n
n∑
i=1
(a + bxi) = a + b
n∑
i=1
xi = a + b · ¯x.
905
□
10.A.5. 425 carps were ﬁshed, and each one was taken
weighed. Then, mass intervals were set, resulting in the following
frequency distribution table:
Weight (kg) 0–1 1–2 2–3 3–4 4–5 5–6 6–7
Class midpoint 0.5 1.5 2.5 3.5 4.5 5.5 6.5
Frequency 75 90 97 63 48 42 10
Draw a histogram, ﬁnd the arithmetic, geometric, and harmonic
means of the carps’ weights. Furthermore, ﬁnd the
median, quartiles, mode, variance, standard deviation, coeﬃcient
of variation, and draw a box plot.
Solution. The histogram looks as follows:
From the deﬁnitions of the corresponding concepts in
subsection 10.1.4, we can directly compute that the arithmetic
mean is ¯x = 2.7 kg, the geometric mean is ¯xG
= 2.1 kg, and
the harmonic mean is ¯xH
= 1.5 kg. By the deﬁnitions of
subsection 10.1.5, the median is equal to ˜x = x0.5 = 2.5
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Therefore, the arithmetic mean is especially suitable for interval
types.
The logarithm of the geometric mean is the arithmetic
mean of the logarithms of the values. It is especially suitable
for those quantities which cumulate multiplicatively, e. g. interests.
If the interest rate for each time period is xi%, then
the ﬁnal result is the same as if the interest rate had constant
value of ¯xG
%. See 10.A.9 for an example where the harmonic
mean is appropriate.
In subsection 8.1.30 (page 746), we use the methods invented
there to prove that the geometric mean never exceeds
the arithmetic mean. The harmonic mean never exceeds the
geometric mean, and so
¯xH
≤ ¯xG
≤ ¯x.
10.1.5. Median, quartile, decile, percentile, ... Another
way of expressing the position or distribution of the values
is to ﬁnd, for a number α between zero and one, such a value
xα that 100α% of values from the set are at most xα and the
remaining ones are greater than xα. If such a value is not
unique, one can choose the mean of the two nearest possibil-
ities.
The number xα is called the α–quantile. Thus, if the
result of a contestant puts him into x1.00, it does not mean
that he is better than anyone else yet. However, there is surely
no one better than him.
The most common values of xα are the following:
• The median (also sample median) is deﬁned by
˜x = x0.50 =
{
x((n+1)/2) for odd n
1
2
(
x(n/2) + x(n/2+1)) for even n
,
where x(k) corresponds to the value in the sorted data set
10.1.3(1).
• The ﬁrst and third quartile are Q1 = x0.25 and Q3 =
x0.75, respectively.
• The p-th quantile (also sample quantile or percentile) xp,
where 0 < p < 1 (usually rounded to two decimal
places).
One can also meet the mode, which is the value ˆx that is
most frequent in the data set x.
The arithmetic mean, median (with ratio types), and
mode (with ordinal or nominal types) correspond to the “anticipated”
values.
Note that all α–quantiles with interval scales are invariant
with respect to aﬃne transformations of the values (check
this yourselves!).
10.1.6. Measures of the variability. Surely any measure of
the variability of a data set x ∈ Rn
should be
invariant with respect to constant translations.
In the Euclidean space Rn
, both the standard
distance and the sample mean have this property.
Therefore, choose the following:
906
kg, the lower quartile to x0.25 = 1.5 kg, the upper quartile
to x0.75 = 3.5 kg, and the mode is ˆx = 2.5 kg. From the
deﬁnitions of subsection 10.1.6, we compute the variance of
the weights, which is s2
x = 2.7 kg2
, whence it follows that
the standard deviation is sx = 1.7 kg, and the coeﬃcient of
variation is Vx = 0, 6. □
10.A.6. Prove that the entropy is maximal if the nominal
values are distributed uniformly, i. e., the frequency of each
class is ni = 1.
Solution. By the deﬁnition of entropy (see 10.1.11),
we are looking for the maximum of the function
HX = −
∑n
i=1 pi ln pi with respect to unknown relative
frequencies pi = ni
n , which satisfy
∑n
i=1 pi = 1.
Therefore, this is a typical example of ﬁnding constrained
extrema, which can be solved using Lagrange multipliers.
The corresponding Lagrange function is
L(p1, . . . , pn, λ) = −
n∑
i=1
pi ln pi + λ
( n∑
i=1
pi − 1
)
.
The partial derivatives are ∂L
∂pi
= − ln pi − 1 + λ, hence its
stationary points is determined by the equations pi = eλ−1
for all i = 1, . . . , n. Moreover, we know that the sum of
the relative frequencies pi is equal to one. This means that
neλ−1
= 1, whence we get λ = 1 − ln n. Substitution then
yields pi = 1
n .
□
10.A.7. The following graphs depict the frequencies of particular
amounts of points obtained by students of the MB104
lecture at the Faculty of Informatics of Masaryk University
in 2012. The axes of the cumulative graph are “swapped”, as
opposed to the previous example.
The frequencies of particular amounts of points are enumerated
in the following table:
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Variance and standard deviation
Deﬁnition. The variance of a data set x is deﬁned by
s2
x =
1
n
n∑
i=1
(xi − ¯x)2
.
The standard deviation sx is deﬁned to be the square root of
the variance.
As requested, the variability of statistical values is independent
of constant translation of all values. Indeed, the unsorted
data set
y = (x1 + c, x2 + c, . . . , xn + c)
has the same variance sy = sx.
Sometimes, the sample variance is used, where there is
(n − 1) in the denominator instead of n. The reason will be
clear later, cf. 10.3.2.
In case of class frequencies nj of values aj for m classes,
this expression leads to the value
s2
x =
1
n
m∑
j=1
nj(aj − ¯x)2
of the variance. In practice, it is recommended to use the
Sheppard’s correction, which decreases s2
x by h2
/12, where
h is the width of the intervals that deﬁne the classes.
Further, one can encounter the data-set range
R = x(n) − x(1)
and the interquartile range
Q = Q3 − Q1.
The mean deviation, which is deﬁned as the mean distance
of the values from the median:
Dx =
1
n
n∑
i=1
|xi − ˜x|.
The following theorem clariﬁes why these measures of
variability are chosen:
Theorem. The function S(t) = (1/n)
∑n
i=1(xi −t)2
has the
minimum value at t = ¯x, i.e., at the sample mean.
The function D(t) = (1/n)
∑n
i=1 |xi − t| has the minimum
value at t = ˜x, i.e., the median.
Proof. The minimum of the quadratic polynomial
f(t) =
∑n
i=1(xi − t)2
is at the only root of its derivative:
f′
(t) = −2
n∑
i=1
(xi − t).
Since the sum of the distances of all values from the sample
mean is zero t = ¯x is the requested root and the ﬁrst proposition
is proved.
As for the second proposition, return to the deﬁnition of
the median. For this purpose, rearrange the sum so that the
ﬁrst and the last summand is added, then the second and the
last-but-one summand, etc. In the ﬁrst case, this leads to the
907
# of points # of students
20.5 1
20 1
19 2
18.5 1
18 2
17.5 3
17 2
16.5 4
16 3
15.5 5
15 7
14.5 6
14 14
13.5 21
13 21
12.5 19
12 17
11.5 18
11 31
10.5 22
10 53
# of points # of students
9.5 9
9 9
8.5 13
8 8
7.5 13
7 4
6.5 7
6 4
5.5 8
5 7
4.5 9
4 5
3.5 7
3 8
2.5 8
2 14
1.5 8
1 2
0.5 6
0 9
The corresponding histogram looks as follows:
The histogram was obtained from the Information System
of Masaryk University. We can see that the data are
shown in a somewhat unusual way: individual amounts of
points correspond to “double rectangles”. It is a matter of
taste how to represent the data (it is possible to merge some
values, thereby decreasing the number of rectangles, or to use
thinner rectangles).
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
expression |x(1) − t| + |x(n) − t|, and this is equal to the
distance x(n) − x(1) provided t lies inside the range, and it is
even greater otherwise. Similarly, the other pair in the sum
gives x(n−1) − x(2) if x(2) ≤ t ≤ x(n−1), and it is greater
otherwise. Therefore, the minimality assumption leads to t =
˜x. □
In practice, it is required to compare the variability of
data sets of diﬀerent statistical populations. For this purpose,
it is convenient to relativize the scale, and so use the coeﬃcient
of variation of a data set x:
Vx =
√
s2
x
|¯x|
.
This relative measure of variability can be perceived in percentage
of the deviation with respect to the sample mean ¯x.
10.1.7. Skewness of a data set. If the values of a data set are
distributed symmetrically around the mean value, then
¯x = ˜x
However, there are distributions where
¯x > ˜x.
This is common, for instance, with the distribution of salaries
in a population where the mean is driven up by a few very
large incomes, while much of the population is below the av-
erage.
A useful characteristic concerning this is the Pearson coeﬃcient,
given by
β = 3
¯x − ˜x
sx
.
It estimates the relative measure (the absolute value of β) and
the direction of the skewness (the sign). In particular, note
that the standard deviation is always positive, so it is already
the sign of ¯x − ˜x which shows the direction of the skewness.
Quantile coefficients of skewness
More detailed information can be obtained from the quantile
coeﬃcients of skewness
βp =
x1−p + xp − 2˜x
x1−p − xp
,
for each 0 < p < 1/2. Their meaning is clear when the
numerator is expressed as (x1−p − ˜x) − (˜x − xp).
In particular, the quartile coeﬃcient of skewness is obtained
when selecting p = 0.25.
10.1.8. Diagrams. People’s eyes are well suited for perceiving
information with a complicated
structure. That is why there exist many
standardized tools for displaying statistical
data or their correlations. One of
them is the box diagram.
908
We can notice that the mode of the values is 10, which,
accidentally, was also the number of points necessary to pass
the course. The mean of the obtained points is 9.48.
10.A.8. Here, we present column diagrams of the amounts
of points of MB101 students in autumn 2010 (the very ﬁrst
semester of their studies). The ﬁrst one corresponds to all students
of the course; the second one does to those who (3 years
later) successfully ﬁnished their studies and got the bachelor’s
degree.
Again, the results can be depicted in an alternative way:
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Box diagram
The diagram illustrates a histogram and a box diagram of
the same data set (normal distribution with mean equal to
10 and variance equal to 3, n = 500).
The middle line is the median; the edges of the box are
the quartiles; the “paws” show 1.5 of the interquartile range,
but not more than the edges of the sample range. Potential
outliers are indicated too.
Common displaying tools allow us to view potential dependencies
of two data sets. For instance, in the left-hand
diagram below, the coordinates are chosen as the values of
two independent normal distributions with mean equal to 10
and variance equal to 3. In the right-hand illustration, the ﬁrst
coordinate is from the same data set, and the second coordinate
is given by the formula y = 3x + 4. It is also perturbed
with a small error.
10.1.9. Covariance matrix. Actually, the depencies between
several data sets associated to the same statistical
units are at the core of our interest in many
real world problems. When deﬁnining the variance
in 10.1.6 above, we employed the euclidean
distance, i.e. we evaluated the scalar product of the values
of the square of distances from the mean with itself. Thus,
having two vectors of data sets, we may deﬁne
909
And these are the graphs of amounts of points obtained
by those students who continued their studies:
We can see that in the former case, the mode is equal
to 0, while in the latter case, it is 10 again. The frequency
distribution is close to the one of the MB104 course, which
is recommended for the fourth semester.
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Covariance and covariance matrix
Consider two data sets x = (x1, . . . , xn), y = (y1, . . . , yn),
and their means ¯x, ¯y. We deﬁne their covariance by the
formula
cov(x, y) =
1
n
n∑
i=1
(xi − ¯x)(yi − ¯y).
If there are k sample sets x(1)
= (x
(1)
1 , . . . , x
(1)
n ), ...,
x(k)
= (x
(k)
1 , . . . , x
(k)
n ), then their covariance matrix is the
symmetric matrix C = (cij) with cij = cov(x(i)
, x(j)
).
Again, the sample covariance and sample covariance
matrix are deﬁned by the same formulae with n replaced by
(n − 1).
Clearly the covariance matrix has got the variances of the
individual data sets on its diagonal.
In order to imagine what the covarinace should say, consider
the two possible behaviours of two data sets: (a) they
will deviate from their means in a very similar way (comparing
individually xi and yi), (b) they behave very independently.
In ﬁrst case, we should assume that the signs of the
deviations will mostly coincide and thus the sum in the definition
will lead to a quite big positive number. In the other
case the signs should be rather independent and thus the positive
and negative contributions should eﬀectively cancel each
other in the covariance sum.
Thus we expect the data sets expressing independent features
to be close to zero while the covariance of dependent
sets should be far from zero. The sign of the covariance shows
the character of the dependence. For example, the two sets of
data depicted in the left hand diagram above had covariance
bout -0.11, while the covariance of the data from the right
hand picture was about 25.9.
Similarly to the variance, we are often interested in normalized
values. The correlation coeﬃcient takes the covariance
and divides it by the standard deviation of each of the
data sets. In our two latter cases, the correlation coﬀeicients
are about -0.01 and 0.99. As expected, they very clearly indicate
which of the data are correlated.
10.1.10. Principal components analysis. If we deal with
statistics involving many parameters and we need to decide
quickly about their similarity (correlation) with some given
patterns, we might use a simple idea from linear algebra.
Assume we have got k data sets x(i)
. Since their covariance
matrix C is symmetric, there is an orthonormal basis e
in Rk
such that in this basis the corresponding quadratic form
given by C will enjoy a diagonal matrix. The relevant basis
e consists of the real eigenvectors ei ∈ Rk
for the eigenvalues
λi. The bigger is the absolute value |λi|, the bigger is the
variation of the orthogonal projection ˆx of all the k data sets
into this one-dimensional subspace spaned by ei.
Thus we may restrict ourselves to just this one data set
ˆx and consider the statistics concerning this one set as representing
the multi-parametric data sets x(i)
. Similarly we may
910
10.A.9. A car was traveling from Brno to Prague at 160
km/h, and then back from Prague to Brno at 120 km/h. What
was its average speed?
Solution. This is an example where one might think of using
the arithmetic mean, which is incorrect. The arithmetic mean
would be the correct result if the car spent the same period of
time going at each speed. However, in this case, it traveled
the same distance, not time, at each speed. Denoting by d the
distance of Brno and Prague and by vp the average speed, we
obtain
d
160
+
d
120
=
2d
vp
,
whence
vp =
2
1
160 + 1
120
.
= 137.14.
Therefore, the average speed is the harmonic mean (see
10.1.3) of the two speeds.
□
B. Visualization of multidimensional data
The above examples were devoted to displaying one numerical
characteristic measured for more objects (number of
points obtained by individual students, for example). Graphical
visualization of data helps us understand them better.
However, how to depict the data if we measure p diﬀerent
characteristics, p ≥ 3, of n objects. Such measurements cannot
be displayed using graphs we have met.
10.B.1. One of the possible methods is the so-called principal
component analysis. In this method, we use eigenvectors
and eigenvalues (see 2.4.2) of the sample covariance matrix
(see 10.2.35). We will use the following notation:
• random vectors of the measurement
xi = (xi1, xi2, . . . , xip)T
, i = 1, . . . , n,
• the mean of the j-th component
mj = 1
n
∑n
i=1 xij, j = 1, . . . , p,
• the sample variance of the j-th component
sj = 1
n−1
∑n
i=1(xij − mj)2
, j = 1, . . . , p,
• the vector of means m = (m1, . . . , mp),
• the sample covariance matrix
1
n−1
∑n
i=1(xi − m)(xi − m)T
(note that each summand is a p-by-p matrix).
The covariance matrix is symmetric, hence all its eigenvalues
are real and its eigenvectors are pairwise orthogonal.
Moreover, considering the eigenvectors of unit length, we
can see that the eigenvalue corresponding to an eigenvector
of the covariance matrix yields the variance of (the size of)
the projection of the data onto this direction (the projection
takes place in the p-dimensional space). The goal of this
method is to ﬁnd the direction (in the p-dimensional space
of the measured characteristics) for which the variance of the
projections is as great as possible. Thus, this direction corresponds
to the eigenvector of the covariance matrix whose
eigenvalue is the greatest one. The linear combination given
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
also use several biggest eigen-values instead of one and reduce
the dimension of our parameter space in this way. Finally,
considering the unit length eigenvector (α1, . . . , αk)
corresponding to the chosen eigenvalue λ, then the values
αj provide the right coeﬃcients in the orthogonal projection
(x(1)
, . . . , x(k)
) → ˆx = α1x(1)
+ · · · + αkx(k)
.
See the exercise 10.B.2 for an illustration, together with
another description how to proceed with the data in 10.B.1.
The latter approach is called the principal component
analysis.
10.1.11. Entropy. We also need to describe the variability of
data sets even with nominal types, for instance in statistical
physics or information theory. The only thing
at disposal is the class frequencies, so the principle of
classical probability can be used (see the fourth part
of chapter one). There, the relative frequency of the i-th class,
pi = ni
n , is understood to be the probability that a random object
belongs to this class.
The variance of ratio-type values with class frequencies
nj was given by the formula (see 10.1.6)
s2
x =
m∑
j=1
nj
n
(aj − ¯x)2
=
m∑
j=1
pj(aj − ¯x)2
,
where pj denotes the (classical) probability that the value is in
the j-th class. Therefore, it is a weighted mean of the adjusted
values where the weight of the term (aj − ¯x)2
is pj.
The variability of nominal values are expressed similarly
(denote it by HX). Even though there are no numerical values
aj for the indices j, we can be interested in functions F that
depend on the relative frequencies pj. For a data set x we can
deﬁne
HX =
n∑
i=1
piF(pi),
where F is an unknown function with some reasonable prop-
erties.
If the data set has only one value, i.e. pk = 1 for some
k and otherwise pj = 0, then we agree that the variability is
zero, and so F(1) = 0.
Moreover, HX is required to have the following property:
If a data set Z consists of pairs of values from data sets X and
Y (for example, one can observe eye colour and hair colour of
people – statistical units), it is reasonable that the variability
of Z be the sum of the variabilities, that is, HZ = HX + HY .
The relative class frequencies pi for the values of the data
set X and qj for those of Y are known. The relative class
frequencies for Z are then
rij =
nimj
nm
= piqj,
so we demand the equality (the ranges of the sums are clear
from the context)
∑
i,j
piqjF(piqj) =
∑
i
piF(pi) +
∑
j
qjF(qj).
911
by the components of this vector is called the ﬁrst principal
component. The size of the projection onto this direction estimates
the data quite well (the principal component can be
viewed as a characteristic which substitutes for the p characteristics,
i. e., it is a random vector with n components). If we
subtract this projection from the data and consider the direction
of the greatest variance again, we get the second principal
component. Repeating this procedure further, we obtain the
other principal components. The directions of the principal
components correspond to the eigenvectors of the covariance
matrix in decreasing order with respect to the size of the corresponding
eigenvalues.
10.B.2. Find the ﬁrst principal component of the following
simple data and the vector which substitutes them: Five people
were taken their height, little ﬁnger length, and index ﬁnger
length. The measured data are shown in the following
table (in centimeters).
Solution.
Martin Michael Matthew John Peggy
index f. 9 11 8 8 8
little f. 7.5 8 6.3 6 6.5
height 186 187 173 174 167
The vectors of the collected data are: x1 = (9; 7.5; 186),
x2 = (11; 8; 187), x3 = (8; 6; 173), x4 = (8; 6; 174),
x5 = (8; 6.5, 167). The covariance matrices of these vectors
are:


0.04 0.14 1.72
0.14 0.49 6.02
1.72 6.02 73.96

 ,


4.840 2.64 21.12
2.64 1.44 11.52
21.12 11.52 92.16

 ,


0.641 0.640 3.521
0.640 0.640 3.52
3.521 3.52 19.36

 ,


0.641 0.640 2.721
0.640 0.640 2.72
2.721 2.72 11.56

 ,


0.641 0.240 8.321
0.240 0.09 3.12
8.32 3.12 108.16

 .
The sample covariance matrix is then a quarter of their
sum, i. e.,
S =


1.70 1.075 9.35
1.075 0.825 6.725
9.35 6.725 76.30


The eigenvalues of S are approximately 2.7, 312.2, and
0.38. The unit eigenvector corresponding to the greatest one
is approximately (0.122; 0.09; 0.989). Thus, the ﬁrst principal
component is (185 5; 186.8; 172.4; 173.4; 166.5), which
is not far from the people’s heights. □
10.B.3. The students of a class had the following marks in
various subjects:
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Since pi and qj are relative frequencies, they sum to 1. So the
right-hand side of the equality can be written as
(∑
j
qj
)( ∑
i
piF(pi)
)
+
(∑
i
pi
)( ∑
j
qjF(qj)
)
,
leading to
∑
i,j
piqjF(piqj) =
∑
i,j
piqj
(
F(pi) + F(qj)
)
.
This is satisﬁed by any constant multiple of a logarithm of any
ﬁxed base a > 1. It can be shown that no other continuous
solution F exists.
Since pi ≤ 1, ln pi ≤ 0. The variability must be nonnegative,
so F is chosen to be a logarithmic function multiplied
by −1. Such a choice also satisﬁes F(1) = 0, as de-
sired.
Entropy
The measure of variability of nominal values is expressed in
terms of entropy. It is given by
HX = −
k∑
i=1
ni
n
ln
(ni
n
)
,
where k is the number of sample classes. Sometimes (especially
in information theory), the binary logarithm is used
instead of the natural logarithm.
One often works with the quantity
eHX
=
∏
i
p−pi
i ,
(or with another logarithm base).
In this form, for a data set X with k equal class frequencies,
compute
eHX
=
(
(1
k )− 1
k
)k
= k,
which is independent of the sample size. The next illustration
shows 2-based entropy y for the number of occurrences of
letters a, b in 10-letter words consisting of these characters,
and x is the number of occurrences of b.
Note that the maximum entropy 1 occurs for the same
number of a’s and b’s, and indeed 21
= 2 as computed above.
The following illustration displays the entropy of 11 randomly
chosen strings of length 10 made of 8 characters. The
values are all much less than the theoretical maximal value of
3. This reﬂects the fact that the number of occurences of the
individual 8 characters cannot be equal (or it could happen
with a very small probability if the length of the string was 8
912
Student id Maths Physics History English PE
1 1 1 2 2 1
2 1 3 1 1 1
3 2 1 1 1 1
4 2 2 2 2 1
5 1 1 3 2 1
6 2 1 2 1 2
7 3 3 2 2 1
8 3 2 1 1 1
9 4 3 2 3 1
10 2 3 1 2 1
Find the ﬁrst principal component of the following simple
data and the vector which substitutes them.
Solution. The vectors of observation are x1 = (1, 1, 2, 2, 1), . . . ,
x10 = (2, 3, 1, 2, 1). The corresponding covariance matrices
are:






1.21 1.10 −0.330 −0.330 0.110
1.10 1. −0.300 −0.300 0.100
−0.330 −0.300 0.0900 0.0900 −0.0300
−0.330 −0.300 0.0900 0.0900 −0.0300
0.110 0.100 −0.0300 −0.0300 0.0100






, . . . ,






0.0100 −0.100 0.0701 −0.0300 0.0100
−0.100 1. −0.700 0.300 −0.100
0.0701 −0.700 0.490 −0.210 0.0701
−0.0300 0.300 −0.210 0.0900 −0.0300
0.0100 −0.100 0.0701 −0.0300 0.0100






The sample covariance matrix is






0.99 0.44 −0.078 0.26 −0.01
0.44 0.89 −0.22 0.22 −0.11
−0.078 −0.22 0.45 0.23 0.03
0.26 0.22 0.23 0.45 −0.078
−0.01 −0.11 0.033 −0.0778 0.100






.
Its dominant eigenvalue is about 13.68, and
the corresponding unit eigenvector is approximately
(0.70; 0.65; −0.13; 0.28; −0.07). Therefore, the principal
component is (1.58; 2.73; 2.13; 2.93; 1.45; 1.93; 4.28; 3.48; 5.26; 3.71)
□
Another possible method of visualization of multidimensional
data is the so-called cluster analysis, but we will not go
into further details here.
C. Classical and conditional probability
In the ﬁrst chapter, we met the so-called classical probability,
see 1.4.1. Just to recall it, let us try to solve the following
(a bit more complicated) problem:
10.C.1. Aleš wants to buy a new bike, which costs 5100
crowns. He has 2500 crowns left from organizing a camp.
Aleš is no dope: he took 50 more crowns from his pocket
money and went to the casino to play the roulette. Aleš always
bets on red. This means that the probability of winning
is 18/37 and the amount he wins is equal to the amount he has
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
or 16). If the same is done with, say, strings of length 10000,
we would get very close to 3 (typically the diﬀerence would
be in the order of 10−3
, if the random string generator was
good enough).
2. Probability
Before further reading, the reader is advised to go
through the fourth part of chapter one (the subsection
beginning on page 18). Back then, we worked mainly
with classical ﬁnite probability. We deﬁned the basics of
a formalism which we extend now. The main extension is
that the sample space Ω can be inﬁnite, even uncountable.
Recall that when we talked about geometric probability at
the end of the fourth part of chapter one, the sample space
for description of an event was a part of the Euclidean space,
and events were suitable subsets of it. All of those sets were
uncountable.
Begin with a simple (inﬁnite, yet still discrete) example,
to which we return from time to time throughout this section.
10.2.1. Why inﬁnite sets of events? Imagine an experiment
where a coin is repeatedly tossed until it comes
up heads. There are many questions to be asked
about this experiment: What is the probability
of tossing the coin at least 3 times? (or exactly
35 times, or at most 10 times, etc.)
The outcomes of this experiment can be considered in the
form ωk ∈ N≥1 ∪ {∞}, which could be read as “the coin
comes up heads for the ﬁrst time in the k-th toss”. Note that
k = ∞ is inserted, since the possibility that the coin always
comes up tails must be allowed, too.
This problem is solved if the classical probability 1/2 of
the coin coming up heads in one toss is used (and the same
for tails). In the abstract model, the total number of tosses by
any natural number N cannot be bounded. On the other hand,
the probability that the coin always comes up tails in the ﬁrst
(k −1) tosses out of the total number of n ≥ k tosses is given
by the fraction
2n−k
2n
= 2−k
,
where in the numerator, there is the number of favorable possibilities
out of n independent tosses (i.e. the number of possibilities
how to distribute two values into the n−k remaining
positions), while in the denominator, there is the number of
913
bet. His betting strategy is as follows: The ﬁrst time, he bets
10 crowns. Each time he has lost, he bets twice the previous
bet (if he does not have enough money to make this bet, he
leaves the casino, deeply depressed). Each time he has won,
he bets 10 crowns again. What is the probability that, using
this strategy, he wins the desired 2550 more crowns? (As
soon as this happens, he immediately runs to buy the bike.)
Solution. First of all, we calculate how many times Aleš can
lose in a row. If he bets 10 crowns the ﬁrst time, then in order
to bet n times, he needs
10+20+· · ·+10·2n−1
= 10·
(n−1∑
i=0
2i
)
= 10·
(
2n
− 1
2 − 1
)
= 10·(2n
−1).
As we can see, the number 2550 is of the form 10(2n
− 1),
for n = 8. This means that Aleš can bet eight times in a row
regardless of the odds. He can never bet nine times in a row,
because for that he would have to have 10(29
− 1) = 5110
crowns, which he will never reach (he stops betting as soon as
he has 5100 crowns). Therefore, Aleš loses the whole game
if and only if he loses eight consecutive bets. The probability
of losing one bet is 19/37; hence, the probability of losing
eight consecutive (independent) bets is (19/37)8
. Thus,
the probability that he wins 10 crowns (using his strategy) is
1 − (19/37)8
. In order to win 2550 crowns, he must win 255
times, and the probability of this is
(
1 −
(
19
37
)8
)255
.
= 0.29.
Therefore, the probability of winning using his strategy is
much lower than if he bet everything on red straightaway.
□
10.C.2. You could try to solve a slight modiﬁcation of the
above problem: Joe stops playing only if he loses all his
money; if he still has some money, but not enough to bet twice
the previous bet, he bets 10 dollars again.
We also met the conditional probability in the ﬁrst chapter,
see 1.4.8.
10.C.3. Let A, B be two events such that B is a disjoint
union of events B1, B2, . . . , Bn. Using the deﬁnition of conditional
probability (see 10.2.6), prove that
P(A|B) =
n∑
i=1
P(A|Bi)P(Bi|B)(1)
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
all possible outcomes. As expected, this probability is independent
of the chosen n, and there is the
∑∞
k=1 2−k
= 1.
Therefore, the probability of tossing only tails is zero.
Thus we can deﬁne probability on the sample space Ω
with sample points (outcomes) ωk, whose probability is 2−k
.
This leads to a probability according to the deﬁnitions below.
We return to this example throughout this section.
10.2.2. σ-ﬁelds. Work with a ﬁxed non-empty set Ω, which
contains the possible outcomes of the experiment
and which is called the sample space. The
possible outcomes ω ∈ Ω are also called sample
points. In probability models, not all subsets of
outcomes need be admitted. In particular, the singletons {ω}
need not be considered. Those subsets whose probability we
want to measure are required to satisfy the axioms of the so
called σ-algebras.
The axioms listed below are chosen from a larger collection
of natural requirements in a minimal form. The ﬁrst one
is based on the assumption that the universal event should be
a measurable set. The second one is forced by the assumption
that events can be negated. The third one reﬂects the necessity
to examine the event of the occurrence of at least one
event from a countably inﬁnite collection. (For instance, in
the example from the previous subsection, the coin is tossed
only ﬁnitely many times, but there is no upper bound on the
number of tosses.).
σ–algebras of subsets
A collection A of subsets of the sample space is called a
σ-algebra or σ-ﬁeld and its elements are called events or
measurable sets if and only if
• Ω ∈ A, i.e., the sample space is an event;
• if A, B ∈ A, then A \ B ∈ A, i.e., the set diﬀerence of
two events is also an event;
• if Ai ∈ A, i ∈ I, is a countable collection of events,
then their union is also an event, i.e., ∪i∈IAi ∈ A.
As usual, the basic axioms imply simple corollaries
which describe further (intuitively required) properties in the
form of mathematical theorems. The reader should check
carefully that both following properties hold.
• The complement Ac
= Ω \ A of an event A is again an
event.
• The intersection of two events is again an event since for
any two subsets A, B ⊂ Ω,
A \ (Ω \ B) = A ∩ B.
Actually, for any countable system of events Ai, i ∈ I,
the event
Ω \ ∪i∈U Ac
i = ∩i∈IAi
is also in the σ-algebra A.
Altogether, a σ-algebra is a collection of subsets of the
sample space which is closed with respect to set diﬀerences,
countable unions, and countable intersections.
914
Solution. First, note that the events A∩B1, A∩B2,...,A∩Bn
are also disjoint. Therefore, we can write
P(A|B1 ∪ · · · ∪ Bn) =
P (A ∩ (B1 ∪ · · · ∪ Bn))
P(B1 ∪ · · · ∪ Bn)
=
=
P ((A ∩ B1) ∪ (A ∩ B2) ∪ · · · ∪ (A ∩ Bn))
P(B)
=
=
∑n
i=1 P(A ∩ Bi)
P(Bi)
·
P(Bi)
P(B)
=
=
n∑
i=1
P(A|Bi)P(Bi|B).
□
10.C.4. We have four bags with balls: In the ﬁrst bag, there
are four white balls. In the second bag, there are three white
balls and one black ball. In the third bag, there are two white
and two black balls. Finally, in the fourth bag, there are four
black balls. We randomly pick a bag and take two balls out
of it (without putting the ﬁrst one back). Find the probability
that
a) the balls are of diﬀerent colors;
b) the second ball is white provided the ﬁrst ball was white.
Solution. Since there is the same number of balls in each
of the bags, any ball has the same probability of being taken
(similarly for any pair of balls lying in the same bag). Therefore,
we can solve this problem using classical probability
a) Altogether, there are 24 pairs of balls that can be taken.
Out of them, 7 consist of balls of diﬀerent colors. Therefore,
the wanted probability is 7/24.
b) Let A denote the event that the ﬁrst ball is white and
B denote the event that the second ball is white. Then,
P(B∩A) is the probability that both balls are white, and
this is equal to 10/24 = 5/12 since there are 10 such
pairs. Again, we can used classical probability to calculate
P(A): there are 16 balls in total, and 9 of them are
white. Altogether, we have
P(B|A) =
P(B ∩ A)
P(A)
=
5
12
9
16
=
20
27
.
Another solution. The event A can be viewed as the
union of three mutually exclusive events A1, A2, A3 that
we took a white ball from the ﬁrst, second, and third bag,
respectively. Since there is the same number of balls in
each of the bags, the probability of taking any (white) ball
is also the same (independent of which ball it is), so we get
P(A) = 9
16 and P(A1|A) =
4
16
9
16
= 4
9 , P(A2|A) = 3
9 = 1
3 ,
P(A3|A) = 2
9 . Applying (5), we obtain
P(B|A) = P(B|A1)P(A1|A) + P(B|A2)P(A2|A) + P(B|A3)P(A3|A) =
= P(B|A1) ·
P(A1)
P(A)
+ P(B|A2) ·
P(A2)
P(A)
+ P(B|A3) ·
P(A3)
P(P(A)
=
= 1 ·
4
9
+
2
3
·
3
9
+
1
3
2
9
=
20
27
.
□
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.2.3. Probability space. Now introduce probability in the
mathematical model, recalling the concepts used already in
the ﬁrst chapter.
Elementary concepts
Use the following terminology in connection with events:
• the entire sample space Ω is called the universal event;
the empty set ∅ ∈ A is called the null event;
• the singletons ω ∈ Ω are called elementary events (note
that {ω} may not even be an event in A);
• the intersection of events ∩i∈IAi corresponds to the simultaneous
occurrence of all the events Ai, i ∈ I;
• the union of events ∪i∈IAi corresponds to the occurrence
of at least one of the events Ai, i ∈ I;
• if A∩B = ∅, then A, B ∈ A are called exclusive events
or disjoint events,
• if A ⊂ B, then the event A implies the event B;
• if A ∈ A, then the event B = Ω \ A is called the
complementary event to A and denoted B = Ac
.
We have seen an example of probability deﬁned on an
inﬁnite sample space in 10.2.1 above. In general, probability
is comprehended as follows:
Probability
Deﬁnition. A probability space is the σ-algebra A of subsets
of the sample space Ω on which there is a scalar function
P : A → R with the following properties:
• P is non-negative, i.e., P(A) ≥ 0 for all events A;
• P is countably additive, i.e.,
P(∪i∈IAi) =
∑
i∈I
P(Ai),
for every countable collection of mutually exclusive
events;
• the probability of the universal event is 1.
The function P is called the probability function on (Ω, A).
Immediately from the deﬁnition, the complementary
event satisﬁes
P(Ac
) = 1 − P(A).
In chapter one, theorems on addition of probabilities were derived.
Although dealing with ﬁnite sample spaces, the arguments
remain the same now. In particular, the inclusion and
exclusion principle says for any ﬁnite collection of k events
Ai that
P(∪k
i=1Ai) =
k∑
i=1
P(Ai) −
k−1∑
i=1
k∑
j=i+1
P(Ai ∩ Aj)
+
k−2∑
i=1
k−1∑
j=i+1
k∑
ℓ=j+1
P(Ai ∩ Aj ∩ Aℓ)
− · · · +
+ (−1)k−1
P(A1 ∩ A2 ∩ · · · ∩ Ak).
915
10.C.5. We have four bags with balls: In the ﬁrst bag, there
are four white balls. In the second bag, there are three white
balls and one black ball. In the third bag, there are two white
and two black balls. Finally, in the fourth bag, there are one
white and three black balls. We randomly pick a bag and take
a ball out of it, ﬁnding out that it is black. Then we throw away
this bag, pick another one and take a ball out of it. What is
the probability that it is white?
Solution. Similarly as in the above exercise, let A denote
the event that the very ﬁrst ball is black. This event can
be viewed as the union of mutually exclusive events Ai,
i = 2, 3, 4, where Ai is the event of picking the i-th bag
and taking a black ball from there. Again, the probability of
picking any (black) ball is the same. Hence, P(A2|A) = 1
6 ,
P(A3|A) = 2
6 = 1
3 , and P(A4|A) = 3
6 = 1
2 . Let B denote
the event that the second ball is white. If the thrown bag is the
second one, then there are a total of 7 white balls remaining,
so the probability of taking one of them is P(B|A2) = 7
12
(we can use classical probability again because each of the
bags contains the same number of balls, so any ball has the
same probability of being taken). Similarly, P(B|A3) = 8
12
and P(B|A4) = 9
12 . Applying (5), we get that the wanted
probability is
P(B|A) = P(B|A2)P(A2|A) + P(B|A3)P(A3|A) +
P(B|A4)P(A4|A) =
= 7
12 · 1
6 + 8
12 · 1
3 + 9
12 · 1
2 = 25
36 . □
10.C.6. We have four bags with balls: In the ﬁrst bag, there
are a white ball and a black ball. In the second bag, there are
three white balls and one black ball. In the third bag, there
are one white and two black balls. Finally, in the fourth bag,
there are one white and three black balls. We randomly pick a
bag and take a ball out of it, ﬁnding out that it is white. Then
we throw away this bag, pick another one and take a ball out
of it. What is the probability that it is white?
Solution. Similarly as in the above exercise, we view the
event A of the ﬁrst ball being white as the union of four
mutually exclusive events A1, A2, A3, and A4 that we take
a white ball from the ﬁrst, second, third, and fourth bag,
respectively. The probability of taking a white ball out of
the ﬁrst bag is P(A1) = 1
4 · 1
2 (the probability of A1 is
the product of the probability that we pick the ﬁrst bag and
the probability that we take a white ball from there); similarly,
P(A2) = 1
4 · 3
4 , P(A3) = 1
4 · 1
3 , P(A4) = 1
4 · 1
4 .
P(A) = P(A1) + P(A2) + P(A3) + P(A4) = 11
24 . Note
that the probability P(A) cannot be calculated classically, i.
e., by simply dividing the number of white balls by the total
number of the balls, because, for instance, the probability
of taking a white ball from the ﬁrst bag is twice greater
than from the fourth bag. As for the conditional probabilities,
we have P(A1|A) = P(A1)/P(A) = 3
11 , P(A2|A) = 9
22 ,
P(A3|A) = 2
11 , P(A4|A) = 3
22 . Now, let B denote the event
that we take another white ball after we have thrown away the
ﬁrst bag. We want to apply (5) again. It remains to compute
P(B|Ai), i = 1, . . . , 4. The probability P(B|A1) can be
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
The reader should look back at 1.4.5 and think about the de-
tails.
10.2.4. Independent events. The deﬁnition of stochastically
independent events also remains
unchanged. It reﬂects the intuition that the
probability of the simultaneous occurrence of
independent events is equal to the product of the particular
probabilities.
Stochastic independence
Events A, B are said to be stochastically independent if and
only if
P(A ∩ B) = P(A)P(B).
Of course, the universal event and the null event are stochastically
independent of any event.
Recall that replacing an event Ai with the complementary
event Ac
i in a collection of stochastically independent
events A1, A2, ..., again results in a collection of stochastically
independent events, and (see 1.4.7, page 23)
P(A1 ∪ · · · ∪ Ak) = 1 − P(Ac
1 ∩ · · · ∩ Ac
k) =
= 1 − (1 − P(A1)) . . . (1 − P(Ak)).
Classical ﬁnite probability remains the fundamental example
of probability, used as the inspiration during creation
of the mathematical model. Recall that in this case, Ω is
a ﬁnite set, the σ-algebra A is the collection of all subsets
of Ω, and the classical probability is the probability space
(Ω, A, P) with probability function P : A → R,
P(A) =
|A|
|Ω|
.
This corresponds precisely to the intuition about the relative
frequency pA of an event A when drawing a random element
from the sample set Ω.
This deﬁnition of probability guarantees reasonable behaviour
of monotone sequences of events:
10.2.5. Theorem. Consider a probability space (Ω, A, P)
and a non-decreasing sequence of events A1 ⊂ A2 ⊂ . . . .
Then,
P
( ∞∪
i=1
Ai
)
= lim
i→∞
P(Ai).
Similarly, if A1 ⊃ A2 ⊃ A3 ⊃ . . . , then
P
( ∞∩
i=1
Ai
)
= lim
i→∞
P(Ai).
Proof. The considered union A = ∪∞
i=1Ai can be
rewritten in terms of mutually exclusive events
˜Ai = Ai \ Ai−1,
deﬁned for all i = 2, 3, . . . . Set ˜A1 = A1. Then,
P(A) = P
( ∞∪
i=1
˜Ai
)
=
∞∑
i=1
P( ˜Ai) = lim
k→∞
k∑
i=1
P( ˜Ai).
916
computed as the sum of the probabilities of the mutually exclusive
events B2, B3, B4 (given A1) that the second white
ball comes from the second, third, fourth bag, respectively.
Altogether, we have
P(B|A1) = P(B2|A1)+P(B3|A1)+P(B4|A1) =
1
3
3
4
+
1
3
1
3
+
1
3
1
4
=
4
9
Similarly,
P(B|A2) = 1
3
1
2 + 1
3
1
3 + 1
3
1
4 =
13
36
,
P(B|A3) = 1
3
1
2 + 1
3
3
4 + 1
3
1
4 =
1
2
,
P(B|A4) = 1
3
1
2 + 1
3
3
4 + 1
3
1
3 =
19
36
.
Altogether, we get
P(B|A) = P(B|A1)P(A1|A) + P(B|A2)P(A2|A) + P(B|A3)P(A3|A) + P(B|A4)P(A4|A) =
=
4
9
3
11
+
13
36
9
22
+
1
2
2
11
+
19
36
3
22
=
19
44
□
10.C.7. Two shooters shoot at a target, each makes two
shots. Their respective accuracies are 80 % and 60 %. We
have found two hits in the target. What is the probability that
they belong to the ﬁrst shooter?
Solution. The probability of hitting the target is 4/5 for the
ﬁrst shooter, and 3/5 for the second one. Consider the events:
A . . . there are two hits in the target, both of the ﬁrst
shooter,
B . . . there are two hits in the target.
Our task is to ﬁnd P(B|A). We can divide the event
B into six disjoint events according to which shot(s) of each
shooter was/were successful. We enumerate the events in a
table and, for each of them, we compute its probability. This
is easy as each of the events is the intersection of four independent
events (results of the four shots). A hits is denoted
by 1, a miss by 0.
Shooter 1 Shooter 2 probability
B1 0 1 0 1 1
5 · 4
5 · 2
5 · 3
5
B2 0 1 1 0 24
252
B3 1 0 1 0 24
252
B4 1 0 0 1 24
252
B5 1 1 0 0 64
252
B6 0 0 1 1 9
252
Adding up the probabilities of these disjoint events, we get:
P(B) =
6∑
i=1
P(Bi) = 169/625.
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
For the ﬁnite sums,
k∑
i=1
P( ˜Ai) = P(A1) +
k∑
i=2
(
P(Ai) − P(Ai−1)
)
= P(An)
by the assumptions Ai−1 ⊂ Ai. This proves the ﬁrst part of
the theorem.
In the second part, consider the complements Bi = Ac
i
instead of the events Ai. They satisfy the assumptions of the
ﬁrst part of this theorem. Then, the complement of the considered
intersection is
B = Ac
=
( ∞∩
i=1
Ai
)c
=
∞∪
i=1
Bi.
The desired statement follows from the fact that
P(A) = 1 − P(B) = 1 − lim
i→∞
P(Bi) = lim
i→∞
(
1 − P(Bi)
)
which completes the proof. □
10.2.6. Conditional probability. Consider the following
problem: On average, 40% of students succeed
in course X and 80% of students succeed in
course Y . If a random student is enrolled in
both these courses saying that he has passed one
of them (but we overhear which one), what is the probability
that he has meant course X?
As mentioned in subsection 1.4.8 (page 24), such problems
can be formalized in the way described below. (We shall
come back to the solution of the latter problem in 10.3.12.)
Conditional probability
Deﬁnition. Let H be an event with non-zero probability in
the σ-algebra A of a probability space (Ω, A, P). The conditional
probability P(A|H) of an event A ∈ A with respect
to the hypothesis H is deﬁned as
P(A|H) =
P(A ∩ H)
P(H)
.
The deﬁnition corresponds to the intuition from the classical
probability that the probability of events A and H occurring
simultaneously, provided the event H has occurred,
is P(A ∩ H)/P(H).
Directly from the deﬁnition, the hypothesis H and the
event A are independent if and only if P(A) = P(A|H).
At ﬁrst sight, it may seem that introducing conditional
probability does not add anything new. Actually, it is a very
important type of approach which is needed in statistics as
well. The hypothesis can be the a prior probability (i.e. the
prior belief assumed beforehand), and the resulting probability
is said to be posterior (i.e., it is considered to be a consequence
of the assumption). This is the core of the Bayesian
approach to statistics as is seen later.
The deﬁnition also implies the following result.
917
Now, we can compute the conditional probability, using the
formula of subsection 10.2.6:
P(A|B) =
P(A ∩ B)
P(B)
=
P(B5)
P(B)
=
64
625
169
625
=
64
164
· 0.38.
□
10.C.8. We toss a coin. If it comes up heads, we put a white
ball into an (initially empty) bag; otherwise we put a black
ball there. This is repeated n times. Then, we take a ball
randomly from the bag (without replacement). Suppose it is
white. What is the probability that another ball we take randomly
from the bag is black?
Solution. We will solve the problem for a general (possibly
biased) coin. In particular, we assume that the individual
tosses are independent and that there exists a ﬁxed probability
of the coin coming up heads, which we denote p. The event
“a ball in the bag is white” corresponds to the event “the coin
came up heads in the corresponding toss”. Since the ﬁrst ball
was white, we deduce that p > 0. We can also see that the
probability space “taking a random ball from the bag” is isomorphic
to the probability space “tossing a coin”. Since we
assume that the individual tosses are independent, we also
get the independence of the colors of the selected balls. This
leads to the conclusion that the probability in question is 1−p.
Is this reasoning correct? Do we not expect the probability
of taking a black ball to be greater than 1 − p? See, there
were approximately np white and n(1 − p) black balls in the
bag, so if we had removed one white ball, the probability of
selecting a black one should increase, shouldn’t it? Before
reading further, try to ﬁgure out which (if any) of these two
presented reasonings is correct, and whether the probability
is also dependent on n (the number of balls in the bag before
any were removed).
Now, we select a more sophisticated approach to the problem.
Let Bi denote the event “there were i white balls in the
bag” (before any were removed), i ∈ {0, 1, 2, . . . , n}. Further,
let A denote the event “the ﬁrst ball is white” and C denote
the event “the second ball is black”. Actually, the event
Bi says that the coin came up heads i times out of n; hence,
its probability is
P(Bi) =
(
n
i
)
pi
(1 − p)n−i
.
The conditional probability of taking a white ball provided
there are exactly i white balls in the bag is equal to
P(A|Bi) =
i
n
.
We are interested in the probability of C, knowing that A has
occurred, i. e., we want to know P(C|A). Since the events
Bi are pairwise disjoint, this is also true for the events C ∩Bi.
Since C can be decomposed as the disjoint union
∪n
i=0(C ∩
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Lemma. Let an event B be the union of mutually exclusive
events B1, B2,...,Bn. Then,
P(A|B) =
n∑
i=1
P(A|Bi)P(Bi|B)(1)
Proof. The events A ∩ B1, A ∩ B2, ..., A ∩ Bn are
also mutually exclusive. Therefore,
P(A|B1 ∪ · · · ∪ Bn) =
P (A ∩ (B1 ∪ · · · ∪ Bn))
P(B1 ∪ · · · ∪ Bn)
=
=
P ((A ∩ B1) ∪ (A ∩ B2) ∪ · · · ∪ (A ∩ Bn))
P(B)
=
=
n∑
i=1
P(A ∩ Bi)
P(Bi)
P(Bi)
P(B)
=
=
n∑
i=1
P(A|Bi)P(Bi|B). □
Consider the special case B = Ω. Then, the events
Bi can be considered the “possible states of the universe”,
P(A|Bi) expresses the probability of A provided the universe
is in its i–th state, and P(Bi|Ω) = P(Bi) is the probability
of the universe being in its i–th state. By the above lemma,
P(A) = P(A|Ω) =
n∑
i=1
P(A|Bi)P(Bi).
This formula is called the law of total probability.
10.2.7. Bayes’ theorem. Simple rearrangement of the conthese
two subsections
are now moved to
Chapter one,
modify/extend here ...
ditional probability formula leads to
P(A ∩ B) = P(B ∩ A) = P(A)P(B|A) = P(B)P(A|B).
There are two important corollaries:
Bayes’ rules
Theorem. The probabilities of events A and B satisfy
P(A|B) =
P(A)P(B|A)
P(B)
(1)
P(A|B) =
P(A)P(B|A)
P(A)P(B|A) + P(Ac)P(B|Ac)
.(2)
The ﬁrst proposition is called the inverse probability formula.
The second proposition is called the ﬁrst Bayes’ for-
mula.
Proof. The ﬁrst statement is a mere rearrangement of
the formula above the theorem. To obtain the second statement,
note that
P(B) = P(B ∩ A) + P(B ∩ Ac
).
Applying the law of total probability, P(B) =
P(A)P(B|A) + P(Ac
)P(B|Ac
) can be substituted
into the inverse probability formula, thereby obtaining the
second statement of the theorem. □
918
Bi), we can write
P(C|A) = P
( n∪
i=0
(C ∩ Bi)|A
)
=
n∑
i=0
P
(
(C ∩ Bi) ∩ A
)
P(A)
=
=
1
P(A)
n∑
i=0
P
(
C ∩ (A ∩ Bi)
)
=
=
1
P(A)
n∑
i=0
P(A ∩ Bi)P(C|A ∩ Bi) =
=
1
P(A)
n∑
i=0
P(Bi)P(A|Bi)P(C|A ∩ Bi).
We use the law of total probability and substitute for P(A),
which leads to
(1)
P(C|A) =
n∑
i=0
P(Bi)P(A|Bi)P(C|A ∩ Bi)
P(A)
=
=
n∑
i=0
P(Bi)P(A|Bi)P(C|A ∩ Bi)
n∑
i=0
P(Bi)P(A|Bi)
.
This formula is sometimes called the second Bayes’ formula;
it holds in general provided the space Ω is a disjoint union of
the events Bi.
Since we tossed the coin at least once, we have n ≥ 1.
Now, we can calculate:
n∑
i=0
P(Bi)P(A|Bi) =
n∑
i=0
(
n
i
)
pi
(1 − p)n−i
·
i
n
=
=
n∑
i=1
(n − 1)!
(i − 1)!(n − i)!
pi
(1 − p)n−i
=
=
n−1∑
i=0
(n − 1)!
i!(n − i − 1)!
pi+1
(1 − p)n−i−1
=
= p
n−1∑
i=0
(
n − 1
i
)
pi
(1 − p)n−1−i
=
= p
(
p + (1 − p)
)n−1
= p,
n∑
i=0
P(Bi)P(A|Bi)P(C|A ∩ Bi) =
=
n∑
i=0
(
n
i
)
pi
(1 − p)n−i
·
i
n
·
n − i
n − 1
=
=
n−1∑
i=1
(n − 2)!
(i − 1)!(n − i − 1)!
pi
(1 − p)n−i
=
=
n−2∑
i=0
(n − 2)!
i!(n − 2 − i)!
pi+1
(1 − p)n−i−1
=
=p(1 − p)
n−2∑
i=0
(
n − 2
i
)
pi
(1 − p)n−2−i
=
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Bayes’ rule is sometimes formulated in a somewhat more
general form, proved similarly as in (??):
Let the sample space Ω be the union of mutually exclusive
events A1,...An. Then, for any i ∈ {1, . . . , n},
(3) P(Ai|B) =
P(B|Ai)P(Ai)
∑n
k=1 P(B|Ak)P(Ak)
.
10.2.8. Example and remarks. Now, the introductory question
from 10.2.6 can be dealt with easily. Consider
the event A which corresponds to “the
student having passed an exam” and the event
B which corresponds to “the exam in question
concerning course X”. Assume that the probabilities of the
exam concerning either course are the same, i.e., P(B) =
P(Bc
) = 0.5. While the wanted probability P(B|A) is
unclear, the probability P(A|B) = 0.4 is given, as well as
P(A|Bc
) = 0.8.
This is a typical application of Bayes’ formula ??(??).
There is no need to calculate P(A) at all:
P(B|A) =
P(B)P(A|B)
P(B)P(A|B) + P(Bc )P(A|Bc )
=
=
0.5 · 0.4
0.5 · 0.4 + 0.5 · 0.8
=
1
3
.
In order to better understand the role of the prior probability
hypothesis, here is another example.
Consider a university using entrance exams with the following
reliability: 99% of intelligent people pass them, while
concerning non-intelligent people, only 0.5% are able to pass.
It is desired to ﬁnd the probability that a random student (accepted
applicant) of the university is intelligent.
Thus, let A be the event “a random person is intelligent”
and B be the event “the person passed the exams successfully”.
Using Bayes’ formula, the probability that A occurs
provided B has occurred can be computed. It is only necessary
to supply the general probability p = P(A) that a random
applicant is intelligent.
P(A|B) =
p · 0.99
p · 0.99 + (1 − p) · 0.005
.
The following table presents the result for various values
of p. The ﬁrst column corresponds to the case that every other
applicant is intelligent, etc.
p 0.5 0.1 0.05 0.01 0.001 0.0001
P(A|B) 0.99 0.96 0.91 0.67 0.17 0.02
Therefore, if every other applicant is intelligent, then
99% of the students are intelligent. If only 1% of the population
meets an expectation of “intelligence” and the applicants
form a good random sample, then only about two thirds of the
students are intelligent, etc.
Consider similar tests for the occurrence of a disease, say
HIV. There may be a test with the same reliability as the one
above and use it to test all students that are present at the university.
In this case, assume that the parameter p is close to
the one for the entire population (say 1 out of 10000 people is
infected, on average), which corresponds to the last column
919
=
{
p(1 − p), n > 1
0, n = 1.
Substituting this into the second Bayes’ formula, we obtain
the wanted probability
P(C|A) =
{
0, n = 1,
1 − p, n > 1.
Thus, the simple reasoning about the probability spaces being
isomorphic led to the correct result. The second reasoning
was wrong because it omitted the fact that since the ﬁrst
ball was white, the expected number of white balls in the bag
(before removing the ﬁrst one) was greater than np. The calculation
highlights the singular case n = 1. □
10.C.9. Once upon a time, there was a quiz where the ﬁrst
prize was Ferrari 599 GTB Fiorano. The contestant who won
the ﬁnal round was taken into a room where there were three
identical doors. Behind two of them, there were goats, while
the third one contained the car. In order to win the car, the
contestant had to guess the correct door. First, the contestant
pointed at one of the three doors. Then, an assistant opened
one of the other two doors behind which there was a goat.
Now, the contestant is given the option to change his guess.
Should he do so?
Solution. Of course, we assume that the contestant wants to
win the car. First of all, try to examine your intuition for random
events. For example, you can reason as follows: “One
of the two remaining doors contains the car, each with the
same probability. Therefore, it does not matter which door
we choose.” Or: “The probability of choosing the correct
door at the beginning is 1
3 . The shown goat changes nothing,
so the probability that the guess is wrong is 2
3 . Therefore, we
should change the door, thereby winning by 2
3 .”
Apparently, it is wise to change the door only if the probability
of the car being behind that door is greater than behind
the initially chosen one. We consider the following events: H
stands for “the initial guess is correct”, A stands for “we have
changed the door”, and C for “we have won”. We are thus
interested in the probabilities P(C|A) and P(C|Ac
).
First, we choose one of three doors, and the Ferrari is
behind one of them, so
P(H) =
1
3
, P(Hc
) = 1 −
1
3
=
2
3
.
We assume that the event of changing the door is independent
of the original guess, hence
P(A|H) = P(A|Hc
) = P(A), P(Ac
|H) = P(Ac
|Hc
) = P(Ac
).
If the original guess is correct and it is changed, then we surely
lose; while if it is originally wrong and then it is changed, then
we surely win. Therefore, we have
P(C|A ∩ H) = 0 = P(C|Ac
∩ Hc
),
P(C|Ac
∩ H) = 1 = P(C|A ∩ Hc
).
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
of the table above. Clearly the result of the test is catastrophically
unreliable. Only about 2% of the students who are tested
positive are really infected!
Note that the problem with both tests is the same one. It
is clear that real entrance exams require good selectivity and
reliability. So the university marketing must ensure that the
actual applicants do not provide a good random sample of
population. Perhaps the university should try to discourage
“non-intelligent” people from applying and thus secure a sufﬁciently
low number of such applicants. With diseases, even
the very rare occurrence of healthy people tested positively
can be devastating. If the test is improved so that it is 100%
reliable for positive people, it would have almost no impact
on the resulting probabilities in the table.
Thus, if a person is tested positive when diagnosing a rare
disease, it is necessary to make further tests. Then, the result
P(A|B) of the ﬁrst test plays the role of the prior probability
P(A) during the second test, etc. This approach allows one
to “cumulate the experience”.
10.2.9. Borel sets. In practice, the probability of events
which are expressed by questioning whether
some numerical quantity falls into a given
interval is of interested. We illustrate this on
the example dealing with the results of students in a given
course, measured for instance by the number of points in a
written exam (cf. 10.1.1).
On one hand, there is only a ﬁnite number of students,
and there are only a ﬁnite number of possible results (say, the
numbers of points in the written exam can be the integers 0
through 20). On the other hand, imagining the results of the
students as an analogy to independent rolls of a regular die
is inappropriate. Even if a regular 21-hedron would exist (it
cannot, see chapter 13); that would be somewhat weird.
Thus it is better to focus on the assessing function X :
Ω → R in the sample space Ω of all students and model the
probability that its value falls into a ﬁxed interval when a random
student is picked. For instance, if the table transferring
points into marks A through F is ﬁxed, the probability that
the student obtained an A or a B can be modeled.
In the case of a reasonable course, we should expect that
the most probable results are somewhere in the middle of the
“interval of success”, while the ideal result of the full number
of points is not very probable. Similarly, if many values of
X lie in the interval of failure, this may be at most universities
perceived as a signiﬁcant failure of the lecturer. This is
a typical example of the random variables or random vectors,
as deﬁned below (it depends whether the result of just one or
several students is chosen randomly).
One way to proceed is to model the behaviour of X as
probability deﬁned for all intervals. This requires the following
σ-algebra:1
1In this connection, we also talk about the σ-algebra of Borelmeasurable
sets on Rk, and then the following deﬁnition says that random
variables are Borel-measurable functions.
920
It follows from the second Bayes’ formula (1) that
P(C|A) =
=
P(H)P(A|H)P(C|A ∩ H) + P(Hc
)P(A|Hc
)P(C|A ∩ Hc
)
P(A)
=
=P(Hc
) =
2
3
and, analogously,
P(C|Ac
) =
=
P(H)P(Ac
|H)P(C|Ac
∩ H) + P(Hc
)P(Ac
|Hc
)P(C|Ac
∩ Hc
)
P(Ac)
=
=P(H) =
1
3
.
We have thus obtained P(C|A) > P(C|Ac
), which means
that it is wise to change the door.
Note that the solution is based upon the assumption that
the assistant deliberately opens a door behind which there is a
goat. If the contestant believes it was an accident or if instead,
say, he happens to see (or hear) a goat behind one of the two
not chosen doors, then the ﬁrst reasoning is correct and the
probability remains to be 1
2 . □
10.C.10. We have two bags. The ﬁrst one contains two white
and two black balls, while the second one contains one white
and two black balls. We randomly select one of the bags and
take two balls out of it (without replacement). What is the
probability that the second ball is black provided the ﬁrst one
is white? ⃝
D. What is probability?
First of all, recall the geometric probability, which was
introduced in ??.
10.D.1. Buﬀon’s needle. A plane is covered with parallel
lines, creating bands of width l. Then, a needle of length l
is thrown onto the plane. What is the probability that the needle
crosses one of the lines?
Solution. The position of the needle is given by two independent
parameters: the distance d of the needle’s center from the
closest line (d ∈ [0, l/2]) and the angle α (α ∈ [0, π/2]) between
the lines and the needle’s direction. The needle crosses
one of the lines if and only if l/2 sin α > d. The space
of all events (α, d) is a rectangle π/2 × l/2. The favorable
events (α, d) (i. e. those for which l/2 sin α > d) correspond
to those points in the rectangle which lie under the
curve l/2 sin α (α being the variable of the x-axis). By 6.2.20,
the area of the ﬁgure is
∫ π
2
0
l
2
sin α dα =
l
2
.
Thus, the wanted probability is (see ??)
l
2
π
2 · l
2
=
2
π
.
□
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Borel sets
The Borel sets in R are all those subsets that can be obtained
from intervals using complements, countable unions, and
countable intersections.
More generally, on the sample space Ω = Rk
, one
considers the smallest σ-algebra B which contains all k–
dimensional intervals.
The sets in B are called the Borel sets on Rk
.
10.2.10. Random variables. The probabilities of the individual
intervals in the Borel algebra are usually
given as follows. Consider a numerical quantity
X on any sample space, that is, a function
X : Ω → R. Since it is desired to work with the
probability of X taking on values from any ﬁxed interval, the
probability space and the properties of the function X have
to allow this.
Notice that working with ﬁnite probability spaces where
all subsets are events, every function X : Ω → R is a random
variable in the following sense.
Random variables and vectors
Deﬁnition. A random variable X on a probability space
(Ω, A, P) is a function X : Ω → R such that the inverse
image X−1
(B) lies in A for every Borel set B ∈ B on R.
The real-valued function PX(B) = P(X−1
(B)) deﬁned on
all intervals B ⊂ R is called the (probability) distribution
of a random variable X.
A random vector X = (X1, . . . , Xk) on (Ω, A, P) is
a k-tuple of random variables Xi : Ω → R deﬁned on the
same probability space (Ω, A, P).
If intervals I1, . . . , Ik in R are chosen, then the probability
of simultaneous occurrence of all of the k events
Xi ∈ Ii must exist. Thus, as in the scalar case, there is a
real-valued function deﬁned on the k-dimensional intervals
B = I1 × · · · × Ik, PX(B) = P(X−1
(B)) (and thus also
for all Borel sets B ⊂ Rk
). It is called the probability distribution
of the random vector X.
10.2.11. Distribution function. The distribution of random
variables is usually given by a rule which shows how the probability
grows as the interval B is extended.
In particular, consider the intervals I with endpoints a, b,
−∞ ≤ a ≤ b ≤ ∞. Denote P(a < X < b) the probability
of X lying in I = (a, b), or P(X < b) if a = −∞; and
analogously for other types of intervals. In the special case
of a singleton, write P(X = a).
In the case of a random vector X = (X1, . . . , Xk), write
P(a1 < X1 < b1, . . . , ak < Xk < bk) for the probability
of simultaneous occurrence of the events where the values of
Xi fall into the corresponding intervals (which may also be
closed, unbounded, etc.).
921
The following (known) problem, which also deals with
geometric probability, illustrates that we must be cautious
about what is assumed to be “clear”.
10.D.2. Bertrand’s paradox. What is the probability that a
random chord of a given circle is longer than the side of an
equilateral triangle inscribed into the circle?
Solution. We will show three ways how to ﬁnd “this” proba-
bility.
1) Every chord is determined by its center. Thus, a random
choice of the chord is given by a random choice of the center.
The chord is greater than the side of the inscribed equilateral
triangle if and only if its center lies inside the concentric circle
with half radius. The center is chosen “randomly” from the
whole inside of the circle. Therefore, the probability that it
will lie in the inner disc is given by the ratio of the areas of
these discs, which is 1
4 .
2) Unlike above, we claim that the wanted probability does
not change if the direction of the chord is ﬁxed. Then, the
centers of such chords lie on a ﬁxed diameter of the circle.
The favorable centers are those which lie inside the inner circle
(see 1)), i. e., inside a ﬁxed diameter of the inner circle.
The ratio of the diameters is 1 : 2, hence the wanted probability
is 1
2 .
3) Now, we observe that a chord is determined by its endpoints
(which must lie on the circle). Let us ﬁx one of the
endpoints (call it A)–thanks to the apparent symmetry, this
should not aﬀect the resulting probability. Then, the chord
satisﬁes the given condition if and only if the other endpoint
lies on the shorter arc BC, where ABC is the inscribed equilateral
triangle. However, the length of this arc is one third of
the length of the entire circle, which means that the wanted
probability is equal to 1
3 .
How is it possible that we came to three diﬀerent probabilities?
It is caused by a hidden ambiguity in the statement of
the problem. It is necessary to specify what exactly it means
to choose a chord “randomly”. Each of the three results is correct
provided the chord is chosen in the corresponding way.
However, these ways are not equivalent; this is apparent not
only from the diﬀerent results, but also from the distribution
of the chords’ centers. In the ﬁrst case, they are distributed
uniformly throughout the inside of the circle. In the second
and third cases, the centers are concentrated more towards the
center of the circle.
□
10.D.3. Two envelopes. There are two envelopes, each contains
a certain amount of money. We know that the amount
in one of them is twice as great as in the other one. We can
choose either of the envelopes (and take its contents). As soon
as we choose one, we are allowed to change our mind and take
the other envelope instead. Is it advantageous to do so?
Solution. At the ﬁrst sight, it must not matter which envelope
we choose. The probability of choosing the one which
contains more is 1/2, so it is no good to change our choice.
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Distribution function
Deﬁnition. The distribution function or cumulative distribution
function of a random variable X is the function FX :
R → [0, 1] deﬁned for all x ∈ R by
FX(x) = P(X < x).
The distribution function of a random vector (X1, . . . , Xk)
is the function FX : Rk
→ R deﬁned for all vectors
x = (x1, . . . , xk) ∈ Rk
by
FX(x) = P(X1 < x1, . . . , Xk < xk).
If it is clear from the context which distribution function is
discussed, omit the random variable name and write simply
F(x).
The following theorem guarantees that, for every random
variable, the probability that the value of X falls
into any (ﬁxed) interval (and thus into any Borel
set B) can be calculated purely from the knowledge
of its distribution function.2
10.2.12. Theorem. For every random variable X, its distribution
function F : R → [0, 1] has the following properties:
(1) F is a non-decreasing function;
(2) F has both side-limits at every point x ∈ R, yet these
limits may diﬀer;
(3) F is left-continuous;
(4) at the inﬁnite points, the limits of F are
lim
x→∞
F(x) = 1, lim
x→−∞
F(x) = 0;
(5) the probability of X taking on the value x is given by
P(X = x) = lim
y→x+
F(y) − F(x).
(6) The distribution function of a random variable always
has only countably many points of discontinuity.
Proof. The proof consists of quite simple and straightforward
calculations. In particular, note that the events a ≤
X < b and X < a are exclusive, so
P(a ≤ X < b) = P(X < b) − P(X < a) = F(b) − F(a).
Hence the ﬁrst property follows immediately from the deﬁnition
of probability.
The next two statements follow from the probability of
monotone sequences of events, discussed in 10.2.5. Fix a nonincreasing
sequence of numbers rn > 0 which converges to 0,
and consider the events An given by X < x − rn. The union
of these events is exactly the event A given by X < x. Of
course, the event A does not depend on the choice of the sequence
rn. By the ﬁrst proposition of 10.2.5,
P(A) = lim
n→∞
P(An).
2In literature, the deﬁnition with the non-strict inequality F(x) =
P(X ≤ x) is often met. In this case, the probability P(X = x) is also
included in FX (x). Then, the distribution function has similar properties as
those in 10.2.12, only it is right-continuous instead of left-continuous, etc.
922
However, consider the following reasoning: the envelope
we have chosen contains a. Therefore, the other one contains
a/2 or 2a, each with probability 1/2. This means that if we
change the envelope, then we get a/2 with probability 1/2
and 2a with probability 1/2, i. e., the expected outcome is
1
2
a
2
+
1
2
2a =
5
4
a.
Therefore, it is wise to change the envelope. What is wrong
with this reasoning?
There are several issues. Mainly, it is not generally true
that if there is amount a in one of the envelopes, then the second
one contains a/2 with probability 1/2. This depends on
the initial distribution of the amounts that have been put into
the envelopes, which is not precisely stated in the problem.
However, the paradox is rooted not only in the concealed
a priori distribution. There are (even discrete) distributions
for which the choice of changing the envelope always produces
greater expected outcome than that of not changing it.
Nevertheless, any distribution with this property must have
inﬁnite expected value (if the expectation is ﬁnite, then there
is always a value which, when seen in the envelope, it is more
advantageous to keep), so it is dubious to say that it is better
to get “greater” inﬁnity on average. □
E. Random variables, density, distribution function
10.E.1. Consider rolling a die. The set of sample points is
Ω = {ω1, . . . , ω6}, where ωi means that we have rolled number
i. Further, consider the σ-ﬁeld
A = {∅, {ω1, ω2}, {ω3, ω4, ω5, ω6}, Ω}.
Find whether the mapping X : Ω → R deﬁned by
i) X(ωi) = i for each i ∈ {1, 2, 3, 4, 5, 6},
ii) X(ω1) = X(ω2) = −2, X(ω3) = X(ω4) = X(ω5) =
= X(ω6) = 3
is a random variable with respect to A.
Solution. First of all, we should make sure that the set A really
satisﬁes all axioms of 10.2.2, i. e., that it is a well-deﬁned
σ-ﬁeld. Then, by deﬁnition 10.2.10, a random variable is any
function X : Ω → R such that the preimage of every Borelmeasurable
set B ⊂ R lies in A. As for the ﬁrst case, consider
the interval [2, 3]. Since X−1
([2, 3]) = {ω2, ω3} ̸∈ A, we
can see that the function X is not a random variable.
In the second case, we can easily see that X is a random variable:
Consider any interval in R. Then, exactly one of the
four following occurs: 1) If the interval contains neither −2
nor 3, then the preimage in X is the empty set. 2) If it contains
−2 but not 3, then the preimage is {ω1, ω2}. 3) On the
other hand, if it contains 3 but not −2, then the preimage is
{ω3, ω4, ω5, ω6}. 4) Finally, if it contains both these numbers,
then the preimage is the whole sample space Ω. In each case,
the preimage lies in the σ-ﬁeld A. □
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
The distribution function is non-decreasing, and thus the leftsided
limit equals the supremum. Thus, the left-sided limit
FX at x exists and equals P(A). This proves one half of
proposition (2) as well as all of proposition (3).
Similarly, the above sequence rn can be used to deﬁne the
events An by Xn < x + rn. This time, it is a non-increasing
sequence A1 ⊃ A2 ⊃ . . . , and its intersection is the event
X ≤ x. By the second property of 10.2.5,
P(A) = lim
n→∞
P(An) = P(X ≤ x),
which veriﬁes that the right-sided limit of F at x exists. At
the same time, property (5) is proved.
The limit values of property (4) can be derived similarly
by applying theorem 10.2.5, as shown for the one-sided limits
above. In the ﬁrst case, use the events An given by X < rn,
for an arbitrary increasing sequence rn → ∞. Their union is
the universal event Ω. In the second case, use the events An
given by X < rn, for any decreasing sequence rn → −∞,
and their intersection is the null event.
It remains to prove the last statement. As already shown,
the discontinuity points of the distribution function are exactly
those values x which the random variable has with nonzero
probability, i.e., P(X = x) ̸= 0. Now, let Mn denote
the set of points x for which P(X = x) > 1
n . Clearly, the
set M of all discontinuity points equals the union of the sets
Mn: M = ∪∞
n=2. Since the sum of probabilities of mutually
exclusive events cannot exceed 1, Mn can contain no more
than n − 1 elements. Thus, M is a countable union of ﬁnite
sets, thus it is countable. □
10.2.13. Probability measure. The probability that a random
variable has a value lying in an arbitrarily chosen interval
can be computed purely from the knowledge of its distribution
function. The distribution function FX thus deﬁnes the entire
probability distribution of the random variable X.
How a particular random variable X is deﬁned can be
ignored. X can be viewed directly as a probability deﬁnition
on the σ-algebra of all the Borel sets in R.
In this sense, every function F : R → R satisfying the
ﬁrst four properties of the latter theorem is a distribution function
of a unique random variable. Check the properties of the
probability function deﬁned on all intervals this way!
The probability obtained in this way is also called a probability
measure on R. Similarly one deals with probability
measures on the algebra of Borel sets in Rk
in terms of the
distribution functions of the random vectors.
In this sense, a random variable or random vector can
be considered without any explicit link to a probability space
(Ω, A, P).
10.2.14. Discrete random variables. Random variables
behave substantially diﬀerently according to
whether the non-zero probability is “concentrated
in isolated points” or it is “continuously
distributed” along (a part of) the real axis.
923
10.E.2. Consider a σ-ﬁeld (Ω, A), where Ω =
{ω1, ω2, ω3, ω4, ω5} and
A = {∅, {ω1, ω2}, {ω3}, {ω4, ω5}, {ω1, ω2, ω3},
{ω1, ω2, ω4, ω5}, {ω3, ω4, ω5}, Ω}.
Find a mapping X : Ω → R, as general as possible,
which is a random variable with respect to A.
Solution. Since the events ω1, ω2 do not occur individually
in A, the random variable X must map them to the same number,
i. e. X(ω1) = X(ω2) = a for an a ∈ R. For the
same reason, we must have X(ω4) = X(ω5) = b for a
b ∈ R. If an interval contains both a and b, then its preimage
is {ω1, ω2, ω4, ω5} ∈ A, which is okay. Clearly, the event
ω3 may be mapped to an arbitrary c ∈ R. Then, we can easily
verify that the X-preimage of every interval is contained
in A, i. e., X is a random variable with respect to A. □
10.E.3. Consider a random variable X which takes on value
i with probability P(X = i) = 1
6 , for each i = 1, . . . , 6. Find
the distribution function FX(x) and draw its graph.
Solution. By deﬁnition 10.2.11, the distribution function is
FX(x) = P(X < x). This means that FX(x) = 0 for
x < 1, FX(x) = ⌊x⌋
6 for 1 ≤ x < 6 (where ⌊x⌋ stands for
the ﬂoor of x), and FX(x) = 1 for x ≥ 6. The graph looks
as follows:
□
10.E.4. An archer keeps shooting at a target until he hits. He
has 4 arrows at his disposal. In each attempt, the probability
that he hits the target is 0.6. Let X be the random variable
which gives the number of unused arrows. Find the probability
mass function and the distribution function of X and draw
their graphs.
Solution. Clearly, the probability of k consecutive misses followed
by a hit is equal to 0.4k
· 0.6. Therefore, fX(x) =
P(X = x) = 0.43−x
· 0.6 for x ∈ {1, 2, 3}. If the archer
misses three times, then there will be no arrow left at the
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Discrete random variables
If a random variable X assumes only ﬁnitely many values
x1, x2, . . . , xn ∈ R or countably inﬁnitely many values
x1, x2, . . . , it is called a discrete random variable.
One can deﬁne its probability mass function f(x) by
f(x) =
{
P(X = xi) x = xi
0 otherwise.
Since the probability is countably additive and the singleton
events X = xi are mutually exclusive, the sum of all values
f(xi) is given by either a ﬁnite sum or an absolutely convergent
series ∑
i
f(xi) = 1.
The probability distribution of a random variable X sat-
isﬁes
P(X ∈ B) =
∑
xi∈B
f(xi).
In particular, the distribution function is of the form
FX(t) =
∑
xi<t
f(xi).
Note the the distribution function F(x) of a discrete random
variable is piecewise constant. F(x) = 1 for those x
which are greater than all the xi’s.
Every random variable deﬁned on a classical ﬁnite probability
space is discrete.
10.2.15. Continuous random variables. Even if the values
of a random variable X are not discrete, one can proceed similarly.
Intuitively, increasing the value of x inﬁnitesimally by
dx, the density function f(x) of the random variable X can
be perceived as
P(x ≤ X < x + dx) = f(x)dx.
This means that whenever −∞ ≤ a ≤ b ≤ ∞, it is required
that
P(a ≤ X < b) =
∫ b
a
f(x)dx.
Continuous random variables
A random variable X for which there exists a function f
satisfying
FX(b) =
∫ b
−∞
f(x)dx,
is said to be a continuous random variable, and the function
f is called its density function.
It is convenient to view the random variables as probability
measures, cf. 10.2.13. Generally, this means not referring
to any other sample space Ω on which X would be deﬁned as
a function. X is represented just by its density or distribution
function.
924
end, no matter whether he hits the last time or not. Thus,
fX(0) = P(X = 0) = 0.43
.
By the deﬁnition of the distribution function (see
10.2.11), we have
FX(x) = P(X < x) =
=



0 for x ≤ 0,
0.43
= 0.064 for x ∈ (0, 1],
0.43
+ 0.42
· 0.6 = 0.16 for x ∈ (1, 2],
0.43
+ 0.42
· 0.6 + 0.4 · 0.6 = 0.4 for x ∈ (2, 3],
1 for x > 3.
The graphs of the probability mass function and the distribution
function are as follows:
□
10.E.5. The distribution function of a random variable X is
FX(x) =



0 for x ≤ 3
1
3 x − 1 for 3 < x ≤ 6
1 for 6 < x.
i) Justify that FX is indeed a distribution function.
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Note that the distribution function F(x) of a continuous
random variable X is always diﬀerentiable. Its derivative is
the density function of X, i.e., F′
(x) = f(x).
10.2.16. The general case. Of course, there are also random
variables with mixed behaviour, where some part
of the probability is distributed continuously, while
there are values that are taken on with non-zero probability.
This means, the probability measure of some
singletons x ∈ R is non-zero and still X is not a discrete
random variable.
For instance, consider a chaotic lecturer who remains to
stand at his laptop with probability p throughout the entire
lecture, but once he decides to move, he happens to be at any
position in front of the lecture room with equal probability.
Then, the random variable which corresponds to his position
(assume that the desk with the laptop is at position 0
and the lecture room is bounded by the values ±1) has the
following distribution function:
F(t) =



0 if t ≤ −1
1−p
2 (t + 1) if t ∈ (−1, 0)
p + 1−p
2 (t + 1) if t ∈ [0, 1)
1 if t ≥ 1.
The distribution function of all such variables can be expressed
directly using the Riemann-Stieltjes integral
F(t) =
∫ t
−∞
f(x)d(g(x)),
developed in subsection 6.3.14 (page 570). In the example
above, choose f(x) = 1 and
g(x) =



−1 for x ≤ −1
1−p
2 x for −1 < x < 0
1−p
2 x + p for 0 ≤ x < 1
1+p
2 for x ≥ 1.
This corresponds again to the idea, that the distribution function
is equivalent to a probability measure. Thus the measure
of any interval is given by integrating its indicator function
with respect to this measure. This is what the RiemannStieltjes
integral achieves.
The Riemann integral corresponds to the choice g(x) =
x. One could add only the jump p at x = 0 (i.e. g(x) = x for
x < 0, while g(x) = x + p otherwise) and leave the constant
density 1−p
2 to f(x), which would be nonzero only on [−1, 1].
This corresponds to splitting the probability measure into its
discrete part (hidden in g) and continuous part (expressed by
the probability density).
Notice that any distribution function can have only countably
many points of discontinuity.
10.2.17. Basic discrete distributions. The requirements on
the properties of probability distributions of random
variables are based on the modeled situations.
Here is a list of the simplest discrete dis-
tributions.
925
ii) Find the density of the random variable X.
iii) Compute P(2 < X < 4).
Solution. a) Clearly, FX is continuous and non-decreasing.
Moreover, we have lim
x→−∞
F(x) = 0 and lim
x→∞
F(x) = 1, as
needed.
b) By 10.2.14, the density of a continuous random variable is
the derivative of its distribution function. We can see that on
the interval (3, 6), the density is equal to f(x) = 1
3 , while on
the intervals (−∞, 3) and (6, ∞), it is equal to zero. Therefore,
the variable X has uniform distribution, see 10.2.20.
c) We have from the deﬁnition of the distribution function that
P(2 < X < 4) = FX(4) − FX(2) = 4
3 − 1 = 1
3 . □
10.E.6. Consider a random variable X and a function f :
R → R given by f(x) = a
1+x2 for x ∈ R, where a is a
parameter. Suppose that f is the density of X. Find
i) the value of a,
ii) the distribution function of X,
iii) P(−1 < X < 1).
Solution. a) If the function f is to be a probability density,
then its integral over R must be equal to one. This yields the
condition
1 =
∫ ∞
−∞
a
1 + x2
dx = a[arctg x]∞
−∞ = aπ.
Hence a = 1
π .
b) By 10.2.14, the distribution function is given by the following
integral:
FX(x) =
∫ x
−∞
f(t)dt =
1
π
∫ x
−∞
dt
1 + t2
=
1
π
arctg x +
1
2
.
c) By b) and the deﬁnition of the distribution function, we
have
P(−1 < X < 1) = FX(1)−FX(−1) =
1
π
·
π
4
−
1
π
·
(
−
π
4
)
=
1
2
.
□
10.E.7. The joint probability mass function of a discrete random
vector is given by the following table:
X
Y 2 5 6
1 1
5
1
10
1
20
2 1
10
1
20 0
3 3
10
1
20
3
20
Find
i) the marginal distribution and probability mass functions;
ii) the joint distribution function and draw it in a suitable
way;
iii) P(Y > 3X).
Solution. a) By 10.2.22, the marginal distribution of the random
variable X is obtained by summing up the joint probability
mass function over all possible values of Y in each
row. Similarly, the marginal distribution of Y is obtained by
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Degenerate distribution
The distribution which corresponds to a constant random
variable X = µ is called the degenerate distribution Dg(µ).
Its distribution function FX and probability mass function
fX are given by
FX(t) =
{
0 t ≤ µ
1 t > µ
fX(t) =
{
1 t = µ
0 otherwise.
Here follows a description of an experiment with two possible
outcomes called success and failure. If the probability
of success is p, then the probability of failure must be 1 − p.
It is convenient to take the values 0 and 1 for the two
possible results.
Bernoulli distribution
The distribution of a random variable X which is 0 (failure)
with probability q = 1 − p and 1 (success) with probability
p is called the Bernoulli distribution A(p).
Its distribution function FX and probability mass function
fX are given by
FX(t) =



0 t ≤ 0
q 0 < t ≤ 1
1 t > 1
fX(t) =



p t = 1
q t = 0
0 t /∈ {0, 1}.
Further, consider a random variable X which corresponds
to n independent experiments described by the
Bernoulli distribution, where X measures the number of
successes. Clearly the probability mass function is non-zero
exactly at the integers t = 0, . . . , n, which correspond to the
total number of successes in the experiments (the order does
not matter).
The probability that t successes are encountered in t chosen
experiments out of n is pt
(1 − p)n−t
. It is necessary to
sum all the
(n
t
)
possibilities. This leads the the binomial distribution
of X:
Binomial distribution
The binomial distribution Bi(n, p) has probability mass
function
fX(t) =
{(n
t
)
pt
(1 − p)n−t
t ∈ {0, 1, . . . , n}
0 otherwise.
The illustration shows the probability mass functions for
Bi(50, 0.2), and Bi(50, 0.9). The distribution of the probability
corresponds to the intuition that most outcomes occur
near the value np:
926
summing up the entries in each column. Thus, we get the
following:
X 1 2 3
fX
7
20
3
20
1
2
and
Y 2 5 6
fY
3
5
1
5
1
5
b) The joint distribution function is at point (a, b) equal to the
sum of all values of the joint probability mass function f(X,Y )
such that X ≤ a and Y ≤ b. This corresponds to values of
the subtable whose lower-right corner is (a, b). Precisely, the
joint distribution function F(X,Y ) looks as follows is missing
.... and on intervals (−∞, 1) × R and R × (−∞, 2), F(X,Y )
is clearly zero.
c) Apparently, P(Y > 3X) = P(X = 1, Y = 5) + P(X =
1, Y = 6) = = 1
10 + 1
20 = 3
20 □
10.E.8. Find the probability P(2X > Y ) provided the density
of the random vector (X, Y ) is given by:
f(X,Y )(x, y) =
{
1
6 (4x − y) for 1 ≤ x ≤ 2, 2 ≤ y ≤ 4,
0 otherwise.
Solution. By deﬁnition, we have
P(2X > Y ) =
∫ ∞
−∞
∫ 2x
−∞
f(X,Y )(x, y)dydx =
=
∫ 2
1
∫ 2x
2
1
6
(4x − y)dydx =
=
∫ 2
1
[
2
3
xy −
1
12
y2
]2x
2
dx =
=
∫ 4
2
(
x2
−
4
3
x +
1
3
)
dx =
=
[
1
3
x3
−
2
3
x2
+
1
3
x
]2
1
=
2
3
.
□
10.E.9. Find the marginal distribution function and the joint
and marginal density of the random vector (X, Y ) provided
F(X,Y )(x, y) =



0 for x < 0, y < 0
1
4 x2
y2
for 0 ≤ x ≤ 1, 0 ≤ y ≤ 2
1 for x > 1, y > 2
Solution. The density of the random vector (X, Y ) is obtained
by diﬀerentiation with respect to x a y. Thus, for
0 ≤ x ≤ 1, 0 ≤ y ≤ 2, we have f(X,Y )(x, y) = xy, and
elsewhere the density is zero. The marginal density of the
random variable X is then
fX(x) =
∫ ∞
−∞
f(X,Y )(x, y)dy =
∫ 2
0
xydy = [
1
2
xy2
]2
0 = 2x.
Similarly, for Y , we get fY (y) = 1
2 y. The marginal distribution
functions are
FX(x) =
∫ x
−∞
fX(t)dt =
∫ x
0
2tdt = x2
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Next, consider distributions similar to the Bernoulli process
referred to in 10.2.1. Consider independent experiments
with the Bernoulli distribution A(p), as in the case of the binomial
distribution, and ﬁx a positive integer r. Repeat the
experiment until r successes occur.
The random variable X is deﬁned as the number of failures
before the r-th success. In the case of r = 1, it is exactly
the example from 10.2.1. The event X = k occurs if and only
if there are exactly r−1 successes in the ﬁrst k+r−1 experiments
and the (k +r)-th experiment also ends with a success.
Thus the following probability mass function is arrived at:
Geometric distribution
The random variable X which corresponds to the number
of failures before reaching the r-th success has probability
distribution
P(X = k) =
(
k + r − 1
r − 1
)
pr
(1 − p)k
, k = 0, 1, 2, . . .
This is called the negative binomial distribution. In the case
of r = 1, it is the geometric distribution.
Often the same deﬁnition is used with the successes and
failures interchanged. This results in the same formula for the
probability mass function with p and 1 − p interchanged.
The geometric distribution appears in physics in connection
with Einstein–Bose statistics.
10.2.18. Poisson distribution. In practice, the binomial distribution
often leads to further model problems.
Consider the situation that r (mutually indistinguishable)
objects are to be divided into
n (distinguishable) boxes, and each object is
equally probable (i.e., has probability 1/n) to fall into any
of the boxes.
The random variable which describes the number X of
objects in one ﬁxed box can be described as follows: The
admissible values are X = k, where k = 0, . . . , r, and the
individual probabilities are
P(X = k) =
(
r
k
) (
1
n
)k (
1 −
1
n
)r−k
=
(
r
k
)
(n − 1)r−k
nr
.
Thus, the distribution of X is of the type Bi(r, 1/n).
Such a variable can be encountered, for example, when
describing a physical system with a huge number of gas
molecules. The boxes represent small volumes of the space.
927
and
FY (y) =
∫ y
−∞
fY (t)dt =
∫ y
0
1
2
tdt =
1
4
y2
. □
10.E.10. In a bag, there are 14 balls–4 red, 5 white, and 5
blue ones. We randomly take 6 balls out of the bag (without
replacement). Find the distribution of the random vector
(X, Y ) where X stands for the number of red balls taken and
Y for the number of white balls. In addition, ﬁnd the marginal
distributions of X and Y . Then, compute P(X ≤ 3), P(1 ≤
Y ≤ 4).
Solution. The value of the probability mass function at point
(x, y) is deﬁned as the probability P(X = x, Y = y), i. e.
the probability of taking x red balls and y white balls. The
number of ways how to take x red balls is
(4
x
)
; for y white
balls, it is
(5
y
)
; and the remaining 6−x−y blue balls can be selected
in
( 5
6−x−y
)
ways. Altogether, there are
(4
x
)(5
y
)( 5
6−x−y
)
possibilities. The values of this expression for all x, y are in
the following table.
x\ y 0 1 2 3 4 5
∑
X
0 0 5 50 100 50 5 210
1 4 100 400 400 100 4 1008
2 30 300 600 300 30 0 1260
3 40 200 200 40 0 0 480
4 10 25 10 0 0 0 45∑
Y 84 630 1260 840 180 9 3003
The values in the last column and row are the sums over all
values of y and x, respectively. Then, the values of the probability
mass function are obtained after dividing by the number
of all possibilities how to take the 6 balls, i. e.
(14
6
)
= 3003.
The marginal distributions of X and Y correspond to the last
column and row, respectively.
The probability P(X ≤ 3) can be calculated easily from the
marginal distribution of X:
P(X ≤ 3) = FX(3) =
1
3003
(210+1008+1260+480) = 0.985.
Similarly, for the probability P(1 ≤ Y ≤ 4), we have
P(1 ≤ Y ≤ 4) = FY (4) − FY (1) =
=
1
3003
(630 + 1260 + 840 + 180) = 0.969.
□
10.E.11. The density of a random vector (X, Y, Z) is
f(x, y, z) =
{
c(x + y + z) for 0 ≤ x ≤ 1, 0 ≤ y ≤ 1, 0 ≤ z ≤ 1
0 otherwise.
Find the value of the parameter c as well as the distribution
function of the vector, and compute P(0 ≤ X ≤ 1
2 , 0 ≤ Y ≤
1
2 , 0 ≤ Z ≤ 1
2 ).
Solution. The integral of the density over the entire space
must be equal to one. This gives us
1 =
∫ 1
0
∫ 1
0
∫ 1
0
c(x + y + z)dzdydx = c
∫ 1
0
∫ 1
0
(x + y + 1
2 )dydx =
= c
∫ 1
0
(x + 1)dx = 3
2 c.
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Observe the distribution of the molecules. Then, the behaviour
of Xn as the number n of boxes as well as the number
rn of objects increases so that their ratio rn/n = λ remains
constant is of interest. In other words, every box is to contain
(approximately) the same number λ of elements, on average.
We are interested in the asymptotic behaviour of the variables
Xn as n → ∞. Letting limn→∞ rn/n = λ, the standard
procedure (with details to be added – take it as a challenge
to recall the methods from the analysis of univariate
functions!) leads to:
lim
n→∞
P(Xn = k) = lim
n→∞
(
rn
k
)
(n − 1)rn−k
nrn
= lim
n→∞
rn(rn − 1) . . . (rn − k + 1)
(n − 1)k
1
k!
(
1 −
1
n
)rn
=
λk
k!
lim
n→∞
(
1 +
−rn
n
rn
)rn
=
λk
k!
lim
m→∞
(
1 +
−λ
m
)m
=
λk
k!
e−λ
,
since the functions (1+x/n)n
converge uniformly to the function
ex
on every bounded interval in R.
Poisson distribution
The Poisson distribution Po(λ) describes the random variables
with probability mass function
fX(k) =
{
λk
k! e−λ
k ∈ N
0 otherwise.
Of course,
∞∑
k=0
fX(k) =
∑
k
λk
k!
e−λ
= e−λ
∑
k
λk
k!
= e−λ+λ
= 1.
As seen above, this discrete distribution Po(λ) with an
arbitrary λ > 0 (distributed into inﬁnitely many points) is a
good approximation of the binomial distributions Bi(n, λ/n),
for large values of n.
10.2.19. Two examples. Besides the physical model mentioned
above, such a behaviour can be encountered
when observing occurrences of events in
a space with constant expected density in a
unit volume. Observing bacteria under a microscope,
when the bacteria are expected to occur in any part
of the image with the same probability, provides an example.
If the “mean density of occurrence” in a unit area is λ and
the whole region is divided into n identical parts, then the occurrence
of k events in a ﬁxed part is modeled by a random
variable X with the Poisson distribution. When diagnosing
in practice, such an observation allows us to compute the total
number of bacteria with a relatively good accuracy from the
actual numbers in only several randomly chosen samples.
928
Hence, c = 2
3 . By deﬁnition, the distribution function is
equal to
FX(x, y, z) = 2
3
∫ x
0
∫ y
0
∫ z
0
(r + s + t)dtdsdr =
= 2
3
∫ x
0
∫ y
0
(rz + sz + 1
2 z2
)dsdr = 2
3
∫ x
0
(rzy + 1
2 y2
z + 1
2 z2
y)dr =
= 2
3 (1
2 x2
zy + 1
2 y2
zx + 1
2 z2
yx) = 1
3 (x2
zy + y2
zx + z2
yx),
so the wanted probability is
P(0 ≤ X ≤ 1
2 , 0 ≤ Y ≤ 1
2 , 0 ≤ Z ≤ 1
2 ) =
F(1
2 , 1
2 , 1
2 ) = 1
16 . □
10.E.12. Find the value of the parameter a so that the func-
tion
f(x) =



0 for x ≤ 1
a ln(x) for 1 < x < 2
0 for 2 ≤ x
would be the probability density of a random variable.
Solution. We know that the condition for the function to be
a density is
∫ ∞
−∞
f(x) = 1
Thus, we have to calculate
∫
ln(x) dx:
∫
ln(x) dx = x ln(x)−
∫
1 dx = x ln(x)−x = x(ln(x)−1).
Altogether,
∫ ∞
−∞
f(x) =
∫ 2
1
a ln(x) = a[x(ln(x)−1)]2
1 = a(2 ln(2)−1),
so a = 1
2 ln(2)−1 . □
10.E.13. A child has become lost in a forest whose shape is
that of a regular hexagon. Suppose that the probability that
the child happens to be in a given part of the forest is directly
proportional to the size of that part, but independent of its
position in the forest.
• What is the probability distribution of the distance of the
child from a given side (extended to a straight line) of the
forest?
• What is the probability distribution of the distance of the
child from the closest side of the forest?
Solution.
• Let a be the length of the sides of the hexagon (forest).
Then, the probability distribution satisﬁes
f(x) =



0 for x ≤ 0
4
9a2 x + 2
3
√
3a
for 0 < x ≤ 1
2
√
3a
− 4
9a2 x + 2√
3a
for 1
2
√
3a ≤ x ≤
√
3a
0 for x >
√
3a
,
as for the ﬁrst question.
• First, let us compute the distribution function F of the
wanted random variable X that corresponds to the distance
of the child from the closest side. The distance can
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
The second example is more challenging. We describe
events which occur randomly at time t ≥ 0.
Here, the probability of an occurrence in the
following small time period of length h does
not depend on what had happened before and
equals the same value hλ for a ﬁxed λ > 0. At the same time,
the probability that the event occurs more than once in a given
time period is small.
Let Xt denote the random variable which corresponds
to the number of occurrences of the examined event in the
interval [0, t). The requirements are expressed inﬁnitesimally.
We want:
• the probability of exactly one event in each time period
of length h equals hλ + α(h), where the function α(h)
satisﬁes limh→0+
α(h)
h = 0;
• the probability β(h) of more than one event occurring in
a time period of length h to satisfy limh→0+
β(h)
h = 0;
• the events Xt = j and Xt+h −Xt = k to be independent
for all j, k ∈ N and t, h > 0.
Use the notation pk(t) = P(Xt = k), k ∈ N, and set the initial
conditions p0(0) = 1 and pk(0) = 0 for k > 0. Compute
directly
p0(t + h) = p0(t)P(Xt+h − Xt = 0) =
= p0(t)(1 − hλ − α(h) − β(h))
and similarly,
pk(t + h) = P(Xt = k, Xt+h − Xt = 0)
+ P(Xt = k − 1, Xt+h − Xt = 1)
+ P(Xt ≤ k − 2, Xt+h = k)
= pk(t)P(Xt+h − Xt = 0) + pk−1P(Xt+h − Xt = 1)
+
k−2∑
i=0
P(Xt = i, Xt+h − Xt = k − i)
= pk(t)(1 − hλ − α(h) − β(h)) + pk−1(t)(hλ + α(h))
+
k−2∑
i=0
pi(t)P(Xt+h − Xt = k − i).
Hence (similar to in 6.1.12, page 520, the symbol o(h) is written
for expressions which, when divided by h, approach zero
as h → 0+)
p0(t + h) − p0(t)
h
= −λp0(t) +
1
h
o(h)
pk(t + h) − pk(t)
h
= −λpk(t) + λpk−1(t) +
1
h
o(h).
Letting h → 0+, an (inﬁnite!) system of ordinary diﬀerential
equations is obtained:
p′
0(t) = −λp0(t), p0(0) = 1
p′
k(t) = −λpk(t) + λpk−1(t), pk(0) = 0
for all t > 0 and k ∈ N, with an initial condition.
The ﬁrst equation has a unique solution
p0(t) = e−λt
,
929
be anywhere in the interval I = ⟨0,
√
3
2 a⟩. Then, for
y ∈ I, we have
F(y) = P[X < y] =
√
3
4 a2
−
(
√
3
2 a−y)2
3
4 a2
√
3
4 a2
√
3
4 a2
= 1−
4(
√
3
2 a − y)2
3a2
Altogether,
F(y) =



0 for y ≤ 0
1 −
4(
√
3
2 a−y)2
3a2 for y ∈ ⟨0,
√
3
2 a⟩
1 for y ≥
√
3
2 a
Thus, the density, being the derivative of the distribution
function, satisﬁes:
f(x) =



0 for x ≤ 0
8(
√
3
2 a−y)
3a2 for y ∈ ⟨0,
√
3
2 a⟩
0 for y ≥
√
3
2 a
□
10.E.14. Let a random variable X have uniform distribution
on an interval ⟨0, r⟩. Find the distribution function and probability
density of the volume of the ball whose radius is equal
to X.
Solution. First, we ﬁnd the distribution function F (for 0 <
d < 4
3 πr3
)
F(d) = P
[
4
3
πX3
≤ d
]
= P
[
X ≤
3
√
3d
4π
]
=
3
√
3d
4π
r
.
Altogether,
F(x) =



0 for x ≤ 0
3
√
3
4πr3 x
1
3 for 0 < x < 4
3 πr3
1 for x ≥ 4
3 πr3
Diﬀerentiating this, we obtain the density:
f(x) =



0 for x ≤ 0
3
√
1
36πr3 x− 2
3 for 0 < x < 4
3 πr3
0 for x ≥ 4
3 πr3
□
10.E.15. Find the value(s) of the parameter a ∈ R so that
the function
f(x) =



0 for x ≤ 0
ax2
for 0 < x < 3
0 for x ≥ 3
deﬁnes the probability density of a random variable X. Then,
ﬁnd its distribution function, probability density, and the expected
value of the volume of the cube whose edge-length has
probability density determined by f.
Solution. Simply, a = 1
9 . Thus, the distribution function of
the random variable X is FX(t) = 1
27 t3
for t ∈ (0, 3), zero
for smaller values of t, and one for greater. Let Z = X3
denote the random variable corresponding to the volume of
the considered cube. It lies in the interval (0, 27). Thus, for
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
which can be immediately substituted into the second equation.
This leads to
p1(t) = λt e−λt
.
A trivial induction argument shows that the system has a
unique solution
pk(t) =
(λt)k
k!
e−λt
, t > 0, k ∈ N.
It is thus veriﬁed that for every process which satisﬁes the
three properties above, the random variable Xt which corresponds
to the number of occurrences in the time period [0, t)
has distribution Po(λt).
In practice, these processes are connected with the failure
rate of machines.
10.2.20. Continuous distributions. The simplest example
of a continuous distribution is the uniform distribution
of the probability throughout a ﬁxed
interval. This is also a good illustration of the
fact that a simply formulated requirement does not leave many
free choices in the deﬁnition. Now, the probability of X taking
on a value inside an interval which is included in the sample
interval (a, b) ⊂ R is required to be dependent only on
the length of the interval, but not on its actual position. This
means that the density fX of the random variable X should
be constant and the value of this constant is given by the requirement
P(a ≤ X < b) = 1.
Uniform distribution
For any real numbers a, b, −∞ < a < b < ∞, deﬁne the
density and distribution function as follows:
fX(t) =



0 t ≤ a
1
b−a t ∈ (a, b)
0 t ≥ b,
FX(t) =



0 t ≤ a
t−a
b−a t ∈ (a, b)
1 t ≥ b.
Here, the random variable X has uniform distribution.
The next distribution is similar to the discrete Poisson
distribution. Suppose the occurrence of a random event is observed
such that its occurrences in non-overlapping intervals
are independent. Thus, if p(t) is the probability of the event
not occurring during an interval of length t, then of necessity
p(t + s) = p(t)p(s) for all t, s > 0. Moreover, assume that p
is diﬀerentiable and p(0) = 1.
Then, ln p(t + s) = ln p(t) + ln p(s). Letting s → 0+
(and applying l’Hospital’s rule),
(
ln(p)
)′
(t) = lim
s→0+
ln p(t + s) − ln p(t)
s
= lim
s→0+
(ln p(s))′
1
=
p′
(0)
p(0)
= p′
(0).
Thus, p′
(0) = −λ ∈ R ( Note: λ > 0, and p′
(0) cannot be
positive as p(0) = 1).
Then, p(t) satisﬁes ln p(t) = −λt + C. The initial
condition leads to the only solution
p(t) = e−λt
.
930
t ∈ (0, 27) and the distribution function FZ of the random
variable Z, we can write FZ(t) = P[Z < t] = P[X3
< t] =
P[X < 3
√
t] = FX( 3
√
t) = 1
27 t. Then, the density is fZ(t) =
1
27 on the interval (0, 27) and zero elsewhere. Since this is
the uniform distribution on the given interval, the expected
value is equal to 13.5. □
10.E.16. Find the value(s) of the parameter a ∈ R so that
the function
f(x) =



0 for x ≤ 0
ax for 0 < x < 3
0 for x ≥ 3
deﬁnes the probability density of a random variable X. Then,
ﬁnd its distribution function, probability density, and the expected
value of the area of the square whose side-length has
probability density determined by f.
Solution. We proceed similarly as in the previous example.
Again, we can easily ﬁnd that a = 2
9 . Thus, the distribution
function of the random variable X is FX(t) = 1
9 t2
for
t ∈ (0, 3), zero for smaller values of t, and one for greater.
Let Z = X3
denote the random variable corresponding to
the area of the considered square. It lies in the interval (0, 9).
Thus, for t ∈ (0, 9) and the distribution function FZ of the
random variable Z, we can write FZ(t) = P[Z < t] =
P[X2
< t] = P[X <
√
t] = FX(
√
t) = 1
9 t. Then, the
density is fZ(t) = 1
9 on the interval (0, 9) and zero elsewhere.
Since this is the uniform distribution on the given interval, the
expected value is equal to 4.5. □
10.E.17. Find the value(s) of the parameter a ∈ R so that the
function
f(x) =



0 for x ≤ 0
ax2
for 0 < x < 2
0 for x ≥ 2
deﬁnes the probability density of a random variable X. Then,
ﬁnd its distribution function, probability density, and the expected
value of the volume of the cube whose edge-length has
probability density determined by f. ⃝
10.E.18. We randomly cut a line segment of length l into
two pieces. Find the distribution function and the density of
the area of the rectangle whose side-lengths are equal to the
obtained pieces.
Solution. Let us compute the distribution function: Let X
denote the random variable with uniform distribution on the
interval ⟨0, l⟩, which corresponds to the length of one of the
pieces (then, the length of the other piece is l − X). The area
S = x(l−x) of the rectangle, for x ∈ ⟨0, l⟩, can lie anywhere
in the interval ⟨0, l2
/4⟩. Setting d ∈ ⟨0, l2
/4⟩, we can write
F(d) = P[S ≤ d] = P[X(l − X) ≤ d]
Thus, we are looking for those values of x for which x(l −
x) ≤ d, which is a quadratic inequality. The roots of the corresponding
quadratic equation are l−
√
l2−4d
2 and l+
√
l2−4d
2 .
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Now, consider the random variable X which corresponds
to a (random) moment when the event occurs for the ﬁrst time.
Apparently, the distribution function of X is given by
FX(t) = 1 − p(t) =
{
1 − e−λt
t > 0
0 t ≤ 0.
This function has the desired properties: It has values between
zero and one, it is increasing and it has the required
behaviour at ±∞. The density of this random variable can
be obtained by diﬀerentiation of the distribution function.
Exponential distribution
The distribution corresponding to the continuous random
variable X with density
fX(t) =
{
λ e−λt
t > 0
0 t ≤ 0.
is called the exponential distribution ex(λ).
The exponential distribution belongs to the more general
family of important distributions with the densities of the
form
cxa−1
e−bx
for x > 0, with given constants a > 0, b > 0, while the
constant c is to be computed. The following expression is
required to equal one:
∫ ∞
0
cxa−1
e−bx
dx =
∫ ∞
0
c
(
t
b
)a−1
e−t 1
b
dt =
c
ba
Γ(a).
Γ is the famous transcendental function providing the analytic
extension of the factorial function, discussed in 6.2.17
on the page 546.
Gamma distribution
The distribution whose density is zero for x ≤ 0, while for
x > 0. It is given by
f(x) =
ba
Γ(a)
xa−1
e−bx
,
called the gamma distribution Γ(a, b) with parameters a >
0, b > 0.
Thus, the exponential distribution is the special case of
this one for the value a = 1.
10.2.21. Normal distribution. Recall the binomial distribution.
If the success rate p is left constant, but the
number n of experiments is increased, the probability
mass function keeps its shape (although
the scale changes). As n increases, the values of
the probability mass function merges into a curve that should
correspond to the density of a continuous distribution which
is a good approximation for Bi(n, p) for large values of n.
Recall the smooth function y = e−x2
/2
, mentioned in
subsection 6.1.9 (page 516) as an appropriate tool for the construction
of functions which are smooth but not analytic. The
931
The inequality is satisﬁed by exactly those values of x which
lie outside this interval. Therefore,
P[X(l − X) ≤ d] = P[X ∈ ⟨0, l⟩ \ (
l −
√
l2 − 4d
2
,
l +
√
l2 − 4d
2
)]
=
l −
√
l2 − 4d
l
= 1 −
√
l2 − 4d
l
Altogether,
F(x) =



0 for x ≤ 0
1 −
√
l2−4x
l for 0 ≤ x ≤ l2
4
1 for x > l2
4
The density is obtained by diﬀerentiation:
f(x) =



0 for x ≤ 0
2
l
√
l2−4x
for 0 ≤ x ≤ l2
4
0 for x > l2
4
□
10.E.19. Nezávislé náhodné veličiny X a Y mají následující
hustoty pravděpodobnosti:
fX(t) =



0 for t ≤ 0,
1 for 0 < t < 1,
0 for 1 ≤ t,
fY (t) =



0 for t ≤ 0,
2t for 0 < t < 1,
0 for 1 ≤ t.
Určete distribuční funkci náhodné veličiny udávající obsah
obdélníka o stranách X a Y .
Solution.
FY (t) =



0 for t ≤ 0
2t − t2
for 0 < t < 1
1 for 1 ≤ t
□
10.E.20. Let X, Y be independent random variables, where
X has uniform distribution on the interval (0, 2) and Y is
given by its density function:
f(x) =



0 for x ≤ 0
2x for 0 < x < 1
0 for x ≥ 1.
Find the probability that Y is less than X2
.
Solution. Since X and Y are independent random variables,
the joint density f(X,Y ) : R2
→ R2
of the variable (X, Y )
is given by the densities fX and fY of the individual random
variables. Thus, we have
f(X,Y )(u, v) =
=
{
fX(u) · fY (v) = 1
2 · 2v = v for (u, v) ∈ (0, 2) × (0, 1),
0 otherwise.
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
illustration compares this curve (in the right hand part) to the
values of Bi(40, 0.5).
This suggests looking for a convenient continuous distribution
whose density would be given by a suitably adjusted
variation of this function.
The function e−x2
/2
is everywhere positive, so it suﬃces
to compute
∫ ∞
−∞
e−x2
/2
dx. If this results in a ﬁnite value,
just multiply the function by its reciprocal value. Unfortunately,
this integral cannot be computed in terms of elementary
functions. Luckily, multidimensional integration and Fubini’s
theorem can be used. Transform to polar coordinates,
to obtain(
∫ ∞
−∞
e−x2
/2
dx
)(
∫ ∞
−∞
e−y2
/2
dy
)
=
∫
R2
e−(x2
+y2
)/2
dxdy
=
∫ ∞
0
∫ 2π
0
e−(r2
)/2
rdrdθ
= 2π
(cf. the notes at the end of subsection 8.2.5, verify that the
integrated function satisﬁes the conditions given there, and
compute that thoroughly!). Hence the integral results in
√
2π,
so the function f(x) = 1√
2π
e−x2
/2
is a well-deﬁned density
of a random variable.
Normal distribution
The distribution of the random variable Z with density
φ(z) =
1
√
2π
e−z2
/2
is called the (standard) normal distribution N(0, 1). The corresponding
distribution function
Φ(z) =
1
√
2π
∫ z
−∞
e−x2
/2
dx
cannot be expressed in terms of elementary functions.
It is called the Gaussian function and the graph of φ(x)
is often called the Gaussian curve.
So far, the correct density which approximates the binomial
distribution is not found. The diagram that compares the
probability function of the binomial distribution to the Gaussian
curve shows that the position of the maximum must be
moved as well as an application of shrinkage or stretch to the
curve horizontally. The ﬁrst goal is easily reached by constant
932
Then, the wanted probability P is the integral of the density
f(X,Y ) over the part O of the plane where Y < X2
:
P =
∫∫
O
f(X,Y ) dx dy = 1 −
∫∫
R2\O
f(X,Y ) dx dy =
= 1 −
∫ 1
0
∫ 1
x2
y dy dx =
3
5
.
□
10.E.21. Let X, Y be independent random variables, where
X has density function
f1(x) =



0 for x ≤ 0
2x for 0 < x < 1
0 for x ≥ 1,
and Y has density function
f2(x) =



0 for x ≤ 0
x
2 for 0 < x < 2
0 for x ≥ 2.
Find the probability that Y is greater than X2
. ⃝
10.E.22. Let X, Y be independent random variables, where
X has density function
f(x) =



0 for x ≤ 0
2x
9 for 0 < x < 3
0 for x ≥ 1,
and Y has density function
f(x) =



0 for x ≤ 0
x
2 for 0 < x < 2
0 for x ≥ 2.
Find the probability that Y is greater than X3
. ⃝
F. Expected value, correlation
Compute the expected value and variance of the binomial
distribution.
Solution. The direct calculation from the deﬁnitions is a nice
exercise on combinatorics. We prove this statement using the
properties of the expected value and variance. Using the deﬁnition
of the binomial distribution (see 10.2.17), we can view
the random variable X ∼ Bi(n, p) as the sum X =
∑n
k=1 Yk,
where Y1, . . . , Yn ∼ A(p) are independent random variables
saying whether the k–th experiment was successful. Clearly,
the Bernoulli distribution has expected value E Yi = p, hence
by theorem 10.2.29, we have E X =
∑n
k=1 E Yk = np. Similarly,
we compute E(Y 2
k ) = 12
· p + 02
· (1 − p) = p, so
var Yk = E(Y 2
k ) − (E Yk)2
= p − p2
. By theorem 10.2.33,
we have var X =
∑n
k=1 var Yk = np(1 − p). □
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
shift µ of the variable z, while scaling the diﬀerence x − µ
by coeﬃcient σ > 0 does the rest. Thus, there are two real
parameters µ and σ > 0 and the density function is of the
form:
gµ,σ(x) = e−(x−µ)2
/(2σ2
)
.
Simple variable substitution leads to
∫ ∞
−∞
e−(x−µ)2
/(2σ2
)
dx =
√
2πσ.
Thus there is an entire two-parametric class of densities
φµ,σ = 1
σ
√
2π
e−
(x−µ)2
2σ2
of random variables. The corresponding distributions are denoted
by N(µ, σ2
).
We return to the asymptotic closeness of the normal and
binomial distributions for n → ∞ after creating suitable
tools. The following illustration reveals, how well this works.
The discrete values correspond to Bi(40, 0.5), while the curve
depicts the density of N(20, 10).
10.2.22. Distributions of random vectors. As for the scalar
random variables, one deﬁnes the distribution
functions and the density or the probability
mass function for continuous and discrete random
vectors. There are joint probability mass functions and
densities.
For two discrete random variables, i.e. a discrete vector
(X, Y ) of random variables, deﬁne their (joint) probability
mass function
f(x, y) =
{
P(X = xi ∧ Y = yj) x = xi, y = yj
0 otherwise.
A random vector (X, Y ) is called continuous, if its distribution
function is deﬁned as for continuous random variables.
This means, for all a, b ∈ R,
F(a, b) = P(X < a, Y < b) =
=
∫ b
−∞
∫ a
−∞
f(x, y)dxdy,
and the function f(x, y) is called the (joint) density of the
random vector (X, Y ).
933
10.F.1. An archer shoots ﬁve arrows at a target. Each time,
the probability he hits is 0.6, and the individual results are independent.
Let X be the random variable which corresponds
to the number of hits. Determine its distribution and ﬁnd its
expected value and variance.
Solution. Clearly, the shots are independent experiments
with the Bernoulli distribution A(3
5 ). Thus, by the deﬁnition
of the binomial distribution, we have X ∼ Bi(5, 3
5 ). By F, the
expected value and variance of Bi(n, p) are equal to np and
np(1−p), respectively, which gives E X = 3 and var X = 6
5
for our case. □
10.F.2. Consider the discrete random variable X which
takes on the values k = 0, 1, 2, 3, . . . , each with probability
P(X = k) = p(1 − p)k
(geometric distribution). Find E X
(the expected number of failures before the ﬁrst success) and
var X.
Solution. Using the deﬁnition of the expected value and the
formula for summing the derivative of a geometric series, we
calculate
E X =
∞∑
k=0
kp(1 − p)k
= p(1 − p)
∞∑
k=0
k(1 − p)k−1
=
= p(1 − p)
1
p2
=
1 − p
p
.
Similarly, using the formula for summing the second derivative
of a geometric series, we compute
E(X2
) =
∞∑
k=0
k2
p(1 − p)k
=
(1 − p)(2 − p)
p2
,
hence the variance is var X = E(X2
) − (E X)2
= 1−p
p2 . □
10.F.3. A random variable X is deﬁned by its density
fX(x) = 3
x4 for x ∈ (1, ∞) and fX(x) = 0 elsewhere. Find
its distribution function, expected value, and variance.
Solution. By the deﬁnition of the distribution function, we
have, for x ∈ (1, ∞),
FX(x) =
∫ x
1
3
t4
dt =
[
−
1
t3
]x
1
= 1 −
1
x3
.
The expected value of X is equal to
E X =
∫ ∞
1
3
x3
dx =
[
−
3
2x2
]∞
1
=
3
2
and the expected value of X2
is
E(X2
) =
∫ ∞
1
3
x2
dx =
[
−
3
x
]∞
1
= 3.
Therefore, var X = 3 − (3
2 )2
= 3
4 . □
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
For a general continuous random vector X =
(X1, . . . , Xn), deﬁne
F(a1, . . . , an) = P(X1 < a1, . . . , Xn < an) =
=
∫ an
−∞
· · ·
∫ a1
−∞
f(x1, . . . , xn) dx1 · · · dxn,
and similarly for discrete random vectors with more compo-
nents.
A random vector (X, Y ) with both X and Y continuous
is not always a continuous vector in the above sense. For example,
taking a continuous variable X, the random vector
(X, 2X) is neither continuous nor discrete, since the entire
probability mass is concentrated along the line y = 2x in the
plane, but not in individual points.
The marginal distribution for one of the variables can be
obtained by summation or integration over the others.
For instance, in the case of a discrete random vector
(X, Y ), the events (X = xi, Y = yj) for all possible values
xi and yj with non-zero probabilities for X and Y , respectively,
form an exhaustive collection of events for the vector
(X, Y ). Thus
P(X = xi) =
∞∑
j=1
P(X = xi, Y = yj),
which relates the marginal probability distribution of the random
variable X to the joint probability distribution of the random
vector (X, Y ). In the case of continuous random vectors,
proceed similarly using integrals instead of sums.
10.2.23. Stochastic independence. It is known from
subsection 10.2.3 what (in)dependence means
for events. Random variables X1, · · · , Xn are
(stochastically) independent if and only if for any
ai ∈ R, the events X1 < a1, . . . , Xn < an
are independent. In view of the deﬁnition of the distribution
function F of the random vector (X1, . . . , Xn), this is
equivalent to
F(x1, . . . , xn) = FX1 (x1) · · · FXn (xn),
where FXi are the distribution functions of the individual
components.
It follows that the events corresponding to
Xk ∈ Ik for arbitrarily chosen intervals Ik is also
independent. The probability of X1 ∈ [a, b) and
simultaneously Xi ∈ (−∞, ci) for the other components
is F(b, c2, . . . , cn) − F(a, c2, . . . , cn) =
(FX1 (b) − FX1 (a))FX2 (c2) . . . FXn (cn), and so on.
The densities and probability mass functions behave well
too:
Proposition. For any random vector (X1, . . . , Xn), the following
two conditions are equivalent:
• The random variables X1 . . . , Xn are stochastically in-
dependent.
• The joint distribution function F of the of random vector
(X1, . . . , Xn) is the product of the marginal distribution
functions FXi of the individual components.
934
10.F.4. A random variable X is deﬁned by its density
fX(x) = cos x for x ∈ ⟨0, π
2 ⟩ and fX(x) = 0 elsewhere.
Find its expected value, variance, and median.
Solution. Using the deﬁnition and integration by parts, we
get
E X =
∫ π
2
0
x cos xdx = [x sin x + cos x]
π
2
0 =
π
2
− 1.
Using double integration by parts, we obtain
E(X2
) =
∫ π
2
0
x2
cos xdx =
=
[
x2
sin x + 2x cos x − 2 sin x
]π
2
0
=
(π
2
)2
− 2,
so the variance is equal to var X = (π
2 )2
−2−(π
2 −1)2
= π−
3. By deﬁnition, the distribution function is equal to FX(x) =∫ x
0
cos tdt = sin x, and the median is F−1
(0.5) = π
6 . □
10.F.5. A random variable X is deﬁned by its density
fX(x) = λe−λx
for x ≥ 0, and fX(x) = 0 elsewhere
(the so-called exponential distribution; λ > 0 is a ﬁxed
parameter). Find its expected value, variance, mode (the
real number where the density reaches its maximum), and
median.
Solution. Using the deﬁnition and integration by parts, we
get
E X =
∫ ∞
0
xλe−λx
dx =
[
−xe−λx
−
1
λ
e−λx
]∞
0
=
1
λ
,
E(X2
) =
∫ ∞
0
x2
λe−λx
dx =
=
[
−x2
e−λx
− 2x
1
λ
e−λx
−
2
λ2
e−λx
]∞
0
=
2
λ2
,
hence var X = E(X2
) − (E X)2
= 1
λ2 . Since F′
X(x) =
−λ2
e−λx
< < 0, the density keeps decreasing. Therefore,
its maximum is at zero. By deﬁnition, we have
F(x) =
∫ x
0
λe−λt
dt = 1 − e−λx
,
so the median is equal to F−1
(0.5) = − 1
λ ln(1
2 ) = ln 2
λ . □
10.F.6. The joint probability mass function of a discrete random
vector (X1, X2) is deﬁned by π(0, −1) = c, π(1, 0) =
π(1, 1) = π(2, 1) = 2c, π(2, 0) = 3c and zero elsewhere.
Find the parameter c and compute the covariance
cov(X1, X2).
Solution. If π is to be a probability mass function, then the
sum of its values over the entire domain must be equal to 1, i.
e., ∑
i,j
π(i, j) = c + 3.2c + 3c = 10c = 1,
so c = 1
10 . The probability mass function π1 of X1 is given
by the sum of the joint function over all possible values of
X2, i. e., π1(i) =
∑
j π(i, j). Thus, we have π1(0) = c,
π1(1) = 4c, π1(2) = 5c and zero elsewhere. Similarly, for
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Moreover, if all Xi are discrete random variables, then
they are independent if and only if the joint probability mass
function f of the random vector (X1, . . . , Xn) is the product
of the marginal probability mass functions fXi of the individiual
components.
Similarly, if all Xi are continuous random variables,
then they are independent if and only if the joint density function
f of the random vector (X1, . . . , Xn) exists and it is the
product of the marginal density functions fXi of the individiual
components.
In particular, any random vector with independent continuous
components is again a continuous random vector.
Proof. Many of the claims are already veriﬁed. The
only nontrivial implication left is the one assuming the product
formula for the joint distribution function and deriving
the claim on the probability function or the density. The argument
for n = 2 is shown below, the general case is analogous.
Consider ﬁrst two discrete independent random variables
X, Y . Then fX,Y (xi, yj) = P(X = xi, Y = yi) = P(X =
xi)P(Y = yj) = fX(xi)fY (yj). The joint distribution function
is
FX,Y (x, y) =
∑
xi<x
∑
yj <y
fX(xi)fY (yj)
=
( ∑
xi<x
fX(xi)
)( ∑
yj <y
fY (yj)
)
,
which is equivalent to fX,Y (x, y) = fX(x)fY (y).
Similarly, assuming that the joint distribution function
FX,Y is the product of the distributions functions of two continuous
random variables X, Y , then its mixed partial derivatives
exist. Thus is set:
fX,Y (x, y) =
∂2
∂x∂y
FX,Y (x, y)
=
∂2
∂x∂y
FX(x)FY (y)
= fX(x)fY (y),
which is the requested joint density function for FX,Y .
All the other implications are either direct consequences
of the deﬁnitions or are obvious. □
10.2.24. Example. Consider a simple example which illustrates
that it is not a good idea to view a random vector only as
a pair of random variables. Consider stochastic properties of
a random vector (X, Y ) which has continuous uniform distribution
on the unit disc in the plane R2
, centered at the origin.
Then, its (joint) density function is
f(x, y) =
{
1
π x2
+ y2
≤ 1
0 otherwise.
The components X and Y of this random vector (in the usual
Euclidean coordinates) cannot be independent random variables:
For instance, note that the probability of (X, Y ) falling
outside the unit disc but inside the square with vertices at
935
the probability mass function π2 of X2, we get π2(−1) = c,
π2(0) = 5c, π2(1) = 4c and zero elsewhere. Hence, E X1 =∑
i iπ1(i) = 14c = 1.4 and E X2 =
∑
j jπ2(j) = = 3c =
0.3. By the deﬁnition of the covariance, we have
cov(X1, X2) =
∑
i,j
(i − 1, 4)(j − 0, 3)π(i, j) = 0.18.
□
10.F.7. In many scientiﬁc ﬁelds, the behavior of a random
variable which is bounded onto an interval is modeled using
the so-called beta-distribution. That is a continuous distribution
is deﬁned by its density on the interval [0, 1]:
fX(x) =
1
B(α, β)
xα−1
(1 − x)β−1
,
where α, β are ﬁxed parameters, chosen suitably for description
of the given random variable, and B(α, β) is a normalizing
constant, guaranteeing that the integral of fX(x) over
[0, 1] is equal to zero. Find its a) mode, b) expected value,
and c) variance.
Solution. a) By deﬁnition, the mode is the value where fX(x)
reaches its maximum. Thus, let us look at its stationary points.
We can easily calculate that the equation f′
X(x) = 0 is equivalent
to
(α − 1)(1 − x) − x(β − 1) = 0,
and this is satisﬁed for x = α−1
α+β−2 . Since fX(0) = fX(1) =
0 and the function is positive, it must be the wanted maxi-
mum.
b) By deﬁnition, we have
E X =
1
B(α, β)
∫ 1
0
xα
(1 − x)β−1
dx.
Integrating by parts, we get
E X = −
1
B(α, β)β
[xα
(1−x)β
]1
0+
α
B(α, β)β
∫ 1
0
xα−1
(1−x)β
dx.
Clearly, the ﬁrst term is equal to zero. Reﬁning the second
one, we obtain
E X =
α
B(α, β)β
∫ 1
0
xα−1
(1 − x)β−1
dx−
−
α
B(α, β)β
∫ 1
0
xα
(1 − x)β−1
dx.
Now, the integral in the ﬁrst term is, thanks to the normalization,
equal to B(α, β), and the second integral is the expected
value, too. Thus, the above equation can be written as
E X =
α
β
−
α
β
E X.
Hence it immediately follows that E X = α
α+β .
c) In order to compute the variance, we must calculate
E X2
=
1
B(α, β)
∫ 1
0
xα+1
(1 − x)β−1
dx.
This integral can be computed similarly as in b):
E X2
=
α + 1
B(α, β)β
∫ 1
0
xα
(1−x)β
dx =
α + 1
β
E X−
α + 1
β
E X2
.
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
(±1, ±1) is zero, while the marginal distribution functions
are non-zero for the values |x| < 1 and |y| < 1.
Expressing this random vector in polar coordinates
(R, Φ),
P(R < r0, Φ < φ0) =
∫ r0
0
∫ φ0
0
1
π
r dφdr =
1
2π
φ0r2
0.
The joint density of the vector (R, Φ) is thus f(r, φ) = r
π for
0 < r ≤ 1, 0 < φ ≤ 2π, and it is zero otherwise. The
marginal densities are
fR(r) =
∫ 2π
0
r
π
dφ = 2r, if 0 < r ≤ 1,
fΦ(φ) =
∫ 1
0
r
π
dr =
1
2π
, if 0 < φ ≤ 2π,
and zero otherwise. Therefore, the random variables R and
Φ are independent. This is a very important feature in mathematical
statistics.
10.2.25. Functions of random variables. In practice, random
vectors are encountered in two quite diﬀerent
roles. Firstly, we can observe several random
variables which describe less or more related
events. As an example, examine various
numerical parameters connected to individual students (their
results in particular courses, weight, height, age, annual income,
etc.). In this case, tools are needed which allow the examination
of diﬀerences or dependencies between these vari-
ables.
We can examine only one of the parameters on a large
collection of objects and select only a small number n of
them. This procedure is described by an n-dimensional vector
(X1, . . . , Xn) where all the random variables Xk have the
same probability distribution. In this case, we are more interested
in the quantities that correspond to statistical numerical
characteristics discussed in the previous part of this chapter.
Both cases can be dealt with using one simple concept.
Instead of the given random variable or random vector, consider
functions of those.
This is a useful tool even in the case of one random variable.
As an example, consider the random variable X with
uniform distribution over the interval [1, 2] ⊂ R giving the
length of the side of a square, asking for the random variable
Y = X2
describing the area of such a square. The problem
is to see the stochastic behaviour of Y in terms of the known
parameters of X.
The obvious technical condition on ψ is to guarantee that
Y = ψ ◦ X is again a random variable according to the deﬁnition.
This means the preimage ψ−1
of a Borel-measurable
set should be again a Borel-measurable set.
Elementary arguments reveal that ψ−1
(A \ B) =
ψ−1
(A) \ ψ−1
(B) and ∪i∈Iψ−1
(Ai) = ψ−1
(
∪i∈IAi
)
, for
subsets A, B, Ai ∈ R. Since each open subset in Rn
is a
countable union of intervals, and the pre-images ψ−1
(U)
are open for continuous functions ψ and open sets U, the
continuous function ψ always satisﬁes the condition.
936
Hence, E X2
= (α+1) E X
α+β+1 . Substituting the expected value,
we obtain
var X = E X2
− (E X)2
=
αβ
(α + β + 1)(α + β)2
. □
10.F.8. We toss three coins. Let X denote the total number
of heads on the ﬁrst and second coins, and Y denote the total
number of heads on the second and third coins.
Solution. First of all, we build the table describing the
joint probability mass function of the discrete random vector
(X, Y ), whence we can get the probability distributions
of the variables we will need:
Y 0 1 2
X
0 1
8
1
8 0
1 1
8
1
4
1
8
2 0 1
8
1
8
The discrete variables X and Y have the same probability
distribution: they take on the value 0 with probability 1/4, 1
with 1/2, and 2 with 1/4. (Of course, we could have come
to this even without the table.) The variable XY takes on
the values 0, 1, 2, 4 with probabilities 3/8, 1/4, 1/4, 1/8,
respectively. Now, we compute the expected values of the
variables X, X2
, Y , Y 2
, XY :
E(X) = E(Y ) = 0 ·
1
4
+ 1 ·
1
2
+ 2 ·
1
4
= 1
E(X2
) = E(Y 2
) = 0 ·
1
4
+ 1 ·
1
2
+ 4 ·
1
4
=
3
2
E(XY ) = 0 ·
3
8
+ 1 ·
1
4
+ 2 ·
1
4
+ 4 ·
1
8
=
5
4
Thus, we have
σ2
(X) = ς2
(Y ) = E(X2
) − [E(X)]2
=
1
2
cov(X, Y ) = E(XY ) − E(X)E(Y ) =
1
4
Altogether,
ρX,Y =
cov(X, Y )
σ(X) · σ(Y )
=
1
2
□
10.F.9. Consider random variables U, V , deﬁned by their
joint probability mass function (U takes on 1 or 2, V takes
on 1, 2 or 3):
V
U 1 2 3
1 0.1 0.2 0.3
2 0.2 0.1 0.1
.
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
In the sequel, we restrict ourselves to this case.
Functions of random variables and vectors
For a continuous function ψ : R → R and a random variable
X, there is also the random variable Y = ψ(X). Y is said
to be a function of the random variable X.
In the case of a random vector (X1, . . . , Xn) and a continuous
function ψ : Rn
→ R, we talk about a function
Y = ψ(X1, . . . , Xn) of the random vector.
It is useful to know whether independent random variables
remain independent after transformations. The answer
and its veriﬁcation are simple:
Proposition. Consider two independent random variables X
and Y and two functions g and h such that U = g(X), V =
h(X) are random variables. Then U and V are independent,
too.
Proof. For ﬁxed reals u, v, write
Au = {x; g(x) < u} Bv = {y; h(y) < v}.
Then the joint distribution function for the vector (U, V ) is
FU,V (u, v) = P(U < u, V < v) = P(X ∈ Au, Y ∈ Bv)
= P(X ∈ Au)P(Y ∈ Bv)
= P(U < u)P(V < v) = FU (u)FV (v).
Thus, the transformed random variables are stochastically independent,
as expected. □
10.2.26. Aﬃne transformations and sums. The simplest
function (except constants) is an aﬃne dependency
ψ(X) = a + bX
with constants a, b ∈ R, b ̸= 0.
If fX(x) is the probability mass function of a random
variable with discrete distribution, it is easily computed that
(1) fψ(X) (y) = P(ψ(X) = y) =
∑
ψ(xi)=y
f(xi).
Thus, in the case of the aﬃne dependency Y = a + bX,
the probability mass function is non-zero exactly at the points
yi = axi + b.
As an example of a function of a random vector X =
(X1, . . . , Xn), consider the sum of n independent random
variables with the Bernoulli distribution Xi ∼ A(p). Of
course, this leads just to the binomial distribution Bi(n, p).
The above formula for fψ(X) (y) reveals the already known
probability function for Y = X1 + · · · + Xn. Only y ∈
{0, . . . , n} can be reached. Collect all the possibilities of
summing y ones, when each of them appears with probability
py
(1 − p)n−y
.
Similarly, proceed with continuous random variables.
The two parameter family Y = µ+σZ is met already, where
937
Find the marginal distributions of both variables, their expected
values, variances, and correlation coeﬃcient. ⃝
10.F.10. Find the expected value and variance of the random
variable X2
provided X has uniform distribution on the interval
⟨−1, 1⟩. ⃝
10.F.11. We roll two dice. Let X denote how many times we
got an even number, and Y denote how many times we got an
odd number. Find their correlation coeﬃcient. ⃝
10.F.12. Consider random variables U, V , deﬁned by their
joint probability mass function (U takes on 1 or 2, V takes on
1, 2 or 3):
V
U 1 2 3
1 0.1 0.1 0.4
2 0.2 0.1 0.1
.
Find the marginal distributions of both variables, their expected
values, variances, and correlation coeﬃcient. ⃝
G. Transformations of random variables
Consider a continuous function of a random variable
Y = ψ(X). If the transformation ψ is increasing, then the
resulting distribution function satisﬁes
FY (y) = P(Y ≤ y) = P(ψ(X) ≤ y) =
= P(X ≤ ψ−1
(y)) = FX(ψ−1
(y)),
where FX is the distribution function of X (analogously for
decreasing ψ). Thus, the density of the transformed random
variable Y satisﬁes
fY (y) =
dFY (y)
dy
= fX(ψ−1
(y))
dψ−1
(y)
dy
.
Applying the rule for transformation of coordinates, we can
compute the expected value of Y as
E Y =
∫ ∞
−∞
yfY (y)dy =
∫ ∞
−∞
ψ(x)fX(x)dx,
ans similarly for the variance of Y .
10.G.1. Consider a random variable X with density f(x).
Find the density of the random variable Y deﬁned by
i) Y = eX
, x ≥ 0,
ii) Y =
√
X, x > 0,
iii) Y = ln X, x > 0,
iv) Y = 1
X , x > 0.
Solution. We can simply apply the formula for the density
of a transformed random variable, which yields a) fY (y) =
f(ln y)1
y , b) fY (y) = 2f(y2
)y, c) fY (y) = f(ey
)ey
, d)
fY (y) = f(1/y) 1
y2 . □
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Z ∼ N(0, 1) in 10.2.21. This is veriﬁed easily.
FY (y) = P(Y < y) = P(µ + σZ < y)
= Φ
( 1
σ
(y − µ)
)
= 1√
2π
∫ y−µ
σ
−∞
e−z2
/2
dz
=
∫ y
−∞
1
√
2πσ
e−
(x−µ)2
2σ2
dx,
where the substitution x = µ + σz is used in the last step.
This is exactly what is wanted.
More generally, the above formula (1) has the straightforward
analog for the density of Y = ψ(X) for a continuous
X in the case when ψ has got non-zero derivative (thus ψ is
invertible).
(2) fY (y) =
∫ y
−∞
|ψ′
(ψ−1
(y)|−1
fX(ψ−1
y) dy.
Check the formula yourself! (Start with the case when the
derivative ψ′
is always positive.)
It is more complicated with more general sums of independent
random variables. Consider two such continuous
random variables X and Y with densities fX
and fY , respectively. The distribution function of the
random variable V = X + Y is computed directly
(exploit the independence of X and Y write the joint density
of (X, Y ) as product):
FV (u) =
∫
x+y<u
fX(x)fY (y) dxdy =
=
∫ ∞
−∞
∫ u−x
−∞
fX(x)fY (y) dydx
=
∫ u
−∞
(∫ ∞
−∞
fX(x)fY (v − x) dx
)
dv,
where the substitution v = x + y is used together with the
Fubini theorem. Thus, the joint density of the sum of two
independent random variables is just the convolution of their
densities
(3) fV = fX ∗ fY ,
already met in subsection 7.2.2 (page 656). Similarly, there
is a discrete convolution of probability mass functions in the
case of discrete random variables.
In the seventh chapter, we viewed the convolution as a
kind of blury picture of one of the functions with the help of
the kernel expressed by the other. This should be the right
intuition for the density of the sum of independent random
variables as well. Of course, this also suggests that the convolution
must be symmetric in the arguments.
10.2.27. Numerical characteristics. When examining
some values (of a measurement, for example)
from the statistical point of view, important
numerical characteristics like the arithmetic
mean and the standard deviationare looked for. Now we
introduce similar characteristics for random variables and
938
10.G.2. Consider a random variable X which has uniform
distribution on the interval (−π
2 , π
2 ). Find the density of
X as well as the densities of transformed variables Y =
sin X, Z = tg X.
Solution. Since the length of the interval where X is nonzero
is π, the density of X is fX(x) = 1
π for x ∈ (−π
2 , π
2 )
and zero elsewhere. Applying the formula for the density of
a transformed random variable and the derivatives of elementary
functions, we get
fY (y) = fX(arcsin(y)) arcsin′
(y) =
1
π
√
1 − y2
and
fZ(y) = fX(arctan(z)) arctan′
(y) =
1
π(1 + y2)
.
□
10.G.3. Consider a random variable X whose density is
cos x for x ∈ (0, π
2 ) and zero elsewhere. Find the density of
the random variable Y = X2
and calculate E Y and var Y .
Solution. Applying the formula for the density of a transformed
random variable, we get
fY (y) = fX(
√
y)(
√
y)′
=
1
2
√
y
cos x.
It is simpler to compute the expected value and variance
of Y directly from the density of X. We have E Y =∫ ∞
−∞
x2
fX(x)dx. Thus,
E Y =
∫ π
2
0
x2
cos xdx =
[
x2
sin x + 2x cos x − 2 sin x
]π
2
0
=
π2
− 8
4
.
The integral was computed by parts. Applying this method
again, we obtain
E(Y 2
) =
∫ π
2
0
x4
cos xdx =
=
[
(x4
− 12x2
+ 24) sin x + 4(x3
− 6x) cos x
]π
2
0
.
Hence, E(Y 2
) = (π
2 )4
− 12(π
2 )2
+ 24, so
var Y = π4
16 − 3π2
+ 24 − π4
−16π2
+64
16 = 20 − 2π2
.
□
10.G.4. Let X be a random variable which takes on values
0 and 1, each with probability 1
2 . Similarly, let Y take on the
values −1 and 1, each with probability 1
2 . Show that the random
variables X a Z = XY are uncorrelated, yet dependent.
Give an example of two continuous random variables with
this property.
Solution. First of all, we compute the expected values of our
random variables: E X = 0· 1
2 +1· 1
2 = 1
2 , E Z = E(XY ) =
0 · 1
2 + (−1) · 1
4 + 1 · 1
4 = = 0. As for the expected value
of their product, we have E(XZ) = E(X2
Y ) = = 1 ·
1
4 + (−1) · 1
4 = 0. By theorem 10.2.33, the covariance is
equal to cov(X, Z) = 0 − 1
2 · 0 = 0. Thus, the variables
X and Y are uncorrelated. At the same time, the conditional
probability P(Z = 1|X = 0) is clearly zero, i. e., we also
have P(Z = 1, X = 0) = 0, while P(Z = 1) = 1
4 and
P(X = 0) = 1
2 , so P(Z = 1)·P(X = 0) = 1
8 ̸= 0. We can
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
random vectors. The ﬁrst one is an analogy of the arithmetic
mean.
Expected value
For any random variable X, deﬁne its expected value E X
by
E X =
{∑
i xifX(xi) for a discrete variable
∫ ∞
−∞
xfX(x)dx for a continuous variable,
provided the sum or integral converges absolutely. If not,
the random variable X is said to have no expected value.
The expected value of a random vector is simply the
vector of expected values of the individual components.
The expected value can also be expressed directly for
functions Y = ψ(X) of a random variable or vector X.
Recall that we consider only those functions ψ for which Y
is again a random variable.
In the discrete case, compute
E Y =
∑
j
yjP(Y = yj)
=
∑
j
yj
∑
ψ(xi)=yj
P(X = xj)
=
∑
i
ψ(xi)P(X = xi),
provided the sum converges absolutely. Of course, it is not
guaranteed that a function of a random variable which has
expected value also has it.
Similarly, the expected value of a function of a continuous
random variable is :
E ψ(X) =
∫ ∞
−∞
ψ(x)fX(x)dx,
provided the integral converges absolutely.
Note that the random variable Y = ψ(X) does not have
to be continuous even if the original variable X is. Nevertheless,
if ψ is a diﬀerentiable monotone function with non-zero
derivative, it is an easy exercise to verify that the deﬁnition of
E ψ(X) coincides with E Y . We do not go into further details
here.
Shortly, in the part devoted to statistics, it is shown that
the expected value has a direct connection with the arithmetic
mean of the corresponding vector of values.
10.2.28. St. Petersburg paradox. We return to the example
used as motivation for the need of discrete
random variables in subsection 10.2.1. Reformulate
the model as potential rules for a casino.
This results in a good example of a situation
where the expected value of the examined random variable
does not exist at all.
The gambler pays an initial amount C and then keeps
tossing a coin until it comes up heads. Denoting the number
of tosses he has made by T, he wins 2T
. The problem is to
determine a “reasonable value” for the initial amount C. If X
939
see that P(Z = 1) · P(X = 0) ̸= P(Z = 1, X = 0), which
means that X and Z are dependent.
It can be easily veriﬁed from the corresponding deﬁnitions
that if X is any random variable with zero expected value,
ﬁnite second moment and zero third moment, then X and
Y = X2
are dependent, but uncorrelated. □
H. Inequalities and limit theorems
Markov’s inequality provides a rough estimate of the behavior
of a non-negative random variable if we know nothing
more than its expected value. In exact words, for any nonnegative
random variable X and any a > 0, it holds that
P(X ≥ a) ≤ E X
a .
10.H.1. Consider a non-negative random variable X with
expected value µ. With no further information about X,
bound P(X > 3µ). Then, compute P(X > 3µ) if you know
that X ∼ Ex( 1
µ ).
Solution. If the non-negative random variable X does not
take zero with probability 1, then its expected value µ is positive.
Therefore, the wanted probability can be bounded using
Markov’s inequality as
P(X ≥ 3µ) ≤
µ
3µ
=
1
3
.
If we know that X ∼ Ex( 1
µ ), then
P(X > 3µ) = 1 − P(X ≤ 3µ) = 1 − F(3µ),
where F is the distribution function of the exponential distribution.
By deﬁnition, this is
F(x) =
∫ x
0
1
µ
e− t
µ dt =
[
−e− t
µ
]x
0
= 1 − e− x
µ .
Hence, P(X > 3µ) = 1
e3 . □
10.H.2. At a particular place, the average speed of wind is
20 kilometers per hour.
• Regardless of the distribution of the speed as a random
variable, bound the probability that in a given observation,
the speed does not exceed 60 km/h.
• Find the interval in which the speed lies with probability
at least 0.9 if you know that the standard deviation is σ =
1 km/h.
Solution. Let X denote the random variable that corresponds
to the speed. In the ﬁrst case, we can only use Markov’s inequality,
leading to
P(X ≤ 60) = 1 − P(X ≥ 60) ≥ 1 −
20
60
=
2
3
.
In the second case, we know the variance (or standard deviation)
of the speed, so we can use Chebyshev’s inequality (see
10.2.32):
0.9 ≤ P(|X − 20| < x) = 1 − P(|X − 20| ≥ x) ≤ 1 −
1
x2
.
Hence, x ≥
√
10 ≈ 3.2. Thus, the wanted interval is
(16.8 km/h, 23.2 km/h). □
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
is the random variable corresponding to the won amount, it
seems that the correct answer is “anything below the expected
value E X”.
As derived in 10.2.1, P(T = k) = 2−k
, provided that
the coin is fair. Sum up all the probabilities multiplied by 2k
,
to obtain
∑∞
1 1 = ∞. Therefore, the expected value does
not exist. So it seems that it is advantageous for the gambler
to play even if the initial amount is very high...
Simulate the game for a while, to obtain that the amount
won is somewhere around 24
. The reason is that no one is
able to play inﬁnitely long, hence the extremely high amounts
are not feasible enough to be won, so such amounts cannot
be taken seriously. In decision theory, these cases (when the
expected value does not directly correspond to the evaluated
utility) are called St. Petersburg paradox, and much literature
has been devoted to this topic.3
10.2.29. Properties of the expected value. In the case of
simple distributions, compute the expected
value directly from the deﬁnition. For instance,
for the Bernoulli distribution A(p), it
is immediate that
E X = (1 − p) · 0 + p · 1 = p.
Similarly, compute the expected value np of the binomial distribution
Bi(n, p). This requires more thought. The result
is a direct corollary of the following general theorem since
Bi(n, p) is the sum of n random variables with the Bernoulli
distributions A(p).
For any random variables X, Y , real constants a, b, consider
the expected values of the functions of random variables
X + Y and a + bX, provided the expected values E X and
E Y exist.
It follows directly from the deﬁnition that the constant
random variable a has expected value a. Further,
E(bX) = b E X,
since the constant b can be factored out from the sums or in-
tegrals.
More generally, the expected value of the product of independent
random variables X and Y can be computed as
follows. Suppose the components of the vector (X, Y ) are
discrete and independent, with probability mass functions
fX(xi), fY (yj). Then,
E(XY ) =
∑
i
∑
j
xiyjfX(xi)fY (yj)
=
(∑
i
xifX(xi)
)(∑
j
yjfY (yj)
)
= E X E Y.
Similarly, verify the equality E(XY ) = E X E Y for independent
continuous random variables.
3Going back to Bernoulli, 1738, the real value is given by the utility,
rather than the price.
940
10.H.3. Each yogurt of an undisclosed company contains
a photo of one of 26 ice-hockey world champions. Suppose
the players are distributed uniformly at random. How many
yogurts must Vera buy if she wants the probability of getting
at least 5 photos of Jaromír Jágr to be at least 0.95?
Solution. Let X denote the random variable that corresponds
to the number of obtained photos of Jaromír Jágr
(parametrized by the number n of yogurts bought). Clearly,
X ∼ Bi(n, 1
26 ). We are looking for the value of n for which
P(X ≥ 5) = 0.95, i. e., FX(4) = P(X ≤ 4) = 0.05. In
order to ﬁnd it, we use the de Moivre-Laplace theorem and
approximate the binomial distribution with the normal distribution
(we assume that n is large, so the approximation error
will be small). By F, the expected value of X is E X = n
26 ,
and its variance is var X = 25n
262 . Denoting the corresponding
standardized variable by Z, we can reformulate the condition
as
0.05 = P(X ≤ 4) = P
(
Z ≤
4 − n
26
5
√
n
26
)
= FZ
(
104 − n
5
√
n
)
,
where by the approximation assumption, FZ ≈ Φ is the distribution
function of the normal distribution N(0, 1). Since
we must have n > 104, using Φ(−x) = 1 − Φ(x), the above
equation gives n − 104 = Φ−1
(0.95) · 5
√
n. Using a table of
the normal distribution or appropriate software, we can learn
that z(0.95) = 1.65. Solving this quadratic equation, we get
n
.
= 228.8. Thus, Vera must buy at least 229 yogurts. □
10.H.4. We roll a die 1200 times. Find the probability that
the number of 6s lies between 150 and 250 (inclusive) using
Chebyshev’s inequality, and then using Moivre-Laplace the-
orem.
Solution. Let X denote the random variable which corresponds
to the number of 6s. Clearly, X ∼ Bi(1200, 1
6 ).
By F, we have E X = 1200 · 1
6 = 200 and var X =
200(1 − 1
6 ) = 500
3 . The condition on the number of 6s says
that 150 ≤ X ≤ 250, which can be written as |X−200| ≤ 50.
Using Chebyshev’s inequality 10.2.32, we get
P(|X−200| ≤ 50) = 1−P(|X−200| ≥ 51) ≥ 1−
500
3 · 512
≈ 0.94.
(2) The exact value of the wanted probability is given by the
expression
P(150 ≤ X ≤ 250) = FX(250) − FX(150),
where FX is the distribution function of the binomial distribution.
By deﬁnition,
P(150 ≤ X ≤ 250) =
250∑
k=150
(
1200
x
) (
1
6
)x (
5
6
)1200−x
.
This expression is hard to evaluate without a computer, so we
use Moivre-Laplace theorem. Replacing X with the standardized
random variable
Z =
√
3(X − 200)
10
√
5
,
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Now compute E(X + Y ) for arbitrary random variables.
For discrete distributions of X and Y ,
E(X + Y ) =
∑
i
∑
j
(xi + yj)P(X = xi, Y = yj)
=
∑
i
(
xi
∑
j
P(X = xi, Y = yj)
)
+
∑
j
(
yj
∑
i
P(X = xi, Y = yj)
)
=
∑
i
xiP(X = xi) +
∑
j
yjP(Y = yj),
where absolute convergence of the ﬁrst double sum follows
from the triangle inequality and the absolute convergence of
the sums that stand for the expected values of the particular
random variables. Absolute convergence is used in order to
interchange the sums.
Dealing with continuous variables X and Y , whose expected
values exist, proceed analogously.
E(X + Y ) =
∫ ∞
−∞
∫ ∞
−∞
(x + y)fX,Y (x, y) dxdy
=
∫ ∞
−∞
x
∫ ∞
−∞
fX,Y (x, y) dydx +
∫ ∞
−∞
y
∫ ∞
−∞
fX,Y (x, y) dydx
=
∫ ∞
−∞
xfX(x) dx +
∫ ∞
−∞
yfY (y) dy = E X + E Y,
where absolute convergence of integrals of the expected values
E X and E Y is used to interchange the integrals by Fubini’s
theorem.
Altogether, the expected formula:
E(X + Y ) = E X + E Y
is obtained, whenever the expected values E X and E Y exist.
Straightforward application of this result leads to the fol-
lowing:
Affine nature of expected values
For any constants a, b1, . . . , bk and random variables
X1, . . . , Xk,
E(a + b1X1 + · · · + bkXk) = a + b1 E X1 + · · · + bk E Xk.
The following theorem extends this behaviour with respect
to aﬃne transformations of random vectors, and shows
that the expected value is invariant with respect to aﬃne transformations,
as is the arithmetic mean:
Theorem. Let X = (X1, . . . , Xn) be a random vector with
expected value E X, a ∈ Rm
, B ∈ Matmn(R) a martix.
Then,
E(a + B · X) = a + B · E X.
Proof. There is almost nothing remaining to be proved.
Since the expected value of a vector is deﬁned as the vector of
the expected values of the components, it suﬃces to restrict
attention to a single item in E(a + B · X). Thus, it can be
assumed that a is a scalar and B is a matrix with a single row.
941
then, by 10.2.40, we have Z ∼ N(0, 1), i. e., FZ ≈ Φ. Thus,
P(150 ≤ X ≤ 250) = P(
√
3(250−200)
10
√
5
≤ Z ≤
√
3(150−200)
10
√
5
) ≈
≈ Φ(
√
15) − Φ(−
√
15) = 2Φ(
√
15) − 1.
We learn that Φ(
√
15) ≈ 0.99994, so the wanted probability
is approximately 99.988 %. □
10.H.5. At the Faculty of Informatics, 10 % of students have
prumer less than 1.2 (let us call them successful). How many
students must we meet if the probability that there are 8–12
% successful ones among them is to be at least 0.95? Solve
this problem using Chebyshev’s inequality, and then using
Moivre-Laplace theorem.
Solution. Let X denote the random variable that corresponds
to the number of successful students, parametrized by the
number n of students we meet. Since a randomly met student
has probability 10 % of being successful, when meeting n students,
we have X ∼ Bi(n, 1
10 ). By F, we have E X = 0.1n
and var X = 0.09n. By Chebyshev’s inequality 10.2.32, the
wanted probability satisﬁes
P(|X − 0.1n| ≤ 0.02n) = 1 − P(|X − 0.1n| ≥ 0.02n) ≥
≥ 1 −
0.1 · 0.9n
(0.02n)2
= 1 −
225
n
.
The inequality 1 − 225
n ≥ 0.95 and hence
P(|X − 0.1n| ≤ 0.02n) ≥ 0.95
holds for n ≥ 4500. The exact value of the probability is
given in terms of the distribution function FX of the binomial
distribution:
P(0.08n ≤ X ≤ 0.12n) = FX(0.12n) − FX(0.08n).
Using the de Moivre-Laplace theorem (see 10.2.40), we can
approximate the standardized random variable Z = 10X−n
3
√
n
with the standard normal distribution, FZ ≈ Φ, so
0.95 = P(0.08n ≤ X ≤ 0.12n) = P(−
√
n
15
≤ Z ≤
√
n
15
) ≈
≈ Φ(
√
n
15
) − Φ(−
√
n
15
) =
= 2Φ(
√
n
15
) − 1.
Hence
√
n = 15z(0.975) and we learn n ≈ 864.4. Thus,
we can see that it is suﬃcient to meet 865 students. □
10.H.6. The probability that a planted tree will grow is 0.8.
What is the probability that out of 500 planted trees, at least
380 trees will grow?
Solution. The random variable X that corresponds to the
number of trees that will grow has binomial distribution X ∼
Bi(500, 4
5 ). By F, we have E X = 400 and var X = 80.
The standardized random variable is Z = X−400√
80
. By the de
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Then, the expected value of a ﬁnite sum of random variables
is obtained, and by the above results, that exists and is given as
the sum of the expected values of the individual items. This
is exactly what is wanted to be proved. □
10.2.30. Quantiles and critical values. Introduce numerical
characteristics that are analogous to those
from descriptive statistics. There, the next useful
characteristics are the quantiles, cf. 10.1.5.
Consider a random variable X whose distribution
function FX is strictly monotone. This is satisﬁed by
any random variable whose density is nowhere equal to zero,
which is the case for the normal distribution, for example. In
this case, deﬁne the quantile function F−1
X simply as the inverse
function (FX)−1
: (0, 1) → R. This means that the
value y = F−1
(α) is such that P(X < y) = α. This corresponds
precisely to the quantiles from descriptive statistics
using relative frequencies for the probabilities.
Quantile function
For any random variable X with distribution function
FX(x), deﬁne its quantile function
F−1
(α) = inf{x ∈ R; F(x) ≥ α}, α ∈ (0, 1).
Clearly, this is a generalization of the previous deﬁnition
in the case the distribution function is strictly monotone.
As seen in descriptive statistics, the most used quantiles
are for α = 0.5 (the median), α = 0.25 (the ﬁrst quartile),
α = 0.75 (the third quartile). Similarly for deciles and percentiles
when α is equal to (integer) multiples of tenths and
hundredths, respectively.
It follows directly from the deﬁnition that the quantile
function for a given random variable X allows the determination
of intervals into which the values of X fall with a chosen
probability. For instance, the value Φ−1
(0.975), approximately
1.96, corresponds to percentile 97.5 for the normal
distribution N(0, 1). This says that with the probability of
2.5 %, the value of such a random variable Z ∼ N(0, 1) is at
least 1.96. Since the density of the variable Z is symmetric
with respect to the origin, this observation can be interpreted
as that there is only a 5% probability that the value of |Z| is
greater 1.96.
There are similar intervals and values when discussing
the reliability of estimates of characteristics of random vari-
ables.
Critical values
For a random variable X and a real number 0 < α < 1,
deﬁne its critical value x(α) at level α as
P(X ≥ x(α)) = α.
This means that x(α) = F−1
X (1 − α) where F−1
X is the
quantile function of the random variable X.
942
Moivre-Laplace theorem, we have FZ ≈ Φ, so
P(X ≥ 380) = P(Z ≥
380 − 400
√
80
) ≈ 1 − Φ(−
√
20
2
) =
= Φ(
√
20
2
) ≈ 0.987.
□
10.H.7. Using the distribution function of the standard normal
distribution, ﬁnd the probability that the absolute diﬀerence
between the heads and the tails in 1600 tosses of a coin
is at least 82.
Solution. Let X denote the random variable that corresponds
to the number of times the coin came up heads. Then X has binomial
distribution Bi(1600, 1/2) (with expected value 800
and standard deviation 20), so for a large value of n = 1600,
by the de Moivre-Laplace theorem, the distribution function
of the variable X−800
20 can be approximated with the distribution
function Φ of the standard normal distribution. Thus, the
wanted probability is
P = 1 − P[759 ≤ X ≤ 841]
= 1 − P
[
−2.05 ≤
X − 800
20
≤ 2.05
]
.
= 2Φ(−2, 05)
.
= 0.0404.
□
10.H.8. Using the distribution function of the standard normal
distribution, ﬁnd the probability that the absolute diﬀerence
between the heads and the tails in 3600 tosses of a coin
is at most 66.
Solution. Let X denote the random variable that corresponds
to the number of times the coin came up heads. Then X has binomial
distribution Bi(3600, 1/2) (with expected value 1800
and standard deviation 30), so for a large value of n = 3600,
the distribution function of the variable X−800
20 can be approximated,
by the de Moivre-Laplace theorem, with the distribution
function Φ of the standard normal distribution. Thus, the
wanted probability is
P[1767 ≤ X ≤ 1833] = P
[
−1, 1 ≤
X − 1800
30
≤ 1, 1
]
.
=
.
= Φ(1, 1) − Φ(−1, 1)
.
= 0, 7498.
□
10.H.9. The probability that a seed will grow is 0.9. How
many seeds must we plant if we require that with probability
at least 0.995, the relative number of grown items diﬀers from
0.9 by at most 0.034.
Solution. The random variable X that corresponds to the
number of grown seeds, out of n planted ones, has binomial
distribution X ∼ Bi(n, 9
10 ). By F, we have E X = 0.9n and
var X = 0.09n, so the standardized variable is Z = X−0.9n√
0.09n
.
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.2.31. Variance and standard deviation. The simple numerical
characteristics concerning the variability
of sample values in descriptive statistics
were the variance and the standard deviation.
Deﬁne them similarly for random variables.
Variance of a random variable
Given a random variable X with ﬁnite expected value, its
variance is deﬁned as
var X = E
(
(X − E X)2
)
,
provided the right-hand expected value exists. Otherwise,
the variance of X does not exist.
The square root
√
var X of the variance is called the
standard deviation of the random variable X.
Using the properties of the expected value, a simpler formula
can be derived for the variance of a random variable X
whose expected value exists:
var X = E(X − E X)2
= E(X2
− 2X(E X) + (E X)2
)
= E X2
− 2(E X)2
+ (E X)2
= E X2
− (E X)2
.
Consider how aﬃne transformations change the variance
of a random variable. Given real numbers a, b and a random
variable X with expected value and variance, consider the
random variable Y = a + bX. Compute
var Y = E
(
(a + bX) − E(a + bX)
)2
= E
(
b(X − E X)
)2
= b2
var X.
Thus are derived the following useful formulae:
Properties of variance
var X = E(X2
) − (EX)2
(1)
var(a + bX) = b2
var X(2)
√
var(a + bX) = b
√
var X(3)
Given a random variable X with expected value and nonzero
variance, deﬁne its standardization as the random vari-
able
Z =
X − E X
√
var X
.
Thus, the standardized variable is the aﬃne transformation of
the original variable whose expected value equals zero and
variance equals one.
10.2.32. Chebyshev’s inequality. A good illustration of the
usefulness of variance is the Chebyshev’s inequality.
This connects the variance directly
to the probability that the random variable assumes
values that are distant from its expected value.
943
The condition in question can be written as
P(|X − 0.9n| ≤ 0.034n) = P
(
|Z| ≤
0.034n
√
0.09n
)
=
= P
(
|Z| ≤
0.34
3
√
n
)
≥ 0.995.
By the de Moivre-Laplace theorem, for large n, the distribution
function can be approximated by the distribution function
Φ of the normal distribution. Thus,
P
(
|Z| ≤
0.34
3
√
n
)
≈ Φ
(
0.34
3
√
n
)
− Φ
(
−
0.34
3
√
n
)
=
= 2Φ
(
0.34
3
√
n
)
− 1.
Altogether, we get the condition
2Φ
(
0.34
3
√
n
)
− 1 ≥ 0.995.
Odtud vypočítáme n ≥
(
3z(0.9975)
0,34
)2
≈ 615. □
10.H.10. The service life (in hours) of a certain kind of gadget
has exponential distribution with parameter λ = 1
10 . Using
the central limit theorem, bound the probability that the
total service life of 100 such gadgets lies between 900 and
1050 hours.
Solution. In exercise 10.F.5, we computed that the expected
value and variance of a random variable Xi with exponential
distribution are equal to E Xi = 1
λ and var Xi = 1
λ2 ,
respectively. Thus, the expected service life of each gadget
is E Xi = µ = 10 hours, with variance var Xi = σ2
=
100 hours2
. By the central limit theorem, the distribution
of the transformed random variable 1√
n
∑n
i=1
(
Xi−µ
σ
)
=
1
100
∑100
i=1 Xi − 10 approaches the standard normal distribution
as n tends to inﬁnity. Thus, the wanted probability for
the service life of 100 gadgets
P(900 ≤
∑
Xi ≤ 1050) = P
(
−1 ≤
1
100
100∑
i=1
Xi − 10 ≤ 0, 5
)
can be approximated with the distribution function of the normal
distribution:
P(900 ≤
∑
Xi ≤ 1050) ≈ Φ(0.5) − Φ(−1) ≈ 0.533.
□
10.H.11. We keep putting items into a chest. The expected
mass of an item is 3 kg and the standard deviation is 0.8 kg.
What is the maximum number of items that we can put into
the chest so that with probability at least 99%, the total mass
does not exceed one ton?
Solution. Let Xi denote the random variable that corresponds
to the mass of the i-th item. Then, we have µ =
E Xi = 3 and σ =
√
var Xi = 0.8 (in kilograms), and we
want to have
P(
n∑
i=1
Xi ≤ 1000) = 0.99.
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Chebyshev’s inequality
Theorem. Consider a random variable X with ﬁnite variance,
and ﬁx an arbitrary ε > 0. Then,
P(|X − E X| ≥ ε) ≤
var X
ε2
.
Proof. Suppose X is continuous. Set µ = E X and
compute, using the deﬁnition:
var X =
∫ ∞
−∞
(x − µ)2
f(x) dx
=
∫
|x−µ|≥ε
(x − µ)2
f(x) dx
+
∫
|x−µ|<ε
(x − µ)2
f(x) dx
≥
∫
|x−µ|≥ε
ε2
f(x) dx = ε2
P(|X − µ| ≥ ε).
□
The analogous proof for discrete random variables is left
as an exercise for the reader.
Realizing that the variance is the square of the standard
deviation σ, the choice ε = kσ yields the probability
P(|X − E X| ≥ kσ) ≤
1
k2
.
The Chebyshev’s inequality helps understanding asymptotic
descriptions of limit processes. For instance, consider
the sequence of random variables X1, X2, . . . with probability
distributions Xn ∼ Bi(n, p), with a ﬁxed value of
p, 0 < p < 1. Intuitively, it is expected that the relative
frequency of success should approach the probability p
as n increases, i.e., that the values of the random variables
Yn = 1
n Xn should approach p. Clearly,
E Yn =
np
n
= p, var Yn =
np(1 − p)
n2
=
p(1 − p)
n
.
Direct application of Chebyshev’s inequality yields, for any
ﬁxed ε > 0, that
P
(
|Yn − p| ≥ ε
)
≤
p(1 − p)
nε2
.
Hence it is clear that, for any ﬁxed ε > 0,
lim
n→∞
P
( Xn
n
− p ≥ ε
)
= 0.
This result is known as Bernoulli’s theorem (one of many).
This type of limit behaviour is called convergence in
probability. Thus it is proved (as a corollary of Chebyshev’s
inequality) that the random variables Yn converge in probability
to the constant random variable p.
944
By the central limit theorem 10.2.40, the distribution of the
random variable
Sn =
1
√
n
n∑
i=1
(
Xi − 3
0.8
)
=
1
0.8
√
n
n∑
i=1
Xi −
3
√
n
0.8
can be approximated by the standard normal distribution.
Thus, we get
P(
n∑
i=1
Xi ≤ 1000) = P(Sn ≤
1000
0.8
√
n
−
3
√
n
0.8
) ≈ Φ(
1000
0.8
√
n
−
3
√
n
0.8
).
We learn that z(0.99) ≈ 2.326, so the wanted n satisﬁes the
quadratic equation
1000
0.8
√
n
−
3
√
n
0.8
= 2.326,
whence we get n ≈ 322. □
I. Testing samples from the normal distribution
In subsection 10.3.4, we introduced the so-called twosided
interval estimate of an unknown parameter µ of the normal
distribution N(µ, σ2
). In some cases, we may be interested
only in an upper or lower estimate, i.e. a statistic U or
L for which P(µ < U) or P(L < µ), respectively. Then,
we talk about a one-sided conﬁdence interval (−∞, U) or
(L, ∞). The formula for these intervals can be derived similarly
as for the two-sided interval. Now, we have for the random
variable Z =
√
n
¯X−µ
σ ∼ N(0, 1) that
1 − α = Φ(z(1 − α)) = P(Z < z(1 − α)).
Hence it immediately follows that
1 − α = P( ¯X −
σ
√
n
z(1 − α) < µ),
so L = ¯X − σ√
n
z(1 − α). Similarly, we ﬁnd U = ¯X +
σ√
n
z(1 − α), and for a distribution with unknown variance,
µ ≥ ¯X − S√
n
tn−1(1 − α) and µ ≤ ¯X + S√
n
tn−1(1 − α).
If we want to estimate the variance σ2
of a random distribution,
then we use theorem 10.3.3, similarly as when we
derived it for the expected value. This time, we use the second
part of the theorem, by which the random variable n−1
σ2 S2
has
distribution χ2
. Then, we can immediately see that
1 − α = P
(
χ2
n−1(α/2) ≤
n − 1
σ2
S2
≤ χ2
n−1(1 − α/2)
)
.
Thus, the two-sided 100(1 − α)% conﬁdence interval for the
variance is
(
(n − 1)S2
χ2
n−1(1 − α/2)
,
(n − 1)S2
χ2
n−1(α/2)
)
and similarly for the one-sided upper and lower estimates, we
get
σ2
≤
(n − 1)S2
χ2
n−1(α)
, resp.
(n − 1)S2
χ2
n−1(1 − α)
≤ σ2
.
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.2.33. Covariance. We return to random vectors. In the
case of the expected value, the situation is very
simple — just take the vector of expected values.
When characterizing the variability, the dependencies
between the individual components are
also of much interest. We follow the idea from 10.1.9 again.
Covariance
Given random variables X, Y whose variances exist, Deﬁne
their covariance as
cov(X, Y ) = E ((X − E X)(Y − E Y ))
The basic properties of the concept can be derived very
easily:
Theorem. For any random variables X, Y , Z whose variances
exist and real numbers a, b, c, d,
cov(X, Y ) = cov(Y, X)(1)
cov(X, Y ) = E(XY ) − (E X)(E Y )(2)
cov(X + Y, Z) = cov(X, Z) + cov(Y, Z)(3)
cov(a + bX, c + dY ) = bd cov(X, Y )(4)
var(X + Y ) = var X + var Y + 2 cov(X, Y ).(5)
Moreover, if X and Y are independent, then cov(X, Y ) = 0,
and consequently
(6) var(X + Y ) = var X + var Y.
Proof. Directly from the deﬁnition, the covariance is
symmetric in the arguments. The second proposition follows
immediately from the properties of the expected value:
cov(X, Y ) = E(X − E X)(Y − E Y )
= E(XY ) − (E Y )X − (E X)Y + E X E Y
= E(XY ) − E X E Y.
The next proposition also follows easily if the deﬁnition is
expanded and the fact that the expected value of the sum of
random variables equals the sum of their expected values is
used.
The next proposition can be computed directly:
cov(a + bX, c + dY ) =
= E
(
(a + bX − E(a + bX))(c + dY − E(c + dY ))
)
= E
(
(bX − b E(X))(dY − d E(Y ))
)
= E
(
bd(X − E(X))(Y − E(Y ))
)
= bdE
(
(X − E X)(Y − E Y )
)
= bd cov(X, Y ).
945
10.I.1. We roll a die 600 times, obtaining only 45 sixes. Is
it possible to say that the die is ideal at level α = 0.01?
Solution. For an ideal die, the probability of rolling a six is always
p = 1
6 . The number of sixes in 600 rolls is given by a random
variable X with binomial distribution X ∼ Bi(600, 1
6 ).
By 10.2.40, this distribution can be approximated by the distribution
N(100, 250
3 ). The measured value X = 45 can be
considered a random sample consisting of one item. Assuming
that the variance is known and applying 10.3.4, we get
that the 99% (two-sided) conﬁdence interval for the expected
value µ equals (45 −
√
250
3 z(0.995), 45 +
√
250
3 z(0.995)).
We learn that the quantile is approximately z(0.995) ≈ 2.58,
which gives the interval (21, 69). However, for an ideal die,
we clearly have µ = 100, so our die is not ideal at level
α = 0.01. □
10.I.2. Suppose the height of 10-years-old boys has normal
distribution N(µ, σ2
) with unknown expected value µ and
variance σ2
= 39.112. Taking the height of 15 boys, we
get the sample mean ¯X = 139.13. Find
i) the 99% two-sided conﬁdence interval for the parameter
µ,
ii) the lower estimate for µ at signiﬁcance level 95 %.
Solution. a) By 10.3.4, the 100(1 − α)% two-sided conﬁdence
interval for the unknown expected value µ of the normal
distribution is
(1) µ ∈
(
¯X −
σ
√
n
z(1 − α/2), ¯X +
σ
√
n
z(1 − α/2)
)
,
where ¯X is the sample mean of n items, σ2
is the known
variance, and z(1 − α/2) is the corresponding quantile. Substituting
the given values n = 15, σ ≈ 6.254 and the
learned z(0.995) ≈ 2.576, we get σ√
n
z(α/2) ≈ 4.16, i. e.,
µ ∈ (134.97, 143.29).
b) The lower estimate L for the parameter µ at signiﬁcance
level 95 % is given by the expression L = ¯X − σ√
n
z(0.95).
We learn that z(0.95) ≈ 1.645, and direct substitution leads
to µ ∈ (136.474, ∞). □
10.I.3. A customer tests the quality of bought products by
examining 21 randomly chosen ones. He will accept the delivery
if the sample standard deviation does not exceed 0.2
mm. We know that the pursued property of the products has
normal distribution of the form N(10 mm; 0.0734 mm2
). Using
statistical tables, ﬁnd the probability that the delivery will
be accepted. How does the answer change if the customer, in
order to save expenses, tests only 4 products?
Solution. The problem asks for the probability P(S ≤ 0.2).
By theorem 10.3.3, when sampling n products, the random
variable n−1
σ2 S2
has distribution χ2
n−1. In our case, n = 21
and σ2
= 0.0734, so
P(S ≤ 0.2) = P
(
20
0.0734
S2
≤
20
0.0734
0.22
)
= χ2
20
(
20 · 0.22
0.0734
)
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
The other propositions about the variance are quite simple
corollaries:
var(X + Y ) = E
(
(X + Y ) − E(X + Y )
)2
= E
(
(X − E X) + (Y − E Y )
)2
= E(X − E X)2
+ 2 E(X − E X)(Y − E Y )
+ E(Y − E Y )2
= var X + 2 cov(X, Y ) + var Y.
Furthermore, if X and Y are independent, then E(XY ) =
E X E Y , and hence that their covariance is zero. □
Directly from the deﬁnition,
var(X) = cov(X, X).
The latter theorem claims that covariance is a symmetric bilinear
form on the real vector space of random variables whose
variance exists. The variance is the corresponding quadratic
form. The covariance can be computed from the variance of
the particular random variables and of their sum, as seen in
linear algebra, see the property (5).
Notice that the random variable, equal to the sum of n
independent and identically distributed random variables Yi
behaves, very much diﬀerently than the multiple nY . In fact,
var(Y1 + · · · + Yn) = n var Y, var(nY ) = n2
var Y.
10.2.34. Correlation of random variables. To a certain extent,
covariance corresponds to dependency between
the random variables. Its relative version
is called the correlation of random variables
and, similarly as for the standard deviation, the following concept
is deﬁned:
Correlation coefficient
The correlation coeﬃcient of random variables X and Y
whose variances are ﬁnite and non-zero is deﬁned as
ρX,Y =
cov(X, Y )
√
var X
√
var Y
.
As seen from theorem 10.2.33, the correlation coeﬃcient
of random variables equals the covariance of the standardized
variables 1√
varX
(X − E X) and 1√
varY
(Y − E Y ).
The following equalities hold (here, a, b, c, d are real constants,
bd ̸= 0, and X, Y are random variables with ﬁnite
non-zero variances)
ρa+bX,c+dY = sgn(bd)ρX,Y
ρX,X = 1.
Moreover, if X and Y are independent, then ρX,Y = 0.
Note that if the variance of a random variable X is zero,
then it assumes the value E X with probability 1. If the value
of X falls into an interval I not containing E X with probability
p ̸= 0, then the expression var X = E(X − E X)2
is
positive. Stochastically, random variables with zero variance
behave as constants.
946
The expression in the argument of the distribution function
is approximately 10.9, and we can learn from the table of
the χ2
distribution that χ2
20(10.9) ≈ 0.05. Thus, the probability
that delivery will be accepted is only 5 %. We could
have expected the probability to be low: indeed, E S2
=
= σ2
= 0.0734 > 0.22
. If the customer tests only 4 products,
then the probability of acceptance is given by the expression
χ2
3
(
3·0.22
0.0734
)
≈ χ2
3(1.63). The value of the distribution
function of χ2
in this argument cannot be found in most tables.
Therefore, we estimate it using linear interpolation. For
instance, if the nearest known points are χ2
3(0.58) = 0.1 and
χ2
3(6.25) = 0.9, then
χ2
3(1.63) ≈ (1.63 − 0.58)
0.9 − 0.1
6.25 − 0.58
+ 0.1 ≈ 0.24.
Although this results is only an estimate, we can be sure that
the probability of acceptance is much greater than when testing
21 products. □
10.I.4. From a population with distribution N(µ, σ2
),
where σ2
= 0.06, we have sampled the values
1.3; 1.8; 1.4; 1.2; 0.9; 1.5; 1.7. Find the two-sided 95%
conﬁdence interval for the unknown expected value.
Solution. We have a random sample of size n = 7 from
the normal distribution with known variance σ2
= 0.06. The
sample mean is
¯X =
1
7
(1.3 + 1.8 + 1.4 + 1.2 + 0.9 + 1.5 + 1.7) = 1.4
and we can learn for the given conﬁdence level α = 0.05 that
z(1 − α/2) = z(0.975) ≈ 1.96. Substituting into (1), we
immediately obtain the wanted interval (1.22, 1.58). □
10.I.5. Let X1, . . . , Xn be a random sample from the distribution
N(µ, 0.04). Find the least number of measurements
that are necessary so that the length of the 95% conﬁdence
interval for µ would not exceed 0.16.
Solution. Since we have a normal distribution with known
variance, we know from (1) that the length of the (1 − α)%
conﬁdence interval is 2σ√
n
z(1 − α/2). Substituting the given
values, we get that the number n of measurements satisﬁes
the inequality
2 · 0.2
√
n
z(0.975) ≤ 0.16.
Since z(0.975) ≈ 1.96, we obtain n ≥ 24.01. Thus, at least
25 measurements are necessary. □
10.I.6. Consider a random variable X with distribution
N(µ, σ2
), where µ, σ2
are unknown. The following table
shows the frequencies of individual values of this random
variable:
Xi 8 11 12 14 15 16 17 18 20 21
ni 1 2 3 4 7 5 4 3 2 1
Calculate the sample mean, sample variance, sample standard
deviation, and ﬁnd the 99% conﬁdence interval for the expected
value µ.
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
If the covariance is a positive-deﬁnite symmetric bilinear
form, then it would follow from the Cauchy-Schwarz inequality
(see 3.4.3) that
(1) |ρX,Y | ≤ 1
The following theorem claims more. It shows that the full
correlation or anti-correlation, i.e. ρX,Y = ±1 of random
variables X a Y says that they are bound by an aﬃne relation
Y = kX + c, where the sign of k corresponds to the
sign in ρX,Y = ±1. On the other hand, a zero correlation
coeﬃcient says that the (potential) dependency between the
variables is very far from any aﬃne relation of the mentioned
type. (Note, however, this does not mean that the variables
must be independent).
For instance, consider random variables Z ∼ N(0, 1)
and Z2
. Then cov(Z, Z2
) = E Z3
= 0 since the density
of Z is an even function. Thus the expected value of an odd
power of Z is zero, if it exists.
Theorem. If the correlation coeﬃcient is deﬁned, then
|ρX,Y | ≤ 1. Equality holds if and only if there are constants
k, c such that P(Y = kX + c) = 1.
Proof. A stochastic aﬃne relation between Y and X
with nonzero coeﬃcient at Y is sought. This is equivalent
to Y + sX ∼ D(c) for some ﬁxed value of the parameter s
and constant c. In such a case the variance vanishes. Thus one
considers the following non-negative quadratic expression:
0 ≤ var
(
Y − E Y
√
varY
+ t
X − E X
√
varX
)
= 1 + 2tρX,Y + t2
.
The right-hand quadratic expression does not have two distinct
real roots; hence its discriminant cannot be positive. So
4(ρX,Y )2
− 4 ≤ 0. Hence the desired inequality is obtained,
and also the discriminant vanishes if ρX,Y = ±1. For the
only (double) root t0, the corresponding random variable has
zero variance; thus it asumes a ﬁxed value with probability 1.
This yields the aﬃne relation as expected. □
10.2.35. Covariance matrix. The variability of a random
vector must be considered. This suggests considering
the covariances of all pairs of components. The following
deﬁnition and theorem show that this leads to
an analogy of the variance for vectors, including the
behaviour of the variance under aﬃne transformations of the
random variables.
947
Solution. The sample mean is given by the expression ¯X =∑
niXi/
∑
ni. Substituting the given values, we get ¯X =
490/32 ≈ 15.3. By deﬁnition, the sample variance is S =∑
ni(Xi − ¯X)2
/(
∑
ni − 1). Substituting the given values,
we get S2
= 1943/256 ≈ 7.6, so the sample standard deviation
is S ≈ 2.8. The formula for the two-sided (1−α)% conﬁdence
interval for the expected value µ, when the variance
is unknown, was derived at the end of subsection 10.3.4:
µ ∈
(
¯X −
S
√
n
tn−1(1 − α/2), ¯X +
S
√
n
tn−1(1 − α/2)
)
.
Substitution yields ¯X = 15.3, n = 32, S ≈ 2.8, α = 0.01,
and we learn t31(0.995) ≈ 2.75. Thus, the 99% conﬁdence
interval is µ ∈ (14.0, 16.7). □
10.I.7. Using the following table of the distribution function
of the normal distribution, ﬁnd the probability that the absolute
diﬀerence between the heads and the tails in 3600 tosses
of a coin is greater than 90.
Standard Normal Distribution Table
0 z
z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
0.0 .0000 .0040 .0080 .0120 .0160 .0199 .0239 .0279 .0319 .0359
0.1 .0398 .0438 .0478 .0517 .0557 .0596 .0636 .0675 .0714 .0753
0.2 .0793 .0832 .0871 .0910 .0948 .0987 .1026 .1064 .1103 .1141
0.3 .1179 .1217 .1255 .1293 .1331 .1368 .1406 .1443 .1480 .1517
0.4 .1554 .1591 .1628 .1664 .1700 .1736 .1772 .1808 .1844 .1879
0.5 .1915 .1950 .1985 .2019 .2054 .2088 .2123 .2157 .2190 .2224
0.6 .2257 .2291 .2324 .2357 .2389 .2422 .2454 .2486 .2517 .2549
0.7 .2580 .2611 .2642 .2673 .2704 .2734 .2764 .2794 .2823 .2852
0.8 .2881 .2910 .2939 .2967 .2995 .3023 .3051 .3078 .3106 .3133
0.9 .3159 .3186 .3212 .3238 .3264 .3289 .3315 .3340 .3365 .3389
1.0 .3413 .3438 .3461 .3485 .3508 .3531 .3554 .3577 .3599 .3621
1.1 .3643 .3665 .3686 .3708 .3729 .3749 .3770 .3790 .3810 .3830
1.2 .3849 .3869 .3888 .3907 .3925 .3944 .3962 .3980 .3997 .4015
1.3 .4032 .4049 .4066 .4082 .4099 .4115 .4131 .4147 .4162 .4177
1.4 .4192 .4207 .4222 .4236 .4251 .4265 .4279 .4292 .4306 .4319
1.5 .4332 .4345 .4357 .4370 .4382 .4394 .4406 .4418 .4429 .4441
1.6 .4452 .4463 .4474 .4484 .4495 .4505 .4515 .4525 .4535 .4545
1.7 .4554 .4564 .4573 .4582 .4591 .4599 .4608 .4616 .4625 .4633
1.8 .4641 .4649 .4656 .4664 .4671 .4678 .4686 .4693 .4699 .4706
1.9 .4713 .4719 .4726 .4732 .4738 .4744 .4750 .4756 .4761 .4767
2.0 .4772 .4778 .4783 .4788 .4793 .4798 .4803 .4808 .4812 .4817
2.1 .4821 .4826 .4830 .4834 .4838 .4842 .4846 .4850 .4854 .4857
2.2 .4861 .4864 .4868 .4871 .4875 .4878 .4881 .4884 .4887 .4890
2.3 .4893 .4896 .4898 .4901 .4904 .4906 .4909 .4911 .4913 .4916
2.4 .4918 .4920 .4922 .4925 .4927 .4929 .4931 .4932 .4934 .4936
2.5 .4938 .4940 .4941 .4943 .4945 .4946 .4948 .4949 .4951 .4952
2.6 .4953 .4955 .4956 .4957 .4959 .4960 .4961 .4962 .4963 .4964
2.7 .4965 .4966 .4967 .4968 .4969 .4970 .4971 .4972 .4973 .4974
2.8 .4974 .4975 .4976 .4977 .4977 .4978 .4979 .4979 .4980 .4981
2.9 .4981 .4982 .4982 .4983 .4984 .4984 .4985 .4985 .4986 .4986
3.0 .4987 .4987 .4987 .4988 .4988 .4989 .4989 .4989 .4990 .4990
3.1 .4990 .4991 .4991 .4991 .4992 .4992 .4992 .4992 .4993 .4993
3.2 .4993 .4993 .4994 .4994 .4994 .4994 .4994 .4995 .4995 .4995
3.3 .4995 .4995 .4995 .4996 .4996 .4996 .4996 .4996 .4996 .4997
3.4 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4998
3.5 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998
Gilles Cazelais. Typeset with LATEX on April 20, 2006.
Solution. Let X denote the random variable that corresponds
to the number of heads. Then, X has binomial distribution
Bi(3600, 1/2) (with expected value 1800 and standard deviation
30), so by the de Moivre-Laplace theorem, for large values
of n, the distribution function of the variable X−1800
30 can
be approximated by the distribution function Φ of the normal
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Covariance matrix
Consider a random vector X = (X1, . . . , Xn)T
all of whose
components have ﬁnite variances.
The covariance matrix of the random vector X is deﬁned
in terms of the expected value as (notice the vector X
is viewed as a column of random variables now)
var X = E(X − E X)(X − E X)T
.
Using the deﬁnition of the expected value of a vector and
expanding the matrix multiplication, it is immediate that the
covariance matrix var X is the symmetric matrix





var X1 cov(X1, X2) · · · cov(X1, Xn)
cov(X2, X1) var X2 · · · cov(X2, Xn)
...
...
...
...
cov(Xn, X1) cov(Xn, X2) · · · var Xn





.
Theorem. Consider a random vector X = (X1, . . . , Xn)T
all of whose components have ﬁnite variances. Further, consider
the transformed random vector Y = BX + c, where B
is an m-by-n matrix of real constants and c ∈ Rm
is a vector
of constants. Then,
var(Y ) = var(BX + c) = B(var X)BT
.
Proof. The claim follows from direct computation, using
the properties of the expected value:
var(Y ) = E
(
(BX + c) − E(BX + c)
)(
(BX + c)
− E(BX + c)
)T
= E(B(X − E X))(B(X − E X))T
= B E(X − E X)(X − E X)T
BT
= B(var X)BT
.
□
The constant part of the transformation has no impact,
while with respect to the linear part of the transformation, the
covariance matrix behaves as the matrix of a quadratic form.
10.2.36. Moments and moment function. The expected
value and variance reﬂect the square of the deviation
of values of a random variable from the
average. In descriptive statistics, one also examines
the skewness of the data, and it is natural to
examine the variability of random variables in terms of higher
powers of the given random variable X.
The characteristic E(Xk
) is called the k-th moment; the
characteristic µk = E
(
(X − EX)k
)
is called the k-th central
moment of a random variable X. What also comes in
handy is the k-th absolute moment, given by E |X|k
.
From the deﬁnition it follows that for a continuous random
variable X,
E Xk
=
∫ ∞
−∞
xk
fX(x) dx.
948
distribution. Thus, the wanted probability is
P = 1 − P[1755 ≤ X ≤ 1845] =
= 1 − P
[
−1.5 ≤
X − 1800
30
≤ 1.5
]
= 2Φ(−1.5)
.
= 0.1336,
where the last value was learned from the table. □
10.I.8. The probability that a newborn baby is a boy is 0.515.
Find the probability that there are at least the same number of
girls as boys among ten thousand babies.
Solution.
P[X < 5000] = P[
X − 5150
√
5150 · 0.485
∼N(0,1)
<
−150
√
5150 · 0.485
−3,001...
]
.
=
.
= 0.00135
□
10.I.9. Using the distribution function of the standard normal
distribution, ﬁnd the probability that we get at least 3100
sixes out of 18000 rolls of a six-sided die.
Solution. We proceed similarly as in the exercises above. X
has binomial distribution Bi(18000, 1/6). We ﬁnd the expected
value ((1/6)(18000) = 3000) as well as the standard
deviation
√
((1/6)(1 − 1/6)18000) = 50. Therefore, the
variable X−3000
50 can be approximated with the distribution
function Φ of the standard normal distribution:
P[X ≥ 3100] = P
[
X − 3000
50
≥
3100 − 3000
50
]
=
= P
[
X − 3000
50
≥ 2
]
.
= 1 − Φ(2)
.
= 0.0228.
□
10.I.10. A public opinion agency organizes a survey of preferences
of ﬁve political parties. How many randomly selected
respondents must answer so that the probability that for each
party, the survey result diﬀers from the actual preference by
no more than 2% is at least 0.95?
Solution. Let pi, i = 1 . . . 5 be the actual relative frequency
of voters of the i-th political party in the population, and let
Xi denote the number of voters of this party among n randomly
chosen people. Note that given any ﬁve intervals, the
events corresponding to Xi/n falling into the corresponding
interval may be dependent. If we choose n so that for each
i, Xi/n falls into the given interval with probability at least
1 − ((1 − 0.95)/5) = 0.99, then the desired condition is sure
to hold even in spite of the dependencies. Thus, let us look
for n such that P[|X
n − p| < 0.02] ≥ 0.99. First of all, we
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Similarly, for a discrete random variable X whose probability
is concentrated into points xi,
E Xk
=
∑
i
xi
k
fX(xi).
The next theorem shows that all the moments completely describe
the distribution of the random variable, as a rule.
For the sake of computations, it is advantageous to work
with a power series in which the moments appear in the coeﬃcients.
Since the coeﬃcients of the Taylor series of a function
at a given point can be obtained using diﬀerentiation, it is easy
to guess the right choice of such a function:
Moment generating function
Given a random variable X, consider the function MX(t) :
R → R deﬁned by
MX(t) = E etX
=
{∑
i etxi
fX(xi) if X is discrete
∫ ∞
−∞
etx
fX(x) dx if X is continuous.
If this expected value exists, the moment generating function
of the random variable X can be discussed.
It is clear that this function MX(t) is always analytic in
the case of discrete random variables with ﬁnitely many values
xi.
Theorem. Let X be a random variable such that its analytic
moment generating function on an interval (−a, a) exists.
Then, MX(t) is given on this interval by the absolutely
convergent series
MX(t) =
∞∑
k=0
tk
k!
E Xk
.
If two random variables X and Y share their moment generating
functions over a nontrivial interval (−a, a), then their
distribution functions coincide.
Proof. The veriﬁcation of the ﬁrst statement is a simple
exercise on the techniques of diﬀerential and integral calculus.
In the case of discrete variables, there are either ﬁnite
sums or absolutely and uniformly converging series. In the
case of continuous variables, there are absolutely converging
integrals. Thus, the limit process and the diﬀerentiation can
be interchanged. Since d
dt etx
= x etx
, it is immediate that
dk
dtk
MX(t) = E Xk
,
as expected.
The second claim is obvious for two discrete variables
X and Y with only a ﬁnite number of values x1, . . . , xk for
which either fX(xi) ̸= 0 or fY (xi) ̸= 0. Indeed, the functions
etxi
are linearly independent functions and thus their
coeﬃcients in the common moment function
M(t) = etx1
f(xi) + · · · + etxk
f(xk)
must be the shared probability function values for both random
variables X and Y .
949
rearrange the expression:
P
[
X
n
− p < 0.02
]
= P
[
−0.02 <
X
n
− p < 0.02
]
=
= P [−0.02 · n < X − pn < 0.02 · n] =
= P
[
−0.02 · n
√
np(1 − p)
<
X − pn
√
np(1 − p)
<
0.02 · n
√
np(1 − p)
]
=
= Φ
(
0.02 · n
√
np(1 − p)
)
− Φ
(
−
0.02 · n
√
np(1 − p)
)
=
= 2Φ
(
0.02 · n
√
np(1 − p)
)
− 1,
where Φ is the distribution function of the normal distribution.
Thus, let us solve the inequality
2Φ
(
0.02 · n
√
np(1 − p)
)
− 1 ≥ 0.99
Φ
(
0.02 · n
√
np(1 − p)
)
≥ 0.995
Since the distribution function is increasing, the last condition
is equivalent to
0.02 · n
√
np(1 − p)
≥ Φ−1
(0.995)
0.02 · n
√
np(1 − p)
≥ 2.576
√
n ≥ 50 · 2.576 ·
√
p(1 − p)
≤ 1
2
=⇒
=⇒ n ≥ (25 · 2.276)2
· 4147
Here, we used the fact that the maximum of the function
p(1 − p) is 1
4 , and it is reached at p = 1
2 . We can see that
if e. g. p
.
= 0.1, then
√
p(1 − p) = 0.3 and the value of
the least n is lower. This accords with our expectations: for
less popular parties, it suﬃces to have fewer respondents (if
the agency estimates the gain of such party to be around 2 %
without asking anybody, then the wanted precision is almost
guaranteed).
□
10.I.11. Two-choice test. Consider random vectors Y1 and
Y2 all of whose components are pairwise independent random
variables with normal distribution, and suppose that the
components of vector Yi have expected value µi, and the variance
σ is the same for all the components of both vectors.
Use the general linear model to test the hypothesis
whether µ1 = µ2.
Solution. We will proceed quite similarly as in subsection
10.3.12 of the theoretical part. This time, we can write both
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
In the case of continuous variables X and Y sharing their
generating function M(t), the argument is more involved
and an indication only is provided. Notice that
M(t) is analytic and thus it is deﬁned for all complex
numbers t, |t| < a.
In particular,
M(it) =
∫ ∞
−∞
eitx
f(x)dx,
which is the inverse Fourier transform of f(x), up to the constant
multiple
√
2π, see 7.2.5 (on page 657). If this works for
all t, then clearly f is obtained by the Fourier transform of
(
√
2π)−1
M(it) and thus must be the same for both X and Y .
Further details, in particular covering general random variables,
would need much more input from measure theory and
Fourier analysis, and thus it is not provided here. □
It can be also shown that the assumptions of the theorem
are true whenever both MX(−a) < ∞ and MX(a) < ∞.
10.2.37. Properties of moment function. By the properties
of the exponential functions, it is easy to compute
the behaviour of the moment function under
aﬃne transformations and sums of independent
random variables.
Proposition. Let a, b ∈ R and X, Y be independent random
variables with moment generating functions MX(t) and
MY (t), respectively. Then, the moment generating functions
of the random variables V = a + bX and W = X + Y are
Ma+bX(t) = eat
MX(bt)
MX+Y (t) = MX(t)MY (t)
Proof. The ﬁrst formula can be computed directly from
the deﬁnition:
MV (t) = E e(a+bX)t
= E eat
e(bt)X
= eat
MX(bt).
As for the second formula, recall that etX
and etY
are
independent variables. Use the fact that the expected value
of the product of independent random variables equals the
product of the expected values.
MW (t) = E et(X+Y )
= E etX
etY
= E etX
E etY
= MX(t)MY (t). □
10.2.38. Normal and binomial distributions. As an illustrating
example, compute the moment function of two random
variables X ∼ N(µ, σ) and X ∼ Bi(n, p).
Moment generating function for N(µ, σ)
Proposition. If X ∼ N(µ, σ), then
MX(t) = eµt
e
σ2t2
2 .
In particular, it is an analytic function on all of R.
950
vectors Yi into one column, and we consider the model










Y11
...
Y1n1
Y21
...
Y2n2










=










1 0
...
...
1 0
1 1
...
...
1 1










(
β1
β2
)
+ σZ.
We will work with arithmetic means of the individual vectors
¯Y1 and ¯Y2. Direct application of the general formula from
theory gives the estimate b in the form
(
b1
b2
)
=
(
n1 + n2 n2
n2 n2
)−1 (
n1
¯Y1 + n2
¯Y2
n2
¯Y2
)
=
=
1
n1n2
(
n2 −n2
−n2 n1 + n2
) (
n1
¯Y1 + n2
¯Y2
n2
¯Y2
)
=
(
¯Y1
¯Y2 − ¯Y1
)
and for the matrix C = (XT
X)−1
, where X is the 2-column
matrix with zeros and ones from our model, we have
C =
( 1
n1
− 1
n1
− 1
n1
1
n1
+ 1
n2
)
.
Thus, we test the hypothesis µ1 = µ2, which means that we
test whether β2 = 0. For this, it is suitable to use the statistic
T =
¯Y2 − ¯Y1
S
(
n1n2
n1 + n2
)1
2
,
where the standard deviation S is substituted as
S2
=
1
n1 + n2 − 2
( n1∑
i=1
(Y1i − ¯Y1)2
+
n2∑
i=1
(Y2i − ¯Y2)2
)
.
The distribution of this statistic is tn1+n2−2, so the null hypothesis
µ1 = µ2 is rejected at level α if we have
|T| ≥ tn1+n2−2(α). □
10.I.12. In JZD1
Tempo, the milk yield of their cows was
measured during ﬁve days, the results being 15, 14, 13, 16
a 17 hectoliters. In JZD Boj, which had the same number
of cows, they performed the same measurement during seven
days, the results being 12, 16, 13, 15, 13, 11, 18 hectoliters.
a) Find the 95% conﬁdence interval for the milk yield of
JZD Boj’s cows, and the 95% conﬁdence interval for the
milk yield of JZD Tempo’s cows.
b) On the 5% level, test the hypothesis that both farms have
cows of the same quality.
Suppose that the milk yield of the cows in each day is
given by the normal distribution. Solve these problems assuming
that there are no data from previous measurements,
and then assuming that the previous measurements showed
that the standard deviation was σ = 2 hl.
1JZD — jednotné zemědělské družstvo — an agricultural cooperative
farm, created by forced collectivization in 1950s in Czechoslovakia.
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Proof. Suppose Z ∼ N(0, 1)). Then
MZ(t) =
∫ ∞
−∞
etx 1
√
2π
e− x2
2 dx
=
∫ ∞
−∞
1
√
2π
e− 1
2 (x2
−2tx+t2
−t2
)
dx
= e
t2
2
∫ ∞
−∞
1
√
2π
e−
(x−t)2
2 dx
= e
t2
2 ,
where use is made of the fact that in the last-but-one expression,
for every ﬁxed t, the density of a continuous random
variable is integrated; hence this integral equals one.
Substitute the formula for the moment generating function
Mµ+σZ, to obtain for X ∼ N(µ, σ) that
MX(t) = eµt
e
σ2t2
2 ,
again a function analytic over entire R. □
In particular, the moments of Z of all orders exist. Substitute
1
2 t2
into the power series for the exponential function,
and calculate them all:
MZ(t) =
∞∑
k=0
1
k!
(
t2
2
)k
=
∞∑
k=0
1
k!2k
t2k
=
= 1 + 0 t +
1
2
t2
+ 0 t3
+
3
4!
t4
+ . . .
In particular, the expected value of Z is E Z = 0, and its variance
is var Z = E Z2
− (E Z)2
= 1. Further, all moments
of odd orders vanish, E Z4
= 3, etc.
Hence the sum of independent normal distributions X ∼
N(µ, σ) and Y ∼ N(µ′
, σ′
) has again the normal distribution
X + Y ∼ N(µ + µ′
, σ + σ′
).
Similarly, considering the discrete random variable X ∼
Bi(n, p),
MX(t) = E etX
=
n∑
k=0
(p et
)k
(
n
k
)
(1 − p)n−k
= (p et
+(1 − p))n
= (p(et
−1) + 1)n
= 1 + npt +
1
2
(
n(n − 1)p2
+ np
)
t2
+ . . .
is computed. Of course, the same can be computed even easier
using the proposition 10.2.37 since X is the sum of n independent
variables Y ∼ A(p) with the Bernoulli distribution.
Therefore,
E etX
= (E etY
)n
= (p et
+(1 − p))n
.
Hence all the moments of the variable Y equal p. Therefore,
E Y = p, while var Y = p(1−p). From the moment function
MX(t), E X = np and var X = E X2
− (E X)2
= np(1 −
p).
951
Solution. First of all, let us compute the results for the known
variance. In order to ﬁnd the conﬁdence interval, we use the
statistic
U =
X − µ
σ/
√
n
,
which has standardized normal distribution (see 10.2.21).
Then, the conﬁdence interval is (see 10.3.4)
(
¯X −
σ
√
n
z(α/2), ¯X +
σ
√
n
z(α/2)
)
,
where α = 0, 05. Now, it suﬃces to substitute the speciﬁc
values. For JZD Tempo, we thus get the sample mean
X1 =
15 + 14 + 13 + 16 + 17
5
= 15,
and using appropriate software, we can learn that z(0.025) =
1.96, which gives the interval
(
15 −
2
√
5
1.96, 15 +
2
√
5
1.96
)
.
= (13.25; 16.75).
For JZD Boj, we get
X2 =
12 + 16 + 13 + 15 + 13 + 11 + 18
7
= 14,
so the 95% conﬁdence interval for the milk yield of their cows
is
(12.52; 15.48).
If the variance of the measurements is not known, we use
the so-called sample variance for the estimate. In order to ﬁnd
the conﬁdence interval, we use the statistic
T =
X − µ
S
√
n
,
which has Student’s distribution with n − 1 degrees of freedom
(see also 10.3.4). Then, we can analogously obtain the
95% conﬁdence interval
(
¯X −
S
√
n
tn−1(α/2), ¯X +
S
√
n
tn−1(α/2)
)
.
For the values of JZD Tempo, we get the sample variance
S2
1 =
02
+ (−1)2
+ (−2)2
+ 12
+ 22
)
4
= 2.5,
i. e., S
.
= 1.58. Further, we have t4(0, 025)
.
= 2, 78, so the
95% conﬁdence interval for JZD Tempo is
(13.03; 16.97).
For JZD Boj, we get the sample variance S2
2 = 6, so the
wanted conﬁdence interval is
(11.73; 16.27).
b) If we compare the expected values of milk yield in
both farms, then this is a comparison of the expected values
of two independent choices from the normal distribution. In
the case of unknown variances, we further assume that the
variance is the same for both farms.
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.2.39. Skewness and kurtosis. Since the third central moment
is given in terms of third powers of deviations
from the expected value, it expresses to a
certain extent the symmetry of the random variable
distributed around the expected value. In
descriptive statistics, we describe this by the coeﬃcient of
skewness. For random variables, we use similarly the charac-
teristic
γ1 =
E(X − E X)3
(
√
var X)3
,
which is called the coeﬃcient of skewness of a random variable
X.
Another commonly used characteristic is the kurtosis of
a random variable X, deﬁned as
γ2 =
E(X − E X)4
(var X)2
− 3.
The standard normal distribution has third central moment
equal to zero and the fourth one equal to 3. Thus, the
kurtosis is standardized so that its value for the standard normal
distribution is zero. For a general distribution, the kurtosis
provides comparison to the normal distribution.
In practice, there are other standardizations of skewness
coeﬃcients and kurtosis.
10.2.40. Law of large numbers. Now, we can consider
the key tools which connect probability and
statistics. We start with the generalization of
Bernoulli’s theorem about the binomial distribution,
discussed at the end of subsection 10.2.32.
The random variables 1
n Xn, where Xn ∼ Bi(n, p), can be
viewed as the arithmetic means of n independent variables
with distribution A(p), and Bernoulli’s theorem then says
that these means converge to p with probability 1.
Such a proposition holds in general. Independence of the
variables is not needed, only the fact that cov(Xi, Xj) = 0
guarantees that the variances sum up.
The law of large numbers
Proposition. Consider a sequence of pairwise uncorrelated
random variables X1, X2, . . . which have the same ﬁnite expected
value E Xi = µ. Moreover, assume the variances are
bounded, so that var Xi ≤ C, for a ﬁxed constant C. Then
for any ε > 0,
lim
n→∞
P
( 1
n
n∑
i=1
Xi − µ < ε
)
= 1.
Proof. By the use Chebyshev’s inequality just as at the
end of subsection 10.2.32,
P
( 1
n
n∑
i=1
Xi − µ ≥ ε
)
≤
var
( 1
n
∑n
i=1 Xi − µ
)
ε2
=
1
n2
∑n
i=1 var Xi
ε2
≤
C
nε2
.
952
Thus, let us examine the hypothesis assuming the known
variances σ2
1 = σ2
2 = 4. We use the statistic
U =
(X1 − X2) − (µ1 − µ2
√
σ2
1
n1
+
σ2
2
n2
=
=
X1 − X2
√
σ2
1
n1
+
σ2
2
n2
∼ N(0, 1),
where µ1 and µ2 are the unknown expected values of milk
yield in the examined farms, and n1, n2 are the numbers of
measurements. This statistic has, as indicated, the standardized
normal distribution. We reject the hypothesis at the 5%
level if and only if the absolute value of the statistic U is
greater than z0.025, i. e., if and only if 0 does not lie in the
95% conﬁdence interval for the diﬀerence of the expected values
of milk yield in both farms. For the speciﬁc values, we
get
U =
15 − 14
√
4
5 + 4
7
.
= 0.854.
Thus, we have |U| < z(0.025) = 1.96, so the hypothesis that
the expected values of milk yield are the same in both farms
is not rejected at the 5% level. The reached p-value of the
test (see 10.3.9) is 39.4%, so we did not get much closer to
rejecting the hypothesis (the probability that the value of the
examined statistic is less than 0.854 provided the null hypothesis
holds is 60.6%.
If we do not know the variances of the measurements
but we know that they must be equal in both farms, we use
the statistic
K =
(X1 − X2) − (µ1 − µ2)
S∗
√
1
n1
+ 1
n2
=
=
X1 − X2
S∗
√
1
n1
+ 1
n2
∼ tn1+n2−2,
where
S∗ =
(n1 − 1)S2
1 + (n2 − 1)S2
2
n1 + n2 − 2
.
For the speciﬁc values, we get K
.
= 0.796,
|K| < t10(0, 025) = 2.2281, so again, the null hypothesis
is not rejected. The reached p-value of the test is
44.6%, which is even greater than in the above test. □
10.I.13. Analyzing the variance of a simple sort. For k ≥
2 independent samples Yi of size ni from normal distributions
with equal variance, use a linear model to test the hypothesis
that all the expected values of individual samples are equal.
Solution. The technique is quite similar to that of the above
exercise. The hypothesis to be tested is equivalent to stating
that a submodel in which all the components of the random
vector Y created by joining the given k vectors Yi have the
same expected value holds.
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Thus, the probability P is bounded from below by
P
( 1
n
n∑
i=1
Xi − µ < ε
)
≥ 1 −
C
nε2
,
which proves the proposition. □
Thus, existence and uniform boundedness of variances
suﬃces for the means of pairwise uncorrelated variables Xi
with zero expected value to converge (in probability) to zero.
10.2.41. Central limit theorem. The next goal is more ambitious.
In addition to the law of large numbers,
the stochastic properties of the ﬂuctuation
of the means ¯Xn = 1
n
∑n
i=1 Xi around the expected
value µ need to be understood. We focus
ﬁrst on the simplest case of sequences of independent and
identically distributed random variables Xi. Then formulate
a more general version of the theorem and provide only comments
on the proofs.
Move to a sequence of normalized random variables Xi.
Assume E Xi = 0 and var Xi = 1. Assume further that the
moment generating function MX(t) exists and is shared by
all the variables Xi.
The arithmetic means 1
n
∑n
i=1 Xi are, of course, random
variables with zero expected value, yet their variances are
n
n2 = 1
n . Thus, it is reasonable to renormalize them to
Sn =
1
√
n
n∑
i=1
Xi,
which are again standardized random variables. Their moment
generating functions are (see proposition 10.2.37)
MSn (t) = E e
t√
n
∑
i Xi
=
(
MX(
t
√
n
)
)n
.
Since it is assumed that the variables Xi are standardized,
MX(
t
√
n
) = 1 + 0
t
√
n
+ 1
t2
2n
+ o
(t2
n
)
,
where again o(G(n)) is written for expressions which, when
divided by G(n), approach zero as n → ∞, see subsection
6.1.12.
Thus, in the limit,
lim
n→∞
MSn
(t) = lim
n→∞
(
1 +
t2
2n
+ o
( 1
n
)
)n
= e
t2
2 .
This is just the moment generating function of the normal
distribution Z ∼ N(0, 1), see the end of subsection 10.2.35.
Thus, the standardized variables Sn asymptotically have the
standard normal distribution.
We have thus proved a special version of the following
fundamental theorem. Although the calculation is merely a
manipulation of moment generating functions, many special
cases were proved in diﬀerent ways, providing explicit estimates
for the speed of convergence, which of course is useful
information in practice.
Notice that the following theorem does not require the
probability distributions of the variables Xi to coincide!
953
Thus, the used model is of the form















Y11
...
Y1n1
Y21
...
Yk1
...
Yknk















=














1 0 · · · 0
...
...
...
1 0 · · · 0
0 1 · · · 0
...
...
...
0 0 · · · 1
...
...
...
0 0 · · · 1


















µ1
µ2
...
µk




+ σZ.
We can easily compute estimates for the expected values µi
using arithmetic means:
¯Yi =
1
ni
ni∑
j=1
Yij.
Hence we get the estimate ˆYij = ¯Yi, so the residual sum of
squares is of the form
RSS =
k∑
i=1
ni∑
j=1
(Yij − ¯Yi)2
.
The estimate of the common expected value in the considered
submodel is
¯Y =
1
n
k∑
i=1
ni∑
j=1
Yij =
1
n
k∑
i=1
ni
¯Yi,
where n = n1 + · · · + nk, and the residual sum of squares in
this submodel is
RSS0
=
k∑
i=1
ni∑
j=1
(Yij − ¯Y )2
.
In the original model, there are k independent parameters µi,
while in the submodel, there is a single parameter µ, so the
tested statistic is of the form
F =
(n − k)
(k − 1)
(RSS0
− RSS)
RSS
.
□
J. Linear regression
We already met the linear regression in chapter three,
subsection ??. Now, we will try to apply the same principle
to problems which are often studied by statisticians.
One standard application of the linear regression is “laying
a line” through given data. Thus, we have a sequence of
measurements for which we record the values of two variables
between which we anticipate linear dependency. A classical
example is the dependency of a son’s height on his father’s
height.
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Central limit theorem
Theorem. Consider a sequence of independent random
variables Xi which have the same expected value E Xi = µ,
variance var Xi = σ2
> 0 and uniformly bounded third absolute
moment E |Xi|3
< C. Then, the distribution of the
random variable
Sn =
1
√
n
n∑
i=1
(
Xi − µ
σ
)
satisﬁes
lim
n→∞
P(Sn < x) = Φ(x),
where Φ is the distribution function of the standard normal
distribution.
Note that the central limit theorem gives a result on asymptotic
behaviour which says that the distribution functions
of certain variables approach the standard normal distribution.
Such behaviour is called convergence in distribution. This
type of convergence is weaker than convergence in probabil-
ity.
The assumption that all Xi are independent and identically
distributed was not fully exploited in the
argumentation above. Only the knowledge of
E Xi = 0 and var Xi = 1 was used. The assumption
of the uniformly bounded third absolute
moments of Xi can be used to prove the existence of the
moment generating functions. The estimate E |Xi|3
can then
be used to complete the above proof exactly as above.
There are many more general results. We mention at least
the Lyapunov’s central limit theorem formulated as follows:
Consider a sequence of random variables Xi with ﬁnite
expected values µi and variances σi
2
. Write
sn =
n∑
i=1
σi
2
and assume for some δ > 0
lim
n→∞
1
s2+δ
n
n∑
i=1
E |Xi − µi|2+δ
= 0.
Then Xi−µi
sn
converges in distribution to Z ∼ N(0, 1).
The previous version of the central limit theorem is derived
by choosing δ = 1. Then sn = σ
√
n and the condition
of the Lyapunov’s theorem reads
0 ≤ lim
n→∞
n−3/2
σ−3
n∑
i=1
E |Xi|3
≤ Cσ−3
lim
n→∞
n−3/2+1
= 0.
10.2.42. De Moivre-Laplace theorem. Historically, the
ﬁrst formulated special case of the central limit
theorem was that of variables Yn with binomial
distribution Bi(n, p). They can be viewed
as the sum of n independent variables Xi with Bernoulli
distribution A(p), 0 < p < 1. These variables have moment
generating functions and E |Xi|3
= p < 1.
954
10.J.1. Find the linear regression model for the dependence
of Y on X, based on the following lists of measured data:
X = [1, 4, 5, 7, 10], Y = [3, 7, 8, 12, 18].
Solution. In order to ﬁnd the parameters of the regression
line, use the formulas derived in 10.3.12. Using the method
of least squares, we try to minimize the distance of the vector
b1X + b0 from the vector Y with respect to the parameters
b1 and b0. This distance, as we know from chapter two, is
minimal for the orthogonal projection of the vector Y onto
the vector subspace generated by the vectors (1, . . . , 1) and
(x1, . . . , xn). For parameters b0, b1 of the regression line
Y = b1X + b0, we obtain
b1 =
∑n
i=1(xi − ¯x)(Yi − ¯Y )
∑n
i=1(xi − ¯x)2
=
=
(1 − 5.4)(3 − 9.6) + · · · + (10 − 5.4)(18 − 9.6)
((1 − 5.4)2 + (4 − 5.4)2 + (5 − 5.4)2 + (7 − 5.4)2 + (10 − 5.4)2
=
=1.677.
Now, we can easily calculate the coeﬃcient b0:
b0 = ¯Y − b1 ¯x = 0.5442.
Therefore, the wanted linear dependency is
Y = 1.677 · X + 0.5442.
Note that in this model, the roles of the variables X and
Y are totally equal. Using the same method, we could have
obtained the dependency of X on Y :
X = 0.5867 · Y − 0.2322.
□
Remark. Think out why the linear regression model of the
dependency of X on Y cannot be obtained by merely expressing
X from the linear regression model of the dependency of
Y on X.
Remark. In many real situations, the dependency of the variables
is clearly given, if one of the variables is time, for ex-
ample.
10.J.2. An orbital station has measured, at the same instant
of ﬁve consecutive days, the following velocities of an unknown
cosmic object (in km/s): 10, 11.4, 13.1, 15.8, and 18.7.
Estimate the object’s velocity on the tenth day.
Solution. Here, it is good to notice that the velocity does
not change linearly with time (the acceleration is increasing).
Thus, we can hypothesize that the object is being attracted
to another one with the gravitational force. Then, its velocity
would be a quadratic function of time. So let us use the
method of least squares to lay a quadratic function (as precise
as possible) through the measured data. The procedure
is the same as if we made the linear regression of the vector
v = (v1, v2, . . . , vn) dependent on x = (x1, . . . , xn)
and x2
= (x2
1, . . . , x2
n). This method is called quadratic regression.
Thus, we are looking for a vector of parameters
b = (b0, b1, b2) so that the variable b2x2
+ b1x + b0 would
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Thus, the central limit theorem says in this case that the
random variables
Sn =
1
√
n
n∑
i=1
(
Xi − p
√
p(1 − p)
)
=
X − np
√
np(1 − p)
behave asymptotically as the standard normal distribution.
This can be formulated: the random variable
X ∼ Bi(n, p) behaves as the random variable with
normal distribution N(np, np(1 − p)) as n increases.
This behaviour is demonstrated exactly in the illustration
at the end of 10.2.21.
In practice, approximation of the binomial distribution
by the normal distribution is usually considered appropriate
if np(1 − p) > 9.
We illustrate the result with a concrete example. Suppose
it is desired to know what percentage of students like a given
course, with an error of at most 5 %. The number of people
who like the course among n randomly chosen people should
behave as the random variable X ∼ Bi(n, p). Further, suppose
the result is desired to be correct with conﬁdence (i.e.,
probability again) to at least 90%. Thus,
P
(
1
n
X − p < 0.05
)
≃ 0.9
is desired by choosing a high enough number n of students to
ask.
Approximate
0.9 ≃ P
(
1
n
X − p < 0.05
)
= P
(
−
0.05n
√
np(1 − p)
<
X − np
√
np(1 − p)
<
0.05n
√
np(1 − p)
)
≃ Φ
(
0.05n
√
np(1 − p)
)
− Φ
(
−
0.05n
√
np(1 − p)
)
= 2Φ
(
0.05n
√
np(1 − p)
)
− 1,
where the symmetry of the density function of the normal
distribution is exploited. Thus,
Φ
(
0.05n
√
np(1 − p)
)
≃
1
2
(1 + 0.9) = 0.95
is wanted. This leads to the choice (recall the deﬁnition of
critical values z(α) for a variable Z with standard normal distribution
in subsection 10.2.30)
Φ
(
0.05n
√
np(1 − p)
)
≃ z(0.05) = 1.64485.
Since p(1−p) is at most 1
4 , the necessary number of students
can be bounded by n > 270, independently of p.
955
estimate y. Let us build the matrix X of the values of independent
variables:
X =



1 x1 x2
1
...
...
...
1 xn x2
n


 =






1 1 1
1 2 4
1 3 9
1 4 16
1 4 25






,
and the vector of parameters b = (b0, b1, b2) can be computed
by (1):
b = (XT
X)−1
XT
v
.
= (9.26; 0.47; 0.29).
Then, the wanted quadratic estimate is
v = 0.29x2
+ 0.47x + 9.26,
so the estimated velocity on the tenth day is approximately
42.96 km/s. In the model of classic linear regression, we
would get
v = 2.18x + 7.26,
which yields 29.06 km/s for the tenth day. The diﬀerence
between these estimates is quite large. This illustrates that
analysis of the situation is a very important part of statistics.
□
K. Bayesian data analysis
10.K.1. Consider the Bernoulli process deﬁned by a random
variable X ∼ Bi(n, θ) with binomial distribution, and
assume that the parameter θ is a random variable with uniform
distribution on the interval (0, 1). We deﬁne the success
chance in our process as the variable γ = θ
1−θ . What is the
density of this variable γ?
Solution. Intuitively, we can feel that the distribution is not
uniform.
Denoting the wanted probability density by f(s), we can
use the relation between θ and γ to compute θ = γ
1+γ . In
addition, we can immediately see that the probability density
of γ is non-zero only for positive values of the variable. Now,
we can formulate the statement as the requirement
(1) Θ = P(θ < Θ) = P
(
γ <
Γ
1 + Γ
)
=
∫ Γ
0
f(s)ds,
where Γ = Θ
1−Θ . However, the right-hand upper bound contains
the changing limit γ, so we get the deﬁning formula for
f(s)
f(s) =
(
s
s + 1
)′
=
1
(s + 1)2
.
Indeed, the wanted density gives much higher probability to
low values of the chance than to high ones. □
We could see in subsection 10.3.7 that when taking the
Bayesian approach with binomial model of probability distribution
of a random variable X ∼ Bi(n, θ), then we are interested
in its probability mass function fX(k) =
(n
k
)
θk
(1 −
θ)n−k
. This function can be viewed as the conditional probability
P(θ|X = k) for the uniform a priori probability distribution
of the variable θ on the interval (0, 1). Thus, it is just
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.2.43. Important distributions. In the sequel, we return
to statistics. It should be of no surprise that
we work with the characteristics of random vectors
similar to the sample mean and variance,
as well as relative quotients of such characteristics, etc. We
consider several such cases.
Consider a random variable Z ∼ N(0, 1), and compute
the density fY (x) of the random variable Y = Z2
. Clearly,
fY (x) = 0 for x ≤ 0, while for positive x,
FY (x) = P(Y < x) = P(−
√
x < Z <
√
x)
=
∫ √
x
−
√
x
1
√
2π
e−z2
/2
dz =
∫ x
0
1
√
2π
t−1/2
e−t/2
dt.
Diﬀerentiation leads to
fY (x) =
d
dx
FY (x) =
1
√
2π
x−1/2
e−x/2
.
This distribution is called χ2
with one degree of freedom, written
Y ∼ χ2
.
We work with sums of such independent variables. All
fall into a general class of distributions whose densities are of
the form
fX(x) = cxa−1
e−bx
for x > 0, while fX(x) = 0 for non-positive x, i.e., the distribution
χ2
corresponds to the choice a = b = 1/2. This case
is already thoroughly discussed as an example in subsection
10.2.20. Hence such a function is the density for the constant
c = ba
Γ (a) . Thus, it is the distribution Γ(a, b) with density, for
positive x,
fX(x) =
ba
Γ(a)
xa−1
e−bx
.
In general, the k-th moment of such variable X is easily
computed:
E Xk
=
∫ ∞
0
xk ba
Γ(a)
xa−1
e−bx
dx
=
Γ(a + k)
Γ(a)bk
∫ ∞
0
ba+k
Γ(a + k)
xa−1+k
e−bx
dx
=
Γ(a + k)
Γ(a)bk
,
since the integral of the density of Γ(a + k, b) in the last expression
must be equal to one
In particular, E X = Γ (a+1)
bΓ (a) = a
b , while
var X =
Γ(a + 2)
b2Γ(a)
−
a2
b2
=
(a + 1)a − a2
b2
=
a
b2
.
Similarly, the moment generating function can be computed
for all values t, −b < t < b
MX(t) =
∫ ∞
0
etx ba
Γ(a)
xa−1
e−bx
dx =
=
ba
(b − t)a
∫ ∞
0
xk (b − t)a
Γ(a)
xa−1
e−(b−t)x
dx =
=
ba
(b − t)a
.
956
the a posteriori probability distribution of θ corresponding to
the result X = k of the experiment. The following exercise
concerns the general class of these probability distributions.
10.K.2. Find the basic charakteristics of the so-called betadistribution
ﬁ(a, b) with probability density of the form
fY =
{
C ya−1
(1 − y)b−1
y ∈ (0, 1)
0 otherwise.
Solution. The constant C must be chosen as the multiplicative
inverse of the integral
∫ 1
0
ya−1
(1 − y)b−1
dy, which is
a function B(a, b), known as beta-function in mathematical
analysis and other sciences (e. g. physics). The function
gamma, which generalizes the discrete values of factorial,
emerges in the following calculation:
Γ(x)Γ(y) =
∫ ∞
0
e−t
tx−1
dt ·
∫ ∞
0
e−s
sy−1
ds =
=
∫ ∞
0
∫ ∞
0
e−t−s
tx−1
sy−1
dt ds =
(substitution t = rq, s = r(1 − q))
=
∫ ∞
r=0
∫ 1
q=0
e−r
(rq)x−1
(
r(1 − q)
)y−1
r dq dr =
=
∫ ∞
r=0
e−r
rx+y−1
dr ·
∫ 1
t=0
qx−1
(1 − q)y−1
dq =
= Γ(x + y)B(x, y).
Thus, we get the general formula
B(a, b) =
Γ(a + b)
Γ(a)Γ(b)
and it follows from properties of the gamma-function for positive
integers a, b that
B(n − k + 1, k + 1) =
k!(n − k)!
(n + 1)!
=
1
n + 1
(
n
k
)−1
.
We can directly compute that the expected value of the variable
X ∼ ﬁ(a, b) with beta-distribution is (applying Γ(z +
1) = zΓ(z))
E X =
B(a + 1, b)
B(a, b)
=
a
a + b
.
If a = b, then the expected value and median are 1
2 .
We can also directly calculate the variance
var X = E(X − E X)2
=
ab
(a + b)2(a + b + 1)
.
Thus, for a = b, we get var X = 1
8a+4 , which shows that the
variance decreases as a = b increases. For a = b = 1, we get
the ordinary uniform distribution on the interval (0, 1). □
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Thus, for the sum of independent variables Y = X1 +
· · · + Xn with distributions Xi ∼ Γ(ai, b), the moment generating
function (for values |t| < b) is obtained
MY (t) =
(
b
b − t
)a1+···+an
,
that is, Y ∼ Γ(a1 + · · · + an, b). It is essential that all of the
gamma distributions share the same value of b.
As an immediate corollary, the density of the variable
Y = Z2
1 + · · · + Z2
n is obtained, where Zi ∼ N(0, 1). As
just shown, this is the gamma distribution Y ∼ Γ(n/2, 1/2);
hence its density is
fY (x) =
1
2n/2Γ(n/2)
xn/2−1
e−x/2
.
This special case of a gamma distribution is called χ2
with n
degrees of freedom. Usually, it is denoted by Y ∼ χ2
n.
10.2.44. The F-distribution. In statistics, it is often wanted
to compare two sample variances, so we need to consider
variables which are given as a quotient
U =
X/k
Y/m
,
where X ∼ χ2
k and Y ∼ χ2
m.
Suppose fX(x) and fY (x) are the densities of independent
random variables X and Y . Suppose fY is non-zero only
for positive values of x. Compute the distribution function of
the random variable U = cX/Y , where c > 0 is an arbitrary
constant. By Fubini’s theorem, the order of integration can
be interchanged with respect to the individual variables.
FU (u) = P(X < (u/c)Y ) =
∫ ∞
0
∫ uy/c
−∞
fX(x)fY (y) dx dy
=
∫ ∞
0
(∫ u
−∞
y
c
fX(ty/c)fY (y) dt
)
dy
=
∫ u
−∞
(
1
c
∫ ∞
0
yfX(ty/c)fY (y) dy
)
dt.
This expression for FU (u) shows that the density fU of the
random variable U equals
fU (u) =
1
c
∫ ∞
0
yfX(uy/c)fY (y) dy.
Substitute the densities of the corresponding special gamma
distributions for X ∼ χ2
k and Y ∼ χ2
m. Set c = m/k. The
random variable U = X/k
Y/m has density fU (u) equal to
(k/m)k/2
2(k+m)/2Γ(k/2)Γ(m/2)
∫ ∞
0
y(k+m)/2−1
e−y(1+ku/m)/2
dy.
The integrand in the latter integral is, up to the right constant
multiple, the density of the distribution of a random variable
Y ∼ Γ((k + m)/2, (1 + ku/m)/2). Hence the multiple
can be rescaled (notice u is constant there) in order to get
957
10.K.3. In the situation as in the problem above the previous
one, assume that the success chance θ in the Bernoulli process
is a random variable with probability distribution ﬁ(a, b).
What is the probability distribution of the variable γ = θ
1−θ ?
In what is it special when a = b = p?
Solution. We have already discusses the special case with
uniform distribution ﬁ(1, 1). Thus, we can continue with the
equality ∥1∥, where we used the form of this distribution.
Now the left-hand side contains, instead of Θ, the expression
1
B(a, b)
∫ Θ
0
ta−1
(1 − t)b−1
dt.
When diﬀerentiating, we must use the rule for diﬀerentiation
of integral with variable upper bound. Thus, we get for the
wanted density that
B(a, b)f(s) =
(
s
s + 1
)a−1 (
1 −
s
s + 1
)b−1
1
(s + 1)2
=
=
(
sa−1
s + 1
)a+b
.
The picture shows the densities for a = b = p =
= 2, 5, 15.
This enforces the intuition that the same and not too small
values of a = b = p correspond to the most probable value
θ = 1
2 , so the density of the chance is greatest around one.
The higher p, the lower the variance of this variable. □
10.K.4. Show that the Bernoulli experiment, described by a
random variable X ∼ Bi(n, θ), and the a priori probability
of a random variable θ with beta-distribution, the a posteriori
probability also has beta-distribution with suitable parameters
which depend on the experiment results. What is the a
posteriori expected value of θ (i. e., the Bayesian point estimate
of this random variable)?
Solution. As justiﬁed in subsection 10.3.7 of the theoretic
part, the a posteriori probability density is, up to an appropriate
constant, given as the product of the a priori probability
density
g(θ) =
1
B(a, b)
θa−1
(1 − θ)b−1
and the probability of the examined variable X provided the
value of θ occurred. Thus, assuming k successes in the
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
the integral to evaluate to one. The density fU (u) is then
expressed as
Γ((k + m)/2)
Γ(k/2)Γ(m/2)
(
k
m
)k/2
uk/2−1
(
1 +
k
m
u
)−(k+m)/2
.
This distribution is called the Fisher-Snedecor distribution
with k and m degrees of freedom, or F-distribution in short.
10.2.45. The t-distribution. One encounters another useful
distribution when examining the quotient of variables Z ∼
N(0, 1) and
√
X/n. Here X ∼ χ2
n. (We are interested in the
quotient of Z and the standard deviation of some sample).
Compute ﬁrst the distribution function of Y =
√
X
(note that X, and hence Y as well, take only positive values
with non-zero probability)
FY (y) = P(
√
X < y) = P(X < y2
)
=
∫ y2
0
1
2n/2Γ(n/2)
xn/2−1
e−x/2
dx
=
∫ y
0
1
2n/2−1Γ(n/2)
tn−1
e−t2
/2
dt.
Hence the density of the random variable Y is
fY (y) =
1
2n/2−1Γ(n/2)
yn−1
e−y2
/2
.
The same method can be used as in the previous subsection
with the random variable U = cZ/Y , setting c =
√
n,
Y =
√
X. This leads to the random variable
T =
Z
√
X/n
.
Similar computation as the one above yields that the density
fT satisﬁes
fT (t) =
Γ((n + 1)/2)
Γ(n/2)
√
nπ
(
1 +
t2
n
)−(n+1)/2
.
This is called the Student’s t-distribution with n degrees of
freedom.
10.2.46. Multidimensional normal distribution. Consider
a random vector Z = (Z1, . . . , Zn) with independent
components Zi ∼ N(0, 1). Then its
covariance matrix is equal to the unit matrix, i.e.,
var Z = In.
Random vectors are often encountered which are an
aﬃne transformation U = a + BZ of such a vector Z, where
a is an arbitrary constant vector in Rm
and B is an m-by-n
constant matrix.
As derived in theorems 10.2.29 and 10.2.35, these random
vectors have expected value E U = a and covariance
matrix var U = V = BBT
(since the covariance matrix of
Z is the identity matrix). Therefore, this covariance matrix is
always positive-semideﬁnite.
The random vector U is said to have multivariate normal
distribution Nm(a, V ).
958
Bernoulli experiment, we get the a posteriori density (the sign
used instead of equality denotes “proportional”)
g(θ|X = k) ∝ P(X = k|θ)g(θ) ∝
∝ θk
(1 − θ)n−k
θa−1
(1 − θ)b−1
=
= θa+k−1
(1 − θ)b+n−k−1
.
Thus, we have indeed obtained the density (up to a constant,
which we need not evaluate) of the a posteriori distribution
for θ with distribution B(a + k, b + n − k).
Its a posteriori expected value is
ˆθ =
a + k
a + b + n
.
For n and k approaching inﬁnity so that k/n → p, our a posteriori
estimate also satisﬁes ˆθ → p. Thus, we can see that
for large values of n and k, the observed fraction of successful
experiments outweighs the a priori assumption. On the
other hand, for small values, the a priori assumption is very
important. □
10.K.5. We have data about accident rates for N = 20 drivers
in the last n = 10 years (the k-th item corresponds to the
number of years when the k-th driver had an accident):
0, 0, 2, 0, 0, 2, 2, 0, 6, 4, 3, 1, 1, 1, 0, 0, 5, 1, 1, 0.
We assume that the probabilities pj, j = 1, . . . , N, that the
j-th driver has an accident in a given year are constants.
For each driver, estimate the probability that s/he has an
accident in the following year (in order to determine the individual
insurance fee, for instance). 2
Solution. We introduce random variables Xij with value 0 if
the i-th driver has no accident in the j-th year, and 1 otherwise.
The individual years are considered to be independent. Thus,
we can assume that the random variables Sj =
∑n
i=1 Xji
that correspond to the number of accidents in all the n = 10
years have distribution Bi(n, pj).
Of course, we could estimate the probabilities for all drivers
altogether, i. e., using the arithmetic mean
ˆp =
1
N
n∑
j=1
Sj
1
n
=
1
20
29
10
= 0.145.
However, consider the homogeneity of the distribution of the
variables Xj, they can hardly be accounted equal, so such
estimate would be misleading.
On the other hand, the opposite extreme, i. e., a totally
independent and individual estimate
ˆpj =
1
n
Sj
is also inappropriate, since we surely do not want to set zero
insurance fee until the ﬁrst accident happens.
The realistic method is to use the same assumption for
the a priori distribution of the probabilities pj of accident
2This problem is taken from the contribution M. Friesl, Bayesovské
odhady v některých modelech, published in: Analýza dat 2004/II (K. Kupka,
ed.), Trilobyte Statistical Software, Pardubice, 2005, pp. 21-33.
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
For any multivariate normal distribution Nm(a, V ), consider
again the aﬃne transformation
W = c + DU
with a vector of constants c ∈ Rk
and an arbitrary k-by-m
constant matrix D. Direct calculation leads to
W = c + D(a + BZ) = (c + Da) + (DB)Z,
which is a random vector W ∼ Nk(c + Da, DBBT
DT
).
Thus, the covariance matrix of the multivariate normal distribution
behaves as a quadratic form with respect to aﬃne
transformations.
This straightforward idea shows that any linear combination
of components of a random vector with the multivariate
normal distribution is a random variable with the normal
distribution. Similarly, any vector obtained by choosing only
some of the components of the vector U is again a random
vector with the multivariate normal distribution.
Note that when the random vector Z ∼ Nn(0, In) is
transformed with an orthogonal matrix QT
, then the joint
distribution function of the random vector U = QT
Z can
be computed directly. If the transformation in coordinates as
t = QT
z is written, then its inverse is z = Qt, and the Jacobian
of this transformation is equal to one. Hence (note that∑
i z2
i =
∑
i t2
i . As in chapter 3, write z < u if all components
satisfy zi < ui)
FU (u) = P(Ui < ui, i = 1, . . . , n) =
=
∫
· · ·
∫
QT z<u
(2π)−n/2
e−
∑
z2
i /2
dz1 · · · dzn
=
∫
· · ·
∫
t<u
(2π)−n/2
e−
∑
t2
i /2
dt1 · · · dtn
=
(∫ u1
−∞
(2π)−1/2
e−t2
1 dt1
)
· · ·
· · ·
(∫ un
−∞
(2π)−1/2
e−t2
n dtn
)
= FU1 (u1) · · · FUn (un)
Hence it directly follows that all components of the random
vector U are again independent, and U ∼ Nn(0, In).
3. Mathematical statistics
The data processing in mathematical
statistics is based on quite sophisticated
mathematics, but the actual methods are
also very much dependent on the inputs
from the diverse application ﬁelds.
Here we restrict attention to a few modest notes about statistical
methods. We suggest curious readers to look for more
detailed literature (which would reﬂect the ﬁeld of application
as well).
959
rates of individual drivers. In practice, one often uses a model
with the Poisson distribution Po(λj) for the j-th driver, with
further assumptions about the distribution of the parameter
λ among the drivers. We can also assume quite well (and
simply) that the distribution is pj ∼ ﬁ(a, b) with suitable parameters
a, b which should reﬂect the cumulated results of all
drivers. Thus, let us go this way.
We know from the above exercise that the a posteriori
probability distribution will be (pj|Sj = k) = ﬁ(a + k, b +
n − k), so the corresponding expected value will be
ˆpb
j =
a + k
a + b + n
.
Let us compare this estimate to the common estimate ˆp mentioned
above and the individual estimate ˆpj. We introduce
the values p0 = a
a+b , i. e., the expected value of the a priori
common distribution for all drivers, and n0 = a + b. We
get
ˆpb
j =
(a + b)a
(a + b + n)(a + b)
+
nk
(a + b + n)n
=
n0
n0 + n
p0+
n
n0 + n
ˆpj,
which is a linear combination of the expected value p0 and
the individual estimate ˆpj.
Thus, it only remains to reasonably estimate the unknown
parameters a, b. We know that
E Xji = E E(Xji|p) = E p = p0
E var(Xji|p)
var E(Xji|p)
=
E(p(1 − p))
var p
= a + b = n0
and the left-hand variables can be estimated directly.
E Xij = E E(Xji|p) ≃
1
N
N∑
j=1
ˆpj
E var(Xji|p) ≃
1
N
N∑
j=1
( n
n − 1
ˆpj(1 − ˆpj)
)
var E(Xji|p) ≃ s2
ˆpj
−
1
nN
N∑
j=1
( n
n − 1
ˆpj(1 − ˆpj)
)
,
where s2
ˆpj
denotes the sample variance between individual
estimates (you can verify that the subtraction of the right-most
expression guarantees that the last estimate is unbiased).
Since for the mentioned data, we get n0 ≃ 3.8643 and
p0 ≃ 0.1450, the Bayesian estimate of the individual probability
of accidents is
ˆpb
j = 0.154 · 0.145 + 0.846 · ˆpj.
Thus, it is a combination of the conﬁdence estimate ˆp =
0.145 of the collective probability p0 with the individual (frequentist)
estimate ˆpj, which is measured from a small number
n = 10 of observations of one driver. □
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.3.1. Introductory ideas. In the descriptive statistics at
the beginning of this chapter, we tried to equip the data sets
with some characteristics which would carry essential information
such as the sample mean, variance, etc.
Mathematical statistics works with some sample of a
given data set, trying to describe to what extent the obtained
statistics are relevant, or to ﬁnd or improve an appropriate theoretical
model for the behaviour of the entire data set, based
on the collected data. The model is then used to either accept/reject
a hypothesis about the data set, or to estimate the
probability of an event that might happen in the future.
Consider an easy example: construct a wooden coin with
heads and tails. Toss it n times, noting that it comes up heads
k ≤ n times. From this experiment, attempt to deduce the
probability that the coin comes up heads in both of two more
tosses.
There are two fundamental approaches to this problem.
The ﬁrst one is the classical statistics (or frequentist statistics).
Build on the assumption that the individual tosses are independent
and have the same probability of coming up heads,
which is given by an objectively existing parameter θ = p
(unknown to us so far). Thus, the individual tosses are considered
to be the realization of a random variable X with
Bernoulli distribution. The probability of getting k heads out
of n tosses is given by the binomial distribution, and it is expected
that the “best possible” estimate for the parameter p
is the ratio θ = k/n. Usually the conﬁdence of such an estimate
is also wanted. This can be obtained from knowledge
of the total number n of tosses and the asymptotic behaviour
of the model as n increases. For instance, if the coin comes
up heads 8 times out of 10, With a certain (mathematically
estimated) conﬁdence, it can be stated that the probability of
the coin coming up heads in both of the subsequent tosses is
0.82
= 0.64, a number much more than half.
The other possibility is quite diﬀerent. Consider the parameter
θ to be a random variable from some chosen family
of distributions, the collected data to be constants, and then
try to deduce how to adjust the probability distribution of this
random variable θ. For example, suppose a (perhaps fair) coin
is created, i.e. the expected value is (close to) 0.5, but the precision
of the production ensures this only up to some small
ε > 0. The experiment of tossing the coin n times allows
the adjustment of the distribution within the preferred class.
Thus, we build on some assumptions about the distribution
and adjust the prior distribution in view of the experiment.
This approach is called Bayesian statistics.
The ﬁrst approach is based on the purely mathematical
abstraction that probabilities are given by the frequencies of
event occurrences in data samples which are so large that they
can be approximated with inﬁnite models. The central limit
theorem can be used to estimate their conﬁdence. From the
statistical point of view, the probability is an idealization of
the relative frequencies of the cases when an examined result
960
L. Processing of multidimensional data
Sometimes, we need to process multidimensional data:
for each of n objects, we determine p characteristics. For instance,
we can examine marks of several students in various
subjects.
10.L.1. In his attempts, J.G.Mendel examined 10 plants of
pea, and each was examined for the number of yellow and
green seeds. The results of the experiment are summarized
in the following table:
plant number 1 2 3 4 5 6 7 8 9 10
yellow seeds 25 32 14 70 24 20 32 44 50 44
green seeds 11 7 5 27 13 6 13 9 14 18
total seeds 36 39 19 97 37 26 45 53 64 62
It follows from the genetic models that the probability of
occurrence of the yellow seed should be 0.75 (and 0.25 for
green seed). At the asymptotic signiﬁcance level 0.05, test
the hypothesis that the results of Mendel’s experiments are in
accordance with the model.
Solution. We test the hypothesis with the Pearson’s chisquared
test. We use the statistic
K =
r∑
j=1
(nj − npj)
npj
,
where r is the number of sorting intervals (measurements; we
have r = 10), nj is the actually measured frequency in the
chosen sorting interval (we will count the number of yellow
seeds), pj is the expected frequency (by the assumed distribution),
in our case, pj = 0.75, j = 1, . . . , 10. If the results
of the experiment were really distributed as assumed in our
model, we would have K ≈ χ2
(r − 1 − p), where p is the
number of estimated parameters in the assumed probability
distribution. In our case, it is especially simple, since our
model does not have any unknown parameters, so we have
p = 0 (the parameters may occur if, e. g., we assume that the
probability distribution in our experiment is normal but with
unknown variance and expected value; then we would have
p = 2). Thus, K ≈ χ2
(9). The statistic is recommended to
be used if the expected frequency of the characteristic in each
of the sorting intervals is at least 5.
Let us write the data into a table:
j nj pj npj
(nj −npj )2
npj
1 25 0.75 27 0.148148
2 32 0.75 29, 25 0.258547
...
...
...
...
...
10 44 0.75 46.5 0.134409
The value of the statistic K for the given data is
K = 0.148148 + 0.258547 + · · · + 0.134409 = 1.797495.
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
occurs in many repeated experiments. This seeming advantage/rigor
can become a disadvantage as soon as we are interested
in the conﬁdence of the data themselves and the suitability
of the chosen experiment. The same problem occurs if we
want to use frequentist statistics to estimate the probability of
one or more outcomes of an experiment that is executed only
once.
On the other hand, Bayesian statistics is an example of
applying mathematics to “common sense” when we want to
adjust our belief in light of new information.
It is interesting that, from the historical point of view, the
ﬁrst approach was the Bayesian one (for instance, Laplace and
more as early as in the 18th century), which succumbed to
frequentist statistics in the 20th century. In recent decades,
Bayesian statistics has been returning, together with further
new approaches.
10.3.2. Random sample of a population. Describe the ﬁrst
approach of the above subsection. Thus, assume
that there is a (huge) basic statistical set of N
units, which is called the population, and each
of the units has a numerical characteristic, i.e.,
there is a set of values (x1, . . . , xN ). From this set there is
only a sample with values (X1, . . . , Xn).
In order to avoid the discussion of the actual size of the
basic statistical set of N units, assume that the items of the
sample are selected one by one and every item is always put
back into the population. In addition, assume that every item
has the same probability 1/N of being chosen. This is a random
sample.
The way of realizing the random sample can be viewed
as working with a vector (X1, . . . , Xn) of independent, identically
distributed random variables. In particular, they have
the same distribution function FX(x) and moments
E Xi = µ, var Xi = σ2
.
The next step must be a derivation of the characteristics
of the sample mean ¯X and the sample variance
S2
=
1
n − 1
n∑
i=1
(Xi − ¯X)2
.
The following theorem explains why the coeﬃcient 1
n−1 is
selected instead of 1
n , which is the case with s2
in subsection
10.1.6.
Theorem. The sample mean ¯X computed from a random
sample of size n whose distribution has ﬁnite expected value
µ and ﬁnite variance σ2
satisﬁes
E ¯X = µ, var ¯X =
1
n
σ2
.
The sample variance S2
satisﬁes
E S2
= σ2
.
Proof. As derived in subsection 10.2.29,
E ¯X =
1
n
E
n∑
i=1
Xi =
1
n
nµ = µ.
961
This value is less than χ2
0.95(9) = 16.9, so we do not reject
the null hypothesis at level 0.05 (i. e., we do not refute the
known genetic model).
□
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Since the variables Xi are independent, additivity of variance
can be used (derived in subsection 10.2.33). The variance
behaves as a quadratic form with respect to multiplication by
a scalar. Hence
var ¯X =
1
n2
var
n∑
i=1
Xi =
1
n2
nσ2
=
1
n
σ2
.
The formula
n∑
i=1
(Xi − µ)2
=
n∑
i=1
(Xi − ¯X)2
+ n( ¯X − µ)2
can be veriﬁed simply by expanding the multiplications.
Thus:
E s2
=
1
n
E
n∑
i=1
(Xi − µ)2
−
n
n
( ¯X − µ)2
=
=
1
n
n∑
i=1
var Xi − var ¯X =
=
(
1 −
1
n
)σ2
.
That is why the variance s2
is multiplied by the coeﬃcient
n
n−1 , which leads just to the sample variance S2
and its expected
value σ. Of course, this multiplication makes sense
only if n ̸= 1. □
10.3.3. Random sample of the normal distribution. In
practice, it is necessary to know not only the numerical
characteristics of the sample mean and the variance,
but also their total probability distributions. Of
course, it can be derived only if the particular probability
distribution of Xi is known. As a useful illustration,
calculate the result for a random sample of the normal distri-
bution.
It is already veriﬁed, as an example on properties of moment
generating functions in 10.2.37, that the sum of random
variables with the normal distribution results again in the normal
distribution. Hence the sample mean must also have the
normal distribution, and since both its expected value and
variance are known, ¯X ∼ N(µ, 1
n σ2
).
The probability distribution of the sample variance is
more complicated. Here, apply the ideas about multivariate
normal distributions from subsection 10.2.37. Consider a vector
Z of standardized normal variables
Zi =
Xi − µ
σ
.
The same property holds for the vector U = QT
Z with any
orthogonal matrix Q. In addition,
∑
i U2
i =
∑
i X2
i . Choose
the matrix Q so that the ﬁrst component U1 equals the sample
mean ¯Z, up to a multiple. This means that the ﬁrst column of
Q is chosen as (
√
n)−1
(1, . . . , 1). Then U2
1 = n ¯Z2
, so we
962
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
can compute:
n∑
i=1
U2
i =
n∑
i=1
Z2
i =
n∑
i=1
(Zi − ¯Z)2
+ n ¯Z2
n∑
i=2
U2
i =
n∑
i=1
(Zi − ¯Z)2
=
1
σ2
n∑
i=1
(Xi − ¯X)2
.
Therefore, a multiple of the sample variance n−1
σ2 S2
is the
sum of n−1 squares of standardized normal variables, so the
following theorem is proved:
Theorem. Let (X1, . . . , Xn) be a random sample from the
N(µ, σ2
) distribution. Then, ¯X and S2
are independent variables,
and
¯X ∼ N(µ,
1
n
σ2
),
n − 1
σ2
S2
∼ χ2
n−1.
Hence, it immediately follows that the standardized sample
mean
T =
√
n
¯X − µ
S
has Student’s t-distribution with n − 1 degrees of freedom.
10.3.4. Point and interval estimates. Now, we have everything
needed to estimate the parameter values
in the context of frequentist statistics. Here is
a simple example. Suppose there are 500 students
enrolled in a course, each of which has a
certain degree of satisfaction with the course, expressed as
an integer in the range 1 through 10. It may be assumed that
the satisfactions Xi of the students are approximated by a random
variable with distribution N(µ, σ2
). Further, suppose a
detailed earlier survey showed that µ = 6, σ = 2.
In the current semester, 15 students are asked about their
opinion about the course, as rumor has it that the evaluation
of the new lecturer might be quite diﬀerent. The results show
that 2 students vote 3, 3 vote 4, 3 vote 5, 5 vote 6, and 2 vote
7. Altogether, the sample mean is ¯X = 5.133 and the sample
variance is S2
= 1.695.
By assumptions, ¯X ∼ N(µ, σ2
/n), so Z =
√
n
¯X−µ
σ ∼
N(0, 1). In order to express the conﬁdence of the estimate,
compute the interval which contains the estimated parameter
with an a priori ﬁxed probability 100(1−α)%. We talk about
a conﬁdence level α, 0 < α < 1. Consider µ to be the unknown
parameter, while the variance can be assumed (be it
correct or not) to remain unchanged. It follows that
1 − α = P(|Z| < z(α/2)) = P
(
√
n
¯X − µ
σ
< z(α/2)
)
= P
(
¯X −
σ
√
n
z(α/2) < µ < ¯X +
σ
√
n
z(α/2)
)
,
where z(α/2) means the critical value, cf. 10.2.30. Thus,
an interval is found whose endpoints are random variables
and which contains the estimated parameter µ with an a priori
ﬁxed probability. The middle point of this interval is called
the point estimate for parameter µ; the whole interval is called
the interval estimate. We can also say that at the conﬁdence
963
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
level α, the estimated parameter µ is or is not diﬀerent from
another value µ0. Suppose for instance, the data and levels are
α = 0.05 and α = 0.1. Respectively we obtain the intervals
µ ∈ (4.121, 6.145), µ ∈ (4.284, 5.983).
Considering the conﬁdence level of 5%, we cannot aﬃrm that
the opinion of students are worse compared to the previous
year because the mentioned interval also contains the value
µ0 = 6. We can conclude this if we take the conﬁdence level
of 10% since the value µ0 = 6 no more lies in the corresponding
interval.
On the other hand, if it is assumed that the other (worse)
lecturer causes the variance of the answers to change as well
(for instance, the students might agree more on the bad assessment),
we proceed diﬀerently. Instead of the standardized
variable Z, deal in a similar way with the variable
T =
√
n
¯X − µ
S
.
As seen, this random variable has probability distribution
T ∼ tn−1, where n = 15 in this case. This leads to the
interval estimate
¯X −
S
√
n
tn−1(α/2) < µ < ¯X +
S
√
n
tn−1(α/2).
Substitute the data at levels α = 0.05 and α = 0.03 respectively,
to obtain
µ ∈
(
4.412, 5.854
)
, µ ∈
(
4.321, 5.945
)
.
Therefore, at the conﬁdence level of 3%, the opinion seems to
have become worse. This corresponds to our intuition that the
sample deviation S = 1.302, which is signiﬁcantly smaller
than σ = 2 from the previous case, should be essential for
our thinking.
10.3.5. Likelihood of estimates. From the mathematical
point of view, interval and point estimates are
simple and easy to understand. It is much worse
with their practical interpretation because it
is problematic to verify all assumptions about
randomness of the sample. With more complicated cases,
we consider problems with the “likelihood” of our estimates.
As mathematicians, we can avoid the practical problem
by deﬁning the missing concept. In general, one works with
a random sample of size n. Implicitly it is assumed that there
are independent random variables Xi with the same probability
distribution which depends on an unknown parameter θ (a
vector in general).
We are trying to ﬁnd a sample statistic T, i.e., a function
of the random variables X1, X2, . . . which, in a mathematical
sense, estimates the actual value of the parameter θ. T is said
to be an unbiased estimator of θ if and only if E T = θ. The
expected value E(T − θ) is called the bias of estimator T.
The asymptotic behaviour of the estimator, that is, what
it does as n goes to inﬁnity is often of interest. T = T(n) is
964
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
said to be a consistent estimator of the parameter θ if and only
if T(n) converges in probability to θ, i.e., for every ε > 0,
lim
n→∞
P(|T(n) − θ| < ε) = 1.
The Chebyshev’s inequality immediately yields
P(|T(n) − E Tn| < ε) ≥ 1 −
var T(n)
ε2
.
Assuming limn→∞ E T(n) = θ, then, for suﬃciently large
values of n,
P(|T(n)−θ| < 2ε) ≥ P(|T(n)−E T(n)| < ε) ≥ 1−
var T(n)
ε2
.
A useful proposition is thus proved :
Theorem. Assume that limn→∞ E T(n) = θ and
limn→∞ var T(n) = 0. Then, T(n) is a consistent
estimator of θ.
As a simple example, we can illustrate this theorem on
variance:
ˆσ2
=
1
n
n∑
i=1
(Xi − ¯X)2
=
n − 1
n
S2
.
Since it is known from subsection 10.3.2 that S2
is an
unbiased estimator, it follows that ˆσ2
is not. However,
limn→∞ ˆσ2
= σ2
, and it can be calculated that
lim
n→∞
var ˆσ2
= lim
n→∞
var S2
= lim
n→∞
2σ
n − 1
= 0.
Therefore, the statistic s2
is a consistent estimator of the vari-
ance.
It is apparent that there may be more unbiased estimators
for a given parameter. For instance, it is already
shown that the arithmetic mean ¯X is an unbiased estimator
of the expected value θ of random variables Xi.
The value X1 is, of course, an unbiased estimator of
θ as well. We wish to ﬁnd the best estimator T in the class of
considered statistics, which are unbiased or consistent. Consider
as best the one whose variance is as small as possible.
Recall that the variance of a vector statistic T is given by the
corresponding covariance matrix, which is, in case of independent
components, a diagonal matrix with the individual
variances of the components on the diagonal. We have already
deﬁned inequalities between positive-deﬁnite matrices.
10.3.6. Maximum likelihood. Assume that the density function
of the components of the sample is given by a function
f(x, θ) which depends on an unknown parameter θ (a vector
in general). By the assumed independence, the joint density
of the vector (X1, . . . , Xn) is equal to the product:
f(x1, . . . , xn, θ) = f(x1, θ) · · · f(xn, θ),
which is called the likelihood function.
We are interested in the value ˆθ which maximizes the
likelihood function on the set of all admissible values of the
parameter. In the discrete case, this means choosing the parameter
for which the obtained sample has the greatest prob-
ability.
965
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Usually, it is more eﬃcient to work with the loglikelihood
function
ℓ(x1, . . . , xn, θ) = ln f(x1, . . . , xn, θ) =
n∑
i=1
ln f(xi, θ).
Since the ln function is strictly increasing, maximization of
the log-likelihood function is equivalent to maximization of
the original likelihood function. If, for some input, it happens
that f(x1, . . . , xn, θ) = 0, set ℓ(x1, . . . , xn, θ) = −∞.
In the case of discrete random variables, use the same
deﬁnition with probability mass function instead of the density,
i.e.,
ℓ(x1, . . . , xn, θ) =
n∑
i=1
ln(P(Xi = xi|θ)).
We can illustrate the principle on a random sample from
the normal distribution N(µ, σ2
) with size n. The unknown
parameters are µ or σ, or both. The considered density is
f(x, µ, σ) =
1
√
2πσ2
e−
(x−µ)2
2σ2
.
Take logarithms of both sides, to obtain
ℓ(x, µ, σ) = −n
1
2
ln(2πσ2
) −
1
2σ2
n∑
i=1
(xi − µ)2
.
The maximum can be found using diﬀerentiation (note that
σ2
is treated as a symbol for a variable):
∂ℓ
∂µ
= −
1
2σ2
n∑
i=1
(−2)(xi − µ) =
1
σ2
(
−nµ +
n∑
i=1
xi
)
∂ℓ
∂σ2
= −
n
2σ2
+
1
2(σ2)2
n∑
i=1
(xi − µ)2
=
=
1
2(σ2)2
(
−nσ2
+
n∑
i=1
(xi − µ)2
)
.
Thus, the only critical points are given by ˆµ = ¯X and ˆσ2
= s2
.
Substitute these values into the matrix of second derivatives,
to obtain the Hessian of ℓ:
(
− n
ˆσ2 0
0 − n
2(ˆσ2)2
)
.
Finally, this is the required maximum, and since there is only
one critical point, it must be the global maximum (think about
the details of this argument!).
Thus it is veriﬁed that the expected value and the variance
are the most likely estimates for µ and σ, as already used.
10.3.7. Bayesian estimates. We return to the example from
subsection 10.3.4, now from the point of view
of Bayesian statistics. This totally reverses the
approach: the collected data X1, . . . , X15 (i.e.,
the points which express how much each student is satisﬁed,
using the scale 1 through 10) are treated as constants. On the
other hand, the estimated parameter µ (the expected value of
the points of satisfaction), is viewed as the random variable
966
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
whose distribution we wish to estimate. Let us look at the
general principle ﬁrst (and come back to this example soon).
For this purpose, let us interpret Bayes’ formula for conditional
probability on the level of probability mass functions
or probability densities, in the following way: If a vector
(X, Θ) has joint density f(x, θ), then the conditional probability
of a component Θ, given by X = x, is deﬁned as the
density
g(θ|x) =
f(x|θ)g(θ)
f(x)
,
where f(x) and g(θ) are the marginal probability densities.
(In the above example, x is 15-dimensional vector coming
from the multidimensional normal distribution while θ = µ
is scalar.)
Thus, given the a priori probability density g(θ) of the
estimated parameter θ and the probability density f(x|θ) (in
the above example, θ = µ is the expected value parameter
of the distribution), the formula to compute the a posteriori
probability density g(θ|x) can be used, based on the collected
data. Indeed, we do not need to know f(x) for the following
reason: we have to view f(x) as a constant independent of θ
and thus the proper density is obtained from f(x|θ)g(θ) by
multiplying with a uniquely given constent in the end. Thus,
during the computation, it is suﬃcient to be precise “up to a
constant multiple”. For this purpose, use the notation Q ∝ R,
meaning that there is a constant C such that the expressions
Q and R satisfy Q = CR.
We shall illustrate this procedure on a more explicite example.
In order to be as close as possible to the ideas from
subsection 10.3.4, work with normal distributions N(µ, σ2
).
Suppose that the satisfaction of individual students in particular
lectures is a random variable X ∼ N(θ, σ2
), while the
parameter θ reached by the particular lecturers is a random
variable θ ∼ N(a, b).
Compute, (up to a constant multiple, ignoring all multiplicative
components which do not include any θ),
g(θ|x) ∝ f(x|θ)g(θ) ∝ exp
(
−
(x − θ)2
2σ2
−
(θ − a)2
2b2
)
∝ exp
(
−
1
2
(
θ2
(
1
σ2
+
1
b2
)
− 2θ
(
x
σ2
+
a
b2
)))
∝ exp
(
−
1
2
(
θ −
b2
x + σ2
a
σ2b2
σ2
b2
b2 + σ2
)2(
b2
σ2
b2 + σ2
)−1)
.
This proves already that the distribution for θ is
θ ∼ N
(
b2
b2 + σ2
x +
σ2
b2 + σ2
a,
b2
σ2
b2 + σ2
)
.
This result can be interpreted so that if the parameters a, b, σ
are known from long-run evaluation of surveys and the opinion
of another student is learned, then the a priori opinion
about the parameters for an individual lecture can be adjusted.
In the resulting estimate, the expected value is given by
the weighted average of the found value x and the a priori
assumed expected value a, in dependence on the standard deviations
σ and b.
967
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.3.8. Interpretation in Bayesian statistics. We follow
the ideas from the previous subsection,
compared to the frequentist interpretation
from 10.3.4. It may seem odd that a
single query can inﬂuence an opinion so
much.
For σ → 0, the relevance of a single opinion is still increasing,
and this corresponds to a 100% relevance of x in
the case σ = 0. This is in accordance with the interpretation
that Bayesian statistics is the probability extension of the standard
discrete mathematical logic. If the variance σ is close to
zero, then it is almost certain that the opinion of any student
precisely describes the opinion of the whole population.
In subsection 10.3.4, we worked with the sample mean
¯X of the collected data. This can be used in the previous
calculation, since the mean also has a normal distribution, too.
The expected value is the same, and the only diﬀerence is that
σ2
/n is substituted instead of σ2
. To facilitate the notation,
deﬁne the constant
cn =
nb2
nb2 + σ2
.
The a posteriori estimate for θ based on the found sample
mean ¯X has the distribution with parameters
θ ∼ N(cn
¯X + (1 − cn)a, cnσ2
/n).
As could be expected, for increasing n, the expected value
of the distribution for θ approaches the sample mean, and its
variance approaches zero. In other words, the higher the value
of n, the closer is the point estimate from the frequentist point
of view.
A contribution of the Bayesian approach is that if the estimated
distribution is used, questions of the kind: “What is the
probability that the new lecturer is worse than the old one?”
can be answered. Use the same data as in 10.3.4 and supplement
the necessary a priori data. Assume that the lecturers
are assessed quite well (otherwise, they would probably not
be teaching at the university at all). For concreteness, select
the a priori distribution with parameters a = 7.5, b = 2.5,
and the standard deviation with σ = 2. Continue with n = 15
and the sample mean of 5.133. Substitute this data, to get the
a posteriori estimate for the distribution
θ ∼ N(5.230, 0.256).
We are interested in P(θ < 6). This is computed by evaluating
the distribution function of the corresponding normal distribution
for the input value 6 (Excel is capable of this, too).
The answer is approximately 93.6 %. This is similar to the
material in subsection 10.3.4, where the known variance is
assumed constant.
Note the inﬂuence of the a priori assumption about the
distribution of the parameter θ for all lecturers. To a certain
extent, this reﬂects a faith that the lecturers are rather good.
If a statistician has a reason for assuming that the actual expected
value a for a speciﬁc lecturer is shifted, say a = 6 as
in the survey about the previous lecturer, (this can be caused,
968
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
for example, by the fact that the lecture is hard and unpopular),
then the probability of his actual parameter being less
than 6 would be approximately 95.0 %. (If the expected value
is considered to be signiﬁcantly worse only when below 5.5,
then the value would be only approximately 75 %). When
substituting a = 5, the value is already 96.8 %. The variance
b2
is also important. For instance, the a priori estimate a = 6,
b = 3.5 leads to probability 95.2 %.
In the above discussion, another very important point is
touched on – sensitivity analysis. It would seem desirable that
a small change of the a priori assumption has only a small
eﬀect on the a posteriori result. It appears that this is so in
this example; however, we omit further discussion here.
The same model with exponential distributions is used
in practice when judging the relevance of the output of an IQ
test of an individual person. It can also be used for another
similar exam where it is expected that the normal distribution
approximates well the probability distribution of the results.
In both cases, there is an a priori assumption to which group
he/she should belong. Other good examples (with diﬀerent
distributions) are practical problems from insurance industry,
where it is purposeful to estimate the parameters so that both
the eﬀects of the experiment upon an individual item and the
altogether expectations over the population are included.
10.3.9. Notes on hypothesis testing. We return to deciding
whether a given event does or does not occur
in the context of frequentist statistics. We build
on the approach from interval estimates, as presented
above.
Thus, consider a random vector X = (X1, . . . , Xn) (the
result of a random sample), whose joint distribution function
is FX(x). A hypothesis is an arbitrary statement about the
distribution which is determined by this distribution function.
Usually, one formulates two hypothesis, denoted H0 and HA.
The former is traditionally called null hypothesis, and the latter
is called alternative hypothesis. The result of the test is
then a decision based on a concrete realization of the random
vector X (a test) whether the hypothesis H0 is to be rejected
or not in favor of the hypothesis HA.
During this process, two types of errors may occur. Type
I error occurs when H0 is rejected even though it is true. Type
II error occurs when H0 is not rejected although it is false.
The decision procedure of a frequentist statistician consists
of selecting the critical region W, i.e., the set of test results
when the hypothesis is rejected. The size of the critical region
is chosen so that a true hypothesis is rejected with probability
not greater than α. This means that a ﬁxed bound for the probability
of the type I error is required: the signiﬁcance level α.
The most common choices are α = 0.05 or α = 0.01. It is
also useful in practice to determine the least possible signiﬁcance
level p for which the hypothesis is rejected; the p–value
of the test.
It remains to ﬁnd a reasonable procedure for choosing
the critical range. This should hopefully be done so that the
969
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
type II error occurs as rarely as possible. Usually, it is convenient
to consider the likelihood function f(x, θ), deﬁned for
a random vector X in subsection 10.3.6. For the sake of simplicity,
assume there is a one-dimensional parameter θ, and
formulate the null hypothesis as X being given by the function
f(x, θ0), while the alternative hypothesis is given by the
distribution f(x, θ1) for ﬁxed distinct values θ0 and θ1. Ideas
about rejecting or accepting the hypotheses suggest that when
substituting the values of a speciﬁc test into the likelihood
function, the hypothesis can be accepted if f(x, θ0) is much
greater than f(x, θ1).
This suggests considering, for each constant c > 0, the
critical range
Wc = {x; f(x, θ1) ≥ cf(x, θ0)}.
Having chosen the signiﬁcance level, choose c so that
∫
Wc
f(x, θ0) = α.
This guarantees that for the test result x ∈ Wc, when H0 is
valid, the type I error occurs with at most the prescribed probability.
This can also be guaranteed by other critical ranges
W which also satisfy
∫
W
f(x, θ0) = α.
On the other hand, type II errors are also of interest. That is, it
is desired to maximize the probability of HA over the critical
range. Thus, consider the diﬀerence
D =
∫
Wc
f(x, θ1) −
∫
W
f(x, θ1)
for arbitrary W as above. The regions over which integration
is carried out, can be divided into the common part W ∩ Wc
and the remaining set diﬀerences. The contributions of the
common part are subtracted, and there remains
D =
∫
Wc\W
f(x, θ1) −
∫
W \Wc
f(x, θ1).
Using the deﬁnition of the critical range Wc, (again, put back
the same integrals over the common part)
D ≥ c
∫
Wc\W
f(x, θ0) − c
∫
W \Wc
f(x, θ0) =
= c
∫
Wc
f(x, θ0) − c
∫
W
f(x, θ0) = cα − cα = 0.
Thus is proved an important statement, the Neyman–
Pearson lemma:
Proposition. Under the above assumptions, Wc is the optimal
critical range which minimizes occurrence of the type II
error at a given signiﬁcance level.
970
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.3.10. Example. The interval estimate, as illustrated on an
example in subsection 10.3.4, is a special case
of hypothesis testing, when H0 had the form
“the expected value of the satisfaction with the
course remained µ0”, while HA says that it is equal to a diﬀerent
value µ1. The general procedure mentioned above leads
in this case to the critical range given by
|Z| =
¯X − µ0
σ
√
n ≥ z(α/2).
Note that in the deﬁnition of the critical range, the actual value
µ1 is not essential. In the context of classical probability, the
decision at a given level α whether or not there is a change to
the expected value µ is thus formalized.
To test only whether the satisfaction is decreased, assume
beforehand that µ1 < µ0. We analyze this case thoroughly:
The critical range from the Neyman–Pearson lemma is determined
by the inequality
f(x, µ1, σ2
)
f(x, µ0, σ2)
= e− 1
2σ2
∑n
i=1
(
(xi−µ1)2
−(xi−µ0)2
)
≥ c.
Take logarithms and rearrange to obtain
2¯x(µ1 − µ0) − (µ2
1 − µ2
0) ≥
2σ2
n
ln c.
Since µ1 < µ0, it follows that
¯x ≤
µ1 + µ0
2
+
σ2
n(µ1 − µ0)
ln c = y.
For a given level α, the constant c, and thereby the decisive
parameter y, are determined so that, under the assumption
that H0 is true,
α = P( ¯X ≤ y) = P
( ¯X − µ0
σ
√
n ≤
y − µ0
σ
√
n
)
.
By assuming that H0 is true,
Z =
¯X − µ0
σ
√
n ∼ N(0, 1),
so the requirement means choosing Z ≤ −z(α), which determines
uniquely the optimal Wc.
Note that this critical region is independent of the chosen
value µ1, and the actual value for y did not have to be
expressed at all. It was only essential to assume that µ1 < µ0.
In the illustrative example from subsection 10.3.4, H0 :
µ = 6, and the alternative hypothesis is HA :
µ < 6. The variance is σ2
= 4. The test
with n = 15 yielded ¯x = 5.133. Substitute this,
to get the value z = 5.133−6
4
√
15 = −1.678,
while −z(0.05) = −1.645.
Therefore, if we are testing whether the new teacher is
even worth than the previous one, we reject the hypothesis
at the level of 5 %, deducing that the students’ opinions are
really worse.
If, for the critical range, the union of the critical ranges
for the cases µ1 < µ0 and µ1 > µ0 are chosen, the same
results as for the interval estimate are obtained, as mentioned
above.
971
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
We remark that in the Bayesian approach, it is also possible
to accept or reject hypotheses in a direct connection to
the a posteriori probability of events, as was, to certain extent,
indicated in subsection 10.3.8 where our speciﬁc example is
interpreted.
10.3.11. Linear models. As is usual in the analysis of mathematical
problems, either we deal with linear dependencies
and objects, or we discuss their linearizations.
In statistics, many methods belong to the linear models,
too. We consider a quite general scheme of this
type.
Consider a random vector Y = (Y1, . . . , Yn)T
and sup-
pose
Y = X · β + σZ,
where X = (xij) is a constant real-valued matrix with n rows
and k < n columns, whose rank is k, β is an unknown constant
vector of k parameters of the model, Z is a random vector
whose n components have distribution N(0, 1), and σ > 0
is an unknown positive parameter of the model. This is a linear
model with full rank.
In practice, the variables xij are often known. The problem
is to estimate or predict the value of Y . For instance, xij
can express the grade in maths of the i-th student in the j-th
semester (j = 1, 2, 3), and we want to know how this student
will fare in the fourth semester. For this purpose, the vector β
needs to be known. This can be estimated using complete observations,
that is, from the knowledge of Y (from the results
of past years, for example).
In order to estimate the vector β, the least squares
method can often be used. This means looking for the estimate
b ∈ Rk
for which the vector ˆY = Xb minimizes the
squared length of the vector Y − Xβ.
This is a simple problem from linear algebra, looking for
the orthogonal projection of the vector Y onto the subspace
span X ⊂ Rn
generated by the columns of the matrix X.
This is minimizing the function
∥Y − Xβ∥2
=
n∑
i=1
(
Yi −
k∑
j=1
xijβj
)2
.
Choose an arbitrary orthonormal basis of the vector subspace
span X and write it into columns of the matrix P. For any
choice of basis, the orthogonal projection is realized as multiplication
by the matrix PPT
. In the subspace span X, the
mapping given by this matrix is the identity. That is,
ˆY = PPT
Y = PPT
(Xβ + σZ) = Xβ + σPPT
Z.
The matrix PPT
is positive-semideﬁnite. Extend the basis
consisting of the columns of P to an orthonormal basis of the
whole space Rn
. In other words, create a matrix Q = (P R)
by writing the newly added basis vectors into the matrix R
with n − k columns and n rows. Denote by V = PT
Z and
U = RT
Z the random vectors with k and n − k components,
respectively. They are orthogonal, and their sum in Rn
is the
vector (V T
UT
)T
= QT
Z.
972
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Clearly (see subsection 10.2.46), both vectors V and U
have multivariate normal distribution with zero
expected value and identity covariance matrix.
The random vector Y is decomposed to the sum
of a constant vector Xβ and two orthogonal
projections
Y = Xβ + σPV + σRU,
and the desired orthogonal projection is the sum of the ﬁrst
and second summands. In subsection 10.2.46, the distribution
of such random vectors is also derived.
The size of ∥Y − ˆY ∥2
is called the residual sum of
squares, sometimes denoted by RSS. Also, the residual variance
is deﬁned as
S2
=
∥Y − Xb∥2
n − k
.
Recall that ˆY = Xb and that XT
X is invertible as the
full rank of X is assumed. Thus b = (XT
X)−1
XT ˆY can be
computed. At the same time, XT
(Y − ˆY ) = σXT
(RU) =
0, since the columns of X and R are mutually orthogonal.
Therefore,
(1) b = (XT
X)−1
XT
Y.
The chosen matrix P can be used with advantage. Since its
columns generate the same subspace as the columns of X,
there is a square matrix T such that X = PT (its columns are
the coeﬃcients of linear combinations which are expressed by
the columns of X in the basis of P). Substitute (using the fact
that PT
P is the identity matrix and T is invertible):
b = (TT
PT
PT)−1
TT
PT
Y =
= T−1
(TT
)−1
TT
PT
(PTβ + σZ) =
= β + σT−1
V.
Thus is proved the main properties of the linear model:
Theorem. Consider a linear model Y = Xβ + σZ.
(1) For the estimate of ˆY ,
ˆY = Xβ + σPV, ˆY ∼ N(Xβ, σ2
PPT
).
(2) The residual sum of squares and the normed square of
the residue size have distributions:
Y − ˆY ∼ N(0, σ2
RRT
), ∥Y − ˆY ∥2
/σ2
∼ χ2
n−k.
(3) The random variable b = β + σT−1
V has distribution
b ∼ N(β, σ2
(XT
X)−1
).
(4) The residual variance satisﬁes (n − k)S2
/σ2
∼ χ2
n−k.
(5) The expected value of the residual variance is E S2
=
σ2
.
(6) The variables b and S2
are independent.
Proof. Both the shape and distribution of ˆY are determined.
It is clear that Y − ˆY = σRU, which veriﬁes the
second proposition. Further,
∥Y − ˆY ∥2
/σ2
= ∥RU∥2
= ∥U∥2
,
973
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
where the last equality follows from the fact that in the construction,
U is the vector of coordinates of the projection Z
onto the complement of span X, and RU is this projection.
The size of a vector is exactly the sum of squares of its coordinates
in any orthonormal basis.
Therefore, the random vector ∥Y − ˆY ∥2
/σ2
is the sum of
(n−k) squares of random variables with distribution N(0, 1),
so it is the distribution χ2
n−k, which proves the rest of (2).
The next proposition follows directly from the deﬁnitions
and calculations. it suﬃces to estimate the covariance
matrix for b. From the general properties, it should be
the matrix T−1
(TT
)−1
. This is the same as (XT
X)−1
=
((PT)T
(PT))−1
.
The proposition (4) is a reformulation of the information
in (2). The next proposition follows from the fact that the
expected value of the χ2
distribution equals the number of
degrees of freedom.
Finally, independence of the variables b and S is a consequence
of the fact that the former variable is a function of
the vector V , while the latter one is a function of the vector
U. These vectors are independent since they are two complementary
parts from an orthogonal transformation of the
vector Z. □
In practice, the hypothesis whether fewer parameters are
suﬃcient to estimate the expected value is sometimes
tested. A random vector Y is said to satisfy
a submodel if and only if both Y = Xβ +
σZ and
Y = X0
β0
+ σZ,
where X0
has only q < k columns. It is assumed that the
columns of X0
generate a subspace in span X, i.e., all are
linear combinations of the columns of X.
Repeat the above construction, choosing the matrix P so
that the ﬁrst q vectors of P generates span X0
. The matrix
P is then of the form (P0
P1
), and the vector V decomposes
similarly:
V =
(
V 0
V 1
)
=
(
(P0
)T
Z
(P1
)T
Z
)
.
This yields a ﬁner decomposition of the vectors and their sizes
and the corresponding residues:
ˆY 0
= P0
(P0
)T
Y = X0
β0
+ σP0
V 0
Y − ˆY 0
= σP1
V 1
+ σRU
∥Y − ˆY 0
∥2
= σ2
∥V 1
∥2
+ σ2
∥U∥2
(RSS0
− RSS)/σ2
= ∥V 1
∥2
.
Therefore, the normed diﬀerence of the residues has distribution
χ2
k−q. It follows immediately that the statistic F
given as the relative diﬀerence of the residues has FisherSnedecor
distribution:
F =
(RSS0
− RSS)/(k − q)
RSS /(n − k)
∼ Fk−q,n−k .
974
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
In practice, the parameter σ is seldom known, and so the
estimate S2
is used. Instead of the individual components
bj ∼ N(βj, σ2
cjj) of the random vector b, where cjj are the
diagonal entries of the matrix C = (XT
X)−1
, work with the
statistics
Tj =
bj − βj
S
√
cjj
∼ tn−k .
Of course, these variables need not be independent.
If the full rank of the matrix X is not assumed, a pseudoinverse
matrix can be used instead of C = (XT
X)−1
.
10.3.12. Examples of tests. As an illustration, we mention
some examples of application of linear models
in the simplest types of tests. The most trivial
case is when there is only one sample. Here
the test is whether or not the only parameter β equals a given
value β0.
For this case, choose the matrix X as a single column
consisting of ones. Then, the expression
Y = Xβ + σZ
indicates that the individual components in Y are independent
variables Yi ∼ N(β, σ2
). It is a random sample of size
n from the normal distribution. In general, the estimate
b = (XT
X)−1
XT
Y =
1
n
n∑
i=1
Yi = ¯Y
S2
=
1
n − 1
∥Y − X ¯Y ∥2
=
1
n − 1
n∑
i=1
(Yi − ¯Y )2
,
which respectively is exactly the sample mean and variance
used before.
In this context, the statistic
T =
¯Y − β0
S
√
n
may also be of interest.
Testing the hypothesis β = β0 is called the one-sample
t-test. The hypothesis is rejected at level α if |T| ≥ tn−1(α).
There is another simple application of the general model,
which is called the paired t-test. It is appropriate for cases
when pairs of random vectors W1 = (Wi1) and W2 = (Wi2)
are tested. The diﬀerences Yi = Wi1 − Wi2 of their components
have distribution N(β, σ2
). In addition, the variables
Yi need to be independent (which does not mean that the individual
pairs Wi1 and Wi2 have to be independent!). In the
context of our illustrative example from 10.3.4, we can imagine
the assessment of two lecturers by the same student.
Test the hypothesis that for every i, E Wi1 = E Wi2.
Thus, use the statistic
T =
¯W1 − ¯W2
S
√
n.
Finally, we consider an example with more parameters.
It is a classical case of the regression line.
975
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Assume that the variables Yi, i = 1, . . . , n have distribution
N(β0 +β1xi, σ2
), where xi are given constant. Examine
the best approximation
Yi = b0 + b1xi,
and the matrix X of the corresponding linear model is
XT
=
(
1 1 . . . 1
x1 x2 . . . xn
)
.
Substitute into general formulae, and compute the estimate
(
b0
b1
)
=
(
n n¯x
n¯x
∑n
i=1 x2
i
)−1(
n ¯Y∑n
i=1 xiYi
)
=
=
( n∑
i=1
(xi − ¯x)2
)−1( 1
n
∑n
i=1 x2
i −¯x
−¯x 1
)(
n ¯Y∑n
i=1 xiYi
)
.
It follows that
b1 =
∑n
i=1(xi − ¯x)(Yi − ¯Y )
∑n
i=1(xi − ¯x)2
.
Finally, compute b0 = ¯Y − b1 ¯x. From the calculations,
var b1 = σ2
/
n∑
i=1
(xi − ¯x)2
.
In order to test the hypothesis whether the expected value
of a variable Y does not depend on x, that is, whether or not
H0 is of the form β1 = 0, use the statistic
T =
b1
S
( n∑
i=1
(xi − ¯x)2
)1/2
∼ tn−2 .
The statistical analysis of multiple regression is similar.
There are several sets of values xij to evaluate the statistical
relevance of the approximation
Yi = b0 + b1x1i + · · · + bkxki.
The individual statistics Tj allow for a t-test of dependence of
the regression on the individual parameters. Software packages
often provide a parameter which expresses how well the
values Yi are approximated. It is called the coeﬃcient of de-
termination:
R2
= 1 −
RSS
∑n
i=1(Yi − ¯Y )2
.
10.3.13. In practice, problems are often met where the distributions
of the statistical data sets are either
completely unknown or errors are assumed in
the model, together with deviations with nonzero
expected value and a non-normal distribution. In these
cases, application of classical frequentist statistics is very
hard or even totally impossible.
There are approaches which work directly with the sample
set. Then derive statistics of point or interval estimates or
probability calculations about the above, including the evaluation
of standard errors.
One of the pioneering articles of this topic is the brief
work of Bradley Efron of Stanford University, published in
976
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
1981: Nonparametric estimates of standard error: The jackknife,
the bootstrap and other methods4
. The keywords of
this article are: Balanced repeated replications; Bootstrap;
Delta method; Half-sampling; Jackknife; Inﬁnitesimal jackknife;
Inﬂuence function.
The procedure used in the bootstrap method uses software
resources, created from a given data sample, and new
data samples of the same size (with replacement). The desired
statistics (sample mean, variance, etc.) is then examined
for each of them. After a great number of executions
of this procedure, a data sample is obtained which is considered
a relevant approximation of the probability distribution
of the examined statistic. The characteristics of this data set
is considered a good approximation of the characteristics of
the examined statistics for point or interval estimates, analysis
of variance, etc. There is not enough space for a more detailed
analysis of these techniques, which is the foundations
of non-parametric methods in contemporary software statistical
tools.
4Biometrika (1981), 68, 3, pp. 589-99
977
In this chapter, we will deal with problems concerning
integers, which include mainly divisibility and solving equations
whose domain will be the set of integers (or natural numbers).
Notice that in this chapter, unlike in the other parts of
this book, we will not consider zero to be a natural number,
as is usual in this ﬁeld of mathematics. Although the natural
numbers and the integers are, from a certain point of view, the
simplest mathematical structure, examination of their properties
yielded a good deal of tough problems for generations of
mathematicians. These are often problems which can be formulated
quite easily, yet many of them remain unsolved so far.
Let us mention some of the most popular of them:
• twin primes – the problem is to decide whether there are
inﬁnitely many primes p such that p + 2 is also a prime,1
• Sophie Germain primes – the problem is to decide
whether there are inﬁnitely many primes p such that
2p + 1 is also a prime,
• existence of an odd perfect integer – i.e., the sum of
whose divisors equals twice the integer, e.g., 6 or 28,
• Goldbach’s conjecture – the problem is to decide whether
every even integer greater than 2 can be expressed as the
sum of two primes.
A jewel among the problems of number theory is the
Fermat’s Last Theorem – the problem is to decide whether
there are natural numbers n, x, y, z such that n > 2 and
xn
+ yn
= zn
; Pierre de Fermat formulated this problem as
early as in 1637; much eﬀort of many generations was put to
this question, and it was solved (using results of various ﬁelds
of mathematics) by Andrew Wiles in 1995.
1. Fundamental concepts
11.1.1. Divisibility. Recall that we say that an integer a divides
an integer b (or that b is divisible by a, also
that b is a multiple of a) if and only if there exists
an integer c satisfying a · c = b. We write
this as a | b. The concept of divisibility can be
considered much more generally, as we shall see in 12.3.5.
1In 2013, Yitang Zhang published a proof of a promising proposition:
for some n < 7 · 107, there are inﬁnitely many pairs of primes which diﬀer
by n. See Y. Zhang, Bounded gaps between primes, Annals of Mathematics,
2013. Although the bound was improved to N = 246 already one year later
by James Maynard and Terence Tao, the problem is still open.
CHAPTER 11
Number theory
God created the integers, all else is the work of man.
Leopold Kronecker
A. Basic properties of divisibility
Let us recall the basic properties of divisibility, whose
proof follows directly from the deﬁnition: the integer
0 is divisible by every integer; the only integer
that is divisible by 0 is 0; every integer a satisﬁes
a | a; every triple of integers a, b, c satisﬁes the
following four implications:
a | b ∧ b | c =⇒ a | c,
a | b ∧ a | c =⇒ (a | b + c) ∧ (a | b − c),
c ̸= 0 =⇒ (a | b ⇐⇒ ac | bc),
a | b ∧ (b > 0) =⇒ a ≤ b.
The mere knowledge of these basic rules allows us to
solve many problems.
11.A.1. Determine the natural numbers n for which the integer
n3
+ 1 is divisible by the integer n − 1.
Solution. We have n3
− 1 = (n − 1)(n2
+ n + 1), so the
integer n3
− 1 is divisible by the integer n − 1 for any n.
If n − 1 is to divide n3
+ 1 as well, it must also divide the
diﬀerence (n3
+ 1) − (n3
− 1) = 2 (see the second property
of divisibility). Since n ∈ N, we have n − 1 ≥ 0. Now,
n − 1 | 2 implies that n − 1 = 1 or n − 1 = 2, whence n = 2
CHAPTER 11. ELEMENTARY NUMBER THEORY
We will often take advantage of one of the most important
properties of the integers, the unique Euclidean division
(i.e., division with remainder).
Unique division with remainder
Theorem. For any integers a ∈ Z, m ∈ N, there exists
a unique pair of integers q ∈ Z, r ∈ {0, 1, . . . , m − 1}
satisfying a = qm + r.
Proof. First, we will prove the existence of the integers
q, r. Let us ﬁx a natural number m and prove the statement
for any a ∈ Z. Assume ﬁrst that a is non-negative and prove
the existence of the integers q, r by induction on a:
If 0 ≤ a < m, we can choose q = 0, r = a, and the
equality a = qm + r holds trivially.
Next, suppose that a ≥ m and the existence of the integers
q, r has been proved for all a′
∈ {0, 1, 2, . . . , a − 1}. In
particular, for a′
= a − m ≥ 0, there are q′
, r′
such that
a′
= q′
m + r′
and r′
∈ {0, 1, . . . , m − 1}.
Therefore, if we select q = q′
+ 1, r = r′
, we obtain
a = a′
+ m = (q′
+ 1)m + r′
= qm + r, which is what we
wanted to prove.
Now, if a is negative, then we have proved that for the
positive integer −a, there are q′
∈ Z, r′
∈ {0, 1, . . . , m − 1}
such that −a = q′
m + r′
. If r′
= 0, we set r = 0, q = −q′
;
otherwise (i.e., r′
> 0), we put r = m − r′
, q = −q′
− 1.
In either case, we get a = q · m + r. Therefore, the
integers q, r with the wanted properties exist for every a ∈ Z,
m ∈ N.
Finally, we will prove the uniqueness. Suppose that there
are integers q1, q2 ∈ Z and r1, r2 ∈ {0, 1, . . . , m − 1} which
satisfy a = q1m + r1 = q2m + r2. Simple rearrangement
yields r1 − r2 = (q2 − q1)m, so m | r1 − r2. However, we
have 0 ≤ r1 < m and 0 ≤ r2 < m, whence it follows that
−m < r1 − r2 < m.
Therefore, r1 − r2 = 0, and (q2 − q1)m = 0, hence
q1 = q2, r1 = r2. □
The integers q and r from the theorem are respectively
called the quotient and remainder of the division of a by m
with remainder. The choice of this terminology seems more
intuitive if we rearrange the equality a = mq + r into the
form
a
m
= q +
r
m
, where 0 ≤
r
m
< 1.
11.1.2. Greatest common divisor. One of the most needed
tools of computational number theory is the algorithm
for computing the greatest common divisor.
Since it is a relatively fast procedure, as we are
going to show, it is used very often in modern
algorithms as well.
979
or n = 3. The wanted property is thus possessed only by the
natural numbers 2 and 3. □
11.A.2. Prove that for any a ∈ Z, the following holds:
i) a2
leaves remainder 0 or 1 when divided by 4;
ii) a2
leaves remainder 0, 1, or 4 when divided by 8;
iii) a4
leaves remainder 0 or 1 when divided by 16.
Solution.
• It follows from the Euclidean division theorem that every
integer a can be written uniquely in either the form a =
2k or a = 2k + 1. Squaring this leads to a2
= 4k2
or
a2
= 4(k2
+ k) + 1, which is what we wanted to prove.
• Making use of the above result, we immediately obtain
the statement for the (even) integers of the form a = 2k.
Back then, we arrived at a2
= 4k(k + 1) + 1 for odd
integers a; we get the proposition easily if we realize that
k(k + 1) is surely even.
• Again, we utilize the result of the previous parts, i.e.
a2
= 4ℓ or a2
= 8ℓ + 1. Squaring these equalities once
again, we get a4
= (a2
)2
= 16ℓ2
for a even, and a4
=
(a2
)2
= (8ℓ + 1)2
= 64ℓ2
+ 16ℓ + 1 = 16(4ℓ2
+ ℓ) + 1
for a odd.
□
11.A.3. Prove that if integers a, b ∈ Z leave remainder 1
when divided by an m ∈ N, then so does their product ab.
Solution. By the Euclidean division theorem, there are s, t ∈
Z such that a = sm + 1, b = tm + 1. Multiplying these
equalities leads to the expression
ab = (sm + 1)(tm + 1) = (stm + s + t)m + 1,
where stm+s+t is the quotient, so the remainder of ab when
divided by m is equal to 1. □
It follows from the Euclidean division theorem that the
greatest common divisor (a, b) of any pair
of integers a, b exists, is unique, and can be
computed eﬃciently by the Euclidean algorithm.
At the same time, the coeﬃcients in Bézout’s identity,
i.e., integers k, l such that ka + lb = (a, b), can be determined
this way. It can also be easily proved straight from the
properties of divisibility that integer linear combinations of
integers a, b are exactly the multiples of their greatest common
divisor.
11.A.4. Find the greatest common divisor of the integers
a = 10175, b = 2277 and determine the corresponding coefﬁcients
in Bézout’s identity.
Solution. We will invoke the Euclidean algorithm:
10175 = 4 · 2277 + 1067,
2277 = 2 · 1067 + 143,
1067 = 7 · 143 + 66,
143 = 2 · 66 + 11,
66 = 6 · 11 + 0.
CHAPTER 11. ELEMENTARY NUMBER THEORY
Greatest common divisor
Consider integers a, b. An integer m satisfying both m | a
and m | b is called a common divisor of a and b. A common
divisor m ≥ 0 of a and b which is divisible by every common
divisor of the integers a, b is called the greatest common
divisor of a and b and it is denoted by (a, b) (or gcd(a, b) for
the sake of clarity).
The concept of the least common multiple is deﬁned dually
and denoted by [a, b] (or lcm(a, b)).
It follows straight from the deﬁnition that for any a, b ∈
Z, we have (a, b) = (b, a), [a, b] = [b, a], (a, 1) = 1, [a, 1] =
|a|, (a, 0) = |a|, [a, 0] = 0.
So far, we have not shown that for every pair of integers
a, b, their greatest common divisor and least common multiple
exist. However, if we assume they exist, then they are
unique because every pair of non-negative integers k, l satisfy
(directly from the deﬁnition) that k | l and l | k imply
k = l. However, in the general case of divisibility in integral
domains, the situation is more complicated – see 12.3.8.
Even in the case of the so-called Euclidean domains,2
which
guarantee the existence of greatest common divisors, the result
is determined uniquely up to the multiplication by a unit
(an element having multiplicative inverse) – in the case of the
integers, the result would be determined uniquely up to sign;
the uniqueness was thus guaranteed by the condition that the
greatest common divisor be non-negative.
Euclidean algorithm
Theorem. Let a1, a2 be positive integers. For every n ≥ 3
such that an−1 ̸= 0, let an denote the remainder of the division
of an−2 by an−1. Then, after a ﬁnite number of steps,
we arrive at ak = 0, and it holds that ak−1 = (a1, a2).
Proof. By the Euclidean division, we have a2 > a3 >
a4 > . . . . Since these are non-negative integers, this decreasing
sequence cannot be inﬁnite, so we get ak = 0 after a ﬁnite
number of steps, where ak−1 ̸= 0. From the deﬁnition of the
integers an, it follows that there are integers q1, q2, . . . , qk−2
such that
a1 = q1 · a2 + a3,
a2 = q2 · a3 + a4,
...
ak−3 = qk−3 · ak−2 + ak−1,
ak−2 = qk−2 · ak−1.
It follows from the last equality that ak−1 | ak−2. Further,
ak−1 | ak−3, ..., ak−1 | a2, ak−1 | a1. Therefore, ak−1 is a
common divisor of the integers a1, a2.
2Wikipedia, Euclidean domain, http://en.wikipedia.org/
wiki/Euclidean_domain (as of July 29, 2017).
980
Therefore, 11 is the greatest common divisor. We will express
this integer from the particular equalities, resulting in a linear
combination of the integers a, b:
11 = 143 − 2 · 66
= 143 − 2 · (1067 − 7 · 143)
= −2 · 1067 + 15 · 143
= −2 · 1067 + 15 · (2277 − 2 · 1067)
= 15 · 2277 − 32 · 1067
= 15 · 2277 − 32 · (10175 − 4 · 2277)
= −32 · 10175 + 143 · 2277.
The wanted expression in the form of Bézout’s identity is thus
11 = (−32) · 10175 + 143 · 2277. □
11.A.5. The computation of the greatest common divisor using
the Euclidean algorithm is quite fast even for relatively
large integers. In our example, we will try this out with integers
A, B, each of which will be the product of two 101-digit
primes. Let us notice that the computation of the greatest
common divisor of even such huge integers took an immeasurably
small amount of time. A noticeable amount of time is
taken by the computation of the greatest common divisor in
the second example, where the input consists of two integers
having more than a million digits.
An example in the system SAGE :
add the solution of the
previous task.
sage : p= next_prime (5*10^100)
sage : q= next_prime (3*10^100)
sage : r= next_prime (10^100)
sage : A=p*q;B=q*r;
sage : time G=gcd (A,B); print G
Time : CPU 0.00 s, Wall : 0.00 s
300000000000000000000000000000000000\
000000000000000000000000000000000000\
00000000000000000000000000223
sage : time G=gcd (A ^10000+1 , B ^10000+1) ;
Time : CPU 2.47 s, Wall : 2.48 s
11.A.6. Find the greatest common divisor of the integers
249
− 1 and 235
− 1, and determine the corresponding coeﬃcients
in Bézout’s identity.
Solution. Again, we use the Euclidean algorithm. We get:
249
− 1 = 214
(235
− 1) + 214
− 1,
235
− 1 = (221
+ 27
)(214
− 1) + 27
− 1,
214
− 1 = (27
+ 1)(27
− 1).
The wanted greatest common divisor is thus 27
− 1 = 127.
Let us notice that 7 = (49, 35) – see also the following
exercise 11.A.7. Reversing this procedure,
we ﬁnd the coeﬃcients k, ℓ into Bézout’s identity
CHAPTER 11. ELEMENTARY NUMBER THEORY
On the other hand, any common divisor of the given integers
a1, a2 divides the integer a3 = a1 − q1a2 as well,
hence it also divides a4 = a2 − q2a3, a5, . . . , and especially
ak−1 = ak−3 − qk−3ak−2. We have thus proved that ak−1 is
the greatest common divisor of the integers a1, a2. □
It follows from the previous statement and the fact that
(a, b) = (a, −b) = (−a, b) = (−a, −b) holds for any a, b ∈
Z, that every pair of integers has a greatest common divisor.
11.1.3. Bézout’s theorem. The Euclidean algorithm
provides another interesting statement, which is often used.
Theorem (Bézout). For every pair of integers a, b, there exist
integers k, l such that (a, b) = ka + lb.
Proof. Surely it suﬃces to prove the theorem for a, b ∈
N. Let us notice that if it is possible to express integers r, s in
the form r = r1a+r2b, s = s1a+s2b, where r1, r2, s1, s2 ∈
Z, then we can also express
r + s = (r1 + s1)a + (r2 + s2)b
in this way as well as
c · r = (c · r1)a + (c · r2)b
for any c ∈ Z, and thus also any integer linear combination
of the numbers r and s arising in the process of Euclidean
algorithm. It follows from the Euclidean algorithm (for a1 =
a, a2 = b) that we can also thus express a3 = a1 − q1a2,
a4 = a2 − q2a3, . . . , hence the integer ak−1 = ak−3 −
qk−3ak−2, too, which is (a1, a2).
Let us emphasize that the wanted numbers k, l are not
determined uniquely. □
The Euclidean algorithm and Bézout’s identity are the
fundamental results of elementary number theory and form
one of the pillars of the algorithms used in algebra and number
theory.
11.1.4. Least common multiple. We have ignored the
properties of the least common multiple
so far. However, we will see that thanks
to the following proposition, they can be
derived from the properties of the greatest common divisor.
Lemma. For every pair of integers a, b, their least common
multiple [a, b] exists and it holds that (a, b) · [a, b] = |a · b|.
Proof. The statement is trivially true if either of the integers
a, b is zero. Furthermore, we can assume that both these
(non-zero from now on) integers are positive since their signs
do not take any eﬀect in the formula in question. We are going
to show that q = a·b/(a, b) is the least common multiple
of the integers a, b, which will ﬁnish the proof.
Since (a, b) is a common divisor of a, b, both a/(a, b)
and b/(a, b) are integers, hence
q =
ab
(a, b)
=
a
(a, b)
· b =
b
(a, b)
· a
is a common multiple of a, b. By Bézout’s identity, there
are integers k, l such that (a, b) = ka + lb. Let us suppose
981
27
− 1 = k(249
− 1) + ℓ(235
− 1):
27
− 1 = (235
− 1) − (221
+ 27
)(214
− 1)
= (235
− 1) − (221
+ 27
)((249
− 1) − 214
(235
− 1))
= (235
+ 221
+ 1)(235
− 1) − (221
+ 27
)(249
− 1).
Therefore, k = −(221
+ 27
), ℓ = 235
+ 221
+ 1. Let
us bear in mind that these coeﬃcients are never determined
uniquely. □
11.A.7. Now, let us try to generalize the result of the previous
exercise, i.e., prove that it holds for any a, m, n ∈ N,
a ̸= 1, that (am
− 1, an
− 1) = a(m,n)
− 1.
Solution. This statement follows easily from the fact that any
pair of natural numbers k, ℓ satisﬁes ak
− 1 | aℓ
− 1 if and
only if k | ℓ. This can be proved by dividing the integer ℓ by
the integer k with remainder, i.e., we set ℓ = kq + r, where
q, r ∈ N0, r < k, and consider that
akq+r
−1 = (ak
−1)(ak(q−1)+r
+ak(q−2)+r
+· · ·+ar
)+ar
−1
is the division of the integer akq+r
− 1 by the integer ak
− 1
with remainder (apparently, we have ar
−1 < ak
−1). Hence
we can easily see that the remainder r is zero if and only if the
remainder ar
− 1 is zero, which is what we wanted to prove.
□
11.A.8. Prove that for all n ∈ N, 25 | 42n+1
− 10n − 4.
Solution. This statement can be proved in several ways (the
easiest one is in terms of congruences, which will be introduced
a bit later). Here, we will prove it by induction and
then as a consequence of the binomial theorem.
The proposition is clearly true for n = 0 (even though the
problem does not ask about the situation for n = 0, we can
surely prove the desired property for the set n ∈ N0, thereby
simplifying the ﬁrst step of the induction). As for the second
step: assuming 25 | 42n−1
− 10(n − 1) − 4, then we also
have 25 | 16(42n−1
− 10(n − 1) − 4) = 42n+1
− 10n −
4 − 150n + 100, whence it easily follows that the desired
proposition 25 | 42n+1
− 10n − 4 holds.
The second proof uses the binomial theorem. By that,
(5 − 1)2n+1
=
2n+1∑
k=0
(
2n + 1
k
)
52n+1−k
(−1)k
,
where all of the terms of the sum, except the last two, are
apparently multiples of 25, i.e., the only part of the sum which
is not divisible by 25 is
(2n+1
2n
)
51
(−1)2n
+ 50
(−1)2n+1
. In
other words, 42n+1
leaves the same remainder when divided
by 25 as the integer 10n + 4, which is equivalent to what we
were to prove. □
B. Prime numbers
Euclid’s theorem 11.2.1 is a fundamental property of
primes as it can be seen in the following example.
CHAPTER 11. ELEMENTARY NUMBER THEORY
that n ∈ Z is an arbitrary common multiple of the integers
a, b. We want to show that it is divisible by q. We thus have
n/a, n/b ∈ Z, hence the number
n
b
· k +
n
a
· l =
n(ka + lb)
ab
=
n(a, b)
ab
=
n
q
is an integer as well. However, this means that q | n, which
is what we wanted to prove. □
11.1.5. Coprime integers. Analogously to the case of two
integers, we can also deﬁne the greatest common
divisor and least common multiple of more than
two integers, and it can be easily proved that
(a1, . . . , an) = ((a1, . . . , an−1), an),
[a1, . . . , an] = [[a1, . . . , an−1], an].
Integers a1, a2, . . . , an ∈ Z are said to be coprime (also
relatively prime) if and only if (a1, a2, . . . , an) = 1. They
are said to be pairwise coprime (pairwise relatively prime) if
and only if we have (ai, aj) = 1 for every pair of indices i, j
satisfying 1 ≤ i < j ≤ n.
Remark. Let us realize that the concepts coprime and
pairwise coprime are diﬀerent. For example, we have
(6, 10, 15) = 1; however, any two of the three integers
6, 10, 15 are not coprime.
Lemma. For any natural numbers a, b, c we have
(1) (ac, bc) = (a, b) · c;
(2) if a | bc and (a, b) = 1, then a | c;
(3) d = (a, b) if and only if there are k, l ∈ N such that
a = dk, b = dl, and (k, l) = 1.
Proof. (1) Since (a, b) is a common divisor of the integers
a, b, (a, b) · c is a common divisor of the integers ac,
bc, hence (a, b) · c | (ac, bc). From Bézout’s identity, we obtain
k, l ∈ Z such that (a, b) = ka + lb. Since (ac, bc) is
a common divisor of the integers ac, bc, it divides the integer
kac + lbc = (a, b) · c as well. We have thus proved that
(a, b) · c and (ac, bc) are natural numbers which divide each
other, hence they are equal.
(2) Let us suppose that (a, b) = 1 and a | bc. From
Bézout’s identity again, we get k, l ∈ Z such that ka+lb = 1,
whence it follows that c = c(ka + lb) = kca + lbc. Since
a | bc, it follows that c as well is a multiple of a.
(3) Let d = (a, b), then there are q1, q2 ∈ N such
that a = dq1, b = dq2. Then, by part (1), we have d =
(a, b) = (dq1, dq2) = d · (q1, q2), so (q1, q2) = 1. On the
other hand, if a = dq1, b = dq2, and (q1, q2) = 1, then
(a, b) = (dq1, dq2) = d(q1, q2) = d · 1 = d (again invoking
part (1) of this lemma). □
2. Primes
The concept of a prime is one of the most important in
elementary number theory. Its importance is
given mainly by the unique factorization theorem,
which is a strong as well as eﬃcient tool
for solving miscellaneous problems from number theory.
982
11.B.1.
i) Prove that if natural numbers m, n are coprime, then so
are
m2
+ mn + n2
and m2
− mn + n2
.
ii) Prove that if odd natural numbers m, n are coprime, then
so are
m + 2n and m2
+ 4n2
.
Solution.
i) To reach a contradiction, suppose that there is a prime p
which divides both of the integers m2
+ mn + n2
and
m2
− mn + n2
. Then, it divides their diﬀerence 2mn as
well, whence p = 2 or p divides one of the integers m, n.
If p = 2, then m2
+ mn + n2
is even, so the integers m
and n must be even as well, which contradicts that they
are coprime. If p divides m as well as m2
+mn+n2
, then
it also divides n2
, whence, by Euclid’s theorem (11.2.1),
it divides n as well. However, this also contradicts that
m, n are coprime. The case of p | n is analogous.
ii) Just like in the above exercise, let us suppose that there
is a prime p which divides m + 2n as well as m2
+ 4n2
.
Then, it must also divide (m2
+ 4n2
) − (m + 2n)(m −
2n) = 8n2
, and since p ̸= 2 (if m + 2n were even, then
so would m be), we necessarily have p | n. However,
since p divides m + 2n as well, we get p | m, which is a
contradiction.
□
11.B.2. Prove that any n ∈ N, n > 1, is prime iﬀ n is not
divisible by any prime p ≤
√
n.
Solution. For n composite we have n = ab with appropriate
a, b > 1. If we admitted a, b >
√
n then we would have
n = ab >
√
n ·
√
n = n. Therefore n has a divisor (and thus
also a prime divisor) not greater than
√
n. □
The theoretical part contains Euclid’s proof of the inﬁnitude
of primes and deals in detail with the distribution of
primes in the set of natural numbers (in some cases, however,
we were forced to leave the mentioned theorems unproved).
Now, we will give several exercises on this topic.
11.B.3. For any natural number n ≥ 3, there is at least one
prime between the integers n and n!.
Solution. Let p denote an arbitrary prime dividing the
integer n! − 1 (by the Fundamental theorem of arithmetic
(11.2.2), there is such a prime since n! − 1 > 1). If we
had p ≤ n, then p would have to divide n! as well, so it could
not divide n! − 1. Therefore, n < p. Since p | (n! − 1), we
have p ≤ n!−1, hence p < n!. Our prime p thus satisﬁes the
conditions of the problem. □
The result of this exercise also implies the inﬁnitude of
primes (it suﬃces to consider the sequence a0 = 3, an+1 =
an! for n ∈ N). However, this statement is very weak (compared
to reality) since the constructed sequence contains only
a “tiny” subset of the primes.
On the other hand, we are able to construct an arbitrarily
long sequence of consecutive composite numbers, as shown
by the following exercise.
CHAPTER 11. ELEMENTARY NUMBER THEORY
Prime numbers
Every natural number n ≥ 2 has at least two positive divisors:
1 and itself. If there are no other divisors, it is called
a prime (number). Otherwise (i.e., if there exist other divisors),
we talk about a composite (number).
In the subsequent paragraphs, we will usually denote
primes by the letter p. The ﬁrst few primes are 2, 3, 5, 7,
11, 13, 17, 19, 23, 29, 31, 37, ...(in particular, the number 1
is considered to be neither prime nor composite as it is a unit
in the ring of the integers). As we will prove shortly, there are
inﬁnitely many primes. However, we have rather limited computational
resources when it comes to determining whether a
given number is prime: The number 282 589 933
− 1, which
is the greatest known prime as of 2018, has only 24 862 048
digits, so its decimal representation would ﬁt into many a prehistoric
data storage device. Printing it as a book would, however,
(assuming 60 rows on a page and 80 digits in a row) take
5 180 pages.
Now, let us introduce a theorem which gives a necessary
and suﬃcient condition for being prime and is thus a fundamental
ingredient for the proof of the unique factorization the-
orem.
11.2.1. Theorem (Euclid’s on primes). An integer p ≥ 2 is
a prime if and only if the following holds: for every pair of
integers a, b, p | ab implies that either p | a or p | b (or both).
Proof. “ =⇒ ” Suppose that p is a prime and p | ab, where
a, b ∈ Z. Since (p, a) is a positive divisor of p, we have either
(p, a) = p or (p, a) = 1. In the former case, we get p | a; in
the latter, p | b by part (2) of the previous lemma.
“ ⇐= ” If p is not a prime, it has a positive divisor distinct
from both 1 and p. Let us denote it by a. However, then we
have b = p
a ∈ N and p = ab, hence 1 < a < p, 1 < b < p.
We have thus found integers a, b such that p | ab, while p
divides neither a nor b. □
11.2.2. Fundamental theorem of arithmetic. In arithmetic
operations, we much rely on expressing the
numbers (or more complicated objects) as
products of simpler ones. The prime property
distinugishes the simplest ones and we call them irreducible.
We shall see in 12.3.5 how to deﬁne divisibility in an arbitrary
integral domain (i.e., rings without divisors of zero). In
some integral domains (e.g. Q), there are no elements with
the prime property.
Other integral domains have such elements, yet they do
not satisfy the unique factorization theorem. It is quite similar
with the generalization of aforementioned Euclid’s theorem
on primes – the elements which satisfy p | ab =⇒ (p | a or
p | b) are always irreducible, but the contrary is not generally
true. Let us mention at least one example of an ambiguous
983
11.B.4. Prove that for any natural number n, there exist n
consecutive natural numbers none of which is prime.
Solution. Let us examine the integers (n + 1)! + 2,
(n + 1)! + 3, · · · , (n + 1)! + (n + 1). For any
k ∈ {2, 3, . . . , n+1}, we have k | (n+1)!, so k | (n+1)!+k
as well, thus (n+1)!+k cannot be a prime. Therefore, there
are no primes among these n natural numbers. □
Practical notes. As we will show, it is very complicated to
decide for certain whether a given large integer
is a prime (on the other hand, for most composite
numbers, it is really easy to prove that they
are indeed composite – see part 11.6.4). Nevertheless,
Indian mathematicians1
managed to prove in 2002
that there is an algorithm running in polynomial time with respect
to the input (i.e., the number of digits of the integer in
question) which decides whether the integer is a prime.
We are unable to produce such an algorithm for prime
factorization (and it is widely believed that it is impossible
although no one has been able to prove this so far). The
fastest generally applicable factorization algorithm, the socalled
general number ﬁeld sieve, runs in sub-exponential
time O
(
e1.9(log N)1/3
(log log N)2/3
)
.
In 1994, Peter Shor invented an algorithm which factors
an integer N in cubic time (i.e., runs it O
(
log3
N
)
) on a quantum
computer. However, this algorithm requires computers
with suﬃcient number of quantum bits. We can see how difﬁcult
this is from the fact that in 2001, IBM managed to use
a quantum computer to factor the integer 15, and in 2012, another
record was achieved by factoring the integer 143 (in fact
using other approach, so-called adiabatic quantum computa-
tion).
We can ﬁnd more evidence about the diﬃculty of this
problem in the call made in 1991 by RSA Security.2
If anyone
manages to factor the integers labeled by the number of
digits they have as RSA-100, ..., RSA-704, RSA-768, ...,
RSA-2048, they could receive respectively 1,000, ..., 30,000,
50,000, ..., 200,000 dollars (the integer RSA-100 was factored
by Arjen Lenstra that very year; the integer RSA-704
was factored in 2012; many others have not been factored yet).
Thanks to the unique factorization theorem, we are able
to (provided we know this factorization) easily answer the following
questions concerning the number or sum of the divisors
of a given integer. Just that easily, we can get the (intuitively
well-known) procedure of computation of the greatest
common divisor of two integers from their prime factoriza-
tions.
1M. Agrawal, N. Kayal, N. Saxena. PRIMES is in P. Annals of Mathematics
160 (2): 781–793. 2004.
2See http://www.rsasecurity.com/rsalabs/node.
asp?id=2093.
CHAPTER 11. ELEMENTARY NUMBER THEORY
factorization – in Z[
√
−5], we have:3
6 = 2 · 3 = (1 +
√
−5) · (1 −
√
−5).
However, it needs a longer discussion to verify that all of the
mentioned factors are really irreducible in Z[
√
−5].
Fundamental theorem of arithmetic
Theorem. Every natural number n can be expressed as the
product of primes, and this expression is unique up to the
order of the factors. (If n is a prime, then it is the “product”
of a single prime n; if n = 1, it is the empty product (the
product of the empty set of primes)4
)
Proof. First, we prove by complete induction on n that
every natural number n can be expressed as a product of
primes. We have already discussed the validity of this statement
for n = 1.
Now, let us suppose that n ≥ 2 and we have already
proved that all natural numbers less than n can be factored to
primes. If n is a prime, the statement is clearly true. If n is
not a prime, then it has a divisor d, 1 < d < n. Denoting
e = n/d, we also have 1 < e < n. From the induction hypothesis,
we get that both d and e can be expressed as products
of primes, so their product d · e = n can also be expressed in
this way.
To prove uniqueness, let us have an equality of products
n = p1·p2 · · · ps = q1·q2 · · · qt, where pi, qj are primes for all
i ∈ {1, . . . , s}, j ∈ {1, . . . , t}. We will prove by induction
on s that s = t and p1 = q1, . . . , ps = qs.
If s = 1, then p1 = q1 · · · qt is a prime. If we had t > 1,
the integer p1 would have a divisor q1 such that 1 < q1 < p1
(since q2q3 · · · qt > 1), which is impossible. Therefore, we
must have t = 1 and p1 = q1.
Now, let us suppose that s ≥ 2 and the proposition holds
for s − 1. It follows from the equality p1 · p2 · · · ps = q1 ·
q2 · · · qt that ps divides the product q1 · · · qt, which is, by Euclid’s
theorem, possible only if ps divides some qj(and after
relabeling we may assume j = t). Since qj is a prime, it follows
that ps = qt. Dividing both sides of the original equality
by this integer, we obtain p1 · p2 · · · ps−1 = q1 · q2 · · · qt−1,
and from the induction hypothesis, we get s − 1 = t − 1,
p1 = q1, . . . , ps−1 = qs−1. Altogether, we have s = t and
p1 = q1, . . . , ps−1 = qs−1, ps = qs. This proves the uniqueness,
and thus the entire theorem as well. □
11.2.3. Prime distribution. Don Zagier wrote: There
are two facts about the distribution of prime numbers.
The ﬁrst is that, [they are] the most arbitrary
and ornery objects studied by mathematicians: they
grow like weeds among the natural numbers, seeming
to obey no other law than that of chance, and nobody
can predict where the next one will sprout. The second fact
3The symbol Z[
√
−5] denotes the integers extended by a root of the
equation x2 = −5, which is deﬁned similarly as we obtained the complex
numbers by adjoining the number
√
−1 to the reals.
984
11.B.5. Number of divisors and sum of divisors. Prove the
following formulae: Every integer a = pα1
1 · · · · · pαk
k has got
exactly
τ(a) = (α1 + 1)(α2 + 1) · · · · · (αk + 1)
positive divisors, which sum up to
σ(a) =
pα1+1
1 − 1
p1 − 1
. . .
pαk+1
k − 1
pk − 1
.
Moreover, let p1, . . . , pk be pairwise distinct primes and
α1, ..., αk, β1, ..., βk be non-negative integers. Denoting
γi = min{αi, βi}, δi = max{αi, βi} for every i =
1, 2, . . . , k, then we have
(pα1
1 · · · · · pαk
k , pβ1
1 · · · · · pβk
k ) = pγ1
1 · · · · · pγk
k ,
[pα1
1 · · · · · pαk
k , pβ1
1 · · · · · pβk
k ] = pδ1
1 · · · · · pδk
k .
Solution. Every positive divisor of the integer a = pα1
1 · · · · ·
pαk
k is of the form pβ1
1 · · · · · pβk
k , where β1, . . . , βk ∈ N0 and
β1 ≤ α1, β2 ≤ α2, ..., βk ≤ αk.
The justiﬁcation of all of the claims is now a simple
consequence of this explicit factorization description of the
the divisors. To ﬁnd the number of positive divisors, we
employ elementary combinatorics (the product rule) to get
τ(a) = (α1 + 1)(α2 + 1) · · · · · (αk + 1).
Next, we can see that the formula for the sum of the divisors
holds if we rewrite it in the form
(1 + p1 + · · · + pα1
1 ) · · · (1 + pk + · · · + pαk
k ),
realizing that every pair of parentheses contains the sum of a
ﬁnite geometric series. The other statements follow directly
from the deﬁnition. □
The sum of all positive divisors of an integer is connected
to the so-called perfect numbers. We say that an integer a is
perfect if and only if σ(a) = 2a, i.e., if and only if the sum of
all positive divisors of a, excluding a itself, equals a.
Perfect numbers include, for instance, 6 = 1 + 2 + 3,
28 = 1 + 2 + 4 + 7 + 14, 496, and 8128 (this exhausts all
perfect numbers less than 10 000).
It can be shown that even perfect numbers are in a tight
relation with the so-called Mersenne primes since the following
holds:
11.B.6. Show that a natural number a is an even perfect
number if and only if it is of the form a = 2q−1
(2q
− 1),
where 2q
− 1 is a prime.
Solution. If a = 2q−1
(2q
− 1), where p = 2q
− 1 is a prime,
then the previous statement yields
σ(a) =
2q
− 1
2 − 1
· (p + 1) = (2q
− 1) · 2q
= 2a.
Such an integer a is thus a perfect number.
For the opposite direction, consider any even perfect
number a, and let us write
a = 2k
· m, where m, k ∈ N and 2 ∤ m.
CHAPTER 11. ELEMENTARY NUMBER THEORY
is even more astonishing, for it states just the opposite: that
the prime numbers exhibit stunning regularity, that there are
laws governing their behavior, and that they obey these laws
with almost military precision.
In the next paragraphs, we will discuss the following
questions:
Are there inﬁnitely many primes?
Are there inﬁnitely many primes in every (or at least one)
arithmetic sequence?
How are the primes distributed among the natural num-
bers?
The ﬁrst question has got an easy answer and the fundamental
theorem was known to Euclid already around 300 BC:
Theorem (Euclid). There are inﬁnitely many primes among
the natural numbers.
Proof. Suppose that there are only ﬁnitely many, and let
them be denoted by p1, p2, . . . , pn. Set N = p1 ·p2 . . . pn +1.
This integer, being greater than 1, is either a prime or it is divisible
by a prime diﬀerent from p1, . . . , pn (since the primes
p1, . . . , pn divide the integer N − 1), which is a contradiction.
□
11.2.4. Next, we will mention a rather strong statement,
whose proof is very laborious (that is why we do not present
it), yet it can be done by elementary means5
.
Theorem (Chebyshev’s, Bertrand’s postulate). For every integer
n > 1, there is at least one prime p satisfying n < p <
2n.
11.2.5. Primes are distributed quite uniformly in the sense
that in any “reasonable” arithmetic sequence (i.e. such that
its terms are coprime), there are inﬁnitely many of them.
For instance, considering the remainders upon division
by 4, there are inﬁnitely many primes with remainder 1 as well
as inﬁnitely many primes with remainder 3 (of course, there is
no prime with remainder 0 and only one prime with remainder
2). The situation is analogous for remainders upon division by
other integers, as explained by the following theorem, whose
proof is very diﬃcult.
Theorem (Dirichlet’s on primes). If a, m are coprime natural
numbers, there are inﬁnitely many primes k such that
mk + a is a prime. In other words, there are inﬁnitely many
primes among the integers 1 · m + a, 2 · m + a,
3 · m + a, . . ..
We can at least mention a proof of a special case of this
theorem, which is a modiﬁcation of Euler’s proof of the inﬁnitude
of primes.
Proposition. There are inﬁnitely many primes of the form
4k + 3, where k ∈ N0.
5See Wikipedia, Proof of Bertrand’s postulate, http://en.
wikipedia.org/wiki/Proof_of_Bertrand’s_postulate
(as of July 29, 2017) or see M. Aigner, G. Ziegler, Proofs from THE BOOK,
Springer, 2009.
985
Since the function σ is multiplicative (see 11.3.2), we have
σ(a) = σ(2k
) · σ(m) = (2k+1
− 1) · σ(m). However, it
follows from a being perfect that σ(a) = 2a = 2k+1
· m,
whence
2k+1
· m = (2k+1
− 1) · σ(m).
Since 2k+1
− 1 is odd, we must have 2k+1
− 1 | m, so
we can lay m = (2k+1
− 1) · n for an appropriate n ∈ N.
Rearranging leads to 2k+1
· n = σ(m). Both m and n divide
m (and since m
n = 2k+1
−1 > 1, these integers are diﬀerent),
hence
2k+1
· n = σ(m) ≥ m + n = 2k+1
· n,
and so σ(m) = m+n. However, this means that m is a prime
with the sole divisors m and n = 1, whence a = 2k
·(2k+1
−
1), where 2k+1
− 1 = m is a prime. □
Remark. On the other hand, people have been unsuccessful
in describing odd perfect numbers; we even do not know
whether there exists an odd perfect number.
Mersenne primes are those of the form 2k
− 1. It should
not go unnoticed that Mersenne primes are easily recognizable
among all primes – for Mersenne numbers (excluding the
primality requirement), there is a simple and fast procedure
how to verify that they are primes. It is thus not by chance
that the largest known primes are usually of the form 2k
− 1.
Later, we will show how to eﬃciently verify whether a
given Mersenne number is prime (see the Lucas-Lehmer test
in part 11.6.9).
Although it may seem a weird and practically useless
business to look for primes as great as possible, it pushes the
borders of our cognition of mathematics forward and reﬁnes
the used methods (as well as hardware). Moreover, the discoverers
often beneﬁt from this (Electronic Frontier Foundation
issued EFF Cooperative Computing Awards for ﬁnding a
prime having at least 106
, 107
, 108
, and 109
digits – rewards
of 50 and 100 dollars, respectively, for the ﬁrst and second
categories were paid in 2000 and 2009, respectively, to the
GIMPS project in both cases. Apparently, it will take a while
before the other prizes are awarded.)
C. Congruences
In this paragraph, we will see in practice how wielding
basic operations with congruences can improve
the expressing of our reasonings about various
problems: We would be able to solve them without
congruences, using only the basic properties
of divisibility. However, with the help of congruences, our
considerations will often be much shorter and clearer.
11.C.1. Show that, For any a, b ∈ Z, m ∈ N, the following
conditions are equivalent:
i) a ≡ b (mod m),
ii) a = b + mt for an appropriate t ∈ Z,
iii) m | a − b.
CHAPTER 11. ELEMENTARY NUMBER THEORY
Proof. Suppose the contrary, i.e., there are only ﬁnitely
many primes of this form, and let them be denoted by p1 = 3,
p2 = 7, p3 = 11, ..., pn. Further, let us set N = 4p2 · p3 ·
· · ··pn +3. Factoring N, the product must contain (according
to the result of exercise 11.A.3) at least one prime p of the
form 4k + 3. If not, N would be a product of only primes of
the form 4k + 1, so N as well would give remainder 1 upon
division by 4, which is not true. However, none of the primes
p1, p2, . . . , pn can play the role of the mentioned p since if
we had pi | N for some i ∈ {2, . . . , n}, then we would get
pi | 3. Similarly, 3 ∤ N, and we thus get a contradiction with
the assumption of ﬁnitely many primes of the given form. □
An analogous elementary proof can be used for primes
of the form 3k + 2 or 6k + 5; however, it will not work for
primes of the form 3k +1 or 4k +1 (think this out well!). We
will be able to remedy this in the latter case in part 11.4.11
about quadratic congruences).
11.2.6. From the propositions mentioned in this chapter,
one can roughly imagine how “dense” the primes appear
among the natural numbers. It is more accurately described
(although “only” asymptotically) by the following, very
important theorem, which was proved independently by J.
Hadamard and Ch. J. de la Vallée-Poussin in 1896 (but we
shall not go into the proof here).
Theorem (Prime Number Theorem). Let π(x) denote the
number of primes less than or equal to a number x ∈ R. Then
π(x) ∼
x
ln x
,
i.e., the quotient of the functions π(x) and x/ ln x approaches
1 for x → ∞.
The following table illustrates how good the asymptotic
estimate π(x) ∼ x/ ln(x) is in several concrete instances in
reality:
x π(x) x/ ln(x) relative error
100 25 21.71 0.13
1000 168 144.76 0.13
10000 1229 1085.73 0.11
100000 9592 8685.88 0.09
500000 41538 38102.89 0.08
The density of primes among the natural numbers is also
partially described by the following result by Euler.
Let us remind the formula
∑
n∈N
1
n2 = π2
6
(see 7.1.10) which indicates that in view of the
next proposition the primes are distributed more
“densely” in N than squares.
Proposition. Let P denote the set of all primes, then
∑
p∈P
1
p = ∞.
986
Solution. (1) =⇒ (3) If a = q1m + r, b = q2m + r, then
a − b = (q1 − q2)m.
(3) =⇒ (2) If m | a − b, then there is a t ∈ Z such that
m · t = a − b, i.e., a = b + mt.
(2) =⇒ (1) If a = b + mt, then, expressing b = mq + r,
it follows that a = m(q +t)+r. Therefore, a and b share the
same remainder r upon division by m, i.e., a ≡ b (mod m).
□
11.C.2. Prove fundamental properties of congruences
stated in 11.3.1.
Solution.
i) If a ≡ b (mod m) and c ≡ d (mod m), by the previous
lemma, there are integers s, t such that a = b + ms, c =
d+mt. However, then we have a+c = b+d+m(s+t),
and, by the lemma again, a + c ≡ b + d (mod m).
Adding a congruence a ≡ b (mod m) to mk ≡ 0
(mod m), which is clearly valid, leads to a + mk ≡ b
(mod m).
ii) If a ≡ b (mod m) and c ≡ d (mod m), there are integers
s, t such that a = b + ms, c = d + mt. Then,
ac = (b + ms)(d + mt) = bd + m(bt + ds + mst),
whence we get ac ≡ bd (mod m).
iii) Let a ≡ b (mod m) and n be a natural number. Since
an
− bn
= (a − b)(an−1
+ an−2
b + · · · + bn−1
),
it follows that an
≡ bn
(mod m) as well.
iv) Suppose that a ≡ b (mod m), a = a1 · d, b = b1 ·
d, and (m, d) = 1. By the lemma, the diﬀerence a −
b = (a1 − b1) · d is divisible by the integer m, and since
(m, d) = 1, the integer a1 − b1 is also divisible by m (by
lemma 11.1.5). Hence it follows that a1 ≡ b1 (mod m).
Further, if ad ≡ bd (mod md), i.e., md | ad−bd, we get
directly from the deﬁnition of divisibility that m | a − b.
v) If a ≡ b (mod m), then a − b is a multiple of m,
and hence a multiple of any divisor d of m, so a ≡ b
(mod d).
vi) Suppose that a ≡ b (mod m), b = b1d, m = m1d.
Then there is an integer t such that a = b + mt = b1d +
m1dt = (b1 + m1t)d, hence d | a.
vii) If a ≡ b (mod m1), a ≡ b (mod m2), . . . , a ≡ b
(mod mk), then the diﬀerence a − b is a common multiple
of the integers m1, m2, . . . , mk, and so it is divisible
by their least common multiple [m1, m2, . . . , mk],
whence it follows that a ≡ b (mod [m1, . . . , mk]).
□
Remark. We have already used some properties of congruences
without explicitly mentioning it – now, the result of the
exercise 11.A.3 can be reformulated as “if a ≡ 1 (mod m),
b ≡ 1 (mod m), then also ab ≡ 1 (mod m)”, which is a
special case of item (2) of the previous theorem.
It is not by chance because any statement which uses congruences
can be reformulated in terms of divisibility. The usefulness
of congruences thus lies not in the strength to solve
more problems than without them, but rather in being a very
CHAPTER 11. ELEMENTARY NUMBER THEORY
Proof. Let n be an arbitrary natural number and
p1, . . . , pπ(n) all primes less than or equal to n. Let us set
λ(n) =
π(n)
∏
i=1
(
1 −
1
pi
)−1
.
The particular factors can be perceived as sums of geometric
series, hence
λ(n) =
π(n)
∏
i=1
( ∞∑
αi=0
1
pαi
i
)
=
∑ 1
pα1
1 · · · p
απ(n)
π(n)
,
where we sum over all π(n)-tuples of non-negative integers
(α1, . . . , απ(n)). Since every integer not exceeding n factors
to only primes in the set {p1, . . . , pπ(n)}, all of them are included
in this sum. Therefore, λ(n) > 1 + 1
2 + · · · + 1
n , and
since the harmonic series is divergent (see 5.D.1), we also
have limn→∞ λ(n) = ∞.
Taking into account the expansion of the function ln(1 +
x) to a power series (see 6.D.47), we further get
ln λ(n) = −
π(n)
∑
i=1
ln
(
1 − 1
pi
)
=
π(n)
∑
i=1
∞∑
m=1
(mpm
i )
−1
=
= p−1
1 + · · · + p−1
π(n) +
π(n)
∑
i=1
∞∑
m=2
(mpm
i )
−1
.
Since the inner sum can be bound from above by
∞∑
m=2
(mpm
i )
−1
<
∞∑
m=2
p−m
i =
= p−2
i
(
1 − p−1
i
)−1
≤ 2p−2
i ,
the divergent sequence ln λ(n) <
∑π(n)
i=1 p−1
i + 2
∑π(n)
i=1 p−2
i
can also be bounded from above. The second sum apparently
converges (since the series
∑∞
n=1 n−2
does), so the ﬁrst
sum
∑π(n)
i=1 p−1
i must diverge, which is what we wanted to
prove. □
3. Congruences and basic theorems
This concept was introduced by C. F. Gauss in 1801 in his
book Disquisitiones Arithmeticae. Although
being a simple one, its contribution to number
theory is mainly in making some reasonings
(even quite complicated ones) be written much more compactly
and transparently.
Congruence
If integers a, b give the same remainder r (where 0 ≤ r <
m) when divided by a natural number m, we say that they
are congruent modulo m and write it as
a ≡ b (mod m).
In the other case, we say that the integers a, b are not congruent
modulo m, writing
a ̸≡ b (mod m).
987
convenient way of writing which simpliﬁes both expressions
and reasonings.
11.C.3.
i) Find the remainder of the integer 730
when divided by
50.
ii) Find the last two digits of the decimal representation of
the integer 730
.
Solution.
i) Since 72
= 49 ≡ −1 (mod 50), using the properties
of congruences, which are mentioned above, 730
≡
(−1)15
= −1 (mod 50), so the remainder of 730
upon
division by 50 is 49.
ii) Our task is actually to determine the remainder of 730
upon division by 100. We know from the above that the
integer 730
leaves remainder 49 when divided by 50, so
the last two digits are either 49 or 99. In particular, we
already know that 730
≡ −1 (mod 25), and we can easily
calculate that 730
≡ (−1)30
= 1 (mod 4). Since
(4, 25) = 1, the wanted pair of digits is 49 (it leaves the
desired remainder upon division by both 25 and 4).
□
We now show how helpful the notion of congruence can
be in solving problems similar to one already solved
in 11.A.8 (where the induction and binomial theorem
were used).
11.C.4. Prove that for any n ∈ N, the integer
37n+2
+ 16n+1
+ 23n
is divisible by 7.
Solution. We have 37 ≡ 16 ≡ 23 ≡ 2 (mod 7), so by the
basic properties of congruences,
37n+2
+ 16n+1
+ 23n
≡ 2n+2
+ 2n+1
+ 2n
= 2n
(4+2+1) ≡ 0 (mod 7). □
11.C.5. Prove that the integer n = (8355
+ 6)18
− 1 is divisible
by 112.
Solution. We factor 112 = 7 · 16. Since (7, 16) = 1, it
suﬃces to show that 7 | n and 16 | n. We have 835 ≡ 2
(mod 7), so
n ≡ (25
+ 6)18
− 1 = 3818
− 1 ≡ 318
− 1
= 276
− 1 ≡ (−1)6
− 1 = 0 (mod 7),
hence 7 | n. Similarly, 835 ≡ 3 (mod 16), so
n ≡ (35
+ 6)18
− 1 = (3 · 81 + 6)18
− 1 ≡ (3 · 1 + 6)18
− 1
= 918
− 1 = 819
− 1 ≡ 19
− 1 = 0 (mod 16),
hence 16 | n. Altogether, 112 | n, which was to be proved.
□
11.C.6. Prove that the following relations hold for prime p:
i) If k ∈ {1, . . . , p − 1}, then p |
(p
k
)
.
ii) If a, b ∈ Z, then ap
+ bp
≡ (a + b)p
(mod p).
Solution.
CHAPTER 11. ELEMENTARY NUMBER THEORY
Whenever it is apparent that we are working with congruence
relations, we usually omit the symbol mod , or write
just a ≡ b (m).
11.3.1. Fundamental properties. It follows directly from
the deﬁnition that the congruence modulo m is an equivalence
relation.
Now, we will state further properties of congruences
which are proven in detail in the practical column, see 11.C.2.
Properties of congruences
(1) Congruences with respect to the same modulus can be
added. An arbitrary multiple of the modulus can be
added to either side.
(2) Congruences with respect to the same modulus can be
multiplied.
(3) Both sides of a congruence can be raised to the power
of the same natural number.
(4) Both sides of a congruence can be divided by their common
divisor provided it is coprime to the modulus. Both
sides of a congruence together with the modulus can
be divided by a positive divisor common to all three of
them.
(5) If a congruence is valid with respect to a modulus m,
it is also valid with respect to any modulus d which divides
m.
(6) If either side of a congruence and the modulus are divisible
by an integer, then this integer must divide the
other side of the congruence as well.
(7) If a congruence is valid with respect to moduli
m1, . . . , mk, it is also valid with respect to their least
common multiple [m1, . . . , mk].
Remark. We have already used some properties of congruences
without explicitly mentioning it – now, the result of the
exercise 11.A.3 can be reformulated as “if a ≡ 1 (mod m),
b ≡ 1 (mod m), then also ab ≡ 1 (mod m)”, which is a
special case of the property (2) above.
It is not by chance because any statement which uses congruences
can be reformulated in terms of divisibility. The usefulness
of congruences thus lies not in the strength to solve
more problems than without them, but rather in being a very
convenient way of writing, which simpliﬁes both expressions
and reasonings.
11.3.2. Arithmetic functions. By arithmetic function we
mean any function whose domain is the set of
natural numbers. Good examples are the number
of divisors τ(a) or their sum σ(a), computed
in 11.B.5. A prominent example is the following function
counting the coprime numbers less than given a:
988
i) Since the binomial coeﬃcient satisﬁes
(
p
k
)
=
p(p − 1) · · · (p − k + 1)
k!
,
which is an integer, we hence know that k! divides the
product p(p − 1) · · · (p − k + 1). However, since the
integer k! is coprime to the prime p, we thus get that
k! | (p − 1) · · · (p − k + 1), whence it follows
that p |
(p
k
)
.
ii) The binomial theorem implies that
(a + b)p
= ap
+
(p
1
)
ap−1
b + · · · +
( p
p−1
)
abp−1
+ bp
.
Thanks to the previous item, we have
(p
k
)
≡ 0 (mod p)
for any k ∈ {1, . . . , p−1}, whence the statement follows
easily.
□
11.C.7. Prove that for any natural numbers m, n and any integers
a, b such that a ≡ b (mod mn
), it is true that
am
≡ bm
(mod mn+1
).
Solution. Since clearly m | mn
, we get that the congruence
a ≡ b (mod m) holds, invoking property (5) of 11.3.1.
Therefore, considering the algebraic identity
am
−bm
= (a−b)(am−1
+am−2
b+· · ·+abm−2
+bm−1
),
all the summands in the second pair of parentheses are congruent
to am−1
modulo m, so
am−1
+ am−2
b + · · · + bm−1
≡ m · am−1
≡ 0 (mod m).
Since mn
divides a − b, and the sum am−1
+ am−2
+ · · · +
bm−1
is divisible by m, we get that mn+1
must divide their
product, which means that am
≡ bm
(mod mn+1
). □
11.C.8. Using the result of the previous exercise (see also
11.A.2), prove that:
i) integers a which are not divisible by 3 satisfy a3
≡ ±1
(mod 9),
ii) odd integers a satisfy a4
≡ 1 (mod 16).
Solution.
i) Cubing the congruence a ≡ ±1 (mod 3) (and, again,
raising the exponent of the modulus), we get a3
≡ ±1
(mod 32
).
ii) This statement was proved already in the third part of
exercise 11.A.2. Now, we will present another proof.
Thanks to part (ii) of the mentioned exercise, we know
that every odd integer a satisﬁes a2
≡ 1 (mod 23
).
Squaring this (and recalling the above exercise) leads to
a4
≡ 12
(mod 24
).
□
11.C.9. Divisibility rules. We can surely recall the basic
rules of divisibility (at least by the numbers 2,
3, 4, 5, 6, 9 a 10) in terms of the decimal representation
of a given integer. However, how can
these rules be proved and can they be extended
to other divisors as well?
We already know that we can restrict ourselves to divisibility
by powers of primes (for instance, divisibility by 6 can
be tested using divisibility by 2 and 3).
CHAPTER 11. ELEMENTARY NUMBER THEORY
Euler’s totient function φ
For a natural number, we deﬁne the value of Euler’s totient
function φ as
φ(n) = |{a ∈ N | 0 < a ≤ n, (a, n) = 1}|.
For example, φ(1) = 1, φ(5) = 4, φ(6) = 2. If p is a
prime, then clearly φ(p) = p − 1 (all natural numbers less
than p are coprime to it).
We are going to prove several important properties of the
function φ. Let us start with a quite simple observation:
Lemma. Let n ∈ N. Then,
∑
d|n φ(d) = n.
Proof. Let us consider the n fractions
1
n
,
2
n
,
3
n
, . . . ,
n − 1
n
,
n
n
.
Reducing them to lowest terms and grouping them together by
the denominators, we get just the statement in question (see
the picture illustrating the case n = 12). □
11.3.3. Soon we shall recognize, that knowing the values
of the Euler’s totient function may be extremely useful. The
following explicit formula is crucial. We might prove it by
the inclusion-exclusion principle, determining the number of
integers not coprime to n in a given interval. Here we shall
employ a more conceptual approach wich is of great interest
by itself, and thus, the proof is postponed to 11.3.8 below.
Theorem. Let n ∈ N factor to primes as n = pα1
1 · · · pαk
k .
Then,
φ(n) = n ·
(
1 −
1
p1
)
· · ·
(
1 −
1
pk
)
.
11.3.4. Multiplicative functions. Dealing with arithmetic
functions, we introduce the concept of "multiplicativity"
which concerns only the coprime arguments:
Deﬁnition. A multiplicative function on the natural numbers
is such an arithmetic function which, for all pairs of coprime
natural numbers a, b, satisﬁes
f(a · b) = f(a) · f(b).
989
The rule for divisibility by 9 says that a given integer is
divisible by 9 if and only if its digit sum is. We will prove
this as a consequence of a much stronger statement: It holds
that every integer is congruent to its digit sum modulo 9 (in
particular, it is congruent to zero if and only if its digit sum
is). And this is trivial to prove: The digit sum of an integer
n = ak10k
+ ak−110k−1
+ · · · + a110 + a0 is equal to
S(n) = ak + ak−1 + · · · + a0, and since 10ℓ
≡ 1ℓ
= 1
(mod 9) for any ℓ ∈ N0, we get
n = ak10k
+ · · · + a0 ≡ ak + · · · + a0 = S(n) (mod 9).
This derivation is valid also if we replace 9 with 3.
The rule for divisibility by 11, which we have not mentioned
yet, works similarly. Here, we have 10ℓ
≡ (−1)ℓ
(mod 11), so we get
n = ak10k
+ · · · + a0 ≡ ak(−1)k
+ · · · + a1(−1) + a0
≡ (a0 + a2 + · · · ) − (a1 + a3 + · · · ) (mod 11).
Therefore, an integer is divisible by 11 if and only if the difference
of the sum of the digits at even places and the sum of
the digits at odd places is.
There is a nice trick for the divisors 7 and 13: We have
1001 = 7·11·13; an integer n = 1000a+b thus satisﬁes n ≡
−a + b (mod m), where m is any of the numbers 7, 11, 13.
Therefore, 2015 is divisible by 13 since 015 − 2 = 13. Similarly,
2016 is divisible by 7 as 016 − 2 = 14 is a multiple of
7. We could justify that 2013 is a multiple of 11 in the same
way, but the aforementioned criterion 11 | (3 + 0) − (1 + 2)
is maybe more smart.
Using divisibility for error detection. Let us note that divisibility
by eleven is often used for making decimal
codes which allow us to detect a single-digit error.
If we make such a mistake when copying an integer
which is divisible by eleven, then the resulting integer is
surely not (see the aforementioned criterion of divisibility by
eleven). More details can be found in chapter 12.5.1 about
coding. For instance, the national identiﬁcation numbers in
the Czech Republic and Slovakia contain a check digit which
completes the code into an integer divisible by eleven.
Similarly, the numbers of bank accounts managed
by Czech banks must comply with a similar (only a bit
more complicated) procedure. Both the transformed 6-digit
preﬁx a5a4a3a2a1a0 and the 10-digit account number
b9b8b7b6b5b4b3b2b1b0 must satisfy the following condition
on divisibility by eleven (here, we mention only the one for
the number without the preﬁx):
0 ≡ b929
+ b828
+ b727
+ · · · + b323
+ b222
+ b121
+ b020
≡ −5b9 + 3b8 − 4b7 − 2b6 − b5
+ 5b4 − 3b3 + 4b2 + 2b1 + b0 (mod 11).
This condition can be shortly described so that the account
number, perceived as being in binary (though with usage of
decimal digits) is to be divisible by eleven.
CHAPTER 11. ELEMENTARY NUMBER THEORY
Clearly, the multiplicative functions include, for instance,
the number of divisors τ(n), or their sum σ(n), cf. 11.B.5.
It follows directly from Theorem 11.3.3 that the Euler’s
totient function is a multiplicative arithmetic function, too:
Corollary. Let a, b ∈ N, (a, b) = 1. Then
φ(a · b) = φ(a) · φ(b).
Remark. The multiplicativity of φ can also be derived directly
from the knowledge that
(n, ab) = 1 ⇐⇒ (n, a) = 1 ∧ (n, b) = 1.
Then the easy fact about the totient function on powers of
primes
φ(pα
) = pα
− pα−1
= (p − 1) · pα−1
,
leads to the formula for the computation of φ in yet another
alternative way.
11.3.5. Möbious function. Let us start our detour leading
to the proof of Theorem 11.3.3. At the same time,
the introduced convolution and inversion formulae
should provide a glance into a very important conceptual
direction.
Deﬁnition. Let a natural number n be factored to distinct
primes: n = pα1
1 · · · pαk
k . The value of the Möbius function
µ(n) is deﬁned to be 0 if αi > 1 for some i (i.e., if n is divisible
by a square), and (−1)k
otherwise. Further, we deﬁne
µ(1) = 1 (in accordance with the convention that 1 factors to
the product of zero primes).
For instance, µ(4) = µ(22
) = 0, µ(6) = µ(2 · 3) =
(−1)2
= 1, µ(2) = µ(3) = −1. By the very deﬁnition, µ is
a multiplicative function, too.
In the next paragraphs, we prove several important properties
of the Möbius function, especially the so-called Möbius
inversion formula. Let us start with a simple observation:
Lemma. For all n ∈ N \ {1}, it holds that
∑
d|n µ(d) = 0.
Proof. Writing n as n = pα1
1 · · · pαk
k , then all divisors d
of n are of the form d = pβ1
1 · · · pβk
k , where 0 ≤ βi ≤ αi for
all i ∈ {1, . . . , k}. Therefore,
∑
d|n
µ(d) =
∑
(β1,...,βk)∈(N∪{0})k
0≤βi≤αi
µ
(
pβ1
1 · · · pβk
k
)
=
∑
(β1,...,βk)∈{0,1}k
µ
(
pβ1
1 · · · pβk
k
)
=
(k
0
)
+
(k
1
)
· (−1) +
(k
2
)
· (−1)2
+ · · · +
(k
k
)
· (−1)k
= (1 + (−1))k
= 0.
In the third equality, we used a combinatorial reasoning – the
summand
(k
ℓ
)
(−1)ℓ
gives the contribution of the divisors d =
pβ1
1 · · · pβk
k with the property that exactly ℓ of the exponents
β1, . . . , βk are equal to one; there are
(k
ℓ
)
of them, and each
satisﬁes that µ(pβ1
1 · · · pβk
k ) = (−1)ℓ
. □
990
11.C.10. Verify that the account number of the Masaryk university,
85636621, is built correctly.
Solution. We will test the condition of divisibility by eleven:
−5b9 +3b8 −4b7 −2b6 −b5 +5b4 −3b3 +4b2 +2b1 +b0
≡ −4 · 8 − 2 · 5 − 1 · 6 + 5 · 3 − 3 · 6 + 4 · 6 + 2 · 2 + 1 · 1
≡ 0 (mod 11) . □
Euler’s totient function. The totient function φ assigns to
a natural number m the number of natural
numbers which are less than or equal to m
and coprime to m, which can be written as
φ(m) = |{a ∈ N | 0 < a ≤ m, (a, m) = 1}|.
However, to be able to evaluate it eﬃciently, one needs
to know the factorization of the input integer m to primes. In
such a case, for m = pα1
1 · · · pαk
k , we have
φ(m) = (p1 − 1)pα1−1
1 · · · (pk − 1)pαk−1
k .
In particular, we know that φ(pα
) = (p − 1) · pα−1
and that
φ(m · n) = φ(m) · φ(n) holds whenever m, n are coprime.
11.C.11. Calculate φ(72).
Solution. 72 = 23
·32
=⇒ φ(72) = 72·(1− 1
2 )·(1− 1
3 ) =
24, alternatively φ(72) = φ(8) · φ(9) = 4 · 6 = 24. □
11.C.12.
i) Determine all natural numbers n for which φ(n) is odd.
ii) Prove that ∀n ∈ N : φ(4n + 2) = φ(2n + 1).
Solution.
i) We clearly have φ(1) = φ(2) = 1. Every integer n ≥ 3
is either divisible by an odd prime p (then, φ(n) is divisible
p − 1, which is an even integer) or n is a (higherthan-ﬁrst)
power of two (and then, φ(2α
) = 2α−1
is even
as well). Altogether, we have found out that φ(n) is odd
only for n = 1, 2.
ii) The integer 2n + 1 is odd, so (2, 2n + 1) = 1, and hence
φ(4n + 2) = φ(2 · (2n + 1)) = φ(2) · φ(2n + 1) =
φ(2n + 1) .
□
11.C.13. Find all natural numbers m for which:
i) φ(m) = 30,
ii) φ(m) = 34,
iii) φ(m) = 20,
iv) φ(m) = m
3 .
Solution. In all of the above cases, we are looking for the
ﬁbers of a given integer a in the form m = pα1
1 · · · pαk
k , and
we proceed as follows:
• Since φ(m) = (p1 − 1)pα1−1
1 · · · (pk − 1)pαk−1
k = a,
every prime p which divides m must satisfy
p − 1 | a.
• Similarly, every prime p whose higher power divides
m must divide a. More exactly, we must even have
pα−1
| a.
CHAPTER 11. ELEMENTARY NUMBER THEORY
11.3.6. Dirichlet convolution. There is another concept
which is tightly connected to the Möbius function, the
so-called Dirichlet product (also Dirichlet convolution).
Deﬁnition. Let f, g be arithmetic functions. Its Dirichlet
product is deﬁned as follows:
(f ◦ g)(n) =
∑
d|n
f(d) · g
(n
d
)
=
∑
d1d2=n
f(d1) · g(d2).
The next observation is nearly obvious:
Lemma. The Dirichlet product is associative.
Proof.
((f ◦g)◦h)(n) =
∑
d1d2d3=n
f(d1)g(d2)h(d3) = (f ◦(g◦h))(n)
□
11.3.7. Möbious inversion formula. Let us deﬁne two useful
functions i and I by i(1) = 1, i(n) = 0 for all n > 1 and
I(n) = 1 for all n ∈ N. Then, every arithmetic function f
satisﬁes:
f ◦ i = i ◦ f = f
(I ◦ f)(n) = (f ◦ I)(n) =
∑
d|n
f(d).
Further, notice the Möbious function commutes with I
and the result for all n > 1 is i (the statement is clear for
n = 1 and we use the above Lemma 11.3.5):
(I ◦ µ)(n) =
∑
d|n
I(d)µ
(n
d
)
=
∑
d|n
I
(n
d
)
µ(d) =
=
∑
d|n
µ(d) = 0
Theorem (Möbius inversion formula). Let an arithmetic
function F be deﬁned in terms of an arithmetic function f
by F(n) =
∑
d|n f(d). Then, f can be expressed as
f(n) =
∑
d|n
µ
(n
d
)
· F(d).
Proof. The relation F(n) =
∑
d|n f(d) can be rewritten
as F = f ◦I. Therefore, F ◦µ = (f ◦I)◦µ = f ◦(I◦µ) =
f ◦ i = f, which is the statement of the theorem. □
11.3.8. Proof of the Theorem 11.3.3. Invoking the Lemma
11.3.2 and the Möbius inversion formula, we get
φ(n) =
∑
d|n
µ(d)
n
d
= n −
n
p1
− · · · −
n
pk
+ · · · + (−1)k n
p1 · · · pk
= n ·
(
1 −
1
p1
)
· · ·
(
1 −
1
pk
)
and the explicit formula for φ(n) has been proved.
991
• This procedure results in a ﬁnite set of candidates for m,
which can be eliminated in a convenient way, sometimes
also using the fact that any prime dividing a must occur
in the factorization of φ(m) as a divisor of some p − 1
or in a prime power pα−1
.
Now, let us solve problems i)-iii):
i) Every prime p from the factorization of m must satisfy
p − 1 | 30, so p − 1 ∈ {1, 2, 3, 5, 6, 10, 15, 30}, which
is satisﬁed by primes p ∈ {2, 3, 7, 11, 31}, and only 2
and 3 of them can divide m in higher power than the ﬁrst.
Therefore,
m = 2α
3β
7γ
11δ
31ε
,
where α, β ∈ {0, 1, 2}, γ, δ, ε ∈ {0, 1}. The analysis of
the possibilities can be further simpliﬁed if we realize
that φ(3) = 2, φ(32
) = φ(7) = 6, φ(11) = 10 are
all integers which divide 30 into an odd integer greater
than 1. Therefore, if we had, for instance, m = 7 · m1,
where 7 ∤ m1, then we would also have φ(m1) = 5,
which is impossible, as we know from the previous exer-
cise.
We thus get β = γ = δ = 0 and m = 2α
· 31ε
,
whence we can easily obtain the solution m ∈ {31, 62}.
ii) Similarly to above, only primes p ∈ {2, 3} can divide
m, and the prime 3 can divide m only in the ﬁrst power.
However, since 34
φ(3) = 17, the prime 3 cannot divide
m at all. The remaining possibility, m = 2α
, leads to
34 = 2α−1
, which is also impossible. Therefore, there is
no such number m.
iii) Now, every prime p dividing m must satisfy p−1 | 20, so
p − 1 ∈ {1, 2, 4, 5, 10, 20}, which is satisﬁed by primes
p ∈ {2, 3, 5, 11}, and only 2 and 5 of those can divide m
in higher power. We thus have
m = 2α
3β
5γ
11δ
,
where α ∈ {0, 1, 2, 3}, γ ∈ {0, 1, 2}, β, δ ∈ {0, 1}.
First, consider δ = 1. Then, φ(2α
3β
5γ
) = 2,
whence we easily get that γ = 0 and (α, β) ∈
{(2, 0), (1, 1), (0, 1)}, which gives three solutions:
m ∈ {44, 66, 33}.
Further, let us have δ = 0. If γ = 2, then
φ(2α
3β
) = 1, whence (α, β) ∈ {(1, 0), (0, 0)}. We
thus obtain two more solutions: m ∈ {50, 25}.
If γ = 1, then we get 20
φ(5) = 5, similarly to the
above item. This is an odd integer, so we get no solutions
in this case. This is also the case for γ = 0 since the
equation φ(2α
) = 20 has no solution, either.
Altogether, there are ﬁve satisfactory values m ∈
{25, 33, 44, 50, 66}.
iv) This problem is of a diﬀerent kind than the previous ones,
so we must approach otherwise. The relation φ(m) = m
3
implies that m must be a multiple of three (since the lefthand
side of the equation is an integer). We will thus be
looking for the solution in the form m = 3α
· n, where
3 ∤ n, α ≥ 1. Then, φ(m) = 2 · 3α−1
· φ(n) = m
3 =
CHAPTER 11. ELEMENTARY NUMBER THEORY
11.3.9. Remark. Tady by mela byt poznamka (by T. Perutka)
o Okruhu aritmetickych funkci. Zaroven trochu vyplni
prostor, ktery je potreba kvuli velkemu mnozstvi prikladu
na Eulerovu funkci v praktickem sloupci. Tady by mela
byt poznamka (by T. Perutka) o Okruhu aritmetickych funkci.
Zaroven trochu vyplni prostor, ktery je potreba kvuli velkemu
mnozstvi prikladu na Eulerovu funkci v praktickem sloupci.
Tady by mela byt poznamka (by T. Perutka) o Okruhu aritmetickych
funkci. Zaroven trochu vyplni prostor, ktery je potreba
kvuli velkemu mnozstvi prikladu na Eulerovu funkci v praktickem
sloupci. Tady by mela byt poznamka (by T. Perutka) o
Okruhu aritmetickych funkci. Zaroven trochu vyplni prostor,
ktery je potreba kvuli velkemu mnozstvi prikladu na Eulerovu
funkci v praktickem sloupci. Tady by mela byt poznamka (by
T. Perutka) o Okruhu aritmetickych funkci. Zaroven trochu
vyplni prostor, ktery je potreba kvuli velkemu mnozstvi prikladu
na Eulerovu funkci v praktickem sloupci. Tady by mela
byt poznamka (by T. Perutka) o Okruhu aritmetickych funkci.
Zaroven trochu vyplni prostor, ktery je potreba kvuli velkemu
mnozstvi prikladu na Eulerovu funkci v praktickem sloupci.
Tady by mela byt poznamka (by T. Perutka) o Okruhu aritmetickych
funkci. Zaroven trochu vyplni prostor, ktery je potreba
kvuli velkemu mnozstvi prikladu na Eulerovu funkci v praktickem
sloupci. Tady by mela byt poznamka (by T. Perutka) o
Okruhu aritmetickych funkci. Zaroven trochu vyplni prostor,
ktery je potreba kvuli velkemu mnozstvi prikladu na Eulerovu
funkci v praktickem sloupci. Tady by mela byt poznamka (by
T. Perutka) o Okruhu aritmetickych funkci. Zaroven trochu
vyplni prostor, ktery je potreba kvuli velkemu mnozstvi prikladu
na Eulerovu funkci v praktickem sloupci. Tady by mela
byt poznamka (by T. Perutka) o Okruhu aritmetickych funkci.
Zaroven trochu vyplni prostor, ktery je potreba kvuli velkemu
mnozstvi prikladu na Eulerovu funkci v praktickem sloupci.
Möbius inversion formula and irreducible polynomials.
Above, we prove the properties of Euler’s totient
function using Möbius inversion formula.
This is a quite strange
subsection - we do not
cover the ﬁnite ﬁelds,
nor the polynomials they
are treated much
later ...
The standard form of this formula connects the expression
of an arithmetic function F of natural numbers in terms
of a function f in the form
F(n) =
∑
d|n
f(d)
to the inverse expression of the function f in terms of the
function F in the form
f(n) =
∑
d|n
µ(n
d ) · F(d).
The function value µ(n) depends on the prime factorization
of the input value n as follows:
• if a square of a prime divides n, then µ(n) = 0;
• otherwise, we set µ(n) = (−1)k
, where k is the number
of primes which divide n.
992
3α−1
· n. Reducing this leads to 2φ(n) = n or, equivalently,
φ(n) = n
2 . Here, we must have 2 | n, and writing
n = 2β
· k, where (k, 6) = 1, β ≤ 1, we get φ(k) = k,
which is apparently satisﬁed only by k = 1.
To summarize it, the problem is satisﬁed by those
natural numbers which are of the form 2α
3β
, where
α, β ≥ 1 .
□
11.C.14. Find all two-digit numbers n for which 9|φ(n). ⃝
11.C.15. Fermat’s (little) theorem. Now, we will prove Fermat’s
little theorem 11.3.11 in two more ways: by
mathematical induction, and then by a combinatorial
means). The theorem states that for any integer
a and any prime p which does not divide a, it holds
that ap−1
≡ 1 (mod p).
Solution. First, we prove (by induction on a) that an apparently
equivalent statement, ap
≡ a (mod p), holds for any
a ∈ Z and prime p. For a = 1, there is nothing to prove. Further,
let us assume that the proposition holds for a and prove
its validity for a+1. It follows from the induction hypothesis
and the exercise 11.C.6 that
(a + 1)p
≡ ap
+ 1p
≡ a + 1 (mod p),
which is what we were to prove.
The statement holds trivially for a = 0 as well as in the
case a < 0, p = 2. The validity for a < 0 and p odd can
be obtained easily from the above: since −a is a positive integer,
we get −ap
= (−a)p
≡ −a (mod p), whence ap
≡ a
(mod p).
The combinatorial proof is a somewhat “cunning” one:
Similarly to problems using Burnside’s lemma (see exercise
12.G.1), we are to determine how many necklaces can be created
by wiring a given number of beads, of which there is a
given number of types. Having a types of beads, there are
clearly ap
necklaces of length p, a of which consist of a single
bead type. From now on, we will be interested only in the
other ones, of which there are thus ap
− a. Apparently, each
necklace is transformed into itself by rotating by p beads. In
general, a necklace can be transformed into itself by rotating
by another number of beads, but this number can never be
coprime to p (for instance, considering p = 8 and the necklace
ABABABAB, rotations by 2, 4, or 6 beads leave it unchanged).
However, if p is a prime, it follows that all rotations
lead to diﬀerent necklaces. Therefore, if we do not distinguish
necklaces which diﬀer in rotation only (i.e., in the position of
the “knot”), there are exactly
ap
− a
p
of them, which especially means that p | ap
− a.
As an example, let us consider the case a = 2, p = 5,
i.e., necklaces of length 5, consisting of 2 bead types (A, B).
There are 25
= 32 necklaces in total, 2 of which consist of a
single bead type (AAAAA, BBBBB). Leaving them and
ignoring the position of the knot, there remain 25
−2
5 = 6
necklaces which diﬀer not merely in rotation, namely
CHAPTER 11. ELEMENTARY NUMBER THEORY
This formula can be generalized in many ways – especially
in cases where the functions F and f map natural numbers
into an abelian group (G, ·). In this case, the formula
(considering the operation in G to be multiplicative) gets the
form
f(n) =
∑
d|n
F(d)
µ
(
n
d
)
.
Now, we will demonstrate the use of the Möbius inversion
formula on a more complex example from the theory of
ﬁnite ﬁelds. Let us consider a p-element ﬁeld Fp (i.e., the ring
of residue classes modulo a prime p) and examine the number
Nd of monic irreducible polynomials of a given degree d
over this ﬁeld. Let Sd(x) denote the product of all such polynomials.
Now, we borrow a (not very hard) theorem from the
theory of ﬁnite ﬁelds which states that for all n ∈ N, we have
xpn
− x =
∏
d|n
Sd(x).
Confronting the degrees of the polynomial on both sides
yields
pn
=
∑
d|n
dNd,
whence we get, by applying the standard Möbius inversion
formula, that
Nn =
1
n
∑
d|n
µ
(n
d
)
pd
.
In particular, we can see that for any n ∈ N, it holds that
Nn = 1
n (pn
− · · · + µ(n)p) ̸= 0 since the expression in the
parentheses is a sum of distinct powers of p multiplied by coeﬃcients
±1, so it cannot be equal to 0. Therefore, there exist
irreducible polynomials over Fp of an arbitrary degree n, so
there are ﬁnite ﬁelds Fpn (having pn
elements) for any prime
p and natural number n (in the theory of ﬁeld extensions, such
a ﬁeld is constructed as the quotient ring Fp[x]/(f) of the ring
of polynomials over Fp modulo the ideal generated by an irreducible
polynomial f ∈ Fp[x] of degree n, whose existence
has just been proved).
11.3.10. Example. By the formula we have proved, the number
of (monic) irreducible polynomials over F2 of degree 5 is
equal to
N5 =
1
5
∑
d|5
µ
(5
d
)
2d
=
1
5
(
µ(1) · 25
+ µ(5) · 2
)
= 6.
The number of monic irreducible polynomials over F3
of degree four is then
N4 = 1
4
∑
d|4 µ
(4
d
)
3d
= 1
4
(
µ(1) · 34
+ µ(2) · 32
+ µ(4)31
)
= 1
4 (81 − 9) = 18 .
11.3.11. The next two theorems belong to the most important
results of elementary number theory, and
they will often be applied in further theoretical as
well as practical problems.
993
ABBBB, AABBB, AAABB, AAAAB, ABABB, AABAB.
□
11.C.16.
i) Determine the remainder of the integer 250
+ 350
+ 450
when divided by 17.
ii) Determine the remainder of the integer 2181
+3181
+5181
when divided by 37.
Solution.
i) By Fermat’s theorem, we have 216
≡ 316
≡ 416
≡ 1
(mod 17). Since 50 ≡ 2 (mod 16), we get
250
+ 350
+ 450
≡ 22
+ 32
+ 42
≡ 12 (mod 17) .
ii) Similarly 236
≡ 336
≡ 536
≡ 1 (mod 37), and hence
2181
+ 3181
+ 5181
≡ 2 + 3 + 5 ≡ 10 (mod 37) .
□
Euler’s theorem and orders of integers modulo m. Thanks
to Euler’s theorem, it is guaranteed that every integer a which
is coprime to m has an order, i.e. the least natural number n
such that an
≡ 1 (mod m). The most interesting ones are
those integers a whose order equals φ(m); they are called
the primitive roots modulo m.
11.C.17. Determine the order of 2 modulo 7.
Solution. The order of 2 modulo 7 is equal to 3 as
21
= 2 ̸≡ 1 (mod 7), 22
= 4 ̸≡ 1 (mod 7), 23
=
8 ≡ 1 (mod 7). □
11.C.18. Determine the last two digits of the
number 72019
.
Solution. We can easily see that the order of 7 modulo 100
is equal to 4 – by simple calculations, we have 72
= 49 and
492
= (50 − 1)2
= 502
− 2 · 50 + 1 ≡ 1 (mod 100). Therefore,
it suﬃces to determine the remainder r of the integer
2019 when divided by 4, since 72019
≡ 7r
(mod 100). Apparently,
we have r = 3, so the wanted last two digits are the
same as those of 7 · 49, i.e. 43. □
Now, we mention several statements about the properties
of the order of an integer modulo m.
11.C.19. Let m ∈ N, a, b ∈ Z, (a, m) = (b, m) = 1.
Prove that if a ≡ b (mod m), then the integers a, b share the
same order modulo m.
Solution. Raising the congruence a ≡ b (mod m) to the
n-th power leads to an
≡ bn
(mod m), so an
≡ 1
(mod m) ⇐⇒ bn
≡ 1 (mod m). □
11.C.20. Let m ∈ N, a ∈ Z, (a, m) = 1. If the order of
a modulo m is r · s, (where r, s ∈ N), prove that the order of
the integer ar
modulo m is s.
Solution. Since none of the integers a, a2
, a3
, . . . , ars−1
is congruent to 1 modulo m, nor is any of the integers
ar
, a2r
, a3r
, . . . , a(s−1)r
. On the other hand, we have
(ar
)s
≡ 1 (mod m), so the order of ar
modulo m equals s.
□
CHAPTER 11. ELEMENTARY NUMBER THEORY
Fermat’s little theorem
Theorem. Let a be an integer and p a prime, p ∤ a. Then,
ap−1
≡ 1 (mod p).
Proof. The statement will follow as a simple consequence
of Euler’s theorem (and together with this one, it is
a consequence of more general Lagrange’s theorem 12.4.10).
However, it can be proved directly (by mathematical induction
or a combinatorial means, as mentioned in exercise
11.C.15). □
Sometimes, Fermat’s little theorem is presented in the
following form, which is apparently equivalent to the original
statement.
Corollary. Let a be an integer and p a prime. Then,
ap
≡ a (mod p).
11.3.12. Euler’s theorem. We are coming to a theorem
which has got immediate striking applications in cryptography
(we shall come to this point in cf. 11.6.18). Before formulating
and proving Euler’s theorem, we introduce a few
useful concepts.
Residue systems
A complete residue system modulo m is an arbitrary
m-tuple of integers which are pairwise incongruent
modulo m (the most commonly used m-tuple is
0, 1, . . . , m − 1 or, for odd m, its “symmetric” variation
−m−1
2 , . . . , −1, 0, 1, . . . , m−1
2 ).
A reduced residue system modulo m is an arbitrary
φ(m)-tuple of integers which are pairwise incongruent modulo
m and coprime to m.
Lemma. Let x1, x2, . . . , xφ(m) form a reduced residue system
modulo m. If a ∈ Z, (a, m) = 1, then the integers
a·x1, . . . , a·xφ(m) also form a reduced residue system modulo
m.
Proof. Since (a, m) = 1 and (xi, m) = 1, we have
(a · xi, m) = 1. Further, if we had a · xi ≡ a · xj
(mod m) for some distinct indices i, j, dividing both sides
of the congruence by the integer a (which is coprime to m)
would lead to xi ≡ xj (mod m), meaning that the original
φ(m)-tuple was not a reduced residue system, either. □
Euler’s theorem
Theorem. Let a ∈ Z, m ∈ N, (a, m) = 1. Then,
aφ(m)
≡ 1 (mod m).
Proof. Let x1, x2, . . . , xφ(m) be an arbitrary reduced
residue system modulo m. By the previous lemma, a ·
x1, . . . , a · xφ(m) is also a reduced residue system modulo m.
994
11.C.21. Show that the contrary of the previous statement
need not be true in general.
Solution. Indeed, even if the order of an integer ar
modulo
mis s, the order of a modulo m may not be r ·s. For instance,
for m = 13 and the integers a = 3, b = −4, we have a2
=
9, a3
= 27 ≡ 1 (mod 13), so the order of a modulo 13
is 3. Similarly, b2
= 16 ̸≡ 1 (mod 13), b3
= −64 ≡ 1
(mod 13), so the order of b modulo 13 is 3, too. On the other
hand, b2
= (−4)2
= 16 ≡ 3 = a (mod 13) has the same
order (3) as a, yet the integer b does not have order 2 · 3. □
11.C.22. Determine the last digit of the numbers
i) 3579
,
ii) 373737
,
iii) 121314
.
⃝
11.C.23. It holds for all odd n ∈ N that n | 2n!
− 1. Prove
this! ⃝
11.C.24.
i) Determine the last digit of the number 79573
.
ii) Determine the remainder of the number 151413
when divided
by 11.
Solution.
i) The order of 7 modulo 100 is equal to 4 by exercise
11.C.18, so it suﬃces to ﬁnd the remainder of the (extremely
large) exponent upon division by 4. Since 9 ≡ 1
(mod 4), the entire exponent leaves remainder 1 as well.
Therefore, the wanted digit is 71
= 7.
ii) The order of 15 ≡ 4 (mod 11) is 5 (which can be found
by direct computation or from the fact that 2 is a primitive
root modulo 11 (see also 11.C.28); then, theorem
11.3.14 yields that the order of 4 = 22
is 10
(10,2) = 5). It
is thus suﬃcient to determine the remainder of the exponent
modulo 5. We have
1413
≡ (−1)13
= −1 ≡ 4 (mod 5),
so the wanted remainder is 44
= 28
= 256 ≡ 6 − 5 +
2 = 3 (mod 11). Alternatively, we could have ﬁnish the
calculation as follows: 44
≡ 4−1
≡ 3 (mod 11).
□
11.C.25. Determine the last two digits of the decimal expansion
of the number 141414
.
Solution. We are interested in the remainder of the
number a = 141414
upon division by 100. However,
since (14, 100) > 1, we cannot consider the order of 14 modulo
100. Instead, we can factor the modulus to coprime integers:
100 = 4·25. Apparently, 4 | a, so it remains to ﬁnd the
remainder of a modulo 25. By Euler’s theorem, we have
14φ(25)
= 1420
≡ 1 (mod 25),
so we are interested in the remainder of 1414
upon division
by 20 = 4 · 5. Again, we clearly have 4 | 1414
, and further
1414
≡ (−1)14
= 1 (mod 5), so
1414
≡ 16 (mod 20).
Altogether,
141414
≡ 1416
= 216
· 716
(mod 25).
CHAPTER 11. ELEMENTARY NUMBER THEORY
Therefore, for every i ∈ {1, 2, . . . , φ(m)}, there is a unique
j ∈ {1, 2, . . . , φ(m)} such that a · xi ≡ xj (mod m). Multiplying
these congruences leads to (a · x1) · (a · x2) · · · (a ·
xφ(m)) ≡ x1 · x2 · · · xφ(m) (mod m), which can be rearranged
to
aφ(m)
· x1 · x2 · · · xφ(m) ≡ x1 · x2 · · · xφ(m) (mod m).
Dividing this by the integer x1 · x2 · · · xφ(m) already gives
the wanted statement. □
Remark. As we have already mentioned, Euler’s theorem is
a consequence of Lagrange’s theorem (see 12.4.10) applied to
the group (Z×
m, ·). This proof of Euler’s theorem utilized the
fact that multiplying by an integer a which is coprime to m is,
in algebraic words, an automorphism of the group (Z×
m, ·).
11.3.13. There is an important concept which is tightly connected
to Euler’s totient function and Euler’s theorem: the
so-called order of an integer modulo m – once again, it is
nothing else than the order of the corresponding element in
the group of invertible residue classes modulo m:
Order of an integer
Let a ∈ Z, m ∈ N, where (a, m) = 1. The order of a
modulo m is the least natural number n satisfying
an
≡ 1 (mod m).
It follows from Euler’s theorem that the order of an integer
is well-deﬁned – the order of any integer coprime to the
modulus is surely not greater than φ(m). As we will see later,
the integers whose order is exactly φ(m) are of great interest
– they are called primitive roots modulo m and they play an
important role in solving binomial congruences, among others.
This concept is just another name for a generator of the
group (Z×
m, ·).
Some of the very basic results of the order are demonstrated
in 11.C.19, and complete description of the dependency
of the order upon the exponent is given by the subsequent
two theorems.
Theorem. Let m ∈ N, a ∈ Z, (a, m) = 1. Let r denote the
order of a modulo m. Then, for any t, s ∈ N0, we have
at
≡ as
(mod m) ⇐⇒ t ≡ s (mod r).
Proof. Without loss of generality, we can assume that
t ≥ s. Dividing the integer t − s by r with remainder, we get
t − s = q · r + z, where q, z ∈ N0, 0 ≤ z < r.
“ ⇐= ” Since t ≡ s (mod r), we have z = 0, hence
at−s
= aqr
= (ar
)q
≡ 1q
(mod m). Multiplying both
sides of the congruence by the integer as
leads to the wanted
statement.
“ =⇒ ” It follows from at
≡ as
(mod m) that as
·
aqr+z
≡ as
(mod m). Since ar
≡ 1 (mod m), we also
have aqr+z
≡ az
(mod m). Altogether, after dividing both
sides of the ﬁrst congruence by the integer as
(which is coprime
to the modulus), we get az
≡ 1 (mod m). Since
995
We can simplify the computation to come a lot if we realize
that
72
≡ −1 (mod 25), and 25
≡ 7 (mod 25).
Then,
141414
≡ 216
· 716
≡ (25
)3
· 2 · 716
≡ 73
· 2 · 716
≡ 2 · 719
≡ 2 · (−1)9
· 7 = 11 (mod 25).
We are thus looking for a non-negative integer which is
less than 100, is a multiple of 4, and leaves remainder 11 when
divided 25 – the only such number is clearly 36. □
11.C.26. Determine the last three digits of the number 121011
.
⃝
11.C.27. Find all natural numbers n for which the integer
5n
− 4n
− 3n
is divisible by eleven.
Solution. The orders of all of the numbers 3, 4, and
5 are equal to ﬁve, so it suﬃces to examine n ∈
{0, 1, 2, 3, 4}. It can be seen from the following table
n 0 1 2 3 4
5n
mod 11 1 5 3 4 9
4n
mod 11 1 4 5 9 3
3n
mod 11 1 3 9 5 4
that only the case n ≡ 2 (mod 5) yields 3 − 5 − 9 ≡ 0
(mod 11).
The problem is thus satisﬁed by exactly those natural
numbers n which satisfy n ≡ 2 (mod 5). □
11.C.28. Primitive roots. Show that there are no primitive
roots modulo 8, and ﬁnd any primitive root modulo 11.
Solution. Apparently, even integers cannot be primitive roots
modulo 8, so it remains to examine odd ones. We can easily
calculate that 32
≡ 52
≡ 72
≡ 1 (mod 8), but φ(8) = 4 >
2. Now, we will verify that 2 is a primitive root modulo 11.
The order of 2 divides φ(11) = 10, so it suﬃces to show
that 22
̸≡ 1 (mod 11) and 25
= 32 ≡ −1 ̸≡ 1 (mod 11).
Therefore, the order of 2 modulo 11 is indeed 10. □
11.C.29. We will now determine (with the help of propositions
proving Theorem 11.3.16) primitive roots modulo
41, 412
, and 2 · 412
.
Solution. Since φ(41) = 40 = 23
· 5, it holds that an integer
g coprime to 41 is a primitive root modulo 41 if and only if
g20
̸≡ 1 (mod 41) ∧ g8
̸≡ 1 (mod 41).
CHAPTER 11. ELEMENTARY NUMBER THEORY
z < r, it follows from the deﬁnition of the order that z = 0,
hence r | t − s. □
The above theorem and Euler’s theorem apparently lead
to the following corollary (whose second part is only a reformulation
of Lagrange’s theorem 12.4.10 for our situation):
Corollary. Let m ∈ N, a ∈ Z, (a, m) = 1, and r be the
order of a modulo m.
(1) For any n ∈ N ∪ {0}, it holds that
an
≡ 1 (mod m) ⇐⇒ r | n.
(2) r | φ(m)
11.3.14. The following theorem is a generalization of the
result in 11.C.20.
Theorem. Let m, n ∈ N, a ∈ Z, (a, m) = 1. If the order of
a modulo m is r, then the order of an
modulo m is r
(n,r) .
Proof. Since r·n
(r,n) = [r, n], which is clearly a multiple
of r, we have
(an
)
r
(r,n) = a[r,n]
≡ 1 (mod m)
(the last statement follows from the above corollary, because
r | [r, n]). On the other hand, if k ∈ N is such that (an
)k
=
an·k
≡ 1 (mod m), we get r | n · k (since r is the order of
a). Further, we know that r
(n,r) | n
(n,r) · k, whence (thanks to
r
(n,r) and n
(n,r) being coprime) r
(n,r) | k. Therefore, r
(n,r) is
the order of the integer an
modulo m. □
The last statement of this series connects the orders of
two integers to the order of their product:
Lemma. Let m ∈ N, a, b ∈ Z, (a, m) = (b, m) = 1. If a
has order r and b has order s modulo m, where (r, s) = 1,
then the integer a · b has order r · s modulo m.
Proof. Let δ denote the order of a · b. Then, (ab)δ
≡ 1
(mod m). Raising both sides of this congruence to the r-th
power leads to arδ
brδ
≡ 1 (mod m). Since r is the order of
a, we have ar
≡ 1 (mod m), i.e., brδ
≡ 1 (mod m), and so
s | rδ. From r being coprime to s, we get s | δ. Analogously,
we can get r | δ, so (again utilizing that r, s are coprime)
r · s | δ. On the other hand, we clearly have (ab)rs
≡ 1
(mod m), hence δ | rs. Altogether, δ = rs. □
11.3.15. Primitive roots. Among the integers coprime to
a modulus m (i.e., the elements of a reduced
residue system modulo m), the most important
ones are those whose order is equal to φ(m).
Step-by-step exponentiation of such a number
yields all possible elements of a reduced residue system (or
integers congruent to them). Therefore, in various problems,
we can work with powers of a given integer instead of considering
random elements of a reduced residue system modulo
m, and this is often much simpler (see, for instance, the proof
of the theorem 11.4.10 about binomial coeﬃcients).
996
Now, we will go through the potential primitive roots in
ascending order:
g = 2 : 28
= 25
· 23
≡ −9 · 8 ≡ 10 (mod 41),
220
= (25
)4
≡ (−9)4
= 812
≡ (−1)2
= 1 (mod 41),
g = 3 : 38
= (34
)2
≡ (−1)2
= 1 (mod 41),
g = 4 : the order of 4 = 22
always divides the order 2,
g = 5 : 58
= (52
)4
≡ (−24
)4
= 216
= (28
)2
≡ 102
≡ 18 (mod 41),
520
= (52
)10
≡ (−24
)10
= 240
= (220
)2
≡ 1 (mod 41),
g = 6 : 68
= 28
· 38
≡ 10 · 1 = 10 (mod 41),
620
= 220
· 320
≡ 220
· (38
)2
· 34
≡ 1 · 1 · (−1) = −1 (mod 41).
We have thus proved that 6 is the least positive primitive
root modulo 41 (if we were interested in other primitive
roots modulo 41 as well, we would get them as the powers
of 6 with exponent taking on values from the range 1
to 40 which are coprime to 40. There are exactly φ(40) =
φ(23
· 5) = 16 of them, and the resulting remainders modulo
41 are ±6, ±7, ±11, ±12, ±13, ±15, ±17, ±19).
Now, if we prove that 640
̸≡ 1 (mod 412
), we will know
that 6 is a primitive root modulo any power of 41 (if we had
“bad luck” and found out that 640
≡ 1 (mod 412
), then a
primitive root modulo 412
would be 47 = 6 + 41). To avoid
manipulating huge numbers when verifying the condition, we
will use several tricks (the so-called residue number system).
First of all, we calculate the remainder of 68
upon division
by 412
; this problem can be further reduced to computing
the remainders of the integers 28
and 38
:
28
= 256 = 6 · 41 + 10 (mod 412
),
38
= (34
)2
= (2 · 41 − 1)2
≡ −4 · 41 + 1 (mod 412
).
Then,
68
= 28
· 38
≡ (6 · 41 + 10)(−4 · 41 + 1)
≡ −34 · 41 + 10 ≡ 7 · 41 + 10 (mod 412
)
and
640
= (68
)5
≡ (7 · 41 + 10)5
≡ (105
+ 5 · 7 · 41 · 104
)
= 104
(10 + 35 · 41) ≡ (−2 · 41 − 4)(−6 · 41 + 10)
≡ (4 · 41 − 40) = 124 ̸≡ 1 (mod 412
).
In the calculation, we made use of the fact that 104
=
6 · 412
− 86, i.e., 104
≡ −2 · 41 − 4 (mod 412
).
Therefore, 6 is a primitive root modulo 412
, and since it is
an even integer, we can see that 1687 = 6+412
is a primitive
root modulo 2 · 412
(while the least positive primitive root
modulo 2 · 412
is the integer 7). □
CHAPTER 11. ELEMENTARY NUMBER THEORY
Primitive root
Let m ∈ N. An integer g, (g, m) = 1, is said to be a primitive
root modulo m if and only if its order modulo m equals
φ(m).
Lemma. If g is a primitive root modulo m, then for every
integer a such that (a, m) = 1, there is a unique xa ∈ Z, 0 ≤
xa < φ(m) with the property that gxa
≡ a (mod m).
The mapping a → xa is called the discrete logarithm
or index of the integer a (with respect to a given modulus m
and a ﬁxed primitive root g), and it is a bijection between the
sets {a ∈ Z; (a, m) = 1, 0 < a < m} and {x ∈ Z; 0 ≤ x <
φ(m)}.
Proof. Suppose that it holds for x, y ∈ Z, 0 ≤ x, y <
φ(m) that gx
≡ gy
(mod m). From the properties of the
order, we get x ≡ y (mod φ(m)), i.e., x = y, so the mapping
is injective. Since it is a mapping between two ﬁnite sets
which have the same number of elements, it must be surjective
as well. □
If there are primitive roots at all for a natural number
m, then there are exactly φ(φ(m)) of them among
the integers 1, 2, . . . , m: If g is a primitive root and a ∈
{1, 2, . . . , φ(m)} arbitrary, then the order of ga
is φ(m)
(a,φ(m))
(by theorem 11.3.14), which is equal to φ(m) if and only if
(a, φ(m)) = 1, and there are exactly φ(φ(m)) such integers
in the set {1, 2, . . . , φ(m)}.
Now, we are about to show that primitive roots exist for
a suﬃcient amount of moduli m.
11.3.16. Theorem (Existence of primitive roots). Let m ∈
N, m > 1. The modulus m has primitive roots if and only if
at least one of the following conditions holds:
• m = 2 or m = 4,
• m is a power of an odd prime,
• m is twice a power of an odd prime.
The proof of this theorem will be done in several steps.
We can easily see that 1 is a primitive root modulo 2
and 3 is a primitive root modulo 4. Further, we will
show that primitive roots exist modulo any odd prime
(in algebraic words, this is another proof of the fact
that the group (Z×
m, ·) of invertible residue classes modulo a
prime m is cyclic; see also 12.4.8).
Proposition. Let p be an odd prime. Then there are primitive
roots modulo p.
Proof. Let r1, r2, . . . , rp−1 be the orders of the integers
1, 2, . . . , p − 1 modulo p. Let δ = [r1, r2, . . . , rp−1] be the
least common multiple of these orders. We will show that
there is an integer of order δ among 1, 2, . . . , p − 1 and that
δ = p − 1.
Let δ = qα1
1 · · · qαk
k be the factorization of δ to primes.
For every s ∈ {1, . . . , k}, there is a c ∈ {1, . . . , p − 1} such
that qαs
s | rc (otherwise, there would be a common multiple
997
D. Solving congruences
Linear congruences. The following exercise illustrates that
the procedure we mentioned in the proof
of theorem 11.4.3 about solvability of linear
congruences (which invokes Euler’s theorem)
is usually not the most eﬃcient one – we can utilize both
Bézout’s theorem and equivalent modiﬁcations of the given
congruence.
11.D.1. Solve the congruence
39x ≡ 41 (mod 47).
Solution.
i) First, we use Euler’s theorem.
Since (39, 47) = 1, we have
39φ(47)
= 3946
≡ 1 (mod 47),
i.e.,
3945
· 39
3946≡1
x ≡ 3945
· 41 (mod 47),
whence it already follows that
x ≡ 3945
· 41 (mod 47).
To complete the solution, it remains to calculate the remainder
of 3945
· 41 when divided by 47, which is left
as an exercise to the kind reader, leading to the result
x ≡ 36 (mod 47).
ii) Another option is to make use of Bézout’s theorem.
The Euclidean algorithm applied to the pair (39, 47)
yields
47 = 1 · 39 + 8,
39 = 4 · 8 + 7,
8 = 1 · 7 + 1.
In the other direction, this leads to
1 = 8 − 7 = 8 − (39 − 4 · 8) = 5 · 8 − 39
= 5 · (47 − 39) − 39 = 5 · 47 − 6 · 39.
Considering this equality modulo 47 and remembering
that we are solving the equation 41 ≡ x · 39, we obtain
1 ≡ −6 · 39 (mod 47), / · 41
41 ≡ 41 · (−6) · 39 (mod 47),
x ≡ 41 · (−6) (mod 47),
x ≡ −246 (mod 47),
x ≡ 36 (mod 47).
Let us notice that this procedure is usually used in the
corresponding software tools – it is eﬃcient and can be
easily made into an algorithm. It was also important that
41 (the number we multiplied the congruence with) and
the modulus 47 are coprime.
CHAPTER 11. ELEMENTARY NUMBER THEORY
of the integers r1, r2, . . . , rp−1 less than δ). Therefore, there
exists an integer b such that rc = b · qαs
s . Since c has order
rc, the order of the integer gs := cb
is equal to qαs
s (by the
theorem 11.3.14 on orders of powers).
Reasoning analogously for any s ∈ {1, . . . , k}, we get
integers g1, . . . , gk, and we can set g := g1 · · · gk. From the
properties of the order of a product, we get that the order of g
is equal to the product of the orders of the integers g1, . . . , gk,
i.e. to qα1
1 · · · qαk
k = δ.
Now, we prove that δ = p − 1. Since the orders of the integers
1, 2, . . . , p − 1 divide δ, we get the congruence xδ
≡ 1
(mod p) for any x ∈ {1, 2, . . . , p − 1}. By theorem 11.4.8,
there are at most δ solutions to a congruence of degree δ modulo
a prime p (in algebraic words, we are actually looking for
roots of a polynomial over a ﬁeld, and there cannot be more
of them than the degree of the polynomial, as we will see in
part 12.3.4). On the other hand, we have already shown that
this congruence has p−1 solutions, so necessarily δ ≥ p−1.
Still, δ is (being the order of g) a divisor of p − 1, whence we
ﬁnally get the wanted equality δ = p − 1. □
11.3.17. Next, we show that there are primitive roots modulo
powers of odd primes. First, we prove two helpful lemmas
and then comes the proposition.
Lemma. Let p be an odd prime, ℓ ≥ 2 arbitrary. Then, it
holds for any a ∈ Z that
(1 + ap)pℓ−2
≡ 1 + apℓ−1
(mod pℓ
).
Proof. This will follow easily from the binomial theorem
using mathematical induction on ℓ.
I. The statement is clearly true for ℓ = 2.
II. Let the statement be true for ℓ, and let us prove it for ℓ + 1.
Invoking exercise 11.C.7 and raising the statement for ℓ to the
p-th power, we obtain
(1 + ap)pℓ−1
≡ (1 + apℓ−1
)p
(mod pℓ+1
).
It follows from the binomial theorem that
(1 + apℓ−1
)p
= 1 + p · a · pℓ−1
+
p∑
k=2
(
p
k
)
ak
p(ℓ−1)k
and since we have p |
(p
k
)
for 1 < k < p (by exercise 11.C.6),
it suﬃces to show that pℓ+1
| p1+(ℓ−1)k
, which is equivalent
to 1 ≤ (k − 1)(ℓ − 1). Thanks to the assumption ℓ ≥ 2,
we get that pℓ+1
| p(ℓ−1)p
for k = p as well. □
Lemma. Let p be an odd prime, ℓ ≥ 2 arbitrary. Then, it
holds for any integer a satisfying p ∤ a that the order of 1+ap
modulo pℓ
equals pℓ−1
.
Proof. By the previous lemma, we have
(1 + ap)pℓ−1
≡ 1 + apℓ
(mod pℓ+1
),
and considering this congruence modulo pℓ
, we get (1 +
ap)pℓ−1
≡ 1 (mod pℓ
). At the same time, it follows directly
from the previous lemma and p not being a divisor of a that
998
iii) Concerning paper-and-pencil calculations, the most eﬃcient
procedure (yet one not easily generalizable into an
algorithm) is to gradually modify the congruence so that
the set of solutions remains unchanged:
39x ≡ 41 (mod 47),
−8x ≡ −6 (mod 47), / : −2
4x ≡ 3 (mod 47),
4x ≡ −44 (mod 47), / : 4
x ≡ −11 (mod 47),
x ≡ 36 (mod 47).
□
Systems of congruences. In order to solve system of (not
only linear) congruences, we will often utilize the Chinese
remainder theorem, which guarantees uniqueness of the solution
provided the moduli of the particular congruences are
pairwise coprime.
11.D.2. Solve the system
x ≡ 7 (mod 27),
x ≡ −3 (mod 11).
Solution. As (27, 11) = 1, we are guaranteed by the Chinese
remainder theorem that the solution is unique modulo 27 ·
11 = 297. There are two major possible approaches to ﬁnding
the solution.
(a) Using the Euclidean algorithm, we can ﬁnd the coeﬃcients
in Bézout’s identity: 1 = 5 · 11 − 2 · 27. Hence,
[11]−1
27 = [5]27 and [27]−1
11 = [−2]11. Therefore, the solution
is x ≡ 7 · 11 · 5 − 3 · 27 · (−2) = 547 ≡ 250 (mod 297).
(b) Using step-by-step substitution, we get x = 11t − 3
from the second congruence. Substituting this into the ﬁrst
one leads to 11t ≡ 10 (mod 27). Multiplying this by 5
yields 55t ≡ 50, i.e., t ≡ −4 (mod 27). Altogether, x =
11 · 27 · s − 4 · 11 − 3 = 297s − 47 for s ∈ Z, i.e., x ≡ −47
(mod 297). □
11.D.3. Solve the following system of congruences:
x ≡ 1 (mod 10),
x ≡ 5 (mod 18),
x ≡ −4 (mod 25).
Solution. The integers x which satisfy the ﬁrst congruence
are those of the form x = 1 + 10t, where t ∈ Z may be
arbitrary. We will substitute this expression into the second
congruence and then solve it (as a congruence in variable t):
1 + 10t ≡ 5 (mod 18),
10t ≡ 4 (mod 18),
5t ≡ 2 (mod 9),
5t ≡ 20 (mod 9),
t ≡ 4 (mod 9),
CHAPTER 11. ELEMENTARY NUMBER THEORY
(1 + ap)pℓ−2
̸≡ 1 (mod pℓ
), which gives the wanted proposition.
□
Proposition. Let p be an odd prime. Then, for every ℓ ∈ N,
there is a primitive root modulo pℓ
.
Proof. Let g be a primitive root modulo p. We will show
that if gp−1
̸≡ 1 (mod p2
), then g is a primitive root even
modulo pℓ
for any ℓ ∈ N. (If we had gp−1
≡ 1 (mod p2
),
then (g + p)p−1
≡ 1 + (p − 1)gp−2
p ̸≡ 1 (mod p2
), so we
could choose g + p for the original primitive root instead of
the congruent integer g.)
Let g satisfy gp−1
̸≡ 1 (mod p2
). Then, there is an
a ∈ Z, p ∤ a such that gp−1
= 1 + p · a. We will show that
the order of g modulo pℓ
is φ(pℓ
) = (p−1)pℓ−1
. Let n be the
least natural number which satisﬁes gn
≡ 1 (mod pℓ
). By
the previous lemma, the order of gp−1
= 1+p·a modulo pℓ
is
pℓ−1
. However, then it follows from the corollary of 11.3.13
that
(gp−1
)n
= (gn
)p−1
≡ 1 (mod pℓ
) =⇒ pℓ−1
| n.
At the same time, the congruence gn
≡ 1 (mod p) implies
that p − 1 | n. From p − 1 and pℓ−1
being coprime, we get
that (p − 1)pℓ−1
| n. Therefore, n = φ(pℓ
), and g is thus a
primitive root modulo pℓ
. □
11.3.18. Our next task is to deal with the existing primitive
roots in form of multiples of two.
Proposition. Let p be an odd prime and g a primitive root
modulo pℓ
for ℓ ∈ N. Then the odd one of the integers g, g+pℓ
is a primitive root modulo 2pℓ
.
Proof. Let c be an odd natural number. Then, for every
n ∈ N, we have cn
≡ 1 (mod pℓ
) if and only if cn
≡ 1
(mod 2pℓ
). Since φ(2pℓ
) = φ(pℓ
), every odd primitive root
modulo pℓ
is also a primitive root modulo 2pℓ
. □
The subsequent proposition describes the case of powers
of two. We will use similar helping lemmas as in the case of
odd primes.
Lemma. Let ℓ ∈ N, ℓ ≥ 3. Then 52ℓ−3
≡ 1 + 2ℓ−1
(mod 2ℓ
).
Proof. Similarly as above for 2 ∤ p. □
Lemma. Let ℓ ∈ N, ℓ ≥ 3. Then the order of the integer 5
modulo 2ℓ
is 2ℓ−2
.
Proof. Easily from the above lemma. □
Proposition. Let ℓ ∈ N. There are primitive roots modulo 2ℓ
if and only if ℓ ≤ 2.
Proof. Let ℓ ≥ 3. Then the set
S = {(−1)a
· 5b
; a ∈ {0, 1}, 0 ≤ b < 2ℓ−2
; b ∈ Z}
forms a reduce residue system modulo 2ℓ
: it has φ
(
2ℓ
)
elements,
and it can be easily veriﬁed that they are pairwise
incongruent modulo 2ℓ
.
999
or t = 4 + 9s, where s ∈ Z is arbitrary. The ﬁrst two congruences
are thus satisﬁed by exactly those integers x which are
of the form x = 1 + 10t = 1 + 10(4 + 9s) = 41 + 90s.
Once again, this can be substituted into the third congruence
and then solved:
41 + 90s ≡ −4 (mod 25),
90s ≡ 5 (mod 25),
18s ≡ 1 (mod 5),
3s ≡ 6 (mod 5),
s ≡ 2 (mod 5),
or s = 2 + 5r, where r ∈ Z. Altogether, x = 41 + 90s =
41 + 90(2 + 5r) = 221 + 450r.
Therefore, the system is satisﬁed by those integers x with
x ≡ 221 (mod 450). □
11.D.4. A group of thirteen pirates managed to steal a chest
full of gold coins (there were around two thousand of them).
The pirates tried to divide them evenly among themselves, but
ten coins were left over. They started to ﬁght for the remaining
coins, and one of the pirates was deadly stabbed during the
combat. So, they tried to divide the coins evenly once again,
and now three coins were left. Another pirate died in a subsequent
battle for the three coins. The remaining pirates tried
to divide the coins evenly for the third time, now successfully.
How many coins were there in the chest?
Solution. The problem leads to the following system of con-
gruences:
x ≡ 10 (mod 13),
x ≡ 3 (mod 12),
x ≡ 0 (mod 11).
Its solution is x ≡ 231 (mod 11·12·13). Since the number x
of coins is to be around 2000 and x ≡ 231 (mod 1716), we
can easily settle that there were exactly 231 + 1716 = 1947
coins. □
11.D.5. When gymnasts made groups of eight people, three
were left over. When they formed circles, each consisting of
seventeen people, seven remained; and when they grouped
into pyramids (each of them contains 21 = 42
+ 22
+ 1 gymnasts),
two of them were incomplete (missing a person “on
the top”). How many gymnasts were there, provided there
were at least 2000 and at most 4000?
Solution. We solve the following system of linear congruences
in the standard way:
c ≡ 3 (mod 8),
c ≡ 7 (mod 17),
c ≡ −2 (mod 21),
CHAPTER 11. ELEMENTARY NUMBER THEORY
At the same time (utilizing the previous lemma), the order
of every element of S apparently divides 2ℓ−2
. Therefore,
this reduced system cannot (and nor can any other) contain an
element of order φ(2ℓ
) = 2ℓ−1
. □
11.3.19. The last piece to the jigsaw puzzle of propositions
which collectively prove theorem 11.3.16 is the statement
about non-existence of primitive roots for composite numbers
which are neither a power of prime nor twice such.
Proposition. Let m ∈ N be divisible by at least two primes,
and let it not be twice a power of an odd prime. Then, there
are no primitive roots modulo m.
Proof. Let m factor to primes as 2α
pα1
1 · · · pαk
k , where
α ∈ N0, αi ∈ N, 2 ∤ pi, and k ≥ 2 or both k ≥ 1 and
α ≥ 2. Denoting δ = [φ(2α
), φ(pα1
1 ), . . . , φ(pα1
1 )], we can
easily see that δ < φ(2α
) · φ(pα1
1 ) · · · φ(pα1
1 ) = φ(m) and
that for any a ∈ Z, (a, m) = 1, we have aδ
≡ 1 (mod m).
Therefore, there are no primitive roots modulo m. □
11.3.20. In general, it is computationally very hard to ﬁnd
a primitive root for a given modulus. The following
theorem describes a necessary and sufﬁcient
condition for the examined integer to be
a primitive root.
Theorem. Let m be such an integer that there are primitive
roots modulo m. Let us write φ(m) = qα1
1 · · · qαk
k , where
q1, . . . , qk are primes and α1, . . . , αk ∈ N. Then, for every
g ∈ Z, (g, m) = 1, it holds that g is a primitive root modulo
m, if and only if neither of the following congruences holds:
g
φ(m)
q1 ≡ 1 (mod m), . . . , g
φ(m)
qk ≡ 1 (mod m).
Proof. If either of the congruences were true, it would
mean that the order of g is less than φ(m).
On the other hand, if g fails to be a primitive root, then
there is a d ∈ N, d | φ(m), where d < φ(m) and gd
≡ 1
(mod m). If u = φ(m)
d > 1, then there must be an i ∈
{1, . . . , k} such that qi | u. However, then we get
g
φ(m)
qi = g
d· u
qi ≡ 1 (mod m).
□
4. Solving congruences and systems of them
This part will be devoted to the analog to solving equations
in a numerical domain. We will actually
be solving equations (and systems of equations)
in the ring of residue classes (Zm, +, ·);
we will, however, talk about solving congruences modulo m
and write it in the more transparent way as usual.
1000
leading to the solution c ≡ 1027 (mod 2856), which, together
with the additional information, implies that there were
exactly 3883 gymnasts. □
11.D.6. Find which of the following (systems of) linear congruences
has a solution.
i) x ≡ 1 (mod 3),
x ≡ −1 (mod 9);
ii) 8x ≡ 1 (mod 12345678910111213);
iii) x ≡ 3 (mod 29),
x ≡ 5 (mod 47).
⃝
The Chinese remainder theorem can also be used “in the
opposite direction”, i.e. to simplify a linear congruence provided
we are able to express the modulus as a product of pairwise
coprime factors.
11.D.7. Solve the congruence 23 941x ≡ 915 (mod 3564).
Solution. Let us factor 3564 = 22
·34
·11. Since none of the
integers 2, 3, 11 divides 23 941, we have (23 941, 3564) = 1,
so the congruence has a solution. Since φ(3564) = 2·(33
·2)·
10 = 1080, the solution is of the form x ≡ 915 · 23 9411079
(mod 3564). However, it would take much eﬀort to simplify
the right-hand integer to a more explicit form. Therefore, we
will try to solve the congruence in a diﬀerent way – we will
build an equivalent system of congruences which are easier
to solve than the original one.
We know that an integer x is a solution of the given congruence
if and only if it is a solution of the system
23941x ≡ 915 (mod 22
),
23941x ≡ 915 (mod 34
),
23941x ≡ 915 (mod 11).
Solving these congruences separately, we get the following,
equivalent system:
x ≡ 3 (mod 4),
x ≡ −3 (mod 81),
x ≡ −4 (mod 11).
Now, the procedure for ﬁnding a solution of a system of congruences
yields x ≡ −1137 (mod 3564), which is the
solution of the original congruence as well. □
11.D.8. Solve the congruence 3446x ≡ 8642 (mod 208).
⃝
11.D.9. Prove that the sequence (2n
− 3)∞
n=1 contains inﬁnitely
many multiples of 5 as well as inﬁnitely many multiples
of 13, yet there is no multiple of 65 in it. ⃝
Residue number system. When calculating with large integers,
it is often more advantageous to work not with
their decimal or binary expansions, but rather with
their representation in a so-called residue number system,
which allows for easy parallelization of computations
with large integers. Such a system is given by a k-tuple of
(usually pairwise coprime) moduli, and each integer which
CHAPTER 11. ELEMENTARY NUMBER THEORY
Congruence in one variable
Let m ∈ N, f(x), g(x) ∈ Z[x]. The notation
f(x) ≡ g(x) (mod m)
is called a congruence in variable x, and it is understood to
be the problem of ﬁnding the set of solutions, i.e., the set of
all such integers c for which f(c) ≡ g(c) (mod m).
Two congruences (in one variable) are called equivalent
if and only if they have the same set of solutions.
The mentioned congruence is equivalent to the congru-
ence
f(x) − g(x) ≡ 0 (mod m).
The only method which always leads to a solution is trying
out all possible values (however, this would, of course,
often take too much time). This procedure is formalized by
the following proposition.
11.4.1. Proposition. Let m ∈ N, f(x) ∈ Z[x]. Then, it
holds for every a, b ∈ Z that
a ≡ b (mod m) =⇒ f(a) ≡ f(b) (mod m).
Proof. Let f(x) = cnxn
+cn−1xn−1
+· · ·+c1x + + c0,
where c0, c1, . . . , cn ∈ Z. Since a ≡ b (mod m), ciai
≡
cibi
(mod m) holds for every i = 0, 1, . . . , n. Adding up
these congruences for i = 0, 1, 2, . . . , n leads to
cnan
+ · · · + c1a + c0 ≡ cnbn
+ · · · + c1b + c0 (mod m),
i.e., f(a) ≡ f(b) (mod m). □
Corollary. The set of solutions of an arbitrary congruence
modulo m is a union of residue classes modulo m.
Deﬁnition. The number of solutions of a congruence in one
variable modulo m is the number of residue classes modulo
m containing the solutions of the congruence.
Example. The concept number of solutions of a congruence,
which we have just deﬁned, is a bit counterintuitive in that it
depends on the modulus of the congruence. Therefore, equivalent
congruences (sharing the same integers as solutions)
can have diﬀerent numbers of solutions.
(1) The congruence 2x ≡ 3 (mod 3) has exactly one solution
(modulo 3).
(2) The congruence 10x ≡ 15 (mod 15) has ﬁve solutions
(modulo 15).
(3) The congruences from (1) and (2) are equivalent.
11.4.2. Linear congruence in one variable. Just like in the
case of ordinary equations, the easiest congruences
are the linear ones, for which we are able
not only to decide whether they have a solution,
but to eﬃciently ﬁnd it (provided they have some). The procedure
is described by the following theorem and its proof.
11.4.3. Theorem. Let m ∈ N, a, b ∈ Z, and d = (a, m).
Then the congruence (in variable x)
ax ≡ b (mod m)
1001
is less than their product is then uniquely representable as a
k-tuple of remainders (whose values do not exceed the mod-
uli).
11.D.10. The quintuple of moduli 3, 5, 7, 11, 13 can serve to
uniquely represent integers which are less than their product
(i.e. less than 15015) and to perform standard arithmetic operations
eﬃciently (and in a distributed manner if desired).
Now, we will determine the representation of the integers
1234 and 5678 in this residue number system and we will determine
their sum and product.
Solution. Calculating the remainders of the given integers
upon division by the particular moduli, we get their RNS representations,
which can be written as the tuples (1, 4, 2, 2, 12)
and (2, 3, 1, 2, 10).
The sum is computed componentwise (reducing the results
modulo the appropriate number), leading to the tuple
(0, 2, 3, 4, 9). Using the Chinese remainder theorem, this tuple
can then be transformed back to the integer 6912. The
product is computed analogously, yielding the corresponding
tuple (2, 2, 2, 4, 3), which can be transformed back to 9662
(by the Chinese remainder theorem again). This is indeed
congruent to 1234 · 5678 modulo 15015. □
11.D.11. In practice, the residue number system is often a
triple 2n
− 1, 2n
, 2n
+ 1 (why are these integers always coprime?),
which can uniquely cover integers of 3n bits at the
utmost.
Consider the case n = 3 and determine the representation
of the integer 118 in this residue number system.
Solution. We can directly calculate that 118 ≡ 6 (mod 7),
118 ≡ 6 (mod 8), and 118 ≡ 1 (mod 9). The wanted representation
is thus given by the triple (6, 6, 1).
In practice, however, it is very important that the RNS
representation can be eﬃciently transformed to binary and
vice versa. In our concrete case, the remainder of 118 =
(1110110)2 when divided by 23
can be found easily – it is the
last three bits (110)2 = 6. Computing the remainder upon
division by 23
+ 1 = 9 or 23
− 1 = 7 is not any more complicated.
We can see (splitting the examined integer into three
groups of n bits each) that
(1110110)2 ≡ (001)2 + (110)2 + (110)2 ≡ 6 (mod 23
− 1),
(1110110)2 ≡ (001)2 − (110)2 + (110)2 ≡ 1 (mod 23
+ 1).
A thoughtful reader has surely noticed the similarity with the
criteria for divisibility by 9 and 11, which were discussed in
paragraph 11.C.9. □
11.D.12. Higher-order congruences. Using the procedure
of theorem 11.4.6, solve the congruence
x4
+ 7x + 4 ≡ 0 (mod 27).
Solution. First, we will solve this congruence modulo 3 (by
substitution, for instance) – we can easily ﬁnd that the solution
CHAPTER 11. ELEMENTARY NUMBER THEORY
has a solution if and only if d | b.
If d | b, then this congruence has exactly d solutions
(modulo m).
Proof. First, we prove that the mentioned condition is
necessary. If an integer c is a solution of this congruence,
then we must have m | a · c − b. Since d = (a, m), we get
d | m and d | a · c − b, so d | a · c − (a · c − b) = b.
Now, we will prove that if d | b, then the given congruence
has exactly d solutions modulo m. Let a1, b1 ∈ Z and
m1 ∈ N so that a = d · a1, b = d · b1, and m = d · m1.
The congruence we are trying to solve is thus equivalent to
the congruence
a1 · x ≡ b1 (mod m1),
where (a1, m1) = 1. This congruence can be multiplied by
the integer a
φ(m1)−1
1 , which, by Euler’s theorem, leads to
x ≡ b1 · a
φ(m1)−1
1 (mod m1).
This congruence has a unique solution modulo m1, thus it has
d = m/m1 solutions modulo m. □
Using the theorem about solutions of linear congruences,
we can, among others, prove Wilson’s theorem – an important
theorem which gives a necessary (and suﬃcient) condition
for an integer to be a prime. Such conditions are extremely
useful in computational number theory, where one needs to
eﬃciently determine whether a given large integer is a prime.
Unfortunately, it is not known now how fast modular factorial
of a large integer can be computed. That is why Wilson’s
theorem is not used for this purpose in practice.
Theorem (Wilson). A natural number n > 1 is a prime if
and only if
(n − 1)! ≡ −1 (mod n).
Proof. First, we prove that every composite number n >
4 satisﬁes n | (n − 1)!, i.e., (n − 1)! ≡ 0 (mod n). Let
1 < d < n be a non-trivial divisor of n. If d ̸= n/d, then
the inequality 1 < d, n/d ≤ n − 1 implies what we need:
n = d · n/d | (n − 1)!. If d = n/d, i.e., n = d2
, then
we have d > 2 (since n > 4) and n | (d · 2d) | (n − 1)!. For
n = 4, we easily get (4 − 1)! ≡ 2 ̸≡ −1 (mod 4).
Now, let p be a prime. The integers in the set
{2, 3, . . . , p − 2} can be grouped by pairs of those mutually
inverse modulo p, i.e., pairs of integers whose product is
congruent to 1. By the previous theorem, for every integer
a of this set, there is a unique solution of the congruence
a · x ≡ 1 (mod p). Since a ̸= 0, 1, p − 1, it is apparent that
the solution c of the congruence also satisﬁes c ̸≡ 0, 1, −1
(mod p). The integer a cannot be paired with itself, either:
If so, i.e., a · a ≡ 1 (mod p), we would (thanks to
p | a2
− 1 = (a + 1)(a − 1)) get the congruence a ≡ ±1
(mod p). The product of the integers of the mentioned set
thus consists of products of (p − 3)/2 pairs (whose product
is always congruent to 1 modulo p). Therefore, we have
(p − 1)! ≡ 1(p−3)/2
· (p − 1) ≡ −1 (mod p). □
1002
is x ≡ 1 (mod 3). Now, writing the solution in the form
x = 1 + 3t, where t ∈ Z, we will solve the congruence
modulo 9:
x4
+ 7x + 4 ≡ 0 (mod 9),
(1 + 3t)4
+ 7(1 + 3t) + 4 ≡ 0 (mod 9),
1 + 4 · 3t + 7 + 7 · 3t + 4 ≡ 0 (mod 9),
33t ≡ −12 (mod 9),
11t ≡ − 4 (mod 3),
t ≡ 1 (mod 3).
Writing t = 1 + 3s, where s ∈ Z, we get x = 4 + 9s, and
substituting this leads to
(4 + 9s)4
+ 7(4 + 9s) + 4 ≡ 0 (mod 27),
44
+ 4 · 43
· 9s + 28 + 63s + 4 ≡ 0 (mod 27),
256 · 9s + 63s ≡ −288 (mod 27),
256s + 7s ≡ − 32 (mod 3),
2s ≡ 1 (mod 3),
s ≡ 2 (mod 3).
Altogether, we get the solution in the form x = 4 + 9s =
4 + 9(2 + 3r) = 22 + 27r, where r ∈ Z, i.e., x ≡ 22
(mod 27). □
11.D.13. Knowing a primitive root modulo 41 from exercise
11.C.29 , solve the congruence
7x17
≡ 11 (mod 41).
Solution. Multiplying the congruence by 6, we get an equivalent
congruence 42x17
≡ 66, i.e., x17
≡ 25 (mod 41).
Since 6 is a primitive root modulo 41, the substitution x = 6t
leads to the congruence 617t
≡ 25 ≡ 64
(mod 41), which is
equivalent to 17t ≡ 4 (mod 40), and this holds if and only
if t ≡ 12 (mod 40). Therefore, the congruence is satisﬁed
by exactly those integers x with x ≡ 612
≡ 4 (mod 41). □
11.D.14. Solve the congruence x5
+ 1 ≡ 0 (mod 11).
Solution. Since (5, φ(11)) = 5 and
(−1)
φ(11)
5 ≡ 1 (mod 11),
the congruence
x5
≡ −1 (mod 11)
has ﬁve solutions. There are several possibilities how to ﬁnd
them. We can either try all (ten) candidates or transform the
problem to a linear congruence using the primitive-root trick.
Since 210/2
≡ −1 ̸≡ 1 (mod 11) and 210/5
≡ 4 ̸≡ 1
(mod 11), 2 is a primitive root modulo 11 (see also exercise
11.C.28), and the substitution x ≡ 2y
then transforms the
congruence to
25y
≡ 25
(mod 11),
CHAPTER 11. ELEMENTARY NUMBER THEORY
11.4.4. Systems of linear congruences. Having a system
of linear congruences in the same variable, we
can decide whether each of them is solvable by
the previous theorem. If at least one of the congruences
does not have a solution, nor does the
whole system. On the other hand, if each of the congruences
is solvable, we can rearrange it into the form x ≡ ci
(mod mi).
We thus get a system of congruences
x ≡ c1 (mod m1),
...
x ≡ ck (mod mk).
Apparently, it suﬃces to solve the case k = 2 since the
solutions of a system of more congruences can be obtained
by repeatedly applying the procedure for a system of two con-
gruences.
Proposition. Let c1, c2 be integers and m1, m2 be natural
numbers. Let us denote d = (m1, m2). The system of two
congruences
x ≡ c1 (mod m1),
x ≡ c2 (mod m2)
has no solution if c1 ̸≡ c2 (mod d). On the other hand, if
c1 ≡ c2 (mod d), then there is an integer c such that x ∈ Z
satisﬁes the system if and only if it satisﬁes the congruence
x ≡ c (mod [m1, m2]).
Proof. If the given system is to have a solution x ∈ Z,
we must have x ≡ c1 (mod d), x ≡ c2 (mod d), and thus
c1 ≡ c2 (mod d) as well. Hence it follows that the system
cannot have a solution when c1 ̸≡ c2 (mod d).
From now on, suppose that c1 ≡ c2 (mod d). The
ﬁrst congruence of the system is satisﬁed by those integers
x which are of the form x = c1 + tm1, where t ∈ Z is
arbitrary. Such an integer x satisﬁes the second congruence
of the system if and only if c1 + tm1 ≡ c2 (mod m2), i.e.,
tm1 ≡ c2 − c1 (mod m2). By the theorem about solutions
of linear congruences, this congruence (in variable t) is solvable
since d = (m1, m2) divides c2 − c1, and t satisﬁes this
congruence if and only if
t ≡
c2 − c1
d
·
(m1
d
)φ
( m2
d
)
−1 (
mod
m2
d
)
,
i.e., if and only if
x = c1 + tm1 = c1 + (c2 − c1) ·
(m1
d
)φ
( m2
d
)
+ r
m1m2
d
= c + r · [m1, m2], where r ∈ Z is arbitrary and
c = c1 + (c2 − c1) · (m1/d)φ(m2/d)
, as m1m2 equals d ·
[m1, m2]. We have thus found such an integer c that every x ∈
Z satisﬁes the system if and only if x ≡ c (mod [m1, m2]),
as wanted. □
1003
which is equivalent to the linear congruence
5y ≡ 5 (mod 10),
y ≡ 1 (mod 2).
This congruence is satisﬁed by y ∈ {−3, −1, 1, 3, 5}; the
original congruence is thus (substituting x ≡ 2y
(mod 11))
satisﬁed by x ∈ {−1, 2, −3, −4, −5}. □
11.D.15. Solve the congruence x3
− 3x + 5 ≡ 0 (mod 105).
⃝
11.D.16. Determine the number of solutions of the congru-
ence
x5
≡ 534 (mod 232
).
Solution. The given congruence is equivalent to x5
≡ 5
(mod 232
), and since we have (5, φ(23)) = 1, it follows
from the theorem on solvability of binomial congruences that
the congruence has a unique solution if considered modulo
23. Furthermore, this solution is surely not a multiple of 23.
Therefore, considering the polynomial whose roots we are
looking for, its derivative (x5
− 5)′
= 5x4
does not evaluate
to a multiple of 23 at the wanted solution, either. Invoking
Hensel’s lemma, we can summarize that the original congruence
has a unique solution (without having to describe it
explicitly). □
11.D.17. Give an example of a polynomial congruence
whose degree is less than the number of its solutions.
Solution. Taking into account theorem 11.4.8, we must use
either a modulus which is composite or a polynomial all of
whose coeﬃcients will be multiples of the modulus.
As an example of a congruence of the ﬁrst kind, we can
put
x2
≡ 1 (mod 8),
which is a quadratic congruence with four solutions 1, 3, 5, 7.
The case if a prime modulus can be exempliﬁed by the
quadratic congruence 10x2
− 15 ≡ 0 (mod 5), which has
ﬁve solutions. □
11.D.18. Other types of congruences. Prove that for any
natural number n, the integer
111 + 2222n−1
is divisible by 127.
Solution. We are to prove that the congruence
2222n−1
≡ −111 (mod 127)
is satisﬁed for every n ∈ N. This congruence is equivalent to
2222n−1
≡ 222
(mod 127).
Since 27
= 128 ≡ 1 (mod 127), the order of 2 modulo 127
equals 7, so the congruence to be proved is (by 11.3.13) equivalent
to
222n−1
≡ 22
(mod 7).
CHAPTER 11. ELEMENTARY NUMBER THEORY
We can notice that the proof of this theorem is constructive,
i.e., it yields a formula for ﬁnding the integer c. This
theorem thus gives us a procedure how to catch the condition
that an integer x satisﬁes a given system by a single congruence.
This new congruence is then of the same form as
the original one. Therefore, we can apply this procedure to
a system of more congruences – ﬁrst, we create a single congruence
from the ﬁrst and second congruences of the system
(satisﬁed by exactly those integers x which satisfy the original
two); then, we create another congruence from the new
one and the third one of the original system, and so on. Each
step reduces the number of congruences by one; after a ﬁnite
number of steps, we thus arrive at a single congruence which
describes all solutions of the given system.
It follows from the procedure we have just mentioned
(supposing the condition from below holds) that a system of
congruences always has a solution, and this is unique.
Theorem (Chinese remainder theorem). Let m1, , . . . , mk ∈
N be pairwise coprime, a1, . . . , ak ∈ Z. Then, the system
x ≡ a1 (mod m1),
...
x ≡ ak (mod mk)
has a unique solution modulo m1 · m2 · · · mk.
Remark. The unusual name of this theorem comes from Chinese
mathematician Sun Tzu of the 4th century. In his text,
he asked for an integer which leaves remainder 2 when divided
by 3, leaves remainder 3 when divided by 5, and again
remainder 2 when divided by 7.
The answer is rumored to be hidden in the following
song:
Proof. It is a simple consequence of the previous proposition
about the form of the solution of a system of two congruences.
However, as we show here, this result can also
be proved directly. Let us denote M := m1m2 · · · mr and
ni = M/mi for every i, 1 ≤ i ≤ r. Then, for any i, mi
is coprime to ni, so there is an integer bi ∈ {1, . . . , mi − 1}
such that bini ≡ 1 (mod mi). Note that bini is divisible by
all the numbers mj, 1 ≤ j ≤ r, i ̸= j. Therefore, the wanted
solution of the system is the integer
x = a1b1n1 + a2b2n2 + · · · + arbrnr.
□
1004
Similarly, the order of 2 modulo 7 is 3, which leads to the
(again equivalent) congruence
22n−1
≡ 2 (mod 3),
(−1)2n−1
≡ −1 (mod 3),
and this is apparently true (we could also have proceed likewise
– the order of 2 modulo 3 is 2, and so on). This proves
the statement. □
11.D.19. Determine which natural numbers n satisfy that
the integer n · 2n
+ 1 is divisible by seven.
Solution. We are looking for the solution of the con-
gruence
n · 2n
≡ −1 (mod 7).
We should be aware of the fact that we cannot use the theorem
11.4.1 since n · 2n
is not a polynomial in variable n, so it is
not guaranteed (and it is even not true) that the expression
will yield the same remainder modulo 7 when evaluated at
integers which are congruent modulo 7.
On the other hand, we can notice that the order of 2 modulo
7 is equal to 3, so we can split the problem into three cases
according to the remainder of n when divided by 3.
For n ≡ 0 (mod 3), we have 2n
≡ 1 (mod 7), so the
congruence in question is equivalent to n ≡ −1 (mod 7).
Combining the conditions n ≡ 0 (mod 3) and n ≡ −1
(mod 7) in the Chinese remainder theorem leads to the solution
n ≡ 6 (mod 21).
Now, for n ≡ 1 (mod 3), we have 2n
≡ 2 (mod 7), so
the examined congruence is of the form 2n ≡ −1 (mod 7),
which is equivalent to n ≡ 3 (mod 7). The conditions n ≡ 1
(mod 3) and n ≡ 3 (mod 7) are satisﬁed iﬀ n ≡ 10
(mod 21).
Finally, for n ≡ 2 (mod 3), we have 2n
≡ 4 (mod 7),
and the solution of the congruence 4n ≡ −1 (mod 7) is n ≡
5 (mod 7). Altogether, n ≡ 5 (mod 21).
The problem is satisﬁed by exactly those natural numbers
n with n ≡ 5, 6, 10 (mod 21). □
11.D.20. Prove that for any natural number n, the integer
2n4
+ n3
+ 50 is divisible by 6 if and only if the
integer 2 · 4n
+ 3n
+ 50 is divisible by 13.
Solution. The expression f(n) = 2n4
+n3
+50 is
a polynomial in variable n, so in this case, we can
make use of theorem 11.4.1, i.e., it suﬃces to go through all
possible remainders modulo 6. Since the order of 4 modulo
13 is equal to 6 and the order of 3 modulo 13 equals 3, it
is enough (by 11.3.13) to examine the remainder of n upon
division by 6 in the latter case as well.
In the former case, we calculate
n 0 1 2 3 4 5
f(n) mod 6 2 5 0 5 2 3
CHAPTER 11. ELEMENTARY NUMBER THEORY
Let us emphasize that this is quite a strong theorem
(which is actually valid in much more general algebraic structures),
which allows us to guarantee that for any remainders
with respect to given (pairwise coprime) moduli, there exists
an integer with the given remainders.
11.4.5. Higher-order congruences. Now, let us get back to
the more general case of congruences.
f(x) ≡ 0 (mod m),
where f(x) is a polynomial with integer coeﬃcients and m ∈
N. So far, we have only one method at our disposal, which is
tedious, yet universal – to try all possible remainders modulo
m. When solving such a congruence, it is suﬃcient to ﬁnd
out for which integers a, 0 ≤ a < m, it holds that f(a) ≡ 0
(mod m). The disadvantage of this method is its complexity,
which increases as m does. If m is composite, i.e., m =
pα1
1 . . . pαk
k , where p1, . . . , pk are distinct primes, and k >
1, we can replace the original congruence by the system of
congruences
f(x) ≡ 0 (mod pα1
1 ),
...
f(x) ≡ 0 (mod pαk
k ),
which has the same set of solutions. However, we can solve
the congruences separately. The advantage of this method is
in that the moduli of the congruences of the system are less
than the modulus of the original congruence.
Example. Consider the congruence
x3
− 2x + 11 ≡ 0 (mod 105).
If we were to try out all possibilities, we would have to compute
the value of f(x) = x3
−2x+11 for the 105 values f(0),
f(1), . . . , f(104). Therefore, we better factor 105 = 3 · 5 · 7
and solve the congruences f(x) ≡ 0 for moduli 3, 5, and 7.
We evaluate the polynomial f(x) in convenient integers:
x −3 −2 −1 0 1 2 3
f(x) −10 7 12 11 10 15 32
.
The congruence f(x) ≡ 0 (mod 3) thus has solution
x ≡ −1 (mod 3) (only the ﬁrst one of the integers 12, 11, 10
is a multiple of 3); the congruence f(x) ≡ 0 (mod 5) has
solutions x ≡ 1 and x ≡ 2 (mod 5); ﬁnally, the solution of
the congruence f(x) ≡ 0 (mod 7) is x ≡ −2 (mod 7).
It remains to solve two systems of congruences:
x ≡ −1 (mod 3),
x ≡ 1 (mod 5),
x ≡ −2 (mod 7)
and
x ≡ −1 (mod 3),
x ≡ 2 (mod 5),
x ≡ −2 (mod 7).
Solving these systems, we can ﬁnd out that the solutions
of the given congruence f(x) ≡ 0 (mod 105) are exactly
those integers x which satisfy x ≡ 26 (mod 105) or x ≡ 47
(mod 105).
1005
Therefore, the congruence f(n) ≡ 0 (mod 6) is satisﬁed
by exactly those natural numbers n which satisfy n ≡ 2
(mod 6).
In the latter case, we gradually compute that
n 0 1 2 3 4 5
4n
mod 13 1 4 3 −1 −4 −3
3n
mod 13 1 3 9 1 3 9
2 · 4n
+ 3n
− 2 mod 13 1 9 0 −3 −7 1
Just like in the former case, the congruence 2·4n
+3n
+50 ≡ 0
(mod 13) is satisﬁed if and only if n ≡ 2 (mod 6). □
11.D.21. Solve the congruence
x2
≡ 18 (mod 63).
Solution. Since (18, 63) = 9, it must be that 9 | x2
, i.e.,
3 | x. Setting x = 3x1, x1 ∈ Z, we get an equivalent
congruence x2
1 ≡ 2 (mod 7), which already satisﬁes that
the modulus is coprime to the integer on the right-hand side.
It follows from theorem 11.4.8 that this congruence has at
most 2 solutions, and those are clearly x1 ≡ ±3 (mod 7),
i.e., x1 ≡ ±3, ±10, ±17, ±24, ±31, ±38, ±45, ±52, ±59
(mod 63). The solution of the original congruence is thus
x ≡ 3x1 (mod 63), i.e., x ≡ ±9, ±12, ±30 (mod 63).
□
11.D.22. Solve the congruence
x3
≡ 3 (mod 18).
Solution. Since (3, 18) = 3, we must have 3 | x. Making the
substitution x = 3·x1, similarly to the above exercise, we get
the congruence
27x3
1 ≡ 3 (mod 18),
which has no solution since (27, 18) ∤ 3. □
Quadratic congruences. In the theoretical column we state
that any quadratic congruence can be transformed
to the (possibly system of congruences
of) binomial form x2
≡ a (mod p), and then
we can decide about the solvability using the
Legendre symbol. Let us illustrate it on several examples.
11.D.23. Determine the number of solutions of the congruence
13x2
+ 7x + 1 ≡ 0 (mod 37).
Solution. First, we need to normalize the polynomial on the
left-hand side, i.e. we have to ﬁnd the inverse of 13 modulo 37.
Using the Eucliean algorithm we ﬁnd that the inverse is 20,
and after multiplication of both sides of the congruence by it
and reducing modulo 37 we obtain the congruence x2
+29x+
20 ≡ 0 (mod 37). Now we complete the square (an odd
coeﬃcient 29 does not cause any trouble as it can be replaced
by −8) and we obtain (x − 4)2
+ 4 ≡ 0 (mod 37). After
substitution of y for x − 4 we ﬁnally obtain the congruence
in a binomial form
y2
≡ −4 (mod 37).
CHAPTER 11. ELEMENTARY NUMBER THEORY
It is not always possible to replace the congruence with
a system of congruences modulo primes, as in the
above example: if the original modulus is a multiple
of a higher power of a prime, then we cannot “get
rid” of this power. However, even such a congruence
modulo a power of prime need not be solved by examining all
possibilities. There is a more eﬃcient tool, which is described
by the following theorem.
11.4.6. Theorem (Hensel’s lemma). Let p be a prime,
f(x) ∈ Z[x], a ∈ Z such that p | f(a), p ∤ f′
(a). Then,
for every n ∈ N, the system
x ≡ a (mod p),
f(x) ≡ 0 (mod pn
)
has a unique solution modulo pn
.
Proof. We will proceed by induction on n. In the case of
n = 1, the congruence f(x) ≡ 0 (mod p1
) is only another
formulation of the assumption that the integer a satisﬁes p |
f(a). Further, let n > 1 and suppose the proposition is true
for n − 1. If x satisﬁes the system for n, then it does so for
n−1 as well. Denoting one of the solutions of the system for
n − 1 as cn−1, we can look for the solution of the system for
n in the form
x = cn−1 + k · pn−1
, where k ∈ Z.
We need to ﬁnd out for which k we have
f
(
cn−1 + k · pn−1
)
≡ 0 (mod pn
). We know that
pn−1
| f
(
cn−1 + k · pn−1
)
. Now, we use the binomial
theorem for f(x) = amxm
+ · · · + a1x + a0, where
a0, . . . , am ∈ Z. We have
(
cn−1 + k · pn−1
)i
≡ ci
n−1 + i · ci−1
n−1 · kpn−1
(mod pn
),
hence
f
(
cn−1 + k · pn−1
)
≡ f(cn−1) + k · pn−1
f′
(cn−1).
Therefore,
f
(
cn−1 + k · pn−1
)
≡ 0 (mod pn
) ⇐⇒
⇐⇒ 0 ≡
f(cn−1)
pn−1
+ k · f′
(cn−1) (mod p).
Since cn−1 ≡ a (mod p), we get f′
(cn−1) ≡ f′
(a) ̸≡ 0
(mod p), so (f′
(cn−1), p) = 1. By the theorem about the
solutions of linear congruences, we can hence see that there
is (modulo p) a unique solution k of this congruence, and
since cn−1 was, by the induction hypothesis, the only solution
modulo pn−1
, the integer cn−1 +k ·pn−1
is the only solution
of the given system modulo pn
. □
Example. Consider the congruence
3x2
+ 4 ≡ 0 (mod 49).
The congruence can be equivalently transformed (by
solving the linear congruence 3y ≡ 1 (mod 49) and multiplying
both sides of the congruence by the integer y ≡ 33)
to the form x2
≡ 15 (mod 72
). Then, we proceed as in the
constructive proof of Hensel’s lemma.
1006
The fact that this congruence is solvable can be established
either using theorem 11.4.10, or with use of the Legendre
symbol. The former approach leads to the calculation d =
(2, φ(37)) = 2, and
(−4)
36
2 ≡ 1 (mod 37),
while the latter one gives
(
−4
37
)
=
(
−1
37
)
·
(
2
37
)2
= 1
by the corollary after theorem 11.4.13 (as 37 ≡ 1 (mod 4)).
In any way we have obtained that the given congruence has
d = 2 solutions. □
11.D.24. Solve the congruence 6x2
+x−1 ≡ 0 (mod 29).
Solution. Although we have not presented any special
method for ﬁnding solutions of quadratic congruence yet
(apart from the general method for binomial congruences or
going through the complete residue system) we will see that
in some case the set of solutions can be easily established.
Let us ﬁrst proceed in the usual way: multiplying the
congruence by 5 (it is the inverse of 6 modulo 29) we obtain
x2
+ 5x − 5 ≡ 0 (mod 29), and after completing the square
we have
(x − 12)2
≡ 4 (mod 29).
We immediately see that this congruence is solvable with the
pair of solutions x−12 ≡ ±2 (mod 29), and thus x ≡ 10, 14
(mod 29).
We could have also seen almost immediately that the
given polynomial can be factored as 6x2
+ x − 1 = (3x −
1)(2x+1), and thus the prime modulus 29 has to divide either
3x − 1 or 2x + 1. The obtained linear congruences 3x ≡ 1
(mod 29) and 2x ≡ −1 (mod 29) easily yield the same solutions
x ≡ 10 (mod 29) and x ≡ 14 (mod 29) as above.
□
11.D.25. Find all integers which satisfy the congruence
x2
≡ 7 (mod 43).
Solution. The Legendre symbol evaluates to
(
7
43
)
= −
(
43
7
)
= −
(
1
7
)
= −1.
Hence it follows that 7 is a quadratic nonresidue modulo 43,
so there is no solution of the given congruence. □
11.D.26. Find all integers a for which the congruence
x2
≡ a (mod 43)
is solvable.
Solution. This exercise is a follow-up to the above one, from
which we can see that the integer 7 does not meet the requirement.
We can test all the remainders modulo 43 in the same
way, but there is a simpler method. The congruence is surely
solvable if a is a multiple of 43 (then, it has a unique solution);
CHAPTER 11. ELEMENTARY NUMBER THEORY
First, we solve the congruence x2
≡ 15 ≡ 1 (mod 7),
which has at most 2 solutions, and those are x ≡ ±1
(mod 7). These solutions can be expressed in the form
x = ±1 + 7t, where t ∈ Z, and substituted into the congruence
modulo 49, whence we get the solution x ≡ ±8
(mod 49) (if we were interested solely in the number of solutions,
we would not even have to ﬁnish the calculation as
it follows straight from Hensel’s lemma that every solution
modulo 7 gives a unique solution modulo 49 because for
f(x) = x2
− 15, we have 7 ∤ f′
(±1)).
11.4.7. Congruences modulo a prime. The solution of
general higher-order congruences has thus been
reduced to the solution of congruences modulo
a prime. As we will see, this is where the stumbling
block is since no (much) more eﬃcient universal
procedure than trying out all possibilities is known. We
can at least mention several statements describing the solvability
and number of solutions of such congruences. We will
then prove some detailed results for some special cases in further
paragraphs.
Theorem. Let p be a prime, f(x) ∈ Z[x]. Every congruence
f(x) ≡ 0 (mod p) is equivalent to a congruence of degree
at most p − 1.
Proof. Since it holds for any a ∈ Z that p | ap
− a (simple
consequence of Fermat’s little theorem), the congruence
xp
−x ≡ 0 (mod p) is satisﬁed by all integers. Dividing the
polynomial f(x) by xp
− x with remainder, we get
f(x) = q(x) · (xp
− x) + r(x)
for suitable f(x), r(x) ∈ Z, where the degree of r(x) is less
than that of the divisor, i.e. p. We thus get that the congruence
r(x) ≡ 0 (mod p) is equivalent to the congruence f(x) ≡ 0
(mod p), yet it is of degree at most p − 1. □
11.4.8. Theorem. Let p be a prime, f(x) ∈ Z[x]. If the congruence
f(x) ≡ 0 (mod p) has more than deg(f) solutions,
then each of the coeﬃcients of the polynomial f is a multiple
of p.
Proof. In algebraic words, we are actually interested in
the number of roots of a non-zero polynomial over a ﬁnite
ﬁeld Zp, and by 12.3.4, there are at most deg(f) of them. □
Corollary (Another proof of Wilson’s theorem). If p is a
prime, then
(p − 1)! ≡ −1 (mod p).
Proof. The statement is clearly true for p = 2, so we
can consider only odd primes p from now on. By Fermat’s
little theorem, the congruence
(x−1)(x−2) · · · (x−(p−1))−(xp−1
−1) ≡ 0 (mod p)
is satisﬁed by any integer a which is not divisible by p; i.e.,
there are p−1 solutions. However, its degree is equal to p−2
(which is less than the number of solutions). It follows from
11.4.7 that all of the coeﬃcients of the left-hand polynomial
1007
and if not, it must be a quadratic residue modulo 43. The quadratic
residues can be most simply enumerated by calculating
the squares of all elements of a reduced residue system modulo
43.
The quadratic residues are thus the integers congruent to
(±1)2
, (±2)2
, (±3)2
, . . . , (±21)2
modulo 43, so the problem
is satisﬁed by exactly those integers a which are congruent to
any one of 1, 4, 6, 9, 10, 11, 13, 14, 15, 16, 17, 21, 23, 24, 25,
31, 35, 36, 38, 40, 41. □
Using law of quadratic reciprocity 11.4.13 we can calculate
the value (a/p) for any integer a and an odd prime p.
Moreover, evaluation of the Legendre symbol is fast enough
even for high arguments, therefore using it is favourable to
verifying criteria of the theorem 11.4.10.
11.D.27. Here, we recall the statement of the Law in the
slightly modiﬁed way which is more suitable for direct calcu-
lations.
i) −1 is a quadratic residue for primes p which satisfy p ≡
1 (mod 4) and it is a quadratic nonresidue for primes p
satisfying p ≡ 3 (mod 4).
ii) 2 is a quadratic residue for primes p which satisfy p ≡
±1 (mod 8) and it is a quadratic nonresidue for primes
p satisfying p ≡ ±3 (mod 8).
iii) If p ≡ 1 (mod 4) or q ≡ 1 (mod 4), then (p/q) =
(q/p); for other odd p, q, we have (p/q) = −(q/p).
Solution. We simply apply law of quadratic reciprocity in the
appropriate cases.
i) The integer p−1
2 is even iﬀ 4 | p − 1.
ii) We need to know for which odd primes p the exponent is
p2
−1
8 is even. Odd primes are congruent to ±1 or ±3
modulo 8, so we have (by 11.C.7) that either p2
≡ 1
(mod 16) or p2
≡ 9 (mod 16).
iii) This is clear from the law of quadratic reciprocity.
□
11.D.28. Derive by straight calculation from Gauss’s
lemma 11.4.14 once again the so-called supplementary laws
of quadratic reciprocity:
(
−1
p
)
= (−1)
p−1
2 and
(
2
p
)
= (−1)
p2
−1
8 .
Solution. To evaluate (−1/p) in the former case, we should
realize that µ tells the number of least (in absolute value) negative
remainders of integers in the set
{
−1, −2, . . . , −p−1
2
}
.
However, those are exactly the desired remainders and they
are all negative; hence we have µ = p−1
2 and (−1/p) =
(−1)
p−1
2 .
In the latter case, we need to express the number of least
(in absolute value) negative remainders of integers in the set
{
1 · 2, 2 · 2, 3 · 2 . . . , p−1
2 · 2
}
.
CHAPTER 11. ELEMENTARY NUMBER THEORY
are multiples of p. In particular, this applies to the absolute
term, which equals (p−1)!+1. This proves Wilson’s theorem.
□
11.4.9. Binomial congruences. This part will be devoted to
solving special types of higher-order polynomial
congruences, the so-called binomial
congruences. It is an analog to the binomial
equations, where the polynomial
f(x) is xn
− a. It can easily be shown that we can restrict
ourselves to the condition that a be coprime with the modulus
of the congruence – otherwise, we can always equivalently
transform the congruence into this form or decide that it has
no solution.
Quadratic and power residues
Let m ∈ N, a ∈ Z, (a, m) = 1. The integer a is said to
be a n-th power residue modulo m, or residue of degree n
modulo m if and only if the congruence
xn
≡ a (mod m)
is solvable. Otherwise, we call a a n-th power nonresidue
modulo m, or nonresidue of degree n modulo m.
For n = 2, 3, 4, we use the adjectives quadratic, cubic,
and quartic residue (or nonresidue) modulo m.
Now, we will show how to solve binomial congruences
modulo m, if there are primitive roots modulo m (in particular,
when the modulus is an odd prime or its power).
11.4.10. Theorem. Let m ∈ N be such that there are primitive
roots modulo m. Further, let a ∈ Z, (a, m) = 1. Then,
the congruence xn
≡ a (mod m) is solvable (i.e., a is an
n-th power residue modulo m) if and only if aφ(m)/d
≡ 1
(mod m), where d = (n, φ(m)). And if so, it has exactly d
solutions.
Proof. Let g be a primitive root modulo m. Then, for
any x coprime to m, there is a unique integer y (its discrete
logarithm) with the property 0 ≤ y < φ(m) such that
x ≡ gy
(mod m). Similarly, for a given a, there is a unique
b ∈ Z; 0 ≤ b < φ(m) such that a ≡ gb
(mod m). After
this substitution, the binomial congruence in question is thus
equivalent to the congruence (gy
)n
≡ gb
(mod m) and, invoking
theorem 11.3.13, to the linear congruence n · y ≡ b
(mod φ(m)) as well.
However, this congruence
n · y ≡ b (mod φ(m))
is solvable if and only if d = (n, φ(m)) | b (and if so, it has
d solutions).
It remains to prove that d | b if and only if aφ(m)/d
≡ 1
(mod m). However, the congruence
1 ≡ aφ(m)/d
≡ gbφ(m)/d
(mod m)
is true if and only if φ(m) | bφ(m)
d , which happens if and only
if d | b. □
1008
For any k ∈
{
1, 2, . . . , p−1
2
}
, the integer 2k leaves a negative
remainder if and only if 2k > p−1
2 , i.e., iﬀ k > p−1
4 . Now, it
remains to determine the number of such integers k.
If p ≡ 1 (mod 4), then this number is equal to p−1
2 −
p−1
4 = p−1
4 , so
(
−1
p
)
= (−1)µ
= (−1)
p−1
4 = (−1)
p−1
4 ·
p+1
2 = (−1)
p2
−1
8 ,
since p+1
2 is odd in this case.
Similarly, for p ≡ 3 (mod 4), the number of such integers
k equals p−1
2 − p−3
4 = p+1
4 , so
(
−1
p
)
= (−1)
p+1
4 = (−1)
p+1
4 ·
p−1
2 = (−1)
p2
−1
8 ,
since p−1
2 is odd in this case as well. □
11.D.29. Solve the congruence x2
− 23 ≡ 0 (mod 77).
Solution. Factoring the modulus, we get the system
x2
− 1 ≡ 0 (mod 11),
x2
− 2 ≡ 0 (mod 7).
Clearly, 1 is a quadratic residue modulo 11, so the ﬁrst congruence
of the system has (exactly) two solutions: x ≡ ±1
(mod 11). Further, (2/7) = (9/7) = 1, and it should not
take much eﬀort to notice the solution: x ≡ ±3 (mod 7).
We have thus obtained four simple systems of two
linear congruences each. Solving them, we will get that
the original congruence has the following four solutions:
x ≡ 10, 32, 45 or 67 (mod 77). □
11.D.30. Solve the congruence 7x2
+ 112x + 42 ≡ 0
(mod 473). ⃝
Jacobi symbol. Jacobi symbol (a/b) is a generalization of
the Legendre symbol to the case where the
“lower” argument b need not be a prime, but
any odd positive integer. It is deﬁned as the
product of the Legendre symbols corresponding to the prime
factors of b: if b = pα1
1 · · · pαk
k , then
(
a
b
)
=
(
a
p1
)α1
· · ·
(
a
pk
)αk
.
The primary motivation for introducing the Jacobi symbol
is the necessity to evaluate the Legendre symbol (and thus
to decide the solvability of quadratic congruences) without
having to factor integers to primes. We will illustrate such
calculation on an example having in mind that Jacobi symbol
shares with the Legendre one not only the notation but also
almost all of the (computational) properties.
11.D.31. Decide whether the congruence x2
≡ 219
(mod 383) is solvable.
CHAPTER 11. ELEMENTARY NUMBER THEORY
Corollary. If the assumptions of the above theorem hold
and, moreover, (n, φ(m)) = 1, the congruence xn
≡ a
(mod m) always has a unique solution. In other words, exponentiation
to the n-th power (where n is coprime to φ(m))
is a bijection on the set Z×
m of invertible residue classes modulo
m (it is even an automorphism of the group (Z×
m, ·)).
11.4.11. Quadratic congruences and the Legendre symbol.
Now, our task is to ﬁnd an eﬃcient condition
determining whether a quadratic congru-
ence
ax2
+ bx + c ≡ 0 (mod m)
is solvable (and if so, how many solutions it has). It can easily
be seen from the presented theory that if we want to decide
whether this congruence is solvable, it suﬃces to decide this
for the (binomial) congruence
x2
≡ a (mod p),
where p is an odd prime and a is an integer coprime to it.
A congruence modulo a composite m can be decomposed
to an equivalent system of congruences modulo the particular
factors of the integer m, which are powers of primes.
Such congruences can be transformed to quadratic congruences
with prime modulus using the procedure described in
Hensel’s lemma 11.4.6. Norming this congruence and completing
the square then results in the aforementioned form.
To decide the solvability of a congruence, we can, of
course, use the theorem 11.4.10 about the solvability of binomial
congruences. Its application is, however, often limited
by time resources; we will thus try to ﬁnd a criterion which
will be computationally easier in (not only) the quadratic case.
Example. Let us determine the number of solutions of the
congruence x2
≡ 219 (mod 383).
Since 383 is a prime and (2, φ(383)) = 2, it follows
from theorem 11.4.10 that the given congruence is solvable
(and it has 2 solutions) if and only if 219
φ(383)
2 = 219191
≡ 1
(mod 383). It is not easy to verify this proposition without
some computational power (though, this can still be calculated
on a “piece of paper”). However, we will show that this
condition can be veriﬁed much more easily using the properties
of the so-called Legendre symbol.
Legendre symbol
Let p be an odd prime and a an integer. The Legendre symbol
is deﬁned by
(
a
p
)
=



1 for p ∤ a, a is a quadratic residue modulo p,
0 for p | a,
−1 if a is a quadratic nonresidue modulo p.
The Legendre symbol is also often written as (a/p) and usually
read as “a on p”.
Example. Since the congruence x2
≡ 1 (mod p) is solvable
for an arbitrary odd prime p, we have (1/p) = 1. Further,
1009
Solution. Since 383 is a prime, the congruence will be solvable
if the Legendre symbol will satisfy (219/383) = 1.
(
219
383
)
= −
(
383
219
)
(Jacobi) as 383 ≡ 219 ≡ 3 (mod 4)
= −
(
164
219
)
= −
(
41
219
)
164 = 22
· 41
= −
(
219
41
)
(Jacobi) as 41 ≡ 1 (mod 4)
= −
(
14
41
)
= −
(
2
41
)(
7
41
)
= −
(
7
41
)
as 41 ≡ 1 (mod 8)
= −
(
41
7
)
as 41 ≡ 1 (mod 4)
= −
(
−1
7
)
= 1 as 7 ≡ 3 (mod 4).
□
Now, we introduce several exercises proving that the
Jacobi symbol has properties similar to the Legendre one,
which relieves us of the necessity to factor the integers that
appear when working purely with the Legendre symbol.
11.D.32. Prove that all odd positive numbers b, b′
and all integers
a, a1, a2 satisfy (the symbols used here are always the
Jacobi ones):
i) if a1 ≡ a2 (mod b), then
(a1
b
)
=
(a2
b
)
,
ii)
(a1a2
b
)
=
(a1
b
)(a2
b
)
,
iii)
( a
bb′
)
=
(a
b
)(a
b′
)
.
⃝
11.D.33. Prove that if a, b are odd natural numbers, then
i) ab−1
2 ≡ a−1
2 + b−1
2 (mod 2),
ii) a2
b2
−1
8 ≡ a2
−1
8 + b2
−1
8 (mod 2).
Solution.
i) Since the integer (a − 1)(b − 1) = (ab − 1) − (a − 1) −
(b−1) is a multiple of 4, we get (ab−1) ≡ (a−1)+(b−
1) (mod 4), which gives what we want when divided by
two.
ii) Similarly to above, (a2
− 1)(b2
− 1) = (a2
b2
− 1) −
− (a2
− 1) − (b2
− 1) is a multiple of 16. Therefore,
(a2
b2
− 1) ≡ (a2
− 1) + (b2
− 1) (mod 16), which
gives the wanted statement when divided by eight (see
also exercise 11.A.2).
□
11.D.34. Prove that if a1, . . . , ak are odd natural numbers,
then
i)
∏k
ℓ=1 aℓ−1
2 ≡
∑k
ℓ=1
aℓ−1
2 (mod 2),
ii)
∏k
ℓ=1 a2
ℓ −1
8 ≡
∑k
ℓ=1
a2
ℓ −1
8 (mod 2).
⃝
CHAPTER 11. ELEMENTARY NUMBER THEORY
(−1/5) = (4/5) = 1, because the congruence x2
≡ −1
(mod 5) is equivalent to the congruence x2
≡ 4 (mod 5),
whose solutions are x ≡ ±2 (mod 5).
The statement of the following lemma will be very often
used when evaluating the Legendre symbol in practice.
11.4.12. Lemma. Let p be an odd prime, a, b ∈ Z arbitrary.
Then:
(1)
(a
p
)
≡ a
p−1
2 (mod p).
(2)
(ab
p
)
=
(a
p
)(b
p
)
.
(3) If a ≡ b (mod p), then
(a
p
)
=
(b
p
)
.
Proof. (1) The statement is clear for p | a; if a is a quadratic
residue modulo p, then the statement follows from
the theorem about the solvability of quadratic congruences,
which claims (in this case, we have (φ(p), 2) = 2)
that the necessary and suﬃcient condition for the congruence
x2
≡ a (mod p) to be solvable that
a
p−1
2 ≡ 1 (mod p).
The same theorem implies for the case of a quadratic nonresidue
as well that we have a
p−1
2 ̸≡ 1 (mod p). However,
then (since we have p | ap−1
− 1 = (a
p−1
2 −
1)(a
p−1
2 + 1) by Fermat’s theorem), necessarily p |
a
p−1
2 + 1, i.e., a
p−1
2 ≡ −1 (mod p).
(2) From (1), we have
(
ab
p
)
≡ (ab)
p−1
2 = a
p−1
2 b
p−1
2 ≡
(
a
p
)(
b
p
)
(mod p).
However, since the values of the Legendre symbol belong
to the set {−1, 0, 1}, this congruence immediately
implies that the left and right sides are equal.
(3) Apparent from the deﬁnition.
□
Corollary. (1) Any reduced residue system modulo p contains
the same number of quadratic residues as non-
residues.
(2) The product of two quadratic residues as well as the product
of two quadratic nonresidues is a residue; the product
of a residue and a nonresidue is a nonresidue.
(3) (−1/p) = (−1)
p−1
2 , i.e., the congruence x2
≡ −1
(mod p) is solvable if and only if p ≡ 1 (mod 4).
Proof. (1) Considering the elements of a reduced
residue system modulo p (we can take, for instance,
the set {−p−1
2 , . . . , −1, 1, . . . , p−1
2 }), the quadratic
residues are exactly those integers which are congruent
to one of (±1)2
, . . . , (±p−1
2 )2
. Thus there
are exactly p−1
2 of quadratic residues, so there are
p − 1 − p−1
2 = p−1
2 of the other ones (the quadratic
nonresidues).
(2) This follows immediately from part (2) and the previous
lemma.
1010
11.D.35. Prove the law of quadratic reciprocity for the Jacobi
symbol, i.e., prove that if a, b are odd natural numbers,
then
i)
(−1
a
)
= (−1)
a−1
2 ,
ii)
(2
a
)
= (−1)
a2−1
8 ,
iii)
(a
b
)
=
(b
a
)
· (−1)
a−1
2
b−1
2 .
Solution. Let (just like in the deﬁnition of the Jacobi symbol)
a factor to (odd) primes as p1p2 · · · pk.
i) The properties of the Legendre symbol and the aforementioned
statement imply that
(
−1
a
)
=
(
−1
p1
)
·
(
−1
p2
)
· · ·
(
−1
pk
)
=
= (−1)
p1−1
2 · · · (−1)
pk−1
2 =
= (−1)
∑k
i=1
pi−1
2 =
= (−1)
∏k
i=1 pi−1
2 = (−1)
a−1
2 .
ii) Analogously to above.
iii) Further, let b factor to (odd) primes as q1q2 · · · qℓ. If we
have pi = qj for some i and j, then the symbols on both
sides of the equality are equal to zero. Otherwise, the law
of quadratic reciprocity for the Legendre symbol implies
that for all pairs (pi, qj), we have
(
pi
qj
)
=
(
qi
pj
)
· (−1)
pi−1
2
qj −1
2 .
Therefore,
(
a
b
)
=
k∏
i=1
ℓ∏
j=1
(
pi
qj
)
=
=
k∏
i=1
ℓ∏
j=1
(
qj
pi
)
· (−1)
pi−1
2
qj −1
2 =
=
k∏
i=1
(−1)
pi−1
2
∑ℓ
j=1
qj −1
2
ℓ∏
j=1
(
qj
pi
)
=
=
k∏
i=1
(−1)
pi−1
2
∏ℓ
j=1 qj −1
2
ℓ∏
j=1
(
qj
pi
)
=
=
k∏
i=1
(−1)
pi−1
2
b−1
2
ℓ∏
j=1
(
qj
pi
)
=
= (−1)
b−1
2
∑k
i=1
pi−1
2
k∏
i=1
ℓ∏
j=1
(
qj
pi
)
=
= (−1)
a−1
2
b−1
2
(
b
a
)
.
We utilized the result of part (i) of the previous exercise in the
calculations. □
11.D.36. Determine whether the congruence x2
≡ 38
(mod 165) is solvable.
CHAPTER 11. ELEMENTARY NUMBER THEORY
(3) It follows from part (1) of the lemma that (−1/p) ≡
(−1)
p−1
2 (mod p); both sides, however, take on the values
±1, so they must be equal.
□
These basic statements about the values of the Legendre
symbol are already suﬃcient for proving the
theorem on the inﬁnitude of primes of the form
4k + 1 (see paragraph 11.2.3).
Proposition. There are inﬁnitely many primes of the form
4k + 1.
Proof. We will proceed by contradiction. Suppose that
p1, p2, . . . , pℓ is the enumeration of all primes of the form
4k + 1, and consider the integer N = (2p1 · · · pℓ)2
+ 1. This
integer is of the form 4k+1 as well. The assumption that N is
a prime would lead to an immediate contradiction, since N is
surely greater than any of the integers p1, p2, . . . , pℓ. Therefore,
from now on, let us suppose that it is thus composite.
Then, there must exist a prime p which divides N. Apparently,
none of the primes 2, p1, p2, . . . , pℓ divides N, so we
will be ﬁnished if we prove that p is also of the form 4k + 1.
It follows from the congruence (2p1 · · · pℓ)2
≡ −1 (mod p),
that (−1/p) = 1, and this is true (by the previous corollary)
if and only if p ≡ 1 (mod 4). Altogether, we have reached
a contradiction (a prime p not belonging to the original list
of all primes of the form 4k + 1) in the case of composite
N as well, which proves that there are inﬁnitely many such
primes. □
The most important theorem which allows us to eﬃciently
compute the value of the Legendre symbol
(and thus determine the solvability of a quadratic
congruence), is the so-called law of quadratic
reciprocity.
Law of quadratic reciprocity
11.4.13. Theorem. Let p, q be odd primes. Then,
(1)
(−1
p
)
= (−1)
p−1
2 ,
(2)
(2
p
)
= (−1)
p2−1
8 ,
(3)
(q
p
)
=
(p
q
)
· (−1)
p−1
2
q−1
2 .
The theorem is put this way mainly because we can calculate
the value (a/p) for any integer a using these three formulae
and the basic rules for the Legendre symbol.
Example. Let us calculate the value (79/101) using the properties
of the Legendre symbol.
1011
Solution. The Jacobi symbol is equal to
( 38
165
)
=
( 2
165
)
·
( 19
165
)
=
(2
3
)
·
(2
5
)
·
( 2
11
)
·
(19
3
)
·
(19
5
)
·
(19
11
)
= (−1)3
(1
3
)
·
(−1
5
)
·
( 2
11
)3
= 1.
This result does not answer the question of the existence of
a solution. However, if we split the congruence to a system
of congruences according to the factors of the modulus, we
obtain
x2
≡ −1 (mod 3),
x2
≡ 3 (mod 5),
x2
≡ 5 (mod 11),
whence we can easily see that the ﬁrst and second congruences
have no solution. In particular,(−1
3
)
= −1 and
(3
5
)
=
(5
3
)
=
(2
3
)
= −1 .
Therefore, neither the original congruence has a solution. □
11.D.37. Find all primes p such that the integer below is a
quadratic residue modulo p:
i) 3, ii) − 3, iii) 6.
Solution.
i) We are looking for primes p ̸= 3 such that x2
≡ 3
(mod p) is solvable. Since p = 2 satisﬁes the above,
we will consider only odd primes p ̸= 3 from now on.
For p ≡ 1 (mod 4), it follows from the law of quadratic
reciprocity that 1 = (3/p) = (p/3), which occurs
if and only if p ≡ 1 (mod 3). On the other hand, if
p ≡ −1 (mod 4), then 1 = (3/p) = −(p/3), which
holds for p ≡ −1 (mod 3). Putting the conditions of
both cases together, we arrive at p ≡ ±1 (mod 12),
which, together with p = 2, completes the set of all
primes satisfying the given condition.
ii) The condition 1 = (−3/p) = (−1/p)(3/p) is satisﬁed
if either (−1/p) = (3/p) = 1 or (−1/p) = (3/p) = −1.
In the former case (using the result of the previous item),
this means that p ≡ 1 (mod 4) and p ≡ ±1 (mod 12).
In the latter case, we must have p ≡ −1 (mod 4) and
p ≡ ±5 (mod 12), at the same time – we can take,
for instance, the set {−5, −1, 1, 5} for a reduced residue
system modulo 12, and since (3/p) = 1 for p ≡ ±1
(mod 12), we surely have (3/p) = −1 whenever p ≡ ±5
(mod 12). We have thus obtained four systems of two
congruences each. Two of them have no solution, and
the remaining two are satisﬁed by p ≡ 1 (mod 12) and
p ≡ −5 (mod 12), respectively.
iii) In this case, (6/p) = (2/p)(3/p) and once again, there
are two possibilities: either (2/p) = (3/p) = 1 or
(2/p) = (3/p) = −1. The former case occurs if p satisﬁes
p ≡ ±1 (mod 8) as well as p ≡ ±1 (mod 12).
Solving the corresponding systems of linear congruences
leads to the condition p ≡ ±1 (mod 24). In the latter
case, we get p ≡ ±3 (mod 8) as well as p ≡ ±5
(mod 12), which together gives p ≡ ±5 (mod 24).
Let us remark that thanks to Dirichlet’s theorem 11.2.5, the
number of primes we were interested in is inﬁnite in each of
the three problems. □
CHAPTER 11. ELEMENTARY NUMBER THEORY
(
79
101
)
=
(
101
79
)
since 101 is congruent to 1 modulo 4
=
(
22
79
)
=
(
2
79
)
·
(
11
79
)
=
(
11
79
)
since 79 is congruent to −1 modulo 8
= (−1)
(
79
11
)
since 11 ≡ 79 ≡ 3 (mod 4)
= (−1)
(
2
11
)
= 1 since 11 ≡ 3 (mod 8).
Many proofs of the the quadratic reciprocity law can be
found in literature6
. However, many of them (especially
the shorter ones) usually make use of deeper
knowledge from algebraic number theory. We will
present an elementary proof of this theorem here.
Let S denote the reduced residue system of the least
residues (in absolute value) modulo p, i.e.,
S =
{
−p−1
2 , −p−3
2 , . . . , −1, 1, . . . , p−3
2 , p−1
2
}
.
Further, for a ∈ Z, p ∤ a, let µp(a) denote the number of
negative least residues (in absolute value) of the integers
1 · a, 2 · a, . . . ,
p − 1
2
· a,
i.e., we decide for each of these integers to which integer from
the set S it is congruent and count the number of the negative
ones. If it is clear from context which values a, p we mean,
we will usually omit the parameters and write only µ instead
of µp(a).
Example. We determine µp(a) for the prime p = 11 and the
integer a = 3.
Now, the reduced residue system we are interested in is
S = {−5, . . . , −1, 1, , . . . , 5}, and for a = 3, we calculate
1 · 3 ≡ 3 (mod 11)
2 · 3 ≡ −5 (mod 11)
3 · 3 ≡ −2 (mod 11)
4 · 3 ≡ 1 (mod 11)
5 · 3 ≡ 4 (mod 11),
whence µ11(3) = 2.
We will show in the following statement that this integer
is tightly connected to the Legendre symbol – the value of the
symbol (3/11) can be determined in terms of the µ function
as (−1)µ11(3)
= (−1)2
= 1.
6In 2000, F. Lemmermeyer stated 233 proofs – see F. Lemmermeyer,
Reciprocity laws. From Euler to Eisenstein, Springer. 2000
1012
11.D.38. The following exercise illustrates that if the modulus
of a quadratic congruence is a prime p satisfying
p ≡ 3 (mod 4), then we are able not only to decide
the solvability of the congruence, but also to describe
all of its solutions in a simple way.
Consider a prime p ≡ 3 (mod 4) and an integer a such
that (a/p) = 1. Prove that the solution of the congruence
x2
≡ a (mod p) is
x ≡ ±a
p+1
4 (mod p).
Solution. It can be easily veriﬁed (using lemma 11.4.12) that
(
a
p+1
4
)2
≡ a
p+1
2 ≡ a ·
(a
p
)
≡ a (mod p) . □
11.D.39. Determine whether the congruence
x2
≡ 3 (mod 59)
is solvable. If so, ﬁnd all of its solutions.
Solution. Calculating the Legendre symbol
(
3
59
)
= −
(
59
3
)
= −
(
2
3
)
= −(−1) = 1,
we ﬁnd out that the congruence has two solutions. Thanks
to the statement above, we can immediately see (59 ≡ 3
(mod 4)) that the congruence is satisﬁed by
x ≡ ±3
59+1
4 = ±315
≡ (35
)3
≡
≡ ±73
= ±343 ≡ ∓11 (mod 59),
since 35
= 243 ≡ 7 (mod 59). □
E. Diophantine equations
Here, we limit ourselves only to the small class of equations
which can be solved using divisibility or can be reduced
to solving congruences.
11.E.1. Linear Diophantine equations. Decide whether it
is possible to use a balance scale to weigh 50
grams of given goods provided we have only (an
arbitrary number of) three kinds of masses; their
weights are 770, 630, and 330 grams, respectively.
If so, how to do that?
Solution. Our task is to solve the equation
770x + 630y + 330z = 50,
where x, y, z ∈ Z (a negative value in the solution
would mean that we put the corresponding masses on
the other scale). Dividing both sides of the equation by
(770, 630, 330) = 10, we get an equivalent equation
77x + 63y + 33z = 5.
Considering this equation modulo (77, 63) = 7, we get the
following linear congruence:
33z ≡ 5 (mod 7),
5z ≡ 5 (mod 7),
z ≡ 1 (mod 7).
CHAPTER 11. ELEMENTARY NUMBER THEORY
11.4.14. Lemma (Gauss). If p is an odd prime, a ∈ Z, p ∤ a,
then the value of the Legendre symbol satisﬁes
(
a
p
)
= (−1)µp(a)
.
Proof. For each integer i ∈
{
1, 2, . . . , p−1
2
}
, we set a
value mi ∈
{
1, 2, . . . , p−1
2
}
so that i · a ≡ ±mi (mod p).
We can easily see that if k, l ∈
{
1, 2, . . . , p−1
2
}
are diﬀerent,
then the values mk, ml are also diﬀerent (the equality mk =
ml would imply that k · a ≡ ±l · a (mod p), and hence
k ≡ ±l (mod p), which cannot be satisﬁed unless k = l).
Therefore, the sets {1, 2, . . . , p−1
2 } and
{m1, m2, . . . , mp−1
2
} coincide, which is also illustrated
by the above example. Multiplying the congruences
1 · a ≡ ±m1 (mod p),
2 · a ≡ ±m2 (mod p),
...
p − 1
2
· a ≡ ±mp−1
2
(mod p)
leads to
p−1
2 ! · a
p−1
2 ≡ (−1)µ
· p−1
2 ! (mod p),
since there are exactly µ negative values on the right-hand
sides of the congruences. Dividing both sides by the integer
p−1
2 !, we get the wanted statement, making use of lemma
11.4.12, whence (a/p) ≡ a
p−1
2 (mod p). □
Now, with the help of Gauss’s lemma, we will prove the
law of quadratic reciprocity.
Proof of the law of quadratic reciprocity. The
ﬁrst part has been already proven; for the rest, we ﬁrst derive
a lemma which will be utilized in the proof of both of the
remaining parts.
Let a ∈ Z, p ∤ a, k ∈ N and let [x] and ⟨x⟩ denote the
integer part (i.e. ﬂoor) and the fractional part, respectively, of
a real number x. Then,
[
2ak
p
]
=
[
2
[
ak
p
]
+ 2
⟨
ak
p
⟩]
= 2
[
ak
p
]
+
[
2
⟨
ak
p
⟩]
.
This expression is odd if and only if ⟨ak
p ⟩ > 1
2 , which is if
and only if the least residue (in absolute value) of the integer
ak modulo p is negative (a watchful reader should notice the
return from the calculations of (ostensibly) irrelevant expressions
back to the conditions close to the Legendre symbol).
The integer µp(a) thus has the same parity (is congruent to,
modulo 2) as
∑p−1
2
k=1
[2ak
p
]
, whence (thanks to Gauss’s lemma)
we get that
(
a
p
)
= (−1)µp(a)
= (−1)
∑ p−1
2
k=1
[
2ak
p
]
.
1013
This congruence is thus satisﬁed by those integers z of the
form z = 1 + 7t, where t is an integer parameter.
Substituting the form of z into the original equation, we
get
77x + 63y = 5 − 33(1 + 7t),
11x + 9y = −4 − 33t.
Now, we consider this (parametrized) equation modulo 11:
9y ≡ −4 − 33t (mod 11),
−2y ≡ −4 (mod 11),
y ≡ 2 (mod 11).
Therefore, this congruence is satisﬁed by integers y = 2+11s
for any s ∈ Z. Now, it only remains to calculate x:
11x = −4 − 33t − 9(2 + 11s),
11x = −22 − 33t − 9 · 11s,
x = −2 − 3t − 9s.
We have found out that the equation is satisﬁed if and only if
(x, y, z) is in the set
{(−2 − 3t − 9s, 2 + 11s, 1 + 7t); s, t ∈ Z}.
Particular solutions can be obtained by evaluating the triple
at concrete values of t, s. For instance, setting t = s = 0
gives the triple (−2, 2, 1); putting t = −4, s = 1 leads to
(1, 13, −27).
Of course, the unknowns can be eliminated in any order –
the result may seem “syntactically” diﬀerent, but it must still
describe the same set of solutions (that is given by a particular
coset of an appropriate subgroup (in our case, it is the
subgroup (2, 2, 1)+(3, 0, 7)Z+(−9, 11, 0)Z) in the commutative
group Z3
, which is an apparent analog to the fact that
the solution of such an equation over a ﬁeld forms an aﬃne
subspace of the corresponding vector space). □
Other types of Diophantine equations reducible to congruences.
Some Diophantine equations are such that one of
the unknowns can be expressed explicitly as a function of the
other ones. In this case, it makes sense to examine for which
integer arguments it holds that the value of the function is also
an integer.
For instance, having an equation of the form
mxn = f(x1, . . . , xn−1),
where m is a natural number and f(x1, . . . , xn−1) ∈
Z[x1, . . . , xn−1] is a polynomial with integer coeﬃcients, an
n-tuple of integers x1, . . . , xn is a solution of it if and only if
f(x1, . . . , xn−1) ≡ 0 (mod m).
11.E.2. Solve the Diophantine equation x(x + 3) = 4y − 1.
Solution. The equation can be rewritten as 4y = x2
+3x+1.
Now, we will solve the congruence
x2
+ 3x + 1 ≡ 0 (mod 4).
This congruence has no solution since for any integer x,
the polynomial x2
+ 3x + 1 evaluates to an odd integer (the
CHAPTER 11. ELEMENTARY NUMBER THEORY
Furthermore, if a is odd, then a + p is even and we get
(
2a
p
)
=
(
2a + 2p
p
)
=
(
4a+p
2
p
)
=
(
2
p
)2
·
(a+p
2
p
)
= (−1)
∑ p−1
2
k=1
[(a+p)k
p
]
= (−1)
∑ p−1
2
k=1
[
ak
p
]
· (−1)
∑ p−1
2
k=1 k
.
Since the sum of the arithmetic series
∑p−1
2
k=1 k is
1
2
p−1
2
p+1
2 = p2
−1
8 , we get (for a odd) the relation
(
2
p
)
·
(
a
p
)
= (−1)
∑ p−1
2
k=1
[
ak
p
]
· (−1)
p2−1
8 ,
which, for a = 1, gives the wanted statement of item 2.
By part (2), which we have already proved, and the previous
equality, we now get for odd integers a that
(1)
(
a
p
)
= (−1)
∑ p−1
2
k=1
[
ak
p
]
.
Now, let us consider, for given primes p ̸= q, the set
T = {q · x; x ∈ Z, 1 ≤ x ≤ (p − 1)/2}×
× {p · y; y ∈ Z, 1 ≤ y ≤ (q − 1)/2}.
We apparently have |T| = p−1
2 · q−1
2 . We will show that we
also have
(−1)|T |
= (−1)
∑ p−1
2
k=1
[pk
q
]
· (−1)
∑ p−1
2
k=1
[qk
p
]
,
which will be suﬃcient thanks to the above.
Since the equality qx = py can happen for no pair of
x, y from the permissible domain, the set T can be partitioned
to (disjoint) subsets T1 and T2 so that T1 = T ∩
{(u, v); u, v ∈ Z, u < v}, T2 = T \ T1. Clearly, |T1|
is the number of pairs (qx, py) for which x < p
q y. Since
p
q y ≤ p
q · q−1
2 < p
2 , we have
[
p
q y
]
≤ p−1
2 . For a ﬁxed y,
in T1, there are thus exactly those pairs (qx, py) for which
1 ≤ x ≤
[
p
q y
]
; hence |T1| =
∑(q−1)/2
y=1
[
p
q y
]
. Analogously,
|T2| =
∑(p−1)/2
x=1
[
q
p x
]
.
By (1), we thus have
(p
q
)
= (−1)|T1|
and
(q
p
)
= (−1)|T2|
,
which ﬁnishes the proof of the law of quadratic reciprocity.
□
1014
fact that the congruence is not solvable can also be established
by trying out all four possible remainders modulo 4 into it).
□
11.E.3. Solve the following equation in integers:
379x + 314y + 183y2
= 210
Solution. The equation is linear in x, so the other unknown,
y, must satisfy the congruence
183y2
+ 314y − 210 ≡ 0 (mod 379).
Now, we can complete the left-hand polynomial to square in
order to get rid of the linear term. First of all, we must ﬁnd a
t ∈ Z such that 183 · t ≡ 1 (mod 379). (In other words, we
need to determine the inverse of the integer 183 modulo 379).
For this purpose, we will use the Euclidean algorithm:
379 = 2 · 183 + 13,
183 = 14 · 13 + 1,
whence
1 = 183 − 14 · 13 = 183 − 14 · (379 − 2 · 183) =
= 29 · 183 − 14 · 379.
Therefore, we can take, for instance, the integer 29 to be our
t. Now, multiplying bith sides of the congruence by t = 29
and rearranging it, we get an equivalent congruence:
y2
+ 10y − 26 ≡ 0 (mod 379)
Now, we can complete the left-hand polynomial to square,
which leads to (substituting z = y + 5)
(y + 5)2
− 52
− 26 ≡ 0 (mod 379),
z2
≡ 51 (mod 379).
Invoking the law of quadratic reciprocity, we calculate the
Legendre symbol (51/379):
(
51
379
)
=
(
3
379
)
·
(
17
379
)
=
(
379
3
)
· (−1) ·
(
379
17
)
· (+1) =
=
(
1
3
)
· (−1) ·
(
5
17
)
= (1) · (−1) ·
(
17
5
)
· (+1) =
= (−1) ·
(
2
5
)
= (−1) · (−1) = 1,
whence it follows that the congruence is solvable, and, in particular,
it has two solutions modulo 379.
The proposition of exercise 11.D.38 implies that the solutions
are of the form
z ≡ ±51
380
4 ,
where 513
≡ 1 (mod 379), whence 5195
= (513
)31
· 512
≡
−52 (mod 379). The solution is thus z ≡ ±52 (mod 379),
which gives for the original unknown that
y ≡ 47 (mod 379), y ≡ −57 (mod 379).
Therefore, the given Diophantine equation is satisﬁed by
those pairs (x, y) with y ∈ {47 + 379 · k; k ∈ Z} ∪ {−57 +
CHAPTER 11. ELEMENTARY NUMBER THEORY
The evaluation of the Legendre symbol (as we saw in
the example above) allows us to only use the law
of quadratic reciprocity for primes, so it forces
us to factor integers to primes, which is a very
hard operation from the computational point of
view. This can be mended by extending the deﬁnition of the
Legendre symbol to the so-called Jacobi symbol with similar
properties.
Deﬁnition. Let a ∈ Z, b ∈ N, 2 ∤ b. Let b factor as
b = p1p2 · · · pk to (odd) primes (here, we exceptionally do
not group the same primes to a power of the prime, rather we
write each one explicitly, e.g. 135 = 3 · 3 · 3 · 5). The
symbol (
a
b
)
=
(
a
p1
)
·
(
a
p2
)
· · ·
(
a
pk
)
is called the Jacobi symbol.
We show in the practical column that the Jacobi symbol
has similar properties as the Legendre one. However, there is
a substantial aberration – it is not generally true that (a/b) = 1
implies that the congruence x2
≡ a (mod b) is solvable.
Example.
(
2
15
)
=
(
2
3
)
·
(
2
5
)
= (−1) · (−1) = 1,
but the congruence
x2
≡ 2 (mod 15)
has no solution (the congruence x2
≡ 2 is solvable neither
modulo 3 nor modulo 5).
Theorem (Law of quadratic reciprocity for the Jacobi symbol).
Let a, b ∈ N be odd integers. Then,
(1)
(−1
a
)
= (−1)
a−1
2 ,
(2)
(2
a
)
= (−1)
a2−1
8 ,
(3)
(a
b
)
=
(b
a
)
· (−1)
a−1
2
b−1
2 .
Proof. The proof is simple, utilizing the law of quadratic
reciprocity for the Legendre symbol. See exercise
11.D.35. □
There is another application of the law of quadratic reciprocity
in a certain sense – we can consider the question:
For which primes is a given integer a a quadratic
residue? We are already able to answer this
question for a = 2, for example. The ﬁrst step in answering
this question is to do so for primes since the answer
for composite values of a depends on the factorization of the
integer a.
Theorem. Let q be an odd prime.
• If q ≡ 1 (mod 4), then q is a quadratic residue modulo
those primes p which satisfy p ≡ r (mod q), where r is
a quadratic residue modulo q.
• If q ≡ 3 (mod 4), then q is a quadratic residue modulo
those primes p which satisfy p ≡ ±b2
(mod 4q), where
b is odd and coprime to q.
1015
379 · k; k ∈ Z} and x = 1
379 · (210 − 314y − 183y2
); e. g.
(−1105, 47) or (−1521, −57) (which are the only solutions
with |x| < 105
). □
11.E.4. Solve the equation 2x
= 1 + 3y
in integers.
Solution. If y < 0, then 1 < 1+3y
< 2, whence 0 < x < 1,
so x could not be an integer. Therefore, y ≥ 0, hence 2x
=
1 + 3y
≥ 2 and x ≥ 1. We will show that we also must have
x ≤ 2. If not (i.e., if x ≥ 3), then we would have
1 + 3y
= 2x
≡ 0 (mod 8),
whence it follows that
3y
≡ −1 (mod 8).
However, this impossible since the order of 3 modulo 8 equals
2, so the powers of three are congruent to 3 and 1 only. Now,
it remains to examine the possibilities x = 1 and x = 2.
For x = 1, we get
3y
= 21
− 1 = 1,
hence y = 0. If x = 2, we have
3y
= 22
− 1 = 3,
whence y = 1. Thus, the equations has two solutions: x = 1,
y = 0; and x = 2, y = 1. □
F. Primality tests
11.F.1. Mersenne primes. The following problems are in
deep connection with testing Mersenne numbers
for primality.
For any q ∈ N, consider the integer Mq =
2q
− 1 and prove:
i) If q is composite, then so is Mq.
ii) If q is a prime, q ≡ 3 (mod 4), then 2q + 1 divides Mq
if and only if 2q + 1 is a prime (hence it follows that if
q ≡ 3 (mod 4) is a Sophie Germain prime3
, then Mq is
not a prime).
iii) If a prime p divides Mq, then p ≡ ±1 (mod 8) and p ≡
1 (mod q).
Solution.
3See Wikipedia, Sophie Germain prime, http://en.wikipedia.
org/wiki/Sophie_Germain_prime (as of July 28, 2013, 14:43
GMT).
CHAPTER 11. ELEMENTARY NUMBER THEORY
Proof. The ﬁrst theorem follows trivially from the law
of quadratic reciprocity. Let us consider q ≡ 3 (mod 4), i.e.,
(q/p) = (−1)
p−1
2 (p/q). First of all, let p ≡ +b2
(mod 4q),
where b is odd, and hence b2
≡ 1 (mod 4). Then, p ≡ b2
≡ 1
(mod 4) and p ≡ b2
(mod q). Therefore, (−1)
p−1
2 = 1 and
(p/q) = 1, whence (q/p) = 1. Now, if p ≡ −b2
(mod 4q),
then we similarly get that p ≡ −b2
≡ 3 (mod 4) and p ≡
−b2
(mod q). Therefore, (−1)
p−1
2 = −1 and (p/q) = −1,
whence we get again that (q/p) = 1.
For the opposite way, suppose that (q/p) = 1. There are
two possibilities – either (−1)
p−1
2 = 1 and (p/q) = 1, or
(−1)
p−1
2 = −1 and (p/q) = −1. In the former case, we have
p ≡ 1 (mod 4) and there is a b such that p ≡ b2
(mod q).
Further, we can assume without loss of generality that b is odd
(if not, we could have taken b + q instead). However, then we
get b2
≡ 1 ≡ p (mod 4), and altogether p ≡ b2
(mod 4q).
In the latter case, we have p ≡ 3 (mod 4) and (−p/q) =
(−1/q)(p/q) = (−1)(−1) = 1. Therefore, there is a b
(which can also be chosen so that it is odd) such that −p ≡ b2
(mod q). We thus get −b2
≡ 3 ≡ p (mod 4), and altogether
p ≡ −b2
(mod 4q). □
5. Diophantine equations
It is as early as in the third century AD when Diophantus
of Alexandria dealt with miscellaneous equations while admitting
only integers as solutions. And there is no wonder –
in many practical problems that lead to equations, non-integer
solutions may fail to have a meaningful interpretation. As an
example, we can consider the problem of how to pay an exact
amount of money with coins of given values.
In honor of Diophantus, equations for which we are interested
in integer solutions only are called Diophantine equa-
tions.
Another nice example of a Diophantine equation is Euler’s
relation
v − e + f = 2
from graph theory, connecting the number of vertices, edges,
and faces of a planar graph. Furthermore, if we restrict ourselves
to regular graphs only, we get to the problem about existence
of the so-called Platonic solids, which can be smartly
described just as a solution of this Diophantine equation – for
more information, see 13.1.22.
Unfortunately, there is no universal method for solving
this kind of equations. There is even no method (algorithm)
to decide whether a given polynomial Diophantine equation
has a solution. This question is well-known as Hilbert’s tenth
problem, and the proof of algorithmic unsolvability of this
problem was given by rii Mati seviq (Yuri Matiyasevich)
in 1970.7
However, there are cases in which we are able to ﬁnd the
solution of a Diophantine equation, or – at least – to reduce
7See the elementary text M. Davis, Hilbert’s Tenth Problem is Unsolvable,
The American Mathematical Monthly 80(3): 233–269. 1973.
1016
i) If n | q, then it follows from exercise 11.A.7 that 2n
−1 |
2q
− 1, so Mn | Mq. Therefore, Mq is not a prime for
n > 1.
ii) Let n = 2q+1 be a divisor of Mq. We will show that n is
a prime invoking Lucas’ theorem 11.6.10, Since n−1 =
2q has only two prime divisors, it suﬃces to ﬁnd compositeness
witnesses for the integers 2 and q. We have
2
n−1
q = 22
̸≡ 1 (mod n), (−2)
n−1
2 = −2q
≡ − 1 ̸≡
1 (mod n), thanks to the assumption n | Mq = 2q
− 1.
Further, since (−2)n−1
= 2n−1
= 22q
− 1 = (2q
+
1)Mq ≡ 0 (mod n), it follows from Lucas’ theorem that
n is a prime.
Now, let p = 2q + 1 ≡ −1 (mod 8) be a prime.
Since (2/p) = 1, there exists an m such that 2 ≡ m2
(mod p). Hence, 2q
≡ 2
p−1
2 ≡ mp−1
≡ 1 (mod p), so
p | 2q
− 1 = = Mq.
iii) If p | Mq = 2q
− 1, then the order of 2 modulo p must
divide the prime q, hence it equals q. Therefore, q | p−1,
and there exists a k ∈ Z such that 2qk = p − 1. Altogether,
we get
(2/p) ≡ 2
p−1
2 ≡ 2qk
≡ 1 (mod p),
i.e., p ≡ ±1 (mod 8).
□
11.F.2. For each of the following Mersenne numbers, determine
whether it is prime or composite:
211
− 1, 215
− 1, 223
− 1, 229
− 1,
and 283
− 1.
Solution. In the case of the integer 215
− 1, the exponent is
composite; therefore, the whole integer is composite as well
(we even know that it is divisible by 23
−1 and 25
−1). In the
other cases, the exponent is always a prime. We can notice
that these primes, namely q = 11, 23, 29, and 83, are even
Sophie Germain primes (i.e., 2q + 1 is also a prime). It thus
follows from part (ii) of the previous exercise that 23 | 211
−1,
47 | 223
− 1, and 167 | 283
− 1.
We cannot use this proposition for the last case since
29 ̸≡ 3 (mod 4) and, indeed, 59 ∤ 229
− 1. Now, however, it
follows from part (iii) of the above exercise that if there is a
prime p which divides 289
− 1, then it must satisfy
p ≡ ±1 (mod 8)
p ≡ 1 (mod 29),
i.e., p ≡ 1 (mod 232) or p ≡ 175 (mod 232). If we are
looking for a prime divisor of the integer n = 229
− 1 =
536 870 911, then it suﬃces to check the primes (of the above
form) up to
√
n ≈ 23 170. There are 50 of them, so we are
able to decide whether n is a prime quite easily (even with paper
and pencil). In this case, fortunately, n is divisible already
by the least prime, 233. □
11.F.3. Show that the integer 341 is a Fermat pseudoprime
to base 2, yet it is not a Euler-Jacobi pseudoprime
to base 2. Further, prove that the integer
561 is a Euler-Jacobi pseudoprime to base 2,
but not to base 3. Prove that, on the other hand, the integer
CHAPTER 11. ELEMENTARY NUMBER THEORY
the problem to solving congruences, which is besides the already
mentioned applications another motivation for studying
them. Now, we will describe several such types of Diophantine
equations.
Linear Diophantine equation. A linear Diophantine equation
is an equation of the form
a1x1 + a2x2 + · · · + anxn = b,
where x1, . . . , xn are unknowns and a1, . . . , an, b are given
non-zero integers.
We can see that the ability to solve Diophantine
equations is sometimes important
in “practical” life as well, as is proved
by Bruce Willis and Samuel Jackson in Die
Hard with a Vengeance, where they have to
do away with a bomb using 4 gallons of water,
having only 3- and 5-gallon containers
at their disposal. A mathematician would
say that the gentlemen were to ﬁnd a solution
of the Diophantine equation 3x+5y =
4.
One can use congruences in order to solve these equations.
Apparently, it is necessary for the equation to be solvable
that the integer d = (a1, . . . , an) divides b. Provided
that, dividing both sides of the equation by the number d leads
to an equivalent equation
a′
1x1 + a′
2x2 + · · · + a′
nxn = b′
,
where a′
i = ai/d for i = 1, . . . , n and b′
= b/d. Here, we
have
d · (a′
1, . . . , a′
n) = (da′
1, . . . , da′
n) = (a1, . . . , an) = d,
so (a′
1, . . . , a′
n) = 1.
Further, we will show that the equation
a1x1 + a2x2 + · · · + anxn = b,
where a1, a2, . . . , an, b are integers such that (a1, . . . , an) =
1, always have a solution in integers and all such solutions can
be described in terms of n − 1 integer parameters.
We will prove this proposition by mathematic induction
on n, the number of unknowns. The situation is trivial for
n = 1 – there is a unique solution (which does not depend
1017
121 is a Euler-Jacobi pseudoprime to base 3, but not to base
2.
Solution. The integer 341 is a Fermat pseudoprime to base
2 since 210
≡ 1 ⇒ 2340
≡ 1 (mod 341). It is not a
Euler-Jacobi pseudoprime since 2170
≡ 1 (mod 341), but( 2
341
)
= −1, which follows from the fact that 341 ≡ −3
(mod 8). For the integer 561, we have 2280
≡ 1 (mod 561)
and
( 2
561
)
= 1, since 561 ≡ 1 (mod 8). Therefore, it is
a Euler-Jacobi pseudoprime to base 2. But not to base 3,
since 3 | 561. On the other hand, the integer 121 satisﬁes
35
≡ 1 (mod 121) ⇒ 360
≡ 1 (mod 121) and
( 3
121
)
= 1,
but 260
≡ 89 ̸≡ 1 (mod 121). □
11.F.4. Prove that the integers 2465, 2821, and 6601 are
Carmichael numbers, i.e., denoting any of
them as n, then every integer a coprime to
n satisﬁes
an−1
≡ 1 (mod n).
Solution. We have 2465 = 5 · 17 · 29, 2821 = 7 · 13 · 31,
6601 = 7 · 23 · 41, and the proposition follows from Korselt’s
criterion 11.6.6 since all of the integers 4, 16, 28 divide
2464 = 25
· 7 · 11, all of the integers 6, 12, 30 divide
2820 = 22
·3·5·47, and 6, 22, 40 divide 6600 = 23
·3·52
·11.
□
11.F.5. Prove that the integer 2047 is a strong pseudoprime
to base 2, but not to base 3. Further, prove that
the integer 1905 is a Euler-Jacobi pseudoprime
to base 2 but not a strong pseudoprime to this
base.
Solution. In order to verify whether 2047 is a strong pseudoprime
to base 2, we factor
(22046
− 1) = (21023
− 1)(21023
+ 1).
Since 21023
≡ 1 (mod 2047), the statement is true. However,
it is not a strong pseudoprime to base 3 as
31023
≡ 1565 ̸≡ ±1 (mod 2047).
Notice that for the integer 2047, the strong pseudoprimality
test is identical to the Euler one (this is because the integer
2046 is not divisible by four).
The integer 1905 is a Euler-Jacobi pseudoprime to base
2 since 21904/2
≡ 1 (mod 1905) and the Jacobi symbol
(2/1905) is equal to 1. Since 1904 = 24
· 7 · 17, 1905 will
be a strong pseudoprime to base 2 only if at least one of the
following congruences holds:
2952
≡ −1 (mod 1905),
2476
≡ −1 (mod 1905),
2238
≡ −1 (mod 1905),
2119
≡ ±1 (mod 1905).
However, 2952
≡ 2476
≡ 1 (mod 1905), 2238
≡ 1144
(mod 1905), and 2119
≡ 128 (mod 1905). Therefore, 1905
is not a strong pseudoprime to base 2. □
CHAPTER 11. ELEMENTARY NUMBER THEORY
on any parameters). Further, let n ≥ 2 and suppose that the
statement holds for equations having n−1 unknowns. Denoting
d = (a1, . . . , an−1), any n-tuple x1, . . . , xn that satisﬁes
the equation must also satisfy the congruence
a1x1 + a2x2 + · · · + anxn ≡ b (mod d).
Since d is the greatest common divisor of the integers
a1, . . . , an−1, this congruence is of the form
anxn ≡ b (mod d),
which (since (d, an) = (a1, . . . , an) = 1) has a unique solu-
tion
xn ≡ c (mod d),
where c is a suitable integer, i.e., xn = c+d·t, where t ∈ Z is
arbitrary. Substituting into the original equation and reﬁning
it leads to the equation
a1x1 + · · · + an−1xn−1 = b − anc − andt
with n−1 unknowns and one parameter, t. However, the number
(b − anc)/d is an integer, so we can divide the equation
by d. This leads to
a′
1x1 + · · · + a′
n−1xn−1 = b′
,
where a′
i = ai/d for i = 1, . . . , n − 1 and b′
= ((b −
anc)/d) − ant, satisfying
(a′
1, . . . , a′
n−1) = (da′
1, . . . , da′
n−1)·1
d = (a1, . . . , an−1)·1
d = 1.
By the induction hypothesis, this equation has, for any t ∈ Z,
a solution which can be described in terms of n−2 integer parameters
(diﬀerent from t), which together with the condition
xn = c + dt gives what we wanted.
11.5.1. Pythagorean equation. In this section, we will deal
with enumeration of all right triangles with integer
side lengths. This is a Diophantine equation
where we will only seldom use the methods described
above; nevertheless, we will look at it in
detail.
The task is to solve the equation
x2
+ y2
= z2
in integers.
Solution. Clearly, we can assume that (x, y, z) = 1 (otherwise,
we simply divide both sides of the equation by the integer
d = (x, y, z)).
Further, we can show that the integers x, y, z are pairwise
coprime: if there were a prime p dividing two of them,
then we can easily see that it would have to divide the third
one as well, which it may not according to our assumption.
Therefore, at most one of the integers x, y is even. If neither
of them were, we would get
z2
≡ x2
+ y2
≡ 1 + 1 (mod 8),
which is impossible (see exercise 11.A.2). Altogether, we get
that exactly one of the integers x, y is even. However, since
the roles of these integers in the equation are symmetric, we
1018
11.F.6. Applying Pocklington-Lehmer test 11.6.11, show
that 1321 is a prime.
Solution. Let us set N = 1321, then N − 1 =
1320 = 23
· 3 · 5 · 11.
For the sake of simplicity, we will assume that the trial
division is executed only for primes below 10, then F = 23
·
3 · 5 = 120, U = 11, where (F, U) = (120, 11) = 1.
In order to prove the primality of 1321 by the
Pocklington-Lehmer test, we need to ﬁnd a primality
witness ap for each p ∈ {2, 3, 5}.
Since
(
2
1320
3 − 1, 1321
)
= 1 and
(
2
1320
5 − 1, 1321
)
=
1, we can lay a3 = a5 = 2. However, for p = 2,
we have
(
2
1320
2 − 1, 1321
)
= 1321, so we have to look
for another primality witness. We can take a2 = 7 since(
7
1320
2 − 1, 1321
)
= 1. In both cases, we have 21320
≡
71320
≡ 1 (mod 1321). The primality witnesses of the integer
1321 are thus a2 = 7, a3 = a5 = 2. Instead, we could
also have chosen for all primes p the same number (e. g. 13),
which is a primitive root modulo 1321. □
11.F.7. Factor the integer 221 to primes by Pollard’s
ρ-method. Use the function f(x) = x2
+ 1
with initial value x0 = 2.
Solution. Let us set x = y = 2. The procedure from 11.6.14
gives:
x := f(x) y := f(f(y)) (|x − y|, 221) mod 221
5 26 1
26 197 1
14 104 1
197 145 13
We have thus found a non-trivial divisor, so now it is easy to
calculate 221 = 13 · 17. □
11.F.8. Find a non-trivial divisor of the integer 455459.
Solution. Consider the function f(x) = x2
+ 1 (we silently
assume that this function behaves randomly modulo an unknown
prime divisor p of the integer n and has the required
properties). In the particular iterations, we compute a ←
f(a) (mod n), b ← f(f(b)) (mod n) while evaluating
d = (a − b, n).
a b d
5 26 1
26 2871 1
677 179685 1
2871 155260 1
44380 416250 1
179685 43670 1
121634 164403 1
155260 247944 1
44567 68343 743
We have found a divisor 743, and now we can easily compute
that 455459 = 613 · 743. □
CHAPTER 11. ELEMENTARY NUMBER THEORY
can, without loss of generality, select x to be even and set
x = 2r, r ∈ N. Hence, we have
4r2
= z2
− y2
,
so
r2
=
z + y
2
·
z − y
2
.
Now, let us denote u = 1
2 (z + y), v = 1
2 (z − y) (then, the
inverse substitution is z = u + v, y = u − v). Since y is
coprime to z, so is u to v (if there were a prime p dividing
both u and v, then it would divide their sum as well as their
diﬀerence, i.e., the integers y and z). It follows from
r2
= u · v
that there are coprime positive integers a, b such that u =
a2
, v = b2
. Moreover, since u > v, we must have a > b.
Altogether, we get
x = 2r = 2ab,
y = u − v = (a2
− b2
),
z = u + v = (a2
+ b2
),
which indeed satisﬁes the given equation for any coprime
a, b ∈ N with a > b. Further solutions can be obtained by
interchanging x and y. Finally, relinquishing the condition
(x, y, z) = 1, each solution will yield inﬁnitely many more if
we multiply each of its component by a ﬁxed positive integer
d. □
11.5.2. Fermat’s Last Theorem for n = 4. Thanks to the
parametrization of Pythagorean triples, we will be
able to prove that the famous Fermat’s Last Theorem
xn
+ yn
= zn
has no solution for n = 4 in integers. For this task it is suﬃcient
to prove that the equation x4
+ y4
= z2
has no solution
in N.
Solution. We will use the so-called method of inﬁnite descent,
which was introduced by Pierre de Fermat. This
method utilizes the fact that every non-empty set of natural
numbers has a least element (in other words, N is a wellordered
set).
Therefore, suppose that the set of solutions of the equation
x4
+ y4
= z2
is non-empty and let (x, y, z) denote (any)
solution with z as small as possible. The integers x, y, z are
thus pairwise distinct. Since the equation can be written in
the form
(x2
)2
+ (y2
)2
= z2
,
it follows from the previous exercise that there exist r, s ∈ N
such that
x2
= 2rs, y2
= r2
− s2
, z = r2
+ s2
.
Hence, y2
+ s2
= r2
, where (y, s) = 1 (if there were a
prime p dividing both y and s, then it would divide x as well
as z, which contradicts that they are coprime). Making the
1019
G. Encryption
11.G.1. RSA. We have overheard that the integers 29, 7, 21
were sent by means of RSA with public key (7, 33).
Try to break the cipher and ﬁnd the messages (integers)
that were originally sent.
Solution. In order to ﬁnd the private key d, we
need to solve the congruence 7d ≡ 1 (mod φ(33)). However,
since the integer 33 is quite small, we can factor it and
easily compute that φ(33) = (3 − 1)(11 − 1) = 20. We
are thus looking for a d such that 7d ≡ 1 (mod 20), which
is satisﬁed by d ≡ 3 (mod 20). Since 293
≡ (−4)3
≡ 2,
73
≡ 13, and 213
≡ 21 (mod 33), the messages that were
encrypted are 2, 13, and 21. □
Attacks against RSA.
Using so-called Fermat’s factorization method,
we can try to factor n = p · q if we think that the
diﬀerence between p and q is small.
Then,
n =
(
p + q
2
)2
−
(
p − q
2
)2
,
where s = (p − q)/2 is small and t = (p + q)/2 is only a
bit greater than
√
n. Therefore, it suﬃces to check whether4
t = ⌈
√
n⌉ , t = ⌈
√
n⌉ + 1, t = ⌈
√
n⌉ + 2, . . . , until
t2
− n is a square (this condition can, of course, be checked
eﬃciently).
11.G.2. Now, we will try to factor the integer
n = 23104222007 this way. (We anticipate
that it is a product of two close primes.)
Solution. We compute
√
n ≈ 152000,731
and check the candidates for t:
For t = 152001, we have
√
t2 − n ≈ 286,345.
For t = 152002, it is
√
t2 − n ≈ 621,287.
For t = 152003,
√
t2 − n ≈ 830,664.
Finally, for t = 152004, we get
√
t2 − n = 997 ∈ Z.
Therefore, s = 997 and we can easily calculate the prime
divisors of n: p = t + s = 153001, q = t − s = 151007.
□
11.G.3. The RSA modulus n = p · q can also be easily factored
if the integer φ(n) is known (compromised). Then,
φ(n) = (p−1)(q−1) = pq−(p+q)+1, odkudp+q = n+1−φ(n).
We are thus to ﬁnd two integers whose sum and product
are known, which can be done by Viète’s formulas relating the
roots and the coeﬃcients of a polynomial, whence it follows
that p and q are the roots of the polynomial
x2
− (n + 1 − φ(n))x + n.
4The symbol ⌈x⌉ denotes the ceiling if a real number x, i.e., the it is
the integer which satisﬁes ⌈x⌉ − 1 < x ≤ ⌈x⌉.
CHAPTER 11. ELEMENTARY NUMBER THEORY
Pythagorean substitution once again, we get natural numbers
a, b with (y is odd)
y = a2
− b2
, s = 2ab, r = a2
+ b2
.
The inverse substitution leads to
x2
= 2rs = 2 · 2ab(a2
+ b2
),
and since x is even, we get
(x
2
)2
= ab(a2
+ b2
).
The integers a, b, a2
+ b2
are pairwise coprime (which
can be derived easily from the fact that y is coprime to s).
Therefore, each of them is a square of a natural number:
a = c2
, b = d2
, a2
+ b2
= e2
,
whence c4
+ d4
= e2
, and since e ≤ a2
+ b2
= r < z, we
get a contradiction with the minimality of z. □
6. Applications – calculation with large integers,
cryptography
11.6.1. Computational aspects of number theory. In
many practical problems which utilize the results
of number theory, it is necessary to execute
one or more of the following computations fast:
• common arithmetic operations (sum, product, modulo)
on integers;
• to determine the remainder of a (natural) n-th power of
an integer a when divided by a given m;
• to determine the multiplicative inverse of an integer a
modulo m ∈ N;
• to determine the greatest common divisor of two integers
(and the coeﬃcients of corresponding Bézout’s identity);
• to decide whether a given integer is a prime or composite
number.
• to factor a given integer to primes.
Basic arithmetic operations are usually executed on large
integers in the same way as we were taught at primary school,
i.e., we add in linear time and multiply and divide with remainder
in quadratic time. The multiplication, which is a
base for many other operations, can be performed asymptotically
more eﬃciently (there exist algorithms of the type divide
and conquer) – for instance, the Karatsuba algorithm (1960),
running in time Θ
(
nlog2 3
)
or the Schönhage-Strassen algorithm
(1971), which runs in Θ(n log n log log n) and uses
Fast Fourier Transforms – see also 7.2.5. Although it is
asymptotically much better, in practice, it becomes advantageous
for integers of at least ten thousand digits (it is thus
used, for example, when looking for large primes in the
GIMPS project).
1020
11.G.4. Consider (as above) the integer n = 23104222007
and factor it with the additional knowledge that
φ(n) = 23103918000.
Solution. Following the procedure described above, we get
the quadratic equation
x2
− 304008x + 23104222007 = 0,
whose solutions are
p =
1
2
(304008 +
√
3040082 − 4 · 23104222007 = 153001,
q =
1
2
(304008 −
√
3040082 − 4 · 23104222007 = 151007.
□
11.G.5. ElGamal. Martin and John want to communicate
using the ElGamal encryption (designed
by Egyptian mathematician Taher ElGamal,
who was inspired by the Diﬃe-Hellman
key exchange protocol). Martin chose the prime 41 and its
primitive root g = 11 as well as the integer 10. Then, he
published the triple (41, 11, A), where A ≡ 1110
(mod 41);
he kept the integer 10 to himself – it is his private key. John
used a public channel to send the pair (22, 6) to him. What
is the original message John sent?
Solution. For completeness, we will ﬁrst compute the whole
public key A = 9 (however, this integer was only needed
by John when he encrypted the message for Martin; it is no
longer necessary for decryption). The message M can be obtained
as M ≡ (6/2210
) (mod 41). First, we compute
2210
≡ 222
·
(
222
)2
·
((
222
)2
)
≡
≡ (−8) · (−8)2
· (−8)2
≡
≡ (−8) · 23 · 23 ≡ −9 (mod 41)
and (−9)−1
≡ 9 (mod 41). Therefore, the decrypted message
is the integer
M = 9 · 6 ≡ 13 (mod 41) . □
11.G.6. Rabin cryptosystem. Alice has chosen
p = 23, q = 31 as her private key in
Rabin cryptosystem. The public key is
n = pq = 713, then. Encrypt the message
M = 327 for Alice and show how Alice will decrypt it.
Solution. We compute C = (327)2
≡ 692 (mod 713) and
send this cipher to Alice. According to the decryption procedure,
we determine
r ≡ C(p+1)/4
≡ 692
23+1
4 ≡ 18 (mod 23),
s ≡ C(q+1)/4
≡ 692
31+1
4 ≡ 14 (mod 31),
and further the coeﬃcients a, b into Bézout’s identity 23 · a +
31 · b = 1 (using the Euclidean algorithm). We get a =
−4, b = 3; the candidates for the original message are thus
CHAPTER 11. ELEMENTARY NUMBER THEORY
11.6.2. Greatest common divisor and modular inverses.
As we have already shown, the computation of the solution of
the congruence a · x ≡ 1 (mod m) in variable x can be easily
reduced (thanks to Bézout’s identity) to the computation of
the greatest common divisor of the integers a and m and looking
for the coeﬃcients k, l in Bézout’s identity k·a+l·m = 1
(the integer k is then the wanted inverse of a modulo m).
f u n c t i o n extended_gcd ( a ,m)
i f m == 0:
r e t u r n ( 1 , 0 )
e l s e
( q , r ) := d i v i d e ( a ,m)
( k , l ) := extended_gcd (m, r )
r e t u r n ( l , k − q∗ l )
A thorough analysis8
shows that the problem of computing
the greatest common divisor has quadratic time complex-
ity.
11.6.3. Modular exponentiation. The algorithm for modular
exponentiation is based on the idea that when computing,
for instance 264
mod 1000, one need not calculate 264
and
then divide it with remainder by 1000, but that it is better
to multiply the 2’s gradually and reduce the temporary result
modulo 1000 whenever it exceeds this value. More importantly,
there is no need to perform such a huge number of
multiplications: in this case, 63 naive multiplications can be
replaced with six squarings, as
264
= (((((22
)2
)2
)2
)2
)2
.
f u n c t i o n modular_pow ( base , exp , mod)
r e s u l t := 1
while exp > 0
i f ( exp % 2 == 1 ) :
r e s u l t := ( r e s u l t ∗ base ) % mod
exp := exp >> 1
base := ( base ∗ base ) % mod
r e t u r n r e s u l t
The algorithm squares the base modulo n for every binary
digit of the exponent (which can be done in quadratic
time in the worst case) and it performs a multiplication for
every one in the binary representation of the exponent. Altogether,
we are able to do modular exponentiation in cubic
time in the worst case. We can also notice that the complexity
is a good deal dependent on the binary appearance of the
exponent.
Example. Let us compute 2560
(mod 561).
Since 560 = (1000110000)2, the mentioned algorithm
gives
8See, for example, D. Knuth, Art of Computer Programming, Volume 2:
Seminumerical Algorithms, Addison-Wesley 1997 or Wikipedia, Euclidean
algorithm, http://en.wikipedia.org/wiki/Euclidean_
algorithm (as of July 29, 2017).
1021
the integers ∓4 · 23 · 14 ± 3 · 31 · 18 (mod 713). We thus
know that one of the integers
386, 603, 110, 327
is the message that was sent. □
11.G.7. Show how to encrypt and decrypt the message
M = 321 in Rabin cryptosystem with n = 437.
Solution. The encrypted text can be obtained as the square
modulo n: C = 3212
≡ (−116)2
= 13456 ≡ 346
(mod 437). On the other hand, when decrypting, we will
use the factorization (its knowledge is the private key of
the message receiver) n = 437 = 19 · 23, and we compute
r = 346
19+1
4 = 3465
≡ 17 ≡ −2 (mod 19) and
s = 246
23+1
4 = 3466
≡ 1 (mod 23). Applying Euclidean
algorithm to the pair (19, 23) = 1, we determine the coeﬃcients
into Bézout’s identity
19 · (−6) + 23 · 5 = 1.
Then, the message is one of the integers ±6·19·1±5·23·(−2)
(mod 437), i.e., M = ±116 or M = ±344. Indeed, M =
−116 ≡ 321 (mod 437). □
CHAPTER 11. ELEMENTARY NUMBER THEORY
exp base result last digit exp
560 2 1 0
280 4 1 0
140 16 1 0
70 256 1 0
35 460 1 1
17 103 460 1
8 511 256 0
4 256 256 0
2 460 256 0
1 103 256 1
0 511 1 0
Therefore, 2560
≡ 1 (mod 561).
11.6.4. Primality testing. Although we have the Fundamental
theorem of arithmetic, which guarantees
that every natural number can be uniquely factored
to a product of primes, this operation is
very hard from the computational point of view.
In practice, it is usually done in the following steps:
(1) ﬁnding all divisors below a given threshold (by trying all
primes up to the threshold, which is usually somewhere
around 106
);
(2) testing the remaining factor for compositeness (deciding
whether some necessary condition for primality holds);
(a) if the compositeness test did not ﬁnd the integer to
be composite, i.e., it is likely to be a prime, then we
test it for primality to verify that it is indeed a prime;
(b) if the compositeness test proved that the integer was
composite, then we try to ﬁnd a non-trivial divisor.
The mentioned steps are executed in this order because
the corresponding algorithms are gradually (and strongly) increasing
in time complexity. In 2002, Agrawal, Kayal, and
Saxena published an algorithm for primality testing in polynomial
time, but it is still more eﬃcient to use the above procedure
in practice.
11.6.5. Compositeness tests – how to recognize composite
numbers with certainty? The so-called compositeness tests
check for some necessary condition for primality. The easiest
of such conditions is Fermat’s little theorem.
Proposition (Fermat’s test). Let N be a natural number. If
there is an a ̸≡ 0 (mod N) such that aN−1
̸≡ 1 (mod N),
then N is not a prime.
Unfortunately, having a composite N, it still may not be
easy to ﬁnd such an integer a which reveals the compositeness
of N. There are even such exceptional integers N for
which the only integers a with the mentioned property are
those which are not coprime to N. To ﬁnd them is thus equivalent
to ﬁnding a divisor, and thus to factoring N to primes.
There are indeed such ugly (or extremely nice?) composite
numbers N for which every integer a which is coprime
to N satisﬁes aN−1
≡ 1 (mod N). These are called
1022
CHAPTER 11. ELEMENTARY NUMBER THEORY
Carmichael numbers, the least of which9
is 561 = 3 · 11 · 17,
and it was no sooner than in 1992 that it was proved10
that
there are even inﬁnitely many of them.
Example. We will prove that 561 is a Carmichael number,
i.e., that it holds for every a ∈ N which is coprime to 3·11·17
that a560
≡ 1 (mod 561).
Thanks to the properties of congruences, we know that it
suﬃces to prove this congruence modulo 3, 11, and 17. However,
this can be obtained straight from Fermat’s little theorem
since such an integer a satisﬁes a2
≡ 1 (mod 3), a10
≡ 1
(mod 11), a16
≡ 1 (mod 17), where all of 2, 10, and 16 divide
560, hence a560
≡ 1 modulo 3, 11 as well as 17 for all
integers a coprime to 561 (see also Korselt’s criterion mentioned
below).
11.6.6. Proposition (Korselt’s criterion). A composite number
n is a Carmichael number if and only if both of the following
conditions hold
• n is square-free (divisible by the square of no prime),
• p − 1 | n − 1 holds for all primes p which divide n.
Proof. “ ⇐= ” We will show that if n satisﬁes the above
two conditions and it is composite, then every a ∈ Z which is
coprime to n satisﬁes an−1
≡ 1 (mod n). Let us thus factor
n to the product of distinct odd primes: n = p1 · · · pk, where
pi − 1 | n − 1 for all i ∈ {1, , . . . , k}. Since (a, pi) = 1, we
get from Fermat’s little theorem that api−1
≡ 1 (mod pi),
whence (thanks to the condition pi −1 | n−1) it also follows
that an−1
≡ 1 (mod pi). This is true for all indices i, hence
an−1
≡ 1 (mod n), so n is indeed a Carmichael number.
“ =⇒ ” A Carmichael number n cannot be even since then
we would get for a = −1 that an−1
≡ −1 (mod n), which
would (since an−1
≡ 1 (mod n)) mean that n is equal to
2 (and thus is not composite). Therefore, let n factor as
n = pα1
1 · · · pαk
k , where pi are distinct odd primes and αi ∈ N.
Thanks to theorem 11.3.16, we can choose for every i a primitive
root gi modulo pαi
i , and the Chinese remainder theorem
then yields an integer a which satisﬁes a ≡ gi (mod pαi
i ) for
all i and which is apparently coprime to n. Further, we know
from the assumption that an−1
≡ 1 (mod n), so this holds
modulo pαi
i , and thus gn−1
i ≡ 1 (mod pαi
i ) as well. Since gi
is a primitive root modulo pαi
i , the integer n−1 must be a multiple
of its order, i.e. a multiple of φ(pαi
i ) = pαi−1
i (pi − 1).
At the same time, we have (pi, n − 1) = 1 (since pi|n), so
necessarily αi = 1 and pi − 1 | n − 1. □
Fermat’s primality test can be slightly improved to Euler’s
test or even more with the help of the Jacobi symbol, yet
this still does not mend the presented problem completely.
9The ﬁrst discoverer of the ﬁrst seven Carmichael numbers is apparently
Czech priest and mathematician Václav Šimerka (1819–1887), who
occupied himself with them much earlier than the American mathematician
R. D. Carmichael (1879–1967), whose name they bear.
10W. R. Alford, A. Granville and C. Pomerance, There are Inﬁnitely
Many Carmichael Numbers, Annals of Mathematics, Vol. 139, No. 3 (1994),
pp. 703-722.
1023
CHAPTER 11. ELEMENTARY NUMBER THEORY
Proposition (Euler’s test). Let N be an odd natural number.
If there is an integer a ̸≡ 0 (mod N) such that a
N−1
2 ̸≡ ±1
(mod N), then N is not a prime.
Proof. This follows directly from Fermat’s theorem and
the fact the for N odd, we have aN−1
= (a
N−1
2 −1)(a
N+1
2 −
1). □
Proposition (Euler-Jacobi test). Let N be an odd natural
number. If there is an integer a ̸≡ 0 (mod N) such that
a
N−1
2 ̸≡
(a
N
)
(mod N), then N is not a prime.
Proof. This follows immediately from lemma 11.4.12.
□
Example. Let us consider N = 561 = 3 · 11 · 17 as before
and let a = 5. Then, we have 5280
≡ 1 (mod 3) and 5280
≡
1 (mod 10), but 5280
≡ −1 (mod 17), so surely 5280
̸≡
±1 (mod 561). Here, it did not hold that a(N−1)/2
≡ ±1
(mod N), so we even did not need to check the value of the
Jacobi symbol (5/561). However, the Euler-Jacobi test can
often reveal a composite number even in the case when this
power is equal to ±1.
Example. Euler’s test cannot detect the compositeness of the
integer N = 1729 = 7 · 13 · 19 since the integer N−1
2 =
864 = 25
· 33
is divisible by 6, 12, and 18, and so it follows
from Fermat’s theorem that a(N−1)/2
≡ 1 (mod N) holds
for all integers a coprime to N. On the other hand, we get
already for a = 11 that (11/1729) = −1, so the Euler-Jacobi
is able to recognize the integer 1729 as composite.
Let us notice that the value of the Legendre or Jacobi
symbol (a/n) can be computed very eﬃciently thanks
to the law of quadratic reciprocity11
, namely in time
O((log a)(log n)).
Pseudoprimes
A composite number n is called a pseudoprime if it passes
the corresponding test of compositeness without being revealed.
We thus have
(1) Fermat pseudoprimes to base a,
(2) Euler (or Euler-Jacobi) pseudoprimes to base a,
(3) strong pseudoprimes to base a, which are composite
numbers which pass the following compositeness test:
The subsequent test is simple, yet (as shown in theorem
11.6.8) very eﬃcient. It is a further speciﬁcation of Fermat’s
test, which we have introduced at the beginning.
11.6.7. Theorem. Let p be an odd prime. Let us write p−1 =
2t
· q, where t is a natural number and q is odd. Then, every
integer a which is not a multiple of p satisﬁes aq
≡ 1 (mod p)
or there exists an e ∈ {0, 1, . . . , t − 1} such that a2e
q
≡ −1
(mod p).
11See H. Cohen, A Course in Computational Algebraic Number Theory,
Springer, 1993.
1024
CHAPTER 11. ELEMENTARY NUMBER THEORY
Proof. It follows from Fermat’s little theorem that
p | ap−1
− 1 = (a
p−1
2 − 1)(a
p−1
2 + 1) =
= (a
p−1
4 − 1)(a
p−1
4 + 1)(a
p−1
2 + 1) =
...
= (aq
− 1)(aq
+ 1)(a2q
+ 1) · · · (a2t−1
q
+ 1),
whence the statement follows easily since p is a prime. □
Proposition (Miller-Rabin compositeness test). Let N, t, q
be natural numbers such that N is odd and N −1 = 2t
·q, 2 ∤
q. If there is an integer a ̸≡ 0 (mod N) such that
aq
̸≡ 1 (mod N)
a2e
q
̸≡ −1 (mod N) for e ∈ {0, 1, . . . , t − 1},
then N is not a prime.
Proof. The correctness of the test follows directly from
the previous theorem. □
Miscellaneous types of pseudoprimes
In practice, this easy test rapidly increases the ability to
recognize composite numbers. The least strong pseudoprime
to base 2 is 2047 (while the least Fermat pseudoprime to base
2 was already 341), and considering the bases 2, 3, and 5, the
least strong pseudoprime is 25326001. In other words, if we
are to test integers below 2·107
, then it is suﬃcient to execute
this compositeness test already for the bases 2, 3, and 5. If the
tested integer is not revealed to be composite, then it is surely
a prime. On the other hand, it has been proved that no ﬁnite
basis is suﬃcient for testing all natural numbers.
The Miller-Rabin test is a practical application of the previous
statement, and we are even able to bound the probability
of failure thanks to the following theorem, which we present
without a proof 12
.
11.6.8. Theorem. Let N > 10 be an odd composite number.
Let us write N − 1 = 2t
· q, where t is a natural number
and q is odd. Then, at most a quarter of the integers from the
set {a ∈ Z; 1 ≤ a < N, (a, N) = 1} satisﬁes the following
condition:
aq
≡ 1 (mod N)
12Schoof, René (2004), "Four primality testing algorithms" (PDF), Algorithmic
Number Theory: Lattices, Number Fields, Curves and Cryptography,
Cambridge University Press, ISBN 978-0-521-80854-5
1025
CHAPTER 11. ELEMENTARY NUMBER THEORY
or there is an e ∈ {0, 1, . . . , t − 1} satisfying
a2e
q
≡ −1 (mod N).
In practical implementations, one usually tests about 20
random bases (or the least prime bases). In this case, the
above theorem states that the probability of failing to reveal a
composite number is less than 2−40
.
The time complexity of the algorithm is same as in the
case of modular exponentiation, i.e. cubic in the worst case.
However, we should realize that the test is non-deterministic
and the reliability of its deterministic version depends on the
so-called generalized Riemann hypothesis (GRH13
).
11.6.9. Primality tests. Primality tests are usually applied
when the used compositeness test claims that the
examined integer is likely to be a prime, or they
are executed straightaway for special types of integers.
Let us ﬁrst give a list of the most known
tests, which includes historical tests as well as very modern
ones.
(1) AKS – a general polynomial primality test discovered by
Indian mathematicians Agrawal, Kayal, and Saxena in
2002.
(2) Pocklington-Lehmer test – primality test of subexponential
complexity.
(3) Lucas-Lehmer test – primality test for Mersenne num-
bers.
(4) Pépin’s test – primality test for Fermat numbers from
1877.
(5) ECPP - primality test based on the so-called elliptic
curves.
Now, we will introduce a standard primality test for
Mersenne numbers.
Proposition (Lucas-Lehmer test). Let q ̸= 2 be a prime, and
let a sequence (sn)∞
n=0 be deﬁned recursively by
s0 = 4, sn+1 = s2
n − 2.
Then, the integer Mq = 2q
− 1 is a prime if and only if Mq
divides sq−2.
Proof. We will be working in the ring R = Z[
√
3] =
= {a + b
√
3; a, b ∈ Z}, where the division with remainder
behaves similarly as in the integers (see also 12.3.5). Let us
set α = 2+
√
3, β = 2−
√
3 and note that α+β = 4, α·β = 1.
First, we prove by induction that it holds for all n ∈ N0
that
(1) sn = α2n
+ β2n
= β2n
(
1 + α2n+1
)
.
The statement is true for n = 0 since s0 = 4 = α + β.
Now, let us suppose that it is true for n−1, then sn = s2
n−1−2
is, by the induction hypothesis, equal to
(
α2n−1
+ β2n−1
)2
−
2 = α2n
+ β2n
.
13Wikipedia, Riemann hypothesis, http://en.wikipedia.
org/wiki/Riemann_hypothesis (as of July 29, 2017).
1026
CHAPTER 11. ELEMENTARY NUMBER THEORY
Further, since Mq ≡ −1 (mod 8), we have (2/Mq) =
1, and it follows from the law of quadratic reciprocity that
(
3
Mq
)
= −
(
Mq
3
)
= −
(
2q
− 1
3
)
= −
(
1
3
)
= −1,
since we have 2q
− 1 ≡ 1 (mod 3) for q odd. Both of these
expressions are valid even if Mq is not a prime (in this case,
it is the Jacobi symbol).
Let us note that in the last part of the proof, we will use
the extension of the congruence relation to the elements of the
domain Z[
√
3] = {a+b
√
3; a, b ∈ Z}; just like in the case of
the integers, we write for α, β ∈ Z[
√
3] that α ≡ β (mod p)
if p | α−β. Further, an analog of proposition (ii) from 11.C.6
holds as well – if p is a prime, then (α + β)p
≡ αp
+ βp
(mod p) (the proof is identical to the one for the integers).
“ =⇒ ” Suppose that Mq is a prime. We will prove that
α2q−1
≡ −1 (mod Mq), which will imply (thanks to 1) that
Mq | sq−2. Since 2(Mq−1)/2
≡ (2/Mq) = 1 (mod Mq),
there is a y ∈ Z such that 2y2
≡ 1 (mod Mq). We have
(y(1 +
√
3))2
= y2
(4 + 2
√
3) ≡ α (mod Mq),
whence, invoking Fermat’s theorem and the relation 2q−1
=
Mq+1
2 , we get
α2q−1
≡
(
y
(
1 +
√
3
))Mq+1
≡ y2
· yMq−1
(
1 +
√
3
)
·
(
1 +
√
3
)Mq
≡ y2
(
1 +
√
3
)
·
(
1 −
√
3
)
= −2y2
≡ −1 (mod Mq).
When deriving this, we made use of the fact that 3 is a quadratic
nonresidue modulo Mq, so
(
1 +
√
3
)Mq
≡ 1 +
(√
3
)Mq
= 1 + 3(Mq−1)/2
·
√
3
≡ 1 −
√
3 (mod Mq).
“ ⇐= ” For the other direction, let Mq | sq−2. However,
then
Mq | sq−2 · α2q−2
= 1 + α2q−1
.
If p ̸= 2, 3 is a prime divisor of Mq, then α2q−1
≡ −1
(mod p) as well and α2q
≡ 1 (mod p). Hence it follows
that 2q
is the order of α in the multiplicative group Tp =
{a + b
√
3; 0 ≤ a, b < p} \ {0}.
If we had (3/p) = 1, then we would get
αp−1
= β · αp
≡ β ·
(
2p
+
(√
3
)p)
≡ β ·
(
2 +
√
3 · 3(p−1)/2
)
≡ β ·
(
2 +
√
3
)
= 1,
whence p − 1 would be a multiple of the order of α, i.e. 2q
.
However, this would mean that p > p − 1 ≥ 2q
> 2q
−
1 = Mq, which contradicts the fact that p is a divisor of Mq.
1027
CHAPTER 11. ELEMENTARY NUMBER THEORY
Therefore, we have (3/p) = −1 and
αp+1
≡
(
2 +
√
3
) (
2 +
√
3
)p
≡
(
2 +
√
3
) (
2 −
√
3
)
≡ 1 (mod p).
The order of α modulo p is 2q
, hence 2q
| p+1 and especially
p ≥ 2q
− 1 = = Mq. At the same time, p is a prime divisor
of Mq, therefore, Mq = p is a prime. □
Unlike the proof, implementation of this algorithm is
very easy.
Algorithm (Lucas-Lehmer primality test):
f u n c t i o n LL_is_prime ( q )
s := 4; M := 2q
− 1
r e p e a t q − 2 times
s := s2
− 2 (mod M)
i f s = 0 r e t u r n PRIME .
e l s e r e t u r n COMPOSITE.
The time complexity of this test is asymptotically the same
as in the case of the Miller-Rabin test. It is, however, more
eﬃcient in concrete instances.
Fermat numbers are integers of the form Fn = 22n
+ 1.
Pierre de Fermat conjectured in the 17th century
that all of the integers of this form are
primes (apparently driven by the eﬀort to generalize
the observation for F0 = 3, F1 = 5, F2 = 17,
F3 = 257, and F4 = 65537). However, in the 18th century,
Leonhard Euler found out that F5 = 641 × 6700417, and
we have not been able to discover any other Fermat primes
so far. Since their size increases rapidly, it takes much time
resources to compute with them (so the following test is not
much used). Nowadays, the least Fermat number which has
not been tested is F33, which has 2 585 827 973 digits, and it
is thus much greater than the largest discovered prime.
Proposition (Pépin’s test). A necessary and suﬃcient condition
for the n-th Fermat number Fn to be a prime is
3
Fn−1
2 ≡ −1 (mod Fn).
We can see that this is a very simple test, which is actually
a mere part of Euler’s compositeness test.
Proof of correctness of Pépin’s test. First, suppose
that 3(Fn−1)/2
≡ −1 (mod Fn). Then, 3Fn−1
≡ 1
(mod Fn). Since Fn − 1 is a power of two, Fn − 1 is necessarily
the order of 3 modulo Fn. However, the order of every
integer modulo Fn is at most φ(Fn) ≤ Fn − 1, hence in this
case, we have φ(Fn) = Fn − 1, which means that Fn is a
prime.
1028
CHAPTER 11. ELEMENTARY NUMBER THEORY
For the other direction, let Fn be a prime. From part (i) of
lemma 11.4.12, we get that 3(Fn−1)/2
≡ (3/Fn) (mod Fn),
so it suﬃces to determine the value (3/Fn). However, this
is easy, because Fn ≡ 2 (mod 3), and thus (Fn/3) = −1.
Further, we have Fn ≡ 1 (mod 4), and the law of quadratic
reciprocity thus yields (3/Fn) = −1, which is what we
wanted to prove. □
Now, we will introduce a primality test which is a bit old,
yet it is still widely used in modern computation systems
– the so-called Pocklington-Lehmer test. However,
ﬁrst of all, we will describe a simpler primality
test for illustration, the so-called Lucas’s test:
11.6.10. Theorem (Lucas). If for all prime divisors q of N −
1, there is an a such that aN−1
≡ 1 (mod N), a
N−1
q ̸≡ 1
(mod N), then N is a prime.
Proof. It suﬃces to prove that N − 1 divides φ(N)
(which is a condition apparently unsatisﬁed by composite
numbers). If not, then there is a prime q and r ∈ N such
that qr
divides N − 1, but it does not divide φ(N). The order
e of the integer a divides N − 1 (the ﬁrst condition) and
does not divide (N − 1)/q (the second condition), hence qr
divides e. Furthermore, e divides φ(N), and so qr
does, a
contradiction. □
The integer a from the previous theorem is called a primality
witness for the integer N (in this as well as in the other
primality tests).
Another general primality test is based on the above one.
It is good if we want to make the high probability of the answer
of the Miller-Rabin compositeness test into certainty.
11.6.11. Theorem (Pocklington–Lehmer). Let N be a natural
number, N > 1. Let p be a prime which divides N − 1.
Further, let us suppose that there is an integer ap such that
aN−1
p ≡ 1 (mod N) and
(
a
N−1
p
p − 1, N
)
= 1.
Let pαp
be the highest power of p which divides N − 1. Then
every positive divisor d of the integer N satisﬁes
d ≡ 1 (mod pαp
).
Proof of the Pocklington–Lehmer theorem. Every
positive divisor d of the integer N is a product of prime
divisors of N, so it suﬃces to prove the theorem for prime
values of d. The condition aN−1
p ≡ 1 (mod N) implies
that the integers ap, N are coprime (any divisor they have in
common must divide the right-hand side of the congruence
as well). Then, (ap, d) = 1 as well, and we have ad−1
p ≡ 1
(mod d) by Fermat’s theorem. Since
(
a
(N−1)/p
p −1, N
)
= 1,
we get a
(N−1)/p
p ̸≡ 1 (mod d).
Let e denote the order of ap modulo d. Then, e | d − 1,
e | N − 1, and e ∤ (N − 1)/p.
If pαp
∤ e, then e | N − 1 would imply that e | N−1
p ,
which is a contradiction. Therefore, pαp
| e, and so pαp
|
d − 1. □
1029
CHAPTER 11. ELEMENTARY NUMBER THEORY
11.6.12. Theorem. Let N ∈ N, N > 1. Suppose that we can
write N − 1 = F · U, where (F, U) = 1 and F >
√
N, and
that we are familiar with the prime factorization of F. Then:
• if we can ﬁnd for every prime p | F an integer ap ∈ Z
from the above theorem, then N is a prime;
• if N is a prime then for every prime p | N − 1, there is
an integer ap ∈ Z with the desired properties.
Proof. By theorem 11.6.11, the potential divisor d > 1
of the integer N satisﬁes d ≡ 1 (mod pαp
) for all prime
factors of F, hence d ≡ 1 (mod F), and so d >
√
N. If N
has no non-trivial divisor less than or equal to
√
N, then it is
necessarily a prime. On the other hand, it suﬃces to choose
for ap a primitive root modulo the prime N (independently
of p). Then, it follows from Fermat’s theorem that aN−1
p ≡ 1
(mod N), and since ap is a primitive root, we get a
(N−1)/p
p ̸≡
1 (mod N) for any p | N − 1.
The integers ap are again called primality witnesses for
the integer N. □
Remark. The previous test also contains Pépin’s test in itself
(here, for N = Fn, we have p = 2, which is satisﬁed by the
primality witness ap = 3).
11.6.13. The polynomial test. Viz Radan (ATC2014) podrobne,
a velmi strucne McAndrew (Crypto
in sage) Pridat popis AKS algoritmu – zvazit
zda vcetne dukazu (podivat se do algebraicke
kapitoly, jestli je tam vse potrebne). Pripadne
by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze
Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu
– zvazit zda vcetne dukazu (podivat se do algebraicke
kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne
pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku
citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda
vcetne dukazu (podivat se do algebraicke kapitoly, jestli je
tam vse potrebne). Pripadne by bylo mozne pridat dukaz
tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho
vyse. Pridat popis AKS algoritmu – zvazit zda vcetne
dukazu (podivat se do algebraicke kapitoly, jestli je tam vse
potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni o
Rabin-Millerovi ze Schoofova clanku citovaneho vyse. Pridat
popis AKS algoritmu – zvazit zda vcetne dukazu (podivat se
do algebraicke kapitoly, jestli je tam vse potrebne). Pripadne
by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze
Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu
– zvazit zda vcetne dukazu (podivat se do algebraicke
kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne
pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku
citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda
vcetne dukazu (podivat se do algebraicke kapitoly, jestli je
tam vse potrebne). Pripadne by bylo mozne pridat dukaz
tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho
vyse. Pridat popis AKS algoritmu – zvazit zda vcetne
dukazu (podivat se do algebraicke kapitoly, jestli je tam vse
potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni
1030
CHAPTER 11. ELEMENTARY NUMBER THEORY
o Rabin-Millerovi ze Schoofova clanku citovaneho vyse.
Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat
se do algebraicke kapitoly, jestli je tam vse potrebne).
Pripadne by bylo mozne pridat dukaz tvrzeni o RabinMillerovi
ze Schoofova clanku citovaneho vyse. Pridat popis
AKS algoritmu – zvazit zda vcetne dukazu (podivat se do
algebraicke kapitoly, jestli je tam vse potrebne). Pripadne
by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze
Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu
– zvazit zda vcetne dukazu (podivat se do algebraicke
kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne
pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku
citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda
vcetne dukazu (podivat se do algebraicke kapitoly, jestli je
tam vse potrebne). Pripadne by bylo mozne pridat dukaz
tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho
vyse. Pridat popis AKS algoritmu – zvazit zda vcetne
dukazu (podivat se do algebraicke kapitoly, jestli je tam vse
potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni
o Rabin-Millerovi ze Schoofova clanku citovaneho vyse.
Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat
se do algebraicke kapitoly, jestli je tam vse potrebne).
Pripadne by bylo mozne pridat dukaz tvrzeni o RabinMillerovi
ze Schoofova clanku citovaneho vyse. Pridat popis
AKS algoritmu – zvazit zda vcetne dukazu (podivat se do
algebraicke kapitoly, jestli je tam vse potrebne). Pripadne
by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze
Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu
– zvazit zda vcetne dukazu (podivat se do algebraicke
kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne
pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku
citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda
vcetne dukazu (podivat se do algebraicke kapitoly, jestli je
tam vse potrebne). Pripadne by bylo mozne pridat dukaz
tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho
vyse. Pridat popis AKS algoritmu – zvazit zda vcetne
dukazu (podivat se do algebraicke kapitoly, jestli je tam vse
potrebne). Pripadne by bylo mozne pridat dukaz tvrzeni
o Rabin-Millerovi ze Schoofova clanku citovaneho vyse.
Pridat popis AKS algoritmu – zvazit zda vcetne dukazu (podivat
se do algebraicke kapitoly, jestli je tam vse potrebne).
Pripadne by bylo mozne pridat dukaz tvrzeni o RabinMillerovi
ze Schoofova clanku citovaneho vyse. Pridat popis
AKS algoritmu – zvazit zda vcetne dukazu (podivat se do
algebraicke kapitoly, jestli je tam vse potrebne). Pripadne
by bylo mozne pridat dukaz tvrzeni o Rabin-Millerovi ze
Schoofova clanku citovaneho vyse. Pridat popis AKS algoritmu
– zvazit zda vcetne dukazu (podivat se do algebraicke
kapitoly, jestli je tam vse potrebne). Pripadne by bylo mozne
pridat dukaz tvrzeni o Rabin-Millerovi ze Schoofova clanku
citovaneho vyse. Pridat popis AKS algoritmu – zvazit zda
vcetne dukazu (podivat se do algebraicke kapitoly, jestli je
tam vse potrebne). Pripadne by bylo mozne pridat dukaz
tvrzeni o Rabin-Millerovi ze Schoofova clanku citovaneho
vyse.
1031
CHAPTER 11. ELEMENTARY NUMBER THEORY
11.6.14. Looking for divisors. If one of the compositeness
tests veriﬁes that a given integer is indeed
composite, we usually want to ﬁnd one of its
non-trivial divisors. However, this task is much
more diﬃcult than mere revealing that it is composite – let us
recall that the compositeness tests can guarantee the compositeness,
yet they provide us with no divisors (which is, on the
other hand, advantageous for RSA and similar cryptographic
protocols). Therefore, we will present here only a short summary
of methods used in practice and one sample for inspira-
tion.
(1) Trial division
(2) Pollard’s ρ-algorithm
(3) Pollard’s p − 1 algorithm
(4) Elliptic curve method (ECM)
(5) Quadratic sieve (QS)
(6) Number ﬁeld sieve (NFS)
For illustration, we demonstrate in the exercises (11.F.8)
one of these algorithms – Pollard’s ρ-method – on a concrete
instance. This algorithm is especially suitable for ﬁnding relatively
small divisors (since its expected complexity depends
on the size of these divisors), and it is based on the idea
that having a random function f : S → S, where S is a
ﬁnite set having n-elements, the sequence (xn)∞
n=0, where
xn+1 = f(xn), must loop. The expected length of the tail as
well as the period is then
√
π · n/8.
The algorithm described below is again a straightforward
implementation of the mentioned reasonings.
Algorithm (Pollard’s ρ-method):
Input : n −− the i n t e g e r to be f a c t o r e d ,
and an a p p r o p r i a t e f u n c t i o n f(x)
a := 2; b := 2; d := 1
While d = 1 do
a := f(a)
b := f(f(b))
d := gcd(a − b, n)
I f d = n , r e t u r n FAILURE .
Else r e t u r n d .
1032
CHAPTER 11. ELEMENTARY NUMBER THEORY
11.6.15. Public-key cryptography. In present-day practice,
the most important application of number
theory is the so-called public-key cryptography.
We shall only brieﬂy touch this fascinating area
of Mathematics.
The main objectives are to provide
• encryption; the message encrypted with the public key
of the receiver can be decrypted by no one else (to be
precise, by no one who does not know his private key);
• signature; the integrity of the message signed with the
private key of the sender can be veriﬁed by anyone with
access to his public key.
The most basic and most often used protocols in publickey
cryptography are:
• RSA (encryption) and the derived system for signing mes-
sages,
• Digital Signature Algorithm – DSA and its variant based
on elliptic curves (ECDSA),
• Rabin cryptosystem (and signature scheme),
• ElGamal cryptosystem (and signature scheme),
• elliptic curve cryptography (ECC),
• Diﬃe-Hellman key exchange protocol (DH).
11.6.16. Encryption – RSA. First, we describe the most
known public-key cipher – RSA. The simple principle of the
protocol RSA14
is based on the Euler’s theorem 11.3.12:
• The user A needs a pair of keys – a public one (VA) and
a private one (SA).
• Key generating: the user selects two large primes p, q,
and calculates n = pq, φ(n) = (p − 1)(q − 1). The
integer n is part of the public key; the main idea is that
it is too hard to compute φ(n) without knowing the de-
composition.
• Then, the user chooses the second half of the public key,
e, and veriﬁes that (e, φ(n)) = 1.
• Using the Euclidean algorithm, the private key d is computed
so that e · d ≡ 1 (mod φ(n)).
The principle of RSA
The secret communication runs in the following steps (for
the sake of simplicity, we will further identify the encryption
procedure with the public key VA and the decryption
procedure with the private key SA):
• Encrypting a numerical code of a message M addressed
to the user A (by any other participant who has got access
to the public key VA):
C = VA(M) ≡ Me
(mod n).
• Decrypting the cipher C by the user A:
OT = SA(C) ≡ Cd
(mod n).
14Ron Rivest, Adi Shamir, Leonard Adleman (1977); C. Cocks, the
secret service GCHQ (not publicly) as early as 1973.
1033
CHAPTER 11. ELEMENTARY NUMBER THEORY
The proof of correctness of this protocol (i.e., that A indeed
receives what was meant) is a straightforward application
of Euler’s theorem: Thanks to 11.3.13, it holds for any
message M which is coprime to n that (Me
)d
≡ M1
= M
(mod n). In the (extremely unlikely) case that the message
M would not be coprime to n, the statement holds as well,
although the proof needs to be modiﬁed with the help of the
Chinese remainder theorem (however, we should realize that
if the message M with property 0 < M < n is not coprime
to n, then it means that (M, n) is a non-trivial divisor of n,
so the key of the receiver is actually discredited).
The security of RSA has been tested since it was invented
in 1977, and no meaningful weakness (except for side channels
or some singular keys) has been discovered yet (provided
a suﬃciently large key is used; nowadays it is recommended
to use at least 2048 bits). Nevertheless, it has not been proved
that the RSA problem really relies on the hardness of integer
factorization.
The requirements on a secure choice of the key for practical
reasons are:
• d is large enough (defense against the so-called Wiener’s
attack),
• p and q are not too close to each other (see the exercise
11.G.1),
• the public key is selected to be at least e = 65537 (although
no direct attack against a small public key e has
been noticed).
11.6.17. Rabin cryptosystem. Further, we mention a simpliﬁed
variant of the protocol named Rabin cryp-
tosystem15
, which has been the ﬁrst public cryptosystem
where one demonstrably needs to factorize
the modulus to break it (unlike RSA, for
which this has not been proved):
• Every participant A needs a pair of keys – a public one
(VA) and a private one (SA).
• Key generating: A chooses two large primes of roughly
the same size – p, q ≡ 3 (mod 4), and computes n =
pq.
• The public key is VA = n, the private key is the pair
SA = (p, q).
The secret communication then runs as follows:
• Encryption of the numerical code of the message M:
C = VA(M) ≡ M2
(mod n).
• Decryption of the cipher C: the (four) roots of C modulo
n are computed and it is easily found out which one
of them is the original message (for instance, the other
three make no sense or do not contain the agreed identi-
ﬁcation).
As we can see from the description of the protocol, the
process of decryption requires the computation of the square
15Rabin, Michael. Digitalized Signatures and Public-Key Functions
as Intractable as Factorization (in PDF). MIT Laboratory for Computer Science,
January 1979.
1034
CHAPTER 11. ELEMENTARY NUMBER THEORY
root of C modulo n = pq, where p ≡ q ≡ 3 (mod 4).
This can be done as follows:
• The values r ≡ C(p+1)/4
(mod p) and s ≡ C(q+1)/4
(mod q) are computed.
• Further, we need to determine the coeﬃcients a, b in Bézout’s
identity, i.e., integers for which ap + bq = 1.
• We set x ≡ (aps + bqr) (mod n), y ≡ (aps − bqr)
(mod n).
• The square roots of C modulo n are then ±x, ±y.
Let us mention that this is in fact an application of the Chinese
remainder theorem and the fact that we are able to easily ﬁnd
the solutions of the quadratic congruence x2
≡ a (mod p)
provided p ≡ 3 (mod 4) (see exercise 11.D.38). Indeed, it
holds that
(±x)2
= (aps + bqr)2
≡ (bqr)2
≡ r2
≡ C(p+1)/2
≡ C (mod p),
where we made use of the fact that bq ≡ 1 (mod p) and
that C ≡ M2
(mod p) is a quadratic residue modulo p,
hence C(p−1)/2
≡ (C/p) = 1 (mod p). Similarly, we have
(±x)2
≡ C (mod q) as well, thus ±x is a square root of C
modulo n. The derivation of the same result for y is nearly
identical.
11.6.18. Digital signature. Now, let use brieﬂy describe the
principle of digital signature.
Principle of the digital signature
Creating the signature:
(1) A digest (hash) HM of the message is generated, the
length of the hash is ﬁxed (160 or 256 bits, for instance)
– we should realize that such a mapping is surely not
injective (there will be many messages sharing the same
hash).
(2) The signature of the message SA(HM ) is created from
this hash using the knowledge of the private key of the
signer (similarly to decryption of a message’s text).
(3) The message M is sent (optionally encrypted with the
public key of the receiver) together with the created sig-
nature.
The signature veriﬁcation then runs as follows:
(1) A digest H′
M if generated for the received message M
(after decryption, if it has been encrypted)
(2) Using the public key of the (declared) sender of the message,
the original digest of the message is reconstructed:
VA(SA(HM )) = HM .
(3) The digests are then compared; i.e., it is found out
whether HM = H′
M .
The (cryptographic) hash function mentioned above
must have the following properties:
• It is easy to ﬁnd the hash of any message.
1035
CHAPTER 11. ELEMENTARY NUMBER THEORY
• It is impossible (in real time) to ﬁnd (any) message with
the desired hash.
• It is impossible (in real time) to ﬁnd two messages with
the same hash (the function must be collision-resistant).
• Every change of the message changes the hash as well.
The most known examples of such functions are:
• MD5 (128 bit, Rivest 1992) – not collision-resistant
• SHA-1 (160 bit, NSA 1995) – from 2005 considered insuﬃciently
collision-resistant
• RIPEMD-320
• SHA-3
11.6.19. Diﬃe-Hellman key exchange system. Another important
type of protocol, which is very often
used in practice, is a protocol for key exchange
in symmetric cryptography – Diﬃe-Hellman
key exchange16
, whose discovery was a breakthrough
in this discipline, making it possible to replace onetime
keys, messengers with cases (and similar) with mathematical
means, in particular without the necessity of prior
communication of both sides.
The protocol for the agreement of two sides (Alice, Bob)
on a common key (integer) is as follows:
Principle of DH key exchange protocol
• Both sides agree on a prime p and a primitive root g
modulo p (this need not be done secretly).
• Alice chooses a random integer a and sends ga
(mod p).
• Bob chooses a random integer b and sends gb
(mod p).
• The common key for the communication is then gab
(mod p).
The security of this protocol relies on the fact that it is
hard to compute the discrete logarithm (the so-called discrete
logarithm problem) – see also part 11.3.15.
There is another encryption algorithm which is based on
the Diﬃe-Hellman key exchange protocol – algorithm ElGamal,
which we will also describe in short:
• Every participant chooses a prime p with a primitive root
g.
• Further, they choose a private key x, compute h = gx
(mod p), and publish the public key (p, g, h).
The secret communication then runs as follows:
• Encryption of the numerical code of the message M:
choosing a random y and computing C1 = gy
(mod p)
and C2 = M · hy
(mod p), then sending the pair
(C1, C2) to participant A.
• The participant A then decrypts the message by computing
C2/Cx
1 .
16 Whitﬁeld Diﬃe, Martin Hellman (1976); M. Williamson (secret service
GCHQ) as early as 1974 (not published).
1036
CHAPTER 11. ELEMENTARY NUMBER THEORY
Remark. The mechanism of digital signature can be derived
from the ElGamal algorithm just like in the case of RSA.
1037
1038
CHAPTER 11. NUMBER THEORY
H. Additional exercises to the whole chapter
11.H.1. Find all three-digit n ∈ N, for which 11 | n a n
11 is the sum of squares of the digits of n. ⃝
11.H.2. Someone falsely memorized Fermat’s little theorem, and claims that for p ∤ a we have ap
≡ 1 (mod p). Find all
pairs (a, p) for which this is true. ⃝
11.H.3. Prove that last four digits of 5n
, (n = 4, 5, 6, . . .) form a periodical sequence. Find this period. ⃝
11.H.4. Prove that there are inﬁnitely many odd natural numbers k such that the integer 22n
+k is composite for every n ∈ N.
⃝
11.H.5. Prove that for every integer k ̸= 1, there are inﬁnitely many natural numbers n such that the integer 22n
+ k is
composite. ⃝
11.H.6. Consider the sequence (an)∞
n=1 deﬁned by
an = 2n
+ 3n
+ 6n
− 1.
Prove that for every prime p, this sequence contains a multiple of p. ⃝
11.H.7. Prove that no natural number n greater than 1 satisﬁes n | 2n
− 1. ⃝
11.H.8. Prove that for every odd prime p, there are inﬁnitely many natural numbers n such that p | n · 2n
+ 1. ⃝
11.H.9. Let a function f : N → N satisfy (f(a), f(b)) = (f(a), f(|a − b|)) for all a, b ∈ N. Prove that (f(a), f(b)) =
f((a, b)). Show that this implies the result of exercise 11.A.7 as well as the fact that (Fa, Fb) = F(a,b), where Fa denotes the
a-th term of the Fibonacci sequence. ⃝
11.H.10. Let the RSA parameters be n = 143 = 11 · 13, e = 7, d = 103. Sign the message m = 8, and verify this signature.
Decide, whether s = 42 is the signature of the message m = 26. ⃝
1039
CHAPTER 11. NUMBER THEORY
Key to the exercises
10.C.10. The answer is 3
5 · 2
3 + 2
5 · 1 = 4
5 .
10.E.17. Simply, a = 3
8 . Thus, the distribution function of the random variable X is FX(t) = 1
8 t3
for t ∈ (0, 2), zero
for smaller values of t, and one for greater. Let Z = X3
denote the random variable corresponding to the volume of the
considered cube. It lies in the interval (0, 8). Thus, for t ∈ (0, 8) and the distribution function FZ of the random variable Z,
we can write FZ(t) = P[Z < t] = P[X3
< t] = P[X < 3
√
t] = FX( 3
√
t) = 1
8 t. Then, the density is fZ(t) = 1
8 on the
interval (0, 8) and zero elsewhere. Since this is the uniform distribution on the given interval, the expected value is equal to
4.
10.E.21. f(X,Y )(u, v) = uv, for (u, v) ∈ (0, 1) × (0, 2), f(X,Y )(u, v) = 0 otherwise. Then, the wanted probability is
P =
∫ 1
0
∫ 2
x2 xy dy dx =
11
12
.
10.E.22. P =
∫ 3√
2
0
∫ 2
x3 xy dy dx =
3√
4
12 .
10.F.9. EU = 1 · 0.6 + 2 · 0.4 = 1.4, EU2
= 0.4 + 4 · 0.6 = 2.8 EV = 0.4 + 0.6 + 1.2 = 2.1, EV 2
= 0.3 +
+ 1.2 + 3.6 = 5.1, E(UV ) = 2.8, var(U) = 2.8 − 1.42
= 2.8 − 1.96 = 0.84, var(V ) = 5.1 − 4.41 = 0.69,
cov(UV ) = 2.8 − 1.4 · 2.1 = −0.14, ρU,V = −0.14√
0.84·0.69
.
10.F.10. EX = 1/3, var2
X = 4/45.
10.F.11.
ρX,Y = −1.
10.F.12. ϱU,V
.
= −0, 421.
In this chapter, we begin a seemingly very formal study.
But the concepts reﬂect many properties of things and phenomena
surrounding us. This is one of the parts of the book
which is not in the prerequisites of any other chapter. Large
parts serve as a quick illustration of interesting uses of mathematical
tools and models.
The simplest properties of real objects are used for encoding
in terms of algebraic operations. Thus, “algebra” considers
algorithmic manipulations with letters which usually
correspond to computations or descriptions of processes.
Strictly speaking, this chapter builds only on the ﬁrst and
sixth parts of chapter one, where abstract views on numbers
and relations between objects are introduced. But it is a focal
point for abstract versions of many concepts already met.
The ﬁrst two sections aim at direct generalizations of the
familiar algebraic structure of numbers. This leads to a discussion
of rings of polynomials. Only then we provide an
introduction to group theory, for which there is only a single
operation.
The last two sections provide some glimpses of direct
applications. The construction of (self-correcting) codes often
used in data transfer is considered. The last section explains
the elementary foundations of computer algebra. This
includes solving polynomial equations and algorithmic methods
for manipulation and calculations with formal expres-
sions.
1. Posets and Boolean algebras
Familiarity with the properties of addition and multiplication
of scalars and matrices is assumed. Likewise,
the binary operations of set intersection
and union in elementary set theory, as indicated
in the end of the ﬁrst chapter.
We proceed to work with symbols which stand for miscellaneous
objects resulting in the universal applicability of
the results.
This allows the relating of the basic set operations, to
propositional logic which formalizes methods for expressing
propositions and evaluating truth values.
12.1.1. Algebraic operations. For any set M, there is a set
K = 2M
consisting of all subsets of M, together with the
operations of union ∨ : K × K → K and intersection ∧ :
CHAPTER 12
Algebraic structures
The more abstraction, the more chaos?
– no, it is often the other way round...
A. Boolean algebras and lattices
12.A.1. Find the (complete) disjunctive normal form of the
proposition
(B′
⇒ C) ∧ [(A ∨ C) ∧ B]
′
.
Solution.
If the propositional formula contains only a few variables
(in our case, it is three), the most advantageous
procedure is to build the truth table of the formula
and build the disjunctive normal form from that.
The table will consist of 23
= 8 rows. The examined
formula is denoted φ.
CHAPTER 12. ALGEBRAIC STRUCTURES
K × K → K. This is an instance of an algebraic structure
on the set K with two binary operations. In general, write
(K, ∨, ∧). In the special case of sets, these binary operations
are denoted rather by ∪ and ∩, respectively.
To every set A ∈ K, its complement A′
= K \ A can
be assigned. This is another operation ′
: K → K with only
one argument. Such operations are called unary operations.
In general, there are algebraic structures with k operations
µ1, . . . , µk, each of them
µj : K
ij -times
× · · · × K → K
with ij arguments, and write (K, µ1, . . . , µk) for such a structure.
The number ij of arguments is called the parity of the
operation (“unary”, “binary”, etc.). If ij = 0, then the operation
has no arguments which means it is a distinguished
element in K.
With subsets in K = 2M
, there is the unique “greatest
object”, i.e. the entire set M, which is neutral for the ∧ operation.
Similarly, the empty set ∅ ∈ K is the only neutral
element for ∨. Notice that if M is empty, then K contains
the only element ∅.
12.1.2. Set algebra. View the algebraic structure on the set
K = 2M
from the previous paragraph as (K, ∨, ∧, ′
, 1, 0),
with two binary operations, one unary operation (the complement),
and two special elements 1 = M, 0 = ∅.
It is easily veriﬁed that all elements A, B, C ∈ K satisfy
the following properties:
Axioms of Boolean algebras
A ∧ (B ∧ C) = (A ∧ B) ∧ C,(1)
A ∨ (B ∨ C) = (A ∨ B) ∨ C,(2)
A ∧ B = B ∧ A, A ∨ B = B ∨ A,(3)
A ∧ (B ∨ C) = (A ∧ B) ∨ (A ∧ C),(4)
A ∨ (B ∧ C) = (A ∨ B) ∧ (A ∨ C),(5)
there is a 0 ∈ K such that A ∨ 0 = A,(6)
there is a 1 ∈ K such that A ∧ 1 = A,(7)
A ∧ A′
= 0, A ∨ A′
= 1.(8)
Compare these properties with those of the scalars
(K, +, ·, 0, 1): Properties (1) and (2) say that
both the operations ∧ and ∨ are associative.
Property (3) says that both operations are also
commutative. So far, this is the same as for the
addition and multiplication of scalars. Also there are neutral
elements for both operations there.
However, the properties (4) and (5) are stronger now:
they require the distributivity of ∧ over ∨ as well as ∨ over
∧. Of course, this cannot be the case for addition and multiplication
of numbers. In the case of numbers, multiplication
distributes over addition but not vice versa.
1041
A B C B′
⇒ C [(A ∨ C) ∧ B]
′
φ
0 0 0 0 1 0
0 0 1 1 1 1
0 1 0 1 1 1
0 1 1 1 0 0
1 0 0 0 1 0
1 0 1 1 1 1
1 1 0 1 0 0
1 1 1 1 0 0
The resulting complete disjunctive normal form is the
disjunction of the formula that correspond to the rows with
one in the last column (the formula is true for the given valuation
of the atomic propositions). The row corresponds to
conjunction of the variables (if the corresponding value is 1)
or their negations (if it is 0). In our case, it is the disjunction
of conjunctions corresponding to the second, third, and sixth
rows, i. e., the result is
(A′
∧ B′
∧ C) ∨ (A′
∧ B ∧ C′
) ∨ (A ∧ B′
∧ C).
We can also rewrite the formula by expanding the connective
⇒ with ∧ and ∨, using the De Morgan laws and dis-
tributivity:
(
B′
⇒ C
)
∧ [(A ∨ C) ∧ B]′
⇐⇒
⇐⇒(B ∨ C) ∧
[
(A ∨ C)′
∨ B′]
⇐⇒
⇐⇒(B ∨ C) ∧
[(
A′
∧ C′ )
∨ B′]
⇐⇒
⇐⇒
[
(B ∨ C) ∧
(
A′
∧ C′ )]
∨
[
(B ∨ C) ∧ B′]
⇐⇒
⇐⇒
[(
B ∧ A′
∧ C′ )
∨
(
C ∧ A′
∧ C′ )]
∨
[(
B ∧ B′)
∨
(
C ∧ B′)]
⇐⇒
⇐⇒
(
B ∧ A′
∧ C′ )
∨
(
C ∧ B′)
,
which is an (incomplete) disjunctive normal form of the given
formula. Clearly, it is equivalent to our result above (the word
“complete” means that each disjunct (called clause in this context)
contains each of the three variables or their negations
(these are called literals). □
12.A.2. Find a disjunctive normal form of the formula
((A ∧ B) ∨ C)
′
∧ (A′
∨ (B ∧ C′
∧ D))
⃝
We know several logical connectives: ∧, ∨, =⇒, ≡ and
the unary ′
. Any propositional formula with these connectives
can be equivalently written using only some of them, for instance
∨ and ′
. There are also connectives which alone suﬃce
to express any propositional formula. From binary connectives,
these are NAND and NOR (A NAND B = (A ∧ B)′
,
A NOR B = (A∨B)′
). Try to express each of the known connectives
using only NAND, and then only NOR. These connectives
are implemented in electric circuits as the so-called
“gates”.
12.A.3. Express the propositional formula (A ⇒ B) using
only the NAND gates. ⃝
CHAPTER 12. ALGEBRAIC STRUCTURES
Properties (6)–(8) require the existence of neutral elements
for both operations as well as the existence of an analogy
to the “inverse” of each element. (Note however, that the
intersection with the complement results in the neutral element
for union and vice versa. This is the other way round
for numbers.)
There are only a few structures that possess such restrictive
properties.
Boolean algebras
Deﬁnition. A set K together with two binary operations ∧,
∨, a unary operation ′
, and two special elements 1, 0, which
satisfy the properties (1)–(8) is called a Boolean algebra.
The ∧ operation is called inﬁmum or meet, and the ∨ operation
is called supremum or join. The element A′
is called
the complement of A.
Note that the axioms of Boolean algebras are symmetric
with respect to interchanging the operations ∧ and ∨ together
with the interchange of 0 and 1. This means that any proposition
that can be derived from the axioms has a valid dual
proposition, created by interchanging ∧ with ∨ and 0 with 1.
This is the principle of duality.
12.1.3. Properties of Boolean algebras. As usual, we derive
several elementary corollaries of the axioms. In particular,
note that in the special case of the Boolean algebra of
all subsets of a given set, this proves all the elementary properties
known from set algebra. The special elements 1 and 0
are unique as the neutral elements for ∧ and ∨. If there is ˜0
with the same properties then ˜0 = ˜0 ∨ 0 = 0. Similarly 1 is
also unique. Next, if B, C ∈ K satisfy the properties of A′
(axiom (8) in the above deﬁnition), then
B = B ∨ 0 = B ∨ (A ∧ C) =
= (B ∨ A) ∧ (B ∨ C) = 1 ∧ (B ∨ C) = B ∨ C
and similarly
C = C ∨ B = B ∨ C.
Therefore, B = C and so the complement to any A ∈ K is
determined uniquely by its properties.
Generally the above observation means that given
(K, ∧, ∨), there exists at most one operation ′
which completes
it to a Boolean algebra (K, ∧, ∨, ′
, 1, 0), together with
unique elements 1 and 0. Generally (K, ∧, ∨) is written,
omitting the other three symbols for operations.
The properties of the following proposition have their
own names in set algebra: (2) is called absorption
laws, (3) is the idempotency of the operations
∧ and ∨, and the equalities of (4) are the De Morgan
laws.
1042
12.A.4. Write down a logic table for the Boolean proposi-
tion
((A ∧ B) ∨ C)′
.
Solution. Using de Morgan’s and distributive laws we ex-
press
((A ∧ B) ∨ C)′
= (A ∧ B)′
∧ C′
= A′
∧ C′
∨ B′
∧ C′
.
Setting 1 for the value True and 0 for False for A, B, C we
obtain the table for φ(A, B, C) = A′
∧ C′
∨ B′
∧ C′
:
A B C φ
0 0 0 1
0 0 1 0
0 1 0 1
1 0 0 1
0 1 1 0
1 0 1 0
1 1 0 0
1 1 1 0
□
12.A.5. For A = 0, B = 0, C = 1 ﬁnd the value of
the Boolean proposition
A ∧ B ∨ C
Solution. A ∧ B ∨ C = 0 ∧ 0 ∨ 1 = 0 ∨ 1 = 1, which has
value True. □
12.A.6. A switching circuit is an electrical network that consists
of switches being open or closed as on ﬁg. .. If switch
a is open we set a = 0, if swich a is closed we set a = 1
and so on. Switches labeled with the same letter must are either
all open or all closed simultaneously. Switch a′
is open
if and only if switch a is closed. If current can ﬂow between
the extreme left and right ends of the circuit, the output of the
circuit is 1 and 0 otherwise. The switching circuit representing
a∧b is called series circuit, while the circuit representing
a ∨ b is called parallel circuit.
Represent the switching circuit at the ﬁgure?? as a Bollean
proposition.
Solution. We can write this circuit as proposition φ = a ∧
X ∧ b, where
X = b ∨ d′
∨ c ∧ (a ∨ d ∨ c′
).
Thus,
φ = a ∧ (b ∨ d′
∨ c ∧ (a ∨ d ∨ c′
)) ∧ b.
□
12.A.7. Draw a switching circuit representing Boolean
proposition
(a ∧ ((b ∧ c′
) ∨ (b′
∧ c))) ∨ (a ∧ b ∧ c)
Solution. see ﬁg??? □
CHAPTER 12. ALGEBRAIC STRUCTURES
Further properties of Boolean algebras
Proposition. In every Boolean algebra (K, ∧, ∨);
(1) A ∧ 0 = 0, A ∨ 1 = 1,
(2) A ∧ (A ∨ B) = A, A ∨ (A ∧ B) = A,
(3) A ∧ A = A, A ∨ A = A,
(4) (A ∧ B)′
= A′
∨ B′
, (A ∨ B)′
= A′
∧ B′
,
(5) (A′
)′
= A.
for all A, B ∈ K.
Proof. By the principle of duality, it suﬃces to prove
only one of the claims in each item. Begin with (3), of course
using just the axioms of the Boolean algebras
A = A ∧ 1 = A ∧ (A ∨ A′
) = (A ∧ A) ∨ 0 = A ∧ A.
Now, (1) is proved easily:
A ∧ 0 = A ∧ (A ∧ A′
) = (A ∧ A) ∧ A′
= A ∧ A′
= 0.
(2) is also easy (read the second equality from right to left):
A ∧ (A ∨ B) = (A ∨ 0) ∧ (A ∨ B) =
= A ∨ (0 ∧ B) = A ∨ 0 = A.
In order to prove the De Morgan laws, it suﬃces to verify that
A′
∨ B′
has the properties of the complement of A ∧ B. By
the above, it must be the complement. Using (1), compute
(A ∧ B) ∧ (A′
∨ B′
) = ((A ∧ B) ∧ A′
) ∨ ((A ∧ B) ∧ B′
)
= (0 ∧ B) ∨ (A ∧ 0) = 0.
Similarly,
(A ∧ B) ∨ (A′
∨ B′
) = (A ∨ (A′
∨ B′
)) ∧ (B ∨ (A′
∨ B′
))
= (1 ∨ B′
) ∧ (1 ∨ A′
) = 1.
Finally, from the deﬁnition, A′
∧ A = 0 and A′
∨ A = 1.
Hence, A has the required properties of the complement of
A′
, which means that A = (A′
)′
. □
12.1.4. Examples of Boolean algebras. The intersection
and union on all subsets in a given set M always
deﬁne a Boolean algebra. The smallest is the set
of all subsets of a singleton M. It contains two
elements, namely 0 = ∅ and 1 = M with the
obvious equalities 0 ∧ 1 = 0, 0 ∨ 1 = 1, etc. The operations
∧ and ∨ are the same as multiplication and addition in the
remainder class ring Z2 of even and odd numbers. This is
called the Boolean algebra Z2. This is the only case when a
Boolean algebra is a ﬁeld of scalars at the same time!
As in the case of rings of scalars or vector spaces, the
algebraic structure of a Boolean algebra can be extended to
all spaces of functions whose codomain is a Boolean algebra.
For the set S = {f : M → K} of all functions from a set
M to a Boolean algebra (K, ∧, ∨), the necessary operations
and the distinguished elements 0 and 1 on S can be deﬁned
1043
12.A.8. Verify equivalence of the following two Boolean
propositions
(a ∨ b) ∧ (c ∨ d) and (a ∧ c) ∨ (a ∧ d) ∨ (b ∧ c) ∨ (b ∧ d)
Solution. Distributive laws yield
(a∨b)∧(c∨d) = ((a∨b)∧c)∨((a∨b)∧d) = a∧c∨b∧c∨a∧d∨b∧d.
□
12.A.9. Determine whether the following two Boolean
propositions
((a′
∧ b) ∨ (a ∧ c′
))′
and (a ∨ b′
) ∧ (a ∨ c′
)
are equivalent?
Solution. setting a = 1, b = 1, c = 0 implies ((a′
∧ b) ∨
(a ∧ c′
))′
= 0 and (a ∨ b′
) ∧ (a ∨ c′
) = 1. Hence, these
propositions are not equivalent. □
12.A.10. Determine whether the following two Boolean
propositions
(a ∧ b ∧ c) ∨ (a ∨ c)′
and (a ∧ c) ∨ (a′
∧ c′
)
are equivalent?
Solution. setting a = 1, b = 0, c = 1 implies (a ∧ b ∧ c) ∨
(a ∨ c)′
= 0 and (a ∨ b′
) ∧ (a ∧ c) ∨ (a′
∧ c′
) = 1. Hence,
these propositions are not equivalent. □
12.A.11. Show that the negation a′
can not be obtained by a
combinatorial circuit that consists of ∧ and ∨ operations only.
Solution. We prove the claim by the induction on the number
of operations n in the circuit. For n = 1 it is clear that neither
proposition b ∧ c or b ∨ c is equivalent to a′
. Suppose a′
can not be achieved by a combinatorial circuit, containing ∨
and ∧. Consider a circuit with n + 1 operations and the ﬁrst
occurrence of the input of a. It can be either ∧ or ∨. If it is ∧
with the second argument either a or 1, then the outcome of it
is the same as input a and this operation ∧ can be deleted from
the circuit with input a going to the next available operation.
If the second argument is 0, then a∧0 = 0 and operation can
again be deleted with 0 going as input to the next available
operation. If a enters the circuit at the ∨, and the second
argument of this ∨ is either a or 0, then the ∨ can be deleted
with a entering the next operation. If the second argument
equals 1, then the result of ∨ is 1 , and it can be entered to the
next available operation. □
12.A.12. Simplify the formula
((A ∧ B) ∨ (A ⇒ B)) ∧ ((B′
⇒ C) ∨ (B ∧ C′
)).
Solution. In Boolean algebra, we obtain
(a · b + a′
+ b) · (b + c + b · c′
) = · · · = a′
· c + b.
This means that the given formula is equivalent to (A′
∧C)∨
B. □
CHAPTER 12. ALGEBRAIC STRUCTURES
as functions of an argument x ∈ M as follows:
(f1 ∧ f2)(x) = (f1(x)) ∧ (f2(x)) ∈ K,
(f1 ∨ f2)(x) = (f1(x)) ∨ (f2(x)) ∈ K,
(1)(x) = 1 ∈ K, (0)(x) = 0 ∈ K,
(f)′
(x) = (f(x))′
∈ K.
It is easy and straightforward to verify that these new operations
deﬁne a Boolean algebra.
Recall that the subsets of a given set M can be viewed as
mappings M → Z2 (the elements of the subset in question
are mapped to 1 while all others go to 0). Then, the union
and intersection can be deﬁned in the above manner — for
instance, evaluating the expression (A ∨ B)(x) for a point
x ∈ M, This determines whether it lies in A or whether it
lies in B, and whether the join of the results is in Z2. The
result is 1 if and only if x lies in the union.
12.1.5. Propositional logic. The latter simple observation
brings us close to the calculus of elementary
logic. View the notations for operations in a
Boolean algebra as creating “words” from the elements
A, B, . . . ∈ K, the operations ∨, ∧, ′
and
parentheses, which clarify the desired precedence of the op-
erations.
The axioms of the Boolean algebras and their corollaries
say how diﬀerent words may produce the same result in K.
This is clear in the case of K = 2M
, the set of all subsets
of a given set; it is just equality of subsets. Now, another
interpretation is mentioned in terms of operations in formal
logic.
Work with words as above but view them as propositions
composed from elementary (atomic) propositions A, B, . . .
and the logical operations AND (the binary operation ∧), OR
(the binary operation ∨), and the negation NOT (the unary
operation ′
). These words are called propositions. They are
assigned a truth value depending on the truth values of the
individual atomic propositions. The truth value is an element
of the trivial Boolean algebra Z2, i.e. either 0 or 1.
The truth value of a proposition is completely determined
by assigning the truth values for the simplest propositions A∧
B, A ∨ B and A′
. A ∧ B is deﬁned to be true if and only if
both A and B are true. A ∨ B is false if and only if both A
and B are false. The value of A′
is complementary to A.
A proposition with n elementary propositions deﬁnes a
function (Z2)n
→ Z2. Two propositions are called logically
equivalent if and only if they deﬁne the same function. In
the previous paragraph, it is already veriﬁed that the set of all
classes of logically equivalent propositions has the structure
of a Boolean algebra. Propositional logic satisﬁes everything
proved for general Boolean algebras.
Next, we consider how other usual simple propositions
of propositional logic are represented as elements of the
Boolean algebra. Expressions always correspond to a class
of logically equivalent propositions:
1044
12.A.13. Design a logic proposition in three boolean that
takes value True if and only if the majority of variables also
have True values.
Solution. For this purpose, a dijunsctive normal form is quite
suitable. Consider the proposition:
a ∧ b ∧ c′
∨ a ∧ b′
∧ c ∨ a′
∧ b ∧ c ∨ a ∧ b ∧ c.
□
12.A.14. Let A be a free boolean algebra, generated by
countably many generators a1, . . . , an, . . . with operations ∧,
∨ and ′
and elements being ﬁnitely many of these generators,
linked by the operations. Prove that these algebra is atomless.
Solution. Suppose f is an atom in A. Without loss of generality
f can be expressed in terms of the ﬁrst n generators
a1, . . . , an. Then f ∧an+1 is neither f nor 0, so f can not be
an atom in A. □
12.A.15. Prove monotonicity laws for Boolean algebras: if
A ≤ ˜A and B ≤ ˜B then
i) A ∨ B ≤ ˜A ∨ ˜B;
ii) A ∧ B ≤ ˜A ∧ ˜B;
iii) ( ˜A)′
≤ A′
Solution. Since
(A ∨ B) ∨ ( ˜A ∨ ˜B) = (A ∨ ˜A) ∨ (B ˜B) = ˜A ∨ ˜B,
it implies the ﬁrst assertion. The second follows from the
duality the fact that A ≤ ˜A if and only if A ∧ ˜A = A. Application
of de Morgan’s law
( ˜A)′
∨ A′
= (A ∧ ˜A′
) = A′
veriﬁes the third. □
12.A.16. For Boolean algebras prove
i) A ≤ B if and only if B′
≤ A′
;
ii) A ≤ B if and only if A ∧ B′
= 0;
iii) C ∧ A ≤ B if and only if A ≤ C′
∨ B.
Solution. By previous exercise, A ≤ B yields B′
≤ A′
, and
vice versa
A = (A′
)′
≤ (B′
)′
= B.
If A ≤ B, then A = A ∧ B and A ∧ B′
= A ∧ B ∧ B′
= 0.,
Conversely, from A ∧ B′
= 0 follows that
A = A ∧ (B ∨ B′
) = A ∧ B ∨ A ∧ B′
= A ∧ B,
so A ≤ B. For the third assertion, since C ∧ A ≤ B and
C′
∧ A ≤ C′
, by monotonicity of the ∨,
A = C ∧ A ∨ C′
∧ A ≤ B ∨ C′
.
On the other hand, if A ≤ C′
∨ B, by monotonicity of the ∧,
C ∧ A ≤ C ∧ (C′
+ B) ≤ C ∧ B ≤ B.
□
CHAPTER 12. ALGEBRAIC STRUCTURES
The standard logical operators
(1) A AND B corresponds to A ∧ B,
(2) A OR B corresponds to A ∨ B,
(3) the implication A ⇒ B can be obtained as A′
∨ B,
(4) the equivalence A ⇔ B corresponds to (A ∧ B) ∨
(A′
∧ B′
),
(5) the exclusive OR, known as A XOR B, is given as (A∧
B′
) ∨ (A′
∧ B),
(6) the negation of OR, A NOR B, is expressed as A′
∧B′
,
(7) the negation of AND, A NAND B, is given as A′
∨ B′
,
(8) tautology (proposition always true) is given in terms of
an arbitrary atomic proposition as A ∨ A′
, its negation
(always false) is A ∧ A′
.
Note that in set algebra, XOR corresponds to the symmetric
diﬀerence.
12.1.6. Switch boards as Boolean algebras. A switch is a
black box with only two states – it is either on (and the signal
goes through) or oﬀ (and the signal does not go through).
B
A B
A
One or more switches may be interconnected in a series
circuit or a parallel circuit. The series circuit corresponds to
the binary operation ∧, while the parallel circuit corresponds
to ∨. The unary operation A′
deﬁnes a switch whose state is
always the opposite than that of A. Every ﬁnite word created
from the switches A, B, . . . and the operations ∧, ∨, and ′
can be transformed to a diagram that represents a system of
switches, connected by wires, similarly as in the above subsection,
where each choice of states of the individual switches
gives the value “on/oﬀ” for the entire system. The discussion
is about switchboards and their logical evaluation function.
Again, it is easy to verify all the axioms of Boolean algebras
for the system. The following diagram illustrates one of
the distributivity axioms.
=A
A
A
B
C
B
C
The circuit without a switch corresponds to 1. When the
endpoints are not connected, this corresponds to 0 (consider
a series circuit of A and A′
). Draw diagrams for all axioms
of Boolean algebras and verify them!
We return to this example shortly, showing that each expression
in propositional logic can be modeled by a switch
board.
12.1.7. Algebra of divisors. There are other natural examples
of Boolean algebras. Choose a positive integer
p ∈ N. The underlying set Dp is the set
of all divisors q of p. For two such divisors q,
r, deﬁne q ∧ r to be the greatest common divisor of q and r,
and q ∨r is deﬁned to be their least common multiple (cf. the
1045
12.A.17. For any two elements A, B of a Boolean algebra,
the symmetric diﬀerence of A and B is deﬁned as
A B = A ∧ B′
∨ B ∧ A′
.
This is analogous of taking symmetric diﬀerence A B =
(A \ B) ∪ (B \ A). Show that
i) A = B if and only if A B = 0.
ii) A (B C) = (A B) C
iii) A ∧ (B C) = (A ∧ B) (A ∧ C).
Solution.
i) If A = B then A ≤ B ans B ≤ A, and therefore, by
previous exercise, A ∧ B′
= 0 and B ∧ A′
= 0, thus
A B = 0. On the other hand, A B = 0 implies that
both A ∧ B′
= 0 and B ∧ A′
= 0, which yield A ≤ B
ans B ≤ A, thus A = B.
ii) Expanding (B C)′
using distributivity and de Morgan’s
laws, we obtain
(B C)′
= (B′
∨ C) ∧ (C′
∨ B) = B ∧ C ∨ B′
∧ C′
.
Further expanding A (B C) we get
A (B C) = A∧(B∧C∨B′
∧C′
)∨A′
∧(B∧C′
∨B′
∧C)
= A∧B ∧C ∨A∧B′
∧C′
∨A′
∧B ∧C′
∨A′
∧B′
∧C,
which equals (A B) C by obvious symmetry.
iii) Follows from
A∧(B C) = A∧(B∧C′
∨C∧B′
) = A∧B∧C′
∨A∧C∧B′
and
(A∧B) (A∧C) = A∧B∧(A′
∨C′
)∨A∧C∧(A′
∨B′
) = A∧B∧C′
∨A∧C∧B′
□
12.A.18. The Sheﬀer stroke is deﬁned as:
A|B = A′
∧ B′
.
The Pierce arrow can be deﬁned as:
A ↑ B = A′
∨ B′
.
Prove that | and ↑ deﬁne A ∨ B, A ∧ B and A′
.
Solution. It is clear from de Morgan’s laws, in particular, that
A′
= A ↑ A, A ∨ B = (A ↑ A) ↑ (B ↑ B). Since
A ∧ B = (A′
∨ B′
)′
, we obtain that ↑ generates all Boolean
operations. Analogously, A′
= A|A and A ∧ B = A′
|B′
=
(A|A)|(B|B). De Morgan’s law A ∨ B = (A′
∧ B′
)′
proves
that | also generates all three Boolean operations. □
12.A.19. The exclusive or binary operation in Boolean algebra
is deﬁned as:
A ⊕ B = (A ∨ B) ∧ (A′
∨ B′
).
Prove that ⊕ and ∧ do not generate all Boolean operations.
Solution. If we only apply ⊕ and ∧ operations, from the inputs
with all values False we’ll be getting outputs with the
same False value. Thus, A′
can not be obtained using only ⊕
and ∧ out of a False proposition A. □
CHAPTER 12. ALGEBRAIC STRUCTURES
previous chapter for the deﬁnitions and context). The distinguished
element 1 ∈ Dp is deﬁned to be p itself. The neutral
element 0 for join on Dp is the integer 1 ∈ N. The unary
operation ′
is deﬁned using division as q′
= p/q.
Proposition. The set Dp together with the above operations
∧, ∨, and ′
is a Boolean algebra if and only if the factorization
of p contains no squares (i.e., in the unique factorization p =
q1 . . . qn, the prime numbers qi are pairwise distinct).
Proof. It is easy to verify the axioms of Boolean algebras
under the assumptions of the proposition. It might be
interesting to see where the assumption squarefree is needed.
The greatest common divisor of a ﬁnite number of integers
is independent of their order. This also holds for the least
common multiple. This corresponds to the axioms (1) and (2)
in 12.1.2. The commutativity (3) is clear.
For any three elements a, b, c, write their factorizations
without loss of generality as a = qn1
1 . . . qns
s , b =
qm1
1 . . . qms
s , c = qk1
1 . . . qks
s . Zero powers are allowed and
all qj are pairwise coprime. Thus, a∧b ∈ Dp corresponds to
the element in which each qi occurs with the power that is the
minimum of the powers in a and b. This holds analogously
for a∨b and maximum. The distributivity laws (4) and (5) of
12.1.2 now follow easily.
There is no problem with the existence of the distinguished
elements 0 a 1. These are already deﬁned directly and
clearly satisfy the axioms (6) and (7). However, if there are
squares in the factorizations, then this prevents the existence
of complements. For instance, in D12 = {1, 2, 3, 4, 6, 12},
6 ∧ 6′
= 1 cannot be reached since 6 has a non-trivial divisor
which is common to all other elements of D12 except for 1,
but 6 ∨ 1 = 6 ̸= 12. (The number 1 is the potential smallest
element in D12; it plays the role of 0 from the axioms.)
Nevertheless, if there are no squares in the factorization
of the integer p, then the complement can be deﬁned as q′
=
p/q, and it can be veriﬁed easily that this deﬁnition satisﬁes
the axiom 12.1.2(8). □
If there are no squares in the decomposition of p, then
the number of all divisors is a power of 2. This suggests that
these Boolean algebras are very similar to the set algebras
we started with. We return to the classiﬁcation of all ﬁnite
Boolean algebras. Before that, we consider structures like the
divisors above for general p.
12.1.8. Partial order. There is a much more fundamental
concept, the partial order. See the end of chapter
1. Recall that the deﬁnition of a partial order
is a reﬂexive, antisymmetric, and transitive
relation ≤ on a set K. A set with partial order
(K, ≤) is called a partially ordered set, or poset for short. The
adjective “partial” means that in general, this relation does not
say whether a ≤ b or b ≤ a for every two diﬀerent elements
a, b ∈ K If it does for each pair, it is called a linear order or
a total order.
1046
12.A.20. Translate the following sentence as a logical
proposition considering ambiguity of English language:
Either school X improves its performance and continues to
perform well or school Y wins academic competition.
Solution. Modern English is sometimes vague about the
meaning of or. Therefore, the sentence can be translated using
∨ or ⊕ in place of the or. Let A, B, C represent propositions
School X improves its performance, School X continues
to perform well, and School Y wins academic competition.
Then the proposition of the sentence can be interpreted as
i) (A ∧ B) ∨ C;
ii) (A ∧ B) ⊕ C;
This two propositions diﬀer when all three propositions
A, B, C have True value. □
12.A.21. Let logical implication → mean A → B = A′
∨B.
Prove the equivalence of the following logical propositions
i) (A′
→ (B → C)) = (B → (A ∨ C));
ii) (A → B′
)′
∧ (A ∨ B)′
= 0, that is, it is always False.
iii) (A ∨ B) ∧ (B′
∨ C) → (B ∨ C) = 1 that is, it is always
True
12.A.22. Show that the following dvisibilty relations are
partial orders on X. Is any one of them a linear order?
i) X = N. Relation | is deﬁned as: m|n if m divides n.
ii) Let X be a set of all integer divisors of 36.
Solution.
i) As n|n for each n, the relation is reﬂexive. IF m|n and
n|m, then m = n. Since r|m and m|n implies r|n, it is
also transitive. The order is not linear, since, for example,
4 and 5 are not divisors of each other.
ii) X = {1, 2, 3, 4, 6, 9, 12, 18, 36}. By previous item, | is
reﬂexive, antisymmetric and transityive. However, divisors
2 and 3 are not related. (DRAW HASSE DIA-
GRAM!)
□
12.A.23. Given a partial order ≤ on a set A, one can also
deﬁne a corresponding strict order <, that is also useful in
various situations. A relation R on A is called irreﬂexive if
(x, x) ̸∈ R for all x ∈ A. R is called asymmetric if with
every (x, y) ∈ R, (y, x) ̸∈ R. Relation R on A is a strict
order if R is a partial order and, in addition, it is irreﬂexive
and asymmetric. Let A be a set
i) If ≤ is a partial order order on A, then the relation <,
deﬁned as a < b if a ≤ b and a ̸= b is a strict order on
A.
ii) Given strict order < on A then the partial order ≤ can be
deﬁned as a ≤ b if a < b or a = b.
Solution. Suppose, ≤ is a partial order with < deﬁned by
a < b for a ≤ b and a ̸= b. Then < is irreﬂexive: if a < b
then a ̸= b. Now, suppose that a < b and b < c , then a ≤ c
by transitivity of ≤. If a = c, then by asymmetry of ≤, we
would have a = b. Hence, a < c. □
CHAPTER 12. ALGEBRAIC STRUCTURES
There is always a partial order on the set K = 2M
of all
subsets of a given set M – the inclusion. In terms of intersections
or joins, the inclusion can be deﬁned as A ⊆ B if
and only if A ∧ B = A, or equivalently, A ⊆ B if and only
if A ∨ B = B. In general, each Boolean algebra is a very
special poset:
Lemma. Let (K, ∧, ∨) be a Boolean algebra. Then the relation
≤ deﬁned by A ≤ B if and only if A ∧ B = A is a
partial order. Moreover, for all A, B, C ∈ K:
(1) A ∧ B ≤ A,
(2) A ≤ A ∨ B,
(3) if A ≤ C and B ≤ C, then A ∨ B ≤ C,
(4) A ≤ B if and only if A ∧ B′
= 0,
(5) 0 ≤ A and A ≤ 1.
Proof. All the properties to be proved are results of simple
calculations in the Boolean algebra K. Begin with the
properties of a partial order for ≤. Reﬂexivity is a direct corollary
of idempotency: A ∧ A = A, i.e. A ≤ A. Similarly, the
commutativity of ∧ guarantees the antisymmetry of ≤, since
if both A ∧ B = A and B ∧ A = B, then
A = A ∧ B = B ∧ A = B.
Finally, if A ∧ B = A and B ∧ C = B, then
A ∧ C = (A ∧ B) ∧ C = A ∧ (B ∧ C) = A ∧ B = A,
which veriﬁes the transitivity of ≤.
Similarly, (A ∧ B) ∧ A = (A ∧ A) ∧ B = A ∧ B, that
is, A ∧ B ≤ A.
It follows from A ∧ (A ∨ B) = A, see 12.1.3(2), that
A ≤ A ∨ B. This proves the claim (2).
Distributivity together with assumption (3) provides
(A ∨ B) ∧ C = (A ∧ C) ∨ (B ∧ C) = A ∨ B,
so that (3) holds.
The proposition (5) follows directly from the axioms for
the distinguished elements 1 and 0.
It remains to prove (4). If A ≤ B, then A∧B′
= A∧B∧
B′
= 0. On the other hand, if A∧B′
= 0, then A = A∧1 =
A∧(B ∨B′
) = (A∧B)∨(A∧B′
) = (A∧B)∨0 = A∧B.
Hence A ≤ B, and the proof is ﬁnished. □
Note that as for the algebra of subsets, in all Boolean
algebras A∧B = A if and only if A∨B = B. If A∧B = A,
then the absorption laws imply that A∨B = (A∧B)∨B =
B, and vice versa. Therefore, the operation ∨ can also be used
in the deﬁnition of a partial order.
Every poset (K, ≤) corresponds to a (oriented) graph (cf.
the beginning of chapter 13 for deﬁnitions if necessary): the
vertex set is K, and there is an edge leading from a to b if
and only if a ≤ b. This is a convenient way how to represent
ﬁnite posets.
A Hasse diagram of a poset is a drawing of this graph
in the plane so that greater elements are drawn above lower
ones. Since the edge orientation is implicitly given by this, it
need not be drawn explicitly. Furthermore, loops and edges
1047
12.A.24. Prove that a ﬁnite poset A can be embedded in a
totally ordered set, whose order extends the original order on
A.
Solution. Let (A, ≤) be a ﬁnite poset with a minimal element
a1. Set A\{a1} remains a poset with some minimal element
a2. Choose a3, . . . , an similarly until A is exhausted. Deﬁne
ai ≤ aj for i ≤ j, which deﬁnes a total order on A. □
12.A.25. Convert the poset of divisors of 36 into a linearly
ordered set in diﬀerent ways
Solution. One order is the natural order in N:
1, 2, 3, 4, 6, 9, 12, 18, 36. Another order can be chosen
as 1, 2, 4, 3, 6, 12, 9, 18, 36. □
12.A.26. Prove that lattices of the diamond and pentagon
types are not distributive.
Solution.
i) For the diamond lattice
a∧(b∨c) = a∧1 = a, while (a∧b)∨(a∧c) = 0∨0 = 0.
ii) For the pentagon lattice
a∧(b∨c) = a∧1 = a, and (a∧b)∨(a∧c) = 0∨0 = 0.
also hold.
□
12.A.27. Prove uniqueness of completeness in distributive
lattices, i.e. if (A, ≤) is a bounded distributive lattice with
minimum 0 and maximum 1 then complements are unique.
Solution. Let x′
and z denote complements of x ∈ A. Then
(1) x′
= x′
∧ 1
= x′
∧ (x ∨ z)
= (x′
∧ x) ∨ (x′
∧ z)
= 0 ∨ (x′
∧ z)
= (x′
∧ z) ∨ (x′
∧ z)
= (x ∨ x′
) ∧ z
= 1 ∧ z = z
□
12.A.28. Anne, Brenda, Kate, and Dana want to set out on a
trip. Find out which of the girls will go if the following must
hold: At least one of Brenda and Dana will go; at most one
of Anne and Kate will go; at least one of Anne and Dana will
go; at most one of Brenda and Kate will go; Brenda will not
go unless Anne goes; and Kate will go if Dana goes.
Solution. Transforming the problem to Boolean algebra, simplifying
it, and transforming it back, we ﬁnd out that either
Anne and Brenda will go or Kate and Dana will go. □
CHAPTER 12. ALGEBRAIC STRUCTURES
which are implied by transitivity and reﬂexivity are omitted
in the diagram. Especially when K has only a few elements,
this is a very transparent way of discussing several cases; see
the examples in the exercise column.
12.1.9. Lattices. Not every poset is created the latter way
from a Boolean algebra. For instance, the trivial partial
order is deﬁned on any set as A ≤ A for each A,
but all pairs of diﬀerent elements are incomparable.
Such a poset cannot rise from a Boolean algebra if K
contains more than one element (as seen, the least and greatest
elements of a Boolean algebra are comparable to every
element).
Think to what extent the operations ∧ and ∨ can be built
from a partial order. They are the suprema and inﬁma in the
following deﬁnition:
lower and upper bounds, suprema, infima
Consider a ﬁxed poset (K, ≤). An element C ∈ K is said to
be a lower bound for a subset L ⊆ K if and only if C ≤ A
for all A ∈ L. An element C ∈ K is said to be the greatest
lower bound (or inﬁmum) of a subset L ⊆ K if and only if it
is a lower bound and for every lower bound D of L, D ≤ C.
By replacing ≤ with ≥ in the above, the deﬁnitions of
an upper bound and of the least upper bound (or supremum)
of a subset L are obtained.
If the suprema and inﬁma exist for all couples A, B, they
deﬁne the binary operations ∨ and ∧, respectively.
Lattices
Deﬁnition. A lattice is a poset (K, ≤) where every twoelement
set {A, B} has a supremum A ∨ B and an inﬁmum
A ∧ B.
The poset (K, ≤) is said to be a complete lattice if and
only if every subset of K has a supremum and an inﬁmum.
The binary operations ∧ and ∨ on a lattice (K, ≤) are
clearly commutative and associative (prove this in detail!).
The latter properties (of associativity and commutativity) ensure
that all ﬁnite non-empty subsets in K possess inﬁma and
suprema.
Note that any element of a lattice K is an upper bound
for the empty set. Thus in a complete lattice, the supremum
of the empty set is the least element 0 of K. Similarly, the
inﬁmum of the empty set is the greatest element 1 of K. Of
course, a ﬁnite lattice (K, ≤) is always complete (with 1 being
the supremum of all elements in K and 0 the inﬁmum of
all elements in K).
Remark. The poset of real numbers, completed with the
greatest and least elements ±∞, is a complete lattice (with
the standard ordering). We may view it as a completion of
the poset of rational numbers. The classical Dedekind cut
construction of this completion of rational numbers extends
1048
12.A.29. Solve the following problem by transforming it to
Boolean algebra: Tom, Paul, Sam, Ralph, and Lucas
are suspected of having committed a murder. It is certain
that at the crime scene, there were: at least one of
Tom and Ralph, at most one of Lucas and Paul, and at least
one of Lucas and Tom. Sam could be there only if so was
Ralph. However, if Sam was there, then so was Tom. Paul
could never cooperate with Ralph, but Paul and Tom are an
inseparable pair. Who committed the murder?
Solution. Transforming into Boolean algebra, using the ﬁrst
letter of each name, we get
(t + r)(l′
+ p′
)(l + t)(r + s′
)(s′
+ t)(p′
+ r′
)(pt + p′
t′
)
and thanks to x2
= x, xx′
= 0, x + x′
= 1, we can rearrange
the above to s′
r′
ptl′
+ s′
rp′
t′
l. Thus, the murder was
committed either by Tom and Paul or by Ralph and Lucas. □
12.A.30. A vote box for three voters is a box which processes
three votes and outputs “yes” if and only if majority of
the voters is for. Design this box using from switch circuits.
Solution.
□
12.A.31. Find a ﬁnite subset of the set of positive integers
which is not a lattice with respect to divisibility. ⃝
12.A.32. Find the number of partial orders on a given
4-element set. Draw the Hasse diagram of each isomorphism
class and determine whether it is a lattice. Is one of them a
Boolean algebra?
Solution. We go through all Hasse diagrams of the partial
orders on a 4-element set M and for each diagram, we count
the number of partial orders (i. e. subsets of M × M) that
correspond to it; see the picture:
Therefore, there are 219 partial orders on a given 4-element
set.
Note that the condition of existence of suprema and inﬁma
of any pair of elements in a lattice implies (by induction)
the existence of them for any ﬁnite non-empty subset. In particular,
this means that every non-empty ﬁnite lattice has a
greatest element as well as a least element.
CHAPTER 12. ALGEBRAIC STRUCTURES
to all posets. Indeed, the so called Dedekind-MacNeille comadd
a simple example
to the other column,
e.g. complete the graf
K3,3 as Hasse diagram
of a poset.pletion provides the unique smallest complete lattice containing
a given poset. We shall not go into any details here.
A lattice is said to be distributive if and only the operations
∧ and ∨ satisfy the distributivity axioms (4) and (5) of
subsection 12.1.2 on page 1041. There are lattices which are
not distributive; see the Hasse diagrams of two such simple
lattices below (and check that in both cases x ∧ (y ∨ z) ̸=
(x ∧ y) ∨ (x ∧ z)).
Now, Boolean algebras can be deﬁned in terms of lattices:
a Boolean algebra is a complete distributive lattice such that
each element has its complement (i.e. the axiom 12.1.2(8) is
satisﬁed).
It is already veriﬁed that the latter requirement implies
that complements are unique (see the ideas at the beginning
of subsection 12.1.3), which means that the alternative deﬁnition
of Boolean algebras is correct.
During the discussion of divisors of any given integer p,
distributive lattices Dp are encountered. These distributive
lattices are Boolean algebras if and only if p is squarefree,
see 12.1.7.
12.1.10. Homomorphisms. Dealing with mathematical
structures, most information about objects
can be obtained/understood from the
homomorphisms. These are mappings
which preserve the corresponding operations. The linear
mappings between vector spaces, or continuous mappings
on Rn
or any metric spaces with the given topology of open
neighbourhoods represent very good examples.
This concept is particularly simple for posets:
Poset homomorphisms
Let (K, ≤K) and (L, ≤L) be posets. A mapping f : K →
L is called a poset homomorphism (also order-preserving
mapping, monotone mapping or isotone mapping) if for all
A ≤K B the same relation f(A) ≤L f(B) is true.
Although the structure of any Boolean algebra is completely
determined by its subordinated poset structure, an isotone
mapping does not necessarily respect the suprema and
inﬁma. Non-comparable elements A, B can be mapped to
the same image f(A) = f(B), while their suprema could
map to f(A ∨ B) strictly larger.
1049
Using this criterion, we can see that only the last two
Hasse diagrams may be lattices. Indeed, they are lattices; the
ﬁrst one is even a Boolean algebra. □
12.A.33. Find the number of partial orders on the set
{1, 2, 3, 4, 5} such that there are exactly two pairs of
incomparable elements. ⃝
12.A.34. Draw the Hasse diagram of the lattice of all (positive)
divisors 36. Is this lattice distributive? Is it a Boolean
algebra?
Solution. The lattice distributive (it does not contain a sublattice
isomorphic to diamond or pentagon).
□
12.A.35. Draw the Hasse diagram of the lattice of all (positive)
divisors 30. Is this lattice distributive? Is it a Boolean
algebra?
Solution. This lattice is a Boolean algebra, and it has 8 elements.
All ﬁnite Boolean algebras are of size 2n
for an appropriate
n, and they are all isomorphic for a ﬁxed n (see
12.1.16). This Boolean algebra is a “cube”: its graph can be
drawn as projection of a cube onto the plane.
CHAPTER 12. ALGEBRAIC STRUCTURES
In the case of Boolean algebras, homomorphisms are deﬁned
as follows:
Lattice and Boolean-algebra homomorphisms
A mapping f : (K, ∧, ∨) → (L, ∧, ∨) is a homomorphism
of Boolean algebras if and only if for all A, B ∈ K
(1) f(A ∧ B) = f(A) ∧ f(B),
(2) f(A ∨ B) = f(A) ∨ f(B),
(3) f(A′
) = f(A)′
.
Moreover, if f is bijective, it is an isomorphism of Boolean
algebras.
Similarly, lattice homomorphisms are deﬁned as mappings
which satisfy the properties (1) and (2).
It is easily veriﬁed that if a homomorphism f is bijective,
then f−1
is also a homomorphism.
It is clear from the deﬁnition of the partial order on
Boolean algebras or lattices that every homomorphism f :
K → L also satisﬁes f(A) ≤ f(B) for all A, B ∈ K such
that A ≤ B, i.e. it is in particular a poset homomorphism. In
particular, f(0) = 0 and f(1) = 1.
The converse of the above is generally not true, that is, it
may happen that a poset homomorphism is not a lattice ho-
momorphism.
12.1.11. Fixed-point theorems. Many practical problems
lead to discussion on the existence and properties
of ﬁxed points of a mapping f : K → K
on a set K, i.e. of elements x ∈ K such that
f(x) = x. The concepts of inﬁma and suprema
allows the derivation of very strong propositions of this type
surprisingly easily. There follows here a classical theorem
proved by Knaster and Tarski1
:
Tarski’s fixed-point theorem
Theorem. Let (K, ∧, ∨) be a complete lattice and f : K →
K a poset homomorphism. Then, f has a ﬁxed point, and
the set of all ﬁxed points of f, together with the restricted
ordering from K, is again a complete lattice.
Proof. Denote M = {x ∈ K; x ≤ f(x)}. Since K has
a least element, M is non-empty. Since f is order-preserving,
f(M) ⊆ M. Moreover, denote z1 = sup M. Then, for
x ∈ M, x ≤ z1, which means that f(x) ≤ f(z1). At the
same time, x ≤ f(x), hence f(z1) is an upper bound for M.
Then z1 ≤ f(z1), and also z1 ∈ M, and hence f(z1) ≤ z1.
It follows that f(z1) = z1, so a ﬁxed point is found.
1Knaster and Tarski proved this in the special case of the Boolean algebra
of all subsets in a given set already in 1928, cf. Ann. Soc. Polon. Math.
6: 133–134. Much later in 1955, Tarski published the general result, cf. Paciﬁc
Journal of Mathematics. 5:2: 285–309. Alfred Tarski (1901-1983) was
a renowned and inﬂuential Polish logician, mathematician and philosopher,
who worked most of his active career in Berkeley, California. His elder colleague
Bronisław Knaster (1893-1980) was also a Polish mathematician.
1050
□
12.A.36. Decide whether every lattice on a 3-element set is
a chain, i. e., whether each pair of elements are necessary
comparable.
Solution. As we have noticed in exercise 12.A.32, every ﬁnite
non-empty lattice must contain a greatest element and a
least element. Each of these is thus comparable to any other,
which means that the remaining one is comparable with these
two; and there are no other elements. □
12.A.37. Find an example of two lattices and a poset homomorphism
between them which is not a lattice homomor-
phism.
Solution. Again, we return to exercise 12.A.32 and consider
the following mapping:
□
12.A.38. Decide whether every lattice homomorphism between
ﬁnite non-empty lattices K, L maps the least element
of K to the least element of L.
Solution. No, any constant mapping between two lattices is
a lattice homomorphism. Thus, if we sent everything to an
element diﬀerent from the least one, we get the wanted counterexample
homomorphism. □
12.A.39. Decide whether every chain which has a greatest
element and a least element is a complete lattice.
Solution. No. Consider the set of non-zero integers and order
it as follows: any positive integer is greater than any negative
integer, but the ordering among the positive integers will be
reversed, as well as among the negative integers. Then, 1 will
be the greatest element of the resulting chain, and −1 will be
the least element. However, the subset of all positive integers
does not have an inﬁmum in this poset.
Formally we deﬁne the linear order ≺ on Z \ {0} by:
a ≺ b ⇐⇒ [(sgn(a)·sgn(b) = 1∧b > a)∨(sgn(a) > sgn(b))].
CHAPTER 12. ALGEBRAIC STRUCTURES
It is more diﬃcult to verify the last statement of the
theorem, namely that the set Z ⊆ K of all
ﬁxed points of f is a complete lattice. The
greatest element z1 = max Z is found already.
Analogously, using inﬁmum and the property
f(x) ≤ x in the deﬁnition of M, we ﬁnd the least element
z0 = min Z.
Consider any non-empty set Q ⊆ Z and denote y =
sup Q. This supremum need not lie in Z. However, as seen
shortly, the set has a supremum in Z with respect to the partial
order in K, restricted to Z. For that purpose, denote R =
{x ∈ K; y ≤ x}. It is clear from the deﬁnitions that this set
together with the partial order in K, restricted to R, is again
a complete lattice, and that the restriction of f to R is again a
poset homomorphism f|R : R → R. By the above, f|R has a
least ﬁxed point ¯y. Of course, ¯y ∈ Z, and ¯y is the supremum
of the ﬁxed set Q with respect to the inherited order on Z.
Note that it is possible that ¯y > y. Analogously, the inﬁmum
of any non-empty subset of Z can be found. Since the least
and greatest elements are already found, the proof is ﬁnished.
□
Remark. In the literature, one may ﬁnd many variants of
the ﬁxed-point theorems, in various contexts. One
of very useful variants is Kleene’s recursion theorem,
which can be determined from the theorem just
proved and formulated as follows:
Consider a poset homomorphism f and a countable subset
of K (using the notation of Tarski’s ﬁxed-point theorem),
formed by the Kleene chain
0 ≤ f(0) ≤ f(f(0)) ≤ . . . .
Then, the supremum z of this subset cannot be greater than
any ﬁxed point of f. If y is a ﬁxed point of f, it follows from
0 ≤ y that f(0) ≤ f(y) = y, etc. Moreover, if f is assumed
continuous in a certain sense of reasonably preserving
suprema, then it can be shown that f(z) is also the supremum
of this chain and hence is a ﬁxed point. Therefore, it is the
smallest ﬁxed point. This theorem is called Kleene’s ﬁxedpoint
theorem. It has many applications in recursion theory,
when discussing termination of algorithms, etc.
We omit details about the necessary “continuity” of mappings
between posets and further generalizations.2
We point
out the added value to the general formulation of Tarski’s theorem
— the Kleene’s theorem provides an iterative computational
process approaching the ﬁxed point with the given
“seed”, the minimal point.
2Stephen Cole Kleene (1909-1994) was a famous American mathematician
working with Church, Turing, Post and others. The interested reader
may consult full exposition of the above mentioned theorem in chapter 1 of
the book: Mathematical Theory of Domains, Cambridge Tracts in Theoretical
Computer Science, Cambridge University Press, 1994, by V. StoltenbergHansen,
I. Lindström, E. R. Griﬀor.
1051
□
12.A.40. Give an example of an inﬁnite chain which is a
complete lattice.
Solution. We can take the set of real numbers together with
−∞, ∞, where −∞ is the least element (and thus the supremum
of the empty set) and ∞ is the greatest element (and
thus the inﬁmum of the empty set). The lattice suprema and
inﬁma are thus deﬁned in accordance with these concepts in
the real numbers. Moreover, −∞ is the inﬁmum of subsets
which are not bounded from below, and similarly ∞ is the
supremum of subsets which are not bounded from above. □
12.A.41. Decide whether the set of all convex subsets of
R3
is a lattice (with respect to suitably deﬁned operations of
suprema and inﬁma). If so, is this lattice complete, distribu-
tive?
Solution. It is a lattice. The inﬁmum is simply the intersection,
since the intersection of convex subsets is again convex.
The supremum is the convex hull of the union. It is clear
that the lattice axioms are indeed satisﬁed for these operations
(think this out!).
The lattice is complete, since the above operations work
for inﬁnite subsets as well, and clearly, the lattice has both
a least element (the empty set) and a greatest element (the
entire space).
However, the lattice is not distributive. For example,
consider three unit balls B1, B2, B3 centered at [3, 0, 0],
[−3, 0, 0], [0, 0, 0], respectively. Then,
K1 ∨ (K2 ∧ K3) = K1 ̸= (K1 ∨ K3) ∧ (K1 ∨ K2).
□
12.A.42. Decide whether the set of all vector subspaces of
R3
is a lattice (with respect to suitably deﬁned operations of
suprema and inﬁma). If so, is this lattice complete, distribu-
tive?
Solution. This is a lattice, inﬁma correspond to intersections
and suprema to sums of vector spaces (it is easy to verify that
these operations satisfy the lattice axioms).
This lattice is complete (the operations work for inﬁnite
subsets as well, the least element is the zero-dimensional subspace,
and the greatest element is the entire space).
However, it is not distributive (consider three lines in a
plane). □
CHAPTER 12. ALGEBRAIC STRUCTURES
12.1.12. Back to Boolean algebras. When discussing
propositional logic, there is the problem of what
exactly are the elements of the corresponding
Boolean algebra. Formally, they are deﬁned as
the classes of equivalent propositions. In other
words, we work with truth-value functions for propositions
with a given number of arguments. There is the problem of
recognizing propositions which are equivalent in this sense.
There is the question of whether every function (Z2)n
→ Z2
can be deﬁned in terms of the basic logical operations.
Clearly all such functions form a Boolean algebra, since their
values are in the Boolean algebra Z2.
Similarly, there is the problem of deciding whether or
not two systems of switches can have the same function. Just
as for propositions, a system consisting of n switches corresponds
to a function (Z2)n
→ Z2. There are 22n
such functions.
A Boolean algebra can be naturally deﬁned on these
functions (again using the fact that the function values are in
the Boolean algebra Z2).
We summarize a few such questions:
some basic questions
Question 1: Are all ﬁnite Boolean algebras (K, ∧, ∨) deﬁned
on sets K with 2n
elements?
Question 2: Can each function (Z2)n
→ Z2 be the truth
function of some logical expression built of n elementary
propositions and the logical operators?
Question 3: How to recognize whether two such expressions
represent the same function?
Question 4: Can each function (Z2)n
→ Z2 be realized by
some switch board with n switches?
Question 5: How to recognize whether two switchboards
represent the same function?
All these questions are answered by ﬁnding the normal
form of every element of a general Boolean algebra. This is
achieved by writing it as the join of certain particularly simple
elements. By comparing the normal forms of any pair of
elements, it is easily determined whether or not they are the
same.
This helps to classify all ﬁnite Boolean algebras, giving
the aﬃrmative answer to question 1.
12.1.13. Atoms and normal forms. First, deﬁne the “simplest”
elements of a Boolean algebra:
Atoms in a Boolean algebra
Let K be a Boolean algebra. An element A ∈ K, A ̸= 0, is
called an atom if and only if for all B ∈ K, A ∧ B = A or
A ∧ B = 0.
In other words, A ̸= 0 is an atom if and only if there are
only two elements B such that B ≤ A, namely B = 0 and
B = A.
Note that 0 is not considered an atom, just as the integer 1
is not considered a prime. Let us remark that inﬁnite Boolean
algebras may contain no atoms at all.
1052
B. Rings
12.B.1. Decide whether the set R with the operations ⊕, ⊙
form a ring, a commutative ring, an integral domain or a ﬁeld:
i) R = Z, a ⊕ b = a + b + 3, a ⊙ b = −3,
ii) R = Z, a ⊕ b = a + b − 3, a ⊙ b = a · b − 1,
iii) R = Z, a ⊕ b = a + b − 1, a ⊙ b = a + b − a · b,
iv) R = Q, a ⊕ b = a + b, a ⊙ b = b,
v) R = Q, a ⊕ b = a + b + 1, a ⊙ b = a + b + a · b,
vi) R = Q, a ⊕ b = a + b − 1, a ⊙ b = a + b + a · b.
⃝
Solution.
i) not a ring (but is a commutative rng),
ii) not a ring,
iii) an integral domain,
iv) not a ring,
v) a ﬁeld,
vi) not a ring.
□
12.B.2. Prove that the subset Z[i] = {a + bi | a, b ∈ Z} of
the complex numbers is an integral domain. Is it a ﬁeld?
Solution. Any subring of an integral domain must be an integral
domain again. In this case, we are talking about a subset
of the ﬁeld C (thus also an integral domain). Since the subset
is closed with respect to all the operations (sum, additive inverse,
multiplication) and contains both 0 and 1, it is indeed
a subring. However, multiplicative inverses exist only for the
numbers 1, i, −1, −i (these form the so-called subgroup of
units – invertible elements), so it is not a ﬁeld. □
12.B.3. In the ring of 2-by-2 matrices over the real numbers,
consider the subring of matrices of the form
(
a −b
b a
)
.
Prove that this subring is isomorphic to C.
Solution. We will show that the isomorphism is given by the
mapping φ :
(
a −b
b a
)
→ a + ib. The multiplication in the
subring works as follows:
(
a −b
b a
)
·
(
c −d
d c
)
=
(
ac − bd −bc − ad
bc + ad ac − bd
)
,
and in, C, we have (a + ib)(c + id) = ac − bd + i(bc + ad).
Hence we can see that φ is a homomorphism with respect to
multiplication. Since addition is deﬁned componentwise, φ
is a homomorphism to it as well. Moreover, this mapping is
clearly both injective and surjective, thus it is an isomorphism.
□
12.B.4. Prove that the identity is the only automorphism of
the ﬁeld of real numbers.
Solution. Consider an automorphism φ : R → R. Clearly,
it must satisfy φ(0) = 0 and φ(1) = 1. Since φ respects
addition, we must have for all positive integers n that φ(n) =
φ(1 + 1 + · · · + 1) = nφ(1) = n and φ(−n) = −n.
Since it respects multiplication, we must have for any integers
CHAPTER 12. ALGEBRAIC STRUCTURES
The situation is very simple in the Boolean algebra of
all subsets of a given ﬁnite set M. Clearly, the atoms are
precisely the singletons A = {x}. For every subset B, either
A ∧ B = A (if x ∈ B) or A ∧ B = 0 (if x /∈ B). The
requirements fail whenever there is more than one element in
A.
Next, consider which elements are atoms in the Boolean
algebra of functions of the switch boards with n switches
A1, . . . , An. It can be easily veriﬁed that there are 2n
atoms,
which are of the form Aσ1
1 ∧· · · ∧Aσn
n , where either Aσi
i = Ai
or Aσi
i = A′
i.
The inﬁmum φ ∧ ψ of functions φ and ψ is the function
whose values are given by the products of the corresponding
values in Z2. Therefore, φ ≤ ψ if φ takes the value 1 ∈ Z2
only on arguments where ψ also has value 1. Hence in the
Boolean algebra of truth-value functions, a function φ is an
atom if and only if φ returns 1 ∈ Z2 for exactly one of the
2n
possible choices of arguments. All these functions can be
created in the above mentioned manner.
Now, the promised theorem can be formulated. While
this one is called the disjunctive normal form, there is also the
opposite version with the suprema and inﬁma interchanged
(the conjunctive normal form).
Disjunctive normal form
Theorem. Each element B of a ﬁnite Boolean algebra
(K, ∧, ∨) can be written as a supremum of atoms
B = A1 ∨ · · · ∨ Ak.
This expression is unique up to the order of the atoms.
The proof takes several paragraphs, but the basic idea is
quite simple: Consider all atoms A1, A2, . . . , Ak in
K which are less or equal to B. From the properties
of the order on K, (see 12.1.8(3)) it follows that
Y = A1 ∨ · · · ∨ Ak ≤ B.
The main step of the proof is to verify that B∧Y ′
= 0, which
by 12.1.8(4) guarantees that B ≤ Y . That proves the equality
B = Y .
12.1.14. Three useful claims. We derive several technical
properties of atoms, in order to complete the proof of the theorem
on disjunctive normal form. We retain the notation of
the previous subsection.
Proposition. (1) If Y, X1, . . . , Xℓ are atoms in K, then
Y ≤ X1 ∨ . . . ∨ Xℓ if and only if Y = Xi for some
i = 1, . . . , ℓ.
(2) For each Y ∈ K, Y ̸= 0, there is an atom X ∈ K such
that X ≤ Y .
(3) If X1, . . . , Xr are precisely all the atoms of K, then Y = 0
if and only if Y ∧ Xi = 0 for all i = 1, . . . , r.
Proof. (1) If the inequality of the proposition holds,
then
Y ∧ (X1 ∨ · · · ∨ Xℓ) = Y.
1053
p, q (q ̸= 0) that φ(p) = φ(q · p
q ) = φ(p)cdotφ(p
q ). Hence,
φ(p
q ) = p
q , i. e., φ(r) = r for all rational numbers r.
Consider a positive number x ∈ R. Then, φ(x) =
φ
(√
x
2
)
= φ (
√
x)
2
≥ ≥ 0. Thus, for any x, y ∈ R
such that x < y, we must have φ(x) < φ(y). Now, assume
that φ is not the identity, i. e., there exists a z ∈ R such that
φ(z) ̸= z. We can assume without loss of generality that
φ(z) < z. Since Q is dense in R, there exists an r ∈ Q for
which φ(z) < r < z. However, we know that φ(r) = r,
which means that r < z implies φ(r) < φ(z). Altogether,
we get the wanted contradiction φ(z) < φ(r) < φ(z). □
12.B.5. Let p be a prime and R a ring which contains p2
elements. Prove that R is commutative.
Solution. Since (R, +) is a ﬁnite commutative group with p2
elements, it is by 12.4.8 isomorphic to either Zp2 or Zp × Zp.
In the ﬁrst case, (R, +) is cyclic, so there exists an element
x ∈ R such that each element of R is of the form nx for some
1 ≤ n ≤ p2
. Since all these elements commute, we get that
the entire R is commutative.
In the second case, each element (except 0) must have
order p with respect to addition. Let x ∈ R be any element
that is not in the additive subgroup generated by 1. Then, each
element of R is of the form m + nx, where 1 ≤ m, n ≤ p.
Again, all these elements commute, so R is commutative. □
12.B.6. Find the inverses of 17, 18, and 19 in (Z∗
131, ·) (the
group of all invertible elements in Z131 with multiplication).
Solution. Applying the Euclidean algorithm, we get
131 = 7 · 17 + 12,
17 = 1 · 12 + 5,
12 = 2 · 5 + 2,
5 = 2 · 2 + 1.
Therefore, 1 = 5−2·2 = 5−2(12−2·5) = 5·5−2·12 =
5·(17−12)−2·12 = 5·17−7·12 = 5·17−7·(131−7·17) =
54·17−7·131. The inverse of 17 is 54. Similarly, [18]−1
= 51
and [19]−1
= 69. □
12.B.7. Find the inverse of [49]Z253 in Z253 ⃝
12.B.8. Find the inverse of [37]Z208 in Z208. ⃝
12.B.9. Find the inverse of [57]Z359 in Z359. ⃝
12.B.10. Find the inverse of [17]Z40 in Z40. ⃝
C. Polynomial rings
12.C.1. Eisenstein’s irreducibility criterion This criterion
provides a suﬃcient condition for a polynomial over Z to be
irreducible over Q (which is the same as to be irreducible over
Z):
Let
f(x) = anxn
+ an−1xn−1
+ · · · + a1x + a0
be a polynomial over Z and p be a prime such that
• p divides aj, j = 0, . . . , n − 1,
CHAPTER 12. ALGEBRAIC STRUCTURES
By distributivity, the equality can be rewritten as
(Y ∧ X1) ∨ · · · ∨ (Y ∧ Xℓ) = Y.
However, for all i either Y ∧ Xi = 0 or Y ∧ Xi = Xi. If all
these intersections are 0, then Y = 0. Thus, there is an i for
which Y ∧ Xi = Xi. Since Y is also an atom, the desired
equality Y = Xi is proved.
The other implication is trivial.
(2) If Y is an atom itself, choose X = Y . If Y is not an
atom, then it follows from the deﬁnition that there must exist
a non-zero element Z1 ̸= Y for which Z1 ≤ Y . If Z1 is not
an atom either, then similarly ﬁnd a Z2 ≤ Z1, etc., leading to
a sequence of pairwise distinct elements
· · · ≤ Zk ≤ Zk−1 ≤ · · · ≤ Z1 ≤ Y,
which cannot be inﬁnite since the entire Boolean algebra K
is ﬁnite. Therefore, it must end with an atom Zk.
(3) Assume that Y ∧ Xi = 0 for all indices i. If Y ̸= 0,
then due to the above claim, there must exist an atom Xj for
which Xj ∧ Y = Xj, which is a contradiction.
The other implication is trivial. □
12.1.15. Completion of the proof of theorem 12.1.13.
Write
Y = A1 ∨ · · · ∨ Ak ≤ B,
where Ai are all the atoms in K which are less then or equal
to B. Compute
B ∧ Y ′
= B ∧ (A1 ∨ · · · ∨ Ak)′
= B ∧ A′
1 ∧ · · · ∧ A′
k.
If an atom A = Ai is contained in the join Y , then
B ∧ Y ′
∧ A = 0. However, if A is an atom which does
not occur in Y , then also B ∧ Y ′
∧ A = 0, since Y contains
exactly those atoms which are ≤ B. Hence B ∧ Y ′
∧ A = 0
for all atoms A in K.
Thus it is proved that the intersection of B ∧ Y ′
and any
atom is zero, which means that it must be zero itself, by the
third claim in the latter proposition. Therefore, B ≤ Y (cf.
12.1.8(4)). The deﬁnition of Y implies Y ≤ B, so the antisymmetry
of the order implies that B = Y .
It remains to prove the uniqueness of the expression, up
to order. Thus, suppose B can be written in two ways as
B = A1 ∨ · · · ∨ Ak = ˜A1 ∨ · · · ∨ ˜Aℓ.
Since each Ai satisﬁes Ai ≤ B, the ﬁrst claim in the proposition
above ensures it must equal one of the ˜Aj. Repeating
this argument gives the desired uniqueness and ﬁnishes the
proof.
12.1.16. Classiﬁcation. To end the discussion of Boolean algebras,
we prove that all the examples of ﬁnite
Boolean algebras (of given size) are isomorphic.
In particular, each of the 22n
truth-value functions
for n atomic propositions can be expressed
as an appropriate proposition, just like each of the 22n
switch
board functions can be deﬁned in terms of n suitably arranged
switches. In both cases, the algebra in question behaves the
1054
• p does not divide an,
• p2
does not divide a0.
Then, f(x) is irreducible over Z (Q). Prove this criterion.
⃝
12.C.2. Factorize over C and R the polynomial
x4
+ 2x3
+ 3x2
+ 2x + 1.
Solution. This polynomial can be factorized either by looking
for multiple roots or as a reciprocal equation:
• Let us compute the greatest common divisor of the polynomial
and its derivative 4x3
+ 6x2
+ 6x + 2, using
the Euclidean algorithm. The greatest common divisor
is given in any ring up to a multiple by a unit, and during
the Euclidean algorithm, we may multiply the partial
results by units of the ring. In the case of a polynomial
ring over a ﬁeld of scalars, the units are exactly the nonzero
scalars. We perform the multiplication in the way
to avoid calculations with fractions as much as possible.
2x4
+ 4x3
+ 6x2
+ 4x + 2 : 2x3
+ 3x2
+ 3x + 1 = x +
1
2
2x4
+ 3x3
+ 3x2
+ x
x3
+ 3x2
+ 3x + 2
x3
+
3
2
x2
+
3
2
x +
1
2
3
2
x2
+
3
2
x +
3
2
Further, we divide the polynomial 2x3
+ 3x2
+ 3x + 1
by the remainder 3
2 x2
+ 3
2 x + 3
2 (multiplied by the unit
2
3 )
2x3
+ 3x2
+ 3x + 1 : x2
+ x + 1 = 2x + 1
2x3
+ 2x2
+ 2x
x2
+ x + 1
The roots of the greatest common divisor of the original
polynomial and its derivative are exactly the multiple
roots of the original polynomial. In this case, the roots
of x2
+ x + 1 are −1
2 ± i
√
3/2, which are thus double
roots of the original polynomial. The factorization over
C is thus to root factors (this is always the case over C,
as stated by the fundamental theorem of algebra):
x4
+ 2x3
+ 3x2
+ 2x + 1 =
=
(
x +
1
2
− i
√
3
2
)2
·
(
x +
1
2
+ i
√
3
2
)2
.
The factorization over R can be obtained by multiplying
the factors corresponding to pairs of complex-conjugated
roots of the polynomial (verify that such a product must
always result in a polynomial with real coeﬃcients!):
x4
+ 2x3
+ 3x2
+ 2x + 1 =
(
x2
+ x + 1
)2
.
• Let us solve the equation
x4
+ 2x3
+ 3x2
+ 2x + 1 = 0.
CHAPTER 12. ALGEBRAIC STRUCTURES
same way as the Boolean algebra of all subsets of a given
2n
-element set.
Moreover, each of these expressions can be written in
a unique normal form, so it can be decided algorithmically
whether two switch boards have the same behaviour without
comparing their values for all 2n
possible inputs (which on
the other hand might still be faster, in particular the resulting
normal formula tends to be exponentially large).
Theorem. Every ﬁnite Boolean algebra is isomorphic to the
Boolean algebra K = 2M
where M is the set of atoms in K.
Proof. The idea of the proof is quite straightforward.
Every isomorphism of a Boolean algebra (K, ∧, ∨) must map
atoms to atoms. Let M be the set of all atoms in K and consider
the Boolean algebra (2M
, ∩, ∪). This deﬁnes a natural
correspondence between the atoms of K and the atoms of
2M
.
Next, use the disjunctive normal form to extend the mapping
to all of K. Each element X ∈ K can be written
uniquely (up to order) as a join of atoms:
X = A1 ∨ · · · ∨ Ak
Deﬁne the function f : K → 2M
by
f(X) = f(A1) ∪ · · · ∪ f(Ak) = {A1, . . . , Ak},
as the union of the singletons Ai ⊆ M that occur in the ex-
pression.
The uniqueness of the normal form implies that f is a
bijection. It remains to show that it is a homomorphism of
the Boolean algebras.
Let X, Y ∈ K. The normal form of their supremum
contains exactly the atoms which occur in at least one of X,
Y ; while the inﬁmum involves just those atoms which occur
in both. This veriﬁes that f preserves the operations ∧ and
∨. As for the complements, note that an atom A occurs in
the normal form of X′
if and only if X ∧ A = 0. Hence f
preserves complements, which ﬁnishes the proof. □
The classiﬁcation of inﬁnite Boolean algebras is far more
complicated. It is not the case that each would be isomorphic
to the Boolean algebra of all subsets of an appropriate set M.
However, every Boolean algebra is isomorphic to a Boolean
subalgebra of a Boolean algebra 2M
for an appropriate set M.
This result is known as Stone’s representation theorem3
.
2. Elements of Logic
This section provides a quick introduction to logic. Mastering
the calculus of Boolean algebras will become useful,
but we have to start from scratch.
3The American mathematician Marshall Harvey Stone (1903 – 1989)
proved this theorem in 1936 when dealing with the spectral theory of operators
on Hilbert spaces, required for analysis and topology. Nowadays, it
belongs to standard material in advanced textbooks.
1055
Dividing by x2
and substituting t = x + 1
x , we get the
equation
t2
+ 2t + 1 = 0
with double root −1. Now, substituting this into the definition
of t, we get the known equation x2
+ x + 1 = 0,
which was solved above.
□
Remark. Let us remark that the only irreducible polynomials
over R are linear polynomials and quadratic polynomials with
negative discriminant. This also follows from the reasonings
in the above exercise.
12.C.3. Factorize the polynomial x5
+3x3
+3 to irreducible
factors over
i) Q,
ii) Z7.
Solution.
i) By Eisenstein’s criterion, the given polynomial is irreducible
over Z and Q (we use the prime 3).
ii) (x − 1)2
(x3
+ 2x2
− x + 3). Using Horner’s scheme,
for instance, we ﬁnd the double root 1. When divided by
the polynomial (x − 1)2
, we get (x3
+ 2x2
− x + 3),
which has no roots over Z7. Since it is only of degree 3,
this means that it must be irreducible (if it were reducible,
one of the factors would have to be linear, which means
that the cubic polynomial (x3
+2x2
−x+3) would have
a root).
□
12.C.4. Factorize the polynomial x4
+ 1 over
• Z3,
• C,
• R.
Solution.
• (x2
+ x + 2)(x2
+ 2x + 2)
• The roots are the fourth roots of −1, which lie in the complex
plane on the unit circle, and their arguments are π/4,
π/4 + π/2, π/4 + π, and π/4 + 3π/2 i. e., they are the
numbers ±
√
2/2 ± i
√
2/2. Thus, the factorization is
(
x −
√
2
2
− i
√
2
2
) (
x −
√
2
2
+ i
√
2
2
) (
x +
√
2
2
− i
√
2
2
) (
x −
√
2
2
+ i
√
2
2
)
.
• Multiplying the root factors of complex-conjugated roots
in the factorization over C, we get the factorization over
R: (
x2
−
√
2x + 1
) (
x2
+
√
2x + 1
)
..
□
12.C.5. Find a polynomial with rational coeﬃcients of the
lowest degree possible which has 2007
√
2 as a root.
Solution. P(x) = x2007
−2. Let us show that there is no polynomial
of lower degree with root 2007
√
2: Let Q(x) be a nonzero
polynomial of the lowest degree with root 2007
√
2. Then,
deg Q(x) ≤ 2007. Let us divide P(x) by Q(x) with remainder:
P(x) = Q(x)·D(x)+R(x), where D(x) is the quotient
and R(x) is the remainder, and either deg R(x) < st Q(x) or
R(x) = 0. Substituting the number 2007
√
2 into the last equation,
we can see that 2007
√
2 is also a root of R(x). By the
deﬁnition of Q(x), this means that R(x) must be the zero
polynomial, which means that Q(x) divides P(x). However,
P(x) is irreducible (by Eisenstein’s criterion for 2), so its only
CHAPTER 12. ALGEBRAIC STRUCTURES
12.2.1. From set algebra to logical propositions. Let us recall
that in 12.1.2, we abstracted Boolean algebras from the
properties of set operations; in particular, the ∨ operation corresponded
to the union of sets. We understand sets as collections
of elements with some property. So, if an element x is in
the union of sets C and B, then it has ”property C” or ”property
B.” Furthermore, if we know that element x does not
have ”property C,” then we know for sure that it has ”property
B.” In other words, from the fact that element x has ”property
C” or ”property B,” but not ”property C,” we can conclude
that element x deﬁnitely has ”property B.” Symbolically,
x ∈ C ∪ B
x ̸∈ C
x ∈ B ,
or more succinctly using Boolean operations,
C ∨ B
C′
B
This simple observation can be reformulated as follows:
”From C ∨ B and C′
, it logically follows that B.” For
the sake of simplicity, let’s denote A = C′
, which means
C = A′
, and rewrite the previous observation as:
(1)
A′
∨ B
A
B
and read it as: ”From A′
∨ B and A, it necessarily implies
B.”
Next, let’s perform three simple Boolean calculations:
A′
∨ (B′
∨ A) = (A′
∨ A) ∨ B = 1 ∨ B = 1,
(A′
∨ B)′
∨
(
(A′
∨ B′
)′
∨ A
)
= (A′
∨ B)′
∨
(
(A ∧ B) ∨ A′
)
= (A′
∨ B)′
∨ (B ∨ A′
) = 1,
(A′
∨ B)′
∨ (A′
∨ C) = (A ∧ B′
) ∨ A′
∨ C
=(A ∨ A′
∨ C) ∧ (B′
∨ A′
∨ C) = 1 ∧ (A′
∨ B′
∨ C)
=
(
A′
∨ (B′
∨ C)
)
.
Now, we will use the axioms (8) of Boolean algebras, and
we obtain the following formulas:
(2) A′
∨ (B′
∨ A) = 1, A′
∨ A = 1
(A′
∨ B)′
∨
(
(A′
∨ B′
)′
∨ A
)
= 1,
(
A′
∨ (B′
∨ C)
)′
∨
(
(A′
∨ B)′
∨ (A′
∨ C)
)
= 1 .
We already noticed the connections between Boolean algebras
and propositional logic in paragraph 12.1.5. In this
sense, we can see that the left sides of the previous equations
express true propositions. Now, let’s transition more
thoroughly from the ”Boolean world” to the ”world of logic.”
We will do this by replacing uppercase letters with lowercase
letters, replacing the symbol ′
∨ with the symbol →, and
changing the ”postﬁx” symbol ′
to the ”preﬁx” symbol ¬.
So, instead of A′
∨B, we will write a → b, and instead of A′
,
we will write ¬a.”
1056
non-trivial divisor is itself (up to multiplication by a unit of
the polynomial ring over Q, i. e. a non-zero rational constant).
Thus, we have Q(x) = P(x) up to multiplication by
a unit. For instance, the polynomial 1
3 x2007
− 2
3 also satisﬁes
the stated conditions. However, if we require the polynomial
to be monic (i. e., with leading coeﬃcient 1), then the only
solution is the mentioned polynomial P(x). □
12.C.6. Find all irreducible polynomials of degree at most
2 over Z3.
Solution. By deﬁnition, all linear polynomials are irreducible.
As for quadratic irreducible polynomials, the
easiest way is to simply enumerate them all and leave
out the reducible ones, i. e. those which are a product
of two linear polynomials. The reducible polynomials
are (x + 1)2
= x2
+ 2x + 1, (x + 2)2
= x2
+ x + 1,
(x + 1)(x + 2) = x2
+ 2, x2
, x(x + 1) = x2
+ x,
x(x + 2) = x2
+ 2x. (It suﬃces to consider monic polynomials,
since the non-monic can be obtained by multiplication
by 2.) The remaining quadratic polynomials over Z3 are
irreducible; these are x2
+ 2x + 2, x2
+ x + 2, x2
+ 1. □
12.C.7. Decide whether the following polynomial is irreducible
over Z3; if not, factorize it:
x4
+ x3
+ x + 2.
Solution. Evaluating the polynomial at 0, 1, 2, we ﬁnd that
it has no root in Z3. This means that it is either irreducible
or a product of two quadratic polynomials. Assume it is reducible.
Then, we may assume without loss of generality that
it is a product of two monic polynomials (the only other option
is that it is a product of two polynomials with leading
coeﬃcients equal to 2 – then both can be multiplied by 2 in
order to become monic). Thus, let us look for constants a, b,
c, d ∈ Z3 so that
x4
+ x3
+ x + 2 = (x2
+ ax + b)(x2
+ cx + d) =
= x4
+ (a + c)x3
+ (ac + b + d)x2
+
+ (ad + bc)x + bd.
Comparing the coeﬃcients of individual power of x, we get
the following system of four equations in four variables:
1 = a + c,
0 = ac + b + d,
1 = ad + bc,
2 = bd.
From the last equation, we get that one of the numbers b, d
is equal to 1 and the other one to 2. Thanks to symmetry of
the system in the pairs (a, b) and (c, d), we can choose b = 1,
d = 2. From the second equation, we get ac = 0, i. e., at
least one of the numbers a, c is 0. From the ﬁrst equation, we
get that the other one is 1. From the third equation, we get
2a + c = 1, i. e., a = 0, c = 1. Altogether,
x4
+ x3
+ x + 2 = (x2
+ 1)(x2
+ x + 2). □
CHAPTER 12. ALGEBRAIC STRUCTURES
12.2.2. Classical Propositional Logic. The alphabet of
classical propositional logic consists of lowercase Latin
letters a, b, c, . . . possibly with subscripts a1, a2, a3, . . .,
special symbols ¬, →, and parentheses (, ).
These letters are called ”atomic propositions” or ”propositional
variables.” We can think of them as declarative statements
that can be either true or false. (Truth in this context is
understood in the usual vague sense; the concept of truth has
not been precisely deﬁned yet.) The symbols ¬, → are called
”operators” or ”connectives,” sometimes also ”propositional
functors.”
What are the propositions?
Deﬁnition. ”Propositions” are deﬁned by the following
rules:
(i) Every atomic proposition is a proposition.
(ii) If A is a proposition, then ¬A is also a proposition.
(iii) If A and B are propositions, then (A → B) is also a
proposition.
In this deﬁnition, uppercase letters are actually parameters
that represent any proposition.
Example. a, b, ¬a, (a → b) are propositions; A, B, C are
parameters that represent arbitrary propositions.
A statement created according to rule (ii) will be called
negation of statement A. We can read it as ‘not A,’ ‘non-A,’
‘A is not true,’ and similar expressions. A statement created
according to rule (iii) is called an implication, where statement
A is called the antecedent of the implication, and statement
B is called the consequent of the implication. We can
read it as ‘A logically implies B,’ ‘if A, then B,’ ‘A is a suﬃcient
condition for B,’ ‘B is a necessary condition for A,’ ‘A
implies B,’ and similar expressions.
If there is no risk of misunderstanding, we will omit the
outer parentheses in statements created according to rule (iii).
The last statement from the previous example will then have
the form ¬
(
(b → a) → ¬a
)
→ (b → a).
We consider the alphabet of propositional calculus as
primitive concepts. From them, we construct, delineate, and
deﬁne further concepts using rules. In Kantian terminology,
atomic statements are synthetic, while other statements are
analytical—they can be broken down into atomic statements.
Statements are understood as declarative sentences that
are clear enough to decide whether they are true or not. Since
‘truth’ is still a very vague concept, we will use the word
‘validity.’ However, we do not yet know any valid statements.
We will now declare as valid statements those that are prescribed
by the Boolean expressions on the left sides of equations
(2).
1057
12.C.8. For any odd prime p, ﬁnd all roots of the polynomial
P(x) = xp−2
+ xp−3
+ · · · + x + 2
over the ﬁeld Zp.
Solution. Considering the equality
xp−1
− 1 = (x − 1)(P(x) − 1),
we can see that all numbers of Zp, except 0 and 1, are roots
of P(x) − 1, so they cannot be roots of P(x) + 1. Clearly, 0
is never a root of P(x), and 1 is always a root, which means
that it is the only root.. □
12.C.9. Factorize the polynomial p(x) = x2
+ x + 1 in
Z5[x] and Z7[x].
Solution. Irreducible in Z5[x]; p(x) = (x − 2)(x − 4) in
Z7[x]. □
12.C.10. Factorize the polynomial p(x) = x6
−x4
−5x2
−3
in C[x], R[x], Q[x], Z[x], Z5[x], Z7[x], knowing that it has a
multiple root.
Solution. Applying the Euclidean algorithm, we ﬁnd out that
the greatest common divisor of p and its derivative p′
is x2
+1.
Dividing the polynomial p(x) twice by this factor, we get
p(x) = (x2
+ 1)2
(x2
− 3).
Clearly, these factors are irreducible in the rings Q[x] and
Z[x].
In C[x], we can always factorize a polynomial to linear
factors. In this case, it suﬃces to factorize x2
+ 1, which is
easy: x2
+ 1 = (x + i)(x − 1). The factor x2
− 3 is equal to
(x −
√
3)(x +
√
3) even in R[x]. Thus, in C[x], we have
p(x) = (x + i)2
(x − i)2
(
x −
√
3
) (
x +
√
3
)
,
while in R[x], we have
p(x) = (x2
+ 1)2
(
x −
√
3
) (
x +
√
3
)
.
In Z5[x], the polynomial x2
+ 1 has roots ±2, and the polynomial
x2
− 3 has no roots, which means that
p(x) = (x − 2)2
(x + 2)2
(x2
− 3).
In Z7[x], neither polynomial has a root, so the factorization
to irreducible factors is identical to that in Q[x] and Z[x].
p(x) = (x2
+ 1)2
(x2
− 3). □
12.C.11. Knowing that the polynomial p = x6
+x5
+4x4
+
2x3
+5x2
+x+2 has multiple root x = i, factorize it to irreducible
polynomials over C[x], R[x], Z2[x], Z5[x], and Z7[x].
Divide the polynomial q = x2
y2
+ y2
+ xy + x2
y + 2y + 1
by the irreducible factors of p in R[x], and use the result to
solve the system of polynomial equations p = q = 0 over C.
Solution. p = (x2
+1)2
(x2
+x+2), in Z2: p = x(x+1)5
, in
Z5: p = (x−2)2
(x+2)2
(x2
+x+2), in Z7: p = (x2
+1)2
(x+
4)2
. For the second polynomial, we get q = (y2
+y)(x2
+x+
2) − y2
(x + 1) + 1 and q = (y2
+ y)(x2
+ 1) + y(x + 1) + 1.
Thus, if x = α is a root of x2
+ x + 2, i. e., α = −1
2 ± 1
2 i
√
7,
CHAPTER 12. ALGEBRAIC STRUCTURES
Axioms of classical propositional calculus
Deﬁnition. Axioms of classical propositional calculus are:
(kvp1) A → (B → A),
(kvp2) (A → B) →
(
(A → ¬B) → ¬A
)
,
(kvp3)
(
A → (B → C)
)
→
(
(A → B) → (A → C)
)
,
(kvp4) ¬¬A → A,
where A, B, C denote statements.
Axiom (kvp1) can be interpreted as follows: if statement
A is valid, then we don’t need to know where it follows from;
a valid statement follows from anything.
Axiom (kvp2) can be read as follows: if statement B follows
from statement A, then if the negation of statement B
also follows from statement A, statement A must necessarily
not be valid. In other words, an implication is valid if its
antecedent is not valid. In other words, a suﬃcient (but not
necessary!) condition for the validity of an implication is the
invalidity of the antecedent.
Axiom (kvp3) is called the axiom of deduction, and the
reason for this name will become clear later. Axiom (kvp3),
together with axiom (kvp1), expresses the transitivity of implication.
However, this might not be immediately evident
(if it is, I apologize to the reader for underestimating them).
Therefore, we will also address the transitivity of implication
later.
Axiom (kvp4) simply states that when it’s not true that
statement A is not valid, then statement A is valid.
The selection of axioms for classical propositional calculus
was arbitrary, and nothing forced us to choose them.
However, as we will see later, this selection is very useful.
Although it’s not the only possible one.
We can now construct statements and know four valid
statements. What remains is to clarify how to derive new
valid statements from valid ones. For this purpose, we will
use the rule (1).
Deﬁnition. The modus ponens inference rule is of the form:
A → B
A
B.
This rule states that if an implication is valid and its antecedent
is valid, then its consequent is also valid. In other
words, from the implication, we deduce the valid antecedent,
and the valid consequent remains. Hence the name of this
rule, the method of detachment.
We can summarize the ﬁrst part of our considerations:
The propositional logic system consists of three components.
Countably inﬁnite alphabets together with ﬁnitely many syntactic
rules tell us how to construct words and sentences, i.e.,
statements, from individual letters of the alphabet. Further,
a ﬁnite number of axioms, i.e., statements that we consider
valid. The third component consists of a ﬁnite number of inference
rules.
1058
then y = 1√
1+α
. If x = β is a root of x2
+ 1, i. e., β = ±i,
then y = − 1
1+β □
12.C.12. Factorize the following polynomial to irreducible
polynomials in R[x] and in C[x]:
4x5
− 8x4
+ 9x3
− 7x2
+ 3x − 1.
⃝
12.C.13. Factorize the following polynomial to irreducible
polynomials in R[x] and in C[x]:
x5
+ 3x4
+ 7x3
+ 9x2
+ 8x + 4.
⃝
12.C.14. Factorize x4
−4x3
+10x2
−12x+9 to irreducible
polynomials in R[x] and in C[x]: ⃝
12.C.15. Decide whether the following polynomial over Z3
is irreducible; if not, factorize it to irreducible polynomials:
x5
+ x2
+ 2x + 1.
⃝
12.C.16. Decide whether the following polynomial over Z3
is irreducible; if not, factorize it to irreducible polynomials:
x4
+ 2x3
+ 2.
⃝
12.C.17. Find all monic quadratic irreducible polynomials
over Z5.
Solution. We write out all monic quadratic polynomials over
Z5 and exclude those which are not irreducible, i. e., have a
root:
x2
±2, x2
±x+2, x2
±2x−2, x2
−x±1, x2
±2x−1.
□
D. Rings of multivariate polynomials
12.D.1. Find the remainder of the polynomial x3
y + x +
yz + yz4
with respect to the basis (x2
y + z, y + z) and the
orderings <lex, <grlex.
Solution. □
For illustration, we present examples of several varieties
deﬁned by polynomials.
12.D.2. Curves in the aﬃne plane R2
. Every non-zero polynomial
f(x, y) in two variables deﬁnes a “curve” in R2
by the
equation f(x, y) = 0. Thus, it is the set of zero points of a
polynomial f, and it will be denoted K = V(f). You can derive
that if f = f1 . . . fk, then V(f) = V(f1) ∪ · · · ∪ V(fk).
The subsequent pictures depict examples of such curves.
12.D.3. Using your favorite software, draw the curve given
by the equation x3
+ x2
− y2
= 0 in the plane.
Solution. See picture 1. □
CHAPTER 12. ALGEBRAIC STRUCTURES
The speciﬁc choice of the alphabet with rules, axioms,
and inference rules assigns some attribute to propositional
calculus. In this case, that attribute is ”classical.”
So far, we have introduced the primitive concepts of
propositional calculus. It is already evident that the subject of
(propositional) logic is the validity or invalidity of correctly
constructed statements and logical deduction. Logic, in this
sense, is a part of mathematics and has nothing to do with
psychology or ”rules of proper thinking.”
Inference and Proofs
Deﬁnition. Deriving a statement A from statements
A1, A2, . . . , An is a ﬁnite sequence of statements, where
each member is either an axiom, one of the statements
A1, A2, . . . , An, or a statement derived from the previous
members of the sequence using an inference rule. The last
member of the derivation is the statement A.
If there exists a derivation of statement A from statements
A1, A2, . . . , An, we say that statement A can be derived
from statements A1, A2, . . . , An, and we write
{A1, A2, . . . , An} ⊢ A.
The statements in the set {A1, A2, . . . , An} are called the
premises of the derivation, and statement A is called the con-
clusion.
If the premises of the derivation are valid statements,
then the conclusion is also valid.
Examples that can be considered claims.
1) Transitivity of implication in classical propositional
calculus
{A → B, B → C} ⊢ A → C.
We write down the derivation in a table. In the ﬁrst column
is the ordinal number of the statement in the derivation,
in the second column is the corresponding statement, and in
the third column is the justiﬁcation for why the statement is
included in the derivation. Typically, the justiﬁcation refers
to an axiom or refers to the lines from which the statement
is derived using the modus ponens rule. Remember that uppercase
letters are, in fact, parameters and can be replaced by
any statement.
1. A → B premise
2. B → C premise
3. (B → C) →
(
A → (B → C)) (kvp1)
4. A → (B → C) 2., 3.
5.
(
A → (B → C)
)
→
(
(A → B) → (B → C)
)
(kvp3)
6. (A → B) → (B → C) 4., 5.
7. A → C 1., 6.
We used exactly two axioms (kvp1) and (kvp3) in the
derivation.
2) The Law of Excluded Middle
{A, ¬A} ⊢ B,
also known as ex impossibili sequitur quodlibet (from nonsense,
anything follows).
1059
Figure 1. V(x3
+ x2
− y2
)
Figure 2. V(2x4
− 3x2
y + y2
− 2y3
+ y4
)
12.D.4. Using your favorite software, draw the curve given
by the equation 2x4
−3x2
y +y2
−2y3
+y4
= 0 in the plane.
Solution. See picture 2. □
We can also attempt to deﬁned curves by equations x =
f(t), y = g(t), where f, g ∈ R[t]. In that case, the curve is
deﬁned as a “polynomial inclusion” of the real line into the
plane.
12.D.5. Parametrize the curve (variety) V(x3
+ x2
− y2
).
CHAPTER 12. ALGEBRAIC STRUCTURES
1. A premise
2. ¬A premise
3. A → (¬B → A) (kvp1)
4. ¬B → A 1., 3.
5. (¬B → A) →
(
(¬B → ¬A) → ¬¬B
)
(kvp2)
6. (¬B → ¬A) → ¬¬B 4., 5.
7. ¬A → (¬B → ¬A) (kvp1)
8. ¬B → ¬A 2., 7.
9. ¬¬B 6., 8.
10. ¬¬B → B (kvp4)
11. B 9., 10.
In these examples, we can see that it is easy to follow the
derivation process. However, the construction of derivations
can be very challenging. (This was not the case with the presented
derivations.)
Deduction and Proofs
Deﬁnition. Proof of statement A is a derivation from the
empty set of premises. If there exists a proof of statement
A, we say that this statement is provable and write ⊢ A.
A theorem is a provable statement.
Example. Theorem [1]: ⊢ A → A.
1. A → (A → A) (kvp1)
2. A →
(
(A → A) → A) (kvp1)
3.
(
A →
(
(A → A) → A
))
→
((
A → (A → A)
)
→ (A → A)
)
(kvp3)
4.
(
A → (A → A)
)
→ (A → A) 2., 3.
5. A → A 1., 4.
Derivations and proofs can be shortened by using previously
derived statements or proven theorems. In the initial
deﬁnition, we add that in the derivation of statement A from
statements A1, A2, . . . , An, a theorem or a statement previously
derived from the same premises can be included.
Example. {A} ⊢ ¬¬A.
1. A premise
2. A → (¬A → A) (kvp1)
3. ¬A → A 1., 2.
4. (¬A → A) →
(
(¬A → ¬A) → ¬¬A
)
(kvp2)
5. (¬A → ¬A) → ¬¬A 3., 4.
6. ¬A → ¬A Theorem [1]
7. ¬¬A 5., 6.
A statement that can be derived from valid statements is
also valid. Similarly, a theorem is a valid statement. In the
propositional logic system, we naturally obtain an inclusion
(hierarchy) of statements:
axioms
theorems
valid statements
statements
1060
Solution. The parametrization can be found by computing
intersections of lines y = tx with the given curve, i. e., we
parametrize by the tangent of these lines. Technically, this
means that we substitute tx for y and express x in terms of t
from the equation:
x3
+ x2
− t2
x2
= x2
(x + 1 − t) =⇒ x = t − 1 ∨ x = 0.
Then, y = t2
(t − 1), or for x = 0, the only satisfying
point on the curve is y = 0. The point [0, 0] can be obtained
by choosing t = 1 in the mentioned parametrization, so it
suﬃces to consider only this parametrization. □
We obtain more curves if we consider quotients of polynomials
in the parametrization, i. e., f = f1
f2
, g = g1
g2
. Then,
we talk about a rational parametrization.
12.D.6. Derive the parametrization of a circle using stereographic
projection (see the picture).
Solution. Substituting the equation of the line y = x
t +1 into
the equation of the circle, we get the equation
x2
+
(
x2
t2
−
2x
t
+
)
= 1,
with solution x = 0 or the parametric expression
x =
2t
1 + t2
, y =
t2
− 1
1 + t2
,
which does not include the point [0, 1], however. □
Remark. Note that in this case, the inclusion of the real
line gives only “almost all points” of the parametrized variety,
since one of them (i. e., the point from which we project)
is not reachable for any value of the parameter t. This is not
our fault – it follows from diﬀerent topological properties of
the line and the circle that there exists no global parametriza-
tion.
Remark. Since R is not an algebraically closed ﬁeld, we have
problems with existence of roots of polynomials. As a result,
a mere perturbation of coeﬃcients of the deﬁning equation
may drastically change the resulting variety. It is possible
to work with complex polynomials C[x, y] and with the subsets
they deﬁne in C2
. We need not be scared by that; on
the contrary, our originally real curves are contained in their
“complexiﬁcations” (real polynomials are simply viewed as
complex which happen to have real coeﬃcients), and we just
CHAPTER 12. ALGEBRAIC STRUCTURES
The relationship between theorems and valid statements is discussed
in the following
Deduction Theorem.
{A1, A2, . . . , An−1, An} ⊢ A
if and only if
{A1, A2, . . . , An−1} ⊢ An → A.
Proof. First, assume that {A1, . . . , An} ⊢ A, i.e.,
we have some derivation of statement A from statements
A1, . . . , An, and this derivation has m steps. By induction
on m, we will show that for each statement B in the derivation,
statement An → B can be derived from the statements
{A1, . . . , An−1}; in particular, this holds for the last statement
in the derivation, B = A.
If m = 1, the derivation consists of a single statement
A. Then, A is either an axiom or one of the statements
A1, . . . , An. If A = An, then according to Theorem [1],
An → A. If A ̸= An, then statement An → A can be derived
from axiom (kvp1) rewritten as A → (An → A) using
one application of the modus ponens rule.
Now, assume that the statement holds for some m, and we
have a derivation of statement A from statements A1, . . . , An
with a length of m + 1. The last statement in this derivation,
A, necessarily resulted from applying the modus ponens rule
(because we have no other rules) to some statements of the
form ˜A and ˜A → A, which appeared among the ﬁrst m steps
of the derivation. Therefore, the derivations of each of the
statements ˜A and ˜A → A can have a length of at most m.
According to the induction hypothesis, we have:
{A1, . . . , An−1} ⊢ An → ˜A, {A1, . . . , An−1} ⊢ An → (An → ˜A).
We can now obtain statement An → A by applying the modus
ponens rule twice to the statement
(
An → ( ˜A → A)
)
→
(
(An → ˜A) → (An → A)
)
,
which is axiom (kvp3).
The necessity of the condition is proven. Its suﬃciency
is trivial since adding statement An to the premises
A1, . . . , An−1 and using the modus ponens rule once yields
{A1, . . . , An−1, An} ⊢ A. □
Corollary. Statement B can be derived from statement
A if and only if statement A → B is a theorem.
From this and the previous example, we can conclude
that statement A → ¬¬A is a theorem. This statement is the
converse of axiom (kvp4).
Example. 1) Theorem [2]: ⊢ (A → B) →
(
(B → C) →
(A → C)
)
.
We will show {A → B} ⊢ (B → C) → (A → C),
and from this, using the Deduction Theorem, the statement
follows. The abbreviation ”TI.” in the sixth line of the derivation
refers to the already proven transitivity of implication.
1061
obtain richer tools for description of their properties (imaginary
tangent lines, etc.).
Moreover, we are missing “improper points”. For instance,
when parametrizing a circle, we can describe the missing
points as the image of the only improper point of the real
line, i. e. the point “at inﬁnity”. These problems can be best
avoided by working in the so-called projective extension of
the (real or complex) plane.
The projective extension is advantageous to use in various
problems, we will also use its application when deﬁning
the group operation on the points of an elliptic curve (see J).
12.D.7. (The complex circle). Consider the sets of points
Xε
= V(z2
1 + z2
2 − ε) ⊆ C2
for any ε ∈ R \ {0}. The
corresponding real curves are
Xε
R = Xε
∩ R2
=
{
circle with radius
√
ε ε > 0,
∅ ε < 0.
We will write zj = xj + iyj = xj +
√
−1yj. Therefore, Xε
is given as a subset in R4
by a system of two real equations:
Re(z2
1 + z2
2 − ε) = x2
1 + x2
2 − y2
1 − y2
2 − ε = 0,
Im(z2
1 + z2
2 − ε) = 2(x1y1 + x2y2) = 0.
Thus, we can assume that Xε
will be a “two-dimensional surface”
in R4
. We will try to imagine it as a surface in R3
in
a suitable projection R4
→ R3
. For this purpose, we choose
the mapping
φ+ : (x1, x2, y1, y2) →
(
x1, x2,
x1y2 − x2y1
√
x2
1 + x2
2
)
Denote by V the subset of R4
which is given by our second
equation, i. e.,
V = {(x1, x2, y1, y2); x1y1 + x2y2 = 0, (x1, x2) ̸= (0, 0)}.
The restriction of φ+ to V is invertible, and its inverse ψ+ is
given by
ψ+ : (u, v, w) →
(
u, v, −
vw
√
u2 + v2
,
uw
√
u2 + v2
)
.
Now, note that
(
x1y2 − x2y1
√
x2
1 + x2
2
)2
= y2
1 + y2
2,
and hence it follows that
φ+(V ∩ Xε
) = Hε
= {(u, v, w); u2
+ v2
− w2
− |ε| = 0}.
Now, we can compose the constructed mappings
φε : Xε
→ V @ > φ+ >> R3
\ {(0, 0, 0)} ⊇ Hε
,
and for every ε > 0, we get a bijection φε : Xε
→ Hε
. The
real part of this variety is the “thinnest circle” on the one-part
rotational hyperboloid Hε
; see the picture.
CHAPTER 12. ALGEBRAIC STRUCTURES
1. A → B premise
2. (A → B) →
(
(B → C) → (A → B)
)
(kvp1)
3. (B → C) → (A → B) 1., 2.
4. (B → C) →
(
A → (B → C)
)
(kvp1)
5.
(
A → (B → C)
)
→
(
(A → B) → (A → C)
)
(kvp3)
6. (B → C) →
(
(A → B) → (A → C)
)
4., 5., TI.
7.
(
(B → C) →
(
(A → B) → (A → C)
))
→
((
(B → C) → (A → B)
)
→
(
(B → C) → (A → C)
))
(kvp3)
8.
(
(B → C) → (A → B)
)
→
(
(B → C) → (A → C) 6., 7.
9. (B → C) → (A → C) 3., 8.
2) Theorem [3]: ⊢
(
A → (B → C)
)
→
(
B → (A → C)
)
.
We will show {A → (B → C), B, A} ⊢ C:
1. A → (B → C) premise
2. A premise
3. B → C 1., 2.
4. B premise
5. C 3., 4.
Now we use the Deduction Theorem three times to get
{A → (B → C), B} ⊢ A → C, {A → (B → C)} ⊢ B → (A → C),
⊢
(
A → (B → C)
)
→
(
B → (A → C)
)
.
Semantics of Classical Propositional Logic
Classical propositional logic was introduced purely formally
and axiomatically. An alphabet and syntax (rules for
constructing ”letters,” ”words,” and ”sentences”) were introduced,
some statements (axioms) were declared valid, and a
rule was speciﬁed for deriving other valid statements from
valid ones. The idea that statements are or represent ”declarative
sentences that can be designated as true or false” and that
we can somehow ”read” propositional connectives is entirely
unnecessary when creating a propositional logic system.
Now we will ”ﬁll the form with content,” show ”what
it is about.” More precisely, we will give meanings to statements.
Let’s start somewhat broadly: We somehow understand
linguistic expressions; they express some ”meaning” for
us. This meaning helps us determine or identify a fact or reality
in the real (or ﬁctional, or virtual) world that the respective
linguistic expression denotes. This fact or reality is the
”meaning” of the linguistic expression. Schematically:
expression meaning
denotes
senseexpresses identiﬁes
A classic example, mentioned by Gottlob Frege, is the linguistic
expression ”Morning Star.” Its sense is the ”brightest star
in the morning sky.” Its meaning in the real world is the planet
Venus. The sense of another expression, ”Evening Star,” is
the ”brightest star in the evening sky.” But its meaning is the
same—Venus.
1062
For ε < 0, we can repeat the above reasoning, merely
interchanging x and y and the signs in the deﬁnition of φ+:
φ− : (x1, x2, y1, y2) →
(
−y1, −y2,
−y1x2 + y2x1
√
y2
1 + y2
2
)
,
which changes the inversion ψ−
ψ+ : (u, v, w) →
(
−
vw
√
u2 + v2
,
uw
√
u2 + v2
, −u, −v
)
.
Now, Hε
is again a one-part rotational hyperboloid, but its
real part is Xε
R = ∅.
In the complex case, we can observe that when continuously
changing the coeﬃcients, the resulting variety changes
only a bit, except for certain “catastrophic” points, where a
qualitative leap may occur. This is called the principle of permanence.
In the real case, this principle does not hold at all.
12.D.8. The projective extension of the line and the plane.
The real projective space P1(R) is deﬁned as the set of all directions
in R2
, i. e., its points are one-dimensional subspaces
of R2
.
The complex projective space P1(C) is deﬁned as the set
of all directions in C2
, i. e., its points are one-dimensional
subspaces of C2
.
Similarly, the points of the real and complex twodimensional
projective spaces are deﬁned as directions in R3
and C3
, respectively.
CHAPTER 12. ALGEBRAIC STRUCTURES
The example shows that diﬀerent expressions can have
the same sense; conversely, we can also encounter expressions
that have no sense. Statements in classical propositional
logic are represented by various expressions. However, the
meaning of a statement will be exactly one of the possibilities
”true” or ”false.” These meanings are symbolically denoted
as T and F, sometimes as P and N in Czech, alternatively as
1 and 0. We will use this last notation. Sometimes it will be
appropriate to interpret the symbols ”0” and ”1” directly as
numbers, between which an order and arithmetic operations
are naturally deﬁned.
It is reasonable to consider the negation of a statement
to be true precisely when the original statement is false, and
vice versa. From the modus ponens rule, it follows that if an
implication is true, and its antecedent is also true, then the
consequent must also be true. Axiom (kvp2) implies that if
the antecedent of an implication is false, then the entire implication
is true. These considerations allow us to formulate
a deﬁnition.
Deﬁnition. An interpretation of the classical propositional
logic system is a mapping I from the set of statements to the
set {0, 1} such that
(i1) I(¬A) = 1 if and only if I(A) = 0,
(i2) I(A → B) = 0 if and only if I(A) = 1 and simultaneously
I(B) = 0.
The set {0, 1} is called the set of truth values.
The interpretations of primitive statements ¬A and A →
B can be written in tables:
A ¬A
0 1
1 0
,
A B A → B
0 0 1
0 1 1
1 0 0
1 1 1
.
We will use the notation |A| = I(A), which is often
found in logic textbooks. Then we can express the interpretation
of primitive statements as follows:
|¬A| = 1−|¬A|, |A → B| = max{1−|A|, |B|} = max{|¬A|, |B|}.
From the deﬁnition of interpretation, it is clear that the
truth value of any statement can be computed for any interpretation
of atomic statements by a ﬁnite number of subtractions
and searching for the greater of the two values.
Example. Let |a| = 1, |b| = 0. Then
¬
(
(b → a) → ¬a
)
→ (b → a) = max
{
1 − ¬
(
(b → a) → ¬a
)
, |b → a
max {|(b → a) → ¬a| , max {1 − |b|, |a|}} =
max {max(1 − |b → a| , |¬a|), 1} = 1.
This ordered computation is, of course, not very eﬃcient. A
much shorter evaluation would be achieved using a truth ta-
ble.
Deﬁnition.
1063
E. Algebraic structures
First of all, we practice general properties of operations
and we ﬁnd out what structures the known sets and operations
actually are.
12.E.1. Decide about the following sets and operations what
algebraic structures they are (groupoid, semigroup (with potential
one-sided neutral elements), monoid, group):
i) the set of all subsets of the integers with union,
ii) the set of positive integers with the greatest common divisor
as the binary operation,
iii) the set of positive integers with the least common multiple
as the binary operation,
iv) the set of all 2-by-2 invertible matrices over R with addi-
tion,
v) the set of all 2-by-2 matrices over R with multiplication,
vi) the set of all 2-by-2 matrices over R with subtraction,
vii) the set of all 2-by-2 invertible matrices over Z2 with mul-
tiplication,
viii) the set Z6 with multiplication (modulo 6),
ix) the set Z7 with multiplication (modulo 7).
Construct the table of the operation for the last-but-two struc-
ture.
Solution.
i) a monoid (the empty set being neutral),
ii) a semigroup (with no neutral elements),
iii) a monoid (1 being neutral),
iv) not even a groupoid (consider A+(−A) for an invertible
matrix A),
v) a monoid,
vi) a groupoid (not associative),
vii) a group,
viii) a monoid (the class [1] being neutral),
ix) a monoid (the class [1] being neutral).
The group in vii) consists of the following elements:
A =
(
1 0
0 1
)
, B =
(
0 1
1 0
)
, C =
(
1 1
0 1
)
,
D =
(
1 1
1 0
)
, E =
(
0 1
1 1
)
, F =
(
1 0
1 1
)
. The table
of the matrix multiplication looks as follows:
A B C D E F
A A B C D E F
B B A E F C D
C C D A B F E
D D C F E A B
E E F B A D C
F F E D C B A
Note that each row and column (disregarding the heading
ones) contains each element exactly once (why is it so?).
Thus, we do not have to calculate each product and instead
we can play “sudoku” as soon as we have ﬁlled enough entries
of the table. □
CHAPTER 12. ALGEBRAIC STRUCTURES
• We say that a statement A is a tautology if, for every interpretation
of the statements from which A is composed,
|A| = 1.
• We say that a statement A is a contradiction if, for every
interpretation of the statements from which A is composed,
|A| = 0.
• We say that a statement A follows from a set of statements
{A1, A2, . . . } (ﬁnite or countable) if, for an interpretation
of the statements A1, A2, . . . such that |A1| =
|A2| = · · · = 1, it holds that |A| = 1.
• We say that a set of statements {A1, A2, . . . } is semantically
consistent if there exists an interpretation such that
|A1| = |A2| = · · · = 1.
A tautology follows from an empty set of statements.
Examples.
(1) The statement A → ¬¬A is a tautology:
If |A| = 1, then |A → ¬¬A| = max {1 − |A|, |¬¬A|}=
max {0, 1 − (1 − 1)} = 1,
If |A| = 0, then |A → ¬¬A| = max {1 − |A|, |¬¬A|}=
max {1, 1 − (1 − 0)} = 1.
The given statement is also a theorem in classical vector
calculus.
(2) The statements A → B and B → C are semantically
consistent.
We construct a table
A B C A → B B → C
0 0 0 1 1
0 0 1 1 1
0 1 0 1 0
0 1 1 1 1
1 0 0 0 1
1 0 1 0 1
1 1 0 1 0
1 1 1 1 1
and from it, we can see that both statements have an interpretation
of 1 in four cases, speciﬁcally in rows 1, 2,
4, and 8.
(3) The statement A → C follows from the set of statements
{A → B, B → C} .
From the previous example, we already know that the
given set of statements is semantically consistent. In the
interpretation table, we only need to focus on the rows
where both statements have a truth value of 1.
A B C A → B B → C A → C
0 0 0 1 1 1
0 0 1 1 1 1
0 1 1 1 1 1
1 1 1 1 1 1
All truth values in the last column of the table are equal
to 1. Recall that the statement A → C is also derivable
from the set of statements {A → B, B → C} ; this is the
transitivity of implication.
1064
12.E.2. Let X be a set and P(X) denote the set of all subsets
of X. Decide whether the set P(X) together with each
of the following operations forms a groupoid, semigroup,
monoid, group and whether the operation is commutative:
i) set intersection,
ii) set union,
iii) set symmetric diﬀerence (xor).
Solution. If the set X is empty, then P(X) together with any
of the mentioned operations is the trivial (1-element) group.
Otherwise:
i) with the set intersection, the resulting structure is a commutative
monoid,
ii) with the set union, the resulting structure is a commutative
monoid,
iii) with the set xor, the resulting structure is a commutative
group where the empty set is neutral and each element is
self-inverse. A−1
= A.
□
12.E.3. Decide about the following sets and operations what
algebraic structures they are (groupoid, semigroup, group),
whether they have one-sided or two-sided neutral elements,
and whether the operation is commutative:
i) the set of all 3-by-3 invertible matrices over R with addi-
tion,
ii) the set of all 3-by-3 matrices over R with multiplication,
iii) the set of all 3-by-3 matrices over R with addition,
iv) the set of all 3-by-3 invertible matrices over Z2 with mul-
tiplication,
v) (Z9, +),
vi) (Z9, ·).
⃝
12.E.4. Decide about the following subsets G of the complex
numbers what algebraic structures they form together with
multiplication (groupoid, semigroup, group), and whether the
operation is commutative:
i) G = {a + bi | a, b ∈ Z},
ii) G = {a + bi | a, b ∈ R, a2
+ b2
= 1},
iii) G = {a + b ·
√
5 | a, b ∈ Q, a2
+ b2
̸= 0}.
⃝
12.E.5. Decide whether Z together with the operation ♡
forms a groupoid, semigroup, monoid, group, and whether
♡ is commutative, provided it is deﬁned by:
i) a♡b = (a, b),
ii) a♡b = a|b|
,
iii) a♡b = 2a + b,
iv) a♡b = |a|,
v) a♡b = a + b + a · b,
vi) a♡b = a + b − a · b,
vii) a♡b = a + (−1)a
b.
⃝
CHAPTER 12. ALGEBRAIC STRUCTURES
There are four possible truth tables for unary logical connectors.
One of them is the table for the negation interpretation.
We will show that the remaining three possible tables
also express interpretations of statements formed using the
unary connector ¬ and the binary connector →.
A ♠A A → A
0 1 1
1 1 1
,
A ♣A ¬(A → A)
0 1 0
1 1 0
,
A ♡A A
0 1 0
1 1 1
.
The ﬁrst connector is called unary verum, the second unary
falsum, and the third assertion; the symbols used are not stan-
dard.
Similarly, any truth table for binary logical connectors is
satisﬁed by a statement created using the connectors ¬ and
→. We will demonstrate this with at least one example.
A B A ↔ B ¬
(
(A → B) → ¬(B → A)
)
0 0 1 1
0 1 0 0
1 0 0 0
1 1 1 1
.
You probably know that the connector ↔ B is called equiva-
lence.
Basic Properties of Classical Propositional Logic
First, let’s emphasize that it’s essential to distinguish between
concepts that are axiomatic, i.e., introduced using axioms
and rules of inference, and those that are semantic, i.e.,
introduced using interpretation. (And let’s also note that the
concepts referred to as axiomatic are commonly called syntactic,
but that’s misleading because it leads to mixing axioms
with syntax in the narrower sense—rules for forming state-
ments.)
However, we’ve already seen that both the axiomatics
and semantics include concepts that are related to each
other—a ”theorem” corresponds to a ”tautology,” ”derivation”
corresponds to ”entailment.” But this is the case only
in classical propositional logic; in other propositional logic
systems, it may not hold.
We’ve already formulated the basic properties of classical
propositional logic, or at least hinted at them. Now let’s
summarize them without proofs:
Deduction.: The concept of a ”theorem” is deﬁned using
the concept of ”derivation” and is thus reducible
to this concept. Conversely, the deduction theorem
implies that the concept of ”derivation from a ﬁnite
set of statements” is reducible to the concept of a
”theorem.” The set of theorems coincides with the
set of derivable statements.
The semantic counterpart of the deduction theorem
also holds:
Statement A follows from a set of statements
{A1, . . . , An−1, An} if and only if statement
An → A follows from the set of statements
{A1, . . . , An−1}.
1065
12.E.6. In how many ways can we ﬁll the following table so
that ({a, b, c}, ⋇) would be a
i) groupoid
ii) commutative groupoid
iii) a groupoid with a neutral
element
iv) a monoid
v) a group
⋇ a b c
a c b a
b b
c
Solution.
i) 35
ii) 9
iii) 9
iv) 1
v) 0
□
12.E.7. Find the number of groupoids on a given threeelement
set.
Solution. Since the set is given, it remains to deﬁne the binary
operation. In a groupoid, there is no restriction except
that the result of the operation must be an element of the underlying
set. Thus, for any pair of elements, there are three
possibilities for the result. By the product rule, this gives
33·3
= 19683
groupoids. □
12.E.8. Decide whether the set G = (R ∖ {0} × R) together
with the operation △ deﬁned by (x, y)△(u, v) = (xu, xv+y)
for all (x, y), (u, v) ∈ G is a groupoid, semigroup, monoid,
group, and whether △ is commutative. ⃝
12.E.9. Let Ω be a set with the multiplication operation deﬁned
by ab = a for all a, b in Ω Prove that such multiplication
is associative. ⃝
Solution. For a, b, c ∈ Ω we have:
a(bc) = ab = a = ac = (ab)c
□
12.E.10. Suppose that abc = e for some a, b, c in a group G.
Show that bca = e as well. ⃝
Solution. Set x = bc. Then ax = e i.e. x−1
= a since x is
invertible and so xa = (bc)a = e. □
F. Groups
We begin with recalling permutations and their properties.
We have already met permutations in chapter two, see
??, where we used them to deﬁne the determinant of a matrix.
12.F.1. For each of the following conditions, ﬁnd all permutations
π ∈ S7 which satisfy it:
i) π4
= (1, 2, 3, 4, 5, 6, 7)
CHAPTER 12. ALGEBRAIC STRUCTURES
Compactness.: A trivial consequence of the deﬁnition of
derivation is that a statement is derivable from any
set of statements if and only if it is derivable from
some of its ﬁnite subsets. (This follows from the
fact that derivation is a ﬁnite sequence.) This also
holds for entailment:
A statement follows from an inﬁnite set of statements
if and only if it follows from some of its ﬁnite
subsets.
It should be noted that proving this statement is not
trivial.
Correctness.: Classical propositional logic is correct, meaning
that every theorem is a tautology. ”What can be
proven is true.”
Completeness.: Classical propositional logic is complete,
meaning that every tautology is a theorem. ”What
is true can be proven.”
Strong Correctness and Completeness.: Classical propositional
logic is strongly correct, meaning that
whenever a statement is derivable from a set of
statements, it follows from it. Classical propositional
logic is also strongly complete, meaning
that whenever a statement follows from a set of
statements, it is derivable from it.
Therefore, a statement in classical propositional
logic follows from a set of statements if and only if
it is derivable from it.
Denotational Saturation.: An operator with any truth table
can be expressed using the operators ¬ and →; that
is, for any n-ary operator deﬁned by a truth table,
there exists a statement constructed solely using ¬
and → that has the same truth table.
Decidability.: For any (correctly formulated) statement, it
can be decided whether it is a tautology or not. For
any statement, it can be decided whether it is a theorem
or not.
This statement follows directly from the deﬁnition
of interpretation. Every statement is composed of
only ﬁnitely many parts, that is, only ﬁnitely many
atomic statements. And then, it can be decided by
ﬁnite computation or a ﬁnite truth table whether the
value is 1 for all evaluations of atomic parts.
Alternative Formulations of Propositional Logic
The primitive operations of propositional logic are, by
deﬁnition, negation ¬ and implication →. The ﬁrst is unary,
the second is binary. Using them, we introduce several other
binary operations on the set of statements, i.e., propositional
functors:
1066
ii) π2
= (1, 2, 3) ◦ (4, 5, 6)
iii) π2
= (1, 2, 3, 4)
⃝
12.F.2. Find the signature (parity) of each of the following
permutations:
i)
(
1 2 3 4 5 6 . . . 3n − 2 3n − 1 3n
2 3 1 5 6 4 . . . 3n − 1 3n 3n − 2
)
,
ii)
(
1 2 3 . . . n n + 1 n + 2 . . . 2n
2 4 6 . . . 2n 1 3 . . . 2n − 1
)
.
Solution. The parity of a permutation corresponds to the
number of transpositions from which it is built or, equivalently,
to the number of its inversions, see 2.2.2. The number
of inverses can be read easily from the two-row representation
of the permutation. For each number of the second row, we
count the number of numbers that are less and lie more to the
right than the current number. Thus, the ﬁrst permutation is
even (the signature is 1), and in the second case, the signature
depends on n and is equal to (−1)
n·(n+1)
2 . □
12.F.3. Find all permutations ρ ∈ S9 such that
[ρ ◦ (1, 2, 3)]
2
◦ [ρ ◦ (2, 3, 4)]
2
= (1, 2, 3, 4).
Solution. No such permutation exists, since the left-hand side
is always an even permutation, while the right-hand side is an
odd one. □
12.F.4. Find all permutations ρ ∈ S9 such that
ρ2
◦ (1, 2) ◦ ρ2
= (1, 2) ◦ ρ2
◦ (1, 2).
⃝
12.F.5. Consider the permutation σ =
(
1 2 3 4 5 6 7
3 6 5 7 1 2 4
)
.
Find the order of σ in the group (S7, ◦), the inverse of σ and
compute σ2013
. Show that σ does not commute with the
transposition τ = (2, 3).
Solution. σ = (1, 3, 5) ◦ (2, 6) ◦ (4, 7). Therefore, the order
of σ is the least common multiple of the cycle lengths 3, 2, 2,
which is 6. Furthermore, σ−1
= (1, 5, 3) ◦ (2, 6) ◦ (4, 7) and
σ2013
= (σ3
35)6
◦ σ3
= σ3
= (2, 6) ◦ (4, 7).
Finally, we have σ ◦ τ = (1, 3, 6, 2, 5) ◦ (4, 7), but τ ◦ σ =
(1, 2, 6, 3, 5) ◦ (4, 7). □
12.F.6. Find σ−1
and σ2013
, where
• (a) σ =
(
1 2 3 4 5 6 7
4 5 7 6 1 2 3
)
in the symmetric
group (S7, ◦).
• (b) σ = [4]11 in the group (Z×
11, ·).
Solution. (a) σ = (1, 4, 6, 2, 5)◦(3, 7), σ−1
= (1, 5, 2, 6, 4)◦
(3, 7), since the order of (1, 4, 6, 2, 5) is 5 and the order of the
transposition (3, 7) is 2, we get that the order of σ is the least
CHAPTER 12. ALGEBRAIC STRUCTURES
statement deﬁnition name
A ∧ B ¬(A → ¬B) conjunction
A ∨ B ¬A → B [inclusive] disjunction
A ⊻ B (A ∧ ¬B) ∨ (¬A ∧ B) exclusive disjunction
A | B ¬(A ∧ B) Sheﬀer’s function, incompatibility
A ↓ B ¬(A ∨ B) Peirce’s function
A ↔ B (A → B) ∧ (B → A) equivalence
All the connectors introduced in this way are not strictly
part of propositional logic; they are meta-symbols. However,
this strict distinction is not signiﬁcant for our purposes, and
we can freely add the symbols ∧, ∨, ⊻, | , ↔, and ↓ to the
propositional alphabet.
The conjunction A ∧ B can be read as ”A and B” or ”A
simultaneously with B” and so on.
The disjunction A ∨ B can be read as ”A or B.” The
grammatical connector ”or” here has a similar meaning to the
Latin ”vel,” indicating possible but not always necessary alternatives,
so at least one of the statements A, B is true. Similarly,
in the Czech sentence: ”Tonight I will go to the pub or
study mathematics.” I can study mathematics with my classmates
in the pub. In contrast, the exclusive disjunction A⊻B,
which can be read as ”either A or B,” expresses ”or” in an exclusionary
sense, meaning that exactly one of the statements
A, B is true, as in the Czech sentence: ”Tonight I will go to
the pub or somewhere else.” The Czech language has another
meaning for the word ”or,” which is that at most one of the
options will occur, for example, in the sentence: ”Tonight I
will sleep either in Brno or in New York.” This possibility is
formalized by Sheﬀer’s connector, and the statement A | B is
suitable to read as ”A is incompatible with B.”
Statement A ↓ B can be read as ”neither A nor B,” and
the equivalence A ↔ B can be read as ”A if and only if B”
or ”A is a necessary and suﬃcient condition for B.”
Furthermore, it should be noted that the exclusive disjunction
A ⊻ B is sometimes denoted as A ↔| B, and the
negation of implication ¬(A → B) is sometimes called ”inhibition”
and denoted as A →| B. Alternative names for the
Peirce function are ¯∨, .|., or NOR.
The semantics of these propositional connectors can be
easily derived from their deﬁnitions. Let’s summarize them
in a truth table:
A B A ∧ B A ∨ B A ⊻ B A | B A ↓ B A ↔ B
0 0 0 0 0 1 1 1
0 1 0 1 1 1 0 0
1 0 0 1 1 1 0 0
1 1 1 1 0 0 0 1
As an exercise, you can verify some ”standard” tautolo-
gies:
1067
common multiple of 2 and 5, which is 10, i. e., σ10
= 1.
Then,
σ2013
= (σ1
0)201
◦ σ3
= σ3
= (1, 2, 4, 5, 6) ◦ (3, 7)
.
(b) For the sake of simplicity, we will write only k for the
residue class [k]11, k ∈ Z. Then,
45
≡ 1 (mod 1)1 ⇒ σ−1
= 44
≡ 3 (mod 1)1
σ2013
= 42013
≡ 43
≡ 9 (mod 1)1.
□
12.F.7. Prove that every group whose number of elements is
even contains a non-trivial (i. e., diﬀerent from the identity)
element which is self-inverse.
Solution. Since each element which is not self-inverse can be
paired with its inverse, we see that there are an even number
of elements which are not-self inverse. Thus, there remain an
even number of elements which are self-inverse, and one of
them is the identity, so there must be at least one more such
element. □
12.F.8. Prove that there exists no non-commutative group
of order 4.
Solution. By Lagrange’s theorem (see 12.4.10), the nontrivial
elements of a 4-element group are of order 2 or 4. If
there is an element of order 4, then the group is cyclic, and
thus commutative. So the only remaining case is that there are
(besides the identity e) three elements of order 2, call them a,
b, c. We are going to show that we must have ab = c: It
cannot be that ab = e, since the inverse of a is a itself, and
not b. It cannot be that ab = a, since this would mean that
b = e, and similarly, it cannot be that ab = b, since this would
mean that a = e. Therefore, the only remaining possibility
is that indeed ab = c, and it can be shown analogously that
the product of any two non-trivial elements, regardless of the
order, must be equal to the third one, so this group is commutative,
too. Altogether, we have shown that there are exactly
two groups of order 4, up to isomorphism. The latter is called
the Klein group, and one instance of it is the group Z2 × Z2.
□
12.F.9. Show that there exists no non-commutative group of
order 5.
Solution. By Lagrange’s theorem (see 12.4.10), the nontrivial
elements of a 5-element group are of order 5, so the
group must be cyclic, and thus commutative. □
Remark. The same argumentation show that each group
of prime order must be cyclic, and thus commutative. In
particular, there are neither 2-element nor 3-element noncommutative
groups. As we have shown (see 12.F.8), there
is even no 4-element non-commutative group. Therefore, the
smallest non-commutative group may be of order 6. As we
have seen (see 12.E.1(vii)), this is indeed the case.
CHAPTER 12. ALGEBRAIC STRUCTURES
A ∨ ¬A, ¬(A ∧ ¬A) Law of excluded middle
(A ∧ ¬A) → B, ¬A → (A → B) Law of Duns Scotus(
(A → B) ∧ A
)
→ B Modus ponens
(¬B → ¬A) → (A → B) Reductio ad absurdum, indirec(
(A ∧ ¬B) → (C ∧ ¬C)
)
→ (A → B) Proof by contradiction(
(A → B) ∧ (B → C)
)
→ (A → C) Transitivity of implication(
(A ∧ B) → C
)
→
(
A → (B → C)
)
Deduction
We can declare other connectors, such as ¬, ∧, and ∨,
as primitive operations alongside →. In this sense, Propositional
Calculus can be treated as discussed in Section 11.48.
For constructing propositional calculus, one of either ∧ or ∨
alongside ¬ is suﬃcient, as implication → can be expressed
using them:
A → B = ¬A ∨ B, or A → B = ¬(A ∧ ¬B).
Axioms can then be rewritten using the respective connector.
The inference rules will take the following form:
A ∨ B
¬A
B,
A ∨ B
¬B
A,
or
¬(A ∧ B)
A
¬B,
¬(A ∧ B)
B
¬A.
Each of the Sheﬀer and Peirce functions can serve as a
single propositional connector. Speciﬁcally:
¬A = A | A, A → B = A | (B | B),
and
¬A = A ↓ A, A → B =
(
(A ↓ A) ↓ B
)
↓
(
(A ↓ A) ↓ B
)
.
The inference rules can then be written as:
A | B
A
B | B,
A | B
B
A | A
and
A ↓ B
A ↓ A
B ↓ B.
As previously mentioned, Propositional Calculus can be
built using other connectors than ¬ and →. However, any
combination of these connectors will suﬃce, as shown in Section
11.48. The construction of Propositional Calculus will
only require one of ∧ or ∨ in addition to negation, as implication
can be expressed using them.
Therefore, the axioms for classical propositional calculus
can be either:
(kvp1) A → (B → A)
(kvp2′
) (¬A → ¬B) →
(
(¬A → B) → A
)
(kvp3)
(
A → (B → C)
)
→
(
(A → B) → (A → C)
)
Or, a single axiom known as the ”Meredith’s Axiom”:
(((
(A → B) → (¬C → ¬D)
)
→ C
)
→ E
)
→
(
(E → A) → (D → A)
)
.
On the other hand, if we consider ¬, →, ∧, and ∨ as
primitive connectors (i.e., not deﬁning conjunction and disjunction
in terms of implication and negation), we would need
1068
12.F.10. Prove that any group G where each element is selfinverse
must be commutative.
Solution. Let a, b ∈ G. Since each of ba, b, a is assumed to
be self-inverse, we get
ab = ab((ba)(ba)) = a(bb)aba = (aa)ba = ba.
□
12.F.11. Prove that every group G of order 6 is isomorphic
to Z6 or S3.
Solution.
By Lagrange’s theorem (see 12.4.10), the non-trivial elements
of a 6-element group are of order 2, 3, or 6. If there is
an element of order 6, then G is cyclic, and thus isomorphic
to Z6.
Therefore, assume from now on that the order of each
non-trivial element is 2 or 3. Since an element a of order 3 is
not self-inverse (we have a−1
= a2
since a · a2
= a3
= 1),
we get from exercise 12.F.7 that there must be at least one
element of order 2.
As we are going to show, there must also be an element
of order 3. For the sake of contradiction, assume that each
element of G is self-inverse, and let a ̸= b be any two elements
diﬀerent from the identity e. The same argumentation
as in 12.F.8 shows that the product ab cannot be any of e, a, b.
Thus, H = {e, a, b, ab} is a 4-element subset of G. Thanks to
the self-inverseness, we can see that H is closed under the operation,
with the possible exception of b · a, b · ab, and ab · a.
However, we get from the above exercise that G is commutative,
so that these 3 products also lie in H, and it follows
that H is actually a subgroup of G. However, this contradicts
theorem 12.4.10, by which a 6-element group cannot have a
4-element subgroup.
The only remaining case is that there is an element of
order 2 (call it a) as well as an element of order 3 (call it b).
Then, b2
is also of order 3 (and diﬀerent from b), so G contains
the 4 elements e, a, b, b2
. Furthermore, G must also
contain ab, ba, ab2
, b2
a, and by uniqueness of inverses, none
of these is equal to e. Moreover, none of these may be equal
to any of a, b, b2
(e. g. if we had a = ab, then multiplication
by a−1
from the left yields e = b; the other equalities
can be refuted similarly). Since G contains only 6 elements,
the set {ab, ba, ab2
, b2
a} has at most two. Again, we
can have neither ab = ab2
nor ba = b2
a. If ab = ba, then
(ab)2
= a2
b2
= b2
̸= e and (ab)3
= a3
b3
= a ̸= e, so the
order of ab is greater than 3, which contradicts our assumption.
Therefore, it must be that ab = b2
a and ba = a2
b, so
that G is indeed isomorphic to S3 (a corresponds to a transposition
and b does to a cycle of length 3). This group can
also be viewed as the group of symmetries of an equilateral
triangle (a corresponds to a reﬂection and b does to a rotation
by 120◦
), see also 12.4.3.
We have discussed all possibilities, so the proof is ﬁnished.
□
CHAPTER 12. ALGEBRAIC STRUCTURES
a more extensive system of axioms. Kleene introduced such
axioms in 1967:
(Kkvp1) A → (B → A)
(Kkvp2)
(
(A → (B → C)
)
→
(
(A → B) → (A → C)
)
(Kkvp3) (A ∧ B) → A
(Kkvp4) (A ∧ B) → B
(Kkvp5) A →
(
B → (A ∧ B)
)
(Kkvp6) A → (A ∨ B)
(Kkvp7) B → (A ∨ B)
(Kkvp8) (A → C) →
(
(B → C) →
(
(A ∨ B) → C)
)
(Kkvp9) (A → B) →
(
(A → ¬B) → ¬A
)
(Kkvp10) ¬¬A → A
Notably, axioms (Kkvp1), (Kkvp2), (Kkvp9), and
(Kkvp10) are the same as axioms (kvp1), (kvp3), (kvp2),
and (kvp4).
Alternatives to Classical Propositional Calculus
Classical propositional calculus possesses beautiful
properties, particularly decidability and strong completeness
and correctness. This means that in classical propositional
calculus, the axiom system essentially aligns with semantics.
However, this can also pose a danger - one might blur the
line between semantics and axiomatics and perceive all of
mathematics as precise formal calculations.
One of the problems with classical propositional calculus
is the primitive concept of →. It is meant to correspond to the
intuitive notion of implication. However, consider the Czech
sentences: ”If it is dry in the Sahara, then beer is brewed in
the Czech Republic” (which is true due to the truth of the
consequent) and ”If the Illuminati rule the world, they caused
a tsunami in Sri Lanka” (which is true due to the falsity of
the antecedent). These are certainly not exemplary true statements.
Sometimes, constructions like these, which are consequences
of axiom (kvp1) and theorem ¬A → (A → B), are
referred to as ”paradoxes of implication.”
Another problem is posed by axiom (kvp4), which is semantically
interpreted as implying that if a statement is not
false, then it is true. This appears to be self-evident, but it conceals
an unspoken assumption that there are no truth values
other than ”true” and ”false.” A statement could have another
truth value, such as ”possible,” or it could have no truth value,
expressing uncertainty or ”I don’t know.” Accepting this possibility
means relinquishing the ”view from God’s eye.”
Alternative approaches to propositional calculus may involve
”improving implication” or introducing multiple truth
values. We will explore some signiﬁcant cases.
(1) Intuitionistic Propositional Calculus. This arises
through a slight modiﬁcation of classical propositional calculus.
Axiom (kvp4) is replaced by a weaker axiom:
(ivp4) A → (¬A → B).
This axiom is weaker in the sense that it is a theorem of classical
propositional calculus (which can be easily veriﬁed by
showing that (ivp4) is a tautology of classical propositional
1069
12.F.12. Find all commutative groups of order 8 (up to isomorphism).
Then, for each of the following groups, decide
to which of the found ones it is isomorphic (the operation is
always multiplication):
• Z×
15,
• Z×
16,
• Z×
17/{[1], [−1] = [16], ·},
• the complex roots of the polynomial z8
− 1.
Solution.
By theorem 12.4.8, every commutative group is a product
of cyclic groups. By 12.4.10, their orders divide 8. This
means that there are only 3 possibilities: Z8, Z2 × Z4, and
Z2 × Z2 × Z2.
• The group Z×
15 contains the residue classes which are coprime
to 15. There are φ(15) = (5 − 1)(3 − 1) = 8
of them, so indeed |Z×
15| = 8. In particular, these
are 1, 2, 4, 7, 8, 11, 13, 14. Their orders are either 2 (for
4, 11, 14) or 4 (for 2, 7, 8, 13), which means that Z×
15 is
isomorphic to Z2 × Z4.
• Z×
16 = {1, 3, 5, 7, 9, 11, 13, 15}. Again, this group contains
8 elements, and their orders are either 2 (for 7, 9, 15)
or 4 (for 3, 5, 11, 13), which means that Z×
16 is also isomorphic
to Z2 × Z4.
• Z×
17 = {±1, ±2, . . . , ±8}. Thus, the quotient
Z×
17/(±1) = {1, 2, . . . , 8} has 8 elements. We
can easily calculate that the order of 3 is 8. Therefore,
3 generates the entire group, which means that
Z×
17/(±1) ∼= Z8.
• The complex roots of the polynomial z8
− 1 are e
nπ
4 i
,
where n = 1, 2, . . . , 8. Clearly, these form a cyclic
group of order 8, isomorphic to Z8.
□
12.F.13. Let G be a commutative group and denote H =
{g ∈ G | g2
= = e}, where e is the identity of G. Prove
that H is a subgroup of G.
Solution. Clearly, e ∈ H. If a ∈ H, then we also have
a−1
∈ H, because a = a−1
(since a2
= e). Moreover, if
b ∈ H, then (ab)2
= a2
b2
= e (this is where we use the
commutativity of G), which means that ab ∈ H. Thus, H is
closed under the operation, and it is indeed a subgroup. □
12.F.14. Let GLn(R) denote the set of all n-by-n regular matrices
with real coeﬃcients. Prove that G = GL2(R) with
multiplication is a group and decide for each of the following
subsets H of G whether it is a subgroup of G:
i) H = GL2(Q),
ii) H = GL2(Z),
iii) H = {A ∈ GL2(Z) | |A| = 1},
iv) H =
{(
0 a
a b
)
∈ G | a, b ∈ Q
}
,
v) H =
{(
1 0
a 1
)
∈ G | a ∈ Z
}
,
vi) H =
{(
1 a
a 1
)
∈ G | a ∈ Q
}
,
vii) H =
{(
0 a
b c
)
∈ G | a, b, c ∈ R
}
,
CHAPTER 12. ALGEBRAIC STRUCTURES
calculus), but (kvp4) is not a theorem of intuitionistic propositional
calculus. This means that anything provable or derivable
in intuitionistic propositional calculus can also be proven
in classical propositional calculus. All the theorems of classical
propositional calculus proven without using axiom (kvp4)
are also theorems of intuitionistic propositional calculus.
In intuitionistic propositional calculus, the Deduction
Theorem also holds. In its proof, we did not use axiom
(kvp4).
The connection between classical and intuitionistic
propositional calculus is further demonstrated by the
following
Reﬂection Theorem. If statement A is derivable from statements
A1, . . . , An in classical propositional calculus, then it
is also derivable from statements ¬¬A1, . . . , ¬¬An in intuitionistic
propositional calculus.
Classical reasoning is thus mirrored in intuitionistic calculus
through double negation. In this sense, intuitionistic
propositional calculus is stronger than classical. These observations
lead to the conclusion that the → operator has a
diﬀerent meaning in classical and intuitionistic logic, even
though it is denoted by the same symbol. (The meaning of an
expression can depend on the context in which it is used.)
Therefore, the semantics of intuitionistic logic is diﬀerent
from classical logic. It cannot be deﬁned using truth tables;
it is introduced using the concept of possible worlds.
However, intuitionistic propositional calculus is just as ”beautiful”
as classical, being strongly correct and complete, compact,
and decidable.
(2) Lukasiewicz’s Three-Valued Propositional Calculus.
The set of truth values includes not only 0 and 1 but also
the value x, which can be interpreted as ”I don’t know.” Their
semantics are deﬁned by truth tables as follows:
The axioms of this propositional calculus are:
(l3vp1) A → (B → A),
(l3vp2) (A → B) →
(
(B → C) → (A → C)
)
,
(l3vp3) (¬A → ¬B) → (B → A),
(l3vp4)
(
(A → ¬A) → A
)
→ A.
The modus ponens rule is used for inference.
Lukasiewicz’s three-valued propositional calculus is decidable,
compact, strongly correct, and complete. However,
the Deduction Theorem does not hold in it.
(3) Lukasiewicz’s fuzzy propositional calculus. The
set of truth values is a closed interval [0, 1]. The values 0 and
1070
viii) H =
{(
1 a
b c
)
∈ G | a, b, c ∈ R
}
.
⃝
12.F.15.
i) Decide whether the set H = {a ∈ R∗
| a2
∈ Q} is a
subgroup of the group (R∗
, ·)
ii) Decide whether the set H = {a ∈ R | a2
∈ Q} is a
subgroup of the group (R, +)
⃝
12.F.16. Find all positive integers m ̸= 5 such that the group
Z×
m is isomorphic to Z×
5 . ⃝
12.F.17. How many cycles of length p (1 < p ≤ n) are
there in Sn?
Solution. The elements of the cycle (i. e., the non-ﬁxed
points of the permutation) can be selected in
(n
p
)
ways. Now,
without loss of generality, we can proclaim one of the p elements
to be the ﬁrst element in the cycle representation (for
instance the least one, if we are working with numbers). This
element can be mapped to any of the p − 1 remaining elements,
that one can be mapped to any of the p − 2 remaining
elements, etc. Altogether, we get by the product rule that there
are
( n
p·(p−1)!
)
cycles of length p. □
12.F.18. Let G be the set of real-valued matrices with zeros
above the diagonal and ones on it. Prove that G with
matrix multiplication forms a group, i. e., a subgroup in
GL(3, R), and ﬁnd the center of G (i. e., the subgroup deﬁned
by Z(G) = {z ∈ G | ∀g ∈ G : zg = gz}).
Solution. We can either verify all the group axioms or make
use of the known fact that GL(3, R) is a group, and we verify
only that G is closed with respect to multiplication and inverses.
Clearly, the neutral element (the identity matrix) lies
in G.


1 0 0
a 1 0
b c 1




1 0 0
a1 1 0
b1 c1 1

 =


1 0 0
a + a1 1 0
b + ca1 + b1 c + c1 1

 ∈ G,


1 0 0
a 1 0
b c 1


−1
=


1 0 0
−a 1 0
−b + ac −c 1

 ∈ G.
It follows from the form of the products in G that the center
contains precisely the matrices of the form

1 0 0
0 1 0
b 0 1

. □
12.F.19. For any subset X ⊆ G, we deﬁne its centralizer
as CG(X) = {y ∈ G | xy = yx, for all x ∈ X}. Prove
that if X ⊆ Y , then CG(Y ) ⊆ CG(X). Further, prove that
X ⊆ CG(CG(X)) and CG(X) = CG(CG(CG(X))).
Solution. The ﬁrst proposition is clear: The elements of
G which commute with everything from Y also commute
with everything from X. We have from the deﬁnition that
CHAPTER 12. ALGEBRAIC STRUCTURES
1 represent truth and falsehood, just like in classical propositional
calculus. The values in between can be seen as the
degree of certainty about the statement. The interpretation is
governed by the rules:
|¬A| = 1−|A|, |A → B| = max{1−|A|, |B|} = max{|¬A|, |B|},
|A ∧ B| = min {|A|, |B|} , |A ∨ B| = max {|A|, |B|} .
The ﬁrst two rules are the same as in classical propositional
calculus. The degree of certainty of a statement cannot be
interpreted as its probability, even when the statements are
about random events. In such cases, the conjunction ∧ would
express the simultaneous occurrence of events. The degree of
certainty is calculated directly from the degrees of certainty
of individual events, not their probabilities, as it also depends
on the stochastic dependence or independence of the relevant
events.
The axioms of this propositional calculus are:
(lF vp1) A → (B → A),
(lF vp2) (A → B) →
(
(B → C) → (A → C)
)
,
(lF vp3) (¬A → ¬B) → (B → A),
(lF vp4)
(
(A → B) → B
)
→
(
(B → A) → A
)
.
The modus ponens rule is used for inference. The ﬁrst three
axioms of fuzzy propositional calculus are the same as the
axioms of the three-valued propositional calculus.
Fuzzy propositional calculus is decidable and compact,
but it is not strongly correct or complete. The Deduction Theorem
does not hold in it.
(4) Paraconsistent four-valued propositional calculus.
We previously interpreted the truth value x in Lukasiewicz’s
three-valued propositional calculus as ”I don’t know.” However,
we can also interpret it as meaning that a given statement
has no truth value, i.e., ”neither true nor false.” Now, to the
three truth values 0, 1, and x, we add a fourth value, denoted
as g, and interpret it as ”simultaneously true and false.” This
may sound nonsensical at ﬁrst, but these introduced truth values
can have meaning when ”seeking the truth” in the real
world. For example, imagine that we have witness statements
about a situation that can be expressed by a statement (declarative
sentence). If the witnesses agree that the situation occurred,
we assign a truth value of 1 to the statement. If they
agree that the situation did not occur, we assign a value of 0.
However, it’s possible that witness statements may diﬀer. In
such a case, we assign the truth value g to the statement. And
if there is no testimony at all, the statement will have the value
x.
The value x is also referred to as a ”truth-value gap,”
while the value g is called a ”truth-value glut.” We can deﬁne
the interpretation as an ordered pair (I1, I2) of interpretations
of atomic statements in Lukasiewicz’s three-valued
logic, with two witnesses interpreting the statement. Each
of them can respond with ”yes,” ”no,” or ”I don’t know.” To
these pairs of interpretations, we assign truth values in the
interpretation I according to the table:
1071
CG(CG(X)) = {y ∈ G | xy = yx, ∀x ∈ CG(X)}, and
this is in particular satisﬁed by the elements y ∈ X. The
last statement follows simply using the two above. Substituting
X := CG(X) into the second one, we get CG(X) ⊆
CG(CG(CG(X))), and applying the ﬁrst one to the second
one, we obtain CG(X) ⊇ CG(CG(CG(X))). □
12.F.20. Suppose that a group G has a non-trivial subgroup
H which is contained in every non-trivial subgroup of G.
Prove that H is contained in the center of G.
Solution. For each g ∈ G, the centralizer CG(g) = {x ∈
G | xg = gx} is a non-trivial subgroup, since g ∈ CG(g)
and CG(e) = G. Thus, the group H is contained in every
CG(g). Therefore, it is contained in their intersection (over
all g ∈ G), which is exactly the center of G. □
12.F.21. Let G be a ﬁnite group. The conjugation class for
a ∈ G is the set
Cl(a) = {xax−1
| x ∈ G}.
Prove that:
i) the set of conjugation classes of all elements of G is a
partition of G,
ii) the size of each conjugation class divides the order of G,
iii) if G has only two conjugation classes, then its order is 2.
Solution. (i) It suﬃces to show that we have for any a, b ∈ G
that either Cl(a) = Cl(b) or Cl(a) ∩ Cl(b) = ∅. Thus, assume
that the intersection of Cl(a) and Cl(b) is non-empty.
Then, by deﬁnition, there are x, y ∈ G such that xax−1
=
yby−1
. Multiplying this equality by y−1
from the left and
by y from the right leads to y−1
xax−1
y = b. However,
(y−1
x)−1
= x−1
y, which means that b is of the form zaz−1
for z = y−1
x and thus lies in Cl(a). Analogously, we get
a ∈ Cl(b), so that both conjugation classes coincide.
(ii) Note that the elements of Cl(a) are in one-to-one correspondence
with the cosets corresponding to the centralizer
CG(a) = {x ∈ G | xax−1
= a}. Indeed, if elements b and
c lie in the same coset (i. e., they satisfy b = cz for some
z ∈ CG(a)), then
bab−1
= cza(cz)−1
= czaz−1
c−1
= czz−1
ac−1
= cac−1
.
By 10.2.1, we have |G| = |CG(a)|·|G/CG(a)|, which means
that |Cl(a)| = |G/CG(a)| divides |G|.
(iii) The neutral element always forms its own conjugation
class Cl(e) = {e}. Therefore, if there are only two conjugation
classes, then all the other elements a ̸= e must lie in
one class. Thus, its size is |G| − 1, and by (ii), this integer
must divide |G|, which means that |G| = 2. □
12.F.22. Let G be a commutative group. Suppose that the
order r of an element a ∈ G and the order s of an element
b ∈ G are coprime. Prove that the order of ab is rs.
Solution. We have (ab)rs
= ars
brs
= (ar
)s
(bs
)r
= es
er
=
e, so the order is at most rs. For the sake of contradiction,
assume that (ab)q
= e for some q < rs. Since q is less
than the least common multiple of r and s (recall that r, s are
CHAPTER 12. ALGEBRAIC STRUCTURES
(I1, I2) (0, 0) (0, x) (x, 0) (x, x) (0, 1) (1, 0) (x, 1) (1, x) (1, 1)
I 0 0 0 x g g 1 1 1
We create the interpretation of negation by taking the
negation interpretations in I1 and I2 (recall that if |A| = x,
then |¬A| = x in Lukasiewicz’s three-valued logic). This
results in the table:
A ¬A
0 1
x x
g g
1 0
Furthermore, on the set of truth values, we introduce a
(partial) ordering:
g x
0
1
and we deﬁne the interpretation of propositional connectives
using the rules:
|A → B| = sup{|¬A|, |B|}, |A∨B| = sup {|A|, |B|} , |A∧B| = inf {|A|, |B|}
The semantics is thus a straightforward generalization of the
semantics of Lukasiewicz’s fuzzy propositional calculus.
Within this semantics, we deﬁne a tautology as a statement
that, for any interpretation of the statements it is composed
of, has an interpretation of 1 or g.
Furthermore, we say that statement A follows from the
set of statements {A1, . . . , An} if, for any interpretation of
atomic statements such that statements A1, . . . , An have an
interpretation of 1 or g, statement A has an interpretation of
1 or g.
If statement A has an interpretation of x, then statements
A ∨ ¬A and ¬(A ∧ ¬A) also have an interpretation of x and
are not tautologies. In the considered paraconsistent logic,
the Law of Excluded Middle does not hold.
If statement A has an interpretation of g, and statement B
has an interpretation of 1, then statement A → B has an interpretation
of 1, and statement A → ¬B has an interpretation
of g. This means that in paraconsistent logic, ”local inconsistency”
can occur (a case where a statement and its negation
both follow from another statement). If statement A has an
interpretation of g, then statement A ∧ ¬A also has an interpretation
of g, and some statement B with an interpretation
of 0 or x does not follow from it. This means that the Law of
Non-Contradiction does not hold in paraconsistent logic, and
”global inconsistency” (a case where every statement follows
from some set of statements) does not occur.
A propositional logic with ”local inconsistencies” but
without ”global inconsistencies” is interesting because people
sometimes hold beliefs that are factually contradictory and
yet manage to work with them relatively successfully.
1072
coprime), at least one of them does not divide q. Assume that
it is r (the other case can be refuted analogously). Taking the
s-th power of the equality (ab)q
= e, we get e = ((ab)q
)s
=
(ab)qs
= aqs
bqs
= aqs
(bs
)q
= aqs
eq
= aqs
. Since r does
not divide q and is coprime to s, we get that r (the order of a)
does not divide qs, but aqs
= e, which is a contradiction. □
12.F.23. Prove that every ﬁnite group G whose order is
greater than 2 has a non-trivial automorphism.
Solution. If G is not commutative and a is an element that
does not lie in the center, then the conjugation x → axa−1
deﬁnes a non-trivial automorphism. For a cyclic group of
order m, we have, for any n coprime to m, the automorphism
x → xn
. If G is commutative, then it is a product of cyclic
groups (see 10.1.8). If the order of at least one of the factors
is greater than 2, then we can use the above automorphism for
cyclic groups. If the order of each factor is 2, then permuting
any pair of factors is a non-trivial automorphism. □
12.F.24. Consider the group (Q, +) of the rational numbers
with addition and the group (Q+
, ·) of the positive rational
numbers with multiplication. Find all homomorphisms
(Q, +) → (Q+
, ·).
Solution. There is only one homomorphism, the trivial one.
For the sake of contradiction, assume that there exists a
non-trivial homomorphism φ, i. e., φ(a) = b ̸= 1
for some a, b ∈ Q. Then, for all n ∈ N, we have
b = φ(a) = φ(na
n ) = φ( a
n )n
. This is a contradiction,
since only some n-th roots of b are rational (cf. ??).
□
12.F.25. Let G be the group of matrices of the form(
a 0
b a−1
)
, where a, b ∈ R and a > 0, and let N be the set
of matrices of the form
(
1 0
b 1
)
, where b ∈ R. Show that N
is a normal subgroup of G and prove that G/N is isomorphic
to R.
Solution. The key to the proof is the formula for multiplication
in G:
(
a 0
b a−1
) (
a1 0
b1 a−1
1
)
=
(
aa1 0
ba1 + a−1
b1 a−1
a−1
1
)
.
Hence we can see that the mapping
(
a 0
b a−1
)
→ a is a
homomorphism with kernel N. Thus, N is a normal subgroup
of G. Moreover, G/N is isomorphic to the multiplicative
group R+
, which is isomorphic to the additive group R.
□
12.F.26. Let G be the group with operation of matrix multiplication
of 2 × 2 matrices
(
a b
c d
)
where ad − bc ̸= 0 and
a, b, c, d are integers from Z3. Show that
i) |G| = 48.
ii) if ad − bc = 1 then |G| = 24.
Solution.
CHAPTER 12. ALGEBRAIC STRUCTURES
3. Polynomial rings
The operations of addition and multiplication are fundamental
in the case of scalars as well as vectors.
There are other similar structures. Besides
the integers Z, rational numbers Q and complex
numbers C, there are polynomials over similar scalars K to be
considered.
Among others, the abstract algebraic theory can be in
many aspects viewed as a straightforward generalization of
divisibility properties of integers.
12.3.1. Rings and ﬁelds. Recall that the integers and all
other scalars K have the following properties:
Commutative rings and integral domains
Deﬁnition. Let (M, +, ·) be an algebraic structure with two
binary operations + and ·. It is a commutative ring, if it
satisﬁes
• (a + b) + c = a + (b + c) for all a, b, c ∈ M;
• a + b = b + a for all a, b ∈ M;
• there is an element 0 such that for all a ∈ M, 0+a = a;
• for each a ∈ M, there is the unique element −a ∈ M
such that a + (−a) = 0.
• (a · b) · c = a · (b · c) for all a, b, c ∈ M;
• a · b = b · a for all a, b ∈ M;
• there is an element 1 such that for all a ∈ M, 1 · a = a;
• a · (b + c) = a · b + a · c for all a, b, c ∈ M.
If the ring is such that c · d = 0 implies either c or d is
zero, then it is called an integral domain.
The ﬁrst four properties deﬁne the algebraic structure of a
commutative group (M, +). Groups are considered in more
detail in the next part of this chapter. The last property in
the list of ring axioms is called distributivity of multiplication
over addition. There are similar axioms for Boolean algebras
where each of the operations is distributive over the other.
If the operation "·" is commutative for all elements, then
the ring is called a commutative ring. Otherwise, the ring
is called a non-commutative ring. In the sequel, rings are
commutative unless otherwise stated. Traditionally, the operation
"+" is called addition, and the operation "·" multiplication,
even if they are not the standard operations on one of the
known rings of numbers.
In the literature, there are structures without the assumption
of having the identity for multiplication. These are not
discussed here, so it is always assumed that a ring has a multiplicative
identity denoted by 1. The identity for addition is
denoted by 0.
Fields
A non-trivial ring where all non-zero elements are invertible
with respect to multiplication is called a division ring. If the
multiplication is commutative, it is called a ﬁeld.
1073
i) For the ﬁrst row (a, b) of a M ∈ G values a, b could be
arbitrary except for a = 0, b = 0 in Z3. Hence, there
are 3 × 3 − 1 = 8 possibilities to ﬁll the ﬁrst row. The
second row should be not a multiple of the ﬁrst row. The
multiplication factor could be 0, 1 or 2. Hence. for the
second row there are 3 × 3 − 3 = 6 possibilities to ﬁll
the second row once the ﬁrst row is ﬁlled. Thus, |G| =
8 × 6 = 48.
ii) The set of matrices
(
a b
c d
)
, such that ad−bc = 1 forms
a normal subgroup in G. Note that for ∀g ∈ G values of
det g ∈ Z3 can be either 1 or 2. E.g. det g = 2 for
g =
(
2 2
1 2
)
and det h = 1 for h =
(
2 1
1 1
)
. Hence,
det g′
= 2 ∀g′
∈ Hg and, conversely, if det g′
= 2
then g′
∈ Hg. Multiplication by g bijectively maps H
on Hg. Nence, index H ∈ G equals 2 and, therefore,
|H| = |G|
2 = 24.
□
12.F.27. If N is a normal subgroup of G and H is any subgroup
in G, then NH is also a subgroup in G.
Solution. Let n1h1 and n2h2 are two elements in NH Then
i) n1h1n2h2 = n1h1n2(h1)−1
h1h2 = n1n′
2h1h2 =
n′
h′
∈ NH.
ii) (nh)−1
= h−1
n−1
= h−1
n−1
h−1
h = n′
h.
□
12.F.28. Suppose that N and M are two normal subgroups
of G and N cupM = {e}. Show that nm = mn for any
n ∈ N and m ∈ M.
Solution. On the one hand, mnm−1
n−1
=
m(nm−1
n−1
) = mm′
∈ M. On the other hand,
mnm−1
n−1
= (mnm−1
)n−1
= n′
n−1
∈ N. Hence
mnm−1
n−1
in M ∪ N. So, mnm−1
n−1
= e. Thus,
mn = nm. □
12.F.29. Show that the intersection of two normal subgroups
of G is also a normal subgroup of G.
Solution. It is clear that the intersection of two subgroups of
G is also a subgroup of G. Let N and M be normal subgroups
of G. Then for any g ∈ G and any n ∈ N and any m ∈ M
we have gng−1
∈ N and gmg−1
1 ∈ M. Let w ∈ N ∪ M.
Since w ∈ N we have gwg−1
∈ N and since w ∈ M we
have gwg−1
∈ M. Hence gwg−1
∈ N ∪ M for any g ∈ G
i.e. N ∪ M is a normal subgroup of G. □
12.F.30. Prove that if G is some group and H ⊂ G its subgroup
of index 2 then H is a normal subgroup of G.
Solution. Let H and aH be the left cosets of H in G and
H and Hb be right cosets of H in G. Since there are only
two cosets aH = G \ H and Hb = G \ H, then aH = Hb.
In order to show that H is normal in G, we need xH = Hx
for any x ∈ G. Let x ∈ G. If x ∈ H then, obviously,
CHAPTER 12. ALGEBRAIC STRUCTURES
Typical examples of ﬁelds are the rational numbers Q,
the real numbers R, and the complex numbers C. Furthermore,
every remainder class set Zp is a commutative ring,
while only Zp for prime p are also ﬁelds. We shall come back
to ﬁelds in the next part of this chapter, too.
Recall the useful example of a non-commutative ring, the
set Matk(K) of all k-by-k matrices over a ring K, k ≥ 2.
As can be checked for K = Z2 and k = 2, these rings are
never an integral domain (see 2.1.5 on page 87 for the full
argument).
As an example of a division ring which is not a ﬁeld, consider
the ring of quaternions H. This is constructed as an extension
of the complex numbers, by adding another imaginary
unit j, i.e. H = C ⊕ jC ≃ R4
, just as the complex numbers
are obtained from the reals. Another “new” element i·j is usually
denoted k. It follows from the construction that ij = −ji.
This structure is a division ring. Think out the details as a not
completely easy exercise!
12.3.2. Elementary properties of rings. The following
lemma collects properties which all seem
to be obvious for rings of scalars. But the
properties need proof to build an abstract
theory:
Lemma. In every commutative ring K, the following holds
(1) 0 · c = c · 0 = 0 for all c ∈ K,
(2) −c = (−1) · c = c · (−1) for all c ∈ K,
(3) −(c · d) = (−c) · d = c · (−d) for all c, d ∈ K,
(4) a · (b − c) = a · b − a · c,
(5) the entire ring K collapses to the trivial set {0} = {1}
if and only if 0 = 1.
Proof. All of the propositions are direct consequences
of the deﬁnition axioms. In the ﬁrst case, for any c, a
c · a = c · (a + 0) = c · a + c · 0,
and since 0 is the only element that is neutral with respect to
addition, c · 0 = 0.
In the second case, it suﬃces to compute
0 = c · 0 = c · (1 + (−1)) = c + c · (−1).
This means that c · (−1) is the inverse of c, as desired.
The following two propositions are direct consequences
of the second proposition and the axioms. If the ring contains
only one element, then 0 = 1. On the other hand, if 1 = 0,
then for any c ∈ K, necessarily c = 1 · c = 0 · c = 0. □
12.3.3. Polynomials over rings. The deﬁnition of a commutative
ring uses precisely the properties that are
expected for multiplication and addition. The
concept of polynomials can now be extended. A
polynomial is any expression that can be built
from (known) constant elements of K and an (unknown) variable
using ﬁnite number of additions and multiplications. Formally,
polynomials are deﬁned as follows:4
4It is not by accident that the symbol K is used for the ring – you can
imagine e.g. any of the number rings behind that.
1074
xH = H = Hx i.e. xH = Hx. If x ∈ G \ H = aH = Hb
then there exist h1, h2 ∈ H such that x = ah1 = h2b. Then
xH = (ah1)H = aH = Hb = H(h2b) = Hx, i.e. xH =
Hx for any x ∈ G, so G−1
HG = H. □
12.F.31. Show that the intersection of two normal subgroups
in G is a normal subgroup of G.
Solution. It is clear that the intersection of two subgroups of
G is a subgroup of G: Let N and M be normal subgroups of
G: Then for any g ∈ G and any n ∈ N and any m ∈ M
we have gng−1
∈ N and gmg−1
1 ∈ M. Let w ∈ N ∪ M.
Since w ∈ N we have gwg−1
∈ N and since w ∈ M we
have gwg−1
∈ M. Hence gwg−1
∈ N ∪ M for any g ∈ G
i.e. N ∪ M is a normal subgroup in G. □
12.F.32. If N and M are normal subgroups of G show that
NM is also a normal subgroup of G.
Solution. Let nm ∈ NM, then
gnmg−1
= gng−1
gmg−1 ∈ NM,
since gmg−1
∈ M and gng−1
∈ N. □
12.F.33. If N is a normal subgroup in the ﬁnite group such
that gcd(|G : N|, |N|) = 1. Show that N must contain any
element x ∈ G satisfying x|N|
= e.
Solution. Let x be an element in G such that x|N|
= e. Since
gcd(|G : N|, |N|) = 1 there exist m, n ∈ Z such that
m|G : N| + n|N| = 1.
Then x = xm|G:N|+n|N
= xm|G:N|
xn|N|
. Since x|N|
=
e we have x = xm|G:N|
. Consider element xN ∈ G/N.
Since xNk
= xk
N for any k ∈ Z, then N = (xN)|G/N|
=
x|G/N|
N , which is the identity element in GN. This means
x|G/N|
∈ N and so x = (x|G:N|
)m
∈ N. □
12.F.34. If H is a subgroup of G such that the product of
any two right cosets of H in G is again a right coset of H in
G show that H is normal in G.
Solution. Let Ha and Hb be two right cosets of H ∈ G. By
assumption HaHb is a right coset of H in G: The set HaHb
contains the element eaeb = ab. Since there is only one
right coset of H in G containing ab; which is Hab; we have
HaHb = Hab for all a, b ∈ G. Hence HaH = Ha for all
a ∈ G. i.e. h1ah2 is equal to h3a for some h3 ∈ H. But
h1ah2 = h3a implies ah2a−1
= h−1
1 h3 = h′
∈ H for any
h2 ∈ H and a ∈ G. So, H is normal in G. □
12.F.35. Let Z(G) be centraliser of group G and xy = z ∈
Z(G) for some x, y ∈ G. Show that x and y commute.
Solution. As zx = xz, we have xyx = zx = xz = xxy.
Multiplying by x−1
on the left, we obtain yx = xy. □
CHAPTER 12. ALGEBRAIC STRUCTURES
Polynomials
Deﬁnition. Let K be a commutative ring. A polynomial
over K is a ﬁnite expression
f(x) =
k∑
i=0
aixi
,
where ai ∈ K, i = 0, 1, . . . , k are the coeﬃcients of the
polynomial. If ak ̸= 0, then by deﬁnition, f(x) has degree
k, written deg f = k. The zero polynomial is not assigned
a degree. Polynomials of degree zero (called constant polynomials)
are exactly the non-zero elements of K.
Polynomials f(x) and g(x) are equal if they have the
same coeﬃcients. The set of all polynomials over a ring K
is denoted K[x].
Every polynomial deﬁnes a mapping f : K → K by
substituting the argument c for the variable x and evaluating
the resulting expression, i.e.
f(c) = a0 + a1c + · · · + akck
.
Note that constant polynomials deﬁne constant mappings in
this manner.
A root of a polynomial f(x) is such an element c ∈ K
for which f(c) = 0 ∈ K.
It may happen that diﬀerent polynomials deﬁne the same
mapping. For instance, the polynomial x2
+ x ∈ Z2[x] deﬁnes
the mapping which is constantly equal to zero. More
generally, for every ﬁnite ring K = {a0, a1, . . . , ak}, the
polynomial f(x) = (x − a0)(x − a1) . . . (x − ak) deﬁnes
the constant-zero mapping.
Polynomials f(x) =
∑
i aixi
and g(x) =
∑
i bixi
can
be added and multiplied in a natural way (just think to introduce
again the structure of a ring and invoke the expected
distributivity of multiplication over addition):
(f + g)(x) = (a0 + b0) + (a1 + b1)x + · · · + (ak + bk)xk
,
(f · g)(x) = (a0b0) + (a0b1 + a1b0)x + . . .
+ (a0br + a1br−1 + arb0)xr
+ · · · + akbℓxk+ℓ
,
where k ≥ ℓ are the degrees of f and g, respectively. Zero
coeﬃcients are assumed everywhere where there is no coeﬃcient
in the original expression.5
This deﬁnition corresponds to the addition and multiplication
of the function values of f, g : K → K, by the properties
of “coeﬃcients” in the original ring K.
It follows directly from the deﬁnition that the set of polynomials
K[x] over a commutative ring K is again a commutative
ring, where the multiplicative identity is the element
1 ∈ K, perceived as a polynomial of degree zero. The additive
identity is the zero polynomial. You should check all the
axioms carefully!
5To avoid this formal hassle, a polynomial can be deﬁned as an inﬁnite
expression (like a formal power series over the ring in question) with the
condition that only ﬁnitely many coeﬃcients are non-zero. Concerning the
order, zero polynomial is also atributed to have order −∞.
1075
12.F.36. Let G be a group of 2 × 2 matrices M =
(
a b
c d
)
such that ad − bc ̸= 0, a, b, c, d ∈ R. Let subgroup H ={(
1 0
x 1
)
|x ∈ Z
}
∈ G and g =
(
1 0
0 2
)
Show that H ̸=
gHg−1
⊂ H.
Solution. Obviously, matrix
(
1 0
1 1
)
∈ H. However,
(
1 0
0 2
) (
1 0
x 1
) (
1 0
0 1
2
)
=
(
1 0
2x 2
) (
1 0
0 1
2
)
=
(
1 0
2x 1
)
Thus, g−1
Hg =
{(
1 0
2x 1
)
|x ∈ Z
}
does not contain
(
1 0
1 1
)
∈ H. □
12.F.37. Let G be a group of an even order. Prove that it
contains a subgroup of order 2.
Solution. Consider the set of pairs Ω = {g, g−1
}, where
g ̸= g−1
. This engages an even number of elements of
G. Those elements g ∈ G that are not in Ω are the ones
satisfying g = g−1
i.e., g2
= e. Therefore if we count
Gmod2 , we can ignore the pairs {g, g−1
} ∈ Ω and obtain
|G| ≡ {g ∈ G|g2
= e} mod 2. One solution to g2
= e
is e itself. If it were the only solution, then |G| ≡ 1 mod 2,
which is false. Therefore some g ̸= e satisﬁes g2
= e, which
provides an element and also a subgroup in G of order 2. □
12.F.38. Let G be ﬁnite abelian group, which order n is divisible
by prime factor p. Show that G contains an element
of order p and, hence a cyclic subgroup of that order. This
claim is known as abelian case of Cauchy theorem.
Solution. We use induction on n. Case n = p is trivial.
Let n > p and p|n and the proposition is true for all abelian
groups of orders divisible by p and less than n. Assume no
element of G has order p. Then no element has order divisible
by p, because if g ∈ G has order r and p|r then g
r
p would
have order p. Let G = {g1, g2, . . . , gn} and let gi have order
mi so mi is not divisible by p. Set m to be the least common
multiple of all mi, i = 1, 2, . . . , n, so m is not divisible by
p and gmi
= e for all i. Because G is abelian, the function
f : (Z/m)n
→ G given by
f(a1, ..., an) = ga1
1 . . . gan
n .
is a homomorphism f(a1, ..., an)f(b1, ..., bn) = f(a1 +
b1, . . . , an + bn). as, by commutativity of G,
ga1
. . . gan
n gb1
1 . . . gbn
n = ga1+b1
1 . . . gan+bn
n
’s. This homomorphism is surjective (each element of
gi ∈ G is F(a1, . . . , an) for ai = 1 and other aj = 0, j ̸= i.
gi , and if ai = 1 and other aj ’s are 0 then f(a1, . . . , an) = gi)
and the elements where f takes particular value gi is a coset
of ker f, so |G| = number of cosets of ker f which is a factor
of |Z/m|n
, which is a factor of mn
. Since p||G| and mn
is
not divisible by p, this is a contradiction. □
CHAPTER 12. ALGEBRAIC STRUCTURES
Lemma. A polynomial ring over an integral domain is again
an integral domain.
Proof. The task is to show that K[x] can contain nontrivial
divisors of zero only if they are in K. However, this
is clear from the expression for polynomial multiplication. If
f(x) and g(x) are polynomials of degree k and ℓ as above,
then the coeﬃcient at xk+ℓ
in the product f(x) · g(x) is the
product ak · bℓ, which is non-zero unless there are zero divisors
in K. □
12.3.4. Multivariate polynomials. Some objects can be described
using polynomials with more variables.
For instance, consider a circle in the plane R2
whose center is at S = (x0, y0) and whose radius
is R. This circle can be deﬁned by the equation
(x − x0)2
+ (y − y0)2
− R2
= 0.
Rings of polynomials in variables x1, . . . , xr can be deﬁned
similarly as in the case of K[x]. Instead of the powers
xk
of a single variable x, consider the monomials
xk1
1 · · · xkr
r ,
and their formal linear combinations with coeﬃcients
ak1···kr ∈ K.
However, it is simpler, both formally and technically, to
deﬁne them inductively by
K[x1, . . . , xr] :=
(
K[x1, . . . , xr−1]
)
[xr].
For instance, K[x, y] = K[x][y]. One can consider polynomials
in the variable y over the ring K[x]. It can be shown (check
this in detail!) that polynomials in variables x1, . . . , xr can
be viewed, even with this deﬁnition, as expressions created
from the variables x1, . . . , xn and the elements of the ring K
with a ﬁnite number of (formal) addition and multiplication
in a commutative ring. For example, the elements of K[x, y]
are of the form
f = an(x)yn
+ an−1(x)yn−1
+ · · · + a0(x) =
= (amnxm
+ · · · + a0n)yn
+ · · · + (bp0xp
+ · · · + b00) =
= c00 + c10x + c01y + c20x2
+ c11xy + c02y2
+ . . . .
To simplify the notation, we use the multi-index notation
(as we did with real polynomials and partial derivates in inﬁnitesimal
analysis).
1076
12.F.39. Let G be ﬁnite -abelian group, which order n is
divisible by prime factor p. Show that G contains an element
of order p and, hence a cyclic subgroup of that order. This is
non-abelian case of Cauchy theorem.
Solution. If a proper subgroup H of G has order divisible
by p, then by induction there is an element of order p in H,
which gives us an element of order p in G. Thus we may assume
no proper subgroup of G has order divisible by p. For
any proper subgroup H ⊂ G we have |G| = |H|[G : H]
and |H| is not divisible by p, so p|[G : H] for every proper
subgroup H. Denote conjugacy class of an element a ∈ G
by C(a) = {b ∈ G|b = g−1
agfor someg ∈ G}. Let the conjugacy
classes in G with size greater than 1 be represented
by g1, g2, . . . , gk. The conjugacy classes of size 1 are the elements
of the center Z(G). Since the conjugacy classes are
a partition of G, counting |G| by counting conjugacy classes
implies
|G| = |Z(G)| +
k∑
i=1
|C(gi)| = |Z(G)| +
k∑
i=1
[G : Z(gi)],
where Z(gi) is the centraliser of gi. Every [G : Z(gi)] >
1 since each C(gi) is greater than 1. Since ∀j, |G| =
|Z(gi)|[G : Z(gi)], it yields p|[G : Z(gi)]. Thus, |Z(G)| is
divisible by p and so is non-trivial. Since proper subgroups
of G do not have order divisible by p, center Z(G) has to be
all of G. That means G is abelian, which is a contradiction.
□
12.F.40. Prove that any group of order 15 is cyclic.
Solution. Let G be a group of order 15. By Cauchy’s theorem
there exists a subgroup M of order 5 and a subgroup N of
order 3. [G : M] = |G|
|M| = 3. Both M and N are cyclic
subgroups. Let x be a generator of M and y - the generator
of N. Let’s prove that xy generates entire G. (NEED TO
COME UP WITH AN ELEGANT PROOF..)
(xg)5
=
□
12.F.41. Let G be a group of order 14 which has a normal
subgroup N of order 2. Prove that G is commutative.
Solution. Clearly, the order of the group G/N is |G/N| =
|G|
|N| = 7. By Lagrange’s theorem 12.4.10, the orders of its
elements are 1 or 7. Since only the identity has order 1, this
means that there is an element of order 7, so that the group
G/N is cyclic. Let N = {e, n}, where e is the identity of
G and let [a] be the generator of G/N. Since N is normal,
we have ana−1
∈ N, but ana−1
= e implies n = e, which
means that we must have ana−1
= n, i. e., na = an. Since
[a] generates G/N, we get that each element of G/N is of the
form [a]k
, k = 0, . . . , 6, i. e., [ak
]. Then, each element of G
CHAPTER 12. ALGEBRAIC STRUCTURES
Multi-indices
A multi-index α of length r is an r-tuple of non-negative
integers (α1, . . . , αr). The integer |α| = α1 + · · · + αr is
called the size of the multi-index α.
Monomials are written shortly as xα
instead of
xα1
1 xα2
2 . . . xαr
r . Polynomials in r variables can be symbolically
expressed in a similar way as univariate polynomials:
f =
∑
|α|≤n
aαxα
, g =
∑
|β|≤m
bβxβ
∈ K[x1, . . . , xr].
f is said to have total degree n if at least one coeﬃcient
with multi-indices α of size n is non-zero, while all the coeﬃcients
with multi-indices of larger sizes vanish.
Analogous formulae can be deﬁned for addition and multiplication
of multivariate polynomials of degrees m and n
respectively:
f + g =
∑
|α|≤max(m,n)
(aα + bα)xα
,
fg =
m+n∑
|γ|=0
( ∑
α+β=γ
(aαbβ)xγ
)
,
where the multi-indices are added componentwise, and the
formally non-existing coeﬃcients are assumed to be zero.
Clearly in the case of univariate polynomials we recover the
formula
fg(x) =
k+ℓ∑
i=0
( ∑
p+q=i
apbqxi
)
.
Lemma. These formulae describe addition and multiplication
in the inductively deﬁned ring of polynomials in r variables.
In particular, in a polynomial ring over integral domain,
the total degree of a sum or diﬀerence of two polynomials
(if deﬁned) is at most the maximum of their total degrees,
while the total degree of a product of two polynomials is the
sum of their total degrees.
Proof. The proposition is easily proved by induction
on the number of variables. Suppose that
the formulae are valid in K[x1, . . . , xr−1],
and calculate the sum
f = ak(x1, . . . , xr−1)xk
r + · · · + a0(x1, . . . , xr−1)
=
(
∑
α
ak,αxα
)
xk
r + . . . ,
g = bl(x1, . . . , xr−1)xl
r + · · · + b0(x1, . . . , xr−1)
=


∑
β
bl,βxβ

 xk
r + . . . ,
f + g =
(
a0(x1, . . . , xr−1) + b0(x1, . . . , xr−1)
)
+
(
a1(x1, . . . , xr−1) + b1(x1, . . . , xr−1)
)
xr + . . .
1077
is of the form ak
or ak
n, and since a and n commute, we get
that actually all elements of G commute. □
12.F.42. Decide whether the following holds: If the quotient
G/N of a group G by a normal subgroup N is commutative,
then G itself is commutative. ⃝
12.F.43. Prove that any subgroup H of the symmetric group
Sn contains either only even permutations or the same number
of even and odd permutations.
Solution. Consider the homomorphism p : H → Z2 which
maps each permutation to its parity (0 for even and 1 for odd).
Then, p−1
(0) = Ker(p) is a normal subgroup of H: let h ∈
Ker(p), then
p(ghg−1
) = p(g)p(h)p(g−1
) = p(g)p(g−1
) = p(gg−1
) =
= p(e) = 0,
which means that ghg−1
∈ Ker(p), i. e., Ker(p) is normal.
Since Z2 has only two elements, it follows that H/ Ker(p)
has either only one coset (i. e., all permutations are even) or
two cosets, which must be of equal size (i. e., there are the
same number of even and odd permutations). □
12.F.44. Describe the group of symmetries of a regular
tetrahedron and ﬁnd all of its subgroups.
Solution. Let us denote the vertices of the tetrahedron as a, b,
c, d. Each symmetry can be described as a permutation of the
vertices (to which vertex each one goes). Thus, the group of
symmetries of the tetrahedron is isomorphic to a certain subgroup
of the symmetric group S4. Given any pair of vertices,
there exists a symmetry which swaps this pair and keeps the
other two vertices ﬁxed (this is reﬂection with respect to the
plane that is perpendicular to the line segment of the pair and
goes through its center). Thus, the wanted subgroup is generated
by all transpositions in S4. However, this is the group S4
itself.
Thus, let us describe all subgroups of the group S4. This
group has 24 elements, which means that the order of any
subgroup must be one of 1, 2, 3, 4, 6, 8, 12, 24 (see 12.4.10).
Clearly, the only group of order 1 is the trivial subgroup {id}.
Similarly, the only group of order 24 is the entire group S4.
Now, let us look at the remaining orders of a potential subgroup
H ⊆ S4.
(i) |H| = 2. H must consist of the identity and another
self-inverse element (x2
= id). These are transpositions and
double transpositions (compositions of two disjoint transpositions).
Geometrically, double transpositions correspond to
rotation by 180◦
around the axis that goes through the centers
of opposite edges). Thus, we get nine subgroups: {id, (a, b)},
{id, (a, c)}, {id, (a, d)}, {id, (b, c)}, {id, (b, d)}, {id, (c, d)},
{id, (a, b) ◦ (c, d)}, {id, (a, c) ◦ (b, d)}, {id, (a, d) ◦ (b, c)}.
(ii) |H| = 3. By Lagrange’s theorem, such a subgroup
must be cyclic, i. e., it must be of the form {id, p, p2
},
p3
= id. Thus, the factorization of p to independent
cycles must contain a cycle of length 3, which means that
p cannot contain anything else. By 12.F.17, there are 4 · 2
cycles of length 3, which give rise to the following four
CHAPTER 12. ALGEBRAIC STRUCTURES
=
(
∑
γ
(ak,γ + bk,γ)(x1, . . . , xr−1)γ
)
xk
r + . . .
+
(
∑
γ
(a0,γ + b0,γ)(x1, . . . , xr−1)γ
)
=
∑
(γ,j)
(aj,γ + bj,γ)(x1, . . . , xr−1)γ
xj
r.
The proof for multiplication is similar (do it yourselves!). □
The deﬁnition or the above formulae for polynomials
over general commutative rings yield the following corollary:
Corollary. If a ring K is an integral domain, then the ring
K[x1, . . . , xr] is also an integral domain.
Proof. Proceed by induction on the number r of vari-
ables.6
We know the result for the univariate polynomials
already. In particular, the product of non-zero polynomials is
again non-zero.
If the proposition holds for r−1 variables, then the result
follows from the inductive deﬁnition of the polynomials. □
Let us notice that each polynomial f ∈ K[x1, . . . , xn]
also represents a mapping K × . . . × K → K multilinear and
symmetric in its n arguments.
12.3.5. Divisibility and irreducibility. The next goal is to
understand how polynomials over a general integral
domain can be expressed as products of simpler
polynomials. In the case of univariate polynomials,
this means ﬁnding the roots of a polynomial.
Since multivariate polynomials can be deﬁned inductively,
it suﬃces to consider univariate polynomials over a
general integral domain. This leads to a generalization of the
concept of divisibility which forms the basis of elementary
number theory in chapter eleven.
Consider an integral domain K (for instance, the integers
Z or the ring Zp for prime p).
Divisibility in rings
For a, c ∈ K we say that a divides c in K if and only if there
is some b ∈ K such that a · b = c. This is written a|c.
The divisors of 1 (the multiplicative identity), i.e. the
invertible elements of K, are called units.
The units of a commutative ring always form a commutative
group in the sense used for the properties of addition
in the deﬁnition of rings. This is called the group of units
in K. The group of units in Z is {−1, 1}, while all nonzero
elements in a ﬁeld are units there.
In an integral domain, the divisors are determined
uniquely. If b = a · c and b ̸= 0, then c is determined by the
choice of a and b, since if b = ac = ac′
, then 0 = a · (c − c′
)
6Alternatively, proceed directly using multi-index formulae for the
product, provided an appropriate ordering on monomials is deﬁned, see the
last part of this chapter.
1078
subgroups: {id, (a, b, c), (a, c, b)}, {id, (a, c, d), (a, d, c)},
{id, (a, b, d), (a, d, b)}, {id, (b, c, d), (b, d, c)}. The cycles of
length 3 correspond to rotation by 120◦
around the axis that
goes through a vertex and the center of the opposite side.
(iii) |H| = 4. Such a subgroup must be isomorphic to
Z4 or Z2 × Z2. Considering the factorization to independent
cycles, we ﬁnd out that the only permutation of order 4 is
a cycle of length 4. Thus, cyclic subgroups must contain a
cycle of length 4, namely exactly two of them, since if p has
order 4, then p−1
= p3
is also of order 4 i. e., a cycle of
length 4. Then, the permutation p2
has order 2, so it must be
a double transposition (it is not a single transposition, since
p2
clearly does not have a ﬁxed point). There are six cycles of
length 4 (see 12.F.17), and they pair up to the following three
subgroups of this type:
{id, (a, b, c, d), (a, c) ◦ (b, d), (a, d, c, b)},
{id, (a, c, b, d), (a, b) ◦ (b, d), (a, d, b, c)},
{id, (a, b, d, c), (a, d) ◦ (bc), (a, c, d, b)}.
As for subgroups isomorphic to Z2 × Z2, they must
contain (besides the identity) only elements of order 2, which
are transpositions and double transpositions. By 12.F.43, the
subgroup must contain either no or exactly two transpositions.
Moreover, it cannot cannot contain two dependent transpositions,
since their composition is a cycle of length 3. Thus,
the subgroup contains (besides the identity) either two independent
transpositions and the double transposition which
is their composition (this gives rise to three subgroups),
or the three double transpositions. Altogether, we have
found: {id, (a, b), (a, b) ◦ (c, d), (c, d)}, {id, (a, c), (a, c) ◦
(b, d), (b, d)}, {id, (a, d), (a, d) ◦ (b, c), (b, c)}
and {id, (a, b) ◦ (c, d), (a, c) ◦ (b, d), (a, d) ◦ (b, c)}.
(iv) |H| = 6. By 12.F.11, this subgroup is isomorphic
to S3 (it cannot be isomorphic to Z6 since there is no element
of order 6 in S4), so it contains (besides the identity) two elements
x, x−1
of order 3 and three elements of order 2. Thus,
x and x−1
are cycles of length 3 which ﬁx the same vertex
(say a). What are the other three elements? There cannot be
a double transposition, since its composition with x yields another
cycle of length 3. There cannot be a transposition which
does not ﬁx a since its composition with x yields a cycle of
length 4. Thus, the only possibility is that there are the three
transpositions which also ﬁx a. Since there are four possibilities
which vertex is the ﬁxed one, we obtain four subgroups
of order 6.
(v) |H| = 8. The group cannot be a subgroup of the
group A4 of even permutations (since there are 12 of them,
and 8 does not divide 12). Thus, by 12.F.43, H must contain
four even and four odd permutations. The even permutations
must form a subgroup of A4, and we could see in (iii) that the
only such 4-element subgroup is {id, (a, b) ◦ (c, d), (a, c) ◦
(b, d), (a, d) ◦ (b, c)}, which is normal. Considering any odd
permutation and the coset (with respect to the above normal
subgroup) which contains it, we can see that the coset together
with the above 4 elements form a subgroup of S4. We
thus get three subgroups of S4. It is not hard to realize that
CHAPTER 12. ALGEBRAIC STRUCTURES
and being in an integral domain, a ̸= 0 implies c = c′
. The
reader should check that the claim is wrong e.g. in Z4.
Just as for integers, the following propositions are direct
corollaries of the deﬁnitions (check the details yourself!):
Lemma. Let a, b, c ∈ K. Then,
(1) if a|b and b|c, then a|c;
(2) if a|b and a|c, then a|(αb + βc) for all α, β ∈ K;
(3) a|0 (since a · 0 = 0, in particular 0|0);
(4) a ∈ K is divisible by each unit e ∈ K and its a-multiple
a · e (this follows from the existence of e−1
).
Unique factorization domain
An element a ∈ K is said to be irreducible if and only if it
is divisible only by units e ∈ K and their a-multiples.
A ring K is called a unique factorization domain if and
only if the following conditions hold:
• For every non-zero element a ∈ K, there are irreducible
elements a1, . . . , ar ∈ K such that a = a1 · a2 . . . ar.
This is called factorization of a.
• If there are two factorizations of a into irreducible nonunit
elements a = a1a2 . . . ar = b1b2 . . . bs, then r = s
and, up to a permutation of the order of the factors bj,
aj = ejbj for suitable units ej, j = 1, . . . , r.
Z is a unique factorization domain. So is every ﬁeld,
since every non-zero element is a unit.
There are examples of integral domains without the
unique factorization property. The construction
is similar to polynomials. Instead of powers,
consider conveniently combined roots of powers
of the unknown variable x.
An integral domain K consists of ﬁnite expressions of
the form
a0 +
k∑
i=1
ai (xmi
)
1
2ni
,
where a0, . . . , ak ∈ Z, mi, n ∈ Z>0. The multiplication and
addition is deﬁned as with polynomials assuming the standard
behaviour of rational powers of x. Then, the only units in K
are ±1, and all elements with a0 = 0 are reducible, but the
expression x, for example, cannot be expressed as a product of
irreducible elements. There are simply very few irreducible
elements in K.
12.3.6. Euclidean division and roots of polynomials. The
fundamental tool for the discussion of divisibility,
common divisors, etc. in the ring of integers Z is
the procedure of division with remainder, and the Euclidean
algorithm for the greatest common divisor.
These procedures can be generalized.
Consider univariate polynomials akxk
+ · · · + a0 over a
general integral domain K. akxk
is called the leading monomial,
while ak is the leading coeﬃcient.
Lemma (An algorithm for division with remainder). Let K
be an integral domain and f, g ∈ K[x] polynomials, g ̸= 0.
1079
each of them is isomorphic to the group of symmetries of a
square (the so-called dihedral group D4). From the geometrical
point of view, we can describe it as follows: Consider the
orthogonal projection of the tetrahedron onto the plane that is
perpendicular to the line that goes through the centers of opposite
edges. The boundary of this projection is a square. Out
of all the symmetries of the tetrahedron, we take only those
which induce a symmetry of this square (for instance, it will
not be a symmetry which only swaps adjacent vertices of the
resulting square). Since there are three pairs of opposite edges
in the tetrahedron, we get three 8-element subgroups, isomorphic
to the dihedral group D4. (vi) |H| = 12. By 12.F.43,
such a subgroup contains either only even permutations, or
six even and six odd permutations, and the six even permutations
must form a subgroup of S4. However, we could see in
(iv) that there is no 6-element subgroup of S4 consisting only
of even permutations. Thus, the only possibility is the alternating
group A4 of all even permutations in S4. From the
geometric point of view, these are the so-called direct symmetries,
which are realized by rotations (not reﬂections), and
thus can be performed in the space. □
Remark. In general, the group of symmetries of a solid with
n vertices is a subgroup of the symmetric group Sn.
12.F.45. Which subgroups of the group S4 are normal?
Solution. By deﬁnition, a subgroup H ⊆ S4 is normal iﬀ it
is closed under conjugation, i. e., ghg−1
∈ H for any g ∈ S4,
h ∈ H. Since conjugation in symmetric groups only renames
the permuted items but preserves the permutation structure (i.
e. the cycle lengths in the factorization to independent cycles),
we can see that H is normal if and only if it contains
either no or all permutations of each type. Examining all the
subgroups, which we found in the previous exercise, we ﬁnd
that the normal ones are the trivial group {id}, the so-called
Klein group (consisting of the identity and the three double
transpositions, which we already met in 12.F.8), the alternating
group A4 of all even permutations, and the entire group
S4. □
12.F.46. Find the group of symmetries of a cube (describe
all symmetries). Is this group commutative?
Solution. The group has 48 elements; 24 of them are generated
by rotations (these are the so-called direct symmetries),
the other 24 are the composition of a direct symmetry and a
reﬂection. The group is not commutative (consider the composition
of a reﬂection with respect to the plane containing the
centers of four parallel edges and a rotation by 90◦
around the
axis that lies in the plane and goes through the centers of two
opposite sides. The group is isomorphic to S4 × Z2. □
12.F.47. In the group of symmetries of a cube, ﬁnd the subgroup
generated by a reﬂection with respect to the plane containing
the centers of four parallel edges and the rotation by
180◦
around the axis that lies in the plane and goes through
the centers of two opposite sides. Is this subgroup normal?
⃝
CHAPTER 12. ALGEBRAIC STRUCTURES
Then, there exists an a ∈ K, a ̸= 0, and polynomials q and r
such that af = qg + r, where either r = 0 or deg r < deg g.
Moreover, if K is a ﬁeld or if the leading coeﬃcient of g is
one, and if the choice a = 1 is made, then the polynomials q
and r exist and they are unique.
Proof. If deg f < deg g or f = 0, then the choice a =
1, q = 0, r = f, satisﬁes all the conditions. If g is constant,
set a = g, q = f, r = 0. Continue by induction on the degree
of f. Suppose deg f ≥ deg g > 0, and write
f = a0 + · · · + anxn
, g = b0 + · · · + bmxm
.
Either bmf − anxn−m
g = 0 or deg(bmf − anxn−m
g) <
deg f. In the former case, the proof is ﬁnished. In the latter
case, it follows by the induction hypothesis that there exist
a′
, q′
, r′
satisfying
a′
(
bmf − anxn−m
g
)
= q′
g + r′
and either r′
= 0 or deg r′
< deg g. This means that
a′
bmf =
(
q′
+ a′
anxn−m
)
g + r′
.
If bm = 1 or K is a ﬁeld, then the induction hypothesis can
be used to choose a′
= 1 and then q′
, r′
are unique. In this
case,
bmf =
(
q′
+ anxn−m
)
g + r′
.
If K is a ﬁeld, then this equation can be multiplied by b−1
m .
Assume that there is another solution f = q1g+r1. Then,
0 = f − f = (q − q1)g + (r − r1) and either r = r1,
or deg(r − r1) < deg g. In the former case, it follows that
q = q1 as well, since there are no zero divisors in K[x]. Let
axs
be the term of the highest degree in q − q1 ̸= 0 (it must
exist). Then, its product with the term of the highest degree
in g must be zero (since the term of the highest degree is just
the product of the terms of the highest degrees and there are
no other terms of this degree there). However, this means
that a = 0. Since axs
is the non-zero term with the highest
degree, q − q1 contains no non-zero monomials, so it equals
zero. But then r = r1. □
The procedure for the Euclidean division can be used to
discuss the roots of a polynomial.
Consider a polynomial f ∈ K[x], deg f > 0, and divide
it by the polynomial x − b, b ∈ K. Since the leading
coeﬃcient is one, the algorithm produces a unique result. It
follows that there are unique polynomials q and r which satisfy
f = q(x − b) + r, where r = 0 or deg r = 0, i.e. r ∈ K.
This means that the value of the polynomial f at b ∈ K equals
f(b) = r. It follows that the element b ∈ K is a root of the
polynomial f if and only if (x − b)|f. Since division by a
polynomial of degree one decreases the degree of the original
polynomial by at least one, the following proposition is
proved:
Corollary. Every polynomial f ∈ K[x] has at most deg f
roots, including multiplicities. In particular, polynomials
over an inﬁnite integral domain deﬁne the same mapping
K → K if and only if they are the same polynomial.
1080
12.F.48. For each of the following permutations, decide
whether the subgroup it generates is normal in the corresponding
group:
• (1, 2, 3) in S3,
• (1, 2, 3, 4) in S4,
• (1, 2, 3) in A4
For the last case, ﬁnd the right cosets of A4 by the considered
subgroup. Find all n ≥ 3 for which the subset of all cycles of
length n together with the identity is a subgroup of Sn. Show
that if this is so, then it is even a normal subgroup.
Solution.
• It is a normal subgroup of A3.
• It is not a normal subgroup ( (1, 2) ◦ (1, 3) ◦ (2, 4) ◦
(1, 2) = = (4, 1) ◦ (2, 3) ).
• It is not a normal subgroup. The right cosets are
{(1, 2, 4), (2, 4, 3), (1, 3) ◦ (2, 4)},
{(1, 4, 2), (1, 4, 3), (1, 4) ◦ (2, 3)},
{(2, 3, 4), (1, 2) ◦ (3, 4), (1, 3, 4)},
{id, (1, 2, 3), (1, 3, 2)}.
The mentioned subset is a subgroup only for n = 3. In this
case, it is the alternating group A3 of all even permutations in
S3, which is a normal subgroup. (For greater value of n, we
can ﬁnd two cycles of length n whose composition is neither
a cycle of length n nor the identity.) □
12.F.49. Find the subgroup of S6 that is generated by the
permutations (1, 2) ◦ (3, 4) ◦ (5, 6), (1, 2, 3, 4), and (5, 6). Is
this subgroup normal? If so, describe the set of (two-sided)
cosets S6/H.
Solution. First of all, note that all of the generating permutations
lie in the subgroup S4 × S2 ⊆ S6 (considering the
natural inclusion of S4 ×S2, i. e., for s ∈ S4 ×S2, the restriction
of s to {1, 2, 3, 4} is a permutation on this set and so is the
restriction of s to {5, 6}). This means that the group they generate
is also a subgroup of S4 × S2. Moreover, since there is
(5, 6) among the generators, we can see that the subgroup is of
the form H ×S2, where H ⊆ S4. Thus, it suﬃces to describe
H. This group is generated by the elements (1, 2)◦(3, 4) and
(1, 2, 3, 4) (projection of the generators on S4). We have
(1, 2, 3, 4)2
= (1, 3) ◦ (2, 4),
(1, 2, 3, 4)3
= (4, 3, 2, 1),
(1, 2, 3, 4)4
= id,
[(1, 2) ◦ (3, 4)]
2
= id,
[(1, 2) ◦ (3, 4)] ◦ (1, 2, 3, 4) = (2, 4),
(1, 2, 3, 4) ◦ [(1, 2) ◦ (3, 4)] = (1, 3),
[(1, 2) ◦ (3, 4)] ◦ (4, 3, 2, 1) = (1, 3),
(4, 3, 2, 1) ◦ [(1, 2) ◦ (3, 4)] = (2, 4),
[(1, 2) ◦ (3, 4)] ◦ [(1, 3) ◦ (2, 4)] = (1, 4) ◦ (2, 3),
[(1, 3) ◦ (2, 4)] ◦ [(1, 2) ◦ (3, 4)] = (1, 4) ◦ (2, 3),
CHAPTER 12. ALGEBRAIC STRUCTURES
Indeed, if two polynomials over an integral domain deﬁne
the same mapping K → K, then their diﬀerence has any
element of K as a root. This means that if their diﬀerence
is not the zero polynomial, then K has at most as many elements
as the maximum of the degrees of the polynomials in
question.
12.3.7. Multiple roots and derivatives. In the continuous
modelling, we mostly work over inﬁnite integral
domains K and so we identify the algebraic
expressions for the polynomials with the mappings.
However the properties of diﬀerentiation
on polynomials are of purely algebraic character, in general:
The diﬀerentiation of polynomials over real or complex
numbers is an algebraic operation which make sense for all
commutative rings K and it still satisﬁes the Leibniz rule:
Derivative of polynomials
Consider polynomials f(x) = a0 + a1x + · · · + anxn
,
g(x) = b0 +b1x · · ·+bmxm
be polynomials of orders n and
m over an commutative ring K. The derivative ′
: f(x) →
f′
(x) = a1 +a2x+· · ·+nanxn−1
respects the addition of
polynomials and their multiplication by the elements in K.
Moreover it satisﬁes the Leibniz rule
(1) (f(x)g(x))′
= f′
(x)g(x) + f(x)g′
(x).
While claim on the additive structure is obvious, let us
check the Leibniz rule:
f(x) · g(x) =
m+n∑
k=0
ckxk
, ck =
∑
i+j=k
aibj
and thus, expanding f′
(x) · g(x) + f(x) · g′
(x) yields
exactly the expression for the derivative of the product
∑m+n
k=1 kckxk−1
.
In particular, the derivative is not a homomorphism
K[x] → K[x] of the ring of polynomials, in view of (1). In
much more general context, the homomorphisms of the additive
structure of a ring satisfying the Leibniz rule (1) are
called derivations. For polynomial rings, we see inductively
that the only derivation there is our operation ′
.
Diﬀerentiation can be exploited easily for discussing
multiple roots of polynomials.
Consider a polynomial f(x) ∈ K[x] over an integral domain
K, with root c ∈ K of multiplicity k. Thus, in view
of the division of polynomials discussed in the previous paragraph
12.3.6,
f(x) = (x − c)k
g(x),
with the unique polynomial g, g(c) ̸= 0. Diﬀerentiating f(x)
and applying the Leibniz rule we obtain
f′
(x) = k(x − c)k−1
g(x) + (x − c)k
g′
(x)
= (x − c)k−1
(
kg(x) + (x − c)g′
(x)
)
.
Clearly the polynomial h(x) = kg(x) + (x − c)g′
(x) does
not admit c as root, i.e. h(c) = kg(c) ̸= 0.
1081
[(1, 2) ◦ (3, 4)] ◦ (4, 2) = (1, 2, 3, 4),
(1, 3) ◦ (4, 2) = (1, 3) ◦ (2, 4).
Now, we can note that the generating permutations (1, 2, 3, 4)
and (1, 2) ◦ (3, 4) are symmetries of a square on vertices
1, 2, 3, 4. Therefore, they cannot generate more than
8-element D4, which has already happened. This means that
no more permutations can be obtained by further compositions.
Thus, the subgroup H ⊆ S4 has 8 elements (which is
possible by Lagrange’s theorem, since 8 divides 24).
H = { id, (1, 2, 3, 4), (1, 3) ◦ (2, 4), (4, 3, 2, 1), (1, 2) ◦ (3, 4),
(1, 3), (2, 4), (1, 4) ◦ (2, 3)}.
Altogether, the examined subgroup in S6 has 16 elements:
for each h ∈ H, it contains (h, id) and (h, (56)).
□
12.F.50. Find the subgroup in S4 that is generated by the
permutations (1, 2) ◦ (3, 4), (1, 2, 3).
Solution. Since both the generating permutations are even,
they can generate only even permutations. Thus, the examined
group is a subgroup of the alternating group A4 of all
even permutations. We have
[(1, 2) ◦ (3, 4)]
2
= id,
(1, 2, 3)2
= (3, 2, 1),
[(1, 2) ◦ (3, 4)] ◦ (1, 2, 3) = (2, 4, 3),
(1, 2, 3) ◦ [(1, 2) ◦ (3, 4)] = (1, 3, 4),
[(1, 2) ◦ (3, 4)] ◦ (3, 2, 1) = (3, 1, 4),
(3, 2, 1) ◦ [(1, 2) ◦ (3, 4)] = (2, 3, 4),
and now, we already have seven elements of the examined
subgroup of A4, and since A4 has 12 elements and the order
of a subgroup must divide that of the group, it is clear that the
subgroup is the whole A4. □
12.F.51. Find all subgroups of the group of invertible 2-by-2
matrices over Z2 (with matrix multiplication). Is any of them
normal?
Solution. In exercise 12.E.1, we built the table of the operation
in this group. By Lagrange’s theorem (12.4.10), the order
of any subgroup must divide the order of the group, which is
six. Thus, besides the trivial subgroup {A} and the entire
group, each subgroup must have two or three elements. In
a 2-element subgroup, the non-trivial element must be selfinverse,
which is also suﬃcient for the subset to be a subgroup.
We thus get the subgroups {A, B}, {A, C}, {A, F},
which are not normal, as can be easily veriﬁed. The identity
is A. Since B, C, F have order 2, they cannot be in
a 3-element subgroup. Thus, the only remaining possibility
is P = {A, D, E}, which is indeed a subgroup. Moreover,
checking the conjugations BDB = E, CDC = E,
FDF = E (whence it follows that BEB = D, CEC = D,
FEF = D), we ﬁnd out that this subgroup is normal. □
CHAPTER 12. ALGEBRAIC STRUCTURES
On the contrary, if c is a joint root of f and its derivative,
then, reading the previous computation backwards (with k =
1), we see the c is also a root of g(x), and thus a root with
multiplicity at least two of f. Inductively, we arrive at the
following very useful claim:
Proposition. A polynomial f(x) over an integral domain K
admits the root c ∈ K of multiplicity k, if and inly if c is the
root of f′
(x) of multiplicity k − 1.
12.3.8. The greatest common divisor. Consider a polynomial
ring K[x] over an integral domain K. A
polynom h ∈ K[x] is called the greatest common
divisor of polynomials f and g ∈ K[x] if
and only if the following hold:
• h|f and h|g,
• for any k, if both k|f and k|g, then k|h.
As a direct corollary of the existence of an algorithm for
unique division with remainder, there is the very important
Bezout’s identity (it is proved using the Euclidean division
similarly as in the case of the integers in Chapter 11).
Theorem. Let K be a ﬁeld and f, g ∈ K[x]. Then, there
exists a greatest common divisor h of the polynomials f and
g. The polynomial h is unique up to a multiple by a non-zero
scalar. In addition, there exist polynomials A, B ∈ K[x] such
that h = Af + Bg.
Proof. The polynomials h, A, B can be constructed directly
using the Euclidean algorithm. Continue dividing with
remainder (since K is a ﬁeld, there is always a unique way to
do this; see the above lemma 12.3.6):
f = q1g + r1,
g = q2r1 + r2,
r1 = q3r2 + r3,
...
rp−1 = qp+1rp + 0.
In this procedure, the degrees of the polynomials ri are
strictly decreasing; hence the equality from the last line must
occur (for some p), and this says that rp|rp−1. It follows from
the line above that rp|rp−2 etc.. Continue sequentially up to
the ﬁrst and second lines, to obtain rp|g and rp|f.
If h|f and h|g, then the same equalities imply that h divides
all the ri. In particular, it divides rp. In this way, the
greatest common divisor h = rp of the polynomials f and g
is obtained.
Substitute upwards, starting with the last equation:
h = rp = rp−2 − qprp−1 =
= rp−2 − qp(rp−3 − qp−1rp−2) =
= −qprp−3 + (1 + qp−1qp)rp−2 =
= −qprp−3 + (1 + qpqp−1)(rp−4 − qp−2rp−3) =
...
1082
12.F.52. Find all subgroups of the group (Z10, +).
Solution. The subgroups are isomorphic to (Zd, +), Where
d|10, i. e., {0} ∼= Z1, {0, 5} ∼= Z2, {0, 2, 4, 6, 8} ∼= Z5, and
Z10. □
12.F.53. Find the orders of the elements 2, 4, 5 in (Z×
35, ·)
and in (Z35, +).
Solution. By deﬁnition, the order of x in the group (Z×
35, ·)
is the least positive integer k such that xk
≡ 1 mod 35.
By Euler’s theorem, the order of x = 2 and x = 4 is
k ≤ φ(35) = 24. Computing the corresponding modular
powers, we ﬁnd out that the order of x = 2 is 12. Hence it
immediately follows that the order of x = 4 is 6. The number
x = 5 does not lie in the group (Z×
35, ·). Speciﬁcally, we have
(modulo 35):
x x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
2 4 8 16 32 29 23 11 22 9 18 1
4 16 29 11 9 1
In the group (Z35, +), the order of x is the least positive
integer k such that k·x ≡ 0 mod 35. This can be calculated
simply as k = 35
(35,x) . Therefore, the order of 2 and 4 is 35,
while the order of 5 is 7. □
12.F.54. Find all ﬁnite subgroups of the group (R∗
, ·)1
Solution. If a given subgroup of the group (R∗
, ·) contains an
element a, |a| ̸= 1, then the elements a, a2
, a3
, . . . form an
inﬁnite geometric progression of pairwise distinct elements
all of which must lie in the considered subgroup, so it is inﬁnite.
Thus, a ﬁnite subgroup may contain only the numbers
1 and −1, which means that there are two ﬁnite subgroups:
{1}, {−1, 1}.
□
12.F.55. For each of the following formulas, decide whether
it correctly deﬁnes a mapping φ. If so, decide whether it is a
homomorphism, and if so, ﬁnd its kernel. Moreover, decide
whether it is surjective and injective:
i) φ : Z4 × Z3 → Z12, φ(([a]4, [b]3)) = [a − b]12,
ii) φ : Z4 × Z3 → Z12, φ(([a]4, [b]3)) = [6a + 4b]12,
iii) φ : Z4 × Z3 → Z12, φ(([a]4, [b]3)) = [0]12.
Solution.
i) Not a mapping. For instance, if we take two representatives
([6]4, [1]3) = ([2]2, [1]3) of the same element
in Z4 × Z3, then we get φ([6]4, [1]3) = [5]12 and
φ([2]4, [1]3) = [1]12), so this is not a correct deﬁnition
of a mapping.
ii) A homomorphism, neither injective, nor surjective. Its
kernel Ker(φ) is the set {([2]4, [0]3), ([0]4, [0]3)}.
iii) A homomorphism, neither injective, nor surjective. Its
kernel is the entire group Z4 × Z3.
□
1The group of all invertible elements for R and C is denoted by R∗ and
C∗, respectively, and by Z×
n for Zn.
CHAPTER 12. ALGEBRAIC STRUCTURES
= Af + Bg.
□
12.3.9. Unique factorization. Now follows a very useful
(and elegant) statement, the proof of which is straightforward,
yet it requires many technical details (and it also concerns the
so called ﬁeld of rational functions). It is recommended to
read the following paragraph carefully. Then maybe, at the
ﬁrst reading, skip the technical lemmas of the proof.
Theorem. Let K be a unique factorization domain. Then,
K[x] is also a unique factorization domain.
Proof. The idea of the proof is very simple. Consider
a polynomial f ∈ K[x]. If f is reducible, then
f = f1 · f2, where neither of the polynomials
f1, f2 ∈ K[x] is a unit. Moreover, assume for
a while that if f is divisible by an irreducible
polynomial h, then so is f1 or f2.
If this is always the case, this procedure can be applied
step by step to reach a unique factorization. If f1 is further
reducible, then f1 = g1 · g2, where g1, g2 are not units, and
either both the polynomials g1 and g2 have degree less than
that of f, or the number of irreducible factors in the leading
terms of g1 and g2 decreases (for instance, over the integers
Z, 2x2
+ 2x + 2 = 2(x2
+ x + 1)). After a ﬁnite number
of steps, a factorization f = f1 . . . fr is obtained, where the
polynomials f1, . . . , fr are irreducible.
It follows from the additional assumption that every irreducible
polynomial h which divides f also divides one
of f1, . . . , fr. Therefore, for every other factorization f =
f′
1f′
2 . . . f′
s, each factor fi divides one of f′
j, and in this case,
f′
j = efi for an appropriate unit e. Cancel such pairs step by
step, to conclude that r = s and that the individual factors
diﬀer only by a unit multiple. □
To conclude, we have to deal with our assumption in the
ﬁrst paragraph of our above proof. This will require some
technical preparation and we shall come back to it soon.
The direct consequence of the latter theorem for the multivariate
polynomials can be formulated (due to their inductive
deﬁnition):
Corollary. Let K be a unique factorization domain. Then,
K[x1, . . . , xr] is also a unique factorization domain.
Every polynomial over a unique factorization domain can
be factored in a similar way to the case of polynomials with
real or complex coeﬃcients.
In particular, this holds for polynomials over every ﬁeld
of scalars.
12.3.10. Fields of fractions. When dealing with integer calculations,
it is often more advantageous to work
with rational numbers and verify only at the end
of the procedure that the result is an integer.
This method is useful in the case of polynomials,
too.
1083
12.F.56. For each of the following formulas, decide whether
it correctly deﬁnes a mapping φ. If so, decide whether it is a
homomorphism, and if so, ﬁnd its kernel. Moreover, decide
whether it is surjective and injective:
i) φ : Z4 → C∗
, φ([a]4) = ia
,
ii) φ : Z5 → C∗
, φ([a]5) = ia
,
iii) φ : Z4 → C∗
, φ([a]4) = (−1)a
,
iv) φ : Z → C∗
, φ(a) = ia
.
Solution.
i) We have φ([a]4 +[b]4) = ia+b
= ia
·ib
= φ([a])·φ([b]),
and φ([4]) = i4
= 1, which means that if [c]4 = [d]4, i.
e., c = = d + 4k, k ∈ Z, then φ([c]4) = ic
= id+4k
=
id
= φ([d]4), so the mapping is a well-deﬁned homomorphism.
It is injective (which is equivalent to saying that
Ker(φ) = {[0]4]}), but it is clearly not surjective.
ii) Not a mapping since we have [0]5 = [5]5 and φ([0]5) =
i0
= 1 but φ([5]5) = i5
= i.
iii) Is a homomorphism, neither injective (we have −1 =
φ(1) = φ(3) = (−1)3
= −1), nor surjective. The
kernel is Ker(φ) = {[0]4, [2]4]}.
iv) Is a homomorphism, neither injective nor surjective. The
kernel is Ker(φ) = 4Z = {4k | k ∈ Z}.
□
12.F.57. For each of the following formulas, decide whether
it correctly deﬁnes a mapping φ. If so, decide whether it is
a homomorphism. Moreover, decide whether it is surjective
and injective:
i) φ : Q∗
→ Q∗
, φ
(
p
q
)
= q
p
ii) φ : Q∗
→ Q∗
, φ
(
p
q
)
= p2
q2
iii) φ : Q∗
→ Q∗
, φ
(
p
q
)
= p2
+q2
pq
⃝
12.F.58. For each of the following formulas, decide whether
it correctly deﬁnes a mapping φ. If so, decide whether it is
a homomorphism. Moreover, decide whether it is surjective
and injective:
i) φ : C → R, φ(a + bi) = a + b,
ii) φ : C → R, φ(a + bi) = a,
iii) φ : C∗
→ R∗
, φ(a + bi) = a2
+ b2
,
iv) φ : C∗
→ R∗
, φ(c) = 2|c|,
v) φ : C∗
→ R∗
, φ(c) = |c|3
,
vi) φ : C∗
→ R∗
, φ(c) = 1/|c|.
⃝
12.F.59. For each of the following formulas, decide whether
it correctly deﬁnes a mapping φ. If so, decide whether it is
a homomorphism. Moreover, decide whether it is surjective
and injective:
i) φ : GL2(R) → R∗
, φ(A) = |A|
ii) φ : GL2(R) → R∗
, φ
((
a b
c d
))
= a2
+ b2
.
iii) φ : GL2(R) → R∗
, φ
((
a b
c d
))
= ac + bd.
CHAPTER 12. ALGEBRAIC STRUCTURES
Let K be an integral domain. Its ﬁeld of fractions is deﬁned
as the set of equivalence classes of the pairs (a, b) ∈
K × K, b ̸= 0. These classes are written a
b , and the equivalence
is deﬁned by
a
b
=
a′
b′
⇔ ab′
= a′
b.
Addition and multiplication is deﬁned in terms of representa-
tives:
a
b
+
c
d
=
ad + bc
bd
,
a
b
c
d
=
ac
bd
.
It is easily veriﬁed that this deﬁnition is correct and that the
resulting structure satisﬁes all ﬁeld axioms. In particular, 0
1
is the additive identity, and 1
1 is the multiplicative identity. If
a ̸= 0, b ̸= 0, then a
b
b
a = 1
1 . All the details of the arguments
are in fact identical with the discussion of rational numbers
in 1.6.6.
The ﬁeld of fractions of a ring K[x1, . . . , xr] is called
the ﬁeld of rational functions (of r variables) and denoted
K(x1, . . . , xr). In software systems like Sage, Maple or
Mathematica, all algebraic operations with polynomials are
performed in the corresponding ﬁeld of fractions, i.e. in the
ﬁeld of rational functions, usually using K = Q.
12.3.11. Completion of the proof. It remains to prove that if
a polynomial f = f1f2 is divisible by an irreducible
polynomial h, then h divides either f1 or f2 or both.
This statement is proved in the following three
lemmas.
Lemma. Let K be a unique factorization domain. Then:
(1) If a, b, c ∈ K, a is irreducible and a|bc, then either a|b
or a|c.
(2) If a constant polynomial a ∈ K[x] divides f ∈ K[x],
then a divides all coeﬃcients of f.
(3) If a is an irreducible constant polynomial in K[x] and
a|fg, f, g ∈ K[x], then a|f or a|g.
Proof. (1) By the assumption, bc = ad for a suitable
d ∈ K. Let d = d1 . . . dr, b = b1 . . . bs, c = c1 . . . cq be the
factorizations to irreducible factors. This means that
ad1 . . . dr = b1 . . . bsc1 . . . cq.
Since ad factors in a unique way, it follows that a = ebj or
a = eci for a suitable unit e.
(2) Let f = b0 + b1x + · · · + bnxn
. Since a|f, there
must exist a polynomial g = c0 + c1x + . . . ckxk
such that
f = ag. Hence it immediately follows that k = n, ac0 =
b0, . . . , acn = bn.
(3) Consider f, g ∈ K[x] as above and suppose that a
divides neither f nor g. By the previous claim, there exists
an i such that a does not divide bi, and there exists a
j such that a does not divide cj. Choose the least such
i and j. The coeﬃcient at xi+j
of the polynomial fg is
b0ci+j + b1ci+j−1 + · · · + bi+jc0. By choice, a divides all
of b0ci+j, . . . , bi−1cj+1, bi+1cj−1, . . . , bi+jc0. At the same
1084
⃝
12.F.60. For each of the following formulas, decide whether
it correctly deﬁnes a mapping φ. If so, decide whether it is
a homomorphism. Moreover, decide whether it is surjective
and injective:
i) φ : Z3 → A4, φ([a]3) = (1, 2, 4) ◦ (1, 3, 2)a
◦ (1, 4, 2)
ii) φ : Z3 → A4, φ([a]3) = (1, 2) ◦ (1, 3, 2)a
⃝
12.F.61. For each of the following formulas, decide whether
it correctly deﬁnes a mapping φ. If so, decide whether it is
a homomorphism. Moreover, decide whether it is surjective
and injective:
i) φ : Z → Z, φ(a) = 2a
ii) φ : Z → Z,
φ(a) = a + 1
iii) φ : Z → Z,
φ(a) = 3|a|
iv) φ : Z → Z, φ(a) = 1
⃝
12.F.62. For each of the following formulas, decide whether
it correctly deﬁnes a mapping φ. If so, decide whether it is
a homomorphism. Moreover, decide whether it is surjective
and injective:
i) φ : Z × Z × Z → Q∗
, φ((a, b, c)) = 2a
3b
12c
ii) φ : Z∗
3 × Z5 → Z5, φ((a, b)) = ba
iii) φ : Z2 × Z → Z, φ(([a]2, b)) = b
⃝
12.F.63. Prove that there exists no isomorphism of the multiplicative
group of non-zero complex numbers onto the multiplicative
group of non-zero real numbers.
Solution. Every homomorphism must map the identity of the
domain to the identity of the codomain (see 12.4.5). Thus, 1
must be mapped to itself. And what about −1? We know that
f(−1)2
= f((−1)2
) = f(1) = 1. Therefore, the image of
−1 is a square root of 1. Since we are interested in bijective
homomorphisms only, we must have f(−1) = −1. However,
then f(i)2
= f(i2
) = f(−1) = −1, so that f(i) is a square
root of −1 in R; however, no such real number exists. Therefore,
no bijective homomorphism may exist. □
Remark. The mapping which assigns the absolute value
to each non-zero complex number is a surjective homomorphism
of C∗
onto R+
.
G. Burnside’s lemma
12.G.1. How many necklaces can be created from 3 black
and 6 white beads? Beads of one color are indistinguishable,
and two necklaces are considered the same if they can be
transformed to each other by rotation and/or reﬂection.
CHAPTER 12. ALGEBRAIC STRUCTURES
time, it does not divide bicj. Therefore, it cannot divide the
coeﬃcient. □
12.3.12. Lemma. Consider the ﬁeld of fractions L of a
unique factorization domain K. If a polynomial f is irreducible
in K[x], then it is irreducible in L[x], too.
Proof. Each coeﬃcient a ∈ K can be considered as an
element a
1 ∈ L. Therefore, every non-zero polynomial f ∈
K[x] can be considered a polynomial in L[x].
Suppose that f = g′
h′
for some g′
, h′
∈ L[x], where
the polynomials g′
, h′
are not units in L[x] (i.e. they are not
constant polynomials, since L is a ﬁeld). Let a be a common
multiple of the denominators of the coeﬃcients in g′
and b
be a common multiple of the denominators of coeﬃcients in
h′
. Then, bh′
, ag′
∈ K[x], and so abf = (bh′
)(ag′
). Let
c be an irreducible factor in the factorization of ab. Then, c
divides (bh′
)(ag′
), and hence c divides bh′
or ag′
(by the previous
lemma). This means that c can be canceled out. After
a ﬁnite number of such cancellations, the conclusion is that
f = gh for polynomials g, h ∈ K[x]. Since the degrees of
the polynomials are not changed, neither g nor h is constant.
Thus if f is reducible in L[x], then it is also reducible in
K[x], contradicting the implication to be proved. □
12.3.13. Lemma. Let K be a unique factorization domain
and f, g, h ∈ K[x]. Suppose that f is irreducible and f|gh.
Then, either f|g or f|h.
Proof. This statement is already proved in one of the
previous lemmas for the case that f is a constant polynomial
(i.e. an element of K).
Suppose that deg f > 0. Then f is irreducible in L[x] as
well, where L is the ﬁeld of fractions of the ring K.
Suppose that K itself is a ﬁeld (and as such equals its
ﬁeld of fractions). Moreover, suppose that f|gh and f does
not divide g. The greatest common divisor of the polynomials
g and f must be a constant polynomial in K. Therefore, there
are A, B ∈ K[x] such that 1 = Af + Bg. Hence, h =
Afh + Bgh. Since f|gh, it follows that f|h as well.
Return to the general case. It follows from the assumptions
that f|g or f|h in the polynomial ring L[x] over the ﬁeld
of fractions L of the ring K. For instance, let h = kf in L[x],
and choose an a ∈ K so that ak ∈ K[x]. Then, ah = akf
and it must hold for every irreducible factor c of a that c|ak,
because f is irreducible and not constant.
It follows that c can be canceled. After a ﬁnite number
of such cancellations, a becomes a unit, i.e. h = k′
f for an
appropriate k′
∈ K[x]. □
The proof of this lemma completes the proof of theorem
12.3.9.
1085
Solution. Let us assume the necklace as coloring of the vertices
of a regular 9-gon. Let S denote the set of all such colorings.
Since each coloring is determined by the positions of
the 3 black beads, we get that S has
(9
3
)
= 84 elements.
We know that the group of symmetries is D9, which
contains 9 rotations (including the identity) and 9 reﬂections.
Two colorings are the same if they lie in the same orbit of
the action of D9 on the set S. Thus, we are interested in the
number of orbits (let us denote it N). In order to ﬁnd N, it
suﬃces to compute the sizes of Sg for all elements g of D9:
The identity is the only element of order 1, we
have|Sid| = 84, so the contribution to the sum is 84.
There are 9 reﬂections g, each of order 2. Clearly, we
have |Sg| = 4, so the total contribution is 4 · 9 = 36.
There are 2 rotations g by 2π/3 or 4π/3, both of order 3,
and |Sg| = 3. Their contribution is 6.
Finally, there are 6 rotations of order 9, and no coloring
is kept unchanged by them, so they do not contribute to the
sum.
Altogether, we get by the formula of Burnside’s lemma
that
N =
1
|D9|
∑
g∈D9
|Sg| =
126
18
= 7.
Draw the seven necklaces! □
12.G.2. Find the number of colorings of a 3-by-3 grid with
three colors. Two colorings are considered the same they can
be transformed to each other by rotation and/or reﬂection.
Solution. The group of symmetries of the table is the same
as for a square, i. e., it is the dihedral group D4. Without
any identiﬁcation, there are clearly 39
colorings of the table.
Now, the group G = D4 acts on these colorings. For each
symmetry g ∈ G, we ﬁnd the number of colorings that g
keeps unchanged:
• g = Id: |Sg| = 39
.
• g is a rotation by 90◦
or 270◦
(= −90◦
): In this rotation,
every corner tile is sent to an adjacent corner tile. This
means that if the coloring is to be unchanged, all the corner
tiles must be of the same color. Similarly, all the edge
tiles must be of the same color. Then, the center tile may
be of any color. Altogether, we get that there are 33
which
are not changed by the considered rotations.
• g is a rotation by 180◦
: There are four pairs of tiles that
are sent to each other by this symmetry, which means
that the two tiles of each pair must be of the same color.
Then, the center tile may be of any color. Altogether, we
have |Sg| = 35
.
• g is one of the four reﬂections: There are three pairs of
tiles that are sent to each other by the reﬂection, so again
the tiles within each pair must be of one color. The three
tiles that are ﬁxed by the reﬂection may be each of an
arbitrary color. Altogether, we get |Sg| = 36
.
CHAPTER 12. ALGEBRAIC STRUCTURES
4. Groups, rings, and ﬁelds
As an illustration of the most abstract approach to an algebraic
theory, concepts enjoying just one operation
only are considered. The focus is on objects
and situations where equations of the form
a · x = b always have a unique solution (as usual with linear
equations, the objects a and b are given, while x is what is
sought for). This is group theory. Note that nothing is known
about the “nature” of the objects, or even what the dot stands
for. The only assumption is that two objects a and x are assigned
an object a · x.
In a previous part of this chapter, such operations are
known as addition or multiplication in rings. The concepts
and vocabulary concerning such operations are now extended.
Among them, numbers and transformations of the plane and
space, where such “group” objects are met. Then follows the
foundations of a general theory.
12.4.1. Examples and concepts. Let A be a set. A binary
operation on A is deﬁned to be any mapping A × A → A.
The result of such an operation is often denoted
(a, b) → a · b
and called the product of a and b. A set together with a binary
operation is called a groupoid or a magma.
Further assumed properties of the operations are needed
in order to be able to say something interesting,
Binary operations and semigroups
A binary operation is said to be associative, if and only if
a · (b · c) = (a · b) · c for all a, b, c ∈ A.
A groupoid where the operation is associative is called
a semigroup.
A binary operation is said to be commutative if and only
if a · b = b · a for all a, b ∈ A.
The natural numbers N = {0, 1, 2, . . . } together with
either addition or multiplication from a groupoid. These operations
are both commutative and associative. The integers
Z = {. . . , −2, −1, 0, 1, 2, . . . } form a groupoid with any of
addition, subtraction, and multiplication. Subtraction is neither
associative, for example
(5 − 3) − 2 = 0 ̸= 5 − (3 − 2) = 4,
nor commutative, since a−b = −(b−a), which is in general
diﬀerent from b − a.
1086
By Burnside’s lemma, the wanted number of colorings is
equal to
1
8
(
39
+ 2 · 33
+ 35
+ 4 · 36
)
= 2862. □
12.G.3.
a) Find all rotational symmetries of a regular octahedron.
b) Find the number of colorings of its sides. Two colorings
are considered the same they can be transformed to each
other by rotation.
Solution.
a) Placing the octahedron into the Cartesian coordinate system
so that pairs of adjacent vertices are on the axes and
the center of the octahedron lies in the origin, then every
rotational symmetry is given by which of the six vertices
is on the positive x-semiaxis and which of the four adjacent
vertices is on the positive y-semiaxis. Thus, the
group has 24 elements. These are (besides the identity)
rotations by ±90◦
and 180◦
around axes going through
opposite vertices, rotations by 180◦
around axes going
through the centers of opposite edges, and ﬁnally rotations
around ±120◦
around axes going through the centers
of opposite sides.
b) Without any identiﬁcations, there are 38
colorings. For
each rotational symmetry g, we compute the number of
colorings that are kept unchanged by it:
– g is a rotation by ±90◦
around an axis going through
opposite vertices. Then, g ﬁxes 32
colorings, and
there are 6 such rotations.
– g is a rotation by 180◦
around an axis going through
opposite vertices or the centers of opposite edges.
Then, g ﬁxes 34
colorings. There are 3 + 6 = 9
of these.
– g is a rotation by ±120◦
. Then, g also ﬁxes 34
colorings,
and there are 8 such rotations.
Together with 38
for the identity, we get that the number
of colorings is
1
24
(
38
+ 6 · 32
+ 17 · 34
)
= 333. □
12.G.4. How many necklaces can be created from 9 white, 6
red, and 3 black beads? Beads of one color are indistinguishable,
and two necklaces are considered the same if they can
be transformed to each other by rotation and/or reﬂection.
CHAPTER 12. ALGEBRAIC STRUCTURES
Neutral elements, inverses, and groups7
A left identity (or left neutral element) in a groupoid (A, ·)
is an element e ∈ A such that e · a = a for all a ∈ A.
Similarly, e ∈ A is a right identity (right neutral element)
iﬀ for all a ∈ A, a·e = a. If e satisﬁes both these properties,
it is called an identity (or a neutral element).
In a grupoid (A, ·) with identity e, an element b is a left
inverse of an element a if and only if b · a = e; it is a right
inverse of a if and only if a · b = e. If b satisﬁes both these
properties, it is called an inverse of a.
A monoid (M, ·) is a semigroup which has a neutral
element. A group (G, ·) is a monoid where each element
has an inverse.
A commutative semigroup is a semigroup where the operation
is commutative, similarly for a commutative monoid
or a commutative group. A commutative group is also often
called an Abelian group.
Consider direct consequences of the deﬁnitions. A
groupoid cannot have both a left identity and a diﬀerent right
identity (if it had, what would be their product equal to?).
Thus, if a groupoid has a (two-sided) identity, then it is the
only identity element, called the identity.
Similarly, in a monoid, an element x cannot have both a
left inverse a and a diﬀerent right inverse b, since if a · x =
x · b = e, then also
a = a · (x · b) = (a · x) · b = b.
Note that associativity of the operation is needed here. It follows
that if x has an inverse, then it is unique. It is usually
denoted by x−1
.
As an example, consider again the subtraction on integers.
This operation is not associative. There is a right identity
(zero), i.e. a − 0 = a for any integer a, but it is not a left
identity. There is no left identity for subtraction.
The integers are a semigroup with respect to either addition
or multiplication. They form a group only with addition,
since with respect to multiplication, only the integers ±1 have
an inverse.
If (A, ·) is a group, then any subset B ⊆ A which is
closed with respect to the restriction of · (i.e. a · b ∈ B for
any a, b ∈ B) and forms a group with this operation is called
a subgroup. Both conditions are essential. For instance, consider
the integers as a subset of the rational numbers and the
multiplication there.
Let G be a group and M ⊂ G. The subgroup generated
by M is the smallest (with respect to set inclusion) subgroup
of G which contains all the elements of M. Clearly, this is
the intersection of all subgroups containing M.
7The name "Abelian" is in honour of a young mathematician Niels Henrik
Abel. The adjective is so widely used that it is common to write it with a
lower-case ‘a’, abelian, although it is derived from a surname.
1087
Solution. The group of symmetries of the necklace is the dihedral
group D18, which has 36 elements. It acts on the set of
necklaces, where we can number each place (1 through 18),
resulting in 18!/(9!6!3!) = 4084080 necklaces (without any
identiﬁcation). The only symmetries that ﬁx a non-zero number
of necklaces are rotations by 120◦
and 240◦
, reﬂections,
and of course the identity. By Burnside’s lemma, the wanted
number of necklaces is equal to
1
36
(
4084080 + 2 ·
(
6
3
)(
3
3
)
+ 9 ·
(
8
4
)(
4
3
))
= 113590.
□
12.G.5. How many necklaces can be created from 6 white, 6
red, and 6 black beads? Beads of one color are indistinguishable,
and two necklaces are considered the same if they can
be transformed to each other by rotation and/or reﬂection.
⃝
12.G.6. How many necklaces can be created from 8 white, 8
red, and 8 black beads? Beads of one color are indistinguishable,
and two necklaces are considered the same if they can
be transformed to each other by rotation and/or reﬂection.
⃝
12.G.7. How many necklaces can be created from 3 white and
6 black beads? Beads of one color are indistinguishable, and
two necklaces are considered the same if they can be transformed
to each other by rotation and/or reﬂection. ⃝
H. Codes
12.H.1. Consider the (5, 3)-code over Z2 generated by the
polynomial x2
+ x + 1. Find all codewords as well as the
generating matrix and the check matrix.
Solution. p(x) = x2
+ x + 1. The code words are precisely
the multiples of the generating polynomial:
0·p, 1·p, x·p, (x+1)·p, x2
·p, (x2
+1)·p, (x2
+x)·p, (x2
+x+1)·p
or
0, x2
+ x + 1, x3
+ x2
+ x, x3
+ 1, x4
+ x3
+ x2
, x4
+ x3
+ x + 1,
x4
+ x, x4
+ x2
+ 1
CHAPTER 12. ALGEBRAIC STRUCTURES
Here are a few very well known examples of groups. The
rational numbers Q are a commutative group
with respect to addition. The integers are one
of their subgroups. The non-zero rational numbers
are a commutative group.
For every positive integer k, the set of all k-th roots of
unity, i.e. the set {z ∈ C; zk
= 1} is a ﬁnite commutative
group with respect to multiplication of complex numbers. For
k = 2, this is the two-element group {−1, 1}, both of whose
elements are self-inverse. For k = 4, this is the group G =
{1, i, −1, −i}.
The set Matn, n > 1, of all square matrices of order
n is a (non-commutative) monoid with respect to multiplication
and a commutative group with respect to addition (see
subsections 2.1.2–2.1.5).
The set of all linear mappings Hom(V, V ) on a vector
space is a monoid with respect to mapping composition and
a commutative group with respect to addition (see subsection
2.3.12).
In every monoid, the subset of all invertible elements
forms a group. In the former of the above examples, it was
the group of invertible matrices. In the latter case, it was
the group of linear isomorphisms of the corresponding vector
space.
In previous chapters, there are several (semi)group structures,
sometimes met quite unexpectedly. For example, recall
various subgroups of the group of matrices or the group structure
on elliptic curves.
12.4.2. Permutation groups. Groups and semigroups often
arise as sets of mappings on a ﬁxed set M,
which are closed with respect to mapping com-
position.
This is easily seen on ﬁnite non-empty sets
M, where every subset of invertible mappings generates a
group with respect to composition.
Such a set M consisting of m = |M| ∈ N elements
allows for mm
possible mappings (each of m elements can
be sent to arbitrary element of M), and all of these mappings
can be composed. Since mapping composition is associative,
it is a semigroup.
If a mapping α : M → M is required to have an inverse
α−1
, then α must be a bijection. Composition of two
bijections yields again a bijection; hence the set Σm of all bijections
on an m-element set M is a group. This is called the
symmetric group (on m elements). It is an example of a ﬁnite
group.8
The name of the group Σm brings another connection:
Instead of bijections on a ﬁnite set, permutations can be
viewed as the rearranging of distinguished objects. Permutations
are encountered in this sense when studying determinants,
for example, see subsection 2.2.1 on page 96 for a few
8It can be proved that every ﬁnite group is a subgroup of an appropriate
ﬁnite symmetric group. This can be interpreted so that the groups Σm are
as non-commutative and complex as possible.
1088
or
00000, 11100, 01110, 10010, 00111, 11011, 01001, 10101.
The basis vectors multiplied by x5−3
= x2
yield mod (p):
x2
≡ x + 1,
x3
= x.x2
≡ x(x + 1) = x2
+ x ≡ 1,
x4
≡ x.
This means that the basis vectors are encoded as follows:
1 → x2
+ x + 1, 100 → 11100,
x → x3
+ 1, i. e. 010 → 10010,
x2
→ x4
+ x, 001 → 01001.
Thus, the generating matrix is
G =






1 1 0
1 0 1
1 0 0
0 1 0
0 0 1






and the check matrix is
H =
(
1 0 1 1 0
0 1 1 0 1
)
. □
12.H.2. Consider the (5, 3)-code over Z2 generated by the
polynomial x2
+ x + 1. Find the generating matrix and the
check matrix of the (7, 4)-code over Z2 generated by the polynomial
x3
+ x + 1. ⃝
12.H.3. A 7-bit message a0a1 . . . a6, considered as the polynomial
a0+a1x+· · ·+a6x6
, is encoded using the polynomial
code generated by x4
+ x + 1.
i) Encode the message 1100011.
ii) You have received the word 10111010001. What was the
message if you assume that at most one bit was ﬂipped?
iii) What was the message in ii) if you assume that exactly
two bits were ﬂipped?
Solution. i)
x4
≡ x + 1,
x5
≡ x2
+ x,
x9
≡ x3
+ x,
x10
≡ x2
+ x + 1,
whence
1 + x + x5
+ x6
→ x4
+ x5
+ x9
+ x10
+ x + 1 + x2
+ x+
x3
+ x + x2
+ x + 1 =
= x3
+ x4
+ x5
+ x9
+ x10
.
Thus, the code is 00011100011.
CHAPTER 12. ALGEBRAIC STRUCTURES
elementary results. Let us brieﬂy recollect them now in view
of the general concepts of groups and their homomorphisms.
What the operation in this group looks like needs more
thought. In the case of a (small) ﬁnite group, build a complete
table of the operation results for all pairs of operands. Considering
the group Σ3 on numbers {1, 2, 3} and denoting the
particular permutations by the ordering of the numbers (not
to be confused with the notation for cycles!)
a = (1, 2, 3), b = (2, 3, 1), c = (3, 1, 2),
d = (1, 3, 2), e = (3, 2, 1), f = (2, 1, 3),
then the composition is given by the following table:
· a b c d e f
a a b c d e f
b b c a f d e
c c a b e f d
d d e f a b c
e e f d c a b
f f d e b c a
Note that there is a fundamental diﬀerence between the
permutations a, b, c and the other three. The former three
form a cycle, generated by either b or c:
b2
= c, b3
= a, c2
= b, c3
= a.
It follows that these three permutations form a commutative
subgroup. Here (as well as in the whole group), a is the neutral
element and b and c are inverses of each other. Therefore,
this subgroup is the same as the group Z3 of residue classes
modulo 3, or as the group of third roots of unity.
The other three permutations are self-inverse, which
means that any one of them together with the identity a create
a subgroup, the same one as Z2. Further, b and c are elements
of order 3, i.e. the third power is the ﬁrst one equal to the
identity a, while d, e, and f are of order 2.
Since the table is not symmetric with respect to the main
diagonal, the composition · is not commutative.
Other permutation groups Σm of ﬁnite m-element sets
behave similarly. Each permutation σ partitions the set M
into a disjoint union of maximal invariant subsets, which are
obtained by taking unprocessed elements x ∈ M step by step
and putting all iteration results σk
(x), k = 1, 2, . . . , into the
class Mx until σk
(x) = x.
Each permutation is obtained as a composition of the cycles,
which behave as the identity outside Mx and as σ on
Mx. If the elements of Mx are numbered as (1, 2, . . . , |Mx|)
so that i corresponds to σi
(x), then the permutation is simply
a one-place shift in the cycle (i.e. the last element is mapped
back to the ﬁrst one). Hence the name cycle. These cycles
commute, so it does not matter in which order the permutation
σ is composed from them. (Of course, if we pick arbitrary
two cycles on M, they do not have to commute.)
The simplest cycles are one-element ﬁxed points of σ and
two-element subsets (x, σ(x)), where σ(σ(x)) = x. The latter
are called transpositions. Since every cycle can be composed
from transpositions of adjacent elements (just let the
1089
ii) 1 + x2
+ x3
+ x4
+ x6
+ x10
divide by x4
+ x + 1
gives remainder x2
+ 1 ≡ x8
. Thus, the ninth bit was ﬂipped
and the original message was 1010101.
iii) Either the ﬁrst and third bits were ﬂipped (x2
+1), or
the ﬁfth and sixth were (x4
+ x5
≡ x2
+ 1). In the ﬁrst case,
the message was 1010001, while in the second case, it was
0110001. □
12.H.4. A 7-bit message a0a1 . . . a6, considered as the polynomial
a0+a1x+· · ·+a6x6
, is encoded using the polynomial
code generated by x4
+ x3
+ 1.
i) Encode the message 1101011.
ii) You have received the word 01001011101. What was the
message if you assume that at most one bit was ﬂipped?
iii) What was the message in ii) if you assume that exactly
two bits were ﬂipped?
Solution. i)
x4
≡ x3
+ 1,
x5
≡ x3
+ x + 1,
x7
≡ x2
+ x + 1,
x9
≡ x2
+ 1,
x10
≡ x3
+ x,
thus we get
1 + x + x3
+ x5
+ x6
→ x4
+ x5
+ x7
+ x9
+ x10
+ x3
+
+ 1 + x3
+ x + 1 + x2
+ x + 1 + x2
+ 1 + x3
+ x =
= x3
+ x4
+ x5
+ x7
+ x9
+ x10
+ x3
+ x.
Therefore, the code is 0101
redundancy
1101011
message
.
ii) x + x4
+ x6
+ x7
+ x8
+ x10
divided by x4
+ x3
+ 1
gives remainder x2
+ x + 1 ≡ x7
. Thus, the eighth bit was
ﬂipped, and the original message was 1010101.
iii) Either the second and tenth bit were ﬂipped (x+x9
≡
x2
+x+1), or the fourth and seventh (x3
+x6
≡ x2
+x+1), or
the ﬁfth and ninth (x4
+x8
≡ x2
+x+1). The respective messages
are 00001011111, 01011010101, and 01000011001. □
12.H.5. Consider the (15, 11)-code generated by the
polynomial 1 + x3
+ x4
. We have received the word
011101110111001. Find the original 11-bit message,
provided exactly one bit was ﬂipped.
Solution. The word is a codeword if and only if it is divisible
by the generating polynomial 1+x3
+x4
. The received word
corresponds to the polynomial x + x2
+ x3
+ x5
+ x6
+
x7
+ x9
+ x10
+ x11
+ x14
. When divided by 1 + x3
+
x4
, it leaves remainder x + 1. This means that an error has
occurred. If we assume that only one bit was ﬂipped, then
there must be a power of x which is equal to this remainder
modulo 1 + x3
+ x4
. Thus, we compute x4
≡ x3
+ 1, x5
≡
x3
+ x + 1, . . . , x12
≡ x + 1 and ﬁnd out that the thirteenth
it was ﬂipped, and the original message was 01110111101.
CHAPTER 12. ALGEBRAIC STRUCTURES
last element “bubble” back to the beginning), every permutation
can be written as a composition of transpositions of
adjacent elements.
Return to the case of Σ3. Two elements, b, c, represent
cycles which include all the three elements; each of them generates
{a, b, c} = Z3. Besides those, d, e, f are composed of
cycles of length 2 and 1; ﬁnally a is composed of three cycles
of length one. There are no more possibilities. However, it
is clear from the procedure that for more elements, there are
very many possibilities.
In general, there are many ways of expressing a permutation
as a composition of transpositions. However, for a
given permutation, the parity of the number of transpositions
is ﬁxed and independent of the choice of particular transpositions.
This can be seen from the number of inverses of a
permutation, since each transposition changes the number of
inverses by an odd number (see the discussion in subsection
2.2.2 on page 97).
It follows that there is a well-deﬁned mapping
sgn : Σm → Z2 = {±1},
the permutation parity. This recovers the proposition crucial
for building the determinants (see 2.2.1 and on):
Theorem. Every permutation of a ﬁnite set can be written as
a composition of cycles. A cycle of length ℓ can be expressed
as a composition of ℓ − 1 transpositions. The parity of this
cycle is (−1)ℓ−1
.
The parity of the composition σ◦τ is equal to the product
of the parities of the composed permutations σ and τ.
The last proposition suggests that the mapping sgn transforms
permutation composition σ ◦ τ to the product sgn σ ·
sgn τ in the commutative group Z2.
(Semi)group homomorphisms
In general, a mapping f : G1 → G2 is a (semi)group homomorphism
if and only if it respects the operation, i.e.
f(a · b) = f(a) · f(b).
In particular, the permutation parity is a homomorphism
sgn : Σm → Z2.
In a moment, we shall see that the group inversions and
units are also preserved by homorphisms. Before disussing
the theory, let us look at more examples of groups.
12.4.3. Symmetries of plane ﬁgures. In the ﬁfth part of
chapter one, the connections between invertible 2-by-2 matrices
and linear transformations in the plane are thoroughly
considered.
A matrix in Mat2(R) deﬁnes a linear mapping R2
→ R2
that preserves standard distances if and only if its
columns form an orthonormal basis of R2
(which is
a simple condition for the matrix entries, see subsection
1.5.7 on page 33). Combining the orthogonal linear
mappings with translations, we arrive at the group of all
Euclidean transformations of the plane.
1090
Let us look at the exercise more thoroughly. Computing
all powers of x, we obtain
x4
≡ x3
+ 1,
x5
≡ x3
+ x + 1,
x6
≡ x3
+ x2
+ x + 1,
x7
≡ x2
+ x + 1,
x8
≡ x3
+ x2
+ x,
x9
≡ x2
+ 1,
x10
≡ x3
+ x,
x11
≡ x3
+ x2
+ 1,
x12
≡ x + 1,
x13
≡ x2
+ x,
x14
≡ x3
+ x2
,
so the generating matrix is
G =


























1 1 1 1 0 1 0 1 1 0 0
0 1 1 1 1 0 1 0 1 1 0
0 0 1 1 1 1 0 1 0 1 1
1 1 1 0 1 0 1 1 0 0 1
1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 1


























.
We can verify that multiplication by 01110111101 yields the
codeword 011101110111101, which diﬀers from the received
word 011101110111001 exactly in the thirteenth bit □
Now, we begin to eﬃciently use the check matrix.
12.H.6. Find the generating matrix and the check matrix of
the (7, 2)-code (i. e., there are 2 bits of the
message and 5 redundant bits) generated by
the polynomial x5
+ x4
+ x2
+ 1. Decode
the received word 0010111 (i. e., ﬁnd the message that was
sent) assuming that the least number of errors occurred.
Solution. The generating matrix of the code is
G =










1 1
0 1
1 1
0 1
1 1
1 0
0 1










.
CHAPTER 12. ALGEBRAIC STRUCTURES
In fact, it is possible to prove that every mapping of the
plane into itself which preserves distances is aﬃne such a Euclidean
transformation.9
As observed, the linear part of this mapping is orthogonal.
Thus, all these mappings form a group of all orthogonal
transformations (also called Euclidean transformations) in the
plane. Moreover, it is shown that besides translations Ta by a
vector a, these are only rotations Rφ around the origin by any
angle φ, and reﬂection Fℓ with respect to any line that goes
through the origin (also note that the central inversion is the
same as the rotation by π).
Now, general group concepts are illustrated on the problem
of symmetries of plane ﬁgures. For example,
consider tiles. First, consider them individually,
in the form of a bounded diagram in
the plane. Then consider with the condition of
tiling in a band, and then in the entire plane.
As an example, consider a line segment and an equilateral
triangle. It is of interest in how much these objects
are symmetric; that is, with respect to which transformations
(that preserve distances) they are invariant. In other words,
we want the image of the ﬁgure to be identical to the original
one (unless some signiﬁcant points are labeled , for example
the vertices of the triangle A, B, C or the endpoints of the
line segment). It is clear that all symmetries of a ﬁxed object
form a group (usually with only one element, the identity).
3A
B
p
Zp
1
2
In the case of the line segment, the situation is very simple
– it is clear that the only non-trivial symmetries are rotation
by π around the center of the segment, reﬂection FH with
respect to the axis of the segment, and reﬂection FV with respect
to the line itself. All these symmetries are self-inverse.
Hence the group of symmetries has four elements. Its table
looks as follows:
9If a mapping F : R2 → R2 preserves distances, then this must also
hold for the mapped vectors of velocity, i.e. the Jacobi matrix DF(x, y)
must be orthogonal at every point. Expanding this condition for the given
mapping F = (f(x, y), g(x, y)) : R2 → R2 leads to a system of diﬀerential
equations which has only aﬃne solutions, since all second derivatives of
F must be zero (and then, the proposition is an immediate consequence of
Taylor’s remainder theorem). Try to think out the details! The same procedure
leads to the result for Euclidean spaces of arbitrary dimension. Note that
the condition to be proved is independent of the choice of aﬃne coordinates.
Composing F with a linear mapping does not change the result. Hence, for
a ﬁxed point (x, y), compose (DF)−1 ◦F and assume, without loss of generality,
that DF(x, y) is the identity matrix. Diﬀerentiation of the equations
then yields the desired proposition.
1091
The generating matrix is of the form G =
(
P
Ik
)
, where P =






1 1
0 1
1 1
0 1
1 1






. The check matrix is of the form
(
In−k P
)
, i. e.,
in our case
H =






1 0 0 0 0 1 1
0 1 0 0 0 0 1
0 0 1 0 0 1 1
0 0 0 1 0 0 1
0 0 0 0 1 1 1






.
Multiplying the received word by the check matrix, we get the
syndrome (error) of the word:
H =






1 0 0 0 0 1 1
0 1 0 0 0 0 1
0 0 1 0 0 1 1
0 0 0 1 0 0 1
0 0 0 0 1 1 1
















0
0
1
0
1
1
1










=
=
(
0 1 1 1 1
)
.
The syndrome corresponding to the received word is 01111.
Now, we ﬁnd all words corresponding to this syndrome. This
can be done by adding all codewords to the received word.
There are four codewords, corresponding to the four possible
messages. They are obtained by multiplying the messages
(00, 01, 10, 11) by the generating matrix. Thus, we get the
codewords
0000000, 1111101, 1010110, 0101011.
The space of words corresponding to a given syndrome is an
aﬃne space whose direction is the vector space of all codewords
(see 12.5.8). Thus, we get the words
0010111, 1101010, 1000001, 0111100.
The least number of errors is equal to the least number of
ones in the obtained words. In our case, this is the word
1000001, which contains only two ones and thus is the socalled
leading representative of the class of words with syndrome
01111. The original message can be obtained by subtracting
(or adding – this is equivalent in Z2) the received
word and the leading representative of the class with the given
syndrome. In our case, we get
0010111 − 1000001 = 1010110.
Therefore, assuming the least number of errors, the sent word
was 1010110, where the last two bits are the original message,
i. e., 10. □
12.H.7. Consider the (7, 3)-code generated by the polynomial
x4
+ x3
+ x + 1. Find its generating matrix and check matrix.
Using the method of leading representatives, decode the
received word 1110010. ⃝
CHAPTER 12. ALGEBRAIC STRUCTURES
· R0 Rπ FH FV
R0 R0 Rπ FH FV
Rπ Rπ R0 FV FH
FH FH FV R0 Rπ
FV FV FH Rπ R0
.
This group is commutative.
For the equilateral triangle, there are more symmetries:
one can rotate by 2π/3 or one can mirror with respect to axes
of the sides. In order to obtain the entire group, all compositions
of these transformations must be added in. In 1.5.9
it is shown that the composition of two reﬂections is always
a rotation. At the same time, it is clear that changing the order
of composition of ﬁxed two reﬂections leads to a rotation
by the same angle but the other orientation. It follows that
the reﬂections around two axes generate all the symmetries,
of which there are six altogether. Placing the triangle as is
shown in the diagram, the six transformations are given by
the following matrices:
a =


1 0
0 1

, b =



−1
2
√
3
2
−
√
3
2 −1
2


, c =



−1
2 −
√
3
2√
3
2 −1
2


,
d =


−1 0
0 1

, e =



1
2 −
√
3
2
−
√
3
2 −1
2


, f =



1
2
√
3
2√
3
2 −1
2


.
A comparison of the table of the operation, with that of the
permutation group Σ3, shows that it is the same. For the sake
of clarity, the vertices are labeled with numbers, so the corresponding
permutations can be easily understood.
Similarly, there are groups of symmetries with k rotations
and k reﬂections. It suﬃces to consider a regular k-gon.
These groups are usually denoted Dk and are called the dihedral
groups of order k. They are not commutative for k ≥ 3
(D2 is commutative). The name comes from the fact that
D2 is the group of symmetries of the hydrogen molecule H2,
which contains two hydrogen atoms and can be imagined as
a line segment.
Similarly, there are ﬁgures whose only symmetries are
rotations, and hence the corresponding groups are commutative.
They are denoted Ck and called cyclic groups of order
k. For that, it suﬃces to consider a regular polygon whose
sides are changed non-symmetrically, but in the same manner
(see the extension of the triangle in the diagram). Note
that the group C2 can be realized in two ways: either using
the rotation by π or a single reﬂection.
As the ﬁrst illustration of the power of abstraction, we
prove the following theorem. A ﬁgure is said to have a discrete
group of symmetries if and only if the set of images of
an arbitrary point over all the symmetries is a discrete subset
of the plane (i.e. each of its points has a neighbourhood
where there is no other point of the set).
Note that every discrete group of symmetries of a
bounded ﬁgure is necessarily ﬁnite.
Theorem. Let M be a bounded set in the plane R2
with discrete
group of symmetries G. Then, G is either trivial or one
of the groups Ck, Dk for k > 1.
1092
12.H.8. Consider the linear (7, 4)-code (i. e., the message
has length 4) over Z2 deﬁned by the matrix










0 1 1 0
1 1 0 1
1 0 1 1
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1










Decode the received word 1010001 (i. e., ﬁnd the sent message)
assuming that the least number of errors occurred.
Solution. There are 24
= 16 possible messages. All codewords
can be obtained by multiplying the possible messages
(0000, 0001, . . . , 1111) by the generating matrix of the code.
Thus, we get:
0110001, 1010010, 1100100, 0111000
1100011, 1010101, 0001001, 1011100
1101010, 0110110, 0001110, 1101101
1011011, 0000111, 0111111, 0000000.
Now, we construct the check matrix of the given code:
H =


1 0 0 0 1 1 0
0 1 0 1 1 0 1
0 0 1 1 0 1 1

 .
(we remove the block of the generating matrix that consists of
the identity matrix and to the left of the remaining block, we
write the identity matrix of ﬁtting size). Now, multiplying the
vector of the received word zT
= (1010001) by H, we get the
syndrome s = Hz = (110)T
. One word with this syndrome
is 1100000 (we ﬁll the syndrome with zeros to the appropriate
length). All words with syndrome 110 are obtained by adding
this word to all codewords. Thus, we get the words
1000001, 0110010, 0000100, 1011000,
0000011, 0110101, 1101001, 0111100,
0001010, 1010110, 1101110, 0001101,
0111011, 1100111, 1011111, 1100000
Out of these words with syndrome 110, only the word
0000100 contains a single one, so this is the leading representative
of the class of words with syndrome 110. Subtracting
the leading representative from the received word, we get the
word that was sent, assuming the least number of bit ﬂips (1
in this case), i. e., the word (101)0101, where the last four
bits are the message, i. e. 0101. □
12.H.9. Consider the linear (7, 4)-code (i. e., the message
has length 4) over Z2 deﬁned by the matrix










1 1 0 1
0 0 1 1
1 0 1 0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1










CHAPTER 12. ALGEBRAIC STRUCTURES
Proof. If there were a set M with translation as one of
its symmetries, then it could not be bounded.
If M had, as one of its symmetries, a rotation
by angle which is an irrational multiple of 2π,
then iterating this rotation would lead to a dense
subset of images on the corresponding circle.10
It follows that
the group is not discrete.
If M had non-trivial rotations with diﬀerent centers as
symmetries, then again it could not be bounded. To see this,
write the corresponding rotations in the complex plane as
R : z → (z − a)ζ + a, Q : z → zη
for complex units ζ = e2πi/k
, η = e2πi/ℓ
and an arbitrary
a ̸= 0 ∈ C. Then, it is immediate (a straightforward
computation with complex numbers) that
Q ◦ R ◦ Q−1
◦ R−1
: z → z + a(−1 + ζ + η − ζη),
which is a translation by a non-trivial vector unless the angle
of one of the rotations is zero. It follows that M is not
bounded.
The same holds for the case of a rotation and a reﬂection
with respect to a line which does not go through the center of
the rotation. Check this case yourself!
The only symmetries available are rotations with a common
center and reﬂections with respect to lines which pass
through this center. It remains to prove that the entire group
is composed either only from rotations, or from the same number
of rotations and reﬂections.
Recall that the composition of two diﬀerent reﬂections
yields a rotation whose angle doubles the angle enclosed by
the corresponding axes (see 1.5.9). Therefore, composing a
reﬂection with respect to a line p with a rotation by angle φ
is again a reﬂection with respect to the line which is at angle
φ/2 from p (draw a diagram!).
The proof is almost complete. Observe that the subgroup
of all rotations in the group of symmetries contains a rotation
by the smallest positive angle φ0 (there are only ﬁnite many
of them there). But then it is impossible to allow a rotation
Rφ where φ is not a multiple of φ0 (for then φ ∈ (kφ0, (k +
1)φ0) for some k and the composition R−kφ0 ◦ Rφ would
have an smaller angle than φ0). This subgroup coincides with
one of Cℓ. Next, adding one reﬂection produces exactly one
diﬀerent reﬂection for each nontrivial element in Cℓ, as seen
above. □
12.4.4. Symmetries of plane tilings. There is more complicated
behaviour in the case of plane ﬁgures in
bands or in the entire plane (for example, symmetries
of various tilings).
10The argument is subtle but straightforward: if there was an interval
of length ε on the circle not hit by an orbit of a point under the rotation, all
points in the orbit would have to be at distance at least ε. Thus there can be
only ﬁnitely many of them and this contradicts the irrationality of the angle.
1093
Decode the received word 1101001 (i. e., ﬁnd the sent message)
assuming that the least number of errors occurred.
Solution. Syndrome 101, leading representative 0001000,
sent message (110)0001 □
12.H.10. Consider the linear (7, 4)-code (i. e., the message
has length 4) over Z2 deﬁned by the matrix










1 0 1 1
0 1 0 1
0 1 1 0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1










Decode the received word 0000011 (i. e., ﬁnd the sent message)
assuming that the least number of errors occurred.
Solution. Syndrome 011, leading representative 0000100,
sent message (000)0111. □
12.H.11. Consider the linear (7, 4)-code (i. e., the message
has length 4) over Z2 deﬁned by the matrix










0 1 1 1
1 0 1 0
1 1 0 0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1










Decode the received word 0001100 (i. e., ﬁnd the sent message)
assuming that the least number of errors occurred.
Solution. Syndrome 110, leading representative 0000010,
sent message (000)1110. □
12.H.12. We want to transfer one of four possible messages
with a binary code which should be able to correct all
single errors. What is the minimum possible length of
the codewords (all codewords have to be of the same
length)? Why?
Solution. Let us denote the desired length as n. The minimum
Hamming distance of any two codewords must be at
least three. This means that if we take two diﬀerent codewords
and ﬂip one bit in each, the resulting words must also
be diﬀerent (and also diﬀerent from each codeword). There
are n + 1 words that can be obtained from a given one by
ﬂipping at most one bit (this includes the original word itself
as well). Thus, we must have at least 4(n + 1) possible
words. On the other hand, there are 2n
words of length n,
which means that 4(n + 1) ≤ 2n
. This inequality is satisﬁed
only for n ≥ 5. Thus, the codewords must be at least 5 bits
long. And indeed, there are four codewords of length 5 with
minimum Hamming distance 3, for instance 00111, 01001,
10100, 11010. □
CHAPTER 12. ALGEBRAIC STRUCTURES
Consider ﬁrst the set containing the points
that lie between two ﬁxed parallel lines. Suppose that this
band is covered with disjoint images of a bounded subset M
by some translation. Of course, this translation is a symmetry
of the chosen tiling of the band. So the group of symmetries
is necessarily inﬁnite.
Such a set allows for no other rotation symmetries than
Rπ, and the only possible reﬂections are either horizontal
with respect to the axis of the band, or vertical with respect to
any line which is perpendicular to the boundary lines. In addition,
there are translations given by a vector which is parallel
to the axis of the band.
A not-too-complicated discussion leads to description
of all discrete groups of symmetries for these bands. Such
a group is generated by some of the following symmetries:
translation T, shifted reﬂection G (i.e. composition of horizontal
reﬂection and translation), vertical reﬂection V , horizontal
reﬂection H and rotation R by π.
Theorem. Every discrete group of symmetries of a band in
the plane is isomorphic to one of the groups generated by the
following symmetries:
(1) a single translation T,
(2) a single shifted reﬂection G,
(3) a single translation T and a vertical reﬂection V ,
(4) a single translation T and the rotation R,
(5) a single shifted reﬂection G and the rotation R,
(6) a single translation T and the horizontal reﬂection H,
(7) a single translation T, the horizontal reﬂection H and a
vertical reﬂection V .
The proof is not presented here. The following diagram
shows examples of schematic patterns with corresponding
symmetries:
It is even more complicated with symmetries of tilings
which cover the entire plane. There is insuﬃcient space here
to consider further details. It can be shown that there are 17
such groups of symmetries, known as the two-dimensional
crystallographic groups.
A similar complete discussion is known even for threedimensional
discrete groups of symmetries. The rich theory
was created namely in the 19th century in connection with
studying symmetries of crystals and molecules of chemical
elements.
1094
12.H.13. We want to transfer 4-bit messages with a binary
code which should be able to correct all single and
double errors. What is the minimum possible length
of the codewords (all codewords have to be of the same
length)? Why?
Solution. We proceed similarly as in the above exercise. If
the code is to correct double errors as well, then the minimum
Hamming distance of any two codewords must be at
least three. This means that if we take two diﬀerent codewords
and ﬂip up to two bits in each, the resulting words must
also be diﬀerent. Denoting by n the length of the words, we
get the inequality
24
(
1 + n +
(
n
2
))
≤ 2n
.
The least value of n for which it is satisﬁed is n = 12, so the
codewords must be at least 12 bits long. □
I. Extension of the stereographic projection
Let us try to extend the deﬁnition of the stereographic
projection so that a circle would be parametrized by points of
P1(R). Let us look at the corresponding mapping P1(R) →
P2(R2
). The points in projective extensions will be deﬁned
in the so-called homogeneous coordinates, which are given
up to a multiple. For instance, the points in P2(R) will be
(x : y : z).
A circle in the plane z = 1 is given as the intersection
of the cone of directions deﬁned by x2
+ y2
− z2
= 0 with
this plane. The inversion of the stereographic projection (i.
e., our parametrization of the circle) can be described as:
(t : 1) →
(
2t
1 + t2
:
t2
− 1
t2 + 1
: 1
)
=
(
2t : t2
− 1 : t2
+ 1
)
.
For t ̸= 0, we have (t : 1) = (2t2
: 2t), and the original stereographic
projection (i. e., the inverse of the above mapping)
can be written linearly as
(x : y : z) → (y + z : x),
which extends our parametrization to the improper point (0 :
1) → (0 : 1 : 1). Then, the mapping of P1(R) onto the circle
has the “linear” form
P1(R) ∋ (x : y) → (2x : x − y : x + y) ∈ P2(R).
Now, let us look at how simple it is to calculate the formula
for the stereographic projection in the projective extensions
directly (see 4.4.1): We include P1(R) as points with
homogeneous coordinates (t : 0 : 1), and among the linear
combinations of point (0 : 1 : −1) (i. e., the pole from which
we project) and (x : y : z) (a general point of the circle), we
must ﬁnd the one whose coordinates are (u : 0 : v). The only
possibility is the point (x : 0 : z + y), which is our previous
formula.
CHAPTER 12. ALGEBRAIC STRUCTURES
12.4.5. Group homomorphisms. Recall that a mapping
f : G → H from a group G to a group H is
called a group homomorphism if and only if it
respects the operation, i.e. for all a, b ∈ G that
f(a · b) = f(a) · f(b).
Note that the operation on the left-hand side is the operation in
G, before f is applied, while the operation on the right-hand
side is the operation in H, after f is applied.
The following properties of homomorphisms follow easily
from the deﬁnition:
Proposition. Every group homomorphism f : G → H satis-
ﬁes:
(1) the identity of G is mapped to the identity of H,
(2) the inverse of an element of G is mapped to the inverse
of its image, i.e.
f(a−1
) = f(a)−1
,
(3) the image of a subgroup K ⊆ G is a subgroup f(K) ⊆
H,
(4) the preimage f−1
(K) ⊆ G of a subgroup K ⊆ H is
again a subgroup,
(5) if f is also a bijection, then the inverse mapping f−1
is
also a homomorphism,
(6) f is injective if and only if f−1
(eH) = {eG}.
Proof. (3), (2), and (1). If K ⊆ G is a subgroup, then
for each y = f(a), z = f(b) in H, a, b ∈ K, y · z = f(a ·
b) also lies in the image. In particular, f(b) = f(e · b) =
f(e) · f(b) and similarly f(b) = f(b) ·f(e). Thus f(e) is the
unique unit in the image f(K). Finally, f(e) = f(a·a−1
) =
f(a) · f(a−1
), so that f(a−1
) is the right inverse of f(a).
Similarly, it is a left inverse, too, and we have proved the ﬁrst
three claims.
Proceed similarly in the case of preimages: if a, b ∈ G
satisfy f(a), f(b) ∈ K ⊆ H, then also f(a · b) ∈ K.
Suppose there exists an inverse mapping g = f−1
. Fix
arbitrarily y = f(a), z = f(b) ∈ H. Then, f(a · b) =
y · z = f(a) · f(b), which is equivalent to the expression
g(y) · g(z) = a · b = g(y · z). Thus the inverse mapping is
also a homomorphism.
If f(a) = f(b), then f(a · b−1
) = eH. Therefore, if the
only element that is mapped to eH is eG, then a · b−1
= eG,
i.e. a = b. The other implication is trivial. □
The subgroup f−1
(eH) (the preimage of the identity in
H) is called the kernel of the homomorphism f and is denoted
ker f. A bijective group homomorphism is called a
group isomorphism.
It follows directly from the above ideas that a homomorphism
f : G → H with a trivial kernel is an isomorphism
onto the image f(G).
1095
J. Elliptic curves
A singular point of a hypersurface in Pn
, deﬁned by a
homogeneous polynomial
F(x0, x1, . . . , xn) = 0,
is such a point which satisﬁes ∂F
∂xi
= 0 for i = 1, . . . , n.
From the geometric point of view, “something weird”
happens at the point. In the case of a curve in the projective
space P2
(R), the condition that the partial derivatives must
be zero means that there is no tangent line to the curve at the
given point. This means that the curve has the so-called cusp
there or it intersects itself. A “nice” singularity can be seen
in the “quatrefoil”, i. e., the variety given by the zero points
of the polynomial (x2
+ y2
)3
− 4x2
y2
v R2
:
The cusp can be found on the curve in R2
given by x3
−
y2
= 0.
CHAPTER 12. ALGEBRAIC STRUCTURES
12.4.6. Examples. The additive group Zk of residue classes
modulo k is isomorphic to the group of k-th
roots of unity, and also to the group of rotations
by integer multiples of 2π/k. Draw a diagram,
calculation with the complex units e2πi/k
is very eﬃcient.
The mapping exp : R → R+ is an isomorphism of the
additive group of the real numbers onto the multiplicative
group of the positive real numbers.
This isomorphism extends naturally to a homomorphism
exp : C → C\{0} of the additive group of the complex numbers
onto the multiplicative group of the non-zero complex
numbers. However, this homomorphism has a non-trivial
kernel. The restriction of exp to purely imaginary numbers
(which is a subgroup isomorphic to R) is a homomorphism
it → eit
= cos t + i sin t.
This means that the numbers 2kπi, k ∈ Z, lie in the kernel.
It can be shown that nothing else is in the kernel. If es+it
=
es
· eit
= 1 for real numbers s and t, then es
= 1, i.e. s = 0,
and then t = 2kπ for an integer k.
The determinant of a matrix is a mapping which assigns,
to each square matrix of scalars in K, a scalar in K (the cases
K = Z, Q, R, C have already been worked with). The Cauchy
theorem about the determinant of the product of square ma-
trices
det(A · B) = (det A) · (det B)
can be also seen as the fact that for the group G = GL(n, K)
of invertible matrices, the mapping det : G → K \ {0} is a
group homomorphism.
12.4.7. Group product. Given any two groups, a more complicated
group can be constructed using the following con-
struction:
Group product
For any groups G, H the group product G×H is deﬁned as
follows: The underlying set is the Cartesian product G × H
and the operation is deﬁned componentwise. That is,
(a, x) · (b, y) = (a · b, x · y),
where the left-hand operation is the one to be deﬁned, while
the right-hand operations are respectively those in G and H.
The projections onto the components G and H in the
product:
pG : G × H ∋ (a, b) → a ∈ G, pH : G × H ∋ (a, b) → b
are surjective homomorphisms, whose kernels are
ker pG = {(eG, b); b ∈ H} ≃ H,
ker pH = {(a, eH); a ∈ G} ≃ G.
The group Z6 is isomorphic to the product Z2 × Z3.
This can be seen easily in the multiplicative realization of
the groups Zk as the complex k-th roots of unity. Z6 consists
of the points of the unit circle that form the vertices of a
regular hexagon. Then, Z2 corresponds to ±1, while Z3 corresponds
to the equilateral triangle, one of whose vertices is
1096
An elliptic curve C is the set of points in K2
, where K is
a given ﬁeld, which satisfy an equation of the form
y2
= x3
+ ax + b,
where a, b ∈ K. In addition, we require that there are no
singularities, which means, over the ﬁeld of real numbers, that
∆ = −16(4a3
+ 27b2
) ̸= 0.
This expression ∆ is called the discriminant of the equation.
Note that the right-hand side contains a cubic polynomial
without the quadratic term. This form of the equation is called
the Weierstrass equation of an elliptic curve.
12.J.1. Prove that the curve y2
= x3
+ ax + b in R2
has a
singularity if and only if 4a3
− 27b2
= 0.
Solution. The equation of the curve in homogeneous coordinates
(see 4.4.1) is F(x, y, z) = 0, where
(1) F(x, y, z) = y2
z − x3
− axz2
− bz3
.
We have
∂F
∂x
= −3x2
− az2
,
∂F
∂y
= 2yz,
∂F
∂z
= y2
− 2axz − 3bz2
.
Let [x, y, z] be a singular point of the given curve. If
z = 0, then since the partial derivatives of F with respect to
x and z must be zero, we get x = 0 and y = 0, respectively.
However, this is “out”, because the point [0, 0, 0] does not lie
in the considered projective space P2
(R). Thus, the singular
points has z ̸= 0, so that ∂F
∂y = 0 implies y = 0. Denoting
γ = x
z , then −3x2
− az2
= 0 implies 3γ2
= −a, and y2
−
2axz − 3bz2
= 0 implies 2aγ = −3b. We can see that the
equality a = 0 implies that b = 0, i. e., the equality 4a3
=
27b2
is satisﬁed trivially. If a ̸= 0, then we can express γ
from the two obtained equations. From the ﬁrst one, we have
γ = − 3b
2a , and from the second one, γ2
= −a
3 . Altogether,
γ2
= −
a
3
=
9b2
4a2
=⇒ 4a3
+ 27b2
= 0.
Thus, we have proved one of the implications. On the
other hand, if 4a3
+27b2
= 0, then deﬁning γ = − 3b
2a , we can
see that the point [γ, 0, 1] satisﬁes the equation of the elliptic
curve:
γ2
+ aγ + b =
(
−
3b
2a
) (
−
a
3
)
+ a
(
−
3b
2a
)
+ b =
=
b
2
−
3b
2
= 0.
Thanks to the choice of γ, all the three partial derivatives
of F at the point [γ, 0, 1] are zero. □
In order to deﬁne a group operation on the points of an
elliptic curve, it is useful to consider the curve in the projective
extension of the plane (see 4.4.1), and we deﬁne a point
CHAPTER 12. ALGEBRAIC STRUCTURES
the number 1. If each point is identiﬁed with the rotation that
maps 1 to that point, then the composition of such rotations
is always commutative. Composing a rotation from Z2 with
a rotation from Z3 yields exactly all rotations from Z6. Draw
a diagram! This leads to the following isomorphism (using
additive notation, as is common for the residue classes):
[0]6 → ([0]2, [0]3),
[1]6 → ([1]2, [2]3),
[2]6 → ([0]2, [1]3),
[3]6 → ([1]2, [0]3),
[4]6 → ([0]2, [2]3),
[5]6 → ([1]2, [1]3).
Similar constructions are available for ﬁnite commutative
groups in complete generality.
12.4.8. Commutative groups. Any element a of a group
G is contained in the minimal subgroup
{. . . , a−2
, a−1
, e, a, a2
, a3
, . . . }, which contains
it. It is clear that this subgroup is
commutative. If G is ﬁnite, then it must once
happen that ak
= e. The least positive integer k with this
property is called the order of the element a in G. A cyclic
group G is one which is generated by one of its elements
a in the above manner. If the order k of the generator is
ﬁnite, then it results in one of the groups Ck, known from
the discussion of symmetries of plane ﬁgures.
It follows directly from the deﬁnition that every cyclic
group is isomorphic either to the group of integers Z (if it is
inﬁnite) or to one of the groups of residue classes Zk (if it is
ﬁnite). These simple building stones are suﬃcient to create
all ﬁnite commutative groups.
Theorem. Every ﬁnite commutative group G is isomorphic
to a product of some cyclic groups Ck. The orders of the
components Ck are always powers of the prime divisors of the
number of elements n = |G|. This product decomposition is
unique, up to order.
If n = pk1
1 · · · pkr
r is the prime factorization of n, then
the group Cn is isomorphic to the product
Cn = Cp
k1
1
× · · · × Cpkr
r
.
Incomplete proof. We are going to prove only the second
claim now, and we return to the ﬁrst claim later, see
12.4.12, 12.4.13.
For a simpler case, suppose n = pq with p coprime to q.
Fix a generator a of the group Cn, a generator b of Cp, and a
generator c of Cq. Deﬁne the mapping f : Cn → Cp × Cq
by
f(ak
) = (bk
, ck
).
Since ak
· aℓ
= ak+ℓ
and similarly for b and c, it follows that
f(ak
· aℓ
) = (bk+ℓ
, ck+ℓ
) = (bk
, ck
) · (bℓ
, cℓ
),
so the mapping f is a homomorphism. If the image is the
identity, then k must be a multiple of p as well as q. Since
1097
O ∈ C as the direction (0, 1) (which is the point [0, 1, 0] in the
homogeneous coordinates). Then, the addition of two points
A, B ∈ C is geometrically deﬁned as the point −C, where C
is the third intersection point of the line AB with the elliptic
curve. If A = B, then the result is given by the other intersection
point with the tangent line of the elliptic curve that goes
through A.
12.J.2. Prove that the above deﬁnition correctly deﬁnes an
operation on the points of an elliptic curve.
Solution. The intersections of the line with the elliptic curve
are obtained as the roots of a cubic equation. If it has two real
roots, corresponding to the points A and B, then it must have
a third real root as well, i. e., the line AB must have another
intersection points with the curve. In the case of a tangent
line, the point A corresponds to a double root, so there also
exists another intersection point. As for improper points (the
last homogeneous coordinate is zero; they correspond to directions
in the plane), the only improper point that belongs to
the curve given by the equation (1) is the point O = [0, 1, 0].
Addition with the point O means looking for a second intersection
of the elliptic curve (besides the point A itself) and
the line which goes through A and is parallel to the y-axis.
The improper line z = 0 has triple intersection point O with
the curve, i. e., O + O = O. □
Remark. Thus, the operation is well-deﬁned. Moreover, it
follows directly from the deﬁnition that it is commutative. It
even follows from the above that O is a neutral element of
the operation. However, the proof of associativity is far from
trivial.
12.J.3. Deﬁne the above operation algebraically.
Solution. For any point A ∈ C, we deﬁne A+O = O+A =
A.
For a point A ∈ C, A = (α, β, 1), we clearly have B ∈ C,
and we deﬁne A + B = 0, i. e., A = −B.
For a point A ̸= −B, A = [α, β, 1] and a point B ∈ C,
B = [γ, δ, 1], we set
k =
{ β−δ
α−γ for A ̸= B,
[5pt]3α2
+a
2β for A = B,
σ = k2
− α − γ,
τ = −β + k(α − σ).
CHAPTER 12. ALGEBRAIC STRUCTURES
p and q are coprime, k is a multiple of n, so f is injective.
Moreover, the group Cn has the same number of elements as
Cp × Cq, so f is an isomorphism.
Finally, the proposition about the decomposition of
cyclic groups of order k into smaller cyclic groups follows
by induction on the number of diﬀerent primes pi in the
factorization of n. □
Notice, Cp2 is never isomorphic to the product Cp × Cp.
While Cp2 is generated by an element of order p2
, the highest
order of an element in Cp × Cp is only p.
Since every ﬁnite commutative group is isomorphic to a
product of cyclic groups, it is possible, for a given number of
elements, to enumerate all commutative groups of that order
up to isomorphism. For instance, there are only two groups
of order 12:
C12 = C4 × C3, C2 × C2 × C3 = C2 × C6.
Notice similarly that if all elements (except the identity) of
a ﬁnite commutative group G have order 2, then G has the
form (C2)n
for an integer n. In particular, such a group G
has 2n
elements. If the decomposition of G into cyclic groups
contains a group Cp, p > 2, then the group contains elements
of higher order.
12.4.9. Subgroups and cosets. Selecting any subgroup H
of a group G, gives further information about the
structure of the whole group. A binary relation ∼H
on G can be deﬁned as follows: a ∼H b if and only
if b−1
· a ∈ H. This relation expresses when two elements
of G are “the same” up to multiplication by an element
of H from the right. It is easily veriﬁed that this relation is
an equivalence:
Clearly, a−1
· a = e ∈ H, so it is reﬂexive. If
b−1
· a = h ∈ H, then a−1
· b = (b−1
· a)−1
= h−1
∈ H,
so it is symmetric as well. Finally, if c−1
· b ∈ H and
b−1
· a ∈ H, then c−1
· a = c−1
· b · b−1
· a ∈ H, so
it is transitive, too.
It follows that G partitions into the left cosets of mutually
equivalent elements, with respect to the subgroup H. The
coset corresponding to an element a is denoted a · H, and
a · H = {a · h; h ∈ H},
since an element b is equivalent to a if and only if it can be
expressed this way.
The corresponding partition of G (i.e. the set of all left
cosets) with respect to H is denoted G/H.
Similarly, right cosets H · a can be deﬁned. The corresponding
equivalence relation is given by a ∼ b if and only
if a · b−1
∈ H. Hence,
H \ G = {H · a; a ∈ G}.
Proposition. Let G be a group and H a subgroup of G.
Then:
(1) The left cosets with respect to H coincide with the right
cosets with respect to H if and only if for each a ∈ G,
1098
Then, we deﬁne A + B = [γ, τ, 1]. We leave it for the reader
to verify that this is indeed the operation that we have deﬁned
geometrically. □
K. Gröbner bases
12.K.1. Is the basis g1 = x2
, g2 = xy + y2
a Gröbner basis
for the lexicographic ordering x > y? If not, ﬁnd one.
Solution. Clearly, the leading monomials are LT(g1) = x2
,
LT(g2) = xy, so the S-polynomial is equal to
S(g1, g2) = yg1 − xg2 = −xy2
.
By theorem 12.6.12, g1, g2 is a Gröbner basis if and only if the
remainder in the multivariate division of this S-polynomial by
the basis polynomials is zero. Performing this division (see
12.6.6), we obtain
S(g1, g2) = yg1 − xg2 + yg2 − y3
.
The remainder y3
shows that g1, g2 do not form a Gröbner
basis. By 12.6.13, in order to get one, we must add the remainder
polynomial g3 = y3
to g1, g2. Now, we calculate
that
S(g1, g3) = y3
g1 − x2
g3 = 0
and
S(g2, g3) = y2
g2 − xg3 = y4
= yg3.
Hence it follows by theorem 12.6.12 that g1, g2, g3 is already
a Gröbner basis. □
12.K.2. Is the basis g1 = xy − 2y, g2 = y2
− x2
a Gröbner
basis for the lexicographic ordering y > x? If not, ﬁnd one.
Solution. Since LT(g1) = xy and LT(g2) = y2
, the corresponding
S-polynomials is S(g1, g2) = yg1 − xg2 =
x3
− 2y2
= −2g2 + x3
− 2x2
. The leading term x3
is a
multiple of neither xy nor y2
, which means that g1, g2 do not
from a Gröbner basis. We can obtain one by adding the polynomial
g3 = x3
− 2x2
. Then, we have
S(g1, g3) = x2
g1 − yg3 = 0
and
S(g2, g3) = x3
g2 − y2
g3 = 2y2
x2
− x5
=
= (4y + 2xy)g1 − (x2
+ 2x + 4)g3 + 8g2. □
12.K.3. Eliminate variables in the ideal
I = ⟨x2
+ y2
+ z2
− 1, x2
+ y2
+ z2
− 2x, 2x − y − z⟩.
Solution. The variable elimination is obtained by ﬁnding a
Gröbner basis with respect to the lexicographic monomial
ordering. Let us denote the generating polynomials of I as
g1, g2, g3, respectively. The reduction g2 = g1 + 1 − 2x
yields the reduced polynomial f1 = 2x−1. Now, we use this
polynomial to reduce g3 = f1 + 1 − y − z to f2 = y + z − 1.
Now, we reduce g1, dividing it by f1 and f2, which leads to
g1 = (
1
2
x +
1
4
)f1 + y2
+ z2
− 1
and
y2
+ z2
− 1 = (y − z + 1)f2 + 2z2
− 2z +
1
4
.
CHAPTER 12. ALGEBRAIC STRUCTURES
h ∈ H
a · h · a−1
∈ H.
(2) Each coset (left or right) has the same cardinality as the
subgroup H.
Proof. Both properties are direct consequences of the
deﬁnition. In the ﬁrst case, for any a ∈ G, h ∈ H, an
element h′
∈ H is required so that h · a = a · h′
. This occurs
if and only if a−1
· h · a = h′
∈ H.
In the second case, if a · h = a · h′
, then multiplication
by a−1
from the left yields h = h′
. □
As an immediate corollary of the above statement, there
are the following extremely useful results:
12.4.10. Theorem. Let G be a ﬁnite group with n elements
and H a subgroup of G. Then:
(1) the cardinality n = |G| is the product of the cardinality
of H and the cardinality of G/H, i.e.
|G| = |G/H| · |H|,
(2) the integer |H| divides n,
(3) if a ∈ G is of order k, then k divides n,
(4) for each a ∈ G, an
= e,
(5) if n is prime, then G is isomorphic to the cyclic group
Zn.
The second proposition is called Lagrange’s theorem,
The fourth proposition is called Fermat’s little theorem. Special
cases are discussed in the previous chapter on number
theory.
Proof. Each left coset has exactly |H| elements. However,
diﬀerent cosets are disjoint. Hence the ﬁrst proposition
follows.
The second proposition is a direct corollary of the ﬁrst
one.
Each element a ∈ G generates the cyclic subgroup
{a, a2
, . . . , ak
= e}, and the order of this subgroup is exactly
the order of a. Therefore, the order of a must divide the
number of elements of G.
Since the order k of any element a divides n and ak
= e,
then also an
= (ak
)s
= e for any integer s.
If n > 1, then there exists an element a ∈ G that is
diﬀerent from e. Its order k is an integer greater than one and
it divides n. Therefore, k must be equal to n. This means that
all the elements of G are of the form aℓ
for ℓ = 1, . . . , n. □
12.4.11. Normal subgroups and quotient groups. A subgroup
H which satisﬁes a · h · a−1
∈ H for all
a ∈ G, h ∈ H, is called a normal subgroup.
For normal subgroups, the operation on
G/H can be deﬁned by
(a · H) · (b · H) = (a · b) · H.
Choosing other representatives a · h, b · h′
leads to the same
result:
(a · h · b · h′
) · H = ((a · b) · (b−1
· h · b) · h′
) · H = (a · b) · H.
1099
Hence, f3 = 8z2
− 8z + 1. We can see that we could do
with polynomial reduction and we did not have to add any
other polynomials. The basis of I with eliminated variables
is I = ⟨2x − 1, y + z − 1, 8z2
− 8z + 1⟩. □
12.K.4. Solve the following system of polynomial equa-
tions:
x2
y − z3
= 0,
2xy − 4z = 1,
z − y2
= 0,
x3
− 4yz = 0.
Solution. Using appropriate software, we can ﬁnd out that
the corresponding ideal
⟨x2
y − z3
, 2xy − 4z − 1, z − y2
, x3
− 4yz⟩
has Gröbner basis ⟨1⟩ with respect to the lexicographic monomial
ordering, which means that the system has no solution.
□
12.K.5. Find the Gröbner basis of the variety in R3
deﬁned
parametrically as
x = 3u + 3uv2
− u3
,
y = 3v + 3u2
v − v3
,
z = 3u2
− 3v2
.
This is the so-called Enneper surface, and it is depicted
in the picture on page 1119.
Solution. Applying the elimination procedure (e. g. int the
MAPLE system, using gbasis with plex ordering), we obtain
the corresponding implicit representation, i. e., an equation
with a single polynomial of degree nine:
− 59049z − 104976z2
− 6561y2
− 72900z3
− 18954y2
z−
− 23328z4
+ 32805z2
x2
+ 14580z3
x2
+ 3645z4
x2
− 1296y4
z−
− 16767y2
z2
− 6156y2
z3
− 783y2
z4
+ 39366zx2
+ 19683x2
−
− 1296y4
− 2430z5
+ 432z6
+ 108z7
+ 486z5
x2
− 432y4
z2
+
+ 54y2
z5
+ 27z6
x2
− 48y4
z3
+ 15y2
z6
− 64y6
− z9
.
□ As we illustrate on the following
simple exercise, the Gröbner basis can be used
for solving integer optimization problems.
12.K.6. What is the minimum number of banknotes that are
necessary to pay 77700 CZK? Solve this problem for three
scenarios: First, assume that the banknotes at disposal are
of values 100 CZK, 200 CZK, 500 CZK, 1000 CZK. Then,
assume that there are also banknotes of value 2000 CZK. Finally,
assume that there are no banknotes of value 2000 CZK,
but there are banknotes of value 5000 CZK.
Solution. Let us denote the respective banknotes by variables
s, d, p, t, D, P. The banknotes to be used will be represented
as a polynomial in these variables so that the exponent of
each variable determines the number of the corresponding
banknotes. For instance, if we decide to use only the 100
CZK banknotes, then the polynomial will be q = s777
. If we
pay with ten 1000 CZK banknotes, ten 500 CZK banknotes,
CHAPTER 12. ALGEBRAIC STRUCTURES
Moreover, cosets can be written as H ·a·H, and the equation
(H · a) · (b · H) = H · (a · b) · H is straighforward. On the
other hand, the deﬁnition of the product on the cosets fails if
H is not normal.
Clearly, this new operation on G/H satisﬁes all group
axioms: the identity is the group H itself (formally it is the
coset e·H that corresponds to the identity e of G), the inverse
of a · H is a−1
· H, and the associativity is clear from the
deﬁnition. This is called the quotient group G/H of G by the
normal subgroup H.
Of course, in commutative groups, every subgroup is normal.
The subset
nZ = {na; a ∈ Z} ⊆ Z
is a subgroup of the integers, and the corresponding quotient
group is the (additive) group Zn of residue classes.
It is clear from the deﬁnition that the kernel of every homomorphism
is a normal subgroup. On the other hand, if a
subgroup H ⊆ G is normal, then the mapping
p : G → G/H, a → a · H
is a surjective homomorphism, whose kernel is H. It can be
seen directly from the deﬁnition of the operation on G/H that
p is a homomorphism, and it is clearly surjective. It follows
that normal subgroups are precisely the kernels of homomor-
phisms.
Moreover, for any group homomorphism f : G → K
with kernel H = ker f, there is a well-deﬁned homomor-
phism
˜f : G/H → K, ˜f(a · H) = f(a),
which is injective. If H is any normal subgroup in G contained
in ker f, the latter homomorphism is still well deﬁned,
but not necessarily injective.
There is a seemingly paradoxical example of a group homomorphism
C∗
→ C∗
, deﬁned on the non-zero complex
numbers by z → zk
, where k is a ﬁxed positive integer.
Clearly, this is a surjective homomorphism, and its kernel is
the set of k-th roots of unity, i.e. the cyclic subgroup Zk. Reasoning
as above, there is an isomorphism
˜f : C∗
/Zk → C∗
for any positive integer k. This example illustrates that in the
case of inﬁnite groups, the calculations with cardinalities are
not so intuitive as in the case of ﬁnite groups and theorem
12.4.10.
12.4.12. Exact sequences. A normal subgroup H of a group
G yields the short exact sequence of groups
e → H → G → G/H → e,
where the arrows respectively correspond to the only homomorphism
of the trivial group {e} into the group H, the inclusion
ι of the subgroup H ⊆ G, the projection ν onto the quotient
group G/H, and the only homomorphism of the group
G/H onto the trivial group {e}. In each case, the image of
one arrow is precisely the kernel of the following one. This is
the deﬁnition of exactness of a sequence of homomorphisms.
1100
and the remaining amount with 100 CZK banknotes, then the
polynomial will be q = t10
p10
s627
. In the former case, the
number of banknotes will be 777. In the latter case, it will be
10 + 10 + 627 = 647.
If we have only the banknotes s, d, p, t, then the ideal that
describes the relation of the individual banknotes is
I1 = ⟨s2
− d, s5
− p, s10
− t⟩.
In order to minimize the number of used banknotes, we compute
the Gröbner basis with respect to the graded reverse
lexicographic ordering (we want to eliminate the small ban-
knotes):
G1 = (p2
− t, s2
− d, d3
− sp, sd2
− p).
Now, we take any polynomial that represents a given choice of
banknotes. Reducing this polynomial with respect to the basis
G1, we get a polynomial whose degree is minimal for our
monomial ordering, at it is easy to show that it is the polynomial
corresponding to the optimal choice. For instance, take
q = s777
. Reduction with respect to G1 yields t77
pd. This
means that the optimal choice is seventy-seven 1000 CZK
banknotes, one 500 CZK banknote, and one 200 CZK banknote.
Altogether, it is 79 banknotes.
In the second scenario, when we also have the banknote D,
the ideal is I2 = ⟨s2
− d, s5
− p, s10
− t, s20
− D⟩ and its
Gröbner basis is
G2 = (t2
− D, p2
− t, s2
− d, d3
− sp, sd2
− p).
Reduction of q = s777
with respect to G2 gives D38
tpd, so
this time we pay with 41 banknotes. In the third scenario, we
have I3 = ⟨s2
− d, s5
− p, s10
− t, s50
− P⟩ and
G3 = (t5
− P, p2
− t, s2
− d, d3
− sp, sd2
− p),
and the reduction is equal to P15
t2
pd. In this case, we need
only 19 banknotes.
Of course, this simple problem can be solved quickly with
common sense. However, the presented method of Gröbner
bases gives a universal algorithm which can be automatically
used for higher amounts and other, more complicated cases.
□
Gröbner bases have applications in robotics as well. In
particular, it is in inversion kinematics, where one must ﬁnd
how to set individual joints of a robot so that it could reach a
given position. This problem often leads to a system of nonlinear
equations which can be solved by ﬁnding a Gröbner
basis, as in the following problem.
12.K.7. Consider a simple robot, as shown in the picture,
which consists of three straight parts of length 1 which are
connected with independent joints that enable arbitrary angles
α, β, γ. We want this robot to grasp, from above, an
object which lies on the ground in distance x. What values
should the angles α, β, γ be set to? Draw the conﬁguration
of the robot for x = 1, 1,5a
√
3.
Solution. Consider a natural coordinate system, where the
initial end of the robotic hand lies in the origin, and the
CHAPTER 12. ALGEBRAIC STRUCTURES
If there exists a homomorphism σ : G/H → G such that
ν ◦ σ = idG/H, it is said that the exact sequence splits.
Lemma. Every split short exact sequence of commutative
groups deﬁnes an isomorphism G → H × G/H.
Proof. Deﬁne a mapping f : H × G/H → G by
f(a, b) = a · σ(b).
Since the groups are commutative, f is a homomorphism:
f (aa′
, bb′
) = aa′
σ(b)σ(b′
) = (aσ(b)) (a′
σ (b′
)) .
If f(a, b) = e, then σ(b) = a−1
∈ H, i.e. b = ν(σ(b)) is
the identity in G/H. However, its image is then σ(b) = e,
so a = e. Since the left and right cosets of commutative
groups coincide, the mapping f is surjective. Hence f is an
isomorphism. □
Now, the main idea of the proof of theorem 12.4.8 can be
indicated. If it is known that every short exact sequence created
by choosing cyclic subgroups H of a ﬁnite commutative
group G splits, then it is easy to proceed with the proof by
induction. If G is a group of order n which is not cyclic, then
an element a of order p, p < n, can selected. The cyclic subgroup
H generated by a can be found as well as the splitting
of the corresponding short exact sequence. This expresses the
group G as the product of the selected cyclic subgroup H and
the group G/H of order n/p.
The main technical point of the proof is the veriﬁcation
that in each ﬁnite commutative group, there are elements of
order pr
with appropriate powers of the primes p and that the
short exact sequences for these groups really split.
12.4.13. Return to ﬁnite Abelian groups. Below is a brief
exposition of the complete proof of the classiﬁcation
theorem, broken into several steps.
The following lemma suggests that cyclic
subgroups with prime orders are required.
Lemma (Claim 1). Let G be a ﬁnite Abelian group with n
elements. If p is a prime which divides n, then there is an
element g ∈ G of order p.11
Proof. The claim is obvious if n is prime, i.e. G = Zp
(as proved above). If n is not prime, proceed by induction on
n. Clearly G must have a proper subgroup H if n is not prime,
|H| = m < n. Either p|m or p|(n/m). In the former case,
the claim follows from the induction hypothesis directly.
Otherwise assume p|(n/m) (remind n/m is the order of
G/H). Then there is an element g ∈ G such that the order
of g · H in the quotient group G/H is p. Thus gp
∈ H, and
therefore the order of g in G divides p|H|. Since p is a prime,
the order of g is ℓp for some integer ℓ. Hence the element gℓ
has the required order p. □
11This is a special version of the more general result valid for all ﬁnite
groups, called the Cauchy theorem. The formulation remains the same, with
the word Abelian omitted.
1101
ground corresponds to the x-axis. It follows from elementary
trigonometry that the total x-range of the robot at angels
α, β, γ is equal to
x = sin α + sin(α + β) + sin(α + β + γ).
Similarly, the range of the robot in the vertical direction is
y = cos α + cos(α + β) + cos(α + β + γ).
The condition of grasping the object from above is clearly
equivalent to the condition α + β + γ = π, so the statement
of the problem leads to the system
sin α + sin(α + β) = x,
cos α + cos(α + β) − 1 = 0.
In order to transform this system to a system of polynomial
equations, we introduce new variables s1 = sin α, c1 =
cos α, s2 = sin β, c2 = cos β, which of course satisfy
s2
1 + c2
1 = 1 and s2
2 + c2
2 = 1. Using basic trigonometric
equalities for sums in arguments, we obtain the following,
equivalent system of polynomial equations:
s1 + s1c2 + c1s2 − x = 0,
c1 + c1c2 − s1s2 − 1 = 0,
s2
1 + c2
1 − 1 = 0,
s2
2 + c2
2 − 1 = 0.
The Gröbner basis of the corresponding ideal can be found
in a while using appropriate software. For the graded reverse
lexicographic ordering s1 > c1 > s2 > c2, we get the basis
(2c2 + 1 − x2
, 2c1(1 + x2
) − 2s2x − 1 − x2
, 2s1(1 + x2
)+
+2s2 − x − x3
, 4s2
2 − 3 − 2x2
+ x4
),
and hence it is easy to calculate the values of the variables in
dependence on x. For example, we can immediately see that
c2 = x2
−1
2 , i. e., β = arccos(x2
−1
2 ). In particular, it is clear
that the problem has no solution for |x| >
√
3. Speciﬁcally,
for |x| <
√
3, there are 2 solutions, and for |x| =
√
3, there is
one solution (α = π
3 , β = 0, γ = 2π
3 for positive x; α = −π
3 ,
β = 0, γ = 4π
3 for negative x). For x = 1, we get the solution
α = 0, β = π
2 , γ = π
2 and the degenerated solution α = π
2 ,
β = −π
2 , γ = π. The case x = −1 is similar. It is good
to realize that for |x| < 1, one of the solutions will always
correspond to a conﬁguration of the robot where some parts
will intersect. For these values of x, there will be only one
realizable conﬁguration.
CHAPTER 12. ALGEBRAIC STRUCTURES
For any prime number p, G is called a p-group if each of
its elements has order pk
for some power k. Claim 1 has an
obvious corollary:
Lemma (Claim 2). A ﬁnite Abelian group G is a p-group if
and only if its number of elements n is a power of p.
Proof. One implication follows straight from the Lagrange’s
theorem since all proper divisors of a power of a
prime p are just smaller powers of p.
On the other hand, if n is not a power of prime, it has
another prime divisor q and so there is an element of order q
by Claim 1. □
Now it can be shown that a given ﬁnite Abelian group G
can always be decomposed into a product of p-groups.
Lemma (Claim 3). If G is a ﬁnite Abelian group than it is
isomorphic to a product of p-groups. This decomposition is
unique up to order.
Proof. Consider a prime p dividing n = |G|. Deﬁne Gp
to be the subgroup of all elements whose orders are powers of
p, while G′
p is the subgroup of all elements whose orders are
not divisible by p (check yourself that subgroups are obtained
in this way). By the above Claim 1, the subgroup Gp is not
trivial.
Next, consider an element g of order qpℓ
with q not divisible
by p. Then gpℓ
has order q, so this element belongs
to G′
p, while gq
∈ Gp. The Bezout equality guarantees there
are integers r and s such that rpℓ
+ sq = 1. Hence
g = grpℓ
· gsq
is a decomposition of g into product of elements in Gp and
G′
p. This veriﬁes G ≃ Gp × G′
p and Gp is a p-group.
This process can be repeated for the subgroup G′
p and the
remaining primes in the decomposition in order to complete
the proof.
The uniqueness claim is obvious. □
It remains to consider the p-groups only. The next claim
shows that the p-groups which are not cyclic must have more
than one subgroup of order p.
Lemma (Claim 4). If a ﬁnite Abelian p-group G has just one
subgroup H with |H| = p, then it is cyclic.
Proof. The case p = n = |G| is obvious. Proceed by
induction on n. Assume H is the only subgroup of order p
and consider σ : G → G, σ(g) = gp
and write K = ker(σ).
Then H ⊂ K and since p is prime, all elements in K have
order p. For any g ∈ K, the cyclic group generated by g has
order p and so coincides with H and consequently H = K.
If G ̸= K, then σ(G) is a non-trivial subgroup in G which
must be isomorphic to G/K. By Claims 1 and 2, there is a
subgroup in σ(G) of order p. This yields a subgroup in G and
by assumption it is again H. Finally, apply the induction hypothesis
on the group σ(G) ≃ G/H, which has to be cyclic.
Choosing a generator g · H of the latter group, even in G the
1102
□
Gröbner bases can also be used in software engineering
when looking for loop invariants, which are needed for veriﬁcation
of algorithms, as in the following problem.
12.K.8. Verify the correctness of the algorithm for the product
of two integers a, b.
(x, y, z) := (a, b, 0);
while not (y = 0) do
if y mod 2 = 0
then (x, y, z) := (2*x, y/2, z)
else (x, y, z) := (2*x, (y-1)/2, x+z)
end if
end while
return z
Solution. Let X, Y, Z denote the initial values of the variables
x, y, z, respectively. Then, by deﬁnition, a polynomial
p is an invariant of the loop if and only if we have
p(x, y, z, X, Y, Z) = 0 after each iteration. Such a polynomial
can be found using Gröbner basis as follows:
Let f1, f2 denote the assignments of the then- and elsebranches,
respectively, i. e.,
f1(x, y, z) = (2x,
1
2
y, z) and f2(x, y, z) = (2x,
y − 1
2
, x+z).
For n iterations of the ﬁrst one, we immediately obtain the explicit
formula fn
1 (x, y, z) = (2n
x, 1
2n y, z). In order to transform
this iterated function to a polynomial mapping, we introduce
new variables u := 2n
, v := 1
2n . Then, fn
1 is given
by the polynomial function
F1 : x → ux y → vy z → z,
CHAPTER 12. ALGEBRAIC STRUCTURES
cyclic subgroup generated by g must have a subgroup of order
p (again Claim 1). The uniqueness assumption ensures that
this is again the subgroup H.
Clearly, |G/H| = |G|/p is the smallest exponent with
g|G|/p
∈ H. At the same time the latter power is not equal to
the unit since H ⊂ ⟨g⟩. Consequently, the order of g in G is
bigger than |G|/p and so the group G is cyclic. □
Finally, a splitting condition for the p-groups is proved,
which provides the property discussed in the end of the previous
paragraph on the exact sequences. This completes the
entire proof of the classiﬁcation theorem.
Lemma (Claim 5). Let G be a ﬁnite Abelian p-group and let
C be a cyclic subgroup of maximal order in G. Then G =
C × L for some subgroup L.
Proof. If G is cyclic, set G = C and of course G =
C × L with L = {e}. Proceed by induction on n = |G|.
Assume G is not cyclic. Then it contains more than one cyclic
subgroup of order p. Of course the subgroup C is one such
subgroup. Choose H to be another subgroup of order p which
is not a subgroup in C. Since p is prime, the intersection of
H and C is trivial. Consequently the quotient group (C ×
H)/H ⊂ G/H is isomorphic to C.
Now consider the induction step. The order of the cyclic
subgroup (C × H)/H in G/H must be maximal, since the
orders of the elements g · H in the quotient group are divisors
of the orders of the generators g in the group G. By the
induction hypothesis,
G/H = (C × H)/H × K
for some subgroup K ⊂ G/H. Clearly the preimage of K under
the quotient projection is a group L satisfying H ⊂ L ⊂
G. Now, the latter identiﬁcation of G/H with the product
implies
G = (C · H) · L = C · (H · L) = C · L.
At the same time, L ∩ (C · H) = H and so L ∩ C = {e}. So
G = C × L. □
The proof is complete up to the uniqueness claim. It is
known already that the decomposition into p-groups is unique.
Assume that a p-group G decomposes into two products of
cyclic groups H1 × . . . × Hk and H′
1 × . . . × H′
ℓ with nonincreasing
orders of Hi or H′
j. Then the orders of H1 and
H′
1 coincide, since these are the maximal orders in G. By
induction all the orders coincide and the work is complete.
The classiﬁcation theorem is a special case of a more general
result on ﬁnitely generated Abelian groups. In additive
notation, if g1, . . . , gt are generators of the entire G, then all
elements of G are of the form a1g1 + · · · + atgt with integer
coeﬃcients ai. The general theorem provides a severe restriction
for possible relations between such combinations. In fact
it says that all ﬁnitely generated Abelian groups are products
of cyclic groups, hence G = Zℓ
⊕Zp1 ×. . .×Zpk
. This means
there is always a ﬁnite number of completely independent generators
of G and each of them generates a cyclic subgroup in
1103
where the new variables satisfy uv = 1. Clearly, the invariant
polynomial must lie in the ideal
I1 = ⟨ux − X, vy − Y, z − Z, uv − 1⟩.
In order to ﬁnd such polynomial, it suﬃces to eliminate the
variables u and v, which can be done just with the Gröbner basis
with respect to the graded reverse lexicographic ordering
with u > v > x > y > z. This basis is equal to
(xy − XY, z − Z, x − vX, y − uY ).
Hence F1(xy − XY ) = xy − XY ans F1(z − Z) = z − Z,
and all other polynomials are invariant with respect to any
number n of applications of f1 and are given by a polynomial
in (polynomials) xy − XY and z − Z.
Now, we proceed similarly for f2. For n iterations, we derive
the formula
fn
2 (x, y, z) = (2n
x,
1
2n
(y + 1) − 1, (2n
− 1)x + z),
and introducing the variables u and v, we get an equivalent
polynomial function
F2 : x → ux y → v(y + 1) − 1 z → (u − 1)x + z.
The invariant polynomial for F2 can be obtained similarly as
above, thanks to the Gröbner basis of the corresponding ideal.
However, we are interested in those polynomials which are
invariant for both F1 and F2. Clearly, these must lie in the
ideal
I2 = ⟨F2(xy − XY ), F2(z − Z), uv − 1⟩.
Substituting for F2, we obtain
I2 = ⟨uxv(y + 1) − ux − XY, (u − 1)x + z − Z, uv − 1⟩
and with the Gröbner basis of this ideal, we eliminate the variables
u and v and thus ﬁnd the polynomial xy−XY +z −Z,
which is invariant for both F1 and F2, so it is an invariant of
the given cycle. Since at the beginning we have X = a, Y =
b, Z = 0, we can see that it holds in every step of the algorithm
that xy − ab + z = 0. Since the loop terminates only
if y = 0, we get that indeed z = ab. □
Now, we present several exercises where we use Gröbner
bases to solve various polynomial systems. The primary goal
will not be to ﬁnd the Gröbner basis, but rather to solve the
given system.
12.K.9. Using Gröbner basis, solve the polynomial system
x3
− 2xy = 0,
x2
y + x − 2y2
= 0.
Solution. Let us denote f1 := x3
− 2xy, f2 := x2
y + x −
2y2
. The basis (f1, f2) is not a Gröbner basis since, e. g.,
LM(yf1 − xf2) = x2
/∈ (x3
, x2
y) = (LM(f1), LM(f2)).
Thus, we have to add the polynomial
yf1 − xf2 = −x2
.
CHAPTER 12. ALGEBRAIC STRUCTURES
G. (Compare this to the description of ﬁnite dimensional vector
spaces via their basis, as discussed in chapter 2.)
12.4.14. Group actions. Groups can be considered as sets
of transformations of a ﬁxed set. All the transformations
are invertible, and the set of transformations
must be closed under composition. The
idea is to work with a ﬁxed group whose elements
are represented as mappings on a ﬁxed set, but the mappings
corresponding to diﬀerent elements of the group need
not be diﬀerent. For instance, the rotations around the origin
over all possible angles correspond to the group of real numbers.
On the other hand, the rotation by 2π is the identity as
a mapping.
Formally, this situation can be described as follows:
Group actions
A left action of a group G on a set S is a homomorphism of
the group G to the subgroup of invertible mappings in the
monoid SS
of all mappings S → S. Such a homomorphism
can be viewed as a mapping φ : G × S → S which satisﬁes
φ(a · b, x) = φ(a, φ(b, x)),
hence the name “left action”. Often, the notation a·x is used
to refer to the result of an a ∈ G applied to a point x ∈ S
(although this is a diﬀerent dot than the operation inside the
group). Then, the deﬁnition property can be expressed as
(a · b) · x = a · (b · x).
The image of a point x ∈ S in the action of the entire
group G is called the orbit Sx of x, i.e.
Sx = {y = φ(a, x); a ∈ G}.
For each point x ∈ S, deﬁne the isotropy subgroup Gx ⊆ G
of the action φ (also called the stabilizer subgroup):
Gx = {a ∈ G; φ(a, x) = x}.
If for every two points x, y ∈ S, there is an a ∈ G such that
φ(a, x) = y, then the action φ is said to be transitive.
Choosing any two points x, y ∈ S and a g ∈ G which
maps x to y = g · x, then the set {ghg−1
; h ∈ Gx} is clearly
the isotropy subgroup Gy. In addition, the mapping h →
ghg−1
is a group homomorphism Gx → Gy.
In the case of transitive actions, the entire space forms a
single orbit and all isotropy subgroups have the same cardi-
nality.
As an example of a transitive action of a ﬁnite group, consider
the apparent action of the symmetric group of a ﬁxed
set X on X itself. The natural action of all invertible linear
transformations on the non-zero elements of a vector space V
is also transitive. However, if the entire space V is selected,
then the zero vector forms its own orbit.
The mentioned example of the action of the additive
group of real numbers that acts as rotations around a ﬁxed
center O in the plane is not transitive. The orbit of each point
M is the circle which is centered at O and goes through M.
1104
The resulting basis can be reduced using division of the polynomials
f1, f2 by x2
. This leads to the basis
(
xy, x − 2y2
, x2
)
.
Now, we can divide the ﬁrst polynomial by the second one,
with remainder 2y3
, and the third one by the second one, with
remainder 4y4
. Thus, we get the basis
(
x − 2y2
, y3
)
,
and this is a Gröbner basis: by the naive algorithm (see
12.6.13), it suﬃces to verify that the polynomial
S
(
x − 2y2
, y3
)
= y3
(
x − 2y2
)
− xy3
= −2y5
gives zero remainder with respect to the basis
(
x − 2y2
, y3
)
,
even for any polynomial ordering. Clearly, the solution of the
system is the point (0, 0). □
12.K.10. Consider the following system of polynomial
equations:
x2
yz2
+ x2
y2
+ yz − xyz2
− z2
= 0,
x2
y + z = 0,
xyz + z + 1 = 0.
Sort the monomials of the polynomials with respect to the
lexicographic ordering with x > y > z, then divide the ﬁrst
polynomial by the second one and the third one, and use the
result in order to solve the system in the real numbers.
Solution.
x2
y2
+ x2
yz2
− xyz2
+ yz − z2
=
(
y + z2
) (
x2
y + z
)
−
− y(xyz + z + 1) − z3
+ z.
Hence z = 0, ±1. Then, e. g.,
0 = z
(
x2
y + z
)
− x(xyz + z + 1) =
= z2
− zx − x.
Hence, x = z2
z+1 , and we get from the third equation that
y = −(1+z)2
z3 . This is satisﬁed by the sole point
(1
2 , −4, 1
)
.
□
12.K.11. Using Gröbner basis, solve the polynomial system
x2
+ y + z = 1,
x + y2
+ z = 1,
x + y + z2
= 1.
Solution. Let us denote f1 := x + y + z2
− 1. The division
of x + y2
+ z − 1 by f1 gives
f2 = y2
− y − z2
+ z.
The division of x2
+ y + z − 1 by f1 yields y2
+ 2yz2
−
y + z4
− 2z2
+ z, and further division by f2 produces the
remainder.
f3 = 2yz2
+ z4
− z2
.
CHAPTER 12. ALGEBRAIC STRUCTURES
A typical example of a transitive action of a group G is
the natural action on the set G/H of left cosets for any subgroup
H. It is deﬁned by
g · (aH) = (ga)H.
This is the form of every transitive group action. For any
transitive action G × S → S and a ﬁxed point x ∈ G, S can
be identiﬁed with the set G/Gx of left cosets by gGx → g ·x.
Clearly, this mapping is surjective, and the images g·x = h·x
coincide if and only if h−1
g ∈ Gx, which is equivalent to
gGx = hGx. Finally, note that this identiﬁcation transforms
the original action of G on S just to the mentioned action of
G on G/Gx.
12.4.15. Theorem. Let an action of a ﬁnite group G on a
ﬁnite set S be given. Then:
(1) for each point x ∈ S,
|G| = |Gx| · |Sx|,
(2) (Burnside’s lemma) if N denotes the number of orbits,
then
|G| =
1
N
∑
g∈G
|Sg
|,
where Sg
= {x ∈ S; g · x = x} denotes the set of ﬁxed
points of the action corresponding to an element g.
Proof. Consider any point x ∈ S and its isotropy subgroup
Gx ⊆ G. The same argument as the one at the end
of the previous paragraph for transitive group actions can be
applied to each action of the group G. This gives the mapping
G/Gx → Sx, g · Gx → g · x. If g · x = h · x, then
clearly g−1
h ∈ Gx, so this mapping is injective. Clearly, it
is also surjective, which means that the cardinalities of the ﬁnite
sets must satisfy |G/Gx| = |Sx|. The ﬁrst proposition
follows, because |G| = |G/Gx| · |Gx|.
The second proposition is proved by counting the cardinality
of the set of ﬁxed points of individual group elements
in two diﬀerent ways:
F = {(x, g) ∈ S × G; g · x = x} ⊆ S × G.
Since these are ﬁnite sets, the elements of the Cartesian product
S × G can be considered as entries of a matrix (columns
are indexed by elements of S, rows are indexed by elements
of G). Summing this matrix up, either by rows or by columns,
yields
|F| =
∑
g∈G
|Sg
| =
∑
x∈S
|Gx|.
For the sake of clarity, choose one representative for each orbit
(denote them x1, . . . , xN ), and recall that the cardinalities
of the isotropy groups for points that lie in the same orbit are
the same. Using the proved statement (1), it is obtained easily
that
|F| =
∑
g∈G
|Sg
| =
N∑
i=1
∑
x∈Sxi
|Gx| =
N∑
i=1
|Sxi ||Gxi | = N·|G|,
which completes the proof. □
1105
However, (f1, f2, f3) is not a Gröbner basis yet. That will be
constructed by the choice g1 := f1, g2 := f2 and replacing
f3 with the S-polynomial
2z2
f2 − yf3 = −yz4
− yz2
− 2z4
+ 2z3
.
Then, the division by the polynomial f3 leads to
g4 = z6
− 4z4
+ 4z3
− z2
=
= z2
(z − 1)2
(
z2
+ 2z − 1
)
.
Now, (g1, g2, g3) is a Gröbner basis, so we can solve the system
by elimination. We get from g4 = 0 that z = 0, 1, −1 ±√
2. Substituting this to g2 = 0 and g1 = 0 gives the solutions
(1, 0, 0),(0, 1, 0),(0, 0, 1),
(
−1 +
√
2, −1 +
√
2, −1 +
√
2
)
,
(
−1 −
√
2, −1 −
√
2, −1 −
√
2
)
. □
12.K.12. Solve the following system of polynomial equations
in R:
x2
− 2xz − 4,
x2
y2
z + yz3
,
2xy2
− 3z3
.
Solution. The basis suitable for variable elimination is the
Gröbner basis for the lexicographic monomial ordering with
x > y > z. Using Maple, we can ﬁnd the basis
144z5
+ 35z7
+ 12z9
,
23z6
+ 12z8
+ 44yz4
,
yz3
+ 3z5
+ 4zy2
, 9z4
+ 4y3
,
−8y2
− 6z4
+ 3xz3
, 2xy2
− 3z3
,
x2
− 2xz − 4.
Since the discriminant of the ﬁrst polynomial of the basis
(divided by z5
) satisﬁes 352
− 4 · 144 · 7 < 0, we must
have z = 0. Substituting this into the other polynomials, we
immediately obtain y = 0, x = ±2. □
12.K.13. Solve the following system of polynomial equations
in R:
xy + yz − 1,
yz + zw − 1,
zw + wx − 1,
wx + xy − 1.
Solution. In this case, it is a good idea to take the graded
lexicographic ordering with w > x > y > z. Using the
algorithm 12.6.13 or appropriate software, we ﬁnd the corresponding
Gröbner basis
(x − z, w − y, 2yz − 1).
Thus, the system is satisﬁed by exactly the points ( 1
2t , t, 1
2t , t)
for an arbitrary t ∈ R except zero. □
CHAPTER 12. ALGEBRAIC STRUCTURES
It is recommended to think out thoroughly how useful
these propositions are for solving combinatorial problems, c.f.
12.G.1 through 12.G.7.
Next, we shall come back to ﬁelds and after knowing the
complete classiﬁcation of ﬁnite Boolean algebras and ﬁnite
abelian groups, we focus on ﬁnite ﬁelds.
As ﬁrst step, we have to draw some elementary results on
rings and univariate polynomial rings.
12.4.16. Rings, ideals, quotients. We are meeting rings on
our journey permanently since the very beginning. Now we
shall add some structural understanding of objects and homomorphisms
in this categry.
Similarly to groups, the inclusion of a subring K ⊂ L
allows us to construct the residue class space L/K = {a +
K; a ∈ L}, i.e., we consider the (commutative) aditive group
structure only. If we want to equip L/K by a ring structure applying
the operations to representatives, we do not face problems
with the addition (since all subgroups are normal in the
abelian case), but we need the properties a · K ⊂ K and
K · a ⊂ K for all a ∈ L to get the multiplication right, too.
Indeed, (a+K)·(b+K) = a·b+a·K+K·b+K = a·b+K
under this assumption. Such a subring K ⊂ L is called an
ideal in L, and we call L/K the residue class ring, or quotient
ring. The veriﬁcation of all ring properties for the addition
and multiplication deﬁned by means of the representatives is
straightforward.
For all homomorphisms of rings, f : K → L, the kernel
ker f = {a ∈ K; f(a) = 0} is clearly an ideal, since all
multiples of zero are zero. Similarly to groups, the ideals
K ⊂ L come with the short exact sequences of rings
(1) 0 → K → L → L/K → 0.
By the very deﬁnition, an intersection of ideals is an ideal
again. Thus we may talk about ideals generated by subsets in
L. In commutative rings L, we distinguish the ideals generated
by a single element and call them the principal ideals,
⟨a⟩ = {b · a; b ∈ L}. If there are no proper ideals (i.e. different
from the trivial ones) except of the principal ones, we
call L a principal ideal domain.
The integer ring Z is a principal ideal domain. Indeed, if
K ⊂ Z and n is the smalles positive integer in K, then for any
m ∈ K, the division yields m = qn + r with the remainder
again in K. Thus r = 0 (it cannot be smaller then n) and we
have proved K = ⟨n⟩. If n = 1, then K = Z. The quotient
ring Z/⟨n⟩ is the ring of residue classes Zn.
Notice, there are no proper ideals in any ﬁeld K.
12.4.17. Characteristic of rings and ﬁelds. Consider a ring
K. Its additive structure provides the cyclic subgroups generated
by single elements a ∈ K, Ga = {a, a + a, 3a, . . . }. If
there is the minimal positive integer n, such that na = 0, then
we say that the order of a is n. If there is such a (minimal)
n valid for all a ∈ K, we say that the characteristic of the
ring K is n. If there are elements of inﬁnite order, or their
1106
12.K.14. Solve the following system of polynomial equations
in R:
x2
+ yz + x,
z2
+ xy + z,
y2
+ xz + y.
Solution. Using the algorithm 12.6.13 or appropriate software,
we ﬁnd the corresponding Gröbner basis for the lexicographic
monomial ordering with x > y > z, consisting of
six polynomials:
z2
+ 3z3
+ 2z4
,
z2
+ z3
+ 2yz2
+ 2yz3
,
y − yz − z − z2
− 2yz2
+ y2
,
yz + z + z2
+ 2yz2
+ xz,
z2
+ xy + z, x2
+ yz + x.
The roots of the ﬁrst polynomial are z = 0, −1, −1
2 .
Discussing the individual cases, we ﬁnd out
that the system is satisﬁed exactly by the points
(0, 0, 0), (−1, 0, 0), (0, −1, 0), (0, 0, −1) and (−1
2 , −1
2 , −1
2 ).
□
CHAPTER 12. ALGEBRAIC STRUCTURES
orders are not bounded, we say that the characteristic of K is
inﬁnity.
Clearly, the characteristic of Zn is n, while the scalars Z,
Q, R, C all have characteristic ∞.
Lemma. If K is a ﬁnite ring with |K| = m elements, then
the characteristic of K is a divisor of m.
Proof. The orders of the cyclic soubgroups Ga must be
ﬁnite (if ka = ℓa, then (k − ℓ)a = 0, and there are only
ﬁnitely many elements in K) and they must be divisors of m
(as known for all ﬁnite abelian groups). □
Another nice observation says, that ﬁnite integral domains
are actually ﬁelds:
Theorem. If K is a ﬁnite commutative ring without divisors
of zero, then K is a ﬁeld. Moreover the number of elements
|K| = m is a power of a prime number, m = pn
, where p is
the characteristic of K.
Proof. Let p be the characteristic of K. Suppose, p =
kℓ is not a prime (clearly, 1 < k ≤ ℓ < p). Then looking at
1 ∈ K, 0 = p · 1 = (kℓ) · 1 = (k · 1) · (ℓ · 1) and this is
impossible without divisors of zero. Thus the characteristic
of K is a prime p.
Next, for any 0 ̸= a ∈ K, all the elements a · b, b ∈ K
must be diﬀerent, since a · b1 = a · b2 would imply a · (b1 −
b2) = 0, i.e., b1 = b2. In particular, the unit 1 is expressed as
1 = a · b for some b and so b is the inverse to a. Thus K is a
ﬁnite ﬁeld.
Finally, the aditive subgroup G1 = {1, 2 · 1, . . . , p · 1}
is closed under the multiplication, too (and is isomorphic to
the ﬁeld Zp). Now, K can be viewed as a vector space over
G1 and it must be generated by ﬁnite number of elements
a1, . . . , an (with big freedom in the choices, but always the
same number, as known from the elementary linear algebra).
But then it is clear that |K| = pn
, as requested.
□
The theorem suggests that all ﬁnite ﬁelds are built from
the simplest bricks — the so called Galois ﬁelds GF(p) isomorphic
to Zp with prime p. They are often denoted Fp in
literature.
12.4.18. Univariate polynomials over ﬁelds. The ring of
univariate polynomials K[x] over any ﬁeld K is a principal
ideal domain. Indeed, the same argument as for integers applies.
Given a non-zero ideal J, there must be a polynomial
f of lowest degree in J. If the degree is zero, J = K[x].
Otherwise, each g ∈ J is expressed as g = qf + r applying
the division with reminder (cf. 12.3.6). Then r ∈ J and it
must be of lower degree than f (which is impossible) or zero.
Thus, J = ⟨f⟩.
Recall that each ideal J = ⟨f⟩ ⊂ K[x], deg f > 0,
provides the quotient ring L = K[x]/J.
Lemma. The ring L = K[x]/⟨f⟩, d = deg f > 0 contains
K as subﬁeld, and L is a vector space over K of dimension
at most d.
1107
CHAPTER 12. ALGEBRAIC STRUCTURES
Proof. Since d = deg f > 0, no constant polynomial
can appear in ⟨f⟩. Thus a → a + J embeds K into L as
a subring. Finally, the elements 1 + J,x + J,...,xd−1
+ J
must generate L (remind we deal with univariate polynomials).
□
Proposition. The ring L = K[x]/⟨f⟩, deg f > 0 is a ﬁeld
if and only if the polynomial f does not have any roots in K.
If K is ﬁnite, then |L| = pnm
, where pn
= |K| and m is the
dimension of the vector space L over K.
Proof. □
In the situation of the proposition, we call L the algebraic
extension of the ﬁeld K with respect to the plynomial
f. Clearly, the polynomial f becomes completely reducible,
f = a(x − α1) . . . (x − αd), over L. Thus the construction
adds all the missing roots αi to K and extends the ﬁeld this
way.
Considering a polynomial
12.4.19. Theorem (Classiﬁcation of ﬁnite ﬁelds). For each
prime number p and power n ∈ N, there is a ﬁeld with pn
elements. Such ﬁeld F is obtained as extension All ﬁelds with
the same number of elements are isomorphic.
Proof. □
12.4.20. The fundamental theorem of algebra. While it
may happen that a polynomial over the real numbers has no
roots, every polynomial over the complex numbers has a root.
This is the statement of the so called fundamental theorem of
algebra, which is presented here with an (essentially) complete
proof. By this result, every polynomial in C[x] has as
many roots (including multiplicity) as its degree deg f = k.
Hence it always admits a factorization of the form
f(x) = b(x − a1) · (x − a2) . . . (x − ak)
for the complex roots ai and the appropriate leading coeﬃcient
b.
Theorem. The ﬁeld C is algebraically closed, i.e. every polynomial
of degree at least one has a root.
Proof. We are going to provide an elementary proof
based on simple real analysis, in particular the
concept of continuity (the readers should be familiar
with the techniques developed in Chapters
5 and 6).
Suppose that f ∈ C[z] is a non-zero polynomial with no
root, i.e. f(z) ̸= 0 for all z ∈ C. Consider the mapping
deﬁned by
φ : C → C, z →
f(z)
|f(z)|
.
Then φ maps the entire C into the unit circle
K1 = {eit
, t ∈ R} ⊆ C.
By the assumption that f(z) is never zero, this mapping is
well-deﬁned. Next, we shall consider the restrictions of φ to
the individual circles Kr ⊆ C with center at zero and radius
r ≥ 0
1108
CHAPTER 12. ALGEBRAIC STRUCTURES
We can parameterize these circles by the mappings
ψr : R → Kr, t → ψr(t) = r eit
.
For all r, the composition κ : (0, ∞) × R → K1,
κ(r, t) = φ ◦ ψr(t), is continuous in both t and r. Thus,
for each r, there exists a mapping αr : R → R which
is uniquely given by the conditions 0 ≤ αr(0) < 2π and
κ(r, t) = eiαr(t)
. Again, the obtained mapping αr continuously
depends on r. Altogether, there is a continuous map-
ping
α : R × (0, ∞) → R, (t, r) → αr(t).
It follows from its construction that for every r,
1
2π (αr(2π) − αr(0)) = nr ∈ Z. Since α is continuous
in r, it means that nr is an integer constant which is
independent of r.
In order to complete the proof, it suﬃces to note that if
f = a0 + · · · + adzd
and ad ̸= 0, then for small values of
r, αr behaves nearly as a constant mapping, while for large
values of r, it behaves almost as if f = zd
. First, calculate
nr for f = zd
, then make this statement more precise. This
completes the proof.
The complex functions z → zd
, z → zd
|zd|
can be expressed
easily using the trigonometric form of the complex
numbers z = r(cos θ + i sin θ):
zd
= rd
(cos dθ + i sin dθ) = rd
eidθ
,
zd
|zd|
= 1(cos dθ + i sin dθ) = eidθ
.
In this case, the mapping φ is simply a rotation of the complex
plane, followed by the central projection onto the unit circle.
Then, κr(t) = eidt
, and so αr(t) = dt, regardless of r.
It follows that nr = d for the choice f = zd
. If f = azd
is
chosen, a ̸= 0, then there is no impact on the above result
(verify this yourselves!).
Consider a general polynomial f = a0 +· · ·+adzd
with
no root. Then a0 ̸= 0 (a0 = 0, implies that 0 is a root). For
z ̸= 0,
f(z)
adzd
= 1 +
1
ad
(a0z−d
+ · · · + ad−1z−1
).
1109
CHAPTER 12. ALGEBRAIC STRUCTURES
Hence, lim|z|→∞
f(z)
adzd = 1. Knowing this, calculate
lim
|z|→∞
f(z)
|f(z)|
−
adzd
|adzd|
=
= lim
|z|→∞
f(z)
adzd
adzd
|adzd|
|adzd
|
|f(z)|
−
adzd
|adzd|
= 0.
Hence, nr = d for large values of r.
A similar computation can be done for small values of r.
Recall that a0 ̸= 0:
f(z)
a0
= 1 +
1
a0
(a1z + · · · + adzd
).
Thus, lim|z|→0
f(z)
a0
= 1. In addition, f(z)
|f(z)| = f(z)
a0
a0
|a0|
|a0|
|f(z)| .
Hence, lim|z|→0
f(z)
|f(z)| = lim|z|→0
a0
|a0| , i.e. nr = 0 for small
values of r. Altogether, the degree of the polynomial is d =
0. □
5. Coding theory
It is often needed to transfer data while guaranteeing that
they are transferred correctly. In some cases, it suﬃces to
recognize whether the data is unchanged. In some cases, it
can be transferred again. In other cases, this might not be
reasonable, so that the data is needed to be recovered after
a small number of errors in the transfer. This is the goal of
coding theory. Some of the algorithms are explored now.
Notice that coding is quite diﬀerent from encrypting. If
no one but the addressee is meant to be able to read the message,
then it should be encrypted. This topic is discussed
brieﬂy at the end of the previous chapter.
12.5.1. Codes. Data transfer is usually prone to errors. Since
any information may be encoded as a sequence
of bits (zeros or ones), the work is with Z2 although
the theory may be developed even for a
general ﬁnite ﬁeld. Furthermore, the length of
the message to be transferred is assumed to be known in advance.
Thus, one transfers k-bit words, where k ∈ N is ﬁxed.
It is desired to detect potential errors, and if possible, recover
the original data. For this reason, further (n−k) bits are
added to the k-bit word, where n is also ﬁxed (and of course
n > k). These are called (n, k)-codes.
There are 2k
binary words of length k and each should
be mapped to one of the 2n
possible codewords. For an
(n, k)-code, there remain
2n
− 2k
= 2k
(2n−k
− 1)
words which are not codewords (if such a word is received,
then an error has occurred). Thus, even for a large value of k,
only a few bits added provide much redundant information.
The simplest example is the parity check code. Having
a message of length k, the codeword is created by adding a
bit whose value is determined so that the total number ones
would be even. This is an example of a (k + 1, k)-code.
If there occur an odd number of errors during the transfer,
then it can be detected with this simple code. Every two
1110
CHAPTER 12. ALGEBRAIC STRUCTURES
codewords diﬀer in at least two bits, but an error word diﬀers
from at least two codewords in only one bit. Therefore, this
code is unable to recover the original message, even with the
assumption that only one bit was changed.
The following diagram illustrates all 2-bit words with the
parity bit added. The codewords are marked with a bold dot.
100
110
101
111
001
010
011
000
Moreover, the parity check code is unable to detect the
error of interchanging a pair of adjacent bits, which often hap-
pens.
12.5.2. Word distance. In the diagram of the parity check
(3, 2)-code, each error word is at the “same”
distance from three codewords – those which
diﬀer in exactly bits. The other words are farther.
Formally, this observation can be described by the following
deﬁnition of distance:
Word distance
The Hamming distance of a pair of words (of equal length)
is the number of bits in which they diﬀer.
Consider words x, y, z such that x and y diﬀer in r bits,
and y and z diﬀer in s bits. Then, x and z diﬀer in at most
r + s, which veriﬁes the triangle inequality for distances.
If the code is to detect errors in r bits, then the minimum
distance between each pair of codewords must be at least r+1.
If the code is to recover errors in r bits, then there exists only
one codeword whose distance from the received word is at
most r. Thus, the following propositions are veriﬁed:
Theorem. (1) A code reliably detects at most r errors if and
only if the minimum Hamming distance of the codewords
is r + 1.
(2) A code reliably detects and recovers at most r errors if
and only if the minimum Hamming distance of the codewords
is 2r + 1.
12.5.3. Construction of polynomial codes. For practical
applications, the codewords should be constructed
eﬃciently so that they can be easily
recognized among all the words. The parity
check code is one example; another trivial
possibility is to simply repeat the bits. For instance, the
(3, 1)-code can be considered which triplicates each bit.
A systematic way for code construction is to use division
of polynomials. A message b0b1 . . . bk−1 is understood as the
polynomial
m(x) = b0 + b1x + · · · + bk−1xk−1
1111
CHAPTER 12. ALGEBRAIC STRUCTURES
over the ﬁeld Z2. The encoded message should be another
polynomial v(x) of degree at most n − 1.
Polynomial codes
Let p(x) = a0 + · · · + an−kxn−k
∈ Z2[x] be a polynomial
with coeﬃcients a0 = 1, an−k = 1. The polynomial code
generated by a polynomial p(x) is the (n, k)-code whose
codewords are polynomial of degree less than n divisible by
p(x). A message m(x) is encoded as
v(x) = r(x) + xn−k
m(x),
where r(x) is the remainder in the division of the polynomial
xn−k
m(x) by p(x).
Check the claimed properties ﬁrst. By the very deﬁnition
of the codeword v(x) of a message m(x):
v(x) = r(x) + xn−k
m(x) =
= r(x) + q(x)p(x) + r(x) = q(x)p(x),
since the sum of two identical polynomials is always zero over
Z2. Therefore, all codewords are divisible by p(x).
On the other hand, if v(x) is divisible by p(x), the above
calculation can be read from right to left (setting r(x) =
xn−k
m(x) − q(x)p(x)), and so it is a codeword created by
the above procedure.
From the deﬁnition, the codeword is created by adding
n − k bits given by r(x) at the beginning of the word (and
simply shifting the message to the right by that). It follows
that the original message is contained in the polynomial v(x)
and the decoding is easy.
Consider the two simple examples, already mentioned.
First, note that p(x) = 1 + x divides v(x) if and only if
v(1) = 0. This occurs if and only if v(x) has an even number
of non-zero coeﬃcients. So the polynomial p(x) = 1 + x
generates the parity check (n, n − 1)-code for any n ≥ 2.
Similarly, it is easily veriﬁed that the polynomial
p(x) = 1 + x + · · · + xn−1
generates the (n, 1)-code of n-fold bit repetition. Dividing
the polynomial b0xn−1
by p(x), gives the remainder b0(1 +
· · · + xn−2
), so the corresponding codeword is b0p(x).
12.5.4. Error detection. Let e(x) denote the error vector.
This is, the diﬀerence between the original message
v ∈ (Z2)n
and the received data u:
u(x) = v(x) + e(x).
The error is detected if and only if the generator of the code
(i.e. the polynomial p(x)) does not divide e(x). Therefore,
polynomials over Z2[x] which do not happen to be divisors
too often are of interest.
Deﬁnition. An irreducible polynomial p(x) ∈ Z2[x] of degree
m is said to be primitive if and only if p(x) divides
(1 + xk
) for k = 2m
− 1, but not for any smaller value of
k.
1112
CHAPTER 12. ALGEBRAIC STRUCTURES
Theorem. Let p(x) be a primitive polynomial of degree m
and n ≤ 2m
− 1. Then the polynomial (n, n − m)-code
generated by p(x) detects all simple and double errors.
Proof. If exactly one error occurs, then e(x) = xi
for
some i, 0 ≤ i < n. Since p(x) is irreducible, it cannot have
a root in Z2. In particular, it cannot divide xi
since the factorization
of xi
is unique. It follows that every simple error
is detected.
If exactly two errors occur, then
e(x) = xi
+ xj
= xi
(
1 + xj−i
)
for some 0 ≤ i < j < n. p(x) does not divide any xi
,
and since it is primitive, it does not divide 1 + xj−i
either,
provided j−i < 2m
−1. At the same time, p(x) is irreducible,
which means that it does not divide the product e(x) = xi
(1+
xj−i
), which completes the proof. □
12.5.5. Corollary. Let q(x) be a primitive polynomial of degree
m and n ≤ 2m
− 1. Then the polynomial (n, n − m −
1)-code generated by the polynomial p(x) = q(x)(1 + x)
detects all double errors as well as all errors with an odd
number of bit ﬂips.
Proof. The codewords generated by the chosen polynomial
p(x) are divisible by both x + 1 and the primitive polynomial
q(x). As veriﬁed, the factor x + 1 is responsible for
parity checking, i.e. all codewords have an even number of
non-zero bits. This detects any odd number of errors. By the
above theorem, the second factor is able to detect all double
errors. □
The following table illustrates the power of the above theorems
for several primitive polynomials of low degrees. For
instance, the last row says that by adding only 11 redundant
bits to a message of length 1012, and employing the polynomial
(x+1)p(x), all single, double, triple, and odd-numbered
errors in the transfer can be detected. These are already quite
large numbers, with over 300 decimal digits.
primitive polynomial p(x) redundant bits codeword length
1 + x 1 1
1 + x + x2
2 3
1 + x + x3
3 7
1 + x + x4
4 15
1 + x2
+ x5
5 31
1 + x + x6
6 63
1 + x3
+ x7
7 127
1 + x2
+ x3
+ x4
+ x8
8 255
1 + x4
+ x9
9 511
1 + x3
+ x10
10 1023
Note that quite strong results on the divisibility are used
for the decomposition of polynomials derived in the second
part of this chapter. But tools which would assist in constructing
primitive polynomials are not mentioned.
1113
CHAPTER 12. ALGEBRAIC STRUCTURES
Such tools come from the theory of ﬁnite ﬁelds. The
name "primitive" reﬂects the connection to the primitive elements
in the Galois ﬁelds G(2m
). This theory also provides
a convenient way of applying the Euclidean division, that is,
of verifying whether or not the received word is a codeword,
using the delayed registers. This is a simple circuit with as
many elements as is the degree of the polynomial.12
12.5.6. Linear codes. Polynomial codes can also be described
using elementary matrix calculus.
Recall that when working over the ﬁeld Z2,
caution is required when applying the results
of elementary linear algebra, since then the
property that v = −v implies v = 0 is often used. This is
not the case now.
However, the basic deﬁnition of vector spaces, existence
of bases and descriptions of linear mappings by matrices are
still valid. It is useful to recall the general theory and its ap-
plicability.
Start with a more general deﬁnition of codes, which only
requires linear dependency of the codeword on the original
message:
Linear codes
Any injective linear mapping g : (Z2)k
→ (Z2)n
is a linear
code. The k-by-n matrix G that corresponds to this mapping
(in the canonical bases) is called the generating matrix of the
code.
For each message v, the corresponding codeword is given
by
u = G · v.
Theorem. Every polynomial (n, k)-code is a linear code.
Proof. Use elementary properties of Euclidean division.
Apply the assignment of the polynomial v(x) = r(x) +
xn−k
m(x) determined by the original message m(x) to the
sum of two messages m(x) = m1(x) + m2(x). The remainder
in the division xn−k
(m1(x) + m2(x)) is, by uniqueness,
given as the sum r1(x) + r2(x) of the remainders for the individual
messages. It follows that
v(x) = r1(x) + r2(x) + xn−k
(m1(x) + m2(x)),
which is the desired additivity. Since the only non-zero scalar
in Z2 is 1, the linearity of the mapping of the message m(x) to
the longer codeword v(x) is proved. Moreover, this mapping
is clearly injective, since the original message m(x) is simply
copied beyond the redundant bits. □
For instance, consider the (6, 3)-code generated by the
polynomial p(x) = 1 + x + x3
, for encoding 3-bit words.
12More about the beautiful theory and its connection with codes can
be found in the book Gilbert, W., Nicholson, K., Modern Algebra and its
applications, John Wiley & Sons, 2nd edition, 2003, 330+xvii pp., ISBN
0-471-41451-4.
1114
CHAPTER 12. ALGEBRAIC STRUCTURES
Evaluate it on the individual basis vectors mi(x) = xi−1
,
i = 1, 2, 3, to get
v0 = (1 + x) + x3
,
v1 = (x + x2
) + x4
,
v2 = (1 + x + x2
) + x5
.
It follows that the generating matrix of this (6, 3)-code is
G =








1 0 1
1 1 1
0 1 1
1 0 0
0 1 0
0 0 1








.
Polynomial codes always copy the original message beyond
the redundant bits. So the generating matrix can be split
into two blocks P and Q consisting respectively of n − k and
k rows. Then Q equals the identity matrix Ik.
12.5.7. Theorem. Let g : (Z2)k
→ (Z2)n
be a linear code
with generating matrix G, written in blocks as
G =
(
P
Ik
)
.
Then, the mapping h : (Z2)n
→ (Z2)n−k
with the matrix
H =
(
In−k P
)
has the following properties:
(1) Ker h = Im g,
(2) a received word u is a codeword if and only if H · u = 0.
Proof. The composition h ◦ g : (Z2)k
→ (Z2)n−k
is
given by the product of matrices (computing over Z2):
H · G =
(
In−k P
)
·
(
P
Ik
)
= P + P = 0.
Hence it is proved that Im g ⊆ Ker h. Since the ﬁrst n − k
columns of H are basis vectors of (Z2)n−k
, the image Im h
has the maximum dimension n − k, which means that this
image contains 2n−k
vectors. Vector spaces over Z2 are ﬁnite
commutative groups, so the formula relating the orders
of subgroups and quotient groups from subsection 12.4.10
can be used, thus obtaining
| Ker h| · | Im h| = |(Z2)n
| = 2n
.
Therefore, the number of vectors in Ker h is equal to 2n
·
2k−n
= 2k
. In order to complete the proof of the ﬁrst proposition,
it suﬃces to note that the image Im g also has 2k
ele-
ments.
The second proposition is a trivial corollary of the ﬁrst
one. □
The matrix H from the theorem is called the parity check
matrix of the corresponding linear (n, k)-code.
1115
CHAPTER 12. ALGEBRAIC STRUCTURES
For instance, the matrix H = (1 1 1) is the parity check
matrix for the parity check (3, 2)-code, encoding 2-bit words.
It is easily obtained from the matrix
G =


1 1
1 0
0 1


that generates this code.
For the (6, 3)-code mentioned above, the parity check
matrix is
H =


1 0 0 1 0 1
0 1 0 1 1 1
0 0 1 0 1 1

 .
12.5.8. Error-correcting codes. As seen, transferring a
message u gives the result
v = u + e.
Over Z2, this is equivalent to e = u + v.
It follows that if the error e to be detected ﬁxed, all the
received words determined by the correct codewords
u ﬁll one of the cosets in the quotient space (Z2)n
/V ,
where V is the vector subspace V ⊆ (Z2)n
of the
codewords.
The mapping h : (Z2)n
→ (Z2)n−k
corresponding to
the parity check matrix has V as its kernel. Therefore, it induces
the injective linear mapping ˜h : (Z2)n
/V → (Z2)n−k
.
Clearly, the value ˜h(v + V ) on the coset generated by v is
determined uniquely by the value H · v.
Syndromes
The expression H ·v, where H is the parity check matrix for
the considered linear code, is called the syndrome of v.
The following claim is a direct corollary of the construction
and the above observations:
Theorem. Two words are in the same class u+V if and only
if they have the same syndromes.
It follows that self-correcting codes can be constructed
by choosing, for every syndrome, the element of the corresponding
coset which is most likely to be the sent codeword.
Naturally, when choosing the code, it is desirable to maximize
the probability that it can correct single errors (and possibly
even more errors).
Try it on the example of the (6, 3)-code for which the
matrices G and H are already computed. Build the table of
all syndromes and the corresponding words.
The syndrome 000 is possessed exactly by the codewords.
All words with a given diﬀerent syndrome are obtained by
choosing one of them and adding all the proper codewords.
The following two tables display the syndromes in the
ﬁrst rows; the second rows then display the vector which has
the least number of ones among the vectors of the corresponding
coset. In almost all cases, this is just one value one there;
only in the last column, there are two ones, and the element
1116
CHAPTER 12. ALGEBRAIC STRUCTURES
is chosen where the ones are adjacent (because, for instance,
multiple errors are more likely to be adjacent).
000 100 010 001
000000 100000 010000 001000
110100 010100 100100 111100
011010 111010 001010 010010
111001 011001 101001 110001
101110 001110 111110 100110
001101 101101 011101 000101
100011 000011 110011 101011
010111 110111 000111 011111
110 011 111 101
000100 000010 000001 000110
110000 110110 110101 110010
011110 011000 011011 011100
111101 111011 111000 111111
101010 101100 101111 101000
001001 001111 001100 001011
100111 100001 100010 100101
010011 010101 010110 010001
All the columns in the tables are aﬃne subspaces whose
modelling vector spaces are always the ﬁrst column of the ﬁrst
table. This is because the code is linear, so that the set of all
codewords forms a vector space, and the individual cosets of
the quotient space are consequently aﬃne subspaces.
In particular, the diﬀerence of each pair of words in the
same column is a codeword. The words in bold are the leading
representatives of the coset (aﬃne space) that correspond
to the given syndromes. These are the words with the least
number of ones in their column. They correspond to the least
number of bit ﬂips which must be made to any word in the
given column in order to get a codeword.
For instance, if a word 111101 is received, compute that
its syndrome is 110. The leading representative of the coset
of this syndrome is 000100. Subtract it from the received
word, to obtain the codeword 111001. This is the codeword
with the least Hamming distance from the received word. So
the original message is most likely to be 001.
6. Systems of polynomial equations
In practical problems, objects or actions are often encountered
which are described in terms of polynomials
or systems of polynomial equations.
For instance, the set of points in R3
deﬁned by
two equations x2
+ y2
− 1 = 0 and z = 0 is the
circle which is centered at (0, 0, 0), has radius 1 and lies in
the plane xy.
Similarly, equations xz = 0 and yz = 0 considered in
R3
deﬁne the union of the line x = 0, y = 0 and the plane
z = 0. Notice we have to specify the space carefully, since
x2
+y2
= 0 deﬁnes a circle in R2
, but it is a cylinder if viewed
in R3
.
1117
CHAPTER 12. ALGEBRAIC STRUCTURES
Deciding whether or not a given point lies within a given
body, ﬁnding extrema of algebraically deﬁned subsets of multidimensional
spaces, analyzing movements of parts of some
machine, etc. are examples of such problems.
12.6.1. Aﬃne varieties. For the sake of simplicity (existence
of roots of polynomials), the work is mainly with the
ﬁeld of complex numbers. Some ideas are extended to the
case of a general ﬁeld K.
An aﬃne n-dimensional space over a ﬁeld K is understood
to be Kn
= K × · · · × K
n
with the standard aﬃne structure
(see the beginning of Chapter 4).
As seen, a polynomial f =
∑
α aαxα
∈ K[x1, . . . , xn]
can be viewed naturally as a mapping f : Kn
→ K, deﬁned
by
f(u1, . . . , un) :=
∑
α
aαuα
, where uα
= uα1
1 · · · uαn
n .
In dimension n = 1, the equality f(x) = 0 describes only
ﬁnitely many points of K. In higher dimension, the similar
equality f(x1, . . . , xn) = 0 describes subsets similar to
curves in the plane or surfaces in the space. However, they
may be of quite complicated and self-intersecting shapes.
For instance, the set given by the equation (x2
+ y2
)3
−
4x2
y2
= 0 look as a quatrefoil (see the illustration in the beginning
of the part J in the other column). Another illustration
of a two-dimensional surface is given by Whitney’s umbrella
x2
− y2
z = 0, which, besides the part shown in the diagram,
also includes the line {x = 0, y = 0}.
The diagram was drawn using the parametric description
x = uv, y = v, z = u2
, whence the implicit description
x2
− y2
z = 0 is easily guessed.
1118
CHAPTER 12. ALGEBRAIC STRUCTURES
In the following illustration, there is the Enneper surface
with parametrization x = 3u+3uv2
−u3
, y = 3v+3u2
v−v3
,
z = 3u2
− 3v2
.
It is hard to imagine how to obtain the implicit description
from this parametrization in hand. Nevertheless,
there is an algorithm to eliminate the
variables u and v from these three equations.
Quite a complex theory is needed to be developed for that.
As usual, begin by formalization of the objects of interest.
Affine varieties
Let f1, . . . , fs ∈ K[x1, . . . , xn]. The aﬃne variety in Kn
corresponding to the set of polynomials f1, . . . , fs is the set
V(f1, . . . , fs) =
{
(a1, . . . , an) ∈ Kn
;
fi(a1, . . . , an) = 0, i = 1, . . . , s
}
5
Aﬃne varieties include conic sections, quadrics, and hyperquadrics,
both singular and regular. Many curves and surfaces
can be easily described as aﬃne varieties.
The variety corresponding to a set of polynomials is the
intersection of the varieties corresponding to the individual
polynomials. For instance, V(x2
+ y2
− 1, z) ⊂ R3
is the
circle which is centered at (0, 0, 0), has radius 1 and lies in
the plane xy.
Similarly, V(xz, yz) ⊂ R3
is the union of the line x = 0,
y = 0 and the plane z = 0, since it is exactly the points of
these two objects where both the polynomials xz, yz are zero.
These examples illustrate that it is not easy to deal
with the concept of dimension. Is the mentioned line,
added to the plane, enough for the variety to be considered
three-dimensional, or should one keep considering it
two-dimensional with a certain anomaly?
Verify the following straightforward proposition:
1119
CHAPTER 12. ALGEBRAIC STRUCTURES
Theorem. Let V = V(f1, . . . , fs) and W =
V(g1, . . . , gt) ⊆ Kn
be aﬃne varieties. Then, V ∪ W
and V ∩ W are also aﬃne varieties, where
V ∩ W = V(f1, . . . , fs, g1, . . . , gt),
V ∪ W = V(figj) for 1 ≤ i ≤ s, 1 ≤ j ≤ t.
In the following subsections, some questions which arise
in the context of varieties are answered:
Q1. Is the set V(f1, . . . , fs) empty?
Q2. Is the set V(f1, . . . , fs) ﬁnite?
Q3. How to understand the concept of dimension for vari-
eties?
All of these problems can be “reasonably” solved for varieties
over the complex numbers (as well as over any algebraically
closed ﬁeld); it is more diﬁﬁcult for the real numbers
and nearly impossible for general ﬁelds. For instance, over the
rational numbers, the question whether V(xn
+yn
−zn
) = ∅
leads to the well-known Fermat’s last theorem, many times
mentioned in Chapter 11.
12.6.2. Parametrization. For some purely practical operations
with varieties, it is convenient to use the implicit representation
(the one used so far). For instance, deciding
whether a given point lies in a given variety, or inside the
space enclosed by it, is quite easy using the implicit description.
On the other hand, the parametric description may also
come in handy in many situations (for example, it was used
to draw the diagram).
The variety V(x + y + z − 1, x + 2y − z − 3) is a line
(the intersection of two planes). If the system
x + y + z − 1 = 0,
x + 2y − z − 3 = 0,
is solved, the parametric description of this line is immediate:
x = −1 − 3t,
y = 2 − 2t,
z = t
as known from the aﬃne geometry. One needs to be more
careful in general:
Rational parametrization
Deﬁnition. A rational parametric representation of a variety
V(f1, . . . , fr) ⊆ Kn
is a set of rational functions
r1, . . . , rn ∈ K(t1, . . . , ts) such that:
• the entire image of the mapping r = (r1, . . . , rn) is
contained in V(f1, . . . , fr);
• V(f1, . . . , fr) is the minimal aﬃne variety which contains
these points (x1, . . . , xn).
Note that the parametrization is not required to describe
all the points of the variety. This is important, as seen by a
simple example of parametrization of a circle in the plane:
x =
2t
1 + t2
, y =
−1 + t2
1 + t2
,
1120
CHAPTER 12. ALGEBRAIC STRUCTURES
which can be obtained using the stereographic projection.
(Verify this in detail!) Note that this parametrization describes
all points except for the point (0, 1), from which we
project, since this point is not reachable for any value of the
parameter t. This is nobody’s fault; it follows from the diﬀerent
topological properties of the circle and the line that there
exists no global bijective rational parametrization.
In this connection, two more questions arise:
Q4. Does there exist a parametrization of a given variety and
how to ﬁnd it?
Q5. Is there an implicit description of a parametrically deﬁned
variety?
The general answer to question 4 is negative. In fact,
most aﬃne varieties cannot be parametrized; or at least there
is no algorithm for parametrization of the implicit descrip-
tion.
It is clear at ﬁrst sight that a given variety may admit
many implicit and parametric descriptions. In the case of implicit
descriptions, it is given by representation in terms of
several “generating” polynomials, and there is clearly much
freedom in their choice. Once a parametrization is found, it
can be composed with any rational bijection on the parameters
in order to obtain another one.
12.6.3. Ideals. In order to avoid the dependence on the chosen
equations that deﬁne a variety, consider all
consequences of the given equations. This leads
to the following algebraic concept of subsets in
rings (which is similar to normal subgroups):
Ideals
Deﬁnition. Let K be a commutative ring. A subset I ⊆ K
is called an ideal if and only if 0 ∈ I and
f, g ∈ I =⇒ f + g ∈ I,
f ∈ I, h ∈ K =⇒ f · h ∈ I.
Since the deﬁnition contains only universal quantiﬁers
(the properties are required for all elements in K or I), the intersection
of two ideals is also an ideal. Consequently, ideals
can be considered generated by subsets. Use the notation
I = ⟨a1, . . . , an⟩. It is easy to prove that such an ideal is
I =
{∑
i
aibi, bi ∈ K
}
,
where only ﬁnite sums are considered. (Check that this is the
intersection of all ideals containing the set of the generators!)
The set of generators may be inﬁnite, too. If there are only
ﬁnitely many generators, the ideal is said to be ﬁnitely gener-
ated.
It is easy to verify that each variety deﬁnes an ideal in
the ring of polynomials in the following way:
1121
CHAPTER 12. ALGEBRAIC STRUCTURES
The ideal of a variety
For a variety V = V(f1, . . . , fs), set
I(V ) :=
{
f ∈ K[x1, . . . , xn];
f(a1, . . . , an) = 0, ∀ (a1, . . . , an) ∈ V
}
.
Lemma. Let f1, . . . , fs, g1, . . . , gt ∈ K[x1, . . . , xn] be
polynomials. Then:
(1) if ⟨f1, . . . , fs⟩ = ⟨g1, . . . , gt⟩, then
V = V(f1, . . . , fs) = V(g1, . . . , gt);
(2) I(V ) is an ideal, and ⟨f1, . . . , fs⟩ ⊆ I(V ).
Proof. If a point a = (a1, . . . , an) lies in a variety
V(f1, . . . , fs), then any polynomial of the form
f = h1f1 + · · · + hsfs
(i.e. any member of the ideal I = ⟨f1, . . . , fs⟩) takes zero at
a. In particular, this means that all the polynomials gi take
zero at a. Hence
V(f1, . . . , fs) ⊆ V(g1, . . . , gt).
The other inclusion can be proved similarly.
In order to verify the second proposition, choose g, g′
∈
I(V ), h ∈ K[x1, . . . , xn]. Then, for any point a ∈ V that
(gh)(a) = 0 ⇒ gh ∈ I(V ),
(g + g′
)(a) = 0 ⇒ g + g′
∈ I(V ).
Hence I(V ) is an ideal in K[x1, . . . , xn].
For any polynomial f = h1f1 + · · · + hsfs ∈
⟨f1, . . . , fs⟩ and a point a ∈ V , f(a) = 0, which proves the
desired inclusion. □
The simplest examples are trivial varieties – a single
point and the entire aﬃne space:
I
(
{(0, 0, . . . , 0)}
)
= ⟨x1, . . . , xn⟩,
I(Kn
) = {0} for any inﬁnite ﬁeld K.
The other inclusion of the second part of the lemma does
not hold in general. For instance, the variety V(x2
, y2
) contains
only the single point – (0, 0). This means that I(V ) =
⟨x, y⟩ ⊃ ⟨x2
, y2
⟩.
If V, W ⊆ Kn
are varieties, then
V ⊆ W =⇒ I(V ) ⊇ I(W).
In other words, a polynomial which takes zero at each point
of a given variety clearly takes zero at each point of any of
the variety’s subsets.
Now, further natural problems can be formulated:
Q6. Is every ideal I ∈ K[x1, . . . , xn] ﬁnitely generated?
Q7. Is there an algorithm which decides whether
f ∈ ⟨f1, . . . , fs⟩?
Q8. What is the precise relation between ⟨f1, . . . , fs⟩ and
I
(
V(f1, . . . , fs)
)
?
1122
CHAPTER 12. ALGEBRAIC STRUCTURES
12.6.4. Dimension 1. Consider univariate polynomials ﬁrst
f = a0xn
+ a1xn−1
+ · · · + an, where a0 ̸= 0.
The leading term of a polynomial is deﬁned to be LT(f) :=
a0xn
. Clearly,
deg f ≤ deg g ⇐⇒ LT(f)|LT(g).
Let K be a ﬁeld and g a non-zero polynomial. Every polynomial
f ∈ K[x] can be written in a unique way as
f = q · g + r, where r = 0 or deg r < deg g.
In fact, the quotient q and the remainder r can be computed
by the following algorithm:
(1) q := 0, r := f
(2) while r ̸= 0 ∧ LT(g)|LT(r)
(a) q := q + LT(r)/LT(g)
(b) r := r − LT(r)/LT(g) · g
When checking the loop condition, the invariant f = q ·g +r
holds, so the algorithm answers correctly as soon as the loop
condition becomes "false". Since the degree of r is decreasing,
the algorithm eventually terminates.
Corollary. Let K be a ﬁeld. Then, every ideal in the polynomial
ring K[x] is of the form ⟨f⟩.
Proof. Consider an ideal I ⊆ K[x]. If I = {0}, then
the ideal is generated by the zero polynomial. If I contains
a non-zero polynomial f, then choose any with the lowest
degree. Clearly ⟨f⟩ ⊆ I.
For any polynomial g ∈ I, consider the Euclidean division
of g by f, i.e. g = qf + r. Clearly, qf ∈ I, which
means that r ∈ I as well. However, the degree of f is as
small as possible, so r = 0. Therefore, g is a multiple of f,
and I = ⟨f⟩. □
An ideal which is generated by a single element is called
a principal ideal. A ring which has the property of the last
lemma is called a principal ideal domain.
Recall the Euclidean algorithm for the greatest common
divisor h = GCD(f, g) of polynomials f and g (the variable
h contains the desired greatest common divisor when the algorithm
terminates):
(1) h := f, s := g
(2) while s ̸= 0
(a) r := h mod s
(b) h := s
(c) s := r
Let f = q · g + r and h = GCD(f, g). Then, h|r, g and
∀p ∈ K[x]: p|r, g, so p|f and p|h.
Hence, h is GCD(r, g). Trivially, GCD(h, 0) = h, so the
algorithm computes GCD(f, g) correctly. Since the degree of
r is strictly decreasing, the algorithm eventually terminates.
It follows that each pair of polynomials has a greatest
common divisor. It is unique up to multiplication by a scalar.
Indeed, if two polynomials are greatest common divisors of
a given pair of polynomials, then they must divide each other,
1123
CHAPTER 12. ALGEBRAIC STRUCTURES
which is, for polynomials, possible only in the case men-
tioned.
The greatest common divisor of more polynomials is deﬁned
as follows: If s > 2, then
GCD(f1, . . . , fs) := GCD
(
f1, GCD(f2, . . . , fs)
)
.
Lemma. Let f1, . . . , fs be polynomials. Then,
⟨GCD(f1, . . . , fs)⟩ = ⟨f1, . . . , fs⟩.
Proof. GCD(f1, . . . , fs) divides all the polynomials fi.
Hence the principal ideal ⟨GCD(f1, . . . , fs)⟩ contains the
ideal ⟨f1, . . . , fs⟩. The other inclusion follows immediately
from Bezout’s identity. □
Earlier, eight questions were formulated. Here are some
answers for dimension 1:
• Since V(f1, . . . , fs) = V(GCD(f1, . . . , fs)), the problem
of emptiness of a given variety reduces to the problem
of existence of a root of a single polynomial.
• For the same reason, each variety is a ﬁnite set of isolated
points – the roots of the polynomial GCD(f1, . . . , fs),
except for the case when GCD(f1, . . . , fs) = 0. This
can happen only if f1 = f2 = · · · = fs = 0, and then
the variety is the entire K.
• The concept of dimension is not of much interest in this
case; each variety has dimension zero, being a discrete
set of points.
• Each ideal can be generated by a single polynomial.
• f ∈ ⟨f1, . . . , fs⟩ ⇐⇒ GCD(f1, . . . , fs)|f.
• Denoting ⟨f⟩ := I(V(f1, . . . , fs)), then f and
GCD(f1, . . . , fs) may diﬀer only in root multiplicities.
12.6.5. Monomial ordering. In order to generalize the Euclidean
division of polynomials for more variables,
one must ﬁrst ﬁnd an appropriate analogy
for the degree of a polynomial and its leading
term.
The Euclidean division of a polynomial f ∈
K[x1, . . . , xn] by polynomials g1, . . . , gs is to be an
expression of the form
f = a1g1 + · · · + asgs + r
where no term of the remainder r is divisible by the leading
term of any of the polynomials gi.
Try this with f = x2
y + xy2
+ y2
, g1 = xy − 1, and
g2 = y2
− 1. The ﬁrst division yields
f = (x + y) · g1 + (x + y2
+ y).
LT(y2
−1) does not divide x (the leading term of the remainder),
so, theoretically, continuation is not possible.
However, x can be moved into the remainder, thus obtaining
the result
f = (x + y) · g1 + g2 + (x + y + 1).
No term of the remainder is divisible by either LT(g1) or
LT(g2). How are the leading terms determined?
1124
CHAPTER 12. ALGEBRAIC STRUCTURES
Monomial ordering
A monomial ordering on K[x1, . . . , xn] is a well-ordering
(every non-empty subset has a least element) < on Nn
which
satisﬁes
∀α, β, γ ∈ Zn
: α < β =⇒ α + γ < β + γ.
An ordering on Nn
induces an ordering on monomials
as soon as the order of the variables x1 < x2 < . . . xn is
ﬁxed.
Each polynomial can be rearranged as a decreasing sequence
of monomials (ignoring the coeﬃcients for now).
The following three deﬁnitions introduce the most common
monomial orderings. Each ordering assumes that the
order of the variables is ﬁxed, usually x1 > x2 > · · · > xn.
Deﬁnition. Let α, β ∈ Nn
. The lexicographic ordering <lex
is deﬁned by
α >lex β ⇐⇒ the left-most non-zero term of α − β
is positive.
The graded lexicographic ordering <grlex is deﬁned by
α >grlex β ⇐⇒ |α| > |β| or
|α| = |β| and α >lex β.
The graded reverse lexicographic ordering <grevlex is deﬁned
by
α >grevlex β ⇐⇒ |α| > |β| or
|α| = |β| and the right-most nonzero
term of α − β is negative.
If x > y > z, then x >grevlex y >grevlex z, but
x2
yz2
>grlex xy3
z, yet x2
yz2
<grevlex xy3
z.
Verify in detail that >lex, >grlex, >grevlex are monomial or-
derings!
12.6.6. Multivariate division with remainder. Consider a
non-zero polynomial f =
∑
α∈Nn aαxα
in K[x1, . . . , xn]
and < a monomial ordering. Then deﬁne the degree, leading
coeﬃcient, leading monomial, and leading term of f as
follows:
• multideg f := max{α ∈ Nn
, aα ̸= 0},
• LC f := amultideg f ,
• LM f := xmultideg f
,
• LT f := LC f · LM f.
Of course, these concepts depend on the underlying
monomial ordering.
Lemma. Let f, g ∈ K[x1, . . . , xn] and < be a monomial
ordering. Then,
(1) multideg(f · g) = multideg f + multideg g,
(2) f + g ̸= 0 =⇒
multideg(f + g) ≤ max{multideg f, multideg g}.
Proof. Both claims are straightforward corollaries of
the deﬁnitions. □
1125
CHAPTER 12. ALGEBRAIC STRUCTURES
Theorem. Let < be a monomial ordering and
F = (f1, . . . , fs) be an s-tuple of polynomials in
K[x1, . . . , xn]. Then, every polynomial f ∈ K[x1, . . . , xn]
can be expressed as
f = a1f1 + · · · + asfs + r,
where ai, r ∈ K[x1, . . . , xn] for all i = 1, 2, . . . , s. Moreover,
either r = 0 or r is a linear combination of monomials
none of which is divisible by any of LT f1, . . . , LT fs, and if
aifi ̸= 0, then multideg f ≥ multideg aifi for each i.
The polynomial r is called the remainder of the multivariate
division f/F.
Proof. The theorem says nothing about uniqueness of
the result. The following algorithm produces a possible solution
and thus proves the theorem.
In the sequel, consider the output of this algorithm to be
the result of the division.
(1) a1 := 0, . . . , as := 0, r := 0, p := f
(2) while p ̸= 0
(a) i := 1
(b) d := false
(c) while i ≤ s ∧ not d
(i) if LT fi|LT p
ai := ai + LT p/LT fi
p := p − (LT p/LT fi) · fi
d := true
(ii) else i := i + 1
(d) if not d
(i) r := r + LT p
(ii) p := p − LT p
In every iteration of the outer loop, exactly one of the
commands 2(c)i, 2(d)ii is executed, so the degree of p decreases.
Therefore, the algorithm eventually terminates.
When checking the loop condition, the invariant f =
a1f1 + · · · + p + r holds, and each term of each ai is the quotient
LT p/LT fi from some moment. The degrees of these
terms are less than the current degree of p, which is at most
the degree of f. Altogether, the degree of each aifi is at most
the degree of f. □
In the ring K[x1, . . . , xn], the following implication
clearly holds:
f = a1f1 + · · · + asfs + 0 =⇒ f ∈ ⟨f1, . . . , fs⟩.
However, the converse is generally not true for multivariate
division:
Consider f = xy2
− x, f1 = xy + 1, f2 = y2
− 1. The
algorithm outputs
f = y(xy + 1) + 0(y2
− 1) + (−x − y),
but f = x(y2
− 1), so that f ∈ ⟨f1, f2⟩.
The next goal is to ﬁnd some distinguished generators of
the ideals I = ⟨f1, . . . , fs⟩ which would behave better. In
a certain sense, this is a similar procedure to the Gaussian
elimination of variables for systems of linear equations. Begin
with some special assumptions about the ideals.
1126
CHAPTER 12. ALGEBRAIC STRUCTURES
12.6.7. Monomial ideals. An ideal I ⊆ K[x1, . . . , xn] is
called monomial if and only if there is a set of
multi-indices A ⊆ Nn
such that I is generated
by the monomials xα
with α ∈ A.
This means that all polynomials in I are of
the form
∑
α∈A hαxα
, where hα ∈ K[x1, . . . , xn].
Clearly, for a monomial ideal I xβ
∈ I if and only if
there exists an α ∈ A such that xα
divides xβ
.
Lemma. Let I ⊆ K[x1, . . . , xn] be a monomial ideal and
f ∈ K[x1, . . . , xn] a polynomial. Then, the following propositions
are equivalent:
(1) f ∈ I,
(2) each term of f lies in I;
(3) the polynomial f is a linear combination of monomials
from I with coeﬃcients from K.
Proof. The implications (3) =⇒ (2) =⇒ (1) are
obvious. It remains to prove (1) =⇒ (3).
Write the polynomial f as f =
∑
α aαxα
, where
aα ∈ K. It follows from the assumption f ∈ I that f =∑
β∈A hβxβ
, where xβ
∈ I and hβ ∈ K[x1, . . . , xn]. Each
term aαxα
must equal some term of the other equality. Hence
each term aαxα
of the polynomial f can be expressed as the
sum of expressions d xβ+δ
, where d ∈ K, xβ
∈ I. However,
this means that xα
∈ I, so that (3) holds. □
Corollary. Two monomial ideals coincide if and only if they
contain the same monomials.
The following theorem goes much further. It says that
every monomial ideal is ﬁnitely generated and, moreover, the
ﬁnite set of generators may be chosen from any given set of
generators.
12.6.8. Theorem (Dickson’s lemma). Every monomial ideal
I = ⟨xα
, α ∈ A⟩ ⊆ K[x1, . . . , xn] can be written in the
form I = ⟨xα1
, . . . , xαs
⟩, where α1, . . . , αs ∈ A.
Proof. Proceed by induction on the number of variables.
If n = 1, then I ⊆ K[x], I = ⟨xα
, α ∈ A ⊆
N⟩. The set of exponents in A has a minimum,
so denote it by β := min A. Then, xβ
divides
each monomial xα
with α ∈ A, so I = ⟨xβ
⟩.
Now suppose n > 1 and assume that the proposition
is true for fewer variables. Denote the variables
x1, . . . , xn−1, y, and write monomials in the
form xα
ym
where α ∈ Nn−1
, m ∈ N. Suppose
that I ⊆ K[x1, . . . , xn−1, y] is monomial, and deﬁne
J ⊆ K[x1, . . . , xn−1] by
J := ⟨xα
, ∃m ∈ N, xα
ym
∈ I⟩.
Clearly, J is a monomial ideal in n − 1 variables, so by the
induction hypothesis, J = ⟨xα1
, . . . , xαs
⟩. It follows from
the deﬁnition of J that there are minimal integers mi ∈ N
such that xαi
ymi
∈ IA. Denote m := max{mi} and deﬁne
an analogous system of ideals Jk ⊆ K[x1, . . . , xn−1] for 0 ≤
k ≤ m − 1
Jk := ⟨xβ
; xβ
yk
∈ IA⟩.
1127
CHAPTER 12. ALGEBRAIC STRUCTURES
Again, all the ideals Jk satisfy the induction hypothesis, so
they can be expressed as
Jk = ⟨xαk,1
, . . . , xαk,sk ⟩.
It remains to show that I is generated by the following ﬁnite
set of monomials:
xα1
ym
, . . . , xαs
ym
,
xα0,1
y0
, · · · , xα0,s0 y0
,
...
xαm−1,1
ym−1
, . . . , xαm−1,sm−1 ym−1
.
Consider a monomial xα
yp
∈ I. Either of the following cases
occurs:
• p ≥ m. Then, xα
∈ J, k = p, so one of
xα1
ym
, . . . , xαs
ym
divides xα
yp
.
• p < m. Then, analogously, xα
∈ Jk, and one of
xαk,1
yk
, . . . , xαk,sk yk
divides xα
yp
.
By the previous lemma, each polynomial f ∈ I can be expressed
as a linear combination of monomials from I. Each
of these is divisible by one of the generators; hence f lies in
the ideal generated by them. Therefore, I is a subset of that.
The other inclusion is trivial, which completes the proof of
Dickson’s lemma. □
12.6.9. Hilbert’s theorem. Everything is now at hand for
the discussion of ideal bases in polynomial
rings. The main idea is the maximal utilization
of the information about the leading terms
among the generating polynomials and in the ideal. For a nonzero
ideal I ⊆ K[x1, . . . , xn], denote
LT I := {axα
; ∃f ∈ I : LT f = axα
}.
Clearly, ⟨LT I⟩ is a monomial ideal, so by Dickson’s lemma,
⟨LT I⟩ = ⟨LT g1, . . . , LT gs⟩ for appropriate g1, . . . , gs ∈ I.
Theorem. Every ideal I ∈ K[x1, . . . , xn] is ﬁnitely gener-
ated.
Proof. The statement is trivial for I = {0}. So suppose
I ̸= {0}. By Dickson’s lemma and the above note, there are
g1, . . . , gs ∈ I such that ⟨LT I⟩ = ⟨LT g1, . . . , LT gs⟩.
Clearly, ⟨g1, . . . , gs⟩ ⊆ I. Choose any polynomial f ∈ I
and divide it by the s-tuple g1, . . . , gs.
f = a1g1 + · · · + asgs + r,
is obtained where no term of r is divisible by any of
LT g1, . . . , LT gs.
Since r = f −a1g1 −· · ·−asgs, r ∈ I, and also LT r ∈
LT I. This means that LT r ∈ ⟨LT I⟩. Admit that r ̸= 0.
Since ⟨LT I⟩ is monomial, LT r must be divisible by one of its
generators, i.e. LT g1, . . . , LT gs. This contradicts the result
of the multivariate division algorithm. Therefore, r = 0 and
I is generated by g1, . . . , gs. □
1128
CHAPTER 12. ALGEBRAIC STRUCTURES
12.6.10. Gröbner bases. The basis used in the proof of
Hilbert’s theorem has the properties stated in the following
deﬁnition:
Gröbner bases of ideals
Deﬁnition. A ﬁnite set of generators g1, . . . , gs of an ideal
I ⊆ K[x1, . . . , xn] is called Gröbner basis if and only if
⟨LT I⟩ = ⟨LT g1, . . . , LT gs⟩.
Corollary. Every ideal I ⊆ K[x1, . . . , xn] has a Gröbner
basis. Every set of polynomials g1, . . . , gs ∈ I such that
⟨LT I⟩ = ⟨LT g1, . . . , LT gs⟩ is a Gröbner basis of I.
Example. Return to the remark on similarity with the Gaussian
variable elimination for systems of linear
equations. That is, illustrate the general results
above on the simplest case of polynomials of degree
one with the lexicographic ordering.
Denote the generators fi =
∑
j aijxj + ai0. Consider
the matrix A = (aij), where i = 1, . . . , s and j = 0, . . . , n,
and apply the Gaussian elimination to it. This gives a matrix
B = (bij) in echelon form. Zero rows can be omitted from it.
Hence there is a new basis g1, . . . , gt, where t ≤ s.
Due to the performed steps, each fi can be expressed as
a linear combination g1, . . . , gt, which means that
⟨f1, . . . , fs⟩ = ⟨g1, . . . , gt⟩.
Now, verify that these polynomials g1, . . . , gt form a
Gröbner basis. Without loss of generality, assume that the
variables are labeled so that LM gi = xi for i = 1, . . . , t.
Any polynomial f ∈ I can be written as
f = h1f1 + · · · + hsfs = h′
1g1 + · · · + h′
tgt.
It is required that LT f ∈ ⟨LT g1, . . . , LT gt⟩. That is, LT f is
divisible by one of x1, . . . , xt. Suppose that f contains only
the variables xt+1, . . . , xn. However, then h′
1 = 0, since x1
is only in g1 by the echelon form of B. Analogously, h′
2 =
· · · = h′
t = 0, and so f = 0.
The existence of the very special bases is now proved.
However, they cannot yet be constructed
algorithmically. This is the goal of the following
subsections.
12.6.11. Theorem. Let G = {g1, . . . , gt} be a Gröbner basis
of an ideal I ⊆ K[x1, . . . , xn] and f a polynomial in
K[x1, . . . , xn]. Then, there is a unique r =
∑
α aαxα
∈
K[x1, . . . , xn] such that:
(1) no term of r is divisible by any of LT g1, . . . , LT gt, i.e.
∀α∀i: LT gi̸ | aαxα
;
(2) ∃g ∈ I : f = g + r.
Proof. The algorithm for multivariate division produces
f = a1g1 + · · · + atgt + r,
where r satisﬁes the condition (1). Select g as a1g1 + · · · +
atgt, which of course lies in I.
1129
CHAPTER 12. ALGEBRAIC STRUCTURES
It remains to prove uniqueness. Suppose that
f = g + r = g′
+ r′
,
where r ̸= r′
. Clearly, r−r′
= g′
−g ∈ I. Since G is a Gröbner
basis, LT(r − r′
) is divisible by one of LT g1, . . . , LT gt.
There are two possibilities:
• LM r ̸= LM r′
. The one with the higher degree must be
divisible by one of LT g1, . . . , LT gt, which contradicts
condition (1).
• LM r = LM r′
and LC r ̸= LC r′
. Then both the
monomials LM r and LM r′
must be divisible by one of
LT g1, . . . , LT gt, which is again a contradiction.
It follows that LT r = LT r′
and the inductive argument
shows that r = r′
. □
The previous theorem generalizes the Euclidean division,
with an ideal instead of a divisor. In the univariate case, this
is no generalization, since every ideal is generated by a single
polynomial. If it is only the remainder which is of interest,
the order of polynomials in the Gröbner basis does not matter.
Hence it makes sense to deﬁne the notation f
G
for the
remainder in the division f/G, provided G = (g1, . . . , gs) is
a Gröbner basis.
Corollary. Let G = {g1, . . . , gt} be a Gröbner basis
of an ideal I ⊆ K[x1, . . . , xn] and f a polynomial in
K[x1, . . . , xn]. Then, f ∈ I if and only if the remainder f
G
is zero.
12.6.12. Syzygies. The next step is to ﬁnd a suﬃcient “testing
set” of polynomials of a given ideal which allows
us to verify whether the considered system is a Gröbner
basis. Again, we wish to test this by means of
multivariate division only.
For α = multideg f and β = multideg g, consider
γ := (γ1, . . . , γn), where γi = max{αi, βi}.
The monomial xγ
is called the least common multiple
of the monomials LM f and LM g and is denoted
LCM(LM f, LM g) := xγ
. The expression
S(f, g) :=
xγ
LT f
· f −
xγ
LT g
· g
is called the S-polynomial (also syzygy, or pair) of the polynomials
f, g.
This is a tool for the elimination of leading terms. The
Gaussian elimination is a special case of this procedure for
polynomials of degree one. However, during the general procedure,
it may happen that the degrees of the resulting polynomials
are higher even though the original leading terms are
removed.
For instance, consider the polynomials
f = x3
y2
− x2
y3
+ x, g = 3x4
y + y2
1130
CHAPTER 12. ALGEBRAIC STRUCTURES
of degree 5 in R[x, y] a with the <grlex ordering. Then, γ =
(4, 2) and
S(f, g) =
x4
y2
x3y2
f−
x4
y2
3x4y
g = xf−
1
3
yg = −x3
y3
+x2
−
1
3
y3
,
which is a polynomial of degree 6.
Theorem. Let I ⊆ K[x1, . . . , xn] be an ideal. Then, G =
{g1, . . . , gt} is a Gröbner basis of I if and only if for each
i ̸= j, the remainder of the division S(gi, gj)/G is zero.
Proof. Begin with a technical lemma which describes
which cancellations may occur when expressing polynomials
in terms of generators. More precisely, they can always be
expressed in terms of S-polynomials.
Lemma. Consider a polynomial f =
∑t
i=1 cixαi
gi, where
c1, . . . , ct ∈ K and αi + multideg gi = δ for a
ﬁxed δ whenever ci ̸= 0. If multideg f < δ, then
there are cjk ∈ K such that
t∑
i=1
cixαi
gi =
t∑
j,k=1
cjkxδ−γjk
S(gj, gk),
where xγjk
= LCM(LM gj, LM gk), and the degree of each
monomial xδ−γjk
S(gj, gk) is less than δ.
Proof. Let di := LC gi and pi = xαi
gi/di.
Clearly, cidi = LC(cixαi
gi) and LC pi = 1. Since
multideg(cixαi
gi) = δ and multideg f < δ, it follows
that
∑t
i=1 cidi = 0. Express f as a combination of
S-polynomials:
f =
t∑
i=1
cidipi = c1d1(p1 − p2) + (c1d1 + c2d2)(p2 − p3)+
+ · · · + (c1d1 + · · · + ct−1dt−1)(pt−1 − pt)+
+ (c1d1 + · · · + ctdt
0
)pt.
Each diﬀerence pj − pk can be expressed in terms of
S-polynomials
xδ
djxδ−αj
gj −
xδ
dkxδ−αk
gk = xδ−γjk
(
xγjk
LT gj
gj −
xγjk
LT gk
gk
)
=
= xδ−γjk
S(gj, gk)
Now the individual coeﬃcients cjk can be derived easily from
these equalities. □
Now follows the proof of the theorem. The “=⇒” implication
follows directly from the corollary of subsection
12.6.11.
For the reverse implication, consider a non-zero polynomial
f ∈ I. It must be shown that under the implication
assumption, LT f ∈ ⟨LT g1, . . . LT gt⟩. If it is known that
the polynomial can be expressed as f =
∑t
i=1 higi with the
property that
multideg f = max
{
multideg(higi)
}
,
1131
CHAPTER 12. ALGEBRAIC STRUCTURES
then LT f is necessarily divisible by one of the leading terms
LT gi, which means that G is a Gröbner basis.
Denote mi := multideg(higi), δ :=
max{m1, . . . , mt}. Clearly, multideg f ≤ δ. Let
polynomials h1, . . . , ht be chosen so that δ is as small as
possible. Since this is a monomial ordering, which, in
particular, is a well-ordering, such a δ exists.
It is necessary to prove that multideg f = δ. Write
f =
∑
mi=δ
higi +
∑
mi<δ
higi =
=
∑
mi=δ
(LT hi)gi +
∑
mi=δ
(hi − LT hi)gi +
∑
mi<δ
higi.
Clearly, the degrees of all the terms of the second and third
sums are less than δ. Admitting that multideg f < δ, it follows
that
multideg
(
∑
mi=δ
(LT hi)gi
)
< δ.
Denote cixαi
:= LT hi and apply the lemma:
∑
mi=δ
(LT hi)gi =
∑
mi=δ
cixαi
gi =
∑
j,k
cjkxδ−γjk
S(gj, gk).
It follows from the assumption of the theorem and the multivariate
division algorithm that
S(gj, gk) =
t∑
i=1
pijkgi
and, moreover, multideg(pijkgi) ≤ multideg S(gj, gk). Denote
qijk := xδ−γjk
pijk, to obtain
xδ−γjk
S(gj, gk) =
t∑
i=1
qijkgi.
By the second part of the lemma,
multideg(qijkgi) ≤ multideg
(
xδ−γjk
S(gj, gk)
)
< δ.
Substitution yields
∑
mi=δ
(LT hi)gi =
∑
j,k
cjk
( t∑
i=1
qijkgi
)
=
=
t∑
i=1


∑
j,k
cjkqijk

 gi.
At the same time,
multideg


∑
j,k
cjkqijkgi

 < δ for i = 1, . . . , t.
Substitute this into the original equality, to get f expressed as
a combination of g1, . . . , gt, where the degrees of all terms
are less than δ. This contradicts the minimality of δ, and so
multideg f = δ, whence LT f ∈ ⟨LT g1, . . . , LT gt⟩. So G
is a Gröbner basis. □
1132
CHAPTER 12. ALGEBRAIC STRUCTURES
12.6.13. A naive algorithm for Gröbner bases. The theorem
just proved provides an eﬃcient method for
deciding whether a given basis is a Gröbner basis.
For example, consider I = ⟨x + y, y − z⟩.
The only relevant S-polynomial is
S(x + y, y − z) =
xy
x
(x + y) −
xy
y
(y − z) = xz + y2
.
The division yields xz + y2
= z(x + y) + y(y − z), so it is
a Gröbner basis.
The following algorithm utilizes just this method to ﬁnd
a Gröbner basis of an ideal that is generated by a given s-tuple
of polynomials F = (f1, . . . , fs).
(1) G := F, G′
:= ∅
(2) while G ̸= G′
(a) G′
:= G
(b) ∀p, q ∈ G′
: p ̸= q do
(i) s := S(p, q)
G′
(ii) if s ̸= 0
G := G ∪ {s}
If the algorithm ever terminates, then G contains a Gröbner
basis. Thus, it suﬃces to verify that it terminates. However,
in each iteration of the inner loop (2), i.e. when a nontrivial
remainder is added, either the monomial ideal generated
by LT G extends or it remains unchanged. Consequently,
there is a non-decreasing chain of (monomial) ideals
I1 = LT(F) ⊆ I2 ⊆ · · · ⊆ Ik ⊆ · · · . Denoting
I = ∪∞
k=1Ik, then I is an ideal, and by Hilbert’s theorem,
it is ﬁnitely generated. However, this means that all generators
of I already lie in one of the Ik. Therefore, from this k
onwards, Ik = Ik+1 = . . . .13
Clearly, the stabilization of this chain of monomial ideals
of the leading terms is equivalent to termination of the algo-
rithm.
This algorithm is far from ideal. There are quite trivial
inputs for which it returns wild results. Moreover, the output
basis directly depends on the input, so the outputs for the same
ideal deﬁned by diﬀerent bases may vary.
12.6.14. Reduction of bases. In order to recognize the generators
which are needed in a Gröbner basis, it sufﬁces
to follow the leading terms. The ﬁrst step of the
discussion is simply to remove all elements which are
not needed in this sense.
Lemma. Let G be a Gröbner basis of an ideal I and p ∈ G
such that LT p ∈ ⟨LT(G \ {p})⟩. Then, G − {p} is also a
Gröbner basis of I.
Proof. From the deﬁnition of the Gröbner basis,
⟨LT I⟩ = ⟨LT G⟩. But LT p ∈ ⟨LT(G \ {p})⟩, so
13The condition of stabilization of every non-decreasing chain of ideals
is called the “ascending chain condition”. Rings which satisfy ACC are
called Noetherian (in honour of Emy Noether). Hilbert’s theorem can
also be formulated as “a polynomial ring over a Noetherian ring is again
Noetherian”.
1133
CHAPTER 12. ALGEBRAIC STRUCTURES
⟨LT(G \ {p})⟩ = ⟨LT G⟩. Hence the proposition follows
immediately. □
Deﬁnition. A Gröbner basis G of an ideal I is said to be
minimal if and only if LC p = 1 and LT p /∈ ⟨LT(G − {p})⟩
for all p ∈ G.
For instance, consider K[x, y] and <grlex, I = ⟨f1, f2⟩ =
⟨x3
− 2xy, x2
y − 2y2
+ x⟩. The mentioned algorithm
produces the following ﬁve polynomials F = (f1, . . . , f5):
F = (x3
− 2xy, x2
y − 2y2
+ x, −x2
, −2xy, −2y2
+ x).
Nevertheless, LT f1 = x3
= −x LT f3 and LT f2 =
−1
2 x LT f4, so neither f1 nor f2 is needed.
However, this reduction is still insuﬃcient, since redundancy
may occur at the level of individual terms of the
basis elements. For example, for every a, the set {x2
+
axy, xy, y2
− 1/2x} is a minimal Gröbner basis of the considered
ideal.
That is why the following concept is introduced:
Reduced Gröbner basis
Let G be a Gröbner basis of an ideal I. A polynomial g ∈ G
is said to be reduced for the basis G if and only if none of its
monomials lies in ⟨LT(G \ {g})⟩. G is said to be reduced if
and only if for all p ∈ G, LC p = 1 and p is reduced for G.
Clearly, every reduced Gröbner basis is minimal. More-
over:
Proposition. If a polynomial g is reduced for a minimal
Gröbner basis G of an ideal I, then it is reduced for every
minimal Gröbner basis G′
of I which contains it.
Proof. In order to arrive at a contradiction, let G =
{g1, . . . , gs}, G′
= {g′
1, . . . , g′
t} be two minimal Gröbner
bases. Choose a term m of a polynomial g where m ∈
⟨LT(G′
− {g})⟩ (i.e. g is not reduced for G′
). Then,
m = a1 LT g′
1 + · · · + at LT g′
t for appropriate polynomials
a1, . . . , at. Since both G and G′
are Gröbner bases of the
same ideal, ⟨LT G⟩ = ⟨LT G′
⟩, which means that each LT g′
i
can be expressed as a multiple of one of the LT g1, . . . , LT gs,
except LT g (since G′
is minimal). Hence m ∈ ⟨LT(G\{g})⟩,
which contradicts the assumption that g is reduced for G. □
Everything is now available to prove the main result
about the existence and uniqueness of a reduced
Gröbner basis. This is the main achievement of
this part of the chapter on algebra. It allows for a
straightforward algorithm to eliminate variables
in systems of polynomial equations.
12.6.15. Theorem. Let I ⊆ K[x1, . . . , xn] be a non-zero
ideal. Then, for every monomial ordering, there is a unique
reduced Gröbner basis of I. Moreover, every Gröbner basis
can be reduced algorithmically.
Proof. Assume that G is a Gröbner basis of an I. By the
lemma of the previous subsection, it can be assumed that G
is minimal. (The minimizing algorithm is clear: it suﬃces to
1134
CHAPTER 12. ALGEBRAIC STRUCTURES
check for divisibility of leading monomials in any order and
to omit redundant elements of the basis.)
Assume that a polynomial g ∈ G is not reduced. Then
in the division g/(G \ {g}), the leading term of g stays in the
remainder, since it is not divisible by anything (the basis is
minimal). Therefore, LT(g G\{g}
) = LT g, since nothing
else can be the leading term of the remainder. Now, denote
g′
:= g G\{g}
, G′
:=
(
G \ {g}
)
∪ {g′
}.
This new system of generators G′
is again a minimal Gröbner
basis of I, because ⟨LT G′
⟩ = ⟨LT G⟩. That is,
⟨LT G′
⟩ = ⟨LT I⟩. By properties of the multivariate division
algorithm, the polynomial g′
is reduced for G′
. If a
polynomial h ̸= g is reduced for G, then it is also reduced for
G′
by the above proposition.
With every reduction of one of the elements, the total
number of terms in all polynomials of the reduced Gröbner
basis decreases. Therefore, the algorithm terminates as soon
as each element is reduced. Hence there is an algorithm for
the construction of the reduced Gröbner basis.
It remains to prove uniqueness. Let there be two reduced
Gröbner bases G, ˜G of a non-zero ideal I. Then ⟨LT G⟩ =
⟨LT I⟩ = ⟨LT ˜G⟩. Since this ideal is monomial, Dickson’s
lemma can be applied.
Recalling the construction of the basis in the proof of
Dickson’s lemma, there exists a unique monomial basis of a
monomial ideal such that the coeﬃcients of its elements equal
1, and no element of the basis divides another one.
By the deﬁnition of minimality, both LT G and LT ˜G
must be such. This means that LT G = LT ˜G. Consequently
for each g ∈ G, there is a unique ˜g ∈ ˜G such that
LT g = LT ˜g.
g − ˜g ∈ I. Since G is a Gröbner basis, g − ˜g
G
= 0. The
terms LT g, LT ˜g cancel out in g − ˜g. Since both the bases
are reduced, none of the remaining terms of g − ˜g may be
divisible by any of LT G = LT ˜G. Therefore, it must be in
the remainder, which means that
g − ˜g = g − ˜g
G
= 0.
This proves the uniqueness. □
12.6.16. Remarks. Several of the previous questions are
now answered: It can be decided eﬃciently
whether or not a given polynomial
lies in a given ideal by means of the multivariate
division and the Gröbner basis.
Because of the reduced Gröbner bases, it can be decided
whether or not two ideals coincide – they simply need to have
the same reduced Gröbner basis.
This means that it can be decided whether or not a polynomial
equation lies in the ideal generated by a given system.
Moreover, it can be decided eﬃciently whether or not two
given systems generate the same ideal of consequences.
The above algorithmic construction depends on an appropriate
monomial ordering. The answers to the questions are,
of course, independent of such ordering.
1135
CHAPTER 12. ALGEBRAIC STRUCTURES
As mentioned at the beginning of this chapter, the technique
of Gröbner bases is one of the fundamentals of computer
algebra. Of course, this algorithm is usually implemented
using various tricks to make it faster. One can use
the reduction technique as early as when creating the Gröbner
basis in the fundamental algorithm from subsection 12.6.13,
etc.
In the literature, one may ﬁnd miscellaneous variations
for non-commutative algebraic objects (e.g. for formal
manipulations with diﬀerential operators). The
algorithm for ﬁnding a Gröbner basis can be viewed
as a special case of the Knuth-Bendix algorithm for
rewriting some rules. This solves the problem of word equivalence
in monoids that are given by generators and a set of
equalities.
Last, but not least, the technique of Gröbner bases can
be used in a much more sophisticated way in commutative
algebra. In the algorithm, the syzygies of all pairs of generators
of a ﬁnite basis can be found sequentially. These syzygies
are a basis of the submodule of all relations between
k elements g1, . . . , gk of the basis, that is, subsets S in the
space (K[x1, . . . , xn])
k
. The algorithm can be extended to
these subsets, to ﬁnd distinguished generators of all relations
between generators. As long as there are some non-trivial
relations the procedure can continue. It can be proved that
after at most n such steps, there exist no non-trivial relations.
The number of generators of relations in each step provides
detailed information about the topological properties of the
corresponding aﬃne variety V(g1, . . . , gk).
12.6.17. Elimination of variables. We ﬁnish this chapter by
an application of the above algorithms.
Consider the ring K[xp+1, . . . , xn] to be
a subring of K[x1, . . . , xn]. These are polynomials
with no occurrence of the variables
x1, . . . , xp. It is a subring, but not an ideal.
Elimination ideals
Let I = ⟨f1, . . . , fs⟩ ⊆ K[x1, . . . , xn]. For p = 1, . . . , n,
deﬁne
Ip := I ∩ K[xp+1, . . . , xn].
This set is called the p-th elimination ideal. Note that Ip is
an ideal only in K[xp+1, . . . , xn].
On the level of polynomial equations, Ip contains
all equations which are consequences of the system
f1 = 0, . . . , fs = 0 and which contain only the variables
xp+1, . . . , xn.
Theorem (Elimination theorem). Let I ⊆ K[x1, . . . , xn] be
an ideal and G = {g1, . . . , gm} a Gröbner basis of I
with respect to <lex. Let the variables be ordered as x1 >lex
x2 >lex · · · . Then, for each p = 0, . . . , n, Gp := G ∩
K[xp+1, . . . , xn] is a Gröbner basis for the ideal Ip.
If G is minimal or reduced, then Gp is again minimal or
reduced, respectively.
1136
CHAPTER 12. ALGEBRAIC STRUCTURES
Proof. Without loss of generality, assume that
Gp = {g1, . . . , gr}. Since G ⊆ I, it follows that Gp ⊆ Ip.
The inclusion ⟨Gp⟩ ⊆ Ip is trivial.
It needs to be veriﬁed for each polynomial f ∈ Ip that
f = h1g1 + · · · + hrgr.
To do this, perform multivariate division by the original Gröbner
basis G. Since f ∈ I, it follows that f
G
= 0, i.e.
f = h1g1 + · · · + hrgr + hr+1gr+1 · · · + hmgm.
Each of the polynomials gr+1, . . . , gm must contain at least
one of the variables x1, . . . , xp, otherwise it would lie in Gp.
By the properties of the lexicographic ordering, this variable
must also be contained in LT gr+1, . . . , LT gm.
Recall the individual steps of the algorithm for multivariate
division and the fact that f contains no monomial with any
of x1, . . . , xp. Then hr+1 = · · · = hm = 0. thus veriﬁng
that f ∈ ⟨Gp⟩.
Not only the desired inclusion but also the fact that the
division f/G on Ip gives the same result as f/Gp is proved.
For 1 ≤ i < j ≤ r, consider the S-polynomials S(gi, gj).
S(gi, gj)
Gp
= S(gi, gj)
G
= 0,
so Gp is a Gröbner basis of Ip.
It is clear that the property, for the basis, of being either
minimal or reduced, is preserved. □
The only property of the lexicographic ordering used in
the proof is that if a variable occurs in the polynomial f, then
it occurs in its leading term as well. However, this condition is
much weaker than that of the lexicographic ordering. Therefore,
in actual implementations, one may use any ordering
with the mentioned property. This usually leads to more efﬁcient
computations, since the pure lexicographic ordering
usually leads to an unpleasant increase of the polynomials’
degrees.
12.6.18. Back to parametrized varieties. The above theorem
suggests an algorithm for ﬁnding an implicit representation
of a variety deﬁned in terms of polynomial parametrization.
Tools necessary for work with the smallest varieties that
contain the points deﬁned by parametrization, are not here
available, so a detailed discussion is omitted.
When the parametrization of a variety is given by polynomial
relations
x1 = f1(u1, . . . , uk), . . . , xn = fn(u1, . . . , uk),
the reduced Gröbner basis of the ideal
⟨x1 − f1, . . . , xn − fn⟩
can be computed in the lexicographic ordering where ui < xj
for all i,j. From this basis, the reduced Gröbner basis of the
elimination ideal Ik is obtained. This is precisely the required
ideal and its implicit representation.
It suﬃces to use an ordering which guarantees that each
ui is before each xj, so that the computation of the Gröbner
1137
CHAPTER 12. ALGEBRAIC STRUCTURES
basis would eliminate ui; otherwise the ordering may be arbitrary.
There is a chance that there is a more eﬃcient computation
than with the pure lexicographic ordering.
When the parametrization is rational, i.e.
xi =
fi(t1, . . . , tm)
gi(t1, . . . , tm)
,
it is perhaps natural to think of substituting the ideal
⟨x1g1 − f1, . . . , xngn − fn⟩
into the above theorem. However, the result of this is usually
not good. For instance, consider
x =
u2
v
, y =
v2
u
, z = u.
Here, I = ⟨vx − u2
, uy − v2
, z − u⟩, and the elimination
yields I2 = ⟨z(x2
y − z3
)⟩. However, the correct result is
V(x2
y − z3
). The computation has added an entire plane.
The problem is that the entire variety of zero points of the
denominators in the parametrizations of individual variables
is included in W = V(g1, . . . , gn). Instead, perceive the
parametrization F as a mapping F : (Kk
− W) → Kn
. For
the implicit situation, use the ideal
I = ⟨g1x1 − fj, . . . , gnxn − fn, 1 − g1 · · · gny⟩ ⊆
⊆ K[y, t1, . . . , tm, x1, . . . , xn],
where the additional variable y enables avoidance of the zero
spaces of the denominators. It can be shown that V (Ik+1) is
the minimal aﬃne variety which contains F(Km
− W).
1138
1139
CHAPTER 12. ALGEBRAIC STRUCTURES
Key to the exercises
In this chapter, we return to problems concerning properties
or mutual relations of (mainly) ﬁnite sets of objects. Combinatorial
problems are already introduced in the second and
third parts of chapter one.
Like number theory, combinatorics is a ﬁeld of mathematics
where the problems can often be formulated very easily.
On the other hand, solutions can be much more diﬃcult
to ﬁnd.
We begin with graph theory, and display a collection of
useful algorithms based on this theory. At the end of the chapter
methods of combinatorial computations are considered.
1. Elements of Graph theory
13.1.1. Two examples. Several people come to a party;
some pairs of people know each other, while
other people know nobody. (Acquaintance is
assumed to be symmetric). How many people
must be there in order to guarantee that there are either three
people who all know each other, or there are three people
with no mutual acquaintance?
Such situations can be aptly illustrated by a diagram. The
points (or vertices) stand for the particular people of the party,
the full lines represent pairs who know one another, while
the dashed lines stand for pairs who do not know one another.
Note that every pair of vertices is connected by either a full
or a dashed line. The question is now reformulated as: how
many vertices must be there in order that either there is a triangle
whose sides are all full or a triangle whose sides are all
dashed?
00
0000
11
1111
0000
00
1111
11
0000
00
1111
11000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000
111111111111111
111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111111111111111111
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000
111111111111111
111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111111111111111111
00
0000
11
1111
0000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000
11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
0011
000
111
0011
0011
00
0
11
100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
00000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
111111111111111111111111111111
11111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111000000000000000000000000000000000000000000000
111111111111111111111111111111111111111111111
000000000000000000000000000000000000000000000
111111111111111111111111111111111111111111111
0000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
111111111111111
11111111111111111111111111111111111111111111111111111111111111111111111111111111
0000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
111111111111111
11111111111111111111111111111111111111111111111111111111111111111111111111111111
0000000000000000000011111111111111111111
00000000001111111111
0000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000
11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
0000000000000000000000000000000000000000
1111111111111111111111111111111111111111
0000000000
00000000000000000000000000000000000
1111111111
11111111111111111111111111111111111
0011
000
000000
000
111
111111
111
00
0000
11
1111
00
00
00
11
11
11
00
00
00
11
11
11
There is no such triangle in the left-hand diagram with
four vertices. The example of a regular pentagon, in which
all its outside edges are full, while all its diagonals are dashed
(draw a picture!), shows that at least six vertices are required.
Such a triangle always exists if the number of vertices is
at least six. To show this, consider a set of six vertices, each
pair of which is joined by either a dashed line or a full line.
CHAPTER 13
Combinatorial methods, graphs, and algorithms
Do we often prefer thinking in pictures?
– yes, but we can compute discrete things only...
A. Fundamental concepts
One of the motives for creating graph theory was visualization
of certain problems concerning relations. A human
brain like thinking about entities it can imagine. Therefore,
we like representing a binary relation with a graph whose vertices
correspond to the elements and edges (lines between the
elements) correspond to the fact that the given pair is related.
Optionally, we can encode a relation in a more complicated
way–use Hasse diagram (see 12.1.8), for instance. Partially
ordered sets are almost always depicted this way. The relation
of friendship or acquaintance between people can also
be translated to graphs. This gives rise to a good deal of “relaxing”
problems.
13.A.1. Prove that the number of odd-degree vertices in an
undirected graph is always even.
Solution. Let G = (V, E) be an arbitrary undirected graph
with degree d(v) at every v ∈ V . Adding the numbers of all
degrees at the vertices we count all edges twice, and, there-
fore,
∑
v∈V
d(v) = 2|E|. Let V1 ⊂ V be the set of vertices
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
If v is one of the vertices, then it is joined by ﬁve outgoing
lines.
At least three of these lines are of one type, without loss
of generality, full, joining to vertices vA, vB, vC. Then either
the triangle formed by the vertices vA, vB, vC, contains only
dashed lines, which is then the desired triangle, or one of its
edges is full in which case there is a full triangle.
As another example, consider a black box which consumes
one bit after another and shines in blue or in red according
to whether the last bit is zero or one. Imagine this
could be a light over the toilet door recognizing whether the
last person came out (0) or in (1). Again, this scheme can be
illustrated by a diagram:
0
S
BLUE
RED
0
1 1
1 0
The third vertex which has only two outgoing arrows represents
the beginning of the system (before the ﬁrst bit is sent).
Both situations share the same scheme: there is a ﬁnite
set of objects represented by vertices. There is a set of their
properties represented by connecting lines between particular
vertices. The scheme can be modiﬁed by distinguishing the
directions of the connecting lines by arrows.
Such a situation can be described in terms of relations;
see the text from subsection 1.6.1 on in the sixth part of chapter
one. But this is a complicated terminology for describing
simple situations: In the ﬁrst case, there is one set of people
with two complementary symmetric and non-reﬂexive relations.
In the second case, there are two antisymmetric relations
on three elements.
13.1.2. Fundamental concepts of graphs. We use the terminology
which corresponds to the latter dia-
grams.
Graphs and directed graphs
Deﬁnition. A graph (also an undirected graph) is a pair
G = (V, E), where V is the set of its vertices and E is a
subset of the set
(V
2
)
of all 2-element subsets of V .
The elements of E are called edges of the graph. The
vertices of an edge e = {v, w}, v ̸= w, are called the endpoints
of e. An endpoint of an edge is said to be incident to
that edge. Two edges which share a vertex are called adjacent.
Any two vertices which are the endpoints of an edge
are called adjacent.
1141
with all degrees and V2 ⊂ V - set of vertices with even degrees.
Then,
2|E| =
∑
v∈V
d(v) =
∑
v∈V1
d(v) +
∑
v∈V2
d(v), and
∑
v∈V1
d(v) = 2|E| −
∑
v∈V2
d(v),
which is an even number. So, |V1| is even. □
13.A.2. Does there exist a graph with degree sequence
(3, 3, 2, 2, 2, 1)?
Solution. In this sequence the number of odd-degree vertices
equals odd number 3, which is impossible. □
13.A.3. Let G be a graph with minimum degree d > 1.
Prove that G contains a cycle of length at least d + 1.
Solution. Let v1 . . . vk be a maximal path in G, i.e., a path
that cannot be extended. Then any neighbour of v1 must be
on the path, since otherwise we could extend it. Since v1 has
at least d(G) neighbors, the set {v2, . . . , vk} must contain at
least d(G) elements. Hence k ≥ d(G) + 1, so the path has
length at least d(G). Now, neighbour of v1 that is furthest
from v1 along the path must be vi with i ≥ d(G) + 1. Then
v1 . . . viv1 form a cycle of length at least d(G) + 1. □
13.A.4. Show that any graph with |V | ≥ 2 contains at least
2 vertices with equal degrees
Solution. Prove, using pigeon hole principle. Let |V | = n.
Degree d(v) for any vertex v ∈ V can take values from 0
to n − 1. So, there are n possible distinct values for d(v).
However, if one of the degrees d(v) = n − 1, which means
that vertex v is connected to any other vertex, no other degree
can be 0. Thus, d(v) can take no more than n − 1 distinct
values, and, by pigeon hole principle, among n values of d(v)
at least two coincide. □
13.A.5. Show that if n people attend a party and some shake
hands with others, then at the end, there are at least two people
who have shaken hands with the same number of people.
Solution. Let g be a graph with the set of vertices V being the
set of people. If two people shake hands, that would represent
an edge e between those two vertices. by previous problem,
there will be at least two vertices with equal degrees. □
13.A.6. Show that if every connected component of a graph
is bipartite, then the graph is bipartite.
Solution. Graph is bipartite, if we can divide its vertices into
two subsets A and B such that every edge in the graph connects
a vertex in set A to a vertex in set B. If the vertex sets of
components are divided into sets Ai and Bi, we set A = ∪iAi
and B = ∪iBi. Subsets A and B impose the graph with bipartite
structure. □
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
A directed graph is a pair G = (V, E), where V is as above,
but now, E ⊆ V × V . The ﬁrst of the vertices that deﬁne an
edge e = (v, w) is called the tail of the edge and the other
vertex is called its head. From the vertices’ point of view,
e is an outgoing edge of v and an ingoing edge of w. The
directed edges are also called arcs or arrows. The head and
the tail of a directed edge may be the same vertex; such an
edge is called a loop.
Two directed edges are called consecutive if the tail of
one of them is the head of the other one. Similarly, two vertices
which are the head and the tail of an edge are called
consecutive.
To every directed graph G = (V, E), its symmetrization
can be assigned. This is an undirected graph with the same
set of vertices as G. It contains an edge e = {v, w} if and
only if at least one of the edges e′
= (v, w) and e′′
= (w, v)
belongs to E.
Graph theory provides an extraordinarily good language
for thinking about procedures and deriving properties that
concern ﬁnite sets of objects. They are a good example of
a compromise between the tendency to “think in diagrams”
and precise mathematical formulations.
The language of graph theory allows the adding of information
about the vertices or edges in particular problems. For
instance, the vertices of a graph can be “coloured” according
to membership of the corresponding objects to several (pairwise
disjoint) classes. Or the edges with several values can be
labeled, and so on. The existence of an edge between diﬀerently
coloured vertices can indicate a “conﬂict”. For example,
if the vertices are coloured red and blue according to membership
to two groups of people with diﬀerent interests and the
edges represent adjacency at a dining table, then an edge connecting
two diﬀerently coloured vertices can mean a potential
conﬂict. Our ﬁrst example from the previous subsection
can thus be perceived as a graph with coloured edges. The
statement we have checked there reads thus in the language
of graph theory:
A graph Kn =
(
V,
(V
2
))
with n ≥ 6 vertices and all possible
edges which are labeled with two colours always contains
a triangle whose sides are of the same colour.
The directed graph in the second example above, whose
edges are labeled with zero or one, represents a simple ﬁnite
automaton. This name reﬂects the idea that the graph
describes a process which is, at any moment, in a state represented
by the corresponding vertex. It changes to another
state, in a step represented by one of the outgoing edges of
that vertex. The theory of ﬁnite automata is not considered
here.
13.1.3. Examples of useful graphs. The simplest case of
graphs are those which contain no edges. There
is no special notation for them. At the other
extreme is a graph which contains all possible
edges. This is called a complete graph, denoted by Kn, where
n is the number of vertices of the graph. The graphs K4 and
1142
13.A.7. Show that a graph is bipartite if and only if it contains
no cycles of odd length.
Solution. Suppose there are no cycles of odd length in the
graph G. Choose any vertex from the graph and put it in set
A. Follow every edge from that vertex and put all vertices at
the other end in set B. Erase all the vertices that have been
already used. Now for every vertex in B, follow all edges
from each vertex in B and put the vertices on the other end
in A, erasing all used vertices. Alternate back and forth in
this manner until we cannot proceed. This happens when either
we exhaust all the vertices or we encounter a vertex that
is already in one set and needs to be moved to the other. If
the latter occurs, that would represent an odd number of steps
from it to itself, so there would be a cycle of odd length. If the
graph is not connected, there may still be vertices that have
not been assigned. Repeat same process until all vertices are
assigned either to set A or to set B. Thus, graph G is bipartite.
Suppose now that graph G is bipartite. Let c be a cycle
vi1 . . . vik
of length k. Vertices along this cycle alternate between
subsets A and B of V . Also, the ﬁrst vi1 and the last
vik
vertices on c should lie in the same set. Therefore, there
is the same number of A-vertices and B-vertices on c, and,
so, k is even. □
13.A.8. Show that any tree with at least two vertices is bi-
partite.
Solution. Since a tree does not contain any cycles, it also does
not contain odd length cycles, and therefore, it is bipartite. □
13.A.9. Prove that if u is a vertex of odd degree in a graph,
then there exists a path from u to some other vertex v of odd
degree.
Solution. We build a path that does not reuse any edges. As
we build the path, we erase edges we used in order not to use
them again. Begin at vertex u and select an arbitrary path emanating
from it. If, at any point, the path reaches a vertex of
odd degree, we are be done, but each time we arrive at a vertex
t of even degree, we are guaranteed that there is another
neighbouring vertex where we can move further. By passing
through t we erase two edges from vertex t. Since the vertex
originally was of even degree, coming in and going out
reduces its degree by two, so it remains even. In this way,
there is always a way to continue when we arrive at a vertex
of even degree. Since there are only a ﬁnite number of edges,
the tour must end eventually, and the only way it can end is if
we arrive at a vertex of odd degree. □
13.A.10. If the distance d(u, v) between two vertices u and
v that can be connected by a path in a graph is deﬁned
to be the length of the shortest path connecting them, then
show that the distance function satisﬁes the triangle inequality:
d(u, v) + d(v, w) ≥ d(u, w).
Solution. If one simply connects the paths from u to v to the
path connecting v to w one obtains a path from u to w of
length d(u, v) + d(v, w). Since we are looking for the path
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
K6 are presented in the introductory subsection. The graph
K3 is called a triangle.
An important type of graph, is a path. This is a graph
whose vertices are ordered as (v0, . . . , vn) so that E =
{e1, . . . , en}, where ei = {vi−1, vi} for all i = 1, . . . , n.
A path graph of length n is denoted by Pn. If the ﬁrst and last
vertices coincide for the path graph ( n ≥ 3), it is called a cycle
graph of length n, denoted by Cn. The graphs K3 = C3,
C5, and P5 are shown in the following diagram.
00
0
0
0
00
0
0
0
00
1
11
1
1
1
1
11
1
1
1
0011
0011
0011
0011
0011 0011
0011
0011
0011
0011 0011
0011 001100000000
0000
0000
0000
00000000
0000
0000
0000
0000
1111
11111111
1111
1111
1111
1111
11111111
1111
1111
00000000
0000
0000
0000
00000000
0000
0000
0000
0000
1111
11111111
1111
1111
1111
1111
11111111
1111
1111
0000000011111111
0000
00000000
0000
0000
0000
0000
1111
11111111
1111
1111
1111
1111
0000
00000000
0000
0000
0000
0000
1111
11111111
1111
1111
1111
1111
00
0000
00
00
00
0000
11
1111
11
11
11
1111
000001111100
0000
00
00
00
0000
11
1111
11
11
11
1111
00
0000
00
00
00
0000
11
1111
11
11
11
1111 0000011111
0
00
0
0
0
00
1
11
1
1
1
11
00000000001111111111
0011
Another type of graph is the complete bipartite graph. Its
vertices can be coloured with two (distinct) colours. All possible
edges between vertices of diﬀerent colours are present,
but no other edges. Such a graph is denoted by Km,n, where
m and n are the numbers of vertices of particular colours. The
diagram below illustrates the graphs K1,3, K2,3, and K3,3.
00110011 0011 0011 0011 0011 0011 0011 0011 0011
0011 0011 0011 0011 0011000000000000
000000
000000000
000000
111111111
111111111
111111111
111111
000000000000
000000
000000000
000000
111111111
111111111
111111111
111111
00000000
0000
000000
0000
111111
111111
111111
1111
00000000
0000
000000
0000
111111
111111
111111
1111
0000000000000000
00000000
000000000000
00000000
111111111111
111111111111
111111111111
11111111
0000
00
000
00
111
111
111
11
00000000
0000
000000
0000
111111
111111
111111
1111
00000000000000000000
0000000000
000000000000000
0000000000
111111111111111
111111111111111
111111111111111
1111111111
000000000000
000000
000000000
000000
111111111
111111111
111111111
111111
000000000000
000000
000000000
000000
111111111
111111111
111111111
111111
000000000000
000000
000000000
000000
111111111
111111111
111111111
111111
000000000000
000000
000000000
000000
111111111
111111111
111111111
111111
00000000000000000000
0000000000
000000000000000
0000000000
111111111111111
111111111111111
111111111111111
1111111111
00000000000000000000
0000000000
000000000000000
0000000000
111111111111111
111111111111111
111111111111111
1111111111
0011
Another interesting example of a graph is the hypercube
Hn in dimension n, whose vertices are the integers
0, . . . , 2n
− 1 and whose edges join those pairs of vertices
whose binary expansions diﬀer by exactly one bit. The following
diagram depicts the hypercube H4, with labels of the
vertices indicated.
From the deﬁnition it follows that a hypercube of given
dimension can always be composed from two hypercubes of
dimension one lower, connecting them with edges in an appropriate
way. These new edges between the two disjoint copies
of H3 are the dashed ones in the diagram. Obviously, the
hypercube H4 can be similarly decomposed in several ways
(just looking at one ﬁxed bit position, as is done with the very
ﬁrst position in the diagram).
1001
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
000000001111111100000
00000
00000
00000
00000
00000
00000
00000
00000
00000
00000
11111
11111
11111
11111
11111
11111
11111
11111
11111
11111
11111
00000
00000
00000
00000
00000
00000
00000
00000
00000
00000
00000
11111
11111
11111
11111
11111
11111
11111
11111
11111
11111
11111
00000
00000
00000
00000
00000
00000
00000
00000
00000
00000
00000
11111
11111
11111
11111
11111
11111
11111
11111
11111
11111
11111
0000000011111111
00000
00000
00000
00000
00000
00000
00000
00000
00000
00000
00000
11111
11111
11111
11111
11111
11111
11111
11111
11111
11111
11111
0000000011111111
0000000011111111
0011 0011
0011
0
0
1
1
0
0
1
1
0011
0
0
1
1
0
0
1
1
00000000111111110000000000
00000
00000
00000
00000
00000
00000
00000
00000
00000
1111111111
11111
11111
11111
11111
11111
11111
11111
11111
11111
0000000000
00000
00000
00000
00000
00000
00000
00000
00000
00000
1111111111
11111
11111
11111
11111
11111
11111
11111
11111
11111
00000
00000
00000
00000
0000000000
0000000000
0000000000
00000
11111
11111
11111
11111
1111111111
1111111111
1111111111
11111
0000000011111111
00000
00000
00000
00000
0000000000
0000000000
0000000000
00000
11111
11111
11111
11111
1111111111
1111111111
1111111111
11111
0000000011111111
0000000011111111
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
00000000000000000000000000000000
00000000000000000000000000000000
0000000000000000
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
1111111111111111
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
00000000000000000000000000000000
00000000000000000000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
00000000000000000000000000000000
00000000000000000000000000000000
0000000000000000
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
1111111111111111
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
00000000000000000000000000000000
00000000000000000000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
00000000000000000000000000000000
00000000000000000000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
00000000000000000000000000000000
00000000000000000000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
0000 0001
0010
0100
0110
1110 1111
1011
0
0
1
1
Here are two more examples. The ﬁrst is the cycle ladder
graph CLn with 2n vertices. This consists of two cycle
graphs Cn whose vertices are connected by edges according
to their order in the cycles. The second is the Petersen graph.
1143
of minimal length, if there is a shorter path it will be shorter
than this one, so the triangle inequality will be satisﬁed. □
13.A.11. Show that in a directed graph where every vertex
has the same number of incoming as outgoing edges there
exists an Eulerian path for the graph.
Solution. Suppose a path starts at some vertex v. First edge
on the path can be any edge emanating from v, that we delete
from the set of possible further edges on a path. As following
the path we arrive at any other vertex u we choose an edge that
is still in the set of possible outgoing from u edges. We stop as
we can no longer choose such an edge. This can happen only
if we arrive again at v and all outgoing edges from v have
been previously engaged in the path. Indeed, if this is any
other vertex w then every arrival at w engages one incoming
edge that can be followed by an outgoing edge as their number
is the same. Hence, our path starts and ends at v and contains
all edges of the graph. □
13.A.12. An n-cube Qn is a cube in n dimensions that consists
of vertices with coordinates 0 and 1 and edges, connecting
neighbouring vertices, that may diﬀer from each other in
just one coordinate. Show that Qn possesses a Hamiltonian
circuit.
Solution. We prove by induction. The k + 1-dimensional
cobe Qk+1 can be considered as two copies of Qk
with every vertex of the ﬁrst Qk having ﬁrst coordinate
0 and 1 for the second copy of Qk. Consider
some Hamiltonian cycle vi1 . . . vik
vi1 in Qk. In the
ﬁrst copy of Qk that we take a cycle 0vi1 . . . 0vik
0vi1 ,
and in the second copy 1vi1
. . . 1vik
, 1vi1
. Consider path
0vi1 . . . 0vik
, 1vik
1vik−1
1vi1 0vi1 which forms a Hamiltonian
cycle on Qk+1. □
13.A.13. Show that a tree with n vertices has exactly n1
edges.
Solution. We proceed by induction. Suppose that every tree
with k vertices has precisely k − 1 edges. If the tree T contains
k + 1 vertices, we will show that it contains a vertex
with a single edge connected to it. If not,start at any vertex,
and start following edges marking each vertex as we pass it.
If we ever come to a marked vertex, there is a loop in the
edges which is impossible. But since each vertex is assumed
to have more than one vertex coming out,there is never a reason
that we have to stop at one,so we eventually encounter a
marked vertex, which is a contradiction. Take the vertex with
a single edge connecting to it, and delete it and its edge from
the tree T. The new graph T0 will have k vertices. It must
be connected, since the only thing we removed was a vertex
that was not connected to anything else, and all other vertices
must be connected. If there were no loops before, removing
an edge certainly cannot produce a loop, so T0 is a tree. By
the induction hypothesis, T0 has k1 edges. But to convert T0
to T we need to add one edge and one vertex, so T also satises
the formula for the number of edges. □
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
This is somewhat similar to CL5, yet it is actually the simplest
counterexample for many propositions about graphs.
0011
0
0
1
1
0011
0011 0011
0011
0011
0
0
1
1
0
0
1
1
0011
0
0
1
1
0011
0011
0011
0
0
1
1
0
0
1
1
0
0
1
1
0011
0011
0011
13.1.4. Morphisms of graphs. Mappings between the sets
of vertices or edges which respect the considered
structure are of great importance in graph
theory. It is enough to consider mappings between
the vertices only.
Morphisms of graphs
Deﬁnition. Let G = (V, E) and G′
= (V ′
, E′
) be two
given graphs. A morphism (or homomorphism) f : G → G′
is a mapping fV : V → V ′
between the sets of vertices
such that if e = {v, w} is an edge in E, then e′
=
{fV (v), fV (w)} is an edge in E′
.
In practice, there is no need to distinguish between the
morphism f and the mapping fV .
The deﬁnition is the same for directed graphs, using ordered
pairs e = (v, w) as edges.
In the case of undirected graphs, the deﬁnition implies
that if f(v) = f(w) for distinct v, w ∈ V , then they are not
connected by an edge. On the other hand, such an edge is
admissible for directed graphs provided the common image
of the vertices has a loop.
An important special case of a morphism of a graph G is
one whose codomain is Km. Such a morphism is equivalent
to a labeling of the vertices of G with m colours (or any other
names) of the vertices of Km so that vertices of one colour
are not adjacent. In this case, it is a (vertex) colouring of the
graph G with m colours.
If a morphism f : G → G′
is a bijection between the
sets of vertices such that the inverse mapping f−1
is also a
morphism, then f is called an isomorphism of graphs. Two
graphs are isomorphic if they diﬀer only in the labeling of the
vertices.
Every morphism of directed graphs is also a morphism
of their symmetrizations. The converse is not true in general.
There are simple and extraordinarily useful examples of
graph morphisms: namely a path, a walk, and a cycle in a
graph:
1144
13.A.14. If u and v are two vertices of a tree T, show that
there is a unique path connecting them.
Solution. Since T is a tree, it is connected, and therefore
there has to be at least one path connecting u and v. Suppose
there are two diﬀerent paths P and Q connecting u to
v. Reverse Q to make a path Q′
leading from v to u, and the
path made by concatenating P and Q′
leads from u back to itself.
Now this path PQ′
may not necessarily be a loop, since
it may use some of the same edges in both directions, but we
assumed that there are some diﬀerences between P and Q.
We can, from PQ′
, generate a simple loop. Begin at u
and continue back one node at a time until the paths P and
Q′
diﬀer. At this point, the paths split. Continue along both
paths of the beyond the bifurcation point until they join again
for the rst time, and this must occur eventually, since we know
they are joined at the end. The two fragments of P and Q′
form a smaller circuit in the tree, which is impossible, since
T is a tree. □
13.A.15. If G is a connected graph and k ≥ 2 is the maximum
path length, then any two paths in G with length k share
at least one common vertex.
Solution. Suppose, not. Let P = vi1 . . . vi and Q =
vj1 . . . vjk
be two paths of maximal length kthat do not share
any vertex. Vertices vi1 and vj1 are connected by a path
R = ui1 . . . uis , where ui1 = vi1 and uis = vjk
. Path
R coincides with P on some vertices, then it deviates from
P. May be, returns to P again, then deviates from P for the
last time and connects to Q. Denote this fragment of R by
vir
= uim
uim+1
. . . uim+t
= vjℓ
, where t ≥ 1. Let P′
be
path vi1 . . . vir , T′
= uim . . . uim+t and Q′
be the longest
part of Q going forward or backwards from vjℓ
, which length
is obviously, at least, k + 1. □
13.A.16. Solution.
□
13.A.17. In a dormitory, there is a party held every night.
Every time, the organizer of the party invites all of his/her
acquaintances so that at the end of the party, all of the guests
know each other. Suppose that each member of the dormitory
has organized a party at least once, yet there are still two students
who do not know each other. Show that they will not
meet at the next party.
Solution. Consider the acquaintance graph of the students at
the beginning (the vertices correspond to the students, and the
edges to the acquaintances). We are going to show that if two
students lie in the same connected component of this graph (i.
e., there exists a chain of acquaintances beginning with one
of the considered students and ending with the other one), see
13.1.10, then they will know each other as soon as each member
of the dormitory has held a party. Indeed, consider the
shortest path (acquaintance chain) between two students that
lie in the same connected component. Every time someone
from this path organizes a party, this path is made one shorter
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Walks, Paths, Trails, and Cycles
A walk of length n in a graph G is a morphism s : Pn → G.
Both vertices and edges may repeat in the image.
A trail is a walk, where vertices are allowed to repeat,
but edges are not allowed to repeat.
A path of length n in a graph G is any morphism p :
Pn → G such that p is an injective mapping. The images of
the vertices v0, . . . , vn from Pn are pairwise distinct.
A cycle of length n in a graph G is any morphism c :
Cn → G such that c is an injective mapping of the vertices.
For simplicity, the morphism is often identiﬁed with
its image. Walks are often written explicitly in the form
(v0, e1, v1, . . . , en, vn), where ei = {vi−1, vi} for i =
1, . . . , n.
A walk can be thought of as the trajectory of a “pilgrim”
moving from the vertex f(v0) to the vertex
f(vn), not stopping at any vertex of an (undirected)
graph. Pn always contains an edge connecting
the adjacent vertices vi−1 and vi, while loops are not
admitted in undirected graphs. The pilgrim can enter a vertex
more than once or even go along an edge already visited.
The pilgrim making a “trail” is a little wiser – he does not go
along an edge already visited for the second time on his walk
from the initial vertex f(v0) to the terminal vertex f(vn).
13.1.5. Subgraphs. The images of paths, walks, and cycles
are examples of subgraphs, but not in the same way.
Subgraphs
Deﬁnition. A graph G′
= (V ′
, E′
) is a subgraph of a graph
G = (V, E) if and only if and only if V ′
⊆ V , and E′
⊆ E.
Consider a graph G = (V, E). Choose a subset V ′
⊆ V .
The largest subgraph (with respect to the number of edges)
with V ′
as its set of vertices is called an induced subgraph. It
is the graph G′
= (V ′
, E′
), where an edge e ∈ E belongs
to E′
if and only if both of its endpoints lie in V ′
. Therefore,
the set E′
of G′
’s edges is given as the intersection E ∩
(V ′
2
)
.
A spanning subgraph (also factor) of a graph G = (V, E)
is any graph G′
= (V, E′
) with V = V ′
. Hence G′
has the
same vertex set as G, but the set of edges E′
⊂ E may be
arbitrary. A clique is a subgraph of the graph G which is
isomorphic to a complete graph.
Every subgraph can be constructed by a step-by-step application
of these two cases – ﬁrst, select V ′
⊆ V , then
choose the target edge set E′
in the subgraph induced on V ′
.
Every image of a homomorphism (vertices as well as
edges) forms a subgraph.
1145
(the organizer falls out). Since we assume that each of the students
on the path has organized a party, the marginal students
must know each other as well. Therefore, if there are two students
who do not know each other even after everyone has
held a party, then they lie in diﬀerent connected components
of the graph, so they will never meet at a party (in particular,
not at the upcoming one). □
Now, we are going to practice the fundamental concepts
of graph theory on simple combinatorial problems.
13.A.18. Determine the number of edges of each of the
graphs K6, K5,6, C8.
Solution. The complete graph K6 on 6 vertices has
(6
2
)
= 15
edges. The complete bipartite graph K5,6 (see 13.1.3) has
5 · 6 = 30 edges. Finally, the cycle graph C8 has 8 edges.
□
13.A.19. Degree sequence. Verify whether each of the following
sequences is the degree sequence (see 13.1.7) of some
graph. If so, draw one of the corresponding graphs.
i) (1, 2, 3, 4, 5, 6, 7, 8, 9),
ii) (1, 1, 1, 2, 2, 3, 4, 5, 5).
Solution. First of all, we should check the necessary condition
from (1). In the former case, we have 1 + · · · + 9 =
1
2 · 9 · 10 = 45, so the condition is not satisﬁed. Therefore,
the ﬁrst sequence does not correspond to any graph.
As for the latter sequence, the sum of the wanted degrees
equals 24, so the necessary condition is satisﬁed. Now,
we proceed by the Havel–Hakimi theorem from subsection
13.1.7.
(1, 1, 1, 2, 2, 3, 4, 5, 5) ←→ (1, 1, 1, 1, 1, 2, 3, 4) ←→
←→ (1, 1, 1, 0, 0, 1, 2) ←→ (0, 0, 1, 1, 1, 1, 2) ←→
←→ (0, 0, 1, 1, 0, 0) ←→ (0, 0, 0, 0, 1, 1) ←→
←→ (0, 0, 0, 0, 0).
Of course, it was not necessary to execute the procedure to
the very end. We could have ﬁnished as soon as we saw that
the obtained sequence indeed is the degree sequence of some
graph. Now, we construct the corresponding graph “backwards”
(however, we must take care to always add edges to
vertices of appropriate degrees– it is this place where we have
the option and can obtain non-isomorphic graphs with the
same degree sequence). One of the possible outcomes is the
following graph (the order in which each vertex was selected
is written inside it):
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.1.6. How many non-isomorphic graphs are there? It
is easy to draw all graphs (up to isomorphism) with
a predetermined small number of vertices (three or
four for instance). Generally this is a complicated
combinatorial problem. It is often diﬃcult to decide
whether or not two given graphs are isomorphic.
Remark. This problem, known as the Graph isomorphism
problem, is a somewhat peculiar member of the class NP1
–
it is known neither whether it is NP-complete nor whether
it can be solved in polynomial time. This is a special case
of the problem of deciding whether or not a given graph is
isomorphic to a subgraph of another graph. This Subgraph
isomorphism problem is known to be NP-complete.
Just to get some feeling, let us ﬁnd a rough lower estimate
of the total number of non-isomorphic graphs. There are the
same number of graphs on a given set of n vertices as the
number of subsets of the edge set. A k-element set has 2k
subsets. There are at most n! graphs isomorphic to a given
one, since this is the number of bijections between n-element
sets. It follows that there are at least
k(n) =
2
(n
2
)
n!
pairwise non-isomorphic graphs. From this,
log2 k(n) =
(
n
2
)
− log2 n! ≥
n2
2
(
1 −
1
n
−
2 log2 n
n
)
,
since n! ≤ nn
. The asymptotic estimate for large n:
log2 k(n) ≥
1
2
n2
− O(n log2 n),
follows. (See the notation for asymptotic bounds from subsection
6.1.12 on page 520). Thus the logarithm of the number
of non-isomorphic graphs grows at least as fast as n2
.
13.1.7. Vertex degree and degree sequence. It is relatively
easy to verify that two given graphs are not isomorphic.
Since isomorphic graphs diﬀer in the
relabeling of the vertices only, they share all numerical
and other characteristics which are not
changed by the relabeling. Simple data of this type includes,
for instance, the number of edges incident to particular ver-
tices.
1Wikipedia, NP (complexity), http://en.wikipedia.org/
wiki/NP_(complexity) (as of Aug. 7, 2013, 13:44 GMT).
1146
1 0
0 0
0 0
3
2
4
□
13.A.20. Find the number of pairwise non-isomorphic complete
bipartite graphs with 1001 edges.
Solution. A complete bipartite graph Km,n has m · n edges.
Therefore, the problem can be stated as follows: In how many
ways can we write the integer 1001 as the product of two integers?
Since 1001 = 7 · 11 · 13, we get 1001 = 1 · 1001 =
7 · (11 · 13) = 11 · (7 · 13) = 13 · (7 · 11).
Thus, there are four non-isomorphic complete bipartite
graphs having 1001 edges:
K1,1001, K7,143, K11,91 and K13,77. □
13.A.21. Find the number of graph homomorphisms (see
13.1.4)
a) from P2 to K5,
b) from K3 to K5.
Solution. We can see from the deﬁnition of the graph homomorphism
that the only condition which must be satisﬁed is
that adjacent vertices must not be mapped to the same vertex.
a) 5 · 4 · 4 = 80.
b) 5 · 4 · 3 = 60.
□
13.A.22. Number of walks. Using the adjacency matrix
(see 13.1.8), ﬁnd the number of trails of length 4 from vertex
1 to vertex 2 in the following graph:
1
2
3
4 5
Solution. The adjacency matrix of the given graph is
AG =






0 1 1 0 0
1 0 1 1 1
1 1 0 1 1
0 1 1 0 1
0 1 1 1 0






.
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Vertex degree and degree sequence
The degree of a vertex v ∈ V in a graph G = (V, E) is
the number of edges from E incident to v. It is denoted by
deg v.
The degree sequence of a graph G with vertices V =
(v1, . . . , vn) is the sequence
(deg v1, deg v2, . . . , deg vn).
Sometimes, it is required that the sequence be sorted in ascending
or descending order rather than correspond to the
selected order of vertices.
In the case of directed graphs, distinguish between the
indegree deg+ v of a vertex v and its outdegree deg− v. A
directed graph is said to be balanced if and only if for all
vertices v deg− v = deg+ v.
The degree sequence of a graph (and its isomorphic
copies) is unique up to permutation. Therefore, if the degree
sequences of two graphs diﬀer not merely by permutation,
then the graphs are not isomorphic. The converse statement
is not true in general. Two non-isomorphic graphs with the
same degree sequence are the graph G = C3 ∪ C3 which has
degree sequence (2, 2, 2, 2, 2, 2), and C6. However, C6 contains
a path of length 5, but C3 ∪C3 does not contain a path of
length 5. Therefore, these two graphs cannot be isomorphic.
Since every edge has two endpoints, it is counted twice in
the sum of the degree sequence (this condition is sometimes
known as handshaking lemma). It follows that
(1)
∑
v∈V
deg v = 2|E|.
In particular, the sum of the degree sequence must be even.
The following theorem2
illustrates how useful invariants
like the degree sequences can be. The proof is
constructive. It describes an algorithm for constructing
a graph with a given degree sequence
if there is one, or shows that there is no such graph.
Deciding about a given degree sequence
Theorem. For any natural numbers 0 ≤ d1 ≤ · · · ≤ dn,
there exists a graph G on n vertices with the above values
as its degree sequence if and only if there exists a graph on
n − 1 vertices with degree sequence
(d1, d2, . . . , dn−dn − 1, dn−dn+1 − 1, . . . , dn−1 − 1).
Proof. If there exists a graph G′
on n − 1 vertices with
degree sequence as stated in the theorem, then a new vertex
vn can be added to G′
. Connect vn with edges to the last dn
vertices of G′
, thereby obtaining a graph G with the desired
degree sequence.
2Proved independently by Václav J. Havel in 1955 in the Časopis pro
pěstování matematiky (in Czech) and S. L. Hakimi in 1962 in the Journal of
the Society for Industrial and Applied Mathematics.
1147
The number of walks of length 4 from vertex 1 to vertex 2 is
the element at [1, 2] in the matrix A4
G. Since
A2
G =






2 1 1 2 2
1 4 3 2 2
1 3 4 2 2
2 2 2 3 2
2 2 2 2 3






,
we have (A4
G)1,2 = (2, 1, 1, 2, 2)·(1, 4, 3, 2, 2)T
= 17. Therefore,
there are 17 walks of length 4 between the vertices 1 and
2. □
13.A.23. A cut edge (also called bridge) in a graph is such
an edge that its removal increases the number of connected
components of the graph. Similarly, a cut vertex (also called
articulation point) is a vertex with this property, i. e., when
removed (with the edges incident to it, of course), the graph
splits up into more connected components.
Find all cut edges and vertices of the following graph:
0 1 2 3 4 17
5 6 7 8 9 10 11
12 13 14 15 16
⃝
13.A.24. Prove that a Hamiltonian graph (see 13.1.13) must
be 2-vertex-connected. Give an example of a graph which is
2-vertex-connected yet does not contain a Hamiltonian cycle.
Solution. Considering any pair of vertices in a Hamiltonian
graph, there are two disjoint (except for the two vertices) paths
between them (the “arcs” of the Hamiltonian cycle). Therefore,
if we remove one of the vertices, the graph clearly remains
connected (the vertex to be removed lies on one of the
two paths only). As for the example of a non-Hamiltonian
graph which is 2-vertex-connected, we can recall the Petersen
graph (see the picture at the beginning of this chapter). □
13.A.25. Determine the number of cycles (see 13.1.3) in the
graph K5.
Solution. We sort the cycles by their lengths, i. e., we count
separately the numbers of cycles upon three, four, and ﬁve
vertices. A cycle of length three is determined uniquely by
its three vertices, and there are
(5
3
)
ways how to choose them.
A cycle of length four is determined by its vertices (which
can be chosen in
(5
4
)
ways) and the pair of neighbors of a
ﬁxed vertex (the pair can be selected from the remaining three
vertices in
(3
2
)
ways). Finally, a cycle of length ﬁve is given
by the pair of neighbors of a ﬁxed vertex as well as the other
neighbor (from the two remaining) of a ﬁxed vertex of this
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
The reverse implication is more diﬃcult. The following
needs to be proved. Suppose a ﬁxed degree sequence
(d1, . . . , dn) with 0 ≤ d1 ≤ · · · ≤ dn is given. Then there
exists a graph whose vertex vn is adjacent to exactly the last
dn vertices vn−dn , . . . , vn−1.
The idea is simple – if any of the last dn vertices vk is not
adjacent to vn, then vn must be adjacent to one of the prior
vertices. We want to interchange the endpoints of two edges
so that the vertices vn and vk become adjacent and the degree
sequence remains unchanged.
Technically, this can be done as follows: Consider all
graphs G with a given degree sequence and let, for
each G, ν(G) denote the greatest index of a vertex
which is not adjacent to the vertex vn. Fix G to
be such that ν(G) is as small as possible. Then, either
ν(G) = n − dn − 1 (and the graph is obtained) or
ν(G) ≥ n − dn.
If the latter is true, then vn is adjacent to one of the vertices
vi, i < ν(G). Since, deg vν(G) ≥ deg vi, there exists
a vertex vℓ which is adjacent to vν(G), but not to vi. Replace
the edge {vℓ, vν(G)} for {vℓ, vi} as well as {vi, vn} for
{vν(G), vn}, to get a graph G′
with the same degree sequence,
but with ν(G′
) < ν(G), which contradicts the choice of G.
(Draw a diagram!)
Therefore, the former possibility is true. So the graph is
created by adding the last vertex and connecting it to the last
dn vertices with edges. □
The theorem describes an exact procedure for constructing
a graph with a given degree sequence. If there is no such
graph, the algorithm so indicates during the computation. Begin
with the degree sequence in (say) ascending order. Then
delete the largest value d and subtract one from d remaining
values on the very right. Then sort the obtained degree sequence
and continue with the above step until either there is
an example of a graph with the current degree sequence or the
degree sequence does not correspond to any graph. If, eventually,
a graph is constructed after a number of steps, then
one can reverse the procedure, adding one vertex in each step,
connected to those vertices where ones were subtracted during
the procedure. (Try examples by yourself!) The algorithm
constructs only one of the many graphs which share the same
degree sequence.
13.1.8. Matrix representation. The eﬃciency of graph representations
is of importance for running algorithms.
One of them is useful in theoretical con-
siderations:
1148
pair. Altogether, there are
(
5
3
)
+
(
5
4
)
·
(
3
2
)
+
(
5
5
)
·
(
4
2
)
·
(
2
1
)
= 37
cycles. □
13.A.26. Determine the number of subgraphs (see 13.1.4)
of the graph K5.
Solution. Again, we count the number of subgraphs separately
by the number v of their vertices:
• v = 0. There is a unique graph on 0 vertices, the empty
graph.
• v = 1. There are 5 ways of selecting 1 vertex, resulting
in 5 subgraphs.
• v = 2. Two vertices can be chosen in
(5
2
)
ways. Further,
there may or may not be an edge between them. Altogether,
we get
(5
2
)
· 2 such subgraphs.
• v = 3. Three vertices can be selected in
(5
3
)
ways. For
each pair of them, there may or may not be an edge. This
results in
(5
3
)
· 2
(3
2
)
subgraphs.
• v = 4. Here, we calculate
(5
4
)
· 2
(4
2
)
subgraphs.
• v = 5. Finally, in this case, there are
(5
5
)
·2
(5
2
)
subgraphs.
Altogether, we have found 1550 subgraphs of the graph K5.
□
13.A.27. Determine the number of paths between a ﬁxed
pair of diﬀerent vertices in the graph K7.
Solution. We sort the paths by their lengths. There is a
unique path of length one (it consists of the edge that connects
the selected vertices). There are ﬁve paths of length two (it
may lead through any of the remaining vertices). There are
5 · 4 paths of length three (we select the two vertices through
which it leads, and their order matters). Similarly, there are
5·4·3 paths of length four, 5·4·3·2 paths of length ﬁve, and
also 5! paths of length six. Clearly, there are no longer paths
in K7. Altogether, we have 1+5+5·4+5·4·3+5!+5! = 326
paths. □
At the end of this subsection, we present one more amusing
problem.
13.A.28. The towns of a certain country are connected with
roads. Each town is directly connected to exactly three other
towns. Prove that there exists a town from which we can make
a sightseeing tour such that the number of roads we use is not
divisible by three.
Solution. First of all, we reformulate this problem in the
language of graph theory. Our task is to prove that every
3-regular graph (i. e., such that the degree of every vertex
equals three) contains a cycle whose length is not divisible
by three. We will proceed by induction, and actually, we will
prove a stronger proposition: every graph each of whose vertices
has degree at least three contains a cycle whose length is
not divisible by three. In fact, the original proposition could
not be proved by induction since the induction hypothesis
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Adjacency matrix
The adjacency matrix of the (undirected) graph G = (V, E)
is deﬁned as follows. Fix a (total) ordering of its vertices
V = (v1, . . . , vn). Deﬁne the matrix AG = (aij) over Z2
(entries are zero and ones)
aij =
{
1 if the edge eij = {vi, vj} ∈ E,
0 if the edge eij = {vi, vj} ̸∈ E.
It is recommended to write explicitly the adjacency matrices
of the graphs mentioned at the beginning of this chapter!
By deﬁnition, adjacency matrices are symmetric.
There are straightforward generalizations of this concept
for more general graphs. For oriented edges their directions
may be indicated by the sign, multiple edges might be encoded
by appropriate integers, etc.
If the matrix is stored in a two-dimensional array, then
this method of graph representation is very ineﬃcient. It consumes
O(n2
) memory. However, if the graph is rather sparse,
i.e. there are only a few edges, then almost all of the entries
of the matrix are zeros. There are many methods of storing
such matrices more eﬃciently.
The matrix representation of graphs is suggestive of linear
algebra considerations. For example, there is the following
beautiful theorem:
Theorem. Let G = (V, E) be a graph with vertices ordered
as V = (v1, . . . , vn), and let AG be its adjacency matrix.
Further, let Ak
G = (a
(k)
ij ) denote the entries of the k-th power
of the matrix AG = (aij).
Then, a
(k)
ij is the number of walks of length k between the
vertices vi and vj.
Proof. The proof is by induction on the length of the
walks. For k = 1, the statement is simply a reformulation of
the deﬁnition of the adjacency matrix. Suppose the proposition
holds for a ﬁxed positive integer k. Examine the number
of walks of length k + 1 between the vertices vi and vj for
some ﬁxed indices i and j. Each walk can be obtained by attaching
an edge from vi to a vertex vℓ to a walk of length k
between vℓ and vj. Further, each walk of length k + 1 can be
obtained uniquely in this way. Therefore, if a
(k)
ℓj denotes the
number of walks of length k from vℓ to vj, then the number
of walks of length k + 1 is
a
(k+1)
ij =
n∑
ℓ=1
aiℓ · a
(k)
ℓj .
This is exactly the formula for the product of the matrix AG
and the power Ak
G. It follows that the entries of the matrix
Ak+1
G are the integers a
(k+1)
ij . □
Corollary. If G = (V, E) and AG are as above, then each
pair of vertices in G is connected by a path if and only if
the matrix (A + En)n−1
has only positive entries (En is the
n-by-n identity matrix).
1149
would be too weak. The induction will be carried on the number
k of vertices of the graph. Clearly, the statement holds for
k = 4. Now, consider a graph where the degree of each vertex
is at least three and suppose that the statement is true for
any such graph on fewer vertices. The reader should be able
to prove that there exists a cycle in the graph. If its length
is not divisible by three, we are done. Thus, suppose from
now on that C = v1v2 . . . v3n. Each vertex of this cycle is
connected to at least one more (diﬀerent from the neighbors
on the cycle) in the graph. If there is a vertex vi on the cycle
that is connected to a vertex vj on the cycle (j > i + 1),
then the lengths of the cycles v1v2 . . . vivjvj+1 . . . v3n and
vivi+1 . . . vj total up to 3n + 2, so the length of at least one
of them is not divisible by three, as wanted. The situation is
similar if there are two vertices vi and vj, 1 ≤ i < j ≤ 3n,
which are connected to the same vertex outside the cycle.
Therefore, suppose that each
vertex of the cycle is connected
to some vertices outside
C and no two vertices of
the cycle are adjacent to the
same vertex outside. Then, we
can consider the graph which is obtained from the original
one by replacing the vertices v1, v2, . . . , v3n with a single vertex
V .
In this new graph, there are also at least three edges leading
from each vertex (including V ), so we can apply the induction
hypothesis to it. Therefore, there is a cycle w1w2 . . . wk
where 3 ∤ k. If it does not contain the vertex V , then it is a
cycle in the original graph as well. If it does, then we proceed
analogously as above: we consider two cycles whose lengths
sum up to 3n+2k, so the length of at least one of them is not
divisible by three. We have found the wanted cycle in every
case, which ﬁnishes the proof. □
B. Fundamental algorithms
Let us begin with breadth-ﬁrst search and depth-ﬁrst
search, which serve as a basis for more sophisticated algorithms.
Their actual implementations may diﬀer; therefore,
the answers to the following problems may be ambiguous.
13.B.1. Consider a graph on six vertices which are labeled
1, 2,..., 6. A pair of vertices is connected with an edge if
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Proof.
(A + En)n−1
= An−1
+
(n−1
1
)
An−2
+ · · · +
(n−1
n−2
)
A + En.
The entries of the resulting matrix are (using the notation as
above)
a
(n−1)
ij + · · · +
(n−1
ℓ
)
a
(n−1−ℓ)
ij + · · · + (n − 1)aij + δij,
where δii = 1 for all i, and δij = 0 for i ̸= j.
This gives the sum of numbers of walks of length
0, . . . , n − 1 between the vertices vi and vj, multiplied by
positive constants. Therefore, it is non-zero if and only if
there is a path between these vertices. □
13.1.9. Remark. Observe how permuting the vertices of V
aﬀects the adjacency matrix of the corresponding
graph. It is not hard to see that each such permutation
permutes both the rows and columns of the matrix AG
in the same way. Such a permutation can be given
uniquely by the permutation matrix, each of whose rows and
columns contain zeros only except for one entry, which is 1.
If P is a permutation matrix, then the new adjacency matrix
of the isomorphic graph G′
is
AG′ = P · AG · PT
,
where (the dots stand for matrix multiplication). The transposed
matrix PT
is also the inverse matrix to P, since permutation
matrices are orthogonal. Every permutation can be
written as a composition of transpositions; hence every permutation
matrix can be obtained as the product of the matrices
corresponding to the transpositions.
Of course, this is exactly how the matrices of linear mappings
change under the change of basis. Understanding the
adjacency matrix as a linear mapping is often useful. For
example, the adjacency matrix may be thought of as hitting
vectors of zeros and ones (imagine the ones indicating active
vertices of interest) and yielding vectors of integers (showing
how many times the given vertices are arrived at from all active
vectors along the edges in one step).
This observation also shows that the question whether
two adjacency matrices describe isomorphic graphs is equivalent
to asking for the equivalence of the matrices via a permutation
matrix P.
13.1.10. Connected components of a graph. Every graph
G = (V, E) naturally partitions into disjoint
subgraphs Gi such that two vertices v ∈ Gi and
w ∈ Gj are connected by a path if and only if
i = j.
This procedure can be formalized as follows: Let G =
(V, E) be an undirected graph. Deﬁne a relation ∼ on the
set V . Set v ∼ w for vertices v, w ∈ V if and only if there
exists a path from v to w in G. This relation is clearly a welldeﬁned
equivalence relation. Every class [v] of this equivalence
determines the induced subgraph G[v] ⊆ G, and the
(disjoint) union of these subgraphs actually gives the original
graph G. According to the deﬁnition of an equivalence
1150
and only if the sum of their labels is odd. Describe the run
of the breadth-ﬁrst search algorithm on this graph. Which of
the edges is visited at the end provided the search is initiated
at vertex 5 and the neighbors of a given vertex are visited in
ascending order?
Solution. The algorithm starts at vertex 5 and goes along the
edges (5, 2), (5, 4), (5, 6), thereby visiting the vertices 2, 4,
6 (the queue of vertices to be processed is 2, 4, 6). The ﬁrst
vertex to have been visited is 2, so the algorithm continues
the search from there, i. e., vertex 5 is processed and vertex
2 becomes active. The algorithm goes along the edges (2, 1),
(2, 3), (2, 5) (the last one has already been used), thereby visiting
the vertices 1 and 3 (the queue of vertices to be processed
is 4, 6, 1, 3). Now, vertex 2 becomes processed and the
ﬁrst unprocessed vertex to have been visited becomes active.
That is vertex 4. The algorithm discovers the edges (4, 1) and
(4, 3), yet no new vertices. Vertex 4 becomes processed and
vertex 6 becomes active. This leads to discovery of the edges
(6, 1) and (6, 3). If the algorithm know the number of edges
in the graph, it terminates at this moment. Otherwise, it goes
through the vertices 1 and 3, ﬁnding out that there are no new
edges or vertices, and then it terminates. In either case, the
last edge to have been discovered is (3, 6). □
13.B.2. Consider a graph on six vertices which are labeled
1, 2,..., 6. A pair of vertices is connected with an edge if
and only if the sum of their labels is odd. Describe the run
of the depth-ﬁrst search algorithm on this graph. Which of
the edges is visited at the end provided the search is initiated
at vertex 5 and the neighbors of a given vertex are visited in
ascending order?
Solution. The algorithm starts at vertex 5 and goes along
the edges (5, 2), (5, 4), (5, 6), thereby visiting the vertices
2, 4, 6 in this order (the stack of vertices to be processed is
6, 4, 2). Vertex 5 becomes processed and the last vertex to
have been visited (i. e., vertex 6) becomes active. The algorithm
goes along the edges (6, 1) and (6, 3) (the edge (6, 5)
has already been used), thereby visiting the vertices 1 and
3 (the stack of vertices to be processed is 3, 1, 4, 2). Now,
vertex 2 becomes processed and the last unprocessed vertex
to have been visited becomes active. This is vertex 3. The
algorithm discovers the edges (3, 2) and (3, 4), so the stack
becomes 4, 2, 1, 4, 2. Vertex 3 becomes processed and vertex
4 becomes active. This leads to discovery of the edge (4, 1),
leaving the stack at 1, 2, 1, 2. The algorithm continues the
search from vertex 1, visiting the last edge (1, 2). (Note: only
unprocessed vertices are pushed into the stack.) □
Remark. If we had chosen the opposite edge priority, the
edges would have been visited in the following order: (5, 2),
(2, 1), (1, 4), (4, 3), (3, 2), (3, 6), (6, 1), (6, 5), (4, 5). Intuitively,
the depth-ﬁrst search can be perceived so that the
algorithm examines the ﬁrst undiscovered edge in each step.
13.B.3. Let the vertices of the graph K6 be labeled 1, 2,..., 6.
Write the order of edges of K6 in which they are visited by the
depth-ﬁrst search algorithm, supposing the search is initiated
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
relation, no edge of the original graph can connect vertices
that belong to diﬀerent components. The subgraphs G[v] are
called connected components of the graph G.
A graph G = (V, E) is said to be connected if and only
if it has exactly one connected component.
If the graph G is directed, then the deﬁnition is analogous
to the case of undirected graphs – it is only required that there
exist both paths from v to w and from w to v in order for the
pair (v, w) to be related. Using this deﬁnition, strongly connected
components can be discussed. On the other hand, it
may only be required that the symmetrization of the graph be
connected (in the undirected sense); then weak connectedness
can be discussed.
13.1.11. Multiply connected graphs. It is useful to consider
the concept of connectedness in a much stronger sense,
i.e. to enforce a certain redundancy in the number of paths
between vertices.
Deﬁnition. A graph G = (V, E) is said to be
• k-vertex-connected if and only if it has at least k + 1 vertices,
and remains connected whenever any k−1 vertices
are removed;
• k-edge-connected if and only if it has at least k edges,
and remains connected whenever any k − 1 edges are
removed.
In the case k = 1, the deﬁnition simply says that the
graph is connected (in both cases) since the condition is vacuously
true. Stronger graph connectedness is desirable with
any networks supporting some surfaces (roads, pipelines, internet
connection, etc.) where the clients prefer considerable
redundancy of the provided service for the case if several connections
in the network (i.e. edges in a graph) or nodes in the
network (vertices in a graph) break down.
In general, Menger’s theorem3
holds. It says that for every
pair of vertices v and w, the number of pairwise edgedisjoint
paths from v to w equals the minimum number of
edges that must be removed so as to leave v and w in diﬀerent
components of the new graph. Similarly, the number of pairwise
vertex-disjoint paths from v to w equals the number of
vertices that must be removed in order to disconnect v from
w.
We return to this topic in subsection 13.2.13. Right now,
we consider the simplest interesting case in detail. These are
graphs (on at least three vertices) such that deleting any one
vertex does not destroy the connectedness.
Theorem. If G = (V, E) has at least three vertices, then the
following conditions are equivalent:
• G is 2-vertex-connected;
• every pair of vertices v and w in G lie on a common
cycle;
• the graph G can be constructed from the triangle K3 by
repeatedly adding and splitting edges.
3Karl Menger proved this as early as in 1927; that is, before graph theory
came into being.
1151
from vertex 3 and the neighbors of a given vertex are visited
in ascending order. ⃝
13.B.4. Let the vertices of the graph K6 be labeled 1, 2,...,
6. Write the order of edges of K6 in which they are visited
by the breadth-ﬁrst search algorithm, supposing the search is
initiated from vertex 3 and the neighbors of a given vertex are
visited in ascending order. ⃝
13.B.5. Apply Dijkstra’ algorithm to ﬁnd the shortest path
from vertex number 9 to each of the other vertices.
0 1 2 3 4 17
5 6 7 8 9 10 11
12 13 14 15 16
2 7 3 4
5
1 1 8 15 2 7 5
2 6 1
18
3
1 2 7 3 111
⃝
13.B.6. Give an example of
i) a graph on at least 4 vertices which does not contain a
negative cycle, yet Dijkstra’s algorithm fails on it;
ii) a graph on at least 4 vertices which contains a negative
edge, yet Dijkstra’s algorithm succeeds on it.
Solution. In both cases, we must be well aware how Dijkstra’s
algorithm works. Then, it is easy to ﬁnd the wanted examples
(apparently, there are many more possibilities). As for the ﬁrst
problem, we can consider the following graph (where S is the
initial vertex):
S
A
B
C
1
2
−3
1
If Dijkstra’s algorithm is run from S, then it visits the vertex A
and ﬁxes its distance from S to 1. However, there is a shorter
path, namely the path (S, B, C, A) of length 0. As for the
second problem, consider the following:
S
A
B
C
−1
2
1
1
□
Bellman-Ford algorithm. This algorithm is based on the
same principle as Dijkstra’s. However, instead of going
through particular vertices, it processes them “simultaneously”–the
relaxation loop (i. e., ﬁnding out whether the temporary
distances of the vertices can be improved using a given
edge) is iterated (|V |−1)-times over all edges. The advantage
is that this approach works even with negative edges, and it
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Proof. If the second proposition is true, there are at least
two diﬀerent paths between any two vertices. So deleting a
vertex cannot destroy the connectedness and the ﬁrst proposition
follows.
Conversely, suppose the ﬁrst proposition is true. Proceed
by induction on the minimal length of a path between
v and w. Suppose ﬁrst that the vertices are the endpoints
of an edge e, and that the shortest path is of
length 1. If removing the edge e splits the graph into
two components, then this would also occur if the vertex v is
removed or if the vertex w is removed. Therefore, the graph is
connected even without the edge e, so there is a path between
v and w. This path, together with the edge e, forms a cycle.
For the induction hypothesis, assume that such a shared
cycle is constructed for all pairs of vertices connected by a
path whose length does not exceed k. Consider vertices v, w
and one of the shortest paths between them:
(v = v0, e1, v1, e2, . . . , vk+1 = w)
of length k+1. Then, v1 and w can be connected by a path of
length at most k, hence they lie on a common cycle. Denote
by P1 and P2 the corresponding two disjoint paths between v1
and w. Now, the graph G \ {v1} is also connected, so there
exists a path P from v to w which does not go through the
vertex v1, and this path must once meet either of the paths P1,
P2. Without loss of generality, suppose that this occurs on the
path P1, at vertex z. Now, the cycle can be built: it consists
of the part of P from v to z, the part of P1 from z to w, and
P2 (directed the other way) from w to v (draw a diagram!).
It follows that the second proposition is a consequence of the
ﬁrst proposition, and hence ﬁrst condition is equivalent to the
second one.
Suppose the third proposition is true. Neither splitting
an edge nor adding a new one in a 2-vertex-connected graph
destroys the 2-connectedness. So the ﬁrst proposition follows
from the third proposition.
It remains to prove that third proposition follows from the
ﬁrst proposition. From the ﬁrst proposition, G is 2-connected,
so there exists a cycle, which can be obtained from K3 by
splitting edges. Consider the subgraph G′
= (V ′
, E′
) determined
by this cycle, and consider an edge e = {v, w} /∈ E′
such that one of its endpoints lies in V ′
. If both of its endpoints
lie there, a new edge can simply be added to the graph
G′
, which leads to the subgraph (V ′
, E′
∪ {e}) in G, which
contains more vertices and edges than the graph G′
. Consider
the remaining possibility, i.e. v ∈ V ′
while w /∈ V ′
.
Since G is 2-connected, it remains connected even if the vertex
v is removed, and it contains a shortest path P between
the vertex w and some vertex (denote it as v′
) in G′
(apart
from the removed vertex v) and containing no other vertex
from V ′
. Adding this path to the graph G′
, together with the
edge e (which can be done by adding the edge {v, v′
} splitting
it to the desired number of “new” vertices and edges),
A new subgraph is obtained which satisﬁes the requirements
1152
is able to detect negative cycles (if another iteration of the relaxation
loop leads to a change, then there must be a negative
cycle in the graph). However, we pay for that with increased
time complexity.
13.B.7. Use the Bellman-Ford algorithm to ﬁnd the shortest
paths from the vertex S to all other vertices. Assume that the
edges are ordered by the number of the tail (or head) and the
initial vertex is the least one. Then, change the value of the
edge (8, 6) from 18 to −18, execute the algorithm on this new
graph, and show the detection of negative cycles.
picture skipped
Solution. According to the conditions, the edges are visited
in the following order: (S,4), (S,7), (1,2), (1,5), (2,1), (2,3),
(2,6), (3,7), (4,7), (4,8), (5,1), (5,6), (6,2), (6,5), (7,8), (8,6).
The vertex distances (potential higher values computed earlier
during the same iteration are written in parentheses):
S 1 2 3 4 5 6 7 8
1 0 ∞ ∞ ∞ 1 ∞ 22 3(6) 4
2 0 ∞ 23 ∞ 1 24 22 3 4
3 0 25(30) 23 26 1 24 22 3 3
4 0 25 23 26 1 24 22 3 3
Since the fourth iteration does not lead to any change, we can
terminate the algorithm at this moment.
In the changed graph, the execution is as follows (for the
sake of clarity, we do not write the values of vertices that are
untouched by the change):
S 1 2 3 4 5 6 7 8
1 0 ∞ ∞ ∞ 1 ∞ −14 3(6) 4
2 −13 −12
3 −11(−6) −10 −19 −2 −1
4 −18 −17
5 −16 −15 −24 −7 −6
6 −23 −22
7 −21 −20 −29 −12 −11
8 −28 −27
9 −26 −25 −34 −17 −16
The graph has 9 vertices, and since the ninth iteration changed
the distance of one of the vertices, there is a negative cycle. Of
course, we could have terminated the algorithm much earlier
if we had noticed exactly what changes took place between the
particular steps. Clearly, the values of the vertices 1, 2, 3, 5,
6, 7, 8 keep decreasing below all bounds. The algorithm can
also be implemented so that it produces the tree of shortest
paths and also ﬁnds the vertices lying on a negative cycle if
there is one. □
Paths between all pairs of vertices. We often need to know
the shortest paths between all pairs of vertices. Of course, we
could apply the above algorithms to all initial vertices. However,
there is a more eﬀective method to do this. One of the
possibilities is to use the similarity with matrix multiplication,
which is the basis of the Floyd-Warshall algorithm (the bestknown
among algorithms of the all pairs shortest paths type),
which:
• computes the distances between all pairs of vertices in
time O(n3
);
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
and contains more vertices than the considered graph G′
. After
a ﬁnite number of these steps, the entire graph G is built
from the triangle K3, as desired. The proof is complete. □
13.1.12. Eulerian graphs. There are problems of the type
“draw a graph without removing the pencil from the paper”.
In the language of graph theory, this can be stated as follows:
Eulerian trails
Deﬁnition. A trail which visits every edge exactly once and
whose initial and terminal vertices are the same is called a
Eulerian trail. Connected graphs that admit such a trail are
called Eulerian graphs.
Of course, an Eulerian trail goes through every vertex
at least once, but it can visit a vertex
more than once. To draw a graph without
removing the pencil from the paper
while ending at the same point where one
started means to ﬁnd an Eulerian trail. The terminology refers
to the classical story about the seven bridges in Königsberg.
There, the task was to go for a walk and visit each of the
bridges exactly once. The ﬁrst proof that this is impossible
is by Leonhard Euler, in 1736.
The situation is depicted in the diagram. On the left,
there is a sketch of the river with the islands and bridges.
The corresponding multigraph is caught in the right-hand diagram.
The vertices of this graph correspond to the “connected
land”, while the edges correspond to the bridges. If it is desired
to do without the multiple edges (which have not been
admitted so far), it would suﬃce to place another vertex inside
each bridge (i.e. to split the edges with new vertices). Surprisingly,
the general solution of this problem is quite simple, as
shown by the following theorem. Of course, this also shows
that Euler could not design the desired walk.
Eulerian graphs
Theorem. A graph G is Eulerian if and only if it is connected
and all vertices of G have even degree.
Proof. If a graph is Eulerian, for every vertex entered
there is an exit. Therefore, the degree of every vertex is even.
More formally: consider a trail that begins and ends at a vertex
v0 and passes through all edges. Every vertex occurs once
or more on this trail and its degree equals twice the number
of its occurrences.
Now suppose that all vertices of a graph G have even degree.
Consider the longest possible trail (v0, e1, . . . , vk) in G
1153
• starts with the matrix U0 = A = (aij) of edge lengths
(setting uii = 0 for each vertex i) and then iteratively
computes the matrices U0, U1, . . . , U|V |, where uk(i, j)
is the length of the shortest path from i to j such that all
of its inner vertices are among {1, 2, . . . , k};
• the matrices are computed using the formula
uk(i, j) = min{uk−1(i, j), uk−1(i, k) + uk−1(k, j)}.
In other words, considering the shortest path from i to j
which can go only through the vertices 1, . . . , k, we can ask
whether it uses the vertex k. If so, then this path consists of
the shortest path from i to k and the shortest path from k to j
(and these two paths use only the vertices 1, . . . , k − 1). Otherwise,
the wanted path is also the shortest path from i to j
which can go only through the vertices 1, . . . , k − 1. Clearly,
for k = |V |, we get the shortest paths between all pairs of vertices
without any restrictions. Moreover, we can maintain the
so-called predecessor matrix (i. e., the predecessor of each
vertex on the shortest path from each vertex and update it as
follows:
• Initialization:
(P0)ij = i for i ̸= j and aij < ∞,
• In the k-th step, we update
(Pk)ij =
{
(Pk−1)kj, if the path through k is better,
(Pk−1)ij, otherwise.
As soon as the algorithm terminates, we can easily construct
the shortest path between any pair of vertices u, v: we derive
it from the matrix P = Pn = (pij) (in the reverse order) as
v, w = puv, puw, . . ..
13.B.8. Apply the Floyd–Warshall algorithm to the graph
in the picture. Write the intermediate results into matrices.
Show the detection of negative cycles. Maintain all information
necessary for the construction of the shortest paths.
picture skipped
Solution. We proceed according to the algorithm, obtaining
the following shortest-length matrices and predecessor matri-
ces:
U0 =




0 4 −3 ∞
−3 0 −7 ∞
∞ 10 0 3
5 6 6 0



 , P0 =




− 1 1 −
2 − 2 −
− 3 − 3
4 4 4 −



 ;
U1 =




0 4 −3 ∞
−3 0 −7 ∞
∞ 10 0 3
5 6 2 0



 , P1 =




− 1 1 −
2 − 2 −
− 3 − 3
4 4 1 −



 ;
U2 =




0 4 −3 ∞
−3 0 −7 ∞
7 10 0 3
3 6 −1 0



 , P2 =




− 1 1 −
2 − 2 −
2 3 − 3
2 4 2 −



 ;
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
where no edge occurs twice or more. First, suppose for a moment
that vk ̸= v0. This would mean that the number of
edges of the trail that enter or leave the vertex v0 is odd, so
there must be an edge which is incident to v0 and not contained
in the trail. However, then the trail can be prolonged
while still using every edge of the graph at most once, which
is a contradiction. Therefore, v0 = vk.
Deﬁne a subgraph G′
= (V ′
, E′
) of G as follows: It contains
the vertices and edges of our ﬁxed trail and nothing else.
If V ′
̸= V , then (since the graph G is connected) there exists
an edge e = {v, w} such that v ∈ V ′
and w /∈ V ′
. However,
then the trail can be “rotated” so that it begins and ends at the
vertex v. It can be prolonged with the edge e, which contradicts
the assumption of the greatest possible length. Therefore,
V ′
= V .
It remains to show that E′
= E. So suppose there is an
edge e = {v, w} /∈ E′
. As above, the trail can be rotated so
that it begins and ends at the vertex v and then it can continue
the edge e – a contradiction. □
Corollary. A connected graph can be drawn without removing
the pencil from the paper if and only if there are either no
vertices of odd degree or exactly two of them.
Proof. Let G be a connected graph with exactly two odddegree
vertices. Construct a new graph G′
by attaching a new
vertex w to the original graph G and connecting it to both the
odd-degree vertices. This graph is Eulerian, and the Eulerian
trail in it leads to the desired result.
On the other hand, if a graph G can be drawn in the desired
way, then the graph G′
is necessarily Eulerian, so the
degrees of the vertices in G are as stated. □
The situation for directed graphs is similar. A directed
graph is called balanced if and only if the outcoming and incoming
degrees coincide, i.e. deg+(v) = deg−(v), for all
vertices v.
Proposition. A directed graph G is Eulerian if and only if
it is balanced and its symmetrization is connected (i.e. the
graph G is weakly connected).
Proof. The proof is analogous to the undirected case.
(Work out the details yourself!) □
13.1.13. Hamiltonian cycles. Find a walk or cycle that visits
every vertex of a graph G exactly once. Of necessity,
such a walk can visit every edge at most
once. Such a cycle is called a Hamiltonian cycle
in the graph G. A graph is called Hamiltonian
if and only if it contains a Hamiltonian cycle. This problem
seems to be very similar to the above one of visiting every
edge exactly once. But while the problem of deciding about
the existence of a Eulerian trail is trivial, the problem of deciding
whether a graph is Hamiltonian is NP-complete.
Of course, this problem can be solved by “brute force”.
Given a graph on n vertices, generate all n! possible orders
of the n vertices, and for each of them, verify whether it is a
cycle in G.
1154
U3 =




0 4 −3 0
−3 0 −7 −4
7 10 0 3
3 6 −1 0



 , P3 =




− 1 1 3
2 − 2 3
2 3 − 3
2 4 2 −



 ;
U4 =




0 4 −3 0
−3 0 −7 −4
6 9 0 3
3 6 −1 0



 , P4 =




− 1 1 3
2 − 2 3
2 4 − 3
2 4 2 −



 .
Since there is no negative number on the diagonal of U4, there
is no negative cycle in the graph. Suppose we would like to
ﬁnd the shortest path from vertex 3 to vertex 1, for instance:
The predecessor of 1 is P4[3, 1] = 2 and the predecessor of
2 is P4[3, 2] = 4. Therefore, the wanted path is 3, 4, 2, 1 and
its length is U4[3, 1] = 6.
□
Hamiltonian graphs. To decide whether a given graph is
Hamiltonian is an NP-complete problem. Therefore, it might
be useful to have some simpler necessary or suﬃcient conditions
for this property at our disposal.
We mention three suﬃcient conditions: Dirac’s, Ore’s,
and the Bondy–ChvĂĄtal theorem.
Dirac: Let a graph G with n ≥ 3 vertices be given. If
each vertex of G has degree at least n/2, then G is Hamilton-
ian.
Ore: Let a graph G with n ≥ 3 vertices be given. If the
sum of the degrees of each pair of non-adjacent vertices is at
least n, then G is Hamiltonian.
The closure of a graph G is the graph cl(G) obtained
from G by repeatedly adding an edge u, v such that u, v have
not been adjacent and deg(u) + deg(v) ≥ n until no such
pair of vertices u, v exists.
Bondy, ChvĂĄtal: A graph G is Hamiltonian if and
only if cl(G) is.
13.B.9. Prove the Bondy–ChvĂĄtal theorem.
Solution. Clearly, it suﬃces to prove that if G is Hamiltonian
after addition of an edge {u, v} such that u, v have not
been adjacent and deg(u) + deg(v) ≥ n, then it is already
Hamiltonian without this edge. Suppose that G + {u, v} is
Hamiltonian, but G is not. Then, there exists a Hamiltonian
path from u to v in G. It must hold for each vertex adjacent
to u that its predecessor on this path is not adjacent to v (otherwise,
there would be a Hamiltonian cycle in G). Therefore,
deg(u) + deg(v) ≤ n − 1. □
13.B.10.
i) Prove that the Bondy–ChvĂĄtal theorem implies Ore’s
and Ore’s implies Dirac’s.
ii) Give an example of a Hamiltonian graph which satisﬁes
Ore’s condition but not Dirac’s.
iii) Give an example of a Hamiltonian graph whose closure
is not a complete graph.
Solution.
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
This problem forms a vital ﬁeld of research. For instance,
in 2010, A. Björklund published a randomized algorithm
based on the Monte Carlo method, which counts the
number of Hamiltonian cycles in a graph on n vertices in time
O(1,567n
).4
Finding Hamiltonian cycles is desired in many problems
related to logistics. For example, ﬁnding optimal paths for
goods delivery.
13.1.14. Trees. Sometimes it is desired to minimize the
number of edges in the graph while keeping it
connected. Of course, this is possible if and
only if there is at least one cycle in the graph.
The graphs without cycles are extremely important as we shall
see below.
Forests, Trees, Leaves
A connected graph which does not contain a cycle is called
a tree. A graph which does not contain a cycle is called a
forest (a forest is not required to be connected). Every vertex
of degree one in any graph is called a leaf .
The deﬁnition suggests an easily memorable “theorem”:
A tree is a connected forest.
Lemma. Every tree with at least two vertices contains at
least two leaves.
For any graph G with a leaf v, the following propositions
are equivalent:
• G is a tree;
• G \ v is a tree.
Proof. Let P = (v0, . . . , vk) be (any) longest possible
path in a tree G. If the vertex v0 is not a leaf, then there is an
edge e incident to it whose other endpoint v is not in P since
this would form a cycle in the tree. Then the path P with this
edge could be prolonged, which contradicts "‘longest"’. So
the vertex v0 is a leaf. The proof for the vertex vk is similar.
Thus, if the longest path is not trivial, then it must contain two
leaves v0 and vk.
Next, consider any leaf v of a tree G. Consider any two
other vertices w, z in G. There exists a path between them,
and no vertex on this path has degree one. Therefore, this
path remains the same in G \ v. Hence the graph remains
connected even after removing the vertex v. There is no cycle,
since it is constructed by removing a vertex from a tree.
Conversely, if G\v is a tree, then adding a vertex with degree
1 cannot create a cycle. The resulting graph is evidently
connected. □
Trees can be characterized by many equivalent and useful
properties. Some of them appear in the following theorem
which is more diﬃcult to formulate than to prove.
4Björklund, Andreas (2010), "Determinant sums for undirected
Hamiltonicity", Proc. 51st Impartial Symposium on Foundations
of Computer Science (FOCS ’10), pp. 173-182, arXiv:1008.0541,
doi:10.1109/FOCS.2010.24.
1155
i) If a graph G satisﬁes Ore’s condition, then its closure is a
complete graph, which is Hamiltonian, of course. By the
Bondy–ChvĂĄtal theorem, the original graph is Hamiltonian
as well. Further if G satisﬁes Dirac’s condition,
then it clearly satisﬁes Ore’s as well and thus is Hamil-
tonian.
ii) Consider the following example:
1 2
3 4
5
The degree of vertex 5 is 2, which is less than 5
2 . The
sum of the degrees of any pair of (not only non-adjacent)
vertices is at least 5.
iii) The wanted conditions are satisﬁed by the cycle graphs
Cn, n > 4, for which cl(Cn) = Cn.
□
Planar graphs.
13.B.11. Decide whether the graph in the picture is planar.
Solution. By the Kuratowski theorem (see page 1162), this
graph is not planar since one of its subgraphs is a subdivision
of K3,3. □
13.B.12. Decide whether there is a graph with degree se-
quence
(6, 6, 6, 7, 7, 7, 7, 8, 8, 8).
If so, is there a planar graph with this degree sequence? ⃝
13.B.13. What is the minimum number of edges of a hexa-
hedron?
Solution. In any polyhedron, every face is bounded by at least
three edges. At the same time, every edge lies belongs to two
faces. If f is the number faces and e the number of edges
of the polyhedron, then we have 3f ≤ 2e (see also 13.1.20).
For a hexahedron, this bound yields 18 ≤ 2e, i. e., e ≥ 9.
Indeed, there exists a hexahedron with nine edges. It can be
obtained by “gluing” two identical regular tetrahedra together
along one face. Therefore, the minimum number of edges of
a hexahedron is nine.
□
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.1.15. Theorem. Let G = (V, E) be a graph. The following
conditions are equivalent:
(1) G is a tree.
(2) For every pair of vertices v, w, there is exactly one path
from v to w.
(3) G is connected but ceases to be such if any edge is re-
moved.
(4) G does not contain a cycle, but the addition of any edge
creates one.
(5) G is a connected graph, and the number of its vertices
and edges satisﬁes
|V | = |E| + 1.
Proof. The properties 2–5 are satisﬁed in every tree. Indeed,
by the previous lemma, every tree which has at least
two vertices has a leaf v. It continues to be a tree when this
leaf v is removed. Therefore, it suﬃces to show that if any of
the statements 2–5 is true for a given tree, then it holds when
a leaf is added to the tree as well. This is clear.
In the case of properties 2 and 3, the graph is connected,
and their formulation directly excludes the existence of cycles.
As for the fourth property, it suﬃces to verify that G is connected.
However, any two of vertices v, w in G are either
connected with an edge, or adding this edge to the graph creates
a cycle. So there exists a path between them even without
this edge.
The last implication can be proved by induction on the
number of vertices. Suppose that all connected graphs on n
vertices and n−1 edges are trees. The sum of vertex degrees
of any graph on n+1 vertices and n edges is 2n, so the graph
must contain a leaf. It follows from the induction hypothesis
that this graph can be constructed by attaching a leaf to a tree;
hence it is also a tree. □
13.1.16. Rooted trees, binary trees, and heaps. Trees are
often suitable structures for data storage. They
permit basic database operations (e.g. ﬁnding a
particular piece of information) eﬃciently.
Since there is no cycle in a tree, ﬁxing one
vertex vr deﬁnes the orientation of all edges. For every vertex
v, there is exactly one path from vr to v, so the orientation
can be deﬁned accordingly. Since there are no cycles, it is
impossible for two such paths to force both orientations of a
particular edge.
If one of the vertices of a tree is ﬁxed, the situation is similar
to a real tree in nature – there is a distinctive vertex which
“grows from the ground”. Trees with a ﬁxed distinguished
vertex vr are called rooted trees, and vr is said to be the root
of the tree.
In a rooted tree, the terms successor and predecessor of
a vertex are deﬁned as follows: a vertex w is a successor of
v (or v is a predecessor of w) if and only if the path from the
root of the tree to the vertex w goes through v and v ̸= w.
If the vertices are directly connected with an edge, we can
talk about a direct successor and a direct predecessor. More
1156
13.B.14. Decide whether the given planar graph is maximal.
Add as many edges as possible while keeping the graph pla-
nar.
Solution. The graph has 14 vertices and 20 edges, hence
3|V | − 6 − |E| = 16. Therefore, it is not maximal, and 16
edges can be added so that it is still planar.
0 1 2
3 4 5 6 7
8 9
10 11
12 13
Ten (dashed) have been added. For the sake of clarity, the
other 6 edges that connect the vertices of the “outer” 9-gon
are not drawn.
□
13.B.15. Prove or disprove each of the following proposi-
tions.
i) Every graph with fewer than 9 edges is planar.
ii) Every graph which is not planar is not Hamiltonian.
iii) Every graph which is not planar is Hamiltonian.
iv) Every graph which is not planar is not Eulerian (see
13.1.12).
v) Every graph which is not planar is Eulerian.
vi) Every Hamiltonian graph is planar.
vii) No Hamiltonian graph is planar.
viii) Every Eulerian graph is planar.
ix) No Eulerian graph is planar.
⃝
Trees.
13.B.16. Determine the code of the following graph as a
i) plane tree,
ii) tree.
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
often, they are called a child and a parent (motivated by the
genealogical trees).
The most common data structures are the binary trees,
which are special cases of a rooted trees: there, every vertex
has at most two children (sometimes, the term binary tree
implies that every vertex is either a leaf, or has exactly two
children; to avoid ambiguity, such trees are often called full
binary trees). Such trees are very useful in search procedures.
If the vertices are associated to keys from a totally ordered set
(eg. the integers), the search for the vertex with a given key
is performed by searching the path from the root of the tree
to that vertex. At every vertex, compare its key to the desired
one. This decides whether one continues to the left or to the
right, or stop the search if it is found. If this algorithm is to
be correct, one of the children with all its successors must
have lower keys than the keys of the other child and all its
successors.
In order for the search to be eﬃcient, some eﬀort must be
made to keep the binary trees balanced, with the lengths of
the paths from the root to the leaves diﬀering by at most one.
The most unfortunate example of a binary tree on n vertices
is the path graph Pn (which may be formally considered a binary
tree), while the most desired case is the perfect complete
binary tree, where every vertex that is not a leaf has exactly
two children, and all leaves are at the same level. Such a tree
can be constructed only when the number of vertices is of the
form n = 2k
− 1, k = 1, 2, . . . . Therefore, in a balanced
tree, ﬁnding the vertex with a given key value can be done
in O(log2 n) steps. Such trees are often called binary search
trees. Think out as an exercise how to eﬃciently perform basic
graph operations over binary search trees (additions and
removals of the vertex with a given key as well as how to keep
the tree balanced).
An extraordinarily useful example of binary trees is the
structure of a heap. It is a full balanced binary tree, where
the keys are either strictly decreasing along each path from
the root (the so called max heap), or they are inreasing (the
min heap). Because of this ordering along the paths in a max
heap, the maximum key value of the heap can be found in
constant time and removed in logarithmic time (similarly with
minimum in the min heap). The desired maximum is just at
the root and after removing it we need to balance the shape of
the heap. Prove this is possible in logarithmic time yourself!
4 6 11
9
8
7
3
1 5
100
19 36
17 3 25 1
2 7
The left-hand diagram shows a binary search tree. In the
right-hand diagram, there is a max heap.
Much literature is devoted to trees, their applications and
miscellaneous variations.
1157
Solution.
i) Using the procedure from 13.1.18, we get the following
code of the plane tree:
0 0 0001100100101111 1 0 0101000010101111 1 1
The highlighted vertex in the graph is indeed the appropriate
candidate to be the root since it is the only element
of the center of the tree.
ii) As for the unique construction of a plane tree, we sort the
descendants lexicographically in ascending order. Thus,
the wanted code is
0000001010111101011 0000010110110011111.
□
13.B.17. For each of the following codes, decide whether
there exists a tree with this code. If so, draw the corresponding
tree.
• 00011001111001,
• 00000110010010111110010100001010111111.
⃝
Huﬀman coding. We are working with plane binary trees
where every edge is colored with a symbol of an output alphabet
A (we often have A = {0, 1}). The codewords C
are those words over the alphabet A to which we translate the
symbols of the input alphabet. Our task is to represent a given
text using suitable codewords over the output alphabet.
We can easily see that it makes sense to require that the
set of codewords be preﬁx, (i. e., no codeword can be a preﬁx
of another one); otherwise, we could get into trouble when
decoding.
We will use binary trees for the construction of binary
preﬁx codes (i. e. over the alphabet A = {0, 1}). We label
the edges going from each vertex by 0 and 1. Further, we label
the leaves of our tree with symbols of the input alphabet.
This results in a preﬁx code over A for these symbols by concatenating
the edge labels along the path from the root to the
corresponding leaf.
Clearly, this code is preﬁx. Moreover, if we take into account
the frequencies of particular symbols of the input alphabet
in the input text, we obtain a lossless data compression.
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.1.17. Remarks on sorting. Suppose it is required to
distinguish all diﬀerent sortings of n elements,
thus distinguishing among n! diﬀerent objects.
If there is no information other than comparing
the order of two single elements, then the tree of
all possible decision paths can be written down. The sorting
provides a path through this binary tree. As seen, any binary
tree of depth h (i.e. h − 1 is the length of the longest path
from the root to a leaf) has at most 2h−1
leaves. It follows
that a tree of depth h satisfying 2h−1
≥ n! is needed.
Consequently the depth h satisﬁes h log 2 > log n!.
log n! = log 1 + log 2 + · · · + log n
>
∫ n
1
log x dx = n log n − (n − 1)
h >
n log n − n
log 2
> n log n − n
It is proved that the depth of the necessary binary tree is
bounded from below by an expression of size n log n. Hence
no algorithm based only on the comparison of two elements
of the ordered set can have a better worst case run than
O(n log n).
The latter claim is not true if there is further relevant information.
For example, if it is known that only a ﬁnite number
of k values may appear among our n elements, then one
may simply run through the list counting how many occurrencies
of the individual values are there, and hence write
the right ordered list from scratch. This all happens in linear
time!
13.1.18. Tree isomorphisms. Simple features of trees are
exploited in order to illustrate the (generally diﬃcult)
problem of graph isomorphisms on this special class
of graphs.
First, strengthen the structure to be preserved by
the isomorphisms. Then show that the obtained procedure is
also applicable to the most general trees.
In order to keep more information about the structure of
rooted trees, remember the relations parent–child. Also have
the children of every node sorted in a speciﬁc order (for instance,
from left to right if drawn on a diagram). Such trees
are called ordered trees or plane trees. They are formally deﬁned
as a tuple T = (V, E, vr, ν), where ν is a partial order
on the edges such that a pair of edges is comparable if and
only if they have the same tail (i.e. they all go from one parent
vertex to all its children).
A homomorphism of rooted trees T = (V, E, vr) and
T′
= (V ′
, E′
, v′
r) is a graph morphism φ : T → T′
such that
vr is mapped to v′
r; similarly for isomorphisms. For plane
trees, it is further required that the morphism preserves the
partial orders ν and ν′
.
1158
Let M be the list of frequencies of the symbols of the input
alphabet in the input text. The algorithm constructs the optimal
binary tree (the so-called minimum-weight binary tree)
and the assignment of the symbols to the leaves.
• Select the two least frequencies w1, w2 from M. Create
a tree with two leaves labeled by the corresponding symbols
and root labeled by w1 +w2, then replace the values
w1, w2 with the new value w1 + w2 in M.
• Repeat the above step; if the selected value from M is a
sum, then simply “connect” the existing subtree.
• The code of each symbol is determined by the path from
the root to the corresponding leaf (left edge = “0”, right
edge = “1”, for instance).
13.B.18. Find the Huﬀman code for the input alphabet with
the frequencies
[’A’:16,’B’:13,’C’:9,’D’:12,’E’:45,’F’:5].
Solution. If we naively assign a 3-bit code to each letter of
the alphabet, then this message of length 100 consumes 300
bits.
We show that Huﬀman code is more succinct. We build
the tree according to the algorithm.
E:45 ABCDF:55
BD:25
B:13 D:12
AFC:30
FC:14
F:5 C:9
A:16
0 1
0 1
0 1 0 1
0 1
We have thus obtained the codes A : 111,B : 100,
C : 1101, D : 101, E : 0, F : 1100. Multiplying the
code lengths by the frequencies, we can see that a 100-letter
message with the given distribution of letters is encoded into
only
3 · 16 + 3 · 13 + 4 · 9 + 3 · 12 + 1 · 45 + 4 · 5 = 224
bits. □
C. Minimum spanning tree
13.C.1. How many spanning trees (see 13.2.6) of the graph
K5 are there? And how many are there if we do not distinguish
isomorphic ones?
Solution. There are three pairwise non-isomorphic spanning
trees (with degree sequences (1, 2, 2, 2, 1), (1, 2, 3, 1, 1),
(4, 1, 1, 1, 1)). The corresponding classes of isomorphic spanning
trees have 5·
(4
2
)
·2, 5·4·3, and 5 elements, respectively.
Altogether, there are 125 = 53
spanning trees, which is in accordance
with Cayley’s formula for the number of spanning
trees of a complete graph (see 13.4.11). □
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Coding the plane trees
Given the plane tree T = (V, E, vr, ν). It has a code W by
strings of ones and zeros, deﬁned recursively as follows:
Start with the word 01 for the root v0 and write W =
0W1 . . . Wℓ1, where Wi are the ℓ still unknown words for
the subtrees rooted by the children of v0. In particular the
code of the tree with just one vertex is W = 01.
Applying the same procedure recursively over the children
and concatenating the results deﬁnes the code.
The tree in the left-hand diagram above (in the end of
13.1.16) is encoded as follows (the children of a vertex are
ordered from left to right, Wr is for the code of the child with
key r):
0W3W81 → 00W1W510W911
→ 00010W4W61100W11111
→ 000100101110001111.
Imagine drawing the entire plane tree with one move of
the pencil, starting with an arrow ending in the root and going
downwards with arrows towards the leaves and then upwards
to the predecessors, reaching consecutively all the leafs from
the left to the right and writing 0 when going down and 1
when going up. The very last arrow is then leaving the root
upwards.
Theorem. Two plane trees are isomorphic if and only if their
codes are the same.
Proof. By construction, two isomorphic trees are assigned
the same code. It remains to show that diﬀerent codes
lead to non-isomorphic plane trees.
This is proved by induction on the length of the code (i.e.
the number of zeros and ones). This length is 2(|E| + 1),
(twice the number of vertices; therefore, the proof can be
viewed as an induction on the number of vertices of the tree
T. The shortest code corresponds to the smallest tree on one
vertex. Assume that the proposition holds for all trees up to n
vertices, i.e. for codes of length up to k = 2n, and consider a
code of the form 0W1, where W is a code of length 2n. Find
the shortest preﬁx of W1 which contains the same number of
zeros and ones (when drawing a diagram of the tree, this is the
ﬁrst moment when we return to the root of the tree that corresponds
to the code 0W1). Similarly, ﬁnd the next part of the
code W that contains the same number of zeros and ones, etc.
Hence the code W can be written as W = W1W2 . . . Wℓ. By
the induction hypothesis, the codes Wi correspond uniquely
(up to isomorphism) to plane trees, and the order of their roots,
being the children of our tree T, is given uniquely by the order
in the code. Therefore, the tree T is determined uniquely
by the code 0W1 up to isomorphism. □
1159
13.C.2. Let the vertices of K6 be labeled 1, 2, . . . , 6 and let
every edge {i, j} be assigned the integer [(i + j) mod 3] + 1.
How many minimum spanning trees are there in this graph?
Solution. There are ﬁve edges whose value is 1: four of them
lie on the cycle 12451 and the remaining one is the edge 36.
Therefore, they form a disconnected subgraph of the complete
graph, so the spanning tree must contain at least one edge of
value 2. Thus, the total weight of a minimum spanning tree is
at least 4 · 1 + 2 = 6. And indeed, there exist spanning trees
with this weight. We select all the edges of value 1 except for
one that lies on the mentioned cycle and connect the resulting
components 1245 and 36 with any edge of value 2. There are
four such edges. Altogether, there are 4 · 4 = 16 minimum
spanning trees. □
13.C.3. Find a minimum spanning tree of the following
graph using
i) Kruskal’s algorithm,
ii) Jarník’s (Prim’s) algorithm.
Explain why we cannot apply Borůvka’s algorithm directly.
2 7 3 4
5
1 1 5 3 2 2 5
2 6 1
4
3
1 2 7 3 111
Solution. The spanning tree is
2 3
5
1 1 3 2 2
2 1
4 1 2 3 111
Borůvka’s algorithm cannot be applied directly since the
mapping of the weights to the edges is not injective. However,
this can be ﬁxed easily by slight modiﬁcations of the weights.
□
13.C.4. Consider the following procedure for ﬁnding the
shortest path between a given pair of vertices in an undirected
weighted graph: First, we ﬁnd a minimum spanning
tree. Then, we proclaim that the only path between the pair
of vertices in the obtained spanning tree is the shortest one.
Prove the correctness of this method, or disprove it by providing
a counterexample. ⃝
13.C.5. We are given the following table of distances of
world metropolises: London, Mexico City, New York, Paris,
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Use the encoding of plane trees to encode any tree.
Deal ﬁrst with the case of rooted trees. Determine
the order of the children of every vertex
uniquely up to isomorphism. The order is unimportant
if and only if the subgraphs determined
by the respective children are isomorphic.
The same construction can be used as for the plane trees,
ordering the vertices lexicographically with respect to their
codes (see 12.6.5 for the related concepts). This means that
codes W1, W2 satisfy W1 > W2 if and only if W1 contains a
one at an earlier position than W2 or W2 is a preﬁx of W1. The
rooted tree as a whole is described by the same recursive procedure:
if the children of a vertex v are coded by W1, . . . , Wℓ,
then the code of the vertex v is
0W1 . . . Wℓ1,
where the order is selected so that W1 ≤ W2 ≤ · · · ≤ Wℓ.
If no vertex is designated to be the root of a tree, the root
can be designed so that it would be almost “in the middle”
of the tree. This can be realized by assigning an integer to
every vertex of the tree which describes its eccentricity. That
eccentricity exT(v) of a vertex v in a graph T is deﬁned to be
the greatest possible distance between v and some vertex w
in T. This concept is meaningful for all graphs; however, by
the absence of cycles in trees, it is guaranteed that there are
at most two vertices with the minimal eccentricity.
Lemma. Let C(T) be the set of those vertices of a tree T
whose eccentricity is minimal. Then, C(T) contains either
a single vertex, or exactly two vertices, which are connected
with an edge in T.
Proof. The claim is proved by induction, using the trivial
fact that the most distant vertex from any vertex v must be
a leaf. Therefore, the center of T coincides with the center of
the tree T′
which is created from the tree T by removing all
its leaves and the corresponding edges. After a ﬁnite number
of such steps, there remains either just one vertex, or a subtree
with two vertices. □
C(T) determined by the latter lemma is called the center
of the graph, and the minimal eccentricity is called the radius
of the graph.
A unique (up to isomorphism) code can now be assigned
to every tree. If the center of T contains only one vertex, use
it as the root. Otherwise, create the codes for the two rooted
subtrees of the original tree without the edge that connects
the vertices of the center, and the code of T is the code of the
rooted tree (T, x), where x is the vertex of the center whose
subtree has lexicographically smaller code.
Corollary. Trees T and T′
are isomorphic if and only if they
are assigned the same code.
The above ideas imply that the algorithm for verifying
planar tree isomorphism can be implemented in linear time
with respect to the number of vertices of the trees.
1160
Peking, and Tokyo:








L MC NY P Pe T
L 5558 3469 214 5074 5959
MC 2090 5725 7753 7035
NY 3636 6844 6757
P 5120 6053
Pe 1307








What is the least total length of wire used for interconnecting
these cities (assuming the length necessary to connect a given
pair of cities is equal to the distance in the table). ⃝
13.C.6. Using the matrix variation of the JarnĂŋk-Prim algorithm,
ﬁnd a minimum spanning tree of the graph given by
the following matrix:












− 12 − 16 − − − 13
12 − 16 − − − 14 −
− 16 − 12 − 14 − −
16 − 12 − 13 − − −
− − − 13 − 14 − 15
− − 14 − 14 − 15 −
− 14 − − − 15 − 14
13 − − − 15 − 14 −












⃝
D. Flow networks
13.D.1. An example of bad behavior of the FordFulkerson
algorithm. The worst-case time complexity of
the Ford-Fulkerson algorithm is O(E · |f|), where |f| is the
size of a maximum ﬂow. Consider the following network:
0/100
0/100
0/100
0/100
0/1
The bad behavior of the algorithm is due to the fact that it
uses depth-ﬁrst search to ﬁnd unsaturated paths.
Solution. We proceed strictly by depth-ﬁrst search (examining
the vertices from left to right and then top down):
/100
0/100
0/100
0/100
0/1
1/100
1/1
1/100
1/100
0/100
0/100
1/100
1/1
1/100
0/1
1/100
1/100
1/100
1/100
1/100
0/1
2/100
1/1
2/100
2/100
1/100
1/100
2/100
1/1
2/100
0/1
2/100
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
The trees form a special class of graphs. They are often
used in miscellaneous variations and with additional requirements.
We return to them later, in connection with practical
applications.
Now follows another extraordinarily important class of
graphs.
13.1.19. Planar graphs. Some graphs are drawn in the
plane in such a way that their edges do not
“cross” one another. This means that every
vertex of the graph is identiﬁed with a point
of the plane, and an edge between vertices v,
w corresponds to a continuous curve c : [0, 1] → R2
that
connects the vertices c(0) = v and c(1) = w. Furthermore,
suppose that edges may intersect only at their endpoints.
This describes a planar graph G.
The question whether a given graph admits a realization
as a planar graph often emerges in practical problems. Here
is an example:
Providers of water, electricity, and gas have their connection
spots near three houses (each provider has one spot).
Each house wants to be connected to each resource so that the
connections would not cross (they might want not to dig too
deep, for instance). Is it possible to do this? The answer is:
“no”.
In this particular case, it is clear from the diagram. There
is a complete bipartite graph K3,3, where three of the vertices
correspond to the connection spots, while the other three represent
the houses. The edges are the connections between
the spots and the houses. All edges can be placed except the
last one – see the diagram, where the dashed edge cannot be
drawn without crossing any more:
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1000000
000000000
000000000000
000
111
111111111111
111111111
111111
000000
000000000
000000000000
000
111111
111111111
111111111111
111
00
000
0000
0
1
1111
111
11
00
000
0000
0
1
1111
111
11
00
000
0000
0
1
1111
111
110
0
1
1
For a complete proof, more mathematical tools are
needed. A complete explanation is not provided here, but an
indication of the reasoning follows.
One of the basic results from the topology of the
plane (the Jordan curve theorem) states that every
closed continuous curve c in the plane that is not
self-intersecting (i.e. it is a “crooked circle”) divides
the plane into two distinct parts. In other words,
every other continuous curve which connects a point inside a
curve c and a point outside c must intersect c. If the edges are
realized as piecewise linear curves (every edge composed of
ﬁnitely many adjacent line segments), then it is quite easy to
prove the Jordan curve theorem (you might do it yourself!).
The general theorem can be proved by approximating the
continuous curves by piecewise linear ones (quite diﬃcult
to do, but it is much easier if the curve is assumed to be
piecewise diﬀerentiable).
1161
.../100
.../100
.../100
.../100
.../1
100/100
99/100
99/100
100/100
1/1
100/100
0/1
100/100
We can see that 200 iterations were needed in order to
ﬁnd the maximum ﬂow. □
13.D.2. Find the size of a maximum ﬂow in the network
given by the following matrix A, where vertex 1 is the source
and vertex 8 is the sink. Further, ﬁnd the corresponding minimum
cut.
A =












− 16 24 12 − − − −
− − − − 30 − − −
− − − − 9 6 12 −
− − − − − − 21 −
− − − − − 9 − 15
− − − − − − − 9
− − − − − − − 18
− − − − − − − −












Solution. The following augmenting semipaths are found:
1–2–5–8 with residual capacity 15.
1–2–5–6–8 with residual capacity 1.
1–3–5–6–8 with residual capacity 8.
1–4–7–8 with residual capacity 12.
1–3–7–8 with residual capacity 6.
The total size of the found ﬂow is 42. We can see that it
is indeed of maximum size from the fact the cut consisting of
edges (5, 8), (6, 8), and (7, 8) has also size 42 (and it is thus
a minimum cut). □
13.D.3. The following picture depicts a ﬂow network (the
numbers f/c deﬁne the actual ﬂow and the capacity of a given
edge, respectively). Decide whether the given ﬂow is maximum.
If so, justify your answer. If not, ﬁnd a maximum ﬂow
and describe the used procedure in detail. Find a minimum
cut in the network.
picture skipped
Solution. In the given network, There exists an augmenting
(semi)path 1–2–3–4–8 with residual capacity 4.
Its saturation results in a ﬂow of size 32. Since the cut
(3, 8), (5, 8), (2, 4), (6, 4) is of the same size, we have found
a maximum ﬂow. □
13.D.4. Find a maximum ﬂow and a minimum cut in the
following ﬂow network (source=1,sink=14).
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Consider the graph K3,3. The triples of vertices that are
not connected with edges are indistinguishable up to order.
Therefore the thick cycle can be considered the general case
of a cycle with four points in the graph. The position of the
remaining two vertices can then be discussed. In order for
the graph to be planar, either both of the vertices must lie inside
the cycle, or both outside. Again, these possibilities are
equivalent, so it can be assumed without loss of generality
that they are in opposite sides, as are the black vertices on
the diagram. Now, their position with respect to a suitable
cycle with two thick edges and two thin black edges can be
discussed (i.e. through three gray vertices and one black one).
Then, we can discuss the position of the remaining black vertex
with respect to this cycle. This leads to the impossibility
of drawing the last (dashed) edge without crossing the thick
cycle.
It can be shown similarly that the complete graph K5
is not planar either. We provide a pure combinatorial argument
why K5 and K3,3 cannot be planar graphs below, see
the Corollary in the end of the next subsection.
Notice that if a graph G is“expanded” by dividing some
of its edges (i.e. adding new vertices in the edges), then the
new graph is planar if only if G is planar. The new graph
is a subdivision of G. Planar graphs must not contain any
subdivision of K3,3 or K5. The reverse implication is also
true:
Kuratowski theorem
Theorem. A graph G is planar if and only if none of its
subgraphs is isomorphic to a subdivision of K3,3 or K5.
The proof is complicated, so it is not discussed here.
Much attention is devoted to planar graphs both in research
and practical applications.
There are algorithms which are capable of deciding
whether or not a given graph is planar in linear time. Direct
application of the Kuratowski theorem would lead to a worse
time complexity.
13.1.20. Faces of planar graphs. Consider a planar graph
G embedded in the plane R2
. Let S be the set
of those points x ∈ R2
which do not belong to
any edge of the graph (nor are vertices). In this
way, the set R2
\ G is partitioned into connected
subsets Si, called the faces of the planar graph G. Since the
graphs are ﬁnite, there is exactly one unbounded face S0. The
set of all faces are denoted by S = {S0, S1, . . . , Sk}, and the
planar graph by G = (V, E, S).
The simplest case of a planar graph is a tree. Every tree
is a planar graph since it can be constructed by step-by-step
addition of leaves, staring with single vertex. Of course, the
Kuratowski theorem can also be applied– when there is no cycle
in a graph G, then there cannot be a subdivision of K3,3
or K5, either. Since a tree G cannot contain a cycle, there is
1162
51 10
2 7
3 8
4
6
14
12
13
11
9
12/30
4/4
4/6
6/12
12/18
22/32
4/6
4/4
6/8
18/18
0/2 2/2
6/24 4/4
2/2 0/2 0/14 2/2
10/10 12/18 6/6
8/8 8/20 0/8 4/4
4/4 14/28
2/2 0/4
Solution. The paths are saturated in the following order:
1
18
−−→ 2
18
−−→ 7
14
−−→ 10
12
←−− 5
12
−−→ 8
4
−→ 11
2
−→ 13
10
−−→ 14 r.2
1
16
−−→ 2
16
−−→ 7
12
−−→ 10
10
←−− 5
10
−−→ 8
14
−−→ 13
8
−→ 14 r.8
We have found a ﬂow of size 50. And indeed, it is a
maximum ﬂow since there is no further unsaturated path. If
we look for the reachable vertices, we can also ﬁnd a cut with
capacity 50, consisting of edges
[2, 4] : 4, [7, 9] : 2, [7, 12] : 4, [10, 12] : 2, [10, 14] :
6, [13, 14] : 32. □
13.D.5. Find a maximum ﬂow in the following network on
the vertex set {1, 2, . . . , 9} with source 1 and sink 9 using
the Ford-Fulkerson algorithm (during the depth-ﬁrst search,
choose the vertices in ascending order). Find a minimum cut
in this network. Describe the steps of the procedure in detail.
The edges e ∈ E as well as the lower and upper bounds on
the ﬂow (l(e) and u(e)) and the current ﬂow f(e) are given
in the table:
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
only one face S0 there (the unbounded face). Since the number
of edges of a tree is related to the number of its vertices,
cf. the formula 13.1.15(5), it follows that
|V | − |E| + |S| = 2
for all trees.
Surprisingly, the latter formula linking the number of
edges, faces, and vertices can be derived for all planar graphs.
The formula is named after Leonhard Euler. Especially, the
number of faces is independent of the particular embedding
of the graph in the plane:
Euler’s formula
Theorem. Let G = (V, E, S) be a connected planar graph.
Then,
|V | − |E| + |S| = 2.
Proof. The proof is by induction on the number of edges.
The graph with zero or one edge satisﬁes the formula. Consider
a graph G with |E| > 1. If G does not contain a cycle,
then it is a tree, and the formula is already proved for this case.
Suppose that there is an edge e of G that is contained
in a cycle. Then, the graph G′
= G \ e is connected, and it
follows from the induction hypothesis that G′
satisﬁes Euler’s
formula:
|V | − (|E| − 1) + (|S| − 1) = 2,
since removing an edge necessarily leads to merging two
faces of G into one face in G′
. Hence Euler’s formula is valid
for the graph G. □
Corollary. Let G = (V, E, S), be a planar graph with |V | =
n ≥ 3, and |E| = e. Then
• There is the inequality e ≤ 3n−6 which becomes equality
if and only if G is a maximal planar graph (adding
any edge to G, would violate planarity).
• If G does not contain a triangle (i.e. the graph K3 is not
a subgraph), then e ≤ 2n − 4.
Proof. Continue adding edges to a given graph until it
is maximal. If the obtained maximal graph G satisﬁes the inequality
with equality, then the inequality holds for the original
graph as well.
Similarly, if the graph G is not connected, two of its
components can be connected with a new edge, so such a
graph cannot be maximal. Even if it were connected but not
2-connected, there would exist a vertex v ∈ V such that when
it is removed, the graph G collapses into several components
G1, . . . , Gk, k ≥ 2. However, then an edge can be added
between these components without destroying the planarity
of the original graph G (draw a diagram!). Therefore, it can
be assumed from the beginning that the original graph G is a
maximal planar 2-connected graph.
As shown in theorem 13.1.11, every 2-connected graph
can be constructed from the triangle K3 by splitting edges and
attaching new ones. It is easily proved by induction that every
1163
e l(e) u(e) f(e)
(1,2) 0 6 0
(1,3) 0 6 0
(1,6) 0 4 0
(2,3) 0 2 0
(2,4) 0 3 0
(3,4) 0 4 0
(3,5) 0 4 0
(4,5) 3 5 4
(4,8) 0 3 0
e l(e) u(e) f(e)
(5,1) 0 3 0
(5,6) 0 6 0
(5,7) 0 5 4
(5,8) 0 5 0
(6,9) 0 5 0
(7,4) 1 6 4
(7,9) 0 3 0
(8,9) 0 9 0
⃝
13.D.6. A cut in a network (V, E, s, t, w) can also be viewed
as a set C ⊂ E of edges such that in the network (V, E \
C, s, t, w), there is no path from the source s to the sink t,
but if any edge e is removed from C, then the resulting set
does not satisfy this property, i. e., there is a path from s to t
in (V, E \ C ∪ e, s, t, w). Find all cuts (and their sizes) in the
following network:
4
5
6 2
1
2
10
Z
S
4
2
2
Solution. Let us ﬁx the following edge labeling:
Z
S
a
b
c d e
f
gh
i
j
Then, there are cuts: {f,i},{f,h,j,a},{f,j,c,a,d,e},{f,j,c,a,d,g},
{b,j,c},{b,j,h},{b,i}. Their capacities are 12, 9, 20, 18, 15,
10, and 15, respectively. □
13.D.7. Find a maximum ﬂow in the network given in the
above exercise. ⃝
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
face of a planar graph must be bounded by a cycle (which
seems intuitively apparent).
However, if there is a face of our maximal planar graph
G that is not bounded by a triangle, then this face can be split
with another edge (a “diagonal” in geometrical terminology),
so G would not be maximal. It follows that all faces of G are
bounded by triangles K3. Hence 3|S| = 2|E|.
It suﬃces to substitute |S| = 2
3 |E| for the number of
faces in Euler’s formula.
The second proposition is analogous; now, the faces of
the maximal planar graph without triangles are bounded by
either four or ﬁve edges, whence it follows that 4|S| ≤ 2|E|
with the equality if and only if there are just quadrangles there.
□
The corollary implies (even without the Kuratowski theorem)
that neither K5 nor K3,3 is planar: in the former case,
|V | = 5 and |E| = 10 > 3|V | − 6, while in the latter,
|V | = 6, |E| = 9 > 2|V | − 4, which is again a contradiction
since K3,3 does not contain a triangle.
13.1.21. Convex polyhedra in the space. Planar graphs can
be imagined as drawn on the sphere instead in
the plane. The sphere can be constructed from
the plane by attaching one point “at inﬁnity”.
Again, faces of such graphs can be discussed,
and the faces are now equivalent to one another (even the face
S0 is bounded).
On the contrary, every convex polyhedron P ⊆ R3
can
be imagined as a graph drawn on the sphere (project the vertices
and edges of the polyhedron onto a suﬃciently large
sphere from any point inside P). Dropping a point inside one
of the faces (that face becomes the unbounded face S0) then
leads to the planar graph as above – the sphere with the hole
is spread in the plane.
The planar graphs that are formed of convex polyhedra
are clearly 2-connected since every pair of vertices of a convex
polyhedron lies on a common cycle. Moreover, every face
is interior to its boundary cycle and the graphs of convex polyhedra
are always 3-connected.
In fact, they are just such graphs as the following
Steinitz’s theorem says (we omit the proof):
Steinitz’s polyhedra theorem
Theorem. A graph G is the graph of a convex polyhedron
if and only if it is planar and 3-vertex-connected.
13.1.22. Platonic solids. As an illustration of the combinatorial
approach to polyhedral graphs, we classify
all the regular polyhedra. These are those built
up from one type of regular polygons so that the
same number of them touch at every vertex. It was known
as early as in the epoch of the ancient philosopher Plato that
there are only ﬁve of them:
1164
Further exercises on maximum ﬂows and minimum cuts
can be found on page 1220.
E. Classical probability and combinatorics
In this section, we recall the methods we learned as early
as in the ﬁrst chapter.
13.E.1. We throw n dice. What is the probability that none
of the values 1, 3, 6 is cast?
Solution. We can also see the problem as throwing one dice
n times. The probability that none of the values 1, 3, 6 is cast
in the ﬁrst throw is 1/2. The probability that they are cast
neither in the ﬁrst throw nor in the second one is clearly 1/4
(the result of the ﬁrst throw has no impact on the result of the
second one). Since this holds generally, i. e., the results of
diﬀerent throws are (stochastically) independent, the wanted
probability is 1/2n
. □
13.E.2. We have a pack of ten playing cards, exactly one of
which is an Ace. Each time, we randomly draw one of the ten
cards and then put it back. How many times do we have to
repeat this experiment if we require the probability of getting
the Ace at least once to be at least 0.9?
Solution. Let Ai be the event “the Ace is picked in the
i-th draw”. The events Ai are (stochastically) independent.
Hence, we know that
P
( n∪
i=1
Ai
)
= 1−(1−P(A1))·(1−P(A2)) · · · (1−P(An))
for every n ∈ N. We are looking for an n ∈ N such that
P
( n∪
i=1
Ai
)
=
1 − (1 − P(A1)) · (1 − P(A2)) · · · (1 − P(An)) > 0.9.
Apparently, we have P(Ai) = 1/10 for any i ∈ N. Therefore,
it suﬃces to solve the inequality
1 −
( 9
10
)n
> 0.9,
whence
n > loga 0.1
loga 0.9 , where a > 1.
Evaluating this, we ﬁnd out that we have to repeat the experiment
at least 22 times. □
13.E.3. We randomly draw six cards from a pack of 32 cards
(containing four Kings). Calculate the probability that the
sixth card is a King and, at the same time, it is the only King
drawn.
Solution. By the theorem on product of probabilities, the result
is
28
32 · 27
31 · 26
30 · 25
29 · 24
28 · 4
27
.
= 0.0723. □
13.E.4. We randomly draw two cards from a pack of
32 cards (containing four Aces). Calculate the probability
that the second card drawn is an Ace if:
a) the ﬁrst card is put back; b) the ﬁrst card is not put back.
Solution. If the ﬁrst card is put back in the pack, then we
clearly repeat an experiment with 32 possible outcomes (with
the same probability), 4 of which are favorable. Therefore,
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Translate the condition of regularity to the properties of
the corresponding graphs: Every vertex needs the same degree
d ≥ 3, and the boundary of every face must contain the
same number k ≥ 3 of vertices. Let n, e, s denote the total
number of vertices, edges, and faces, respectively.
Firstly, the relation of the vertex degrees and the number
of edges requires
dn = 2e.
Secondly, every edge lies in the boundary of exactly two faces,
so
2e = ks.
Thirdly, Euler’s formula states that
2 = n − e + s =
2e
d
− e +
2e
k
.
Put this together. The constants d and k must satisfy
1
d
−
1
2
+
1
k
=
1
e
.
Since d, k, e, n are positive integers (in particular, 1
e > 0),
this equality restricts the possibilities. Especially, the lefthand
side is maximal for d = 3. Substitute this value, to
obtain the inequality
−
1
6
+
1
k
=
1
e
> 0.
It follows that k ∈ {3, 4, 5} for a general d. The roles of k and
d are symmetric in the original equality, so also d ∈ {3, 4, 5}.
Checking each of the remaining possibilities, yields all the
solutions:
d k n e s
3 3 4 6 4
3 4 8 12 6
4 3 6 12 8
3 5 20 30 12
5 3 12 30 20
It remains to show that the corresponding regular polyhedra
exist. This is already seen in the above diagrams, but that
is not a mathematical proof. The existence of the ﬁrst three is
apparent. Concentrate on the geometrical construction of the
regular dodecahedron (draw a diagram!).
Begin with a cube, building “A-tents” on all its sides simultaneously.
The upper horizontal poles are set on
the level of the cube’s sides so that those of adjacent
sides are perpendicular to each other. Its length is chosen
so that the trapezoids of the lateral sides would
have three sides of the same length. Now, simultaneously
1165
the wanted probability is 1/8. However, even if we do not put
the ﬁrst card back, the probability is the same. Clearly, the
probability of a given card being drawn the ﬁrst time is the
same as for the second time. Of course, we can also apply the
conditional probability. This leads to
4
32
·
3
31
+
28
32
·
4
31
=
1
8
. □
13.E.5. Combinatorial identities. Use combinatorial means
to derive the following important identities (in particular, do
not use induction):
Arithmetic series
∑n
k=0 k = n(n+1)
2 =
(n+1
2
)
Geometric series
∑n
k=0 xk
= xn+1
−1
x−1
Binomial theorem (x + y)n
=
∑n
k=0
(n
k
)
xk
yn−k
Upper binomial theorem
∑n
k=0
(k
m
)
=
(n+1
m+1
)
Vandermonde’s convolution1
(m+n
r
)
=
∑r
k=0
(m
k
)( n
r−k
)
.
⃝
13.E.6. An urn consists of 30 red balls and 70 green balls.
What is the probability of getting exactly k red balls in a sample
of size 20, ) ≤ k ≤ 20 if the sampling is done with
replacement (repetition allowed)?
Solution. Any time we take a sample from the urn we put
it back before the next sample (sampling with replacement).
Thus in this experiment each time we sample, the probability
of choosing a red ball is 30
100 and we repeat this in 20 independent
trials. Thus, using the binomial formula, we obtain
P(kredballs) =
(
20
k
)
(0.3)k
(0.7)20−k
□
13.E.7. An urn consists of 30 red balls and 70 green balls.
What is the probability of getting exactly k red balls in a sample
of size 20, ) ≤ k ≤ 20 if the sampling is done without
replacement (repetition not allowed)?
Solution. Let A be the event of getting exactly k red balls.
To ﬁnd P(A) we need to ﬁnd |A| and the total number of
possibilities |S|. Here |S|
(100
20
)
. Next, to ﬁnd |A|, we need
to ﬁnd out in how many ways one can choose k red balls and
20 − k green balls. Thus,
|A| =
(
30
k
)(
70
20 − k
)
.
Thus,
P(A) =
|A|
|S|
=
(30
k
)( 70
20−k
)
(100
20
) .
□
13.E.8. Assume that there are k people in a room and we
know that
• k = 5 with probability 1
4 ;
• k = 10 with probability 1
4 ;
• k = 15 with probability 1
2 .
1Also called hockey identity.
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
raise all tents while keeping the ratio of the three sides of the
trapezoids. There is a position at which the adjacent trapezoid
and triangle sides are coplanar. At that position, the regular
dodecahedron is created.
Now, the regular icosahedron can be constructed via the
so called dual graph construction. The dual graph G′
to a
planar graph G = (V, E, S) has vertices deﬁned as the faces
in S and there is an edge between faces S1 and S2 if and only
if they share an edge (i.e. were neighbours) in G. Clearly the
dual graph to the dodecahedron is the isosahedron. Exactly
as the cube and the octohedron are dual, while tetrahedron is
dual to itself.
2. A few graph algorithms
In this part, we consider several applications of graph
concepts and the algorithms built upon them.
13.2.1. Graph representations. As already indicated, algorithms
are often formulated with the help of the
language of graphs.
The concept of an algorithm can be formalized
as a procedure dealing with a (directed) graph whose
vertices and/or edges are equipped with further information.
The procedure consists in walking through the graph along
its edges, while processing the information associated to the
visited vertices and edges. Of course, processing the information
includes also the decision which outgoing edges must be
investigated in a further walk, and in which order.
In the case of an undirected graphs, each (undirected)
edge can be replaced with a pair of directed edges.
The graph may also be changed during the run of the algorithm,
i.e. vertices and/or edges may be added or removed.
In order to execute such algorithms eﬃciently (usually on
a computer), it is necessary to represent the graph in a suitable
way. The adjacency matrix representation is one possibility,
cf. 13.1.8. There are many other options based on various
lists with suitable pointers.
The edge list (also the adjacency list) of the graph G =
(V, E) consists of two lists V and E that are interconnected by
pointers so that every vertex points to the edges it is incident
to. and every edge points to its endpoints.
The necessary memory to represent the graph as an edge
list is O(|V | + |E|) since every edge is pointed at twice and
every vertex is pointed at d times, where d is its degree, and
the sum of the degrees of all vertices equals twice the number
of edges. Therefore, up to a constant multiple, this is an
optimal way of graph representation in memory. It is of interest
in how the basic graph operations are processed in both
representations. By the basic operations, is meant:
• removal of an edge,
• addition of an edge,
• removal of a vertex,
• addition of a vertex,
• splitting an edge with a new vertex.
1166
i) What is the probability that at least two of them have been
born in the same month? Assume that all months are
equally likely.
ii) Given that we already know there are at least two people
that celebrate their birthday in the same month, what is
the probability that k = 10?
Solution. Let Ak be the event that at least 2 people out of
k people have birthdays on the same month. We have The
complimentary event Bk is that none out of k people have
birthday on the same month, i.e. there are exactly k distinct
months when these people were born. Thus
P(Ak) = 1 −
P1
k 2
/
12k
, for2 ≤ k ≤ 12.
For k > 12, obviously, P(Ak) = 1. Therefore the required
P =
1
4
P(A5) +
1
4
P(A10) +
1
2
P(A15)
=
1
4
(1 −
P1
5 2
/
125
) +
1
4
(1 −
P10
5
/
1210
) +
1
2
.
The second part of the question asks for conditional probability
P(k = 10|A). According to Bayes’ rule:
P(k = 10|A) =
P(A|k = 10)P(k = 10)
P(A)
=
P(A10
4P(A))
=
1 − P101
2
/ 121
0
(1 −
P 1
5 2
/ 125) + (1 −
P 10
5
/ 1210) + 2
□
13.E.9. Hat matching problemN guests arrive at a party.
Each person is wearing a hat. All hats are collected and then
randomly redistributed at the departure. What is the probability
that at least one person receives his/her own hat?
Solution. Let Ai be the the event that person i receives own
hat. Then the task is to ﬁnd P(E), where E = ∪N
i=1Ai. To
ﬁnd P(E) we use inclusion-exclusion principle:
P(E) = P(∪N
i=1Ai) =
N∑
i=1
P(Ai) −
∑
i<j
P(Ai ∩ Aj)+
∑
i<j<k
P(Ai ∩ Aj ∩ Ak) − . . . + (−1)N−1
P(∩N
i=1Ai)
Note, that due to symmetry, the probability of any such intersection
depends only on the number of sets Ai in the intersection,
so
N∑
i=1
P(Ai) = NP(A1);
∑
i<j
P(Ai ∩ Aj) =
(
N
2
)
P(A1 ∩ A2);
∑
i<j<k
P(Ai ∩ Aj ∩ Ak) =
(
(
N
)
, 3)P(A1 ∩ A2 ∩ A3);
. . .
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
It is apparent that if the matrix is represented by an array of
zeros and ones, then the ﬁrst and second operations can be
executed in O(1) (constant time), while the others are in O(n)
(linear time).
In the case of the adjacency list, implementation of the
data structures is crucial for the time complexity. However,
all of the operations should be proportional to the number
of edited data units provided the corresponding item(s) are
already found. For instance, if a vertex is removed, then all
of the edges that are incident to it must also be removed.
The matrix representation is also useful in theoretical discussions
about graphs, using matrix calculus.
13.2.2. Searching in graphs. Many useful algorithms are
based on going through all vertices of a given
graph step by step. Usually, the vertex is given
to start with or it is selected at the beginning of
the procedure.
At every stage of the search, each vertex has (exactly) one
of the following statuses:
• processed – it has been visited and completely processed;
• active – it has been visited and is prepared to be pro-
cessed;
• sleeping – it has not been visited yet.
At the same time, information about processed edges is retained.
At every stage, the sets of vertices and/or edges in
these groups must form a partition of the sets V and E while
one of the active vertices is being processed.
The general principle on searching through the vertices
is illustrated ﬁrst. In the subsequent subsections, such procedures
are used to build algorithms solving particular prob-
lems.
At the beginning of the algorithm, there is just one active
vertex and all the others are sleeping. At the ﬁrst step, traverse
all edges incident to the active vertex and change the status
of their other endpoints from sleeping to active. Then, the
active vertex started may be marked as processed, and another
active vertex may be chosen. In the following steps, always
go through those adjacent edges not yet met, marking their
other endpoints as active. This algorithm can be applied to
both directed and undirected graphs.
In practical problems, the search is often restricted to
only some edges going from the current vertex. This is an
insigniﬁcant change to the algorithm.
To specify the algorithm completely, a decision must be
made in which order to process the active vertices and in
which order to process the edges going from the current vertex.
In general, the two simplest possibilities of processing
the vertices are:
(1) they are processed in the same order as they were visited
(queue),
(2) they are processed in the reversed order than they were
visited (stack).
The former case, is called a breadth-ﬁrst search. The latter
case is called a depth-ﬁrst search.
1167
Therefore,
P(E) = NP(A1)−
(
N
2
)
P(A1∩A2)+
(
N
3
)
P(A1∩A2∩A3)−. . .+(−1)N
P(∩N
i=1Ai).
We need to ﬁnd P(A1), P(A1 ∩ A2), P(A1 ∩ A2 ∩ A3), . . .
and so on for all numbers of intersecting sets. Obviously,
P(A1) =
|A1|
|S|
=
(N − 1)!
N!
=
1
N
,
since |S| = N! and |A| = (N − 1)!. Similarly
P(A1 ∩ A2) =
|A1 ∩ A2|
N!
=
(N − 2)!
N!
=
1
N(N − 1)
;
P(A1 ∩ A2 ∩ A3) =
Thus,
P(E) = N
1
N
−
(
N
2
)
1
N(N − 1)
+
(
N
3
)
1
N(N − 1)(N − 2)
−. . .+(−1)N−1 1
N!
= 1 −
1
2!
+
1
3!
− . . . + (−1)N 1
N!
,
which approaches 1 − 1
e as N → ∞. □
13.E.10. Texas Hold ’em Poker. Now, we solve several simple
problems about one of the most popular card games–
Texas Hold’em Poker. We do not present its rules; they can
be easily found on the Internet. What is the probability that:
i) we are dealt a pair?
ii) we are dealt an Ace?
iii) we have one of the six best poker combinations at the
end?
iv) we win if we are holding an Ace and a Three and there
are three Twos and the diﬀerently-suited Ace on the table
(the river has not been dealt yet)?
Solution.
i) There are 4 cards of each of 13 ranks. Therefore, there
are 13
(4
2
)
= 78 pairs. The total number of pairs is
(13·4
2
)
= 1326. Thus, the wanted probability is 1
17
.
=
0.06.
ii) One of the cards is an Ace (there are four possibilities)
and the other one is arbitrary (51 possibilities). However,
this includes the
(4
2
)
= 6 pairs of Aces twice. Therefore,
the number of favorable cases is only 4 · 51 − 6 = 198
and the wanted probability is 198
1326
.
= 0.15.
iii) We compute the probabilities of the particular combinations
when dealt ﬁve cards at random:
ROYAL FLUSH: There is exactly one such combination
for each suit–four in total. Further, there are
(52
5
)
=
2598960 possibilities for a hand of ﬁve cards. Thus, the
probability is approximately 1.5·10−6
, very low indeed.
STRAIGHT FLUSH: The highest card of the straight
must be between 5 and K, i. e., there are 9 possibilities
for each suit. Altogether, the probability is 36
2598960
.
=
1.4 · 10−5
.
POKER (FOUR OF A KIND): There are 13 possibilities
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
The role of the data structures used for representing the
graph is immediately apparent: The adjacency list allows passage
through all edges going from a given vertex in a time proportional
to the number of them. Each edge is visited at most
twice since it has only two endpoints. Hence the following
result:
Theorem. Both the breadth-ﬁrst and depth-ﬁrst searches run
in O((n + m)K) time, where n and m are the number of
vertices and edges of the graph, respectively. K is the time
needed for processing an edge or a vertex.
The following diagram illustrates the breadth-ﬁrst search
through the Petersen graph:
0011
0
0
1
1
0
0
1
1
0
0
1
1
0011
0
0
1
1
0
0
1
1
0011
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0011
0
0
1
1
0
0
1
1
0011
0011 0011 0011 0011
The ﬁrst 8 steps are shown here. The circled vertex is
the one to be processed, the bold vertices are the already processed
one, while the dashed edges are those that have been
processed, and the small vertices adjacent to some dashed
edges are the active ones. At the given vertex, the edges
are processed counterclockwise, beginning with the direction
“straight down”.
The diagram below illustrates the depth-ﬁrst search applied
to the same graph. Note that the ﬁrst step is the same as
above.
0011
0
0
1
1
0011
0
0
1
1
0011
0011
0011
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0011
0011
0
0
1
1
0011
0011
0011 0011 0011 0011
As a simple example of graph searching, consider an
algorithm for ﬁnding all connected components of a given
graph. The only information that must be processed during
the search (no matter whether breadth-ﬁrst or depth-ﬁrst) is
which component is being examined.
The search, as here presented, passes exactly the vertices
of a single component. Hence, one can start with all vertices
in the sleeping state and choose any one of them. During
the search, whenever there are no more active vertices to be
processed, the search of one component is ﬁnished. One can
then choose an arbitrary sleeping vertex and continue likewise.
The algorithm terminates as soon as there are no more
sleeping vertices remaining.
1168
for the quad and the ﬁfth card can be arbitrary (48 possibilities).
Hence: 624
2598960
.
= 2.4 · 10−4
.
FULL HOUSE: There are 13
(4
3
)
= 52 possibilities for
the triple and 12
(4
2
)
= 72 possibilities for the remaining
pair. Altogether, 3744
2598960
.
= 1.4 · 10−3
.
FLUSH: There are 4 suits and
(13
5
)
hands for each suit,
i. e., 4 ·
(13
5
)
= 5148 possibilities in total. However, we
must not count the straights again. There are 40 of them,
so the resulting probability is 5108
2598960
.
= 2 · 10−3
.
STRAIGHT: The highest card of the straight is between
5 and A, so there are 10 possibilities. Selecting the suit
of each card arbitrarily, this gives 10 · 45
= 10240 possibilities.
However, we must exclude ﬂushes, so the total
probability is 10200
2598960
.
= 3.9 · 10−3
.
Altogether, the probability of one of the best six combinations
is approximately 3.9 · 10−3
+ 2 · 10−3
+ 1.4 ·
10−3
+ 0.24 · 10−3
= = 7.54 · 10−3
, i. e., about 0.75%.
In the Texas Hold ’em variation, the best 5-card hand
of the seven cards is always considered. We have computed
the number of favorable 5-card hands and there are(52−5
2
)
possibilities for the remaining two cards. The total
number of 7-card hands is
(52
7
)
. We can thus approximate
the probability for Texas Hold ’em from the classic
Poker by multiplying by the coeﬃcient
(52
5
)(47
2
)
(52
7
) = 21.
However, note that this is indeed only an approximation
of the actual probability since some favorable
combinations are counted more than once this way. For
instance, we have a full house in the considered 5-card
hand and if the arbitrary pair contains the fourth card of
the triple, then we actually have a poker (four of a kind),
so this combination has been counted more times. On
the other hand, the true result only merely diﬀers from
the computed approximation, so the probability of one
of the best six poker combinations is about twenty times
higher than in classic Poker. This may be the reason why
this variation is so popular.
iv) Clearly, our situation is very good. Hence, it will be easier
to count the unfavorable cases when the other player
has a better combination. We have Twos full of Aces, so
we lose only if the opponent has Aces full of Twos or a
poker of Twos, i. e., he must hold the remaining Two or
an Ace. In the former case, he surely wins, and this happens
in 0 + 3 + 4 + · · · + 4 + 2 = 45 cases (we can
see the remaining Twos, a Three, and two Aces) out of
all
(46
2
)
= 1035, so the probability of this loss is about
0.043. In the latter case, there are more possibilities. If
he holds a pair of Aces and the river card is not the remaining
Two, then we lose; otherwise (i. e., if he has
only one Ace or the Two appears on the river), it is a tie.
Thus, we lose in this case by 1
1035 .43
44
.
= 10−3
. Altogether,
the probability that we win or draw is almost 96
%.
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.2.3. Natural metrics on graphs. The concept of “path
length” is used earlier. This recalls the general
idea of distance. The concept of distance in
graphs can be built mathematically in this man-
ner.
For an (undirected) graph, deﬁne the distance between
vertices v and w to be the number dG(v, w). This is the number
of edges on the shortest path from v to w. If there is no
such path, write dG(v, w) = ∞.
For the sake of simplicity, consider only connected
graphs G. The function dG : V × V → N deﬁned as
above satisﬁes the usual three properties of a distance (it is
recommended to compare this to the issues from the relevant
part of chapter seven, see 7.3.1 (the page 666):
• dG(v, w) ≥ 0, and dG(v, w) = 0 if and only if v = w;
• the distance is symmetric, i.e. dG(v, w) = dG(w, v);
• the triangle inequality holds; i.e. for every triple of vertices
v, w, z,
dG(v, z) ≤ dG(v, w) + dG(w, z).
dG is a metric on the graph G.
Besides these three properties, every such metric on a
graph apparently satisﬁes the following:
• dG(v, w) is always a non-negative integer;
• if dG(v, w) > 1, then there exists a vertex z distinct from
v and w such that dG(v, w) = dG(v, z) + dG(z, w).
The following is true:
Every function dG on V × V (for a ﬁnite
set V ), satisfying the ﬁve properties listed above,
allows to deﬁne the edges E so that G = (V, E)
is a graph with metric dG.
Prove this yourself as an exercise! (It is quite clear how
to construct the corresponding graph. It remains “merely” to
show that the given function dG could be achieved as the metric
on the constructed graph.)
13.2.4. Dijkstra’s shortest-path algorithm. One may suspect
that the shortest path between a given vertex
v and another given vertex w can be found
by breadth-ﬁrst searching the graph. With this
approach, discuss ﬁrst the vertices which are reachable with
one edge from the initial vertex v, then those which are two
edges distant, and so on. This is the fundamental idea of one
of the most often used graph algorithms – the Dijkstra’s algo-
rithm5
.
This algorithm is able to ﬁnd the shortest paths even
in problems from practice, where each edge e is assigned a
weight w(e), which is a positive real number. When looking
for shortest paths, the weights are to represent lengths of the
5Edsger Wybe Dijkstra (1930 - 2002) was a famous Dutch computer
scientist, being one of the fathers of this discipline. Among others, he is
credited as one of founders of concurrent computing. He published the above
algorithm in 1959
1169
□
13.E.11. Four players are given two cards each from a pack
consisting of four Aces and four Kings. What is the probability
that at least one of the players is given a pair of Aces?
Express the result as a ratio of two-digit integers. ⃝
13.E.12. Alex owns two special dice: one of them has only
6’s on its sides. The other one has two 4’s, two 5’s, and two
6’s. Martin has two ordinary dice. Each of the players throws
his dice and the one whose sum is higher wins. What is the
probability that Alex wins? Express the result as a ratio of
two-digit integers. ⃝
13.E.13. In how many ways can we place n rooks on an n×
n chessboard so that every unoccupied square is guarded by
at least one of the rooks?
Solution. Clearly, the condition is satisﬁed if and only if at
least one of the following holds: There is at least one rook in
each rank (which implies that there must be exactly one rook
in each rank–there are nn
such placements since the particular
squares can be selected independently for each rank); or
there is at least one rook in each ﬁle (again resulting in nn
placements). However, there are n! placements which satisfy
both (we have n squares where to put a rook in the ﬁrst rank,
n − 1 squares for the second rank since one of the ﬁles is already
occupied, etc.). By the inclusion-exclusion principle,
the wanted number of placements is:
2nn
− n!. □
13.E.14. We ﬂip a coin ﬁve times. Every time it comes up
heads, we put a white ball into a hat. Every time it comes up
tails, we put a black ball into the hat. Express the probability
that there are more black balls than white ones provided there
is at least one black ball.
Solution. Let us deﬁne the events
A – there are more black balls than white ones,
H – there is at least one black ball.
We want to compute P(A|H). Clearly, the probability
P
(
HC
)
of the complementary event to H is 2−5
.
Further, the probability of A is the same as the probability
P
(
AC
)
. Therefore, P(H) = 1 − 2−5
and P(A) = 1/2.
Further, P(A ∩ H) = P(A) since the event H contains the
event A (the event A implies the event H). Altogether, we
have obtained
P(A|H) =
P(A ∩ H)
P(H)
=
1
2
1 −
(1
2
)5 =
16
31
. □
F. More advanced problems from combinatorics
In the ﬁrst chapter, we met the fundamental methods used
in combinatorics. Even using merely these ideas, we are able
to solve relatively complicated problems.
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
edges. However, in general, the weights may have other meanings:
they may stand for proﬁts or costs, network ﬂows, and
so on.
The input of the algorithm consists of an edge-weighted
graph G = (V, E) and an initial vertex v0. The output consists
of the numbers dw(v), which give the least possible sum
of the weights of edges along a path from the vertex v0 to the
vertex v. This procedure works in undirected graphs as well
as in directed ones.
In order to ensure the termination and the correctness
of the algorithm, it is important that all of the
weights are positive – see example 13.B.6. Dijkstra’s
algorithm needs only a little modiﬁcation
of the general breadth-ﬁrst search:
• For every vertex v, keep the information d(v), which is an
upper bound for the actual distance of v from the initial
vertex v0.
• At every stage, the already processed vertices are those
for which the shortest path is already known. Then,
d(v) = dw(v).
• When some sleeping vertices are to be made active,
choose exactly those vertices y from the set Z of sleeping
vertices for which d(y) = min{d(z); z ∈ Z}.
Suppose that the graph G has at least two vertices. More
formally, Dijkstra’s algorithm can be described as follows:
Dijkstra’s Algorithm
Input: vertex v0 in the graph G = (V, E) with weights on
all edges.
Output: the distances from v0 within G associated to all ver-
tices.
(1) Initialization step: Set the values for all v ∈ V :
d(v) =
{
0 for v = v0,
∞ for v ̸= v0.
Set Z = V , W = ∅.
(2) Cycle condition: If every vertex y ∈ Z is assigned ∞,
the algorithm terminates; otherwise the algorithm continues
with another iteration. (In particular, the algorithm
terminates if Z = ∅.)
(3) Update of the vertex statuses:
• Find the set N of those vertices v ∈ Z for which
d(v) = δ is as small as possible:
δ = min{d(y); y ∈ Z}.
• All vertices which have been in W are removed and
marked as processed; the new set of active vertices
is W = N, while all these vertices are removed
from Z, i.e. they are no more sleeping.
(4) Cycle body: For each edge e ∈ EW Z (i.e. whose tail is
an active vertex v and head is a sleeping vertex y:
• if d(v) + w(e) < d(y), then update d(y) to d(v) +
w(e).
Move back to check the cycle condition (step 2).
1170
13.F.1. There are n (n ≥ 3) fortresses positioned on a circle,
numbered 1 through n. At a given moment, every
fortress shoots at one of its neighbors (i. e., fortress
1 shoots at n or 2, fortress 2 shoots at 1 or 3, etc.).
We will refer to the set of hit fortresses as a result (i.
e., we are only interested in whether each fortress was hit or
not; it does not matter whether it was hit once or twice). Let
P(n) denote the number of possible results. Prove that the
integers P(n) and P(n + 1) are coprime.
Solution. First of all, note that a set of hit fortresses is a possible
result if and only if no pair of adjacent-but-one fortresses
(i. e., whose numbers diﬀer by 2 modulo n) is unhit. Therefore,
if n is odd, then P(n) is equal to the number K(n) of
results where no pair of adjacent fortresses is unhit (consider
the order 1, 3, 5, . . . , n, 2, 4, . . . , n − 1). If n is even, then
P(n) equals K(n/2)2
since fortresses at even positions and
those at odd positions can be considered independently.
We can easily derive the following recurrent formula for
K(n): K(n) = K(n − 1) + K(n − 2). (Well, on the other
hand, it is not so trivial...It is left as an exercise for the
reader.) Further, we can easily calculate that K(2) = 3,
K(3) = 4, K(4) = 7, so K(2) = F(4) − F(0),
K(3) = F(5) − F(1), K(4) = F(6) − F(2), and simple
induction argument shows that K(n) = F(n+2) − F(n−
2), where F(n) denotes the n-th term of the Fibonacci sequence
(F(0) = 0, F(1) = F(2) = 1). Moreover, since
(K(2), K(3)) = 1, we have for n ≥ 3 that (similarly as with
the Fibonacci sequence)
(K(n), K(n − 1)) = (K(n) − K(n − 1), K(n − 1)) =
= (K(n − 2), K(n − 1)) = · · · = 1.
Now, we are going to show that, for every n = 2a,
P(n) = K(a)2
is coprime to both P(n + 1) = K(2a + 1)
and P(n − 1) = K(2a − 1). It suﬃces to realize that for
a ≥ 2, we have
(K(a), K(2a + 1)) = (K(a), F(2)K(2a) + F(1)K(2a − 1))
= (K(a), F(3)K(2a − 1) + F(2)K(2a − 2) = . . .
= (K(a), F(a + 1)K(a + 1) + F(a)K(a))
= (K(a), F(a + 1)) = (F(a + 2) − F(a − 2), F(a + 1))
= (F(a + 2) − F(a + 1) − F(a − 2), F(a + 1))
= (F(a) − F(a − 2), F(a + 1))
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.2.5. Theorem. For a given vertex v0, the Dijkstra’s algorithm
ﬁnds the distance dw(v) of each vertex v in G that lies
in the connected component of the vertex v0. For the vertices
v of other connected components, d(v) = ∞ remains.
The algorithm can be implemented in such a way that it
terminates in time O(n log n + m), where n is the number of
vertices and m is the number of edges in G.
Proof. The algorithm is correct, since
• it terminates after a ﬁnite number of steps;
• when it does, its output has the desired properties.
The cycle condition guarantees that in each iteration, the
number of sleeping vertices decreases by one at least since
N is always non-empty. Therefore, the algorithm necessarily
terminates after a ﬁnite number of steps.
After going through the initialization cycle,
(1) dw(v) ≤ d(v)
for all vertices v of the graph. Now assume that this property
holds when the algorithm enters the main cycle and show that
it holds when it leaves the cycle as well. Indeed, if d(y) is
changed during step 4, then it is caused by ﬁnding a vertex v
such that
dw(y) ≤ dw(v) + w({v, y}) ≤ d(v) + w({v, y}) = d(y),
where the new value is on the right-hand side.
The inequality (1) is satisﬁed when the algorithm terminates.
It remains to verify that the other inequality holds as
well. For this purpose, consider what is actually done in steps
3 and 4 of the algorithm.
Let 0 = d0 < · · · < dk denote all (distinct) ﬁnite distances
dw(v) of the vertices in G from the initial
vertex v0. At the same time, this partitions
the vertex set of the graph G into clusters Vi of
vertices whose distance from v0 is exactly di.
During the ﬁrst iteration of the main cycle, N = V0 = {v0},
the number δ is just d1, and the set of sleeping vertices is
changed to V \ V0.
Suppose this holds up to j-th iteration (inclusive),
i.e. the algorithm enters the cycle with N = Vj, δ = dj,
and
∪j
i=0 Vi = V \ N. Consider a vertex y ∈ Vj+1,
i.e. dw(y) = dj+1 < ∞, and there exists a path
(v0, e1, v1, . . . , vℓ, eℓ+1, y) with total length dj+1. However,
then
(2) dw(vℓ) ≤ dj+1 − w({vℓ, y}) < dj+1.
It follows from the assumption that the vertex vℓ was active
during an earlier iteration of the main cycle, and dw(vℓ) =
d(vℓ) = di for some i ≤ j then. Therefore, after the current
iteration of the main cycle has been ﬁnished,
d(y) = dw(vℓ) + w({vℓ, y}) = dj+1
and this does not change any more. It follows that the inequality
(1) holds with equality when the algorithm terminates.
1171
= (F(a − 1), F(a + 1)) = (F(a − 1), F(a)) = 1.
(K(a), K(2a − 1))
= (K(a), F(2)K(2a − 2) + F(1)K(2a − 3))
= (K(a), F(3)K(2a − 3) + F(2)K(2a − 4))
= · · · = (K(a), F(a)K(a) + F(a − 1)K(a − 1))
= (K(a), F(a − 1)) = (F(a + 2) − F(a − 2), F(a − 1))
= (F(a + 2) − F(a), F(a − 1))
= (F(a + 2) − F(a + 1), F(a − 1)) = (F(a), F(a − 1)) = 1.
This proves the proposition. □
G. Probability in combinatorics
Classical probability is tightly connected to combinatorics,
as we have already seen in the ﬁrst chapter. Now, we
present another example, which is a bit more complicated.
Combinatorics is hidden even in the following “probabilistic”
problem.
13.G.1. There are 100 prisoners in a prison, numbered 1
through 100. The chief guard has placed 100 chests
(also numbered 1 through 100) into a closed room and
randomly put balls with numbers 1 through 100 on
them into the chests so that each chest contains exactly one
ball. He has decided to play the following game with the
prisoners: He calls them one by one into the room and the
invited prisoner is allowed to gradually open 50 chests. Then,
he leaves without any possibility to talk to the other prisoners,
the guard closes all the chests, and another prisoner is let
in. The guard has promised to free all the prisoners provided
each of them ﬁnds the ball with his number in one of the 50
opened chests. However, if any of the prisoners fails to ﬁnd
his ball, all will be executed. Before the game begins, the
prisoners are allowed to agree on a strategy. Does there exist
a strategy that gives the prisoners a “reasonable” chance of
winning?
Solution. Clearly, if the prisoners choose to open the chests
randomly (where the choices of the particular prisoners are independent),
the chance for one prisoner to ﬁnd his ball is 1/2,
so the total probability of success is merely 1/2100
. Therefore,
it is necessary to look for a strategy where the successes
of the prisoners are as dependent as possible. First of all,
we should realize that the invited prisoner has no information
from other prisoners and does not know the positions
of particular balls in the chests. However, once he opens a
chest, he knows the ball number it contains and may choose
the next chest accordingly. This suggests the following simple
strategy: Every prisoner starts with the chest that bears
his number. If it contains the corresponding ball, the prisoner
succeeds and can open the remaining chests at random. If not,
he opens the chest with the number of the found ball. He continues
this way until he eventually ﬁnds his ball or opens the
ﬁftieth chest. Since every chest “points” at another chest according
to the described procedure, let us call this strategy the
pointer strategy.
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
The analysis of the main cycle just made also determines
a bound for the running time of the algorithm (i.e. the number
of elementary operations with the graph and other corresponding
objects). The main cycle is iterated as many times
as there are (distinct) distances di in the graph. Every vertex,
when processed during step 3, is considered exactly once.
The vertices that are still sleeping must be sorted. This gives
the bound O(n log n) for this part of the algorithm provided
the graph is stored as a list of vertices and weighted edges,
such that the sleeping vertices are kept in a suitable data structure
that allows the ﬁnding of the set of N active vertices in
time O(log n + |N|). This can be achieved if a heap is used.
Every edge is processed exactly once in step 4 since the vertices
are active only during one iteration of the cycle. □
Note that the inequality (2), essential for the analysis of
the algorithm, need not hold if the weights of the edges are
allowed to be negative.
In practice, many heuristic improvements of the algorithm
are applied. For instance, it is not necessary
to compute the distance between all vertices
if only the distance between a given pair
of vertices is of interest. When the vertex is excluded from
the active ones, its distance is ﬁnal.
Further, it is not necessary to initialize the distances with
the value of inﬁnity. Of course, this is technically impossible,
and a suﬃciently large constant would be needed in the im-
plementation.
13.2.6. Spanning trees. In practical applications, graphs often
encode all possibilities of connections between
particular objects, as in road or electrical
networks. If it is only required that each pair
of vertices is connected by a path, using as few
edges as possible, then what is needed is a subgraph T which
is a tree. This corresponds to the problem of ﬁnding a type of
minimal network.
Spanning tree of a graph
Deﬁnition. Any tree T = (V, E′
) in a graph G = (V, E),
E′
⊆ E is called a spanning tree of the graph G.
A graph can have a spanning tree if and only if it is con-
nected.
A spanning tree is connected since all trees are. Conversely,
the following algorithm ﬁnds a spanning tree for any
given connected graph.
1172
Probability of success. The guard’s possible placements of
the balls bijectively correspond to permutations of the numbers
1 through 100. In order to ﬁnd the probability of success,
we must realize for which permutations the pointer strategy
works. Recall that every permutation can be expressed as the
composition of pairwise disjoint cycles. If each prisoner were
allowed to open an arbitrary number of chests, he would ﬁnd
his ball as the last one of the corresponding cycle since he begins
with the chest with his number, which is pointed at just
by the chest with his ball. It follows that the strategy fails if
and only if there is a cycle of length greater than 50 because
then no prisoner of this cycle ﬁnds his ball in time. Thus, we
must count the number of such permutations. In general, the
probability that a random permutation of length n contains a
cycle of length r > n/2 (there could be more occurrences
of shorter cycles; however, there can be at most one cycle of
length greater than n/2, which simpliﬁes the calculation) is
as follows: We must choose the r elements of the cycle, order
them, and then choose an arbitrary permutation of the remaining
n − r numbers. This leads to
(
n
r
)
(r − 1)! (n − r)! =
n!
r
Therefore, the probability that such permutation is selected
(among all the n! permutations) is 1/r. Thus, the probability
that our 100 prisoners succeed is
1 −
100∑
k=51
1
k
≈ 0.311828
As we can see, this is much higher than the original 1/2100
.
Now, let us look at the behavior of this probability for a general
number n of the prisoners (then, each prisoner is allowed
to open at most n/2 chests). In general, the probability that
a random permutation of length n contains a cycle of length
greater than n/2 is equal to
p =
n∑
k=1+ n
2
1
k
Recall that
∑n
k=1
1
k → ln (n) + γ for n → ∞, where γ is
Euler’s constant. Thus, we have:
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Spanning forest algorithm
Input: Graph G = (V, E)
Output: A forest T = (V, E′
) consisting of spanning trees
of the components of G.
(1) Sort all edges e0, . . . , em ∈ E in any order.
(2) Start with E0 = {e0} and gradually build the sets of
edges Ei so that in the i-th step add the edge ei to Ei−1
unless this creates a cycle in the graph Gi = (V, Ei−1 ∪
{ei}). If this edge creates a cycle, leave Ei = Ei−1
unchanged.
(3) The algorithm terminates if the graph Gi = (V, Ei) has
exactly n − 1 edges at some step i or if i = m, and
produces the graph T = (V, Ei).
If the algorithm terminates for the latter reason, then the
graph is not connected and no spanning tree exists (but there
are still the spanning trees of all individual components).
Proof. It follows from the rules of the algorithm that the
resulting subgraph T of G never contains a cycle. Therefore,
it is a forest. If the resulting number of edges is n − 1, then it
must be a tree; see theorem 13.1.15.
It remains to show that the connected components of the
graph T have the same sets of vertices as the connected components
of the original graph G: Every path in T is also a
path in G; therefore, all vertices that lie in one tree of T must
lie in the same component of G. If there is a path in G from v
to w such that its endpoints lie in diﬀerent trees of T, then
one of its vertices vi is the last one that is in the component
determined by v (in particular, vi+1 does not lie in this component).
The corresponding edge {vi, vi+1} creates a cycle
when examined by the algorithm since otherwise, it would be
in T. Since the edges are never removed from T, there is a
path between vi and vi+1 in T, which contradicts the assumptions.
Therefore, v and w cannot lie in diﬀerent trees of T.
The number of components of T is given by the fact that the
number of vertices and edges diﬀers by one in every tree. The
diﬀerence increases by one with every component so if there
are n vertices and k edges in the forest, then there are n − k
components. □
Remark. As always, the time complexity of the algorithm is
of interest. The addition of an edge creates a cycle
if and only if its endpoints lie in the same connected
component of the forest T under construction.
Knowledge of the connected components of the
current forest T is helpful. To implement the algorithm, it
is needed to unite two equivalence classes of a given equivalence
relation on a given set (the vertex set) and to ﬁnd out
whether two vertices are in the same class or not. The union
requires time O(k), where k is the number of elements to be
united. k can be bounded from above by n, the total number
of vertices.
However, for each equivalence class it can be noted how
many vertices it contains. If, for each vertex, the information
to which class it belongs is kept, then the union operation
1173
p =
n∑
k=1
1
k
−
n
2∑
k=1
1
k
→ ln (n)+γ−ln
(n
2
)
−γ = ln 2, pro n → ∞.
Hence it follows that, for large values of n, the probability
of success approaches 1 − p ≃ 1 − ln 2 = 0.30685 . . . .
Now, we are going to show that the pointer strategy is optimal.
Optimality of the pointer strategy. In order to prove the
optimality of the pointer strategy, we merely modify the rules
of the game and further deﬁne another game.
Consider the following rules: Every prisoner keeps opening
the chests until he ﬁnds his ball. The prisoners win iﬀ each
opens at most 50 chests. Clearly, this modiﬁcation does not
change the probability of success, but it will help us prove the
optimality. We will refer to this game as game A.
Now, consider another game (game B) with the following
rules: First, prisoner number 1 enters the chest room and
keeps opening the chests until he ﬁnds his ball. Then, the
guard leaves the opened chests as they are and immediately
invites the prisoner with the least undiscovered number. The
game proceeds this way until all chests are opened. The prisoners
win iﬀ none of them opened more than 50 (n/2 in the
general case) chests.
Suppose that the guard notes the ball numbers in the order
they were discovered by the prisoners. This results in a
permutation of the numbers 1 through 100, from which he
can see whether the prisoners won or not. The probability
of discovering a particular ball is at every moment independent
of the selected strategy. There are 100! permutations
which correspond to some strategies (no matter whether they
are random or sophisticated) since they are merely the orders
in which the ball numbers are discovered. In order to
compute the probability of success in game B, we should
note that any order can be written as the composition of
cycles where each cycle contains the ball numbers discovered
by a given prisoner. For the sake of clarity, consider
a game with 8 prisoners. If the guard has noted the permutation
(2, 5, 7, 1, 6, 8, 3, 4), then we can see that the prisoners
win since prisoner 1 discovered numbers (2, 5, 7, 1),
then prisoner 3 discovered (6, 8, 3), and ﬁnally prisoner 4
discovered only his number (4). In this case, we can write:
(2, 5, 7, 1, 6, 8, 3, 4) → (2, 5, 7, 1)(6, 8, 3)(4). Further, any
such permutation corresponds to a unique order of the numbers
1 through 8. Having any permutation in the cyclic notation,
we ﬁrst rearrange each cycle so that its least number is
the last one and then sort the cycles by their last numbers in
ascending order. For instance, we have:
(7, 5, 8)(2, 4)(1, 6, 3) → (6, 3, 1)(4, 2)(8, 7, 5) →
→ (6, 3, 1, 4, 2, 8, 7, 5).
We have thus constructed a bijection between the winning orders
of discovered numbers and the permutations of the numbers
1 through 8 that do not contain a cycle of length greater
than 4. It follows that the probability of success in game B is
the same as the probability that a random permutation does
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
means to relabel the vertices of one of the united classes. If
the smaller class is always selected to be relabeled, then the
total number of operations of the algorithm is O(n log n+m).
(As an exercise, complete the details of these considerations
yourself!)
The above reasoning shows that slightly better in time
might be achieved, if only the spanning tree of
the connected component of a given starting vertex
is of interest:
Another spanning tree algorithm
Input: G = (V, E) with n vertices and m edges, vertex
v ∈ V .
Output: The tree T spanning the connected component of v.
(1) Initialize T0 = ({v}, ∅).
(2) In the i-th step, we build the tree Ti as follows. Look
for edges e which are not in Ti−1, but their tail vertex
ve is. Take one of them and add it to Ti−1, i.e. add the
head vertex to Vi−1 and e to Ei−1.
(3) The algorithm terminates as soon as no such edge exists.
Apparently, the resulting graph T is connected. The count
of its vertices and edges shows that it is a tree.
Proof. The vertices of T coincide with the vertices of
the connected component of the graph G containing the starting
vertex v.
Suppose there is a path from v to a given vertex w. If w
does not lie in T, then label by vi the last of its vertices that lie
in T (just like in the proof of the previous lemma). However,
the subsequent edge of the path would have to be added to T
by the algorithm when it terminated, which is a contradiction.
Consequently, this algorithm ﬁnds a spanning tree of the
connected component that contains a given initial vertex v in
time O(n + m). □
13.2.7. Minimum spanning tree. All spanning trees of a
given graph G have the same number of edges
since this is a general property of trees. Just as
the shortest path in graphs with weighted edges
was found, now spanning trees with the minimum sum of their
edges’ weights is desired.
Deﬁnition. Let G = (V, E, w) be a connected graph whose
edges e are labeled by non-negative weights w(e). A minimum
spanning tree of G is such that its total weight does not
exceed that of any other spanning tree.
This problem has many applications in practice. For instance,
networks of electricity, gas, water, etc.
Surprisingly, it is quite simple to ﬁnd a minimum spanning
tree (supposing all edge weights w(e) of G are nonnegative)
by the following procedure6
:
6Joseph Bernard Kruskal (1928 - 2010) was a famous American mathematician,
statistician, computer scientist, and psychometrician. There are
other famous mathematicians of the same surname - his two brothers and one
nephew. Martin David co-invented solitons and surreal numbers, William
1174
not contain a cycle of length greater than 4 (n/2 in the general
case). This corresponds to the probability of success in the
original game using the pointer strategy. Now, this implies
an important conclusion for game A. Indeed, the prisoners
may apply any strategy from game A to game B as follows:
each prisoner behaves like in game A, but he considers open
chests to be closed, i. e., if he wants to open a chest which
has already been opened, he just “p”asses this move and further
behaves as if he had just discovered the ball number in
the considered chest. Therefore, any strategy that succeeds
for a given placement of the balls in game A must succeed for
the same placement in game B as well. Therefore, if there existed
a better strategy for game A, we could apply it to game
B and obtain a higher chance of winning there. However, this
is impossible since all strategies in game B lead to the same
probability of success. Therefore, the pointer strategy is better
than or equally good as any other strategy. □
13.G.2. In a competition, there are m contestants and n oﬃcials,
where n is an odd integer greater than two. Each oﬃcial
judges each contestant as either good or bad. Suppose that any
pair of oﬃcials agree on at most k contestants. Prove that
k
m
≥
n − 1
2n
.
Let us look at two possible approaches to this problem.
Solution. Let us count the number N of pairs ({oﬃcial,
oﬃcial}, contestant) where the oﬃcials are distinct and agree
on the contestant. Altogether, there are
(n
2
)
pairs of oﬃcials,
and each pair can agree on at most k contestants. Therefore,
N ≤ k
(n
2
)
.
Now, let us ﬁx a contestant X and count the number of
pairs of oﬃcials who agree on X. Say that x oﬃcials said X
was good. Then, there are
(x
2
)
pairs who agree that X is good
and
(n−x
2
)
pairs who agree that X is bad. Altogether, there
are
(
x
2
)
+
(
n − x
2
)
=
x(x − 1)
2
+
(n − x)(n − x − 1)
2
pairs that agree on X. We have:
x(x − 1)
2
+
(n − x)(n − x − 1)
2
=
2x2
− 2nx + n2
− n
2
=
=
(
x −
n
2
)2
+
n2
4
−
n
2
≥
n2
4
−
n
2
=
(n − 1)2
4
−
1
4
.
Since n is odd, the expression (n − 1)2
/4 is an integer. Thus,
the number of pairs that agree on X is at least (n − 1)2
/4.
Hence N ≥ m(n − 1)2
/4. Combining these two inequalities
together, we get
k
m
≥
n − 1
2n
.
An alternative solution - using probabilities. Let choose
a pair of oﬃcials at random. Let X be the random variable
which tells the number of cases when this pair agrees. We are
going to prove the contrapositive implication, i. e., if k
m <
n−1
2n , then X is greater than k with probability greater than
zero, which will be denoted P(X > k) > 0.
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Kruskal’s algorithm
Input: A graph G = (V, E, w) with non-negative weights
over edges.
Output: The minimal spanning trees for all components of
G.
(1) Sort the m edges in E so that w(e1) ≤ w(e2) ≤ · · · ≤
w(em).
(2) For this order of the edges, call the “Spanning forest
algorithm” from the previous subsection.
This is a typical example of the “greedy approach”, when
maximizing proﬁts (or minimizing expenses) is achieved by
choosing always the option which is the most advantageous
at each stage.
In many problems, this approach fails since low expenses
at the beginning may be the cause of much higher ones at
the end. Therefore, greedy algorithms are often a base for
very useful heuristic algorithms but seldom yield optimal solutions.
However, in the case of minimum spanning tree, this
approach works:
Theorem. Kruskal’s algorithm ﬁnds a minimum spanning
tree for every connected graph G with non-negative edge
weights. The algorithm runs in O(m log m) time where m
is the number of edges of G.
Proof. Let T = (V, E(T)) denote the spanning
tree generated by Kruskal’s algorithm and, further,
let ˜T = (V, E( ˜T)) be an arbitrary minimum
spanning tree. The minimality implies that∑
e∈E( ˜T ) w(e) ≤
∑
e∈E(T ) w(e), so the goal is to
show that also
∑
e∈E(T ) w(e) ≤
∑
e∈E( ˜T ) w(e).
If E(T) = E( ˜T), then nothing further is needed. So
assume there exists an edge e ∈ E(T) such that e /∈ E( ˜T).
From all such edges, choose one, call it e with weight w(e)
as small as possible.
The addition of e into ˜T creates a cycle ee1e2 · · · ek in ˜T,
and at least one of its edges ei is not in E(T). The choice
of the edge e implies that if w(ei) < w(e), then the edge ei
would be among the candidate edges in Kruskal’s algorithm
after a certain subtree T′
⊆ T ∩ ˜T had been created, so its
addition to the gradually constructed tree T would not create
a cycle. Therefore, if w(ei) < w(e), the edge ei would be
chosen in the algorithm. It follows that w(ei) ≥ w(e).
However, now the edge ei can be replaced with e in ˜T (by
the choice of ei, this results in a spanning tree again) without
increasing the total weight. So the resulting ˜T is a minimum
spanning tree. It diﬀers from T in fewer edges than before.
Therefore, in a ﬁnite number of steps, ˜T is changed to T without
increasing the total weight. □
13.2.8. Two more algorithms. The second algorithm for
ﬁnding a spanning tree, presented in 13.2.6 also leads to a
minimum spanning tree:
was active in statistics, Clyde was a computer scientist too. The above algorithm
dates from 1956.
1175
Consider the random variables Xi for i = 1, 2, . . . , m
with codomain 0, 1, denoting whether the pair agrees on the
i-th contestant. Let Xi = 1 when they agree, and let Xi = 0
otherwise. Hence we have:
X = X1 + X2 + · · · + Xm
Using the linearity of expectation, we obtain:
E[X] = E[X1] + E[X2] + · · · + E[Xm].
Now, let us calculate E[Xi] =
∑
xi∈{0,1} xi · P(Xi = xi).
Since Xi can be only 0 or 1, we have directly E[Xi] =
P(Xi = 1). Let us examine the probability P(Xi = 1), i.
e., that the oﬃcials agree on the i-th contestant. There are(n
2
)
pairs of oﬃcials. Let ti denote the number of oﬃcials
who say the i-the contestant is good and n−ti be the number
of those who do not. Then, there are
(ti
2
)
pairs who agree that
the i-the contestant is good and
(n−ti
2
)
pairs who agree on the
contrary. Altogether, there are
(ti
2
)
+
(n−ti
2
)
pairs that agree
on the i-th contestant. Therefore,
E[Xi] = P(Xi = 1) =
(ti
2
)
+
(n−ti
2
)
(n
2
) .
Hence,
E[X] =
m∑
i=1
(ti
2
)
+
(n−ti
2
)
(n
2
) .
We are going to show that for odd values of n, we have
(ti
2
)
+
(n−ti
2
)
≥ (n−1)2
4 . Rearranging this leads to
(n − 2ti)2
≥ 1 ⇔ ti ≤
n − 1
2
or ti ≥
n + 1
2
,
which is clearly true since n−1
2 and n+1
2 are adjacent integers.
Using the inequality
(ti
2
)
+
(n−ti
2
)
≥ (n−1)2
4 , we obtain:
E[X] ≥ m
(n−1
2
)2
n(n−1)
2
=
m(n − 1)
2n
.
Thanks to the assumption m(n−1)
2n > k, we have E[X] > k,
and thus P(X > k) > 0, which ﬁnishes the proof. □
Further, we demonstrate application of probabilities to
an interesting problem.
13.G.3. Let S be a ﬁnite set of points in the plane which are
in general position (i. e., no three of them lie on a straight
line). For any convex polygon P all of whose vertices lie in
S, let a(P) denote the number of its vertices and b(P) the
number of points from S which are outside P. Prove that for
any real number x, we have
∑
P
xa(P )
(1 − x)b(P )
= 1
where the sum runs over all convex polygons P with vertices
in S. (A line segment, a singleton, and the empty set are considered
to be a convex 2-gon, 1-gon, and 0-gon, respectively.)
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Jarník-Prim’s algorithm7
Input: A connected graph G = (V, E, w) with n vertices
and m edges, with non-negative weights over the edges.
Output: The minimum spanning tree T of G.
(1) Initialize T0 = ({v}, ∅) with some vertex v ∈ V .
(2) In the i-th step, look for all edges e which are not in
Ti−1, but their tail vertex ve is. Take the one of them
with minimal weight and add it to Ti−1, i.e. add the
head vertex to Vi−1 and e to Ei−1.
(3) The algorithm terminates when the number of added
edges totals at n − 1.
The Borůvka’s algorithm is similar. It constructs as many
as possible connected components simultaneously: It begins
with the singleton components in the graph T0 = (V, ∅). In
each step, it connects every component to another component
with the shortest edge possible. It is easy to prove that (provided
the edge weights are pairwise distinct) this results in a
minimum spanning tree.
Borůvka’s algorithm
Input: A connected graph G = (V, E, w) with non-negative
weights on the edges.
Output: The minimum spanning tree for G.
(1) Initialization. Create the graph S with the same vertex
set as G and no edges;
(2) The main loop. While S contains more than one component,
do:
• for every tree T in S, ﬁnd the shortest edge that
connects T to G \ T, and add this edge into E,
• add all edges of E into the graph S and clear E.
Note that Borůvka’s algorithm can be executed using parallel
computation, which is why it is used in many practical
modiﬁcations.
The proofs that both of these algorithms are correct, are
similar to that of Kruskal’s. The details are omitted.
13.2.9. Traveling salesman problem.
So far, our short excursion through graph
based algorithms could give the feeling that
simple and straightforward algorithms for the
considered problems can always be found. So far however,
only the easy problems have been considered. In all but very
few cases, the contrary is true — mostly there are no algorithms
running in polynomial time, so one needs to use algorithms
which do not always ﬁnd the optimal solution but give
7Robert Clay Prim (born 1921) is an American mathematician and computer
scientist. While he published his work already in the realm of computer
science and hence the most common name of the algorithm is “Prim’s”, earlier
works by Otakar Borůvka (1899 - 1995) and Vojtěch Jarník (1897–1970)
appeared before those by Prim. The Borůvka’s algorithm was designed when
consulting the construction of new electricity network in Moravia, a region
in central Europe, in 1926, and Jarník published the algorithm (recovered
much later by Prim) in 1930, motivated by Brůvka.
1176
Solution. First of all, we prove the wanted equality for x ∈
[0, 1]. Let us color a point from S so that it is white with probability
x and black with probability 1−x (in other words, we
consider a random choice of the size |S| with the binomial
probability distribution Bi(n, x) and let us say that success
corresponds to white and failure corresponds to black). We
can note that for any such coloring, there must exist a polygon
such that all of its vertices are white and all points outside
are black (this polygon is the convex hull of the white
points). The above suggests that the probability that the random
choice realizes a polygon with all vertices white and all
exterior points black is equal to one. However, we can compute
this probability in a diﬀerent way. The event of a polygon
having this property is the union of k disjoint events, where k
is the number of convex polygons, namely that a given polygon
has the desired property (note that the property cannot be
shared by diﬀerent convex polygons). For every given convex
polygon P, the probability that its vertices are white and the
points outside it are black is equal to xa(P )
(1−x)b(P )
, where
a(P) is the number of vertices of P and b(P) is the number
of points from S outside P. Since the probability of a union
of disjoint events is equal to the sum of the particular events’
probabilities, we get
∑
P
xa(P )
(1 − x)b(P )
= 1.
This proves the equality for all numbers in the interval [0, 1].
However, we can also perceive this fact as follows: any
real number from the interval [0, 1] is a root of the poly-
nomial
∑
P xa(P )
(1 − x)b(P )
− 1. As we know, a nonzero
polynomial over the (inﬁnite) ﬁeld of real numbers
can have only ﬁnitely many roots (see 12.3.6). Therefore,∑
P xa(P )
(1−x)b(P )
−1 is the zero polynomial and the equal-
ity
∑
P xa(P )
(1 − x)b(P )
= 1 thus holds for all real numbers
x. □
Remark. This equality holds even if we deﬁne the numbers
a(P) and b(P) in another way: The deﬁnition of a(P) is the
same, but now let b(P) denote the number of points from S
which are not the vertices of P. (Thus, we always have a(P)+
b(P) = |S|). Then, the given equality is a corollary of the
binomial theorem for (x + (1 − x))|S|
.
13.G.4. A competition with n players is called an
(n, k)-tournament iﬀ it has k rounds and satisﬁes the
following:
i) every player competes in every round and any pair of
players competes at most once,
ii) if A plays with B in the i-th round, C plays with D in the
i-th round, and A plays with C in the j-th round, then B
plays with D in the j-the round
Find all pairs (n, k) for which there exists an
(n, k)-tournament.
Solution. There exists an (n, k)-tournament if and only if
2⌈log2(k+1)⌉
divides the integer n. First of all, we are going
to show the if-part. We construct a (2t
, k)-tournament
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
one which is as good as possible. This is called a heuristic
approach.
One of the most important combinatorial problems of
this class is the problem of ﬁnding a minimum Hamiltonian
cycle. This is a Hamiltonian cycle with the minimum sum of
the weights of its edges among all Hamiltonian cycles.
This problem arises in many practical applications. For
instance:
• goods or post delivery (via a given network)
• network maintenance (electricity, water pipelines, IT,
etc.)
• request processing (parallel requests for reading from a
hard disk, for instance),
• measuring several parts of a system (for example, when
studying the structure of a protein crystal using X-rays,
the main expenses are due to the movements and focusing
for particular measurements),
• material division (for instance, when covering a wall with
wallpaper, one tries to keep the pattern continuous while
minimizing the amount of unused material)).
The greedy approach can be applied in case of looking
for a minimum Hamiltonian cycle as well. The algorithm begins
in an arbitrary vertex v1, which is set active, and the other
vertices are labeled as sleeping. For each step, it examines the
sleeping vertices adjacent to the active one and selects the one
which is connected by the shortest edge. The active vertex is
labeled as processed, and the selected vertex becomes active.
The algorithm terminates either with a failure, when there is
no edge going from the active vertex to a sleeping one, or it
successfully ﬁnds a Hamiltonian path. In the latter case, if
there exists an edge from the last vertex vn to v1, a Hamiltonian
cycle is obtained.
This algorithm seldom produces a minimal Hamiltonian
cycle. At least, it always ﬁnds some (and relatively small)
Hamiltonian cycle in a complete graph.
13.2.10. Flow networks. Another group of applications of
the language of graph theory concerns moving
some amount of a measurable material in a ﬁxed
network. The vertices of a directed graph represent
places between which one transports material
up to predetermined limits which are given as assessments
of the edges (called capacities). There are two important
types of vertices: the sources and sinks of the network.
A network is a directed graph with valued edges, where some
of the vertices are labeled as sources or sinks.
Without loss of generality, assume that the graph is directed
and has only one source and one sink: In the general
case, an artiﬁcial source and a sink can always be added, connected
with directed edges to the original sources and sinks.
Then the capacities of the added edges would cover all maximum
capacities of the particular sources and sinks. The situation
is depicted in the diagram. There, the black vertices
on the left correspond to the given sources, while the black
vertices on the right stand for the given sinks. On the left,
1177
where k ≤ 2t
− 1 (then, the general case 2t
| n can be easily
derived from that). There are thus 2t
players in the tournament,
so we assign to each player a (unique) number from
the set {0, 1, . . . , 2t
− 1}. In the i-th round, player α competes
with player α ⊕ i (where ⊕ is the binary XOR operation,
i. e., the j-th bit of a ⊕ b is one if and only if the j-th
bit of a is diﬀerent from the j-th bit of b). This schedule is
correct since every player is engaged in every round and different
players have diﬀerent opponents (for α ̸= β, we have
α ⊕ i ̸= β ⊕ i). Further, the opponent of the opponent of α
is indeed α (since (α ⊕ i) ⊕ i = α). Moreover, the second
tournament rule is also satisﬁed: if α plays with β and γ plays
with δ in the i-th round (i. e., β = α⊕i and δ = γ ⊕i) and if
j is such that α plays with γ in the j-th round, then we have
β⊕j = (α⊕i)⊕j = (α⊕j)⊕i = γ⊕i = δ, so β indeed plays
with δ in the j-th round. Any (2t
· s, k)-tournament where s
is odd can be obtained as s parallelized (2t
, k)-tournaments.
Now, we are going to show that the condition
2⌈log2(k+1)⌉
| n is necessary as well. Consider the
graph Gi whose vertices correspond to the players and edges
are between the pairs who have played in or before the i-th
round. Consider players A and B who play together in
round i + 1. We want to show that we must have |Γ| = |∆|
where Γ is the component of A and ∆ is the component of
B. Actually, we show that any player of Γ competes with
a player of ∆ in round i + 1. Thus, let C ∈ Γ, i. e., in Gi,
there exists a path A = X1, X2, . . . , Xm = C such that Xj
has played with Xj+1, j = 1, . . . , m − 1, in or before the
i-th round. Consider the sequence Y1, Y2, . . . Ym, where Yk
is the opponent of Xk in round i + 1, k = 1, . . . , m (thus
Y1 = B). Then, for any 1 ≤ j ≤ m − 1, we have that Xj
competes with Yj and Xj+1 competes with Yj+1 in round
i + 1 (by the deﬁnition of the sequence Y1, . . . , Yi) and in a
certain r-the round (1 ≤ r ≤ i), Xj played with Xj+1 (by
the deﬁnition of the sequence X1, . . . , Xi). However, by the
second tournament rule, this means that Yj also played with
Yj+1 in the r-the round, so the edge YjYj+1 is contained
in Gi for any 1 ≤ j ≤ m − 1, Thus, Y1, Y2, . . . Ym is a
path in Gi, so B = Y1 and Ym lie in the same component
(∆). It can be deduced analogously that any player from ∆
competes with a player of Γ in round i + 1, and since every
player plays exactly once in a given round, we must have
|Γ| = |∆|. By the deﬁnition of a component, the component
of A in Gi+1 is equal to Γ ∪ ∆. Then, we have either Γ = ∆
(then, the component of A in Gi+1 is Γ), or Γ ∩ ∆ = ∅
(in this case, the component of A in Gi+1 is the disjoint
union Γ ∪ ∆). Altogether, the component of A in Gi+1 is
either the same or twice as great as in Gi. Now, consider
the components Γ1, Γ2, . . . , Γk of A in the respective graphs
G1, G2, . . . Gk. We have |Γ1| = 2 (since A had exactly one
opponent in the ﬁrst round) and for 1 ≤ i ≤ k − 1, we
have either |Γi| = |Γi+1|, or 2|Γi| = |Γi+1|. Therefore, the
number of vertices (players) of every component is a power
of two, i. e., |Γk| = 2l
for some l, and Γk ≥ k + 1 (A had
diﬀerent opponents in the k rounds). Hence, 2l
≥ k + 1, i.
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
there is an artiﬁcial source (a white vertex), and there is an
artiﬁcial sink on the right. The edge values are not shown in
the diagram.
Flow networks
A network is a directed graph G = (V, E) with a distinctive
vertex z, called the source, and another distinctive vertex s,
called the sink, together with a non-negative assessment of
the edges w : E → R, which represents their capacities. A
ﬂow in a network S = (V, E, z, s, w) is an assessment of
the edges f : E → R such that, for each vertex v except for
the source and the sink, the total input is equal to the total
output, i.e.
∑
e∈IN(v)
f(e) =
∑
e∈OUT (v)
f(e).
This rule is often called the Kirchhoﬀ’s law (referring to the
terminology used in physics).
The size of a ﬂow f is given by the total balance of the
source values
|f| =
∑
e∈OUT (z)
f(e) −
∑
e∈IN(z)
f(e).
It follows directly from the deﬁnition that the size of a
ﬂow f can also be computed as
|f| =
∑
e∈IN(s)
f(e) −
∑
e∈OUT (s)
f(e).
The left hand part of the following diagram shows a simple
network with the source in the white circled vertex and the
sink in the black bold vertex. The labels over the edges determine
the maximal capacities. Looking at the sum of the capacities
that enter the sink, the maximum ﬂow in this network
is 5 (the sum of the capacities leaving the source is larger).
4
0011
0011
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
2
3
2
5
3
3
2
2
1
3
2 1
2
0011
2/3
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0011
0011
0
0
1
1
0
0
1
1
0011
2/2
1/3
1/2
0/2
1/2 1/1
0/2
1/3
2/2
2/4
1/3
2/5
1/1
0
0
1
1
13.2.11. Maximum ﬂow problem. The next task is to ﬁnd
the maximum possible ﬂow for a given network on a graph
G. The right hand side of the above diagram shows a ﬂow
of size ﬁve, and the size of any ﬂow cannot exceed this. The
fundamental principle is that the capacities of a set of edges
1178
e., 2l
is at least 2⌈log2(k+1)⌉
, so the number of players in each
component is divisible by 2⌈log2(k+1)⌉
. Thus, so must be the
total number n. □
H. Combinatorial games
13.H.1. Consider the following game for two players: On
the table, there are four piles of 9, 10, 11, and 14 tokens, respectively.
Players alternate moves where the move consists
of selecting one of the piles and removing an arbitrary (positive)
number of tokens from that pile. The player who takes
the last token wins. Is there a winning strategy for one of the
players?
Solution. Note that this game is the sum of four games which
correspond to one-pile games where an arbitrary (positive)
number of tokens can be removed (the sum of combinatorial
games is both commutative and associative, so we can talk
just about the sum of those games without having to specify
the order). A simple induction argument shows that the value
of the Sprague-Grundy function (the SG-value) of such onepile
game is equal to the number of tokens: Suppose that a
natural number n is such that for all k < n, the SG-value
of the game with k tokens is k. According to the rules of
the game, we can remove an arbitrary (positive) number of
tokens, i. e., we can leave there an arbitrary number from 0
to n−1. By the induction hypothesis, this means that for any
number k < n, we can reach a position whose SG-value is k,
and we cannot reach a position whose SG-value would be n.
By the deﬁnition of the SG-function, the value of the game
with n tokens is n. It follows from the theorem of subsection
13.2.16 that the SG-value of the initial position of our game is
equal to the xor of the initial positions of the particular games,
namely
9 ⊕ 10 ⊕ 11 ⊕ 14 = 6.
Since this value is non-zero, there exists a winning strategy
for the ﬁrst player: he always moves to a position whose
SG-value is zero–such a position must exist by the deﬁnition
of the SG-function. For instance, the ﬁrst move would be to
remove 6 tokens from the pile containing 14. (We look at the
highest one in the binary expansion of the SG-value and ﬁnd a
pile where the corresponding bit is also one. Then, we set this
bit to zero–thereby surely decreasing the number of tokens–
and adjust the lower bits so that there would be an even number
of ones in each position, resulting in zero SG-value.)
□
13.H.2. Consider the following game for two players: On
the table, there is a pile of tokens. Players alternate moves
where the move consists of either splitting one pile into two
(non-empty) piles or removing an arbitrary (positive) number
of tokens from a pile. The player who takes the last token
wins. Find the SG-value of the initial position of this game
if the pile contains n tokens.
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
through which each path from z to s must go have to be added
up. In the diagram, there are three such choices providing the
limits 12, 8, 5 (from left to right). At the same time, in such a
simple case the ﬂow that realizes the maximal possible value
is easily found. This idea can be formalized as follows:
Cut in a network
A cut in a network S = (V, E, z, s, w) is a minimal set of
edges C ⊆ E such that when these edges are removed, there
remains no path from the source z to the sink s in the graph
G = (V, E \ C). The number
|C| =
∑
e∈C
w(e)
is called the capacity of the cut C.
Clearly, there is no ﬂow whose size is greater than the
capacity of a cut. We present the Ford-Fulkerson algorithm8
,
which ﬁnds a cut with the minimum possible capacity as well
as a ﬂow which realizes this value. This proves the following
theorem:
Theorem. In any network S = (V, E, z, s, w), the maximum
size of a ﬂow equals the minimum capacity of a cut in S.
The idea of the algorithm is quite simple. It looks for
paths between the vertices of the graph, trying to “saturate”
them with the ﬂow. For this purpose, deﬁne the following
terminology:
An undirected path from the vertex v to the vertex v′
in
a network S = (V, E, z, s, w) is called unsaturated if and
only if all edges e directed along the path from v to v′
satisfy
f(e) < w(e) and the edges e in the other direction satisfy
f(e) > 0 (sometimes, one tries to saturate the ﬂow in
the other direction; yielding a semipath, or the augmenting
semipath). The residual capacity of an edge e is the number
w(e) − f(e) if the edge is directed from v to w, and it is the
number f(e) otherwise. The residual capacity of a path is deﬁned
to be the minimum residual capacity of its edges. For
the sake of simplicity, assume that all the edge capacities are
rational numbers.
8Ford, L. R.; Fulkerson, D. R. (1956). "Maximal ﬂow through a network".
Canadian Journal of Mathematics 8: 399–404.
1179
Solution. We are going to prove by induction that any positive
integer k satisﬁes:
g(4k + 1) = 4k + 1
g(4k + 2) = 4k + 2
g(4k + 3) = 4k + 4
g(4k + 4) = 4k + 3
Clearly, we have g(0) = 0. The following picture shows how
we can deduce the value of the SG-function for one-, two-,
and three-token piles. However, it is apparent that this would
be much harder for a general number of tokens.
Now, assume that the above is satisﬁed for all positive
integers below 4k + 1 and let us prove that we indeed have
g(4k + 1) = 4k + 1. By the deﬁnition, the SG-value is
the least non-negative integer l such that there is no move
to a position with SG-value l. Moreover, this property (including
that the terminal positions have zero value) determine
the Sprague-Grundy function uniquely. Therefore, it suﬃces
to prove that, for each l < 4k + 1, we can move to a position
with SG-value l, and that we cannot get into a position
with SG-value 4k + 1. The former is clear since by
the induction hypothesis, the SG-values of one-pile games of
0, 1, . . . , 4k tokens take all the integers 0, 1, . . . , 4k (although
not in this order), so we can just remove the corresponding
number of tokens from the pile. Now, we are going to show
that we cannot reach a position with SG-value 4k + 1: We
already know that the only moves that could possibly lead to
this SG-value are to split the pile into two. If we examine
the resulting amounts modulo 4, there are two possibilities:
either the number of tokens in one of the resulting piles is divisible
by 4 (4a) and the other one leaves remainder 1 (4b+1),
or the numbers leave remainders 2 and 3, respectively. As for
the former case, the SG-values of the resulting piles are, by
the induction hypothesis, 4a − 1 and 4b + 1 (the numbers of
tokens in the particular piles are non-zero and less than 4k+1,
so we may use the induction hypothesis. In the latter case, i.
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Ford-Fulkerson algorithm
Input: A network S = (V, E, z, s, w).
Output: A maximal possible ﬂow f : E → R and a minimal
cut C, which is given by those edges which lead from the set
U ⊆ V of those vertices to which there exists an unsaturated
path, to V \ U.
(1) Initialization: Set f(e) = 0 for each edge e ∈ E, and
using depth-ﬁrst search from z, ﬁnd the set U ⊆ V of
those vertices to which there exists an unsaturated path.
(2) The main loop: While s ∈ U, do
• select an unsaturated path P from the source z to
the sink s; then increase the ﬂow f along all edges
of the path P by the value of the residual capacity
of P.
• update U.
Proof. As seen, the size of any ﬂow cannot exceed the
capacity of any cut. Therefore, it suﬃces to show that when
the algorithm terminates, the capacity of the generated cut
equals the size of the constructed ﬂow.
The algorithm terminates in the ﬁrst moment when there
is no unsaturated path from the source z to the sink
s. This means that U does not contain s and for all
edges e from U and ending outside of U, f(e) = w(e)
(otherwise, the other endpoint of e would be added to
U when updating U).
For the same reason, all edges e leading from V \ U to
U must have f(e) = 0.
Clearly, at each beginning of the main loop the total size
of the ﬂow satisﬁes
|f| =
∑
edges from U to V \ U
f(e) −
∑
edges from V \ U to U
f(e) .
However, when the algorithm terminates, this expression
equals to the capacity of the ﬁnal cut C which is the desired
result.
It remains to show that the algorithm always terminates.
Since the edges are assumed assessed with rational numbers,
it can be assumed by rescaling that the capacities are integers.
Then every ﬂow constructed during the run of the algorithm
has integer size. In addition, every iteration of the main loop
increases the size of the ﬂow. However, since any cut bounds
the maximum size of any ﬂow from above, the algorithm must
terminate after a ﬁnite number of steps. □
0/3
0011
0
0
1
1
0011 0011
0011
0
0
1
1
00110011
0
0
1
1
0/2
0/2
0/2 1/1
0/2
0/3
0/2
2/4
0/3
1/5
0/1
2/3
2/2
0
0
1
1
0/3
0011
0
0
1
1
0011 0011
0011
0
0
1
1
0011
0
0
1
1
0011
0/2
0/2
2/2 1/1
0/2
2/3
2/2
2/4
0/3
3/5
0/1
2/3
2/2
0
0
1
1
1180
e., if we split the pile into 4a+2 and 4b+3 tokens, we get that
their SG-values are 4a + 2 and 4b + 4. Furthermore, a twopile
game is the sum of the two corresponding one pile-games,
so the SG-value of the two-pile game is the xor (nim-sum) of
the amounts. In both cases, the SG-value leaves remainder 2
upon division by 4 (consider the last two bits). In particular,
it is surely not equal to 4k +1. This proves the induction step
for positive integers of the form 4k + 1.
The proof for integers of the form 4k + 2 is analogous.
The situation is more amazing in the 4k + 3 case: Similarly
as above, it follows from the induction hypothesis that the
SG-values of the one-pile positions we can move to exhaust
all the non-negative integers up to 4k +2. However, note that
if we split the pile into two containing 1 and 4k + 2 tokens,
respectively, then their SG-values are also 1 and 4k+2 by the
induction hypothesis, and the xor of these integers is 4k + 3.
It remains to prove that there is no move into a position with
SG-value 4k + 4: Again, the only remaining possibility is to
split the existing pile. Then, the resulting remainders modulo
4 are either 0 and 3, or 1 and 2. By the induction hypothesis
the remainders of the corresponding SG-values are respectively
3 and 0 in the former case, and 1 and 2 in the latter. In
either case, the xor of these integers (and thus the SG-value
of the resulting position) leaves remainder 3, so it is not equal
to 4k + 4. This proves the induction step for positive integers
of the form 4k + 3. The proof for integers of the form 4k + 4
is analogous. □
I. Generating functions
13.I.1. In how many ways can we buy 12 packs of coﬀee if
we can choose from 5 kinds?
Further, solve this problem with the following modiﬁca-
tions:
i) we want to buy at least 2 packs of each kind;
ii) we want to buy an even number of packs of each kind;
iii) there are only 3 packs of one of the kinds.
Solution. The basic problem is a classical example of a combinatorial
problem on the number of 5-combinations with
repetition–the answer is
(12+5−1
5−1
)
=
(16
4
)
. The modiﬁcations
can also be solved by combinatorial reasonings with a bit of
invention. However, we want to demonstrate how these problems
can be solved (almost without no need to think) using
generating functions.
The wanted number corresponds to the coeﬃcient at x12
in the expansion of the function
(1 + x + x2
+ . . . )5
=
= (1 + x + . . . )(1 + x + . . . ) · · · (1 + x + . . . )
into a power series. The number of packs of the ﬁrst kind
determines which term is selected from the ﬁrst parenthesis,
and similarly for the other kinds. (Note that we need not pay
special attention to that fact that there cannot be more than
12 packs of a given kind – it turns out that inﬁnite series are
usually simpler to work with than ﬁnite polynomials.)
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
The run of the algorithm is illustrated in two diagrams.
On the left, there are two shortest unsaturated paths from the
source to the sink in gray (the upper one has two edges, while
the lower one has three). On the right, another path is saturated
(taking the ﬁrst turn in the upper path), also drawn in
gray. Now, it is apparent that there can be no other unsaturated
path from the source to the sink. Therefore, the algorithm terminates
at this moment.
13.2.12. Further remarks. The algorithm allows for further
conditions incorporated in the problem. For instance,
capacity limits can be set for the vertices
of the network as well. There are not only upper
limits for the ﬂows along particular edges or
through vertices, but also lower ones.
It is easy to add vertex capacities – just double every vertex
(one for the incoming edges, the other for the outgoing
edges), connecting each pair with an edge of the corresponding
capacity.
The lower limits for the ﬂow can be included in the initialization
part of our algorithm. However, one needs to check
whether such a ﬂow exists at all. Many other variations can
be found in literature.
On the other hand, the algorithm does not necessarily terminate
if the edge capacities are irrational. Moreover, the
ﬂows that are constructed during the run may not even converge
to the optimal solution in such a case. However, it still
Put some nice example
in the other column,
e.g. the
https://www.cs.prince-
ton.edu/cours-
es/archive/spring13/cos423/le
tures/07DemoFordFulk-
ersonPathological.pdf
one
holds that if the algorithm terminates, then a maximum ﬂow
is found.
If the capacities are integers (equivalently rational numbers),
the running time of the algorithm can be bounded by
O(f|E|), where f is the size of a maximum ﬂow in the network
and |E| is the number of edges. The worst case occurs
if every iteration increases the size of the ﬂow by one.
In the proof of correctness, no explicit way of searching
the graph when looking for an unsaturated path is used.
Another variation of the Ford–Fulkerson algorithm is to
use breadth-ﬁrst search. The resulting algorithm is called
Edmonds–Karp, and its running time is O(|V ||E|2
).9
We
mention Dinic’s algorithm, which simpliﬁes the search for
an unsaturated path by constructing the level graph, where
augmenting edges are considered only if they lead between
vertices whose distances from the source diﬀer. The time
complexity of this algorithm is O(|V |2
|E|), which is much
better for dense graphs than the Edmonds-Karp algorithm.
13.2.13. Problems related to ﬂow networks. A good application
of ﬂow networks is the problem of bipartite matching.
The task is to ﬁnd a maximum matching in a bipartite graph,
i.e. a set of as many edges as possible so that each vertex of
the graph is the endpoint of at most one of the selected edges.
9Edmonds, Jack; Karp, Richard M. (1972). "Theoretical improvements
in algorithmic eﬃciency for network ﬂow problems". Journal
of the ACM (Association for Computing Machinery) 19 (2): 248–264.
doi:10.1145/321694.321699.
1181
Since
1
1 − x
= 1 + x + x2
+ . . .
(see 13.4.3), the function we are considering is (1 − x)−5
.
Our task is thus to expand (1 − x)−5
into a power series.
By the generalized binomial theorem, from 13.4.3, the coeﬃcient
at xk
is the number
(k+5−1
5−1
)
, which is
(16
4
)
in our case.
Note that using generating functions, we have answered the
question not only for 12, but rather for an arbitrary number of
packs of coﬀee.
The modiﬁcations can be solved analogously:
i) The generating function is
(x2
+ x3
+ . . . )5
=
(
x2
1 − x
)5
=
x10
(1 − x)5
,
hence the coeﬃcient at x12
is equal to
(2+5−1
5−1
)
.
ii) An even number of each kind corresponds to the generating
function
(1 + x2
+ x4
+ . . . )5
=
1
(1 − x2)5
.
The coeﬃcient at x12
can be found by many means; the
easiest one seems to be the substitution y = x2
and looking
for the coeﬃcient at y6
(which can be perceived as
joining the packs into pairs in the shop). This leads to
the answer
(6+5−1
5−1
)
.
iii) In this case, the generating function equals
(1 + x + x2
+ x3
)(1 + x + x2
+ . . . )4
,
and the wanted result is thus(12+4−1
4−1
)
+
(11+4−1
4−1
)
+
(10+4−1
4−1
)
+
(9+4−1
4−1
)
.
□
13.I.2. In how many ways can we use the coins of values 1,
2, 5, 10, 20, and 50 crowns to pay exactly 100 crowns?
Solution. We are looking for non-negative integers
a1, a2, a5, a10, a20, and a50 such that ai is a multiple of
i for all i ∈ {1, 2, 5, 10, 20, 50} and, at the same time,
a1 + a2 + a5 + a10 + a20 + a50 = 100. We can see that the
wanted number of ways can be obtained as the coeﬃcient
at x100
in the product
(1 + x + x2
+ . . . )(1 + x2
+ x4
+ . . . )(1 + x5
+ x10
+ . . . )·
· (1 + x10
+ x20
+ . . . )(1 + x20
+ x40
+ . . . )·
· (1 + x50
+ x100
+ . . . ) =
=
1
1 − x
·
1
1 − x2
·
1
1 − x5
·
1
1 − x10
·
1
1 − x20
·
1
1 − x50
.
The result can be obtained using the software of SAGE,
for instance (the names of the used commands are pretty selfdescriptive,
aren’t they?):
sage : f=1/(1 - x)*1/(1 - x^2) *1/(1 - x^5) \
*1/(1 - x ^10) *1/(1 - x ^20) *1/(1 - x ^50)
sage : r= taylor (f,x ,0 ,100)
sage : r. coeff (x ,100)
4562
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
This is an abstract variation of a quite common problem.
For instance, it may be needed to match boys and girls in dancing
lessons, provided information about which pairs would be
willing to dance together is given.
This problem is easily reduced to the problem of maximum
ﬂow. Add an artiﬁcial source to the graph and connect
it with edges to all vertices of one part of the bipartite graph,
while the vertices of the other part are connected to an artiﬁcial
sink. The capacity of each edge is set to one, and the
resulting graph is searched for the maximum ﬂow. Then, the
edges that are used in the ﬂow correspond to the selected pairs.
Of course, information on which pairs to put together by leaving
some of them out, may be included.
Another important application of ﬂow networks is the
proof of Menger’s theorem (mentioned as a theorem
in 13.1.10). It can be understood as follows: Given a
directed graph, set the capacity of each edge as well
as edge vertex to one. Further, select an arbitrary pair
of vertices v and w, which are considered to be the source
and the sink, respectively. Then, the size of a maximum ﬂow
in this graph equals the maximum number of disjoint paths
from v to w (the paths may share only the source and the sink).
Every cut divides v and w into diﬀerent connected components
of the remaining graph (since they are chosen to be the
source and sink). The desired statements then follow from
the fact that the size of a maximum ﬂow equals the capacity
of a minimum cut.
13.2.14. Game trees. We turn our attention to a very
broadly used application of tree structures
when analyzing possible strategies
or procedures. They can be encountered
in the theory of artiﬁcial intelligence as
well as in the game theory. They play an important role in
economics and many other social ﬁelds.
This is about games. In the mathematical sense, game
theory examines models in which one or more players take
turns in playing moves according to predetermined and generally
known rules. Usually, the moves are assessed with profits
or losses for the given player. Then, the task is to ﬁnd a
strategy for each player, i.e. an algorithmic procedure which
maximizes the proﬁts or minimizes the losses.
We use an extensive description of the games. This
means that a complete and ﬁnite analysis of all possible states
of the game is given, and the resulting analysis gives an exact
account about the proﬁts and losses. This is supposing that
the other players also play the best moves for them.
A game tree is a rooted tree whose vertices are the possible
states of the game and they are labeled according to whose
turn it is. The outgoing edges of a vertex correspond to the
possible moves of the player from that state. This complete
description of a game using the game tree may be used for
common games like chess, naughts and crosses (known also
as tic-tac-toe), etc.
1182
□
13.I.3. Expand the following functions into power series:
i) x
x+2 ,
ii) x2
+x+1
2x3+3x2+1 .
Solution.
i)
x
x + 2
=
x
2 − (−x)
=
x/2
1 − (−x/2)
=
=
x
2
−
x2
4
+
x3
8
− · · · + =
∞∑
n=1
(−1)n−1 xn
2n
.
ii) We perform partial fraction decomposition:
x2
+ x + 1
2x3 + 3x2 + 1
=
x2
+ x + 1
(x − 1)2(2x + 1)
=
=
A
2x + 1
+
B
x − 1
+
C
(x − 1)2
,
ﬁnding out that A = B = 1
3 and C = 1; hence
x2
+ x + 1
2x3 + 3x2 + 1
=
1/3
1 + 2x
−
1/3
1 − x
+
1
(1 − x)2
=
=
∑∞
n=0
[
1
3
((−2)n
− 1)) + (n + 1)
]
xn
.
□
13.I.4. Find the generating functions of the following se-
quences:
i) (1, 2, 3, 4, 5, . . . ),
ii) (1, 4, 9, 16, . . . ),
iii) (1, 1, 2, 2, 4, 4, 8, 8, . . . ),
iv) (9, 0, 0, 2 · 16, 0, 0, 4 · 25, 0, 0, 8 · 36, . . .),
v) (9, 1, −9, 32, 1, −32, 100, 1, −100, . . . , ).
⃝
13.I.5. In how many ways can we buy n pieces of the following
ﬁve kinds of fruit if we do not distinguish between
particular pieces of a given kind, we need not buy all kinds,
and:
• there is no restriction on the number of apples we buy,
• we want to buy an even number of bananas,
• the number of pears we buy must be a multiple of 4,
• we can buy at most 3 oranges, and
• we can buy at most 1 pomelo.
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
As a simple example, consider a simple variation of the
game known as Nim.10
There are k tokens on the table (the tokens may be sticks
or matches), where k > 1 is an integer, and players take turns
at removing one or two tokens. The player who manages to
take the last token(s) wins. There is a variation of the game in
which the player who is forced to take the last token loses. The
tree of this game, including all necessary information about
the game, can be constructed as follows:
• The state with ℓ tokens on the table and the ﬁrst player to
move corresponds to the subtree rooted at Fℓ. The state
with the same number of tokens but the second player to
move is represented by the subtree rooted at Sℓ.
• The vertex Fℓ has Sℓ−1 as its left-hand son and Sℓ−2 as
its right-hand son. Similarly, the sons of the vertex Sℓ
are Fℓ−1 and Fℓ−2.
• The leaves are always F0 or S0. (In the variation when
the player to take the last token loses, these would be the
states F1 and S1.)
Every run of the game starting at root Fk corresponds to exactly
one leaf of the resulting tree. Therefore, the total number
p(k) of possible runs for Fk is equal to
p(k) = p(k − 1) + p(k − 2)
for k ≥ 3, and clearly p(1) = 1 and p(2) = 2. This difference
equation is already considered. It is satisﬁed by the
Fibonacci numbers, which can be computed by an explicit formula
(see the subsection on generating functions in the end of
this chapter, or the corresponding part about diﬀerence equations
in chapter three, cf. 3.B.1). A formula is known for the
number of possible runs of the game. The number of possible
states equals the number of all vertices of the tree. The
game always ends in a win of one of the players. We can also
consider games where a tie is possible.
13.2.15. Game analysis. The tree structure allows an analysis
of the game so that an algorithmic strategy for
each player can be built. This is done with a simple
recursive procedure for assessing the root of
a subtree. Each vertex is given a label: W for vertices
where the ﬁrst player can force a win, L for those where
the ﬁrst player loses if the other one plays optimally, and, optionally,
T for vertices where optimal play of both players
results in a tie. The procedure is as follows:
(1) The leaves are labeled directly according to the rules of
the game (in the case of our Nim, the leaves S0 are labeled
by W, and the leaves F0 by L).
(2) Considering the vertex Fℓ. Label it W if it has a son who
is labeled by W. If there is no such son, but there is a son
labeled by T, then Fℓ is given the label T. Otherwise, i.e.
if all sons are labeled by L, then Fℓ also gets L.
10The game was given this name by Charles Bouton in his analysis
of this type of games from 1901. It refers to the German word “Nimm!”,
meaning “Take!”.
1183
Solution. The generating function for the sequence (an)
where an is the wanted number of ways to buy n pieces of
fruit is
(1 + x + x2
+ · · · )(1 + x2
+ x4
· · · )(1 + x4
+ x8
+ · · · )·
· (1 + x + x2
+ x3
)(1 + x) =
=
1
1 − x
·
1
1 − x2
·
1
1 − x4
·
1 − x4
1 − x
· (1 + x) =
=
1
(1 − x)3
.
By the generalized binomial theorem, we have (1 − x)−3
=∑∞
n=0
(n+2
2
)
xn
. Therefore, the wanted number of ways satisﬁes
an =
(n+2
2
)
. □
13.I.6. Using the generalized binomial theorem, prove
again the following combinatorial identities:
•
∑n
k=0
(n
k
)
= 2n
,
•
∑n
k=0(−1)k
(n
k
)
= 0,
•
∑n
k=0 k
(n
k
)
= n2n−1
.
Solution. Substituting into the binomial theorem
(1 + x)n
=
(
n
0
)
+
(
n
1
)
x +
(
n
2
)
x2
+ · · · +
(
n
n
)
xn
the numbers x = 1 and x = −1, we obtain the ﬁrst and
second identities, respectively. Then, the third one can be obtained
by viewing both sides of the binomial theorem “continuously”
and using the properties of derivatives. □
13.I.7. In a box, there are 30 red, 40 blue, and 50 white balls.
Balls of one color are indistinguishable. In how many ways
can we select 70 balls?
Solution. The wanted number is equal to the coeﬃcient
at x70
in the product
(1 + x + · · · + x30
)(1 + x + · · · + x40
)(1 + x + · · · + x50
).
This product can be rearranged to
(1 − x)−3
(1 − x31
)(1 − x41
)(1 − x51
),
whence, using the generalized binomial theorem, we obtain
((
2
2
)
+
(
3
2
)
x+
(
4
2
)
x2
+· · ·
)
(1−x31
−x41
−x51
+x72
+. . .).
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
(3) Similarly, a vertex Sℓ is labeled L if it has a son labeled
by L. Otherwise if it has a son labeled by T, it receives
T. Otherwise (i.e. if it has only W-sons), it is labeled by
W.
Calling this procedure on the root of the tree gives the
labeling of each vertex as well as an optimal strategy for each
player:
• The ﬁrst player tries to move to a vertex labeled by W; if
this cannot be achieved, he moves to a T-vertex at least.
• Similarly, the second player tries to move to a vertex labeled
by L; if this cannot be achieved, he moves to a
T-vertex at least.
The depth of the recursion is given by the depth of the tree.
For instance, having a Nim game with k tokens, the depth is
k.
This analysis is not very useful yet. In order to use it in
the mentioned form, the entire game tree is needed for disposal.
This can be a great amount of data (for instance, in the
case of naughts and crosses on 3 × 3 playground, the corresponding
tree has tens of thousands of vertices). Usually, the
analysis with game tree is used when only a small part of the
whole tree is examined, applying appropriate heuristic methods,
and the corresponding part is being created dynamically
during the game. This is a fascinating ﬁeld of the modern
theory of artiﬁcial intelligence. The details are omitted.
There is a more compact representation of the tree structure
for our purposes of complete formal analysis. If
the game tree for Nim is drawn, then one state of the
game is represented by many vertices which correspond
to diﬀerent histories of the game. However, the
strategies depend only on the actual state (i.e. the number of
tokens and the player to move) rather than on the history of
the game. Therefore, the same game can be described by a
graph where for each number of tokens, there is only one vertex,
and the whole strategy is determined by identifying who
is winning (whether this is the player on move or the other
one) Directed edges are used for the description of possible
moves and there is then always an acyclic graph.
5N
F0
L L
F3
L
S2 S1
F1 F0
W L L
S0
W
0P
1N
3P
7N
6P
4N
2N
The example of the game Nim is displayed on the diagram.
On the left, there is a complete game tree corresponding
to three tokens. The directed graph on the right represents
the game with seven tokens. A complete tree for this
game would already have 21 leaves, and the number of leaves
grows exponentially with the number of tokens.
1184
Hence, the coeﬃcient at x70
is clearly
(70+2
2
)
−
(70+2−31
2
)
−
(70+2−41
2
)
− −
(70+2−51
2
)
= 1061. □
13.I.8. Prove that
n∑
k=1
Hk = (n + 1)(Hn+1 − 1).
Solution. The necessary convolution can be obtained as the
product of the series 1
1−x and 1
1−x ln 1
1−x . Hence:
[xn
] 1
(1−x)2 ln 1
1−x =
n∑
k=1
1
k (n + 1 − k),
whence the wanted identity follows easily. □
13.I.9. Solve the recurrence
a0 = a1 = 1,
an = an−1 + 2an−2 + (−1)n
.
Solution. As always, it may be a good idea to write out a few
terms of the sequence (however, this will not help us much in
this case; still, it can serve as veriﬁcation of the result).2
Step 1: an = an−1 + 2an−2 + (−1)n
[n ≥ 0] + [n = 1].
Step 2: A(x) = xA(x) + 2x2
A(x) + 1
1+x + x.
Step 3:
A(x) =
1 + x + x2
(1 − 2x)(1 + x)2
.
Step 4: an = 7
9 2n
+
(1
3 n + 2
9
)
(−1)n
. □
13.I.10. Quicksort – analysis of the average case. Our task
is to determine the expected number of comparisons made by
Quicksort, a well-known algorithm for sorting a (ﬁnite)
sequence of elements.
An example of a simple divide-and-conquer implemen-
tation:
if L == []: return []
return qsort ([ x for x in L [1:] if x<L
[0]])
+ L [0:1]
+ qsort ([ x for x in L [1:] if x >= L
[0]])
It is not too diﬃcult to construct a formula for the number
of comparisons (we assume that the particular orders of the sequence
to be sorted are distributed uniformly). The following
parameters are important for the analysis of the algorithm:
i) The number of comparisons in the divide phase: n − 1.
ii) The uniformness assumption: the probability that L[0] is
the k-th greatest element of the sequence is 1
n .
iii) The sizes of the sorted subsequences in the conquer
phase: k − 1 and n − k.
2Despite the statement in Concrete mathematics, this sequence can already
be found in The On-Line Encyclopedia of Integer Sequences.
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
The individual vertices in the directed acyclic graph on
the right-hand side of the diagram indicate the number of tokens
left and the information whether the game at that state is
won by the player who is to move (letter N as “next”) or the
other one (letter P as “previous”). Altogether, considering a
game with k tokens, this graph always has only k+1 vertices.
At the same time, there is the complete strategy encoded in
the graph: The players always try to move from the current
state into a vertex labeled by P if such one exists.
In fact, every directed acyclic graph can be seen as a description
of a game. The initial situations are represented by
those vertices which have no incoming edges (there can be
one or more of them), and the game ends in leaves, i.e. vertices
with no outgoing edges (again, there can be one or more
of them).
The strategy for each player can be obtained by a simple
recursive procedure as above (for the sake of simplicity, it is
assumed that there is no tie):
• The leaves are labeled by P ( the player who is to move
from a leaf loses).
• A non-leaf vertex of the graph is labeled by N if there
is an edge leading to a P-vertex. Otherwise, it is labeled
P.
In the case of our variation of Nim, the situation is very
simple. It follows from the strategy described that the player
who is to move loses if and only if the number of tokens is
divisible by three.
The games that can be represented by a directed acyclic
graph are called impartial. These are exactly those games
which satisfy:
• in every state, both players choose from the same set of
moves;
• the number of possible states is ﬁnite;
• the game has “zero sum”, i.e. the better the outcome for
one of the players, the worse for the other one.
An example of an impartial game is tic-tac-toe. Although the
players use diﬀerent symbols in this game, they can place
them in any of the unoccupied squares. On the other hand,
chess is not an impartial game in this sense, since the set of
possible moves in every situation depends on the number of
pieces the players have at their disposal.
13.2.16. Sum of combinatorial games. The rules of the real
classical game Nim are somewhat more complicated:
There are three piles of tokens. The
move consists of selecting one of the piles and
removing an arbitrary (positive) number of tokens
from that pile. The player who manages to take the last
token wins. There is a variation of the game in which the
player who is forced to take the last token loses. If this game
is considered with one pile, the situation is easy: The ﬁrst
player takes all the tokens and wins immediately. However, it
is not that easy with three piles. Whether the analysis of the
one-pile game is of any use for this more complicated game
is a good question.
1185
We thus get the following recurrent formula for the expected
number of comparisons:
Cn = n − 1 +
n∑
k=1
1
n
(Ck−1 + Cn−k) .
It is possible to solve this recurrence (using certain tricks
which can be learned to some extent) even without using generating
functions.
Cn = n − 1 +
2
n
n∑
k=1
Ck−1 symmetry of both sums
nCn = n(n − 1) + 2
n∑
k=1
Ck−1 multiply by n
(n − 1)Cn−1 = (n − 1)(n − 2) + 2
n−1∑
k=1
Ck−1 the same expression for Cn−1
nCn = (n + 1)Cn−1 + 2(n − 1) subtracted and rearranged
We have thus obtained a much simpler recurrence:
nCn = (n + 1)Cn−1 + 2(n − 1).
On the other hand, this equation contains non-constant coefﬁcients
as well.
We can also note that the recurrence has been simpliﬁed
to the extent that the values Cn can be computed iteratively.
Nevertheless, it is advantageous to express these values explicitly
as a function of n (or at least to approximate them).
First, we use a slight trick: dividing both sides by n(n +
1) :
Cn
n + 1
=
Cn−1
n
+
2(n − 1)
n(n + 1)
Now, we “expand” this expression (telescope, we can also use
the substitution Bn = Cn/n + 1):
Cn
n + 1
=
2(n − 1)
n(n + 1)
+
2(n − 2)
(n − 1)n
+ · · · +
2 · 1
2 · 3
+
C1
2
.
Hence
Cn
n + 1
= 2
n−1∑
k=1
k
(k + 1)(k + 2)
.
This can be summed up using partial fraction decomposition,
for instance: k
(k+1)(k+2) = 2
k+2 − 1
k+1 . This leads to
Cn
n + 1
= 2
(
Hn+1 − 2 +
1
n + 1
)
,
whence
Cn = 2(n + 1)Hn+1 − 4(n + 1) + 2
(Hn =
∑n
k=1
1
k is the sum of the ﬁrst n terms of the harmonic
progression). At the same time, we can give the bound Hn ∼∫ n
1
dx
x + γ, whence
Cn ∼ 2(n + 1)(ln(n + 1) + γ − 2) + 2.
13.I.11. Using the generating function F(x) = x/(1−x−x2
)
for the Fibonacci sequence, ﬁnd the generating function for
the “semi-Fibonacci” sequence (F0, F2, F4, . . .). ⃝
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
For this purpose, introduce a new concept, the sum of
impartial games: A situation in the game composed of two
simpler games is a pair of possible situations in the particular
games. Then, a move consists of selecting one of the two
games and performing a move in that game (the other game
is left unchanged). Therefore, the sum of impartial games is
an operation which assigns to a pair of directed acyclic graphs
a new one.
Considering graphs G1 = (V1, E1) and G2 = (V2, E2),
its sum G1+G2 is the graph G = (V, E), where V = V1×V2
and
E = {(v1v2, w1v2); (v1, w1) ∈ E1}
∪ {(v1v2, v1w2); (v2, w2) ∈ E2}.
In the case of one game, the vertices can be labeled stepby-step
by the letters N and P in an upwards
manner, according to whether one can get to a
P-vertex along some of the edges. However, in
the sum of games, movement along the edges
is needed in a much more complicated way. Therefore, ﬁner
tools are needed for expressing the reachability of vertices labeled
by P from other vertices.
This needs some preparation which might seem like a
strange magic (but the proof of the theorem below shows that
all this is quite natural). Deﬁne the Sprague–Grundy function
recursively. g : V → N on a directed acyclic graph G =
(V, E) as follows:11
(1) for a leaf v, set g(v) = 0;
(2) for a non-leaf vertex v ∈ V , deﬁne
g(v) = mex{g(w); (v, w) is an edge},
where the minimum excluded value function mex is deﬁned
on subsets S of the natural numbers N = {0, 1, . . . } by
mex S = min N \ S.
The function g(v) is just the mex S operation for the set
S of the values g(w) over those vertices w where one can get
along an edge from v.
Note that this deﬁnition is correct since, clearly, the formula
uniquely deﬁnes a function that assigns a natural number
to any vertex in the acyclic graph in question.
Yet another operation on the natural numbers is needed.
It is the binary XOR operation
(a, b) → a ⊕ b,
performing the exclusive-or operation bit-wise on the binary
expansions of a and b. This operation can be considered from
the following point of view: Consider the binary expansions
of a and b to be vectors in the vector space (Z2)k
over Z2
(for a suﬃciently large k), and add them there. The resulting
vector is the binary expansion of a ⊕ b.
Now the main result can be formulated:
11We are presenting the theory which was developed in combinatorial
game theory independently by R. P. Sprague in 1935 and P. M. Grundy in
1939.
1186
13.I.12. The fan of order n is a graph on n + 1 vertices,
which are labeled 0, 1, . . . , n, with the following edges: vertex
0 is connected to all other vertices, and for each k satisfying
1 ≤ k < n, vertex k is connected to vertex k + 1. How
many spanning trees does this graph have?
Solution. Denoting by vn the wanted number of spanning
trees, we clearly have v1 = 1, and since the fan of order 2 is
the triangle graph K3, we have v2 = 3. Further, we are going
to show that for n > 1, the following recurrence3
holds:
vn = vn−1 +
n−1∑
k=0
vk + 1, v0 = 0.
For a ﬁxed spanning tree of the fan of order n, let k
be the greatest integer of the set {1, . . . , n − 1} such
that the spanning tree contains all edges of the path
(0, 1, 2, 3, . . . , k). This spanning tree cannot contain the
edges {0, 2}, . . . , {0, k}, {k, k + 1}; therefore, there are
the same number of spanning trees for a ﬁxed k as in the
fan of order n − k with vertices 0, k + 1, k + 2, . . . , n,
i. e. vn−k. Further, we must count one spanning tree
for k = n and those spanning trees which do not contain
the edge {0, 1} (thus they must contain the edge {1, 2})
– they are obtained from fans of order n − 1 on vertices
0, 2, . . . , n. We have thus obtained the wanted recurrence
vn = vn−1 + vn−1 + vn−2 + · · · + v0 + 1.
Now, we have the general formula
vn = vn−1 +
n−1∑
k=0
vk + 1 − [n = 0],
whence the usual procedure for ﬁnding the generating function
V (x) of this sequence yields
V (x) = x · V (x) +
∞∑
n=0
∑
k<n
vkxn
+
1
1 − x
− 1 =
= x · V (x) +
∞∑
k=0
∑
n>k
vkxn
+
x
1 − x
=
=
( ∞∑
k=0
vkxk
)
·
∑
n>k
xn−k
+
x
1 − x
=
=
( ∞∑
k=0
vkxk
)
·
x
1 − x
+
x
1 − x
= V (x) ·
x
1 − x
+
x
1 − x
.
The solution of the equation V (x) = xV (x) + x
1−x V (x) +
x
1−x is
V (x) =
x
1 − 3x + x2
,
whence using the standard method (partial fraction decomposition)
or the previous problem leads to the result vn = F2n.
□
3 Using this recurrent formula to calculate more values vn, we ﬁnd out
that v3 = 8, v4 = 21, which suggests a hypothesis about connection with
the Fibonacci sequence in the form vn = F2n. This can be proved easily by
induction.
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Sprague–Grundy theorem
13.2.17. Theorem. Consider a directed acyclic graph G =
(V, E). Its vertices v are labeled by P if and only if g(v) = 0,
where g is the Sprague–Grundy function.
For any two directed acyclic graphs G1 = (V1, E1),
G2 = (V2, E2) and their Sprague–Grundy functions g1, g2,
the Sprague–Grundy function g of their sum is given by
g(v1v2) = g1(v1) ⊕ g2(v2).
Proof. The ﬁrst proposition of the theorem follows directly
by induction from the deﬁnition of the
Sprague–Grundy function g.
The proof of the other part is more complicated.
Let (v1v2) be a position of the game
G1 + G2 = (V, E), and consider any a ∈ N0 such that
a < g1(v1) ⊕ g2(v2). There exists a state x1x2 of the game
G1 +G2 such that g(x1)⊕g(x2) = a and (v1v2, x1x2) ∈ E,
and, at the same time, there is no edge (v1v2, y1y2) ∈ E such
that
g1(y1) ⊕ g2(y2) = g1(v1) ⊕ g2(v2).
This justiﬁes the recursive deﬁnition of the Sprague–Grundy
function and proves the rest of the theorem.
To show why, ﬁnd a vertex x1x2 with a given value a <
g1(v1) ⊕ g2(v2) of the Sprague–Grundy function.
Consider the integer b := a ⊕ g1(v1) ⊕ g2(v2). Refer
to the bit of value 2i
as the i-th bit of an integer. Clearly,
b ̸= 0. Let k be the position of the highest one in the binary
expansion of b, i.e. 2k
≤ b < 2k+1
. This means that the k-th
bit of exactly one of the integers a, g1(v1) ⊕ g2(v2) is one
and that these integers do not diﬀer in higher bits. It follows
from the assumption a < g1(v1)⊕g2(v2) that it is the integer
g1(v1) ⊕ g2(v2) whose k-th bit is one.
Therefore, the k-th bit of exactly one of the integers
g1(v1), g2(v2) is one. Assume without loss of generality
that it is the integer g1(v1). Further, consider the integer
c := g1(v1) ⊕ b. Recall that the highest one of b is at position
k, so the integers c, g1(v1) do not diﬀer in higher bits
and the k-th bit of c is zero. Therefore, c < g1(v1). Then,
by the deﬁnition of the function value g1(v1), there must exist
a state w1 of the game G1 such that (v1, w1) ∈ E1 and
g1(w1) = c. Now, (v1v2, w1v2) ∈ E and
g1(w1) ⊕ g2(v2) = c ⊕ g2(v2) = g1(v1) ⊕ b ⊕ g2(v2)
= g1(v1) ⊕ a ⊕ g1(v1) ⊕ g2(v2) ⊕ g2(v2) = a.
This fulﬁlls the ﬁrst part of our plan.
Further, consider any edge (v1v2, y1y2) ∈ E in G, where
(v1, y1) ∈ E1, and hence v2 = y2. Suppose that g1(y1) ⊕
g2(y2) = g1(v1)⊕g2(v2). Then, g1(y1)⊕g2(v2) = g1(v1)⊕
g2(v2). Clearly, the terms g2(v2) can be canceled (it is an
operation in a vector space), leading to g1(y1) = g1(v1). This
contradicts the properties of the Sprague–Grundy function g1
of the game G1. This proves the second part of the theorem.
□
1187
Recursively connected sequences. Sometimes, we are able
to express the wanted number of ways or events only in terms
of more mutually connected sequences.
13.I.13. In how many ways can we cover a 3 × n rectangle
with 1 × 2 domino pieces? Evaluate this value for n = 20.
Solution. We can easily ﬁnd out that c1 = 0, c2 = 3, c3 = 0,
and it is reasonable to set c0 = 1 (this is nor merely convention;
there is indeed a unique empty covering).
We are looking for a recursive formula–discussing the
behavior “on the edge”, we ﬁnd out that cn = 2rn−1 + cn−2,
rn = cn−1 + rn−2, r0 = 0, r1 = 1, where rn is the number
of coverings of the rectangle 3 × n without one of the corner
tiles.
The values of cn and rn for the ﬁrst few non-negative
integers n are:
n 0 1 2 3 4 5 6 7
cn 1 0 3 0 11 0 41 0
rn 0 1 0 4 0 15 0 56
• Step 1: cn = 2rn−1 + cn−2 + [n = 0], rn = cn−1 +
rn−2.
• Step 2:
C(x) = 2xR(x)+x2
C(x)+1, R(x) = xC(x)+x2
R(x).
• Step 3:
C(x) =
1 − x2
1 − 4x2 + x4
, R(x) =
x
1 − 4x2 + x4
.
• Step 4: We can see that both are functions of x2
. We can
thus save much work if we consider the function D(z) =
1/(1−4z +z2
). Then, we have C(x) = (1−x2
)D(x2
),
i. e., [x2n
]C(x) = [x2n
](1 − − x2
)D(x2
) = [xn
](1 −
x)D(x), so c2n = dn − dn−1.
The roots of 1−4x+x2
are 2+
√
3 and 2−
√
3, whence
the standard procedure yields
c2n =
(2 +
√
3)n
3 −
√
3
+
(2 −
√
3)n
3 +
√
3
.
Just like with the Fibonacci sequence, the second term is
negligible for large values of n and is always between 0 and
1. Therefore,
c2n =
⌈
(2 +
√
3)n
3 −
√
3
⌉
.
For instance, c20 = 413403. □
13.I.14. Using generating functions, ﬁnd the number of
ones in a random bit string.
Solution. Let B the set of bit strings, and for b ∈ B, let |b|
denote the length of b and j(b) the number of ones in it. The
generating function is of the form
B(x) =
∑
b∈B
x|b|
=
∑
n≥0
2n
xn
=
1
1 − 2x
.
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
The following useful result is a direct corollary of this
theorem:
Corollary. A vertex v1v2 in the sum of games is labeled by
P if and only if g1(v1) = g2(v2).
For example, if three piles of tokens are combined in the
simpliﬁed Nim game (it is always allowed to take only one
or two tokens), the ﬁrst player always wins, if all three piles
have the same number of tokens, not divisible by three. The
individual functions gi(k) for the individual piles equal the
remainder after dividing k by 3. It follows that, when summing
the ﬁrst two Nim pile games, the value g(v) = 0 is
obtained for the initial position. Summing again with another
pile game gives g(v) ̸= 0.
In the original game, the individual piles are described
by g(k) = k (any number of tokens can be chosen, hence the
function g grows in this way). The losing positions are those,
where the binary sum of the numbers of tokens is zero. For
example, if the two of the initial piles are of equal size, then a
simple winning strategy is to remove the third one completely
and always make the remaining two equal after the opponent’s
move.
Remark. Further details are omitted in this text. It can be
proved that every ﬁnite directed acyclic graph is isomorphic
to a ﬁnite sum of suitably generalized games of Nim.
In particular, the analysis of the simple game and construction
of the function g basically (at least implicitly) gives
a complete analysis of all impartial games.
3. Remarks on Computational Geometry
A large amount of practical problems consist in constructing
or analyzing some ﬁnite geometrical
objects in Euclidean spaces, mainly in 2D or 3D.
This is a very busy area of both applications and
research. At the same time, most of the algorithms
and their complexity analysis are based on graph theoretical
and further combinatorial concepts. We provide several
glimpses into this beautiful topic.12
We discuss convex
hulls, triangulations, and Voronoi diagrams and focus on a
few basic approaches only.
13.3.1. Convex hulls. We start with a simple and practical
problem. In the plane R2
, suppose n points X =
{v1, . . . , vn} are given and the task is to ﬁnd their convex
hull CH(X). As learned in the Chapter 4, CH(X) is given
by a convex polygon and it is desired to ﬁnd it eﬀectively.
First we have to decide how CH(X) should be encoded
as a data structure. Choose the connected list of edges. There
is the cyclic list of the vertices vi in the polygon, sorted in
the counter clock-wise order, together with pointers towards
the oriented segments between the consecutive vertices (the
12The beautiful book Computational Geometry, Algorithms and
Applications by de Berg, M., Cheong, O., van Kreveld, M., Overmars,
M., published by Springer (1997) can be warmly recommended,
http://www.springer.com/us/book/9783540779735
1188
The generating function for the number of ones is
C(x) =
∑
b∈B
j(b)x|b|
.
A string b can be obtained from the one bit shorter string b′
by adding either a zero or a one, i. e., j(b) is the sum of j(b′
)
ones and j(b′
) + 1 ones. Therefore,
C(x) =
∑
b′∈B
(1 + 2j(b′
))x|b′
|+1
=
∑
b′∈B
x|b′
|+1
+ 2
∑
b′∈B
j(b′
)x|b′
|+1
=
= xB(x) + 2xC(x).
Hence
C(x) =
x
(1 − 2x)2
= x(1 − 2x)−2
and the n-th coeﬃcient is cn = 2n−1
( −2
n−1
)
= n2n−1
. This
number gives the number of ones in strings of length n, and
there are bn = 2n
such strings. Therefore, the expected number
of ones in such string is cn
bn
= n
2 , which is, of course, what
we have anticipated. □
13.I.15. Find the generating function and an explicit formula
for the n-th term of the sequence {an} deﬁned by the
recurrent formula
a0 = 1, a1 = 2
an = 4an−1 − 3an−2 + 1 for n ≥ 2.
Solution. The universal formula which holds for all n ∈ Z is
an = 4an−1 − 3an−2 + 1 − 3[n = 1].
Multiplying by xn
and summing over all n, we get the following
equation for the generating function: A(x) =
∞∑
n=0
anxn
.
Hence, we can express
A(x) =
3x2
− 3x + 1
(1 − x)2(1 − 3x)
=
3
4
·
1
1 − x
−
1
2
·
1
(1 − x)2
+
3
4
·
1
1 − 3x
.
Therefore, the coeﬃcient at xn
is
an =
3
4
(−1)k
(
−1
n
)
−
1
2
(−1)n
(
−2
n
)
+
3
4
(−3)n
(
−1
n
)
=
=
3
4
−
1
2
(n + 1) +
3
4
3n
= d1−2n+3n+1
4 . □
13.I.16. Solve the following recurrence using generating
functions:
a0 = 1, a1 = 2
an = 5an−1 − 4an−2 n ≥ 2
Solution. The universal formula is of the form
an = 5an−1 − 4an−2 − 3[n = 1] + [n = 0]
Multiplying both sides by xn
and summing over all n, we
obtain
A(x) = 5xA(x) − 4x2
A(x) − 3x + 1.
Hence
A(x) =
1 − 3x
(1 − 4x)(1 − x)
=
2
3
·
1
1 − x
+
1
3
·
1
1 − 4x
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
edges). Moreover, there is the list of edges pointing to their
tail and head vertices.
There is a simple way, to get CH(X). Namely, create
the oriented edges eij for all pairs (vi, vj) of the points in X,
and decide whether eij belongs to CH(X) by testing whether
all the other points of X are on the left of eij (in the obvious
sense). It is known already from chapter one, that this is tested
in constant time by means of the determinant. Clearly eij
belongs to CH(X) if and only if all the latter tests are positive.
In the end, the order in which to sort the edges and vertices
in the output is found.
This does not look like a good algorithm, since O(n2
)
edges need to be tested against O(n) points. Hence, cubic
time complexity is expected. But there is a strong simple improvement
available. Consider the lexicographic order on the
points vi with respect to their coordinates. Then build the
convex hull consecutively and run through the tests only for
the edges having the last added vertex as their tail.
Gift Wrapping Convex Hull Algorithm
Input: A set of points X = {v1, . . . , vn} in the plane.
Output: The requested edge list for CH(X).
(1) Initialization. Take the smallest vertex v0 in the lexicographic
order with respect to the coordinates, and set
vactive = v0.
(2) Main cycle.
• Test edges with tail vactive, until e belonging to
CH(X) is found.
• add e to CH(X) and set its head to be the vactive
• if vactive ̸= v0, then repeat the cycle.
Obviously, the most left and lowest vertex v0 in X is in
CH(X). Since CH(X) is a cycle (as a directed graph), the
algorithm works correctly. It is necessary to be careful about
possible collinear edges in CH(X) and the lack of robustness
of the test for those nearly collinear.
This simple improvement reduces the worst running time
of the algorithm to O(n2
). The worst case can be obtained if
all the points vi appear on one circle, and unluckily the right
next point is always found as the very last one in partial tests.
But the actual running time is much better, at most O(ns),
where s is the size of the CH(X).
For example, in situations where the distribution of the
points in the plane is random with normal distribution (see
1189
and
an =
2
3
(−1
n
)
+
2
3
(−1
n
)
(−4)n
=
4n
+ 2
3
. □
13.I.17. A cash dispenser can provide us with banknotes of
values 200, 500, and 1,000 crowns. In how many ways can
we pick 7,000 crowns? Use generating functions to ﬁnd the
solution.
Solution. The problem can be reformulated as looking for
the number of integer solutions of the equation
2a + 5b + 10c = 70; a, b, c ≥ 0.
This number is equal to the coeﬃcient at x70
in the function
G(x) = (1+x2
+x4
+· · · )(1+x5
+x10
+· · · )(1+x10
+x20
+· · · ).
This function is equal to
G(x) =
1
1 − x2
1
1 − x5
1
1 − x10
and since
1 − x10
1 − x5
= 1 + x5
and
1 − x10
1 − x2
= 1 + x2
+ x4
+ x6
+ x8
,
we can transform it into the form
G(x) =
(1 + x2
+ x4
+ x6
+ x8
)(1 + x5
)
(1 − x10)3
.
By the binomial theorem, we have
(1 − x10
)3
=
∞∑
k=0
(−1)k
(
−3
k
)
x10k
.
Therefore, G(x) equals
(1+x2
+x4
+x5
+x6
+x7
+x8
+x9
+x11
+x13
)
∞∑
k=0
(−1)k
(
−3
k
)
x10k
The term x70
can be obtained only as 7 · 10 + 0, i. e., the
coeﬃcient at x70
is equal to
[x70
]G(x) = −
(−3
7
)
=
(3+7−1
7
)
=
9 · 8
2
= 36. □
13.I.18. Find the probability of getting exactly k heads after
n times tossing the coin.
Solution. Represent H outcome as x and T as y. We see
that all possible outcomes of n tosse sare represented by expansion
of f(x) = (x + y)n
. The coeﬃcient at xk
yn−k
is
the number of outcomes with exactly k heads. The required
probability is, therefore,
((
n
)
,k)
2n . □
13.I.19. Find the probability of getting exactly k heads after
n times tossing the coin if the probabilities of a head and a
tail equal p and q, respectively, P + q = 1.
Solution. Represent H outcome as px and T as q. We see
that all possible outcomes of n tosses are represented by expansion
of f(x) = (px + q)n
. The coeﬃcient at xk
is the
number of outcomes with exactly k heads. The required probability
is, therefore,
pk
qn−k
((
n
)
,k)
2n . □
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
the chapter 10 for what this means), then it is known that the
expected size would be logarithmic.
At the same time, ﬁnding CH(X) for X distributed on
a circle is equivalent to sorting the points along the circle. So
the worst time run cannot be better than O(n log n) for all
algorithms (cf. 13.1.17).
13.3.2. The sweep line paradigm. We illustrate several
main approaches to computational geometry algorithms for
the same convex hull problem.
The latter algorithm is close to the idea of having a special
object L running through all the objects in
the input X consecutively, taking care of constructing
the relevant partial output of the algorithm
on the run. This is the event structure describing
all the events needing consideration, and the sweep
line structure carrying all the information to deal with the individual
events. The procedure is similar to the search over
a graph discussed earlier. For a shortcut, this may reduce the
dimension of the problem (e.g. from 2D to 1D) on the cost of
implementing dynamical structures.
To start, initialize the queue of the events and begin dealing
with them. At each step, there are the still sleeping events
(those further in the queue not yet treated), the current active
events (those under consideration) and the already processed
ones.
With the CH(X), this idea can be implemented as follows.
Initialize the lexicographically ordered points vi in X.
Notice that the ﬁrst and last ones necessarily appear in the
CH(X). This way, CH(X) splits into the two disjoint chains
of edges between them. We call them the upper and the lower
convex hulls. Hence the entire construction can be split into
the upper convex hull and lower convex hull.
As in the diagram, moving through the events one by one,
it is only needed to check, whether the edge joining the last
but one active vertex with the current one makes a left or right
turn (as usual, right means the clockwise orientation). If right,
then add the edge to list, if left, omit the recent edges one by
one, until the right turn is obtained.
1190
13.I.20. There are coins C1, . . . , Cn For each k coin Ck is
biased so that when tossed the probability that it falls heads
is 1
2k+1 . If n coins are tossed, what is the probability of an
odd number of heads?
Solution. Let pk = 1
2k+1 and qk = 1 − pk. The generating
function can be written as
f(x) =
n∏
i=1
(qi + pix) =
n∑
i=1
amxm
,
with am the probability of getting exactly m heads. Observe
that
f(1) =
∑
am =
∑
a2k +
∑
a2k+1.
and
f(−1) =
∑
am(−1)m
=
∑
a2k −
∑
a2k+1
. Hence,
∑
a2k+1 =
1
2
(f(1) − f(−1)) =
1
2
(1 −
1
2n + 1
) =
n
n + 1
,
as f(1) = 1 and f(−1) = 1
3
3
5
5
7 . . . 2n−1
2n+1 = 1
2n+1 . □
13.I.21. Let n be a positive integer. Find the number of polynomials
P(x) with coeﬃcients from the set {0, 1, 2, 3} such
that P(2) = n.
Solution. Let P(x) = a0 + a1x + a2x2
+ . . . akxk
+ . . .
is such polynomial. Then its coeﬃcients should satisfy the
equation
a0 + 2a1 + 4a2 + . . . 2k
ak + . . . = n
with integer coeﬃcients 0 ≤ ak ≤ 3. The number of such
solutions is given by the coeﬃcient at xn
in
f(x) = (1+x+x2
+x3
)(1+x2
+x4
+x6
)(1+x4
+x8
+x12
) . . .
=
∞∏
k=0
(1 + x2k
+ x2(2k
)
+ x3(2k
)
)
=
∞∏
k=0
1 − x2k+2
1 − x2k =
1
(1 − x)(1 − x2)
=
1
4(1 − x)
+
1
2(1 − x)2
1
4(1 + x)
=
∑
n = 0∞
(
1
4
+ (−1)n 1
4
+
1
2
(n + 1)
)
xn
.
Hence, the required coeﬃcient is
⌊n
2
⌋
+ 1. □
13.I.22. Find the number of subsets of {1, . . . , 2017}, such
that the sum of its elements is divisible by 5.
Solution. For subset S = {s1, s2, . . . , sk} deﬁne σ(S) =
xs1
xs2
. . . xsk
. Then
∑
S⊂{1,...,2017}
σ(S) = (1 + x)(1 + x2
) . . . (1 + x2017
).
The coeﬃcient at xn
in f(x) is the number of subsets in
{1, . . . , 2017} with the sum of its elements being exactly n.
Let f(x) =
∑
anxn
. We are looking for the sum A5 =
a0 + a5 + a10 + . . .. Let ωj = exp(2jπ
5 ), j = 0, . . . , 4 be the
5-th roots of unity with ω0 = 1. Then
A5 =
1
5
(f(ω0) + f(ω1) + f(ω2) + f(ω3) + f(ω4)).
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Sweep Line Upper Convex Hull
Input: A set of points X = {v1, . . . , vn} in the plane.
Output: The directed path UCH(X).
(1) Initialization.
• Set the event structure to be the lexicographically
ordered list of points vﬁrst, . . . , vlast. There is no
special sweep line structure but the indicator distinguishing
the stage of the event.
• Set the active event to be vactive = vﬁrst and initiate
the UCH(X) as the trivial path with one vertex
vﬁrst (this is the current last vertex of the path in
construction).
(2) Main cycle.
• Set the active event to the next point v in the queue,
and consider the potential edge e having vactive as
the tail and the last vertex in UCH(X) as head.
• Check whether the UCH(X) is to the left of e
(check it only against the last edge in the current
UCH(X)). If so, add e and the vactive to the
UCH(X). If not, remove edges in UCH(X) one
by one, until the test turns positive.
• Repeat the cycle until the next event is the vlast.
It is easy to check that the algorithm is correct. Exactly n
events need to be considered, and at each of it up to O(n) vertices
can be removed in the current UCH(X). This occurs
in O(n2
) time, but in fact none of the vertices is added again
to the UCH(X) after removal. It follows that the asymptotic
estimate for the main cycle run is O(n) in total and it is the
ordering in the initialization dominating with its O(n log n)
time. Clearly the linear O(n) memory is suﬃcient and so the
optimal solution is achieved again for the convex hull prob-
lem.
13.3.3. The divide and conquer paradigm. Another very
standard idea is to divide the entire problem into pieces, apply
recursively the same procedure on them, and merge the partial
results together. These are the two phases of the divide and
conquer approach. This paradigm is common in many areas,
cf. 13.I.10.
With convex hulls, adopt the gift wrapping approach for
the conquer phase. The idea is to split recursively the task
producing disjoint “left CH(X)” and “right CH(X)” and to
1191
Obviuosly f(ω0) = f(1) = 22017
. For 1 ≤ j ≤ 4
f(ωj) = ((1+ω0
j )(1+ωj)(1+ω2
j )(1+ω3
j )(1+ω4
j ))403
(1+ωj)(1+ω2
j ) = 2403
(1+ωj+ω2
j +ω3
j ).
Therefore,
4∑
j=1
F(ωj) = 2403
(4 − 1 − 1 − 1) = 2403
.
Therefore, the answer is 1
5 (22017
+ 2403
), which is an integer
as 2 is the last digit in 22017
and 8 is the last digit in 2403
. □
13.I.23. Using generating functions, solve recursive equa-
tion
ak+1 = 2ak + 4k
, a1 = 3.
Solution. From a1 = 2a0 + 40
follows a0 = 1. Multiplyuing
both sides of the recursion by xk
by summation on k we
obtain
∞∑
k=0
ak+1xk
= 2
∞∑
k=0
akxk
+
∞∑
k=0
4k
xk
.
Therefore, for generating function f(x) =
∞∑
k=0
akxk
we have
1
x
∞∑
k=0
ak+1xk+1
= 2f(x) +
∞∑
k=0
(4x)k
,
that is
1
x
(f(x) − 1) = 2f(x) +
1
1 − 4x
.
Therefore,
f(x) =
1
1 − 2x
+
x
(1 − 2x)(1 − 4x)
=
1
2
1 − 2x
+
1
2
1 − 4x
.
Expanding f(x), we obtain:
f(x) =
1
2
∞∑
k=0
(2x)k
+
1
2
∞∑
k=0
(4x)k
,
and so, ak = 1
2 2k
+ 1
2 4k
. □
13.I.24. In how many ways can n balls be distributed in 4
boxes if the ﬁrst box has at least two balls.
Solution. The generating function for the ﬁrst box is b1(x) =
x2
+ x3
+ x4
+ . . . = x2
1−x . The generating function for each
other box is b(x) = 1 + x + x2
+ x3
+ . . . = 1
1−x . For
all 4 boxes it has to be f(x) = b1(x)(b(x))3
= x2
(1−x)4 . The
coeﬃcient at xn
in f(x) that indicates the number of options
to distribute n balls, equals
(
(
n
)
− 2 + 4 − 1, 4 − 1) =
(
(
n
)
+ 1, 3).
□
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
merge them by ﬁnding the upper and lower “tangent edge” of
those two parts.
Divide and Conquer Convex Hull
Input: Set of points X = {v1, . . . , vn} in the plane, ordered
lexicographically.
Output: The directed path CH(X).
(1) Divide. If n ≤ 3, return the CH(X). Otherwise, split
X = X1 ∪X2 into two subsets of (roughly) same sizes,
respecting the order (i.e. all vertices in X1 smaller then
those in X2).
(2) Merge.
• Start with the edge joining the largest point in
CH(X1) and the smallest in CH(X2), and iteratively
balance it to the lower tangent segment el to
CH(X1) and CH(X2).
• Proceed similarly to get the upper tangent eu.
• Merge the relevant parts of the CH(X1) and
CH(X2) with the help of el and eu.
Perhaps the merge step requires some more explanation.
The situation is illustrated in the diagram. For the upper tangent,
ﬁrst ﬁx the right-hand vertex of the initial edge joining
the two convex polygons. Then ﬁnd the tangent to the left
polygon from this vertex. Then ﬁx the head of the moving
edge and ﬁnd the right hand touch point of the potential tangent.
After a ﬁnite number of exchanges like this, the edge
stabilizes. This is the upper tangent edge eu. Observe that
during the balancing, we move only clockwise on the righthand
polygon and counter-clockwise on the other one. Notice
also that it is the smart divide strategy which prevents any of
the points of the input X1 to appear inside of the CH(X2)
and vice versa.
Again the analysis of the algorithm is easy. The typical
merge time is asymptotically linear and then the recursive call
yields O(log n) runs of the procedures. The total estimated
time is again O(n log n). Notice, there is no initialization in
the procedure itself. Just assume that the points in X are already
ordered. Hence another O(n log n) time must be added
to prepare for the very ﬁrst call. The memory necessary for
running the algorithm is estimated by O(n) if the recursions
are implemented properly.
13.3.4. The incremental paradigm. This approach consists
in taking the input objects one by one and consecutively
build the required resulting structure. This is
particularly useful if the application does not allow to
have all the points available in the beginning. Imagine
incrementally building the convex hull of shots into the
target, as they occur.
Another good use is in the randomized algorithms, where
all the data is known in the beginning, but treated in a random
order. Typically the expected time for running is then very
good, while there might be much less eﬀective, but improbable,
worst time runs.
1192
13.I.25. How many sequences of {0, 1, 2, 3} of length n
have at least 2 zeroes?
Solution. Exponential generating function for 0- entry is
b0(x) =
x2
2!
+
x3
3!
+ . . . = ex
−1 − x
. Exponential generating function for other entries is ex
as
there are no restrictions. Hence, the exponential generating
function in question is f(x) = e4x
− e3x
−x e3x
. The answer
to the question is n! times the coeﬃcient at xn
in f(x), which
is
n!(
4n
n!
−
3n
n!
− n
3n−1
n!
= 4n
− 3n
− n3n−1
.
□
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
The former case is easy to illustrate on the convex hull
problem. In each step employ the merge step of the very degenerate
version of the divide and conquer algorithm, merging
CH(X1) of the previously known points X1, while X2 =
CH(X2) = {vk} is just the new point. But an extra step is
needed to check whether vk is inside of CH(X1) or not. If
it is, then skip the new point and wait for the next one. If not,
then merge. The worst time of this algorithm is O(n2
), but
as with the gift wrapping method, it depends on the actual
size of the output as well as on the quality of the algorithm
checking whether is vk inside of the CH(X1).
We illustrate the second case with a more elaborate convex
hull algorithm. The main idea is to keep track of the position
of all points in X with respect to the convex hull CH(Xk)
of the ﬁrst k of them (in the ﬁxed order chosen at the begin-
ning).
With this goal in mind, keep the dynamical structure of
a bipartite graph G whose ﬁrst group of vertices consists of
those points which have not been processed yet, while the
other group contains all the faces of the current convex polygon
S = CH(Xk) (call them faces, not to be confused with
the edges in the graph in G). Remember the faces in S are
oriented. Such a face e is in conﬂict with the point v if the
face is “visible” from v, i.e. v is in the right-hand halfplane
determined by e. Keep all points joined to each of their faces
in conﬂict in the bipartite graph. Call G the graph of conﬂicts.
The algorith can now be formulated:
Randomized Incremental Convex Hull Algorithm
Input: A set X = {v1, . . . , vn}, n > 3, of at least four
points in the plane.
Output: The edge list R of the convex hull CH(X).
(1) Initialization. Fix a random order on X. Choose the
ﬁrst three points as X0, create the list of conﬂicts for
the edge list R = CH(X0) (i.e. state which of the three
faces are seen from which points) and remove the three
points from X.
(2) Main cycle. Repeat until the list X is empty:
• choose the ﬁrst point v ∈ X;
• if there are some conﬂicts of v in G, then
– remove all the faces in conﬂict with v from
both R and G,
– ﬁnd the two new faces (the uper and lower
tangents from the new point v to the existing
CH(X) – they are easily found by taking care
of the “vertices without missing edges”),
– add the two new faces to both R and G and
ﬁnd all their conﬂicts;
• remove the point v from the list X, and from the
graph G.
The complete analysis of this algorithm is omitted. Notice
that ﬁnding the newly emerging conﬂicts is easy since
it is only necessary to check the potential conﬂicts of points
1193
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
which are in conﬂict with the faces incident to those two vertices
in G before the update, to which the two new faces are
attached.
It can be proved that the expected time for this algorithm
is O(n log n), while the worst time is O(n2
). The complete
framework for the analysis of randomized algorithms is nicely
explained in the book mentioned in the very beginning of this
part, see page 1188.
13.3.5. Convex hull in 3D. Many applications need convex
hulls of ﬁnite sets of points in higher dimensions,
in particular in 3D. There are several
ways of adjusting 2D algorithms to their 3D ver-
sions.
First, it needs to be stated what the right structure is for
the CH(X). As seen in 13.1.21, the convex polyhedra in
R3
can be perfectly described by planar graphs. In order to
modify the algorithms into 3D, some good encoding for them
is needed. We want to ﬁnd all vertices with edges or faces
which are incident or neighbouring in time proportional to
the output.
This is nicely achieved by the double connected edge
lists.
Double connected edge list – DCEL
Let G = (V, E, S) be a planar graph. The double connected
edge list is the list E such that each edge is represented by
two oriented twin-edges e, ˜e, equipped with the pointers
• Vt, Vh to tail and head of e
• F to the incident face (the left one with respect to the
directed edge e) and the pointer to the twin ˜e
• P to the edge following along the face F (in the counter
clockwise direction).
At the same time, keep the list of vertices and the list of faces,
always with just one pointer towards some of the incident
edges.
Clearly, we can remove or add faces, edges, vertices, or
ﬁnd all kind of neighbors in time proportional to the output
(notice the use of the twins and think about the details!).
Next, look at the 2D incremental algorithm above and try
to imagine what needs changing there to get it to work in 3D.
First, we have to deal with the 2D faces S of the DCEL of
the convex hull and instead of their boundary vertices, deal
with boundary edges. Again, all the faces with conﬂicts with
the just processed point have to be removed (see the picture).
This leads to a directed cycle of edges the pointers F missing
(call them “unsaturated edges”). Finally, instead of adding
1194
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
two new faces in 2D, add the tangent cone of faces joining
the point v with the unsaturated edges. Of course the graph
of conﬂicts must also be updated.
Randomized Incremental 3D Convex Hull Algorithm
Input: A set X = {v1, . . . , vn}, n ≥ 4, of at least ﬁve points
in the space R3
.
Output: The DCEL R for the convex hull CH(X).
(1) Initialization. Fix a random order on X. Choose the
ﬁrst four points as X0, create the list of conﬂicts G and
the DCEL R for CH(X0) (i.e. tell which of the four
faces are seen from which points) and remove the four
points from X.
(2) Main cycle. Until the list X is empty, repeat:
• take the ﬁrst point v ∈ X;
• if there are some conﬂicts of v in G, then
– remove all the faces of R in conﬂict with v
(from both R and G), and take care of the (oriented)
edges e in R left without their incident
faces,
– build the “tangent cone” from the new point v
to the current R by connecting v to the latter
“unsaturated” edges,
– add the new faces to both R and G and ﬁnd
all their conﬂicts (again, note that the check
for new conﬂicts can be restricted to the points
which were in conﬂict with the faces incident
to those edges where the cone has been attached
to the previous R).
• remove the point v from the list X and from the
graph G.
A detailed analysis is omitted. As with the 2D case, the
expected running time for this algorithm is O(n log n). By
the very well adapted DCEL data structure for the convex hull,
it is a very good algorithm.
The divide and conquer algorithm from the 2D case can
be easily adapted, too. Skipping details, the initial ordering
of the input points lexicographically allows to recursively call
the same procedure producing two DCELs of disjunct convex
polytopes. This allows us to apply a more sophisticated
“gift wrapping” approach when merging the results. A sort of
“tubular collar” wrapping of the two polytopes to create their
1195
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
convex hull is desired. Imagine rotating the hyperplanes similarly
as with the lines in 2D in order to get the ﬁrst edge in the
tubular set of faces to be added. Then the ﬁrst plane containing
one of the missing new faces is obtained. Continue breaking
the plane along the new edges, until both directed cycles
along of which the collar is attached to the previous two polytopes
are closed. All that is done by bending the planes by the
smallest possible angle in each step and checking to arrive at
the right position. Of course, the DCEL structure is essential
to update all the data properly in time proportional to the size
of the changes.
With reasonably smart implementation, this algorithm
achieves the optimal O(n log n) running time.
Both of the latter algorithms can be generalized to all
higher dimensions, too.
13.3.6. Voronoi diagrams. The next stop is at one of
the most popular and useful planar divisions
(and searching in them). For a ﬁnite set of
points X = {v1, . . . , vn} in the plane R2
, write
V R(vi) for the set of pointin R2
sharing vi ∈ X
as their uniquely given closest point in X. Deﬁne:
Voronoi diagram
For a given set of points X = {v1, . . . , vn} (not all colinear),
the Voronoi regions are
V R(vi) = {x ∈ R2
; ∥x − vi∥ < ∥x − vk∥,
for all vk ∈ X, k ̸= i}.
This is an intersection of n−1 open half-planes bounded by
lines, so it is an open convex set. Its boundary is a convex
polygon.
The Voronoi diagram V D(X) is the planar graph
whose faces are the open regions V R(vi), while the
boundaries of V R(vi) yield the edges and vertices.
4
Care is needed about collinearity since if all the points vi
are on same line in R2
, then their Voronoi regions are strips in
the plane bounded by parallel lines. Under all other circumstances,
the planar graph V D(X) from the latter deﬁnition is
well deﬁned and connected.
By deﬁnition, the vertices p of V D(X) are the points in
the plane, such that at least three points v, w, u ∈ X are the
same distance from p and no more points of X are inside the
circle through v, w, u. If there are no more points of X on
the latter circle, then the degree of this vertex p is 3.
The most degenerate situation occurs if all the points of
X are on one circle. Then, obviously, the Voronoi regions are
delimited by two half lines, all emanating from the center of
the circle and cutting the angles properly. The construction
of the V D(X) is then equivalent to ordering of the points by
angles. At least O(n log n) time is needed for the worst case
estimate in any algorithm building the Voronoi diagrams.
Some of the Voronoi regions are unbounded, others
bounded. If just two points v and w are considered, then
the axis of the segment vw is the boundary for the regions
1196
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
V R(v) and V R(w). In particular, the region V R(v) must be
bounded for each v in the interior of the convex hull of X. On
the contrary, consider an edge in the CH(X) with incident
vertices v and w, and the “outer” half-axis of the segment
vw. If one considers any interior point u in the CH(X), then
it is in the other halfplane with respect to the segment vw.
Sooner or later the points in the latter half-axis are closer to v
and w than to u. It follows that both V R(v) and V R(w) are
unbounded. Summarizing:
Lemma. Each Voronoi region V R(v) of the Voronoi diagram
V D(X) is an open convex polygon region. It is unbounded
if and only if v belongs to the convex hull of X.
13.3.7. An incremental algorithm. Each Voronoi diagram
represents a planar division and adding a new
point p as a vertex, it is quite obvious how to update
the diagram. First assume, we know which
V R(v) the point p hits. Then choose the center v,
split the region V R(v) by the relevant part of the axis of the
segment pv. Add this new edge e into the updated V D(X),
simultaneously creating the two new faces and removing the
V R(v) one. The new edge e hits the boundary of the current
V R(v) in either two points or one point (if the new edge is
unbounded). These hits show what is the next region of the
updated diagram to be split. “Walk” further with the new hit
at the boundary revealing the next center of region playing
the role of v above. Ultimately this walk consecutively splits
the visited old regions and creates the new directed cycle of
edges bounding the new region, or it has an unbounded path
of boundary edges, if the new region is unbounded. See the
diagram for an illustration.
If the new point p is on the boundary, i.e. hitting one
of the edges or vertices in V D(X), then the same algorithm
works. Just start with one of the incident regions.
So far this looks easy, but how does one ﬁnd the relevant
region hit by the new point? An eﬃcient structure to search
for it on the run of the algorithm is desired. Build an acyclic
directed graph G for that purpose. The vertices in G are all
the temporary Voronoi regions as they were created in the
individual incremental steps. Whenever a region is split by
the above procedure, new leafs in G are created. Draw edges
towards these leafs from all the earlier regions which have
some nontrivial overlap. Of course, care must be taken how
1197
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
the old regions are overlap with the new ones, but this is not
diﬃcult. We illustrate the procedure on the diagram, updating
from one point to four points.
Incremental Voronoi Diagram
Input: The set of points X = {v1, . . . , vn} in the plane, not
all collinear.
Output: The DCEL of V D(X) and the search graph G.
(1) Initialization. Consider the ﬁrst two points X0 =
{v1, v2} and create the DCEL for V D(X0) with two
regions. Create the acyclic directed graph G (just root
and two leaves).
(2) Main cycle. Repeat until there are no new points z ∈ X:
• localize the V R(v) hit by z (by the search in G)
• perform the path walk ﬁnding the boundary of the
new region V R(z) in V D(X)
• update the DCEL for V D(X) and the acyclic directed
search graph G.
This algorithm is easy to implement. It produces directly
a search structure for ﬁnding the proper Voronoi regions of
given points. Unfortunately, it is very far from optimal in
both aspects - the worst running time is O(n2
), and the worst
depth of the search acyclic graph is O(n). If this is treated as
a randomized incremental algorithm, the expected values is
better, but not optimal either. Below is a useful modiﬁcation
via triangulations.
13.3.8. Delaunay triangulation. One remarkable feature
of the Voronoi diagrams should not remain
unnoticed. Right after the deﬁnition of the
Voronoi diagram, an important fact was
mentioned. The vertices of the planar graph
V D(X) are centers of circles containing at least three points
1198
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
of X, and no other points of X are inside of the circle. The
dual graph to V D(X) (see 13.1.22 for the deﬁnition) can be
realized by taking the vertices V = X and the edges indicate
the neighboring faces. Then this is again a tessellation of
the plane. Actually, the convex hull of X is tesselated into
convex regions, and one unbouded region remains. It is
called the Delaunay tessellation DT(X). In the generic
case, the degrees of all vertices in V D(X) are 3 (i.e. no
four points of X lie on one circle). This is the Delaunay
triangulation of the convex hull of X.13
Notice that it easy to
turn any Delaunay tesselation into a triangulation by adding
the necessary edges to triangulate the convex regions with
more edges. Any of these reﬁned tesselations is called the
Dalaunay triangulation associated to the V D(X).
In general, a planar graph T is called a triangulation of
its vertices X ⊂ R2
, |X| = n, if all its bounded faces have
just 3 vertices. It is easy to see that each triangulation T has
τ = 2n − 2 − k triangles and ν = 3n − 3 − k edges, where
k is the number of vertices in the boundary of the unbounded
face. Indeed, by the Euler formula (13.1.20) n−ν+τ +1 = 2
(there is an unbounded face on top of all the triangles). Now,
every triangle has 3 edges, while there are k edges around the
unbounded face. It follows that 3τ + k = 2ν. It remains to
solve the two linear equations for τ and ν.
The triangulations are extremely useful in numerical
mathematics and in computer graphics as the typical background
mesh for processing of approximate values of functions.
Of course, there are many triangulations on a given set
and one of the qualitative requests is to aim at triangles as
close to the equilateral triangles as possible. This could be
phrased as the goal to maximize the minimal angles inside
the triangles.
A practical way to do this is to write the angle vector of
the triangulation
A(T) = (γ1, γ2, . . . , γ3τ ),
where γj are the angles of all the triangles in T sorted by their
value, γj ≤ γk for all j < k.
A triangulation T on X is said to be angle optimal, if
A(T) ≥ A(T′
) for all triangulations T′
on the same set of
vertices X, in the lexicographic ordering. In particular, an
angle optimal triangulation achieves the maximum over the
minimal angles of the triangles.
Surprisingly, there is a very simple (though not very effective)
procedure to produce (one of) the angle optimal triangulations.
Consider any two adjacent triangles and check the
six angle sequences of their interior angles. If the current position
of the diagonal edge provides the worse sequence, ﬂip
it. See the diagram.
13Although the name sounds French, Boris Nikolaevich Delone (1890
- 1990) was a Russian mathematician using the French transcription of his
name in his publications. His name is associated with the triangulation because
of his important work on this from 1934.
1199
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
The ﬂip is necessary if and only if one of the vertices
outside the diagonal is inside the circle drawn through the
remaining three vertices.
Since each such ﬂipping of an edge inside of a tringulation
T deﬁnitively increases the angle vector, the following
algorithm must stop and achieve an angle optimal triangula-
tion:
Edge Flipping Delaunay
Input: Any triangulation ˜T of points X in the plane.
Output: An angle minimal triangulation T of the same set
X.
(1) Main cycle. Repeat until there are no edges to ﬂip:
• Find an edge which should be ﬂipped and ﬂip it.
Theorem. A triangulation T on a set of points X in the plane
R2
is angle optimal if and only if it is a Delaunay triangulation
associated to the Voronoi diagram V D(X).
Proof. Consider any Delaunay triangulation T associated
to V D(X) and one of the vertices p of V D(X).
If the four vertices of two neighboring triangles are
not all on the same circle, then by the very deﬁnition
of V D(X), the two circles in question do not contain
the remaining points and thus there is no need for any ﬂips.
Further, let v1, . . . , vk be all the points of X lying on the circle
determining p. Fix an edge with two neighbouring endpoints
on a circle. All triangles with the third vertex on the circle
above the edge share the same angle. A simple check now
veriﬁes that diﬀerent ways of triangulating the same region
of V D(X) with more than 3 boundary edges lead always to
the same angle vector.
In particular, there are no ﬂips at all necessary in the
above algorithm if one starts with the Delaunay trinagulation
T. Hence the angle optimal triangulation is arrived at.
In order to prove the other implication, recall the comments
on the diagram above. All triangles in the angle optimal
triangulations T have the following two properties: (1)
the circle drawn through their three vertices do not include
any other point in its interior: (2) the circle having any of
their edges as diameter does not have any other point in its
interior.
Consider the dual graph G of T and consider its realization
in the plane by drawing the vertices as the centers of
the circles drawn through the vertices of individual triangles,
1200
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
while the edges are the segments joining them. If there are not
more than 3 points on any of those circles, then G = V D(X)
is obtained. In the degenerate situations, all the triangles sharing
the circle produce the same vertex in the plane and some
of the relevant edges degenerate completely. Identify those
collapsing elements in G, to get the right V D(X). □
13.3.9. Incremental Delaunay. We return to the general
idea for the Voronoi diagram, namely to design
an algorithm which constructs both V D(X)
and DT(X) and which behaves very well in its
randomized implementation.
The idea is straightforward. Use the incremental approach
as with the Voronoi diagrams for reﬁning the consecutive
Delaunay triangulations, employing the ﬂipping of edges
method.
By looking at the diagram, the Voronoi algorithm is easily
modiﬁed. Care must be taken of three diﬀerent cases for
the new points – hitting the unbounded face, hitting one of
the internal triangles, hitting one of the edges.
1201
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Incremental Delaunay Triangulation
Input: The set of points X = {v1, . . . , vn} in the plane, not
all collinear.
Output: The DCEL of both DT(X) and the search graph G
for this triangulation.
(1) Initialization. Consider the ﬁrst three points X0 =
{v1, v2, v3}. Create the DCEL for DT(X0) with two
regions, and create the CH(X0) (the connected edge
list). Create the acyclic directed graph G (just root and
two leaves).
(2) Main cycle. Repeat until there are no new points z ∈ X:
• Localize the face ∆ inDT(Xk) hit by z (by the
search in G)
• if z is in the unbounded face, then
– add the new triangles ∆1, . . . , ∆ℓ to DT(Xk)
by joining z to visible edges in CH(Xk)
– update the CH(X).
• if z hits a (bounded) triangle ∆, then split it into
the three new triangles ∆1, ∆2, ∆3.
• if z hits an edge e, then split the adjacent bounded
triangles into ∆1, . . . , ∆4 (only two, if an edge in
CH(Xk)) is hit).
• Create a queue Q of not yet checked modiﬁed triangles
and repeat as long Q is not empty:
– take the ﬁrst ∆ from the queue Q, look for its
neighbour not in Q and not yet checked, and
ﬂip the edge if necessary;
– if an edge is ﬂipped, put the newly modiﬁed
triangles into Q
Detailed analysis of the algorithm is omitted. It is almost
obvious that the algorithm is correct. It is only necessary to
prove that the proposed version of the ﬂip edge algorithm update
ensures that after each addition of the new point z in kth
step, the correct Delaunay triangulation of Xk arises. Once
an edge is ﬂipped, then it is not necessary to consider it any
time later.
Finally, if the Voronoi diagram is needed instead, it can
be obtained from the DT(X) in linear time. Obviously the
search structures can be used directly.
Surprisingly enough, it turns out that the expected
number of total ﬂips necessary over all the run is of size
O(n log n). Hence the algorithm achieves the perfect
expected time O(n log n). Detailed analysis of this beautiful
example of results in computational geometry can be found
in the section 9.4. of the book by Berg at al., cf. the link on
page 1188.
13.3.10. The beach line Voronoi. The Voronoi algorithm
provides a perfect example for the sweep line paradigm,
where the extra structure to keep track of
the events has to be quite smart.14
14This is mostly called the Fortune Voronoi algorithm. Not because
this is such a lucky construction, but rather because the algorithm was published
by Steven Fortune of the Bell Laboratories in 1986
1202
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Imagine the horizontal line L (parallel to the
x axis) ﬂows in the top-down direction and meets the points
in X = {v1, . . . , vn}. Of course V D(X) including all points
above the current position of L cannot be drawn, since it depends
also on the points below the line. It is better to look at
the part RL in the plane,
RL = {p ∈ R2
; dist(p, L) ≥ dist(p, vi), vi ∈ X above L}.
This is exactly the part of R2
which can be tesselated into the
Voronoi diagram with the information collected at the current
position of L. Obviously, RL is bounded by a continuous
curve consisting of parts of parabolas, since for one point vi
this is the case, and the intersection of the conditions is rele-
vant.
Call the boundary of RL the beach line BL. The vertices
on BL draw the V D(X) when L is moving. Since the
Voronoi diagram consists of lines, we do not even compute
the parabolas and take care of the arrangements of the still
active parts of parabolas in the beachline, as determined by
the individual points. New parts of the beachline arise when
the line L meets one of the points. Add all the points to an
ordered list in the obvious lexicographic order. Call them the
point events. The active arc in the beachline disappears if
the line L meets the bottom of the circle drawn through three
points determining a vertex in the Voronoi diagram. Such an
event is called a circle event.
Both types of the events are illustrated in the diagram
above. There is a striking diﬀerence between them.
The point events always initiate a new arc and start “drawing”
two edges of the Voronoi diagram. They initiate previously
unknown circle events.
The circle events might disappear without creating a genuine
vertex in V D(X). Look at the diagram at the s point
event. The new s, r, q circle event is encountered there. But
this would not create a vertex in the diagram if there was the
next point u somewhere close enough to the indicated vertex.
One could ﬁnd it out as soon as such a point event u
is met. Such “ineﬀectively disappearing” circle events are
called false alarms. On the contrary the p, q, r circle event
shown in the diagram gives rise to the indicated vertex.
Summarizing, the emerged circle events must be inserted
properly into the ordered queue of events and handled properly
at each of the point events.
1203
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Further details are not considered here. When implemented
properly, this algorithm runs in the optimal
O(n log n) time and O(n) storage. See again the above
mentioned book by Berg et al. (section 7) for details.
13.3.11. Geometric transformations. Various geometric
transformations of the basic ambient space can often
help to transform one problem into another one. This
is illustrated in a beautiful construction relating the
convex hulls and the Voronoi diagrams.
Of course transformations which behave well on lines
and planes and preserve incidences should be well thought
of. The aﬃne and projective transformations behave well in
this respect as in the fourth chapter. Introduce a more interesting
one – the spherical inversion. In the plane R2
, consider
the unit circle x2
+ y2
= 1. For arbitrary v = (x, y) ̸= (0, 0)
deﬁne
φ(v) =
1
∥v∥2
v =
1
x2 + y2
(x, y)
Clearly φ is a bijection on R2
\ {(0, 0)}. The geometric
meaning of such a transform is clear from the formula, see
the diagram. "The general" point v is sent to a point on the
same line through the origin, but with reciprocal size. The
unit sphere is the set of all ﬁxed points.
The same principle works for all dimensions, so we may
equally well (and more interestingly) consider v ∈ R3
in the
sequel.
Next follows the crucial property of φ.
Lemma. The mapping φ maps the spheres and planes in R3
onto spheres and planes. The image of a sphere is a plane if
and only if the sphere contains the origin.
Proof. Consider a sphere C with the center c and radius
r. The equation for its general points p reads
∥p − c∥2
= r2
.
By drawing a few images as in the diagram above, it is easily
guessed that the images will be a circle with the center
s = 1
∥c∥2−r2 c (i.e. again on the same line through the
origin). Now consider q = φ(p) and compute (using just
1204
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
2p · c = ∥p∥2
+ ∥c∥2
− r2
from the latter equation)
∥q − s∥2
= ∥
p
∥p∥2
−
c
∥c∥2 − r2
∥2
=
1
∥p∥2
+
1
(∥c∥2 − r2)2
∥c∥2
− 2
p · c
(∥c∥2 − r2)∥p∥2
=
1
(∥c∥2 − r2)2
∥c∥2
−
1
∥c∥2 − r2
=
(
r
∥c∥2 − r2
)2
.
The latter computation assumes ∥c∥ ̸= r. Fix the center
c ̸= 0 and consider diameters r approaching ∥c∥ from below
or above. Then the images are circles with the centers s approaching
one or the other inﬁnite points of the line span{c}
and fast growing radii. In the limit position, the plane is obtained
as requested. (Check this asymptotic computation directly
yourself, if any doubts.) □
The continuity of φ has got important consequences.
Consider a general plane µ (not containing the origin). The
inversion φ maps one of the half-spaces determined by µ to
the interior of the image sphere. The other half-space maps
to the unbounded complement of the sphere. The latter is of
course the half-space containing the origin.
The eﬃcient link between the Voronoi diagrams and convex
hulls can now be explained. Assume a set of points
X = {v1, . . . , vn} in the plane is given. View them as the
points in the plane z = 1 in R3
, i.e. add the same coordinates
z = 1 to all the points (x, y) in X. For simplicity, assume
that no three of them are collinear and no four them lie on the
same circle.
The spherical inversion φ maps the entire plane z = 1
to the sphere S with center c = (0, 0, 1/2) and radius 1/2.
Write w1, . . . , wn for the images wi = φ(vi).
Now, consider CH(Y ) for the set of the images Y =
{w1, . . . , wn}. This is a convex polytope with all vertices on
the sphere S. All its faces represent planes not containing the
origin (this is due to the assumption that no three points of X
are collinear).
Split the faces of CH(Y ) into those “visible” from the
origin and the “invisible” ones. In the latter case, all points of
Y are on the same side of plane µ generated by the face as the
origin. This implies that all the other points are outside of the
image sphere Sµ = φ(µ). In particular, there are no points of
X inside of the intersection of Sµ with the plane z = 1. This
is the deﬁning condition for obtaining one of the vertices of
the Voronoi diagram. Since the map φ preserves incidencies,
the entire DCEL for V D(X) is easily reconstructed from the
DCEL of CH(Y ) and vice versa.
This resembles the construction of the dual graph, i.e. the
Delaunay triangulation DT(X) from the Voronoi diagram,
with further geometric transformation in the back.
Last, but not least, the faces of CH(Y ) visible from
the origin are worth mentioning too. For the same reason
as above, all the points of Y appear on the other side from
1205
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
the origin and so, all the points in X are inside the image
sphere. This means the diagram of furthest points instead of
the Voronoi diagram of the closest ones is obtained.
This is a very useful tool in several areas of mathematics,
see some of the exercises (??) for further illustration.
4. Remarks on more advanced combinatorial
calculations
13.4.1. Generating functions. The worlds of discrete and
continuous mathematics meet all the time.
There are already many instances of useful
interactions. With some slight exaggeration,
we can claim that all results in
analysis were achieved by an appropriate reduction of the continuous
tasks to some combinatorial problem (for instance,
integration of rational functions is reduced to partial fraction
decomposition, solution of analytic diﬀerential equations possibly
boils down to some recurrences, etc.). In the opposite direction,
we demonstrate how handy continuous methods can
be in purely combinatorial problems.
We begin with a simple combinatorial question: There
are four 1-crown coins, ﬁve 2-crown coins, and three 5-crown
coins at our disposal. Suppose we want to buy a bottle of coke
which costs 22 crowns. In how many ways can we pay the
exact amount of money with the given coins?
We are looking for integers i, j, k such that i+j+k = 22
and
i ∈ {0, 1, 2, 3, 4}, j ∈ {0, 2, 4, 6, 8, 10}, k ∈ {0, 5, 10, 15}.
Consider the product of polynomials (over the real numbers,
for instance)
(x0
+ x1
+ x2
+ x3
+ x4
)(x0
+ x2
+ x4
+ x6
+
+ x8
+ x10
)(x0
+ x5
+ x10
+ x15
).
It should be clear that the number of solutions equals the
coeﬃcient at x22
in the resulting polynomial. This corresponds
to the four possibilities of choosing the values i, j,
k: 3 · 5 + 3 · 2 + 1 · 1, 3 · 5 + 2 · 2 + 3 · 1, 2 · 5 + 5 · 2 + 2 · 1,
and 2 · 5 + 4 · 2 + 4 · 1.
This simple example deserves more attention.
The coeﬃcients of the particular polynomials represent
sequences of numbers, referring to how many times we can
achieve the given value with one type of coins only. Work
with an inﬁnite sequence to avoid a prior bound on how many
available values there can be. Encode the possibilities in inﬁnite
sequences
(1, 1, 1, 1, 1, 0, 0, . . . ) 1-crowns
(1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, . . . ) 2-crowns
(1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, . . . ) 5-crowns.
Each such sequence with only ﬁnitely many non-zero terms
can be assigned a polynomial. The solution of the problem is
given by the product of these polynomials, as noted before.
This is an instance of a general procedure for handling
sequences eﬀectively.
1206
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Generating function of a sequence
Deﬁnition. An (ordinary) generating function for an inﬁnite
sequence a = (a0, a1, a2, . . . ) is a (formal) power series
a(x) = a0 + a1x + a2x2
+ · · · =
∞∑
i=0
aixi
.
The values ai are considered in some ﬁxed ﬁeld K, normally
the rational numbers, real numbers, or complex numbers.
In practice, there are several standard ways for deﬁning
and using generating functions:
- to ﬁnd an explicit formula for the n-th term of a sequence;
- to derive new recurrent relations between values (although
generating functions are often based on recurrent
formulae themselves);
- for calculation of means or other statistical dependencies
(for instance, the average time complexity of an al-
gorithm);
- to prove miscellaneous combinatorial identities;
- to ﬁnd an approximate formula or the asymptotic behaviour
when the exact formula is too hard to get.
We shall see examples of some of these.
13.4.2. Operations with generating functions. Several basic
operations with sequences correspond to simple
operations over power series (which can be
easily proved by performing the relevant operation
with the power series):
• Component wise, the sum (ai +bi) of the sequences corresponds
to the sum a(x) + b(x) of the generating func-
tions.
• Multiplication (α · ai) of all terms by a given scalar α
corresponds to the same multiplication α · a(x) of the
generating function.
• Multiplication of the generating function a(x) by a monomial
xk
corresponds to shifting the sequence k places to
the right and ﬁlling the ﬁrst k places with zeros.
• In order to shift the sequence k places to the left (i.e.
omit the ﬁrst k terms), subtract the polynomial bk(x) corresponding
to the sequence (a0, . . . , ak−1, 0, . . . ) from
a(x), and then divide the generating function by the expression
xk
.
• Substitution of a polynomial f(x) for x leads to a speciﬁc
combination of the terms of the original sequence. They
can be expressed easily for f(x) = αx, which corresponds
to multiplication of the k–th term of the sequence
by the scalar value αk
. The substitution f(x) = xn
inserts
n − 1 zeros between each pair of adjacent terms.
The ﬁrst and second rules express the fact that the assignment
of the generating function to a sequence is an isomorphism
of the two vector spaces (over the ﬁeld in question).
There are other important operations which often appear
when working with generating functions:
1207
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
• Diﬀerentiation with respect to x: the function a′
(x) generates
the sequence (a1, 2a2, 3a3, . . . ); the term at index
k is (k + 1)ak+1 (i.e. the power series is diﬀerentiated
term by term).
• Integration: the function
∫ x
0
a(t) dt generates the sequence
(0, a0, 1
2 a1, 1
3 a2, 1
4 a3, . . . ); for k ≥ 1, the term
at index k is equal to 1
k ak−1 (clearly, diﬀerentiation of
the corresponding power series term by term leads to the
original function a(x)).
• Product of power series: the product a(x)b(x) is the generating
function for the sequence (c0, c1, c2, . . . ), where
ck =
∑
i+j=k
aibj,
i.e. the terms of the product are up to ck the same as in
the product (a0 + a1x + a2x2
+ · · · + akxk
)(b0 + b1x +
b2x2
+ · · · bkxk
). The sequence (cn) is also called the
convolution of the sequences (an), (bn).
13.4.3. More links to continuous analysis. There are useful
examples of generating functions. Most of them are seen
when working with power series in chapters ﬁve and six.
Perhaps the reader recognizes the generating function
given by the geometric series:
a(x) =
1
1 − x
= 1 + x + x2
+ . . . ,
which corresponds to the constant sequence (1, 1, 1, . . . ). As
we know, this power series converges for x ∈ (−1, 1) and
equals 1/(1 − x).
It works the other way round as well: Expand this function
into its Taylor series at the point 0. The original series is
obtained. This “encoding” of a sequence into a function and
then decoding it back is the key idea in both the theory and
practice of generating function.
Generally, consider any sequence ai with n
√
an bounded.
Then there is a neighbourhood on which its generating function
converges (see 5.4.10 on page 428). For example, an easy
check shows that this happens whenever |an| = O(nk
) with a
constant exponent k ≥ 0. On this neighbourhood, the generating
functions can be worked with as with ordinary functions.
In particular, one can add, multiply, compose, diﬀerentiate,
and integrate them. All the equalities obtained carry over to
the relevant sequences.
Recall several very useful basic power series and their
sums:
1
1 − x
=
∑
n≥0
xn
,
ln(1 + x) =
∑
n≥1
(−1)n+1 xn
n
,
ln
1
1 − x
=
∑
n≥1
xn
n
,
ex
=
∑
n≥0
xn
n!
,
1208
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
sin x =
∑
n≥0
(−1)n x2n+1
(2n + 1)!
,
cos x =
∑
n≥0
(−1)n x2n
(2n)!
.
13.4.4. Binomial theorem. Recall the standard ﬁnite binomial
formula (a + b)r
= ar
(1 + c)r
=
ar
∑r
n=0
(r
n
)
cn
, where r ∈ N, 0 ̸= a, b ∈ C,
c = b/a. Even if the power r is not a natural
number, the Taylor series of (1 + x)r
can still
be computed. This yields the following generalization:
Generalized binomial theorem
Theorem. For any r ∈ R, k ∈ N, write
(
r
k
)
=
r(r − 1)(r − 2) · · · (r − k + 1)
k!
(in particular
(r
0
)
= 1, having empty product divided by 1
in the latter formula). The power series expansion
(1 + x)r
=
∑
k≥0
(
r
k
)
xk
converges on a neighbourhood of zero, for each r ∈ R. The
latter formula is called the generalized binomial theorem.
In particular, the function 1
(1−x)n , n ∈ N can be expanded
into the series
1 +
(
1 + n − 1
n − 1
)
x + · · · +
(
k + n − 1
n − 1
)
xk
+ · · · .
Proof. The theorem is obvious if r ∈ N since it is then
the ﬁnite binomial formula. So assume r is not a natural number
and thus zero is never obtained when evaluating
(r
k
)
.
First, diﬀerentiate the function a(x) = (1+x)r
and evaluate
all the derivatives in x = 0. Obviously
a(k)
(0) = r(r − 1) · · · (r − k + 1)(1 + x)r−k
|x=0
= r(r − 1) · · · (r − k + 1)
which provides the coeﬃcients ak =
(r
k
)
of the series. In
5.4.5, there are several simple tests to decide about convergence
of a number series. The ratio test helps here:
ak+1 xk+1
ak xk
=
r(r−1)...(r−k)
(k+1)!
r(r−1)...(r−k+1)
k!
x =
r − k
k + 1
x.
By the ratio test, the radius of convergence is 1 for all r /∈ N.
The generalized binomial formula for negative integers
is a straightforward consequence. If r = −n, we may write(−n
k
)
= (−1)k 1
k! (n + k − 1)(n + k − 2) · · · n =
(n+k−1
n−1
)
and substituting −x for the argument just kills the signs as
requested. □
1209
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.4.5. Examples. The formulae with r as a negative integer
are very useful in practice. The simplest
one is the geometric series with r = −1. Write
down two more of them.
1
(1 − x)2
=
∑
n≥0
(n + 1)xn
1
(1 − x)3
=
∑
n≥0
(
n + 2
2
)
xn
.
The same results can be obtained by consecutive convolutions.
Indeed, for the generating function a(x) of a sequence
(a0, a1, a2, . . . ), 1
1−x a(x) is the generating function for the
sequence of all the partial sums (a0, a0+a1, a0+a1+a2, . . .).
For instance,
1
1 − x
ln
1
1 − x
is the generating function of the harmonic numbers
Hn = 1 +
1
2
+ · · · +
1
n
.
13.4.6. Diﬀerence equations. Typically, the generating
functions can be very useful, if the sequences
are deﬁned by relations between their terms.
An instructive example of such an application
is the complete discussion of the solutions of linear
diﬀerence equations with constant coeﬃcients. This is examined
in the second part of chapter one, see 1.2.4. Back there,
a formula is derived for ﬁrst-order equations, the uniqueness
and existence of the solution is justiﬁed, after only “guessing”
the solution. Now, it can be truly derived in another straight
forward way, working also in more complex and non-linear
problems.
First, sort out the well-known example of the Fibonacci
sequence, given by the recurrence
Fn+2 = Fn + Fn+1, F0 = 0, F1 = 1,
and write F(x) for the (yet unknown) generating function of
this sequence. We want to compute F(x) and so obtain an
explicit expression for the nth Fibonacci number.
The deﬁning equality can be expressed in terms of
F(x) if we use our operations for shifting the terms of
the sequence. Indeed, xF(x) corresponds to the sequence
(0, F0, F1, F2, . . . ), and x2
F(x) does to (0, 0, F0, F1, . . . ).
Therefore, the generating function
G(x) = F(x) − x F(x) − x2
F(x)
represents the sequence
(F0, F1 − F0, 0, 0, . . . , 0, . . . ).
Substitute in the values F0 = 0, F1 = 1 (the initial condition).
Obviously G(x) = x and hence
(1 − x − x2
)F(x) = x.
F(x) is a rational function. It can be rewritten as a linear combination
of simple rational functions. This is helpful, since a
1210
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
linear combination of generating functions corresponds to the
same combination of the sequences.
Rational functions can be decomposed into partial fractions,
see 6.2.7. Using this procedure, we ﬁnd a generating
function for 1/(1 − x − x2
). Namely, write r ∈ N
F(x) =
1
1 − x − x2
=
A
x − x1
+
B
x − x2
=
a
1 − λ1x
+
b
1 − λ2x
,
where A, B are suitable (generally) complex constants, and
x1, x2 are the roots of the polynomial in the denominator.
The ultimate constants a, b, λ1, and λ2 can be obtained by
a simple rearrangement of the particular fractions. This leads
to the general solution for the generating function
F(x) =
∞∑
n=0
(aλn
1 + bλn
2 )xn
,
and so the general solution of the recurrence is known as well.
In the present case, the roots of the quadratic polynomial
are 1±
√
5
2 . Hence the reciprocal values of the roots are λ1,2 =
2
1±
√
5
. The partial fraction decomposition equality gives
x = a
(
1 − 2
1−
√
5
x
)
+ b
(
1 − 2
1+
√
5
x
)
and so a = −b = 1√
5
. Finally the requested solution
Fn =
1
√
5
((
1 +
√
5
2
)n
−
(
1 −
√
5
2
)n)
is obtained. Compare this procedure to the approach in 3.2.2
and 3.B.1. This expression, full of irrational numbers, is an
integer. The second summand is approximately (1−
√
5)/2 ≃
−0.618. Its value is negligible for large n. Hence Fn can be
computed by evaluating just the ﬁrst summand and approximating
to the nearest integer.
Of course, the same procedure can be applied for general
k-th order homogeneous linear diﬀerence equations. Consider
the recurrence
Fn+k = α0Fn + · · · + αk−1Fn+k−1.
The generating function for the resulting sequence is
F(x) =
g(x)
1 − α0xk − · · · − αk−1x
,
where the polynomial g(x) of order at most k − 1 is determined
by the chosen initial conditions.
Using partial fraction decomposition, the general result
follows as in subsection 3.2.4.
13.4.7. The general method. Power series are a much
stronger tool for solving recurrences. The point is that
one is not restricted to linearity and homogeneity. Using
the following general approach, recurrences that
seem intractable at ﬁrst sight can quite often be managed.
The ﬁrst steps are just algorithmic, while the ﬁnal solution
of the equation on the generating function may need very
diverse approaches.
1211
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
In order to be able to write down the necessary equations
eﬃciently, adopt the convention of the logical predicate
[δ(n)] which is attached before the expression it should govern.
Simply multiply by the coeﬃcient 1 if δ(n) is true, and
by zero otherwise. For instance, the equation
Fn = Fn−2 + Fn−1 + [n = 1]1 + [n = 0]1
deﬁnes the above Fibonacci recurrence with initial conditions
F0 = 1 and F1 = 2.
Method to resolve recurrences
Recurrent deﬁnitions of sequences (a0, a1, . . . ) may be
solved in the following 4 steps:
(1) Write the complete dependence between the terms in
the sequence as a single equation expressing an in terms
of terms with smaller indices. This universal formula
must hold for all n ∈ N (supposing a−1 = a−2 =
· · · = 0).
(2) Both sides of the equation are multiplied by xn
. Then
sum the resulting expressions over all n ∈ N. One of the
summands is
∑
n≥0 anxn
, which is the generating function
A(x) for the sequence. Rearrange other summands
so that they contain only the terms A(x) and some other
polynomial expressions.
(3) Solve the resulting equation with respect to A(x) explic-
itly.
(4) The function A(x) is expanded into the power series.
Its coeﬃcients at xn
are the requested values of an.
As an example, consider a second order linear diﬀerence
equation with constant coeﬃcients, but a non-linear right
hand side.
The recurrence is an = 5an−1 − 6an−2 − n with initial
conditions a0 = 0, a1 = 1. The individual steps in the latter
procedure are as follows:
Step 1. The universal equation is clear, up to the initial conditions.
First check n = 0, which yields no extra term, but then
n = 1 enforces the extra value 2 to be added. Hence,
an = 5an−1 − 6an−2 − n + [n = 1]2.
Step 2.
∑
n≥0
anxn
= 5x
∑
n≥0
an−1xn−1
− 6x2
∑
n≥0
an−2xn−2
−
∑
n≥0
nxn
+ 2x
Next, one of the terms is nearly the power series for (1−x)−2
.
Thus remove one x there in order to get the equality on A(x)
in the form as required (ignore the negative values of indices
since all a−1, a−2. . . . vanish by assumption).
A(x) = 5xA(x) − 6x2
A(x) − x 1
(1−x)2 + 2x.
Step 3. Find the reciprocal values to the roots, 2 and 3, of the
polynomial 1−5x+6x2
= (1−2x)(1−3x). An elementary
1212
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
calculation yields
A(x) =
2x3
− 4x2
+ x
(1 − 2x)(1 − 3x)(1 − x)2
.
Step 4. Partial fraction decomposition directly leads to the
result
A(x) = −
1
4
1
1 − 3x
+ 2
1
1 − 2x
−
1
2
1
(1 − x)2
−
5
4
1
1 − x
.
This corresponds to the solution (notice the last but one term
yields
∑
n(n+1)xn
and thus the constant term in our formula
is the sum −1/2 − 5/4)
an = −
1
4
3n
+ 2n+1
−
1
2
n −
7
4
.
The ﬁrst eight terms in the sequence are 0, 1, 3, 6, 8, −1, −59,
−296, all integers of course.
13.4.8. Plane binary trees and Catalan numbers. The
next application of the generating functions answers
the question about the number bn of nonisomorphic
plane binary trees on n vertices (cf.
13.1.18 for plane trees). Treat these trees in the
form of the root (of a subtree) with a pair [the left binary subtree,
the right binary subtree].
Examine the initial values of n, namely
b0 = 1, b1 = 1, b2 = 2, b3 = 5.
It is more or less obvious that for n ≥ 1, the sequence bn
satisﬁes the recurrent formula
bn = b0bn−1 + b1bn−2 + · · · + bn−1b0,
and this is actually close to a convolution of two equal sequences.
Rearrange the expression so that it holds for all
n ∈ N0:
bn =
∑
0≤k<n
bkbn−k−1 + [n = 0]1.
This ﬁnishes step 1 of the procedure.
In step 2, multiply both sides by xn
and add it all together.
Write B(x) for the generating function of the sequence bn.
B(x) =
∑
n,k
bkbn−k−1xn
+
∑
n,k
[n = 0]xn
=
∑
k
bkxk
(
∑
n
bn−k−1xn−k
)
+ 1
=
∑
k
bkxk
(xB(x)) + 1 = B(x) · xB(x) + 1.
Notice that the convolution bn = b0bn−1 + b1bn−2 + · · · +
bn−1b0 is replaced by
bn = b0bn−1 + · · · + bn−1b0 + bnb−1 + bn+1b−2 + · · · .
This is no problem by the convention. It helps with processing
the sums (it is much easier to work with inﬁnite sums here
than to keep an eye on the bounds all the time).
1213
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
In step 3, the quadratic equation B(x) = xB(x)2
+ 1
must be solved for B(x). So
B(x) =
1 ±
√
1 − 4x
2x
.
Although there are two solutions, not necessarily both
must produce a valid solution to our problem. The sign + in
the formula is impossible since then, the limit of B(x) for
x → 0+ is ∞, but the generating function for our sequence
must have the value b0 = 1 at 0.
In the last step, expand B(x) into a power series. The expansion
can be obtained using the generalized binomial theo-
rem
(1 − 4x)1/2
=
∑
k≥0
(
1/2
k
)
(−4x)k
= 1 +
∑
k≥1
1
2k
(
−1/2
k − 1
)
(−4x)k
.
Dividing 1 −
√
1 − 4x by the expression 2x leads to
B(x) =
∑
k≥1
1
k
(
−1/2
k − 1
)
(−4x)k−1
=
∑
n≥0
(
−1/2
n
)
(−4x)n
n + 1
Substitute the (−4)n
multiple of
(−1/2
n
)
into the deﬁnition
of the generalized binomial numbers. A straightforward
check shows (−4)n
(−1/2
n
)
=
(2n
n
)
, which yields a ﬁnal, much
neater, formula for the coeﬃcients. We conclude that the
number of plane binary trees on n vertices equals
bn =
1
n + 1
(
2n
n
)
.
These are known as the Catalan numbers. They occur surprisingly
often:
• the number of well-parenthesized words of length 2n, i.e.
words consisting of n opening and n closing parentheses
so that no preﬁx of the word contains more closing parentheses
than closing ones;
• this also corresponds to the number of ways an unsupplied
vending machine can accept n 5-crown coins and
n 10-crown coins for 5-crown orders so that it can always
give the change (hence the probability that a random ordering
is satisfactory can also be found)
• the number of monotonic paths from [0, 0] to [n, n] along
the sides of the unit squares of the grid such that the path
does not cross the diagonal
• the number of triangulations of a convex (n + 2)-gon.
The intuitive reasoning for this is that they come from the expansion
of the square root within B(x) and quadratic equalities
appear often in real world.
1214
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.4.9. Quicksort analysis. The next task is to determine
the expected number of comparisons made by the Quicksort,
a well-known algorithm for sorting a (ﬁnite) sequence of elements.
This is the following divide and conquer type of algo-
rithm:
Procedure Qsort
Input: A (non-sorted) list of elements L = (L[0], . . . , L[n])
Output: The sorted list L with the same elements.
(1) if L is empty, then return the empty list ().
(2) Divide phase. Create a sublist L1 by going through L
and leaving only the elements x with L[0] > x, while
putting the other elements into the list L2.
(3) Conquer phase. Combine the lists
L = Qsort(L1) + (L[0]) + Qsort(L2)
and return the list L.
We analyze how many comparisons are needed. Assume
that all possible orderings of the list L to be sorted are distributed
uniformly. The following parameters are crucial:
• The number of comparisons in the divide phase is n − 1.
• The assumption of uniformity ensures that the probability
of L[0] being the k-th greatest element of the sequence
is 1
n .
• The sizes of the sublists to be sorted in the conquer phase
are k − 1 and n − k.
There is the following recurrent formula for the expected
number of comparisons Cn:
(1) Cn = n − 1 +
n∑
k=1
1
n
(Ck−1 + Cn−k) .
One could work the steps of the general method directly, but
the symmetry of the two summands allows a rewrite (1), multiplying
by n at the same time
(2) nCn = n(n − 1) + 2
n∑
k=1
Ck−1.
In the ﬁrst step, care is needed concerning about n = 0.
In the deﬁning recurrence (1), n = 0 is not treated at all (since
the equation does not make sense). So the convention must
be extended to include C0 = 0 in the computation. Then the
equation (2) deﬁnes the C1 = 0 properly. It is not necessary
to add any terms in view of the initial conditions.
Next, multiply both sides by xn
and add
∑
n≥0
nCnxn
=
∑
n≥0
n(n − 1)xn
+ 2
∑
n≥0
n∑
k=1
Ck−1xn
.
All the terms look familiar. The left hand side shows the derivative
of the generating function C(x) =
∑
n≥0 Cnxn
if
one x is removed. The ﬁrst term on the right is the series for
(1 − x)3
, up to a constant and shift of powers by 2. Finally
the last term is the convolution with (1 − x)−1
, up to one x
1215
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
and the coeﬃcient 2. Hence the equation
xC′
(x) =
2x2
(1 − x)3
+ 2
xC(x)
1 − x
.
The third step is straightforward see 8.3.3. Divide by x to
obtain
C′
(x) =
2
1 − x
C(x) +
2x
(1 − x)3
.
The corresponding integrating factor is e
−
∫ 2
1−x dx
= (1 −
x)2
. Hence
(
(1 − x)2
C(x)
)′
=
2x
1 − x
,
and ﬁnally
C(x) = 2
(
1
(1 − x)2
ln
1
1 − x
−
x
(1 − x)2
)
.
The ﬁrst terms in the bracket corresponds to the convolution
of two known sequences, so it contributes to Cn by
n∑
k=1
1
k
(n − k + 1) = (n + 1)
n∑
k=1
1
k
− n
= (n + 1)Hn − n = (n + 1)(Hn+1 − 1),
where Hn are the harmonic numbers. The result is
Cn = 2(n + 1)(Hn+1 − 1) − 2n.
Notice in 13.I.10, the very same recurrence is solved by
diﬀerent (more direct and simpler) tricks, without any diﬀerential
equations involved.
Since the harmonic numbers Hn are easily approximated
by ln n =
∫ n
1
1
x dx, the analysis shows that the estimated time
cost of quicksort is O(n log n). But it is easy to see that the
worst time case is O(n2
) (in this version it happens if the list
was already ordered properly – then L1 is always empty and
the depth of the recursion is linear).
13.4.10. Exponential generating functions. Another approach
to generating functions is to take the exponential
ex
=
∑
n≥0
1
n! xn
as the power series corresponding
to the constant sequence (1, 1, . . . ).
In general, this is called the exponential generating
functions
A(x) =
∑
n≥0
an
xn
n!
.
Here are a few elementary examples:
ex e.g.f.
←→ (1, 1, 1, . . .),
1
1 − x
e.g.f.
←→ (1, 1, 2, 6, 24, . . . , n!, . . .)
ln
1
1 − x
e.g.f.
←→ (0, 1, 1, 2, 6, 24, . . .)
The slight modiﬁcation of the deﬁnition (just reconsidering
the 1
n! coeﬃcient) is responsible for a very diﬀerent behaviour,
compared to the ordinary generating functions. Indeed,
now the elementary operations are:
1216
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
• Multiplication of A(x) by x yields the sequence with
terms ˜an = nan−1.
• Diﬀerentiation of A(x) shifts the sequence to the left.
• Integration of A(x) shifts the sequence to the right.
• The product of functions A(x) and B(x) corresponds to
the sequence with terms hn =
∑
k
(n
k
)
akbn−k, the binomial
convolution of an and bn.
As before, the exponential generating functions might become
useful when resolving recurrences. Here is a simple example.
Deﬁne the sequence by the initial conditions g0 = 0,
g1 = 1 and the formula
gn = −2ngn−1 +
∑
k≥0
(
n
k
)
gkgn−k.
At the ﬁrst glance, seeing the binomial convolution suggests
trying the exponential version.
Write G(x) for the corresponding power series and proceed
in the usual four steps again.
Step 1. Complete the formula to accommodate the initial con-
ditions:
gn = −2ngn−1 +
n∑
k=0
(
n
k
)
gkgn−k + [n = 1].
There seems to be a subtle point about g0 here, because the
equation gives g0 = g0
2
, with two solutions 0 and 1. The
proper choice of g0 now yields the correct value for g1, but
the right solution G is chosen later.
Step 2. Multiply by xn
n! and add over all n, to obtain
G(x) = −2xG(x) + G(x)2
+ x.
Step 3. Now, solve the easy quadratic equation, arriving at
G(x) = 1/2(1 + 2x ±
√
1 + 4x2).
The evaluation at zero provides g0. Hence the right choice for
g0 = 0 is the minus sign. Hence,
G(x) =
1 + 2x −
√
1 + 4x2
2
.
Step 4. Apply the generalized binomial theorem, to expand
G(x) into a power series. See13.4.8.
√
1 + 4x2 = 1 +
∑
k≥1
1
k
· (−1)k−1
· 2 ·
(
2k − 2
k − 1
)
· x2k
.
Further, since
G(x) =
∑
n≥0
gn
xn
n!
=
1 + 2x −
√
1 + 4x2
2
,
g2k+1 = 0 and
g2k = (−1)k
·
1
k
(
2k − 2
k − 1
)
· (2k)! = (−1)k
· (2k)! · Ck−1,
where Cn is the n-th Catalan number.
1217
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.4.11. Cayley’s formula. We conclude this chapter by a
more complicated example.
Cayley’s formula computes the number of trees (i.e.
graphs with unique paths between all pairs of vertices) on n
given vertices,
κ(Kn) = nn−2
.
The notation refers to the equivalent formulation to ﬁnd all
spanning trees in the complete graph Kn. Equivalently, in
how many ways can a tree be realized on n vertices with the
vertices labeled. For example, already the path Pn can be
realized in n! ways, so there must be very many of them. This
result is proved with the help of the exponential generating
functions.
Write Tn = κ(Kn) for the unknown values. It is easily
shown that T1 = T2 = 1, T3 = 3, T4 = 16. For instance,
consider trees on 4 vertices. Out of the
(6
3
)
= 20 potential
graphs with exactly three edges, those where the edges form
a triangle must not be counted. There are
(4
3
)
= 4 of them.
In the diagram, there are four diﬀerent possibilities, and each
of them can be rotated into another three, hence the solution
is 16.
The recurrent formula can be obtained by ﬁxing one of
the vertices and add together the possibilities for all available
degrees of this vertex. This suggests looking rather at the
number Rn of the rooted trees. It is clear that Rn = nTn
because there are n possibilities to place the root at each of
the trees. Also, one can work with one ﬁxed ordering of the
vertices in Kn and multiply the result by n! in the end. In
this way, go through the possible degrees m of the ﬁrst vertex
and for each m to ﬁnd the diﬀerent possibilities for the
sizes k1 . . . , km of the corresponding subtrees. Obviously
k1 + · · · + km = n − 1, all ki > 0, and since the labeling
of all vertices is ﬁxed, all the orders of the subtrees must
be considered as equivalent. Multiply the contribution by 1
m!
and similarly for each of the possibilities of the subtrees. The
recurrent formula is
Rn = n!
∑
m>0
1
m!
∑
k1+···+km=n−1
1
k1! . . . km!
Rk1 . . . Rkm .
Of course, R0 = 0, R1 = 1 and, already using the formula,
R2 = 2u1 = 2. Next, R3 = 3u2 + 3u1
2
= 9, R4 = 4R3 +
24R1R2 + 4R1
3
= 64, all as expected. The ﬁrst step of the
standard procedure is accomplished.
Next, write R(x) =
∑
n≥0 Rn
1
n! xn
. The inner sum is
the coeﬃcient at xn−1
in the m-th power of the series R(x).
Therefore,
Rn
1
n!
= [xn−1
]
∑
m≥0
1
m!
R(x)m
,
1218
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
and hence have the required equation on R:
R(x) = x eR(x)
.
There are several ways, of solving such functional equations.
Here is one such tool without proof.
maybe, get it as
application of residue
theorem in chapter 9
Theorem (Lagrange inverse formula). Consider an analytic
function f, f(0) = 0 and f′
(0) ̸= 0. Then there (locally) is
the analytic inverse of f, i.e w = g(z) =
∑
n≥1 gn
zn
n! and
z = f(g(z)). Moreover, for all n > 0,
gn = lim
w→0
(
dn−1
dwn−1
(
w
f(w)
)n)
.
In this case, solve the equation x = R(x)
eR(x)
, so that we
may apply the latter theorem with g = R and f(w) = w
ew . It
follows that
[xn
]R(x) =
1
n
[wn−1
]
(
w
w/ew
)n
=
1
n
[wn−1
]ewn
=
1
n
nn−1
(n − 1)!
=
nn−1
n!
In particular, Rn = nn−1
and so,
Tn =
Rn
n
= nn−2
.
1219
1220
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
J. Additional exercises to the whole chapter
13.J.1. Determine the number of edges that must be added into
i) the cycle graph Cn on n vertices,
ii) the complete bipartite graph Km,n
in order to obtain a complete graph. ⃝
13.J.2. Let the vertices of K6 be labeled 1, 2, . . . , 6 and let every edge {i, j} be assigned the integer [(i+j) mod 3]+1. How
many maximum spanning trees are there in this graph? ⃝
13.J.3. Let the vertices of K7 be labeled 1, 2, . . . , 7 and let every edge {i, j} be assigned the integer [(i+j) mod 3]+1. How
many maximum spanning trees are there in this graph? ⃝
13.J.4. Let the vertices of K5 be labeled 1, 2, . . . , 5 and let every edge {i, j} be assigned the integer: 1 if i + j is odd; 2 if
i + j is even. How many maximum spanning trees are there in this graph? ⃝
13.J.5. Let the vertices of K5 be labeled 1, 2, . . . , 5 and let every edge {i, j} be assigned the integer: 1 if i + j is odd; 2 if
i + j is even. How many minimum spanning trees are there in this graph? ⃝
13.J.6. Let the vertices of K6 be labeled 1, 2, . . . , 6 and let every edge {i, j} be assigned the integer: 1 if i + j leaves
remainder 1 upon division by 3; 2 if i + j leaves remainder 2 upon division by 3; 3 if i + j is divisible by 3; How many
minimum spanning trees are there in this graph? ⃝
13.J.7. Let the vertices of K6 be labeled 1, 2, . . . , 6 and let every edge {i, j} be assigned the integer: 1 if i + j leaves
remainder 1 upon division by 3; 2 if i + j leaves remainder 2 upon division by 3; 3 if i + j is divisible by 3; How many
maximum spanning trees are there in this graph? ⃝
13.J.8. Icosian Game – ﬁnd a Hamiltonian cycle in the graph consisting of the vertices and edges of the regular dodecahe-
dron.
Solution. See Wikipedia4
. □
13.J.9. Does there exist a Hamiltonian cycle in the Petersen graph?
Solution. No (however, when any one of the vertices is removed, the resulting graph is already Hamiltonian). This can be
shown by enumerating all 3-regular Hamiltonian graphs on 10 vertices and ﬁnding a cycle of length less than 5 in each of
them. □
13.J.10. If G = (V, E) is Hamiltonian and ∅ ̸= W ⊊ V , then G \ W has at most |W| connected components. Give an
example of a graph where the converse does not hold.
Solution.
□
13.J.11. Find a maximum ﬂow and the corresponding minimum cut in the following weighted directed graph:
01 01
01
0101
01 01
20
18
3
7
5 10
7
11
9
12
8
20
17
10
11
9
11
Z
S
A B
C D
E F
7
2
2
⃝
13.J.12. Find a maximum ﬂow and the corresponding minimum cut in the following weighted directed graph:
4Wikipedia, Icosian game, http://en.wikipedia.org/wiki/
Icosian_game (as of Aug. 8, 2013, 13:24 GMT).
1221
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
01 01
01
0101
01 01
10
20
30
9
9
19
20
20
5
10
8
9 7
4
8
11
5
9
12
14
Z
S
A B
C D
E F
⃝
13.J.13. Find a maximum ﬂow and the corresponding minimum cut in the following weighted directed graph:
01 01
01
0101
01 01
13
19
23
8
7
11
9 7
10
15
14
7
17
23
11
9
20
15
10
2
Z S
A B
C D
E F
⃝
13.J.14. Find a maximum ﬂow and the corresponding minimum cut in the following weighted directed graph:
01 01
01
0101
01 01
8
7
10
7
12
13
8
18
28
16
5
9
18
17
7 6
20
17
8
Z
S
A B
C D
E F
13
⃝
13.J.15. Find the generating functions of the following sequences:
i) (1; 2; 1; 4; 1; 8; 1; 16; . . . )
ii) (1; 1; 0; 1; 1; 0; 1; 1; . . . )
iii) (1; −1; 2; −2; 3; −3; 4; −4; . . . )
Solution.
i) (1; 2; 1; 4; 1; 8; 1; 16; . . . ) = (1; 0; 1; 0; . . . ) + (0; 2; 0; 4; 0; 16; . . . ). Thus, we ﬁnd the generating functions for each
sequence separately. As for the ﬁrst one, consider the sequence (1, 1, 1, 1, 1, . . . ). It is generated by the function 1
1−x .
The zeros can be inserted by substituting x2
for x. As for the second sequence, we proceed similarly, starting with
(1; 2; 4; 8; 16; . . . ), then multiplying by two, inserting zeros, and ﬁnally shifting to the right by multiplying by x.
ii) (1; 1; 0; 1; 1; 0; 1; 1; . . . ) = (1; 0; 0; 1; 0; 0; 1 . . . ) + (0; 1; 0; 0; 1; 0; 0; 1 . . . ).
i) 1
1−x2 + 2x
1−2x2
ii) 1+x
1−x3
iii) −1
(1−x2)2 + x
(1−x2)2
□
1222
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.J.16. Find the coeﬃcient at x17
in (x3
+ x4
+ x5
+ . . . )3
.
Solution. (x3
+ x4
+ x5
+ ...)3
= x9
(1−x)3 = x9
· 1
(1−x)3 . We are thus looking for the coeﬃcient at x8
in 1
(1−x)3 . This is
equal to
(10
2
)
, i. e. 45. □
13.J.17. There are 30 red, 40 blue, and 50 white balls in a box (balls of the same color are indistinguishable). In how many
ways can we pick up 70 balls from the box? ⃝
Solution. Clearly, the number of possibilities is equal to the coeﬃcient at x70
in the expression
(1 + x + · · · + x30
)(1 + x + · · · + x40
)(1 + x + · · · + x50
).
Mere rearrangements lead to
(1+x+· · ·+x30
)(1+x+· · ·+x40
)(1+x+· · ·+x50
) =
1
(1 − x)3
. . . (1−x31
)(1−x41
)(1−x51
).
Applying the generalized binomial theorem, we obtain the solution
(72
2
)
−
(41
2
)
−
(31
2
)
−
(21
2
)
. □
13.J.18. What is the probability that a roll of 12 dice results in the sum of 30?
Hint: Express the number of possibilities when the sum is 30. Consider (x + x2
+ x3
+ x4
+ x5
+ x6
)12
. ⃝
13.J.19. A fruit grower wants to plant 25 new trees, having four species at his disposal. However, his wife insists that there
be at most 1 walnut, at most 10 apples, at least 6 cherries, and at least 8 plums. In how many ways can he fulﬁll his beloved’s
wishes?
Hint: We are interested in the coeﬃcient at x25
in the expression
(1 + x)(1 + x + · · · + x10
)(x6
+ x7
+ . . . )(x8
+ x9
+ . . . ).
Solution.
(1+x)(1+x+· · ·+x10
)(x6
+x7
+. . . )(x8
+x9
+. . . ) =
x14
(1 − x2
)(1 − x11
)
(1 − x)4
.
Therefore, we are looking for the coeﬃcient at x11
in (1 − x2
− x11
. . . ) · 1
(1−x)4 , which is equal to
(14
3
)
−
(12
3
)
−
(3
3
)
. □
13.J.20. Express the general term of the sequences deﬁned by the following recurrences:
i) a1 = 3, a2 = 5, an+2 = 4an+1 − 3an for n = 1, 2, 3 . . . .
ii) a0 = 0,a1 = 1, an+2 = 2an+1 − 4an for n = 0, 1, 2, 3 . . . .
Solution.
i) an = 2 + 3n−1
.
ii) an = 1
2
√
−3 · ((1 +
√
−3)n
− (1 −
√
−3)n
).
□
13.J.21. Solve the recurrence where each term of the sequence (a0, a1, a2, . . . ) is equal to the arithmetic mean of the preceding
two terms. ⃝
13.J.22. Solve the recurrence an+2 =
√
an+1an with the initial conditions a0 = 2, a1 = 8.
Hint: Create a new sequence bn = log2 an. ⃝
13.J.23. Solve the recurrence given by
an =
∑
k≥0
(
n
k
)
ak
2k
, a0 = 1.
Hint: Multiply both sides by xn
n! and sum it up. Note that A(X) is the exponential generating function for the sequence
(an). ⃝
1223
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.J.24. Find the number of triangulations of a convex n-gon.
Hint: Select any diagonal that goes through a ﬁxed vertex, this splits the polygon in two.
Solution. tn = Cn−2, where Cn denotes the n-th Catalan number. □
13.J.25. Find the number of walks in a square grid of size n × n from the lower left-hand corner A to the upper right-hand
corner B which go only upwards or rightwards and intersect the diagonal AB at exactly one point (besides A and B).
Hint: Catalan numbers. ⃝
13.J.26. Prove that the Fibonacci number satisfy:
i) F2 + F4 + · · · + F2n = F2n+1 − 1
ii) F1 + F3 + · · · + F2n−1 = F2n
⃝
13.J.27. Recall the well-known puzzle Tower of Hanoi and let Hn denote the minimum number of steps necessary to move
a tower consisting of n disks from one rod to another one. Find a recurrent formula for Hn as well as its general solution.
Solution. Hn+1 = 2Hn + 1, Hn = 2n
− 1. □
1224
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
At the very end of the book, we present one problem from practice.
13.J.28. A volleyball team (with a libero, i. e. 7 people) are sitting in a pub, drinking their favorite and well-deserved beer.
However, there are only 7 beer mugs available. What is the probability that in the next round, i) exactly one volleyball player
is not given the same mug as the last time,
ii) no one is given the same mug as the last time,
iii) exactly three players are given the same mug.
Solution.
i) If six of the seven people are given the same mug, then so must the last one. Therefore, the probability is zero.
ii) Let M be the set of all orders of the 7 players and let Ai be the event of orders where the i-th player is given his mug.
We want to calculate |M − ∪iAi|. We get 7!
∑7
k=0
(−1)k
k! = 1854, so the probability is 1854
5040 = 103
280
.
= 0,37.
iii) There are
(7
3
)
= 35 ways to select the three people who are to get the same mug. The remaining four people must be
given diﬀerent mugs. Again, we can apply the formula from above, i. e., there are 4!
∑4
k=0
(−1)k
k! = 9 possibilities.
Altogether, there are 9 · 35 = 315 favorable cases, so the probability is 315
5040 = 1
16 .
□
1225
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Key to the exercises
13.A.23. The cut vertices are 0, 1, 9, 10; the cut edges are (0, 1), (0, 12), (9, 10).
13.B.3. (3, 1), (3, 2), (3, 4), (3, 5), (3, 6), (6, 1), (6, 2), (6, 4), (6, 5), (5, 1), (5, 2), (5, 4), (4, 1), (4, 2), (2, 1).
13.B.4. (3, 1), (3, 2), (3, 4), (3, 5), (3, 6), (1, 2), (1, 4), (1, 5), (1, 6), (2, 4), (2, 5), ...(5, 6).
13.B.5. Solution is missing.
13.B.12.
It can be shown using the Havel–Hakimi procedure that such graph indeed exists. However, it cannot be planar: |V | =
10, |E| = 35, but if it were planar, we would have 3|V | − 6 ≥ |E|, i. e., 24 ≥ 35.
13.B.15.
i) Yes. This follows immediately from the Kuratowski theorem (K5 has 10 edges and K3,3 has 9).
ii) No. Consider K5 or K3,3.
iii) No. There are many counterexamples, for instance K3,3 with another vertex and an edge leading to it.
iv) No. Consider K5.
v) No. Consider K3,3.
vi) The same as (ii).
vii) No. Consider Cn.
viii) No. Consider K5.
ix) No. Consider Cn.
13.B.17. The ﬁrst code does not represent a tree (it has a proper preﬁx with the same number of zeros and ones). There is a
tree corresponding to the second code.
13.C.4. The procedure is incorrect. As a counterexample, consider a cycle graph with one edge of length two and all other
edges of length one.
13.C.5. Applying any algorithm for ﬁnding a minimum spanning tree, we ﬁnd out that the wanted length is 12154 (the spanning
tree consists of edges LPe, LP, LNY, PeT, MCNY).
13.C.6. Solution is missing.
13.D.5. We ﬁnd a maximum ﬂow of size 15 and the cut [1, 6], [1, 3], [2, 4], [2, 3] of the same capacity.
13.D.7. We know from the theory and the result of the above exercise that the minimum capacity of a cut is 9. There are
more maximum ﬂows in the network. For instance, we can set f(a) = 2, f(b) = 4, f(c) = 1, f(h) = 1, f(j) = 4, f(f) =
2, f(i) = 7, and f(v) = 0 for all other edges v of the graph.
13.E.5. Solution is missing.
13.E.11.
1 −
4! · 4!
8!
24
=
27
35
.
13.E.12. 49
54 .
13.I.4.
i) We know from the exercise of subsection 13.4.3 that the generating function of the sequence (1, 2, 3, 4, . . .) is 1
(1−x)2 .
ii) Since we have (by the previous exercise as well)
x
(1 − x)2
o.g.f.
←→ (0, 1, 2, 3, . . . ),
mĂĄme pro derivaci tĂŠto funkce
(
x
(1 − x)2
)′
=
1 + x
(1 − x)3
o.g.f.
←→ (1 · 1, 2 · 2, 3 · 3, . . . ).
Let us emphasize that this problem could also be solved using the fact that 1
(1−x)3
o.g.f.
←→
(n+2
n
)
.
1226
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
iii) We have
1
1 − x
o.g.f.
←→ (1, 1, 1, 1 . . . ),
1
1 − 2x
o.g.f.
←→ (1, 2, 4, 8, . . . ),
1
1 − 2x2
o.g.f.
←→ (1, 0, 2, 0, 4, 0, . . . ),
x
1 − 2x2
o.g.f.
←→ (0, 1, 0, 2, 0, 4, . . . ),
whence we get the result
1 + x
1 − 2x2
o.g.f.
←→ (1, 1, 2, 2, 4, 4, 8, 8, . . . ).
iv) We know from the above that f(x) = 1+x
(1−x)3
o.g.f.
←→ (12
, 22
, 32
, . . . ), hence
f(x) − (1 + 4x)
x2
o.g.f.
←→ (32
, 42
, 52
, . . . ).
Substituting 2x3
for x, we obtain
f(2x3
) − (1 + 8x3
)
46
o.g.f.
←→ (9, 0, 0, 2 · 16, 0, 0, 4 · 25, . . . ).
v) If we denote the result of the previous problem as F(x), then the result of this one is
F(x) − x2
F(x) + x
1−x3 .
13.I.11. x/(1 − 3x + x2
).
13.J.1.
i) The complete graph on n vertices has n(n−1)
2 edges, the cycle graph on n vertices has n edges. Therefore, n(n−1)
2 − n
edges must be added to the cycle graph.
ii) Similarly as above, we get the result (m+n)(m+n−1)
2 − m · n.
13.J.2. There are ﬁve edges whose value is 3: four of them lie on the cycle 23562 and the remaining one is the edge 14.
Therefore, they form a disconnected subgraph of the complete graph, so the spanning tree must contain at least one edge of
value 2. Thus, the total weight of a maximum spanning tree is at most 4 · 3 + 2 = 14. And indeed, there exist spanning trees
with this weight. We select all the edges of value 3 except for one that lies on the mentioned cycle and connect the resulting
components 2356 and 14 with any edge of value 2. There are four such edges. Altogether, there are 4 · 4 = 16 maximum
spanning trees.
13.J.3. The edges of value 1 form a subgraph with two connected components, namely {1, 2, 4, 5, 7} a {3, 6}. Further, there
are six edges of value 2 that lead between these two components. Therefore, the total weight of a minimum spanning tree is
6 · 1 + 2 = 8. Moreover, there are exactly three cycles in the former component, each of length 4, and each of the 6 edges of
this component belongs to exactly two of the three cycles. In order to obtain a tree from this component, we must omit two
edges, which can be done in 6 · 4/2 ways. Altogether, we get 12 · 6 = 72 minimum spanning trees.
13.J.4. 18.
13.J.5. 12.
13.J.6. 16.
13.J.7. 16.
13.J.11. The minimum cut is given by the set {Z, A, E}. Its value is 32.
13.J.12. The minimum cut is given by the set {B, D, S}. Its value is 40.
13.J.13. The minimum cut is given by the set {F, S, D}. Its value is 29.
13.J.14. The minimum cut is given by the set {F, S}. Its value is 39.
1227
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.J.18. The resulting probability is the ratio of the number of favorable cases to the number of all cases. Clearly, the latter
is 612
. Now, let us compute the number of favorable cases. Consider the expression (x + x2
+ x3
+ x4
+ x5
+ x6
)12
. Then,
the number of favorable cases is the coeﬃcient at x30
. We have:
(x+x2
+x3
+x4
+x5
+x6
)12
=
(
x(1 − x6
)
1 − x
)12
= x12
·
(
1 − x6
1 − x
)12
Therefore, we are interested in the coeﬃcient at x18
in
(
1 − x6
1 − x
)12
= (1 − 12x6
+ 66x12
− 220x18
) ·
1
1 − x
12
.
It follows from the generalized binomial theorem that the number of favorable cases is
(
29
11
)
− 12 ·
(
23
11
)
+ 66 ·
(
17
11
)
− 220 ·
(
11
11
)
.
13.J.21. an = k
(
−1
2
)n
+ l.
13.J.22. Solution is missing.
13.J.23. Solution is missing.
13.J.25. Solution is missing.
13.J.26. Solution is missing.
1228
Index
LU-decomposition, 172
δ-neighborhood, 264
ε-net, 457
k–times (continuously) diﬀerentiable, 481
k-combination, 13
k-edge-connected, 634
k-linear, 107
k-permutation, 13
k-permutations, 15
k-vertex-connected, 633
(Riemann) measurable, 373
Taylor expansion with a remainder, 345
characteristic polynomial of the mapping, 111
commutative group, 5
commutative rings, 5
degree of nilpotency, 163
factor vector space, 166
generated, 204
isomorphism, 99
linear combination, 93
linearly dependent, 93
linearly independent, 93
matrix for changing the basis, 102
matrix of the mapping f, 101
normal, 161
normal matrices, 162
parallelepiped, 209
root subspace, 165
scalars, 6
Sylvester criterion, 228
angle between vector subspaces, 215
A binary relation, 37
a critical point of order k, 349
a Laurent series centered at x0, 383
a representative, 40
Abel’s theorem, 383
absolute value, 260
absolutely convergent, 292
absorbing, 152
addition inverse, 74
adjacency list, 629
adjacency matrix, 629
adjacent, 622
adjoint mapping, 159
adjoint matrix, 159
admissible vector x, 136
aﬃne combinations of points, 207
aﬃne coordinate system, 204
aﬃne coordinates, 26
aﬃne frame, 204
aﬃne hull, 205
aﬃne map, 211
aﬃne mappings of the plane, 29
aﬃne subspace, 204
algebra, 249
algebraic complement, 89
Algebraic multiplicity, 115, 163
algebraically adjoint matrix, 91
algorithm ElGamal, 614
alternating series, 294
an asymptote with a slope, 351
an inﬂective point, 350
analytic formula, 223
analytic function, 347
angle φ(u, v) , 215
angle between aﬃne subspaces, 216
angles, 208
antisymmetric, 39, 86
antisymmetric bilinear form, 108
area, 35
arithmetic mean, 288
arithmetic representatives, 230
associativity, 5
asymptotes, 351
asymptotes without a slope, 351
asymptotic estimation, 358
autonomous systems of ordinary diﬀerential equations, 538
axioms, 5
axis of collineation f, 234
Baire space, 455
balanced, 627, 642
basis, 24
basis of vector space, 96
Bernoulli diﬀerential equations, 529
Bernoulli’s inequality, 289
Bessel inequality, 156
Bessel’s identity, 421
Bessel’s inequality, 420
Bezout’s identity, 568
bilinear forms, 107
binary search trees, 642
binary trees, 642
binomial coeﬃcient, 13
binomial congruences, 594
binomial expansion, 13
binormal of the curve, 356
bipartite matching, 657
BorĹŻvka’s algorithm, 652
boundary, 208
boundary point, 264, 456
bounded, 136, 264, 455, 457
1229
INDEX
breadth-ﬁrst search, 632
bunch of hyperplanes passing through point A ∈ P(V ), 234
calibre, 388
canonical analytic formulas, 224
capacities, 654
capacity of the cut, 655
Cartesian coordinate system, 213
Cartesian product, 37
Casortian, 139
Cauchy inequality, 155
Cauchy sequence, 261, 446
Cauchy theorem, 88
Cauchy’s mean value theorem, 285
center, 225
center of collineation f, 234
characteristic function of the set, 373
characteristic polynomial, 140
characteristic polynomial of the matrix, 110
child, 642
classical probability, 17
classical Stokes’ theorem, 524
clique, 626
closed, 446
closed interval, 263
closed subset, 263
closure, 263
code of a plane tree, 643
codomain, 7
codomain of the relation, 37
coeﬃcients of the polynomial, 250
collineation, 231
coloring of the graph, 625
columns of the matrix, 74
combinations, 13
common divisor, 567
commutative ﬁeld, 5
commutativity, 5
compact, 264, 456
comparable, 39
complementary event, 18
complementary minor, 89
complementary submatrix, 89
complete, 39, 262, 541
complete bipartite graph, 623
complete graph, 623
complete metric spaces, 446
complete orthogonal system, 421
complete residue system, 581
complete spline, 257
completion of the metric space X, 451
complex Fourier coeﬃcients, 424
complex-valued functions of a real variable, 249
composite, 570
composition, 38
composition of a relation, 38
concave, 350
conditional probability, 21
congruence in variable, 588
congruent modulo m, 577
conjugation, 260
connected, 633
connected components, 633
consecutive, 622
continuous at a point, 272
continuous mappings, 447
contraction, 531
contraction mapping, 454
convergence, 261
converges, 261, 446
converges uniformly, 379
convex, 208, 350
convex hull, 208
convex polyhedrons, 209
convolution, 664
convolution of functions, 436
coordinates of the vector, 99
coprime, 570
cosine Fourier series, 427
critical points, 344
cross product, 221
cross-ratio of four points (A, B, C, D), 233
crossbar, 210
cubic interpolation spline, 257
curvature, 353
cut in a network, 655
cycle, 86, 625
cycle graph of length n, 623
cycle ladder graph, 624
cycle of length n, 625
cyclic, 163
Cyclometric functions, 299
d, 627
Darboux integral, 366
decomposition into partial fractions, 363
deconvolution, 441
decreasing at a point, 279
decreasing on an interval, 279
deﬁnite integral, 359
degree n, 250
degree sequence, 627
deleted neighborhood, 267
dense, 451
dependent variable, 7
depth-ﬁrst search, 632
derivative, 254, 277
derivative in the direction of a vector, 476
determinant, 29
determinant expansion, 88
Determinant of the matrix A, 84
diameter of the set, 455
diﬀerence equation of the ﬁrst order, 9
diﬀerence space, 203
diﬀerentiable, 277
diﬀerentiable (k + 1) times, 342
diﬀerentiable at a point, 477
diﬀerentiable mapping, 488
diﬀerential equations, 370
diﬀerential of the function f, 477
diﬀerential of the mapping, 489
Dijkstra’s algorithm, 636
dimension, 72, 203
dimension of V , 96
dimension of the matrix, 75
Dirac function δ, 441
direct predecessor, 642
direct successor, 642
direct sum, 95
directed graph, 622
directional derivative, 476
Dirichlet condition, 431
1230
INDEX
Dirichlet kernel, 433
Dirichlet product, 579
discontinuities, 369
discrete logarithm, 584
discrete logarithm problem, 614
discrete logistic model, 12
distance, 457
distance of the point x from the set A, 263
distance of vertices, 635
distributivity of addition with respect to multiplication, 5
diverges, 291
divisible, 566
domain, 7
double diﬀerentiable, 342
dual basis, 103
dual mapping, 158
dual projective space, 233
dual space, 103
eccentricity, 644
echelon form, 78
edge list, 629
edges, 622
eigenvalues of mapping, 110
eigenvalues of the matrix, 110
eigenvectors of mapping, 110
elementary column transformations, 78
elementary row transformations, 78
endpoints, 622
equicontinuous, 460
equivalence, 577
equivalence classes, 40
equivalence relation, 39
Euclid’s algorithm, 568
Euclidean distance, 213
Euclidean division, 566
euclidean point space, 213
Euclidean subspaces, 213
euclidean transformation, 225
Euler’s approximation, 533
Euler’s number e, 290
Eulerian, 638
Eulerian trail, 638
even, 85
event, 17
events, 17
exponential function, 275
exponential generating functions, 669
exponential of a matrix, 545
exterior diﬀerential, 521
exterior diﬀerential k–form, 515
faces, 208
faces of the planar graph, 646
factor, 626
father wavelet function, 428
Fermat numbers, 609
ﬁeld, 5
ﬁnite automaton, 623
ﬁnitely dimensional, 96
ﬁrst order recurrence, 9
ﬁxed hyperplane, 234
ﬁxed point, 234
ﬂow, 654
ﬂow of the vector ﬁeld, 541
Ford-Fulkerson algorithm, 655
forest, 640
Fourier coeﬃcients, 421
Fourier coeﬃcients of the function, 422
Fourier cosine transform, 443
Fourier series, 421, 422
Fourier sine transform, 443
Fourier transform, 438
Fourier transform F, 439
free vectors, 25
Frenet–Serret formulas, 357
Frobenius theorem, 133
function, 7
fundamental system of solutions, 133, 138
Fundamental theorem of arithmetic, 571
game tree, 658
games, 658
Gauss–Ostrogradsky theorem, 524
Gaussian elimination, 78
general Fourier series, 427
general position, 207, 232
generalized binomial theorem, 665
generalized exponential power series, 670
generates, 94
generating function, 664
generators, 94
geometric basis, 232
geometric mean, 288
geometric multiplicity, 115, 163
geometric points, 230
geometric series, 296
Gibbs phenomenon, 425
gradient of the function F, 498
Gramm determinant, 219
Gramm-Schmidt orthogonalisation process, 106
graph, 622
greatest common divisor, 567
Green’s theorem, 523
Gronwall’s inequality, 536
Hölder’s inequality, 448
Hölder’s inequality for integrals, 450
Hamiltonian cycle, 640
harmonic mean, 288
head, 622
Heaviside’s function, 258
Hermite’s interpolation polynomial, 256
Hermitian matrices, 160
high pass ﬁlter, 428
Homogeneous coordinates, 229
homogeneous coordinates, 230
homogeneous equations, 529
Homogeneous linear diﬀerence equation of order k, 138
homogeneous linear recurrence, 138
homogeneous system, 132
homothety, 109
hyperbolic functions, 300
hypercube, 624
hyperplane in An, 207
hypothesis, 21
identity relation, 38
Image, 99
immersions, 514
impartial, 660
implicit description, 205
improper, 267, 277
improper integrals of the ﬁrst kind, 371
1231
INDEX
improper integrals of the second kind, 371
improper limit points, 267
incident, 622
inclusion-exclusion principle, 19
increasing at the point, 279
increasing on the interval, 279
indeﬁnite, 228, 487
indegree deg+ v, 627
independent variable, 7
index, 584
induced subgraph, 626
inﬁmum, 259
inﬁnite points, 230
inﬁnite series of numbers, 291
inﬁnitely dimensional, 96
ingoing, 622
initial values, 525
injective, 38
integral curve, 541
integral domain, 6
integral mean value theorem, 374
integral of the form ω on M, 518
integral operators, 438
interior, 447
interior point, 264
interior point of a subset, 456
interpolation polynomial, 251
interval [x0, x1], 257
invariant subspace, 112
inverse, 77
inverse Fourier transform, 439
inverse function, 282
inverse matrix to the rotation matrix, 32
inverse relation, 39
inversion in permutationσ, 85
invertible matrix, 77
isolated point, 265, 456
isometry, 453
isomorphism, 625
Jacobi symbol, 600
Jacobi theorem, 228
Jacobian matrix of the mapping, 488
JarnĂŋk’s algorithm, 652
Jordan blocks, 164
Jordan curve theorem, 645
Jordan decomposition, 164
Jordan measure, 374
k- combinations, 15
k-combinations with repetitions, 15
kernel of linear mapping, 99
kernel of the integral operator L, 438
Kronecker delta, 75
Kruskal’s algorithm, 651
Kuratowski theorem, 646
l’Hospital’s rule, 285
Lagrange algorithm, 226
Lagrange interpolation polynomial, 253
Lagrange’s mean value theorem, 284
Laplace expansion, 90
Laplace transform, 443
law of cosines, 215
law of inertia, 227
law of quadratic reciprocity, 597, 598
leading principal minors, 89
leading principal submatrices, 89
leaf, 640
least common multiple, 567
left-sided limit, 268
Legendre polynomials, 418
Legendre symbol, 595
Leibniz criterion, 294
Leibniz rule, 280
length of a curve, 374
Leslie model for population growth, 146
level sets, 498
limes superior, 294
limit, 261, 267
limit point, 456
limit point of the set A, 263
limit points of a subset A ⊂ X, 447
line segment, 208
Linear algebra, 24
linear approximation, 255
linear combinations, 82
linear combinations of vectors, 24
linear diﬀerence equation of ﬁrst order, 9
linear form η on U, 514
linear forms, 103
linear functionals, 435
linear mapping, 28
linear mapping (homomorphism), 99
linear programming problem, 134
linear restrictions, 134
linearly dependent, 82
linearly independent, 82
Lipchitz continuous, 459
Lipschitz continuity, 493
local parametrization of the manifold, 515
locally ﬁnite cover by parametrizations, 518
logarithmic function with base a, 276
logarithmic order of magnitude, 547
loop, 622
low pass ﬁlter, 428
lower bound, 259
lower Riemann sum, 366
Lucas’s test, 609
Möbius function, 579
Möbius inversion formula, 579
Malthusian population growth, 9
mapping, 7
mapping from a set A to the set B, 37
Markov chain xn, 151
Markov process, 151
mathematical analysis, 249
mathematical induction, 10, 14
matrices, 28
maximum, 485
mean value, 374
member of the determinant 84
Menger’s theorem, 634
Mersenne primes, 573
method of Lagrange multipliers, 501
metric, 445
metric on the graph, 635
metric space, 445
minimum, 485
minimum excluded value, 661
minimum spanning tree, 651
Minkowski inequality, 449
1232
INDEX
minor, 89
minor complement, 89
modules over rings, 73
Monte Carlo methods, 24
morphism, 625
multidimensional interval, 503
multiple, 566
multiplicative function, 580
mutually perpendicular, 161
natural logarithm, 276
natural spline, 257
negative deﬁnite, 228
negative semideﬁnite, 228
negatively deﬁnite, 487
negatively semideﬁnite, 487
neighborhood of a point, 264
Newton integral, 359
nilpotent, 163
Nim, 658
non-homogeneous linear diﬀerence equations, 143
Norm, 148
norm, 155, 445
norm of the partition, 365
normal space, 500
normal vector, 498
normalised, 104, 153
normalized vectors, 31
normed vector space, 445
nowhere dense, 451
number π, 297
number of solutions of a congruence, 588
objective function, 134
odd, 85
One-sided derivatives, 277
one-to-one, 38
onto, 37
open, 446
open ε–neighbourhood, 446
open cover, 265, 456
open intervals, 264
open set, 264
order of a modulo m, 582
order of an integer modulo m, 582
order of magnitude, 547
ordered ﬁeld, 259
ordered trees, 643
ordering, 39
orientation, 36, 218
orientation of the manifold, 517
oriented euclidean (point) space, 218
oriented manifold with boundary, 522
oriented manifolds, 518
oriented vector space, 218
origin of the aﬃne coordinate system, 204
orthogonal, 104, 153
orthogonal basis, 104
orthogonal complement, 105, 154
orthogonal group, 157
orthogonal mapping, 112
orthogonal matrices, 113, 157
orthogonal mother wavelet, 427
orthogonal system of functions, 419
orthogonally diagonalisable, 161
orthonormal basis, 153
orthonormal system of functions, 419
orthonormalised basis, 104
oscillates, 291
osculating circle, 353
outdegree deg− v, 627
outer product, 220
outgoing, 622
pairwise coprime, 570
parametric description, 26, 205
parametrized by the length, 356
parent, 642
Parity of permutation, 85
Parseval equality, 156
Parseval’s theorem, 420
partial derivatives of order k, 481
partial derivatives of the function f, 476
partial sums, 291
particular solution, 133
partition, 40
Pascal triangle, 15
path, 623, 625
path graph of length n, 623
path of length n, 625
perfect numbers, 573
periodic, 298
periodic function, 422
permutation, 12
permutation of the set X, 84
permutation with repetitions, 15
perpendicular, 31, 104, 161
perpendicular projection, 105
Perron-Frobenius theory, 147
Petersen graph, 624
phase frequency, 425
Picard’s approximation, 532
planar graph, 645
plane trees, 643
Pocklington-Lehmer, 609
points, 203
polar basis, 226
polynomial order of magnitude, 547
polynomials, 8
positive deﬁnite, 227
positive direction, 32
positive matrix, 147
positive semideﬁnite, 228
positively deﬁnite, 163, 487
positively semideﬁnite, 163, 487
power function xa, 275
power mean with exponent r, 288
power residue, 594
power series, 295
power series centered at x0, 300
predecessor, 642
preimage, 38
primality witness, 609
prime, 570
Primitive matrix, 147
primitive root, 584
principal matrices, 89
principal minors, 89
principle of inclusion-exclusion, 20
private key, 612, 614
probability function, 17
projection, 105
projective maps, 231
1233
INDEX
Projective plane P2, 229
projective quadric, 235
projective transformations, 231
projectivization of a vector space, 230
proper, 267, 277
proper rational functions, 363
pseudoinverse matrix, 175
pseudoprime, 605
public key, 612, 614
pullback of the form η by φ, 515
QR decomposition, 175
quadratic forms, 223
quadrics, 223
Rabin cryptosystem, 613
radius of convergence, 295
range, 7
rank of the matrix, 82
rank of the quadratic form, 224
ratio of points, 212
rational functions, 274
rays, 208
real-valued functions of a real variable, 249
recurrence relation, 9
reduced residue system, 581
reﬂection through a line, 33
reﬂexive, 39
regular collineations, 231
regular square matrix, 77
residual capacity, 655
Riccati equation, 529
Riemann integral, 365
Riemann measurable, 504
Riemann measure of the set, 373
Riemann sum, 365
Riemann–Stieltjes integral sum, 386
right-continuous or left-continuous, 272
right-sided limit, 268
Rolle’s theorem, 284
root of the polynomial, 251
root of the tree, 641
root vector, 165
rooted trees, 641
rotation or curl of the vector ﬁeld, 524
rows of the matrix, 74
RSA, 612
Saarus rule, 84
sample space, 17
sampling interval τ, 443
scalar functions, 7
Scalar product, 104
scalar product, 31, 74, 153
scale, 36
second-order partial derivatives, 481
self-adjoint, 160
self-adjoint matrices, 160
semiaxes, 225
semipath, 655
separated variables, 370
sequence an converges to a, 268
sequentially continuous, 369
series of functions, 295
set of solutions, 588
shift of the plane, 25
signature of a quadratic form, 227
simplex, 208, 209
Simpson’s rule, 386
sine Fourier series, 427
singular values of the matrix, 173
sink, 654
size of a ﬂow, 654
size of the vector, 153
smooth, 342
solution, 525
solvable, 136
source, 654
spanning subgraph, 626
spanning tree, 649
spectral radius of matrix A, 148
Spectrum of linear mapping, 115
spectrum of linear mapping, 163
Sprague–Grundy function, 661
Sprague–Grundy theorem, 662
square matrix, 75
square wave function, 425
Standard aﬃne space An, 202
standard basis of Kn, 98
standard maximalisation problem, 134
standard minimisation problem, 134
standard unitary space, 153
stationary point of the function, 485
stationary points, 344, 501
Steinitz exchange theorem, 97
Steinitz’s theorem, 648
stochastic matrices, 152
stochastically independent, 20, 21
strategy, 658
strict extremum, 485
subdeterminant, 89
subgraphs, 626
submatrix of the matrix A, 89
subspaces, 208
successor, 642
sum of impartial games, 661
sum of subspaces, 95
supremum, 259
surface and volume of a solid of revolution, 376
surjective, 37
symmetric, 39, 86
symmetric bilinear form, 108
symmetric mappings, 160
symmetric matrices, 160
symmetrization, 622
tail of the edge, 622
tangent hyperplane, 480
tangent line, 255
tangent line to the curve c, 476
tangent plane, 480
tangent space, 500
tangent space TU, 514
tangent vector, 476, 513
Taylor expansion with a remainder, 345
Taylor polynomial of k–th degree, 345
the backward diﬀerence, 358
the central diﬀerence, 358
the class of funcitons Ck (A), 342
the curvature of the curve, 356
the diﬀerential of function f, 352
the diﬀerention of the second order, 358
The domain of the relation, 37
1234
INDEX
The Euclidean plane, 30
the existence of a neutral element, 5
the existence of a unit element, 5
the existence of an inverse element, 5
the forward diﬀerence, 358
the Frenet frame, 356
The fundamental theorem of algebra, 343
the graph of a function, 38
the indeﬁnite integral, 358
the integral mean value theorem, 369
the lower Riemann integral, 369
the main normal, 356
the primitive function, 358
the second derivative, 342
the sources and sinks of the network, 653
the uniform continuity, 369
the upper Riemann integral, 369
the Weierstrass test, 382
topology, 264, 447
topology of the complex plane, 264
topology of the metric spaces, 447
topology of the real line, 264
torsion of the curve, 356
totally bounded, 457
Trace of mapping, 111
trace of matrix, 111
trail, 625
transformation, 489
transient, 152
transitive, 39
translation, 25, 203
transpose, 86
transposition, 84
trapezoidal rule, 385
tree, 640
triangle, 208, 623
triangle inequality, 155
trigonometric functions, 297
unbounded, 264
undirected graph, 622
uniform continuity, 368
uniformly bounded, 460
uniformly Cauchy, 380
unit decomposition subordinate to a locally ﬁnite cover, 519
unit matrix, 32, 75
unitary group, 157
Unitary isomorphism, 154
unitary mapping, 154
unitary matrices, 157
Unitary space, 153
universal formula, 667
unsaturated, 655
upper bound, 259
upper Riemann sum, 366
Vandermonde determinant, 252
variation, 387
vector, 72
vector ﬁeld, 541
vector ﬁeld X, 513
vector ﬁeld X along the curve M, 513
vector functions, 354
vector functions of one real variable, 354
vector of restrictions, 134
Vector space, 92
vector subspace, 94
vectors, 24
vertices, 622
walk, 625
walk of length n, 625
wavelet mother function, 427
weak connectedness, 633
weakly connected, 639
weight, 636
zero curvature, 352
zero matrix, 74
zero measure, 388
zero vector, 24
Based on the earlier textbook:
Matematika drsně a svižně
Jan Slovák, Martin Panák, Michal Bulant
a kolektiv
published by Masarykova univerzita in 2013
1. edition, 2013
500 copies
Typography, LATEX and more, Tomáš Janoušek
Print: Tiskárna Knopp, Černčice 24, 549 01 Nové Město nad
Metují