E7441: Scientific computing
Vlad Popovici, Ph.D.
RECETOX
Outline
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 2 / 48
There is nothing more practical than a good theory.
Kurt Lewin (1890–1947)
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 3 / 48
Outline
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 4 / 48
Bibliography:
KONG Q., SIAUW T., BAYEN A. (2020). Python programming and
numerical methods. Academic Press. ISBN: 9780128195499
HEATH M.T. (2002). Scientific Computing. An introductory survey.
McGraw-Hill, 2nd edition. ISBN: 0-07-239910-4 Good accompanying
materials at https://heath.cs.illinois.edu/scicomp/notes/index.html,
including slides and demos! Used as basis for the first part of the
course.
KEPNER J. (2009). Parallel Matlab for Multicore and Multinode
Computers. SIAM Publishing. ISBN: 978-0-898716-73-3
GENTLE J.E. (2005). Elements of Computational Statistics. Springer.
ISBN:978-0387954899
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 5 / 48
Computing environments for the course:
Python 3, https://www.python.org - with NumPy and SciPy packages
recommended: JupyterLab for exercises
suggestion: install Python and related packages using a distribution
like Anaconda or Mamba for easier integration of dependencies
R, http://www.r-project.org - "environment for statistical computing
and graphics"
WARNING: Some pieces of code shown during the course may not
represent the optimal implementation in the given language. They are
merely a device for demonstrating some principles.
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 6 / 48
Scientific computing
Wikipedia:
"Computational science (also scientific computing or scientific
computation) is concerned with constructing mathematical models and
quantitative analysis techniques and using computers to analyze and solve
scientific problems."
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 7 / 48
Scientific computing
Wikipedia:
"Computational science (also scientific computing or scientific
computation) is concerned with constructing mathematical models and
quantitative analysis techniques and using computers to analyze and solve
scientific problems."
Basically: find numerical solutions to mathematically-formulated problems.
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 7 / 48
(J. Hadamard) A problem is well posed if its solution
exists
is unique
has a behavior that changes continuously with the initial conditions;
otherwise, it is ill posed.
Inverse problems are often ill posed.
Example: 3D to 2D projection.
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 8 / 48
continuous domain → discrete domain
well-posed but ill-conditioned problems: small errors in input lead to
large variations in the solution
improve conditioning by regularization
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 9 / 48
General computational approach
continuous domain → discrete domain
infinite → finite
differential → algebraic
nonlinear → (combination of) linear
accept approximate solutions, but control for the error
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 10 / 48
Approximations
Modeling approximations:
▶ "model" = approximation of the nature
▶ data - inexact measurements or previous results
Implementation/computational approximations:
▶ discretization of the continuous domain; truncation
▶ rounding
errors in input data
errors propagated by the algorithm
accuracy of the final result
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 11 / 48
Example: area of the Earth
model: sphere
A = 4πr2
r =?
π = 3.14159 . . .
rounded arithmetic
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 12 / 48
Errors
Absolute error: approximate value (ˆx) - true value (x)
Relative error:
absolute error
true value
→ approximate value = (1 + relative error) × (true value)
if the relative error is ∼ 10−d
, it means that ˆx has about d exact digits:
there exists τ = ±(0.0 . . . 0nd+1nd+2 . . . ) such that ˆx = x + τ
true value is usually not known → use estimates or bounds on the
error
relative error can be taken relative to the approximate value
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 13 / 48
Example/exercise - Homework!
Stirling’s approximation for factorials:
Sn =
√
2πn
n
e
n
≈ n!, n = 1, 2, . . .
where e = exp(1).
Relative error (Sn − n!)/n!:
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 14 / 48
Errors: data and computational
compute f(x) for f : R → R
▶ x ∈ R is the true value
▶ f(x) true/desired result
▶ ˆx approximate input
▶ ˆf approximate result
total error:
ˆf(ˆx) − f(x) = (ˆf(ˆx) − f(ˆx)) + (f(ˆx) − f(x))
= computational error + propagated data error
the algorithm has no effect on propagated error
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 15 / 48
Computational error
is sum of:
truncation error = (true result) - (result of the algorithm using exact
arithmetic)
Example: considering only the first terms of an infinite Taylor series;
stopping before convergence
rounding error = (result of the algorithm using exact arithmetic) (result
of the algorithm using limited precision arithmetic)
Example: π ≈ 3.14 or π ≈ 3.141593
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 16 / 48
Finite difference approximation
f′
(x) = lim
h→0
f(x + h) − f(x)
h
≈
f(x + h) − f(x)
h
, for some small h > 0
truncation error: f′(x) −
f(x+h)−f(x)
h ≤ Mh/2 where |f′′(t)| ≤ M for t in a
small neighborhood of x (HOMEWORK)
rounding error: 2ϵ/h, for ϵ being the precision
total error is minimized for h ≈ 2
√
ϵ/M
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 17 / 48
Figure: Total computational error as a tradeoff between truncation and rounding
error (from Heath - Scientific computing)
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 18 / 48
Error analysis
For y = f(x), for f : R → R an approximate ˆy result is obtained.
forward error: ∆y = ˆy − y
backward error: ∆x = ˆx − x, for f(ˆx) = ˆy
x
f //
OO
backward error

ˆf

y
OO
forward error

= f(x)
ˆx
f // ˆy = ˆf(x) = f(ˆx)
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 19 / 48
Compute f(x) = ex
for x = 1. Use the first 4 terms from Taylor expansion:
ˆf(x) = 1 + x +
x2
2
+
x3
6
take "true" value: f(x) = 2.716262 and compute ˆf(x) = 2.666667,
then
forward error: |∆y| = 0.051615, or a relative f. error of about 2%
backward error: ˆx = lnˆf(x) = 0.989829 ⇒ |∆x| = 0.019171, or a
relative b. error of 2%
these are two perspectives on assessing the accuracy
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 20 / 48
Exercise
Consider the general Taylor series with limit e:
∞
n=0
1
n!
= e
How many terms are needed for an approximation of e to three decimal
places?
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 21 / 48
Backward error analysis
idea: approximate result is the exact solution of a modified problem
how far from the original problem is the modified version?
how much error in the input data would explain all the error in the
result?
an approximate solution is good if it is an exact solution for a nearby
problem
backward analysis is usually easier
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 22 / 48
Sensitivity and conditioning
insensitive (well-conditioned) problem: relative changes in input data
causes similar relative change in the result
large changes in solution for small changes in input data indicate a
sensitive (ill-conditioned) problem;
condition number:
cond =
absolute relative change in solution
absolute relative change in input
=
|∆y/y|
|∆x/x|
if cond >> 1 the problem is sensitive
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 23 / 48
condition number is a scale factor for the error:
relative forward err = cond × relative backward err
usually, only upper bounds of the cond. number can be estimated,
cond ≤ C, hence
relative forward err ≤ C × relative backward err
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 24 / 48
ˆx = x + ∆x
forward error: f(x + ∆x) − f(x) ≈ f′(x)∆x, for small enough ∆x
relative forward error: ≈
f′(x)∆x
f(x)
⇒ cond ≈
xf′(x)
f(x)
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 25 / 48
ˆx = x + ∆x
forward error: f(x + ∆x) − f(x) ≈ f′(x)∆x, for small enough ∆x
relative forward error: ≈
f′(x)∆x
f(x)
⇒ cond ≈
xf′(x)
f(x)
Example: tangent function is sensitive in neighborhood of π/2
tan(1.57079) ≈ 1.58058 × 105
; tan(1.57078) ≈ 6.12490 × 104
for x = 1.57079, cond ≈ 2.48275 × 105
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 25 / 48
Stability
an algorithm is stable if is relatively insensitive to perturbations during
computation
stability of algorithms is analogous to conditioning of problems
backward analysis: an algorithm is stable if the result produced is the
exact solution of a nearby problem
stable algorithm: the effect of computational error is no worse than
the effect of small error in input data
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 26 / 48
Accuracy
accuracy: closeness of the result to the true solution of the problem
depends on the conditioning of the problem AND on the stability of
the algorithm
stable algorithm + well-conditioned problem = accurate results
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 27 / 48
CPUs
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 28 / 48
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 29 / 48
Number representation
internally, all data are represented in binary format (each digit can be
either 0 or 1, e.g. 1011001...)
bit, nybble, byte
word → specific to architecture: 1, 2, 4, or 8 bytes
integers:
▶ unsigned (≥ 0): on n bits: 0, . . . , 2n
− 1. The stored representation (for
1 byte) is b7b6b5b4b3b2b1b0 for a value x = 7
i=0 bi2i
.
▶ signed: 1 bit for sign, rest for the absolute value;
−2n−1
, . . . , 0, . . . , 2n−1
− 1. The stored representation (for 1 byte) is
b7b6b5b4b3b2b1b0 for a value x = b7(−27
) + 6
i=0 bi2i
.
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 30 / 48
Floating-point numbers
like in scientific notation: mantissa × radixexponent
, e.g. 2.35 × 103
formally
x = ± b0 +
b1
β
+
b2
β2
+ · · · +
bp−1
βp−1
× βE
where
β is the radix (or base)
p is the precision
L ≤ E ≤ U are the limits of the exponent
0 ≤ bk ≤ β
mantissa: m = b0b1 . . . bp−1; fraction: b1b2 . . . bp−1
the sign, mantissa and exponent are stored in fixed-sized fields (the
radix is implicit for a given system, β = 2 usually)
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 31 / 48
Normalization:
b0 0 for all x 0
mantissa m satisfies 1 ≤ m < β
ensures unique representation, optimal use of available bits
Internal representation (on 64 bits - "double precision", binary
representation):
x = sign | exponent | fraction = b63 b62 . . . b52 b51 . . . b0
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 32 / 48
Properties:
only a finite number of discrete values can be represented
total number of floating point numbers representable in normalized
format is
2(β − 1)βp−1
(U − L + 1) + 1
undeflow level (smallest number): UFL = βL
overflow level (largest number): OFL = βU+1
(1 − β−p
)
not all real numbers can be represented exactly:
▶ machine numbers
▶ rounding → rounding error
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 33 / 48
Example: let β = 2, p = 3, L = −1, U = 1, there are 25 distinct numbers
that can be represented:
UFL = 0.510; OFL = 3.510
note the non-uniform coverage
∀x ∈ R, fl(x) is the floating point representation; x − fl(x) is the
rounding error
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 34 / 48
Rounding rules
chop = round toward zero: truncate the base−β representation after
p − 1st digit
round to nearest: fl(x) is the closest machine number to x
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 35 / 48
Machine precision
machine precision, ϵmach
▶ with chopping: ϵmach = β1−p
▶ with rounding to nearest: ϵmach = 1
2
β1−p
called also unit roundoff: the smallest number ϵ such that fl(1 + ϵ) > 1
maximum relative error of representation
fl(x) − x
x
≤ ϵmach
usually 0 < UFL < ϵmach < OFL
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 36 / 48
Machine precision - example
For β = 2, p = 3, L = −1, U = 1,
ϵmach = (0.01)2 = (0.25)10 with chopping
ϵmach = (0.001)2 = (0.125)10 with rounding to nearest
The usual case (IEEE fp systems):
ϵmach = 2−24
≈ 10−7
in single precision
ϵmach = 2−53
≈ 10−16
in double precision
→ about 7 and 16 decimals of precision, respectively
(in R: p-value < 2.2e − 16)
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 37 / 48
Gradual underflow
to improve representation of numbers around 0 - use subnormal (or
denormalized) numbers
when exponent is at minimum, alow leading digits to be 0
subnormals are less precise
→ gradual underflow
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 38 / 48
Special values
IEEE standard:
Inf: infinity; the result of 1/0
NaN: the result of 0/0 or Inf/Inf
special representation of the exponent field
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 39 / 48
Floating-point arithmetic
addition/subtraction: denormalization might be required:
3.52 × 103
+ 1.97 × 105
= 0.0352 × 105
+ 1.97 × 105
= 2.0052 × 105
→ might cause loss of some digits
multiplication/division: the result may not be representable
overflow is more serious than underflow: how to approximate large
numbers?
for underflow, the result may be approximated by 0
in FP arithm. addition and multiplication are commutative but not
associative: if ϵ is slightly smaller than ϵmach, then (1 + ϵ) + ϵ = 1, but
1 + (ϵ + ϵ) > 1
ideally, x flop y = fl(x op y); IEEE standard ensures this for within
range results
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 40 / 48
Example: divergent series
∞
n=1
1
n
in FP arithm, the sum of the series is finite;
depending on the system, this is because:
▶ after a while, the sum overflows
▶ 1/n underflows
▶ for all n such that
1
n
< ϵmach
n−1
k=1
1
k
the sum does not change anymore
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 41 / 48
Cancellation
subtracting 2 numbers of the same magnitude usually cancels the
most significant digits:
1.92403 × 102
− 1.92275 × 102
= 1.28000 × 10−1
→ only 3 significant
digits
let ϵ > 0 be slightly smaller than ϵmach, then (1 + ϵ) − (1 − ϵ) yields 0
in FP arithmetic, instead of 2ϵ.
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 42 / 48
Cancellation - example
For the quadratic equation, ax2
+ bx + c = 0, the two solutions are given
by
x1,2 =
−b ±
√
b2 − 4ac
2a
Problems:
for very large/small coefficients, the terms b2
or 4ac may
over-/underflow → rescale coeficients by max{a, b, c}.
cancellation between −b and
√
· can be avoided by computing one
root using x = 2c
−b∓
√
b2−4ac
Exercise: let x1 = 2000, x2 = 0.05 be the roots of a quadratic equation.
Compute the coefficients and then use the above formulas to retrieve the
roots. Try numpy.roots() function in Python.
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 43 / 48
Cancellation - another example
P(X) = (X − 1)6
= X6
− 6X5
+ 15X4
− 20X3
+ 15X2
− 6X + 1. What
happens around X = 1?
import matplotlib.pyplot as plt
import numpy as np
epsilon = [0.01, 0.005, 0.001]
for k in range(3):
x = np.linspace(1 - epsilon[k], 1 + epsilon[k], 100)
px = x**6 - 6*x**5 + 15*x**4 - 20*x**3 + 15*x**2 - 6*x + 1
px0 = (x - 1)**6
plt.subplot(2, 3, k+1)
plt.plot(x, px, ’-b’, x, np.zeros(100), ’-r’)
plt.axis([1 - epsilon[k], 1 + epsilon[k], -max(abs(px)),
max(abs(px))])
plt.subplot(2, 3, k+4)
plt.plot(x, px0, ’-b’, x, np.zeros(100), ’-r’)
plt.axis([1 - epsilon[k], 1 + epsilon[k], -max(abs(px0)),
max(abs(px0))])
plt.show()
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 44 / 48
...mathematically equivalent, but numerically different...
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 45 / 48
(very small) Project
Study the paper
Moler, C., Morisson, D., Replacing square roots by Pythagorean sums.
IBM J. Res. Develop. 27(6), 1983
Then, implement the proposed method and compare it with the naive
sqrt()-based approach.
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 46 / 48
In Python...
basic (and not only) numerical functions are in numpy package
ϵmach is returned by
▶ single precision: np.finfo(np.float32).eps gives
1.1920929e − 07 = 2−23
▶ double precision: np.finfo(np.float64).eps gives
2.220446049250313e − 16 = 2−52
to obtain the smallest or largest single/double precision numbers, use
np.finfo(np.float32).min, np.finfo(np.float32).max, np.
finfo(np.float64).min, np.finfo(np.float64).max
you have the special constants np.Inf and np.NaN
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 47 / 48
Questions?
Vlad Popovici, Ph.D. (RECETOX) E7441: Scientific computing 48 / 48