PV027 Optimization
Tom´aˇs Br´azdil
1
Resources & Prerequisities
Resources:
▶ Lectures & tutorials (the main resources)
▶ Books:
Joaquim R. R. A. Martins and Andrew Ning. Engineering Design
Optimization. Cambridge University Press, 2021. ISBN:
9781108833417.
Jorge Nocedal and Stephen J. Wright. Numerical
optimization. Springer, 2006. ISBN: 0387303030.
2
Resources & Prerequisities
Resources:
▶ Lectures & tutorials (the main resources)
▶ Books:
Joaquim R. R. A. Martins and Andrew Ning. Engineering Design
Optimization. Cambridge University Press, 2021. ISBN:
9781108833417.
Jorge Nocedal and Stephen J. Wright. Numerical
optimization. Springer, 2006. ISBN: 0387303030.
We shall need elementary knowledge and understanding of
▶ Linear algebra in Rn
Operations with vectors and matrices, bases, diagonalization.
▶ Multi-variable calculus (i.e., in Rn)
Partial derivatives, gradients, Hessians, Taylor’s theorem.
We will refresh our memories during lectures and tutorials.
2
Evaluation
Oral exam - You will get a manual describing the knowledge
necessary for E and better.
There might be homework assignments that you may discuss at
tutorials, but (for this year) there is no mandatory homework.
Please be aware that
This is a difficult math-based
course.
3
What is Optimization
Merriam Webster:
An act, process, or methodology of making something (such as a
design, system, or decision) as perfect, functional, or effective as
possible.
specifically: the mathematical procedures (such as finding the
maximum of a function) involved in this.
4
What is Optimization
Merriam Webster:
An act, process, or methodology of making something (such as a
design, system, or decision) as perfect, functional, or effective as
possible.
specifically: the mathematical procedures (such as finding the
maximum of a function) involved in this.
Britannica
Collection of mathematical principles and methods for solving
quantitative problems in many disciplines, including physics,
biology, engineering, economics, and business.
Historically, (mathematical/numerical) optimization is called
mathematical programming.
4
Optimization
People optimize in
▶ scheduling
▶ transportation,
▶ education,
▶ · · ·
5
Optimization
People optimize in
▶ scheduling
▶ transportation,
▶ education,
▶ · · ·
▶ investments
▶ portfolio management,
▶ utility maximization,
▶ · · ·
5
Optimization
People optimize in
▶ scheduling
▶ transportation,
▶ education,
▶ · · ·
▶ investments
▶ portfolio management,
▶ utility maximization,
▶ · · ·
▶ industrial design
▶ aerodynamics,
▶ electrical engineering,
▶ · · ·
5
Optimization
People optimize in
▶ scheduling
▶ transportation,
▶ education,
▶ · · ·
▶ investments
▶ portfolio management,
▶ utility maximization,
▶ · · ·
▶ industrial design
▶ aerodynamics,
▶ electrical engineering,
▶ · · ·
▶ sciences
▶ molecular modeling,
▶ computational systems biology,
▶ · · ·
5
Optimization
People optimize in
▶ scheduling
▶ transportation,
▶ education,
▶ · · ·
▶ investments
▶ portfolio management,
▶ utility maximization,
▶ · · ·
▶ industrial design
▶ aerodynamics,
▶ electrical engineering,
▶ · · ·
▶ sciences
▶ molecular modeling,
▶ computational systems biology,
▶ · · ·
▶ machine learning 5
Optimization Algorithms
6
Optimization Algorithms
7
Design Optimization Process
8
Design Optimization Process
9
A bit naive example:
▶ Consider a company with several plants producing a single
product but with different efficiency.
▶ The goal is to set the production of each plant so that demand
for goods is satisfied, but overproduction is minimized.
10
A bit naive example:
▶ Consider a company with several plants producing a single
product but with different efficiency.
▶ The goal is to set the production of each plant so that demand
for goods is satisfied, but overproduction is minimized.
▶ First try: Model each plant’s production and maximize the
total production efficiency.
This would lead to a solution where only the most efficient plant will
produce.
10
A bit naive example:
▶ Consider a company with several plants producing a single
product but with different efficiency.
▶ The goal is to set the production of each plant so that demand
for goods is satisfied, but overproduction is minimized.
▶ First try: Model each plant’s production and maximize the
total production efficiency.
This would lead to a solution where only the most efficient plant will
produce.
▶ However, after a certain level of demand, no single plant can
satisfy the demand ⇒, introducing constraints on the
maximum production of the plants.
This would maximize production of the most efficient plant and then the
second one, etc.
10
A bit naive example:
▶ Consider a company with several plants producing a single
product but with different efficiency.
▶ The goal is to set the production of each plant so that demand
for goods is satisfied, but overproduction is minimized.
▶ First try: Model each plant’s production and maximize the
total production efficiency.
This would lead to a solution where only the most efficient plant will
produce.
▶ However, after a certain level of demand, no single plant can
satisfy the demand ⇒, introducing constraints on the
maximum production of the plants.
This would maximize production of the most efficient plant and then the
second one, etc.
▶ Then you notice that all plant employees must work.
▶ Then you start solving transportation problems depending on
the location of the plants.
▶ ...
10
Optimization Problem Formulation
1. Describe the problem
▶ Problem formulation is vital since the
optimizer exploits any weaknesses in the
model formulation.
▶ You might get the “right answer to the
wrong question.”
▶ The problem description is typically
informal at the beginning.
2. Gather information
▶ Identify possible inputs/outputs.
▶ Gather data and identify the analysis
procedure.
11
Optimization Problem Formulation
3. Define the design variables
▶ Identify the quantities that describe the
system:
x ∈ Rn
(i.e., certain characteristics of the system,
such as position, investments, etc.)
▶ The variables are supposed to be
independent; the optimizer must be free to
choose the components of x independently.
▶ The choice of variables is typically not
unique (e.g., a square can be described by
its side or area).
▶ The variables may affect the functional
form of the objective and constraints (e.g.,
linear vs non-linear).
11
Optimization Problem Formulation
4. Define the objective
▶ The function determines if one design is
better than another.
▶ Must be a scalar computable from the
variables:
f : Rn
→ R
(e.g., profit, time, potential energy, etc.)
▶ The objective function is either maximized
or minimized depending on the application.
▶ The choice is not always obvious: E.g.,
minimizing just the weight of a vehicle
might result in a vehicle being too
expensive to be manufactured.
11
Optimization Problem Formulation
5. Define the constraints
▶ Prescribe allowed values of the variables.
▶ May have a general form
c(x) ≤ 0 or c(x) ≥ 0 or c(x) = 0
(e.g., time cannot be negative, bounded
amount of money to invest)
Where c : Rn
→ R is a function depending
on the variables.
11
Modelling and Optimization
The Optimization Problem consists of
▶ variables
▶ objective
▶ constraints
The above components constitute a model.
12
Modelling and Optimization
The Optimization Problem consists of
▶ variables
▶ objective
▶ constraints
The above components constitute a model.
Modelling is concerned with model building, optimization with
maximization/minimization of the objective for a given model.
We concentrate on the optimization part but keep in mind that it is intertwined
with modeling.
12
Modelling and Optimization
The Optimization Problem consists of
▶ variables
▶ objective
▶ constraints
The above components constitute a model.
Modelling is concerned with model building, optimization with
maximization/minimization of the objective for a given model.
We concentrate on the optimization part but keep in mind that it is intertwined
with modeling.
The Optimization Problem (OP): Find settings of variables so
that the objective is maximized/minimized while satisfying the
constraints.
12
Modelling and Optimization
The Optimization Problem consists of
▶ variables
▶ objective
▶ constraints
The above components constitute a model.
Modelling is concerned with model building, optimization with
maximization/minimization of the objective for a given model.
We concentrate on the optimization part but keep in mind that it is intertwined
with modeling.
The Optimization Problem (OP): Find settings of variables so
that the objective is maximized/minimized while satisfying the
constraints.
An Optimization Algorithm (OA) solves the above problem and
provides a solution, some setting of variables satisfying the
constraints and minimizing/maximizing the objective.
12
Optimization Problems
13
Optimization Problem Formally
Denote by
f : Rn → R an objective function,
x a vector of real variables,
g1, . . . , gng inequality constraint functions gi : Rn → R.
h1, . . . , hnh
equality constraint functions hj : Rn → R.
14
Optimization Problem Formally
Denote by
f : Rn → R an objective function,
x a vector of real variables,
g1, . . . , gng inequality constraint functions gi : Rn → R.
h1, . . . , hnh
equality constraint functions hj : Rn → R.
The optimization problem is to
minimize f (x)
by varying x
subject to gi (x) ≤ 0 i = 1, . . . , ng
hj (x) = 0 j = 1, . . . , nh
14
Optimization Problem - Example
f (x1, x2) = (x1 − 2)2 + (x2 − 1)2
g1(x1, x2) = x2
1 − x2
g2(x1, x2) = x1 + x2 − 2
The optimization problem is
minimize (x1−2)2
+(x2−1)2
subject to
x2
1 − x2 ≤ 0,
x1 + x2 − 2 ≤ 0.
15
Optimization Problem - Example
f (x1, x2) = (x1 − 2)2 + (x2 − 1)2
g1(x1, x2) = x2
1 − x2
g2(x1, x2) = x1 + x2 − 2
A contour of f is defined, for some c ∈ R, by {x ∈ Rn | f (x) = c} 15
Constraints
Consider the constraints
gi (x) ≤ 0 i = 1, . . . , ng
hj (x) = 0 j = 1, . . . , nh
16
Constraints
Consider the constraints
gi (x) ≤ 0 i = 1, . . . , ng
hj (x) = 0 j = 1, . . . , nh
Define the feasibility region by
F = {x | gi (x) ≤ 0, hj (x) = 0, i = 1, . . . , ng , j = 1, . . . , nh}
x ∈ F is feasible, x ̸∈ F is infeasible.
16
Constraints
Consider the constraints
gi (x) ≤ 0 i = 1, . . . , ng
hj (x) = 0 j = 1, . . . , nh
Define the feasibility region by
F = {x | gi (x) ≤ 0, hj (x) = 0, i = 1, . . . , ng , j = 1, . . . , nh}
x ∈ F is feasible, x ̸∈ F is infeasible.
Note that constraints of the form gi (x) ≥ 0 can be easily
transformed to the inequality contraints −gi (x) ≤ 0
16
Constraints
Consider the constraints
gi (x) ≤ 0 i = 1, . . . , ng
hj (x) = 0 j = 1, . . . , nh
Define the feasibility region by
F = {x | gi (x) ≤ 0, hj (x) = 0, i = 1, . . . , ng , j = 1, . . . , nh}
x ∈ F is feasible, x ̸∈ F is infeasible.
Note that constraints of the form gi (x) ≥ 0 can be easily
transformed to the inequality contraints −gi (x) ≤ 0
x∗ ∈ F is now a constrained minimizer if
f (x∗
) ≤ f (x) for all x ∈ F
16
Constraints
Inequality constraints gi (x) ≤ 0 can be active or inactive.
active at x
gi (x) = 0
inactive at x
gi (x) < 0
17
More Practical Example
The problem formulation:
▶ A company has two chemical factories F1 and F2, and a dozen
retail outlets R1, . . . , R12.
▶ Each Fi can produce (maximum of) ai tons of a chemical
each week.
▶ Each retail outlet Rj demands at least bj tons.
▶ The cost of shipping one ton from Fi to Rj is cij .
18
More Practical Example
The problem formulation:
▶ A company has two chemical factories F1 and F2, and a dozen
retail outlets R1, . . . , R12.
▶ Each Fi can produce (maximum of) ai tons of a chemical
each week.
▶ Each retail outlet Rj demands at least bj tons.
▶ The cost of shipping one ton from Fi to Rj is cij .
The problem: Determine how much each factory should ship to
each outlet to satisfy the requirements and minimize cost.
18
More Practical Example
Variables: xij for i = 1, 2 and j = 1, . . . , 12. Each xij (intuitively)
corresponds to tons shipped from Fi to Rj .
The objective:
min
ij
cij xij
19
More Practical Example
Variables: xij for i = 1, 2 and j = 1, . . . , 12. Each xij (intuitively)
corresponds to tons shipped from Fi to Rj .
The objective:
min
ij
cij xij
subject to
12
j=1
xij ≤ ai , i = 1, 2
2
i=1
xij ≥ bj , j = 1, . . . , 12,
xij ≥ 0, i = 1, 2, j = 1, . . . , 12.
19
More Practical Example
Variables: xij for i = 1, 2 and j = 1, . . . , 12. Each xij (intuitively)
corresponds to tons shipped from Fi to Rj .
The objective:
min
ij
cij xij
subject to
12
j=1
xij ≤ ai , i = 1, 2
2
i=1
xij ≥ bj , j = 1, . . . , 12,
xij ≥ 0, i = 1, 2, j = 1, . . . , 12.
The above is linear programming problem since both the objective
and constraint functions are linear.
19
Discrete Optimization
In our original optimization problem definition, we consider real
(continuous) variables.
Sometimes, we need to assume discrete values. For example, in the
previous example, the factories may produce tractors. In such a
case, it does not make sense to produce 4.6 tractors.
20
Discrete Optimization
In our original optimization problem definition, we consider real
(continuous) variables.
Sometimes, we need to assume discrete values. For example, in the
previous example, the factories may produce tractors. In such a
case, it does not make sense to produce 4.6 tractors.
Usually, an integer constraint is added, such as
xi ∈ Z
It constrains xi only to integer values. This leads to so-called
integer programming.
Discrete optimization problems have discrete and finite variables.
20
Wing Design Example
Our goal is to design the wing shape of an aircraft.
Assume a rectangular wing.
The parameters are called span b and chord c.
21
Wing Design Example
Our goal is to design the wing shape of an aircraft.
Assume a rectangular wing.
The parameters are called span b and chord c.
However, two other variables are often used in aircraft design:
Wing area S and wing aspect ratio AR. It holds that
S = bc AR = b2
/S
21
Wing Design Example
What exactly are the objectives and constraints?
22
Wing Design Example
What exactly are the objectives and constraints?
Our objective function is the power required to keep level flight:
f (b, c) =
Dv
η
Here,
▶ D is the drag
That is the aerodynamic force that opposes an aircraft’s motion through
the air.
▶ η is the propulsive efficiency
That is the efficiency with which the energy contained in a vehicle’s fuel
is converted into kinetic energy of the vehicle.
▶ v is the lift velocity
That is the velocity needed to lift the aircraft, which depends on its
weight.
22
Wing Design Example
For illustration, let us look at the lift velocity v.
23
Wing Design Example
For illustration, let us look at the lift velocity v.
In level flight, the aircraft must generate enough lift L to equal its
weight W , that is L = W .
23
Wing Design Example
For illustration, let us look at the lift velocity v.
In level flight, the aircraft must generate enough lift L to equal its
weight W , that is L = W .
The weight partially depends on the wing area:
W = W0 + WS S
Here S = bc is the wing area, and W0 is the payload weight.
23
Wing Design Example
For illustration, let us look at the lift velocity v.
In level flight, the aircraft must generate enough lift L to equal its
weight W , that is L = W .
The weight partially depends on the wing area:
W = W0 + WS S
Here S = bc is the wing area, and W0 is the payload weight.
The lift can be approximated using the following formula.
L = q · CL · S
Where q = 1
2ϱv2 is the fluid dynamic pressure, here ϱ is the air
density, CL is a lift coefficient (depending on the wing shape).
23
Wing Design Example
For illustration, let us look at the lift velocity v.
In level flight, the aircraft must generate enough lift L to equal its
weight W , that is L = W .
The weight partially depends on the wing area:
W = W0 + WS S
Here S = bc is the wing area, and W0 is the payload weight.
The lift can be approximated using the following formula.
L = q · CL · S
Where q = 1
2ϱv2 is the fluid dynamic pressure, here ϱ is the air
density, CL is a lift coefficient (depending on the wing shape).
Thus, we may obtain the lift velocity as
v = 2W /ϱCLS = 2(W0 + WS bc)/ϱCLbc
Similarly, various physics-based arguments provide approximations
of the drag D and the propulsion efficiency η. 23
Wing Design Example
The drag D = Di + Df is the sum of the induced and viscous drag.
24
Wing Design Example
The drag D = Di + Df is the sum of the induced and viscous drag.
The induced drag can be approximated by
Di = W 2
/q π b2
e
Here, e is the Oswald efficiency factor, a correction factor that represents
the change in drag with the lift of a wing, as compared with an ideal
wing having the same aspect ratio.
24
Wing Design Example
The drag D = Di + Df is the sum of the induced and viscous drag.
The induced drag can be approximated by
Di = W 2
/q π b2
e
Here, e is the Oswald efficiency factor, a correction factor that represents
the change in drag with the lift of a wing, as compared with an ideal
wing having the same aspect ratio.
The viscous drag can be approximated by
Df = k Cf q 2.05 S
Here, k is the form factor (accounts for the pressure drag), and Cf is the
skin friction coefficient that can be approximated by
Cf = 0.074/Re0.2
Where Re is the Reynolds number that somewhat characterizes air flow
patterns around the wing and is defined as follows:
Re = ρvc/µ
Here µ is the air dynamic viscosity.
24
Wing Design Example
The propulsion efficiency η can be roughly approximated by the
Gaussian efficiency curve.
η = ηmax exp
−(v − ¯v)2
2σ2
Here, ¯v is the peak propulsive efficiency velocity, and σ is the std
of the efficiency function.
25
Wing Design Example
The objective function contours:
26
Wing Design Example
The objective function contours:
The engineers would refuse the solution: The aspect ratio is much
higher than typically seen in airplanes. It adversely affects the
structural strength. Add constraints!
26
Wing Design Example
Added a constraint on bending stress at the root of the wing:
It looks like a reasonable wing ...
27
Optimization Problem Classification
28
Optimization Problem Classification
29
Optimization Problem Classification
▶ Continuous allows only xi ∈ R, discrete allows only xi ∈ Z,
mixed allows variables of both kinds.
29
Optimization Problem Classification
▶ Continuous allows only xi ∈ R, discrete allows only xi ∈ Z,
mixed allows variables of both kinds.
▶ Single-objective: f : Rn → R, Multi-objective: f : Rn → Rm
29
Optimization Problem Classification
▶ Continuous allows only xi ∈ R, discrete allows only xi ∈ Z,
mixed allows variables of both kinds.
▶ Single-objective: f : Rn → R, Multi-objective: f : Rn → Rm
▶ Unconstrained: No constraints, just the objective function.
29
Optimization Problem Classification
30
Smoothness
We consider various classes of problems depending on the
smoothness properties of the objective/constraint functions:
▶ C0: Continuous function Continuity allows us to estimate value in
small neighborhoods.
Discontinuous functions exist.
▶ C1: Continuous first derivatives
The derivatives give information on the slope. If continuous, it changes
smoothly, allowing us to estimate the slope locally.
Nondifferentiable continuous functions and differentiable
functions with discontinuous derivatives exist.
▶ C2: Continuous second derivatives The second derivatives inform
about curvature.
Continuously differentiable functions without second
derivatives and twice differentiable functions with
discontinuous second derivatives exist.
31
f (x) = |x| is continuous, f is not differentiable at 0
f (x) = x|x| is differentiable on R, f ′ has no second derivative at 0
32
f (x) =
x2 sin(1/x) if x ̸= 0
0 if x = 0
f ′
(x) =
2x sin(1/x) − cos(1/x), x ̸= 0
0, x = 0
f is differentiable on R, f ′ is not continuous at 0
33
f (x) =
x4 sin(1/x) if x ̸= 0
0 if x = 0
f is differentiable on R,
f ′
(x) =
4x3 sin(1/x) − x2 cos(1/x), x ̸= 0
0, x = 0
f ′ is differentiable on R,
f ′′
(x) =
12x2 sin(1/x) − 6x cos(1/x) − sin(1/x), x ̸= 0
0, x = 0
Clearly, f ′′ does not have a limit at 0 as sin(1/x) oscillates between
−1 and 1 and thus is not continuous.
34
Linearity
Linear programming: Both the objective and the constraints are
linear.
It is possible to solve precisely, efficiently, and in rational numbers
(see the linear programming later).
35
Multimodality
Denote by F the feasibility set.
x∗ is a (weak) local minimiser if there is ε > 0 such that
f (x∗
) ≤ f (x) for all x ∈ F satisfying ||x∗
− x|| ≤ ε
36
Multimodality
Denote by F the feasibility set.
x∗ is a (weak) local minimiser if there is ε > 0 such that
f (x∗
) ≤ f (x) for all x ∈ F satisfying ||x∗
− x|| ≤ ε
x∗ is a (weak) global minimiser if
f (x∗
) ≤ f (x) for all x ∈ F
36
Multimodality
Denote by F the feasibility set.
x∗ is a (weak) local minimiser if there is ε > 0 such that
f (x∗
) ≤ f (x) for all x ∈ F satisfying ||x∗
− x|| ≤ ε
x∗ is a (weak) global minimiser if
f (x∗
) ≤ f (x) for all x ∈ F
Global/local minimiser is strict if the inequality is strict.
36
Multimodality
Denote by F the feasibility set.
x∗ is a (weak) local minimiser if there is ε > 0 such that
f (x∗
) ≤ f (x) for all x ∈ F satisfying ||x∗
− x|| ≤ ε
x∗ is a (weak) global minimiser if
f (x∗
) ≤ f (x) for all x ∈ F
Global/local minimiser is strict if the inequality is strict.
Unimodal functions have a single global minimiser in F,
multimodal have multiple local minimisers in F.
36
Convexity
S ⊆ Rn is a convex set if the straight line segment connecting any
two points in S lies entirely inside S. Formally, for any two points
x ∈ S and y ∈ S, we have αx + (1 − α)y ∈ S for all α ∈ [0, 1]
37
Convexity
S ⊆ Rn is a convex set if the straight line segment connecting any
two points in S lies entirely inside S. Formally, for any two points
x ∈ S and y ∈ S, we have αx + (1 − α)y ∈ S for all α ∈ [0, 1]
f is a convex function if its domain is a convex set and if for any
two points x and y in this domain, the graph of f lies below the
straight line connecting (x, f (x)) to (y, f (y)) in the space Rn+1.
That is, we have
f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y), for all α ∈ (0, 1).
37
Convexity
S ⊆ Rn is a convex set if the straight line segment connecting any
two points in S lies entirely inside S. Formally, for any two points
x ∈ S and y ∈ S, we have αx + (1 − α)y ∈ S for all α ∈ [0, 1]
f is a convex function if its domain is a convex set and if for any
two points x and y in this domain, the graph of f lies below the
straight line connecting (x, f (x)) to (y, f (y)) in the space Rn+1.
That is, we have
f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y), for all α ∈ (0, 1).
A standard form convex optimization assumes
▶ convex objective f and convex inequality constraint
functions gi
▶ affine equality constraint functions hj
Implications:
▶ Every local minimum is a global minimum.
▶ If the above inequality is strict for all x ̸= y, then there is
a unique minimum.
37
Stochasticity
Sometimes, the parameters of a model cannot be specified with
certainty.
For example, in the transportation model, customer demand
cannot be predicted precisely in practice.
However, such parameters may often be statistically estimated and
modeled using an appropriate probability distribution.
38
Stochasticity
Sometimes, the parameters of a model cannot be specified with
certainty.
For example, in the transportation model, customer demand
cannot be predicted precisely in practice.
However, such parameters may often be statistically estimated and
modeled using an appropriate probability distribution.
Stochastic optimization problem is to minimize/maximize the
expectation of a statistic parametrized with the variables x:
Find x maximizing Ef (x; W )
Here, W is a vector of random variables, and the expectation is
taken using the probability distribution of these variables.
In this course, we stick with deterministic optimization.
38
Optimization Algorithms
39
Optimization Algorithm
An optimization algorithm solves the optimization problem, i.e.,
searches for x∗, which (in some sense) minimizes the objective f
and satisfies the constraints.
Typically, the algorithm computes a set of candidate solutions
x0, x1, . . . and then identifies one resembling a solution.
40
Optimization Algorithm
An optimization algorithm solves the optimization problem, i.e.,
searches for x∗, which (in some sense) minimizes the objective f
and satisfies the constraints.
Typically, the algorithm computes a set of candidate solutions
x0, x1, . . . and then identifies one resembling a solution.
The problem is to
▶ compute the candidate solutions,
Complexity of the objective function, difficulties in selection of the
candidates, etc.
▶ Select the one closest to a minimum.
It is Hard to decide whether a given point is a minimum (even a local
one). Example: Neural networks training.
40
Optimization Algorithm Properties
Typically, we are concerned with the following issues:
41
Optimization Algorithm Properties
Typically, we are concerned with the following issues:
▶ Robustness: OA should perform well on various problems in
their class for all reasonable choices of the initial variables.
41
Optimization Algorithm Properties
Typically, we are concerned with the following issues:
▶ Robustness: OA should perform well on various problems in
their class for all reasonable choices of the initial variables.
▶ Efficiency: OA should not require too much computer time or
storage.
41
Optimization Algorithm Properties
Typically, we are concerned with the following issues:
▶ Robustness: OA should perform well on various problems in
their class for all reasonable choices of the initial variables.
▶ Efficiency: OA should not require too much computer time or
storage.
▶ Accuracy: OA should be able to identify a solution with
precision without being overly sensitive to
▶ errors in the data/model
▶ the arithmetic rounding errors
41
42
Order and Search
Order
▶ Zeroth = gradient-free: no info about derivatives is used
▶ First = gradient-based: use info about first derivatives
(e.g., gradient descent)
▶ Second = use info about first and second derivatives
(e.g., Newton’s method)
43
Order and Search
Order
▶ Zeroth = gradient-free: no info about derivatives is used
▶ First = gradient-based: use info about first derivatives
(e.g., gradient descent)
▶ Second = use info about first and second derivatives
(e.g., Newton’s method)
Search
▶ Local search = start at a point and search for a solution by
successively updating the current solution
(e.g., gradient descent)
▶ Global search tries to span the whole space
(e.g., grid search)
43
Mathematical vs Heuristic
For some algorithms and under specific assumptions imposed on
the optimization problem, we can do the following:
▶ Prove that the algorithm converges to an optimum/minimum.
44
Mathematical vs Heuristic
For some algorithms and under specific assumptions imposed on
the optimization problem, we can do the following:
▶ Prove that the algorithm converges to an optimum/minimum.
▶ Determine the rate of convergence.
44
Mathematical vs Heuristic
For some algorithms and under specific assumptions imposed on
the optimization problem, we can do the following:
▶ Prove that the algorithm converges to an optimum/minimum.
▶ Determine the rate of convergence.
▶ Decide whether we are at (or close to) an optimum/minimum.
44
Mathematical vs Heuristic
For some algorithms and under specific assumptions imposed on
the optimization problem, we can do the following:
▶ Prove that the algorithm converges to an optimum/minimum.
▶ Determine the rate of convergence.
▶ Decide whether we are at (or close to) an optimum/minimum.
For example, for linear optimization problems, the simplex
algorithm converges to a minimum (or says that there is no
minimum) in, at most, exponentially many steps, and we may
efficiently decide whether we have reached a minimum.
44
Mathematical vs Heuristic
For some algorithms and under specific assumptions imposed on
the optimization problem, we can do the following:
▶ Prove that the algorithm converges to an optimum/minimum.
▶ Determine the rate of convergence.
▶ Decide whether we are at (or close to) an optimum/minimum.
For example, for linear optimization problems, the simplex
algorithm converges to a minimum (or says that there is no
minimum) in, at most, exponentially many steps, and we may
efficiently decide whether we have reached a minimum.
We may prove only some or none of the properties for some
algorithms.
There are (almost) infinitely many heuristic algorithms without provable
convergence, often motivated by the behaviors of various animals.
44
Deterministic vs Stochastic and Static vs Dynamic
Stochastic optimization is based on a random selection of
candidate solutions.
Evolutionary algorithms contain some randomness (e.g., in the
form of random mutations).
Also, various variants of the gradient-based methods are often
randomized (e.g., variants of the stochastic gradient descent).
45
Deterministic vs Stochastic and Static vs Dynamic
Stochastic optimization is based on a random selection of
candidate solutions.
Evolutionary algorithms contain some randomness (e.g., in the
form of random mutations).
Also, various variants of the gradient-based methods are often
randomized (e.g., variants of the stochastic gradient descent).
In this course, we stick to static optimization problems where we
solve the optimization problem only once.
In contrast, the dynamic optimization, a sequence of (usually)
dependent optimization problems are solved sequentially.
For example, consider driving a car where the driver must react
optimally to changing situations several times per second.
Dynamic optimization problems are usually defined using a kind of
(Markov) decision process.
45
Summary
The course consists of the following main parts:
▶ Unconstrained optimization
▶ Non-linear objectives, (twice) differentiable
▶ Second-order methods (quasi-Newton)
▶ Constrained optimization
▶ Non-linear objectives and constraints, (twice) differentiable
▶ Lagrange multipliers, Newton-Lagrange method
▶ Quadratic programming (a little bit)
▶ Linear programming
▶ Linear objectives and constraints
▶ Simplex algorithm deep dive (including the degenerate case)
▶ Integer linear programming
▶ Linear objectives and mixed integer linear constraints
▶ Branch-and-bound, Gomory cuts algorithms
▶ A little bit on non-differentiable algorithms.
You will need to understand: Calculus in Rn (gradient, Hessian)
and linear algebra in Rn (vectors, matrices, geometry)
46
Single-variable Objectives
47
Unconstrained Single Variable Optimization Problem
An objective function f : R → R
A variable x
Find x∗ such that
f (x∗
) ≤ min
x∈R
f (x)
48
Unconstrained Single Variable Optimization Problem
An objective function f : R → R
A variable x
Find x∗ such that
f (x∗
) ≤ min
x∈R
f (x)
We consider
▶ f continuously differentiable
▶ f twice continuously differentiable
Present the following methods:
▶ Gradient descent
▶ Newton’s method
▶ Secant method
48
Gradient Based Methods
An objective function f : R → R
A variable x ∈ R
Find x∗ such that
f (x∗
) ≤ min
x∈R
f (x)
49
Gradient Based Methods
An objective function f : R → R
A variable x ∈ R
Find x∗ such that
f (x∗
) ≤ min
x∈R
f (x)
Assume that
f ′
(x) = lim
h→0
f (x + h) − f (x)
h
for x ∈ R
is continuous on R.
Denote by C1 the set of all continuously differentiable functions.
49
Gradient Descent in Single Variable
Gradient descent algorithm for finding a local minimum of a
function f , using a variable step length.
Input: Function f with first derivative f ′, initial point x0, initial
step length α0 > 0, tolerance ϵ > 0
Output: A point x that approximately minimizes f (x)
1: Set k ← 0
2: while |f ′(xk)| > ϵ do
3: Calculate the derivative: y′ ← f ′(xk)
4: Update xk+1 ← xk − αk · y′
5: Update step length αk to αk+1 based on a certain strategy
6: Increment k
7: end while
8: return xk
50
Convergence of Single Variable Gradient Descent
Theorem 1
Assume that f is
▶ differentiable, i.e., that f ′ exists,
▶ bounded below, i.e., there is B ∈ R such that f (x) ≥ B for all
x ∈ R,
▶ L-smooth, i.e., there is L > 0 such that
|f ′(x) − f ′(x′)| ≤ L|x − x′| for all x, x′ ∈ R.
Consider a sequence x0, x1, . . . computed by the gradient descent
algorithm for f . Assume a constant step length α ≤ 1
L.
Then limk→∞ |f ′(xk)| = 0 and, moreover,
min
0≤t<T
|f ′
(xt)| ≤
2L(f (x0) − B)
T
51
Example
Consider the following objective function f
f (x) =
1
2
x2
− sin x
52
Example
Consider the objective function f
f (x) =
1
2
x2
− sin x
Assume x0 = 0.5, and that the required accuracy is ϵ = 10−4, i.e.,
we stop when |xk+1 − xk| < ϵ.
Consider the step length α = 1.
53
Example
Consider the objective function f
f (x) =
1
2
x2
− sin x
Assume x0 = 0.5, and that the required accuracy is ϵ = 10−4, i.e.,
we stop when |xk+1 − xk| < ϵ.
Consider the step length α = 1.
We compute
f ′
(x) = x − cos x.
Then,
x1 = 0.5 − (0.5 − cos 0.5)
= 0.5 − (−0.37758)
= 0.87758
53
Example
Continuing in the same way:
x1 = 0.87758
x2 = 0.63901
x3 = 0.80269
x4 = 0.69478
x5 = 0.76820
x6 = 0.71917
x7 = 0.75236
x8 = 0.73008
x9 = 0.74512
x10 = 0.73501
x11 = 0.74183
x12 = 0.73724
x13 = 0.74033
x14 = 0.73825
x15 = 0.73965
x16 = 0.73870
x17 = 0.73934
x18 = 0.73891
x19 = 0.73920
x20 = 0.73901
x21 = 0.73914
x22 = 0.73905
Note that |x22 − x21| < 10−4.
54
Example
What if we consider the step length 1/k? Then
x1 = 0.50000
x2 = 0.87758
x3 = 0.75830
x4 = 0.74753
x5 = 0.74399
x6 = 0.74235
x7 = 0.74144
x8 = 0.74087
x9 = 0.74050
x10 = 0.74024
x11 = 0.74004
x12 = 0.73990
x13 = 0.73978
x14 = 0.73969
Note that |x14 − x13| < 10−4 but x14 is far from the solution
which is 0.7390....
55
Example
What if we consider the step length 1/k? Then
x1 = 0.50000
x2 = 0.87758
x3 = 0.75830
x4 = 0.74753
x5 = 0.74399
x6 = 0.74235
x7 = 0.74144
x8 = 0.74087
x9 = 0.74050
x10 = 0.74024
x11 = 0.74004
x12 = 0.73990
x13 = 0.73978
x14 = 0.73969
· · ·
x115 = 0.739100605
x116 = 0.739100379
x117 = 0.739100159
x118 = 0.739099944
x119 = 0.739099734
x120 = 0.739099529
x121 = 0.739099328
x122 = 0.739099132
x123 = 0.739098940
x124 = 0.739098752
x125 = 0.739098568
x126 = 0.739098388
x127 = 0.739098212
x128 = 0.739098040
56
Example
Gradient descent with the step length = 1.0:
57
Example
Gradient descent with the step length = 1/k:
58
Example
Gradient descent with the step length = 1/k2:
It does not seem to converge to the same number as the previous
step lengths.
59
Example
Gradient descent with the step length = 1.0:
60
Example
Gradient descent with the step length = 1/k:
61
Properties of Gradient Descent
▶ The objective must be differentiable, however:
▶ Can be extended to functions with few non-linearities by
considering differentiable parts or sub-gradients.
▶ There are methods for differentiable approximation of
non-differentiable functions.
62
Properties of Gradient Descent
▶ The objective must be differentiable, however:
▶ Can be extended to functions with few non-linearities by
considering differentiable parts or sub-gradients.
▶ There are methods for differentiable approximation of
non-differentiable functions.
▶ GD is sensitive to the initial point: Converges to a local
minimum for a small step length (typically) to the closest one.
62
Properties of Gradient Descent
▶ The objective must be differentiable, however:
▶ Can be extended to functions with few non-linearities by
considering differentiable parts or sub-gradients.
▶ There are methods for differentiable approximation of
non-differentiable functions.
▶ GD is sensitive to the initial point: Converges to a local
minimum for a small step length (typically) to the closest one.
▶ GD is quite sensitive to the step length.
Might be very slow or too fast (even overshoot and diverge).
62
Properties of Gradient Descent
▶ The objective must be differentiable, however:
▶ Can be extended to functions with few non-linearities by
considering differentiable parts or sub-gradients.
▶ There are methods for differentiable approximation of
non-differentiable functions.
▶ GD is sensitive to the initial point: Converges to a local
minimum for a small step length (typically) to the closest one.
▶ GD is quite sensitive to the step length.
Might be very slow or too fast (even overshoot and diverge).
▶ For convex functions, the algorithm converges to a minimum
(if it converges).
62
Properties of Gradient Descent
▶ The objective must be differentiable, however:
▶ Can be extended to functions with few non-linearities by
considering differentiable parts or sub-gradients.
▶ There are methods for differentiable approximation of
non-differentiable functions.
▶ GD is sensitive to the initial point: Converges to a local
minimum for a small step length (typically) to the closest one.
▶ GD is quite sensitive to the step length.
Might be very slow or too fast (even overshoot and diverge).
▶ For convex functions, the algorithm converges to a minimum
(if it converges).
▶ Straightforward to implement if the derivatives are available.
GD is much more interesting in multiple variables, forming the
basis for neural network learning (see later).
Better algorithm for unimodal functions using just derivatives?
62
Newton’s Method
An objective function f : R → R
A variable x ∈ R
Find x∗ such that
f (x∗
) ≤ min
x∈R
f (x)
63
Newton’s Method
An objective function f : R → R
A variable x ∈ R
Find x∗ such that
f (x∗
) ≤ min
x∈R
f (x)
Assume that
f ′′
(x) = lim
h→0
f ′(x + h) − f ′(x)
h
for x ∈ R
is continuous on R.
Denote by C2 the set of all twice continuously differentiable
functions.
63
Taylor Series Approximation
We would need the o-notation: Given functions f , g : R → R we
write f = o(g) if
lim
x→0
f (x)
g(x)
= 0
64
Taylor Series Approximation
We would need the o-notation: Given functions f , g : R → R we
write f = o(g) if
lim
x→0
f (x)
g(x)
= 0
Consider a function f : R → R and x0 ∈ R. Assume that f is twice
differentiable at x0. Then for all x ∈ R we have that
f (x) = f (x0)+f ′
(x0)(x −x0)+
1
2
f ′′
(x0)(x −x0)2
+o(|x −x0|2
)
64
Taylor Series Approximation
We would need the o-notation: Given functions f , g : R → R we
write f = o(g) if
lim
x→0
f (x)
g(x)
= 0
Consider a function f : R → R and x0 ∈ R. Assume that f is twice
differentiable at x0. Then for all x ∈ R we have that
f (x) = f (x0)+f ′
(x0)(x −x0)+
1
2
f ′′
(x0)(x −x0)2
+o(|x −x0|2
)
Thus, such f can be reasonably approximated around x0 with a
quadratic function
f (x) ≈ q(x) = f (x0) + f ′
(x0)(x − x0) +
1
2
f ′′
(x0)(x − x0)2
64
Newton’s Method Idea
The method computes successive approximations x0, x1, . . . , xk, . . .
as the GD.
65
Newton’s Method Idea
The method computes successive approximations x0, x1, . . . , xk, . . .
as the GD.
To compute xk+1, a quadratic approximation
q(x) = f (xk) + f ′
(xk)(x − xk) +
1
2
f ′′
(xk)(x − xk)2
is considered around xk.
65
Newton’s Method Idea
The method computes successive approximations x0, x1, . . . , xk, . . .
as the GD.
To compute xk+1, a quadratic approximation
q(x) = f (xk) + f ′
(xk)(x − xk) +
1
2
f ′′
(xk)(x − xk)2
is considered around xk.
Then xk+1 is set to the extreme point of q(x) (i.e., q′(xk+1) = 0).
65
Newton’s Method Algorithm
Now note that for
q(x) = f (xk) + f ′
(xk)(x − xk) +
1
2
f ′′
(xk)(x − xk)2
66
Newton’s Method Algorithm
Now note that for
q(x) = f (xk) + f ′
(xk)(x − xk) +
1
2
f ′′
(xk)(x − xk)2
we have
q′
(x) = f ′
(xk) + f ′′
(xk)(x − xk)
66
Newton’s Method Algorithm
Now note that for
q(x) = f (xk) + f ′
(xk)(x − xk) +
1
2
f ′′
(xk)(x − xk)2
we have
q′
(x) = f ′
(xk) + f ′′
(xk)(x − xk)
and thus
q′
(x) = 0 iff x = xk −
f ′(xk)
f ′′(xk)
66
Newton’s Method Algorithm
Now note that for
q(x) = f (xk) + f ′
(xk)(x − xk) +
1
2
f ′′
(xk)(x − xk)2
we have
q′
(x) = f ′
(xk) + f ′′
(xk)(x − xk)
and thus
q′
(x) = 0 iff x = xk −
f ′(xk)
f ′′(xk)
Newton’s method then sets
xk+1 := xk −
f ′(xk)
f ′′(xk)
66
Newton’s Method Algorithm
Input: A function f with derivative f ′ and second derivative f ′′,
initial point x0, tolerance ϵ > 0
Output: A point x that approximately minimizes f (x)
1: Set k ← 0
2: while |xk+1 − xk| > ϵ do
3: Calculate the derivative: y′ ← f ′(xk)
4: Calculate the second derivative : y′′ ← f ′′(xk)
5: Update the estimate: xk+1 ← xk − y′
y′′
6: Increment k
7: end while
8: return xk
Note that the method implicitly assumes that f ′′(xk) ̸= 0 in every
iteration.
67
Example
Consider the following objective function f
f (x) =
1
2
x2
− sin x
Assume x0 = 0.5, and that the required accuracy is ϵ = 10−5, i.e.,
we stop when |xk+1 − xk| ≤ ϵ.
68
Example
Consider the following objective function f
f (x) =
1
2
x2
− sin x
Assume x0 = 0.5, and that the required accuracy is ϵ = 10−5, i.e.,
we stop when |xk+1 − xk| ≤ ϵ.
We compute
f ′
(x) = x − cos x, f ′′
(x) = 1 + sin x.
68
Example
Consider the following objective function f
f (x) =
1
2
x2
− sin x
Assume x0 = 0.5, and that the required accuracy is ϵ = 10−5, i.e.,
we stop when |xk+1 − xk| ≤ ϵ.
We compute
f ′
(x) = x − cos x, f ′′
(x) = 1 + sin x.
Hence,
x1 = 0.5 −
0.5 − cos 0.5
1 + sin 0.5
= 0.5 −
−0.3775
1.479
= 0.7552
68
Example
Proceeding similarly, we obtain
x2 = x1 −
f ′ (x1)
f ′′ (x1)
= x1 −
0.02710
1.685
= 0.7391
x3 = x2 −
f ′ (x2)
f ′′ (x2)
= x2 −
9.461 × 10−5
1.673
= 0.7390851339
x4 = x3 −
f ′ (x3)
f ′′ (x3)
= x3 −
1.17 × 10−9
1.673
= 0.7390851332
· · ·
69
Example
Proceeding similarly, we obtain
x2 = x1 −
f ′ (x1)
f ′′ (x1)
= x1 −
0.02710
1.685
= 0.7391
x3 = x2 −
f ′ (x2)
f ′′ (x2)
= x2 −
9.461 × 10−5
1.673
= 0.7390851339
x4 = x3 −
f ′ (x3)
f ′′ (x3)
= x3 −
1.17 × 10−9
1.673
= 0.7390851332
· · ·
Note that
|x4 − x3| < ϵ = 10−5
f ′
(x4) = −8.6 × 10−6
≈ 0
f ′′
(x4) = 1.673 > 0
So, we conclude that x∗ ≈ x4 is a strict minimizer.
However, remember that the above does not have to be true!
69
Convergence
Newton’s method works well if f ′′(x) > 0 everywhere.
However, if f ′′(x) < 0 for some x, Newton’s method may fail to
converge to a minimizer (converges to a point x where f ′(x) = 0):
If the method converges to a minimizer, it does so quadratically.
What does this mean?
70
Types of Convergence Rates
Linear Convergence
An algorithm is said to have linear convergence if the error at each
step is proportionally reduced by a constant factor:
lim
k→∞
|xk+1 − x∗|
|xk − x∗|
= r, 0 < r < 1
71
Types of Convergence Rates
Linear Convergence
An algorithm is said to have linear convergence if the error at each
step is proportionally reduced by a constant factor:
lim
k→∞
|xk+1 − x∗|
|xk − x∗|
= r, 0 < r < 1
Superlinear Convergence
Convergence is superlinear if:
lim
k→∞
|xk+1 − x∗|
|xk − x∗|
= 0
This often requires an algorithm to utilize second-order
information.
71
Quadratic Convergence of Newton’s Method
Quadratic Convergence
Quadratic convergence is achieved when the number of accurate
digits roughly doubles with each iteration:
lim
k→∞
|xk+1 − x∗|
|xk − x∗|2
= C, C > 0
72
Quadratic Convergence of Newton’s Method
Quadratic Convergence
Quadratic convergence is achieved when the number of accurate
digits roughly doubles with each iteration:
lim
k→∞
|xk+1 − x∗|
|xk − x∗|2
= C, C > 0
Newton’s method is a classic example of an algorithm with
quadratic convergence.
Theorem 2 (Quadratic Convergence of Newton’s Method)
Let f : R → R satisfy f ∈ C2 and suppose x∗ is a minimizer of f
such that f ′′(x∗) > 0. Assume Lipschitz continuity of f ′′. If the
initial guess x0 is sufficiently close to x∗, then the sequence {xk}
computed by the Newton’s method converges quadratically to x∗.
72
Newton’s Method of Tangents
Newton’s method is also a technique for finding roots of functions.
In our case, this means finding a root of f ′.
73
Newton’s Method of Tangents
Newton’s method is also a technique for finding roots of functions.
In our case, this means finding a root of f ′.
Denote g = f ′. Then Newton’s approximation goes like this:
xk+1 = xk −
g(xk)
g′(xk)
73
Secant Method
What if f ′′ is unavailable, but we want to use something like
Newton’s method (with its superlinear convergence)?
74
Secant Method
What if f ′′ is unavailable, but we want to use something like
Newton’s method (with its superlinear convergence)?
Assume f ∈ C1 and try to approximate f ′′ around xk−1 with
f ′′
(x) ≈
f ′(x) − f ′(xk−1)
x − xk−1
Substituting x with xk, we obtain
1
f ′′(xk)
≈
xk − xk−1
f ′(xk) − f ′(xk−1)
74
Secant Method
What if f ′′ is unavailable, but we want to use something like
Newton’s method (with its superlinear convergence)?
Assume f ∈ C1 and try to approximate f ′′ around xk−1 with
f ′′
(x) ≈
f ′(x) − f ′(xk−1)
x − xk−1
Substituting x with xk, we obtain
1
f ′′(xk)
≈
xk − xk−1
f ′(xk) − f ′(xk−1)
Then, we may try to use Newton’s step with this approximation:
xk+1 = xk −
xk − xk−1
f ′(xk) − f ′(xk−1)
· f ′
(xk)
Is the rate of convergence superlinear?
74
Example
Consider the following objective function f
f (x) =
1
2
x2
− sin x
Assume x0 = 0.5 and x1 = 1.0.
Now, we need to initialize the first two values.
75
Example
Consider the following objective function f
f (x) =
1
2
x2
− sin x
Assume x0 = 0.5 and x1 = 1.0.
Now, we need to initialize the first two values.
We have f ′(x) = x − cos x
Hence,
x2 = 1.0 −
1.0 − 0.5
(1.0 − cos 1.0) − (0.5 − cos 0.5)
(0.5 − cos 0.5)
= 0.7254
75
Example
Continuing, we obtain:
x0 = 0.5
x1 = 1.0
x2 = 0.72548
x3 = 0.73839
x4 = 0.739087
x5 = 0.739085132
x6 = 0.739085133
76
Example
Start the secant method with the approximation given by Newton’s
method:
x0 = 0.5
x1 = 0.7552
x2 = 0.7381
x3 = 0.739081
x5 = 0.7390851339
x6 = 0.7390851332
· · ·
Compare with Newton’s method:
x0 = 0.5
x1 = 0.7552
x2 = 0.7391
x3 = 0.7390851339
x4 = 0.73908513321516067229
x5 = 0.73908513321516067229
· · · 77
Superlinear Convergence of Secant Method
Theorem 3 (Superlinear Convergence of Secant Method)
Assume f : R → R twice continuously differentiable and x∗
a minimizer of f . Assume f ′′ Lipschitz continuous and f ′′(x∗) > 0.
The sequence {xk} generated by the Secant method converges to
x∗ superlinearly if x0 and x1 are sufficiently close to x∗.
The rate of convergence p of the Secant method is given by the
positive root of the equation p2 − p − 1 = 0, which is
p = 1+
√
5
2 ≈ 1.618 (the golden ratio). Formally,
lim
k→∞
|xk+1 − x∗|
|xk − x∗|
1+
√
5
2
= C, C > 0
78
Secant Method for Root Finding
As for Newton’s method of tangents, the secant method can be
seen as a method for finding a root of f ′.
79
Secant Method for Root Finding
As for Newton’s method of tangents, the secant method can be
seen as a method for finding a root of f ′.
Denote g = f ′. Then the secant method approximation is
xk+1 = xk −
xk − xk−1
g(xk) − g(xk−1)
· g(xk)
79
General Form
Note that all methods have similar update formula:
xk+1 = xk −
f ′(xk)
ak
Different choice of ak produce different algorithm:
▶ ak = 1 gives the gradient descent,
▶ ak = f ′′(xk) gives Newton’s method,
▶ ak =
f ′(xk )−f ′(xk−1)
xk −xk−1
gives the secant method,
▶ ak = f ′′(xm) where m = ⌊k/p⌋p gives Shamanskii method.
80
Summary
▶ Newton’s method
▶ Converges quickly to an extremum under rather strict
conditions (see Theorem 2)
▶ The choice of the initial point is critical; the method may
diverge to a stationary point, which is not a minimizer. The
method may also cycle.
▶ If the second derivative is very small, close to the minimizer,
the method can be very slow (the quadratic convergence is
guaranteed only if the second derivative is non-zero at the
minimizer and the constants depend on the second derivative).
81
Summary
▶ Newton’s method
▶ Converges quickly to an extremum under rather strict
conditions (see Theorem 2)
▶ The choice of the initial point is critical; the method may
diverge to a stationary point, which is not a minimizer. The
method may also cycle.
▶ If the second derivative is very small, close to the minimizer,
the method can be very slow (the quadratic convergence is
guaranteed only if the second derivative is non-zero at the
minimizer and the constants depend on the second derivative).
▶ Secant method
▶ The second derivative is not needed.
▶ Superlinear (but not quadratic) convergence for an initial point
close to a minimum (under rather strict conditions Theorem 3)
81
Constrained Single Variable Optimization Problem
An objective function f : R → R
A variable x
A constraint
a0 ≤ x ≤ b0
Consider the following cases:
▶ f continuously differentiable on [a0, b0]
▶ f twice continuously differentiable on [a0, b0]
Homework: Modify the gradient descent and Newton’s method to
work on the bounded interval (the above definitions guarantee
continuous differentiability at a0 and b0).
82