PV027 Optimization Tom´aˇs Br´azdil 1 Resources & Prerequisities Resources: ▶ Lectures & tutorials (the main resources) ▶ Books: Joaquim R. R. A. Martins and Andrew Ning. Engineering Design Optimization. Cambridge University Press, 2021. ISBN: 9781108833417. Jorge Nocedal and Stephen J. Wright. Numerical optimization. Springer, 2006. ISBN: 0387303030. 2 Resources & Prerequisities Resources: ▶ Lectures & tutorials (the main resources) ▶ Books: Joaquim R. R. A. Martins and Andrew Ning. Engineering Design Optimization. Cambridge University Press, 2021. ISBN: 9781108833417. Jorge Nocedal and Stephen J. Wright. Numerical optimization. Springer, 2006. ISBN: 0387303030. We shall need elementary knowledge and understanding of ▶ Linear algebra in Rn Operations with vectors and matrices, bases, diagonalization. ▶ Multi-variable calculus (i.e., in Rn) Partial derivatives, gradients, Hessians, Taylor’s theorem. We will refresh our memories during lectures and tutorials. 2 Evaluation Oral exam - You will get a manual describing the knowledge necessary for E and better. There might be homework assignments that you may discuss at tutorials, but (for this year) there is no mandatory homework. Please be aware that This is a difficult math-based course. 3 What is Optimization Merriam Webster: An act, process, or methodology of making something (such as a design, system, or decision) as perfect, functional, or effective as possible. specifically: the mathematical procedures (such as finding the maximum of a function) involved in this. 4 What is Optimization Merriam Webster: An act, process, or methodology of making something (such as a design, system, or decision) as perfect, functional, or effective as possible. specifically: the mathematical procedures (such as finding the maximum of a function) involved in this. Britannica Collection of mathematical principles and methods for solving quantitative problems in many disciplines, including physics, biology, engineering, economics, and business. Historically, (mathematical/numerical) optimization is called mathematical programming. 4 Optimization People optimize in ▶ scheduling ▶ transportation, ▶ education, ▶ · · · 5 Optimization People optimize in ▶ scheduling ▶ transportation, ▶ education, ▶ · · · ▶ investments ▶ portfolio management, ▶ utility maximization, ▶ · · · 5 Optimization People optimize in ▶ scheduling ▶ transportation, ▶ education, ▶ · · · ▶ investments ▶ portfolio management, ▶ utility maximization, ▶ · · · ▶ industrial design ▶ aerodynamics, ▶ electrical engineering, ▶ · · · 5 Optimization People optimize in ▶ scheduling ▶ transportation, ▶ education, ▶ · · · ▶ investments ▶ portfolio management, ▶ utility maximization, ▶ · · · ▶ industrial design ▶ aerodynamics, ▶ electrical engineering, ▶ · · · ▶ sciences ▶ molecular modeling, ▶ computational systems biology, ▶ · · · 5 Optimization People optimize in ▶ scheduling ▶ transportation, ▶ education, ▶ · · · ▶ investments ▶ portfolio management, ▶ utility maximization, ▶ · · · ▶ industrial design ▶ aerodynamics, ▶ electrical engineering, ▶ · · · ▶ sciences ▶ molecular modeling, ▶ computational systems biology, ▶ · · · ▶ machine learning 5 Optimization Algorithms 6 Optimization Algorithms 7 Design Optimization Process 8 Design Optimization Process 9 A bit naive example: ▶ Consider a company with several plants producing a single product but with different efficiency. ▶ The goal is to set the production of each plant so that demand for goods is satisfied, but overproduction is minimized. 10 A bit naive example: ▶ Consider a company with several plants producing a single product but with different efficiency. ▶ The goal is to set the production of each plant so that demand for goods is satisfied, but overproduction is minimized. ▶ First try: Model each plant’s production and maximize the total production efficiency. This would lead to a solution where only the most efficient plant will produce. 10 A bit naive example: ▶ Consider a company with several plants producing a single product but with different efficiency. ▶ The goal is to set the production of each plant so that demand for goods is satisfied, but overproduction is minimized. ▶ First try: Model each plant’s production and maximize the total production efficiency. This would lead to a solution where only the most efficient plant will produce. ▶ However, after a certain level of demand, no single plant can satisfy the demand ⇒, introducing constraints on the maximum production of the plants. This would maximize production of the most efficient plant and then the second one, etc. 10 A bit naive example: ▶ Consider a company with several plants producing a single product but with different efficiency. ▶ The goal is to set the production of each plant so that demand for goods is satisfied, but overproduction is minimized. ▶ First try: Model each plant’s production and maximize the total production efficiency. This would lead to a solution where only the most efficient plant will produce. ▶ However, after a certain level of demand, no single plant can satisfy the demand ⇒, introducing constraints on the maximum production of the plants. This would maximize production of the most efficient plant and then the second one, etc. ▶ Then you notice that all plant employees must work. ▶ Then you start solving transportation problems depending on the location of the plants. ▶ ... 10 Optimization Problem Formulation 1. Describe the problem ▶ Problem formulation is vital since the optimizer exploits any weaknesses in the model formulation. ▶ You might get the “right answer to the wrong question.” ▶ The problem description is typically informal at the beginning. 2. Gather information ▶ Identify possible inputs/outputs. ▶ Gather data and identify the analysis procedure. 11 Optimization Problem Formulation 3. Define the design variables ▶ Identify the quantities that describe the system: x ∈ Rn (i.e., certain characteristics of the system, such as position, investments, etc.) ▶ The variables are supposed to be independent; the optimizer must be free to choose the components of x independently. ▶ The choice of variables is typically not unique (e.g., a square can be described by its side or area). ▶ The variables may affect the functional form of the objective and constraints (e.g., linear vs non-linear). 11 Optimization Problem Formulation 4. Define the objective ▶ The function determines if one design is better than another. ▶ Must be a scalar computable from the variables: f : Rn → R (e.g., profit, time, potential energy, etc.) ▶ The objective function is either maximized or minimized depending on the application. ▶ The choice is not always obvious: E.g., minimizing just the weight of a vehicle might result in a vehicle being too expensive to be manufactured. 11 Optimization Problem Formulation 5. Define the constraints ▶ Prescribe allowed values of the variables. ▶ May have a general form c(x) ≤ 0 or c(x) ≥ 0 or c(x) = 0 (e.g., time cannot be negative, bounded amount of money to invest) Where c : Rn → R is a function depending on the variables. 11 Modelling and Optimization The Optimization Problem consists of ▶ variables ▶ objective ▶ constraints The above components constitute a model. 12 Modelling and Optimization The Optimization Problem consists of ▶ variables ▶ objective ▶ constraints The above components constitute a model. Modelling is concerned with model building, optimization with maximization/minimization of the objective for a given model. We concentrate on the optimization part but keep in mind that it is intertwined with modeling. 12 Modelling and Optimization The Optimization Problem consists of ▶ variables ▶ objective ▶ constraints The above components constitute a model. Modelling is concerned with model building, optimization with maximization/minimization of the objective for a given model. We concentrate on the optimization part but keep in mind that it is intertwined with modeling. The Optimization Problem (OP): Find settings of variables so that the objective is maximized/minimized while satisfying the constraints. 12 Modelling and Optimization The Optimization Problem consists of ▶ variables ▶ objective ▶ constraints The above components constitute a model. Modelling is concerned with model building, optimization with maximization/minimization of the objective for a given model. We concentrate on the optimization part but keep in mind that it is intertwined with modeling. The Optimization Problem (OP): Find settings of variables so that the objective is maximized/minimized while satisfying the constraints. An Optimization Algorithm (OA) solves the above problem and provides a solution, some setting of variables satisfying the constraints and minimizing/maximizing the objective. 12 Optimization Problems 13 Optimization Problem Formally Denote by f : Rn → R an objective function, x a vector of real variables, g1, . . . , gng inequality constraint functions gi : Rn → R. h1, . . . , hnh equality constraint functions hj : Rn → R. 14 Optimization Problem Formally Denote by f : Rn → R an objective function, x a vector of real variables, g1, . . . , gng inequality constraint functions gi : Rn → R. h1, . . . , hnh equality constraint functions hj : Rn → R. The optimization problem is to minimize f (x) by varying x subject to gi (x) ≤ 0 i = 1, . . . , ng hj (x) = 0 j = 1, . . . , nh 14 Optimization Problem - Example f (x1, x2) = (x1 − 2)2 + (x2 − 1)2 g1(x1, x2) = x2 1 − x2 g2(x1, x2) = x1 + x2 − 2 The optimization problem is minimize (x1−2)2 +(x2−1)2 subject to x2 1 − x2 ≤ 0, x1 + x2 − 2 ≤ 0. 15 Optimization Problem - Example f (x1, x2) = (x1 − 2)2 + (x2 − 1)2 g1(x1, x2) = x2 1 − x2 g2(x1, x2) = x1 + x2 − 2 A contour of f is defined, for some c ∈ R, by {x ∈ Rn | f (x) = c} 15 Constraints Consider the constraints gi (x) ≤ 0 i = 1, . . . , ng hj (x) = 0 j = 1, . . . , nh 16 Constraints Consider the constraints gi (x) ≤ 0 i = 1, . . . , ng hj (x) = 0 j = 1, . . . , nh Define the feasibility region by F = {x | gi (x) ≤ 0, hj (x) = 0, i = 1, . . . , ng , j = 1, . . . , nh} x ∈ F is feasible, x ̸∈ F is infeasible. 16 Constraints Consider the constraints gi (x) ≤ 0 i = 1, . . . , ng hj (x) = 0 j = 1, . . . , nh Define the feasibility region by F = {x | gi (x) ≤ 0, hj (x) = 0, i = 1, . . . , ng , j = 1, . . . , nh} x ∈ F is feasible, x ̸∈ F is infeasible. Note that constraints of the form gi (x) ≥ 0 can be easily transformed to the inequality contraints −gi (x) ≤ 0 16 Constraints Consider the constraints gi (x) ≤ 0 i = 1, . . . , ng hj (x) = 0 j = 1, . . . , nh Define the feasibility region by F = {x | gi (x) ≤ 0, hj (x) = 0, i = 1, . . . , ng , j = 1, . . . , nh} x ∈ F is feasible, x ̸∈ F is infeasible. Note that constraints of the form gi (x) ≥ 0 can be easily transformed to the inequality contraints −gi (x) ≤ 0 x∗ ∈ F is now a constrained minimizer if f (x∗ ) ≤ f (x) for all x ∈ F 16 Constraints Inequality constraints gi (x) ≤ 0 can be active or inactive. active at x gi (x) = 0 inactive at x gi (x) < 0 17 More Practical Example The problem formulation: ▶ A company has two chemical factories F1 and F2, and a dozen retail outlets R1, . . . , R12. ▶ Each Fi can produce (maximum of) ai tons of a chemical each week. ▶ Each retail outlet Rj demands at least bj tons. ▶ The cost of shipping one ton from Fi to Rj is cij . 18 More Practical Example The problem formulation: ▶ A company has two chemical factories F1 and F2, and a dozen retail outlets R1, . . . , R12. ▶ Each Fi can produce (maximum of) ai tons of a chemical each week. ▶ Each retail outlet Rj demands at least bj tons. ▶ The cost of shipping one ton from Fi to Rj is cij . The problem: Determine how much each factory should ship to each outlet to satisfy the requirements and minimize cost. 18 More Practical Example Variables: xij for i = 1, 2 and j = 1, . . . , 12. Each xij (intuitively) corresponds to tons shipped from Fi to Rj . The objective: min ij cij xij 19 More Practical Example Variables: xij for i = 1, 2 and j = 1, . . . , 12. Each xij (intuitively) corresponds to tons shipped from Fi to Rj . The objective: min ij cij xij subject to 12 j=1 xij ≤ ai , i = 1, 2 2 i=1 xij ≥ bj , j = 1, . . . , 12, xij ≥ 0, i = 1, 2, j = 1, . . . , 12. 19 More Practical Example Variables: xij for i = 1, 2 and j = 1, . . . , 12. Each xij (intuitively) corresponds to tons shipped from Fi to Rj . The objective: min ij cij xij subject to 12 j=1 xij ≤ ai , i = 1, 2 2 i=1 xij ≥ bj , j = 1, . . . , 12, xij ≥ 0, i = 1, 2, j = 1, . . . , 12. The above is linear programming problem since both the objective and constraint functions are linear. 19 Discrete Optimization In our original optimization problem definition, we consider real (continuous) variables. Sometimes, we need to assume discrete values. For example, in the previous example, the factories may produce tractors. In such a case, it does not make sense to produce 4.6 tractors. 20 Discrete Optimization In our original optimization problem definition, we consider real (continuous) variables. Sometimes, we need to assume discrete values. For example, in the previous example, the factories may produce tractors. In such a case, it does not make sense to produce 4.6 tractors. Usually, an integer constraint is added, such as xi ∈ Z It constrains xi only to integer values. This leads to so-called integer programming. Discrete optimization problems have discrete and finite variables. 20 Wing Design Example Our goal is to design the wing shape of an aircraft. Assume a rectangular wing. The parameters are called span b and chord c. 21 Wing Design Example Our goal is to design the wing shape of an aircraft. Assume a rectangular wing. The parameters are called span b and chord c. However, two other variables are often used in aircraft design: Wing area S and wing aspect ratio AR. It holds that S = bc AR = b2 /S 21 Wing Design Example What exactly are the objectives and constraints? 22 Wing Design Example What exactly are the objectives and constraints? Our objective function is the power required to keep level flight: f (b, c) = Dv η Here, ▶ D is the drag That is the aerodynamic force that opposes an aircraft’s motion through the air. ▶ η is the propulsive efficiency That is the efficiency with which the energy contained in a vehicle’s fuel is converted into kinetic energy of the vehicle. ▶ v is the lift velocity That is the velocity needed to lift the aircraft, which depends on its weight. 22 Wing Design Example For illustration, let us look at the lift velocity v. 23 Wing Design Example For illustration, let us look at the lift velocity v. In level flight, the aircraft must generate enough lift L to equal its weight W , that is L = W . 23 Wing Design Example For illustration, let us look at the lift velocity v. In level flight, the aircraft must generate enough lift L to equal its weight W , that is L = W . The weight partially depends on the wing area: W = W0 + WS S Here S = bc is the wing area, and W0 is the payload weight. 23 Wing Design Example For illustration, let us look at the lift velocity v. In level flight, the aircraft must generate enough lift L to equal its weight W , that is L = W . The weight partially depends on the wing area: W = W0 + WS S Here S = bc is the wing area, and W0 is the payload weight. The lift can be approximated using the following formula. L = q · CL · S Where q = 1 2ϱv2 is the fluid dynamic pressure, here ϱ is the air density, CL is a lift coefficient (depending on the wing shape). 23 Wing Design Example For illustration, let us look at the lift velocity v. In level flight, the aircraft must generate enough lift L to equal its weight W , that is L = W . The weight partially depends on the wing area: W = W0 + WS S Here S = bc is the wing area, and W0 is the payload weight. The lift can be approximated using the following formula. L = q · CL · S Where q = 1 2ϱv2 is the fluid dynamic pressure, here ϱ is the air density, CL is a lift coefficient (depending on the wing shape). Thus, we may obtain the lift velocity as v = 2W /ϱCLS = 2(W0 + WS bc)/ϱCLbc Similarly, various physics-based arguments provide approximations of the drag D and the propulsion efficiency η. 23 Wing Design Example The drag D = Di + Df is the sum of the induced and viscous drag. 24 Wing Design Example The drag D = Di + Df is the sum of the induced and viscous drag. The induced drag can be approximated by Di = W 2 /q π b2 e Here, e is the Oswald efficiency factor, a correction factor that represents the change in drag with the lift of a wing, as compared with an ideal wing having the same aspect ratio. 24 Wing Design Example The drag D = Di + Df is the sum of the induced and viscous drag. The induced drag can be approximated by Di = W 2 /q π b2 e Here, e is the Oswald efficiency factor, a correction factor that represents the change in drag with the lift of a wing, as compared with an ideal wing having the same aspect ratio. The viscous drag can be approximated by Df = k Cf q 2.05 S Here, k is the form factor (accounts for the pressure drag), and Cf is the skin friction coefficient that can be approximated by Cf = 0.074/Re0.2 Where Re is the Reynolds number that somewhat characterizes air flow patterns around the wing and is defined as follows: Re = ρvc/µ Here µ is the air dynamic viscosity. 24 Wing Design Example The propulsion efficiency η can be roughly approximated by the Gaussian efficiency curve. η = ηmax exp −(v − ¯v)2 2σ2 Here, ¯v is the peak propulsive efficiency velocity, and σ is the std of the efficiency function. 25 Wing Design Example The objective function contours: 26 Wing Design Example The objective function contours: The engineers would refuse the solution: The aspect ratio is much higher than typically seen in airplanes. It adversely affects the structural strength. Add constraints! 26 Wing Design Example Added a constraint on bending stress at the root of the wing: It looks like a reasonable wing ... 27 Optimization Problem Classification 28 Optimization Problem Classification 29 Optimization Problem Classification ▶ Continuous allows only xi ∈ R, discrete allows only xi ∈ Z, mixed allows variables of both kinds. 29 Optimization Problem Classification ▶ Continuous allows only xi ∈ R, discrete allows only xi ∈ Z, mixed allows variables of both kinds. ▶ Single-objective: f : Rn → R, Multi-objective: f : Rn → Rm 29 Optimization Problem Classification ▶ Continuous allows only xi ∈ R, discrete allows only xi ∈ Z, mixed allows variables of both kinds. ▶ Single-objective: f : Rn → R, Multi-objective: f : Rn → Rm ▶ Unconstrained: No constraints, just the objective function. 29 Optimization Problem Classification 30 Smoothness We consider various classes of problems depending on the smoothness properties of the objective/constraint functions: ▶ C0: Continuous function Continuity allows us to estimate value in small neighborhoods. Discontinuous functions exist. ▶ C1: Continuous first derivatives The derivatives give information on the slope. If continuous, it changes smoothly, allowing us to estimate the slope locally. Nondifferentiable continuous functions and differentiable functions with discontinuous derivatives exist. ▶ C2: Continuous second derivatives The second derivatives inform about curvature. Continuously differentiable functions without second derivatives and twice differentiable functions with discontinuous second derivatives exist. 31 f (x) = |x| is continuous, f is not differentiable at 0 f (x) = x|x| is differentiable on R, f ′ has no second derivative at 0 32 f (x) = x2 sin(1/x) if x ̸= 0 0 if x = 0 f ′ (x) = 2x sin(1/x) − cos(1/x), x ̸= 0 0, x = 0 f is differentiable on R, f ′ is not continuous at 0 33 f (x) = x4 sin(1/x) if x ̸= 0 0 if x = 0 f is differentiable on R, f ′ (x) = 4x3 sin(1/x) − x2 cos(1/x), x ̸= 0 0, x = 0 f ′ is differentiable on R, f ′′ (x) = 12x2 sin(1/x) − 6x cos(1/x) − sin(1/x), x ̸= 0 0, x = 0 Clearly, f ′′ does not have a limit at 0 as sin(1/x) oscillates between −1 and 1 and thus is not continuous. 34 Linearity Linear programming: Both the objective and the constraints are linear. It is possible to solve precisely, efficiently, and in rational numbers (see the linear programming later). 35 Multimodality Denote by F the feasibility set. x∗ is a (weak) local minimiser if there is ε > 0 such that f (x∗ ) ≤ f (x) for all x ∈ F satisfying ||x∗ − x|| ≤ ε 36 Multimodality Denote by F the feasibility set. x∗ is a (weak) local minimiser if there is ε > 0 such that f (x∗ ) ≤ f (x) for all x ∈ F satisfying ||x∗ − x|| ≤ ε x∗ is a (weak) global minimiser if f (x∗ ) ≤ f (x) for all x ∈ F 36 Multimodality Denote by F the feasibility set. x∗ is a (weak) local minimiser if there is ε > 0 such that f (x∗ ) ≤ f (x) for all x ∈ F satisfying ||x∗ − x|| ≤ ε x∗ is a (weak) global minimiser if f (x∗ ) ≤ f (x) for all x ∈ F Global/local minimiser is strict if the inequality is strict. 36 Multimodality Denote by F the feasibility set. x∗ is a (weak) local minimiser if there is ε > 0 such that f (x∗ ) ≤ f (x) for all x ∈ F satisfying ||x∗ − x|| ≤ ε x∗ is a (weak) global minimiser if f (x∗ ) ≤ f (x) for all x ∈ F Global/local minimiser is strict if the inequality is strict. Unimodal functions have a single global minimiser in F, multimodal have multiple local minimisers in F. 36 Convexity S ⊆ Rn is a convex set if the straight line segment connecting any two points in S lies entirely inside S. Formally, for any two points x ∈ S and y ∈ S, we have αx + (1 − α)y ∈ S for all α ∈ [0, 1] 37 Convexity S ⊆ Rn is a convex set if the straight line segment connecting any two points in S lies entirely inside S. Formally, for any two points x ∈ S and y ∈ S, we have αx + (1 − α)y ∈ S for all α ∈ [0, 1] f is a convex function if its domain is a convex set and if for any two points x and y in this domain, the graph of f lies below the straight line connecting (x, f (x)) to (y, f (y)) in the space Rn+1. That is, we have f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y), for all α ∈ (0, 1). 37 Convexity S ⊆ Rn is a convex set if the straight line segment connecting any two points in S lies entirely inside S. Formally, for any two points x ∈ S and y ∈ S, we have αx + (1 − α)y ∈ S for all α ∈ [0, 1] f is a convex function if its domain is a convex set and if for any two points x and y in this domain, the graph of f lies below the straight line connecting (x, f (x)) to (y, f (y)) in the space Rn+1. That is, we have f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y), for all α ∈ (0, 1). A standard form convex optimization assumes ▶ convex objective f and convex inequality constraint functions gi ▶ affine equality constraint functions hj Implications: ▶ Every local minimum is a global minimum. ▶ If the above inequality is strict for all x ̸= y, then there is a unique minimum. 37 Stochasticity Sometimes, the parameters of a model cannot be specified with certainty. For example, in the transportation model, customer demand cannot be predicted precisely in practice. However, such parameters may often be statistically estimated and modeled using an appropriate probability distribution. 38 Stochasticity Sometimes, the parameters of a model cannot be specified with certainty. For example, in the transportation model, customer demand cannot be predicted precisely in practice. However, such parameters may often be statistically estimated and modeled using an appropriate probability distribution. Stochastic optimization problem is to minimize/maximize the expectation of a statistic parametrized with the variables x: Find x maximizing Ef (x; W ) Here, W is a vector of random variables, and the expectation is taken using the probability distribution of these variables. In this course, we stick with deterministic optimization. 38 Optimization Algorithms 39 Optimization Algorithm An optimization algorithm solves the optimization problem, i.e., searches for x∗, which (in some sense) minimizes the objective f and satisfies the constraints. Typically, the algorithm computes a set of candidate solutions x0, x1, . . . and then identifies one resembling a solution. 40 Optimization Algorithm An optimization algorithm solves the optimization problem, i.e., searches for x∗, which (in some sense) minimizes the objective f and satisfies the constraints. Typically, the algorithm computes a set of candidate solutions x0, x1, . . . and then identifies one resembling a solution. The problem is to ▶ compute the candidate solutions, Complexity of the objective function, difficulties in selection of the candidates, etc. ▶ Select the one closest to a minimum. It is Hard to decide whether a given point is a minimum (even a local one). Example: Neural networks training. 40 Optimization Algorithm Properties Typically, we are concerned with the following issues: 41 Optimization Algorithm Properties Typically, we are concerned with the following issues: ▶ Robustness: OA should perform well on various problems in their class for all reasonable choices of the initial variables. 41 Optimization Algorithm Properties Typically, we are concerned with the following issues: ▶ Robustness: OA should perform well on various problems in their class for all reasonable choices of the initial variables. ▶ Efficiency: OA should not require too much computer time or storage. 41 Optimization Algorithm Properties Typically, we are concerned with the following issues: ▶ Robustness: OA should perform well on various problems in their class for all reasonable choices of the initial variables. ▶ Efficiency: OA should not require too much computer time or storage. ▶ Accuracy: OA should be able to identify a solution with precision without being overly sensitive to ▶ errors in the data/model ▶ the arithmetic rounding errors 41 42 Order and Search Order ▶ Zeroth = gradient-free: no info about derivatives is used ▶ First = gradient-based: use info about first derivatives (e.g., gradient descent) ▶ Second = use info about first and second derivatives (e.g., Newton’s method) 43 Order and Search Order ▶ Zeroth = gradient-free: no info about derivatives is used ▶ First = gradient-based: use info about first derivatives (e.g., gradient descent) ▶ Second = use info about first and second derivatives (e.g., Newton’s method) Search ▶ Local search = start at a point and search for a solution by successively updating the current solution (e.g., gradient descent) ▶ Global search tries to span the whole space (e.g., grid search) 43 Mathematical vs Heuristic For some algorithms and under specific assumptions imposed on the optimization problem, we can do the following: ▶ Prove that the algorithm converges to an optimum/minimum. 44 Mathematical vs Heuristic For some algorithms and under specific assumptions imposed on the optimization problem, we can do the following: ▶ Prove that the algorithm converges to an optimum/minimum. ▶ Determine the rate of convergence. 44 Mathematical vs Heuristic For some algorithms and under specific assumptions imposed on the optimization problem, we can do the following: ▶ Prove that the algorithm converges to an optimum/minimum. ▶ Determine the rate of convergence. ▶ Decide whether we are at (or close to) an optimum/minimum. 44 Mathematical vs Heuristic For some algorithms and under specific assumptions imposed on the optimization problem, we can do the following: ▶ Prove that the algorithm converges to an optimum/minimum. ▶ Determine the rate of convergence. ▶ Decide whether we are at (or close to) an optimum/minimum. For example, for linear optimization problems, the simplex algorithm converges to a minimum (or says that there is no minimum) in, at most, exponentially many steps, and we may efficiently decide whether we have reached a minimum. 44 Mathematical vs Heuristic For some algorithms and under specific assumptions imposed on the optimization problem, we can do the following: ▶ Prove that the algorithm converges to an optimum/minimum. ▶ Determine the rate of convergence. ▶ Decide whether we are at (or close to) an optimum/minimum. For example, for linear optimization problems, the simplex algorithm converges to a minimum (or says that there is no minimum) in, at most, exponentially many steps, and we may efficiently decide whether we have reached a minimum. We may prove only some or none of the properties for some algorithms. There are (almost) infinitely many heuristic algorithms without provable convergence, often motivated by the behaviors of various animals. 44 Deterministic vs Stochastic and Static vs Dynamic Stochastic optimization is based on a random selection of candidate solutions. Evolutionary algorithms contain some randomness (e.g., in the form of random mutations). Also, various variants of the gradient-based methods are often randomized (e.g., variants of the stochastic gradient descent). 45 Deterministic vs Stochastic and Static vs Dynamic Stochastic optimization is based on a random selection of candidate solutions. Evolutionary algorithms contain some randomness (e.g., in the form of random mutations). Also, various variants of the gradient-based methods are often randomized (e.g., variants of the stochastic gradient descent). In this course, we stick to static optimization problems where we solve the optimization problem only once. In contrast, the dynamic optimization, a sequence of (usually) dependent optimization problems are solved sequentially. For example, consider driving a car where the driver must react optimally to changing situations several times per second. Dynamic optimization problems are usually defined using a kind of (Markov) decision process. 45 Summary The course consists of the following main parts: ▶ Unconstrained optimization ▶ Non-linear objectives, (twice) differentiable ▶ Second-order methods (quasi-Newton) ▶ Constrained optimization ▶ Non-linear objectives and constraints, (twice) differentiable ▶ Lagrange multipliers, Newton-Lagrange method ▶ Quadratic programming (a little bit) ▶ Linear programming ▶ Linear objectives and constraints ▶ Simplex algorithm deep dive (including the degenerate case) ▶ Integer linear programming ▶ Linear objectives and mixed integer linear constraints ▶ Branch-and-bound, Gomory cuts algorithms ▶ A little bit on non-differentiable algorithms. You will need to understand: Calculus in Rn (gradient, Hessian) and linear algebra in Rn (vectors, matrices, geometry) 46 Single-variable Objectives 47 Unconstrained Single Variable Optimization Problem An objective function f : R → R A variable x Find x∗ such that f (x∗ ) ≤ min x∈R f (x) 48 Unconstrained Single Variable Optimization Problem An objective function f : R → R A variable x Find x∗ such that f (x∗ ) ≤ min x∈R f (x) We consider ▶ f continuously differentiable ▶ f twice continuously differentiable Present the following methods: ▶ Gradient descent ▶ Newton’s method ▶ Secant method 48 Gradient Based Methods An objective function f : R → R A variable x ∈ R Find x∗ such that f (x∗ ) ≤ min x∈R f (x) 49 Gradient Based Methods An objective function f : R → R A variable x ∈ R Find x∗ such that f (x∗ ) ≤ min x∈R f (x) Assume that f ′ (x) = lim h→0 f (x + h) − f (x) h for x ∈ R is continuous on R. Denote by C1 the set of all continuously differentiable functions. 49 Gradient Descent in Single Variable Gradient descent algorithm for finding a local minimum of a function f , using a variable step length. Input: Function f with first derivative f ′, initial point x0, initial step length α0 > 0, tolerance ϵ > 0 Output: A point x that approximately minimizes f (x) 1: Set k ← 0 2: while |f ′(xk)| > ϵ do 3: Calculate the derivative: y′ ← f ′(xk) 4: Update xk+1 ← xk − αk · y′ 5: Update step length αk to αk+1 based on a certain strategy 6: Increment k 7: end while 8: return xk 50 Convergence of Single Variable Gradient Descent Theorem 1 Assume that f is ▶ differentiable, i.e., that f ′ exists, ▶ bounded below, i.e., there is B ∈ R such that f (x) ≥ B for all x ∈ R, ▶ L-smooth, i.e., there is L > 0 such that |f ′(x) − f ′(x′)| ≤ L|x − x′| for all x, x′ ∈ R. Consider a sequence x0, x1, . . . computed by the gradient descent algorithm for f . Assume a constant step length α ≤ 1 L. Then limk→∞ |f ′(xk)| = 0 and, moreover, min 0≤t 0 Output: A point x that approximately minimizes f (x) 1: Set k ← 0 2: while |xk+1 − xk| > ϵ do 3: Calculate the derivative: y′ ← f ′(xk) 4: Calculate the second derivative : y′′ ← f ′′(xk) 5: Update the estimate: xk+1 ← xk − y′ y′′ 6: Increment k 7: end while 8: return xk Note that the method implicitly assumes that f ′′(xk) ̸= 0 in every iteration. 67 Example Consider the following objective function f f (x) = 1 2 x2 − sin x Assume x0 = 0.5, and that the required accuracy is ϵ = 10−5, i.e., we stop when |xk+1 − xk| ≤ ϵ. 68 Example Consider the following objective function f f (x) = 1 2 x2 − sin x Assume x0 = 0.5, and that the required accuracy is ϵ = 10−5, i.e., we stop when |xk+1 − xk| ≤ ϵ. We compute f ′ (x) = x − cos x, f ′′ (x) = 1 + sin x. 68 Example Consider the following objective function f f (x) = 1 2 x2 − sin x Assume x0 = 0.5, and that the required accuracy is ϵ = 10−5, i.e., we stop when |xk+1 − xk| ≤ ϵ. We compute f ′ (x) = x − cos x, f ′′ (x) = 1 + sin x. Hence, x1 = 0.5 − 0.5 − cos 0.5 1 + sin 0.5 = 0.5 − −0.3775 1.479 = 0.7552 68 Example Proceeding similarly, we obtain x2 = x1 − f ′ (x1) f ′′ (x1) = x1 − 0.02710 1.685 = 0.7391 x3 = x2 − f ′ (x2) f ′′ (x2) = x2 − 9.461 × 10−5 1.673 = 0.7390851339 x4 = x3 − f ′ (x3) f ′′ (x3) = x3 − 1.17 × 10−9 1.673 = 0.7390851332 · · · 69 Example Proceeding similarly, we obtain x2 = x1 − f ′ (x1) f ′′ (x1) = x1 − 0.02710 1.685 = 0.7391 x3 = x2 − f ′ (x2) f ′′ (x2) = x2 − 9.461 × 10−5 1.673 = 0.7390851339 x4 = x3 − f ′ (x3) f ′′ (x3) = x3 − 1.17 × 10−9 1.673 = 0.7390851332 · · · Note that |x4 − x3| < ϵ = 10−5 f ′ (x4) = −8.6 × 10−6 ≈ 0 f ′′ (x4) = 1.673 > 0 So, we conclude that x∗ ≈ x4 is a strict minimizer. However, remember that the above does not have to be true! 69 Convergence Newton’s method works well if f ′′(x) > 0 everywhere. However, if f ′′(x) < 0 for some x, Newton’s method may fail to converge to a minimizer (converges to a point x where f ′(x) = 0): If the method converges to a minimizer, it does so quadratically. What does this mean? 70 Types of Convergence Rates Linear Convergence An algorithm is said to have linear convergence if the error at each step is proportionally reduced by a constant factor: lim k→∞ |xk+1 − x∗| |xk − x∗| = r, 0 < r < 1 71 Types of Convergence Rates Linear Convergence An algorithm is said to have linear convergence if the error at each step is proportionally reduced by a constant factor: lim k→∞ |xk+1 − x∗| |xk − x∗| = r, 0 < r < 1 Superlinear Convergence Convergence is superlinear if: lim k→∞ |xk+1 − x∗| |xk − x∗| = 0 This often requires an algorithm to utilize second-order information. 71 Quadratic Convergence of Newton’s Method Quadratic Convergence Quadratic convergence is achieved when the number of accurate digits roughly doubles with each iteration: lim k→∞ |xk+1 − x∗| |xk − x∗|2 = C, C > 0 72 Quadratic Convergence of Newton’s Method Quadratic Convergence Quadratic convergence is achieved when the number of accurate digits roughly doubles with each iteration: lim k→∞ |xk+1 − x∗| |xk − x∗|2 = C, C > 0 Newton’s method is a classic example of an algorithm with quadratic convergence. Theorem 2 (Quadratic Convergence of Newton’s Method) Let f : R → R satisfy f ∈ C2 and suppose x∗ is a minimizer of f such that f ′′(x∗) > 0. Assume Lipschitz continuity of f ′′. If the initial guess x0 is sufficiently close to x∗, then the sequence {xk} computed by the Newton’s method converges quadratically to x∗. 72 Newton’s Method of Tangents Newton’s method is also a technique for finding roots of functions. In our case, this means finding a root of f ′. 73 Newton’s Method of Tangents Newton’s method is also a technique for finding roots of functions. In our case, this means finding a root of f ′. Denote g = f ′. Then Newton’s approximation goes like this: xk+1 = xk − g(xk) g′(xk) 73 Secant Method What if f ′′ is unavailable, but we want to use something like Newton’s method (with its superlinear convergence)? 74 Secant Method What if f ′′ is unavailable, but we want to use something like Newton’s method (with its superlinear convergence)? Assume f ∈ C1 and try to approximate f ′′ around xk−1 with f ′′ (x) ≈ f ′(x) − f ′(xk−1) x − xk−1 Substituting x with xk, we obtain 1 f ′′(xk) ≈ xk − xk−1 f ′(xk) − f ′(xk−1) 74 Secant Method What if f ′′ is unavailable, but we want to use something like Newton’s method (with its superlinear convergence)? Assume f ∈ C1 and try to approximate f ′′ around xk−1 with f ′′ (x) ≈ f ′(x) − f ′(xk−1) x − xk−1 Substituting x with xk, we obtain 1 f ′′(xk) ≈ xk − xk−1 f ′(xk) − f ′(xk−1) Then, we may try to use Newton’s step with this approximation: xk+1 = xk − xk − xk−1 f ′(xk) − f ′(xk−1) · f ′ (xk) Is the rate of convergence superlinear? 74 Example Consider the following objective function f f (x) = 1 2 x2 − sin x Assume x0 = 0.5 and x1 = 1.0. Now, we need to initialize the first two values. 75 Example Consider the following objective function f f (x) = 1 2 x2 − sin x Assume x0 = 0.5 and x1 = 1.0. Now, we need to initialize the first two values. We have f ′(x) = x − cos x Hence, x2 = 1.0 − 1.0 − 0.5 (1.0 − cos 1.0) − (0.5 − cos 0.5) (0.5 − cos 0.5) = 0.7254 75 Example Continuing, we obtain: x0 = 0.5 x1 = 1.0 x2 = 0.72548 x3 = 0.73839 x4 = 0.739087 x5 = 0.739085132 x6 = 0.739085133 76 Example Start the secant method with the approximation given by Newton’s method: x0 = 0.5 x1 = 0.7552 x2 = 0.7381 x3 = 0.739081 x5 = 0.7390851339 x6 = 0.7390851332 · · · Compare with Newton’s method: x0 = 0.5 x1 = 0.7552 x2 = 0.7391 x3 = 0.7390851339 x4 = 0.73908513321516067229 x5 = 0.73908513321516067229 · · · 77 Superlinear Convergence of Secant Method Theorem 3 (Superlinear Convergence of Secant Method) Assume f : R → R twice continuously differentiable and x∗ a minimizer of f . Assume f ′′ Lipschitz continuous and f ′′(x∗) > 0. The sequence {xk} generated by the Secant method converges to x∗ superlinearly if x0 and x1 are sufficiently close to x∗. The rate of convergence p of the Secant method is given by the positive root of the equation p2 − p − 1 = 0, which is p = 1+ √ 5 2 ≈ 1.618 (the golden ratio). Formally, lim k→∞ |xk+1 − x∗| |xk − x∗| 1+ √ 5 2 = C, C > 0 78 Secant Method for Root Finding As for Newton’s method of tangents, the secant method can be seen as a method for finding a root of f ′. 79 Secant Method for Root Finding As for Newton’s method of tangents, the secant method can be seen as a method for finding a root of f ′. Denote g = f ′. Then the secant method approximation is xk+1 = xk − xk − xk−1 g(xk) − g(xk−1) · g(xk) 79 General Form Note that all methods have similar update formula: xk+1 = xk − f ′(xk) ak Different choice of ak produce different algorithm: ▶ ak = 1 gives the gradient descent, ▶ ak = f ′′(xk) gives Newton’s method, ▶ ak = f ′(xk )−f ′(xk−1) xk −xk−1 gives the secant method, ▶ ak = f ′′(xm) where m = ⌊k/p⌋p gives Shamanskii method. 80 Summary ▶ Newton’s method ▶ Converges quickly to an extremum under rather strict conditions (see Theorem 2) ▶ The choice of the initial point is critical; the method may diverge to a stationary point, which is not a minimizer. The method may also cycle. ▶ If the second derivative is very small, close to the minimizer, the method can be very slow (the quadratic convergence is guaranteed only if the second derivative is non-zero at the minimizer and the constants depend on the second derivative). 81 Summary ▶ Newton’s method ▶ Converges quickly to an extremum under rather strict conditions (see Theorem 2) ▶ The choice of the initial point is critical; the method may diverge to a stationary point, which is not a minimizer. The method may also cycle. ▶ If the second derivative is very small, close to the minimizer, the method can be very slow (the quadratic convergence is guaranteed only if the second derivative is non-zero at the minimizer and the constants depend on the second derivative). ▶ Secant method ▶ The second derivative is not needed. ▶ Superlinear (but not quadratic) convergence for an initial point close to a minimum (under rather strict conditions Theorem 3) 81 Constrained Single Variable Optimization Problem An objective function f : R → R A variable x A constraint a0 ≤ x ≤ b0 Consider the following cases: ▶ f continuously differentiable on [a0, b0] ▶ f twice continuously differentiable on [a0, b0] Homework: Modify the gradient descent and Newton’s method to work on the bounded interval (the above definitions guarantee continuous differentiability at a0 and b0). 82