Optimization basics for machine learning

Course notes on Optimization for Machine Learning Gabriel Peyré CNRS DMA École Normale Supérieure gabriel petical tours github io www numerical tours com March 30, 2021.Course notes on Optimization for Machine Learning Gabriel Peyré CNRS DMA École Normale Supérieure gabriel petical tours github io www numerical tours com March 30, 2021.

Regression

For regression, yi∈R, in which case f(x) = 1

2||Ax−y|| 2 , (3) is the least square quadratic risk function (see Fig.1) Herehu, vi=Pp i=1uiviis the canonical inner product inR p and|| ã || 2 =hã,ãi.

Classification

For classification, yi∈ {−1,1}, in which case f(x) n

`(−yihx, aii) =L(−diag(y)Ax) (4) where` is a smooth approximation of the 0-1 loss 1 R + For instance`(u) = log(1 + exp(u)), and diag(y)∈

R n×n is the diagonal matrix withyi along the diagonal (see Fig.1, right) Here the separable loss function

Existence of Solutions

In general, the optimization problem (1) may have no solution This happens exactly when the objective is unbounded below For example, f(x) = −x^2 has a minimum of −∞, so there is no finite minimizer The same phenomenon can occur if the function does not grow at infinity; for instance, f(x) = e^{−x} has infimum 0 but no minimizer, since the function values remain positive and never reach zero.

To guarantee the existence of a minimizer and prevent the minimizer set from becoming unbounded, we replace the whole space R^p by a compact subset Ω ⊂ R^p, i.e., Ω is bounded and closed By restricting the problem to Ω, we aim to show that f attains its minimum on Ω If f is continuous on Ω, a minimizer exists; one may also use the weaker condition that f is lower semicontinuous, but here we focus on continuity on Ω.

Figure 2: Left: non-existence of minimizer, middle: multiple minimizers, right: uniqueness.

Figure 3 illustrates the coercivity condition for least squares A function is coercive if f(x) → +∞ as ||x|| → ∞, meaning the objective grows without bound as the parameter vector x becomes unbounded This property allows us to focus on a bounded search region, since minima cannot lie at infinity Consequently, for any x0 ∈ R^p one can consider its associated lower-level set L = {x ∈ R^p : f(x) ≤ f(x0)}, which is bounded under coercivity, enabling a finite-domain analysis of the least squares problem.

Let Ω = { x ∈ R^p : f(x) ≤ f(x0) } This sublevel set is bounded by coercivity and closed because f is continuous In fact, for convex functions, having a bounded set of minimizers is equivalent to the function being coercive, although this equivalence does not hold for non-convex functions; for example, f(x) = min(1, x^2) has a single minimizer but is not coercive.

Example 1 (Least squares): For the quadratic loss f(x) = 1/2 ||Ax − y||^2, coercivity holds if and only if ker(A) = {0}, which corresponds to the overdetermined setting If ker(A) ≠ {0} and x* is a minimizer, then x* + u is also a minimizer for every u ∈ ker(A), so the set of minimizers is unbounded By contrast, if ker(A) = {0}, the minimizer is unique (see Fig 3) If the loss is strictly convex, the same conclusion holds in the classification setting.

Convexity

Convex functions are the main class of functions that are relatively easy to optimize: all minimizers of a convex function are global minimizers, and there are often efficient methods to locate these minimizers, especially for smooth convex functions A function f: Rp → R is convex if, for any two points x, y ∈ Rp and any t ∈ [0, 1], f(tx + (1 − t)y) ≤ t f(x) + (1 − t) f(y) This defining property underpins why convex optimization is reliable: it eliminates the risk of getting trapped in local minima and enables efficient algorithms, such as gradient descent and its accelerated variants, to converge to the global minimum.

Let f be a convex function For any x, y and any t in [0,1], f((1−t)x+ty) ≤ (1−t)f(x) + t f(y), meaning the graph lies below its secant (and, where defined, above its tangent, as illustrated in Fig 4) If x* is a local minimizer of a convex f, then x* is a global minimizer, i.e., x* ∈ argmin f Convex functions are especially convenient because they are closed under many transformations: if f and g are convex and a,b>0, then af+bg is convex, and max(f,g) is convex, so the set of convex functions is an infinite-dimensional convex cone; furthermore, if g: R^q → R is convex and B ∈ R^{q×p}, b ∈ R^q, then f(x) = g(Bx+b) is convex This immediately shows that common square losses, such as the one in (3), are convex.

|| ã || 2 /2 is convex (as a sum of squares) Also, similarly, if` and hence L is convex, then the classification loss function (4) is itself convex.

Figure 4: Convex vs non-convex functions ; Strictly convex vs non strictly convex functions.

Figure 5: Comparison of convex functionsf :R p →R(forp= 1) and convex setsC⊂R p (forp= 2).

Strict convexity Whenf is convex, one can strengthen the condition (5) and impose that the inequality is strict fort∈]0,1[ (see Fig.4, right), i.e.

In this case, if a minimumx ? exists, then it is unique Indeed, ifx ? 1 6=x ? 2 were two different minimizer, one would have by strict convexityf( x ? 1 +x 2 ? 2 )< f(x ? 1 ) which is impossible.

For the quadratic loss f(x) = 1/2 ||Ax − y||^2, strict convexity is equivalent to ker(A) = {0} (i.e., A has full column rank) The Hessian of f is ∇^2 f(x) = A^T A, and strict convexity is guaranteed when the eigenvalues of A^T A are strictly positive, i.e., when A^T A is positive definite Equivalently, ker(A^T A) = {0}, and since z^T (A^T A) z = ||Az||^2, A^T A z = 0 implies z ∈ ker(A) Therefore ker(A) = {0} exactly when A^T A has no zero eigenvalues, which is precisely the condition for strict convexity of f.

Convex Sets

A subset Ω ⊂ R^p is convex if for any x, y ∈ Ω and any t ∈ [0,1], the point (1−t)x + t y also lies in Ω The connection between convex sets and convex functions is that a function f is convex if and only if its epigraph epi(f) = { (x, t) ∈ R^{p+1} : t ≥ f(x) } is a convex subset of R^{p+1}.

Remark: The minimizers x* may be non-unique, as Figure 3 illustrates When f is convex, the set of minimizers, argmin(f), is itself convex Indeed, if x1 and x2 are minimizers with f(x1) = f(x2) = min f, then for any t in [0,1], f((1−t)x1 + t x2) ≤ (1−t)f(x1) + t f(x2) = min f, so (1−t)x1 + t x2 is also a minimizer Figure 5 shows convex and non-convex sets.

Gradient

Iff is differentiable along each axis, we denote

∈R p the gradient vector, so that ∇f :R p → R p is a vector field Here the partial derivative (when they exits) are defined as

∂xk def.= lim η→0 f(x+ηδk)−f(x) η whereδk= (0, ,0,1,0, ,0) > ∈R p is thek th canonical basis vector.

Beware that∇f(x) can exist withoutf being differentiable Differentiability off at each reads f(x+ε) =f(x) +hε,∇f(x)i+o(||ε||) (7)

Here, R(ε) = o(||ε||) denotes a quantity that decays faster than ||ε|| as ε → 0, i.e., R(ε) ||ε|| → 0 when ε → 0 The existence of partial derivatives corresponds to differentiability along the coordinate axes, while differentiability must hold for every converging sequence ε → 0, not just along a fixed direction A two-dimensional counterexample illustrating the distinction is f(x) = 2 x1 x2^2 (x1 + x2).

1 +x 2 2 with f(0) = 0, which is affine with different slope along each radial lines.

The gradient ∇f(x) is the unique vector that satisfies the relation (7) Consequently, a practical way to prove that f is differentiable at x and to obtain a formula for ∇f(x) is to show an expansion of the form f(x+ε) = f(x) + g · ε + o(‖ε‖) as ε → 0, in which case the linear term identifies ∇f(x) with g.

The following proposition shows that convexity is equivalent to the graph of the function being above its tangents.

Proposition 1 Iff is differentiable, then f convex ⇔ ∀(x, x 0 ), f(x)⩾f(x 0 ) +h∇f(x 0 ), x−x 0 i.

Proof One can write the convexity condition as f((1−t)x+tx 0 )⩽(1−t)f(x) +tf(x 0 ) =⇒ f(x+t(x 0 −x))−f(x) t ⩽f(x 0 )−f(x) hence, taking the limitt→0 one obtains h∇f(x), x 0 −xi⩽f(x 0 )−f(x).

To establish the other implication, replace the pair (x, x0) with the convex combination x_t := (1 − t)x + t x0 and apply the supporting inequality at x_t to both endpoints: f(x) ≥ f(x_t) + ⟨∇f(x_t), x − x_t⟩ and f(x0) ≥ f(x_t) + ⟨∇f(x_t), x0 − x_t⟩ Multiplying the first inequality by (1 − t) and the second by t, and then summing, yields a combined bound that relates the weighted average (1 − t)f(x) + t f(x0) to f(x_t) and the gradient ∇f(x_t) via the corresponding inner-product terms This procedure provides the desired implication.

Figure 6: Function with local maxima/minima (left), saddle point (middle) and global minimum (right).

First Order Conditions

The main theoretical interest (we will see later that it also have algorithmic interest) of the gradient vector is that it is a necessarily condition for optimality, as stated below.

Proposition 2 If x ? is a local minimum of the function f (i.e that f(x ? )⩽f(x)for all xin some ball aroundx ? ) then

Proof One has forεsmall enough andufixed f(x ? )⩽f(x ? +εu) =f(x ? ) +εh∇f(x ? ), ui+o(ε) =⇒ h∇f(x ? ), ui⩾o(1) =⇒ h∇f(x ? ), ui⩾0.

So applying this for u and −uin the previous equation shows that h∇f(x ? ), ui = 0 for all u, and hence

Note that the converse is not true in general: a point with ∇f(x) = 0 does not necessarily lie at a local minimum For example, at x = 0 the function f(x) = −x^2 has a maximum, and f(x) = x^3 has a saddle point, yet ∇f(0) = 0 (see Fig 6) In practice, if ∇f(x*) = 0 but x* is not a local minimum, x* tends to be an unstable equilibrium Consequently, gradient-based algorithms typically converge to points where ∇f(x*) = 0 and the point is a local minimizer The following proposition shows that a much stronger result holds if and only if f is convex.

Proposition 3 If f is convex andx ? a local minimum, then x ? is also a global minimum If f is differentiable and convex, x ? ∈argmin x f(x) ⇐⇒ ∇f(x ? ) = 0.

Proof For anyx, there exist 0< t 0 such that f(xk)−f(x ? )⩽ C

If furthermoref isà-strongly convex, then there exists 0⩽ρ 0, there exist an integer q ≥ 1 and parameters w_k ∈ R^p, z_k ∈ R^p, and amplitudes u_k ∈ R (for k = 1, , q) such that sup_{x ∈ Ω} |f(x) − Σ_{k=1}^q u_k φ_{w_k,z_k}(x)| < ε.

In a typical machine learning scenario, a hypothesis class with sufficiently large capacity can drive the training error toward zero, making overfitting possible However, due to the bias–variance tradeoff, model capacity must be chosen carefully, and cross-validation is used to account for the finite amount of data and to ensure good generalization properties.

In one dimension, the proof can be understood as an approximation by smoothed step functions By introducing a small regularization ε > 0 and assuming the function is Lipschitz to ensure uniform convergence, the smoothed mapping φ_{w,ε}(z) converges as ε → 0+ to the indicator of the half-line [-z/w, ∞), i.e., a Heaviside-type step function.

As ε → 0 and the parameters z and u tend to their limits, h converges to a piecewise constant function on the partition defined by the breakpoints t_k, with h taking the value u_k on each interval [t_k, t_{k+1}) Conversely, any piecewise constant function can be written in this form If h assumes the constant value d_k on each interval [t_k, t_{k+1}), then h admits an indicator-based representation, for example h = Σ_k d_k χ_{[t_k, t_{k+1})}, or equivalently h = Σ_k d_k (1_{[t_k, ∞)} − 1_{[t_{k+1}, ∞)}), which shows the direct link between piecewise-constant behavior and a sum of shifted step functions.

Since the space of piecewise constant functions is dense in continuous function over an interval, this proves the theorem.

Proof in arbitrary dimension p We start by proving the following dual characterization of density, using bounded Borel measureà∈ M(Ω) i.e such thatà(Ω)

Tiêu đề	Optimization Basics for Machine Learning
Tác giả	Gabriel Peyré
Trường học	École Normale Supérieure
Chuyên ngành	Machine Learning
Thể loại	course notes
Năm xuất bản	2021
Thành phố	Paris

Định dạng
Số trang	45
Dung lượng	8,85 MB