Bài giảng Tối ưu hóa nâng cao - Chương 7: Subgradient method cung cấp cho người học các kiến thức: Last last time - gradient descent, subgradient method, step size choices, convergence analysis, lipschitz continuity, convergence analysis - Proof,... Mời các bạn cùng tham khảo.
Trang 1Subgradient Method
Hoàng Nam Dũng
Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội
Trang 2Last last time: gradient descent
Consider the problem
min
x f(x)for f convex and differentiable,dom(f ) = Rn
Gradient descent: choose initial x(0) ∈ Rn, repeat:
I Requires f differentiable — addressed this lecture
I Can be slow to converge — addressed next lecture
Trang 3Subgradient method
Now consider f convex, havingdom(f ) = Rn, but not necessarily
differentiable
Subgradient method: like gradient descent, but replacing gradients
with subgradients, i.e., initialize x(0), repeat:
x(k) = x(k−1)− tk · g(k−1), k = 1, 2, 3, where g(k−1)∈ ∂f (x(k−1)) any subgradient of f at x(k−1)
Subgradient method is not necessarily a descent method, so wekeep track of best iterate xbest(k) among x(0), x(k) so far, i.e.,
f(xbest(k)) = min
i=0, ,kf(x(i))
Trang 4Subgradient method is not necessarily a descent method, so we
keep track of best iterate xbest(k) among x(0), x(k) so far, i.e.,
f(xbest(k)) = min
i=0, ,kf(x(i))
Trang 6Step size choices
I Fixed step sizes: tk = t all k = 1, 2, 3,
I Fixed step length, i.e., tk = s/kg(k−1)k2, and hence
There are several other options too, but key difference to gradientdescent: step sizes are pre-specified,not adaptively computed
Trang 7,
Trang 8Lipschitz continuity
Before the proof let consider the Lipschitz continuity assumption
Lemma
f is Lipschitz continuous with constant L> 0, i.e.,
|f (x) − f (y)| ≤ L kx − yk2 for all x, y ,
Trang 9Lipschitz continuity
Before the proof let consider the Lipschitz continuity assumption
Lemma
f is Lipschitz continuous with constant L> 0, i.e.,
|f (x) − f (y)| ≤ L kx − yk2 for all x, y ,
Trang 10Convergence analysis - Proof
Can prove both results from same basic inequality Key steps:
I Using definition of subgradient
Trang 11Convergence analysis - Proof
Can prove both results from same basic inequality Key steps:
I Using definition of subgradient
Trang 12Convergence analysis - Proof
I Using kx(k)− x∗k2 ≥ 0 and letting R = kx(0)− x∗k2, we have
Trang 13Convergence analysis - Proof
I Using kx(k)− x∗k2 ≥ 0 and letting R = kx(0)− x∗k2, we have
I Introducing f(xbest(k)) = mini =0, kf(x(i)) and rearranging, we
have the basic inequality
Trang 14Convergence analysis - Proof
I Using kx(k)− x∗k2 ≥ 0 and letting R = kx(0)− x∗k2, we have
Trang 15Convergence analysis - Proof
The basic inequality tells us that after k steps, we have
f(xbest(k))− f (x∗)≤ R
2+Pk i=1ti2kg(i−1)k2
2
2Pk i=1ti
I Does not guarantee convergence (as k → ∞)
I For large k, f(xbest(k)) is approximately L2t
Trang 16Convergence analysis - Proof
The basic inequality tells us that after k steps, we have
f(xbest(k))− f (x∗)≤ R
2+Pk i=1ti2kg(i−1)k2
2
2Pk i=1ti
I Does not guarantee convergence (as k → ∞)
I For large k, f(xbest(k)) is approximately L2t
Trang 17Convergence analysis - Proof
The basic inequality tells us that after k steps, we have
f(xbest(k))− f (x∗)≤ R
2+Pk i=1ti2kg(i−1)k2
2
2Pk i=1ti
I Does not guarantee convergence (as k → ∞)
I For large k, f(xbest(k)) is approximately L2t
Trang 18Convergence analysis - Proof
The basic inequality tells us that after k steps, we have
f(xbest(k))− f (x∗)≤ R
2+Pk i=1ti2kg(i−1)k2
2
2Pk i=1ti
With fixed step length, i.e., ti = s/kg(i−1)k2, we have
f(xbest(k))−f (x∗)≤ R
2+ ks2
2sPk i=1kg(i−1)k−12
2+ ks2
2sPk i=1L−1 = LR
2
2ks +Ls
2
I Does not guarantee convergence (as k → ∞)
I For large k, f(xbest(k)) is approximately Ls
2-suboptimal
I To make the gap ≤ ε, let’s make each term ≤ ε/2 So we canchoose s = ε/L, and k = LR2/s · 1/ε = R2L2/ε2
Trang 19Convergence analysis - Proof
The basic inequality tells us that after k steps, we have
f(xbest(k))− f (x∗)≤ R
2+Pk i=1ti2kg(i−1)k2
2
2Pk i=1ti
With fixed step length, i.e., ti = s/kg(i−1)k2, we have
f(xbest(k))−f (x∗)≤ R
2+ ks2
2sPk i=1kg(i−1)k−12
2+ ks2
2sPk i=1L−1 = LR
2
2ks +Ls
2
I Does not guarantee convergence (as k → ∞)
I For large k, f(xbest(k)) is approximately Ls
2-suboptimal
I To make the gap ≤ ε, let’s make each term ≤ ε/2 So we canchoose s = ε/L, and k = LR2/s· 1/ε = R2L2/ε2
Trang 20Convergence analysis - Proof
The basic inequality tells us that after k steps, we have
f(xbest(k))− f (x∗)≤ R
2+Pk i=1ti2kg(i−1)k2
2
2Pk i=1ti From this and the Lipschitz continuity, we have
f(xbest(k))− f (x∗)≤ R
2+ L2Pk
i=1ti2
2Pk i=1ti
With diminishing step size,P∞
i=1ti =∞ andP∞
i=1ti2 <∞, thereholds
lim
k→∞f(xbest(k)) = f∗
Trang 21Example: 1-norm minimization
Trang 22Diminishing step size:t k = 0.01/ √
Trang 23Example: regularized logistic regression
Given(xi, yi)∈ Rp× {0, 1} for i = 1, n, the logistic regression
−yixiTβ + log(1 + exp(xiTβ))
This is a smooth and convex with
Trang 24Example: regularized logistic regression
Ridge: use gradients; lasso: use subgradients Example here has
Step sizes hand-tuned to be favorable for each method (of course
comparison is imperfect, but it reveals the convergence behaviors)
Step sizes hand-tuned to be favorable for each method (of course
comparison is imperfect, but it reveals the convergence behaviors) 14
Trang 25Polyak step sizes
Polyak step sizes: when the optimal value f∗ is known, take
kx(k)−x∗k22 ≤ kx(k−1)−x∗k22−2tk(f (x(k−1))−f (x∗))+tk2kg(k−1)k22.Polyak step size minimizes the right-hand side
With Polyak step sizes, can show subgradient method converges tooptimal value Convergence rate is still O(1/ε2)
f(xbest(k))− f (x∗)≤ √LR
k.(Proof: see slide 11,http://www.seas.ucla.edu/~vandenbe/236C/lectures/sgmethod.pdf)
Trang 26Example: intersection of sets
Suppose we want to find x∗ ∈ C1∩ · · · ∩ Cm, i.e., find a point inintersection of closed, convex sets C1, , Cm
First define
fi(x) = dist(x, Ci), i = 1, , m
f(x) = max
i=1, ,mfi(x)and now solve
min
x f(x)
Check: is this convex?
Note that f∗ = 0⇐⇒ x∗ ∈ C1∩ · · · ∩ Cm
Trang 27Example: intersection of sets
Recall the distance functiondist(x, C ) = miny ∈Cky − xk2 Last
time we computed its gradient
∇ dist(x, C ) = x− PC(x)
kx − PC(x)k2
where PC(x) is the projection of x onto C
Also recall subgradient rule: if f(x) = maxi=1, mfi(x), then
∂f (x) = conv
[
Trang 28Example: intersection of sets
Put these two facts together for intersection of sets problem, with
fi(x) = dist(x, Ci): if Ci is farthest set from x (so fi(x) = f (x)),
and
gi =∇fi(x) = x− PC i(x)
kx − PC i(x)k2
then gi ∈ ∂f (x)
Now apply subgradient method, with Polyak size tk = f (x(k−1))
At iteration k, with Ci farthest from x(k−1), we perform update
x(k)= x(k−1)− f (x(k−1)) x
(k−1)− PC i(x(k−1))
kx(k−1)− PC i(x(k−1))k
= PC i(x(k−1)),since
f(x(k−1)) = dist(x(k−1), Ci) =kx(k−1)− PC i(x(k−1))k
Trang 29Example: intersection of sets
For two sets, this is the famousalternating projections1 algorithm,
i.e., just keep projecting back and forth.For two sets, this is the famousalternating projections algorithm1,i.e., just keep projecting back and forth
(From Boyd’s lecture notes)
orthogonal spaces”
1 von Neumann (1950), “Functional operators, volume II: The geometry of
orthogonal spaces”
19
Trang 30Projected subgradient method
To optimize a convex function f over a convex set C ,
min
x f(x) subject to x ∈ C
we can use theprojected subgradient method
Just like the usual subgradient method, except we project onto C
at each iteration:
x(k)= PC(x(k−1)− tk · g(k−1)), k = 1, 2, 3,
Assuming we can do this projection, we get the same convergenceguarantees as the usual subgradient method, with the same stepsize choices
Trang 31Projected subgradient method
What sets C are easy to project onto? Lots, e.g.,
I Affine images:{Ax + b : x ∈ Rn}
I Solution set of linear system: {x : Ax = b}
I Nonnegative orthant: Rn+={x : x ≥ 0}
I Somenorm balls:{x : kxkp≤ 1} for p = 1, 2, ∞
I Some simple polyhedra and simple cones
Warning: it is easy to write down seemingly simple set C , and PCcan turn out to be very hard! E.g., generally hard to project ontoarbitrary polyhedron C ={x : Ax ≤ b}
Note: projected gradient descent works too, more next time
Trang 32Can we do better?
Upside of the subgradient method: broad applicability
Downside: O(1/ε2) convergence rate over problem class of convex,Lipschitz functions is really slow
Nonsmooth first-order methods: iterative methods updating x(k) in
x(0)+ span{g(0), g(1), , g(k−1)}where subgradients g(0), g(1), , g(k−1) come from weak oracle
Trang 33Improving on the subgradient method
In words, wecannot do betterthan the O(1/ε2) rate of subgradientmethod (unless we go beyond nonsmooth first-order methods)
So instead of trying to improve across the board, we will focus onminimizingcomposite functionsof the form
f(x) = g (x) + h(x)where g is convex and differentiable, h is convex and nonsmoothbut “simple”
For a lot of problems (i.e., functions h), we can recover the O(1/ε)rate of gradient descent with a simple algorithm, having importantpractical consequences
Trang 34References and further reading
S Boyd, Lecture notes for EE 264B, Stanford University,
Spring 2010-2011
Y Nesterov (1998), Introductory lectures on convex
optimization: a basic course, Chapter 3
B Polyak (1987), Introduction to optimization, Chapter 5
L Vandenberghe, Lecture notes for EE 236C, UCLA, Spring2011-2012