Bài giảng Tối ưu hóa nâng cao - Chương 8: Proximal gradient descent (and acceleration) cung cấp cho người học các kiến thức: Subgradient method, decomposable functions, proximal mapping, proximal gradient descent,.... Mời các bạn cùng tham khảo.
Trang 1Proximal Gradient Descent (and Acceleration)
Hoàng Nam Dũng
Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội
Trang 2Last time: subgradient method
Consider the problem
min
x f(x)with f convex, anddom(f ) = Rn
Subgradient method: choose an initial x(0)∈ Rn, and repeat:
x(k) = x(k−1)− tk · g(k−1), k = 1, 2, 3, where g(k−1)∈ ∂f (x(k−1)) We use pre-set rules for the step sizes(e.g., diminshing step sizes rule)
If f is Lipschitz, then subgradient method has a convergence rateO(1/ε2)
Upside: very generic Downside: can be slow — addressed today
1
Trang 4Decomposable functions
Suppose
f(x) = g (x) + h(x)where
I g is convex, differentiable, dom(g ) = Rn
I h is convex, not necessarily differentiable
If f were differentiable, then gradient descent update would be
x+ = x− t · ∇f (x)Recall motivation: minimizequadratic approximationto f around
Trang 5Decomposable functions
In our case f is not differentiable, but f = g + h, g differentiable.Why don’t we makequadratic approximation to g , leave h alone?I.e., update
2 stay close to gradient update for g
Trang 11Uniqueness since the objective function is strictly convex.
Optimality condition
z = prox (x)⇔ x − z ∈ ∂h(z)
Trang 12Properties of proximal mapping
Trang 13Properties of proximal mapping
Trang 14Properties of proximal mapping
Trang 15Proximal gradient descent
Proximal gradient descent: choose initialize x(0), repeat:
x(k) = proxtkh x(k−1)− tk · ∇g(x(k−1))
, k = 1, 2, 3,
To make this update step look familiar, can rewrite it as
x(k) = x(k−1)− tk · Gtk(x(k−1))where Gt is the generalized gradient of f ,
Gt(x) = x− proxth(x− t∇g(x))
For h= 0 it is gradient descent
Trang 16Proximal gradient descent
Proximal gradient descent: choose initialize x(0), repeat:
x(k) = proxtkh x(k−1)− tk · ∇g(x(k−1))
, k = 1, 2, 3,
To make this update step look familiar, can rewrite it as
x(k) = x(k−1)− tk · Gtk(x(k−1))where Gt is the generalized gradient of f ,
Gt(x) = x− proxth(x− t∇g(x))
For h= 0 it is gradient descent
8
Trang 17Proximal gradient descent
Proximal gradient descent: choose initialize x(0), repeat:
x(k) = proxtkh x(k−1)− tk · ∇g(x(k−1))
, k = 1, 2, 3,
To make this update step look familiar, can rewrite it as
x(k) = x(k−1)− tk · Gtk(x(k−1))where Gt is the generalized gradient of f ,
Gt(x) = x− proxth(x− t∇g(x))
For h= 0 it is gradient descent
Trang 19What good did this do?
You have a right to be suspicious may look like we just swappedone minimization problem for another
Key point is thatproxh(·) is can be computed analyticallyfor a lot
of important functions h1
Note:
I Mappingproxh(·) doesn’t depend on g at all, only on h
I Smooth part g can be complicated, we only need to computeits gradients
Convergence analysis: will be in terms of number of iterations ofthe algorithm Each iteration evaluatesproxh(·) once and this can
be cheap or expensive depending on h
Trang 20Example: ISTA (Iterative Shrinkage-Thresholding Algorithm)
Given y ∈ Rn, X ∈ Rn×p, recall lasso criterion
Proximal mapping is now
proxth(β) = argminz 1
2tkβ − zk22+ λkzk1
= Sλt(β),where Sλ(β) is the soft-thresholding operator
Trang 21Example: ISTA (Iterative Shrinkage-Thresholding Algorithm)
Recall∇g(β) = −XT(y− X β), hence proximal gradient update is
2 Beck and Teboulle (2008), “A fast iterative shrinkage-thresholding algorithm
Trang 22Backtracking line search
Backtrackingfor prox gradient descent works similar as before (ingradient descent), but operates on g and not f
Choose parameter 0< β < 1 At each iteration, start at t = tinit,and while
g(x− tGt(x)) > g (x)− t∇g(x)TGt(x) + t
2kGt(x)k22shrink t= βt, for some 0 < β < 1 Else perform proximal gradientupdate
(Alternative formulations exist that require less computation, i.e.,fewer calls to prox)
12
Trang 23Convergence analysis
For criterion f(x) = g (x) + h(x), we assume
I g is convex, differentiable, dom(g ) = Rn, and∇g is Lipschitz
continuous with constant L> 0
I h is convex,proxth(x) = argminz{kx − zk2
Proximal gradient descent has convergence rate O(1/k) or O(1/ε).Same as gradient descent! (But remember, prox cost matters ).Proof: Seehttp://www.seas.ucla.edu/~vandenbe/236C/lectures/proxgrad.pdf
Trang 24Convergence analysis
For criterion f(x) = g (x) + h(x), we assume
I g is convex, differentiable, dom(g ) = Rn, and∇g is Lipschitz
continuous with constant L> 0
I h is convex,proxth(x) = argminz{kx − zk2
Proximal gradient descent has convergence rate O(1/k) or O(1/ε)
Same as gradient descent! (But remember, prox cost matters )
Proof: Seehttp://www.seas.ucla.edu/~vandenbe/236C/lectures/proxgrad.pdf
13
Trang 25Convergence analysis
For criterion f(x) = g (x) + h(x), we assume
I g is convex, differentiable, dom(g ) = Rn, and∇g is Lipschitzcontinuous with constant L> 0
I h is convex,proxth(x) = argminz{kx − zk2
Proximal gradient descent has convergence rate O(1/k) or O(1/ε).Same as gradient descent! (But remember, prox cost matters )
Trang 26Example: matrix completion
Given a matrix Y ∈ Rm×n, and only observe entries Yij, (i, j)∈ Ω.Suppose we want to fill in missing entries (e.g., for a recommendersystem), so we solve amatrix completion problem3
minB
12
X(i,j)∈Ω(Yij − Bij)2+ λkBktr.
HerekBktr is the trace (or nuclear) norm of B
kBktr=
rXi=1
σi(B),where r = rank(B) and σ1(X )≥ · · · ≥ σr(X )≥ 0 are the singularvalues4
3
Wikipedia: In the case of the Netflix problem the ratings matrix is expected
to be low-rank since user preferences can often be described by a few factors, such as the movie genre and time of release
4
https://math.berkeley.edu/~hutching/teach/54-2017/svd-notes.pdf
14
Trang 27Example: matrix completion
Define PΩ, projection operator onto observed set
Two ingredients needed for proximal gradient descent:
I Gradient calculation
∇g(B) = −(PΩ(Y )− PΩ(B))
I Prox function
1
Trang 28Example: matrix completion
Claim:
proxt(B) = Sλt(B), matrix soft-thresholdingat the level λ
Here Sλ(B) is defined by
Sλ(B) = UΣλVTwhere B = UΣVT is an SVD, and Σλ is diagonal with
(Σλ)ii = max{Σii− λ, 0}
Proof : note thatproxth(B) = Z , where Z satisfies
Trang 29Example: matrix completion
Hence proximal gradient update step is
B+= Sλt(B + t(PΩ(Y )− PΩ(B)))
Note that∇g(B) is Lipschitz continuous with L = 1, so we can
choose fixed step size t = 1 Update step is now
B+= Sλ(PΩ(Y ) + PΩ⊥(B))where PΩ⊥ projects onto unobserved set, PΩ(B) + P ⊥
Trang 30Special cases
Proximal gradient descent also called composite gradient descent
orgeneralized gradient descent
Why “generalized”? This refers to the several special cases, whenminimizing f = g + h
I h = 0 – gradient descent
I h = IC – projected gradient descent
I g = 0 – proximal minimization algorithm
Therefore these algorithms all have O(1/ε) convergence rate
18
Trang 31Projected gradient descent
Given closed, convex set C ∈ Rn,
min
x ∈Cg(x)⇐⇒ minx g(x) + IC(x)where IC(x) =
Trang 32Projected gradient descent
Given closed, convex set C ∈ Rn,
min
x ∈Cg(x)⇐⇒ minx g(x) + IC(x)where IC(x) =
19
Trang 33Projected gradient descent
Therefore proximal gradient update step is
x+= PC(x− t∇g(x)),i.e., perform usual gradient update and then project back onto C.Calledprojected gradient descent
Therefore proximal gradient update step is:
x+= PC x− t∇g(x)i.e., perform usual gradient update and then project back onto C.Calledprojected gradient descent
Trang 34Proximal minimization algorithm
Consider for h convex (not necessarily differentiable)
Calledproximal minimization algorithm Faster than subgradient
method, but not implementable unless we know prox in closed
form
21
Trang 35What happens if we can’t evaluate prox?
Theory for proximal gradient, with f = g + h, assumes that proxfunction can be evaluated, i.e., assumes the minimization
proxth(x) = argminz 1
2tkx − zk2
2+ h(z)can be done exactly In general, not clear what happens if we justminimize this approximately
But, if you can precisely control the errors in approximating the
prox operator, then you can recover the original convergence rates6
In practice, if prox evaluation is done approximately, then it should
be done to decently high accuracy
6
Schmidt et al (2011), “Convergence rates of inexact proximal-gradient
Trang 36I 1983: original acceleration idea for smooth functions
I 1988: another acceleration idea for smooth functions
I 2005: smoothing techniques for nonsmooth functions, coupledwith original acceleration idea
I 2007: acceleration idea for composite functions7
We will follow Beck and Teboulle (2008), an extension of Nesterov(1983) to composite functions8
7 Each step uses entire history of previous steps and makes two prox calls
8
Each step uses information from two last steps and makes one prox call
23
Trang 37Accelerated proximal gradient method
As before consider
min
x g(x) + h(x),where g convex, differentiable, and h convex
Accelerated proximal gradient method: choose initial point
I First step k = 1 is just usual proximal gradient update
I After that, v = x(k−1)+ k−2
k+1(x(k−1)− x(k−2)) carries some
Trang 38Accelerated proximal gradient method
Trang 39Accelerated proximal gradient method
Back to lasso example: acceleration can really help!Back to lasso example: acceleration can really help!
Note: accelerated proximal gradient is not a descent method
Trang 40Backtracking line search
Backtracking under with acceleration in different ways
Simple approach: fixβ < 1, t0 = 1 At iteration k, start with
t= tk−1, and while
g(x+) > g (v ) +∇g(v)T(x+− v) + 2t1 kx+− vk22
shrink t= βt, and let x+= proxth(v− t∇g(v)) Else keep x+
Note that this strategy forces us to take decreasing step sizes (more complicated strategies exist which avoid this)
27
Trang 41Convergence analysis
For criterion f(x) = g (x) + h(x), we assume as before
I g is convex, differentiable, dom(g ) = Rn, and∇g is Lipschitzcontinuous with constant L> 0
I h is convex,proxth(x) = argminz{kx − zk2
√
Trang 42FISTA (Fast ISTA)
Back to lasso problem
minβ
1
2ky − X βk22+ λkβk1.Recall ISTA (Iterative Soft-thresholding Algorithm):
β(k) = Sλtk(β(k−1)+ tkXT(y − X β(k−1))), k= 1, 2, 3,
Sλ(·) being vector soft-thresholding
Applying acceleration gives usFISTA(F is for Fast)9: for
Trang 43ISTA vs FISTA
Lasso regression: 100 instances (with n = 100, p = 500):
Lasso regression: 100 instances (with n = 100, p = 500):
30
Trang 44ISTA vs FISTA
Lasso logistic regression: 100 instances (n = 100, p = 500):
Lasso logistic regression: 100 instances (n = 100, p = 500):
29
31
Trang 45Is acceleration always useful?
Acceleration can be a very effective speedup tool but should it
always be used?
In practice the speedup of using acceleration is diminished in thepresence ofwarm starts E.g., suppose want to solve lasso problemfor tuning parameters values
λ1> λ2 >· · · > λr
I When solving for λ1, initialize x(0) = 0, record solution ˆx(λ1)
I When solving for λj, initialize x(0) = ˆx(λj −1), the recordedsolution for λj −1
Over a fine enough grid ofλ values, proximal gradient descent canoften perform just as well without acceleration
Trang 46Is acceleration always useful?
Acceleration can be a very effective speedup tool but should italways be used?
In practice the speedup of using acceleration is diminished in thepresence ofwarm starts E.g., suppose want to solve lasso problemfor tuning parameters values
λ1> λ2 >· · · > λr
I When solving for λ1, initialize x(0) = 0, record solution ˆx(λ1)
I When solving for λj, initialize x(0) = ˆx(λj −1), the recorded
solution for λj −1
Over a fine enough grid ofλ values, proximal gradient descent canoften perform just as well without acceleration
32
Trang 47Is acceleration always useful?
Sometimes backtracking and acceleration can bedisadvantageous!Recall matrix completion problem: the proximal gradient update is
B+= Sλ
B+ t(PΩ(Y )− PΩ⊥(B))where Sλ is the matrix soft-thresholding operator requires SVD
I One backtracking loop evaluates generalized gradient Gt(x),i.e., evaluatesproxt(x), across various values of t For matrixcompletion, this means multiple SVDs
I Acceleration changes argument we pass to prox: v − t∇g(v)instead of x− t∇g(x) For matrix completion (and t = 1),
B− ∇g(B) = PΩ(Y )
| {z }sparse
+ PΩ⊥(B)
| {z }low rank
⇒ fast SVD
Trang 48References and further reading
Nesterov’s four ideas (three acceleration methods):
Y Nesterov (1983), A method for solving a convex
programming problem with convergence rate O(1/k2)
Y Nesterov (1988), On an approach to the construction of
optimal methods of minimization of smooth convex functions
Y Nesterov (2005), Smooth minimization of non-smooth
functions
Y Nesterov (2007), Gradient methods for minimizing
composite objective function
34
Trang 49References and further reading
Extensions and/or analyses:
A Beck and M Teboulle (2008), A fast iterative
shrinkage-thresholding algorithm for linear inverse problems
S Becker and J Bobin and E Candes (2009), NESTA: a fastand accurate first-order method for sparse recovery
P Tseng (2008), On accelerated proximal gradient methods
for convex-concave optimization
Trang 50References and further reading
Helpful lecture notes/books:
E Candes, Lecture notes for Math 301, Stanford University,
Winter 2010-2011
Y Nesterov (1998), Introductory lectures on convex
optimization: a basic course, Chapter 2
L Vandenberghe, Lecture notes for EE 236C, UCLA, Spring2011-2012
36