1. Trang chủ
  2. » Giáo án - Bài giảng

Bài giảng Tối ưu hóa nâng cao: Chương 8 - Hoàng Nam Dũng

50 45 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 50
Dung lượng 1,6 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Bài giảng Tối ưu hóa nâng cao - Chương 8: Proximal gradient descent (and acceleration) cung cấp cho người học các kiến thức: Subgradient method, decomposable functions, proximal mapping, proximal gradient descent,.... Mời các bạn cùng tham khảo.

Trang 1

Proximal Gradient Descent (and Acceleration)

Hoàng Nam Dũng

Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội

Trang 2

Last time: subgradient method

Consider the problem

min

x f(x)with f convex, anddom(f ) = Rn

Subgradient method: choose an initial x(0)∈ Rn, and repeat:

x(k) = x(k−1)− tk · g(k−1), k = 1, 2, 3, where g(k−1)∈ ∂f (x(k−1)) We use pre-set rules for the step sizes(e.g., diminshing step sizes rule)

If f is Lipschitz, then subgradient method has a convergence rateO(1/ε2)

Upside: very generic Downside: can be slow — addressed today

1

Trang 4

Decomposable functions

Suppose

f(x) = g (x) + h(x)where

I g is convex, differentiable, dom(g ) = Rn

I h is convex, not necessarily differentiable

If f were differentiable, then gradient descent update would be

x+ = x− t · ∇f (x)Recall motivation: minimizequadratic approximationto f around

Trang 5

Decomposable functions

In our case f is not differentiable, but f = g + h, g differentiable.Why don’t we makequadratic approximation to g , leave h alone?I.e., update

2 stay close to gradient update for g

Trang 11

Uniqueness since the objective function is strictly convex.

Optimality condition

z = prox (x)⇔ x − z ∈ ∂h(z)

Trang 12

Properties of proximal mapping

Trang 13

Properties of proximal mapping

Trang 14

Properties of proximal mapping

Trang 15

Proximal gradient descent

Proximal gradient descent: choose initialize x(0), repeat:

x(k) = proxtkh x(k−1)− tk · ∇g(x(k−1))

, k = 1, 2, 3,

To make this update step look familiar, can rewrite it as

x(k) = x(k−1)− tk · Gtk(x(k−1))where Gt is the generalized gradient of f ,

Gt(x) = x− proxth(x− t∇g(x))

For h= 0 it is gradient descent

Trang 16

Proximal gradient descent

Proximal gradient descent: choose initialize x(0), repeat:

x(k) = proxtkh x(k−1)− tk · ∇g(x(k−1))

, k = 1, 2, 3,

To make this update step look familiar, can rewrite it as

x(k) = x(k−1)− tk · Gtk(x(k−1))where Gt is the generalized gradient of f ,

Gt(x) = x− proxth(x− t∇g(x))

For h= 0 it is gradient descent

8

Trang 17

Proximal gradient descent

Proximal gradient descent: choose initialize x(0), repeat:

x(k) = proxtkh x(k−1)− tk · ∇g(x(k−1))

, k = 1, 2, 3,

To make this update step look familiar, can rewrite it as

x(k) = x(k−1)− tk · Gtk(x(k−1))where Gt is the generalized gradient of f ,

Gt(x) = x− proxth(x− t∇g(x))

For h= 0 it is gradient descent

Trang 19

What good did this do?

You have a right to be suspicious may look like we just swappedone minimization problem for another

Key point is thatproxh(·) is can be computed analyticallyfor a lot

of important functions h1

Note:

I Mappingproxh(·) doesn’t depend on g at all, only on h

I Smooth part g can be complicated, we only need to computeits gradients

Convergence analysis: will be in terms of number of iterations ofthe algorithm Each iteration evaluatesproxh(·) once and this can

be cheap or expensive depending on h

Trang 20

Example: ISTA (Iterative Shrinkage-Thresholding Algorithm)

Given y ∈ Rn, X ∈ Rn×p, recall lasso criterion

Proximal mapping is now

proxth(β) = argminz 1

2tkβ − zk22+ λkzk1

= Sλt(β),where Sλ(β) is the soft-thresholding operator

Trang 21

Example: ISTA (Iterative Shrinkage-Thresholding Algorithm)

Recall∇g(β) = −XT(y− X β), hence proximal gradient update is

2 Beck and Teboulle (2008), “A fast iterative shrinkage-thresholding algorithm

Trang 22

Backtracking line search

Backtrackingfor prox gradient descent works similar as before (ingradient descent), but operates on g and not f

Choose parameter 0< β < 1 At each iteration, start at t = tinit,and while

g(x− tGt(x)) > g (x)− t∇g(x)TGt(x) + t

2kGt(x)k22shrink t= βt, for some 0 < β < 1 Else perform proximal gradientupdate

(Alternative formulations exist that require less computation, i.e.,fewer calls to prox)

12

Trang 23

Convergence analysis

For criterion f(x) = g (x) + h(x), we assume

I g is convex, differentiable, dom(g ) = Rn, and∇g is Lipschitz

continuous with constant L> 0

I h is convex,proxth(x) = argminz{kx − zk2

Proximal gradient descent has convergence rate O(1/k) or O(1/ε).Same as gradient descent! (But remember, prox cost matters ).Proof: Seehttp://www.seas.ucla.edu/~vandenbe/236C/lectures/proxgrad.pdf

Trang 24

Convergence analysis

For criterion f(x) = g (x) + h(x), we assume

I g is convex, differentiable, dom(g ) = Rn, and∇g is Lipschitz

continuous with constant L> 0

I h is convex,proxth(x) = argminz{kx − zk2

Proximal gradient descent has convergence rate O(1/k) or O(1/ε)

Same as gradient descent! (But remember, prox cost matters )

Proof: Seehttp://www.seas.ucla.edu/~vandenbe/236C/lectures/proxgrad.pdf

13

Trang 25

Convergence analysis

For criterion f(x) = g (x) + h(x), we assume

I g is convex, differentiable, dom(g ) = Rn, and∇g is Lipschitzcontinuous with constant L> 0

I h is convex,proxth(x) = argminz{kx − zk2

Proximal gradient descent has convergence rate O(1/k) or O(1/ε).Same as gradient descent! (But remember, prox cost matters )

Trang 26

Example: matrix completion

Given a matrix Y ∈ Rm×n, and only observe entries Yij, (i, j)∈ Ω.Suppose we want to fill in missing entries (e.g., for a recommendersystem), so we solve amatrix completion problem3

minB

12

X(i,j)∈Ω(Yij − Bij)2+ λkBktr.

HerekBktr is the trace (or nuclear) norm of B

kBktr=

rXi=1

σi(B),where r = rank(B) and σ1(X )≥ · · · ≥ σr(X )≥ 0 are the singularvalues4

3

Wikipedia: In the case of the Netflix problem the ratings matrix is expected

to be low-rank since user preferences can often be described by a few factors, such as the movie genre and time of release

4

https://math.berkeley.edu/~hutching/teach/54-2017/svd-notes.pdf

14

Trang 27

Example: matrix completion

Define PΩ, projection operator onto observed set

Two ingredients needed for proximal gradient descent:

I Gradient calculation

∇g(B) = −(PΩ(Y )− PΩ(B))

I Prox function

1

Trang 28

Example: matrix completion

Claim:

proxt(B) = Sλt(B), matrix soft-thresholdingat the level λ

Here Sλ(B) is defined by

Sλ(B) = UΣλVTwhere B = UΣVT is an SVD, and Σλ is diagonal with

(Σλ)ii = max{Σii− λ, 0}

Proof : note thatproxth(B) = Z , where Z satisfies

Trang 29

Example: matrix completion

Hence proximal gradient update step is

B+= Sλt(B + t(PΩ(Y )− PΩ(B)))

Note that∇g(B) is Lipschitz continuous with L = 1, so we can

choose fixed step size t = 1 Update step is now

B+= Sλ(PΩ(Y ) + PΩ⊥(B))where PΩ⊥ projects onto unobserved set, PΩ(B) + P ⊥

Trang 30

Special cases

Proximal gradient descent also called composite gradient descent

orgeneralized gradient descent

Why “generalized”? This refers to the several special cases, whenminimizing f = g + h

I h = 0 – gradient descent

I h = IC – projected gradient descent

I g = 0 – proximal minimization algorithm

Therefore these algorithms all have O(1/ε) convergence rate

18

Trang 31

Projected gradient descent

Given closed, convex set C ∈ Rn,

min

x ∈Cg(x)⇐⇒ minx g(x) + IC(x)where IC(x) =

Trang 32

Projected gradient descent

Given closed, convex set C ∈ Rn,

min

x ∈Cg(x)⇐⇒ minx g(x) + IC(x)where IC(x) =

19

Trang 33

Projected gradient descent

Therefore proximal gradient update step is

x+= PC(x− t∇g(x)),i.e., perform usual gradient update and then project back onto C.Calledprojected gradient descent

Therefore proximal gradient update step is:

x+= PC x− t∇g(x)i.e., perform usual gradient update and then project back onto C.Calledprojected gradient descent

Trang 34

Proximal minimization algorithm

Consider for h convex (not necessarily differentiable)

Calledproximal minimization algorithm Faster than subgradient

method, but not implementable unless we know prox in closed

form

21

Trang 35

What happens if we can’t evaluate prox?

Theory for proximal gradient, with f = g + h, assumes that proxfunction can be evaluated, i.e., assumes the minimization

proxth(x) = argminz 1

2tkx − zk2

2+ h(z)can be done exactly In general, not clear what happens if we justminimize this approximately

But, if you can precisely control the errors in approximating the

prox operator, then you can recover the original convergence rates6

In practice, if prox evaluation is done approximately, then it should

be done to decently high accuracy

6

Schmidt et al (2011), “Convergence rates of inexact proximal-gradient

Trang 36

I 1983: original acceleration idea for smooth functions

I 1988: another acceleration idea for smooth functions

I 2005: smoothing techniques for nonsmooth functions, coupledwith original acceleration idea

I 2007: acceleration idea for composite functions7

We will follow Beck and Teboulle (2008), an extension of Nesterov(1983) to composite functions8

7 Each step uses entire history of previous steps and makes two prox calls

8

Each step uses information from two last steps and makes one prox call

23

Trang 37

Accelerated proximal gradient method

As before consider

min

x g(x) + h(x),where g convex, differentiable, and h convex

Accelerated proximal gradient method: choose initial point

I First step k = 1 is just usual proximal gradient update

I After that, v = x(k−1)+ k−2

k+1(x(k−1)− x(k−2)) carries some

Trang 38

Accelerated proximal gradient method

Trang 39

Accelerated proximal gradient method

Back to lasso example: acceleration can really help!Back to lasso example: acceleration can really help!

Note: accelerated proximal gradient is not a descent method

Trang 40

Backtracking line search

Backtracking under with acceleration in different ways

Simple approach: fixβ < 1, t0 = 1 At iteration k, start with

t= tk−1, and while

g(x+) > g (v ) +∇g(v)T(x+− v) + 2t1 kx+− vk22

shrink t= βt, and let x+= proxth(v− t∇g(v)) Else keep x+

Note that this strategy forces us to take decreasing step sizes (more complicated strategies exist which avoid this)

27

Trang 41

Convergence analysis

For criterion f(x) = g (x) + h(x), we assume as before

I g is convex, differentiable, dom(g ) = Rn, and∇g is Lipschitzcontinuous with constant L> 0

I h is convex,proxth(x) = argminz{kx − zk2

Trang 42

FISTA (Fast ISTA)

Back to lasso problem

minβ

1

2ky − X βk22+ λkβk1.Recall ISTA (Iterative Soft-thresholding Algorithm):

β(k) = Sλtk(β(k−1)+ tkXT(y − X β(k−1))), k= 1, 2, 3,

Sλ(·) being vector soft-thresholding

Applying acceleration gives usFISTA(F is for Fast)9: for

Trang 43

ISTA vs FISTA

Lasso regression: 100 instances (with n = 100, p = 500):

Lasso regression: 100 instances (with n = 100, p = 500):

30

Trang 44

ISTA vs FISTA

Lasso logistic regression: 100 instances (n = 100, p = 500):

Lasso logistic regression: 100 instances (n = 100, p = 500):

29

31

Trang 45

Is acceleration always useful?

Acceleration can be a very effective speedup tool but should it

always be used?

In practice the speedup of using acceleration is diminished in thepresence ofwarm starts E.g., suppose want to solve lasso problemfor tuning parameters values

λ1> λ2 >· · · > λr

I When solving for λ1, initialize x(0) = 0, record solution ˆx(λ1)

I When solving for λj, initialize x(0) = ˆx(λj −1), the recordedsolution for λj −1

Over a fine enough grid ofλ values, proximal gradient descent canoften perform just as well without acceleration

Trang 46

Is acceleration always useful?

Acceleration can be a very effective speedup tool but should italways be used?

In practice the speedup of using acceleration is diminished in thepresence ofwarm starts E.g., suppose want to solve lasso problemfor tuning parameters values

λ1> λ2 >· · · > λr

I When solving for λ1, initialize x(0) = 0, record solution ˆx(λ1)

I When solving for λj, initialize x(0) = ˆx(λj −1), the recorded

solution for λj −1

Over a fine enough grid ofλ values, proximal gradient descent canoften perform just as well without acceleration

32

Trang 47

Is acceleration always useful?

Sometimes backtracking and acceleration can bedisadvantageous!Recall matrix completion problem: the proximal gradient update is

B+= Sλ

B+ t(PΩ(Y )− PΩ⊥(B))where Sλ is the matrix soft-thresholding operator requires SVD

I One backtracking loop evaluates generalized gradient Gt(x),i.e., evaluatesproxt(x), across various values of t For matrixcompletion, this means multiple SVDs

I Acceleration changes argument we pass to prox: v − t∇g(v)instead of x− t∇g(x) For matrix completion (and t = 1),

B− ∇g(B) = PΩ(Y )

| {z }sparse

+ PΩ⊥(B)

| {z }low rank

⇒ fast SVD

Trang 48

References and further reading

Nesterov’s four ideas (three acceleration methods):

Y Nesterov (1983), A method for solving a convex

programming problem with convergence rate O(1/k2)

Y Nesterov (1988), On an approach to the construction of

optimal methods of minimization of smooth convex functions

Y Nesterov (2005), Smooth minimization of non-smooth

functions

Y Nesterov (2007), Gradient methods for minimizing

composite objective function

34

Trang 49

References and further reading

Extensions and/or analyses:

A Beck and M Teboulle (2008), A fast iterative

shrinkage-thresholding algorithm for linear inverse problems

S Becker and J Bobin and E Candes (2009), NESTA: a fastand accurate first-order method for sparse recovery

P Tseng (2008), On accelerated proximal gradient methods

for convex-concave optimization

Trang 50

References and further reading

Helpful lecture notes/books:

E Candes, Lecture notes for Math 301, Stanford University,

Winter 2010-2011

Y Nesterov (1998), Introductory lectures on convex

optimization: a basic course, Chapter 2

L Vandenberghe, Lecture notes for EE 236C, UCLA, Spring2011-2012

36

Ngày đăng: 16/05/2020, 01:39

🧩 Sản phẩm bạn có thể quan tâm