1. Trang chủ
  2. » Giáo án - Bài giảng

Bài giảng Tối ưu hóa nâng cao: Chương 7 - Hoàng Nam Dũng

34 9 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 34
Dung lượng 543,2 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Bài giảng Tối ưu hóa nâng cao - Chương 7: Subgradient method cung cấp cho người học các kiến thức: Last last time - gradient descent, subgradient method, step size choices, convergence analysis, lipschitz continuity, convergence analysis - Proof,... Mời các bạn cùng tham khảo.

Trang 1

Subgradient Method

Hoàng Nam Dũng

Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội

Trang 2

Last last time: gradient descent

Consider the problem

min

x f(x)for f convex and differentiable,dom(f ) = Rn

Gradient descent: choose initial x(0) ∈ Rn, repeat:

I Requires f differentiable — addressed this lecture

I Can be slow to converge — addressed next lecture

Trang 3

Subgradient method

Now consider f convex, havingdom(f ) = Rn, but not necessarily

differentiable

Subgradient method: like gradient descent, but replacing gradients

with subgradients, i.e., initialize x(0), repeat:

x(k) = x(k−1)− tk · g(k−1), k = 1, 2, 3, where g(k−1)∈ ∂f (x(k−1)) any subgradient of f at x(k−1)

Subgradient method is not necessarily a descent method, so wekeep track of best iterate xbest(k) among x(0), x(k) so far, i.e.,

f(xbest(k)) = min

i=0, ,kf(x(i))

Trang 4

Subgradient method is not necessarily a descent method, so we

keep track of best iterate xbest(k) among x(0), x(k) so far, i.e.,

f(xbest(k)) = min

i=0, ,kf(x(i))

Trang 6

Step size choices

I Fixed step sizes: tk = t all k = 1, 2, 3,

I Fixed step length, i.e., tk = s/kg(k−1)k2, and hence

There are several other options too, but key difference to gradientdescent: step sizes are pre-specified,not adaptively computed

Trang 7

,

Trang 8

Lipschitz continuity

Before the proof let consider the Lipschitz continuity assumption

Lemma

f is Lipschitz continuous with constant L> 0, i.e.,

|f (x) − f (y)| ≤ L kx − yk2 for all x, y ,

Trang 9

Lipschitz continuity

Before the proof let consider the Lipschitz continuity assumption

Lemma

f is Lipschitz continuous with constant L> 0, i.e.,

|f (x) − f (y)| ≤ L kx − yk2 for all x, y ,

Trang 10

Convergence analysis - Proof

Can prove both results from same basic inequality Key steps:

I Using definition of subgradient

Trang 11

Convergence analysis - Proof

Can prove both results from same basic inequality Key steps:

I Using definition of subgradient

Trang 12

Convergence analysis - Proof

I Using kx(k)− x∗k2 ≥ 0 and letting R = kx(0)− x∗k2, we have

Trang 13

Convergence analysis - Proof

I Using kx(k)− x∗k2 ≥ 0 and letting R = kx(0)− x∗k2, we have

I Introducing f(xbest(k)) = mini =0, kf(x(i)) and rearranging, we

have the basic inequality

Trang 14

Convergence analysis - Proof

I Using kx(k)− x∗k2 ≥ 0 and letting R = kx(0)− x∗k2, we have

Trang 15

Convergence analysis - Proof

The basic inequality tells us that after k steps, we have

f(xbest(k))− f (x∗)≤ R

2+Pk i=1ti2kg(i−1)k2

2

2Pk i=1ti

I Does not guarantee convergence (as k → ∞)

I For large k, f(xbest(k)) is approximately L2t

Trang 16

Convergence analysis - Proof

The basic inequality tells us that after k steps, we have

f(xbest(k))− f (x∗)≤ R

2+Pk i=1ti2kg(i−1)k2

2

2Pk i=1ti

I Does not guarantee convergence (as k → ∞)

I For large k, f(xbest(k)) is approximately L2t

Trang 17

Convergence analysis - Proof

The basic inequality tells us that after k steps, we have

f(xbest(k))− f (x∗)≤ R

2+Pk i=1ti2kg(i−1)k2

2

2Pk i=1ti

I Does not guarantee convergence (as k → ∞)

I For large k, f(xbest(k)) is approximately L2t

Trang 18

Convergence analysis - Proof

The basic inequality tells us that after k steps, we have

f(xbest(k))− f (x∗)≤ R

2+Pk i=1ti2kg(i−1)k2

2

2Pk i=1ti

With fixed step length, i.e., ti = s/kg(i−1)k2, we have

f(xbest(k))−f (x∗)≤ R

2+ ks2

2sPk i=1kg(i−1)k−12

2+ ks2

2sPk i=1L−1 = LR

2

2ks +Ls

2

I Does not guarantee convergence (as k → ∞)

I For large k, f(xbest(k)) is approximately Ls

2-suboptimal

I To make the gap ≤ ε, let’s make each term ≤ ε/2 So we canchoose s = ε/L, and k = LR2/s · 1/ε = R2L2/ε2

Trang 19

Convergence analysis - Proof

The basic inequality tells us that after k steps, we have

f(xbest(k))− f (x∗)≤ R

2+Pk i=1ti2kg(i−1)k2

2

2Pk i=1ti

With fixed step length, i.e., ti = s/kg(i−1)k2, we have

f(xbest(k))−f (x∗)≤ R

2+ ks2

2sPk i=1kg(i−1)k−12

2+ ks2

2sPk i=1L−1 = LR

2

2ks +Ls

2

I Does not guarantee convergence (as k → ∞)

I For large k, f(xbest(k)) is approximately Ls

2-suboptimal

I To make the gap ≤ ε, let’s make each term ≤ ε/2 So we canchoose s = ε/L, and k = LR2/s· 1/ε = R2L2/ε2

Trang 20

Convergence analysis - Proof

The basic inequality tells us that after k steps, we have

f(xbest(k))− f (x∗)≤ R

2+Pk i=1ti2kg(i−1)k2

2

2Pk i=1ti From this and the Lipschitz continuity, we have

f(xbest(k))− f (x∗)≤ R

2+ L2Pk

i=1ti2

2Pk i=1ti

With diminishing step size,P∞

i=1ti =∞ andP∞

i=1ti2 <∞, thereholds

lim

k→∞f(xbest(k)) = f∗

Trang 21

Example: 1-norm minimization

Trang 22

Diminishing step size:t k = 0.01/ √

Trang 23

Example: regularized logistic regression

Given(xi, yi)∈ Rp× {0, 1} for i = 1, n, the logistic regression

−yixiTβ + log(1 + exp(xiTβ))

This is a smooth and convex with

Trang 24

Example: regularized logistic regression

Ridge: use gradients; lasso: use subgradients Example here has

Step sizes hand-tuned to be favorable for each method (of course

comparison is imperfect, but it reveals the convergence behaviors)

Step sizes hand-tuned to be favorable for each method (of course

comparison is imperfect, but it reveals the convergence behaviors) 14

Trang 25

Polyak step sizes

Polyak step sizes: when the optimal value f∗ is known, take

kx(k)−x∗k22 ≤ kx(k−1)−x∗k22−2tk(f (x(k−1))−f (x∗))+tk2kg(k−1)k22.Polyak step size minimizes the right-hand side

With Polyak step sizes, can show subgradient method converges tooptimal value Convergence rate is still O(1/ε2)

f(xbest(k))− f (x∗)≤ √LR

k.(Proof: see slide 11,http://www.seas.ucla.edu/~vandenbe/236C/lectures/sgmethod.pdf)

Trang 26

Example: intersection of sets

Suppose we want to find x∗ ∈ C1∩ · · · ∩ Cm, i.e., find a point inintersection of closed, convex sets C1, , Cm

First define

fi(x) = dist(x, Ci), i = 1, , m

f(x) = max

i=1, ,mfi(x)and now solve

min

x f(x)

Check: is this convex?

Note that f∗ = 0⇐⇒ x∗ ∈ C1∩ · · · ∩ Cm

Trang 27

Example: intersection of sets

Recall the distance functiondist(x, C ) = miny ∈Cky − xk2 Last

time we computed its gradient

∇ dist(x, C ) = x− PC(x)

kx − PC(x)k2

where PC(x) is the projection of x onto C

Also recall subgradient rule: if f(x) = maxi=1, mfi(x), then

∂f (x) = conv

[

Trang 28

Example: intersection of sets

Put these two facts together for intersection of sets problem, with

fi(x) = dist(x, Ci): if Ci is farthest set from x (so fi(x) = f (x)),

and

gi =∇fi(x) = x− PC i(x)

kx − PC i(x)k2

then gi ∈ ∂f (x)

Now apply subgradient method, with Polyak size tk = f (x(k−1))

At iteration k, with Ci farthest from x(k−1), we perform update

x(k)= x(k−1)− f (x(k−1)) x

(k−1)− PC i(x(k−1))

kx(k−1)− PC i(x(k−1))k

= PC i(x(k−1)),since

f(x(k−1)) = dist(x(k−1), Ci) =kx(k−1)− PC i(x(k−1))k

Trang 29

Example: intersection of sets

For two sets, this is the famousalternating projections1 algorithm,

i.e., just keep projecting back and forth.For two sets, this is the famousalternating projections algorithm1,i.e., just keep projecting back and forth

(From Boyd’s lecture notes)

orthogonal spaces”

1 von Neumann (1950), “Functional operators, volume II: The geometry of

orthogonal spaces”

19

Trang 30

Projected subgradient method

To optimize a convex function f over a convex set C ,

min

x f(x) subject to x ∈ C

we can use theprojected subgradient method

Just like the usual subgradient method, except we project onto C

at each iteration:

x(k)= PC(x(k−1)− tk · g(k−1)), k = 1, 2, 3,

Assuming we can do this projection, we get the same convergenceguarantees as the usual subgradient method, with the same stepsize choices

Trang 31

Projected subgradient method

What sets C are easy to project onto? Lots, e.g.,

I Affine images:{Ax + b : x ∈ Rn}

I Solution set of linear system: {x : Ax = b}

I Nonnegative orthant: Rn+={x : x ≥ 0}

I Somenorm balls:{x : kxkp≤ 1} for p = 1, 2, ∞

I Some simple polyhedra and simple cones

Warning: it is easy to write down seemingly simple set C , and PCcan turn out to be very hard! E.g., generally hard to project ontoarbitrary polyhedron C ={x : Ax ≤ b}

Note: projected gradient descent works too, more next time

Trang 32

Can we do better?

Upside of the subgradient method: broad applicability

Downside: O(1/ε2) convergence rate over problem class of convex,Lipschitz functions is really slow

Nonsmooth first-order methods: iterative methods updating x(k) in

x(0)+ span{g(0), g(1), , g(k−1)}where subgradients g(0), g(1), , g(k−1) come from weak oracle

Trang 33

Improving on the subgradient method

In words, wecannot do betterthan the O(1/ε2) rate of subgradientmethod (unless we go beyond nonsmooth first-order methods)

So instead of trying to improve across the board, we will focus onminimizingcomposite functionsof the form

f(x) = g (x) + h(x)where g is convex and differentiable, h is convex and nonsmoothbut “simple”

For a lot of problems (i.e., functions h), we can recover the O(1/ε)rate of gradient descent with a simple algorithm, having importantpractical consequences

Trang 34

References and further reading

S Boyd, Lecture notes for EE 264B, Stanford University,

Spring 2010-2011

Y Nesterov (1998), Introductory lectures on convex

optimization: a basic course, Chapter 3

B Polyak (1987), Introduction to optimization, Chapter 5

L Vandenberghe, Lecture notes for EE 236C, UCLA, Spring2011-2012

Ngày đăng: 18/05/2021, 11:57

TỪ KHÓA LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm