Bài giảng Tối ưu hóa nâng cao: Chương 9 - Hoàng Nam Dũng

Bài giảng Tối ưu hóa nâng cao - Chương 9: Stochastic gradient descent cung cấp cho người học các kiến thức: Proximal gradient descent, stochastic gradient descent, convergence rates, early stopping, mini-batches. Mời các bạn cùng tham khảo.

Trang 1

Stochastic Gradient Descent

Hoàng Nam Dũng

Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội

Trang 2

Last time: proximal gradient descent

Consider the problem

min

x g(x) + h(x)with g, h convex, g differentiable, and h “simple” in so much as

Step sizes tk chosen to be fixed and small, or via backtracking

If∇g is Lipschitz with constant L, then this has convergence rateO(1/ε) Lastly we canaccelerate this, to optimal rate O(1/√

ε)

1

Trang 4

Stochastic gradient descent

Consider minimizing an average of functions

minx

1m

mX

Trang 5

Stochastic gradient descent

Two rules for choosing index ik at iteration k:

I Cyclic rule: choose ik = 1, 2, , m, 1, 2, , m,

Randomized rule is more common in practice For randomized rule,note that

E[∇fi k(x)] =∇f (x),

so we can view SGD as using anunbiased estimateof the gradient

at each step

Main appeal of SGD:

I Iteration cost is independent of m (number of functions)

I Can also be a big savings in terms of memory usage

4

Trang 6

Example: stochastic logistic regression

Given(xi, yi)∈ Rp× {0, 1}, i = 1, , n, recall logistic regression

min

β f(β) = 1

n

nX

Full gradient (also called batch) versus stochastic gradient:

I One batch update costs O(np)

I One stochastic update costs O(p)

Clearly, e.g., 10K stochastic steps are much more affordable

5

Trang 7

Batch vs stochastic gradient descent

Small example with n= 10, p = 2 to show the “classic picture” forbatch versus stochastic methods:

Small example with n = 10, p = 2 to show the “classic picture” forbatch versus stochastic methods:

Blue: batch steps, O(np)

Red: stochastic steps, O(p)Rule of thumb for stochasticmethods:

• generally thrive farfrom optimum

• generally struggle close

to optimum

7

Blue: batch steps, O(np)Red: stochastic steps, O(p)Rule of thumb for stochasticmethods:

I generally thrive far fromoptimum

I generally struggle close

to optimum

6

Trang 8

Step sizes

Standard in SGD is to usediminishing step sizes, e.g., tk = 1/k,for k= 1, 2, 3,

Why not fixed step sizes? Here’s some intuition

Suppose we take cyclic rule for simplicity Set tk = t for m updates

in a row, we get

x(k+m) = x(k)− t

mX

i =1

∇fi(x(k))

The difference here: tPm

i =1[∇fi(x(k+i−1))− ∇fi(x(k))], and if wehold t constant, this difference will not generally be going to zero

7

Trang 9

When f is differentiable with Lipschitz gradient, there holds for

gradient descent with suitable fixed step sizes

Trang 10

Convergence rates

Even worse is the following discrepancy!

When f is strongly convex and has a Lipschitz gradient, gradientdescent satisfies

f(x(k))− f∗= O(ck)where c < 1 But under same conditions, SGD gives us2

Trang 11

Also common is mini-batch stochastic gradient descent, where wechoose a random subset Ik ⊆ {1, , m} of size |Ik| = b m andrepeat

x(k)= x(k−1)− tk ·1

bX

i ∈I k

∇fi(x(k−1)), k = 1, 2, 3, Again, we are approximating full gradient by an unbiased estimate

E



1bX

i ∈I k

∇fi(x)



=∇f (x)

Using mini-batches reduces thevariance of our gradient estimate

by a factor 1/b, but is also b times more expensive

10

Trang 12

Batch vs mini-batches vs stochastic

Back to logistic regression, let’s now consider a regularized version:

min

β∈R p

1n

nX

i=1

−yixiTβ + log(1 + exiTβ)+λ

2kβk22.Write the criterion as

f(β) = 1

n

nX

i=1

fi(β), fi(β) =−yixiTβ + log(1 + exiTβ) +λ

2kβk22.Full gradient computation is∇f (β) = n1Pn

i=1(yi − pi(β))xi + λβ.Comparison between methods:

I One batch update costs O(np)

I One mini-batch update costs O(bp)

I One stochastic update costs O(p)

11

Trang 13

Example with n= 10, 000, p = 20, all methods use fixed step sizes

Example with n = 10, 000, p = 20, all methods use fixed step sizes:

13 12

Trang 14

What’s happening? Now let’s parametrize by flopsWhat’s happening? Now let’s parametrize by flops: 3

14

3

Trang 15

Finally, looking at suboptimality gap (on log scale)

Finally, looking at suboptimality gap (on log scale):

15 14

Trang 16

End of the story?

Short story:

I SGD can besuper effective in terms of iteration cost, memory

I But SGD is slow to converge, can’t adapt to strong convexity

I And mini-batches seem to be a wash in terms of flops (though

they can still be useful in practice)

Is this the end of the story for SGD?

For a while, the answer was believed to be yes Slow convergencefor strongly convex functions was believed inevitable, as Nemirovskiand others established matchinglower bounds but this was for amore general stochastic problem, where f(x) =R F (x, ζ)dP(ζ).New wave of “variance reduction” work shows we can modify SGD

to converge much faster for finite sums (more later?)

15

Trang 17

End of the story?

Short story:

I And mini-batches seem to be a wash in terms of flops (though

they can still be useful in practice)

For a while, the answer was believed to be yes Slow convergencefor strongly convex functions was believed inevitable, as Nemirovskiand others established matchinglower bounds but this was for amore general stochastic problem, where f(x) =R F (x, ζ)dP(ζ).New wave of “variance reduction” work shows we can modify SGD

15

Trang 18

End of the story?

Short story:

I And mini-batches seem to be a wash in terms of flops (thoughthey can still be useful in practice)

For a while, the answer was believed to be yes Slow convergencefor strongly convex functions was believed inevitable, as Nemirovskiand others established matchinglower bounds but this was for amore general stochastic problem, where f(x) =R F (x, ζ)dP(ζ)

New wave of “variance reduction” work shows we can modify SGD

15

Trang 19

SGD in large-scale ML

SGD has really taken off in large-scale machine learning

I In many ML problems we don’t care about optimizing to highaccuracy, it doesn’t pay off in terms of statistical performance

I Thus (in contrast to what classic theory says) fixed step sizes

are commonly used in ML applications

I One trick is to experiment with step sizes using small fraction

of training before running SGD on full data set many otherheuristics are common4

I Many variants provide better practical stability, convergence:momentum, acceleration, averaging, coordinate-adapted stepsizes, variance reduction

I See AdaGrad, Adam, AdaMax, SVRG, SAG, SAGA (morelater?)

4

Trang 20

−yixiTβ + log(1 + exiTβ) subject tokβk2≤ t

We could also run gradient descent on the unregularized problem

minβ∈R p

1n

nX

i=1

−yixiTβ + log(1 + exiTβ)andstop early, i.e., terminate gradient descent well-short of the

global minimum

17

Trang 21

Early stopping

Consider the following, for a very small constant step sizeε:

I Start at β(0)= 0, solution to regularized problem at t = 0

I Perform gradient descent on unregularized criterion

β(k) = β(k−1)− ε ·n1

nX

i =1(yi − pi(β(k−1)))xi, k = 1, 2, 3, (we could equally well consider SGD)

I Treat β(k) as an approximate solution to regularized problemwith t =kβ(k)k2

This is calledearly stopping for gradient descent Why would weever do this? It’s both more convenient and potentially much moreefficient than using explicit regularization

18

Trang 22

An intriguing connection

When we solve the`2 regularized logistic problem for varying t,

solution path looks quite similar to gradient descent path!

Example with p= 8, solution and grad descent paths side by side

An intruiging connectionWhen we solve the `2 regularized logistic problem for varying t

solution path looks quite similar to gradient descent path!

Example with p = 8, solution and grad descent paths side by side:

Trang 23

Lots left to explore

I Connection holds beyond logistic regression, for arbitrary loss

I In general, the grad descent path will not coincide with the `2regularized path (as ε→ 0) Though in practice, it seems togive competitive statistical performance

I Can extend early stopping idea to mimick a generic regularizer(beyond`2)5

I There is a lot of literature on early stopping, but it’s still not

as well-understood as it should be

I Early stopping is just one instance of implicit or algorithmic

they all should be better understood

5

Tibshirani (2015), “A general framework for fast stagewise algorithms”

20

Trang 24

References and further reading

D Bertsekas (2010), Incremental gradient, subgradient, andproximal methods for convex optimization: a survey

A Nemirovski and A Juditsky and G Lan and A Shapiro

(2009), Robust stochastic optimization approach to stochasticprogramming

R Tibshirani (2015), A general framework for fast stagewisealgorithms

21

Định dạng
Số trang	24
Dung lượng	381,8 KB