Bài giảng Tối ưu hóa nâng cao - Chương 9: Stochastic gradient descent cung cấp cho người học các kiến thức: Proximal gradient descent, stochastic gradient descent, convergence rates, early stopping, mini-batches. Mời các bạn cùng tham khảo.
Trang 1Stochastic Gradient Descent
Hoàng Nam Dũng
Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội
Trang 2Last time: proximal gradient descent
Consider the problem
min
x g(x) + h(x)with g, h convex, g differentiable, and h “simple” in so much as
Step sizes tk chosen to be fixed and small, or via backtracking
If∇g is Lipschitz with constant L, then this has convergence rateO(1/ε) Lastly we canaccelerate this, to optimal rate O(1/√
ε)
1
Trang 4Stochastic gradient descent
Consider minimizing an average of functions
minx
1m
mX
Trang 5Stochastic gradient descent
Two rules for choosing index ik at iteration k:
I Cyclic rule: choose ik = 1, 2, , m, 1, 2, , m,
Randomized rule is more common in practice For randomized rule,note that
E[∇fi k(x)] =∇f (x),
so we can view SGD as using anunbiased estimateof the gradient
at each step
Main appeal of SGD:
I Iteration cost is independent of m (number of functions)
I Can also be a big savings in terms of memory usage
4
Trang 6Example: stochastic logistic regression
Given(xi, yi)∈ Rp× {0, 1}, i = 1, , n, recall logistic regression
min
β f(β) = 1
n
nX
Full gradient (also called batch) versus stochastic gradient:
I One batch update costs O(np)
I One stochastic update costs O(p)
Clearly, e.g., 10K stochastic steps are much more affordable
5
Trang 7Batch vs stochastic gradient descent
Small example with n= 10, p = 2 to show the “classic picture” forbatch versus stochastic methods:
Small example with n = 10, p = 2 to show the “classic picture” forbatch versus stochastic methods:
Blue: batch steps, O(np)
Red: stochastic steps, O(p)Rule of thumb for stochasticmethods:
• generally thrive farfrom optimum
• generally struggle close
to optimum
7
Blue: batch steps, O(np)Red: stochastic steps, O(p)Rule of thumb for stochasticmethods:
I generally thrive far fromoptimum
I generally struggle close
to optimum
6
Trang 8Step sizes
Standard in SGD is to usediminishing step sizes, e.g., tk = 1/k,for k= 1, 2, 3,
Why not fixed step sizes? Here’s some intuition
Suppose we take cyclic rule for simplicity Set tk = t for m updates
in a row, we get
x(k+m) = x(k)− t
mX
i =1
∇fi(x(k))
The difference here: tPm
i =1[∇fi(x(k+i−1))− ∇fi(x(k))], and if wehold t constant, this difference will not generally be going to zero
7
Trang 9When f is differentiable with Lipschitz gradient, there holds for
gradient descent with suitable fixed step sizes
Trang 10Convergence rates
Even worse is the following discrepancy!
When f is strongly convex and has a Lipschitz gradient, gradientdescent satisfies
f(x(k))− f∗= O(ck)where c < 1 But under same conditions, SGD gives us2
Trang 11Also common is mini-batch stochastic gradient descent, where wechoose a random subset Ik ⊆ {1, , m} of size |Ik| = b m andrepeat
x(k)= x(k−1)− tk ·1
bX
i ∈I k
∇fi(x(k−1)), k = 1, 2, 3, Again, we are approximating full gradient by an unbiased estimate
E
1bX
i ∈I k
∇fi(x)
=∇f (x)
Using mini-batches reduces thevariance of our gradient estimate
by a factor 1/b, but is also b times more expensive
10
Trang 12Batch vs mini-batches vs stochastic
Back to logistic regression, let’s now consider a regularized version:
min
β∈R p
1n
nX
i=1
−yixiTβ + log(1 + exiTβ)+λ
2kβk22.Write the criterion as
f(β) = 1
n
nX
i=1
fi(β), fi(β) =−yixiTβ + log(1 + exiTβ) +λ
2kβk22.Full gradient computation is∇f (β) = n1Pn
i=1(yi − pi(β))xi + λβ.Comparison between methods:
I One batch update costs O(np)
I One mini-batch update costs O(bp)
I One stochastic update costs O(p)
11
Trang 13Batch vs mini-batches vs stochastic
Example with n= 10, 000, p = 20, all methods use fixed step sizes
Example with n = 10, 000, p = 20, all methods use fixed step sizes:
13 12
Trang 14Batch vs mini-batches vs stochastic
What’s happening? Now let’s parametrize by flopsWhat’s happening? Now let’s parametrize by flops: 3
14
3
Trang 15Batch vs mini-batches vs stochastic
Finally, looking at suboptimality gap (on log scale)
Finally, looking at suboptimality gap (on log scale):
15 14
Trang 16End of the story?
Short story:
I SGD can besuper effective in terms of iteration cost, memory
I But SGD is slow to converge, can’t adapt to strong convexity
I And mini-batches seem to be a wash in terms of flops (though
they can still be useful in practice)
Is this the end of the story for SGD?
For a while, the answer was believed to be yes Slow convergencefor strongly convex functions was believed inevitable, as Nemirovskiand others established matchinglower bounds but this was for amore general stochastic problem, where f(x) =R F (x, ζ)dP(ζ).New wave of “variance reduction” work shows we can modify SGD
to converge much faster for finite sums (more later?)
15
Trang 17End of the story?
Short story:
I SGD can besuper effective in terms of iteration cost, memory
I But SGD is slow to converge, can’t adapt to strong convexity
I And mini-batches seem to be a wash in terms of flops (though
they can still be useful in practice)
Is this the end of the story for SGD?
For a while, the answer was believed to be yes Slow convergencefor strongly convex functions was believed inevitable, as Nemirovskiand others established matchinglower bounds but this was for amore general stochastic problem, where f(x) =R F (x, ζ)dP(ζ).New wave of “variance reduction” work shows we can modify SGD
to converge much faster for finite sums (more later?)
15
Trang 18End of the story?
Short story:
I SGD can besuper effective in terms of iteration cost, memory
I But SGD is slow to converge, can’t adapt to strong convexity
I And mini-batches seem to be a wash in terms of flops (thoughthey can still be useful in practice)
Is this the end of the story for SGD?
For a while, the answer was believed to be yes Slow convergencefor strongly convex functions was believed inevitable, as Nemirovskiand others established matchinglower bounds but this was for amore general stochastic problem, where f(x) =R F (x, ζ)dP(ζ)
New wave of “variance reduction” work shows we can modify SGD
to converge much faster for finite sums (more later?)
15
Trang 19SGD in large-scale ML
SGD has really taken off in large-scale machine learning
I In many ML problems we don’t care about optimizing to highaccuracy, it doesn’t pay off in terms of statistical performance
I Thus (in contrast to what classic theory says) fixed step sizes
are commonly used in ML applications
I One trick is to experiment with step sizes using small fraction
of training before running SGD on full data set many otherheuristics are common4
I Many variants provide better practical stability, convergence:momentum, acceleration, averaging, coordinate-adapted stepsizes, variance reduction
I See AdaGrad, Adam, AdaMax, SVRG, SAG, SAGA (morelater?)
4
Trang 20
−yixiTβ + log(1 + exiTβ) subject tokβk2≤ t
We could also run gradient descent on the unregularized problem
minβ∈R p
1n
nX
i=1
−yixiTβ + log(1 + exiTβ)andstop early, i.e., terminate gradient descent well-short of the
global minimum
17
Trang 21Early stopping
Consider the following, for a very small constant step sizeε:
I Start at β(0)= 0, solution to regularized problem at t = 0
I Perform gradient descent on unregularized criterion
β(k) = β(k−1)− ε ·n1
nX
i =1(yi − pi(β(k−1)))xi, k = 1, 2, 3, (we could equally well consider SGD)
I Treat β(k) as an approximate solution to regularized problemwith t =kβ(k)k2
This is calledearly stopping for gradient descent Why would weever do this? It’s both more convenient and potentially much moreefficient than using explicit regularization
18
Trang 22An intriguing connection
When we solve the`2 regularized logistic problem for varying t,
solution path looks quite similar to gradient descent path!
Example with p= 8, solution and grad descent paths side by side
An intruiging connectionWhen we solve the `2 regularized logistic problem for varying t
solution path looks quite similar to gradient descent path!
Example with p = 8, solution and grad descent paths side by side:
Trang 23Lots left to explore
I Connection holds beyond logistic regression, for arbitrary loss
I In general, the grad descent path will not coincide with the `2regularized path (as ε→ 0) Though in practice, it seems togive competitive statistical performance
I Can extend early stopping idea to mimick a generic regularizer(beyond`2)5
I There is a lot of literature on early stopping, but it’s still not
as well-understood as it should be
I Early stopping is just one instance of implicit or algorithmic
they all should be better understood
5
Tibshirani (2015), “A general framework for fast stagewise algorithms”
20
Trang 24References and further reading
D Bertsekas (2010), Incremental gradient, subgradient, andproximal methods for convex optimization: a survey
A Nemirovski and A Juditsky and G Lan and A Shapiro
(2009), Robust stochastic optimization approach to stochasticprogramming
R Tibshirani (2015), A general framework for fast stagewisealgorithms
21