Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội... We use pre-set rules for the step sizes (e.g., diminshing step sizes rule).[r]
Trang 1Proximal Gradient Descent (and Acceleration)
Hoàng Nam Dũng
Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội
Trang 2Last time: subgradient method
Consider the problem
min
x f(x) with f convex, anddom(f ) = Rn
x(k) = x(k−1)− tk · g(k−1), k = 1, 2, 3, where g(k−1)∈ ∂f (x(k−1)) We use pre-set rules for the step sizes (e.g., diminshing step sizes rule)
If f is Lipschitz, then subgradient method has a convergence rate O(1/ε2)
Upside: very generic Downside: can be slow — addressed today
Trang 3Today
I Proximal gradient descent
I Convergence analysis
I ISTA, matrix completion
I Special cases
I Acceleration
2
Trang 4Decomposable functions
Suppose
f(x) = g (x) + h(x) where
I g is convex, differentiable, dom(g ) = Rn
I h is convex, not necessarily differentiable
If f were differentiable, then gradient descent update would be
x+ = x− t · ∇f (x) Recall motivation: minimizequadratic approximationto f around
x , replace∇2f(x) by 1tI
x+= argminzf(x) +∇f (x)T(z − x) +2t1kz − xk22
˜
Trang 5
Decomposable functions
In our case f is not differentiable, but f = g + h, g differentiable Why don’t we makequadratic approximation to g , leave h alone? I.e., update
x+= argminzg˜t(z) + h(z)
= argminzg(x) +∇g(x)T(z − x) +2t1kz − xk2
2+ h(z)
= argminz 1
2tkz − (x − t∇g(x))k22+ h(z)
1
2tkz − (x − t∇g(x))k2
2 stay close to gradient update for g
4
Trang 6Proximal mapping
defined as
proxh(x) = argminz1
2kx − zk2
2+ h(z)
Examples:
I h(x) = 0: proxh(x) = x
I h(x) is indicator function of a closed convex set C : proxh is the projection on C
proxh(x) = argminz∈C 1
2kx − zk2
2= PC(x)
I h(x) =kxk1:proxh is the ’soft-threshold’ (shrinkage) operation
proxh(x)i =
xi − 1 xi ≥ 1
0 |xi| ≤ 1
xi + 1 xi ≤ −1
Trang 7Proximal mapping
defined as
proxh(x) = argminz1
2kx − zk2
2+ h(z)
Examples:
I h(x) = 0: proxh(x) = x
I h(x) is indicator function of a closed convex set C : proxh is the projection on C
proxh(x) = argminz∈C 1
2kx − zk2
2= PC(x)
I h(x) =kxk1:proxh is the ’soft-threshold’ (shrinkage) operation
proxh(x)i =
xi − 1 xi ≥ 1
0 |xi| ≤ 1
xi + 1 xi ≤ −1
5
Trang 8Proximal mapping
defined as
proxh(x) = argminz1
2kx − zk2
2+ h(z)
Examples:
I h(x) = 0: proxh(x) = x
I h(x) is indicator function of a closed convex set C : proxh is
the projection on C
proxh(x) = argminz∈C 1
2kx − zk2
2= PC(x)
I h(x) =kxk1:proxh is the ’soft-threshold’ (shrinkage) operation
proxh(x)i =
xi − 1 xi ≥ 1
0 |xi| ≤ 1
xi + 1 xi ≤ −1
Trang 9Proximal mapping
defined as
proxh(x) = argminz1
2kx − zk2
2+ h(z)
Examples:
I h(x) = 0: proxh(x) = x
I h(x) is indicator function of a closed convex set C : proxh is the projection on C
proxh(x) = argminz∈C 1
2kx − zk2
2= PC(x)
I h(x) =kxk1:proxh is the ’soft-threshold’ (shrinkage)
operation
proxh(x)i =
xi − 1 xi ≥ 1
0 |xi| ≤ 1
xi + 1 xi ≤ −1 5
Trang 10Proximal mapping
Theorem
If h is convex and closed (has closed epigraph) then
proxh(x) = argminz1
2kx − zk22+ h(z)
exists and is unique for all x
Chứng minh
proxop.pdf
Uniqueness since the objective function is strictly convex
Optimality condition
z = proxh(x)⇔ x − z ∈ ∂h(z)
⇔ h(u) ≥ h(z) + (x − z)T(u− z), ∀u