Bài giảng Tối ưu hóa nâng cao - Chương 6: Subgradients cung cấp cho người học các kiến thức: Last time - gradient descent, subgradients, examples of subgradients, monotonicity, examples of non-subdifferentiable functions,... Mời các bạn cùng tham khảo.
Trang 1Hoàng Nam Dũng
Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội
Trang 2Last time: gradient descent
Consider the problem
min
x f(x)for f convex and differentiable,dom(f ) =Rn
Gradient descent: choose initial x(0) ∈ Rn, repeat
Trang 4Basic inequality
Recall that for convex and differentiable f ,
f(y )≥ f (x) + ∇f (x)T(y − x), ∀x, y ∈ dom(f )
Basic inequality
recall the basic inequality for differentiable convex functions:
f (y) ≥ f(x) + ∇f(x) T (y − x) ∀y ∈ dom f
• the first-order approximation of f at x is a global lower bound
• ∇f(x) defines a non-vertical supporting hyperplane to epi f at (x, f (x)) :
−
x
Trang 5Asubgradientof a convex function f at x is any g ∈ Rn such that
f(y )≥ f (x) + gT(y− x), ∀y ∈ dom(f )
I Always exists (on the relative interior of dom(f ))
I If f differentiable at x , then g =∇f (x) uniquely
I Same definition works for nonconvex f (however, subgradients
need not exist)
Subgradient
gis a subgradient of a convex functionf at x ∈ dom f if
Trang 6Asubgradientof a convex function f at x is any g ∈ Rn such that
f(y )≥ f (x) + gT(y− x), ∀y ∈ dom(f )
I Always exists (on the relative interior of dom(f ))
I If f differentiable at x , then g =∇f (x) uniquely
I Same definition works for nonconvex f (however, subgradientsneed not exist)
Subgradient
gis a subgradient of a convex functionf at x ∈ dom f if
g 1 , g 2 are subgradients at x 1 ; g 3 is a subgradient at x 2
g1 and g2 are subgradients at x1, g3 is subgradient at x2 4
Trang 7I For x 6= 0, unique subgradient g = sign(x)
I For x = 0, subgradient g is any element of [−1, 1]
5
Trang 8I For x 6= 0, unique subgradient g = x
kxk2
I For x = 0, subgradient g is any element of {z : kzk2 ≤ 1} 6
Trang 9I For xi 6= 0, unique ith component gi = sign(xi)
I For xi = 0, ith component gi is any element of[−1, 1]
7
Trang 10I For f1(x) > f2(x), unique subgradient g =∇f1(x)
I For f2(x) > f1(x), unique subgradient g =∇f2(x)
I For f1(x) = f2(x), subgradient g is any point on line segment
Trang 11Set of all subgradients of convex f is called thesubdifferential:
∂f (x) ={g ∈ Rn: g is a subgradient of f at x}
Properties:
I Nonempty for convex f at x ∈ int(domf )
I ∂f (x) is closed and convex (even for nonconvex f )
I If f is differentiable at x , then∂f (x) ={∇f (x)}
I If∂f (x) ={g}, then f is differentiable at x and ∇f (x) = g.Proof: Seehttp://www.seas.ucla.edu/~vandenbe/236C/lectures/subgradients.pdf
9
Trang 12Set of all subgradients of convex f is called thesubdifferential:
∂f (x) ={g ∈ Rn: g is a subgradient of f at x}
Properties:
I Nonempty for convex f at x ∈ int(domf )
I ∂f (x) is closed and convex (even for nonconvex f )
Trang 13f(y )≥ f (x) + uT(y− x) and f (x) ≥ f (y) + vT(x− y).
Combining them get shows monotonicity
Question: Monotonicity for differentiable convex function?
(∇f (x) − ∇f (y))T(x− y) ≥ 0,which follows directly from the first order characterization ofconvex functions
10
Trang 14f(y )≥ f (x) + uT(y− x) and f (x) ≥ f (y) + vT(x− y).
Combining them get shows monotonicity
Question: Monotonicity for differentiable convex function?
(∇f (x) − ∇f (y))T(x− y) ≥ 0,which follows directly from the first order characterization ofconvex functions
10
Trang 15f(y )≥ f (x) + uT(y− x) and f (x) ≥ f (y) + vT(x− y).
Combining them get shows monotonicity
Question: Monotonicity for differentiable convex function?
(∇f (x) − ∇f (y))T(x− y) ≥ 0,which follows directly from the first order characterization of
convex functions
10
Trang 16Examples of non-subdifferentiable functions
The following functions are not subdifferentiable at x = 0
Trang 17Connection to convex geometry
Convex set C ⊆ Rn, consider indicator function IC:Rn→ R,
Trang 19Subgradient calculus
Basic rules for convex functions:
I Scaling:∂(af ) = a· ∂f provided a > 0
Trang 20I Norms: important special case, f(x) =kxkp Let q be such
Trang 21Why subgradients?
Subgradients are important for two reasons:
I Convex analysis: optimality characterization via subgradients,monotonicity, relationship to duality
I Convex optimization: if you can compute subgradients, thenyou can minimize any convex function
16
Trang 22Optimality condition
Subgradient optimality condition: For any f (convex or not),
f(x∗) = min
x f(x)⇐⇒ 0 ∈ ∂f (x∗),i.e., x∗ is a minimizer if and only if 0 is a subgradient of f at x∗
Why? Easy: g = 0 being a subgradient means that for all y
f(y )≥ f (x∗) + 0T(y− x∗) = f (x∗)
Note the implication for a convex and differentiable function fwith∂f (x) ={∇f (x)}
17
Trang 23Optimality condition
Subgradient optimality condition: For any f (convex or not),
f(x∗) = min
x f(x)⇐⇒ 0 ∈ ∂f (x∗),i.e., x∗ is a minimizer if and only if 0 is a subgradient of f at x∗.Why? Easy: g = 0 being a subgradient means that for all y
f(y )≥ f (x∗) + 0T(y− x∗) = f (x∗)
Note the implication for a convex and differentiable function f
with∂f (x) ={∇f (x)}
17
Trang 24Derivation of first-order optimality
Example of the power of subgradients: we can use what we havelearned so far to derive thefirst-order optimality condition
Direct proof see, e.g., http://www.princeton.edu/~amirali/Public/
Teaching/ORF523/S16/ORF523_S16_Lec7_gh.pdf Proof using subgradient next slide.
Intuitively: says that gradient increases as we move away from x Note that for C =Rn (unconstrained case) it reduces to ∇f = 0 18
Trang 25Derivation of first-order optimality
Trang 26Example: lasso optimality conditions
Given y ∈ Rn, X ∈ Rn×p,lasso problem can be parametrized as
minβ
1
2ky − X βk22+ λkβk1whereλ≥ 0
[−1, 1] if βi = 0
20
Trang 27Example: lasso optimality conditions
Given y ∈ Rn, X ∈ Rn×p,lasso problem can be parametrized as
minβ
1
2ky − X βk22+ λkβk1whereλ≥ 0 Subgradient optimality
0∈ ∂ 12ky − X βk22+ λkβk1
⇔ 0 ∈ −XT(y − X β) + λ∂ kβk1
⇔ XT(y− X β) = λvfor some v ∈ ∂ kβk1, i.e.,
[−1, 1] if βi = 0
20
Trang 28Example: lasso optimality conditions
Write X1, , Xp for columns of X Then our condition reads
They are also helpful in understanding the lasso estimator; e.g., if
|XT
i (y − X β)| < λ, then βi = 0 (used by screening rules, later?)
21
Trang 29Example: lasso optimality conditions
Write X1, , Xp for columns of X Then our condition reads
They are also helpful in understanding the lasso estimator; e.g., if
|XT
i (y− X β)| < λ, then βi = 0 (used by screening rules, later?)
21
Trang 30Example: soft-thresholding
Simplfied lasso problem with X = I :
minβ
1
2ky − βk22+ λkβk1.This we can solve directly using subgradient optimality Solution is
β = Sλ(y ), where Sλ is thesoft-thresholding operator
yi − βi = λ· sign(βi) if βi 6= 0
|yi − βi| ≤ λ ifβi = 0
22
Trang 31Example: soft-thresholding
Now plug inβ = Sλ(y ) and check these are satisfied:
I When yi > λ, βi = yi − λ > 0, so yi − βi = λ = λ· 1
I When yi <−λ, argument is similar
I When|yi| ≤ λ, βi = 0, and |yi − βi| = |yi| ≤ λ
Trang 32Example: distance to a convex set
Recall thedistance function to a closed, convex set C
dist(x, C ) = min
y ∈Cky − xk2.This is a convex function What are its subgradients?
Writedist(x, C ) =kx − PC(x)k2, where PC(x) is the projection of
x onto C It turns out that whendist(x, C ) > 0,
Trang 33Example: distance to a convex set
Recall thedistance function to a closed, convex set C
dist(x, C ) = min
y ∈Cky − xk2.This is a convex function What are its subgradients?
Writedist(x, C ) =kx − PC(x)k2, where PC(x) is the projection of
x onto C It turns out that whendist(x, C ) > 0,
Trang 34Example: distance to a convex set
We will only show one direction, i.e., that
Trang 35Example: distance to a convex set
Now for y 6∈ H, we have
(x− u)T(y − u) = kx − uk2ky − uk2cos θ where θ is the angle
between x− u and y − u Thus
26
Trang 36References and further reading
S Boyd, Lecture notes for EE 264B, Stanford University,
Spring 2010-2011
R T Rockafellar (1970), Convex analysis, Chapters 23–25
L Vandenberghe, Lecture notes for EE 236C, UCLA, Spring2011-2012
27