DSpace at VNU: Combining Lagrangian decomposition and excessive gap smoothing technique for solving large-scale separable convex optimization problems

Comput Optim Appl 2013 55:75–111DOI 10.1007/s10589-012-9515-6 Combining Lagrangian decomposition and excessive gap smoothing technique for solving large-scale separable convex optimizati

Trang 1

Comput Optim Appl (2013) 55:75–111

DOI 10.1007/s10589-012-9515-6

Combining Lagrangian decomposition and excessive

gap smoothing technique for solving large-scale

separable convex optimization problems

Quoc Tran Dinh · Carlo Savorgnan · Moritz Diehl

Received: 22 June 2011 / Published online: 18 November 2012

Abstract A new algorithm for solving large-scale convex optimization problems

with a separable objective function is proposed The basic idea is to combine threetechniques: Lagrangian dual decomposition, excessive gap and smoothing The mainadvantage of this algorithm is that it automatically and simultaneously updates thesmoothness parameters which significantly improves its performance The conver-gence of the algorithm is proved under weak conditions imposed on the original

problem The rate of convergence is O(1k ), where k is the iteration counter In the

second part of the paper, the proposed algorithm is coupled with a dual scheme toconstruct a switching variant in a dual decomposition framework We discuss imple-mentation issues and make a theoretical comparison Numerical examples confirmthe theoretical results

Keywords Excessive gap· Smoothing technique · Lagrangian decomposition ·

Proximal mappings· Large-scale problem · Separable convex optimization ·

Distributed optimization

Q Tran Dinh () · C Savorgnan · M Diehl

Department of Electrical Engineering (ESAT-SCD) and Optimization in Engineering Center

(OPTEC), KU Leuven, Kasteelpark Arenberg 10, 3001 Heverlee-Leuven, Belgium

Trang 2

76 Q Tran Dinh et al.

1 Introduction

Large-scale convex optimization problems appear in many areas of science such

as graph theory, networks, transportation, distributed model predictive control, tributed estimation and multistage stochastic optimization [16,20,30,35–37,39].Solving large-scale optimization problems is still a challenge in many applica-tions [4] Over the years, thanks to the development of parallel and distributed com-puter systems, the chances for solving large-scale problems have been increased.However, methods and algorithms for solving this type of problems are limited [1,4].Convex minimization problems with a separable objective function form a class

dis-of problems which is relevant in many applications This class dis-of problems is alsoknown as separable convex minimization problems, see, e.g [1] Without loss ofgenerality, a separable convex optimization problem can be written in the form of aconvex program with separable objective function and coupled linear constraints [1]

In addition, decoupling convex constraints may also be considered Mathematically,this problem can be formulated in the following form:

i=1

A i x i = b,

(1)

where φ i : Rni → R is convex, Xi ∈ Rni is a nonempty, closed convex set,

A i∈ Rm ×n i , b∈ Rm for all i = 1, , M, and n1+ n2+ · · · + nM = n The last

constraint is called coupling linear constraint.

In the literature, several solution approaches have been proposed for solving lem (1) For example, (augmented) Lagrangian relaxation and subgradient methods

prob-of multipliers [1,10,29,36], Fenchel’s dual decomposition [11], alternating tion methods [2,9,13,15], proximal point-type methods [3,33], spitting methods[7,8], interior point methods [18,32,39], mean value cross decomposition [17] andpartial inverse method [31] have been studied among many others One of the clas-sical approaches for solving (1) is Lagrangian dual decomposition The main idea

direc-of this approach is to solve the dual problem by means direc-of a subgradient method Ithas been recognized in practice that subgradient methods are usually slow [25] andnumerically sensitive to the choice of step sizes In the special case of a stronglyconvex objective function, the dual function is differentiable Consequently, gradientschemes can be applied to solve the dual problem

Recently, Nesterov [25] developed smoothing techniques for solving nonsmoothconvex optimization problems based on the fast gradient scheme which was intro-duced in his early work [24] The fast gradient schemes have been used in numerousapplications including image processing, compressed sensing, networks and systemidentification, see e.g [9,12,28] Exploiting Nesterov’s idea in [26], Necoara andSuykens [22] applied smoothing technique to the dual problem in the framework of

Trang 3

Excessive gap smoothing techniques in Lagrangian dual decomposition 77Lagrangian dual decomposition and then used Nesterov’s fast gradient scheme tomaximize the smoothed function of the dual problem This resulted in a new variant

of dual decomposition algorithms for solving separable convex optimization The

au-thors proved that the rate of convergence of their algorithm is O(1k )which is much

better than O(√1

k )in the subgradient methods of multipliers [6,23], where k is the

iteration counter A main disadvantage of this scheme is that the smoothness

param-eter requires to be given a priori Moreover, this paramparam-eter crucially depends on a

given desired accuracy Since the Lipschitz constant of the gradient of the objectivefunction in the dual problem is inversely proportional to the smoothness parameter,the algorithm usually generates short steps towards a solution of the dual problem

although the rate of convergence is O(1k ).

To overcome this drawback, in this paper, we propose a new algorithm which bines three techniques: smoothing [26,27], excessive gap [27] and Lagrangian dualdecomposition [1] techniques Although the convergence rate is still O(1k ), the algo-

com-rithms developed in this paper have some advantages compared to the one in [22].First, instead of fixing the smoothness parameters, we update them dynamically atevery iteration Second, our algorithm is a primal-dual method which not only gives

us a dual approximate solution but also a primal approximate solution of (1) Notethat the computational cost of the proposed algorithms remains almost the same as inthe proximal-center-based decomposition algorithm proposed in [22, Algorithm 3.2].(Algorithm 3.2 in [22] requires one to compute an additional dual step.) This algo-

rithm is called dual decomposition with two primal steps (Algorithm1) tively, we apply the switching strategy of [27] to obtain a decomposition algorithmwith switching primal-dual steps for solving problem (1) This algorithm differs fromthe one in [27] at two points First, the smoothness parameter is dynamically updatedwith an exact formula Second, proximal-based mappings are used to handle the non-smoothness of the objective function The second point is more significant since, inpractice, estimating the Lipschitz constants is not an easy task even if the objectivefunction is differentiable We notice that the proximal-based mappings proposed inthis paper only play a role for handling the nonsmoothness of the objective function.Therefore, the algorithms developed in this paper do not belong to any proximal-pointalgorithm class considered in the literature The approach presented in the presentpaper is different from splitting methods and alternating methods considered in theliterature, see, e.g [2,7,13,15] in the sense that it solves the convex subproblems

Alterna-of each component simultaneously without transforming the original problem to anyequivalent form Moreover, all algorithms are first order methods which can be im-plemented in a highly parallel and distributed manner

Contribution The contribution of this paper is the following:

1 We apply the Lagrangian relaxation, smoothing and excessive gap techniques

to large-scale separable convex optimization problems which are not necessarilysmooth Note that the excessive gap condition that we use in this paper is differ-ent from the one in [27], where not only the duality gap is measured but also thefeasibility gap is used in the framework of constrained optimization, see Lemma3

2 We propose two algorithms for solving general separable convex optimizationproblems The first algorithm is new, while the second one is a new variant of the

Trang 4

78 Q Tran Dinh et al.first algorithm proposed by Nesterov in [27, Algorithm 1] applied to Lagrangiandual decomposition Both algorithms allow us to obtain the primal and dual ap-proximate solutions simultaneously Moreover, all the algorithm parameters areupdated automatically without any tuning procedure A special case of these al-gorithms, a new method for solving problem (1) with a strongly convex objectivefunction is studied All the algorithms are highly parallelizable and distributed.

3 The convergence of the algorithms is proved and the convergence rate is estimated

In the two first algorithms, this convergence rate is O(1k )which is much higher

de-and estimates its worst-case complexity Section4is a combination of the two primal

steps and the two dual steps schemes which we call decomposition algorithm with

switching primal-dual steps Section5is an application of the two dual steps scheme(53) to solve problem (1) with a strongly convex objective function We also discussthe implementation issues of the proposed algorithms and a theoretical comparison

of Algorithms1 and2 in Sect.6 Numerical examples are presented in Sect.7 toexamine the performance of the proposed algorithms and to compare different meth-ods

Notation Throughout the paper, we shall consider the Euclidean spaceRnendowed

with an inner product x T y for x, y∈ Rn and the normx :=√x T x The notation

x := (x1, , xM )represents a column vector inRn , where x i is a subvector inRni,

i = 1, , M and n1+ · · · + nM = n.

2 Lagrangian dual decomposition and excessive gap smoothing technique

A classical technique to address coupling constraints in optimization is Lagrangianrelaxation [1] However, this technique often leads to a nonsmooth optimization prob-lem in the dual form To overcome this situation, we combine the Lagrangian dualdecomposition and smoothing technique in [26,27] to obtain a smoothly approximatedual problem

For simplicity of discussion, we consider problem (1) with M= 2 However, the

methods presented in the next sections can be directly applied to the case M > 2 (see

Sect.6) Problem (1) with M= 2 can be rewritten as follows:

x ∈ X × X := X,

(2)

Trang 5

Excessive gap smoothing techniques in Lagrangian dual decomposition 79

where φ i , X i and A i are defined as in (1) for i = 1, 2 and b ∈ R m Problem (2) is

said to satisfy the Slater constraint qualification condition if ri(X) ∩ {x = (x1, x2)|

A1x1+ A2x2= b} = ∅, where ri(X) is the relative interior of the convex set X Let

us denote by X∗the solution set of this problem We make the following assumption.

Assumption 1 The solution set X∗ is nonempty and either the Slater qualification

condition for problem (2 ) holds or X i is polyhedral The function φi is proper, lower semicontinuous and convex inRn , i = 1, 2.

Note that the objective function φ is not necessarily smooth For example, φ(x)=

x1=n

i=1|x(i)|, which is nonsmooth and separable.

2.1 Decomposition via Lagrangian relaxation

Let us first define the Lagrange function of problem (2) as:

L(x, y) := φ1 (x1) + φ2(x2) + y T (A1x1+ A2x2− b), (3)

where y∈ Rm is the multiplier associated with the coupling constraint A1x1+

A2x2 = b Then, the dual problem of (2) can be written as:

is the dual function

Let A = [A1, A2] Due to Assumption1, strong duality holds and we have:

Let us denote by Y∗the solution set of the dual problem (4) It is well known that Y∗

is bounded due to Assumption1

Finally, we note that the dual function d defined by (5) can be computed separatelyas:

2(y)) The representation (7)–(8) is called a dual decomposition

of the dual function d It is obvious that, in general, the dual function d is convex and

nonsmooth

Trang 6

80 Q Tran Dinh et al.2.2 Smoothing via prox-functions

Let us recall the definition of a proximity function A function p X is called aproximity function (prox-function) of a given nonempty, closed and convex set

X⊂ Rnx if p X is continuous, strongly convex with a convexity parameter σ X >0

and X ⊆ dom(pX ) Let x c be the prox-center of X which is defined as:

x c= arg min

Without loss of generality, we can assume that p X (x c )= 0 Otherwise, we consider

the function ˆpX (x) := pX (x) − pX (x c ) Let:

D X:= max

We make the following assumption

Assumption 2 Each feasible set X i is endowed with a prox-function pi which has

a convexity parameter σi > 0 Moreover, 0 ≤ Di := maxxi∈X i p i (x i ) < +∞ for

i = 1, 2.

Particularly, if X iis bounded then Assumption2is satisfied Throughout the paper,

we assume that Assumptions1and2are satisfied

Now, we consider the following functions:

d(y ; β1):= d1(y; β1)+ d2(y; β1). (12)

Here, β1 > 0 is a given parameter called smoothness parameter We denote by

Note that we can use different parameters β1i for (11) (i = 1, 2).

The following lemma shows the main properties of d( ·; β1), whose proof can befound, e.g., in [22,27]

Lemma 1 For any β1 > 0, the function d i ( ·; β1)defined by (11 ) is well-defined,

con-cave and continuously differentiable onRm The gradient∇yd i (y ; β1) = Ai x∗

i (y ; β1)

−1

2b is Lipschitz continuous with a Lipschitz constant L d i (β1)=A i 2

β1σi (i = 1, 2) Consequently, the function d(·; β1) defined by (12 ) is concave and differentiable.

Its gradient is given by ∇dy (y ; β1):= Ax∗(y ; β1)− b which is Lipschitz continuous with a Lipschitz constant L d (β1):= 1

β1

2

i=1A i

2

σi Moreover, it holds that:

d(y ; β1) − β1(D1+ D2) ≤ d(y) ≤ d(y; β1), (14)

and d(y; β1) → d(y) as β1↓ 0+for any y∈ Rm

Trang 7

Remark 1 Even without the boundedness of X, if the solution set X∗ of (2) is

bounded then, in principle, we can bound the feasible set X by a large compact

set which contains all the sampling points generated by the algorithms (see Sect.4

below) However, in the following algorithms we do not use D i , i = 1, 2 (defined

by (10)) in any computational step They only appear in the theoretical complexityestimates

Next, for a given β2 > 0, we define a mapping ψ( ·; β2)from X toR by:

This function can be considered as a smoothed version of ψ(x):= maxy∈Rm {(Ax −

b) T y } via the prox-function p(y) := 1

2y2 It is easy to show that the unique solution

of the maximization problem in (15) is given explicitly as y∗(x ; β2)= 1

The next lemma summarizes the properties of ψ( ·; β2)and f ( ·; β2).

Lemma 2 For any β2 > 0, the function ψ( ·; β2)defined by (15 ) is a quadratic

func-tion of the form ψ (x; β2)= 1

2β2Ax − b2on X Its gradient vector is given by:

Trang 8

≤ 1

β2 A12x1− ˆx12+ 1

β2 A22x2− ˆx22. (20)This inequality is indeed (18) The inequality (19) follows directly from (16)

2.3 Excessive gap technique

Since the duality gap of the primal and dual problems (2)–(4) is measured by

g(x, y) := φ(x) − d(y), if the gap g is equal to zero for some feasible point (x, y)

then this point is an optimal solution of (2)–(4) In this section, we apply a technique

called excessive gap proposed by Nesterov in [27] to the Lagrangian dual sition framework First, we recall the following definition

decompo-Definition 1 We say that a point ( ¯x, ¯y) ∈ X × R m satisfies the excessive gap tion with respect to two smoothness parameters β1 > 0 and β2 >0 if:

condi-f ( ¯x; β2) ≤ d( ¯y; β1), (21)

where f ( ·; β2)and d( ·; β1)are defined by (19) and (12), respectively

The following lemma provides an upper bound estimate for the duality gap andthe feasibility gap of problem (2)

Lemma 3 Suppose that ( ¯x, ¯y) ∈ X × R m satisfies the excessive gap condition (21)

Then for any y∗∈ Y∗, we have:

Trang 9

3 New decomposition algorithm

In this section, we derive an iterative decomposition algorithm for solving (2) based

on the excessive gap technique This method is called a decomposition algorithm with

two primal steps The aim is to generate a point ( ¯x, ¯y) ∈ X ×R mat each iteration suchthat this point maintains the excessive gap condition (21) while the algorithm drives

the parameters β1 and β2to zero

3.1 Finding a starting point

As assumed earlier, the function φ i is convex but not necessarily differentiable.Therefore, we can not use the gradient information of these functions We consider

the following mappings (i = 1, 2):

i (β2) defined in Lemma 2 is positive,

P i ( ·; β2) is well-defined This mapping is called a proximal operator [3] Let

P ( ·; β2)= (P1(·; β2), P2(·; β2)).

First, we show in the following lemma that there exists a point ( ¯x, ¯y) satisfying the

excessive gap condition (21) The proof of this lemma can be found in theAppendix

Lemma 4 Suppose that x c is the prox-center of X For a given β2> 0, let:

then ( ¯x, ¯y) satisfies the excessive gap condition (21)

3.2 Main iteration scheme

Suppose that ( ¯x, ¯y) ∈ X × R msatisfies the excessive gap condition (21) We generate

a new point ( ¯x+, ¯y+) ∈ X × R mby applying the following update scheme:

Trang 10

Remark 2 In the scheme (27 ), the points x∗( ¯y; β1) = (x∗

1( ¯y; β1), x∗

2( ¯y; β1)), ˆx =

( ˆx1,ˆx2)and ¯x+= ( ¯x1+, ¯x2+) can be computed in parallel To compute x∗( ¯y; β1)and

¯x+ we need to solve two corresponding convex programs inRn1 andRn2, tively

respec-The following theorem shows that the scheme (27)–(28) maintains the excessivegap condition (21)

Theorem 1 Suppose that ( ¯x, ¯y) ∈ X × R m satisfies (21 ) with respect to two values

β1> 0 and β2 > 0 Then if the parameter τ is chosen such that τ ∈ (0, 1) and:

2 = (1 − τ)β2,one has:

ψ ( ¯x; β2)= 1

2β2 A ¯x − b2= (1 − τ) 1

2β+ 2

A ¯x − b2= (1 − τ)ψ¯x; β2+. (31)

Moreover, if we denote by x1:= x∗( ¯y; β1) then, by the strong convexity of p1and

p2, (31) and f ( ¯x; β2)≤ d( ¯y; β1), we have:

Trang 11

2β+ 2

and∇xψ ( ˆx; β2)= A T ˆy to obtain:

[·]2 := φ (x) + (Ax − b) T ˆy = φ(x) + ˆy T A(x − ˆx) + (A ˆx − b) T ˆy

A ˆx − b2

def ˆψ

= φ (x) + ψˆx; β2++ ∇xψ

ˆx; β2+T(x − ˆx) + ψˆx; β2+. (33)Substituting (32) and (33) into (30) and noting that (1 − τ)( ¯x − ˆx) + τ(x − ˆx) =

τ (x − x1)due to the first line of (27), we obtain:

Trang 12

2 A( ¯x − ˆx)2+ τψ( ˆx; β2+) − τ(1 − τ)ψ( ¯x; β2+) Next, we note

that the condition (29) is equivalent to:

¯x++ ψ¯x+; β2+= f¯x+; β2+. (37)

To complete the proof, we show that T1≥ 0 Indeed, let us define ˆu := A ˆx − b and

¯u := A ¯x − b, then ˆu − ¯u = A( ˆx − ¯x) We have:

Trang 13

2β+ 2

τ ˆu2− τ(1 − τ) ¯u2+ (1 − τ) ˆu − ¯u2

2β+ 2

τ ˆu2− τ(1 − τ) ¯u2+ (1 − τ) ˆu2

+ (1 − τ) ¯u2− 2(1 − τ) ˆu T ¯u

2β+ 2

 ˆu2+ (1 − τ)2 ¯u2− 2(1 − τ) ˆu T ¯u

2β+ 2

ˆu − (1 − τ) ¯u 2

Substituting (38) into (37) we obtain the inequality d( ¯y+; β1+) ≥ f ( ¯x+; β2+).

Remark 3 If φiis convex and differentiable such that its gradient is Lipschitz

contin-uous with a Lipschitz constant L φi ≥ 0 for some i = 1, 2, then instead of using the

proximal mapping P i ( ·; β2)in (27) we can use the following mapping:

Indeed, let us prove the condition d( ¯y+; β1+)≥

f ( ˆ ¯x+; β2+), where G(x ; β2):= (G1(x1; β2), G2(x2; β2))and ˆ¯x+:= G( ˆx; β2+) First,

by using the convexity of φ i and the Lipschitz continuity of its gradient, we have:

φ i ( ˆxi ) + ∇φi ( ˆxi ) T (u i − ˆxi ) ≤ φi (u i )

≤ φi ( ˆxi ) + ∇φi ( ˆxi ) T (u i − ˆxi )+L φi

Trang 14

If X i is polytopic then problem (39) becomes a convex quadratic program

3.3 The step size update rule

Next step we show how to update the parameter τ such that the conditions (26) and(29) hold From the update rule (28) we have β+

This condition leads to τ≥ τ+

1−τ+ Since τ, τ+∈ (0, 1), the last inequality implies:

Trang 15

Lemma 5 Suppose that τ0 is arbitrarily chosen in (0,12] Then the sequence {τk}k≥0

τ and consider the function ξ(t) := t + 1 then the

se-quence{tk}k≥0 generated by the rule t k+1:= ξ(tk ) = tk + 1 satisfies tk = t0+ k for

all k ≥ 0 Hence τk = 1

tk = 1

t0 +k = τ0τ0 k+1 for k≥ 0 To prove (46), we observe that

β k+1= β0

k

i=0(1 − τi ) Hence, by substituting (45) into the last equality and

Remark 4 Since τ0∈ (0, 0.5], from Lemma5we see that with τ0 := 0.5 the

right-hand side estimate of (46) is minimized In this case, the update rule of τ kis simplified

to τ k:= 1

k+2for k≥ 0

3.4 The algorithm and its worst case complexity

Now, we combine the results of Lemma4, Theorem1and Lemma5in order to buildthe following algorithm

Algorithm 1 (Decomposition algorithm with two primal steps)

Initialization: Perform the following steps:

2 Compute¯x0and ¯y0from (25) as:

Iteration: For k = 0, 1, , perform the following steps:

1 If a given stopping criterion is satisfied then terminate

2 Update the smoothness parameter β k+1:= (1 − τk )β k

Trang 16

Note that the dual solution set Y∗is compact due to Assumption1and the sions of Lemma3hold for any y∗∈ Y∗, we can define the following quantities:

The next theorem provides the worst-case complexity estimate for Algorithm1

Theorem 2 Let {( ¯x k , ¯y k ) } be a sequence generated by Algorithm1 Then the

follow-ing duality and feasibility gap estimates hold:

and DX , D Y∗and DY are defined in (47)

Proof By the choice of β10= β0

2= ¯L and Steps 1 in the initialization phase of

Algo-rithm1we see that β1k = β k

2for all k ≥ 0 Moreover, since τ0= 0.5, by Lemma5, we

have β1k = β k

2= β0

τ0k+1=0.5k ¯L+1 Now, by applying Lemma3with β1 and β2equal to

β1k and β2k respectively, we obtain the estimates (48) and (49)

Remark 5 The worst case complexity of Algorithm1is O(1ε ) However, the constants

in the estimations (48) and (49) also depend on the choices of β0 and β0, which

Trang 17

satisfy the condition (26) The values of β10 and β20 will affect the accuracy of theduality and feasibility gaps

4 Switching decomposition algorithm

In this section, we apply a switching strategy to obtain a new variant of the first rithm proposed in [27, Algorithm 1] for solving problem (2) This scheme alternatelyswitches between a two primal steps scheme and a two dual steps scheme depending

algo-on the iteratialgo-on counter k being even or odd.

4.1 The gradient mapping of the smoothed dual function

Since the smoothed dual function d( ·; β1)is Lipschitz continuously differentiable on

Rm(see Lemma1) We define the following mapping:

G( ˆy; β1):= arg max

y∈Rm ∇yd( ˆy; β1)T (y − ˆy) − L d (β1)

First, we adapt the scheme (27)–(28) in the switching primal-dual framework

Sup-pose that the pair ( ¯x, ¯y) ∈ X × R msatisfies the excessive gap condition (21) The two

primal steps scheme computes ( ¯x+, ¯y+)as follows:

and then updates β+

1 := (1 − τ)β1, where τ ∈ (0, 1) and P (·; β2)is defined in (24).The difference between schemesA p

mandA p is that the parameter β2is fixed inA p

Symmetrically, the two dual steps scheme computes ( ¯x+, ¯y+)as:

Trang 18

main-92 Q Tran Dinh et al.

Lemma 6 Suppose that ( ¯x, ¯y) ∈ X × R m and satisfies (21 ) with respect to two values

β1and β2 Then if the parameter τ is chosen such that τ∈ (0, 1) and:

Remark 6 Given β1> 0, we can choose β2 >0 such that the condition (26) holds

We compute a point ( ¯x0, ¯y0)as:

as a starting point for Algorithm2below

In Algorithm2below we apply either the schemeA porA dby using the followingrule:

Rule A If the iteration counter k is even then apply A p Otherwise, A d is used.

Now, we provide an update rule to generate a sequence{τk} such that the condition

(54) holds Let ¯L2:= 2 max1≤i≤2{A i 2

σi } Suppose that at the iteration k the condition

(1 − τk )β1k β2k However, since the condition (56) holds, we have (1 − τk )β1k β2k ≥

τ k2¯L2 Now, we suppose that the condition (54) is satisfied with β k+1

Lemma Suppose that τ0 is arbitrarily chosen... data-page="17">

Excessive gap smoothing techniques in Lagrangian dual decomposition 91

satisfy the condition (26) The values of β10 and β20... sequence generated by Algorithm1 Then the

follow-ing duality and feasibility gap estimates hold:

and DX , D Y∗and DY are

Định dạng
Số trang	37
Dung lượng	1,22 MB