Tài liệu Bài 3: Gradients and Optimization Methods ppt

3.1.2 Matrix gradient In many of the algorithms encountered in this book, we have to consider scalar-valued functionsgof the elements of anmnmatrixW = w ij: g=gW =gw11:::w ij :::w mn 3.

Trang 1

Gradients and Optimization

Methods

The main task in the independent component analysis (ICA) problem, formulated in Chapter 1, is to estimate a separating matrixWthat will give us the independent components It also became clear thatWcannot generally be solved in closed form, that is, we cannot write it as some function of the sample or training set, whose value

could be directly evaluated Instead, the solution method is based on cost functions,

also called objective functions or contrast functions SolutionsWto ICA are found

at the minima or maxima of these functions Several possible ICA cost functions will

be given and discussed in detail in Parts II and III of this book In general, statistical estimation is largely based on optimization of cost or objective functions, as will be seen in Chapter 4

Minimization of multivariate functions, possibly under some constraints on the

solutions, is the subject of optimization theory In this chapter, we discuss some

typical iterative optimization algorithms and their properties Mostly, the algorithms are based on the gradients of the cost functions Therefore, vector and matrix gradients are reviewed first, followed by the most typical ways to solve unconstrained and constrained optimization problems with gradient-type learning algorithms

3.1 VECTOR AND MATRIX GRADIENTS

3.1.1 Vector gradient

Consider a scalar valued functiongofmvariables

g=g(w1:::w m) =g(w )

57

Independent Component Analysis Aapo Hyv¨arinen, Juha Karhunen, Erkki Oja

Copyright  2001 John Wiley & Sons, Inc ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)

Trang 2

where we have used the notationw = (w1:::w m)

T

By convention, we definew

as a column vector Assuming the functiongis differentiable, its vector gradient with respect towis them-dimensional column vector of partial derivatives

@g

@w

= 0

B

@

@w1

@

@wm

1

C

(3.1)

The notation @

@w

is just shorthand for the gradient; it should be understood that it does not imply any kind of division by a vector, which is not a well-defined concept Another commonly used notation would bergorr

wg

In some iteration methods, we have also reason to use second-order gradients We define the second-order gradient of a functiongwith respect towas

@2g

@w 2

= 0

B

@

@ 2 g

@w

2 ::: @

2 g

@w 1

wm

@ 2 g

@wmw

1 ::: @

2 g

@w 2

m

1

C A

(3.2)

This is anmmmatrix whose elements are second order partial derivatives It is

called the Hessian matrix of the functiong(w ) It is easy to see that it is always symmetric

These concepts generalize to vector-valued functions; this means ann-element vector

g (w ) =

0

B

@

g1 (w )

gn (w )

1

C

whose elementsgi

(w )are themselves functions ofw The Jacobian matrix ofgwith respect towis

@g

@w

= 0

B

@

@ 1

@w1 ::: @ n

@w1

@ 1

@wm ::: @ n

@wm

1

C

Thus the ith column of the Jacobian matrix is the gradient vector ofgi

(w ) with respect tow The Jacobian matrix is sometimes denoted byJg

For computing the gradients of products and quotients of functions, as well as of composite functions, the same rules apply as for ordinary functions of one variable.

Trang 3

VECTOR AND MATRIX GRADIENTS 59

Thus

@f(w )g(w )

@w

=

@f(w )

@w

g(w ) +f(w )

@g(w )

@w

(3.5)

@f(w )=g(w )

@w

=

@f(w )

@w

g(w ) f(w )

@g(w )

@w ]=g2 (w ) (3.6)

@f(g(w ))

@w

= f0 (g(w ))

@g(w )

@w

(3.7)

The gradient of the composite functionf(g(w ))can be generalized to any number

of nested functions, giving the same chain rule of differentiation that is valid for functions of one variable

3.1.2 Matrix gradient

In many of the algorithms encountered in this book, we have to consider scalar-valued functionsgof the elements of anmnmatrixW = (w ij):

g=g(W ) =g(w11:::w ij :::w mn) (3.8)

A typical function of this kind is the determinant ofW

Of course, any matrix can be trivially represented as a vector by scanning the elements row by row into a vector and reindexing Thus, when considering the gradient ofgwith respect to the matrix elements, it would suffice to use the notion

of vector gradient reviewed earlier However, using the separate concept of matrix gradient gives some advantages in terms of a simplified notation and sometimes intuitively appealing results

In analogy with the vector gradient, the matrix gradient means a matrix of the same sizemnas matrixW, whoseijth element is the partial derivative ofgwith respect tow ij Formally we can write

@g

@W

= 0

B

@

@g

@w11 ::: @w @g

1n

@g

@wm1 ::: @w @gmn

1

C

Again, the notation@ @g

W

is just shorthand for the matrix gradient

Let us look next at some examples on vector and matrix gradients The formulas presented in these examples will be frequently needed later in this book

3.1.3 Examples of gradients

Example 3.1 Consider the simple linear functional ofw, or inner product

g(w ) =

m

X

i a i w i= aTw

Trang 4

wherea = (a1:::a m)T is a constant vector The gradient is, according to (3.1),

@g

@w

= 0

B

@

a1

a m

1

C

which is the vectora We can write

@aTw

@w

= a

Because the gradient is constant (independent ofw), the Hessian matrix ofg(w ) =

aTwis zero

Example 3.2 Next consider the quadratic form

g(w ) = wTAw =

m

X

i=1

m

X

j=1

w i w j a ij (3.11) whereA = (a ij)is a squaremmmatrix We have

@g

@w

= 0

B

@

Pm j

=1w j a1j+

Pm i

=1w i a i1

Pm j

=1w j a mj +

Pm i

=1w i a im

1

C

which is equal to the vectorAw + ATw So,

@wTAw

@w

= Aw + ATw

For symmetricA, this becomes2Aw

The second-order gradient or Hessian becomes

@2

wTAw

@w 2

= 0

B

@

2a11 ::: a1m+a m1

a m1 +a1m ::: 2a mm

1

C

which is equal to the matrixA + AT IfAis symmetric, then the Hessian ofwTAw

is equal to2A

Example 3.3 For the quadratic form (3.11), we might quite as well take the gradient

with respect toA, assuming now thatwis a constant vector Then@w

T Aw

@aij

=w i w j Compiling this into matrix form, we notice that the matrix gradient is themm matrixwwT.

Example 3.4 In some ICA models, we must compute the matrix gradient of the

determinant of a matrix The determinant is a scalar function of the matrix elements

Trang 5

VECTOR AND MATRIX GRADIENTS 61

consisting of multiplications and summations, and therefore its partial derivatives are relatively simple to compute Let us prove the following: IfWis an invertible square

mmmatrix whose determinant is denoteddet W, then

@

@W det W = (W

T )

1

This is a good example for showing that a compact formula is obtained using the matrix gradient; ifWwere stacked into a long vector, and only the vector gradient were used, this result could not be expressed so simply

Instead of starting from scratch, we employ a well-known result from matrix algebra (see, e.g., [159]), stating that the inverse of a matrixWis obtained as

W

1

= 1

det W

with adj(W )the so-called adjoint ofW The adjoint is the matrix

adj(W ) =

0

@

W11 ::: Wn1

W1n ::: Wnn

1

where the scalar numbers Wij are the so-called cofactors The cofactor Wij is obtained by first taking the(n 1) (n 1)submatrix ofWthat remains when the ith row and jth column are removed, then computing the determinant of this submatrix, and finally multiplying by(1)

i+j

The determinantdet Wcan also be expressed in terms of the cofactors:

det W =

n X

k =1

Rowican be any row, and the result is always the same In the cofactorsWik, none

of the matrix elements of theith row appear, so the determinant is a linear function

of these elements Taking now a partial derivative of (3.17) with respect to one of the elements, say,wij, gives

@det W

@wij

=Wij

By definitions (3.9) and (3.16), this implies directly that

@det W

@W

=adj(W )

T

But adj(W )

T

is equal to(det W )(W

T )

1

by (3.15), so we have shown our result (3.14)

This also implies that

@log j det W j

@W

= 1

j det W j

@j det W j

@W

= (W T )

1

(3.18) see (3.15) This is an example of the matrix gradient of a composite function consisting of thelog, absolute value, anddetfunctions This result will be needed when the ICA problem is solved by maximum likelihood estimation in Chapter 9

Trang 6

3.1.4 Taylor series expansions of multivariate functions

In deriving some of the gradient type learning algorithms, we have to resort to Taylor series expansions of multivariate functions In analogy with the well-known Taylor series expansion of a functiong(w)of a scalar variablew,

g(w0

) =g(w) +

dg

dw(w0

w) + 1=2

d2g

dw2 w0

w) +::: (3.19)

we can do a similar expansion for a functiong(w ) =g(w1:::w m)ofmvariables

We have

g(w

0

) =g(w ) + (

@g

@w )T(w 0

w ) + 1=2(w

0

w )T @2g

@w 2 (w 0

w ) +:::

(3.20)

where the derivatives are evaluated at the pointw The second term is the inner product of the gradient vector with the vectorw

0

w, and the third term is a quadratic form with the symmetric Hessian matrix @ @2g

w

2 The truncation error depends on the distancekw

0

w k; the distance has to be small, ifg(w

0 )is approximated using only the first- and second-order terms

The same expansion can be made for a scalar function of a matrix variable The second order term already becomes complicated because the second order gradient is

a four-dimensional tensor But we can easily extend the first order term in (3.20), the inner product of the gradient with the vectorw

0

w, to the matrix case Remember that the vector inner product is defined as

(

@g

@w )T(w 0

w ) =

m

X

i=1 (

@g

@w )i(w0

iw i)

For the matrix case, this must become the sum

Pm i

=1

Pm j

=1 (@ @g

W )ij(w0

ijw ij):This

is the sum of the products of corresponding elements, just like in the vectorial inner product This can be nicely presented in matrix form when we remember that for any two matrices, say,AandB,

trace(ATB) =

m

X

i=1 (ATB)ii =

m

X

i=1

m

X

j=1 (A)ij(B)ij

with obvious notation So, we have

g(W

0 ) =g(W ) +trace(

@g

@W )T(W 0

W )] +::: (3.21) for the first two terms in the Taylor series of a functiongof a matrix variable

Trang 7

LEARNING RULES FOR UNCONSTRAINED OPTIMIZATION 63

3.2 LEARNING RULES FOR UNCONSTRAINED OPTIMIZATION

3.2.1 Gradient descent

Many of the ICA criteria have the basic form of minimizing a cost functionJ (W )

with respect to a parameter matrixW, or possibly with respect to one of its columns

w In many cases, there are also constraints that restrict the set of possible solutions

A typical constraint is to require that the solution vector must have a bounded norm,

or the solution matrix has orthonormal columns

For the unconstrained problem of minimizing a multivariate function, the most classic approach is steepest descent or gradient descent Let us consider in more detail the case when the solution is a vectorw; the matrix case goes through in a completely analogous fashion

In gradient descent, we minimize a functionJ (w ) iteratively by starting from some initial pointw (0), computing the gradient ofJ (w ) at this point, and then moving in the direction of the negative gradient or the steepest descent by a suitable distance Once there, we repeat the same procedure at the new point, and so on For

t = 1 2 :::we have the update rule

w (t) = w (t 1) (t)

@J (w )

@w j

w =w (t1) (3.22) with the gradient taken at the pointw (t 1) The parameter(t)gives the length of

the step in the negative gradient direction It is often called the step size or learning rate Iteration (3.22) is continued until it converges, which in practice happens when

the Euclidean distance between two consequent solutionskw (t) w (t 1)kgoes below some small tolerance level

If there is no reason to emphasize the time or iteration step, a convenient shorthand notation will be used throughout this book in presenting update rules of the preceding

type Denote the difference between the new and old value by

We can then write the rule (3.22) either as

w =

@J (w )

@w

or even shorter as

w /

@J (w )

@w

The symbol/is read “is proportional to”; it is then understood that the vector on the left-hand side,w, has the same direction as the gradient vector on the right-hand

side, but there is a positive scalar coefficient by which the length can be adjusted In the upper version of the update rule, this coefficient is denoted by In many cases, this learning rate can and should in fact be time dependent Yet a third very convenient way to write such update rules, in conformity with programming languages, is

@J (w )

Trang 8

where the symbol means substitution, i.e., the value of the right-hand side is computed and substituted inw

Geometrically, a gradient descent step as in (3.22) means going downhill The graph ofJ (w )is the multidimensional equivalent of mountain terrain, and we are always moving downwards in the steepest direction This also immediately shows the disadvantage of steepest descent: unless the functionJ (w )is very simple and smooth, steepest descent will lead to the closest local minimum instead of a global minimum As such, the method offers no way to escape from a local minimum Nonquadratic cost functions may have many local maxima and minima Therefore, good initial values are important in initializing the algorithm

Local minimum Gradient vector minimum

Global

Fig 3.1 Contour plot of a cost function with a local minimum.

As an example, consider the case of Fig 3.1 A functionJ (w )is shown there as

a contour plot In the region shown in the figure, there is one local minimum and one global minimum From the initial point chosen there, where the gradient vector has been plotted, it is very likely that the algorithm will converge to the local minimum Generally, the speed of convergence can be quite low close to the minimum point, because the gradient approaches zero there The speed can be analyzed as follows Let us denote byw

the local or global minimum point where the algorithm will eventually converge From (3.22) we have

w (t) w

= w (t 1) w

(t)

@J (w )

@w j

w =w (t1) (3.24) Let us expand the gradient vector@J (w )

@w

element by element as a Taylor series around the pointw

, as explained in Section 3.1.4 Using only the zeroth- and first-order terms, we have for theith element

@J (w )

@wi j

w =w (t1)

=

@J (w )

@wi j

w =w

+

m

X

j

@ 2

J (w )

@wiwj j

w =w

wj(t 1) w

j] + :::

Trang 9

LEARNING RULES FOR UNCONSTRAINED OPTIMIZATION 65

Now, because w

is the point of convergence, the partial derivatives of the cost function must be zero atw

Using this result, and compiling the above expansion into vector form, yields

@J (w )

@w j

w =w (t1)

= H(w

)w (t 1) w

] + :::

whereH(w

is the Hessian matrix computed at the pointw = w

Substituting this

in (3.24) gives

w (t) w

I (t)H(w

)]w (t 1) w

]

This kind of convergence, which is essentially equivalent to multiplying a matrix

many times with itself, is called linear The speed of convergence depends on the

learning rate and the size of the Hessian matrix If the cost functionJ (w )is very flat at the minimum, with second partial derivatives also small, then the Hessian is small and the convergence is slow (for fixed(t)) Usually, we cannot influence the shape of the cost function, and we have to choose(t), given a fixed cost function The choice of an appropriate step length or learning rate(t)is thus essential: too small a value will lead to slow convergence The value cannot be too large either: too large a value will lead to overshooting and instability, which prevents convergence altogether In Fig 3.1, too large a learning rate will cause the solution point to zigzag around the local minimum The problem is that we do not know the Hessian matrix and therefore determining a good value for the learning rate is difficult

A simple extension to the basic gradient descent, popular in neural network learn-ing rules like the back-propagation algorithm, is to use a two-step iteration instead

of just one step like in (3.22), leading to the so-called momentum method Neural network literature has produced a large number of tricks for boosting steepest descent learning by adjustable learning rates, clever choice of the initial value, etc However,

in ICA, many of the most popular algorithms are still straightforward gradient descent methods, in which the gradient of an appropriate contrast function is computed and used as such in the algorithm

3.2.2 Second-order learning

In numerical analysis, a large number of methods that are more efficient than plain gradient descent have been introduced for minimizing or maximizing a multivariate scalar function They could be immediately used for the ICA problem Their ad-vantage is faster convergence in terms of the number of iterations required, but the disadvantage quite often is increased computational complexity per iteration Here

we consider second-order methods, which means that we also use the information contained in the second-order derivatives of the cost function Obviously, this infor-mation relates to the curvature of the optimization terrain and should help in finding

a better direction for the next step in the iteration than just plain gradient descent

Trang 10

A good starting point is the multivariate Taylor series; see Section 3.1.4 Let us develop the functionJ (w )in Taylor series around a pointwas

J (w

0

) = J (w ) +

@J (w )

@w ] T (w 0

w ) +

1

2 (w 0

w ) T

@ 2

J (w )

@w 2 (w 0

w ) + :::

(3.25)

In trying to minimize the functionJ (w ), we ask what choice of the new pointw

0

gives us the largest decrease in the value ofJ (w ) We can writew

0

w = wand minimize the functionJ (w

0 ) J (w ) =

@J (w )

@w ] T w+ 1=2 w

T

@ 2

J (w )

@w 2

wwith respect to w The gradient of this function with respect to wis (see Example 3.2) equal to@J (w )

@w

+

@ 2

J (w )

@w 2

w; note that the Hessian matrix is symmetric If the Hessian is also positive definite, then the function will have a parabolic shape and the minimum is given by the zero of the gradient Setting the gradient to zero gives

w =

@ 2

J (w )

@w 2 ]

1

@J (w )

@w

From this, the following second-order iteration rule emerges:

w 0

= w

@ 2

J (w )

@w 2 ]

1

@J (w )

@w

(3.26)

where we have to compute the gradient and Hessian on the right-hand side at the pointw

Algorithm (3.26) is called Newton’s method, and it is one of the most efficient

ways for function minimization It is, in fact, a special case of the well-known Newton’s method for solving an equation; here it solves the equation that says that the gradient is zero

Newton’s method provides a fast convergence in the vicinity of the minimum,

if the Hessian matrix is positive definite there, but the method may perform poorly farther away A complete convergence analysis is given in [284] It is also shown

there that the convergence of Newton’s method is quadratic; ifw

is the limit of convergence, then

kw 0

w

k kw w

k 2

whereis a constant This is a very strong mode of convergence When the error on the right-hand side is relatively small, its square can be orders of magnitude smaller (If the exponent is3, the convergence is called cubic, which is somewhat better than quadratic, although the difference is not as large as the difference between linear and quadratic convergence.)

On the other hand, Newton’s method is computationally much more demanding per one iteration than the steepest descent method The inverse of the Hessian has to be computed at each step, which is prohibitively heavy for many practical cost functions in high dimensions It may also happen that the Hessian matrix becomes ill-conditioned or close to a singular matrix at some step of the algorithm, which induces numerical errors into the iteration One possible remedy for this is

Tiêu đề	Gradients and optimization methods
Tác giả	Aapo Hyvärinen, Juha Karhunen, Erkki Oja
Thể loại	Presentation
Năm xuất bản	2001

Định dạng
Số trang	20
Dung lượng	440,89 KB