Tài liệu Bài 12: ICA by Nonlinear Decorrelation and Nonlinear PCA doc

ICA by Nonlinear Decorrelation and Nonlinear PCA This chapter starts by reviewing some of the early research efforts in independentcomponent analysis ICA, especially the technique based

Trang 1

ICA by Nonlinear Decorrelation and

Nonlinear PCA

This chapter starts by reviewing some of the early research efforts in independentcomponent analysis (ICA), especially the technique based on nonlinear decorrelation,that was successfully used by Jutten, H´erault, and Ans to solve the first ICA problems.Today, this work is mainly of historical interest, because there exist several moreefficient algorithms for ICA

Nonlinear decorrelation can be seen as an extension of second-order methodssuch as whitening and principal component analysis (PCA) These methods givecomponents that are uncorrelated linear combinations of input variables, as explained

in Chapter 6 We will show that independent components can in some cases be found

as nonlinearly uncorrelated linear combinations The nonlinear functions used in

this approach introduce higher order statistics into the solution method, making ICApossible

We then show how the work on nonlinear decorrelation eventually lead to theCichocki-Unbehauen algorithm, which is essentially the same as the algorithm that

we derived in Chapter 9 using the natural gradient Next, the criterion of nonlineardecorrelation is extended and formalized to the theory of estimating functions, andthe closely related EASI algorithm is reviewed

Another approach to ICA that is related to PCA is the so-called nonlinear PCA

A nonlinear representation is sought for the input data that minimizes a least square error criterion For the linear case, it was shown in Chapter 6 that principalcomponents are obtained It turns out that in some cases the nonlinear PCA approachgives independent components instead We review the nonlinear PCA criterion andshow its equivalence to other criteria like maximum likelihood (ML) Then, twotypical learning rules introduced by the authors are reviewed, of which the first one

mean-239

Independent Component Analysis Aapo Hyv¨arinen, Juha Karhunen, Erkki Oja

Copyright  2001 John Wiley & Sons, Inc ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)

Trang 2

is a stochastic gradient algorithm and the other one a recursive least mean-squarealgorithm.

The correlation between two random variablesy

1andy

2was discussed in detail inChapter 2 Here we consider zero-mean variables only, so correlation and covarianceare equal Correlation is related to independence in such a way that independentvariables are always uncorrelated The opposite is not true, however: the variablescan be uncorrelated, yet dependent An example is a uniform density in a rotatedsquare centered at the origin of the(y

1

y

2 is constrained to be jointly gaussian

Extending the concept of correlation, we here define the nonlinear correlation of

the random variablesy

1andy

2as Eff (y

1 )g(y 2 )g Here,f (y

1 andg(y

2 are twofunctions, of which at least one is nonlinear Typical examples might be polynomials

of degree higher than 1, or more complex functions like the hyperbolic tangent Thismeans that one or both of the random variables are first transformed nonlinearly tonew variablesf (y

1 ) g(y

2 and then the usual linear correlation between these newvariables is considered

The question now is: Assuming thaty

1andy

2are nonlinearly decorrelated in thesense

Eff(y 1 )g(y 2

can we say something about their independence? We would hope that by makingthis kind of nonlinear correlation zero, independence would be obtained under someadditional conditions to be specified

There is a general theorem (see, e.g., [129]) stating thaty

1andy

2are independent

if and only if

Eff (y 1 )g(y 2 )g =Eff (y

1 )gEfg(y

2

for all continuous functionsf andg that are zero outside a finite interval Based

on this, it seems very difficult to approach independence rigorously, because thefunctionsf andgare almost arbitrary Some kind of approximations are needed.This problem was considered by Jutten and H´erault [228] Let us assume thatf (y

1

and are smooth functions that have derivatives of all orders in a neighborhood

Trang 3

NONLINEAR CORRELATIONS AND INDEPENDENCE 241

of the origin They can be expanded in Taylor series:

f (y 1

= f (0) + f

0 (0)y 1 + 1 2 f 0 (0)y 2 1 + :::

= 1 X i=0 f i y i 1

g(y 2

= g(0) + g

0 (0)y 2 + 1 2 g 0 (0)y 2 + :::

= 1 X i=0 g i y i 2

wheref

i

g

iis shorthand for the coefficients of theith powers in the series

The product of the functions is then

f (y 1 )g(y 2

= 1 X i=1

1 X j=1 f i g j y i 1 y j

and condition (12.1) is equivalent to

Eff (y 1 )g(y 2 )g = 1 X i=1

1 X j=1 f i g

jEfy i 1 y j 2

Obviously, a sufficient condition for this equation to hold is

Efy i 1 y j 2

for all indices i j appearing in the series expansion (12.4) There may be other

solutions in which the higher order correlations are not zero, but the coefficientsf

i g j

happen to be just suitable to cancel the terms and make the sum in (12.4) exactlyequal to zero For nonpolynomial functions that have infinite Taylor expansions, suchspurious solutions can be considered unlikely (we will see later that such spurioussolutions do exist but they can be avoided by the theory of ML estimation)

Again, a sufficient condition for (12.5) to hold is that the variablesy

1andy

2are

independent and one of Efy

i 1 g Efy j 2

gis zero Let us require that Efy

i 1

g = 0for allpowersiappearing in its series expansion But this is only possible iff (y

1 is an odd

function; then the Taylor series contains only odd powers1 3 5 :::, and the powers

iin Eq (12.5) will also be odd Otherwise, we have the case that even moments of

1, the nonlinearity is an odd function such thatf (y

1 has zero mean

The preceding discussion is informal but should make it credible that nonlinearcorrelations are useful as a possible general criterion for independence Several thingshave to be decided in practice: the first one is how to actually choose the functions Is there some natural optimality criterion that can tell us that some functions

Trang 4

Consider the ICA model x = As Let us first look at a 2 2case, which wasconsidered by H´erault, Jutten and Ans [178, 179, 226] in connection with the blindseparation of two signals from two linear mixtures The model is then

prob-From Fig 12.1 we have directly

y = x My

Thus the input-output mapping of the network is

(12.8)

Trang 5

THE CICHOCKI-UNBEHAUEN ALGORITHM 243

Note that from the original ICA model we have s = A

1

x, provided thatA isinvertible IfI + M = A, thenybecomes equal tos However, the problem in blindseparation is that the matrixAis unknown

The solution that Jutten and H´erault introduced was to adapt the two feedbackcoefficientsm12m21so that the outputs of the networky1y2become independent.Then the matrixAhas been implicitly inverted and the original sources have beenfound For independence, they used the criterion of nonlinear correlations Theyproposed the following learning rules:

a rough approximation

y = (I + M)

1

x (I M)x

that seems to work in practice

Although the H´erault-Jutten algorithm was a very elegant pioneering solution tothe ICA problem, we know now that it has some drawbacks in practice The algorithmmay work poorly or even fail to separate the sources altogether if the signals are badlyscaled or the mixing matrix is ill-conditioned The number of sources that the methodcan separate is severely limited Also, although the local stability was shown in [408],good global convergence behavior is not guaranteed

Starting from the H´erault-Jutten algorithm Cichocki, Unbehauen, and coworkers [82,

85, 84] derived an extension that has a much enhanced performance and reliability.Instead of a feedback circuit like the H´erault-Jutten network in Fig 12.1, Cichocki

Trang 6

and Unbehauen proposed a feedforward network with weight matrixB, with themixture vectorxfor input and with outputy = Bx Now the dimensionality of theproblem can be higher than 2 The goal is to adapt themmmatrixBso that theelements ofybecome independent The learning algorithm forBis as follows:

).The argumentation showing that this algorithm will give independent components,too, is based on nonlinear decorrelations Consider the stationary solution of thislearning rule defined as the matrix for which EfBg = 0, with the expectationtaken over the density of the mixturesx For this matrix, the update is on the averagezero Because this is a stochastic-approximation-type algorithm (see Chapter 3), suchstationarity is a necessary condition for convergence Excluding the trivial solution

B = 0, we must have

Eff(y )g(y

T )g = 0

Especially, for the off-diagonal elements, this implies

Eff(yi )g(yj

which is exactly our definition of nonlinear decorrelation in Eq (12.1) extended ton

output signalsy1:::yn The diagonal elements satisfy

Eff(yi )g(yi )g = ii

showing that the diagonal elements

iiof matrixonly control the amplitude scaling

of the outputs

The conclusion is that if the learning rule converges to a nonzero matrixB, thenthe outputs of the network must become nonlinearly decorrelated, and hopefullyindependent The convergence analysis has been performed in [84]; for generalprinciples of analyzing stochastic iteration algorithms like (12.11), see Chapter 3.The justification for the Cichocki-Unbehauen algorithm (12.11) in the originalarticles was based on nonlinear decorrelations, not on any rigorous cost functionsthat would be minimized by the algorithm However, it is interesting to note thatthis algorithm, first appearing in the early 1990’s, is in fact the same as the popularnatural gradient algorithm introduced later by Amari, Cichocki, and Young [12] as

an extension to the original Bell-Sejnowski algorithm [36] All we have to do ischoose as the unit matrix, the functiong(y )as the linear functiong(y ) = y,and the functionf(y )as a sigmoidal related to the true density of the sources TheAmari-Cichocki-Young algorithm and the Bell-Sejnowski algorithm were reviewed

in Chapter 9 and it was shown how the algorithms are derived from the rigorousmaximum likelihood criterion The maximum likelihood approach also tells us whatkind of nonlinearities should be used, as discussed in Chapter 9

Trang 7

THE ESTIMATING FUNCTIONS APPROACH * 245

Consider the criterion of nonlinear decorrelations being zero, generalized tonrandomvariablesy

1

::: y , shown in Eq (12.12) Among the possible rootsy

1

::: y ofthese equations are the source signalss

1

::: s When solving these in an algorithmlike the H´erault-Jutten algorithm or the Cichocki-Unbehauen algorithm, one in factsolves the separating matrixB

This notion was generalized and formalized by Amari and Cardoso [8] to the case

of estimating functions Again, consider the basic ICA modelx = As,s = B

x

This means that, taking the expectation with respect to the density of x, the true

separating matrices are roots of the equation Once these are solved from Eq (12.13),

the independent components are directly obtained

Example 12.1 Given a set of nonlinear functionsf

1 (y 1 ) ::: f (y

n ), withy = Bx,and defining a vector functionf (y ) = f

1 (y 1 ) ::: f (y

n )]

T

, a suitable estimatingfunction for ICA is

g = 0 The diagonal matrixdetermines thescales of the separated sources Another estimating function is the right-hand side ofthe learning rule (12.11),

F(x B) = f (y )g(y

T )]B

There is a fundamental difference in the estimating function approach compared tomost of the other approaches to ICA: the usual starting point in ICA is a cost functionthat somehow measures how independent or nongaussian the outputsy

iare, and theindependent components are solved by minimizing the cost function In contrast,there is no such cost function here The estimation function need not be the gradient

of any other function In this sense, the theory of estimating functions is very generaland potentially useful for finding ICA algorithms For a discussion of this approach

in connection with neural networks, see [328]

It is not a trivial question how to design in practice an estimation function so that

we can solve the ICA model Even if we have two estimating functions that bothhave been shaped in such a way that separating matrices are their roots, what is arelevant measure to compare them? Statistical considerations are helpful here Notethat in practice, the densities of the sources and the mixtures are unknown in

Trang 8

the ICA model It is impossible in practice to solve Eq (12.13) as such, because theexpectation cannot be formed Instead, it has to be estimated using a finite sample of

x Denoting this sample byx(1) ::: x(T ), we use the sample function

EfF(x B)g

1 T

T X t=1 F(x(t) B)

The general result provided by Amari and Cardoso [8] is that estimating functions

of the form (12.14) are optimal in the sense that, given any estimating functionF,one can always find a better or at least equally good estimating function (in the sense

of efficiency) having the form

i

y

j; the diagonal elements are simply scaling factors

The result shows that it is unnecessary to use a nonlinear functiong(y )instead of

yas the other one of the two functions in nonlinear decorrelation Only one nonlinearfunctionf (y ), combined withy, is sufficient It is interesting that functions of exactlythe typef (y )y

T

naturally emerge as gradients of cost functions such as likelihood;the question of how to choose the nonlinearityf (y )is also answered in that case Afurther example is given in the following section

The preceding analysis is not related in any way to the practical methods for findingthe roots of estimating functions Due to the nonlinearities, closed-form solutions donot exist and numerical algorithms have to be used The simplest iterative stochasticapproximation algorithm for solving the roots ofF(x B)has the form

B = F(x B) (12.17)withan appropriate learning rate In fact, we now discover that the learning rules(12.9), (12.10) and (12.11) are examples of this more general framework

Trang 9

EQUIVARIANT ADAPTIVE SEPARATION VIA INDEPENDENCE 247

In most of the proposed approaches to ICA, the learning rules are gradient descentalgorithms of cost (or contrast) functions Many cases have been covered in previouschapters Typically, the cost function has the form J(B) = EfG(y )g, with G

some scalar function, and usually some additional constraints are used Here again

y = Bx, and the form of the functionGand the probability density ofxdeterminethe shape of the contrast functionJ (B)

It is easy to show (see the definition of matrix and vector gradients in Chapter 3)that

1

(12.19)For appropriate nonlinearitiesG(y ), these gradients are estimating functions inthe sense that the elements ofymust be statistically independent when the gradientbecomes zero Note also that in the form Effg (y )y

T gg(B T )

1

, the first factor

g (y )y

T

has the shape of an optimal estimating function (except for the diagonal

elements); see eq (12.15) Now we also know how the nonlinear functiong (y )

can be determined: it is directly the gradient of the functionG(y )appearing in theoriginal cost function

Unfortunately, the matrix inversion(B

T )

1

in (12.19) is cumbersome Matrix

inversion can be avoided by using the so-called natural gradient introduced by Amari

[4] This is covered in Chapter 3 The natural gradient is obtained in this case bymultiplying the usual matrix gradient (12.19) from the right by matrixB

T

B, whichgives Efg (y )y

This gradient algorithm can also be derived using the relative gradient introduced

by Cardoso and Hvam Laheld [71] This approach is also reviewed in Chapter

3 Based on this, the authors developed their equivariant adaptive separation via

independence (EASI) learning algorithm To proceed from (12.20) to the EASI

learning rule, an extra step must be taken In EASI, as in many other learningrules for ICA, a whitening preprocessing is considered for the mixture vectorsx

(see Chapter 6) We first transform linearly to whose elements have

Trang 10

unit variances and zero covariances: Efzz

The ICA model using these whitened vectors instead of the original ones becomes

z = V As, and it is easily seen that the matrixV Ais an orthogonal matrix (a rotation).Thus its inverse which gives the separating matrix is also orthogonal As in earlierchapters, let us denote the orthogonal separating matrix byW

Basically, the learning rule forWwould be the same as (12.20) However, asnoted by [71], certain constraints must hold in any updating ofWif the orthogonality

is to be preserved at each iteration step Let us denote the serial update forWusingthe learning rule (12.20), briefly, asW W + DW, where nowD = g (y )y

T

.The orthogonality condition for the updated matrix becomes

(W + DW )(W + DW )

T

= I + D + D

T + DD T

T W T

The concept of equivariance that forms part of the name of the EASI algorithm

is a general concept in statistical estimation; see, e.g., [395] Equivariance of anestimator means, roughly, that its performance does not depend on the actual value ofthe parameter In the context of the basic ICA model, this means that the ICs can beestimated with the same performance what ever the mixing matrix may be EASI wasone of the first ICA algorithms which was explicitly shown to be equivariant In fact,most estimators of the basic ICA model are equivariant For a detailed discussion,see [69]

Trang 11

NONLINEAR PRINCIPAL COMPONENTS 249

One of the basic definitions of PCA was optimal least mean-square error sion, as explained in more detail in Chapter 6 Assuming a randomm-dimensionalzero-mean vector x, we search for a lower dimensional subspace such that theresidual error betweenxand its orthogonal projection on the subspace is minimal,averaged over the probability density ofx Denoting an orthonormal basis of thissubspace byw

x.For instance, ifxis two-dimensional with a gaussian density, and we seek for aone-dimensional subspace (a straight line passing through the center of the density),then the solution is given by the principal axis of the elliptical density

We now pose the question how this criterion and its solution are changed if anonlinearity is included in the criterion Perhaps the simplest nontrivial nonlinearextension is provided as follows Assumingg1 :)::::gn

(:)a set of scalar functions,

as yet unspecified, let us look at a modified criterion to be minimized with respect tothe basis vectors [232]:

J(w

1:::w n ) =Efkx

n X i=1

gi (w T i x)w i k 2

x, wenow have nonlinear functions of them in the expansion that gives the approximation

to x In the optimal solution that minimizes the criterion J(w

1:::w n ), suchfactors might be termed nonlinear principal components Therefore, the technique offinding the basis vectorsw

iis here called “nonlinear principal component analysis”(NLPCA)

It should be emphasized that practically always when a well-defined linear problem

is extended into a nonlinear one, many ambiguities and alternative definitions arise.This is the case here, too The term “nonlinear PCA” is by no means unique.There are several other techniques, like the method of principal curves [167, 264]

or the nonlinear autoassociators [252, 325] that also give “nonlinear PCA” In thesemethods, the approximating subspace is a curved manifold, while the solution to theproblem posed earlier is still a linear subspace Only the coefficients corresponding

to the principal components are nonlinear functions of It should be noted that

Trang 12

minimizing the criterion (12.25) does not give a smaller least mean square error thanstandard PCA Instead, the virtue of this criterion is that it introduces higher-orderstatistics in a simple manner via the nonlinearitiesgi.

Before going into any deeper analysis of (12.25), it may be instructive to see in

a simple special case how it differs from linear PCA and how it is in fact related toICA

If the functionsgi

(y)were linear, as in the standard PCA technique, and the number

nof terms in the sum were equal tomor the dimension ofx, then the representationerror always would be zero, as long as the weight vectors are chosen orthonormal.For nonlinear functionsgi

(y), however, this is usually not true Instead, in somecases, at least, it turns out that the optimal basis vectorsw

iminimizing (12.25) will

be aligned with the independent components of the input vectors

Example 12.2 Assume thatxis a two-dimensional random vector that has a uniform

density in a unit square that is not aligned with the coordinate axesx1x2, according toFig 12.2 Then it is easily shown that the elementsx1x2are uncorrelated and haveequal variances (equal to 1/3), and the covariance matrix ofxis therefore equal to

1=3I Thus, except for the scaling by1=3, vectorxis whitened (sphered) However,the elements are not independent The problem is to find a rotations = Wx of

xsuch that the elements of the rotated vectorsare statistically independent It isobvious from Fig 12.2 that the elements ofsmust be aligned with the orientation ofthe square, because then and only then the joint density is separable into the product

of the two marginal uniform densities

Because of the whitening, we know that the rows of the separating matrixWmust

be orthogonal This is seen by writing

Efss T

g = WEfxx

T gW T

= 1 3 WW T

(12.26)Because the elementss1ands2are uncorrelated, it must hold thatw

T 1 w

2= 0.The solution minimizing the criterion (12.25), with w

1 w

2 orthogonal dimensional vectors andg1 :) =g2 :) =g(:)a suitable nonlinearity, provides now

two-a rottwo-ation into independent components This ctwo-an be seen two-as follows Assume thtwo-at

gis a very sharp sigmoid, e.g.,g(y) = tanh(10y), which is approximately the signfunction The termP

2 i=1g(w T i x)w

iin criterion (12.25) becomes

w

1g(w T 1 x) + w

2g(w T 2 x)

w

1sign(w T 1 x) + w

2sign(w T 2 x)

Thus according to (12.25), eachxshould be optimally represented by one of the fourpossible points(w

in the first quadrant where the angles betweenxand the basis vectors are positive,

by the point From Fig 12.2, it can be seen that the optimal fit is obtained

Tiêu đề	ICA by nonlinear decorrelation and nonlinear PCA
Tác giả	Aapo Hyvärinen, Juha Karhunen, Erkki Oja
Thể loại	Book chapter
Năm xuất bản	2001

Định dạng
Số trang	24
Dung lượng	693,89 KB