1. Trang chủ
  2. » Tài Chính - Ngân Hàng

Tài liệu Bài 9: ICA by Maximum Likelihood Estimation ppt

17 365 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề ICA by Maximum Likelihood Estimation
Tác giả Aapo Hyvärinen, Juha Karhunen, Erkki Oja
Thể loại Chương sách
Năm xuất bản 2001
Định dạng
Số trang 17
Dung lượng 364,42 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

9 ICA by Maximum Likelihood Estimation A very popular approach for estimating the independent component analysis ICA model is maximum likelihood ML estimation.. 9.1.2 Estimation of the d

Trang 1

9 ICA by Maximum Likelihood Estimation

A very popular approach for estimating the independent component analysis (ICA) model is maximum likelihood (ML) estimation Maximum likelihood estimation is

a fundamental method of statistical estimation; a short introduction was provided in Section 4.5 One interpretation of ML estimation is that we take those parameter values as estimates that give the highest probability for the observations In this section, we show how to apply ML estimation to ICA estimation We also show its close connection to the neural network principle of maximization of information flow (infomax)

9.1 THE LIKELIHOOD OF THE ICA MODEL

9.1.1 Deriving the likelihood

It is not difficult to derive the likelihood in the noise-free ICA model This is based

on using the well-known result on the density of a linear transform, given in (2.82) According to this result, the densityp

xof the mixture vector

can be formulated as

p x (x) = j det Bjp

s (s) = j det Bj

Y i p i (s i

203

Independent Component Analysis Aapo Hyv¨arinen, Juha Karhunen, Erkki Oja

Copyright  2001 John Wiley & Sons, Inc ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)

Trang 2

whereB = A

1

, and thep

i denote the densities of the independent components This can be expressed as a function ofB = (b

1

 ::: b n ) T

andx, giving

p x (x) = j det Bj

Y i p i (b T i

Assume that we haveTobservations ofx, denoted byx(1) x(2) ::: x(T ) Then the likelihood can be obtained (see Section 4.5) as the product of this density evaluated

at theTpoints This is denoted byLand considered as a function ofB:

L(B) =

T Y t=1

n Y i=1 p i (b T i

Very often it is more practical to use the logarithm of the likelihood, since it is algebraically simpler This does not make any difference here since the maximum of the logarithm is obtained at the same point as the maximum of the likelihood The log-likelihood is given by

log L(B) =

T X t=1

n X i=1 log p i (b T i x(t)) + T log j det Bj (9.5)

The basis of the logarithm makes no difference, though in the following the natural logarithm is used

To simplify notation and to make it consistent to what was used in the previous chapter, we can denote the sum over the sample indextby an expectation operator, and divide the likelihood byTto obtain

1 T log L(B) = Ef

n X i=1 log p i (b T i x)g + log j det Bj (9.6)

The expectation here is not the theoretical expectation, but an average computed from the observed sample Of course, in the algorithms the expectations are eventually replaced by sample averages, so the distinction is purely theoretical

9.1.2 Estimation of the densities

Problem of semiparametric estimation In the preceding, we have expressed the likelihood as a function of the parameters of the model, which are the elements

of the mixing matrix For simplicity, we used the elements of the inverseBof the mixing matrix This is allowed since the mixing matrix can be directly computed from its inverse

There is another thing to estimate in the ICA model, though This is the densities of the independent components Actually, the likelihood is a function of these densities

as well This makes the problem much more complicated, because the estimation

of densities is, in general, a nonparametric problem Nonparametric means that it

Trang 3

THE LIKELIHOOD OF THE ICA MODEL 205

cannot be reduced to the estimation of a finite parameter set In fact the number of parameters to be estimated is infinite, or in practice, very large Thus the estimation

of the ICA model has also a nonparametric part, which is why the estimation is sometimes called “semiparametric”

Nonparametric estimation of densities is known to be a difficult problem Many parameters are always more difficult to estimate than just a few; since nonparametric problems have an infinite number of parameters, they are the most difficult to estimate This is why we would like to avoid the nonparametric density estimation in the ICA There are two ways to avoid it

First, in some cases we might know the densities of the independent components

in advance, using some prior knowledge on the data at hand In this case, we could simply use these prior densities in the likelihood Then the likelihood would really

be a function ofBonly If reasonably small errors in the specification of these prior densities have little influence on the estimator, this procedure will give reasonable results In fact, it will be shown below that this is the case

A second way to solve the problem of density estimation is to approximate the densities of the independent components by a family of densities that are specified

by a limited number of parameters If the number of parameters in the density family needs to be very large, we do not gain much from this approach, since the goal was

to reduce the number of parameters to be estimated However, if it is possible to use

a very simple family of densities to estimate the ICA model for any densitiesp

i, we will get a simple solution Fortunately, this turns out to be the case We can use an extremely simple parameterization of thep

i, consisting of the choice between two densities, i.e., a single binary parameter

A simple density family It turns out that in maximum likelihood estimation, it is

enough to use just two approximations of the density of an independent component.

For each independent component, we just need to determine which one of the two approximations is better This shows that, first, we can make small errors when we fix the densities of the independent components, since it is enough that we use a density that is in the same half of the space of probability densities Second, it shows that we can estimate the independent components using very simple models of their densities, in particular, using models consisting of only two densities

This situation can be compared with the one encountered in Section 8.3.4, where

we saw that any nonlinearity can be seen to divide the space of probability distributions

in half When the distribution of an independent component is in one of the halves, the nonlinearity can be used in the gradient method to estimate that independent

component When the distribution is in the other half, the negative of the nonlinearity

must be used in the gradient method In the ML case, a nonlinearity corresponds to

a density approximation

The validity of these approaches is shown in the following theorem, whose proof can be found in the appendix This theorem is basically a corollary of the stability theorem in Section 8.3.4

Trang 4

Theorem 9.1 Denote by ~

i the assumed densities of the independent components, and

g i (s i ) =

@

@s i log ~ i (s i ) =

~ 0 i (s i )

~ i (s i )

(9.7)

Constrain the estimates of the independent componentsy

i

= b T i

xto be uncorrelated and to have unit variance Then the ML estimator is locally consistent, if the assumed densities~

ifulfill

Efs i g i (s i )  g 0 (s i

for alli.

This theorem shows rigorously that small misspecifications in the densitiesp

ido not affect the local consistency of the ML estimator, since sufficiently small changes

do not change the sign in (9.8)

Moreover, the theorem shows how to construct families consisting of only two densities, so that the condition in (9.8) is true for one of these densities For example, consider the following log-densities:

log ~ + i (s) =  1

 2 log cosh(s) (9.9)

log ~

 i (s) =  2

 s 2

=2  log cosh(s)] (9.10) where

1

 

2are positive parameters that are fixed so as to make these two functions logarithms of probability densities Actually, these constants can be ignored in the following The factor 2 in (9.9) is not important, but it is usually used here; also, the factor1=2in (9.10) could be changed

The motivation for these functions is that~

+

i is a supergaussian density, because

the log cosh function is close to the absolute value that would give the Laplacian density The density given by ~



i is subgaussian, because it is like a gaussian

log-density,s

2

=2plus a constant, that has been somewhat “flattened” by thelog cosh

function

Simple computations show that the value of the nonpolynomial moment in (9.8)

is for~

+

i

2Ef tanh(s

i )s i + (1  tanh(s

i

and for~



i it is

Eftanh(s

i )s i

 (1  tanh(s

i

since the derivative oftanh(s)equals1  tanh(s)

2

, andEfs

2 i

g = 1by definition

We see that the signs of these expressions are always opposite Thus, for practically any distributions of thes

i, one of these functions fulfills the condition, i.e., has the desired sign, and estimation is possible Of course, for some distribution of thes

i

the nonpolynomial moment in the condition could be zero, which corresponds to the

Trang 5

ALGORITHMS FOR MAXIMUM LIKELIHOOD ESTIMATION 207

case of zero kurtosis in cumulant-based estimation; such cases can be considered to

be very rare

Thus we can just compute the nonpolynomial moments for the two prior distribu-tions in (9.9) and (9.10), and choose the one that fulfills the stability condition in (9.8) This can be done on-line during the maximization of the likelihood This always provides a (locally) consistent estimator, and solves the problem of semiparametric estimation

In fact, the nonpolynomial moment in question measures the shape of the density function in much the same way as kurtosis For g(s) = s

3

, we would actually obtain kurtosis Thus, the choice of nonlinearity could be compared with the choice whether to minimize or maximize kurtosis, as previously encountered in Section 8.2 That choice was based on the value of the sign of kurtosis; here we use the sign of a nonpolynomial moment

Indeed, the nonpolynomial moment of this chapter is the same as the one encoun-tered in Section 8.3 when using more general measures of nongaussianity However,

it must be noted that the set of nonlinearities that we can use here is more restricted than those used in Chapter 8 This is because the nonlinearitiesg

iused must corre-spond to the derivative of the logarithm of a probability density function (pdf) For example, we cannot use the functiong(s) = s

3

because the corresponding pdf would

be of the formexp(s

4

=4), and this is not integrable, i.e., it is not a pdf at all

9.2 ALGORITHMS FOR MAXIMUM LIKELIHOOD ESTIMATION

To perform maximum likelihood estimation in practice, we need an algorithm to perform the numerical maximization of likelihood In this section, we discuss dif-ferent methods to this end First, we show how to derive simple gradient algorithms,

of which especially the natural gradient algorithm has been widely used Then we show how to derive a fixed-point algorithm, a version of FastICA, that maximizes the likelihood faster and more reliably

9.2.1 Gradient algorithms

The Bell-Sejnowski algorithm The simplest algorithms for maximizing likeli-hood are obtained by gradient methods Using the well-known results in Chapter 3, one can easily derive the stochastic gradient of the log-likelihood in (9.6) as:

1 T

@ log L

@B

= B T ]

1 + Efg (Bx)x

T

Here,g (y ) = (g

i (y i ) ::: g (y

n ))is a component-wise vector function that consists

of the so-called (negative) score functionsg

iof the distributions ofs

i, defined as

g = (log p )

0

= p 0 i

Trang 6

This immediately gives the following algorithm for ML estimation:

B / B

T ]

1 + Efg (Bx)x

T

A stochastic version of this algorithm could be used as well This means that the expectation is omitted, and in each step of the algorithm, only one data point is used:

B / B

T ]

1 + g (Bx)x

T

This algorithm is often called the Bell-Sejnowski algorithm It was first derived in [36], though from a different approach using the infomax principle that is explained

in Section 9.3 below

The algorithm in Eq (9.15) converges very slowly, however, especially due to the inversion of the matrixBthat is needed in every step The convergence can be improved by whitening the data, and especially by using the natural gradient

The natural gradient algorithm The natural (or relative) gradient method sim-plifies the maximization of the likelihood considerably, and makes it better condi-tioned The principle of the natural gradient is based on the geometrical structure of the parameter space, and is related to the principle of relative gradient, which uses the Lie group structure of the ICA problem See Chapter 3 for more details In the case of basic ICA, both of these principles amount to multiplying the right-hand side

of (9.15) byB

T

B Thus we obtain

B / (I + Efg (y )y

T

Interestingly, this algorithm can be interpreted as nonlinear decorrelation This

principle will be treated in more detail in Chapter 12 The idea is that the algo-rithm converges when Efg (y )y

T

g = I, which means that they

i andg j (y j ) are uncorrelated fori 6= j This is a nonlinear extension of the ordinary requirement

of uncorrelatedness, and, in fact, this algorithm is a special case of the nonlinear decorrelation algorithms to be introduced in Chapter 12

In practice, one can use, for example, the two densities described in Section 9.1.2 For supergaussian independent components, the pdf defined by (9.9) is usually used This means that the component-wise nonlinearitygis the tanh function:

g +

For subgaussian independent components, other functions must be used For exam-ple, one could use the pdf in (9.10), which leads to

g



(Another possibility is to use g(y) = y

3

for subgaussian components.) These nonlinearities are illustrated in Fig 9.1

The choice between the two nonlinearities in (9.18) and (9.19) can be made by computing the nonpolynomial moment:

(9.20)

Trang 7

ALGORITHMS FOR MAXIMUM LIKELIHOOD ESTIMATION 209

Fig 9.1 The functions g

+

in Eq (9.18) and g



in Eq (9.19), given by the solid line and the dashed line, respectively.

using some estimates of the independent components If this nonpolynomial moment

is positive, the nonlinearity in (9.18) should be used, otherwise the nonlinearity in (9.19) should be used This is because of the condition in Theorem 9.1

The choice of nonlinearity can be made while running the gradient algorithm, using the running estimates of the independent components to estimate the nature of the independent components (that is, the sign of the nonpolynomial moment) Note that the use of the polynomial moment requires that the estimates of the independent components are first scaled properly, constraining them to unit variance, as in the theorem Such normalizations are often omitted in practice, which may in some cases lead to situations in which the wrong nonlinearity is chosen

The resulting algorithm is recapitulated in Table 9.1 In this version, whitening and the above-mentioned normalization in the estimation of the nonpolynomial moments are omitted; in practice, these may be very useful

9.2.2 A fast fixed-point algorithm

Likelihood can be maximized by a fixed-point algorithm as well The fixed-point algorithm given by FastICA is a very fast and reliable maximization method that was introduced in Chapter 8 to maximize the measures of nongaussianity used for ICA estimation Actually, the FastICA algorithm can be directly applied to maximization

of the likelihood

The FastICA algorithm was derived in Chapter 8 for optimization ofEfG(w

T z)g

under the constraint of the unit norm ofw In fact, maximization of likelihood gives

us an almost identical optimization problem, if we constrain the estimates of the independent components to be white (see Chapter 7) In particular, this implies that the termlog j det W jis constant, as proven in the Appendix, and thus the likelihood basically consists of the sum of terms of the form optimized by FastICA Thus

Trang 8

1 Center the data to make its mean zero

2 Choose an initial (e.g., random) separating matrixB Choose initial values

of

i

i = 1 ::: n, either randomly or using prior information Choose the learning ratesand



3 Computey = Bx

4 If the nonlinearities are not fixed a priori:

(a) update

i

= (1  

 )

i + 

 Ef tanh(y

i )y i + (1  tanh(y

i ) )g (b) if

i

> 0, defineg

ias in (9.18), otherwise define it as in (9.19)

5 Update the separating matrix by

B  B + I + g (y )y

T

whereg (y) = (g

1 (y 1 ) ::: g (y

n )) T

6 If not converged, go back to step 3

Table 9.1 The on-line stochastic natural gradient algorithm for maximum likelihood esti-mation Preliminary whitening is not shown here, but in practice it is highly recommended.

we could use directly the same kind of derivation of fixed-point iteration as used in Chapter 8

In Eq (8.42) in Chapter 8 we had the following form of the FastICA algorithm (for whitened data):

w  w  Efzg(w

T z)g + w ]=Efg

0 (w T z)g + ] (9.22) where can be computed from (8.40) as  = Efy

i g(y i )g If we write this in matrix form, we obtain:

W  W +diag(

i )diag(

i ) + Efg (y )y

T

where

iis defined as1=(Efg

0 (w T z) +  i g), andy = Wz To express this using nonwhitened data, as we have done in this chapter, it is enough to multiply both sides

of (9.23) from the right by the whitening matrix This means simply that we replace theWbyB, since we haveWz = WVxwhich impliesB = WV

Thus, we obtain the basic iteration of FastICA as:

B  B +diag(

i )diag(

i ) + Efg (y )y

T

wherey = Bx,

i

= Efy

i g(y i )g, and

i

= 1=(

i + Efg 0 (y i )g) After every step, the matrixBmust be projected on the set of whitening matrices This can be accomplished by the classic method involving matrix square roots,

(9.25)

Trang 9

THE INFOMAX PRINCIPLE 211

whereC = Efxx

T

gis the correlation matrix of the data (see exercises) The inverse square root is obtained as in (7.20) For alternative methods, see Section 8.4 and Chapter 6, but note that those algorithms require that the data is prewhitened, since they simply orthogonalize the matrix

This version of FastICA is recapitulated in Table 9.2 FastICA could be compared with the natural gradient method for maximizing likelihood given in (9.17) Then

we see that FastICA can be considered as a computationally optimized version of the

gradient algorithm In FastICA, convergence speed is optimized by the choice of the matrices diag(

i

)and diag(

i ) These two matrices give an optimal step size to be used in the algorithm

Another advantage of FastICA is that it can estimate both sub- and supergaussian independent components without any additional steps: We can fix the nonlinearity

to be equal to thetanhnonlinearity for all the independent components The reason

is clear from (9.24): The matrix diag(

i )contains estimates on the nature (sub- or supergaussian) of the independent components These estimates are used as in the gradient algorithm in the previous subsection On the other hand, the matrix diag(

i )

can be considered as a scaling of the nonlinearities, since we could reformulate

diag(

i

) + Efg (y )y

T g] = diag(

i )I +diag(

1 i )Efg (y )y

T g] Thus we can say that FastICA uses a richer parameterization of the densities than that used in Section 9.1.2: a parameterized family instead of just two densities

Note that in FastICA, the outputs y

i are decorrelated and normalized to unit variance after every step No such operations are needed in the gradient algorithm FastICA is not stable if these additional operations are omitted Thus the optimization space is slightly reduced

In the version given here, no preliminary whitening is done In practice, it is often highly recommended to do prewhitening, possibly combined with PCA dimension reduction

9.3 THE INFOMAX PRINCIPLE

An estimation principle for ICA that is very closely related to maximum likelihood

is the infomax principle [282, 36] This is based on maximizing the output entropy,

or information flow, of a neural network with nonlinear outputs Hence the name infomax

Assume thatxis the input to the neural network whose outputs are of the form

y i

=  i (b T i

where the

iare some nonlinear scalar functions, and theb

i are the weight vectors

of the neurons The vectornis additive gaussian white noise One then wants to maximize the entropy of the outputs:

H (y ) = H(

1 (b T 1 x) :::  n (b T n

This can be motivated by considering information flow in a neural network Efficient information transmission requires that we maximize the mutual information between

Trang 10

1 Center the data to make its mean zero Compute correlation matrix C = Efxx

T

g

2 Choose an initial (e.g., random) separating matrixB

3 Compute

 i

= Efy

i g(y i )gfori = 1 ::: n (9.27)

 i

= 1=(

i + Efg 0 (y i )g)fori = 1 ::: n (9.28)

4 Update the separating matrix by

B  B +diag(

i )diag(

i ) + Efg (y )y

T g]B (9.29)

5 Decorrelate and normalize by

B  (BCB

T )

1=2

6 If not converged, go back to step 3

Table 9.2 The FastICA algorithm for maximum likelihood estimation This is a version without whitening; in practice, whitening combined with PCA may often be useful The nonlinear function g is typically the tanh function.

Ngày đăng: 23/12/2013, 07:19

TỪ KHÓA LIÊN QUAN

w