Independent component analysis P18

Methods using Time Structure The model of independent component analysis ICA that we have considered so far consists of mixing independent random variables, usually linearly.. If the ind

Trang 1

Methods using Time

Structure

The model of independent component analysis (ICA) that we have considered so far consists of mixing independent random variables, usually linearly In many applications, however, what is mixed is not random variables but time signals, or time series This is in contrast to the basic ICA model in which the samples ofx

have no particular order: We could shuffle them in any way we like, and this would have no effect on the validity of the model, nor on the estimation methods we have discussed If the independent components (ICs) are time signals, the situation is quite different

In fact, if the ICs are time signals, they may contain much more structure than sim-ple random variables For examsim-ple, the autocovariances (covariances over different time lags) of the ICs are then well-defined statistics One can then use such additional statistics to improve the estimation of the model This additional information can actually make the estimation of the model possible in cases where the basic ICA methods cannot estimate it, for example, if the ICs are gaussian but correlated over time

In this chapter, we consider the estimation of the ICA model when the ICs are time signals,s

i

(t) t = 1 ::: T, wheretis the time index In the previous chapters,

we denoted bytthe sample index, but herethas a more precise meaning, since it defines an order between the ICs The model is then expressed by

whereAis assumed to be square as usual, and the ICs are of course independent In

contrast, the ICs need not be nongaussian.

In the following, we shall make some assumptions on the time structure of the ICs that allow for the estimation of the model These assumptions are alternatives to the

341

Independent Component Analysis Aapo Hyv¨arinen, Juha Karhunen, Erkki Oja

Copyright  2001 John Wiley & Sons, Inc ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)

Trang 2

assumption of nongaussianity made in other chapters of this book First, we shall assume that the ICs have different autocovariances (in particular, they are all different from zero) Second, we shall consider the case where the variances of the ICs are nonstationary Finally, we discuss Kolmogoroff complexity as a general framework for ICA with time-correlated mixtures

We do not here consider the case where it is the mixing matrix that changes in time; see [354]

18.1 SEPARATION BY AUTOCOVARIANCES

18.1.1 Autocovariances as an alternative to nongaussianity

The simplest form of time structure is given by (linear) autocovariances This means covariances between the values of the signal at different time points: cov(x

i (t)x i (t

))where is some lag constant, = 1 2 3 ::: If the data has time-dependencies, the autocovariances are often different from zero

In addition to the autocovariances of one signal, we also need covariances between two signals: cov(x

i (t)x j (t ))wherei 6= j All these statistics for a given time lag can be grouped together in the time-lagged covariance matrix

C x

= Efx(t)x(t

T

The theory of time-dependent signals was briefly discussed in Section 2.8

As we saw in Chapter 7, the problem in ICA is that the simple zero-lagged covariance (or correlation) matrixCdoes not contain enough parameters to allow the estimation ofA This means that simply finding a matrixVso that the components

of the vector

are white, is not enough to estimate the independent components This is because there is an infinity of different matricesVthat give decorrelated components This

is why in basic ICA, we have to use the nongaussian structure of the independent components, for example, by minimizing the higher-order dependencies as measured

by mutual information

The key point here is that the information in a time-lagged covariance matrixC

x

could be used instead of the higher-order information [424, 303] What we do is

to find a matrixBso that in addition to making the instantaneous covariances of

y (t) = Bx(t)go to zero, the lagged covariances are made zero as well:

Efy i (t)y j (t )g = 0 for alli j (18.4) The motivation for this is that for the ICss

i (t), the lagged covariances are all zero due

to independence Using these lagged covariances, we get enough extra information

to estimate the model, under certain conditions specified below No higher-order information is then needed

Trang 3

SEPARATION BY AUTOCOVARIANCES 343

18.1.2 Using one time lag

In the simplest case, we can use just one time lag Denote by such a time lag, which

is very often taken equal to 1 A very simple algorithm can now be formulated to find

a matrix that cancels both the instantaneous covariances and the ones corresponding

to lag

Consider whitened data (see Chapter 6), denoted byz Then we have for the orthogonal separating matrixW:

Let us consider a slightly modified version of the lagged covariance matrix as defined

in (18.2), given by

C z

= 1 2

C z

+ (C z

T

We have by linearity and orthogonality the relation

C

z

=

1

2

W

T

Efs(t)s(t

T

g + Efs(t )s(t)

T g]W = W

T

C s

W

(18.8) Due to the independence of the s

i (t), the time-lagged covariance matrix C

s

= Efs(t)s(t )gis diagonal; let us denote it byD Clearly, the matrix

C s

equals this same matrix Thus we have

C z

= W T

What this equation shows is that the matrixWis part of the eigenvalue decomposition

of

C

z

The eigenvalue decomposition of this symmetric matrix is simple to compute

In fact, the reason why we considered this matrix instead of the simple time-lagged covariance matrix (as in [303]) was precisely that we wanted to have a symmetric matrix, because then the eigenvalue decomposition is well defined and simple to compute (It is actually true that the lagged covariance matrix is symmetric if the data exactly follows the ICA model, but estimates of such matrices are not symmetric.)

The AMUSE algorithm Thus we have a simple algorithm, called AMUSE [424], for estimating the separating matrixWfor whitened data:

1 Whiten the (zero-mean) dataxto obtainz(t)

2 Compute the eigenvalue decomposition of

C z

= 1 2

C

+ C T

], whereC

= Efz(t)z(t )gis the time-lagged covariance matrix, for some lag

3 The rows of the separating matrixWare given by the eigenvectors

An essentially similar algorithm was proposed in [303]

Trang 4

This algorithm is very simple and fast to compute The problem is, however, that

it only works when the eigenvectors of the matrix

C

are uniquely defined This is the case if the eigenvalues are all distinct (not equal to each other) If some of the eigenvalues are equal, then the corresponding eigenvectors are not uniquely defined, and the corresponding ICs cannot be estimated This restricts the applicability of this method considerably These eigenvalues are given by cov(si

(t)si (t)), and thus the eigenvalues are distinct if and only if the lagged covariances are different for all the ICs

As a remedy to this restriction, one can search for a suitable time lag so that the eigenvalues are distinct, but this is not always possible: If the signalssi

(t)have identical power spectra, that is, identical autocovariances, then no value of makes estimation possible

18.1.3 Extension to several time lags

An extension of the AMUSE method that improves its performance is to consider several time lags instead of a single one Then, it is enough that the covariances for

one of these time lags are different Thus the choice of is a somewhat less serious problem

In principle, using several time lags, we want to simultaneously diagonalize all the

corresponding lagged covariance matrices It must be noted that the diagonalization

is not possible exactly, since the eigenvectors of the different covariance matrices are unlikely to be identical, except in the theoretical case where the data is exactly generated by the ICA model So here we formulate functions that express the degree

of diagonalization obtained and find its maximum

One simple way of measuring the diagonality of a matrixMis to use the operator

off(M) =

X i6=j

m2 ij

(18.10)

which gives the sum of squares of the off-diagonal elements M What we now want to do is to minimize the sum of the off-diagonal elements of several lagged covariances ofy = Wz As before, we use the symmetric version

C y

of the lagged covariance matrix Denote bySthe set of the chosen lags Then we can write this

as an objective functionJ (w ):

J 1 (W ) = X

2S

off(W

C z

W T

MinimizingJ

1 under the constraint thatWis orthogonal gives us the estimation method This minimization could be performed by (projected) gradient descent Another alternative is to adapt the existing methods for eigenvalue decomposition to this simultaneous approximate diagonalization of several matrices The algorithm called SOBI (second-order blind identification) [43] is based on these principles, and

so is TDSEP [481]

z

Trang 5

SEPARATION BY AUTOCOVARIANCES 345

The criterionJ

1can be simplified For an orthogonal transformation,W, the sum

of the squares of the elements ofWMW

T

is constant.1 Thus, the “off” criterion could be expressed as the difference of the total sum of squares minus the sum of the squares on the diagonal Thus we can formulate

J 2 (W ) =

X

2S X i (w T i

C z

w i

where thew

T

i

are the rows ofW Thus, minimizingJ

2is equivalent to minimizing

J

1

An alternative method for measuring the diagonality can be obtained using the approach in [240] For any positive-definite matrixM, we have

X i logmii

and the equality holds only for diagonalM Thus, we could measure the nondiago-nality ofMby

F(M) =

X i logmii

Again, the total nondiagonality of the C

at different time lags can be measured

by the sum of these measures for different time lags This gives us the following objective function to minimize:

J 3 (W ) =

1 2 X

2S

F(

C y

= 1 2 X

2S

F(W

C z

W T

Just as in maximum likelihood (ML) estimation,Wdecouples from the term involv-ing the logarithm of the determinant We obtain

J

3

(W ) =

X

2S X i 1 2 log (w T i

C z

w i ) log j det W j

1 2 log j det

C z

j

(18.16) Considering whitened data, in which caseWcan be constrained orthogonal, we see that the term involving the determinant is constant, and we finally have

J 3 (W ) = X

2S X i 1 2 log(w T i

C z

w i ) +const (18.17)

This is in fact rather similar to the functionJ

2 in (18.12) The only difference is that the functionu2

has been replaced by1=2 log(u) What these functions have

1 This is because it equals trace (WMW

T (WMW T ) T ) = trace (WMM

T W T ) =

Trang 6

in common is concavity, so one might speculate that many other concave functions could be used as well

The gradient ofJ

3can be evaluated as

@J 3

@W

= X

2S Q

W

C z

with

Q

=diag(W

C z

W T )

1

(18.19) Thus we obtain the gradient descent algorithm

W /

X

2S Q

W

C z

Here, Wshould be orthogonalized after every iteration Moreover, care must

be taken so that in the inverse in (18.19), very small entries do not cause numerical problems A very similar gradient descent can be obtained for (18.12), the main difference being the scalar function in the definition ofQ

Thus we obtain an algorithm that estimates Wbased on autocorrelations with several time lags This gives a simpler alternative to methods based on joint approx-imative diagonalization Such an extension allows estimation of the model in some cases where the simple method using a single time lag fails The basic limitation

cannot be avoided, however: if the ICs have identical autocovariances (i.e., identical

power spectra), they cannot be estimated by the methods using time-lagged

covari-ances only This is in contrast to ICA using higher-order information, where the independent components are allowed to have identical distributions

Further work on using autocovariances for source separation can be found in [11, 6, 106] In particular, the optimal weighting of different lags has be considered

in [472, 483]

18.2 SEPARATION BY NONSTATIONARITY OF VARIANCES

An alternative approach to using the time structure of the signals was introduced in [296], where it was shown that ICA can be performed by using the nonstationarity

of the signals The nonstationarity we are using here is the nonstationarity of the variances of the ICs Thus the variances of the ICs are assumed to change smoothly in time Note that this nonstationarity of the signals is independent from nongaussianity

or the linear autocovariances in the sense that none of them implies or presupposes any of the other assumptions

To illustrate the variance nonstationarity in its purest form, let us look at the signal

in Fig 18.1 This signal was created so that it has a gaussian marginal density, and no linear time correlations, i.e., for any lag Thus,

Trang 7

SEPARATION BY NONSTATIONARITY OF VARIANCES 347

−4

−3

−2

−1

0

1

2

3

4

Fig 18.1 A signal with nonstationary variance.

ICs of this kind could not be separated by basic ICA methods, or using linear time-correlations On the other hand, the nonstationarity of the signal is clearly visible It

is characterized by bursts of activity

Below, we review some basic approaches to this problem Further work can be found in [40, 370, 126, 239, 366]

18.2.1 Using local autocorrelations

Separation of nonstationary signals could be achieved by using a variant of autocor-relations, somewhat similar to the case of Section 18.1 It was shown in [296] that if

we find a matrixBso that the components ofy (t) = Bx(t)are uncorrelated at every

time point t, we have estimated the ICs Note that due to nonstationarity, the

covari-ance ofy (t)depends ont, and thus if we force the components to be uncorrelated for everyt, we obtain a much stronger condition than simple whitening

The (local) uncorrelatedness ofy (t)could be measured using the same measures

of diagonality as used in Section 18.1.3 We use here a measure based on (18.14):

Q(B t) =

X i log E t fy i (t) 2

g log j det E

t

fy (t)y (t)

T gj

(18.21)

The subscripttin the expectation emphasizes that the signal is nonstationary, and the expectation is the expectation around the time pointt This function is minimized by the separating matrix

Trang 8

Expressing this as a function ofB = (b

1

::: b n ) T

we obtain

Q(B t) =

X

i

log E t f(b T i x(t)) 2

g log j det E

t fBx(t)x(t)

T B T gj

=

X

i

log E

t

f(b T i x(t)) 2

g log j det E

t fx(t)x(t)

T

gj 2 log j det Bj (18.22)

Note that the termlog j det E

t fx(t)x(t)

T

gjdoes not depend onBat all Furthermore,

to take into account all the time points, we sum the values ofQin different time points, and obtain the objective function

J

4

(B) =

X

t

Q(B t) =

X it log E t f(b T i x(t)) 2

g 2 log j det Bj +const

(18.23)

As usual, we can whiten the data to obtain whitened dataz, and force the separating matrixWto be orthogonal Then the objective function simplifies to

J

4

(W ) =

X t Q(W t) =

X it log E t f(w T i z(t)) 2

g +const

(18.24) Thus we can compute the gradient ofJ

4as

@J

4

@W

= 2 X t

diag(E t f(w T i z(t)) 2 g

1 )W E t fz(t)z(t)

T g:

(18.25) The question is now: How to estimate the local variancesE

t f(w T i z(t)) 2

g? We cannot simply use the sample variances, due to nonstationarity, which leads to de-pendence between these variances and thez(t) Instead, we have to use some local estimates at time pointt A natural thing to do is to assume that the variance changes

slowly Then we can estimate the local variance by local sample variances In other

words:

^ E t f(w T i z(t)) 2

g = X

h( )(w T i z(t ))

2

(18.26)

wherehis a moving average operator (low-pass filter), normalized so that the sum

of its components is one

Thus we obtain the following algorithm:

W / X t

diag(

^ E t f(w T i z(t)) 2 g

1 )Wz(t)z(t)

T

(18.27)

where after every iteration,Wis symmetrically orthogonalized (see Chapter 6), and

^

E

t is computed as in (18.26) Again, care must be taken that taking the inverse

of very small local variances does not cause numerical problems This is the basic method for estimating signals with nonstationary variances It is a simplified form

of the algorithm in [296]

Trang 9

SEPARATION BY NONSTATIONARITY OF VARIANCES 349

0

2

4

6

8

10

12

14

Fig 18.2 The energy (i.e., squares) of the initial part of the signal in Fig 18.1 This is clearly time-correlated.

The algorithm in (18.27) enables one to estimate the ICs using the information

on the nonstationarity of their variances This principle is different from the ones considered in preceding chapters and the preceding section It was implemented by

considering simultaneously different local autocorrelations An alternative method

for using nonstationarity will be considered next

18.2.2 Using cross-cumulants

Nonlinear autocorrelations A second method of using nonstationarity is based

on interpreting variance nonstationarity in terms of higher-order cross-cumulants Thus we obtain a very simple criterion that expresses nonstationarity of variance

To see how this works, consider the energy (i.e., squared amplitude) of the signal

in Fig 18.1 The energies of the initial 1000 time points are shown in Fig 18.2 What is clearly visible is that the energies are correlated in time This is of course a consequence of the assumption that the variance changes smoothly in time

Before proceeding, note that the nonstationarity of a signal depends on the time-scale and the level of the detail in the model of the signal If the nonstationarity of the variance is incorporated in the model (by hidden Markov models, for example), the signal no longer needs to be considered nonstationary [370] This is the approach

that we choose in the following In particular, the energies are not considered

nonstationary, but rather they are considered as stationary signals that are time-correlated This is simply a question of changing the viewpoint

So, we could measure the variance nonstationarity of a signaly(t) t = 1 :::t

using a measure based on the time-correlation of energies:Efy(t)

2 y(t

2

gwhere

is some lag constant, often equal to one For the sake of mathematical simplicity, it

is often useful to use cumulants instead of such basic higher-order correlations The

Trang 10

cumulant corresponding to the correlation of energies is given by the fourth-order cross cumulant

cum(y(t) y(t) y(t ) y(t ))

=Efy(t)2

y(t )2

g Efy(t)2

gEfy(t )2

g 2(Efy(t)y(t )g)2

(18.28) This could be considered as a normalized version of the cross-correlation of energies

In our case, where the variances are changing smoothly, this cumulant is positive because the first term dominates the two normalizing terms

Note that although cross-cumulants are zero for random variables with jointly

gaussian distributions, they need not be zero for variables with gaussian marginal distributions Thus positive cross-cumulants do not imply nongaussian marginal distributions for the ICs, which shows that the property measured by this cross-cumulant is indeed completely different from the property of nongaussianity The validity of this criterion can be easily proven Consider a linear combination

of the observed signalsx

i(t)that are mixtures of original ICs, as in (18.1) This linear combination, sayb

T

x(t), is a linear combination of the ICsb

T

x(t) =b

T

As(t), say

q

T

s(t) =P

i

q

i

s

i(t) Using the basic properties of cumulants, the nonstationarity

of such a linear combination can be evaluated as

cum(b

T

x(t) b

T

x(t) b T

x(t ) b

T

x(t ))

=X i q 4

icum(s

i(t) s

i(t ) s

i(t )) (18.29) Now, we can constrain the variance ofb

T

xto be equal to unity to normalize the scale (cumulants are not scale-invariant) This implies var

P i q i s

i=kqk

2= 1 Let

us consider what happens if we maximize nonstationarity with respect tob This is equivalent to the optimization problem

max

kqk

2

=1 X i q 4

icum(s

i(t) s

i(t ) s

i(t )) (18.30)

This optimization problem is formally identical to the one encountered when kur-tosis (or in general, its absolute value) is maximized to find the most nongaussian directions, as in Chapter 8 It was proven that solutions to this optimization problem give the ICs In other words, the maxima of (18.30) are obtained when only one

of theq

i is nonzero This proof applies directly in our case as well, and thus we

see that the maximally nonstationary linear combinations give the ICs.2 Since the cross-cumulants are assumed to be all positive, the problem we have here is in fact slightly simpler since we can then simply maximize the cross-cumulant of the linear combinations, and need not consider its absolute value as is done with kurtosis in Chapter 8

2Note that this statement requires that we identify nonstationarity with the energy correlations, which may

or may not be meaningful depending on the context.

Tiêu đề	Methods using time structure
Tác giả	Aapo Hyvärinen, Juha Karhunen, Erkki Oja
Trường học	John Wiley & Sons, Inc.
Chuyên ngành	Independent Component Analysis
Thể loại	Thesis
Năm xuất bản	2001
Thành phố	Hoboken

Định dạng
Số trang	14
Dung lượng	327,19 KB