Independent component analysis P15

In noisy ICA, we also encounter a new problem: estimation of the noise-free realizations of the independent components ICs.. The noisy model is not invertible, and therefore estimation o

Trang 1

Part III

EXTENSIONS AND RELATED METHODS

Independent Component Analysis Aapo Hyv¨arinen, Juha Karhunen, Erkki Oja

Copyright  2001 John Wiley & Sons, Inc ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)

Trang 2

Noisy ICA

In real life, there is always some kind of noise present in the observations Noise can correspond to actual physical noise in the measuring devices, or to inaccuracies

of the model used Therefore, it has been proposed that the independent component analysis (ICA) model should include a noise term as well In this chapter, we consider different methods for estimating the ICA model when noise is present

However, estimation of the mixing matrix seems to be quite difficult when noise

is present It could be argued that in practice, a better approach could often be to reduce noise in the data before performing ICA For example, simple filtering of time-signals is often very useful in this respect, and so is dimension reduction by principal component analysis (PCA); see Sections 13.1.2 and 13.2.2

In noisy ICA, we also encounter a new problem: estimation of the noise-free realizations of the independent components (ICs) The noisy model is not invertible, and therefore estimation of the noise-free components requires new methods This problem leads to some interesting forms of denoising

15.1 DEFINITION

Here we extend the basic ICA model to the situation where noise is present The noise is assumed to be additive This is a rather realistic assumption, standard in factor analysis and signal processing, and allows for a simple formulation of the noisy model Thus, the noisy ICA model can be expressed as

293

Independent Component Analysis Aapo Hyv¨arinen, Juha Karhunen, Erkki Oja

Copyright  2001 John Wiley & Sons, Inc ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)

Trang 3

294 NOISY ICA

wheren = (n

1

::: n

n ) is the noise vector Some further assumptions on the noise are usually made In particular, it is assumed that

1 The noise is independent from the independent components

2 The noise is gaussian

The covariance matrix of the noise, say, is often assumed to of the form

2 , but this may be too restrictive in some cases In any case, the noise covariance is assumed

to be known Little work on estimation of an unknown noise covariance has been conducted; see [310, 215, 19]

The identifiability of the mixing matrix in the noisy ICA model is guaranteed under the same restrictions that are sufficient in the basic case,1basically meaning independence and nongaussianity In contrast, the realizations of the independent componentss

ican no longer be identified, because they cannot be completely sepa-rated from noise

In the typical case where the noise covariance is assumed to be of the form

2 , the noise in Eq (15.1) could be considered as “sensor” noise This is because the noise variables are separately added on each sensor, i.e., observed variablex

i This is in contrast to “source” noise, in which the noise is added to the independent components (sources) Source noise can be modeled with an equation slightly different from the preceding, given by

where again the covariance of the noise is diagonal In fact, we could consider the noisy independent components, given by~ s

i

= s i + n

i, and rewrite the model as

We see that this is just the basic ICA model, with modified independent components What is important is that the assumptions of the basic ICA model are still valid: the components of~ sare nongaussian and independent Thus we can estimate the model

in (15.3) by any method for basic ICA This gives us a perfectly suitable estimator for the noisy ICA model This way we can estimate the mixing matrix and the noisy independent components The estimation of the original independent components from the noisy ones is an additional problem, though; see below

This idea is, in fact, more general Assume that the noise covariance has the form

= AA

T

2

(15.4)

1 This seems to be admitted by the vast majority of ICA researchers We are not aware of any rigorous proofs of this property, though.

T

Trang 4

FEW NOISE SOURCES 295

Then the noise vector can be transformed into another onen ~ = A

1

n, which can be called equivalent source noise Then the equation (15.1) becomes

x = As +

A~

n = A(s +

~

The point is that the covariance ofn ~is

2 , and thus the transformed components in

s + ~ nare independent Thus, we see again that the mixing matrixAcan be estimated

by basic ICA methods

To recapitulate: if the noise is added to the independent components and not to the observed mixtures, or has a particular covariance structure, the mixing matrix can be estimated by ordinary ICA methods The denoising of the independent components

is another problem, though; it will be treated in Section 15.5 below

Another special case that reduces to the basic ICA model can be found, when the number of noise components and independent components is not very large In particular, if their total number is not larger than the number of mixtures, we again have an ordinary ICA model, in which some of the components are gaussian noise and others are the real independent components Such a model could still be estimated

by the basic ICA model, using one-unit algorithms with less units than the dimension

of the data

In other words, we could define the vector of the independent components as

~

s = (s

1

::: s

k

n

1

::: n l ) T where the s

i

i = 1 ::: k are the “real” independent components and then

i

i = 1 ::: lare the noise variables Assume that the number

of mixtures equalsk + l, that is the number of real ICs plus the number of noise variables In this case, the ordinary ICA model holds with x = A~ s, whereAis

a matrix that incorporates the mixing of the real ICs and the covariance structure

of the noise, and the number of the independent components in ~ sis equal to the number of observed mixtures Therefore, finding thekmost nongaussian directions,

we can estimate the real independent components We cannot estimate the remaining dummy independent components that are actually noise variables, but we did not want to estimate them in the first place

The applicability of this idea is quite limited, though, since in most cases we want

to assume that the noise is added on each mixture, in which casek + l, the number

of real ICs plus the number of noise variables, is necessarily larger than the number

of mixtures, and the basic ICA model does not hold for~ s

Not many methods for noisy ICA estimation exist in the general case The estimation

of the noiseless model seems to be a challenging task in itself, and thus the noise is usually neglected in order to obtain tractable and simple results Moreover, it may

Trang 5

296 NOISY ICA

be unrealistic in many cases to assume that the data could be divided into signals and noise in any meaningful way

Here we treat first the problem of estimating the mixing matrix Estimation of the independent components will be treated below

15.4.1 Bias removal techniques

Perhaps the most promising approach to noisy ICA is given by bias removal tech-niques This means that ordinary (noise-free) ICA methods are modified so that the bias due to noise is removed, or at least reduced

Let us denote the noise-free data in the following by

We can now use the basic idea of finding projections, sayw

T

v, in which nongaus-sianity, is locally maximized for whitened data, with constraintkw k = 1 As shown

in Chapter 8, projections in such directions give consistent estimates of the indepen-dent components, if the measure of nongaussianity is well chosen This approach could be used for noisy ICA as well, if only we had measures of nongaussianity which are immune to gaussian noise, or at least, whose values for the original data can be easily estimated from noisy observations We havew

T

x = w T

v + w T

n, and thus the point is to measure the nongaussianity ofw

T

vfrom the observedw

T x

so that the measure is not affected by the noisew

T

n

Bias removal for kurtosis If the measure of nongaussianity is kurtosis (the fourth-order cumulant), it is almost trivial to construct one-unit methods for noisy ICA, because kurtosis is immune to gaussian noise This is because the kurtosis of w

T

xequals the kurtosis ofw

T

v, as can be easily proven by the basic properties of kurtosis

It must be noted, however, that in the preliminary whitening, the effect of noise must be taken into account; this is quite simple if the noise covariance matrix is known Denoting byC = Efxx

T

gthe covariance matrix of the observed noisy data, the ordinary whitening should be replaced by the operation

~

x = (C )

1=2

In other words, the covariance matrixC of the noise-free data should be used in whitening instead of the covariance matrixCof the noisy data In the following, we call this operation “quasiwhitening” After this operation, the quasiwhitened datax ~ follows a noisy ICA model as well:

~

whereBis orthogonal, andn ~is a linear transform of the original noise in (15.1) Thus, the theorem in Chapter 8 is valid for~ x, and finding local maxima of the absolute value of kurtosis is a valid method for estimating the independent components

Trang 6

ESTIMATION OF THE MIXING MATRIX 297

Bias removal for general nongaussianity measures As was argued in Chapter 8, it is important in many applications to use measures of nongaussianity that have better statistical properties than kurtosis We introduced the following measure:

J

G(w T

v) = EfG(w

T

v)g EfG()g]2

(15.9) where the functionG is a sufficiently regular nonquadratic function, and is a standardized gaussian variable

Such a measure could be used for noisy data as well, if only we were able to estimateJ

G(w

T

v)of the noise-free data from the noisy observationsx Denoting

byza nongaussian random variable, and byna gaussian noise variable of variance

2

, we should be able to express the relation betweenEfG(z)gandEfG(z+n)g

in simple algebraic terms In general, this relation seems quite complicated, and can

be computed only using numerical integration

However, it was shown in [199] that for certain choices ofG, a similar relation becomes very simple The basic idea is to chooseGto be the density function of

a zero-mean gaussian random variable, or a related function These nonpolynomial

moments are called gaussian moments.

Denote by

'

c(x) = 1 c '(x c ) = 1p

2c

exp( x 2

2c

the gaussian density function with variancec , and by'

(k )

c (x) the kth (k > 0) derivative of '

c(x) Denote further by '

(k )

c the kth integral function of'

c(x), obtained by'

(k )

c (x) =R

0 ' (k +1)

c ()d, where we define'

(0)

c (x) ='

c(x) (The lower integration limit0is here quite arbitrary, but has to be fixed.) Then we have the following theorem [199]:

Theorem 15.1 Letz be any nongaussian random variable, andnan independent gaussian noise variable of variance

2

Define the gaussian function'as in (15.10) Then for any constantc >

2

, we have

Ef'

c(z)g=Ef'

withd=p

c

2

Moreover, (15.11) still holds when'is replaced by'

(k )

for any integer indexk.

The theorem means that we can estimate the independent components from noisy observations by maximizing a general contrast function of the form (15.9), where the direct estimation of the statistics EfG(w

T

v)gof the noise-free data is made possible by usingG(u) ='

(k )

c (u) We call the statistics of the formEf'

(k )

c (w T

v)g the gaussian moments of the data Thus, for quasiwhitened datax~, we maximize the following contrast function:

max Ef'

(k ) (w

T x~ )g Ef'

(k )()g]2

(15.12)

Trang 7

298 NOISY ICA

withd(w ) =

p

c w T

~

w This gives a consistent (i.e., convergent) method of estimating the noisy ICA model, as was shown in Chapter 8

To use these results in practice, we need to choose some values fork In fact,c disappears from the final algorithm, so value for this parameter need not be chosen Two indiceskfor the gaussian moments seem to be of particular interest:k = 0and

k = 2 The first corresponds to the gaussian density function; its use was proposed

in Chapter 8 The casek = 2is interesting because the contrast function is then

of the form of a (negative) log-density of a supergaussian variable In fact,'

(2) (u) can be very accurately approximated byG(u) = 1=2 log cosh u, which was also used

in Chapter 8

FastICA for noisy data Using the unbiased measures of nongaussianity given in this section, we can derive a variant of the FastICA algorithm [198] Using kurtosis

or gaussian moments give algorithms of a similar form, just like in the noise-free case

The algorithm takes the form [199, 198]:

w

= Efxg(w

T

~

x )g (I +

~

)w Efg

0 (w T

~

wherew

, the new value ofw, is normalized to unit norm after every iteration, and

~

is given by

~

= Efn~ n

T

g = (C )

1=2

(C )

1=2

(15.14) The functiongis here the derivative ofG, and can thus be chosen among the following:

g

1

(u) = tanh(u) g

2 (u) = u exp(u

2

=2) g 3 (u) = u 3 (15.15) whereg

1is an approximation of'

(1) , which is the gaussian cumulative distribution function (these relations hold up to some irrelevant constants) These functions cover essentially the nonlinearities ordinarily used in the FastICA algorithm

15.4.2 Higher-order cumulant methods

A different approach to estimation of the mixing matrix is given by methods using higher-order cumulants only Higher-order cumulants are unaffected by gaussian noise (see Section 2.7), and therefore any such estimation method would be immune

to gaussian noise Such methods can be found in [63, 263, 471] The problem is, however, that such methods often use cumulants of order 6 Higher-order cumulants are sensitive to outliers, and therefore methods using cumulants of orders higher than 4 are unlikely to be very useful in practice A nice feature of this approach is, however, that we do not need to know the noise covariance matrix

Note that the cumulant-based methods in Part II used both second- and

fourth-order cumulants Second-fourth-order cumulants are not immune to gaussian noise, and

therefore the cumulant-based method introduced in the previous chapters would not

Trang 8

ESTIMATION OF THE NOISE-FREE INDEPENDENT COMPONENTS 299

be immune either Most of the cumulant-based methods could probably be modified

to work in the noisy case, as we did in this chapter for methods maximizing the absolute value of kurtosis

15.4.3 Maximum likelihood methods

Another approach for estimation of the mixing matrix with noisy data is given by maximum likelihood (ML) estimation First, one could maximize the joint likelihood

of the mixing matrix and the realizations of the independent components, as in [335, 195, 80] This is given by

log L(As(1) :::s(T )) =

T X

t=1

"

1

2

kAs(t) x(t)k

2

1 + n X

i=1 f i (s i (t))

#

+ C (15.16)

wherekmk

2

1is defined asmT1m, thes(t)are the realizations of the indepen-dent components, andCis an irrelevant constant Thef

iare the logarithms of the probability density functions (pdf’s) of the independent components Maximization

of this joint likelihood is, however, computationally very expensive

A more principled method would be to maximize the (marginal) likelihood of the mixing matrix, and possibly that of the noise covariance, which was done in [310] This was based on the idea of approximating the densities of the independent components as gaussian mixture densities; the application of the EM algorithm then becomes feasible In [42], the simpler case of discrete-valued independent components was treated A problem with the EM algorithm is, however, that the computational complexity grows exponentially with the dimension of the data

A more promising approach might be to use bias removal techniques so as to modify existing ML algorithms to be consistent with noisy data Actually, the bias removal techniques given here can be interpreted as such methods; a related method was given in [119]

Finally, let us mention a method based on the geometric interpretation of the maximum likelihood estimator, introduced in [33], and a rather different approach for narrow-band sources, introduced in [76]

COMPONENTS

15.5.1 Maximum a posteriori estimation

In noisy ICA, it is not enough to estimate the mixing matrix Inverting the mixing matrix in (15.1), we obtain

Trang 9

300 NOISY ICA

In other words, we only get noisy estimates of the independent components There-fore, we would like to obtain estimates of the original independent componentss^

i that are somehow optimal, i.e., contain minimum noise

A simple approach to this problem would be to use the maximum a posteriori (MAP) estimates See Section 4.6.3 for the definition Basically, this means that we take the values that have maximum probability, given thex Equivalently, we take

ass^

ithose values that maximize the joint likelihood in (15.16), so this could also be called a maximum likelihood (ML) estimator

To compute the MAP estimator, let us take the gradient of the log-likelihood (15.16) with respect to thes(t) t= 1 ::: T and equate this to 0 Thus we obtain the equation

^ A T

1 A^ ^s(t) A^

T

1

x(t) +f

0(^s(t)) = (15.18) where the derivative of the log-density, denoted byf

0 , is applied separately on each component of the vector^s(t)

In fact, this method gives a nonlinear generalization of classic Wiener filtering pre-sented in Section 4.6.2 An alternative approach would be to use the time-structure

of the ICs (see Chapter 18) for denoising This results in a method resembling the Kalman filter; see [250, 249]

15.5.2 Special case of shrinkage estimation

Solving for the^sis not easy, however In general, we must use numerical optimization

A simple special case is obtained if the noise covariance is assumed to be of the same form as in (15.4) [200, 207] This corresponds to the case of (equivalent) source noise Then (15.18) gives

^

s=g( ^A

1

where the scalar component-wise functiongis obtained by inverting the relation

g

1(u) =u+

2 f

Thus, the MAP estimator is obtained by inverting a certain function involvingf

0 , or the score function [395] of the density ofs For nongaussian variables, the score function is nonlinear, and so isg

In general, the inversion required in (15.20) may be impossible analytically Here

we show three examples, which will be shown to have great practical value in Chapter 21, where the inversion can be done easily

Example 15.1 Assume thatshas a Laplacian (or double exponential) distribution of unit variance Thenp(s) = exp(

p

2jsj)= p

2,f

0(s) =p

2sign(s), andgtakes the form

( ) =sign( )max(0 p

0

Trang 10

ESTIMATION OF THE NOISE-FREE INDEPENDENT COMPONENTS 301

(Rigorously speaking, the function in (15.20) is not invertible in this case, but ap-proximating it by a sequence of invertible functions, (15.21) is obtained as the limit.)

The function in (15.21) is a shrinkage function that reduces the absolute value of its

argument by a fixed amount, as depicted in Fig 15.1 Intuitively, the utility of such a function can be seen as follows Since the density of a supergaussian random variable (e.g., a Laplacian random variable) has a sharp peak at zero, it can be assumed that small values of the noisy variable correspond to pure noise, i.e., tos= 0 Thresh-olding such values to zero should thus reduce noise, and the shrinkage function can indeed be considered a soft thresholding operator

Example 15.2 More generally, assume that the score function is approximated as a

linear combination of the score functions of the gaussian and the Laplacian distribu-tions:

f

0(s) =as+bsign(s) (15.22) witha b >0 This corresponds to assuming the following density model fors:

p(s) =Cexp(as

2

whereC is an irrelevant scaling constant This is depicted in Fig 15.2 Then we obtain

g(u) = 1 + 1

2 a sign(u)max(0 juj b

This function is a shrinkage with additional scaling, as depicted in Fig 15.1

Example 15.3 Yet another possibility is to use the following strongly supergaussian

probability density:

p(s) = 12d

(+ 2)(+ 1)=2](=2+1)

p

(+ 1)=2 +js=dj](+3)

(15.25)

with parameters d >0, see Fig 15.2 When ! 1, the Laplacian density is obtained as the limit The strong sparsity of the densities given by this model can be seen e.g., from the fact that the kurtosis [131, 210] of these densities is always larger than the kurtosis of the Laplacian density, and reaches infinity for 2 Similarly,

p(0)reaches infinity asgoes to zero The resulting shrinkage function given by (15.20) can be obtained after some straightforward algebraic manipulations as:

g(u) =sign(u)max(0

juj ad

2 + 1 2

p (juj+ad)2

4

2(+ 3) )

(15.26) wherea=p

(+ 1)=2, andg(u)is set to zero in case the square root in (15.26)

is imaginary This is a shrinkage function that has a stronger thresholding flavor, as depicted in Fig 15.1

Tiêu đề	Noisy Ica
Tác giả	Aapo Hyvärinen, Juha Karhunen, Erkki Oja
Trường học	John Wiley & Sons, Inc.
Thể loại	Phần
Năm xuất bản	2001
Thành phố	Hoboken

Định dạng
Số trang	13
Dung lượng	259,94 KB