Tài liệu Thuật toán ICA - 13: Practical Considerations pptx

We also propose some preprocessing techniques dimension reduction by principal component analysis, time filtering that may be useful and even necessary before the application of the ICA

Trang 1

13 Practical Considerations

In the preceding chapters, we presented several approaches for the estimation of the independent component analysis (ICA) model In particular, several algorithms were proposed for the estimation of the basic version of the model, which has a square mixing matrix and no noise Now we are, in principle, ready to apply those algorithms on real data sets Many such applications will be discussed in Part IV

However, when applying the ICA algorithms to real data, some practical con-siderations arise and need to be taken into account In this chapter, we discuss different problems that may arise, in particular, overlearning and noise in the data

We also propose some preprocessing techniques (dimension reduction by principal component analysis, time filtering) that may be useful and even necessary before the application of the ICA algorithms in practice

13.1 PREPROCESSING BY TIME FILTERING

The success of ICA for a given data set may depend crucially on performing some application-dependent preprocessing steps In the basic methods discussed in the previous chapters, we always used centering in preprocessing, and often whitening was done as well Here we discuss further preprocessing methods that are not necessary in theory, but are often very useful in practice

263

ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)

Trang 2

264 PRACTICAL CONSIDERATIONS

13.1.1 Why time filtering is possible

In many cases, the observed random variables are, in fact, time signals or time series, which means that they describe the time course of some phenomenon or system Thus the sample indextinx

i (t)is a time index In such a case, it may be very useful

to filter the signals In other words, this means taking moving averages of the time series Of course, in the ICA model no time structure is assumed, so filtering is not always possible: If the sample pointsx(t)cannot be ordered in any meaningful way with respect tot, filtering is not meaningful, either

For time series, any linear filtering of the signals is allowed, since it does not change the ICA model In fact, if we filter linearly the observed signals x

i (t) to obtain new signals, sayx

i (t), the ICA model still holds forx

i (t), with the same mixing matrix This can be seen as follows Denote byXthe matrix that contains the observationsx(1) ::: x(T )as its columns, and similarly forS Then the ICA model can be expressed as:

Now, time filtering ofXcorresponds to multiplyingXfrom the right by a matrix, let

us call itM This gives

X

= XM = ASM = AS

(13.2) which shows that the ICA model still remains valid The independent components are filtered by the same filtering that was applied on the mixtures They are not mixed with each other inS

because the matrixMis by definition a component-wise filtering matrix

Since the mixing matrix remains unchanged, we can use the filtered data in the ICA estimating method only After estimating the mixing matrix, we can apply the same mixing matrix on the original data to obtain the independent components The question then arises what kind of filtering could be useful In the following,

we consider three different kinds of filtering: high-pass and low-pass filtering, as well as their compromise

Trang 3

13.1.2 Low-pass filtering

Basically, low-pass filtering means that every sample point is replaced by a weighted average of that point and the points immediately before it.1 This is a form of

smoothing the data. Then the matrixMin (13.2) would be something like

M = 1

3

0

B

@

: : 1 1 1 0 0 0 0 0 : :

: : 0 1 1 1 0 0 0 0 : :

: : 0 0 1 1 1 0 0 0 : :

: : 0 0 0 1 1 1 0 0 : :

: : 0 0 0 0 1 1 1 0 : :

: : 0 0 0 0 0 1 1 1 : :

1

C

A

(13.3)

Low-pass filtering is often used because it tends to reduce noise This is a well-known property in signal processing that is explained in most basic signal processing textbooks

In the basic ICA model, the effect of noise is more or less neglected; see Chapter 15 for a detailed discussion Thus basic ICA methods work much better with data that does not have much noise, and reducing noise is thus useful and sometimes even necessary

A possible problem with low-pass filtering is that it reduces the information in the data, since the fast-changing, high-frequency features of the data are lost It often happens that this leads to a reduction of independence as well (see next section)

13.1.3 High-pass filtering and innovations

High-pass filtering is the opposite of low-pass filtering The point is to remove slowly changing trends from the data Thus a low-pass filtered version is subtracted from the signal A classic way of doing high-pass filtering is differencing, which means replacing every sample point by the difference between the value at that point and the value at the preceding point Thus, the matrixMin (13.2) would be

M =

0

B

@

: : 1 1 0 0 0 0 0 : :

: : 0 1 1 0 0 0 0 : :

: : 0 0 1 1 0 0 0 : :

: : 0 0 0 1 1 0 0 : :

: : 0 0 0 0 1 1 0 : :

: : 0 0 0 0 0 1 1 : :

1

C

A

(13.4)

1 To have a causal filter, points after the current point may be left out of the averaging.

Trang 4

High-pass filtering may be useful in ICA because in certain cases it increases the independence of the components It often happens in practice that the compo-nents have slowly changing trends or fluctuations, in which case they are not very independent If these slow fluctuations are removed by high-pass filtering the fil-tered components are often much more independent A more principled approach to high-pass filtering is to consider it in the light of innovation processes

Innovation processes Given a stochastic processs(t), we define its innovation process~ s(t)as the error of the best prediction ofs(t), given its past Such a best prediction is given by the conditional expectation ofs(t)given its past, because it

is the expected value of the conditional distribution ofs(t)given its past Thus the innovation process of~ s(t)is defined by

~ s(t) = s(t) Efs(t)js(t 1) s(t 2) :::g (13.5) The expression “innovation” describes the fact that~

s(t)contains all the new infor-mation about the process that can be obtained at timetby observings(t)

The concept of innovations can be utilized in the estimation of the ICA model due

to the following property:

Theorem 13.1 If x(t) ands(t)follow the basic ICA model, then the innovation processesx (t) ~ and~ s(t)follow the ICA model as well In particular, the components

~

s

i

(t)are independent from each other.

On the other hand, independence of the innovations does not imply the

indepen-dence of thes

i

(t) Thus, the innovations are more often independent from each

other than the original processes Moreover, one could argue that the innovations

are usually more nongaussian than the original processes This is because thes

i (t)

is a kind of moving average of the innovation process, and sums tend to be more gaussian than the original variable Together these mean that the innovation process

is more susceptible to be independent and nongaussian, and thus to fulfill the basic assumptions in ICA

Innovation processes were discussed in more detail in [194], where it was also shown that using innovations, it is possible to separate signals (images of faces) that are otherwise strongly correlated and very difficult to separate

The connection between innovations and ordinary filtering techniques is that the computation of the innovation process is often rather similar to high-pass filtering Thus, the arguments in favor of using innovation processes apply at least partly in favor of high-pass filtering

A possible problem with high-pass filtering, however, is that it may increase noise for the same reasons that low-pass filtering decreases noise

13.1.4 Optimal filtering

Both of the preceding types of filtering have their pros and cons The optimum would

be to find a filter that increases the independence of the components while reducing

Trang 5

noise To achieve this, some compromise between high- and low-pass filtering may

be the best solution This leads to band-pass filtering, in which the highest and the lowest frequencies are filtered out, leaving a suitable frequency band in between What this band should be depends on the data and general answers are impossible to give

In addition to simple low-pass/high-pass filtering, one might also use more so-phisticated techniques For example, one might take the (1-D) wavelet transforms of the data [102, 290, 17] Other time-frequency decompositions could be used as well

13.2 PREPROCESSING BY PCA

A common preprocessing technique for multidimensional data is to reduce its dimen-sion by principal component analysis (PCA) PCA was explained in more detail in Chapter 6 Basically, the data is projected linearly onto a subspace

~

x = E n

so that the maximum amount of information (in the least-squares sense) is preserved Reducing dimension in this way has several benefits which we discuss in the next subsections

13.2.1 Making the mixing matrix square

First, let us consider the case where the the number of independent componentsn

is smaller than the number of mixtures, say m Performing ICA on the mixtures directly can cause big problems in such a case, since the basic ICA model does not hold anymore Using PCA we can reduce the dimension of the data ton After such

a reduction, the number of mixtures and ICs are equal, the mixing matrix is square, and the basic ICA model holds

The question is whether PCA is able to find the subspace correctly, so that the

nICs can be estimated from the reduced mixtures This is not true in general, but

in a special case it turns out to be the case If the data consists ofnICs only, with

no noise added, the whole data is contained in ann-dimensional subspace Using PCA for dimension reduction clearly finds thisn-dimensional subspace, since the eigenvalues corresponding to that subspace, and only those eigenvalues, are nonzero Thus reducing dimension with PCA works correctly In practice, the data is usually not exactly contained in the subspace, due to noise and other factors, but if the noise level is low, PCA still finds approximately the right subspace; see Section 6.1.3 In the general case, some “weak” ICs may be lost in the dimension reduction process, but PCA may still be a good idea for optimal estimation of the “strong” ICs [313] Performing first PCA and then ICA has an interesting interpretation in terms of factor analysis In factor analysis, it is conventional that after finding the factor subspace, the actual basis vectors for that subspace are determined by some criteria

Trang 6

that make the mixing matrix as simple as possible [166] This is called factor rotation.

Now, ICA can be interpreted as one method for determining this factor rotation, based

on higher-order statistics instead of the structure of the mixing matrix

13.2.2 Reducing noise and preventing overlearning

A well-known benefit of reducing the dimension of the data is that it reduces noise,

as was already discussed in Chapter 6 Often, the dimensions that have been omitted consist mainly of noise This is especially true in the case where the number of ICs

is smaller than the number of mixtures

Another benefit of reducing dimensions is that it prevents overlearning, to which the rest of this subsection is devoted Overlearning means that if the number of parameters in a statistical model is too large when compared to the number of available data points, the estimation of the parameters becomes difficult, maybe impossible The estimation of the parameters is then too much determined by the available sample points, instead of the actual process that generated the data, which

is what we are really interested in

Overlearning in ICA [214] typically produces estimates of the ICs that have a single spike or bump, and are practically zero everywhere else This is because in the space of source signals of unit variance, nongaussianity is more or less maximized

by such spike/bump signals This becomes easily comprehensible if we consider the extreme case where the sample sizeT equals the dimension of the datam, and these are both equal to the number of independent componentsn Let us collect the realizations x(t) ofx as the columns of the matrix X, and denote byS the corresponding matrix of the realizations ofs(t), as in (13.1) Note that now all the matrices in (13.1) are square This means that by changing the values ofA (and keepingXfixed), we can give any values whatsoever to the elements ofS This is

a case of serious overlearning, not unlike the classic case of regression with equal numbers of data points and parameters

Thus it is clear that in this case, the estimate of S that is obtained by ICA estimation depends little on the observed data Let us assume that the densities of the source signals are known to be supergaussian (i.e., positively kurtotic) Then the ICA estimation basically consists of finding a separating matrixBthat maximizes a measure of the supergaussianities (or sparsities) of the estimates of the source signals Intuitively, it is easy to see that sparsity is maximized when the source signals each have only one nonzero point Thus we see that ICA estimation with an insufficient sample size leads to a form of overlearning that gives artifactual (spurious) source

signals Such source signals are characterized by large spikes.

An important fact shown experimentally [214] is that a similar phenomenon is much more likely to occur if the source signals are not independently and identically distributed (i.i.d.) in time, but have strong time-dependencies In such cases the sample size needed to get rid of overlearning is much larger, and the source signals

are better characterized by bumps, i.e., low-pass filtered versions of spikes An

intuitive way of explaining this phenomenon is to consider such a signal as being constant onN=kblocks ofkconsecutive sample points This means that the data can

Trang 7

be considered as really having onlyN=ksample points; each sample point has simply been repeatedktimes Thus, in the case of overlearning, the estimation procedure gives “spikes” that have a width ofktime points, i.e., bumps

Here we illustrate the phenomenon by separation of artificial source signals Three positively kurtotic signals, with 500 sample points each, were used in these

simulations, and are depicted in Fig 13.1 a Five hundred mixtures were produced,

and a very small amount of gaussian noise was added to each mixture separately

As an example of a successful ICA estimation, Fig 13.1 b shows the result of

applying the FastICA and maximum likelihood (ML) gradient ascent algorithms (denoted by “Bell-Sejnowski”) to the mixed signals In both approaches, the prepro-cessing (whitening) stage included a dimension reduction of the data into the first three principal components It is evident that both algorithms are able to extract all the initial signals

In contrast, when the whitening is made with very small dimension reduction (we took 400 dimensions), we see the emergence of spiky solutions (like Dirac functions),

which is an extreme case of kurtosis maximization (Fig 13.1 c) The algorithm used

in FastICA was of a deflationary type, from which we plot the first five components extracted As for the ML gradient ascent, which was of a symmetric type, we show five representative solutions to the 400 extracted

Thus, we see here that without dimension reduction, we are not able to estimate the source signals

Fig 13.1 d presents an intermediate stage of dimension reduction (from the original

500 mixtures we took 50 whitened vectors) We see that the actual source signals are revealed by both methods, even though each resulting vector is more noisy than the

ones shown in Fig 13.1 b.

For the final example, in Fig 13.1 e, we low-pass filtered the mixed signals, prior to

the independent component analysis, using a 10 delay moving average filter Taking

the same amount of principal components as in d, we can see that we lose all the

original source signals: the decompositions show a bumpy structure corresponding to

the low-pass filtering of the spiky outputs presented in c Through low-pass filtering,

we have reduced the information contained in the data, and so the estimation is rendered impossible even with this, not very weak, dimension reduction Thus, we see that with this low-pass filtered data, a much stronger dimension reduction by PCA is necessary to prevent overlearning

In addition to PCA, some kind of prior information on the mixing matrix could be useful in preventing overlearning This is considered in detail in Section 20.1.3

13.3 HOW MANY COMPONENTS SHOULD BE ESTIMATED?

Another problem that often arises in practice is to decide the number of ICs to be estimated This problem does not arise if one simply estimates the same number

of components as the dimension of the data This may not always be a good idea, however

Trang 8

(a)

(b)

(c)

(d)

(e)

Fig 13.1 (From [214]) Illustration of the importance of the degree of dimension reduction and filtering in artificially generated data, using FastICA and a gradient algorithm for ML

estimation (a) Original positively kurtotic signals (b) ICA decomposition in which the preprocessing includes a dimension reduction to the first 3 principal components (c) Poor, i.e., too weak dimension reduction (d) Decomposition using an intermediate dimension reduction (50 components retained) (e) Same results as in (d) but using low-pass filtered

mixtures

Trang 9

First, since dimension reduction by PCA is often necessary, one must choose the number of principal components to be retained This is a classic problem; see Chapter 6 It is usually solved by choosing the minimum number of principal components that explain the data well enough, containing, for example,90%of the variance Often, the dimension is actually chosen by trial and error with no theoretical guidelines

Second, for computational reasons we may prefer to estimate only a smaller number of ICs than the dimension of the data (after PCA preprocessing) This is the case when the dimension of the data is very large, and we do not want to reduce the dimension by PCA too much, since PCA always contains the risk of not including the ICs in the reduced data Using FastICA and other algorithms that allow estimation of

a smaller number of components, we can thus perform a kind of dimension reduction

by ICA In fact, this is an idea somewhat similar to projection pursuit Here, it is even more difficult to give any guidelines as to how many components should be estimated Trial and error may be the only method applicable

Information-theoretic, Bayesian, and other criteria for determining the number of ICs are discussed in more detail in [231, 81, 385]

13.4 CHOICE OF ALGORITHM

Now we shall briefly discuss the choice of ICA algorithm from a practical viewpoint

As will be discussed in detail in Chapter 14, most estimation principles and objective functions for ICA are equivalent, at least in theory So, the main choice is reduced to

a couple of points:

One choice is between estimating all the independent components in parallel,

or just estimating a few of them (possibly one-by-one) This corresponds to choosing between symmetric and hierarchical decorrelation In most cases, symmetric decorrelation is recommended Deflation is mainly useful in the case where we want to estimate only a very limited number of ICs, and other special cases The disadvantage with deflationary orthogonalization is that the estimation errors in the components that are estimated first accumulate and increase the errors in the later components

One must also choose the nonlinearity used in the algorithms It seems that the robust, nonpolynomial nonlinearities are to be preferred in most applications The simplest thing to do is to just use thetanhfunction as the nonlinearity

g This is sufficient when using FastICA (When using gradient algorithms, especially in the ML framework, a second function needs to be used as well; see Chapter 9.)

Finally, there is the choice between on-line and batch algorithms In most cases, the whole data set is available before the estimation, which is called

in different contexts batch, block, or off-line estimation This is the case where FastICA can be used, and it is the algorithm that we recommend

Trang 10

On-272 PRACTICAL CONSIDERATIONS

line or adaptive algorithms are needed in signal-processing applications where the mixing matrix may change on-line, and fast tracking is needed In the on-line case, the recommended algorithms are those obtained by stochastic gradient methods It should also be noted that in some cases, the FastICA algorithm may not converge well as Newton-type algorithms sometimes exhibit oscillatory behavior This problem can be alleviated by using gradient methods,

or combinations of the two (see [197])

13.5 CONCLUDING REMARKS AND REFERENCES

In this chapter, we considered some practical problems in ICA When dealing with time signals, low-pass filtering of the data is useful to reduce noise On the other hand, high-pass filtering, or computing innovation processes is useful to increase the independence and nongaussianity of the components One of these, or their combi-nation may be very useful in practice Another very useful thing to do is to reduce the dimension of the data by PCA This reduces noise and prevents overlearning It may also solve the problems with data that has a smaller number of ICs than mixtures

Problems

13.1 Take a Fourier transform on every observed signalx

i (t) Does the ICA model still hold, and in what way?

13.2 Prove the theorem on innovations

Computer assignments

13.1 Take a gaussian white noise sequence Low-pass filter it by a low-pass filter with coefficients ( ,0,0,1,1,1,1,1,0,0,0, ) What does the signal look like?

13.2 High-pass filter the gaussian white noise sequence What does the signal look like?

13.3 Generate 100 samples of 100 independent components Run FastICA on this data without any mixing What do the estimated ICs look like? Is the estimate of the mixing matrix close to identity?

Tiêu đề	Practical considerations
Tác giả	Aapo Hyvärinen, Juha Karhunen, Erkki Oja
Thể loại	Chương sách
Năm xuất bản	2001

Định dạng
Số trang	10
Dung lượng	221,25 KB