We also propose some preprocessing techniques dimension reduction by principal component analysis, time filtering that may be useful and even necessary before the application of the ICA
Trang 113 Practical Considerations
In the preceding chapters, we presented several approaches for the estimation of the independent component analysis (ICA) model In particular, several algorithms were proposed for the estimation of the basic version of the model, which has a square mixing matrix and no noise Now we are, in principle, ready to apply those algorithms on real data sets Many such applications will be discussed in Part IV
However, when applying the ICA algorithms to real data, some practical con-siderations arise and need to be taken into account In this chapter, we discuss different problems that may arise, in particular, overlearning and noise in the data
We also propose some preprocessing techniques (dimension reduction by principal component analysis, time filtering) that may be useful and even necessary before the application of the ICA algorithms in practice
13.1 PREPROCESSING BY TIME FILTERING
The success of ICA for a given data set may depend crucially on performing some application-dependent preprocessing steps In the basic methods discussed in the previous chapters, we always used centering in preprocessing, and often whitening was done as well Here we discuss further preprocessing methods that are not necessary in theory, but are often very useful in practice
263
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
Trang 2264 PRACTICAL CONSIDERATIONS
13.1.1 Why time filtering is possible
In many cases, the observed random variables are, in fact, time signals or time series, which means that they describe the time course of some phenomenon or system Thus the sample indextinx
i (t)is a time index In such a case, it may be very useful
to filter the signals In other words, this means taking moving averages of the time series Of course, in the ICA model no time structure is assumed, so filtering is not always possible: If the sample pointsx(t)cannot be ordered in any meaningful way with respect tot, filtering is not meaningful, either
For time series, any linear filtering of the signals is allowed, since it does not change the ICA model In fact, if we filter linearly the observed signals x
i (t) to obtain new signals, sayx
i (t), the ICA model still holds forx
i (t), with the same mixing matrix This can be seen as follows Denote byXthe matrix that contains the observationsx(1) ::: x(T )as its columns, and similarly forS Then the ICA model can be expressed as:
Now, time filtering ofXcorresponds to multiplyingXfrom the right by a matrix, let
us call itM This gives
X
= XM = ASM = AS
(13.2) which shows that the ICA model still remains valid The independent components are filtered by the same filtering that was applied on the mixtures They are not mixed with each other inS
because the matrixMis by definition a component-wise filtering matrix
Since the mixing matrix remains unchanged, we can use the filtered data in the ICA estimating method only After estimating the mixing matrix, we can apply the same mixing matrix on the original data to obtain the independent components The question then arises what kind of filtering could be useful In the following,
we consider three different kinds of filtering: high-pass and low-pass filtering, as well as their compromise
Trang 313.1.2 Low-pass filtering
Basically, low-pass filtering means that every sample point is replaced by a weighted average of that point and the points immediately before it.1 This is a form of
smoothing the data. Then the matrixMin (13.2) would be something like
M = 1
3
0
B
B
B
B
B
B
@
: : 1 1 1 0 0 0 0 0 : :
: : 0 1 1 1 0 0 0 0 : :
: : 0 0 1 1 1 0 0 0 : :
: : 0 0 0 1 1 1 0 0 : :
: : 0 0 0 0 1 1 1 0 : :
: : 0 0 0 0 0 1 1 1 : :
1
C
C
C
C
C
C
A
(13.3)
Low-pass filtering is often used because it tends to reduce noise This is a well-known property in signal processing that is explained in most basic signal processing textbooks
In the basic ICA model, the effect of noise is more or less neglected; see Chapter 15 for a detailed discussion Thus basic ICA methods work much better with data that does not have much noise, and reducing noise is thus useful and sometimes even necessary
A possible problem with low-pass filtering is that it reduces the information in the data, since the fast-changing, high-frequency features of the data are lost It often happens that this leads to a reduction of independence as well (see next section)
13.1.3 High-pass filtering and innovations
High-pass filtering is the opposite of low-pass filtering The point is to remove slowly changing trends from the data Thus a low-pass filtered version is subtracted from the signal A classic way of doing high-pass filtering is differencing, which means replacing every sample point by the difference between the value at that point and the value at the preceding point Thus, the matrixMin (13.2) would be
M =
0
B
B
B
B
B
B
@
: : 1 1 0 0 0 0 0 : :
: : 0 1 1 0 0 0 0 : :
: : 0 0 1 1 0 0 0 : :
: : 0 0 0 1 1 0 0 : :
: : 0 0 0 0 1 1 0 : :
: : 0 0 0 0 0 1 1 : :
1
C
C
C
C
C
C
A
(13.4)
1 To have a causal filter, points after the current point may be left out of the averaging.
Trang 4266 PRACTICAL CONSIDERATIONS
High-pass filtering may be useful in ICA because in certain cases it increases the independence of the components It often happens in practice that the compo-nents have slowly changing trends or fluctuations, in which case they are not very independent If these slow fluctuations are removed by high-pass filtering the fil-tered components are often much more independent A more principled approach to high-pass filtering is to consider it in the light of innovation processes
Innovation processes Given a stochastic processs(t), we define its innovation process~ s(t)as the error of the best prediction ofs(t), given its past Such a best prediction is given by the conditional expectation ofs(t)given its past, because it
is the expected value of the conditional distribution ofs(t)given its past Thus the innovation process of~ s(t)is defined by
~ s(t) = s(t) Efs(t)js(t 1) s(t 2) :::g (13.5) The expression “innovation” describes the fact that~
s(t)contains all the new infor-mation about the process that can be obtained at timetby observings(t)
The concept of innovations can be utilized in the estimation of the ICA model due
to the following property:
Theorem 13.1 If x(t) ands(t)follow the basic ICA model, then the innovation processesx (t) ~ and~ s(t)follow the ICA model as well In particular, the components
~
s
i
(t)are independent from each other.
On the other hand, independence of the innovations does not imply the
indepen-dence of thes
i
(t) Thus, the innovations are more often independent from each
other than the original processes Moreover, one could argue that the innovations
are usually more nongaussian than the original processes This is because thes
i (t)
is a kind of moving average of the innovation process, and sums tend to be more gaussian than the original variable Together these mean that the innovation process
is more susceptible to be independent and nongaussian, and thus to fulfill the basic assumptions in ICA
Innovation processes were discussed in more detail in [194], where it was also shown that using innovations, it is possible to separate signals (images of faces) that are otherwise strongly correlated and very difficult to separate
The connection between innovations and ordinary filtering techniques is that the computation of the innovation process is often rather similar to high-pass filtering Thus, the arguments in favor of using innovation processes apply at least partly in favor of high-pass filtering
A possible problem with high-pass filtering, however, is that it may increase noise for the same reasons that low-pass filtering decreases noise
13.1.4 Optimal filtering
Both of the preceding types of filtering have their pros and cons The optimum would
be to find a filter that increases the independence of the components while reducing
Trang 5noise To achieve this, some compromise between high- and low-pass filtering may
be the best solution This leads to band-pass filtering, in which the highest and the lowest frequencies are filtered out, leaving a suitable frequency band in between What this band should be depends on the data and general answers are impossible to give
In addition to simple low-pass/high-pass filtering, one might also use more so-phisticated techniques For example, one might take the (1-D) wavelet transforms of the data [102, 290, 17] Other time-frequency decompositions could be used as well
13.2 PREPROCESSING BY PCA
A common preprocessing technique for multidimensional data is to reduce its dimen-sion by principal component analysis (PCA) PCA was explained in more detail in Chapter 6 Basically, the data is projected linearly onto a subspace
~
x = E n
so that the maximum amount of information (in the least-squares sense) is preserved Reducing dimension in this way has several benefits which we discuss in the next subsections
13.2.1 Making the mixing matrix square
First, let us consider the case where the the number of independent componentsn
is smaller than the number of mixtures, say m Performing ICA on the mixtures directly can cause big problems in such a case, since the basic ICA model does not hold anymore Using PCA we can reduce the dimension of the data ton After such
a reduction, the number of mixtures and ICs are equal, the mixing matrix is square, and the basic ICA model holds
The question is whether PCA is able to find the subspace correctly, so that the
nICs can be estimated from the reduced mixtures This is not true in general, but
in a special case it turns out to be the case If the data consists ofnICs only, with
no noise added, the whole data is contained in ann-dimensional subspace Using PCA for dimension reduction clearly finds thisn-dimensional subspace, since the eigenvalues corresponding to that subspace, and only those eigenvalues, are nonzero Thus reducing dimension with PCA works correctly In practice, the data is usually not exactly contained in the subspace, due to noise and other factors, but if the noise level is low, PCA still finds approximately the right subspace; see Section 6.1.3 In the general case, some “weak” ICs may be lost in the dimension reduction process, but PCA may still be a good idea for optimal estimation of the “strong” ICs [313] Performing first PCA and then ICA has an interesting interpretation in terms of factor analysis In factor analysis, it is conventional that after finding the factor subspace, the actual basis vectors for that subspace are determined by some criteria
Trang 6268 PRACTICAL CONSIDERATIONS
that make the mixing matrix as simple as possible [166] This is called factor rotation.
Now, ICA can be interpreted as one method for determining this factor rotation, based
on higher-order statistics instead of the structure of the mixing matrix
13.2.2 Reducing noise and preventing overlearning
A well-known benefit of reducing the dimension of the data is that it reduces noise,
as was already discussed in Chapter 6 Often, the dimensions that have been omitted consist mainly of noise This is especially true in the case where the number of ICs
is smaller than the number of mixtures
Another benefit of reducing dimensions is that it prevents overlearning, to which the rest of this subsection is devoted Overlearning means that if the number of parameters in a statistical model is too large when compared to the number of available data points, the estimation of the parameters becomes difficult, maybe impossible The estimation of the parameters is then too much determined by the available sample points, instead of the actual process that generated the data, which
is what we are really interested in
Overlearning in ICA [214] typically produces estimates of the ICs that have a single spike or bump, and are practically zero everywhere else This is because in the space of source signals of unit variance, nongaussianity is more or less maximized
by such spike/bump signals This becomes easily comprehensible if we consider the extreme case where the sample sizeT equals the dimension of the datam, and these are both equal to the number of independent componentsn Let us collect the realizations x(t) ofx as the columns of the matrix X, and denote byS the corresponding matrix of the realizations ofs(t), as in (13.1) Note that now all the matrices in (13.1) are square This means that by changing the values ofA (and keepingXfixed), we can give any values whatsoever to the elements ofS This is
a case of serious overlearning, not unlike the classic case of regression with equal numbers of data points and parameters
Thus it is clear that in this case, the estimate of S that is obtained by ICA estimation depends little on the observed data Let us assume that the densities of the source signals are known to be supergaussian (i.e., positively kurtotic) Then the ICA estimation basically consists of finding a separating matrixBthat maximizes a measure of the supergaussianities (or sparsities) of the estimates of the source signals Intuitively, it is easy to see that sparsity is maximized when the source signals each have only one nonzero point Thus we see that ICA estimation with an insufficient sample size leads to a form of overlearning that gives artifactual (spurious) source
signals Such source signals are characterized by large spikes.
An important fact shown experimentally [214] is that a similar phenomenon is much more likely to occur if the source signals are not independently and identically distributed (i.i.d.) in time, but have strong time-dependencies In such cases the sample size needed to get rid of overlearning is much larger, and the source signals
are better characterized by bumps, i.e., low-pass filtered versions of spikes An
intuitive way of explaining this phenomenon is to consider such a signal as being constant onN=kblocks ofkconsecutive sample points This means that the data can
Trang 7be considered as really having onlyN=ksample points; each sample point has simply been repeatedktimes Thus, in the case of overlearning, the estimation procedure gives “spikes” that have a width ofktime points, i.e., bumps
Here we illustrate the phenomenon by separation of artificial source signals Three positively kurtotic signals, with 500 sample points each, were used in these
simulations, and are depicted in Fig 13.1 a Five hundred mixtures were produced,
and a very small amount of gaussian noise was added to each mixture separately
As an example of a successful ICA estimation, Fig 13.1 b shows the result of
applying the FastICA and maximum likelihood (ML) gradient ascent algorithms (denoted by “Bell-Sejnowski”) to the mixed signals In both approaches, the prepro-cessing (whitening) stage included a dimension reduction of the data into the first three principal components It is evident that both algorithms are able to extract all the initial signals
In contrast, when the whitening is made with very small dimension reduction (we took 400 dimensions), we see the emergence of spiky solutions (like Dirac functions),
which is an extreme case of kurtosis maximization (Fig 13.1 c) The algorithm used
in FastICA was of a deflationary type, from which we plot the first five components extracted As for the ML gradient ascent, which was of a symmetric type, we show five representative solutions to the 400 extracted
Thus, we see here that without dimension reduction, we are not able to estimate the source signals
Fig 13.1 d presents an intermediate stage of dimension reduction (from the original
500 mixtures we took 50 whitened vectors) We see that the actual source signals are revealed by both methods, even though each resulting vector is more noisy than the
ones shown in Fig 13.1 b.
For the final example, in Fig 13.1 e, we low-pass filtered the mixed signals, prior to
the independent component analysis, using a 10 delay moving average filter Taking
the same amount of principal components as in d, we can see that we lose all the
original source signals: the decompositions show a bumpy structure corresponding to
the low-pass filtering of the spiky outputs presented in c Through low-pass filtering,
we have reduced the information contained in the data, and so the estimation is rendered impossible even with this, not very weak, dimension reduction Thus, we see that with this low-pass filtered data, a much stronger dimension reduction by PCA is necessary to prevent overlearning
In addition to PCA, some kind of prior information on the mixing matrix could be useful in preventing overlearning This is considered in detail in Section 20.1.3
13.3 HOW MANY COMPONENTS SHOULD BE ESTIMATED?
Another problem that often arises in practice is to decide the number of ICs to be estimated This problem does not arise if one simply estimates the same number
of components as the dimension of the data This may not always be a good idea, however
Trang 8270 PRACTICAL CONSIDERATIONS
(a)
(b)
(c)
(d)
(e)
Fig 13.1 (From [214]) Illustration of the importance of the degree of dimension reduction and filtering in artificially generated data, using FastICA and a gradient algorithm for ML
estimation (a) Original positively kurtotic signals (b) ICA decomposition in which the preprocessing includes a dimension reduction to the first 3 principal components (c) Poor, i.e., too weak dimension reduction (d) Decomposition using an intermediate dimension reduction (50 components retained) (e) Same results as in (d) but using low-pass filtered
mixtures
Trang 9First, since dimension reduction by PCA is often necessary, one must choose the number of principal components to be retained This is a classic problem; see Chapter 6 It is usually solved by choosing the minimum number of principal components that explain the data well enough, containing, for example,90%of the variance Often, the dimension is actually chosen by trial and error with no theoretical guidelines
Second, for computational reasons we may prefer to estimate only a smaller number of ICs than the dimension of the data (after PCA preprocessing) This is the case when the dimension of the data is very large, and we do not want to reduce the dimension by PCA too much, since PCA always contains the risk of not including the ICs in the reduced data Using FastICA and other algorithms that allow estimation of
a smaller number of components, we can thus perform a kind of dimension reduction
by ICA In fact, this is an idea somewhat similar to projection pursuit Here, it is even more difficult to give any guidelines as to how many components should be estimated Trial and error may be the only method applicable
Information-theoretic, Bayesian, and other criteria for determining the number of ICs are discussed in more detail in [231, 81, 385]
13.4 CHOICE OF ALGORITHM
Now we shall briefly discuss the choice of ICA algorithm from a practical viewpoint
As will be discussed in detail in Chapter 14, most estimation principles and objective functions for ICA are equivalent, at least in theory So, the main choice is reduced to
a couple of points:
One choice is between estimating all the independent components in parallel,
or just estimating a few of them (possibly one-by-one) This corresponds to choosing between symmetric and hierarchical decorrelation In most cases, symmetric decorrelation is recommended Deflation is mainly useful in the case where we want to estimate only a very limited number of ICs, and other special cases The disadvantage with deflationary orthogonalization is that the estimation errors in the components that are estimated first accumulate and increase the errors in the later components
One must also choose the nonlinearity used in the algorithms It seems that the robust, nonpolynomial nonlinearities are to be preferred in most applications The simplest thing to do is to just use thetanhfunction as the nonlinearity
g This is sufficient when using FastICA (When using gradient algorithms, especially in the ML framework, a second function needs to be used as well; see Chapter 9.)
Finally, there is the choice between on-line and batch algorithms In most cases, the whole data set is available before the estimation, which is called
in different contexts batch, block, or off-line estimation This is the case where FastICA can be used, and it is the algorithm that we recommend
Trang 10On-272 PRACTICAL CONSIDERATIONS
line or adaptive algorithms are needed in signal-processing applications where the mixing matrix may change on-line, and fast tracking is needed In the on-line case, the recommended algorithms are those obtained by stochastic gradient methods It should also be noted that in some cases, the FastICA algorithm may not converge well as Newton-type algorithms sometimes exhibit oscillatory behavior This problem can be alleviated by using gradient methods,
or combinations of the two (see [197])
13.5 CONCLUDING REMARKS AND REFERENCES
In this chapter, we considered some practical problems in ICA When dealing with time signals, low-pass filtering of the data is useful to reduce noise On the other hand, high-pass filtering, or computing innovation processes is useful to increase the independence and nongaussianity of the components One of these, or their combi-nation may be very useful in practice Another very useful thing to do is to reduce the dimension of the data by PCA This reduces noise and prevents overlearning It may also solve the problems with data that has a smaller number of ICs than mixtures
Problems
13.1 Take a Fourier transform on every observed signalx
i (t) Does the ICA model still hold, and in what way?
13.2 Prove the theorem on innovations
Computer assignments
13.1 Take a gaussian white noise sequence Low-pass filter it by a low-pass filter with coefficients ( ,0,0,1,1,1,1,1,0,0,0, ) What does the signal look like?
13.2 High-pass filter the gaussian white noise sequence What does the signal look like?
13.3 Generate 100 samples of 100 independent components Run FastICA on this data without any mixing What do the estimated ICs look like? Is the estimate of the mixing matrix close to identity?