20.1 PRIORS ON THE MIXING MATRIX 20.1.1 Motivation for prior information No prior knowledge on the mixing matrix is used in the basic ICA model.. Using prior information on the mixing ma
Trang 1Other Extensions
In this chapter, we present some additional extensions of the basic independent component analysis (ICA) model First, we discuss the use of prior information
on the mixing matrix, especially on its sparseness Second, we present models that somewhat relax the assumption of the independence of the components In the model called independent subspace analysis, the components are divided into subspaces that
are independent, but the components inside the subspaces are not independent In the
model of topographic ICA, higher-order dependencies are modeled by a topographic organization Finally, we show how to adapt some of the basic ICA algorithms to the case where the data is complex-valued instead of real-valued
20.1 PRIORS ON THE MIXING MATRIX
20.1.1 Motivation for prior information
No prior knowledge on the mixing matrix is used in the basic ICA model This has the advantage of giving the model great generality In many application areas, however, information on the form of the mixing matrix is available Using prior information on the mixing matrix is likely to give better estimates of the matrix for a given number
of data points This is of great importance in situations where the computational costs of ICA estimation are so high that they severely restrict the amount of data that can be used, as well as in situations where the amount of data is restricted due to the nature of the application
371
Independent Component Analysis Aapo Hyv¨arinen, Juha Karhunen, Erkki Oja
Copyright 2001 John Wiley & Sons, Inc ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
Trang 2This situation can be compared to that found in nonlinear regression, where overlearning or overfitting is a very general phenomenon [48] The classic way
of avoiding overlearning in regression is to use regularizing priors, which typically penalize regression functions that have large curvatures, i.e., lots of “wiggles” This makes it possible to use regression methods even when the number of parameters
in the model is very large compared to the number of observed data points In the extreme theoretical case, the number of parameters is infinite, but the model can still
be estimated from finite amounts of data by using prior information Thus suitable priors can reduce overlearning that was discussed in Section 13.2.2
One example of using prior knowledge that predates modern ICA methods is the literature on beamforming (see the discussion in [72]), where a very specific form of the mixing matrix is represented by a small number of parameters Another example
is in the application of ICA to magnetoencephalogaphy (see Chapter 22), where it has been found that the independent components (ICs) can be modeled by the classic dipole model, which shows how to constrain the form of the mixing coefficients [246] The problem with these methods, however, is that they may be applicable to a few data sets only, and lose the generality that is one of the main factors in the current flood of interest in ICA
Prior information can be taken into account in ICA estimation by using Bayesian prior distributions for the parameters This means that the parameters, which in this case are the elements of the mixing matrix, are treated as random variables They have a certain distribution and are thus more likely to assume certain values than others A short introduction to Bayesian estimation was given in Section 4.6
In this section, we present a form of prior information on the mixing matrix that is both general enough to be used in many applications and strong enough to increase the performance of ICA estimation To give some background, we first investigate the possibility of using two simple classes of priors for the mixing matrix
A: Jeffreys’ prior and quadratic priors We come to the conclusion that these two classes are not very useful in ICA Then we introduce the concept of sparse priors These are priors that enforce a sparse structure on the mixing matrix In other words, the prior penalizes mixing matrices with a larger number of significantly nonzero entries Thus this form of prior is analogous to the widely-used prior knowledge on the supergaussianity or sparseness of the independent components In fact, due to this similarity, sparse priors are so-called conjugate priors, which implies that estimation using this kind of priors is particularly easy: Ordinary ICA methods can be simply adapted to using such priors
20.1.2 Classic priors
In the following, we assume that the estimatorBof the inverse of the mixing matrix
Ais constrained so that the estimates of the independent componentsy = Bxare
white, i.e., decorrelated and of unit variance: Efyy
T
g = I This restriction greatly facilitates the analysis It is basically equivalent to first whitening the data and then restrictingBto be orthogonal, but here we do not want to restrict the generality of
Trang 3PRIORS ON THE MIXING MATRIX 373
these results by whitening We concentrate here on formulating priors forB = A
1
Completely analogue results hold for prior onA
Jeffreys’ prior The classic prior in Bayesian inference is Jeffreys’ prior It
is considered a maximally uninformative prior, which already indicates that it is probably not useful for our purpose
Indeed, it was shown in [342] that Jeffreys’ prior for the basic ICA model has the form:
p(B) / j det B
1
Now, the constraint of whiteness of the y = Bx means thatB can be expressed
as B = WV, where V is a constant whitening matrix, and W is restricted to
be orthogonal But we havedet B = det W det V = det V, which implies that Jeffreys’s prior is constant in the space of allowed estimators (i.e., decorrelatingB) Thus we see that Jeffreys’ prior has no effect on the estimator, and therefore cannot reduce overlearning
Quadratic priors In regression, the use of quadratic regularizing priors is very common [48] It would be tempting to try to use the same idea in the context of ICA Especially in feature extraction, we could require the columns ofA, i.e the features,
to be smooth in the same sense as smoothness is required from regression functions
In other words, we could consider every column ofAas a discrete approximation of
a smooth function, and choose a prior that imposes smoothness for the underlying continuous function Similar arguments hold for priors defined on the rows ofB, i.e., the filters corresponding to the features
The simplest class of regularizing priors is given by quadratic priors We will show here, however, that such quadratic regularizers, at least the simple class that we define below, do not change the estimator
Consider priors that are of the form
log p(B) =
n X
i=1 b T i Mb i
where theb
T
i are the rows ofB = A
1
, andMis a matrix that defines the quadratic prior For example, forM = Iwe have a “weight decay” priorlog p(B) =
P
i kb i k 2
that is often used to penalize large elements inB Alternatively, we could include in
Msome differential operators so that the prior would measure the “smoothnesses”
of theb
i, in the sense explained above The prior can be manipulated algebraically
to yield
n X
i=1 b T i Mb i
= n X
i=1
tr(Mb i b T i ) =tr(MB
T
Quadratic priors have little significance in ICA estimation, however To see this, let us constrain the estimates of the independent components to be white as previously
Trang 4This means that we have
Efyy T
g = EfBxx
T B T
g = BCB
T
in the space of allowed estimates, which gives after some algebraic manipulations
B
T
B = C
1
Now we see that
n X
i=1 b T i Mb i
=tr(MC
1
In other words, the quadratic prior is constant The same result can be proven for a quadratic prior onA Thus, quadratic priors are of little interest in ICA
20.1.3 Sparse priors
Motivation A much more satisfactory class of priors is given by what we call
sparse priors This means that the prior information says that most of the elements
of each row ofBare zero; thus their distribution is supergaussian or sparse The motivation for considering sparse priors is both empirical and algorithmic
Empirically, it has been observed in feature extraction of images (see Chapter 21) that the obtained filters tend to be localized in space This implies that the distribution
of the elementsb
ijof the filterb
itends to be sparse, i.e., most elements are practically zero A similar phenomenon can be seen in analysis of magnetoencephalography, where each source signal is usually captured by a limited number of sensors This is due to the spatial localization of the sources and the sensors
The algorithmic appeal of sparsifying priors, on the other hand, is based on the fact that sparse priors can be made to be conjugate priors (see below for definition) This is a special class of priors, and means that estimation of the model using this prior requires only very simple modifications in ordinary ICA algorithms
Another motivation for sparse priors is their neural interpretation Biological neural networks are known to be sparsely connected, i.e., only a small proportion
of all possible connections between neurons are actually used This is exactly what sparse priors model This interpretation is especially interesting when ICA is used in modeling of the visual cortex (Chapter 21)
Measuring sparsity The sparsity of a random variable, says, can be measured by expectations of the formEfG(s)g, whereGis a nonquadratic function, for example, the following
The use of such measures requires that the variance ofs is normalized to a fixed value, and its mean is zero These kinds of measures were widely used in Chapter 8
to probe the higher-order structure of the estimates of the ICs Basically, this is
a robust nonpolynomial moment that typically is a monotonic function of kurtosis Maximizing this function is maximizing kurtosis, thus supergaussianity and sparsity
Trang 5PRIORS ON THE MIXING MATRIX 375
In feature extraction and probably several other applications as well, the distribu-tions of the elements of of the mixing matrix and its inverse are zero-mean due to symmetry Let us assume that the dataxis whitened as a preprocessing step Denote
byzthe whitened data vector whose components are thus uncorrelated and have unit variance Constraining the estimatesy = Wz of the independent components to
be white implies thatW, the inverse of the whitened mixing matrix, is orthogonal This implies that the sum of the squares of the elements
P
j w
ij is equal to one for everyi The elements of each roww
T
i ofWcan be then considered a realization of
a random variable of zero mean and unit variance This means we could measure the sparsities of the rows ofWusing a sparsity measure of the form (20.6)
Thus, we can define a sparse prior of the form
log p(W ) =
n X
i=1
n X
j=1 G(w ij
whereGis the logarithm of some supergaussian density function The functionGin (20.6) is such log-density, corresponding to the Laplacian density, so we see that we have here a measure of sparsity of thew
i The prior in (20.7) has the nice property of being a conjugate prior Let us assume that the independent components are supergaussian, and for simplicity, let us further assume that they have identical distributions, with log-densityG Now we can take that same log-density as the log-prior densityGin (20.7) Then we can write the prior in the form
log p(W ) =
n X
i=1
n X
j=1 G(w T i e j
where we denote bye
ithe canonical basis vectors, i.e., theith element ofe
iis equal
to one, and all the others are zero Thus the posterior distribution has the form:
log p(W jz(1) ::: z(T )) =
n X
i=1
T X
t=1 G(w T i z(t)) +
n X
j=1 G(w T i e j )] +const:
(20.9) This form shows that the posterior distribution has the same form as the prior distribution (and, in fact, the original likelihood) Priors with this property are called conjugate priors in Bayesian theory The usefulness of conjugate priors resides in the property that the prior can be considered to correspond to a “virtual” sample The posterior distribution in (20.9) has the same form as the likelihood of a sample of size
T + n, which consists of both the observedz(t)and the canonical basis vectorse
i
In other words, the posterior in (20.9) is the likelihood of the augmented (whitened) data sample
z
(t) = (
z(t) if1 t T
Trang 6Thus, using conjugate priors has the additional benefit that we can use exactly the same algorithm for maximization of the posterior as in ordinary maximum likelihood estimation of ICA All we need to do is to add this virtual sample to the data; the virtual sample is of the same sizenas the dimension of the data
For experiments using sparse priors in image feature extraction, see [209]
Modifying prior strength The conjugate priors given above can be generalized
by considering a family of supergaussian priors given by
log p(W ) =
n X
i=1
n X
j=1
G(w T i e j ) +const: (20.11)
Using this kind of prior means that the virtual sample points are weighted by some parameter This parameter expresses the degree of belief that we have in the prior
A largemeans that the belief in the prior is strong Also, the parametercould
be different for differenti, but this seems less useful here The posterior distribution then has the form:
log p(W jz(1) ::: z(T )) =
n X
i=1
T X
t=1 G(w T i z(t)) +
n X
j=1
G(w T i e j )] +const:
(20.12) The preceding expression can be further simplified in the case where the assumed density of the independent components is Laplacian, i.e.,G(y) = jyj In this case, thecan multiply thee
jthemselves:
log p(W jz(1) ::: z(T )) =
n X
i=1
T X
t=1 jw T i z(t)j
n X
j=1 jw T i (e j )j] +const:
(20.13) which is simpler than (20.12) from the algorithmic viewpoint: It amounts to the addition of justnvirtual data vectors of the forme
j to the data This avoids all the complications due to the differential weighting of sample points in (20.12), and ensures that any conventional ICA algorithm can be used by simply adding the virtual sample to the data In fact, the Laplacian prior is most often used in ordinary ICA algorithms, sometimes in the form of the log cosh function that can be considered as
a smoother approximation of the absolute value function
Whitening and priors In the preceding derivation, we assumed that the data is preprocessed by whitening It should be noted that the effect of the sparse prior is dependent on the whitening matrix This is because sparseness is imposed on the separating matrix of the whitened data, and the value of this matrix depends on the whitening matrix There is an infinity of whitening matrices, so imposing sparseness
on the whitened separating matrix may have different meanings
On the other hand, it is not necessary to whiten the data The preceding framework can be used for non-white data as well If the data is not whitened, the meaning of the sparse prior is somewhat different, though This is because every row of is not
Trang 7PRIORS ON THE MIXING MATRIX 377
constrained to have unit norm for general data Thus our measure of sparsity does not anymore measure the sparsities of eachb
i On the other hand, the developments
of the preceding section show that the sum of squares of the whole matrix
P
ij b ij
does stay constant This means that the sparsity measure is now measuring rather the global sparsity ofB, instead of the sparsities of individual rows
In practice, one usually wants to whiten the data for technical reasons Then the problems arises: How to impose the sparseness on the original separating matrix even when the data used in the estimation algorithm needs to be whitened? The preceding framework can be easily modified so that the sparseness is imposed on the original separating matrix Denote byVthe whitening matrix and byBthe separating matrix for original data Thus, we haveWV = Bandz = Vxby definition Now, we can express the prior in (20.8) as
log p(B) =
n X
i=1
n X
j=1 G(b T i e j ) +const.=
n X
i=1
n X
j=1 G(w T i (V e j )) +const
(20.14) Thus, we see that the virtual sample added toz(t)now consists of the columns of the whitening matrix, instead of the identity matrix
Incidentally, a similar manipulation of (20.8) shows how to put the prior on the original mixing matrix instead of the separating matrix We always haveV A = (W )
1
= W
T
Thus, we obtaina
T i e j
= a T i V T (V
1 ) T e j
= w T i (V
1 ) T e
j This shows that imposing a sparse prior onAis done by using the virtual sample given
by the rows of the inverse of the whitening matrix (Note that for whitened data, the mixing matrix is the transpose of the separating matrix, so the fourth logical possibility of formulating prior for the whitened mixing matrix is not different from using a prior on the whitened separating matrix.)
In practice, the problems implied by whitening can often be solved by using a whitening matrix that is sparse in itself Then imposing sparseness on the whitened separating matrix is meaningful In the context of image feature extraction, a sparse whitening matrix is obtained by the zero-phase whitening matrix (see [38] for dis-cussion), for example Then it is natural to impose the sparseness for the whitend separating matrix, and the complications discussed in this subsection can be ignored
20.1.4 Spatiotemporal ICA
When using sparse priors, we typically make rather similar assumptions on both the ICs and the mixing matrix Both are assumed to be generated so that the values are taken from independent, typically sparse, distributions At the limit, we might develop a model where the very same assumptions are made on the mixing matrix
and the ICs Such a model [412] is called spatiotemporal ICA since it does ICA both
in the temporal domain (assuming that the ICs are time signals), and in the spatial domain, which corresponds to the spatial mixing defined by the mixing matrix
In spatiotemporal ICA, the distinction between ICs and the mixing matrix is completely abolished To see why this is possible, consider the data as a single matrix of the observed vectors as its columns: , and likewise
Trang 8for the ICs Then the ICA model can be expressed as
Now, taking a transpose of this equation, we obtain
X T
= S T A T
(20.16) Now we see that the matrixSis like a mixing matrix, withA
T
giving the realizations
of the “independent components” Thus, by taking the transpose, we flip the roles of the mixing matrix and the ICs
In the basic ICA model, the difference betweensandAis due to the statistical assumptions made ons, which are the independent random variables, and onA, which
is a constant matrix of parameters But with sparse priors, we made assumptions on
Athat are very similar to those usually made ons So, we can simply consider both
AandSas being generated by independent random variables, in which case either one of the mixing equations (with or without transpose) are equally valid This is the basic idea in spatiotemporal ICA
There is another important difference betweenSandA, though The dimensions
ofA andSare typically very different: A is square whereasShas many more columns than rows This difference can be abolished by considering that thereAhas many fewer columns than rows, that is, there is some redundancy in the signal The estimation of the spatiotemporal ICA model can be performed in a manner rather similar to using sparse priors The basic idea is to form a virtual sample where the data consists of two parts, the original data and the data obtained by transposing the data matrix The dimensions of these data sets must be strongly reduced and made equal to each other, using PCA-like methods This is possible because it was assumed that bothAandS
T
have the same kind of redundancy: many more rows than columns For details, see [412], where the infomax criterion was applied on this estimation task
20.2 RELAXING THE INDEPENDENCE ASSUMPTION
In the ICA data model, it is assumed that the componentss
iare independent How-ever, ICA is often applied on data sets, for example, on image data, in which the obtained estimates of the independent components are not very independent, even approximately In fact, it is not possible, in general, to decompose a random vector
xlinearly into components that are independent This raises questions on the utility and interpretation of the components given by ICA Is it useful to perform ICA on real data that does not give independent components, and if it is, how should the results be interpreted?
One approach to this problem is to reinterpret the estimation results A straight-forward reinterpretation was offered in Chapter 10: ICA gives components that are as independent as possible Even in cases where this is not enough, we can still justify the utility by other arguments This is because ICA simultaneously serves certain
Trang 9RELAXING THE INDEPENDENCE ASSUMPTION 379
other useful purposes than dependence reduction For example, it can be interpreted
as projection pursuit (see Section 8.5) or sparse coding (see Section 21.2) Both of these methods are based on the maximal nongaussianity property of the independent components, and they give important insight into what ICA algorithms are really doing
A different approach to the problem of not finding independent components is to relax the very assumption of independence, thus explicitly formulating new data mod-els In this section, we consider this approach, and present three recently developed methods in this category In multidimensional ICA, it is assumed that only certain sets (subspaces) of the components are mutually independent A closely related method is independent subspace analysis, where a particular distribution structure inside such subspaces is defined Topographic ICA, on the other hand, attempts
to utilize the dependence of the estimated “independent” components to define a topographic order
20.2.1 Multidimensional ICA
In multidimensional independent component analysis [66, 277], a linear generative model as in basic ICA is assumed In contrast to basic ICA, however, the components (responses)s
iare not assumed to be all mutually independent Instead, it is assumed that thes
ican be divided into couples, triplets or in generalk-tuples, such that the
s
iinside a givenk-tuple may be dependent on each other, but dependencies between differentk-tuples are not allowed
Everyk-tuple ofs
icorresponds tokbasis vectorsa
i In general, the dimensional-ity of each independent subspace need not be equal, but we assume so for simplicdimensional-ity The model can be simplified by two additional assumptions First, even though the componentss
iare not all independent, we can always define them so that they are uncorrelated, and of unit variance In fact, linear correlations inside a givenk-tuple of dependent components could always be removed by a linear transformation Second,
we can assume that the data is whitened (sphered), just as in basic ICA
These two assumptions imply that the a
i are orthonormal In particular, the independent subspaces become orthogonal after whitening These facts follow di-rectly from the proof in Section 7.4.2, which applies here as well, due to our present assumptions
Let us denote byJ the number of independent feature subspaces, and byS
j
j =
1 ::: J the set of the indices of thes
ibelonging to the subspace of indexj Assume that the data consists of T observed data points x(t) t = 1 ::: T Then we can express the likelihoodLof the data, given the model as follows
L(x(t) t = 1 ::: T b
i
i = 1 ::: n)
= T Y
t=1
j det Bj
J Y
j=1 p j (b T i x(t) i 2 S
j
where p
j
(:), which is a function of the k arguments b
T i x(t) i 2 S
j, gives the probability density inside the th -tuple of The term appears here as in
Trang 10any expression of the probability density of a transformation, giving the change in volume produced by the linear transformation, as in Chapter 9
The k-dimensional probability density p
j (:) is not specified in advance in the general definition of multidimensional ICA [66] Thus, the question arises how
to estimate the model of multidimensional ICA One approach is to estimate the basic ICA model, and then group the components intok-tuples according to their dependence structure [66] This is meaningful only if the independent components are well defined and can be accurately estimated; in general we would like to utilize the subspace structure in the estimation process Another approach is to model the distributions inside the subspaces by a suitable model This is potentially very difficult, since we then encounter the classic problem of estimatingk-dimensional distributions One solution for this problem is given by independent subspaces analysis, to be explained next
20.2.2 Independent subspace analysis
Independent subspace analysis [204] is a simple model that models some dependen-cies between the components It is based on combining multidimensional ICA with the principle of invariant-feature subspaces
Invariant-feature subspaces To motivate independent subspace analysis, let us consider the problem of feature extraction, treated in more detail in Chapter 21 In the most basic case, features are given by linear transformations, or filters The presence
of a given feature is detected by computing the dot-product of input data with a given feature vector For example, wavelet, Gabor, and Fourier transforms, as well as most models of V1 simple cells, use such linear features (see Chapter 21) The problem with linear features, however, is that they necessarily lack any invariance with respect
to such transformations as spatial shift or change in (local) Fourier phase [373, 248] Kohonen [248] developed the principle of invariant-feature subspaces as an ab-stract approach to representing features with some invariances The principle of invariant-feature subspaces states that one can consider an invariant feature as a lin-ear subspace in a feature space The value of the invariant, higher-order feature is given by (the square of) the norm of the projection of the given data point on that subspace, which is typically spanned by lower-order features
A feature subspace, as any linear subspace, can always be represented by a set
of orthogonal basis vectors, sayb
i
i = 1 ::: k, wherek is the dimension of the subspace Then the valueF (x)of the featureFwith input vectorxis given by
F (x) =
k X
i=1 (b T i x) 2
(20.18)
In fact, this is equivalent to computing the distance between the input vectorxand a general linear combination of the vectors (possibly filters)b
iof the feature subspace [248]