We also address the nonlinear blind source separation BSS problem.Contrary to the linear case, we consider it different from the respective nonlinear ICAproblem.. Special emphasis isgive
Trang 1Nonlinear ICA
This chapter deals with independent component analysis (ICA) for nonlinear mixingmodels A fundamental difficulty in the nonlinear ICA problem is that it is highlynonunique without some extra constraints, which are often realized by using a suitableregularization We also address the nonlinear blind source separation (BSS) problem.Contrary to the linear case, we consider it different from the respective nonlinear ICAproblem After considering these matters, some methods introduced for solving thenonlinear ICA or BSS problems are discussed in more detail Special emphasis isgiven to a Bayesian approach that applies ensemble learning to a flexible multilayerperceptron model for finding the sources and nonlinear mixing mapping that havemost probably given rise to the observed mixed data The efficiency of this method isdemonstrated using both artificial and real-world data At the end of the chapter, othertechniques proposed for solving the nonlinear ICA and BSS problems are reviewed
17.1 NONLINEAR ICA AND BSS
17.1.1 The nonlinear ICA and BSS problems
In many situations, the basic linear ICA or BSS model
x = As =
n X
j=1 s j a
is too simple for describing the observed dataxadequately Hence, it is natural toconsider extension of the linear model to nonlinear mixing models For instantaneous
315
Independent Component Analysis Aapo Hyv¨arinen, Juha Karhunen, Erkki Oja
Copyright 2001 John Wiley & Sons, Inc ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
Trang 2mixtures, the nonlinear mixing model has the general form
wherexis the observedm-dimensional data (mixture) vector,fis an unknown valuedm-component mixing function, andsis ann-vector whose elements are the
real-nunknown independent components
Assume now for simplicity that the number of independent componentsnequalsthe number of mixturesm The general nonlinear ICA problem then consists offinding a mappingh : R
n
! R nthat gives components
that are statistically independent A fundamental characteristic of the nonlinearICA problem is that in the general case, solutions always exist, and they are highlynonunique One reason for this is that if x and y are two independent randomvariables, any of their functionsf(x)andg(y)are also independent An even moreserious problem is that in the nonlinear case,xandycan be mixed and still statisticallyindependent, as will be shown below This is not unlike in the case of gaussian ICs
in a linear mixing
In this chapter, we define BSS in a special way to clarify the distinction betweenfinding independent components, and finding the original sources Thus, in therespective nonlinear BSS problem, one should find the original source signals sthat have generated the observed data This is usually a clearly more meaningfuland unique problem than nonlinear ICA defined above, provided that suitable priorinformation is available on the sources and/or the mixing mapping It is worthemphasizing that if some arbitrary independent components are found for the datagenerated by (17.2), they may be quite different from the true source signals Hencethe situation differs greatly from the basic linear data model (17.1), for which theICA or BSS problems have the same solution Generally, solving the nonlinear BSSproblem is not easy, and requires additional prior information or suitable regularizingconstraints
An important special case of the general nonlinear mixing model (17.2) consists
of so-called post-nonlinear mixtures There each mixture has the form
xi
=fi 0
@ n X
j=1
aijsj 1
Thus the sourcessj, j = 1::: nare first mixed linearly according to the basicICA/BSS model (17.1), but after that a nonlinear functionfiis applied to them to getthe final observationsxi It can be shown [418] that for the post-nonlinear mixtures,the indeterminacies are usually the same as for the basic linear instantaneous mixingmodel (17.1) That is, the sources can be separated or the independent compo-nents estimated up to the scaling, permutation, and sign indeterminacies under weakconditions on the mixing matrixAand source distributions The post-nonlinearityassumption is useful and reasonable in many signal processing applications, because
Trang 3NONLINEAR ICA AND BSS 317
it can be thought of as a model for a nonlinear sensor distortion In more generalsituations, it is a restrictive and somewhat arbitrary constraint This model will betreated in more detail below
Another difficulty in the general nonlinear BSS (or ICA) methods proposed thusfar is that they tend to be computationally rather demanding Moreover, the compu-tational load usually increases very rapidly with the dimensionality of the problem,preventing in practice the application of nonlinear BSS methods to high-dimensionaldata sets
The nonlinear BSS and ICA methods presented in the literature could be divided
into two broad classes: generative approaches and signal transformation approaches
[438] In the generative approaches, the goal is to find a specific model that explainshow the observations were generated In our case, this amounts to estimating boththe source signalssand the unknown mixing mappingf ()that have generated theobserved dataxthrough the general mapping (17.2) In the signal transformationmethods, one tries to estimate the sources directly using the inverse transformation(17.3) In these methods, the number of estimated sources is the same as the number
of observed mixtures [438]
17.1.2 Existence and uniqueness of nonlinear ICA
The question of existence and uniqueness of solutions for nonlinear independentcomponent analysis has been addressed in [213] The authors show that there alwaysexists an infinity of solutions if the space of the nonlinear mixing functionsf isnot limited They also present a method for constructing parameterized families
of nonlinear ICA solutions A unique solution (up to a rotation) can be obtained
in the two-dimensional special case if the mixing mappingf is constrained to be aconformal mapping together with some other assumptions; see [213] for details
In the following, we present in more detail the constructive method introduced in[213] that always yields at least one solution to the nonlinear ICA problem Thisprocedure might be considered as a generalization of the well-known Gram-Schmidtorthogonalization method Givenmindependent variablesy=(y1::: y m)and avariablex, a new variabley m+1=g(yx)is constructed so that the sety1:::y m+1
is mutually independent
The construction is defined recursively as follows Assume that we have already
mindependent random variablesy1::: y mwhich are jointly uniformly distributed
in01]m Here it is not a restriction to assume that the distributions of they i areuniform, since this follows directly from the recursion, as will be seen below; for asingle variable, uniformity can be attained by the probability integral transformation;see (2.85) Denote byxany random variable, and bya1::: a m bsome nonrandom
Trang 4() and pyx() are the marginal probability densities of y and (yx),respectively (it is assumed here implicitly that such densities exist), andP(j)denotesthe conditional probability Thepyxin the argument ofgis to remind thatgdepends
on the joint probability distribution ofyandx Form= 0,gis simply the cumulativedistribution function ofx Now,gas defined above gives a nonlinear decomposition,
as stated in the following theorem
Theorem 17.1 Assume thaty1::: y m are independent scalar random variables that have a joint uniform distribution in the unit cube01]m Letxbe any scalar random variable Definegas in (17.5), and set
The theorem is proved in [213] The constructive method given above can be used
to decomposen variablesx1:::x n into n independent components y1:::y n,giving a solution for the nonlinear ICA problem
This construction also clearly shows that the decomposition in independent ponents is by no means unique For example, we could first apply a linear trans-formation on thexto obtain another random vectorx
Lin [278] has recently derived some interesting theoretical results on ICA thatare useful in describing the nonuniqueness of the general nonlinear ICA problem.Let the matrices Hs andHx denote the Hessians of the logarithmic probabilitydensitieslogp s(s)andlogp x(x)of the source vectorsand mixture (data) vectorx,respectively Then for the basic linear ICA model (17.1) it holds that
whereAis the mixing matrix If the components ofsare truly independent,Hsshould be a diagonal matrix Due to the symmetry of the Hessian matricesHsand
Hx, Eq (17.7) imposesn(n 1)=2constraints for the elements of thennmatrix
A Thus a constant mixing matrixAcan be solved by estimatingHxat two differentpoints, and assuming some values for the diagonal elements of s
Trang 5SEPARATION OF POST-NONLINEAR MIXTURES 319
If the nonlinear mapping (17.2) is twice differentiable, we can approximate itlocally at any point by the linear mixing model (17.1) ThereAis defined by thefirst order term@f (s)=@sof the Taylor series expansion off(s)at the desired point.But nowAgenerally changes from point to point, so that the constraint conditions(17.7) still leaven(n 1)=2degrees of freedom for determining the mixing matrix
A(omitting the diagonal elements) This also shows that the nonlinear ICA problem
is highly nonunique
Taleb and Jutten have considered separability of nonlinear mixtures in [418, 227].Their general conclusion is the same as earlier: Separation is impossible withoutadditional prior knowledge on the model, since the independence assumption alone
is not strong enough in the general nonlinear case
17.2 SEPARATION OF POST-NONLINEAR MIXTURES
Before discussing approaches applicable to general nonlinear mixtures, let us brieflyconsider blind separation methods proposed for the simpler case of post-nonlinearmixtures (17.4) Especially Taleb and Jutten have developed BSS methods for thiscase Their main results have been represented in [418], and a short overview of theirstudies on this problem can be found in [227] In the following, we present the themain points of their method
A separation method for the post-nonlinear mixtures (17.4) should generally sist of two subsequent parts or stages:
con-1 A nonlinear stage, which should cancel the nonlinear distortions f
i
i = 1 : : n This part consists of nonlinear functionsg
i (
i u) The parameters
(u) =
d log p (u) =
p 0 i (u)
(17.9)
Trang 6i
(u)is the probability density function ofy
iandp 0 i (u)its derivative In practice,the natural gradient algorithm is used instead of the Bell-Sejnowski algorithm (17.8);see Chapter 9
For the nonlinear stage, one can derive the gradient learning rule [418]
k
x k j
@ k
E
( n X
i=1
i (y i )b ik
@g k (
k
x k
@ k )
Herex
k is thekth component of the input vector,b
ik is the elementikof the matrix
B, andg
0
k
is the derivative of thekth nonlinear functiong
k The exact computationalgorithm depends naturally on the specific parametric form of the chosen nonlinearmappingg
do this are considered in [418] An estimation method based on the Gram-Charlierexpansion performs appropriately only for mild post-nonlinear distortions However,another method, which estimates the score functions directly, also provides very goodresults for hard nonlinearities Experimental results are presented in [418] A wellperforming batch type method for estimating the score functions has been introduced
in a later paper [417]
Before proceeding, we mention that separation of post-nonlinear mixtures alsohas been studied in [271, 267, 469] using mainly extensions of the natural gradientalgorithm
17.3 NONLINEAR BSS USING SELF-ORGANIZING MAPS
One of the earliest ideas for achieving nonlinear BSS (or ICA) is to use Kohonen’sself-organizing map (SOM) to that end This method was originally introduced byPajunen et al [345] The SOM [247, 172] is a well-known mapping and visualizationmethod that in an unsupervised manner learns a nonlinear mapping from the data to
a usually two-dimensional grid The learned mapping from often high-dimensionaldata space to the grid is such that it tries to preserve the structure of the data as well
as possible Another goal in the SOM method is to map the data so that it would beuniformly distributed on the rectangular (or hexagonal) grid This can be roughlyachieved with suitable choices [345]
If the joint probability density of two random variables is uniformly distributedinside a rectangle, then clearly the marginal densities along the sides of the rectangleare statistically independent This observation gives the justification for applyingself-organizing map to nonlinear BSS or ICA The SOM mapping provides theregularization needed in nonlinear BSS, because it tries to preserve the structure
Trang 7NONLINEAR BSS USING SELF-ORGANIZING MAPS 321
of the data This implies that the mapping should be as simple as possible whileachieving the desired goals
0 5 10 15 20 25 30 35 40 45 50
−20
−10 0 10 20
Fig 17.2 Nonlinear mixtures.
The following experiment [345] illustrates the use of the self-organizing map innonlinear blind source separation There were two subgaussian source signalss
ishown in Fig 17.1, consisting of a sinusoid and uniformly distributed white noise.Each source vectorswas first mixed linearly using the mixing matrix
A =
0:7 0:3 0:3 0:7
(17.10)After this, the data vectorsxwere obtained as post-nonlinear mixtures of the sources
by applying the formula (17.4), where the nonlinearityf
i (t)=t 3 + t,i = 1 2 Thesemixturesx
iare depicted in Fig 17.2
Fig 17.4 Converged SOM map.
The sources separated by the SOM method are shown in Fig 17.3, and theconverged SOM map is illustrated in Fig 17.4 The estimates of the source signals
in Fig 17.3 are obtained by mapping each data vector onto the map of Fig 17.4,
Trang 8and reading the coordinates of the mapped data vector Even though the precedingexperiment was carried out with post-nonlinear mixtures, the use of the SOM method
is not limited to them
Generally speaking, there are several difficulties in applying self-organizing maps
to nonlinear blind source separation If the sources are uniformly distributed, then
it can be heuristically justified that the regularization of the nonlinear separatingmapping provided by the SOM approximately separates the sources But if the truesources are not uniformly distributed, the separating mapping providing uniformdensities inevitably causes distortions, which are in general the more serious thefarther the true source densities are from the uniform ones Of course, the SOMmethod still provides an approximate solution to the nonlinear ICA problem, but thissolution may have little to do with the true source signals
Another difficulty in using SOM for nonlinear BSS or ICA is that computationalcomplexity increases very rapidly with the number of the sources (dimensionality ofthe map), limiting the potential application of this method to small-scale problems.Furthermore, the mapping provided by the SOM is discrete, where the discretization
is determined by the number of grid points
17.4 A GENERATIVE TOPOGRAPHIC MAPPING APPROACH TO NONLINEAR BSS *
17.4.1 Background
The self-organizing map discussed briefly in the previous section is a nonlinearmapping method that is inspired by neurobiological modeling arguments Bishop,Svensen and Williams introduced the generative topographic mapping (GTM) method
as a statistically more principled alternative to SOM Their method is presented indetail in [49]
In the basic GTM method, mutually similar impulse (delta) functions that areequispaced on a rectangular grid are used to model the discrete uniform density in thespace of latent variables, or the joint density of the sources in our case The mappingfrom the sources to the observed data, corresponding in our nonlinear BSS problem
to the nonlinear mixing mapping (17.2), is modeled using a mixture-of-gaussiansmodel The parameters of the mixture-of-gaussians model, defining the mixingmapping, are then estimated using a maximum likelihood (ML) method (see Section4.5) realized by the expectation-maximization (EM) algorithm [48, 172] After this,the inverse (separating) mapping from the data to the latent variables (sources) can
be determined
It is well-known that any continuous smooth enough mapping can be approximatedwith arbitrary accuracy using a mixture-of-gaussians model with sufficiently manygaussian basis functions [172, 48] Roughly stated, this provides the theoreticalbasis of the GTM method A fundamental difference of the GTM method compared
with SOM is that GTM is based on a generative approach that starts by assuming
a model for the latent variables, in our case the sources On the other hand, SOM
Trang 9A GENERATIVE TOPOGRAPHIC MAPPING APPROACH * 323
tries to separate the sources directly by starting from the data and constructing asuitable separating signal transformation A key benefit of GTM is its firm theoreticalfoundation which helps to overcome some of the limitations of SOM This alsoprovides the basis of generalizing the GTM approach to arbitrary source densities.Using the basic GTM method instead of SOM for nonlinear blind source separationdoes not yet bring out any notable improvement, because the densities of the sourcesare still assumed to be uniform However, it is straightforward to generalize the GTM
method to arbitrary known source densities The advantage of this approach is that
one can directly regularize the inverse of the mixing mapping by using the knownsource densities This modified GTM method is then used for finding a noncomplexmixing mapping This approach is described in the following
17.4.2 The modified GTM method
The modified GTM method introduced in [346] differs from the standard GTM [49]only in that the required joint density of the latent variables (sources) is defined as
a weighted sum of delta functions instead of plain delta functions The weighting
coefficients are determined by discretizing the known source densities Only the mainpoints of the GTM method are presented here, with emphasis on the modificationsmade for applying it to nonlinear blind source separation Readers wishing to gain adeeper understanding of the GTM method should look at the original paper [49].The GTM method closely resembles SOM in that it uses a discrete grid of pointsforming a regular array in them-dimensional latent space As in SOM, the dimension
of the latent space is usuallym= 2 Vectors lying in the latent space are denoted by
s(t); in our application they will be source vectors The GTM method uses a set ofLfixed nonlinear basis functionsfj
(s)g,j= 1::: L, which form a nonorthogonalbasis set These basis functions typically consist of a regular array of sphericalgaussian functions, but the basis functions can at least in principle be of other types.The mapping from the m-dimensional latent space to the n-dimensional dataspace, which is in our case the mixing mapping of Eq (17.2), is in GTM modeled as
a linear combination of basis functions'j:
x=f(s) =M'(s) ' = '1'2::: 'L
] T
(17.11)HereMis annLmatrix of weight parameters
Denote the node locations in the latent space by
i Eq (17.11) then defines acorresponding set of reference vectors
n=2 exp
(17.13)
Trang 10The probability density function for the GTM model is obtained by summing overall of the gaussian components, yielding
2
n=2 exp
2
km ixk
2
(17.14)
HereKis the total number of gaussian components, which is equal to the number ofgrid points in latent space, and the prior probabilitiesP(i)of the gaussian componentsare all equal to1=K
GTM tries to represent the distribution of the observed dataxin then-dimensionaldata space in terms of a smallerm-dimensional nonlinear manifold [49] The gaussiandistribution in (17.13) represents a noise or error model which is needed because thedata usually does not lie exactly in such a lower dimensional manifold It is important
to realize that theKgaussian distributions defined in (17.13) have nothing to do withthe basis function' i,i = 1::: L Usually it is advisable that the numberL ofthe basis functions is clearly smaller than the numberKof node locations and theirrespective noise distributions (17.13) In this way, one can avoid overfitting andprevent the mixing mapping (17.11) to become overly complicated
The unknown parameters in this model are the weight matrixMand the inversevariance These parameters are estimated by fitting the model (17.14) to theobserved data vectorsx(1) x(2)::: x(T)using the maximum likelihood methoddiscussed earlier in Section 4.5 The log likelihood function of the observed data isgiven by
L(M ) =
T
X
t=1 logpx (x(t)jM ) =
T
X
t=1 log
Z
px (x(t)js M )ps
(s)d s
(17.15)where1
is the variance ofxgivensandM, andT is the total number of datavectorsx(t)
For applying the modified GTM method, the probability density functionps
(s)
of the source vectorssshould be known Assuming that the sourcess1s2::: s mare statistically independent, this joint density can be evaluated as the product of themarginal densities of the individual sources:
ps (s) =
The latent space in the GTM method usually has a small dimension, typically
m = 2 The method can be applied in principle form > 2, but its computationalload then increases quite rapidly just like in the SOM method For this reason, only
Trang 11A GENERATIVE TOPOGRAPHIC MAPPING APPROACH * 325
i=1
K2 X
j=1
aij(s
ij ) = K X
Inserting (17.17) into (17.15) yields
L(M) =
T X
t=1 log
K X
q=1
aqpx (x(t)j
q Furthermore,Gis a diagonal matrix with elements
Gq
= T X
qM) P
1
new
= 1Tn
K X
q=1
T X
t=1
Rq
k M new '(
q ) x(t) k
2
(17.22)wherenis the dimension of the data space
Trang 12Fig 17.5 Source signals. Fig 17.6 Separated signals.
In GTM, the EM algorithm is used for maximizing the likelihood Here the step (17.21) consists of computing the responsibilitiesR
E-q , and the M-steps (17.19),(17.22) of updating the parametersMand The preceding derivation is quite similar
to the one as in the original GTM method [49], only the prior density coefficientsa
ij
=a
qhave been added to the model
After a few iterations, the EM algorithm converges to the parameter valuesM
and
s (s) of the sourcess
is assumed to be known, it is then straightforward to compute the posterior densityp(s(t) j x(t) M
of the sources given the observed data using the Bayes’ rule
As mentioned in Chapter 4, this posterior density contains all the relevant informationabout the sources
However, it is often convenient to choose a specific source estimates(t)sponding to each data vectorx(t)for visualizing purposes An often used estimate
q=1 R q
If the posterior density of the sources is multimodal, the posterior mean (17.23) cangive misleading results Then it is better to use for example the maximum a posteriori(MAP) estimate, which is simply the source value corresponding to the maximumresponsibilityq max= argmax(R
q ),q = 1 : : K for each sample indext
17.4.3 An experiment
In the following, a simple experiment involving two sources shown in Fig 17.5and three noisy nonlinear mixtures is described The mixed data was generated bytransforming linear mixtures of the original sources using a multilayer perceptron
Trang 13A GENERATIVE TOPOGRAPHIC MAPPING APPROACH * 327
Fig 17.7 Joint mixture densities with superimposed maps Top left: Joint density p(x1 x2
network with a volume conserving architecture (see [104]) Such an architecture waschosen for ensuring that the total mixing mapping is bijective and therefore reversible,and for avoiding highly complex distortions of the source densities However, thischoice has the advantage that it makes the total mixing mapping more complex thanthe post-nonlinear model (17.4) Finally, gaussian noise was added to the mixtures.The mixtures were generated using the model
x = As + tanh(UAs) + n (17.24)whereU is an upper-diagonal matrix with zero diagonal elements The nonzeroelements of Uwere drawn from a standard gaussian distribution The matrix Uensures volume conservation of the nonlinearity applied toAs
The modified GTM algorithm presented above was used to learn a separatingmapping For reducing scaling effects, the mixtures were first whitened Afterwhitening the mixtures are uncorrelated and have unit variance Then the modifiedGTM algorithm was run for eight iterations using a5 5grid The number of basisfunctions was