to have basic knowledge on single variable probability theory, so that fundamentaldefinitions such as probability, elementary events, and random variables are familiar.Readers who alread
Trang 1Part I
MATHEMATICAL PRELIMINARIES
Copyright 2001 John Wiley & Sons, Inc ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
Trang 2to have basic knowledge on single variable probability theory, so that fundamentaldefinitions such as probability, elementary events, and random variables are familiar.Readers who already have a good knowledge of multivariate statistics can skip most
of this chapter For those who need a more extensive review or more information onadvanced matters, many good textbooks ranging from elementary ones to advancedtreatments exist A widely used textbook covering probability, random variables, andstochastic processes is [353]
2.1 PROBABILITY DISTRIBUTIONS AND DENSITIES
2.1.1 Distribution of a random variable
In this book, we assume that random variables are continuous-valued unless stated
otherwise The cumulative distribution function (cdf)F
xof a random variablexatpointx = x
0is defined as the probability thatx x
0:
F x (x 0
Allowingx
0to change from1to1defines the whole cdf for all values ofx
Clearly, for continuous random variables the cdf is a nonnegative, nondecreasing(often monotonically increasing) continuous function whose values lie in the interval
15
Independent Component Analysis Aapo Hyv¨arinen, Juha Karhunen, Erkki Oja
Copyright 2001 John Wiley & Sons, Inc ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
Trang 3Usually a probability distribution is characterized in terms of its density function
rather than cdf Formally, the probability density function (pdf)px
(x)of a continuousrandom variablexis obtained as the derivative of its cumulative distribution function:
px (x0
=
dFx (x)
= Z x 0
is possible
Example 2.1 The gaussian (or normal) probability distribution is used in numerous
models and applications, for example to describe additive noise Its density function
Trang 4PROBABILITY DISTRIBUTIONS AND DENSITIES 17
Here the parameterm(mean) determines the peak point of the symmetric densityfunction, and(standard deviation), its effective width (flatness or sharpness of thepeak) See Figure 2.1 for an illustration
Generally, the cdf of the gaussian density cannot be evaluated in closed form using(2.3) The term1=p
0 exp
2 2
The error function is closely related to the cdf of a normalized gaussian density, forwhich the meanm= 0and the variance2
= 1 See [353] for details
2.1.2 Distribution of a random vector
Assume now thatxis ann-dimensional random vector
x = (x1x2::: xn
) T
(2.6)whereTdenotes the transpose (We take the transpose because all vectors in this bookare column vectors Note that vectors are denoted by boldface lowercase letters.) Thecomponentsx1x2::: xnof the column vectorxare continuous random variables.The concept of probability distribution generalizes easily to such a random vector
In particular, the cumulative distribution function ofxis defined by
Fx (x 0
where P(:) again denotes the probability of the event in parentheses, and x
some constant value of the random vectorx The notationx x
0means that eachcomponent of the vectorxis less than or equal to the respective component of thevectorx
0 The multivariate cdf in Eq (2.7) has similar properties to that of a singlerandom variable It is a nondecreasing function of each component, with values lying
The multivariate probability density functionpx
(x)ofxis defined as the derivative
of the cumulative distribution functionFx
(x)with respect to all components of therandom vectorx:
px (x 0
x=x 0
(2.8)Hence
:::Z x0n
px (x)dxn:::dx2dx1
(2.9)
Trang 5wherex0iis theith component of the vectorx
0 Clearly,
Z +1
In many cases, random variables have nonzero probability density functions only
on certain finite intervals An illustrative example of such a case is presented below
Example 2.2 Assume that the probability density function of a two-dimensional
random vectorz=(xy)
(z)and consequentlyalso the cdf is zero In the region where0< x 2and0< y 1, the cdf is givenby
Fz
Z y
0
Z x
0 3
7
= 3
4
xy
In the region where0 < x 2andy > 1, the upper limit in integrating overybecomes equal to 1, and the cdf is obtained by insertingy = 1into the precedingexpression Similarly, in the regionx >2and0< y 1, the cdf is obtained byinsertingx = 2to the preceding formula Finally, if both x > 2andy > 1, thecdf becomes unity, showing that the probability densitypz
(z)has been normalizedcorrectly Collecting these results yields
4xy) 0< x 20< y 1 3
1
2y) x >20< y 1
2.1.3 Joint and marginal distributions
The joint distribution of two different random vectors can be handled in a similarmanner In particular, letybe another random vector having in general a dimension
mdifferent from the dimensionnof The vectors and can be concatenated to
Trang 6EXPECTATIONS AND MOMENTS 19
a "supervector"z
T
=(x T
y T ), and the preceding formulas used directly The cdf
that arises is called the joint distribution function ofxandy, and is given by
F xy (x 0
xy (x y )with respect to all components
of the random vectorsxandy Hence, the relationship
F xy (x 0
y 0
= Z x0
1
Z y0
1 p xy
xy (x y ):
p x
Z 1
1 p xy
p y
Z 1
1 p xy
Example 2.3 Consider the joint density given in Example 2.2 The marginal
densi-ties of the random variablesxandyare
p x
Z 1
0 3
7
= (
Z 2
0 3
7
= (
2
7
2.2 EXPECTATIONS AND MOMENTS
2.2.1 Definition and general properties
In practice, the exact probability density function of a vector or scalar valued randomvariable is usually unknown However, one can use instead expectations of some
Trang 7functions of that random variable for performing useful analyses and processing Agreat advantage of expectations is that they can be estimated directly from the data,even though they are formally defined in terms of the density function.
denoted by Efg (x)g, and is defined by
Z 1
1
Here the integral is computed over all the components ofx.The integration operation
is applied separately to every component of the vector or element of the matrix,yielding as a result another vector or matrix of the same size Ifg (x)=x, we get theexpectation Efxgofx; this is discussed in more detail in the next subsection.Expectations have some important fundamental properties
1 Linearity Letxi,i= 1::: mbe a set of different random vectors, anda i,
i= 1::: m, some nonrandom scalar coefficients Then
2 Linear transformation Letxbe anm-dimensional random vector, andAand
Bsome nonrandomkmandmlmatrices, respectively Then
3 Transformation invariance Lety = g (x)be a vector-valued function of therandom vectorx Then
Z 1
1
ypy
Z 1
2.2.2 Mean vector and correlation matrix
Moments of a random vectorxare typical expectations used to characterize it Theyare obtained when consists of products of components of In particular, the
Trang 8EXPECTATIONS AND MOMENTS 21
first moment of a random vectorxis called the mean vectormxofx It is defined
as the expectation ofx:
mx
Z 1
1
xipx
Z 1
1
xipxi (xi )dxi (2.20)
Another important set of moments consists of correlations between pairs of
com-ponents ofx The correlationrij between theith andjth component ofxis given
by the second moment
rij
=Efxixj
Z 1
1
xixjpx
Z 1
1
Z 1
1
xixjpxixj
(xixj )dxjdxi
(2.21)Note that correlation can be negative or positive
Thenncorrelation matrix
chosen so that they are mutually orthonormal.
Higher-order moments can be defined analogously, but their discussion is poned to Section 2.7 Instead, we shall first consider the corresponding central andsecond-order moments for two different random vectors
Trang 9post-2.2.3 Covariances and joint moments
Central moments are defined in a similar fashion to usual moments, but the meanvectors of the random vectors involved are subtracted prior to computing the ex-pectation Clearly, central moments are only meaningful above the first order Thequantity corresponding to the correlation matrixRxis called the covariance matrix
of thennmatrixCx are called covariances, and they are the central moments
corresponding to the correlations1rij defined in Eq (2.21)
The covariance matrixCxsatisfies the same properties as the correlation matrix
Rx Using the properties of the expectation operator, it is easy to see that
Rx
If the mean vectormx
=0, the correlation and covariance matrices become the same. If necessary, the data can easily be made zero mean by subtracting the(estimated) mean vector from the data vectors as a preprocessing step This is a usualpractice in independent component analysis, and thus in later chapters, we simplydenote byCxthe correlation/covariance matrix, often even dropping the subscriptx
for simplicity
For a single random variablex, the mean vector reduces to its mean valuemx=
Efxg, the correlation matrix to the second moment Efx2
g, and the covariance matrix
to the variance ofx
2 x
Efg(x y)g =
Z 1
1
Z 1
1
g(x y)pxy
(x y)d y d x (2.28)The integrals are computed over all the components ofxandy
Of the joint expectations, the most widely used are the cross-correlation matrix
are used, and the matrix consisting of them is called the correlation matrix In this book, the correlation matrix is defined by the formula (2.22), which is a common practice in signal processing, neural networks, and engineering.
Trang 10EXPECTATIONS AND MOMENTS 23
Fig 2.2 An example of negative covariance
between the random variables x and y
Note that the dimensions of the vectorsxandycan be different Hence, the correlation and -covariance matrices are not necessarily square matrices, and they arenot symmetric in general However, from their definitions it follows easily that
andyhave a clear negative covariance (or correlation) A positive value ofxmostlyimplies thatyis negative, and vice versa On the other hand, in the case of Fig 2.3,
it is not possible to infer anything about the value ofyby observingx Hence, their
Trang 11g(xj
For example, applying (2.33), we get for the mean vectormx ofx its standard
estimator, the sample mean
^
mx
= 1
K
K X
j=1
where the hat overmis a standard notation for an estimator of a quantity
Similarly, if instead of the joint densityp
xy (xy)of the random vectorsxand
y, we knowKsample pairs(x1
y1 ) (x2
K
K X
2.3 UNCORRELATEDNESS AND INDEPENDENCE
2.3.1 Uncorrelatedness and whiteness
Two random vectorsxandyare uncorrelated if their cross-covariance matrixCxy
Trang 12UNCORRELATEDNESS AND INDEPENDENCE 25
In the special case of two different scalar random variablesxandy(for example,two components of a random vectorz),xandyare uncorrelated if their covariance
Another important special case concerns the correlations between the components
of a single random vectorxgiven by the covariance matrixCxdefined in (2.24) Inthis case a condition equivalent to (2.37) can never be met, because each component
ofxis perfectly correlated with itself The best that we can achieve is that differentcomponents ofxare mutually uncorrelated, leading to the uncorrelatedness condition
Cx
) T
In particular, random vectors having zero mean and unit covariance (and hencecorrelation) matrix, possibly multiplied by a constant variance2
whereIis thennidentity matrix
Assume now that an orthogonal transformation defined by annnmatrixTisapplied to the random vectorx Mathematically, this can be expressed
y=Tx whereTTT=TTT
An orthogonal matrixT defines a rotation (change of coordinate axes) in thendimensional space, preserving norms and distances Assuming thatxis white, weget
Trang 13showing thatyis white, too Hence we can conclude that the whiteness property is preserved under orthogonal transformations In fact, whitening of the original data
can be made in infinitely many ways Whitening will be discussed in more detail
in Chapter 6, because it is a highly useful and widely used preprocessing step inindependent component analysis
It is clear that there also exists infinitely many ways to decorrelate the originaldata, because whiteness is a special case of the uncorrelatedness property
Example 2.5 Consider the linear signal model
wherexis ann-dimensional random or data vector,Aannmconstant matrix,s
anm-dimensional random signal vector, andnann-dimensional random vector thatusually describes additive noise The correlation matrix ofxthen becomes
R sn
=EfsnTg =EfsgEfnTg = 0 (2.49)Similarly,R
ns
= 0, and the correlation matrix ofxsimplifies to
R x
(2.51)Sometimes, for example in a noisy version of the ICA model (Chapter 15), thecomponents of the signal vectorsare also mutually uncorrelated, so that the signalcorrelation matrix becomes the diagonal matrix
D s
=diag(Efs2
1
gEfs2 2
Trang 14UNCORRELATEDNESS AND INDEPENDENCE 27
wherea
iis theith column vector of the matrixA
The noisy linear signal or data model (2.47) is encountered frequently in signalprocessing and other areas, and the assumptions made onsandnvary depending
on the problem at hand It is straightforward to see that the results derived in thisexample hold for the respective covariance matrices as well
2.3.2 Statistical independence
A key concept that constitutes the foundation of independent component analysis is
statistical independence For simplicity, consider first the case of two different scalar
random variablesxandy The random variablexis independent ofy, if knowing thevalue ofydoes not give any information on the value ofx For example,xandycan
be outcomes of two events that have nothing to do with each other, or random signalsoriginating from two quite different physical processes that are in no way related toeach other Examples of such independent random variables are the value of a dicethrown and of a coin tossed, or speech signal and background noise originating from
a ventilation system at a certain time instant
Mathematically, statistical independence is defined in terms of probability ties The random variablesxandyare said to be independent if and only if
densi-p xy
x (x)p y
In words, the joint density p
xy (x y) ofx and y must factorize into the product
of their marginal densitiesp
x
y (y) Equivalently, independence could bedefined by replacing the probability density functions in the definition (2.54) by therespective cumulative distribution functions, which must also be factorizable.Independent random variables satisfy the basic property
1
Z 1
1 g(x)h(y)p
xy
= Z 1
1 g(x)p x (x)dx Z 1
1 h(y)p y
Equation (2.55) reveals that statistical independence is a much stronger property thanuncorrelatedness Equation (2.40), defining uncorrelatedness, is obtained from theindependence property (2.55) as a special case where bothg(x)andh(y)are linearfunctions, and takes into account second-order statistics (correlations or covariances)only However, if the random variables have gaussian distributions, independenceand uncorrelatedness become the same thing This very special property of gaussiandistributions will be discussed in more detail in Section 2.5
Definition (2.54) of independence generalizes in a natural way for more thantwo random variables, and for random vectors Let be random vectors
Trang 15which may in general have different dimensions The independence condition for
x y z : : is then
p xy z:::
x (x)p y (y )p z
The general definition (2.57) gives rise to a generalization of the standard notion
of statistical independence The components of the random vectorxare themselvesscalar random variables, and the same holds foryandz Clearly, the components
of xcan be mutually dependent, while they are independent with respect to thecomponents of the other random vectorsyandz, and (2.57) still holds A similarargument applies to the random vectorsyandz
Example 2.6 First consider the random variablesxandydiscussed in Examples 2.2and 2.3 The joint density ofxandy, reproduced here for convenience,
and a one-dimensional random vectory = ygiven by [419]
2.4 CONDITIONAL DENSITIES AND BAYES’ RULE
Thus far, we have dealt with the usual probability densities, joint densities, andmarginal densities Still one class of probability density functions consists of con-ditional densities They are especially important in estimation theory, which will
Trang 16CONDITIONAL DENSITIES AND BAYES’ RULE 29
be studied in Chapter 4 Conditional densities arise when answering the followingquestion: “What is the probability density of a random vectorxgiven that anotherrandom vectoryhas the fixed valuey
exist, the conditional probability density ofxgivenyis defined as
p xjy
p xy
p y (y )
0
0 +x
0are some constant vectors, and bothxandy
are small Similarly,
p
y jx
p xy
p x (x)
(2.60)
In conditional densities, the conditioning quantity,yin (2.59) andxin (2.60), isthought to be like a nonrandom parameter vector, even though it is actually a randomvector itself
Example 2.7 Consider the two-dimensional joint density p
xy (x y) depicted inFig 2.4 For a given constant valuex
0, the conditional distribution
p yjx (yjx 0
= p xy (x 0
y)
p x (x 0
Hence, it is a one-dimensional distribution obtained by "slicing" the joint distribution
p(x y)parallel to they-axis at the pointx = x
0 Note that the denominatorp
x (x
0 can be obtained geometrically
by slicing the joint distribution of Fig 2.4 parallel to thex-axis at the pointy = y
0.The resulting conditional distributions are shown in Fig 2.5 for the valuex
by integrating the joint densityp
xy (x y )over the unconditional random vector Thisalso shows immediately that the conditional densities are true probability densitiessatisfying
Z 1
1 p xjy
Z 1
1 p
y jx
If the random vectorsxandyare statistically independent, the conditional densityequals to the unconditional density of , since does not depend
Trang 17Fig 2.4 A two-dimensional joint density of the random variables x and y
in any way ony, and similarlyp
y jx
y (y ), and both Eqs (2.59) and (2.60)can be written in the form
p xy
x (x)p y
which is exactly the definition of independence of the random vectorsxandy
In the general case, we get from Eqs (2.59) and (2.60) two different expressionsfor the joint density ofxandy:
p xy
y jx (y jx)p x
xjy (xjy )p y
p x (x)
(2.64)where the denominator can be computed by integrating the numerator if necessary:
p x
Z 1
p xjy (xj)p y
Trang 18THE MULTIVARIATE GAUSSIAN DENSITY 31
Fig 2.6 The conditional probability sity p
den-xjy
Formula (2.64) (together with (2.65)) is called Bayes’ rule This rule is important
especially in statistical estimation theory There typicallyp
xjy
density of the measurement vectorx, withydenoting the vector of unknown random
parameters Bayes’ rule (2.64) allows the computation of the posterior density
p
y jx
and assuming or knowing the prior distributionp
y (y )of the random parametersy.These matters will be discussed in more detail in Chapter 4
Conditional expectations are defined similarly to the expectations defined earlier,but the pdf appearing in the integral is now the appropriate conditional density Hence,for example,
Z 1
Actually, this is just an alternative two-stage procedure for computing the expectation(2.28), following easily from Bayes’ rule
2.5 THE MULTIVARIATE GAUSSIAN DENSITY
The multivariate gaussian or normal density has several special properties that make itunique among probability density functions Due to its importance, we shall discuss
it more thoroughly in this section
Trang 19Consider an n-dimensional random vector x It is said to be gaussian if theprobability density function ofxhas the form
) 1=2 exp
1
x The notationdetAis used for the determinant of a matrixA, in this caseCx It
is easy to see that for a single random variablex(n = 1), the density (2.68) reduces
to the one-dimensional gaussian pdf (2.4) discussed briefly in Example 2.1 Notealso that the covariance matrixCx is assumed strictly positive definite, which alsoimplies that its inverse exists
It can be shown that for the density (2.68)
Efxg =mx
) T
Hence callingmxthe mean vector andCxthe covariance matrix of the multivariategaussian density is justified
2.5.1 Properties of the gaussian density
In the following, we list the most important properties of the multivariate gaussiandensity omitting proofs The proofs can be found in many books; see, for example,[353, 419, 407]
Only first- and second-order statistics are needed Knowledge of the meanvectormxand the covariance matrixCxofxare sufficient for defining the multi-variate gaussian density (2.68) completely Therefore, all the higher-order momentsmust also depend only onmxandCx This implies that these moments do not carryany novel information about the gaussian distribution An important consequence ofthis fact and the form of the gaussian pdf is that linear processing methods based onfirst- and second-order statistical information are usually optimal for gaussian data.For example, independent component analysis does not bring out anything new com-pared with standard principal component analysis (to be discussed later) for gaussiandata Similarly, linear time-invariant discrete-time filters used in classic statisticalsignal processing are optimal for filtering gaussian data
Linear transformations are gaussian Ifxis a gaussian random vector and
y= Axits linear transformation, thenyis also gaussian with mean vectormy =
Amxand covariance matrixCy=ACxAT
A special case of this result says thatany linear combination of gaussian random variables is itself gaussian This resultagain has implications in standard independent component analysis: it is impossible
to estimate the ICA model for gaussian data, that is, one cannot blindly separate
Trang 20THE MULTIVARIATE GAUSSIAN DENSITY 33
gaussian sources from their mixtures without extra knowledge of the sources, as will
be discussed in Chapter 7.2
Marginal and conditional densities are gaussian Consider now two randomvectorsxandyhaving dimensionsnandm, respectively Let us collect them in asingle random vectorzT
(x)andpy
(y)of the joint gaussian densitypz z)are gaussian.Also the conditional densities pxjy and py jx are n- and m-dimensional gaussiandensities, respectively The mean and covariance matrix of the conditional density
Uncorrelatedness and geometrical structure. We mentioned earlier that
uncorrelated gaussian random variables are also independent, a property which is
not shared by other distributions in general Derivation of this important result is left
to the reader as an exercise If the covariance matrixCxof the multivariate gaussiandensity (2.68) is not diagonal, the components ofxare correlated SinceCxis asymmetric and positive definite matrix, it can always be represented in the form
Cx
= n X
i=1
ieieT
HereEis an orthogonal matrix (that is, a rotation) having as its columnse1 e2::: en
theneigenvectors ofCx, andD= diag(12::: n
)is the diagonal matrix taining the respective eigenvalues i ofCx Now it can readily be verified thatapplying the rotation
Trang 21second-toxmakes the components of the gaussian distribution ofuuncorrelated, and hencealso independent.
Moreover, the eigenvaluesi and eigenvectorsei of the covariance matrixCx
reveal the geometrical structure of the multivariate gaussian distribution The tours of any pdf are defined by curves of constant values of the density, given by theequationpx
con-(x)= constant For the multivariate gaussian density, this is equivalent
to requiring that the exponent is a constantc:
Fig 2.7 Illustration of a multivariate gaussian probability density.
2.5.2 Central limit theorem
Still another argument underlining the significance of the gaussian distribution isprovided by the central limit theorem Let
xk
= k X
i=1
be a partial sum of a sequencefzi
gof independent and identically distributed randomvariableszi Since the mean and variance ofxkcan grow without bound ask! 1,consider instead ofxkthe standardized variables