Tài liệu Bài 2: Random Vectors and Independence pdf

to have basic knowledge on single variable probability theory, so that fundamentaldefinitions such as probability, elementary events, and random variables are familiar.Readers who alread

Trang 1

Part I

MATHEMATICAL PRELIMINARIES

Copyright  2001 John Wiley & Sons, Inc ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)

Trang 2

to have basic knowledge on single variable probability theory, so that fundamentaldefinitions such as probability, elementary events, and random variables are familiar.Readers who already have a good knowledge of multivariate statistics can skip most

of this chapter For those who need a more extensive review or more information onadvanced matters, many good textbooks ranging from elementary ones to advancedtreatments exist A widely used textbook covering probability, random variables, andstochastic processes is [353]

2.1 PROBABILITY DISTRIBUTIONS AND DENSITIES

2.1.1 Distribution of a random variable

In this book, we assume that random variables are continuous-valued unless stated

otherwise The cumulative distribution function (cdf)F

xof a random variablexatpointx = x

0is defined as the probability thatx x

0:

F x (x 0

Allowingx

0to change from1to1defines the whole cdf for all values ofx

Clearly, for continuous random variables the cdf is a nonnegative, nondecreasing(often monotonically increasing) continuous function whose values lie in the interval

15

Independent Component Analysis Aapo Hyv¨arinen, Juha Karhunen, Erkki Oja

Copyright  2001 John Wiley & Sons, Inc ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)

Trang 3

Usually a probability distribution is characterized in terms of its density function

rather than cdf Formally, the probability density function (pdf)px

(x)of a continuousrandom variablexis obtained as the derivative of its cumulative distribution function:

px (x0

=

dFx (x)

= Z x 0

is possible

Example 2.1 The gaussian (or normal) probability distribution is used in numerous

models and applications, for example to describe additive noise Its density function

Trang 4

PROBABILITY DISTRIBUTIONS AND DENSITIES 17

Here the parameterm(mean) determines the peak point of the symmetric densityfunction, and(standard deviation), its effective width (flatness or sharpness of thepeak) See Figure 2.1 for an illustration

Generally, the cdf of the gaussian density cannot be evaluated in closed form using(2.3) The term1=p

0 exp

2 2

The error function is closely related to the cdf of a normalized gaussian density, forwhich the meanm= 0and the variance2

= 1 See [353] for details

2.1.2 Distribution of a random vector

Assume now thatxis ann-dimensional random vector

x = (x1x2::: xn

) T

(2.6)whereTdenotes the transpose (We take the transpose because all vectors in this bookare column vectors Note that vectors are denoted by boldface lowercase letters.) Thecomponentsx1x2::: xnof the column vectorxare continuous random variables.The concept of probability distribution generalizes easily to such a random vector

In particular, the cumulative distribution function ofxis defined by

Fx (x 0

where P(:) again denotes the probability of the event in parentheses, and x

some constant value of the random vectorx The notationx x

0means that eachcomponent of the vectorxis less than or equal to the respective component of thevectorx

0 The multivariate cdf in Eq (2.7) has similar properties to that of a singlerandom variable It is a nondecreasing function of each component, with values lying

The multivariate probability density functionpx

(x)ofxis defined as the derivative

of the cumulative distribution functionFx

(x)with respect to all components of therandom vectorx:

px (x 0

x=x 0

(2.8)Hence

:::Z x0n

px (x)dxn:::dx2dx1

(2.9)

Trang 5

wherex0iis theith component of the vectorx

0 Clearly,

Z +1

In many cases, random variables have nonzero probability density functions only

on certain finite intervals An illustrative example of such a case is presented below

Example 2.2 Assume that the probability density function of a two-dimensional

random vectorz=(xy)

(z)and consequentlyalso the cdf is zero In the region where0< x 2and0< y 1, the cdf is givenby

Fz

Z y

0

Z x

0 3

7

= 3

4

xy

In the region where0 < x 2andy > 1, the upper limit in integrating overybecomes equal to 1, and the cdf is obtained by insertingy = 1into the precedingexpression Similarly, in the regionx >2and0< y 1, the cdf is obtained byinsertingx = 2to the preceding formula Finally, if both x > 2andy > 1, thecdf becomes unity, showing that the probability densitypz

(z)has been normalizedcorrectly Collecting these results yields

4xy) 0< x 20< y 1 3

1

2y) x >20< y 1

2.1.3 Joint and marginal distributions

The joint distribution of two different random vectors can be handled in a similarmanner In particular, letybe another random vector having in general a dimension

mdifferent from the dimensionnof The vectors and can be concatenated to

Trang 6

EXPECTATIONS AND MOMENTS 19

a "supervector"z

T

=(x T

y T ), and the preceding formulas used directly The cdf

that arises is called the joint distribution function ofxandy, and is given by

F xy (x 0

xy (x y )with respect to all components

of the random vectorsxandy Hence, the relationship

F xy (x 0

y 0

= Z x0

1

Z y0

1 p xy

xy (x y ):

p x

Z 1

1 p xy

p y

Z 1

1 p xy

Example 2.3 Consider the joint density given in Example 2.2 The marginal

densi-ties of the random variablesxandyare

p x

Z 1

0 3

7

= (

Z 2

0 3

7

= (

2

7

2.2 EXPECTATIONS AND MOMENTS

2.2.1 Definition and general properties

In practice, the exact probability density function of a vector or scalar valued randomvariable is usually unknown However, one can use instead expectations of some

Trang 7

functions of that random variable for performing useful analyses and processing Agreat advantage of expectations is that they can be estimated directly from the data,even though they are formally defined in terms of the density function.

denoted by Efg (x)g, and is defined by

Z 1

1

Here the integral is computed over all the components ofx.The integration operation

is applied separately to every component of the vector or element of the matrix,yielding as a result another vector or matrix of the same size Ifg (x)=x, we get theexpectation Efxgofx; this is discussed in more detail in the next subsection.Expectations have some important fundamental properties

1 Linearity Letxi,i= 1::: mbe a set of different random vectors, anda i,

i= 1::: m, some nonrandom scalar coefficients Then

2 Linear transformation Letxbe anm-dimensional random vector, andAand

Bsome nonrandomkmandmlmatrices, respectively Then

3 Transformation invariance Lety = g (x)be a vector-valued function of therandom vectorx Then

Z 1

1

ypy

Z 1

2.2.2 Mean vector and correlation matrix

Moments of a random vectorxare typical expectations used to characterize it Theyare obtained when consists of products of components of In particular, the

Trang 8

first moment of a random vectorxis called the mean vectormxofx It is defined

as the expectation ofx:

mx

Z 1

1

xipx

Z 1

1

xipxi (xi )dxi (2.20)

Another important set of moments consists of correlations between pairs of

com-ponents ofx The correlationrij between theith andjth component ofxis given

by the second moment

rij

=Efxixj

Z 1

1

xixjpx

Z 1

1

Z 1

1

xixjpxixj

(xixj )dxjdxi

(2.21)Note that correlation can be negative or positive

Thenncorrelation matrix

chosen so that they are mutually orthonormal.

Higher-order moments can be defined analogously, but their discussion is poned to Section 2.7 Instead, we shall first consider the corresponding central andsecond-order moments for two different random vectors

Trang 9

post-2.2.3 Covariances and joint moments

Central moments are defined in a similar fashion to usual moments, but the meanvectors of the random vectors involved are subtracted prior to computing the ex-pectation Clearly, central moments are only meaningful above the first order Thequantity corresponding to the correlation matrixRxis called the covariance matrix

of thennmatrixCx are called covariances, and they are the central moments

corresponding to the correlations1rij defined in Eq (2.21)

The covariance matrixCxsatisfies the same properties as the correlation matrix

Rx Using the properties of the expectation operator, it is easy to see that

Rx

If the mean vectormx

=0, the correlation and covariance matrices become the same. If necessary, the data can easily be made zero mean by subtracting the(estimated) mean vector from the data vectors as a preprocessing step This is a usualpractice in independent component analysis, and thus in later chapters, we simplydenote byCxthe correlation/covariance matrix, often even dropping the subscriptx

for simplicity

For a single random variablex, the mean vector reduces to its mean valuemx=

Efxg, the correlation matrix to the second moment Efx2

g, and the covariance matrix

to the variance ofx

2 x

Efg(x y)g =

Z 1

1

Z 1

1

g(x y)pxy

(x y)d y d x (2.28)The integrals are computed over all the components ofxandy

Of the joint expectations, the most widely used are the cross-correlation matrix

are used, and the matrix consisting of them is called the correlation matrix In this book, the correlation matrix is defined by the formula (2.22), which is a common practice in signal processing, neural networks, and engineering.

Trang 10

Fig 2.2 An example of negative covariance

between the random variables x and y

Note that the dimensions of the vectorsxandycan be different Hence, the correlation and -covariance matrices are not necessarily square matrices, and they arenot symmetric in general However, from their definitions it follows easily that

andyhave a clear negative covariance (or correlation) A positive value ofxmostlyimplies thatyis negative, and vice versa On the other hand, in the case of Fig 2.3,

it is not possible to infer anything about the value ofyby observingx Hence, their

Trang 11

g(xj

For example, applying (2.33), we get for the mean vectormx ofx its standard

estimator, the sample mean

^

mx

= 1

K

K X

j=1

where the hat overmis a standard notation for an estimator of a quantity

Similarly, if instead of the joint densityp

xy (xy)of the random vectorsxand

y, we knowKsample pairs(x1

y1 ) (x2

K

K X

2.3 UNCORRELATEDNESS AND INDEPENDENCE

2.3.1 Uncorrelatedness and whiteness

Two random vectorsxandyare uncorrelated if their cross-covariance matrixCxy

Trang 12

UNCORRELATEDNESS AND INDEPENDENCE 25

In the special case of two different scalar random variablesxandy(for example,two components of a random vectorz),xandyare uncorrelated if their covariance

Another important special case concerns the correlations between the components

of a single random vectorxgiven by the covariance matrixCxdefined in (2.24) Inthis case a condition equivalent to (2.37) can never be met, because each component

ofxis perfectly correlated with itself The best that we can achieve is that differentcomponents ofxare mutually uncorrelated, leading to the uncorrelatedness condition

Cx

) T

In particular, random vectors having zero mean and unit covariance (and hencecorrelation) matrix, possibly multiplied by a constant variance2

whereIis thennidentity matrix

Assume now that an orthogonal transformation defined by annnmatrixTisapplied to the random vectorx Mathematically, this can be expressed

y=Tx whereTTT=TTT

An orthogonal matrixT defines a rotation (change of coordinate axes) in thendimensional space, preserving norms and distances Assuming thatxis white, weget

Trang 13

showing thatyis white, too Hence we can conclude that the whiteness property is preserved under orthogonal transformations In fact, whitening of the original data

can be made in infinitely many ways Whitening will be discussed in more detail

in Chapter 6, because it is a highly useful and widely used preprocessing step inindependent component analysis

It is clear that there also exists infinitely many ways to decorrelate the originaldata, because whiteness is a special case of the uncorrelatedness property

Example 2.5 Consider the linear signal model

wherexis ann-dimensional random or data vector,Aannmconstant matrix,s

anm-dimensional random signal vector, andnann-dimensional random vector thatusually describes additive noise The correlation matrix ofxthen becomes

R sn

=EfsnTg =EfsgEfnTg = 0 (2.49)Similarly,R

ns

= 0, and the correlation matrix ofxsimplifies to

R x

(2.51)Sometimes, for example in a noisy version of the ICA model (Chapter 15), thecomponents of the signal vectorsare also mutually uncorrelated, so that the signalcorrelation matrix becomes the diagonal matrix

D s

=diag(Efs2

1

gEfs2 2

Trang 14

UNCORRELATEDNESS AND INDEPENDENCE 27

wherea

iis theith column vector of the matrixA

The noisy linear signal or data model (2.47) is encountered frequently in signalprocessing and other areas, and the assumptions made onsandnvary depending

on the problem at hand It is straightforward to see that the results derived in thisexample hold for the respective covariance matrices as well

2.3.2 Statistical independence

A key concept that constitutes the foundation of independent component analysis is

statistical independence For simplicity, consider first the case of two different scalar

random variablesxandy The random variablexis independent ofy, if knowing thevalue ofydoes not give any information on the value ofx For example,xandycan

be outcomes of two events that have nothing to do with each other, or random signalsoriginating from two quite different physical processes that are in no way related toeach other Examples of such independent random variables are the value of a dicethrown and of a coin tossed, or speech signal and background noise originating from

a ventilation system at a certain time instant

Mathematically, statistical independence is defined in terms of probability ties The random variablesxandyare said to be independent if and only if

densi-p xy

x (x)p y

In words, the joint density p

xy (x y) ofx and y must factorize into the product

of their marginal densitiesp

x

y (y) Equivalently, independence could bedefined by replacing the probability density functions in the definition (2.54) by therespective cumulative distribution functions, which must also be factorizable.Independent random variables satisfy the basic property

1

Z 1

1 g(x)h(y)p

xy

= Z 1

1 g(x)p x (x)dx Z 1

1 h(y)p y

Equation (2.55) reveals that statistical independence is a much stronger property thanuncorrelatedness Equation (2.40), defining uncorrelatedness, is obtained from theindependence property (2.55) as a special case where bothg(x)andh(y)are linearfunctions, and takes into account second-order statistics (correlations or covariances)only However, if the random variables have gaussian distributions, independenceand uncorrelatedness become the same thing This very special property of gaussiandistributions will be discussed in more detail in Section 2.5

Definition (2.54) of independence generalizes in a natural way for more thantwo random variables, and for random vectors Let be random vectors

Trang 15

which may in general have different dimensions The independence condition for

x y z : : is then

p xy z:::

x (x)p y (y )p z

The general definition (2.57) gives rise to a generalization of the standard notion

of statistical independence The components of the random vectorxare themselvesscalar random variables, and the same holds foryandz Clearly, the components

of xcan be mutually dependent, while they are independent with respect to thecomponents of the other random vectorsyandz, and (2.57) still holds A similarargument applies to the random vectorsyandz

Example 2.6 First consider the random variablesxandydiscussed in Examples 2.2and 2.3 The joint density ofxandy, reproduced here for convenience,

and a one-dimensional random vectory = ygiven by [419]

2.4 CONDITIONAL DENSITIES AND BAYES’ RULE

Thus far, we have dealt with the usual probability densities, joint densities, andmarginal densities Still one class of probability density functions consists of con-ditional densities They are especially important in estimation theory, which will

Trang 16

CONDITIONAL DENSITIES AND BAYES’ RULE 29

be studied in Chapter 4 Conditional densities arise when answering the followingquestion: “What is the probability density of a random vectorxgiven that anotherrandom vectoryhas the fixed valuey

exist, the conditional probability density ofxgivenyis defined as

p xjy

p xy

p y (y )

0

0 +x

0are some constant vectors, and bothxandy

are small Similarly,

p

y jx

p xy

p x (x)

(2.60)

In conditional densities, the conditioning quantity,yin (2.59) andxin (2.60), isthought to be like a nonrandom parameter vector, even though it is actually a randomvector itself

Example 2.7 Consider the two-dimensional joint density p

xy (x y) depicted inFig 2.4 For a given constant valuex

0, the conditional distribution

p yjx (yjx 0

= p xy (x 0

y)

p x (x 0

Hence, it is a one-dimensional distribution obtained by "slicing" the joint distribution

p(x y)parallel to they-axis at the pointx = x

0 Note that the denominatorp

x (x

0 can be obtained geometrically

by slicing the joint distribution of Fig 2.4 parallel to thex-axis at the pointy = y

0.The resulting conditional distributions are shown in Fig 2.5 for the valuex

by integrating the joint densityp

xy (x y )over the unconditional random vector Thisalso shows immediately that the conditional densities are true probability densitiessatisfying

Z 1

1 p xjy

Z 1

1 p

y jx

If the random vectorsxandyare statistically independent, the conditional densityequals to the unconditional density of , since does not depend

Trang 17

Fig 2.4 A two-dimensional joint density of the random variables x and y

in any way ony, and similarlyp

y jx

y (y ), and both Eqs (2.59) and (2.60)can be written in the form

p xy

x (x)p y

which is exactly the definition of independence of the random vectorsxandy

In the general case, we get from Eqs (2.59) and (2.60) two different expressionsfor the joint density ofxandy:

p xy

y jx (y jx)p x

xjy (xjy )p y

p x (x)

(2.64)where the denominator can be computed by integrating the numerator if necessary:

p x

Z 1

p xjy (xj)p y

Trang 18

THE MULTIVARIATE GAUSSIAN DENSITY 31

Fig 2.6 The conditional probability sity p

den-xjy

Formula (2.64) (together with (2.65)) is called Bayes’ rule This rule is important

especially in statistical estimation theory There typicallyp

xjy

density of the measurement vectorx, withydenoting the vector of unknown random

parameters Bayes’ rule (2.64) allows the computation of the posterior density

p

y jx

and assuming or knowing the prior distributionp

y (y )of the random parametersy.These matters will be discussed in more detail in Chapter 4

Conditional expectations are defined similarly to the expectations defined earlier,but the pdf appearing in the integral is now the appropriate conditional density Hence,for example,

Z 1

Actually, this is just an alternative two-stage procedure for computing the expectation(2.28), following easily from Bayes’ rule

2.5 THE MULTIVARIATE GAUSSIAN DENSITY

The multivariate gaussian or normal density has several special properties that make itunique among probability density functions Due to its importance, we shall discuss

it more thoroughly in this section

Trang 19

Consider an n-dimensional random vector x It is said to be gaussian if theprobability density function ofxhas the form

) 1=2 exp

1

x The notationdetAis used for the determinant of a matrixA, in this caseCx It

is easy to see that for a single random variablex(n = 1), the density (2.68) reduces

to the one-dimensional gaussian pdf (2.4) discussed briefly in Example 2.1 Notealso that the covariance matrixCx is assumed strictly positive definite, which alsoimplies that its inverse exists

It can be shown that for the density (2.68)

Efxg =mx

) T

Hence callingmxthe mean vector andCxthe covariance matrix of the multivariategaussian density is justified

2.5.1 Properties of the gaussian density

In the following, we list the most important properties of the multivariate gaussiandensity omitting proofs The proofs can be found in many books; see, for example,[353, 419, 407]

Only first- and second-order statistics are needed Knowledge of the meanvectormxand the covariance matrixCxofxare sufficient for defining the multi-variate gaussian density (2.68) completely Therefore, all the higher-order momentsmust also depend only onmxandCx This implies that these moments do not carryany novel information about the gaussian distribution An important consequence ofthis fact and the form of the gaussian pdf is that linear processing methods based onfirst- and second-order statistical information are usually optimal for gaussian data.For example, independent component analysis does not bring out anything new com-pared with standard principal component analysis (to be discussed later) for gaussiandata Similarly, linear time-invariant discrete-time filters used in classic statisticalsignal processing are optimal for filtering gaussian data

Linear transformations are gaussian Ifxis a gaussian random vector and

y= Axits linear transformation, thenyis also gaussian with mean vectormy =

Amxand covariance matrixCy=ACxAT

A special case of this result says thatany linear combination of gaussian random variables is itself gaussian This resultagain has implications in standard independent component analysis: it is impossible

to estimate the ICA model for gaussian data, that is, one cannot blindly separate

Trang 20

THE MULTIVARIATE GAUSSIAN DENSITY 33

gaussian sources from their mixtures without extra knowledge of the sources, as will

be discussed in Chapter 7.2

Marginal and conditional densities are gaussian Consider now two randomvectorsxandyhaving dimensionsnandm, respectively Let us collect them in asingle random vectorzT

(x)andpy

(y)of the joint gaussian densitypz z)are gaussian.Also the conditional densities pxjy and py jx are n- and m-dimensional gaussiandensities, respectively The mean and covariance matrix of the conditional density

Uncorrelatedness and geometrical structure. We mentioned earlier that

uncorrelated gaussian random variables are also independent, a property which is

not shared by other distributions in general Derivation of this important result is left

to the reader as an exercise If the covariance matrixCxof the multivariate gaussiandensity (2.68) is not diagonal, the components ofxare correlated SinceCxis asymmetric and positive definite matrix, it can always be represented in the form

Cx

= n X

i=1

ieieT

HereEis an orthogonal matrix (that is, a rotation) having as its columnse1 e2::: en

theneigenvectors ofCx, andD= diag(12::: n

)is the diagonal matrix taining the respective eigenvalues i ofCx Now it can readily be verified thatapplying the rotation

Trang 21

second-toxmakes the components of the gaussian distribution ofuuncorrelated, and hencealso independent.

Moreover, the eigenvaluesi and eigenvectorsei of the covariance matrixCx

reveal the geometrical structure of the multivariate gaussian distribution The tours of any pdf are defined by curves of constant values of the density, given by theequationpx

con-(x)= constant For the multivariate gaussian density, this is equivalent

to requiring that the exponent is a constantc:

Fig 2.7 Illustration of a multivariate gaussian probability density.

2.5.2 Central limit theorem

Still another argument underlining the significance of the gaussian distribution isprovided by the central limit theorem Let

xk

= k X

i=1

be a partial sum of a sequencefzi

gof independent and identically distributed randomvariableszi Since the mean and variance ofxkcan grow without bound ask! 1,consider instead ofxkthe standardized variables

Tiêu đề	Random Vectors and Independence
Tác giả	Aapo Hyvärinen, Juha Karhunen, Erkki Oja
Trường học	John Wiley & Sons, Inc.
Chuyên ngành	Mathematics and Statistics
Thể loại	Giáo trình
Năm xuất bản	2001

Định dạng
Số trang	43
Dung lượng	0,97 MB

Tài liệu tham khảo	Loại	Chi tiết
2.1 Derive a rule for computing the values of the cdf of the single variable gaussian (2.4) from the known tabulated values of the error function (2.5)	Khác
2.7 Assume that x 1 and x 2 are zero-mean, correlated random variables. Any orthogonal transformation of x 1 and x 2 can be represented in the formy	Khác
2.8 Consider the joint probability density of the random vectors x = (x 1 x2 )Tand y = y discussed in Example 2.6:pxy( x y ) =((x1 + 3x2 )y x1 x22 0 1] y 2 0 1]0 elsewhere	Khác
2.8.1. Compute the marginal distributions p x( x ) , p y ( y ) , p x1 (x 1 ) , and p x2 (x 2 )	Khác
2.8.2. Verify that the claims made on the independence of x 1 , x 2 , and y in Example 2.6 hold	Khác
2.9 Which conditions should the elements of the matrixR = a c d bsatisfy so that R could be a valid autocorrelation matrix of 2.9.1. A two-dimensional random vector	Khác
2.10 Show that correlation and covariance matrices satisfy the relationships (2.26) and (2.32)	Khác
2.11 Work out Example 2.5 for the covariance matrix C x of x , showing that similar results are obtained. Are the assumptions required the same	Khác
2.12 Assume that the inverse R 1 x of the correlation matrix of the n -dimensional column random vector x exists. Show thatE f x T R 1 x x g = n	Khác
2.13 Consider a two-dimensional gaussian random vector x with mean vector m x= (2 1) T and covariance matrixC x = 1 2 1 2	Khác
2.13.2. Draw a contour plot of the gaussian density similar to Figure 2.7	Khác
2.14 Repeat the previous problem for a gaussian random vector x that has the mean vector m x = (2 3) T and covariance matrixC x = 2 2 2 5	Khác
2.15 Assume that random variables x and y are linear combinations of two uncor- related gaussian random variables u and v , defined byx = 3u 4v	Khác