Principal Component Analysis and Whitening Principal component analysis PCA and the closely related Karhunen-Lo`eve trans-form, or the Hotelling transtrans-form, are classic techniques i
Trang 1Principal Component Analysis and Whitening
Principal component analysis (PCA) and the closely related Karhunen-Lo`eve trans-form, or the Hotelling transtrans-form, are classic techniques in statistical data analysis, feature extraction, and data compression, stemming from the early work of Pearson [364] Given a set of multivariate measurements, the purpose is to find a smaller set of variables with less redundancy, that would give as good a representation as possible This goal is related to the goal of independent component analysis (ICA) However,
in PCA the redundancy is measured by correlations between data elements, while
in ICA the much richer concept of independence is used, and in ICA the reduction
of the number of variables is given less emphasis Using only the correlations as in PCA has the advantage that the analysis can be based on second-order statistics only
In connection with ICA, PCA is a useful preprocessing step
The basic PCA problem is outlined in this chapter Both the closed-form solution and on-line learning algorithms for PCA are reviewed Next, the related linear statistical technique of factor analysis is discussed The chapter is concluded by presenting how data can be preprocessed by whitening, removing the effect of first-and second-order statistics, which is very helpful as the first step in ICA
6.1 PRINCIPAL COMPONENTS
The starting point for PCA is a random vectorxwithnelements There is available
a samplex(1) ::: x(T )from this random vector No explicit assumptions on the probability density of the vectors are made in PCA, as long as the first- and second-order statistics are known or can be estimated from the sample Also, no generative
125
Independent Component Analysis Aapo Hyv¨arinen, Juha Karhunen, Erkki Oja
Copyright 2001 John Wiley & Sons, Inc ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
Trang 2model is assumed for vectorx Typically the elements ofxare measurements like pixel gray levels or values of a signal at different time instants It is essential in PCA that the elements are mutually correlated, and there is thus some redundancy
inx, making compression possible If the elements are independent, nothing can be achieved by PCA
In the PCA transform, the vectorxis first centered by subtracting its mean:
x x Efxg
The mean is in practice estimated from the available sample x(1):::x(T) (see Chapter 4) Let us assume in the following that the centering has been done and thus
Efxg = 0 Next,xis linearly transformed to another vectorywithmelements,
m < n, so that the redundancy induced by the correlations is removed This is done by finding a rotated orthogonal coordinate system such that the elements of
xin the new coordinates become uncorrelated At the same time, the variances of the projections ofxon the new coordinate axes are maximized so that the first axis corresponds to the maximal variance, the second axis corresponds to the maximal variance in the direction orthogonal to the first axis, and so on
For instance, ifxhas a gaussian density that is constant over ellipsoidal surfaces
in then-dimensional space, then the rotated coordinate system coincides with the principal axes of the ellipsoid A two-dimensional example is shown in Fig 2.7 in Chapter 2 The principal components are now the projections of the data points on the two principal axes,e
1ande
2 In addition to achieving uncorrelated components, the variances of the components (projections) also will be very different in most appli-cations, with a considerable number of the variances so small that the corresponding components can be discarded altogether Those components that are left constitute the vectory
As an example, take a set of88pixel windows from a digital image, an application that is considered in detail in Chapter 21 They are first transformed, e.g., using row-by-row scanning, into vectorsxwhose elements are the gray levels of the 64 pixels
in the window In real-time digital video transmission, it is essential to reduce this data as much as possible without losing too much of the visual quality, because the total amount of data is very large Using PCA, a compressed representation vectory
can be obtained fromx, which can be stored or transmitted Typically,ycan have as few as 10 elements, and a good replica of the original8 8image window can still
be reconstructed from it This kind of compression is possible because neighboring elements ofx, which are the gray levels of neighboring pixels in the digital image, are heavily correlated These correlations are utilized by PCA, allowing almost the same information to be represented by a much smaller vectory PCA is a linear technique, so computingyfromxis not heavy, which makes real-time processing possible
Trang 3PRINCIPAL COMPONENTS 127
6.1.1 PCA by variance maximization
In mathematical terms, consider a linear combination
y1
=
n
X
k=1
w k1x k = wT
1 x
of the elementsx1:::x nof the vectorx Thew11:::w n1are scalar coefficients or weights, elements of ann-dimensional vectorw
1, andwT
1 denotes the transpose of
w
1
The factory1is called the first principal component ofx, if the variance ofy1is maximally large Because the variance depends on both the norm and orientation of the weight vectorw
1 and grows without limits as the norm grows, we impose the constraint that the norm ofw
1is constant, in practice equal to 1 Thus we look for a weight vectorw
1maximizing the PCA criterion
J PCA
1
(w
1
=Efy2
g =Ef(wT
1 x) 2
g = wT
1EfxxTgw
1
= wT
1 C x w
1 (6.1)
so thatkw
1
There Ef:gis the expectation over the (unknown) density of input vectorx, and the norm ofw
1is the usual Euclidean norm defined as
kw 1
k = (wT
1 w 1
1=2
=
n
X
k=1
w2
k1 ]
1=2
The matrixC
xin Eq (6.1) is thenncovariance matrix ofx(see Chapter 4) given for the zero-mean vectorxby the correlation matrix
C x
It is well known from basic linear algebra (see, e.g., [324, 112]) that the solution
to the PCA problem is given in terms of the unit-length eigenvectorse
1:::en of the matrix C
x The ordering of the eigenvectors is such that the corresponding eigenvaluesd1:::d n satisfyd1
d2
:::d n The solution maximizing (6.1) is given by
w 1
= e 1
Thus the first principal component ofxisy1
= eT
1
x The criterionJ PCA
1 in eq (6.1) can be generalized tomprincipal components, withmany number between 1 andn Denoting them-th (1 m n) principal component byy m= wTmx, withwmthe corresponding unit norm weight vector, the variance ofy mis now maximized under the constraint thaty mis uncorrelated with all the previously found principal components:
Efy m y kg = 0 k < m: (6.4) Note that the principal componentsy mhave zero means because
E y m TmE
Trang 4The condition (6.4) yields:
Efy m y kg =Ef(wTmx)(wTkx)g = wTmC
x
wk = 0 (6.5) For the second principal component, we have the condition that
wT
2 Cw 1
=d1
wT
2 e 1
because we already know thatw
1
= e
1 We are thus looking for maximal variance
Efy2
g =Ef(wT
2
x) 2
gin the subspace orthogonal to the first eigenvector ofC
x The solution is given by
w 2
= e 2
Likewise, recursively it follows that
wk = ek
Thus thekth principal component isy k = eTkx
Exactly the same result for the wi is obtained if the variances of y i are maxi-mized under the constraint that the principal component vectors are orthonormal, or
wTiwj= ij This is left as an exercise
6.1.2 PCA by minimum mean-square error compression
In the preceding subsection, the principal components were defined as weighted sums
of the elements ofxwith maximal variance, under the constraints that the weights are normalized and the principal components are uncorrelated with each other It turns out that this is strongly related to minimum mean-square error compression
ofx, which is another way to pose the PCA problem Let us search for a set ofm
orthonormal basis vectors, spanning anm-dimensional subspace, such that the mean-square error betweenxand its projection on the subspace is minimal Denoting again the basis vectors byw
1:::wm, for which we assume
wTiwj= ij
the projection ofxon the subspace spanned by them is
Pm i
=1 (wTix)wi The mean-square error (MSE) criterion, to be minimized by the orthonormal basisw
1:::wm, becomes
J MSE PCA=Efkx
m
X
i=1 (wTix)wik
2
It is easy to show (see exercises) that due to the orthogonality of the vectorswi, this criterion can be further written as
J MSE PCA = Efkxk
2
g Ef
m
X
j=1 (wTjx) 2
= trace(C
x )
m
X
j
wTjC x
Trang 5PRINCIPAL COMPONENTS 129
It can be shown (see, e.g., [112]) that the minimum of (6.9) under the orthonor-mality condition on thewiis given by any orthonormal basis of the PCA subspace spanned by themfirst eigenvectorse
1:::em However, the criterion does not spec-ify the basis of this subspace at all Any orthonormal basis of the subspace will give the same optimal compression While this ambiguity can be seen as a disadvantage,
it should be noted that there may be some other criteria by which a certain basis in the PCA subspace is to be preferred over others Independent component analysis is
a prime example of methods in which PCA is a useful preprocessing step, but once the vectorxhas been expressed in terms of the firstmeigenvectors, a further rotation brings out the much more useful independent components
It can also be shown [112] that the value of the minimum mean-square error of (6.7) is
J MSE PCA=
n
X
i=m+1
the sum of the eigenvalues corresponding to the discarded eigenvectorsem+1:::en
If the orthonormality constraint is simply changed to
where all the numbers! k are positive and different, then the mean-square error problem will have a unique solution given by scaled eigenvectors [333]
6.1.3 Choosing the number of principal components
From the result that the principal component basis vectorswiare eigenvectorseiof
C
x, it follows that
Efy2
mg =EfeTmxxTemg = eTmC
x
em=d m (6.12) The variances of the principal components are thus directly given by the eigenvalues
ofC
x Note that, because the principal components have zero means, a small eigen-value (a small variance)d mindicates that the value of the corresponding principal componenty mis mostly close to zero
An important application of PCA is data compression The vectorsxin the original data set (that have first been centered by subtracting the mean) are approximated by the truncated PCA expansion
^
x =
m
X
i=1
Then we know from (6.10) that the mean-square error Efkx ^ xk
2
gis equal to
Pn i
=m+1d i As the eigenvalues are all positive, the error decreases when more and more terms are included in (6.13), until the error becomes zero whenm=nor all the principal components are included A very important practical problem is how to
Trang 6choosemin (6.13); this is a trade-off between error and the amount of data needed for the expansion Sometimes a rather small number of principal components are sufficient
Fig 6.1 Leftmost column: some digital images in a 32 32 grid Second column: means
of the samples Remaining columns: reconstructions by PCA when 1, 2, 5, 16, 32, and 64 principal components were used in the expansion.
Example 6.1 In digital image processing, the amount of data is typically very large,
and data compression is necessary for storage, transmission, and feature extraction PCA is a simple and efficient method Fig 6.1 shows 10 handwritten characters that were represented as binary32 32matrices (left column) [183] Such images, when scanned row by row, can be represented as 1024-dimensional vectors For each of the
10 character classes, about 1700 handwritten samples were collected, and the sample means and covariance matrices were computed by standard estimation methods The covariance matrices were1024 1024matrices For each class, the first 64 principal component vectors or eigenvectors of the covariance matrix were computed The second column in Fig 6.1 shows the sample means, and the other columns show the reconstructions (6.13) for various values ofm In the reconstructions, the sample means have been added again to scale the images for visual display Note how a relatively small percentage of the 1024 principal components produces reasonable reconstructions
Trang 7PRINCIPAL COMPONENTS 131
The condition (6.12) can often be used in advance to determine the number of principal componentsm, if the eigenvalues are known The eigenvalue sequence
d1d2:::d n of a covariance matrix for real-world measurement data is usually sharply decreasing, and it is possible to set a limit below which the eigenvalues, hence principal components, are insignificantly small This limit determines how many principal components are used
Sometimes the threshold can be determined from some prior information on the vectorsx For instance, assume thatxobeys a signal-noise model
x =
m
X
i=1
wherem < n Thereaiare some fixed vectors and the coefficientss i are random numbers that are zero mean and uncorrelated We can assume that their variances have been absorbed in vectorsai so that they have unit variances The termnis white noise, for which EfnnTg =2
Then the vectorsaispan a subspace, called
the signal subspace, that has lower dimensionality than the whole space of vectors
x The subspace orthogonal to the signal subspace is spanned by pure noise and it is called the noise subspace
It is easy to show (see exercises) that in this case the covariance matrix ofxhas a special form:
C x
=
m
X
i=1
aiaTi +2
(6.15) The eigenvalues are now the eigenvalues of
Pm i
=1
aiaTi, added by the constant2
But the matrix
Pm i
=1
aiaTi has at mostmnonzero eigenvalues, and these correspond
to eigenvectors that span the signal subspace When the eigenvalues of C
x are computed, the firstmform a decreasing sequence and the rest are small constants, equal to2
:
d1> d2> ::: > d m > d m+1
=d m+2
=:::=d n=2
It is usually possible to detect where the eigenvalues become constants, and putting
a threshold at this index,m, cuts off the eigenvalues and eigenvectors corresponding
to pure noise Then only the signal part remains
A more disciplined approach to this problem was given by [453]; see also [231] They give formulas for two well-known information theoretic modeling criteria, Akaike’s information criterion (AIC) and the minimum description length criterion (MDL), as functions of the signal subspace dimensionm The criteria depend on the lengthTof the samplex(1):::x(T)and on the eigenvaluesd1:::d nof the matrix
C
x Finding the minimum point gives a good value form
6.1.4 Closed-form computation of PCA
To use the closed-form solutionwi= eigiven earlier for the PCA basis vectors, the eigenvectors of the covariance matrix must be known In the conventional use of
Trang 8PCA, there is a sufficiently large sample of vectorsxavailable, from which the mean and the covariance matrixC
xcan be estimated by standard methods (see Chapter 4) Solving the eigenvector–eigenvalue problem forC
xgives the estimate fore
1 There are several efficient numerical methods available for solving the eigenvectors, e.g., the QR algorithm with its variants [112, 153, 320]
However, it is not always feasible to solve the eigenvectors by standard numerical methods In an on-line data compression application like image or speech coding, the data samplesx(t)arrive at high speed, and it may not be possible to estimate the covariance matrix and solve the eigenvector–eigenvalue problem once and for all One reason is computational: the eigenvector problem is numerically too demanding
if the dimensionalitynis large and the sampling rate is high Another reason is that the covariance matrixC
xmay not be stationary, due to fluctuating statistics in the sample sequencex(t), so the estimate would have to be incrementally updated Therefore, the PCA solution is often replaced by suboptimal nonadaptive transformations like the discrete cosine transform [154]
6.2 PCA BY ON-LINE LEARNING
Another alternative is to derive gradient ascent algorithms or other on-line methods for the preceding maximization problems The algorithms will then converge to the solutions of the problems, that is, to the eigenvectors The advantage of this approach
is that such algorithms work on-line, using each input vectorx(t)once as it becomes available and making an incremental change to the eigenvector estimates, without computing the covariance matrix at all This approach is the basis of the PCA neural network learning rules
Neural networks provide a novel way for parallel on-line computation of the PCA expansion The PCA network [326] is a layer of parallel linear artificial neurons shown in Fig 6.2 The output of theith unit (i = 1:::m) isyi
= w T i
x, withx
denoting then-dimensional input vector of the network andw
idenoting the weight vector of theith unit The number of units,m, will determine how many principal components the network will compute Sometimes this can be determined in advance for typical inputs, ormcan be equal tonif all principal components are required The PCA network learns the principal components by unsupervised learning rules,
by which the weight vectors are gradually updated until they become orthonormal and tend to the theoretically correct eigenvectors The network also has the ability to track slowly varying statistics in the input data, maintaining its optimality when the statistical properties of the inputs do not stay constant Due to their parallelism and adaptivity to input data, such learning algorithms and their implementations in neural networks are potentially useful in feature detection and data compression tasks
In ICA, where decorrelating the mixture variables is a useful preprocessing step, these learning rules can be used in connection to on-line ICA
Trang 9PCA BY ON-LINE LEARNING 133
wT
1
x wTmx
Input vectorx
w 1
wm1
wm
@
@
@
@
@
@
@
C C C C C C C
Q Q Q Q Q Q Q Q Q Q
#
#
#
#
#
#
#
#
S S S S S S S
A A A A A A A
C C C C C C C
A A A A A A A
S S S S S S S
@
@
@
@
@
@
@
c c c c c c c c
Fig 6.2 The basic linear PCA layer
6.2.1 The stochastic gradient ascent algorithm
In this learning rule, the gradient ofy
2
is taken with respect tow
1and the normalizing constraintkw
1
k = 1is taken into account The learning rule is
w
1 (t + 1) = w
1 (t) + (t)y
1 (t)x(t) y
2 (t)w 1 (t)]
withy
1
(t) = w
1
(t)Tx(t) This is iterated over the training setx(1) x(2) :::: The parameter(t)is the learning rate controlling the speed of convergence
In this chapter we will use the shorthand notation introduced in Chapter 3 and write the learning rule as
w 1
= (y 1
x y 2 1 w
The name stochastic gradient ascent (SGA) is due to the fact that the gradient is not with respect to the variance Efy
2
gbut with respect to the instantaneous random value
y
2
In this way, the gradient can be updated every time a new input vector becomes available, contrary to batch mode learning Mathematically, this is a stochastic approximation type of algorithm (for details, see Chapter 3) Convergence requires that the learning rate is decreased during learning at a suitable rate For tracking nonstationary statistics, the learning rate should remain at a small constant value For a derivation of this rule, as well as for the mathematical details of its convergence, see [323, 324, 330] The algorithm (6.16) is often called Oja’s rule in the literature Likewise, taking the gradient ofy
2
jwith respect to the weight vectorwjand using the normalization and orthogonality constraints, we end up with the learning rule
wj = yjx yjwj 2
X
i<j
yiwi] (6.17)
Trang 10On the right-hand side there is a term y jx, which is a so-called Hebbian term, product of the outputy j of thejth neuron and the inputxto it The other terms are implicit orthonormality constraints The casej = 1gives the one-unit learning rule (6.16) of the basic PCA neuron The convergence of the vectorsw
1:::wmto the eigenvectorse
1:::emwas established in [324, 330] A modification called the generalized Hebbian algorithm (GHA) was later presented by Sanger [391], who also applied it to image coding, texture segmentation, and the development of receptive fields
6.2.2 The subspace learning algorithm
The following algorithm [324, 458]
wj=y jx
m
X
i=1
is obtained as a constrained gradient ascent maximization of
Pm j
=1 (wTjx) 2
, the mean of which gives criterion (6.9) The regular structure allows this algorithm to be written in a simple matrix form: denoting byW = (w
1:::wm)T themnmatrix whose rows are the weight vectorswj, we have the update rule
W =WxxT (WxxTWT)W ]: (6.19) The network implementation of (6.18) is analogous to the SGA algorithm but still simpler because the normalizing feedback term, depending on the other weight vectors, is the same for all neuron units The convergence was studied by Williams [458], who showed that the weight vectorsw
1:::wmwill not tend to the eigenvec-torse
1:::embut only to some rotated basis in the subspace spanned by them, in analogy with the minimum mean-square criterion of Section 6.1.2 For this reason, this learning rule is called the subspace algorithm A global convergence analysis was given in [465, 75]
A variant of the subspace algorithm (6.18) is the weighted subspace algorithm
wj=y jx jXm
i=1
Algorithm (6.20) is similar to (6.18) except for the scalar parameters1::: m, which are inverses of the parameters!1:::! min criterion (6.11) If all of them are chosen different and positive, then it was shown by [333] that the vectorsw
1:::wmwill tend to the true PCA eigenvectorse
1:::emmultiplied by scalars The algorithm is appealing because it produces the true eigenvectors but can be computed in a fully parallel way in a homogeneous network It can be easily presented in a matrix form, analogous to (6.19)
Other related on-line algorithms have been introduced in [136, 388, 112, 450] Some of them, like the APEX algorithm by Diamantaras and Kung [112], are based