Statistics in geophysics principal component analysis

The aim of principal component analysis IPrincipal component analysis PCA provides a computationally efficient way of projectingthe p-dimensionaldata cloud orthogonally onto a k-dimensio

Trang 1

Statistics in Geophysics: Principal Component

Analysis

Steffen Unkel

Department of Statistics Ludwig-Maximilians-University Munich, Germany

Trang 2

Multivariate data

Let x = (x1, , xp)> be a p-dimensional random vector withpopulation mean µ and population covariance matrix Σ.Suppose that a sampleof n realizations of x is available

These np measurements xij (i = 1, , n; j = 1, , p) can becollected in a data matrix

X = (x(1), , x(n))>= (x1, , xp) ∈ Rn×pwith x>(i ) = (xi 1, , xip) being the i -th observation vector(i = 1, , n) and xj = (x1j, , xnj)> being the vector of the

n measurements on the j -th variable (j = 1, , p)

Trang 6

Thus, Z>Z/(n − 1) is the sample correlation matrix.

Trang 8

Eigendecomposition of the sample covariance matrix

Let SX bepositive semi-definite with rank(SX) = r (r ≤ p)

The eigenvalue decomposition(or spectral decomposition) of

on its main diagonal and E ∈ Rp×r is a column-wise

orthonormal matrix whose columns e1, , er are the

corresponding unit-normeigenvectorsof ω1, , ωr

Trang 9

The aim of principal component analysis I

Principal component analysis (PCA) provides a

computationally efficient way of projectingthe p-dimensionaldata cloud orthogonally onto a k-dimensional subspace

The aim of PCA is to derive k ( p) uncorrelated linearcombinations of the p-dimensional observation vectors

x(1), , x(n), called the sample principal components(PCs),which retain most of the total variationpresent in the data.This is achieved by taking those k components that

successively have maximum variance

Trang 10

The aim of principal component analysis II

PCA looks for r vectors ej ∈ Rp×1 (j = 1, , r ) whichmaximize e>j SXej

subject to e>j ej = 1 for j = 1, , r and

e>i ej = 0 for i = 1, , j − 1 (j ≥ 2)

It turns out that yj = Xej is the j -th sample PC with zeromean and variance ωj, where ej is an eigenvector of SX

corresponding to its j -th largest eigenvalue ωj (j = 1, , r )

The total variance of the r PCs will equal the total variance ofthe original variables so thatPr

j =1ωj = trace(SX)

Trang 11

Singular value decomposition of the data matrix I

The sample PCs can also be found using thesingular valuedecomposition(SVD) of X

Expressing X with rank r with r ≤ min{n, p} by its SVD gives

D ∈ Rr ×r is a diagonal matrix with the singular values of Xsorted in decreasing order, σ1≥ σ2≥ ≥ σr > 0, on itsmain diagonal

Trang 12

Singular value decomposition of the data matrix II

The matrix E is composed of coefficients or loadingsand thematrix of component scores Y ∈ Rn×r is given by Y = VD

Since it holds that E>E = Ir and Y>Y/(n − 1) = D2/(n − 1),theloadings are orthogonaland the sample PCsare

uncorrelated

The variance of the j -th sample PC is σj2/(n − 1) which isequal to the j -th largest eigenvalue, ωj, of SX (j = 1, , r )

Trang 13

Singular value decomposition of the data matrix III

In practice, the leading k componentswith k r usuallyaccount for a substantial proportion

ω1+ · · · + ωktrace(SX)

of the total variance in the data and the sum in the SVD of X

is thereforetruncated after the first k terms

If so, PCA comes down to finding a matrix

Y = (y1, , yk) ∈ Rn×k of component scores of the nsamples on the k components and a matrix

E = (e1, , ek) ∈ Rp×k of coefficients whose k-th column isthe vector of loadings for the k-th component

Trang 14

Least squares property of the SVD

PCA can be defined as the minimization of

||X − YE>||2F ,where ||B||F =

qtrace(B>B) denotes the Frobenius norm ofB

When variables are measured on different scales or on acommon scale with widely differing ranges, the data are oftenstandardized prior to PCA

The sample PCs are then obtained from an eigenvalue

decomposition of the sample correlation matrix Thesecomponents arenot equal to those derived from SX

Trang 15

Choosing the number of components I

(i) Retain the first k components which explain alarge

proportion of the total variation, say 70-80%

(ii) If the correlation matrix is analyzed, retain only those

components witheigenvalues greater than 1 (or 0.7)

(iii) Examine a scree plot This is a plot of the eigenvalues versusthe component number The idea is to look for an “elbow”which corresponds to the point after which the eigenvaluesdecrease more slowly

(iv) Consider whether the component has a sensibleand usefulinterpretation

Trang 16

Choosing the number of components II

Trang 17

Interpretation I

Correlations and covariances of variables and components

The covariance of variable i with component j is given by

Cov(xi, yj) = ωjeji The correlation of variable i with component j is therefore

rxi,yj =

√

ωjeji

si ,where si is the standard deviation of variable i

If the components are extracted from the correlation matrix,then

rx i ,y j =√ωjeji

Trang 18

Interpretation II

Rescaling principal components

The coefficients ej an be rescaled so that coefficients for themost important components are larger than those for lessimportant components

These rescaled coefficientsare calculated as

e∗j =√ωjej ,for which e∗j>e∗j = ωj, rather than unity

When the correlation matrix is analyzed, this rescaling leads

to coefficients that are the correlationsbetween the

components and the original variables

Trang 19

Rotation can be performed either in an orthogonal or anoblique(non-orthogonal) fashion.

Several analytic orthogonal and oblique rotation criteria exist

in the literature

Trang 20

Rotation II

To aid interpretation, all rotation criteria are designed to makethe coefficients as simple as possiblein some sense, with mostloadings made to have values either ‘close to zero’ or ‘far fromzero’, and with as few as possible of the coefficients takingintermediate values

After rotation, either one or both of the properties possessed

by PCA, that is, orthogonality of the loadings and

uncorrelatedness of the component scores, is lost

Trang 21

PCA in the open-source software R

Function princomp() in the stats package:

Eigendecomposition of the covariance or correlation matrix.Alternative: use directly the function eigen()

Function prcomp() in the stats package: SVD of the(centered and possibly scaled) data matrix Alternative: usedirectly the function svd()

Trang 22

Description of the data

For 41 cities in the United States the following seven variableswere recorded:

meter

more workers

We shall examine how PCA can be used to explore variousaspects of the data

Files: chap3usair.dat and pcausair.R

Trang 23

Description of the data

Source: National Center for Environmental

Prediction/National Center for Atmospheric Research

Winter monthly sea level pressures over the Northern

Trang 24

3 4

1

1 1

2 2

2

334

Figure: Spatial map representations of the two leading PCs for winter sea level pressure data (left: North Atlantic Oscillation, right: North Pacific Oscillation) The loadings have been multiplied by 100.

Định dạng
Số trang	24
Dung lượng	275,08 KB