Statistics, Data Mining, and Machine Learning in Astronomy 8 4 Principal Component Regression • 337 8 4 Principal Component Regression For the case of high dimensional data or data sets where the vari[.]
Trang 18.4 Principal Component Regression • 337
8.4 Principal Component Regression
For the case of high-dimensional data or data sets where the variables are collinear, the relation between ridge regression and principal component analysis can be exploited to define a regression based on the principal components For centered data (i.e., zero mean) we recall from § 7.3 that we can define the principal components
of a system from the data covariance matrix, X T X, by applying an eigenvalue
decomposition (EVD) or singular value decomposition (SVD),
with V Tthe eigenvectors and the eigenvalues.
Projecting the data matrix onto these eigenvectors we define a set of projected data points,
and truncate this expansion to exclude those components with small eigenvalues A standard linear regression can now be applied to the data transposed to this principal component space with
with M z the design matrix for the projected components z i The PCA analysis (including truncation) and the regression are undertaken as separate steps in this procedure The distinction between principal component regression (PCR) and ridge regression is that the number of model components in PCR is ordered by their eigenvalues and is absolute (i.e., we weight the regression coefficients by 1 or 0) For ridge regression the weighting of the regression coefficients is continuous
The advantages of PCR over ridge regression arise primarily for data containing independent variables that are collinear (i.e., where the correlation between the vari-ables is almost unity) For these cases, the regression coefficients have large variance and their solutions can become unstable Excluding those principal components with small eigenvalues alleviates this issue At what level to truncate the set of eigenvectors
is, however, an open question (see § 7.3.2) A simple approach to take is to truncate based on the eigenvalue with common cutoffs ranging between 1% and 10% of the average eigenvalue (see § 8.2 of [10] for a detailed discussion of truncation levels for PCR) The disadvantage of such an approach is that an eigenvalue does not always correlate with the ability of a given principal component to predict the dependent variable Other techniques, including cross-validation [17], have been proposed yet there is no well-adopted solution to this problem
Finally, we note that in the case of ill-conditioned regression problems (e.g., those with collinear data), most implementations of linear regression will implicitly perform a form of principal component regression when inverting the singular
matrix M T M This comes through the use of the robust pseudoinverse, which
truncates small singular values to prevent numerical overflow in ill-conditioned problems