Regression to the meanFigure: Scatterplot of mid-parental height against child’s height, and regression line dark red line... Other names frequently seen are: Predictor variable: input v
Trang 1Statistics in Geophysics: Linear Regression
Steffen Unkel
Department of Statistics Ludwig-Maximilians-University Munich, Germany
Trang 2Historical remarks
Sir Francis Galton (1822-1911) was responsible for the
introduction of the word “regression”
Galton, F (1886): Regression towards mediocrity in hereditarystature, The Journal of the Anthropological Institute of GreatBritain and Ireland, Vol 15, pp 246-263
Regression equation:
ˆ
y = ¯y +2
3(x − ¯x ) ,where y denotes the height of the child and x is a weightedaverage of the mother’s and father’s heights
Trang 3Regression to the mean
Figure: Scatterplot of mid-parental height against child’s height, and regression line (dark red line).
Trang 4Relationship between two variables
We can distinguishpredictor variables andresponse variables
Other names frequently seen are:
Predictor variable: input variable, X -variable, regressor, covariate, independent variable.
Response variable: output variable, predictand, Y -variable, dependent variable.
We shall be interested in finding out how changes in thepredictor variables affect the values of a response variable
Trang 5Relationship between two variables: Example
Trang 6In simple (multiple) linear regressionone (two or more)predictor variable(s) is (are) assumed to affect the values of aresponse variable in a linear fashion
For the model ofsimple linear regression, we assume
= β0+ β1x + ,
and is the random error term
Inserting the data yields the n equations
yi = β0+ β1xi + i , i = 1, , nwith unknown regression coefficients β0 and β1
Trang 7covariates, that is, f is linear in the parameters
2 Additivity of errors
3 The error terms i (i = 1 , n) are random variables withE(i) = 0 and constant variance σ2 (unknown), that is,
homoscedastic errors with Var(i) = σ2
Cov(i, j) = 0 for i 6= j
i ∼ N (0, σ2)
Trang 8Least squares (LS) fitting
The estimated values ˆβ0 and ˆβ1 are determined as minimizers
of the sum of squares deviations
β0 = ¯y − ˆβ1¯x
Trang 9Least squares (LS) fitting II
An estimate for the error variance σ2, called theresidualvariance, is
Trang 10Figure: Minimum temperature (◦F) observations at Ithaca and
Canandaigua, New York, for January 1987, with fitted least squares line
Trang 11the regression line?
Consider the identity
Trang 12Coefficient of determination
Some of the variation in the data (SST) can be ascribed tothe regression line (SSR) and some to the fact that the actualobservations do not all lie on the regression line (SSE)
A useful statistic to check is the R2 value (coefficient ofdetermination):
i =1(yi − ¯y )2 = 1−SSE
for which it holds that 0 ≤ R2 ≤ 1 and which is often
expressed as a percentage by multiplying by 100
The square root of R2 is (the absolute value) of the Pearsoncorrelation between x and y
Trang 13ANOVA table for simple linear regression
Source Degrees of freedom Sum of squares Mean square F -value
Trang 14F-test for significance of regression
Suppose that the errors i are independent N (0, σ2) variables.Then it can be shown that if β1= 0, the ratio
Statistical test: H0: β1 = 0 versus H1: β1 6= 0
We compare the F -value with the 100(1 − α)% point of thetabulated F (1, n − 2)-distribution in order to determine
data we have seen
Trang 15Confidence intervals
(1 − α) × 100% confidence intervals for β0 and β1:
[ ˆβj ± ˆσβˆ
j× t1−α/2(n − 2)] , j = 0, 1 ,where
For sufficiently large n: Replace quantiles of the
t(n − 2)-distribution by quantiles of the N (0, 1)-distribution
Trang 16Hypothesis tests
Example: Two-sided test for β1:
H0 : β1= 0 H1 : β1 6= 0 Observed test statistic:
t = βˆ1− 0ˆ
σβˆ
1
= βˆ1ˆ
Trang 17Prediction intervals
A prediction intervalfor a future observation y0 at a location
x0 with level (1 − α) is given by
(x − ¯x )2
i =1(xi − ¯x )2
Trang 18Prediction intervals: Example
+ + + + +
+
+ +
+
+
+
+ +
+
+ +
+ +
+ + +
+ + + + +
+
+ +
+ +
Trang 19Residuals versus fitted values
Trang 20If successive residuals are positively (negatively) seriallycorrelated, d will be near 0 (near 4).
The distribution of d is symmetric around 2
The critical values for Durbin-Watson tests vary depending onthe sample size and the number of predictor variables
Trang 21Durbin-Watson test II
Compare d (or 4 − d , whichever is closer to zero) with thetabulated critical values dL and dU
If d < dL, conclude that positive serial correlation is a
possibility; if d > dU, conclude that no serial correlation isindicated
If 4 − d < dL, conclude that negative serial correlation is apossibility; if 4 − d > dU, conclude that no serial correlation isindicated
If the d (or 4 − d ) value lies between dL and dU, the test isinconclusive
Trang 22Durbin-Watson test: Example
Trang 23If all the points lie on such a line, more or less, one wouldconclude that the residuals do not deny the assumption ofnormality of errors.
Trang 24Quantile-quantile plot: Example
Figure: Gaussian Q-Q plot of the residuals obtained from the regression
of the January 1987 temperature data.